[PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
       [not found] <7caf19ae394accab538d2f94953bb62b55a2c79f.1206486012.git.peff@peff.net>
@ 2008-03-25 23:03 ` Jeff King
  2008-03-26  5:59   ` Teemu Likonen
  0 siblings, 1 reply; 37+ messages in thread
From: Jeff King @ 2008-03-25 23:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Teemu Likonen

We always use 'utf-8' as the encoding, since we currently
have no way of getting the information from the user.

This also refactors the quoting of recipient names, since
both processes can share the rfc2047 quoting code.

Signed-off-by: Jeff King <peff@peff.net>
---
 git-send-email.perl   |   18 +++++++++++++++---
 t/t9001-send-email.sh |   15 +++++++++++++++
 2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/git-send-email.perl b/git-send-email.perl
index 7c4f06c..075cd0b 100755
--- a/git-send-email.perl
+++ b/git-send-email.perl
@@ -501,7 +501,12 @@ if ($compose) {
 	open(C,">",$compose_filename)
 		or die "Failed to open for writing $compose_filename: $!";
 	print C "From $sender # This line is ignored.\n";
-	printf C "Subject: %s\n\n", $initial_subject;
+	print C "Subject: ",
+		($initial_subject =~ /[^[:ascii:]]/ ?
+		quote_rfc2047($initial_subject) :
+		$initial_subject),
+		"\n";
+	print C "\n";
 	printf C <<EOT;
 GIT: Please enter your email below.
 GIT: Lines beginning in "GIT: " will be removed.
@@ -626,6 +631,14 @@ sub unquote_rfc2047 {
 	return wantarray ? ($_, $encoding) : $_;
 }
 
+sub quote_rfc2047 {
+	local $_ = shift;
+	my $encoding = shift || 'utf-8';
+	s/([^-a-zA-Z0-9!*+\/])/sprintf("=%02X", ord($1))/eg;
+	s/(.*)/=\?$encoding\?q\?$1\?=/;
+	return $_;
+}
+
 # use the simplest quoting being able to handle the recipient
 sub sanitize_address
 {
@@ -643,8 +656,7 @@ sub sanitize_address
 
 	# rfc2047 is needed if a non-ascii char is included
 	if ($recipient_name =~ /[^[:ascii:]]/) {
-		$recipient_name =~ s/([^-a-zA-Z0-9!*+\/])/sprintf("=%02X", ord($1))/eg;
-		$recipient_name =~ s/(.*)/=\?utf-8\?q\?$1\?=/;
+		$recipient_name = quote_rfc2047($recipient_name);
 	}
 
 	# double quotes are needed if specials or CTLs are included
diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh
index e222c49..a4bcd28 100755
--- a/t/t9001-send-email.sh
+++ b/t/t9001-send-email.sh
@@ -210,4 +210,19 @@ test_expect_success '--compose respects user mime type' '
 	! grep "^Content-Type: text/plain; charset=utf-8" msgtxt1
 '
 
+test_expect_success '--compose adds MIME for utf8 subject' '
+	clean_fake_sendmail &&
+	echo y | \
+	  GIT_EDITOR=$(pwd)/fake-editor \
+	  GIT_SEND_EMAIL_NOTTY=1 \
+	  git send-email \
+	  --compose --subject utf8-sübjëct \
+	  --from="Example <nobody@example.com>" \
+	  --to=nobody@example.com \
+	  --smtp-server="$(pwd)/fake.sendmail" \
+	  $patches &&
+	grep "^fake edit" msgtxt1 &&
+	grep "^Subject: =?utf-8?q?utf8-s=C3=BCbj=C3=ABct?=" msgtxt1
+'
+
 test_done
-- 
1.5.5.rc1.123.ge5f4e6

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-25 23:03 ` [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters Jeff King
@ 2008-03-26  5:59   ` Teemu Likonen
  2008-03-26  6:20     ` Jeff King
  0 siblings, 1 reply; 37+ messages in thread
From: Teemu Likonen @ 2008-03-26  5:59 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

Jeff King kirjoitti:

> We always use 'utf-8' as the encoding, since we currently
> have no way of getting the information from the user.
>
> This also refactors the quoting of recipient names, since
> both processes can share the rfc2047 quoting code.

These patches seem to work except that the quoting of Subject field 
works only if user types a non-Ascii text to the "What subject should 
the initial email start with?" prompt. If she changes the subject in 
editor it won't be rfc2047-quoted.

Thank you anyway, I think we're going to right direction. I think 'git 
send-mail --compose' is nice way to produce introductory message to 
patch series. If --compose doesn't support MIME encoding reasonable 
way, user may have to write and send intro message with real MUA and 
find out the Message-Id for correct In-Reply-To field for the actual 
patch series.

E-mail agents KMail and Mutt have setting for preferred encodings for 
outgoing mail. It's a list of encodings, 
like "us-ascii,iso-8859-1,utf-8". The first one that fits (including 
From, To, Cc, Subject, the body, ...?) is used, so there is some kind 
of detection of content after the message has been composed.

If portable content encoding detection is difficult or considered 
unnecessary, then I think a documented configurable option is fine 
(UTF-8 by default).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-26  5:59   ` Teemu Likonen
@ 2008-03-26  6:20     ` Jeff King
  2008-03-26  8:30       ` Teemu Likonen
  0 siblings, 1 reply; 37+ messages in thread
From: Jeff King @ 2008-03-26  6:20 UTC (permalink / raw)
  To: Teemu Likonen; +Cc: Junio C Hamano, git

On Wed, Mar 26, 2008 at 07:59:48AM +0200, Teemu Likonen wrote:

> These patches seem to work except that the quoting of Subject field 
> works only if user types a non-Ascii text to the "What subject should 
> the initial email start with?" prompt. If she changes the subject in 
> editor it won't be rfc2047-quoted.

Ah, yes, I hadn't considered that. We should definitely do the quoting
after all of the user's input. Replace 2/2 from my series with the patch
below, which handles this case correctly (and as a bonus, the user sees
the unencoded subject in the editor, which is much more readable).

> Thank you anyway, I think we're going to right direction. I think 'git 
> send-mail --compose' is nice way to produce introductory message to 
> patch series. If --compose doesn't support MIME encoding reasonable 
> way, user may have to write and send intro message with real MUA and 
> find out the Message-Id for correct In-Reply-To field for the actual 
> patch series.

git-format-patch recently got a --cover-letter option which does the
same thing. I actually use a real MUA (mutt) instead of send-email, and
this way you can avoid the message-id cutting and pasting that is
required. It automatically does the right thing with encodings because I
end up sending the message using my MUA.

> E-mail agents KMail and Mutt have setting for preferred encodings for 
> outgoing mail. It's a list of encodings, 
> like "us-ascii,iso-8859-1,utf-8". The first one that fits (including 
> From, To, Cc, Subject, the body, ...?) is used, so there is some kind 
> of detection of content after the message has been composed.

Yes, the git-send-email code is a real mess for this sort of thing. I
think it started very small and specific, and has gotten hack upon hack
piled on it. It would be much nicer rewritten from scratch around one of
the many abstracted perl mail objects (though that does introduce a new
dependency).

> If portable content encoding detection is difficult or considered 
> unnecessary, then I think a documented configurable option is fine 
> (UTF-8 by default).

I think that is sensible. Want to try adding it on top of my patches?

Below is the revised subject-munging patch.

-- >8 --
send-email: rfc2047-quote subject lines with non-ascii characters

We always use 'utf-8' as the encoding, since we currently
have no way of getting the information from the user.

This also refactors the quoting of recipient names, since
both processes can share the rfc2047 quoting code.

Signed-off-by: Jeff King <peff@peff.net>
---
 git-send-email.perl   |   20 ++++++++++++++++++--
 t/t9001-send-email.sh |   15 +++++++++++++++
 2 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/git-send-email.perl b/git-send-email.perl
index 7c4f06c..3694f81 100755
--- a/git-send-email.perl
+++ b/git-send-email.perl
@@ -536,6 +536,15 @@ EOT
 		if (!$in_body && /^MIME-Version:/i) {
 			$need_8bit_cte = 0;
 		}
+		if (!$in_body && /^Subject: ?(.*)/i) {
+			my $subject = $1;
+			$_ = "Subject: " .
+				($subject =~ /[^[:ascii:]]/ ?
+				 quote_rfc2047($subject) :
+				 $subject) .
+				"\n";
+			}
+		}
 		print C2 $_;
 	}
 	close(C);
@@ -626,6 +635,14 @@ sub unquote_rfc2047 {
 	return wantarray ? ($_, $encoding) : $_;
 }
 
+sub quote_rfc2047 {
+	local $_ = shift;
+	my $encoding = shift || 'utf-8';
+	s/([^-a-zA-Z0-9!*+\/])/sprintf("=%02X", ord($1))/eg;
+	s/(.*)/=\?$encoding\?q\?$1\?=/;
+	return $_;
+}
+
 # use the simplest quoting being able to handle the recipient
 sub sanitize_address
 {
@@ -643,8 +660,7 @@ sub sanitize_address
 
 	# rfc2047 is needed if a non-ascii char is included
 	if ($recipient_name =~ /[^[:ascii:]]/) {
-		$recipient_name =~ s/([^-a-zA-Z0-9!*+\/])/sprintf("=%02X", ord($1))/eg;
-		$recipient_name =~ s/(.*)/=\?utf-8\?q\?$1\?=/;
+		$recipient_name = quote_rfc2047($recipient_name);
 	}
 
 	# double quotes are needed if specials or CTLs are included
diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh
index e222c49..a4bcd28 100755
--- a/t/t9001-send-email.sh
+++ b/t/t9001-send-email.sh
@@ -210,4 +210,19 @@ test_expect_success '--compose respects user mime type' '
 	! grep "^Content-Type: text/plain; charset=utf-8" msgtxt1
 '
 
+test_expect_success '--compose adds MIME for utf8 subject' '
+	clean_fake_sendmail &&
+	echo y | \
+	  GIT_EDITOR=$(pwd)/fake-editor \
+	  GIT_SEND_EMAIL_NOTTY=1 \
+	  git send-email \
+	  --compose --subject utf8-sübjëct \
+	  --from="Example <nobody@example.com>" \
+	  --to=nobody@example.com \
+	  --smtp-server="$(pwd)/fake.sendmail" \
+	  $patches &&
+	grep "^fake edit" msgtxt1 &&
+	grep "^Subject: =?utf-8?q?utf8-s=C3=BCbj=C3=ABct?=" msgtxt1
+'
+
 test_done
-- 
1.5.5.rc1.123.ge5f4e6

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-26  6:20     ` Jeff King
@ 2008-03-26  8:30       ` Teemu Likonen
  2008-03-26  8:39         ` Jeff King
  0 siblings, 1 reply; 37+ messages in thread
From: Teemu Likonen @ 2008-03-26  8:30 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

Jeff King wrote:

> On Wed, Mar 26, 2008 at 07:59:48AM +0200, Teemu Likonen wrote:
> > These patches seem to work except that the quoting of Subject field
> > works only if user types a non-Ascii text to the "What subject
> > should the initial email start with?" prompt. If she changes the
> > subject in editor it won't be rfc2047-quoted.
>
> Ah, yes, I hadn't considered that. We should definitely do the quoting
> after all of the user's input. Replace 2/2 from my series with the
> patch below, which handles this case correctly (and as a bonus, the
> user sees the unencoded subject in the editor, which is much more
> readable).

It seems to work nicely after I fixed one unmatching bracket. See below.

> git-format-patch recently got a --cover-letter option which does the
> same thing. I actually use a real MUA (mutt) instead of send-email,
> and this way you can avoid the message-id cutting and pasting that is
> required. It automatically does the right thing with encodings because
> I end up sending the message using my MUA.

I had missed the --cover-letter option completely. It may be useful too.
I'm still trying to find the best way to send pathces. If I send intro
message with real MUA I either need to wait for the message to show up
on a mailing list or check my sent-mail folder to find the Message-Id.
Once I know the Message-Id I can send the actual patch series with 'git
send-email' as replies to the intro message. Well, this is OK.

> > If portable content encoding detection is difficult or considered
> > unnecessary, then I think a documented configurable option is fine
> > (UTF-8 by default).
>
> I think that is sensible. Want to try adding it on top of my patches?

I'd like to, but I can only do sh/bash stuff and possibly some
copy-and-paste programming with other scripting languages. You'd end up
fixing my code anyway, sorry.

As you noticed, I accidentally sent you a couple of test emails because
send-email CCed mails to patches' author (I think). Now I have set
"suppresscc = all" and "suppressfrom = true" which should prevent such
accidents. Shouldn't these be defaults? In my opinion it's generally the
best practice to always explicitly define what parties emails are sent
to.

There is unmatching bracket in your patch:

> diff --git a/git-send-email.perl b/git-send-email.perl
> index 7c4f06c..3694f81 100755
> --- a/git-send-email.perl
> +++ b/git-send-email.perl
> @@ -536,6 +536,15 @@ EOT
>  		if (!$in_body && /^MIME-Version:/i) {
>  			$need_8bit_cte = 0;
>  		}
> +		if (!$in_body && /^Subject: ?(.*)/i) {
> +			my $subject = $1;
> +			$_ = "Subject: " .
> +				($subject =~ /[^[:ascii:]]/ ?
> +				 quote_rfc2047($subject) :
> +				 $subject) .
> +				"\n";
> +			}
                        ^-- Shouldn't we remove this one?
> +		}

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-26  8:30       ` Teemu Likonen
@ 2008-03-26  8:39         ` Jeff King
  2008-03-26  9:23           ` Teemu Likonen
  0 siblings, 1 reply; 37+ messages in thread
From: Jeff King @ 2008-03-26  8:39 UTC (permalink / raw)
  To: Teemu Likonen; +Cc: Junio C Hamano, git

On Wed, Mar 26, 2008 at 10:30:33AM +0200, Teemu Likonen wrote:

> I had missed the --cover-letter option completely. It may be useful too.
> I'm still trying to find the best way to send pathces. If I send intro
> message with real MUA I either need to wait for the message to show up
> on a mailing list or check my sent-mail folder to find the Message-Id.
> Once I know the Message-Id I can send the actual patch series with 'git
> send-email' as replies to the intro message. Well, this is OK.

That is how I used to do it; now I use --cover-letter (which you
probably missed because it is brand new in the upcoming 1.5.5).

> > I think that is sensible. Want to try adding it on top of my patches?
> I'd like to, but I can only do sh/bash stuff and possibly some
> copy-and-paste programming with other scripting languages. You'd end up
> fixing my code anyway, sorry.

OK, I will add it to the end of my long todo. Out of curiosity, do you
actually want something besides utf-8, or is this just to make us feel
feature complete?

> As you noticed, I accidentally sent you a couple of test emails because
> send-email CCed mails to patches' author (I think). Now I have set
> "suppresscc = all" and "suppressfrom = true" which should prevent such
> accidents. Shouldn't these be defaults? In my opinion it's generally the
> best practice to always explicitly define what parties emails are sent
> to.

I think this is probably a good change. But it is a behavior change,
which means it is definitely out during the -rc freeze. And it may or
may not need a warning period for users.

> There is unmatching bracket in your patch:

Argh, yes. I _thought_ I ran it successfully through the test script,
but obviously I failed to 'make' and just tested the previous version.
It works fine with the bracket removed.

For reference, the fixed-up patch is below.

-- >8 --
send-email: rfc2047-quote subject lines with non-ascii characters

We always use 'utf-8' as the encoding, since we currently
have no way of getting the information from the user.

This also refactors the quoting of recipient names, since
both processes can share the rfc2047 quoting code.

Signed-off-by: Jeff King <peff@peff.net>
---
 git-send-email.perl   |   19 +++++++++++++++++--
 t/t9001-send-email.sh |   15 +++++++++++++++
 2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/git-send-email.perl b/git-send-email.perl
index 7c4f06c..d0f9d4a 100755
--- a/git-send-email.perl
+++ b/git-send-email.perl
@@ -536,6 +536,14 @@ EOT
 		if (!$in_body && /^MIME-Version:/i) {
 			$need_8bit_cte = 0;
 		}
+		if (!$in_body && /^Subject: ?(.*)/i) {
+			my $subject = $1;
+			$_ = "Subject: " .
+				($subject =~ /[^[:ascii:]]/ ?
+				 quote_rfc2047($subject) :
+				 $subject) .
+				"\n";
+		}
 		print C2 $_;
 	}
 	close(C);
@@ -626,6 +634,14 @@ sub unquote_rfc2047 {
 	return wantarray ? ($_, $encoding) : $_;
 }
 
+sub quote_rfc2047 {
+	local $_ = shift;
+	my $encoding = shift || 'utf-8';
+	s/([^-a-zA-Z0-9!*+\/])/sprintf("=%02X", ord($1))/eg;
+	s/(.*)/=\?$encoding\?q\?$1\?=/;
+	return $_;
+}
+
 # use the simplest quoting being able to handle the recipient
 sub sanitize_address
 {
@@ -643,8 +659,7 @@ sub sanitize_address
 
 	# rfc2047 is needed if a non-ascii char is included
 	if ($recipient_name =~ /[^[:ascii:]]/) {
-		$recipient_name =~ s/([^-a-zA-Z0-9!*+\/])/sprintf("=%02X", ord($1))/eg;
-		$recipient_name =~ s/(.*)/=\?utf-8\?q\?$1\?=/;
+		$recipient_name = quote_rfc2047($recipient_name);
 	}
 
 	# double quotes are needed if specials or CTLs are included
diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh
index e222c49..a4bcd28 100755
--- a/t/t9001-send-email.sh
+++ b/t/t9001-send-email.sh
@@ -210,4 +210,19 @@ test_expect_success '--compose respects user mime type' '
 	! grep "^Content-Type: text/plain; charset=utf-8" msgtxt1
 '
 
+test_expect_success '--compose adds MIME for utf8 subject' '
+	clean_fake_sendmail &&
+	echo y | \
+	  GIT_EDITOR=$(pwd)/fake-editor \
+	  GIT_SEND_EMAIL_NOTTY=1 \
+	  git send-email \
+	  --compose --subject utf8-sübjëct \
+	  --from="Example <nobody@example.com>" \
+	  --to=nobody@example.com \
+	  --smtp-server="$(pwd)/fake.sendmail" \
+	  $patches &&
+	grep "^fake edit" msgtxt1 &&
+	grep "^Subject: =?utf-8?q?utf8-s=C3=BCbj=C3=ABct?=" msgtxt1
+'
+
 test_done
-- 
1.5.5.rc1.133.g360d

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-26  8:39         ` Jeff King
@ 2008-03-26  9:23           ` Teemu Likonen
  2008-03-26  9:32             ` Teemu Likonen
  2008-03-26  9:33             ` Jeff King
  0 siblings, 2 replies; 37+ messages in thread
From: Teemu Likonen @ 2008-03-26  9:23 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

Jeff King kirjoitti (26.3.2008 klo 4.39):

> On Wed, Mar 26, 2008 at 10:30:33AM +0200, Teemu Likonen wrote:
> 
> > I had missed the --cover-letter option completely. It may be useful
> > too. I'm still trying to find the best way to send pathces. If
> > I send intro message with real MUA I either need to wait for the
> > message to show up on a mailing list or check my sent-mail folder to
> > find the Message-Id. Once I know the Message-Id I can send the
> > actual patch series with 'git send-email' as replies to the intro
> > message. Well, this is OK.
> 
> That is how I used to do it; now I use --cover-letter (which you
> probably missed because it is brand new in the upcoming 1.5.5).

I'm using the current 'master' branch so --cover-letter is there.
Managed to miss it anyway. :)

Hmm, do you send the 0000-cover-letter.patch with 'git send-email'? It
seems that this cover letter don't get MIME headers when sent that way.
Sending through 'mutt -H' it works fine but then the Message-Id needs to
be copy-pasted manually to send-mail for the rest of the series (to have
them appear as replies, that is). No problem with that.

> OK, I will add it to the end of my long todo. Out of curiosity, do you
> actually want something besides utf-8, or is this just to make us feel
> feature complete?

I mostly use (and promote) UTF-8 and now that I begin to understand how
send-email works I can live with the current behaviour just fine. Don't
take my feedback as complaining. :)

In general my interests are in human languages and I have done quite
a lot of work in different areas to make computers interact nicely with
human languages. This is my interest in general level and I tend to
report/fix problems when I notice them. From Git's point of view at the
present moment we can probably say just like you did: "make us feel
feature complete."

Thanks for your work on this. Really.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-26  9:23           ` Teemu Likonen
@ 2008-03-26  9:32             ` Teemu Likonen
  2008-03-26  9:35               ` Jeff King
  2008-03-26  9:33             ` Jeff King
  1 sibling, 1 reply; 37+ messages in thread
From: Teemu Likonen @ 2008-03-26  9:32 UTC (permalink / raw)
  To: Jeff King; +Cc: git

I mumbled:

> Thanks for your work on this. Really.

My English is somewhat broken. I meant to thank you for your work.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-26  9:23           ` Teemu Likonen
  2008-03-26  9:32             ` Teemu Likonen
@ 2008-03-26  9:33             ` Jeff King
  2008-03-27  7:38               ` Jeff King
  1 sibling, 1 reply; 37+ messages in thread
From: Jeff King @ 2008-03-26  9:33 UTC (permalink / raw)
  To: Teemu Likonen; +Cc: Junio C Hamano, git

On Wed, Mar 26, 2008 at 11:23:03AM +0200, Teemu Likonen wrote:

> Hmm, do you send the 0000-cover-letter.patch with 'git send-email'? It
> seems that this cover letter don't get MIME headers when sent that way.
> Sending through 'mutt -H' it works fine but then the Message-Id needs to
> be copy-pasted manually to send-mail for the rest of the series (to have
> them appear as replies, that is). No problem with that.

No, I have format-patch do the threading. So something like:

  git format-patch --cover-letter --thread --stdout upstream >mbox
  mutt -f mbox

and then in mutt I bind a key to <resend-message>. For each message, I
do the 'resend', set the recipient headers, look it over one last time,
and then send. The most annoying part is entering the recipients;
usually it isn't too bad because I have short aliases for Junio and the
list, but I had to, e.g., cut and paste your address twice for the other
series.

Probably munging the 'to:' and 'cc:' before running mutt would make the
most sense, but I haven't gotten around to it yet.

> I mostly use (and promote) UTF-8 and now that I begin to understand how
> send-email works I can live with the current behaviour just fine. Don't
> take my feedback as complaining. :)

OK, I am inclined to leave the patches as-is, then, and wait for
somebody to complain about their pet encoding. My reasoning is that:

  - in most cases throughout git, we assume things are happening in
    utf-8, so I don't think it will come as a great surprise
  - I think doing it right might be more complex than just send-mail; I
    am thinking there might need to be a "stuff the user inputs is in
    encoding X" config option. And I don't want to do the work. :)

> Thanks for your work on this. Really.

No problem at all. Thank you for helping make git better with bug
reports!

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-26  9:32             ` Teemu Likonen
@ 2008-03-26  9:35               ` Jeff King
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2008-03-26  9:35 UTC (permalink / raw)
  To: Teemu Likonen; +Cc: git

On Wed, Mar 26, 2008 at 11:32:31AM +0200, Teemu Likonen wrote:

> I mumbled:
> 
> > Thanks for your work on this. Really.
> 
> My English is somewhat broken. I meant to thank you for your work.

Maybe it is the late hour, but I am a native English speaker, and it
parsed just fine to me.

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-26  9:33             ` Jeff King
@ 2008-03-27  7:38               ` Jeff King
  2008-03-27 19:44                 ` Todd Zullinger
  0 siblings, 1 reply; 37+ messages in thread
From: Jeff King @ 2008-03-27  7:38 UTC (permalink / raw)
  To: Teemu Likonen; +Cc: git

On Wed, Mar 26, 2008 at 05:33:10AM -0400, Jeff King wrote:

> No, I have format-patch do the threading. So something like:
> 
>   git format-patch --cover-letter --thread --stdout upstream >mbox
>   mutt -f mbox
> 
> and then in mutt I bind a key to <resend-message>. For each message, I

Since it looks like you are using mutt also, I will warn you that there
is a problem with this workflow: when mutt does the resend, it generates
a new message-id. Thus the patches are all connected in a thread because
they all in-reply-to the cover letter, but the cover letter is not
connected, since it has a new message-id.

I'm not sure if there is a way to fix this short of patching mutt. :(

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-27  7:38               ` Jeff King
@ 2008-03-27 19:44                 ` Todd Zullinger
  0 siblings, 0 replies; 37+ messages in thread
From: Todd Zullinger @ 2008-03-27 19:44 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 956 bytes --]

Jeff King wrote:
> Since it looks like you are using mutt also, I will warn you that
> there is a problem with this workflow: when mutt does the resend, it
> generates a new message-id. Thus the patches are all connected in a
> thread because they all in-reply-to the cover letter, but the cover
> letter is not connected, since it has a new message-id.
> 
> I'm not sure if there is a way to fix this short of patching mutt.
> :(

I don't know if it would help, but perhaps you could try:

:set postponed=/path/to/your/format-patch-mbox

instead of opening the mbox using -f, and then recall the messages to
send.  That *might* prevent mutt from rewriting the message-id, but I
haven't tested it at all.

-- 
Todd        OpenPGP -> KeyID: 0xBEAF0CE3 | URL: www.pobox.com/~tmz/pgp
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Between two evils, I always pick the one I never tried before.
    -- Mae West


[-- Attachment #2: Type: application/pgp-signature, Size: 542 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-05-21 19:39   ` Junio C Hamano
@ 2008-05-21 19:47     ` Jeff King
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2008-05-21 19:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Wed, May 21, 2008 at 12:39:44PM -0700, Junio C Hamano wrote:

> Last night I was going through old mail-logs and found this and another
> one that this is a follow-up to, which I think are still needed.  Does
> anybody see anything wrong with them?
>
> Jeff King <peff@peff.net> writes:
> 
> > We always use 'utf-8' as the encoding, since we currently
> > have no way of getting the information from the user.

Ah, thanks for bringing this up. I noticed a few weeks ago that it
hadn't been applied and meant to bring it up, but somehow I failed to
do so.

Obviously I'm in support of this one, but I also think Horst's patch
looks correct.

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-28 21:29 ` [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters Jeff King
  2008-03-29  7:19   ` Robin Rosenberg
@ 2008-05-21 19:39   ` Junio C Hamano
  2008-05-21 19:47     ` Jeff King
  1 sibling, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2008-05-21 19:39 UTC (permalink / raw)
  To: git; +Cc: Jeff King

Last night I was going through old mail-logs and found this and another
one that this is a follow-up to, which I think are still needed.  Does
anybody see anything wrong with them?

Jeff King <peff@peff.net> writes:

> We always use 'utf-8' as the encoding, since we currently
> have no way of getting the information from the user.
>
> This also refactors the quoting of recipient names, since
> both processes can share the rfc2047 quoting code.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  git-send-email.perl   |   19 +++++++++++++++++--
>  t/t9001-send-email.sh |   15 +++++++++++++++
>  2 files changed, 32 insertions(+), 2 deletions(-)
>
> diff --git a/git-send-email.perl b/git-send-email.perl
> index 7c4f06c..d0f9d4a 100755
> --- a/git-send-email.perl
> +++ b/git-send-email.perl
> @@ -536,6 +536,14 @@ EOT
>  		if (!$in_body && /^MIME-Version:/i) {
>  			$need_8bit_cte = 0;
>  		}
> +		if (!$in_body && /^Subject: ?(.*)/i) {
> +			my $subject = $1;
> +			$_ = "Subject: " .
> +				($subject =~ /[^[:ascii:]]/ ?
> +				 quote_rfc2047($subject) :
> +				 $subject) .
> +				"\n";
> +		}
>  		print C2 $_;
>  	}
>  	close(C);
> @@ -626,6 +634,14 @@ sub unquote_rfc2047 {
>  	return wantarray ? ($_, $encoding) : $_;
>  }
>  
> +sub quote_rfc2047 {
> +	local $_ = shift;
> +	my $encoding = shift || 'utf-8';
> +	s/([^-a-zA-Z0-9!*+\/])/sprintf("=%02X", ord($1))/eg;
> +	s/(.*)/=\?$encoding\?q\?$1\?=/;
> +	return $_;
> +}
> +
>  # use the simplest quoting being able to handle the recipient
>  sub sanitize_address
>  {
> @@ -643,8 +659,7 @@ sub sanitize_address
>  
>  	# rfc2047 is needed if a non-ascii char is included
>  	if ($recipient_name =~ /[^[:ascii:]]/) {
> -		$recipient_name =~ s/([^-a-zA-Z0-9!*+\/])/sprintf("=%02X", ord($1))/eg;
> -		$recipient_name =~ s/(.*)/=\?utf-8\?q\?$1\?=/;
> +		$recipient_name = quote_rfc2047($recipient_name);
>  	}
>  
>  	# double quotes are needed if specials or CTLs are included
> diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh
> index e222c49..a4bcd28 100755
> --- a/t/t9001-send-email.sh
> +++ b/t/t9001-send-email.sh
> @@ -210,4 +210,19 @@ test_expect_success '--compose respects user mime type' '
>  	! grep "^Content-Type: text/plain; charset=utf-8" msgtxt1
>  '
>  
> +test_expect_success '--compose adds MIME for utf8 subject' '
> +	clean_fake_sendmail &&
> +	echo y | \
> +	  GIT_EDITOR=$(pwd)/fake-editor \
> +	  GIT_SEND_EMAIL_NOTTY=1 \
> +	  git send-email \
> +	  --compose --subject utf8-sübjëct \
> +	  --from="Example <nobody@example.com>" \
> +	  --to=nobody@example.com \
> +	  --smtp-server="$(pwd)/fake.sendmail" \
> +	  $patches &&
> +	grep "^fake edit" msgtxt1 &&
> +	grep "^Subject: =?utf-8?q?utf8-s=C3=BCbj=C3=ABct?=" msgtxt1
> +'
> +
>  test_done
> -- 
> 1.5.5.rc1.141.g50ecd.dirty

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-29  8:41       ` Robin Rosenberg
  2008-03-29  8:49         ` Jeff King
@ 2008-03-30 23:47         ` Junio C Hamano
  1 sibling, 0 replies; 37+ messages in thread
From: Junio C Hamano @ 2008-03-30 23:47 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Jeff King, git

Robin Rosenberg <robin.rosenberg.lists@dewire.com> writes:

> Den Saturday 29 March 2008 08.22.03 skrev Jeff King:
>> On Sat, Mar 29, 2008 at 08:19:07AM +0100, Robin Rosenberg wrote:
>> > Den Friday 28 March 2008 22.29.01 skrev Jeff King:
>> > > We always use 'utf-8' as the encoding, since we currently
>> > > have no way of getting the information from the user.
>> >
>> > Don't set encoding to UTF-8 unless it actually looks like UTF-8.
>>
>> OK. Do you have an example function that guesses with high probability
>> whether a string is utf-8? If there are non-ascii characters but we
>> _don't_ guess utf-8, what should we do?
>
> Any test for valid UTF-8 will do that with a very high probability. The
> perl UTF-8 "api" is a mess. I couldn't find such a routine!?. Calling 
> decode/encode and see if you get the original string works, but that is too
> clumsy, IMHO.

The sequence to decode followed by encode will test if you have a valid
one and if it is canonically encoded, which is testing too much.  You only
want to check if it is valid, and do not care about normalization.

I see this in perluniintro.pod:

    =item *

    How Do I Detect Data That's Not Valid In a Particular Encoding?

    Use the C<Encode> package to try converting it.
    For example,

        use Encode 'decode_utf8';
        if (decode_utf8($string_of_bytes_that_I_think_is_utf8)) {
            # valid
        } else {
            # invalid
        }

For commit log messages, we traditionally use similar idea to guess by
checking if it looks like an UTF-8 encoded string and otherwise assume
Latin-1 (and I think we still do if the user does not tell us).

If this issue is only about the --compose part of send-email, perhaps you
can interactively ask instead of "otherwise assume Latin-1"?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-30  3:40                       ` Sam Vilain
@ 2008-03-30  4:39                         ` Jeff King
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2008-03-30  4:39 UTC (permalink / raw)
  To: Sam Vilain; +Cc: Robin Rosenberg, Junio C Hamano, git

On Sun, Mar 30, 2008 at 04:40:53PM +1300, Sam Vilain wrote:

> > My point is that we don't _know_ what is happening in between the decode
> > and encode. Does that intermediate form have the information required to
> > convert back to the exact same bytes as the original form?
> No, it doesn't.  If you want that, save a copy of the string (it's a
> lazy copy anyway).

We do already save a copy. The question is that Robin is proposing
decode/encode to check for validity. It was not clear to me that such a
process would always return the exact same bytes even for valid utf-8.

But it seems like you are saying below that it is really just the
"decode" part of that which is interesting:

> utf8::decode works in-place; it is essentially checking that the string
> is valid, and if so, marking it as UTF8.
> 
>    my ($encoding);
>    if (utf8::decode($string)) {
>        if (utf8::is_utf($string)) {
>            $encoding = "UTF-8";
>        }
>        else {
>            $encoding = "US-ASCII";
>        }
>    }
>    else {
>        $encoding = "ISO8859-1"
>    }

OK, that was the magic invocation we were looking for. Thank you.

> For US-ASCII, you'll only have to encode if the string contains special
> characters (those below \037) or any "=" characters.

Ah, yeah. I think our tests are lacking in that they check for only
[^[:ascii:]].

> Anyway, I guess all this rubbish is why people use CPAN modules, so that
> they don't have to continually rediscover every single protocol quirk
> and reinvent the wheel.
> 
> ie, it would be much, much simpler to use MIME::Entity->build for all of
> this, and remove the duplication of code.

Yes, I actually made a similar comment recently. send-email could
probably be shorter, easier to read, and have fewer bugs if it used one
of the many mail-handling CPAN modules. I think it would pretty much
involve scrapping the current send-email and starting fresh, though.

Thanks for your input.

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-30  2:12               ` Sam Vilain
@ 2008-03-30  4:31                 ` Jeff King
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2008-03-30  4:31 UTC (permalink / raw)
  To: Sam Vilain; +Cc: Robin Rosenberg, Junio C Hamano, git

On Sun, Mar 30, 2008 at 03:12:46PM +1300, Sam Vilain wrote:

> > Any idea what version of perl started shipping I18N::Langinfo? I
> > couldn't see anything useful from grepping the Changes files.
> Module::CoreList knows.  See the man page for that.

Thanks, I didn't know about that (I foolishly assumed that such
information would be, well, along with the core of perl).

The answer is: I18N::Langinfo started shipping with 5.007003. I think we
have pretty much given up on perl < 5.6 (at least from my experience
with 5.005 on Solaris), so it is probably safe to use.

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-29 21:45                     ` Jeff King
@ 2008-03-30  3:40                       ` Sam Vilain
  2008-03-30  4:39                         ` Jeff King
  0 siblings, 1 reply; 37+ messages in thread
From: Sam Vilain @ 2008-03-30  3:40 UTC (permalink / raw)
  To: Jeff King; +Cc: Robin Rosenberg, Junio C Hamano, git

Jeff King wrote:
> My point is that we don't _know_ what is happening in between the decode
> and encode. Does that intermediate form have the information required to
> convert back to the exact same bytes as the original form?

No, it doesn't.  If you want that, save a copy of the string (it's a
lazy copy anyway).

The module that will let you see into the strings to see what it
happening is Devel::Peek.  Using that, you will see the state of the
UTF8 scalar flag.  For example;

 maia:~$ perl -Mutf8 -MDevel::Peek -le 'Dump "Güt"'
 SV = PV(0x605d08) at 0x62f230
   REFCNT = 1
   FLAGS = (PADBUSY,PADTMP,POK,READONLY,pPOK,UTF8)
   PV = 0x60cd20 "G\303\274t"\0 [UTF8 "G\x{fc}t"]
   CUR = 4
   LEN = 8

By default, all strings that are read from files will NOT have this flag
set, unless the filehandle that was read from was marked as being utf-8
(in order to preserve C semantics by default);

 maia:~$ echo "Güt" | perl -MDevel::Peek -nle 'Dump $_'
 SV = PV(0x6052d0) at 0x604220
   REFCNT = 1
   FLAGS = (POK,pPOK)
   PV = 0x62f0e0 "G\303\274t"\0
   CUR = 4
   LEN = 80
 maia:~$ echo "Güt" | perl -MDevel::Peek -nle 'BEGIN { binmode STDIN,
":utf8" } Dump $_'
 SV = PV(0x6052d0) at 0x604220
   REFCNT = 1
   FLAGS = (POK,pPOK,UTF8)
   PV = 0x62f100 "G\303\274t"\0 [UTF8 "G\x{fc}t"]
   CUR = 4
   LEN = 80

> But it still feels a little wrong to test by converting.

utf8::decode works in-place; it is essentially checking that the string
is valid, and if so, marking it as UTF8.

   my ($encoding);
   if (utf8::decode($string)) {
       if (utf8::is_utf($string)) {
           $encoding = "UTF-8";
       }
       else {
           $encoding = "US-ASCII";
       }
   }
   else {
       $encoding = "ISO8859-1"
   }

For US-ASCII, you'll only have to encode if the string contains special
characters (those below \037) or any "=" characters.

You could try using langinfo CODESET instead of hardcoding ISO8859-1
like that, but at least on my system can return bizarre values like
ANSI_X3.4-1968, which may be in some contexts a "correct" description of
the encoding, but is unlikely to be understood by mail clients.

> There must be
> some way to ask "is this valid utf-8" (there are several candidate
> functions, but I don't think either of us quite knows the right way to
> invoke them).

I think you were just reading the note on the utf8::valid function a
little too strongly.

You could use this block;

   if ($string =~ m/[\200-\377]/) {
       Encode::_utf8_on($string);
       if (!utf8::valid($string)) {
           Encode::_utf8_off($string);
       }
   }

Anyway, I guess all this rubbish is why people use CPAN modules, so that
they don't have to continually rediscover every single protocol quirk
and reinvent the wheel.

ie, it would be much, much simpler to use MIME::Entity->build for all of
this, and remove the duplication of code.

Sam.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-29  9:52             ` Jeff King
  2008-03-29 12:54               ` Robin Rosenberg
@ 2008-03-30  2:12               ` Sam Vilain
  2008-03-30  4:31                 ` Jeff King
  1 sibling, 1 reply; 37+ messages in thread
From: Sam Vilain @ 2008-03-30  2:12 UTC (permalink / raw)
  To: Jeff King; +Cc: Robin Rosenberg, Junio C Hamano, git

Jeff King wrote:
> Any idea what version of perl started shipping I18N::Langinfo? I
> couldn't see anything useful from grepping the Changes files.

Module::CoreList knows.  See the man page for that.

Sam.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-29 21:43                   ` Robin Rosenberg
@ 2008-03-29 22:00                     ` Jeff King
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2008-03-29 22:00 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Junio C Hamano, git

On Sat, Mar 29, 2008 at 10:43:40PM +0100, Robin Rosenberg wrote:

> First that is even by random an unlikely sequence. For any "real" is string
> it simply won't happen, even in this context. Try scanning everything you
> can think of and see if you find such a sequence that is not actually UTF-8.

That's the problem I was mentioning: "everything I can think of" is
basically just us-ascii with a few accented characters. I don't know
how, e.g., Japanese texts will fare with such a test.

> > But over all commonly used encodings, what is the probability in an
> > average text of that encoding that it contains valid UTF-8?
> > For example, I have no idea what patterns can be found in EUCJP.
> 
> See here http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf

Thanks, that is an interesting read. And he seems to indicate that you
can guess with a reasonable degree of success. But a few points on that
work:

  - he has a specific methodology for guessing, which is more elaborate
    than what you proposed. So to get his results, you would need to
    implement his method. Hopefully if perl does have a "guess if this
    looks like utf8" method, it uses a similar scheme.

  - he does admit that some encodings have difficult to assess
    probabilities, and it will vary from language to language. See page
    22:

      If a specific language does not use all three letters (a single
      letter on the left and the corresponding two letters on the
      right), then this combination presents no danger. Further checks
      can then be made with a dictionary, although there is the problem
      that a dictionary never contains all possible words, and that of
      course resource names don't necessarily have to be words.

  - he mentions Latin, Cyrillic, and Hebrew encodings. I note the
    conspicuous absence of any Asian languages.

> Note that a random string is a randomly generated string. Not a random
> string from the set of actually existing strings.

Sure. But looking at random strings isn't terribly useful; there is a
non-uniform distribution over the set of strings, dependent on the
_actual_ encoding. So there are going to be "good" encodings that will
guess well, and there will be "bad" encodings that might not (and by
"will", I mean "there may be"; that is the very thing I am saying we
don't have good evidence for).

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-29 12:54                   ` Robin Rosenberg
@ 2008-03-29 21:45                     ` Jeff King
  2008-03-30  3:40                       ` Sam Vilain
  0 siblings, 1 reply; 37+ messages in thread
From: Jeff King @ 2008-03-29 21:45 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Junio C Hamano, git

On Sat, Mar 29, 2008 at 01:54:47PM +0100, Robin Rosenberg wrote:

> > There were several given in the "OS X normalize your UTF-8 filenames"
> > thread a while back. They generally boil down to "a<UMLAUT MODIFIER>"
> > versus "<A WITH UMLAUT>" both of which are valid UTF-8.
> 
> That is what /OS X/ does with file names. It changes one unicode code point
> to a sequence of other "equivalent" code points. I'm pretty sure perl does
> not do that.

My point is that we don't _know_ what is happening in between the decode
and encode. Does that intermediate form have the information required to
convert back to the exact same bytes as the original form? I don't think
you've provided any evidence that it does or does not.

But here is some evidence that it does work:

$ cat test.pl
sub is_valid {
  my $orig = shift;
  my $test = $orig;
  utf8::decode($test);
  utf8::encode($test);
  return $orig eq $test ? "yes" : "no";
}
print "utf-8: ", is_valid("\xc3\xb6"), "\n";
print "latin-1: ", is_valid("\xc3"), "\n";
print "utf-8 w/ combining: ", is_valid("o\xcc\x88"), "\n";

$ perl test.pl
utf-8: yes
latin-1: no
utf-8 w/ combining: yes

But it still feels a little wrong to test by converting. There must be
some way to ask "is this valid utf-8" (there are several candidate
functions, but I don't think either of us quite knows the right way to
invoke them).

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-29 21:18                 ` Jeff King
@ 2008-03-29 21:43                   ` Robin Rosenberg
  2008-03-29 22:00                     ` Jeff King
  0 siblings, 1 reply; 37+ messages in thread
From: Robin Rosenberg @ 2008-03-29 21:43 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

Den Saturday 29 March 2008 22.18.49 skrev Jeff King:
> On Sat, Mar 29, 2008 at 01:54:10PM +0100, Robin Rosenberg wrote:
> > I think you really should try the UTF-8 guess, since a file may well be
> > UTF-8 even if the user locale is something else. Especially for XML
> > files, UTF-8 is common, but there are many more cases. Look into
> > git-gui/po for more examples. The probability of a UTF-8 test being wrong
> > is just so unimaginable low.
>
> Thinking about this more, I think it is only half the solution. If
> something is not valid utf-8, then we know it must be something else.
> But if something is valid utf-8, is it necessarily utf-8? I think we are
> going to have a much higher probability of guessing wrong there.
>
> For example, consider the bytes { 0xc3, 0xb6 }. In utf-8, they are 'ö'.
> But in iso8859-1, they also have meaning (paragraph symbol followed by
> Ã). Now that is an unlikely combination to come up. And maybe for
> Latin-1, having two non-ascii characters next to each other is unlikely.
First that is even by random an unlikely sequence. For any "real" is string
it simply won't happen, even in this context. Try scanning everything you
can think of and see if you find such a sequence that is not actually UTF-8.

> But over all commonly used encodings, what is the probability in an
> average text of that encoding that it contains valid UTF-8?
> For example, I have no idea what patterns can be found in EUCJP.

See here http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf

Note that a random string is a randomly generated string. Not a random string
from the set of actually existing strings.

> There is some magic with how Perl marks strings as "binary" versus
> "utf-8" that I don't quite understand. And I think is_utf8 is really
> about asking "is the utf-8 flag set".
>
> I think this discussion would benefit greatly from somebody who has more
> of a clue how perl i18n stuff works. Why don't you work up a patch that
> makes sense for you, and then hopefully that will get some attention?

The only real question as I see it is whether perl has a builtin metod that 
works better than the decode/encode. Anyone?

-- robin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-29 12:54               ` Robin Rosenberg
@ 2008-03-29 21:18                 ` Jeff King
  2008-03-29 21:43                   ` Robin Rosenberg
  0 siblings, 1 reply; 37+ messages in thread
From: Jeff King @ 2008-03-29 21:18 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Junio C Hamano, git

On Sat, Mar 29, 2008 at 01:54:10PM +0100, Robin Rosenberg wrote:

> I think you really should try the UTF-8 guess, since a file may well be UTF-8 
> even if the user locale is something else. Especially for XML files, UTF-8
> is common, but there are many more cases. Look into git-gui/po for more 
> examples. The probability of a UTF-8 test being wrong is just so unimaginable 
> low.

Thinking about this more, I think it is only half the solution. If
something is not valid utf-8, then we know it must be something else.
But if something is valid utf-8, is it necessarily utf-8? I think we are
going to have a much higher probability of guessing wrong there.

For example, consider the bytes { 0xc3, 0xb6 }. In utf-8, they are 'ö'.
But in iso8859-1, they also have meaning (paragraph symbol followed by
Ã). Now that is an unlikely combination to come up. And maybe for
Latin-1, having two non-ascii characters next to each other is unlikely.
But over all commonly used encodings, what is the probability in an
average text of that encoding that it contains valid UTF-8?
For example, I have no idea what patterns can be found in EUCJP.

> > PS Your 'require' is more simply written as 'use I18N::Langinfo
> > qw(langinfo CODESET)', or perhaps even simpler:
> 
> See the man page, from which I stole it. It suggests you wrap it all inside 
> eval {}, just in case your perl does not have langinfo.

Yes, that does make sense for a script (I just couldn't see it because
the entire toy example would be inside the eval).

> As for the is_utf8() i'm not sure what it does, but I can't make it work.

There is some magic with how Perl marks strings as "binary" versus
"utf-8" that I don't quite understand. And I think is_utf8 is really
about asking "is the utf-8 flag set".

I think this discussion would benefit greatly from somebody who has more
of a clue how perl i18n stuff works. Why don't you work up a patch that
makes sense for you, and then hopefully that will get some attention?

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-29  9:43                 ` Jeff King
@ 2008-03-29 12:54                   ` Robin Rosenberg
  2008-03-29 21:45                     ` Jeff King
  0 siblings, 1 reply; 37+ messages in thread
From: Robin Rosenberg @ 2008-03-29 12:54 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

Den Saturday 29 March 2008 10.43.22 skrev Jeff King:
> On Sat, Mar 29, 2008 at 10:39:43AM +0100, Robin Rosenberg wrote:
> > > Because some UTF-8 sequences have multiple representations, and that
> >
> > Care to give an example?
>
> There were several given in the "OS X normalize your UTF-8 filenames"
> thread a while back. They generally boil down to "a<UMLAUT MODIFIER>"
> versus "<A WITH UMLAUT>" both of which are valid UTF-8.

That is what /OS X/ does with file names. It changes one unicode code point
to a sequence of other "equivalent" code points. I'm pretty sure perl does
not do that.

-- robin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-29  9:52             ` Jeff King
@ 2008-03-29 12:54               ` Robin Rosenberg
  2008-03-29 21:18                 ` Jeff King
  2008-03-30  2:12               ` Sam Vilain
  1 sibling, 1 reply; 37+ messages in thread
From: Robin Rosenberg @ 2008-03-29 12:54 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

Den Saturday 29 March 2008 10.52.38 skrev Jeff King:
> On Sat, Mar 29, 2008 at 10:38:48AM +0100, Robin Rosenberg wrote:
> > The environment variables are only part of the story. There is a langinfo
> > API for this. See I18N::Langinfo(3pm) that knows about those and
> > something else.
> >
> > # perl -e 'require I18N::Langinfo; I18N::Langinfo->import(qw(langinfo
> > CODESET)); $codeset = langinfo(CODESET()); print "My codeset=".
> > $codeset."\n";'
> > My codeset=ISO-8859-15
>
> Hmm, neat. So perhaps it would make sense to just use this value instead
> of utf-8, and not worry about examining the actual text (since any such
> examination is at best a guess, anyway)?

I think you really should try the UTF-8 guess, since a file may well be UTF-8 
even if the user locale is something else. Especially for XML files, UTF-8
is common, but there are many more cases. Look into git-gui/po for more 
examples. The probability of a UTF-8 test being wrong is just so unimaginable 
low.

> PS Your 'require' is more simply written as 'use I18N::Langinfo

> qw(langinfo CODESET)', or perhaps even simpler:

See the man page, from which I stole it. It suggests you wrap it all inside 
eval {}, just in case your perl does not have langinfo.

As for the is_utf8() i'm not sure what it does, but I can't make it work.

-- robin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-29  9:38           ` Robin Rosenberg
@ 2008-03-29  9:52             ` Jeff King
  2008-03-29 12:54               ` Robin Rosenberg
  2008-03-30  2:12               ` Sam Vilain
  0 siblings, 2 replies; 37+ messages in thread
From: Jeff King @ 2008-03-29  9:52 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Junio C Hamano, git

On Sat, Mar 29, 2008 at 10:38:48AM +0100, Robin Rosenberg wrote:

> The environment variables are only part of the story. There is a langinfo API 
> for this. See I18N::Langinfo(3pm) that knows about those and something else.
> 
> # perl -e 'require I18N::Langinfo; I18N::Langinfo->import(qw(langinfo 
> CODESET)); $codeset = langinfo(CODESET()); print "My codeset=".
> $codeset."\n";'
> My codeset=ISO-8859-15

Hmm, neat. So perhaps it would make sense to just use this value instead
of utf-8, and not worry about examining the actual text (since any such
examination is at best a guess, anyway)?

Any idea what version of perl started shipping I18N::Langinfo? I
couldn't see anything useful from grepping the Changes files.

-Peff

PS Your 'require' is more simply written as 'use I18N::Langinfo
qw(langinfo CODESET)', or perhaps even simpler:

  perl -MI18N::Langinfo=langinfo,CODESET ...

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-29  9:39               ` Robin Rosenberg
@ 2008-03-29  9:43                 ` Jeff King
  2008-03-29 12:54                   ` Robin Rosenberg
  0 siblings, 1 reply; 37+ messages in thread
From: Jeff King @ 2008-03-29  9:43 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Junio C Hamano, git

On Sat, Mar 29, 2008 at 10:39:43AM +0100, Robin Rosenberg wrote:

> > Because some UTF-8 sequences have multiple representations, and that
> 
> Care to give an example?

There were several given in the "OS X normalize your UTF-8 filenames"
thread a while back. They generally boil down to "a<UMLAUT MODIFIER>"
versus "<A WITH UMLAUT>" both of which are valid UTF-8.

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-29  9:11             ` Jeff King
@ 2008-03-29  9:39               ` Robin Rosenberg
  2008-03-29  9:43                 ` Jeff King
  0 siblings, 1 reply; 37+ messages in thread
From: Robin Rosenberg @ 2008-03-29  9:39 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

Den Saturday 29 March 2008 10.11.45 skrev Jeff King:
> On Sat, Mar 29, 2008 at 10:02:43AM +0100, Robin Rosenberg wrote:
> > My proof is entirely empirical. What happens is that attempting to decode
> > a non-UTF-8 string will put a unicode surrogate pair into the (now
> > Unicode) string and encoding will just encode the surrogate pair into
> > UTF-8 and not the original. As a result, the encode(decode($x)) eq $x
> > *only* if $x is a valid UTF-8 octet sequence. Why would you not get the
> > original back if you start with valid UTF-8?
>
> Because some UTF-8 sequences have multiple representations, and that

Care to give an example?

-- robon

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-29  8:53         ` Jeff King
@ 2008-03-29  9:38           ` Robin Rosenberg
  2008-03-29  9:52             ` Jeff King
  0 siblings, 1 reply; 37+ messages in thread
From: Robin Rosenberg @ 2008-03-29  9:38 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

Den Saturday 29 March 2008 09.53.04 skrev Jeff King:
> On Sat, Mar 29, 2008 at 09:44:55AM +0100, Robin Rosenberg wrote:
> > > OK. Do you have an example function that guesses with high probability
> > > whether a string is utf-8? If there are non-ascii characters but we
> > > _don't_ guess utf-8, what should we do?
> >
> > I guess the best bet is to assume the locale. Btw, is the encoding header
> > from the commit (when present) completely lost? (not that it can be
> > trusted anyway).
>
> What do you mean by "assume the locale"?  Is there a portable way to say
> "this is the encoding of the locale the user has chosen?" On my system I
> set LANG=en_US, and behind-the-scenes magic chooses utf-8 versus
> iso8859-1.

The environment variables are only part of the story. There is a langinfo API 
for this. See I18N::Langinfo(3pm) that knows about those and something else.

# perl -e 'require I18N::Langinfo; I18N::Langinfo->import(qw(langinfo 
CODESET)); $codeset = langinfo(CODESET()); print "My codeset=".
$codeset."\n";'
My codeset=ISO-8859-15

-- robin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-29  9:02           ` Robin Rosenberg
@ 2008-03-29  9:11             ` Jeff King
  2008-03-29  9:39               ` Robin Rosenberg
  0 siblings, 1 reply; 37+ messages in thread
From: Jeff King @ 2008-03-29  9:11 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Junio C Hamano, git

On Sat, Mar 29, 2008 at 10:02:43AM +0100, Robin Rosenberg wrote:

> My proof is entirely empirical. What happens is that attempting to decode a 
> non-UTF-8 string will put a unicode surrogate pair into the (now Unicode) 
> string and encoding will just encode the surrogate pair into UTF-8 and not 
> the original. As a result, the encode(decode($x)) eq $x *only* if $x is a
> valid UTF-8 octet sequence. Why would you not get the original back if
> you start with valid UTF-8?

Because some UTF-8 sequences have multiple representations, and that
information may be lost by whatever intermediate form is the result of
decode($x). In practice, I don't know if this happens or not.

Though it looks like there is an Encode::is_utf8 function (which is also
utf8::is_utf8, but only in perl >= 5.8.1). So we could use that, but it
needs the utf-8 flag turned on for the string. Maybe utf8::valid is
actually what we want.

But there is still a larger question. You have some binary bytes that
will go in a subject header. There are non-ascii bytes. There are
non-utf8 sequences. What do you do?

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-29  8:49         ` Jeff King
@ 2008-03-29  9:02           ` Robin Rosenberg
  2008-03-29  9:11             ` Jeff King
  0 siblings, 1 reply; 37+ messages in thread
From: Robin Rosenberg @ 2008-03-29  9:02 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

Den Saturday 29 March 2008 09.49.48 skrev Jeff King:
> On Sat, Mar 29, 2008 at 09:41:53AM +0100, Robin Rosenberg wrote:
> > > OK. Do you have an example function that guesses with high probability
> > > whether a string is utf-8? If there are non-ascii characters but we
> > > _don't_ guess utf-8, what should we do?
> >
> > Any test for valid UTF-8 will do that with a very high probability. The
> > perl UTF-8 "api" is a mess. I couldn't find such a routine!?. Calling
> > decode/encode and see if you get the original string works, but that is
> > too clumsy, IMHO.
>
> Does that work? I would think you would have to compare the normalized
> versions of each string, since decode(encode($x)) is not, AIUI,
> guaranteed to produce $x.

I don't claim to understand it either. Hopefully some perl guru will step 
forward and just explain how to do this in perl.

My proof is entirely empirical. What happens is that attempting to decode a 
non-UTF-8 string will put a unicode surrogate pair into the (now Unicode) 
string and encoding will just encode the surrogate pair into UTF-8 and not 
the original. As a result, the encode(decode($x)) eq $x *only* if $x is a
valid UTF-8 octet sequence. Why would you not get the original back if
you start with valid UTF-8?

-- robin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-29  8:44       ` Robin Rosenberg
@ 2008-03-29  8:53         ` Jeff King
  2008-03-29  9:38           ` Robin Rosenberg
  0 siblings, 1 reply; 37+ messages in thread
From: Jeff King @ 2008-03-29  8:53 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Junio C Hamano, git

On Sat, Mar 29, 2008 at 09:44:55AM +0100, Robin Rosenberg wrote:

> > OK. Do you have an example function that guesses with high probability
> > whether a string is utf-8? If there are non-ascii characters but we
> > _don't_ guess utf-8, what should we do?
> 
> I guess the best bet is to assume the locale. Btw, is the encoding header
> from the commit (when present) completely lost? (not that it can be trusted
> anyway).

What do you mean by "assume the locale"?  Is there a portable way to say
"this is the encoding of the locale the user has chosen?" On my system I
set LANG=en_US, and behind-the-scenes magic chooses utf-8 versus
iso8859-1.

And there is no encoding header for the commit; the point of this patch
is to handle the "cover letter" message created by "send-email
--compose" (we should already be doing the right thing for the patch
emails, since the commit encoding is output by format-patch in a
content-type header before we even get to send-email).

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-29  8:41       ` Robin Rosenberg
@ 2008-03-29  8:49         ` Jeff King
  2008-03-29  9:02           ` Robin Rosenberg
  2008-03-30 23:47         ` Junio C Hamano
  1 sibling, 1 reply; 37+ messages in thread
From: Jeff King @ 2008-03-29  8:49 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Junio C Hamano, git

On Sat, Mar 29, 2008 at 09:41:53AM +0100, Robin Rosenberg wrote:

> > OK. Do you have an example function that guesses with high probability
> > whether a string is utf-8? If there are non-ascii characters but we
> > _don't_ guess utf-8, what should we do?
> 
> Any test for valid UTF-8 will do that with a very high probability. The
> perl UTF-8 "api" is a mess. I couldn't find such a routine!?. Calling 
> decode/encode and see if you get the original string works, but that is too
> clumsy, IMHO.

Does that work? I would think you would have to compare the normalized
versions of each string, since decode(encode($x)) is not, AIUI,
guaranteed to produce $x.

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-29  7:22     ` Jeff King
  2008-03-29  8:41       ` Robin Rosenberg
@ 2008-03-29  8:44       ` Robin Rosenberg
  2008-03-29  8:53         ` Jeff King
  1 sibling, 1 reply; 37+ messages in thread
From: Robin Rosenberg @ 2008-03-29  8:44 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

Den Saturday 29 March 2008 08.22.03 skrev Jeff King:
> On Sat, Mar 29, 2008 at 08:19:07AM +0100, Robin Rosenberg wrote:
> > Den Friday 28 March 2008 22.29.01 skrev Jeff King:
> > > We always use 'utf-8' as the encoding, since we currently
> > > have no way of getting the information from the user.
> >
> > Don't set encoding to UTF-8 unless it actually looks like UTF-8.
>
> OK. Do you have an example function that guesses with high probability
> whether a string is utf-8? If there are non-ascii characters but we
> _don't_ guess utf-8, what should we do?

I guess the best bet is to assume the locale. Btw, is the encoding header
from the commit (when present) completely lost? (not that it can be trusted
anyway).

-- robin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-29  7:22     ` Jeff King
@ 2008-03-29  8:41       ` Robin Rosenberg
  2008-03-29  8:49         ` Jeff King
  2008-03-30 23:47         ` Junio C Hamano
  2008-03-29  8:44       ` Robin Rosenberg
  1 sibling, 2 replies; 37+ messages in thread
From: Robin Rosenberg @ 2008-03-29  8:41 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

Den Saturday 29 March 2008 08.22.03 skrev Jeff King:
> On Sat, Mar 29, 2008 at 08:19:07AM +0100, Robin Rosenberg wrote:
> > Den Friday 28 March 2008 22.29.01 skrev Jeff King:
> > > We always use 'utf-8' as the encoding, since we currently
> > > have no way of getting the information from the user.
> >
> > Don't set encoding to UTF-8 unless it actually looks like UTF-8.
>
> OK. Do you have an example function that guesses with high probability
> whether a string is utf-8? If there are non-ascii characters but we
> _don't_ guess utf-8, what should we do?

Any test for valid UTF-8 will do that with a very high probability. The
perl UTF-8 "api" is a mess. I couldn't find such a routine!?. Calling 
decode/encode and see if you get the original string works, but that is too
clumsy, IMHO.

-- robin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-29  7:19   ` Robin Rosenberg
@ 2008-03-29  7:22     ` Jeff King
  2008-03-29  8:41       ` Robin Rosenberg
  2008-03-29  8:44       ` Robin Rosenberg
  0 siblings, 2 replies; 37+ messages in thread
From: Jeff King @ 2008-03-29  7:22 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Junio C Hamano, git

On Sat, Mar 29, 2008 at 08:19:07AM +0100, Robin Rosenberg wrote:

> Den Friday 28 March 2008 22.29.01 skrev Jeff King:
> > We always use 'utf-8' as the encoding, since we currently
> > have no way of getting the information from the user.
> 
> Don't set encoding to UTF-8 unless it actually looks like UTF-8.

OK. Do you have an example function that guesses with high probability
whether a string is utf-8? If there are non-ascii characters but we
_don't_ guess utf-8, what should we do?

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-28 21:29 ` [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters Jeff King
@ 2008-03-29  7:19   ` Robin Rosenberg
  2008-03-29  7:22     ` Jeff King
  2008-05-21 19:39   ` Junio C Hamano
  1 sibling, 1 reply; 37+ messages in thread
From: Robin Rosenberg @ 2008-03-29  7:19 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

Den Friday 28 March 2008 22.29.01 skrev Jeff King:
> We always use 'utf-8' as the encoding, since we currently
> have no way of getting the information from the user.

Don't set encoding to UTF-8 unless it actually looks like UTF-8.

-- robin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters
  2008-03-28 21:27 [ANNOUNCE] GIT 1.5.5-rc2 Jeff King
@ 2008-03-28 21:29 ` Jeff King
  2008-03-29  7:19   ` Robin Rosenberg
  2008-05-21 19:39   ` Junio C Hamano
  0 siblings, 2 replies; 37+ messages in thread
From: Jeff King @ 2008-03-28 21:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

We always use 'utf-8' as the encoding, since we currently
have no way of getting the information from the user.

This also refactors the quoting of recipient names, since
both processes can share the rfc2047 quoting code.

Signed-off-by: Jeff King <peff@peff.net>
---
 git-send-email.perl   |   19 +++++++++++++++++--
 t/t9001-send-email.sh |   15 +++++++++++++++
 2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/git-send-email.perl b/git-send-email.perl
index 7c4f06c..d0f9d4a 100755
--- a/git-send-email.perl
+++ b/git-send-email.perl
@@ -536,6 +536,14 @@ EOT
 		if (!$in_body && /^MIME-Version:/i) {
 			$need_8bit_cte = 0;
 		}
+		if (!$in_body && /^Subject: ?(.*)/i) {
+			my $subject = $1;
+			$_ = "Subject: " .
+				($subject =~ /[^[:ascii:]]/ ?
+				 quote_rfc2047($subject) :
+				 $subject) .
+				"\n";
+		}
 		print C2 $_;
 	}
 	close(C);
@@ -626,6 +634,14 @@ sub unquote_rfc2047 {
 	return wantarray ? ($_, $encoding) : $_;
 }
 
+sub quote_rfc2047 {
+	local $_ = shift;
+	my $encoding = shift || 'utf-8';
+	s/([^-a-zA-Z0-9!*+\/])/sprintf("=%02X", ord($1))/eg;
+	s/(.*)/=\?$encoding\?q\?$1\?=/;
+	return $_;
+}
+
 # use the simplest quoting being able to handle the recipient
 sub sanitize_address
 {
@@ -643,8 +659,7 @@ sub sanitize_address
 
 	# rfc2047 is needed if a non-ascii char is included
 	if ($recipient_name =~ /[^[:ascii:]]/) {
-		$recipient_name =~ s/([^-a-zA-Z0-9!*+\/])/sprintf("=%02X", ord($1))/eg;
-		$recipient_name =~ s/(.*)/=\?utf-8\?q\?$1\?=/;
+		$recipient_name = quote_rfc2047($recipient_name);
 	}
 
 	# double quotes are needed if specials or CTLs are included
diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh
index e222c49..a4bcd28 100755
--- a/t/t9001-send-email.sh
+++ b/t/t9001-send-email.sh
@@ -210,4 +210,19 @@ test_expect_success '--compose respects user mime type' '
 	! grep "^Content-Type: text/plain; charset=utf-8" msgtxt1
 '
 
+test_expect_success '--compose adds MIME for utf8 subject' '
+	clean_fake_sendmail &&
+	echo y | \
+	  GIT_EDITOR=$(pwd)/fake-editor \
+	  GIT_SEND_EMAIL_NOTTY=1 \
+	  git send-email \
+	  --compose --subject utf8-sübjëct \
+	  --from="Example <nobody@example.com>" \
+	  --to=nobody@example.com \
+	  --smtp-server="$(pwd)/fake.sendmail" \
+	  $patches &&
+	grep "^fake edit" msgtxt1 &&
+	grep "^Subject: =?utf-8?q?utf8-s=C3=BCbj=C3=ABct?=" msgtxt1
+'
+
 test_done
-- 
1.5.5.rc1.141.g50ecd.dirty

^ permalink raw reply related	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2008-05-21 19:48 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <7caf19ae394accab538d2f94953bb62b55a2c79f.1206486012.git.peff@peff.net>
2008-03-25 23:03 ` [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters Jeff King
2008-03-26  5:59   ` Teemu Likonen
2008-03-26  6:20     ` Jeff King
2008-03-26  8:30       ` Teemu Likonen
2008-03-26  8:39         ` Jeff King
2008-03-26  9:23           ` Teemu Likonen
2008-03-26  9:32             ` Teemu Likonen
2008-03-26  9:35               ` Jeff King
2008-03-26  9:33             ` Jeff King
2008-03-27  7:38               ` Jeff King
2008-03-27 19:44                 ` Todd Zullinger
2008-03-28 21:27 [ANNOUNCE] GIT 1.5.5-rc2 Jeff King
2008-03-28 21:29 ` [PATCH 2/2] send-email: rfc2047-quote subject lines with non-ascii characters Jeff King
2008-03-29  7:19   ` Robin Rosenberg
2008-03-29  7:22     ` Jeff King
2008-03-29  8:41       ` Robin Rosenberg
2008-03-29  8:49         ` Jeff King
2008-03-29  9:02           ` Robin Rosenberg
2008-03-29  9:11             ` Jeff King
2008-03-29  9:39               ` Robin Rosenberg
2008-03-29  9:43                 ` Jeff King
2008-03-29 12:54                   ` Robin Rosenberg
2008-03-29 21:45                     ` Jeff King
2008-03-30  3:40                       ` Sam Vilain
2008-03-30  4:39                         ` Jeff King
2008-03-30 23:47         ` Junio C Hamano
2008-03-29  8:44       ` Robin Rosenberg
2008-03-29  8:53         ` Jeff King
2008-03-29  9:38           ` Robin Rosenberg
2008-03-29  9:52             ` Jeff King
2008-03-29 12:54               ` Robin Rosenberg
2008-03-29 21:18                 ` Jeff King
2008-03-29 21:43                   ` Robin Rosenberg
2008-03-29 22:00                     ` Jeff King
2008-03-30  2:12               ` Sam Vilain
2008-03-30  4:31                 ` Jeff King
2008-05-21 19:39   ` Junio C Hamano
2008-05-21 19:47     ` Jeff King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.