All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] send-email: handle adjacent RFC 2047-encoded words properly
@ 2014-11-23 23:50 Роман Донченко
  2014-11-24  7:27 ` Junio C Hamano
  2014-11-24 15:36 ` Jeff King
  0 siblings, 2 replies; 7+ messages in thread
From: Роман Донченко @ 2014-11-23 23:50 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Ævar Arnfjörð Bjarmason,
	Jay Soffian, Jeff King, Thomas Rast

The RFC says that they are to be concatenated after decoding (i.e. the
intervening whitespace is ignored).

I change the sender's name to an all-Cyrillic string in the tests so that
its encoded form goes over the 76 characters in a line limit, forcing
format-patch to split it into multiple encoded words.

Since I have to modify the regular expression for an encoded word anyway,
I take the opportunity to bring it closer to the spec, most notably
disallowing embedded spaces and making it case-insensitive (thus allowing
the encoding to be specified as both "q" and "Q").

Signed-off-by: Роман Донченко <dpb@corrigendum.ru>
---
 git-send-email.perl   | 21 +++++++++++++++------
 t/t9001-send-email.sh | 18 +++++++++---------
 2 files changed, 24 insertions(+), 15 deletions(-)

diff --git a/git-send-email.perl b/git-send-email.perl
index 9949db0..4bb9f6f 100755
--- a/git-send-email.perl
+++ b/git-send-email.perl
@@ -913,13 +913,22 @@ $time = time - scalar $#files;
 
 sub unquote_rfc2047 {
 	local ($_) = @_;
+
+	my $et = qr/[!->@-~]+/; # encoded-text from RFC 2047
+	my $sep = qr/[ \t]+/;
+	my $encoded_word = qr/=\?($et)\?q\?($et)\?=/i;
+
 	my $encoding;
-	s{=\?([^?]+)\?q\?(.*?)\?=}{
-		$encoding = $1;
-		my $e = $2;
-		$e =~ s/_/ /g;
-		$e =~ s/=([0-9A-F]{2})/chr(hex($1))/eg;
-		$e;
+	s{$encoded_word(?:$sep$encoded_word)+}{
+		my @words = split $sep, $&;
+		foreach (@words) {
+			m/$encoded_word/;
+			$encoding = $1;
+			$_ = $2;
+			s/_/ /g;
+			s/=([0-9A-F]{2})/chr(hex($1))/eg;
+		}
+		join '', @words;
 	}eg;
 	return wantarray ? ($_, $encoding) : $_;
 }
diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh
index 19a3ced..318b870 100755
--- a/t/t9001-send-email.sh
+++ b/t/t9001-send-email.sh
@@ -236,7 +236,7 @@ test_expect_success $PREREQ 'self name with dot is suppressed' "
 "
 
 test_expect_success $PREREQ 'non-ascii self name is suppressed' "
-	test_suppress_self_quoted 'Füñný Nâmé' 'odd_?=mail@example.com' \
+	test_suppress_self_quoted 'Кириллическое Имя' 'odd_?=mail@example.com' \
 		'non_ascii_self_suppressed'
 "
 
@@ -946,25 +946,25 @@ test_expect_success $PREREQ 'utf8 author is correctly passed on' '
 	clean_fake_sendmail &&
 	test_commit weird_author &&
 	test_when_finished "git reset --hard HEAD^" &&
-	git commit --amend --author "Füñný Nâmé <odd_?=mail@example.com>" &&
-	git format-patch --stdout -1 >funny_name.patch &&
+	git commit --amend --author "Кириллическое Имя <odd_?=mail@example.com>" &&
+	git format-patch --stdout -1 >nonascii_name.patch &&
 	git send-email --from="Example <nobody@example.com>" \
 	  --to=nobody@example.com \
 	  --smtp-server="$(pwd)/fake.sendmail" \
-	  funny_name.patch &&
-	grep "^From: Füñný Nâmé <odd_?=mail@example.com>" msgtxt1
+	  nonascii_name.patch &&
+	grep "^From: Кириллическое Имя <odd_?=mail@example.com>" msgtxt1
 '
 
 test_expect_success $PREREQ 'utf8 sender is not duplicated' '
 	clean_fake_sendmail &&
 	test_commit weird_sender &&
 	test_when_finished "git reset --hard HEAD^" &&
-	git commit --amend --author "Füñný Nâmé <odd_?=mail@example.com>" &&
-	git format-patch --stdout -1 >funny_name.patch &&
-	git send-email --from="Füñný Nâmé <odd_?=mail@example.com>" \
+	git commit --amend --author "Кириллическое Имя <odd_?=mail@example.com>" &&
+	git format-patch --stdout -1 >nonascii_name.patch &&
+	git send-email --from="Кириллическое Имя <odd_?=mail@example.com>" \
 	  --to=nobody@example.com \
 	  --smtp-server="$(pwd)/fake.sendmail" \
-	  funny_name.patch &&
+	  nonascii_name.patch &&
 	grep "^From: " msgtxt1 >msgfrom &&
 	test_line_count = 1 msgfrom
 '
-- 
2.1.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] send-email: handle adjacent RFC 2047-encoded words properly
  2014-11-23 23:50 [PATCH] send-email: handle adjacent RFC 2047-encoded words properly Роман Донченко
@ 2014-11-24  7:27 ` Junio C Hamano
  2014-11-24 15:44   ` Jeff King
  2014-11-24 18:09   ` Роман Донченко
  2014-11-24 15:36 ` Jeff King
  1 sibling, 2 replies; 7+ messages in thread
From: Junio C Hamano @ 2014-11-24  7:27 UTC (permalink / raw)
  To: Роман
	Донченко
  Cc: Git Mailing List, Ævar Arnfjörð Bjarmason,
	Jay Soffian, Jeff King, Thomas Rast

On Sun, Nov 23, 2014 at 3:50 PM, Роман Донченко <dpb@corrigendum.ru> wrote:
> The RFC says that they are to be concatenated after decoding (i.e. the
> intervening whitespace is ignored).
>
> I change the sender's name to an all-Cyrillic string in the tests so that
> its encoded form goes over the 76 characters in a line limit, forcing
> format-patch to split it into multiple encoded words.
>
> Since I have to modify the regular expression for an encoded word anyway,
> I take the opportunity to bring it closer to the spec, most notably
> disallowing embedded spaces and making it case-insensitive (thus allowing
> the encoding to be specified as both "q" and "Q").
>
> Signed-off-by: Роман Донченко <dpb@corrigendum.ru>

This sounds like a worthy thing to do in general.

I wonder if the C implementation we have for mailinfo needs similar
update, though. I vaguely recall that we have case-insensitive start for
q/b segments, but do not remember the details offhand.

Was the change to the test to use Cyrillic really necessary, or did it
suffice if you simply extended the existsing "Funny Name" spelled with
strange accents, but you substituted the whole string anyway?

Until I found out what the new string says by running web-based
translation on it, I felt somewhat uneasy. As I do not read
Cyrillic/Russian, we may have been adding some profanity without
knowing. It turns out that the string just says "Cyrillic Name", so I am
not against using the new string, but it simply looked odd to replace the
string whole-sale when you merely need a longer string. It made it look
as if a bug was specific to Cyrillic when it wasn't.

As you may notice by reading "git log --no-merges" from recent history,
we tend not to say "I did X, I did Y". If the tone of the above message
were more similar to them, it may have been easier to read.

But other than these minor nits, the change looks good from
a cursory read.

Thanks.

> ---
>  git-send-email.perl   | 21 +++++++++++++++------
>  t/t9001-send-email.sh | 18 +++++++++---------
>  2 files changed, 24 insertions(+), 15 deletions(-)
>
> diff --git a/git-send-email.perl b/git-send-email.perl
> index 9949db0..4bb9f6f 100755
> --- a/git-send-email.perl
> +++ b/git-send-email.perl
> @@ -913,13 +913,22 @@ $time = time - scalar $#files;
>
>  sub unquote_rfc2047 {
>         local ($_) = @_;
> +
> +       my $et = qr/[!->@-~]+/; # encoded-text from RFC 2047
> +       my $sep = qr/[ \t]+/;
> +       my $encoded_word = qr/=\?($et)\?q\?($et)\?=/i;
> +
>         my $encoding;
> -       s{=\?([^?]+)\?q\?(.*?)\?=}{
> -               $encoding = $1;
> -               my $e = $2;
> -               $e =~ s/_/ /g;
> -               $e =~ s/=([0-9A-F]{2})/chr(hex($1))/eg;
> -               $e;
> +       s{$encoded_word(?:$sep$encoded_word)+}{
> +               my @words = split $sep, $&;
> +               foreach (@words) {
> +                       m/$encoded_word/;
> +                       $encoding = $1;
> +                       $_ = $2;
> +                       s/_/ /g;
> +                       s/=([0-9A-F]{2})/chr(hex($1))/eg;
> +               }
> +               join '', @words;
>         }eg;
>         return wantarray ? ($_, $encoding) : $_;
>  }
> diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh
> index 19a3ced..318b870 100755
> --- a/t/t9001-send-email.sh
> +++ b/t/t9001-send-email.sh
> @@ -236,7 +236,7 @@ test_expect_success $PREREQ 'self name with dot is suppressed' "
>  "
>
>  test_expect_success $PREREQ 'non-ascii self name is suppressed' "
> -       test_suppress_self_quoted 'Füñný Nâmé' 'odd_?=mail@example.com' \
> +       test_suppress_self_quoted 'Кириллическое Имя' 'odd_?=mail@example.com' \
>                 'non_ascii_self_suppressed'
>  "
>
> @@ -946,25 +946,25 @@ test_expect_success $PREREQ 'utf8 author is correctly passed on' '
>         clean_fake_sendmail &&
>         test_commit weird_author &&
>         test_when_finished "git reset --hard HEAD^" &&
> -       git commit --amend --author "Füñný Nâmé <odd_?=mail@example.com>" &&
> -       git format-patch --stdout -1 >funny_name.patch &&
> +       git commit --amend --author "Кириллическое Имя <odd_?=mail@example.com>" &&
> +       git format-patch --stdout -1 >nonascii_name.patch &&
>         git send-email --from="Example <nobody@example.com>" \
>           --to=nobody@example.com \
>           --smtp-server="$(pwd)/fake.sendmail" \
> -         funny_name.patch &&
> -       grep "^From: Füñný Nâmé <odd_?=mail@example.com>" msgtxt1
> +         nonascii_name.patch &&
> +       grep "^From: Кириллическое Имя <odd_?=mail@example.com>" msgtxt1
>  '
>
>  test_expect_success $PREREQ 'utf8 sender is not duplicated' '
>         clean_fake_sendmail &&
>         test_commit weird_sender &&
>         test_when_finished "git reset --hard HEAD^" &&
> -       git commit --amend --author "Füñný Nâmé <odd_?=mail@example.com>" &&
> -       git format-patch --stdout -1 >funny_name.patch &&
> -       git send-email --from="Füñný Nâmé <odd_?=mail@example.com>" \
> +       git commit --amend --author "Кириллическое Имя <odd_?=mail@example.com>" &&
> +       git format-patch --stdout -1 >nonascii_name.patch &&
> +       git send-email --from="Кириллическое Имя <odd_?=mail@example.com>" \
>           --to=nobody@example.com \
>           --smtp-server="$(pwd)/fake.sendmail" \
> -         funny_name.patch &&
> +         nonascii_name.patch &&
>         grep "^From: " msgtxt1 >msgfrom &&
>         test_line_count = 1 msgfrom
>  '
> --
> 2.1.1
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] send-email: handle adjacent RFC 2047-encoded words properly
  2014-11-23 23:50 [PATCH] send-email: handle adjacent RFC 2047-encoded words properly Роман Донченко
  2014-11-24  7:27 ` Junio C Hamano
@ 2014-11-24 15:36 ` Jeff King
  2014-11-24 18:26   ` Роман Донченко
  1 sibling, 1 reply; 7+ messages in thread
From: Jeff King @ 2014-11-24 15:36 UTC (permalink / raw)
  To: Роман
	Донченко
  Cc: git, Junio C Hamano, Ævar Arnfjörð Bjarmason,
	Jay Soffian, Thomas Rast

On Mon, Nov 24, 2014 at 02:50:04AM +0300, Роман Донченко wrote:

> The RFC says that they are to be concatenated after decoding (i.e. the
> intervening whitespace is ignored).
> 
> I change the sender's name to an all-Cyrillic string in the tests so that
> its encoded form goes over the 76 characters in a line limit, forcing
> format-patch to split it into multiple encoded words.
> 
> Since I have to modify the regular expression for an encoded word anyway,
> I take the opportunity to bring it closer to the spec, most notably
> disallowing embedded spaces and making it case-insensitive (thus allowing
> the encoding to be specified as both "q" and "Q").

The overall goal makes sense to me. Thanks for working on this. I have a
few questions/comments, though.

>  sub unquote_rfc2047 {
>  	local ($_) = @_;
> +
> +	my $et = qr/[!->@-~]+/; # encoded-text from RFC 2047
> +	my $sep = qr/[ \t]+/;
> +	my $encoded_word = qr/=\?($et)\?q\?($et)\?=/i;

The first $et in $encoded_word is actually the charset, which is defined
by RFC 2047 as:

     charset = token    ; see section 3

     token = 1*<Any CHAR except SPACE, CTLs, and especials>

     especials = "(" / ")" / "<" / ">" / "@" / "," / ";" / ":" / "
	               <"> / "/" / "[" / "]" / "?" / "." / "="

Your regex is a little more liberal. I doubt that it is a big deal in
practice (actually, in practice, I suspect [a-zA-Z0-9-] would be fine).
But if we are tightening things up in general, it may make sense to do
so here (and I notice that is_rfc2047_quoted does a more thorough $token
definition, and it probably makes sense for the two functions to be
consistent).

For your definition of encoded-text, RFC 2047 says:

     encoded-text = 1*<Any printable ASCII character other than "?"
                          or SPACE>

It looks like you pulled the definition of $et from is_rfc2047_quoted,
but I am not clear on where that original came from (it is from a3a8262,
but that commit message does not explain the regex).

Also, I note that we handle 'q'-style encodings here, but not 'b'. I
wonder if it is worth adding that in while we are in the area (it is not
a big deal if you always send-email git-generated patches, as we never
generate it).

> +	s{$encoded_word(?:$sep$encoded_word)+}{

If I am reading this right, it requires at least two $encoded_words.
Should this "+" be a "*"?

> +		my @words = split $sep, $&;
> +		foreach (@words) {
> +			m/$encoded_word/;
> +			$encoding = $1;
> +			$_ = $2;
> +			s/_/ /g;
> +			s/=([0-9A-F]{2})/chr(hex($1))/eg;

In the spirit of your earlier change, should this final regex be
case-insensitive? RFC 2047 says only "Upper case should be used for
hexadecimal digits "A" through "F." but that does not seem like a "MUST"
to me.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] send-email: handle adjacent RFC 2047-encoded words properly
  2014-11-24  7:27 ` Junio C Hamano
@ 2014-11-24 15:44   ` Jeff King
  2014-11-24 18:09   ` Роман Донченко
  1 sibling, 0 replies; 7+ messages in thread
From: Jeff King @ 2014-11-24 15:44 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Роман
	Донченко,
	Git Mailing List, Ævar Arnfjörð Bjarmason,
	Jay Soffian, Thomas Rast

On Sun, Nov 23, 2014 at 11:27:51PM -0800, Junio C Hamano wrote:

> Was the change to the test to use Cyrillic really necessary, or did it
> suffice if you simply extended the existsing "Funny Name" spelled with
> strange accents, but you substituted the whole string anyway?
> 
> Until I found out what the new string says by running web-based
> translation on it, I felt somewhat uneasy. As I do not read
> Cyrillic/Russian, we may have been adding some profanity without
> knowing. It turns out that the string just says "Cyrillic Name", so I am
> not against using the new string, but it simply looked odd to replace the
> string whole-sale when you merely need a longer string. It made it look
> as if a bug was specific to Cyrillic when it wasn't.

I do not mind hidden Cyrillic profanity[1], but I found the new text
much harder to verify, because the shapes are very unfamiliar to my
eyes. I'd prefer if we can stick to accented Roman letters.  I realize
this is me being totally Anglo-centric. But for Cyrillic readers,
consider how much more difficult it would be to manually verify the test
if it were written in an unfamiliar script (e.g., Hangul).  The
surrounding code is already written in Roman characters (and English),
so it probably makes sense as a common denominator.

-Peff

[1] As long as it is only crude and not mean. :)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] send-email: handle adjacent RFC 2047-encoded words properly
  2014-11-24  7:27 ` Junio C Hamano
  2014-11-24 15:44   ` Jeff King
@ 2014-11-24 18:09   ` Роман Донченко
  1 sibling, 0 replies; 7+ messages in thread
From: Роман Донченко @ 2014-11-24 18:09 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Git Mailing List, Ævar Arnfjörð Bjarmason,
	Jay Soffian, Jeff King, Thomas Rast

Junio C Hamano <gitster@pobox.com> писал в своём письме Mon, 24 Nov 2014  
10:27:51 +0300:

> On Sun, Nov 23, 2014 at 3:50 PM, Роман Донченко <dpb@corrigendum.ru>  
> wrote:
>> The RFC says that they are to be concatenated after decoding (i.e. the
>> intervening whitespace is ignored).
>>
>> I change the sender's name to an all-Cyrillic string in the tests so  
>> that
>> its encoded form goes over the 76 characters in a line limit, forcing
>> format-patch to split it into multiple encoded words.
>>
>> Since I have to modify the regular expression for an encoded word  
>> anyway,
>> I take the opportunity to bring it closer to the spec, most notably
>> disallowing embedded spaces and making it case-insensitive (thus  
>> allowing
>> the encoding to be specified as both "q" and "Q").
>>
>> Signed-off-by: Роман Донченко <dpb@corrigendum.ru>
>
> This sounds like a worthy thing to do in general.
>
> I wonder if the C implementation we have for mailinfo needs similar
> update, though. I vaguely recall that we have case-insensitive start for
> q/b segments, but do not remember the details offhand.

That's what git am uses, right? I think that already works correctly (or  
at least doesn't have the bug this patch fixes). I didn't do extensive  
testing or look at the code, though.

>
> Was the change to the test to use Cyrillic really necessary, or did it
> suffice if you simply extended the existsing "Funny Name" spelled with
> strange accents, but you substituted the whole string anyway?
>
> Until I found out what the new string says by running web-based
> translation on it, I felt somewhat uneasy. As I do not read
> Cyrillic/Russian, we may have been adding some profanity without
> knowing. It turns out that the string just says "Cyrillic Name", so I am
> not against using the new string, but it simply looked odd to replace the
> string whole-sale when you merely need a longer string. It made it look
> as if a bug was specific to Cyrillic when it wasn't.

Ah, if only I had thought of including profanity beforehand. ;-)

Seriously though, I just needed to hit the 76 character limit, and  
switching the keyboard layout is a lot easier than copypasting Latin  
letters with diacritics (plus I had trouble coming up with a long enough  
extension of "Funny Name"...). I can see how that's problematic, though;  
I'll change it.

> As you may notice by reading "git log --no-merges" from recent history,
> we tend not to say "I did X, I did Y". If the tone of the above message
> were more similar to them, it may have been easier to read.

Technically, I said "I do", not "I did"... but sure, point taken.

Roman.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] send-email: handle adjacent RFC 2047-encoded words properly
  2014-11-24 15:36 ` Jeff King
@ 2014-11-24 18:26   ` Роман Донченко
  2014-11-24 23:03     ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Роман Донченко @ 2014-11-24 18:26 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Junio C Hamano, Ævar Arnfjörð Bjarmason,
	Jay Soffian, Thomas Rast

Jeff King <peff@peff.net> писал в своём письме Mon, 24 Nov 2014 18:36:09  
+0300:

> On Mon, Nov 24, 2014 at 02:50:04AM +0300, Роман Донченко wrote:
>
>> The RFC says that they are to be concatenated after decoding (i.e. the
>> intervening whitespace is ignored).
>>
>> I change the sender's name to an all-Cyrillic string in the tests so  
>> that
>> its encoded form goes over the 76 characters in a line limit, forcing
>> format-patch to split it into multiple encoded words.
>>
>> Since I have to modify the regular expression for an encoded word  
>> anyway,
>> I take the opportunity to bring it closer to the spec, most notably
>> disallowing embedded spaces and making it case-insensitive (thus  
>> allowing
>> the encoding to be specified as both "q" and "Q").
>
> The overall goal makes sense to me. Thanks for working on this. I have a
> few questions/comments, though.
>
>>  sub unquote_rfc2047 {
>>  	local ($_) = @_;
>> +
>> +	my $et = qr/[!->@-~]+/; # encoded-text from RFC 2047
>> +	my $sep = qr/[ \t]+/;
>> +	my $encoded_word = qr/=\?($et)\?q\?($et)\?=/i;
>
> The first $et in $encoded_word is actually the charset, which is defined
> by RFC 2047 as:
>
>      charset = token    ; see section 3
>
>      token = 1*<Any CHAR except SPACE, CTLs, and especials>
>
>      especials = "(" / ")" / "<" / ">" / "@" / "," / ";" / ":" / "
> 	               <"> / "/" / "[" / "]" / "?" / "." / "="
>
> Your regex is a little more liberal. I doubt that it is a big deal in
> practice (actually, in practice, I suspect [a-zA-Z0-9-] would be fine).
> But if we are tightening things up in general, it may make sense to do
> so here (and I notice that is_rfc2047_quoted does a more thorough $token
> definition, and it probably makes sense for the two functions to be
> consistent).

Yeah, I did realize that token is more restrictive than encoded-text, but  
I didn't want to stray too far from the subject line of the patch. What  
I'll probably do is split the patch into two, one for regex tweaking and  
one for multiple-word handling. And yeah, I'll try to make the two  
functions use the same regexes.

>
> For your definition of encoded-text, RFC 2047 says:
>
>      encoded-text = 1*<Any printable ASCII character other than "?"
>                           or SPACE>
>
> It looks like you pulled the definition of $et from is_rfc2047_quoted,
> but I am not clear on where that original came from (it is from a3a8262,
> but that commit message does not explain the regex).

No, it's actually an independent discovery. :-) I don't think it needs  
explanation, though - it's just a character class with two ranges covering  
every printable character but the question mark.

> Also, I note that we handle 'q'-style encodings here, but not 'b'. I
> wonder if it is worth adding that in while we are in the area (it is not
> a big deal if you always send-email git-generated patches, as we never
> generate it).

I could add "b" decoding, but since format-patch never generates "b"  
encodings, testing would be a problem. And I'd rather not do it without  
any tests.

>
>> +	s{$encoded_word(?:$sep$encoded_word)+}{
>
> If I am reading this right, it requires at least two $encoded_words.
> Should this "+" be a "*"?

I hang my head in shame. Looks like I'll have to add more tests...

>
>> +		my @words = split $sep, $&;
>> +		foreach (@words) {
>> +			m/$encoded_word/;
>> +			$encoding = $1;
>> +			$_ = $2;
>> +			s/_/ /g;
>> +			s/=([0-9A-F]{2})/chr(hex($1))/eg;
>
> In the spirit of your earlier change, should this final regex be
> case-insensitive? RFC 2047 says only "Upper case should be used for
> hexadecimal digits "A" through "F." but that does not seem like a "MUST"
> to me.

Sounds reasonable.

Roman.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] send-email: handle adjacent RFC 2047-encoded words properly
  2014-11-24 18:26   ` Роман Донченко
@ 2014-11-24 23:03     ` Jeff King
  0 siblings, 0 replies; 7+ messages in thread
From: Jeff King @ 2014-11-24 23:03 UTC (permalink / raw)
  To: Роман
	Донченко
  Cc: git, Junio C Hamano, Ævar Arnfjörð Bjarmason,
	Jay Soffian, Thomas Rast

On Mon, Nov 24, 2014 at 09:26:22PM +0300, Роман Донченко wrote:

> Yeah, I did realize that token is more restrictive than encoded-text, but I
> didn't want to stray too far from the subject line of the patch. What I'll
> probably do is split the patch into two, one for regex tweaking and one for
> multiple-word handling. And yeah, I'll try to make the two functions use the
> same regexes.

Thanks, I think that sounds like a good plan.

> >For your definition of encoded-text, RFC 2047 says:
> >
> >     encoded-text = 1*<Any printable ASCII character other than "?"
> >                          or SPACE>
> >
> >It looks like you pulled the definition of $et from is_rfc2047_quoted,
> >but I am not clear on where that original came from (it is from a3a8262,
> >but that commit message does not explain the regex).
> 
> No, it's actually an independent discovery. :-) I don't think it needs
> explanation, though - it's just a character class with two ranges covering
> every printable character but the question mark.

And now it is my turn to hang my head in shame. I viewed that as a set
of characters, rather than ranges. The "-" just blended into the mass of
punctuation, and I mistook the "!" for negation.

I wonder if it would be more readable as:

  [\x21-\x3e\x40-\x7e]

or something. I guess perl even has classes pre-made for "printable
ascii". I dunno. It may be OK as-is, too, and I just need to read more
carefully. :)

> >Also, I note that we handle 'q'-style encodings here, but not 'b'. I
> >wonder if it is worth adding that in while we are in the area (it is not
> >a big deal if you always send-email git-generated patches, as we never
> >generate it).
> 
> I could add "b" decoding, but since format-patch never generates "b"
> encodings, testing would be a problem. And I'd rather not do it without any
> tests.

I think you could include some literal fixtures in the test suite (t5100
already does this for mailinfo). But I don't think handling 'b' is a
requirement here. It's really orthogonal to your patches, and nobody has
actually asked for it, so I don't mind leaving it for another day.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-11-24 23:03 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-23 23:50 [PATCH] send-email: handle adjacent RFC 2047-encoded words properly Роман Донченко
2014-11-24  7:27 ` Junio C Hamano
2014-11-24 15:44   ` Jeff King
2014-11-24 18:09   ` Роман Донченко
2014-11-24 15:36 ` Jeff King
2014-11-24 18:26   ` Роман Донченко
2014-11-24 23:03     ` Jeff King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.