linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] checkpatch: improve email parsing
@ 2020-11-05 11:59 Dwaipayan Ray
  2020-11-05 17:41 ` Joe Perches
  0 siblings, 1 reply; 4+ messages in thread
From: Dwaipayan Ray @ 2020-11-05 11:59 UTC (permalink / raw)
  To: joe
  Cc: linux-kernel-mentees, dwaipayanray1, linux-kernel, lukas.bulwahn,
	yashsri421

checkpatch doesn't report warnings for many common mistakes
in emails. Some of which are trailing commas and incorrect
use of email comments.

At the same time several false positives are reported due to
incorrect handling of mail comments. The most common of which
is due to the pattern:

<stable@vger.kernel.org> # X.X

Improve email parsing in checkpatch.

Some general comment rules are defined:

- Multiple name comments should not be allowed.
- Comments inside address should not be allowed.
- In general comments should be enclosed within parentheses.
  Exception for stable@vger.kernel.org # X.X

Improvements to parsing:

- Detect and report unexpected content after email.
- Quoted names are excluded from comment parsing.
- Trailing dots or commas in email are removed during
  formatting. Correspondingly a BAD_SIGN_OFF warning
  is emitted.
- Improperly quoted email like '"name <address>"' are now
  warned about.

In addition, added fixes for all the possible rules.

Signed-off-by: Dwaipayan Ray <dwaipayanray1@gmail.com>
---
 scripts/checkpatch.pl | 88 ++++++++++++++++++++++++++++++++++---------
 1 file changed, 71 insertions(+), 17 deletions(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index fab38b493cef..dc8b664b7de1 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -1152,6 +1152,7 @@ sub parse_email {
 	my ($formatted_email) = @_;
 
 	my $name = "";
+	my $quoted = "";
 	my $name_comment = "";
 	my $address = "";
 	my $comment = "";
@@ -1183,14 +1184,20 @@ sub parse_email {
 		}
 	}
 
-	$comment = trim($comment);
-	$name = trim($name);
-	$name =~ s/^\"|\"$//g;
-	if ($name =~ s/(\s*\([^\)]+\))\s*//) {
-		$name_comment = trim($1);
+	# Extract comments from names excluding quoted parts
+	# "John A. (Kennedy)" - Do not extract
+	if ($name =~ s/\"(.+)\"//) {
+		$quoted = $1;
+	}
+	while ($name =~ s/\s*($balanced_parens)\s*/ /) {
+		$name_comment .= trim($1);
 	}
+	$name =~ s/^[ \"]+|[ \"]+$//g;
+	$name = trim("$quoted $name");
+
 	$address = trim($address);
 	$address =~ s/^\<|\>$//g;
+	$comment = trim($comment);
 
 	if ($name =~ /[^\w \-]/i) { ##has "must quote" chars
 		$name =~ s/(?<!\\)"/\\"/g; ##escape quotes
@@ -1205,17 +1212,20 @@ sub format_email {
 
 	my $formatted_email;
 
-	$name_comment = trim($name_comment);
-	$comment = trim($comment);
-	$name = trim($name);
-	$name =~ s/^\"|\"$//g;
+	$name =~ s/^[ \"]+|[ \"]+$//g;
 	$address = trim($address);
+	$address =~ s/(?:\.|\,)+$//; ##trailing commas or dots
 
 	if ($name =~ /[^\w \-]/i) { ##has "must quote" chars
 		$name =~ s/(?<!\\)"/\\"/g; ##escape quotes
 		$name = "\"$name\"";
 	}
 
+	$name_comment = trim($name_comment);
+	$name_comment = " $name_comment" if length($name_comment) > 0;
+	$comment = trim($comment);
+	$comment = " $comment" if length($comment) > 0;
+
 	if ("$name" eq "") {
 		$formatted_email = "$address";
 	} else {
@@ -1233,15 +1243,11 @@ sub reformat_email {
 }
 
 sub same_email_addresses {
-	my ($email1, $email2, $match_comment) = @_;
+	my ($email1, $email2) = @_;
 
 	my ($email1_name, $name1_comment, $email1_address, $comment1) = parse_email($email1);
 	my ($email2_name, $name2_comment, $email2_address, $comment2) = parse_email($email2);
 
-	if ($match_comment != 1) {
-		return $email1_name eq $email2_name &&
-		       $email1_address eq $email2_address;
-	}
 	return $email1_name eq $email2_name &&
 	       $email1_address eq $email2_address &&
 	       $name1_comment eq $name2_comment &&
@@ -2704,7 +2710,7 @@ sub process {
 			$signoff++;
 			$in_commit_log = 0;
 			if ($author ne ''  && $authorsignoff != 1) {
-				if (same_email_addresses($1, $author, 1)) {
+				if (same_email_addresses($1, $author)) {
 					$authorsignoff = 1;
 				} else {
 					my $ctx = $1;
@@ -2800,9 +2806,57 @@ sub process {
 				$dequoted =~ s/" </ </;
 				# Don't force email to have quotes
 				# Allow just an angle bracketed address
-				if (!same_email_addresses($email, $suggested_email, 0)) {
+				if (!same_email_addresses($email, $suggested_email)) {
+					if (WARN("BAD_SIGN_OFF",
+					    "email address '$email' might be better as '$suggested_email'\n" . $herecurr) &&
+						$fix) {
+						$fixed[$fixlinenr] =~ s/\Q$email\E/$suggested_email/;
+					}
+				}
+
+				# Address part shouldn't have comments
+				my $stripped_address = $email_address;
+				$stripped_address =~ s/\([^\(\)]*\)//g;
+				if ($email_address ne $stripped_address) {
+					if (WARN("BAD_SIGN_OFF",
+					    "address part of email should not have comments: '$email_address'\n" . $herecurr) &&
+						$fix) {
+						$fixed[$fixlinenr] =~ s/\Q$email_address\E/$stripped_address/;
+					}
+				}
+
+				# Only one name comment should be allowed
+				my $comment_count = () = $name_comment =~ /\([^\)]+\)/g;
+				if ($comment_count > 1) {
 					WARN("BAD_SIGN_OFF",
-					     "email address '$email' might be better as '$suggested_email'\n" . $herecurr);
+					     "Use a single name comment in email: '$email'\n" . $herecurr);
+				}
+
+				# Comments must begin only with (
+				# or # in case of stable@vger.kernel.org
+				if ($email =~ /^.*stable\@vger/) {
+					if ($comment ne "" && $comment !~ /^#.+/) {
+						if (WARN("BAD_SIGN_OFF",
+						    "Invalid comment format for stable: '$email', prefer parentheses\n" . $herecurr) &&
+							$fix) {
+							my $new_comment = $comment;
+							$new_comment =~ s/^[ \(\[]+|[ \)\]]+$//g;
+							$new_comment = " # $new_comment" if length($new_comment) > 0;
+							$fixed[$fixlinenr] =~ s/\s*\Q$comment\E$/$new_comment/;
+						}
+					}
+				} else {
+					if ($comment ne "" && $comment !~ /^\(.+\)$/) {
+						if (WARN("BAD_SIGN_OFF",
+						    "Unexpected content after email: '$email'\n" . $herecurr) &&
+							$fix) {
+							my $new_comment = $comment;
+							$new_comment =~ s/^(?:\#|\/\*|\.|\,)//g;
+							$new_comment =~ s/^[ \{\[]+|[ \}\]]+$//g;
+							$new_comment = " ($new_comment)" if length($new_comment) > 0;
+							$fixed[$fixlinenr] =~ s/\s*\Q$comment\E$/$new_comment/;
+						}
+					}
 				}
 			}
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] checkpatch: improve email parsing
  2020-11-05 11:59 [PATCH v3] checkpatch: improve email parsing Dwaipayan Ray
@ 2020-11-05 17:41 ` Joe Perches
  2020-11-05 20:08   ` Greg KH
       [not found]   ` <CABJPP5BTRtAjdNQdRCx-3HjmyJ0AnoXBzB95YiPRMoF0njOOaA@mail.gmail.com>
  0 siblings, 2 replies; 4+ messages in thread
From: Joe Perches @ 2020-11-05 17:41 UTC (permalink / raw)
  To: Dwaipayan Ray, stable, Greg KH
  Cc: linux-kernel-mentees, linux-kernel, lukas.bulwahn, yashsri421

(adding stable and Greg KH for additional review)
On Thu, 2020-11-05 at 17:29 +0530, Dwaipayan Ray wrote:
> checkpatch doesn't report warnings for many common mistakes
> in emails. Some of which are trailing commas and incorrect
> use of email comments.

I presume you've tested this against the git tree.

Can you send me a file with the BAD_SIGN_OFF messages generated
and if possible the git SHA-1s of the commits?

> At the same time several false positives are reported due to
> incorrect handling of mail comments. The most common of which
> is due to the pattern:
> 
> <stable@vger.kernel.org> # X.X
> 
> Improve email parsing in checkpatch.
> 
> Some general comment rules are defined:
> 
> - Multiple name comments should not be allowed.
> - Comments inside address should not be allowed.
> - In general comments should be enclosed within parentheses.
>   Exception for stable@vger.kernel.org # X.X

not just vger.kernel.org, but this should also allow stable@kernel.org
and only allow cc: and not any other -by: type for that email address.

A process preference question for Greg and the stable team:

The most common stable forms are

	stable@vger.kernel.org # version info
then
	stable@vger.kernel.org [ version info ]

with some other relatively infrequently used outlier styles, some
that use parentheses, but this is not frequent.

It might be sensible to standardize on the "# version info" trailer
comment version info style and warn on any other form.

A somewhat common style for the stable address is to use a name
before the stable address which describes the version info:

Perhaps any name before stable should be warned and the version
should be a comment.

Here's a list of the stable addresses with "version name" then
stable address in the git tree and other outlier styles.

     24 linux-stable <stable@vger.kernel.org>
     21 5.4+ <stable@vger.kernel.org>
     14 All applicable <stable@vger.kernel.org>
      6 3.10+ <stable@vger.kernel.org>
      5 5.9+ <stable@vger.kernel.org>
      5 5.3+ <stable@vger.kernel.org>
      5 5.1+ <stable@vger.kernel.org>
      4 5.6+ <stable@vger.kernel.org>
      4 4.20+ <stable@vger.kernel.org>
      3 Stable Team <stable@vger.kernel.org>
      3 4.19+ <stable@vger.kernel.org>
      3 4.15+ <stable@vger.kernel.org>
      3 4.10+ <stable@vger.kernel.org>
      2 stable@vger.kernel.org (v2.6.12+)
      2 5.2+ <stable@vger.kernel.org>
      2 4.16+ <stable@vger.kernel.org>
      1 v5.8+ <stable@vger.kernel.org>
      1 v5.7+ <stable@vger.kernel.org>
      1 v5.6+ <stable@vger.kernel.org>
      1 v5.3+ <stable@vger.kernel.org>
      1 v5.0+ <stable@vger.kernel.org>
      1 v4.9+ <stable@vger.kernel.org>
      1 <stable@vger.kernel.org> v5.0+
      1 <stable@vger.kernel.org> +v4.18
      1 stable@vger.kernel.org (3.14+)
      1 5.8+ <stable@vger.kernel.org>
      1 5.5+ <stable@vger.kernel.org>
      1 5.0+ <stable@vger.kernel.org>
      1 4.18+ <stable@vger.kernel.org>
      1 4.14+ <stable@vger.kernel.org>
      1 4.13+ <stable@vger.kernel.org>
      1 4.0+ <stable@vger.kernel.org>
      1 3.15+ <stable@vger.kernel.org>
      1 3.11+ <stable@vger.kernel.org>

> Improvements to parsing:
> 
> - Detect and report unexpected content after email.
> - Quoted names are excluded from comment parsing.
> - Trailing dots or commas in email are removed during
>   formatting. Correspondingly a BAD_SIGN_OFF warning
>   is emitted.
> - Improperly quoted email like '"name <address>"' are now
>   warned about.

All of the above seems right but perhaps the comment style for any
<foo>-by: lines should also allow # comments.

The below is just comments on the patch itself.

> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
[]
> @@ -2800,9 +2806,57 @@ sub process {
>  				$dequoted =~ s/" </ </;
>  				# Don't force email to have quotes
>  				# Allow just an angle bracketed address
> -				if (!same_email_addresses($email, $suggested_email, 0)) {
> +				if (!same_email_addresses($email, $suggested_email)) {
> +					if (WARN("BAD_SIGN_OFF",
> +					    "email address '$email' might be better as '$suggested_email'\n" . $herecurr) &&
> +						$fix) {

trivia:

Please always align $fix with tabs to the if and then 4 spaces to the
open parenthesis.

> +				# Comments must begin only with (
> +				# or # in case of stable@vger.kernel.org
> +				if ($email =~ /^.*stable\@vger/) {

I believe this should be

				if ($email =~ /^stable\@(?:vger\.)?kernel.org$/) {

> +					if ($comment ne "" && $comment !~ /^#.+/) {
> +						if (WARN("BAD_SIGN_OFF",
> +						    "Invalid comment format for stable: '$email', prefer parentheses\n" . $herecurr) &&

Prefer #

> +							$fix) {
> +							my $new_comment = $comment;
> +							$new_comment =~ s/^[ \(\[]+|[ \)\]]+$//g;

Does the comment include any leading whitespace here?
I presumed not given the $comment !~ /^#/ test above.



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] checkpatch: improve email parsing
  2020-11-05 17:41 ` Joe Perches
@ 2020-11-05 20:08   ` Greg KH
       [not found]   ` <CABJPP5BTRtAjdNQdRCx-3HjmyJ0AnoXBzB95YiPRMoF0njOOaA@mail.gmail.com>
  1 sibling, 0 replies; 4+ messages in thread
From: Greg KH @ 2020-11-05 20:08 UTC (permalink / raw)
  To: Joe Perches
  Cc: Dwaipayan Ray, stable, linux-kernel-mentees, linux-kernel,
	lukas.bulwahn, yashsri421

On Thu, Nov 05, 2020 at 09:41:15AM -0800, Joe Perches wrote:
> (adding stable and Greg KH for additional review)
> On Thu, 2020-11-05 at 17:29 +0530, Dwaipayan Ray wrote:
> > checkpatch doesn't report warnings for many common mistakes
> > in emails. Some of which are trailing commas and incorrect
> > use of email comments.
> 
> I presume you've tested this against the git tree.
> 
> Can you send me a file with the BAD_SIGN_OFF messages generated
> and if possible the git SHA-1s of the commits?
> 
> > At the same time several false positives are reported due to
> > incorrect handling of mail comments. The most common of which
> > is due to the pattern:
> > 
> > <stable@vger.kernel.org> # X.X
> > 
> > Improve email parsing in checkpatch.
> > 
> > Some general comment rules are defined:
> > 
> > - Multiple name comments should not be allowed.
> > - Comments inside address should not be allowed.
> > - In general comments should be enclosed within parentheses.
> >   Exception for stable@vger.kernel.org # X.X
> 
> not just vger.kernel.org, but this should also allow stable@kernel.org
> and only allow cc: and not any other -by: type for that email address.
> 
> A process preference question for Greg and the stable team:
> 
> The most common stable forms are
> 
> 	stable@vger.kernel.org # version info

That is what is documented it should be, yes.

> then
> 	stable@vger.kernel.org [ version info ]

Really?  Ick, no wonder my email parsing scripts choke on that :)

> with some other relatively infrequently used outlier styles, some
> that use parentheses, but this is not frequent.

The one with '#' should be preferred.  If not, we need to change our
documentation.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] checkpatch: improve email parsing
       [not found]   ` <CABJPP5BTRtAjdNQdRCx-3HjmyJ0AnoXBzB95YiPRMoF0njOOaA@mail.gmail.com>
@ 2020-11-06  0:26     ` Joe Perches
  0 siblings, 0 replies; 4+ messages in thread
From: Joe Perches @ 2020-11-06  0:26 UTC (permalink / raw)
  To: Dwaipayan Ray
  Cc: Lukas Bulwahn, linux-kernel-mentees, linux-kernel, Aditya Srivastava

On Fri, 2020-11-06 at 02:46 +0530, Dwaipayan Ray wrote:
> > Can you send me a file with the BAD_SIGN_OFF messages generated
> > and if possible the git SHA-1s of the commits?
> Yes sure, am attaching the file for data tested on about 27k commits
> from v5.4.

Thanks.

> Excluded the duplicate signatures, tags and spacing ones for
> simplicity.
> 
> >      24 linux-stable <stable@vger.kernel.org>
> >      21 5.4+ <stable@vger.kernel.org>
> >      14 All applicable <stable@vger.kernel.org>
> >       6 3.10+ <stable@vger.kernel.org>
> >       5 5.9+ <stable@vger.kernel.org>
> >       5 5.3+ <stable@vger.kernel.org>
> 
> Do I also convert these then for the fix?

Yes.

Ideally removing any case insensitive name like linux-stable or stable
and also removing any leading < or trailing > from the email address.

And the stable email address should be forced to lower case only.

Any other name should be moved as a comment after the email address
then #.

> > > Improvements to parsing:
> > > 
> > > - Detect and report unexpected content after email.
> > > - Quoted names are excluded from comment parsing.
> > > - Trailing dots or commas in email are removed during
> > >   formatting. Correspondingly a BAD_SIGN_OFF warning
> > >   is emitted.
> > > - Improperly quoted email like '"name <address>"' are now
> > >   warned about.
[]
> > > +                                                     my $new_comment = $comment;
> > > +                                                     $new_comment =~ s/^[ \(\[]+|[ \)\]]+$//g;
> > 
> > Does the comment include any leading whitespace here?
> > I presumed not given the $comment !~ /^#/ test above.
> > 
> I added it to discard empty spaces before or after the first
> brackets are unwrapped. Something like (  v5.7 ) would
> be better if translated to # v5.7. (The extra spaces would be purged).
> It just looks good :). Should I change it to keep it as it is?

No.  Converting the comment style is very sensible and a good change.
I did not understand why the regex included a space.  I do now.
I suggest changing the space to use \s in case there are tabs.



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-11-06  0:26 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-05 11:59 [PATCH v3] checkpatch: improve email parsing Dwaipayan Ray
2020-11-05 17:41 ` Joe Perches
2020-11-05 20:08   ` Greg KH
     [not found]   ` <CABJPP5BTRtAjdNQdRCx-3HjmyJ0AnoXBzB95YiPRMoF0njOOaA@mail.gmail.com>
2020-11-06  0:26     ` Joe Perches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).