checkpatch: extend author Signed-off-by check for split From: header
diff mbox series

Message ID 20200919081225.28624-1-dwaipayanray1@gmail.com
State Superseded
Headers show
Series
  • checkpatch: extend author Signed-off-by check for split From: header
Related show

Commit Message

Dwaipayan Ray Sept. 19, 2020, 8:12 a.m. UTC
Checkpatch did not handle cases where the author From: header
was split into two lines. The author string went empty and
checkpatch generated a false NO_AUTHOR_SIGN_OFF warning.

Support split From: headers in AUTHOR_SIGN_OFF check by adding
an additional clause to resolve author identity in such cases.

Signed-off-by: Dwaipayan Ray <dwaipayanray1@gmail.com>
---
 scripts/checkpatch.pl | 28 ++++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

Comments

Joe Perches Sept. 19, 2020, 5:26 p.m. UTC | #1
On Sat, 2020-09-19 at 13:42 +0530, Dwaipayan Ray wrote:
> Checkpatch did not handle cases where the author From: header
> was split into two lines. The author string went empty and
> checkpatch generated a false NO_AUTHOR_SIGN_OFF warning.

It's good to provide an example where the current code
doesn't work.

It likely would be better to do this by searching forward for
any extension lines after a "^From:' rather than searching
backwards as there can be any number of extension lines.

> Support split From: headers in AUTHOR_SIGN_OFF check by adding
> an additional clause to resolve author identity in such cases.
> 
> Signed-off-by: Dwaipayan Ray <dwaipayanray1@gmail.com>
> ---
>  scripts/checkpatch.pl | 28 ++++++++++++++++++++++++----
>  1 file changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> index 504d2e431c60..86975baead22 100755
> --- a/scripts/checkpatch.pl
> +++ b/scripts/checkpatch.pl
> @@ -1210,6 +1210,16 @@ sub reformat_email {
>  	return format_email($email_name, $email_address);
>  }
>  
> +sub format_author_email {
> +	my ($email, $from) = @_;
> +
> +	$email = encode("utf8", $email) if ($from =~ /=\?utf-8\?/i);
> +	$email =~ s/"//g;
> +	$email = reformat_email($email);
> +
> +	return $email;
> +}
> +
>  sub same_email_addresses {
>  	my ($email1, $email2) = @_;
>  
> @@ -2347,6 +2357,7 @@ sub process {
>  	my $signoff = 0;
>  	my $author = '';
>  	my $authorsignoff = 0;
> +	my $prevheader = '';
>  	my $is_patch = 0;
>  	my $is_binding_patch = -1;
>  	my $in_header_lines = $file ? 0 : 1;
> @@ -2658,12 +2669,21 @@ sub process {
>  			}
>  		}
>  
> +# Check the patch for a split From:
> +		if($prevheader ne '') {
> +			if ($author eq '' && decode("MIME-Header", $prevheader) =~ /^From:\s*(.*)/) {
> +				my $email = $1.$line;
> +				$author = format_author_email($email, $prevheader);
> +			}
> +			$prevheader = '';
> +		}
> +
>  # Check the patch for a From:
>  		if (decode("MIME-Header", $line) =~ /^From:\s*(.*)/) {
> -			$author = $1;
> -			$author = encode("utf8", $author) if ($line =~ /=\?utf-8\?/i);
> -			$author =~ s/"//g;
> -			$author = reformat_email($author);
> +			$author = format_author_email($1, $line);
> +			if($author eq '') {
> +				$prevheader = $line;
> +			}
>  		}
>  
>  # Check the patch for a signoff:
Lukas Bulwahn Sept. 19, 2020, 6:12 p.m. UTC | #2
On Sat, 19 Sep 2020, Joe Perches wrote:

> On Sat, 2020-09-19 at 13:42 +0530, Dwaipayan Ray wrote:
> > Checkpatch did not handle cases where the author From: header
> > was split into two lines. The author string went empty and
> > checkpatch generated a false NO_AUTHOR_SIGN_OFF warning.
> 
> It's good to provide an example where the current code
> doesn't work.
>

Joe, as this is a linux-kernel-mentees patch, we discussed that before 
reaching out to you; you can find Dwaipayan's own evaluation here:

https://lore.kernel.org/linux-kernel-mentees/CABJPP5BOTG0QLFSaRJTb2vAZ_hJf229OAQihHKG4sYd35i_WMw@mail.gmail.com/

Dwaipayan, Joe's comment is still valid; it would be good to describe
the reasons why patches might have split lines (as far as see, long 
encodings for non-ascii names).

I will run my own evaluation of checkpatch.pl before and after patch 
application on Monday and then check if I can confirm Dwaipayan's results.

> It likely would be better to do this by searching forward for
> any extension lines after a "^From:' rather than searching
> backwards as there can be any number of extension lines.
>

Just to sure what you are talking about...

You mean just to access the next line through the lines array, rather 
than using prevheader and trying to decode that one line twice.

I agree the logic is a bit redundant and complicated at the moment.

Once prevheader is non-empty, it already clear that author is '' and 
prevheader decodes with that match, because that is the only way to
make prevheader non-empty in the first place; at least as far I see it 
right now.


Lukas
 
> > Support split From: headers in AUTHOR_SIGN_OFF check by adding
> > an additional clause to resolve author identity in such cases.
> > 
> > Signed-off-by: Dwaipayan Ray <dwaipayanray1@gmail.com>
> > ---
> >  scripts/checkpatch.pl | 28 ++++++++++++++++++++++++----
> >  1 file changed, 24 insertions(+), 4 deletions(-)
> > 
> > diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> > index 504d2e431c60..86975baead22 100755
> > --- a/scripts/checkpatch.pl
> > +++ b/scripts/checkpatch.pl
> > @@ -1210,6 +1210,16 @@ sub reformat_email {
> >  	return format_email($email_name, $email_address);
> >  }
> >  
> > +sub format_author_email {
> > +	my ($email, $from) = @_;
> > +
> > +	$email = encode("utf8", $email) if ($from =~ /=\?utf-8\?/i);
> > +	$email =~ s/"//g;
> > +	$email = reformat_email($email);
> > +
> > +	return $email;
> > +}
> > +
> >  sub same_email_addresses {
> >  	my ($email1, $email2) = @_;
> >  
> > @@ -2347,6 +2357,7 @@ sub process {
> >  	my $signoff = 0;
> >  	my $author = '';
> >  	my $authorsignoff = 0;
> > +	my $prevheader = '';
> >  	my $is_patch = 0;
> >  	my $is_binding_patch = -1;
> >  	my $in_header_lines = $file ? 0 : 1;
> > @@ -2658,12 +2669,21 @@ sub process {
> >  			}
> >  		}
> >  
> > +# Check the patch for a split From:
> > +		if($prevheader ne '') {
> > +			if ($author eq '' && decode("MIME-Header", $prevheader) =~ /^From:\s*(.*)/) {
> > +				my $email = $1.$line;
> > +				$author = format_author_email($email, $prevheader);
> > +			}
> > +			$prevheader = '';
> > +		}
> > +
> >  # Check the patch for a From:
> >  		if (decode("MIME-Header", $line) =~ /^From:\s*(.*)/) {
> > -			$author = $1;
> > -			$author = encode("utf8", $author) if ($line =~ /=\?utf-8\?/i);
> > -			$author =~ s/"//g;
> > -			$author = reformat_email($author);
> > +			$author = format_author_email($1, $line);
> > +			if($author eq '') {
> > +				$prevheader = $line;
> > +			}
> >  		}
> >  
> >  # Check the patch for a signoff:
> 
>
Joe Perches Sept. 19, 2020, 6:19 p.m. UTC | #3
On Sat, 2020-09-19 at 20:12 +0200, Lukas Bulwahn wrote:
> 
> On Sat, 19 Sep 2020, Joe Perches wrote:
> 
> > On Sat, 2020-09-19 at 13:42 +0530, Dwaipayan Ray wrote:
> > > Checkpatch did not handle cases where the author From: header
> > > was split into two lines. The author string went empty and
> > > checkpatch generated a false NO_AUTHOR_SIGN_OFF warning.
> > 
> > It's good to provide an example where the current code
> > doesn't work.
> > 
> 
> Joe, as this is a linux-kernel-mentees patch, we discussed that before 
> reaching out to you; you can find Dwaipayan's own evaluation here:
> 
> https://lore.kernel.org/linux-kernel-mentees/CABJPP5BOTG0QLFSaRJTb2vAZ_hJf229OAQihHKG4sYd35i_WMw@mail.gmail.com/
> 
> Dwaipayan, Joe's comment is still valid; it would be good to describe
> the reasons why patches might have split lines (as far as see, long 
> encodings for non-ascii names).
> 
> I will run my own evaluation of checkpatch.pl before and after patch 
> application on Monday and then check if I can confirm Dwaipayan's results.
> 
> > It likely would be better to do this by searching forward for
> > any extension lines after a "^From:' rather than searching
> > backwards as there can be any number of extension lines.
> > 
> 
> Just to sure what you are talking about...
> 
> You mean just to access the next line through the lines array, rather 
> than using prevheader and trying to decode that one line twice.
> 
> I agree the logic is a bit redundant and complicated at the moment.
> 
> Once prevheader is non-empty, it already clear that author is '' and 
> prevheader decodes with that match, because that is the only way to
> make prevheader non-empty in the first place; at least as far I see it 
> right now.

Yeah, something like this (completely untested):
---
 scripts/checkpatch.pl | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 3e474072aa90..2c710d05b184 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -2679,9 +2679,13 @@ sub process {
 		}
 
 # Check the patch for a From:
-		if (decode("MIME-Header", $line) =~ /^From:\s*(.*)/) {
+		if ($line =~ /^From:\s*(.*)/i) {
 			$author = $1;
-			$author = encode("utf8", $author) if ($line =~ /=\?utf-8\?/i);
+			my $curline = $linenr;
+			while (defined($rawlines[$curline] && $rawlines[$curline++] =~ /^ \s*(.*)/) {
+				$author .= $1;
+			}
+			$author = encode("utf8", $author) if ($author =~ /=\?utf-8\?/i);
 			$author =~ s/"//g;
 			$author = reformat_email($author);
 		}
Lukas Bulwahn Sept. 19, 2020, 6:36 p.m. UTC | #4
On Sat, 19 Sep 2020, Joe Perches wrote:

> On Sat, 2020-09-19 at 20:12 +0200, Lukas Bulwahn wrote:
> > 
> > On Sat, 19 Sep 2020, Joe Perches wrote:
> > 
> > > On Sat, 2020-09-19 at 13:42 +0530, Dwaipayan Ray wrote:
> > > > Checkpatch did not handle cases where the author From: header
> > > > was split into two lines. The author string went empty and
> > > > checkpatch generated a false NO_AUTHOR_SIGN_OFF warning.
> > > 
> > > It's good to provide an example where the current code
> > > doesn't work.
> > > 
> > 
> > Joe, as this is a linux-kernel-mentees patch, we discussed that before 
> > reaching out to you; you can find Dwaipayan's own evaluation here:
> > 
> > https://lore.kernel.org/linux-kernel-mentees/CABJPP5BOTG0QLFSaRJTb2vAZ_hJf229OAQihHKG4sYd35i_WMw@mail.gmail.com/
> > 
> > Dwaipayan, Joe's comment is still valid; it would be good to describe
> > the reasons why patches might have split lines (as far as see, long 
> > encodings for non-ascii names).
> > 
> > I will run my own evaluation of checkpatch.pl before and after patch 
> > application on Monday and then check if I can confirm Dwaipayan's results.
> > 
> > > It likely would be better to do this by searching forward for
> > > any extension lines after a "^From:' rather than searching
> > > backwards as there can be any number of extension lines.
> > > 
> > 
> > Just to sure what you are talking about...
> > 
> > You mean just to access the next line through the lines array, rather 
> > than using prevheader and trying to decode that one line twice.
> > 
> > I agree the logic is a bit redundant and complicated at the moment.
> > 
> > Once prevheader is non-empty, it already clear that author is '' and 
> > prevheader decodes with that match, because that is the only way to
> > make prevheader non-empty in the first place; at least as far I see it 
> > right now.
> 
> Yeah, something like this (completely untested):
> ---
>  scripts/checkpatch.pl | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> index 3e474072aa90..2c710d05b184 100755
> --- a/scripts/checkpatch.pl
> +++ b/scripts/checkpatch.pl
> @@ -2679,9 +2679,13 @@ sub process {
>  		}
>  
>  # Check the patch for a From:
> -		if (decode("MIME-Header", $line) =~ /^From:\s*(.*)/) {
> +		if ($line =~ /^From:\s*(.*)/i) {
>  			$author = $1;
> -			$author = encode("utf8", $author) if ($line =~ /=\?utf-8\?/i);
> +			my $curline = $linenr;
> +			while (defined($rawlines[$curline] && $rawlines[$curline++] =~ /^ \s*(.*)/) {
> +				$author .= $1;
> +			}
> +			$author = encode("utf8", $author) if ($author =~ /=\?utf-8\?/i);
>  			$author =~ s/"//g;
>  			$author = reformat_email($author);
>  		}
>

Yeah, I get how you would like to see that being implemented. I will work 
with Dwaipayan to get that properly implemented, properly described and 
tested.

But let us keep the fun of that task to Dwaipayan... that is what a 
mentorship is all about :)

Lukas
Dwaipayan Ray Sept. 19, 2020, 7:38 p.m. UTC | #5
On Sun, Sep 20, 2020 at 12:06 AM Lukas Bulwahn <lukas.bulwahn@gmail.com> wrote:
>
>
>
> On Sat, 19 Sep 2020, Joe Perches wrote:
>
> > On Sat, 2020-09-19 at 20:12 +0200, Lukas Bulwahn wrote:
> > >
> > > On Sat, 19 Sep 2020, Joe Perches wrote:
> > >
> > > > On Sat, 2020-09-19 at 13:42 +0530, Dwaipayan Ray wrote:
> > > > > Checkpatch did not handle cases where the author From: header
> > > > > was split into two lines. The author string went empty and
> > > > > checkpatch generated a false NO_AUTHOR_SIGN_OFF warning.
> > > >
> > > > It's good to provide an example where the current code
> > > > doesn't work.
> > > >
> > >
> > > Joe, as this is a linux-kernel-mentees patch, we discussed that before
> > > reaching out to you; you can find Dwaipayan's own evaluation here:
> > >
> > > https://lore.kernel.org/linux-kernel-mentees/CABJPP5BOTG0QLFSaRJTb2vAZ_hJf229OAQihHKG4sYd35i_WMw@mail.gmail.com/
> > >
> > > Dwaipayan, Joe's comment is still valid; it would be good to describe
> > > the reasons why patches might have split lines (as far as see, long
> > > encodings for non-ascii names).
> > >
> > > I will run my own evaluation of checkpatch.pl before and after patch
> > > application on Monday and then check if I can confirm Dwaipayan's results.
> > >
> > > > It likely would be better to do this by searching forward for
> > > > any extension lines after a "^From:' rather than searching
> > > > backwards as there can be any number of extension lines.
> > > >
> > >
> > > Just to sure what you are talking about...
> > >
> > > You mean just to access the next line through the lines array, rather
> > > than using prevheader and trying to decode that one line twice.
> > >
> > > I agree the logic is a bit redundant and complicated at the moment.
> > >
> > > Once prevheader is non-empty, it already clear that author is '' and
> > > prevheader decodes with that match, because that is the only way to
> > > make prevheader non-empty in the first place; at least as far I see it
> > > right now.
> >
> > Yeah, something like this (completely untested):
> > ---
> >  scripts/checkpatch.pl | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> > index 3e474072aa90..2c710d05b184 100755
> > --- a/scripts/checkpatch.pl
> > +++ b/scripts/checkpatch.pl
> > @@ -2679,9 +2679,13 @@ sub process {
> >               }
> >
> >  # Check the patch for a From:
> > -             if (decode("MIME-Header", $line) =~ /^From:\s*(.*)/) {
> > +             if ($line =~ /^From:\s*(.*)/i) {
> >                       $author = $1;
> > -                     $author = encode("utf8", $author) if ($line =~ /=\?utf-8\?/i);
> > +                     my $curline = $linenr;
> > +                     while (defined($rawlines[$curline] && $rawlines[$curline++] =~ /^ \s*(.*)/) {
> > +                             $author .= $1;
> > +                     }
> > +                     $author = encode("utf8", $author) if ($author =~ /=\?utf-8\?/i);
> >                       $author =~ s/"//g;
> >                       $author = reformat_email($author);
> >               }
> >
>

Hi,

Yeah I think the backwards checking was pretty redundant after all. If the
extended encoding went too long, the From: header would be split into
more than two lines and my proposed solution would fail.

Thanks for the heads up, Joe!

> Yeah, I get how you would like to see that being implemented. I will work
> with Dwaipayan to get that properly implemented, properly described and
> tested.
>
> But let us keep the fun of that task to Dwaipayan... that is what a
> mentorship is all about :)
>
> Lukas

Yes definitely, the task is interesting for me, and I would like to solve
it in a proper way.

As for the fix, shouldn't we stop the author string concatenation once
an email address is found? something like:

  last if  $rawlines[$curline] = ~/^\s*(\S+\@\S+)\s*/

I will update the patch and sync up with Lukas on this.

Thanks,
Dwaipayan.
Joe Perches Sept. 19, 2020, 7:47 p.m. UTC | #6
On Sun, 2020-09-20 at 01:08 +0530, Dwaipayan Ray wrote:
> On Sun, Sep 20, 2020 at 12:06 AM Lukas Bulwahn <lukas.bulwahn@gmail.com> wrote:
> > On Sat, 19 Sep 2020, Joe Perches wrote:
> > > On Sat, 2020-09-19 at 20:12 +0200, Lukas Bulwahn wrote:
> > > > On Sat, 19 Sep 2020, Joe Perches wrote:
> > > > > On Sat, 2020-09-19 at 13:42 +0530, Dwaipayan Ray wrote:
> > > > > > Checkpatch did not handle cases where the author From: header
> > > > > > was split into two lines. The author string went empty and
> > > > > > checkpatch generated a false NO_AUTHOR_SIGN_OFF warning.
> > > > > 
> > > > > It's good to provide an example where the current code
> > > > > doesn't work.
> > > > > 
> > > > Joe, as this is a linux-kernel-mentees patch, we discussed that before
> > > > reaching out to you; you can find Dwaipayan's own evaluation here:
> > > > 
> > > > https://lore.kernel.org/linux-kernel-mentees/CABJPP5BOTG0QLFSaRJTb2vAZ_hJf229OAQihHKG4sYd35i_WMw@mail.gmail.com/
> > > > 
> > > > Dwaipayan, Joe's comment is still valid; it would be good to describe
> > > > the reasons why patches might have split lines (as far as see, long
> > > > encodings for non-ascii names).
> > > > 
> > > > I will run my own evaluation of checkpatch.pl before and after patch
> > > > application on Monday and then check if I can confirm Dwaipayan's results.
> > > > 
> > > > > It likely would be better to do this by searching forward for
> > > > > any extension lines after a "^From:' rather than searching
> > > > > backwards as there can be any number of extension lines.
> > > > > 
> > > > Just to sure what you are talking about...
> > > > 
> > > > You mean just to access the next line through the lines array, rather
> > > > than using prevheader and trying to decode that one line twice.
> > > > 
> > > > I agree the logic is a bit redundant and complicated at the moment.
> > > > 
> > > > Once prevheader is non-empty, it already clear that author is '' and
> > > > prevheader decodes with that match, because that is the only way to
> > > > make prevheader non-empty in the first place; at least as far I see it
> > > > right now.
> > > 
> > > Yeah, something like this (completely untested):
> > > ---
> > >  scripts/checkpatch.pl | 8 ++++++--
> > >  1 file changed, 6 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> > > index 3e474072aa90..2c710d05b184 100755
> > > --- a/scripts/checkpatch.pl
> > > +++ b/scripts/checkpatch.pl
> > > @@ -2679,9 +2679,13 @@ sub process {
> > >               }
> > > 
> > >  # Check the patch for a From:
> > > -             if (decode("MIME-Header", $line) =~ /^From:\s*(.*)/) {
> > > +             if ($line =~ /^From:\s*(.*)/i) {
> > >                       $author = $1;
> > > -                     $author = encode("utf8", $author) if ($line =~ /=\?utf-8\?/i);
> > > +                     my $curline = $linenr;
> > > +                     while (defined($rawlines[$curline] && $rawlines[$curline++] =~ /^ \s*(.*)/) {
> > > +                             $author .= $1;
> > > +                     }
> > > +                     $author = encode("utf8", $author) if ($author =~ /=\?utf-8\?/i);
> > >                       $author =~ s/"//g;
> > >                       $author = reformat_email($author);
> > >               }
> > > 
> 
> Hi,
> 
> Yeah I think the backwards checking was pretty redundant after all. If the
> extended encoding went too long, the From: header would be split into
> more than two lines and my proposed solution would fail.
> 
> Thanks for the heads up, Joe!
> 
> > Yeah, I get how you would like to see that being implemented. I will work
> > with Dwaipayan to get that properly implemented, properly described and
> > tested.
> > 
> > But let us keep the fun of that task to Dwaipayan... that is what a
> > mentorship is all about :)
> > 
> > Lukas
> 
> Yes definitely, the task is interesting for me, and I would like to solve
> it in a proper way.
> 
> As for the fix, shouldn't we stop the author string concatenation once
> an email address is found? something like:
> 
>   last if  $rawlines[$curline] = ~/^\s*(\S+\@\S+)\s*/

Probably not.

I think it should follow the rfc standard with extension
lines starting with a space.

See rfc 5322, 2.2.3 Long Header Fields

> I will update the patch and sync up with Lukas on this.

Enjoy.

I believe I now have a working version, we can compare later.

cheers, Joe

Patch
diff mbox series

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 504d2e431c60..86975baead22 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -1210,6 +1210,16 @@  sub reformat_email {
 	return format_email($email_name, $email_address);
 }
 
+sub format_author_email {
+	my ($email, $from) = @_;
+
+	$email = encode("utf8", $email) if ($from =~ /=\?utf-8\?/i);
+	$email =~ s/"//g;
+	$email = reformat_email($email);
+
+	return $email;
+}
+
 sub same_email_addresses {
 	my ($email1, $email2) = @_;
 
@@ -2347,6 +2357,7 @@  sub process {
 	my $signoff = 0;
 	my $author = '';
 	my $authorsignoff = 0;
+	my $prevheader = '';
 	my $is_patch = 0;
 	my $is_binding_patch = -1;
 	my $in_header_lines = $file ? 0 : 1;
@@ -2658,12 +2669,21 @@  sub process {
 			}
 		}
 
+# Check the patch for a split From:
+		if($prevheader ne '') {
+			if ($author eq '' && decode("MIME-Header", $prevheader) =~ /^From:\s*(.*)/) {
+				my $email = $1.$line;
+				$author = format_author_email($email, $prevheader);
+			}
+			$prevheader = '';
+		}
+
 # Check the patch for a From:
 		if (decode("MIME-Header", $line) =~ /^From:\s*(.*)/) {
-			$author = $1;
-			$author = encode("utf8", $author) if ($line =~ /=\?utf-8\?/i);
-			$author =~ s/"//g;
-			$author = reformat_email($author);
+			$author = format_author_email($1, $line);
+			if($author eq '') {
+				$prevheader = $line;
+			}
 		}
 
 # Check the patch for a signoff: