[v3] checkpatch: fix false positives in REPEATED_WORD warning
diff mbox series

Message ID 20201023133828.19609-1-yashsri421@gmail.com
State New, archived
Headers show
Series
  • [v3] checkpatch: fix false positives in REPEATED_WORD warning
Related show

Commit Message

Aditya Srivastava Oct. 23, 2020, 1:38 p.m. UTC
Presence of hexadecimal address or symbol results in false warning
message by checkpatch.pl.

For example, running checkpatch on commit b8ad540dd4e4 ("mptcp: fix
memory leak in mptcp_subflow_create_socket()") results in warning:

WARNING:REPEATED_WORD: Possible repeated word: 'ff'
    00 00 00 00 00 00 00 00 00 2f 30 0a 81 88 ff ff  ........./0.....

Similarly, the presence of list command output in commit results in
an unnecessary warning.

For example, running checkpatch on commit 899e5ffbf246 ("perf record:
Introduce --switch-output-event") gives:

WARNING:REPEATED_WORD: Possible repeated word: 'root'
  dr-xr-x---. 12 root root    4096 Apr 27 17:46 ..

Here, it reports 'ff' and 'root to be repeated, but it is in fact part
of some address or code, where it has to be repeated.

In these cases, the intent of the warning to find stylistic issues in
commit messages is not met and the warning is just completely wrong in
this case.

To avoid these warnings, add additional regex check for the
directory permission pattern and avoid checking the line for this
class of warning. Similarly, to avoid hex pattern, check if the word
consists of hex symbols and skip this warning if it is not among the
common english words formed using hex letters.

A quick evaluation on v5.6..v5.8 showed that this fix reduces
REPEATED_WORD warnings from 2797 to 907.

A quick manual check found all cases are related to hex output or
list command outputs in commit messages.

Signed-off-by: Aditya Srivastava <yashsri421@gmail.com>
---
 scripts/checkpatch.pl | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

Comments

Lukas Bulwahn Oct. 23, 2020, 7:06 p.m. UTC | #1
On Fri, 23 Oct 2020, Aditya Srivastava wrote:

> Presence of hexadecimal address or symbol results in false warning
> message by checkpatch.pl.
> 
> For example, running checkpatch on commit b8ad540dd4e4 ("mptcp: fix
> memory leak in mptcp_subflow_create_socket()") results in warning:
> 
> WARNING:REPEATED_WORD: Possible repeated word: 'ff'
>     00 00 00 00 00 00 00 00 00 2f 30 0a 81 88 ff ff  ........./0.....
> 
> Similarly, the presence of list command output in commit results in
> an unnecessary warning.
> 
> For example, running checkpatch on commit 899e5ffbf246 ("perf record:
> Introduce --switch-output-event") gives:
> 
> WARNING:REPEATED_WORD: Possible repeated word: 'root'
>   dr-xr-x---. 12 root root    4096 Apr 27 17:46 ..
> 
> Here, it reports 'ff' and 'root to be repeated, but it is in fact part
> of some address or code, where it has to be repeated.
> 
> In these cases, the intent of the warning to find stylistic issues in
> commit messages is not met and the warning is just completely wrong in
> this case.
> 
> To avoid these warnings, add additional regex check for the
> directory permission pattern and avoid checking the line for this
> class of warning. Similarly, to avoid hex pattern, check if the word
> consists of hex symbols and skip this warning if it is not among the
> common english words formed using hex letters.
> 
> A quick evaluation on v5.6..v5.8 showed that this fix reduces
> REPEATED_WORD warnings from 2797 to 907.
> 
> A quick manual check found all cases are related to hex output or
> list command outputs in commit messages.
> 
> Signed-off-by: Aditya Srivastava <yashsri421@gmail.com>

I think this strategy now makes sense and has the right complexity for a 
good heuristics in this case.

Nice job, Aditya.

Are you ready for a next challenge of this kind? Would you like to work on 
further rules that can be improved with your evaluation approach?

Lukas

> ---
>  scripts/checkpatch.pl | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> index 9b9ffd876e8a..3bd8205c48d8 100755
> --- a/scripts/checkpatch.pl
> +++ b/scripts/checkpatch.pl
> @@ -3051,7 +3051,9 @@ sub process {
>  		}
>  
>  # check for repeated words separated by a single space
> -		if ($rawline =~ /^\+/ || $in_commit_log) {
> +# avoid false positive from list command eg, '-rw-r--r-- 1 root root'
> +		if (($rawline =~ /^\+/ || $in_commit_log) &&
> +				$rawline !~ /[bcCdDlMnpPs\?-][rwxsStT-]{9}/) {
>  			while ($rawline =~ /\b($word_pattern) (?=($word_pattern))/g) {
>  
>  				my $first = $1;
> @@ -3065,6 +3067,17 @@ sub process {
>  				next if ($first ne $second);
>  				next if ($first eq 'long');
>  
> +				# avoid repeating hex occurrences like 'ff ff fe 09 ...'
> +				my @non_hex_words = (
> +					"add", "added", "ace", "aced", "bad", "be", "bed", "bead",
> +					"beaded", "bedded", "cab", "cabbed", "cede", "ceded",
> +					"dead", "deaf", "deb", "decade", "deed", "deface", "defaced", "efface",
> +					"effaced", "face", "faced", "fade", "faded", "fed", "fee", "feed"
> +				);
> +				if ($first =~ /\b[0-9a-f]{2,}\b/) {
> +					next if (!grep(/^$first$/, @non_hex_words));
> +				}
> +
>  				if (WARN("REPEATED_WORD",
>  					 "Possible repeated word: '$first'\n" . $herecurr) &&
>  				    $fix) {
> -- 
> 2.17.1
> 
>
Joe Perches Oct. 23, 2020, 8:19 p.m. UTC | #2
On Fri, 2020-10-23 at 21:06 +0200, Lukas Bulwahn wrote:
> On Fri, 23 Oct 2020, Aditya Srivastava wrote:
> > A quick evaluation on v5.6..v5.8 showed that this fix reduces
> > REPEATED_WORD warnings from 2797 to 907.
> > 
> > A quick manual check found all cases are related to hex output or
> > list command outputs in commit messages.
[] 
> I think this strategy now makes sense and has the right complexity for a 
> good heuristics in this case.

The proper indentation was not followed as described in a previous reply.
This patch also causes conflicts with Dwaipayan's earlier effort.

Lastly, I'm not sure this complexity is worthwhile.

The style is also somewhat complex and doesn't allow for upper and lower
case variants.

I still think that a trivial "is $first hex only" test is simpler.

Perhaps use a hash and an exists test if this is useful at all.

What specific matches were found with this word list?
Why not just add those to the word list?

In all the nearly 1 million commit descriptions in the kernel without
the patch contexts, these were the only hex-like word matches I found
with their count of matches:

76 be
17 Add
13 add
6 dada
2 Bad
2 bad

100 or so false positives in about a million commits isn't many.

dada is not an actual word in a commit message so I suggest
adding hex word like checks only for "be" "add" and "bad".

Maybe "added" too but that hasn't been used as far as I can tell.

But all the other dictionary entries don't appear to be useful.

So maybe something like:

our %allow_repeated_words = (
	add => '',
	added => '',
	bad => '',
	be => '',
);

and

	next if (exists($allow_repeated_words{lc($first)}));
Aditya Srivastava Oct. 24, 2020, 1:24 p.m. UTC | #3
On 24/10/20 12:36 am, Lukas Bulwahn wrote:
> 
> 
> On Fri, 23 Oct 2020, Aditya Srivastava wrote:
> 
>> Presence of hexadecimal address or symbol results in false warning
>> message by checkpatch.pl.
>>
> 
> I think this strategy now makes sense and has the right complexity for a 
> good heuristics in this case.
> 
> Nice job, Aditya.
> 
> Are you ready for a next challenge of this kind? Would you like to work on 
> further rules that can be improved with your evaluation approach?
> 
> Lukas
> 

Yes, I would like work on further rules.

Thanks
Aditya
Joe Perches Oct. 24, 2020, 3:33 p.m. UTC | #4
On Sat, 2020-10-24 at 18:54 +0530, Aditya wrote:
> > Would you like to work on 
> > further rules that can be improved with your evaluation approach?
> 
> Yes, I would like work on further rules.

Some generic ideas:

How about working to reduce runtime and complexity by
making the rules extensible or separable at startup.

Maybe move each existing rules into a separate
directory as an individual file and aggregate them at
checkpatch startup.

Maybe look at the existing rules that do not have a
$fix option and add them as appropriate.

You could fix the multiline indentation where the
current warning and fix is only for a single line

	value = function(arg1,
		arg2,
		arg3);

where checkpatch emits only single warning and fix
for the line with arg2, but not the line with arg3);

Maybe try to make the coding styles supported more
flexible:

Allow braces in different places, support different
tab indentation sizes, spacing rules around operators,
function definition layouts, etc.
Aditya Srivastava Oct. 24, 2020, 6:12 p.m. UTC | #5
On 24/10/20 9:03 pm, Joe Perches wrote:
> On Sat, 2020-10-24 at 18:54 +0530, Aditya wrote:
>>> Would you like to work on 
>>> further rules that can be improved with your evaluation approach?
>>
>> Yes, I would like work on further rules.
> 
> Some generic ideas:
> 
> How about working to reduce runtime and complexity by
> making the rules extensible or separable at startup.
> 
> Maybe move each existing rules into a separate
> directory as an individual file and aggregate them at
> checkpatch startup.
> 
> Maybe look at the existing rules that do not have a
> $fix option and add them as appropriate.
> 
> You could fix the multiline indentation where the
> current warning and fix is only for a single line
> 
> 	value = function(arg1,
> 		arg2,
> 		arg3);
> 
> where checkpatch emits only single warning and fix
> for the line with arg2, but not the line with arg3);
> 
> Maybe try to make the coding styles supported more
> flexible:
> 
> Allow braces in different places, support different
> tab indentation sizes, spacing rules around operators,
> function definition layouts, etc.
> 
> 
> 

Thanks for all these suggestions. I'll make observations regarding
these and get back to you :)

Aditya
Lukas Bulwahn Oct. 25, 2020, 5:51 a.m. UTC | #6
On Sat, 24 Oct 2020, Joe Perches wrote:

> On Sat, 2020-10-24 at 18:54 +0530, Aditya wrote:
> > > Would you like to work on 
> > > further rules that can be improved with your evaluation approach?
> > 
> > Yes, I would like work on further rules.
> 
> Some generic ideas:
> 
> How about working to reduce runtime and complexity by
> making the rules extensible or separable at startup.
> 
> Maybe move each existing rules into a separate
> directory as an individual file and aggregate them at
> checkpatch startup.
> 
> Maybe look at the existing rules that do not have a
> $fix option and add them as appropriate.
>

I certainly support working on fixes.

I have this crazy dream that:

We can identify a set of rules and clear automatic fixes that
maintainers can simply run this script over the patches they pick
up when they pick them up.
Realistically in an easy interactive mode that works with the maintainers' 
workflow, so they simply confirm the automatic fixes where the script 
cannot be 100% sure that the quick fix is the right thing to do.

I think for commit message fixes that would be very nice.
E.g., correcting URLs, normalizing tags, correcting obvious typos.
 
> You could fix the multiline indentation where the
> current warning and fix is only for a single line
> 
> 	value = function(arg1,
> 		arg2,
> 		arg3);
> 
> where checkpatch emits only single warning and fix
> for the line with arg2, but not the line with arg3);
> 
> Maybe try to make the coding styles supported more
> flexible:
> 
> Allow braces in different places, support different
> tab indentation sizes, spacing rules around operators,
> function definition layouts, etc.
> 
>

This is a nice idea as well. And I recommend to do this kind of research 
looking at checkpatch and clang-format; both will not be the 'final tools' 
but I think if you can identify a good mix and combination of those two 
tools, we will get a good step forward of documenting the 
commonalities and differences of coding styles in the different 
subcommunities (the preferences of specific maintainers and subsystems).

A last nice idea to consider:

I would like to have an interactive checkpatch.pl mode where an authors
can say that they ran checkpatch.pl, seen the warning and errors 
reported, but disagree and comment why they disagree. Then this kind of
information is added to the patch in a non-disturbing area of the patch
and we can pick up and aggregate that information from the patches to see
the complaints, false positives to address or suggestions which rules
should be disabled for some subcommunities.

Aditya, your task is now to make those ideas more specific and write down 
a one to two page project proposal for the mentorship. It can be to some 
extent in bullet points as long as it is roughly understandable to all of 
us what you plan and would like to do.

With that proposal accepted, you have shown to be applicable to the 
mentorship program here and we take care of the last organisational points 
before the start.

Lukas
Joe Perches Oct. 25, 2020, 6:06 a.m. UTC | #7
On Sun, 2020-10-25 at 06:51 +0100, Lukas Bulwahn wrote:
> We can identify a set of rules and clear automatic fixes that
> maintainers can simply run this script over the patches they pick
> up when they pick them up.

checkpatch --fix-inplace does that now.

But realistically, this is more an interactive bit that IMO
should be a separate script on top of checkpatch.
Lukas Bulwahn Oct. 25, 2020, 6:11 a.m. UTC | #8
On Sat, 24 Oct 2020, Joe Perches wrote:

> On Sun, 2020-10-25 at 06:51 +0100, Lukas Bulwahn wrote:
> > We can identify a set of rules and clear automatic fixes that
> > maintainers can simply run this script over the patches they pick
> > up when they pick them up.
> 
> checkpatch --fix-inplace does that now.
> 
> But realistically, this is more an interactive bit that IMO
> should be a separate script on top of checkpatch.
>

Agree, and of course, we can work with some maintainers to find the best 
way to combine that into their 'pick up and apply patches' scripts?

E.g., some might be using b4 and we simply hook into that in a friendly 
way.

Further just documenting the potential options for maintainer in the 
maintainers handbook might also increase adoption.
Aditya Srivastava Oct. 25, 2020, 10:01 a.m. UTC | #9
On 25/10/20 11:21 am, Lukas Bulwahn wrote:

> This is a nice idea as well. And I recommend to do this kind of research 
> looking at checkpatch and clang-format; both will not be the 'final tools' 
> but I think if you can identify a good mix and combination of those two 
> tools, we will get a good step forward of documenting the 
> commonalities and differences of coding styles in the different 
> subcommunities (the preferences of specific maintainers and subsystems).
> 
> A last nice idea to consider:
> 
> I would like to have an interactive checkpatch.pl mode where an authors
> can say that they ran checkpatch.pl, seen the warning and errors 
> reported, but disagree and comment why they disagree. Then this kind of
> information is added to the patch in a non-disturbing area of the patch
> and we can pick up and aggregate that information from the patches to see
> the complaints, false positives to address or suggestions which rules
> should be disabled for some subcommunities.
> 
> Aditya, your task is now to make those ideas more specific and write down 
> a one to two page project proposal for the mentorship. It can be to some 
> extent in bullet points as long as it is roughly understandable to all of 
> us what you plan and would like to do.
> 
> With that proposal accepted, you have shown to be applicable to the 
> mentorship program here and we take care of the last organisational points 
> before the start.
> 
> Lukas
> 

Alright Sir. I'll be sending my proposal in few days time, on the
basis of my formulated ideas.
Will let you know if I face any doubts during the procedure :)

Thanks
Aditya

Patch
diff mbox series

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 9b9ffd876e8a..3bd8205c48d8 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -3051,7 +3051,9 @@  sub process {
 		}
 
 # check for repeated words separated by a single space
-		if ($rawline =~ /^\+/ || $in_commit_log) {
+# avoid false positive from list command eg, '-rw-r--r-- 1 root root'
+		if (($rawline =~ /^\+/ || $in_commit_log) &&
+				$rawline !~ /[bcCdDlMnpPs\?-][rwxsStT-]{9}/) {
 			while ($rawline =~ /\b($word_pattern) (?=($word_pattern))/g) {
 
 				my $first = $1;
@@ -3065,6 +3067,17 @@  sub process {
 				next if ($first ne $second);
 				next if ($first eq 'long');
 
+				# avoid repeating hex occurrences like 'ff ff fe 09 ...'
+				my @non_hex_words = (
+					"add", "added", "ace", "aced", "bad", "be", "bed", "bead",
+					"beaded", "bedded", "cab", "cabbed", "cede", "ceded",
+					"dead", "deaf", "deb", "decade", "deed", "deface", "defaced", "efface",
+					"effaced", "face", "faced", "fade", "faded", "fed", "fee", "feed"
+				);
+				if ($first =~ /\b[0-9a-f]{2,}\b/) {
+					next if (!grep(/^$first$/, @non_hex_words));
+				}
+
 				if (WARN("REPEATED_WORD",
 					 "Possible repeated word: '$first'\n" . $herecurr) &&
 				    $fix) {