checkpatch: fix false positive for REPEATED_WORD warning
diff mbox series

Message ID 20201021150120.29920-1-yashsri421@gmail.com
State New, archived
Headers show
Series
  • checkpatch: fix false positive for REPEATED_WORD warning
Related show

Commit Message

Aditya Srivastava Oct. 21, 2020, 3:01 p.m. UTC
Presence of hexadecimal address or symbol results in false warning
message by checkpatch.pl.

For example, running checkpatch on commit b8ad540dd4e4 ("mptcp: fix
memory leak in mptcp_subflow_create_socket()") results in warning:

WARNING:REPEATED_WORD: Possible repeated word: 'ff'
    00 00 00 00 00 00 00 00 00 2f 30 0a 81 88 ff ff  ........./0.....

Here, it reports 'ff' to be repeated, but it is in fact part of some
address or code, where it has to be repeated.
In this case, the intent of the warning to find stylistic issues in
commit messages is not met and the warning is just completely wrong in
this case.

To avoid all such reports, add an additional regex check for a repeating
pattern of 4 or more 2-lettered words separated by space in a line.

A quick evaluation on v5.6..v5.8 showed that this fix reduces
REPEATED_WORD warnings from 2797 to 1043.

A quick manual check found all cases are related to hex output in
commit messages.

Signed-off-by: Aditya Srivastava <yashsri421@gmail.com>
---
 scripts/checkpatch.pl | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Lukas Bulwahn Oct. 21, 2020, 3:08 p.m. UTC | #1
On Wed, 21 Oct 2020, Aditya Srivastava wrote:

> Presence of hexadecimal address or symbol results in false warning
> message by checkpatch.pl.
> 
> For example, running checkpatch on commit b8ad540dd4e4 ("mptcp: fix
> memory leak in mptcp_subflow_create_socket()") results in warning:
> 
> WARNING:REPEATED_WORD: Possible repeated word: 'ff'
>     00 00 00 00 00 00 00 00 00 2f 30 0a 81 88 ff ff  ........./0.....
> 
> Here, it reports 'ff' to be repeated, but it is in fact part of some
> address or code, where it has to be repeated.
> In this case, the intent of the warning to find stylistic issues in
> commit messages is not met and the warning is just completely wrong in
> this case.
> 
> To avoid all such reports, add an additional regex check for a repeating
> pattern of 4 or more 2-lettered words separated by space in a line.
> 
> A quick evaluation on v5.6..v5.8 showed that this fix reduces
> REPEATED_WORD warnings from 2797 to 1043.
> 
> A quick manual check found all cases are related to hex output in
> commit messages.
>

Aditya, one thing I just noticed the commit message header is a bit
uninformative.

How about something like:

identify typical hex output for a better REPEATED_WORD check

Other than that, it looks good. You might want to share the link to the 
complete report of differences before and after this patch for Joe to 
check as well.

Lukas

> Signed-off-by: Aditya Srivastava <yashsri421@gmail.com>
> ---
>  scripts/checkpatch.pl | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> index 9b9ffd876e8a..78aeb7a3ca3d 100755
> --- a/scripts/checkpatch.pl
> +++ b/scripts/checkpatch.pl
> @@ -3050,8 +3050,10 @@ sub process {
>  			}
>  		}
>  
> -# check for repeated words separated by a single space
> -		if ($rawline =~ /^\+/ || $in_commit_log) {
> +# check for repeated words separated by a single space and
> +# avoid repeating hex occurrences like 'ff ff fe 09 ...'
> +		if (($rawline =~ /^\+/ || $in_commit_log) &&
> +		$rawline !~ /(\b[0-9a-f]{2}( )+){4,}/) {
>  			while ($rawline =~ /\b($word_pattern) (?=($word_pattern))/g) {
>  
>  				my $first = $1;
> -- 
> 2.17.1
> 
>
Joe Perches Oct. 21, 2020, 3:18 p.m. UTC | #2
On Wed, 2020-10-21 at 20:31 +0530, Aditya Srivastava wrote:
> Presence of hexadecimal address or symbol results in false warning
> message by checkpatch.pl.
> 
> For example, running checkpatch on commit b8ad540dd4e4 ("mptcp: fix
> memory leak in mptcp_subflow_create_socket()") results in warning:
> 
> WARNING:REPEATED_WORD: Possible repeated word: 'ff'
>     00 00 00 00 00 00 00 00 00 2f 30 0a 81 88 ff ff  ........./0.....

Right.

> To avoid all such reports, add an additional regex check for a repeating
> pattern of 4 or more 2-lettered words separated by space in a line.

> A quick evaluation on v5.6..v5.8 showed that this fix reduces
> REPEATED_WORD warnings from 2797 to 1043.

Are many of the other 1043 false positives?
Any pattern to them?

> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
[]
> @@ -3050,8 +3050,10 @@ sub process {
>  			}
>  		}
>  
> -# check for repeated words separated by a single space
> -		if ($rawline =~ /^\+/ || $in_commit_log) {
> +# check for repeated words separated by a single space and
> +# avoid repeating hex occurrences like 'ff ff fe 09 ...'
> +		if (($rawline =~ /^\+/ || $in_commit_log) &&
> +		$rawline !~ /(\b[0-9a-f]{2}( )+){4,}/) {

This might be better as \b$Hex to avoid FF FF
and FFFFFFFF FFFFFFFF

I might add that check to the line below where
the repeated words are checked against long
---
 scripts/checkpatch.pl | 1 +
 1 file changed, 1 insertion(+)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index fab38b493cef..929866999f81 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -3062,6 +3062,7 @@ sub process {
 
 				next if ($first ne $second);
 				next if ($first eq 'long');
+				next if ($first =~ /^$Hex$/;
 
 				if (WARN("REPEATED_WORD",
 					 "Possible repeated word: '$first'\n" . $herecurr) &&
Joe Perches Oct. 21, 2020, 3:28 p.m. UTC | #3
On Wed, 2020-10-21 at 08:18 -0700, Joe Perches wrote:
> I might add that check to the line below where
> the repeated words are checked against long
[]
> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
[]
> @@ -3062,6 +3062,7 @@ sub process {
>  
>  				next if ($first ne $second);
>  				next if ($first eq 'long');
> +				next if ($first =~ /^$Hex$/;

oops.  with a close parenthesis added of course...
Joe Perches Oct. 21, 2020, 4:50 p.m. UTC | #4
On Wed, 2020-10-21 at 08:28 -0700, Joe Perches wrote:
> On Wed, 2020-10-21 at 08:18 -0700, Joe Perches wrote:
> > I might add that check to the line below where
> > the repeated words are checked against long
> []
> > diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> []
> > @@ -3062,6 +3062,7 @@ sub process {
> >  
> >  				next if ($first ne $second);
> >  				next if ($first eq 'long');
> > +				next if ($first =~ /^$Hex$/;
> 
> oops.  with a close parenthesis added of course...

That doesn't work as $Hex expects a leading 0x.

But this does...

The negative of this approach is it would also not emit
a warning on these repeated words: (doesn't seem too bad)

$ grep -P '^[0-9a-f]{2,}$' /usr/share/dict/words
abed
accede
acceded
ace
aced
ad
add
added
baa
baaed
babe
bad
bade
be
bead
beaded
bed
bedded
bee
beef
beefed
cab
cabbed
cad
cede
ceded
dab
dabbed
dad
dead
deaf
deb
decade
decaf
deed
deeded
deface
defaced
ebb
ebbed
efface
effaced
fa
facade
face
faced
fad
fade
faded
fed
fee
feed
---
 scripts/checkpatch.pl | 1 +
 1 file changed, 1 insertion(+)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index fab38b493cef..79d7a4cba19e 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -3062,6 +3062,7 @@ sub process {
 
 				next if ($first ne $second);
 				next if ($first eq 'long');
+				next if ($first =~ /^[0-9a-f]+$/i);
 
 				if (WARN("REPEATED_WORD",
 					 "Possible repeated word: '$first'\n" . $herecurr) &&
Dwaipayan Ray Oct. 21, 2020, 4:59 p.m. UTC | #5
On Wed, Oct 21, 2020 at 10:21 PM Joe Perches <joe@perches.com> wrote:
>
> On Wed, 2020-10-21 at 08:28 -0700, Joe Perches wrote:
> > On Wed, 2020-10-21 at 08:18 -0700, Joe Perches wrote:
> > > I might add that check to the line below where
> > > the repeated words are checked against long
> > []
> > > diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> > []
> > > @@ -3062,6 +3062,7 @@ sub process {
> > >
> > >                             next if ($first ne $second);
> > >                             next if ($first eq 'long');
> > > +                           next if ($first =~ /^$Hex$/;
> >
> > oops.  with a close parenthesis added of course...
>
> That doesn't work as $Hex expects a leading 0x.
>
> But this does...
>
> The negative of this approach is it would also not emit
> a warning on these repeated words: (doesn't seem too bad)
>
> $ grep -P '^[0-9a-f]{2,}$' /usr/share/dict/words
> abed
> accede
> acceded
> ace
> aced
> ad
> add
> added
> baa
> baaed
> babe
> bad
> bade
> be
> bead
> beaded
> bed
> bedded
> bee
> beef
> beefed
> cab
> cabbed
> cad
> cede
> ceded
> dab
> dabbed
> dad
> dead
> deaf
> deb
> decade
> decaf
> deed
> deeded
> deface
> defaced
> ebb
> ebbed
> efface
> effaced
> fa
> facade
> face
> faced
> fad
> fade
> faded
> fed
> fee
> feed
> ---
>  scripts/checkpatch.pl | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> index fab38b493cef..79d7a4cba19e 100755
> --- a/scripts/checkpatch.pl
> +++ b/scripts/checkpatch.pl
> @@ -3062,6 +3062,7 @@ sub process {
>
>                                 next if ($first ne $second);
>                                 next if ($first eq 'long');
> +                               next if ($first =~ /^[0-9a-f]+$/i);
>
>                                 if (WARN("REPEATED_WORD",
>                                          "Possible repeated word: '$first'\n" . $herecurr) &&
>
>

Hi,
Can it be considered that the Hex numbers occur
mostly in pairs or groups of 8, like "FF" or "FFFFFFFF"?

I think it might reduce the negative side further.

Thanks,
Dwaipayan.
Joe Perches Oct. 21, 2020, 5:17 p.m. UTC | #6
On Wed, 2020-10-21 at 22:29 +0530, Dwaipayan Ray wrote:
> Can it be considered that the Hex numbers occur
> mostly in pairs or groups of 8, like "FF" or "FFFFFFFF"?
> 
> I think it might reduce the negative side further.

Maybe.  This already looks for pairs.

Combined with your previous patch,
https://lore.kernel.org/linux-kernel-mentees/20201017162732.152351-1-dwaipayanray1@gmail.com/
it seems OK to me.

Try something out and see if it makes a difference.
Aditya Srivastava Oct. 21, 2020, 5:55 p.m. UTC | #7
On 21/10/20 10:20 pm, Joe Perches wrote:
> On Wed, 2020-10-21 at 08:28 -0700, Joe Perches wrote:
>> On Wed, 2020-10-21 at 08:18 -0700, Joe Perches wrote:
>>> I might add that check to the line below where
>>> the repeated words are checked against long
>> []
>>> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
>> []
>>> @@ -3062,6 +3062,7 @@ sub process {
>>>  
>>>  				next if ($first ne $second);
>>>  				next if ($first eq 'long');
>>> +				next if ($first =~ /^$Hex$/;
>>
>> oops.  with a close parenthesis added of course...
> 
> That doesn't work as $Hex expects a leading 0x.
> 
> But this does...
> 
> The negative of this approach is it would also not emit
> a warning on these repeated words: (doesn't seem too bad)
> 
> $ grep -P '^[0-9a-f]{2,}$' /usr/share/dict/words
> abed
> accede
> acceded
> ace
> aced
> ad
> add
> added
> baa
> baaed
> babe
> bad
> bade
> be
> bead
> beaded
> bed
> bedded
> bee
> beef
> beefed
> cab
> cabbed
> cad
> cede
> ceded
> dab
> dabbed
> dad
> dead
> deaf
> deb
> decade
> decaf
> deed
> deeded
> deface
> defaced
> ebb
> ebbed
> efface
> effaced
> fa
> facade
> face
> faced
> fad
> fade
> faded
> fed
> fee
> feed
> ---
>  scripts/checkpatch.pl | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> index fab38b493cef..79d7a4cba19e 100755
> --- a/scripts/checkpatch.pl
> +++ b/scripts/checkpatch.pl
> @@ -3062,6 +3062,7 @@ sub process {
>  
>  				next if ($first ne $second);
>  				next if ($first eq 'long');
> +				next if ($first =~ /^[0-9a-f]+$/i);
>  
>  				if (WARN("REPEATED_WORD",
>  					 "Possible repeated word: '$first'\n" . $herecurr) &&
> 
> 
> 

Hi Sir,
Thanks for your feedback. I ran a manual check using this approach
over v5.6..v5.8.
The negatives occurring with this approach are for the word 'be'
(Frequency 5) and 'add'(Frequency 1). For eg.

WARNING:REPEATED_WORD: Possible repeated word: 'be'
#278: FILE: drivers/net/ethernet/intel/ice/ice_flow.c:388:
+ * @seg: index of packet segment whose raw fields are to be be extracted

WARNING:REPEATED_WORD: Possible repeated word: 'add'
#21:
Let's also add add a note about using only the l3 access without l4

Apart from these, it works as expected. It also takes into account the
cases for multiple occurrences of hex, as you mentioned. For eg.

WARNING:REPEATED_WORD: Possible repeated word: 'ffff'
#15:
	0x0040:  ffff ffff ffff ffff ffff ffff ffff ffff

These cases were getting missed with my approach.

Also, it is able to detect warnings for hex sequences which are
occurring less than 4 times(frequency 2), for eg,

WARNING:REPEATED_WORD: Possible repeated word: 'ff'
#38:
 Code: ff ff 48 (...)

I'll try to combine both methods and come up with a better approach.

Aditya
Joe Perches Oct. 21, 2020, 6:05 p.m. UTC | #8
On Wed, 2020-10-21 at 23:25 +0530, Aditya wrote:
> Thanks for your feedback. I ran a manual check using this approach
> over v5.6..v5.8.
> The negatives occurring with this approach are for the word 'be'
> (Frequency 5) and 'add'(Frequency 1). For eg.
> 
> WARNING:REPEATED_WORD: Possible repeated word: 'be'
> #278: FILE: drivers/net/ethernet/intel/ice/ice_flow.c:388:
> + * @seg: index of packet segment whose raw fields are to be be extracted
> 
> WARNING:REPEATED_WORD: Possible repeated word: 'add'
> #21:
> Let's also add add a note about using only the l3 access without l4
> 
> Apart from these, it works as expected. It also takes into account the
> cases for multiple occurrences of hex, as you mentioned. For eg.
> 
> WARNING:REPEATED_WORD: Possible repeated word: 'ffff'
> #15:
[]
> I'll try to combine both methods and come up with a better approach.

Enjoy, but please consider:

If for over 30K patches, there are just a few false positives and
a few false negatives, it likely doesn't need much improvement...

checkpatch works on patch contexts.

It's not intended to be perfect.

It's just a little tool that can help avoid some common defects.
Aditya Srivastava Oct. 21, 2020, 6:25 p.m. UTC | #9
On 21/10/20 11:35 pm, Joe Perches wrote:
> On Wed, 2020-10-21 at 23:25 +0530, Aditya wrote:
>> Thanks for your feedback. I ran a manual check using this approach
>> over v5.6..v5.8.
>> The negatives occurring with this approach are for the word 'be'
>> (Frequency 5) and 'add'(Frequency 1). For eg.
>>
>> WARNING:REPEATED_WORD: Possible repeated word: 'be'
>> #278: FILE: drivers/net/ethernet/intel/ice/ice_flow.c:388:
>> + * @seg: index of packet segment whose raw fields are to be be extracted
>>
>> WARNING:REPEATED_WORD: Possible repeated word: 'add'
>> #21:
>> Let's also add add a note about using only the l3 access without l4
>>
>> Apart from these, it works as expected. It also takes into account the
>> cases for multiple occurrences of hex, as you mentioned. For eg.
>>
>> WARNING:REPEATED_WORD: Possible repeated word: 'ffff'
>> #15:
> []
>> I'll try to combine both methods and come up with a better approach.
> 
> Enjoy, but please consider:
> 
> If for over 30K patches, there are just a few false positives and
> a few false negatives, it likely doesn't need much improvement...
> 
> checkpatch works on patch contexts.
> 
> It's not intended to be perfect.
> 
> It's just a little tool that can help avoid some common defects.
> 
> 

Alright Sir. Then, we can proceed with the method you suggested, as it
is more or less perfect.
I'll re-send the patch with modified reduced warning figure.

Thanks
Aditya
Aditya Srivastava Oct. 21, 2020, 7:10 p.m. UTC | #10
On 21/10/20 8:48 pm, Joe Perches wrote:
> On Wed, 2020-10-21 at 20:31 +0530, Aditya Srivastava wrote:
>> Presence of hexadecimal address or symbol results in false warning
>> message by checkpatch.pl.
>>
>> For example, running checkpatch on commit b8ad540dd4e4 ("mptcp: fix
>> memory leak in mptcp_subflow_create_socket()") results in warning:
>>
>> WARNING:REPEATED_WORD: Possible repeated word: 'ff'
>>     00 00 00 00 00 00 00 00 00 2f 30 0a 81 88 ff ff  ........./0.....
> 
> Right.
> 
>> To avoid all such reports, add an additional regex check for a repeating
>> pattern of 4 or more 2-lettered words separated by space in a line.
> 
>> A quick evaluation on v5.6..v5.8 showed that this fix reduces
>> REPEATED_WORD warnings from 2797 to 1043.
> 
> Are many of the other 1043 false positives?
> Any pattern to them?
> 

Apart from the changes suggested by Dwaipayan in
https://lore.kernel.org/linux-kernel-mentees/20201017162732.152351-1-dwaipayanray1@gmail.com/

The 'ls -l' output seems to be another common false positive for
REPEATED_WORD (Frequency 106 over v5.6..v5.8). For eg.

WARNING:REPEATED_WORD: Possible repeated word: 'root'
#18:
  drwxr-xr-x. 2 root root    0 Apr 17 10:53 .

WARNING:REPEATED_WORD: Possible repeated word: 'nobody'
#28:
drwxr-xr-x 5 nobody nobody    0 Jan 25 18:08 .

WARNING:REPEATED_WORD: Possible repeated word: 'irogers'
#17:
  -rw-r----- 1 irogers irogers 553 Apr 17 14:31
../../../util/unwind-libdw.h

These can be avoided by using:
@@ -3050,8 +3050,10 @@ sub process {
 			}
 		}

-		if ($rawline =~ /^\+/ || $in_commit_log) {
+		if (($rawline =~ /^\+/ || $in_commit_log) &&
+		$rawline !~ /\b[a-z-]+.* \d{1,3} [a-zA-Z]+ \w+ +\d+ \w{3} \d{1,2}
\d{1,2}:\d{1,2}/) {

Sincerely
Aditya

>> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> []
>> @@ -3050,8 +3050,10 @@ sub process {
>>  			}
>>  		}
>>  
>> -# check for repeated words separated by a single space
>> -		if ($rawline =~ /^\+/ || $in_commit_log) {
>> +# check for repeated words separated by a single space and
>> +# avoid repeating hex occurrences like 'ff ff fe 09 ...'
>> +		if (($rawline =~ /^\+/ || $in_commit_log) &&
>> +		$rawline !~ /(\b[0-9a-f]{2}( )+){4,}/) {
> 
> This might be better as \b$Hex to avoid FF FF
> and FFFFFFFF FFFFFFFF
> 
> I might add that check to the line below where
> the repeated words are checked against long
> ---
>  scripts/checkpatch.pl | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> index fab38b493cef..929866999f81 100755
> --- a/scripts/checkpatch.pl
> +++ b/scripts/checkpatch.pl
> @@ -3062,6 +3062,7 @@ sub process {
>  
>  				next if ($first ne $second);
>  				next if ($first eq 'long');
> +				next if ($first =~ /^$Hex$/;
>  
>  				if (WARN("REPEATED_WORD",
>  					 "Possible repeated word: '$first'\n" . $herecurr) &&
> 
>
Lukas Bulwahn Oct. 21, 2020, 7:12 p.m. UTC | #11
On Wed, Oct 21, 2020 at 8:25 PM Aditya <yashsri421@gmail.com> wrote:
>
> On 21/10/20 11:35 pm, Joe Perches wrote:
> > On Wed, 2020-10-21 at 23:25 +0530, Aditya wrote:
> >> Thanks for your feedback. I ran a manual check using this approach
> >> over v5.6..v5.8.
> >> The negatives occurring with this approach are for the word 'be'
> >> (Frequency 5) and 'add'(Frequency 1). For eg.
> >>
> >> WARNING:REPEATED_WORD: Possible repeated word: 'be'
> >> #278: FILE: drivers/net/ethernet/intel/ice/ice_flow.c:388:
> >> + * @seg: index of packet segment whose raw fields are to be be extracted
> >>
> >> WARNING:REPEATED_WORD: Possible repeated word: 'add'
> >> #21:
> >> Let's also add add a note about using only the l3 access without l4
> >>
> >> Apart from these, it works as expected. It also takes into account the
> >> cases for multiple occurrences of hex, as you mentioned. For eg.
> >>
> >> WARNING:REPEATED_WORD: Possible repeated word: 'ffff'
> >> #15:
> > []
> >> I'll try to combine both methods and come up with a better approach.
> >
> > Enjoy, but please consider:
> >
> > If for over 30K patches, there are just a few false positives and
> > a few false negatives, it likely doesn't need much improvement...
> >
> > checkpatch works on patch contexts.
> >
> > It's not intended to be perfect.
> >
> > It's just a little tool that can help avoid some common defects.
> >
> >
>
> Alright Sir. Then, we can proceed with the method you suggested, as it
> is more or less perfect.
> I'll re-send the patch with modified reduced warning figure.
>

Aditya, you can also choose to implement your solution;
yes, it is more work for you but it also seems to function better in
the long run.

Clearly, Joe would settle for a simpler solution, but his TODO list of
topics to engage in and work on is also much longer...

Lukas
Joe Perches Oct. 21, 2020, 7:26 p.m. UTC | #12
On Thu, 2020-10-22 at 00:40 +0530, Aditya wrote:
> On 21/10/20 8:48 pm, Joe Perches wrote:
> > On Wed, 2020-10-21 at 20:31 +0530, Aditya Srivastava wrote:
> > > Presence of hexadecimal address or symbol results in false warning
> > > message by checkpatch.pl.
> > > 
> > > For example, running checkpatch on commit b8ad540dd4e4 ("mptcp: fix
> > > memory leak in mptcp_subflow_create_socket()") results in warning:
> > > 
> > > WARNING:REPEATED_WORD: Possible repeated word: 'ff'
> > >     00 00 00 00 00 00 00 00 00 2f 30 0a 81 88 ff ff  ........./0.....
> > 
> > Right.
> > 
> > > To avoid all such reports, add an additional regex check for a repeating
> > > pattern of 4 or more 2-lettered words separated by space in a line.
> > > A quick evaluation on v5.6..v5.8 showed that this fix reduces
> > > REPEATED_WORD warnings from 2797 to 1043.
> > 
> > Are many of the other 1043 false positives?
> > Any pattern to them?
> > 
> Apart from the changes suggested by Dwaipayan in
> https://lore.kernel.org/linux-kernel-mentees/20201017162732.152351-1-dwaipayanray1@gmail.com/
> 
> The 'ls -l' output seems to be another common false positive for
> REPEATED_WORD (Frequency 106 over v5.6..v5.8). For eg.
> 
> WARNING:REPEATED_WORD: Possible repeated word: 'root'
> #18:
>   drwxr-xr-x. 2 root root    0 Apr 17 10:53 .
[]
> @@ -3050,8 +3050,10 @@ sub process {
>  			}
>  		}
> 
> -		if ($rawline =~ /^\+/ || $in_commit_log) {
> +		if (($rawline =~ /^\+/ || $in_commit_log) &&
> +		$rawline !~ /\b[a-z-]+.* \d{1,3} [a-zA-Z]+ \w+ +\d+ \w{3} \d{1,2}
> \d{1,2}:\d{1,2}/) {

Perhaps a regex for permissions is good enough

	$line !~ /\b[cbdl-][rwxs-]{9,9}\b/
Joe Perches Oct. 21, 2020, 8:36 p.m. UTC | #13
On Wed, 2020-10-21 at 12:26 -0700, Joe Perches wrote:

> Perhaps a regex for permissions is good enough
> 	$line !~ /\b[cbdl-][rwxs-]{9,9}\b/

Maybe not completely correct...

From info ls:

    The file type is one of the following characters:

     ‘-’
          regular file
     ‘b’
          block special file
     ‘c’
          character special file
     ‘C’
          high performance (“contiguous data”) file
     ‘d’
          directory
     ‘D’
          door (Solaris 2.5 and up)
     ‘l’
          symbolic link
     ‘M’
          off-line (“migrated”) file (Cray DMF)
     ‘n’
          network special file (HP-UX)
     ‘p’
          FIFO (named pipe)
     ‘P’
          port (Solaris 10 and up)
     ‘s’
          socket
     ‘?’
          some other file type

     The file mode bits listed are similar to symbolic mode
     specifications (*note Symbolic Modes::).  But ‘ls’ combines
     multiple bits into the third character of each set of permissions
     as follows:

     ‘s’
          If the set-user-ID or set-group-ID bit and the corresponding
          executable bit are both set.

     ‘S’
          If the set-user-ID or set-group-ID bit is set but the
          corresponding executable bit is not set.

     ‘t’
          If the restricted deletion flag or sticky bit, and the
          other-executable bit, are both set.  The restricted deletion
          flag is another name for the sticky bit.  *Note Mode
          Structure::.

     ‘T’
          If the restricted deletion flag or sticky bit is set but the
          other-executable bit is not set.

     ‘x’
          If the executable bit is set and none of the above apply.

     ‘-’
          Otherwise.

So apparently to be correct this should be:

	$line !~ /\b[bcCdDlMnpPs\?-][rwxsStT-]{9,9}\b/
Aditya Srivastava Oct. 22, 2020, 2:21 p.m. UTC | #14
On 22/10/20 12:42 am, Lukas Bulwahn wrote:
> On Wed, Oct 21, 2020 at 8:25 PM Aditya <yashsri421@gmail.com> wrote:
>>
>> On 21/10/20 11:35 pm, Joe Perches wrote:
>>> On Wed, 2020-10-21 at 23:25 +0530, Aditya wrote:
>>>> Thanks for your feedback. I ran a manual check using this approach
>>>> over v5.6..v5.8.
>>>> The negatives occurring with this approach are for the word 'be'
>>>> (Frequency 5) and 'add'(Frequency 1). For eg.
>>>>
>>>> WARNING:REPEATED_WORD: Possible repeated word: 'be'
>>>> #278: FILE: drivers/net/ethernet/intel/ice/ice_flow.c:388:
>>>> + * @seg: index of packet segment whose raw fields are to be be extracted
>>>>
>>>> WARNING:REPEATED_WORD: Possible repeated word: 'add'
>>>> #21:
>>>> Let's also add add a note about using only the l3 access without l4
>>>>
>>>> Apart from these, it works as expected. It also takes into account the
>>>> cases for multiple occurrences of hex, as you mentioned. For eg.
>>>>
>>>> WARNING:REPEATED_WORD: Possible repeated word: 'ffff'
>>>> #15:
>>> []
>>>> I'll try to combine both methods and come up with a better approach.
>>>
>>> Enjoy, but please consider:
>>>
>>> If for over 30K patches, there are just a few false positives and
>>> a few false negatives, it likely doesn't need much improvement...
>>>
>>> checkpatch works on patch contexts.
>>>
>>> It's not intended to be perfect.
>>>
>>> It's just a little tool that can help avoid some common defects.
>>>
>>>
>>
>> Alright Sir. Then, we can proceed with the method you suggested, as it
>> is more or less perfect.
>> I'll re-send the patch with modified reduced warning figure.
>>
> 
> Aditya, you can also choose to implement your solution;
> yes, it is more work for you but it also seems to function better in
> the long run.
> 
> Clearly, Joe would settle for a simpler solution, but his TODO list of
> topics to engage in and work on is also much longer...
> 
> Lukas
> 

Hi Sir
I have implemented my solution. Should I send the patch in reply to
this mail or as a different mail? Also should I label it as v2? I have
also addressed the warnings out of list command output in it. for eg.

WARNING:REPEATED_WORD: Possible repeated word: 'root'
#18:
  drwxr-xr-x. 2 root root    0 Apr 17 10:53 .

WARNING:REPEATED_WORD: Possible repeated word: 'nobody'
#28:
drwxr-xr-x 5 nobody nobody    0 Jan 25 18:08 .

Sincerely
Aditya
Joe Perches Oct. 22, 2020, 2:35 p.m. UTC | #15
On Thu, 2020-10-22 at 19:51 +0530, Aditya wrote:
> > > Alright Sir.

Joe is fine, sir isn't necessary.
> Hi Sir
> I have implemented my solution. Should I send the patch in reply to
> this mail or as a different mail? Also should I label it as v2? I have
> also addressed the warnings out of list command output in it. for eg.

Either way works.

Patch
diff mbox series

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 9b9ffd876e8a..78aeb7a3ca3d 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -3050,8 +3050,10 @@  sub process {
 			}
 		}
 
-# check for repeated words separated by a single space
-		if ($rawline =~ /^\+/ || $in_commit_log) {
+# check for repeated words separated by a single space and
+# avoid repeating hex occurrences like 'ff ff fe 09 ...'
+		if (($rawline =~ /^\+/ || $in_commit_log) &&
+		$rawline !~ /(\b[0-9a-f]{2}( )+){4,}/) {
 			while ($rawline =~ /\b($word_pattern) (?=($word_pattern))/g) {
 
 				my $first = $1;