[v2] checkpatch: fix false positives in REPEATED_WORD warning
diff mbox series

Message ID 20201022145021.28211-1-yashsri421@gmail.com
State New, archived
Headers show
Series
  • [v2] checkpatch: fix false positives in REPEATED_WORD warning
Related show

Commit Message

Aditya Oct. 22, 2020, 2:50 p.m. UTC
Presence of hexadecimal address or symbol results in false warning
message by checkpatch.pl.

For example, running checkpatch on commit b8ad540dd4e4 ("mptcp: fix
memory leak in mptcp_subflow_create_socket()") results in warning:

WARNING:REPEATED_WORD: Possible repeated word: 'ff'
    00 00 00 00 00 00 00 00 00 2f 30 0a 81 88 ff ff  ........./0.....

Similarly, the presence of list command output in commit results in
an unnecessary warning.

For example, running checkpatch on commit 899e5ffbf246 ("perf record:
Introduce --switch-output-event") gives:

WARNING:REPEATED_WORD: Possible repeated word: 'root'
  dr-xr-x---. 12 root root    4096 Apr 27 17:46 ..

Here, it reports 'ff' and 'root to be repeated, but it is in fact part
of some address or code, where it has to be repeated.

In these cases, the intent of the warning to find stylistic issues in
commit messages is not met and the warning is just completely wrong in
this case.

To avoid these warnings, add additional regex check for the
directory permission pattern and avoid checking the line for this
class of warning. Similarly, to avoid hex pattern, check if the word
consists of hex symbols and skip this warning if, it forms a pattern
of repeating sequence or contains more special character after pattern
than non hex characters.

A quick evaluation on v5.6..v5.8 showed that this fix reduces
REPEATED_WORD warnings from 2797 to 907.

A quick manual check found all cases are related to hex output or
list command outputs in commit messages.

Signed-off-by: Aditya Srivastava <yashsri421@gmail.com>
---
 scripts/checkpatch.pl | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

Comments

Aditya Oct. 22, 2020, 2:58 p.m. UTC | #1
On 22/10/20 8:20 pm, Aditya Srivastava wrote:
> Presence of hexadecimal address or symbol results in false warning
> message by checkpatch.pl.
> 
> For example, running checkpatch on commit b8ad540dd4e4 ("mptcp: fix
> memory leak in mptcp_subflow_create_socket()") results in warning:
> 
> WARNING:REPEATED_WORD: Possible repeated word: 'ff'
>     00 00 00 00 00 00 00 00 00 2f 30 0a 81 88 ff ff  ........./0.....
> 
> Similarly, the presence of list command output in commit results in
> an unnecessary warning.
> 
> For example, running checkpatch on commit 899e5ffbf246 ("perf record:
> Introduce --switch-output-event") gives:
> 
> WARNING:REPEATED_WORD: Possible repeated word: 'root'
>   dr-xr-x---. 12 root root    4096 Apr 27 17:46 ..
> 
> Here, it reports 'ff' and 'root to be repeated, but it is in fact part
> of some address or code, where it has to be repeated.
> 
> In these cases, the intent of the warning to find stylistic issues in
> commit messages is not met and the warning is just completely wrong in
> this case.
> 
> To avoid these warnings, add additional regex check for the
> directory permission pattern and avoid checking the line for this
> class of warning. Similarly, to avoid hex pattern, check if the word
> consists of hex symbols and skip this warning if, it forms a pattern
> of repeating sequence or contains more special character after pattern
> than non hex characters.
> 
> A quick evaluation on v5.6..v5.8 showed that this fix reduces
> REPEATED_WORD warnings from 2797 to 907.
> 
> A quick manual check found all cases are related to hex output or
> list command outputs in commit messages.
> 
> Signed-off-by: Aditya Srivastava <yashsri421@gmail.com>
> ---
>  scripts/checkpatch.pl | 33 ++++++++++++++++++++++++++++++++-
>  1 file changed, 32 insertions(+), 1 deletion(-)
> 
> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> index 9b9ffd876e8a..0f57e99ed670 100755
> --- a/scripts/checkpatch.pl
> +++ b/scripts/checkpatch.pl
> @@ -3051,7 +3051,10 @@ sub process {
>  		}
>  
>  # check for repeated words separated by a single space
> -		if ($rawline =~ /^\+/ || $in_commit_log) {
> +# avoid false positive from list command eg, '-rw-r--r-- 1 root root'
> +		if (($rawline =~ /^\+/ || $in_commit_log) &&
> +		$rawline !~ /[bcCdDlMnpPs\?-][rwxsStT-]{9}/) {
> +			my @hex_seq = $rawline =~ /\b[0-9a-f]{2,} \b/g;
>  			while ($rawline =~ /\b($word_pattern) (?=($word_pattern))/g) {
>  
>  				my $first = $1;
> @@ -3065,6 +3068,34 @@ sub process {
>  				next if ($first ne $second);
>  				next if ($first eq 'long');
>  
> +				# avoid repeating hex occurrences like 'ff ff fe 09 ...'
> +				if ($first =~ /\b[0-9a-f]{2,}/) {
> +					# if such sequence occurs more than 4, it is most probably part of some of code
> +					next if ((scalar @hex_seq)>4);
> +
> +					# for hex occurrences which are less than 4
> +					# get first hex word in the line
> +					if ($rawline =~ /\b[0-9a-f]{2,} /) {
> +						my $post_hex_seq = $';
> +
> +						# set suffieciently high default values to avoid ignoring or counting in absence of another
> +						my $non_hex_char_pos = 1000;
> +						my $special_chars_pos = 500;
> +
> +						if ($post_hex_seq =~ /[g-z]+/) {
> +							# first non hex character in post_hex_seq
> +							$non_hex_char_pos = $-[0];
> +						}
> +						if($post_hex_seq =~ /[^a-zA-Z0-9]{2,}/) {
> +							# first occurrence of 2 or more special chars
> +							$special_chars_pos = $-[0];
> +						}
> +
> +						# as such occurrences are not common, it is more likely to be a part of some code
> +						next if ($special_chars_pos<$non_hex_char_pos);
> +					}
> +				}
> +
>  				if (WARN("REPEATED_WORD",
>  					 "Possible repeated word: '$first'\n" . $herecurr) &&
>  				    $fix) {
> 

I have also generated some reports showing impact caused by the changes:

Hex variations causing warning previously:
https://github.com/AdityaSrivast/kernel-tasks/blob/master/Task3/hex_variations.txt

Change in warnings and errors:
https://github.com/AdityaSrivast/kernel-tasks/blob/master/Task3/relative_summary/summary_relative.txt

Warnings which are dropped:
https://github.com/AdityaSrivast/kernel-tasks/blob/master/Task3/relative_summary/dropped_warnings/summary.txt

Thanks
Aditya
Joe Perches Oct. 22, 2020, 4:10 p.m. UTC | #2
On Thu, 2020-10-22 at 20:20 +0530, Aditya Srivastava wrote:
> Presence of hexadecimal address or symbol results in false warning
> message by checkpatch.pl.
[]
> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
[]
> @@ -3051,7 +3051,10 @@ sub process {
>  		}
>  
>  # check for repeated words separated by a single space
> -		if ($rawline =~ /^\+/ || $in_commit_log) {
> +# avoid false positive from list command eg, '-rw-r--r-- 1 root root'
> +		if (($rawline =~ /^\+/ || $in_commit_log) &&
> +		$rawline !~ /[bcCdDlMnpPs\?-][rwxsStT-]{9}/) {

Alignment and use \b before and after the regex please.

		if (($rawline =~ /^\+/ || $in_commit_log) &&
		    $rawline !~ /\b[bcCdDlMnpPs\?-][rwxsStT-]{9}\b/) {
> @@ -3065,6 +3068,34 @@ sub process {
>  				next if ($first ne $second);
>  				next if ($first eq 'long');
>  
> +				# avoid repeating hex occurrences like 'ff ff fe 09 ...'
> +				if ($first =~ /\b[0-9a-f]{2,}/) {
> +					# if such sequence occurs more than 4, it is most probably part of some of code
> +					next if ((scalar @hex_seq)>4);
> +					# for hex occurrences which are less than 4
> +					# get first hex word in the line
> +					if ($rawline =~ /\b[0-9a-f]{2,} /) {
> +						my $post_hex_seq = $';
> +
> +						# set suffieciently high default values to avoid ignoring or counting in absence of another
> +						my $non_hex_char_pos = 1000;
> +						my $special_chars_pos = 500;
> +
> +						if ($post_hex_seq =~ /[g-z]+/) {
> +							# first non hex character in post_hex_seq
> +							$non_hex_char_pos = $-[0];
> +						}
> +						if($post_hex_seq =~ /[^a-zA-Z0-9]{2,}/) {
> +							# first occurrence of 2 or more special chars
> +							$special_chars_pos = $-[0];
> +						}

What does all this code actually avoid?
Aditya Oct. 22, 2020, 7:14 p.m. UTC | #3
On 22/10/20 9:40 pm, Joe Perches wrote:
> On Thu, 2020-10-22 at 20:20 +0530, Aditya Srivastava wrote:
>> Presence of hexadecimal address or symbol results in false warning
>> message by checkpatch.pl.
> []
>> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> []
>> @@ -3051,7 +3051,10 @@ sub process {
>>  		}
>>  
>>  # check for repeated words separated by a single space
>> -		if ($rawline =~ /^\+/ || $in_commit_log) {
>> +# avoid false positive from list command eg, '-rw-r--r-- 1 root root'
>> +		if (($rawline =~ /^\+/ || $in_commit_log) &&
>> +		$rawline !~ /[bcCdDlMnpPs\?-][rwxsStT-]{9}/) {
> 
> Alignment and use \b before and after the regex please.

If we use \b either before or after or both it does not match patterns
such as:
+   -rw-r--r--. 1 root root 112K Mar 20 12:16
selinux-policy-3.14.4-48.fc31.noarch.rpm

This is happening probably because it is counting '-' for '\b'
I have not observed any negatives of using this though.

> 
> 		if (($rawline =~ /^\+/ || $in_commit_log) &&
> 		    $rawline !~ /\b[bcCdDlMnpPs\?-][rwxsStT-]{9}\b/) {
>> @@ -3065,6 +3068,34 @@ sub process {
>>  				next if ($first ne $second);
>>  				next if ($first eq 'long');
>>  
>> +				# avoid repeating hex occurrences like 'ff ff fe 09 ...'
>> +				if ($first =~ /\b[0-9a-f]{2,}/) {
>> +					# if such sequence occurs more than 4, it is most probably part of some of code
>> +					next if ((scalar @hex_seq)>4);
>> +					# for hex occurrences which are less than 4
>> +					# get first hex word in the line
>> +					if ($rawline =~ /\b[0-9a-f]{2,} /) {
>> +						my $post_hex_seq = $';
>> +
>> +						# set suffieciently high default values to avoid ignoring or counting in absence of another
>> +						my $non_hex_char_pos = 1000;
>> +						my $special_chars_pos = 500;
>> +
>> +						if ($post_hex_seq =~ /[g-z]+/) {
>> +							# first non hex character in post_hex_seq
>> +							$non_hex_char_pos = $-[0];
>> +						}
>> +						if($post_hex_seq =~ /[^a-zA-Z0-9]{2,}/) {
>> +							# first occurrence of 2 or more special chars
>> +							$special_chars_pos = $-[0];
>> +						}
> 
> What does all this code actually avoid?
> 
> 

Sir, there are multiple variations of hex for which this warning is
occurring, for eg:
1) 00 c0 06 16 00 00 ff ff 00 93 1c 18 00 00 ff ff  ................
2) ffffffff ffffffff 00000000 c070058c
3)     f5a:       48 c7 44 24 78 ff ff    movq
$0xffffffffffffffff,0x78(%rsp)
4) +    fe fe
5) +    fe fe   - ? end marker ?
6) Code: ff ff 48 (...)

So I first check if the repeated word matches /\b[0-9a-f]{2,}/ . If it
does and occurs as a sequence of such repetitions more than 4(ie more
than or equal to 5), then it is most probably a part of hexadecimal
code. This is implemented here,

+				if ($first =~ /\b[0-9a-f]{2,}/) {
+					# if such sequence occurs more than 4, it is most probably part
of some of code
+					next if ((scalar @hex_seq)>4);

This addresses our issues for warning similar to example (1),(2) and (3).

But still we haven't detected 4,5,6. One can argue that we can modify:

+					next if ((scalar @hex_seq)>4);

with (scalar @hex_seq)>2 or (scalar @hex_seq)>3

but then, we'll not be able to account for warnings such as:

7) +	 * sets this to -1, the slack value will be calculated to be be
halfway
8) + * @seg: index of packet segment whose raw fields are to be be
extracted
9) The data in destination buffer is expected to be be parsed in big
10) +	 *   1. New session or device can'be be created - session sysfs
files

Here I observed that in hex codes, there are atleast 2 special
characters present before any non-hex character, for eg. in (5). Also
generally such occurrences are very rare in writing english, and it is
also helpful in our case.

This is implemented here:

>> +				# avoid repeating hex occurrences like 'ff ff fe 09 ...'
>> +				if ($first =~ /\b[0-9a-f]{2,}/) {
>> +					# if such sequence occurs more than 4, it is most probably
part of some of code
>> +					next if ((scalar @hex_seq)>4);
>> +					# for hex occurrences which are less than 4
>> +					# get first hex word in the line
>> +					if ($rawline =~ /\b[0-9a-f]{2,} /) {
>> +						my $post_hex_seq = $';
>> +
>> +						# set suffieciently high default values to avoid ignoring or
counting in absence of another
>> +						my $non_hex_char_pos = 1000;
>> +						my $special_chars_pos = 500;
>> +
>> +						if ($post_hex_seq =~ /[g-z]+/) {
>> +							# first non hex character in post_hex_seq
>> +							$non_hex_char_pos = $-[0];
>> +						}
>> +						if($post_hex_seq =~ /[^a-zA-Z0-9]{2,}/) {
>> +							# first occurrence of 2 or more special chars
>> +							$special_chars_pos = $-[0];
>> +						}

I have used these two lines for cases like example(4):
+						my $non_hex_char_pos = 1000;
+						my $special_chars_pos = 500;

Here, non-hex characters are missing, thus the default character helps
us to get desired result.
Also, I have set higher values such that if one of them occurs in a
line, the result remain unaffected, than with lower default values.


Thanks
Aditya
Joe Perches Oct. 22, 2020, 7:33 p.m. UTC | #4
On Fri, 2020-10-23 at 00:44 +0530, Aditya wrote:
> On 22/10/20 9:40 pm, Joe Perches wrote:
> > On Thu, 2020-10-22 at 20:20 +0530, Aditya Srivastava wrote:
> > > Presence of hexadecimal address or symbol results in false warning
> > > message by checkpatch.pl.
> > []
> > > diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> > []
> > > @@ -3051,7 +3051,10 @@ sub process {
> > >  		}
> > >  
> > >  # check for repeated words separated by a single space
> > > -		if ($rawline =~ /^\+/ || $in_commit_log) {
> > > +# avoid false positive from list command eg, '-rw-r--r-- 1 root root'
> > > +		if (($rawline =~ /^\+/ || $in_commit_log) &&
> > > +		$rawline !~ /[bcCdDlMnpPs\?-][rwxsStT-]{9}/) {
> > 
> > Alignment and use \b before and after the regex please.
> 
> If we use \b either before or after or both it does not match patterns
> such as:
> +   -rw-r--r--. 1 root root 112K Mar 20 12:16'
selinux-policy-3.14.4-48.fc31.noarch.rpm

OK, thanks, it's good you checked.

> > []

> > What does all this code actually avoid?
> 
> Sir, there are multiple variations of hex for which this warning is
> occurring, for eg:
> 1) 00 c0 06 16 00 00 ff ff 00 93 1c 18 00 00 ff ff  ................
> 2) ffffffff ffffffff 00000000 c070058c
> 3)     f5a:       48 c7 44 24 78 ff ff    movq
> $0xffffffffffffffff,0x78(%rsp)
> 4) +    fe fe
> 5) +    fe fe   - ? end marker ?
> 6) Code: ff ff 48 (...)

So why not just match first with /^[0-9a-f]+$/i ?

Doesn't that match all the cases listed above?
Aditya Oct. 22, 2020, 9:05 p.m. UTC | #5
On 23/10/20 1:03 am, Joe Perches wrote:
> On Fri, 2020-10-23 at 00:44 +0530, Aditya wrote:
>> On 22/10/20 9:40 pm, Joe Perches wrote:
>>> On Thu, 2020-10-22 at 20:20 +0530, Aditya Srivastava wrote:
>>>> Presence of hexadecimal address or symbol results in false warning
>>>> message by checkpatch.pl.
>>> []
>>>> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
>>> []
>>>> @@ -3051,7 +3051,10 @@ sub process {
>>>>  		}
>>>>  
>>>>  # check for repeated words separated by a single space
>>>> -		if ($rawline =~ /^\+/ || $in_commit_log) {
>>>> +# avoid false positive from list command eg, '-rw-r--r-- 1 root root'
>>>> +		if (($rawline =~ /^\+/ || $in_commit_log) &&
>>>> +		$rawline !~ /[bcCdDlMnpPs\?-][rwxsStT-]{9}/) {
>>>
>>> Alignment and use \b before and after the regex please.
>>
>> If we use \b either before or after or both it does not match patterns
>> such as:
>> +   -rw-r--r--. 1 root root 112K Mar 20 12:16'
> selinux-policy-3.14.4-48.fc31.noarch.rpm
> 
> OK, thanks, it's good you checked.
> 
>>> []
> 
>>> What does all this code actually avoid?
>>
>> Sir, there are multiple variations of hex for which this warning is
>> occurring, for eg:
>> 1) 00 c0 06 16 00 00 ff ff 00 93 1c 18 00 00 ff ff  ................
>> 2) ffffffff ffffffff 00000000 c070058c
>> 3)     f5a:       48 c7 44 24 78 ff ff    movq
>> $0xffffffffffffffff,0x78(%rsp)
>> 4) +    fe fe
>> 5) +    fe fe   - ? end marker ?
>> 6) Code: ff ff 48 (...)
> 
> So why not just match first with /^[0-9a-f]+$/i ?
> 
> Doesn't that match all the cases listed above?
> 
> 

Then, we'll not be able to account for cases such as:

1) +     * sets this to -1, the slack value will be calculated to be be
halfway [For 'be' 'be']
2) + * @seg: index of packet segment whose raw fields are to be be
extracted [For 'be' 'be']
3) Let's also add add a note about using only the l3 access without l4
[For 'add' 'add']

where it will not detect them.

Aditya
Joe Perches Oct. 22, 2020, 10:46 p.m. UTC | #6
On Fri, 2020-10-23 at 02:35 +0530, Aditya wrote:
> On 23/10/20 1:03 am, Joe Perches wrote:
> > On Fri, 2020-10-23 at 00:44 +0530, Aditya wrote:
> > > On 22/10/20 9:40 pm, Joe Perches wrote:
> > > > On Thu, 2020-10-22 at 20:20 +0530, Aditya Srivastava wrote:
> > > > > Presence of hexadecimal address or symbol results in false warning
> > > > > message by checkpatch.pl.
> > > > []
> > > > > diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> > > > []
> > > > > @@ -3051,7 +3051,10 @@ sub process {
> > > > >  		}
> > > > >  
> > > > >  # check for repeated words separated by a single space
> > > > > -		if ($rawline =~ /^\+/ || $in_commit_log) {
> > > > > +# avoid false positive from list command eg, '-rw-r--r-- 1 root root'
> > > > > +		if (($rawline =~ /^\+/ || $in_commit_log) &&
> > > > > +		$rawline !~ /[bcCdDlMnpPs\?-][rwxsStT-]{9}/) {
> > > > 
> > > > Alignment and use \b before and after the regex please.
> > > 
> > > If we use \b either before or after or both it does not match patterns
> > > such as:
> > > +   -rw-r--r--. 1 root root 112K Mar 20 12:16'
> > selinux-policy-3.14.4-48.fc31.noarch.rpm
> > 
> > OK, thanks, it's good you checked.
> > 
> > > > []
> > > > What does all this code actually avoid?
> > > 
> > > Sir, there are multiple variations of hex for which this warning is
> > > occurring, for eg:
> > > 1) 00 c0 06 16 00 00 ff ff 00 93 1c 18 00 00 ff ff  ................
> > > 2) ffffffff ffffffff 00000000 c070058c
> > > 3)     f5a:       48 c7 44 24 78 ff ff    movq
> > > $0xffffffffffffffff,0x78(%rsp)
> > > 4) +    fe fe
> > > 5) +    fe fe   - ? end marker ?
> > > 6) Code: ff ff 48 (...)
> > 
> > So why not just match first with /^[0-9a-f]+$/i ?
> > 
> > Doesn't that match all the cases listed above?
> > 
> > 
> 
> Then, we'll not be able to account for cases such as:
> 
> 1) +     * sets this to -1, the slack value will be calculated to be be
> halfway [For 'be' 'be']
> 2) + * @seg: index of packet segment whose raw fields are to be be
> extracted [For 'be' 'be']
> 3) Let's also add add a note about using only the l3 access without l4
> [For 'add' 'add']

Like the use of long, I think you're better off with
either a list or hash of specific words to ignore.
Aditya Oct. 23, 2020, 6:33 a.m. UTC | #7
On 23/10/20 4:16 am, Joe Perches wrote:
> On Fri, 2020-10-23 at 02:35 +0530, Aditya wrote:
>> On 23/10/20 1:03 am, Joe Perches wrote:
>>> On Fri, 2020-10-23 at 00:44 +0530, Aditya wrote:
>>>> On 22/10/20 9:40 pm, Joe Perches wrote:
>>>>> On Thu, 2020-10-22 at 20:20 +0530, Aditya Srivastava wrote:
>>>>>> Presence of hexadecimal address or symbol results in false warning
>>>>>> message by checkpatch.pl.
>>>>> []
>>>>>> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
>>>>> []
>>>>>> @@ -3051,7 +3051,10 @@ sub process {
>>>>>>  		}
>>>>>>  
>>>>>>  # check for repeated words separated by a single space
>>>>>> -		if ($rawline =~ /^\+/ || $in_commit_log) {
>>>>>> +# avoid false positive from list command eg, '-rw-r--r-- 1 root root'
>>>>>> +		if (($rawline =~ /^\+/ || $in_commit_log) &&
>>>>>> +		$rawline !~ /[bcCdDlMnpPs\?-][rwxsStT-]{9}/) {
>>>>>
>>>>> Alignment and use \b before and after the regex please.
>>>>
>>>> If we use \b either before or after or both it does not match patterns
>>>> such as:
>>>> +   -rw-r--r--. 1 root root 112K Mar 20 12:16'
>>> selinux-policy-3.14.4-48.fc31.noarch.rpm
>>>
>>> OK, thanks, it's good you checked.
>>>
>>>>> []
>>>>> What does all this code actually avoid?
>>>>
>>>> Sir, there are multiple variations of hex for which this warning is
>>>> occurring, for eg:
>>>> 1) 00 c0 06 16 00 00 ff ff 00 93 1c 18 00 00 ff ff  ................
>>>> 2) ffffffff ffffffff 00000000 c070058c
>>>> 3)     f5a:       48 c7 44 24 78 ff ff    movq
>>>> $0xffffffffffffffff,0x78(%rsp)
>>>> 4) +    fe fe
>>>> 5) +    fe fe   - ? end marker ?
>>>> 6) Code: ff ff 48 (...)
>>>
>>> So why not just match first with /^[0-9a-f]+$/i ?
>>>
>>> Doesn't that match all the cases listed above?
>>>
>>>
>>
>> Then, we'll not be able to account for cases such as:
>>
>> 1) +     * sets this to -1, the slack value will be calculated to be be
>> halfway [For 'be' 'be']
>> 2) + * @seg: index of packet segment whose raw fields are to be be
>> extracted [For 'be' 'be']
>> 3) Let's also add add a note about using only the l3 access without l4
>> [For 'add' 'add']
> 
> Like the use of long, I think you're better off with
> either a list or hash of specific words to ignore.
> 
> 
> 

Ok Sir. Actually, I was proceeding with this approach initially. But
it has some side affects like it won't be able to detect hex such as:

'ff 4d be be be 9f 04 48'

Here although 'be' is part of hex sequence, it will be reported in the
warning.
However, though such cases haven't occurred over v6..v8.

Actually, we could simplify it more on the basis of occurrences in
v6..v8, for eg. if we check for /[0-9c-f][0-9a-f]+/ (instead of
/[0-9a-f]{2,}/), it gives us desired result over v6..v8, but again
we'll miss out cases such as:

'ff af af bc bc ba ba bd bd'

To counter all these, I had implemented the solution in the proposed
way. But, I'll change it as you suggested. The advantage of it can be
that the code complexity is less and the negatives will be few.

Thanks
Aditya
Aditya Oct. 23, 2020, 6:38 a.m. UTC | #8
On 23/10/20 12:03 pm, Aditya wrote:
> However, though such cases haven't occurred over v6..v8.
> 
> Actually, we could simplify it more on the basis of occurrences in
> v6..v8, for eg. if we check for /[0-9c-f][0-9a-f]+/ (instead of
> /[0-9a-f]{2,}/), it gives us desired result over v6..v8, but again
> we'll miss out cases such as:
> 

Sorry, v5.6..v5.8

Patch
diff mbox series

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 9b9ffd876e8a..0f57e99ed670 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -3051,7 +3051,10 @@  sub process {
 		}
 
 # check for repeated words separated by a single space
-		if ($rawline =~ /^\+/ || $in_commit_log) {
+# avoid false positive from list command eg, '-rw-r--r-- 1 root root'
+		if (($rawline =~ /^\+/ || $in_commit_log) &&
+		$rawline !~ /[bcCdDlMnpPs\?-][rwxsStT-]{9}/) {
+			my @hex_seq = $rawline =~ /\b[0-9a-f]{2,} \b/g;
 			while ($rawline =~ /\b($word_pattern) (?=($word_pattern))/g) {
 
 				my $first = $1;
@@ -3065,6 +3068,34 @@  sub process {
 				next if ($first ne $second);
 				next if ($first eq 'long');
 
+				# avoid repeating hex occurrences like 'ff ff fe 09 ...'
+				if ($first =~ /\b[0-9a-f]{2,}/) {
+					# if such sequence occurs more than 4, it is most probably part of some of code
+					next if ((scalar @hex_seq)>4);
+
+					# for hex occurrences which are less than 4
+					# get first hex word in the line
+					if ($rawline =~ /\b[0-9a-f]{2,} /) {
+						my $post_hex_seq = $';
+
+						# set suffieciently high default values to avoid ignoring or counting in absence of another
+						my $non_hex_char_pos = 1000;
+						my $special_chars_pos = 500;
+
+						if ($post_hex_seq =~ /[g-z]+/) {
+							# first non hex character in post_hex_seq
+							$non_hex_char_pos = $-[0];
+						}
+						if($post_hex_seq =~ /[^a-zA-Z0-9]{2,}/) {
+							# first occurrence of 2 or more special chars
+							$special_chars_pos = $-[0];
+						}
+
+						# as such occurrences are not common, it is more likely to be a part of some code
+						next if ($special_chars_pos<$non_hex_char_pos);
+					}
+				}
+
 				if (WARN("REPEATED_WORD",
 					 "Possible repeated word: '$first'\n" . $herecurr) &&
 				    $fix) {