All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] checkpatch: handle utf8 while computing length of commit msg lines
@ 2022-10-21 19:15 Antonio Borneo
  2022-10-22  5:48 ` Joe Perches
  0 siblings, 1 reply; 3+ messages in thread
From: Antonio Borneo @ 2022-10-21 19:15 UTC (permalink / raw)
  To: Andy Whitcroft, Joe Perches, Dwaipayan Ray, Lukas Bulwahn, linux-kernel
  Cc: Antonio Borneo

The current check for the length of each line in the commit msg
uses length($line) that counts line's bytes.
If the line contains utf8 characters, the byte count can exceed
the cap even on quite short lines.

Count the utf8 characters for checking line length.

Signed-off-by: Antonio Borneo <antonio.borneo@foss.st.com>

---

Actually it's not fully clear to me if utf8 characters in the
commit msg are acceptable/tolerated or to be avoided.
In the commit msg of 15662b3e8644 ("checkpatch: add a --strict
check for utf-8 in commit logs") is stated:
	Some find using utf-8 in commit logs inappropriate.


 scripts/checkpatch.pl | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 1e5e66ae5a52..eaad5da50554 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -3220,7 +3220,7 @@ sub process {
 
 # Check for line lengths > 75 in commit log, warn once
 		if ($in_commit_log && !$commit_log_long_line &&
-		    length($line) > 75 &&
+		    length(decode("utf8", $line)) > 75 &&
 		    !($line =~ /^\s*[a-zA-Z0-9_\/\.]+\s+\|\s+\d+/ ||
 					# file delta changes
 		      $line =~ /^\s*(?:[\w\.\-\+]*\/)++[\w\.\-\+]+:/ ||

base-commit: 9abf2313adc1ca1b6180c508c25f22f9395cc780
-- 
2.38.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] checkpatch: handle utf8 while computing length of commit msg lines
  2022-10-21 19:15 [PATCH] checkpatch: handle utf8 while computing length of commit msg lines Antonio Borneo
@ 2022-10-22  5:48 ` Joe Perches
  2022-10-22 18:33   ` Linus Torvalds
  0 siblings, 1 reply; 3+ messages in thread
From: Joe Perches @ 2022-10-22  5:48 UTC (permalink / raw)
  To: Antonio Borneo, Andy Whitcroft, Dwaipayan Ray, Lukas Bulwahn,
	linux-kernel
  Cc: Andrew Morton, Linus Torvalds

On Fri, 2022-10-21 at 21:15 +0200, Antonio Borneo wrote:
> The current check for the length of each line in the commit msg
> uses length($line) that counts line's bytes.
> If the line contains utf8 characters, the byte count can exceed
> the cap even on quite short lines.
> 
> Count the utf8 characters for checking line length.
> 
> Signed-off-by: Antonio Borneo <antonio.borneo@foss.st.com>
> 
> ---
> 
> Actually it's not fully clear to me if utf8 characters in the
> commit msg are acceptable/tolerated or to be avoided.

Nor is it to me, likely it's OK though as at least checkpatch has an
existing test/comment for nominally valid UTF-8 in commit messages.

			CHK("INVALID_UTF8",
			    "Invalid UTF-8, patch and commit message should be encoded in UTF-8\n" . $hereptr);

> In the commit msg of 15662b3e8644 ("checkpatch: add a --strict
> check for utf-8 in commit logs") is stated:
> 	Some find using utf-8 in commit logs inappropriate.

I don't particularly care one way or another.

Andrew?  Linus?

> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> index 1e5e66ae5a52..eaad5da50554 100755
> --- a/scripts/checkpatch.pl
> +++ b/scripts/checkpatch.pl
> @@ -3220,7 +3220,7 @@ sub process {
>  
>  # Check for line lengths > 75 in commit log, warn once
>  		if ($in_commit_log && !$commit_log_long_line &&
> -		    length($line) > 75 &&
> +		    length(decode("utf8", $line)) > 75 &&
>  		    !($line =~ /^\s*[a-zA-Z0-9_\/\.]+\s+\|\s+\d+/ ||
>  					# file delta changes
>  		      $line =~ /^\s*(?:[\w\.\-\+]*\/)++[\w\.\-\+]+:/ ||
> 
> base-commit: 9abf2313adc1ca1b6180c508c25f22f9395cc780


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] checkpatch: handle utf8 while computing length of commit msg lines
  2022-10-22  5:48 ` Joe Perches
@ 2022-10-22 18:33   ` Linus Torvalds
  0 siblings, 0 replies; 3+ messages in thread
From: Linus Torvalds @ 2022-10-22 18:33 UTC (permalink / raw)
  To: Joe Perches
  Cc: Antonio Borneo, Andy Whitcroft, Dwaipayan Ray, Lukas Bulwahn,
	linux-kernel, Andrew Morton

On Fri, Oct 21, 2022 at 10:48 PM Joe Perches <joe@perches.com> wrote:
>
> On Fri, 2022-10-21 at 21:15 +0200, Antonio Borneo wrote:
> >
> > Actually it's not fully clear to me if utf8 characters in the
> > commit msg are acceptable/tolerated or to be avoided.

utf8 is not just acceptable, but actively encouraged in commit
messages. Not *grtatuitous* use (please - no emojis) but there is
absolutely nothing wrong when using utf8 when appropriate.

And getting people's names right is not just appropriate, but actually
important. And depending on where in the world you are from, utf8 is
absolutely required, and no, we don't do Latin1 for that subset of the
world (any more - we have a dark history of Latin1 in some corners).

That said, I'm not convinced the whole line length check really
matters, or is even appropriate. A lot of commit messages absolutely
should have long lines, regardless of any UTF8 issues.

Just as a recent example, see commit 71e2d666ef85 ("mm/huge_memory: do
not clobber swp_entry_t during THP split"), which has a 200+ character
line, and that's *exactly* what it should have. Splitting that line
would be actively wrong. The same often goes for things like quoted
compiler warnings etc.

I personally can't think of a case where we've actually had issues wrt
"line length in bytes vs line length in characters". And I'm not
convinced the length check is appropriate in the first place.

The only line that really shouldn't be overly long is the _first_ line
of the commit message, because that tends to be a "somebody write a
whole paragraph in line-wrapped mode". And the first line of the
commit message really is special, and should not just be of a
reasonable length (although75 chars may be too restrictive), but
should have an empty line after it.

I didn't look into what the checkpatch.pl script does around that
code, maybe that's what it already does.

                       Linus

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-10-22 18:33 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-21 19:15 [PATCH] checkpatch: handle utf8 while computing length of commit msg lines Antonio Borneo
2022-10-22  5:48 ` Joe Perches
2022-10-22 18:33   ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.