linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] checkpatch: Only encode UTF-8 quoted printable mail headers
@ 2018-07-18 14:52 Geert Uytterhoeven
  2018-07-18 21:42 ` Andrew Morton
  2018-07-19 14:50 ` Arnd Bergmann
  0 siblings, 2 replies; 7+ messages in thread
From: Geert Uytterhoeven @ 2018-07-18 14:52 UTC (permalink / raw)
  To: Andy Whitcroft, Joe Perches, Andrew Morton
  Cc: Stephen Rothwell, linux-kernel, Geert Uytterhoeven

As PERL uses its own internal character encoding, always calling
encode("utf8", ...) on the author name may cause corruption, leading to
an author signoff mismatch.

This happens in the following cases:
  - If a patch is in ISO-8859, and contains a non-ASCII author name in
    the From: line, it is converted to UTF-8, while the Signed-off-by
    line will still be in ISO-8859.
  - If a patch is in UTF-8, and contains a non-ASCII author name in the
    body (not header) From: line, it is assumed to be encoded in PERL's
    internal character encoding, and converted to UTF-8 incorrectly,
    while the Signed-off-by line will be in real UTF-8.

Fix this by only doing the encode step if the From: line used UTF-8
quoted printable encoding.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
---
Fixes: bc76e3a125b44379 ("checkpatch: warn if missing author Signed-off-by")
in -next

To be folded into "checkpatch: Warn if missing author Signed-off-by" in
Andrew's tree.

v2:
  - Add parentheses.
---
 scripts/checkpatch.pl | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 3d01fee203c4775d..017253b1df513bcb 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -2523,7 +2523,8 @@ sub process {
 
 # Check the patch for a From:
 		if (decode("MIME-Header", $line) =~ /^From:\s*(.*)/) {
-			$author = encode("utf8", $1);
+			$author = $1;
+			$author = encode("utf8", $author) if ($line =~ /=\?utf-8\?/i);
 			$author =~ s/"//g;
 		}
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] checkpatch: Only encode UTF-8 quoted printable mail headers
  2018-07-18 14:52 [PATCH v2] checkpatch: Only encode UTF-8 quoted printable mail headers Geert Uytterhoeven
@ 2018-07-18 21:42 ` Andrew Morton
  2018-07-18 21:55   ` Joe Perches
  2018-07-19 14:50 ` Arnd Bergmann
  1 sibling, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2018-07-18 21:42 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Andy Whitcroft, Joe Perches, Stephen Rothwell, linux-kernel

On Wed, 18 Jul 2018 16:52:54 +0200 Geert Uytterhoeven <geert+renesas@glider.be> wrote:

> As PERL uses its own internal character encoding, always calling
> encode("utf8", ...) on the author name may cause corruption, leading to
> an author signoff mismatch.
> 
> This happens in the following cases:
>   - If a patch is in ISO-8859, and contains a non-ASCII author name in
>     the From: line, it is converted to UTF-8, while the Signed-off-by
>     line will still be in ISO-8859.
>   - If a patch is in UTF-8, and contains a non-ASCII author name in the
>     body (not header) From: line, it is assumed to be encoded in PERL's
>     internal character encoding, and converted to UTF-8 incorrectly,
>     while the Signed-off-by line will be in real UTF-8.
> 
> Fix this by only doing the encode step if the From: line used UTF-8
> quoted printable encoding.

Works for me, thanks.


Relatedly, would it be worth adding a checkpatch warning if a patch
contains anything other than ASCII or UTF-8?

I added this to my little local patch-checking script.

	if ! file $p | grep -q -P "ASCII text|Unicode text"
	then
		echo $p: weird charset
	fi


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] checkpatch: Only encode UTF-8 quoted printable mail headers
  2018-07-18 21:42 ` Andrew Morton
@ 2018-07-18 21:55   ` Joe Perches
  0 siblings, 0 replies; 7+ messages in thread
From: Joe Perches @ 2018-07-18 21:55 UTC (permalink / raw)
  To: Andrew Morton, Geert Uytterhoeven
  Cc: Andy Whitcroft, Stephen Rothwell, linux-kernel

On Wed, 2018-07-18 at 14:42 -0700, Andrew Morton wrote:
> On Wed, 18 Jul 2018 16:52:54 +0200 Geert Uytterhoeven <geert+renesas@glider.be> wrote:
> 
> > As PERL uses its own internal character encoding, always calling
> > encode("utf8", ...) on the author name may cause corruption, leading to
> > an author signoff mismatch.
> > 
> > This happens in the following cases:
> >   - If a patch is in ISO-8859, and contains a non-ASCII author name in
> >     the From: line, it is converted to UTF-8, while the Signed-off-by
> >     line will still be in ISO-8859.
> >   - If a patch is in UTF-8, and contains a non-ASCII author name in the
> >     body (not header) From: line, it is assumed to be encoded in PERL's
> >     internal character encoding, and converted to UTF-8 incorrectly,
> >     while the Signed-off-by line will be in real UTF-8.
> > 
> > Fix this by only doing the encode step if the From: line used UTF-8
> > quoted printable encoding.
> 
> Works for me, thanks.

Me too so far, but I've more testing I'd like to do.

> Relatedly, would it be worth adding a checkpatch warning if a patch
> contains anything other than ASCII or UTF-8?
> 
> I added this to my little local patch-checking script.
> 
> 	if ! file $p | grep -q -P "ASCII text|Unicode text"
> 	then
> 		echo $p: weird charset
> 	fi

Might be hard to be effective.

For instance, the lkml mail I've kept so far this year
has a mixture of ascii/utf-8/iso-8859/windows-1252 and
some others with a few different encodings used too.

$ grep -Poh "\bcharset=\S+" ~/.local/share/evolution/mail/local/.MailingLists.Linux-Kernel/cur/*|cut -f3- -d:|sort|uniq -c|sort -rn
    821 charset=us-ascii
    469 charset="UTF-8"
    394 charset="ISO-8859-1"
    252 charset=US-ASCII
    221 charset=utf-8
    118 charset=utf-8;
     97 charset="utf-8"
     66 charset=UTF-8
     60 charset="us-ascii"
     33 charset=ISO-8859-15
     24 charset=iso-8859-1
     18 charset=US-ASCII;
     11 charset=us-ascii;
      7 charset=windows-1252;
      7 charset="utf-8";
      6 charset="UTF-8";
      5 charset=windows-1252
      5 charset="iso-8859-1"
      4 charset="windows-1252"
      3 charset=UTF-8;
      3 charset="US-ASCII"
      2 charset="iso-2022-jp"
      2 charset=gbk;
      2 charset="gb2312"
      1 charset="utf-7"
      1 charset="iso-8859-15"
      1 charset=ISO-8859-1
      1 charset="gbk";

And

$ grep "^Content-Transfer-Encoding:" ~/.local/share/evolution/mail/local/.MailingLists.Linux-Kernel/cur/*|cut -f3- -d:|sort|uniq -c|sort -rn
    873 Content-Transfer-Encoding: 7bit
    212 Content-Transfer-Encoding: 8bit
     97 Content-Transfer-Encoding: quoted-printable
     63 Content-Transfer-Encoding: base64
     56 Content-Transfer-Encoding: 8BIT
     24 Content-Transfer-Encoding: 7Bit
      3 Content-Transfer-Encoding: 7BIT
      2 Content-Transfer-Encoding: QUOTED-PRINTABLE


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] checkpatch: Only encode UTF-8 quoted printable mail headers
  2018-07-18 14:52 [PATCH v2] checkpatch: Only encode UTF-8 quoted printable mail headers Geert Uytterhoeven
  2018-07-18 21:42 ` Andrew Morton
@ 2018-07-19 14:50 ` Arnd Bergmann
  2018-07-19 15:03   ` Geert Uytterhoeven
  1 sibling, 1 reply; 7+ messages in thread
From: Arnd Bergmann @ 2018-07-19 14:50 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Andy Whitcroft, Joe Perches, Andrew Morton, Stephen Rothwell,
	Linux Kernel Mailing List, Martin Schwidefsky

On Wed, Jul 18, 2018 at 4:52 PM, Geert Uytterhoeven
<geert+renesas@glider.be> wrote:
> As PERL uses its own internal character encoding, always calling
> encode("utf8", ...) on the author name may cause corruption, leading to
> an author signoff mismatch.
>
> This happens in the following cases:
>   - If a patch is in ISO-8859, and contains a non-ASCII author name in
>     the From: line, it is converted to UTF-8, while the Signed-off-by
>     line will still be in ISO-8859.
>   - If a patch is in UTF-8, and contains a non-ASCII author name in the
>     body (not header) From: line, it is assumed to be encoded in PERL's
>     internal character encoding, and converted to UTF-8 incorrectly,
>     while the Signed-off-by line will be in real UTF-8.
>
> Fix this by only doing the encode step if the From: line used UTF-8
> quoted printable encoding.
>
> Reported-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
> ---
> Fixes: bc76e3a125b44379 ("checkpatch: warn if missing author Signed-off-by")
> in -next
>
> To be folded into "checkpatch: Warn if missing author Signed-off-by" in
> Andrew's tree.

On a related note, I've looked through all files in the kernel, and found
that very file files in there are something other than 7-bit ASCII, UTF-8
or non-text files (according to /usr/bin/file). These are the only ones I found:

Documentation/devicetree/bindings/net/nfc/pn544.txt: ISO-8859 text
arch/arm/boot/dts/sun4i-a10-inet97fv2.dts:           C source, ISO-8859 text
arch/arm/crypto/sha256_glue.c:                       C source, ISO-8859 text
arch/arm/crypto/sha256_neon_glue.c:                  C source, ISO-8859 text
arch/m68k/hp300/hp300map.map:                        ISO-8859 text
arch/s390/kernel/ebcdic.c:                           C source, Non-ISO
extended-ASCII text
drivers/crypto/vmx/ghashp8-ppc.pl:                   a /usr/bin/env
perl script, ISO-8859 text executable
drivers/iio/dac/ltc2632.c:                           C source, ISO-8859 text
drivers/power/reset/ltc2952-poweroff.c:              C source, ISO-8859 text
drivers/staging/rtl8188eu/include/odm.h:             C source, ISO-8859 text
drivers/tty/vt/defkeymap.map:                        ISO-8859 text
kernel/events/callchain.c:                           C source, ISO-8859 text
lib/fonts/font_7x14.c:                               data
lib/fonts/font_8x16.c:                               data
lib/fonts/font_8x8.c:                                data
lib/fonts/font_pearl_8x8.c:                          data
net/netfilter/ipvs/Kconfig:                          ISO-8859 text
net/netfilter/ipvs/ip_vs_mh.c:                       C source, ISO-8859 text
tools/power/cpupower/po/de.po:                       GNU gettext
message catalogue, ISO-8859 text
tools/power/cpupower/po/fr.po:                       GNU gettext
message catalogue, ISO-8859 text

Almost all of those can be trivially converted using 'recode ISO-8859-1..UTF-8',
which we should probably do. The four font files contain comments for each
of the 256 characters, so that recode turns e.g. the <FF> character
into <U+00FF>,
which is probably still what we want here.

The one exception seems to be arch/s390/kernel/ebcdic.c, which apparently
uses 0x81 bytes as an excape before characters ISO-8859-1 characters with
the high bit set. I don't know what that encoding is called, but I managed
to manually convert it into something useful.

       Arnd

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] checkpatch: Only encode UTF-8 quoted printable mail headers
  2018-07-19 14:50 ` Arnd Bergmann
@ 2018-07-19 15:03   ` Geert Uytterhoeven
  2018-07-19 17:38     ` Joe Perches
  0 siblings, 1 reply; 7+ messages in thread
From: Geert Uytterhoeven @ 2018-07-19 15:03 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Geert Uytterhoeven, Andy Whitcroft, Joe Perches, Andrew Morton,
	Stephen Rothwell, Linux Kernel Mailing List, Martin Schwidefsky

Hi Arnd,

On Thu, Jul 19, 2018 at 4:50 PM Arnd Bergmann <arnd@arndb.de> wrote:
> On a related note, I've looked through all files in the kernel, and found
> that very file files in there are something other than 7-bit ASCII, UTF-8
> or non-text files (according to /usr/bin/file). These are the only ones I found:
>
> Documentation/devicetree/bindings/net/nfc/pn544.txt: ISO-8859 text
> arch/arm/boot/dts/sun4i-a10-inet97fv2.dts:           C source, ISO-8859 text
> arch/arm/crypto/sha256_glue.c:                       C source, ISO-8859 text
> arch/arm/crypto/sha256_neon_glue.c:                  C source, ISO-8859 text
> arch/m68k/hp300/hp300map.map:                        ISO-8859 text
> arch/s390/kernel/ebcdic.c:                           C source, Non-ISO
> extended-ASCII text
> drivers/crypto/vmx/ghashp8-ppc.pl:                   a /usr/bin/env
> perl script, ISO-8859 text executable
> drivers/iio/dac/ltc2632.c:                           C source, ISO-8859 text
> drivers/power/reset/ltc2952-poweroff.c:              C source, ISO-8859 text
> drivers/staging/rtl8188eu/include/odm.h:             C source, ISO-8859 text
> drivers/tty/vt/defkeymap.map:                        ISO-8859 text
> kernel/events/callchain.c:                           C source, ISO-8859 text
> lib/fonts/font_7x14.c:                               data
> lib/fonts/font_8x16.c:                               data
> lib/fonts/font_8x8.c:                                data
> lib/fonts/font_pearl_8x8.c:                          data
> net/netfilter/ipvs/Kconfig:                          ISO-8859 text
> net/netfilter/ipvs/ip_vs_mh.c:                       C source, ISO-8859 text
> tools/power/cpupower/po/de.po:                       GNU gettext
> message catalogue, ISO-8859 text
> tools/power/cpupower/po/fr.po:                       GNU gettext
> message catalogue, ISO-8859 text
>
> Almost all of those can be trivially converted using 'recode ISO-8859-1..UTF-8',
> which we should probably do. The four font files contain comments for each
> of the 256 characters, so that recode turns e.g. the <FF> character
> into <U+00FF>,
> which is probably still what we want here.
>
> The one exception seems to be arch/s390/kernel/ebcdic.c, which apparently
> uses 0x81 bytes as an excape before characters ISO-8859-1 characters with
> the high bit set. I don't know what that encoding is called, but I managed
> to manually convert it into something useful.

Yes, we should convert everything to UTF-8.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] checkpatch: Only encode UTF-8 quoted printable mail headers
  2018-07-19 15:03   ` Geert Uytterhoeven
@ 2018-07-19 17:38     ` Joe Perches
  2018-07-24 11:19       ` Arnd Bergmann
  0 siblings, 1 reply; 7+ messages in thread
From: Joe Perches @ 2018-07-19 17:38 UTC (permalink / raw)
  To: Geert Uytterhoeven, Arnd Bergmann
  Cc: Geert Uytterhoeven, Andy Whitcroft, Andrew Morton,
	Stephen Rothwell, Linux Kernel Mailing List, Martin Schwidefsky

On Thu, 2018-07-19 at 17:03 +0200, Geert Uytterhoeven wrote:
> Hi Arnd,
> 
> On Thu, Jul 19, 2018 at 4:50 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > On a related note, I've looked through all files in the kernel, and found
> > that very file files in there are something other than 7-bit ASCII, UTF-8
> > or non-text files (according to /usr/bin/file). These are the only ones I found:
> > 
> > Documentation/devicetree/bindings/net/nfc/pn544.txt: ISO-8859 text
> > arch/arm/boot/dts/sun4i-a10-inet97fv2.dts:           C source, ISO-8859 text
> > arch/arm/crypto/sha256_glue.c:                       C source, ISO-8859 text
> > arch/arm/crypto/sha256_neon_glue.c:                  C source, ISO-8859 text
> > arch/m68k/hp300/hp300map.map:                        ISO-8859 text
> > arch/s390/kernel/ebcdic.c:                           C source, Non-ISO
> > extended-ASCII text
> > drivers/crypto/vmx/ghashp8-ppc.pl:                   a /usr/bin/env
> > perl script, ISO-8859 text executable
> > drivers/iio/dac/ltc2632.c:                           C source, ISO-8859 text
> > drivers/power/reset/ltc2952-poweroff.c:              C source, ISO-8859 text
> > drivers/staging/rtl8188eu/include/odm.h:             C source, ISO-8859 text
> > drivers/tty/vt/defkeymap.map:                        ISO-8859 text
> > kernel/events/callchain.c:                           C source, ISO-8859 text
> > lib/fonts/font_7x14.c:                               data
> > lib/fonts/font_8x16.c:                               data
> > lib/fonts/font_8x8.c:                                data
> > lib/fonts/font_pearl_8x8.c:                          data
> > net/netfilter/ipvs/Kconfig:                          ISO-8859 text
> > net/netfilter/ipvs/ip_vs_mh.c:                       C source, ISO-8859 text
> > tools/power/cpupower/po/de.po:                       GNU gettext
> > message catalogue, ISO-8859 text
> > tools/power/cpupower/po/fr.po:                       GNU gettext
> > message catalogue, ISO-8859 text
> > 
> > Almost all of those can be trivially converted using 'recode ISO-8859-1..UTF-8',
> > which we should probably do. The four font files contain comments for each
> > of the 256 characters, so that recode turns e.g. the <FF> character
> > into <U+00FF>,
> > which is probably still what we want here.
> > 
> > The one exception seems to be arch/s390/kernel/ebcdic.c, which apparently
> > uses 0x81 bytes as an excape before characters ISO-8859-1 characters with
> > the high bit set. I don't know what that encoding is called, but I managed
> > to manually convert it into something useful.
> 
> Yes, we should convert everything to UTF-8.

Thanks.

Can you send a patch or a script for Linus to apply?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] checkpatch: Only encode UTF-8 quoted printable mail headers
  2018-07-19 17:38     ` Joe Perches
@ 2018-07-24 11:19       ` Arnd Bergmann
  0 siblings, 0 replies; 7+ messages in thread
From: Arnd Bergmann @ 2018-07-24 11:19 UTC (permalink / raw)
  To: Joe Perches
  Cc: Geert Uytterhoeven, Geert Uytterhoeven, Andy Whitcroft,
	Andrew Morton, Stephen Rothwell, Linux Kernel Mailing List,
	Martin Schwidefsky

On Thu, Jul 19, 2018 at 7:38 PM, Joe Perches <joe@perches.com> wrote:
> On Thu, 2018-07-19 at 17:03 +0200, Geert Uytterhoeven wrote:

>> >
>> > The one exception seems to be arch/s390/kernel/ebcdic.c, which apparently
>> > uses 0x81 bytes as an excape before characters ISO-8859-1 characters with
>> > the high bit set. I don't know what that encoding is called, but I managed
>> > to manually convert it into something useful.
>>
>> Yes, we should convert everything to UTF-8.
>
> Thanks.
>
> Can you send a patch or a script for Linus to apply?

I sent a series of four patches now, one for each type of change I did.

      Arnd

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-07-24 11:19 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-18 14:52 [PATCH v2] checkpatch: Only encode UTF-8 quoted printable mail headers Geert Uytterhoeven
2018-07-18 21:42 ` Andrew Morton
2018-07-18 21:55   ` Joe Perches
2018-07-19 14:50 ` Arnd Bergmann
2018-07-19 15:03   ` Geert Uytterhoeven
2018-07-19 17:38     ` Joe Perches
2018-07-24 11:19       ` Arnd Bergmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).