* [PATCH v2] checkpatch: Only encode UTF-8 quoted printable mail headers @ 2018-07-18 14:52 Geert Uytterhoeven 2018-07-18 21:42 ` Andrew Morton 2018-07-19 14:50 ` Arnd Bergmann 0 siblings, 2 replies; 7+ messages in thread From: Geert Uytterhoeven @ 2018-07-18 14:52 UTC (permalink / raw) To: Andy Whitcroft, Joe Perches, Andrew Morton Cc: Stephen Rothwell, linux-kernel, Geert Uytterhoeven As PERL uses its own internal character encoding, always calling encode("utf8", ...) on the author name may cause corruption, leading to an author signoff mismatch. This happens in the following cases: - If a patch is in ISO-8859, and contains a non-ASCII author name in the From: line, it is converted to UTF-8, while the Signed-off-by line will still be in ISO-8859. - If a patch is in UTF-8, and contains a non-ASCII author name in the body (not header) From: line, it is assumed to be encoded in PERL's internal character encoding, and converted to UTF-8 incorrectly, while the Signed-off-by line will be in real UTF-8. Fix this by only doing the encode step if the From: line used UTF-8 quoted printable encoding. Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> --- Fixes: bc76e3a125b44379 ("checkpatch: warn if missing author Signed-off-by") in -next To be folded into "checkpatch: Warn if missing author Signed-off-by" in Andrew's tree. v2: - Add parentheses. --- scripts/checkpatch.pl | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 3d01fee203c4775d..017253b1df513bcb 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -2523,7 +2523,8 @@ sub process { # Check the patch for a From: if (decode("MIME-Header", $line) =~ /^From:\s*(.*)/) { - $author = encode("utf8", $1); + $author = $1; + $author = encode("utf8", $author) if ($line =~ /=\?utf-8\?/i); $author =~ s/"//g; } -- 2.17.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v2] checkpatch: Only encode UTF-8 quoted printable mail headers 2018-07-18 14:52 [PATCH v2] checkpatch: Only encode UTF-8 quoted printable mail headers Geert Uytterhoeven @ 2018-07-18 21:42 ` Andrew Morton 2018-07-18 21:55 ` Joe Perches 2018-07-19 14:50 ` Arnd Bergmann 1 sibling, 1 reply; 7+ messages in thread From: Andrew Morton @ 2018-07-18 21:42 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Andy Whitcroft, Joe Perches, Stephen Rothwell, linux-kernel On Wed, 18 Jul 2018 16:52:54 +0200 Geert Uytterhoeven <geert+renesas@glider.be> wrote: > As PERL uses its own internal character encoding, always calling > encode("utf8", ...) on the author name may cause corruption, leading to > an author signoff mismatch. > > This happens in the following cases: > - If a patch is in ISO-8859, and contains a non-ASCII author name in > the From: line, it is converted to UTF-8, while the Signed-off-by > line will still be in ISO-8859. > - If a patch is in UTF-8, and contains a non-ASCII author name in the > body (not header) From: line, it is assumed to be encoded in PERL's > internal character encoding, and converted to UTF-8 incorrectly, > while the Signed-off-by line will be in real UTF-8. > > Fix this by only doing the encode step if the From: line used UTF-8 > quoted printable encoding. Works for me, thanks. Relatedly, would it be worth adding a checkpatch warning if a patch contains anything other than ASCII or UTF-8? I added this to my little local patch-checking script. if ! file $p | grep -q -P "ASCII text|Unicode text" then echo $p: weird charset fi ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] checkpatch: Only encode UTF-8 quoted printable mail headers 2018-07-18 21:42 ` Andrew Morton @ 2018-07-18 21:55 ` Joe Perches 0 siblings, 0 replies; 7+ messages in thread From: Joe Perches @ 2018-07-18 21:55 UTC (permalink / raw) To: Andrew Morton, Geert Uytterhoeven Cc: Andy Whitcroft, Stephen Rothwell, linux-kernel On Wed, 2018-07-18 at 14:42 -0700, Andrew Morton wrote: > On Wed, 18 Jul 2018 16:52:54 +0200 Geert Uytterhoeven <geert+renesas@glider.be> wrote: > > > As PERL uses its own internal character encoding, always calling > > encode("utf8", ...) on the author name may cause corruption, leading to > > an author signoff mismatch. > > > > This happens in the following cases: > > - If a patch is in ISO-8859, and contains a non-ASCII author name in > > the From: line, it is converted to UTF-8, while the Signed-off-by > > line will still be in ISO-8859. > > - If a patch is in UTF-8, and contains a non-ASCII author name in the > > body (not header) From: line, it is assumed to be encoded in PERL's > > internal character encoding, and converted to UTF-8 incorrectly, > > while the Signed-off-by line will be in real UTF-8. > > > > Fix this by only doing the encode step if the From: line used UTF-8 > > quoted printable encoding. > > Works for me, thanks. Me too so far, but I've more testing I'd like to do. > Relatedly, would it be worth adding a checkpatch warning if a patch > contains anything other than ASCII or UTF-8? > > I added this to my little local patch-checking script. > > if ! file $p | grep -q -P "ASCII text|Unicode text" > then > echo $p: weird charset > fi Might be hard to be effective. For instance, the lkml mail I've kept so far this year has a mixture of ascii/utf-8/iso-8859/windows-1252 and some others with a few different encodings used too. $ grep -Poh "\bcharset=\S+" ~/.local/share/evolution/mail/local/.MailingLists.Linux-Kernel/cur/*|cut -f3- -d:|sort|uniq -c|sort -rn 821 charset=us-ascii 469 charset="UTF-8" 394 charset="ISO-8859-1" 252 charset=US-ASCII 221 charset=utf-8 118 charset=utf-8; 97 charset="utf-8" 66 charset=UTF-8 60 charset="us-ascii" 33 charset=ISO-8859-15 24 charset=iso-8859-1 18 charset=US-ASCII; 11 charset=us-ascii; 7 charset=windows-1252; 7 charset="utf-8"; 6 charset="UTF-8"; 5 charset=windows-1252 5 charset="iso-8859-1" 4 charset="windows-1252" 3 charset=UTF-8; 3 charset="US-ASCII" 2 charset="iso-2022-jp" 2 charset=gbk; 2 charset="gb2312" 1 charset="utf-7" 1 charset="iso-8859-15" 1 charset=ISO-8859-1 1 charset="gbk"; And $ grep "^Content-Transfer-Encoding:" ~/.local/share/evolution/mail/local/.MailingLists.Linux-Kernel/cur/*|cut -f3- -d:|sort|uniq -c|sort -rn 873 Content-Transfer-Encoding: 7bit 212 Content-Transfer-Encoding: 8bit 97 Content-Transfer-Encoding: quoted-printable 63 Content-Transfer-Encoding: base64 56 Content-Transfer-Encoding: 8BIT 24 Content-Transfer-Encoding: 7Bit 3 Content-Transfer-Encoding: 7BIT 2 Content-Transfer-Encoding: QUOTED-PRINTABLE ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] checkpatch: Only encode UTF-8 quoted printable mail headers 2018-07-18 14:52 [PATCH v2] checkpatch: Only encode UTF-8 quoted printable mail headers Geert Uytterhoeven 2018-07-18 21:42 ` Andrew Morton @ 2018-07-19 14:50 ` Arnd Bergmann 2018-07-19 15:03 ` Geert Uytterhoeven 1 sibling, 1 reply; 7+ messages in thread From: Arnd Bergmann @ 2018-07-19 14:50 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Andy Whitcroft, Joe Perches, Andrew Morton, Stephen Rothwell, Linux Kernel Mailing List, Martin Schwidefsky On Wed, Jul 18, 2018 at 4:52 PM, Geert Uytterhoeven <geert+renesas@glider.be> wrote: > As PERL uses its own internal character encoding, always calling > encode("utf8", ...) on the author name may cause corruption, leading to > an author signoff mismatch. > > This happens in the following cases: > - If a patch is in ISO-8859, and contains a non-ASCII author name in > the From: line, it is converted to UTF-8, while the Signed-off-by > line will still be in ISO-8859. > - If a patch is in UTF-8, and contains a non-ASCII author name in the > body (not header) From: line, it is assumed to be encoded in PERL's > internal character encoding, and converted to UTF-8 incorrectly, > while the Signed-off-by line will be in real UTF-8. > > Fix this by only doing the encode step if the From: line used UTF-8 > quoted printable encoding. > > Reported-by: Andrew Morton <akpm@linux-foundation.org> > Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> > --- > Fixes: bc76e3a125b44379 ("checkpatch: warn if missing author Signed-off-by") > in -next > > To be folded into "checkpatch: Warn if missing author Signed-off-by" in > Andrew's tree. On a related note, I've looked through all files in the kernel, and found that very file files in there are something other than 7-bit ASCII, UTF-8 or non-text files (according to /usr/bin/file). These are the only ones I found: Documentation/devicetree/bindings/net/nfc/pn544.txt: ISO-8859 text arch/arm/boot/dts/sun4i-a10-inet97fv2.dts: C source, ISO-8859 text arch/arm/crypto/sha256_glue.c: C source, ISO-8859 text arch/arm/crypto/sha256_neon_glue.c: C source, ISO-8859 text arch/m68k/hp300/hp300map.map: ISO-8859 text arch/s390/kernel/ebcdic.c: C source, Non-ISO extended-ASCII text drivers/crypto/vmx/ghashp8-ppc.pl: a /usr/bin/env perl script, ISO-8859 text executable drivers/iio/dac/ltc2632.c: C source, ISO-8859 text drivers/power/reset/ltc2952-poweroff.c: C source, ISO-8859 text drivers/staging/rtl8188eu/include/odm.h: C source, ISO-8859 text drivers/tty/vt/defkeymap.map: ISO-8859 text kernel/events/callchain.c: C source, ISO-8859 text lib/fonts/font_7x14.c: data lib/fonts/font_8x16.c: data lib/fonts/font_8x8.c: data lib/fonts/font_pearl_8x8.c: data net/netfilter/ipvs/Kconfig: ISO-8859 text net/netfilter/ipvs/ip_vs_mh.c: C source, ISO-8859 text tools/power/cpupower/po/de.po: GNU gettext message catalogue, ISO-8859 text tools/power/cpupower/po/fr.po: GNU gettext message catalogue, ISO-8859 text Almost all of those can be trivially converted using 'recode ISO-8859-1..UTF-8', which we should probably do. The four font files contain comments for each of the 256 characters, so that recode turns e.g. the <FF> character into <U+00FF>, which is probably still what we want here. The one exception seems to be arch/s390/kernel/ebcdic.c, which apparently uses 0x81 bytes as an excape before characters ISO-8859-1 characters with the high bit set. I don't know what that encoding is called, but I managed to manually convert it into something useful. Arnd ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] checkpatch: Only encode UTF-8 quoted printable mail headers 2018-07-19 14:50 ` Arnd Bergmann @ 2018-07-19 15:03 ` Geert Uytterhoeven 2018-07-19 17:38 ` Joe Perches 0 siblings, 1 reply; 7+ messages in thread From: Geert Uytterhoeven @ 2018-07-19 15:03 UTC (permalink / raw) To: Arnd Bergmann Cc: Geert Uytterhoeven, Andy Whitcroft, Joe Perches, Andrew Morton, Stephen Rothwell, Linux Kernel Mailing List, Martin Schwidefsky Hi Arnd, On Thu, Jul 19, 2018 at 4:50 PM Arnd Bergmann <arnd@arndb.de> wrote: > On a related note, I've looked through all files in the kernel, and found > that very file files in there are something other than 7-bit ASCII, UTF-8 > or non-text files (according to /usr/bin/file). These are the only ones I found: > > Documentation/devicetree/bindings/net/nfc/pn544.txt: ISO-8859 text > arch/arm/boot/dts/sun4i-a10-inet97fv2.dts: C source, ISO-8859 text > arch/arm/crypto/sha256_glue.c: C source, ISO-8859 text > arch/arm/crypto/sha256_neon_glue.c: C source, ISO-8859 text > arch/m68k/hp300/hp300map.map: ISO-8859 text > arch/s390/kernel/ebcdic.c: C source, Non-ISO > extended-ASCII text > drivers/crypto/vmx/ghashp8-ppc.pl: a /usr/bin/env > perl script, ISO-8859 text executable > drivers/iio/dac/ltc2632.c: C source, ISO-8859 text > drivers/power/reset/ltc2952-poweroff.c: C source, ISO-8859 text > drivers/staging/rtl8188eu/include/odm.h: C source, ISO-8859 text > drivers/tty/vt/defkeymap.map: ISO-8859 text > kernel/events/callchain.c: C source, ISO-8859 text > lib/fonts/font_7x14.c: data > lib/fonts/font_8x16.c: data > lib/fonts/font_8x8.c: data > lib/fonts/font_pearl_8x8.c: data > net/netfilter/ipvs/Kconfig: ISO-8859 text > net/netfilter/ipvs/ip_vs_mh.c: C source, ISO-8859 text > tools/power/cpupower/po/de.po: GNU gettext > message catalogue, ISO-8859 text > tools/power/cpupower/po/fr.po: GNU gettext > message catalogue, ISO-8859 text > > Almost all of those can be trivially converted using 'recode ISO-8859-1..UTF-8', > which we should probably do. The four font files contain comments for each > of the 256 characters, so that recode turns e.g. the <FF> character > into <U+00FF>, > which is probably still what we want here. > > The one exception seems to be arch/s390/kernel/ebcdic.c, which apparently > uses 0x81 bytes as an excape before characters ISO-8859-1 characters with > the high bit set. I don't know what that encoding is called, but I managed > to manually convert it into something useful. Yes, we should convert everything to UTF-8. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] checkpatch: Only encode UTF-8 quoted printable mail headers 2018-07-19 15:03 ` Geert Uytterhoeven @ 2018-07-19 17:38 ` Joe Perches 2018-07-24 11:19 ` Arnd Bergmann 0 siblings, 1 reply; 7+ messages in thread From: Joe Perches @ 2018-07-19 17:38 UTC (permalink / raw) To: Geert Uytterhoeven, Arnd Bergmann Cc: Geert Uytterhoeven, Andy Whitcroft, Andrew Morton, Stephen Rothwell, Linux Kernel Mailing List, Martin Schwidefsky On Thu, 2018-07-19 at 17:03 +0200, Geert Uytterhoeven wrote: > Hi Arnd, > > On Thu, Jul 19, 2018 at 4:50 PM Arnd Bergmann <arnd@arndb.de> wrote: > > On a related note, I've looked through all files in the kernel, and found > > that very file files in there are something other than 7-bit ASCII, UTF-8 > > or non-text files (according to /usr/bin/file). These are the only ones I found: > > > > Documentation/devicetree/bindings/net/nfc/pn544.txt: ISO-8859 text > > arch/arm/boot/dts/sun4i-a10-inet97fv2.dts: C source, ISO-8859 text > > arch/arm/crypto/sha256_glue.c: C source, ISO-8859 text > > arch/arm/crypto/sha256_neon_glue.c: C source, ISO-8859 text > > arch/m68k/hp300/hp300map.map: ISO-8859 text > > arch/s390/kernel/ebcdic.c: C source, Non-ISO > > extended-ASCII text > > drivers/crypto/vmx/ghashp8-ppc.pl: a /usr/bin/env > > perl script, ISO-8859 text executable > > drivers/iio/dac/ltc2632.c: C source, ISO-8859 text > > drivers/power/reset/ltc2952-poweroff.c: C source, ISO-8859 text > > drivers/staging/rtl8188eu/include/odm.h: C source, ISO-8859 text > > drivers/tty/vt/defkeymap.map: ISO-8859 text > > kernel/events/callchain.c: C source, ISO-8859 text > > lib/fonts/font_7x14.c: data > > lib/fonts/font_8x16.c: data > > lib/fonts/font_8x8.c: data > > lib/fonts/font_pearl_8x8.c: data > > net/netfilter/ipvs/Kconfig: ISO-8859 text > > net/netfilter/ipvs/ip_vs_mh.c: C source, ISO-8859 text > > tools/power/cpupower/po/de.po: GNU gettext > > message catalogue, ISO-8859 text > > tools/power/cpupower/po/fr.po: GNU gettext > > message catalogue, ISO-8859 text > > > > Almost all of those can be trivially converted using 'recode ISO-8859-1..UTF-8', > > which we should probably do. The four font files contain comments for each > > of the 256 characters, so that recode turns e.g. the <FF> character > > into <U+00FF>, > > which is probably still what we want here. > > > > The one exception seems to be arch/s390/kernel/ebcdic.c, which apparently > > uses 0x81 bytes as an excape before characters ISO-8859-1 characters with > > the high bit set. I don't know what that encoding is called, but I managed > > to manually convert it into something useful. > > Yes, we should convert everything to UTF-8. Thanks. Can you send a patch or a script for Linus to apply? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] checkpatch: Only encode UTF-8 quoted printable mail headers 2018-07-19 17:38 ` Joe Perches @ 2018-07-24 11:19 ` Arnd Bergmann 0 siblings, 0 replies; 7+ messages in thread From: Arnd Bergmann @ 2018-07-24 11:19 UTC (permalink / raw) To: Joe Perches Cc: Geert Uytterhoeven, Geert Uytterhoeven, Andy Whitcroft, Andrew Morton, Stephen Rothwell, Linux Kernel Mailing List, Martin Schwidefsky On Thu, Jul 19, 2018 at 7:38 PM, Joe Perches <joe@perches.com> wrote: > On Thu, 2018-07-19 at 17:03 +0200, Geert Uytterhoeven wrote: >> > >> > The one exception seems to be arch/s390/kernel/ebcdic.c, which apparently >> > uses 0x81 bytes as an excape before characters ISO-8859-1 characters with >> > the high bit set. I don't know what that encoding is called, but I managed >> > to manually convert it into something useful. >> >> Yes, we should convert everything to UTF-8. > > Thanks. > > Can you send a patch or a script for Linus to apply? I sent a series of four patches now, one for each type of change I did. Arnd ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2018-07-24 11:19 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-07-18 14:52 [PATCH v2] checkpatch: Only encode UTF-8 quoted printable mail headers Geert Uytterhoeven 2018-07-18 21:42 ` Andrew Morton 2018-07-18 21:55 ` Joe Perches 2018-07-19 14:50 ` Arnd Bergmann 2018-07-19 15:03 ` Geert Uytterhoeven 2018-07-19 17:38 ` Joe Perches 2018-07-24 11:19 ` Arnd Bergmann
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).