tools.linux.kernel.org archive mirror
 help / color / mirror / Atom feed
* utf-8 issues on b4 master
@ 2021-07-17 20:50 Michael S. Tsirkin
  2021-07-17 21:21 ` Kyle Meyer
  0 siblings, 1 reply; 8+ messages in thread
From: Michael S. Tsirkin @ 2021-07-17 20:50 UTC (permalink / raw)
  To: Konstantin Ryabitsev, tools, users

Passing message id
bbe52a89-c7ea-c155-6226-0397f223cd80@linux.alibaba.com to b4
gives this backtrace:

Traceback (most recent call last):
  File "/scm/b4/b4/command.py", line 263, in <module>
    cmd()
  File "/scm/b4/b4/command.py", line 246, in cmd
    cmdargs.func(cmdargs)
  File "/scm/b4/b4/command.py", line 41, in cmd_mbox
    b4.mbox.main(cmdargs)
  File "/scm/b4/b4/mbox.py", line 581, in main
    msgid, msgs = get_msgs(cmdargs)
  File "/scm/b4/b4/mbox.py", line 523, in get_msgs
    msgid = b4.get_msgid(cmdargs)
  File "/scm/b4/b4/__init__.py", line 2080, in get_msgid
    msgid = get_msgid_from_stdin()
  File "/scm/b4/b4/__init__.py", line 2072, in get_msgid_from_stdin
    message = email.message_from_string(sys.stdin.read())
  File "/usr/lib64/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 5886: invalid continuation byte

mutt does not seem to have trouble decoding this ... weird.
-- 
MST


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: utf-8 issues on b4 master
  2021-07-17 20:50 utf-8 issues on b4 master Michael S. Tsirkin
@ 2021-07-17 21:21 ` Kyle Meyer
  2021-07-18  1:39   ` Michael S. Tsirkin
  0 siblings, 1 reply; 8+ messages in thread
From: Kyle Meyer @ 2021-07-17 21:21 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Konstantin Ryabitsev, tools, users

Michael S. Tsirkin writes:

> Passing message id
> bbe52a89-c7ea-c155-6226-0397f223cd80@linux.alibaba.com to b4
> gives this backtrace:
>
> Traceback (most recent call last):
>   File "/scm/b4/b4/command.py", line 263, in <module>
>     cmd()
>   File "/scm/b4/b4/command.py", line 246, in cmd
>     cmdargs.func(cmdargs)
>   File "/scm/b4/b4/command.py", line 41, in cmd_mbox
>     b4.mbox.main(cmdargs)
>   File "/scm/b4/b4/mbox.py", line 581, in main
>     msgid, msgs = get_msgs(cmdargs)
>   File "/scm/b4/b4/mbox.py", line 523, in get_msgs
>     msgid = b4.get_msgid(cmdargs)
>   File "/scm/b4/b4/__init__.py", line 2080, in get_msgid
>     msgid = get_msgid_from_stdin()
>   File "/scm/b4/b4/__init__.py", line 2072, in get_msgid_from_stdin
>     message = email.message_from_string(sys.stdin.read())
>   File "/usr/lib64/python3.9/codecs.py", line 322, in decode
>     (result, consumed) = self._buffer_decode(data, self.errors, final)
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 5886: invalid continuation byte
>
> mutt does not seem to have trouble decoding this ... weird.

I'm confused by that backtrace.  I think get_msgid_from_stdin() should
be called only when a message is fed on stdin.  You say you're passing a
message ID.  That's as a positional argument, right?

Fwiw I wasn't able to trigger the issue on my end.

  $ b4 am bbe52a89-c7ea-c155-6226-0397f223cd80@linux.alibaba.com
  Looking up https://lore.kernel.org/r/bbe52a89-c7ea-c155-6226-0397f223cd80%40linux.alibaba.com
  Analyzing 5 messages in the thread
  ---
    [PATCH] virtio-balloon: Use virtio_find_vqs() helper
      + Reviewed-by: David Hildenbrand <david@redhat.com>
  ---
  Total patches: 1
  ---
   Link: https://lore.kernel.org/r/1626190724-7942-1-git-send-email-xianting_tian@126.com
   Base: not specified
         git am ./20210713_xianting_tian_virtio_balloon_use_virtio_find_vqs_helper.mbx
  
  $ b4 mbox bbe52a89-c7ea-c155-6226-0397f223cd80@linux.alibaba.com
  Looking up https://lore.kernel.org/r/bbe52a89-c7ea-c155-6226-0397f223cd80%40linux.alibaba.com
  5 messages in the thread
  Saved ./bbe52a89-c7ea-c155-6226-0397f223cd80@linux.alibaba.com.mbx

That's with

  Python 3.7.3
  b4 v0.7.0-32-g45ef591
  patatt v0.4.6

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: utf-8 issues on b4 master
  2021-07-17 21:21 ` Kyle Meyer
@ 2021-07-18  1:39   ` Michael S. Tsirkin
  2021-07-18  4:34     ` [PATCH b4 0/2] Avoid decoding errors when extracting message ID from stdin Kyle Meyer
  0 siblings, 1 reply; 8+ messages in thread
From: Michael S. Tsirkin @ 2021-07-18  1:39 UTC (permalink / raw)
  To: Kyle Meyer; +Cc: Konstantin Ryabitsev, tools, users

On Sat, Jul 17, 2021 at 05:21:30PM -0400, Kyle Meyer wrote:
> Michael S. Tsirkin writes:
> 
> > Passing message id
> > bbe52a89-c7ea-c155-6226-0397f223cd80@linux.alibaba.com to b4
> > gives this backtrace:
> >
> > Traceback (most recent call last):
> >   File "/scm/b4/b4/command.py", line 263, in <module>
> >     cmd()
> >   File "/scm/b4/b4/command.py", line 246, in cmd
> >     cmdargs.func(cmdargs)
> >   File "/scm/b4/b4/command.py", line 41, in cmd_mbox
> >     b4.mbox.main(cmdargs)
> >   File "/scm/b4/b4/mbox.py", line 581, in main
> >     msgid, msgs = get_msgs(cmdargs)
> >   File "/scm/b4/b4/mbox.py", line 523, in get_msgs
> >     msgid = b4.get_msgid(cmdargs)
> >   File "/scm/b4/b4/__init__.py", line 2080, in get_msgid
> >     msgid = get_msgid_from_stdin()
> >   File "/scm/b4/b4/__init__.py", line 2072, in get_msgid_from_stdin
> >     message = email.message_from_string(sys.stdin.read())
> >   File "/usr/lib64/python3.9/codecs.py", line 322, in decode
> >     (result, consumed) = self._buffer_decode(data, self.errors, final)
> > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 5886: invalid continuation byte
> >
> > mutt does not seem to have trouble decoding this ... weird.
> 
> I'm confused by that backtrace.  I think get_msgid_from_stdin() should
> be called only when a message is fed on stdin.  You say you're passing a
> message ID.  That's as a positional argument, right?

Sorry. I passed the message on the stdin. I supplied the
message ID so you can get the original from the list archives.

To reproduce:

wget -O - https://lore.kernel.org/lkml/bbe52a89-c7ea-c155-6226-0397f223cd80@linux.alibaba.com/raw | b4 mbox





> Fwiw I wasn't able to trigger the issue on my end.
> 
>   $ b4 am bbe52a89-c7ea-c155-6226-0397f223cd80@linux.alibaba.com
>   Looking up https://lore.kernel.org/r/bbe52a89-c7ea-c155-6226-0397f223cd80%40linux.alibaba.com
>   Analyzing 5 messages in the thread
>   ---
>     [PATCH] virtio-balloon: Use virtio_find_vqs() helper
>       + Reviewed-by: David Hildenbrand <david@redhat.com>
>   ---
>   Total patches: 1
>   ---
>    Link: https://lore.kernel.org/r/1626190724-7942-1-git-send-email-xianting_tian@126.com
>    Base: not specified
>          git am ./20210713_xianting_tian_virtio_balloon_use_virtio_find_vqs_helper.mbx
>   
>   $ b4 mbox bbe52a89-c7ea-c155-6226-0397f223cd80@linux.alibaba.com
>   Looking up https://lore.kernel.org/r/bbe52a89-c7ea-c155-6226-0397f223cd80%40linux.alibaba.com
>   5 messages in the thread
>   Saved ./bbe52a89-c7ea-c155-6226-0397f223cd80@linux.alibaba.com.mbx
> 
> That's with
> 
>   Python 3.7.3
>   b4 v0.7.0-32-g45ef591
>   patatt v0.4.6


b4 v0.7.0-32-g45ef591

python3-3.9.5-2.fc33.x86_64

I don't know about patatt.

-- 
MST


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH b4 0/2] Avoid decoding errors when extracting message ID from stdin
  2021-07-18  1:39   ` Michael S. Tsirkin
@ 2021-07-18  4:34     ` Kyle Meyer
  2021-07-18  4:34       ` [PATCH b4 1/2] " Kyle Meyer
                         ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Kyle Meyer @ 2021-07-18  4:34 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Konstantin Ryabitsev, tools, users

Michael S. Tsirkin writes:

> On Sat, Jul 17, 2021 at 05:21:30PM -0400, Kyle Meyer wrote:
>> Michael S. Tsirkin writes:
>> 
>> > Passing message id
>> > bbe52a89-c7ea-c155-6226-0397f223cd80@linux.alibaba.com to b4
>> > gives this backtrace:
>> >
>> > Traceback (most recent call last):
>> > [....]
>> >   File "/scm/b4/b4/__init__.py", line 2072, in get_msgid_from_stdin
>> >     message = email.message_from_string(sys.stdin.read())
>> >   File "/usr/lib64/python3.9/codecs.py", line 322, in decode
>> >     (result, consumed) = self._buffer_decode(data, self.errors, final)
>> > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 5886: invalid continuation byte
>> >
>> > mutt does not seem to have trouble decoding this ... weird.
>> 
>> I'm confused by that backtrace.  I think get_msgid_from_stdin() should
>> be called only when a message is fed on stdin.  You say you're passing a
>> message ID.  That's as a positional argument, right?
>
> Sorry. I passed the message on the stdin. I supplied the
> message ID so you can get the original from the list archives.
>
> To reproduce:
>
> wget -O - https://lore.kernel.org/lkml/bbe52a89-c7ea-c155-6226-0397f223cd80@linux.alibaba.com/raw | b4 mbox

Thanks.  I can trigger that on my end too.  Here's a possible fix.
The first patch is the actual fix.  The second patch makes this code
path do a little less work but isn't necessary.

  [1/2] Avoid decoding errors when extracting message ID from stdin
  [2/2] Parse just headers when extracting message ID from stdin mbox

 b4/__init__.py | 4 +++-
 b4/pr.py       | 5 ++++-
 2 files changed, 7 insertions(+), 2 deletions(-)


base-commit: 06cc7c8820aea85d1329911b785d7bf4ecaacb1f
-- 
2.32.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH b4 1/2] Avoid decoding errors when extracting message ID from stdin
  2021-07-18  4:34     ` [PATCH b4 0/2] Avoid decoding errors when extracting message ID from stdin Kyle Meyer
@ 2021-07-18  4:34       ` Kyle Meyer
  2021-07-18  4:34       ` [PATCH b4 2/2] Parse just headers when extracting message ID from stdin mbox Kyle Meyer
  2021-08-03 16:03       ` [PATCH b4 0/2] Avoid decoding errors when extracting message ID from stdin Konstantin Ryabitsev
  2 siblings, 0 replies; 8+ messages in thread
From: Kyle Meyer @ 2021-07-18  4:34 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Konstantin Ryabitsev, tools, users

The mbox, am, and pr subcommands accept an mbox on stdin and extract
the message ID.  When stdin.read() is called, Python assumes the
encoding is locale.getpreferredencoding(False).  This may not match
the content encoding, leading to a decoding error.

Instead feed the stdin bytes to message_from_bytes(), which leads to a
decode('ASCII', errors='surrogateescape') underneath.  That's
sufficient to get the message ID from the ASCII headers.

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Kyle Meyer <kyle@kyleam.com>
---

  Note: I've tested only `b4 am/mbox' with the reproducer message
  mentioned in upthread; I haven't tested `b4 pr'.

 b4/__init__.py | 2 +-
 b4/pr.py       | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/b4/__init__.py b/b4/__init__.py
index 0e007be..5b32fb4 100644
--- a/b4/__init__.py
+++ b/b4/__init__.py
@@ -1948,7 +1948,7 @@ def get_requests_session():
 
 def get_msgid_from_stdin():
     if not sys.stdin.isatty():
-        message = email.message_from_string(sys.stdin.read())
+        message = email.message_from_bytes(sys.stdin.buffer.read())
         return message.get('Message-ID', None)
     return None
 
diff --git a/b4/pr.py b/b4/pr.py
index d8ff7f4..fbb2a71 100644
--- a/b4/pr.py
+++ b/b4/pr.py
@@ -433,7 +433,7 @@ def main(cmdargs):
 
     if not sys.stdin.isatty():
         logger.debug('Getting PR message from stdin')
-        msg = email.message_from_string(sys.stdin.read())
+        msg = email.message_from_bytes(sys.stdin.buffer.read())
         msgid = b4.LoreMessage.get_clean_msgid(msg)
         lmsg = parse_pr_data(msg)
     else:
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH b4 2/2] Parse just headers when extracting message ID from stdin mbox
  2021-07-18  4:34     ` [PATCH b4 0/2] Avoid decoding errors when extracting message ID from stdin Kyle Meyer
  2021-07-18  4:34       ` [PATCH b4 1/2] " Kyle Meyer
@ 2021-07-18  4:34       ` Kyle Meyer
  2021-07-18  4:45         ` Kyle Meyer
  2021-08-03 16:03       ` [PATCH b4 0/2] Avoid decoding errors when extracting message ID from stdin Konstantin Ryabitsev
  2 siblings, 1 reply; 8+ messages in thread
From: Kyle Meyer @ 2021-07-18  4:34 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Konstantin Ryabitsev, tools, users

When the pr, mbox, and am subcommands grab a message ID from the mbox
on stdin, they call message_from_bytes(), which in turn calls
BytesParser().parsebytes(s).

parsebytes() has a headersonly parameter that can be used to tell it
to stop parsing after reading the headers.  The headers are all that's
needed here, so use BytesParser directly and set headersonly.

Signed-off-by: Kyle Meyer <kyle@kyleam.com>
---

  Note: I've tested only `b4 am/mbox' with the reproducer message
  mentioned in upthread; I haven't tested `b4 pr'.

 b4/__init__.py | 4 +++-
 b4/pr.py       | 5 ++++-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/b4/__init__.py b/b4/__init__.py
index 5b32fb4..4722826 100644
--- a/b4/__init__.py
+++ b/b4/__init__.py
@@ -1948,7 +1948,9 @@ def get_requests_session():
 
 def get_msgid_from_stdin():
     if not sys.stdin.isatty():
-        message = email.message_from_bytes(sys.stdin.buffer.read())
+        from email.parser import BytesParser
+        message = BytesParser().parsebytes(
+            sys.stdin.buffer.read(), headersonly=True)
         return message.get('Message-ID', None)
     return None
 
diff --git a/b4/pr.py b/b4/pr.py
index fbb2a71..e52c2ab 100644
--- a/b4/pr.py
+++ b/b4/pr.py
@@ -433,7 +433,10 @@ def main(cmdargs):
 
     if not sys.stdin.isatty():
         logger.debug('Getting PR message from stdin')
-        msg = email.message_from_bytes(sys.stdin.buffer.read())
+        from email.parser import BytesHeaderParser
+        from email.parser import BytesHeaderParser
+        msg = BytesParser().parsebytes(
+            sys.stdin.buffer.read(), headersonly=True)
         msgid = b4.LoreMessage.get_clean_msgid(msg)
         lmsg = parse_pr_data(msg)
     else:
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH b4 2/2] Parse just headers when extracting message ID from stdin mbox
  2021-07-18  4:34       ` [PATCH b4 2/2] Parse just headers when extracting message ID from stdin mbox Kyle Meyer
@ 2021-07-18  4:45         ` Kyle Meyer
  0 siblings, 0 replies; 8+ messages in thread
From: Kyle Meyer @ 2021-07-18  4:45 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Konstantin Ryabitsev, tools, users

Kyle Meyer writes:

> diff --git a/b4/pr.py b/b4/pr.py
> index fbb2a71..e52c2ab 100644
> --- a/b4/pr.py
> +++ b/b4/pr.py
> @@ -433,7 +433,10 @@ def main(cmdargs):
>  
>      if not sys.stdin.isatty():
>          logger.debug('Getting PR message from stdin')
> -        msg = email.message_from_bytes(sys.stdin.buffer.read())
> +        from email.parser import BytesHeaderParser
> +        from email.parser import BytesHeaderParser

Doh, there's a repeated import here :/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH b4 0/2] Avoid decoding errors when extracting message ID from stdin
  2021-07-18  4:34     ` [PATCH b4 0/2] Avoid decoding errors when extracting message ID from stdin Kyle Meyer
  2021-07-18  4:34       ` [PATCH b4 1/2] " Kyle Meyer
  2021-07-18  4:34       ` [PATCH b4 2/2] Parse just headers when extracting message ID from stdin mbox Kyle Meyer
@ 2021-08-03 16:03       ` Konstantin Ryabitsev
  2 siblings, 0 replies; 8+ messages in thread
From: Konstantin Ryabitsev @ 2021-08-03 16:03 UTC (permalink / raw)
  To: Kyle Meyer; +Cc: Michael S. Tsirkin, tools, users

On Sun, Jul 18, 2021 at 12:34:04AM -0400, Kyle Meyer wrote:
> Thanks.  I can trigger that on my end too.  Here's a possible fix.
> The first patch is the actual fix.  The second patch makes this code
> path do a little less work but isn't necessary.
> 
>   [1/2] Avoid decoding errors when extracting message ID from stdin
>   [2/2] Parse just headers when extracting message ID from stdin mbox

I've applied 1/2 and partial 2/2. In the case of b4/pr.py, we actually want to
parse the entire message, not just headers.

Thanks!

-K

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-08-03 16:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-17 20:50 utf-8 issues on b4 master Michael S. Tsirkin
2021-07-17 21:21 ` Kyle Meyer
2021-07-18  1:39   ` Michael S. Tsirkin
2021-07-18  4:34     ` [PATCH b4 0/2] Avoid decoding errors when extracting message ID from stdin Kyle Meyer
2021-07-18  4:34       ` [PATCH b4 1/2] " Kyle Meyer
2021-07-18  4:34       ` [PATCH b4 2/2] Parse just headers when extracting message ID from stdin mbox Kyle Meyer
2021-07-18  4:45         ` Kyle Meyer
2021-08-03 16:03       ` [PATCH b4 0/2] Avoid decoding errors when extracting message ID from stdin Konstantin Ryabitsev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).