Linux maintainer tooling and workflows
 help / color / Atom feed
* utf-8 issues on b4 master
@ 2021-07-17 20:50 Michael S. Tsirkin
  2021-07-17 21:21 ` Kyle Meyer
  0 siblings, 1 reply; 7+ messages in thread
From: Michael S. Tsirkin @ 2021-07-17 20:50 UTC (permalink / raw)
  To: Konstantin Ryabitsev, tools, users

Passing message id
bbe52a89-c7ea-c155-6226-0397f223cd80@linux.alibaba.com to b4
gives this backtrace:

Traceback (most recent call last):
  File "/scm/b4/b4/command.py", line 263, in <module>
    cmd()
  File "/scm/b4/b4/command.py", line 246, in cmd
    cmdargs.func(cmdargs)
  File "/scm/b4/b4/command.py", line 41, in cmd_mbox
    b4.mbox.main(cmdargs)
  File "/scm/b4/b4/mbox.py", line 581, in main
    msgid, msgs = get_msgs(cmdargs)
  File "/scm/b4/b4/mbox.py", line 523, in get_msgs
    msgid = b4.get_msgid(cmdargs)
  File "/scm/b4/b4/__init__.py", line 2080, in get_msgid
    msgid = get_msgid_from_stdin()
  File "/scm/b4/b4/__init__.py", line 2072, in get_msgid_from_stdin
    message = email.message_from_string(sys.stdin.read())
  File "/usr/lib64/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 5886: invalid continuation byte

mutt does not seem to have trouble decoding this ... weird.
-- 
MST


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: utf-8 issues on b4 master
  2021-07-17 20:50 utf-8 issues on b4 master Michael S. Tsirkin
@ 2021-07-17 21:21 ` Kyle Meyer
  2021-07-18  1:39   ` Michael S. Tsirkin
  0 siblings, 1 reply; 7+ messages in thread
From: Kyle Meyer @ 2021-07-17 21:21 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Konstantin Ryabitsev, tools, users

Michael S. Tsirkin writes:

> Passing message id
> bbe52a89-c7ea-c155-6226-0397f223cd80@linux.alibaba.com to b4
> gives this backtrace:
>
> Traceback (most recent call last):
>   File "/scm/b4/b4/command.py", line 263, in <module>
>     cmd()
>   File "/scm/b4/b4/command.py", line 246, in cmd
>     cmdargs.func(cmdargs)
>   File "/scm/b4/b4/command.py", line 41, in cmd_mbox
>     b4.mbox.main(cmdargs)
>   File "/scm/b4/b4/mbox.py", line 581, in main
>     msgid, msgs = get_msgs(cmdargs)
>   File "/scm/b4/b4/mbox.py", line 523, in get_msgs
>     msgid = b4.get_msgid(cmdargs)
>   File "/scm/b4/b4/__init__.py", line 2080, in get_msgid
>     msgid = get_msgid_from_stdin()
>   File "/scm/b4/b4/__init__.py", line 2072, in get_msgid_from_stdin
>     message = email.message_from_string(sys.stdin.read())
>   File "/usr/lib64/python3.9/codecs.py", line 322, in decode
>     (result, consumed) = self._buffer_decode(data, self.errors, final)
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 5886: invalid continuation byte
>
> mutt does not seem to have trouble decoding this ... weird.

I'm confused by that backtrace.  I think get_msgid_from_stdin() should
be called only when a message is fed on stdin.  You say you're passing a
message ID.  That's as a positional argument, right?

Fwiw I wasn't able to trigger the issue on my end.

  $ b4 am bbe52a89-c7ea-c155-6226-0397f223cd80@linux.alibaba.com
  Looking up https://lore.kernel.org/r/bbe52a89-c7ea-c155-6226-0397f223cd80%40linux.alibaba.com
  Analyzing 5 messages in the thread
  ---
    [PATCH] virtio-balloon: Use virtio_find_vqs() helper
      + Reviewed-by: David Hildenbrand <david@redhat.com>
  ---
  Total patches: 1
  ---
   Link: https://lore.kernel.org/r/1626190724-7942-1-git-send-email-xianting_tian@126.com
   Base: not specified
         git am ./20210713_xianting_tian_virtio_balloon_use_virtio_find_vqs_helper.mbx
  
  $ b4 mbox bbe52a89-c7ea-c155-6226-0397f223cd80@linux.alibaba.com
  Looking up https://lore.kernel.org/r/bbe52a89-c7ea-c155-6226-0397f223cd80%40linux.alibaba.com
  5 messages in the thread
  Saved ./bbe52a89-c7ea-c155-6226-0397f223cd80@linux.alibaba.com.mbx

That's with

  Python 3.7.3
  b4 v0.7.0-32-g45ef591
  patatt v0.4.6

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: utf-8 issues on b4 master
  2021-07-17 21:21 ` Kyle Meyer
@ 2021-07-18  1:39   ` Michael S. Tsirkin
  2021-07-18  4:34     ` [PATCH b4 0/2] Avoid decoding errors when extracting message ID from stdin Kyle Meyer
  0 siblings, 1 reply; 7+ messages in thread
From: Michael S. Tsirkin @ 2021-07-18  1:39 UTC (permalink / raw)
  To: Kyle Meyer; +Cc: Konstantin Ryabitsev, tools, users

On Sat, Jul 17, 2021 at 05:21:30PM -0400, Kyle Meyer wrote:
> Michael S. Tsirkin writes:
> 
> > Passing message id
> > bbe52a89-c7ea-c155-6226-0397f223cd80@linux.alibaba.com to b4
> > gives this backtrace:
> >
> > Traceback (most recent call last):
> >   File "/scm/b4/b4/command.py", line 263, in <module>
> >     cmd()
> >   File "/scm/b4/b4/command.py", line 246, in cmd
> >     cmdargs.func(cmdargs)
> >   File "/scm/b4/b4/command.py", line 41, in cmd_mbox
> >     b4.mbox.main(cmdargs)
> >   File "/scm/b4/b4/mbox.py", line 581, in main
> >     msgid, msgs = get_msgs(cmdargs)
> >   File "/scm/b4/b4/mbox.py", line 523, in get_msgs
> >     msgid = b4.get_msgid(cmdargs)
> >   File "/scm/b4/b4/__init__.py", line 2080, in get_msgid
> >     msgid = get_msgid_from_stdin()
> >   File "/scm/b4/b4/__init__.py", line 2072, in get_msgid_from_stdin
> >     message = email.message_from_string(sys.stdin.read())
> >   File "/usr/lib64/python3.9/codecs.py", line 322, in decode
> >     (result, consumed) = self._buffer_decode(data, self.errors, final)
> > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 5886: invalid continuation byte
> >
> > mutt does not seem to have trouble decoding this ... weird.
> 
> I'm confused by that backtrace.  I think get_msgid_from_stdin() should
> be called only when a message is fed on stdin.  You say you're passing a
> message ID.  That's as a positional argument, right?

Sorry. I passed the message on the stdin. I supplied the
message ID so you can get the original from the list archives.

To reproduce:

wget -O - https://lore.kernel.org/lkml/bbe52a89-c7ea-c155-6226-0397f223cd80@linux.alibaba.com/raw | b4 mbox





> Fwiw I wasn't able to trigger the issue on my end.
> 
>   $ b4 am bbe52a89-c7ea-c155-6226-0397f223cd80@linux.alibaba.com
>   Looking up https://lore.kernel.org/r/bbe52a89-c7ea-c155-6226-0397f223cd80%40linux.alibaba.com
>   Analyzing 5 messages in the thread
>   ---
>     [PATCH] virtio-balloon: Use virtio_find_vqs() helper
>       + Reviewed-by: David Hildenbrand <david@redhat.com>
>   ---
>   Total patches: 1
>   ---
>    Link: https://lore.kernel.org/r/1626190724-7942-1-git-send-email-xianting_tian@126.com
>    Base: not specified
>          git am ./20210713_xianting_tian_virtio_balloon_use_virtio_find_vqs_helper.mbx
>   
>   $ b4 mbox bbe52a89-c7ea-c155-6226-0397f223cd80@linux.alibaba.com
>   Looking up https://lore.kernel.org/r/bbe52a89-c7ea-c155-6226-0397f223cd80%40linux.alibaba.com
>   5 messages in the thread
>   Saved ./bbe52a89-c7ea-c155-6226-0397f223cd80@linux.alibaba.com.mbx
> 
> That's with
> 
>   Python 3.7.3
>   b4 v0.7.0-32-g45ef591
>   patatt v0.4.6


b4 v0.7.0-32-g45ef591

python3-3.9.5-2.fc33.x86_64

I don't know about patatt.

-- 
MST


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH b4 0/2] Avoid decoding errors when extracting message ID from stdin
  2021-07-18  1:39   ` Michael S. Tsirkin
@ 2021-07-18  4:34     ` Kyle Meyer
  2021-07-18  4:34       ` [PATCH b4 1/2] " Kyle Meyer
  2021-07-18  4:34       ` [PATCH b4 2/2] Parse just headers when extracting message ID from stdin mbox Kyle Meyer
  0 siblings, 2 replies; 7+ messages in thread
From: Kyle Meyer @ 2021-07-18  4:34 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Konstantin Ryabitsev, tools, users

Michael S. Tsirkin writes:

> On Sat, Jul 17, 2021 at 05:21:30PM -0400, Kyle Meyer wrote:
>> Michael S. Tsirkin writes:
>> 
>> > Passing message id
>> > bbe52a89-c7ea-c155-6226-0397f223cd80@linux.alibaba.com to b4
>> > gives this backtrace:
>> >
>> > Traceback (most recent call last):
>> > [....]
>> >   File "/scm/b4/b4/__init__.py", line 2072, in get_msgid_from_stdin
>> >     message = email.message_from_string(sys.stdin.read())
>> >   File "/usr/lib64/python3.9/codecs.py", line 322, in decode
>> >     (result, consumed) = self._buffer_decode(data, self.errors, final)
>> > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 5886: invalid continuation byte
>> >
>> > mutt does not seem to have trouble decoding this ... weird.
>> 
>> I'm confused by that backtrace.  I think get_msgid_from_stdin() should
>> be called only when a message is fed on stdin.  You say you're passing a
>> message ID.  That's as a positional argument, right?
>
> Sorry. I passed the message on the stdin. I supplied the
> message ID so you can get the original from the list archives.
>
> To reproduce:
>
> wget -O - https://lore.kernel.org/lkml/bbe52a89-c7ea-c155-6226-0397f223cd80@linux.alibaba.com/raw | b4 mbox

Thanks.  I can trigger that on my end too.  Here's a possible fix.
The first patch is the actual fix.  The second patch makes this code
path do a little less work but isn't necessary.

  [1/2] Avoid decoding errors when extracting message ID from stdin
  [2/2] Parse just headers when extracting message ID from stdin mbox

 b4/__init__.py | 4 +++-
 b4/pr.py       | 5 ++++-
 2 files changed, 7 insertions(+), 2 deletions(-)


base-commit: 06cc7c8820aea85d1329911b785d7bf4ecaacb1f
-- 
2.32.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH b4 1/2] Avoid decoding errors when extracting message ID from stdin
  2021-07-18  4:34     ` [PATCH b4 0/2] Avoid decoding errors when extracting message ID from stdin Kyle Meyer
@ 2021-07-18  4:34       ` Kyle Meyer
  2021-07-18  4:34       ` [PATCH b4 2/2] Parse just headers when extracting message ID from stdin mbox Kyle Meyer
  1 sibling, 0 replies; 7+ messages in thread
From: Kyle Meyer @ 2021-07-18  4:34 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Konstantin Ryabitsev, tools, users

The mbox, am, and pr subcommands accept an mbox on stdin and extract
the message ID.  When stdin.read() is called, Python assumes the
encoding is locale.getpreferredencoding(False).  This may not match
the content encoding, leading to a decoding error.

Instead feed the stdin bytes to message_from_bytes(), which leads to a
decode('ASCII', errors='surrogateescape') underneath.  That's
sufficient to get the message ID from the ASCII headers.

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Kyle Meyer <kyle@kyleam.com>
---

  Note: I've tested only `b4 am/mbox' with the reproducer message
  mentioned in upthread; I haven't tested `b4 pr'.

 b4/__init__.py | 2 +-
 b4/pr.py       | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/b4/__init__.py b/b4/__init__.py
index 0e007be..5b32fb4 100644
--- a/b4/__init__.py
+++ b/b4/__init__.py
@@ -1948,7 +1948,7 @@ def get_requests_session():
 
 def get_msgid_from_stdin():
     if not sys.stdin.isatty():
-        message = email.message_from_string(sys.stdin.read())
+        message = email.message_from_bytes(sys.stdin.buffer.read())
         return message.get('Message-ID', None)
     return None
 
diff --git a/b4/pr.py b/b4/pr.py
index d8ff7f4..fbb2a71 100644
--- a/b4/pr.py
+++ b/b4/pr.py
@@ -433,7 +433,7 @@ def main(cmdargs):
 
     if not sys.stdin.isatty():
         logger.debug('Getting PR message from stdin')
-        msg = email.message_from_string(sys.stdin.read())
+        msg = email.message_from_bytes(sys.stdin.buffer.read())
         msgid = b4.LoreMessage.get_clean_msgid(msg)
         lmsg = parse_pr_data(msg)
     else:
-- 
2.32.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH b4 2/2] Parse just headers when extracting message ID from stdin mbox
  2021-07-18  4:34     ` [PATCH b4 0/2] Avoid decoding errors when extracting message ID from stdin Kyle Meyer
  2021-07-18  4:34       ` [PATCH b4 1/2] " Kyle Meyer
@ 2021-07-18  4:34       ` Kyle Meyer
  2021-07-18  4:45         ` Kyle Meyer
  1 sibling, 1 reply; 7+ messages in thread
From: Kyle Meyer @ 2021-07-18  4:34 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Konstantin Ryabitsev, tools, users

When the pr, mbox, and am subcommands grab a message ID from the mbox
on stdin, they call message_from_bytes(), which in turn calls
BytesParser().parsebytes(s).

parsebytes() has a headersonly parameter that can be used to tell it
to stop parsing after reading the headers.  The headers are all that's
needed here, so use BytesParser directly and set headersonly.

Signed-off-by: Kyle Meyer <kyle@kyleam.com>
---

  Note: I've tested only `b4 am/mbox' with the reproducer message
  mentioned in upthread; I haven't tested `b4 pr'.

 b4/__init__.py | 4 +++-
 b4/pr.py       | 5 ++++-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/b4/__init__.py b/b4/__init__.py
index 5b32fb4..4722826 100644
--- a/b4/__init__.py
+++ b/b4/__init__.py
@@ -1948,7 +1948,9 @@ def get_requests_session():
 
 def get_msgid_from_stdin():
     if not sys.stdin.isatty():
-        message = email.message_from_bytes(sys.stdin.buffer.read())
+        from email.parser import BytesParser
+        message = BytesParser().parsebytes(
+            sys.stdin.buffer.read(), headersonly=True)
         return message.get('Message-ID', None)
     return None
 
diff --git a/b4/pr.py b/b4/pr.py
index fbb2a71..e52c2ab 100644
--- a/b4/pr.py
+++ b/b4/pr.py
@@ -433,7 +433,10 @@ def main(cmdargs):
 
     if not sys.stdin.isatty():
         logger.debug('Getting PR message from stdin')
-        msg = email.message_from_bytes(sys.stdin.buffer.read())
+        from email.parser import BytesHeaderParser
+        from email.parser import BytesHeaderParser
+        msg = BytesParser().parsebytes(
+            sys.stdin.buffer.read(), headersonly=True)
         msgid = b4.LoreMessage.get_clean_msgid(msg)
         lmsg = parse_pr_data(msg)
     else:
-- 
2.32.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH b4 2/2] Parse just headers when extracting message ID from stdin mbox
  2021-07-18  4:34       ` [PATCH b4 2/2] Parse just headers when extracting message ID from stdin mbox Kyle Meyer
@ 2021-07-18  4:45         ` Kyle Meyer
  0 siblings, 0 replies; 7+ messages in thread
From: Kyle Meyer @ 2021-07-18  4:45 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Konstantin Ryabitsev, tools, users

Kyle Meyer writes:

> diff --git a/b4/pr.py b/b4/pr.py
> index fbb2a71..e52c2ab 100644
> --- a/b4/pr.py
> +++ b/b4/pr.py
> @@ -433,7 +433,10 @@ def main(cmdargs):
>  
>      if not sys.stdin.isatty():
>          logger.debug('Getting PR message from stdin')
> -        msg = email.message_from_bytes(sys.stdin.buffer.read())
> +        from email.parser import BytesHeaderParser
> +        from email.parser import BytesHeaderParser

Doh, there's a repeated import here :/

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, back to index

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-17 20:50 utf-8 issues on b4 master Michael S. Tsirkin
2021-07-17 21:21 ` Kyle Meyer
2021-07-18  1:39   ` Michael S. Tsirkin
2021-07-18  4:34     ` [PATCH b4 0/2] Avoid decoding errors when extracting message ID from stdin Kyle Meyer
2021-07-18  4:34       ` [PATCH b4 1/2] " Kyle Meyer
2021-07-18  4:34       ` [PATCH b4 2/2] Parse just headers when extracting message ID from stdin mbox Kyle Meyer
2021-07-18  4:45         ` Kyle Meyer

Linux maintainer tooling and workflows

Archives are clonable:
	git clone --mirror https://lore.kernel.org/tools/0 tools/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 tools tools/ https://lore.kernel.org/tools \
		tools@linux.kernel.org
	public-inbox-index tools

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.linux.tools


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git