All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/3] python/machine.py: refactor shutdown
@ 2020-06-04 19:52 John Snow
  2020-06-04 19:52 ` [PATCH v3 1/3] python/machine.py: consolidate _post_shutdown() John Snow
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: John Snow @ 2020-06-04 19:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, Eduardo Habkost, philmd, Markus Armbruster, Cleber Rosa,
	John Snow

v3:
 - Split _post_shutdown refactor into own patch (now 1/3)
 - Re-add sigkill warning squelch (now 3/3)

NOTE: I re-added the squelch in its own patch for review purposes, but
for the purposes of avoiding temporary breakage, a maintainer may wish
to squash patches 2 and 3 if they are accepted.

v2: Philippe took patches 1, 3 and 4.

This is a re-write of what was:
[PATCH RFC 03/32] python//machine.py: remove bare except
[PATCH 2/4] python/machine.py: remove bare except

It's a bit heavier handed, but it should address some of kwolf's
feedback from the RFC version.

Applies straight to origin/master, ought to pass pylint and flake8:

> cd ~/src/qemu/python/qemu
> pylint *.py
> flake8 *.py

John Snow (3):
  python/machine.py: consolidate _post_shutdown()
  python/machine.py: refactor shutdown
  python/machine.py: re-add sigkill warning suppression

 python/qemu/machine.py | 100 +++++++++++++++++++++++++++++------------
 1 file changed, 71 insertions(+), 29 deletions(-)

-- 
2.21.3



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 1/3] python/machine.py: consolidate _post_shutdown()
  2020-06-04 19:52 [PATCH v3 0/3] python/machine.py: refactor shutdown John Snow
@ 2020-06-04 19:52 ` John Snow
  2020-06-04 19:52 ` [PATCH v3 2/3] python/machine.py: refactor shutdown John Snow
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: John Snow @ 2020-06-04 19:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, Eduardo Habkost, philmd, Markus Armbruster, Cleber Rosa,
	John Snow

Move more cleanup actions into _post_shutdown. As a change, if QEMU
should so happen to be terminated during a call to wait(), that event
will now be logged.

This is not likely to occur during normative use.

Signed-off-by: John Snow <jsnow@redhat.com>
---
 python/qemu/machine.py | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/python/qemu/machine.py b/python/qemu/machine.py
index 041c615052e..f7e68191c28 100644
--- a/python/qemu/machine.py
+++ b/python/qemu/machine.py
@@ -283,6 +283,8 @@ def _post_launch(self):
             self._qmp.accept()
 
     def _post_shutdown(self):
+        self._load_io_log()
+
         if self._qemu_log_file is not None:
             self._qemu_log_file.close()
             self._qemu_log_file = None
@@ -296,6 +298,17 @@ def _post_shutdown(self):
         while len(self._remove_files) > 0:
             self._remove_if_exists(self._remove_files.pop())
 
+        exitcode = self.exitcode()
+        if exitcode is not None and exitcode < 0:
+            msg = 'qemu received signal %i; command: "%s"'
+            if self._qemu_full_args:
+                command = ' '.join(self._qemu_full_args)
+            else:
+                command = ''
+            LOG.warning(msg, -int(exitcode), command)
+
+        self._launched = False
+
     def launch(self):
         """
         Launch the VM and make sure we cleanup and expose the
@@ -344,7 +357,6 @@ def wait(self):
         self._popen.wait()
         if self._qmp:
             self._qmp.close()
-        self._load_io_log()
         self._post_shutdown()
 
     def shutdown(self, has_quit=False, hard=False):
@@ -371,21 +383,8 @@ def shutdown(self, has_quit=False, hard=False):
                     self._popen.kill()
             self._popen.wait()
 
-        self._load_io_log()
         self._post_shutdown()
 
-        exitcode = self.exitcode()
-        if exitcode is not None and exitcode < 0 and \
-                not (exitcode == -9 and hard):
-            msg = 'qemu received signal %i: %s'
-            if self._qemu_full_args:
-                command = ' '.join(self._qemu_full_args)
-            else:
-                command = ''
-            LOG.warning(msg, -int(exitcode), command)
-
-        self._launched = False
-
     def kill(self):
         self.shutdown(hard=True)
 
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 2/3] python/machine.py: refactor shutdown
  2020-06-04 19:52 [PATCH v3 0/3] python/machine.py: refactor shutdown John Snow
  2020-06-04 19:52 ` [PATCH v3 1/3] python/machine.py: consolidate _post_shutdown() John Snow
@ 2020-06-04 19:52 ` John Snow
  2020-06-04 19:52 ` [PATCH v3 3/3] python/machine.py: re-add sigkill warning suppression John Snow
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: John Snow @ 2020-06-04 19:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, Eduardo Habkost, philmd, Markus Armbruster, Cleber Rosa,
	John Snow

This is done primarily to avoid the 'bare except' pattern, which
suppresses ALL exceptions and not just ones that we are anticipating to
see.

Replace this with a pattern that isolates the different kind of shutdown
paradigms and a new fallback shutdown handler that gracefully attempts
one before the other.

Ensure that the main shutdown() function ALWAYS calls the post_shutdown
logic, no matter what kind of error we encountered. Subprocess wait
timeouts are considered expected, everything else is unexpected.

In cases where we encounter an expected error in the graceful shutdown
timeout, we will not re-raise an exception above shutdown(). Otherwise,
after post_shutdown cleanup, we will.

I anticipate that this WILL lead to additional bug reports filed against
this module, but that is unfortunately somewhat the point: This code
shouldn't be hiding failures that exist elsewhere within the python
code.

Signed-off-by: John Snow <jsnow@redhat.com>
---
 python/qemu/machine.py | 68 ++++++++++++++++++++++++++++++++----------
 1 file changed, 53 insertions(+), 15 deletions(-)

diff --git a/python/qemu/machine.py b/python/qemu/machine.py
index f7e68191c28..a2abd2c35e3 100644
--- a/python/qemu/machine.py
+++ b/python/qemu/machine.py
@@ -359,9 +359,51 @@ def wait(self):
             self._qmp.close()
         self._post_shutdown()
 
-    def shutdown(self, has_quit=False, hard=False):
+    def _hard_shutdown(self) -> None:
         """
-        Terminate the VM and clean up
+        Kill the VM if it is running.
+        """
+        if not self.is_running():
+            return
+
+        self._popen.kill()
+        self._popen.wait(timeout=60)
+
+    def _soft_shutdown(self, has_quit: bool = False, timeout: int = 3) -> None:
+        """
+        Attempt to shutdown the VM gracefully if it is running.
+
+        :param has_quit: When True, don't attempt to issue 'quit' QMP command
+        :param timeout: Timeout for graceful shutdown. Default 3 seconds.
+        """
+        if not self.is_running():
+            return
+
+        if self._qmp is not None:
+            if not has_quit:
+                self._qmp.cmd('quit')
+            self._qmp.close()
+
+        self._popen.wait(timeout=timeout)
+
+    def _do_shutdown(self, has_quit: bool = False, timeout: int = 3) -> None:
+        """
+        Attempt to shutdown the VM gracefully; fallback to a hard shutdown.
+
+        :param has_quit: When True, don't attempt to issue 'quit' QMP command
+        :param timeout: Timeout for graceful shutdown. Default 3 seconds.
+        """
+        try:
+            self._soft_shutdown(has_quit, timeout)
+        except subprocess.TimeoutExpired:
+            self._hard_shutdown()
+        except:
+            self._hard_shutdown()
+            raise
+
+    def shutdown(self, has_quit: bool = False, hard: bool = False) -> None:
+        """
+        Terminate the VM (gracefully if possible) and perform cleanup.
         """
         # If we keep the console socket open, we may deadlock waiting
         # for QEMU to exit, while QEMU is waiting for the socket to
@@ -370,22 +412,18 @@ def shutdown(self, has_quit=False, hard=False):
             self._console_socket.close()
             self._console_socket = None
 
-        if self.is_running():
+        try:
             if hard:
-                self._popen.kill()
-            elif self._qmp:
-                try:
-                    if not has_quit:
-                        self._qmp.cmd('quit')
-                    self._qmp.close()
-                    self._popen.wait(timeout=3)
-                except:
-                    self._popen.kill()
-            self._popen.wait()
-
-        self._post_shutdown()
+                self._hard_shutdown()
+            else:
+                self._do_shutdown(has_quit)
+        finally:
+            self._post_shutdown()
 
     def kill(self):
+        """
+        Terminate the VM forcefully and perform cleanup.
+        """
         self.shutdown(hard=True)
 
     def set_qmp_monitor(self, enabled=True):
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 3/3] python/machine.py: re-add sigkill warning suppression
  2020-06-04 19:52 [PATCH v3 0/3] python/machine.py: refactor shutdown John Snow
  2020-06-04 19:52 ` [PATCH v3 1/3] python/machine.py: consolidate _post_shutdown() John Snow
  2020-06-04 19:52 ` [PATCH v3 2/3] python/machine.py: refactor shutdown John Snow
@ 2020-06-04 19:52 ` John Snow
  2020-06-05  9:34 ` [PATCH v3 0/3] python/machine.py: refactor shutdown Kevin Wolf
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: John Snow @ 2020-06-04 19:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, Eduardo Habkost, philmd, Markus Armbruster, Cleber Rosa,
	John Snow

If the user kills QEMU on purpose, we don't need to warn them about that
having happened: they know already.

Signed-off-by: John Snow <jsnow@redhat.com>
---
 python/qemu/machine.py | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/python/qemu/machine.py b/python/qemu/machine.py
index a2abd2c35e3..99bcb499878 100644
--- a/python/qemu/machine.py
+++ b/python/qemu/machine.py
@@ -22,6 +22,7 @@
 import os
 import subprocess
 import shutil
+import signal
 import socket
 import tempfile
 from typing import Optional, Type
@@ -122,6 +123,7 @@ def __init__(self, binary, args=None, wrapper=None, name=None,
         self._console_address = None
         self._console_socket = None
         self._remove_files = []
+        self._killed = False
 
     def __enter__(self):
         return self
@@ -282,7 +284,7 @@ def _post_launch(self):
         if self._qmp:
             self._qmp.accept()
 
-    def _post_shutdown(self):
+    def _post_shutdown(self) -> None:
         self._load_io_log()
 
         if self._qemu_log_file is not None:
@@ -299,7 +301,8 @@ def _post_shutdown(self):
             self._remove_if_exists(self._remove_files.pop())
 
         exitcode = self.exitcode()
-        if exitcode is not None and exitcode < 0:
+        if (exitcode is not None and exitcode < 0
+                and not (self._killed and exitcode == -signal.SIGKILL)):
             msg = 'qemu received signal %i; command: "%s"'
             if self._qemu_full_args:
                 command = ' '.join(self._qemu_full_args)
@@ -307,6 +310,7 @@ def _post_shutdown(self):
                 command = ''
             LOG.warning(msg, -int(exitcode), command)
 
+        self._killed = False
         self._launched = False
 
     def launch(self):
@@ -414,6 +418,7 @@ def shutdown(self, has_quit: bool = False, hard: bool = False) -> None:
 
         try:
             if hard:
+                self._killed = True
                 self._hard_shutdown()
             else:
                 self._do_shutdown(has_quit)
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/3] python/machine.py: refactor shutdown
  2020-06-04 19:52 [PATCH v3 0/3] python/machine.py: refactor shutdown John Snow
                   ` (2 preceding siblings ...)
  2020-06-04 19:52 ` [PATCH v3 3/3] python/machine.py: re-add sigkill warning suppression John Snow
@ 2020-06-05  9:34 ` Kevin Wolf
  2020-06-09 13:08 ` Philippe Mathieu-Daudé
  2020-06-17 17:15 ` Philippe Mathieu-Daudé
  5 siblings, 0 replies; 20+ messages in thread
From: Kevin Wolf @ 2020-06-05  9:34 UTC (permalink / raw)
  To: John Snow
  Cc: Markus Armbruster, Cleber Rosa, philmd, qemu-devel, Eduardo Habkost

Am 04.06.2020 um 21:52 hat John Snow geschrieben:
> v3:
>  - Split _post_shutdown refactor into own patch (now 1/3)

Feels much easier to read now!

>  - Re-add sigkill warning squelch (now 3/3)
> 
> NOTE: I re-added the squelch in its own patch for review purposes, but
> for the purposes of avoiding temporary breakage, a maintainer may wish
> to squash patches 2 and 3 if they are accepted.

As I found in the v2 review that we don't even have a single user of
hard=True, I don't think it matters. A separate patch makes things
easier to understand, so I would leave it as you have it in this series.

Reviewed-by: Kevin Wolf <kwolf@redhat.com>



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/3] python/machine.py: refactor shutdown
  2020-06-04 19:52 [PATCH v3 0/3] python/machine.py: refactor shutdown John Snow
                   ` (3 preceding siblings ...)
  2020-06-05  9:34 ` [PATCH v3 0/3] python/machine.py: refactor shutdown Kevin Wolf
@ 2020-06-09 13:08 ` Philippe Mathieu-Daudé
  2020-06-09 21:55   ` John Snow
  2020-06-17 17:15 ` Philippe Mathieu-Daudé
  5 siblings, 1 reply; 20+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-06-09 13:08 UTC (permalink / raw)
  To: John Snow, qemu-devel
  Cc: kwolf, Markus Armbruster, Eduardo Habkost, Cleber Rosa

Hi John,

On 6/4/20 9:52 PM, John Snow wrote:
> v3:
>  - Split _post_shutdown refactor into own patch (now 1/3)
>  - Re-add sigkill warning squelch (now 3/3)
> 
> NOTE: I re-added the squelch in its own patch for review purposes, but
> for the purposes of avoiding temporary breakage, a maintainer may wish
> to squash patches 2 and 3 if they are accepted.
> 
> v2: Philippe took patches 1, 3 and 4.
> 
> This is a re-write of what was:
> [PATCH RFC 03/32] python//machine.py: remove bare except
> [PATCH 2/4] python/machine.py: remove bare except
> 
> It's a bit heavier handed, but it should address some of kwolf's
> feedback from the RFC version.
> 
> Applies straight to origin/master, ought to pass pylint and flake8:
> 
>> cd ~/src/qemu/python/qemu
>> pylint *.py
>> flake8 *.py
> 
> John Snow (3):
>   python/machine.py: consolidate _post_shutdown()
>   python/machine.py: refactor shutdown
>   python/machine.py: re-add sigkill warning suppression
> 
>  python/qemu/machine.py | 100 +++++++++++++++++++++++++++++------------
>  1 file changed, 71 insertions(+), 29 deletions(-)
> 

I'm now seeing this error:

21:31:58 DEBUG| / # reboot
21:32:01 DEBUG| / # reboot: Restarting system
21:32:01 DEBUG| >>> {'execute': 'quit'}
21:32:01 WARNI| qemu received signal 9; command:
"mips-softmmu/qemu-system-mips -display none -vga none -chardev
socket,id=mon,path=/tmp/tmp679upvrk/qemu-10292-monitor.sock -mon
chardev=mon,mode=control -machine malta -chardev
socket,id=console,path=/tmp/tmp679upvrk/qemu-10292-console.sock,server,nowait
-serial chardev:console -kernel
/tmp/avocado_b3aaagr9/avocado_job_5bj0xe1h/12-tests_acceptance_boot_linux_console.py_BootLinuxConsole.test_mips_malta_cpio/boot/vmlinux-4.5.0-2-4kc-malta
-initrd
/tmp/avocado_b3aaagr9/avocado_job_5bj0xe1h/12-tests_acceptance_boot_linux_console.py_BootLinuxConsole.test_mips_malta_cpiorootfs.cpio
-append printk.time=0 console=ttyS0 console=tty rdinit=/sbin/init
noreboot -no-reboot"
21:32:01 ERROR|
21:32:01 ERROR| Reproduced traceback from:
/home/travis/build/philmd/qemu/build/tests/venv/lib/python3.6/site-packages/avocado/core/test.py:886
21:32:01 ERROR| Traceback (most recent call last):
21:32:01 ERROR|   File
"/home/travis/build/philmd/qemu/build/tests/acceptance/avocado_qemu/__init__.py",
line 195, in tearDown
21:32:01 ERROR|     vm.shutdown()
21:32:01 ERROR|   File
"/home/travis/build/philmd/qemu/python/qemu/machine.py", line 449, in
shutdown
21:32:01 ERROR|     self._do_shutdown(has_quit)
21:32:01 ERROR|   File
"/home/travis/build/philmd/qemu/python/qemu/machine.py", line 426, in
_do_shutdown
21:32:01 ERROR|     self._soft_shutdown(has_quit, timeout)
21:32:01 ERROR|   File
"/home/travis/build/philmd/qemu/python/qemu/machine.py", line 413, in
_soft_shutdown
21:32:01 ERROR|     self._qmp.cmd('quit')
21:32:01 ERROR|   File
"/home/travis/build/philmd/qemu/python/qemu/qmp.py", line 271, in cmd
21:32:01 ERROR|     return self.cmd_obj(qmp_cmd)
21:32:01 ERROR|   File
"/home/travis/build/philmd/qemu/python/qemu/qmp.py", line 249, in cmd_obj
21:32:01 ERROR|     self.__sock.sendall(json.dumps(qmp_cmd).encode('utf-8'))
21:32:01 ERROR| BrokenPipeError: [Errno 32] Broken pipe
21:32:01 ERROR|
21:32:01 DEBUG| DATA (filename=output.expected) => NOT FOUND (data
sources: variant, test, file)
21:32:01 DEBUG| DATA (filename=stdout.expected) => NOT FOUND (data
sources: variant, test, file)
21:32:01 DEBUG| DATA (filename=stderr.expected) => NOT FOUND (data
sources: variant, test, file)
21:32:01 DEBUG| Not logging /var/log/syslog (lack of permissions)
21:32:01 ERROR| ERROR
12-tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_mips_malta_cpio
-> TestSetupFail: [Errno 32] Broken pipe
21:32:01 INFO |

https://travis-ci.org/github/philmd/qemu/jobs/696142277#L5329



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/3] python/machine.py: refactor shutdown
  2020-06-09 13:08 ` Philippe Mathieu-Daudé
@ 2020-06-09 21:55   ` John Snow
  2020-06-15 15:21     ` Philippe Mathieu-Daudé
  0 siblings, 1 reply; 20+ messages in thread
From: John Snow @ 2020-06-09 21:55 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, qemu-devel
  Cc: kwolf, Markus Armbruster, Eduardo Habkost, Cleber Rosa



On 6/9/20 9:08 AM, Philippe Mathieu-Daudé wrote:
> Hi John,
> 
> On 6/4/20 9:52 PM, John Snow wrote:
>> v3:
>>  - Split _post_shutdown refactor into own patch (now 1/3)
>>  - Re-add sigkill warning squelch (now 3/3)
>>
>> NOTE: I re-added the squelch in its own patch for review purposes, but
>> for the purposes of avoiding temporary breakage, a maintainer may wish
>> to squash patches 2 and 3 if they are accepted.
>>
>> v2: Philippe took patches 1, 3 and 4.
>>
>> This is a re-write of what was:
>> [PATCH RFC 03/32] python//machine.py: remove bare except
>> [PATCH 2/4] python/machine.py: remove bare except
>>
>> It's a bit heavier handed, but it should address some of kwolf's
>> feedback from the RFC version.
>>
>> Applies straight to origin/master, ought to pass pylint and flake8:
>>
>>> cd ~/src/qemu/python/qemu
>>> pylint *.py
>>> flake8 *.py
>>
>> John Snow (3):
>>   python/machine.py: consolidate _post_shutdown()
>>   python/machine.py: refactor shutdown
>>   python/machine.py: re-add sigkill warning suppression
>>
>>  python/qemu/machine.py | 100 +++++++++++++++++++++++++++++------------
>>  1 file changed, 71 insertions(+), 29 deletions(-)
>>
> 
> I'm now seeing this error:
> 
> 21:31:58 DEBUG| / # reboot
> 21:32:01 DEBUG| / # reboot: Restarting system
> 21:32:01 DEBUG| >>> {'execute': 'quit'}
> 21:32:01 WARNI| qemu received signal 9; command:
> "mips-softmmu/qemu-system-mips -display none -vga none -chardev
> socket,id=mon,path=/tmp/tmp679upvrk/qemu-10292-monitor.sock -mon
> chardev=mon,mode=control -machine malta -chardev
> socket,id=console,path=/tmp/tmp679upvrk/qemu-10292-console.sock,server,nowait
> -serial chardev:console -kernel
> /tmp/avocado_b3aaagr9/avocado_job_5bj0xe1h/12-tests_acceptance_boot_linux_console.py_BootLinuxConsole.test_mips_malta_cpio/boot/vmlinux-4.5.0-2-4kc-malta
> -initrd
> /tmp/avocado_b3aaagr9/avocado_job_5bj0xe1h/12-tests_acceptance_boot_linux_console.py_BootLinuxConsole.test_mips_malta_cpiorootfs.cpio
> -append printk.time=0 console=ttyS0 console=tty rdinit=/sbin/init
> noreboot -no-reboot"
> 21:32:01 ERROR|
> 21:32:01 ERROR| Reproduced traceback from:
> /home/travis/build/philmd/qemu/build/tests/venv/lib/python3.6/site-packages/avocado/core/test.py:886
> 21:32:01 ERROR| Traceback (most recent call last):
> 21:32:01 ERROR|   File
> "/home/travis/build/philmd/qemu/build/tests/acceptance/avocado_qemu/__init__.py",
> line 195, in tearDown
> 21:32:01 ERROR|     vm.shutdown()
> 21:32:01 ERROR|   File
> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 449, in
> shutdown
> 21:32:01 ERROR|     self._do_shutdown(has_quit)
> 21:32:01 ERROR|   File
> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 426, in
> _do_shutdown
> 21:32:01 ERROR|     self._soft_shutdown(has_quit, timeout)
> 21:32:01 ERROR|   File
> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 413, in
> _soft_shutdown
> 21:32:01 ERROR|     self._qmp.cmd('quit')
> 21:32:01 ERROR|   File
> "/home/travis/build/philmd/qemu/python/qemu/qmp.py", line 271, in cmd
> 21:32:01 ERROR|     return self.cmd_obj(qmp_cmd)
> 21:32:01 ERROR|   File
> "/home/travis/build/philmd/qemu/python/qemu/qmp.py", line 249, in cmd_obj
> 21:32:01 ERROR|     self.__sock.sendall(json.dumps(qmp_cmd).encode('utf-8'))
> 21:32:01 ERROR| BrokenPipeError: [Errno 32] Broken pipe
> 21:32:01 ERROR|
> 21:32:01 DEBUG| DATA (filename=output.expected) => NOT FOUND (data
> sources: variant, test, file)
> 21:32:01 DEBUG| DATA (filename=stdout.expected) => NOT FOUND (data
> sources: variant, test, file)
> 21:32:01 DEBUG| DATA (filename=stderr.expected) => NOT FOUND (data
> sources: variant, test, file)
> 21:32:01 DEBUG| Not logging /var/log/syslog (lack of permissions)
> 21:32:01 ERROR| ERROR
> 12-tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_mips_malta_cpio
> -> TestSetupFail: [Errno 32] Broken pipe
> 21:32:01 INFO |
> 
> https://travis-ci.org/github/philmd/qemu/jobs/696142277#L5329
> 

Gotcha.

The problem here is that `test_mips_malta_cpio` in boot_linux_console.py
does this:

        self.vm.add_args('-kernel', kernel_path,
                         '-initrd', initrd_path,
                         '-append', kernel_command_line,
                         '-no-reboot')

and then:

exec_command_and_wait_for_pattern(self, 'reboot',
                         'reboot: Restarting system')

and (in avocado_qemu/) __init__.py does this:

    def tearDown(self):
        for vm in self._vms.values():
            vm.shutdown()



What's happening here is that we are instructing QEMU to *close* when
the guest reboots instead of allowing it to reboot. Then, we are issuing
a reboot command to the guest, which will effectively terminate QEMU as
well. Finally, we are trying to send a shutdown command to QEMU, but
QEMU has already gone.

Now, in the shutdown code, we do make an attempt to catch this:

def is_running(self):
    """Returns true if the VM is running."""
    return self._popen is not None and self._popen.poll() is None

But, well, race conditions.

When we make it here:

        if self._qmp is not None:
            if not has_quit:
                self._qmp.cmd('quit')
            self._qmp.close()

We believe we are running and we believe we have an open QMP socket.
Attempting to engage the socket by sending 'quit' causes the error.

It's a tight window: if quit happens earlier, we send the command
successfully and everything's OK. If quit happens later, we realize QEMU
isn't running and proceed to cleanup.

Ultimately:

- Avocado should not try to shut down QEMU twice, but
- machine.py shouldn't enable the race condition either.



for my part, how about this:

diff --git a/python/qemu/machine.py b/python/qemu/machine.py
index 99bcb499878..813f8e477db 100644
--- a/python/qemu/machine.py
+++ b/python/qemu/machine.py
@@ -385,7 +385,14 @@ def _soft_shutdown(self, has_quit: bool = False,
timeout: int = 3) -> None:

         if self._qmp is not None:
             if not has_quit:
-                self._qmp.cmd('quit')
+                try:
+                    self._qmp.cmd('quit')
+                except (BrokenPipeError, ConnectionResetError):
+                    # QMP went away just before or just after sending
'quit'
+                    if not self.is_running():
+                        # "Mission Accomplished"
+                        pass
+                    raise
             self._qmp.close()

         self._popen.wait(timeout=timeout)



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/3] python/machine.py: refactor shutdown
  2020-06-09 21:55   ` John Snow
@ 2020-06-15 15:21     ` Philippe Mathieu-Daudé
  2020-06-15 22:21       ` Eduardo Habkost
  2020-06-16 21:49       ` Cleber Rosa
  0 siblings, 2 replies; 20+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-06-15 15:21 UTC (permalink / raw)
  To: John Snow, qemu-devel, Cleber Rosa
  Cc: kwolf, Eduardo Habkost, Markus Armbruster

On 6/9/20 11:55 PM, John Snow wrote:
> 
> 
> On 6/9/20 9:08 AM, Philippe Mathieu-Daudé wrote:
>> Hi John,
>>
>> On 6/4/20 9:52 PM, John Snow wrote:
>>> v3:
>>>  - Split _post_shutdown refactor into own patch (now 1/3)
>>>  - Re-add sigkill warning squelch (now 3/3)
>>>
>>> NOTE: I re-added the squelch in its own patch for review purposes, but
>>> for the purposes of avoiding temporary breakage, a maintainer may wish
>>> to squash patches 2 and 3 if they are accepted.
>>>
>>> v2: Philippe took patches 1, 3 and 4.
>>>
>>> This is a re-write of what was:
>>> [PATCH RFC 03/32] python//machine.py: remove bare except
>>> [PATCH 2/4] python/machine.py: remove bare except
>>>
>>> It's a bit heavier handed, but it should address some of kwolf's
>>> feedback from the RFC version.
>>>
>>> Applies straight to origin/master, ought to pass pylint and flake8:
>>>
>>>> cd ~/src/qemu/python/qemu
>>>> pylint *.py
>>>> flake8 *.py
>>>
>>> John Snow (3):
>>>   python/machine.py: consolidate _post_shutdown()
>>>   python/machine.py: refactor shutdown
>>>   python/machine.py: re-add sigkill warning suppression
>>>
>>>  python/qemu/machine.py | 100 +++++++++++++++++++++++++++++------------
>>>  1 file changed, 71 insertions(+), 29 deletions(-)
>>>
>>
>> I'm now seeing this error:
>>
>> 21:31:58 DEBUG| / # reboot
>> 21:32:01 DEBUG| / # reboot: Restarting system
>> 21:32:01 DEBUG| >>> {'execute': 'quit'}
>> 21:32:01 WARNI| qemu received signal 9; command:
>> "mips-softmmu/qemu-system-mips -display none -vga none -chardev
>> socket,id=mon,path=/tmp/tmp679upvrk/qemu-10292-monitor.sock -mon
>> chardev=mon,mode=control -machine malta -chardev
>> socket,id=console,path=/tmp/tmp679upvrk/qemu-10292-console.sock,server,nowait
>> -serial chardev:console -kernel
>> /tmp/avocado_b3aaagr9/avocado_job_5bj0xe1h/12-tests_acceptance_boot_linux_console.py_BootLinuxConsole.test_mips_malta_cpio/boot/vmlinux-4.5.0-2-4kc-malta
>> -initrd
>> /tmp/avocado_b3aaagr9/avocado_job_5bj0xe1h/12-tests_acceptance_boot_linux_console.py_BootLinuxConsole.test_mips_malta_cpiorootfs.cpio
>> -append printk.time=0 console=ttyS0 console=tty rdinit=/sbin/init
>> noreboot -no-reboot"
>> 21:32:01 ERROR|
>> 21:32:01 ERROR| Reproduced traceback from:
>> /home/travis/build/philmd/qemu/build/tests/venv/lib/python3.6/site-packages/avocado/core/test.py:886
>> 21:32:01 ERROR| Traceback (most recent call last):
>> 21:32:01 ERROR|   File
>> "/home/travis/build/philmd/qemu/build/tests/acceptance/avocado_qemu/__init__.py",
>> line 195, in tearDown
>> 21:32:01 ERROR|     vm.shutdown()
>> 21:32:01 ERROR|   File
>> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 449, in
>> shutdown
>> 21:32:01 ERROR|     self._do_shutdown(has_quit)
>> 21:32:01 ERROR|   File
>> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 426, in
>> _do_shutdown
>> 21:32:01 ERROR|     self._soft_shutdown(has_quit, timeout)
>> 21:32:01 ERROR|   File
>> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 413, in
>> _soft_shutdown
>> 21:32:01 ERROR|     self._qmp.cmd('quit')
>> 21:32:01 ERROR|   File
>> "/home/travis/build/philmd/qemu/python/qemu/qmp.py", line 271, in cmd
>> 21:32:01 ERROR|     return self.cmd_obj(qmp_cmd)
>> 21:32:01 ERROR|   File
>> "/home/travis/build/philmd/qemu/python/qemu/qmp.py", line 249, in cmd_obj
>> 21:32:01 ERROR|     self.__sock.sendall(json.dumps(qmp_cmd).encode('utf-8'))
>> 21:32:01 ERROR| BrokenPipeError: [Errno 32] Broken pipe
>> 21:32:01 ERROR|
>> 21:32:01 DEBUG| DATA (filename=output.expected) => NOT FOUND (data
>> sources: variant, test, file)
>> 21:32:01 DEBUG| DATA (filename=stdout.expected) => NOT FOUND (data
>> sources: variant, test, file)
>> 21:32:01 DEBUG| DATA (filename=stderr.expected) => NOT FOUND (data
>> sources: variant, test, file)
>> 21:32:01 DEBUG| Not logging /var/log/syslog (lack of permissions)
>> 21:32:01 ERROR| ERROR
>> 12-tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_mips_malta_cpio
>> -> TestSetupFail: [Errno 32] Broken pipe
>> 21:32:01 INFO |
>>
>> https://travis-ci.org/github/philmd/qemu/jobs/696142277#L5329
>>
> 
> Gotcha.
> 
> The problem here is that `test_mips_malta_cpio` in boot_linux_console.py
> does this:
> 
>         self.vm.add_args('-kernel', kernel_path,
>                          '-initrd', initrd_path,
>                          '-append', kernel_command_line,
>                          '-no-reboot')
> 
> and then:
> 
> exec_command_and_wait_for_pattern(self, 'reboot',
>                          'reboot: Restarting system')
> 
> and (in avocado_qemu/) __init__.py does this:
> 
>     def tearDown(self):
>         for vm in self._vms.values():
>             vm.shutdown()
> 
> 
> 
> What's happening here is that we are instructing QEMU to *close* when
> the guest reboots instead of allowing it to reboot. Then, we are issuing
> a reboot command to the guest, which will effectively terminate QEMU as
> well. Finally, we are trying to send a shutdown command to QEMU, but
> QEMU has already gone.
> 
> Now, in the shutdown code, we do make an attempt to catch this:
> 
> def is_running(self):
>     """Returns true if the VM is running."""
>     return self._popen is not None and self._popen.poll() is None
> 
> But, well, race conditions.
> 
> When we make it here:
> 
>         if self._qmp is not None:
>             if not has_quit:
>                 self._qmp.cmd('quit')
>             self._qmp.close()
> 
> We believe we are running and we believe we have an open QMP socket.
> Attempting to engage the socket by sending 'quit' causes the error.
> 
> It's a tight window: if quit happens earlier, we send the command
> successfully and everything's OK. If quit happens later, we realize QEMU
> isn't running and proceed to cleanup.

Nice debugging :)

> 
> Ultimately:
> 
> - Avocado should not try to shut down QEMU twice, but
> - machine.py shouldn't enable the race condition either.
> 
> 
> 
> for my part, how about this:
> 
> diff --git a/python/qemu/machine.py b/python/qemu/machine.py
> index 99bcb499878..813f8e477db 100644
> --- a/python/qemu/machine.py
> +++ b/python/qemu/machine.py
> @@ -385,7 +385,14 @@ def _soft_shutdown(self, has_quit: bool = False,
> timeout: int = 3) -> None:
> 
>          if self._qmp is not None:
>              if not has_quit:
> -                self._qmp.cmd('quit')
> +                try:
> +                    self._qmp.cmd('quit')
> +                except (BrokenPipeError, ConnectionResetError):
> +                    # QMP went away just before or just after sending
> 'quit'
> +                    if not self.is_running():
> +                        # "Mission Accomplished"
> +                        pass
> +                    raise

Looks OK to me, Cleber/Eduardo can you Ack?

>              self._qmp.close()
> 
>          self._popen.wait(timeout=timeout)
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/3] python/machine.py: refactor shutdown
  2020-06-15 15:21     ` Philippe Mathieu-Daudé
@ 2020-06-15 22:21       ` Eduardo Habkost
  2020-06-16 15:06         ` John Snow
  2020-06-16 21:49       ` Cleber Rosa
  1 sibling, 1 reply; 20+ messages in thread
From: Eduardo Habkost @ 2020-06-15 22:21 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: kwolf, Markus Armbruster, John Snow, qemu-devel, Cleber Rosa

On Mon, Jun 15, 2020 at 05:21:18PM +0200, Philippe Mathieu-Daudé wrote:
> On 6/9/20 11:55 PM, John Snow wrote:
> > 
> > 
> > On 6/9/20 9:08 AM, Philippe Mathieu-Daudé wrote:
> >> Hi John,
> >>
> >> On 6/4/20 9:52 PM, John Snow wrote:
> >>> v3:
> >>>  - Split _post_shutdown refactor into own patch (now 1/3)
> >>>  - Re-add sigkill warning squelch (now 3/3)
> >>>
> >>> NOTE: I re-added the squelch in its own patch for review purposes, but
> >>> for the purposes of avoiding temporary breakage, a maintainer may wish
> >>> to squash patches 2 and 3 if they are accepted.
> >>>
> >>> v2: Philippe took patches 1, 3 and 4.
> >>>
> >>> This is a re-write of what was:
> >>> [PATCH RFC 03/32] python//machine.py: remove bare except
> >>> [PATCH 2/4] python/machine.py: remove bare except
> >>>
> >>> It's a bit heavier handed, but it should address some of kwolf's
> >>> feedback from the RFC version.
> >>>
> >>> Applies straight to origin/master, ought to pass pylint and flake8:
> >>>
> >>>> cd ~/src/qemu/python/qemu
> >>>> pylint *.py
> >>>> flake8 *.py
> >>>
> >>> John Snow (3):
> >>>   python/machine.py: consolidate _post_shutdown()
> >>>   python/machine.py: refactor shutdown
> >>>   python/machine.py: re-add sigkill warning suppression
> >>>
> >>>  python/qemu/machine.py | 100 +++++++++++++++++++++++++++++------------
> >>>  1 file changed, 71 insertions(+), 29 deletions(-)
> >>>
> >>
> >> I'm now seeing this error:
> >>
> >> 21:31:58 DEBUG| / # reboot
> >> 21:32:01 DEBUG| / # reboot: Restarting system
> >> 21:32:01 DEBUG| >>> {'execute': 'quit'}
> >> 21:32:01 WARNI| qemu received signal 9; command:
> >> "mips-softmmu/qemu-system-mips -display none -vga none -chardev
> >> socket,id=mon,path=/tmp/tmp679upvrk/qemu-10292-monitor.sock -mon
> >> chardev=mon,mode=control -machine malta -chardev
> >> socket,id=console,path=/tmp/tmp679upvrk/qemu-10292-console.sock,server,nowait
> >> -serial chardev:console -kernel
> >> /tmp/avocado_b3aaagr9/avocado_job_5bj0xe1h/12-tests_acceptance_boot_linux_console.py_BootLinuxConsole.test_mips_malta_cpio/boot/vmlinux-4.5.0-2-4kc-malta
> >> -initrd
> >> /tmp/avocado_b3aaagr9/avocado_job_5bj0xe1h/12-tests_acceptance_boot_linux_console.py_BootLinuxConsole.test_mips_malta_cpiorootfs.cpio
> >> -append printk.time=0 console=ttyS0 console=tty rdinit=/sbin/init
> >> noreboot -no-reboot"
> >> 21:32:01 ERROR|
> >> 21:32:01 ERROR| Reproduced traceback from:
> >> /home/travis/build/philmd/qemu/build/tests/venv/lib/python3.6/site-packages/avocado/core/test.py:886
> >> 21:32:01 ERROR| Traceback (most recent call last):
> >> 21:32:01 ERROR|   File
> >> "/home/travis/build/philmd/qemu/build/tests/acceptance/avocado_qemu/__init__.py",
> >> line 195, in tearDown
> >> 21:32:01 ERROR|     vm.shutdown()
> >> 21:32:01 ERROR|   File
> >> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 449, in
> >> shutdown
> >> 21:32:01 ERROR|     self._do_shutdown(has_quit)
> >> 21:32:01 ERROR|   File
> >> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 426, in
> >> _do_shutdown
> >> 21:32:01 ERROR|     self._soft_shutdown(has_quit, timeout)
> >> 21:32:01 ERROR|   File
> >> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 413, in
> >> _soft_shutdown
> >> 21:32:01 ERROR|     self._qmp.cmd('quit')
> >> 21:32:01 ERROR|   File
> >> "/home/travis/build/philmd/qemu/python/qemu/qmp.py", line 271, in cmd
> >> 21:32:01 ERROR|     return self.cmd_obj(qmp_cmd)
> >> 21:32:01 ERROR|   File
> >> "/home/travis/build/philmd/qemu/python/qemu/qmp.py", line 249, in cmd_obj
> >> 21:32:01 ERROR|     self.__sock.sendall(json.dumps(qmp_cmd).encode('utf-8'))
> >> 21:32:01 ERROR| BrokenPipeError: [Errno 32] Broken pipe
> >> 21:32:01 ERROR|
> >> 21:32:01 DEBUG| DATA (filename=output.expected) => NOT FOUND (data
> >> sources: variant, test, file)
> >> 21:32:01 DEBUG| DATA (filename=stdout.expected) => NOT FOUND (data
> >> sources: variant, test, file)
> >> 21:32:01 DEBUG| DATA (filename=stderr.expected) => NOT FOUND (data
> >> sources: variant, test, file)
> >> 21:32:01 DEBUG| Not logging /var/log/syslog (lack of permissions)
> >> 21:32:01 ERROR| ERROR
> >> 12-tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_mips_malta_cpio
> >> -> TestSetupFail: [Errno 32] Broken pipe
> >> 21:32:01 INFO |
> >>
> >> https://travis-ci.org/github/philmd/qemu/jobs/696142277#L5329
> >>
> > 
> > Gotcha.
> > 
> > The problem here is that `test_mips_malta_cpio` in boot_linux_console.py
> > does this:
> > 
> >         self.vm.add_args('-kernel', kernel_path,
> >                          '-initrd', initrd_path,
> >                          '-append', kernel_command_line,
> >                          '-no-reboot')
> > 
> > and then:
> > 
> > exec_command_and_wait_for_pattern(self, 'reboot',
> >                          'reboot: Restarting system')
> > 
> > and (in avocado_qemu/) __init__.py does this:
> > 
> >     def tearDown(self):
> >         for vm in self._vms.values():
> >             vm.shutdown()
> > 
> > 
> > 
> > What's happening here is that we are instructing QEMU to *close* when
> > the guest reboots instead of allowing it to reboot. Then, we are issuing
> > a reboot command to the guest, which will effectively terminate QEMU as
> > well. Finally, we are trying to send a shutdown command to QEMU, but
> > QEMU has already gone.
> > 
> > Now, in the shutdown code, we do make an attempt to catch this:
> > 
> > def is_running(self):
> >     """Returns true if the VM is running."""
> >     return self._popen is not None and self._popen.poll() is None
> > 
> > But, well, race conditions.
> > 
> > When we make it here:
> > 
> >         if self._qmp is not None:
> >             if not has_quit:
> >                 self._qmp.cmd('quit')
> >             self._qmp.close()
> > 
> > We believe we are running and we believe we have an open QMP socket.
> > Attempting to engage the socket by sending 'quit' causes the error.
> > 
> > It's a tight window: if quit happens earlier, we send the command
> > successfully and everything's OK. If quit happens later, we realize QEMU
> > isn't running and proceed to cleanup.
> 
> Nice debugging :)
> 
> > 
> > Ultimately:
> > 
> > - Avocado should not try to shut down QEMU twice, but
> > - machine.py shouldn't enable the race condition either.
> > 
> > 
> > 
> > for my part, how about this:
> > 
> > diff --git a/python/qemu/machine.py b/python/qemu/machine.py
> > index 99bcb499878..813f8e477db 100644
> > --- a/python/qemu/machine.py
> > +++ b/python/qemu/machine.py
> > @@ -385,7 +385,14 @@ def _soft_shutdown(self, has_quit: bool = False,
> > timeout: int = 3) -> None:
> > 
> >          if self._qmp is not None:
> >              if not has_quit:
> > -                self._qmp.cmd('quit')
> > +                try:
> > +                    self._qmp.cmd('quit')
> > +                except (BrokenPipeError, ConnectionResetError):
> > +                    # QMP went away just before or just after sending
> > 'quit'
> > +                    if not self.is_running():
> > +                        # "Mission Accomplished"
> > +                        pass
> > +                    raise
> 
> Looks OK to me, Cleber/Eduardo can you Ack?

Looks good to me.

Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>


> 
> >              self._qmp.close()
> > 
> >          self._popen.wait(timeout=timeout)
> > 
> 

-- 
Eduardo



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/3] python/machine.py: refactor shutdown
  2020-06-15 22:21       ` Eduardo Habkost
@ 2020-06-16 15:06         ` John Snow
  2020-06-16 15:36           ` Philippe Mathieu-Daudé
  0 siblings, 1 reply; 20+ messages in thread
From: John Snow @ 2020-06-16 15:06 UTC (permalink / raw)
  To: Eduardo Habkost, Philippe Mathieu-Daudé
  Cc: kwolf, Markus Armbruster, qemu-devel, Cleber Rosa



On 6/15/20 6:21 PM, Eduardo Habkost wrote:
> On Mon, Jun 15, 2020 at 05:21:18PM +0200, Philippe Mathieu-Daudé wrote:
>> On 6/9/20 11:55 PM, John Snow wrote:
>>>
>>>
>>> On 6/9/20 9:08 AM, Philippe Mathieu-Daudé wrote:
>>>> Hi John,
>>>>
>>>> On 6/4/20 9:52 PM, John Snow wrote:
>>>>> v3:
>>>>>  - Split _post_shutdown refactor into own patch (now 1/3)
>>>>>  - Re-add sigkill warning squelch (now 3/3)
>>>>>
>>>>> NOTE: I re-added the squelch in its own patch for review purposes, but
>>>>> for the purposes of avoiding temporary breakage, a maintainer may wish
>>>>> to squash patches 2 and 3 if they are accepted.
>>>>>
>>>>> v2: Philippe took patches 1, 3 and 4.
>>>>>
>>>>> This is a re-write of what was:
>>>>> [PATCH RFC 03/32] python//machine.py: remove bare except
>>>>> [PATCH 2/4] python/machine.py: remove bare except
>>>>>
>>>>> It's a bit heavier handed, but it should address some of kwolf's
>>>>> feedback from the RFC version.
>>>>>
>>>>> Applies straight to origin/master, ought to pass pylint and flake8:
>>>>>
>>>>>> cd ~/src/qemu/python/qemu
>>>>>> pylint *.py
>>>>>> flake8 *.py
>>>>>
>>>>> John Snow (3):
>>>>>   python/machine.py: consolidate _post_shutdown()
>>>>>   python/machine.py: refactor shutdown
>>>>>   python/machine.py: re-add sigkill warning suppression
>>>>>
>>>>>  python/qemu/machine.py | 100 +++++++++++++++++++++++++++++------------
>>>>>  1 file changed, 71 insertions(+), 29 deletions(-)
>>>>>
>>>>
>>>> I'm now seeing this error:
>>>>
>>>> 21:31:58 DEBUG| / # reboot
>>>> 21:32:01 DEBUG| / # reboot: Restarting system
>>>> 21:32:01 DEBUG| >>> {'execute': 'quit'}
>>>> 21:32:01 WARNI| qemu received signal 9; command:
>>>> "mips-softmmu/qemu-system-mips -display none -vga none -chardev
>>>> socket,id=mon,path=/tmp/tmp679upvrk/qemu-10292-monitor.sock -mon
>>>> chardev=mon,mode=control -machine malta -chardev
>>>> socket,id=console,path=/tmp/tmp679upvrk/qemu-10292-console.sock,server,nowait
>>>> -serial chardev:console -kernel
>>>> /tmp/avocado_b3aaagr9/avocado_job_5bj0xe1h/12-tests_acceptance_boot_linux_console.py_BootLinuxConsole.test_mips_malta_cpio/boot/vmlinux-4.5.0-2-4kc-malta
>>>> -initrd
>>>> /tmp/avocado_b3aaagr9/avocado_job_5bj0xe1h/12-tests_acceptance_boot_linux_console.py_BootLinuxConsole.test_mips_malta_cpiorootfs.cpio
>>>> -append printk.time=0 console=ttyS0 console=tty rdinit=/sbin/init
>>>> noreboot -no-reboot"
>>>> 21:32:01 ERROR|
>>>> 21:32:01 ERROR| Reproduced traceback from:
>>>> /home/travis/build/philmd/qemu/build/tests/venv/lib/python3.6/site-packages/avocado/core/test.py:886
>>>> 21:32:01 ERROR| Traceback (most recent call last):
>>>> 21:32:01 ERROR|   File
>>>> "/home/travis/build/philmd/qemu/build/tests/acceptance/avocado_qemu/__init__.py",
>>>> line 195, in tearDown
>>>> 21:32:01 ERROR|     vm.shutdown()
>>>> 21:32:01 ERROR|   File
>>>> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 449, in
>>>> shutdown
>>>> 21:32:01 ERROR|     self._do_shutdown(has_quit)
>>>> 21:32:01 ERROR|   File
>>>> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 426, in
>>>> _do_shutdown
>>>> 21:32:01 ERROR|     self._soft_shutdown(has_quit, timeout)
>>>> 21:32:01 ERROR|   File
>>>> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 413, in
>>>> _soft_shutdown
>>>> 21:32:01 ERROR|     self._qmp.cmd('quit')
>>>> 21:32:01 ERROR|   File
>>>> "/home/travis/build/philmd/qemu/python/qemu/qmp.py", line 271, in cmd
>>>> 21:32:01 ERROR|     return self.cmd_obj(qmp_cmd)
>>>> 21:32:01 ERROR|   File
>>>> "/home/travis/build/philmd/qemu/python/qemu/qmp.py", line 249, in cmd_obj
>>>> 21:32:01 ERROR|     self.__sock.sendall(json.dumps(qmp_cmd).encode('utf-8'))
>>>> 21:32:01 ERROR| BrokenPipeError: [Errno 32] Broken pipe
>>>> 21:32:01 ERROR|
>>>> 21:32:01 DEBUG| DATA (filename=output.expected) => NOT FOUND (data
>>>> sources: variant, test, file)
>>>> 21:32:01 DEBUG| DATA (filename=stdout.expected) => NOT FOUND (data
>>>> sources: variant, test, file)
>>>> 21:32:01 DEBUG| DATA (filename=stderr.expected) => NOT FOUND (data
>>>> sources: variant, test, file)
>>>> 21:32:01 DEBUG| Not logging /var/log/syslog (lack of permissions)
>>>> 21:32:01 ERROR| ERROR
>>>> 12-tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_mips_malta_cpio
>>>> -> TestSetupFail: [Errno 32] Broken pipe
>>>> 21:32:01 INFO |
>>>>
>>>> https://travis-ci.org/github/philmd/qemu/jobs/696142277#L5329
>>>>
>>>
>>> Gotcha.
>>>
>>> The problem here is that `test_mips_malta_cpio` in boot_linux_console.py
>>> does this:
>>>
>>>         self.vm.add_args('-kernel', kernel_path,
>>>                          '-initrd', initrd_path,
>>>                          '-append', kernel_command_line,
>>>                          '-no-reboot')
>>>
>>> and then:
>>>
>>> exec_command_and_wait_for_pattern(self, 'reboot',
>>>                          'reboot: Restarting system')
>>>
>>> and (in avocado_qemu/) __init__.py does this:
>>>
>>>     def tearDown(self):
>>>         for vm in self._vms.values():
>>>             vm.shutdown()
>>>
>>>
>>>
>>> What's happening here is that we are instructing QEMU to *close* when
>>> the guest reboots instead of allowing it to reboot. Then, we are issuing
>>> a reboot command to the guest, which will effectively terminate QEMU as
>>> well. Finally, we are trying to send a shutdown command to QEMU, but
>>> QEMU has already gone.
>>>
>>> Now, in the shutdown code, we do make an attempt to catch this:
>>>
>>> def is_running(self):
>>>     """Returns true if the VM is running."""
>>>     return self._popen is not None and self._popen.poll() is None
>>>
>>> But, well, race conditions.
>>>
>>> When we make it here:
>>>
>>>         if self._qmp is not None:
>>>             if not has_quit:
>>>                 self._qmp.cmd('quit')
>>>             self._qmp.close()
>>>
>>> We believe we are running and we believe we have an open QMP socket.
>>> Attempting to engage the socket by sending 'quit' causes the error.
>>>
>>> It's a tight window: if quit happens earlier, we send the command
>>> successfully and everything's OK. If quit happens later, we realize QEMU
>>> isn't running and proceed to cleanup.
>>
>> Nice debugging :)
>>
>>>
>>> Ultimately:
>>>
>>> - Avocado should not try to shut down QEMU twice, but
>>> - machine.py shouldn't enable the race condition either.
>>>
>>>
>>>
>>> for my part, how about this:
>>>
>>> diff --git a/python/qemu/machine.py b/python/qemu/machine.py
>>> index 99bcb499878..813f8e477db 100644
>>> --- a/python/qemu/machine.py
>>> +++ b/python/qemu/machine.py
>>> @@ -385,7 +385,14 @@ def _soft_shutdown(self, has_quit: bool = False,
>>> timeout: int = 3) -> None:
>>>
>>>          if self._qmp is not None:
>>>              if not has_quit:
>>> -                self._qmp.cmd('quit')
>>> +                try:
>>> +                    self._qmp.cmd('quit')
>>> +                except (BrokenPipeError, ConnectionResetError):
>>> +                    # QMP went away just before or just after sending
>>> 'quit'
>>> +                    if not self.is_running():
>>> +                        # "Mission Accomplished"
>>> +                        pass
>>> +                    raise
>>
>> Looks OK to me, Cleber/Eduardo can you Ack?
> 
> Looks good to me.
> 
> Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
> 
> 
>>
>>>              self._qmp.close()
>>>
>>>          self._popen.wait(timeout=timeout)
>>>
>>
> 

I'll respin with the fix squashed.

Thanks!

--js



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/3] python/machine.py: refactor shutdown
  2020-06-16 15:06         ` John Snow
@ 2020-06-16 15:36           ` Philippe Mathieu-Daudé
  0 siblings, 0 replies; 20+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-06-16 15:36 UTC (permalink / raw)
  To: John Snow, Eduardo Habkost
  Cc: kwolf, Markus Armbruster, qemu-devel, Cleber Rosa

On 6/16/20 5:06 PM, John Snow wrote:
> 
> 
> On 6/15/20 6:21 PM, Eduardo Habkost wrote:
>> On Mon, Jun 15, 2020 at 05:21:18PM +0200, Philippe Mathieu-Daudé wrote:
>>> On 6/9/20 11:55 PM, John Snow wrote:
>>>>
>>>>
>>>> On 6/9/20 9:08 AM, Philippe Mathieu-Daudé wrote:
>>>>> Hi John,
>>>>>
>>>>> On 6/4/20 9:52 PM, John Snow wrote:
>>>>>> v3:
>>>>>>  - Split _post_shutdown refactor into own patch (now 1/3)
>>>>>>  - Re-add sigkill warning squelch (now 3/3)
>>>>>>
>>>>>> NOTE: I re-added the squelch in its own patch for review purposes, but
>>>>>> for the purposes of avoiding temporary breakage, a maintainer may wish
>>>>>> to squash patches 2 and 3 if they are accepted.
>>>>>>
>>>>>> v2: Philippe took patches 1, 3 and 4.
>>>>>>
>>>>>> This is a re-write of what was:
>>>>>> [PATCH RFC 03/32] python//machine.py: remove bare except
>>>>>> [PATCH 2/4] python/machine.py: remove bare except
>>>>>>
>>>>>> It's a bit heavier handed, but it should address some of kwolf's
>>>>>> feedback from the RFC version.
>>>>>>
>>>>>> Applies straight to origin/master, ought to pass pylint and flake8:
>>>>>>
>>>>>>> cd ~/src/qemu/python/qemu
>>>>>>> pylint *.py
>>>>>>> flake8 *.py
>>>>>>
>>>>>> John Snow (3):
>>>>>>   python/machine.py: consolidate _post_shutdown()
>>>>>>   python/machine.py: refactor shutdown
>>>>>>   python/machine.py: re-add sigkill warning suppression
>>>>>>
>>>>>>  python/qemu/machine.py | 100 +++++++++++++++++++++++++++++------------
>>>>>>  1 file changed, 71 insertions(+), 29 deletions(-)
>>>>>>
>>>>>
>>>>> I'm now seeing this error:
>>>>>
>>>>> 21:31:58 DEBUG| / # reboot
>>>>> 21:32:01 DEBUG| / # reboot: Restarting system
>>>>> 21:32:01 DEBUG| >>> {'execute': 'quit'}
>>>>> 21:32:01 WARNI| qemu received signal 9; command:
>>>>> "mips-softmmu/qemu-system-mips -display none -vga none -chardev
>>>>> socket,id=mon,path=/tmp/tmp679upvrk/qemu-10292-monitor.sock -mon
>>>>> chardev=mon,mode=control -machine malta -chardev
>>>>> socket,id=console,path=/tmp/tmp679upvrk/qemu-10292-console.sock,server,nowait
>>>>> -serial chardev:console -kernel
>>>>> /tmp/avocado_b3aaagr9/avocado_job_5bj0xe1h/12-tests_acceptance_boot_linux_console.py_BootLinuxConsole.test_mips_malta_cpio/boot/vmlinux-4.5.0-2-4kc-malta
>>>>> -initrd
>>>>> /tmp/avocado_b3aaagr9/avocado_job_5bj0xe1h/12-tests_acceptance_boot_linux_console.py_BootLinuxConsole.test_mips_malta_cpiorootfs.cpio
>>>>> -append printk.time=0 console=ttyS0 console=tty rdinit=/sbin/init
>>>>> noreboot -no-reboot"
>>>>> 21:32:01 ERROR|
>>>>> 21:32:01 ERROR| Reproduced traceback from:
>>>>> /home/travis/build/philmd/qemu/build/tests/venv/lib/python3.6/site-packages/avocado/core/test.py:886
>>>>> 21:32:01 ERROR| Traceback (most recent call last):
>>>>> 21:32:01 ERROR|   File
>>>>> "/home/travis/build/philmd/qemu/build/tests/acceptance/avocado_qemu/__init__.py",
>>>>> line 195, in tearDown
>>>>> 21:32:01 ERROR|     vm.shutdown()
>>>>> 21:32:01 ERROR|   File
>>>>> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 449, in
>>>>> shutdown
>>>>> 21:32:01 ERROR|     self._do_shutdown(has_quit)
>>>>> 21:32:01 ERROR|   File
>>>>> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 426, in
>>>>> _do_shutdown
>>>>> 21:32:01 ERROR|     self._soft_shutdown(has_quit, timeout)
>>>>> 21:32:01 ERROR|   File
>>>>> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 413, in
>>>>> _soft_shutdown
>>>>> 21:32:01 ERROR|     self._qmp.cmd('quit')
>>>>> 21:32:01 ERROR|   File
>>>>> "/home/travis/build/philmd/qemu/python/qemu/qmp.py", line 271, in cmd
>>>>> 21:32:01 ERROR|     return self.cmd_obj(qmp_cmd)
>>>>> 21:32:01 ERROR|   File
>>>>> "/home/travis/build/philmd/qemu/python/qemu/qmp.py", line 249, in cmd_obj
>>>>> 21:32:01 ERROR|     self.__sock.sendall(json.dumps(qmp_cmd).encode('utf-8'))
>>>>> 21:32:01 ERROR| BrokenPipeError: [Errno 32] Broken pipe
>>>>> 21:32:01 ERROR|
>>>>> 21:32:01 DEBUG| DATA (filename=output.expected) => NOT FOUND (data
>>>>> sources: variant, test, file)
>>>>> 21:32:01 DEBUG| DATA (filename=stdout.expected) => NOT FOUND (data
>>>>> sources: variant, test, file)
>>>>> 21:32:01 DEBUG| DATA (filename=stderr.expected) => NOT FOUND (data
>>>>> sources: variant, test, file)
>>>>> 21:32:01 DEBUG| Not logging /var/log/syslog (lack of permissions)
>>>>> 21:32:01 ERROR| ERROR
>>>>> 12-tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_mips_malta_cpio
>>>>> -> TestSetupFail: [Errno 32] Broken pipe
>>>>> 21:32:01 INFO |
>>>>>
>>>>> https://travis-ci.org/github/philmd/qemu/jobs/696142277#L5329
>>>>>
>>>>
>>>> Gotcha.
>>>>
>>>> The problem here is that `test_mips_malta_cpio` in boot_linux_console.py
>>>> does this:
>>>>
>>>>         self.vm.add_args('-kernel', kernel_path,
>>>>                          '-initrd', initrd_path,
>>>>                          '-append', kernel_command_line,
>>>>                          '-no-reboot')
>>>>
>>>> and then:
>>>>
>>>> exec_command_and_wait_for_pattern(self, 'reboot',
>>>>                          'reboot: Restarting system')
>>>>
>>>> and (in avocado_qemu/) __init__.py does this:
>>>>
>>>>     def tearDown(self):
>>>>         for vm in self._vms.values():
>>>>             vm.shutdown()
>>>>
>>>>
>>>>
>>>> What's happening here is that we are instructing QEMU to *close* when
>>>> the guest reboots instead of allowing it to reboot. Then, we are issuing
>>>> a reboot command to the guest, which will effectively terminate QEMU as
>>>> well. Finally, we are trying to send a shutdown command to QEMU, but
>>>> QEMU has already gone.
>>>>
>>>> Now, in the shutdown code, we do make an attempt to catch this:
>>>>
>>>> def is_running(self):
>>>>     """Returns true if the VM is running."""
>>>>     return self._popen is not None and self._popen.poll() is None
>>>>
>>>> But, well, race conditions.
>>>>
>>>> When we make it here:
>>>>
>>>>         if self._qmp is not None:
>>>>             if not has_quit:
>>>>                 self._qmp.cmd('quit')
>>>>             self._qmp.close()
>>>>
>>>> We believe we are running and we believe we have an open QMP socket.
>>>> Attempting to engage the socket by sending 'quit' causes the error.
>>>>
>>>> It's a tight window: if quit happens earlier, we send the command
>>>> successfully and everything's OK. If quit happens later, we realize QEMU
>>>> isn't running and proceed to cleanup.
>>>
>>> Nice debugging :)
>>>
>>>>
>>>> Ultimately:
>>>>
>>>> - Avocado should not try to shut down QEMU twice, but
>>>> - machine.py shouldn't enable the race condition either.
>>>>
>>>>
>>>>
>>>> for my part, how about this:
>>>>
>>>> diff --git a/python/qemu/machine.py b/python/qemu/machine.py
>>>> index 99bcb499878..813f8e477db 100644
>>>> --- a/python/qemu/machine.py
>>>> +++ b/python/qemu/machine.py
>>>> @@ -385,7 +385,14 @@ def _soft_shutdown(self, has_quit: bool = False,
>>>> timeout: int = 3) -> None:
>>>>
>>>>          if self._qmp is not None:
>>>>              if not has_quit:
>>>> -                self._qmp.cmd('quit')
>>>> +                try:
>>>> +                    self._qmp.cmd('quit')
>>>> +                except (BrokenPipeError, ConnectionResetError):
>>>> +                    # QMP went away just before or just after sending
>>>> 'quit'
>>>> +                    if not self.is_running():
>>>> +                        # "Mission Accomplished"
>>>> +                        pass
>>>> +                    raise
>>>
>>> Looks OK to me, Cleber/Eduardo can you Ack?
>>
>> Looks good to me.
>>
>> Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
>>
>>
>>>
>>>>              self._qmp.close()
>>>>
>>>>          self._popen.wait(timeout=timeout)
>>>>
>>>
>>
> 
> I'll respin with the fix squashed.

Not needed, it is here already:
https://gitlab.com/philmd/qemu/-/commit/d95ff1cf5d47f96d
I was waiting the build to success to respond to this
thread.

> 
> Thanks!
> 
> --js
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/3] python/machine.py: refactor shutdown
  2020-06-15 15:21     ` Philippe Mathieu-Daudé
  2020-06-15 22:21       ` Eduardo Habkost
@ 2020-06-16 21:49       ` Cleber Rosa
  2020-06-17 17:13         ` Philippe Mathieu-Daudé
  1 sibling, 1 reply; 20+ messages in thread
From: Cleber Rosa @ 2020-06-16 21:49 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: kwolf, Markus Armbruster, John Snow, qemu-devel, Eduardo Habkost

[-- Attachment #1: Type: text/plain, Size: 9247 bytes --]

On Mon, Jun 15, 2020 at 05:21:18PM +0200, Philippe Mathieu-Daudé wrote:
> On 6/9/20 11:55 PM, John Snow wrote:
> > 
> > 
> > On 6/9/20 9:08 AM, Philippe Mathieu-Daudé wrote:
> >> Hi John,
> >>
> >> On 6/4/20 9:52 PM, John Snow wrote:
> >>> v3:
> >>>  - Split _post_shutdown refactor into own patch (now 1/3)
> >>>  - Re-add sigkill warning squelch (now 3/3)
> >>>
> >>> NOTE: I re-added the squelch in its own patch for review purposes, but
> >>> for the purposes of avoiding temporary breakage, a maintainer may wish
> >>> to squash patches 2 and 3 if they are accepted.
> >>>
> >>> v2: Philippe took patches 1, 3 and 4.
> >>>
> >>> This is a re-write of what was:
> >>> [PATCH RFC 03/32] python//machine.py: remove bare except
> >>> [PATCH 2/4] python/machine.py: remove bare except
> >>>
> >>> It's a bit heavier handed, but it should address some of kwolf's
> >>> feedback from the RFC version.
> >>>
> >>> Applies straight to origin/master, ought to pass pylint and flake8:
> >>>
> >>>> cd ~/src/qemu/python/qemu
> >>>> pylint *.py
> >>>> flake8 *.py
> >>>
> >>> John Snow (3):
> >>>   python/machine.py: consolidate _post_shutdown()
> >>>   python/machine.py: refactor shutdown
> >>>   python/machine.py: re-add sigkill warning suppression
> >>>
> >>>  python/qemu/machine.py | 100 +++++++++++++++++++++++++++++------------
> >>>  1 file changed, 71 insertions(+), 29 deletions(-)
> >>>
> >>
> >> I'm now seeing this error:
> >>
> >> 21:31:58 DEBUG| / # reboot
> >> 21:32:01 DEBUG| / # reboot: Restarting system
> >> 21:32:01 DEBUG| >>> {'execute': 'quit'}
> >> 21:32:01 WARNI| qemu received signal 9; command:
> >> "mips-softmmu/qemu-system-mips -display none -vga none -chardev
> >> socket,id=mon,path=/tmp/tmp679upvrk/qemu-10292-monitor.sock -mon
> >> chardev=mon,mode=control -machine malta -chardev
> >> socket,id=console,path=/tmp/tmp679upvrk/qemu-10292-console.sock,server,nowait
> >> -serial chardev:console -kernel
> >> /tmp/avocado_b3aaagr9/avocado_job_5bj0xe1h/12-tests_acceptance_boot_linux_console.py_BootLinuxConsole.test_mips_malta_cpio/boot/vmlinux-4.5.0-2-4kc-malta
> >> -initrd
> >> /tmp/avocado_b3aaagr9/avocado_job_5bj0xe1h/12-tests_acceptance_boot_linux_console.py_BootLinuxConsole.test_mips_malta_cpiorootfs.cpio
> >> -append printk.time=0 console=ttyS0 console=tty rdinit=/sbin/init
> >> noreboot -no-reboot"
> >> 21:32:01 ERROR|
> >> 21:32:01 ERROR| Reproduced traceback from:
> >> /home/travis/build/philmd/qemu/build/tests/venv/lib/python3.6/site-packages/avocado/core/test.py:886
> >> 21:32:01 ERROR| Traceback (most recent call last):
> >> 21:32:01 ERROR|   File
> >> "/home/travis/build/philmd/qemu/build/tests/acceptance/avocado_qemu/__init__.py",
> >> line 195, in tearDown
> >> 21:32:01 ERROR|     vm.shutdown()
> >> 21:32:01 ERROR|   File
> >> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 449, in
> >> shutdown
> >> 21:32:01 ERROR|     self._do_shutdown(has_quit)
> >> 21:32:01 ERROR|   File
> >> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 426, in
> >> _do_shutdown
> >> 21:32:01 ERROR|     self._soft_shutdown(has_quit, timeout)
> >> 21:32:01 ERROR|   File
> >> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 413, in
> >> _soft_shutdown
> >> 21:32:01 ERROR|     self._qmp.cmd('quit')
> >> 21:32:01 ERROR|   File
> >> "/home/travis/build/philmd/qemu/python/qemu/qmp.py", line 271, in cmd
> >> 21:32:01 ERROR|     return self.cmd_obj(qmp_cmd)
> >> 21:32:01 ERROR|   File
> >> "/home/travis/build/philmd/qemu/python/qemu/qmp.py", line 249, in cmd_obj
> >> 21:32:01 ERROR|     self.__sock.sendall(json.dumps(qmp_cmd).encode('utf-8'))
> >> 21:32:01 ERROR| BrokenPipeError: [Errno 32] Broken pipe
> >> 21:32:01 ERROR|
> >> 21:32:01 DEBUG| DATA (filename=output.expected) => NOT FOUND (data
> >> sources: variant, test, file)
> >> 21:32:01 DEBUG| DATA (filename=stdout.expected) => NOT FOUND (data
> >> sources: variant, test, file)
> >> 21:32:01 DEBUG| DATA (filename=stderr.expected) => NOT FOUND (data
> >> sources: variant, test, file)
> >> 21:32:01 DEBUG| Not logging /var/log/syslog (lack of permissions)
> >> 21:32:01 ERROR| ERROR
> >> 12-tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_mips_malta_cpio
> >> -> TestSetupFail: [Errno 32] Broken pipe
> >> 21:32:01 INFO |
> >>
> >> https://travis-ci.org/github/philmd/qemu/jobs/696142277#L5329
> >>
> > 
> > Gotcha.
> > 
> > The problem here is that `test_mips_malta_cpio` in boot_linux_console.py
> > does this:
> > 
> >         self.vm.add_args('-kernel', kernel_path,
> >                          '-initrd', initrd_path,
> >                          '-append', kernel_command_line,
> >                          '-no-reboot')
> > 
> > and then:
> > 
> > exec_command_and_wait_for_pattern(self, 'reboot',
> >                          'reboot: Restarting system')
> > 
> > and (in avocado_qemu/) __init__.py does this:
> > 
> >     def tearDown(self):
> >         for vm in self._vms.values():
> >             vm.shutdown()
> > 
> > 
> > 
> > What's happening here is that we are instructing QEMU to *close* when
> > the guest reboots instead of allowing it to reboot. Then, we are issuing
> > a reboot command to the guest, which will effectively terminate QEMU as
> > well. Finally, we are trying to send a shutdown command to QEMU, but
> > QEMU has already gone.
> > 
> > Now, in the shutdown code, we do make an attempt to catch this:
> > 
> > def is_running(self):
> >     """Returns true if the VM is running."""
> >     return self._popen is not None and self._popen.poll() is None
> > 
> > But, well, race conditions.
> > 
> > When we make it here:
> > 
> >         if self._qmp is not None:
> >             if not has_quit:
> >                 self._qmp.cmd('quit')
> >             self._qmp.close()
> > 
> > We believe we are running and we believe we have an open QMP socket.
> > Attempting to engage the socket by sending 'quit' causes the error.
> > 
> > It's a tight window: if quit happens earlier, we send the command
> > successfully and everything's OK. If quit happens later, we realize QEMU
> > isn't running and proceed to cleanup.
> 
> Nice debugging :)
> 
> > 
> > Ultimately:
> > 
> > - Avocado should not try to shut down QEMU twice, but
> > - machine.py shouldn't enable the race condition either.
> > 
> > 
> > 
> > for my part, how about this:
> > 
> > diff --git a/python/qemu/machine.py b/python/qemu/machine.py
> > index 99bcb499878..813f8e477db 100644
> > --- a/python/qemu/machine.py
> > +++ b/python/qemu/machine.py
> > @@ -385,7 +385,14 @@ def _soft_shutdown(self, has_quit: bool = False,
> > timeout: int = 3) -> None:
> > 
> >          if self._qmp is not None:
> >              if not has_quit:
> > -                self._qmp.cmd('quit')
> > +                try:
> > +                    self._qmp.cmd('quit')
> > +                except (BrokenPipeError, ConnectionResetError):
> > +                    # QMP went away just before or just after sending
> > 'quit'
> > +                    if not self.is_running():
> > +                        # "Mission Accomplished"
> > +                        pass
> > +                    raise
> 
> Looks OK to me, Cleber/Eduardo can you Ack?
>

John's description seems spot on, and so does the solution.  I was
unable, though, to reproduce it given that the window is indeed tiny.

Phil,

In addition to this, wouldn't it make sense to:

diff --git a/tests/acceptance/boot_linux_console.py b/tests/acceptance/boot_linux_console.py
index 3d02519660..e117597e3b 100644
--- a/tests/acceptance/boot_linux_console.py
+++ b/tests/acceptance/boot_linux_console.py
@@ -180,8 +180,8 @@ class BootLinuxConsole(LinuxKernelTest):
                                + 'rdinit=/sbin/init noreboot')
         self.vm.add_args('-kernel', kernel_path,
                          '-initrd', initrd_path,
-                         '-append', kernel_command_line,
-                         '-no-reboot')
+                         '-append', kernel_command_line)
+
         self.vm.launch()
         self.wait_for_console_pattern('Boot successful.')
 
@@ -189,8 +189,8 @@ class BootLinuxConsole(LinuxKernelTest):
                                                 'BogoMIPS')
         exec_command_and_wait_for_pattern(self, 'uname -a',
                                                 'Debian')
-        exec_command_and_wait_for_pattern(self, 'reboot',
-                                                'reboot: Restarting system')
+        exec_command_and_wait_for_pattern(self, 'halt',
+                                                'reboot: System halted')
 
     @skipUnless(os.getenv('AVOCADO_ALLOW_UNTRUSTED_CODE'), 'untrusted code')
     def test_mips64el_malta_5KEc_cpio(self):

Do you know of any reason to "fake halt" (reboot) instead of actually
halting from the guests?

> >              self._qmp.close()
> > 
> >          self._popen.wait(timeout=timeout)
> > 
> 

Either way (^ with this extra block):

Reviewed-by: Cleber Rosa <crosa@redhat.com>
Tested-by: Cleber Rosa <crosa@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/3] python/machine.py: refactor shutdown
  2020-06-16 21:49       ` Cleber Rosa
@ 2020-06-17 17:13         ` Philippe Mathieu-Daudé
  2020-06-20 10:14           ` Philippe Mathieu-Daudé
  0 siblings, 1 reply; 20+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-06-17 17:13 UTC (permalink / raw)
  To: Cleber Rosa
  Cc: kwolf, Markus Armbruster, John Snow, qemu-devel, Eduardo Habkost

On 6/16/20 11:49 PM, Cleber Rosa wrote:
> On Mon, Jun 15, 2020 at 05:21:18PM +0200, Philippe Mathieu-Daudé wrote:
>> On 6/9/20 11:55 PM, John Snow wrote:
>>>
>>>
>>> On 6/9/20 9:08 AM, Philippe Mathieu-Daudé wrote:
>>>> Hi John,
>>>>
>>>> On 6/4/20 9:52 PM, John Snow wrote:
>>>>> v3:
>>>>>  - Split _post_shutdown refactor into own patch (now 1/3)
>>>>>  - Re-add sigkill warning squelch (now 3/3)
>>>>>
>>>>> NOTE: I re-added the squelch in its own patch for review purposes, but
>>>>> for the purposes of avoiding temporary breakage, a maintainer may wish
>>>>> to squash patches 2 and 3 if they are accepted.
>>>>>
>>>>> v2: Philippe took patches 1, 3 and 4.
>>>>>
>>>>> This is a re-write of what was:
>>>>> [PATCH RFC 03/32] python//machine.py: remove bare except
>>>>> [PATCH 2/4] python/machine.py: remove bare except
>>>>>
>>>>> It's a bit heavier handed, but it should address some of kwolf's
>>>>> feedback from the RFC version.
>>>>>
>>>>> Applies straight to origin/master, ought to pass pylint and flake8:
>>>>>
>>>>>> cd ~/src/qemu/python/qemu
>>>>>> pylint *.py
>>>>>> flake8 *.py
>>>>>
>>>>> John Snow (3):
>>>>>   python/machine.py: consolidate _post_shutdown()
>>>>>   python/machine.py: refactor shutdown
>>>>>   python/machine.py: re-add sigkill warning suppression
>>>>>
>>>>>  python/qemu/machine.py | 100 +++++++++++++++++++++++++++++------------
>>>>>  1 file changed, 71 insertions(+), 29 deletions(-)
>>>>>
>>>>
>>>> I'm now seeing this error:
>>>>
>>>> 21:31:58 DEBUG| / # reboot
>>>> 21:32:01 DEBUG| / # reboot: Restarting system
>>>> 21:32:01 DEBUG| >>> {'execute': 'quit'}
>>>> 21:32:01 WARNI| qemu received signal 9; command:
>>>> "mips-softmmu/qemu-system-mips -display none -vga none -chardev
>>>> socket,id=mon,path=/tmp/tmp679upvrk/qemu-10292-monitor.sock -mon
>>>> chardev=mon,mode=control -machine malta -chardev
>>>> socket,id=console,path=/tmp/tmp679upvrk/qemu-10292-console.sock,server,nowait
>>>> -serial chardev:console -kernel
>>>> /tmp/avocado_b3aaagr9/avocado_job_5bj0xe1h/12-tests_acceptance_boot_linux_console.py_BootLinuxConsole.test_mips_malta_cpio/boot/vmlinux-4.5.0-2-4kc-malta
>>>> -initrd
>>>> /tmp/avocado_b3aaagr9/avocado_job_5bj0xe1h/12-tests_acceptance_boot_linux_console.py_BootLinuxConsole.test_mips_malta_cpiorootfs.cpio
>>>> -append printk.time=0 console=ttyS0 console=tty rdinit=/sbin/init
>>>> noreboot -no-reboot"
>>>> 21:32:01 ERROR|
>>>> 21:32:01 ERROR| Reproduced traceback from:
>>>> /home/travis/build/philmd/qemu/build/tests/venv/lib/python3.6/site-packages/avocado/core/test.py:886
>>>> 21:32:01 ERROR| Traceback (most recent call last):
>>>> 21:32:01 ERROR|   File
>>>> "/home/travis/build/philmd/qemu/build/tests/acceptance/avocado_qemu/__init__.py",
>>>> line 195, in tearDown
>>>> 21:32:01 ERROR|     vm.shutdown()
>>>> 21:32:01 ERROR|   File
>>>> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 449, in
>>>> shutdown
>>>> 21:32:01 ERROR|     self._do_shutdown(has_quit)
>>>> 21:32:01 ERROR|   File
>>>> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 426, in
>>>> _do_shutdown
>>>> 21:32:01 ERROR|     self._soft_shutdown(has_quit, timeout)
>>>> 21:32:01 ERROR|   File
>>>> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 413, in
>>>> _soft_shutdown
>>>> 21:32:01 ERROR|     self._qmp.cmd('quit')
>>>> 21:32:01 ERROR|   File
>>>> "/home/travis/build/philmd/qemu/python/qemu/qmp.py", line 271, in cmd
>>>> 21:32:01 ERROR|     return self.cmd_obj(qmp_cmd)
>>>> 21:32:01 ERROR|   File
>>>> "/home/travis/build/philmd/qemu/python/qemu/qmp.py", line 249, in cmd_obj
>>>> 21:32:01 ERROR|     self.__sock.sendall(json.dumps(qmp_cmd).encode('utf-8'))
>>>> 21:32:01 ERROR| BrokenPipeError: [Errno 32] Broken pipe
>>>> 21:32:01 ERROR|
>>>> 21:32:01 DEBUG| DATA (filename=output.expected) => NOT FOUND (data
>>>> sources: variant, test, file)
>>>> 21:32:01 DEBUG| DATA (filename=stdout.expected) => NOT FOUND (data
>>>> sources: variant, test, file)
>>>> 21:32:01 DEBUG| DATA (filename=stderr.expected) => NOT FOUND (data
>>>> sources: variant, test, file)
>>>> 21:32:01 DEBUG| Not logging /var/log/syslog (lack of permissions)
>>>> 21:32:01 ERROR| ERROR
>>>> 12-tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_mips_malta_cpio
>>>> -> TestSetupFail: [Errno 32] Broken pipe
>>>> 21:32:01 INFO |
>>>>
>>>> https://travis-ci.org/github/philmd/qemu/jobs/696142277#L5329
>>>>
>>>
>>> Gotcha.
>>>
>>> The problem here is that `test_mips_malta_cpio` in boot_linux_console.py
>>> does this:
>>>
>>>         self.vm.add_args('-kernel', kernel_path,
>>>                          '-initrd', initrd_path,
>>>                          '-append', kernel_command_line,
>>>                          '-no-reboot')
>>>
>>> and then:
>>>
>>> exec_command_and_wait_for_pattern(self, 'reboot',
>>>                          'reboot: Restarting system')
>>>
>>> and (in avocado_qemu/) __init__.py does this:
>>>
>>>     def tearDown(self):
>>>         for vm in self._vms.values():
>>>             vm.shutdown()
>>>
>>>
>>>
>>> What's happening here is that we are instructing QEMU to *close* when
>>> the guest reboots instead of allowing it to reboot. Then, we are issuing
>>> a reboot command to the guest, which will effectively terminate QEMU as
>>> well. Finally, we are trying to send a shutdown command to QEMU, but
>>> QEMU has already gone.
>>>
>>> Now, in the shutdown code, we do make an attempt to catch this:
>>>
>>> def is_running(self):
>>>     """Returns true if the VM is running."""
>>>     return self._popen is not None and self._popen.poll() is None
>>>
>>> But, well, race conditions.
>>>
>>> When we make it here:
>>>
>>>         if self._qmp is not None:
>>>             if not has_quit:
>>>                 self._qmp.cmd('quit')
>>>             self._qmp.close()
>>>
>>> We believe we are running and we believe we have an open QMP socket.
>>> Attempting to engage the socket by sending 'quit' causes the error.
>>>
>>> It's a tight window: if quit happens earlier, we send the command
>>> successfully and everything's OK. If quit happens later, we realize QEMU
>>> isn't running and proceed to cleanup.
>>
>> Nice debugging :)
>>
>>>
>>> Ultimately:
>>>
>>> - Avocado should not try to shut down QEMU twice, but
>>> - machine.py shouldn't enable the race condition either.
>>>
>>>
>>>
>>> for my part, how about this:
>>>
>>> diff --git a/python/qemu/machine.py b/python/qemu/machine.py
>>> index 99bcb499878..813f8e477db 100644
>>> --- a/python/qemu/machine.py
>>> +++ b/python/qemu/machine.py
>>> @@ -385,7 +385,14 @@ def _soft_shutdown(self, has_quit: bool = False,
>>> timeout: int = 3) -> None:
>>>
>>>          if self._qmp is not None:
>>>              if not has_quit:
>>> -                self._qmp.cmd('quit')
>>> +                try:
>>> +                    self._qmp.cmd('quit')
>>> +                except (BrokenPipeError, ConnectionResetError):
>>> +                    # QMP went away just before or just after sending
>>> 'quit'
>>> +                    if not self.is_running():
>>> +                        # "Mission Accomplished"
>>> +                        pass
>>> +                    raise
>>
>> Looks OK to me, Cleber/Eduardo can you Ack?
>>
> 
> John's description seems spot on, and so does the solution.  I was
> unable, though, to reproduce it given that the window is indeed tiny.
> 
> Phil,
> 
> In addition to this, wouldn't it make sense to:
> 
> diff --git a/tests/acceptance/boot_linux_console.py b/tests/acceptance/boot_linux_console.py
> index 3d02519660..e117597e3b 100644
> --- a/tests/acceptance/boot_linux_console.py
> +++ b/tests/acceptance/boot_linux_console.py
> @@ -180,8 +180,8 @@ class BootLinuxConsole(LinuxKernelTest):
>                                 + 'rdinit=/sbin/init noreboot')
>          self.vm.add_args('-kernel', kernel_path,
>                           '-initrd', initrd_path,
> -                         '-append', kernel_command_line,
> -                         '-no-reboot')
> +                         '-append', kernel_command_line)

Very few machines implement the hardware shutdown (sending
the SHUTDOWN_CAUSE_GUEST_SHUTDOWN event). AFAIK The
SHUTDOWN_CAUSE_GUEST_RESET is a bit more present.
When the hardware doesn't model this, the guest loops
infinitely. Some physical machines are not supposed to
power off, until cutting its power. So firmware also loops
without trying to power things off.

'-no-reboot' means "if the guest tries to reboot, consider
the emulation done".

> +
>          self.vm.launch()
>          self.wait_for_console_pattern('Boot successful.')
>  
> @@ -189,8 +189,8 @@ class BootLinuxConsole(LinuxKernelTest):
>                                                  'BogoMIPS')
>          exec_command_and_wait_for_pattern(self, 'uname -a',
>                                                  'Debian')
> -        exec_command_and_wait_for_pattern(self, 'reboot',
> -                                                'reboot: Restarting system')
> +        exec_command_and_wait_for_pattern(self, 'halt',
> +                                                'reboot: System halted')
>  
>      @skipUnless(os.getenv('AVOCADO_ALLOW_UNTRUSTED_CODE'), 'untrusted code')
>      def test_mips64el_malta_5KEc_cpio(self):
> 
> Do you know of any reason to "fake halt" (reboot) instead of actually
> halting from the guests?

This might be my embedded system reflex (explained earlier, some
hardware can not power itself down). I will give your diff a try.

> 
>>>              self._qmp.close()
>>>
>>>          self._popen.wait(timeout=timeout)
>>>
>>
> 
> Either way (^ with this extra block):
> 
> Reviewed-by: Cleber Rosa <crosa@redhat.com>
> Tested-by: Cleber Rosa <crosa@redhat.com>

Thanks!



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/3] python/machine.py: refactor shutdown
  2020-06-04 19:52 [PATCH v3 0/3] python/machine.py: refactor shutdown John Snow
                   ` (4 preceding siblings ...)
  2020-06-09 13:08 ` Philippe Mathieu-Daudé
@ 2020-06-17 17:15 ` Philippe Mathieu-Daudé
  5 siblings, 0 replies; 20+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-06-17 17:15 UTC (permalink / raw)
  To: John Snow, qemu-devel
  Cc: kwolf, Markus Armbruster, Eduardo Habkost, Cleber Rosa

On 6/4/20 9:52 PM, John Snow wrote:
> v3:
>  - Split _post_shutdown refactor into own patch (now 1/3)
>  - Re-add sigkill warning squelch (now 3/3)
> 
> NOTE: I re-added the squelch in its own patch for review purposes, but
> for the purposes of avoiding temporary breakage, a maintainer may wish
> to squash patches 2 and 3 if they are accepted.
> 
> v2: Philippe took patches 1, 3 and 4.
> 
> This is a re-write of what was:
> [PATCH RFC 03/32] python//machine.py: remove bare except
> [PATCH 2/4] python/machine.py: remove bare except
> 
> It's a bit heavier handed, but it should address some of kwolf's
> feedback from the RFC version.
> 
> Applies straight to origin/master, ought to pass pylint and flake8:
> 
>> cd ~/src/qemu/python/qemu
>> pylint *.py
>> flake8 *.py
> 
> John Snow (3):
>   python/machine.py: consolidate _post_shutdown()
>   python/machine.py: refactor shutdown
>   python/machine.py: re-add sigkill warning suppression
> 
>  python/qemu/machine.py | 100 +++++++++++++++++++++++++++++------------
>  1 file changed, 71 insertions(+), 29 deletions(-)
> 

I had prepared:

Thanks, applied to my python-next tree (including your
diff in python/qemu/machine.py):
https://gitlab.com/philmd/qemu/commits/python-next

But I just noticed Cleber review comment asking for more testing,
so I'm keeping this applied, but will delay the pull request
since I have to re-run the testing.

Phil.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/3] python/machine.py: refactor shutdown
  2020-06-17 17:13         ` Philippe Mathieu-Daudé
@ 2020-06-20 10:14           ` Philippe Mathieu-Daudé
  2020-06-22 17:51             ` John Snow
  2020-06-26 17:28             ` John Snow
  0 siblings, 2 replies; 20+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-06-20 10:14 UTC (permalink / raw)
  To: Cleber Rosa
  Cc: kwolf, Markus Armbruster, John Snow, qemu-devel, Eduardo Habkost

On 6/17/20 7:13 PM, Philippe Mathieu-Daudé wrote:
> On 6/16/20 11:49 PM, Cleber Rosa wrote:
>> On Mon, Jun 15, 2020 at 05:21:18PM +0200, Philippe Mathieu-Daudé wrote:
>>> On 6/9/20 11:55 PM, John Snow wrote:
>>>>
>>>>
>>>> On 6/9/20 9:08 AM, Philippe Mathieu-Daudé wrote:
>>>>> Hi John,
>>>>>
>>>>> On 6/4/20 9:52 PM, John Snow wrote:
>>>>>> v3:
>>>>>>  - Split _post_shutdown refactor into own patch (now 1/3)
>>>>>>  - Re-add sigkill warning squelch (now 3/3)
>>>>>>
>>>>>> NOTE: I re-added the squelch in its own patch for review purposes, but
>>>>>> for the purposes of avoiding temporary breakage, a maintainer may wish
>>>>>> to squash patches 2 and 3 if they are accepted.
>>>>>>
>>>>>> v2: Philippe took patches 1, 3 and 4.
>>>>>>
>>>>>> This is a re-write of what was:
>>>>>> [PATCH RFC 03/32] python//machine.py: remove bare except
>>>>>> [PATCH 2/4] python/machine.py: remove bare except
>>>>>>
>>>>>> It's a bit heavier handed, but it should address some of kwolf's
>>>>>> feedback from the RFC version.
>>>>>>
>>>>>> Applies straight to origin/master, ought to pass pylint and flake8:
>>>>>>
>>>>>>> cd ~/src/qemu/python/qemu
>>>>>>> pylint *.py
>>>>>>> flake8 *.py
>>>>>>
>>>>>> John Snow (3):
>>>>>>   python/machine.py: consolidate _post_shutdown()
>>>>>>   python/machine.py: refactor shutdown
>>>>>>   python/machine.py: re-add sigkill warning suppression
>>>>>>
>>>>>>  python/qemu/machine.py | 100 +++++++++++++++++++++++++++++------------
>>>>>>  1 file changed, 71 insertions(+), 29 deletions(-)
>>>>>>
>>>>>
>>>>> I'm now seeing this error:
>>>>>
>>>>> 21:31:58 DEBUG| / # reboot
>>>>> 21:32:01 DEBUG| / # reboot: Restarting system
>>>>> 21:32:01 DEBUG| >>> {'execute': 'quit'}
>>>>> 21:32:01 WARNI| qemu received signal 9; command:
>>>>> "mips-softmmu/qemu-system-mips -display none -vga none -chardev
>>>>> socket,id=mon,path=/tmp/tmp679upvrk/qemu-10292-monitor.sock -mon
>>>>> chardev=mon,mode=control -machine malta -chardev
>>>>> socket,id=console,path=/tmp/tmp679upvrk/qemu-10292-console.sock,server,nowait
>>>>> -serial chardev:console -kernel
>>>>> /tmp/avocado_b3aaagr9/avocado_job_5bj0xe1h/12-tests_acceptance_boot_linux_console.py_BootLinuxConsole.test_mips_malta_cpio/boot/vmlinux-4.5.0-2-4kc-malta
>>>>> -initrd
>>>>> /tmp/avocado_b3aaagr9/avocado_job_5bj0xe1h/12-tests_acceptance_boot_linux_console.py_BootLinuxConsole.test_mips_malta_cpiorootfs.cpio
>>>>> -append printk.time=0 console=ttyS0 console=tty rdinit=/sbin/init
>>>>> noreboot -no-reboot"
>>>>> 21:32:01 ERROR|
>>>>> 21:32:01 ERROR| Reproduced traceback from:
>>>>> /home/travis/build/philmd/qemu/build/tests/venv/lib/python3.6/site-packages/avocado/core/test.py:886
>>>>> 21:32:01 ERROR| Traceback (most recent call last):
>>>>> 21:32:01 ERROR|   File
>>>>> "/home/travis/build/philmd/qemu/build/tests/acceptance/avocado_qemu/__init__.py",
>>>>> line 195, in tearDown
>>>>> 21:32:01 ERROR|     vm.shutdown()
>>>>> 21:32:01 ERROR|   File
>>>>> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 449, in
>>>>> shutdown
>>>>> 21:32:01 ERROR|     self._do_shutdown(has_quit)
>>>>> 21:32:01 ERROR|   File
>>>>> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 426, in
>>>>> _do_shutdown
>>>>> 21:32:01 ERROR|     self._soft_shutdown(has_quit, timeout)
>>>>> 21:32:01 ERROR|   File
>>>>> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 413, in
>>>>> _soft_shutdown
>>>>> 21:32:01 ERROR|     self._qmp.cmd('quit')
>>>>> 21:32:01 ERROR|   File
>>>>> "/home/travis/build/philmd/qemu/python/qemu/qmp.py", line 271, in cmd
>>>>> 21:32:01 ERROR|     return self.cmd_obj(qmp_cmd)
>>>>> 21:32:01 ERROR|   File
>>>>> "/home/travis/build/philmd/qemu/python/qemu/qmp.py", line 249, in cmd_obj
>>>>> 21:32:01 ERROR|     self.__sock.sendall(json.dumps(qmp_cmd).encode('utf-8'))
>>>>> 21:32:01 ERROR| BrokenPipeError: [Errno 32] Broken pipe
>>>>> 21:32:01 ERROR|
>>>>> 21:32:01 DEBUG| DATA (filename=output.expected) => NOT FOUND (data
>>>>> sources: variant, test, file)
>>>>> 21:32:01 DEBUG| DATA (filename=stdout.expected) => NOT FOUND (data
>>>>> sources: variant, test, file)
>>>>> 21:32:01 DEBUG| DATA (filename=stderr.expected) => NOT FOUND (data
>>>>> sources: variant, test, file)
>>>>> 21:32:01 DEBUG| Not logging /var/log/syslog (lack of permissions)
>>>>> 21:32:01 ERROR| ERROR
>>>>> 12-tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_mips_malta_cpio
>>>>> -> TestSetupFail: [Errno 32] Broken pipe
>>>>> 21:32:01 INFO |
>>>>>
>>>>> https://travis-ci.org/github/philmd/qemu/jobs/696142277#L5329
>>>>>
>>>>
>>>> Gotcha.
>>>>
>>>> The problem here is that `test_mips_malta_cpio` in boot_linux_console.py
>>>> does this:
>>>>
>>>>         self.vm.add_args('-kernel', kernel_path,
>>>>                          '-initrd', initrd_path,
>>>>                          '-append', kernel_command_line,
>>>>                          '-no-reboot')
>>>>
>>>> and then:
>>>>
>>>> exec_command_and_wait_for_pattern(self, 'reboot',
>>>>                          'reboot: Restarting system')
>>>>
>>>> and (in avocado_qemu/) __init__.py does this:
>>>>
>>>>     def tearDown(self):
>>>>         for vm in self._vms.values():
>>>>             vm.shutdown()
>>>>
>>>>
>>>>
>>>> What's happening here is that we are instructing QEMU to *close* when
>>>> the guest reboots instead of allowing it to reboot. Then, we are issuing
>>>> a reboot command to the guest, which will effectively terminate QEMU as
>>>> well. Finally, we are trying to send a shutdown command to QEMU, but
>>>> QEMU has already gone.
>>>>
>>>> Now, in the shutdown code, we do make an attempt to catch this:
>>>>
>>>> def is_running(self):
>>>>     """Returns true if the VM is running."""
>>>>     return self._popen is not None and self._popen.poll() is None
>>>>
>>>> But, well, race conditions.
>>>>
>>>> When we make it here:
>>>>
>>>>         if self._qmp is not None:
>>>>             if not has_quit:
>>>>                 self._qmp.cmd('quit')
>>>>             self._qmp.close()
>>>>
>>>> We believe we are running and we believe we have an open QMP socket.
>>>> Attempting to engage the socket by sending 'quit' causes the error.
>>>>
>>>> It's a tight window: if quit happens earlier, we send the command
>>>> successfully and everything's OK. If quit happens later, we realize QEMU
>>>> isn't running and proceed to cleanup.
>>>
>>> Nice debugging :)
>>>
>>>>
>>>> Ultimately:
>>>>
>>>> - Avocado should not try to shut down QEMU twice, but
>>>> - machine.py shouldn't enable the race condition either.
>>>>
>>>>
>>>>
>>>> for my part, how about this:
>>>>
>>>> diff --git a/python/qemu/machine.py b/python/qemu/machine.py
>>>> index 99bcb499878..813f8e477db 100644
>>>> --- a/python/qemu/machine.py
>>>> +++ b/python/qemu/machine.py
>>>> @@ -385,7 +385,14 @@ def _soft_shutdown(self, has_quit: bool = False,
>>>> timeout: int = 3) -> None:
>>>>
>>>>          if self._qmp is not None:
>>>>              if not has_quit:
>>>> -                self._qmp.cmd('quit')
>>>> +                try:
>>>> +                    self._qmp.cmd('quit')
>>>> +                except (BrokenPipeError, ConnectionResetError):
>>>> +                    # QMP went away just before or just after sending
>>>> 'quit'
>>>> +                    if not self.is_running():
>>>> +                        # "Mission Accomplished"
>>>> +                        pass
>>>> +                    raise
>>>
>>> Looks OK to me, Cleber/Eduardo can you Ack?
>>>
>>
>> John's description seems spot on, and so does the solution.  I was
>> unable, though, to reproduce it given that the window is indeed tiny.

Still doesn't work...
https://travis-ci.org/github/philmd/qemu/jobs/700309163#L5445

>>
>> Phil,
>>
>> In addition to this, wouldn't it make sense to:
>>
>> diff --git a/tests/acceptance/boot_linux_console.py b/tests/acceptance/boot_linux_console.py
>> index 3d02519660..e117597e3b 100644
>> --- a/tests/acceptance/boot_linux_console.py
>> +++ b/tests/acceptance/boot_linux_console.py
>> @@ -180,8 +180,8 @@ class BootLinuxConsole(LinuxKernelTest):
>>                                 + 'rdinit=/sbin/init noreboot')
>>          self.vm.add_args('-kernel', kernel_path,
>>                           '-initrd', initrd_path,
>> -                         '-append', kernel_command_line,
>> -                         '-no-reboot')
>> +                         '-append', kernel_command_line)
> 
> Very few machines implement the hardware shutdown (sending
> the SHUTDOWN_CAUSE_GUEST_SHUTDOWN event). AFAIK The
> SHUTDOWN_CAUSE_GUEST_RESET is a bit more present.
> When the hardware doesn't model this, the guest loops
> infinitely. Some physical machines are not supposed to
> power off, until cutting its power. So firmware also loops
> without trying to power things off.
> 
> '-no-reboot' means "if the guest tries to reboot, consider
> the emulation done".
> 
>> +
>>          self.vm.launch()
>>          self.wait_for_console_pattern('Boot successful.')
>>  
>> @@ -189,8 +189,8 @@ class BootLinuxConsole(LinuxKernelTest):
>>                                                  'BogoMIPS')
>>          exec_command_and_wait_for_pattern(self, 'uname -a',
>>                                                  'Debian')
>> -        exec_command_and_wait_for_pattern(self, 'reboot',
>> -                                                'reboot: Restarting system')
>> +        exec_command_and_wait_for_pattern(self, 'halt',
>> +                                                'reboot: System halted')
>>  
>>      @skipUnless(os.getenv('AVOCADO_ALLOW_UNTRUSTED_CODE'), 'untrusted code')
>>      def test_mips64el_malta_5KEc_cpio(self):
>>
>> Do you know of any reason to "fake halt" (reboot) instead of actually
>> halting from the guests?
> 
> This might be my embedded system reflex (explained earlier, some
> hardware can not power itself down). I will give your diff a try.
> 
>>
>>>>              self._qmp.close()
>>>>
>>>>          self._popen.wait(timeout=timeout)
>>>>
>>>
>>
>> Either way (^ with this extra block):
>>
>> Reviewed-by: Cleber Rosa <crosa@redhat.com>
>> Tested-by: Cleber Rosa <crosa@redhat.com>
> 
> Thanks!
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/3] python/machine.py: refactor shutdown
  2020-06-20 10:14           ` Philippe Mathieu-Daudé
@ 2020-06-22 17:51             ` John Snow
  2020-06-22 17:55               ` Philippe Mathieu-Daudé
  2020-06-26 17:28             ` John Snow
  1 sibling, 1 reply; 20+ messages in thread
From: John Snow @ 2020-06-22 17:51 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, Cleber Rosa
  Cc: kwolf, Markus Armbruster, qemu-devel, Eduardo Habkost



On 6/20/20 6:14 AM, Philippe Mathieu-Daudé wrote:
> Still doesn't work...
> https://travis-ci.org/github/philmd/qemu/jobs/700309163#L5445

I tried running from master with just these patches applied. So far, the
only failures I am seeing on gitlab/travis are job timeouts.

Are you doing something special to avoid timeout failures?

--js



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/3] python/machine.py: refactor shutdown
  2020-06-22 17:51             ` John Snow
@ 2020-06-22 17:55               ` Philippe Mathieu-Daudé
  2020-06-22 18:08                 ` John Snow
  0 siblings, 1 reply; 20+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-06-22 17:55 UTC (permalink / raw)
  To: John Snow, Cleber Rosa
  Cc: kwolf, Markus Armbruster, qemu-devel, Eduardo Habkost

On 6/22/20 7:51 PM, John Snow wrote:
> 
> 
> On 6/20/20 6:14 AM, Philippe Mathieu-Daudé wrote:
>> Still doesn't work...
>> https://travis-ci.org/github/philmd/qemu/jobs/700309163#L5445
> 
> I tried running from master with just these patches applied. So far, the
> only failures I am seeing on gitlab/travis are job timeouts.
> 
> Are you doing something special to avoid timeout failures?

Unfortunately no :S Cleber tried too and couldn't reproduce,
I'm the only one lucky enough to win the race condition...

Retrying the acceptance tests on Travis-CI is painful, yes,
I know...



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/3] python/machine.py: refactor shutdown
  2020-06-22 17:55               ` Philippe Mathieu-Daudé
@ 2020-06-22 18:08                 ` John Snow
  2020-06-22 18:28                   ` Philippe Mathieu-Daudé
  0 siblings, 1 reply; 20+ messages in thread
From: John Snow @ 2020-06-22 18:08 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, Cleber Rosa
  Cc: kwolf, Markus Armbruster, qemu-devel, Eduardo Habkost



On 6/22/20 1:55 PM, Philippe Mathieu-Daudé wrote:
> On 6/22/20 7:51 PM, John Snow wrote:
>>
>>
>> On 6/20/20 6:14 AM, Philippe Mathieu-Daudé wrote:
>>> Still doesn't work...
>>> https://travis-ci.org/github/philmd/qemu/jobs/700309163#L5445
>>
>> I tried running from master with just these patches applied. So far, the
>> only failures I am seeing on gitlab/travis are job timeouts.
>>
>> Are you doing something special to avoid timeout failures?
> 
> Unfortunately no :S Cleber tried too and couldn't reproduce,
> I'm the only one lucky enough to win the race condition...
> 
> Retrying the acceptance tests on Travis-CI is painful, yes,
> I know...
> 

You didn't address the timeouts I asked about. Are you not seeing job
timeouts? How is the CI passing at all, for anyone?

We seem to have bigger problems here.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/3] python/machine.py: refactor shutdown
  2020-06-22 18:08                 ` John Snow
@ 2020-06-22 18:28                   ` Philippe Mathieu-Daudé
  0 siblings, 0 replies; 20+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-06-22 18:28 UTC (permalink / raw)
  To: John Snow, Cleber Rosa, Alex Bennée
  Cc: kwolf, Markus Armbruster, qemu-devel, Eduardo Habkost

+Alex

On 6/22/20 8:08 PM, John Snow wrote:
> On 6/22/20 1:55 PM, Philippe Mathieu-Daudé wrote:
>> On 6/22/20 7:51 PM, John Snow wrote:
>>>
>>>
>>> On 6/20/20 6:14 AM, Philippe Mathieu-Daudé wrote:
>>>> Still doesn't work...
>>>> https://travis-ci.org/github/philmd/qemu/jobs/700309163#L5445
>>>
>>> I tried running from master with just these patches applied. So far, the
>>> only failures I am seeing on gitlab/travis are job timeouts.
>>>
>>> Are you doing something special to avoid timeout failures?
>>
>> Unfortunately no :S Cleber tried too and couldn't reproduce,
>> I'm the only one lucky enough to win the race condition...
>>
>> Retrying the acceptance tests on Travis-CI is painful, yes,
>> I know...
>>
> 
> You didn't address the timeouts I asked about. Are you not seeing job
> timeouts? How is the CI passing at all, for anyone?
> 
> We seem to have bigger problems here.

OK, your problem/question was not obvious.

What helped me to is to know you setup'd your Travis-CI
account today.
At some point the jobs were taking too long, by caching
the ccache directory between runs, we could get quicker
builds.

My guess is as your setup is new, your cache is empty,
and you are reaching the 50min timeout...

If this is the case, I'm not sure how to resolve this.
Retry and retry again?



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/3] python/machine.py: refactor shutdown
  2020-06-20 10:14           ` Philippe Mathieu-Daudé
  2020-06-22 17:51             ` John Snow
@ 2020-06-26 17:28             ` John Snow
  1 sibling, 0 replies; 20+ messages in thread
From: John Snow @ 2020-06-26 17:28 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, Cleber Rosa
  Cc: kwolf, Markus Armbruster, qemu-devel, Eduardo Habkost



On 6/20/20 6:14 AM, Philippe Mathieu-Daudé wrote:
> On 6/17/20 7:13 PM, Philippe Mathieu-Daudé wrote:
>> On 6/16/20 11:49 PM, Cleber Rosa wrote:
>>> On Mon, Jun 15, 2020 at 05:21:18PM +0200, Philippe Mathieu-Daudé wrote:
>>>> On 6/9/20 11:55 PM, John Snow wrote:
>>>>>
>>>>>
>>>>> On 6/9/20 9:08 AM, Philippe Mathieu-Daudé wrote:
>>>>>> Hi John,
>>>>>>
>>>>>> On 6/4/20 9:52 PM, John Snow wrote:
>>>>>>> v3:
>>>>>>>  - Split _post_shutdown refactor into own patch (now 1/3)
>>>>>>>  - Re-add sigkill warning squelch (now 3/3)
>>>>>>>
>>>>>>> NOTE: I re-added the squelch in its own patch for review purposes, but
>>>>>>> for the purposes of avoiding temporary breakage, a maintainer may wish
>>>>>>> to squash patches 2 and 3 if they are accepted.
>>>>>>>
>>>>>>> v2: Philippe took patches 1, 3 and 4.
>>>>>>>
>>>>>>> This is a re-write of what was:
>>>>>>> [PATCH RFC 03/32] python//machine.py: remove bare except
>>>>>>> [PATCH 2/4] python/machine.py: remove bare except
>>>>>>>
>>>>>>> It's a bit heavier handed, but it should address some of kwolf's
>>>>>>> feedback from the RFC version.
>>>>>>>
>>>>>>> Applies straight to origin/master, ought to pass pylint and flake8:
>>>>>>>
>>>>>>>> cd ~/src/qemu/python/qemu
>>>>>>>> pylint *.py
>>>>>>>> flake8 *.py
>>>>>>>
>>>>>>> John Snow (3):
>>>>>>>   python/machine.py: consolidate _post_shutdown()
>>>>>>>   python/machine.py: refactor shutdown
>>>>>>>   python/machine.py: re-add sigkill warning suppression
>>>>>>>
>>>>>>>  python/qemu/machine.py | 100 +++++++++++++++++++++++++++++------------
>>>>>>>  1 file changed, 71 insertions(+), 29 deletions(-)
>>>>>>>
>>>>>>
>>>>>> I'm now seeing this error:
>>>>>>
>>>>>> 21:31:58 DEBUG| / # reboot
>>>>>> 21:32:01 DEBUG| / # reboot: Restarting system
>>>>>> 21:32:01 DEBUG| >>> {'execute': 'quit'}
>>>>>> 21:32:01 WARNI| qemu received signal 9; command:
>>>>>> "mips-softmmu/qemu-system-mips -display none -vga none -chardev
>>>>>> socket,id=mon,path=/tmp/tmp679upvrk/qemu-10292-monitor.sock -mon
>>>>>> chardev=mon,mode=control -machine malta -chardev
>>>>>> socket,id=console,path=/tmp/tmp679upvrk/qemu-10292-console.sock,server,nowait
>>>>>> -serial chardev:console -kernel
>>>>>> /tmp/avocado_b3aaagr9/avocado_job_5bj0xe1h/12-tests_acceptance_boot_linux_console.py_BootLinuxConsole.test_mips_malta_cpio/boot/vmlinux-4.5.0-2-4kc-malta
>>>>>> -initrd
>>>>>> /tmp/avocado_b3aaagr9/avocado_job_5bj0xe1h/12-tests_acceptance_boot_linux_console.py_BootLinuxConsole.test_mips_malta_cpiorootfs.cpio
>>>>>> -append printk.time=0 console=ttyS0 console=tty rdinit=/sbin/init
>>>>>> noreboot -no-reboot"
>>>>>> 21:32:01 ERROR|
>>>>>> 21:32:01 ERROR| Reproduced traceback from:
>>>>>> /home/travis/build/philmd/qemu/build/tests/venv/lib/python3.6/site-packages/avocado/core/test.py:886
>>>>>> 21:32:01 ERROR| Traceback (most recent call last):
>>>>>> 21:32:01 ERROR|   File
>>>>>> "/home/travis/build/philmd/qemu/build/tests/acceptance/avocado_qemu/__init__.py",
>>>>>> line 195, in tearDown
>>>>>> 21:32:01 ERROR|     vm.shutdown()
>>>>>> 21:32:01 ERROR|   File
>>>>>> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 449, in
>>>>>> shutdown
>>>>>> 21:32:01 ERROR|     self._do_shutdown(has_quit)
>>>>>> 21:32:01 ERROR|   File
>>>>>> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 426, in
>>>>>> _do_shutdown
>>>>>> 21:32:01 ERROR|     self._soft_shutdown(has_quit, timeout)
>>>>>> 21:32:01 ERROR|   File
>>>>>> "/home/travis/build/philmd/qemu/python/qemu/machine.py", line 413, in
>>>>>> _soft_shutdown
>>>>>> 21:32:01 ERROR|     self._qmp.cmd('quit')
>>>>>> 21:32:01 ERROR|   File
>>>>>> "/home/travis/build/philmd/qemu/python/qemu/qmp.py", line 271, in cmd
>>>>>> 21:32:01 ERROR|     return self.cmd_obj(qmp_cmd)
>>>>>> 21:32:01 ERROR|   File
>>>>>> "/home/travis/build/philmd/qemu/python/qemu/qmp.py", line 249, in cmd_obj
>>>>>> 21:32:01 ERROR|     self.__sock.sendall(json.dumps(qmp_cmd).encode('utf-8'))
>>>>>> 21:32:01 ERROR| BrokenPipeError: [Errno 32] Broken pipe
>>>>>> 21:32:01 ERROR|
>>>>>> 21:32:01 DEBUG| DATA (filename=output.expected) => NOT FOUND (data
>>>>>> sources: variant, test, file)
>>>>>> 21:32:01 DEBUG| DATA (filename=stdout.expected) => NOT FOUND (data
>>>>>> sources: variant, test, file)
>>>>>> 21:32:01 DEBUG| DATA (filename=stderr.expected) => NOT FOUND (data
>>>>>> sources: variant, test, file)
>>>>>> 21:32:01 DEBUG| Not logging /var/log/syslog (lack of permissions)
>>>>>> 21:32:01 ERROR| ERROR
>>>>>> 12-tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_mips_malta_cpio
>>>>>> -> TestSetupFail: [Errno 32] Broken pipe
>>>>>> 21:32:01 INFO |
>>>>>>
>>>>>> https://travis-ci.org/github/philmd/qemu/jobs/696142277#L5329
>>>>>>
>>>>>
>>>>> Gotcha.
>>>>>
>>>>> The problem here is that `test_mips_malta_cpio` in boot_linux_console.py
>>>>> does this:
>>>>>
>>>>>         self.vm.add_args('-kernel', kernel_path,
>>>>>                          '-initrd', initrd_path,
>>>>>                          '-append', kernel_command_line,
>>>>>                          '-no-reboot')
>>>>>
>>>>> and then:
>>>>>
>>>>> exec_command_and_wait_for_pattern(self, 'reboot',
>>>>>                          'reboot: Restarting system')
>>>>>
>>>>> and (in avocado_qemu/) __init__.py does this:
>>>>>
>>>>>     def tearDown(self):
>>>>>         for vm in self._vms.values():
>>>>>             vm.shutdown()
>>>>>
>>>>>
>>>>>
>>>>> What's happening here is that we are instructing QEMU to *close* when
>>>>> the guest reboots instead of allowing it to reboot. Then, we are issuing
>>>>> a reboot command to the guest, which will effectively terminate QEMU as
>>>>> well. Finally, we are trying to send a shutdown command to QEMU, but
>>>>> QEMU has already gone.
>>>>>
>>>>> Now, in the shutdown code, we do make an attempt to catch this:
>>>>>
>>>>> def is_running(self):
>>>>>     """Returns true if the VM is running."""
>>>>>     return self._popen is not None and self._popen.poll() is None
>>>>>
>>>>> But, well, race conditions.
>>>>>
>>>>> When we make it here:
>>>>>
>>>>>         if self._qmp is not None:
>>>>>             if not has_quit:
>>>>>                 self._qmp.cmd('quit')
>>>>>             self._qmp.close()
>>>>>
>>>>> We believe we are running and we believe we have an open QMP socket.
>>>>> Attempting to engage the socket by sending 'quit' causes the error.
>>>>>
>>>>> It's a tight window: if quit happens earlier, we send the command
>>>>> successfully and everything's OK. If quit happens later, we realize QEMU
>>>>> isn't running and proceed to cleanup.
>>>>
>>>> Nice debugging :)
>>>>
>>>>>
>>>>> Ultimately:
>>>>>
>>>>> - Avocado should not try to shut down QEMU twice, but
>>>>> - machine.py shouldn't enable the race condition either.
>>>>>
>>>>>
>>>>>
>>>>> for my part, how about this:
>>>>>
>>>>> diff --git a/python/qemu/machine.py b/python/qemu/machine.py
>>>>> index 99bcb499878..813f8e477db 100644
>>>>> --- a/python/qemu/machine.py
>>>>> +++ b/python/qemu/machine.py
>>>>> @@ -385,7 +385,14 @@ def _soft_shutdown(self, has_quit: bool = False,
>>>>> timeout: int = 3) -> None:
>>>>>
>>>>>          if self._qmp is not None:
>>>>>              if not has_quit:
>>>>> -                self._qmp.cmd('quit')
>>>>> +                try:
>>>>> +                    self._qmp.cmd('quit')
>>>>> +                except (BrokenPipeError, ConnectionResetError):
>>>>> +                    # QMP went away just before or just after sending
>>>>> 'quit'
>>>>> +                    if not self.is_running():
>>>>> +                        # "Mission Accomplished"
>>>>> +                        pass
>>>>> +                    raise
>>>>
>>>> Looks OK to me, Cleber/Eduardo can you Ack?
>>>>
>>>
>>> John's description seems spot on, and so does the solution.  I was
>>> unable, though, to reproduce it given that the window is indeed tiny.
> 
> Still doesn't work...
> https://travis-ci.org/github/philmd/qemu/jobs/700309163#L5445
> 
I have overlooked a very fundamental mistake!

'pass' does not terminate execution of an exception handler...

It needs to be:

except (BrokenPipeError, ConnectionResetError):
    if self.is_running():
        raise

Will re-spin, sorry about that.

--js



^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2020-06-26 17:29 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-04 19:52 [PATCH v3 0/3] python/machine.py: refactor shutdown John Snow
2020-06-04 19:52 ` [PATCH v3 1/3] python/machine.py: consolidate _post_shutdown() John Snow
2020-06-04 19:52 ` [PATCH v3 2/3] python/machine.py: refactor shutdown John Snow
2020-06-04 19:52 ` [PATCH v3 3/3] python/machine.py: re-add sigkill warning suppression John Snow
2020-06-05  9:34 ` [PATCH v3 0/3] python/machine.py: refactor shutdown Kevin Wolf
2020-06-09 13:08 ` Philippe Mathieu-Daudé
2020-06-09 21:55   ` John Snow
2020-06-15 15:21     ` Philippe Mathieu-Daudé
2020-06-15 22:21       ` Eduardo Habkost
2020-06-16 15:06         ` John Snow
2020-06-16 15:36           ` Philippe Mathieu-Daudé
2020-06-16 21:49       ` Cleber Rosa
2020-06-17 17:13         ` Philippe Mathieu-Daudé
2020-06-20 10:14           ` Philippe Mathieu-Daudé
2020-06-22 17:51             ` John Snow
2020-06-22 17:55               ` Philippe Mathieu-Daudé
2020-06-22 18:08                 ` John Snow
2020-06-22 18:28                   ` Philippe Mathieu-Daudé
2020-06-26 17:28             ` John Snow
2020-06-17 17:15 ` Philippe Mathieu-Daudé

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.