All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default
@ 2023-11-14 16:31 Thomas Huth
  2023-11-14 16:35 ` Philippe Mathieu-Daudé
  2023-11-14 16:37 ` Philippe Mathieu-Daudé
  0 siblings, 2 replies; 22+ messages in thread
From: Thomas Huth @ 2023-11-14 16:31 UTC (permalink / raw)
  To: Alex Bennée, Philippe Mathieu-Daudé, qemu-devel
  Cc: Wainer dos Santos Moschetta, Beraldo Leal, Cleber Rosa,
	Pavel Dovgalyuk, Paolo Bonzini, Nicholas Piggin, qemu-ppc

The tests seem currently to be broken. Disable them by default
until someone fixes them.

Signed-off-by: Thomas Huth <thuth@redhat.com>
---
 tests/avocado/reverse_debugging.py | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/tests/avocado/reverse_debugging.py b/tests/avocado/reverse_debugging.py
index fc47874eda..2585c78f34 100644
--- a/tests/avocado/reverse_debugging.py
+++ b/tests/avocado/reverse_debugging.py
@@ -11,6 +11,7 @@
 import logging
 
 from avocado import skipIf
+from avocado import skipUnless
 from avocado_qemu import BUILD_DIR
 from avocado.utils import gdb
 from avocado.utils import process
@@ -241,8 +242,8 @@ class ReverseDebugging_ppc64(ReverseDebugging):
 
     REG_PC = 0x40
 
-    # unidentified gitlab timeout problem
-    @skipIf(os.getenv('GITLAB_CI'), 'Running on GitLab')
+    # Test seems to be broken right now
+    @skipUnless(os.getenv('QEMU_TEST_FLAKY_TESTS'), 'Test gets stuck')
     def test_ppc64_pseries(self):
         """
         :avocado: tags=arch:ppc64
@@ -254,7 +255,7 @@ def test_ppc64_pseries(self):
         self.endian_is_le = False
         self.reverse_debugging()
 
-    @skipIf(os.getenv('GITLAB_CI'), 'Running on GitLab')
+    @skipUnless(os.getenv('QEMU_TEST_FLAKY_TESTS'), 'Test often fails')
     def test_ppc64_powernv(self):
         """
         :avocado: tags=arch:ppc64
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default
  2023-11-14 16:31 [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default Thomas Huth
@ 2023-11-14 16:35 ` Philippe Mathieu-Daudé
  2023-11-14 16:37 ` Philippe Mathieu-Daudé
  1 sibling, 0 replies; 22+ messages in thread
From: Philippe Mathieu-Daudé @ 2023-11-14 16:35 UTC (permalink / raw)
  To: Thomas Huth, Alex Bennée, qemu-devel
  Cc: Wainer dos Santos Moschetta, Beraldo Leal, Cleber Rosa,
	Pavel Dovgalyuk, Paolo Bonzini, Nicholas Piggin, qemu-ppc

On 14/11/23 17:31, Thomas Huth wrote:
> The tests seem currently to be broken. Disable them by default
> until someone fixes them.
> 
> Signed-off-by: Thomas Huth <thuth@redhat.com>
> ---
>   tests/avocado/reverse_debugging.py | 7 ++++---
>   1 file changed, 4 insertions(+), 3 deletions(-)

Acked-by: Philippe Mathieu-Daudé <philmd@linaro.org>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default
  2023-11-14 16:31 [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default Thomas Huth
  2023-11-14 16:35 ` Philippe Mathieu-Daudé
@ 2023-11-14 16:37 ` Philippe Mathieu-Daudé
  2023-11-14 18:29   ` Thomas Huth
  1 sibling, 1 reply; 22+ messages in thread
From: Philippe Mathieu-Daudé @ 2023-11-14 16:37 UTC (permalink / raw)
  To: Thomas Huth, Alex Bennée, qemu-devel
  Cc: Wainer dos Santos Moschetta, Beraldo Leal, Cleber Rosa,
	Pavel Dovgalyuk, Paolo Bonzini, Nicholas Piggin, qemu-ppc

On 14/11/23 17:31, Thomas Huth wrote:
> The tests seem currently to be broken. Disable them by default
> until someone fixes them.
> 
> Signed-off-by: Thomas Huth <thuth@redhat.com>
> ---
>   tests/avocado/reverse_debugging.py | 7 ++++---
>   1 file changed, 4 insertions(+), 3 deletions(-)

Similarly, I suspect https://gitlab.com/qemu-project/qemu/-/issues/1961
which has a fix ready:
https://lore.kernel.org/qemu-devel/20231110170831.185001-1-richard.henderson@linaro.org/

Maybe wait the fix gets in first?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default
  2023-11-14 16:37 ` Philippe Mathieu-Daudé
@ 2023-11-14 18:29   ` Thomas Huth
  2023-11-15  1:15     ` Nicholas Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Thomas Huth @ 2023-11-14 18:29 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, Alex Bennée, qemu-devel
  Cc: Wainer dos Santos Moschetta, Beraldo Leal, Cleber Rosa,
	Pavel Dovgalyuk, Paolo Bonzini, Nicholas Piggin, qemu-ppc

On 14/11/2023 17.37, Philippe Mathieu-Daudé wrote:
> On 14/11/23 17:31, Thomas Huth wrote:
>> The tests seem currently to be broken. Disable them by default
>> until someone fixes them.
>>
>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>> ---
>>   tests/avocado/reverse_debugging.py | 7 ++++---
>>   1 file changed, 4 insertions(+), 3 deletions(-)
> 
> Similarly, I suspect https://gitlab.com/qemu-project/qemu/-/issues/1961
> which has a fix ready:
> https://lore.kernel.org/qemu-devel/20231110170831.185001-1-richard.henderson@linaro.org/
> 
> Maybe wait the fix gets in first?

No, I applied Richard's patch, but the problem persists. Does this test 
still work for you?

  Thomas




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default
  2023-11-14 18:29   ` Thomas Huth
@ 2023-11-15  1:15     ` Nicholas Piggin
  2023-11-15  6:23       ` Thomas Huth
  0 siblings, 1 reply; 22+ messages in thread
From: Nicholas Piggin @ 2023-11-15  1:15 UTC (permalink / raw)
  To: Thomas Huth, Philippe Mathieu-Daudé, Alex Bennée, qemu-devel
  Cc: Wainer dos Santos Moschetta, Beraldo Leal, Cleber Rosa,
	Pavel Dovgalyuk, Paolo Bonzini, qemu-ppc, John Snow

On Wed Nov 15, 2023 at 4:29 AM AEST, Thomas Huth wrote:
> On 14/11/2023 17.37, Philippe Mathieu-Daudé wrote:
> > On 14/11/23 17:31, Thomas Huth wrote:
> >> The tests seem currently to be broken. Disable them by default
> >> until someone fixes them.
> >>
> >> Signed-off-by: Thomas Huth <thuth@redhat.com>
> >> ---
> >>   tests/avocado/reverse_debugging.py | 7 ++++---
> >>   1 file changed, 4 insertions(+), 3 deletions(-)
> > 
> > Similarly, I suspect https://gitlab.com/qemu-project/qemu/-/issues/1961
> > which has a fix ready:
> > https://lore.kernel.org/qemu-devel/20231110170831.185001-1-richard.henderson@linaro.org/
> > 
> > Maybe wait the fix gets in first?
>
> No, I applied Richard's patch, but the problem persists. Does this test 
> still work for you?

I bisected it to 1d4796cd008373 ("python/machine: use socketpair() for
console connections"), which causes this halfway through the test:

2023-11-15 10:37:04,600 stacktrace       L0045 ERROR| Traceback (most recent call last):
2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|   File "/home/npiggin/src/qemu/build/pyvenv/lib/python3.11/site-packages/avocado/core/decorators.py", line 90, in wrapper
2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|     return function(obj, *args, **kwargs)
2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|   File "/home/npiggin/src/qemu/build/tests/avocado/reverse_debugging.py", line 264, in test_ppc64_powernv
2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|     self.reverse_debugging()
2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|   File "/home/npiggin/src/qemu/build/tests/avocado/reverse_debugging.py", line 173, in reverse_debugging
2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|     g.cmd(b'c')
2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|   File "/home/npiggin/src/qemu/build/pyvenv/lib/python3.11/site-packages/avocado/utils/gdb.py", line 783, in cmd
2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|     response_payload = self.decode(result)
2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|                        ^^^^^^^^^^^^^^^^^^^
2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|   File "/home/npiggin/src/qemu/build/pyvenv/lib/python3.11/site-packages/avocado/utils/gdb.py", line 738, in decode
2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|     raise InvalidPacketError
2023-11-15 10:37:04,600 stacktrace       L0045 ERROR| avocado.utils.gdb.InvalidPacketError
2023-11-15 10:37:04,600 stacktrace       L0046 ERROR|

It doesn't always fail the same gdb command
(I saw a bc on line 182 as well). It seems to be receiving a
zero length response?

No idea what's happening or why ppc seems to be more fragile.
Or why changing console connection affects gdb connection?

Thanks,
Nick


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default
  2023-11-15  1:15     ` Nicholas Piggin
@ 2023-11-15  6:23       ` Thomas Huth
  2023-11-15 13:14         ` Daniel P. Berrangé
  0 siblings, 1 reply; 22+ messages in thread
From: Thomas Huth @ 2023-11-15  6:23 UTC (permalink / raw)
  To: John Snow, Daniel P. Berrangé, Ani Sinha
  Cc: Wainer dos Santos Moschetta, Beraldo Leal, Cleber Rosa,
	Pavel Dovgalyuk, Paolo Bonzini, qemu-ppc, John Snow,
	Nicholas Piggin, qemu-devel, Alex Bennée,
	Philippe Mathieu-Daudé

On 15/11/2023 02.15, Nicholas Piggin wrote:
> On Wed Nov 15, 2023 at 4:29 AM AEST, Thomas Huth wrote:
>> On 14/11/2023 17.37, Philippe Mathieu-Daudé wrote:
>>> On 14/11/23 17:31, Thomas Huth wrote:
>>>> The tests seem currently to be broken. Disable them by default
>>>> until someone fixes them.
>>>>
>>>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>>>> ---
>>>>    tests/avocado/reverse_debugging.py | 7 ++++---
>>>>    1 file changed, 4 insertions(+), 3 deletions(-)
>>>
>>> Similarly, I suspect https://gitlab.com/qemu-project/qemu/-/issues/1961
>>> which has a fix ready:
>>> https://lore.kernel.org/qemu-devel/20231110170831.185001-1-richard.henderson@linaro.org/
>>>
>>> Maybe wait the fix gets in first?
>>
>> No, I applied Richard's patch, but the problem persists. Does this test
>> still work for you?
> 
> I bisected it to 1d4796cd008373 ("python/machine: use socketpair() for
> console connections"),

Maybe John (who wrote that commit) can help?

  Thomas


> which causes this halfway through the test:
> 
> 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR| Traceback (most recent call last):
> 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|   File "/home/npiggin/src/qemu/build/pyvenv/lib/python3.11/site-packages/avocado/core/decorators.py", line 90, in wrapper
> 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|     return function(obj, *args, **kwargs)
> 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|   File "/home/npiggin/src/qemu/build/tests/avocado/reverse_debugging.py", line 264, in test_ppc64_powernv
> 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|     self.reverse_debugging()
> 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|   File "/home/npiggin/src/qemu/build/tests/avocado/reverse_debugging.py", line 173, in reverse_debugging
> 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|     g.cmd(b'c')
> 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|   File "/home/npiggin/src/qemu/build/pyvenv/lib/python3.11/site-packages/avocado/utils/gdb.py", line 783, in cmd
> 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|     response_payload = self.decode(result)
> 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|                        ^^^^^^^^^^^^^^^^^^^
> 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|   File "/home/npiggin/src/qemu/build/pyvenv/lib/python3.11/site-packages/avocado/utils/gdb.py", line 738, in decode
> 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|     raise InvalidPacketError
> 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR| avocado.utils.gdb.InvalidPacketError
> 2023-11-15 10:37:04,600 stacktrace       L0046 ERROR|
> 
> It doesn't always fail the same gdb command
> (I saw a bc on line 182 as well). It seems to be receiving a
> zero length response?
> 
> No idea what's happening or why ppc seems to be more fragile.
> Or why changing console connection affects gdb connection?
> 
> Thanks,
> Nick
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default
  2023-11-15  6:23       ` Thomas Huth
@ 2023-11-15 13:14         ` Daniel P. Berrangé
  2023-11-15 17:22           ` Daniel P. Berrangé
  0 siblings, 1 reply; 22+ messages in thread
From: Daniel P. Berrangé @ 2023-11-15 13:14 UTC (permalink / raw)
  To: Thomas Huth
  Cc: John Snow, Ani Sinha, Wainer dos Santos Moschetta, Beraldo Leal,
	Cleber Rosa, Pavel Dovgalyuk, Paolo Bonzini, qemu-ppc,
	Nicholas Piggin, qemu-devel, Alex Bennée,
	Philippe Mathieu-Daudé

On Wed, Nov 15, 2023 at 07:23:01AM +0100, Thomas Huth wrote:
> On 15/11/2023 02.15, Nicholas Piggin wrote:
> > On Wed Nov 15, 2023 at 4:29 AM AEST, Thomas Huth wrote:
> > > On 14/11/2023 17.37, Philippe Mathieu-Daudé wrote:
> > > > On 14/11/23 17:31, Thomas Huth wrote:
> > > > > The tests seem currently to be broken. Disable them by default
> > > > > until someone fixes them.
> > > > > 
> > > > > Signed-off-by: Thomas Huth <thuth@redhat.com>
> > > > > ---
> > > > >    tests/avocado/reverse_debugging.py | 7 ++++---
> > > > >    1 file changed, 4 insertions(+), 3 deletions(-)
> > > > 
> > > > Similarly, I suspect https://gitlab.com/qemu-project/qemu/-/issues/1961
> > > > which has a fix ready:
> > > > https://lore.kernel.org/qemu-devel/20231110170831.185001-1-richard.henderson@linaro.org/
> > > > 
> > > > Maybe wait the fix gets in first?
> > > 
> > > No, I applied Richard's patch, but the problem persists. Does this test
> > > still work for you?
> > 
> > I bisected it to 1d4796cd008373 ("python/machine: use socketpair() for
> > console connections"),
> 
> Maybe John (who wrote that commit) can help?

I find it hard to believe this commit is a direct root cause of the
problem since all it does is change the QEMU startup sequence so that
instead of QEMU listening for a monitor connection, it is given a
pre-opened monitor connection.

At the very most that should affect the startup timing a little.

I notice all the reverse debugging tests have a skip on gitlab
with a comment:

    # unidentified gitlab timeout problem

this makes be suspicious that John's patch has merely made this
(henceforth undiagnosed) timeout more likely to ocurr.

> > which causes this halfway through the test:
> > 
> > 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR| Traceback (most recent call last):
> > 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|   File "/home/npiggin/src/qemu/build/pyvenv/lib/python3.11/site-packages/avocado/core/decorators.py", line 90, in wrapper
> > 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|     return function(obj, *args, **kwargs)
> > 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|   File "/home/npiggin/src/qemu/build/tests/avocado/reverse_debugging.py", line 264, in test_ppc64_powernv
> > 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|     self.reverse_debugging()
> > 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|   File "/home/npiggin/src/qemu/build/tests/avocado/reverse_debugging.py", line 173, in reverse_debugging
> > 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|     g.cmd(b'c')
> > 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|   File "/home/npiggin/src/qemu/build/pyvenv/lib/python3.11/site-packages/avocado/utils/gdb.py", line 783, in cmd
> > 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|     response_payload = self.decode(result)
> > 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|                        ^^^^^^^^^^^^^^^^^^^
> > 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|   File "/home/npiggin/src/qemu/build/pyvenv/lib/python3.11/site-packages/avocado/utils/gdb.py", line 738, in decode
> > 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR|     raise InvalidPacketError
> > 2023-11-15 10:37:04,600 stacktrace       L0045 ERROR| avocado.utils.gdb.InvalidPacketError
> > 2023-11-15 10:37:04,600 stacktrace       L0046 ERROR|
> > 
> > It doesn't always fail the same gdb command
> > (I saw a bc on line 182 as well). It seems to be receiving a
> > zero length response?
> > 
> > No idea what's happening or why ppc seems to be more fragile.
> > Or why changing console connection affects gdb connection?

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default
  2023-11-15 13:14         ` Daniel P. Berrangé
@ 2023-11-15 17:22           ` Daniel P. Berrangé
  2023-11-16  1:15             ` Nicholas Piggin
                               ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Daniel P. Berrangé @ 2023-11-15 17:22 UTC (permalink / raw)
  To: Thomas Huth, John Snow, Ani Sinha, Wainer dos Santos Moschetta,
	Beraldo Leal, Cleber Rosa, Pavel Dovgalyuk, Paolo Bonzini,
	qemu-ppc, Nicholas Piggin, qemu-devel, Alex Bennée,
	Philippe Mathieu-Daudé

On Wed, Nov 15, 2023 at 01:14:53PM +0000, Daniel P. Berrangé wrote:
> On Wed, Nov 15, 2023 at 07:23:01AM +0100, Thomas Huth wrote:
> > On 15/11/2023 02.15, Nicholas Piggin wrote:
> > > On Wed Nov 15, 2023 at 4:29 AM AEST, Thomas Huth wrote:
> > > > On 14/11/2023 17.37, Philippe Mathieu-Daudé wrote:
> > > > > On 14/11/23 17:31, Thomas Huth wrote:
> > > > > > The tests seem currently to be broken. Disable them by default
> > > > > > until someone fixes them.
> > > > > > 
> > > > > > Signed-off-by: Thomas Huth <thuth@redhat.com>
> > > > > > ---
> > > > > >    tests/avocado/reverse_debugging.py | 7 ++++---
> > > > > >    1 file changed, 4 insertions(+), 3 deletions(-)
> > > > > 
> > > > > Similarly, I suspect https://gitlab.com/qemu-project/qemu/-/issues/1961
> > > > > which has a fix ready:
> > > > > https://lore.kernel.org/qemu-devel/20231110170831.185001-1-richard.henderson@linaro.org/
> > > > > 
> > > > > Maybe wait the fix gets in first?
> > > > 
> > > > No, I applied Richard's patch, but the problem persists. Does this test
> > > > still work for you?
> > > 
> > > I bisected it to 1d4796cd008373 ("python/machine: use socketpair() for
> > > console connections"),
> > 
> > Maybe John (who wrote that commit) can help?
> 
> I find it hard to believe this commit is a direct root cause of the
> problem since all it does is change the QEMU startup sequence so that
> instead of QEMU listening for a monitor connection, it is given a
> pre-opened monitor connection.
> 
> At the very most that should affect the startup timing a little.
> 
> I notice all the reverse debugging tests have a skip on gitlab
> with a comment:
> 
>     # unidentified gitlab timeout problem
> 
> this makes be suspicious that John's patch has merely made this
> (henceforth undiagnosed) timeout more likely to ocurr.

After an absolutely horrendous hours long debugging session I think
I figured out the problem. The QEMU process is blocking in

    qemu_chr_write_buffer

spinning in the loop on EAGAIN.

The Python  Machine() class has passed one of a pre-created socketpair
FDs for the serial port chardev. The guest is trying to write to this
and blocking.  Nothing in the Machine() class is reading from the
other end of the serial port console.


Before John's change, the serial port uses a chardev in server mode
and crucially  'wait=off', and the Machine() class never opened the
console socket unless the test case wanted to read from it.

IOW, QEMU had a background job setting there waiting for a connection
that would never come.

As a result when QEMU started executing the guest, all the serial port
writes get sent into to the void.


So John's patch has had a semantic change in behaviour, because the
console socket is permanently open, and thus socket buffers are liable
to fill up.

As a demo I increased the socket buffers to 1MB and everything then
succeeded.

@@ -357,6 +360,10 @@ def _pre_launch(self) -> None:
 
         if self._console_set:
             self._cons_sock_pair = socket.socketpair()
+            self._cons_sock_pair[0].setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1024*1024);
+            self._cons_sock_pair[0].setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024*1024);
+            self._cons_sock_pair[1].setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1024*1024);
+            self._cons_sock_pair[1].setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024*1024);
             os.set_inheritable(self._cons_sock_pair[0].fileno(), True)
 
         # NOTE: Make sure any opened resources are *definitely* freed in


The Machine class doesn't know if anything will ever use the console,
so as is the change is unsafe.

The original goal of John's change was to guarantee we capture early
boot messages as some test need that.  

I think we need to be able to have a flag to say whether the caller needs
an "early console" facility, and only use the pre-opened FD passing for
that case. Tests we need early console will have to ask for that guarantee
explicitly.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default
  2023-11-15 17:22           ` Daniel P. Berrangé
@ 2023-11-16  1:15             ` Nicholas Piggin
  2023-11-16  3:55               ` Ani Sinha
                                 ` (2 more replies)
  2023-11-16  3:50             ` Ani Sinha
  2023-11-20 19:18             ` John Snow
  2 siblings, 3 replies; 22+ messages in thread
From: Nicholas Piggin @ 2023-11-16  1:15 UTC (permalink / raw)
  To: Daniel P. Berrangé,
	Thomas Huth, John Snow, Ani Sinha, Wainer dos Santos Moschetta,
	Beraldo Leal, Cleber Rosa, Pavel Dovgalyuk, Paolo Bonzini,
	qemu-ppc, qemu-devel, Alex Bennée,
	Philippe Mathieu-Daudé

On Thu Nov 16, 2023 at 3:22 AM AEST, Daniel P. Berrangé wrote:
> On Wed, Nov 15, 2023 at 01:14:53PM +0000, Daniel P. Berrangé wrote:
> > On Wed, Nov 15, 2023 at 07:23:01AM +0100, Thomas Huth wrote:
> > > On 15/11/2023 02.15, Nicholas Piggin wrote:
> > > > On Wed Nov 15, 2023 at 4:29 AM AEST, Thomas Huth wrote:
> > > > > On 14/11/2023 17.37, Philippe Mathieu-Daudé wrote:
> > > > > > On 14/11/23 17:31, Thomas Huth wrote:
> > > > > > > The tests seem currently to be broken. Disable them by default
> > > > > > > until someone fixes them.
> > > > > > > 
> > > > > > > Signed-off-by: Thomas Huth <thuth@redhat.com>
> > > > > > > ---
> > > > > > >    tests/avocado/reverse_debugging.py | 7 ++++---
> > > > > > >    1 file changed, 4 insertions(+), 3 deletions(-)
> > > > > > 
> > > > > > Similarly, I suspect https://gitlab.com/qemu-project/qemu/-/issues/1961
> > > > > > which has a fix ready:
> > > > > > https://lore.kernel.org/qemu-devel/20231110170831.185001-1-richard.henderson@linaro.org/
> > > > > > 
> > > > > > Maybe wait the fix gets in first?
> > > > > 
> > > > > No, I applied Richard's patch, but the problem persists. Does this test
> > > > > still work for you?
> > > > 
> > > > I bisected it to 1d4796cd008373 ("python/machine: use socketpair() for
> > > > console connections"),
> > > 
> > > Maybe John (who wrote that commit) can help?
> > 
> > I find it hard to believe this commit is a direct root cause of the
> > problem since all it does is change the QEMU startup sequence so that
> > instead of QEMU listening for a monitor connection, it is given a
> > pre-opened monitor connection.
> > 
> > At the very most that should affect the startup timing a little.
> > 
> > I notice all the reverse debugging tests have a skip on gitlab
> > with a comment:
> > 
> >     # unidentified gitlab timeout problem
> > 
> > this makes be suspicious that John's patch has merely made this
> > (henceforth undiagnosed) timeout more likely to ocurr.
>
> After an absolutely horrendous hours long debugging session I think
> I figured out the problem. The QEMU process is blocking in
>
>     qemu_chr_write_buffer
>
> spinning in the loop on EAGAIN.

Great work.

Why does this make the gdb socket give an empty response? Something
just times out?

>
> The Python  Machine() class has passed one of a pre-created socketpair
> FDs for the serial port chardev. The guest is trying to write to this
> and blocking.  Nothing in the Machine() class is reading from the
> other end of the serial port console.
>
>
> Before John's change, the serial port uses a chardev in server mode
> and crucially  'wait=off', and the Machine() class never opened the
> console socket unless the test case wanted to read from it.
>
> IOW, QEMU had a background job setting there waiting for a connection
> that would never come.
>
> As a result when QEMU started executing the guest, all the serial port
> writes get sent into to the void.
>
>
> So John's patch has had a semantic change in behaviour, because the
> console socket is permanently open, and thus socket buffers are liable
> to fill up.
>
> As a demo I increased the socket buffers to 1MB and everything then
> succeeded.
>
> @@ -357,6 +360,10 @@ def _pre_launch(self) -> None:
>  
>          if self._console_set:
>              self._cons_sock_pair = socket.socketpair()
> +            self._cons_sock_pair[0].setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1024*1024);
> +            self._cons_sock_pair[0].setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024*1024);
> +            self._cons_sock_pair[1].setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1024*1024);
> +            self._cons_sock_pair[1].setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024*1024);
>              os.set_inheritable(self._cons_sock_pair[0].fileno(), True)
>  
>          # NOTE: Make sure any opened resources are *definitely* freed in

So perhaps ppc64 fails just because it prints more to the console in early
boot than other targets?

> The Machine class doesn't know if anything will ever use the console,
> so as is the change is unsafe.
>
> The original goal of John's change was to guarantee we capture early
> boot messages as some test need that.  
>
> I think we need to be able to have a flag to say whether the caller needs
> an "early console" facility, and only use the pre-opened FD passing for
> that case. Tests we need early console will have to ask for that guarantee
> explicitly.

The below patch makes this test work. Maybe as a quick fix it is
better than disabling the test.

I guess we still have a problem if a test invokes vm.launch()
directly without subsequently waiting for a console pattern or
doing something with the console as you say. Your suggesstion is
add something like vm.launch(console=True) ? 

Thanks,
Nick
---

diff --git a/tests/avocado/reverse_debugging.py b/tests/avocado/reverse_debugging.py
index fc47874eda..128d85bc0e 100644
--- a/tests/avocado/reverse_debugging.py
+++ b/tests/avocado/reverse_debugging.py
@@ -12,6 +12,7 @@
 
 from avocado import skipIf
 from avocado_qemu import BUILD_DIR
+from avocado.utils import datadrainer
 from avocado.utils import gdb
 from avocado.utils import process
 from avocado.utils.network.ports import find_free_port
@@ -52,6 +53,10 @@ def run_vm(self, record, shift, args, replay_path, image_path, port):
         if args:
             vm.add_args(*args)
         vm.launch()
+        console_drainer = datadrainer.LineLogger(vm.console_socket.fileno(),
+                                    logger=self.log.getChild('console'),
+                                    stop_check=(lambda : not vm.is_running()))
+        console_drainer.start()
         return vm
 
     @staticmethod


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default
  2023-11-15 17:22           ` Daniel P. Berrangé
  2023-11-16  1:15             ` Nicholas Piggin
@ 2023-11-16  3:50             ` Ani Sinha
  2023-11-20 19:18             ` John Snow
  2 siblings, 0 replies; 22+ messages in thread
From: Ani Sinha @ 2023-11-16  3:50 UTC (permalink / raw)
  To: "Daniel P. Berrangé"
  Cc: Thomas Huth, John Snow, Wainer dos Santos Moschetta,
	Beraldo Leal, Cleber Rosa, Pavel Dovgalyuk, Paolo Bonzini,
	qemu-ppc, Nicholas Piggin, QEMU Developers, Alex Bennée,
	Philippe Mathieu-Daudé



> On 15-Nov-2023, at 10:52 PM, Daniel P. Berrangé <berrange@redhat.com> wrote:
> 
> On Wed, Nov 15, 2023 at 01:14:53PM +0000, Daniel P. Berrangé wrote:
>> On Wed, Nov 15, 2023 at 07:23:01AM +0100, Thomas Huth wrote:
>>> On 15/11/2023 02.15, Nicholas Piggin wrote:
>>>> On Wed Nov 15, 2023 at 4:29 AM AEST, Thomas Huth wrote:
>>>>> On 14/11/2023 17.37, Philippe Mathieu-Daudé wrote:
>>>>>> On 14/11/23 17:31, Thomas Huth wrote:
>>>>>>> The tests seem currently to be broken. Disable them by default
>>>>>>> until someone fixes them.
>>>>>>> 
>>>>>>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>>>>>>> ---
>>>>>>>   tests/avocado/reverse_debugging.py | 7 ++++---
>>>>>>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>> 
>>>>>> Similarly, I suspect https://gitlab.com/qemu-project/qemu/-/issues/1961
>>>>>> which has a fix ready:
>>>>>> https://lore.kernel.org/qemu-devel/20231110170831.185001-1-richard.henderson@linaro.org/
>>>>>> 
>>>>>> Maybe wait the fix gets in first?
>>>>> 
>>>>> No, I applied Richard's patch, but the problem persists. Does this test
>>>>> still work for you?
>>>> 
>>>> I bisected it to 1d4796cd008373 ("python/machine: use socketpair() for
>>>> console connections"),
>>> 
>>> Maybe John (who wrote that commit) can help?
>> 
>> I find it hard to believe this commit is a direct root cause of the
>> problem since all it does is change the QEMU startup sequence so that
>> instead of QEMU listening for a monitor connection, it is given a
>> pre-opened monitor connection.
>> 
>> At the very most that should affect the startup timing a little.
>> 
>> I notice all the reverse debugging tests have a skip on gitlab
>> with a comment:
>> 
>>    # unidentified gitlab timeout problem
>> 
>> this makes be suspicious that John's patch has merely made this
>> (henceforth undiagnosed) timeout more likely to ocurr.
> 
> After an absolutely horrendous hours long debugging session I think
> I figured out the problem. The QEMU process is blocking in
> 
>    qemu_chr_write_buffer
> 
> spinning in the loop on EAGAIN.
> 
> The Python  Machine() class has passed one of a pre-created socketpair
> FDs for the serial port chardev. The guest is trying to write to this
> and blocking.  Nothing in the Machine() class is reading from the
> other end of the serial port console.
> 
> 
> Before John's change, the serial port uses a chardev in server mode
> and crucially  'wait=off', and the Machine() class never opened the
> console socket unless the test case wanted to read from it.
> 
> IOW, QEMU had a background job setting there waiting for a connection
> that would never come.
> 
> As a result when QEMU started executing the guest, all the serial port
> writes get sent into to the void.
> 
> 
> So John's patch has had a semantic change in behaviour, because the
> console socket is permanently open, and thus socket buffers are liable
> to fill up.
> 
> As a demo I increased the socket buffers to 1MB and everything then
> succeeded.
> 
> @@ -357,6 +360,10 @@ def _pre_launch(self) -> None:
> 
>         if self._console_set:
>             self._cons_sock_pair = socket.socketpair()
> +            self._cons_sock_pair[0].setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1024*1024);
> +            self._cons_sock_pair[0].setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024*1024);
> +            self._cons_sock_pair[1].setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1024*1024);
> +            self._cons_sock_pair[1].setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024*1024);
>             os.set_inheritable(self._cons_sock_pair[0].fileno(), True)
> 
>         # NOTE: Make sure any opened resources are *definitely* freed in
> 
> 
> The Machine class doesn't know if anything will ever use the console,
> so as is the change is unsafe.
> 
> The original goal of John's change was to guarantee we capture early
> boot messages as some test need that.

As in https://gitlab.com/qemu-project/qemu/-/blob/master/tests/avocado/acpi-bits.py?ref_type=heads#L395 ?
Some other tests do this too.

>  
> 
> I think we need to be able to have a flag to say whether the caller needs
> an "early console" facility, and only use the pre-opened FD passing for
> that case. Tests we need early console will have to ask for that guarantee
> explicitly.
> 
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default
  2023-11-16  1:15             ` Nicholas Piggin
@ 2023-11-16  3:55               ` Ani Sinha
  2023-11-16  7:14                 ` Nicholas Piggin
  2023-11-16  7:09               ` Thomas Huth
  2023-11-16  9:00               ` Daniel P. Berrangé
  2 siblings, 1 reply; 22+ messages in thread
From: Ani Sinha @ 2023-11-16  3:55 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: "Daniel P. Berrangé",
	Thomas Huth, John Snow, Wainer dos Santos Moschetta,
	Beraldo Leal, Cleber Rosa, Pavel Dovgalyuk, Paolo Bonzini,
	qemu-ppc, qemu-devel, Alex Bennée,
	Philippe Mathieu-Daudé



> On 16-Nov-2023, at 6:45 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
> 
> On Thu Nov 16, 2023 at 3:22 AM AEST, Daniel P. Berrangé wrote:
>> On Wed, Nov 15, 2023 at 01:14:53PM +0000, Daniel P. Berrangé wrote:
>>> On Wed, Nov 15, 2023 at 07:23:01AM +0100, Thomas Huth wrote:
>>>> On 15/11/2023 02.15, Nicholas Piggin wrote:
>>>>> On Wed Nov 15, 2023 at 4:29 AM AEST, Thomas Huth wrote:
>>>>>> On 14/11/2023 17.37, Philippe Mathieu-Daudé wrote:
>>>>>>> On 14/11/23 17:31, Thomas Huth wrote:
>>>>>>>> The tests seem currently to be broken. Disable them by default
>>>>>>>> until someone fixes them.
>>>>>>>> 
>>>>>>>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>>>>>>>> ---
>>>>>>>>   tests/avocado/reverse_debugging.py | 7 ++++---
>>>>>>>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>>> 
>>>>>>> Similarly, I suspect https://gitlab.com/qemu-project/qemu/-/issues/1961
>>>>>>> which has a fix ready:
>>>>>>> https://lore.kernel.org/qemu-devel/20231110170831.185001-1-richard.henderson@linaro.org/
>>>>>>> 
>>>>>>> Maybe wait the fix gets in first?
>>>>>> 
>>>>>> No, I applied Richard's patch, but the problem persists. Does this test
>>>>>> still work for you?
>>>>> 
>>>>> I bisected it to 1d4796cd008373 ("python/machine: use socketpair() for
>>>>> console connections"),
>>>> 
>>>> Maybe John (who wrote that commit) can help?
>>> 
>>> I find it hard to believe this commit is a direct root cause of the
>>> problem since all it does is change the QEMU startup sequence so that
>>> instead of QEMU listening for a monitor connection, it is given a
>>> pre-opened monitor connection.
>>> 
>>> At the very most that should affect the startup timing a little.
>>> 
>>> I notice all the reverse debugging tests have a skip on gitlab
>>> with a comment:
>>> 
>>>    # unidentified gitlab timeout problem
>>> 
>>> this makes be suspicious that John's patch has merely made this
>>> (henceforth undiagnosed) timeout more likely to ocurr.
>> 
>> After an absolutely horrendous hours long debugging session I think
>> I figured out the problem. The QEMU process is blocking in
>> 
>>    qemu_chr_write_buffer
>> 
>> spinning in the loop on EAGAIN.
> 
> Great work.
> 
> Why does this make the gdb socket give an empty response? Something
> just times out?
> 
>> 
>> The Python  Machine() class has passed one of a pre-created socketpair
>> FDs for the serial port chardev. The guest is trying to write to this
>> and blocking.  Nothing in the Machine() class is reading from the
>> other end of the serial port console.
>> 
>> 
>> Before John's change, the serial port uses a chardev in server mode
>> and crucially  'wait=off', and the Machine() class never opened the
>> console socket unless the test case wanted to read from it.
>> 
>> IOW, QEMU had a background job setting there waiting for a connection
>> that would never come.
>> 
>> As a result when QEMU started executing the guest, all the serial port
>> writes get sent into to the void.
>> 
>> 
>> So John's patch has had a semantic change in behaviour, because the
>> console socket is permanently open, and thus socket buffers are liable
>> to fill up.
>> 
>> As a demo I increased the socket buffers to 1MB and everything then
>> succeeded.
>> 
>> @@ -357,6 +360,10 @@ def _pre_launch(self) -> None:
>> 
>>         if self._console_set:
>>             self._cons_sock_pair = socket.socketpair()
>> +            self._cons_sock_pair[0].setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1024*1024);
>> +            self._cons_sock_pair[0].setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024*1024);
>> +            self._cons_sock_pair[1].setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1024*1024);
>> +            self._cons_sock_pair[1].setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024*1024);
>>             os.set_inheritable(self._cons_sock_pair[0].fileno(), True)
>> 
>>         # NOTE: Make sure any opened resources are *definitely* freed in
> 
> So perhaps ppc64 fails just because it prints more to the console in early
> boot than other targets?
> 
>> The Machine class doesn't know if anything will ever use the console,
>> so as is the change is unsafe.
>> 
>> The original goal of John's change was to guarantee we capture early
>> boot messages as some test need that.  
>> 
>> I think we need to be able to have a flag to say whether the caller needs
>> an "early console" facility, and only use the pre-opened FD passing for
>> that case. Tests we need early console will have to ask for that guarantee
>> explicitly.
> 
> The below patch makes this test work. Maybe as a quick fix it is
> better than disabling the test.
> 
> I guess we still have a problem if a test invokes vm.launch()
> directly without subsequently waiting for a console pattern or
> doing something with the console as you say. Your suggesstion is
> add something like vm.launch(console=True) ? 

I think what he is saying is to add a new property for QEMUMachine() with which the test can explicitly tell the machine init code that it is going to drain the console logs. By default it can be false. When tests use console_drainer, they can set the property to true and inspect the early console logs after draining it. 

> 
> Thanks,
> Nick
> ---
> 
> diff --git a/tests/avocado/reverse_debugging.py b/tests/avocado/reverse_debugging.py
> index fc47874eda..128d85bc0e 100644
> --- a/tests/avocado/reverse_debugging.py
> +++ b/tests/avocado/reverse_debugging.py
> @@ -12,6 +12,7 @@
> 
> from avocado import skipIf
> from avocado_qemu import BUILD_DIR
> +from avocado.utils import datadrainer
> from avocado.utils import gdb
> from avocado.utils import process
> from avocado.utils.network.ports import find_free_port
> @@ -52,6 +53,10 @@ def run_vm(self, record, shift, args, replay_path, image_path, port):
>         if args:
>             vm.add_args(*args)
>         vm.launch()
> +        console_drainer = datadrainer.LineLogger(vm.console_socket.fileno(),
> +                                    logger=self.log.getChild('console'),
> +                                    stop_check=(lambda : not vm.is_running()))
> +        console_drainer.start()
>         return vm
> 
>     @staticmethod




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default
  2023-11-16  1:15             ` Nicholas Piggin
  2023-11-16  3:55               ` Ani Sinha
@ 2023-11-16  7:09               ` Thomas Huth
  2023-11-16  9:45                 ` Nicholas Piggin
  2023-11-16  9:00               ` Daniel P. Berrangé
  2 siblings, 1 reply; 22+ messages in thread
From: Thomas Huth @ 2023-11-16  7:09 UTC (permalink / raw)
  To: Nicholas Piggin, Daniel P. Berrangé,
	John Snow, Ani Sinha, Wainer dos Santos Moschetta, Beraldo Leal,
	Cleber Rosa, Pavel Dovgalyuk, Paolo Bonzini, qemu-ppc,
	qemu-devel, Alex Bennée, Philippe Mathieu-Daudé

On 16/11/2023 02.15, Nicholas Piggin wrote:
> On Thu Nov 16, 2023 at 3:22 AM AEST, Daniel P. Berrangé wrote:
>> On Wed, Nov 15, 2023 at 01:14:53PM +0000, Daniel P. Berrangé wrote:
...
>> The Machine class doesn't know if anything will ever use the console,
>> so as is the change is unsafe.
>>
>> The original goal of John's change was to guarantee we capture early
>> boot messages as some test need that.
>>
>> I think we need to be able to have a flag to say whether the caller needs
>> an "early console" facility, and only use the pre-opened FD passing for
>> that case. Tests we need early console will have to ask for that guarantee
>> explicitly.
> 
> The below patch makes this test work. Maybe as a quick fix it is
> better than disabling the test.
> 
> I guess we still have a problem if a test invokes vm.launch()
> directly without subsequently waiting for a console pattern or
> doing something with the console as you say. Your suggesstion is
> add something like vm.launch(console=True) ?
> 
> Thanks,
> Nick
> ---
> 
> diff --git a/tests/avocado/reverse_debugging.py b/tests/avocado/reverse_debugging.py
> index fc47874eda..128d85bc0e 100644
> --- a/tests/avocado/reverse_debugging.py
> +++ b/tests/avocado/reverse_debugging.py
> @@ -12,6 +12,7 @@
>   
>   from avocado import skipIf
>   from avocado_qemu import BUILD_DIR
> +from avocado.utils import datadrainer
>   from avocado.utils import gdb
>   from avocado.utils import process
>   from avocado.utils.network.ports import find_free_port
> @@ -52,6 +53,10 @@ def run_vm(self, record, shift, args, replay_path, image_path, port):
>           if args:
>               vm.add_args(*args)
>           vm.launch()
> +        console_drainer = datadrainer.LineLogger(vm.console_socket.fileno(),
> +                                    logger=self.log.getChild('console'),
> +                                    stop_check=(lambda : not vm.is_running()))
> +        console_drainer.start()
>           return vm
>   
>       @staticmethod

Tested-by: Thomas Huth <thuth@redhat.com>

Could you please send this as a proper patch, with a S-o-b line, and a short 
comment in front of the newly added code explaining it?

  Thanks,
   Thomas




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default
  2023-11-16  3:55               ` Ani Sinha
@ 2023-11-16  7:14                 ` Nicholas Piggin
  2023-11-16  8:55                   ` Daniel P. Berrangé
  0 siblings, 1 reply; 22+ messages in thread
From: Nicholas Piggin @ 2023-11-16  7:14 UTC (permalink / raw)
  To: Ani Sinha
  Cc: Daniel P. Berrangé,
	Thomas Huth, John Snow, Wainer dos Santos Moschetta,
	Beraldo Leal, Cleber Rosa, Pavel Dovgalyuk, Paolo Bonzini,
	qemu-ppc, qemu-devel, Alex Bennée,
	Philippe Mathieu-Daudé

On Thu Nov 16, 2023 at 1:55 PM AEST, Ani Sinha wrote:
>
>
> > On 16-Nov-2023, at 6:45 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
> > 
> > On Thu Nov 16, 2023 at 3:22 AM AEST, Daniel P. Berrangé wrote:
> >> On Wed, Nov 15, 2023 at 01:14:53PM +0000, Daniel P. Berrangé wrote:
> >>> On Wed, Nov 15, 2023 at 07:23:01AM +0100, Thomas Huth wrote:
> >>>> On 15/11/2023 02.15, Nicholas Piggin wrote:
> >>>>> On Wed Nov 15, 2023 at 4:29 AM AEST, Thomas Huth wrote:
> >>>>>> On 14/11/2023 17.37, Philippe Mathieu-Daudé wrote:
> >>>>>>> On 14/11/23 17:31, Thomas Huth wrote:
> >>>>>>>> The tests seem currently to be broken. Disable them by default
> >>>>>>>> until someone fixes them.
> >>>>>>>> 
> >>>>>>>> Signed-off-by: Thomas Huth <thuth@redhat.com>
> >>>>>>>> ---
> >>>>>>>>   tests/avocado/reverse_debugging.py | 7 ++++---
> >>>>>>>>   1 file changed, 4 insertions(+), 3 deletions(-)
> >>>>>>> 
> >>>>>>> Similarly, I suspect https://gitlab.com/qemu-project/qemu/-/issues/1961
> >>>>>>> which has a fix ready:
> >>>>>>> https://lore.kernel.org/qemu-devel/20231110170831.185001-1-richard.henderson@linaro.org/
> >>>>>>> 
> >>>>>>> Maybe wait the fix gets in first?
> >>>>>> 
> >>>>>> No, I applied Richard's patch, but the problem persists. Does this test
> >>>>>> still work for you?
> >>>>> 
> >>>>> I bisected it to 1d4796cd008373 ("python/machine: use socketpair() for
> >>>>> console connections"),
> >>>> 
> >>>> Maybe John (who wrote that commit) can help?
> >>> 
> >>> I find it hard to believe this commit is a direct root cause of the
> >>> problem since all it does is change the QEMU startup sequence so that
> >>> instead of QEMU listening for a monitor connection, it is given a
> >>> pre-opened monitor connection.
> >>> 
> >>> At the very most that should affect the startup timing a little.
> >>> 
> >>> I notice all the reverse debugging tests have a skip on gitlab
> >>> with a comment:
> >>> 
> >>>    # unidentified gitlab timeout problem
> >>> 
> >>> this makes be suspicious that John's patch has merely made this
> >>> (henceforth undiagnosed) timeout more likely to ocurr.
> >> 
> >> After an absolutely horrendous hours long debugging session I think
> >> I figured out the problem. The QEMU process is blocking in
> >> 
> >>    qemu_chr_write_buffer
> >> 
> >> spinning in the loop on EAGAIN.
> > 
> > Great work.
> > 
> > Why does this make the gdb socket give an empty response? Something
> > just times out?
> > 
> >> 
> >> The Python  Machine() class has passed one of a pre-created socketpair
> >> FDs for the serial port chardev. The guest is trying to write to this
> >> and blocking.  Nothing in the Machine() class is reading from the
> >> other end of the serial port console.
> >> 
> >> 
> >> Before John's change, the serial port uses a chardev in server mode
> >> and crucially  'wait=off', and the Machine() class never opened the
> >> console socket unless the test case wanted to read from it.
> >> 
> >> IOW, QEMU had a background job setting there waiting for a connection
> >> that would never come.
> >> 
> >> As a result when QEMU started executing the guest, all the serial port
> >> writes get sent into to the void.
> >> 
> >> 
> >> So John's patch has had a semantic change in behaviour, because the
> >> console socket is permanently open, and thus socket buffers are liable
> >> to fill up.
> >> 
> >> As a demo I increased the socket buffers to 1MB and everything then
> >> succeeded.
> >> 
> >> @@ -357,6 +360,10 @@ def _pre_launch(self) -> None:
> >> 
> >>         if self._console_set:
> >>             self._cons_sock_pair = socket.socketpair()
> >> +            self._cons_sock_pair[0].setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1024*1024);
> >> +            self._cons_sock_pair[0].setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024*1024);
> >> +            self._cons_sock_pair[1].setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1024*1024);
> >> +            self._cons_sock_pair[1].setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024*1024);
> >>             os.set_inheritable(self._cons_sock_pair[0].fileno(), True)
> >> 
> >>         # NOTE: Make sure any opened resources are *definitely* freed in
> > 
> > So perhaps ppc64 fails just because it prints more to the console in early
> > boot than other targets?
> > 
> >> The Machine class doesn't know if anything will ever use the console,
> >> so as is the change is unsafe.
> >> 
> >> The original goal of John's change was to guarantee we capture early
> >> boot messages as some test need that.  
> >> 
> >> I think we need to be able to have a flag to say whether the caller needs
> >> an "early console" facility, and only use the pre-opened FD passing for
> >> that case. Tests we need early console will have to ask for that guarantee
> >> explicitly.
> > 
> > The below patch makes this test work. Maybe as a quick fix it is
> > better than disabling the test.
> > 
> > I guess we still have a problem if a test invokes vm.launch()
> > directly without subsequently waiting for a console pattern or
> > doing something with the console as you say. Your suggesstion is
> > add something like vm.launch(console=True) ? 
>
> I think what he is saying is to add a new property for QEMUMachine() with which the test can explicitly tell the machine init code that it is going to drain the console logs. By default it can be false. When tests use console_drainer, they can set the property to true and inspect the early console logs after draining it. 

Hmm... well we do have QEMUMachine.set_console already. Is this enough?
If the test case is not going to drain or interact with the console
then it could set it to false. Or am I missing something?

Thanks,
Nick


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default
  2023-11-16  7:14                 ` Nicholas Piggin
@ 2023-11-16  8:55                   ` Daniel P. Berrangé
  2023-11-16 11:17                     ` Ani Sinha
  0 siblings, 1 reply; 22+ messages in thread
From: Daniel P. Berrangé @ 2023-11-16  8:55 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Ani Sinha, Thomas Huth, John Snow, Wainer dos Santos Moschetta,
	Beraldo Leal, Cleber Rosa, Pavel Dovgalyuk, Paolo Bonzini,
	qemu-ppc, qemu-devel, Alex Bennée,
	Philippe Mathieu-Daudé

On Thu, Nov 16, 2023 at 05:14:43PM +1000, Nicholas Piggin wrote:
> On Thu Nov 16, 2023 at 1:55 PM AEST, Ani Sinha wrote:
> >
> >
> > > On 16-Nov-2023, at 6:45 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
> > > 
> > > On Thu Nov 16, 2023 at 3:22 AM AEST, Daniel P. Berrangé wrote:
> > >> On Wed, Nov 15, 2023 at 01:14:53PM +0000, Daniel P. Berrangé wrote:
> > >>> On Wed, Nov 15, 2023 at 07:23:01AM +0100, Thomas Huth wrote:
> > >>>> On 15/11/2023 02.15, Nicholas Piggin wrote:
> > >>>>> On Wed Nov 15, 2023 at 4:29 AM AEST, Thomas Huth wrote:
> > >>>>>> On 14/11/2023 17.37, Philippe Mathieu-Daudé wrote:
> > >>>>>>> On 14/11/23 17:31, Thomas Huth wrote:
> > >>>>>>>> The tests seem currently to be broken. Disable them by default
> > >>>>>>>> until someone fixes them.
> > >>>>>>>> 
> > >>>>>>>> Signed-off-by: Thomas Huth <thuth@redhat.com>
> > >>>>>>>> ---
> > >>>>>>>>   tests/avocado/reverse_debugging.py | 7 ++++---
> > >>>>>>>>   1 file changed, 4 insertions(+), 3 deletions(-)
> > >>>>>>> 
> > >>>>>>> Similarly, I suspect https://gitlab.com/qemu-project/qemu/-/issues/1961
> > >>>>>>> which has a fix ready:
> > >>>>>>> https://lore.kernel.org/qemu-devel/20231110170831.185001-1-richard.henderson@linaro.org/
> > >>>>>>> 
> > >>>>>>> Maybe wait the fix gets in first?
> > >>>>>> 
> > >>>>>> No, I applied Richard's patch, but the problem persists. Does this test
> > >>>>>> still work for you?
> > >>>>> 
> > >>>>> I bisected it to 1d4796cd008373 ("python/machine: use socketpair() for
> > >>>>> console connections"),
> > >>>> 
> > >>>> Maybe John (who wrote that commit) can help?
> > >>> 
> > >>> I find it hard to believe this commit is a direct root cause of the
> > >>> problem since all it does is change the QEMU startup sequence so that
> > >>> instead of QEMU listening for a monitor connection, it is given a
> > >>> pre-opened monitor connection.
> > >>> 
> > >>> At the very most that should affect the startup timing a little.
> > >>> 
> > >>> I notice all the reverse debugging tests have a skip on gitlab
> > >>> with a comment:
> > >>> 
> > >>>    # unidentified gitlab timeout problem
> > >>> 
> > >>> this makes be suspicious that John's patch has merely made this
> > >>> (henceforth undiagnosed) timeout more likely to ocurr.
> > >> 
> > >> After an absolutely horrendous hours long debugging session I think
> > >> I figured out the problem. The QEMU process is blocking in
> > >> 
> > >>    qemu_chr_write_buffer
> > >> 
> > >> spinning in the loop on EAGAIN.
> > > 
> > > Great work.
> > > 
> > > Why does this make the gdb socket give an empty response? Something
> > > just times out?
> > > 
> > >> 
> > >> The Python  Machine() class has passed one of a pre-created socketpair
> > >> FDs for the serial port chardev. The guest is trying to write to this
> > >> and blocking.  Nothing in the Machine() class is reading from the
> > >> other end of the serial port console.
> > >> 
> > >> 
> > >> Before John's change, the serial port uses a chardev in server mode
> > >> and crucially  'wait=off', and the Machine() class never opened the
> > >> console socket unless the test case wanted to read from it.
> > >> 
> > >> IOW, QEMU had a background job setting there waiting for a connection
> > >> that would never come.
> > >> 
> > >> As a result when QEMU started executing the guest, all the serial port
> > >> writes get sent into to the void.
> > >> 
> > >> 
> > >> So John's patch has had a semantic change in behaviour, because the
> > >> console socket is permanently open, and thus socket buffers are liable
> > >> to fill up.
> > >> 
> > >> As a demo I increased the socket buffers to 1MB and everything then
> > >> succeeded.
> > >> 
> > >> @@ -357,6 +360,10 @@ def _pre_launch(self) -> None:
> > >> 
> > >>         if self._console_set:
> > >>             self._cons_sock_pair = socket.socketpair()
> > >> +            self._cons_sock_pair[0].setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1024*1024);
> > >> +            self._cons_sock_pair[0].setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024*1024);
> > >> +            self._cons_sock_pair[1].setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1024*1024);
> > >> +            self._cons_sock_pair[1].setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024*1024);
> > >>             os.set_inheritable(self._cons_sock_pair[0].fileno(), True)
> > >> 
> > >>         # NOTE: Make sure any opened resources are *definitely* freed in
> > > 
> > > So perhaps ppc64 fails just because it prints more to the console in early
> > > boot than other targets?
> > > 
> > >> The Machine class doesn't know if anything will ever use the console,
> > >> so as is the change is unsafe.
> > >> 
> > >> The original goal of John's change was to guarantee we capture early
> > >> boot messages as some test need that.  
> > >> 
> > >> I think we need to be able to have a flag to say whether the caller needs
> > >> an "early console" facility, and only use the pre-opened FD passing for
> > >> that case. Tests we need early console will have to ask for that guarantee
> > >> explicitly.
> > > 
> > > The below patch makes this test work. Maybe as a quick fix it is
> > > better than disabling the test.
> > > 
> > > I guess we still have a problem if a test invokes vm.launch()
> > > directly without subsequently waiting for a console pattern or
> > > doing something with the console as you say. Your suggesstion is
> > > add something like vm.launch(console=True) ? 
> >
> > I think what he is saying is to add a new property for QEMUMachine() with which the test can explicitly tell the machine init code that it is going to drain the console logs. By default it can be false. When tests use console_drainer, they can set the property to true and inspect the early console logs after draining it. 
> 
> Hmm... well we do have QEMUMachine.set_console already. Is this enough?
> If the test case is not going to drain or interact with the console
> then it could set it to false. Or am I missing something?

Yeah, set_console is enough - i missed that that exists.

Thus problem is more specific. It hits when a test calls
set_console(True), but then fails to read from the console.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default
  2023-11-16  1:15             ` Nicholas Piggin
  2023-11-16  3:55               ` Ani Sinha
  2023-11-16  7:09               ` Thomas Huth
@ 2023-11-16  9:00               ` Daniel P. Berrangé
  2 siblings, 0 replies; 22+ messages in thread
From: Daniel P. Berrangé @ 2023-11-16  9:00 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Thomas Huth, John Snow, Ani Sinha, Wainer dos Santos Moschetta,
	Beraldo Leal, Cleber Rosa, Pavel Dovgalyuk, Paolo Bonzini,
	qemu-ppc, qemu-devel, Alex Bennée,
	Philippe Mathieu-Daudé

On Thu, Nov 16, 2023 at 11:15:25AM +1000, Nicholas Piggin wrote:
> On Thu Nov 16, 2023 at 3:22 AM AEST, Daniel P. Berrangé wrote:
> > On Wed, Nov 15, 2023 at 01:14:53PM +0000, Daniel P. Berrangé wrote:
> > > On Wed, Nov 15, 2023 at 07:23:01AM +0100, Thomas Huth wrote:
> > > > On 15/11/2023 02.15, Nicholas Piggin wrote:
> > > > > On Wed Nov 15, 2023 at 4:29 AM AEST, Thomas Huth wrote:
> > > > > > On 14/11/2023 17.37, Philippe Mathieu-Daudé wrote:
> > > > > > > On 14/11/23 17:31, Thomas Huth wrote:
> > > > > > > > The tests seem currently to be broken. Disable them by default
> > > > > > > > until someone fixes them.
> > > > > > > > 
> > > > > > > > Signed-off-by: Thomas Huth <thuth@redhat.com>
> > > > > > > > ---
> > > > > > > >    tests/avocado/reverse_debugging.py | 7 ++++---
> > > > > > > >    1 file changed, 4 insertions(+), 3 deletions(-)
> > > > > > > 
> > > > > > > Similarly, I suspect https://gitlab.com/qemu-project/qemu/-/issues/1961
> > > > > > > which has a fix ready:
> > > > > > > https://lore.kernel.org/qemu-devel/20231110170831.185001-1-richard.henderson@linaro.org/
> > > > > > > 
> > > > > > > Maybe wait the fix gets in first?
> > > > > > 
> > > > > > No, I applied Richard's patch, but the problem persists. Does this test
> > > > > > still work for you?
> > > > > 
> > > > > I bisected it to 1d4796cd008373 ("python/machine: use socketpair() for
> > > > > console connections"),
> > > > 
> > > > Maybe John (who wrote that commit) can help?
> > > 
> > > I find it hard to believe this commit is a direct root cause of the
> > > problem since all it does is change the QEMU startup sequence so that
> > > instead of QEMU listening for a monitor connection, it is given a
> > > pre-opened monitor connection.
> > > 
> > > At the very most that should affect the startup timing a little.
> > > 
> > > I notice all the reverse debugging tests have a skip on gitlab
> > > with a comment:
> > > 
> > >     # unidentified gitlab timeout problem
> > > 
> > > this makes be suspicious that John's patch has merely made this
> > > (henceforth undiagnosed) timeout more likely to ocurr.
> >
> > After an absolutely horrendous hours long debugging session I think
> > I figured out the problem. The QEMU process is blocking in
> >
> >     qemu_chr_write_buffer
> >
> > spinning in the loop on EAGAIN.
> 
> Great work.
> 
> Why does this make the gdb socket give an empty response? Something
> just times out?

There's no empty response - avocado is sending SIGTERM to the
test because no response ever arrived after 10 seconds 

> > @@ -357,6 +360,10 @@ def _pre_launch(self) -> None:
> >  
> >          if self._console_set:
> >              self._cons_sock_pair = socket.socketpair()
> > +            self._cons_sock_pair[0].setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1024*1024);
> > +            self._cons_sock_pair[0].setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024*1024);
> > +            self._cons_sock_pair[1].setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1024*1024);
> > +            self._cons_sock_pair[1].setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024*1024);
> >              os.set_inheritable(self._cons_sock_pair[0].fileno(), True)
> >  
> >          # NOTE: Make sure any opened resources are *definitely* freed in
> 
> So perhaps ppc64 fails just because it prints more to the console in early
> boot than other targets?

Yes, that's exactly it - the ppc firmware is very verbose, so is enough
to tip it over the limit.

> > The Machine class doesn't know if anything will ever use the console,
> > so as is the change is unsafe.
> >
> > The original goal of John's change was to guarantee we capture early
> > boot messages as some test need that.  
> >
> > I think we need to be able to have a flag to say whether the caller needs
> > an "early console" facility, and only use the pre-opened FD passing for
> > that case. Tests we need early console will have to ask for that guarantee
> > explicitly.
> 
> The below patch makes this test work. Maybe as a quick fix it is
> better than disabling the test.
> 
> I guess we still have a problem if a test invokes vm.launch()
> directly without subsequently waiting for a console pattern or
> doing something with the console as you say. Your suggesstion is
> add something like vm.launch(console=True) ? 
> 
> Thanks,
> Nick
> ---
> 
> diff --git a/tests/avocado/reverse_debugging.py b/tests/avocado/reverse_debugging.py
> index fc47874eda..128d85bc0e 100644
> --- a/tests/avocado/reverse_debugging.py
> +++ b/tests/avocado/reverse_debugging.py
> @@ -12,6 +12,7 @@
>  
>  from avocado import skipIf
>  from avocado_qemu import BUILD_DIR
> +from avocado.utils import datadrainer
>  from avocado.utils import gdb
>  from avocado.utils import process
>  from avocado.utils.network.ports import find_free_port
> @@ -52,6 +53,10 @@ def run_vm(self, record, shift, args, replay_path, image_path, port):
>          if args:
>              vm.add_args(*args)
>          vm.launch()
> +        console_drainer = datadrainer.LineLogger(vm.console_socket.fileno(),
> +                                    logger=self.log.getChild('console'),
> +                                    stop_check=(lambda : not vm.is_running()))
> +        console_drainer.start()
>          return vm
>  
>      @staticmethod

Yes, thus should do the trick.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default
  2023-11-16  7:09               ` Thomas Huth
@ 2023-11-16  9:45                 ` Nicholas Piggin
  0 siblings, 0 replies; 22+ messages in thread
From: Nicholas Piggin @ 2023-11-16  9:45 UTC (permalink / raw)
  To: Thomas Huth, Daniel P. Berrangé,
	John Snow, Ani Sinha, Wainer dos Santos Moschetta, Beraldo Leal,
	Cleber Rosa, Pavel Dovgalyuk, Paolo Bonzini, qemu-ppc,
	qemu-devel, Alex Bennée, Philippe Mathieu-Daudé

On Thu Nov 16, 2023 at 5:09 PM AEST, Thomas Huth wrote:
> On 16/11/2023 02.15, Nicholas Piggin wrote:
> > On Thu Nov 16, 2023 at 3:22 AM AEST, Daniel P. Berrangé wrote:
> >> On Wed, Nov 15, 2023 at 01:14:53PM +0000, Daniel P. Berrangé wrote:
> ...
> >> The Machine class doesn't know if anything will ever use the console,
> >> so as is the change is unsafe.
> >>
> >> The original goal of John's change was to guarantee we capture early
> >> boot messages as some test need that.
> >>
> >> I think we need to be able to have a flag to say whether the caller needs
> >> an "early console" facility, and only use the pre-opened FD passing for
> >> that case. Tests we need early console will have to ask for that guarantee
> >> explicitly.
> > 
> > The below patch makes this test work. Maybe as a quick fix it is
> > better than disabling the test.
> > 
> > I guess we still have a problem if a test invokes vm.launch()
> > directly without subsequently waiting for a console pattern or
> > doing something with the console as you say. Your suggesstion is
> > add something like vm.launch(console=True) ?
> > 
> > Thanks,
> > Nick
> > ---
> > 
> > diff --git a/tests/avocado/reverse_debugging.py b/tests/avocado/reverse_debugging.py
> > index fc47874eda..128d85bc0e 100644
> > --- a/tests/avocado/reverse_debugging.py
> > +++ b/tests/avocado/reverse_debugging.py
> > @@ -12,6 +12,7 @@
> >   
> >   from avocado import skipIf
> >   from avocado_qemu import BUILD_DIR
> > +from avocado.utils import datadrainer
> >   from avocado.utils import gdb
> >   from avocado.utils import process
> >   from avocado.utils.network.ports import find_free_port
> > @@ -52,6 +53,10 @@ def run_vm(self, record, shift, args, replay_path, image_path, port):
> >           if args:
> >               vm.add_args(*args)
> >           vm.launch()
> > +        console_drainer = datadrainer.LineLogger(vm.console_socket.fileno(),
> > +                                    logger=self.log.getChild('console'),
> > +                                    stop_check=(lambda : not vm.is_running()))
> > +        console_drainer.start()
> >           return vm
> >   
> >       @staticmethod
>
> Tested-by: Thomas Huth <thuth@redhat.com>

Thank you.

> Could you please send this as a proper patch, with a S-o-b line, and a short 
> comment in front of the newly added code explaining it?

Yep, just testing it now. Should have it sent out in an hour or two.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default
  2023-11-16  8:55                   ` Daniel P. Berrangé
@ 2023-11-16 11:17                     ` Ani Sinha
  2023-11-16 11:31                       ` Daniel P. Berrangé
  0 siblings, 1 reply; 22+ messages in thread
From: Ani Sinha @ 2023-11-16 11:17 UTC (permalink / raw)
  To: "Daniel P. Berrangé"
  Cc: Nicholas Piggin, Thomas Huth, John Snow,
	Wainer dos Santos Moschetta, Beraldo Leal, Cleber Rosa,
	Pavel Dovgalyuk, Paolo Bonzini, qemu-ppc, QEMU Developers,
	Alex Bennée, Philippe Mathieu-Daudé



> On 16-Nov-2023, at 2:25 PM, Daniel P. Berrangé <berrange@redhat.com> wrote:
> 
> On Thu, Nov 16, 2023 at 05:14:43PM +1000, Nicholas Piggin wrote:
>> On Thu Nov 16, 2023 at 1:55 PM AEST, Ani Sinha wrote:
>>> 
>>> 
>>>> On 16-Nov-2023, at 6:45 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
>>>> 
>>>> On Thu Nov 16, 2023 at 3:22 AM AEST, Daniel P. Berrangé wrote:
>>>>> On Wed, Nov 15, 2023 at 01:14:53PM +0000, Daniel P. Berrangé wrote:
>>>>>> On Wed, Nov 15, 2023 at 07:23:01AM +0100, Thomas Huth wrote:
>>>>>>> On 15/11/2023 02.15, Nicholas Piggin wrote:
>>>>>>>> On Wed Nov 15, 2023 at 4:29 AM AEST, Thomas Huth wrote:
>>>>>>>>> On 14/11/2023 17.37, Philippe Mathieu-Daudé wrote:
>>>>>>>>>> On 14/11/23 17:31, Thomas Huth wrote:
>>>>>>>>>>> The tests seem currently to be broken. Disable them by default
>>>>>>>>>>> until someone fixes them.
>>>>>>>>>>> 
>>>>>>>>>>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>>>>>>>>>>> ---
>>>>>>>>>>>  tests/avocado/reverse_debugging.py | 7 ++++---
>>>>>>>>>>>  1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>>>>>> 
>>>>>>>>>> Similarly, I suspect https://gitlab.com/qemu-project/qemu/-/issues/1961
>>>>>>>>>> which has a fix ready:
>>>>>>>>>> https://lore.kernel.org/qemu-devel/20231110170831.185001-1-richard.henderson@linaro.org/
>>>>>>>>>> 
>>>>>>>>>> Maybe wait the fix gets in first?
>>>>>>>>> 
>>>>>>>>> No, I applied Richard's patch, but the problem persists. Does this test
>>>>>>>>> still work for you?
>>>>>>>> 
>>>>>>>> I bisected it to 1d4796cd008373 ("python/machine: use socketpair() for
>>>>>>>> console connections"),
>>>>>>> 
>>>>>>> Maybe John (who wrote that commit) can help?
>>>>>> 
>>>>>> I find it hard to believe this commit is a direct root cause of the
>>>>>> problem since all it does is change the QEMU startup sequence so that
>>>>>> instead of QEMU listening for a monitor connection, it is given a
>>>>>> pre-opened monitor connection.
>>>>>> 
>>>>>> At the very most that should affect the startup timing a little.
>>>>>> 
>>>>>> I notice all the reverse debugging tests have a skip on gitlab
>>>>>> with a comment:
>>>>>> 
>>>>>>   # unidentified gitlab timeout problem
>>>>>> 
>>>>>> this makes be suspicious that John's patch has merely made this
>>>>>> (henceforth undiagnosed) timeout more likely to ocurr.
>>>>> 
>>>>> After an absolutely horrendous hours long debugging session I think
>>>>> I figured out the problem. The QEMU process is blocking in
>>>>> 
>>>>>   qemu_chr_write_buffer
>>>>> 
>>>>> spinning in the loop on EAGAIN.
>>>> 
>>>> Great work.
>>>> 
>>>> Why does this make the gdb socket give an empty response? Something
>>>> just times out?
>>>> 
>>>>> 
>>>>> The Python  Machine() class has passed one of a pre-created socketpair
>>>>> FDs for the serial port chardev. The guest is trying to write to this
>>>>> and blocking.  Nothing in the Machine() class is reading from the
>>>>> other end of the serial port console.
>>>>> 
>>>>> 
>>>>> Before John's change, the serial port uses a chardev in server mode
>>>>> and crucially  'wait=off', and the Machine() class never opened the
>>>>> console socket unless the test case wanted to read from it.
>>>>> 
>>>>> IOW, QEMU had a background job setting there waiting for a connection
>>>>> that would never come.
>>>>> 
>>>>> As a result when QEMU started executing the guest, all the serial port
>>>>> writes get sent into to the void.
>>>>> 
>>>>> 
>>>>> So John's patch has had a semantic change in behaviour, because the
>>>>> console socket is permanently open, and thus socket buffers are liable
>>>>> to fill up.
>>>>> 
>>>>> As a demo I increased the socket buffers to 1MB and everything then
>>>>> succeeded.
>>>>> 
>>>>> @@ -357,6 +360,10 @@ def _pre_launch(self) -> None:
>>>>> 
>>>>>        if self._console_set:
>>>>>            self._cons_sock_pair = socket.socketpair()
>>>>> +            self._cons_sock_pair[0].setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1024*1024);
>>>>> +            self._cons_sock_pair[0].setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024*1024);
>>>>> +            self._cons_sock_pair[1].setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1024*1024);
>>>>> +            self._cons_sock_pair[1].setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024*1024);
>>>>>            os.set_inheritable(self._cons_sock_pair[0].fileno(), True)
>>>>> 
>>>>>        # NOTE: Make sure any opened resources are *definitely* freed in
>>>> 
>>>> So perhaps ppc64 fails just because it prints more to the console in early
>>>> boot than other targets?
>>>> 
>>>>> The Machine class doesn't know if anything will ever use the console,
>>>>> so as is the change is unsafe.
>>>>> 
>>>>> The original goal of John's change was to guarantee we capture early
>>>>> boot messages as some test need that.  
>>>>> 
>>>>> I think we need to be able to have a flag to say whether the caller needs
>>>>> an "early console" facility, and only use the pre-opened FD passing for
>>>>> that case. Tests we need early console will have to ask for that guarantee
>>>>> explicitly.
>>>> 
>>>> The below patch makes this test work. Maybe as a quick fix it is
>>>> better than disabling the test.
>>>> 
>>>> I guess we still have a problem if a test invokes vm.launch()
>>>> directly without subsequently waiting for a console pattern or
>>>> doing something with the console as you say. Your suggesstion is
>>>> add something like vm.launch(console=True) ? 
>>> 
>>> I think what he is saying is to add a new property for QEMUMachine() with which the test can explicitly tell the machine init code that it is going to drain the console logs. By default it can be false. When tests use console_drainer, they can set the property to true and inspect the early console logs after draining it. 
>> 
>> Hmm... well we do have QEMUMachine.set_console already. Is this enough?
>> If the test case is not going to drain or interact with the console
>> then it could set it to false. Or am I missing something?
> 
> Yeah, set_console is enough - i missed that that exists.
> 
> Thus problem is more specific. It hits when a test calls
> set_console(True), but then fails to read from the console.

So then it is a test issue in that the test requests console to be enabled but does not look at the console o/p.

> 
> 
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default
  2023-11-16 11:17                     ` Ani Sinha
@ 2023-11-16 11:31                       ` Daniel P. Berrangé
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel P. Berrangé @ 2023-11-16 11:31 UTC (permalink / raw)
  To: Ani Sinha
  Cc: Nicholas Piggin, Thomas Huth, John Snow,
	Wainer dos Santos Moschetta, Beraldo Leal, Cleber Rosa,
	Pavel Dovgalyuk, Paolo Bonzini, qemu-ppc, QEMU Developers,
	Alex Bennée, Philippe Mathieu-Daudé

On Thu, Nov 16, 2023 at 04:47:48PM +0530, Ani Sinha wrote:
> 
> 
> > On 16-Nov-2023, at 2:25 PM, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > 
> > On Thu, Nov 16, 2023 at 05:14:43PM +1000, Nicholas Piggin wrote:
> >> On Thu Nov 16, 2023 at 1:55 PM AEST, Ani Sinha wrote:
> >>> 
> >>> 
> >>>> On 16-Nov-2023, at 6:45 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
> >>>> 
> >>>> On Thu Nov 16, 2023 at 3:22 AM AEST, Daniel P. Berrangé wrote:
> >>>>> On Wed, Nov 15, 2023 at 01:14:53PM +0000, Daniel P. Berrangé wrote:
> >>>>> 
> >>>>> After an absolutely horrendous hours long debugging session I think
> >>>>> I figured out the problem. The QEMU process is blocking in
> >>>>> 
> >>>>>   qemu_chr_write_buffer
> >>>>> 
> >>>>> spinning in the loop on EAGAIN.
> >>>> 
> >>>> Great work.
> >>>> 
> >>>> Why does this make the gdb socket give an empty response? Something
> >>>> just times out?
> >>>> 
> >>>>> 
> >>>>> The Python  Machine() class has passed one of a pre-created socketpair
> >>>>> FDs for the serial port chardev. The guest is trying to write to this
> >>>>> and blocking.  Nothing in the Machine() class is reading from the
> >>>>> other end of the serial port console.
> >>>>> 
> >>>>> 
> >>>>> Before John's change, the serial port uses a chardev in server mode
> >>>>> and crucially  'wait=off', and the Machine() class never opened the
> >>>>> console socket unless the test case wanted to read from it.
> >>>>> 
> >>>>> IOW, QEMU had a background job setting there waiting for a connection
> >>>>> that would never come.
> >>>>> 
> >>>>> As a result when QEMU started executing the guest, all the serial port
> >>>>> writes get sent into to the void.
> >>>>> 
> >>>>> 
> >>>>> So John's patch has had a semantic change in behaviour, because the
> >>>>> console socket is permanently open, and thus socket buffers are liable
> >>>>> to fill up.
> >>>>> 
> >>>>> As a demo I increased the socket buffers to 1MB and everything then
> >>>>> succeeded.
> >>>>> 
> >>>>> @@ -357,6 +360,10 @@ def _pre_launch(self) -> None:
> >>>>> 
> >>>>>        if self._console_set:
> >>>>>            self._cons_sock_pair = socket.socketpair()
> >>>>> +            self._cons_sock_pair[0].setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1024*1024);
> >>>>> +            self._cons_sock_pair[0].setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024*1024);
> >>>>> +            self._cons_sock_pair[1].setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1024*1024);
> >>>>> +            self._cons_sock_pair[1].setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024*1024);
> >>>>>            os.set_inheritable(self._cons_sock_pair[0].fileno(), True)
> >>>>> 
> >>>>>        # NOTE: Make sure any opened resources are *definitely* freed in
> >>>> 
> >>>> So perhaps ppc64 fails just because it prints more to the console in early
> >>>> boot than other targets?
> >>>> 
> >>>>> The Machine class doesn't know if anything will ever use the console,
> >>>>> so as is the change is unsafe.
> >>>>> 
> >>>>> The original goal of John's change was to guarantee we capture early
> >>>>> boot messages as some test need that.  
> >>>>> 
> >>>>> I think we need to be able to have a flag to say whether the caller needs
> >>>>> an "early console" facility, and only use the pre-opened FD passing for
> >>>>> that case. Tests we need early console will have to ask for that guarantee
> >>>>> explicitly.
> >>>> 
> >>>> The below patch makes this test work. Maybe as a quick fix it is
> >>>> better than disabling the test.
> >>>> 
> >>>> I guess we still have a problem if a test invokes vm.launch()
> >>>> directly without subsequently waiting for a console pattern or
> >>>> doing something with the console as you say. Your suggesstion is
> >>>> add something like vm.launch(console=True) ? 
> >>> 
> >>> I think what he is saying is to add a new property for QEMUMachine() with which the test can explicitly tell the machine init code that it is going to drain the console logs. By default it can be false. When tests use console_drainer, they can set the property to true and inspect the early console logs after draining it. 
> >> 
> >> Hmm... well we do have QEMUMachine.set_console already. Is this enough?
> >> If the test case is not going to drain or interact with the console
> >> then it could set it to false. Or am I missing something?
> > 
> > Yeah, set_console is enough - i missed that that exists.
> > 
> > Thus problem is more specific. It hits when a test calls
> > set_console(True), but then fails to read from the console.
> 
> So then it is a test issue in that the test requests console to be enabled but does not look at the console o/p.

Yes, any test cases affected by this are broken already, and need to drain
the console. We might easily miss if we only test with x86_64 QEMU and not
other arches, as eg ppc firmware appears more chatty on serial port than
seabios

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default
  2023-11-15 17:22           ` Daniel P. Berrangé
  2023-11-16  1:15             ` Nicholas Piggin
  2023-11-16  3:50             ` Ani Sinha
@ 2023-11-20 19:18             ` John Snow
  2023-11-23  2:04               ` Nicholas Piggin
  2023-11-23 10:52               ` Peter Maydell
  2 siblings, 2 replies; 22+ messages in thread
From: John Snow @ 2023-11-20 19:18 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Thomas Huth, Ani Sinha, Wainer dos Santos Moschetta,
	Beraldo Leal, Cleber Rosa, Pavel Dovgalyuk, Paolo Bonzini,
	qemu-ppc, Nicholas Piggin, qemu-devel, Alex Bennée,
	Philippe Mathieu-Daudé

On Wed, Nov 15, 2023 at 12:23 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Wed, Nov 15, 2023 at 01:14:53PM +0000, Daniel P. Berrangé wrote:
> > On Wed, Nov 15, 2023 at 07:23:01AM +0100, Thomas Huth wrote:
> > > On 15/11/2023 02.15, Nicholas Piggin wrote:
> > > > On Wed Nov 15, 2023 at 4:29 AM AEST, Thomas Huth wrote:
> > > > > On 14/11/2023 17.37, Philippe Mathieu-Daudé wrote:
> > > > > > On 14/11/23 17:31, Thomas Huth wrote:
> > > > > > > The tests seem currently to be broken. Disable them by default
> > > > > > > until someone fixes them.
> > > > > > >
> > > > > > > Signed-off-by: Thomas Huth <thuth@redhat.com>
> > > > > > > ---
> > > > > > >    tests/avocado/reverse_debugging.py | 7 ++++---
> > > > > > >    1 file changed, 4 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > Similarly, I suspect https://gitlab.com/qemu-project/qemu/-/issues/1961
> > > > > > which has a fix ready:
> > > > > > https://lore.kernel.org/qemu-devel/20231110170831.185001-1-richard.henderson@linaro.org/
> > > > > >
> > > > > > Maybe wait the fix gets in first?
> > > > >
> > > > > No, I applied Richard's patch, but the problem persists. Does this test
> > > > > still work for you?
> > > >
> > > > I bisected it to 1d4796cd008373 ("python/machine: use socketpair() for
> > > > console connections"),
> > >
> > > Maybe John (who wrote that commit) can help?
> >
> > I find it hard to believe this commit is a direct root cause of the
> > problem since all it does is change the QEMU startup sequence so that
> > instead of QEMU listening for a monitor connection, it is given a
> > pre-opened monitor connection.
> >
> > At the very most that should affect the startup timing a little.
> >
> > I notice all the reverse debugging tests have a skip on gitlab
> > with a comment:
> >
> >     # unidentified gitlab timeout problem
> >
> > this makes be suspicious that John's patch has merely made this
> > (henceforth undiagnosed) timeout more likely to ocurr.
>
> After an absolutely horrendous hours long debugging session I think
> I figured out the problem. The QEMU process is blocking in
>
>     qemu_chr_write_buffer
>
> spinning in the loop on EAGAIN.
>
> The Python  Machine() class has passed one of a pre-created socketpair
> FDs for the serial port chardev. The guest is trying to write to this
> and blocking.  Nothing in the Machine() class is reading from the
> other end of the serial port console.
>
>
> Before John's change, the serial port uses a chardev in server mode
> and crucially  'wait=off', and the Machine() class never opened the
> console socket unless the test case wanted to read from it.
>
> IOW, QEMU had a background job setting there waiting for a connection
> that would never come.
>
> As a result when QEMU started executing the guest, all the serial port
> writes get sent into to the void.
>
>
> So John's patch has had a semantic change in behaviour, because the
> console socket is permanently open, and thus socket buffers are liable
> to fill up.
>
> As a demo I increased the socket buffers to 1MB and everything then
> succeeded.
>
> @@ -357,6 +360,10 @@ def _pre_launch(self) -> None:
>
>          if self._console_set:
>              self._cons_sock_pair = socket.socketpair()
> +            self._cons_sock_pair[0].setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1024*1024);
> +            self._cons_sock_pair[0].setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024*1024);
> +            self._cons_sock_pair[1].setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1024*1024);
> +            self._cons_sock_pair[1].setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024*1024);
>              os.set_inheritable(self._cons_sock_pair[0].fileno(), True)
>
>          # NOTE: Make sure any opened resources are *definitely* freed in
>
>
> The Machine class doesn't know if anything will ever use the console,
> so as is the change is unsafe.
>
> The original goal of John's change was to guarantee we capture early
> boot messages as some test need that.
>
> I think we need to be able to have a flag to say whether the caller needs
> an "early console" facility, and only use the pre-opened FD passing for
> that case. Tests we need early console will have to ask for that guarantee
> explicitly.

Tch. I see. Thank you for diagnosing this.

From the machine.py perspective, you have to *opt in* to having a
console, so I hadn't considered that a caller would enable the console
and then ... not read from it. Surely that's a bug in the caller?

If you don't intend to read from the console, you shouldn't call set_console().

(The async rewrite I have been toying with on and off has a built-in
drainer that writes to a log file that would probably remedy this, but
the client tests should still be fixed, I think. Otherwise, do you
have any suggestions for how I might make this failure state more
obvious/friendly? I wonder if on close of the machine.py object I
could detect that the pipe is full and emit a warning about that.)

--js

>
> With regards,
> Daniel
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default
  2023-11-20 19:18             ` John Snow
@ 2023-11-23  2:04               ` Nicholas Piggin
  2023-11-23 10:52               ` Peter Maydell
  1 sibling, 0 replies; 22+ messages in thread
From: Nicholas Piggin @ 2023-11-23  2:04 UTC (permalink / raw)
  To: John Snow, Daniel P. Berrangé
  Cc: Thomas Huth, Ani Sinha, Wainer dos Santos Moschetta,
	Beraldo Leal, Cleber Rosa, Pavel Dovgalyuk, Paolo Bonzini,
	qemu-ppc, qemu-devel, Alex Bennée,
	Philippe Mathieu-Daudé

On Tue Nov 21, 2023 at 5:18 AM AEST, John Snow wrote:
> On Wed, Nov 15, 2023 at 12:23 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Wed, Nov 15, 2023 at 01:14:53PM +0000, Daniel P. Berrangé wrote:
> > > On Wed, Nov 15, 2023 at 07:23:01AM +0100, Thomas Huth wrote:
> > > > On 15/11/2023 02.15, Nicholas Piggin wrote:
> > > > > On Wed Nov 15, 2023 at 4:29 AM AEST, Thomas Huth wrote:
> > > > > > On 14/11/2023 17.37, Philippe Mathieu-Daudé wrote:
> > > > > > > On 14/11/23 17:31, Thomas Huth wrote:
> > > > > > > > The tests seem currently to be broken. Disable them by default
> > > > > > > > until someone fixes them.
> > > > > > > >
> > > > > > > > Signed-off-by: Thomas Huth <thuth@redhat.com>
> > > > > > > > ---
> > > > > > > >    tests/avocado/reverse_debugging.py | 7 ++++---
> > > > > > > >    1 file changed, 4 insertions(+), 3 deletions(-)
> > > > > > >
> > > > > > > Similarly, I suspect https://gitlab.com/qemu-project/qemu/-/issues/1961
> > > > > > > which has a fix ready:
> > > > > > > https://lore.kernel.org/qemu-devel/20231110170831.185001-1-richard.henderson@linaro.org/
> > > > > > >
> > > > > > > Maybe wait the fix gets in first?
> > > > > >
> > > > > > No, I applied Richard's patch, but the problem persists. Does this test
> > > > > > still work for you?
> > > > >
> > > > > I bisected it to 1d4796cd008373 ("python/machine: use socketpair() for
> > > > > console connections"),
> > > >
> > > > Maybe John (who wrote that commit) can help?
> > >
> > > I find it hard to believe this commit is a direct root cause of the
> > > problem since all it does is change the QEMU startup sequence so that
> > > instead of QEMU listening for a monitor connection, it is given a
> > > pre-opened monitor connection.
> > >
> > > At the very most that should affect the startup timing a little.
> > >
> > > I notice all the reverse debugging tests have a skip on gitlab
> > > with a comment:
> > >
> > >     # unidentified gitlab timeout problem
> > >
> > > this makes be suspicious that John's patch has merely made this
> > > (henceforth undiagnosed) timeout more likely to ocurr.
> >
> > After an absolutely horrendous hours long debugging session I think
> > I figured out the problem. The QEMU process is blocking in
> >
> >     qemu_chr_write_buffer
> >
> > spinning in the loop on EAGAIN.
> >
> > The Python  Machine() class has passed one of a pre-created socketpair
> > FDs for the serial port chardev. The guest is trying to write to this
> > and blocking.  Nothing in the Machine() class is reading from the
> > other end of the serial port console.
> >
> >
> > Before John's change, the serial port uses a chardev in server mode
> > and crucially  'wait=off', and the Machine() class never opened the
> > console socket unless the test case wanted to read from it.
> >
> > IOW, QEMU had a background job setting there waiting for a connection
> > that would never come.
> >
> > As a result when QEMU started executing the guest, all the serial port
> > writes get sent into to the void.
> >
> >
> > So John's patch has had a semantic change in behaviour, because the
> > console socket is permanently open, and thus socket buffers are liable
> > to fill up.
> >
> > As a demo I increased the socket buffers to 1MB and everything then
> > succeeded.
> >
> > @@ -357,6 +360,10 @@ def _pre_launch(self) -> None:
> >
> >          if self._console_set:
> >              self._cons_sock_pair = socket.socketpair()
> > +            self._cons_sock_pair[0].setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1024*1024);
> > +            self._cons_sock_pair[0].setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024*1024);
> > +            self._cons_sock_pair[1].setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1024*1024);
> > +            self._cons_sock_pair[1].setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024*1024);
> >              os.set_inheritable(self._cons_sock_pair[0].fileno(), True)
> >
> >          # NOTE: Make sure any opened resources are *definitely* freed in
> >
> >
> > The Machine class doesn't know if anything will ever use the console,
> > so as is the change is unsafe.
> >
> > The original goal of John's change was to guarantee we capture early
> > boot messages as some test need that.
> >
> > I think we need to be able to have a flag to say whether the caller needs
> > an "early console" facility, and only use the pre-opened FD passing for
> > that case. Tests we need early console will have to ask for that guarantee
> > explicitly.
>
> Tch. I see. Thank you for diagnosing this.
>
> From the machine.py perspective, you have to *opt in* to having a
> console, so I hadn't considered that a caller would enable the console
> and then ... not read from it. Surely that's a bug in the caller?
>
> If you don't intend to read from the console, you shouldn't call set_console().

Agree, hence the fix patch for the test case.

Although most tests wait for console, ones like this that control the
machine with gdb/qmp are rarer, so less examples to copy paste from.

>
> (The async rewrite I have been toying with on and off has a built-in
> drainer that writes to a log file that would probably remedy this, but
> the client tests should still be fixed, I think. Otherwise, do you

This sounds good because no matter the test, you rarely don't want to
log console output. Separating that from what the test does with
console would be nice.

> have any suggestions for how I might make this failure state more
> obvious/friendly? I wonder if on close of the machine.py object I
> could detect that the pipe is full and emit a warning about that.)

That's an idea. It wouldn't be foolproof (test could be waiting for
something else or failed for some other reason), but at least it could
give a suggestion (similar to my warning in the chardev code).

How would you do it? Maybe the simplest/portable way would be keep
a pipe write fd open in the harness and try write something to it
with O_NONBLOCK?

Thanks,
Nick


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default
  2023-11-20 19:18             ` John Snow
  2023-11-23  2:04               ` Nicholas Piggin
@ 2023-11-23 10:52               ` Peter Maydell
  2024-01-08 23:52                 ` John Snow
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Maydell @ 2023-11-23 10:52 UTC (permalink / raw)
  To: John Snow
  Cc: Daniel P. Berrangé,
	Thomas Huth, Ani Sinha, Wainer dos Santos Moschetta,
	Beraldo Leal, Cleber Rosa, Pavel Dovgalyuk, Paolo Bonzini,
	qemu-ppc, Nicholas Piggin, qemu-devel, Alex Bennée,
	Philippe Mathieu-Daudé

On Mon, 20 Nov 2023 at 19:19, John Snow <jsnow@redhat.com> wrote:
>
> On Wed, Nov 15, 2023 at 12:23 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > The Python  Machine() class has passed one of a pre-created socketpair
> > FDs for the serial port chardev. The guest is trying to write to this
> > and blocking.  Nothing in the Machine() class is reading from the
> > other end of the serial port console.

> > The Machine class doesn't know if anything will ever use the console,
> > so as is the change is unsafe.
> >
> > The original goal of John's change was to guarantee we capture early
> > boot messages as some test need that.
> >
> > I think we need to be able to have a flag to say whether the caller needs
> > an "early console" facility, and only use the pre-opened FD passing for
> > that case. Tests we need early console will have to ask for that guarantee
> > explicitly.
>
> Tch. I see. Thank you for diagnosing this.
>
> From the machine.py perspective, you have to *opt in* to having a
> console, so I hadn't considered that a caller would enable the console
> and then ... not read from it. Surely that's a bug in the caller?

From an Avocado test perspective, I would expect that the test case
should have to explicitly opt *out* of "the console messages appear
in the avocado test log, even if the test case doesn't care about them
for the purposes of identifying when to end the test or whatever".
The console logs are important for after-the-fact human diagnosis
of why a test might have failed, so we should always collect them.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default
  2023-11-23 10:52               ` Peter Maydell
@ 2024-01-08 23:52                 ` John Snow
  0 siblings, 0 replies; 22+ messages in thread
From: John Snow @ 2024-01-08 23:52 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Daniel P. Berrangé,
	Thomas Huth, Ani Sinha, Wainer dos Santos Moschetta,
	Beraldo Leal, Cleber Rosa, Pavel Dovgalyuk, Paolo Bonzini,
	qemu-ppc, Nicholas Piggin, qemu-devel, Alex Bennée,
	Philippe Mathieu-Daudé

On Thu, Nov 23, 2023 at 5:53 AM Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Mon, 20 Nov 2023 at 19:19, John Snow <jsnow@redhat.com> wrote:
> >
> > On Wed, Nov 15, 2023 at 12:23 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > The Python  Machine() class has passed one of a pre-created socketpair
> > > FDs for the serial port chardev. The guest is trying to write to this
> > > and blocking.  Nothing in the Machine() class is reading from the
> > > other end of the serial port console.
>
> > > The Machine class doesn't know if anything will ever use the console,
> > > so as is the change is unsafe.
> > >
> > > The original goal of John's change was to guarantee we capture early
> > > boot messages as some test need that.
> > >
> > > I think we need to be able to have a flag to say whether the caller needs
> > > an "early console" facility, and only use the pre-opened FD passing for
> > > that case. Tests we need early console will have to ask for that guarantee
> > > explicitly.
> >
> > Tch. I see. Thank you for diagnosing this.
> >
> > From the machine.py perspective, you have to *opt in* to having a
> > console, so I hadn't considered that a caller would enable the console
> > and then ... not read from it. Surely that's a bug in the caller?
>
> From an Avocado test perspective, I would expect that the test case
> should have to explicitly opt *out* of "the console messages appear
> in the avocado test log, even if the test case doesn't care about them
> for the purposes of identifying when to end the test or whatever".
> The console logs are important for after-the-fact human diagnosis
> of why a test might have failed, so we should always collect them.
>
> thanks
> -- PMM
>

Understood. In that case, fixing the test would involve engaging's the
avocado suite's draining utility to ensure that the log is being
consumed and logged.

I think there's a potential here to simplify all of the
draining-and-logging code we have split across the avocado test suite,
console_socket.py and machine.py, but I can't promise that the rewrite
I've been working on will be ready quickly, so if this is still busted
(I'm still catching back up with my mail post-holidays) then we want a
quicker fix if we haven't committed one yet.

--js



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2024-01-08 23:53 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-14 16:31 [PATCH] tests/avocado/reverse_debugging: Disable the ppc64 tests by default Thomas Huth
2023-11-14 16:35 ` Philippe Mathieu-Daudé
2023-11-14 16:37 ` Philippe Mathieu-Daudé
2023-11-14 18:29   ` Thomas Huth
2023-11-15  1:15     ` Nicholas Piggin
2023-11-15  6:23       ` Thomas Huth
2023-11-15 13:14         ` Daniel P. Berrangé
2023-11-15 17:22           ` Daniel P. Berrangé
2023-11-16  1:15             ` Nicholas Piggin
2023-11-16  3:55               ` Ani Sinha
2023-11-16  7:14                 ` Nicholas Piggin
2023-11-16  8:55                   ` Daniel P. Berrangé
2023-11-16 11:17                     ` Ani Sinha
2023-11-16 11:31                       ` Daniel P. Berrangé
2023-11-16  7:09               ` Thomas Huth
2023-11-16  9:45                 ` Nicholas Piggin
2023-11-16  9:00               ` Daniel P. Berrangé
2023-11-16  3:50             ` Ani Sinha
2023-11-20 19:18             ` John Snow
2023-11-23  2:04               ` Nicholas Piggin
2023-11-23 10:52               ` Peter Maydell
2024-01-08 23:52                 ` John Snow

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.