All of lore.kernel.org
 help / color / mirror / Atom feed
* hang in virtio-failover-test (s390 host)
@ 2022-03-24 11:11 Peter Maydell
  2022-03-24 11:53 ` Laurent Vivier
  0 siblings, 1 reply; 3+ messages in thread
From: Peter Maydell @ 2022-03-24 11:11 UTC (permalink / raw)
  To: QEMU Developers; +Cc: Laurent Vivier, Alex Bennée, Dr. David Alan Gilbert

This is a backtrace from virtio-failover-test, which had hung
on the s390 gitlab CI runner. Both processes were using CPU,
so this is some kind of livelock, not a deadlock.

Looking more closely at the virtio-net-failover process, in the function
test_migrate_off_abort() we have executed 'migrate_cancel' and then go
into a loop waiting for 'status' to be "cancelled", with aborts if
it is either "failed" or "active". But the status the QEMU process
returns is "completed", so we loop forever waiting for a status change
that will never come (I assume).

I also have a hang in migration-test on this system, which I'll
send a separate email about.

-- PMM

virtio-net-fail(2706321)───qemu-system-i38(2706783)
===========================================================
PROCESS: 2706321
gitlab-+ 2706321 2691914 46 Mar23 ?        08:49:56
/home/gitlab-runner/builds/-LCfcJ2T/0/qemu-project/qemu/build/tests/qtest/virtio-net-failover
--tap -k
[New LWP 2706322]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
__libc_read (nbytes=1, buf=0x3ffe6bfa4d7, fd=3) at
../sysdeps/unix/sysv/linux/read.c:26
26      ../sysdeps/unix/sysv/linux/read.c: No such file or directory.

Thread 2 (Thread 0x3ff8617f900 (LWP 2706322)):
#0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
#1  0x000002aa3f8bb334 in qemu_futex_wait (val=<optimized out>,
f=<optimized out>) at
/home/gitlab-runner/builds/-LCfcJ2T/0/qemu-project/qemu/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x2aa3f8f6ce8 <rcu_call_ready_event>)
at ../util/qemu-thread-posix.c:481
#3  0x000002aa3f8c5a2a in call_rcu_thread (opaque=opaque@entry=0x0) at
../util/rcu.c:261
#4  0x000002aa3f8ba32a in qemu_thread_start (args=<optimized out>) at
../util/qemu-thread-posix.c:556
#5  0x000003ff86407e66 in start_thread (arg=0x3ff8617f900) at
pthread_create.c:477
#6  0x000003ff862fcbf6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 1 (Thread 0x3ff867f5430 (LWP 2706321)):
#0  __libc_read (nbytes=1, buf=0x3ffe6bfa4d7, fd=3) at
../sysdeps/unix/sysv/linux/read.c:26
#1  __libc_read (fd=fd@entry=3, buf=buf@entry=0x3ffe6bfa4d7,
nbytes=nbytes@entry=1) at ../sysdeps/unix/sysv/linux/read.c:24
#2  0x000002aa3f89d4f4 in read (__nbytes=1, __buf=0x3ffe6bfa4d7,
__fd=3) at /usr/include/s390x-linux-gnu/bits/unistd.h:44
#3  qmp_fd_receive (fd=<optimized out>) at ../tests/qtest/libqtest.c:613
#4  0x000002aa3f89d59a in qtest_qmp_receive_dict (s=0x2aa401fcc50) at
../tests/qtest/libqtest.c:648
#5  qtest_qmp_receive (s=0x2aa401fcc50) at ../tests/qtest/libqtest.c:636
#6  0x000002aa3f89de84 in qtest_vqmp (s=<optimized out>,
fmt=<optimized out>, ap=ap@entry=0x3ffe6bfa680) at
../tests/qtest/libqtest.c:749
#7  0x000002aa3f89e0aa in qtest_qmp (s=<optimized out>, fmt=<optimized
out>) at ../tests/qtest/libqtest.c:790
#8  0x000002aa3f88cc88 in migrate_status (qts=<optimized out>) at
../tests/qtest/virtio-net-failover.c:595
#9  0x000002aa3f892fcc in test_migrate_off_abort (opaque=<optimized
out>) at ../tests/qtest/virtio-net-failover.c:1425
#10 0x000003ff865fe608 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#11 0x000003ff865fe392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#12 0x000003ff865fe392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#13 0x000003ff865fe392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#14 0x000003ff865fe392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#15 0x000003ff865feada in g_test_run_suite () from
/lib/s390x-linux-gnu/libglib-2.0.so.0
#16 0x000003ff865feb10 in g_test_run () from
/lib/s390x-linux-gnu/libglib-2.0.so.0
#17 0x000002aa3f88c43a in main (argc=<optimized out>, argv=<optimized
out>) at ../tests/qtest/virtio-net-failover.c:1869
[Inferior 1 (process 2706321) detached]

===========================================================
PROCESS: 2706783
gitlab-+ 2706783 2706321 33 Mar23 ?        06:19:26 ./qemu-system-i386
-qtest unix:/tmp/qtest-2706321.sock -qtest-log /dev/null -chardev
socket,path=/tmp/qtest-2706321.qmp,id=char0 -mon
chardev=char0,mode=control -display none -M q35 -nodefaults -device
pcie-root-port,id=root0,addr=0x1,bus=pcie.0,chassis=1 -device
pcie-root-port,id=root1,addr=0x2,bus=pcie.0,chassis=2 -netdev
user,id=hs0 -netdev user,id=hs1 -accel qtest
[New LWP 2706795]
[New LWP 2706796]
[New LWP 2706797]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
0x000003ffa5cf1c9c in __ppoll (fds=0x2aa22222600, nfds=5,
timeout=<optimized out>, timeout@entry=0x3ffca97b438,
sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
44      ../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.

Thread 4 (Thread 0x3ff95ca1900 (LWP 2706797)):
#0  futex_wait_cancelable (private=0, expected=0,
futex_word=0x2aa214a5dd8) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0,
mutex=0x2aa1f426f10 <qemu_global_mutex>, cond=0x2aa214a5db0) at
pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x2aa214a5db0,
mutex=mutex@entry=0x2aa1f426f10 <qemu_global_mutex>) at
pthread_cond_wait.c:638
#3  0x000002aa1eab0e56 in qemu_cond_wait_impl (cond=0x2aa214a5db0,
mutex=0x2aa1f426f10 <qemu_global_mutex>, file=0x2aa1eb2e3fc
"../softmmu/cpus.c", line=<optimized out>) at
../util/qemu-thread-posix.c:195
#4  0x000002aa1e4f4436 in qemu_wait_io_event
(cpu=cpu@entry=0x2aa214993b0) at ../softmmu/cpus.c:424
#5  0x000002aa1e882f22 in dummy_cpu_thread_fn
(arg=arg@entry=0x2aa214993b0) at ../accel/dummy-cpus.c:53
#6  0x000002aa1eab05c2 in qemu_thread_start (args=<optimized out>) at
../util/qemu-thread-posix.c:556
#7  0x000003ffa5e07e66 in start_thread (arg=0x3ff95ca1900) at
pthread_create.c:477
#8  0x000003ffa5cfcbf6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 3 (Thread 0x3ff964a2900 (LWP 2706796)):
#0  0x000003ffa5cf1b42 in __GI___poll (fds=0x3ff88003250, nfds=3,
timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x000003ffa83d4386 in  () at /lib/s390x-linux-gnu/libglib-2.0.so.0
#2  0x000003ffa83d4790 in g_main_loop_run () at
/lib/s390x-linux-gnu/libglib-2.0.so.0
#3  0x000002aa1e8bfa4e in iothread_run
(opaque=opaque@entry=0x2aa212f3400) at ../iothread.c:73
#4  0x000002aa1eab05c2 in qemu_thread_start (args=<optimized out>) at
../util/qemu-thread-posix.c:556
#5  0x000003ffa5e07e66 in start_thread (arg=0x3ff964a2900) at
pthread_create.c:477
#6  0x000003ffa5cfcbf6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 2 (Thread 0x3ff96da4900 (LWP 2706795)):
#0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
#1  0x000002aa1eab15cc in qemu_futex_wait (val=<optimized out>,
f=<optimized out>) at
/home/gitlab-runner/builds/-LCfcJ2T/0/qemu-project/qemu/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x2aa1f457e60 <rcu_call_ready_event>)
at ../util/qemu-thread-posix.c:481
#3  0x000002aa1eab9e42 in call_rcu_thread (opaque=opaque@entry=0x0) at
../util/rcu.c:261
#4  0x000002aa1eab05c2 in qemu_thread_start (args=<optimized out>) at
../util/qemu-thread-posix.c:556
#5  0x000003ffa5e07e66 in start_thread (arg=0x3ff96da4900) at
pthread_create.c:477
#6  0x000003ffa5cfcbf6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 1 (Thread 0x3ffa91fe440 (LWP 2706783)):
#0  0x000003ffa5cf1c9c in __ppoll (fds=0x2aa22222600, nfds=5,
timeout=<optimized out>, timeout@entry=0x3ffca97b438,
sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
#1  0x000002aa1ead154a in ppoll (__ss=0x0, __timeout=0x3ffca97b438,
__nfds=<optimized out>, __fds=<optimized out>) at
/usr/include/s390x-linux-gnu/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
timeout=timeout@entry=1000000000) at ../util/qemu-timer.c:348
#3  0x000002aa1eacd768 in os_host_main_loop_wait (timeout=1000000000)
at ../util/main-loop.c:250
#4  main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:531
#5  0x000002aa1e4faa94 in qemu_main_loop () at ../softmmu/runstate.c:727
#6  0x000002aa1e4ae7a0 in main (argc=<optimized out>, argv=<optimized
out>, envp=<optimized out>) at ../softmmu/main.c:50
[Inferior 1 (process 2706783) detached]

thanks
-- PMM


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: hang in virtio-failover-test (s390 host)
  2022-03-24 11:11 hang in virtio-failover-test (s390 host) Peter Maydell
@ 2022-03-24 11:53 ` Laurent Vivier
  2022-03-24 13:01   ` Peter Maydell
  0 siblings, 1 reply; 3+ messages in thread
From: Laurent Vivier @ 2022-03-24 11:53 UTC (permalink / raw)
  To: Peter Maydell, QEMU Developers; +Cc: Alex Bennée, Dr. David Alan Gilbert

On 24/03/2022 12:11, Peter Maydell wrote:
> This is a backtrace from virtio-failover-test, which had hung
> on the s390 gitlab CI runner. Both processes were using CPU,
> so this is some kind of livelock, not a deadlock.
> 
> Looking more closely at the virtio-net-failover process, in the function
> test_migrate_off_abort() we have executed 'migrate_cancel' and then go
> into a loop waiting for 'status' to be "cancelled", with aborts if
> it is either "failed" or "active". But the status the QEMU process
> returns is "completed", so we loop forever waiting for a status change
> that will never come (I assume).
>

It means the migration has been completed before we tried to cancel it.
The test doesn't fail but is not valid.

Could you try this:

diff --git a/tests/qtest/virtio-net-failover.c b/tests/qtest/virtio-net-failover.c
index 80292eecf65f..80cda4ca28ce 100644
--- a/tests/qtest/virtio-net-failover.c
+++ b/tests/qtest/virtio-net-failover.c
@@ -1425,6 +1425,11 @@ static void test_migrate_off_abort(gconstpointer opaque)
          ret = migrate_status(qts);

          status = qdict_get_str(ret, "status");
+        if (strcmp(status, "completed") == 0) {
+            g_test_skip("Failed to cancel the migration");
+            qobject_unref(ret);
+            goto out;
+        }
          if (strcmp(status, "cancelled") == 0) {
              qobject_unref(ret);
              break;
@@ -1437,6 +1442,7 @@ static void test_migrate_off_abort(gconstpointer opaque)
      check_one_card(qts, true, "standby0", MAC_STANDBY0);
      check_one_card(qts, true, "primary0", MAC_PRIMARY0);

+out:
      qos_object_destroy((QOSGraphObject *)vdev);
      machine_stop(qts);
  }

Thanks,
Laurent



^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: hang in virtio-failover-test (s390 host)
  2022-03-24 11:53 ` Laurent Vivier
@ 2022-03-24 13:01   ` Peter Maydell
  0 siblings, 0 replies; 3+ messages in thread
From: Peter Maydell @ 2022-03-24 13:01 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: Alex Bennée, QEMU Developers, Dr. David Alan Gilbert

On Thu, 24 Mar 2022 at 11:53, Laurent Vivier <lvivier@redhat.com> wrote:
>
> On 24/03/2022 12:11, Peter Maydell wrote:
> > This is a backtrace from virtio-failover-test, which had hung
> > on the s390 gitlab CI runner. Both processes were using CPU,
> > so this is some kind of livelock, not a deadlock.
> >
> > Looking more closely at the virtio-net-failover process, in the function
> > test_migrate_off_abort() we have executed 'migrate_cancel' and then go
> > into a loop waiting for 'status' to be "cancelled", with aborts if
> > it is either "failed" or "active". But the status the QEMU process
> > returns is "completed", so we loop forever waiting for a status change
> > that will never come (I assume).
> >
>
> It means the migration has been completed before we tried to cancel it.
> The test doesn't fail but is not valid.
>
> Could you try this:
>
> diff --git a/tests/qtest/virtio-net-failover.c b/tests/qtest/virtio-net-failover.c
> index 80292eecf65f..80cda4ca28ce 100644
> --- a/tests/qtest/virtio-net-failover.c
> +++ b/tests/qtest/virtio-net-failover.c
> @@ -1425,6 +1425,11 @@ static void test_migrate_off_abort(gconstpointer opaque)
>           ret = migrate_status(qts);
>
>           status = qdict_get_str(ret, "status");
> +        if (strcmp(status, "completed") == 0) {
> +            g_test_skip("Failed to cancel the migration");
> +            qobject_unref(ret);
> +            goto out;
> +        }
>           if (strcmp(status, "cancelled") == 0) {
>               qobject_unref(ret);
>               break;
> @@ -1437,6 +1442,7 @@ static void test_migrate_off_abort(gconstpointer opaque)
>       check_one_card(qts, true, "standby0", MAC_STANDBY0);
>       check_one_card(qts, true, "primary0", MAC_PRIMARY0);
>
> +out:
>       qos_object_destroy((QOSGraphObject *)vdev);
>       machine_stop(qts);
>   }

Looks plausible, but I can't currently get this hang to reproduce
(it's probably a fairly rare intermittent) so I can't really
test a fix in any meaningful way.

It looks like there are several other loops in other tests in
this file which also need to check for "completed".

I would suggest maybe using check_migration_status() instead
of hand-rolling loops here, except that that function seems
to assert on an unexpected "completed" status whereas we want
the test to skip. It could probably be improved to be usable
here, though.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-03-24 13:03 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-24 11:11 hang in virtio-failover-test (s390 host) Peter Maydell
2022-03-24 11:53 ` Laurent Vivier
2022-03-24 13:01   ` Peter Maydell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.