QEMU-Devel Archive on lore.kernel.org
 help / color / Atom feed
* [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system
@ 2018-11-26 22:53 dann frazier
  2018-11-26 23:26 ` [Qemu-devel] [Bug 1805256] " John Snow
                   ` (94 more replies)
  0 siblings, 95 replies; 141+ messages in thread
From: dann frazier @ 2018-11-26 22:53 UTC (permalink / raw)
  To: qemu-devel

Public bug reported:

On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
frequently hangs (~50% of the time) with this command:

qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
qcow2->qcow2 conversion happens to be something uvtool does every time
it fetches images.

Once hung, attaching gdb gives the following backtrace:

(gdb) bt
#0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
    timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
    at ../sysdeps/unix/sysv/linux/ppoll.c:39
#1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
    __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
    timeout=timeout@entry=-1) at util/qemu-timer.c:322
#3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
    at util/main-loop.c:233
#4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
#5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
#6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
#7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

Reproduced w/ latest QEMU git (@ 53744e0a182)

** Affects: qemu
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  New

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions

^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
@ 2018-11-26 23:26 ` John Snow
  2018-11-26 23:54 ` dann frazier
                   ` (93 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: John Snow @ 2018-11-26 23:26 UTC (permalink / raw)
  To: qemu-devel

Hi, can you do a `thread apply all bt` instead? If I were to bet, we're
probably waiting for some slow call like lseek to return in another
thread.

What filesystem/blockdevice is involved here?

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  New

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions

^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
  2018-11-26 23:26 ` [Qemu-devel] [Bug 1805256] " John Snow
@ 2018-11-26 23:54 ` dann frazier
  2018-12-05 11:20 ` Alex Bennée
                   ` (92 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: dann frazier @ 2018-11-26 23:54 UTC (permalink / raw)
  To: qemu-devel

ext4 filesystem, SATA drive:

(gdb) thread apply all bt

Thread 3 (Thread 0xffff9bffc9a0 (LWP 9015)):
#0  0x0000ffffaaa462cc in __GI___sigtimedwait (set=<optimized out>, 
    set@entry=0xaaaae725c070, info=info@entry=0xffff9bffbf18, 
    timeout=0x3ff0000000000001, timeout@entry=0x0)
    at ../sysdeps/unix/sysv/linux/sigtimedwait.c:42
#1  0x0000ffffaab7dfac in __sigwait (set=set@entry=0xaaaae725c070, 
    sig=sig@entry=0xffff9bffbff4) at ../sysdeps/unix/sysv/linux/sigwait.c:28
#2  0x0000aaaad998a628 in sigwait_compat (opaque=0xaaaae725c070)
    at util/compatfd.c:36
#3  0x0000aaaad998bce0 in qemu_thread_start (args=<optimized out>)
    at util/qemu-thread-posix.c:498
#4  0x0000ffffaab73088 in start_thread (arg=0xffffc528531f)
    at pthread_create.c:463
#5  0x0000ffffaaae34ec in thread_start ()
    at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 2 (Thread 0xffffa0e779a0 (LWP 9014)):
#0  syscall () at ../sysdeps/unix/sysv/linux/aarch64/syscall.S:38
#1  0x0000aaaad998c9e8 in qemu_futex_wait (val=<optimized out>, f=<optimized out>)
    at /home/ubuntu/qemu/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0xaaaad9a091c0 <rcu_call_ready_event>)
    at util/qemu-thread-posix.c:442
#3  0x0000aaaad99a6834 in call_rcu_thread (opaque=<optimized out>)
    at util/rcu.c:261
#4  0x0000aaaad998bce0 in qemu_thread_start (args=<optimized out>)
    at util/qemu-thread-posix.c:498
#5  0x0000ffffaab73088 in start_thread (arg=0xffffc528542f)
    at pthread_create.c:463
#6  0x0000ffffaaae34ec in thread_start ()
    at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 1 (Thread 0xffffa0fa8010 (LWP 9013)):
#0  0x0000ffffaaada154 in __GI_ppoll (fds=0xaaaae7291dc0, nfds=187650771816320, 
    timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc52852e0)
    at ../sysdeps/unix/sysv/linux/ppoll.c:39
#1  0x0000aaaad9987f00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
    __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
    timeout=timeout@entry=-1) at util/qemu-timer.c:322
#3  0x0000aaaad9988f80 in os_host_main_loop_wait (timeout=-1)
    at util/main-loop.c:233
#4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
#5  0x0000aaaad98b7a30 in convert_do_copy (s=0xffffc52854e8) at qemu-img.c:1980
#6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
#7  0x0000aaaad98b033c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  New

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions

^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
  2018-11-26 23:26 ` [Qemu-devel] [Bug 1805256] " John Snow
  2018-11-26 23:54 ` dann frazier
@ 2018-12-05 11:20 ` Alex Bennée
  2019-04-15 12:59 ` 贞贵李
                   ` (91 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Alex Bennée @ 2018-12-05 11:20 UTC (permalink / raw)
  To: qemu-devel

** Tags added: qemu-img

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  New

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions

^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (2 preceding siblings ...)
  2018-12-05 11:20 ` Alex Bennée
@ 2019-04-15 12:59 ` 贞贵李
  2019-04-15 12:59   ` 贞贵李
  2019-04-15 14:37 ` 贞贵李
                   ` (90 subsequent siblings)
  94 siblings, 1 reply; 141+ messages in thread
From: 贞贵李 @ 2019-04-15 12:59 UTC (permalink / raw)
  To: qemu-devel

Hi, I also found a problem that qemu-img convert hands in ARM.

The convert command line is "qemu-img convert -f qcow2 -O raw disk.qcow2
disk.raw ".

The bt is below:

Thread 2 (Thread 0x40000b776e50 (LWP 27215)):
#0 0x000040000a3f2994 in sigtimedwait () from /lib64/libc.so.6
#1 0x000040000a39c60c in sigwait () from /lib64/libpthread.so.0
#2 0x0000aaaaaae82610 in sigwait_compat (opaque=0xaaaac5163b00) at util/compatfd.c:37
#3 0x0000aaaaaae85038 in qemu_thread_start (args=args@entry=0xaaaac5163b90) at util/qemu_thread_posix.c:496
#4 0x000040000a3918bc in start_thread () from /lib64/libpthread.so.0
#5 0x000040000a492b2c in thread_start () from /lib64/libc.so.6

Thread 1 (Thread 0x40000b573370 (LWP 27214)):
#0 0x000040000a489020 in ppoll () from /lib64/libc.so.6
#1 0x0000aaaaaadaefc0 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at qemu_timer.c:391
#3 0x0000aaaaaadae014 in os_host_main_loop_wait (timeout=<optimized out>) at main_loop.c:272
#4 0x0000aaaaaadae190 in main_loop_wait (nonblocking=<optimized out>) at main_loop.c:534
#5 0x0000aaaaaad97be0 in convert_do_copy (s=0xffffdc32eb48) at qemu-img.c:1923
#6 0x0000aaaaaada2d70 in img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2414
#7 0x0000aaaaaad99ac4 in main (argc=7, argv=<optimized out>) at qemu-img.c:5305


Do you find the cause of the problem and fix it? Thanks for your reply!

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  New

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions

^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2019-04-15 12:59 ` 贞贵李
@ 2019-04-15 12:59   ` 贞贵李
  0 siblings, 0 replies; 141+ messages in thread
From: 贞贵李 @ 2019-04-15 12:59 UTC (permalink / raw)
  To: qemu-devel

Hi, I also found a problem that qemu-img convert hands in ARM.

The convert command line is "qemu-img convert -f qcow2 -O raw disk.qcow2
disk.raw ".

The bt is below:

Thread 2 (Thread 0x40000b776e50 (LWP 27215)):
#0 0x000040000a3f2994 in sigtimedwait () from /lib64/libc.so.6
#1 0x000040000a39c60c in sigwait () from /lib64/libpthread.so.0
#2 0x0000aaaaaae82610 in sigwait_compat (opaque=0xaaaac5163b00) at util/compatfd.c:37
#3 0x0000aaaaaae85038 in qemu_thread_start (args=args@entry=0xaaaac5163b90) at util/qemu_thread_posix.c:496
#4 0x000040000a3918bc in start_thread () from /lib64/libpthread.so.0
#5 0x000040000a492b2c in thread_start () from /lib64/libc.so.6

Thread 1 (Thread 0x40000b573370 (LWP 27214)):
#0 0x000040000a489020 in ppoll () from /lib64/libc.so.6
#1 0x0000aaaaaadaefc0 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at qemu_timer.c:391
#3 0x0000aaaaaadae014 in os_host_main_loop_wait (timeout=<optimized out>) at main_loop.c:272
#4 0x0000aaaaaadae190 in main_loop_wait (nonblocking=<optimized out>) at main_loop.c:534
#5 0x0000aaaaaad97be0 in convert_do_copy (s=0xffffdc32eb48) at qemu-img.c:1923
#6 0x0000aaaaaada2d70 in img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2414
#7 0x0000aaaaaad99ac4 in main (argc=7, argv=<optimized out>) at qemu-img.c:5305


Do you find the cause of the problem and fix it? Thanks for your reply!

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  New

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (3 preceding siblings ...)
  2019-04-15 12:59 ` 贞贵李
@ 2019-04-15 14:37 ` 贞贵李
  2019-04-15 14:37   ` 贞贵李
  2019-04-15 22:25 ` dann frazier
                   ` (89 subsequent siblings)
  94 siblings, 1 reply; 141+ messages in thread
From: 贞贵李 @ 2019-04-15 14:37 UTC (permalink / raw)
  To: qemu-devel

sorry, I make a spelling mistake here("Hi, I also found a problem that
qemu-img convert hands in ARM.").The right is "I also found a problem
that qemu-img convert hangs in ARM".

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  New

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions

^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2019-04-15 14:37 ` 贞贵李
@ 2019-04-15 14:37   ` 贞贵李
  0 siblings, 0 replies; 141+ messages in thread
From: 贞贵李 @ 2019-04-15 14:37 UTC (permalink / raw)
  To: qemu-devel

sorry, I make a spelling mistake here("Hi, I also found a problem that
qemu-img convert hands in ARM.").The right is "I also found a problem
that qemu-img convert hangs in ARM".

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  New

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (4 preceding siblings ...)
  2019-04-15 14:37 ` 贞贵李
@ 2019-04-15 22:25 ` dann frazier
  2019-04-15 22:25   ` dann frazier
  2019-04-15 23:37 ` dann frazier
                   ` (88 subsequent siblings)
  94 siblings, 1 reply; 141+ messages in thread
From: dann frazier @ 2019-04-15 22:25 UTC (permalink / raw)
  To: qemu-devel

** Changed in: qemu
       Status: New => Confirmed

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  Confirmed

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions

^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2019-04-15 22:25 ` dann frazier
@ 2019-04-15 22:25   ` dann frazier
  0 siblings, 0 replies; 141+ messages in thread
From: dann frazier @ 2019-04-15 22:25 UTC (permalink / raw)
  To: qemu-devel

** Changed in: qemu
       Status: New => Confirmed

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  Confirmed

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (5 preceding siblings ...)
  2019-04-15 22:25 ` dann frazier
@ 2019-04-15 23:37 ` dann frazier
  2019-04-15 23:37   ` dann frazier
  2019-04-16  8:16 ` 贞贵李
                   ` (87 subsequent siblings)
  94 siblings, 1 reply; 141+ messages in thread
From: dann frazier @ 2019-04-15 23:37 UTC (permalink / raw)
  To: qemu-devel

No, sorry - this bugs still persists w/ latest upstream (@ afccfc0). I
found a report of similar symptoms:

  https://patchwork.kernel.org/patch/10047341/
  https://bugzilla.redhat.com/show_bug.cgi?id=1524770#c13

To be clear, ^ is already fixed upstream, so it is not the *same* issue
- but perhaps related.


** Bug watch added: Red Hat Bugzilla #1524770
   https://bugzilla.redhat.com/show_bug.cgi?id=1524770

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  Confirmed

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions

^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2019-04-15 23:37 ` dann frazier
@ 2019-04-15 23:37   ` dann frazier
  0 siblings, 0 replies; 141+ messages in thread
From: dann frazier @ 2019-04-15 23:37 UTC (permalink / raw)
  To: qemu-devel

No, sorry - this bugs still persists w/ latest upstream (@ afccfc0). I
found a report of similar symptoms:

  https://patchwork.kernel.org/patch/10047341/
  https://bugzilla.redhat.com/show_bug.cgi?id=1524770#c13

To be clear, ^ is already fixed upstream, so it is not the *same* issue
- but perhaps related.


** Bug watch added: Red Hat Bugzilla #1524770
   https://bugzilla.redhat.com/show_bug.cgi?id=1524770

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  Confirmed

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (6 preceding siblings ...)
  2019-04-15 23:37 ` dann frazier
@ 2019-04-16  8:16 ` 贞贵李
  2019-04-16  8:16   ` 贞贵李
  2019-04-16 13:32 ` 贞贵李
                   ` (86 subsequent siblings)
  94 siblings, 1 reply; 141+ messages in thread
From: 贞贵李 @ 2019-04-16  8:16 UTC (permalink / raw)
  To: qemu-devel

Do you have any good ideas about it? Maybe somewhere lack of memeory
barriers that cause it?

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  Confirmed

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions

^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2019-04-16  8:16 ` 贞贵李
@ 2019-04-16  8:16   ` 贞贵李
  0 siblings, 0 replies; 141+ messages in thread
From: 贞贵李 @ 2019-04-16  8:16 UTC (permalink / raw)
  To: qemu-devel

Do you have any good ideas about it? Maybe somewhere lack of memeory
barriers that cause it?

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  Confirmed

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (7 preceding siblings ...)
  2019-04-16  8:16 ` 贞贵李
@ 2019-04-16 13:32 ` 贞贵李
  2019-04-16 13:32   ` 贞贵李
  2019-04-23  1:29 ` 贞贵李
                   ` (85 subsequent siblings)
  94 siblings, 1 reply; 141+ messages in thread
From: 贞贵李 @ 2019-04-16 13:32 UTC (permalink / raw)
  To: qemu-devel

frazier, Do you find the conditions that necessarily make this problem appear?

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  Confirmed

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions

^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2019-04-16 13:32 ` 贞贵李
@ 2019-04-16 13:32   ` 贞贵李
  0 siblings, 0 replies; 141+ messages in thread
From: 贞贵李 @ 2019-04-16 13:32 UTC (permalink / raw)
  To: qemu-devel

frazier, Do you find the conditions that necessarily make this problem appear?

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  Confirmed

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (8 preceding siblings ...)
  2019-04-16 13:32 ` 贞贵李
@ 2019-04-23  1:29 ` 贞贵李
  2019-04-23  1:29   ` 贞贵李
  2019-06-05 16:16 ` dann frazier
                   ` (84 subsequent siblings)
  94 siblings, 1 reply; 141+ messages in thread
From: 贞贵李 @ 2019-04-23  1:29 UTC (permalink / raw)
  To: qemu-devel

I can reproduce this problem with qemu.git/matser. It still exists in qemu.git/matser. I found that when an IO return in worker threads and want to call aio_notify to wake up main_loop, but it found that ctx->notify_me is cleared to 0 by main_loop in aio_ctx_check by calling atomic_and(&ctx->notify_me, ~1) . So worker thread won't write enventfd to notify main_loop.If such a scene happens, the main_loop will hang:
    main loop                        worker thread1                  worker thread2
-----------------------------------------------------------------------------------------------
    qemu_poll_ns                    aio_worker
                            qemu_bh_schedule(pool->completion_bh)
    glib_pollfds_poll
    g_main_context_check
    aio_ctx_check
    atomic_and(&ctx->notify_me, ~1)                                aio_worker
                                                      qemu_bh_schedule(pool->completion_bh)
    /* do something for event */
    qemu_poll_ns
    /* hangs !!!*/

As we known, ctx->notify_me will be visited by worker thread and main
loop. I thank we should add a lock protection for ctx->notify_me to
avoid this happend.what do you thank so?

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  Confirmed

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions

^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2019-04-23  1:29 ` 贞贵李
@ 2019-04-23  1:29   ` 贞贵李
  0 siblings, 0 replies; 141+ messages in thread
From: 贞贵李 @ 2019-04-23  1:29 UTC (permalink / raw)
  To: qemu-devel

I can reproduce this problem with qemu.git/matser. It still exists in qemu.git/matser. I found that when an IO return in worker threads and want to call aio_notify to wake up main_loop, but it found that ctx->notify_me is cleared to 0 by main_loop in aio_ctx_check by calling atomic_and(&ctx->notify_me, ~1) . So worker thread won't write enventfd to notify main_loop.If such a scene happens, the main_loop will hang:
    main loop                        worker thread1                  worker thread2
-----------------------------------------------------------------------------------------------
    qemu_poll_ns                    aio_worker
                            qemu_bh_schedule(pool->completion_bh)
    glib_pollfds_poll
    g_main_context_check
    aio_ctx_check
    atomic_and(&ctx->notify_me, ~1)                                aio_worker
                                                      qemu_bh_schedule(pool->completion_bh)
    /* do something for event */
    qemu_poll_ns
    /* hangs !!!*/

As we known, ctx->notify_me will be visited by worker thread and main
loop. I thank we should add a lock protection for ctx->notify_me to
avoid this happend.what do you thank so?

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  Confirmed

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (9 preceding siblings ...)
  2019-04-23  1:29 ` 贞贵李
@ 2019-06-05 16:16 ` dann frazier
  2019-09-05 15:03 ` Rafael David Tinoco
                   ` (83 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: dann frazier @ 2019-06-05 16:16 UTC (permalink / raw)
  To: qemu-devel

** Also affects: qemu (Ubuntu)
   Importance: Undecided
       Status: New

** Changed in: qemu (Ubuntu)
       Status: New => Confirmed

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  Confirmed
Status in qemu package in Ubuntu:
  Confirmed

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (10 preceding siblings ...)
  2019-06-05 16:16 ` dann frazier
@ 2019-09-05 15:03 ` Rafael David Tinoco
  2019-09-06 15:12 ` Rafael David Tinoco
                   ` (82 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2019-09-05 15:03 UTC (permalink / raw)
  To: qemu-devel

** Changed in: qemu (Ubuntu)
       Status: Confirmed => In Progress

** Changed in: qemu (Ubuntu)
     Assignee: (unassigned) => Rafael David Tinoco (rafaeldtinoco)

** Changed in: qemu (Ubuntu)
   Importance: Undecided => Medium

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  Confirmed
Status in qemu package in Ubuntu:
  In Progress

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (11 preceding siblings ...)
  2019-09-05 15:03 ` Rafael David Tinoco
@ 2019-09-06 15:12 ` Rafael David Tinoco
  2019-09-06 15:16 ` Rafael David Tinoco
                   ` (81 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2019-09-06 15:12 UTC (permalink / raw)
  To: qemu-devel

Hello Liz,

I'll try to reproduce this issue in a Cortex-A53 aarch64 real
environment (w/ 24 HW threads) AND in a virtual environment w/ lots of
vCPUs... but, if it's a barrier missing - or the lack of atomicity
and/or ordering in a primitive - then, I'm afraid the context switch in
between vCPUs might not be the same as in real CPUs (IPIs are sent and
handled differently and host kernel delays IPI delivery because of its
own callbacks, before scheduling, etc...) and I could need a qemu dump
from your environment.

Would that be feasible ? Can you reproduce this nowadays ? This bug has
aged a little, so I'm now sure!

Could you provide me the dump caused by latest package available for
your Ubuntu version ? This way I have the debug symbols to work with.

Meanwhile, I'll be trying to reproduce on my side.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  Confirmed
Status in qemu package in Ubuntu:
  In Progress

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (12 preceding siblings ...)
  2019-09-06 15:12 ` Rafael David Tinoco
@ 2019-09-06 15:16 ` Rafael David Tinoco
  2019-09-06 21:22 ` Rafael David Tinoco
                   ` (80 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2019-09-06 15:16 UTC (permalink / raw)
  To: qemu-devel

OOhh nm on the virtual environment test, as I just remembered we don't
have KVM on 2nd level for aarch64 yet (at least in ARMv8 implementing
virt extension). I'll try to reproduce in the real env only.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  Confirmed
Status in qemu package in Ubuntu:
  In Progress

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (13 preceding siblings ...)
  2019-09-06 15:16 ` Rafael David Tinoco
@ 2019-09-06 21:22 ` Rafael David Tinoco
  2019-09-09 16:47 ` Rafael David Tinoco
                   ` (79 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2019-09-06 21:22 UTC (permalink / raw)
  To: qemu-devel

Alright, I couldn't reproduce this yet, I'm running same test case in a
24 cores box and causing lots of context switches and CPU migrations in
parallel (trying to exhaust the logic).

Will let this running for sometime to check.

Unfortunately this can be related QEMU AIO BH locking/primitives and
cache coherency in the HW in question (which I got specs from:
https://en.wikichip.org/wiki/hisilicon/kunpeng/hi1616):

l1$ size	8 MiB
l1d$ size	4 MiB
l1i$ size	4 MiB
l2$ size	32 MiB
l3$ size	64 MiB

like for example when having 2 threads in different NUMA domains, or
some other situation.

I can't simulate the same since I have a SOC with:

Cortex-A53 MPCore 24cores,

L1 I/D=32KB/32KB
L2 =256KB
L3 =4MB

and I'm not even close to L1/L2/L3 cache numbers from D06 =o).

Just got a note that I'll be able to reproduce this in the real HW, will
get back soon with real gdb debugging.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  Confirmed
Status in qemu package in Ubuntu:
  In Progress

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (14 preceding siblings ...)
  2019-09-06 21:22 ` Rafael David Tinoco
@ 2019-09-09 16:47 ` Rafael David Tinoco
  2019-09-10  2:04 ` Rafael David Tinoco
                   ` (78 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2019-09-09 16:47 UTC (permalink / raw)
  To: qemu-devel

Alright, with a d06 aarch64 machine I was able to reproduce it after 8
attempts.I'll debug it today and provide feedback on my findings.

(gdb) bt full
#0  0x0000ffffb0b2181c in __GI_ppoll (fds=0xaaaace5ab770, nfds=4, timeout=<optimized out>, timeout@entry=0x0,
    sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:39
        _x3tmp = 0
        _x0tmp = 187650583213936
        _x0 = 187650583213936
        _x3 = 0
        _x4tmp = 8
        _x1tmp = 4
        _x1 = 4
        _x4 = 8
        _x2tmp = <optimized out>
        _x2 = 0
        _x8 = 73
        _sys_result = <optimized out>
        _sys_result = <optimized out>
        sc_cancel_oldtype = 0
        sc_ret = <optimized out>
        tval = {tv_sec = 0, tv_nsec = 187650583137792}
#1  0x0000aaaacd2a773c in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>)
    at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
No locals.
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at ./util/qemu-timer.c:322
No locals.
#3  0x0000aaaacd2a8764 in os_host_main_loop_wait (timeout=-1) at ./util/main-loop.c:233
        context = 0xaaaace599d90
        ret = <optimized out>
        context = <optimized out>
        ret = <optimized out>
#4  main_loop_wait (nonblocking=<optimized out>) at ./util/main-loop.c:497
        ret = <optimized out>
        timeout = 4294967295
        timeout_ns = <optimized out>
#5  0x0000aaaacd1df454 in convert_do_copy (s=0xfffff9b2b1d8) at ./qemu-img.c:1981
        ret = <optimized out>
        i = <optimized out>
        n = <optimized out>
        sector_num = <optimized out>
        ret = <optimized out>
        i = <optimized out>
        n = <optimized out>
        sector_num = <optimized out>
#6  img_convert (argc=<optimized out>, argv=<optimized out>) at ./qemu-img.c:2457
        c = <optimized out>
        bs_i = <optimized out>
        flags = 16898
        src_flags = 0
        fmt = 0xfffff9b2bad1 "qcow2"
        out_fmt = <optimized out>
        cache = 0xaaaacd2cb1c8 "unsafe"
        src_cache = 0xaaaacd2ca9c0 "writeback"
        out_baseimg = <optimized out>
        out_filename = <optimized out>
        out_baseimg_param = <optimized out>
        snapshot_name = 0x0
        drv = <optimized out>
        proto_drv = <optimized out>
        bdi = {cluster_size = 65536, vm_state_offset = 32212254720, is_dirty = false, unallocated_blocks_are_zero = true,
          needs_compressed_writes = false}
        out_bs = <optimized out>
        opts = 0xaaaace5ab390
        sn_opts = 0x0
        create_opts = 0xaaaace5ab0c0
        open_opts = <optimized out>
        options = 0x0
        local_err = 0x0
        writethrough = false
        src_writethrough = false
        quiet = <optimized out>
        image_opts = false
        skip_create = false
        progress = <optimized out>
        tgt_image_opts = false
        ret = <optimized out>
        force_share = false
        explict_min_sparse = false
        s = {src = 0xaaaace577240, src_sectors = 0xaaaace577300, src_num = 1, total_sectors = 62914560,allocated_sectors = 9572096, allocated_done = 6541440, sector_num = 8863744, wr_offs = 8859776, status = BLK_DATA, sector_next_status = 8863744, target = 0xaaaace5bd2a0, has_zero_init = true,compressed = false, unallocated_blocks_are_zero = true, target_has_backing = false, target_backing_sectors = -1, wr_in_order = true, copy_range = false, min_sparse = 8, alignment = 8,cluster_sectors = 128, buf_sectors = 4096, num_coroutines = 8, running_coroutines = 8, co = {0xaaaace5ceda0,0xaaaace5cef50, 0xaaaace5cf100, 0xaaaace5cf2b0, 0xaaaace5cf460, 0xaaaace5cf610, 0xaaaace5cf7c0,0xaaaace5cf970, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, wait_sector_num = {-1, 8859904, 8860928, 8863360,8861952, 8862976, 8862592, 8861440, 0, 0, 0, 0, 0, 0, 0, 0}, lock = {locked = 0, ctx = 0x0, from_push = {slh_first = 0x0}, to_pop = {slh_first = 0x0}, handoff = 0, sequence = 0, holder = 0x0}, ret = -115}
        __PRETTY_FUNCTION__ = "img_convert"
#7  0x0000aaaacd1d8400 in main (argc=7, argv=<optimized out>) at ./qemu-img.c:4976
        cmd = 0xaaaacd34ad78 <img_cmds+80>
        cmdname = <optimized out>
        local_error = 0x0
        trace_file = 0x0
        c = <optimized out>
        long_options = {{name = 0xaaaacd2cbbb0 "help", has_arg = 0, flag = 0x0, val = 104}, {
            name = 0xaaaacd2cbc78 "version", has_arg = 0, flag = 0x0, val = 86}, {name = 0xaaaacd2cbc80 "trace",
            has_arg = 1, flag = 0x0, val = 84}, {name = 0x0, has_arg = 0, flag = 0x0, val = 0}}

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  Confirmed
Status in qemu package in Ubuntu:
  In Progress

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (15 preceding siblings ...)
  2019-09-09 16:47 ` Rafael David Tinoco
@ 2019-09-10  2:04 ` Rafael David Tinoco
  2019-09-10 14:16 ` Rafael David Tinoco
                   ` (77 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2019-09-10  2:04 UTC (permalink / raw)
  To: qemu-devel

Alright,

I'm still investigating this but wanted to share some findings... I
haven't got a kernel dump yet after the task is frozen, I have analyzed
only the userland part of it (although I have checked if code was
running inside kernel with perf cycles:u/cycles:k at some point).

The big picture is this: Whenever qemu-img hangs, we have 3 hung tasks
basically with these stacks:

----

TRHREAD #1
__GI_ppoll (../sysdeps/unix/sysv/linux/ppoll.c:39)
ppoll (/usr/include/aarch64-linux-gnu/bits/poll2.h:77)
qemu_poll_ns (./util/qemu-timer.c:322)
os_host_main_loop_wait (./util/main-loop.c:233)
main_loop_wait (./util/main-loop.c:497)
convert_do_copy (./qemu-img.c:1981)
img_convert (./qemu-img.c:2457)
main (./qemu-img.c:4976)

got stack traces:

./33293/stack                          ./33293/stack                         
[<0>] __switch_to+0xc0/0x218           [<0>] __switch_to+0xc0/0x218          
[<0>] ptrace_stop+0x148/0x2b0          [<0>] do_sys_poll+0x508/0x5c0         
[<0>] get_signal+0x5a4/0x730           [<0>] __arm64_sys_ppoll+0xc0/0x118    
[<0>] do_notify_resume+0x158/0x358     [<0>] el0_svc_common+0xa0/0x168       
[<0>] work_pending+0x8/0x10            [<0>] el0_svc_handler+0x38/0x78       
                                       [<0>] el0_svc+0x8/0xc  

root@d06-1:~$ perf record -F 9999 -e cycles:u -p 33293 -- sleep 10
[ perf record: Woken up 6 times to write data ]
[ perf record: Captured and wrote 1.871 MB perf.data (48730 samples) ]

root@d06-1:~$ perf report --stdio
# Overhead  Command   Shared Object       Symbol
# ........  ........  ..................  ......................
#
    37.82%  qemu-img  libc-2.29.so        [.] 0x00000000000df710
    21.81%  qemu-img  [unknown]           [k] 0xffff000010099504
    14.23%  qemu-img  [unknown]           [k] 0xffff000010085dc0
     9.13%  qemu-img  [unknown]           [k] 0xffff00001008fff8
     6.47%  qemu-img  libc-2.29.so        [.] 0x00000000000df708
     5.69%  qemu-img  qemu-img            [.] qemu_event_reset
     2.57%  qemu-img  libc-2.29.so        [.] 0x00000000000df678
     0.63%  qemu-img  libc-2.29.so        [.] 0x00000000000df700
     0.49%  qemu-img  libc-2.29.so        [.] __sigtimedwait
     0.42%  qemu-img  libpthread-2.29.so  [.] __libc_sigwait

----

TRHREAD #3
__GI___sigtimedwait (../sysdeps/unix/sysv/linux/sigtimedwait.c:29)
__sigwait (linux/sigwait.c:28)
qemu_thread_start (./util/qemu-thread-posix.c:498)
start_thread (pthread_create.c:486)
thread_start (linux/aarch64/clone.S:78)


./33303/stack                          ./33303/stack                               
[<0>] __switch_to+0xc0/0x218           [<0>] __switch_to+0xc0/0x218                
[<0>] ptrace_stop+0x148/0x2b0          [<0>] do_sigtimedwait.isra.9+0x194/0x288    
[<0>] get_signal+0x5a4/0x730           [<0>] __arm64_sys_rt_sigtimedwait+0xac/0x110
[<0>] do_notify_resume+0x158/0x358     [<0>] el0_svc_common+0xa0/0x168             
[<0>] work_pending+0x8/0x10            [<0>] el0_svc_handler+0x38/0x78             
                                       [<0>] el0_svc+0x8/0xc   

root@d06-1:~$ perf record -F 9999 -e cycles:u -p 33303 -- sleep 10
[ perf record: Woken up 6 times to write data ]
[ perf record: Captured and wrote 1.905 MB perf.data (49647 samples) ]

root@d06-1:~$ perf report --stdio
# Overhead  Command   Shared Object       Symbol
# ........  ........  ..................  ......................
#
    45.37%  qemu-img  libc-2.29.so        [.] 0x00000000000df710
    23.52%  qemu-img  [unknown]           [k] 0xffff000010099504
     9.08%  qemu-img  [unknown]           [k] 0xffff00001008fff8
     8.89%  qemu-img  [unknown]           [k] 0xffff000010085dc0
     5.56%  qemu-img  libc-2.29.so        [.] 0x00000000000df708
     3.66%  qemu-img  libc-2.29.so        [.] 0x00000000000df678
     1.01%  qemu-img  libc-2.29.so        [.] __sigtimedwait
     0.80%  qemu-img  libc-2.29.so        [.] 0x00000000000df700
     0.64%  qemu-img  qemu-img            [.] qemu_event_reset
     0.55%  qemu-img  libc-2.29.so        [.] 0x00000000000df718
     0.52%  qemu-img  libpthread-2.29.so  [.] __libc_sigwait

----

TRHREAD #2
syscall (linux/aarch64/syscall.S:38)
qemu_futex_wait (./util/qemu-thread-posix.c:438)
qemu_event_wait (./util/qemu-thread-posix.c:442)
call_rcu_thread (./util/rcu.c:261)
qemu_thread_start (./util/qemu-thread-posix.c:498)
start_thread (pthread_create.c:486)
thread_start (linux/aarch64/clone.S:78)

./33302/stack                          ./33302/stack                       
[<0>] __switch_to+0xc0/0x218           [<0>] __switch_to+0xc0/0x218        
[<0>] ptrace_stop+0x148/0x2b0          [<0>] ptrace_stop+0x148/0x2b0       
[<0>] get_signal+0x5a4/0x730           [<0>] get_signal+0x5a4/0x730        
[<0>] do_notify_resume+0x1c4/0x358     [<0>] do_notify_resume+0x1c4/0x358  
[<0>] work_pending+0x8/0x10            [<0>] work_pending+0x8/0x10    

<stack does not change at all>

root@d06-1:~$ perf report --stdio
# Overhead  Command   Shared Object       Symbol
# ........  ........  ..................  ......................
#
    50.30%  qemu-img  libc-2.29.so        [.] 0x00000000000df710
    26.44%  qemu-img  [unknown]           [k] 0xffff000010099504
     5.88%  qemu-img  libc-2.29.so        [.] 0x00000000000df708
     5.26%  qemu-img  [unknown]           [k] 0xffff000010085dc0
     5.25%  qemu-img  [unknown]           [k] 0xffff00001008fff8
     4.25%  qemu-img  libc-2.29.so        [.] 0x00000000000df678
     0.93%  qemu-img  libc-2.29.so        [.] __sigtimedwait
     0.51%  qemu-img  libc-2.29.so        [.] 0x00000000000df700
     0.35%  qemu-img  libpthread-2.29.so  [.] __libc_sigwait

Their stack show those tasks are pretty much "stuck" in same userland
program logic, while one of them is stuck at the same program counter
address. Profiling those tasks give no much information without more
debugging data and less optimizations.

Although all the 0x000000dfXXX addresses seem broken as we get where
libc was mapped (mid heap) and we have:

(gdb) print __libc_sigwait
$25 = {int (const sigset_t *, int *)} 0xffffbf128080 <__GI___sigwait>

----

Anyway, continuing.... I investigated the qemu_event_{set,reset,xxx}
logic. In non Linux OSes it uses pthread primitives, but, for Linux, it
uses a futex() implementation with a struct QemuEvent
(rcu_call_ready_event) being the one holding values (busy, set, free,
etc).

I got 2 hung situations:

(gdb) print (struct QemuEvent) *(0xaaaacd35fce8)
$16 = {
  value = 4294967295,
  initialized = true
}

value = 4294967295 -> THIS IS A 32-bit 0xFFFF (casting vs overflow issue
?)

AND

a situation where value was either 0 or 1 (like expected). In this last
situation I changed things by hand to make program to continue its
execution:

void qemu_event_wait(QemuEvent *ev)
{
    unsigned value;

    assert(ev->initialized);
    value = atomic_read(&ev->value);
    smp_mb_acquire();
    if (value != EV_SET) {
        if (value == EV_FREE) {

            if (atomic_cmpxchg(&ev->value, 
                       EV_FREE, EV_BUSY) == EV_SET) {
                return;
            }
        }
        qemu_futex_wait(ev, EV_BUSY);
    }
}

438     in ./util/qemu-thread-posix.c
   0x0000aaaaaabd4174 <+44>:    mov     w1, #0xffffffff                 // #-1
   0x0000aaaaaabd4178 <+48>:    ldaxr   w0, [x19]
   0x0000aaaaaabd417c <+52>:    cmp     w0, #0x1
   0x0000aaaaaabd4180 <+56>:    b.ne    0xaaaaaabd418c <qemu_event_wait+68>  // b.any
=> 0x0000aaaaaabd4184 <+60>:    stlxr   w2, w1, [x19]
   0x0000aaaaaabd4188 <+64>:    cbnz    w2, 0xaaaaaabd4178 <qemu_event_wait+48>
   0x0000aaaaaabd418c <+68>:    cbz     w0, 0xaaaaaabd41cc <qemu_event_wait+132>
   0x0000aaaaaabd4190 <+72>:    mov     w6, #0x0                        // #0
   0x0000aaaaaabd4194 <+76>:    mov     x5, #0x0                        // #0
   0x0000aaaaaabd4198 <+80>:    mov     x4, #0x0                        // #0
   0x0000aaaaaabd419c <+84>:    mov     w3, #0xffffffff                 // #-1
   0x0000aaaaaabd41a0 <+88>:    mov     w2, #0x0                        // #0
   0x0000aaaaaabd41a4 <+92>:    mov     x1, x19
   0x0000aaaaaabd41a8 <+96>:    mov     x0, #0x62                       // #98
   0x0000aaaaaabd41ac <+100>:   bl      0xaaaaaaaff380 <syscall@plt>
   
I unblocked it by hand, setting the program counter register to outside that logic:

(gdb) print qemu_event_wait+132
$15 = (void (*)(QemuEvent *)) 0xaaaaaabd41cc <qemu_event_wait+132>
(gdb) print rcu_call_ready_event
$16 = {value = 1, initialized = true}
(gdb) set rcu_call_ready_event->value=0
(gdb) set $pc=0xaaaaaabd41cc

And it got stuck again with program counter in other STLXR instruction:

(gdb) thread 2

[Switching to thread 2 (Thread 0xffffbec61d90 (LWP 33302))]
#0  0x0000aaaaaabd4110 in qemu_event_reset (ev=0xaaaaaac87ce8 <rcu_call_ready_event>) at ./util/qemu-thread-posix.c:414
414     ./util/qemu-thread-posix.c: No such file or directory.
(gdb) bt
#0  0x0000aaaaaabd4110 in qemu_event_reset (ev=0xaaaaaac87ce8 <rcu_call_ready_event>) at ./util/qemu-thread-posix.c:414
#1  0x0000aaaaaabedff8 in call_rcu_thread (opaque=opaque@entry=0x0) at ./util/rcu.c:255
#2  0x0000aaaaaabd34c8 in qemu_thread_start (args=<optimized out>) at ./util/qemu-thread-posix.c:498
#3  0x0000ffffbf26a880 in start_thread (arg=0xfffffffff5bf) at pthread_create.c:486
#4  0x0000ffffbf1c4b9c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

(gdb) print rcu_call_ready_event
$20 = {value = 1, initialized = true}

(gdb) disassemble qemu_event_reset
Dump of assembler code for function qemu_event_reset:
   0x0000aaaaaabd40f0 <+0>:     ldrb    w1, [x0, #4]
   0x0000aaaaaabd40f4 <+4>:     cbz     w1, 0xaaaaaabd411c <qemu_event_reset+44>
   0x0000aaaaaabd40f8 <+8>:     ldr     w1, [x0]
   0x0000aaaaaabd40fc <+12>:    dmb     ishld
   0x0000aaaaaabd4100 <+16>:    cbz     w1, 0xaaaaaabd4108 <qemu_event_reset+24>
   0x0000aaaaaabd4104 <+20>:    ret
   0x0000aaaaaabd4108 <+24>:    ldaxr   w1, [x0]
   0x0000aaaaaabd410c <+28>:    orr     w1, w1, #0x1
=> 0x0000aaaaaabd4110 <+32>:    stlxr   w2, w1, [x0]
   0x0000aaaaaabd4114 <+36>:    cbnz    w2, 0xaaaaaabd4108 <qemu_event_reset+24>
   0x0000aaaaaabd4118 <+40>:    ret
   
And it does not matter if I continue, CPU keeps stuck in that program counter (again in a STLXR instruction)

----

So, initially I was afraid that the lack barriers (or not so strong ones
being used) could have caused a race condition that would make one
thread to depend on the other thread logic.

Unfortunately it looks that instruction STLXR might not be behaving
appropriately for this CPU/architecture as program counter seem to be
stuck in the same instruction (which is super weird, by not throwing a
general exception for some microcode issue, for example).

But this was just an initial overview, I still have to revisit this in
order interpret results better (and recompile qemu with debugging data,
and possible with other GCC version).

Any comments are appreciated.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  Confirmed
Status in qemu package in Ubuntu:
  In Progress

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on high core count ARM system
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (16 preceding siblings ...)
  2019-09-10  2:04 ` Rafael David Tinoco
@ 2019-09-10 14:16 ` Rafael David Tinoco
  2019-09-10 18:15 ` [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Rafael David Tinoco
                   ` (76 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2019-09-10 14:16 UTC (permalink / raw)
  To: qemu-devel

Alright, here is what is happening:

Whenever program is stuck, thread #2 backtrace is this:

(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/aarch64/syscall.S:38
#1  0x0000aaaaaabd41b0 in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at ./util/qemu-thread-posix.c:438
#2  qemu_event_wait (ev=ev@entry=0xaaaaaac87ce8 <rcu_call_ready_event>) at ./util/qemu-thread-posix.c:442
#3  0x0000aaaaaabee03c in call_rcu_thread (opaque=opaque@entry=0x0) at ./util/rcu.c:261
#4  0x0000aaaaaabd34c8 in qemu_thread_start (args=<optimized out>) at ./util/qemu-thread-posix.c:498
#5  0x0000ffffbf26a880 in start_thread (arg=0xfffffffff5bf) at pthread_create.c:486
#6  0x0000ffffbf1c4b9c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Meaning that code is waiting for a futex inside kernel.

(gdb) print rcu_call_ready_event
$4 = {value = 4294967295, initialized = true}

The QemuEvent "rcu_call_ready_event->value" is set to INT_MAX and I
don't know why yet.

rcu_call_ready_event->value is only touched by:

qemu_event_init() -> bool init ? EV_SET : EV_FREE
qemu_event_reset() -> atomic_or(&ev->value, EV_FREE)
qemu_event_set() -> atomic_xchg(&ev->value, EV_SET)
qemu_event_wait() -> atomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY)'

And there should be no 0x7fff value for "ev->value".

qemu_event_init() is the one initializing the global:

    static QemuEvent rcu_call_ready_event;

and it is called by "rcu_init_complete()" which is called by
"rcu_init()":

    static void __attribute__((__constructor__)) rcu_init(void)

a constructor function.

So, "fixing" this issue by:

    (gdb) print rcu_call_ready_event
    $8 = {value = 4294967295, initialized = true}
    
    (gdb) watch rcu_call_ready_event
    Hardware watchpoint 1: rcu_call_ready_event
    
    (gdb) set rcu_call_ready_event.initialized = 1
    
    (gdb) set rcu_call_ready_event.value = 0

and note that I added a watchpoint to rcu_call_ready_event global:

<HANG>

Thread 1 "qemu-img" received signal SIGINT, Interrupt.
(gdb) thread 2
[Switching to thread 2 (Thread 0xffffbec61d90 (LWP 33625))]

(gdb) bt
#0  0x0000aaaaaabd4110 in qemu_event_reset (ev=ev@entry=0xaaaaaac87ce8 <rcu_call_ready_event>)
#1  0x0000aaaaaabedff8 in call_rcu_thread (opaque=opaque@entry=0x0) at ./util/rcu.c:255
#2  0x0000aaaaaabd34c8 in qemu_thread_start (args=<optimized out>) at ./util/qemu-thread-posix.c:498
#3  0x0000ffffbf26a880 in start_thread (arg=0xfffffffff5bf) at pthread_create.c:486
#4  0x0000ffffbf1c4b9c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb) print rcu_call_ready_event
$9 = {value = 0, initialized = true}

You can see I advanced in the qemu_event_{reset,set,wait} logic.

    (gdb) disassemble /m 0x0000aaaaaabd4110
    Dump of assembler code for function qemu_event_reset:
    408     in ./util/qemu-thread-posix.c

    409     in ./util/qemu-thread-posix.c

    410     in ./util/qemu-thread-posix.c
    411     in ./util/qemu-thread-posix.c
       0x0000aaaaaabd40f0 <+0>:     ldrb    w1, [x0, #4]
       0x0000aaaaaabd40f4 <+4>:     cbz     w1, 0xaaaaaabd411c <qemu_event_reset+44>
       0x0000aaaaaabd411c <+44>:    stp     x29, x30, [sp, #-16]!
       0x0000aaaaaabd4120 <+48>:    adrp    x3, 0xaaaaaac20000
       0x0000aaaaaabd4124 <+52>:    add     x3, x3, #0x908
       0x0000aaaaaabd4128 <+56>:    mov     x29, sp
       0x0000aaaaaabd412c <+60>:    adrp    x1, 0xaaaaaac20000
       0x0000aaaaaabd4130 <+64>:    adrp    x0, 0xaaaaaac20000
       0x0000aaaaaabd4134 <+68>:    add     x3, x3, #0x290
       0x0000aaaaaabd4138 <+72>:    add     x1, x1, #0xc00
       0x0000aaaaaabd413c <+76>:    add     x0, x0, #0xd40
       0x0000aaaaaabd4140 <+80>:    mov     w2, #0x19b    // #411
       0x0000aaaaaabd4144 <+84>:    bl      0xaaaaaaaff190 <__assert_fail@plt>

    412     in ./util/qemu-thread-posix.c
       0x0000aaaaaabd40f8 <+8>:     ldr     w1, [x0]

    413     in ./util/qemu-thread-posix.c
       0x0000aaaaaabd40fc <+12>:    dmb     ishld

    414     in ./util/qemu-thread-posix.c
       0x0000aaaaaabd4100 <+16>:    cbz     w1, 0xaaaaaabd4108 <qemu_event_reset+24>
       0x0000aaaaaabd4104 <+20>:    ret
       0x0000aaaaaabd4108 <+24>:    ldaxr   w1, [x0]
       0x0000aaaaaabd410c <+28>:    orr     w1, w1, #0x1
    => 0x0000aaaaaabd4110 <+32>:    stlxr   w2, w1, [x0]
       0x0000aaaaaabd4114 <+36>:    cbnz    w2, 0xaaaaaabd4108 <qemu_event_reset+24>
       0x0000aaaaaabd4118 <+40>:    ret

And I'm currently inside the STLXR and LDAXR logic. To make sure my program counter is advancing, I added a breakpoint at 0x0000aaaaaabd4108, so CBNZ instruction would branch indefinitely into LDXAR instruction again, until the
LDAXR<->STLXR logic is satisfied (inside qemu_event_wait()).

(gdb) break *(0x0000aaaaaabd4108)
Breakpoint 2 at 0xaaaaaabd4108: file ./util/qemu-thread-posix.c, line 414.

which is basically this:

    if (value == EV_SET) {                        EV_SET == 0
        atomic_or(&ev->value, EV_FREE);           EV_FREE = 1
    }
    
and we can see that this logic being called one time after another:

(gdb) c
Thread 2 "qemu-img" hit Breakpoint 3, 0x0000aaaaaabd4108 in qemu_event_reset (
    ev=ev@entry=0xaaaaaac87ce8 <rcu_call_ready_event>) at ./util/qemu-thread-posix.c:414

(gdb) c
Thread 2 "qemu-img" hit Breakpoint 3, 0x0000aaaaaabd4108 in qemu_event_reset (
    ev=ev@entry=0xaaaaaac87ce8 <rcu_call_ready_event>) at ./util/qemu-thread-posix.c:414
    
(gdb) c
Thread 2 "qemu-img" hit Breakpoint 3, 0x0000aaaaaabd4108 in qemu_event_reset (
    ev=ev@entry=0xaaaaaac87ce8 <rcu_call_ready_event>) at ./util/qemu-thread-posix.c:414

EVEN when rcu_call_ready_event->value is already EV_SET (0):

(gdb) print rcu_call_ready_event
$11 = {value = 0, initialized = true}

(gdb) info break
Num     Type           Disp Enb Address            What
1       hw watchpoint  keep y                      rcu_call_ready_event
3       breakpoint     keep n   0x0000aaaaaabd4108 qemu-thread-posix.c:414
        breakpoint already hit 23 times
4       breakpoint     keep y   0x0000aaaaaabd4148 qemu-thread-posix.c:424

IF I enable only rcu_call_ready_event HW watchpoint, nothing is
triggered.

(gdb) watch *(rcu_call_ready_event->value)
Hardware watchpoint 6: *(rcu_call_ready_event->value)

not if I set it directly to QemuEvent->value...

    assert(ev->initialized);
    value = atomic_read(&ev->value);
    smp_mb_acquire();
    if (value == EV_SET) {
        atomic_or(&ev->value, EV_FREE);
    }

meaning that "value" and "ev->value" might have a diff value... is that
so ?

(gdb) print value
$14 = <optimized out>

can't say.. checking registers AND stack:

       0x0000aaaaaabd4100 <+16>:    cbz     w1, 0xaaaaaabd4108 <qemu_event_reset+24>
       0x0000aaaaaabd4104 <+20>:    ret
       0x0000aaaaaabd4108 <+24>:    ldaxr   w1, [x0]
       0x0000aaaaaabd410c <+28>:    orr     w1, w1, #0x1
    => 0x0000aaaaaabd4110 <+32>:    stlxr   w2, w1, [x0]
       0x0000aaaaaabd4114 <+36>:    cbnz    w2, 0xaaaaaabd4108 
       

x0             0xaaaaaac87ce8      187649986428136
x1             0x1                 1
x2             0x1                 1
x3             0x0                 0
x4             0xffffbec61e98      281473882398360
x5             0xffffbec61c90      281473882397840
x6             0xffffbec61c90      281473882397840
x7             0x1                 1
x8             0x65                101
x9             0x0                 0
x10            0x0                 0
x11            0x0                 0
x12            0xffffbec61d90      281473882398096
x13            0x0                 0
x14            0x0                 0
x15            0x2                 2
x16            0xffffbf67ccf0      281473892994288
x17            0xffffbf274938      281473888766264
x18            0x23f               575
x19            0x0                 0
x20            0xaaaaaac87ce8      187649986428136
x21            0x0                 0
x22            0xfffffffff5bf      281474976708031
x23            0xaaaaaac87ce0      187649986428128
x24            0xaaaaaac29000      187649986039808
x25            0xfffffffff658      281474976708184
x26            0x1000              4096
x27            0xffffbf28c000      281473888862208
x28            0xffffbec61d90      281473882398096
x29            0xffffbec61420      281473882395680
x30            0xaaaaaabedff8      187649985798136
sp             0xffffbec61420      0xffffbec61420
pc             0xaaaaaabd4110      0xaaaaaabd4110 <qemu_event_reset+32>
cpsr           0x0                 [ EL=0 ]
fpsr           0x0                 0
fpcr           0x0                 0

AND the ORR instruction is ALWAYS being executed against 0x1 (not 0x0,
which is what I just changed by changing .value):

(gdb) print value
$14 = <optimized out>

       0x0000aaaaaabd410c <+28>:    orr     w1, w1, #0x1
       
#0x1 is being used instead of contents of "value" local variable (volatile).

I'll recompile QEMU flagging all those local "unsigned value" variables
as being volatile and check if optimization changes. Or even try to
disable optimizations.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on high core count ARM system

Status in QEMU:
  Confirmed
Status in qemu package in Ubuntu:
  In Progress

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (17 preceding siblings ...)
  2019-09-10 14:16 ` Rafael David Tinoco
@ 2019-09-10 18:15 ` Rafael David Tinoco
  2019-09-10 22:56 ` Rafael David Tinoco
                   ` (75 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2019-09-10 18:15 UTC (permalink / raw)
  To: qemu-devel

** Summary changed:

- qemu-img hangs on high core count ARM system
+ qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images

** Changed in: qemu
       Status: Confirmed => In Progress

** Changed in: qemu
     Assignee: (unassigned) => Rafael David Tinoco (rafaeldtinoco)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (18 preceding siblings ...)
  2019-09-10 18:15 ` [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Rafael David Tinoco
@ 2019-09-10 22:56 ` Rafael David Tinoco
  2019-09-11  2:15 ` [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues Rafael David Tinoco
                   ` (74 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2019-09-10 22:56 UTC (permalink / raw)
  To: qemu-devel

QEMU BUG: #1

Alright, one of the issues is (according to comment #14):

"""
Meaning that code is waiting for a futex inside kernel.

(gdb) print rcu_call_ready_event
$4 = {value = 4294967295, initialized = true}

The QemuEvent "rcu_call_ready_event->value" is set to INT_MAX and I
don't know why yet.

rcu_call_ready_event->value is only touched by:

qemu_event_init() -> bool init ? EV_SET : EV_FREE
qemu_event_reset() -> atomic_or(&ev->value, EV_FREE)
qemu_event_set() -> atomic_xchg(&ev->value, EV_SET)
qemu_event_wait() -> atomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY)'
"""

Now I know why rcu_call_ready_event->value is set to INT_MAX. That is
because in the following declaration:

struct QemuEvent {
#ifndef __linux__
    pthread_mutex_t lock;
    pthread_cond_t cond;
#endif
    unsigned value;
    bool initialized;
};

#define EV_SET         0
#define EV_FREE        1
#define EV_BUSY       -1

"value" is declared as unsigned, but EV_BUSY sets it to -1, and,
according to the Two's Complement Operation
(https://en.wikipedia.org/wiki/Two%27s_complement), it will be INT_MAX
(4294967295).

So this is the "first bug" found AND it is definitely funny that this
hasn't been seen in other architectures at all... I can reproduce it at
will.

With that said, it seems that there is still another issue causing (less
frequently):

(gdb) thread 2
[Switching to thread 2 (Thread 0xffffbec5ad90 (LWP 17459))]
#0  syscall () at ../sysdeps/unix/sysv/linux/aarch64/syscall.S:38
38      ../sysdeps/unix/sysv/linux/aarch64/syscall.S: No such file or directory.
(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/aarch64/syscall.S:38
#1  0x0000aaaaaabd41cc in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at ./util/qemu-thread-posix.c:438
#2  qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>) at ./util/qemu-thread-posix.c:442
#3  0x0000aaaaaabed05c in call_rcu_thread (opaque=opaque@entry=0x0) at ./util/rcu.c:261
#4  0x0000aaaaaabd34c8 in qemu_thread_start (args=<optimized out>) at ./util/qemu-thread-posix.c:498
#5  0x0000ffffbf25c880 in start_thread (arg=0xfffffffff5bf) at pthread_create.c:486
#6  0x0000ffffbf1b6b9c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 2 to be stuck at "futex()" kernel syscall (like the FUTEX_WAKE
never happened and/or wasn't atomic for this arch/binary). Need to
investigate this also.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
@ 2019-09-11  2:15 ` Rafael David Tinoco
  2019-09-11  2:15   ` [Qemu-devel] [Bug 1805256] " Rafael David Tinoco
                     ` (2 more replies)
  0 siblings, 3 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2019-09-11  2:15 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Bug 1805256, QEMU Developers - ARM, QEMU Developers

Paolo,

While debugging hungs in ARM64 while doing a simple:

qemu-img convert -f qcow2 -O qcow2 file.qcow2 output.qcow2

I might have found 2 issues which I'd like you to review, if possible.

ISSUE #1
========

I've caught the following stack trace after an HUNG in qemu-img convert:

(gdb) bt
#0 syscall ()
#1 0x0000aaaaaabd41cc in qemu_futex_wait
#2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
#3 0x0000aaaaaabed05c in call_rcu_thread
#4 0x0000aaaaaabd34c8 in qemu_thread_start
#5 0x0000ffffbf25c880 in start_thread
#6 0x0000ffffbf1b6b9c in thread_start ()

(gdb) print rcu_call_ready_event
$4 = {value = 4294967295, initialized = true}

value INT_MAX (4294967295) seems WRONG for qemu_futex_wait():

- EV_BUSY, being -1, and passed as an argument qemu_futex_wait(void *,
unsigned), is a two's complement, making argument into a INT_MAX when
that's not what is expected (unless I missed something).

*** If that is the case, unsure if you, Paolo, prefer declaring
*(QemuEvent)->value as an integer or changing EV_BUSY to "2" would okay
here ***

BUG: description:
https://bugs.launchpad.net/qemu/+bug/1805256/comments/15

========
ISSUE #2
========

I found this when debugging lockups while in futex() in a specific ARM64
server - https://bugs.launchpad.net/qemu/+bug/1805256 - which I'm still
investigating.

After fixing the issue above, I'm still getting stuck into:

qemu_event_wait() -> qemu_futex_wait()

***
As if qemu_event_set() has ran before qemu_futex_wait() ever started running
***

The Other threads are waiting for poll() on a PIPE coming from this
stuck thread (thread #1), and in sigwait():

(gdb) thread 1
...
(gdb) bt
#0  0x0000ffffbf1ad81c in __GI_ppoll
#1  0x0000aaaaaabcf73c in ppoll
#2  qemu_poll_ns
#3  0x0000aaaaaabd0764 in os_host_main_loop_wait
#4  main_loop_wait
...

(gdb) thread 2
...
(gdb) bt
#0 syscall ()
#1 0x0000aaaaaabd41cc in qemu_futex_wait
#2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
#3 0x0000aaaaaabed05c in call_rcu_thread
#4 0x0000aaaaaabd34c8 in qemu_thread_start
#5 0x0000ffffbf25c880 in start_thread
#6 0x0000ffffbf1b6b9c in thread_start ()

(gdb) thread 3
...
(gdb) bt
#0  0x0000ffffbf11aa20 in __GI___sigtimedwait
#1  0x0000ffffbf2671b4 in __sigwait
#2  0x0000aaaaaabd1ddc in sigwait_compat
#3  0x0000aaaaaabd34c8 in qemu_thread_start
#4  0x0000ffffbf25c880 in start_thread
#5  0x0000ffffbf1b6b9c in thread_start

QUESTION:

- Should qemu_event_set() check return code from
qemu_futex_wake()->qemu_futex()->syscall() in order to know if ANY
waiter was ever woken up ? Maybe even loop until at least 1 is awaken ?

Tks in advance,

Rafael D. Tinoco


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-09-11  2:15 ` [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues Rafael David Tinoco
@ 2019-09-11  2:15   ` Rafael David Tinoco
  2019-09-11  7:05   ` [Qemu-devel] " Rafael David Tinoco
  2019-09-11 13:17   ` [Qemu-devel] " Paolo Bonzini
  2 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2019-09-11  2:15 UTC (permalink / raw)
  To: qemu-devel

Paolo,

While debugging hungs in ARM64 while doing a simple:

qemu-img convert -f qcow2 -O qcow2 file.qcow2 output.qcow2

I might have found 2 issues which I'd like you to review, if possible.

ISSUE #1
========

I've caught the following stack trace after an HUNG in qemu-img convert:

(gdb) bt
#0 syscall ()
#1 0x0000aaaaaabd41cc in qemu_futex_wait
#2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
#3 0x0000aaaaaabed05c in call_rcu_thread
#4 0x0000aaaaaabd34c8 in qemu_thread_start
#5 0x0000ffffbf25c880 in start_thread
#6 0x0000ffffbf1b6b9c in thread_start ()

(gdb) print rcu_call_ready_event
$4 = {value = 4294967295, initialized = true}

value INT_MAX (4294967295) seems WRONG for qemu_futex_wait():

- EV_BUSY, being -1, and passed as an argument qemu_futex_wait(void *,
unsigned), is a two's complement, making argument into a INT_MAX when
that's not what is expected (unless I missed something).

*** If that is the case, unsure if you, Paolo, prefer declaring
*(QemuEvent)->value as an integer or changing EV_BUSY to "2" would okay
here ***

BUG: description:
https://bugs.launchpad.net/qemu/+bug/1805256/comments/15

========
ISSUE #2
========

I found this when debugging lockups while in futex() in a specific ARM64
server - https://bugs.launchpad.net/qemu/+bug/1805256 - which I'm still
investigating.

After fixing the issue above, I'm still getting stuck into:

qemu_event_wait() -> qemu_futex_wait()

***
As if qemu_event_set() has ran before qemu_futex_wait() ever started running
***

The Other threads are waiting for poll() on a PIPE coming from this
stuck thread (thread #1), and in sigwait():

(gdb) thread 1
...
(gdb) bt
#0  0x0000ffffbf1ad81c in __GI_ppoll
#1  0x0000aaaaaabcf73c in ppoll
#2  qemu_poll_ns
#3  0x0000aaaaaabd0764 in os_host_main_loop_wait
#4  main_loop_wait
...

(gdb) thread 2
...
(gdb) bt
#0 syscall ()
#1 0x0000aaaaaabd41cc in qemu_futex_wait
#2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
#3 0x0000aaaaaabed05c in call_rcu_thread
#4 0x0000aaaaaabd34c8 in qemu_thread_start
#5 0x0000ffffbf25c880 in start_thread
#6 0x0000ffffbf1b6b9c in thread_start ()

(gdb) thread 3
...
(gdb) bt
#0  0x0000ffffbf11aa20 in __GI___sigtimedwait
#1  0x0000ffffbf2671b4 in __sigwait
#2  0x0000aaaaaabd1ddc in sigwait_compat
#3  0x0000aaaaaabd34c8 in qemu_thread_start
#4  0x0000ffffbf25c880 in start_thread
#5  0x0000ffffbf1b6b9c in thread_start

QUESTION:

- Should qemu_event_set() check return code from
qemu_futex_wake()->qemu_futex()->syscall() in order to know if ANY
waiter was ever woken up ? Maybe even loop until at least 1 is awaken ?

Tks in advance,

Rafael D. Tinoco

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (20 preceding siblings ...)
  2019-09-11  2:15 ` [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues Rafael David Tinoco
@ 2019-09-11  2:17 ` Rafael David Tinoco
  2019-09-11 11:19 ` Rafael David Tinoco
                   ` (72 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2019-09-11  2:17 UTC (permalink / raw)
  To: qemu-devel

In comment #14, please disregard the second half of the issue, related
to:

       0x0000aaaaaabd4100 <+16>: cbz w1, 0xaaaaaabd4108 <qemu_event_reset+24>
       0x0000aaaaaabd4104 <+20>: ret
       0x0000aaaaaabd4108 <+24>: ldaxr w1, [x0]
       0x0000aaaaaabd410c <+28>: orr w1, w1, #0x1
    => 0x0000aaaaaabd4110 <+32>: stlxr w2, w1, [x0]
       0x0000aaaaaabd4114 <+36>: cbnz w2, 0xaaaaaabd4108

Duh! This is just a regular load/xor/store logic for atomic_or() inside
qemu_event_reset().

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-09-11  2:15 ` [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues Rafael David Tinoco
  2019-09-11  2:15   ` [Qemu-devel] [Bug 1805256] " Rafael David Tinoco
@ 2019-09-11  7:05   ` Rafael David Tinoco
  2019-09-11  7:05     ` [Qemu-devel] [Bug 1805256] " Rafael David Tinoco
  2019-09-11 13:17   ` [Qemu-devel] " Paolo Bonzini
  2 siblings, 1 reply; 141+ messages in thread
From: Rafael David Tinoco @ 2019-09-11  7:05 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Bug 1805256, QEMU Developers - ARM, QEMU Developers

Quick update...

> value INT_MAX (4294967295) seems WRONG for qemu_futex_wait():
> 
> - EV_BUSY, being -1, and passed as an argument qemu_futex_wait(void *,
> unsigned), is a two's complement, making argument into a INT_MAX when
> that's not what is expected (unless I missed something).
> 
> *** If that is the case, unsure if you, Paolo, prefer declaring
> *(QemuEvent)->value as an integer or changing EV_BUSY to "2" would okay
> here ***
> 
> BUG: description:
> https://bugs.launchpad.net/qemu/+bug/1805256/comments/15

I realized this might be intentional, but, still, I tried:

    https://pastebin.ubuntu.com/p/6rkkY6fJdm/

looking for anything that could have misbehaved in arm64 (specially
concerned on casting and type conversions between the functions).

> QUESTION:
> 
> - Should qemu_event_set() check return code from
> qemu_futex_wake()->qemu_futex()->syscall() in order to know if ANY
> waiter was ever woken up ? Maybe even loop until at least 1 is awaken ?

And I also tried:

-    qemu_futex(f, FUTEX_WAKE, n, NULL, NULL, 0);
+    while(qemu_futex(pval, FUTEX_WAKE, val, NULL, NULL, 0) == 0)
+        continue;

and it made little difference (took way more time for me to reproduce
the issue though):

"""
(gdb) run
Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
./disk01.ext4.qcow2 ./output.qcow2

[New Thread 0xffffbec5ad90 (LWP 72839)]
[New Thread 0xffffbe459d90 (LWP 72840)]
[New Thread 0xffffbdb57d90 (LWP 72841)]
[New Thread 0xffffacac9d90 (LWP 72859)]
[New Thread 0xffffa7ffed90 (LWP 72860)]
[New Thread 0xffffa77fdd90 (LWP 72861)]
[New Thread 0xffffa6ffcd90 (LWP 72862)]
[New Thread 0xffffa67fbd90 (LWP 72863)]
[New Thread 0xffffa5ffad90 (LWP 72864)]

[Thread 0xffffa5ffad90 (LWP 72864) exited]
[Thread 0xffffa6ffcd90 (LWP 72862) exited]
[Thread 0xffffa77fdd90 (LWP 72861) exited]
[Thread 0xffffbdb57d90 (LWP 72841) exited]
[Thread 0xffffa67fbd90 (LWP 72863) exited]
[Thread 0xffffacac9d90 (LWP 72859) exited]
[Thread 0xffffa7ffed90 (LWP 72860) exited]

<HUNG w/ 3 threads in the stack trace showed before>
"""

All the tasks left are blocked in a system call, so no task left to call
qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
thread #1 (doing poll() in a pipe with thread #2).

Those 7 threads exit before disk conversion is complete (sometimes in
the beginning, sometimes at the end).

I'll try to check why those tasks exited.

Any thoughts ?

Tks


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-09-11  7:05   ` [Qemu-devel] " Rafael David Tinoco
@ 2019-09-11  7:05     ` Rafael David Tinoco
  0 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2019-09-11  7:05 UTC (permalink / raw)
  To: qemu-devel

Quick update...

> value INT_MAX (4294967295) seems WRONG for qemu_futex_wait():
> 
> - EV_BUSY, being -1, and passed as an argument qemu_futex_wait(void *,
> unsigned), is a two's complement, making argument into a INT_MAX when
> that's not what is expected (unless I missed something).
> 
> *** If that is the case, unsure if you, Paolo, prefer declaring
> *(QemuEvent)->value as an integer or changing EV_BUSY to "2" would okay
> here ***
> 
> BUG: description:
> https://bugs.launchpad.net/qemu/+bug/1805256/comments/15

I realized this might be intentional, but, still, I tried:

    https://pastebin.ubuntu.com/p/6rkkY6fJdm/

looking for anything that could have misbehaved in arm64 (specially
concerned on casting and type conversions between the functions).

> QUESTION:
> 
> - Should qemu_event_set() check return code from
> qemu_futex_wake()->qemu_futex()->syscall() in order to know if ANY
> waiter was ever woken up ? Maybe even loop until at least 1 is awaken ?

And I also tried:

-    qemu_futex(f, FUTEX_WAKE, n, NULL, NULL, 0);
+    while(qemu_futex(pval, FUTEX_WAKE, val, NULL, NULL, 0) == 0)
+        continue;

and it made little difference (took way more time for me to reproduce
the issue though):

"""
(gdb) run
Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
./disk01.ext4.qcow2 ./output.qcow2

[New Thread 0xffffbec5ad90 (LWP 72839)]
[New Thread 0xffffbe459d90 (LWP 72840)]
[New Thread 0xffffbdb57d90 (LWP 72841)]
[New Thread 0xffffacac9d90 (LWP 72859)]
[New Thread 0xffffa7ffed90 (LWP 72860)]
[New Thread 0xffffa77fdd90 (LWP 72861)]
[New Thread 0xffffa6ffcd90 (LWP 72862)]
[New Thread 0xffffa67fbd90 (LWP 72863)]
[New Thread 0xffffa5ffad90 (LWP 72864)]

[Thread 0xffffa5ffad90 (LWP 72864) exited]
[Thread 0xffffa6ffcd90 (LWP 72862) exited]
[Thread 0xffffa77fdd90 (LWP 72861) exited]
[Thread 0xffffbdb57d90 (LWP 72841) exited]
[Thread 0xffffa67fbd90 (LWP 72863) exited]
[Thread 0xffffacac9d90 (LWP 72859) exited]
[Thread 0xffffa7ffed90 (LWP 72860) exited]

<HUNG w/ 3 threads in the stack trace showed before>
"""

All the tasks left are blocked in a system call, so no task left to call
qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
thread #1 (doing poll() in a pipe with thread #2).

Those 7 threads exit before disk conversion is complete (sometimes in
the beginning, sometimes at the end).

I'll try to check why those tasks exited.

Any thoughts ?

Tks

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress

Bug description:
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (21 preceding siblings ...)
  2019-09-11  2:17 ` [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Rafael David Tinoco
@ 2019-09-11 11:19 ` Rafael David Tinoco
  2019-09-11 19:23 ` Rafael David Tinoco
                   ` (71 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2019-09-11 11:19 UTC (permalink / raw)
  To: qemu-devel

** Description changed:

+ Command:
+ 
+ qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2
+ 
+ Hangs indefinitely approximately 30% of the runs.
+ 
+ ----
+ 
+ Workaround:
+ 
+ qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2
+ 
+ Run "qemu-img convert" with "a single coroutine" to avoid this issue.
+ 
+ ----
+ 
+ (gdb) thread 1
+ ...
+ (gdb) bt
+ #0 0x0000ffffbf1ad81c in __GI_ppoll
+ #1 0x0000aaaaaabcf73c in ppoll
+ #2 qemu_poll_ns
+ #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
+ #4 main_loop_wait
+ ...
+ 
+ (gdb) thread 2
+ ...
+ (gdb) bt
+ #0 syscall ()
+ #1 0x0000aaaaaabd41cc in qemu_futex_wait
+ #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
+ #3 0x0000aaaaaabed05c in call_rcu_thread
+ #4 0x0000aaaaaabd34c8 in qemu_thread_start
+ #5 0x0000ffffbf25c880 in start_thread
+ #6 0x0000ffffbf1b6b9c in thread_start ()
+ 
+ (gdb) thread 3
+ ...
+ (gdb) bt
+ #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
+ #1 0x0000ffffbf2671b4 in __sigwait
+ #2 0x0000aaaaaabd1ddc in sigwait_compat
+ #3 0x0000aaaaaabd34c8 in qemu_thread_start
+ #4 0x0000ffffbf25c880 in start_thread
+ #5 0x0000ffffbf1b6b9c in thread_start
+ 
+ ----
+ 
+ (gdb) run
+ Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
+ ./disk01.ext4.qcow2 ./output.qcow2
+ 
+ [New Thread 0xffffbec5ad90 (LWP 72839)]
+ [New Thread 0xffffbe459d90 (LWP 72840)]
+ [New Thread 0xffffbdb57d90 (LWP 72841)]
+ [New Thread 0xffffacac9d90 (LWP 72859)]
+ [New Thread 0xffffa7ffed90 (LWP 72860)]
+ [New Thread 0xffffa77fdd90 (LWP 72861)]
+ [New Thread 0xffffa6ffcd90 (LWP 72862)]
+ [New Thread 0xffffa67fbd90 (LWP 72863)]
+ [New Thread 0xffffa5ffad90 (LWP 72864)]
+ 
+ [Thread 0xffffa5ffad90 (LWP 72864) exited]
+ [Thread 0xffffa6ffcd90 (LWP 72862) exited]
+ [Thread 0xffffa77fdd90 (LWP 72861) exited]
+ [Thread 0xffffbdb57d90 (LWP 72841) exited]
+ [Thread 0xffffa67fbd90 (LWP 72863) exited]
+ [Thread 0xffffacac9d90 (LWP 72859) exited]
+ [Thread 0xffffa7ffed90 (LWP 72860) exited]
+ 
+ <HUNG w/ 3 threads in the stack trace showed before>
+ """
+ 
+ All the tasks left are blocked in a system call, so no task left to call
+ qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
+ thread #1 (doing poll() in a pipe with thread #2).
+ 
+ Those 7 threads exit before disk conversion is complete (sometimes in
+ the beginning, sometimes at the end).
+ 
+ ----
+ 
+ [ Original Description ]
+ 
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:
  
  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2
  
  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.
  
  Once hung, attaching gdb gives the following backtrace:
  
  (gdb) bt
- #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, 
-     timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
-     at ../sysdeps/unix/sysv/linux/ppoll.c:39
- #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, 
-     __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
- #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, 
-     timeout=timeout@entry=-1) at util/qemu-timer.c:322
+ #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
+     timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
+     at ../sysdeps/unix/sysv/linux/ppoll.c:39
+ #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
+     __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
+ #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
+     timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
-     at util/main-loop.c:233
+     at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975
  
  Reproduced w/ latest QEMU git (@ 53744e0a182)

** Description changed:

  Command:
  
- qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2
+ qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2
  
  Hangs indefinitely approximately 30% of the runs.
  
  ----
  
  Workaround:
  
  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2
  
  Run "qemu-img convert" with "a single coroutine" to avoid this issue.
  
  ----
  
  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...
  
  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()
  
  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start
  
  ----
  
  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2
  
  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]
  
  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]
  
  <HUNG w/ 3 threads in the stack trace showed before>
  """
  
  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).
  
  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).
  
  ----
  
  [ Original Description ]
  
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:
  
  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2
  
  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.
  
  Once hung, attaching gdb gives the following backtrace:
  
  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975
  
  Reproduced w/ latest QEMU git (@ 53744e0a182)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-09-11  2:15 ` [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues Rafael David Tinoco
  2019-09-11  2:15   ` [Qemu-devel] [Bug 1805256] " Rafael David Tinoco
  2019-09-11  7:05   ` [Qemu-devel] " Rafael David Tinoco
@ 2019-09-11 13:17   ` Paolo Bonzini
  2019-09-11 14:48     ` Rafael David Tinoco
  2019-09-11 19:09     ` Rafael David Tinoco
  2 siblings, 2 replies; 141+ messages in thread
From: Paolo Bonzini @ 2019-09-11 13:17 UTC (permalink / raw)
  To: Rafael David Tinoco; +Cc: QEMU Developers - ARM, lizhengui, QEMU Developers

Note that the RCU thread is expected to sit most of the time doing 
nothing, so I don't think this matters.

Zhengui's theory that notify_me doesn't work properly on ARM is more
promising, but he couldn't provide a clear explanation of why he thought
notify_me is involved.  In particular, I would have expected notify_me to
be wrong if the qemu_poll_ns call came from aio_ctx_dispatch, for example:


    glib_pollfds_fill
      g_main_context_prepare
        aio_ctx_prepare
          atomic_or(&ctx->notify_me, 1)
    qemu_poll_ns
    glib_pollfds_poll
      g_main_context_check
        aio_ctx_check
          atomic_and(&ctx->notify_me, ~1)
      g_main_context_dispatch
        aio_ctx_dispatch
          /* do something for event */
            qemu_poll_ns 

but all backtraces show thread 1 in os_host_main_loop_wait:

Thread 1 (Thread 0x40000b573370 (LWP 27214)):
#0  0x000040000a489020 in ppoll () from /lib64/libc.so.6
#1  0x0000aaaaaadaefc0 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at qemu_timer.c:391
#3  0x0000aaaaaadae014 in os_host_main_loop_wait (timeout=<optimized out>) at main_loop.c:272
#4  0x0000aaaaaadae190 in main_loop_wait (nonblocking=<optimized out>) at main_loop.c:534
#5  0x0000aaaaaad97be0 in convert_do_copy (s=0xffffdc32eb48) at qemu-img.c:1923
#6  0x0000aaaaaada2d70 in img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2414
#7  0x0000aaaaaad99ac4 in main (argc=7, argv=<optimized out>) at qemu-img.c:5305

Can you place somewhere your util/async.o object file for me to look at it?

Anyway:

On 11/09/19 04:15, Rafael David Tinoco wrote:
> I've caught the following stack trace after an HUNG in qemu-img convert:
> 
> (gdb) bt
> #0 syscall ()
> #1 0x0000aaaaaabd41cc in qemu_futex_wait
> #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
> #3 0x0000aaaaaabed05c in call_rcu_thread
> #4 0x0000aaaaaabd34c8 in qemu_thread_start
> #5 0x0000ffffbf25c880 in start_thread
> #6 0x0000ffffbf1b6b9c in thread_start ()
> 
> (gdb) print rcu_call_ready_event
> $4 = {value = 4294967295, initialized = true}
> 
> value INT_MAX (4294967295) seems WRONG for qemu_futex_wait():

This is UINT_MAX, not INT_MAX.  qemu_futex_wait() doesn't care of the 
signedness of the value, which is why it is declared as void *.  (That said,
changing "ev" to "&ev->value" would be nicer indeed).

> - EV_BUSY, being -1, and passed as an argument qemu_futex_wait(void *,
> unsigned), is a two's complement, making argument into a INT_MAX when
> that's not what is expected (unless I missed something).
> 
> *** If that is the case, unsure if you, Paolo, prefer declaring
> *(QemuEvent)->value as an integer or changing EV_BUSY to "2" would okay
> here ***

You could change it to 3, but it has to have all the bits in EV_FREE 
(see atomic_or(&ev->value, EV_FREE) in qemu_event_reset).

You could also change it to -1u, but I don't see a particular need to do so.

> BUG: description:
> https://bugs.launchpad.net/qemu/+bug/1805256/comments/15
> 
> ========
> ISSUE #2
> ========
> 
> I found this when debugging lockups while in futex() in a specific ARM64
> server - https://bugs.launchpad.net/qemu/+bug/1805256 - which I'm still
> investigating.
> 
> After fixing the issue above, I'm still getting stuck into:

If you changed it to 2, it's wrong.

> - Should qemu_event_set() check return code from
> qemu_futex_wake()->qemu_futex()->syscall() in order to know if ANY
> waiter was ever woken up ? Maybe even loop until at least 1 is awaken ?

Why would it need to do so?

Paolo


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-09-11 13:17   ` [Qemu-devel] " Paolo Bonzini
@ 2019-09-11 14:48     ` Rafael David Tinoco
  2019-09-11 19:09     ` Rafael David Tinoco
  1 sibling, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2019-09-11 14:48 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: QEMU Developers - ARM, lizhengui, QEMU Developers


> Note that the RCU thread is expected to sit most of the time doing 
> nothing, so I don't think this matters.

Agreed.

> Zhengui's theory that notify_me doesn't work properly on ARM is more
> promising, but he couldn't provide a clear explanation of why he thought
> notify_me is involved.  In particular, I would have expected notify_me to
> be wrong if the qemu_poll_ns call came from aio_ctx_dispatch, for example:
> 
> 
>     glib_pollfds_fill
>       g_main_context_prepare
>         aio_ctx_prepare
>           atomic_or(&ctx->notify_me, 1)
>     qemu_poll_ns
>     glib_pollfds_poll
>       g_main_context_check
>         aio_ctx_check
>           atomic_and(&ctx->notify_me, ~1)
>       g_main_context_dispatch
>         aio_ctx_dispatch
>           /* do something for event */
>             qemu_poll_ns 
> 

Yep, will focus there.

> 
> Can you place somewhere your util/async.o object file for me to look at it?

Sure!

https://send.firefox.com/download/45c26bbe1075eea1/#ZD_e_96imPG2QuDqaX-jhg

Note: this async.o has value as int, EV_BUSY as 3, aborts if any errno
in qemu_futex() and uses &ev->value as 1st argument to wake/wait (as in
https://pastebin.ubuntu.com/p/xk8D6H6kgM/).

> 
> You could change it to 3, but it has to have all the bits in EV_FREE 
> (see atomic_or(&ev->value, EV_FREE) in qemu_event_reset).
> 
> You could also change it to -1u, but I don't see a particular need to do so.
> 

Yep, it was a dead end on my side.

>> - Should qemu_event_set() check return code from
>> qemu_futex_wake()->qemu_futex()->syscall() in order to know if ANY
>> waiter was ever woken up ? Maybe even loop until at least 1 is awaken ?
> 
> Why would it need to do so?
> 

No need, just realized after I saw no tasks waking that thread up. Like
you said, ctx->notify_me seems more promising, will give it a try.



^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-09-11 13:17   ` [Qemu-devel] " Paolo Bonzini
  2019-09-11 14:48     ` Rafael David Tinoco
@ 2019-09-11 19:09     ` Rafael David Tinoco
  2019-09-11 19:09       ` [Qemu-devel] [Bug 1805256] " Rafael David Tinoco
  2019-09-24 20:25       ` [Qemu-devel] " dann frazier
  1 sibling, 2 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2019-09-11 19:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Bug 1805256, QEMU Developers - ARM, lizhengui, QEMU Developers

> Zhengui's theory that notify_me doesn't work properly on ARM is more
> promising, but he couldn't provide a clear explanation of why he thought
> notify_me is involved.  In particular, I would have expected notify_me to
> be wrong if the qemu_poll_ns call came from aio_ctx_dispatch, for example:
> 
> 
>     glib_pollfds_fill
>       g_main_context_prepare
>         aio_ctx_prepare
>           atomic_or(&ctx->notify_me, 1)
>     qemu_poll_ns
>     glib_pollfds_poll
>       g_main_context_check
>         aio_ctx_check
>           atomic_and(&ctx->notify_me, ~1)
>       g_main_context_dispatch
>         aio_ctx_dispatch
>           /* do something for event */
>             qemu_poll_ns 
> 

Paolo,

I tried confining execution in a single NUMA domain (cpu & mem) and
still faced the issue, then, I added a mutex "ctx->notify_me_lcktest"
into context to protect "ctx->notify_me", like showed bellow, and it
seems to have either fixed or mitigated it.

I was able to cause the hung once every 3 or 4 runs. I have already ran
qemu-img convert more than 30 times now and couldn't reproduce it again.

Next step is to play with the barriers and check why existing ones
aren't enough for ordering access to ctx->notify_me ... or should I
try/do something else in your opinion ?

This arch/machine (Huawei D06):

$ lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  1
Core(s) per socket:  48
Socket(s):           2
NUMA node(s):        4
Vendor ID:           0x48
Model:               0
Stepping:            0x0
CPU max MHz:         2000.0000
CPU min MHz:         200.0000
BogoMIPS:            200.00
L1d cache:           64K
L1i cache:           64K
L2 cache:            512K
L3 cache:            32768K
NUMA node0 CPU(s):   0-23
NUMA node1 CPU(s):   24-47
NUMA node2 CPU(s):   48-71
NUMA node3 CPU(s):   72-95
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
cpuid asimdrdm dcpop

----

diff --git a/include/block/aio.h b/include/block/aio.h
index 0ca25dfec6..0724086d91 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -84,6 +84,7 @@ struct AioContext {
      * dispatch phase, hence a simple counter is enough for them.
      */
     uint32_t notify_me;
+    QemuMutex notify_me_lcktest;

     /* A lock to protect between QEMUBH and AioHandler adders and deleter,
      * and to ensure that no callbacks are removed while we're walking and
diff --git a/util/aio-posix.c b/util/aio-posix.c
index 51c41ed3c9..031d6e2997 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -529,7 +529,9 @@ static bool run_poll_handlers(AioContext *ctx,
int64_t max_ns, int64_t *timeout)
     bool progress;
     int64_t start_time, elapsed_time;

+    qemu_mutex_lock(&ctx->notify_me_lcktest);
     assert(ctx->notify_me);
+    qemu_mutex_unlock(&ctx->notify_me_lcktest);
     assert(qemu_lockcnt_count(&ctx->list_lock) > 0);

     trace_run_poll_handlers_begin(ctx, max_ns, *timeout);
@@ -601,8 +603,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
      * so disable the optimization now.
      */
     if (blocking) {
+        qemu_mutex_lock(&ctx->notify_me_lcktest);
         assert(in_aio_context_home_thread(ctx));
         atomic_add(&ctx->notify_me, 2);
+        qemu_mutex_unlock(&ctx->notify_me_lcktest);
     }

     qemu_lockcnt_inc(&ctx->list_lock);
@@ -647,8 +651,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
     }

     if (blocking) {
+        qemu_mutex_lock(&ctx->notify_me_lcktest);
         atomic_sub(&ctx->notify_me, 2);
         aio_notify_accept(ctx);
+        qemu_mutex_unlock(&ctx->notify_me_lcktest);
     }

     /* Adjust polling time */
diff --git a/util/async.c b/util/async.c
index c10642a385..140e1e86f5 100644
--- a/util/async.c
+++ b/util/async.c
@@ -221,7 +221,9 @@ aio_ctx_prepare(GSource *source, gint    *timeout)
 {
     AioContext *ctx = (AioContext *) source;

+    qemu_mutex_lock(&ctx->notify_me_lcktest);
     atomic_or(&ctx->notify_me, 1);
+    qemu_mutex_unlock(&ctx->notify_me_lcktest);

     /* We assume there is no timeout already supplied */
     *timeout = qemu_timeout_ns_to_ms(aio_compute_timeout(ctx));
@@ -239,8 +241,10 @@ aio_ctx_check(GSource *source)
     AioContext *ctx = (AioContext *) source;
     QEMUBH *bh;

+    qemu_mutex_lock(&ctx->notify_me_lcktest);
     atomic_and(&ctx->notify_me, ~1);
     aio_notify_accept(ctx);
+    qemu_mutex_unlock(&ctx->notify_me_lcktest);

     for (bh = ctx->first_bh; bh; bh = bh->next) {
         if (bh->scheduled) {
@@ -346,11 +350,13 @@ void aio_notify(AioContext *ctx)
     /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
      * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
      */
-    smp_mb();
+    //smp_mb();
+    qemu_mutex_lock(&ctx->notify_me_lcktest);
     if (ctx->notify_me) {
         event_notifier_set(&ctx->notifier);
         atomic_mb_set(&ctx->notified, true);
     }
+    qemu_mutex_unlock(&ctx->notify_me_lcktest);
 }

 void aio_notify_accept(AioContext *ctx)
@@ -424,6 +430,8 @@ AioContext *aio_context_new(Error **errp)
     ctx->co_schedule_bh = aio_bh_new(ctx, co_schedule_bh_cb, ctx);
     QSLIST_INIT(&ctx->scheduled_coroutines);

+    qemu_rec_mutex_init(&ctx->notify_me_lcktest);
+
     aio_set_event_notifier(ctx, &ctx->notifier,
                            false,
                            (EventNotifierHandler *)


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-09-11 19:09     ` Rafael David Tinoco
@ 2019-09-11 19:09       ` Rafael David Tinoco
  2019-09-24 20:25       ` [Qemu-devel] " dann frazier
  1 sibling, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2019-09-11 19:09 UTC (permalink / raw)
  To: qemu-devel

> Zhengui's theory that notify_me doesn't work properly on ARM is more
> promising, but he couldn't provide a clear explanation of why he thought
> notify_me is involved.  In particular, I would have expected notify_me to
> be wrong if the qemu_poll_ns call came from aio_ctx_dispatch, for example:
> 
> 
>     glib_pollfds_fill
>       g_main_context_prepare
>         aio_ctx_prepare
>           atomic_or(&ctx->notify_me, 1)
>     qemu_poll_ns
>     glib_pollfds_poll
>       g_main_context_check
>         aio_ctx_check
>           atomic_and(&ctx->notify_me, ~1)
>       g_main_context_dispatch
>         aio_ctx_dispatch
>           /* do something for event */
>             qemu_poll_ns 
> 

Paolo,

I tried confining execution in a single NUMA domain (cpu & mem) and
still faced the issue, then, I added a mutex "ctx->notify_me_lcktest"
into context to protect "ctx->notify_me", like showed bellow, and it
seems to have either fixed or mitigated it.

I was able to cause the hung once every 3 or 4 runs. I have already ran
qemu-img convert more than 30 times now and couldn't reproduce it again.

Next step is to play with the barriers and check why existing ones
aren't enough for ordering access to ctx->notify_me ... or should I
try/do something else in your opinion ?

This arch/machine (Huawei D06):

$ lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  1
Core(s) per socket:  48
Socket(s):           2
NUMA node(s):        4
Vendor ID:           0x48
Model:               0
Stepping:            0x0
CPU max MHz:         2000.0000
CPU min MHz:         200.0000
BogoMIPS:            200.00
L1d cache:           64K
L1i cache:           64K
L2 cache:            512K
L3 cache:            32768K
NUMA node0 CPU(s):   0-23
NUMA node1 CPU(s):   24-47
NUMA node2 CPU(s):   48-71
NUMA node3 CPU(s):   72-95
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
cpuid asimdrdm dcpop

----

diff --git a/include/block/aio.h b/include/block/aio.h
index 0ca25dfec6..0724086d91 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -84,6 +84,7 @@ struct AioContext {
      * dispatch phase, hence a simple counter is enough for them.
      */
     uint32_t notify_me;
+    QemuMutex notify_me_lcktest;

     /* A lock to protect between QEMUBH and AioHandler adders and deleter,
      * and to ensure that no callbacks are removed while we're walking and
diff --git a/util/aio-posix.c b/util/aio-posix.c
index 51c41ed3c9..031d6e2997 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -529,7 +529,9 @@ static bool run_poll_handlers(AioContext *ctx,
int64_t max_ns, int64_t *timeout)
     bool progress;
     int64_t start_time, elapsed_time;

+    qemu_mutex_lock(&ctx->notify_me_lcktest);
     assert(ctx->notify_me);
+    qemu_mutex_unlock(&ctx->notify_me_lcktest);
     assert(qemu_lockcnt_count(&ctx->list_lock) > 0);

     trace_run_poll_handlers_begin(ctx, max_ns, *timeout);
@@ -601,8 +603,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
      * so disable the optimization now.
      */
     if (blocking) {
+        qemu_mutex_lock(&ctx->notify_me_lcktest);
         assert(in_aio_context_home_thread(ctx));
         atomic_add(&ctx->notify_me, 2);
+        qemu_mutex_unlock(&ctx->notify_me_lcktest);
     }

     qemu_lockcnt_inc(&ctx->list_lock);
@@ -647,8 +651,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
     }

     if (blocking) {
+        qemu_mutex_lock(&ctx->notify_me_lcktest);
         atomic_sub(&ctx->notify_me, 2);
         aio_notify_accept(ctx);
+        qemu_mutex_unlock(&ctx->notify_me_lcktest);
     }

     /* Adjust polling time */
diff --git a/util/async.c b/util/async.c
index c10642a385..140e1e86f5 100644
--- a/util/async.c
+++ b/util/async.c
@@ -221,7 +221,9 @@ aio_ctx_prepare(GSource *source, gint    *timeout)
 {
     AioContext *ctx = (AioContext *) source;

+    qemu_mutex_lock(&ctx->notify_me_lcktest);
     atomic_or(&ctx->notify_me, 1);
+    qemu_mutex_unlock(&ctx->notify_me_lcktest);

     /* We assume there is no timeout already supplied */
     *timeout = qemu_timeout_ns_to_ms(aio_compute_timeout(ctx));
@@ -239,8 +241,10 @@ aio_ctx_check(GSource *source)
     AioContext *ctx = (AioContext *) source;
     QEMUBH *bh;

+    qemu_mutex_lock(&ctx->notify_me_lcktest);
     atomic_and(&ctx->notify_me, ~1);
     aio_notify_accept(ctx);
+    qemu_mutex_unlock(&ctx->notify_me_lcktest);

     for (bh = ctx->first_bh; bh; bh = bh->next) {
         if (bh->scheduled) {
@@ -346,11 +350,13 @@ void aio_notify(AioContext *ctx)
     /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
      * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
      */
-    smp_mb();
+    //smp_mb();
+    qemu_mutex_lock(&ctx->notify_me_lcktest);
     if (ctx->notify_me) {
         event_notifier_set(&ctx->notifier);
         atomic_mb_set(&ctx->notified, true);
     }
+    qemu_mutex_unlock(&ctx->notify_me_lcktest);
 }

 void aio_notify_accept(AioContext *ctx)
@@ -424,6 +430,8 @@ AioContext *aio_context_new(Error **errp)
     ctx->co_schedule_bh = aio_bh_new(ctx, co_schedule_bh_cb, ctx);
     QSLIST_INIT(&ctx->scheduled_coroutines);

+    qemu_rec_mutex_init(&ctx->notify_me_lcktest);
+
     aio_set_event_notifier(ctx, &ctx->notifier,
                            false,
                            (EventNotifierHandler *)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (22 preceding siblings ...)
  2019-09-11 11:19 ` Rafael David Tinoco
@ 2019-09-11 19:23 ` Rafael David Tinoco
  2019-10-02 11:02 ` Jan Glauber
                   ` (70 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2019-09-11 19:23 UTC (permalink / raw)
  To: qemu-devel

** Also affects: qemu (Ubuntu Ff-series)
   Importance: Undecided
       Status: New

** Also affects: qemu (Ubuntu Bionic)
   Importance: Undecided
       Status: New

** Also affects: qemu (Ubuntu Eoan)
   Importance: Medium
     Assignee: Rafael David Tinoco (rafaeldtinoco)
       Status: In Progress

** Also affects: qemu (Ubuntu Disco)
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  New
Status in qemu source package in Disco:
  New
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in FF-Series:
  New

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-09-11 19:09     ` Rafael David Tinoco
  2019-09-11 19:09       ` [Qemu-devel] [Bug 1805256] " Rafael David Tinoco
@ 2019-09-24 20:25       ` dann frazier
  2019-09-24 20:25         ` [Bug 1805256] " dann frazier
  2019-10-02  9:23         ` Jan Glauber
  1 sibling, 2 replies; 141+ messages in thread
From: dann frazier @ 2019-09-24 20:25 UTC (permalink / raw)
  To: Rafael David Tinoco
  Cc: lizhengui, Jan Glauber, QEMU Developers, Bug 1805256,
	QEMU Developers - ARM, Paolo Bonzini

On Wed, Sep 11, 2019 at 04:09:25PM -0300, Rafael David Tinoco wrote:
> > Zhengui's theory that notify_me doesn't work properly on ARM is more
> > promising, but he couldn't provide a clear explanation of why he thought
> > notify_me is involved.  In particular, I would have expected notify_me to
> > be wrong if the qemu_poll_ns call came from aio_ctx_dispatch, for example:
> > 
> > 
> >     glib_pollfds_fill
> >       g_main_context_prepare
> >         aio_ctx_prepare
> >           atomic_or(&ctx->notify_me, 1)
> >     qemu_poll_ns
> >     glib_pollfds_poll
> >       g_main_context_check
> >         aio_ctx_check
> >           atomic_and(&ctx->notify_me, ~1)
> >       g_main_context_dispatch
> >         aio_ctx_dispatch
> >           /* do something for event */
> >             qemu_poll_ns 
> > 
> 
> Paolo,
> 
> I tried confining execution in a single NUMA domain (cpu & mem) and
> still faced the issue, then, I added a mutex "ctx->notify_me_lcktest"
> into context to protect "ctx->notify_me", like showed bellow, and it
> seems to have either fixed or mitigated it.
> 
> I was able to cause the hung once every 3 or 4 runs. I have already ran
> qemu-img convert more than 30 times now and couldn't reproduce it again.
> 
> Next step is to play with the barriers and check why existing ones
> aren't enough for ordering access to ctx->notify_me ... or should I
> try/do something else in your opinion ?
> 
> This arch/machine (Huawei D06):
> 
> $ lscpu
> Architecture:        aarch64
> Byte Order:          Little Endian
> CPU(s):              96
> On-line CPU(s) list: 0-95
> Thread(s) per core:  1
> Core(s) per socket:  48
> Socket(s):           2
> NUMA node(s):        4
> Vendor ID:           0x48
> Model:               0
> Stepping:            0x0
> CPU max MHz:         2000.0000
> CPU min MHz:         200.0000
> BogoMIPS:            200.00
> L1d cache:           64K
> L1i cache:           64K
> L2 cache:            512K
> L3 cache:            32768K
> NUMA node0 CPU(s):   0-23
> NUMA node1 CPU(s):   24-47
> NUMA node2 CPU(s):   48-71
> NUMA node3 CPU(s):   72-95
> Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
> cpuid asimdrdm dcpop

Note that I'm also seeing this on a ThunderX2 (same calltrace):

$ lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              224
On-line CPU(s) list: 0-223
Thread(s) per core:  4
Core(s) per socket:  28
Socket(s):           2
NUMA node(s):        2
Vendor ID:           Cavium
Model:               1
Model name:          ThunderX2 99xx
Stepping:            0x1
BogoMIPS:            400.00
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            32768K
NUMA node0 CPU(s):   0-111
NUMA node1 CPU(s):   112-223
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm

  -dann

> ----
> 
> diff --git a/include/block/aio.h b/include/block/aio.h
> index 0ca25dfec6..0724086d91 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -84,6 +84,7 @@ struct AioContext {
>       * dispatch phase, hence a simple counter is enough for them.
>       */
>      uint32_t notify_me;
> +    QemuMutex notify_me_lcktest;
> 
>      /* A lock to protect between QEMUBH and AioHandler adders and deleter,
>       * and to ensure that no callbacks are removed while we're walking and
> diff --git a/util/aio-posix.c b/util/aio-posix.c
> index 51c41ed3c9..031d6e2997 100644
> --- a/util/aio-posix.c
> +++ b/util/aio-posix.c
> @@ -529,7 +529,9 @@ static bool run_poll_handlers(AioContext *ctx,
> int64_t max_ns, int64_t *timeout)
>      bool progress;
>      int64_t start_time, elapsed_time;
> 
> +    qemu_mutex_lock(&ctx->notify_me_lcktest);
>      assert(ctx->notify_me);
> +    qemu_mutex_unlock(&ctx->notify_me_lcktest);
>      assert(qemu_lockcnt_count(&ctx->list_lock) > 0);
> 
>      trace_run_poll_handlers_begin(ctx, max_ns, *timeout);
> @@ -601,8 +603,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
>       * so disable the optimization now.
>       */
>      if (blocking) {
> +        qemu_mutex_lock(&ctx->notify_me_lcktest);
>          assert(in_aio_context_home_thread(ctx));
>          atomic_add(&ctx->notify_me, 2);
> +        qemu_mutex_unlock(&ctx->notify_me_lcktest);
>      }
> 
>      qemu_lockcnt_inc(&ctx->list_lock);
> @@ -647,8 +651,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
>      }
> 
>      if (blocking) {
> +        qemu_mutex_lock(&ctx->notify_me_lcktest);
>          atomic_sub(&ctx->notify_me, 2);
>          aio_notify_accept(ctx);
> +        qemu_mutex_unlock(&ctx->notify_me_lcktest);
>      }
> 
>      /* Adjust polling time */
> diff --git a/util/async.c b/util/async.c
> index c10642a385..140e1e86f5 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -221,7 +221,9 @@ aio_ctx_prepare(GSource *source, gint    *timeout)
>  {
>      AioContext *ctx = (AioContext *) source;
> 
> +    qemu_mutex_lock(&ctx->notify_me_lcktest);
>      atomic_or(&ctx->notify_me, 1);
> +    qemu_mutex_unlock(&ctx->notify_me_lcktest);
> 
>      /* We assume there is no timeout already supplied */
>      *timeout = qemu_timeout_ns_to_ms(aio_compute_timeout(ctx));
> @@ -239,8 +241,10 @@ aio_ctx_check(GSource *source)
>      AioContext *ctx = (AioContext *) source;
>      QEMUBH *bh;
> 
> +    qemu_mutex_lock(&ctx->notify_me_lcktest);
>      atomic_and(&ctx->notify_me, ~1);
>      aio_notify_accept(ctx);
> +    qemu_mutex_unlock(&ctx->notify_me_lcktest);
> 
>      for (bh = ctx->first_bh; bh; bh = bh->next) {
>          if (bh->scheduled) {
> @@ -346,11 +350,13 @@ void aio_notify(AioContext *ctx)
>      /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
>       * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
>       */
> -    smp_mb();
> +    //smp_mb();
> +    qemu_mutex_lock(&ctx->notify_me_lcktest);
>      if (ctx->notify_me) {
>          event_notifier_set(&ctx->notifier);
>          atomic_mb_set(&ctx->notified, true);
>      }
> +    qemu_mutex_unlock(&ctx->notify_me_lcktest);
>  }
> 
>  void aio_notify_accept(AioContext *ctx)
> @@ -424,6 +430,8 @@ AioContext *aio_context_new(Error **errp)
>      ctx->co_schedule_bh = aio_bh_new(ctx, co_schedule_bh_cb, ctx);
>      QSLIST_INIT(&ctx->scheduled_coroutines);
> 
> +    qemu_rec_mutex_init(&ctx->notify_me_lcktest);
> +
>      aio_set_event_notifier(ctx, &ctx->notifier,
>                             false,
>                             (EventNotifierHandler *)
> 


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-09-24 20:25       ` [Qemu-devel] " dann frazier
@ 2019-09-24 20:25         ` dann frazier
  2019-10-02  9:23         ` Jan Glauber
  1 sibling, 0 replies; 141+ messages in thread
From: dann frazier @ 2019-09-24 20:25 UTC (permalink / raw)
  To: qemu-devel

On Wed, Sep 11, 2019 at 04:09:25PM -0300, Rafael David Tinoco wrote:
> > Zhengui's theory that notify_me doesn't work properly on ARM is more
> > promising, but he couldn't provide a clear explanation of why he thought
> > notify_me is involved.  In particular, I would have expected notify_me to
> > be wrong if the qemu_poll_ns call came from aio_ctx_dispatch, for example:
> > 
> > 
> >     glib_pollfds_fill
> >       g_main_context_prepare
> >         aio_ctx_prepare
> >           atomic_or(&ctx->notify_me, 1)
> >     qemu_poll_ns
> >     glib_pollfds_poll
> >       g_main_context_check
> >         aio_ctx_check
> >           atomic_and(&ctx->notify_me, ~1)
> >       g_main_context_dispatch
> >         aio_ctx_dispatch
> >           /* do something for event */
> >             qemu_poll_ns 
> > 
> 
> Paolo,
> 
> I tried confining execution in a single NUMA domain (cpu & mem) and
> still faced the issue, then, I added a mutex "ctx->notify_me_lcktest"
> into context to protect "ctx->notify_me", like showed bellow, and it
> seems to have either fixed or mitigated it.
> 
> I was able to cause the hung once every 3 or 4 runs. I have already ran
> qemu-img convert more than 30 times now and couldn't reproduce it again.
> 
> Next step is to play with the barriers and check why existing ones
> aren't enough for ordering access to ctx->notify_me ... or should I
> try/do something else in your opinion ?
> 
> This arch/machine (Huawei D06):
> 
> $ lscpu
> Architecture:        aarch64
> Byte Order:          Little Endian
> CPU(s):              96
> On-line CPU(s) list: 0-95
> Thread(s) per core:  1
> Core(s) per socket:  48
> Socket(s):           2
> NUMA node(s):        4
> Vendor ID:           0x48
> Model:               0
> Stepping:            0x0
> CPU max MHz:         2000.0000
> CPU min MHz:         200.0000
> BogoMIPS:            200.00
> L1d cache:           64K
> L1i cache:           64K
> L2 cache:            512K
> L3 cache:            32768K
> NUMA node0 CPU(s):   0-23
> NUMA node1 CPU(s):   24-47
> NUMA node2 CPU(s):   48-71
> NUMA node3 CPU(s):   72-95
> Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
> cpuid asimdrdm dcpop

Note that I'm also seeing this on a ThunderX2 (same calltrace):

$ lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              224
On-line CPU(s) list: 0-223
Thread(s) per core:  4
Core(s) per socket:  28
Socket(s):           2
NUMA node(s):        2
Vendor ID:           Cavium
Model:               1
Model name:          ThunderX2 99xx
Stepping:            0x1
BogoMIPS:            400.00
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            32768K
NUMA node0 CPU(s):   0-111
NUMA node1 CPU(s):   112-223
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm

  -dann

> ----
> 
> diff --git a/include/block/aio.h b/include/block/aio.h
> index 0ca25dfec6..0724086d91 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -84,6 +84,7 @@ struct AioContext {
>       * dispatch phase, hence a simple counter is enough for them.
>       */
>      uint32_t notify_me;
> +    QemuMutex notify_me_lcktest;
> 
>      /* A lock to protect between QEMUBH and AioHandler adders and deleter,
>       * and to ensure that no callbacks are removed while we're walking and
> diff --git a/util/aio-posix.c b/util/aio-posix.c
> index 51c41ed3c9..031d6e2997 100644
> --- a/util/aio-posix.c
> +++ b/util/aio-posix.c
> @@ -529,7 +529,9 @@ static bool run_poll_handlers(AioContext *ctx,
> int64_t max_ns, int64_t *timeout)
>      bool progress;
>      int64_t start_time, elapsed_time;
> 
> +    qemu_mutex_lock(&ctx->notify_me_lcktest);
>      assert(ctx->notify_me);
> +    qemu_mutex_unlock(&ctx->notify_me_lcktest);
>      assert(qemu_lockcnt_count(&ctx->list_lock) > 0);
> 
>      trace_run_poll_handlers_begin(ctx, max_ns, *timeout);
> @@ -601,8 +603,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
>       * so disable the optimization now.
>       */
>      if (blocking) {
> +        qemu_mutex_lock(&ctx->notify_me_lcktest);
>          assert(in_aio_context_home_thread(ctx));
>          atomic_add(&ctx->notify_me, 2);
> +        qemu_mutex_unlock(&ctx->notify_me_lcktest);
>      }
> 
>      qemu_lockcnt_inc(&ctx->list_lock);
> @@ -647,8 +651,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
>      }
> 
>      if (blocking) {
> +        qemu_mutex_lock(&ctx->notify_me_lcktest);
>          atomic_sub(&ctx->notify_me, 2);
>          aio_notify_accept(ctx);
> +        qemu_mutex_unlock(&ctx->notify_me_lcktest);
>      }
> 
>      /* Adjust polling time */
> diff --git a/util/async.c b/util/async.c
> index c10642a385..140e1e86f5 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -221,7 +221,9 @@ aio_ctx_prepare(GSource *source, gint    *timeout)
>  {
>      AioContext *ctx = (AioContext *) source;
> 
> +    qemu_mutex_lock(&ctx->notify_me_lcktest);
>      atomic_or(&ctx->notify_me, 1);
> +    qemu_mutex_unlock(&ctx->notify_me_lcktest);
> 
>      /* We assume there is no timeout already supplied */
>      *timeout = qemu_timeout_ns_to_ms(aio_compute_timeout(ctx));
> @@ -239,8 +241,10 @@ aio_ctx_check(GSource *source)
>      AioContext *ctx = (AioContext *) source;
>      QEMUBH *bh;
> 
> +    qemu_mutex_lock(&ctx->notify_me_lcktest);
>      atomic_and(&ctx->notify_me, ~1);
>      aio_notify_accept(ctx);
> +    qemu_mutex_unlock(&ctx->notify_me_lcktest);
> 
>      for (bh = ctx->first_bh; bh; bh = bh->next) {
>          if (bh->scheduled) {
> @@ -346,11 +350,13 @@ void aio_notify(AioContext *ctx)
>      /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
>       * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
>       */
> -    smp_mb();
> +    //smp_mb();
> +    qemu_mutex_lock(&ctx->notify_me_lcktest);
>      if (ctx->notify_me) {
>          event_notifier_set(&ctx->notifier);
>          atomic_mb_set(&ctx->notified, true);
>      }
> +    qemu_mutex_unlock(&ctx->notify_me_lcktest);
>  }
> 
>  void aio_notify_accept(AioContext *ctx)
> @@ -424,6 +430,8 @@ AioContext *aio_context_new(Error **errp)
>      ctx->co_schedule_bh = aio_bh_new(ctx, co_schedule_bh_cb, ctx);
>      QSLIST_INIT(&ctx->scheduled_coroutines);
> 
> +    qemu_rec_mutex_init(&ctx->notify_me_lcktest);
> +
>      aio_set_event_notifier(ctx, &ctx->notifier,
>                             false,
>                             (EventNotifierHandler *)
>

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  New
Status in qemu source package in Disco:
  New
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in FF-Series:
  New

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-09-24 20:25       ` [Qemu-devel] " dann frazier
  2019-09-24 20:25         ` [Bug 1805256] " dann frazier
@ 2019-10-02  9:23         ` Jan Glauber
  2019-10-02  9:23           ` Jan Glauber
                             ` (2 more replies)
  1 sibling, 3 replies; 141+ messages in thread
From: Jan Glauber @ 2019-10-02  9:23 UTC (permalink / raw)
  To: qemu-devel

I've looked into this on ThunderX2. The arm64 code generated for the
atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
memory barriers. It is just plain ldaxr/stlxr.

>From my understanding this is not sufficient for SMP sync.

If I read this comment correct:

    void aio_notify(AioContext *ctx)
    {
        /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
         * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
         */
        smp_mb();
        if (ctx->notify_me) {

it points out that the smp_mb() should be paired. But as
I said the used atomics don't generate any barriers at all.

I've tried to verify me theory with this patch and didn't run into the
issue for ~500 iterations (usually I would trigger the issue ~20 iterations).

--Jan

diff --git a/util/aio-posix.c b/util/aio-posix.c
index d8f0cb4af8dd..d07dcd4e9993 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -591,6 +591,7 @@ bool aio_poll(AioContext *ctx, bool blocking)
      */
     if (blocking) {
         atomic_add(&ctx->notify_me, 2);
+        smp_mb();
     }
 
     qemu_lockcnt_inc(&ctx->list_lock);
@@ -632,6 +633,7 @@ bool aio_poll(AioContext *ctx, bool blocking)
 
     if (blocking) {
         atomic_sub(&ctx->notify_me, 2);
+        smp_mb();
     }
 
     /* Adjust polling time */
diff --git a/util/async.c b/util/async.c
index 4dd9d95a9e73..92ac209c4615 100644
--- a/util/async.c
+++ b/util/async.c
@@ -222,6 +222,7 @@ aio_ctx_prepare(GSource *source, gint    *timeout)
     AioContext *ctx = (AioContext *) source;
 
     atomic_or(&ctx->notify_me, 1);
+    smp_mb();
 
     /* We assume there is no timeout already supplied */
     *timeout = qemu_timeout_ns_to_ms(aio_compute_timeout(ctx));
@@ -240,6 +241,7 @@ aio_ctx_check(GSource *source)
     QEMUBH *bh;
 
     atomic_and(&ctx->notify_me, ~1);
+    smp_mb();
     aio_notify_accept(ctx);
 
     for (bh = ctx->first_bh; bh; bh = bh->next) {

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  New
Status in qemu source package in Disco:
  New
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in FF-Series:
  New

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-10-02  9:23         ` Jan Glauber
@ 2019-10-02  9:23           ` Jan Glauber
  2019-10-02  9:45           ` Paolo Bonzini
  2019-10-07 11:06           ` [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues Paolo Bonzini
  2 siblings, 0 replies; 141+ messages in thread
From: Jan Glauber @ 2019-10-02  9:23 UTC (permalink / raw)
  To: dann frazier
  Cc: Rafael David Tinoco, lizhengui, QEMU Developers, Bug 1805256,
	QEMU Developers - ARM, Paolo Bonzini

I've looked into this on ThunderX2. The arm64 code generated for the
atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
memory barriers. It is just plain ldaxr/stlxr.

From my understanding this is not sufficient for SMP sync.

If I read this comment correct:

    void aio_notify(AioContext *ctx)
    {
        /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
         * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
         */
        smp_mb();
        if (ctx->notify_me) {

it points out that the smp_mb() should be paired. But as
I said the used atomics don't generate any barriers at all.

I've tried to verify me theory with this patch and didn't run into the
issue for ~500 iterations (usually I would trigger the issue ~20 iterations).

--Jan

diff --git a/util/aio-posix.c b/util/aio-posix.c
index d8f0cb4af8dd..d07dcd4e9993 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -591,6 +591,7 @@ bool aio_poll(AioContext *ctx, bool blocking)
      */
     if (blocking) {
         atomic_add(&ctx->notify_me, 2);
+        smp_mb();
     }
 
     qemu_lockcnt_inc(&ctx->list_lock);
@@ -632,6 +633,7 @@ bool aio_poll(AioContext *ctx, bool blocking)
 
     if (blocking) {
         atomic_sub(&ctx->notify_me, 2);
+        smp_mb();
     }
 
     /* Adjust polling time */
diff --git a/util/async.c b/util/async.c
index 4dd9d95a9e73..92ac209c4615 100644
--- a/util/async.c
+++ b/util/async.c
@@ -222,6 +222,7 @@ aio_ctx_prepare(GSource *source, gint    *timeout)
     AioContext *ctx = (AioContext *) source;
 
     atomic_or(&ctx->notify_me, 1);
+    smp_mb();
 
     /* We assume there is no timeout already supplied */
     *timeout = qemu_timeout_ns_to_ms(aio_compute_timeout(ctx));
@@ -240,6 +241,7 @@ aio_ctx_check(GSource *source)
     QEMUBH *bh;
 
     atomic_and(&ctx->notify_me, ~1);
+    smp_mb();
     aio_notify_accept(ctx);
 
     for (bh = ctx->first_bh; bh; bh = bh->next) {


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-10-02  9:23         ` Jan Glauber
  2019-10-02  9:23           ` Jan Glauber
@ 2019-10-02  9:45           ` Paolo Bonzini
  2019-10-02 11:05             ` Jan Glauber
  2019-10-07 11:06           ` [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues Paolo Bonzini
  2 siblings, 1 reply; 141+ messages in thread
From: Paolo Bonzini @ 2019-10-02  9:45 UTC (permalink / raw)
  To: Jan Glauber, dann frazier
  Cc: QEMU Developers, Rafael David Tinoco, QEMU Developers - ARM,
	lizhengui, Bug 1805256

On 02/10/19 11:23, Jan Glauber wrote:
> I've tried to verify me theory with this patch and didn't run into the
> issue for ~500 iterations (usually I would trigger the issue ~20 iterations).

Awesome!  That would be a compiler bug though, as atomic_add and atomic_sub
are defined as sequentially consistent:

#define atomic_add(ptr, n) ((void) __atomic_fetch_add(ptr, n, __ATOMIC_SEQ_CST))
#define atomic_sub(ptr, n) ((void) __atomic_fetch_sub(ptr, n, __ATOMIC_SEQ_CST))

What compiler are you using and what distro?  Can you compile util/aio-posix.c
with "-fdump-rtl-all -fdump-tree-all", zip the boatload of debugging files and
send them my way?

Thanks,

Paolo



^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (23 preceding siblings ...)
  2019-09-11 19:23 ` Rafael David Tinoco
@ 2019-10-02 11:02 ` Jan Glauber
  2019-10-03 12:28 ` Rafael David Tinoco
                   ` (69 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Jan Glauber @ 2019-10-02 11:02 UTC (permalink / raw)
  To: qemu-devel

Debug files for aio-posix generated on 18.04 on ThunderX2.

Compiler:
gcc version 7.4.0 (Ubuntu/Linaro 7.4.0-1ubuntu1~18.04.1)

Distro:
Ubuntu 18.04.3 LTS

** Attachment added: "aio-posix.tar.xz"
   https://bugs.launchpad.net/qemu/+bug/1805256/+attachment/5293619/+files/aio-posix.tar.xz

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  New
Status in qemu source package in Disco:
  New
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in FF-Series:
  New

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-10-02  9:45           ` Paolo Bonzini
@ 2019-10-02 11:05             ` Jan Glauber
  2019-10-02 11:05               ` [Bug 1805256] " Jan Glauber
  2019-10-02 13:20               ` memory barriers and ATOMIC_SEQ_CST on aarch64 (was Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues) Paolo Bonzini
  0 siblings, 2 replies; 141+ messages in thread
From: Jan Glauber @ 2019-10-02 11:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Rafael David Tinoco, lizhengui, dann frazier, QEMU Developers,
	Bug 1805256, QEMU Developers - ARM

On Wed, Oct 02, 2019 at 11:45:19AM +0200, Paolo Bonzini wrote:
> On 02/10/19 11:23, Jan Glauber wrote:
> > I've tried to verify me theory with this patch and didn't run into the
> > issue for ~500 iterations (usually I would trigger the issue ~20 iterations).
> 
> Awesome!  That would be a compiler bug though, as atomic_add and atomic_sub
> are defined as sequentially consistent:
> 
> #define atomic_add(ptr, n) ((void) __atomic_fetch_add(ptr, n, __ATOMIC_SEQ_CST))
> #define atomic_sub(ptr, n) ((void) __atomic_fetch_sub(ptr, n, __ATOMIC_SEQ_CST))

Compiler bug sounds kind of unlikely...

> What compiler are you using and what distro?  Can you compile util/aio-posix.c
> with "-fdump-rtl-all -fdump-tree-all", zip the boatload of debugging files and
> send them my way?

This is on Ubuntu 18.04.3,
gcc version 7.4.0 (Ubuntu/Linaro 7.4.0-1ubuntu1~18.04.1)

I've uploaded the debug files to:
https://bugs.launchpad.net/qemu/+bug/1805256/+attachment/5293619/+files/aio-posix.tar.xz

Thanks,
Jan

> Thanks,
> 
> Paolo


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-10-02 11:05             ` Jan Glauber
@ 2019-10-02 11:05               ` Jan Glauber
  2019-10-02 13:20               ` memory barriers and ATOMIC_SEQ_CST on aarch64 (was Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues) Paolo Bonzini
  1 sibling, 0 replies; 141+ messages in thread
From: Jan Glauber @ 2019-10-02 11:05 UTC (permalink / raw)
  To: qemu-devel

On Wed, Oct 02, 2019 at 11:45:19AM +0200, Paolo Bonzini wrote:
> On 02/10/19 11:23, Jan Glauber wrote:
> > I've tried to verify me theory with this patch and didn't run into the
> > issue for ~500 iterations (usually I would trigger the issue ~20 iterations).
> 
> Awesome!  That would be a compiler bug though, as atomic_add and atomic_sub
> are defined as sequentially consistent:
> 
> #define atomic_add(ptr, n) ((void) __atomic_fetch_add(ptr, n, __ATOMIC_SEQ_CST))
> #define atomic_sub(ptr, n) ((void) __atomic_fetch_sub(ptr, n, __ATOMIC_SEQ_CST))

Compiler bug sounds kind of unlikely...

> What compiler are you using and what distro?  Can you compile util/aio-posix.c
> with "-fdump-rtl-all -fdump-tree-all", zip the boatload of debugging files and
> send them my way?

This is on Ubuntu 18.04.3,
gcc version 7.4.0 (Ubuntu/Linaro 7.4.0-1ubuntu1~18.04.1)

I've uploaded the debug files to:
https://bugs.launchpad.net/qemu/+bug/1805256/+attachment/5293619/+files/aio-posix.tar.xz

Thanks,
Jan

> Thanks,
> 
> Paolo

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  New
Status in qemu source package in Disco:
  New
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in FF-Series:
  New

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* memory barriers and ATOMIC_SEQ_CST on aarch64 (was Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues)
  2019-10-02 11:05             ` Jan Glauber
  2019-10-02 11:05               ` [Bug 1805256] " Jan Glauber
@ 2019-10-02 13:20               ` Paolo Bonzini
  2019-10-02 14:58                 ` Torvald Riegel
  1 sibling, 1 reply; 141+ messages in thread
From: Paolo Bonzini @ 2019-10-02 13:20 UTC (permalink / raw)
  To: Jan Glauber
  Cc: Rafael David Tinoco, lizhengui, dann frazier, Richard Henderson,
	QEMU Developers, Bug 1805256, QEMU Developers - ARM,
	Torvald Riegel, Will Deacon

On 02/10/19 13:05, Jan Glauber wrote:
> The arm64 code generated for the
> atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
> memory barriers. It is just plain ldaxr/stlxr.
> 
> From my understanding this is not sufficient for SMP sync.
> 
>>> If I read this comment correct:
>>> 
>>>     void aio_notify(AioContext *ctx)
>>>     {
>>>         /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
>>>          * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
>>>          */
>>>         smp_mb();
>>>         if (ctx->notify_me) {
>>>
>>> it points out that the smp_mb() should be paired. But as
>>> I said the used atomics don't generate any barriers at all.
>>
>> Awesome!  That would be a compiler bug though, as atomic_add and atomic_sub
>> are defined as sequentially consistent:
>>
>> #define atomic_add(ptr, n) ((void) __atomic_fetch_add(ptr, n, __ATOMIC_SEQ_CST))
>> #define atomic_sub(ptr, n) ((void) __atomic_fetch_sub(ptr, n, __ATOMIC_SEQ_CST))
>
> Compiler bug sounds kind of unlikely...
Indeed the assembly produced by the compiler matches for example the
mappings at https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html.  A
small testcase is as follows:

  int ctx_notify_me;
  int bh_scheduled;

  int x()
  {
      int one = 1;
      int ret;
      __atomic_store(&bh_scheduled, &one, __ATOMIC_RELEASE);     // x1
      __atomic_thread_fence(__ATOMIC_SEQ_CST);                   // x2
      __atomic_load(&ctx_notify_me, &ret, __ATOMIC_RELAXED);     // x3
      return ret;
  }

  int y()
  {
      int ret;
      __atomic_fetch_add(&ctx_notify_me, 2, __ATOMIC_SEQ_CST);  // y1
      __atomic_load(&bh_scheduled, &ret, __ATOMIC_RELAXED);     // y2
      return ret;
  }

Here y (which is aio_poll) wants to order the write to ctx->notify_me
before reads of bh->scheduled.  However, the processor can speculate the
load of bh->scheduled between the load-acquire and store-release of
ctx->notify_me.  So you can have something like:

 thread 0 (y)                          thread 1 (x)
 -----------------------------------   -----------------------------
 y1: load-acq ctx->notify_me
 y2: load-rlx bh->scheduled
                                       x1: store-rel bh->scheduled <-- 1
                                       x2: memory barrier
                                       x3: load-rlx ctx->notify_me
 y1: store-rel ctx->notify_me <-- 2

Being very puzzled, I tried to put this into cppmem:

  int main() {
    atomic_int ctx_notify_me = 0;
    atomic_int bh_scheduled = 0;
    {{{ {
          bh_scheduled.store(1, mo_release);
          atomic_thread_fence(mo_seq_cst);
          // must be zero since the bug report shows no notification
          ctx_notify_me.load(mo_relaxed).readsvalue(0);
        }
    ||| {
          ctx_notify_me.store(2, mo_seq_cst);
          r2=bh_scheduled.load(mo_relaxed);
        }
    }}};
    return 0;
  }

and much to my surprise, the tool said r2 *can* be 0.  Same if I put a
CAS like

        cas_strong_explicit(ctx_notify_me.readsvalue(0), 0, 2,
                            mo_seq_cst, mo_seq_cst);

which resembles the code in the test case a bit more.

I then found a discussion about using the C11 memory model in Linux
(https://gcc.gnu.org/ml/gcc/2014-02/msg00058.html) which contains the
following statement, which is a bit disheartening even though it is
about a different test:

   My first gut feeling was that the assertion should never fire, but
   that was wrong because (as I seem to usually forget) the seq-cst
   total order is just a constraint but doesn't itself contribute
   to synchronizes-with -- but this is different for seq-cst fences.

and later in the thread:

   Use of C11 atomics to implement Linux kernel atomic operations
   requires knowledge of the underlying architecture and the compiler's
   implementation, as was noted earlier in this thread.

Indeed if I add an atomic_thread_fence I get only one valid execution,
where r2 must be 1.  This is similar to GCC's bug
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697, and we can fix it in
QEMU by using __sync_fetch_and_add; in fact cppmem also shows one valid
execution if the store is replaced with something like GCC's assembly
for __sync_fetch_and_add (or Linux's assembly for atomic_add_return):

        cas_strong_explicit(ctx_notify_me.readsvalue(0), 0, 2,
                            mo_release, mo_release);
        atomic_thread_fence(mo_seq_cst);

So we should:

1) understand why ATOMIC_SEQ_CST is not enough in this case.  QEMU code
seems to be making the same assumptions as Linux about the memory model,
and this is wrong because QEMU uses C11 atomics if available.
Fortunately, this kind of synchronization in QEMU is relatively rare and
only this particular bit seems affected.  If there is a fix which stays
within the C11 memory model, and does not pessimize code on x86, we can
use it[1] and document the pitfall.

2) if there's no way to fix the bug, qemu/atomic.h needs to switch to
__sync_fetch_and_add and friends.  And again, in this case the
difference between the C11 and Linux/QEMU memory models must be documented.

Torvald, Will, help me please... :((

Paolo

[1] as would be the case if fetch_add was implemented as
fetch_add(RELEASE)+thread_fence(SEQ_CST).


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: memory barriers and ATOMIC_SEQ_CST on aarch64 (was Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues)
  2019-10-02 13:20               ` memory barriers and ATOMIC_SEQ_CST on aarch64 (was Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues) Paolo Bonzini
@ 2019-10-02 14:58                 ` Torvald Riegel
  2019-10-02 16:30                   ` Paolo Bonzini
  0 siblings, 1 reply; 141+ messages in thread
From: Torvald Riegel @ 2019-10-02 14:58 UTC (permalink / raw)
  To: Paolo Bonzini, Jan Glauber
  Cc: Rafael David Tinoco, lizhengui, dann frazier, Richard Henderson,
	QEMU Developers, Bug 1805256, QEMU Developers - ARM, Will Deacon

On Wed, 2019-10-02 at 15:20 +0200, Paolo Bonzini wrote:
> On 02/10/19 13:05, Jan Glauber wrote:
> > The arm64 code generated for the
> > atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
> > memory barriers. It is just plain ldaxr/stlxr.
> > 
> > From my understanding this is not sufficient for SMP sync.
> > 
> > > > If I read this comment correct:
> > > > 
> > > >     void aio_notify(AioContext *ctx)
> > > >     {
> > > >         /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
> > > >          * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> > > >          */
> > > >         smp_mb();
> > > >         if (ctx->notify_me) {
> > > > 
> > > > it points out that the smp_mb() should be paired. But as
> > > > I said the used atomics don't generate any barriers at all.
> > > 
> > > Awesome!  That would be a compiler bug though, as atomic_add and atomic_sub
> > > are defined as sequentially consistent:
> > > 
> > > #define atomic_add(ptr, n) ((void) __atomic_fetch_add(ptr, n, __ATOMIC_SEQ_CST))
> > > #define atomic_sub(ptr, n) ((void) __atomic_fetch_sub(ptr, n, __ATOMIC_SEQ_CST))
> > 
> > Compiler bug sounds kind of unlikely...
> 
> Indeed the assembly produced by the compiler matches for example the
> mappings at https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html.  A
> small testcase is as follows:
> 
>   int ctx_notify_me;
>   int bh_scheduled;
> 
>   int x()
>   {
>       int one = 1;
>       int ret;
>       __atomic_store(&bh_scheduled, &one, __ATOMIC_RELEASE);     // x1
>       __atomic_thread_fence(__ATOMIC_SEQ_CST);                   // x2
>       __atomic_load(&ctx_notify_me, &ret, __ATOMIC_RELAXED);     // x3
>       return ret;
>   }
> 
>   int y()
>   {
>       int ret;
>       __atomic_fetch_add(&ctx_notify_me, 2, __ATOMIC_SEQ_CST);  // y1
>       __atomic_load(&bh_scheduled, &ret, __ATOMIC_RELAXED);     // y2
>       return ret;
>   }
> 
> Here y (which is aio_poll) wants to order the write to ctx->notify_me
> before reads of bh->scheduled.  However, the processor can speculate the
> load of bh->scheduled between the load-acquire and store-release of
> ctx->notify_me.  So you can have something like:
> 
>  thread 0 (y)                          thread 1 (x)
>  -----------------------------------   -----------------------------
>  y1: load-acq ctx->notify_me
>  y2: load-rlx bh->scheduled
>                                        x1: store-rel bh->scheduled <-- 1
>                                        x2: memory barrier
>                                        x3: load-rlx ctx->notify_me
>  y1: store-rel ctx->notify_me <-- 2
> 
> Being very puzzled, I tried to put this into cppmem:
> 
>   int main() {
>     atomic_int ctx_notify_me = 0;
>     atomic_int bh_scheduled = 0;
>     {{{ {
>           bh_scheduled.store(1, mo_release);
>           atomic_thread_fence(mo_seq_cst);
>           // must be zero since the bug report shows no notification
>           ctx_notify_me.load(mo_relaxed).readsvalue(0);
>         }
>     ||| {
>           ctx_notify_me.store(2, mo_seq_cst);
>           r2=bh_scheduled.load(mo_relaxed);
>         }
>     }}};
>     return 0;
>   }
> 
> and much to my surprise, the tool said r2 *can* be 0.  Same if I put a
> CAS like
> 
>         cas_strong_explicit(ctx_notify_me.readsvalue(0), 0, 2,
>                             mo_seq_cst, mo_seq_cst);
> 
> which resembles the code in the test case a bit more.

This example looks like Dekker synchronization (if I get the intent right).

Two possible implementations of this are either (1) with all memory
accesses having seq-cst MO, or (2) with relaxed-MO accesses and seq-cst
fences on between the store and load on both ends.  It's possible to mix
both, but that get's trickier I think.  I'd prefer the one with just
fences, just because it's easiest, conceptually.

> I then found a discussion about using the C11 memory model in Linux
> (https://gcc.gnu.org/ml/gcc/2014-02/msg00058.html) which contains the
> following statement, which is a bit disheartening even though it is
> about a different test:
> 
>    My first gut feeling was that the assertion should never fire, but
>    that was wrong because (as I seem to usually forget) the seq-cst
>    total order is just a constraint but doesn't itself contribute
>    to synchronizes-with -- but this is different for seq-cst fences.

It works if you use (1) or (2) consistently.  cppmem and the Batty et al.
tech report should give you the gory details.
My comment is just about seq-cst working differently on memory accesses vs.
fences (in the way it's specified in the memory model).

> and later in the thread:
> 
>    Use of C11 atomics to implement Linux kernel atomic operations
>    requires knowledge of the underlying architecture and the compiler's
>    implementation, as was noted earlier in this thread.
> 
> Indeed if I add an atomic_thread_fence I get only one valid execution,
> where r2 must be 1.  This is similar to GCC's bug
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697, and we can fix it in
> QEMU by using __sync_fetch_and_add; in fact cppmem also shows one valid
> execution if the store is replaced with something like GCC's assembly
> for __sync_fetch_and_add (or Linux's assembly for atomic_add_return):
> 
>         cas_strong_explicit(ctx_notify_me.readsvalue(0), 0, 2,
>                             mo_release, mo_release);
>         atomic_thread_fence(mo_seq_cst);
> 
> So we should:
> 
> 1) understand why ATOMIC_SEQ_CST is not enough in this case.  QEMU code
> seems to be making the same assumptions as Linux about the memory model,
> and this is wrong because QEMU uses C11 atomics if available.
> Fortunately, this kind of synchronization in QEMU is relatively rare and
> only this particular bit seems affected.  If there is a fix which stays
> within the C11 memory model, and does not pessimize code on x86, we can
> use it[1] and document the pitfall.

Using the fences between the store/load pairs in Dekker-like
synchronization should do that, right?  It's also relatively easy to deal
with.

> 2) if there's no way to fix the bug, qemu/atomic.h needs to switch to
> __sync_fetch_and_add and friends.  And again, in this case the
> difference between the C11 and Linux/QEMU memory models must be documented.

I surely not aware of all the constraints here, but I'd be surprised if the
C11 memory model isn't good enough for portable synchronization code (with
the exception of the consume MO minefield, perhaps). 



^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: memory barriers and ATOMIC_SEQ_CST on aarch64 (was Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues)
  2019-10-02 14:58                 ` Torvald Riegel
@ 2019-10-02 16:30                   ` Paolo Bonzini
  0 siblings, 0 replies; 141+ messages in thread
From: Paolo Bonzini @ 2019-10-02 16:30 UTC (permalink / raw)
  To: Torvald Riegel, Jan Glauber
  Cc: Rafael David Tinoco, lizhengui, dann frazier, Richard Henderson,
	QEMU Developers, Bug 1805256, QEMU Developers - ARM, Will Deacon

On 02/10/19 16:58, Torvald Riegel wrote:
> This example looks like Dekker synchronization (if I get the intent right).

It is the same pattern.  However, one of the two synchronized variables
is a counter rather than just a flag.

> Two possible implementations of this are either (1) with all memory
> accesses having seq-cst MO, or (2) with relaxed-MO accesses and seq-cst
> fences on between the store and load on both ends.  It's possible to mix
> both, but that get's trickier I think.  I'd prefer the one with just
> fences, just because it's easiest, conceptually.

Got it.

I'd also prefer the one with just fences, because we only really control
one side of the synchronization primitive (ctx_notify_me in my litmus
test) and I don't like the idea of forcing seq-cst MO on the other side
(bh_scheduled).  The performance issue that I mentioned is that x86
doesn't have relaxed fetch and add, so you'd have a redundant fence like
this:

	lock	xaddl $2, mem1
	mfence
	...
	movl	mem1, %r8

(Gory QEMU details however allow us to use relaxed load and store here,
because there's only one writer).

> It works if you use (1) or (2) consistently.  cppmem and the Batty et al.
> tech report should give you the gory details.
>
>> 1) understand why ATOMIC_SEQ_CST is not enough in this case.  QEMU code
>> seems to be making the same assumptions as Linux about the memory model,
>> and this is wrong because QEMU uses C11 atomics if available.
>> Fortunately, this kind of synchronization in QEMU is relatively rare and
>> only this particular bit seems affected.  If there is a fix which stays
>> within the C11 memory model, and does not pessimize code on x86, we can
>> use it[1] and document the pitfall.
>
> Using the fences between the store/load pairs in Dekker-like
> synchronization should do that, right?  It's also relatively easy to deal
> with.
> 
>> 2) if there's no way to fix the bug, qemu/atomic.h needs to switch to
>> __sync_fetch_and_add and friends.  And again, in this case the
>> difference between the C11 and Linux/QEMU memory models must be documented.
>
> I surely not aware of all the constraints here, but I'd be surprised if the
> C11 memory model isn't good enough for portable synchronization code (with
> the exception of the consume MO minefield, perhaps). 

This helps a lot already; I'll work on a documentation and code patch.
Thanks very much.

Paolo

>>   int main() {
>>     atomic_int ctx_notify_me = 0;
>>     atomic_int bh_scheduled = 0;
>>     {{{ {
>>           bh_scheduled.store(1, mo_release);
>>           atomic_thread_fence(mo_seq_cst);
>>           // must be zero since the bug report shows no notification
>>           ctx_notify_me.load(mo_relaxed).readsvalue(0);
>>         }
>>     ||| {
>>           ctx_notify_me.store(2, mo_seq_cst);
>>           r2=bh_scheduled.load(mo_relaxed);
>>         }
>>     }}};
>>     return 0;
>>   }



^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (24 preceding siblings ...)
  2019-10-02 11:02 ` Jan Glauber
@ 2019-10-03 12:28 ` Rafael David Tinoco
  2019-10-03 12:29 ` Rafael David Tinoco
                   ` (68 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2019-10-03 12:28 UTC (permalink / raw)
  To: qemu-devel

Documenting this here as bug# was dropped from the mail thread:

On 02/10/19 13:05, Jan Glauber wrote:
> The arm64 code generated for the
> atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
> memory barriers. It is just plain ldaxr/stlxr.
>
> From my understanding this is not sufficient for SMP sync.
>
>>> If I read this comment correct:
>>>
>>>     void aio_notify(AioContext *ctx)
>>>     {
>>>         /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
>>>          * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
>>>          */
>>>         smp_mb();
>>>         if (ctx->notify_me) {
>>>
>>> it points out that the smp_mb() should be paired. But as
>>> I said the used atomics don't generate any barriers at all.
>>
>> Awesome!  That would be a compiler bug though, as atomic_add and atomic_sub
>> are defined as sequentially consistent:
>>
>> #define atomic_add(ptr, n) ((void) __atomic_fetch_add(ptr, n, __ATOMIC_SEQ_CST))
>> #define atomic_sub(ptr, n) ((void) __atomic_fetch_sub(ptr, n, __ATOMIC_SEQ_CST))
>
> Compiler bug sounds kind of unlikely...
Indeed the assembly produced by the compiler matches for example the
mappings at https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html.  A
small testcase is as follows:

  int ctx_notify_me;
  int bh_scheduled;

  int x()
  {
      int one = 1;
      int ret;
      __atomic_store(&bh_scheduled, &one, __ATOMIC_RELEASE);     // x1
      __atomic_thread_fence(__ATOMIC_SEQ_CST);                   // x2
      __atomic_load(&ctx_notify_me, &ret, __ATOMIC_RELAXED);     // x3
      return ret;
  }

  int y()
  {
      int ret;
      __atomic_fetch_add(&ctx_notify_me, 2, __ATOMIC_SEQ_CST);  // y1
      __atomic_load(&bh_scheduled, &ret, __ATOMIC_RELAXED);     // y2
      return ret;
  }

Here y (which is aio_poll) wants to order the write to ctx->notify_me
before reads of bh->scheduled.  However, the processor can speculate the
load of bh->scheduled between the load-acquire and store-release of
ctx->notify_me.  So you can have something like:

 thread 0 (y)                          thread 1 (x)
 -----------------------------------   -----------------------------
 y1: load-acq ctx->notify_me
 y2: load-rlx bh->scheduled
                                       x1: store-rel bh->scheduled <-- 1
                                       x2: memory barrier
                                       x3: load-rlx ctx->notify_me
 y1: store-rel ctx->notify_me <-- 2

Being very puzzled, I tried to put this into cppmem:

  int main() {
    atomic_int ctx_notify_me = 0;
    atomic_int bh_scheduled = 0;
    {{{ {
          bh_scheduled.store(1, mo_release);
          atomic_thread_fence(mo_seq_cst);
          // must be zero since the bug report shows no notification
          ctx_notify_me.load(mo_relaxed).readsvalue(0);
        }
    ||| {
          ctx_notify_me.store(2, mo_seq_cst);
          r2=bh_scheduled.load(mo_relaxed);
        }
    }}};
    return 0;
  }

and much to my surprise, the tool said r2 *can* be 0.  Same if I put a
CAS like

        cas_strong_explicit(ctx_notify_me.readsvalue(0), 0, 2,
                            mo_seq_cst, mo_seq_cst);

which resembles the code in the test case a bit more.

I then found a discussion about using the C11 memory model in Linux
(https://gcc.gnu.org/ml/gcc/2014-02/msg00058.html) which contains the
following statement, which is a bit disheartening even though it is
about a different test:

   My first gut feeling was that the assertion should never fire, but
   that was wrong because (as I seem to usually forget) the seq-cst
   total order is just a constraint but doesn't itself contribute
   to synchronizes-with -- but this is different for seq-cst fences.

and later in the thread:

   Use of C11 atomics to implement Linux kernel atomic operations
   requires knowledge of the underlying architecture and the compiler's
   implementation, as was noted earlier in this thread.

Indeed if I add an atomic_thread_fence I get only one valid execution,
where r2 must be 1.  This is similar to GCC's bug
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697, and we can fix it in
QEMU by using __sync_fetch_and_add; in fact cppmem also shows one valid
execution if the store is replaced with something like GCC's assembly
for __sync_fetch_and_add (or Linux's assembly for atomic_add_return):

        cas_strong_explicit(ctx_notify_me.readsvalue(0), 0, 2,
                            mo_release, mo_release);
        atomic_thread_fence(mo_seq_cst);

So we should:

1) understand why ATOMIC_SEQ_CST is not enough in this case.  QEMU code
seems to be making the same assumptions as Linux about the memory model,
and this is wrong because QEMU uses C11 atomics if available.
Fortunately, this kind of synchronization in QEMU is relatively rare and
only this particular bit seems affected.  If there is a fix which stays
within the C11 memory model, and does not pessimize code on x86, we can
use it[1] and document the pitfall.

2) if there's no way to fix the bug, qemu/atomic.h needs to switch to
__sync_fetch_and_add and friends.  And again, in this case the
difference between the C11 and Linux/QEMU memory models must be documented.

Torvald, Will, help me please... :((

Paolo

[1] as would be the case if fetch_add was implemented as
fetch_add(RELEASE)+thread_fence(SEQ_CST).


** Bug watch added: GCC Bugzilla #65697
   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  New
Status in qemu source package in Disco:
  New
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in FF-Series:
  New

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (25 preceding siblings ...)
  2019-10-03 12:28 ` Rafael David Tinoco
@ 2019-10-03 12:29 ` Rafael David Tinoco
  2019-10-03 12:29 ` Rafael David Tinoco
                   ` (67 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2019-10-03 12:29 UTC (permalink / raw)
  To: qemu-devel

On Wed, 2019-10-02 at 15:20 +0200, Paolo Bonzini wrote:
> On 02/10/19 13:05, Jan Glauber wrote:
>> The arm64 code generated for the
>> atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
>> memory barriers. It is just plain ldaxr/stlxr.
>>
>> From my understanding this is not sufficient for SMP sync.
>>
>>>> If I read this comment correct:
>>>>
>>>>     void aio_notify(AioContext *ctx)
>>>>     {
>>>>         /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
>>>>          * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
>>>>          */
>>>>         smp_mb();
>>>>         if (ctx->notify_me) {
>>>>
>>>> it points out that the smp_mb() should be paired. But as
>>>> I said the used atomics don't generate any barriers at all.
>>>
>>> Awesome!  That would be a compiler bug though, as atomic_add and atomic_sub
>>> are defined as sequentially consistent:
>>>
>>> #define atomic_add(ptr, n) ((void) __atomic_fetch_add(ptr, n, __ATOMIC_SEQ_CST))
>>> #define atomic_sub(ptr, n) ((void) __atomic_fetch_sub(ptr, n, __ATOMIC_SEQ_CST))
>>
>> Compiler bug sounds kind of unlikely...
>
> Indeed the assembly produced by the compiler matches for example the
> mappings at https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html.  A
> small testcase is as follows:
>
>   int ctx_notify_me;
>   int bh_scheduled;
>
>   int x()
>   {
>       int one = 1;
>       int ret;
>       __atomic_store(&bh_scheduled, &one, __ATOMIC_RELEASE);     // x1
>       __atomic_thread_fence(__ATOMIC_SEQ_CST);                   // x2
>       __atomic_load(&ctx_notify_me, &ret, __ATOMIC_RELAXED);     // x3
>       return ret;
>   }
>
>   int y()
>   {
>       int ret;
>       __atomic_fetch_add(&ctx_notify_me, 2, __ATOMIC_SEQ_CST);  // y1
>       __atomic_load(&bh_scheduled, &ret, __ATOMIC_RELAXED);     // y2
>       return ret;
>   }
>
> Here y (which is aio_poll) wants to order the write to ctx->notify_me
> before reads of bh->scheduled.  However, the processor can speculate the
> load of bh->scheduled between the load-acquire and store-release of
> ctx->notify_me.  So you can have something like:
>
>  thread 0 (y)                          thread 1 (x)
>  -----------------------------------   -----------------------------
>  y1: load-acq ctx->notify_me
>  y2: load-rlx bh->scheduled
>                                        x1: store-rel bh->scheduled <-- 1
>                                        x2: memory barrier
>                                        x3: load-rlx ctx->notify_me
>  y1: store-rel ctx->notify_me <-- 2
>
> Being very puzzled, I tried to put this into cppmem:
>
>   int main() {
>     atomic_int ctx_notify_me = 0;
>     atomic_int bh_scheduled = 0;
>     {{{ {
>           bh_scheduled.store(1, mo_release);
>           atomic_thread_fence(mo_seq_cst);
>           // must be zero since the bug report shows no notification
>           ctx_notify_me.load(mo_relaxed).readsvalue(0);
>         }
>     ||| {
>           ctx_notify_me.store(2, mo_seq_cst);
>           r2=bh_scheduled.load(mo_relaxed);
>         }
>     }}};
>     return 0;
>   }
>
> and much to my surprise, the tool said r2 *can* be 0.  Same if I put a
> CAS like
>
>         cas_strong_explicit(ctx_notify_me.readsvalue(0), 0, 2,
>                             mo_seq_cst, mo_seq_cst);
>
> which resembles the code in the test case a bit more.

This example looks like Dekker synchronization (if I get the intent
right).

Two possible implementations of this are either (1) with all memory
accesses having seq-cst MO, or (2) with relaxed-MO accesses and seq-cst
fences on between the store and load on both ends.  It's possible to mix
both, but that get's trickier I think.  I'd prefer the one with just
fences, just because it's easiest, conceptually.

> I then found a discussion about using the C11 memory model in Linux
> (https://gcc.gnu.org/ml/gcc/2014-02/msg00058.html) which contains the
> following statement, which is a bit disheartening even though it is
> about a different test:
>
>    My first gut feeling was that the assertion should never fire, but
>    that was wrong because (as I seem to usually forget) the seq-cst
>    total order is just a constraint but doesn't itself contribute
>    to synchronizes-with -- but this is different for seq-cst fences.

It works if you use (1) or (2) consistently.  cppmem and the Batty et al.
tech report should give you the gory details.
My comment is just about seq-cst working differently on memory accesses vs.
fences (in the way it's specified in the memory model).

> and later in the thread:
>
>    Use of C11 atomics to implement Linux kernel atomic operations
>    requires knowledge of the underlying architecture and the compiler's
>    implementation, as was noted earlier in this thread.
>
> Indeed if I add an atomic_thread_fence I get only one valid execution,
> where r2 must be 1.  This is similar to GCC's bug
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697, and we can fix it in
> QEMU by using __sync_fetch_and_add; in fact cppmem also shows one valid
> execution if the store is replaced with something like GCC's assembly
> for __sync_fetch_and_add (or Linux's assembly for atomic_add_return):
>
>         cas_strong_explicit(ctx_notify_me.readsvalue(0), 0, 2,
>                             mo_release, mo_release);
>         atomic_thread_fence(mo_seq_cst);
>
> So we should:
>
> 1) understand why ATOMIC_SEQ_CST is not enough in this case.  QEMU code
> seems to be making the same assumptions as Linux about the memory model,
> and this is wrong because QEMU uses C11 atomics if available.
> Fortunately, this kind of synchronization in QEMU is relatively rare and
> only this particular bit seems affected.  If there is a fix which stays
> within the C11 memory model, and does not pessimize code on x86, we can
> use it[1] and document the pitfall.

Using the fences between the store/load pairs in Dekker-like
synchronization should do that, right?  It's also relatively easy to deal
with.

> 2) if there's no way to fix the bug, qemu/atomic.h needs to switch to
> __sync_fetch_and_add and friends.  And again, in this case the
> difference between the C11 and Linux/QEMU memory models must be documented.

I surely not aware of all the constraints here, but I'd be surprised if the
C11 memory model isn't good enough for portable synchronization code (with
the exception of the consume MO minefield, perhaps).

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  New
Status in qemu source package in Disco:
  New
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in FF-Series:
  New

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (26 preceding siblings ...)
  2019-10-03 12:29 ` Rafael David Tinoco
@ 2019-10-03 12:29 ` Rafael David Tinoco
  2019-10-03 21:35 ` dann frazier
                   ` (66 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2019-10-03 12:29 UTC (permalink / raw)
  To: qemu-devel

On 02/10/19 16:58, Torvald Riegel wrote:
> This example looks like Dekker synchronization (if I get the intent right).

It is the same pattern.  However, one of the two synchronized variables
is a counter rather than just a flag.

> Two possible implementations of this are either (1) with all memory
> accesses having seq-cst MO, or (2) with relaxed-MO accesses and seq-cst
> fences on between the store and load on both ends.  It's possible to mix
> both, but that get's trickier I think.  I'd prefer the one with just
> fences, just because it's easiest, conceptually.

Got it.

I'd also prefer the one with just fences, because we only really control
one side of the synchronization primitive (ctx_notify_me in my litmus
test) and I don't like the idea of forcing seq-cst MO on the other side
(bh_scheduled).  The performance issue that I mentioned is that x86
doesn't have relaxed fetch and add, so you'd have a redundant fence like
this:

	lock	xaddl $2, mem1
	mfence
	...
	movl	mem1, %r8

(Gory QEMU details however allow us to use relaxed load and store here,
because there's only one writer).

> It works if you use (1) or (2) consistently.  cppmem and the Batty et al.
> tech report should give you the gory details.
>
>> 1) understand why ATOMIC_SEQ_CST is not enough in this case.  QEMU code
>> seems to be making the same assumptions as Linux about the memory model,
>> and this is wrong because QEMU uses C11 atomics if available.
>> Fortunately, this kind of synchronization in QEMU is relatively rare and
>> only this particular bit seems affected.  If there is a fix which stays
>> within the C11 memory model, and does not pessimize code on x86, we can
>> use it[1] and document the pitfall.
>
> Using the fences between the store/load pairs in Dekker-like
> synchronization should do that, right?  It's also relatively easy to deal
> with.
>
>> 2) if there's no way to fix the bug, qemu/atomic.h needs to switch to
>> __sync_fetch_and_add and friends.  And again, in this case the
>> difference between the C11 and Linux/QEMU memory models must be documented.
>
> I surely not aware of all the constraints here, but I'd be surprised if the
> C11 memory model isn't good enough for portable synchronization code (with
> the exception of the consume MO minefield, perhaps). 

This helps a lot already; I'll work on a documentation and code patch.
Thanks very much.

Paolo

>>   int main() {
>>     atomic_int ctx_notify_me = 0;
>>     atomic_int bh_scheduled = 0;
>>     {{{ {
>>           bh_scheduled.store(1, mo_release);
>>           atomic_thread_fence(mo_seq_cst);
>>           // must be zero since the bug report shows no notification
>>           ctx_notify_me.load(mo_relaxed).readsvalue(0);
>>         }
>>     ||| {
>>           ctx_notify_me.store(2, mo_seq_cst);
>>           r2=bh_scheduled.load(mo_relaxed);
>>         }
>>     }}};
>>     return 0;
>>   }



** Changed in: qemu (Ubuntu Disco)
   Importance: Undecided => Medium

** Changed in: qemu (Ubuntu Bionic)
   Importance: Undecided => Medium

** Changed in: qemu (Ubuntu Ff-series)
   Importance: Undecided => Medium

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  New
Status in qemu source package in Disco:
  New
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in FF-Series:
  New

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (27 preceding siblings ...)
  2019-10-03 12:29 ` Rafael David Tinoco
@ 2019-10-03 21:35 ` dann frazier
  2019-12-13 14:24 ` dann frazier
                   ` (65 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: dann frazier @ 2019-10-03 21:35 UTC (permalink / raw)
  To: qemu-devel

** Also affects: kunpeng920
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  New
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  New
Status in qemu source package in Disco:
  New
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in FF-Series:
  New

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-10-02  9:23         ` Jan Glauber
  2019-10-02  9:23           ` Jan Glauber
  2019-10-02  9:45           ` Paolo Bonzini
@ 2019-10-07 11:06           ` Paolo Bonzini
  2019-10-07 14:36             ` Jan Glauber
  2019-10-07 14:44             ` dann frazier
  2 siblings, 2 replies; 141+ messages in thread
From: Paolo Bonzini @ 2019-10-07 11:06 UTC (permalink / raw)
  To: Jan Glauber, dann frazier
  Cc: QEMU Developers, Rafael David Tinoco, QEMU Developers - ARM,
	lizhengui, Bug 1805256

On 02/10/19 11:23, Jan Glauber wrote:
> I've looked into this on ThunderX2. The arm64 code generated for the
> atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
> memory barriers. It is just plain ldaxr/stlxr.
> 
> From my understanding this is not sufficient for SMP sync.
> 
> If I read this comment correct:
> 
>     void aio_notify(AioContext *ctx)
>     {
>         /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
>          * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
>          */
>         smp_mb();
>         if (ctx->notify_me) {
> 
> it points out that the smp_mb() should be paired. But as
> I said the used atomics don't generate any barriers at all.

Based on the rest of the thread, this patch should also fix the bug:

diff --git a/util/async.c b/util/async.c
index 47dcbfa..721ea53 100644
--- a/util/async.c
+++ b/util/async.c
@@ -249,7 +249,7 @@ aio_ctx_check(GSource *source)
     aio_notify_accept(ctx);
 
     for (bh = ctx->first_bh; bh; bh = bh->next) {
-        if (bh->scheduled) {
+        if (atomic_mb_read(&bh->scheduled)) {
             return true;
         }
     }


And also the memory barrier in aio_notify can actually be replaced
with a SEQ_CST load:

diff --git a/util/async.c b/util/async.c
index 47dcbfa..721ea53 100644
--- a/util/async.c
+++ b/util/async.c
@@ -349,11 +349,11 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
 
 void aio_notify(AioContext *ctx)
 {
-    /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
-     * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
+    /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
+     * ctx->notify_me is read.  Pairs with atomic_or in aio_ctx_prepare or
+     * atomic_add in aio_poll.
      */
-    smp_mb();
-    if (ctx->notify_me) {
+    if (atomic_mb_read(&ctx->notify_me)) {
         event_notifier_set(&ctx->notifier);
         atomic_mb_set(&ctx->notified, true);
     }


Would you be able to test these (one by one possibly)?

> I've tried to verify me theory with this patch and didn't run into the
> issue for ~500 iterations (usually I would trigger the issue ~20 iterations).

Sorry for asking the obvious---500 iterations of what?

Paolo



^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-10-07 11:06           ` [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues Paolo Bonzini
@ 2019-10-07 14:36             ` Jan Glauber
  2019-10-07 14:36               ` [Bug 1805256] " Jan Glauber
  2019-10-07 14:44             ` dann frazier
  1 sibling, 1 reply; 141+ messages in thread
From: Jan Glauber @ 2019-10-07 14:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Rafael David Tinoco, lizhengui, dann frazier, QEMU Developers,
	Bug 1805256, QEMU Developers - ARM

On Mon, Oct 07, 2019 at 01:06:20PM +0200, Paolo Bonzini wrote:
> On 02/10/19 11:23, Jan Glauber wrote:
> > I've looked into this on ThunderX2. The arm64 code generated for the
> > atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
> > memory barriers. It is just plain ldaxr/stlxr.
> > 
> > From my understanding this is not sufficient for SMP sync.
> > 
> > If I read this comment correct:
> > 
> >     void aio_notify(AioContext *ctx)
> >     {
> >         /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
> >          * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> >          */
> >         smp_mb();
> >         if (ctx->notify_me) {
> > 
> > it points out that the smp_mb() should be paired. But as
> > I said the used atomics don't generate any barriers at all.
> 
> Based on the rest of the thread, this patch should also fix the bug:
> 
> diff --git a/util/async.c b/util/async.c
> index 47dcbfa..721ea53 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -249,7 +249,7 @@ aio_ctx_check(GSource *source)
>      aio_notify_accept(ctx);
>  
>      for (bh = ctx->first_bh; bh; bh = bh->next) {
> -        if (bh->scheduled) {
> +        if (atomic_mb_read(&bh->scheduled)) {
>              return true;
>          }
>      }
> 
> 
> And also the memory barrier in aio_notify can actually be replaced
> with a SEQ_CST load:
> 
> diff --git a/util/async.c b/util/async.c
> index 47dcbfa..721ea53 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -349,11 +349,11 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
>  
>  void aio_notify(AioContext *ctx)
>  {
> -    /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
> -     * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> +    /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
> +     * ctx->notify_me is read.  Pairs with atomic_or in aio_ctx_prepare or
> +     * atomic_add in aio_poll.
>       */
> -    smp_mb();
> -    if (ctx->notify_me) {
> +    if (atomic_mb_read(&ctx->notify_me)) {
>          event_notifier_set(&ctx->notifier);
>          atomic_mb_set(&ctx->notified, true);
>      }
> 
> 
> Would you be able to test these (one by one possibly)?

Sure.

> > I've tried to verify me theory with this patch and didn't run into the
> > issue for ~500 iterations (usually I would trigger the issue ~20 iterations).
> 
> Sorry for asking the obvious---500 iterations of what?

The testcase mentioned in the Canonical issue:
https://bugs.launchpad.net/qemu/+bug/1805256

It's a simple image convert:
qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

Usually it got stuck after 3-20 iterations.

--Jan


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-10-07 14:36             ` Jan Glauber
@ 2019-10-07 14:36               ` Jan Glauber
  0 siblings, 0 replies; 141+ messages in thread
From: Jan Glauber @ 2019-10-07 14:36 UTC (permalink / raw)
  To: qemu-devel

On Mon, Oct 07, 2019 at 01:06:20PM +0200, Paolo Bonzini wrote:
> On 02/10/19 11:23, Jan Glauber wrote:
> > I've looked into this on ThunderX2. The arm64 code generated for the
> > atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
> > memory barriers. It is just plain ldaxr/stlxr.
> > 
> > From my understanding this is not sufficient for SMP sync.
> > 
> > If I read this comment correct:
> > 
> >     void aio_notify(AioContext *ctx)
> >     {
> >         /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
> >          * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> >          */
> >         smp_mb();
> >         if (ctx->notify_me) {
> > 
> > it points out that the smp_mb() should be paired. But as
> > I said the used atomics don't generate any barriers at all.
> 
> Based on the rest of the thread, this patch should also fix the bug:
> 
> diff --git a/util/async.c b/util/async.c
> index 47dcbfa..721ea53 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -249,7 +249,7 @@ aio_ctx_check(GSource *source)
>      aio_notify_accept(ctx);
>  
>      for (bh = ctx->first_bh; bh; bh = bh->next) {
> -        if (bh->scheduled) {
> +        if (atomic_mb_read(&bh->scheduled)) {
>              return true;
>          }
>      }
> 
> 
> And also the memory barrier in aio_notify can actually be replaced
> with a SEQ_CST load:
> 
> diff --git a/util/async.c b/util/async.c
> index 47dcbfa..721ea53 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -349,11 +349,11 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
>  
>  void aio_notify(AioContext *ctx)
>  {
> -    /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
> -     * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> +    /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
> +     * ctx->notify_me is read.  Pairs with atomic_or in aio_ctx_prepare or
> +     * atomic_add in aio_poll.
>       */
> -    smp_mb();
> -    if (ctx->notify_me) {
> +    if (atomic_mb_read(&ctx->notify_me)) {
>          event_notifier_set(&ctx->notifier);
>          atomic_mb_set(&ctx->notified, true);
>      }
> 
> 
> Would you be able to test these (one by one possibly)?

Sure.

> > I've tried to verify me theory with this patch and didn't run into the
> > issue for ~500 iterations (usually I would trigger the issue ~20 iterations).
> 
> Sorry for asking the obvious---500 iterations of what?

The testcase mentioned in the Canonical issue:
https://bugs.launchpad.net/qemu/+bug/1805256

It's a simple image convert:
qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

Usually it got stuck after 3-20 iterations.

--Jan

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  New
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  New
Status in qemu source package in Disco:
  New
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in FF-Series:
  New

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-10-07 11:06           ` [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues Paolo Bonzini
  2019-10-07 14:36             ` Jan Glauber
@ 2019-10-07 14:44             ` dann frazier
  2019-10-07 14:44               ` [Bug 1805256] " dann frazier
  2019-10-07 14:58               ` Paolo Bonzini
  1 sibling, 2 replies; 141+ messages in thread
From: dann frazier @ 2019-10-07 14:44 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Rafael David Tinoco, lizhengui, Jan Glauber, QEMU Developers,
	Bug 1805256, QEMU Developers - ARM

On Mon, Oct 07, 2019 at 01:06:20PM +0200, Paolo Bonzini wrote:
> On 02/10/19 11:23, Jan Glauber wrote:
> > I've looked into this on ThunderX2. The arm64 code generated for the
> > atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
> > memory barriers. It is just plain ldaxr/stlxr.
> > 
> > From my understanding this is not sufficient for SMP sync.
> > 
> > If I read this comment correct:
> > 
> >     void aio_notify(AioContext *ctx)
> >     {
> >         /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
> >          * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> >          */
> >         smp_mb();
> >         if (ctx->notify_me) {
> > 
> > it points out that the smp_mb() should be paired. But as
> > I said the used atomics don't generate any barriers at all.
> 
> Based on the rest of the thread, this patch should also fix the bug:
> 
> diff --git a/util/async.c b/util/async.c
> index 47dcbfa..721ea53 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -249,7 +249,7 @@ aio_ctx_check(GSource *source)
>      aio_notify_accept(ctx);
>  
>      for (bh = ctx->first_bh; bh; bh = bh->next) {
> -        if (bh->scheduled) {
> +        if (atomic_mb_read(&bh->scheduled)) {
>              return true;
>          }
>      }
> 
> 
> And also the memory barrier in aio_notify can actually be replaced
> with a SEQ_CST load:
> 
> diff --git a/util/async.c b/util/async.c
> index 47dcbfa..721ea53 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -349,11 +349,11 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
>  
>  void aio_notify(AioContext *ctx)
>  {
> -    /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
> -     * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> +    /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
> +     * ctx->notify_me is read.  Pairs with atomic_or in aio_ctx_prepare or
> +     * atomic_add in aio_poll.
>       */
> -    smp_mb();
> -    if (ctx->notify_me) {
> +    if (atomic_mb_read(&ctx->notify_me)) {
>          event_notifier_set(&ctx->notifier);
>          atomic_mb_set(&ctx->notified, true);
>      }
> 
> 
> Would you be able to test these (one by one possibly)?

Paolo,
  I tried them both separately and together on a Hi1620 system, each
time it hung in the first iteration. Here's a backtrace of a run with
both patches applied:

(gdb) thread apply all bt

Thread 3 (Thread 0xffff8154b820 (LWP 63900)):
#0  0x0000ffff8b9402cc in __GI___sigtimedwait (set=<optimized out>, set@entry=0xaaaaf1e08070, 
    info=info@entry=0xffff8154ad98, timeout=timeout@entry=0x0) at ../sysdeps/unix/sysv/linux/sigtimedwait.c:42
#1  0x0000ffff8ba77fac in __sigwait (set=set@entry=0xaaaaf1e08070, sig=sig@entry=0xffff8154ae74)
    at ../sysdeps/unix/sysv/linux/sigwait.c:28
#2  0x0000aaaab7dc1610 in sigwait_compat (opaque=0xaaaaf1e08070) at util/compatfd.c:35
#3  0x0000aaaab7dc3e80 in qemu_thread_start (args=<optimized out>) at util/qemu-thread-posix.c:519
#4  0x0000ffff8ba6d088 in start_thread (arg=0xffffceefbf4f) at pthread_create.c:463
#5  0x0000ffff8b9dd4ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 2 (Thread 0xffff81d4c820 (LWP 63899)):
#0  syscall () at ../sysdeps/unix/sysv/linux/aarch64/syscall.S:38
#1  0x0000aaaab7dc4cd8 in qemu_futex_wait (val=<optimized out>, f=<optimized out>)
    at /home/ubuntu/qemu/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0xaaaab7e48708 <rcu_call_ready_event>) at util/qemu-thread-posix.c:459
#3  0x0000aaaab7ddf44c in call_rcu_thread (opaque=<optimized out>) at util/rcu.c:260
#4  0x0000aaaab7dc3e80 in qemu_thread_start (args=<optimized out>) at util/qemu-thread-posix.c:519
#5  0x0000ffff8ba6d088 in start_thread (arg=0xffffceefc05f) at pthread_create.c:463
#6  0x0000ffff8b9dd4ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 1 (Thread 0xffff81e83010 (LWP 63898)):
#0  0x0000ffff8b9d4154 in __GI_ppoll (fds=0xaaaaf1e0dbc0, nfds=187650205809964, timeout=<optimized out>, 
    timeout@entry=0x0, sigmask=0xffffceefbef0) at ../sysdeps/unix/sysv/linux/ppoll.c:39
#1  0x0000aaaab7dbedb0 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>)
    at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at util/qemu-timer.c:340
#3  0x0000aaaab7dbfd2c in os_host_main_loop_wait (timeout=-1) at util/main-loop.c:236
#4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:517
#5  0x0000aaaab7ce86e8 in convert_do_copy (s=0xffffceefc068) at qemu-img.c:2028
#6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2520
#7  0x0000aaaab7ce1e54 in main (argc=8, argv=<optimized out>) at qemu-img.c:5097

> > I've tried to verify me theory with this patch and didn't run into the
> > issue for ~500 iterations (usually I would trigger the issue ~20 iterations).
> 
> Sorry for asking the obvious---500 iterations of what?

$ for i in $(seq 1 500); do echo "==$i=="; ./qemu/qemu-img convert -p -f qcow2 -O qcow2 bionic-server-cloudimg-arm64.img out.img; done
==1==
    (37.19/100%)

  -dann


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-10-07 14:44             ` dann frazier
@ 2019-10-07 14:44               ` dann frazier
  2019-10-07 14:58               ` Paolo Bonzini
  1 sibling, 0 replies; 141+ messages in thread
From: dann frazier @ 2019-10-07 14:44 UTC (permalink / raw)
  To: qemu-devel

On Mon, Oct 07, 2019 at 01:06:20PM +0200, Paolo Bonzini wrote:
> On 02/10/19 11:23, Jan Glauber wrote:
> > I've looked into this on ThunderX2. The arm64 code generated for the
> > atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
> > memory barriers. It is just plain ldaxr/stlxr.
> > 
> > From my understanding this is not sufficient for SMP sync.
> > 
> > If I read this comment correct:
> > 
> >     void aio_notify(AioContext *ctx)
> >     {
> >         /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
> >          * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> >          */
> >         smp_mb();
> >         if (ctx->notify_me) {
> > 
> > it points out that the smp_mb() should be paired. But as
> > I said the used atomics don't generate any barriers at all.
> 
> Based on the rest of the thread, this patch should also fix the bug:
> 
> diff --git a/util/async.c b/util/async.c
> index 47dcbfa..721ea53 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -249,7 +249,7 @@ aio_ctx_check(GSource *source)
>      aio_notify_accept(ctx);
>  
>      for (bh = ctx->first_bh; bh; bh = bh->next) {
> -        if (bh->scheduled) {
> +        if (atomic_mb_read(&bh->scheduled)) {
>              return true;
>          }
>      }
> 
> 
> And also the memory barrier in aio_notify can actually be replaced
> with a SEQ_CST load:
> 
> diff --git a/util/async.c b/util/async.c
> index 47dcbfa..721ea53 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -349,11 +349,11 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
>  
>  void aio_notify(AioContext *ctx)
>  {
> -    /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
> -     * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> +    /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
> +     * ctx->notify_me is read.  Pairs with atomic_or in aio_ctx_prepare or
> +     * atomic_add in aio_poll.
>       */
> -    smp_mb();
> -    if (ctx->notify_me) {
> +    if (atomic_mb_read(&ctx->notify_me)) {
>          event_notifier_set(&ctx->notifier);
>          atomic_mb_set(&ctx->notified, true);
>      }
> 
> 
> Would you be able to test these (one by one possibly)?

Paolo,
  I tried them both separately and together on a Hi1620 system, each
time it hung in the first iteration. Here's a backtrace of a run with
both patches applied:

(gdb) thread apply all bt

Thread 3 (Thread 0xffff8154b820 (LWP 63900)):
#0  0x0000ffff8b9402cc in __GI___sigtimedwait (set=<optimized out>, set@entry=0xaaaaf1e08070, 
    info=info@entry=0xffff8154ad98, timeout=timeout@entry=0x0) at ../sysdeps/unix/sysv/linux/sigtimedwait.c:42
#1  0x0000ffff8ba77fac in __sigwait (set=set@entry=0xaaaaf1e08070, sig=sig@entry=0xffff8154ae74)
    at ../sysdeps/unix/sysv/linux/sigwait.c:28
#2  0x0000aaaab7dc1610 in sigwait_compat (opaque=0xaaaaf1e08070) at util/compatfd.c:35
#3  0x0000aaaab7dc3e80 in qemu_thread_start (args=<optimized out>) at util/qemu-thread-posix.c:519
#4  0x0000ffff8ba6d088 in start_thread (arg=0xffffceefbf4f) at pthread_create.c:463
#5  0x0000ffff8b9dd4ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 2 (Thread 0xffff81d4c820 (LWP 63899)):
#0  syscall () at ../sysdeps/unix/sysv/linux/aarch64/syscall.S:38
#1  0x0000aaaab7dc4cd8 in qemu_futex_wait (val=<optimized out>, f=<optimized out>)
    at /home/ubuntu/qemu/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0xaaaab7e48708 <rcu_call_ready_event>) at util/qemu-thread-posix.c:459
#3  0x0000aaaab7ddf44c in call_rcu_thread (opaque=<optimized out>) at util/rcu.c:260
#4  0x0000aaaab7dc3e80 in qemu_thread_start (args=<optimized out>) at util/qemu-thread-posix.c:519
#5  0x0000ffff8ba6d088 in start_thread (arg=0xffffceefc05f) at pthread_create.c:463
#6  0x0000ffff8b9dd4ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 1 (Thread 0xffff81e83010 (LWP 63898)):
#0  0x0000ffff8b9d4154 in __GI_ppoll (fds=0xaaaaf1e0dbc0, nfds=187650205809964, timeout=<optimized out>, 
    timeout@entry=0x0, sigmask=0xffffceefbef0) at ../sysdeps/unix/sysv/linux/ppoll.c:39
#1  0x0000aaaab7dbedb0 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>)
    at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at util/qemu-timer.c:340
#3  0x0000aaaab7dbfd2c in os_host_main_loop_wait (timeout=-1) at util/main-loop.c:236
#4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:517
#5  0x0000aaaab7ce86e8 in convert_do_copy (s=0xffffceefc068) at qemu-img.c:2028
#6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2520
#7  0x0000aaaab7ce1e54 in main (argc=8, argv=<optimized out>) at qemu-img.c:5097

> > I've tried to verify me theory with this patch and didn't run into the
> > issue for ~500 iterations (usually I would trigger the issue ~20 iterations).
> 
> Sorry for asking the obvious---500 iterations of what?

$ for i in $(seq 1 500); do echo "==$i=="; ./qemu/qemu-img convert -p -f qcow2 -O qcow2 bionic-server-cloudimg-arm64.img out.img; done
==1==
    (37.19/100%)

  -dann

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  New
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  New
Status in qemu source package in Disco:
  New
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in FF-Series:
  New

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-10-07 14:44             ` dann frazier
  2019-10-07 14:44               ` [Bug 1805256] " dann frazier
@ 2019-10-07 14:58               ` Paolo Bonzini
  2019-10-09  8:02                 ` Jan Glauber
  1 sibling, 1 reply; 141+ messages in thread
From: Paolo Bonzini @ 2019-10-07 14:58 UTC (permalink / raw)
  To: dann frazier
  Cc: Rafael David Tinoco, lizhengui, Jan Glauber, QEMU Developers,
	Bug 1805256, QEMU Developers - ARM

On 07/10/19 16:44, dann frazier wrote:
> On Mon, Oct 07, 2019 at 01:06:20PM +0200, Paolo Bonzini wrote:
>> On 02/10/19 11:23, Jan Glauber wrote:
>>> I've looked into this on ThunderX2. The arm64 code generated for the
>>> atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
>>> memory barriers. It is just plain ldaxr/stlxr.
>>>
>>> From my understanding this is not sufficient for SMP sync.
>>>
>>> If I read this comment correct:
>>>
>>>     void aio_notify(AioContext *ctx)
>>>     {
>>>         /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
>>>          * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
>>>          */
>>>         smp_mb();
>>>         if (ctx->notify_me) {
>>>
>>> it points out that the smp_mb() should be paired. But as
>>> I said the used atomics don't generate any barriers at all.
>>
>> Based on the rest of the thread, this patch should also fix the bug:
>>
>> diff --git a/util/async.c b/util/async.c
>> index 47dcbfa..721ea53 100644
>> --- a/util/async.c
>> +++ b/util/async.c
>> @@ -249,7 +249,7 @@ aio_ctx_check(GSource *source)
>>      aio_notify_accept(ctx);
>>  
>>      for (bh = ctx->first_bh; bh; bh = bh->next) {
>> -        if (bh->scheduled) {
>> +        if (atomic_mb_read(&bh->scheduled)) {
>>              return true;
>>          }
>>      }
>>
>>
>> And also the memory barrier in aio_notify can actually be replaced
>> with a SEQ_CST load:
>>
>> diff --git a/util/async.c b/util/async.c
>> index 47dcbfa..721ea53 100644
>> --- a/util/async.c
>> +++ b/util/async.c
>> @@ -349,11 +349,11 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
>>  
>>  void aio_notify(AioContext *ctx)
>>  {
>> -    /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
>> -     * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
>> +    /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
>> +     * ctx->notify_me is read.  Pairs with atomic_or in aio_ctx_prepare or
>> +     * atomic_add in aio_poll.
>>       */
>> -    smp_mb();
>> -    if (ctx->notify_me) {
>> +    if (atomic_mb_read(&ctx->notify_me)) {
>>          event_notifier_set(&ctx->notifier);
>>          atomic_mb_set(&ctx->notified, true);
>>      }
>>
>>
>> Would you be able to test these (one by one possibly)?
> 
> Paolo,
>   I tried them both separately and together on a Hi1620 system, each
> time it hung in the first iteration. Here's a backtrace of a run with
> both patches applied:

Ok, not a great start...  I'll find myself an aarch64 machine and look
at it more closely.  I'd like the patch to be something we can
understand and document, since this is probably the second most-used
memory barrier idiom in QEMU.

Paolo



^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-10-07 14:58               ` Paolo Bonzini
@ 2019-10-09  8:02                 ` Jan Glauber
  2019-10-09  8:02                   ` [Bug 1805256] " Jan Glauber
  2019-10-09  9:15                   ` Paolo Bonzini
  0 siblings, 2 replies; 141+ messages in thread
From: Jan Glauber @ 2019-10-09  8:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Rafael David Tinoco, lizhengui, dann frazier, QEMU Developers,
	Bug 1805256, QEMU Developers - ARM

On Mon, Oct 07, 2019 at 04:58:30PM +0200, Paolo Bonzini wrote:
> On 07/10/19 16:44, dann frazier wrote:
> > On Mon, Oct 07, 2019 at 01:06:20PM +0200, Paolo Bonzini wrote:
> >> On 02/10/19 11:23, Jan Glauber wrote:
> >>> I've looked into this on ThunderX2. The arm64 code generated for the
> >>> atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
> >>> memory barriers. It is just plain ldaxr/stlxr.
> >>>
> >>> From my understanding this is not sufficient for SMP sync.
> >>>
> >>> If I read this comment correct:
> >>>
> >>>     void aio_notify(AioContext *ctx)
> >>>     {
> >>>         /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
> >>>          * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> >>>          */
> >>>         smp_mb();
> >>>         if (ctx->notify_me) {
> >>>
> >>> it points out that the smp_mb() should be paired. But as
> >>> I said the used atomics don't generate any barriers at all.
> >>
> >> Based on the rest of the thread, this patch should also fix the bug:
> >>
> >> diff --git a/util/async.c b/util/async.c
> >> index 47dcbfa..721ea53 100644
> >> --- a/util/async.c
> >> +++ b/util/async.c
> >> @@ -249,7 +249,7 @@ aio_ctx_check(GSource *source)
> >>      aio_notify_accept(ctx);
> >>  
> >>      for (bh = ctx->first_bh; bh; bh = bh->next) {
> >> -        if (bh->scheduled) {
> >> +        if (atomic_mb_read(&bh->scheduled)) {
> >>              return true;
> >>          }
> >>      }
> >>
> >>
> >> And also the memory barrier in aio_notify can actually be replaced
> >> with a SEQ_CST load:
> >>
> >> diff --git a/util/async.c b/util/async.c
> >> index 47dcbfa..721ea53 100644
> >> --- a/util/async.c
> >> +++ b/util/async.c
> >> @@ -349,11 +349,11 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
> >>  
> >>  void aio_notify(AioContext *ctx)
> >>  {
> >> -    /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
> >> -     * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> >> +    /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
> >> +     * ctx->notify_me is read.  Pairs with atomic_or in aio_ctx_prepare or
> >> +     * atomic_add in aio_poll.
> >>       */
> >> -    smp_mb();
> >> -    if (ctx->notify_me) {
> >> +    if (atomic_mb_read(&ctx->notify_me)) {
> >>          event_notifier_set(&ctx->notifier);
> >>          atomic_mb_set(&ctx->notified, true);
> >>      }
> >>
> >>
> >> Would you be able to test these (one by one possibly)?
> > 
> > Paolo,
> >   I tried them both separately and together on a Hi1620 system, each
> > time it hung in the first iteration. Here's a backtrace of a run with
> > both patches applied:
> 
> Ok, not a great start...  I'll find myself an aarch64 machine and look
> at it more closely.  I'd like the patch to be something we can
> understand and document, since this is probably the second most-used
> memory barrier idiom in QEMU.
> 
> Paolo

I'm still not sure what the actual issue is here, but could it be some bad
interaction between the notify_me and the list_lock? The are both 4 byte
and side-by-side:

address notify_me: 0xaaaadb528aa0  sizeof notify_me: 4
address list_lock: 0xaaaadb528aa4  sizeof list_lock: 4

AFAICS the generated code looks OK (all load/store exclusive done
with 32 bit size):

     e6c:       885ffc01        ldaxr   w1, [x0]
     e70:       11000821        add     w1, w1, #0x2
     e74:       8802fc01        stlxr   w2, w1, [x0]

...but if I bump notify_me size to uint64_t the issue goes away.

BTW, the image file I convert in the testcase is ~20 GB.

--Jan

diff --git a/include/block/aio.h b/include/block/aio.h
index a1d6b9e24939..e8a5ea3860bb 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -83,7 +83,7 @@ struct AioContext {
      * Instead, the aio_poll calls include both the prepare and the
      * dispatch phase, hence a simple counter is enough for them.
      */
-    uint32_t notify_me;
+    uint64_t notify_me;
 
     /* A lock to protect between QEMUBH and AioHandler adders and deleter,
      * and to ensure that no callbacks are removed while we're walking and


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-10-09  8:02                 ` Jan Glauber
@ 2019-10-09  8:02                   ` Jan Glauber
  2019-10-09  9:15                   ` Paolo Bonzini
  1 sibling, 0 replies; 141+ messages in thread
From: Jan Glauber @ 2019-10-09  8:02 UTC (permalink / raw)
  To: qemu-devel

On Mon, Oct 07, 2019 at 04:58:30PM +0200, Paolo Bonzini wrote:
> On 07/10/19 16:44, dann frazier wrote:
> > On Mon, Oct 07, 2019 at 01:06:20PM +0200, Paolo Bonzini wrote:
> >> On 02/10/19 11:23, Jan Glauber wrote:
> >>> I've looked into this on ThunderX2. The arm64 code generated for the
> >>> atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
> >>> memory barriers. It is just plain ldaxr/stlxr.
> >>>
> >>> From my understanding this is not sufficient for SMP sync.
> >>>
> >>> If I read this comment correct:
> >>>
> >>>     void aio_notify(AioContext *ctx)
> >>>     {
> >>>         /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
> >>>          * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> >>>          */
> >>>         smp_mb();
> >>>         if (ctx->notify_me) {
> >>>
> >>> it points out that the smp_mb() should be paired. But as
> >>> I said the used atomics don't generate any barriers at all.
> >>
> >> Based on the rest of the thread, this patch should also fix the bug:
> >>
> >> diff --git a/util/async.c b/util/async.c
> >> index 47dcbfa..721ea53 100644
> >> --- a/util/async.c
> >> +++ b/util/async.c
> >> @@ -249,7 +249,7 @@ aio_ctx_check(GSource *source)
> >>      aio_notify_accept(ctx);
> >>  
> >>      for (bh = ctx->first_bh; bh; bh = bh->next) {
> >> -        if (bh->scheduled) {
> >> +        if (atomic_mb_read(&bh->scheduled)) {
> >>              return true;
> >>          }
> >>      }
> >>
> >>
> >> And also the memory barrier in aio_notify can actually be replaced
> >> with a SEQ_CST load:
> >>
> >> diff --git a/util/async.c b/util/async.c
> >> index 47dcbfa..721ea53 100644
> >> --- a/util/async.c
> >> +++ b/util/async.c
> >> @@ -349,11 +349,11 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
> >>  
> >>  void aio_notify(AioContext *ctx)
> >>  {
> >> -    /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
> >> -     * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> >> +    /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
> >> +     * ctx->notify_me is read.  Pairs with atomic_or in aio_ctx_prepare or
> >> +     * atomic_add in aio_poll.
> >>       */
> >> -    smp_mb();
> >> -    if (ctx->notify_me) {
> >> +    if (atomic_mb_read(&ctx->notify_me)) {
> >>          event_notifier_set(&ctx->notifier);
> >>          atomic_mb_set(&ctx->notified, true);
> >>      }
> >>
> >>
> >> Would you be able to test these (one by one possibly)?
> > 
> > Paolo,
> >   I tried them both separately and together on a Hi1620 system, each
> > time it hung in the first iteration. Here's a backtrace of a run with
> > both patches applied:
> 
> Ok, not a great start...  I'll find myself an aarch64 machine and look
> at it more closely.  I'd like the patch to be something we can
> understand and document, since this is probably the second most-used
> memory barrier idiom in QEMU.
> 
> Paolo

I'm still not sure what the actual issue is here, but could it be some bad
interaction between the notify_me and the list_lock? The are both 4 byte
and side-by-side:

address notify_me: 0xaaaadb528aa0  sizeof notify_me: 4
address list_lock: 0xaaaadb528aa4  sizeof list_lock: 4

AFAICS the generated code looks OK (all load/store exclusive done
with 32 bit size):

     e6c:       885ffc01        ldaxr   w1, [x0]
     e70:       11000821        add     w1, w1, #0x2
     e74:       8802fc01        stlxr   w2, w1, [x0]

...but if I bump notify_me size to uint64_t the issue goes away.

BTW, the image file I convert in the testcase is ~20 GB.

--Jan

diff --git a/include/block/aio.h b/include/block/aio.h
index a1d6b9e24939..e8a5ea3860bb 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -83,7 +83,7 @@ struct AioContext {
      * Instead, the aio_poll calls include both the prepare and the
      * dispatch phase, hence a simple counter is enough for them.
      */
-    uint32_t notify_me;
+    uint64_t notify_me;
 
     /* A lock to protect between QEMUBH and AioHandler adders and deleter,
      * and to ensure that no callbacks are removed while we're walking and

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  New
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  New
Status in qemu source package in Disco:
  New
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in FF-Series:
  New

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-10-09  8:02                 ` Jan Glauber
  2019-10-09  8:02                   ` [Bug 1805256] " Jan Glauber
@ 2019-10-09  9:15                   ` Paolo Bonzini
  2019-10-11  6:05                     ` Jan Glauber
  1 sibling, 1 reply; 141+ messages in thread
From: Paolo Bonzini @ 2019-10-09  9:15 UTC (permalink / raw)
  To: Jan Glauber
  Cc: Rafael David Tinoco, lizhengui, dann frazier, QEMU Developers,
	Bug 1805256, QEMU Developers - ARM

On 09/10/19 10:02, Jan Glauber wrote:
> On Mon, Oct 07, 2019 at 04:58:30PM +0200, Paolo Bonzini wrote:
>> On 07/10/19 16:44, dann frazier wrote:
>>> On Mon, Oct 07, 2019 at 01:06:20PM +0200, Paolo Bonzini wrote:
>>>> On 02/10/19 11:23, Jan Glauber wrote:
>>>>> I've looked into this on ThunderX2. The arm64 code generated for the
>>>>> atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
>>>>> memory barriers. It is just plain ldaxr/stlxr.
>>>>>
>>>>> From my understanding this is not sufficient for SMP sync.
>>>>>
>>>>> If I read this comment correct:
>>>>>
>>>>>     void aio_notify(AioContext *ctx)
>>>>>     {
>>>>>         /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
>>>>>          * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
>>>>>          */
>>>>>         smp_mb();
>>>>>         if (ctx->notify_me) {
>>>>>
>>>>> it points out that the smp_mb() should be paired. But as
>>>>> I said the used atomics don't generate any barriers at all.
>>>>
>>>> Based on the rest of the thread, this patch should also fix the bug:
>>>>
>>>> diff --git a/util/async.c b/util/async.c
>>>> index 47dcbfa..721ea53 100644
>>>> --- a/util/async.c
>>>> +++ b/util/async.c
>>>> @@ -249,7 +249,7 @@ aio_ctx_check(GSource *source)
>>>>      aio_notify_accept(ctx);
>>>>  
>>>>      for (bh = ctx->first_bh; bh; bh = bh->next) {
>>>> -        if (bh->scheduled) {
>>>> +        if (atomic_mb_read(&bh->scheduled)) {
>>>>              return true;
>>>>          }
>>>>      }
>>>>
>>>>
>>>> And also the memory barrier in aio_notify can actually be replaced
>>>> with a SEQ_CST load:
>>>>
>>>> diff --git a/util/async.c b/util/async.c
>>>> index 47dcbfa..721ea53 100644
>>>> --- a/util/async.c
>>>> +++ b/util/async.c
>>>> @@ -349,11 +349,11 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
>>>>  
>>>>  void aio_notify(AioContext *ctx)
>>>>  {
>>>> -    /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
>>>> -     * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
>>>> +    /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
>>>> +     * ctx->notify_me is read.  Pairs with atomic_or in aio_ctx_prepare or
>>>> +     * atomic_add in aio_poll.
>>>>       */
>>>> -    smp_mb();
>>>> -    if (ctx->notify_me) {
>>>> +    if (atomic_mb_read(&ctx->notify_me)) {
>>>>          event_notifier_set(&ctx->notifier);
>>>>          atomic_mb_set(&ctx->notified, true);
>>>>      }
>>>>
>>>>
>>>> Would you be able to test these (one by one possibly)?
>>>
>>> Paolo,
>>>   I tried them both separately and together on a Hi1620 system, each
>>> time it hung in the first iteration. Here's a backtrace of a run with
>>> both patches applied:
>>
>> Ok, not a great start...  I'll find myself an aarch64 machine and look
>> at it more closely.  I'd like the patch to be something we can
>> understand and document, since this is probably the second most-used
>> memory barrier idiom in QEMU.
>>
>> Paolo
> 
> I'm still not sure what the actual issue is here, but could it be some bad
> interaction between the notify_me and the list_lock? The are both 4 byte
> and side-by-side:
> 
> address notify_me: 0xaaaadb528aa0  sizeof notify_me: 4
> address list_lock: 0xaaaadb528aa4  sizeof list_lock: 4
> 
> AFAICS the generated code looks OK (all load/store exclusive done
> with 32 bit size):
> 
>      e6c:       885ffc01        ldaxr   w1, [x0]
>      e70:       11000821        add     w1, w1, #0x2
>      e74:       8802fc01        stlxr   w2, w1, [x0]
> 
> ...but if I bump notify_me size to uint64_t the issue goes away.

Ouch. :)  Is this with or without my patch(es)?

Also, what if you just add a dummy uint32_t after notify_me?

Thanks,

Paolo

> 
> BTW, the image file I convert in the testcase is ~20 GB.
> 
> --Jan
> 
> diff --git a/include/block/aio.h b/include/block/aio.h
> index a1d6b9e24939..e8a5ea3860bb 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -83,7 +83,7 @@ struct AioContext {
>       * Instead, the aio_poll calls include both the prepare and the
>       * dispatch phase, hence a simple counter is enough for them.
>       */
> -    uint32_t notify_me;
> +    uint64_t notify_me;
>  
>      /* A lock to protect between QEMUBH and AioHandler adders and deleter,
>       * and to ensure that no callbacks are removed while we're walking and
> 



^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-10-09  9:15                   ` Paolo Bonzini
@ 2019-10-11  6:05                     ` Jan Glauber
  2019-10-11  6:05                       ` [Bug 1805256] " Jan Glauber
                                         ` (2 more replies)
  0 siblings, 3 replies; 141+ messages in thread
From: Jan Glauber @ 2019-10-11  6:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Rafael David Tinoco, lizhengui, dann frazier, QEMU Developers,
	Bug 1805256, QEMU Developers - ARM

On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote:
> On 09/10/19 10:02, Jan Glauber wrote:

> > I'm still not sure what the actual issue is here, but could it be some bad
> > interaction between the notify_me and the list_lock? The are both 4 byte
> > and side-by-side:
> > 
> > address notify_me: 0xaaaadb528aa0  sizeof notify_me: 4
> > address list_lock: 0xaaaadb528aa4  sizeof list_lock: 4
> > 
> > AFAICS the generated code looks OK (all load/store exclusive done
> > with 32 bit size):
> > 
> >      e6c:       885ffc01        ldaxr   w1, [x0]
> >      e70:       11000821        add     w1, w1, #0x2
> >      e74:       8802fc01        stlxr   w2, w1, [x0]
> > 
> > ...but if I bump notify_me size to uint64_t the issue goes away.
> 
> Ouch. :)  Is this with or without my patch(es)?
> 
> Also, what if you just add a dummy uint32_t after notify_me?

With the dummy the testcase also runs fine for 500 iterations.

Dann, can you try if this works on the Hi1620 too?

--Jan


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-10-11  6:05                     ` Jan Glauber
@ 2019-10-11  6:05                       ` Jan Glauber
  2019-10-11  8:18                       ` Paolo Bonzini
  2019-10-11 17:50                       ` dann frazier
  2 siblings, 0 replies; 141+ messages in thread
From: Jan Glauber @ 2019-10-11  6:05 UTC (permalink / raw)
  To: qemu-devel

On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote:
> On 09/10/19 10:02, Jan Glauber wrote:

> > I'm still not sure what the actual issue is here, but could it be some bad
> > interaction between the notify_me and the list_lock? The are both 4 byte
> > and side-by-side:
> > 
> > address notify_me: 0xaaaadb528aa0  sizeof notify_me: 4
> > address list_lock: 0xaaaadb528aa4  sizeof list_lock: 4
> > 
> > AFAICS the generated code looks OK (all load/store exclusive done
> > with 32 bit size):
> > 
> >      e6c:       885ffc01        ldaxr   w1, [x0]
> >      e70:       11000821        add     w1, w1, #0x2
> >      e74:       8802fc01        stlxr   w2, w1, [x0]
> > 
> > ...but if I bump notify_me size to uint64_t the issue goes away.
> 
> Ouch. :)  Is this with or without my patch(es)?
> 
> Also, what if you just add a dummy uint32_t after notify_me?

With the dummy the testcase also runs fine for 500 iterations.

Dann, can you try if this works on the Hi1620 too?

--Jan

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  New
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  New
Status in qemu source package in Disco:
  New
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in FF-Series:
  New

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-10-11  6:05                     ` Jan Glauber
  2019-10-11  6:05                       ` [Bug 1805256] " Jan Glauber
@ 2019-10-11  8:18                       ` Paolo Bonzini
  2019-10-11  8:30                         ` Jan Glauber
  2019-10-11 17:50                       ` dann frazier
  2 siblings, 1 reply; 141+ messages in thread
From: Paolo Bonzini @ 2019-10-11  8:18 UTC (permalink / raw)
  To: Jan Glauber
  Cc: Rafael David Tinoco, lizhengui, dann frazier, QEMU Developers,
	Bug 1805256, QEMU Developers - ARM

On 11/10/19 08:05, Jan Glauber wrote:
> On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote:
>>> ...but if I bump notify_me size to uint64_t the issue goes away.
>>
>> Ouch. :)  Is this with or without my patch(es)?

You didn't answer this question.

>> Also, what if you just add a dummy uint32_t after notify_me?
> 
> With the dummy the testcase also runs fine for 500 iterations.

You might be lucky and causing list_lock to be in another cache line.
What if you add __attribute__((aligned(16)) to notify_me (and keep the
dummy)?

Paolo

> Dann, can you try if this works on the Hi1620 too?



^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-10-11  8:18                       ` Paolo Bonzini
@ 2019-10-11  8:30                         ` Jan Glauber
  2019-10-11  8:30                           ` [Bug 1805256] " Jan Glauber
  2019-10-11 17:55                           ` dann frazier
  0 siblings, 2 replies; 141+ messages in thread
From: Jan Glauber @ 2019-10-11  8:30 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Rafael David Tinoco, lizhengui, dann frazier, QEMU Developers,
	Bug 1805256, QEMU Developers - ARM

On Fri, Oct 11, 2019 at 10:18:18AM +0200, Paolo Bonzini wrote:
> On 11/10/19 08:05, Jan Glauber wrote:
> > On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote:
> >>> ...but if I bump notify_me size to uint64_t the issue goes away.
> >>
> >> Ouch. :)  Is this with or without my patch(es)?
> 
> You didn't answer this question.

Oh, sorry... I did but the mail probably didn't make it out.
I have both of your changes applied (as I think they make sense).

> >> Also, what if you just add a dummy uint32_t after notify_me?
> > 
> > With the dummy the testcase also runs fine for 500 iterations.
> 
> You might be lucky and causing list_lock to be in another cache line.
> What if you add __attribute__((aligned(16)) to notify_me (and keep the
> dummy)?

Good point. I'll try to force both into the same cacheline.

--Jan

> Paolo
> 
> > Dann, can you try if this works on the Hi1620 too?


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-10-11  8:30                         ` Jan Glauber
@ 2019-10-11  8:30                           ` Jan Glauber
  2019-10-11 17:55                           ` dann frazier
  1 sibling, 0 replies; 141+ messages in thread
From: Jan Glauber @ 2019-10-11  8:30 UTC (permalink / raw)
  To: qemu-devel

On Fri, Oct 11, 2019 at 10:18:18AM +0200, Paolo Bonzini wrote:
> On 11/10/19 08:05, Jan Glauber wrote:
> > On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote:
> >>> ...but if I bump notify_me size to uint64_t the issue goes away.
> >>
> >> Ouch. :)  Is this with or without my patch(es)?
> 
> You didn't answer this question.

Oh, sorry... I did but the mail probably didn't make it out.
I have both of your changes applied (as I think they make sense).

> >> Also, what if you just add a dummy uint32_t after notify_me?
> > 
> > With the dummy the testcase also runs fine for 500 iterations.
> 
> You might be lucky and causing list_lock to be in another cache line.
> What if you add __attribute__((aligned(16)) to notify_me (and keep the
> dummy)?

Good point. I'll try to force both into the same cacheline.

--Jan

> Paolo
> 
> > Dann, can you try if this works on the Hi1620 too?

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  New
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  New
Status in qemu source package in Disco:
  New
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in FF-Series:
  New

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-10-11  6:05                     ` Jan Glauber
  2019-10-11  6:05                       ` [Bug 1805256] " Jan Glauber
  2019-10-11  8:18                       ` Paolo Bonzini
@ 2019-10-11 17:50                       ` dann frazier
  2019-10-11 17:50                         ` [Bug 1805256] " dann frazier
  2 siblings, 1 reply; 141+ messages in thread
From: dann frazier @ 2019-10-11 17:50 UTC (permalink / raw)
  To: Jan Glauber
  Cc: Rafael David Tinoco, lizhengui, QEMU Developers, Bug 1805256,
	QEMU Developers - ARM, Paolo Bonzini

On Fri, Oct 11, 2019 at 06:05:25AM +0000, Jan Glauber wrote:
> On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote:
> > On 09/10/19 10:02, Jan Glauber wrote:
> 
> > > I'm still not sure what the actual issue is here, but could it be some bad
> > > interaction between the notify_me and the list_lock? The are both 4 byte
> > > and side-by-side:
> > > 
> > > address notify_me: 0xaaaadb528aa0  sizeof notify_me: 4
> > > address list_lock: 0xaaaadb528aa4  sizeof list_lock: 4
> > > 
> > > AFAICS the generated code looks OK (all load/store exclusive done
> > > with 32 bit size):
> > > 
> > >      e6c:       885ffc01        ldaxr   w1, [x0]
> > >      e70:       11000821        add     w1, w1, #0x2
> > >      e74:       8802fc01        stlxr   w2, w1, [x0]
> > > 
> > > ...but if I bump notify_me size to uint64_t the issue goes away.
> > 
> > Ouch. :)  Is this with or without my patch(es)?
> > 
> > Also, what if you just add a dummy uint32_t after notify_me?
> 
> With the dummy the testcase also runs fine for 500 iterations.
> 
> Dann, can you try if this works on the Hi1620 too?

On Hi1620, it hung on the first iteration. Here's the complete patch
I'm running with:

diff --git a/include/block/aio.h b/include/block/aio.h
index 6b0d52f732..e6fd6f1a1a 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -82,7 +82,7 @@ struct AioContext {
      * Instead, the aio_poll calls include both the prepare and the
      * dispatch phase, hence a simple counter is enough for them.
      */
-    uint32_t notify_me;
+    uint64_t notify_me;
 
     /* A lock to protect between QEMUBH and AioHandler adders and deleter,
      * and to ensure that no callbacks are removed while we're walking and
diff --git a/util/async.c b/util/async.c
index ca83e32c7f..024c4c567d 100644
--- a/util/async.c
+++ b/util/async.c
@@ -242,7 +242,7 @@ aio_ctx_check(GSource *source)
     aio_notify_accept(ctx);
 
     for (bh = ctx->first_bh; bh; bh = bh->next) {
-        if (bh->scheduled) {
+        if (atomic_mb_read(&bh->scheduled)) {
             return true;
         }
     }
@@ -342,12 +342,12 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
 
 void aio_notify(AioContext *ctx)
 {
-    /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
-     * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
+    /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
+     * ctx->notify_me is read.  Pairs with atomic_or in aio_ctx_prepare or
+     * atomic_add in aio_poll.
      */
-    smp_mb();
-    if (ctx->notify_me) {
-        event_notifier_set(&ctx->notifier);
+    if (atomic_mb_read(&ctx->notify_me)) {
+	event_notifier_set(&ctx->notifier);
         atomic_mb_set(&ctx->notified, true);
     }
 }


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-10-11 17:50                       ` dann frazier
@ 2019-10-11 17:50                         ` dann frazier
  0 siblings, 0 replies; 141+ messages in thread
From: dann frazier @ 2019-10-11 17:50 UTC (permalink / raw)
  To: qemu-devel

On Fri, Oct 11, 2019 at 06:05:25AM +0000, Jan Glauber wrote:
> On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote:
> > On 09/10/19 10:02, Jan Glauber wrote:
> 
> > > I'm still not sure what the actual issue is here, but could it be some bad
> > > interaction between the notify_me and the list_lock? The are both 4 byte
> > > and side-by-side:
> > > 
> > > address notify_me: 0xaaaadb528aa0  sizeof notify_me: 4
> > > address list_lock: 0xaaaadb528aa4  sizeof list_lock: 4
> > > 
> > > AFAICS the generated code looks OK (all load/store exclusive done
> > > with 32 bit size):
> > > 
> > >      e6c:       885ffc01        ldaxr   w1, [x0]
> > >      e70:       11000821        add     w1, w1, #0x2
> > >      e74:       8802fc01        stlxr   w2, w1, [x0]
> > > 
> > > ...but if I bump notify_me size to uint64_t the issue goes away.
> > 
> > Ouch. :)  Is this with or without my patch(es)?
> > 
> > Also, what if you just add a dummy uint32_t after notify_me?
> 
> With the dummy the testcase also runs fine for 500 iterations.
> 
> Dann, can you try if this works on the Hi1620 too?

On Hi1620, it hung on the first iteration. Here's the complete patch
I'm running with:

diff --git a/include/block/aio.h b/include/block/aio.h
index 6b0d52f732..e6fd6f1a1a 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -82,7 +82,7 @@ struct AioContext {
      * Instead, the aio_poll calls include both the prepare and the
      * dispatch phase, hence a simple counter is enough for them.
      */
-    uint32_t notify_me;
+    uint64_t notify_me;
 
     /* A lock to protect between QEMUBH and AioHandler adders and deleter,
      * and to ensure that no callbacks are removed while we're walking and
diff --git a/util/async.c b/util/async.c
index ca83e32c7f..024c4c567d 100644
--- a/util/async.c
+++ b/util/async.c
@@ -242,7 +242,7 @@ aio_ctx_check(GSource *source)
     aio_notify_accept(ctx);
 
     for (bh = ctx->first_bh; bh; bh = bh->next) {
-        if (bh->scheduled) {
+        if (atomic_mb_read(&bh->scheduled)) {
             return true;
         }
     }
@@ -342,12 +342,12 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
 
 void aio_notify(AioContext *ctx)
 {
-    /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
-     * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
+    /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
+     * ctx->notify_me is read.  Pairs with atomic_or in aio_ctx_prepare or
+     * atomic_add in aio_poll.
      */
-    smp_mb();
-    if (ctx->notify_me) {
-        event_notifier_set(&ctx->notifier);
+    if (atomic_mb_read(&ctx->notify_me)) {
+	event_notifier_set(&ctx->notifier);
         atomic_mb_set(&ctx->notified, true);
     }
 }

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  New
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  New
Status in qemu source package in Disco:
  New
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in FF-Series:
  New

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-10-11  8:30                         ` Jan Glauber
  2019-10-11  8:30                           ` [Bug 1805256] " Jan Glauber
@ 2019-10-11 17:55                           ` dann frazier
  2019-10-11 17:55                             ` dann frazier
                                               ` (2 more replies)
  1 sibling, 3 replies; 141+ messages in thread
From: dann frazier @ 2019-10-11 17:55 UTC (permalink / raw)
  To: qemu-devel

On Fri, Oct 11, 2019 at 08:30:02AM +0000, Jan Glauber wrote:
> On Fri, Oct 11, 2019 at 10:18:18AM +0200, Paolo Bonzini wrote:
> > On 11/10/19 08:05, Jan Glauber wrote:
> > > On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote:
> > >>> ...but if I bump notify_me size to uint64_t the issue goes away.
> > >>
> > >> Ouch. :)  Is this with or without my patch(es)?
> > 
> > You didn't answer this question.
> 
> Oh, sorry... I did but the mail probably didn't make it out.
> I have both of your changes applied (as I think they make sense).
> 
> > >> Also, what if you just add a dummy uint32_t after notify_me?
> > > 
> > > With the dummy the testcase also runs fine for 500 iterations.
> > 
> > You might be lucky and causing list_lock to be in another cache line.
> > What if you add __attribute__((aligned(16)) to notify_me (and keep the
> > dummy)?
> 
> Good point. I'll try to force both into the same cacheline.

On the Hi1620, this still hangs in the first iteration:

diff --git a/include/block/aio.h b/include/block/aio.h
index 6b0d52f732..00e56a5412 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -82,7 +82,7 @@ struct AioContext {
      * Instead, the aio_poll calls include both the prepare and the
      * dispatch phase, hence a simple counter is enough for them.
      */
-    uint32_t notify_me;
+    __attribute__((aligned(16))) uint64_t notify_me;
 
     /* A lock to protect between QEMUBH and AioHandler adders and deleter,
      * and to ensure that no callbacks are removed while we're walking and
diff --git a/util/async.c b/util/async.c
index ca83e32c7f..024c4c567d 100644
--- a/util/async.c
+++ b/util/async.c
@@ -242,7 +242,7 @@ aio_ctx_check(GSource *source)
     aio_notify_accept(ctx);
 
     for (bh = ctx->first_bh; bh; bh = bh->next) {
-        if (bh->scheduled) {
+        if (atomic_mb_read(&bh->scheduled)) {
             return true;
         }
     }
@@ -342,12 +342,12 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
 
 void aio_notify(AioContext *ctx)
 {
-    /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
-     * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
+    /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
+     * ctx->notify_me is read.  Pairs with atomic_or in aio_ctx_prepare or
+     * atomic_add in aio_poll.
      */
-    smp_mb();
-    if (ctx->notify_me) {
-        event_notifier_set(&ctx->notifier);
+    if (atomic_mb_read(&ctx->notify_me)) {
+	event_notifier_set(&ctx->notifier);
         atomic_mb_set(&ctx->notified, true);
     }
 }

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  New
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  New
Status in qemu source package in Disco:
  New
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in FF-Series:
  New

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-10-11 17:55                           ` dann frazier
@ 2019-10-11 17:55                             ` dann frazier
  2019-10-12  0:24                             ` [Bug 1805256] " no-reply
  2019-10-12  0:49                             ` no-reply
  2 siblings, 0 replies; 141+ messages in thread
From: dann frazier @ 2019-10-11 17:55 UTC (permalink / raw)
  To: Jan Glauber
  Cc: Rafael David Tinoco, lizhengui, QEMU Developers, Bug 1805256,
	QEMU Developers - ARM, Paolo Bonzini

On Fri, Oct 11, 2019 at 08:30:02AM +0000, Jan Glauber wrote:
> On Fri, Oct 11, 2019 at 10:18:18AM +0200, Paolo Bonzini wrote:
> > On 11/10/19 08:05, Jan Glauber wrote:
> > > On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote:
> > >>> ...but if I bump notify_me size to uint64_t the issue goes away.
> > >>
> > >> Ouch. :)  Is this with or without my patch(es)?
> > 
> > You didn't answer this question.
> 
> Oh, sorry... I did but the mail probably didn't make it out.
> I have both of your changes applied (as I think they make sense).
> 
> > >> Also, what if you just add a dummy uint32_t after notify_me?
> > > 
> > > With the dummy the testcase also runs fine for 500 iterations.
> > 
> > You might be lucky and causing list_lock to be in another cache line.
> > What if you add __attribute__((aligned(16)) to notify_me (and keep the
> > dummy)?
> 
> Good point. I'll try to force both into the same cacheline.

On the Hi1620, this still hangs in the first iteration:

diff --git a/include/block/aio.h b/include/block/aio.h
index 6b0d52f732..00e56a5412 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -82,7 +82,7 @@ struct AioContext {
      * Instead, the aio_poll calls include both the prepare and the
      * dispatch phase, hence a simple counter is enough for them.
      */
-    uint32_t notify_me;
+    __attribute__((aligned(16))) uint64_t notify_me;
 
     /* A lock to protect between QEMUBH and AioHandler adders and deleter,
      * and to ensure that no callbacks are removed while we're walking and
diff --git a/util/async.c b/util/async.c
index ca83e32c7f..024c4c567d 100644
--- a/util/async.c
+++ b/util/async.c
@@ -242,7 +242,7 @@ aio_ctx_check(GSource *source)
     aio_notify_accept(ctx);
 
     for (bh = ctx->first_bh; bh; bh = bh->next) {
-        if (bh->scheduled) {
+        if (atomic_mb_read(&bh->scheduled)) {
             return true;
         }
     }
@@ -342,12 +342,12 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
 
 void aio_notify(AioContext *ctx)
 {
-    /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
-     * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
+    /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
+     * ctx->notify_me is read.  Pairs with atomic_or in aio_ctx_prepare or
+     * atomic_add in aio_poll.
      */
-    smp_mb();
-    if (ctx->notify_me) {
-        event_notifier_set(&ctx->notifier);
+    if (atomic_mb_read(&ctx->notify_me)) {
+	event_notifier_set(&ctx->notifier);
         atomic_mb_set(&ctx->notified, true);
     }
 }


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [Bug 1805256] Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-10-11 17:55                           ` dann frazier
  2019-10-11 17:55                             ` dann frazier
@ 2019-10-12  0:24                             ` no-reply
  2019-10-12  0:49                             ` no-reply
  2 siblings, 0 replies; 141+ messages in thread
From: no-reply @ 2019-10-12  0:24 UTC (permalink / raw)
  To: dann.frazier; +Cc: qemu-devel

Patchew URL: https://patchew.org/QEMU/20191011175536.GB25464@xps13.dannf/



Hi,

This series failed the docker-mingw@fedora build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#! /bin/bash
export ARCH=x86_64
make docker-image-fedora V=1 NETWORK=1
time make docker-test-mingw@fedora J=14 NETWORK=1
=== TEST SCRIPT END ===

In file included from /tmp/qemu-test/src/include/qemu/osdep.h:51,
                 from /tmp/qemu-test/src/util/async.c:26:
/tmp/qemu-test/src/util/async.c: In function 'aio_notify':
/tmp/qemu-test/src/include/qemu/compiler.h:86:36: error: static assertion failed: "not expecting: sizeof(*&ctx->notify_me) > ATOMIC_REG_SIZE"
 #define QEMU_BUILD_BUG_MSG(x, msg) _Static_assert(!(x), msg)
                                    ^~~~~~~~~~~~~~
/tmp/qemu-test/src/include/qemu/compiler.h:94:30: note: in expansion of macro 'QEMU_BUILD_BUG_MSG'
---
/tmp/qemu-test/src/util/async.c:349:9: note: in expansion of macro 'atomic_mb_read'
     if (atomic_mb_read(&ctx->notify_me)) {
         ^~~~~~~~~~~~~~
make: *** [/tmp/qemu-test/src/rules.mak:69: util/async.o] Error 1
make: *** Waiting for unfinished jobs....
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 662, in <module>
---
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=b90449ea684e4a14ad6b34f4f5dcb0c9', '-u', '1001', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-0y2eb54d/src/docker-src.2019-10-11-20.12.57.30254:/var/tmp/qemu:z,ro', 'qemu:fedora', '/var/tmp/qemu/run', 'test-mingw']' returned non-zero exit status 2.
filter=--filter=label=com.qemu.instance.uuid=b90449ea684e4a14ad6b34f4f5dcb0c9
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-0y2eb54d/src'
make: *** [docker-run-test-mingw@fedora] Error 2

real    11m27.735s
user    0m7.532s


The full log is available at
http://patchew.org/logs/20191011175536.GB25464@xps13.dannf/testing.docker-mingw@fedora/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [Bug 1805256] Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
  2019-10-11 17:55                           ` dann frazier
  2019-10-11 17:55                             ` dann frazier
  2019-10-12  0:24                             ` [Bug 1805256] " no-reply
@ 2019-10-12  0:49                             ` no-reply
  2 siblings, 0 replies; 141+ messages in thread
From: no-reply @ 2019-10-12  0:49 UTC (permalink / raw)
  To: dann.frazier; +Cc: qemu-devel

Patchew URL: https://patchew.org/QEMU/20191011175536.GB25464@xps13.dannf/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Subject: [Bug 1805256] Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
Type: series
Message-id: 20191011175536.GB25464@xps13.dannf

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Switched to a new branch 'test'
d3707c3 qemu_futex_wait() lockups in ARM64: 2 possible issues

=== OUTPUT BEGIN ===
WARNING: Block comments use a leading /* on a separate line
#66: FILE: util/async.c:345:
+    /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before

ERROR: code indent should never use tabs
#74: FILE: util/async.c:350:
+^Ievent_notifier_set(&ctx->notifier);$

ERROR: Missing Signed-off-by: line(s)

total: 2 errors, 1 warnings, 33 lines checked

Commit d3707c31b5da (qemu_futex_wait() lockups in ARM64: 2 possible issues) has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/20191011175536.GB25464@xps13.dannf/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (28 preceding siblings ...)
  2019-10-03 21:35 ` dann frazier
@ 2019-12-13 14:24 ` dann frazier
  2019-12-17  1:34 ` Fred Kimmy
                   ` (64 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: dann frazier @ 2019-12-13 14:24 UTC (permalink / raw)
  To: qemu-devel

** Changed in: kunpeng920
       Status: New => Confirmed

** Changed in: qemu (Ubuntu Bionic)
       Status: New => Confirmed

** Changed in: qemu (Ubuntu Disco)
       Status: New => Confirmed

** Changed in: qemu (Ubuntu Focal)
       Status: New => Confirmed

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Confirmed
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  Confirmed
Status in qemu source package in Bionic:
  Confirmed
Status in qemu source package in Disco:
  Confirmed
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  Confirmed

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (29 preceding siblings ...)
  2019-12-13 14:24 ` dann frazier
@ 2019-12-17  1:34 ` Fred Kimmy
  2019-12-17 19:17 ` dann frazier
                   ` (63 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Fred Kimmy @ 2019-12-17  1:34 UTC (permalink / raw)
  To: qemu-devel

include/block/aio.h | 3 +++
 qemu-img.c          | 4 ++++
 util/async.c        | 5 +----
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index e9bc04c..9153d87 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -89,6 +89,9 @@ struct AioContext {
      */
     uint32_t notify_me;
 
+    /* force to notify for qemu-img convert */
+    bool notify_for_convert;
+
     /* lock to protect between bh's adders and deleter */
     QemuMutex bh_lock;
 
diff --git a/qemu-img.c b/qemu-img.c
index 60a2be3..cf037aa 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -2411,6 +2411,10 @@ static int img_convert(int argc, char **argv)
         .wr_in_order        = wr_in_order,
         .num_coroutines     = num_coroutines,
     };
+
+    AioContext *ctx = qemu_get_aio_context();
+    ctx->notify_for_convert = 1;
+
     ret = convert_do_copy(&state);
 
 out:
diff --git a/util/async.c b/util/async.c
index 042bf8a..af235fc 100644
--- a/util/async.c
+++ b/util/async.c
@@ -336,12 +336,9 @@ void aio_notify(AioContext *ctx)
      * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
      */
     smp_mb();
-    if (ctx->notify_me) {
+    if (ctx->notify_me || ctx->notify_for_convert) {
         event_notifier_set(&ctx->notifier);
         atomic_mb_set(&ctx->notified, true);
-#if defined(__aarch64__)
-        kill(getpid(), SIGIO);
-#endif
     }
 }

Can you try this aboving patchset to slove it?

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Confirmed
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  Confirmed
Status in qemu source package in Bionic:
  Confirmed
Status in qemu source package in Disco:
  Confirmed
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  Confirmed

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (30 preceding siblings ...)
  2019-12-17  1:34 ` Fred Kimmy
@ 2019-12-17 19:17 ` dann frazier
  2019-12-18  2:40 ` Rafael David Tinoco
                   ` (62 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: dann frazier @ 2019-12-17 19:17 UTC (permalink / raw)
  To: qemu-devel

I tested the patch in Comment #34, and it was able to pass 500
iterations.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Confirmed
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  Confirmed
Status in qemu source package in Bionic:
  Confirmed
Status in qemu source package in Disco:
  Confirmed
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  Confirmed

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (31 preceding siblings ...)
  2019-12-17 19:17 ` dann frazier
@ 2019-12-18  2:40 ` Rafael David Tinoco
  2019-12-18  9:52 ` iveskim
                   ` (61 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2019-12-18  2:40 UTC (permalink / raw)
  To: qemu-devel

Hello Fred,

Based on Dann's feedback on testing, I'm failing to see where your patch
fixes the "root" cause (despite being able to mitigate the issue by
changing the aio notification mechanism).

I think the root cause is best described in this 2 emails from the
thread:

https://lore.kernel.org/qemu-devel/20191009080220.GA2905@hc/

and

https://lore.kernel.org/qemu-devel/966c119d-aa76-2149-108f-
867aebd772f7@redhat.com/

So, by adding ctx->notify_for_convert, it is very likely you
workarounded the issue by doing what Jan already said: removing both
variables (ctx->list_lock and, in old case, ctx->notify_me, in your
case, ctx->notify_for_convert) from the same cacheline and making the
issue to "disappear" (as we would eventually do in a workaround patch).

What about aarch64 issue with both, ctx->list_lock and
ctx->notify_for_convert, being synchronized by qemu used primitives, and
being in the same cache line ?

Any "workaround" here would try to dodge the same cacheline situation,
but, for upstream, I suppose Paolo wants to have something else
regarding aarch64 ATOMIC_SEQ_CST.

like describe in this part of the discussion:

https://lore.kernel.org/qemu-devel/96c26e21-5996-0c63-ce8b-
99a1b5473453@redhat.com/

Unless I'm missing something, am I ?

Thank you!

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Confirmed
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  Confirmed
Status in qemu source package in Bionic:
  Confirmed
Status in qemu source package in Disco:
  Confirmed
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  Confirmed

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (32 preceding siblings ...)
  2019-12-18  2:40 ` Rafael David Tinoco
@ 2019-12-18  9:52 ` iveskim
  2019-12-18 14:52 ` dann frazier
                   ` (60 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: iveskim @ 2019-12-18  9:52 UTC (permalink / raw)
  To: qemu-devel

I tested the patch in Comment #34, and it was also failed to pass 5 iterations.
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "aarch64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 3987
[New LWP 3988]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000ffffbd2b3154 in __GI_ppoll (fds=0xaaaae80ef080, nfds=187650381360708, 
    timeout=<optimized out>, sigmask=0xffffc31815f0)
    at ../sysdeps/unix/sysv/linux/ppoll.c:39
39	../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.
(gdb)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Confirmed
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  Confirmed
Status in qemu source package in Bionic:
  Confirmed
Status in qemu source package in Disco:
  Confirmed
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  Confirmed

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (33 preceding siblings ...)
  2019-12-18  9:52 ` iveskim
@ 2019-12-18 14:52 ` dann frazier
  2019-12-18 16:21 ` Ubuntu Foundations Team Bug Bot
                   ` (59 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: dann frazier @ 2019-12-18 14:52 UTC (permalink / raw)
  To: qemu-devel

fyi, what I tested in Comment #35 was upstream QEMU (@ aceeaa69d2) with
a port of the patch in Comment #34 applied. I've attached that patch
here. While it did avoid the issue in my testing, I agree with Rafael's
Comment #36 that it does not appear to address the root cause (as I
understand it), and is therefore unlikely something we'd ship in Ubuntu.

** Patch added: "comment-34-ported-to-upstream.patch"
   https://bugs.launchpad.net/qemu/+bug/1805256/+attachment/5313631/+files/comment-34-ported-to-upstream.patch

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Confirmed
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  Confirmed
Status in qemu source package in Bionic:
  Confirmed
Status in qemu source package in Disco:
  Confirmed
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  Confirmed

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (34 preceding siblings ...)
  2019-12-18 14:52 ` dann frazier
@ 2019-12-18 16:21 ` Ubuntu Foundations Team Bug Bot
  2020-02-13  8:41 ` Ike Panhc
                   ` (58 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Ubuntu Foundations Team Bug Bot @ 2019-12-18 16:21 UTC (permalink / raw)
  To: qemu-devel

** Tags added: patch

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Confirmed
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  Confirmed
Status in qemu source package in Bionic:
  Confirmed
Status in qemu source package in Disco:
  Confirmed
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  Confirmed

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (35 preceding siblings ...)
  2019-12-18 16:21 ` Ubuntu Foundations Team Bug Bot
@ 2020-02-13  8:41 ` Ike Panhc
  2020-02-13  8:42 ` Andrew Cloke
                   ` (57 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Ike Panhc @ 2020-02-13  8:41 UTC (permalink / raw)
  To: qemu-devel

** Tags added: ikeradar

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Incomplete
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  Incomplete
Status in qemu source package in Bionic:
  Incomplete
Status in qemu source package in Disco:
  Incomplete
Status in qemu source package in Eoan:
  Incomplete
Status in qemu source package in Focal:
  Incomplete

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (36 preceding siblings ...)
  2020-02-13  8:41 ` Ike Panhc
@ 2020-02-13  8:42 ` Andrew Cloke
  2020-02-13  9:20 ` Fred Kimmy
                   ` (56 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Andrew Cloke @ 2020-02-13  8:42 UTC (permalink / raw)
  To: qemu-devel

Could HiSilicon respond to Dann & Rafael's comments #36 and #38?
Is there an upstream acceptable patch that addresses this issue?

** Changed in: kunpeng920
       Status: Confirmed => Incomplete

** Changed in: qemu (Ubuntu Bionic)
       Status: Confirmed => Incomplete

** Changed in: qemu (Ubuntu Disco)
       Status: Confirmed => Incomplete

** Changed in: qemu (Ubuntu Eoan)
       Status: In Progress => Incomplete

** Changed in: qemu (Ubuntu Focal)
       Status: Confirmed => Incomplete

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Incomplete
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  Incomplete
Status in qemu source package in Bionic:
  Incomplete
Status in qemu source package in Disco:
  Incomplete
Status in qemu source package in Eoan:
  Incomplete
Status in qemu source package in Focal:
  Incomplete

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (37 preceding siblings ...)
  2020-02-13  8:42 ` Andrew Cloke
@ 2020-02-13  9:20 ` Fred Kimmy
  2020-04-15  2:47 ` Rafael David Tinoco
                   ` (55 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Fred Kimmy @ 2020-02-13  9:20 UTC (permalink / raw)
  To: qemu-devel

=》Could HiSilicon respond to Dann & Rafael's comments #36 and #38?
=》Is there an upstream acceptable patch that addresses this issue?

No upstream patchset, I Only provide a private solution and do not know
this root cause.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Incomplete
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  Incomplete
Status in qemu source package in Bionic:
  Incomplete
Status in qemu source package in Disco:
  Incomplete
Status in qemu source package in Eoan:
  Incomplete
Status in qemu source package in Focal:
  Incomplete

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (38 preceding siblings ...)
  2020-02-13  9:20 ` Fred Kimmy
@ 2020-04-15  2:47 ` Rafael David Tinoco
  2020-05-04  7:24 ` Ike Panhc
                   ` (54 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2020-04-15  2:47 UTC (permalink / raw)
  To: qemu-devel

** Changed in: qemu (Ubuntu Eoan)
     Assignee: Rafael David Tinoco (rafaeldtinoco) => (unassigned)

** Changed in: qemu
     Assignee: Rafael David Tinoco (rafaeldtinoco) => (unassigned)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Incomplete
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  Incomplete
Status in qemu source package in Bionic:
  Incomplete
Status in qemu source package in Disco:
  Incomplete
Status in qemu source package in Eoan:
  Incomplete
Status in qemu source package in Focal:
  Incomplete

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (39 preceding siblings ...)
  2020-04-15  2:47 ` Rafael David Tinoco
@ 2020-05-04  7:24 ` Ike Panhc
  2020-05-05  0:54 ` Ike Panhc
                   ` (53 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Ike Panhc @ 2020-05-04  7:24 UTC (permalink / raw)
  To: qemu-devel

PPA created with temporarily workaround in comment #34.

https://launchpad.net/~ikepanhc/+archive/ubuntu/lp1805256

This PPA can solve temporarily but is not acceptable for offical
release.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Incomplete
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  Incomplete
Status in qemu source package in Bionic:
  Incomplete
Status in qemu source package in Disco:
  Incomplete
Status in qemu source package in Eoan:
  Incomplete
Status in qemu source package in Focal:
  Incomplete

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (40 preceding siblings ...)
  2020-05-04  7:24 ` Ike Panhc
@ 2020-05-05  0:54 ` Ike Panhc
  2020-05-05  1:22 ` Ying Fang
                   ` (52 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Ike Panhc @ 2020-05-05  0:54 UTC (permalink / raw)
  To: qemu-devel

Take several CPUs offline and re-test. Even only 32 threads left, I
still can reproduce this issue easily.

ubuntu@kreiken:~$ lscpu | grep list;for i in `seq 1 10`;do echo ;rm -f out.img;timeout 30 qemu-img convert -f qcow2 -O qcow2 ./bionic-server-cloudimg-arm64.img out.img -p; done
On-line CPU(s) list:  0-31
Off-line CPU(s) list: 32-127

    (100.00/100%)

    (43.20/100%)
    (0.00/100%)
    (1.00/100%)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Incomplete
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  Incomplete
Status in qemu source package in Bionic:
  Incomplete
Status in qemu source package in Disco:
  Incomplete
Status in qemu source package in Eoan:
  Incomplete
Status in qemu source package in Focal:
  Incomplete

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (41 preceding siblings ...)
  2020-05-05  0:54 ` Ike Panhc
@ 2020-05-05  1:22 ` Ying Fang
  2020-05-05  6:15 ` Ike Panhc
                   ` (51 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Ying Fang @ 2020-05-05  1:22 UTC (permalink / raw)
  To: qemu-devel

Hi, Ike.

I think this tricky bug was fixed by Paolo last month. 
Please try patch https://git.qemu.org/?p=qemu.git;a=commitdiff;h=5710a3e09f9b85801e5ce70797a4a511e5fc9e2c.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Incomplete
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  Incomplete
Status in qemu source package in Bionic:
  Incomplete
Status in qemu source package in Disco:
  Incomplete
Status in qemu source package in Eoan:
  Incomplete
Status in qemu source package in Focal:
  Incomplete

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (42 preceding siblings ...)
  2020-05-05  1:22 ` Ying Fang
@ 2020-05-05  6:15 ` Ike Panhc
  2020-05-05 15:01 ` Ike Panhc
                   ` (50 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Ike Panhc @ 2020-05-05  6:15 UTC (permalink / raw)
  To: qemu-devel

Thanks. I will test it.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Incomplete
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  Incomplete
Status in qemu source package in Bionic:
  Incomplete
Status in qemu source package in Disco:
  Incomplete
Status in qemu source package in Eoan:
  Incomplete
Status in qemu source package in Focal:
  Incomplete

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (43 preceding siblings ...)
  2020-05-05  6:15 ` Ike Panhc
@ 2020-05-05 15:01 ` Ike Panhc
  2020-05-05 18:48 ` Rafael David Tinoco
                   ` (49 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Ike Panhc @ 2020-05-05 15:01 UTC (permalink / raw)
  To: qemu-devel

The test deb has been pushed to
https://launchpad.net/~ikepanhc/+archive/ubuntu/lp1805256

40 run with patch mentioned in #43 and all passed.

Thanks.


** Changed in: kunpeng920
       Status: Incomplete => Triaged

** Changed in: kunpeng920
     Assignee: (unassigned) => Ike Panhc (ikepanhc)

** Also affects: kunpeng920/upstream-kernel
   Importance: Undecided
       Status: New

** Also affects: kunpeng920/ubuntu-20.04
   Importance: Undecided
       Status: New

** Also affects: kunpeng920/ubuntu-19.10
   Importance: Undecided
       Status: New

** Also affects: kunpeng920/ubuntu-18.04
   Importance: Undecided
       Status: New

** Also affects: kunpeng920/ubuntu-18.04-hwe
   Importance: Undecided
       Status: New

** Changed in: kunpeng920/upstream-kernel
       Status: New => Fix Committed

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  New
Status in kunpeng920 ubuntu-18.04-hwe series:
  New
Status in kunpeng920 ubuntu-19.10 series:
  New
Status in kunpeng920 ubuntu-20.04 series:
  New
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  Incomplete
Status in qemu source package in Bionic:
  Incomplete
Status in qemu source package in Disco:
  Incomplete
Status in qemu source package in Eoan:
  Incomplete
Status in qemu source package in Focal:
  Incomplete

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (44 preceding siblings ...)
  2020-05-05 15:01 ` Ike Panhc
@ 2020-05-05 18:48 ` Rafael David Tinoco
  2020-05-05 23:55 ` dann frazier
                   ` (48 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2020-05-05 18:48 UTC (permalink / raw)
  To: qemu-devel

Hello Ike,

Please, let me know if you want me to go after the needed SRUs for this
fix or if you will.

I'll wait for the final feedback from tests with your PPA.

Cheers!

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  New
Status in kunpeng920 ubuntu-18.04-hwe series:
  New
Status in kunpeng920 ubuntu-19.10 series:
  New
Status in kunpeng920 ubuntu-20.04 series:
  New
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  Incomplete
Status in qemu source package in Bionic:
  Incomplete
Status in qemu source package in Disco:
  Incomplete
Status in qemu source package in Eoan:
  Incomplete
Status in qemu source package in Focal:
  Incomplete

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (45 preceding siblings ...)
  2020-05-05 18:48 ` Rafael David Tinoco
@ 2020-05-05 23:55 ` dann frazier
  2020-05-06 13:08 ` Rafael David Tinoco
                   ` (47 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: dann frazier @ 2020-05-05 23:55 UTC (permalink / raw)
  To: qemu-devel

fyi, I backported that fix also to focal/groovy and eoan, and with those
builds. On my test systems the hang reliable occurs within 20
iterations. After the fix, they have survived > 500 iterations thus far.
I'll leave running overnight just to be sure.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  New
Status in kunpeng920 ubuntu-18.04-hwe series:
  New
Status in kunpeng920 ubuntu-19.10 series:
  New
Status in kunpeng920 ubuntu-20.04 series:
  New
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  Incomplete
Status in qemu source package in Bionic:
  Incomplete
Status in qemu source package in Disco:
  Incomplete
Status in qemu source package in Eoan:
  Incomplete
Status in qemu source package in Focal:
  Incomplete

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (46 preceding siblings ...)
  2020-05-05 23:55 ` dann frazier
@ 2020-05-06 13:08 ` Rafael David Tinoco
  2020-05-06 13:23 ` Rafael David Tinoco
                   ` (46 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2020-05-06 13:08 UTC (permalink / raw)
  To: qemu-devel

** Changed in: qemu (Ubuntu)
     Assignee: Rafael David Tinoco (rafaeldtinoco) => (unassigned)

** Changed in: qemu
       Status: In Progress => Fix Released

** Changed in: qemu (Ubuntu Focal)
       Status: Incomplete => In Progress

** Changed in: qemu (Ubuntu Eoan)
       Status: Incomplete => In Progress

** Changed in: qemu (Ubuntu Disco)
       Status: Incomplete => In Progress

** Changed in: qemu (Ubuntu Bionic)
       Status: Incomplete => In Progress

** Changed in: qemu (Ubuntu)
       Status: Incomplete => In Progress

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  New
Status in kunpeng920 ubuntu-18.04-hwe series:
  New
Status in kunpeng920 ubuntu-19.10 series:
  New
Status in kunpeng920 ubuntu-20.04 series:
  New
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Disco:
  In Progress
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  In Progress

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (47 preceding siblings ...)
  2020-05-06 13:08 ` Rafael David Tinoco
@ 2020-05-06 13:23 ` Rafael David Tinoco
  2020-05-06 15:45 ` Ike Panhc
                   ` (45 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2020-05-06 13:23 UTC (permalink / raw)
  To: qemu-devel

** Description changed:

+ [Impact]
+ 
+ * QEMU locking primitives might face a race condition in QEMU Async I/O
+ bottom halves scheduling. This leads to a dead lock making either QEMU
+ or one of its tools to hang indefinitely.
+ 
+ [Test Case]
+ 
+ * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2
+ 
+ Hangs indefinitely approximately 30% of the runs in Aarch64.
+ 
+ [Regression Potential]
+ 
+ * This is a change to a core part of QEMU: The AIO scheduling. It works
+ like a "kernel" scheduler, whereas kernel schedules OS tasks, the QEMU
+ AIO code is responsible to schedule QEMU coroutines or event listeners
+ callbacks.
+ 
+ * There was a long discussion upstream about primitives and Aarch64.
+ After quite sometime Paolo released this patch and it solves the issue.
+ Tested platforms were: amd64 and aarch64 based on his commit log.
+ 
+ * Christian suggests that this fix stay little longer in -proposed to
+ make sure it won't cause any regressions.
+ 
+ [Other Info]
+ 
+  * Original Description bellow:
+ 
+ 
  Command:
  
  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2
  
  Hangs indefinitely approximately 30% of the runs.
  
  ----
  
  Workaround:
  
  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2
  
  Run "qemu-img convert" with "a single coroutine" to avoid this issue.
  
  ----
  
  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...
  
  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()
  
  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start
  
  ----
  
  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2
  
  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]
  
  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]
  
  <HUNG w/ 3 threads in the stack trace showed before>
  """
  
  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).
  
  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).
  
  ----
  
- [ Original Description ]
- 
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:
  
  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2
  
  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.
  
  Once hung, attaching gdb gives the following backtrace:
  
  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975
  
  Reproduced w/ latest QEMU git (@ 53744e0a182)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  New
Status in kunpeng920 ubuntu-18.04-hwe series:
  New
Status in kunpeng920 ubuntu-19.10 series:
  New
Status in kunpeng920 ubuntu-20.04 series:
  New
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Disco:
  In Progress
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  In Progress

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  [Other Info]

   * Original Description bellow:

  
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (48 preceding siblings ...)
  2020-05-06 13:23 ` Rafael David Tinoco
@ 2020-05-06 15:45 ` Ike Panhc
  2020-05-06 16:42 ` dann frazier
                   ` (44 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Ike Panhc @ 2020-05-06 15:45 UTC (permalink / raw)
  To: qemu-devel

** Changed in: kunpeng920/ubuntu-18.04
       Status: New => Triaged

** Changed in: kunpeng920/ubuntu-18.04
     Assignee: (unassigned) => Ike Panhc (ikepanhc)

** Changed in: kunpeng920/ubuntu-18.04-hwe
       Status: New => Triaged

** Changed in: kunpeng920/ubuntu-18.04-hwe
     Assignee: (unassigned) => Ike Panhc (ikepanhc)

** Changed in: kunpeng920/ubuntu-19.10
       Status: New => Triaged

** Changed in: kunpeng920/ubuntu-19.10
     Assignee: (unassigned) => Ike Panhc (ikepanhc)

** Changed in: kunpeng920/ubuntu-20.04
       Status: New => Triaged

** Changed in: kunpeng920/ubuntu-20.04
     Assignee: (unassigned) => Ike Panhc (ikepanhc)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Triaged
Status in kunpeng920 ubuntu-20.04 series:
  Triaged
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Disco:
  In Progress
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  In Progress

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  [Other Info]

   * Original Description bellow:

  
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (49 preceding siblings ...)
  2020-05-06 15:45 ` Ike Panhc
@ 2020-05-06 16:42 ` dann frazier
  2020-05-06 19:04 ` Launchpad Bug Tracker
                   ` (43 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: dann frazier @ 2020-05-06 16:42 UTC (permalink / raw)
  To: qemu-devel

** Description changed:

  [Impact]
  
  * QEMU locking primitives might face a race condition in QEMU Async I/O
  bottom halves scheduling. This leads to a dead lock making either QEMU
  or one of its tools to hang indefinitely.
  
  [Test Case]
  
  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2
  
  Hangs indefinitely approximately 30% of the runs in Aarch64.
  
  [Regression Potential]
  
  * This is a change to a core part of QEMU: The AIO scheduling. It works
  like a "kernel" scheduler, whereas kernel schedules OS tasks, the QEMU
  AIO code is responsible to schedule QEMU coroutines or event listeners
  callbacks.
  
  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the issue.
  Tested platforms were: amd64 and aarch64 based on his commit log.
  
  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.
  
+ * dannf suggests we also check for performance regressions; e.g. how
+ long it takes to convert a cloud image on high-core systems.
+ 
  [Other Info]
  
-  * Original Description bellow:
- 
+  * Original Description bellow:
  
  Command:
  
  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2
  
  Hangs indefinitely approximately 30% of the runs.
  
  ----
  
  Workaround:
  
  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2
  
  Run "qemu-img convert" with "a single coroutine" to avoid this issue.
  
  ----
  
  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...
  
  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()
  
  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start
  
  ----
  
  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2
  
  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]
  
  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]
  
  <HUNG w/ 3 threads in the stack trace showed before>
  """
  
  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).
  
  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).
  
  ----
  
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:
  
  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2
  
  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.
  
  Once hung, attaching gdb gives the following backtrace:
  
  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975
  
  Reproduced w/ latest QEMU git (@ 53744e0a182)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Triaged
Status in kunpeng920 ubuntu-20.04 series:
  Triaged
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Disco:
  In Progress
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  In Progress

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (50 preceding siblings ...)
  2020-05-06 16:42 ` dann frazier
@ 2020-05-06 19:04 ` Launchpad Bug Tracker
  2020-05-06 19:09 ` Philippe Mathieu-Daudé
                   ` (42 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Launchpad Bug Tracker @ 2020-05-06 19:04 UTC (permalink / raw)
  To: qemu-devel

** Merge proposal linked:
   https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/qemu/+git/qemu/+merge/383530

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Triaged
Status in kunpeng920 ubuntu-20.04 series:
  Triaged
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Disco:
  In Progress
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  In Progress

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (51 preceding siblings ...)
  2020-05-06 19:04 ` Launchpad Bug Tracker
@ 2020-05-06 19:09 ` Philippe Mathieu-Daudé
  2020-05-06 19:57   ` dann frazier
  2020-05-06 20:11 ` Rafael David Tinoco
                   ` (41 subsequent siblings)
  94 siblings, 1 reply; 141+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-05-06 19:09 UTC (permalink / raw)
  To: qemu-devel

Isn't this fixed by commit 5710a3e09f9?

commit 5710a3e09f9b85801e5ce70797a4a511e5fc9e2c
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Tue Apr 7 10:07:46 2020 -0400

    async: use explicit memory barriers
    
    When using C11 atomics, non-seqcst reads and writes do not participate
    in the total order of seqcst operations.  In util/async.c and util/aio-posix.c,
    in particular, the pattern that we use
    
              write ctx->notify_me                 write bh->scheduled
              read bh->scheduled                   read ctx->notify_me
              if !bh->scheduled, sleep             if ctx->notify_me, notify
    
    needs to use seqcst operations for both the write and the read.  In
    general this is something that we do not want, because there can be
    many sources that are polled in addition to bottom halves.  The
    alternative is to place a seqcst memory barrier between the write
    and the read.  This also comes with a disadvantage, in that the
    memory barrier is implicit on strongly-ordered architectures and
    it wastes a few dozen clock cycles.
    
    Fortunately, ctx->notify_me is never written concurrently by two
    threads, so we can assert that and relax the writes to ctx->notify_me.
    The resulting solution works and performs well on both aarch64 and x86.
    
    Note that the atomic_set/atomic_read combination is not an atomic
    read-modify-write, and therefore it is even weaker than C11 ATOMIC_RELAXED;
    on x86, ATOMIC_RELAXED compiles to a locked operation.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Triaged
Status in kunpeng920 ubuntu-20.04 series:
  Triaged
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Disco:
  In Progress
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  In Progress

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2020-05-06 19:09 ` Philippe Mathieu-Daudé
@ 2020-05-06 19:57   ` dann frazier
  0 siblings, 0 replies; 141+ messages in thread
From: dann frazier @ 2020-05-06 19:57 UTC (permalink / raw)
  To: qemu-devel

On Wed, May 6, 2020 at 1:20 PM Philippe Mathieu-Daudé
<1805256@bugs.launchpad.net> wrote:
>
> Isn't this fixed by commit 5710a3e09f9?

See comment #43. The discussions hence are about testing/integration
of that fix.

  -dann

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Triaged
Status in kunpeng920 ubuntu-20.04 series:
  Triaged
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Disco:
  In Progress
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  In Progress

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (52 preceding siblings ...)
  2020-05-06 19:09 ` Philippe Mathieu-Daudé
@ 2020-05-06 20:11 ` Rafael David Tinoco
  2020-05-06 21:10 ` Launchpad Bug Tracker
                   ` (40 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2020-05-06 20:11 UTC (permalink / raw)
  To: qemu-devel

FYIO, from now on all the "merge" work will be done in the merge
requests being linked to this BUG (at the top). @paelzer will be
verifying those.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Triaged
Status in kunpeng920 ubuntu-20.04 series:
  Triaged
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Disco:
  In Progress
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  In Progress

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (53 preceding siblings ...)
  2020-05-06 20:11 ` Rafael David Tinoco
@ 2020-05-06 21:10 ` Launchpad Bug Tracker
  2020-05-06 21:44 ` Launchpad Bug Tracker
                   ` (39 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Launchpad Bug Tracker @ 2020-05-06 21:10 UTC (permalink / raw)
  To: qemu-devel

** Merge proposal linked:
   https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/qemu/+git/qemu/+merge/383545

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Triaged
Status in kunpeng920 ubuntu-20.04 series:
  Triaged
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Disco:
  In Progress
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  In Progress

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (54 preceding siblings ...)
  2020-05-06 21:10 ` Launchpad Bug Tracker
@ 2020-05-06 21:44 ` Launchpad Bug Tracker
  2020-05-07  3:37 ` Launchpad Bug Tracker
                   ` (38 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Launchpad Bug Tracker @ 2020-05-06 21:44 UTC (permalink / raw)
  To: qemu-devel

** Merge proposal linked:
   https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/qemu/+git/qemu/+merge/383551

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Triaged
Status in kunpeng920 ubuntu-20.04 series:
  Triaged
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Disco:
  In Progress
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  In Progress

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (55 preceding siblings ...)
  2020-05-06 21:44 ` Launchpad Bug Tracker
@ 2020-05-07  3:37 ` Launchpad Bug Tracker
  2020-05-07  7:00 ` Ike Panhc
                   ` (37 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Launchpad Bug Tracker @ 2020-05-07  3:37 UTC (permalink / raw)
  To: qemu-devel

** Merge proposal linked:
   https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/qemu/+git/qemu/+merge/383566

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Triaged
Status in kunpeng920 ubuntu-20.04 series:
  Triaged
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Disco:
  In Progress
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  In Progress

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (56 preceding siblings ...)
  2020-05-07  3:37 ` Launchpad Bug Tracker
@ 2020-05-07  7:00 ` Ike Panhc
  2020-05-07 22:27 ` dann frazier
                   ` (36 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Ike Panhc @ 2020-05-07  7:00 UTC (permalink / raw)
  To: qemu-devel

Tested debs in ppa:rafaeldtinoco/lp1805256 for focal and eoan and 1000
qemu-img convert passed.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Triaged
Status in kunpeng920 ubuntu-20.04 series:
  Triaged
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Disco:
  In Progress
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  In Progress

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (57 preceding siblings ...)
  2020-05-07  7:00 ` Ike Panhc
@ 2020-05-07 22:27 ` dann frazier
  2020-05-14  8:05 ` Andrew Cloke
                   ` (35 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: dann frazier @ 2020-05-07 22:27 UTC (permalink / raw)
  To: qemu-devel

Ike's backport in
https://launchpad.net/~ikepanhc/+archive/ubuntu/lp1805256 tests well for
me on Cavium Sabre. One minor note is that the function
in_aio_context_home_thread() is being called in aio-win32.c, but that
function didn't exist in 2.11. We probably want to change that to
aio_context_in_iothread(). It was renamed in
https://git.qemu.org/?p=qemu.git;a=commitdiff;h=d2b63ba8dd20c1091b3f1033e6a95ef95b18149d

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Triaged
Status in kunpeng920 ubuntu-20.04 series:
  Triaged
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Disco:
  In Progress
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  In Progress

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (58 preceding siblings ...)
  2020-05-07 22:27 ` dann frazier
@ 2020-05-14  8:05 ` Andrew Cloke
  2020-05-27  4:55 ` Christian Ehrhardt 
                   ` (34 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Andrew Cloke @ 2020-05-14  8:05 UTC (permalink / raw)
  To: qemu-devel

** Changed in: kunpeng920
       Status: Triaged => In Progress

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  In Progress
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Triaged
Status in kunpeng920 ubuntu-20.04 series:
  Triaged
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Disco:
  In Progress
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  In Progress

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (59 preceding siblings ...)
  2020-05-14  8:05 ` Andrew Cloke
@ 2020-05-27  4:55 ` Christian Ehrhardt 
  2020-05-28 14:58 ` Christian Ehrhardt 
                   ` (33 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Christian Ehrhardt  @ 2020-05-27  4:55 UTC (permalink / raw)
  To: qemu-devel

** No longer affects: qemu (Ubuntu Disco)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  In Progress
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Triaged
Status in kunpeng920 ubuntu-20.04 series:
  Triaged
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  In Progress

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (60 preceding siblings ...)
  2020-05-27  4:55 ` Christian Ehrhardt 
@ 2020-05-28 14:58 ` Christian Ehrhardt 
  2020-05-29  7:55 ` Launchpad Bug Tracker
                   ` (32 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Christian Ehrhardt  @ 2020-05-28 14:58 UTC (permalink / raw)
  To: qemu-devel

FYI: sponsored into groovy

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  In Progress
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Triaged
Status in kunpeng920 ubuntu-20.04 series:
  Triaged
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  In Progress

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (61 preceding siblings ...)
  2020-05-28 14:58 ` Christian Ehrhardt 
@ 2020-05-29  7:55 ` Launchpad Bug Tracker
  2020-05-29  8:01 ` Christian Ehrhardt 
                   ` (31 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Launchpad Bug Tracker @ 2020-05-29  7:55 UTC (permalink / raw)
  To: qemu-devel

This bug was fixed in the package qemu - 1:4.2-3ubuntu8

---------------
qemu (1:4.2-3ubuntu8) groovy; urgency=medium

  * d/p/ubuntu/lp-1805256*: Fixes for QEMU on aarch64 ARM hosts
    - async: use explicit memory barriers (LP: #1805256)
    - aio-wait: delegate polling of main AioContext if BQL not held

 -- Rafael David Tinoco <rafaeldtinoco@ubuntu.com>  Wed, 27 May 2020
21:47:21 +0000

** Changed in: qemu (Ubuntu)
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  In Progress
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Triaged
Status in kunpeng920 ubuntu-20.04 series:
  Triaged
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  In Progress

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (62 preceding siblings ...)
  2020-05-29  7:55 ` Launchpad Bug Tracker
@ 2020-05-29  8:01 ` Christian Ehrhardt 
  2020-06-02 22:45 ` Brian Murray
                   ` (30 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Christian Ehrhardt  @ 2020-05-29  8:01 UTC (permalink / raw)
  To: qemu-devel

Migrated right now, sponsoring the related SRU portions into B/E/F ...
for consideration by the SRU Team.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  In Progress
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Triaged
Status in kunpeng920 ubuntu-20.04 series:
  Triaged
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Eoan:
  In Progress
Status in qemu source package in Focal:
  In Progress

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (63 preceding siblings ...)
  2020-05-29  8:01 ` Christian Ehrhardt 
@ 2020-06-02 22:45 ` Brian Murray
  2020-06-02 22:49 ` [Bug 1805256] Please test proposed package Brian Murray
                   ` (29 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Brian Murray @ 2020-06-02 22:45 UTC (permalink / raw)
  To: qemu-devel

Hello dann, or anyone else affected,

Accepted qemu into focal-proposed. The package will build now and be
available at https://launchpad.net/ubuntu/+source/qemu/1:4.2-3ubuntu6.2
in a few hours, and then in the -proposed repository.

Please help us by testing this new package.  See
https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how
to enable and use -proposed.  Your feedback will aid us getting this
update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug,
mentioning the version of the package you tested, what testing has been
performed on the package and change the tag from verification-needed-
focal to verification-done-focal. If it does not fix the bug for you,
please add a comment stating that, and change the tag to verification-
failed-focal. In either case, without details of your testing we will
not be able to proceed.

Further information regarding the verification process can be found at
https://wiki.ubuntu.com/QATeam/PerformingSRUVerification .  Thank you in
advance for helping!

N.B. The updated package will be released to -updates after the bug(s)
fixed by this package have been verified and the package has been in
-proposed for a minimum of 7 days.

** Tags added: block-proposed-focal

** Changed in: qemu (Ubuntu Focal)
       Status: In Progress => Fix Committed

** Tags added: verification-needed verification-needed-focal

** Changed in: qemu (Ubuntu Eoan)
       Status: In Progress => Fix Committed

** Tags added: verification-needed-eoan

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  In Progress
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Triaged
Status in kunpeng920 ubuntu-20.04 series:
  Triaged
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  Fix Committed
Status in qemu source package in Eoan:
  Fix Committed
Status in qemu source package in Focal:
  Fix Committed

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Please test proposed package
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (64 preceding siblings ...)
  2020-06-02 22:45 ` Brian Murray
@ 2020-06-02 22:49 ` Brian Murray
  2020-06-02 22:54 ` Brian Murray
                   ` (28 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Brian Murray @ 2020-06-02 22:49 UTC (permalink / raw)
  To: qemu-devel

Hello dann, or anyone else affected,

Accepted qemu into eoan-proposed. The package will build now and be
available at https://launchpad.net/ubuntu/+source/qemu/1:4.0+dfsg-
0ubuntu9.7 in a few hours, and then in the -proposed repository.

Please help us by testing this new package.  See
https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how
to enable and use -proposed.  Your feedback will aid us getting this
update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug,
mentioning the version of the package you tested, what testing has been
performed on the package and change the tag from verification-needed-
eoan to verification-done-eoan. If it does not fix the bug for you,
please add a comment stating that, and change the tag to verification-
failed-eoan. In either case, without details of your testing we will not
be able to proceed.

Further information regarding the verification process can be found at
https://wiki.ubuntu.com/QATeam/PerformingSRUVerification .  Thank you in
advance for helping!

N.B. The updated package will be released to -updates after the bug(s)
fixed by this package have been verified and the package has been in
-proposed for a minimum of 7 days.

** Tags added: block-proposed-eoan

** Changed in: qemu (Ubuntu Bionic)
       Status: In Progress => Fix Committed

** Tags added: verification-needed-bionic

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  In Progress
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Triaged
Status in kunpeng920 ubuntu-20.04 series:
  Triaged
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  Fix Committed
Status in qemu source package in Eoan:
  Fix Committed
Status in qemu source package in Focal:
  Fix Committed

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Please test proposed package
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (65 preceding siblings ...)
  2020-06-02 22:49 ` [Bug 1805256] Please test proposed package Brian Murray
@ 2020-06-02 22:54 ` Brian Murray
  2020-06-03  4:09 ` [Bug 1805256] Autopkgtest regression report (qemu/1:4.0+dfsg-0ubuntu9.7) Ubuntu SRU Bot
                   ` (27 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Brian Murray @ 2020-06-02 22:54 UTC (permalink / raw)
  To: qemu-devel

Hello dann, or anyone else affected,

Accepted qemu into bionic-proposed. The package will build now and be
available at https://launchpad.net/ubuntu/+source/qemu/1:2.11+dfsg-
1ubuntu7.27 in a few hours, and then in the -proposed repository.

Please help us by testing this new package.  See
https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how
to enable and use -proposed.  Your feedback will aid us getting this
update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug,
mentioning the version of the package you tested, what testing has been
performed on the package and change the tag from verification-needed-
bionic to verification-done-bionic. If it does not fix the bug for you,
please add a comment stating that, and change the tag to verification-
failed-bionic. In either case, without details of your testing we will
not be able to proceed.

Further information regarding the verification process can be found at
https://wiki.ubuntu.com/QATeam/PerformingSRUVerification .  Thank you in
advance for helping!

N.B. The updated package will be released to -updates after the bug(s)
fixed by this package have been verified and the package has been in
-proposed for a minimum of 7 days.

** Tags removed: verification-needed-bionic
** Tags added: block-proposed-bionic

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  In Progress
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Triaged
Status in kunpeng920 ubuntu-20.04 series:
  Triaged
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  Fix Committed
Status in qemu source package in Eoan:
  Fix Committed
Status in qemu source package in Focal:
  Fix Committed

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Autopkgtest regression report (qemu/1:4.0+dfsg-0ubuntu9.7)
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (66 preceding siblings ...)
  2020-06-02 22:54 ` Brian Murray
@ 2020-06-03  4:09 ` Ubuntu SRU Bot
  2020-06-03  6:35 ` [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Ike Panhc
                   ` (26 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Ubuntu SRU Bot @ 2020-06-03  4:09 UTC (permalink / raw)
  To: qemu-devel

All autopkgtests for the newly accepted qemu (1:4.0+dfsg-0ubuntu9.7) for eoan have finished running.
The following regressions have been reported in tests triggered by the package:

edk2/0~20190606.20d2e5a1-2ubuntu1.1 (amd64, armhf)


Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-
migration/eoan/update_excuses.html#qemu

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  In Progress
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Triaged
Status in kunpeng920 ubuntu-20.04 series:
  Triaged
Status in kunpeng920 upstream-kernel series:
  Fix Committed
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  Fix Committed
Status in qemu source package in Eoan:
  Fix Committed
Status in qemu source package in Focal:
  Fix Committed

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (67 preceding siblings ...)
  2020-06-03  4:09 ` [Bug 1805256] Autopkgtest regression report (qemu/1:4.0+dfsg-0ubuntu9.7) Ubuntu SRU Bot
@ 2020-06-03  6:35 ` Ike Panhc
  2020-06-03  8:40 ` [Bug 1805256] Autopkgtest regression report (qemu/1:4.2-3ubuntu6.2) Ubuntu SRU Bot
                   ` (25 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Ike Panhc @ 2020-06-03  6:35 UTC (permalink / raw)
  To: qemu-devel

100 run on bionic/eoan/focal -proposed `qemu-img convert` all
successful. No hang occurs. Thanks a lot.

** Tags removed: verification-needed verification-needed-eoan verification-needed-focal
** Tags added: verification-done-bionic verification-done-eoan verification-done-focal

** Changed in: kunpeng920/ubuntu-18.04
       Status: Triaged => In Progress

** Changed in: kunpeng920/ubuntu-18.04
     Assignee: Ike Panhc (ikepanhc) => (unassigned)

** Changed in: kunpeng920
     Assignee: Ike Panhc (ikepanhc) => (unassigned)

** Changed in: kunpeng920/ubuntu-18.04-hwe
       Status: Triaged => In Progress

** Changed in: kunpeng920/ubuntu-18.04-hwe
     Assignee: Ike Panhc (ikepanhc) => (unassigned)

** Changed in: kunpeng920/ubuntu-19.10
       Status: Triaged => In Progress

** Changed in: kunpeng920/ubuntu-19.10
     Assignee: Ike Panhc (ikepanhc) => (unassigned)

** Changed in: kunpeng920/ubuntu-20.04
       Status: Triaged => In Progress

** Changed in: kunpeng920/ubuntu-20.04
     Assignee: Ike Panhc (ikepanhc) => (unassigned)

** Changed in: kunpeng920/upstream-kernel
       Status: Fix Committed => Invalid

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  In Progress
Status in kunpeng920 ubuntu-18.04 series:
  In Progress
Status in kunpeng920 ubuntu-18.04-hwe series:
  In Progress
Status in kunpeng920 ubuntu-19.10 series:
  In Progress
Status in kunpeng920 ubuntu-20.04 series:
  In Progress
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  Fix Committed
Status in qemu source package in Eoan:
  Fix Committed
Status in qemu source package in Focal:
  Fix Committed

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Autopkgtest regression report (qemu/1:4.2-3ubuntu6.2)
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (68 preceding siblings ...)
  2020-06-03  6:35 ` [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Ike Panhc
@ 2020-06-03  8:40 ` Ubuntu SRU Bot
  2020-06-05  3:51 ` [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Christian Ehrhardt 
                   ` (24 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Ubuntu SRU Bot @ 2020-06-03  8:40 UTC (permalink / raw)
  To: qemu-devel

All autopkgtests for the newly accepted qemu (1:4.2-3ubuntu6.2) for focal have finished running.
The following regressions have been reported in tests triggered by the package:

systemd/245.4-4ubuntu3.1 (arm64)


Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-
migration/focal/update_excuses.html#qemu

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  In Progress
Status in kunpeng920 ubuntu-18.04 series:
  In Progress
Status in kunpeng920 ubuntu-18.04-hwe series:
  In Progress
Status in kunpeng920 ubuntu-19.10 series:
  In Progress
Status in kunpeng920 ubuntu-20.04 series:
  In Progress
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  Fix Committed
Status in qemu source package in Eoan:
  Fix Committed
Status in qemu source package in Focal:
  Fix Committed

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (69 preceding siblings ...)
  2020-06-03  8:40 ` [Bug 1805256] Autopkgtest regression report (qemu/1:4.2-3ubuntu6.2) Ubuntu SRU Bot
@ 2020-06-05  3:51 ` Christian Ehrhardt 
  2020-06-11  8:04 ` Andrew Cloke
                   ` (23 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Christian Ehrhardt  @ 2020-06-05  3:51 UTC (permalink / raw)
  To: qemu-devel

I've looked and retried the tests - all green now.
Let us give it a few extra days in proposed as planned, but other than that it looks ok to be released.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  In Progress
Status in kunpeng920 ubuntu-18.04 series:
  In Progress
Status in kunpeng920 ubuntu-18.04-hwe series:
  In Progress
Status in kunpeng920 ubuntu-19.10 series:
  In Progress
Status in kunpeng920 ubuntu-20.04 series:
  In Progress
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  Fix Committed
Status in qemu source package in Eoan:
  Fix Committed
Status in qemu source package in Focal:
  Fix Committed

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (70 preceding siblings ...)
  2020-06-05  3:51 ` [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Christian Ehrhardt 
@ 2020-06-11  8:04 ` Andrew Cloke
  2020-06-17  5:16 ` Christian Ehrhardt 
                   ` (22 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Andrew Cloke @ 2020-06-11  8:04 UTC (permalink / raw)
  To: qemu-devel

** Changed in: kunpeng920/ubuntu-18.04
       Status: In Progress => Fix Committed

** Changed in: kunpeng920/ubuntu-18.04-hwe
       Status: In Progress => Fix Committed

** Changed in: kunpeng920/ubuntu-19.10
       Status: In Progress => Fix Committed

** Changed in: kunpeng920/ubuntu-20.04
       Status: In Progress => Fix Committed

** Changed in: kunpeng920
       Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Fix Committed
Status in kunpeng920 ubuntu-18.04 series:
  Fix Committed
Status in kunpeng920 ubuntu-18.04-hwe series:
  Fix Committed
Status in kunpeng920 ubuntu-19.10 series:
  Fix Committed
Status in kunpeng920 ubuntu-20.04 series:
  Fix Committed
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  Fix Committed
Status in qemu source package in Eoan:
  Fix Committed
Status in qemu source package in Focal:
  Fix Committed

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (71 preceding siblings ...)
  2020-06-11  8:04 ` Andrew Cloke
@ 2020-06-17  5:16 ` Christian Ehrhardt 
  2020-06-18  9:23 ` Launchpad Bug Tracker
                   ` (21 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Christian Ehrhardt  @ 2020-06-17  5:16 UTC (permalink / raw)
  To: qemu-devel

We had the 14 (instead f 7) days in -proposed for some extended maturing. Nothing came up in regard to this and all validations were good.
Dropping block-proposed to be released once the SRU Team gets to it.

** Tags removed: block-proposed-bionic block-proposed-eoan block-
proposed-focal

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Fix Committed
Status in kunpeng920 ubuntu-18.04 series:
  Fix Committed
Status in kunpeng920 ubuntu-18.04-hwe series:
  Fix Committed
Status in kunpeng920 ubuntu-19.10 series:
  Fix Committed
Status in kunpeng920 ubuntu-20.04 series:
  Fix Committed
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  Fix Committed
Status in qemu source package in Eoan:
  Fix Committed
Status in qemu source package in Focal:
  Fix Committed

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (72 preceding siblings ...)
  2020-06-17  5:16 ` Christian Ehrhardt 
@ 2020-06-18  9:23 ` Launchpad Bug Tracker
  2020-06-18  9:23 ` [Bug 1805256] Update Released Łukasz Zemczak
                   ` (20 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Launchpad Bug Tracker @ 2020-06-18  9:23 UTC (permalink / raw)
  To: qemu-devel

This bug was fixed in the package qemu - 1:4.2-3ubuntu6.2

---------------
qemu (1:4.2-3ubuntu6.2) focal; urgency=medium

  * d/p/ubuntu/lp-1805256*: Fixes for QEMU on aarch64 ARM hosts
    - async: use explicit memory barriers (LP: #1805256)
    - aio-wait: delegate polling of main AioContext if BQL not held

 -- Rafael David Tinoco <rafaeldtinoco@ubuntu.com>  Wed, 27 May 2020
21:19:20 +0000

** Changed in: qemu (Ubuntu Focal)
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Fix Committed
Status in kunpeng920 ubuntu-18.04 series:
  Fix Committed
Status in kunpeng920 ubuntu-18.04-hwe series:
  Fix Committed
Status in kunpeng920 ubuntu-19.10 series:
  Fix Committed
Status in kunpeng920 ubuntu-20.04 series:
  Fix Committed
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  Fix Committed
Status in qemu source package in Eoan:
  Fix Committed
Status in qemu source package in Focal:
  Fix Released

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Update Released
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (73 preceding siblings ...)
  2020-06-18  9:23 ` Launchpad Bug Tracker
@ 2020-06-18  9:23 ` Łukasz Zemczak
  2020-06-18  9:38 ` [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Launchpad Bug Tracker
                   ` (19 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Łukasz Zemczak @ 2020-06-18  9:23 UTC (permalink / raw)
  To: qemu-devel

The verification of the Stable Release Update for qemu has completed
successfully and the package is now being released to -updates.
Subsequently, the Ubuntu Stable Release Updates Team is being
unsubscribed and will not receive messages about this bug report.  In
the event that you encounter a regression using the package from
-updates please report a new bug using ubuntu-bug and tag the bug report
regression-update so we can easily find any regressions.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Fix Committed
Status in kunpeng920 ubuntu-18.04 series:
  Fix Committed
Status in kunpeng920 ubuntu-18.04-hwe series:
  Fix Committed
Status in kunpeng920 ubuntu-19.10 series:
  Fix Committed
Status in kunpeng920 ubuntu-20.04 series:
  Fix Committed
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  Fix Committed
Status in qemu source package in Eoan:
  Fix Committed
Status in qemu source package in Focal:
  Fix Released

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (74 preceding siblings ...)
  2020-06-18  9:23 ` [Bug 1805256] Update Released Łukasz Zemczak
@ 2020-06-18  9:38 ` Launchpad Bug Tracker
  2020-06-18  9:39 ` Launchpad Bug Tracker
                   ` (18 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Launchpad Bug Tracker @ 2020-06-18  9:38 UTC (permalink / raw)
  To: qemu-devel

This bug was fixed in the package qemu - 1:4.0+dfsg-0ubuntu9.7

---------------
qemu (1:4.0+dfsg-0ubuntu9.7) eoan; urgency=medium

  * d/p/ubuntu/lp-1805256*: Fixes for QEMU on aarch64 ARM hosts
    - async: use explicit memory barriers (LP: #1805256)
    - aio-wait: delegate polling of main AioContext if BQL not held

 -- Rafael David Tinoco <rafaeldtinoco@ubuntu.com>  Wed, 27 May 2020
20:07:57 +0000

** Changed in: qemu (Ubuntu Eoan)
       Status: Fix Committed => Fix Released

** Changed in: qemu (Ubuntu Bionic)
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Fix Committed
Status in kunpeng920 ubuntu-18.04 series:
  Fix Committed
Status in kunpeng920 ubuntu-18.04-hwe series:
  Fix Committed
Status in kunpeng920 ubuntu-19.10 series:
  Fix Committed
Status in kunpeng920 ubuntu-20.04 series:
  Fix Committed
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Eoan:
  Fix Released
Status in qemu source package in Focal:
  Fix Released

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (75 preceding siblings ...)
  2020-06-18  9:38 ` [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Launchpad Bug Tracker
@ 2020-06-18  9:39 ` Launchpad Bug Tracker
  2020-06-18 10:27 ` Andrew Cloke
                   ` (17 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Launchpad Bug Tracker @ 2020-06-18  9:39 UTC (permalink / raw)
  To: qemu-devel

This bug was fixed in the package qemu - 1:2.11+dfsg-1ubuntu7.27

---------------
qemu (1:2.11+dfsg-1ubuntu7.27) bionic; urgency=medium

  * d/p/ubuntu/lp-1805256*: Fixes for QEMU on aarch64 ARM hosts
    - aio: rename aio_context_in_iothread() to in_aio_context_home_thread()
    - aio: Do aio_notify_accept only during blocking aio_poll
    - aio-posix: Assert that aio_poll() is always called in home thread
    - async: use explicit memory barriers (LP: #1805256)
    - aio-wait: delegate polling of main AioContext if BQL not held
    - aio-posix: Don't count ctx->notifier as progress when polling

 -- Rafael David Tinoco <rafaeldtinoco@ubuntu.com>  Tue, 26 May 2020
17:39:21 +0000

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Fix Committed
Status in kunpeng920 ubuntu-18.04 series:
  Fix Committed
Status in kunpeng920 ubuntu-18.04-hwe series:
  Fix Committed
Status in kunpeng920 ubuntu-19.10 series:
  Fix Committed
Status in kunpeng920 ubuntu-20.04 series:
  Fix Committed
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Eoan:
  Fix Released
Status in qemu source package in Focal:
  Fix Released

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (76 preceding siblings ...)
  2020-06-18  9:39 ` Launchpad Bug Tracker
@ 2020-06-18 10:27 ` Andrew Cloke
  2020-06-30  6:54 ` Christian Ehrhardt 
                   ` (16 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Andrew Cloke @ 2020-06-18 10:27 UTC (permalink / raw)
  To: qemu-devel

** Changed in: kunpeng920/ubuntu-20.04
       Status: Fix Committed => Fix Released

** Changed in: kunpeng920/ubuntu-19.10
       Status: Fix Committed => Fix Released

** Changed in: kunpeng920/ubuntu-18.04-hwe
       Status: Fix Committed => Fix Released

** Changed in: kunpeng920/ubuntu-18.04
       Status: Fix Committed => Fix Released

** Changed in: kunpeng920
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Fix Released
Status in kunpeng920 ubuntu-18.04 series:
  Fix Released
Status in kunpeng920 ubuntu-18.04-hwe series:
  Fix Released
Status in kunpeng920 ubuntu-19.10 series:
  Fix Released
Status in kunpeng920 ubuntu-20.04 series:
  Fix Released
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Eoan:
  Fix Released
Status in qemu source package in Focal:
  Fix Released

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (77 preceding siblings ...)
  2020-06-18 10:27 ` Andrew Cloke
@ 2020-06-30  6:54 ` Christian Ehrhardt 
  2020-07-01  7:01 ` Ike Panhc
                   ` (15 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Christian Ehrhardt  @ 2020-06-30  6:54 UTC (permalink / raw)
  To: qemu-devel

This will re-open again for Bionic due to bug 1885419 forcing a revert of the former backports.
After a deeper evaluation if the assert is wrong in the backport or just flagging a problem formerly already existing in Bionic this will be re-fixed.

** Changed in: qemu (Ubuntu Bionic)
       Status: Fix Released => Triaged

** Changed in: qemu (Ubuntu Bionic)
     Assignee: (unassigned) => Rafael David Tinoco (rafaeldtinoco)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Fix Released
Status in kunpeng920 ubuntu-18.04 series:
  Fix Released
Status in kunpeng920 ubuntu-18.04-hwe series:
  Fix Released
Status in kunpeng920 ubuntu-19.10 series:
  Fix Released
Status in kunpeng920 ubuntu-20.04 series:
  Fix Released
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  Triaged
Status in qemu source package in Eoan:
  Fix Released
Status in qemu source package in Focal:
  Fix Released

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (78 preceding siblings ...)
  2020-06-30  6:54 ` Christian Ehrhardt 
@ 2020-07-01  7:01 ` Ike Panhc
  2020-07-12 13:16 ` Rafael David Tinoco
                   ` (14 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Ike Panhc @ 2020-07-01  7:01 UTC (permalink / raw)
  To: qemu-devel

Re-open for bionic due to regression found

** Changed in: kunpeng920/ubuntu-18.04
       Status: Fix Released => Triaged

** Changed in: kunpeng920/ubuntu-18.04-hwe
       Status: Fix Released => Triaged

** Changed in: kunpeng920
       Status: Fix Released => Triaged

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Fix Released
Status in kunpeng920 ubuntu-20.04 series:
  Fix Released
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  Triaged
Status in qemu source package in Eoan:
  Fix Released
Status in qemu source package in Focal:
  Fix Released

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (79 preceding siblings ...)
  2020-07-01  7:01 ` Ike Panhc
@ 2020-07-12 13:16 ` Rafael David Tinoco
  2020-07-13  3:59 ` Launchpad Bug Tracker
                   ` (13 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2020-07-12 13:16 UTC (permalink / raw)
  To: qemu-devel

Started working on this again...

** Changed in: qemu (Ubuntu Bionic)
       Status: Triaged => In Progress

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Fix Released
Status in kunpeng920 ubuntu-20.04 series:
  Fix Released
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Eoan:
  Fix Released
Status in qemu source package in Focal:
  Fix Released

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (80 preceding siblings ...)
  2020-07-12 13:16 ` Rafael David Tinoco
@ 2020-07-13  3:59 ` Launchpad Bug Tracker
  2020-07-13  4:12 ` Rafael David Tinoco
                   ` (12 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Launchpad Bug Tracker @ 2020-07-13  3:59 UTC (permalink / raw)
  To: qemu-devel

** Merge proposal linked:
   https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/qemu/+git/qemu/+merge/387269

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Fix Released
Status in kunpeng920 ubuntu-20.04 series:
  Fix Released
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Eoan:
  Fix Released
Status in qemu source package in Focal:
  Fix Released

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (81 preceding siblings ...)
  2020-07-13  3:59 ` Launchpad Bug Tracker
@ 2020-07-13  4:12 ` Rafael David Tinoco
  2020-07-15 15:31 ` dann frazier
                   ` (11 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2020-07-13  4:12 UTC (permalink / raw)
  To: qemu-devel

Worked being done for the Bionic SRU:

BUG: https://bugs.launchpad.net/qemu/+bug/1805256
(fix for the bionic regression demonstrated at LP: #1885419)
PPA: https://launchpad.net/~rafaeldtinoco/+archive/ubuntu/lp1805256-bionic
MERGE: https://tinyurl.com/y8sucs6x

Merge proposal currently going under review, tests and discussions.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Fix Released
Status in kunpeng920 ubuntu-20.04 series:
  Fix Released
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Eoan:
  Fix Released
Status in qemu source package in Focal:
  Fix Released

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (82 preceding siblings ...)
  2020-07-13  4:12 ` Rafael David Tinoco
@ 2020-07-15 15:31 ` dann frazier
  2020-07-20 12:22 ` Rafael David Tinoco
                   ` (10 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: dann frazier @ 2020-07-15 15:31 UTC (permalink / raw)
  To: qemu-devel

I ran the new PPA build (1:2.11+dfsg-1ubuntu7.29~ppa01) on both a
ThunderX2 system and a Hi1620 system overnight, and both survived (6574
& 12919 iterations, respectively).

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Fix Released
Status in kunpeng920 ubuntu-20.04 series:
  Fix Released
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Eoan:
  Fix Released
Status in qemu source package in Focal:
  Fix Released

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (83 preceding siblings ...)
  2020-07-15 15:31 ` dann frazier
@ 2020-07-20 12:22 ` Rafael David Tinoco
  2020-07-21 20:02 ` Rafael David Tinoco
                   ` (9 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2020-07-20 12:22 UTC (permalink / raw)
  To: qemu-devel

Thanks @dannf! I spoke to Christian and him and I agreed to confine this
change into ARM builds only (as SRU for Bionic). Preparing it...

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Fix Released
Status in kunpeng920 ubuntu-20.04 series:
  Fix Released
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Eoan:
  Fix Released
Status in qemu source package in Focal:
  Fix Released

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (84 preceding siblings ...)
  2020-07-20 12:22 ` Rafael David Tinoco
@ 2020-07-21 20:02 ` Rafael David Tinoco
  2020-07-21 20:03 ` Rafael David Tinoco
                   ` (8 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2020-07-21 20:02 UTC (permalink / raw)
  To: qemu-devel

Status from old attempts to solve same nature issues:

----

Older (2018) merge request from @raharper:

https://github.com/koverstreet/bcache-tools/pull/1

addressing the fact that kernel uevents would not always emit 
CACHED_UUID parameters, making udev to delete (whenever that happens) 
/dev/bcache/{by-uuid,by-label} symlinks.

This last MR pointed to previous related bugs:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=890446
https://bugs.launchpad.net/curtin/+bug/1728742

And to an upstream kernel patch:

https://lore.kernel.org/patchwork/patch/921298/

to

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1729145

that wasn't accepted upstream.

Even not being accepted upstream, the SRU was attempted:

LP: #1729145

https://lists.ubuntu.com/archives/kernel-team/2017-December/088680.html
https://lists.ubuntu.com/archives/kernel-team/2017-December/088679.html

Both were NACKED.

Attempted again:

https://lists.ubuntu.com/archives/kernel-team/2017-December/088682.html
https://lists.ubuntu.com/archives/kernel-team/2017-December/088683.html

NACKED again.

And a v2 was sent:

https://lists.ubuntu.com/archives/kernel-team/2017-December/088751.html
https://lists.ubuntu.com/archives/kernel-team/2017-December/088750.html
https://lists.ubuntu.com/archives/kernel-team/2017-December/088749.html

and acked in January 2018 by Coling:

https://lists.ubuntu.com/archives/kernel-team/2018-January/089492.html

but not upstreamed.

BIONIC contains the fix:

commit ed9333e1b583
Author: Ryan Harper <ryan.harper@canonical.com>
Date:   Mon Dec 11 12:12:01 2017

    UBUNTU: SAUCE: (no-up) bcache: decouple emitting a cached_dev CHANGE uevent
    
    BugLink: http://bugs.launchpad.net/bugs/1729145
    
    - decouple emitting a cached_dev CHANGE uevent which includes dev.uuid
      and dev.label from bch_cached_dev_run() which only happens when a
      bcacheX device is bound to the actual backing block device (bcache0 -> vdb)
    
    - update bch_cached_dev_run() to invoke bch_cached_dev_emit_change() as
      needed; no functional code path changes here
    
    - Modify register_bcache to detect a re-registering of a bcache
      cached_dev, and in that case call bcache_cached_dev_emit_change() to
    
    Signed-off-by: Ryan Harper <ryan.harper@canonical.com>
    Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com>
    Acked-by: Colin Ian King <colin.king@canonical.com>
    Acked-by: Stefan Bader <stefan.bader@canonical.com>
    Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
    [ saf: fix incorrect indentation ]
    Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

FOCAL contains the fix:

commit 67553dcd7905
Author: Ryan Harper <ryan.harper@canonical.com>
Date:   Mon Dec 11 12:12:01 2017

    UBUNTU: SAUCE: (no-up) bcache: decouple emitting a cached_dev CHANGE
uevent

GROOVY contains the fix:

commit 67553dcd7905
Author: Ryan Harper <ryan.harper@canonical.com>
Date:   Mon Dec 11 12:12:01 2017

    UBUNTU: SAUCE: (no-up) bcache: decouple emitting a cached_dev CHANGE
uevent

----

So, the kernel patch wasn't accepted, nor bcache-tools patch by 
@raharper, the bcache-export-cached.

----

New Upstream summary from @raharper:

https://github.com/systemd/systemd/pull/16317#issuecomment-655647313

in the upstream merge request made by @rbalint.


** Bug watch added: Debian Bug tracker #890446
   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=890446

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Fix Released
Status in kunpeng920 ubuntu-20.04 series:
  Fix Released
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Eoan:
  Fix Released
Status in qemu source package in Focal:
  Fix Released

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (85 preceding siblings ...)
  2020-07-21 20:02 ` Rafael David Tinoco
@ 2020-07-21 20:03 ` Rafael David Tinoco
  2020-07-31 18:51 ` Rafael David Tinoco
                   ` (7 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2020-07-21 20:03 UTC (permalink / raw)
  To: qemu-devel

I've hidden last post as it was posted in the wrong bug.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Fix Released
Status in kunpeng920 ubuntu-20.04 series:
  Fix Released
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Eoan:
  Fix Released
Status in qemu source package in Focal:
  Fix Released

Bug description:
  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (86 preceding siblings ...)
  2020-07-21 20:03 ` Rafael David Tinoco
@ 2020-07-31 18:51 ` Rafael David Tinoco
  2020-07-31 21:42 ` Rafael David Tinoco
                   ` (6 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2020-07-31 18:51 UTC (permalink / raw)
  To: qemu-devel

** Description changed:

+ 
+ SRU TEAM REVIEWER: This has already been SRUed for Focal, Eoan and Bionic. Unfortunately the Bionic SRU did not work and we had to reverse the change. Since then we had another update and now I'm retrying the SRU.
+ 
+ After discussing with @paelzer (and @dannf as a reviewer) extensively,
+ Christian and I agreed that we should scope this SRU as Aarch64 only AND
+ I was much, much more conservative in question of what is being changed
+ in the AIO qemu code.
+ 
+ New code has been tested against the initial Test Case and the new one,
+ regressed for Bionic. More information (about tests and discussion) can
+ be found in the MR at ~rafaeldtinoco/ubuntu/+source/qemu:lp1805256
+ -bionic-refix
+ 
+ BIONIC REGRESSION BUG:
+ 
+ https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1885419
+ 
  [Impact]
  
  * QEMU locking primitives might face a race condition in QEMU Async I/O
  bottom halves scheduling. This leads to a dead lock making either QEMU
  or one of its tools to hang indefinitely.
  
  [Test Case]
+ 
+ INITIAL
  
  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2
  
  Hangs indefinitely approximately 30% of the runs in Aarch64.
  
  [Regression Potential]
  
  * This is a change to a core part of QEMU: The AIO scheduling. It works
  like a "kernel" scheduler, whereas kernel schedules OS tasks, the QEMU
  AIO code is responsible to schedule QEMU coroutines or event listeners
  callbacks.
  
  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the issue.
  Tested platforms were: amd64 and aarch64 based on his commit log.
  
  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.
  
  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.
+ 
+ BIONIC REGRESSED ISSUE
+ 
+ https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1885419
  
  [Other Info]
  
   * Original Description bellow:
  
  Command:
  
  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2
  
  Hangs indefinitely approximately 30% of the runs.
  
  ----
  
  Workaround:
  
  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2
  
  Run "qemu-img convert" with "a single coroutine" to avoid this issue.
  
  ----
  
  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...
  
  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()
  
  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start
  
  ----
  
  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2
  
  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]
  
  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]
  
  <HUNG w/ 3 threads in the stack trace showed before>
  """
  
  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).
  
  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).
  
  ----
  
  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:
  
  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2
  
  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.
  
  Once hung, attaching gdb gives the following backtrace:
  
  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975
  
  Reproduced w/ latest QEMU git (@ 53744e0a182)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Fix Released
Status in kunpeng920 ubuntu-20.04 series:
  Fix Released
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Eoan:
  Fix Released
Status in qemu source package in Focal:
  Fix Released

Bug description:
  
  SRU TEAM REVIEWER: This has already been SRUed for Focal, Eoan and Bionic. Unfortunately the Bionic SRU did not work and we had to reverse the change. Since then we had another update and now I'm retrying the SRU.

  After discussing with @paelzer (and @dannf as a reviewer) extensively,
  Christian and I agreed that we should scope this SRU as Aarch64 only
  AND I was much, much more conservative in question of what is being
  changed in the AIO qemu code.

  New code has been tested against the initial Test Case and the new
  one, regressed for Bionic. More information (about tests and
  discussion) can be found in the MR at
  ~rafaeldtinoco/ubuntu/+source/qemu:lp1805256-bionic-refix

  BIONIC REGRESSION BUG:

  https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1885419

  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  INITIAL

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  BIONIC REGRESSED ISSUE

  https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1885419

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (87 preceding siblings ...)
  2020-07-31 18:51 ` Rafael David Tinoco
@ 2020-07-31 21:42 ` Rafael David Tinoco
  2020-08-07  9:53 ` Timo Aaltonen
                   ` (5 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Rafael David Tinoco @ 2020-07-31 21:42 UTC (permalink / raw)
  To: qemu-devel

I just pushed/uploaded a SRU for bionic from:

https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/qemu/+git/qemu/+merge/387269

Waiting for SRU on it.


** Changed in: qemu (Ubuntu Bionic)
     Assignee: Rafael David Tinoco (rafaeldtinoco) => (unassigned)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Fix Released
Status in kunpeng920 ubuntu-20.04 series:
  Fix Released
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Eoan:
  Fix Released
Status in qemu source package in Focal:
  Fix Released

Bug description:
  
  SRU TEAM REVIEWER: This has already been SRUed for Focal, Eoan and Bionic. Unfortunately the Bionic SRU did not work and we had to reverse the change. Since then we had another update and now I'm retrying the SRU.

  After discussing with @paelzer (and @dannf as a reviewer) extensively,
  Christian and I agreed that we should scope this SRU as Aarch64 only
  AND I was much, much more conservative in question of what is being
  changed in the AIO qemu code.

  New code has been tested against the initial Test Case and the new
  one, regressed for Bionic. More information (about tests and
  discussion) can be found in the MR at
  ~rafaeldtinoco/ubuntu/+source/qemu:lp1805256-bionic-refix

  BIONIC REGRESSION BUG:

  https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1885419

  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  INITIAL

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  BIONIC REGRESSED ISSUE

  https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1885419

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (88 preceding siblings ...)
  2020-07-31 21:42 ` Rafael David Tinoco
@ 2020-08-07  9:53 ` Timo Aaltonen
  2020-08-07 14:41 ` [Bug 1805256] Autopkgtest regression report (qemu/1:2.11+dfsg-1ubuntu7.30) Ubuntu SRU Bot
                   ` (4 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Timo Aaltonen @ 2020-08-07  9:53 UTC (permalink / raw)
  To: qemu-devel

Hello dann, or anyone else affected,

Accepted qemu into bionic-proposed. The package will build now and be
available at https://launchpad.net/ubuntu/+source/qemu/1:2.11+dfsg-
1ubuntu7.30 in a few hours, and then in the -proposed repository.

Please help us by testing this new package.  See
https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how
to enable and use -proposed.  Your feedback will aid us getting this
update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug,
mentioning the version of the package you tested, what testing has been
performed on the package and change the tag from verification-needed-
bionic to verification-done-bionic. If it does not fix the bug for you,
please add a comment stating that, and change the tag to verification-
failed-bionic. In either case, without details of your testing we will
not be able to proceed.

Further information regarding the verification process can be found at
https://wiki.ubuntu.com/QATeam/PerformingSRUVerification .  Thank you in
advance for helping!

N.B. The updated package will be released to -updates after the bug(s)
fixed by this package have been verified and the package has been in
-proposed for a minimum of 7 days.

** Changed in: qemu (Ubuntu Bionic)
       Status: In Progress => Fix Committed

** Tags removed: verification-done-bionic
** Tags added: verification-needed verification-needed-bionic

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Fix Released
Status in kunpeng920 ubuntu-20.04 series:
  Fix Released
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  Fix Committed
Status in qemu source package in Eoan:
  Fix Released
Status in qemu source package in Focal:
  Fix Released

Bug description:
  
  SRU TEAM REVIEWER: This has already been SRUed for Focal, Eoan and Bionic. Unfortunately the Bionic SRU did not work and we had to reverse the change. Since then we had another update and now I'm retrying the SRU.

  After discussing with @paelzer (and @dannf as a reviewer) extensively,
  Christian and I agreed that we should scope this SRU as Aarch64 only
  AND I was much, much more conservative in question of what is being
  changed in the AIO qemu code.

  New code has been tested against the initial Test Case and the new
  one, regressed for Bionic. More information (about tests and
  discussion) can be found in the MR at
  ~rafaeldtinoco/ubuntu/+source/qemu:lp1805256-bionic-refix

  BIONIC REGRESSION BUG:

  https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1885419

  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  INITIAL

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  BIONIC REGRESSED ISSUE

  https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1885419

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Autopkgtest regression report (qemu/1:2.11+dfsg-1ubuntu7.30)
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (89 preceding siblings ...)
  2020-08-07  9:53 ` Timo Aaltonen
@ 2020-08-07 14:41 ` Ubuntu SRU Bot
  2020-08-07 20:13 ` [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images dann frazier
                   ` (3 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: Ubuntu SRU Bot @ 2020-08-07 14:41 UTC (permalink / raw)
  To: qemu-devel

All autopkgtests for the newly accepted qemu (1:2.11+dfsg-1ubuntu7.30) for bionic have finished running.
The following regressions have been reported in tests triggered by the package:

systemd/237-3ubuntu10.42 (i386)
ubuntu-image/1.9+18.04ubuntu1 (arm64, armhf, i386, s390x, ppc64el, amd64)
cinder/2:12.0.9-0ubuntu1.2 (s390x)


Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-
migration/bionic/update_excuses.html#qemu

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Fix Released
Status in kunpeng920 ubuntu-20.04 series:
  Fix Released
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  Fix Committed
Status in qemu source package in Eoan:
  Fix Released
Status in qemu source package in Focal:
  Fix Released

Bug description:
  
  SRU TEAM REVIEWER: This has already been SRUed for Focal, Eoan and Bionic. Unfortunately the Bionic SRU did not work and we had to reverse the change. Since then we had another update and now I'm retrying the SRU.

  After discussing with @paelzer (and @dannf as a reviewer) extensively,
  Christian and I agreed that we should scope this SRU as Aarch64 only
  AND I was much, much more conservative in question of what is being
  changed in the AIO qemu code.

  New code has been tested against the initial Test Case and the new
  one, regressed for Bionic. More information (about tests and
  discussion) can be found in the MR at
  ~rafaeldtinoco/ubuntu/+source/qemu:lp1805256-bionic-refix

  BIONIC REGRESSION BUG:

  https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1885419

  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  INITIAL

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  BIONIC REGRESSED ISSUE

  https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1885419

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (90 preceding siblings ...)
  2020-08-07 14:41 ` [Bug 1805256] Autopkgtest regression report (qemu/1:2.11+dfsg-1ubuntu7.30) Ubuntu SRU Bot
@ 2020-08-07 20:13 ` dann frazier
  2020-08-14 19:49 ` dann frazier
                   ` (2 subsequent siblings)
  94 siblings, 0 replies; 141+ messages in thread
From: dann frazier @ 2020-08-07 20:13 UTC (permalink / raw)
  To: qemu-devel

Verified w/ over 500 successful iterations on a m6g.metal instance, and
over 300 in an armhf chroot on the same.

** Tags removed: verification-needed verification-needed-bionic
** Tags added: verification-done verification-done-bionic

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Triaged
Status in kunpeng920 ubuntu-18.04 series:
  Triaged
Status in kunpeng920 ubuntu-18.04-hwe series:
  Triaged
Status in kunpeng920 ubuntu-19.10 series:
  Fix Released
Status in kunpeng920 ubuntu-20.04 series:
  Fix Released
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  Fix Committed
Status in qemu source package in Eoan:
  Fix Released
Status in qemu source package in Focal:
  Fix Released

Bug description:
  
  SRU TEAM REVIEWER: This has already been SRUed for Focal, Eoan and Bionic. Unfortunately the Bionic SRU did not work and we had to reverse the change. Since then we had another update and now I'm retrying the SRU.

  After discussing with @paelzer (and @dannf as a reviewer) extensively,
  Christian and I agreed that we should scope this SRU as Aarch64 only
  AND I was much, much more conservative in question of what is being
  changed in the AIO qemu code.

  New code has been tested against the initial Test Case and the new
  one, regressed for Bionic. More information (about tests and
  discussion) can be found in the MR at
  ~rafaeldtinoco/ubuntu/+source/qemu:lp1805256-bionic-refix

  BIONIC REGRESSION BUG:

  https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1885419

  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  INITIAL

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  BIONIC REGRESSED ISSUE

  https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1885419

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (91 preceding siblings ...)
  2020-08-07 20:13 ` [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images dann frazier
@ 2020-08-14 19:49 ` dann frazier
  2020-08-19 16:36 ` Launchpad Bug Tracker
  2020-08-19 17:16 ` Andrew Cloke
  94 siblings, 0 replies; 141+ messages in thread
From: dann frazier @ 2020-08-14 19:49 UTC (permalink / raw)
  To: qemu-devel

** Changed in: kunpeng920/ubuntu-18.04-hwe
       Status: Triaged => Fix Committed

** Changed in: kunpeng920/ubuntu-18.04
       Status: Triaged => Fix Committed

** Changed in: kunpeng920
       Status: Triaged => Fix Committed

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Fix Committed
Status in kunpeng920 ubuntu-18.04 series:
  Fix Committed
Status in kunpeng920 ubuntu-18.04-hwe series:
  Fix Committed
Status in kunpeng920 ubuntu-19.10 series:
  Fix Released
Status in kunpeng920 ubuntu-20.04 series:
  Fix Released
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  Fix Committed
Status in qemu source package in Eoan:
  Fix Released
Status in qemu source package in Focal:
  Fix Released

Bug description:
  
  SRU TEAM REVIEWER: This has already been SRUed for Focal, Eoan and Bionic. Unfortunately the Bionic SRU did not work and we had to reverse the change. Since then we had another update and now I'm retrying the SRU.

  After discussing with @paelzer (and @dannf as a reviewer) extensively,
  Christian and I agreed that we should scope this SRU as Aarch64 only
  AND I was much, much more conservative in question of what is being
  changed in the AIO qemu code.

  New code has been tested against the initial Test Case and the new
  one, regressed for Bionic. More information (about tests and
  discussion) can be found in the MR at
  ~rafaeldtinoco/ubuntu/+source/qemu:lp1805256-bionic-refix

  BIONIC REGRESSION BUG:

  https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1885419

  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  INITIAL

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  BIONIC REGRESSED ISSUE

  https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1885419

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (92 preceding siblings ...)
  2020-08-14 19:49 ` dann frazier
@ 2020-08-19 16:36 ` Launchpad Bug Tracker
  2020-08-19 17:16 ` Andrew Cloke
  94 siblings, 0 replies; 141+ messages in thread
From: Launchpad Bug Tracker @ 2020-08-19 16:36 UTC (permalink / raw)
  To: qemu-devel

This bug was fixed in the package qemu - 1:2.11+dfsg-1ubuntu7.31

---------------
qemu (1:2.11+dfsg-1ubuntu7.31) bionic-security; urgency=medium

  * SECURITY UPDATE: out-of-bounds read in slirp networking
    - debian/patches/CVE-2020-10756.patch: drop bogus IPv6 messages in
      slirp/src/ip6_input.c.
    - CVE-2020-10756
  * SECURITY UPDATE: out-of-bounds read and write in sm501
    - debian/patches/CVE-2020-12829-pre1.patch: use values from the pitch
      register for 2D operations.
    - debian/patches/CVE-2020-12829-pre2.patch: implement negated
      destination raster operation mode.
    - debian/patches/CVE-2020-12829-pre3.patch: log unimplemented raster
      operation modes.
    - debian/patches/CVE-2020-12829-pre4.patch: fix support for non-zero
      frame buffer start address.
    - debian/patches/CVE-2020-12829-pre5.patch: set updated region dirty
      after 2D operation.
    - debian/patches/CVE-2020-12829-pre6.patch: adjust endianness of pixel
      value in rectangle fill.
    - debian/patches/CVE-2020-12829-pre7.patch: convert printf +
      abort to qemu_log_mask.
    - debian/patches/CVE-2020-12829-pre8.patch: shorten long
      variable names in sm501_2d_operation.
    - debian/patches/CVE-2020-12829-pre9.patch: use BIT(x) macro to
      shorten constant.
    - debian/patches/CVE-2020-12829-pre10.patch: clean up local
      variables in sm501_2d_operation.
    - debian/patches/CVE-2020-12829.patch: replace hand written
      implementation with pixman where possible.
    - debian/patches/CVE-2020-12829-2.patch: optimize small overlapping
      blits.
    - debian/patches/CVE-2020-12829-3.patch: fix bounds checks.
    - debian/patches/CVE-2020-12829-4.patch: drop unneded variable.
    - debian/patches/CVE-2020-12829-5.patch: do not allow guest to set
      invalid format.
    - debian/patches/CVE-2020-12829-6.patch: introduce variable for
      commonly used value for better readability.
    - debian/patches/CVE-2020-12829-7.patch: fix and optimize overlap
      check.
    - CVE-2020-12829
  * SECURITY UPDATE: out-of-bounds read during sdhci_write() operations
    - debian/patches/CVE-2020-13253.patch: do not switch to ReceivingData
      if address is invalid in hw/sd/sd.c.
    - CVE-2020-13253
  * SECURITY UPDATE: out-of-bounds access during es1370_write() operation
    - debian/patches/CVE-2020-13361.patch: check total frame count against
      current frame in hw/audio/es1370.c.
    - CVE-2020-13361
  * SECURITY UPDATE: out-of-bounds read via crafted reply_queue_head
    - debian/patches/CVE-2020-13362-1.patch: use unsigned type for
      reply_queue_head and check index in hw/scsi/megasas.c.
    - debian/patches/CVE-2020-13362-2.patch: avoid NULL pointer dereference
      in hw/scsi/megasas.c.
    - debian/patches/CVE-2020-13362-3.patch: use unsigned type for positive
      numeric fields in hw/scsi/megasas.c.
    - CVE-2020-13362
  * SECURITY UPDATE: NULL pointer dereference related to BounceBuffer
    - debian/patches/CVE-2020-13659.patch: set map length to zero when
      returning NULL in exec.c, include/exec/memory.h.
    - CVE-2020-13659
  * SECURITY UPDATE: out-of-bounds access via msi-x mmio operation
    - debian/patches/CVE-2020-13754-1.patch: revert accepting mismatching
      sizes in memory_region_access_valid in memory.c.
    - debian/patches/CVE-2020-13754-2.patch: accept byte and word access to
      core ACPI registers in hw/acpi/core.c.
    - CVE-2020-13754
  * SECURITY UPDATE: invalid memory copy operation via rom_copy
    - debian/patches/CVE-2020-13765.patch: add extra check to
      hw/core/loader.c.
    - CVE-2020-13765
  * SECURITY UPDATE: buffer overflow in XGMAC Ethernet controller
    - debian/patches/CVE-2020-15863.patch: check bounds in hw/net/xgmac.c.
    - CVE-2020-15863
  * SECURITY UPDATE: reachable assertion failure
    - debian/patches/CVE-2020-16092.patch: fix assertion failure in
      hw/net/net_tx_pkt.c.
    - CVE-2020-16092

 -- Marc Deslauriers <marc.deslauriers@ubuntu.com>  Tue, 11 Aug 2020
13:19:33 -0400

** Changed in: qemu (Ubuntu Bionic)
       Status: Fix Committed => Fix Released

** CVE added: https://cve.mitre.org/cgi-bin/cvename.cgi?name=2020-10756

** CVE added: https://cve.mitre.org/cgi-bin/cvename.cgi?name=2020-12829

** CVE added: https://cve.mitre.org/cgi-bin/cvename.cgi?name=2020-13253

** CVE added: https://cve.mitre.org/cgi-bin/cvename.cgi?name=2020-13361

** CVE added: https://cve.mitre.org/cgi-bin/cvename.cgi?name=2020-13362

** CVE added: https://cve.mitre.org/cgi-bin/cvename.cgi?name=2020-13659

** CVE added: https://cve.mitre.org/cgi-bin/cvename.cgi?name=2020-13754

** CVE added: https://cve.mitre.org/cgi-bin/cvename.cgi?name=2020-13765

** CVE added: https://cve.mitre.org/cgi-bin/cvename.cgi?name=2020-15863

** CVE added: https://cve.mitre.org/cgi-bin/cvename.cgi?name=2020-16092

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Fix Committed
Status in kunpeng920 ubuntu-18.04 series:
  Fix Committed
Status in kunpeng920 ubuntu-18.04-hwe series:
  Fix Committed
Status in kunpeng920 ubuntu-19.10 series:
  Fix Released
Status in kunpeng920 ubuntu-20.04 series:
  Fix Released
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Eoan:
  Fix Released
Status in qemu source package in Focal:
  Fix Released

Bug description:
  
  SRU TEAM REVIEWER: This has already been SRUed for Focal, Eoan and Bionic. Unfortunately the Bionic SRU did not work and we had to reverse the change. Since then we had another update and now I'm retrying the SRU.

  After discussing with @paelzer (and @dannf as a reviewer) extensively,
  Christian and I agreed that we should scope this SRU as Aarch64 only
  AND I was much, much more conservative in question of what is being
  changed in the AIO qemu code.

  New code has been tested against the initial Test Case and the new
  one, regressed for Bionic. More information (about tests and
  discussion) can be found in the MR at
  ~rafaeldtinoco/ubuntu/+source/qemu:lp1805256-bionic-refix

  BIONIC REGRESSION BUG:

  https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1885419

  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  INITIAL

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  BIONIC REGRESSED ISSUE

  https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1885419

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images
  2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
                   ` (93 preceding siblings ...)
  2020-08-19 16:36 ` Launchpad Bug Tracker
@ 2020-08-19 17:16 ` Andrew Cloke
  94 siblings, 0 replies; 141+ messages in thread
From: Andrew Cloke @ 2020-08-19 17:16 UTC (permalink / raw)
  To: qemu-devel

** Changed in: kunpeng920/ubuntu-18.04
       Status: Fix Committed => Fix Released

** No longer affects: kunpeng920/ubuntu-18.04-hwe

** Changed in: kunpeng920
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in kunpeng920:
  Fix Released
Status in kunpeng920 ubuntu-18.04 series:
  Fix Released
Status in kunpeng920 ubuntu-19.10 series:
  Fix Released
Status in kunpeng920 ubuntu-20.04 series:
  Fix Released
Status in kunpeng920 upstream-kernel series:
  Invalid
Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  Fix Released
Status in qemu source package in Eoan:
  Fix Released
Status in qemu source package in Focal:
  Fix Released

Bug description:
  
  SRU TEAM REVIEWER: This has already been SRUed for Focal, Eoan and Bionic. Unfortunately the Bionic SRU did not work and we had to reverse the change. Since then we had another update and now I'm retrying the SRU.

  After discussing with @paelzer (and @dannf as a reviewer) extensively,
  Christian and I agreed that we should scope this SRU as Aarch64 only
  AND I was much, much more conservative in question of what is being
  changed in the AIO qemu code.

  New code has been tested against the initial Test Case and the new
  one, regressed for Bionic. More information (about tests and
  discussion) can be found in the MR at
  ~rafaeldtinoco/ubuntu/+source/qemu:lp1805256-bionic-refix

  BIONIC REGRESSION BUG:

  https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1885419

  [Impact]

  * QEMU locking primitives might face a race condition in QEMU Async
  I/O bottom halves scheduling. This leads to a dead lock making either
  QEMU or one of its tools to hang indefinitely.

  [Test Case]

  INITIAL

  * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs in Aarch64.

  [Regression Potential]

  * This is a change to a core part of QEMU: The AIO scheduling. It
  works like a "kernel" scheduler, whereas kernel schedules OS tasks,
  the QEMU AIO code is responsible to schedule QEMU coroutines or event
  listeners callbacks.

  * There was a long discussion upstream about primitives and Aarch64.
  After quite sometime Paolo released this patch and it solves the
  issue. Tested platforms were: amd64 and aarch64 based on his commit
  log.

  * Christian suggests that this fix stay little longer in -proposed to
  make sure it won't cause any regressions.

  * dannf suggests we also check for performance regressions; e.g. how
  long it takes to convert a cloud image on high-core systems.

  BIONIC REGRESSED ISSUE

  https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1885419

  [Other Info]

   * Original Description bellow:

  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xff