qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Rafael David Tinoco <rafaeldtinoco@kernelpath.com>
To: qemu-devel@nongnu.org
Subject: [Qemu-devel] [Bug 1805256] Re: qemu_futex_wait() lockups in ARM64: 2 possible issues
Date: Wed, 11 Sep 2019 19:09:25 -0000	[thread overview]
Message-ID: <1864070a-2f84-1d98-341e-f01ddf74ec4b@ubuntu.com> (raw)
Message-ID: <20190911190925.0vp6QDpasMkAzWDbTLeEX3JzBiThyvtc21CnE4sHnh8@z> (raw)
In-Reply-To: 154327283728.15443.11625169757714443608.malonedeb@soybean.canonical.com

> Zhengui's theory that notify_me doesn't work properly on ARM is more
> promising, but he couldn't provide a clear explanation of why he thought
> notify_me is involved.  In particular, I would have expected notify_me to
> be wrong if the qemu_poll_ns call came from aio_ctx_dispatch, for example:
> 
> 
>     glib_pollfds_fill
>       g_main_context_prepare
>         aio_ctx_prepare
>           atomic_or(&ctx->notify_me, 1)
>     qemu_poll_ns
>     glib_pollfds_poll
>       g_main_context_check
>         aio_ctx_check
>           atomic_and(&ctx->notify_me, ~1)
>       g_main_context_dispatch
>         aio_ctx_dispatch
>           /* do something for event */
>             qemu_poll_ns 
> 

Paolo,

I tried confining execution in a single NUMA domain (cpu & mem) and
still faced the issue, then, I added a mutex "ctx->notify_me_lcktest"
into context to protect "ctx->notify_me", like showed bellow, and it
seems to have either fixed or mitigated it.

I was able to cause the hung once every 3 or 4 runs. I have already ran
qemu-img convert more than 30 times now and couldn't reproduce it again.

Next step is to play with the barriers and check why existing ones
aren't enough for ordering access to ctx->notify_me ... or should I
try/do something else in your opinion ?

This arch/machine (Huawei D06):

$ lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  1
Core(s) per socket:  48
Socket(s):           2
NUMA node(s):        4
Vendor ID:           0x48
Model:               0
Stepping:            0x0
CPU max MHz:         2000.0000
CPU min MHz:         200.0000
BogoMIPS:            200.00
L1d cache:           64K
L1i cache:           64K
L2 cache:            512K
L3 cache:            32768K
NUMA node0 CPU(s):   0-23
NUMA node1 CPU(s):   24-47
NUMA node2 CPU(s):   48-71
NUMA node3 CPU(s):   72-95
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
cpuid asimdrdm dcpop

----

diff --git a/include/block/aio.h b/include/block/aio.h
index 0ca25dfec6..0724086d91 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -84,6 +84,7 @@ struct AioContext {
      * dispatch phase, hence a simple counter is enough for them.
      */
     uint32_t notify_me;
+    QemuMutex notify_me_lcktest;

     /* A lock to protect between QEMUBH and AioHandler adders and deleter,
      * and to ensure that no callbacks are removed while we're walking and
diff --git a/util/aio-posix.c b/util/aio-posix.c
index 51c41ed3c9..031d6e2997 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -529,7 +529,9 @@ static bool run_poll_handlers(AioContext *ctx,
int64_t max_ns, int64_t *timeout)
     bool progress;
     int64_t start_time, elapsed_time;

+    qemu_mutex_lock(&ctx->notify_me_lcktest);
     assert(ctx->notify_me);
+    qemu_mutex_unlock(&ctx->notify_me_lcktest);
     assert(qemu_lockcnt_count(&ctx->list_lock) > 0);

     trace_run_poll_handlers_begin(ctx, max_ns, *timeout);
@@ -601,8 +603,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
      * so disable the optimization now.
      */
     if (blocking) {
+        qemu_mutex_lock(&ctx->notify_me_lcktest);
         assert(in_aio_context_home_thread(ctx));
         atomic_add(&ctx->notify_me, 2);
+        qemu_mutex_unlock(&ctx->notify_me_lcktest);
     }

     qemu_lockcnt_inc(&ctx->list_lock);
@@ -647,8 +651,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
     }

     if (blocking) {
+        qemu_mutex_lock(&ctx->notify_me_lcktest);
         atomic_sub(&ctx->notify_me, 2);
         aio_notify_accept(ctx);
+        qemu_mutex_unlock(&ctx->notify_me_lcktest);
     }

     /* Adjust polling time */
diff --git a/util/async.c b/util/async.c
index c10642a385..140e1e86f5 100644
--- a/util/async.c
+++ b/util/async.c
@@ -221,7 +221,9 @@ aio_ctx_prepare(GSource *source, gint    *timeout)
 {
     AioContext *ctx = (AioContext *) source;

+    qemu_mutex_lock(&ctx->notify_me_lcktest);
     atomic_or(&ctx->notify_me, 1);
+    qemu_mutex_unlock(&ctx->notify_me_lcktest);

     /* We assume there is no timeout already supplied */
     *timeout = qemu_timeout_ns_to_ms(aio_compute_timeout(ctx));
@@ -239,8 +241,10 @@ aio_ctx_check(GSource *source)
     AioContext *ctx = (AioContext *) source;
     QEMUBH *bh;

+    qemu_mutex_lock(&ctx->notify_me_lcktest);
     atomic_and(&ctx->notify_me, ~1);
     aio_notify_accept(ctx);
+    qemu_mutex_unlock(&ctx->notify_me_lcktest);

     for (bh = ctx->first_bh; bh; bh = bh->next) {
         if (bh->scheduled) {
@@ -346,11 +350,13 @@ void aio_notify(AioContext *ctx)
     /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
      * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
      */
-    smp_mb();
+    //smp_mb();
+    qemu_mutex_lock(&ctx->notify_me_lcktest);
     if (ctx->notify_me) {
         event_notifier_set(&ctx->notifier);
         atomic_mb_set(&ctx->notified, true);
     }
+    qemu_mutex_unlock(&ctx->notify_me_lcktest);
 }

 void aio_notify_accept(AioContext *ctx)
@@ -424,6 +430,8 @@ AioContext *aio_context_new(Error **errp)
     ctx->co_schedule_bh = aio_bh_new(ctx, co_schedule_bh_cb, ctx);
     QSLIST_INIT(&ctx->scheduled_coroutines);

+    qemu_rec_mutex_init(&ctx->notify_me_lcktest);
+
     aio_set_event_notifier(ctx, &ctx->notifier,
                            false,
                            (EventNotifierHandler *)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256

Title:
  qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
  converting images

Status in QEMU:
  In Progress
Status in qemu package in Ubuntu:
  In Progress

Bug description:
  Command:

  qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Hangs indefinitely approximately 30% of the runs.

  ----

  Workaround:

  qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2

  Run "qemu-img convert" with "a single coroutine" to avoid this issue.

  ----

  (gdb) thread 1
  ...
  (gdb) bt
  #0 0x0000ffffbf1ad81c in __GI_ppoll
  #1 0x0000aaaaaabcf73c in ppoll
  #2 qemu_poll_ns
  #3 0x0000aaaaaabd0764 in os_host_main_loop_wait
  #4 main_loop_wait
  ...

  (gdb) thread 2
  ...
  (gdb) bt
  #0 syscall ()
  #1 0x0000aaaaaabd41cc in qemu_futex_wait
  #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
  #3 0x0000aaaaaabed05c in call_rcu_thread
  #4 0x0000aaaaaabd34c8 in qemu_thread_start
  #5 0x0000ffffbf25c880 in start_thread
  #6 0x0000ffffbf1b6b9c in thread_start ()

  (gdb) thread 3
  ...
  (gdb) bt
  #0 0x0000ffffbf11aa20 in __GI___sigtimedwait
  #1 0x0000ffffbf2671b4 in __sigwait
  #2 0x0000aaaaaabd1ddc in sigwait_compat
  #3 0x0000aaaaaabd34c8 in qemu_thread_start
  #4 0x0000ffffbf25c880 in start_thread
  #5 0x0000ffffbf1b6b9c in thread_start

  ----

  (gdb) run
  Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
  ./disk01.ext4.qcow2 ./output.qcow2

  [New Thread 0xffffbec5ad90 (LWP 72839)]
  [New Thread 0xffffbe459d90 (LWP 72840)]
  [New Thread 0xffffbdb57d90 (LWP 72841)]
  [New Thread 0xffffacac9d90 (LWP 72859)]
  [New Thread 0xffffa7ffed90 (LWP 72860)]
  [New Thread 0xffffa77fdd90 (LWP 72861)]
  [New Thread 0xffffa6ffcd90 (LWP 72862)]
  [New Thread 0xffffa67fbd90 (LWP 72863)]
  [New Thread 0xffffa5ffad90 (LWP 72864)]

  [Thread 0xffffa5ffad90 (LWP 72864) exited]
  [Thread 0xffffa6ffcd90 (LWP 72862) exited]
  [Thread 0xffffa77fdd90 (LWP 72861) exited]
  [Thread 0xffffbdb57d90 (LWP 72841) exited]
  [Thread 0xffffa67fbd90 (LWP 72863) exited]
  [Thread 0xffffacac9d90 (LWP 72859) exited]
  [Thread 0xffffa7ffed90 (LWP 72860) exited]

  <HUNG w/ 3 threads in the stack trace showed before>
  """

  All the tasks left are blocked in a system call, so no task left to call
  qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
  thread #1 (doing poll() in a pipe with thread #2).

  Those 7 threads exit before disk conversion is complete (sometimes in
  the beginning, sometimes at the end).

  ----

  [ Original Description ]

  On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
  frequently hangs (~50% of the time) with this command:

  qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2

  Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
  qcow2->qcow2 conversion happens to be something uvtool does every time
  it fetches images.

  Once hung, attaching gdb gives the following backtrace:

  (gdb) bt
  #0  0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
      timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
      at ../sysdeps/unix/sysv/linux/ppoll.c:39
  #1  0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
      __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
  #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
      timeout=timeout@entry=-1) at util/qemu-timer.c:322
  #3  0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
      at util/main-loop.c:233
  #4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
  #5  0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
  #6  img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
  #7  0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975

  Reproduced w/ latest QEMU git (@ 53744e0a182)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions


  parent reply	other threads:[~2019-09-11 19:21 UTC|newest]

Thread overview: 141+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
2018-11-26 23:26 ` [Qemu-devel] [Bug 1805256] " John Snow
2018-11-26 23:54 ` dann frazier
2018-12-05 11:20 ` Alex Bennée
2019-04-15 12:59 ` 贞贵李
2019-04-15 12:59   ` 贞贵李
2019-04-15 14:37 ` 贞贵李
2019-04-15 14:37   ` 贞贵李
2019-04-15 22:25 ` dann frazier
2019-04-15 22:25   ` dann frazier
2019-04-15 23:37 ` dann frazier
2019-04-15 23:37   ` dann frazier
2019-04-16  8:16 ` 贞贵李
2019-04-16  8:16   ` 贞贵李
2019-04-16 13:32 ` 贞贵李
2019-04-16 13:32   ` 贞贵李
2019-04-23  1:29 ` 贞贵李
2019-04-23  1:29   ` 贞贵李
2019-06-05 16:16 ` dann frazier
2019-09-05 15:03 ` Rafael David Tinoco
2019-09-06 15:12 ` Rafael David Tinoco
2019-09-06 15:16 ` Rafael David Tinoco
2019-09-06 21:22 ` Rafael David Tinoco
2019-09-09 16:47 ` Rafael David Tinoco
2019-09-10  2:04 ` Rafael David Tinoco
2019-09-10 14:16 ` Rafael David Tinoco
2019-09-10 18:15 ` [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Rafael David Tinoco
2019-09-10 22:56 ` Rafael David Tinoco
2019-09-11  2:15 ` [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues Rafael David Tinoco
2019-09-11  2:15   ` [Qemu-devel] [Bug 1805256] " Rafael David Tinoco
2019-09-11  7:05   ` [Qemu-devel] " Rafael David Tinoco
2019-09-11  7:05     ` [Qemu-devel] [Bug 1805256] " Rafael David Tinoco
2019-09-11 13:17   ` [Qemu-devel] " Paolo Bonzini
2019-09-11 14:48     ` Rafael David Tinoco
2019-09-11 19:09     ` Rafael David Tinoco [this message]
2019-09-11 19:09       ` [Qemu-devel] [Bug 1805256] " Rafael David Tinoco
2019-09-24 20:25       ` [Qemu-devel] " dann frazier
2019-09-24 20:25         ` [Bug 1805256] " dann frazier
2019-10-02  9:23         ` Jan Glauber
2019-10-02  9:23           ` Jan Glauber
2019-10-02  9:45           ` Paolo Bonzini
2019-10-02 11:05             ` Jan Glauber
2019-10-02 11:05               ` [Bug 1805256] " Jan Glauber
2019-10-02 13:20               ` memory barriers and ATOMIC_SEQ_CST on aarch64 (was Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues) Paolo Bonzini
2019-10-02 14:58                 ` Torvald Riegel
2019-10-02 16:30                   ` Paolo Bonzini
2019-10-07 11:06           ` [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues Paolo Bonzini
2019-10-07 14:36             ` Jan Glauber
2019-10-07 14:36               ` [Bug 1805256] " Jan Glauber
2019-10-07 14:44             ` dann frazier
2019-10-07 14:44               ` [Bug 1805256] " dann frazier
2019-10-07 14:58               ` Paolo Bonzini
2019-10-09  8:02                 ` Jan Glauber
2019-10-09  8:02                   ` [Bug 1805256] " Jan Glauber
2019-10-09  9:15                   ` Paolo Bonzini
2019-10-11  6:05                     ` Jan Glauber
2019-10-11  6:05                       ` [Bug 1805256] " Jan Glauber
2019-10-11  8:18                       ` Paolo Bonzini
2019-10-11  8:30                         ` Jan Glauber
2019-10-11  8:30                           ` [Bug 1805256] " Jan Glauber
2019-10-11 17:55                           ` dann frazier
2019-10-11 17:55                             ` dann frazier
2019-10-12  0:24                             ` [Bug 1805256] " no-reply
2019-10-12  0:49                             ` no-reply
2019-10-11 17:50                       ` dann frazier
2019-10-11 17:50                         ` [Bug 1805256] " dann frazier
2019-09-11  2:17 ` [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Rafael David Tinoco
2019-09-11 11:19 ` Rafael David Tinoco
2019-09-11 19:23 ` Rafael David Tinoco
2019-10-02 11:02 ` Jan Glauber
2019-10-03 12:28 ` Rafael David Tinoco
2019-10-03 12:29 ` Rafael David Tinoco
2019-10-03 12:29 ` Rafael David Tinoco
2019-10-03 21:35 ` dann frazier
2019-12-13 14:24 ` dann frazier
2019-12-17  1:34 ` Fred Kimmy
2019-12-17 19:17 ` dann frazier
2019-12-18  2:40 ` Rafael David Tinoco
2019-12-18  9:52 ` iveskim
2019-12-18 14:52 ` dann frazier
2019-12-18 16:21 ` Ubuntu Foundations Team Bug Bot
2020-02-13  8:41 ` Ike Panhc
2020-02-13  8:42 ` Andrew Cloke
2020-02-13  9:20 ` Fred Kimmy
2020-04-15  2:47 ` Rafael David Tinoco
2020-05-04  7:24 ` Ike Panhc
2020-05-05  0:54 ` Ike Panhc
2020-05-05  1:22 ` Ying Fang
2020-05-05  6:15 ` Ike Panhc
2020-05-05 15:01 ` Ike Panhc
2020-05-05 18:48 ` Rafael David Tinoco
2020-05-05 23:55 ` dann frazier
2020-05-06 13:08 ` Rafael David Tinoco
2020-05-06 13:23 ` Rafael David Tinoco
2020-05-06 15:45 ` Ike Panhc
2020-05-06 16:42 ` dann frazier
2020-05-06 19:04 ` Launchpad Bug Tracker
2020-05-06 19:09 ` Philippe Mathieu-Daudé
2020-05-06 19:57   ` dann frazier
2020-05-06 20:11 ` Rafael David Tinoco
2020-05-06 21:10 ` Launchpad Bug Tracker
2020-05-06 21:44 ` Launchpad Bug Tracker
2020-05-07  3:37 ` Launchpad Bug Tracker
2020-05-07  7:00 ` Ike Panhc
2020-05-07 22:27 ` dann frazier
2020-05-14  8:05 ` Andrew Cloke
2020-05-27  4:55 ` Christian Ehrhardt 
2020-05-28 14:58 ` Christian Ehrhardt 
2020-05-29  7:55 ` Launchpad Bug Tracker
2020-05-29  8:01 ` Christian Ehrhardt 
2020-06-02 22:45 ` Brian Murray
2020-06-02 22:49 ` [Bug 1805256] Please test proposed package Brian Murray
2020-06-02 22:54 ` Brian Murray
2020-06-03  4:09 ` [Bug 1805256] Autopkgtest regression report (qemu/1:4.0+dfsg-0ubuntu9.7) Ubuntu SRU Bot
2020-06-03  6:35 ` [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Ike Panhc
2020-06-03  8:40 ` [Bug 1805256] Autopkgtest regression report (qemu/1:4.2-3ubuntu6.2) Ubuntu SRU Bot
2020-06-05  3:51 ` [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Christian Ehrhardt 
2020-06-11  8:04 ` Andrew Cloke
2020-06-17  5:16 ` Christian Ehrhardt 
2020-06-18  9:23 ` Launchpad Bug Tracker
2020-06-18  9:23 ` [Bug 1805256] Update Released Łukasz Zemczak
2020-06-18  9:38 ` [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Launchpad Bug Tracker
2020-06-18  9:39 ` Launchpad Bug Tracker
2020-06-18 10:27 ` Andrew Cloke
2020-06-30  6:54 ` Christian Ehrhardt 
2020-07-01  7:01 ` Ike Panhc
2020-07-12 13:16 ` Rafael David Tinoco
2020-07-13  3:59 ` Launchpad Bug Tracker
2020-07-13  4:12 ` Rafael David Tinoco
2020-07-15 15:31 ` dann frazier
2020-07-20 12:22 ` Rafael David Tinoco
2020-07-21 20:02 ` Rafael David Tinoco
2020-07-21 20:03 ` Rafael David Tinoco
2020-07-31 18:51 ` Rafael David Tinoco
2020-07-31 21:42 ` Rafael David Tinoco
2020-08-07  9:53 ` Timo Aaltonen
2020-08-07 14:41 ` [Bug 1805256] Autopkgtest regression report (qemu/1:2.11+dfsg-1ubuntu7.30) Ubuntu SRU Bot
2020-08-07 20:13 ` [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images dann frazier
2020-08-14 19:49 ` dann frazier
2020-08-19 16:36 ` Launchpad Bug Tracker
2020-08-19 17:16 ` Andrew Cloke

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1864070a-2f84-1d98-341e-f01ddf74ec4b@ubuntu.com \
    --to=rafaeldtinoco@kernelpath.com \
    --cc=1805256@bugs.launchpad.net \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).