From: dann frazier <dann.frazier@canonical.com>
To: qemu-devel@nongnu.org
Subject: [Bug 1805256] Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues
Date: Tue, 24 Sep 2019 20:25:17 -0000 [thread overview]
Message-ID: <20190924202517.GA21422@xps13.dannf> (raw)
Message-ID: <20190924202517.3dgHfsYhXVxQULX5KvVYf4j5XvBtMA8ovb_8xa8XGBI@z> (raw)
In-Reply-To: 1864070a-2f84-1d98-341e-f01ddf74ec4b@ubuntu.com
On Wed, Sep 11, 2019 at 04:09:25PM -0300, Rafael David Tinoco wrote:
> > Zhengui's theory that notify_me doesn't work properly on ARM is more
> > promising, but he couldn't provide a clear explanation of why he thought
> > notify_me is involved. In particular, I would have expected notify_me to
> > be wrong if the qemu_poll_ns call came from aio_ctx_dispatch, for example:
> >
> >
> > glib_pollfds_fill
> > g_main_context_prepare
> > aio_ctx_prepare
> > atomic_or(&ctx->notify_me, 1)
> > qemu_poll_ns
> > glib_pollfds_poll
> > g_main_context_check
> > aio_ctx_check
> > atomic_and(&ctx->notify_me, ~1)
> > g_main_context_dispatch
> > aio_ctx_dispatch
> > /* do something for event */
> > qemu_poll_ns
> >
>
> Paolo,
>
> I tried confining execution in a single NUMA domain (cpu & mem) and
> still faced the issue, then, I added a mutex "ctx->notify_me_lcktest"
> into context to protect "ctx->notify_me", like showed bellow, and it
> seems to have either fixed or mitigated it.
>
> I was able to cause the hung once every 3 or 4 runs. I have already ran
> qemu-img convert more than 30 times now and couldn't reproduce it again.
>
> Next step is to play with the barriers and check why existing ones
> aren't enough for ordering access to ctx->notify_me ... or should I
> try/do something else in your opinion ?
>
> This arch/machine (Huawei D06):
>
> $ lscpu
> Architecture: aarch64
> Byte Order: Little Endian
> CPU(s): 96
> On-line CPU(s) list: 0-95
> Thread(s) per core: 1
> Core(s) per socket: 48
> Socket(s): 2
> NUMA node(s): 4
> Vendor ID: 0x48
> Model: 0
> Stepping: 0x0
> CPU max MHz: 2000.0000
> CPU min MHz: 200.0000
> BogoMIPS: 200.00
> L1d cache: 64K
> L1i cache: 64K
> L2 cache: 512K
> L3 cache: 32768K
> NUMA node0 CPU(s): 0-23
> NUMA node1 CPU(s): 24-47
> NUMA node2 CPU(s): 48-71
> NUMA node3 CPU(s): 72-95
> Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
> cpuid asimdrdm dcpop
Note that I'm also seeing this on a ThunderX2 (same calltrace):
$ lscpu
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 224
On-line CPU(s) list: 0-223
Thread(s) per core: 4
Core(s) per socket: 28
Socket(s): 2
NUMA node(s): 2
Vendor ID: Cavium
Model: 1
Model name: ThunderX2 99xx
Stepping: 0x1
BogoMIPS: 400.00
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 32768K
NUMA node0 CPU(s): 0-111
NUMA node1 CPU(s): 112-223
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm
-dann
> ----
>
> diff --git a/include/block/aio.h b/include/block/aio.h
> index 0ca25dfec6..0724086d91 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -84,6 +84,7 @@ struct AioContext {
> * dispatch phase, hence a simple counter is enough for them.
> */
> uint32_t notify_me;
> + QemuMutex notify_me_lcktest;
>
> /* A lock to protect between QEMUBH and AioHandler adders and deleter,
> * and to ensure that no callbacks are removed while we're walking and
> diff --git a/util/aio-posix.c b/util/aio-posix.c
> index 51c41ed3c9..031d6e2997 100644
> --- a/util/aio-posix.c
> +++ b/util/aio-posix.c
> @@ -529,7 +529,9 @@ static bool run_poll_handlers(AioContext *ctx,
> int64_t max_ns, int64_t *timeout)
> bool progress;
> int64_t start_time, elapsed_time;
>
> + qemu_mutex_lock(&ctx->notify_me_lcktest);
> assert(ctx->notify_me);
> + qemu_mutex_unlock(&ctx->notify_me_lcktest);
> assert(qemu_lockcnt_count(&ctx->list_lock) > 0);
>
> trace_run_poll_handlers_begin(ctx, max_ns, *timeout);
> @@ -601,8 +603,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
> * so disable the optimization now.
> */
> if (blocking) {
> + qemu_mutex_lock(&ctx->notify_me_lcktest);
> assert(in_aio_context_home_thread(ctx));
> atomic_add(&ctx->notify_me, 2);
> + qemu_mutex_unlock(&ctx->notify_me_lcktest);
> }
>
> qemu_lockcnt_inc(&ctx->list_lock);
> @@ -647,8 +651,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
> }
>
> if (blocking) {
> + qemu_mutex_lock(&ctx->notify_me_lcktest);
> atomic_sub(&ctx->notify_me, 2);
> aio_notify_accept(ctx);
> + qemu_mutex_unlock(&ctx->notify_me_lcktest);
> }
>
> /* Adjust polling time */
> diff --git a/util/async.c b/util/async.c
> index c10642a385..140e1e86f5 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -221,7 +221,9 @@ aio_ctx_prepare(GSource *source, gint *timeout)
> {
> AioContext *ctx = (AioContext *) source;
>
> + qemu_mutex_lock(&ctx->notify_me_lcktest);
> atomic_or(&ctx->notify_me, 1);
> + qemu_mutex_unlock(&ctx->notify_me_lcktest);
>
> /* We assume there is no timeout already supplied */
> *timeout = qemu_timeout_ns_to_ms(aio_compute_timeout(ctx));
> @@ -239,8 +241,10 @@ aio_ctx_check(GSource *source)
> AioContext *ctx = (AioContext *) source;
> QEMUBH *bh;
>
> + qemu_mutex_lock(&ctx->notify_me_lcktest);
> atomic_and(&ctx->notify_me, ~1);
> aio_notify_accept(ctx);
> + qemu_mutex_unlock(&ctx->notify_me_lcktest);
>
> for (bh = ctx->first_bh; bh; bh = bh->next) {
> if (bh->scheduled) {
> @@ -346,11 +350,13 @@ void aio_notify(AioContext *ctx)
> /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs
> * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> */
> - smp_mb();
> + //smp_mb();
> + qemu_mutex_lock(&ctx->notify_me_lcktest);
> if (ctx->notify_me) {
> event_notifier_set(&ctx->notifier);
> atomic_mb_set(&ctx->notified, true);
> }
> + qemu_mutex_unlock(&ctx->notify_me_lcktest);
> }
>
> void aio_notify_accept(AioContext *ctx)
> @@ -424,6 +430,8 @@ AioContext *aio_context_new(Error **errp)
> ctx->co_schedule_bh = aio_bh_new(ctx, co_schedule_bh_cb, ctx);
> QSLIST_INIT(&ctx->scheduled_coroutines);
>
> + qemu_rec_mutex_init(&ctx->notify_me_lcktest);
> +
> aio_set_event_notifier(ctx, &ctx->notifier,
> false,
> (EventNotifierHandler *)
>
--
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256
Title:
qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
converting images
Status in QEMU:
In Progress
Status in qemu package in Ubuntu:
In Progress
Status in qemu source package in Bionic:
New
Status in qemu source package in Disco:
New
Status in qemu source package in Eoan:
In Progress
Status in qemu source package in FF-Series:
New
Bug description:
Command:
qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2
Hangs indefinitely approximately 30% of the runs.
----
Workaround:
qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2
Run "qemu-img convert" with "a single coroutine" to avoid this issue.
----
(gdb) thread 1
...
(gdb) bt
#0 0x0000ffffbf1ad81c in __GI_ppoll
#1 0x0000aaaaaabcf73c in ppoll
#2 qemu_poll_ns
#3 0x0000aaaaaabd0764 in os_host_main_loop_wait
#4 main_loop_wait
...
(gdb) thread 2
...
(gdb) bt
#0 syscall ()
#1 0x0000aaaaaabd41cc in qemu_futex_wait
#2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
#3 0x0000aaaaaabed05c in call_rcu_thread
#4 0x0000aaaaaabd34c8 in qemu_thread_start
#5 0x0000ffffbf25c880 in start_thread
#6 0x0000ffffbf1b6b9c in thread_start ()
(gdb) thread 3
...
(gdb) bt
#0 0x0000ffffbf11aa20 in __GI___sigtimedwait
#1 0x0000ffffbf2671b4 in __sigwait
#2 0x0000aaaaaabd1ddc in sigwait_compat
#3 0x0000aaaaaabd34c8 in qemu_thread_start
#4 0x0000ffffbf25c880 in start_thread
#5 0x0000ffffbf1b6b9c in thread_start
----
(gdb) run
Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
./disk01.ext4.qcow2 ./output.qcow2
[New Thread 0xffffbec5ad90 (LWP 72839)]
[New Thread 0xffffbe459d90 (LWP 72840)]
[New Thread 0xffffbdb57d90 (LWP 72841)]
[New Thread 0xffffacac9d90 (LWP 72859)]
[New Thread 0xffffa7ffed90 (LWP 72860)]
[New Thread 0xffffa77fdd90 (LWP 72861)]
[New Thread 0xffffa6ffcd90 (LWP 72862)]
[New Thread 0xffffa67fbd90 (LWP 72863)]
[New Thread 0xffffa5ffad90 (LWP 72864)]
[Thread 0xffffa5ffad90 (LWP 72864) exited]
[Thread 0xffffa6ffcd90 (LWP 72862) exited]
[Thread 0xffffa77fdd90 (LWP 72861) exited]
[Thread 0xffffbdb57d90 (LWP 72841) exited]
[Thread 0xffffa67fbd90 (LWP 72863) exited]
[Thread 0xffffacac9d90 (LWP 72859) exited]
[Thread 0xffffa7ffed90 (LWP 72860) exited]
<HUNG w/ 3 threads in the stack trace showed before>
"""
All the tasks left are blocked in a system call, so no task left to call
qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
thread #1 (doing poll() in a pipe with thread #2).
Those 7 threads exit before disk conversion is complete (sometimes in
the beginning, sometimes at the end).
----
[ Original Description ]
On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
frequently hangs (~50% of the time) with this command:
qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2
Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
qcow2->qcow2 conversion happens to be something uvtool does every time
it fetches images.
Once hung, attaching gdb gives the following backtrace:
(gdb) bt
#0 0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760,
timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
at ../sysdeps/unix/sysv/linux/ppoll.c:39
#1 0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>,
__fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
#2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
timeout=timeout@entry=-1) at util/qemu-timer.c:322
#3 0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
at util/main-loop.c:233
#4 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
#5 0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980
#6 img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456
#7 0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975
Reproduced w/ latest QEMU git (@ 53744e0a182)
To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions
next prev parent reply other threads:[~2019-09-24 20:38 UTC|newest]
Thread overview: 141+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier
2018-11-26 23:26 ` [Qemu-devel] [Bug 1805256] " John Snow
2018-11-26 23:54 ` dann frazier
2018-12-05 11:20 ` Alex Bennée
2019-04-15 12:59 ` 贞贵李
2019-04-15 12:59 ` 贞贵李
2019-04-15 14:37 ` 贞贵李
2019-04-15 14:37 ` 贞贵李
2019-04-15 22:25 ` dann frazier
2019-04-15 22:25 ` dann frazier
2019-04-15 23:37 ` dann frazier
2019-04-15 23:37 ` dann frazier
2019-04-16 8:16 ` 贞贵李
2019-04-16 8:16 ` 贞贵李
2019-04-16 13:32 ` 贞贵李
2019-04-16 13:32 ` 贞贵李
2019-04-23 1:29 ` 贞贵李
2019-04-23 1:29 ` 贞贵李
2019-06-05 16:16 ` dann frazier
2019-09-05 15:03 ` Rafael David Tinoco
2019-09-06 15:12 ` Rafael David Tinoco
2019-09-06 15:16 ` Rafael David Tinoco
2019-09-06 21:22 ` Rafael David Tinoco
2019-09-09 16:47 ` Rafael David Tinoco
2019-09-10 2:04 ` Rafael David Tinoco
2019-09-10 14:16 ` Rafael David Tinoco
2019-09-10 18:15 ` [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Rafael David Tinoco
2019-09-10 22:56 ` Rafael David Tinoco
2019-09-11 2:15 ` [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues Rafael David Tinoco
2019-09-11 2:15 ` [Qemu-devel] [Bug 1805256] " Rafael David Tinoco
2019-09-11 7:05 ` [Qemu-devel] " Rafael David Tinoco
2019-09-11 7:05 ` [Qemu-devel] [Bug 1805256] " Rafael David Tinoco
2019-09-11 13:17 ` [Qemu-devel] " Paolo Bonzini
2019-09-11 14:48 ` Rafael David Tinoco
2019-09-11 19:09 ` Rafael David Tinoco
2019-09-11 19:09 ` [Qemu-devel] [Bug 1805256] " Rafael David Tinoco
2019-09-24 20:25 ` dann frazier [this message]
2019-09-24 20:25 ` [Bug 1805256] Re: [Qemu-devel] " dann frazier
2019-10-02 9:23 ` Jan Glauber
2019-10-02 9:23 ` Jan Glauber
2019-10-02 9:45 ` Paolo Bonzini
2019-10-02 11:05 ` Jan Glauber
2019-10-02 11:05 ` [Bug 1805256] " Jan Glauber
2019-10-02 13:20 ` memory barriers and ATOMIC_SEQ_CST on aarch64 (was Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues) Paolo Bonzini
2019-10-02 14:58 ` Torvald Riegel
2019-10-02 16:30 ` Paolo Bonzini
2019-10-07 11:06 ` [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues Paolo Bonzini
2019-10-07 14:36 ` Jan Glauber
2019-10-07 14:36 ` [Bug 1805256] " Jan Glauber
2019-10-07 14:44 ` dann frazier
2019-10-07 14:44 ` [Bug 1805256] " dann frazier
2019-10-07 14:58 ` Paolo Bonzini
2019-10-09 8:02 ` Jan Glauber
2019-10-09 8:02 ` [Bug 1805256] " Jan Glauber
2019-10-09 9:15 ` Paolo Bonzini
2019-10-11 6:05 ` Jan Glauber
2019-10-11 6:05 ` [Bug 1805256] " Jan Glauber
2019-10-11 8:18 ` Paolo Bonzini
2019-10-11 8:30 ` Jan Glauber
2019-10-11 8:30 ` [Bug 1805256] " Jan Glauber
2019-10-11 17:55 ` dann frazier
2019-10-11 17:55 ` dann frazier
2019-10-12 0:24 ` [Bug 1805256] " no-reply
2019-10-12 0:49 ` no-reply
2019-10-11 17:50 ` dann frazier
2019-10-11 17:50 ` [Bug 1805256] " dann frazier
2019-09-11 2:17 ` [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Rafael David Tinoco
2019-09-11 11:19 ` Rafael David Tinoco
2019-09-11 19:23 ` Rafael David Tinoco
2019-10-02 11:02 ` Jan Glauber
2019-10-03 12:28 ` Rafael David Tinoco
2019-10-03 12:29 ` Rafael David Tinoco
2019-10-03 12:29 ` Rafael David Tinoco
2019-10-03 21:35 ` dann frazier
2019-12-13 14:24 ` dann frazier
2019-12-17 1:34 ` Fred Kimmy
2019-12-17 19:17 ` dann frazier
2019-12-18 2:40 ` Rafael David Tinoco
2019-12-18 9:52 ` iveskim
2019-12-18 14:52 ` dann frazier
2019-12-18 16:21 ` Ubuntu Foundations Team Bug Bot
2020-02-13 8:41 ` Ike Panhc
2020-02-13 8:42 ` Andrew Cloke
2020-02-13 9:20 ` Fred Kimmy
2020-04-15 2:47 ` Rafael David Tinoco
2020-05-04 7:24 ` Ike Panhc
2020-05-05 0:54 ` Ike Panhc
2020-05-05 1:22 ` Ying Fang
2020-05-05 6:15 ` Ike Panhc
2020-05-05 15:01 ` Ike Panhc
2020-05-05 18:48 ` Rafael David Tinoco
2020-05-05 23:55 ` dann frazier
2020-05-06 13:08 ` Rafael David Tinoco
2020-05-06 13:23 ` Rafael David Tinoco
2020-05-06 15:45 ` Ike Panhc
2020-05-06 16:42 ` dann frazier
2020-05-06 19:04 ` Launchpad Bug Tracker
2020-05-06 19:09 ` Philippe Mathieu-Daudé
2020-05-06 19:57 ` dann frazier
2020-05-06 20:11 ` Rafael David Tinoco
2020-05-06 21:10 ` Launchpad Bug Tracker
2020-05-06 21:44 ` Launchpad Bug Tracker
2020-05-07 3:37 ` Launchpad Bug Tracker
2020-05-07 7:00 ` Ike Panhc
2020-05-07 22:27 ` dann frazier
2020-05-14 8:05 ` Andrew Cloke
2020-05-27 4:55 ` Christian Ehrhardt
2020-05-28 14:58 ` Christian Ehrhardt
2020-05-29 7:55 ` Launchpad Bug Tracker
2020-05-29 8:01 ` Christian Ehrhardt
2020-06-02 22:45 ` Brian Murray
2020-06-02 22:49 ` [Bug 1805256] Please test proposed package Brian Murray
2020-06-02 22:54 ` Brian Murray
2020-06-03 4:09 ` [Bug 1805256] Autopkgtest regression report (qemu/1:4.0+dfsg-0ubuntu9.7) Ubuntu SRU Bot
2020-06-03 6:35 ` [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Ike Panhc
2020-06-03 8:40 ` [Bug 1805256] Autopkgtest regression report (qemu/1:4.2-3ubuntu6.2) Ubuntu SRU Bot
2020-06-05 3:51 ` [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Christian Ehrhardt
2020-06-11 8:04 ` Andrew Cloke
2020-06-17 5:16 ` Christian Ehrhardt
2020-06-18 9:23 ` Launchpad Bug Tracker
2020-06-18 9:23 ` [Bug 1805256] Update Released Łukasz Zemczak
2020-06-18 9:38 ` [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Launchpad Bug Tracker
2020-06-18 9:39 ` Launchpad Bug Tracker
2020-06-18 10:27 ` Andrew Cloke
2020-06-30 6:54 ` Christian Ehrhardt
2020-07-01 7:01 ` Ike Panhc
2020-07-12 13:16 ` Rafael David Tinoco
2020-07-13 3:59 ` Launchpad Bug Tracker
2020-07-13 4:12 ` Rafael David Tinoco
2020-07-15 15:31 ` dann frazier
2020-07-20 12:22 ` Rafael David Tinoco
2020-07-21 20:02 ` Rafael David Tinoco
2020-07-21 20:03 ` Rafael David Tinoco
2020-07-31 18:51 ` Rafael David Tinoco
2020-07-31 21:42 ` Rafael David Tinoco
2020-08-07 9:53 ` Timo Aaltonen
2020-08-07 14:41 ` [Bug 1805256] Autopkgtest regression report (qemu/1:2.11+dfsg-1ubuntu7.30) Ubuntu SRU Bot
2020-08-07 20:13 ` [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images dann frazier
2020-08-14 19:49 ` dann frazier
2020-08-19 16:36 ` Launchpad Bug Tracker
2020-08-19 17:16 ` Andrew Cloke
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190924202517.GA21422@xps13.dannf \
--to=dann.frazier@canonical.com \
--cc=1805256@bugs.launchpad.net \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).