From: dann frazier <dann.frazier@canonical.com> To: qemu-devel@nongnu.org Subject: [Bug 1805256] Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues Date: Fri, 11 Oct 2019 17:55:36 -0000 [thread overview] Message-ID: <20191011175536.GB25464@xps13.dannf> (raw) In-Reply-To: 20191011082954.GA10493@hc On Fri, Oct 11, 2019 at 08:30:02AM +0000, Jan Glauber wrote: > On Fri, Oct 11, 2019 at 10:18:18AM +0200, Paolo Bonzini wrote: > > On 11/10/19 08:05, Jan Glauber wrote: > > > On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote: > > >>> ...but if I bump notify_me size to uint64_t the issue goes away. > > >> > > >> Ouch. :) Is this with or without my patch(es)? > > > > You didn't answer this question. > > Oh, sorry... I did but the mail probably didn't make it out. > I have both of your changes applied (as I think they make sense). > > > >> Also, what if you just add a dummy uint32_t after notify_me? > > > > > > With the dummy the testcase also runs fine for 500 iterations. > > > > You might be lucky and causing list_lock to be in another cache line. > > What if you add __attribute__((aligned(16)) to notify_me (and keep the > > dummy)? > > Good point. I'll try to force both into the same cacheline. On the Hi1620, this still hangs in the first iteration: diff --git a/include/block/aio.h b/include/block/aio.h index 6b0d52f732..00e56a5412 100644 --- a/include/block/aio.h +++ b/include/block/aio.h @@ -82,7 +82,7 @@ struct AioContext { * Instead, the aio_poll calls include both the prepare and the * dispatch phase, hence a simple counter is enough for them. */ - uint32_t notify_me; + __attribute__((aligned(16))) uint64_t notify_me; /* A lock to protect between QEMUBH and AioHandler adders and deleter, * and to ensure that no callbacks are removed while we're walking and diff --git a/util/async.c b/util/async.c index ca83e32c7f..024c4c567d 100644 --- a/util/async.c +++ b/util/async.c @@ -242,7 +242,7 @@ aio_ctx_check(GSource *source) aio_notify_accept(ctx); for (bh = ctx->first_bh; bh; bh = bh->next) { - if (bh->scheduled) { + if (atomic_mb_read(&bh->scheduled)) { return true; } } @@ -342,12 +342,12 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx) void aio_notify(AioContext *ctx) { - /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs - * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll. + /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before + * ctx->notify_me is read. Pairs with atomic_or in aio_ctx_prepare or + * atomic_add in aio_poll. */ - smp_mb(); - if (ctx->notify_me) { - event_notifier_set(&ctx->notifier); + if (atomic_mb_read(&ctx->notify_me)) { + event_notifier_set(&ctx->notifier); atomic_mb_set(&ctx->notified, true); } } -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1805256 Title: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Status in kunpeng920: New Status in QEMU: In Progress Status in qemu package in Ubuntu: In Progress Status in qemu source package in Bionic: New Status in qemu source package in Disco: New Status in qemu source package in Eoan: In Progress Status in qemu source package in FF-Series: New Bug description: Command: qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2 Hangs indefinitely approximately 30% of the runs. ---- Workaround: qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2 Run "qemu-img convert" with "a single coroutine" to avoid this issue. ---- (gdb) thread 1 ... (gdb) bt #0 0x0000ffffbf1ad81c in __GI_ppoll #1 0x0000aaaaaabcf73c in ppoll #2 qemu_poll_ns #3 0x0000aaaaaabd0764 in os_host_main_loop_wait #4 main_loop_wait ... (gdb) thread 2 ... (gdb) bt #0 syscall () #1 0x0000aaaaaabd41cc in qemu_futex_wait #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>) #3 0x0000aaaaaabed05c in call_rcu_thread #4 0x0000aaaaaabd34c8 in qemu_thread_start #5 0x0000ffffbf25c880 in start_thread #6 0x0000ffffbf1b6b9c in thread_start () (gdb) thread 3 ... (gdb) bt #0 0x0000ffffbf11aa20 in __GI___sigtimedwait #1 0x0000ffffbf2671b4 in __sigwait #2 0x0000aaaaaabd1ddc in sigwait_compat #3 0x0000aaaaaabd34c8 in qemu_thread_start #4 0x0000ffffbf25c880 in start_thread #5 0x0000ffffbf1b6b9c in thread_start ---- (gdb) run Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2 ./disk01.ext4.qcow2 ./output.qcow2 [New Thread 0xffffbec5ad90 (LWP 72839)] [New Thread 0xffffbe459d90 (LWP 72840)] [New Thread 0xffffbdb57d90 (LWP 72841)] [New Thread 0xffffacac9d90 (LWP 72859)] [New Thread 0xffffa7ffed90 (LWP 72860)] [New Thread 0xffffa77fdd90 (LWP 72861)] [New Thread 0xffffa6ffcd90 (LWP 72862)] [New Thread 0xffffa67fbd90 (LWP 72863)] [New Thread 0xffffa5ffad90 (LWP 72864)] [Thread 0xffffa5ffad90 (LWP 72864) exited] [Thread 0xffffa6ffcd90 (LWP 72862) exited] [Thread 0xffffa77fdd90 (LWP 72861) exited] [Thread 0xffffbdb57d90 (LWP 72841) exited] [Thread 0xffffa67fbd90 (LWP 72863) exited] [Thread 0xffffacac9d90 (LWP 72859) exited] [Thread 0xffffa7ffed90 (LWP 72860) exited] <HUNG w/ 3 threads in the stack trace showed before> """ All the tasks left are blocked in a system call, so no task left to call qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock thread #1 (doing poll() in a pipe with thread #2). Those 7 threads exit before disk conversion is complete (sometimes in the beginning, sometimes at the end). ---- [ Original Description ] On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img frequently hangs (~50% of the time) with this command: qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2 Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This qcow2->qcow2 conversion happens to be something uvtool does every time it fetches images. Once hung, attaching gdb gives the following backtrace: (gdb) bt #0 0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950) at ../sysdeps/unix/sysv/linux/ppoll.c:39 #1 0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77 #2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at util/qemu-timer.c:322 #3 0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1) at util/main-loop.c:233 #4 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497 #5 0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980 #6 img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456 #7 0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975 Reproduced w/ latest QEMU git (@ 53744e0a182) To manage notifications about this bug go to: https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions
WARNING: multiple messages have this Message-ID (diff)
From: dann frazier <dann.frazier@canonical.com> To: Jan Glauber <jglauber@marvell.com> Cc: Rafael David Tinoco <rafaeldtinoco@ubuntu.com>, lizhengui <lizhengui@huawei.com>, QEMU Developers <qemu-devel@nongnu.org>, Bug 1805256 <1805256@bugs.launchpad.net>, QEMU Developers - ARM <qemu-arm@nongnu.org>, Paolo Bonzini <pbonzini@redhat.com> Subject: Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues Date: Fri, 11 Oct 2019 11:55:36 -0600 [thread overview] Message-ID: <20191011175536.GB25464@xps13.dannf> (raw) Message-ID: <20191011175536.cYY-aDIU6pxHm8wt_PwWlW2eP2oQAjn2_vgeDh-Segw@z> (raw) In-Reply-To: <20191011082954.GA10493@hc> On Fri, Oct 11, 2019 at 08:30:02AM +0000, Jan Glauber wrote: > On Fri, Oct 11, 2019 at 10:18:18AM +0200, Paolo Bonzini wrote: > > On 11/10/19 08:05, Jan Glauber wrote: > > > On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote: > > >>> ...but if I bump notify_me size to uint64_t the issue goes away. > > >> > > >> Ouch. :) Is this with or without my patch(es)? > > > > You didn't answer this question. > > Oh, sorry... I did but the mail probably didn't make it out. > I have both of your changes applied (as I think they make sense). > > > >> Also, what if you just add a dummy uint32_t after notify_me? > > > > > > With the dummy the testcase also runs fine for 500 iterations. > > > > You might be lucky and causing list_lock to be in another cache line. > > What if you add __attribute__((aligned(16)) to notify_me (and keep the > > dummy)? > > Good point. I'll try to force both into the same cacheline. On the Hi1620, this still hangs in the first iteration: diff --git a/include/block/aio.h b/include/block/aio.h index 6b0d52f732..00e56a5412 100644 --- a/include/block/aio.h +++ b/include/block/aio.h @@ -82,7 +82,7 @@ struct AioContext { * Instead, the aio_poll calls include both the prepare and the * dispatch phase, hence a simple counter is enough for them. */ - uint32_t notify_me; + __attribute__((aligned(16))) uint64_t notify_me; /* A lock to protect between QEMUBH and AioHandler adders and deleter, * and to ensure that no callbacks are removed while we're walking and diff --git a/util/async.c b/util/async.c index ca83e32c7f..024c4c567d 100644 --- a/util/async.c +++ b/util/async.c @@ -242,7 +242,7 @@ aio_ctx_check(GSource *source) aio_notify_accept(ctx); for (bh = ctx->first_bh; bh; bh = bh->next) { - if (bh->scheduled) { + if (atomic_mb_read(&bh->scheduled)) { return true; } } @@ -342,12 +342,12 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx) void aio_notify(AioContext *ctx) { - /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs - * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll. + /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before + * ctx->notify_me is read. Pairs with atomic_or in aio_ctx_prepare or + * atomic_add in aio_poll. */ - smp_mb(); - if (ctx->notify_me) { - event_notifier_set(&ctx->notifier); + if (atomic_mb_read(&ctx->notify_me)) { + event_notifier_set(&ctx->notifier); atomic_mb_set(&ctx->notified, true); } }
next prev parent reply other threads:[~2019-10-11 18:04 UTC|newest] Thread overview: 141+ messages / expand[flat|nested] mbox.gz Atom feed top 2018-11-26 22:53 [Qemu-devel] [Bug 1805256] [NEW] qemu-img hangs on high core count ARM system dann frazier 2018-11-26 23:26 ` [Qemu-devel] [Bug 1805256] " John Snow 2018-11-26 23:54 ` dann frazier 2018-12-05 11:20 ` Alex Bennée 2019-04-15 12:59 ` 贞贵李 2019-04-15 12:59 ` 贞贵李 2019-04-15 14:37 ` 贞贵李 2019-04-15 14:37 ` 贞贵李 2019-04-15 22:25 ` dann frazier 2019-04-15 22:25 ` dann frazier 2019-04-15 23:37 ` dann frazier 2019-04-15 23:37 ` dann frazier 2019-04-16 8:16 ` 贞贵李 2019-04-16 8:16 ` 贞贵李 2019-04-16 13:32 ` 贞贵李 2019-04-16 13:32 ` 贞贵李 2019-04-23 1:29 ` 贞贵李 2019-04-23 1:29 ` 贞贵李 2019-06-05 16:16 ` dann frazier 2019-09-05 15:03 ` Rafael David Tinoco 2019-09-06 15:12 ` Rafael David Tinoco 2019-09-06 15:16 ` Rafael David Tinoco 2019-09-06 21:22 ` Rafael David Tinoco 2019-09-09 16:47 ` Rafael David Tinoco 2019-09-10 2:04 ` Rafael David Tinoco 2019-09-10 14:16 ` Rafael David Tinoco 2019-09-10 18:15 ` [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Rafael David Tinoco 2019-09-10 22:56 ` Rafael David Tinoco 2019-09-11 2:15 ` [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues Rafael David Tinoco 2019-09-11 2:15 ` [Qemu-devel] [Bug 1805256] " Rafael David Tinoco 2019-09-11 7:05 ` [Qemu-devel] " Rafael David Tinoco 2019-09-11 7:05 ` [Qemu-devel] [Bug 1805256] " Rafael David Tinoco 2019-09-11 13:17 ` [Qemu-devel] " Paolo Bonzini 2019-09-11 14:48 ` Rafael David Tinoco 2019-09-11 19:09 ` Rafael David Tinoco 2019-09-11 19:09 ` [Qemu-devel] [Bug 1805256] " Rafael David Tinoco 2019-09-24 20:25 ` [Qemu-devel] " dann frazier 2019-09-24 20:25 ` [Bug 1805256] " dann frazier 2019-10-02 9:23 ` Jan Glauber 2019-10-02 9:23 ` Jan Glauber 2019-10-02 9:45 ` Paolo Bonzini 2019-10-02 11:05 ` Jan Glauber 2019-10-02 11:05 ` [Bug 1805256] " Jan Glauber 2019-10-02 13:20 ` memory barriers and ATOMIC_SEQ_CST on aarch64 (was Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues) Paolo Bonzini 2019-10-02 14:58 ` Torvald Riegel 2019-10-02 16:30 ` Paolo Bonzini 2019-10-07 11:06 ` [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues Paolo Bonzini 2019-10-07 14:36 ` Jan Glauber 2019-10-07 14:36 ` [Bug 1805256] " Jan Glauber 2019-10-07 14:44 ` dann frazier 2019-10-07 14:44 ` [Bug 1805256] " dann frazier 2019-10-07 14:58 ` Paolo Bonzini 2019-10-09 8:02 ` Jan Glauber 2019-10-09 8:02 ` [Bug 1805256] " Jan Glauber 2019-10-09 9:15 ` Paolo Bonzini 2019-10-11 6:05 ` Jan Glauber 2019-10-11 6:05 ` [Bug 1805256] " Jan Glauber 2019-10-11 8:18 ` Paolo Bonzini 2019-10-11 8:30 ` Jan Glauber 2019-10-11 8:30 ` [Bug 1805256] " Jan Glauber 2019-10-11 17:55 ` dann frazier [this message] 2019-10-11 17:55 ` dann frazier 2019-10-12 0:24 ` [Bug 1805256] " no-reply 2019-10-12 0:49 ` no-reply 2019-10-11 17:50 ` dann frazier 2019-10-11 17:50 ` [Bug 1805256] " dann frazier 2019-09-11 2:17 ` [Qemu-devel] [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Rafael David Tinoco 2019-09-11 11:19 ` Rafael David Tinoco 2019-09-11 19:23 ` Rafael David Tinoco 2019-10-02 11:02 ` Jan Glauber 2019-10-03 12:28 ` Rafael David Tinoco 2019-10-03 12:29 ` Rafael David Tinoco 2019-10-03 12:29 ` Rafael David Tinoco 2019-10-03 21:35 ` dann frazier 2019-12-13 14:24 ` dann frazier 2019-12-17 1:34 ` Fred Kimmy 2019-12-17 19:17 ` dann frazier 2019-12-18 2:40 ` Rafael David Tinoco 2019-12-18 9:52 ` iveskim 2019-12-18 14:52 ` dann frazier 2019-12-18 16:21 ` Ubuntu Foundations Team Bug Bot 2020-02-13 8:41 ` Ike Panhc 2020-02-13 8:42 ` Andrew Cloke 2020-02-13 9:20 ` Fred Kimmy 2020-04-15 2:47 ` Rafael David Tinoco 2020-05-04 7:24 ` Ike Panhc 2020-05-05 0:54 ` Ike Panhc 2020-05-05 1:22 ` Ying Fang 2020-05-05 6:15 ` Ike Panhc 2020-05-05 15:01 ` Ike Panhc 2020-05-05 18:48 ` Rafael David Tinoco 2020-05-05 23:55 ` dann frazier 2020-05-06 13:08 ` Rafael David Tinoco 2020-05-06 13:23 ` Rafael David Tinoco 2020-05-06 15:45 ` Ike Panhc 2020-05-06 16:42 ` dann frazier 2020-05-06 19:04 ` Launchpad Bug Tracker 2020-05-06 19:09 ` Philippe Mathieu-Daudé 2020-05-06 19:57 ` dann frazier 2020-05-06 20:11 ` Rafael David Tinoco 2020-05-06 21:10 ` Launchpad Bug Tracker 2020-05-06 21:44 ` Launchpad Bug Tracker 2020-05-07 3:37 ` Launchpad Bug Tracker 2020-05-07 7:00 ` Ike Panhc 2020-05-07 22:27 ` dann frazier 2020-05-14 8:05 ` Andrew Cloke 2020-05-27 4:55 ` Christian Ehrhardt 2020-05-28 14:58 ` Christian Ehrhardt 2020-05-29 7:55 ` Launchpad Bug Tracker 2020-05-29 8:01 ` Christian Ehrhardt 2020-06-02 22:45 ` Brian Murray 2020-06-02 22:49 ` [Bug 1805256] Please test proposed package Brian Murray 2020-06-02 22:54 ` Brian Murray 2020-06-03 4:09 ` [Bug 1805256] Autopkgtest regression report (qemu/1:4.0+dfsg-0ubuntu9.7) Ubuntu SRU Bot 2020-06-03 6:35 ` [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Ike Panhc 2020-06-03 8:40 ` [Bug 1805256] Autopkgtest regression report (qemu/1:4.2-3ubuntu6.2) Ubuntu SRU Bot 2020-06-05 3:51 ` [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Christian Ehrhardt 2020-06-11 8:04 ` Andrew Cloke 2020-06-17 5:16 ` Christian Ehrhardt 2020-06-18 9:23 ` Launchpad Bug Tracker 2020-06-18 9:23 ` [Bug 1805256] Update Released Łukasz Zemczak 2020-06-18 9:38 ` [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Launchpad Bug Tracker 2020-06-18 9:39 ` Launchpad Bug Tracker 2020-06-18 10:27 ` Andrew Cloke 2020-06-30 6:54 ` Christian Ehrhardt 2020-07-01 7:01 ` Ike Panhc 2020-07-12 13:16 ` Rafael David Tinoco 2020-07-13 3:59 ` Launchpad Bug Tracker 2020-07-13 4:12 ` Rafael David Tinoco 2020-07-15 15:31 ` dann frazier 2020-07-20 12:22 ` Rafael David Tinoco 2020-07-21 20:02 ` Rafael David Tinoco 2020-07-21 20:03 ` Rafael David Tinoco 2020-07-31 18:51 ` Rafael David Tinoco 2020-07-31 21:42 ` Rafael David Tinoco 2020-08-07 9:53 ` Timo Aaltonen 2020-08-07 14:41 ` [Bug 1805256] Autopkgtest regression report (qemu/1:2.11+dfsg-1ubuntu7.30) Ubuntu SRU Bot 2020-08-07 20:13 ` [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images dann frazier 2020-08-14 19:49 ` dann frazier 2020-08-19 16:36 ` Launchpad Bug Tracker 2020-08-19 17:16 ` Andrew Cloke
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20191011175536.GB25464@xps13.dannf \ --to=dann.frazier@canonical.com \ --cc=1805256@bugs.launchpad.net \ --cc=qemu-devel@nongnu.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).