From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9F089ECE58D for ; Fri, 11 Oct 2019 18:04:35 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 7071D21A4A for ; Fri, 11 Oct 2019 18:04:35 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7071D21A4A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=canonical.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:55106 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1iIzHF-0006cU-HS for qemu-devel@archiver.kernel.org; Fri, 11 Oct 2019 14:04:33 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:53876) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1iIzDs-00043Z-Ru for qemu-devel@nongnu.org; Fri, 11 Oct 2019 14:01:06 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1iIzDp-0000b8-4E for qemu-devel@nongnu.org; Fri, 11 Oct 2019 14:01:04 -0400 Received: from indium.canonical.com ([91.189.90.7]:39562) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1iIzDo-0000Zi-U7 for qemu-devel@nongnu.org; Fri, 11 Oct 2019 14:01:01 -0400 Received: from loganberry.canonical.com ([91.189.90.37]) by indium.canonical.com with esmtp (Exim 4.86_2 #2 (Debian)) id 1iIzDn-00060S-0u for ; Fri, 11 Oct 2019 18:00:59 +0000 Received: from loganberry.canonical.com (localhost [127.0.0.1]) by loganberry.canonical.com (Postfix) with ESMTP id 03B5F2E80C7 for ; Fri, 11 Oct 2019 18:00:59 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Date: Fri, 11 Oct 2019 17:55:36 -0000 From: dann frazier To: qemu-devel@nongnu.org X-Launchpad-Notification-Type: bug X-Launchpad-Bug: product=kunpeng920; status=New; importance=Undecided; assignee=None; X-Launchpad-Bug: product=qemu; status=In Progress; importance=Undecided; assignee=rafaeldtinoco@kernelpath.com; X-Launchpad-Bug: distribution=ubuntu; sourcepackage=qemu; component=main; status=In Progress; importance=Medium; assignee=rafaeldtinoco@kernelpath.com; X-Launchpad-Bug: distribution=ubuntu; distroseries=bionic; sourcepackage=qemu; component=main; status=New; importance=Medium; assignee=None; X-Launchpad-Bug: distribution=ubuntu; distroseries=disco; sourcepackage=qemu; component=main; status=New; importance=Medium; assignee=None; X-Launchpad-Bug: distribution=ubuntu; distroseries=eoan; sourcepackage=qemu; component=main; status=In Progress; importance=Medium; assignee=rafaeldtinoco@kernelpath.com; X-Launchpad-Bug: distribution=ubuntu; distroseries=ff-series; sourcepackage=qemu; component=None; status=New; importance=Medium; assignee=None; X-Launchpad-Bug-Tags: qemu-img X-Launchpad-Bug-Information-Type: Public X-Launchpad-Bug-Private: no X-Launchpad-Bug-Security-Vulnerability: no X-Launchpad-Bug-Commenters: dannf jan-glauber-i jnsnow lizhengui rafaeldtinoco X-Launchpad-Bug-Reporter: dann frazier (dannf) X-Launchpad-Bug-Modifier: dann frazier (dannf) References: <154327283728.15443.11625169757714443608.malonedeb@soybean.canonical.com> <20191011082954.GA10493@hc> Message-Id: <20191011175536.GB25464@xps13.dannf> Subject: [Bug 1805256] Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues X-Launchpad-Message-Rationale: Subscriber (QEMU) @qemu-devel-ml X-Launchpad-Message-For: qemu-devel-ml Precedence: bulk X-Generated-By: Launchpad (canonical.com); Revision="af2eefe214bd95389a09b7c956720881bab16807"; Instance="production-secrets-lazr.conf" X-Launchpad-Hash: a88a00f130f18559a7be3c18a21d7f606269868d X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 91.189.90.7 X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: Bug 1805256 <1805256@bugs.launchpad.net> Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" On Fri, Oct 11, 2019 at 08:30:02AM +0000, Jan Glauber wrote: > On Fri, Oct 11, 2019 at 10:18:18AM +0200, Paolo Bonzini wrote: > > On 11/10/19 08:05, Jan Glauber wrote: > > > On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote: > > >>> ...but if I bump notify_me size to uint64_t the issue goes away. > > >> > > >> Ouch. :) Is this with or without my patch(es)? > > = > > You didn't answer this question. > = > Oh, sorry... I did but the mail probably didn't make it out. > I have both of your changes applied (as I think they make sense). > = > > >> Also, what if you just add a dummy uint32_t after notify_me? > > > = > > > With the dummy the testcase also runs fine for 500 iterations. > > = > > You might be lucky and causing list_lock to be in another cache line. > > What if you add __attribute__((aligned(16)) to notify_me (and keep the > > dummy)? > = > Good point. I'll try to force both into the same cacheline. On the Hi1620, this still hangs in the first iteration: diff --git a/include/block/aio.h b/include/block/aio.h index 6b0d52f732..00e56a5412 100644 --- a/include/block/aio.h +++ b/include/block/aio.h @@ -82,7 +82,7 @@ struct AioContext { * Instead, the aio_poll calls include both the prepare and the * dispatch phase, hence a simple counter is enough for them. */ - uint32_t notify_me; + __attribute__((aligned(16))) uint64_t notify_me; = /* A lock to protect between QEMUBH and AioHandler adders and deleter, * and to ensure that no callbacks are removed while we're walking and diff --git a/util/async.c b/util/async.c index ca83e32c7f..024c4c567d 100644 --- a/util/async.c +++ b/util/async.c @@ -242,7 +242,7 @@ aio_ctx_check(GSource *source) aio_notify_accept(ctx); = for (bh =3D ctx->first_bh; bh; bh =3D bh->next) { - if (bh->scheduled) { + if (atomic_mb_read(&bh->scheduled)) { return true; } } @@ -342,12 +342,12 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx) = void aio_notify(AioContext *ctx) { - /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs - * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll. + /* Using atomic_mb_read ensures that e.g. bh->scheduled is written bef= ore + * ctx->notify_me is read. Pairs with atomic_or in aio_ctx_prepare or + * atomic_add in aio_poll. */ - smp_mb(); - if (ctx->notify_me) { - event_notifier_set(&ctx->notifier); + if (atomic_mb_read(&ctx->notify_me)) { + event_notifier_set(&ctx->notifier); atomic_mb_set(&ctx->notified, true); } } -- = You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1805256 Title: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Status in kunpeng920: New Status in QEMU: In Progress Status in qemu package in Ubuntu: In Progress Status in qemu source package in Bionic: New Status in qemu source package in Disco: New Status in qemu source package in Eoan: In Progress Status in qemu source package in FF-Series: New Bug description: Command: qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2 Hangs indefinitely approximately 30% of the runs. ---- Workaround: qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2 Run "qemu-img convert" with "a single coroutine" to avoid this issue. ---- (gdb) thread 1 ... (gdb) bt #0 0x0000ffffbf1ad81c in __GI_ppoll #1 0x0000aaaaaabcf73c in ppoll #2 qemu_poll_ns #3 0x0000aaaaaabd0764 in os_host_main_loop_wait #4 main_loop_wait ... (gdb) thread 2 ... (gdb) bt #0 syscall () #1 0x0000aaaaaabd41cc in qemu_futex_wait #2 qemu_event_wait (ev=3Dev@entry=3D0xaaaaaac86ce8 ) #3 0x0000aaaaaabed05c in call_rcu_thread #4 0x0000aaaaaabd34c8 in qemu_thread_start #5 0x0000ffffbf25c880 in start_thread #6 0x0000ffffbf1b6b9c in thread_start () (gdb) thread 3 ... (gdb) bt #0 0x0000ffffbf11aa20 in __GI___sigtimedwait #1 0x0000ffffbf2671b4 in __sigwait #2 0x0000aaaaaabd1ddc in sigwait_compat #3 0x0000aaaaaabd34c8 in qemu_thread_start #4 0x0000ffffbf25c880 in start_thread #5 0x0000ffffbf1b6b9c in thread_start ---- (gdb) run Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2 ./disk01.ext4.qcow2 ./output.qcow2 [New Thread 0xffffbec5ad90 (LWP 72839)] [New Thread 0xffffbe459d90 (LWP 72840)] [New Thread 0xffffbdb57d90 (LWP 72841)] [New Thread 0xffffacac9d90 (LWP 72859)] [New Thread 0xffffa7ffed90 (LWP 72860)] [New Thread 0xffffa77fdd90 (LWP 72861)] [New Thread 0xffffa6ffcd90 (LWP 72862)] [New Thread 0xffffa67fbd90 (LWP 72863)] [New Thread 0xffffa5ffad90 (LWP 72864)] [Thread 0xffffa5ffad90 (LWP 72864) exited] [Thread 0xffffa6ffcd90 (LWP 72862) exited] [Thread 0xffffa77fdd90 (LWP 72861) exited] [Thread 0xffffbdb57d90 (LWP 72841) exited] [Thread 0xffffa67fbd90 (LWP 72863) exited] [Thread 0xffffacac9d90 (LWP 72859) exited] [Thread 0xffffa7ffed90 (LWP 72860) exited] """ All the tasks left are blocked in a system call, so no task left to call qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock thread #1 (doing poll() in a pipe with thread #2). Those 7 threads exit before disk conversion is complete (sometimes in the beginning, sometimes at the end). ---- [ Original Description ] On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img frequently hangs (~50% of the time) with this command: qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2 Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This qcow2->qcow2 conversion happens to be something uvtool does every time it fetches images. Once hung, attaching gdb gives the following backtrace: (gdb) bt #0 0x0000ffffae4f8154 in __GI_ppoll (fds=3D0xaaaae8a67dc0, nfds=3D187650= 274213760, =C2=A0=C2=A0=C2=A0=C2=A0timeout=3D, timeout@entry=3D0x0, s= igmask=3D0xffffc123b950) =C2=A0=C2=A0=C2=A0=C2=A0at ../sysdeps/unix/sysv/linux/ppoll.c:39 #1 0x0000aaaabbefaf00 in ppoll (__ss=3D0x0, __timeout=3D0x0, __nfds=3D, =C2=A0=C2=A0=C2=A0=C2=A0__fds=3D) at /usr/include/aarch64-= linux-gnu/bits/poll2.h:77 #2 qemu_poll_ns (fds=3D, nfds=3D, =C2=A0=C2=A0=C2=A0=C2=A0timeout=3Dtimeout@entry=3D-1) at util/qemu-timer.= c:322 #3 0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=3D-1) =C2=A0=C2=A0=C2=A0=C2=A0at util/main-loop.c:233 #4 main_loop_wait (nonblocking=3D) at util/main-loop.c:497 #5 0x0000aaaabbe2aa30 in convert_do_copy (s=3D0xffffc123bb58) at qemu-im= g.c:1980 #6 img_convert (argc=3D, argv=3D) at qemu-= img.c:2456 #7 0x0000aaaabbe2333c in main (argc=3D7, argv=3D) at qemu= -img.c:4975 Reproduced w/ latest QEMU git (@ 53744e0a182) To manage notifications about this bug go to: https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4D70BC47404 for ; Fri, 11 Oct 2019 18:54:39 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 1F4F72089F for ; Fri, 11 Oct 2019 18:54:39 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1F4F72089F Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=canonical.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:55900 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1iJ03i-0005V0-BX for qemu-devel@archiver.kernel.org; Fri, 11 Oct 2019 14:54:38 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:53152) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1iIz8f-0006el-LK for qemu-devel@nongnu.org; Fri, 11 Oct 2019 13:55:43 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1iIz8e-00069V-9h for qemu-devel@nongnu.org; Fri, 11 Oct 2019 13:55:41 -0400 Received: from youngberry.canonical.com ([91.189.89.112]:35073) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1iIz8e-00067g-41 for qemu-devel@nongnu.org; Fri, 11 Oct 2019 13:55:40 -0400 Received: from mail-io1-f70.google.com ([209.85.166.70]) by youngberry.canonical.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1iIz8b-0007uo-Ui for qemu-devel@nongnu.org; Fri, 11 Oct 2019 17:55:38 +0000 Received: by mail-io1-f70.google.com with SMTP id r20so15689610ioh.7 for ; Fri, 11 Oct 2019 10:55:37 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=TJabvGBzYoH+K4DTwkXfzex0GExkMyVSB8XXZLit9qI=; b=AtMRAdW8HnhyvcPm3ZJzBtqIhbJH0N6c386tlQ5JqH8FSwcBewMSMchZnDSnxVHI1k LmWwcOd49B6xygtsRIMEhl4on3TDkHDFo0t949HripxlsSOSkHU5JrMyK4/fhV9ACXCc Uv+5kOp9qGy5HFo8R7Bl2iSGzlrUCYsRvefTwYOwMl8O7SZ9/fxAX18IpB66XbX3BlpO UjtpJSP9/RQMB240tFMIfL0Px0R+4XBxAmjAi+BIatcN5fS+1aV/b4gHwyrpcyTKLy0X 5k51y0QecrnRXhU4jqwg121/qsBUmsWVs59uHwkbtITN1q3sDjkcrV3I9841c3CWneg2 TpVw== X-Gm-Message-State: APjAAAU47nS36uc94Iu5ziJyOBgerUKVHH7M8ceKLVuceAvihMUOCJJq sZOdX6A8CsQzGEUcrqxZic1nu/Vi6PgWCddHaHEjm04zOcAmL8zOjg4srPU4lIP/nw7ZO/G0zr0 BglfC7ZmsbfVexMyN2dZzjILqBOKr84wO X-Received: by 2002:a5d:8991:: with SMTP id m17mr7780100iol.104.1570816536857; Fri, 11 Oct 2019 10:55:36 -0700 (PDT) X-Google-Smtp-Source: APXvYqzrZ5NZjnnq4tF4R6PQev4BU0bFGzfLPJOith1vTdJsoQ3pyTj08BA+rbRfPaAGd4j0xD+ZTQ== X-Received: by 2002:a5d:8991:: with SMTP id m17mr7780076iol.104.1570816536534; Fri, 11 Oct 2019 10:55:36 -0700 (PDT) Received: from xps13.canonical.com (c-71-56-235-36.hsd1.co.comcast.net. [71.56.235.36]) by smtp.gmail.com with ESMTPSA id f12sm6162295iob.58.2019.10.11.10.55.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 11 Oct 2019 10:55:35 -0700 (PDT) Date: Fri, 11 Oct 2019 11:55:36 -0600 From: dann frazier To: Jan Glauber Subject: Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues Message-ID: <20191011175536.GB25464@xps13.dannf> References: <20190924202517.GA21422@xps13.dannf> <20191002092253.GA3857@hc> <6dd73749-49b0-0fbc-b9bb-44c3736642b8@redhat.com> <20191007144432.GA29958@xps13.dannf> <065a52a9-5bb0-1259-6c73-41af60e0a05d@redhat.com> <20191009080220.GA2905@hc> <20191011060518.GA6920@hc> <966c119d-aa76-2149-108f-867aebd772f7@redhat.com> <20191011082954.GA10493@hc> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Disposition: inline In-Reply-To: <20191011082954.GA10493@hc> User-Agent: Mutt/1.10.1 (2018-07-13) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x [fuzzy] X-Received-From: 91.189.89.112 X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Rafael David Tinoco , lizhengui , QEMU Developers , Bug 1805256 <1805256@bugs.launchpad.net>, QEMU Developers - ARM , Paolo Bonzini Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Message-ID: <20191011175536.cYY-aDIU6pxHm8wt_PwWlW2eP2oQAjn2_vgeDh-Segw@z> On Fri, Oct 11, 2019 at 08:30:02AM +0000, Jan Glauber wrote: > On Fri, Oct 11, 2019 at 10:18:18AM +0200, Paolo Bonzini wrote: > > On 11/10/19 08:05, Jan Glauber wrote: > > > On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote: > > >>> ...but if I bump notify_me size to uint64_t the issue goes away. > > >> > > >> Ouch. :) Is this with or without my patch(es)? > > > > You didn't answer this question. > > Oh, sorry... I did but the mail probably didn't make it out. > I have both of your changes applied (as I think they make sense). > > > >> Also, what if you just add a dummy uint32_t after notify_me? > > > > > > With the dummy the testcase also runs fine for 500 iterations. > > > > You might be lucky and causing list_lock to be in another cache line. > > What if you add __attribute__((aligned(16)) to notify_me (and keep the > > dummy)? > > Good point. I'll try to force both into the same cacheline. On the Hi1620, this still hangs in the first iteration: diff --git a/include/block/aio.h b/include/block/aio.h index 6b0d52f732..00e56a5412 100644 --- a/include/block/aio.h +++ b/include/block/aio.h @@ -82,7 +82,7 @@ struct AioContext { * Instead, the aio_poll calls include both the prepare and the * dispatch phase, hence a simple counter is enough for them. */ - uint32_t notify_me; + __attribute__((aligned(16))) uint64_t notify_me; /* A lock to protect between QEMUBH and AioHandler adders and deleter, * and to ensure that no callbacks are removed while we're walking and diff --git a/util/async.c b/util/async.c index ca83e32c7f..024c4c567d 100644 --- a/util/async.c +++ b/util/async.c @@ -242,7 +242,7 @@ aio_ctx_check(GSource *source) aio_notify_accept(ctx); for (bh = ctx->first_bh; bh; bh = bh->next) { - if (bh->scheduled) { + if (atomic_mb_read(&bh->scheduled)) { return true; } } @@ -342,12 +342,12 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx) void aio_notify(AioContext *ctx) { - /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs - * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll. + /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before + * ctx->notify_me is read. Pairs with atomic_or in aio_ctx_prepare or + * atomic_add in aio_poll. */ - smp_mb(); - if (ctx->notify_me) { - event_notifier_set(&ctx->notifier); + if (atomic_mb_read(&ctx->notify_me)) { + event_notifier_set(&ctx->notifier); atomic_mb_set(&ctx->notified, true); } }