From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.0 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 948F8C433E5 for ; Tue, 21 Jul 2020 20:11:55 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 5869320684 for ; Tue, 21 Jul 2020 20:11:55 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5869320684 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=bugs.launchpad.net Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:41466 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jxycE-000879-Lp for qemu-devel@archiver.kernel.org; Tue, 21 Jul 2020 16:11:54 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:37038) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1jxybQ-0007KC-RY for qemu-devel@nongnu.org; Tue, 21 Jul 2020 16:11:04 -0400 Received: from indium.canonical.com ([91.189.90.7]:32964) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1jxybO-0004v0-Ex for qemu-devel@nongnu.org; Tue, 21 Jul 2020 16:11:04 -0400 Received: from loganberry.canonical.com ([91.189.90.37]) by indium.canonical.com with esmtp (Exim 4.86_2 #2 (Debian)) id 1jxybL-0001M1-LB for ; Tue, 21 Jul 2020 20:10:59 +0000 Received: from loganberry.canonical.com (localhost [127.0.0.1]) by loganberry.canonical.com (Postfix) with ESMTP id 87B2F2E80ED for ; Tue, 21 Jul 2020 20:10:59 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Date: Tue, 21 Jul 2020 20:02:38 -0000 From: Rafael David Tinoco <1805256@bugs.launchpad.net> To: qemu-devel@nongnu.org X-Launchpad-Notification-Type: bug X-Launchpad-Bug: product=kunpeng920; status=Triaged; importance=Undecided; assignee=None; X-Launchpad-Bug: product=kunpeng920; productseries=ubuntu-18.04; status=Triaged; importance=Undecided; assignee=None; X-Launchpad-Bug: product=kunpeng920; productseries=ubuntu-18.04-hwe; status=Triaged; importance=Undecided; assignee=None; X-Launchpad-Bug: product=kunpeng920; productseries=ubuntu-19.10; status=Fix Released; importance=Undecided; assignee=None; X-Launchpad-Bug: product=kunpeng920; productseries=ubuntu-20.04; status=Fix Released; importance=Undecided; assignee=None; X-Launchpad-Bug: product=kunpeng920; productseries=upstream-kernel; status=Invalid; importance=Undecided; assignee=None; X-Launchpad-Bug: product=qemu; status=Fix Released; importance=Undecided; assignee=None; X-Launchpad-Bug: distribution=ubuntu; sourcepackage=qemu; component=main; status=Fix Released; importance=Medium; assignee=None; X-Launchpad-Bug: distribution=ubuntu; distroseries=bionic; sourcepackage=qemu; component=main; status=In Progress; importance=Medium; assignee=rafaeldtinoco@ubuntu.com; X-Launchpad-Bug: distribution=ubuntu; distroseries=eoan; sourcepackage=qemu; component=main; status=Fix Released; importance=Medium; assignee=None; X-Launchpad-Bug: distribution=ubuntu; distroseries=focal; sourcepackage=qemu; component=main; status=Fix Released; importance=Medium; assignee=None; X-Launchpad-Bug-Tags: ikeradar patch qemu-img verification-done-bionic verification-done-eoan verification-done-focal X-Launchpad-Bug-Information-Type: Public X-Launchpad-Bug-Private: no X-Launchpad-Bug-Security-Vulnerability: no X-Launchpad-Bug-Commenters: andrew-cloke brian-murray dannf ikepanhc iveskim jan-glauber-i janitor jnsnow kongzizaixian lizhengui paelzer philmd rafaeldtinoco sil2100 ubuntu-sru-bot ying-fang X-Launchpad-Bug-Reporter: dann frazier (dannf) X-Launchpad-Bug-Modifier: Rafael David Tinoco (rafaeldtinoco) References: <154327283728.15443.11625169757714443608.malonedeb@soybean.canonical.com> Message-Id: <159536175813.19361.12699030388697702605.malone@chaenomeles.canonical.com> Subject: [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images X-Launchpad-Message-Rationale: Subscriber (QEMU) @qemu-devel-ml X-Launchpad-Message-For: qemu-devel-ml Precedence: bulk X-Generated-By: Launchpad (canonical.com); Revision="4809fcb62f445aaa3ae919f7f6c3cc7d156ea57a"; Instance="production-secrets-lazr.conf" X-Launchpad-Hash: aab0ecf79f09dac05e6ca4a7cd595871e09e93f5 Received-SPF: none client-ip=91.189.90.7; envelope-from=bounces@canonical.com; helo=indium.canonical.com X-detected-operating-system: by eggs.gnu.org: First seen = 2020/07/21 16:11:00 X-ACL-Warn: Detected OS = Linux 3.11 and newer [fuzzy] X-Spam_score_int: -58 X-Spam_score: -5.9 X-Spam_bar: ----- X-Spam_report: (-5.9 / 5.0 requ) BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_HELO_NONE=0.001, SPF_NONE=0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: Bug 1805256 <1805256@bugs.launchpad.net> Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Status from old attempts to solve same nature issues: ---- Older (2018) merge request from @raharper: https://github.com/koverstreet/bcache-tools/pull/1 addressing the fact that kernel uevents would not always emit = CACHED_UUID parameters, making udev to delete (whenever that happens) = /dev/bcache/{by-uuid,by-label} symlinks. This last MR pointed to previous related bugs: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=3D890446 https://bugs.launchpad.net/curtin/+bug/1728742 And to an upstream kernel patch: https://lore.kernel.org/patchwork/patch/921298/ to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1729145 that wasn't accepted upstream. Even not being accepted upstream, the SRU was attempted: LP: #1729145 https://lists.ubuntu.com/archives/kernel-team/2017-December/088680.html https://lists.ubuntu.com/archives/kernel-team/2017-December/088679.html Both were NACKED. Attempted again: https://lists.ubuntu.com/archives/kernel-team/2017-December/088682.html https://lists.ubuntu.com/archives/kernel-team/2017-December/088683.html NACKED again. And a v2 was sent: https://lists.ubuntu.com/archives/kernel-team/2017-December/088751.html https://lists.ubuntu.com/archives/kernel-team/2017-December/088750.html https://lists.ubuntu.com/archives/kernel-team/2017-December/088749.html and acked in January 2018 by Coling: https://lists.ubuntu.com/archives/kernel-team/2018-January/089492.html but not upstreamed. BIONIC contains the fix: commit ed9333e1b583 Author: Ryan Harper Date: Mon Dec 11 12:12:01 2017 UBUNTU: SAUCE: (no-up) bcache: decouple emitting a cached_dev CHANGE ue= vent = BugLink: http://bugs.launchpad.net/bugs/1729145 = - decouple emitting a cached_dev CHANGE uevent which includes dev.uuid and dev.label from bch_cached_dev_run() which only happens when a bcacheX device is bound to the actual backing block device (bcache0 -= > vdb) = - update bch_cached_dev_run() to invoke bch_cached_dev_emit_change() as needed; no functional code path changes here = - Modify register_bcache to detect a re-registering of a bcache cached_dev, and in that case call bcache_cached_dev_emit_change() to = Signed-off-by: Ryan Harper Signed-off-by: Joseph Salisbury Acked-by: Colin Ian King Acked-by: Stefan Bader Signed-off-by: Khalid Elmously [ saf: fix incorrect indentation ] Signed-off-by: Seth Forshee FOCAL contains the fix: commit 67553dcd7905 Author: Ryan Harper Date: Mon Dec 11 12:12:01 2017 UBUNTU: SAUCE: (no-up) bcache: decouple emitting a cached_dev CHANGE uevent GROOVY contains the fix: commit 67553dcd7905 Author: Ryan Harper Date: Mon Dec 11 12:12:01 2017 UBUNTU: SAUCE: (no-up) bcache: decouple emitting a cached_dev CHANGE uevent ---- So, the kernel patch wasn't accepted, nor bcache-tools patch by = @raharper, the bcache-export-cached. ---- New Upstream summary from @raharper: https://github.com/systemd/systemd/pull/16317#issuecomment-655647313 in the upstream merge request made by @rbalint. ** Bug watch added: Debian Bug tracker #890446 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=3D890446 -- = You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1805256 Title: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images Status in kunpeng920: Triaged Status in kunpeng920 ubuntu-18.04 series: Triaged Status in kunpeng920 ubuntu-18.04-hwe series: Triaged Status in kunpeng920 ubuntu-19.10 series: Fix Released Status in kunpeng920 ubuntu-20.04 series: Fix Released Status in kunpeng920 upstream-kernel series: Invalid Status in QEMU: Fix Released Status in qemu package in Ubuntu: Fix Released Status in qemu source package in Bionic: In Progress Status in qemu source package in Eoan: Fix Released Status in qemu source package in Focal: Fix Released Bug description: [Impact] * QEMU locking primitives might face a race condition in QEMU Async I/O bottom halves scheduling. This leads to a dead lock making either QEMU or one of its tools to hang indefinitely. [Test Case] * qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2 Hangs indefinitely approximately 30% of the runs in Aarch64. [Regression Potential] * This is a change to a core part of QEMU: The AIO scheduling. It works like a "kernel" scheduler, whereas kernel schedules OS tasks, the QEMU AIO code is responsible to schedule QEMU coroutines or event listeners callbacks. * There was a long discussion upstream about primitives and Aarch64. After quite sometime Paolo released this patch and it solves the issue. Tested platforms were: amd64 and aarch64 based on his commit log. * Christian suggests that this fix stay little longer in -proposed to make sure it won't cause any regressions. * dannf suggests we also check for performance regressions; e.g. how long it takes to convert a cloud image on high-core systems. [Other Info] =C2=A0* Original Description bellow: Command: qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2 Hangs indefinitely approximately 30% of the runs. ---- Workaround: qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2 Run "qemu-img convert" with "a single coroutine" to avoid this issue. ---- (gdb) thread 1 ... (gdb) bt #0 0x0000ffffbf1ad81c in __GI_ppoll #1 0x0000aaaaaabcf73c in ppoll #2 qemu_poll_ns #3 0x0000aaaaaabd0764 in os_host_main_loop_wait #4 main_loop_wait ... (gdb) thread 2 ... (gdb) bt #0 syscall () #1 0x0000aaaaaabd41cc in qemu_futex_wait #2 qemu_event_wait (ev=3Dev@entry=3D0xaaaaaac86ce8 ) #3 0x0000aaaaaabed05c in call_rcu_thread #4 0x0000aaaaaabd34c8 in qemu_thread_start #5 0x0000ffffbf25c880 in start_thread #6 0x0000ffffbf1b6b9c in thread_start () (gdb) thread 3 ... (gdb) bt #0 0x0000ffffbf11aa20 in __GI___sigtimedwait #1 0x0000ffffbf2671b4 in __sigwait #2 0x0000aaaaaabd1ddc in sigwait_compat #3 0x0000aaaaaabd34c8 in qemu_thread_start #4 0x0000ffffbf25c880 in start_thread #5 0x0000ffffbf1b6b9c in thread_start ---- (gdb) run Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2 ./disk01.ext4.qcow2 ./output.qcow2 [New Thread 0xffffbec5ad90 (LWP 72839)] [New Thread 0xffffbe459d90 (LWP 72840)] [New Thread 0xffffbdb57d90 (LWP 72841)] [New Thread 0xffffacac9d90 (LWP 72859)] [New Thread 0xffffa7ffed90 (LWP 72860)] [New Thread 0xffffa77fdd90 (LWP 72861)] [New Thread 0xffffa6ffcd90 (LWP 72862)] [New Thread 0xffffa67fbd90 (LWP 72863)] [New Thread 0xffffa5ffad90 (LWP 72864)] [Thread 0xffffa5ffad90 (LWP 72864) exited] [Thread 0xffffa6ffcd90 (LWP 72862) exited] [Thread 0xffffa77fdd90 (LWP 72861) exited] [Thread 0xffffbdb57d90 (LWP 72841) exited] [Thread 0xffffa67fbd90 (LWP 72863) exited] [Thread 0xffffacac9d90 (LWP 72859) exited] [Thread 0xffffa7ffed90 (LWP 72860) exited] """ All the tasks left are blocked in a system call, so no task left to call qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock thread #1 (doing poll() in a pipe with thread #2). Those 7 threads exit before disk conversion is complete (sometimes in the beginning, sometimes at the end). ---- On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img frequently hangs (~50% of the time) with this command: qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2 Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This qcow2->qcow2 conversion happens to be something uvtool does every time it fetches images. Once hung, attaching gdb gives the following backtrace: (gdb) bt #0 0x0000ffffae4f8154 in __GI_ppoll (fds=3D0xaaaae8a67dc0, nfds=3D187650= 274213760, =C2=A0=C2=A0=C2=A0=C2=A0timeout=3D, timeout@entry=3D0x0, s= igmask=3D0xffffc123b950) =C2=A0=C2=A0=C2=A0=C2=A0at ../sysdeps/unix/sysv/linux/ppoll.c:39 #1 0x0000aaaabbefaf00 in ppoll (__ss=3D0x0, __timeout=3D0x0, __nfds=3D, =C2=A0=C2=A0=C2=A0=C2=A0__fds=3D) at /usr/include/aarch64-= linux-gnu/bits/poll2.h:77 #2 qemu_poll_ns (fds=3D, nfds=3D, =C2=A0=C2=A0=C2=A0=C2=A0timeout=3Dtimeout@entry=3D-1) at util/qemu-timer.= c:322 #3 0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=3D-1) =C2=A0=C2=A0=C2=A0=C2=A0at util/main-loop.c:233 #4 main_loop_wait (nonblocking=3D) at util/main-loop.c:497 #5 0x0000aaaabbe2aa30 in convert_do_copy (s=3D0xffffc123bb58) at qemu-im= g.c:1980 #6 img_convert (argc=3D, argv=3D) at qemu-= img.c:2456 #7 0x0000aaaabbe2333c in main (argc=3D7, argv=3D) at qemu= -img.c:4975 Reproduced w/ latest QEMU git (@ 53744e0a182) To manage notifications about this bug go to: https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions