From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41642) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1agXh5-00080D-0T for qemu-devel@nongnu.org; Thu, 17 Mar 2016 09:10:31 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1agXgz-0002tK-6K for qemu-devel@nongnu.org; Thu, 17 Mar 2016 09:10:26 -0400 Received: from e06smtp12.uk.ibm.com ([195.75.94.108]:51930) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1agXgy-0002sY-UA for qemu-devel@nongnu.org; Thu, 17 Mar 2016 09:10:21 -0400 Received: from localhost by e06smtp12.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 17 Mar 2016 13:10:18 -0000 Date: Thu, 17 Mar 2016 14:02:23 +0100 From: Cornelia Huck Message-ID: <20160317140223.5c3abdc5.cornelia.huck@de.ibm.com> In-Reply-To: <56EAA576.8020709@de.ibm.com> References: <1458123018-18651-1-git-send-email-famz@redhat.com> <56E9355A.5070700@redhat.com> <56E93A22.1080102@de.ibm.com> <56E93ECE.10103@redhat.com> <56E9425C.8030201@de.ibm.com> <56E957AD.2050005@redhat.com> <56E961EA.4090908@de.ibm.com> <56EAA170.1000904@linux.vnet.ibm.com> <56EAA576.8020709@de.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH 0/4] Tweaks around virtio-blk start/stop List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Christian Borntraeger Cc: Kevin Wolf , Fam Zheng , qemu-block@nongnu.org, "Michael S. Tsirkin" , qemu-devel@nongnu.org, tu bo , Stefan Hajnoczi , Paolo Bonzini On Thu, 17 Mar 2016 13:39:18 +0100 Christian Borntraeger wrote: > On 03/17/2016 01:22 PM, tu bo wrote: > > > > On 03/16/2016 09:38 PM, Christian Borntraeger wrote: > >> On 03/16/2016 01:55 PM, Paolo Bonzini wrote: > >>> > >>> > >>> On 16/03/2016 12:24, Christian Borntraeger wrote: > >>>> On 03/16/2016 12:09 PM, Paolo Bonzini wrote: > >>>>> On 16/03/2016 11:49, Christian Borntraeger wrote: > >>>>>> #3 0x00000000800b713e in virtio_blk_data_plane_start (s=0xba232d80) at /home/cborntra/REPOS/qemu/hw/block/dataplane/virtio-blk.c:224 > >>>>>> #4 0x00000000800b4ea0 in virtio_blk_handle_output (vdev=0xb9eee7e8, vq=0xba305270) at /home/cborntra/REPOS/qemu/hw/block/virtio-blk.c:590 > >>>>>> #5 0x00000000800ef3dc in virtio_queue_notify_vq (vq=0xba305270) at /home/cborntra/REPOS/qemu/hw/virtio/virtio.c:1095 > >>>>>> #6 0x00000000800f1c9c in virtio_queue_host_notifier_read (n=0xba3052c8) at /home/cborntra/REPOS/qemu/hw/virtio/virtio.c:1785 > >>> > >>> If you just remove the calls to virtio_queue_host_notifier_read, here > >>> and in virtio_queue_aio_set_host_notifier_fd_handler, does it work > >>> (keeping patches 2-4 in)? > >> > >> With these changes and patch 2-4 it does no longer locks up. > >> I keep it running some hour to check if a crash happens. > >> > >> Tu Bo, your setup is currently better suited for reproducing. Can you also check? > > > > remove the calls to virtio_queue_host_notifier_read, and keeping patches 2-4 in, > > > > I got same crash as before, > > (gdb) bt > > #0 bdrv_co_do_rw (opaque=0x0) at block/io.c:2172 > > #1 0x000002aa0c65d786 in coroutine_trampoline (i0=, i1=-2013204784) at util/coroutine-ucontext.c:79 > > #2 0x000003ff99ad150a in __makecontext_ret () from /lib64/libc.so.6 > > > > As an interesting side note, I updated my system from F20 to F23 some days ago > (after the initial report). While To Bo is still on a F20 system. I was not able > to reproduce the original crash on f23. but going back to F20 made this > problem re-appear. > > Stack trace of thread 26429: > #0 0x00000000802008aa tracked_request_begin (qemu-system-s390x) > #1 0x0000000080203f3c bdrv_co_do_preadv (qemu-system-s390x) > #2 0x000000008020567c bdrv_co_do_readv (qemu-system-s390x) > #3 0x000000008025d0f4 coroutine_trampoline (qemu-system-s390x) > #4 0x000003ff943d150a __makecontext_ret (libc.so.6) > > this is with patch 2-4 plus the removal of virtio_queue_host_notifier_read. > > Without removing virtio_queue_host_notifier_read, I get the same mutex lockup (as expected). > > Maybe we have two independent issues here and this is some old bug in glibc or > whatever? Doesn't sound unlikely. But the notifier_read removal makes sense. Fix this now and continue searching for the root cause of the f20 breakage? We should try to understand this even if it has been fixed in the meantime.