From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Hill Subject: Re: Shutting down a VM with Kernel 4.14 will sometime hang and a reboot is the only way to recover. Date: Wed, 6 Dec 2017 23:42:34 -0500 Message-ID: References: <29f8e09f-8920-52d0-02f4-c0fb779135ee@redhat.com> <9c912f3b-081c-8b02-17c8-453ebf36f42c@redhat.com> <10fe2b98-1e26-9539-9f49-0d01f8693e04@redhat.com> <6b41b4e5-6c0c-fce6-21fe-02dd8f550095@redhat.com> <634116a6-6338-4249-7d2d-430b654cc99c@redhat.com> <1f789868-7fda-3553-7078-3298873fb355@redhat.com> <918c4152-bcf9-b28c-0f54-f51d07d82bfc@redhat.com> <68b5d4aa-1d48-d9a1-fc47-62ee8d7ad07a@redhat.com> <623df785-b79c-80d1-899f-6fcc10f70e69@redhat.com> <61be2e2b-9aeb-1a82-d607-a6af00f8c9c6@redhat.com> <094aabc6-4e6b-841e-2b7b-177b31e8ed07@redhat.com> <9da15781-b6e0-3688-f6b2-2ef483b39d0d@redhat.com> <2c153ff8-57cc-715b-6d2f-1758bcb66abb@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit To: Jason Wang , Paolo Bonzini , kvm@vger.kernel.org Return-path: Received: from mail-qt0-f172.google.com ([209.85.216.172]:41235 "EHLO mail-qt0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752481AbdLGEmh (ORCPT ); Wed, 6 Dec 2017 23:42:37 -0500 Received: by mail-qt0-f172.google.com with SMTP id i40so14599619qti.8 for ; Wed, 06 Dec 2017 20:42:37 -0800 (PST) In-Reply-To: <2c153ff8-57cc-715b-6d2f-1758bcb66abb@redhat.com> Content-Language: en-US Sender: kvm-owner@vger.kernel.org List-ID: On 2017-12-06 11:34 PM, David Hill wrote: > > > On 2017-12-04 02:51 PM, David Hill wrote: >> >> On 2017-12-03 11:08 PM, Jason Wang wrote: >>> >>> >>> On 2017年12月02日 00:38, David Hill wrote: >>>>> >>>>> Finally, I reverted 581fe0ea61584d88072527ae9fb9dcb9d1f2783e too >>>>> ... compiling and I'll keep you posted. >>>> >>>> So I'm still able to reproduce this issue even with reverting these >>>> 3 commits.  Would you have other suspect commits ? >>> >>> Thanks for the testing. No, I don't have other suspect commits. >>> >>> Looks like somebody else it hitting your issue too (see >>> https://www.spinics.net/lists/netdev/msg468319.html) >>> >>> But he claims the issue were fixed by using qemu 2.10.1. >>> >>> So you may: >>> >>> -try to see if qemu 2.10.1 solves your issue >> It didn't solve it for him... it's only harder to reproduce. [1] >>> -if not, try to see if commit >>> 2ddf71e23cc246e95af72a6deed67b4a50a7b81c ("net: add notifier hooks >>> for devmap bpf map") is the first bad commit >> I'll try to see what I can do here > I'm looking at that commit and it's been introduced before v4.13 if > I'm not mistaken while this issue appeared between v4.13 and v4.14-rc1 > .  Between those two releases, there're  1352 commits. > Is there a way to quickly know which commits are touching vhost-net, > zerocopy ? > > > [ 7496.553044]  __schedule+0x2dc/0xbb0 > [ 7496.553055]  ? trace_hardirqs_on+0xd/0x10 > [ 7496.553074]  schedule+0x3d/0x90 > [ 7496.553087]  vhost_net_ubuf_put_and_wait+0x73/0xa0 [vhost_net] > [ 7496.553100]  ? finish_wait+0x90/0x90 > [ 7496.553115]  vhost_net_ioctl+0x542/0x910 [vhost_net] > [ 7496.553144]  do_vfs_ioctl+0xa6/0x6c0 > [ 7496.553166]  SyS_ioctl+0x79/0x90 > [ 7496.553182]  entry_SYSCALL_64_fastpath+0x1f/0xbe That vhost_net_ubuf_put_and)wait call has been changed in this commit with the following comment: commit 0ad8b480d6ee916aa84324f69acf690142aecd0e Author: Michael S. Tsirkin Date:   Thu Feb 13 11:42:05 2014 +0200     vhost: fix ref cnt checking deadlock     vhost checked the counter within the refcnt before decrementing.  It     really wanted to know that it is the one that has the last reference, as     a way to batch freeing resources a bit more efficiently.     Note: we only let refcount go to 0 on device release.     This works well but we now access the ref counter twice so there's a     race: all users might see a high count and decide to defer freeing     resources.     In the end no one initiates freeing resources until the last reference     is gone (which is on VM shotdown so might happen after a looooong time).     Let's do what we probably should have done straight away:     switch from kref to plain atomic, documenting the     semantics, return the refcount value atomically after decrement,     then use that to avoid the deadlock.     Reported-by: Qin Chuanyu     Signed-off-by: Michael S. Tsirkin     Acked-by: Jason Wang     Signed-off-by: David S. Miller So at this point, are we hitting a deadlock when using experimental_zcopytx ? > >>> -if not, maybe you can continue your bisection through git bisect skip >>> >> Some commits are so broken that the system won't boot ...  What I >> fear is that if I git bisect skip those commits, I'll also skip the >> commit culprit of my original problem >> >> [1] https://www.spinics.net/lists/netdev/msg469887.html >