From mboxrd@z Thu Jan  1 00:00:00 1970
From: David Hill <dhill@redhat.com>
Subject: Re: Shutting down a VM with Kernel 4.14 will sometime hang and a
 reboot is the only way to recover.
Date: Fri, 8 Dec 2017 13:03:15 -0500
Message-ID: <af1452f1-c327-b7f1-11ac-dc01e22bcfb5@redhat.com>
References: <efd45fba-5724-0036-8473-0274b5816ae9@redhat.com>
 <a0ec66f5-ebc0-3c54-26a8-dfba06801084@redhat.com>
 <9c912f3b-081c-8b02-17c8-453ebf36f42c@redhat.com>
 <10fe2b98-1e26-9539-9f49-0d01f8693e04@redhat.com>
 <6b41b4e5-6c0c-fce6-21fe-02dd8f550095@redhat.com>
 <c63ba0d1-c0d2-85c6-ad1c-7f777b59eae8@redhat.com>
 <634116a6-6338-4249-7d2d-430b654cc99c@redhat.com>
 <1f789868-7fda-3553-7078-3298873fb355@redhat.com>
 <918c4152-bcf9-b28c-0f54-f51d07d82bfc@redhat.com>
 <b8b2238c-30a0-6743-9399-ec441fd6043e@redhat.com>
 <68b5d4aa-1d48-d9a1-fc47-62ee8d7ad07a@redhat.com>
 <623df785-b79c-80d1-899f-6fcc10f70e69@redhat.com>
 <61be2e2b-9aeb-1a82-d607-a6af00f8c9c6@redhat.com>
 <094aabc6-4e6b-841e-2b7b-177b31e8ed07@redhat.com>
 <ce97ed55-75fa-2f5d-d6cb-8f7e356a555a@redhat.com>
 <9da15781-b6e0-3688-f6b2-2ef483b39d0d@redhat.com>
 <2c153ff8-57cc-715b-6d2f-1758bcb66abb@redhat.com>
 <dcaafacd-182c-6bf9-636a-0726299f7ce2@redhat.com>
 <4c8c81e6-e582-f292-79ed-f3d62518e2d9@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
To: Jason Wang <jasowang@redhat.com>,
        Paolo Bonzini <pbonzini@redhat.com>, kvm@vger.kernel.org
Return-path: <kvm-owner@vger.kernel.org>
Received: from mail-qt0-f193.google.com ([209.85.216.193]:36432 "EHLO
        mail-qt0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1751800AbdLHSDT (ORCPT <rfc822;kvm@vger.kernel.org>);
        Fri, 8 Dec 2017 13:03:19 -0500
Received: by mail-qt0-f193.google.com with SMTP id a16so27880595qtj.3
        for <kvm@vger.kernel.org>; Fri, 08 Dec 2017 10:03:19 -0800 (PST)
In-Reply-To: <4c8c81e6-e582-f292-79ed-f3d62518e2d9@redhat.com>
Content-Language: en-US
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>


On 2017-12-07 12:13 AM, Jason Wang wrote:
>
>
> On 2017年12月07日 12:42, David Hill wrote:
>>
>>
>> On 2017-12-06 11:34 PM, David Hill wrote:
>>>
>>>
>>> On 2017-12-04 02:51 PM, David Hill wrote:
>>>>
>>>> On 2017-12-03 11:08 PM, Jason Wang wrote:
>>>>>
>>>>>
>>>>> On 2017年12月02日 00:38, David Hill wrote:
>>>>>>>
>>>>>>> Finally, I reverted 581fe0ea61584d88072527ae9fb9dcb9d1f2783e too 
>>>>>>> ... compiling and I'll keep you posted.
>>>>>>
>>>>>> So I'm still able to reproduce this issue even with reverting 
>>>>>> these 3 commits.  Would you have other suspect commits ? 
>>>>>
>>>>> Thanks for the testing. No, I don't have other suspect commits.
>>>>>
>>>>> Looks like somebody else it hitting your issue too (see 
>>>>> https://www.spinics.net/lists/netdev/msg468319.html)
>>>>>
>>>>> But he claims the issue were fixed by using qemu 2.10.1.
>>>>>
>>>>> So you may:
>>>>>
>>>>> -try to see if qemu 2.10.1 solves your issue
>>>> It didn't solve it for him... it's only harder to reproduce. [1]
>>>>> -if not, try to see if commit 
>>>>> 2ddf71e23cc246e95af72a6deed67b4a50a7b81c ("net: add notifier hooks 
>>>>> for devmap bpf map") is the first bad commit
>>>> I'll try to see what I can do here
>>> I'm looking at that commit and it's been introduced before v4.13 if 
>>> I'm not mistaken while this issue appeared between v4.13 and 
>>> v4.14-rc1 .  Between those two releases, there're 1352 commits.
>>> Is there a way to quickly know which commits are touching vhost-net, 
>>> zerocopy ?
>>>
>>>
>>> [ 7496.553044]  __schedule+0x2dc/0xbb0
>>> [ 7496.553055]  ? trace_hardirqs_on+0xd/0x10
>>> [ 7496.553074]  schedule+0x3d/0x90
>>> [ 7496.553087]  vhost_net_ubuf_put_and_wait+0x73/0xa0 [vhost_net]
>>> [ 7496.553100]  ? finish_wait+0x90/0x90
>>> [ 7496.553115]  vhost_net_ioctl+0x542/0x910 [vhost_net]
>>> [ 7496.553144]  do_vfs_ioctl+0xa6/0x6c0
>>> [ 7496.553166]  SyS_ioctl+0x79/0x90
>>> [ 7496.553182]  entry_SYSCALL_64_fastpath+0x1f/0xbe
>>
>> That vhost_net_ubuf_put_and)wait call has been changed in this commit 
>> with the following comment:
>>
>> commit 0ad8b480d6ee916aa84324f69acf690142aecd0e
>> Author: Michael S. Tsirkin <mst@redhat.com>
>> Date:   Thu Feb 13 11:42:05 2014 +0200
>>
>>     vhost: fix ref cnt checking deadlock
>>
>>     vhost checked the counter within the refcnt before decrementing.  It
>>     really wanted to know that it is the one that has the last 
>> reference, as
>>     a way to batch freeing resources a bit more efficiently.
>>
>>     Note: we only let refcount go to 0 on device release.
>>
>>     This works well but we now access the ref counter twice so there's a
>>     race: all users might see a high count and decide to defer freeing
>>     resources.
>>     In the end no one initiates freeing resources until the last 
>> reference
>>     is gone (which is on VM shotdown so might happen after a looooong 
>> time).
>>
>>     Let's do what we probably should have done straight away:
>>     switch from kref to plain atomic, documenting the
>>     semantics, return the refcount value atomically after decrement,
>>     then use that to avoid the deadlock.
>>
>>     Reported-by: Qin Chuanyu <qinchuanyu@huawei.com>
>>     Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>     Acked-by: Jason Wang <jasowang@redhat.com>
>>     Signed-off-by: David S. Miller <davem@davemloft.net>
>>
>>
>>
>> So at this point, are we hitting a deadlock when using 
>> experimental_zcopytx ? 
>
> Yes. But there could be another possibility that it was not caused by 
> vhost_net itself but other places that holds a packet.
>
> Thanks

While bisecting, when I reach this commit 
46d4b68f891bee5d83a32508bfbd9778be6b1b63, the system kernel panic when I 
run virt-customize :

Message from syslogd@zappa at Dec  8 12:52:06 ...
  kernel:[  350.016376] Kernel panic - not syncing: Fatal exception in 
interrupt

I marked that commit as bad again.   Will continue bisecting!