All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Hill <dhill@redhat.com>
To: Jason Wang <jasowang@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	kvm@vger.kernel.org
Subject: Re: Shutting down a VM with Kernel 4.14 will sometime hang and a reboot is the only way to recover.
Date: Fri, 8 Dec 2017 13:03:15 -0500	[thread overview]
Message-ID: <af1452f1-c327-b7f1-11ac-dc01e22bcfb5@redhat.com> (raw)
In-Reply-To: <4c8c81e6-e582-f292-79ed-f3d62518e2d9@redhat.com>



On 2017-12-07 12:13 AM, Jason Wang wrote:
>
>
> On 2017年12月07日 12:42, David Hill wrote:
>>
>>
>> On 2017-12-06 11:34 PM, David Hill wrote:
>>>
>>>
>>> On 2017-12-04 02:51 PM, David Hill wrote:
>>>>
>>>> On 2017-12-03 11:08 PM, Jason Wang wrote:
>>>>>
>>>>>
>>>>> On 2017年12月02日 00:38, David Hill wrote:
>>>>>>>
>>>>>>> Finally, I reverted 581fe0ea61584d88072527ae9fb9dcb9d1f2783e too 
>>>>>>> ... compiling and I'll keep you posted.
>>>>>>
>>>>>> So I'm still able to reproduce this issue even with reverting 
>>>>>> these 3 commits.  Would you have other suspect commits ? 
>>>>>
>>>>> Thanks for the testing. No, I don't have other suspect commits.
>>>>>
>>>>> Looks like somebody else it hitting your issue too (see 
>>>>> https://www.spinics.net/lists/netdev/msg468319.html)
>>>>>
>>>>> But he claims the issue were fixed by using qemu 2.10.1.
>>>>>
>>>>> So you may:
>>>>>
>>>>> -try to see if qemu 2.10.1 solves your issue
>>>> It didn't solve it for him... it's only harder to reproduce. [1]
>>>>> -if not, try to see if commit 
>>>>> 2ddf71e23cc246e95af72a6deed67b4a50a7b81c ("net: add notifier hooks 
>>>>> for devmap bpf map") is the first bad commit
>>>> I'll try to see what I can do here
>>> I'm looking at that commit and it's been introduced before v4.13 if 
>>> I'm not mistaken while this issue appeared between v4.13 and 
>>> v4.14-rc1 .  Between those two releases, there're 1352 commits.
>>> Is there a way to quickly know which commits are touching vhost-net, 
>>> zerocopy ?
>>>
>>>
>>> [ 7496.553044]  __schedule+0x2dc/0xbb0
>>> [ 7496.553055]  ? trace_hardirqs_on+0xd/0x10
>>> [ 7496.553074]  schedule+0x3d/0x90
>>> [ 7496.553087]  vhost_net_ubuf_put_and_wait+0x73/0xa0 [vhost_net]
>>> [ 7496.553100]  ? finish_wait+0x90/0x90
>>> [ 7496.553115]  vhost_net_ioctl+0x542/0x910 [vhost_net]
>>> [ 7496.553144]  do_vfs_ioctl+0xa6/0x6c0
>>> [ 7496.553166]  SyS_ioctl+0x79/0x90
>>> [ 7496.553182]  entry_SYSCALL_64_fastpath+0x1f/0xbe
>>
>> That vhost_net_ubuf_put_and)wait call has been changed in this commit 
>> with the following comment:
>>
>> commit 0ad8b480d6ee916aa84324f69acf690142aecd0e
>> Author: Michael S. Tsirkin <mst@redhat.com>
>> Date:   Thu Feb 13 11:42:05 2014 +0200
>>
>>     vhost: fix ref cnt checking deadlock
>>
>>     vhost checked the counter within the refcnt before decrementing.  It
>>     really wanted to know that it is the one that has the last 
>> reference, as
>>     a way to batch freeing resources a bit more efficiently.
>>
>>     Note: we only let refcount go to 0 on device release.
>>
>>     This works well but we now access the ref counter twice so there's a
>>     race: all users might see a high count and decide to defer freeing
>>     resources.
>>     In the end no one initiates freeing resources until the last 
>> reference
>>     is gone (which is on VM shotdown so might happen after a looooong 
>> time).
>>
>>     Let's do what we probably should have done straight away:
>>     switch from kref to plain atomic, documenting the
>>     semantics, return the refcount value atomically after decrement,
>>     then use that to avoid the deadlock.
>>
>>     Reported-by: Qin Chuanyu <qinchuanyu@huawei.com>
>>     Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>     Acked-by: Jason Wang <jasowang@redhat.com>
>>     Signed-off-by: David S. Miller <davem@davemloft.net>
>>
>>
>>
>> So at this point, are we hitting a deadlock when using 
>> experimental_zcopytx ? 
>
> Yes. But there could be another possibility that it was not caused by 
> vhost_net itself but other places that holds a packet.
>
> Thanks

While bisecting, when I reach this commit 
46d4b68f891bee5d83a32508bfbd9778be6b1b63, the system kernel panic when I 
run virt-customize :

Message from syslogd@zappa at Dec  8 12:52:06 ...
  kernel:[  350.016376] Kernel panic - not syncing: Fatal exception in 
interrupt

I marked that commit as bad again.   Will continue bisecting!

  reply	other threads:[~2017-12-08 18:03 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <efd45fba-5724-0036-8473-0274b5816ae9@redhat.com>
2017-11-13 15:54 ` Shutting down a VM with Kernel 4.14 will sometime hang and a reboot is the only way to recover. [1] David Hill
     [not found]   ` <CALapVYHmf7gG25nA-5LkoaTDR8gB0xQ1Ro_FyyCQNbzrfSp+aQ@mail.gmail.com>
2017-11-15 21:08     ` David Hill
2017-11-22 18:22       ` Shutting down a VM with Kernel 4.14 will sometime hang and a reboot is the only way to recover David Hill
2017-11-23 23:48         ` Paolo Bonzini
2017-11-24  3:11           ` Jason Wang
2017-11-24 16:19             ` David Hill
2017-11-24 16:22             ` David Hill
2017-11-27  3:44               ` Jason Wang
2017-11-27 19:38                 ` David Hill
2017-11-28 18:00                   ` David Hill
2017-11-29  1:52                     ` Jason Wang
2017-11-29  2:52                       ` Dave Hill
2017-11-29  5:15                         ` Jason Wang
2017-11-29 19:13                           ` David Hill
2017-11-30  2:42                             ` Jason Wang
2017-11-30 20:52                               ` David Hill
2017-11-30 20:59                                 ` David Hill
2017-12-01 16:38                                   ` David Hill
2017-12-04  4:08                                     ` Jason Wang
2017-12-04 19:51                                       ` David Hill
2017-12-07  4:34                                         ` David Hill
2017-12-07  4:42                                           ` David Hill
2017-12-07  5:13                                             ` Jason Wang
2017-12-08 18:03                                               ` David Hill [this message]
2017-12-12  3:53                                                 ` David Hill
2017-12-19  3:36                                                   ` Jason Wang
2017-12-19 16:19                                                     ` Willem de Bruijn
2017-12-07  5:12                                           ` Jason Wang
2017-12-02 12:16                                   ` Harald Moeller
2017-12-02 16:37                                   ` Harald Moeller
2017-12-07  2:44                                     ` David Hill

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=af1452f1-c327-b7f1-11ac-dc01e22bcfb5@redhat.com \
    --to=dhill@redhat.com \
    --cc=jasowang@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.