From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Hill Subject: Re: Shutting down a VM with Kernel 4.14 will sometime hang and a reboot is the only way to recover. Date: Fri, 1 Dec 2017 11:38:46 -0500 Message-ID: <094aabc6-4e6b-841e-2b7b-177b31e8ed07@redhat.com> References: <92c4f997-80db-fabf-98c8-fcb92da064a7@redhat.com> <7bd45f84-d07e-7fca-6ca3-07dededd092d@redhat.com> <29f8e09f-8920-52d0-02f4-c0fb779135ee@redhat.com> <9c912f3b-081c-8b02-17c8-453ebf36f42c@redhat.com> <10fe2b98-1e26-9539-9f49-0d01f8693e04@redhat.com> <6b41b4e5-6c0c-fce6-21fe-02dd8f550095@redhat.com> <634116a6-6338-4249-7d2d-430b654cc99c@redhat.com> <1f789868-7fda-3553-7078-3298873fb355@redhat.com> <918c4152-bcf9-b28c-0f54-f51d07d82bfc@redhat.com> <68b5d4aa-1d48-d9a1-fc47-62ee8d7ad07a@redhat.com> <623df785-b79c-80d1-899f-6fcc10f70e69@redhat.com> <61be2e2b-9aeb-1a82-d607-a6af00f8c9c6@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit To: Jason Wang , Paolo Bonzini , kvm@vger.kernel.org Return-path: Received: from mail-qk0-f193.google.com ([209.85.220.193]:42346 "EHLO mail-qk0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750964AbdLAQiu (ORCPT ); Fri, 1 Dec 2017 11:38:50 -0500 Received: by mail-qk0-f193.google.com with SMTP id a71so13845299qkc.9 for ; Fri, 01 Dec 2017 08:38:50 -0800 (PST) In-Reply-To: <61be2e2b-9aeb-1a82-d607-a6af00f8c9c6@redhat.com> Content-Language: en-US Sender: kvm-owner@vger.kernel.org List-ID: On 2017-11-30 03:59 PM, David Hill wrote: > > > On 2017-11-30 03:52 PM, David Hill wrote: >> >> >> On 2017-11-29 09:42 PM, Jason Wang wrote: >>> >>> >>> On 2017年11月30日 03:13, David Hill wrote: >>>> >>>> >>>> On 2017-11-29 12:15 AM, Jason Wang wrote: >>>>> >>>>> >>>>> On 2017年11月29日 10:52, Dave Hill wrote: >>>>>>>> >>>>>>> >>>>>>> Thanks. Zerocopy is disabled by several distribution by default. >>>>>>> For upstream, the only reason to let it on is to hope more >>>>>>> developers can help and fix the issues. >>>>>>> >>>>>>> >>>>>> So I never hit this issue with previous kernel and this issue >>>>>> started happening with the v4.14-rc series. >>>>> >>>>> >>>>> Right, this still need to be investigated if it was introduced >>>>> recently. >>>>> >>>>> Looking at git history, the only suspected commit is for 4.14 is >>>>> >>>>> commit 1e6f74536de08b5e50cf0e37e735911c2cef7c62 >>>>> Author: Willem de Bruijn >>>>> Date:   Fri Oct 6 13:22:31 2017 -0400 >>>>> >>>>>     vhost_net: do not stall on zerocopy depletion >>>>> >>>>> Maybe you can try to revert it and see. >>>>> >>>>> If it does not solve your issue, I suspect there's bug elsewhere >>>>> that cause a packet to be held for very long time. >>>>> >>>>>>   I'm using rawhide so perhaps this is why it isn't disabled by >>>>>> default but I have to mention it's an update of FC25 up to FC28 >>>>>> and it never got disabled. >>>>>> Perhaps it should be disabled in Fedora too if it's not the >>>>>> case... I'm not sure this is the place to discuss this ... is it? >>>>> >>>>> Probably not, but I guess Fedora tries to use new technology >>>>> aggressively. >>>>> >>>>> Thanks >>>> >>>> I can revert that commit in 4.15-rc1 but I can't find it in 4.14.2 >>>> ...  Is there another commit that could affect this ? >>> >>> My bad, the suspicious is then: >>> >>> 1f8b977ab32dc5d148f103326e80d9097f1cefb5 ("sock: enable MSG_ZEROCOPY") >>> c1d1b437816f0afa99202be3cb650c9d174667bc ("net: convert (struct >>> ubuf_info)->refcnt to refcount_t") >>> >>> Thanks >>> >> >> Reverting those two commits breaks kernel compilation: >> >> net/core/dev.c: In function ‘dev_queue_xmit_nit’: >> net/core/dev.c:1952:8: error: implicit declaration of function >> ‘skb_orphan_frags_rx’; did you mean ‘skb_orphan_frags’? >> [-Werror=implicit-function-declaration] >>    if (!skb_orphan_frags_rx(skb2, GFP_ATOMIC)) >>         ^~~~~~~~~~~~~~~~~~~ >>         skb_orphan_frags >> >> >> I changed skb_orphan_frags_rx to skb_orphan_frags and it compiled but >> will everything blow up? >> >> Thanks, >> Dave > > Finally, I reverted 581fe0ea61584d88072527ae9fb9dcb9d1f2783e too ... > compiling and I'll keep you posted. So I'm still able to reproduce this issue even with reverting these 3 commits.  Would you have other suspect commits ?