From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jason Wang Subject: Re: Shutting down a VM with Kernel 4.14 will sometime hang and a reboot is the only way to recover. Date: Wed, 29 Nov 2017 09:52:14 +0800 Message-ID: <634116a6-6338-4249-7d2d-430b654cc99c@redhat.com> References: <92c4f997-80db-fabf-98c8-fcb92da064a7@redhat.com> <7bd45f84-d07e-7fca-6ca3-07dededd092d@redhat.com> <29f8e09f-8920-52d0-02f4-c0fb779135ee@redhat.com> <9c912f3b-081c-8b02-17c8-453ebf36f42c@redhat.com> <10fe2b98-1e26-9539-9f49-0d01f8693e04@redhat.com> <6b41b4e5-6c0c-fce6-21fe-02dd8f550095@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit To: David Hill , Paolo Bonzini , kvm@vger.kernel.org Return-path: Received: from mx1.redhat.com ([209.132.183.28]:40328 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751381AbdK2BwU (ORCPT ); Tue, 28 Nov 2017 20:52:20 -0500 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id D379C356EB for ; Wed, 29 Nov 2017 01:52:20 +0000 (UTC) In-Reply-To: Content-Language: en-US Sender: kvm-owner@vger.kernel.org List-ID: On 2017年11月29日 02:00, David Hill wrote: > > > On 2017-11-27 02:38 PM, David Hill wrote: >> >> >> On 2017-11-26 10:44 PM, Jason Wang wrote: >>> >>> >>> On 2017年11月25日 00:22, David Hill wrote: >>>> The VMs all have 2 vNICs ... and this is the hypervisor: >>>> >>>> [root@zappa ~]# brctl show >>>> bridge name    bridge id        STP enabled    interfaces >>>> virbr0        8000.525400914858    yes        virbr0-nic >>>>                             vnet0 >>>>                             vnet1 >>>> >>>> >>>> 1: lo: mtu 65536 qdisc noqueue state UNKNOWN >>>> group default qlen 1000 >>>>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >>>>     inet 127.0.0.1/8 scope host lo >>>>        valid_lft forever preferred_lft forever >>>>     inet6 ::1/128 scope host >>>>        valid_lft forever preferred_lft forever >>>> 2: eno1: mtu 1500 qdisc mq state >>>> UP group default qlen 1000 >>>>     link/ether 84:2b:2b:13:f2:91 brd ff:ff:ff:ff:ff:ff >>>>     inet redacted/24 brd 173.178.138.255 scope global dynamic eno1 >>>>        valid_lft 48749sec preferred_lft 48749sec >>>>     inet6 fe80::862b:2bff:fe13:f291/64 scope link >>>>        valid_lft forever preferred_lft forever >>>> 3: eno2: mtu 1500 qdisc mq state >>>> UP group default qlen 1000 >>>>     link/ether 84:2b:2b:13:f2:92 brd ff:ff:ff:ff:ff:ff >>>>     inet 192.168.1.3/24 brd 192.168.1.255 scope global eno2 >>>>        valid_lft forever preferred_lft forever >>>>     inet6 fe80::862b:2bff:fe13:f292/64 scope link >>>>        valid_lft forever preferred_lft forever >>>> 4: virbr0: mtu 1500 qdisc noqueue >>>> state UP group default qlen 1000 >>>>     link/ether 52:54:00:91:48:58 brd ff:ff:ff:ff:ff:ff >>>>     inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0 >>>> >>>>        valid_lft forever preferred_lft forever >>>>     inet 192.168.122.10/32 scope global virbr0 >>>>        valid_lft forever preferred_lft forever >>>>     inet 192.168.122.11/32 scope global virbr0 >>>>        valid_lft forever preferred_lft forever >>>>     inet 192.168.122.12/32 scope global virbr0 >>>>        valid_lft forever preferred_lft forever >>>>     inet 192.168.122.15/32 scope global virbr0 >>>>        valid_lft forever preferred_lft forever >>>>     inet 192.168.122.16/32 scope global virbr0 >>>>        valid_lft forever preferred_lft forever >>>>     inet 192.168.122.17/32 scope global virbr0 >>>>        valid_lft forever preferred_lft forever >>>>     inet 192.168.122.18/32 scope global virbr0 >>>>        valid_lft forever preferred_lft forever >>>>     inet 192.168.122.31/32 scope global virbr0 >>>>        valid_lft forever preferred_lft forever >>>>     inet 192.168.122.32/32 scope global virbr0 >>>>        valid_lft forever preferred_lft forever >>>>     inet 192.168.122.33/32 scope global virbr0 >>>>        valid_lft forever preferred_lft forever >>>>     inet 192.168.122.34/32 scope global virbr0 >>>>        valid_lft forever preferred_lft forever >>>>     inet 192.168.122.35/32 scope global virbr0 >>>>        valid_lft forever preferred_lft forever >>>>     inet 192.168.122.36/32 scope global virbr0 >>>>        valid_lft forever preferred_lft forever >>>>     inet 192.168.122.37/32 scope global virbr0 >>>>        valid_lft forever preferred_lft forever >>>>     inet 192.168.122.45/32 scope global virbr0 >>>>        valid_lft forever preferred_lft forever >>>>     inet 192.168.122.46/32 scope global virbr0 >>>>        valid_lft forever preferred_lft forever >>>>     inet 192.168.122.47/32 scope global virbr0 >>>>        valid_lft forever preferred_lft forever >>>>     inet 192.168.122.48/32 scope global virbr0 >>>>        valid_lft forever preferred_lft forever >>>>     inet 192.168.122.49/32 scope global virbr0 >>>>        valid_lft forever preferred_lft forever >>>>     inet 192.168.122.50/32 scope global virbr0 >>>>        valid_lft forever preferred_lft forever >>>>     inet 192.168.122.51/32 scope global virbr0 >>>>        valid_lft forever preferred_lft forever >>>> 5: virbr0-nic: mtu 1500 qdisc fq_codel master >>>> virbr0 state DOWN group default qlen 1000 >>>>     link/ether 52:54:00:91:48:58 brd ff:ff:ff:ff:ff:ff >>>> 125: tun0: mtu 1360 qdisc >>>> fq_codel state UNKNOWN group default qlen 100 >>>>     link/none >>>>     inet 10.10.122.28/21 brd 10.10.127.255 scope global tun0 >>>>        valid_lft forever preferred_lft forever >>>>     inet6 fe80::1f9b:bfd4:e9c9:2059/64 scope link stable-privacy >>>>        valid_lft forever preferred_lft forever >>>> 402: vnet0: mtu 1500 qdisc >>>> fq_codel master virbr0 state UNKNOWN group default qlen 1000 >>>>     link/ether fe:54:00:09:27:39 brd ff:ff:ff:ff:ff:ff >>>>     inet6 fe80::fc54:ff:fe09:2739/64 scope link >>>>        valid_lft forever preferred_lft forever >>>> 403: vnet1: mtu 1500 qdisc >>>> fq_codel master virbr0 state UNKNOWN group default qlen 1000 >>>>     link/ether fe:54:00:ea:6b:18 brd ff:ff:ff:ff:ff:ff >>>>     inet6 fe80::fc54:ff:feea:6b18/64 scope link >>>>        valid_lft forever preferred_lft forever >>>> >>> >>> I could not reproduce this locally by simply running netperf through >>> a mlx4 card. Some more questions: >>> >>> - What kind of workloads did you run in guest? >>> - Did you meet this issue in a specific type of network card (I >>> guess broadcom is used in this case)? >>> - Virbr0 looks like a bridge created by libvirt that did NAT and >>> other stuffs, can you still hit this issue if you don't use virbr0? >>> >>> And what's more important, zerocopy is known to have issues, for >>> production environment, need to disable it through vhost_net module >>> parameters. >>> >>> Thanks >> >> I'm deploying an overcloud through a undercloud virtual machine... >> The VM has 4vCPUs and 16GB of RAM as well as to virtio nics so I'm >> using only virtual hardware here. >> I spawn 7 VMs on the hypervisor and deploy an overcloud using tripleo >> on them ... everything's virtual and if I remove the bridge, then >> I'll have to configure each VMs differently. >> The load is quite high on the VM that won't shutdown but when I shut >> it down, it's doing nothing ...   This is a hard bug to troubleshoot >> and I can't bisect the kernel because at some >> point the system simply won't boot properly. > > I've disabled zerocopy with the following: > > [root@zappa modprobe.d]# cat vhost-net.conf > options vhost_net  experimental_zcopytx=0 > > > And I haven't reproduce this issue so far.   The problem I have right > now is that experimental_zcopytx has been enabled by default with this > commit: > > commit f9611c43ab0ddaf547b395c90fb842f55959334c > Author: Michael S. Tsirkin > Date:   Thu Dec 6 14:56:00 2012 +0200 > >     vhost-net: enable zerocopy tx by default > >     Zero copy TX has been around for a while now. >     We seem to be down to eliminating theoretical bugs >     and performance tuning at this point: >     it's probably time to enable it by default so that >     most users get the benefit. > >     Keep the flag around meanwhile so users can experiment >     with disabling this if they experience regressions. >     I expect that we will remove it in the future. > >     Signed-off-by: Michael S. Tsirkin > > I'll try some more pass in producing this issue and I'll keep you posted. > > Thank you very much, > > David Hill > Thanks. Zerocopy is disabled by several distribution by default. For upstream, the only reason to let it on is to hope more developers can help and fix the issues.