From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Hill Subject: Re: Shutting down a VM with Kernel 4.14 will sometime hang and a reboot is the only way to recover. Date: Mon, 27 Nov 2017 14:38:06 -0500 Message-ID: <6b41b4e5-6c0c-fce6-21fe-02dd8f550095@redhat.com> References: <92c4f997-80db-fabf-98c8-fcb92da064a7@redhat.com> <7bd45f84-d07e-7fca-6ca3-07dededd092d@redhat.com> <29f8e09f-8920-52d0-02f4-c0fb779135ee@redhat.com> <9c912f3b-081c-8b02-17c8-453ebf36f42c@redhat.com> <10fe2b98-1e26-9539-9f49-0d01f8693e04@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit To: Jason Wang , Paolo Bonzini , kvm@vger.kernel.org Return-path: Received: from mail-qk0-f176.google.com ([209.85.220.176]:37848 "EHLO mail-qk0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753070AbdK0TiK (ORCPT ); Mon, 27 Nov 2017 14:38:10 -0500 Received: by mail-qk0-f176.google.com with SMTP id 136so33988293qkd.4 for ; Mon, 27 Nov 2017 11:38:10 -0800 (PST) In-Reply-To: <10fe2b98-1e26-9539-9f49-0d01f8693e04@redhat.com> Content-Language: en-US Sender: kvm-owner@vger.kernel.org List-ID: On 2017-11-26 10:44 PM, Jason Wang wrote: > > > On 2017年11月25日 00:22, David Hill wrote: >> The VMs all have 2 vNICs ... and this is the hypervisor: >> >> [root@zappa ~]# brctl show >> bridge name    bridge id        STP enabled    interfaces >> virbr0        8000.525400914858    yes        virbr0-nic >>                             vnet0 >>                             vnet1 >> >> >> 1: lo: mtu 65536 qdisc noqueue state UNKNOWN >> group default qlen 1000 >>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >>     inet 127.0.0.1/8 scope host lo >>        valid_lft forever preferred_lft forever >>     inet6 ::1/128 scope host >>        valid_lft forever preferred_lft forever >> 2: eno1: mtu 1500 qdisc mq state UP >> group default qlen 1000 >>     link/ether 84:2b:2b:13:f2:91 brd ff:ff:ff:ff:ff:ff >>     inet redacted/24 brd 173.178.138.255 scope global dynamic eno1 >>        valid_lft 48749sec preferred_lft 48749sec >>     inet6 fe80::862b:2bff:fe13:f291/64 scope link >>        valid_lft forever preferred_lft forever >> 3: eno2: mtu 1500 qdisc mq state UP >> group default qlen 1000 >>     link/ether 84:2b:2b:13:f2:92 brd ff:ff:ff:ff:ff:ff >>     inet 192.168.1.3/24 brd 192.168.1.255 scope global eno2 >>        valid_lft forever preferred_lft forever >>     inet6 fe80::862b:2bff:fe13:f292/64 scope link >>        valid_lft forever preferred_lft forever >> 4: virbr0: mtu 1500 qdisc noqueue >> state UP group default qlen 1000 >>     link/ether 52:54:00:91:48:58 brd ff:ff:ff:ff:ff:ff >>     inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0 >> >>        valid_lft forever preferred_lft forever >>     inet 192.168.122.10/32 scope global virbr0 >>        valid_lft forever preferred_lft forever >>     inet 192.168.122.11/32 scope global virbr0 >>        valid_lft forever preferred_lft forever >>     inet 192.168.122.12/32 scope global virbr0 >>        valid_lft forever preferred_lft forever >>     inet 192.168.122.15/32 scope global virbr0 >>        valid_lft forever preferred_lft forever >>     inet 192.168.122.16/32 scope global virbr0 >>        valid_lft forever preferred_lft forever >>     inet 192.168.122.17/32 scope global virbr0 >>        valid_lft forever preferred_lft forever >>     inet 192.168.122.18/32 scope global virbr0 >>        valid_lft forever preferred_lft forever >>     inet 192.168.122.31/32 scope global virbr0 >>        valid_lft forever preferred_lft forever >>     inet 192.168.122.32/32 scope global virbr0 >>        valid_lft forever preferred_lft forever >>     inet 192.168.122.33/32 scope global virbr0 >>        valid_lft forever preferred_lft forever >>     inet 192.168.122.34/32 scope global virbr0 >>        valid_lft forever preferred_lft forever >>     inet 192.168.122.35/32 scope global virbr0 >>        valid_lft forever preferred_lft forever >>     inet 192.168.122.36/32 scope global virbr0 >>        valid_lft forever preferred_lft forever >>     inet 192.168.122.37/32 scope global virbr0 >>        valid_lft forever preferred_lft forever >>     inet 192.168.122.45/32 scope global virbr0 >>        valid_lft forever preferred_lft forever >>     inet 192.168.122.46/32 scope global virbr0 >>        valid_lft forever preferred_lft forever >>     inet 192.168.122.47/32 scope global virbr0 >>        valid_lft forever preferred_lft forever >>     inet 192.168.122.48/32 scope global virbr0 >>        valid_lft forever preferred_lft forever >>     inet 192.168.122.49/32 scope global virbr0 >>        valid_lft forever preferred_lft forever >>     inet 192.168.122.50/32 scope global virbr0 >>        valid_lft forever preferred_lft forever >>     inet 192.168.122.51/32 scope global virbr0 >>        valid_lft forever preferred_lft forever >> 5: virbr0-nic: mtu 1500 qdisc fq_codel master >> virbr0 state DOWN group default qlen 1000 >>     link/ether 52:54:00:91:48:58 brd ff:ff:ff:ff:ff:ff >> 125: tun0: mtu 1360 qdisc >> fq_codel state UNKNOWN group default qlen 100 >>     link/none >>     inet 10.10.122.28/21 brd 10.10.127.255 scope global tun0 >>        valid_lft forever preferred_lft forever >>     inet6 fe80::1f9b:bfd4:e9c9:2059/64 scope link stable-privacy >>        valid_lft forever preferred_lft forever >> 402: vnet0: mtu 1500 qdisc fq_codel >> master virbr0 state UNKNOWN group default qlen 1000 >>     link/ether fe:54:00:09:27:39 brd ff:ff:ff:ff:ff:ff >>     inet6 fe80::fc54:ff:fe09:2739/64 scope link >>        valid_lft forever preferred_lft forever >> 403: vnet1: mtu 1500 qdisc fq_codel >> master virbr0 state UNKNOWN group default qlen 1000 >>     link/ether fe:54:00:ea:6b:18 brd ff:ff:ff:ff:ff:ff >>     inet6 fe80::fc54:ff:feea:6b18/64 scope link >>        valid_lft forever preferred_lft forever >> > > I could not reproduce this locally by simply running netperf through a > mlx4 card. Some more questions: > > - What kind of workloads did you run in guest? > - Did you meet this issue in a specific type of network card (I guess > broadcom is used in this case)? > - Virbr0 looks like a bridge created by libvirt that did NAT and other > stuffs, can you still hit this issue if you don't use virbr0? > > And what's more important, zerocopy is known to have issues, for > production environment, need to disable it through vhost_net module > parameters. > > Thanks I'm deploying an overcloud through a undercloud virtual machine... The VM has 4vCPUs and 16GB of RAM as well as to virtio nics so I'm using only virtual hardware here. I spawn 7 VMs on the hypervisor and deploy an overcloud using tripleo on them ... everything's virtual and if I remove the bridge, then I'll have to configure each VMs differently. The load is quite high on the VM that won't shutdown but when I shut it down, it's doing nothing ...   This is a hard bug to troubleshoot and I can't bisect the kernel because at some point the system simply won't boot properly.