From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jason Wang Subject: Re: Shutting down a VM with Kernel 4.14 will sometime hang and a reboot is the only way to recover. Date: Fri, 24 Nov 2017 11:11:57 +0800 Message-ID: References: <92c4f997-80db-fabf-98c8-fcb92da064a7@redhat.com> <7bd45f84-d07e-7fca-6ca3-07dededd092d@redhat.com> <29f8e09f-8920-52d0-02f4-c0fb779135ee@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit To: Paolo Bonzini , David Hill , kvm@vger.kernel.org Return-path: Received: from mx1.redhat.com ([209.132.183.28]:56044 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751871AbdKXDMB (ORCPT ); Thu, 23 Nov 2017 22:12:01 -0500 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id BAC465D687 for ; Fri, 24 Nov 2017 03:12:01 +0000 (UTC) In-Reply-To: Content-Language: en-US Sender: kvm-owner@vger.kernel.org List-ID: On 2017年11月24日 07:48, Paolo Bonzini wrote: > Jason, any ideas? > > Thanks, > > Paolo > > On 22/11/2017 19:22, David Hill wrote: >> ore than 120 seconds. >> [ 7496.552987] Tainted: G I >> 4.14.0-0.rc1.git3.1.fc28.x86_64 #1 >> [ 7496.552996] "echo 0 /proc/sys/kernel/hung_task_timeout_secs" >> disables this message. >> [ 7496.553006] qemu-system-x86 D12240 5978 1 0x00000004 >> [ 7496.553024] Call Trace: >> [ 7496.553044] __schedule+0x2dc/0xbb0 >> [ 7496.553055] ? trace_hardirqs_on+0xd/0x10 >> [ 7496.553074] schedule+0x3d/0x90 >> [ 7496.553087] vhost_net_ubuf_put_and_wait+0x73/0xa0 [vhost_net] >> [ 7496.553100] ? finish_wait+0x90/0x90 >> [ 7496.553115] vhost_net_ioctl+0x542/0x910 [vhost_net] >> [ 7496.553144] do_vfs_ioctl+0xa6/0x6c0 >> [ 7496.553166] SyS_ioctl+0x79/0x90 >> [ 7496.553182] entry_SYSCALL_64_fastpath+0x1f/0xbe >> [ 7496.553190] RIP: 0033:0x7fa1ea0e1817 >> [ 7496.553196] RSP: 002b:00007ffe3d854bc8 EFLAGS: 00000246 >> ORIG_RAX: 0000000000000010 >> [ 7496.553209] RAX: ffffffffffffffda RBX: 000000000000001d RCX: >> 00007fa1ea0e1817 >> [ 7496.553215] RDX: 00007ffe3d854bd0 RSI: 000000004008af30 RDI: >> 0000000000000021 >> [ 7496.553222] RBP: 000055e33352b610 R08: 000055e33024a6f0 R09: >> 000055e330245d92 >> [ 7496.553228] R10: 000055e33344e7f0 R11: 0000000000000246 R12: >> 000055e33351a000 >> [ 7496.553235] R13: 0000000000000001 R14: 0000000400000000 R15: >> 0000000000000000 >> [ 7496.553284] >> Showing all locks held in the system: >> [ 7496.553313] 1 lock held by khungtaskd/161: >> [ 7496.553319] #0: (tasklist_lock){.+.+}, at: >> [] debug_show_all_locks+0x3d/0x1a0 >> [ 7496.553373] 1 lock held by in:imklog/1194: >> [ 7496.553379] #0: (&f->f_pos_lock){+.+.}, at: >> [] __fdget_pos+0x4c/0x60 >> [ 7496.553541] 1 lock held by qemu-system-x86/5978: >> [ 7496.553547] #0: (&dev->mutex#3){+.+.}, at: >> [] vhost_net_ioctl+0x358/0x910 [vhost_net] Hi: The backtrace shows zero copied skb was not sent for a long while for some reason. This could be either a bug in vhost_net or somewhere in the host driver, qdiscs or others. What's your network setups in host (e.g the qdiscs or network driver)? Can you still hit the issue if you switch to use another type of ethernet driver/cards? Can this still be reproducible in net.git (https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/). Will try to reproduce this locally. Thanks