From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:52033) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aR6wC-0002NY-5I for qemu-devel@nongnu.org; Wed, 03 Feb 2016 18:34:17 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aR6w9-0005zd-A0 for qemu-devel@nongnu.org; Wed, 03 Feb 2016 18:34:16 -0500 Received: from mail-wm0-f46.google.com ([74.125.82.46]:36648) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aR6w9-0005z7-21 for qemu-devel@nongnu.org; Wed, 03 Feb 2016 18:34:13 -0500 Received: by mail-wm0-f46.google.com with SMTP id p63so188221368wmp.1 for ; Wed, 03 Feb 2016 15:34:12 -0800 (PST) References: <56B2754B.7030809@redhat.com> <56B28B1C.7060202@redhat.com> From: Jim Minter Message-ID: <56B28E8B.1030107@redhat.com> Date: Wed, 3 Feb 2016 23:34:35 +0000 MIME-Version: 1.0 In-Reply-To: <56B28B1C.7060202@redhat.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] sda abort with virtio-scsi List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Paolo Bonzini , qemu-devel , Hannes Reinecke Hi again, thanks for replying, On 03/02/16 23:19, Paolo Bonzini wrote: > On 03/02/2016 22:46, Jim Minter wrote: >> I am hitting the following VM lockup issue running a VM with latest >> RHEL7 kernel on a host also running latest RHEL7 kernel. FWIW I'm using >> virtio-scsi because I want to use discard=unmap. I ran the VM as follows: >> >> /usr/libexec/qemu-kvm -nodefaults \ >> -cpu host \ >> -smp 4 \ >> -m 8192 \ >> -drive discard=unmap,file=vm.qcow2,id=disk1,if=none,cache=unsafe \ >> -device virtio-scsi-pci \ >> -device scsi-disk,drive=disk1 \ >> -netdev bridge,id=net0,br=br0 \ >> -device virtio-net-pci,netdev=net0,mac=$(utils/random-mac.py) \ >> -chardev socket,id=chan0,path=/tmp/rhev.sock,server,nowait \ >> -chardev socket,id=chan1,path=/tmp/qemu.sock,server,nowait \ >> -monitor unix:tmp/vm.sock,server,nowait \ >> -device virtio-serial-pci \ >> -device virtserialport,chardev=chan0,name=com.redhat.rhevm.vdsm \ >> -device virtserialport,chardev=chan1,name=org.qemu.guest_agent.0 \ >> -device cirrus-vga \ >> -vnc none \ >> -usbdevice tablet >> >> The host was busyish at the time, but not excessively (IMO). Nothing >> untoward in the host's kernel log; host storage subsystem is fine. I >> didn't get any qemu logs this time around, but I will when the issue >> next recurs. The VM's full kernel log is attached; here are the >> highlights: > > Hannes, were you going to send a patch to disable time outs? > >> >> INFO: rcu_sched detected stalls on CPUs/tasks: { 3} (detected by 2, t=60002 jiffies, g=5253, c=5252, q=0) >> sending NMI to all CPUs: >> NMI backtrace for cpu 1 >> CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.10.0-327.4.5.el7.x86_64 #1 >> Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 >> task: ffff88023417d080 ti: ffff8802341a4000 task.ti: ffff8802341a4000 >> RIP: 0010:[] [] native_safe_halt+0x6/0x10 >> RSP: 0018:ffff8802341a7e98 EFLAGS: 00000286 >> RAX: 00000000ffffffed RBX: ffff8802341a4000 RCX: 0100000000000000 >> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046 >> RBP: ffff8802341a7e98 R08: 0000000000000000 R09: 0000000000000000 >> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 >> R13: ffff8802341a4000 R14: ffff8802341a4000 R15: 0000000000000000 >> FS: 0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 00007f4978587008 CR3: 000000003645e000 CR4: 00000000003407e0 >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> Stack: >> ffff8802341a7eb8 ffffffff8101dbcf ffff8802341a4000 ffffffff81a68260 >> ffff8802341a7ec8 ffffffff8101e4d6 ffff8802341a7f20 ffffffff810d62e5 >> ffff8802341a7fd8 ffff8802341a4000 2581685d70de192c 7ba58fdb3a3bc8d4 >> Call Trace: >> [] default_idle+0x1f/0xc0 >> [] arch_cpu_idle+0x26/0x30 >> [] cpu_startup_entry+0x245/0x290 >> [] start_secondary+0x1ba/0x230 >> Code: 00 00 00 00 00 55 48 89 e5 fa 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb f4 <5d> c3 0f 1f 84 00 00 00 00 00 55 48 89 e5 f4 5d c3 66 0f 1f 84 >> NMI backtrace for cpu 0 > > This is the NMI watchdog firing; the CPU got stuck for 20 seconds. The > issue was not a busy host, but a busy storage (could it be a network > partition if the disk was hosted on NFS???) The VM qcow2 storage is on host-local SSD, and although there's some competition for the host CPU and storage, it seems surprising to me that the VM should be starved of CPU to this extent. I was worried there was some way in which the contention could cause an abort and perhaps thence the lockup (which does not seem to recover when the host load goes down). > Firing the NMI watchdog is fixed in more recent QEMU, which has > asynchronous cancellation, assuming you're running RHEL's QEMU 1.5.3 > (try /usr/libexec/qemu-kvm --version, or rpm -qf /usr/libexec/qemu-kvm). /usr/libexec/qemu-kvm --version reports QEMU emulator version 1.5.3 (qemu-kvm-1.5.3-105.el7_2.3) Cheers, Jim