Re: [Qemu-devel] sda abort with virtio-scsi

From: Jim Minter <jminter@redhat.com>
To: Paolo Bonzini <pbonzini@redhat.com>,
	qemu-devel <qemu-devel@nongnu.org>,
	Hannes Reinecke <hare@suse.de>
Subject: Re: [Qemu-devel] sda abort with virtio-scsi
Date: Wed, 3 Feb 2016 23:34:35 +0000	[thread overview]
Message-ID: <56B28E8B.1030107@redhat.com> (raw)
In-Reply-To: <56B28B1C.7060202@redhat.com>

Hi again, thanks for replying,

On 03/02/16 23:19, Paolo Bonzini wrote:
> On 03/02/2016 22:46, Jim Minter wrote:
>> I am hitting the following VM lockup issue running a VM with latest
>> RHEL7 kernel on a host also running latest RHEL7 kernel.  FWIW I'm using
>> virtio-scsi because I want to use discard=unmap.  I ran the VM as follows:
>>
>> /usr/libexec/qemu-kvm -nodefaults \
>>    -cpu host \
>>    -smp 4 \
>>    -m 8192 \
>>    -drive discard=unmap,file=vm.qcow2,id=disk1,if=none,cache=unsafe \
>>    -device virtio-scsi-pci \
>>    -device scsi-disk,drive=disk1 \
>>    -netdev bridge,id=net0,br=br0 \
>>    -device virtio-net-pci,netdev=net0,mac=$(utils/random-mac.py) \
>>    -chardev socket,id=chan0,path=/tmp/rhev.sock,server,nowait \
>>    -chardev socket,id=chan1,path=/tmp/qemu.sock,server,nowait \
>>    -monitor unix:tmp/vm.sock,server,nowait \
>>    -device virtio-serial-pci \
>>    -device virtserialport,chardev=chan0,name=com.redhat.rhevm.vdsm \
>>    -device virtserialport,chardev=chan1,name=org.qemu.guest_agent.0 \
>>    -device cirrus-vga \
>>    -vnc none \
>>    -usbdevice tablet
>>
>> The host was busyish at the time, but not excessively (IMO).  Nothing
>> untoward in the host's kernel log; host storage subsystem is fine.  I
>> didn't get any qemu logs this time around, but I will when the issue
>> next recurs.  The VM's full kernel log is attached; here are the
>> highlights:
>
> Hannes, were you going to send a patch to disable time outs?
>
>>
>> INFO: rcu_sched detected stalls on CPUs/tasks: { 3} (detected by 2, t=60002 jiffies, g=5253, c=5252, q=0)
>> sending NMI to all CPUs:
>> NMI backtrace for cpu 1
>> CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.10.0-327.4.5.el7.x86_64 #1
>> Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
>> task: ffff88023417d080 ti: ffff8802341a4000 task.ti: ffff8802341a4000
>> RIP: 0010:[<ffffffff81058e96>]  [<ffffffff81058e96>] native_safe_halt+0x6/0x10
>> RSP: 0018:ffff8802341a7e98  EFLAGS: 00000286
>> RAX: 00000000ffffffed RBX: ffff8802341a4000 RCX: 0100000000000000
>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046
>> RBP: ffff8802341a7e98 R08: 0000000000000000 R09: 0000000000000000
>> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
>> R13: ffff8802341a4000 R14: ffff8802341a4000 R15: 0000000000000000
>> FS:  0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00007f4978587008 CR3: 000000003645e000 CR4: 00000000003407e0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Stack:
>>   ffff8802341a7eb8 ffffffff8101dbcf ffff8802341a4000 ffffffff81a68260
>>   ffff8802341a7ec8 ffffffff8101e4d6 ffff8802341a7f20 ffffffff810d62e5
>>   ffff8802341a7fd8 ffff8802341a4000 2581685d70de192c 7ba58fdb3a3bc8d4
>> Call Trace:
>>   [<ffffffff8101dbcf>] default_idle+0x1f/0xc0
>>   [<ffffffff8101e4d6>] arch_cpu_idle+0x26/0x30
>>   [<ffffffff810d62e5>] cpu_startup_entry+0x245/0x290
>>   [<ffffffff810475fa>] start_secondary+0x1ba/0x230
>> Code: 00 00 00 00 00 55 48 89 e5 fa 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb f4 <5d> c3 0f 1f 84 00 00 00 00 00 55 48 89 e5 f4 5d c3 66 0f 1f 84
>> NMI backtrace for cpu 0
>
> This is the NMI watchdog firing; the CPU got stuck for 20 seconds.  The
> issue was not a busy host, but a busy storage (could it be a network
> partition if the disk was hosted on NFS???)

The VM qcow2 storage is on host-local SSD, and although there's some 
competition for the host CPU and storage, it seems surprising to me that 
the VM should be starved of CPU to this extent.  I was worried there was 
some way in which the contention could cause an abort and perhaps thence 
the lockup (which does not seem to recover when the host load goes down).

> Firing the NMI watchdog is fixed in more recent QEMU, which has
> asynchronous cancellation, assuming you're running RHEL's QEMU 1.5.3
> (try /usr/libexec/qemu-kvm --version, or rpm -qf /usr/libexec/qemu-kvm).

/usr/libexec/qemu-kvm --version reports QEMU emulator version 1.5.3 
(qemu-kvm-1.5.3-105.el7_2.3)

Cheers,

Jim