* Linux Crash Caused By KVM?
@ 2012-04-11 2:11 Peijie Yu
2012-04-11 14:45 ` Avi Kivity
0 siblings, 1 reply; 4+ messages in thread
From: Peijie Yu @ 2012-04-11 2:11 UTC (permalink / raw)
To: kvm
Hi,all
I have met some problems while utilizing KVM。
The test environment is:
Summary: Dell R610, 1 x Xeon E5645 2.40GHz, 47.1GB / 48GB 1333MHz DDR3
System: Dell PowerEdge R610 (Dell 08GXHX)
Processors: 1 (of 2) x Xeon E5645 2.40GHz 5860MHz FSB (HT enabled,
6 cores, 24 threads)
Memory: 47.1GB / 48GB 1333MHz DDR3 == 12 x 4GB
Disk: sda: 299GB (72%) JBOD
Disk: sdb (host9): 5.0TB JBOD == 1 x VIRTUAL-DISK
Disk: sdc (host11): 5.0TB JBOD == 1 x VIRTUAL-DISK
Disk: sdd (host12): 5.0TB JBOD == 1 x VIRTUAL-DISK
Disk: sde (host10): 5.0TB JBOD == 1 x VIRTUAL-DISK
Disk-Control: mpt2sas0: LSI Logic / Symbios Logic SAS2008
PCI-Express Fusion-MPT SAS-2 [Falcon]
Disk-Control: host9:
Disk-Control: host10:
Disk-Control: host11:
Disk-Control: host12:
Chipset: Intel 82801IB (ICH9)
Network: br1 (bridge): 14:fe:b5:dc:2c:6e
Network: em1 (bnx2): Broadcom NetXtreme II BCM5709 Gigabit,
14:fe:b5:dc:2c:6e, 1000Mb/s <full-duplex>
Network: em2 (bnx2): Broadcom NetXtreme II BCM5709 Gigabit,
14:fe:b5:dc:2c:70, 1000Mb/s <full-duplex>
Network: em3 (bnx2): Broadcom NetXtreme II BCM5709 Gigabit,
14:fe:b5:dc:2c:72, 1000Mb/s <full-duplex>
Network: em4 (bnx2): Broadcom NetXtreme II BCM5709 Gigabit,
14:fe:b5:dc:2c:74, 1000Mb/s <full-duplex>
Network: vnet0 (tun): fe:16:3e:49:fb:05, 10Mb/s <full-duplex>
Network: vnet1 (tun): fe:16:3e:cb:c0:d1, 10Mb/s <full-duplex>
Network: vnet2 (tun): fe:16:3e:1e:c1:c4, 10Mb/s <full-duplex>
Network: vnet3 (tun): fe:16:3e:d5:58:f4, 10Mb/s <full-duplex>
Network: vnet4 (tun): fe:16:3e:15:b4:16, 10Mb/s <full-duplex>
Network: vnet5 (tun): fe:16:3e:d2:07:47, 10Mb/s <full-duplex>
Network: vnet6 (tun): fe:16:3e:e1:2b:b9, 10Mb/s <full-duplex>
OS: RHEL Server 6.1 (Santiago), Linux
2.6.32-220.2.1.el6.x86_64 x86_64, 64-bit
BIOS: Dell 3.0.0 01/31/2011
And during the term i utilize KVM, some issues happen:
1. Host Crash Caused by
a. Kernel Panic
31 KERNEL: /usr/lib/debug/lib/modules/2.6.32-131.12.1.el6.x86_64/vmlinux
32 DUMPFILE: ../vmcore_2012.13.46 [PARTIAL DUMP]
33 CPUS: 24
34 DATE: Wed Jan 11 13:34:13 2012
35 UPTIME: 25 days, 04:11:05
36 LOAD AVERAGE: 223.16, 172.97, 158.23
37 TASKS: 1464
38 NODENAME: dell2.localdomain
39 RELEASE: 2.6.32-131.12.1.el6.x86_64
40 VERSION: #1 SMP Sun Jul 31 16:44:56 EDT 2011
41 MACHINE: x86_64 (2394 Mhz)
42 MEMORY: 48 GB
43 PANIC: "kernel BUG at arch/x86/kernel/traps.c:547!"
44 PID: 11851
45 COMMAND: "qemu-kvm"
46 TASK: ffff880c071c3500 [THREAD_INFO: ffff880c132d8000]
47 CPU: 1
48 STATE: TASK_RUNNING (PANIC)
49
50 PID: 11851 TASK: ffff880c071c3500 CPU: 1 COMMAND: "qemu-kvm"
51 #0 [ffff880028207be0] machine_kexec at ffffffff810310cb
52 #1 [ffff880028207c40] crash_kexec at ffffffff810b6392
53 #2 [ffff880028207d10] oops_end at ffffffff814de670
54 #3 [ffff880028207d40] die at ffffffff8100f2eb
55 #4 [ffff880028207d70] do_trap at ffffffff814ddf64
56 #5 [ffff880028207dd0] do_invalid_op at ffffffff8100ceb5
57 #6 [ffff880028207e70] invalid_op at ffffffff8100bf5b
58 [exception RIP: do_nmi+554]
59 RIP: ffffffff814de43a RSP: ffff880028207f28 RFLAGS: 00010002
60 RAX: ffff880c132d9fd8 RBX: ffff880028207f58 RCX: 00000000c0000101
61 RDX: 00000000ffff8800 RSI: ffffffffffffffff RDI: ffff880028207f58
62 RBP: ffff880028207f48 R8: ffff88005ebf9800 R9: ffff880028203fc0
63 R10: 0000000000000034 R11: 00000000000003e8 R12: 000000000000cc20
64 R13: ffffffff816024a0 R14: ffff88005ebf9800 R15: 00007ffffffff000
65 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
66 #7 [ffff880028207f50] nmi at ffffffff814ddc90
67 [exception RIP: bad_to_user+37]
68 RIP: ffffffff814e4e2b RSP: ffff880028207bb0 RFLAGS: 00010046
69 RAX: ffff880c132d9fd8 RBX: ffff880c132d9c48 RCX: 0000000000000001
70 RDX: 0000000000000000 RSI: 000000010000000b RDI: ffff880028207c08
71 RBP: ffff880028207c48 R8: ffff88005ebf9800 R9: ffff880028203fc0
72 R10: 0000000000000034 R11: 00000000000003e8 R12: 000000000000cc20
73 R13: ffffffff816024a0 R14: ffff88005ebf9800 R15: 00007ffffffff000
74 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
75 --- <NMI exception stack> ---
For this problem, i found that panic is caused by
BUG_ON(in_nmi()) which means NMI happened during another NMI Context;
But i check the Intel Technical Manual and found "While an NMI
interrupt handler is executing, the processor disables additional
calls to the NMI handler until the next IRET instruction is executed."
So, how this happen?
b. Qemu Process's CPU dead lock
28 KERNEL: /usr/lib/debug/lib/modules/2.6.32-131.12.1.el6.x86_64/vmlinux
29 DUMPFILE: /var/crash/127.0.0.1-2012-02-18-21:20:13/vmcore
[PARTIAL DUMP]
30 CPUS: 24
31 DATE: Sat Feb 18 20:03:56 2012
32 UPTIME: 71 days, 09:42:23
33 LOAD AVERAGE: 46.81, 44.32, 35.15
34 TASKS: 1018
35 NODENAME: virt15-njhx-kvm-19
36 RELEASE: 2.6.32-131.12.1.el6.x86_64
37 VERSION: #1 SMP Sun Jul 31 16:44:56 EDT 2011
38 MACHINE: x86_64 (2394 Mhz)
39 MEMORY: 48 GB
40 PANIC: "Kernel panic - not syncing: Watchdog detected hard
LOCKUP on cpu 12"
41 PID: 18704
42 COMMAND: "qemu-kvm"
43 TASK: ffff880041efb580 [THREAD_INFO: ffff8807309ba000]
44 CPU: 12
45 STATE: TASK_RUNNING (PANIC)
46
47 crash> bt
48 PID: 18704 TASK: ffff880041efb580 CPU: 12 COMMAND: "qemu-kvm"
49 #0 [ffff8806454c7af0] machine_kexec at ffffffff810310cb
50 #1 [ffff8806454c7b50] crash_kexec at ffffffff810b6392
51 #2 [ffff8806454c7c20] panic at ffffffff814da64f
52 #3 [ffff8806454c7ca0] watchdog_overflow_callback at ffffffff810d648d
53 #4 [ffff8806454c7cc0] __perf_event_overflow at ffffffff81108b26
54 #5 [ffff8806454c7d60] perf_event_overflow at ffffffff81109119
55 #6 [ffff8806454c7d70] intel_pmu_handle_irq at ffffffff8101dd46
56 #7 [ffff8806454c7e80] perf_event_nmi_handler at ffffffff814debd8
57 #8 [ffff8806454c7ea0] notifier_call_chain at ffffffff814e0735
58 #9 [ffff8806454c7ee0] atomic_notifier_call_chain at ffffffff814e079a
59 #10 [ffff8806454c7ef0] notify_die at ffffffff8109411e
60 #11 [ffff8806454c7f20] do_nmi at ffffffff814de383
61 #12 [ffff8806454c7f50] nmi at ffffffff814ddc90
62 RIP: 00000000004083ab RSP: 00007fffc80115d8 RFLAGS: 00000206
63 RAX: 000000007e2bf790 RBX: 0000000001c753f0 RCX: 0000000000008000
64 RDX: 0000000000000000 RSI: 0000093b76bfc600 RDI: 000000001277546d
65 RBP: 0000000000000200 R8: 00000000fbc80000 R9: 0000000000000000
66 R10: 0000000000000064 R11: 0000000000000246 R12: 1277546d7d3d8c69
67 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
68 ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b
69 --- <NMI exception stack> ---
2. Guest Boot Hang when lots of guest create requests are processed
at a same time by libvirt;
The guest is configured with -smp 1.
So, anyone has any idea about these?
Thx
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Linux Crash Caused By KVM?
2012-04-11 2:11 Linux Crash Caused By KVM? Peijie Yu
@ 2012-04-11 14:45 ` Avi Kivity
2012-04-11 18:59 ` Eric Northup
0 siblings, 1 reply; 4+ messages in thread
From: Avi Kivity @ 2012-04-11 14:45 UTC (permalink / raw)
To: Peijie Yu; +Cc: kvm
On 04/11/2012 05:11 AM, Peijie Yu wrote:
> Hi,all
> I have met some problems while utilizing KVM。
> The test environment is:
> Summary: Dell R610, 1 x Xeon E5645 2.40GHz, 47.1GB / 48GB 1333MHz DDR3
> System: Dell PowerEdge R610 (Dell 08GXHX)
> Processors: 1 (of 2) x Xeon E5645 2.40GHz 5860MHz FSB (HT enabled,
> 6 cores, 24 threads)
> Memory: 47.1GB / 48GB 1333MHz DDR3 == 12 x 4GB
> Disk: sda: 299GB (72%) JBOD
> Disk: sdb (host9): 5.0TB JBOD == 1 x VIRTUAL-DISK
> Disk: sdc (host11): 5.0TB JBOD == 1 x VIRTUAL-DISK
> Disk: sdd (host12): 5.0TB JBOD == 1 x VIRTUAL-DISK
> Disk: sde (host10): 5.0TB JBOD == 1 x VIRTUAL-DISK
> Disk-Control: mpt2sas0: LSI Logic / Symbios Logic SAS2008
> PCI-Express Fusion-MPT SAS-2 [Falcon]
> Disk-Control: host9:
> Disk-Control: host10:
> Disk-Control: host11:
> Disk-Control: host12:
> Chipset: Intel 82801IB (ICH9)
> Network: br1 (bridge): 14:fe:b5:dc:2c:6e
> Network: em1 (bnx2): Broadcom NetXtreme II BCM5709 Gigabit,
> 14:fe:b5:dc:2c:6e, 1000Mb/s <full-duplex>
> Network: em2 (bnx2): Broadcom NetXtreme II BCM5709 Gigabit,
> 14:fe:b5:dc:2c:70, 1000Mb/s <full-duplex>
> Network: em3 (bnx2): Broadcom NetXtreme II BCM5709 Gigabit,
> 14:fe:b5:dc:2c:72, 1000Mb/s <full-duplex>
> Network: em4 (bnx2): Broadcom NetXtreme II BCM5709 Gigabit,
> 14:fe:b5:dc:2c:74, 1000Mb/s <full-duplex>
> Network: vnet0 (tun): fe:16:3e:49:fb:05, 10Mb/s <full-duplex>
> Network: vnet1 (tun): fe:16:3e:cb:c0:d1, 10Mb/s <full-duplex>
> Network: vnet2 (tun): fe:16:3e:1e:c1:c4, 10Mb/s <full-duplex>
> Network: vnet3 (tun): fe:16:3e:d5:58:f4, 10Mb/s <full-duplex>
> Network: vnet4 (tun): fe:16:3e:15:b4:16, 10Mb/s <full-duplex>
> Network: vnet5 (tun): fe:16:3e:d2:07:47, 10Mb/s <full-duplex>
> Network: vnet6 (tun): fe:16:3e:e1:2b:b9, 10Mb/s <full-duplex>
> OS: RHEL Server 6.1 (Santiago), Linux
> 2.6.32-220.2.1.el6.x86_64 x86_64, 64-bit
> BIOS: Dell 3.0.0 01/31/2011
>
> And during the term i utilize KVM, some issues happen:
> 1. Host Crash Caused by
> a. Kernel Panic
> 31 KERNEL: /usr/lib/debug/lib/modules/2.6.32-131.12.1.el6.x86_64/vmlinux
> 32 DUMPFILE: ../vmcore_2012.13.46 [PARTIAL DUMP]
> 33 CPUS: 24
> 34 DATE: Wed Jan 11 13:34:13 2012
> 35 UPTIME: 25 days, 04:11:05
> 36 LOAD AVERAGE: 223.16, 172.97, 158.23
> 37 TASKS: 1464
> 38 NODENAME: dell2.localdomain
> 39 RELEASE: 2.6.32-131.12.1.el6.x86_64
> 40 VERSION: #1 SMP Sun Jul 31 16:44:56 EDT 2011
> 41 MACHINE: x86_64 (2394 Mhz)
> 42 MEMORY: 48 GB
> 43 PANIC: "kernel BUG at arch/x86/kernel/traps.c:547!"
> 44 PID: 11851
> 45 COMMAND: "qemu-kvm"
> 46 TASK: ffff880c071c3500 [THREAD_INFO: ffff880c132d8000]
> 47 CPU: 1
> 48 STATE: TASK_RUNNING (PANIC)
> 49
> 50 PID: 11851 TASK: ffff880c071c3500 CPU: 1 COMMAND: "qemu-kvm"
> 51 #0 [ffff880028207be0] machine_kexec at ffffffff810310cb
> 52 #1 [ffff880028207c40] crash_kexec at ffffffff810b6392
> 53 #2 [ffff880028207d10] oops_end at ffffffff814de670
> 54 #3 [ffff880028207d40] die at ffffffff8100f2eb
> 55 #4 [ffff880028207d70] do_trap at ffffffff814ddf64
> 56 #5 [ffff880028207dd0] do_invalid_op at ffffffff8100ceb5
> 57 #6 [ffff880028207e70] invalid_op at ffffffff8100bf5b
> 58 [exception RIP: do_nmi+554]
> 59 RIP: ffffffff814de43a RSP: ffff880028207f28 RFLAGS: 00010002
> 60 RAX: ffff880c132d9fd8 RBX: ffff880028207f58 RCX: 00000000c0000101
> 61 RDX: 00000000ffff8800 RSI: ffffffffffffffff RDI: ffff880028207f58
> 62 RBP: ffff880028207f48 R8: ffff88005ebf9800 R9: ffff880028203fc0
> 63 R10: 0000000000000034 R11: 00000000000003e8 R12: 000000000000cc20
> 64 R13: ffffffff816024a0 R14: ffff88005ebf9800 R15: 00007ffffffff000
> 65 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> 66 #7 [ffff880028207f50] nmi at ffffffff814ddc90
> 67 [exception RIP: bad_to_user+37]
> 68 RIP: ffffffff814e4e2b RSP: ffff880028207bb0 RFLAGS: 00010046
> 69 RAX: ffff880c132d9fd8 RBX: ffff880c132d9c48 RCX: 0000000000000001
> 70 RDX: 0000000000000000 RSI: 000000010000000b RDI: ffff880028207c08
> 71 RBP: ffff880028207c48 R8: ffff88005ebf9800 R9: ffff880028203fc0
> 72 R10: 0000000000000034 R11: 00000000000003e8 R12: 000000000000cc20
> 73 R13: ffffffff816024a0 R14: ffff88005ebf9800 R15: 00007ffffffff000
> 74 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> 75 --- <NMI exception stack> ---
>
> For this problem, i found that panic is caused by
> BUG_ON(in_nmi()) which means NMI happened during another NMI Context;
> But i check the Intel Technical Manual and found "While an NMI
> interrupt handler is executing, the processor disables additional
> calls to the NMI handler until the next IRET instruction is executed."
> So, how this happen?
>
The NMI path for kvm is different; the processor exits from the guest
with NMIs blocked, then executes kvm code until it issues "int $2" in
vmx_complete_interrupts(). If an IRET is executed in this path, then
NMIs will be unblocked and nested NMIs may occur.
One way this can happen is if we access the vmap area and incur a fault,
between the VMEXIT and invoking the NMI handler. Or perhaps the NMI
handler itself generates a fault. Or we have a debug exception in that path.
Is this reproducible?
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Linux Crash Caused By KVM?
2012-04-11 14:45 ` Avi Kivity
@ 2012-04-11 18:59 ` Eric Northup
2012-04-15 10:05 ` Avi Kivity
0 siblings, 1 reply; 4+ messages in thread
From: Eric Northup @ 2012-04-11 18:59 UTC (permalink / raw)
To: Avi Kivity; +Cc: Peijie Yu, kvm
On Wed, Apr 11, 2012 at 7:45 AM, Avi Kivity <avi@redhat.com> wrote:
> On 04/11/2012 05:11 AM, Peijie Yu wrote:
>> For this problem, i found that panic is caused by
>> BUG_ON(in_nmi()) which means NMI happened during another NMI Context;
>> But i check the Intel Technical Manual and found "While an NMI
>> interrupt handler is executing, the processor disables additional
>> calls to the NMI handler until the next IRET instruction is executed."
>> So, how this happen?
>>
>
> The NMI path for kvm is different; the processor exits from the guest
> with NMIs blocked, then executes kvm code until it issues "int $2" in
> vmx_complete_interrupts(). If an IRET is executed in this path, then
> NMIs will be unblocked and nested NMIs may occur.
>
> One way this can happen is if we access the vmap area and incur a fault,
> between the VMEXIT and invoking the NMI handler. Or perhaps the NMI
> handler itself generates a fault. Or we have a debug exception in that path.
>
> Is this reproducible?
As an FYI, there have been BIOSes whose SMI handlers ran IRETs. So
the NMI blocking can go away surprisingly.
See 29.8 "NMI handling while in SMM" in the Intel SDM vol 3.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Linux Crash Caused By KVM?
2012-04-11 18:59 ` Eric Northup
@ 2012-04-15 10:05 ` Avi Kivity
0 siblings, 0 replies; 4+ messages in thread
From: Avi Kivity @ 2012-04-15 10:05 UTC (permalink / raw)
To: Eric Northup; +Cc: Peijie Yu, kvm
On 04/11/2012 09:59 PM, Eric Northup wrote:
> On Wed, Apr 11, 2012 at 7:45 AM, Avi Kivity <avi@redhat.com> wrote:
> > On 04/11/2012 05:11 AM, Peijie Yu wrote:
> >> For this problem, i found that panic is caused by
> >> BUG_ON(in_nmi()) which means NMI happened during another NMI Context;
> >> But i check the Intel Technical Manual and found "While an NMI
> >> interrupt handler is executing, the processor disables additional
> >> calls to the NMI handler until the next IRET instruction is executed."
> >> So, how this happen?
> >>
> >
> > The NMI path for kvm is different; the processor exits from the guest
> > with NMIs blocked, then executes kvm code until it issues "int $2" in
> > vmx_complete_interrupts(). If an IRET is executed in this path, then
> > NMIs will be unblocked and nested NMIs may occur.
> >
> > One way this can happen is if we access the vmap area and incur a fault,
> > between the VMEXIT and invoking the NMI handler. Or perhaps the NMI
> > handler itself generates a fault. Or we have a debug exception in that path.
> >
> > Is this reproducible?
>
> As an FYI, there have been BIOSes whose SMI handlers ran IRETs. So
> the NMI blocking can go away surprisingly.
>
> See 29.8 "NMI handling while in SMM" in the Intel SDM vol 3.
Interesting, thanks.
>From 29.8 it looks like you don't even need to issue IRET within SMM,
since SMM doesn't save/restore the NMI blocking flag.
However, this being a server, and the crash being in kvm code, I don't
think we can rule out that this is a kvm bug.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2012-04-15 10:05 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-11 2:11 Linux Crash Caused By KVM? Peijie Yu
2012-04-11 14:45 ` Avi Kivity
2012-04-11 18:59 ` Eric Northup
2012-04-15 10:05 ` Avi Kivity
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.