All of lore.kernel.org
 help / color / mirror / Atom feed
* Linux Crash Caused By KVM?
@ 2012-04-11  2:11 Peijie Yu
  2012-04-11 14:45 ` Avi Kivity
  0 siblings, 1 reply; 4+ messages in thread
From: Peijie Yu @ 2012-04-11  2:11 UTC (permalink / raw)
  To: kvm

Hi,all
  I have met some problems while utilizing KVM。
  The test environment is:
Summary:        Dell R610, 1 x Xeon E5645 2.40GHz, 47.1GB / 48GB 1333MHz DDR3
System:         Dell PowerEdge R610 (Dell 08GXHX)
Processors:     1 (of 2) x Xeon E5645 2.40GHz 5860MHz FSB (HT enabled,
6 cores, 24 threads)
Memory:         47.1GB / 48GB 1333MHz DDR3 == 12 x 4GB
Disk:           sda: 299GB (72%) JBOD
Disk:           sdb (host9): 5.0TB JBOD == 1 x VIRTUAL-DISK
Disk:           sdc (host11): 5.0TB JBOD == 1 x VIRTUAL-DISK
Disk:           sdd (host12): 5.0TB JBOD == 1 x VIRTUAL-DISK
Disk:           sde (host10): 5.0TB JBOD == 1 x VIRTUAL-DISK
Disk-Control:   mpt2sas0: LSI Logic / Symbios Logic SAS2008
PCI-Express Fusion-MPT SAS-2 [Falcon]
Disk-Control:   host9:
Disk-Control:   host10:
Disk-Control:   host11:
Disk-Control:   host12:
Chipset:        Intel 82801IB (ICH9)
Network:        br1 (bridge): 14:fe:b5:dc:2c:6e
Network:        em1 (bnx2): Broadcom NetXtreme II BCM5709 Gigabit,
14:fe:b5:dc:2c:6e, 1000Mb/s <full-duplex>
Network:        em2 (bnx2): Broadcom NetXtreme II BCM5709 Gigabit,
14:fe:b5:dc:2c:70, 1000Mb/s <full-duplex>
Network:        em3 (bnx2): Broadcom NetXtreme II BCM5709 Gigabit,
14:fe:b5:dc:2c:72, 1000Mb/s <full-duplex>
Network:        em4 (bnx2): Broadcom NetXtreme II BCM5709 Gigabit,
14:fe:b5:dc:2c:74, 1000Mb/s <full-duplex>
Network:        vnet0 (tun): fe:16:3e:49:fb:05, 10Mb/s <full-duplex>
Network:        vnet1 (tun): fe:16:3e:cb:c0:d1, 10Mb/s <full-duplex>
Network:        vnet2 (tun): fe:16:3e:1e:c1:c4, 10Mb/s <full-duplex>
Network:        vnet3 (tun): fe:16:3e:d5:58:f4, 10Mb/s <full-duplex>
Network:        vnet4 (tun): fe:16:3e:15:b4:16, 10Mb/s <full-duplex>
Network:        vnet5 (tun): fe:16:3e:d2:07:47, 10Mb/s <full-duplex>
Network:        vnet6 (tun): fe:16:3e:e1:2b:b9, 10Mb/s <full-duplex>
OS:             RHEL Server 6.1 (Santiago), Linux
2.6.32-220.2.1.el6.x86_64 x86_64, 64-bit
BIOS:           Dell 3.0.0 01/31/2011

  And during the term i utilize KVM, some issues happen:
  1.   Host Crash Caused by
      a.   Kernel Panic
  31       KERNEL: /usr/lib/debug/lib/modules/2.6.32-131.12.1.el6.x86_64/vmlinux
  32     DUMPFILE: ../vmcore_2012.13.46  [PARTIAL DUMP]
  33         CPUS: 24
  34         DATE: Wed Jan 11 13:34:13 2012
  35       UPTIME: 25 days, 04:11:05
  36 LOAD AVERAGE: 223.16, 172.97, 158.23
  37        TASKS: 1464
  38     NODENAME: dell2.localdomain
  39      RELEASE: 2.6.32-131.12.1.el6.x86_64
  40      VERSION: #1 SMP Sun Jul 31 16:44:56 EDT 2011
  41      MACHINE: x86_64  (2394 Mhz)
  42       MEMORY: 48 GB
  43        PANIC: "kernel BUG at arch/x86/kernel/traps.c:547!"
  44          PID: 11851
  45      COMMAND: "qemu-kvm"
  46         TASK: ffff880c071c3500  [THREAD_INFO: ffff880c132d8000]
  47          CPU: 1
  48        STATE: TASK_RUNNING (PANIC)
  49
  50 PID: 11851  TASK: ffff880c071c3500  CPU: 1   COMMAND: "qemu-kvm"
  51  #0 [ffff880028207be0] machine_kexec at ffffffff810310cb
  52  #1 [ffff880028207c40] crash_kexec at ffffffff810b6392
  53  #2 [ffff880028207d10] oops_end at ffffffff814de670
  54  #3 [ffff880028207d40] die at ffffffff8100f2eb
  55  #4 [ffff880028207d70] do_trap at ffffffff814ddf64
  56  #5 [ffff880028207dd0] do_invalid_op at ffffffff8100ceb5
  57  #6 [ffff880028207e70] invalid_op at ffffffff8100bf5b
  58     [exception RIP: do_nmi+554]
  59     RIP: ffffffff814de43a  RSP: ffff880028207f28  RFLAGS: 00010002
  60     RAX: ffff880c132d9fd8  RBX: ffff880028207f58  RCX: 00000000c0000101
  61     RDX: 00000000ffff8800  RSI: ffffffffffffffff  RDI: ffff880028207f58
  62     RBP: ffff880028207f48   R8: ffff88005ebf9800   R9: ffff880028203fc0
  63     R10: 0000000000000034  R11: 00000000000003e8  R12: 000000000000cc20
  64     R13: ffffffff816024a0  R14: ffff88005ebf9800  R15: 00007ffffffff000
  65     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  66  #7 [ffff880028207f50] nmi at ffffffff814ddc90
  67     [exception RIP: bad_to_user+37]
  68     RIP: ffffffff814e4e2b  RSP: ffff880028207bb0  RFLAGS: 00010046
  69     RAX: ffff880c132d9fd8  RBX: ffff880c132d9c48  RCX: 0000000000000001
  70     RDX: 0000000000000000  RSI: 000000010000000b  RDI: ffff880028207c08
  71     RBP: ffff880028207c48   R8: ffff88005ebf9800   R9: ffff880028203fc0
  72     R10: 0000000000000034  R11: 00000000000003e8  R12: 000000000000cc20
  73     R13: ffffffff816024a0  R14: ffff88005ebf9800  R15: 00007ffffffff000
  74     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  75 --- <NMI exception stack> ---

     For this problem, i found that panic is caused by
BUG_ON(in_nmi()) which means NMI happened during another NMI Context;
But i check the Intel Technical Manual and found "While an NMI
interrupt handler is executing, the processor disables additional
calls to the NMI handler until the next IRET instruction is executed."
So, how this happen?


    b.  Qemu Process's CPU dead lock
  28 KERNEL: /usr/lib/debug/lib/modules/2.6.32-131.12.1.el6.x86_64/vmlinux
  29     DUMPFILE: /var/crash/127.0.0.1-2012-02-18-21:20:13/vmcore
[PARTIAL DUMP]
  30         CPUS: 24
  31         DATE: Sat Feb 18 20:03:56 2012
  32       UPTIME: 71 days, 09:42:23
  33 LOAD AVERAGE: 46.81, 44.32, 35.15
  34        TASKS: 1018
  35     NODENAME: virt15-njhx-kvm-19
  36      RELEASE: 2.6.32-131.12.1.el6.x86_64
  37      VERSION: #1 SMP Sun Jul 31 16:44:56 EDT 2011
  38      MACHINE: x86_64  (2394 Mhz)
  39       MEMORY: 48 GB
  40        PANIC: "Kernel panic - not syncing: Watchdog detected hard
LOCKUP on cpu 12"
  41          PID: 18704
  42      COMMAND: "qemu-kvm"
  43         TASK: ffff880041efb580  [THREAD_INFO: ffff8807309ba000]
  44          CPU: 12
  45        STATE: TASK_RUNNING (PANIC)
  46
  47 crash> bt
  48 PID: 18704  TASK: ffff880041efb580  CPU: 12  COMMAND: "qemu-kvm"
  49  #0 [ffff8806454c7af0] machine_kexec at ffffffff810310cb
  50  #1 [ffff8806454c7b50] crash_kexec at ffffffff810b6392
  51  #2 [ffff8806454c7c20] panic at ffffffff814da64f
  52  #3 [ffff8806454c7ca0] watchdog_overflow_callback at ffffffff810d648d
  53  #4 [ffff8806454c7cc0] __perf_event_overflow at ffffffff81108b26
  54  #5 [ffff8806454c7d60] perf_event_overflow at ffffffff81109119
  55  #6 [ffff8806454c7d70] intel_pmu_handle_irq at ffffffff8101dd46
  56  #7 [ffff8806454c7e80] perf_event_nmi_handler at ffffffff814debd8
  57  #8 [ffff8806454c7ea0] notifier_call_chain at ffffffff814e0735
  58  #9 [ffff8806454c7ee0] atomic_notifier_call_chain at ffffffff814e079a
  59 #10 [ffff8806454c7ef0] notify_die at ffffffff8109411e
  60 #11 [ffff8806454c7f20] do_nmi at ffffffff814de383
  61 #12 [ffff8806454c7f50] nmi at ffffffff814ddc90
  62     RIP: 00000000004083ab  RSP: 00007fffc80115d8  RFLAGS: 00000206
  63     RAX: 000000007e2bf790  RBX: 0000000001c753f0  RCX: 0000000000008000
  64     RDX: 0000000000000000  RSI: 0000093b76bfc600  RDI: 000000001277546d
  65     RBP: 0000000000000200   R8: 00000000fbc80000   R9: 0000000000000000
  66     R10: 0000000000000064  R11: 0000000000000246  R12: 1277546d7d3d8c69
  67     R13: 0000000000000000  R14: 0000000000000001  R15: 0000000000000000
  68     ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b
  69 --- <NMI exception stack> ---

 2.  Guest Boot Hang when lots of guest create requests are processed
at a same time by libvirt;
      The guest is configured with -smp 1.

 So, anyone has any idea about these?
 Thx

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Linux Crash Caused By KVM?
  2012-04-11  2:11 Linux Crash Caused By KVM? Peijie Yu
@ 2012-04-11 14:45 ` Avi Kivity
  2012-04-11 18:59   ` Eric Northup
  0 siblings, 1 reply; 4+ messages in thread
From: Avi Kivity @ 2012-04-11 14:45 UTC (permalink / raw)
  To: Peijie Yu; +Cc: kvm

On 04/11/2012 05:11 AM, Peijie Yu wrote:
> Hi,all
>   I have met some problems while utilizing KVM。
>   The test environment is:
> Summary:        Dell R610, 1 x Xeon E5645 2.40GHz, 47.1GB / 48GB 1333MHz DDR3
> System:         Dell PowerEdge R610 (Dell 08GXHX)
> Processors:     1 (of 2) x Xeon E5645 2.40GHz 5860MHz FSB (HT enabled,
> 6 cores, 24 threads)
> Memory:         47.1GB / 48GB 1333MHz DDR3 == 12 x 4GB
> Disk:           sda: 299GB (72%) JBOD
> Disk:           sdb (host9): 5.0TB JBOD == 1 x VIRTUAL-DISK
> Disk:           sdc (host11): 5.0TB JBOD == 1 x VIRTUAL-DISK
> Disk:           sdd (host12): 5.0TB JBOD == 1 x VIRTUAL-DISK
> Disk:           sde (host10): 5.0TB JBOD == 1 x VIRTUAL-DISK
> Disk-Control:   mpt2sas0: LSI Logic / Symbios Logic SAS2008
> PCI-Express Fusion-MPT SAS-2 [Falcon]
> Disk-Control:   host9:
> Disk-Control:   host10:
> Disk-Control:   host11:
> Disk-Control:   host12:
> Chipset:        Intel 82801IB (ICH9)
> Network:        br1 (bridge): 14:fe:b5:dc:2c:6e
> Network:        em1 (bnx2): Broadcom NetXtreme II BCM5709 Gigabit,
> 14:fe:b5:dc:2c:6e, 1000Mb/s <full-duplex>
> Network:        em2 (bnx2): Broadcom NetXtreme II BCM5709 Gigabit,
> 14:fe:b5:dc:2c:70, 1000Mb/s <full-duplex>
> Network:        em3 (bnx2): Broadcom NetXtreme II BCM5709 Gigabit,
> 14:fe:b5:dc:2c:72, 1000Mb/s <full-duplex>
> Network:        em4 (bnx2): Broadcom NetXtreme II BCM5709 Gigabit,
> 14:fe:b5:dc:2c:74, 1000Mb/s <full-duplex>
> Network:        vnet0 (tun): fe:16:3e:49:fb:05, 10Mb/s <full-duplex>
> Network:        vnet1 (tun): fe:16:3e:cb:c0:d1, 10Mb/s <full-duplex>
> Network:        vnet2 (tun): fe:16:3e:1e:c1:c4, 10Mb/s <full-duplex>
> Network:        vnet3 (tun): fe:16:3e:d5:58:f4, 10Mb/s <full-duplex>
> Network:        vnet4 (tun): fe:16:3e:15:b4:16, 10Mb/s <full-duplex>
> Network:        vnet5 (tun): fe:16:3e:d2:07:47, 10Mb/s <full-duplex>
> Network:        vnet6 (tun): fe:16:3e:e1:2b:b9, 10Mb/s <full-duplex>
> OS:             RHEL Server 6.1 (Santiago), Linux
> 2.6.32-220.2.1.el6.x86_64 x86_64, 64-bit
> BIOS:           Dell 3.0.0 01/31/2011
>
>   And during the term i utilize KVM, some issues happen:
>   1.   Host Crash Caused by
>       a.   Kernel Panic
>   31       KERNEL: /usr/lib/debug/lib/modules/2.6.32-131.12.1.el6.x86_64/vmlinux
>   32     DUMPFILE: ../vmcore_2012.13.46  [PARTIAL DUMP]
>   33         CPUS: 24
>   34         DATE: Wed Jan 11 13:34:13 2012
>   35       UPTIME: 25 days, 04:11:05
>   36 LOAD AVERAGE: 223.16, 172.97, 158.23
>   37        TASKS: 1464
>   38     NODENAME: dell2.localdomain
>   39      RELEASE: 2.6.32-131.12.1.el6.x86_64
>   40      VERSION: #1 SMP Sun Jul 31 16:44:56 EDT 2011
>   41      MACHINE: x86_64  (2394 Mhz)
>   42       MEMORY: 48 GB
>   43        PANIC: "kernel BUG at arch/x86/kernel/traps.c:547!"
>   44          PID: 11851
>   45      COMMAND: "qemu-kvm"
>   46         TASK: ffff880c071c3500  [THREAD_INFO: ffff880c132d8000]
>   47          CPU: 1
>   48        STATE: TASK_RUNNING (PANIC)
>   49
>   50 PID: 11851  TASK: ffff880c071c3500  CPU: 1   COMMAND: "qemu-kvm"
>   51  #0 [ffff880028207be0] machine_kexec at ffffffff810310cb
>   52  #1 [ffff880028207c40] crash_kexec at ffffffff810b6392
>   53  #2 [ffff880028207d10] oops_end at ffffffff814de670
>   54  #3 [ffff880028207d40] die at ffffffff8100f2eb
>   55  #4 [ffff880028207d70] do_trap at ffffffff814ddf64
>   56  #5 [ffff880028207dd0] do_invalid_op at ffffffff8100ceb5
>   57  #6 [ffff880028207e70] invalid_op at ffffffff8100bf5b
>   58     [exception RIP: do_nmi+554]
>   59     RIP: ffffffff814de43a  RSP: ffff880028207f28  RFLAGS: 00010002
>   60     RAX: ffff880c132d9fd8  RBX: ffff880028207f58  RCX: 00000000c0000101
>   61     RDX: 00000000ffff8800  RSI: ffffffffffffffff  RDI: ffff880028207f58
>   62     RBP: ffff880028207f48   R8: ffff88005ebf9800   R9: ffff880028203fc0
>   63     R10: 0000000000000034  R11: 00000000000003e8  R12: 000000000000cc20
>   64     R13: ffffffff816024a0  R14: ffff88005ebf9800  R15: 00007ffffffff000
>   65     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>   66  #7 [ffff880028207f50] nmi at ffffffff814ddc90
>   67     [exception RIP: bad_to_user+37]
>   68     RIP: ffffffff814e4e2b  RSP: ffff880028207bb0  RFLAGS: 00010046
>   69     RAX: ffff880c132d9fd8  RBX: ffff880c132d9c48  RCX: 0000000000000001
>   70     RDX: 0000000000000000  RSI: 000000010000000b  RDI: ffff880028207c08
>   71     RBP: ffff880028207c48   R8: ffff88005ebf9800   R9: ffff880028203fc0
>   72     R10: 0000000000000034  R11: 00000000000003e8  R12: 000000000000cc20
>   73     R13: ffffffff816024a0  R14: ffff88005ebf9800  R15: 00007ffffffff000
>   74     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>   75 --- <NMI exception stack> ---
>
>      For this problem, i found that panic is caused by
> BUG_ON(in_nmi()) which means NMI happened during another NMI Context;
> But i check the Intel Technical Manual and found "While an NMI
> interrupt handler is executing, the processor disables additional
> calls to the NMI handler until the next IRET instruction is executed."
> So, how this happen?
>

The NMI path for kvm is different; the processor exits from the guest
with NMIs blocked, then executes kvm code until it issues "int $2" in
vmx_complete_interrupts(). If an IRET is executed in this path, then
NMIs will be unblocked and nested NMIs may occur.

One way this can happen is if we access the vmap area and incur a fault,
between the VMEXIT and invoking the NMI handler. Or perhaps the NMI
handler itself generates a fault. Or we have a debug exception in that path.

Is this reproducible?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Linux Crash Caused By KVM?
  2012-04-11 14:45 ` Avi Kivity
@ 2012-04-11 18:59   ` Eric Northup
  2012-04-15 10:05     ` Avi Kivity
  0 siblings, 1 reply; 4+ messages in thread
From: Eric Northup @ 2012-04-11 18:59 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Peijie Yu, kvm

On Wed, Apr 11, 2012 at 7:45 AM, Avi Kivity <avi@redhat.com> wrote:
> On 04/11/2012 05:11 AM, Peijie Yu wrote:
>>      For this problem, i found that panic is caused by
>> BUG_ON(in_nmi()) which means NMI happened during another NMI Context;
>> But i check the Intel Technical Manual and found "While an NMI
>> interrupt handler is executing, the processor disables additional
>> calls to the NMI handler until the next IRET instruction is executed."
>> So, how this happen?
>>
>
> The NMI path for kvm is different; the processor exits from the guest
> with NMIs blocked, then executes kvm code until it issues "int $2" in
> vmx_complete_interrupts(). If an IRET is executed in this path, then
> NMIs will be unblocked and nested NMIs may occur.
>
> One way this can happen is if we access the vmap area and incur a fault,
> between the VMEXIT and invoking the NMI handler. Or perhaps the NMI
> handler itself generates a fault. Or we have a debug exception in that path.
>
> Is this reproducible?

As an FYI, there have been BIOSes whose SMI handlers ran IRETs.  So
the NMI blocking can go away surprisingly.

See 29.8 "NMI handling while in SMM" in the Intel SDM vol 3.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Linux Crash Caused By KVM?
  2012-04-11 18:59   ` Eric Northup
@ 2012-04-15 10:05     ` Avi Kivity
  0 siblings, 0 replies; 4+ messages in thread
From: Avi Kivity @ 2012-04-15 10:05 UTC (permalink / raw)
  To: Eric Northup; +Cc: Peijie Yu, kvm

On 04/11/2012 09:59 PM, Eric Northup wrote:
> On Wed, Apr 11, 2012 at 7:45 AM, Avi Kivity <avi@redhat.com> wrote:
> > On 04/11/2012 05:11 AM, Peijie Yu wrote:
> >>      For this problem, i found that panic is caused by
> >> BUG_ON(in_nmi()) which means NMI happened during another NMI Context;
> >> But i check the Intel Technical Manual and found "While an NMI
> >> interrupt handler is executing, the processor disables additional
> >> calls to the NMI handler until the next IRET instruction is executed."
> >> So, how this happen?
> >>
> >
> > The NMI path for kvm is different; the processor exits from the guest
> > with NMIs blocked, then executes kvm code until it issues "int $2" in
> > vmx_complete_interrupts(). If an IRET is executed in this path, then
> > NMIs will be unblocked and nested NMIs may occur.
> >
> > One way this can happen is if we access the vmap area and incur a fault,
> > between the VMEXIT and invoking the NMI handler. Or perhaps the NMI
> > handler itself generates a fault. Or we have a debug exception in that path.
> >
> > Is this reproducible?
>
> As an FYI, there have been BIOSes whose SMI handlers ran IRETs.  So
> the NMI blocking can go away surprisingly.
>
> See 29.8 "NMI handling while in SMM" in the Intel SDM vol 3.

Interesting, thanks.

>From 29.8 it looks like you don't even need to issue IRET within SMM,
since SMM doesn't save/restore the NMI blocking flag.

However, this being a server, and the crash being in kvm code, I don't
think we can rule out that this is a kvm bug.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-04-15 10:05 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-11  2:11 Linux Crash Caused By KVM? Peijie Yu
2012-04-11 14:45 ` Avi Kivity
2012-04-11 18:59   ` Eric Northup
2012-04-15 10:05     ` Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.