linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86, kdump, ioapic: Fix kdump race with migrating irq
@ 2012-01-31 21:25 Don Zickus
  2012-01-31 21:37 ` Vivek Goyal
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Don Zickus @ 2012-01-31 21:25 UTC (permalink / raw)
  To: x86; +Cc: LKML, vgoyal, ebiederm, kexec-list, Don Zickus

A customer of ours noticed when their machine crashed, kdump did not
work but hung instead.  Using their firmware dumping solution they
grabbed a vmcore and decoded the stacks on the cpus.  What they
noticed seemed to be a rare deadlock with the ioapic_lock.

 CPU4:
 machine_crash_shutdown
 -> machine_ops.crash_shutdown
    -> native_machine_crash_shutdown
       -> kdump_nmi_shootdown_cpus ------> Send NMI to other CPUs
       -> disable_IO_APIC
          -> clear_IO_APIC
             -> clear_IO_APIC_pin
                -> ioapic_read_entry
                   -> spin_lock_irqsave(&ioapic_lock, flags)
                   ---Infinite loop here---

 CPU0:
 do_IRQ
 -> handle_irq
    -> handle_edge_irq
        -> ack_apic_edge
           -> move_native_irq
               -> mask_IO_APIC_irq
                  -> mask_IO_APIC_irq_desc
                     -> spin_lock_irqsave(&ioapic_lock, flags)
                     ---Receive NMI here after getting spinlock---
                        -> nmi
                           -> do_nmi
                              -> crash_nmi_callback
                              ---Infinite loop here---

The problem is that although kdump tries to shutdown minimal hardware,
it still needs to disable the IO APIC.  This requires spinlocks which
may be held by another cpu.  This other cpu is being held infinitely in
an NMI context by kdump in order to serialize the crashing path.  Instant
deadlock.

I attempted to resolve this by busting the spinlock in the kdump case only.
My justification was that kdump has already stopped the other cpus and it
is only clearing the io apic which shouldn't cause harm when overwriting
what the other cpu was doing.

I tested this by loading a dummy module that grabs the ioapic_lock and then
on another cpu, run 'echo c > /proc/sysrq-trigger'.  The deadlock was detected
and fixed with the patch below.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 arch/x86/kernel/apic/io_apic.c     |   18 +++++++++++++++++-
 arch/x86/kernel/crash.c            |    2 +-
 arch/x86/kernel/machine_kexec_32.c |    2 +-
 arch/x86/kernel/machine_kexec_64.c |    2 +-
 arch/x86/kernel/reboot.c           |    2 +-
 5 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index fb07275..5fe4423 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1991,9 +1991,25 @@ void __init enable_IO_APIC(void)
 /*
  * Not an __init, needed by the reboot code
  */
-void disable_IO_APIC(void)
+void disable_IO_APIC(int force)
 {
 	/*
+	 * Use force to bust the io_apic spinlock
+	 *
+	 * There is a case where kdump can race with irq
+	 * migration such that kdump will inject an NMI
+	 * while another cpu holds the ioapic_lock to
+	 * migrate the irq.  This would cause a deadlock.
+	 *
+	 * Because kdump stops all the cpus, we can safely
+	 * bust the spinlock as we are just clearing the
+	 * io apic anyway.
+	 */
+	if (force && spin_is_locked(&ioapic_lock))
+		/* only one cpu should be running now */
+		spin_lock_init(&ioapic_lock);
+
+	/*
 	 * Clear the IO-APIC before rebooting:
 	 */
 	clear_IO_APIC();
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 13ad899..c8383b0 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -97,7 +97,7 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 
 	lapic_shutdown();
 #if defined(CONFIG_X86_IO_APIC)
-	disable_IO_APIC();
+	disable_IO_APIC(1);
 #endif
 #ifdef CONFIG_HPET_TIMER
 	hpet_disable();
diff --git a/arch/x86/kernel/machine_kexec_32.c b/arch/x86/kernel/machine_kexec_32.c
index a3fa43b..3c60005 100644
--- a/arch/x86/kernel/machine_kexec_32.c
+++ b/arch/x86/kernel/machine_kexec_32.c
@@ -212,7 +212,7 @@ void machine_kexec(struct kimage *image)
 		 * one form or other. kexec jump path also need
 		 * one.
 		 */
-		disable_IO_APIC();
+		disable_IO_APIC(0);
 #endif
 	}
 
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index b3ea9db..ed94a6a 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -295,7 +295,7 @@ void machine_kexec(struct kimage *image)
 		 * one form or other. kexec jump path also need
 		 * one.
 		 */
-		disable_IO_APIC();
+		disable_IO_APIC(0);
 #endif
 	}
 
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 37a458b..766795c 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -652,7 +652,7 @@ void native_machine_shutdown(void)
 	lapic_shutdown();
 
 #ifdef CONFIG_X86_IO_APIC
-	disable_IO_APIC();
+	disable_IO_APIC(0);
 #endif
 
 #ifdef CONFIG_HPET_TIMER
-- 
1.7.7.5


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86, kdump, ioapic: Fix kdump race with migrating irq
  2012-01-31 21:25 [PATCH] x86, kdump, ioapic: Fix kdump race with migrating irq Don Zickus
@ 2012-01-31 21:37 ` Vivek Goyal
  2012-01-31 22:08 ` Eric W. Biederman
  2012-02-20 15:20 ` Seiji Aguchi
  2 siblings, 0 replies; 10+ messages in thread
From: Vivek Goyal @ 2012-01-31 21:37 UTC (permalink / raw)
  To: Don Zickus; +Cc: x86, LKML, ebiederm, kexec-list

On Tue, Jan 31, 2012 at 04:25:14PM -0500, Don Zickus wrote:
> A customer of ours noticed when their machine crashed, kdump did not
> work but hung instead.  Using their firmware dumping solution they
> grabbed a vmcore and decoded the stacks on the cpus.  What they
> noticed seemed to be a rare deadlock with the ioapic_lock.
> 
>  CPU4:
>  machine_crash_shutdown
>  -> machine_ops.crash_shutdown
>     -> native_machine_crash_shutdown
>        -> kdump_nmi_shootdown_cpus ------> Send NMI to other CPUs
>        -> disable_IO_APIC
>           -> clear_IO_APIC
>              -> clear_IO_APIC_pin
>                 -> ioapic_read_entry
>                    -> spin_lock_irqsave(&ioapic_lock, flags)
>                    ---Infinite loop here---
> 
>  CPU0:
>  do_IRQ
>  -> handle_irq
>     -> handle_edge_irq
>         -> ack_apic_edge
>            -> move_native_irq
>                -> mask_IO_APIC_irq
>                   -> mask_IO_APIC_irq_desc
>                      -> spin_lock_irqsave(&ioapic_lock, flags)
>                      ---Receive NMI here after getting spinlock---
>                         -> nmi
>                            -> do_nmi
>                               -> crash_nmi_callback
>                               ---Infinite loop here---
> 
> The problem is that although kdump tries to shutdown minimal hardware,
> it still needs to disable the IO APIC.  This requires spinlocks which
> may be held by another cpu.  This other cpu is being held infinitely in
> an NMI context by kdump in order to serialize the crashing path.  Instant
> deadlock.
> 
> I attempted to resolve this by busting the spinlock in the kdump case only.
> My justification was that kdump has already stopped the other cpus and it
> is only clearing the io apic which shouldn't cause harm when overwriting
> what the other cpu was doing.
> 
> I tested this by loading a dummy module that grabs the ioapic_lock and then
> on another cpu, run 'echo c > /proc/sysrq-trigger'.  The deadlock was detected
> and fixed with the patch below.
> 
> Signed-off-by: Don Zickus <dzickus@redhat.com>

Sounds reasonable to me. 

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Thanks
Vivek

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86, kdump, ioapic: Fix kdump race with migrating irq
  2012-01-31 21:25 [PATCH] x86, kdump, ioapic: Fix kdump race with migrating irq Don Zickus
  2012-01-31 21:37 ` Vivek Goyal
@ 2012-01-31 22:08 ` Eric W. Biederman
  2012-01-31 22:27   ` Don Zickus
  2012-02-20 15:20 ` Seiji Aguchi
  2 siblings, 1 reply; 10+ messages in thread
From: Eric W. Biederman @ 2012-01-31 22:08 UTC (permalink / raw)
  To: Don Zickus; +Cc: x86, LKML, vgoyal, kexec-list

Don Zickus <dzickus@redhat.com> writes:

> A customer of ours noticed when their machine crashed, kdump did not
> work but hung instead.  Using their firmware dumping solution they
> grabbed a vmcore and decoded the stacks on the cpus.  What they
> noticed seemed to be a rare deadlock with the ioapic_lock.
>
>  CPU4:
>  machine_crash_shutdown
>  -> machine_ops.crash_shutdown
>     -> native_machine_crash_shutdown
>        -> kdump_nmi_shootdown_cpus ------> Send NMI to other CPUs
>        -> disable_IO_APIC
>           -> clear_IO_APIC
>              -> clear_IO_APIC_pin
>                 -> ioapic_read_entry
>                    -> spin_lock_irqsave(&ioapic_lock, flags)
>                    ---Infinite loop here---
>
>  CPU0:
>  do_IRQ
>  -> handle_irq
>     -> handle_edge_irq
>         -> ack_apic_edge
>            -> move_native_irq
>                -> mask_IO_APIC_irq
>                   -> mask_IO_APIC_irq_desc
>                      -> spin_lock_irqsave(&ioapic_lock, flags)
>                      ---Receive NMI here after getting spinlock---
>                         -> nmi
>                            -> do_nmi
>                               -> crash_nmi_callback
>                               ---Infinite loop here---
>
> The problem is that although kdump tries to shutdown minimal hardware,
> it still needs to disable the IO APIC.  This requires spinlocks which
> may be held by another cpu.  This other cpu is being held infinitely in
> an NMI context by kdump in order to serialize the crashing path.  Instant
> deadlock.

Can you test to see if kexec on panic still needs to disable the IO
APIC.  Last I looked we were close if not all of the way there to not
needing to boot the kernel in pic mode?

If we can skip the ioapic disable entirely we should be much more
robust.

Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86, kdump, ioapic: Fix kdump race with migrating irq
  2012-01-31 22:08 ` Eric W. Biederman
@ 2012-01-31 22:27   ` Don Zickus
  2012-01-31 22:38     ` Eric W. Biederman
  0 siblings, 1 reply; 10+ messages in thread
From: Don Zickus @ 2012-01-31 22:27 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: x86, LKML, vgoyal, kexec-list

On Tue, Jan 31, 2012 at 02:08:29PM -0800, Eric W. Biederman wrote:
> > The problem is that although kdump tries to shutdown minimal hardware,
> > it still needs to disable the IO APIC.  This requires spinlocks which
> > may be held by another cpu.  This other cpu is being held infinitely in
> > an NMI context by kdump in order to serialize the crashing path.  Instant
> > deadlock.
> 
> Can you test to see if kexec on panic still needs to disable the IO
> APIC.  Last I looked we were close if not all of the way there to not
> needing to boot the kernel in pic mode?

Ok, so you just blindly remove disable_IO_APIC from
native_machine_crash_shutdown and re-run some panic tests on various
machines?  What about the disable_IO_APIC path in native_machine_shutdown?

Also, where could I look to see if that work was done?  Is that in the
ioapic setup code?

> 
> If we can skip the ioapic disable entirely we should be much more
> robust.

Agreed.

Cheers,
Don

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86, kdump, ioapic: Fix kdump race with migrating irq
  2012-01-31 22:27   ` Don Zickus
@ 2012-01-31 22:38     ` Eric W. Biederman
  2012-02-01 23:04       ` Don Zickus
  0 siblings, 1 reply; 10+ messages in thread
From: Eric W. Biederman @ 2012-01-31 22:38 UTC (permalink / raw)
  To: Don Zickus; +Cc: x86, LKML, vgoyal, kexec-list

Don Zickus <dzickus@redhat.com> writes:

> On Tue, Jan 31, 2012 at 02:08:29PM -0800, Eric W. Biederman wrote:
>> > The problem is that although kdump tries to shutdown minimal hardware,
>> > it still needs to disable the IO APIC.  This requires spinlocks which
>> > may be held by another cpu.  This other cpu is being held infinitely in
>> > an NMI context by kdump in order to serialize the crashing path.  Instant
>> > deadlock.
>> 
>> Can you test to see if kexec on panic still needs to disable the IO
>> APIC.  Last I looked we were close if not all of the way there to not
>> needing to boot the kernel in pic mode?
>
> Ok, so you just blindly remove disable_IO_APIC from
> native_machine_crash_shutdown and re-run some panic tests on various
> machines?  What about the disable_IO_APIC path in native_machine_shutdown?
>

Yes.  Just native_machine_crash_shutdown.

native_machine_shutdown is the case when all is good and we attempt to
put the hardware back the way we found it.

Any normal x86 machine that the kernel runs in ioapic mode should be
enough to get a first approximation.

> Also, where could I look to see if that work was done?  Is that in the
> ioapic setup code?

The primary question is do we call the ioapic setup code without calling
the pic setup code first.  On some embedded x86 platforms we certainly
do.  I don't know if that code has been generalized.

Historically the problem is that we started the pit timer in pic mode
and used that to calibrate the delay loop.

So what we are looking to verify is that the linux kernel boot skip
pic mode entirely.

Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86, kdump, ioapic: Fix kdump race with migrating irq
  2012-01-31 22:38     ` Eric W. Biederman
@ 2012-02-01 23:04       ` Don Zickus
  2012-02-02  1:34         ` Eric W. Biederman
  0 siblings, 1 reply; 10+ messages in thread
From: Don Zickus @ 2012-02-01 23:04 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: x86, LKML, vgoyal, kexec-list

On Tue, Jan 31, 2012 at 02:38:15PM -0800, Eric W. Biederman wrote:
> Don Zickus <dzickus@redhat.com> writes:
> 
> > On Tue, Jan 31, 2012 at 02:08:29PM -0800, Eric W. Biederman wrote:
> >> > The problem is that although kdump tries to shutdown minimal hardware,
> >> > it still needs to disable the IO APIC.  This requires spinlocks which
> >> > may be held by another cpu.  This other cpu is being held infinitely in
> >> > an NMI context by kdump in order to serialize the crashing path.  Instant
> >> > deadlock.
> >> 
> >> Can you test to see if kexec on panic still needs to disable the IO
> >> APIC.  Last I looked we were close if not all of the way there to not
> >> needing to boot the kernel in pic mode?
> >
> > Ok, so you just blindly remove disable_IO_APIC from
> > native_machine_crash_shutdown and re-run some panic tests on various
> > machines?  What about the disable_IO_APIC path in native_machine_shutdown?
> >
> 
> Yes.  Just native_machine_crash_shutdown.
> 
> native_machine_shutdown is the case when all is good and we attempt to
> put the hardware back the way we found it.

Ok.

> 
> Any normal x86 machine that the kernel runs in ioapic mode should be
> enough to get a first approximation.
> 
> > Also, where could I look to see if that work was done?  Is that in the
> > ioapic setup code?
> 
> The primary question is do we call the ioapic setup code without calling
> the pic setup code first.  On some embedded x86 platforms we certainly
> do.  I don't know if that code has been generalized.
> 
> Historically the problem is that we started the pit timer in pic mode
> and used that to calibrate the delay loop.
> 
> So what we are looking to verify is that the linux kernel boot skip
> pic mode entirely.

It seems to boot fine on an Ivy Bridge machine and a single cpu Pentium4.
I will try and athlon3 and a nehalem tomorrow.

Talking to folks here and trying to read the code it seems like the PIT
stuff is delayed until after the IOAPIC is configured using Fast TSC
calibration as a mechanism to work around the PIT??

I attached the output of the Pentium4 when kdumping.  Not sure what to
really look for to verify the PIC is being skipped.  Perhaps you know?

Cheers,
Don

DMI 2.3 present.
last_pfn = 0x20000 max_arch_pfn = 0x1000000
x86 PAT enabled: cpu 0, old 0x7010600070106, new 0x7010600070106
found SMP MP-table at [c00fe710] fe710
init_memory_mapping: 0000000000000000-0000000020000000
RAMDISK: 1fab5000 - 1ff5f000
ACPI: RSDP 000fd560 00014 (v00 DELL  )
ACPI: RSDT 000fd574 00034 (v01 DELL    GX240   00000008 ASL  00000061)
ACPI: FACP 000fd5a8 00074 (v01 DELL    GX240   00000008 ASL  00000061)
ACPI: DSDT fffe3c22 02393 (v01   DELL    dt_ex 00001000 MSFT 0100000D)
ACPI: FACS 3ff77000 00040
ACPI: SSDT fffe5fb5 000A7 (v01   DELL    st_ex 00001000 MSFT 0100000D)
ACPI: APIC 000fd61c 0005C (v01 DELL    GX240   00000008 ASL  00000061)
ACPI: BOOT 000fd678 00028 (v01 DELL    GX240   00000008 ASL  00000061)
0MB HIGHMEM available.
512MB LOWMEM available.
  mapped low ram: 0 - 20000000
  low ram: 0 - 20000000
Zone PFN ranges:
  DMA      0x00000010 -> 0x00001000
  Normal   0x00001000 -> 0x00020000
  HighMem  empty
Movable zone start PFN for each node
Early memory PFN ranges
    0: 0x00000010 -> 0x000000a0
    0: 0x00018000 -> 0x0001ff6a
    0: 0x0001ff6b -> 0x0001ff6f
    0: 0x0001ffff -> 0x00020000
Using APIC driver default
ACPI: PM-Timer IO Port: 0x808
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] disabled)
ACPI: IOAPIC (id[0x01] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 1, version 32, address 0xfec00000, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
Using ACPI (MADT) for SMP configuration information
2 Processors exceeds NR_CPUS limit of 1
SMP: Allowing 1 CPUs, 0 hotplug CPUs
PM: Registered nosave memory: 00000000000a0000 - 00000000000f0000
PM: Registered nosave memory: 00000000000f0000 - 0000000000100000
PM: Registered nosave memory: 0000000000100000 - 0000000018000000
PM: Registered nosave memory: 000000001ff6a000 - 000000001ff6b000
PM: Registered nosave memory: 000000001ff6f000 - 000000001ffff000
Allocating PCI resources starting at 40000000 (gap: 40000000:bec00000)
Booting paravirtualized kernel on bare hardware
setup_percpu: NR_CPUS:32 nr_cpumask_bits:32 nr_cpu_ids:1 nr_node_ids:1
PERCPU: Embedded 13 pages/cpu @df400000 s32704 r0 d20544 u2097152
Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 31743
Kernel command line: ro root=/dev/mapper/vg_dellgx24003-lv_root rd_NO_LUKS
LANG=en_US.UTF-8 rd_NO_MD KEYTABLE=us console=ttyS0,115200
rd_LVM_LV=vg_dellgx24003/lv_root rd_LVM_LV=vg_dellgx24003/lv_swap
SYSFONT=latarcyrheb-sun16 rd_NO_DM irqpoll nr_cpus=1 reset_devices
cgroup_disable=memory  memmap=exactmap memmap=64K$0K memmap=576K@64K
memmap=64K$960K memmap=130472K@393216K memmap=19K@523689K
memmap=4K@524284K memmap=8K#1048028K memmap=540K$1048036K
memmap=64K$4173824K memmap=64K$4175872K memmap=5120K$4189184K
elfcorehdr=523688K
Misrouted IRQ fixup and polling support enabled
This may significantly impact system performance
Disabling memory control group subsystem
PID hash table entries: 512 (order: -1, 2048 bytes)
Dentry cache hash table entries: 16384 (order: 4, 65536 bytes)
Inode-cache hash table entries: 8192 (order: 3, 32768 bytes)
Initializing CPU#0
Initializing HighMem for node 0 (00000000:00000000)
Memory: 112876k/524288k available (4429k kernel code, 18192k reserved,
2305k data, 500k init, 0k highmem)
virtual kernel memory layout:
    fixmap  : 0xffa96000 - 0xfffff000   (5540 kB)
    pkmap   : 0xff600000 - 0xff800000   (2048 kB)
    vmalloc : 0xe0800000 - 0xff5fe000   ( 493 MB)
    lowmem  : 0xc0000000 - 0xe0000000   ( 512 MB)
      .init : 0xd8a94000 - 0xd8b11000   ( 500 kB)
      .data : 0xd8853712 - 0xd8a93d80   (2305 kB)
      .text : 0xd8400000 - 0xd8853712   (4429 kB)
Checking if this processor honours the WP bit even in supervisor
mode...Ok.
Hierarchical RCU implementation.
NR_IRQS:2304 nr_irqs:256 16
Spurious LAPIC timer interrupt on cpu 0
do_IRQ: 0.89 No irq handler for vector (irq -1)
Console: colour VGA+ 80x25
console [ttyS0] enabled
Fast TSC calibration using PIT
Detected 1694.460 MHz processor.
Calibrating delay loop (skipped), value calculated using timer frequency..
3388.92 BogoMIPS (lpj=1694460)
pid_max: default: 32768 minimum: 301
Security Framework initialized
SELinux:  Initializing.
Mount-cache hash table entries: 512
Initializing cgroup subsys cpuacct
Initializing cgroup subsys memory
Initializing cgroup subsys devices
Initializing cgroup subsys freezer
Initializing cgroup subsys net_cls
Initializing cgroup subsys blkio
Initializing cgroup subsys perf_event
CPU0: Hyper-Threading is disabled
mce: CPU supports 4 MCE banks
SMP alternatives: switching to UP code
Freeing SMP alternatives: 20k freed
ACPI: Core revision 20120111
Enabling APIC mode:  Flat.  Using 1 I/O APICs
..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
CPU0: Intel(R) Pentium(R) 4 CPU 1.70GHz stepping 02
Performance Events: Netburst events, Broken PMU hardware detected, using
software events only.
NMI watchdog disabled (cpu0): hardware events not enabled
Brought up 1 CPUs
Total of 1 processors activated (3388.92 BogoMIPS).
<snip>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86, kdump, ioapic: Fix kdump race with migrating irq
  2012-02-01 23:04       ` Don Zickus
@ 2012-02-02  1:34         ` Eric W. Biederman
  2012-02-02 15:33           ` Don Zickus
  2012-02-02 17:45           ` Don Zickus
  0 siblings, 2 replies; 10+ messages in thread
From: Eric W. Biederman @ 2012-02-02  1:34 UTC (permalink / raw)
  To: Don Zickus; +Cc: x86, LKML, vgoyal, kexec-list

Don Zickus <dzickus@redhat.com> writes:

> On Tue, Jan 31, 2012 at 02:38:15PM -0800, Eric W. Biederman wrote:
>> Don Zickus <dzickus@redhat.com> writes:
>> 
>> > On Tue, Jan 31, 2012 at 02:08:29PM -0800, Eric W. Biederman wrote:
>> >> > The problem is that although kdump tries to shutdown minimal hardware,
>> >> > it still needs to disable the IO APIC.  This requires spinlocks which
>> >> > may be held by another cpu.  This other cpu is being held infinitely in
>> >> > an NMI context by kdump in order to serialize the crashing path.  Instant
>> >> > deadlock.
>> >> 
>> >> Can you test to see if kexec on panic still needs to disable the IO
>> >> APIC.  Last I looked we were close if not all of the way there to not
>> >> needing to boot the kernel in pic mode?
>> >
>> > Ok, so you just blindly remove disable_IO_APIC from
>> > native_machine_crash_shutdown and re-run some panic tests on various
>> > machines?  What about the disable_IO_APIC path in native_machine_shutdown?
>> >
>> 
>> Yes.  Just native_machine_crash_shutdown.
>> 
>> native_machine_shutdown is the case when all is good and we attempt to
>> put the hardware back the way we found it.
>
> Ok.
>
>> 
>> Any normal x86 machine that the kernel runs in ioapic mode should be
>> enough to get a first approximation.
>> 
>> > Also, where could I look to see if that work was done?  Is that in the
>> > ioapic setup code?
>> 
>> The primary question is do we call the ioapic setup code without calling
>> the pic setup code first.  On some embedded x86 platforms we certainly
>> do.  I don't know if that code has been generalized.
>> 
>> Historically the problem is that we started the pit timer in pic mode
>> and used that to calibrate the delay loop.
>> 
>> So what we are looking to verify is that the linux kernel boot skip
>> pic mode entirely.
>
> It seems to boot fine on an Ivy Bridge machine and a single cpu Pentium4.
> I will try and athlon3 and a nehalem tomorrow.
>
> Talking to folks here and trying to read the code it seems like the PIT
> stuff is delayed until after the IOAPIC is configured using Fast TSC
> calibration as a mechanism to work around the PIT??
>
> I attached the output of the Pentium4 when kdumping.  Not sure what to
> really look for to verify the PIC is being skipped.  Perhaps you know?

The important part is the kexec on panic works without shutting down
the ioapic.  There should be no corner case issues it should either
work it should fail.

The problem used to be that we always would initialize the PIT interrupt
in the 8259 interrupt controller before we would initialize the ioapics
and that would kill the boot.

If I have read your testing correctly you are apparently booting in the
kexec on panic case.  That seems to be successful to me.  So we should
be able to just remove the ioapic shutdown code from
machine_crash_shutdown as it is no longer needed.

Thank you for being careful and testing on a number of different
platforms. 

The only case I can think that won't work without ioapic disables
is using a crash kernel that doesn't emable the ioapics.

Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86, kdump, ioapic: Fix kdump race with migrating irq
  2012-02-02  1:34         ` Eric W. Biederman
@ 2012-02-02 15:33           ` Don Zickus
  2012-02-02 17:45           ` Don Zickus
  1 sibling, 0 replies; 10+ messages in thread
From: Don Zickus @ 2012-02-02 15:33 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: x86, LKML, vgoyal, kexec-list

On Wed, Feb 01, 2012 at 05:34:01PM -0800, Eric W. Biederman wrote:
> > Talking to folks here and trying to read the code it seems like the PIT
> > stuff is delayed until after the IOAPIC is configured using Fast TSC
> > calibration as a mechanism to work around the PIT??
> >
> > I attached the output of the Pentium4 when kdumping.  Not sure what to
> > really look for to verify the PIC is being skipped.  Perhaps you know?
> 
> The important part is the kexec on panic works without shutting down
> the ioapic.  There should be no corner case issues it should either
> work it should fail.
> 
> The problem used to be that we always would initialize the PIT interrupt
> in the 8259 interrupt controller before we would initialize the ioapics
> and that would kill the boot.
> 
> If I have read your testing correctly you are apparently booting in the
> kexec on panic case.  That seems to be successful to me.  So we should
> be able to just remove the ioapic shutdown code from
> machine_crash_shutdown as it is no longer needed.
> 
> Thank you for being careful and testing on a number of different
> platforms. 

No problem.  I was actually trying to find machines that did not have
ioapics to make sure they still worked (it's hard!).

So if I test on a couple more machines (hopefully one without an ioapic),
can I get your ack?  Or is there something else you would like me to do to
verify things are working correctly?

I will also need your help writing the changelog such that people
understand why removing that line is safe now.

> 
> The only case I can think that won't work without ioapic disables
> is using a crash kernel that doesn't emable the ioapics.

Ok.  I can see that.  Would you agree that scenario is not a very sane
case? :-) Does not using the ioapic really save you anything?

Otherwise the alternative is to use my original patch.

Thanks for the help.

Cheers,
Don

> 
> Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86, kdump, ioapic: Fix kdump race with migrating irq
  2012-02-02  1:34         ` Eric W. Biederman
  2012-02-02 15:33           ` Don Zickus
@ 2012-02-02 17:45           ` Don Zickus
  1 sibling, 0 replies; 10+ messages in thread
From: Don Zickus @ 2012-02-02 17:45 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: x86, LKML, vgoyal, kexec-list

On Wed, Feb 01, 2012 at 05:34:01PM -0800, Eric W. Biederman wrote:
> > I attached the output of the Pentium4 when kdumping.  Not sure what to
> > really look for to verify the PIC is being skipped.  Perhaps you know?
> 
> The important part is the kexec on panic works without shutting down
> the ioapic.  There should be no corner case issues it should either
> work it should fail.
> 
> The problem used to be that we always would initialize the PIT interrupt
> in the 8259 interrupt controller before we would initialize the ioapics
> and that would kill the boot.

So I dug up an old athlon (family 0x6, model 0x2) which didn't have an
ioapic and kdump seem to work fine.

I'll repost the patch with just the one line removed and come up with some
sort of explaination for it.

Thanks,
Don

----
here is the boot log for that kdump kernel in case it is of interest..

[root@athlon3 ~]# SysRq : Trigger a crash
BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: [<c06873ef>] sysrq_handle_crash+0xf/0x20
*pdpt = 00000000326e6001 *pde = 0000000000000000
Oops: 0002 [#1] SMP
Modules linked in: sunrpc ipv6 ppdev floppy microcode pcspkr serio_raw
3c59x mii via686a i2c_viapro i2c_core sg parport_pc parport ext4 mbcache
jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic pata_via
dm_mirror dm_region_hash dm_log dm_mod [last unloaded: mperf]

Pid: 4748, comm: bash Not tainted 3.3.0-rc1nmi+ #1 System Manufacturer
System Name/<K7V-RM>
EIP: 0060:[<c06873ef>] EFLAGS: 00010086 CPU: 0
EIP is at sysrq_handle_crash+0xf/0x20
EAX: 00000063 EBX: 00000063 ECX: 00000000 EDX: 00000000
ESI: c0a6b760 EDI: 00000296 EBP: 00000000 ESP: f26dff24
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process bash (pid: 4748, ti=f26de000 task=f26665b0 task.ti=f26de000)
Stack:
 c0687a50 c098f07f c09e0b34 00000007 00000000 f26c8280 c0687ab0 fffffffb
 c0687aec 00000002 b77df000 f56f2920 c057334f f26dff9c 00000002 b77df000
 f26c8280 00000002 b77df000 c05732f0 c05274d0 f26dff9c f26671cc 00000000
Call Trace:
 [<c0687a50>] ? __handle_sysrq+0xf0/0x150
 [<c0687ab0>] ? __handle_sysrq+0x150/0x150
 [<c0687aec>] ? write_sysrq_trigger+0x3c/0x50
 [<c057334f>] ? proc_reg_write+0x5f/0x90
 [<c05732f0>] ? proc_reg_poll+0x80/0x80
 [<c05274d0>] ? vfs_write+0xa0/0x170
 [<c0527671>] ? sys_write+0x41/0x80
 [<c085241f>] ? sysenter_do_call+0x12/0x28
Code: a6 c0 01 0f b6 41 03 19 d2 f7 d2 83 e2 03 83 e0 8f c1 e2 04 09 d0 88
41 03 f3 c3 90 c7 05 10 76 b2 c0 01 00 00 00 f0 83 04 24 00 <c6> 05 00 00
00 00 01 c3 89 f6 8d bc 27 00 00 00 00 8d 50 d0 83
EIP: [<c06873ef>] sysrq_handle_crash+0xf/0x20 SS:ESP 0068:f26dff24
CR2: 0000000000000000
Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Linux version 3.3.0-rc1nmi+ (dzickus@ihatethathostname.lab.bos.redhat.com)
(gcc version 4.4.4 20100726 (Red Hat 4.4.4-13) (GCC) ) #1 SMP Wed Feb 1
17:22:40 EST 2012
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000100 - 00000000000a0000 (usable)
 BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000005fffc000 (usable)
 BIOS-e820: 000000005fffc000 - 000000005ffff000 (ACPI data)
 BIOS-e820: 000000005ffff000 - 0000000060000000 (ACPI NVS)
 BIOS-e820: 00000000ffff0000 - 0000000100000000 (reserved)
last_pfn = 0x5fffc max_arch_pfn = 0x1000000
Notice: NX (Execute Disable) protection missing in CPU!
user-defined physical RAM map:
 user: 0000000000000000 - 0000000000010000 (reserved)
 user: 0000000000010000 - 00000000000a0000 (usable)
 user: 00000000000f0000 - 0000000000100000 (reserved)
 user: 0000000018000000 - 000000001ff6a000 (usable)
 user: 000000001ff6a400 - 000000001ff6f000 (usable)
 user: 000000001ffff000 - 0000000020000000 (usable)
 user: 000000005fffc000 - 0000000060000000 (ACPI data)
 user: 00000000ffff0000 - 0000000100000000 (reserved)
DMI 2.3 present.
last_pfn = 0x20000 max_arch_pfn = 0x1000000
x86 PAT enabled: cpu 0, old 0x7010600070106, new 0x7010600070106
init_memory_mapping: 0000000000000000-0000000020000000
RAMDISK: 1fb7e000 - 1ff5f000
crashkernel reservation failed - No suitable area found.
ACPI: RSDP 000f5c30 00014 (v00 ASUS  )
ACPI: RSDT 5fffc000 0002C (v01 ASUS   K7V-RM   30303031 MSFT 31313031)
ACPI: FACP 5fffc080 00074 (v01 ASUS   K7V-RM   30303031 MSFT 31313031)
ACPI: DSDT 5fffc100 0267F (v01   ASUS K7V-RM   00001000 MSFT 0100000B)
ACPI: FACS 5ffff000 00040
ACPI: BOOT 5fffc040 00028 (v01 ASUS   K7V-RM   30303031 MSFT 31313031)
0MB HIGHMEM available.
512MB LOWMEM available.
  mapped low ram: 0 - 20000000
  low ram: 0 - 20000000
Zone PFN ranges:
  DMA      0x00000010 -> 0x00001000
  Normal   0x00001000 -> 0x00020000
  HighMem  empty
Movable zone start PFN for each node
Early memory PFN ranges
    0: 0x00000010 -> 0x000000a0
    0: 0x00018000 -> 0x0001ff6a
    0: 0x0001ff6b -> 0x0001ff6f
    0: 0x0001ffff -> 0x00020000
Using APIC driver default
ACPI: PM-Timer IO Port: 0xe408
SMP: Allowing 1 CPUs, 0 hotplug CPUs
Local APIC disabled by BIOS -- you can enable it with "lapic"
APIC: disable apic facility
APIC: switched to apic NOOP
PM: Registered nosave memory: 00000000000a0000 - 00000000000f0000
PM: Registered nosave memory: 00000000000f0000 - 0000000000100000
PM: Registered nosave memory: 0000000000100000 - 0000000018000000
PM: Registered nosave memory: 000000001ff6a000 - 000000001ff6b000
PM: Registered nosave memory: 000000001ff6f000 - 000000001ffff000
Allocating PCI resources starting at 60000000 (gap: 60000000:9fff0000)
Booting paravirtualized kernel on bare hardware
setup_percpu: NR_CPUS:32 nr_cpumask_bits:32 nr_cpu_ids:1 nr_node_ids:1
PERCPU: Embedded 13 pages/cpu @df400000 s32704 r0 d20544 u2097152
Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 31743
Kernel command line: ro root=/dev/mapper/vg_athlon3-lv_root rd_NO_LUKS
LANG=en_US.UTF-8 rd_LVM_LV=vg_athlon3/lv_swap rd_NO_MD KEYTABLE=us
console=ttyS0,115200 rd_LVM_LV=vg_athlon3/lv_root
SYSFONT=latarcyrheb-sun16 rd_NO_DM crashkernel=128M irqpoll nr_cpus=1
reset_devices cgroup_disable=memory  memmap=exactmap memmap=64K$0K
memmap=576K@64K memmap=64K$960K memmap=130472K@393216K memmap=19K@523689K
memmap=4K@524284K memmap=12K#1572848K memmap=4K#1572860K
memmap=64K$4194240K elfcorehdr=523688K
Misrouted IRQ fixup and polling support enabled
This may significantly impact system performance
Disabling memory control group subsystem
PID hash table entries: 512 (order: -1, 2048 bytes)
Dentry cache hash table entries: 16384 (order: 4, 65536 bytes)
Inode-cache hash table entries: 8192 (order: 3, 32768 bytes)
Initializing CPU#0
Initializing HighMem for node 0 (00000000:00000000)
Memory: 113684k/524288k available (4429k kernel code, 17384k reserved,
2305k data, 500k init, 0k highmem)
virtual kernel memory layout:
    fixmap  : 0xffa96000 - 0xfffff000   (5540 kB)
    pkmap   : 0xff600000 - 0xff800000   (2048 kB)
    vmalloc : 0xe0800000 - 0xff5fe000   ( 493 MB)
    lowmem  : 0xc0000000 - 0xe0000000   ( 512 MB)
      .init : 0xd8a94000 - 0xd8b11000   ( 500 kB)
      .data : 0xd8853712 - 0xd8a93d80   (2305 kB)
      .text : 0xd8400000 - 0xd8853712   (4429 kB)
Checking if this processor honours the WP bit even in supervisor
mode...Ok.
Hierarchical RCU implementation.
NR_IRQS:2304 nr_irqs:256 16
Console: colour VGA+ 80x25
console [ttyS0] enabled
Fast TSC calibration using PIT
Detected 700.010 MHz processor.
Calibrating delay loop (skipped), value calculated using timer frequency..
1400.02 BogoMIPS (lpj=700010)
pid_max: default: 32768 minimum: 301
Security Framework initialized
SELinux:  Initializing.
Mount-cache hash table entries: 512
Initializing cgroup subsys cpuacct
Initializing cgroup subsys memory
Initializing cgroup subsys devices
Initializing cgroup subsys freezer
Initializing cgroup subsys net_cls
Initializing cgroup subsys blkio
Initializing cgroup subsys perf_event
mce: CPU supports 4 MCE banks
SMP alternatives: switching to UP code
Freeing SMP alternatives: 20k freed
ACPI: Core revision 20120111
ACPI: setting ELCR to 0200 (from 0600)
weird, boot CPU (#0) not listed by the BIOS.
SMP motherboard not detected.
Local APIC not detected. Using dummy APIC emulation.
SMP disabled
Performance Events:
no APIC, boot with the "lapic" boot parameter to force-enable it.
no hardware sampling interrupt available.
AMD PMU driver.
... version:                0
... bit width:              48
... generic registers:      4
... value mask:             0000ffffffffffff
... max period:             00007fffffffffff
... fixed-purpose events:   0
... event mask:             000000000000000f
NMI watchdog disabled (cpu0): not supported (no LAPIC?)
Brought up 1 CPUs
Total of 1 processors activated (1400.02 BogoMIPS).
devtmpfs: initialized
print_constraints: dummy:
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: PCI BIOS revision 2.10 entry at 0xf08c0, last bus=1
PCI: Using configuration type 1 for base access
bio: create slab <bio-0> at 0
ACPI: Added _OSI(Module Device)
ACPI: Added _OSI(Processor Device)
ACPI: Added _OSI(3.0 _SCP Extensions)
ACPI: Added _OSI(Processor Aggregator Device)
ACPI: Interpreter enabled
ACPI: (supports S0 S1 S4 S5)
ACPI: Using PIC for interrupt routing
ACPI: No dock devices found.
PCI: Ignoring host bridge windows from ACPI; if necessary, use
"pci=use_crs" and report a bug
ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
<snip>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH] x86, kdump, ioapic: Fix kdump race with migrating irq
  2012-01-31 21:25 [PATCH] x86, kdump, ioapic: Fix kdump race with migrating irq Don Zickus
  2012-01-31 21:37 ` Vivek Goyal
  2012-01-31 22:08 ` Eric W. Biederman
@ 2012-02-20 15:20 ` Seiji Aguchi
  2 siblings, 0 replies; 10+ messages in thread
From: Seiji Aguchi @ 2012-02-20 15:20 UTC (permalink / raw)
  To: Don Zickus, x86; +Cc: LKML, vgoyal, ebiederm, kexec-list

> -void disable_IO_APIC(void)
> +void disable_IO_APIC(int force)
> {
> 	/*
>+	 * Use force to bust the io_apic spinlock
>+	 *
>+	 * There is a case where kdump can race with irq
>+	 * migration such that kdump will inject an NMI
>+	 * while another cpu holds the ioapic_lock to
>+	 * migrate the irq.  This would cause a deadlock.
>+	 *
>+	 * Because kdump stops all the cpus, we can safely
>+	 * bust the spinlock as we are just clearing the
>+	 * io apic anyway.
>+	 */
>+	if (force && spin_is_locked(&ioapic_lock))
>+		/* only one cpu should be running now */
>+		spin_lock_init(&ioapic_lock);
>+
>+	/*

Hmm...

This patch solves the kdump race, but it may make us confuse when we analyze vmcore,
Because it clears value of ioapic_lock without notice.

It kernel panic is related to IO_APIC, they will check value of ioapic_lock.

So, it is better if we can restore the value of ioapic_lock after calling disable_IO_APIC(),  
IOW, my idea is following.
 - Backup value of ioapic_lock
 - Bust ioapic_lock
 -  Call disable_IO_APIC()
 - Restore value of ioapic_lock

But, as you are discussing with Eric, we don't need to worry about this if we can completely remove disable_IO_APIC().

Seiji



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-02-20 15:20 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-31 21:25 [PATCH] x86, kdump, ioapic: Fix kdump race with migrating irq Don Zickus
2012-01-31 21:37 ` Vivek Goyal
2012-01-31 22:08 ` Eric W. Biederman
2012-01-31 22:27   ` Don Zickus
2012-01-31 22:38     ` Eric W. Biederman
2012-02-01 23:04       ` Don Zickus
2012-02-02  1:34         ` Eric W. Biederman
2012-02-02 15:33           ` Don Zickus
2012-02-02 17:45           ` Don Zickus
2012-02-20 15:20 ` Seiji Aguchi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).