All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] x86: fix hang when AP bringup is too slow
@ 2014-03-31 20:09 Igor Mammedov
  2014-03-31 20:09 ` [PATCH v2 1/5] x86: replace timeouts when booting secondary CPU with infinite wait loop Igor Mammedov
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Igor Mammedov @ 2014-03-31 20:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, mingo, hpa, x86, imammedo, bp, paul.gortmaker, JBeulich,
	prarit, drjones, toshi.kani, riel, gong.chen

Changes since v1:
  * reword comment in cpu_init() as suggested by Prarit
  * make master CPU report wakeup error at ERR level
    instead of DBG level so it would be visible to user.
  * while testing found and fixed memory corruption caused
    by invalid usage of x86_cpu_to_apicid and cpu_present maps
    on failure path in do_boot_cpu()
--

Hang is observed on virtual machines during CPU hotplug,
especially in big guests with many CPUs. (It happens more
often if host is over-committed).

Hang happens because master CPU timeouts on waiting till
AP boots and 'cancels' CPU online operation assuming AP
is not functional but AP may continue run wild later
causing various hangs or panics in running kernel that
is assuming that AP was offline.

This is an alternative approach, that instead of canceling
in-progress AP bringup (https://lkml.org/lkml/2014/3/6/257),
removes timeouts so that AP bringup won't be affected by
poor timing and syncs AP with master CPU at early startup
making sure that AP won't run wild if master CPU doesn't
expect AP to come online.

Below is detailed description of a more often happening hang:
-----
Master CPU may timeout before cpu_callin_mask is set and cancel
booting CPU, but being onlined CPU still continues to boot, sets
cpu_active_mask (CPU_STARTING notifiers) and spins in
check_tsc_sync_target() for master cpu to arrive. Following attempt
to online another cpu hangs in stop_machine, initiated from here:
smp_callin ->
  smp_store_cpu_info ->
    identify_secondary_cpu ->
      mtrr_ap_init -> set_mtrr_from_inactive_cpu

stop_machine waits on completion of stop_work on all CPUs from
cpu_active_mask including a failed CPU that spins in check_tsc_sync_target().


Igor Mammedov (5):
  x86: replace timeouts when booting secondary CPU with infinite wait
    loop
  x86: cleanup not needed cpu_initialized_mask
  x86: log error on secondary CPU wakeup failure at ERR level
  x86: fix list corruption on CPU hotplug
  x86: fix memory corruption in acpi_unmap_lsapic()

 arch/x86/include/asm/cpumask.h |    1 -
 arch/x86/kernel/cpu/common.c   |   28 ++++++++-------
 arch/x86/kernel/smpboot.c      |   73 ++--------------------------------------
 3 files changed, 18 insertions(+), 84 deletions(-)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/5] x86: replace timeouts when booting secondary CPU with infinite wait loop
  2014-03-31 20:09 [PATCH v2 0/5] x86: fix hang when AP bringup is too slow Igor Mammedov
@ 2014-03-31 20:09 ` Igor Mammedov
  2014-04-02 17:15   ` Andi Kleen
  2014-03-31 20:09 ` [PATCH v2 2/5] x86: cleanup not needed cpu_initialized_mask Igor Mammedov
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 11+ messages in thread
From: Igor Mammedov @ 2014-03-31 20:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, mingo, hpa, x86, imammedo, bp, paul.gortmaker, JBeulich,
	prarit, drjones, toshi.kani, riel, gong.chen

Hang is observed on virtual machines during CPU hotplug,
especially in big guests with many CPUs. (It reproducible
more often if host is over-committed).

It happens because master CPU gives up waiting on
secondary CPU and allows it to run wild. As result
AP causes locking or crashing system. For example
as described here: https://lkml.org/lkml/2014/3/6/257

If master CPU have sent STARTUP IPI successfully,
make it wait indefinitely till AP boots.
To ensure that AP won't ever run wild, make it
wait at early startup till master CPU confirms its
intention to wait for it.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
v2:
 - ammend comment in cpu_init()
---
 arch/x86/kernel/cpu/common.c |   17 +++++++++-
 arch/x86/kernel/smpboot.c    |   68 ++----------------------------------------
 2 files changed, 18 insertions(+), 67 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 8e28bf2..248161e 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1235,16 +1235,22 @@ void cpu_init(void)
 	struct task_struct *me;
 	struct tss_struct *t;
 	unsigned long v;
-	int cpu;
+	int cpu = stack_smp_processor_id();
 	int i;
 
 	/*
+	 * wait for ACK from master CPU before continuing
+	 * with AP initialization
+	 */
+	while (!cpumask_test_cpu(cpu, cpu_callout_mask))
+		cpu_relax();
+
+	/*
 	 * Load microcode on this cpu if a valid microcode is available.
 	 * This is early microcode loading procedure.
 	 */
 	load_ucode_ap();
 
-	cpu = stack_smp_processor_id();
 	t = &per_cpu(init_tss, cpu);
 	oist = &per_cpu(orig_ist, cpu);
 
@@ -1335,6 +1341,13 @@ void cpu_init(void)
 	struct tss_struct *t = &per_cpu(init_tss, cpu);
 	struct thread_struct *thread = &curr->thread;
 
+	/*
+	 * wait for ACK from master CPU before continuing
+	 * with AP initialization
+	 */
+	while (!cpumask_test_cpu(cpu, cpu_callout_mask))
+		cpu_relax();
+
 	show_ucode_info_early();
 
 	if (cpumask_test_and_set_cpu(cpu, cpu_initialized_mask)) {
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 37e11e5..8bf67bd 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -111,7 +111,6 @@ atomic_t init_deasserted;
 static void smp_callin(void)
 {
 	int cpuid, phys_id;
-	unsigned long timeout;
 
 	/*
 	 * If waken up by an INIT in an 82489DX configuration
@@ -129,37 +128,6 @@ static void smp_callin(void)
 	 * (This works even if the APIC is not enabled.)
 	 */
 	phys_id = read_apic_id();
-	if (cpumask_test_cpu(cpuid, cpu_callin_mask)) {
-		panic("%s: phys CPU#%d, CPU#%d already present??\n", __func__,
-					phys_id, cpuid);
-	}
-	pr_debug("CPU#%d (phys ID: %d) waiting for CALLOUT\n", cpuid, phys_id);
-
-	/*
-	 * STARTUP IPIs are fragile beasts as they might sometimes
-	 * trigger some glue motherboard logic. Complete APIC bus
-	 * silence for 1 second, this overestimates the time the
-	 * boot CPU is spending to send the up to 2 STARTUP IPIs
-	 * by a factor of two. This should be enough.
-	 */
-
-	/*
-	 * Waiting 2s total for startup (udelay is not yet working)
-	 */
-	timeout = jiffies + 2*HZ;
-	while (time_before(jiffies, timeout)) {
-		/*
-		 * Has the boot CPU finished it's STARTUP sequence?
-		 */
-		if (cpumask_test_cpu(cpuid, cpu_callout_mask))
-			break;
-		cpu_relax();
-	}
-
-	if (!time_before(jiffies, timeout)) {
-		panic("%s: CPU%d started up but did not get a callout!\n",
-		      __func__, cpuid);
-	}
 
 	/*
 	 * the boot CPU has finished the init stage and is spinning
@@ -749,7 +717,6 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
 	unsigned long start_ip = real_mode_header->trampoline_start;
 
 	unsigned long boot_error = 0;
-	int timeout;
 	int cpu0_nmi_registered = 0;
 
 	/* Just in case we booted with a single CPU. */
@@ -818,12 +785,9 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
 		pr_debug("After Callout %d\n", cpu);
 
 		/*
-		 * Wait 5s total for a response
+		 * Wait till AP completes initial initialization
 		 */
-		for (timeout = 0; timeout < 50000; timeout++) {
-			if (cpumask_test_cpu(cpu, cpu_callin_mask))
-				break;	/* It has booted */
-			udelay(100);
+		while (!cpumask_test_cpu(cpu, cpu_callin_mask)) {
 			/*
 			 * Allow other tasks to run while we wait for the
 			 * AP to come online. This also gives a chance
@@ -832,33 +796,7 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
 			 */
 			schedule();
 		}
-
-		if (cpumask_test_cpu(cpu, cpu_callin_mask)) {
-			print_cpu_msr(&cpu_data(cpu));
-			pr_debug("CPU%d: has booted.\n", cpu);
-		} else {
-			boot_error = 1;
-			if (*trampoline_status == 0xA5A5A5A5)
-				/* trampoline started but...? */
-				pr_err("CPU%d: Stuck ??\n", cpu);
-			else
-				/* trampoline code not run */
-				pr_err("CPU%d: Not responding\n", cpu);
-			if (apic->inquire_remote_apic)
-				apic->inquire_remote_apic(apicid);
-		}
-	}
-
-	if (boot_error) {
-		/* Try to put things back the way they were before ... */
-		numa_remove_cpu(cpu); /* was set by numa_add_cpu */
-
-		/* was set by do_boot_cpu() */
-		cpumask_clear_cpu(cpu, cpu_callout_mask);
-
-		/* was set by cpu_init() */
-		cpumask_clear_cpu(cpu, cpu_initialized_mask);
-
+	} else {
 		set_cpu_present(cpu, false);
 		per_cpu(x86_cpu_to_apicid, cpu) = BAD_APICID;
 	}
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/5] x86: cleanup not needed cpu_initialized_mask
  2014-03-31 20:09 [PATCH v2 0/5] x86: fix hang when AP bringup is too slow Igor Mammedov
  2014-03-31 20:09 ` [PATCH v2 1/5] x86: replace timeouts when booting secondary CPU with infinite wait loop Igor Mammedov
@ 2014-03-31 20:09 ` Igor Mammedov
  2014-03-31 20:09 ` [PATCH v2 3/5] x86: log error on secondary CPU wakeup failure at ERR level Igor Mammedov
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Igor Mammedov @ 2014-03-31 20:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, mingo, hpa, x86, imammedo, bp, paul.gortmaker, JBeulich,
	prarit, drjones, toshi.kani, riel, gong.chen

cpu_initialized_mask is used in cpu_init() for detecting
if AP has been already started. But now test could never
return true since master CPU doesn't give up waiting on
AP startup and native_cpu_up() checks for:

 physid_isset(apicid, phys_cpu_present_map) &&
 cpu_callin_mask

so that it would not attempt to call do_boot_cpu() on
already started AP.
And since cpu_initialized_mask isn't used for anything
else just remove it altogether.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
 arch/x86/include/asm/cpumask.h |    1 -
 arch/x86/kernel/cpu/common.c   |   11 -----------
 arch/x86/kernel/smpboot.c      |    2 --
 3 files changed, 0 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/cpumask.h b/arch/x86/include/asm/cpumask.h
index 61c852f..c64c5f5 100644
--- a/arch/x86/include/asm/cpumask.h
+++ b/arch/x86/include/asm/cpumask.h
@@ -5,7 +5,6 @@
 
 extern cpumask_var_t cpu_callin_mask;
 extern cpumask_var_t cpu_callout_mask;
-extern cpumask_var_t cpu_initialized_mask;
 extern cpumask_var_t cpu_sibling_setup_mask;
 
 extern void setup_cpu_local_masks(void);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 248161e..7dfac93 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -47,7 +47,6 @@
 #include "cpu.h"
 
 /* all of these masks are initialized in setup_cpu_local_masks() */
-cpumask_var_t cpu_initialized_mask;
 cpumask_var_t cpu_callout_mask;
 cpumask_var_t cpu_callin_mask;
 
@@ -57,7 +56,6 @@ cpumask_var_t cpu_sibling_setup_mask;
 /* correctly size the local cpu masks */
 void __init setup_cpu_local_masks(void)
 {
-	alloc_bootmem_cpumask_var(&cpu_initialized_mask);
 	alloc_bootmem_cpumask_var(&cpu_callin_mask);
 	alloc_bootmem_cpumask_var(&cpu_callout_mask);
 	alloc_bootmem_cpumask_var(&cpu_sibling_setup_mask);
@@ -1262,9 +1260,6 @@ void cpu_init(void)
 
 	me = current;
 
-	if (cpumask_test_and_set_cpu(cpu, cpu_initialized_mask))
-		panic("CPU#%d already initialized!\n", cpu);
-
 	pr_debug("Initializing CPU#%d\n", cpu);
 
 	clear_in_cr4(X86_CR4_VME|X86_CR4_PVI|X86_CR4_TSD|X86_CR4_DE);
@@ -1350,12 +1345,6 @@ void cpu_init(void)
 
 	show_ucode_info_early();
 
-	if (cpumask_test_and_set_cpu(cpu, cpu_initialized_mask)) {
-		printk(KERN_WARNING "CPU#%d already initialized!\n", cpu);
-		for (;;)
-			local_irq_enable();
-	}
-
 	printk(KERN_INFO "Initializing CPU#%d\n", cpu);
 
 	if (cpu_has_vme || cpu_has_tsc || cpu_has_de)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 8bf67bd..803f658 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1237,8 +1237,6 @@ static void __ref remove_cpu_from_maps(int cpu)
 	set_cpu_online(cpu, false);
 	cpumask_clear_cpu(cpu, cpu_callout_mask);
 	cpumask_clear_cpu(cpu, cpu_callin_mask);
-	/* was set by cpu_init() */
-	cpumask_clear_cpu(cpu, cpu_initialized_mask);
 	numa_remove_cpu(cpu);
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 3/5] x86: log error on secondary CPU wakeup failure at ERR level
  2014-03-31 20:09 [PATCH v2 0/5] x86: fix hang when AP bringup is too slow Igor Mammedov
  2014-03-31 20:09 ` [PATCH v2 1/5] x86: replace timeouts when booting secondary CPU with infinite wait loop Igor Mammedov
  2014-03-31 20:09 ` [PATCH v2 2/5] x86: cleanup not needed cpu_initialized_mask Igor Mammedov
@ 2014-03-31 20:09 ` Igor Mammedov
  2014-03-31 20:09 ` [PATCH v2 4/5] x86: fix list corruption on CPU hotplug Igor Mammedov
  2014-03-31 20:09 ` [PATCH v2 5/5] x86: fix memory corruption in acpi_unmap_lsapic() Igor Mammedov
  4 siblings, 0 replies; 11+ messages in thread
From: Igor Mammedov @ 2014-03-31 20:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, mingo, hpa, x86, imammedo, bp, paul.gortmaker, JBeulich,
	prarit, drjones, toshi.kani, riel, gong.chen

If system is running without debug level logging,
it will not log error if do_boot_cpu() failed to
wakeup AP. It may lead to silent AP bringup
failures at boot time.
Change message level to KERN_ERR to make error
visible to user as it's done on other architectures.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
 arch/x86/kernel/smpboot.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 803f658..e540b44 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -858,7 +858,7 @@ int native_cpu_up(unsigned int cpu, struct task_struct *tidle)
 
 	err = do_boot_cpu(apicid, cpu, tidle);
 	if (err) {
-		pr_debug("do_boot_cpu failed %d\n", err);
+		pr_err("do_boot_cpu failed(%d) to wakeup CPU#%u\n", err, cpu);
 		return -EIO;
 	}
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 4/5] x86: fix list corruption on CPU hotplug
  2014-03-31 20:09 [PATCH v2 0/5] x86: fix hang when AP bringup is too slow Igor Mammedov
                   ` (2 preceding siblings ...)
  2014-03-31 20:09 ` [PATCH v2 3/5] x86: log error on secondary CPU wakeup failure at ERR level Igor Mammedov
@ 2014-03-31 20:09 ` Igor Mammedov
  2014-03-31 20:09 ` [PATCH v2 5/5] x86: fix memory corruption in acpi_unmap_lsapic() Igor Mammedov
  4 siblings, 0 replies; 11+ messages in thread
From: Igor Mammedov @ 2014-03-31 20:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, mingo, hpa, x86, imammedo, bp, paul.gortmaker, JBeulich,
	prarit, drjones, toshi.kani, riel, gong.chen

currently if AP wake up is failed, master CPU marks AP as not present
in do_boot_cpu() by calling set_cpu_present(cpu, false).
That leads to following list corruption on the next physical CPU
hotplug:

[  418.107336] WARNING: CPU: 1 PID: 45 at lib/list_debug.c:33 __list_add+0xbe/0xd0()
[  418.115268] list_add corruption. prev->next should be next (ffff88003dc57600), but was ffff88003e20c3a0. (prev=ffff88003e20c3a0).
[  418.123693] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT ipt_REJECT cfg80211 xt_conntrack rfkill ee
[  418.138979] CPU: 1 PID: 45 Comm: kworker/u10:1 Not tainted 3.14.0-rc6+ #387
[  418.149989] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[  418.165750] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[  418.166433]  0000000000000021 ffff880038ca7988 ffffffff8159b22d 0000000000000021
[  418.176460]  ffff880038ca79d8 ffff880038ca79c8 ffffffff8106942c ffff880038ca79e8
[  418.177453]  ffff88003e20c3a0 ffff88003dc57600 ffff88003e20c3a0 00000000ffffffea
[  418.178445] Call Trace:
[  418.185811]  [<ffffffff8159b22d>] dump_stack+0x49/0x5c
[  418.186440]  [<ffffffff8106942c>] warn_slowpath_common+0x8c/0xc0
[  418.187192]  [<ffffffff81069516>] warn_slowpath_fmt+0x46/0x50
[  418.191231]  [<ffffffff8136ef51>] ? acpi_ns_get_node+0xb7/0xc7
[  418.193889]  [<ffffffff812f796e>] __list_add+0xbe/0xd0
[  418.196649]  [<ffffffff812e2aa9>] kobject_add_internal+0x79/0x200
[  418.208610]  [<ffffffff812e2e18>] kobject_add_varg+0x38/0x60
[  418.213831]  [<ffffffff812e2ef4>] kobject_add+0x44/0x70
[  418.229961]  [<ffffffff813e2c60>] device_add+0xd0/0x550
[  418.234991]  [<ffffffff813f0e95>] ? pm_runtime_init+0xe5/0xf0
[  418.250226]  [<ffffffff813e32be>] device_register+0x1e/0x30
[  418.255296]  [<ffffffff813e82a3>] register_cpu+0xe3/0x130
[  418.266539]  [<ffffffff81592be5>] arch_register_cpu+0x65/0x150
[  418.285845]  [<ffffffff81355c0d>] acpi_processor_hotadd_init+0x5a/0x9b
...
Which is caused by the fact that generic_processor_info() allocates
logical CPU id by calling:

 cpu = cpumask_next_zero(-1, cpu_present_mask);

which returns id of previously failed to wake up CPU, since its bit
is cleared by do_boot_cpu() and as result register_cpu() tries to
register another CPU with the same id as already present but failed
to be onlined CPU.

Taking in account that AP will not do anything if master CPU failed to
wake it up, there is no reason to mark that AP as not present and
break next cpu hotplug attempts. As a side effect of not marking AP
as not present, user would be allowed to online it again later.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
 arch/x86/kernel/smpboot.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index e540b44..0400418 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -797,7 +797,6 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
 			schedule();
 		}
 	} else {
-		set_cpu_present(cpu, false);
 		per_cpu(x86_cpu_to_apicid, cpu) = BAD_APICID;
 	}
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 5/5] x86: fix memory corruption in acpi_unmap_lsapic()
  2014-03-31 20:09 [PATCH v2 0/5] x86: fix hang when AP bringup is too slow Igor Mammedov
                   ` (3 preceding siblings ...)
  2014-03-31 20:09 ` [PATCH v2 4/5] x86: fix list corruption on CPU hotplug Igor Mammedov
@ 2014-03-31 20:09 ` Igor Mammedov
  4 siblings, 0 replies; 11+ messages in thread
From: Igor Mammedov @ 2014-03-31 20:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, mingo, hpa, x86, imammedo, bp, paul.gortmaker, JBeulich,
	prarit, drjones, toshi.kani, riel, gong.chen

if during CPU hotplug master CPU failed to wake up AP
it set percpu x86_cpu_to_apicid to BAD_APICID=0xFFFF for AP.

However following attempt to unplug that CPU will lead to
out of bound write access to __apicid_to_node[] which is
32768 items long on x86_64 kernel.

So drop setting x86_cpu_to_apicid to BAD_APICID in do_boot_cpu()
and allow acpi_processor_remove()->acpi_unmap_lsapic() cleanly
remove CPU.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
 arch/x86/kernel/smpboot.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 0400418..2b0b8ee 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -796,8 +796,6 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
 			 */
 			schedule();
 		}
-	} else {
-		per_cpu(x86_cpu_to_apicid, cpu) = BAD_APICID;
 	}
 
 	/* mark "stuck" area as not stuck */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/5] x86: replace timeouts when booting secondary CPU with infinite wait loop
  2014-03-31 20:09 ` [PATCH v2 1/5] x86: replace timeouts when booting secondary CPU with infinite wait loop Igor Mammedov
@ 2014-04-02 17:15   ` Andi Kleen
  2014-04-02 21:29     ` Igor Mammedov
  0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2014-04-02 17:15 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: linux-kernel, tglx, mingo, hpa, x86, bp, paul.gortmaker,
	JBeulich, prarit, drjones, toshi.kani, riel, gong.chen

Igor Mammedov <imammedo@redhat.com> writes:

> Hang is observed on virtual machines during CPU hotplug,
> especially in big guests with many CPUs. (It reproducible
> more often if host is over-committed).
>
> It happens because master CPU gives up waiting on
> secondary CPU and allows it to run wild. As result
> AP causes locking or crashing system. For example
> as described here: https://lkml.org/lkml/2014/3/6/257
>
> If master CPU have sent STARTUP IPI successfully,
> make it wait indefinitely till AP boots.


But what happens on a real machine when the other CPU is dead?

I've seen that. Kernel still boots. With your patch it would
hang.

I don't think you can do that. It needs to have some timeout.
Maybe a longer or configurable one?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/5] x86: replace timeouts when booting secondary CPU with infinite wait loop
  2014-04-02 17:15   ` Andi Kleen
@ 2014-04-02 21:29     ` Igor Mammedov
  2014-04-02 23:48       ` Andi Kleen
  2014-04-03  6:43       ` Ingo Molnar
  0 siblings, 2 replies; 11+ messages in thread
From: Igor Mammedov @ 2014-04-02 21:29 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, tglx, mingo, hpa, x86, bp, paul.gortmaker,
	JBeulich, prarit, drjones, toshi.kani, riel, gong.chen

On Wed, 02 Apr 2014 10:15:29 -0700
Andi Kleen <andi@firstfloor.org> wrote:

> Igor Mammedov <imammedo@redhat.com> writes:
> 
> > Hang is observed on virtual machines during CPU hotplug,
> > especially in big guests with many CPUs. (It reproducible
> > more often if host is over-committed).
> >
> > It happens because master CPU gives up waiting on
> > secondary CPU and allows it to run wild. As result
> > AP causes locking or crashing system. For example
> > as described here: https://lkml.org/lkml/2014/3/6/257
> >
> > If master CPU have sent STARTUP IPI successfully,
> > make it wait indefinitely till AP boots.
> 
> 
> But what happens on a real machine when the other CPU is dead?
One possible way to boot such machine would be to disable dead CPU
in kernel parameters.

> I've seen that. Kernel still boots. With your patch it would
> hang.
> 
> I don't think you can do that. It needs to have some timeout.
> Maybe a longer or configurable one?
there were patch that tried to keep timeouts and 'gracefully'
cancel AP boot if master timed out on it.
https://lkml.org/lkml/2014/3/6/257

It's possible to keep timeouts in do_boot_cpu(), is setting
trampoline_status sufficient indication that AP is not dead
and worth waiting for?

than it could be rewritten like this:
  if (!boot_error) {
      boot_error = 1;
      for (timeout = 0; timeout < 50000; timeout++) {
          /* Wait till AP signals that it's ready to start initialization */
          if (*trampoline_status == 0xA5A5A5A5) {
              boot_error = 0;
              /* allow AP to start initializing. */
              cpumask_set_cpu(cpu, cpu_callout_mask);

              /* wait till AP boots till cpu_callin_mask point */
              while (cpumask_test_cpu(cpu, cpu_callin_mask))
                   schedule();

              break;  /* It has booted */
          }
          udelay(100);
      }
  }
                   
it will provide timeout if AP is dead and still keep AP from running wild
if master CPU timed out on it. 


> 
> -Andi
> 
> -- 
> ak@linux.intel.com -- Speaking for myself only


-- 
Regards,
  Igor

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/5] x86: replace timeouts when booting secondary CPU with infinite wait loop
  2014-04-02 21:29     ` Igor Mammedov
@ 2014-04-02 23:48       ` Andi Kleen
  2014-04-03  6:43       ` Ingo Molnar
  1 sibling, 0 replies; 11+ messages in thread
From: Andi Kleen @ 2014-04-02 23:48 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Andi Kleen, linux-kernel, tglx, mingo, hpa, x86, bp,
	paul.gortmaker, JBeulich, prarit, drjones, toshi.kani, riel,
	gong.chen

On Wed, Apr 02, 2014 at 11:29:56PM +0200, Igor Mammedov wrote:
> On Wed, 02 Apr 2014 10:15:29 -0700
> Andi Kleen <andi@firstfloor.org> wrote:
> 
> > Igor Mammedov <imammedo@redhat.com> writes:
> > 
> > > Hang is observed on virtual machines during CPU hotplug,
> > > especially in big guests with many CPUs. (It reproducible
> > > more often if host is over-committed).
> > >
> > > It happens because master CPU gives up waiting on
> > > secondary CPU and allows it to run wild. As result
> > > AP causes locking or crashing system. For example
> > > as described here: https://lkml.org/lkml/2014/3/6/257
> > >
> > > If master CPU have sent STARTUP IPI successfully,
> > > make it wait indefinitely till AP boots.
> > 
> > 
> > But what happens on a real machine when the other CPU is dead?
> One possible way to boot such machine would be to disable dead CPU
> in kernel parameters.

That would need explicit user action. It's much better to recover 
automatically, even if somewhat crippled.

-Andi


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/5] x86: replace timeouts when booting secondary CPU with infinite wait loop
  2014-04-02 21:29     ` Igor Mammedov
  2014-04-02 23:48       ` Andi Kleen
@ 2014-04-03  6:43       ` Ingo Molnar
  2014-04-03 21:03         ` Andi Kleen
  1 sibling, 1 reply; 11+ messages in thread
From: Ingo Molnar @ 2014-04-03  6:43 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Andi Kleen, linux-kernel, tglx, mingo, hpa, x86, bp,
	paul.gortmaker, JBeulich, prarit, drjones, toshi.kani, riel,
	gong.chen


* Igor Mammedov <imammedo@redhat.com> wrote:

> > I've seen that. Kernel still boots. With your patch it would hang.

Nonsense, not booting is OK when critical hardware is genuinely bad - 
this isn't a disk drive or networking where bad IO 'happens sometimes' 
and failure is something we have to engineer for - this is the CPU!

If a critical piece of hardware like the CPU or RAM is non-functional 
then it should be excluded by the user explicitly, not worked around 
after some ugly, non-deterministic and fragile timeout.

The timeout in the SMP bringup code was really an ancient property, 
introduced back more than a decade ago when hardware makers were 
ignorant of Linux we were ignorant of how to properly interface with 
SMP hardware.

Today a 'timeout' means one of 3 things:

  - bad, fragile hardware - this we don't want to hide, unless 
    explicitly told so by the user. I've seen such symptoms related to 
    overclocking for example - so not booting is perfectly justified, 
    it can prevent reporting a bogus kernel crash down the line.

  - buggy SMP bringup. That is a bug that needs to be fixed, not 
    worked around.

  - timeout fragility in virtualized environments

I'm not aware of any genuine case where timing out is the correct 
thing to do.

So the patches look fine to me as-is, I planned on looking at them 
more closely after the merge window.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/5] x86: replace timeouts when booting secondary CPU with infinite wait loop
  2014-04-03  6:43       ` Ingo Molnar
@ 2014-04-03 21:03         ` Andi Kleen
  0 siblings, 0 replies; 11+ messages in thread
From: Andi Kleen @ 2014-04-03 21:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Igor Mammedov, Andi Kleen, linux-kernel, tglx, mingo, hpa, x86,
	bp, paul.gortmaker, JBeulich, prarit, drjones, toshi.kani, riel,
	gong.chen

On Thu, Apr 03, 2014 at 08:43:37AM +0200, Ingo Molnar wrote:
> 
> * Igor Mammedov <imammedo@redhat.com> wrote:
> 
> > > I've seen that. Kernel still boots. With your patch it would hang.
> 
> Nonsense, not booting is OK when critical hardware is genuinely bad - 
> this isn't a disk drive or networking where bad IO 'happens sometimes' 
> and failure is something we have to engineer for - this is the CPU!
> 
> If a critical piece of hardware like the CPU or RAM is non-functional 
> then it should be excluded by the user explicitly, not worked around 
> after some ugly, non-deterministic and fragile timeout.

That's generally not true. We try to recover as best as we can
and continue.

That's true for RCU stalls, and RAM errors (hwpoison) and
other error conditions. It's true for kernel problems
(we try to oops and continue, not to panic etc.)

Hanging forever is not recovering, it's just poor and broken 
error handling and generally not acceptable these days.

-Andi


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-04-03 21:03 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-31 20:09 [PATCH v2 0/5] x86: fix hang when AP bringup is too slow Igor Mammedov
2014-03-31 20:09 ` [PATCH v2 1/5] x86: replace timeouts when booting secondary CPU with infinite wait loop Igor Mammedov
2014-04-02 17:15   ` Andi Kleen
2014-04-02 21:29     ` Igor Mammedov
2014-04-02 23:48       ` Andi Kleen
2014-04-03  6:43       ` Ingo Molnar
2014-04-03 21:03         ` Andi Kleen
2014-03-31 20:09 ` [PATCH v2 2/5] x86: cleanup not needed cpu_initialized_mask Igor Mammedov
2014-03-31 20:09 ` [PATCH v2 3/5] x86: log error on secondary CPU wakeup failure at ERR level Igor Mammedov
2014-03-31 20:09 ` [PATCH v2 4/5] x86: fix list corruption on CPU hotplug Igor Mammedov
2014-03-31 20:09 ` [PATCH v2 5/5] x86: fix memory corruption in acpi_unmap_lsapic() Igor Mammedov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.