linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH][RFC v3] x86, hotplug: Use hlt instead of mwait if invoked from disable_nonboot_cpus
@ 2016-06-28  9:16 Chen Yu
  2016-07-07  0:33 ` Rafael J. Wysocki
  2016-07-10  1:49 ` [PATCH] x86 / hibernate: Use hlt_play_dead() when resuming from hibernation Rafael J. Wysocki
  0 siblings, 2 replies; 15+ messages in thread
From: Chen Yu @ 2016-06-28  9:16 UTC (permalink / raw)
  To: linux-pm
  Cc: Rafael J. Wysocki, Thomas Gleixner, H. Peter Anvin, Pavel Machek,
	Borislav Petkov, Peter Zijlstra, Ingo Molnar, Len Brown, x86,
	linux-kernel, Chen Yu

Stress test from Varun Koyyalagunta reports that, the
nonboot CPU would hang occasionally, when resuming from
hibernation. Further investigation shows that, the precise
stage when nonboot CPU hangs, is the time when the nonboot
CPU been woken up incorrectly, and tries to monitor the
mwait_ptr for the second time, then an exception is
triggered due to illegal vaddr access, say, something like,
'Unable to handler kernel address of 0xffff8800ba800010...'

Further investigation shows that, this exception is caused
by accessing a page without PRESENT flag, because the pte entry
for this vaddr is zero. Here's the scenario how this problem
happens: Page table for direct mapping is allocated dynamically
by kernel_physical_mapping_init, it is possible that in the
resume process, when the boot CPU is trying to write back pages
to their original address, and just right to writes to the monitor
mwait_ptr then wakes up one of the nonboot CPUs, since the page
table currently used by the nonboot CPU might not the same as it
is before the hibernation, an exception might occur due to
inconsistent page table.

First try is to get rid of this problem by changing the monitor
address from task.flag to zero page, because no one would write
data to zero page. But there is still problem because of a ping-pong
wake up scenario in mwait_play_dead:

One possible implementation of a clflush is a read-invalidate snoop,
which is what a store might look like, so cflush might break the mwait.

1. CPU1 wait at zero page
2. CPU2 cflush zero page, wake CPU1 up, then CPU2 waits at zero page
3. CPU1 is woken up, and invoke cflush zero page, thus wake up CPU2 again.
then the nonboot CPUs never sleep for long.

So it's better to monitor different address for each
nonboot CPUs, however since there is only one zero page, at most:
PAGE_SIZE/L1_CACHE_LINE CPUs are satisfied, which is usually 64
on a x86_64, apparently it's not enough for servers, maybe more
zero pages are required.

So choose a new solution as Brian suggested, to put the nonboot CPUs
into hlt before resume, without touching any memory during s/r.
Theoretically there might still be some problems if some of the CPUs have
already been put offline, but since the case is very rare and users
can work around it, we do not deal with this special case in kernel
for now.

BTW, as James mentioned, he might want to encapsulate disable_nonboot_cpus
into arch-specific, so this patch might need small change after that.

Comments and suggestions would be appreciated.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
Reported-and-tested-by: Varun Koyyalagunta <cpudebug@centtech.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 arch/x86/kernel/smpboot.c | 17 +++++++++++++++++
 include/linux/smp.h       |  2 ++
 kernel/cpu.c              | 11 +++++++++++
 3 files changed, 30 insertions(+)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index fafe8b9..00b5181 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1331,6 +1331,8 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
 	smp_quirk_init_udelay();
 }
 
+static bool force_hlt_play_dead;
+
 void arch_enable_nonboot_cpus_begin(void)
 {
 	set_mtrr_aps_delayed_init();
@@ -1342,6 +1344,19 @@ void arch_enable_nonboot_cpus_end(void)
 }
 
 /*
+ * If we come from disable_nonboot_cpus, use hlt directly.
+ */
+void arch_disable_nonboot_cpus_begin(void)
+{
+	force_hlt_play_dead = true;
+}
+
+void arch_disable_nonboot_cpus_end(void)
+{
+	force_hlt_play_dead = false;
+}
+
+/*
  * Early setup to make printk work.
  */
 void __init native_smp_prepare_boot_cpu(void)
@@ -1642,6 +1657,8 @@ void native_play_dead(void)
 	play_dead_common();
 	tboot_shutdown(TB_SHUTDOWN_WFS);
 
+	if (force_hlt_play_dead)
+		hlt_play_dead();
 	mwait_play_dead();	/* Only returns on failure */
 	if (cpuidle_play_dead())
 		hlt_play_dead();
diff --git a/include/linux/smp.h b/include/linux/smp.h
index c441407..b2d1b4c 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -193,6 +193,8 @@ extern void arch_disable_smp_support(void);
 
 extern void arch_enable_nonboot_cpus_begin(void);
 extern void arch_enable_nonboot_cpus_end(void);
+extern void arch_disable_nonboot_cpus_begin(void);
+extern void arch_disable_nonboot_cpus_end(void);
 
 void smp_setup_processor_id(void);
 
diff --git a/kernel/cpu.c b/kernel/cpu.c
index d948e44..fc9e839 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1017,6 +1017,14 @@ EXPORT_SYMBOL_GPL(cpu_up);
 #ifdef CONFIG_PM_SLEEP_SMP
 static cpumask_var_t frozen_cpus;
 
+void __weak arch_disable_nonboot_cpus_begin(void)
+{
+}
+
+void __weak arch_disable_nonboot_cpus_end(void)
+{
+}
+
 int disable_nonboot_cpus(void)
 {
 	int cpu, first_cpu, error = 0;
@@ -1029,6 +1037,8 @@ int disable_nonboot_cpus(void)
 	 */
 	cpumask_clear(frozen_cpus);
 
+	arch_disable_nonboot_cpus_begin();
+
 	pr_info("Disabling non-boot CPUs ...\n");
 	for_each_online_cpu(cpu) {
 		if (cpu == first_cpu)
@@ -1049,6 +1059,7 @@ int disable_nonboot_cpus(void)
 	else
 		pr_err("Non-boot CPUs are not disabled\n");
 
+	arch_disable_nonboot_cpus_end();
 	/*
 	 * Make sure the CPUs won't be enabled by someone else. We need to do
 	 * this even in case of failure as all disable_nonboot_cpus() users are
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH][RFC v3] x86, hotplug: Use hlt instead of mwait if invoked from disable_nonboot_cpus
  2016-06-28  9:16 [PATCH][RFC v3] x86, hotplug: Use hlt instead of mwait if invoked from disable_nonboot_cpus Chen Yu
@ 2016-07-07  0:33 ` Rafael J. Wysocki
  2016-07-07  2:50   ` Chen, Yu C
  2016-07-07  8:38   ` James Morse
  2016-07-10  1:49 ` [PATCH] x86 / hibernate: Use hlt_play_dead() when resuming from hibernation Rafael J. Wysocki
  1 sibling, 2 replies; 15+ messages in thread
From: Rafael J. Wysocki @ 2016-07-07  0:33 UTC (permalink / raw)
  To: Chen Yu, James Morse
  Cc: linux-pm, Thomas Gleixner, H. Peter Anvin, Pavel Machek,
	Borislav Petkov, Peter Zijlstra, Ingo Molnar, Len Brown, x86,
	linux-kernel

On Tuesday, June 28, 2016 05:16:43 PM Chen Yu wrote:
> Stress test from Varun Koyyalagunta reports that, the
> nonboot CPU would hang occasionally, when resuming from
> hibernation. Further investigation shows that, the precise
> stage when nonboot CPU hangs, is the time when the nonboot
> CPU been woken up incorrectly, and tries to monitor the
> mwait_ptr for the second time, then an exception is
> triggered due to illegal vaddr access, say, something like,
> 'Unable to handler kernel address of 0xffff8800ba800010...'
> 
> Further investigation shows that, this exception is caused
> by accessing a page without PRESENT flag, because the pte entry
> for this vaddr is zero. Here's the scenario how this problem
> happens: Page table for direct mapping is allocated dynamically
> by kernel_physical_mapping_init, it is possible that in the
> resume process, when the boot CPU is trying to write back pages
> to their original address, and just right to writes to the monitor
> mwait_ptr then wakes up one of the nonboot CPUs, since the page
> table currently used by the nonboot CPU might not the same as it
> is before the hibernation, an exception might occur due to
> inconsistent page table.
> 
> First try is to get rid of this problem by changing the monitor
> address from task.flag to zero page, because no one would write
> data to zero page. But there is still problem because of a ping-pong
> wake up scenario in mwait_play_dead:
> 
> One possible implementation of a clflush is a read-invalidate snoop,
> which is what a store might look like, so cflush might break the mwait.
> 
> 1. CPU1 wait at zero page
> 2. CPU2 cflush zero page, wake CPU1 up, then CPU2 waits at zero page
> 3. CPU1 is woken up, and invoke cflush zero page, thus wake up CPU2 again.
> then the nonboot CPUs never sleep for long.
> 
> So it's better to monitor different address for each
> nonboot CPUs, however since there is only one zero page, at most:
> PAGE_SIZE/L1_CACHE_LINE CPUs are satisfied, which is usually 64
> on a x86_64, apparently it's not enough for servers, maybe more
> zero pages are required.
> 
> So choose a new solution as Brian suggested, to put the nonboot CPUs
> into hlt before resume, without touching any memory during s/r.
> Theoretically there might still be some problems if some of the CPUs have
> already been put offline, but since the case is very rare and users
> can work around it, we do not deal with this special case in kernel
> for now.
> 
> BTW, as James mentioned, he might want to encapsulate disable_nonboot_cpus
> into arch-specific, so this patch might need small change after that.
> 
> Comments and suggestions would be appreciated.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
> Reported-and-tested-by: Varun Koyyalagunta <cpudebug@centtech.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>

Below is my sort of version of this (untested) and I did it this way, because
the issue is specific to resume from hibernation (the workaround need not be
applied anywhere else) and the hibernate_resume_nonboot_cpu_disable() thing may
be useful to arm64 too if I'm not mistaken (James?).

Actually, if arm64 uses it too, the __weak implementation can be dropped,
because it will be possible to make it depend on ARCH_HIBERNATION_HEADER
(x86 and arm64 are the only users of that).

Thanks,
Rafael


---
 arch/x86/include/asm/cpu.h |    2 ++
 arch/x86/kernel/smpboot.c  |    5 +++++
 arch/x86/power/cpu.c       |   19 +++++++++++++++++++
 kernel/power/hibernate.c   |    7 ++++++-
 kernel/power/power.h       |    2 ++
 5 files changed, 34 insertions(+), 1 deletion(-)

Index: linux-pm/kernel/power/hibernate.c
===================================================================
--- linux-pm.orig/kernel/power/hibernate.c
+++ linux-pm/kernel/power/hibernate.c
@@ -409,6 +409,11 @@ int hibernation_snapshot(int platform_mo
 	goto Close;
 }
 
+int __weak hibernate_resume_nonboot_cpu_disable(void)
+{
+	return disable_nonboot_cpus();
+}
+
 /**
  * resume_target_kernel - Restore system state from a hibernation image.
  * @platform_mode: Whether or not to use the platform driver.
@@ -433,7 +438,7 @@ static int resume_target_kernel(bool pla
 	if (error)
 		goto Cleanup;
 
-	error = disable_nonboot_cpus();
+	error = hibernate_resume_nonboot_cpu_disable();
 	if (error)
 		goto Enable_cpus;
 
Index: linux-pm/kernel/power/power.h
===================================================================
--- linux-pm.orig/kernel/power/power.h
+++ linux-pm/kernel/power/power.h
@@ -38,6 +38,8 @@ static inline char *check_image_kernel(s
 }
 #endif /* CONFIG_ARCH_HIBERNATION_HEADER */
 
+extern int hibernate_resume_nonboot_cpu_disable(void);
+
 /*
  * Keep some memory free so that I/O operations can succeed without paging
  * [Might this be more than 4 MB?]
Index: linux-pm/arch/x86/power/cpu.c
===================================================================
--- linux-pm.orig/arch/x86/power/cpu.c
+++ linux-pm/arch/x86/power/cpu.c
@@ -266,6 +266,25 @@ void notrace restore_processor_state(voi
 EXPORT_SYMBOL(restore_processor_state);
 #endif
 
+#if defined(CONFIG_HIBERNATION) && defined(CONFIG_HOTPLUG_CPU)
+int hibernate_resume_nonboot_cpu_disable(void)
+{
+	int ret;
+
+	/*
+	 * Ensure that MONITOR/MWAIT will not be used in the "play dead" loop
+	 * during image restoration, because it is likely that the monitored
+	 * address will be actually written to at that time and then the "dead"
+	 * CPU may start executing instructions from an image kernel's page
+	 * (and that may not be the "play dead" loop any more).
+	 */
+	force_hlt_play_dead = true;
+	ret = disable_nonboot_cpus();
+	force_hlt_play_dead = false;
+	return ret;
+}
+#endif
+
 /*
  * When bsp_check() is called in hibernate and suspend, cpu hotplug
  * is disabled already. So it's unnessary to handle race condition between
Index: linux-pm/arch/x86/kernel/smpboot.c
===================================================================
--- linux-pm.orig/arch/x86/kernel/smpboot.c
+++ linux-pm/arch/x86/kernel/smpboot.c
@@ -1441,6 +1441,8 @@ __init void prefill_possible_map(void)
 
 #ifdef CONFIG_HOTPLUG_CPU
 
+bool force_hlt_play_dead;
+
 static void remove_siblinginfo(int cpu)
 {
 	int sibling;
@@ -1642,6 +1644,9 @@ void native_play_dead(void)
 	play_dead_common();
 	tboot_shutdown(TB_SHUTDOWN_WFS);
 
+	if (force_hlt_play_dead)
+		hlt_play_dead();
+
 	mwait_play_dead();	/* Only returns on failure */
 	if (cpuidle_play_dead())
 		hlt_play_dead();
Index: linux-pm/arch/x86/include/asm/cpu.h
===================================================================
--- linux-pm.orig/arch/x86/include/asm/cpu.h
+++ linux-pm/arch/x86/include/asm/cpu.h
@@ -26,6 +26,8 @@ struct x86_cpu {
 };
 
 #ifdef CONFIG_HOTPLUG_CPU
+extern bool force_hlt_play_dead;
+
 extern int arch_register_cpu(int num);
 extern void arch_unregister_cpu(int);
 extern void start_cpu0(void);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH][RFC v3] x86, hotplug: Use hlt instead of mwait if invoked from disable_nonboot_cpus
  2016-07-07  0:33 ` Rafael J. Wysocki
@ 2016-07-07  2:50   ` Chen, Yu C
  2016-07-07 16:03     ` James Morse
  2016-07-07  8:38   ` James Morse
  1 sibling, 1 reply; 15+ messages in thread
From: Chen, Yu C @ 2016-07-07  2:50 UTC (permalink / raw)
  To: Rafael J. Wysocki, James Morse
  Cc: linux-pm, Thomas Gleixner, H. Peter Anvin, Pavel Machek,
	Borislav Petkov, Peter Zijlstra, Ingo Molnar, Len Brown, x86,
	linux-kernel


> -----Original Message-----
> From: Rafael J. Wysocki [mailto:rjw@rjwysocki.net]
> Sent: Thursday, July 07, 2016 8:33 AM
> To: Chen, Yu C; James Morse
> Cc: linux-pm@vger.kernel.org; Thomas Gleixner; H. Peter Anvin; Pavel Machek;
> Borislav Petkov; Peter Zijlstra; Ingo Molnar; Len Brown; x86@kernel.org; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH][RFC v3] x86, hotplug: Use hlt instead of mwait if invoked
> from disable_nonboot_cpus
> 
> On Tuesday, June 28, 2016 05:16:43 PM Chen Yu wrote:
> > Stress test from Varun Koyyalagunta reports that, the nonboot CPU
> > would hang occasionally, when resuming from hibernation. Further
> > investigation shows that, the precise stage when nonboot CPU hangs, is
> > the time when the nonboot CPU been woken up incorrectly, and tries to
> > monitor the mwait_ptr for the second time, then an exception is
> > triggered due to illegal vaddr access, say, something like, 'Unable to
> > handler kernel address of 0xffff8800ba800010...'
> >
> > Further investigation shows that, this exception is caused by
> > accessing a page without PRESENT flag, because the pte entry for this
> > vaddr is zero. Here's the scenario how this problem
> > happens: Page table for direct mapping is allocated dynamically by
> > kernel_physical_mapping_init, it is possible that in the resume
> > process, when the boot CPU is trying to write back pages to their
> > original address, and just right to writes to the monitor mwait_ptr
> > then wakes up one of the nonboot CPUs, since the page table currently
> > used by the nonboot CPU might not the same as it is before the
> > hibernation, an exception might occur due to inconsistent page table.
> >
> > First try is to get rid of this problem by changing the monitor
> > address from task.flag to zero page, because no one would write data
> > to zero page. But there is still problem because of a ping-pong wake
> > up scenario in mwait_play_dead:
> >
> > One possible implementation of a clflush is a read-invalidate snoop,
> > which is what a store might look like, so cflush might break the mwait.
> >
> > 1. CPU1 wait at zero page
> > 2. CPU2 cflush zero page, wake CPU1 up, then CPU2 waits at zero page
> > 3. CPU1 is woken up, and invoke cflush zero page, thus wake up CPU2 again.
> > then the nonboot CPUs never sleep for long.
> >
> > So it's better to monitor different address for each nonboot CPUs,
> > however since there is only one zero page, at most:
> > PAGE_SIZE/L1_CACHE_LINE CPUs are satisfied, which is usually 64 on a
> > x86_64, apparently it's not enough for servers, maybe more zero pages
> > are required.
> >
> > So choose a new solution as Brian suggested, to put the nonboot CPUs
> > into hlt before resume, without touching any memory during s/r.
> > Theoretically there might still be some problems if some of the CPUs
> > have already been put offline, but since the case is very rare and
> > users can work around it, we do not deal with this special case in
> > kernel for now.
> >
> > BTW, as James mentioned, he might want to encapsulate
> > disable_nonboot_cpus into arch-specific, so this patch might need small
> change after that.
> >
> > Comments and suggestions would be appreciated.
> >
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
> > Reported-and-tested-by: Varun Koyyalagunta <cpudebug@centtech.com>
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> 
> Below is my sort of version of this (untested) and I did it this way, because the
> issue is specific to resume from hibernation (the workaround need not be
> applied anywhere else) and the hibernate_resume_nonboot_cpu_disable()
> thing may be useful to arm64 too if I'm not mistaken (James?).

James might want a flag to distinguish whether it is from suspend or resume,
in his arch-specific disabled_nonboot_cpus?

and this patch works on my xeon.
Tested-by: Chen Yu <yu.c.chen@intel.com>

> 
> Actually, if arm64 uses it too, the __weak implementation can be dropped,
> because it will be possible to make it depend on ARCH_HIBERNATION_HEADER
> (x86 and arm64 are the only users of that).
> 
> Thanks,
> Rafael
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH][RFC v3] x86, hotplug: Use hlt instead of mwait if invoked from disable_nonboot_cpus
  2016-07-07  0:33 ` Rafael J. Wysocki
  2016-07-07  2:50   ` Chen, Yu C
@ 2016-07-07  8:38   ` James Morse
  2016-07-07 12:25     ` Rafael J. Wysocki
  1 sibling, 1 reply; 15+ messages in thread
From: James Morse @ 2016-07-07  8:38 UTC (permalink / raw)
  To: Rafael J. Wysocki, Chen Yu
  Cc: linux-pm, Thomas Gleixner, H. Peter Anvin, Pavel Machek,
	Borislav Petkov, Peter Zijlstra, Ingo Molnar, Len Brown, x86,
	linux-kernel

Hi Rafael,

On 07/07/16 01:33, Rafael J. Wysocki wrote:
> Below is my sort of version of this (untested) and I did it this way, because
> the issue is specific to resume from hibernation (the workaround need not be
> applied anywhere else) and the hibernate_resume_nonboot_cpu_disable() thing may
> be useful to arm64 too if I'm not mistaken (James?).

Yes, we will always need to do something extra (based on data in the
arch_hibernation_header) to resume if CPU0 was offline, or kexec meant we no
longer know which CPU the firmware will boot us on.


> Actually, if arm64 uses it too, the __weak implementation can be dropped,
> because it will be possible to make it depend on ARCH_HIBERNATION_HEADER
> (x86 and arm64 are the only users of that).

Heh, I avoided that as it felt too much like a hack!



Thanks,

James

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH][RFC v3] x86, hotplug: Use hlt instead of mwait if invoked from disable_nonboot_cpus
  2016-07-07  8:38   ` James Morse
@ 2016-07-07 12:25     ` Rafael J. Wysocki
  0 siblings, 0 replies; 15+ messages in thread
From: Rafael J. Wysocki @ 2016-07-07 12:25 UTC (permalink / raw)
  To: James Morse
  Cc: Chen Yu, linux-pm, Thomas Gleixner, H. Peter Anvin, Pavel Machek,
	Borislav Petkov, Peter Zijlstra, Ingo Molnar, Len Brown, x86,
	linux-kernel

On Thursday, July 07, 2016 09:38:14 AM James Morse wrote:
> Hi Rafael,
> 
> On 07/07/16 01:33, Rafael J. Wysocki wrote:
> > Below is my sort of version of this (untested) and I did it this way, because
> > the issue is specific to resume from hibernation (the workaround need not be
> > applied anywhere else) and the hibernate_resume_nonboot_cpu_disable() thing may
> > be useful to arm64 too if I'm not mistaken (James?).
> 
> Yes, we will always need to do something extra (based on data in the
> arch_hibernation_header) to resume if CPU0 was offline, or kexec meant we no
> longer know which CPU the firmware will boot us on.
> 
> 
> > Actually, if arm64 uses it too, the __weak implementation can be dropped,
> > because it will be possible to make it depend on ARCH_HIBERNATION_HEADER
> > (x86 and arm64 are the only users of that).
> 
> Heh, I avoided that as it felt too much like a hack!

OK, let's do as follows, then.

I'll queue up this patch for 4.8 if people don't object.

Then you can implement hibernate_resume_nonboot_cpu_disable() as needed on arm64
and we'll drop the __weak thing next.

Since both users of ARCH_HIBERNATION_HEADER will have their own implementations
of hibernate_resume_nonboot_cpu_disable(), we can just make it a static inline
wrapper around disable_nonboot_cpus() if ARCH_HIBERNATION_HEADER is unset.

That actually makes sense, because when ARCH_HIBERNATION_HEADER is unset, then
(a) the layout of the kernel text and static data during image restoration must
be the same as before hibernation (in which case issues like the MWAIT/MONITOR one
on x86 simply cannot happen) and (b) the restore kernel is unable to handle any
differences between the current (ie. image restoration time) and pre-hibernation
configurations of the system.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH][RFC v3] x86, hotplug: Use hlt instead of mwait if invoked from disable_nonboot_cpus
  2016-07-07  2:50   ` Chen, Yu C
@ 2016-07-07 16:03     ` James Morse
  0 siblings, 0 replies; 15+ messages in thread
From: James Morse @ 2016-07-07 16:03 UTC (permalink / raw)
  To: Chen, Yu C, Rafael J. Wysocki
  Cc: linux-pm, Thomas Gleixner, H. Peter Anvin, Pavel Machek,
	Borislav Petkov, Peter Zijlstra, Ingo Molnar, Len Brown, x86,
	linux-kernel

Hi,

On 07/07/16 03:50, Chen, Yu C wrote:
>> From: Rafael J. Wysocki [mailto:rjw@rjwysocki.net]
>> Below is my sort of version of this (untested) and I did it this way, because the
>> issue is specific to resume from hibernation (the workaround need not be
>> applied anywhere else) and the hibernate_resume_nonboot_cpu_disable()
>> thing may be useful to arm64 too if I'm not mistaken (James?).
> 
> James might want a flag to distinguish whether it is from suspend or resume,
> in his arch-specific disabled_nonboot_cpus?

That isn't serious, we can work out whether it is hibernate/resume based on
whether we've read data out of the the arch header. I added it in the other
series as it looked cleaner to pass the value in instead of inferring it.


Thanks,

James

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
  2016-06-28  9:16 [PATCH][RFC v3] x86, hotplug: Use hlt instead of mwait if invoked from disable_nonboot_cpus Chen Yu
  2016-07-07  0:33 ` Rafael J. Wysocki
@ 2016-07-10  1:49 ` Rafael J. Wysocki
  2016-07-13  9:56   ` Pavel Machek
  2016-07-14  1:55   ` [PATCH v2] " Rafael J. Wysocki
  1 sibling, 2 replies; 15+ messages in thread
From: Rafael J. Wysocki @ 2016-07-10  1:49 UTC (permalink / raw)
  To: linux-pm, x86
  Cc: Chen Yu, Thomas Gleixner, H. Peter Anvin, Pavel Machek,
	Borislav Petkov, Peter Zijlstra, Ingo Molnar, Len Brown,
	linux-kernel, James Morse

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

On Intel hardware, native_play_dead() uses mwait_play_dead() by
default and only falls back to the other methods if that fails.
That also happens during resume from hibernation, when the restore
(boot) kernel runs disable_nonboot_cpus() to take all of the CPUs
except for the boot one offline.

However, that is problematic, because the address passed to
__monitor() in mwait_play_dead() is likely to be written to in the
last phase of hibernate image restoration and that causes the "dead"
CPU to start executing instructions again.  Unfortunately, the page
containing the address in that CPU's instruction pointer may not be
valid any more at that point.

First, that page may have been overwritten with image kernel memory
contents already, so the instructions the CPU attempts to execute may
simply be invalid.  Second, the page tables previously used by that
CPU may have been overwritten by image kernel memory contents, so the
address in its instruction pointer is impossible to resolve then.

A report from Varun Koyyalagunta and investigation carried out by
Chen Yu show that the latter sometimes happens in practice.

To prevent it from happening, modify native_play_dead() to make
it use hlt_play_dead() instead of mwait_play_dead() during resume
from hibernation which avoids the inadvertent "revivals" of "dead"
CPUs.

A slightly unpleasant consequence of this change is that if the
system is hibernated with one or more CPUs offline, it will generally
draw more power after resume than it did before hibernation, because
the physical state entered by CPUs via hlt_play_dead() is higher-power
than the mwait_play_dead() one in the majority of cases.  It is
possible to work around this, but it is unclear how much of a problem
that's going to be in practice, so the workaround will be implemented
later if it turns out to be necessary.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
Reported-by: Varun Koyyalagunta <cpudebug@centtech.com>
Original-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

This is a slightly rearranged new version of

https://patchwork.kernel.org/patch/9217459/

---
 arch/x86/include/asm/cpu.h |    6 ++++++
 arch/x86/kernel/smpboot.c  |    3 +++
 arch/x86/power/cpu.c       |   21 +++++++++++++++++++++
 kernel/power/hibernate.c   |    7 ++++++-
 kernel/power/power.h       |    2 ++
 5 files changed, 38 insertions(+), 1 deletion(-)

Index: linux-pm/kernel/power/hibernate.c
===================================================================
--- linux-pm.orig/kernel/power/hibernate.c
+++ linux-pm/kernel/power/hibernate.c
@@ -409,6 +409,11 @@ int hibernation_snapshot(int platform_mo
 	goto Close;
 }
 
+int __weak hibernate_resume_nonboot_cpu_disable(void)
+{
+	return disable_nonboot_cpus();
+}
+
 /**
  * resume_target_kernel - Restore system state from a hibernation image.
  * @platform_mode: Whether or not to use the platform driver.
@@ -433,7 +438,7 @@ static int resume_target_kernel(bool pla
 	if (error)
 		goto Cleanup;
 
-	error = disable_nonboot_cpus();
+	error = hibernate_resume_nonboot_cpu_disable();
 	if (error)
 		goto Enable_cpus;
 
Index: linux-pm/kernel/power/power.h
===================================================================
--- linux-pm.orig/kernel/power/power.h
+++ linux-pm/kernel/power/power.h
@@ -38,6 +38,8 @@ static inline char *check_image_kernel(s
 }
 #endif /* CONFIG_ARCH_HIBERNATION_HEADER */
 
+extern int hibernate_resume_nonboot_cpu_disable(void);
+
 /*
  * Keep some memory free so that I/O operations can succeed without paging
  * [Might this be more than 4 MB?]
Index: linux-pm/arch/x86/power/cpu.c
===================================================================
--- linux-pm.orig/arch/x86/power/cpu.c
+++ linux-pm/arch/x86/power/cpu.c
@@ -266,6 +266,27 @@ void notrace restore_processor_state(voi
 EXPORT_SYMBOL(restore_processor_state);
 #endif
 
+#if defined(CONFIG_HIBERNATION) && defined(CONFIG_HOTPLUG_CPU)
+bool force_hlt_play_dead __read_mostly;
+
+int hibernate_resume_nonboot_cpu_disable(void)
+{
+	int ret;
+
+	/*
+	 * Ensure that MONITOR/MWAIT will not be used in the "play dead" loop
+	 * during hibernate image restoration, because it is likely that the
+	 * monitored address will be actually written to at that time and then
+	 * the "dead" CPU may start executing instructions from an image
+	 * kernel's page (and that may not be the "play dead" loop any more).
+	 */
+	force_hlt_play_dead = true;
+	ret = disable_nonboot_cpus();
+	force_hlt_play_dead = false;
+	return ret;
+}
+#endif
+
 /*
  * When bsp_check() is called in hibernate and suspend, cpu hotplug
  * is disabled already. So it's unnessary to handle race condition between
Index: linux-pm/arch/x86/kernel/smpboot.c
===================================================================
--- linux-pm.orig/arch/x86/kernel/smpboot.c
+++ linux-pm/arch/x86/kernel/smpboot.c
@@ -1642,6 +1642,9 @@ void native_play_dead(void)
 	play_dead_common();
 	tboot_shutdown(TB_SHUTDOWN_WFS);
 
+	if (force_hlt_play_dead)
+		hlt_play_dead();
+
 	mwait_play_dead();	/* Only returns on failure */
 	if (cpuidle_play_dead())
 		hlt_play_dead();
Index: linux-pm/arch/x86/include/asm/cpu.h
===================================================================
--- linux-pm.orig/arch/x86/include/asm/cpu.h
+++ linux-pm/arch/x86/include/asm/cpu.h
@@ -26,6 +26,12 @@ struct x86_cpu {
 };
 
 #ifdef CONFIG_HOTPLUG_CPU
+#ifdef CONFIG_HIBERNATION
+extern bool force_hlt_play_dead;
+#else
+#define force_hlt_play_dead	(false)
+#endif
+
 extern int arch_register_cpu(int num);
 extern void arch_unregister_cpu(int);
 extern void start_cpu0(void);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
  2016-07-10  1:49 ` [PATCH] x86 / hibernate: Use hlt_play_dead() when resuming from hibernation Rafael J. Wysocki
@ 2016-07-13  9:56   ` Pavel Machek
  2016-07-13 10:29     ` Chen Yu
  2016-07-13 12:01     ` Rafael J. Wysocki
  2016-07-14  1:55   ` [PATCH v2] " Rafael J. Wysocki
  1 sibling, 2 replies; 15+ messages in thread
From: Pavel Machek @ 2016-07-13  9:56 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, x86, Chen Yu, Thomas Gleixner, H. Peter Anvin,
	Borislav Petkov, Peter Zijlstra, Ingo Molnar, Len Brown,
	linux-kernel, James Morse

On Sun 2016-07-10 03:49:25, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> On Intel hardware, native_play_dead() uses mwait_play_dead() by
> default and only falls back to the other methods if that fails.
> That also happens during resume from hibernation, when the restore
> (boot) kernel runs disable_nonboot_cpus() to take all of the CPUs
> except for the boot one offline.
> 
> However, that is problematic, because the address passed to
> __monitor() in mwait_play_dead() is likely to be written to in the
> last phase of hibernate image restoration and that causes the "dead"
> CPU to start executing instructions again.  Unfortunately, the page
> containing the address in that CPU's instruction pointer may not be
> valid any more at that point.
> 
> First, that page may have been overwritten with image kernel memory
> contents already, so the instructions the CPU attempts to execute may
> simply be invalid.  Second, the page tables previously used by that
> CPU may have been overwritten by image kernel memory contents, so the
> address in its instruction pointer is impossible to resolve then.
> 
> A report from Varun Koyyalagunta and investigation carried out by
> Chen Yu show that the latter sometimes happens in practice.
> 
> To prevent it from happening, modify native_play_dead() to make
> it use hlt_play_dead() instead of mwait_play_dead() during resume
> from hibernation which avoids the inadvertent "revivals" of "dead"
> CPUs.
> 
> A slightly unpleasant consequence of this change is that if the
> system is hibernated with one or more CPUs offline, it will generally
> draw more power after resume than it did before hibernation, because
> the physical state entered by CPUs via hlt_play_dead() is higher-power
> than the mwait_play_dead() one in the majority of cases.  It is
> possible to work around this, but it is unclear how much of a problem
> that's going to be in practice, so the workaround will be implemented
> later if it turns out to be necessary.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
> Reported-by: Varun Koyyalagunta <cpudebug@centtech.com>
> Original-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

I notice that it changes even i386, where it should not be
neccessary. But we probably should switch i386 to support similar to
x86-64 one day (and I have patches) so no problem there.

But I wonder if simpler solution is to place the mwait semaphore into
known address? (Nosave region comes to mind?)

Best regards,
								Pavel
								
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
  2016-07-13  9:56   ` Pavel Machek
@ 2016-07-13 10:29     ` Chen Yu
  2016-07-13 12:01     ` Rafael J. Wysocki
  1 sibling, 0 replies; 15+ messages in thread
From: Chen Yu @ 2016-07-13 10:29 UTC (permalink / raw)
  To: Pavel Machek
  Cc: linux-pm, x86, Thomas Gleixner, H. Peter Anvin, Borislav Petkov,
	Peter Zijlstra, Ingo Molnar, Len Brown, linux-kernel,
	James Morse, Rafael J. Wysocki, varun koyyalagunta

Hi Pavel,

On 2016年07月13日 17:56, Pavel Machek wrote:
> On Sun 2016-07-10 03:49:25, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> On Intel hardware, native_play_dead() uses mwait_play_dead() by
>> default and only falls back to the other methods if that fails.
>> That also happens during resume from hibernation, when the restore
>> (boot) kernel runs disable_nonboot_cpus() to take all of the CPUs
>> except for the boot one offline.
>>
>> However, that is problematic, because the address passed to
>> __monitor() in mwait_play_dead() is likely to be written to in the
>> last phase of hibernate image restoration and that causes the "dead"
>> CPU to start executing instructions again.  Unfortunately, the page
>> containing the address in that CPU's instruction pointer may not be
>> valid any more at that point.
>>
>> First, that page may have been overwritten with image kernel memory
>> contents already, so the instructions the CPU attempts to execute may
>> simply be invalid.  Second, the page tables previously used by that
>> CPU may have been overwritten by image kernel memory contents, so the
>> address in its instruction pointer is impossible to resolve then.
>>
>> A report from Varun Koyyalagunta and investigation carried out by
>> Chen Yu show that the latter sometimes happens in practice.
>>
>> To prevent it from happening, modify native_play_dead() to make
>> it use hlt_play_dead() instead of mwait_play_dead() during resume
>> from hibernation which avoids the inadvertent "revivals" of "dead"
>> CPUs.
>>
>> A slightly unpleasant consequence of this change is that if the
>> system is hibernated with one or more CPUs offline, it will generally
>> draw more power after resume than it did before hibernation, because
>> the physical state entered by CPUs via hlt_play_dead() is higher-power
>> than the mwait_play_dead() one in the majority of cases.  It is
>> possible to work around this, but it is unclear how much of a problem
>> that's going to be in practice, so the workaround will be implemented
>> later if it turns out to be necessary.
>>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
>> Reported-by: Varun Koyyalagunta <cpudebug@centtech.com>
>> Original-by: Chen Yu <yu.c.chen@intel.com>
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> I notice that it changes even i386, where it should not be
> neccessary. But we probably should switch i386 to support similar to
> x86-64 one day (and I have patches) so no problem there.
>
> But I wonder if simpler solution is to place the mwait semaphore into
> known address? (Nosave region comes to mind?)

Previously we tried to change the monitor
address from task.flag to the zero page, because no one would write
data to zero page. But there is still problem because of a possible
ping-pong wake up scenario in mwait_play_dead:

As Varun Koyyalagunta said(on his x86 platform) one possible implementation of
a clflush is a read-invalidate snoop, which is what a store might look like,
so cflush might wake up the cpu from mwait.

1. CPU1 waits at zero page
2. CPU2 cflush zero page, wake CPU1 up, then CPU2 waits at zero page
3. CPU1 is woken up, and invoke cflush zero page, thus wake up CPU2 again.
then the nonboot CPUs never sleep for long.

So it's better to monitor different address for each
nonboot CPUs, however since there is only one zero page, at most:
PAGE_SIZE/L1_CACHE_LINE CPUs are satisfied, which is usually 64
on a x86_64, apparently it's not enough for servers, maybe more
zero pages are required. So we  tried to use hlt, which looks simpler.
Using Nosave region might also have this problem IMO.

thanks,
Yu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
  2016-07-13  9:56   ` Pavel Machek
  2016-07-13 10:29     ` Chen Yu
@ 2016-07-13 12:01     ` Rafael J. Wysocki
  2016-07-13 12:41       ` Rafael J. Wysocki
  2016-07-28 19:33       ` Pavel Machek
  1 sibling, 2 replies; 15+ messages in thread
From: Rafael J. Wysocki @ 2016-07-13 12:01 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rafael J. Wysocki, Linux PM, the arch/x86 maintainers, Chen Yu,
	Thomas Gleixner, H. Peter Anvin, Borislav Petkov, Peter Zijlstra,
	Ingo Molnar, Len Brown, Linux Kernel Mailing List, James Morse

On Wed, Jul 13, 2016 at 11:56 AM, Pavel Machek <pavel@ucw.cz> wrote:
> On Sun 2016-07-10 03:49:25, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> On Intel hardware, native_play_dead() uses mwait_play_dead() by
>> default and only falls back to the other methods if that fails.
>> That also happens during resume from hibernation, when the restore
>> (boot) kernel runs disable_nonboot_cpus() to take all of the CPUs
>> except for the boot one offline.
>>
>> However, that is problematic, because the address passed to
>> __monitor() in mwait_play_dead() is likely to be written to in the
>> last phase of hibernate image restoration and that causes the "dead"
>> CPU to start executing instructions again.  Unfortunately, the page
>> containing the address in that CPU's instruction pointer may not be
>> valid any more at that point.
>>
>> First, that page may have been overwritten with image kernel memory
>> contents already, so the instructions the CPU attempts to execute may
>> simply be invalid.  Second, the page tables previously used by that
>> CPU may have been overwritten by image kernel memory contents, so the
>> address in its instruction pointer is impossible to resolve then.
>>
>> A report from Varun Koyyalagunta and investigation carried out by
>> Chen Yu show that the latter sometimes happens in practice.
>>
>> To prevent it from happening, modify native_play_dead() to make
>> it use hlt_play_dead() instead of mwait_play_dead() during resume
>> from hibernation which avoids the inadvertent "revivals" of "dead"
>> CPUs.
>>
>> A slightly unpleasant consequence of this change is that if the
>> system is hibernated with one or more CPUs offline, it will generally
>> draw more power after resume than it did before hibernation, because
>> the physical state entered by CPUs via hlt_play_dead() is higher-power
>> than the mwait_play_dead() one in the majority of cases.  It is
>> possible to work around this, but it is unclear how much of a problem
>> that's going to be in practice, so the workaround will be implemented
>> later if it turns out to be necessary.
>>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
>> Reported-by: Varun Koyyalagunta <cpudebug@centtech.com>
>> Original-by: Chen Yu <yu.c.chen@intel.com>
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> I notice that it changes even i386, where it should not be
> neccessary. But we probably should switch i386 to support similar to
> x86-64 one day (and I have patches) so no problem there.
>
> But I wonder if simpler solution is to place the mwait semaphore into
> known address? (Nosave region comes to mind?)

It might work, but it wouldn't be simpler.

First off, we'd need to monitor a separate cache line for each CPU
(see the message from Chen Yu) and it'd be a pain to guarantee that.
Second, CPUs may be woken up from MWAIT for other reasons, so that
needs to be taken into account too.

In principle, we might set up a MONITOR?MWAIT "play dead" loop in a
safe page and make the "dead" CPUs jump to it during image restore,
but then the image kernel (after getting control back) would need to
migrate them away from there again, so doing the "halt" thing is *way*
simpler than that.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
  2016-07-13 12:01     ` Rafael J. Wysocki
@ 2016-07-13 12:41       ` Rafael J. Wysocki
  2016-07-28 19:33       ` Pavel Machek
  1 sibling, 0 replies; 15+ messages in thread
From: Rafael J. Wysocki @ 2016-07-13 12:41 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rafael J. Wysocki, Linux PM, the arch/x86 maintainers, Chen Yu,
	Thomas Gleixner, H. Peter Anvin, Borislav Petkov, Peter Zijlstra,
	Ingo Molnar, Len Brown, Linux Kernel Mailing List, James Morse

On Wed, Jul 13, 2016 at 2:01 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Wed, Jul 13, 2016 at 11:56 AM, Pavel Machek <pavel@ucw.cz> wrote:
>> On Sun 2016-07-10 03:49:25, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> On Intel hardware, native_play_dead() uses mwait_play_dead() by
>>> default and only falls back to the other methods if that fails.
>>> That also happens during resume from hibernation, when the restore
>>> (boot) kernel runs disable_nonboot_cpus() to take all of the CPUs
>>> except for the boot one offline.
>>>
>>> However, that is problematic, because the address passed to
>>> __monitor() in mwait_play_dead() is likely to be written to in the
>>> last phase of hibernate image restoration and that causes the "dead"
>>> CPU to start executing instructions again.  Unfortunately, the page
>>> containing the address in that CPU's instruction pointer may not be
>>> valid any more at that point.
>>>
>>> First, that page may have been overwritten with image kernel memory
>>> contents already, so the instructions the CPU attempts to execute may
>>> simply be invalid.  Second, the page tables previously used by that
>>> CPU may have been overwritten by image kernel memory contents, so the
>>> address in its instruction pointer is impossible to resolve then.
>>>
>>> A report from Varun Koyyalagunta and investigation carried out by
>>> Chen Yu show that the latter sometimes happens in practice.
>>>
>>> To prevent it from happening, modify native_play_dead() to make
>>> it use hlt_play_dead() instead of mwait_play_dead() during resume
>>> from hibernation which avoids the inadvertent "revivals" of "dead"
>>> CPUs.
>>>
>>> A slightly unpleasant consequence of this change is that if the
>>> system is hibernated with one or more CPUs offline, it will generally
>>> draw more power after resume than it did before hibernation, because
>>> the physical state entered by CPUs via hlt_play_dead() is higher-power
>>> than the mwait_play_dead() one in the majority of cases.  It is
>>> possible to work around this, but it is unclear how much of a problem
>>> that's going to be in practice, so the workaround will be implemented
>>> later if it turns out to be necessary.
>>>
>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
>>> Reported-by: Varun Koyyalagunta <cpudebug@centtech.com>
>>> Original-by: Chen Yu <yu.c.chen@intel.com>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> I notice that it changes even i386, where it should not be
>> neccessary. But we probably should switch i386 to support similar to
>> x86-64 one day (and I have patches) so no problem there.
>>
>> But I wonder if simpler solution is to place the mwait semaphore into
>> known address? (Nosave region comes to mind?)
>
> It might work, but it wouldn't be simpler.
>
> First off, we'd need to monitor a separate cache line for each CPU
> (see the message from Chen Yu) and it'd be a pain to guarantee that.
> Second, CPUs may be woken up from MWAIT for other reasons, so that
> needs to be taken into account too.
>
> In principle, we might set up a MONITOR?MWAIT "play dead" loop in a
> safe page and make the "dead" CPUs jump to it during image restore,
> but then the image kernel (after getting control back) would need to
> migrate them away from there again,

And this is not enough even, because we'd also need to ensure that the
non-boot CPUs would use "safe" page tables when restore_image() ran.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2] x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
  2016-07-10  1:49 ` [PATCH] x86 / hibernate: Use hlt_play_dead() when resuming from hibernation Rafael J. Wysocki
  2016-07-13  9:56   ` Pavel Machek
@ 2016-07-14  1:55   ` Rafael J. Wysocki
  2016-07-14  8:57     ` Ingo Molnar
  2016-07-28 19:34     ` Pavel Machek
  1 sibling, 2 replies; 15+ messages in thread
From: Rafael J. Wysocki @ 2016-07-14  1:55 UTC (permalink / raw)
  To: linux-pm
  Cc: x86, Chen Yu, Thomas Gleixner, H. Peter Anvin, Pavel Machek,
	Borislav Petkov, Peter Zijlstra, Ingo Molnar, Len Brown,
	linux-kernel, James Morse

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

On Intel hardware, native_play_dead() uses mwait_play_dead() by
default and only falls back to the other methods if that fails.
That also happens during resume from hibernation, when the restore
(boot) kernel runs disable_nonboot_cpus() to take all of the CPUs
except for the boot one offline.

However, that is problematic, because the address passed to
__monitor() in mwait_play_dead() is likely to be written to in the
last phase of hibernate image restoration and that causes the "dead"
CPU to start executing instructions again.  Unfortunately, the page
containing the address in that CPU's instruction pointer may not be
valid any more at that point.

First, that page may have been overwritten with image kernel memory
contents already, so the instructions the CPU attempts to execute may
simply be invalid.  Second, the page tables previously used by that
CPU may have been overwritten by image kernel memory contents, so the
address in its instruction pointer is impossible to resolve then.

A report from Varun Koyyalagunta and investigation carried out by
Chen Yu show that the latter sometimes happens in practice.

To prevent it from happening, temporarily change the smp_ops.play_dead
pointer during resume from hibernation so that it points to a special
"play dead" routine which uses hlt_play_dead() and avoids the
inadvertent "revivals" of "dead" CPUs this way.

A slightly unpleasant consequence of this change is that if the
system is hibernated with one or more CPUs offline, it will generally
draw more power after resume than it did before hibernation, because
the physical state entered by CPUs via hlt_play_dead() is higher-power
than the mwait_play_dead() one in the majority of cases.  It is
possible to work around this, but it is unclear how much of a problem
that's going to be in practice, so the workaround will be implemented
later if it turns out to be necessary.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
Reported-by: Varun Koyyalagunta <cpudebug@centtech.com>
Original-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

A new version here.

I prefer this one, because it only adds any overhead if hibernation is actually
used, but then it is a bit more of a hack.

If you prefer the previous one, please let me know.

---
 arch/x86/include/asm/smp.h |    1 +
 arch/x86/kernel/smpboot.c  |    2 +-
 arch/x86/power/cpu.c       |   30 ++++++++++++++++++++++++++++++
 kernel/power/hibernate.c   |    7 ++++++-
 kernel/power/power.h       |    2 ++
 5 files changed, 40 insertions(+), 2 deletions(-)

Index: linux-pm/kernel/power/hibernate.c
===================================================================
--- linux-pm.orig/kernel/power/hibernate.c
+++ linux-pm/kernel/power/hibernate.c
@@ -409,6 +409,11 @@ int hibernation_snapshot(int platform_mo
 	goto Close;
 }
 
+int __weak hibernate_resume_nonboot_cpu_disable(void)
+{
+	return disable_nonboot_cpus();
+}
+
 /**
  * resume_target_kernel - Restore system state from a hibernation image.
  * @platform_mode: Whether or not to use the platform driver.
@@ -433,7 +438,7 @@ static int resume_target_kernel(bool pla
 	if (error)
 		goto Cleanup;
 
-	error = disable_nonboot_cpus();
+	error = hibernate_resume_nonboot_cpu_disable();
 	if (error)
 		goto Enable_cpus;
 
Index: linux-pm/kernel/power/power.h
===================================================================
--- linux-pm.orig/kernel/power/power.h
+++ linux-pm/kernel/power/power.h
@@ -38,6 +38,8 @@ static inline char *check_image_kernel(s
 }
 #endif /* CONFIG_ARCH_HIBERNATION_HEADER */
 
+extern int hibernate_resume_nonboot_cpu_disable(void);
+
 /*
  * Keep some memory free so that I/O operations can succeed without paging
  * [Might this be more than 4 MB?]
Index: linux-pm/arch/x86/power/cpu.c
===================================================================
--- linux-pm.orig/arch/x86/power/cpu.c
+++ linux-pm/arch/x86/power/cpu.c
@@ -12,6 +12,7 @@
 #include <linux/export.h>
 #include <linux/smp.h>
 #include <linux/perf_event.h>
+#include <linux/tboot.h>
 
 #include <asm/pgtable.h>
 #include <asm/proto.h>
@@ -266,6 +267,35 @@ void notrace restore_processor_state(voi
 EXPORT_SYMBOL(restore_processor_state);
 #endif
 
+#if defined(CONFIG_HIBERNATION) && defined(CONFIG_HOTPLUG_CPU)
+static void resume_play_dead(void)
+{
+	play_dead_common();
+	tboot_shutdown(TB_SHUTDOWN_WFS);
+	hlt_play_dead();
+}
+
+int hibernate_resume_nonboot_cpu_disable(void)
+{
+	void (*play_dead)(void) = smp_ops.play_dead;
+	int ret;
+
+	/*
+	 * Ensure that MONITOR/MWAIT will not be used in the "play dead" loop
+	 * during hibernate image restoration, because it is likely that the
+	 * monitored address will be actually written to at that time and then
+	 * the "dead" CPU will attempt to execute instructions again, but the
+	 * address in its instruction pointer may not be possible to resolve
+	 * any more at that point (the page tables used by it previously may
+	 * have been overwritten by hibernate image data).
+	 */
+	smp_ops.play_dead = resume_play_dead;
+	ret = disable_nonboot_cpus();
+	smp_ops.play_dead = play_dead;
+	return ret;
+}
+#endif
+
 /*
  * When bsp_check() is called in hibernate and suspend, cpu hotplug
  * is disabled already. So it's unnessary to handle race condition between
Index: linux-pm/arch/x86/kernel/smpboot.c
===================================================================
--- linux-pm.orig/arch/x86/kernel/smpboot.c
+++ linux-pm/arch/x86/kernel/smpboot.c
@@ -1622,7 +1622,7 @@ static inline void mwait_play_dead(void)
 	}
 }
 
-static inline void hlt_play_dead(void)
+void hlt_play_dead(void)
 {
 	if (__this_cpu_read(cpu_info.x86) >= 4)
 		wbinvd();
Index: linux-pm/arch/x86/include/asm/smp.h
===================================================================
--- linux-pm.orig/arch/x86/include/asm/smp.h
+++ linux-pm/arch/x86/include/asm/smp.h
@@ -135,6 +135,7 @@ int native_cpu_up(unsigned int cpunum, s
 int native_cpu_disable(void);
 int common_cpu_die(unsigned int cpu);
 void native_cpu_die(unsigned int cpu);
+void hlt_play_dead(void);
 void native_play_dead(void);
 void play_dead_common(void);
 void wbinvd_on_cpu(int cpu);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
  2016-07-14  1:55   ` [PATCH v2] " Rafael J. Wysocki
@ 2016-07-14  8:57     ` Ingo Molnar
  2016-07-28 19:34     ` Pavel Machek
  1 sibling, 0 replies; 15+ messages in thread
From: Ingo Molnar @ 2016-07-14  8:57 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, x86, Chen Yu, Thomas Gleixner, H. Peter Anvin,
	Pavel Machek, Borislav Petkov, Peter Zijlstra, Ingo Molnar,
	Len Brown, linux-kernel, James Morse


* Rafael J. Wysocki <rjw@rjwysocki.net> wrote:

> A new version here.
> 
> I prefer this one, because it only adds any overhead if hibernation is actually
> used, but then it is a bit more of a hack.
> 
> If you prefer the previous one, please let me know.
> 
> ---
>  arch/x86/include/asm/smp.h |    1 +
>  arch/x86/kernel/smpboot.c  |    2 +-
>  arch/x86/power/cpu.c       |   30 ++++++++++++++++++++++++++++++
>  kernel/power/hibernate.c   |    7 ++++++-
>  kernel/power/power.h       |    2 ++
>  5 files changed, 40 insertions(+), 2 deletions(-)

Acked-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
  2016-07-13 12:01     ` Rafael J. Wysocki
  2016-07-13 12:41       ` Rafael J. Wysocki
@ 2016-07-28 19:33       ` Pavel Machek
  1 sibling, 0 replies; 15+ messages in thread
From: Pavel Machek @ 2016-07-28 19:33 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Linux PM, the arch/x86 maintainers, Chen Yu,
	Thomas Gleixner, H. Peter Anvin, Borislav Petkov, Peter Zijlstra,
	Ingo Molnar, Len Brown, Linux Kernel Mailing List, James Morse

On Wed 2016-07-13 14:01:52, Rafael J. Wysocki wrote:
> On Wed, Jul 13, 2016 at 11:56 AM, Pavel Machek <pavel@ucw.cz> wrote:
> > On Sun 2016-07-10 03:49:25, Rafael J. Wysocki wrote:
> >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>
> >> On Intel hardware, native_play_dead() uses mwait_play_dead() by
> >> default and only falls back to the other methods if that fails.
> >> That also happens during resume from hibernation, when the restore
> >> (boot) kernel runs disable_nonboot_cpus() to take all of the CPUs
> >> except for the boot one offline.
> >>
> >> However, that is problematic, because the address passed to
> >> __monitor() in mwait_play_dead() is likely to be written to in the
> >> last phase of hibernate image restoration and that causes the "dead"
> >> CPU to start executing instructions again.  Unfortunately, the page
> >> containing the address in that CPU's instruction pointer may not be
> >> valid any more at that point.
> >>
> >> First, that page may have been overwritten with image kernel memory
> >> contents already, so the instructions the CPU attempts to execute may
> >> simply be invalid.  Second, the page tables previously used by that
> >> CPU may have been overwritten by image kernel memory contents, so the
> >> address in its instruction pointer is impossible to resolve then.
> >>
> >> A report from Varun Koyyalagunta and investigation carried out by
> >> Chen Yu show that the latter sometimes happens in practice.
> >>
> >> To prevent it from happening, modify native_play_dead() to make
> >> it use hlt_play_dead() instead of mwait_play_dead() during resume
> >> from hibernation which avoids the inadvertent "revivals" of "dead"
> >> CPUs.
> >>
> >> A slightly unpleasant consequence of this change is that if the
> >> system is hibernated with one or more CPUs offline, it will generally
> >> draw more power after resume than it did before hibernation, because
> >> the physical state entered by CPUs via hlt_play_dead() is higher-power
> >> than the mwait_play_dead() one in the majority of cases.  It is
> >> possible to work around this, but it is unclear how much of a problem
> >> that's going to be in practice, so the workaround will be implemented
> >> later if it turns out to be necessary.
> >>
> >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
> >> Reported-by: Varun Koyyalagunta <cpudebug@centtech.com>
> >> Original-by: Chen Yu <yu.c.chen@intel.com>
> >> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > I notice that it changes even i386, where it should not be
> > neccessary. But we probably should switch i386 to support similar to
> > x86-64 one day (and I have patches) so no problem there.
> >
> > But I wonder if simpler solution is to place the mwait semaphore into
> > known address? (Nosave region comes to mind?)
> 
> It might work, but it wouldn't be simpler.
> 
> First off, we'd need to monitor a separate cache line for each CPU
> (see the message from Chen Yu) and it'd be a pain to guarantee that.
> Second, CPUs may be woken up from MWAIT for other reasons, so that
> needs to be taken into account too.
> 
> In principle, we might set up a MONITOR?MWAIT "play dead" loop in a
> safe page and make the "dead" CPUs jump to it during image restore,
> but then the image kernel (after getting control back) would need to
> migrate them away from there again, so doing the "halt" thing is *way*
> simpler than that.

Ok, it looks you have the best solution. Thanks...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
  2016-07-14  1:55   ` [PATCH v2] " Rafael J. Wysocki
  2016-07-14  8:57     ` Ingo Molnar
@ 2016-07-28 19:34     ` Pavel Machek
  1 sibling, 0 replies; 15+ messages in thread
From: Pavel Machek @ 2016-07-28 19:34 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, x86, Chen Yu, Thomas Gleixner, H. Peter Anvin,
	Borislav Petkov, Peter Zijlstra, Ingo Molnar, Len Brown,
	linux-kernel, James Morse

On Thu 2016-07-14 03:55:23, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> On Intel hardware, native_play_dead() uses mwait_play_dead() by
> default and only falls back to the other methods if that fails.
> That also happens during resume from hibernation, when the restore
> (boot) kernel runs disable_nonboot_cpus() to take all of the CPUs
> except for the boot one offline.
> 
> However, that is problematic, because the address passed to
> __monitor() in mwait_play_dead() is likely to be written to in the
> last phase of hibernate image restoration and that causes the "dead"
> CPU to start executing instructions again.  Unfortunately, the page
> containing the address in that CPU's instruction pointer may not be
> valid any more at that point.
> 
> First, that page may have been overwritten with image kernel memory
> contents already, so the instructions the CPU attempts to execute may
> simply be invalid.  Second, the page tables previously used by that
> CPU may have been overwritten by image kernel memory contents, so the
> address in its instruction pointer is impossible to resolve then.
> 
> A report from Varun Koyyalagunta and investigation carried out by
> Chen Yu show that the latter sometimes happens in practice.
> 
> To prevent it from happening, temporarily change the smp_ops.play_dead
> pointer during resume from hibernation so that it points to a special
> "play dead" routine which uses hlt_play_dead() and avoids the
> inadvertent "revivals" of "dead" CPUs this way.
> 
> A slightly unpleasant consequence of this change is that if the
> system is hibernated with one or more CPUs offline, it will generally
> draw more power after resume than it did before hibernation, because
> the physical state entered by CPUs via hlt_play_dead() is higher-power
> than the mwait_play_dead() one in the majority of cases.  It is
> possible to work around this, but it is unclear how much of a problem
> that's going to be in practice, so the workaround will be implemented
> later if it turns out to be necessary.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
> Reported-by: Varun Koyyalagunta <cpudebug@centtech.com>
> Original-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Acked-by: Pavel Machek <pavel@ucw.cz>

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-07-28 19:34 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-28  9:16 [PATCH][RFC v3] x86, hotplug: Use hlt instead of mwait if invoked from disable_nonboot_cpus Chen Yu
2016-07-07  0:33 ` Rafael J. Wysocki
2016-07-07  2:50   ` Chen, Yu C
2016-07-07 16:03     ` James Morse
2016-07-07  8:38   ` James Morse
2016-07-07 12:25     ` Rafael J. Wysocki
2016-07-10  1:49 ` [PATCH] x86 / hibernate: Use hlt_play_dead() when resuming from hibernation Rafael J. Wysocki
2016-07-13  9:56   ` Pavel Machek
2016-07-13 10:29     ` Chen Yu
2016-07-13 12:01     ` Rafael J. Wysocki
2016-07-13 12:41       ` Rafael J. Wysocki
2016-07-28 19:33       ` Pavel Machek
2016-07-14  1:55   ` [PATCH v2] " Rafael J. Wysocki
2016-07-14  8:57     ` Ingo Molnar
2016-07-28 19:34     ` Pavel Machek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).