All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] x86: mcheck: suspend/resume bug fixes
@ 2008-12-12 18:06 Andreas Herrmann
  2008-12-12 18:08 ` [PATCH 1/2] x86: don't enable MCE after suspend/resume when system was booted with "nomce" Andreas Herrmann
  2008-12-12 18:10 ` [PATCH 2/2] x86: re-enable MCE on secondary CPUS after suspend/resume Andreas Herrmann
  0 siblings, 2 replies; 11+ messages in thread
From: Andreas Herrmann @ 2008-12-12 18:06 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, H. Peter Anvin; +Cc: linux-kernel

Following two patches fix suspend/resume issues with current kernel.
Patches are against v2.6.28-rc8-1-g6c34bc2
IMHO this should go into .28.


Thanks,

Andreas



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] x86: don't enable MCE after suspend/resume when system was booted with "nomce"
  2008-12-12 18:06 [PATCH 0/2] x86: mcheck: suspend/resume bug fixes Andreas Herrmann
@ 2008-12-12 18:08 ` Andreas Herrmann
  2008-12-12 19:08   ` Andi Kleen
  2008-12-12 18:10 ` [PATCH 2/2] x86: re-enable MCE on secondary CPUS after suspend/resume Andreas Herrmann
  1 sibling, 1 reply; 11+ messages in thread
From: Andreas Herrmann @ 2008-12-12 18:08 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, H. Peter Anvin; +Cc: linux-kernel

Impact: fix suspend/resume bug with MCE

A suspend/resume cycle unconditionally enables MCE
for the boot CPU if MCE is compiled into the kernel.
I.e. even a system booted with "nomce" configures MCE for the boot CPU.

This patch ensures that MCE is not turned on for systems booted with
"nomce".

Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
---
 arch/x86/kernel/cpu/mcheck/mce_64.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce_64.c b/arch/x86/kernel/cpu/mcheck/mce_64.c
index 4b031a4..e2d9649 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_64.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_64.c
@@ -443,6 +443,9 @@ static void mce_init(void *dummy)
 	u64 cap;
 	int i;
 
+	if (mce_dont_init)
+		return;
+
 	rdmsrl(MSR_IA32_MCG_CAP, cap);
 	banks = cap & 0xff;
 	if (banks > MCE_EXTENDED_BANK) {
-- 
1.6.0.4




^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] x86: re-enable MCE on secondary CPUS after suspend/resume
  2008-12-12 18:06 [PATCH 0/2] x86: mcheck: suspend/resume bug fixes Andreas Herrmann
  2008-12-12 18:08 ` [PATCH 1/2] x86: don't enable MCE after suspend/resume when system was booted with "nomce" Andreas Herrmann
@ 2008-12-12 18:10 ` Andreas Herrmann
  2008-12-12 19:06   ` Andi Kleen
  1 sibling, 1 reply; 11+ messages in thread
From: Andreas Herrmann @ 2008-12-12 18:10 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, H. Peter Anvin; +Cc: linux-kernel

Impact: fix suspend/resume bug with MCE

After suspend/resume MCx_CTL registers of secondary CPUs are cleared.
(At least that's what I've observed on several systems.)
Linux currently only re-initializes MCE on the boot CPU - see mce_resume().
Thus after suspend/resume we end up with a system where MCE is active
on the boot CPU but switched off on all other CPUs.

By calling mce_init() whenever a CPU comes online this problem is
solved.

Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
---
 arch/x86/kernel/cpu/mcheck/mce_64.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce_64.c b/arch/x86/kernel/cpu/mcheck/mce_64.c
index e2d9649..0174539 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_64.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_64.c
@@ -889,6 +889,7 @@ static int __cpuinit mce_cpu_callback(struct notifier_block *nfb,
 		mce_create_device(cpu);
 		if (threshold_cpu_callback)
 			threshold_cpu_callback(action, cpu);
+		mce_init(NULL);
 		break;
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN:
-- 
1.6.0.4




^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] x86: re-enable MCE on secondary CPUS after suspend/resume
  2008-12-12 18:10 ` [PATCH 2/2] x86: re-enable MCE on secondary CPUS after suspend/resume Andreas Herrmann
@ 2008-12-12 19:06   ` Andi Kleen
  2008-12-15 19:05     ` Andreas Herrmann
  0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2008-12-12 19:06 UTC (permalink / raw)
  To: Andreas Herrmann
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-kernel

Andreas Herrmann <andreas.herrmann3@amd.com> writes:

> Impact: fix suspend/resume bug with MCE
>
> After suspend/resume MCx_CTL registers of secondary CPUs are cleared.
> (At least that's what I've observed on several systems.)
> Linux currently only re-initializes MCE on the boot CPU - see mce_resume().
> Thus after suspend/resume we end up with a system where MCE is active
> on the boot CPU but switched off on all other CPUs.
>
> By calling mce_init() whenever a CPU comes online this problem is
> solved.

Can you double check that please?

Suspend/resume are supposted to hotunplug all CPUs except the BP and
then re-online them on resume (with "disable_nonboot_cpus()) . The
re-online initializes MCEs in the standard CPU bootup path.

A good way is to stick a WARN_ON(num_online_cpus() > 1) into
mce_suspend(). I had that here for some time and didn't see
it trigger.

I got a couple of suspend bug fixes in my mce improvement tree, see:

http://git.kernel.org/?p=linux/kernel/git/mingo/linux-2.6-x86.git;a=history;f=arch/x86/kernel/cpu/mcheck/mce_64.c;h=9512a7eab4e7b03a584f5bb647bd242bd4c003dc;hb=x86/mce

During review it was decided to all defer it to .29 though.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] x86: don't enable MCE after suspend/resume when system was booted with "nomce"
  2008-12-12 18:08 ` [PATCH 1/2] x86: don't enable MCE after suspend/resume when system was booted with "nomce" Andreas Herrmann
@ 2008-12-12 19:08   ` Andi Kleen
  2008-12-15 18:55     ` Andreas Herrmann
  0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2008-12-12 19:08 UTC (permalink / raw)
  To: Andreas Herrmann
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-kernel

Andreas Herrmann <andreas.herrmann3@amd.com> writes:

> Impact: fix suspend/resume bug with MCE
>
> A suspend/resume cycle unconditionally enables MCE
> for the boot CPU if MCE is compiled into the kernel.
> I.e. even a system booted with "nomce" configures MCE for the boot CPU.
>
> This patch ensures that MCE is not turned on for systems booted with
> "nomce".

I have that fixed in a better way in

http://git.kernel.org/?p=linux/kernel/git/mingo/linux-2.6-x86.git;a=commit;h=81c7af997dafbabad464ccefba89af2b247899da

I believe


-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] x86: don't enable MCE after suspend/resume when system was booted with "nomce"
  2008-12-12 19:08   ` Andi Kleen
@ 2008-12-15 18:55     ` Andreas Herrmann
  2008-12-15 22:23       ` Andi Kleen
  0 siblings, 1 reply; 11+ messages in thread
From: Andreas Herrmann @ 2008-12-15 18:55 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-kernel

On Fri, Dec 12, 2008 at 08:08:28PM +0100, Andi Kleen wrote:
> Andreas Herrmann <andreas.herrmann3@amd.com> writes:
> 
> > Impact: fix suspend/resume bug with MCE
> >
> > A suspend/resume cycle unconditionally enables MCE
> > for the boot CPU if MCE is compiled into the kernel.
> > I.e. even a system booted with "nomce" configures MCE for the boot CPU.
> >
> > This patch ensures that MCE is not turned on for systems booted with
> > "nomce".
> 
> I have that fixed in a better way in

> http://git.kernel.org/?p=linux/kernel/git/mingo/linux-2.6-x86.git;a=commit;h=81c7af997dafbabad464ccefba89af2b247899da

Thanks for that pointer.
I didn't look at tip/x86/mce up to know. Is this .28 material?
My fix is intended for .28.

I didn't care about the lurking sysfs interface when booted with
"nomce".  It's there and you could enable MCE with changing attributes
but then you are actively fiddling with MCE.

The fix is for the more severe problem that mce_init() is called on
resume and thus will enable MCE (setup cr4 and corresponding MSRs)
which shouldn't happen when booted with "nomce".


Regards,

Andreas



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] x86: re-enable MCE on secondary CPUS after suspend/resume
  2008-12-12 19:06   ` Andi Kleen
@ 2008-12-15 19:05     ` Andreas Herrmann
  2008-12-15 22:33       ` Andi Kleen
  0 siblings, 1 reply; 11+ messages in thread
From: Andreas Herrmann @ 2008-12-15 19:05 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-kernel

On Fri, Dec 12, 2008 at 08:06:21PM +0100, Andi Kleen wrote:
> Andreas Herrmann <andreas.herrmann3@amd.com> writes:
> 
> > Impact: fix suspend/resume bug with MCE
> >
> > After suspend/resume MCx_CTL registers of secondary CPUs are cleared.
> > (At least that's what I've observed on several systems.)
> > Linux currently only re-initializes MCE on the boot CPU - see mce_resume().
> > Thus after suspend/resume we end up with a system where MCE is active
> > on the boot CPU but switched off on all other CPUs.
> >
> > By calling mce_init() whenever a CPU comes online this problem is
> > solved.
> 
> Can you double check that please?
> 
> Suspend/resume are supposted to hotunplug all CPUs except the BP and
> then re-online them on resume (with "disable_nonboot_cpus()) . The
> re-online initializes MCEs in the standard CPU bootup path.

For BP we have

/* On resume clear all MCE state. Don't want to see leftovers from the BIOS.
   Only one CPU is active at this time, the others get readded later using
   CPU hotplug. */
static int mce_resume(struct sys_device *dev)
{
        mce_init(NULL);
        return 0;
}

For APs mcheck_init() is called on resume. But as the respective bit
for an AP is usually set in "mce_cpus" after boot (which is correct, I
think) mcheck_init does not call mce_init, see:

void __cpuinit mcheck_init(struct cpuinfo_x86 *c)
{
        static cpumask_t mce_cpus = CPU_MASK_NONE;

        mce_cpu_quirks(c);

        if (mce_dont_init ||
            cpu_test_and_set(smp_processor_id(), mce_cpus) ||
            !mce_available(c))
  =>             return;

        mce_init(NULL);
        mce_cpu_features(c);
}

But we need to call mce_init to clear all MCE state.
IMHO the best location to call mce_init for APs is the cpu notifier.


Regards,

Andreas



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] x86: don't enable MCE after suspend/resume when system was booted with "nomce"
  2008-12-15 18:55     ` Andreas Herrmann
@ 2008-12-15 22:23       ` Andi Kleen
  0 siblings, 0 replies; 11+ messages in thread
From: Andi Kleen @ 2008-12-15 22:23 UTC (permalink / raw)
  To: Andreas Herrmann
  Cc: Andi Kleen, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-kernel

> > http://git.kernel.org/?p=linux/kernel/git/mingo/linux-2.6-x86.git;a=commit;h=81c7af997dafbabad464ccefba89af2b247899da
> 
> Thanks for that pointer.
> I didn't look at tip/x86/mce up to know. Is this .28 material?

I would think so, but I'm not sure the x86 maintainers think the same.

> My fix is intended for .28.
> 
> I didn't care about the lurking sysfs interface when booted with
> "nomce".  It's there and you could enable MCE with changing attributes
> but then you are actively fiddling with MCE.
> 
> The fix is for the more severe problem that mce_init() is called on
> resume and thus will enable MCE (setup cr4 and corresponding MSRs)
> which shouldn't happen when booted with "nomce".

The patch does the same (although the description doesn't say that). 
When there is no sysfs interface there won't
be any resume either because resume goes through the sysdev devices. 
So it's a cleaner and more complete fix imho.

-Andi

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] x86: re-enable MCE on secondary CPUS after suspend/resume
  2008-12-15 19:05     ` Andreas Herrmann
@ 2008-12-15 22:33       ` Andi Kleen
  2008-12-15 22:41         ` Andreas Herrmann
  0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2008-12-15 22:33 UTC (permalink / raw)
  To: Andreas Herrmann
  Cc: Andi Kleen, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-kernel

> void __cpuinit mcheck_init(struct cpuinfo_x86 *c)
> {
>         static cpumask_t mce_cpus = CPU_MASK_NONE;
> 
>         mce_cpu_quirks(c);
> 
>         if (mce_dont_init ||
>             cpu_test_and_set(smp_processor_id(), mce_cpus) ||
>             !mce_available(c))
>   =>             return;
> 
>         mce_init(NULL);
>         mce_cpu_features(c);
> }
> 
> But we need to call mce_init to clear all MCE state.
> IMHO the best location to call mce_init for APs is the cpu notifier.

Ah got it. Thanks that makes sense. 

But I think the better fix is to just drop the mce_cpus check and
then handly it naturally in the standard bootup path. I'm not
sure what it was good for anyways. I copied it into the 64bit code
from 32bit, but I suppose even there it isn't really needed and on 
32bit it is already gone even.

How about this patch. Does it fix the problem for you too?

-Andi

--


Don't prevent multiple initialization of MCEs.

Back from early prehistory mcheck_init() has a reentry check. Presumably
that was needed in very old kernels to prevent it entering twice.

But as Andreas points out this prevents CPU hotplug (and therefore resume)
to correctly reinitialize MCEs when a AP boots again after being
offlined. 

Just drop the check.

Based on a report from Andreas Herrmann

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 arch/x86/kernel/cpu/mcheck/mce_64.c |    3 ---
 1 file changed, 3 deletions(-)

Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c	2008-12-15 23:13:02.000000000 +0100
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c	2008-12-15 23:13:44.000000000 +0100
@@ -510,12 +510,9 @@
  */
 void __cpuinit mcheck_init(struct cpuinfo_x86 *c)
 {
-	static cpumask_t mce_cpus = CPU_MASK_NONE;
-
 	mce_cpu_quirks(c);
 
 	if (mce_dont_init ||
-	    cpu_test_and_set(smp_processor_id(), mce_cpus) ||
 	    !mce_available(c))
 		return;
 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] x86: re-enable MCE on secondary CPUS after suspend/resume
  2008-12-15 22:33       ` Andi Kleen
@ 2008-12-15 22:41         ` Andreas Herrmann
  2008-12-16 22:03           ` Ingo Molnar
  0 siblings, 1 reply; 11+ messages in thread
From: Andreas Herrmann @ 2008-12-15 22:41 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-kernel

On Mon, Dec 15, 2008 at 11:33:10PM +0100, Andi Kleen wrote:
> > void __cpuinit mcheck_init(struct cpuinfo_x86 *c)
> > {
> >         static cpumask_t mce_cpus = CPU_MASK_NONE;
> > 
> >         mce_cpu_quirks(c);
> > 
> >         if (mce_dont_init ||
> >             cpu_test_and_set(smp_processor_id(), mce_cpus) ||
> >             !mce_available(c))
> >   =>             return;
> > 
> >         mce_init(NULL);
> >         mce_cpu_features(c);
> > }
> > 
> > But we need to call mce_init to clear all MCE state.
> > IMHO the best location to call mce_init for APs is the cpu notifier.
> 
> Ah got it. Thanks that makes sense. 
> 
> But I think the better fix is to just drop the mce_cpus check and
> then handly it naturally in the standard bootup path. I'm not
> sure what it was good for anyways. I copied it into the 64bit code
> from 32bit, but I suppose even there it isn't really needed and on 
> 32bit it is already gone even.
> 
> How about this patch. Does it fix the problem for you too?

Yup, works as well:

Tested-by: Andreas Herrmann <andreas.herrmann3@amd.com>


Thanks,

Andreas

> -Andi
> 
> --
> 
> 
> Don't prevent multiple initialization of MCEs.
> 
> Back from early prehistory mcheck_init() has a reentry check. Presumably
> that was needed in very old kernels to prevent it entering twice.
> 
> But as Andreas points out this prevents CPU hotplug (and therefore resume)
> to correctly reinitialize MCEs when a AP boots again after being
> offlined. 
> 
> Just drop the check.
> 
> Based on a report from Andreas Herrmann
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> 
> ---
>  arch/x86/kernel/cpu/mcheck/mce_64.c |    3 ---
>  1 file changed, 3 deletions(-)
> 
> Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
> ===================================================================
> --- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c	2008-12-15 23:13:02.000000000 +0100
> +++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c	2008-12-15 23:13:44.000000000 +0100
> @@ -510,12 +510,9 @@
>   */
>  void __cpuinit mcheck_init(struct cpuinfo_x86 *c)
>  {
> -	static cpumask_t mce_cpus = CPU_MASK_NONE;
> -
>  	mce_cpu_quirks(c);
>  
>  	if (mce_dont_init ||
> -	    cpu_test_and_set(smp_processor_id(), mce_cpus) ||
>  	    !mce_available(c))
>  		return;
>  
> 



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] x86: re-enable MCE on secondary CPUS after suspend/resume
  2008-12-15 22:41         ` Andreas Herrmann
@ 2008-12-16 22:03           ` Ingo Molnar
  0 siblings, 0 replies; 11+ messages in thread
From: Ingo Molnar @ 2008-12-16 22:03 UTC (permalink / raw)
  To: Andreas Herrmann
  Cc: Andi Kleen, Thomas Gleixner, H. Peter Anvin, linux-kernel


* Andreas Herrmann <andreas.herrmann3@amd.com> wrote:

> On Mon, Dec 15, 2008 at 11:33:10PM +0100, Andi Kleen wrote:
> > > void __cpuinit mcheck_init(struct cpuinfo_x86 *c)
> > > {
> > >         static cpumask_t mce_cpus = CPU_MASK_NONE;
> > > 
> > >         mce_cpu_quirks(c);
> > > 
> > >         if (mce_dont_init ||
> > >             cpu_test_and_set(smp_processor_id(), mce_cpus) ||
> > >             !mce_available(c))
> > >   =>             return;
> > > 
> > >         mce_init(NULL);
> > >         mce_cpu_features(c);
> > > }
> > > 
> > > But we need to call mce_init to clear all MCE state.
> > > IMHO the best location to call mce_init for APs is the cpu notifier.
> > 
> > Ah got it. Thanks that makes sense. 
> > 
> > But I think the better fix is to just drop the mce_cpus check and
> > then handly it naturally in the standard bootup path. I'm not
> > sure what it was good for anyways. I copied it into the 64bit code
> > from 32bit, but I suppose even there it isn't really needed and on 
> > 32bit it is already gone even.
> > 
> > How about this patch. Does it fix the problem for you too?
> 
> Yup, works as well:
> 
> Tested-by: Andreas Herrmann <andreas.herrmann3@amd.com>

applied to tip/x86/urgent, thanks guys.

	Ingo

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-12-16 22:03 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-12 18:06 [PATCH 0/2] x86: mcheck: suspend/resume bug fixes Andreas Herrmann
2008-12-12 18:08 ` [PATCH 1/2] x86: don't enable MCE after suspend/resume when system was booted with "nomce" Andreas Herrmann
2008-12-12 19:08   ` Andi Kleen
2008-12-15 18:55     ` Andreas Herrmann
2008-12-15 22:23       ` Andi Kleen
2008-12-12 18:10 ` [PATCH 2/2] x86: re-enable MCE on secondary CPUS after suspend/resume Andreas Herrmann
2008-12-12 19:06   ` Andi Kleen
2008-12-15 19:05     ` Andreas Herrmann
2008-12-15 22:33       ` Andi Kleen
2008-12-15 22:41         ` Andreas Herrmann
2008-12-16 22:03           ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.