cpu offline causes backtrace from cmci

All of lore.kernel.org
 help / color / mirror / Atom feed

* cpu offline causes backtrace from cmci_rediscover
@ 2013-03-19 22:44 Dave Jones
  2013-03-20  3:16 ` Chen Gong
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Jones @ 2013-03-19 22:44 UTC (permalink / raw)
  To: Linux Kernel; +Cc: x86

offlining a CPU in 3.9-rc3 gets me this trace..

numa_remove_cpu cpu 1 node 0: mask now 0,2-3
smpboot: CPU 1 is now offline
BUG: using smp_processor_id() in preemptible [00000000] code: cpu-offline.sh/10591
caller is cmci_rediscover+0x6a/0xe0
Pid: 10591, comm: cpu-offline.sh Not tainted 3.9.0-rc3+ #2
Call Trace:
 [<ffffffff81333bbd>] debug_smp_processor_id+0xdd/0x100
 [<ffffffff8101edba>] cmci_rediscover+0x6a/0xe0
 [<ffffffff815f5b9f>] mce_cpu_callback+0x19d/0x1ae
 [<ffffffff8160ea66>] notifier_call_chain+0x66/0x150
 [<ffffffff8107ad7e>] __raw_notifier_call_chain+0xe/0x10
 [<ffffffff8104c2e3>] cpu_notify+0x23/0x50
 [<ffffffff8104c31e>] cpu_notify_nofail+0xe/0x20
 [<ffffffff815ef082>] _cpu_down+0x302/0x350
 [<ffffffff815ef106>] cpu_down+0x36/0x50
 [<ffffffff815f1c9d>] store_online+0x8d/0xd0
 [<ffffffff813edc48>] dev_attr_store+0x18/0x30
 [<ffffffff81226eeb>] sysfs_write_file+0xdb/0x150
 [<ffffffff811adfb2>] vfs_write+0xa2/0x170
 [<ffffffff811ae16c>] sys_write+0x4c/0xa0
 [<ffffffff81613019>] system_call_fastpath+0x16/0x1b


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: cpu offline causes backtrace from cmci_rediscover
  2013-03-19 22:44 cpu offline causes backtrace from cmci_rediscover Dave Jones
@ 2013-03-20  3:16 ` Chen Gong
  2013-03-20  5:20   ` Mike Galbraith
  2013-03-20 10:01   ` [PATCH] x86/mce: Rework cmci_rediscover() to play well with CPU hotplug Srivatsa S. Bhat
  0 siblings, 2 replies; 9+ messages in thread
From: Chen Gong @ 2013-03-20  3:16 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, x86

[-- Attachment #1: Type: text/plain, Size: 2168 bytes --]

On Tue, Mar 19, 2013 at 06:44:08PM -0400, Dave Jones wrote:
> Date:	Tue, 19 Mar 2013 18:44:08 -0400
> From: Dave Jones <davej@redhat.com>
> To: Linux Kernel <linux-kernel@vger.kernel.org>
> Cc: x86@kernel.org
> Subject: cpu offline causes backtrace from cmci_rediscover
> User-Agent: Mutt/1.5.21 (2010-09-15)
> 
> offlining a CPU in 3.9-rc3 gets me this trace..
> 
> numa_remove_cpu cpu 1 node 0: mask now 0,2-3
> smpboot: CPU 1 is now offline
> BUG: using smp_processor_id() in preemptible [00000000] code: cpu-offline.sh/10591
> caller is cmci_rediscover+0x6a/0xe0
> Pid: 10591, comm: cpu-offline.sh Not tainted 3.9.0-rc3+ #2
> Call Trace:
>  [<ffffffff81333bbd>] debug_smp_processor_id+0xdd/0x100
>  [<ffffffff8101edba>] cmci_rediscover+0x6a/0xe0
>  [<ffffffff815f5b9f>] mce_cpu_callback+0x19d/0x1ae
>  [<ffffffff8160ea66>] notifier_call_chain+0x66/0x150
>  [<ffffffff8107ad7e>] __raw_notifier_call_chain+0xe/0x10
>  [<ffffffff8104c2e3>] cpu_notify+0x23/0x50
>  [<ffffffff8104c31e>] cpu_notify_nofail+0xe/0x20
>  [<ffffffff815ef082>] _cpu_down+0x302/0x350
>  [<ffffffff815ef106>] cpu_down+0x36/0x50
>  [<ffffffff815f1c9d>] store_online+0x8d/0xd0
>  [<ffffffff813edc48>] dev_attr_store+0x18/0x30
>  [<ffffffff81226eeb>] sysfs_write_file+0xdb/0x150
>  [<ffffffff811adfb2>] vfs_write+0xa2/0x170
>  [<ffffffff811ae16c>] sys_write+0x4c/0xa0
>  [<ffffffff81613019>] system_call_fastpath+0x16/0x1b
> 
Try this patch:

diff a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
index 402c454..692c91e 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -311,10 +311,12 @@ void cmci_rediscover(int dying)
                if (cpu == dying)
                        continue;

-               if (cpu == smp_processor_id()) {
+               if (cpu == get_cpu()) {
+                       put_cpu();
                        cmci_rediscover_work_func(NULL);
                        continue;
-               }
+               } else
+                       put_cpu();

                work_on_cpu(cpu, cmci_rediscover_work_func, NULL);
        }


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: cpu offline causes backtrace from cmci_rediscover
  2013-03-20  3:16 ` Chen Gong
@ 2013-03-20  5:20   ` Mike Galbraith
  2013-03-20 10:01   ` [PATCH] x86/mce: Rework cmci_rediscover() to play well with CPU hotplug Srivatsa S. Bhat
  1 sibling, 0 replies; 9+ messages in thread
From: Mike Galbraith @ 2013-03-20  5:20 UTC (permalink / raw)
  To: Chen Gong; +Cc: Dave Jones, Linux Kernel, x86

On Tue, 2013-03-19 at 23:16 -0400, Chen Gong wrote: 
> On Tue, Mar 19, 2013 at 06:44:08PM -0400, Dave Jones wrote:
> > Date:	Tue, 19 Mar 2013 18:44:08 -0400
> > From: Dave Jones <davej@redhat.com>
> > To: Linux Kernel <linux-kernel@vger.kernel.org>
> > Cc: x86@kernel.org
> > Subject: cpu offline causes backtrace from cmci_rediscover
> > User-Agent: Mutt/1.5.21 (2010-09-15)
> > 
> > offlining a CPU in 3.9-rc3 gets me this trace..
> > 
> > numa_remove_cpu cpu 1 node 0: mask now 0,2-3
> > smpboot: CPU 1 is now offline
> > BUG: using smp_processor_id() in preemptible [00000000] code: cpu-offline.sh/10591
> > caller is cmci_rediscover+0x6a/0xe0
> > Pid: 10591, comm: cpu-offline.sh Not tainted 3.9.0-rc3+ #2
> > Call Trace:
> >  [<ffffffff81333bbd>] debug_smp_processor_id+0xdd/0x100
> >  [<ffffffff8101edba>] cmci_rediscover+0x6a/0xe0
> >  [<ffffffff815f5b9f>] mce_cpu_callback+0x19d/0x1ae
> >  [<ffffffff8160ea66>] notifier_call_chain+0x66/0x150
> >  [<ffffffff8107ad7e>] __raw_notifier_call_chain+0xe/0x10
> >  [<ffffffff8104c2e3>] cpu_notify+0x23/0x50
> >  [<ffffffff8104c31e>] cpu_notify_nofail+0xe/0x20
> >  [<ffffffff815ef082>] _cpu_down+0x302/0x350
> >  [<ffffffff815ef106>] cpu_down+0x36/0x50
> >  [<ffffffff815f1c9d>] store_online+0x8d/0xd0
> >  [<ffffffff813edc48>] dev_attr_store+0x18/0x30
> >  [<ffffffff81226eeb>] sysfs_write_file+0xdb/0x150
> >  [<ffffffff811adfb2>] vfs_write+0xa2/0x170
> >  [<ffffffff811ae16c>] sys_write+0x4c/0xa0
> >  [<ffffffff81613019>] system_call_fastpath+0x16/0x1b
> > 
> Try this patch:
> 
> diff a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
> index 402c454..692c91e 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
> @@ -311,10 +311,12 @@ void cmci_rediscover(int dying)
>                 if (cpu == dying)
>                         continue;
> 
> -               if (cpu == smp_processor_id()) {
> +               if (cpu == get_cpu()) {
> +                       put_cpu();
>                         cmci_rediscover_work_func(NULL);
>                         continue;
> -               }
> +               } else
> +                       put_cpu();
> 
>                 work_on_cpu(cpu, cmci_rediscover_work_func, NULL);
>         }
> 

raw_smp_processor_id()?

-Mike


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] x86/mce: Rework cmci_rediscover() to play well with CPU hotplug
  2013-03-20  3:16 ` Chen Gong
  2013-03-20  5:20   ` Mike Galbraith
@ 2013-03-20 10:01   ` Srivatsa S. Bhat
  2013-03-20 10:46     ` Borislav Petkov
                       ` (2 more replies)
  1 sibling, 3 replies; 9+ messages in thread
From: Srivatsa S. Bhat @ 2013-03-20 10:01 UTC (permalink / raw)
  To: Chen Gong
  Cc: Dave Jones, Linux Kernel, x86, Thomas Gleixner, andi,
	Ingo Molnar, Paul E. McKenney

On 03/20/2013 08:46 AM, Chen Gong wrote:
> On Tue, Mar 19, 2013 at 06:44:08PM -0400, Dave Jones wrote:
>>
>> offlining a CPU in 3.9-rc3 gets me this trace..
>>
>> numa_remove_cpu cpu 1 node 0: mask now 0,2-3
>> smpboot: CPU 1 is now offline
>> BUG: using smp_processor_id() in preemptible [00000000] code: cpu-offline.sh/10591
>> caller is cmci_rediscover+0x6a/0xe0
>> Pid: 10591, comm: cpu-offline.sh Not tainted 3.9.0-rc3+ #2
>> Call Trace:
>>  [<ffffffff81333bbd>] debug_smp_processor_id+0xdd/0x100
>>  [<ffffffff8101edba>] cmci_rediscover+0x6a/0xe0
>>  [<ffffffff815f5b9f>] mce_cpu_callback+0x19d/0x1ae
>>  [<ffffffff8160ea66>] notifier_call_chain+0x66/0x150
>>  [<ffffffff8107ad7e>] __raw_notifier_call_chain+0xe/0x10
>>  [<ffffffff8104c2e3>] cpu_notify+0x23/0x50
>>  [<ffffffff8104c31e>] cpu_notify_nofail+0xe/0x20
>>  [<ffffffff815ef082>] _cpu_down+0x302/0x350
>>  [<ffffffff815ef106>] cpu_down+0x36/0x50
>>  [<ffffffff815f1c9d>] store_online+0x8d/0xd0
>>  [<ffffffff813edc48>] dev_attr_store+0x18/0x30
>>  [<ffffffff81226eeb>] sysfs_write_file+0xdb/0x150
>>  [<ffffffff811adfb2>] vfs_write+0xa2/0x170
>>  [<ffffffff811ae16c>] sys_write+0x4c/0xa0
>>  [<ffffffff81613019>] system_call_fastpath+0x16/0x1b
>>
> Try this patch:
> 
> diff a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
> index 402c454..692c91e 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
> @@ -311,10 +311,12 @@ void cmci_rediscover(int dying)
>                 if (cpu == dying)
>                         continue;
> 
> -               if (cpu == smp_processor_id()) {
> +               if (cpu == get_cpu()) {
> +                       put_cpu();
>                         cmci_rediscover_work_func(NULL);
>                         continue;
> -               }
> +               } else
> +                       put_cpu();
> 
>                 work_on_cpu(cpu, cmci_rediscover_work_func, NULL);
>         }
> 

That doesn't really look right to me. In fact, the function cmci_rediscover()
looks like it needs some attention. Let me quote the function here, so
that its easier to discuss what's wrong with it..


/*
 * After a CPU went down cycle through all the others and rediscover
 * Must run in process context.
 */
void cmci_rediscover(int dying)
{
        int cpu, banks;

        if (!cmci_supported(&banks))
                return;

        for_each_online_cpu(cpu) {
                if (cpu == dying)
                        continue;

                if (cpu == smp_processor_id()) {
                        cmci_rediscover_work_func(NULL);
                        continue;
                }

                work_on_cpu(cpu, cmci_rediscover_work_func, NULL);
	}
}

First of all, I think the comment that says that it must run in process
context, is stale. I think its a remnant of the code which used to do
GFP_KERNEL allocations for a temporary cpumask (looking at git logs).
The function cmci_discover() takes a spin lock with irqs disabled. So
obviously this whole thing can run in atomic context.

And cmci_rediscover() is called from CPU_POST_DEAD handler. So the CPU
which was supposed to go offline would have already gone offline and
out of the cpu_online_mask. So there is no point checking
'if (cpu == dying)' in that for-loop.

So, how about something like this:

------------------------------------------------------------>

From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Subject: [PATCH] x86/mce: Rework cmci_rediscover() to play well with CPU hotplug

Dave Jones reports that offlining a CPU leads to this trace:

numa_remove_cpu cpu 1 node 0: mask now 0,2-3
smpboot: CPU 1 is now offline
BUG: using smp_processor_id() in preemptible [00000000] code:
cpu-offline.sh/10591
caller is cmci_rediscover+0x6a/0xe0
Pid: 10591, comm: cpu-offline.sh Not tainted 3.9.0-rc3+ #2
Call Trace:
 [<ffffffff81333bbd>] debug_smp_processor_id+0xdd/0x100
 [<ffffffff8101edba>] cmci_rediscover+0x6a/0xe0
 [<ffffffff815f5b9f>] mce_cpu_callback+0x19d/0x1ae
 [<ffffffff8160ea66>] notifier_call_chain+0x66/0x150
 [<ffffffff8107ad7e>] __raw_notifier_call_chain+0xe/0x10
 [<ffffffff8104c2e3>] cpu_notify+0x23/0x50
 [<ffffffff8104c31e>] cpu_notify_nofail+0xe/0x20
 [<ffffffff815ef082>] _cpu_down+0x302/0x350
 [<ffffffff815ef106>] cpu_down+0x36/0x50
 [<ffffffff815f1c9d>] store_online+0x8d/0xd0
 [<ffffffff813edc48>] dev_attr_store+0x18/0x30
 [<ffffffff81226eeb>] sysfs_write_file+0xdb/0x150
 [<ffffffff811adfb2>] vfs_write+0xa2/0x170
 [<ffffffff811ae16c>] sys_write+0x4c/0xa0
 [<ffffffff81613019>] system_call_fastpath+0x16/0x1b


However, a look at cmci_rediscover shows that it can be simplified quite
a bit, apart from solving the above issue. It invokes functions that
take spin locks with interrupts disabled, and hence it can run in atomic
context. Also, it is run in the CPU_POST_DEAD phase, so the dying CPU
is already dead and out of the cpu_online_mask. So take these points into
account and simplify the code, and thereby also fix the above issue.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 arch/x86/include/asm/mce.h             |    4 ++--
 arch/x86/kernel/cpu/mcheck/mce.c       |    2 +-
 arch/x86/kernel/cpu/mcheck/mce_intel.c |   25 +++++--------------------
 3 files changed, 8 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index f4076af..fa5f71e 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -146,13 +146,13 @@ DECLARE_PER_CPU(struct device *, mce_device);
 void mce_intel_feature_init(struct cpuinfo_x86 *c);
 void cmci_clear(void);
 void cmci_reenable(void);
-void cmci_rediscover(int dying);
+void cmci_rediscover(void);
 void cmci_recheck(void);
 #else
 static inline void mce_intel_feature_init(struct cpuinfo_x86 *c) { }
 static inline void cmci_clear(void) {}
 static inline void cmci_reenable(void) {}
-static inline void cmci_rediscover(int dying) {}
+static inline void cmci_rediscover(void) {}
 static inline void cmci_recheck(void) {}
 #endif
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 7bc1263..9239504 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -2358,7 +2358,7 @@ mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
 
 	if (action == CPU_POST_DEAD) {
 		/* intentionally ignoring frozen here */
-		cmci_rediscover(cpu);
+		cmci_rediscover();
 	}
 
 	return NOTIFY_OK;
diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
index 402c454..ae1697c 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -285,39 +285,24 @@ void cmci_clear(void)
 	raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
 }
 
-static long cmci_rediscover_work_func(void *arg)
+static void cmci_rediscover_work_func(void *arg)
 {
 	int banks;
 
 	/* Recheck banks in case CPUs don't all have the same */
 	if (cmci_supported(&banks))
 		cmci_discover(banks);
-
-	return 0;
 }
 
-/*
- * After a CPU went down cycle through all the others and rediscover
- * Must run in process context.
- */
-void cmci_rediscover(int dying)
+/* After a CPU went down cycle through all the others and rediscover */
+void cmci_rediscover(void)
 {
-	int cpu, banks;
+	int banks;
 
 	if (!cmci_supported(&banks))
 		return;
 
-	for_each_online_cpu(cpu) {
-		if (cpu == dying)
-			continue;
-
-		if (cpu == smp_processor_id()) {
-			cmci_rediscover_work_func(NULL);
-			continue;
-		}
-
-		work_on_cpu(cpu, cmci_rediscover_work_func, NULL);
-	}
+	on_each_cpu(cmci_rediscover_work_func, NULL, 1);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] x86/mce: Rework cmci_rediscover() to play well with CPU hotplug
  2013-03-20 10:01   ` [PATCH] x86/mce: Rework cmci_rediscover() to play well with CPU hotplug Srivatsa S. Bhat
@ 2013-03-20 10:46     ` Borislav Petkov
  2013-03-20 23:28     ` Tony Luck
  2013-04-02 14:08     ` Srivatsa S. Bhat
  2 siblings, 0 replies; 9+ messages in thread
From: Borislav Petkov @ 2013-03-20 10:46 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Chen Gong, Dave Jones, Linux Kernel, x86, Thomas Gleixner, andi,
	Ingo Molnar, Paul E. McKenney, Tony Luck

+ Tony.

On Wed, Mar 20, 2013 at 03:31:29PM +0530, Srivatsa S. Bhat wrote:
> On 03/20/2013 08:46 AM, Chen Gong wrote:
> > On Tue, Mar 19, 2013 at 06:44:08PM -0400, Dave Jones wrote:
> >>
> >> offlining a CPU in 3.9-rc3 gets me this trace..
> >>
> >> numa_remove_cpu cpu 1 node 0: mask now 0,2-3
> >> smpboot: CPU 1 is now offline
> >> BUG: using smp_processor_id() in preemptible [00000000] code: cpu-offline.sh/10591
> >> caller is cmci_rediscover+0x6a/0xe0
> >> Pid: 10591, comm: cpu-offline.sh Not tainted 3.9.0-rc3+ #2
> >> Call Trace:
> >>  [<ffffffff81333bbd>] debug_smp_processor_id+0xdd/0x100
> >>  [<ffffffff8101edba>] cmci_rediscover+0x6a/0xe0
> >>  [<ffffffff815f5b9f>] mce_cpu_callback+0x19d/0x1ae
> >>  [<ffffffff8160ea66>] notifier_call_chain+0x66/0x150
> >>  [<ffffffff8107ad7e>] __raw_notifier_call_chain+0xe/0x10
> >>  [<ffffffff8104c2e3>] cpu_notify+0x23/0x50
> >>  [<ffffffff8104c31e>] cpu_notify_nofail+0xe/0x20
> >>  [<ffffffff815ef082>] _cpu_down+0x302/0x350
> >>  [<ffffffff815ef106>] cpu_down+0x36/0x50
> >>  [<ffffffff815f1c9d>] store_online+0x8d/0xd0
> >>  [<ffffffff813edc48>] dev_attr_store+0x18/0x30
> >>  [<ffffffff81226eeb>] sysfs_write_file+0xdb/0x150
> >>  [<ffffffff811adfb2>] vfs_write+0xa2/0x170
> >>  [<ffffffff811ae16c>] sys_write+0x4c/0xa0
> >>  [<ffffffff81613019>] system_call_fastpath+0x16/0x1b
> >>
> > Try this patch:
> > 
> > diff a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
> > index 402c454..692c91e 100644
> > --- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
> > +++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
> > @@ -311,10 +311,12 @@ void cmci_rediscover(int dying)
> >                 if (cpu == dying)
> >                         continue;
> > 
> > -               if (cpu == smp_processor_id()) {
> > +               if (cpu == get_cpu()) {
> > +                       put_cpu();
> >                         cmci_rediscover_work_func(NULL);
> >                         continue;
> > -               }
> > +               } else
> > +                       put_cpu();
> > 
> >                 work_on_cpu(cpu, cmci_rediscover_work_func, NULL);
> >         }
> > 
> 
> That doesn't really look right to me. In fact, the function cmci_rediscover()
> looks like it needs some attention. Let me quote the function here, so
> that its easier to discuss what's wrong with it..
> 
> 
> /*
>  * After a CPU went down cycle through all the others and rediscover
>  * Must run in process context.
>  */
> void cmci_rediscover(int dying)
> {
>         int cpu, banks;
> 
>         if (!cmci_supported(&banks))
>                 return;
> 
>         for_each_online_cpu(cpu) {
>                 if (cpu == dying)
>                         continue;
> 
>                 if (cpu == smp_processor_id()) {
>                         cmci_rediscover_work_func(NULL);
>                         continue;
>                 }
> 
>                 work_on_cpu(cpu, cmci_rediscover_work_func, NULL);
> 	}
> }
> 
> First of all, I think the comment that says that it must run in process
> context, is stale. I think its a remnant of the code which used to do
> GFP_KERNEL allocations for a temporary cpumask (looking at git logs).
> The function cmci_discover() takes a spin lock with irqs disabled. So
> obviously this whole thing can run in atomic context.
> 
> And cmci_rediscover() is called from CPU_POST_DEAD handler. So the CPU
> which was supposed to go offline would have already gone offline and
> out of the cpu_online_mask. So there is no point checking
> 'if (cpu == dying)' in that for-loop.
> 
> So, how about something like this:
> 
> ------------------------------------------------------------>
> 
> From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> Subject: [PATCH] x86/mce: Rework cmci_rediscover() to play well with CPU hotplug
> 
> Dave Jones reports that offlining a CPU leads to this trace:
> 
> numa_remove_cpu cpu 1 node 0: mask now 0,2-3
> smpboot: CPU 1 is now offline
> BUG: using smp_processor_id() in preemptible [00000000] code:
> cpu-offline.sh/10591
> caller is cmci_rediscover+0x6a/0xe0
> Pid: 10591, comm: cpu-offline.sh Not tainted 3.9.0-rc3+ #2
> Call Trace:
>  [<ffffffff81333bbd>] debug_smp_processor_id+0xdd/0x100
>  [<ffffffff8101edba>] cmci_rediscover+0x6a/0xe0
>  [<ffffffff815f5b9f>] mce_cpu_callback+0x19d/0x1ae
>  [<ffffffff8160ea66>] notifier_call_chain+0x66/0x150
>  [<ffffffff8107ad7e>] __raw_notifier_call_chain+0xe/0x10
>  [<ffffffff8104c2e3>] cpu_notify+0x23/0x50
>  [<ffffffff8104c31e>] cpu_notify_nofail+0xe/0x20
>  [<ffffffff815ef082>] _cpu_down+0x302/0x350
>  [<ffffffff815ef106>] cpu_down+0x36/0x50
>  [<ffffffff815f1c9d>] store_online+0x8d/0xd0
>  [<ffffffff813edc48>] dev_attr_store+0x18/0x30
>  [<ffffffff81226eeb>] sysfs_write_file+0xdb/0x150
>  [<ffffffff811adfb2>] vfs_write+0xa2/0x170
>  [<ffffffff811ae16c>] sys_write+0x4c/0xa0
>  [<ffffffff81613019>] system_call_fastpath+0x16/0x1b
> 
> 
> However, a look at cmci_rediscover shows that it can be simplified quite
> a bit, apart from solving the above issue. It invokes functions that
> take spin locks with interrupts disabled, and hence it can run in atomic
> context. Also, it is run in the CPU_POST_DEAD phase, so the dying CPU
> is already dead and out of the cpu_online_mask. So take these points into
> account and simplify the code, and thereby also fix the above issue.
> 
> Reported-by: Dave Jones <davej@redhat.com>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> ---
> 
>  arch/x86/include/asm/mce.h             |    4 ++--
>  arch/x86/kernel/cpu/mcheck/mce.c       |    2 +-
>  arch/x86/kernel/cpu/mcheck/mce_intel.c |   25 +++++--------------------
>  3 files changed, 8 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index f4076af..fa5f71e 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -146,13 +146,13 @@ DECLARE_PER_CPU(struct device *, mce_device);
>  void mce_intel_feature_init(struct cpuinfo_x86 *c);
>  void cmci_clear(void);
>  void cmci_reenable(void);
> -void cmci_rediscover(int dying);
> +void cmci_rediscover(void);
>  void cmci_recheck(void);
>  #else
>  static inline void mce_intel_feature_init(struct cpuinfo_x86 *c) { }
>  static inline void cmci_clear(void) {}
>  static inline void cmci_reenable(void) {}
> -static inline void cmci_rediscover(int dying) {}
> +static inline void cmci_rediscover(void) {}
>  static inline void cmci_recheck(void) {}
>  #endif
>  
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 7bc1263..9239504 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -2358,7 +2358,7 @@ mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
>  
>  	if (action == CPU_POST_DEAD) {
>  		/* intentionally ignoring frozen here */
> -		cmci_rediscover(cpu);
> +		cmci_rediscover();
>  	}
>  
>  	return NOTIFY_OK;
> diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
> index 402c454..ae1697c 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
> @@ -285,39 +285,24 @@ void cmci_clear(void)
>  	raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
>  }
>  
> -static long cmci_rediscover_work_func(void *arg)
> +static void cmci_rediscover_work_func(void *arg)
>  {
>  	int banks;
>  
>  	/* Recheck banks in case CPUs don't all have the same */
>  	if (cmci_supported(&banks))
>  		cmci_discover(banks);
> -
> -	return 0;
>  }
>  
> -/*
> - * After a CPU went down cycle through all the others and rediscover
> - * Must run in process context.
> - */
> -void cmci_rediscover(int dying)
> +/* After a CPU went down cycle through all the others and rediscover */
> +void cmci_rediscover(void)
>  {
> -	int cpu, banks;
> +	int banks;
>  
>  	if (!cmci_supported(&banks))
>  		return;
>  
> -	for_each_online_cpu(cpu) {
> -		if (cpu == dying)
> -			continue;
> -
> -		if (cpu == smp_processor_id()) {
> -			cmci_rediscover_work_func(NULL);
> -			continue;
> -		}
> -
> -		work_on_cpu(cpu, cmci_rediscover_work_func, NULL);
> -	}
> +	on_each_cpu(cmci_rediscover_work_func, NULL, 1);
>  }
>  
>  /*
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] x86/mce: Rework cmci_rediscover() to play well with CPU hotplug
  2013-03-20 10:01   ` [PATCH] x86/mce: Rework cmci_rediscover() to play well with CPU hotplug Srivatsa S. Bhat
  2013-03-20 10:46     ` Borislav Petkov
@ 2013-03-20 23:28     ` Tony Luck
  2013-04-02 14:08     ` Srivatsa S. Bhat
  2 siblings, 0 replies; 9+ messages in thread
From: Tony Luck @ 2013-03-20 23:28 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Chen Gong, Dave Jones, Linux Kernel, x86, Thomas Gleixner, andi,
	Ingo Molnar, Paul E. McKenney

Looks lots cleaner.  Applied on top of 3.9.rc3 and took it for a spin
offlining and onlining cpus at random intervals.  First time round I
saw a few splats like the one below.  But after a reboot I can no
longer reproduce.

-Tony

INFO: task devkit-power-da:19861 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
devkit-power-da D ffff8810114cd048     0 19861      1 0x00000080
 ffff88101063b828 0000000000000082 ffff8810114cca80 ffff88101063bfd8
 ffff88101063bfd8 ffff88101063bfd8 ffff88081fa18040 ffff8810114cca80
 ffff88101063b808 ffff88080c0cf800 0000000000000000 ffff88101063b8b0
Call Trace:
 [<ffffffff815dcd39>] schedule+0x29/0x70
 [<ffffffff81478865>] usb_kill_urb+0x85/0xc0
 [<ffffffff810771f0>] ? wake_up_bit+0x40/0x40
 [<ffffffff81479b08>] usb_start_wait_urb+0xd8/0x160
 [<ffffffff81479dec>] usb_control_msg+0xcc/0x110
 [<ffffffff81479ef3>] ? usb_get_status+0x43/0xc0
 [<ffffffff81479f33>] usb_get_status+0x83/0xc0
 [<ffffffff8146e707>] ? usb_set_device_state+0x127/0x160
 [<ffffffff81471fa6>] usb_port_resume+0x2c6/0x630
 [<ffffffff810135d1>] ? __switch_to+0x181/0x4a0
 [<ffffffff81485935>] generic_resume+0x15/0x30
 [<ffffffff8147c405>] usb_resume_both+0x105/0x150
 [<ffffffff8147d8aa>] usb_runtime_resume+0x1a/0x20
 [<ffffffff813b5bd1>] __rpm_callback+0x31/0x90
 [<ffffffff813b5c5f>] rpm_callback+0x2f/0x90
 [<ffffffff813b6d5c>] rpm_resume+0x40c/0x670
 [<ffffffff810771f0>] ? wake_up_bit+0x40/0x40
 [<ffffffff813b725c>] __pm_runtime_resume+0x5c/0x90
 [<ffffffff8147cee9>] usb_autoresume_device+0x29/0x60
 [<ffffffff81482b50>] usbdev_open+0x110/0x210
 [<ffffffff8116dc4c>] chrdev_open+0x9c/0x180
 [<ffffffff811677df>] do_dentry_open+0x20f/0x2c0
 [<ffffffff8116dbb0>] ? cdev_put+0x30/0x30
 [<ffffffff811678c5>] finish_open+0x35/0x50
 [<ffffffff81177e5e>] do_last+0x6de/0xde0
 [<ffffffff811749a8>] ? inode_permission+0x18/0x50
 [<ffffffff81174a58>] ? link_path_walk+0x78/0x880
 [<ffffffff81178617>] path_openat+0xb7/0x4a0
 [<ffffffff81178cc1>] do_filp_open+0x41/0xa0
 [<ffffffff81185382>] ? __alloc_fd+0x42/0x110
 [<ffffffff81168c24>] do_sys_open+0xf4/0x1e0
 [<ffffffff81014959>] ? do_notify_resume+0x59/0x80
 [<ffffffff81168d31>] sys_open+0x21/0x30
 [<ffffffff815e6159>] system_call_fastpath+0x16/0x1b

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] x86/mce: Rework cmci_rediscover() to play well with CPU hotplug
  2013-03-20 10:01   ` [PATCH] x86/mce: Rework cmci_rediscover() to play well with CPU hotplug Srivatsa S. Bhat
  2013-03-20 10:46     ` Borislav Petkov
  2013-03-20 23:28     ` Tony Luck
@ 2013-04-02 14:08     ` Srivatsa S. Bhat
  2013-04-02 14:17       ` Borislav Petkov
  2 siblings, 1 reply; 9+ messages in thread
From: Srivatsa S. Bhat @ 2013-04-02 14:08 UTC (permalink / raw)
  To: Ingo Molnar, Borislav Petkov
  Cc: Chen Gong, Dave Jones, Linux Kernel, x86, Thomas Gleixner, andi,
	Paul E. McKenney, tony.luck

On 03/20/2013 03:31 PM, Srivatsa S. Bhat wrote:
[...]
> ------------------------------------------------------------>
> 
> From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> Subject: [PATCH] x86/mce: Rework cmci_rediscover() to play well with CPU hotplug
> 
> Dave Jones reports that offlining a CPU leads to this trace:
> 
> numa_remove_cpu cpu 1 node 0: mask now 0,2-3
> smpboot: CPU 1 is now offline
> BUG: using smp_processor_id() in preemptible [00000000] code:
> cpu-offline.sh/10591
> caller is cmci_rediscover+0x6a/0xe0
> Pid: 10591, comm: cpu-offline.sh Not tainted 3.9.0-rc3+ #2
> Call Trace:
>  [<ffffffff81333bbd>] debug_smp_processor_id+0xdd/0x100
>  [<ffffffff8101edba>] cmci_rediscover+0x6a/0xe0
>  [<ffffffff815f5b9f>] mce_cpu_callback+0x19d/0x1ae
>  [<ffffffff8160ea66>] notifier_call_chain+0x66/0x150
>  [<ffffffff8107ad7e>] __raw_notifier_call_chain+0xe/0x10
>  [<ffffffff8104c2e3>] cpu_notify+0x23/0x50
>  [<ffffffff8104c31e>] cpu_notify_nofail+0xe/0x20
>  [<ffffffff815ef082>] _cpu_down+0x302/0x350
>  [<ffffffff815ef106>] cpu_down+0x36/0x50
>  [<ffffffff815f1c9d>] store_online+0x8d/0xd0
>  [<ffffffff813edc48>] dev_attr_store+0x18/0x30
>  [<ffffffff81226eeb>] sysfs_write_file+0xdb/0x150
>  [<ffffffff811adfb2>] vfs_write+0xa2/0x170
>  [<ffffffff811ae16c>] sys_write+0x4c/0xa0
>  [<ffffffff81613019>] system_call_fastpath+0x16/0x1b
> 
> 
> However, a look at cmci_rediscover shows that it can be simplified quite
> a bit, apart from solving the above issue. It invokes functions that
> take spin locks with interrupts disabled, and hence it can run in atomic
> context. Also, it is run in the CPU_POST_DEAD phase, so the dying CPU
> is already dead and out of the cpu_online_mask. So take these points into
> account and simplify the code, and thereby also fix the above issue.
> 
> Reported-by: Dave Jones <davej@redhat.com>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> ---
> 
>  arch/x86/include/asm/mce.h             |    4 ++--
>  arch/x86/kernel/cpu/mcheck/mce.c       |    2 +-
>  arch/x86/kernel/cpu/mcheck/mce_intel.c |   25 +++++--------------------
>  3 files changed, 8 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index f4076af..fa5f71e 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -146,13 +146,13 @@ DECLARE_PER_CPU(struct device *, mce_device);
>  void mce_intel_feature_init(struct cpuinfo_x86 *c);
>  void cmci_clear(void);
>  void cmci_reenable(void);
> -void cmci_rediscover(int dying);
> +void cmci_rediscover(void);
>  void cmci_recheck(void);
>  #else
>  static inline void mce_intel_feature_init(struct cpuinfo_x86 *c) { }
>  static inline void cmci_clear(void) {}
>  static inline void cmci_reenable(void) {}
> -static inline void cmci_rediscover(int dying) {}
> +static inline void cmci_rediscover(void) {}
>  static inline void cmci_recheck(void) {}
>  #endif
>  
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 7bc1263..9239504 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -2358,7 +2358,7 @@ mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
>  
>  	if (action == CPU_POST_DEAD) {
>  		/* intentionally ignoring frozen here */
> -		cmci_rediscover(cpu);
> +		cmci_rediscover();
>  	}
>  
>  	return NOTIFY_OK;
> diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
> index 402c454..ae1697c 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
> @@ -285,39 +285,24 @@ void cmci_clear(void)
>  	raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
>  }
>  
> -static long cmci_rediscover_work_func(void *arg)
> +static void cmci_rediscover_work_func(void *arg)
>  {
>  	int banks;
>  
>  	/* Recheck banks in case CPUs don't all have the same */
>  	if (cmci_supported(&banks))
>  		cmci_discover(banks);
> -
> -	return 0;
>  }
>  
> -/*
> - * After a CPU went down cycle through all the others and rediscover
> - * Must run in process context.
> - */
> -void cmci_rediscover(int dying)
> +/* After a CPU went down cycle through all the others and rediscover */
> +void cmci_rediscover(void)
>  {
> -	int cpu, banks;
> +	int banks;
>  
>  	if (!cmci_supported(&banks))
>  		return;
>  
> -	for_each_online_cpu(cpu) {
> -		if (cpu == dying)
> -			continue;
> -
> -		if (cpu == smp_processor_id()) {
> -			cmci_rediscover_work_func(NULL);
> -			continue;
> -		}
> -
> -		work_on_cpu(cpu, cmci_rediscover_work_func, NULL);
> -	}
> +	on_each_cpu(cmci_rediscover_work_func, NULL, 1);
>  }
>  
>  /*
> 

Hi Boris,

Tony mentioned that this patch worked fine for him. So could you kindly
pick up this patch?

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] x86/mce: Rework cmci_rediscover() to play well with CPU hotplug
  2013-04-02 14:08     ` Srivatsa S. Bhat
@ 2013-04-02 14:17       ` Borislav Petkov
  2013-04-02 20:58         ` Luck, Tony
  0 siblings, 1 reply; 9+ messages in thread
From: Borislav Petkov @ 2013-04-02 14:17 UTC (permalink / raw)
  To: Srivatsa S. Bhat, Tony Luck
  Cc: Ingo Molnar, Chen Gong, Dave Jones, Linux Kernel, x86,
	Thomas Gleixner, andi, Paul E. McKenney, tony.luck

On Tue, Apr 02, 2013 at 07:38:11PM +0530, Srivatsa S. Bhat wrote:
> Tony mentioned that this patch worked fine for him. So could you
> kindly pick up this patch?

Normally, Tony picks up the Intel side of MCE. Tony, want me to do it?

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] x86/mce: Rework cmci_rediscover() to play well with CPU hotplug
  2013-04-02 14:17       ` Borislav Petkov
@ 2013-04-02 20:58         ` Luck, Tony
  0 siblings, 0 replies; 9+ messages in thread
From: Luck, Tony @ 2013-04-02 20:58 UTC (permalink / raw)
  To: Borislav Petkov, Srivatsa S. Bhat
  Cc: Ingo Molnar, Chen Gong, Dave Jones, Linux Kernel, x86,
	Thomas Gleixner, andi, Paul E. McKenney, tony.luck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 366 bytes --]

>> Tony mentioned that this patch worked fine for him. So could you
>> kindly pick up this patch?
>
> Normally, Tony picks up the Intel side of MCE. Tony, want me to do it?

I'll pick it up. Thanks.

-Tony
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-04-02 20:58 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-19 22:44 cpu offline causes backtrace from cmci_rediscover Dave Jones
2013-03-20  3:16 ` Chen Gong
2013-03-20  5:20   ` Mike Galbraith
2013-03-20 10:01   ` [PATCH] x86/mce: Rework cmci_rediscover() to play well with CPU hotplug Srivatsa S. Bhat
2013-03-20 10:46     ` Borislav Petkov
2013-03-20 23:28     ` Tony Luck
2013-04-02 14:08     ` Srivatsa S. Bhat
2013-04-02 14:17       ` Borislav Petkov
2013-04-02 20:58         ` Luck, Tony

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.