linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* x86/mce: machine check warning during poweroff
@ 2012-01-11  0:00 Djalal Harouni
  2012-01-12 14:22 ` Ming Lei
  0 siblings, 1 reply; 49+ messages in thread
From: Djalal Harouni @ 2012-01-11  0:00 UTC (permalink / raw)
  To: Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen
  Cc: linux-kernel

Today's pull from Linus' tree shows a warning during poweroff, the
message is related to the machinecheck.
The drivers/base/core.c:device_release() did not find the registred
release() function.

This kernel is used for development and it's running under KVM/Qemu, so
if you need further information or tests let me know.

Qemu is simulating 2 CPUs.

Thanks.


[ 1879.944193] ------------[ cut here ]------------
[ 1879.950488] WARNING: at drivers/base/core.c:194 device_release+0x82/0x90()
[ 1879.959424] Hardware name: Bochs
[ 1879.964714] Device 'machinecheck1' does not have a release() function, it is broken and must be fixed.
[ 1879.977354] Modules linked in:
[ 1879.979704] Pid: 1738, comm: halt Not tainted 3.2.0-minimal-kvm-05692-g1c81065-dirty #41
[ 1879.989093] Call Trace:
[ 1879.992729]  [<ffffffff8103952a>] warn_slowpath_common+0x7a/0xb0
[ 1879.999308]  [<ffffffff81039601>] warn_slowpath_fmt+0x41/0x50
[ 1880.005463]  [<ffffffff8172b022>] device_release+0x82/0x90
[ 1880.012915]  [<ffffffff81601667>] kobject_release+0x47/0x90
[ 1880.019107]  [<ffffffff8160152c>] kobject_put+0x2c/0x60
[ 1880.024269]  [<ffffffff8172acc2>] put_device+0x12/0x20
[ 1880.031254]  [<ffffffff8172ba19>] device_unregister+0x19/0x20
[ 1880.038594]  [<ffffffff81afb49d>] mce_cpu_callback+0xea/0x18b
[ 1880.043389]  [<ffffffff81b08924>] notifier_call_chain+0x64/0xf0
[ 1880.051928]  [<ffffffff81066c89>] __raw_notifier_call_chain+0x9/0x10
[ 1880.059077]  [<ffffffff8103b50b>] __cpu_notify+0x1b/0x30
[ 1880.063894]  [<ffffffff8103b530>] cpu_notify_nofail+0x10/0x20
[ 1880.071952]  [<ffffffff81ae27dd>] _cpu_down+0x11d/0x2c0
[ 1880.078534]  [<ffffffff81b01235>] ? printk+0x3c/0x3e
[ 1880.082662]  [<ffffffff8103b7cb>] disable_nonboot_cpus+0x8b/0x110
[ 1880.091129]  [<ffffffff81053f21>] kernel_power_off+0x21/0x50
[ 1880.098420]  [<ffffffff81054220>] sys_reboot+0x110/0x220
[ 1880.104098]  [<ffffffff8108efdd>] ? trace_hardirqs_on+0xd/0x10
[ 1880.112006]  [<ffffffff81b04deb>] ? _raw_spin_unlock_irq+0x2b/0x50
[ 1880.119181]  [<ffffffff8106dc0d>] ? finish_task_switch+0x8d/0x1a0
[ 1880.126741]  [<ffffffff8106dbce>] ? finish_task_switch+0x4e/0x1a0
[ 1880.134793]  [<ffffffff81b02f0b>] ? __schedule+0x3db/0x890
[ 1880.140510]  [<ffffffff81b0cfc7>] ? sysret_check+0x1b/0x56
[ 1880.148101]  [<ffffffff8160d33e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1880.156706]  [<ffffffff81b0cfa2>] system_call_fastpath+0x16/0x1b
[ 1880.162885] ---[ end trace d8faf9d3af9f23e8 ]---
[ 1880.171148] Power down.


$ cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 2
model name      : QEMU Virtual CPU version 0.12.5
stepping        : 3
microcode       : 0x1
cpu MHz         : 2800.047
cache size      : 4096 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 4
wp              : yes
flags           : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm rep_good nopl pni cx16
tsc_deadlin _timer hypervisor lahf_lm
bogomips        : 5600.09
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:


-- 
tixxdz
http://opendz.org

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-11  0:00 x86/mce: machine check warning during poweroff Djalal Harouni
@ 2012-01-12 14:22 ` Ming Lei
  2012-01-13 20:22   ` Srivatsa S. Bhat
  0 siblings, 1 reply; 49+ messages in thread
From: Ming Lei @ 2012-01-12 14:22 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar,
	Andi Kleen, linux-kernel

Hi,

I saw the warning too during S2R.

thanks,
--
Ming Lei


On Wed, Jan 11, 2012 at 8:00 AM, Djalal Harouni <tixxdz@opendz.org> wrote:
> Today's pull from Linus' tree shows a warning during poweroff, the
> message is related to the machinecheck.
> The drivers/base/core.c:device_release() did not find the registred
> release() function.
>
> This kernel is used for development and it's running under KVM/Qemu, so
> if you need further information or tests let me know.
>
> Qemu is simulating 2 CPUs.
>
> Thanks.
>
>
> [ 1879.944193] ------------[ cut here ]------------
> [ 1879.950488] WARNING: at drivers/base/core.c:194 device_release+0x82/0x90()
> [ 1879.959424] Hardware name: Bochs
> [ 1879.964714] Device 'machinecheck1' does not have a release() function, it is broken and must be fixed.
> [ 1879.977354] Modules linked in:
> [ 1879.979704] Pid: 1738, comm: halt Not tainted 3.2.0-minimal-kvm-05692-g1c81065-dirty #41
> [ 1879.989093] Call Trace:
> [ 1879.992729]  [<ffffffff8103952a>] warn_slowpath_common+0x7a/0xb0
> [ 1879.999308]  [<ffffffff81039601>] warn_slowpath_fmt+0x41/0x50
> [ 1880.005463]  [<ffffffff8172b022>] device_release+0x82/0x90
> [ 1880.012915]  [<ffffffff81601667>] kobject_release+0x47/0x90
> [ 1880.019107]  [<ffffffff8160152c>] kobject_put+0x2c/0x60
> [ 1880.024269]  [<ffffffff8172acc2>] put_device+0x12/0x20
> [ 1880.031254]  [<ffffffff8172ba19>] device_unregister+0x19/0x20
> [ 1880.038594]  [<ffffffff81afb49d>] mce_cpu_callback+0xea/0x18b
> [ 1880.043389]  [<ffffffff81b08924>] notifier_call_chain+0x64/0xf0
> [ 1880.051928]  [<ffffffff81066c89>] __raw_notifier_call_chain+0x9/0x10
> [ 1880.059077]  [<ffffffff8103b50b>] __cpu_notify+0x1b/0x30
> [ 1880.063894]  [<ffffffff8103b530>] cpu_notify_nofail+0x10/0x20
> [ 1880.071952]  [<ffffffff81ae27dd>] _cpu_down+0x11d/0x2c0
> [ 1880.078534]  [<ffffffff81b01235>] ? printk+0x3c/0x3e
> [ 1880.082662]  [<ffffffff8103b7cb>] disable_nonboot_cpus+0x8b/0x110
> [ 1880.091129]  [<ffffffff81053f21>] kernel_power_off+0x21/0x50
> [ 1880.098420]  [<ffffffff81054220>] sys_reboot+0x110/0x220
> [ 1880.104098]  [<ffffffff8108efdd>] ? trace_hardirqs_on+0xd/0x10
> [ 1880.112006]  [<ffffffff81b04deb>] ? _raw_spin_unlock_irq+0x2b/0x50
> [ 1880.119181]  [<ffffffff8106dc0d>] ? finish_task_switch+0x8d/0x1a0
> [ 1880.126741]  [<ffffffff8106dbce>] ? finish_task_switch+0x4e/0x1a0
> [ 1880.134793]  [<ffffffff81b02f0b>] ? __schedule+0x3db/0x890
> [ 1880.140510]  [<ffffffff81b0cfc7>] ? sysret_check+0x1b/0x56
> [ 1880.148101]  [<ffffffff8160d33e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [ 1880.156706]  [<ffffffff81b0cfa2>] system_call_fastpath+0x16/0x1b
> [ 1880.162885] ---[ end trace d8faf9d3af9f23e8 ]---
> [ 1880.171148] Power down.
>
>
> $ cat /proc/cpuinfo
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 2
> model name      : QEMU Virtual CPU version 0.12.5
> stepping        : 3
> microcode       : 0x1
> cpu MHz         : 2800.047
> cache size      : 4096 KB
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 4
> wp              : yes
> flags           : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm rep_good nopl pni cx16
> tsc_deadlin _timer hypervisor lahf_lm
> bogomips        : 5600.09
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 40 bits physical, 48 bits virtual
> power management:
>
>
> --
> tixxdz
> http://opendz.org
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Ming Lei

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-12 14:22 ` Ming Lei
@ 2012-01-13 20:22   ` Srivatsa S. Bhat
  2012-01-13 20:34     ` Justin P. Mattock
                       ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Srivatsa S. Bhat @ 2012-01-13 20:22 UTC (permalink / raw)
  To: Ming Lei
  Cc: Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto,
	Ingo Molnar, Andi Kleen, linux-kernel, Greg Kroah-Hartman,
	Linus Torvalds, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua

On 01/12/2012 07:52 PM, Ming Lei wrote:

> Hi,
> 
> I saw the warning too during S2R.
> 

> 

> 
> On Wed, Jan 11, 2012 at 8:00 AM, Djalal Harouni <tixxdz@opendz.org> wrote:
>> Today's pull from Linus' tree shows a warning during poweroff, the
>> message is related to the machinecheck.
>> The drivers/base/core.c:device_release() did not find the registred
>> release() function.
>>
>> This kernel is used for development and it's running under KVM/Qemu, so
>> if you need further information or tests let me know.
>>
>> Qemu is simulating 2 CPUs.
>>
>> Thanks.
>>
>>
>> [ 1879.944193] ------------[ cut here ]------------
>> [ 1879.950488] WARNING: at drivers/base/core.c:194 device_release+0x82/0x90()
>> [ 1879.959424] Hardware name: Bochs
>> [ 1879.964714] Device 'machinecheck1' does not have a release() function, it is broken and must be fixed.
>> [ 1879.977354] Modules linked in:
>> [ 1879.979704] Pid: 1738, comm: halt Not tainted 3.2.0-minimal-kvm-05692-g1c81065-dirty #41
>> [ 1879.989093] Call Trace:
>> [ 1879.992729]  [<ffffffff8103952a>] warn_slowpath_common+0x7a/0xb0
>> [ 1879.999308]  [<ffffffff81039601>] warn_slowpath_fmt+0x41/0x50
>> [ 1880.005463]  [<ffffffff8172b022>] device_release+0x82/0x90
>> [ 1880.012915]  [<ffffffff81601667>] kobject_release+0x47/0x90
>> [ 1880.019107]  [<ffffffff8160152c>] kobject_put+0x2c/0x60
>> [ 1880.024269]  [<ffffffff8172acc2>] put_device+0x12/0x20
>> [ 1880.031254]  [<ffffffff8172ba19>] device_unregister+0x19/0x20
>> [ 1880.038594]  [<ffffffff81afb49d>] mce_cpu_callback+0xea/0x18b
>> [ 1880.043389]  [<ffffffff81b08924>] notifier_call_chain+0x64/0xf0
>> [ 1880.051928]  [<ffffffff81066c89>] __raw_notifier_call_chain+0x9/0x10
>> [ 1880.059077]  [<ffffffff8103b50b>] __cpu_notify+0x1b/0x30
>> [ 1880.063894]  [<ffffffff8103b530>] cpu_notify_nofail+0x10/0x20
>> [ 1880.071952]  [<ffffffff81ae27dd>] _cpu_down+0x11d/0x2c0
>> [ 1880.078534]  [<ffffffff81b01235>] ? printk+0x3c/0x3e

>> [ 1880.082662]  [<ffffffff8103b7cb>] disable_nonboot_cpus+0x8b/0x110
>> [ 1880.091129]  [<ffffffff81053f21>] kernel_power_off+0x21/0x50
>> [ 1880.098420]  [<ffffffff81054220>] sys_reboot+0x110/0x220
>> [ 1880.104098]  [<ffffffff8108efdd>] ? trace_hardirqs_on+0xd/0x10
>> [ 1880.112006]  [<ffffffff81b04deb>] ? _raw_spin_unlock_irq+0x2b/0x50
>> [ 1880.119181]  [<ffffffff8106dc0d>] ? finish_task_switch+0x8d/0x1a0
>> [ 1880.126741]  [<ffffffff8106dbce>] ? finish_task_switch+0x4e/0x1a0
>> [ 1880.134793]  [<ffffffff81b02f0b>] ? __schedule+0x3db/0x890
>> [ 1880.140510]  [<ffffffff81b0cfc7>] ? sysret_check+0x1b/0x56
>> [ 1880.148101]  [<ffffffff8160d33e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>> [ 1880.156706]  [<ffffffff81b0cfa2>] system_call_fastpath+0x16/0x1b
>> [ 1880.162885] ---[ end trace d8faf9d3af9f23e8 ]---
>> [ 1880.171148] Power down.
>>


Fundamentally, this warning is triggered during CPU Offline, which is done
during poweroff, suspend, hibernate etc. IOW, even a simple
# echo 0 > /sys/devices/system/cpu/cpuX/online will trigger it.

Some discussion about this warning and a probable fix is going on in this
thread: https://lkml.org/lkml/2012/1/13/278

[And there have been reports of Suspend/Hibernate not working in recent
kernels (3.3 merge window)]

However, it is to be noted that, technically this warning (machinecheck1
not having a release() function) is not all that new. Just that people
didn't probably notice it earlier (reason explained below).

Prior to the 3.3 merge window (when everything was fine, particularly
suspend/resume), upon a CPU offline, we used to get the following message:

Broke affinity for irq 49
Broke affinity for irq 87
CPU 1 is now offline
kobject:kobject: 'index0' (ffff8802764e5c00): does not have a release() function, it is broken and must be fixed.
kobject:kobject: 'index1' (ffff8802764e5c48): does not have a release() function, it is broken and must be fixed.
kobject:kobject: 'index2' (ffff8802764e5c90): does not have a release() function, it is broken and must be fixed.
kobject:kobject: 'index3' (ffff8802764e5cd8): does not have a release() function, it is broken and must be fixed.
kobject:kobject: 'cache' (ffff88027926c480): does not have a release() function, it is broken and must be fixed.
kobject:kobject: 'machinecheck1' (ffff88002822d8f0): does not have a release() function, it is broken and must be fixed.
                    ^^^^^^^^^
This is from the kobject_cleanup() function defined in lib/kobject.c. Since
pr_debug() was used for printing, it made this kind of obscure.

After commit 8a25a2fd (cpu: convert 'cpu' and 'machinecheck' sysdev_class to
a regular subsystem), the callpaths changed and we now hit the rather strong
looking WARN() in drivers/base/core.c:device_release(), which is why it is
getting everyone's attention now.

So, in the recent kernels (3.3 merge window), we get:

(Note the difference in the kobject line about machinecheck)

[46407.738415] kobject: 'cpufreq' (ffff88026f794098): calling ktype release
[46407.752649] CPU 1 is now offline
[46407.757002] kobject: 'index0' (ffff88026f0cac00): does not have a release() function, it is broken and must be fixed.
[46407.769302] kobject: 'index1' (ffff88026f0cac48): does not have a release() function, it is broken and must be fixed.
[46407.781412] kobject: 'index2' (ffff88026f0cac90): does not have a release() function, it is broken and must be fixed.
[46407.793480] kobject: 'index3' (ffff88026f0cacd8): does not have a release() function, it is broken and must be fixed.
[46407.805547] kobject: 'cache' (ffff880272e0d3c0): does not have a release() function, it is broken and must be fixed.
[46407.817906] kobject: 'machinecheck1' (ffff88027fc2cb70): calling ktype release
[46407.826182] ------------[ cut here ]------------
[46407.831514] WARNING: at drivers/base/core.c:194 device_release+0x82/0x90()
[46407.831515] Hardware name: IBM System X iDataPlex dx360 M4 Server -[7912AC1]-
[46407.831517] Device 'machinecheck1' does not have a release() function, it is broken and must be fixed.

IOW, the warning about machinecheck has just been moved from one place to
another.

My only point here is that we have essentially seen this warning before
when suspend/resume was working fine. And it has been reported that
suspend/resume works fine if CONFIG_X86_MCE is not set. So I guess something
else is wrong somewhere.. IOW, I feel whether or not machinecheck has a
release function doesn't really matter that much for suspend/resume to get
any better.

Regards,
Srivatsa S. Bhat
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-13 20:22   ` Srivatsa S. Bhat
@ 2012-01-13 20:34     ` Justin P. Mattock
  2012-01-13 20:37     ` Linus Torvalds
  2012-01-13 23:02     ` Linus Torvalds
  2 siblings, 0 replies; 49+ messages in thread
From: Justin P. Mattock @ 2012-01-13 20:34 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck,
	Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-kernel,
	Greg Kroah-Hartman, Linus Torvalds, Kay Sievers, gouders,
	Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tglx,
	prasad, Jeff Chua

On 01/13/2012 12:22 PM, Srivatsa S. Bhat wrote:
> On 01/12/2012 07:52 PM, Ming Lei wrote:
>
>> Hi,
>>
>> I saw the warning too during S2R.
>>
>
>>
>
>>
>> On Wed, Jan 11, 2012 at 8:00 AM, Djalal Harouni<tixxdz@opendz.org>  wrote:
>>> Today's pull from Linus' tree shows a warning during poweroff, the
>>> message is related to the machinecheck.
>>> The drivers/base/core.c:device_release() did not find the registred
>>> release() function.
>>>
>>> This kernel is used for development and it's running under KVM/Qemu, so
>>> if you need further information or tests let me know.
>>>
>>> Qemu is simulating 2 CPUs.
>>>
>>> Thanks.
>>>
>>>
>>> [ 1879.944193] ------------[ cut here ]------------
>>> [ 1879.950488] WARNING: at drivers/base/core.c:194 device_release+0x82/0x90()
>>> [ 1879.959424] Hardware name: Bochs
>>> [ 1879.964714] Device 'machinecheck1' does not have a release() function, it is broken and must be fixed.
>>> [ 1879.977354] Modules linked in:
>>> [ 1879.979704] Pid: 1738, comm: halt Not tainted 3.2.0-minimal-kvm-05692-g1c81065-dirty #41
>>> [ 1879.989093] Call Trace:
>>> [ 1879.992729]  [<ffffffff8103952a>] warn_slowpath_common+0x7a/0xb0
>>> [ 1879.999308]  [<ffffffff81039601>] warn_slowpath_fmt+0x41/0x50
>>> [ 1880.005463]  [<ffffffff8172b022>] device_release+0x82/0x90
>>> [ 1880.012915]  [<ffffffff81601667>] kobject_release+0x47/0x90
>>> [ 1880.019107]  [<ffffffff8160152c>] kobject_put+0x2c/0x60
>>> [ 1880.024269]  [<ffffffff8172acc2>] put_device+0x12/0x20
>>> [ 1880.031254]  [<ffffffff8172ba19>] device_unregister+0x19/0x20
>>> [ 1880.038594]  [<ffffffff81afb49d>] mce_cpu_callback+0xea/0x18b
>>> [ 1880.043389]  [<ffffffff81b08924>] notifier_call_chain+0x64/0xf0
>>> [ 1880.051928]  [<ffffffff81066c89>] __raw_notifier_call_chain+0x9/0x10
>>> [ 1880.059077]  [<ffffffff8103b50b>] __cpu_notify+0x1b/0x30
>>> [ 1880.063894]  [<ffffffff8103b530>] cpu_notify_nofail+0x10/0x20
>>> [ 1880.071952]  [<ffffffff81ae27dd>] _cpu_down+0x11d/0x2c0
>>> [ 1880.078534]  [<ffffffff81b01235>] ? printk+0x3c/0x3e
>
>>> [ 1880.082662]  [<ffffffff8103b7cb>] disable_nonboot_cpus+0x8b/0x110
>>> [ 1880.091129]  [<ffffffff81053f21>] kernel_power_off+0x21/0x50
>>> [ 1880.098420]  [<ffffffff81054220>] sys_reboot+0x110/0x220
>>> [ 1880.104098]  [<ffffffff8108efdd>] ? trace_hardirqs_on+0xd/0x10
>>> [ 1880.112006]  [<ffffffff81b04deb>] ? _raw_spin_unlock_irq+0x2b/0x50
>>> [ 1880.119181]  [<ffffffff8106dc0d>] ? finish_task_switch+0x8d/0x1a0
>>> [ 1880.126741]  [<ffffffff8106dbce>] ? finish_task_switch+0x4e/0x1a0
>>> [ 1880.134793]  [<ffffffff81b02f0b>] ? __schedule+0x3db/0x890
>>> [ 1880.140510]  [<ffffffff81b0cfc7>] ? sysret_check+0x1b/0x56
>>> [ 1880.148101]  [<ffffffff8160d33e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>>> [ 1880.156706]  [<ffffffff81b0cfa2>] system_call_fastpath+0x16/0x1b
>>> [ 1880.162885] ---[ end trace d8faf9d3af9f23e8 ]---
>>> [ 1880.171148] Power down.
>>>
>
>
> Fundamentally, this warning is triggered during CPU Offline, which is done
> during poweroff, suspend, hibernate etc. IOW, even a simple
> # echo 0>  /sys/devices/system/cpu/cpuX/online will trigger it.
>
> Some discussion about this warning and a probable fix is going on in this
> thread: https://lkml.org/lkml/2012/1/13/278
>
> [And there have been reports of Suspend/Hibernate not working in recent
> kernels (3.3 merge window)]
>
> However, it is to be noted that, technically this warning (machinecheck1
> not having a release() function) is not all that new. Just that people
> didn't probably notice it earlier (reason explained below).
>
> Prior to the 3.3 merge window (when everything was fine, particularly
> suspend/resume), upon a CPU offline, we used to get the following message:
>
> Broke affinity for irq 49
> Broke affinity for irq 87
> CPU 1 is now offline
> kobject:kobject: 'index0' (ffff8802764e5c00): does not have a release() function, it is broken and must be fixed.
> kobject:kobject: 'index1' (ffff8802764e5c48): does not have a release() function, it is broken and must be fixed.
> kobject:kobject: 'index2' (ffff8802764e5c90): does not have a release() function, it is broken and must be fixed.
> kobject:kobject: 'index3' (ffff8802764e5cd8): does not have a release() function, it is broken and must be fixed.
> kobject:kobject: 'cache' (ffff88027926c480): does not have a release() function, it is broken and must be fixed.
> kobject:kobject: 'machinecheck1' (ffff88002822d8f0): does not have a release() function, it is broken and must be fixed.
>                      ^^^^^^^^^
> This is from the kobject_cleanup() function defined in lib/kobject.c. Since
> pr_debug() was used for printing, it made this kind of obscure.
>
> After commit 8a25a2fd (cpu: convert 'cpu' and 'machinecheck' sysdev_class to
> a regular subsystem), the callpaths changed and we now hit the rather strong
> looking WARN() in drivers/base/core.c:device_release(), which is why it is
> getting everyone's attention now.
>
> So, in the recent kernels (3.3 merge window), we get:
>
> (Note the difference in the kobject line about machinecheck)
>
> [46407.738415] kobject: 'cpufreq' (ffff88026f794098): calling ktype release
> [46407.752649] CPU 1 is now offline
> [46407.757002] kobject: 'index0' (ffff88026f0cac00): does not have a release() function, it is broken and must be fixed.
> [46407.769302] kobject: 'index1' (ffff88026f0cac48): does not have a release() function, it is broken and must be fixed.
> [46407.781412] kobject: 'index2' (ffff88026f0cac90): does not have a release() function, it is broken and must be fixed.
> [46407.793480] kobject: 'index3' (ffff88026f0cacd8): does not have a release() function, it is broken and must be fixed.
> [46407.805547] kobject: 'cache' (ffff880272e0d3c0): does not have a release() function, it is broken and must be fixed.
> [46407.817906] kobject: 'machinecheck1' (ffff88027fc2cb70): calling ktype release
> [46407.826182] ------------[ cut here ]------------
> [46407.831514] WARNING: at drivers/base/core.c:194 device_release+0x82/0x90()
> [46407.831515] Hardware name: IBM System X iDataPlex dx360 M4 Server -[7912AC1]-
> [46407.831517] Device 'machinecheck1' does not have a release() function, it is broken and must be fixed.
>
> IOW, the warning about machinecheck has just been moved from one place to
> another.
>
> My only point here is that we have essentially seen this warning before
> when suspend/resume was working fine. And it has been reported that
> suspend/resume works fine if CONFIG_X86_MCE is not set. So I guess something
> else is wrong somewhere.. IOW, I feel whether or not machinecheck has a
> release function doesn't really matter that much for suspend/resume to get
> any better.
>
> Regards,
> Srivatsa S. Bhat
> IBM Linux Technology Center
>
>

well I dont care much for the message since its a warning message(should 
be fixed though), its when the machine froze. maybe I hit something else 
other than this warning. I can try doing some more suspending to see if 
this freeze shows up and try to capture syslog or image then post it.

Justin P. Mattock

Justin P. Mattock

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-13 20:22   ` Srivatsa S. Bhat
  2012-01-13 20:34     ` Justin P. Mattock
@ 2012-01-13 20:37     ` Linus Torvalds
  2012-01-13 20:53       ` Srivatsa S. Bhat
  2012-01-14  0:44       ` Dirk Gouders
  2012-01-13 23:02     ` Linus Torvalds
  2 siblings, 2 replies; 49+ messages in thread
From: Linus Torvalds @ 2012-01-13 20:37 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck,
	Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-kernel,
	Greg Kroah-Hartman, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua

On Fri, Jan 13, 2012 at 12:22 PM, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
> However, it is to be noted that, technically this warning (machinecheck1
> not having a release() function) is not all that new. Just that people
> didn't probably notice it earlier (reason explained below).

.. but there really is some semantic issue going on here. I'm not the
only one for whom suspend/resume simply does not work with MCE
enabled, and that "not work" coincides with the switch-over from
sysdev to struct device.

I suspect it is some subtle difference in sysdev_unregister() vs
device_unregister(). The message change is just a symptom of the fact
that we go through different paths.

For me MCE hangs only on the *second* suspend, but it sounds like
others see it on the first one.

                        Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-13 20:37     ` Linus Torvalds
@ 2012-01-13 20:53       ` Srivatsa S. Bhat
  2012-01-13 21:08         ` Linus Torvalds
  2012-01-14  0:44       ` Dirk Gouders
  1 sibling, 1 reply; 49+ messages in thread
From: Srivatsa S. Bhat @ 2012-01-13 20:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck,
	Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-kernel,
	Greg Kroah-Hartman, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua

On 01/14/2012 02:07 AM, Linus Torvalds wrote:

> On Fri, Jan 13, 2012 at 12:22 PM, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>> However, it is to be noted that, technically this warning (machinecheck1
>> not having a release() function) is not all that new. Just that people
>> didn't probably notice it earlier (reason explained below).
> 
> .. but there really is some semantic issue going on here. I'm not the
> only one for whom suspend/resume simply does not work with MCE
> enabled, and that "not work" coincides with the switch-over from
> sysdev to struct device.
> 


True. Even Marcos Souza reported that suspend/resume doesn't work with
MCE enabled.

> I suspect it is some subtle difference in sysdev_unregister() vs
> device_unregister(). The message change is just a symptom of the fact
> that we go through different paths.
> 
> For me MCE hangs only on the *second* suspend, but it sounds like
> others see it on the first one.
> 


Wait a minute, did you mention "second attempt"? I think I have something
interesting..

I was trying out the pm_test framework (Documentation/power/basic-pm-
debugging.txt) at the processors level (in which CPUs are offlined).
First attempt succeeded. Second attempt was reproducibly problematic.
And this was with MCE enabled.

Here is the log (covers both first and second attempts at processors level):

[   71.427970] PM: Hibernation mode set to 'reboot'

[  109.588229] PM: Marking nosave pages: 000000000009c000 - 0000000000100000

[  109.595195] PM: Marking nosave pages: 000000007ce5e000 - 000000007cf22000

[  109.602154] PM: Marking nosave pages: 000000007d6ae000 - 000000007d75e000

[  109.609111] PM: Marking nosave pages: 000000007f68f000 - 000000007f7ff000

[  109.616075] PM: Marking nosave pages: 000000007f800000 - 0000000100000000

[  109.626743] PM: Basic memory bitmaps created

[  109.631179] PM: Syncing filesystems ... done.

[  109.662279] Freezing user space processes ... (elapsed 0.01 seconds) done.

[  109.687212] PM: Preallocating image memory... done (allocated 506656 pages)

[  112.056392] PM: Allocated 2026624 kbytes in 2.36 seconds (858.73 MB/s)

[  112.063046] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.

8.796863] ioatdma 0000:00:16.6: setting latency timer to 64

[  118.796866] uhci_hcd 0000:00:1a.0: setting latency timer to 64

[  118.796870] ioatdma 0000:00:16.7: setting latency timer to 64

[  118.796891] ehci_hcd 0000:00:1a.7: setting latency timer to 64

[  118.796916] uhci_hcd 0000:00:1d.0: setting latency timer to 64

[  118.796932] usb usb3: root hub lost power or was reset

[  118.796952] usb usb4: root hub lost power or was reset

[  118.796954] uhci_hcd 0000:00:1d.1: setting latency timer to 64

[  118.796975] uhci_hcd 0000:00:1d.2: setting latency timer to 64

[  118.796988] usb usb5: root hub lost power or was reset

[  118.797011] ehci_hcd 0000:00:1d.7: setting latency timer to 64

[  118.797014] usb usb6: root hub lost power or was reset

[  118.797035] pci 0000:00:1e.0: setting latency timer to 64

[  118.797055] mptbase: ioc0: pci-resume: pdev=0xffff8808ca39b000, slot=0000:0b:00.0, Previous operating state [D0]

[  118.797069] mptsas 0000:0b:00.0: PCI INT A -> GSI 28 (level, low) -> IRQ 28

[  118.797086] usb usb1: root hub lost power or was reset

[  118.797132] usb usb2: root hub lost power or was reset

[  118.797155] mptbase: ioc0: pci-resume: ioc-state=0x1,doorbell=0x10000000

[  118.797487] pci 0000:06:00.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20

[  118.797498] pci 0000:06:00.0: setting latency timer to 

	SOL session is temporarily not available.

	Attempting to re-establish your session ...



	SOL connection is re-established. SOL session is now active.

[  135.087886] mptbase: ioc0: pci-resume: success

[  135.093612] Extended CMOS year: 2000

[  135.097667] PM: restore of devices complete after 16301.354 msecs

[  135.105915] PM: Image restored successfully.

[  135.110467] Restarting tasks ... done.

[  135.116548] PM: Basic memory bitmaps freed

[  169.935234] PM: Marking nosave pages: 000000000009c000 - 0000000000100000

[  169.942263] PM: Marking nosave pages: 000000007ce5e000 - 000000007cf22000

[  169.949252] PM: Marking nosave pages: 000000007d6ae000 - 000000007d75e000

[  169.956228] PM: Marking nosave pages: 000000007f68f000 - 000000007f7ff000

[  169.963215] PM: Marking nosave pages: 000000007f800000 - 0000000100000000

[  169.973855] PM: Basic memory bitmaps created

[  169.978314] PM: Syncing filesystems ... done.

[  170.045487] Freezing user space processes ... (elapsed 0.01 seconds) done.

[  170.067268] PM: Preallocating image memory... done (allocated 505915 pages)

[  172.421628] PM: Allocated 2023660 kbytes in 2.34 seconds (864.81 MB/s)

[  172.428334] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.

[  172.453631] ------------[ cut here ]------------

[  172.456007] WARNING: at kernel/lockdep.c:956 __bfs+0x1f5/0x270()

[  172.456007] Hardware name: IBM System x -[7870C4Q]-

[  172.456007] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod shpchp i7core_edac cdc_ether i2c_i801 ioatdma bnx2 usbnet iTCO_wdt edac_core pci_hotplug dca i2c_core tpm_tis pcspkr serio_raw tpm mii sg rtc_cmos button iTCO_vendor_support tpm_bios uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon

[  172.456007] Pid: 6166, comm: bash Tainted: G        W    3.2.0-33-default #3

[  172.456007] Call Trace:

[  172.456007]  [<ffffffff81095415>] ? __bfs+0x1f5/0x270

[  172.456007]  [<ffffffff8103ceca>] warn_slowpath_common+0x7a/0xb0

[  172.456007]  [<ffffffff81492bab>] ? _raw_spin_unlock_irq+0x2b/0x50

[  172.456007]  [<ffffffff8103cf15>] warn_slowpath_null+0x15/0x20

[  172.456007]  [<ffffffff81095415>] __bfs+0x1f5/0x270

[  172.456007]  [<ffffffff81095100>] ? noop_count+0x10/0x10

[  172.456007]  [<ffffffff8109892b>] check_usage_backwards+0x6b/0xf0

[  172.456007]  [<ffffffff81496d66>] ? notifier_call_chain+0x26/0x110

[  172.456007]  [<ffffffff810988c0>] ? check_usage_forwards+0x110/0x110

[  172.456007]  [<ffffffff8109930a>] mark_lock_irq+0x9a/0x200

[  172.456007]  [<ffffffff8109957c>] mark_lock+0x10c/0x1b0

[  172.456007]  [<ffffffff81099740>] mark_irqflags+0x120/0x190

[  172.456007]  [<ffffffff8109ae60>] __lock_acquire+0x460/0x520

[  172.456007]  [<ffffffff8109b041>] lock_acquire+0x121/0x190

[  172.456007]  [<ffffffff81473fc0>] ? klist_next+0x30/0x110

[  172.456007]  [<ffffffff81099830>] ? mark_held_locks+0x80/0x120

[  172.456007]  [<ffffffff81341700>] ? device_pm_wait_for_dev+0x40/0x40

[  172.456007]  [<ffffffff814922cc>] _raw_spin_lock+0x3c/0x50

[  172.456007]  [<ffffffff81473fc0>] ? klist_next+0x30/0x110

[  172.456007]  [<ffffffff81473fc0>] klist_next+0x30/0x110

[  172.456007]  [<ffffffff81341700>] ? device_pm_wait_for_dev+0x40/0x40

[  172.456007]  [<ffffffff81336d12>] device_for_each_child+0x52/0x70

[  172.456007]  [<ffffffff81342241>] __device_suspend+0x31/0x2e0

[  172.456007]  [<ffffffff81342550>] dpm_suspend+0x60/0x260

[  172.456007]  [<ffffffff810ad107>] hibernation_snapshot+0xd7/0x1b0

[  172.456007]  [<ffffffff810ad497>] hibernate+0x137/0x260

[  172.456007]  [<ffffffff810aaea4>] state_store+0x114/0x140

[  172.456007]  [<ffffffff8126c3a7>] kobj_attr_store+0x17/0x20

[  172.456007]  [<ffffffff811e9214>] sysfs_write_file+0xd4/0x150

[  172.456007]  [<ffffffff81173aeb>] vfs_write+0xcb/0x130

[  172.456007]  [<ffffffff81173c40>] sys_write+0x50/0x90

[  172.456007]  [<ffffffff8149bc39>] system_call_fastpath+0x16/0x1b

[  172.456007] ---[ end trace b286a5fb903f51e7 ]---

[  172.456007] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010

[  172.456007] IP: [<ffffffff81095309>] __bfs+0xe9/0x270

[  172.456007] PGD 1146b62067 PUD 1147c8b067 PMD 0 

[  172.456007] Oops: 0000 [#1] SMP 

[  172.456007] CPU 2 

[  172.456007] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod shpchp i7core_edac cdc_ether i2c_i801 ioatdma bnx2 usbnet iTCO_wdt edac_core pci_hotplug dca i2c_core tpm_tis pcspkr serio_raw tpm mii sg rtc_cmos button iTCO_vendor_support tpm_bios uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon

[  172.456007] 

[  172.456007] Pid: 6166, comm: bash Tainted: G        W    3.2.0-33-default #3 IBM IBM System x -[7870C4Q]-/68Y8033     

[  172.456007] RIP: 0010:[<ffffffff81095309>]  [<ffffffff81095309>] __bfs+0xe9/0x270

[  172.456007] RSP: 0018:ffff8808c6607ab8  EFLAGS: 00010096

[  172.456007] RAX: 0000000000000000 RBX: ffffffff820aa330 RCX: ffff8808c73b13c0

[  172.456007] RDX: ffff8808c73b13c0 RSI: 0000000000000001 RDI: 0000000000000009

[  172.456007] RBP: ffff8808c6607b18 R08: ffff8808c6607ae8 R09: 0000000000000000

[  172.456007] R10: 0000000000000000 R11: 0000000000000004 R12: 0000000000000000

[  172.456007] R13: 0000000001f87ca6 R14: ffff8808c73b13c0 R15: 0000000000000000

[  172.456007] FS:  00007f46e193a700(0000) GS:ffff8808ffc80000(0000) knlGS:0000000000000000

[  172.456007] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[  172.456007] CR2: 0000000000000010 CR3: 00000011476b7000 CR4: 00000000000006e0

[  172.456007] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

[  172.456007] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

[  172.456007] Process bash (pid: 6166, threadinfo ffff8808c6606000, task ffff8808c73b13c0)

[  172.456007] Stack:

[  172.456007]  ffff8808c6607b18 ffff8808c6607ae8 ffff8808c6607b70 ffffffff81095100

[  172.456007]  0000000000000000 ffffffff816016d0 ffff8808c6607b28 ffff8808c6607b28

[  172.456007]  ffff8808c73b1cc8 ffffffff81779414 ffff8808c73b13c0 ffff8808c73b13c0

[  172.456007] Call Trace:

[  172.456007]  [<ffffffff81095100>] ? noop_count+0x10/0x10

[  172.456007]  [<ffffffff8109892b>] check_usage_backwards+0x6b/0xf0

[  172.456007]  [<ffffffff81496d66>] ? notifier_call_chain+0x26/0x110

[  172.456007]  [<ffffffff810988c0>] ? check_usage_forwards+0x110/0x110

[  172.456007]  [<ffffffff8109930a>] mark_lock_irq+0x9a/0x200

[  172.456007]  [<ffffffff8109957c>] mark_lock+0x10c/0x1b0

[  172.456007]  [<ffffffff81099740>] mark_irqflags+0x120/0x190

[  172.456007]  [<ffffffff8109ae60>] __lock_acquire+0x460/0x520

[  172.456007]  [<ffffffff8109b041>] lock_acquire+0x121/0x190

[  172.456007]  [<ffffffff81473fc0>] ? klist_next+0x30/0x110

[  172.456007]  [<ffffffff81099830>] ? mark_held_locks+0x80/0x120

[  172.456007]  [<ffffffff81341700>] ? device_pm_wait_for_dev+0x40/0x40

[  172.456007]  [<ffffffff814922cc>] _raw_spin_lock+0x3c/0x50

[  172.456007]  [<ffffffff81473fc0>] ? klist_next+0x30/0x110

[  172.456007]  [<ffffffff81473fc0>] klist_next+0x30/0x110

[  172.456007]  [<ffffffff81341700>] ? device_pm_wait_for_dev+0x40/0x40

[  172.456007]  [<ffffffff81336d12>] device_for_each_child+0x52/0x70

[  172.456007]  [<ffffffff81342241>] __device_suspend+0x31/0x2e0

[  172.456007]  [<ffffffff81342550>] dpm_suspend+0x60/0x260

[  172.456007]  [<ffffffff810ad107>] hibernation_snapshot+0xd7/0x1b0

[  172.456007]  [<ffffffff810ad497>] hibernate+0x137/0x260

[  172.456007]  [<ffffffff810aaea4>] state_store+0x114/0x140

[  172.456007]  [<ffffffff8126c3a7>] kobj_attr_store+0x17/0x20

[  172.456007]  [<ffffffff811e9214>] sysfs_write_file+0xd4/0x150

[  172.456007]  [<ffffffff81173aeb>] vfs_write+0xcb/0x130

[  172.456007]  [<ffffffff81173c40>] sys_write+0x50/0x90

[  172.456007]  [<ffffffff8149bc39>] system_call_fastpath+0x16/0x1b

[  172.456007] Code: 00 45 85 ff 48 0f 44 d8 4c 8b 23 49 39 dc 74 aa 4d 89 e5 49 81 ed 80 d6 e0 81 49 c1 fd 06 4c 3b 2d 05 83 d7 00 0f 83 f7 00 00 00 <49> 8b 54 24 10 8b 42 2c 3b 05 d1 01 43 01 0f 84 88 00 00 00 4c 

[  172.456007] RIP  [<ffffffff81095309>] __bfs+0xe9/0x270

[  172.456007]  RSP <ffff8808c6607ab8>

[  172.456007] CR2: 0000000000000010

[  172.456007] ---[ end trace b286a5fb903f51e8 ]---

[  172.456007] BUG: sleeping function called from invalid context at kernel/rwsem.c:21

[  172.456007] in_atomic(): 1, irqs_disabled(): 1, pid: 6166, name: bash

[  172.456007] INFO: lockdep is turned off.

[  172.456007] irq event stamp: 4430551

[  172.456007] hardirqs last  enabled at (4430551): [<ffffffff8148f5ff>] __mutex_unlock_slowpath+0xef/0x180

[  172.456007] hardirqs last disabled at (4430550): [<ffffffff8148f559>] __mutex_unlock_slowpath+0x49/0x180

[  172.456007] softirqs last  enabled at (4424302): [<ffffffff810451be>] __do_softirq+0x21e/0x3a0

[  172.456007] softirqs last disabled at (4424285): [<ffffffff8149d16c>] call_softirq+0x1c/0x30

[  172.456007] Pid: 6166, comm: bash Tainted: G      D W    3.2.0-33-default #3

[  172.456007] Call Trace:

[  172.456007]  [<ffffffff81072992>] __might_sleep+0x152/0x1f0

[  172.456007]  [<ffffffff8149013f>] down_read+0x1f/0x60

[  172.456007]  [<ffffffff810550ff>] exit_signals+0x1f/0x140

[  172.456007]  [<ffffffff8106c411>] ? blocking_notifier_call_chain+0x11/0x20

[  172.456007]  [<ffffffff81042742>] do_exit+0xb2/0x480

[  172.456007]  [<ffffffff81493db4>] oops_end+0xe4/0xf0

[  172.456007]  [<ffffffff8102f609>] no_context+0x119/0x200

[  172.456007]  [<ffffffff8102f83d>] __bad_area_nosemaphore+0x14d/0x220

[  172.456007]  [<ffffffff8102f91e>] bad_area_nosemaphore+0xe/0x10

[  172.456007]  [<ffffffff81496c0e>] do_page_fault+0x38e/0x4c0

[  172.456007]  [<ffffffff8109650d>] ? trace_hardirqs_off+0xd/0x10

[  172.456007]  [<ffffffff8103df92>] ? vprintk+0x362/0x530

[  172.456007]  [<ffffffff8127725d>] ? trace_hardirqs_off_thunk+0x3a/0x3c

[  172.456007]  [<ffffffff814931c5>] page_fault+0x25/0x30

[  172.456007]  [<ffffffff81095309>] ? __bfs+0xe9/0x270

[  172.456007]  [<ffffffff81095415>] ? __bfs+0x1f5/0x270

[  172.456007]  [<ffffffff81095100>] ? noop_count+0x10/0x10

[  172.456007]  [<ffffffff8109892b>] check_usage_backwards+0x6b/0xf0

[  172.456007]  [<ffffffff81496d66>] ? notifier_call_chain+0x26/0x110

[  172.456007]  [<ffffffff810988c0>] ? check_usage_forwards+0x110/0x110

[  172.456007]  [<ffffffff8109930a>] mark_lock_irq+0x9a/0x200

[  172.456007]  [<ffffffff8109957c>] mark_lock+0x10c/0x1b0

[  172.456007]  [<ffffffff81099740>] mark_irqflags+0x120/0x190

[  172.456007]  [<ffffffff8109ae60>] __lock_acquire+0x460/0x520

[  172.456007]  [<ffffffff8109b041>] lock_acquire+0x121/0x190

[  172.456007]  [<ffffffff81473fc0>] ? klist_next+0x30/0x110

[  172.456007]  [<ffffffff81099830>] ? mark_held_locks+0x80/0x120

[  172.456007]  [<ffffffff81341700>] ? device_pm_wait_for_dev+0x40/0x40

[  172.456007]  [<ffffffff814922cc>] _raw_spin_lock+0x3c/0x50

[  172.456007]  [<ffffffff81473fc0>] ? klist_next+0x30/0x110

[  172.456007]  [<ffffffff81473fc0>] klist_next+0x30/0x110

[  172.456007]  [<ffffffff81341700>] ? device_pm_wait_for_dev+0x40/0x40

[  172.456007]  [<ffffffff81336d12>] device_for_each_child+0x52/0x70

[  172.456007]  [<ffffffff81342241>] __device_suspend+0x31/0x2e0

[  172.456007]  [<ffffffff81342550>] dpm_suspend+0x60/0x260

[  172.456007]  [<ffffffff810ad107>] hibernation_snapshot+0xd7/0x1b0

[  172.456007]  [<ffffffff810ad497>] hibernate+0x137/0x260

[  172.456007]  [<ffffffff810aaea4>] state_store+0x114/0x140

[  172.456007]  [<ffffffff8126c3a7>] kobj_attr_store+0x17/0x20

[  172.456007]  [<ffffffff811e9214>] sysfs_write_file+0xd4/0x150

[  172.456007]  [<ffffffff81173aeb>] vfs_write+0xcb/0x130

[  172.456007]  [<ffffffff81173c40>] sys_write+0x50/0x90

[  172.456007]  [<ffffffff8149bc39>] system_call_fastpath+0x16/0x1b

[  172.456007] note: bash[6166] exited with preempt_count 1

[  173.455156] BUG: scheduling while atomic: bash/6166/0x10000002

[  173.461156] INFO: lockdep is turned off.

[  173.465253] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod shpchp i7core_edac cdc_ether i2c_i801 ioatdma bnx2 usbnet iTCO_wdt edac_core pci_hotplug dca i2c_core tpm_tis pcspkr serio_raw tpm mii sg rtc_cmos button iTCO_vendor_support tpm_bios uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon

[  173.513250] Pid: 6166, comm: bash Tainted: G      D W    3.2.0-33-default #3

[  173.523507] Call Trace:

[  173.526124]  [<ffffffff81073157>] __schedule_bug+0x97/0xa0

[  173.531777]  [<ffffffff814910b5>] __schedule+0x705/0x9a0

[  173.537268]  [<ffffffff8116a8c8>] ? mem_cgroup_update_page_stat+0x148/0x250

[  173.544396]  [<ffffffff8116a7d1>] ? mem_cgroup_update_page_stat+0x51/0x250

[  173.551451]  [<ffffffff81150eba>] ? alloc_pages_current+0xaa/0x120

[  173.557801]  [<ffffffff81492c76>] ? _raw_spin_unlock+0x26/0x40

[  173.563806]  [<ffffffff811336c4>] ? zap_pte_range+0x84/0x3b0

[  173.569634]  [<ffffffff811337f5>] ? zap_pte_range+0x1b5/0x3b0

[  173.575547]  [<ffffffff81496ef6>] ? __atomic_notifier_call_chain+0xa6/0x130

[  173.582689]  [<ffffffff81078af5>] __cond_resched+0x25/0x40

[  173.588345]  [<ffffffff814913dd>] _cond_resched+0x2d/0x40

[  173.593914]  [<ffffffff811342ce>] unmap_page_range+0x25e/0x300

[  173.599921]  [<ffffffff8113443c>] unmap_vmas+0xcc/0x150

[  173.605319]  [<ffffffff81139dbd>] exit_mmap+0x8d/0x120

[  173.610633]  [<ffffffff8103ac3c>] mmput+0x6c/0x150

[  173.615595]  [<ffffffff8103ffca>] exit_mm+0x10a/0x140

[  173.620812]  [<ffffffff81492bab>] ? _raw_spin_unlock_irq+0x2b/0x50

[  173.627168]  [<ffffffff8130f413>] ? tty_audit_exit+0x23/0xa0

[  173.633001]  [<ffffffff810427e3>] do_exit+0x153/0x480

[  173.638224]  [<ffffffff81493db4>] oops_end+0xe4/0xf0

[  173.643358]  [<ffffffff8102f609>] no_context+0x119/0x200

[  173.648848]  [<ffffffff8102f83d>] __bad_area_nosemaphore+0x14d/0x220

[  173.655371]  [<ffffffff8102f91e>] bad_area_nosemaphore+0xe/0x10

[  173.661462]  [<ffffffff81496c0e>] do_page_fault+0x38e/0x4c0

[  173.667204]  [<ffffffff8109650d>] ? trace_hardirqs_off+0xd/0x10

[  173.673291]  [<ffffffff8103df92>] ? vprintk+0x362/0x530

[  173.678684]  [<ffffffff8127725d>] ? trace_hardirqs_off_thunk+0x3a/0x3c

[  173.685385]  [<ffffffff814931c5>] page_fault+0x25/0x30

[  173.690696]  [<ffffffff81095309>] ? __bfs+0xe9/0x270

[  173.695837]  [<ffffffff81095415>] ? __bfs+0x1f5/0x270

[  173.701060]  [<ffffffff81095100>] ? noop_count+0x10/0x10

[  173.706542]  [<ffffffff8109892b>] check_usage_backwards+0x6b/0xf0

[  173.712811]  [<ffffffff81496d66>] ? notifier_call_chain+0x26/0x110

[  173.719141]  [<ffffffff810988c0>] ? check_usage_forwards+0x110/0x110

[  173.725672]  [<ffffffff8109930a>] mark_lock_irq+0x9a/0x200

[  173.731324]  [<ffffffff8109957c>] mark_lock+0x10c/0x1b0

[  173.736723]  [<ffffffff81099740>] mark_irqflags+0x120/0x190

[  173.742466]  [<ffffffff8109ae60>] __lock_acquire+0x460/0x520

[  173.748297]  [<ffffffff8109b041>] lock_acquire+0x121/0x190

[  173.753956]  [<ffffffff81473fc0>] ? klist_next+0x30/0x110

[  173.759523]  [<ffffffff81099830>] ? mark_held_locks+0x80/0x120

[  173.765528]  [<ffffffff81341700>] ? device_pm_wait_for_dev+0x40/0x40

[  173.772079]  [<ffffffff814922cc>] _raw_spin_lock+0x3c/0x50

[  173.777748]  [<ffffffff81473fc0>] ? klist_next+0x30/0x110

[  173.783317]  [<ffffffff81473fc0>] klist_next+0x30/0x110

[  173.788738]  [<ffffffff81341700>] ? device_pm_wait_for_dev+0x40/0x40

[  173.795264]  [<ffffffff81336d12>] device_for_each_child+0x52/0x70

[  173.801534]  [<ffffffff81342241>] __device_suspend+0x31/0x2e0

[  173.807451]  [<ffffffff81342550>] dpm_suspend+0x60/0x260

[  173.812938]  [<ffffffff810ad107>] hibernation_snapshot+0xd7/0x1b0

[  173.819201]  [<ffffffff810ad497>] hibernate+0x137/0x260

[  173.824598]  [<ffffffff810aaea4>] state_store+0x114/0x140

[  173.830168]  [<ffffffff8126c3a7>] kobj_attr_store+0x17/0x20

[  173.835915]  [<ffffffff811e9214>] sysfs_write_file+0xd4/0x150

[  173.841833]  [<ffffffff81173aeb>] vfs_write+0xcb/0x130

[  173.847147]  [<ffffffff81173c40>] sys_write+0x50/0x90

[  173.852376]  [<ffffffff8149bc39>] system_call_fastpath+0x16/0x1b


Regards,
Srivatsa S. Bhat
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-13 20:53       ` Srivatsa S. Bhat
@ 2012-01-13 21:08         ` Linus Torvalds
  2012-01-13 21:24           ` Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2012-01-13 21:08 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck,
	Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-kernel,
	Greg Kroah-Hartman, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua

On Fri, Jan 13, 2012 at 12:53 PM, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
>
> Wait a minute, did you mention "second attempt"? I think I have something
> interesting..

Yes, I think you're hitting the exact same thing.

I *think* that what is going on is that we free some data structure
too early, and we didn't use to free them before. I tried to see if I
could catch it with slab and list debugging, but I didn't see
anything, and the machine I used for suspend/resume had other issues
too (wireless network - which is the *only* network on that machine -
hung on resume), so I ended up punting and just disabling MCE to
concentrate on those issues.

On eof the differences between sysdev and 'struct device' is that
sysdev doesn't bother refcounting parents etc. So there could have
been some refcount problem that was never relevant with the old sysdev
code. I dunno.

The wireless issues got resolved for me, and I haven't gotten back to
MCE yet. I was *really* hoping that somebody else could figure it out,
since I'm not the only one seeing it..

                    Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-13 21:08         ` Linus Torvalds
@ 2012-01-13 21:24           ` Andi Kleen
  2012-01-13 21:38             ` Justin P. Mattock
  2012-01-13 23:13             ` Andi Kleen
  0 siblings, 2 replies; 49+ messages in thread
From: Andi Kleen @ 2012-01-13 21:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov,
	Tony Luck, Hidetoshi Seto, Ingo Molnar, linux-kernel,
	Greg Kroah-Hartman, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua


> Yes, I think you're hitting the exact same thing.
>
> I *think* that what is going on is that we free some data structure
> too early, and we didn't use to free them before.

Assuming the problem is with the multiple devices (not fully sure):

The redundant devices don't make much sense anyways, they are all 
mirrors of each other
(state changes always affect all CPUs)  Could just collapse down to a 
single device and symlink
the others.

-Andi


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-13 21:24           ` Andi Kleen
@ 2012-01-13 21:38             ` Justin P. Mattock
  2012-01-13 22:06               ` Srivatsa S. Bhat
  2012-01-13 23:13             ` Andi Kleen
  1 sibling, 1 reply; 49+ messages in thread
From: Justin P. Mattock @ 2012-01-13 21:38 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Srivatsa S. Bhat, Ming Lei, Djalal Harouni,
	Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar,
	linux-kernel, Greg Kroah-Hartman, Kay Sievers, gouders,
	Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tglx,
	prasad, Jeff Chua

On 01/13/2012 01:24 PM, Andi Kleen wrote:
>
>> Yes, I think you're hitting the exact same thing.
>>
>> I *think* that what is going on is that we free some data structure
>> too early, and we didn't use to free them before.
>
> Assuming the problem is with the multiple devices (not fully sure):
>
> The redundant devices don't make much sense anyways, they are all
> mirrors of each other
> (state changes always affect all CPUs) Could just collapse down to a
> single device and symlink
> the others.
>
> -Andi
>
>

this showed up using no_console_suspend


   131.875143] usb 5-1: device descriptor read/64, error -110
[  140.599340] PM: Syncing filesystems ... done.
[  140.815981] PM: Preparing system for mem sleep
[  140.829117] Freezing user space processes ... (elapsed 0.01 seconds) 
done.
[  140.840150] Freezing remaining freezable tasks ...
[  147.079160] usb 5-1: device descriptor read/64, error -110
[  147.282166] usb 5-1: new full-speed USB device number 6 using uhci_hcd
[  157.686165] usb 5-1: device not accepting address 6, error -110
[  157.788183] usb 5-1: new full-speed USB device number 7 using uhci_hcd
[  160.849310]
[  160.849320] Freezing of tasks failed after 20.00 seconds (1 tasks 
refusing to freeze, wq_busy=0):
[  160.849460] khubd           D f5d90020     0    20      2 0x00000000
[  160.849471]  f5d95d50 00000046 f5d095d0 f5d90020 00000000 c16ec3c0 
bce8e78a 00000024
[  160.849488]  c16ec3c0 bce7b7f6 00000024 f60063c0 f5d09170 f5d95d20 
c120490b c1039aa6
[  160.849505]  00000000 00000046 c1721180 00000296 f5d95d70 f5d95d40 
c1465208 00000000
[  160.849521] Call Trace:
[  160.849538]  [<c120490b>] ? do_raw_spin_lock+0x3b/0xf0
[  160.849548]  [<c1039aa6>] ? lock_timer_base.isra.24+0x26/0x50
[  160.849558]  [<c1465208>] ? _raw_spin_lock_irqsave+0x58/0x70
[  160.849567]  [<c1204a4e>] ? do_raw_spin_unlock+0x4e/0x90
[  160.849574]  [<c1463c30>] schedule+0x30/0x50
[  160.849582]  [<c1461b7f>] schedule_timeout+0x10f/0x1f0
[  160.849589]  [<c10396f0>] ? usleep_range+0x40/0x40
[  160.849597]  [<c1463ae0>] wait_for_common+0xb0/0x120
[  160.849605]  [<c1053bb0>] ? try_to_wake_up+0x260/0x260
[  160.849614]  [<c1463bdd>] wait_for_completion_timeout+0xd/0x10
[  160.849624]  [<c13250e1>] usb_start_wait_urb+0xb1/0xe0
[  160.849632]  [<c10e0101>] ? sys_swapon+0xab1/0xc50
[  160.849640]  [<c1325308>] usb_control_msg+0xb8/0xf0
[  160.849648]  [<c12ad1e8>] ? _dev_info+0x28/0x30
[  160.849656]  [<c131e627>] hub_port_init+0x627/0x710
[  160.849664]  [<c131d396>] ? usb_set_device_state+0x76/0x130
[  160.849672]  [<c1320906>] hub_thread+0x626/0x1080
[  160.849681]  [<c10515a1>] ? finish_task_switch+0x31/0xf0
[  160.849688]  [<c14635c0>] ? __schedule+0x3b0/0x7b0
[  160.849698]  [<c10490c0>] ? __init_waitqueue_head+0x50/0x50
[  160.849705]  [<c1050ef9>] ? complete+0x49/0x60
[  160.849713]  [<c13202e0>] ? usb_remote_wakeup+0x40/0x40
[  160.849720]  [<c1048928>] kthread+0x78/0x80
[  160.849728]  [<c10488b0>] ? __init_kthread_worker+0x60/0x60
[  160.849736]  [<c146b0fe>] kernel_thread_helper+0x6/0xd
[  160.849755]
[  160.849759] Restarting tasks ... done.
[  160.865733] power_supply BAT0: uevent
[  160.865737] power_supply BAT0: POWER_SUPPLY_NAME=BAT0
[  160.886551] power_supply BAT0: prop STATUS=Full
[  160.886562] power_supply BAT0: prop PRESENT=1
[  160.886570] power_supply BAT0: prop TECHNOLOGY=Unknown
[  160.886577] power_supply BAT0: prop CYCLE_COUNT=0

I can supply full dmesg if needed.
a bisect on this should not take too long, just need the time to do so.

last good kernel I have here is: 3.2.0-06541-gf33180c

Justin P. Mattock

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-13 21:38             ` Justin P. Mattock
@ 2012-01-13 22:06               ` Srivatsa S. Bhat
  2012-01-13 22:17                 ` Alan Stern
  2012-01-13 22:18                 ` Srivatsa S. Bhat
  0 siblings, 2 replies; 49+ messages in thread
From: Srivatsa S. Bhat @ 2012-01-13 22:06 UTC (permalink / raw)
  To: Justin P. Mattock
  Cc: Andi Kleen, Linus Torvalds, Ming Lei, Djalal Harouni,
	Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar,
	linux-kernel, Greg Kroah-Hartman, Kay Sievers, gouders,
	Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tglx,
	prasad, Jeff Chua

On 01/14/2012 03:08 AM, Justin P. Mattock wrote:

>>
> 
> this showed up using no_console_suspend
> 
> 
>   131.875143] usb 5-1: device descriptor read/64, error -110
> [  140.599340] PM: Syncing filesystems ... done.
> [  140.815981] PM: Preparing system for mem sleep
> [  140.829117] Freezing user space processes ... (elapsed 0.01 seconds)
> done.
> [  140.840150] Freezing remaining freezable tasks ...
> [  147.079160] usb 5-1: device descriptor read/64, error -110
> [  147.282166] usb 5-1: new full-speed USB device number 6 using uhci_hcd
> [  157.686165] usb 5-1: device not accepting address 6, error -110
> [  157.788183] usb 5-1: new full-speed USB device number 7 using uhci_hcd
> [  160.849310]
> [  160.849320] Freezing of tasks failed after 20.00 seconds (1 tasks
> refusing to freeze, wq_busy=0):

> [  160.849460] khubd           D f5d90020     0    20      2 0x00000000
> [  160.849471]  f5d95d50 00000046 f5d095d0 f5d90020 00000000 c16ec3c0
> bce8e78a 00000024
> [  160.849488]  c16ec3c0 bce7b7f6 00000024 f60063c0 f5d09170 f5d95d20
> c120490b c1039aa6
> [  160.849505]  00000000 00000046 c1721180 00000296 f5d95d70 f5d95d40
> c1465208 00000000
> [  160.849521] Call Trace:
> [  160.849538]  [<c120490b>] ? do_raw_spin_lock+0x3b/0xf0
> [  160.849548]  [<c1039aa6>] ? lock_timer_base.isra.24+0x26/0x50
> [  160.849558]  [<c1465208>] ? _raw_spin_lock_irqsave+0x58/0x70
> [  160.849567]  [<c1204a4e>] ? do_raw_spin_unlock+0x4e/0x90
> [  160.849574]  [<c1463c30>] schedule+0x30/0x50
> [  160.849582]  [<c1461b7f>] schedule_timeout+0x10f/0x1f0
> [  160.849589]  [<c10396f0>] ? usleep_range+0x40/0x40
> [  160.849597]  [<c1463ae0>] wait_for_common+0xb0/0x120
> [  160.849605]  [<c1053bb0>] ? try_to_wake_up+0x260/0x260
> [  160.849614]  [<c1463bdd>] wait_for_completion_timeout+0xd/0x10
> [  160.849624]  [<c13250e1>] usb_start_wait_urb+0xb1/0xe0
> [  160.849632]  [<c10e0101>] ? sys_swapon+0xab1/0xc50
> [  160.849640]  [<c1325308>] usb_control_msg+0xb8/0xf0
> [  160.849648]  [<c12ad1e8>] ? _dev_info+0x28/0x30
> [  160.849656]  [<c131e627>] hub_port_init+0x627/0x710
> [  160.849664]  [<c131d396>] ? usb_set_device_state+0x76/0x130
> [  160.849672]  [<c1320906>] hub_thread+0x626/0x1080
> [  160.849681]  [<c10515a1>] ? finish_task_switch+0x31/0xf0
> [  160.849688]  [<c14635c0>] ? __schedule+0x3b0/0x7b0
> [  160.849698]  [<c10490c0>] ? __init_waitqueue_head+0x50/0x50
> [  160.849705]  [<c1050ef9>] ? complete+0x49/0x60
> [  160.849713]  [<c13202e0>] ? usb_remote_wakeup+0x40/0x40
> [  160.849720]  [<c1048928>] kthread+0x78/0x80
> [  160.849728]  [<c10488b0>] ? __init_kthread_worker+0x60/0x60
> [  160.849736]  [<c146b0fe>] kernel_thread_helper+0x6/0xd
> [  160.849755]
> [  160.849759] Restarting tasks ... done.
> [  160.865733] power_supply BAT0: uevent
> [  160.865737] power_supply BAT0: POWER_SUPPLY_NAME=BAT0
> [  160.886551] power_supply BAT0: prop STATUS=Full
> [  160.886562] power_supply BAT0: prop PRESENT=1
> [  160.886570] power_supply BAT0: prop TECHNOLOGY=Unknown
> [  160.886577] power_supply BAT0: prop CYCLE_COUNT=0
> 
> I can supply full dmesg if needed.
> a bisect on this should not take too long, just need the time to do so.
> 
> last good kernel I have here is: 3.2.0-06541-gf33180c
> 

Freezing failure is a totally different problem. Freezing happens much
before CPUs are taken offline and even before devices are suspended.
But yes, if freezing fails, suspend fails too (it is aborted rather).
And freezing failures are typically a bit harder to trigger since they
occur due to some race conditions. But the suspend failure problem
discussed earlier in this thread (while discussing the MCE warnings) is a
deterministic thing and very easily reproducible.

Regards,
Srivatsa S. Bhat
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-13 22:06               ` Srivatsa S. Bhat
@ 2012-01-13 22:17                 ` Alan Stern
  2012-01-13 22:18                 ` Srivatsa S. Bhat
  1 sibling, 0 replies; 49+ messages in thread
From: Alan Stern @ 2012-01-13 22:17 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Justin P. Mattock, Andi Kleen, Linus Torvalds, Ming Lei,
	Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto,
	Ingo Molnar, linux-kernel, Greg Kroah-Hartman, Kay Sievers,
	gouders, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki,
	tglx, prasad, Jeff Chua

On Sat, 14 Jan 2012, Srivatsa S. Bhat wrote:

> On 01/14/2012 03:08 AM, Justin P. Mattock wrote:
> 
> >>
> > 
> > this showed up using no_console_suspend
> > 
> > 
> >   131.875143] usb 5-1: device descriptor read/64, error -110
> > [  140.599340] PM: Syncing filesystems ... done.
> > [  140.815981] PM: Preparing system for mem sleep
> > [  140.829117] Freezing user space processes ... (elapsed 0.01 seconds)
> > done.
> > [  140.840150] Freezing remaining freezable tasks ...
> > [  147.079160] usb 5-1: device descriptor read/64, error -110
> > [  147.282166] usb 5-1: new full-speed USB device number 6 using uhci_hcd
> > [  157.686165] usb 5-1: device not accepting address 6, error -110
> > [  157.788183] usb 5-1: new full-speed USB device number 7 using uhci_hcd
> > [  160.849310]
> > [  160.849320] Freezing of tasks failed after 20.00 seconds (1 tasks
> > refusing to freeze, wq_busy=0):
> 
> > [  160.849460] khubd           D f5d90020     0    20      2 0x00000000
> > [  160.849471]  f5d95d50 00000046 f5d095d0 f5d90020 00000000 c16ec3c0
> > bce8e78a 00000024
> > [  160.849488]  c16ec3c0 bce7b7f6 00000024 f60063c0 f5d09170 f5d95d20
> > c120490b c1039aa6
> > [  160.849505]  00000000 00000046 c1721180 00000296 f5d95d70 f5d95d40
> > c1465208 00000000
> > [  160.849521] Call Trace:
> > [  160.849538]  [<c120490b>] ? do_raw_spin_lock+0x3b/0xf0
> > [  160.849548]  [<c1039aa6>] ? lock_timer_base.isra.24+0x26/0x50
> > [  160.849558]  [<c1465208>] ? _raw_spin_lock_irqsave+0x58/0x70
> > [  160.849567]  [<c1204a4e>] ? do_raw_spin_unlock+0x4e/0x90
> > [  160.849574]  [<c1463c30>] schedule+0x30/0x50
> > [  160.849582]  [<c1461b7f>] schedule_timeout+0x10f/0x1f0
> > [  160.849589]  [<c10396f0>] ? usleep_range+0x40/0x40
> > [  160.849597]  [<c1463ae0>] wait_for_common+0xb0/0x120
> > [  160.849605]  [<c1053bb0>] ? try_to_wake_up+0x260/0x260
> > [  160.849614]  [<c1463bdd>] wait_for_completion_timeout+0xd/0x10
> > [  160.849624]  [<c13250e1>] usb_start_wait_urb+0xb1/0xe0
> > [  160.849632]  [<c10e0101>] ? sys_swapon+0xab1/0xc50
> > [  160.849640]  [<c1325308>] usb_control_msg+0xb8/0xf0
> > [  160.849648]  [<c12ad1e8>] ? _dev_info+0x28/0x30
> > [  160.849656]  [<c131e627>] hub_port_init+0x627/0x710
> > [  160.849664]  [<c131d396>] ? usb_set_device_state+0x76/0x130
> > [  160.849672]  [<c1320906>] hub_thread+0x626/0x1080
> > [  160.849681]  [<c10515a1>] ? finish_task_switch+0x31/0xf0
> > [  160.849688]  [<c14635c0>] ? __schedule+0x3b0/0x7b0
> > [  160.849698]  [<c10490c0>] ? __init_waitqueue_head+0x50/0x50
> > [  160.849705]  [<c1050ef9>] ? complete+0x49/0x60
> > [  160.849713]  [<c13202e0>] ? usb_remote_wakeup+0x40/0x40
> > [  160.849720]  [<c1048928>] kthread+0x78/0x80
> > [  160.849728]  [<c10488b0>] ? __init_kthread_worker+0x60/0x60
> > [  160.849736]  [<c146b0fe>] kernel_thread_helper+0x6/0xd
> > [  160.849755]
> > [  160.849759] Restarting tasks ... done.
> > [  160.865733] power_supply BAT0: uevent
> > [  160.865737] power_supply BAT0: POWER_SUPPLY_NAME=BAT0
> > [  160.886551] power_supply BAT0: prop STATUS=Full
> > [  160.886562] power_supply BAT0: prop PRESENT=1
> > [  160.886570] power_supply BAT0: prop TECHNOLOGY=Unknown
> > [  160.886577] power_supply BAT0: prop CYCLE_COUNT=0
> > 
> > I can supply full dmesg if needed.
> > a bisect on this should not take too long, just need the time to do so.
> > 
> > last good kernel I have here is: 3.2.0-06541-gf33180c
> > 
> 
> Freezing failure is a totally different problem. Freezing happens much
> before CPUs are taken offline and even before devices are suspended.
> But yes, if freezing fails, suspend fails too (it is aborted rather).
> And freezing failures are typically a bit harder to trigger since they
> occur due to some race conditions. But the suspend failure problem
> discussed earlier in this thread (while discussing the MCE warnings) is a
> deterministic thing and very easily reproducible.

The freezing failure is easy to debug.  The khubd thread was busy 
trying to initialize a non-working USB device.  It doesn't check for 
freezes while doing this, and it has a lot of (probably too many) 
nested retry loops with long delays.

If the non-working USB device were unplugged, the problem would go 
away.

Alan Stern


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-13 22:06               ` Srivatsa S. Bhat
  2012-01-13 22:17                 ` Alan Stern
@ 2012-01-13 22:18                 ` Srivatsa S. Bhat
  1 sibling, 0 replies; 49+ messages in thread
From: Srivatsa S. Bhat @ 2012-01-13 22:18 UTC (permalink / raw)
  To: Justin P. Mattock
  Cc: Andi Kleen, Linus Torvalds, Ming Lei, Djalal Harouni,
	Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar,
	linux-kernel, Greg Kroah-Hartman, Kay Sievers, gouders,
	Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tglx,
	prasad, Jeff Chua, Tejun Heo, Alan Stern

On 01/14/2012 03:36 AM, Srivatsa S. Bhat wrote:

> On 01/14/2012 03:08 AM, Justin P. Mattock wrote:
> 
>>>
>>
>> this showed up using no_console_suspend
>>
>>
>>   131.875143] usb 5-1: device descriptor read/64, error -110
>> [  140.599340] PM: Syncing filesystems ... done.
>> [  140.815981] PM: Preparing system for mem sleep
>> [  140.829117] Freezing user space processes ... (elapsed 0.01 seconds)
>> done.
>> [  140.840150] Freezing remaining freezable tasks ...
>> [  147.079160] usb 5-1: device descriptor read/64, error -110
>> [  147.282166] usb 5-1: new full-speed USB device number 6 using uhci_hcd
>> [  157.686165] usb 5-1: device not accepting address 6, error -110
>> [  157.788183] usb 5-1: new full-speed USB device number 7 using uhci_hcd
>> [  160.849310]
>> [  160.849320] Freezing of tasks failed after 20.00 seconds (1 tasks
>> refusing to freeze, wq_busy=0):
> 
>> [  160.849460] khubd           D f5d90020     0    20      2 0x00000000
>> [  160.849471]  f5d95d50 00000046 f5d095d0 f5d90020 00000000 c16ec3c0
>> bce8e78a 00000024
>> [  160.849488]  c16ec3c0 bce7b7f6 00000024 f60063c0 f5d09170 f5d95d20
>> c120490b c1039aa6
>> [  160.849505]  00000000 00000046 c1721180 00000296 f5d95d70 f5d95d40
>> c1465208 00000000
>> [  160.849521] Call Trace:
>> [  160.849538]  [<c120490b>] ? do_raw_spin_lock+0x3b/0xf0
>> [  160.849548]  [<c1039aa6>] ? lock_timer_base.isra.24+0x26/0x50
>> [  160.849558]  [<c1465208>] ? _raw_spin_lock_irqsave+0x58/0x70
>> [  160.849567]  [<c1204a4e>] ? do_raw_spin_unlock+0x4e/0x90
>> [  160.849574]  [<c1463c30>] schedule+0x30/0x50
>> [  160.849582]  [<c1461b7f>] schedule_timeout+0x10f/0x1f0
>> [  160.849589]  [<c10396f0>] ? usleep_range+0x40/0x40
>> [  160.849597]  [<c1463ae0>] wait_for_common+0xb0/0x120
>> [  160.849605]  [<c1053bb0>] ? try_to_wake_up+0x260/0x260
>> [  160.849614]  [<c1463bdd>] wait_for_completion_timeout+0xd/0x10
>> [  160.849624]  [<c13250e1>] usb_start_wait_urb+0xb1/0xe0
>> [  160.849632]  [<c10e0101>] ? sys_swapon+0xab1/0xc50
>> [  160.849640]  [<c1325308>] usb_control_msg+0xb8/0xf0
>> [  160.849648]  [<c12ad1e8>] ? _dev_info+0x28/0x30
>> [  160.849656]  [<c131e627>] hub_port_init+0x627/0x710
>> [  160.849664]  [<c131d396>] ? usb_set_device_state+0x76/0x130
>> [  160.849672]  [<c1320906>] hub_thread+0x626/0x1080
>> [  160.849681]  [<c10515a1>] ? finish_task_switch+0x31/0xf0
>> [  160.849688]  [<c14635c0>] ? __schedule+0x3b0/0x7b0
>> [  160.849698]  [<c10490c0>] ? __init_waitqueue_head+0x50/0x50
>> [  160.849705]  [<c1050ef9>] ? complete+0x49/0x60
>> [  160.849713]  [<c13202e0>] ? usb_remote_wakeup+0x40/0x40
>> [  160.849720]  [<c1048928>] kthread+0x78/0x80
>> [  160.849728]  [<c10488b0>] ? __init_kthread_worker+0x60/0x60
>> [  160.849736]  [<c146b0fe>] kernel_thread_helper+0x6/0xd
>> [  160.849755]
>> [  160.849759] Restarting tasks ... done.
>> [  160.865733] power_supply BAT0: uevent
>> [  160.865737] power_supply BAT0: POWER_SUPPLY_NAME=BAT0
>> [  160.886551] power_supply BAT0: prop STATUS=Full
>> [  160.886562] power_supply BAT0: prop PRESENT=1
>> [  160.886570] power_supply BAT0: prop TECHNOLOGY=Unknown
>> [  160.886577] power_supply BAT0: prop CYCLE_COUNT=0
>>
>> I can supply full dmesg if needed.
>> a bisect on this should not take too long, just need the time to do so.
>>
>> last good kernel I have here is: 3.2.0-06541-gf33180c
>>
> 
> Freezing failure is a totally different problem. Freezing happens much
> before CPUs are taken offline and even before devices are suspended.
> But yes, if freezing fails, suspend fails too (it is aborted rather).
> And freezing failures are typically a bit harder to trigger since they
> occur due to some race conditions. But the suspend failure problem
> discussed earlier in this thread (while discussing the MCE warnings) is a
> deterministic thing and very easily reproducible.
> 


So, looks like we have got 2 problems: a)freezing failure apparently due to
usb related code b) suspend failure due to MCE overhaul.

Adding Tejun and Alan Stern to Cc.

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-13 20:22   ` Srivatsa S. Bhat
  2012-01-13 20:34     ` Justin P. Mattock
  2012-01-13 20:37     ` Linus Torvalds
@ 2012-01-13 23:02     ` Linus Torvalds
  2012-01-13 23:27       ` Srivatsa S. Bhat
  2 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2012-01-13 23:02 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck,
	Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-kernel,
	Greg Kroah-Hartman, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua

On Fri, Jan 13, 2012 at 12:22 PM, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
>
> Fundamentally, this warning is triggered during CPU Offline, which is done
> during poweroff, suspend, hibernate etc. IOW, even a simple
> # echo 0 > /sys/devices/system/cpu/cpuX/online will trigger it.

There is definitely something wrong with CPU hotplug and MCE.

I seem to be able to trigger not only warnings, but some oopses, by doing:

 - enable list debugging, slab debugging, and kobject debugging in the
kernel (I've got some other things enabled too, but I think those are
the main ones)

 - do

     echo 0 > /sys/devices/system/cpu/cpuX/online

   this gets a few warnings

 - then do

     echo 1 > /sys/devices/system/cpu/cpuX/online

where bringing it up again will crash the machine entirely.

The oops scrolled off the screen an ddidn't get caught anywhere, but
the call trace seems to be (warning: hand-entered, so some of this may
be bogus):

Oops in:
  kobject_get+0x10/0x40

Code:
  55 48 89 f8 48 89 e5 48 83 ec 10 48 85 ff 74 0b <8b> 57 38 85 d2 74 06 f0 ff

Call trace:
  get_device
  klist_device_get
  klist_add_tail
  bus_add_device
  device_add
  device_register
  mce_device_create
  notifier_call_chain
  __raw_notifier_call_chain
  __cpu_notify
  _cpu_up
  store_online
  dev_attr_change
  sysfs_write_file

so it's definitely something bad in MCE device handling, and probably
something to do with reusing a 'struct device' after freeign it, or
after not having completely cleaned it up.

I didn't see if I could spot the problem, but I think this is entirely
reproducible, so hopefully somebody who knows the MCE code can
trivially see this and fix it.

                   Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-13 21:24           ` Andi Kleen
  2012-01-13 21:38             ` Justin P. Mattock
@ 2012-01-13 23:13             ` Andi Kleen
  1 sibling, 0 replies; 49+ messages in thread
From: Andi Kleen @ 2012-01-13 23:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov,
	Tony Luck, Hidetoshi Seto, Ingo Molnar, linux-kernel,
	Greg Kroah-Hartman, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua


> Assuming the problem is with the multiple devices (not fully sure):
>
> The redundant devices don't make much sense anyways, they are all 
> mirrors of each other
> (state changes always affect all CPUs)  Could just collapse down to a 
> single device and symlink
> the others.

Checking again it seems like the AMD threshold code reports per CPU 
counters so it would
still need per CPU devices.

All the other code is just a single instance I believe and doesn't 
actually need all the
complicated per CPU code.

-Andi



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-13 23:02     ` Linus Torvalds
@ 2012-01-13 23:27       ` Srivatsa S. Bhat
  2012-01-14  0:05         ` Linus Torvalds
  0 siblings, 1 reply; 49+ messages in thread
From: Srivatsa S. Bhat @ 2012-01-13 23:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck,
	Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-kernel,
	Greg Kroah-Hartman, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua, Suresh B Siddha, Peter Zijlstra,
	Mel Gorman, Gilad Ben-Yossef

On 01/14/2012 04:32 AM, Linus Torvalds wrote:

> On Fri, Jan 13, 2012 at 12:22 PM, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>
>> Fundamentally, this warning is triggered during CPU Offline, which is done
>> during poweroff, suspend, hibernate etc. IOW, even a simple
>> # echo 0 > /sys/devices/system/cpu/cpuX/online will trigger it.
> 
> There is definitely something wrong with CPU hotplug and MCE.
> 
> I seem to be able to trigger not only warnings, but some oopses, by doing:
> 
>  - enable list debugging, slab debugging, and kobject debugging in the
> kernel (I've got some other things enabled too, but I think those are
> the main ones)
> 
>  - do
> 
>      echo 0 > /sys/devices/system/cpu/cpuX/online
> 
>    this gets a few warnings
> 
>  - then do
> 
>      echo 1 > /sys/devices/system/cpu/cpuX/online
> 
> where bringing it up again will crash the machine entirely.
> 


I observed this too; and it is very easy to reproduce.
Here is the log:

# echo 0 > /sys/devices/system/cpu/cpu1/online

[   65.091045] CPU 1 is now offline
[   65.097267] ------------[ cut here ]------------
[   65.102045] WARNING: at drivers/base/core.c:194 device_release+0x82/0x90()
[   65.109137] Hardware name: IBM System x -[7870C4Q]-
[   65.109139] Device 'machinecheck1' does not have a release() function, it is broken and must be fixed.
[   65.109141] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod cdc_ether usbnet i7core_edac edac_core mii serio_raw i2c_i801 shpchp ioatdma iTCO_wdt iTCO_vendor_support dca pci_hotplug pcspkr bnx2 i2c_core tpm_tis tpm tpm_bios sg rtc_cmos button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[   65.109195] Pid: 6631, comm: bash Not tainted 3.2.0-debugkernel-0.0.0.28.36b5ec9-default #4
[   65.109197] Call Trace:
[   65.109202]  [<ffffffff8133b462>] ? device_release+0x82/0x90
[   65.109208]  [<ffffffff8103cc2a>] warn_slowpath_common+0x7a/0xb0
[   65.109212]  [<ffffffff8103cd01>] warn_slowpath_fmt+0x41/0x50
[   65.109216]  [<ffffffff8133b462>] device_release+0x82/0x90
[   65.109223]  [<ffffffff8127051e>] ? kobj_kset_leave+0x1e/0x60
[   65.109228]  [<ffffffff8127060d>] kobject_cleanup+0x6d/0x1b0
[   65.109233]  [<ffffffff8127075d>] kobject_release+0xd/0x10
[   65.109237]  [<ffffffff812704ab>] kobject_put+0x2b/0x60
[   65.109241]  [<ffffffff8133ab42>] put_device+0x12/0x20
[   65.109245]  [<ffffffff8133bfc5>] device_unregister+0x25/0x60
[   65.109252]  [<ffffffff8148a22f>] mce_cpu_callback+0x149/0x1a5
[   65.109257]  [<ffffffff8149b4a2>] notifier_call_chain+0x72/0x110
[   65.109263]  [<ffffffff8106bf19>] __raw_notifier_call_chain+0x9/0x10
[   65.109270]  [<ffffffff8147b9b6>] _cpu_down+0x1c6/0x320
[   65.109274]  [<ffffffff8147bb4b>] cpu_down+0x3b/0x60
[   65.109279]  [<ffffffff8147db1d>] store_online+0x6d/0xc8
[   65.109283]  [<ffffffff8133a70b>] dev_attr_store+0x1b/0x20
[   65.109288]  [<ffffffff811ecb04>] sysfs_write_file+0xd4/0x150
[   65.109295]  [<ffffffff81176d1b>] vfs_write+0xcb/0x130
[   65.109299]  [<ffffffff81176e70>] sys_write+0x50/0x90
[   65.109304]  [<ffffffff814a0379>] system_call_fastpath+0x16/0x1b
[   65.109307] ---[ end trace dafb3fda8041063e ]---
[   65.112016] ------------[ cut here ]------------
[   65.112024] WARNING: at arch/x86/kernel/smp.c:120 native_smp_send_reschedule+0x59/0x60()
[   65.112027] Hardware name: IBM System x -[7870C4Q]-
[   65.112028] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod cdc_ether usbnet i7core_edac edac_core mii serio_raw i2c_i801 shpchp ioatdma iTCO_wdt iTCO_vendor_support dca pci_hotplug pcspkr bnx2 i2c_core tpm_tis tpm tpm_bios sg rtc_cmos button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[   65.112067] Pid: 2277, comm: udevd Tainted: G        W    3.2.0-debugkernel-0.0.0.28.36b5ec9-default #4
[   65.112070] Call Trace:
[   65.112071]  <IRQ>  [<ffffffff81021349>] ? native_smp_send_reschedule+0x59/0x60
[   65.112079]  [<ffffffff8103cc2a>] warn_slowpath_common+0x7a/0xb0
[   65.112083]  [<ffffffff8103cc75>] warn_slowpath_null+0x15/0x20
[   65.112086]  [<ffffffff81021349>] native_smp_send_reschedule+0x59/0x60
[   65.112092]  [<ffffffff810825f5>] trigger_load_balance+0x185/0x4f0
[   65.112096]  [<ffffffff8108262b>] ? trigger_load_balance+0x1bb/0x4f0
[   65.112101]  [<ffffffff81073617>] scheduler_tick+0x107/0x170
[   65.112107]  [<ffffffff8104e057>] update_process_times+0x67/0x80
[   65.112113]  [<ffffffff8109353f>] tick_sched_timer+0x5f/0xc0
[   65.112117]  [<ffffffff810934e0>] ? tick_nohz_handler+0x100/0x100
[   65.112122]  [<ffffffff8106a05e>] __run_hrtimer+0x12e/0x330
[   65.112126]  [<ffffffff8106a4a7>] hrtimer_interrupt+0xc7/0x1f0
[   65.112131]  [<ffffffff81022f64>] smp_apic_timer_interrupt+0x64/0xa0
[   65.112135]  [<ffffffff814a0eb3>] apic_timer_interrupt+0x73/0x80
[   65.112137]  <EOI>  [<ffffffff8115f788>] ? __slab_alloc+0x228/0x4e0
[   65.112145]  [<ffffffff810654f0>] ? __wake_up_bit+0x10/0x30
[   65.112150]  [<ffffffff8110b7e5>] unlock_page+0x25/0x30
[   65.112157]  [<ffffffff81135f75>] do_wp_page+0x4f5/0x7b0
[   65.112161]  [<ffffffff8113708d>] handle_pte_fault+0x19d/0x1e0
[   65.112165]  [<ffffffff81137248>] handle_mm_fault+0x178/0x2e0
[   65.112169]  [<ffffffff8149b171>] do_page_fault+0x201/0x4c0
[   65.112173]  [<ffffffff8103c109>] ? do_fork+0x179/0x350
[   65.112177]  [<ffffffff8119900e>] ? mntput+0x1e/0x30
[   65.112182]  [<ffffffff811786ef>] ? __fput+0x16f/0x210
[   65.112187]  [<ffffffff8127ae3d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[   65.112192]  [<ffffffff81497905>] page_fault+0x25/0x30
[   65.112195] ---[ end trace dafb3fda8041063f ]---
[   65.541793] CPU 9 MCA banks CMCI:2 CMCI:3 CMCI:5
[   75.472229] lockdep: fixing up alternatives.

The above warning is related to the reschedule IPI sent to an offline cpu.
I guess this is due to the recent changes done to nohz_balancer_kick() and
find_new_ilb() in kernel/sched/fair.c. I had never seen this warning before
3.3 merge window, even during CPU Hotplug stress tests. Now this warning
is seen pretty often during CPU offline.

[Adding Suresh Siddha and Peter Zijlstra to Cc.]

# echo 1 > /sys/devices/system/cpu/cpu1/online

[   75.476772] Booting Node 0 Processor 1 APIC 0x2
[   75.481495] smpboot cpu 1: start_ip = 97000
[   75.492927] Calibrating delay loop (skipped) already calibrated this CPU
[   75.508449] NMI watchdog enabled, takes one hw-pmu counter.
[   75.515402] general protection fault: 0000 [#1] SMP 
[   75.518940] CPU 7 
[   75.518940] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod cdc_ether usbnet i7core_edac edac_core mii serio_raw i2c_i801 shpchp ioatdma iTCO_wdt iTCO_vendor_support dca pci_hotplug pcspkr bnx2 i2c_core tpm_tis tpm tpm_bios sg rtc_cmos button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[   75.518940] 
[   75.518940] Pid: 6631, comm: bash Tainted: G        W    3.2.0-debugkernel-0.0.0.28.36b5ec9-default #4 IBM IBM System x -[7870C4Q]-/68Y8033     
[   75.518940] RIP: 0010:[<ffffffff81270779>]  [<ffffffff81270779>] kobject_get+0x19/0x60
[   75.518940] RSP: 0018:ffff8808c6cc7c18  EFLAGS: 00010206
[   75.518940] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b7b RCX: 0000000000000006
[   75.518940] RDX: ffffffff81e98ae0 RSI: ffff8808ccc93080 RDI: 6b6b6b6b6b6b6b7b
[   75.518940] RBP: ffff8808c6cc7c28 R08: 5ff145670d8e439e R09: 0000000000000000
[   75.518940] R10: 0000000000000005 R11: 0000000000000001 R12: ffff88114ded3608
[   75.518940] R13: ffffffff81a13440 R14: ffff8808ddc4cb60 R15: 0000000000000001
[   75.518940] FS:  00007f9a3218e700(0000) GS:ffff88117fcc0000(0000) knlGS:0000000000000000
[   75.518940] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   75.518940] CR2: 000000000068a2a0 CR3: 000000114bd59000 CR4: 00000000000006e0
[   75.518940] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   75.518940] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   75.518940] Process bash (pid: 6631, threadinfo ffff8808c6cc6000, task ffff8808c6d9c600)
[   75.518940] Stack:
[   75.518940]  ffff8808ccc93080 ffff88114ded3608 ffff8808c6cc7c38 ffffffff8133ab14
[   75.518940]  ffff8808c6cc7c48 ffffffff8133ddad ffff8808c6cc7c68 ffffffff81478b82
[   75.518940]  ffff88114ded3608 ffff8808ccc93080 ffff8808c6cc7c88 ffffffff81479062
[   75.518940] Call Trace:
[   75.518940]  [<ffffffff8133ab14>] get_device+0x14/0x20
[   75.518940]  [<ffffffff8133ddad>] klist_devices_get+0xd/0x10
[   75.518940]  [<ffffffff81478b82>] klist_node_init+0x42/0x70
[   75.518940]  [<ffffffff81479062>] klist_add_tail+0x22/0x60
[   75.518940]  [<ffffffff8133e76b>] bus_add_device+0x1bb/0x200
[   75.518940]  [<ffffffff8133c7c7>] device_add+0x2e7/0x570
[   75.518940]  [<ffffffff813479e0>] ? device_pm_init+0x70/0xa0
[   75.518940]  [<ffffffff8133ca69>] device_register+0x19/0x20
[   75.518940]  [<ffffffff81489fe6>] mce_device_create+0x8b/0x18b
[   75.518940]  [<ffffffff8148a26d>] mce_cpu_callback+0x187/0x1a5
[   75.518940]  [<ffffffff8149b4a2>] notifier_call_chain+0x72/0x110
[   75.518940]  [<ffffffff8106bf19>] __raw_notifier_call_chain+0x9/0x10
[   75.518940]  [<ffffffff8148db41>] _cpu_up+0x124/0x12a
[   75.518940]  [<ffffffff8148dc03>] cpu_up+0xbc/0x114
[   75.518940]  [<ffffffff8147db45>] store_online+0x95/0xc8
[   75.518940]  [<ffffffff8133a70b>] dev_attr_store+0x1b/0x20
[   75.518940]  [<ffffffff811ecb04>] sysfs_write_file+0xd4/0x150
[   75.518940]  [<ffffffff81176d1b>] vfs_write+0xcb/0x130
[   75.518940]  [<ffffffff81176e70>] sys_write+0x50/0x90
[   75.518940]  [<ffffffff814a0379>] system_call_fastpath+0x16/0x1b
[   75.518940] Code: ff ff 55 48 83 ef 38 48 89 e5 e8 43 fe ff ff c9 c3 90 55 48 89 e5 48 83 ec 10 48 85 ff 48 89 1c 24 4c 89 64 24 08 48 89 fb 74 0f <8b> 47 38 4c 8d 67 38 85 c0 74 1c f0 ff 43 38 48 89 d8 4c 8b 64 
[   75.518940] RIP  [<ffffffff81270779>] kobject_get+0x19/0x60
[   75.518940]  RSP <ffff8808c6cc7c18>
[   75.856395] ---[ end trace dafb3fda80410640 ]---


And in a separate try, I got this during cpu online operation:
(Pretty much the same as above, but with the BUG description present.)

[   83.491328] Booting Node 1 Processor 6 APIC 0x14^M
[   83.496135] smpboot cpu 6: start_ip = 97000^M
[   72.494772] Calibrating delay loop (skipped) already calibrated this CPU^M 
[   83.522491] NMI watchdog enabled, takes one hw-pmu counter.^M
[   83.529016] BUG: unable to handle kernel paging request at 000000350000004a^M
[   83.532868] IP: [<ffffffff8126cac9>] kobject_get+0x19/0x60^M
[   83.532868] PGD 8c7909067 PUD 0 ^M
[   83.532868] Oops: 0000 [#1] SMP ^M
[   83.532868] CPU 0 ^M
[   83.532868] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod ioatdma cdc_ether usbnet bnx2 shpchp mii tpm_tis tpm i7core_edac rtc_cmos serio_raw i2c_i801 dca pcspkr pci_hotplug edac_core i2c_core iTCO_wdt iTCO_vendor_support sg tpm_bios button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon^M
[   83.532868] ^M
[   83.532868] Pid: 6347, comm: allon_cpu_statu Tainted: G        W    3.2.0-33-default #3 IBM IBM System x -[7870C4Q]-/68Y8033     ^M
[   83.532868] RIP: 0010:[<ffffffff8126cac9>]  [<ffffffff8126cac9>] kobject_get+0x19/0x60^M
[   83.532868] RSP: 0018:ffff8808c78c1c18  EFLAGS: 00010206^M
[   83.532868] RAX: 0000000000000000 RBX: 0000003500000012 RCX: 0000000000000006^M
[   83.532868] RDX: ffffffff81f0f180 RSI: ffff8808c7f01118 RDI: 0000003500000012^M
[   83.532868] RBP: ffff8808c78c1c28 R08: 543148780dbe0391 R09: 0000000000000000^M
[   83.532868] R10: 0000000000000005 R11: 0000000000000001 R12: ffff8808c9f37d38^M
[   83.532868] R13: ffffffff81a13440 R14: ffff88117fc8cb60 R15: 0000000000000006^M
[   83.532868] FS:  00007f7043861700(0000) GS:ffff8808ffc00000(0000) knlGS:0000000000000000^M
[   83.532868] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
[   83.532868] CR2: 000000350000004a CR3: 00000008c7ee9000 CR4: 00000000000006f0^M
[   83.532868] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
[   83.532868] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400^M
[   83.532868] Process allon_cpu_statu (pid: 6347, threadinfo ffff8808c78c0000, task ffff8808ca7c8bc0)^M
[   83.532868] Stack:^M
[   83.532868]  ffff8808c7f01118 ffff8808c9f37d38 ffff8808c78c1c38 ffffffff813362e4^M
[   83.532868]  ffff8808c78c1c48 ffffffff8133951d ffff8808c78c1c68 ffffffff81473db2^M
[   83.532868]  ffff8808c9f37d38 ffff8808c7f01118 ffff8808c78c1c88 ffffffff81474292^M
[   83.532868] Call Trace:^M
[   83.532868]  [<ffffffff813362e4>] get_device+0x14/0x20^M
[   83.532868]  [<ffffffff8133951d>] klist_devices_get+0xd/0x10^M
[   83.532868]  [<ffffffff81473db2>] klist_node_init+0x42/0x70^M
[   83.532868]  [<ffffffff81474292>] klist_add_tail+0x22/0x60^M
[   83.532868]  [<ffffffff81339edb>] bus_add_device+0x1bb/0x200^M
[   83.532868]  [<ffffffff81337f77>] device_add+0x2e7/0x570^M
[   83.532868]  [<ffffffff81343080>] ? device_pm_init+0x70/0xa0^M
[   83.532868]  [<ffffffff81338219>] device_register+0x19/0x20^M
[   83.532868]  [<ffffffff8148537f>] mce_device_create+0x8b/0x18b^M
[   83.532868]  [<ffffffff81485606>] mce_cpu_callback+0x187/0x1a5^M
[   83.532868]  [<ffffffff81496db2>] notifier_call_chain+0x72/0x110^M
[   83.532868]  [<ffffffff8106c1c9>] __raw_notifier_call_chain+0x9/0x10^M
[   83.532868]  [<ffffffff81488dc1>] _cpu_up+0x124/0x12a^M
[   83.532868]  [<ffffffff81488e83>] cpu_up+0xbc/0x114^M
[   83.532868]  [<ffffffff81479065>] store_online+0x95/0xc8^M
[   83.532868]  [<ffffffff81335edb>] dev_attr_store+0x1b/0x20^M
[   83.532868]  [<ffffffff811e9214>] sysfs_write_file+0xd4/0x150^M
[   83.532868]  [<ffffffff81173aeb>] vfs_write+0xcb/0x130^M
[   83.532868]  [<ffffffff81173c40>] sys_write+0x50/0x90^M
[   83.532868]  [<ffffffff8149bc39>] system_call_fastpath+0x16/0x1b^M
[   83.532868] Code: ff ff 55 48 83 ef 38 48 89 e5 e8 43 fe ff ff c9 c3 90 55 48 89 e5 48 83 ec 10 48 85 ff 48 89 1c 24 4c 89 64 24 08 48 89 fb 74 0f <8b> 47 38 4c 8d 67 38 85 c0 74 1c f0 ff 43 38 48 89 d8 4c 8b 64 ^M
[   83.532868] RIP  [<ffffffff8126cac9>] kobject_get+0x19/0x60^M
[   83.532868]  RSP <ffff8808c78c1c18>^M
[   83.532868] CR2: 000000350000004a^M
[   83.890209] ---[ end trace fab5021066ee998d ]---^M


> so it's definitely something bad in MCE device handling, and probably
> something to do with reusing a 'struct device' after freeign it, or
> after not having completely cleaned it up.
> 
> I didn't see if I could spot the problem, but I think this is entirely
> reproducible, so hopefully somebody who knows the MCE code can
> trivially see this and fix it.
> 
>                    Linus
> 

 
Regards,
Srivatsa S. Bhat
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-13 23:27       ` Srivatsa S. Bhat
@ 2012-01-14  0:05         ` Linus Torvalds
  2012-01-14  2:41           ` Srivatsa S. Bhat
  0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2012-01-14  0:05 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck,
	Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-kernel,
	Greg Kroah-Hartman, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua, Suresh B Siddha, Peter Zijlstra,
	Mel Gorman, Gilad Ben-Yossef

On Fri, Jan 13, 2012 at 3:27 PM, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
>
> # echo 1 > /sys/devices/system/cpu/cpu1/online
>
> [   75.476772] Booting Node 0 Processor 1 APIC 0x2
> [   75.481495] smpboot cpu 1: start_ip = 97000
> [   75.492927] Calibrating delay loop (skipped) already calibrated this CPU
> [   75.508449] NMI watchdog enabled, takes one hw-pmu counter.
> [   75.515402] general protection fault: 0000 [#1] SMP
> [   75.518940]
> [   75.518940] Pid: 6631, comm: bash Tainted: G        W    3.2.0-debugkernel-0.0.0.28.36b5ec9-default #4 IBM IBM System x -[7870C4Q]-/68Y8033
> [   75.518940] RIP: 0010:[<ffffffff81270779>]  [<ffffffff81270779>] kobject_get+0x19/0x60
> [   75.518940] RSP: 0018:ffff8808c6cc7c18  EFLAGS: 00010206
> [   75.518940] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b7b RCX: 0000000000000006
> [   75.518940] RDX: ffffffff81e98ae0 RSI: ffff8808ccc93080 RDI: 6b6b6b6b6b6b6b7b

The magic is the %rdi value. The instruction that oopses is

    mov    0x38(%rdi),%eax

and "rdi" is 0x10 + the magic 6b6b6b.. pattern. Which is obviously
'poison_free'.

And the 0x10 is because get_device() does

    return dev ? to_dev(kobject_get(&dev->kobj)) : NULL;

and I bet "kobj" is at offset 16 in the device structure. So we had a
pointer to a "struct device", but it was loaded from memory that was
free'd, turning the kobject pointer into that 0x6b6b6b6b6b6b6b7b

So somebody got a pointer from free'd memory. That somebody seems to
be 'klist_devices_get()' that got it from a 'struct klist_node', so I
think we have free'd something from the klist_devices list in the bus.
But I dunno. Odd. I would have expected us to hit that invalid pointer
long before if the klist entry was bogus.

I'm not seeign anything obvious in mce.c. But the fact that it's that
magic per_cpu allocation makes me nervous. It uses that magic
"mce_device_initialized" bit array etc, and ti clearly must have
worked before, but it equally clearly does *not* work now.

Looking more at it, I think that maybe something keeps the mce_device
around (refcounts that didn't use to exist before?) so when we
unregister it, it is still in use. And then when we re-register it
when we bring it up, we do that

    memset(&dev->kobj, 0, sizeof(struct kobject));

on the device that is in use. I dunno. It's all scary. Somebody who
knows the MCE layer should look at it.

                   Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-13 20:37     ` Linus Torvalds
  2012-01-13 20:53       ` Srivatsa S. Bhat
@ 2012-01-14  0:44       ` Dirk Gouders
  1 sibling, 0 replies; 49+ messages in thread
From: Dirk Gouders @ 2012-01-14  0:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov,
	Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-kernel,
	Greg Kroah-Hartman, Kay Sievers, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Fri, Jan 13, 2012 at 12:22 PM, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>> However, it is to be noted that, technically this warning (machinecheck1
>> not having a release() function) is not all that new. Just that people
>> didn't probably notice it earlier (reason explained below).
>
> .. but there really is some semantic issue going on here. I'm not the
> only one for whom suspend/resume simply does not work with MCE
> enabled, and that "not work" coincides with the switch-over from
> sysdev to struct device.
>
> I suspect it is some subtle difference in sysdev_unregister() vs
> device_unregister(). The message change is just a symptom of the fact
> that we go through different paths.
>
> For me MCE hangs only on the *second* suspend, but it sounds like
> others see it on the first one.

Yes, my laptop hangs on the first suspend.
Another effect of the problem is that the laptop's clock is extremely
off after a forced reboot.

Dirk

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-14  0:05         ` Linus Torvalds
@ 2012-01-14  2:41           ` Srivatsa S. Bhat
  2012-01-14  2:53             ` Linus Torvalds
  2012-01-17  2:21             ` Suresh Siddha
  0 siblings, 2 replies; 49+ messages in thread
From: Srivatsa S. Bhat @ 2012-01-14  2:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck,
	Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-kernel,
	Greg Kroah-Hartman, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua, Suresh B Siddha, Peter Zijlstra,
	Mel Gorman, Gilad Ben-Yossef

On 01/14/2012 05:35 AM, Linus Torvalds wrote:

> On Fri, Jan 13, 2012 at 3:27 PM, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>
>> # echo 1 > /sys/devices/system/cpu/cpu1/online
>>
>> [   75.476772] Booting Node 0 Processor 1 APIC 0x2
>> [   75.481495] smpboot cpu 1: start_ip = 97000
>> [   75.492927] Calibrating delay loop (skipped) already calibrated this CPU
>> [   75.508449] NMI watchdog enabled, takes one hw-pmu counter.
>> [   75.515402] general protection fault: 0000 [#1] SMP
>> [   75.518940]
>> [   75.518940] Pid: 6631, comm: bash Tainted: G        W    3.2.0-debugkernel-0.0.0.28.36b5ec9-default #4 IBM IBM System x -[7870C4Q]-/68Y8033
>> [   75.518940] RIP: 0010:[<ffffffff81270779>]  [<ffffffff81270779>] kobject_get+0x19/0x60
>> [   75.518940] RSP: 0018:ffff8808c6cc7c18  EFLAGS: 00010206
>> [   75.518940] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b7b RCX: 0000000000000006
>> [   75.518940] RDX: ffffffff81e98ae0 RSI: ffff8808ccc93080 RDI: 6b6b6b6b6b6b6b7b
> 
> The magic is the %rdi value. The instruction that oopses is
> 
>     mov    0x38(%rdi),%eax
> 
> and "rdi" is 0x10 + the magic 6b6b6b.. pattern. Which is obviously
> 'poison_free'.
> 
> And the 0x10 is because get_device() does
> 
>     return dev ? to_dev(kobject_get(&dev->kobj)) : NULL;
> 
> and I bet "kobj" is at offset 16 in the device structure. So we had a
> pointer to a "struct device", but it was loaded from memory that was
> free'd, turning the kobject pointer into that 0x6b6b6b6b6b6b6b7b
> 
> So somebody got a pointer from free'd memory. That somebody seems to
> be 'klist_devices_get()' that got it from a 'struct klist_node', so I
> think we have free'd something from the klist_devices list in the bus.
> But I dunno. Odd. I would have expected us to hit that invalid pointer
> long before if the klist entry was bogus.
> 
> I'm not seeign anything obvious in mce.c. But the fact that it's that
> magic per_cpu allocation makes me nervous. It uses that magic
> "mce_device_initialized" bit array etc, and ti clearly must have
> worked before, but it equally clearly does *not* work now.
> 
> Looking more at it, I think that maybe something keeps the mce_device
> around (refcounts that didn't use to exist before?) so when we
> unregister it, it is still in use. And then when we re-register it
> when we bring it up, we do that
> 
>     memset(&dev->kobj, 0, sizeof(struct kobject));
> 
> on the device that is in use. I dunno. It's all scary. Somebody who
> knows the MCE layer should look at it.
> 
>                    Linus
> 


YES!! Finally I have a fix for this whole MCE thing! :-)

The patch below works perfectly for me - I tested multiple CPU hotplug
operations as well as multiple pm_test runs at core level. Please let me
know if this solves the suspend issue as well..

Of course, the warnings at drivers/base/core.c: device_release()
as well as the IPI to offline cpu warnings still appear but are rather
unrelated and harmless to the issue being discussed.
So, with this patch CPU hotplug doesn't crash the system and suspend and
hibernate are expected to work fine.

-------
From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Subject: [PATCH] x86/mce: Fix CPU hotplug and suspend regression related to MCE

Commit 8a25a2f (cpu: convert 'cpu' and 'machinecheck' sysdev_class
to a regular subsystem) changed how things are dealt with in
the MCE subsystem. Some of the things that got broken due to this
are CPU hotplug and suspend/hibernate.

MCE uses per_cpu allocations of struct device. So, when a CPU goes
offline and comes back online, in order to ensure that we start
from a clean slate with respect to the MCE subsystem, zero out the
entire per_cpu device structure to 0 before using it.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 arch/x86/kernel/cpu/mcheck/mce.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index f22a9f7..29ba329 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -2011,7 +2011,7 @@ static __cpuinit int mce_device_create(unsigned int cpu)
 	if (!mce_available(&boot_cpu_data))
 		return -EIO;
 
-	memset(&dev->kobj, 0, sizeof(struct kobject));
+	memset(dev, 0, sizeof(struct device));
 	dev->id  = cpu;
 	dev->bus = &mce_subsys;
 



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-14  2:41           ` Srivatsa S. Bhat
@ 2012-01-14  2:53             ` Linus Torvalds
  2012-01-14  3:05               ` Srivatsa S. Bhat
  2012-01-14 14:49               ` Greg KH
  2012-01-17  2:21             ` Suresh Siddha
  1 sibling, 2 replies; 49+ messages in thread
From: Linus Torvalds @ 2012-01-14  2:53 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck,
	Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-kernel,
	Greg Kroah-Hartman, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua, Suresh B Siddha, Peter Zijlstra,
	Mel Gorman, Gilad Ben-Yossef

On Fri, Jan 13, 2012 at 6:41 PM, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
>
> YES!! Finally I have a fix for this whole MCE thing! :-)

Goodie.

> The patch below works perfectly for me - I tested multiple CPU hotplug
> operations as well as multiple pm_test runs at core level. Please let me
> know if this solves the suspend issue as well..

Ok, I'll try, and I bet it does.

HOWEVER.

I'd be a whole lot happier knowing exactly which field in "struct
device" that needed to be NULL before it gets registered.

I don't like how

  device_register() + device_create_file(dev)..

is not sufficiently undone by

 .. device_remove_file(dev) +  device_unregister()

so that it can't be repeated. Exactly *what* state is stale and
re-used incorrectly if you do that device_register() a second time.

It smells like a misfeature of the device core handling.

But that does obviously explain why this started happening with a
fairly straightforward conversion from sysdev to struct device. It
just makes me worry about any *other* such conversions.

Of course, normal users will allocate and free the memory, so never
see this "re-use the same piece of memory" issue. But still..

                 Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-14  2:53             ` Linus Torvalds
@ 2012-01-14  3:05               ` Srivatsa S. Bhat
  2012-01-14  3:10                 ` Linus Torvalds
  2012-01-14 14:49               ` Greg KH
  1 sibling, 1 reply; 49+ messages in thread
From: Srivatsa S. Bhat @ 2012-01-14  3:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck,
	Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-kernel,
	Greg Kroah-Hartman, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua, Suresh B Siddha, Peter Zijlstra,
	Mel Gorman, Gilad Ben-Yossef

On 01/14/2012 08:23 AM, Linus Torvalds wrote:

> On Fri, Jan 13, 2012 at 6:41 PM, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>
>> YES!! Finally I have a fix for this whole MCE thing! :-)
> 
> Goodie.
> 
>> The patch below works perfectly for me - I tested multiple CPU hotplug
>> operations as well as multiple pm_test runs at core level. Please let me
>> know if this solves the suspend issue as well..
> 
> Ok, I'll try, and I bet it does.
> 
> HOWEVER.
> 
> I'd be a whole lot happier knowing exactly which field in "struct
> device" that needed to be NULL before it gets registered.
> 
> I don't like how
> 
>   device_register() + device_create_file(dev)..
> 
> is not sufficiently undone by
> 
>  .. device_remove_file(dev) +  device_unregister()
> 
> so that it can't be repeated. Exactly *what* state is stale and
> re-used incorrectly if you do that device_register() a second time.
> 
> It smells like a misfeature of the device core handling.
> 
> But that does obviously explain why this started happening with a
> fairly straightforward conversion from sysdev to struct device. It
> just makes me worry about any *other* such conversions.
> 
> Of course, normal users will allocate and free the memory, so never
> see this "re-use the same piece of memory" issue. But still..
> 

I totally agree with you. Even I had set out to find out *exactly* what
was going wrong. After spending significant amount of time digging through
the code (unsuccessfully), this idea of zeroing out everything struck me
and it worked, as expected. Yes, it is definitely important to know the
exact issue so that we can fix the driver core and avoid other mishaps,
but I guess finding that out is not all that simple.. as of now I am
rather exhausted following those zillions of pointers continuously
for the past few hours.. ;-/

Regards,
Srivatsa S. Bhat
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-14  3:05               ` Srivatsa S. Bhat
@ 2012-01-14  3:10                 ` Linus Torvalds
  2012-01-14  3:18                   ` Srivatsa S. Bhat
  2012-01-14  5:15                   ` Tony Luck
  0 siblings, 2 replies; 49+ messages in thread
From: Linus Torvalds @ 2012-01-14  3:10 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck,
	Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-kernel,
	Greg Kroah-Hartman, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua, Suresh B Siddha, Peter Zijlstra,
	Mel Gorman, Gilad Ben-Yossef

On Fri, Jan 13, 2012 at 7:05 PM, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
>> On Fri, Jan 13, 2012 at 6:41 PM, Srivatsa S. Bhat
>> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>
>>> The patch below works perfectly for me - I tested multiple CPU hotplug
>>> operations as well as multiple pm_test runs at core level. Please let me
>>> know if this solves the suspend issue as well..
>>
>> Ok, I'll try, and I bet it does.

Yup, I can suspend and resume multiple times now.

> I totally agree with you. Even I had set out to find out *exactly* what
> was going wrong. After spending significant amount of time digging through
> the code (unsuccessfully), this idea of zeroing out everything struck me
> and it worked, as expected. Yes, it is definitely important to know the
> exact issue so that we can fix the driver core and avoid other mishaps,
> but I guess finding that out is not all that simple.. as of now I am
> rather exhausted following those zillions of pointers continuously
> for the past few hours.. ;-/

Fair enough. The many layers of clever indirection are really really
nasty to follow, I agree.

                   Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-14  3:10                 ` Linus Torvalds
@ 2012-01-14  3:18                   ` Srivatsa S. Bhat
  2012-01-14  3:41                     ` Linus Torvalds
  2012-01-14  5:15                   ` Tony Luck
  1 sibling, 1 reply; 49+ messages in thread
From: Srivatsa S. Bhat @ 2012-01-14  3:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck,
	Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-kernel,
	Greg Kroah-Hartman, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua, Suresh B Siddha, Peter Zijlstra,
	Mel Gorman, Gilad Ben-Yossef

On 01/14/2012 08:40 AM, Linus Torvalds wrote:

> On Fri, Jan 13, 2012 at 7:05 PM, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>> On Fri, Jan 13, 2012 at 6:41 PM, Srivatsa S. Bhat
>>> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>>
>>>> The patch below works perfectly for me - I tested multiple CPU hotplug
>>>> operations as well as multiple pm_test runs at core level. Please let me
>>>> know if this solves the suspend issue as well..
>>>
>>> Ok, I'll try, and I bet it does.
> 
> Yup, I can suspend and resume multiple times now.


Wow! Feels good to hear that, at last! :-)

By the way, in my excitement, while sending the patch I forgot to give
credits to the problem reporters with a "Reported-by". Hope that gets
resolved.

> 
>> I totally agree with you. Even I had set out to find out *exactly* what
>> was going wrong. After spending significant amount of time digging through
>> the code (unsuccessfully), this idea of zeroing out everything struck me
>> and it worked, as expected. Yes, it is definitely important to know the
>> exact issue so that we can fix the driver core and avoid other mishaps,
>> but I guess finding that out is not all that simple.. as of now I am
>> rather exhausted following those zillions of pointers continuously
>> for the past few hours.. ;-/
> 
> Fair enough. The many layers of clever indirection are really really
> nasty to follow, I agree.
> 


:-)
 
Regards,
Srivatsa S. Bhat
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-14  3:18                   ` Srivatsa S. Bhat
@ 2012-01-14  3:41                     ` Linus Torvalds
  0 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2012-01-14  3:41 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck,
	Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-kernel,
	Greg Kroah-Hartman, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua, Suresh B Siddha, Peter Zijlstra,
	Mel Gorman, Gilad Ben-Yossef

On Fri, Jan 13, 2012 at 7:18 PM, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
>
> By the way, in my excitement, while sending the patch I forgot to give
> credits to the problem reporters with a "Reported-by". Hope that gets
> resolved.

.. and in my excitement to get it fixed, I committed and pushed it out
without thinking about it either.

So no, reporters didn't get any glory this time.

Sucks to be them,

                 Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-14  3:10                 ` Linus Torvalds
  2012-01-14  3:18                   ` Srivatsa S. Bhat
@ 2012-01-14  5:15                   ` Tony Luck
  1 sibling, 0 replies; 49+ messages in thread
From: Tony Luck @ 2012-01-14  5:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov,
	Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-kernel,
	Greg Kroah-Hartman, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua, Suresh B Siddha, Peter Zijlstra,
	Mel Gorman, Gilad Ben-Yossef

On Fri, Jan 13, 2012 at 7:10 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Fair enough. The many layers of clever indirection are really really
> nasty to follow, I agree.

You could cheat and bisect (not git bisect, numerical bisect). We know that:

      memset(&dev->kobj, 0, sizeof(struct kobject));

is not enough. And that:

      memset(dev, 0, sizeof(struct device));

works. The kobject is near the beginning of struct device. So try clearing
all the other bits from start of struct device up to "kobj" ... if that doesn't
help, then play with clearing various sized chunks after. Should be able to
zoom in on the culprit field ... then go look at the code.

-Tony

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-14  2:53             ` Linus Torvalds
  2012-01-14  3:05               ` Srivatsa S. Bhat
@ 2012-01-14 14:49               ` Greg KH
  2012-01-14 16:30                 ` Alan Stern
  2012-01-16 18:11                 ` Greg KH
  1 sibling, 2 replies; 49+ messages in thread
From: Greg KH @ 2012-01-14 14:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov,
	Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-kernel,
	Kay Sievers, gouders, Marcos Souza, Linux PM mailing list,
	Rafael J. Wysocki, tglx, prasad, justinmattock, Jeff Chua,
	Suresh B Siddha, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

On Fri, Jan 13, 2012 at 06:53:04PM -0800, Linus Torvalds wrote:
> On Fri, Jan 13, 2012 at 6:41 PM, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
> >
> > YES!! Finally I have a fix for this whole MCE thing! :-)
> 
> Goodie.
> 
> > The patch below works perfectly for me - I tested multiple CPU hotplug
> > operations as well as multiple pm_test runs at core level. Please let me
> > know if this solves the suspend issue as well..
> 
> Ok, I'll try, and I bet it does.
> 
> HOWEVER.
> 
> I'd be a whole lot happier knowing exactly which field in "struct
> device" that needed to be NULL before it gets registered.
> 
> I don't like how
> 
>   device_register() + device_create_file(dev)..
> 
> is not sufficiently undone by
> 
>  .. device_remove_file(dev) +  device_unregister()
> 
> so that it can't be repeated. Exactly *what* state is stale and
> re-used incorrectly if you do that device_register() a second time.
> 
> It smells like a misfeature of the device core handling.

It has to do with the fact that this is a "static" device that is being
reused.  Normally it would be cleaned up properly in the release
function, but as there isn't one, some fields are being left in a bad
state.

I'll look into this Sunday better when I have the chance, I'm currently
on the road until late tonight, skiing, and it's hard to write patches
from a chair lift...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-14 14:49               ` Greg KH
@ 2012-01-14 16:30                 ` Alan Stern
  2012-01-14 20:45                   ` Jeff Chua
  2012-01-15  2:05                   ` Tony Luck
  2012-01-16 18:11                 ` Greg KH
  1 sibling, 2 replies; 49+ messages in thread
From: Alan Stern @ 2012-01-14 16:30 UTC (permalink / raw)
  To: Greg KH
  Cc: Linus Torvalds, Srivatsa S. Bhat, Ming Lei, Djalal Harouni,
	Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar,
	Andi Kleen, linux-kernel, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua, Suresh B Siddha, Peter Zijlstra,
	Mel Gorman, Gilad Ben-Yossef

On Sat, 14 Jan 2012, Greg KH wrote:

> On Fri, Jan 13, 2012 at 06:53:04PM -0800, Linus Torvalds wrote:
> > On Fri, Jan 13, 2012 at 6:41 PM, Srivatsa S. Bhat
> > <srivatsa.bhat@linux.vnet.ibm.com> wrote:
> > >
> > > YES!! Finally I have a fix for this whole MCE thing! :-)
> > 
> > Goodie.
> > 
> > > The patch below works perfectly for me - I tested multiple CPU hotplug
> > > operations as well as multiple pm_test runs at core level. Please let me
> > > know if this solves the suspend issue as well..
> > 
> > Ok, I'll try, and I bet it does.
> > 
> > HOWEVER.
> > 
> > I'd be a whole lot happier knowing exactly which field in "struct
> > device" that needed to be NULL before it gets registered.
> > 
> > I don't like how
> > 
> >   device_register() + device_create_file(dev)..
> > 
> > is not sufficiently undone by
> > 
> >  .. device_remove_file(dev) +  device_unregister()
> > 
> > so that it can't be repeated. Exactly *what* state is stale and
> > re-used incorrectly if you do that device_register() a second time.
> > 
> > It smells like a misfeature of the device core handling.
> 
> It has to do with the fact that this is a "static" device that is being
> reused.  Normally it would be cleaned up properly in the release
> function, but as there isn't one, some fields are being left in a bad
> state.

That's exactly right.  In general, device structures should never be
reused.  Apart from the reinitialization issues, in the general case
you have the problem that the references to the previous incarnation
may not all have been dropped.  Now, perhaps in the MCE case you _do_
know that they're all gone (I can't tell), but relying on it is
dangerous.

The driver core isn't designed to handle device structures that get
unregistered and then spring back to life; callers are supposed to
allocate a fresh new structure instead.  (We had to solve this very
same problem in the USB subsystem a number of years ago; figuring it
all out was tricky even back then.)  And this is true regardless of
whether the original structure was allocated dynamically or not.

Alan Stern


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-14 16:30                 ` Alan Stern
@ 2012-01-14 20:45                   ` Jeff Chua
  2012-01-15  2:05                   ` Tony Luck
  1 sibling, 0 replies; 49+ messages in thread
From: Jeff Chua @ 2012-01-14 20:45 UTC (permalink / raw)
  To: Alan Stern
  Cc: Greg KH, Linus Torvalds, Srivatsa S. Bhat, Ming Lei,
	Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto,
	Ingo Molnar, Andi Kleen, linux-kernel, Kay Sievers, gouders,
	Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tglx,
	prasad, justinmattock, Suresh B Siddha, Peter Zijlstra,
	Mel Gorman, Gilad Ben-Yossef

On Sun, Jan 15, 2012 at 12:30 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Sat, 14 Jan 2012, Greg KH wrote:
>
>> On Fri, Jan 13, 2012 at 06:53:04PM -0800, Linus Torvalds wrote:
>> > On Fri, Jan 13, 2012 at 6:41 PM, Srivatsa S. Bhat
>> > <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>> > >
>> > > YES!! Finally I have a fix for this whole MCE thing! :-)

Yeh! This fixed Suspend/Resume fixed too (re Linux-3.2.0 Pid: 1902,
comm: modprobe Not tainted 3.2.0 #64).

Now, I'm onto another issue with bluetooth. Bluetooth works well with
3.0.17 (may be 3.1.xx), but latest git is freezing the whole system
during pairing. Not good. Not sure whether this is related to the MCE.


Jeff.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-14 16:30                 ` Alan Stern
  2012-01-14 20:45                   ` Jeff Chua
@ 2012-01-15  2:05                   ` Tony Luck
  2012-01-15  2:34                     ` Greg KH
  1 sibling, 1 reply; 49+ messages in thread
From: Tony Luck @ 2012-01-15  2:05 UTC (permalink / raw)
  To: Alan Stern
  Cc: Greg KH, Linus Torvalds, Srivatsa S. Bhat, Ming Lei,
	Djalal Harouni, Borislav Petkov, Hidetoshi Seto, Ingo Molnar,
	Andi Kleen, linux-kernel, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua, Suresh B Siddha, Peter Zijlstra,
	Mel Gorman, Gilad Ben-Yossef

On Sat, Jan 14, 2012 at 8:30 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> The driver core isn't designed to handle device structures that get
> unregistered and then spring back to life; callers are supposed to
> allocate a fresh new structure instead.

Does every caller use "kzalloc()" for this fresh new structure (or even
for the original registration) rather than "kmalloc()"?

-Tony

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-15  2:05                   ` Tony Luck
@ 2012-01-15  2:34                     ` Greg KH
  2012-01-15  3:36                       ` Alan Stern
  0 siblings, 1 reply; 49+ messages in thread
From: Greg KH @ 2012-01-15  2:34 UTC (permalink / raw)
  To: Tony Luck
  Cc: Alan Stern, Linus Torvalds, Srivatsa S. Bhat, Ming Lei,
	Djalal Harouni, Borislav Petkov, Hidetoshi Seto, Ingo Molnar,
	Andi Kleen, linux-kernel, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua, Suresh B Siddha, Peter Zijlstra,
	Mel Gorman, Gilad Ben-Yossef

On Sat, Jan 14, 2012 at 06:05:24PM -0800, Tony Luck wrote:
> On Sat, Jan 14, 2012 at 8:30 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> > The driver core isn't designed to handle device structures that get
> > unregistered and then spring back to life; callers are supposed to
> > allocate a fresh new structure instead.
> 
> Does every caller use "kzalloc()" for this fresh new structure (or even
> for the original registration) rather than "kmalloc()"?

They better, as that's what they are required to do :)

The kernel would be crashing with slab debugging enabled, if it wasn't
so today.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-15  2:34                     ` Greg KH
@ 2012-01-15  3:36                       ` Alan Stern
  2012-01-16 18:15                         ` Greg KH
  0 siblings, 1 reply; 49+ messages in thread
From: Alan Stern @ 2012-01-15  3:36 UTC (permalink / raw)
  To: Greg KH
  Cc: Tony Luck, Linus Torvalds, Srivatsa S. Bhat, Ming Lei,
	Djalal Harouni, Borislav Petkov, Hidetoshi Seto, Ingo Molnar,
	Andi Kleen, linux-kernel, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua, Suresh B Siddha, Peter Zijlstra,
	Mel Gorman, Gilad Ben-Yossef

On Sat, 14 Jan 2012, Greg KH wrote:

> On Sat, Jan 14, 2012 at 06:05:24PM -0800, Tony Luck wrote:
> > On Sat, Jan 14, 2012 at 8:30 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> > > The driver core isn't designed to handle device structures that get
> > > unregistered and then spring back to life; callers are supposed to
> > > allocate a fresh new structure instead.
> > 
> > Does every caller use "kzalloc()" for this fresh new structure (or even
> > for the original registration) rather than "kmalloc()"?
> 
> They better, as that's what they are required to do :)
> 
> The kernel would be crashing with slab debugging enabled, if it wasn't
> so today.

These are both important restrictions on the use of the driver core,
and they don't seem to be explained anywhere in the kernel source.  
Would you like to receive a documentation patch?

Alan Stern


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-14 14:49               ` Greg KH
  2012-01-14 16:30                 ` Alan Stern
@ 2012-01-16 18:11                 ` Greg KH
  2012-01-16 18:27                   ` Luck, Tony
  2012-01-16 18:42                   ` Kay Sievers
  1 sibling, 2 replies; 49+ messages in thread
From: Greg KH @ 2012-01-16 18:11 UTC (permalink / raw)
  To: Linus Torvalds, Kay Sievers
  Cc: Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov,
	Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-kernel,
	Kay Sievers, gouders, Marcos Souza, Linux PM mailing list,
	Rafael J. Wysocki, tglx, prasad, justinmattock, Jeff Chua,
	Suresh B Siddha, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

On Sat, Jan 14, 2012 at 06:49:38AM -0800, Greg KH wrote:
> On Fri, Jan 13, 2012 at 06:53:04PM -0800, Linus Torvalds wrote:
> > On Fri, Jan 13, 2012 at 6:41 PM, Srivatsa S. Bhat
> > <srivatsa.bhat@linux.vnet.ibm.com> wrote:
> > >
> > > YES!! Finally I have a fix for this whole MCE thing! :-)
> > 
> > Goodie.
> > 
> > > The patch below works perfectly for me - I tested multiple CPU hotplug
> > > operations as well as multiple pm_test runs at core level. Please let me
> > > know if this solves the suspend issue as well..
> > 
> > Ok, I'll try, and I bet it does.
> > 
> > HOWEVER.
> > 
> > I'd be a whole lot happier knowing exactly which field in "struct
> > device" that needed to be NULL before it gets registered.
> > 
> > I don't like how
> > 
> >   device_register() + device_create_file(dev)..
> > 
> > is not sufficiently undone by
> > 
> >  .. device_remove_file(dev) +  device_unregister()
> > 
> > so that it can't be repeated. Exactly *what* state is stale and
> > re-used incorrectly if you do that device_register() a second time.
> > 
> > It smells like a misfeature of the device core handling.
> 
> It has to do with the fact that this is a "static" device that is being
> reused.  Normally it would be cleaned up properly in the release
> function, but as there isn't one, some fields are being left in a bad
> state.

Kay, I looked at this this morning, and it comes down to the line:

DEFINE_PER_CPU(struct device, mce_device);

Where we are creating static struct device variables.  I'm guessing this
is just done for "convenience" as we really don't care about where in
memory these structures are, we just want to make sure we have enough of
them around (this is the way all the other mce per-cpu structures are
handled.)

I couldn't figure out a "simple" way to create a variable per cpu here,
dynamically.  I tried doing something like:
	struct device *mce_device[CONFIG_NR_CPUS];
and dynamically create and clean them up when they go away, setting the
array value to NULL when they are unregistered, and let them clean up in
the release function, but does that race with creating the device again?

It seems that this would work, but I'm probably missing something
obvious here, any ideas?

The "correct" way to fix this up would be to have a per-cpu structure
for all of the different mce things that are created in this driver
(struct device, struct mce, exception counts, work queues, polling
banks, etc.), but that seems pretty messy, and I imagine some of these
want to stay as-is for some performance issues.  As I don't know this
code at all, I'm a bit leary to make that kind of change.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-15  3:36                       ` Alan Stern
@ 2012-01-16 18:15                         ` Greg KH
  0 siblings, 0 replies; 49+ messages in thread
From: Greg KH @ 2012-01-16 18:15 UTC (permalink / raw)
  To: Alan Stern
  Cc: Tony Luck, Linus Torvalds, Srivatsa S. Bhat, Ming Lei,
	Djalal Harouni, Borislav Petkov, Hidetoshi Seto, Ingo Molnar,
	Andi Kleen, linux-kernel, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua, Suresh B Siddha, Peter Zijlstra,
	Mel Gorman, Gilad Ben-Yossef

On Sat, Jan 14, 2012 at 10:36:53PM -0500, Alan Stern wrote:
> On Sat, 14 Jan 2012, Greg KH wrote:
> 
> > On Sat, Jan 14, 2012 at 06:05:24PM -0800, Tony Luck wrote:
> > > On Sat, Jan 14, 2012 at 8:30 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> > > > The driver core isn't designed to handle device structures that get
> > > > unregistered and then spring back to life; callers are supposed to
> > > > allocate a fresh new structure instead.
> > > 
> > > Does every caller use "kzalloc()" for this fresh new structure (or even
> > > for the original registration) rather than "kmalloc()"?
> > 
> > They better, as that's what they are required to do :)
> > 
> > The kernel would be crashing with slab debugging enabled, if it wasn't
> > so today.
> 
> These are both important restrictions on the use of the driver core,
> and they don't seem to be explained anywhere in the kernel source.  
> Would you like to receive a documentation patch?

For some reason I thought it was documented somewhere, but I can't seem
to find that right now.  So yes, a patch documenting it would be great
to have.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: x86/mce: machine check warning during poweroff
  2012-01-16 18:11                 ` Greg KH
@ 2012-01-16 18:27                   ` Luck, Tony
  2012-01-16 18:34                     ` Greg KH
  2012-01-16 18:42                   ` Kay Sievers
  1 sibling, 1 reply; 49+ messages in thread
From: Luck, Tony @ 2012-01-16 18:27 UTC (permalink / raw)
  To: Greg KH, Linus Torvalds, Kay Sievers
  Cc: Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov,
	Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-kernel,
	Kay Sievers, gouders, Marcos Souza, Linux PM mailing list,
	Rafael J. Wysocki, tglx, prasad, justinmattock, Jeff Chua,
	Siddha, Suresh B, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

> The "correct" way to fix this up would be to have a per-cpu structure
> for all of the different mce things that are created in this driver
> (struct device, struct mce, exception counts, work queues, polling
> banks, etc.), but that seems pretty messy, and I imagine some of these
> want to stay as-is for some performance issues.  As I don't know this
> code at all, I'm a bit leary to make that kind of change.

If you get so many machine checks that you care about the performance
of the handler - you may be worrying about the wrong things.

I'm more concerned about maintainability of the code. Seto-san has
submitted many patches re-grouping the functions inside mce.c into
functional areas - keeping the data structures separated makes
sense - especially if there is some goal of splitting mce.c into
separate files.

-Tony

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-16 18:27                   ` Luck, Tony
@ 2012-01-16 18:34                     ` Greg KH
  0 siblings, 0 replies; 49+ messages in thread
From: Greg KH @ 2012-01-16 18:34 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Linus Torvalds, Kay Sievers, Srivatsa S. Bhat, Ming Lei,
	Djalal Harouni, Borislav Petkov, Hidetoshi Seto, Ingo Molnar,
	Andi Kleen, linux-kernel, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua, Siddha, Suresh B, Peter Zijlstra,
	Mel Gorman, Gilad Ben-Yossef

On Mon, Jan 16, 2012 at 06:27:16PM +0000, Luck, Tony wrote:
> > The "correct" way to fix this up would be to have a per-cpu structure
> > for all of the different mce things that are created in this driver
> > (struct device, struct mce, exception counts, work queues, polling
> > banks, etc.), but that seems pretty messy, and I imagine some of these
> > want to stay as-is for some performance issues.  As I don't know this
> > code at all, I'm a bit leary to make that kind of change.
> 
> If you get so many machine checks that you care about the performance
> of the handler - you may be worrying about the wrong things.
> 
> I'm more concerned about maintainability of the code. Seto-san has
> submitted many patches re-grouping the functions inside mce.c into
> functional areas - keeping the data structures separated makes
> sense - especially if there is some goal of splitting mce.c into
> separate files.

Ok, I'll leave that alone, and just focus on the struct device stuff, as
that I know can't be performance critical :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-16 18:11                 ` Greg KH
  2012-01-16 18:27                   ` Luck, Tony
@ 2012-01-16 18:42                   ` Kay Sievers
  1 sibling, 0 replies; 49+ messages in thread
From: Kay Sievers @ 2012-01-16 18:42 UTC (permalink / raw)
  To: Greg KH
  Cc: Linus Torvalds, Srivatsa S. Bhat, Ming Lei, Djalal Harouni,
	Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar,
	Andi Kleen, linux-kernel, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua, Suresh B Siddha, Peter Zijlstra,
	Mel Gorman, Gilad Ben-Yossef

On Mon, Jan 16, 2012 at 19:11, Greg KH <gregkh@suse.de> wrote:

>> It has to do with the fact that this is a "static" device that is being
>> reused.  Normally it would be cleaned up properly in the release
>> function, but as there isn't one, some fields are being left in a bad
>> state.
>
> Kay, I looked at this this morning, and it comes down to the line:
>
> DEFINE_PER_CPU(struct device, mce_device);
>
> Where we are creating static struct device variables.  I'm guessing this
> is just done for "convenience" as we really don't care about where in
> memory these structures are, we just want to make sure we have enough of
> them around (this is the way all the other mce per-cpu structures are
> handled.)

It's a pretty common pattern across the CPU handling code. I don't
know the details, but I doubt many of them are needed or really useful
in the context which they are used. So, it's either convenience or
just 'hoping stuff could be made without managing memory'. :)

Having the 'machinecheck' devices pre-allocated per CPU might not make
too much sense, because the driver core operations will still need to
allocate stuff dynamically at register() time anyway. Attaching
devices to interfaces and drivers of a subsystem will require the same
thing, even the simplest kobject allocates the name dynamically.

I guess, the driver core part ofmachinecheckcould just be fully
dynamic, instead of statically pre-allocated.

Kay

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-14  2:41           ` Srivatsa S. Bhat
  2012-01-14  2:53             ` Linus Torvalds
@ 2012-01-17  2:21             ` Suresh Siddha
  2012-01-17  9:52               ` Srivatsa S. Bhat
  1 sibling, 1 reply; 49+ messages in thread
From: Suresh Siddha @ 2012-01-17  2:21 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Linus Torvalds, Ming Lei, Djalal Harouni, Borislav Petkov,
	Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-kernel,
	Greg Kroah-Hartman, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua, Peter Zijlstra, Mel Gorman,
	Gilad Ben-Yossef

On Sat, 2012-01-14 at 08:11 +0530, Srivatsa S. Bhat wrote:
> Of course, the warnings at drivers/base/core.c: device_release()
> as well as the IPI to offline cpu warnings still appear but are rather
> unrelated and harmless to the issue being discussed.

As far the IPI offline cpu warnings are concerned, appended patch should
fix it. Can you please give it a try? Peterz, can you please review and
queue it after Srivatsa confirms that it works? Thanks.
---

From: Suresh Siddha <suresh.b.siddha@intel.com>
Subject: sched, nohz: fix nohz cpu idle load balancing state with cpu hotplug

With the recent nohz scheduler changes, rq's nohz flag 'NOHZ_TICK_STOPPED'
and its associated state doesn't get cleared immediately after the
cpu exits idle. This gets cleared as part of the next tick seen on that cpu.

With the cpu offline, we need to clear this state manually. Fix it by
registering a cpu notifier which clears the nohz idle load balance
state for this rq explicitly.

Reported-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched/fair.c |   40 +++++++++++++++++++++++++++++++++++-----
 1 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8e42de9..defc92d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4862,6 +4862,15 @@ static void nohz_balancer_kick(int cpu)
 	return;
 }
 
+static inline void clear_nohz_tick_stopped(int cpu)
+{
+	if (unlikely(test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))) {
+		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
+		atomic_dec(&nohz.nr_cpus);
+		clear_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
+	}
+}
+
 static inline void set_cpu_sd_state_busy(void)
 {
 	struct sched_domain *sd;
@@ -4900,6 +4909,12 @@ void select_nohz_load_balancer(int stop_tick)
 {
 	int cpu = smp_processor_id();
 
+	/*
+ 	 * If this cpu is going down, then nothing needs to be done.
+ 	 */
+	if (!cpu_active(cpu))
+		return;
+
 	if (stop_tick) {
 		if (test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))
 			return;
@@ -4910,6 +4925,24 @@ void select_nohz_load_balancer(int stop_tick)
 	}
 	return;
 }
+
+static void __cpuinit __sched_ilb_notifier(void *info)
+{
+	clear_nohz_tick_stopped(smp_processor_id());
+}
+
+static int __cpuinit sched_ilb_notifier(struct notifier_block *nfb,
+					unsigned long action, void *hcpu)
+{
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_DOWN_PREPARE:
+		smp_call_function_single((unsigned long) hcpu,
+					 __sched_ilb_notifier, NULL, 1);
+		return NOTIFY_OK;
+	default:
+		return NOTIFY_DONE;
+	}
+}
 #endif
 
 static DEFINE_SPINLOCK(balancing);
@@ -5066,11 +5099,7 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
 	* busy tick after returning from idle, we will update the busy stats.
 	*/
 	set_cpu_sd_state_busy();
-	if (unlikely(test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))) {
-		clear_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
-		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
-		atomic_dec(&nohz.nr_cpus);
-	}
+	clear_nohz_tick_stopped(cpu);
 
 	/*
 	 * None are in tickless mode and hence no need for NOHZ idle load
@@ -5586,6 +5615,7 @@ __init void init_sched_fair_class(void)
 
 #ifdef CONFIG_NO_HZ
 	zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
+	cpu_notifier(sched_ilb_notifier, 0);
 #endif
 #endif /* SMP */
 



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-17  2:21             ` Suresh Siddha
@ 2012-01-17  9:52               ` Srivatsa S. Bhat
  2012-01-17 16:15                 ` Jeff Chua
  2012-01-18  3:17                 ` Suresh Siddha
  0 siblings, 2 replies; 49+ messages in thread
From: Srivatsa S. Bhat @ 2012-01-17  9:52 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Linus Torvalds, Ming Lei, Djalal Harouni, Borislav Petkov,
	Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-kernel,
	Greg Kroah-Hartman, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua, Peter Zijlstra, Mel Gorman,
	Gilad Ben-Yossef

On 01/17/2012 07:51 AM, Suresh Siddha wrote:

> On Sat, 2012-01-14 at 08:11 +0530, Srivatsa S. Bhat wrote:
>> Of course, the warnings at drivers/base/core.c: device_release()
>> as well as the IPI to offline cpu warnings still appear but are rather
>> unrelated and harmless to the issue being discussed.
> 
> As far the IPI offline cpu warnings are concerned, appended patch should
> fix it. Can you please give it a try? Peterz, can you please review and
> queue it after Srivatsa confirms that it works? Thanks.


Hi Suresh,

Thanks for the patch, but unfortunately it doesn't fix the problem!
Exactly the same stack traces are seen during a CPU Hotplug stress test.
(I didn't even have to stress it - it is so fragile that just a script
to offline all cpus except the boot cpu was good enough to reproduce the
problem easily.)

[  562.269083] ------------[ cut here ]------------
[  562.273079] WARNING: at arch/x86/kernel/smp.c:120 native_smp_send_reschedule+0x59/0x60()
[  562.273079] Hardware name: IBM System x -[7870C4Q]-
[  562.273079] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod iTCO_wdt i7core_edac i2c_i801 ioatdma cdc_ether i2c_core tpm_tis bnx2 shpchp usbnet pcspkr mii iTCO_vendor_support edac_core serio_raw dca sg rtc_cmos tpm tpm_bios pci_hotplug button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[  562.273079] Pid: 6, comm: migration/0 Not tainted 3.2.0-sureshipi-0.0.0.28.36b5ec9-default #2
[  562.273079] Call Trace:
[  562.273079]  <IRQ>  [<ffffffff810213d9>] ? native_smp_send_reschedule+0x59/0x60
[  562.273079]  [<ffffffff8103cf4a>] warn_slowpath_common+0x7a/0xb0
[  562.273079]  [<ffffffff8103cf95>] warn_slowpath_null+0x15/0x20
[  562.273079]  [<ffffffff810213d9>] native_smp_send_reschedule+0x59/0x60
[  562.273079]  [<ffffffff81082d65>] trigger_load_balance+0x185/0x500
[  562.273079]  [<ffffffff81082d9b>] ? trigger_load_balance+0x1bb/0x500
[  562.273079]  [<ffffffff81073db7>] scheduler_tick+0x107/0x170
[  562.273079]  [<ffffffff8104e6f7>] update_process_times+0x67/0x80
[  562.273079]  [<ffffffff8109c64f>] tick_sched_timer+0x5f/0xc0
[  562.273079]  [<ffffffff8109c5f0>] ? tick_nohz_handler+0x100/0x100
[  562.273079]  [<ffffffff8106a85e>] __run_hrtimer+0x12e/0x330
[  562.273079]  [<ffffffff8106aca7>] hrtimer_interrupt+0xc7/0x1f0
[  562.273079]  [<ffffffff81022ff4>] smp_apic_timer_interrupt+0x64/0xa0
[  562.273079]  [<ffffffff814a2a33>] apic_timer_interrupt+0x73/0x80
[  562.273079]  <EOI>  [<ffffffff810c563a>] ? stop_machine_cpu_stop+0xda/0x130
[  562.273079]  [<ffffffff810c5560>] ? stop_one_cpu_nowait+0x50/0x50
[  562.273079]  [<ffffffff810c5279>] cpu_stopper_thread+0xd9/0x1b0
[  562.273079]  [<ffffffff81498ddf>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
[  562.273079]  [<ffffffff810c51a0>] ? res_counter_init+0x50/0x50
[  562.273079]  [<ffffffff810a2add>] ? trace_hardirqs_on_caller+0x12d/0x1b0
[  562.273079]  [<ffffffff810a2b6d>] ? trace_hardirqs_on+0xd/0x10
[  562.273079]  [<ffffffff810c51a0>] ? res_counter_init+0x50/0x50
[  562.273079]  [<ffffffff8106553e>] kthread+0x9e/0xb0
[  562.273079]  [<ffffffff814a3334>] kernel_thread_helper+0x4/0x10
[  562.273079]  [<ffffffff81499174>] ? retint_restore_args+0x13/0x13
[  562.273079]  [<ffffffff810654a0>] ? __init_kthread_worker+0x70/0x70
[  562.273079]  [<ffffffff814a3330>] ? gs_change+0x13/0x13
[  562.273079] ---[ end trace 4efec5b2532b902d ]---


I have a few questions regarding the synchronization with CPU Hotplug.
What guarantees that the code which selects and IPIs the new ilb is totally
race-free with respect to CPU hotplug and we will never IPI an offline CPU?

(In 3.2-rc7 I hadn't hit the IPI to offline cpu issue (the above stack trace)
as far as I remember.)

While trying to figure out what changed in the 3.3 merge window, I added a
WARN_ON in the 3.2-rc7 kernel as shown below:

static void nohz_balancer_kick(int cpu)
{
        ....

        if (!cpu_rq(ilb_cpu)->nohz_balance_kick) {
                cpu_rq(ilb_cpu)->nohz_balance_kick = 1;

                smp_mb();
                /*
                 * Use smp_send_reschedule() instead of resched_cpu().
                 * This way we generate a sched IPI on the target cpu which
                 * is idle. And the softirq performing nohz idle load balance
                 * will be run before returning from the IPI.
                 */
==========>      if (!cpu_active(ilb_cpu))
==========>             WARN_ON(1);
                smp_send_reschedule(ilb_cpu);
        }
        return;
}

As expected, I hit this warning during my CPU hotplug stress tests. I am sure
this happens on latest kernel too (3.3 merge window), since there is
apparently no change in that part of code in that aspect.

So, while selecting the new ilb, why are we not careful enough to ensure we
don't select a cpu that is going offline? Is this by design (to avoid some
overhead) or is this a bug? (As demonstrated above, this issue is in 3.2-rc7
as well.)

And the only reason I can think why we did not hit the "IPI to offline CPU"
issue in 3.2-rc7 kernel is that the race window (with CPU offline) was
probably too small and _not_ because we explicitly synchronized with CPU
Hotplug.

Probably I am missing something obvious... I would be grateful if you could
kindly help me understand how this works..

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-17  9:52               ` Srivatsa S. Bhat
@ 2012-01-17 16:15                 ` Jeff Chua
  2012-01-17 16:36                   ` Srivatsa S. Bhat
  2012-01-18  3:17                 ` Suresh Siddha
  1 sibling, 1 reply; 49+ messages in thread
From: Jeff Chua @ 2012-01-17 16:15 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Suresh Siddha, Linus Torvalds, Ming Lei, Djalal Harouni,
	Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar,
	Andi Kleen, linux-kernel, Greg Kroah-Hartman, Kay Sievers,
	gouders, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki,
	tglx, prasad, justinmattock, Peter Zijlstra, Mel Gorman,
	Gilad Ben-Yossef

On Tue, Jan 17, 2012 at 5:52 PM, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
> On 01/17/2012 07:51 AM, Suresh Siddha wrote:
>
>> On Sat, 2012-01-14 at 08:11 +0530, Srivatsa S. Bhat wrote:
>>> Of course, the warnings at drivers/base/core.c: device_release()
>>> as well as the IPI to offline cpu warnings still appear but are rather
>>> unrelated and harmless to the issue being discussed.
>>
>> As far the IPI offline cpu warnings are concerned, appended patch should
>> fix it. Can you please give it a try? Peterz, can you please review and
>> queue it after Srivatsa confirms that it works? Thanks.
>
>
> Hi Suresh,
>
> Thanks for the patch, but unfortunately it doesn't fix the problem!
> Exactly the same stack traces are seen during a CPU Hotplug stress test.
> (I didn't even have to stress it - it is so fragile that just a script
> to offline all cpus except the boot cpu was good enough to reproduce the
> problem easily.)

Works for me. But I'm still seeing this only during boot. Related?
Shall I bisect?


Freeing unused kernel memory: 520k freed
Write protecting the kernel read-only data: 8192k
Freeing unused kernel memory: 1140k freed
Freeing unused kernel memory: 464k freed
Adding 8290300k swap on /dev/sda3.  Priority:-1 extents:1 across:8290300k SS
vmalloc: allocation failure: 0 bytes
modprobe: page allocation failure: order:0, mode:0xd2
Pid: 1914, comm: modprobe Not tainted 3.2.0 #6
Call Trace:
[<ffffffff8107c1ff>] ? 0xffffffff8107c1ff
[<ffffffff81061fec>] ? 0xffffffff81061fec
[<ffffffff8109ab6c>] ? 0xffffffff8109ab6c
[<ffffffff81061fec>] ? 0xffffffff81061fec
[<ffffffff8101bacc>] ? 0xffffffff8101bacc
[<ffffffff81061fec>] ? 0xffffffff81061fec
[<ffffffff81061fec>] ? 0xffffffff81061fec
[<ffffffff81062ec8>] ? 0xffffffff81062ec8
[<ffffffff810637c1>] ? 0xffffffff810637c1
[<ffffffff814d9cb9>] ? 0xffffffff814d9cb9
Mem-Info:
Node 0 DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
CPU    1: hi:    0, btch:   1 usd:   0
CPU    2: hi:    0, btch:   1 usd:   0
CPU    3: hi:    0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd: 158
CPU    1: hi:  186, btch:  31 usd:  25
CPU    2: hi:  186, btch:  31 usd:   0
CPU    3: hi:  186, btch:  31 usd:   0
Node 0 Normal per-cpu:
CPU    0: hi:  186, btch:  31 usd:  93
CPU    1: hi:  186, btch:  31 usd:  74
CPU    2: hi:  186, btch:  31 usd: 170
CPU    3: hi:  186, btch:  31 usd:  60
active_anon:6162 inactive_anon:1 isolated_anon:0
active_file:1782 inactive_file:5164 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
free:1963131 slab_reclaimable:818 slab_unreclaimable:2728
mapped:1639 shmem:3 pagetables:292 bounce:0


Thanks,
Jeff

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-17 16:15                 ` Jeff Chua
@ 2012-01-17 16:36                   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 49+ messages in thread
From: Srivatsa S. Bhat @ 2012-01-17 16:36 UTC (permalink / raw)
  To: Jeff Chua
  Cc: Suresh Siddha, Linus Torvalds, Ming Lei, Djalal Harouni,
	Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar,
	Andi Kleen, linux-kernel, Greg Kroah-Hartman, Kay Sievers,
	gouders, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki,
	tglx, prasad, justinmattock, Peter Zijlstra, Mel Gorman,
	Gilad Ben-Yossef

On 01/17/2012 09:45 PM, Jeff Chua wrote:

> On Tue, Jan 17, 2012 at 5:52 PM, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>> On 01/17/2012 07:51 AM, Suresh Siddha wrote:
>>
>>> On Sat, 2012-01-14 at 08:11 +0530, Srivatsa S. Bhat wrote:
>>>> Of course, the warnings at drivers/base/core.c: device_release()
>>>> as well as the IPI to offline cpu warnings still appear but are rather
>>>> unrelated and harmless to the issue being discussed.
>>>
>>> As far the IPI offline cpu warnings are concerned, appended patch should
>>> fix it. Can you please give it a try? Peterz, can you please review and
>>> queue it after Srivatsa confirms that it works? Thanks.
>>
>>
>> Hi Suresh,
>>
>> Thanks for the patch, but unfortunately it doesn't fix the problem!
>> Exactly the same stack traces are seen during a CPU Hotplug stress test.
>> (I didn't even have to stress it - it is so fragile that just a script
>> to offline all cpus except the boot cpu was good enough to reproduce the
>> problem easily.)
> 
> Works for me. But I'm still seeing this only during boot. Related?
> Shall I bisect?
> 
> 
> Freeing unused kernel memory: 520k freed
> Write protecting the kernel read-only data: 8192k
> Freeing unused kernel memory: 1140k freed
> Freeing unused kernel memory: 464k freed
> Adding 8290300k swap on /dev/sda3.  Priority:-1 extents:1 across:8290300k SS
> vmalloc: allocation failure: 0 bytes


This is a different problem. Not the same as the one Suresh's patch
intended to fix. Your case has something to do with memory allocation
failures. The problem I am facing is Inter-Processor Interrupts (IPIs)
being sent to CPUs that are going offline, after selecting them as the
new ilb (Idle load balancer).

> modprobe: page allocation failure: order:0, mode:0xd2
> Pid: 1914, comm: modprobe Not tainted 3.2.0 #6
> Call Trace:
> [<ffffffff8107c1ff>] ? 0xffffffff8107c1ff
> [<ffffffff81061fec>] ? 0xffffffff81061fec
> [<ffffffff8109ab6c>] ? 0xffffffff8109ab6c
> [<ffffffff81061fec>] ? 0xffffffff81061fec
> [<ffffffff8101bacc>] ? 0xffffffff8101bacc
> [<ffffffff81061fec>] ? 0xffffffff81061fec
> [<ffffffff81061fec>] ? 0xffffffff81061fec
> [<ffffffff81062ec8>] ? 0xffffffff81062ec8
> [<ffffffff810637c1>] ? 0xffffffff810637c1
> [<ffffffff814d9cb9>] ? 0xffffffff814d9cb9
> Mem-Info:
> Node 0 DMA per-cpu:
> CPU    0: hi:    0, btch:   1 usd:   0
> CPU    1: hi:    0, btch:   1 usd:   0
> CPU    2: hi:    0, btch:   1 usd:   0
> CPU    3: hi:    0, btch:   1 usd:   0
> Node 0 DMA32 per-cpu:
> CPU    0: hi:  186, btch:  31 usd: 158
> CPU    1: hi:  186, btch:  31 usd:  25
> CPU    2: hi:  186, btch:  31 usd:   0
> CPU    3: hi:  186, btch:  31 usd:   0
> Node 0 Normal per-cpu:
> CPU    0: hi:  186, btch:  31 usd:  93
> CPU    1: hi:  186, btch:  31 usd:  74
> CPU    2: hi:  186, btch:  31 usd: 170
> CPU    3: hi:  186, btch:  31 usd:  60
> active_anon:6162 inactive_anon:1 isolated_anon:0
> active_file:1782 inactive_file:5164 isolated_file:0
> unevictable:0 dirty:0 writeback:0 unstable:0
> free:1963131 slab_reclaimable:818 slab_unreclaimable:2728
> mapped:1639 shmem:3 pagetables:292 bounce:0
> 
>


Regards,
Srivatsa S. Bhat
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-17  9:52               ` Srivatsa S. Bhat
  2012-01-17 16:15                 ` Jeff Chua
@ 2012-01-18  3:17                 ` Suresh Siddha
  2012-01-18 10:19                   ` Srivatsa S. Bhat
  2012-01-18 13:15                   ` Srivatsa S. Bhat
  1 sibling, 2 replies; 49+ messages in thread
From: Suresh Siddha @ 2012-01-18  3:17 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Linus Torvalds, Ming Lei, Djalal Harouni, Borislav Petkov,
	Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-kernel,
	Greg Kroah-Hartman, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua, Peter Zijlstra, Mel Gorman,
	Gilad Ben-Yossef

On Tue, 2012-01-17 at 15:22 +0530, Srivatsa S. Bhat wrote:
> Thanks for the patch, but unfortunately it doesn't fix the problem!
> Exactly the same stack traces are seen during a CPU Hotplug stress test.
> (I didn't even have to stress it - it is so fragile that just a script
> to offline all cpus except the boot cpu was good enough to reproduce the
> problem easily.)

hmm, that's weird. with the patch, sched_ilb_notifier() should have
cleared the cpu going offline from the nohz.idle_cpus_mask. And this
should have happened after that cpu is removed from active mask. So
no-one else should add that cpu back to the nohz.idle_cpus_mask and this
should prevent the issue from happening.

I could reproduce the problem easily with out the patch but when I
applied the patch I couldn't recreate the issue. Srivatsa, can you
please re-check the kernel you tested indeed has the fix?

re-Reviewing the code/patch also doesn't give me a hint.

> I have a few questions regarding the synchronization with CPU Hotplug.
> What guarantees that the code which selects and IPIs the new ilb is totally
> race-free with respect to CPU hotplug and we will never IPI an offline CPU?

So, nohz_balancer_kick() gets called only from interrupts disabled.
During that time (from selecting the ilb_cpu to sending the IPI), no cpu
can go offline. As the offline happens from the stop-machine process
context with interrupts disabled.

Only thing we need to make sure is the offlined cpu shouldn't be part of
the nohz.idle_cpus_mask and for post 3.2 code, posted patch ensures
that.

For 3.2 and before, when a cpu exits tickless idle, it gets removed from
the nohz.idle_cpus_mask (and also from the nohz.load_balancer). And if
the cpu is not in the active mask (while going offline), subsequent
calls to select_nohz_load_balancer() ensures that the cpu going down
doesn't update the nohz structures. So I thought 3.2 shouldn't exhibit
this problem.


> (As demonstrated above, this issue is in 3.2-rc7
> as well.)

hmm, don't think we ran into this before 3.2. So, what am I missing from
the above? I will try to reproduce it on 3.2 too.

thanks,
suresh


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-18  3:17                 ` Suresh Siddha
@ 2012-01-18 10:19                   ` Srivatsa S. Bhat
  2012-01-18 13:15                   ` Srivatsa S. Bhat
  1 sibling, 0 replies; 49+ messages in thread
From: Srivatsa S. Bhat @ 2012-01-18 10:19 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Linus Torvalds, Ming Lei, Djalal Harouni, Borislav Petkov,
	Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-kernel,
	Greg Kroah-Hartman, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua, Peter Zijlstra, Mel Gorman,
	Gilad Ben-Yossef

On 01/18/2012 08:47 AM, Suresh Siddha wrote:

> On Tue, 2012-01-17 at 15:22 +0530, Srivatsa S. Bhat wrote:
>> Thanks for the patch, but unfortunately it doesn't fix the problem!
>> Exactly the same stack traces are seen during a CPU Hotplug stress test.
>> (I didn't even have to stress it - it is so fragile that just a script
>> to offline all cpus except the boot cpu was good enough to reproduce the
>> problem easily.)
> 
> hmm, that's weird. with the patch, sched_ilb_notifier() should have
> cleared the cpu going offline from the nohz.idle_cpus_mask. And this
> should have happened after that cpu is removed from active mask. So
> no-one else should add that cpu back to the nohz.idle_cpus_mask and this
> should prevent the issue from happening.
> 
> I could reproduce the problem easily with out the patch but when I
> applied the patch I couldn't recreate the issue. Srivatsa, can you
> please re-check the kernel you tested indeed has the fix?
> 


I rechecked, and just to be sure, I compiled a fresh kernel with top
commit 6a48897 (3.3 merge window) and your patch applied on top of that
(Note that your patch applied with a bit of line fuzzing). And still I could
easily reproduce the issue with CPU hotplug stress test (a script to
offline/online cpus wildly.) Here is the script, for your reference:

#! /bin/bash

NUMBER_OF_CPUS=`ls -d /sys/devices/system/cpu/cpu[0-9]* | wc -l`

cd /sys/devices/system/cpu

while [ 1 ]
do
        for ((i=1; i < NUMBER_OF_CPUS; i++ ))
        do
                state=`cat cpu$i/online`
                if [ $state -eq 0 ]
                then
                        echo 1 > cpu$i/online
                else
                        echo 0 > cpu$i/online
                fi
        done
done

> re-Reviewing the code/patch also doesn't give me a hint.
> 
>> I have a few questions regarding the synchronization with CPU Hotplug.
>> What guarantees that the code which selects and IPIs the new ilb is totally
>> race-free with respect to CPU hotplug and we will never IPI an offline CPU?
> 
> So, nohz_balancer_kick() gets called only from interrupts disabled.
> During that time (from selecting the ilb_cpu to sending the IPI), no cpu
> can go offline. As the offline happens from the stop-machine process
> context with interrupts disabled.
>

> Only thing we need to make sure is the offlined cpu shouldn't be part of
> the nohz.idle_cpus_mask and for post 3.2 code, posted patch ensures
> that.
>

> For 3.2 and before, when a cpu exits tickless idle, it gets removed from
> the nohz.idle_cpus_mask (and also from the nohz.load_balancer). And if
> the cpu is not in the active mask (while going offline), subsequent
> calls to select_nohz_load_balancer() ensures that the cpu going down
> doesn't update the nohz structures. So I thought 3.2 shouldn't exhibit
> this problem.
> 
> 
>> (As demonstrated above, this issue is in 3.2-rc7
>> as well.)
> 
> hmm, don't think we ran into this before 3.2. So, what am I missing from
> the above? I will try to reproduce it on 3.2 too.
> 


Sorry, I think I wasn't very clear about this. I was trying to make a
distinction between the 2 things here: !cpu_active(cpu) vs cpu_is_offline(cpu)

The warning in native_smp_send_reschedule() is for cpu_is_offline(cpu). And I
didn't hit this warning in 3.2.

However, as mentioned in my previous post, I added a warning for
!cpu_active(cpu) as well within native_smp_send_reschedule() just to check
if we could end up choosing a CPU that is going to be offline soon, as the new
ilb. And I hit this particular warning (which *I* added for debugging) in 3.2.

But reading your explanation above (thanks a lot for that btw!), I now
understand that even if we happen to choose a soon-to-be-offline CPU as the
new ilb, that CPU cannot go offline until we are done with
smp_send_reschedule() because of the stop-machine code that you pointed out.
IOW, I think adding a warning for !cpu_active(cpu) was not the right thing to
do, because it wouldn't cause any harm anyway.

So the only thing we should be watching out is: sending an IPI to an already
offline CPU (and not one which is yet to go offline). And unfortunately in
the current mainline (even with your patch applied) we still send IPIs to
offline CPUs ! :-(

Here is the log, in case it helps:


[  101.859907] lockdep: fixing up alternatives.
[  101.864370] Booting Node 0 Processor 9 APIC 0x3
[  101.869076] smpboot cpu 9: start_ip = 97000
[  101.884514] Calibrating delay loop (skipped) already calibrated this CPU
[  101.898302] NMI watchdog enabled, takes one hw-pmu counter.
[  101.910463] lockdep: fixing up alternatives.
[  101.915040] Booting Node 0 Processor 10 APIC 0x5
[  101.919912] smpboot cpu 10: start_ip = 97000
[  101.936872] ------------[ cut here ]------------
[  101.940864] WARNING: at arch/x86/kernel/smp.c:120 native_smp_send_reschedule+0x59/0x60()
[  101.940864] Hardware name: IBM System x -[7870C4Q]-
[  101.940864] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod bnx2 i7core_edac shpchp ioatdma cdc_ether edac_core usbnet i2c_i801 pci_hotplug dca i2c_core iTCO_wdt iTCO_vendor_support pcspkr mii tpm_tis tpm tpm_bios serio_raw button rtc_cmos sg uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[  101.972661] Pid: 6018, comm: migration/9 Tainted: G        W    3.2.0-6a48897_IPI_patched-0.0.0.28.36b5ec9-default #1
[  101.972661] Call Trace:
[  101.972661]  <IRQ>  [<ffffffff81021409>] ? native_smp_send_reschedule+0x59/0x60
[  101.972661]  [<ffffffff8103cfca>] warn_slowpath_common+0x7a/0xb0
[  101.972661]  [<ffffffff8103d015>] warn_slowpath_null+0x15/0x20
[  101.972661]  [<ffffffff81021409>] native_smp_send_reschedule+0x59/0x60
[  101.972661]  [<ffffffff81082de5>] trigger_load_balance+0x185/0x500
[  101.972661]  [<ffffffff81082e1b>] ? trigger_load_balance+0x1bb/0x500
[  101.972661]  [<ffffffff81073e37>] scheduler_tick+0x107/0x170
[  101.972661]  [<ffffffff8104e777>] update_process_times+0x67/0x80
[  101.972661]  [<ffffffff8109c6cf>] tick_sched_timer+0x5f/0xc0
[  101.972661]  [<ffffffff8109c670>] ? tick_nohz_handler+0x100/0x100
[  101.972661]  [<ffffffff8106a8de>] __run_hrtimer+0x12e/0x330
[  101.972661]  [<ffffffff8106ad27>] hrtimer_interrupt+0xc7/0x1f0
[  101.972661]  [<ffffffff81023024>] smp_apic_timer_interrupt+0x64/0xa0
[  101.972661]  [<ffffffff814a30f3>] apic_timer_interrupt+0x73/0x80
[  101.972661]  <EOI>  [<ffffffff810c56ba>] ? stop_machine_cpu_stop+0xda/0x130
[  101.972661]  [<ffffffff810c55e0>] ? stop_one_cpu_nowait+0x50/0x50
[  101.972661]  [<ffffffff810c52f9>] cpu_stopper_thread+0xd9/0x1b0
[  101.972661]  [<ffffffff8149948f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
[  101.972661]  [<ffffffff810c5220>] ? res_counter_init+0x50/0x50
[  101.972661]  [<ffffffff810a2b5d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
[  101.972661]  [<ffffffff810a2bed>] ? trace_hardirqs_on+0xd/0x10
[  101.972661]  [<ffffffff810c5220>] ? res_counter_init+0x50/0x50
[  101.972661]  [<ffffffff810655be>] kthread+0x9e/0xb0
[  101.972661]  [<ffffffff814a39f4>] kernel_thread_helper+0x4/0x10
[  101.972661]  [<ffffffff81499834>] ? retint_restore_args+0x13/0x13
[  101.972661]  [<ffffffff81065520>] ? __init_kthread_worker+0x70/0x70
[  101.972661]  [<ffffffff814a39f0>] ? gs_change+0x13/0x13
[  101.972661] ---[ end trace 5ad49345f4e941c1 ]---
[  101.935708] Calibrating delay loop (skipped) already calibrated this CPU
[  101.976841] NMI watchdog enabled, takes one hw-pmu counter.
[  102.207744] lockdep: fixing up alternatives.
[  102.212183] Booting Node 0 Processor 11 APIC 0x7
[  102.216953] smpboot cpu 11: start_ip = 97000
[  102.232665] Calibrating delay loop (skipped) already calibrated this CPU
[  102.247822] NMI watchdog enabled, takes one hw-pmu counter.
[  102.263654] lockdep: fixing up alternatives.
[  102.268187] Booting Node 1 Processor 12 APIC 0x11
[  102.273072] smpboot cpu 12: start_ip = 97000
[  102.288903] Calibrating delay loop (skipped) already calibrated this CPU
[  102.300696] NMI watchdog enabled, takes one hw-pmu counter.
[  102.314589] lockdep: fixing up alternatives.
[  102.319049] Booting Node 1 Processor 13 APIC 0x13
[  102.323935] smpboot cpu 13: start_ip = 97000
[  102.340579] ------------[ cut here ]------------
[  102.344574] WARNING: at arch/x86/kernel/smp.c:120 native_smp_send_reschedule+0x59/0x60()
[  102.344574] Hardware name: IBM System x -[7870C4Q]-
[  102.344574] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod bnx2 i7core_edac shpchp ioatdma cdc_ether edac_core usbnet i2c_i801 pci_hotplug dca i2c_core iTCO_wdt iTCO_vendor_support pcspkr mii tpm_tis tpm tpm_bios serio_raw button rtc_cmos sg uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[  102.344574] Pid: 5945, comm: migration/1 Tainted: G        W    3.2.0-6a48897_IPI_patched-0.0.0.28.36b5ec9-default #1
[  102.344574] Call Trace:
[  102.344574]  <IRQ>  [<ffffffff81021409>] ? native_smp_send_reschedule+0x59/0x60
[  102.344574]  [<ffffffff8103cfca>] warn_slowpath_common+0x7a/0xb0
[  102.344574]  [<ffffffff8103d015>] warn_slowpath_null+0x15/0x20
[  102.344574]  [<ffffffff81021409>] native_smp_send_reschedule+0x59/0x60
[  102.344574]  [<ffffffff81082de5>] trigger_load_balance+0x185/0x500
[  102.344574]  [<ffffffff81082e1b>] ? trigger_load_balance+0x1bb/0x500
[  102.344574]  [<ffffffff81073e37>] scheduler_tick+0x107/0x170
[  102.344574]  [<ffffffff8104e777>] update_process_times+0x67/0x80
[  102.344574]  [<ffffffff8109c6cf>] tick_sched_timer+0x5f/0xc0
[  102.344574]  [<ffffffff8109c670>] ? tick_nohz_handler+0x100/0x100
[  102.344574]  [<ffffffff8106a8de>] __run_hrtimer+0x12e/0x330
[  102.344574]  [<ffffffff8106ad27>] hrtimer_interrupt+0xc7/0x1f0
[  102.344574]  [<ffffffff81023024>] smp_apic_timer_interrupt+0x64/0xa0
[  102.344574]  [<ffffffff814a30f3>] apic_timer_interrupt+0x73/0x80
[  102.344574]  <EOI>  [<ffffffff810c56ba>] ? stop_machine_cpu_stop+0xda/0x130
[  102.344574]  [<ffffffff810c55e0>] ? stop_one_cpu_nowait+0x50/0x50
[  102.344574]  [<ffffffff810c52f9>] cpu_stopper_thread+0xd9/0x1b0
[  102.344574]  [<ffffffff8149948f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
[  102.344574]  [<ffffffff810c5220>] ? res_counter_init+0x50/0x50
[  102.344574]  [<ffffffff810a2b5d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
[  102.344574]  [<ffffffff810a2bed>] ? trace_hardirqs_on+0xd/0x10
[  102.344574]  [<ffffffff810c5220>] ? res_counter_init+0x50/0x50
[  102.344574]  [<ffffffff810655be>] kthread+0x9e/0xb0
[  102.344574]  [<ffffffff814a39f4>] kernel_thread_helper+0x4/0x10
[  102.344574]  [<ffffffff81499834>] ? retint_restore_args+0x13/0x13
[  102.344574]  [<ffffffff81065520>] ? __init_kthread_worker+0x70/0x70
[  102.344574]  [<ffffffff814a39f0>] ? gs_change+0x13/0x13
[  102.344574] ---[ end trace 5ad49345f4e941c2 ]---
[  102.339643] Calibrating delay loop (skipped) already calibrated this CPU
[  102.350294] NMI watchdog enabled, takes one hw-pmu counter.
[  102.612059] lockdep: fixing up alternatives.
[  102.616608] Booting Node 1 Processor 14 APIC 0x15
[  102.621550] smpboot cpu 14: start_ip = 97000
[  102.637305] Calibrating delay loop (skipped) already calibrated this CPU
[  102.655053] NMI watchdog enabled, takes one hw-pmu counter.
[  102.671266] lockdep: fixing up alternatives.
[  102.675736] Booting Node 1 Processor 15 APIC 0x17
[  102.680599] smpboot cpu 15: start_ip = 97000
[  102.696292] Calibrating delay loop (skipped) already calibrated this CPU
[  102.715012] NMI watchdog enabled, takes one hw-pmu counter.
[  102.754087] CPU 1 is now offline
[  102.760470] CPU 9 MCA banks CMCI:2 CMCI:3 CMCI:5
[  102.781900] CPU 2 is now offline
[  102.789950] CPU 10 MCA banks CMCI:2 CMCI:3 CMCI:5
[  102.810198] CPU 3 is now offline
[  102.818284] CPU 11 MCA banks CMCI:2 CMCI:3 CMCI:5
[  102.837816] CPU 4 is now offline
[  102.843629] CPU 5 MCA banks CMCI:6 CMCI:8
[  102.848689] CPU 12 MCA banks CMCI:2 CMCI:3 CMCI:5
[  102.856038] ------------[ cut here ]------------
[  102.860034] WARNING: at arch/x86/kernel/smp.c:120 native_smp_send_reschedule+0x59/0x60()
[  102.860034] Hardware name: IBM System x -[7870C4Q]-
[  102.860034] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod bnx2 i7core_edac shpchp ioatdma cdc_ether edac_core usbnet i2c_i801 pci_hotplug dca i2c_core iTCO_wdt iTCO_vendor_support pcspkr mii tpm_tis tpm tpm_bios serio_raw button rtc_cmos sg uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[  102.860034] Pid: 6108, comm: cat Tainted: G        W    3.2.0-6a48897_IPI_patched-0.0.0.28.36b5ec9-default #1
[  102.860034] Call Trace:
[  102.860034]  <IRQ>  [<ffffffff81021409>] ? native_smp_send_reschedule+0x59/0x60
[  102.860034]  [<ffffffff8103cfca>] warn_slowpath_common+0x7a/0xb0
[  102.860034]  [<ffffffff8103d015>] warn_slowpath_null+0x15/0x20
[  102.860034]  [<ffffffff81021409>] native_smp_send_reschedule+0x59/0x60
[  102.860034]  [<ffffffff81082de5>] trigger_load_balance+0x185/0x500
[  102.860034]  [<ffffffff81082e1b>] ? trigger_load_balance+0x1bb/0x500
[  102.860034]  [<ffffffff81073e37>] scheduler_tick+0x107/0x170
[  102.860034]  [<ffffffff8104e777>] update_process_times+0x67/0x80
[  102.860034]  [<ffffffff8109c6cf>] tick_sched_timer+0x5f/0xc0
[  102.860034]  [<ffffffff8109c670>] ? tick_nohz_handler+0x100/0x100
[  102.860034]  [<ffffffff8106a8de>] __run_hrtimer+0x12e/0x330
[  102.860034]  [<ffffffff8106ad27>] hrtimer_interrupt+0xc7/0x1f0
[  102.860034]  [<ffffffff81023024>] smp_apic_timer_interrupt+0x64/0xa0
[  102.860034]  [<ffffffff81198d21>] ? alloc_fd+0x161/0x240
[  102.860034]  [<ffffffff814a30f3>] apic_timer_interrupt+0x73/0x80
[  102.860034]  <EOI>  [<ffffffff81499834>] ? retint_restore_args+0x13/0x13
[  102.860034]  [<ffffffff810a45af>] ? lock_release+0xef/0x140
[  102.860034]  [<ffffffff814994ee>] _raw_spin_unlock+0x1e/0x40
[  102.860034]  [<ffffffff81198d21>] alloc_fd+0x161/0x240
[  102.860034]  [<ffffffff8117834a>] do_sys_open+0x13a/0x1e0
[  102.860034]  [<ffffffff8117842c>] sys_open+0x1c/0x20
[  102.860034]  [<ffffffff814a25b9>] system_call_fastpath+0x16/0x1b
[  102.860034] ---[ end trace 5ad49345f4e941c3 ]---
[  103.082896] CPU 5 is now offline
[  103.088558] CPU 6 MCA banks CMCI:6 CMCI:8
[  103.093662] CPU 13 MCA banks CMCI:2 CMCI:3 CMCI:5
[  103.111169] CPU 6 is now offline
[  103.116628] CPU 7 MCA banks CMCI:6 CMCI:8
[  103.121692] CPU 14 MCA banks CMCI:2 CMCI:3 CMCI:5
[  103.138452] CPU 7 is now offline
[  103.144914] CPU 12 MCA banks CMCI:6 CMCI:8
[  103.149758] CPU 15 MCA banks CMCI:2 CMCI:3 CMCI:5
[  103.163062] CPU 8 is now offline
[  103.179727] CPU 9 is now offline
[  103.194055] CPU 10 is now offline
[  103.208301] CPU 11 is now offline
[  103.220189] Broke affinity for irq 28
[  103.225078] CPU 12 is now offline
[  103.230105] CPU 13 MCA banks CMCI:6 CMCI:8
[  103.243175] CPU 13 is now offline
[  103.249904] CPU 14 MCA banks CMCI:6 CMCI:8
[  103.260233] CPU 14 is now offline
[  103.266841] CPU 15 MCA banks CMCI:6 CMCI:8
[  103.275590] CPU 15 is now offline
[  103.279063] lockdep: fixing up alternatives.
[  103.283469] SMP alternatives: switching to UP code
[  103.308541] lockdep: fixing up alternatives.
[  103.312971] SMP alternatives: switching to SMP code
[  103.321357] Booting Node 0 Processor 1 APIC 0x2
[  103.326024] smpboot cpu 1: start_ip = 97000
[  103.340003] Calibrating delay loop (skipped) already calibrated this CPU
[  103.349334] NMI watchdog enabled, takes one hw-pmu counter.
[  103.365025] lockdep: fixing up alternatives.
[  103.369504] Booting Node 0 Processor 2 APIC 0x4
[  103.374192] smpboot cpu 2: start_ip = 97000
[  103.388176] Calibrating delay loop (skipped) already calibrated this CPU
[  103.397666] NMI watchdog enabled, takes one hw-pmu counter.
[  103.413404] lockdep: fixing up alternatives.
[  103.417862] Booting Node 0 Processor 3 APIC 0x6
[  103.422567] smpboot cpu 3: start_ip = 97000
[  103.438144] Calibrating delay loop (skipped) already calibrated this CPU
[  103.446717] NMI watchdog enabled, takes one hw-pmu counter.
[  103.461218] lockdep: fixing up alternatives.
[  103.465640] Booting Node 1 Processor 4 APIC 0x10
[  103.465642] smpboot cpu 4: start_ip = 97000
[  103.476769] Calibrating delay loop (skipped) already calibrated this CPU
[  103.485894] NMI watchdog enabled, takes one hw-pmu counter.
[  103.499316] lockdep: fixing up alternatives.
[  103.503779] Booting Node 1 Processor 5 APIC 0x12
[  103.508544] smpboot cpu 5: start_ip = 97000
[  103.524071] Calibrating delay loop (skipped) already calibrated this CPU
[  103.533298] NMI watchdog enabled, takes one hw-pmu counter.
[  103.551041] lockdep: fixing up alternatives.
[  103.555526] Booting Node 1 Processor 6 APIC 0x14
[  103.560431] smpboot cpu 6: start_ip = 97000
[  103.572930] Calibrating delay loop (skipped) already calibrated this CPU
[  103.587702] NMI watchdog enabled, takes one hw-pmu counter.
[  103.603090] lockdep: fixing up alternatives.
[  103.607530] Booting Node 1 Processor 7 APIC 0x16
[  103.612295] smpboot cpu 7: start_ip = 97000
[  103.627878] Calibrating delay loop (skipped) already calibrated this CPU
[  103.639610] NMI watchdog enabled, takes one hw-pmu counter.
[  103.655102] lockdep: fixing up alternatives.
[  103.659586] Booting Node 0 Processor 8 APIC 0x1
[  103.664268] smpboot cpu 8: start_ip = 97000
[  103.680600] ------------[ cut here ]------------
[  103.684594] WARNING: at arch/x86/kernel/smp.c:120 native_smp_send_reschedule+0x59/0x60()
[  103.684594] Hardware name: IBM System x -[7870C4Q]-
[  103.684594] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod bnx2 i7core_edac shpchp ioatdma cdc_ether edac_core usbnet i2c_i801 pci_hotplug dca i2c_core iTCO_wdt iTCO_vendor_support pcspkr mii tpm_tis tpm tpm_bios serio_raw button rtc_cmos sg uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[  103.730409] Pid: 6199, comm: migration/3 Tainted: G        W    3.2.0-6a48897_IPI_patched-0.0.0.28.36b5ec9-default #1
[  103.730409] Call Trace:
[  103.730409]  <IRQ>  [<ffffffff81021409>] ? native_smp_send_reschedule+0x59/0x60
[  103.730409]  [<ffffffff8103cfca>] warn_slowpath_common+0x7a/0xb0
[  103.730409]  [<ffffffff8103d015>] warn_slowpath_null+0x15/0x20
[  103.730409]  [<ffffffff81021409>] native_smp_send_reschedule+0x59/0x60
[  103.730409]  [<ffffffff81082de5>] trigger_load_balance+0x185/0x500
[  103.730409]  [<ffffffff81082e1b>] ? trigger_load_balance+0x1bb/0x500
[  103.730409]  [<ffffffff81073e37>] scheduler_tick+0x107/0x170
[  103.730409]  [<ffffffff8104e777>] update_process_times+0x67/0x80
[  103.730409]  [<ffffffff8109c6cf>] tick_sched_timer+0x5f/0xc0
[  103.730409]  [<ffffffff8109c670>] ? tick_nohz_handler+0x100/0x100
[  103.730409]  [<ffffffff8106a8de>] __run_hrtimer+0x12e/0x330
[  103.730409]  [<ffffffff8106ad27>] hrtimer_interrupt+0xc7/0x1f0
[  103.730409]  [<ffffffff81023024>] smp_apic_timer_interrupt+0x64/0xa0
[  103.730409]  [<ffffffff814a30f3>] apic_timer_interrupt+0x73/0x80
[  103.730409]  <EOI>  [<ffffffff810c56ba>] ? stop_machine_cpu_stop+0xda/0x130
[  103.730409]  [<ffffffff810c55e0>] ? stop_one_cpu_nowait+0x50/0x50
[  103.730409]  [<ffffffff810c52f9>] cpu_stopper_thread+0xd9/0x1b0
[  103.730409]  [<ffffffff8149948f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
[  103.730409]  [<ffffffff810c5220>] ? res_counter_init+0x50/0x50
[  103.730409]  [<ffffffff810a2b5d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
[  103.730409]  [<ffffffff810a2bed>] ? trace_hardirqs_on+0xd/0x10
[  103.730409]  [<ffffffff810c5220>] ? res_counter_init+0x50/0x50
[  103.730409]  [<ffffffff810655be>] kthread+0x9e/0xb0
[  103.730409]  [<ffffffff814a39f4>] kernel_thread_helper+0x4/0x10
[  103.730409]  [<ffffffff81499834>] ? retint_restore_args+0x13/0x13
[  103.730409]  [<ffffffff81065520>] ? __init_kthread_worker+0x70/0x70
[  103.730409]  [<ffffffff814a39f0>] ? gs_change+0x13/0x13
[  103.730409] ---[ end trace 5ad49345f4e941c4 ]---
[  103.676391] Calibrating delay loop (skipped) already calibrated this CPU
[  103.733199] NMI watchdog enabled, takes one hw-pmu counter.
[  103.949312] lockdep: fixing up alternatives.
[  103.953755] Booting Node 0 Processor 9 APIC 0x3
[  103.958455] smpboot cpu 9: start_ip = 97000
[  103.973999] Calibrating delay loop (skipped) already calibrated this CPU
[  103.987566] NMI watchdog enabled, takes one hw-pmu counter.
[  104.003382] lockdep: fixing up alternatives.
[  104.007844] Booting Node 0 Processor 10 APIC 0x5
[  104.014298] smpboot cpu 10: start_ip = 97000
[  104.029808] Calibrating delay loop (skipped) already calibrated this CPU
[  104.040937] NMI watchdog enabled, takes one hw-pmu counter.
[  104.057170] lockdep: fixing up alternatives.
[  104.061613] Booting Node 0 Processor 11 APIC 0x7
[  104.066383] smpboot cpu 11: start_ip = 97000
[  104.082333] Calibrating delay loop (skipped) already calibrated this CPU
[  104.097278] NMI watchdog enabled, takes one hw-pmu counter.
[  104.114759] lockdep: fixing up alternatives.
[  104.119185] Booting Node 1 Processor 12 APIC 0x11
[  104.124032] smpboot cpu 12: start_ip = 97000
[  104.140931] ------------[ cut here ]------------
[  104.144928] WARNING: at arch/x86/kernel/smp.c:120 native_smp_send_reschedule+0x59/0x60()
[  104.144928] Hardware name: IBM System x -[7870C4Q]-
[  104.144928] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod bnx2 i7core_edac shpchp ioatdma cdc_ether edac_core usbnet i2c_i801 pci_hotplug dca i2c_core iTCO_wdt iTCO_vendor_support pcspkr mii tpm_tis tpm tpm_bios serio_raw button rtc_cmos sg uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[  104.144928] Pid: 6246, comm: migration/8 Tainted: G        W    3.2.0-6a48897_IPI_patched-0.0.0.28.36b5ec9-default #1
[  104.144928] Call Trace:
[  104.144928]  <IRQ>  [<ffffffff81021409>] ? native_smp_send_reschedule+0x59/0x60
[  104.144928]  [<ffffffff8103cfca>] warn_slowpath_common+0x7a/0xb0
[  104.144928]  [<ffffffff8103d015>] warn_slowpath_null+0x15/0x20
[  104.236219]  [<ffffffff81021409>] native_smp_send_reschedule+0x59/0x60
[  104.236219]  [<ffffffff81082de5>] trigger_load_balance+0x185/0x500
[  104.236219]  [<ffffffff81082e1b>] ? trigger_load_balance+0x1bb/0x500
[  104.236219]  [<ffffffff81073e37>] scheduler_tick+0x107/0x170
[  104.236219]  [<ffffffff8104e777>] update_process_times+0x67/0x80
[  104.236219]  [<ffffffff8109c6cf>] tick_sched_timer+0x5f/0xc0
[  104.236219]  [<ffffffff8109c670>] ? tick_nohz_handler+0x100/0x100
[  104.236219]  [<ffffffff8106a8de>] __run_hrtimer+0x12e/0x330
[  104.236219]  [<ffffffff8106ad27>] hrtimer_interrupt+0xc7/0x1f0
[  104.236219]  [<ffffffff81023024>] smp_apic_timer_interrupt+0x64/0xa0
[  104.236219]  [<ffffffff814a30f3>] apic_timer_interrupt+0x73/0x80
[  104.236219]  <EOI>  [<ffffffff810c56ba>] ? stop_machine_cpu_stop+0xda/0x130
[  104.236219]  [<ffffffff810c55e0>] ? stop_one_cpu_nowait+0x50/0x50
[  104.236219]  [<ffffffff810c52f9>] cpu_stopper_thread+0xd9/0x1b0
[  104.236219]  [<ffffffff8149948f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
[  104.236219]  [<ffffffff810c5220>] ? res_counter_init+0x50/0x50
[  104.236219]  [<ffffffff810a2b5d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
[  104.236219]  [<ffffffff810a2bed>] ? trace_hardirqs_on+0xd/0x10
[  104.236219]  [<ffffffff810c5220>] ? res_counter_init+0x50/0x50
[  104.236219]  [<ffffffff810655be>] kthread+0x9e/0xb0
[  104.236219]  [<ffffffff814a39f4>] kernel_thread_helper+0x4/0x10
[  104.236219]  [<ffffffff81499834>] ? retint_restore_args+0x13/0x13
[  104.236219]  [<ffffffff81065520>] ? __init_kthread_worker+0x70/0x70
[  104.236219]  [<ffffffff814a39f0>] ? gs_change+0x13/0x13
[  104.236219] ---[ end trace 5ad49345f4e941c5 ]---
[  104.139577] Calibrating delay loop (skipped) already calibrated this CPU
[  104.243898] NMI watchdog enabled, takes one hw-pmu counter.
[  104.414707] lockdep: fixing up alternatives.
[  104.419169] Booting Node 1 Processor 13 APIC 0x13
[  104.424110] smpboot cpu 13: start_ip = 97000
[  104.439830] Calibrating delay loop (skipped) already calibrated this CPU
[  104.451953] NMI watchdog enabled, takes one hw-pmu counter.
[  104.467013] lockdep: fixing up alternatives.
[  104.471457] Booting Node 1 Processor 14 APIC 0x15
[  104.476330] smpboot cpu 14: start_ip = 97000
[  104.492131] Calibrating delay loop (skipped) already calibrated this CPU
[  104.510032] NMI watchdog enabled, takes one hw-pmu counter.
[  104.526274] lockdep: fixing up alternatives.
[  104.530715] Booting Node 1 Processor 15 APIC 0x17
[  104.535578] smpboot cpu 15: start_ip = 97000
[  104.551263] Calibrating delay loop (skipped) already calibrated this CPU
[  104.570011] NMI watchdog enabled, takes one hw-pmu counter.
[  104.606910] CPU 1 is now offline
[  104.612958] CPU 9 MCA banks CMCI:2 CMCI:3 CMCI:5
[  104.620013] ------------[ cut here ]------------
[  104.624007] WARNING: at arch/x86/kernel/smp.c:120 native_smp_send_reschedule+0x59/0x60()
[  104.632694] Hardware name: IBM System x -[7870C4Q]-
[  104.632694] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod bnx2 i7core_edac shpchp ioatdma cdc_ether edac_core usbnet i2c_i801 pci_hotplug dca i2c_core iTCO_wdt iTCO_vendor_support pcspkr mii tpm_tis tpm tpm_bios serio_raw button rtc_cmos sg uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[  104.632694] Pid: 6329, comm: udevd Tainted: G        W    3.2.0-6a48897_IPI_patched-0.0.0.28.36b5ec9-default #1
[  104.632694] Call Trace:
[  104.632694]  <IRQ>  [<ffffffff81021409>] ? native_smp_send_reschedule+0x59/0x60
[  104.632694]  [<ffffffff8103cfca>] warn_slowpath_common+0x7a/0xb0
[  104.632694]  [<ffffffff8103d015>] warn_slowpath_null+0x15/0x20
[  104.632694]  [<ffffffff81021409>] native_smp_send_reschedule+0x59/0x60
[  104.632694]  [<ffffffff81082de5>] trigger_load_balance+0x185/0x500
[  104.632694]  [<ffffffff81082e1b>] ? trigger_load_balance+0x1bb/0x500
[  104.632694]  [<ffffffff81073e37>] scheduler_tick+0x107/0x170
[  104.632694]  [<ffffffff8104e777>] update_process_times+0x67/0x80
[  104.632694]  [<ffffffff8109c6cf>] tick_sched_timer+0x5f/0xc0
[  104.632694]  [<ffffffff8109c670>] ? tick_nohz_handler+0x100/0x100
[  104.632694]  [<ffffffff8106a8de>] __run_hrtimer+0x12e/0x330
[  104.632694]  [<ffffffff8106ad27>] hrtimer_interrupt+0xc7/0x1f0
[  104.632694]  [<ffffffff81023024>] smp_apic_timer_interrupt+0x64/0xa0
[  104.632694]  [<ffffffff814a30f3>] apic_timer_interrupt+0x73/0x80
[  104.632694]  <EOI>  [<ffffffff810a3fda>] ? lock_acquire+0x13a/0x190
[  104.632694]  [<ffffffff81134b84>] ? zap_pte_range+0x84/0x3b0
[  104.632694]  [<ffffffff81498b4c>] _raw_spin_lock+0x3c/0x50
[  104.632694]  [<ffffffff81134b84>] ? zap_pte_range+0x84/0x3b0
[  104.632694]  [<ffffffff81134b84>] zap_pte_range+0x84/0x3b0
[  104.632694]  [<ffffffff81135773>] unmap_page_range+0x243/0x300
[  104.632694]  [<ffffffff811358fc>] unmap_vmas+0xcc/0x150
[  104.632694]  [<ffffffff8113b27d>] exit_mmap+0x8d/0x120
[  104.632694]  [<ffffffff8103ad1c>] mmput+0x6c/0x150
[  104.632694]  [<ffffffff810400ca>] exit_mm+0x10a/0x140
[  104.632694]  [<ffffffff8149942b>] ? _raw_spin_unlock_irq+0x2b/0x50
[  104.632694]  [<ffffffff813152f3>] ? tty_audit_exit+0x23/0xa0
[  104.632694]  [<ffffffff810428e7>] do_exit+0x157/0x470
[  104.632694]  [<ffffffff810a2b5d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
[  104.632694]  [<ffffffff81042c47>] do_group_exit+0x47/0xc0
[  104.632694]  [<ffffffff81042cd2>] sys_exit_group+0x12/0x20
[  104.632694]  [<ffffffff814a25b9>] system_call_fastpath+0x16/0x1b
[  104.632694] ---[ end trace 5ad49345f4e941c6 ]---
[  104.889588] CPU 2 is now offline
[  104.894956] CPU 10 MCA banks CMCI:2 CMCI:3 CMCI:5
[  104.909813] CPU 3 is now offline
[  104.917505] CPU 11 MCA banks CMCI:2 CMCI:3 CMCI:5
[  104.937367] CPU 4 is now offline
[  104.944280] CPU 5 MCA banks CMCI:6 CMCI:8
[  104.949190] CPU 12 MCA banks CMCI:2 CMCI:3 CMCI:5
[  104.967718] CPU 5 is now offline
[  104.973542] CPU 6 MCA banks CMCI:6 CMCI:8
[  104.978528] CPU 13 MCA banks CMCI:2 CMCI:3 CMCI:5
[  104.996528] CPU 6 is now offline
[  105.001621] CPU 7 MCA banks CMCI:6 CMCI:8
[  105.006747] CPU 14 MCA banks CMCI:2 CMCI:3 CMCI:5
[  105.021426] CPU 7 is now offline
[  105.027310] CPU 12 MCA banks CMCI:6 CMCI:8
[  105.032101] CPU 15 MCA banks CMCI:2 CMCI:3 CMCI:5
[  105.045291] CPU 8 is now offline
[  105.061777] CPU 9 is now offline
[  105.078397] CPU 10 is now offline
[  105.091618] CPU 11 is now offline
[  105.107865] CPU 12 is now offline
[  105.114138] CPU 13 MCA banks CMCI:6 CMCI:8
[  105.127674] CPU 13 is now offline
[  105.133645] CPU 14 MCA banks CMCI:6 CMCI:8
[  105.144174] CPU 14 is now offline
[  105.150421] CPU 15 MCA banks CMCI:6 CMCI:8
[  105.160126] CPU 15 is now offline
[  105.163602] lockdep: fixing up alternatives.
[  105.168014] SMP alternatives: switching to UP code
[  105.192945] lockdep: fixing up alternatives.
[  105.197382] SMP alternatives: switching to SMP code
[  105.205661] Booting Node 0 Processor 1 APIC 0x2
[  105.210360] smpboot cpu 1: start_ip = 97000
[  105.224003] Calibrating delay loop (skipped) already calibrated this CPU
[  105.233600] NMI watchdog enabled, takes one hw-pmu counter.
[  105.240007] ------------[ cut here ]------------
[  105.244004] WARNING: at arch/x86/kernel/smp.c:120 native_smp_send_reschedule+0x59/0x60()
[  105.244004] Hardware name: IBM System x -[7870C4Q]-
[  105.244004] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod bnx2 i7core_edac shpchp ioatdma cdc_ether edac_core usbnet i2c_i801 pci_hotplug dca i2c_core iTCO_wdt iTCO_vendor_support pcspkr mii tpm_tis tpm tpm_bios serio_raw button rtc_cmos sg uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[  105.244004] Pid: 1736, comm: udevd Tainted: G        W    3.2.0-6a48897_IPI_patched-0.0.0.28.36b5ec9-default #1
[  105.244004] Call Trace:
[  105.244004]  <IRQ>  [<ffffffff81021409>] ? native_smp_send_reschedule+0x59/0x60
[  105.244004]  [<ffffffff8103cfca>] warn_slowpath_common+0x7a/0xb0
[  105.244004]  [<ffffffff8103d015>] warn_slowpath_null+0x15/0x20
[  105.244004]  [<ffffffff81021409>] native_smp_send_reschedule+0x59/0x60
[  105.244004]  [<ffffffff81082de5>] trigger_load_balance+0x185/0x500
[  105.244004]  [<ffffffff81082e1b>] ? trigger_load_balance+0x1bb/0x500
[  105.244004]  [<ffffffff81073e37>] scheduler_tick+0x107/0x170
[  105.244004]  [<ffffffff8104e777>] update_process_times+0x67/0x80
[  105.244004]  [<ffffffff8109c6cf>] tick_sched_timer+0x5f/0xc0
[  105.244004]  [<ffffffff8109c670>] ? tick_nohz_handler+0x100/0x100
[  105.244004]  [<ffffffff8106a8de>] __run_hrtimer+0x12e/0x330
[  105.244004]  [<ffffffff8106ad27>] hrtimer_interrupt+0xc7/0x1f0
[  105.244004]  [<ffffffff81023024>] smp_apic_timer_interrupt+0x64/0xa0
[  105.244004]  [<ffffffff814a30f3>] apic_timer_interrupt+0x73/0x80
[  105.244004]  <EOI>  [<ffffffff81185b56>] ? getname_flags+0x36/0x270
[  105.244004]  [<ffffffff8115d7a0>] ? check_poison_obj+0x40/0x200
[  105.244004]  [<ffffffff81185b56>] ? getname_flags+0x36/0x270
[  105.244004]  [<ffffffff8115db18>] cache_alloc_debugcheck_after+0x1b8/0x270
[  105.244004]  [<ffffffff81185b56>] ? getname_flags+0x36/0x270
[  105.244004]  [<ffffffff81160510>] kmem_cache_alloc+0x160/0x370
[  105.244004]  [<ffffffff81185b56>] getname_flags+0x36/0x270
[  105.244004]  [<ffffffff8118ad06>] user_path_at_empty+0x36/0xb0
[  105.244004]  [<ffffffff810a2b5d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
[  105.244004]  [<ffffffff810a2bed>] ? trace_hardirqs_on+0xd/0x10
[  105.244004]  [<ffffffff8118ad8c>] user_path_at+0xc/0x10
[  105.244004]  [<ffffffff8117eb76>] vfs_fstatat+0x46/0x80
[  105.244004]  [<ffffffff8117ecd6>] vfs_stat+0x16/0x20
[  105.244004]  [<ffffffff8117ecff>] sys_newstat+0x1f/0x50
[  105.244004]  [<ffffffff810a2b5d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
[  105.244004]  [<ffffffff8127c7be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[  105.244004]  [<ffffffff814a25b9>] system_call_fastpath+0x16/0x1b
[  105.244004] ---[ end trace 5ad49345f4e941c7 ]---
[  105.522657] lockdep: fixing up alternatives.
[  105.527152] Booting Node 0 Processor 2 APIC 0x4
[  105.531859] smpboot cpu 2: start_ip = 97000
[  105.546905] Calibrating delay loop (skipped) already calibrated this CPU
[  105.555722] NMI watchdog enabled, takes one hw-pmu counter.
[  105.569584] lockdep: fixing up alternatives.
[  105.574017] Booting Node 0 Processor 3 APIC 0x6
[  105.578693] smpboot cpu 3: start_ip = 97000
[  105.592007] Calibrating delay loop (skipped) already calibrated this CPU
[  105.602441] NMI watchdog enabled, takes one hw-pmu counter.
[  105.614230] lockdep: fixing up alternatives.
[  105.618692] Booting Node 1 Processor 4 APIC 0x10
[  105.623492] smpboot cpu 4: start_ip = 97000
[  105.640087] ------------[ cut here ]------------
[  105.644084] WARNING: at arch/x86/kernel/smp.c:120 native_smp_send_reschedule+0x59/0x60()
[  105.644084] Hardware name: IBM System x -[7870C4Q]-
[  105.644084] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod bnx2 i7core_edac shpchp ioatdma cdc_ether edac_core usbnet i2c_i801 pci_hotplug dca i2c_core iTCO_wdt iTCO_vendor_support pcspkr mii tpm_tis tpm tpm_bios serio_raw button rtc_cmos sg uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[  105.644084] Pid: 6, comm: migration/0 Tainted: G        W    3.2.0-6a48897_IPI_patched-0.0.0.28.36b5ec9-default #1
[  105.644084] Call Trace:
[  105.644084]  <IRQ>  [<ffffffff81021409>] ? native_smp_send_reschedule+0x59/0x60
[  105.644084]  [<ffffffff8103cfca>] warn_slowpath_common+0x7a/0xb0
[  105.644084]  [<ffffffff8103d015>] warn_slowpath_null+0x15/0x20
[  105.644084]  [<ffffffff81021409>] native_smp_send_reschedule+0x59/0x60
[  105.644084]  [<ffffffff81082de5>] trigger_load_balance+0x185/0x500
[  105.644084]  [<ffffffff81082e1b>] ? trigger_load_balance+0x1bb/0x500
[  105.644084]  [<ffffffff81073e37>] scheduler_tick+0x107/0x170
[  105.644084]  [<ffffffff8104e777>] update_process_times+0x67/0x80
[  105.644084]  [<ffffffff8109c6cf>] tick_sched_timer+0x5f/0xc0
[  105.644084]  [<ffffffff8109c670>] ? tick_nohz_handler+0x100/0x100
[  105.644084]  [<ffffffff8106a8de>] __run_hrtimer+0x12e/0x330
[  105.644084]  [<ffffffff8106ad27>] hrtimer_interrupt+0xc7/0x1f0
[  105.644084]  [<ffffffff81023024>] smp_apic_timer_interrupt+0x64/0xa0
[  105.644084]  [<ffffffff814a30f3>] apic_timer_interrupt+0x73/0x80
[  105.644084]  <EOI>  [<ffffffff810c56ba>] ? stop_machine_cpu_stop+0xda/0x130
[  105.644084]  [<ffffffff810c55e0>] ? stop_one_cpu_nowait+0x50/0x50
[  105.644084]  [<ffffffff810c52f9>] cpu_stopper_thread+0xd9/0x1b0
[  105.644084]  [<ffffffff8149948f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
[  105.644084]  [<ffffffff810c5220>] ? res_counter_init+0x50/0x50
[  105.644084]  [<ffffffff810a2b5d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
[  105.644084]  [<ffffffff810a2bed>] ? trace_hardirqs_on+0xd/0x10
[  105.644084]  [<ffffffff810c5220>] ? res_counter_init+0x50/0x50
[  105.644084]  [<ffffffff810655be>] kthread+0x9e/0xb0
[  105.644084]  [<ffffffff814a39f4>] kernel_thread_helper+0x4/0x10
[  105.644084]  [<ffffffff81499834>] ? retint_restore_args+0x13/0x13
[  105.644084]  [<ffffffff81065520>] ? __init_kthread_worker+0x70/0x70
[  105.644084]  [<ffffffff814a39f0>] ? gs_change+0x13/0x13
[  105.644084] ---[ end trace 5ad49345f4e941c8 ]---
[  105.638988] Calibrating delay loop (skipped) already calibrated this CPU
[  105.899283] NMI watchdog enabled, takes one hw-pmu counter.
[  105.915038] lockdep: fixing up alternatives.
[  105.919475] Booting Node 1 Processor 5 APIC 0x12
[  105.924245] smpboot cpu 5: start_ip = 97000
[  105.940631] ------------[ cut here ]------------
[  105.944627] WARNING: at arch/x86/kernel/smp.c:120 native_smp_send_reschedule+0x59/0x60()
[  105.944627] Hardware name: IBM System x -[7870C4Q]-
[  105.944627] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod bnx2 i7core_edac shpchp ioatdma cdc_ether edac_core usbnet i2c_i801 pci_hotplug dca i2c_core iTCO_wdt iTCO_vendor_support pcspkr mii tpm_tis tpm tpm_bios serio_raw button rtc_cmos sg uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[  105.944627] Pid: 6428, comm: migration/2 Tainted: G        W    3.2.0-6a48897_IPI_patched-0.0.0.28.36b5ec9-default #1
[  105.944627] Call Trace:
[  105.944627]  <IRQ>  [<ffffffff81021409>] ? native_smp_send_reschedule+0x59/0x60
[  105.944627]  [<ffffffff8103cfca>] warn_slowpath_common+0x7a/0xb0
[  105.944627]  [<ffffffff8103d015>] warn_slowpath_null+0x15/0x20
[  106.034846]  [<ffffffff81021409>] native_smp_send_reschedule+0x59/0x60
[  106.034846]  [<ffffffff81082de5>] trigger_load_balance+0x185/0x500
[  106.034846]  [<ffffffff81082e1b>] ? trigger_load_balance+0x1bb/0x500
[  106.034846]  [<ffffffff81073e37>] scheduler_tick+0x107/0x170
[  106.034846]  [<ffffffff8104e777>] update_process_times+0x67/0x80
[  106.034846]  [<ffffffff8109c6cf>] tick_sched_timer+0x5f/0xc0
[  106.034846]  [<ffffffff8109c670>] ? tick_nohz_handler+0x100/0x100
[  106.034846]  [<ffffffff8106a8de>] __run_hrtimer+0x12e/0x330
[  106.034846]  [<ffffffff8106ad27>] hrtimer_interrupt+0xc7/0x1f0
[  106.034846]  [<ffffffff81023024>] smp_apic_timer_interrupt+0x64/0xa0
[  106.034846]  [<ffffffff814a30f3>] apic_timer_interrupt+0x73/0x80
[  106.034846]  <EOI>  [<ffffffff810c56ba>] ? stop_machine_cpu_stop+0xda/0x130
[  106.034846]  [<ffffffff810c55e0>] ? stop_one_cpu_nowait+0x50/0x50
[  106.034846]  [<ffffffff810c52f9>] cpu_stopper_thread+0xd9/0x1b0
[  106.034846]  [<ffffffff8149948f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
[  106.034846]  [<ffffffff810c5220>] ? res_counter_init+0x50/0x50
[  106.034846]  [<ffffffff810a2b5d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
[  106.034846]  [<ffffffff810a2bed>] ? trace_hardirqs_on+0xd/0x10
[  106.034846]  [<ffffffff810c5220>] ? res_counter_init+0x50/0x50
[  106.034846]  [<ffffffff810655be>] kthread+0x9e/0xb0
[  106.034846]  [<ffffffff814a39f4>] kernel_thread_helper+0x4/0x10
[  106.034846]  [<ffffffff81499834>] ? retint_restore_args+0x13/0x13
[  106.034846]  [<ffffffff81065520>] ? __init_kthread_worker+0x70/0x70
[  106.034846]  [<ffffffff814a39f0>] ? gs_change+0x13/0x13
[  106.034846] ---[ end trace 5ad49345f4e941c9 ]---
[  105.939769] Calibrating delay loop (skipped) already calibrated this CPU
[  106.041217] NMI watchdog enabled, takes one hw-pmu counter.
[  106.213412] lockdep: fixing up alternatives.
[  106.217921] Booting Node 1 Processor 6 APIC 0x14
[  106.222769] smpboot cpu 6: start_ip = 97000
[  106.238435] Calibrating delay loop (skipped) already calibrated this CPU
[  106.247789] NMI watchdog enabled, takes one hw-pmu counter.
[  106.263513] lockdep: fixing up alternatives.
[  106.267948] Booting Node 1 Processor 7 APIC 0x16
[  106.272723] smpboot cpu 7: start_ip = 97000
[  106.288103] Calibrating delay loop (skipped) already calibrated this CPU
[  106.297965] NMI watchdog enabled, takes one hw-pmu counter.
[  106.313050] lockdep: fixing up alternatives.
[  106.317523] Booting Node 0 Processor 8 APIC 0x1
[  106.322206] smpboot cpu 8: start_ip = 97000
[  106.337772] Calibrating delay loop (skipped) already calibrated this CPU
[  106.347702] NMI watchdog enabled, takes one hw-pmu counter.
[  106.363043] lockdep: fixing up alternatives.
[  106.367580] Booting Node 0 Processor 9 APIC 0x3
[  106.372315] smpboot cpu 9: start_ip = 97000
[  106.388071] Calibrating delay loop (skipped) already calibrated this CPU
[  106.401878] NMI watchdog enabled, takes one hw-pmu counter.
[  106.418266] lockdep: fixing up alternatives.
[  106.422755] Booting Node 0 Processor 10 APIC 0x5
[  106.422758] smpboot cpu 10: start_ip = 97000
[  106.432548] Calibrating delay loop (skipped) already calibrated this CPU
[  106.444762] NMI watchdog enabled, takes one hw-pmu counter.
[  106.460886] lockdep: fixing up alternatives.
[  106.465317] Booting Node 0 Processor 11 APIC 0x7
[  106.470081] smpboot cpu 11: start_ip = 97000
[  106.485636] Calibrating delay loop (skipped) already calibrated this CPU
[  106.496893] NMI watchdog enabled, takes one hw-pmu counter.
[  106.511748] lockdep: fixing up alternatives.
[  106.516201] Booting Node 1 Processor 12 APIC 0x11
[  106.521093] smpboot cpu 12: start_ip = 97000
[  106.536798] Calibrating delay loop (skipped) already calibrated this CPU
[  106.548584] NMI watchdog enabled, takes one hw-pmu counter.
[  106.563246] lockdep: fixing up alternatives.
[  106.567717] Booting Node 1 Processor 13 APIC 0x13
[  106.572584] smpboot cpu 13: start_ip = 97000
[  106.588291] Calibrating delay loop (skipped) already calibrated this CPU
[  106.600482] NMI watchdog enabled, takes one hw-pmu counter.
[  106.614962] lockdep: fixing up alternatives.
[  106.619398] Booting Node 1 Processor 14 APIC 0x15
[  106.624280] smpboot cpu 14: start_ip = 97000
[  106.640082] Calibrating delay loop (skipped) already calibrated this CPU
[  106.657871] NMI watchdog enabled, takes one hw-pmu counter.
[  106.673572] lockdep: fixing up alternatives.
[  106.678024] Booting Node 1 Processor 15 APIC 0x17
[  106.682912] smpboot cpu 15: start_ip = 97000
[  106.698605] Calibrating delay loop (skipped) already calibrated this CPU
[  106.711375] NMI watchdog enabled, takes one hw-pmu counter.
[  106.748563] CPU 1 is now offline
[  106.754950] CPU 9 MCA banks
[  106.756020] ------------[ cut here ]------------
[  106.756028] WARNING: at arch/x86/kernel/smp.c:120 native_smp_send_reschedule+0x59/0x60()
[  106.756031] Hardware name: IBM System x -[7870C4Q]-
[  106.756033] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod bnx2 i7core_edac shpchp ioatdma cdc_ether edac_core usbnet i2c_i801 pci_hotplug dca i2c_core iTCO_wdt iTCO_vendor_support pcspkr mii tpm_tis tpm tpm_bios serio_raw button rtc_cmos sg uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[  106.756077] Pid: 1736, comm: udevd Tainted: G        W    3.2.0-6a48897_IPI_patched-0.0.0.28.36b5ec9-default #1
[  106.756079] Call Trace:
[  106.756081]  <IRQ>  [<ffffffff81021409>] ? native_smp_send_reschedule+0x59/0x60
[  106.756090]  [<ffffffff8103cfca>] warn_slowpath_common+0x7a/0xb0
[  106.756094]  [<ffffffff8103d015>] warn_slowpath_null+0x15/0x20
[  106.756098]  [<ffffffff81021409>] native_smp_send_reschedule+0x59/0x60
[  106.756103]  [<ffffffff81082de5>] trigger_load_balance+0x185/0x500
[  106.756107]  [<ffffffff81082e1b>] ? trigger_load_balance+0x1bb/0x500
[  106.756113]  [<ffffffff81073e37>] scheduler_tick+0x107/0x170
[  106.756119]  [<ffffffff8104e777>] update_process_times+0x67/0x80
[  106.756124]  [<ffffffff8109c6cf>] tick_sched_timer+0x5f/0xc0
[  106.756128]  [<ffffffff8109c670>] ? tick_nohz_handler+0x100/0x100
[  106.756133]  [<ffffffff8106a8de>] __run_hrtimer+0x12e/0x330
[  106.756138]  [<ffffffff8106ad27>] hrtimer_interrupt+0xc7/0x1f0
[  106.756142]  [<ffffffff81023024>] smp_apic_timer_interrupt+0x64/0xa0
[  106.756147]  [<ffffffff81193222>] ? try_to_ascend+0x82/0x160
[  106.756153]  [<ffffffff814a30f3>] apic_timer_interrupt+0x73/0x80
[  106.756155]  <EOI>  [<ffffffff810a45a6>] ? lock_release+0xe6/0x140
[  106.756165]  [<ffffffff814994ee>] _raw_spin_unlock+0x1e/0x40
[  106.756169]  [<ffffffff81193222>] try_to_ascend+0x82/0x160
[  106.756172]  [<ffffffff811931ef>] ? try_to_ascend+0x4f/0x160
[  106.756176]  [<ffffffff811937e1>] select_parent+0x111/0x2b0
[  106.756179]  [<ffffffff81193ef3>] ? shrink_dcache_parent+0x33/0x40
[  106.756183]  [<ffffffff81193ef3>] shrink_dcache_parent+0x33/0x40
[  106.756188]  [<ffffffff81186d9d>] vfs_rmdir+0xcd/0x120
[  106.756192]  [<ffffffff8118ac69>] do_rmdir+0x139/0x140
[  106.756197]  [<ffffffff814a25e5>] ? sysret_check+0x22/0x5d
[  106.756201]  [<ffffffff810a2b5d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
[  106.756207]  [<ffffffff8127c7be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[  106.756211]  [<ffffffff8118acc1>] sys_rmdir+0x11/0x20
[  106.756215]  [<ffffffff814a25b9>] system_call_fastpath+0x16/0x1b
[  106.756217] ---[ end trace 5ad49345f4e941ca ]---
[  107.000131]  CMCI:2 CMCI:3 CMCI:5

Regards,
Srivatsa S. Bhat
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-18  3:17                 ` Suresh Siddha
  2012-01-18 10:19                   ` Srivatsa S. Bhat
@ 2012-01-18 13:15                   ` Srivatsa S. Bhat
  2012-01-18 13:32                     ` Sergey Senozhatsky
  1 sibling, 1 reply; 49+ messages in thread
From: Srivatsa S. Bhat @ 2012-01-18 13:15 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Linus Torvalds, Ming Lei, Djalal Harouni, Borislav Petkov,
	Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-kernel,
	Greg Kroah-Hartman, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua, Peter Zijlstra, Mel Gorman,
	Gilad Ben-Yossef, Sergey Senozhatsky

On 01/18/2012 08:47 AM, Suresh Siddha wrote:

> On Tue, 2012-01-17 at 15:22 +0530, Srivatsa S. Bhat wrote:
>> Thanks for the patch, but unfortunately it doesn't fix the problem!
>> Exactly the same stack traces are seen during a CPU Hotplug stress test.
>> (I didn't even have to stress it - it is so fragile that just a script
>> to offline all cpus except the boot cpu was good enough to reproduce the
>> problem easily.)
> 
> hmm, that's weird. with the patch, sched_ilb_notifier() should have
> cleared the cpu going offline from the nohz.idle_cpus_mask. And this
> should have happened after that cpu is removed from active mask. So
> no-one else should add that cpu back to the nohz.idle_cpus_mask and this
> should prevent the issue from happening.
> 
> I could reproduce the problem easily with out the patch but when I
> applied the patch I couldn't recreate the issue. Srivatsa, can you
> please re-check the kernel you tested indeed has the fix?
> 
> re-Reviewing the code/patch also doesn't give me a hint.
> 
>> I have a few questions regarding the synchronization with CPU Hotplug.
>> What guarantees that the code which selects and IPIs the new ilb is totally
>> race-free with respect to CPU hotplug and we will never IPI an offline CPU?
> 
> So, nohz_balancer_kick() gets called only from interrupts disabled.
> During that time (from selecting the ilb_cpu to sending the IPI), no cpu
> can go offline. As the offline happens from the stop-machine process
> context with interrupts disabled.
> 
> Only thing we need to make sure is the offlined cpu shouldn't be part of
> the nohz.idle_cpus_mask and for post 3.2 code, posted patch ensures
> that.
> 
> For 3.2 and before, when a cpu exits tickless idle, it gets removed from
> the nohz.idle_cpus_mask (and also from the nohz.load_balancer). And if
> the cpu is not in the active mask (while going offline), subsequent
> calls to select_nohz_load_balancer() ensures that the cpu going down
> doesn't update the nohz structures. So I thought 3.2 shouldn't exhibit
> this problem.
> 
> 
>> (As demonstrated above, this issue is in 3.2-rc7
>> as well.)
> 
> hmm, don't think we ran into this before 3.2. So, what am I missing from
> the above? I will try to reproduce it on 3.2 too.
> 


I tested again on 3.2. I didn't hit those warnings (IPI to offline cpus).
It happens only in the post-3.2 kernel.

Regards,
Srivatsa S. Bhat
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-18 13:15                   ` Srivatsa S. Bhat
@ 2012-01-18 13:32                     ` Sergey Senozhatsky
  2012-01-18 22:08                       ` Suresh Siddha
  0 siblings, 1 reply; 49+ messages in thread
From: Sergey Senozhatsky @ 2012-01-18 13:32 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Suresh Siddha, Linus Torvalds, Ming Lei, Djalal Harouni,
	Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar,
	Andi Kleen, linux-kernel, Greg Kroah-Hartman, Kay Sievers,
	gouders, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki,
	tglx, prasad, justinmattock, Jeff Chua, Peter Zijlstra,
	Mel Gorman, Gilad Ben-Yossef

On (01/18/12 18:45), Srivatsa S. Bhat wrote:
> Date: Wed, 18 Jan 2012 18:45:55 +0530
> From: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
> 
> > On Tue, 2012-01-17 at 15:22 +0530, Srivatsa S. Bhat wrote:
> >> Thanks for the patch, but unfortunately it doesn't fix the problem!
> >> Exactly the same stack traces are seen during a CPU Hotplug stress test.
> >> (I didn't even have to stress it - it is so fragile that just a script
> >> to offline all cpus except the boot cpu was good enough to reproduce the
> >> problem easily.)
> > 
> > hmm, that's weird. with the patch, sched_ilb_notifier() should have
> > cleared the cpu going offline from the nohz.idle_cpus_mask. And this
> > should have happened after that cpu is removed from active mask. So
> > no-one else should add that cpu back to the nohz.idle_cpus_mask and this
> > should prevent the issue from happening.
> > 
Just a small note, since you're talking about removing CPU from nohz.idle_cpus_mask, 
that I'm able to reproduce this problem not only when offlining CPU, but during
onlininig as well (kernel 3.3):

[   67.587942] CPU1 is up
[   67.589710] Call Trace:
[   67.589719]  <IRQ>  [<ffffffff81030092>] warn_slowpath_common+0x7e/0x96
[   67.589745]  [<ffffffff810300bf>] warn_slowpath_null+0x15/0x17
[   67.589762]  [<ffffffff81018ff7>] native_smp_send_reschedule+0x25/0x56
[   67.589783]  [<ffffffff81067ffe>] trigger_load_balance+0x6ac/0x72e
[   67.589802]  [<ffffffff81067bfd>] ? trigger_load_balance+0x2ab/0x72e
[   67.589823]  [<ffffffff8105f05c>] scheduler_tick+0xe2/0xeb
[   67.589842]  [<ffffffff8103f6ac>] update_process_times+0x60/0x70
[   67.589863]  [<ffffffff8107c1e1>] tick_sched_timer+0x6d/0x96
[   67.589882]  [<ffffffff81053b3b>] __run_hrtimer+0x1c2/0x3a1
[   67.589900]  [<ffffffff8107c174>] ? tick_nohz_handler+0xdf/0xdf
[   67.589918]  [<ffffffff81054721>] hrtimer_interrupt+0xe6/0x1b0
[   67.589937]  [<ffffffff81019bdd>] smp_apic_timer_interrupt+0x80/0x93
[   67.589958]  [<ffffffff814a2f73>] apic_timer_interrupt+0x73/0x80
[   67.589975]  <EOI>  [<ffffffff81087bf1>] ? generic_exec_single+0x73/0x8a
[   67.590000]  [<ffffffff81087bea>] ? generic_exec_single+0x6c/0x8a
[   67.590019]  [<ffffffff81017f8b>] ? get_fixed_ranges.constprop.5+0x10b/0x10b
[   67.590039]  [<ffffffff81087d2c>] smp_call_function_single+0x124/0x15c
[   67.590059]  [<ffffffff81017f8b>] ? get_fixed_ranges.constprop.5+0x10b/0x10b
[   67.590081]  [<ffffffff8101696d>] mtrr_save_state+0x19/0x1b
[   67.590100]  [<ffffffff8148bf67>] native_cpu_up+0xa1/0x138
[   67.590117]  [<ffffffff8148d192>] _cpu_up+0x92/0xfc
[   67.590134]  [<ffffffff8147f3eb>] enable_nonboot_cpus+0x48/0xad
[   67.590154]  [<ffffffff8106f080>] suspend_devices_and_enter+0x21a/0x407
[   67.590173]  [<ffffffff8106f391>] enter_state+0x124/0x169
[   67.590191]  [<ffffffff8106e01b>] state_store+0xb7/0x101
[   67.590212]  [<ffffffff8126c82f>] kobj_attr_store+0x17/0x19
[   67.590230]  [<ffffffff8117e20c>] sysfs_write_file+0x103/0x13f
[   67.590249]  [<ffffffff8111f018>] vfs_write+0xad/0x13d
[   67.590266]  [<ffffffff8111f293>] sys_write+0x45/0x6c
[   67.590282]  [<ffffffff814a2439>] system_call_fastpath+0x16/0x1b


	Sergey

> > I could reproduce the problem easily with out the patch but when I
> > applied the patch I couldn't recreate the issue. Srivatsa, can you
> > please re-check the kernel you tested indeed has the fix?
> > 
> > re-Reviewing the code/patch also doesn't give me a hint.
> > 
> >> I have a few questions regarding the synchronization with CPU Hotplug.
> >> What guarantees that the code which selects and IPIs the new ilb is totally
> >> race-free with respect to CPU hotplug and we will never IPI an offline CPU?
> > 
> > So, nohz_balancer_kick() gets called only from interrupts disabled.
> > During that time (from selecting the ilb_cpu to sending the IPI), no cpu
> > can go offline. As the offline happens from the stop-machine process
> > context with interrupts disabled.
> > 
> > Only thing we need to make sure is the offlined cpu shouldn't be part of
> > the nohz.idle_cpus_mask and for post 3.2 code, posted patch ensures
> > that.
> > 
> > For 3.2 and before, when a cpu exits tickless idle, it gets removed from
> > the nohz.idle_cpus_mask (and also from the nohz.load_balancer). And if
> > the cpu is not in the active mask (while going offline), subsequent
> > calls to select_nohz_load_balancer() ensures that the cpu going down
> > doesn't update the nohz structures. So I thought 3.2 shouldn't exhibit
> > this problem.
> > 
> > 
> >> (As demonstrated above, this issue is in 3.2-rc7
> >> as well.)
> > 
> > hmm, don't think we ran into this before 3.2. So, what am I missing from
> > the above? I will try to reproduce it on 3.2 too.
> > 
> 
> 
> I tested again on 3.2. I didn't hit those warnings (IPI to offline cpus).
> It happens only in the post-3.2 kernel.
> 
> Regards,
> Srivatsa S. Bhat
> IBM Linux Technology Center
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-18 13:32                     ` Sergey Senozhatsky
@ 2012-01-18 22:08                       ` Suresh Siddha
  2012-01-19  7:50                         ` Sergey Senozhatsky
  2012-01-19 12:02                         ` Srivatsa S. Bhat
  0 siblings, 2 replies; 49+ messages in thread
From: Suresh Siddha @ 2012-01-18 22:08 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Srivatsa S. Bhat, Linus Torvalds, Ming Lei, Djalal Harouni,
	Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar,
	Andi Kleen, linux-kernel, Greg Kroah-Hartman, Kay Sievers,
	gouders, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki,
	tglx, prasad, justinmattock, Jeff Chua, Peter Zijlstra,
	Mel Gorman, Gilad Ben-Yossef

On Wed, 2012-01-18 at 16:32 +0300, Sergey Senozhatsky wrote:
> Just a small note, since you're talking about removing CPU from nohz.idle_cpus_mask, 
> that I'm able to reproduce this problem not only when offlining CPU, but during
> onlininig as well (kernel 3.3):

yes, if the nohz state is not cleared properly during offline, then the
issue can happen any time including cpu online etc.

Srivatsa, I thought CPU_PRI_SCHED_INACTIVE as INT_MAX for some reason
and was expecting sched_ilb_notifier() will be called after setting that
cpu as inactive. I am now using CPU_DYING which will be called from the
cpu going down.

Here is the v2 version of the fix. Can you folks please give it another
try?

Thanks.
---

From: Suresh Siddha <suresh.b.siddha@intel.com>
Subject: sched, nohz: fix nohz cpu idle load balancing state with cpu hotplug

With the recent nohz scheduler changes, rq's nohz flag 'NOHZ_TICK_STOPPED'
and its associated state doesn't get cleared immediately after the
cpu exits idle. This gets cleared as part of the next tick seen on that cpu.

With the cpu offline, we need to clear this state manually. Fix it by
registering a cpu notifier which clears the nohz idle load balance
state for this rq explicitly.

Reported-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched/fair.c |   34 +++++++++++++++++++++++++++++-----
 1 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2237ffe..f605e1d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4843,6 +4843,15 @@ static void nohz_balancer_kick(int cpu)
 	return;
 }
 
+static inline void clear_nohz_tick_stopped(int cpu)
+{
+	if (unlikely(test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))) {
+		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
+		atomic_dec(&nohz.nr_cpus);
+		clear_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
+	}
+}
+
 static inline void set_cpu_sd_state_busy(void)
 {
 	struct sched_domain *sd;
@@ -4881,6 +4890,12 @@ void select_nohz_load_balancer(int stop_tick)
 {
 	int cpu = smp_processor_id();
 
+	/*
+	 * If this cpu is going down, then nothing needs to be done.
+	 */
+	if (!cpu_active(cpu))
+		return;
+
 	if (stop_tick) {
 		if (test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))
 			return;
@@ -4891,6 +4906,18 @@ void select_nohz_load_balancer(int stop_tick)
 	}
 	return;
 }
+
+static int __cpuinit sched_ilb_notifier(struct notifier_block *nfb,
+					unsigned long action, void *hcpu)
+{
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_DYING:
+		clear_nohz_tick_stopped(smp_processor_id());
+		return NOTIFY_OK;
+	default:
+		return NOTIFY_DONE;
+	}
+}
 #endif
 
 static DEFINE_SPINLOCK(balancing);
@@ -5047,11 +5074,7 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
 	* busy tick after returning from idle, we will update the busy stats.
 	*/
 	set_cpu_sd_state_busy();
-	if (unlikely(test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))) {
-		clear_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
-		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
-		atomic_dec(&nohz.nr_cpus);
-	}
+	clear_nohz_tick_stopped(cpu);
 
 	/*
 	 * None are in tickless mode and hence no need for NOHZ idle load
@@ -5549,6 +5572,7 @@ __init void init_sched_fair_class(void)
 
 #ifdef CONFIG_NO_HZ
 	zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
+	cpu_notifier(sched_ilb_notifier, 0);
 #endif
 #endif /* SMP */
 



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-18 22:08                       ` Suresh Siddha
@ 2012-01-19  7:50                         ` Sergey Senozhatsky
  2012-01-19 12:02                         ` Srivatsa S. Bhat
  1 sibling, 0 replies; 49+ messages in thread
From: Sergey Senozhatsky @ 2012-01-19  7:50 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Srivatsa S. Bhat, Linus Torvalds, Ming Lei, Djalal Harouni,
	Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar,
	Andi Kleen, linux-kernel, Greg Kroah-Hartman, Kay Sievers,
	gouders, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki,
	tglx, prasad, justinmattock, Jeff Chua, Peter Zijlstra,
	Mel Gorman, Gilad Ben-Yossef

On (01/18/12 14:08), Suresh Siddha wrote:
> On Wed, 2012-01-18 at 16:32 +0300, Sergey Senozhatsky wrote:
> > Just a small note, since you're talking about removing CPU from nohz.idle_cpus_mask, 
> > that I'm able to reproduce this problem not only when offlining CPU, but during
> > onlininig as well (kernel 3.3):
> 
> yes, if the nohz state is not cleared properly during offline, then the
> issue can happen any time including cpu online etc.
>

Oh, sure. Good point.

Works for me, here is my:

Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> 

	Sergey

> Srivatsa, I thought CPU_PRI_SCHED_INACTIVE as INT_MAX for some reason
> and was expecting sched_ilb_notifier() will be called after setting that
> cpu as inactive. I am now using CPU_DYING which will be called from the
> cpu going down.
> 
> Here is the v2 version of the fix. Can you folks please give it another
> try?
> 
> Thanks.
> ---
> 
> From: Suresh Siddha <suresh.b.siddha@intel.com>
> Subject: sched, nohz: fix nohz cpu idle load balancing state with cpu hotplug
> 
> With the recent nohz scheduler changes, rq's nohz flag 'NOHZ_TICK_STOPPED'
> and its associated state doesn't get cleared immediately after the
> cpu exits idle. This gets cleared as part of the next tick seen on that cpu.
> 
> With the cpu offline, we need to clear this state manually. Fix it by
> registering a cpu notifier which clears the nohz idle load balance
> state for this rq explicitly.
> 
> Reported-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> ---
>  kernel/sched/fair.c |   34 +++++++++++++++++++++++++++++-----
>  1 files changed, 29 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2237ffe..f605e1d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4843,6 +4843,15 @@ static void nohz_balancer_kick(int cpu)
>  	return;
>  }
>  
> +static inline void clear_nohz_tick_stopped(int cpu)
> +{
> +	if (unlikely(test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))) {
> +		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
> +		atomic_dec(&nohz.nr_cpus);
> +		clear_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
> +	}
> +}
> +
>  static inline void set_cpu_sd_state_busy(void)
>  {
>  	struct sched_domain *sd;
> @@ -4881,6 +4890,12 @@ void select_nohz_load_balancer(int stop_tick)
>  {
>  	int cpu = smp_processor_id();
>  
> +	/*
> +	 * If this cpu is going down, then nothing needs to be done.
> +	 */
> +	if (!cpu_active(cpu))
> +		return;
> +
>  	if (stop_tick) {
>  		if (test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))
>  			return;
> @@ -4891,6 +4906,18 @@ void select_nohz_load_balancer(int stop_tick)
>  	}
>  	return;
>  }
> +
> +static int __cpuinit sched_ilb_notifier(struct notifier_block *nfb,
> +					unsigned long action, void *hcpu)
> +{
> +	switch (action & ~CPU_TASKS_FROZEN) {
> +	case CPU_DYING:
> +		clear_nohz_tick_stopped(smp_processor_id());
> +		return NOTIFY_OK;
> +	default:
> +		return NOTIFY_DONE;
> +	}
> +}
>  #endif
>  
>  static DEFINE_SPINLOCK(balancing);
> @@ -5047,11 +5074,7 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
>  	* busy tick after returning from idle, we will update the busy stats.
>  	*/
>  	set_cpu_sd_state_busy();
> -	if (unlikely(test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))) {
> -		clear_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
> -		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
> -		atomic_dec(&nohz.nr_cpus);
> -	}
> +	clear_nohz_tick_stopped(cpu);
>  
>  	/*
>  	 * None are in tickless mode and hence no need for NOHZ idle load
> @@ -5549,6 +5572,7 @@ __init void init_sched_fair_class(void)
>  
>  #ifdef CONFIG_NO_HZ
>  	zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
> +	cpu_notifier(sched_ilb_notifier, 0);
>  #endif
>  #endif /* SMP */
>  
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-18 22:08                       ` Suresh Siddha
  2012-01-19  7:50                         ` Sergey Senozhatsky
@ 2012-01-19 12:02                         ` Srivatsa S. Bhat
  2012-01-20  2:28                           ` Suresh Siddha
  1 sibling, 1 reply; 49+ messages in thread
From: Srivatsa S. Bhat @ 2012-01-19 12:02 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Sergey Senozhatsky, Linus Torvalds, Ming Lei, Djalal Harouni,
	Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar,
	Andi Kleen, linux-kernel, Greg Kroah-Hartman, Kay Sievers,
	gouders, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki,
	tglx, prasad, justinmattock, Jeff Chua, Peter Zijlstra,
	Mel Gorman, Gilad Ben-Yossef

On 01/19/2012 03:38 AM, Suresh Siddha wrote:

> On Wed, 2012-01-18 at 16:32 +0300, Sergey Senozhatsky wrote:
>> Just a small note, since you're talking about removing CPU from nohz.idle_cpus_mask, 
>> that I'm able to reproduce this problem not only when offlining CPU, but during
>> onlininig as well (kernel 3.3):
> 
> yes, if the nohz state is not cleared properly during offline, then the
> issue can happen any time including cpu online etc.
> 
> Srivatsa, I thought CPU_PRI_SCHED_INACTIVE as INT_MAX for some reason
> and was expecting sched_ilb_notifier() will be called after setting that
> cpu as inactive. I am now using CPU_DYING which will be called from the
> cpu going down.
> 

> Here is the v2 version of the fix. Can you folks please give it another

> try?
> 


Suresh, your patch works perfectly! Thanks a lot!
Tested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

And the reasoning behind the patch matches the test results:
we don't allow select_nohz_load_balancer() to undo the cleanup that we
did in sched_ilb_notifier(), by ensuring that sched_ilb_notifier() runs
*after* sched_cpu_inactive().

So, you can have my "Reviewed-by" too, if you like!

By the way, it would be great if you could kindly describe the above
mentioned subtle aspect in the patch description as well..

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-19 12:02                         ` Srivatsa S. Bhat
@ 2012-01-20  2:28                           ` Suresh Siddha
  2012-01-23  8:43                             ` Peter Zijlstra
  2012-01-26 20:27                             ` [tip:sched/urgent] sched/nohz: Fix nohz cpu idle load balancing state with cpu hotplug tip-bot for Suresh Siddha
  0 siblings, 2 replies; 49+ messages in thread
From: Suresh Siddha @ 2012-01-20  2:28 UTC (permalink / raw)
  To: Srivatsa S. Bhat, Peter Zijlstra, Ingo Molnar
  Cc: Sergey Senozhatsky, Linus Torvalds, Ming Lei, Djalal Harouni,
	Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar,
	Andi Kleen, linux-kernel, Greg Kroah-Hartman, Kay Sievers,
	gouders, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki,
	tglx, prasad, justinmattock, Jeff Chua, Peter Zijlstra,
	Mel Gorman, Gilad Ben-Yossef

On Thu, 2012-01-19 at 17:32 +0530, Srivatsa S. Bhat wrote:
> Suresh, your patch works perfectly! Thanks a lot!
> Tested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> 
> And the reasoning behind the patch matches the test results:
> we don't allow select_nohz_load_balancer() to undo the cleanup that we
> did in sched_ilb_notifier(), by ensuring that sched_ilb_notifier() runs
> *after* sched_cpu_inactive().
> 
> So, you can have my "Reviewed-by" too, if you like!
> 
> By the way, it would be great if you could kindly describe the above
> mentioned subtle aspect in the patch description as well..

Thanks Srivatsa and Sergey for the tests.

Ingo, Peterz, can you please queue this appended patch for 3.3?

thanks,
suresh
---

From: Suresh Siddha <suresh.b.siddha@intel.com>
Subject: sched, nohz: fix nohz cpu idle load balancing state with cpu hotplug

With the recent nohz scheduler changes, rq's nohz flag 'NOHZ_TICK_STOPPED'
and its associated state doesn't get cleared immediately after the
cpu exits idle. This gets cleared as part of the next tick seen on that cpu.

For the cpu offline support, we need to clear this state manually. Fix it by
registering a cpu notifier, which clears the nohz idle load balance
state for this rq explicitly during the CPU_DYING notification.

There won't be any nohz updates for that cpu, after the CPU_DYING notification.
But lets be extra paranoid and skip updating the nohz state in the
select_nohz_load_balancer() if the cpu is not in active state anymore.

Reported-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Reviewed-and-tested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched/fair.c |   34 +++++++++++++++++++++++++++++-----
 1 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 84adb2d..7c6414f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4866,6 +4866,15 @@ static void nohz_balancer_kick(int cpu)
 	return;
 }
 
+static inline void clear_nohz_tick_stopped(int cpu)
+{
+	if (unlikely(test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))) {
+		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
+		atomic_dec(&nohz.nr_cpus);
+		clear_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
+	}
+}
+
 static inline void set_cpu_sd_state_busy(void)
 {
 	struct sched_domain *sd;
@@ -4904,6 +4913,12 @@ void select_nohz_load_balancer(int stop_tick)
 {
 	int cpu = smp_processor_id();
 
+	/*
+	 * If this cpu is going down, then nothing needs to be done.
+	 */
+	if (!cpu_active(cpu))
+		return;
+
 	if (stop_tick) {
 		if (test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))
 			return;
@@ -4914,6 +4929,18 @@ void select_nohz_load_balancer(int stop_tick)
 	}
 	return;
 }
+
+static int __cpuinit sched_ilb_notifier(struct notifier_block *nfb,
+					unsigned long action, void *hcpu)
+{
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_DYING:
+		clear_nohz_tick_stopped(smp_processor_id());
+		return NOTIFY_OK;
+	default:
+		return NOTIFY_DONE;
+	}
+}
 #endif
 
 static DEFINE_SPINLOCK(balancing);
@@ -5070,11 +5097,7 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
 	* busy tick after returning from idle, we will update the busy stats.
 	*/
 	set_cpu_sd_state_busy();
-	if (unlikely(test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))) {
-		clear_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
-		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
-		atomic_dec(&nohz.nr_cpus);
-	}
+	clear_nohz_tick_stopped(cpu);
 
 	/*
 	 * None are in tickless mode and hence no need for NOHZ idle load
@@ -5590,6 +5613,7 @@ __init void init_sched_fair_class(void)
 
 #ifdef CONFIG_NO_HZ
 	zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
+	cpu_notifier(sched_ilb_notifier, 0);
 #endif
 #endif /* SMP */
 



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: x86/mce: machine check warning during poweroff
  2012-01-20  2:28                           ` Suresh Siddha
@ 2012-01-23  8:43                             ` Peter Zijlstra
  2012-01-26 20:27                             ` [tip:sched/urgent] sched/nohz: Fix nohz cpu idle load balancing state with cpu hotplug tip-bot for Suresh Siddha
  1 sibling, 0 replies; 49+ messages in thread
From: Peter Zijlstra @ 2012-01-23  8:43 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Srivatsa S. Bhat, Ingo Molnar, Sergey Senozhatsky,
	Linus Torvalds, Ming Lei, Djalal Harouni, Borislav Petkov,
	Tony Luck, Hidetoshi Seto, Andi Kleen, linux-kernel,
	Greg Kroah-Hartman, Kay Sievers, gouders, Marcos Souza,
	Linux PM mailing list, Rafael J. Wysocki, tglx, prasad,
	justinmattock, Jeff Chua, Mel Gorman, Gilad Ben-Yossef

On Thu, 2012-01-19 at 18:28 -0800, Suresh Siddha wrote:
> Ingo, Peterz, can you please queue this appended patch for 3.3?

Thanks Suresh!

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [tip:sched/urgent] sched/nohz: Fix nohz cpu idle load balancing state with cpu hotplug
  2012-01-20  2:28                           ` Suresh Siddha
  2012-01-23  8:43                             ` Peter Zijlstra
@ 2012-01-26 20:27                             ` tip-bot for Suresh Siddha
  1 sibling, 0 replies; 49+ messages in thread
From: tip-bot for Suresh Siddha @ 2012-01-26 20:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, sergey.senozhatsky,
	srivatsa.bhat, suresh.b.siddha, tglx, mingo

Commit-ID:  71325960d16cd68ea0e22a8da15b2495b0f363f7
Gitweb:     http://git.kernel.org/tip/71325960d16cd68ea0e22a8da15b2495b0f363f7
Author:     Suresh Siddha <suresh.b.siddha@intel.com>
AuthorDate: Thu, 19 Jan 2012 18:28:57 -0800
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Thu, 26 Jan 2012 19:38:13 +0100

sched/nohz: Fix nohz cpu idle load balancing state with cpu hotplug

With the recent nohz scheduler changes, rq's nohz flag
'NOHZ_TICK_STOPPED' and its associated state doesn't get cleared
immediately after the cpu exits idle. This gets cleared as part
of the next tick seen on that cpu.

For the cpu offline support, we need to clear this state
manually. Fix it by registering a cpu notifier, which clears the
nohz idle load balance state for this rq explicitly during the
CPU_DYING notification.

There won't be any nohz updates for that cpu, after the
CPU_DYING notification. But lets be extra paranoid and skip
updating the nohz state in the select_nohz_load_balancer() if
the cpu is not in active state anymore.

Reported-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Reviewed-and-tested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1327026538.16150.40.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched/fair.c |   34 +++++++++++++++++++++++++++++-----
 1 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 84adb2d..7c6414f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4866,6 +4866,15 @@ static void nohz_balancer_kick(int cpu)
 	return;
 }
 
+static inline void clear_nohz_tick_stopped(int cpu)
+{
+	if (unlikely(test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))) {
+		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
+		atomic_dec(&nohz.nr_cpus);
+		clear_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
+	}
+}
+
 static inline void set_cpu_sd_state_busy(void)
 {
 	struct sched_domain *sd;
@@ -4904,6 +4913,12 @@ void select_nohz_load_balancer(int stop_tick)
 {
 	int cpu = smp_processor_id();
 
+	/*
+	 * If this cpu is going down, then nothing needs to be done.
+	 */
+	if (!cpu_active(cpu))
+		return;
+
 	if (stop_tick) {
 		if (test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))
 			return;
@@ -4914,6 +4929,18 @@ void select_nohz_load_balancer(int stop_tick)
 	}
 	return;
 }
+
+static int __cpuinit sched_ilb_notifier(struct notifier_block *nfb,
+					unsigned long action, void *hcpu)
+{
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_DYING:
+		clear_nohz_tick_stopped(smp_processor_id());
+		return NOTIFY_OK;
+	default:
+		return NOTIFY_DONE;
+	}
+}
 #endif
 
 static DEFINE_SPINLOCK(balancing);
@@ -5070,11 +5097,7 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
 	* busy tick after returning from idle, we will update the busy stats.
 	*/
 	set_cpu_sd_state_busy();
-	if (unlikely(test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))) {
-		clear_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
-		cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
-		atomic_dec(&nohz.nr_cpus);
-	}
+	clear_nohz_tick_stopped(cpu);
 
 	/*
 	 * None are in tickless mode and hence no need for NOHZ idle load
@@ -5590,6 +5613,7 @@ __init void init_sched_fair_class(void)
 
 #ifdef CONFIG_NO_HZ
 	zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
+	cpu_notifier(sched_ilb_notifier, 0);
 #endif
 #endif /* SMP */
 

^ permalink raw reply related	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2012-01-26 20:27 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-11  0:00 x86/mce: machine check warning during poweroff Djalal Harouni
2012-01-12 14:22 ` Ming Lei
2012-01-13 20:22   ` Srivatsa S. Bhat
2012-01-13 20:34     ` Justin P. Mattock
2012-01-13 20:37     ` Linus Torvalds
2012-01-13 20:53       ` Srivatsa S. Bhat
2012-01-13 21:08         ` Linus Torvalds
2012-01-13 21:24           ` Andi Kleen
2012-01-13 21:38             ` Justin P. Mattock
2012-01-13 22:06               ` Srivatsa S. Bhat
2012-01-13 22:17                 ` Alan Stern
2012-01-13 22:18                 ` Srivatsa S. Bhat
2012-01-13 23:13             ` Andi Kleen
2012-01-14  0:44       ` Dirk Gouders
2012-01-13 23:02     ` Linus Torvalds
2012-01-13 23:27       ` Srivatsa S. Bhat
2012-01-14  0:05         ` Linus Torvalds
2012-01-14  2:41           ` Srivatsa S. Bhat
2012-01-14  2:53             ` Linus Torvalds
2012-01-14  3:05               ` Srivatsa S. Bhat
2012-01-14  3:10                 ` Linus Torvalds
2012-01-14  3:18                   ` Srivatsa S. Bhat
2012-01-14  3:41                     ` Linus Torvalds
2012-01-14  5:15                   ` Tony Luck
2012-01-14 14:49               ` Greg KH
2012-01-14 16:30                 ` Alan Stern
2012-01-14 20:45                   ` Jeff Chua
2012-01-15  2:05                   ` Tony Luck
2012-01-15  2:34                     ` Greg KH
2012-01-15  3:36                       ` Alan Stern
2012-01-16 18:15                         ` Greg KH
2012-01-16 18:11                 ` Greg KH
2012-01-16 18:27                   ` Luck, Tony
2012-01-16 18:34                     ` Greg KH
2012-01-16 18:42                   ` Kay Sievers
2012-01-17  2:21             ` Suresh Siddha
2012-01-17  9:52               ` Srivatsa S. Bhat
2012-01-17 16:15                 ` Jeff Chua
2012-01-17 16:36                   ` Srivatsa S. Bhat
2012-01-18  3:17                 ` Suresh Siddha
2012-01-18 10:19                   ` Srivatsa S. Bhat
2012-01-18 13:15                   ` Srivatsa S. Bhat
2012-01-18 13:32                     ` Sergey Senozhatsky
2012-01-18 22:08                       ` Suresh Siddha
2012-01-19  7:50                         ` Sergey Senozhatsky
2012-01-19 12:02                         ` Srivatsa S. Bhat
2012-01-20  2:28                           ` Suresh Siddha
2012-01-23  8:43                             ` Peter Zijlstra
2012-01-26 20:27                             ` [tip:sched/urgent] sched/nohz: Fix nohz cpu idle load balancing state with cpu hotplug tip-bot for Suresh Siddha

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).