linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* WARNING: at kernel/rcutree.c:1550 __rcu_process_callbacks+0x46f/0x4b0()
@ 2012-07-19 12:09 Srivatsa S. Bhat
  2012-07-19 17:15 ` Paul E. McKenney
  0 siblings, 1 reply; 20+ messages in thread
From: Srivatsa S. Bhat @ 2012-07-19 12:09 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linux-kernel, Srivatsa S. Bhat

Hi Paul,

While running a CPU hotplug stress test on v3.5-rc7+
(mainline commit 8a7298b7805ab) I hit this warning.
I haven't tried to debug this yet...

Line number 1550 maps to:

WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));

inside rcu_do_batch().

Regards,
Srivatsa S. Bhat


[ 1206.095436] CPU 3 is now offline
[ 1206.097086] CPU 3 offline: Remove Rx thread
[ 1206.099133] CPU 2 MCA banks CMCI:0 CMCI:1 CMCI:3
[ 1206.103750] ------------[ cut here ]------------
[ 1206.103758] WARNING: at kernel/rcutree.c:1550 __rcu_process_callbacks+0x46f/0x4b0()
[ 1206.103759] Hardware name: 4180F59
[ 1206.103759] Modules linked in: tcp_lp lp ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle bridge stp llc ppdev parport_pc parport fuse be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser ip6t_ipv6header rdma_cm ib_cm iw_cm ib_sa lockd ip6t_REJECT ib_mad ib_core nf_conntrack_ipv6 nf_defrag_ipv6 xt_limit ib_addr iscsi_tcp nf_conntrack_ipv4 libiscsi_tcp libiscsi nf_defrag_ipv4 xt_state nf_conntrack scsi_transport_iscsi ip6table_filter ip6_tables snd_hda_codec_conexant snd_hda_intel snd_hda_codec snd_hwdep arc4 iwlwifi snd_seq snd_seq_device snd_pcm mac80211 snd_timer thinkpad_acpi snd sunrpc vhost_net macvtap uvcvideo macvlan tun videobuf2_core kvm_intel cfg80211 videodev media videobuf2_vmalloc tpm_tis kvm i2c_i801 e1000e joydev tpm videobuf2_memops soundcore snd_page_alloc rfkill lpc_ich mfd_core tpm_bios microcode binfmt_misc uinput sdhci_pci sdhci mmc_core nouveau i915 ttm drm_kms_helper drm mxm_wmi i2!
 c_algo_bit
 i2c_core video wmi
[ 1206.103829]  [last unloaded: scsi_wait_scan]
[ 1206.103832] Pid: 6217, comm: migration/1 Not tainted 3.5.0-rc7-mandeep #1
[ 1206.103833] Call Trace:
[ 1206.103834]  <IRQ>  [<ffffffff8106036f>] warn_slowpath_common+0x7f/0xc0
[ 1206.103840]  [<ffffffff810603ca>] warn_slowpath_null+0x1a/0x20
[ 1206.103843]  [<ffffffff81110d5f>] __rcu_process_callbacks+0x46f/0x4b0
[ 1206.103845]  [<ffffffff8108f894>] ? __run_hrtimer+0x84/0x3a0
[ 1206.103848]  [<ffffffff81697f47>] ? _raw_spin_unlock_irqrestore+0x77/0x80
[ 1206.103852]  [<ffffffff810998dd>] ? account_system_vtime+0xdd/0xf0
[ 1206.103855]  [<ffffffff81110dd3>] rcu_process_callbacks+0x33/0x290
[ 1206.103857]  [<ffffffff8106a638>] __do_softirq+0xc8/0x3e0
[ 1206.103860]  [<ffffffff8103bc2d>] ? lapic_next_event+0x1d/0x30
[ 1206.103863]  [<ffffffff816a26ec>] call_softirq+0x1c/0x30
[ 1206.103866]  [<ffffffff8101c365>] do_softirq+0xa5/0xe0
[ 1206.103870]  [<ffffffff8106acae>] irq_exit+0xbe/0xf0
[ 1206.103872]  [<ffffffff816a306e>] smp_apic_timer_interrupt+0x6e/0x99
[ 1206.103875]  [<ffffffff816a1cef>] apic_timer_interrupt+0x6f/0x80
[ 1206.103878]  <EOI>  [<ffffffff810ee2e3>] ? stop_machine_cpu_stop+0xc3/0x110
[ 1206.103883]  [<ffffffff810edf19>] cpu_stopper_thread+0xd9/0x1b0
[ 1206.103884]  [<ffffffff810ee220>] ? queue_stop_cpus_work+0xf0/0xf0
[ 1206.103893]  [<ffffffff816960b1>] ? __schedule+0x431/0xa20
[ 1206.103896]  [<ffffffff810cd21d>] ? trace_hardirqs_on_caller+0x11d/0x1b0
[ 1206.103900]  [<ffffffff810ede40>] ? cpu_stop_signal_done+0x40/0x40
[ 1206.103902]  [<ffffffff8108a86e>] kthread+0xae/0xc0
[ 1206.103905]  [<ffffffff810cd2bd>] ? trace_hardirqs_on+0xd/0x10
[ 1206.103908]  [<ffffffff816a25f4>] kernel_thread_helper+0x4/0x10
[ 1206.103911]  [<ffffffff816986f0>] ? retint_restore_args+0x13/0x13
[ 1206.103913]  [<ffffffff8108a7c0>] ? __init_kthread_worker+0x70/0x70
[ 1206.103916]  [<ffffffff816a25f0>] ? gs_change+0x13/0x13
[ 1206.103918] ---[ end trace cbc247c5c233d6e7 ]---
[ 1206.104125] CPU 1 is now offline


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING: at kernel/rcutree.c:1550 __rcu_process_callbacks+0x46f/0x4b0()
  2012-07-19 12:09 WARNING: at kernel/rcutree.c:1550 __rcu_process_callbacks+0x46f/0x4b0() Srivatsa S. Bhat
@ 2012-07-19 17:15 ` Paul E. McKenney
  2012-07-20 10:41   ` Srivatsa S. Bhat
  2012-09-12 12:36   ` WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0(), during CPU hotplug Srivatsa S. Bhat
  0 siblings, 2 replies; 20+ messages in thread
From: Paul E. McKenney @ 2012-07-19 17:15 UTC (permalink / raw)
  To: Srivatsa S. Bhat; +Cc: linux-kernel

On Thu, Jul 19, 2012 at 05:39:30PM +0530, Srivatsa S. Bhat wrote:
> Hi Paul,
> 
> While running a CPU hotplug stress test on v3.5-rc7+
> (mainline commit 8a7298b7805ab) I hit this warning.
> I haven't tried to debug this yet...
> 
> Line number 1550 maps to:
> 
> WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
> 
> inside rcu_do_batch().

Hello, Srivatsa,

I believe that you need commit a16b7a69 (Prevent __call_rcu() from
invoking RCU core on offline CPUs), which is currently in -tip, queued
for 3.6.  Please see below for the patch.

Does this help?

							Thanx, Paul

------------------------------------------------------------------------

rcu: Prevent __call_rcu() from invoking RCU core on offline CPUs

The __call_rcu() function will invoke the RCU core, for example, if
it detects that the current CPU has too many callbacks.  However, this
can happen on an offline CPU that is on its way to the idle loop, in
which case it is an error to invoke the RCU core, and the excess callbacks
will be adopted in any case.  This commit therefore adds checks to
__call_rcu() for running on an offline CPU, refraining from invoking
the RCU core in this case.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index a4a9c91..ceaa959 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1904,11 +1904,11 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
 	 * If called from an extended quiescent state, invoke the RCU
 	 * core in order to force a re-evaluation of RCU's idleness.
 	 */
-	if (rcu_is_cpu_idle())
+	if (rcu_is_cpu_idle() && cpu_online(smp_processor_id()))
 		invoke_rcu_core();
 
-	/* If interrupts were disabled, don't dive into RCU core. */
-	if (irqs_disabled_flags(flags)) {
+	/* If interrupts were disabled or CPU offline, don't invoke RCU core. */
+	if (irqs_disabled_flags(flags) || cpu_is_offline(smp_processor_id())) {
 		local_irq_restore(flags);
 		return;
 	}


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: WARNING: at kernel/rcutree.c:1550 __rcu_process_callbacks+0x46f/0x4b0()
  2012-07-19 17:15 ` Paul E. McKenney
@ 2012-07-20 10:41   ` Srivatsa S. Bhat
  2012-07-20 14:36     ` Paul E. McKenney
  2012-09-12 12:36   ` WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0(), during CPU hotplug Srivatsa S. Bhat
  1 sibling, 1 reply; 20+ messages in thread
From: Srivatsa S. Bhat @ 2012-07-20 10:41 UTC (permalink / raw)
  To: paulmck; +Cc: linux-kernel

On 07/19/2012 10:45 PM, Paul E. McKenney wrote:
> On Thu, Jul 19, 2012 at 05:39:30PM +0530, Srivatsa S. Bhat wrote:
>> Hi Paul,
>>
>> While running a CPU hotplug stress test on v3.5-rc7+
>> (mainline commit 8a7298b7805ab) I hit this warning.
>> I haven't tried to debug this yet...
>>
>> Line number 1550 maps to:
>>
>> WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
>>
>> inside rcu_do_batch().
> 
> Hello, Srivatsa,
> 
> I believe that you need commit a16b7a69 (Prevent __call_rcu() from
> invoking RCU core on offline CPUs), which is currently in -tip, queued
> for 3.6.  Please see below for the patch.
> 
> Does this help?

Yes, that fixed it, thanks! :-)
Sorry for the noise...

Regards,
Srivatsa S. Bhat

> 
> ------------------------------------------------------------------------
> 
> rcu: Prevent __call_rcu() from invoking RCU core on offline CPUs
> 
> The __call_rcu() function will invoke the RCU core, for example, if
> it detects that the current CPU has too many callbacks.  However, this
> can happen on an offline CPU that is on its way to the idle loop, in
> which case it is an error to invoke the RCU core, and the excess callbacks
> will be adopted in any case.  This commit therefore adds checks to
> __call_rcu() for running on an offline CPU, refraining from invoking
> the RCU core in this case.
> 
> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index a4a9c91..ceaa959 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1904,11 +1904,11 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
>  	 * If called from an extended quiescent state, invoke the RCU
>  	 * core in order to force a re-evaluation of RCU's idleness.
>  	 */
> -	if (rcu_is_cpu_idle())
> +	if (rcu_is_cpu_idle() && cpu_online(smp_processor_id()))
>  		invoke_rcu_core();
> 
> -	/* If interrupts were disabled, don't dive into RCU core. */
> -	if (irqs_disabled_flags(flags)) {
> +	/* If interrupts were disabled or CPU offline, don't invoke RCU core. */
> +	if (irqs_disabled_flags(flags) || cpu_is_offline(smp_processor_id())) {
>  		local_irq_restore(flags);
>  		return;
>  	}
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING: at kernel/rcutree.c:1550 __rcu_process_callbacks+0x46f/0x4b0()
  2012-07-20 10:41   ` Srivatsa S. Bhat
@ 2012-07-20 14:36     ` Paul E. McKenney
  2012-07-20 14:57       ` Srivatsa S. Bhat
  0 siblings, 1 reply; 20+ messages in thread
From: Paul E. McKenney @ 2012-07-20 14:36 UTC (permalink / raw)
  To: Srivatsa S. Bhat; +Cc: linux-kernel

On Fri, Jul 20, 2012 at 04:11:00PM +0530, Srivatsa S. Bhat wrote:
> On 07/19/2012 10:45 PM, Paul E. McKenney wrote:
> > On Thu, Jul 19, 2012 at 05:39:30PM +0530, Srivatsa S. Bhat wrote:
> >> Hi Paul,
> >>
> >> While running a CPU hotplug stress test on v3.5-rc7+
> >> (mainline commit 8a7298b7805ab) I hit this warning.
> >> I haven't tried to debug this yet...
> >>
> >> Line number 1550 maps to:
> >>
> >> WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
> >>
> >> inside rcu_do_batch().
> > 
> > Hello, Srivatsa,
> > 
> > I believe that you need commit a16b7a69 (Prevent __call_rcu() from
> > invoking RCU core on offline CPUs), which is currently in -tip, queued
> > for 3.6.  Please see below for the patch.
> > 
> > Does this help?
> 
> Yes, that fixed it, thanks! :-)
> Sorry for the noise...

May I add your Tested-by?

							Thanx, Paul

> Regards,
> Srivatsa S. Bhat
> 
> > 
> > ------------------------------------------------------------------------
> > 
> > rcu: Prevent __call_rcu() from invoking RCU core on offline CPUs
> > 
> > The __call_rcu() function will invoke the RCU core, for example, if
> > it detects that the current CPU has too many callbacks.  However, this
> > can happen on an offline CPU that is on its way to the idle loop, in
> > which case it is an error to invoke the RCU core, and the excess callbacks
> > will be adopted in any case.  This commit therefore adds checks to
> > __call_rcu() for running on an offline CPU, refraining from invoking
> > the RCU core in this case.
> > 
> > Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Reviewed-by: Josh Triplett <josh@joshtriplett.org>
> > 
> > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > index a4a9c91..ceaa959 100644
> > --- a/kernel/rcutree.c
> > +++ b/kernel/rcutree.c
> > @@ -1904,11 +1904,11 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
> >  	 * If called from an extended quiescent state, invoke the RCU
> >  	 * core in order to force a re-evaluation of RCU's idleness.
> >  	 */
> > -	if (rcu_is_cpu_idle())
> > +	if (rcu_is_cpu_idle() && cpu_online(smp_processor_id()))
> >  		invoke_rcu_core();
> > 
> > -	/* If interrupts were disabled, don't dive into RCU core. */
> > -	if (irqs_disabled_flags(flags)) {
> > +	/* If interrupts were disabled or CPU offline, don't invoke RCU core. */
> > +	if (irqs_disabled_flags(flags) || cpu_is_offline(smp_processor_id())) {
> >  		local_irq_restore(flags);
> >  		return;
> >  	}
> > 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING: at kernel/rcutree.c:1550 __rcu_process_callbacks+0x46f/0x4b0()
  2012-07-20 14:36     ` Paul E. McKenney
@ 2012-07-20 14:57       ` Srivatsa S. Bhat
  0 siblings, 0 replies; 20+ messages in thread
From: Srivatsa S. Bhat @ 2012-07-20 14:57 UTC (permalink / raw)
  To: paulmck; +Cc: linux-kernel

On 07/20/2012 08:06 PM, Paul E. McKenney wrote:
> On Fri, Jul 20, 2012 at 04:11:00PM +0530, Srivatsa S. Bhat wrote:
>> On 07/19/2012 10:45 PM, Paul E. McKenney wrote:
>>> On Thu, Jul 19, 2012 at 05:39:30PM +0530, Srivatsa S. Bhat wrote:
>>>> Hi Paul,
>>>>
>>>> While running a CPU hotplug stress test on v3.5-rc7+
>>>> (mainline commit 8a7298b7805ab) I hit this warning.
>>>> I haven't tried to debug this yet...
>>>>
>>>> Line number 1550 maps to:
>>>>
>>>> WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
>>>>
>>>> inside rcu_do_batch().
>>>
>>> Hello, Srivatsa,
>>>
>>> I believe that you need commit a16b7a69 (Prevent __call_rcu() from
>>> invoking RCU core on offline CPUs), which is currently in -tip, queued
>>> for 3.6.  Please see below for the patch.
>>>
>>> Does this help?
>>
>> Yes, that fixed it, thanks! :-)
>> Sorry for the noise...
> 
> May I add your Tested-by?

Of course, go ahead! :-)

Regards,
Srivatsa S. Bhat

>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> rcu: Prevent __call_rcu() from invoking RCU core on offline CPUs
>>>
>>> The __call_rcu() function will invoke the RCU core, for example, if
>>> it detects that the current CPU has too many callbacks.  However, this
>>> can happen on an offline CPU that is on its way to the idle loop, in
>>> which case it is an error to invoke the RCU core, and the excess callbacks
>>> will be adopted in any case.  This commit therefore adds checks to
>>> __call_rcu() for running on an offline CPU, refraining from invoking
>>> the RCU core in this case.
>>>
>>> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
>>> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>>> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
>>>
>>> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
>>> index a4a9c91..ceaa959 100644
>>> --- a/kernel/rcutree.c
>>> +++ b/kernel/rcutree.c
>>> @@ -1904,11 +1904,11 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
>>>  	 * If called from an extended quiescent state, invoke the RCU
>>>  	 * core in order to force a re-evaluation of RCU's idleness.
>>>  	 */
>>> -	if (rcu_is_cpu_idle())
>>> +	if (rcu_is_cpu_idle() && cpu_online(smp_processor_id()))
>>>  		invoke_rcu_core();
>>>
>>> -	/* If interrupts were disabled, don't dive into RCU core. */
>>> -	if (irqs_disabled_flags(flags)) {
>>> +	/* If interrupts were disabled or CPU offline, don't invoke RCU core. */
>>> +	if (irqs_disabled_flags(flags) || cpu_is_offline(smp_processor_id())) {
>>>  		local_irq_restore(flags);
>>>  		return;
>>>  	}
>>>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0(), during CPU hotplug
  2012-07-19 17:15 ` Paul E. McKenney
  2012-07-20 10:41   ` Srivatsa S. Bhat
@ 2012-09-12 12:36   ` Srivatsa S. Bhat
  2012-09-12 15:31     ` Paul E. McKenney
  2012-09-13  8:35     ` Srivatsa S. Bhat
  1 sibling, 2 replies; 20+ messages in thread
From: Srivatsa S. Bhat @ 2012-09-12 12:36 UTC (permalink / raw)
  To: paulmck; +Cc: linux-kernel, Thomas Gleixner, rusty, Peter Zijlstra, Tejun Heo

On 07/19/2012 10:45 PM, Paul E. McKenney wrote:
> On Thu, Jul 19, 2012 at 05:39:30PM +0530, Srivatsa S. Bhat wrote:
>> Hi Paul,
>>
>> While running a CPU hotplug stress test on v3.5-rc7+
>> (mainline commit 8a7298b7805ab) I hit this warning.
>> I haven't tried to debug this yet...
>>
>> Line number 1550 maps to:
>>
>> WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
>>
>> inside rcu_do_batch().
> 
> Hello, Srivatsa,
> 
> I believe that you need commit a16b7a69 (Prevent __call_rcu() from
> invoking RCU core on offline CPUs), which is currently in -tip, queued
> for 3.6.  Please see below for the patch.
> 
> Does this help?
> 

Hi Paul,

I am hitting the cpu_is_offline() warning in rcu_do_batch() (see 2 of the
examples below) occasionally while testing CPU hotplug on Thomas' smp/hotplug
branch in -tip. It does contain the commit that you had mentioned above.

The stack trace suggests that we are not hitting this from the __call_rcu()
path. So I guess this needs a different fix?

Regards,
Srivatsa S. Bhat

[   53.882344] smpboot: CPU 7 is now offline
[   53.891072] CPU 12 MCA banks CMCI:6 CMCI:8
[   53.895621] CPU 15 MCA banks CMCI:2 CMCI:3 CMCI:5
[   53.914738] Broke affinity for irq 81
[   53.917769] do_IRQ: 8.211 No irq handler for vector (irq -1) 
[   53.917769] ------------[ cut here ]------------
[   53.917769] WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0()
[   53.917769] Hardware name: IBM System x -[7870C4Q]-
[   53.917769] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm cdc_ether usbnet ioatdma lpc_ich mfd_core crc32c_intel microcode mii pcspkr i2c_i801 shpchp i2c_core serio_raw bnx2 tpm_tis dca tpm pci_hotplug i7core_edac tpm_bios edac_core sg rtc_cmos button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[   53.917769] Pid: 47, comm: migration/8 Not tainted 3.6.0-rc1-tglx-hotplug-0.0.0.28.36b5ec9-default #1
[   53.917769] Call Trace:
[   53.917769]  <IRQ>  [<ffffffff810e7806>] ? rcu_do_batch+0x386/0x3a0
[   53.917769]  [<ffffffff810e7806>] ? rcu_do_batch+0x386/0x3a0
[   53.917769]  [<ffffffff8104338a>] warn_slowpath_common+0x7a/0xb0
[   53.917769]  [<ffffffff810433d5>] warn_slowpath_null+0x15/0x20
[   53.917769]  [<ffffffff810e7806>] rcu_do_batch+0x386/0x3a0
[   53.917769]  [<ffffffff810ac8a0>] ? trace_hardirqs_on_caller+0x70/0x1b0
[   53.917769]  [<ffffffff810ac9ed>] ? trace_hardirqs_on+0xd/0x10
[   53.917769]  [<ffffffff810e9143>] __rcu_process_callbacks+0x1a3/0x200
[   53.917769]  [<ffffffff810e9228>] rcu_process_callbacks+0x88/0x240
[   53.917769]  [<ffffffff8104dc79>] __do_softirq+0x159/0x400
[   53.917769]  [<ffffffff814c627c>] call_softirq+0x1c/0x30
[   53.917769]  [<ffffffff810044f5>] do_softirq+0x95/0xd0
[   53.917769]  [<ffffffff8104d745>] irq_exit+0xe5/0x100
[   53.917769]  [<ffffffff81003c14>] do_IRQ+0x64/0xe0
[   53.917769]  [<ffffffff814bc12f>] common_interrupt+0x6f/0x6f
[   53.917769]  <EOI>  [<ffffffff810cfe7a>] ? stop_machine_cpu_stop+0xda/0x130
[   53.917769]  [<ffffffff810cfda0>] ? stop_one_cpu_nowait+0x50/0x50
[   53.917769]  [<ffffffff810cfab9>] cpu_stopper_thread+0xd9/0x1b0
[   53.917769]  [<ffffffff814bbe4f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
[   53.917769]  [<ffffffff810cf9e0>] ? res_counter_init+0x50/0x50
[   53.917769]  [<ffffffff810ac95d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
[   53.917769]  [<ffffffff810ac9ed>] ? trace_hardirqs_on+0xd/0x10
[   53.917769]  [<ffffffff810cf9e0>] ? res_counter_init+0x50/0x50
[   53.917769]  [<ffffffff8106deae>] kthread+0xde/0xf0
[   53.917769]  [<ffffffff814c6184>] kernel_thread_helper+0x4/0x10
[   53.917769]  [<ffffffff814bc1f0>] ? retint_restore_args+0x13/0x13
[   53.917769]  [<ffffffff8106ddd0>] ? __init_kthread_worker+0x70/0x70
[   53.917769]  [<ffffffff814c6180>] ? gs_change+0x13/0x13
[   53.917769] ---[ end trace f60a282810c4ce78 ]---
[   54.170634] smpboot: CPU 8 is now offline
[   54.192259] NOHZ: local_softirq_pending 200 
[   54.197936] smpboot: CPU 9 is now offline
[   54.219707] NOHZ: local_softirq_pending 200 
[   54.225795] smpboot: CPU 10 is now offline


---

[  372.482434] smpboot: CPU 11 is now offline
[  372.534211] smpboot: CPU 12 is now offline
[  372.539786] CPU 13 MCA banks CMCI:6 CMCI:8
[  372.582474] smpboot: CPU 13 is now offline
[  372.591006] CPU 14 MCA banks CMCI:6 CMCI:8
[  372.629745] ------------[ cut here ]------------
[  372.633736] WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0()
[  372.633736] Hardware name: IBM System x -[7870C4Q]-
[  372.633736] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel
microcode serio_raw tpm_tis i7core_edac cdc_ether usbnet ioatdma mii lpc_ich pcspkr edac_core mfd_core bnx2 shpchp pci_hotplug i2c_i801 i2c_core dca tpm sg tpm_bios rtc_cmos button uhci_hcd ehci_hcd
usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[  372.633736] Pid: 8625, comm: migration/14 Not tainted 3.6.0-rc1-tglx-hotplug-0.0.0.28.36b5ec9-default #1
[  372.633736] Call Trace:
[  372.633736]  <IRQ>  [<ffffffff810e7806>] ? rcu_do_batch+0x386/0x3a0
[  372.633736]  [<ffffffff810e7806>] ? rcu_do_batch+0x386/0x3a0
[  372.633736]  [<ffffffff8104338a>] warn_slowpath_common+0x7a/0xb0
[  372.633736]  [<ffffffff810433d5>] warn_slowpath_null+0x15/0x20
[  372.633736]  [<ffffffff810e7806>] rcu_do_batch+0x386/0x3a0
[  372.633736]  [<ffffffff810ac8a0>] ? trace_hardirqs_on_caller+0x70/0x1b0
[  372.633736]  [<ffffffff810ac9ed>] ? trace_hardirqs_on+0xd/0x10
[  372.633736]  [<ffffffff810e9143>] __rcu_process_callbacks+0x1a3/0x200
[  372.633736]  [<ffffffff810e9228>] rcu_process_callbacks+0x88/0x240
[  372.633736]  [<ffffffff8104dc79>] __do_softirq+0x159/0x400
[  372.633736]  [<ffffffff814c627c>] call_softirq+0x1c/0x30
[  372.633736]  [<ffffffff810044f5>] do_softirq+0x95/0xd0
[  372.633736]  [<ffffffff8104d745>] irq_exit+0xe5/0x100
[  372.633736]  [<ffffffff81028df9>] smp_apic_timer_interrupt+0x69/0xa0
[  372.633736]  [<ffffffff814c5aef>] apic_timer_interrupt+0x6f/0x80
[  372.633736]  <EOI>  [<ffffffff810cfe7a>] ? stop_machine_cpu_stop+0xda/0x130
[  372.633736]  [<ffffffff810cfda0>] ? stop_one_cpu_nowait+0x50/0x50
[  372.633736]  [<ffffffff810cfab9>] cpu_stopper_thread+0xd9/0x1b0
[  372.633736]  [<ffffffff814bbe4f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
[  372.633736]  [<ffffffff810cf9e0>] ? res_counter_init+0x50/0x50
[  372.633736]  [<ffffffff810ac95d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
[  372.633736]  [<ffffffff810ac9ed>] ? trace_hardirqs_on+0xd/0x10
[  372.633736]  [<ffffffff810cf9e0>] ? res_counter_init+0x50/0x50
[  372.633736]  [<ffffffff8106deae>] kthread+0xde/0xf0
[  372.633736]  [<ffffffff814c6184>] kernel_thread_helper+0x4/0x10
[  372.633736]  [<ffffffff814bc1f0>] ? retint_restore_args+0x13/0x13
[  372.633736]  [<ffffffff8106ddd0>] ? __init_kthread_worker+0x70/0x70
[  372.633736]  [<ffffffff814c6180>] ? gs_change+0x13/0x13
[  372.633736] ---[ end trace a4296a31284c846d ]---
[  372.883063] smpboot: CPU 14 is now offline
[  372.892721] CPU 15 MCA banks CMCI:6 CMCI:8
[  372.907250] smpboot: CPU 15 is now offline
[  372.911545] SMP alternatives: lockdep: fixing up alternatives
[  372.917292] SMP alternatives: switching to UP code
[  372.941917] SMP alternatives: lockdep: fixing up alternatives


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0(), during CPU hotplug
  2012-09-12 12:36   ` WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0(), during CPU hotplug Srivatsa S. Bhat
@ 2012-09-12 15:31     ` Paul E. McKenney
  2012-09-13  6:30       ` Michael Wang
  2012-09-13  8:35     ` Srivatsa S. Bhat
  1 sibling, 1 reply; 20+ messages in thread
From: Paul E. McKenney @ 2012-09-12 15:31 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-kernel, Thomas Gleixner, rusty, Peter Zijlstra, Tejun Heo

On Wed, Sep 12, 2012 at 06:06:20PM +0530, Srivatsa S. Bhat wrote:
> On 07/19/2012 10:45 PM, Paul E. McKenney wrote:
> > On Thu, Jul 19, 2012 at 05:39:30PM +0530, Srivatsa S. Bhat wrote:
> >> Hi Paul,
> >>
> >> While running a CPU hotplug stress test on v3.5-rc7+
> >> (mainline commit 8a7298b7805ab) I hit this warning.
> >> I haven't tried to debug this yet...
> >>
> >> Line number 1550 maps to:
> >>
> >> WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
> >>
> >> inside rcu_do_batch().
> > 
> > Hello, Srivatsa,
> > 
> > I believe that you need commit a16b7a69 (Prevent __call_rcu() from
> > invoking RCU core on offline CPUs), which is currently in -tip, queued
> > for 3.6.  Please see below for the patch.
> > 
> > Does this help?
> > 
> 
> Hi Paul,
> 
> I am hitting the cpu_is_offline() warning in rcu_do_batch() (see 2 of the
> examples below) occasionally while testing CPU hotplug on Thomas' smp/hotplug
> branch in -tip. It does contain the commit that you had mentioned above.
> 
> The stack trace suggests that we are not hitting this from the __call_rcu()
> path. So I guess this needs a different fix?

So there was an interrupt from stop_machine_stop().  Because RCU complained
about offline, I presume that this was on exit from stop_machine_stop().
(Otherwise, on entry to stop_machine_stop(), the CPU has not yet marked
itself offline, right?)

So my question is:  Why didn't the CPU shut off all interrupts before
coming out of stop_machine_stop()?

Or am I confused about what is really happening here?

							Thanx, Paul

> Regards,
> Srivatsa S. Bhat
> 
> [   53.882344] smpboot: CPU 7 is now offline
> [   53.891072] CPU 12 MCA banks CMCI:6 CMCI:8
> [   53.895621] CPU 15 MCA banks CMCI:2 CMCI:3 CMCI:5
> [   53.914738] Broke affinity for irq 81
> [   53.917769] do_IRQ: 8.211 No irq handler for vector (irq -1) 
> [   53.917769] ------------[ cut here ]------------
> [   53.917769] WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0()
> [   53.917769] Hardware name: IBM System x -[7870C4Q]-
> [   53.917769] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm cdc_ether usbnet ioatdma lpc_ich mfd_core crc32c_intel microcode mii pcspkr i2c_i801 shpchp i2c_core serio_raw bnx2 tpm_tis dca tpm pci_hotplug i7core_edac tpm_bios edac_core sg rtc_cmos button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
> [   53.917769] Pid: 47, comm: migration/8 Not tainted 3.6.0-rc1-tglx-hotplug-0.0.0.28.36b5ec9-default #1
> [   53.917769] Call Trace:
> [   53.917769]  <IRQ>  [<ffffffff810e7806>] ? rcu_do_batch+0x386/0x3a0
> [   53.917769]  [<ffffffff810e7806>] ? rcu_do_batch+0x386/0x3a0
> [   53.917769]  [<ffffffff8104338a>] warn_slowpath_common+0x7a/0xb0
> [   53.917769]  [<ffffffff810433d5>] warn_slowpath_null+0x15/0x20
> [   53.917769]  [<ffffffff810e7806>] rcu_do_batch+0x386/0x3a0
> [   53.917769]  [<ffffffff810ac8a0>] ? trace_hardirqs_on_caller+0x70/0x1b0
> [   53.917769]  [<ffffffff810ac9ed>] ? trace_hardirqs_on+0xd/0x10
> [   53.917769]  [<ffffffff810e9143>] __rcu_process_callbacks+0x1a3/0x200
> [   53.917769]  [<ffffffff810e9228>] rcu_process_callbacks+0x88/0x240
> [   53.917769]  [<ffffffff8104dc79>] __do_softirq+0x159/0x400
> [   53.917769]  [<ffffffff814c627c>] call_softirq+0x1c/0x30
> [   53.917769]  [<ffffffff810044f5>] do_softirq+0x95/0xd0
> [   53.917769]  [<ffffffff8104d745>] irq_exit+0xe5/0x100
> [   53.917769]  [<ffffffff81003c14>] do_IRQ+0x64/0xe0
> [   53.917769]  [<ffffffff814bc12f>] common_interrupt+0x6f/0x6f
> [   53.917769]  <EOI>  [<ffffffff810cfe7a>] ? stop_machine_cpu_stop+0xda/0x130
> [   53.917769]  [<ffffffff810cfda0>] ? stop_one_cpu_nowait+0x50/0x50
> [   53.917769]  [<ffffffff810cfab9>] cpu_stopper_thread+0xd9/0x1b0
> [   53.917769]  [<ffffffff814bbe4f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
> [   53.917769]  [<ffffffff810cf9e0>] ? res_counter_init+0x50/0x50
> [   53.917769]  [<ffffffff810ac95d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
> [   53.917769]  [<ffffffff810ac9ed>] ? trace_hardirqs_on+0xd/0x10
> [   53.917769]  [<ffffffff810cf9e0>] ? res_counter_init+0x50/0x50
> [   53.917769]  [<ffffffff8106deae>] kthread+0xde/0xf0
> [   53.917769]  [<ffffffff814c6184>] kernel_thread_helper+0x4/0x10
> [   53.917769]  [<ffffffff814bc1f0>] ? retint_restore_args+0x13/0x13
> [   53.917769]  [<ffffffff8106ddd0>] ? __init_kthread_worker+0x70/0x70
> [   53.917769]  [<ffffffff814c6180>] ? gs_change+0x13/0x13
> [   53.917769] ---[ end trace f60a282810c4ce78 ]---
> [   54.170634] smpboot: CPU 8 is now offline
> [   54.192259] NOHZ: local_softirq_pending 200 
> [   54.197936] smpboot: CPU 9 is now offline
> [   54.219707] NOHZ: local_softirq_pending 200 
> [   54.225795] smpboot: CPU 10 is now offline
> 
> 
> ---
> 
> [  372.482434] smpboot: CPU 11 is now offline
> [  372.534211] smpboot: CPU 12 is now offline
> [  372.539786] CPU 13 MCA banks CMCI:6 CMCI:8
> [  372.582474] smpboot: CPU 13 is now offline
> [  372.591006] CPU 14 MCA banks CMCI:6 CMCI:8
> [  372.629745] ------------[ cut here ]------------
> [  372.633736] WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0()
> [  372.633736] Hardware name: IBM System x -[7870C4Q]-
> [  372.633736] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel
> microcode serio_raw tpm_tis i7core_edac cdc_ether usbnet ioatdma mii lpc_ich pcspkr edac_core mfd_core bnx2 shpchp pci_hotplug i2c_i801 i2c_core dca tpm sg tpm_bios rtc_cmos button uhci_hcd ehci_hcd
> usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
> [  372.633736] Pid: 8625, comm: migration/14 Not tainted 3.6.0-rc1-tglx-hotplug-0.0.0.28.36b5ec9-default #1
> [  372.633736] Call Trace:
> [  372.633736]  <IRQ>  [<ffffffff810e7806>] ? rcu_do_batch+0x386/0x3a0
> [  372.633736]  [<ffffffff810e7806>] ? rcu_do_batch+0x386/0x3a0
> [  372.633736]  [<ffffffff8104338a>] warn_slowpath_common+0x7a/0xb0
> [  372.633736]  [<ffffffff810433d5>] warn_slowpath_null+0x15/0x20
> [  372.633736]  [<ffffffff810e7806>] rcu_do_batch+0x386/0x3a0
> [  372.633736]  [<ffffffff810ac8a0>] ? trace_hardirqs_on_caller+0x70/0x1b0
> [  372.633736]  [<ffffffff810ac9ed>] ? trace_hardirqs_on+0xd/0x10
> [  372.633736]  [<ffffffff810e9143>] __rcu_process_callbacks+0x1a3/0x200
> [  372.633736]  [<ffffffff810e9228>] rcu_process_callbacks+0x88/0x240
> [  372.633736]  [<ffffffff8104dc79>] __do_softirq+0x159/0x400
> [  372.633736]  [<ffffffff814c627c>] call_softirq+0x1c/0x30
> [  372.633736]  [<ffffffff810044f5>] do_softirq+0x95/0xd0
> [  372.633736]  [<ffffffff8104d745>] irq_exit+0xe5/0x100
> [  372.633736]  [<ffffffff81028df9>] smp_apic_timer_interrupt+0x69/0xa0
> [  372.633736]  [<ffffffff814c5aef>] apic_timer_interrupt+0x6f/0x80
> [  372.633736]  <EOI>  [<ffffffff810cfe7a>] ? stop_machine_cpu_stop+0xda/0x130
> [  372.633736]  [<ffffffff810cfda0>] ? stop_one_cpu_nowait+0x50/0x50
> [  372.633736]  [<ffffffff810cfab9>] cpu_stopper_thread+0xd9/0x1b0
> [  372.633736]  [<ffffffff814bbe4f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
> [  372.633736]  [<ffffffff810cf9e0>] ? res_counter_init+0x50/0x50
> [  372.633736]  [<ffffffff810ac95d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
> [  372.633736]  [<ffffffff810ac9ed>] ? trace_hardirqs_on+0xd/0x10
> [  372.633736]  [<ffffffff810cf9e0>] ? res_counter_init+0x50/0x50
> [  372.633736]  [<ffffffff8106deae>] kthread+0xde/0xf0
> [  372.633736]  [<ffffffff814c6184>] kernel_thread_helper+0x4/0x10
> [  372.633736]  [<ffffffff814bc1f0>] ? retint_restore_args+0x13/0x13
> [  372.633736]  [<ffffffff8106ddd0>] ? __init_kthread_worker+0x70/0x70
> [  372.633736]  [<ffffffff814c6180>] ? gs_change+0x13/0x13
> [  372.633736] ---[ end trace a4296a31284c846d ]---
> [  372.883063] smpboot: CPU 14 is now offline
> [  372.892721] CPU 15 MCA banks CMCI:6 CMCI:8
> [  372.907250] smpboot: CPU 15 is now offline
> [  372.911545] SMP alternatives: lockdep: fixing up alternatives
> [  372.917292] SMP alternatives: switching to UP code
> [  372.941917] SMP alternatives: lockdep: fixing up alternatives
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0(), during CPU hotplug
  2012-09-12 15:31     ` Paul E. McKenney
@ 2012-09-13  6:30       ` Michael Wang
  2012-09-13 12:47         ` Srivatsa S. Bhat
  0 siblings, 1 reply; 20+ messages in thread
From: Michael Wang @ 2012-09-13  6:30 UTC (permalink / raw)
  To: paulmck
  Cc: Srivatsa S. Bhat, linux-kernel, Thomas Gleixner, rusty,
	Peter Zijlstra, Tejun Heo

On 09/12/2012 11:31 PM, Paul E. McKenney wrote:
> On Wed, Sep 12, 2012 at 06:06:20PM +0530, Srivatsa S. Bhat wrote:
>> On 07/19/2012 10:45 PM, Paul E. McKenney wrote:
>>> On Thu, Jul 19, 2012 at 05:39:30PM +0530, Srivatsa S. Bhat wrote:
>>>> Hi Paul,
>>>>
>>>> While running a CPU hotplug stress test on v3.5-rc7+
>>>> (mainline commit 8a7298b7805ab) I hit this warning.
>>>> I haven't tried to debug this yet...
>>>>
>>>> Line number 1550 maps to:
>>>>
>>>> WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
>>>>
>>>> inside rcu_do_batch().
>>>
>>> Hello, Srivatsa,
>>>
>>> I believe that you need commit a16b7a69 (Prevent __call_rcu() from
>>> invoking RCU core on offline CPUs), which is currently in -tip, queued
>>> for 3.6.  Please see below for the patch.
>>>
>>> Does this help?
>>>
>>
>> Hi Paul,
>>
>> I am hitting the cpu_is_offline() warning in rcu_do_batch() (see 2 of the
>> examples below) occasionally while testing CPU hotplug on Thomas' smp/hotplug
>> branch in -tip. It does contain the commit that you had mentioned above.
>>
>> The stack trace suggests that we are not hitting this from the __call_rcu()
>> path. So I guess this needs a different fix?
> 
> So there was an interrupt from stop_machine_stop().  Because RCU complained
> about offline, I presume that this was on exit from stop_machine_stop().
> (Otherwise, on entry to stop_machine_stop(), the CPU has not yet marked
> itself offline, right?)
> 
> So my question is:  Why didn't the CPU shut off all interrupts before
> coming out of stop_machine_stop()?
> 
> Or am I confused about what is really happening here?

I think Srivatsa may need patch a3716d2e too:

commit a3716d2e5a50a9ed5268ae3d3c2f093968ff236a
Author: Paul E. McKenney <paul.mckenney@linaro.org>
Date:   Thu Jun 21 09:54:10 2012 -0700

    rcu: Prevent offline CPUs from executing RCU core code
    
    Earlier versions of RCU invoked the RCU core from the CPU_DYING notifier
    in order to note a quiescent state for the outgoing CPU.  Because the
    CPU is marked "offline" during the execution of the CPU_DYING notifiers,
    the RCU core had to tolerate being invoked from an offline CPU.  However,
    commit b1420f1c (Make rcu_barrier() less disruptive) left only tracing
    code in the CPU_DYING notifier, so the RCU core need no longer execute
    on offline CPUs.  This commit therefore enforces this restriction.
    
    Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 300aba6..84a6f55 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1892,6 +1892,8 @@ static void rcu_process_callbacks(struct softirq_action *unused)
 {
        struct rcu_state *rsp;
 
+       if (cpu_is_offline(smp_processor_id()))
+               return;
        trace_rcu_utilization("Start RCU core");
        for_each_rcu_flavor(rsp)
                __rcu_process_callbacks(rsp);

With out it, the RCU_SOFTIRQ raised before offline won't
have a chance to return.

Regards,
Michael Wang
> 
> 							Thanx, Paul
> 
>> Regards,
>> Srivatsa S. Bhat
>>
>> [   53.882344] smpboot: CPU 7 is now offline
>> [   53.891072] CPU 12 MCA banks CMCI:6 CMCI:8
>> [   53.895621] CPU 15 MCA banks CMCI:2 CMCI:3 CMCI:5
>> [   53.914738] Broke affinity for irq 81
>> [   53.917769] do_IRQ: 8.211 No irq handler for vector (irq -1) 
>> [   53.917769] ------------[ cut here ]------------
>> [   53.917769] WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0()
>> [   53.917769] Hardware name: IBM System x -[7870C4Q]-
>> [   53.917769] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm cdc_ether usbnet ioatdma lpc_ich mfd_core crc32c_intel microcode mii pcspkr i2c_i801 shpchp i2c_core serio_raw bnx2 tpm_tis dca tpm pci_hotplug i7core_edac tpm_bios edac_core sg rtc_cmos button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
>> [   53.917769] Pid: 47, comm: migration/8 Not tainted 3.6.0-rc1-tglx-hotplug-0.0.0.28.36b5ec9-default #1
>> [   53.917769] Call Trace:
>> [   53.917769]  <IRQ>  [<ffffffff810e7806>] ? rcu_do_batch+0x386/0x3a0
>> [   53.917769]  [<ffffffff810e7806>] ? rcu_do_batch+0x386/0x3a0
>> [   53.917769]  [<ffffffff8104338a>] warn_slowpath_common+0x7a/0xb0
>> [   53.917769]  [<ffffffff810433d5>] warn_slowpath_null+0x15/0x20
>> [   53.917769]  [<ffffffff810e7806>] rcu_do_batch+0x386/0x3a0
>> [   53.917769]  [<ffffffff810ac8a0>] ? trace_hardirqs_on_caller+0x70/0x1b0
>> [   53.917769]  [<ffffffff810ac9ed>] ? trace_hardirqs_on+0xd/0x10
>> [   53.917769]  [<ffffffff810e9143>] __rcu_process_callbacks+0x1a3/0x200
>> [   53.917769]  [<ffffffff810e9228>] rcu_process_callbacks+0x88/0x240
>> [   53.917769]  [<ffffffff8104dc79>] __do_softirq+0x159/0x400
>> [   53.917769]  [<ffffffff814c627c>] call_softirq+0x1c/0x30
>> [   53.917769]  [<ffffffff810044f5>] do_softirq+0x95/0xd0
>> [   53.917769]  [<ffffffff8104d745>] irq_exit+0xe5/0x100
>> [   53.917769]  [<ffffffff81003c14>] do_IRQ+0x64/0xe0
>> [   53.917769]  [<ffffffff814bc12f>] common_interrupt+0x6f/0x6f
>> [   53.917769]  <EOI>  [<ffffffff810cfe7a>] ? stop_machine_cpu_stop+0xda/0x130
>> [   53.917769]  [<ffffffff810cfda0>] ? stop_one_cpu_nowait+0x50/0x50
>> [   53.917769]  [<ffffffff810cfab9>] cpu_stopper_thread+0xd9/0x1b0
>> [   53.917769]  [<ffffffff814bbe4f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
>> [   53.917769]  [<ffffffff810cf9e0>] ? res_counter_init+0x50/0x50
>> [   53.917769]  [<ffffffff810ac95d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
>> [   53.917769]  [<ffffffff810ac9ed>] ? trace_hardirqs_on+0xd/0x10
>> [   53.917769]  [<ffffffff810cf9e0>] ? res_counter_init+0x50/0x50
>> [   53.917769]  [<ffffffff8106deae>] kthread+0xde/0xf0
>> [   53.917769]  [<ffffffff814c6184>] kernel_thread_helper+0x4/0x10
>> [   53.917769]  [<ffffffff814bc1f0>] ? retint_restore_args+0x13/0x13
>> [   53.917769]  [<ffffffff8106ddd0>] ? __init_kthread_worker+0x70/0x70
>> [   53.917769]  [<ffffffff814c6180>] ? gs_change+0x13/0x13
>> [   53.917769] ---[ end trace f60a282810c4ce78 ]---
>> [   54.170634] smpboot: CPU 8 is now offline
>> [   54.192259] NOHZ: local_softirq_pending 200 
>> [   54.197936] smpboot: CPU 9 is now offline
>> [   54.219707] NOHZ: local_softirq_pending 200 
>> [   54.225795] smpboot: CPU 10 is now offline
>>
>>
>> ---
>>
>> [  372.482434] smpboot: CPU 11 is now offline
>> [  372.534211] smpboot: CPU 12 is now offline
>> [  372.539786] CPU 13 MCA banks CMCI:6 CMCI:8
>> [  372.582474] smpboot: CPU 13 is now offline
>> [  372.591006] CPU 14 MCA banks CMCI:6 CMCI:8
>> [  372.629745] ------------[ cut here ]------------
>> [  372.633736] WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0()
>> [  372.633736] Hardware name: IBM System x -[7870C4Q]-
>> [  372.633736] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel
>> microcode serio_raw tpm_tis i7core_edac cdc_ether usbnet ioatdma mii lpc_ich pcspkr edac_core mfd_core bnx2 shpchp pci_hotplug i2c_i801 i2c_core dca tpm sg tpm_bios rtc_cmos button uhci_hcd ehci_hcd
>> usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
>> [  372.633736] Pid: 8625, comm: migration/14 Not tainted 3.6.0-rc1-tglx-hotplug-0.0.0.28.36b5ec9-default #1
>> [  372.633736] Call Trace:
>> [  372.633736]  <IRQ>  [<ffffffff810e7806>] ? rcu_do_batch+0x386/0x3a0
>> [  372.633736]  [<ffffffff810e7806>] ? rcu_do_batch+0x386/0x3a0
>> [  372.633736]  [<ffffffff8104338a>] warn_slowpath_common+0x7a/0xb0
>> [  372.633736]  [<ffffffff810433d5>] warn_slowpath_null+0x15/0x20
>> [  372.633736]  [<ffffffff810e7806>] rcu_do_batch+0x386/0x3a0
>> [  372.633736]  [<ffffffff810ac8a0>] ? trace_hardirqs_on_caller+0x70/0x1b0
>> [  372.633736]  [<ffffffff810ac9ed>] ? trace_hardirqs_on+0xd/0x10
>> [  372.633736]  [<ffffffff810e9143>] __rcu_process_callbacks+0x1a3/0x200
>> [  372.633736]  [<ffffffff810e9228>] rcu_process_callbacks+0x88/0x240
>> [  372.633736]  [<ffffffff8104dc79>] __do_softirq+0x159/0x400
>> [  372.633736]  [<ffffffff814c627c>] call_softirq+0x1c/0x30
>> [  372.633736]  [<ffffffff810044f5>] do_softirq+0x95/0xd0
>> [  372.633736]  [<ffffffff8104d745>] irq_exit+0xe5/0x100
>> [  372.633736]  [<ffffffff81028df9>] smp_apic_timer_interrupt+0x69/0xa0
>> [  372.633736]  [<ffffffff814c5aef>] apic_timer_interrupt+0x6f/0x80
>> [  372.633736]  <EOI>  [<ffffffff810cfe7a>] ? stop_machine_cpu_stop+0xda/0x130
>> [  372.633736]  [<ffffffff810cfda0>] ? stop_one_cpu_nowait+0x50/0x50
>> [  372.633736]  [<ffffffff810cfab9>] cpu_stopper_thread+0xd9/0x1b0
>> [  372.633736]  [<ffffffff814bbe4f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
>> [  372.633736]  [<ffffffff810cf9e0>] ? res_counter_init+0x50/0x50
>> [  372.633736]  [<ffffffff810ac95d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
>> [  372.633736]  [<ffffffff810ac9ed>] ? trace_hardirqs_on+0xd/0x10
>> [  372.633736]  [<ffffffff810cf9e0>] ? res_counter_init+0x50/0x50
>> [  372.633736]  [<ffffffff8106deae>] kthread+0xde/0xf0
>> [  372.633736]  [<ffffffff814c6184>] kernel_thread_helper+0x4/0x10
>> [  372.633736]  [<ffffffff814bc1f0>] ? retint_restore_args+0x13/0x13
>> [  372.633736]  [<ffffffff8106ddd0>] ? __init_kthread_worker+0x70/0x70
>> [  372.633736]  [<ffffffff814c6180>] ? gs_change+0x13/0x13
>> [  372.633736] ---[ end trace a4296a31284c846d ]---
>> [  372.883063] smpboot: CPU 14 is now offline
>> [  372.892721] CPU 15 MCA banks CMCI:6 CMCI:8
>> [  372.907250] smpboot: CPU 15 is now offline
>> [  372.911545] SMP alternatives: lockdep: fixing up alternatives
>> [  372.917292] SMP alternatives: switching to UP code
>> [  372.941917] SMP alternatives: lockdep: fixing up alternatives
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0(), during CPU hotplug
  2012-09-12 12:36   ` WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0(), during CPU hotplug Srivatsa S. Bhat
  2012-09-12 15:31     ` Paul E. McKenney
@ 2012-09-13  8:35     ` Srivatsa S. Bhat
  2012-09-14 11:47       ` Fengguang Wu
  1 sibling, 1 reply; 20+ messages in thread
From: Srivatsa S. Bhat @ 2012-09-13  8:35 UTC (permalink / raw)
  To: paulmck, fengguang.wu, jack
  Cc: linux-kernel, Thomas Gleixner, rusty, Peter Zijlstra, Tejun Heo

On 09/12/2012 06:06 PM, Srivatsa S. Bhat wrote:
> On 07/19/2012 10:45 PM, Paul E. McKenney wrote:
>> On Thu, Jul 19, 2012 at 05:39:30PM +0530, Srivatsa S. Bhat wrote:
>>> Hi Paul,
>>>
>>> While running a CPU hotplug stress test on v3.5-rc7+
>>> (mainline commit 8a7298b7805ab) I hit this warning.
>>> I haven't tried to debug this yet...
>>>
>>> Line number 1550 maps to:
>>>
>>> WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
>>>
>>> inside rcu_do_batch().
>>
>> Hello, Srivatsa,
>>
>> I believe that you need commit a16b7a69 (Prevent __call_rcu() from
>> invoking RCU core on offline CPUs), which is currently in -tip, queued
>> for 3.6.  Please see below for the patch.
>>
>> Does this help?
>>
> 
> Hi Paul,
> 
> I am hitting the cpu_is_offline() warning in rcu_do_batch() (see 2 of the
> examples below) occasionally while testing CPU hotplug on Thomas' smp/hotplug
> branch in -tip. It does contain the commit that you had mentioned above.
> 

I also hit some writeback related problems during some of these runs. But I was
not able to reproduce them after that occurrence. (Adding relevant people to CC.)

I hit the divide error shown below during the CPU hotplug test run, and the general
protection fault subsequently, while trying to shutdown the machine after the test.

Regards,
Srivatsa S. Bhat

[  522.987310] SMP alternatives: switching to SMP code
[  522.999101] smpboot: Booting Node 1 Processor 7 APIC 0x16
[  524.083872] SMP alternatives: lockdep: fixing up alternatives
[  524.090053] smpboot: Booting Node 0 Processor 8 APIC 0x1
[  525.148720] SMP alternatives: lockdep: fixing up alternatives
[  525.154970] smpboot: Booting Node 0 Processor 9 APIC 0x3
[  526.024180] divide error: 0000 [#1] SMP 
[  526.028144] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm cdc_ether pcspkr usbnet shpchp pci_hotplug i2c_i801 i2c_core ioatdma mii crc32c_intel serio_raw microcode lpc_ich mfd_core i7core_edac bnx2 dca edac_core tpm_tis tpm sg tpm_bios rtc_cmos button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[  526.028145] CPU 9 
[  526.028145] Pid: 2235, comm: flush-8:0 Not tainted 3.6.0-rc1-tglx-hotplug-0.0.0.28.36b5ec9-default #1 IBM IBM System x -[7870C4Q]-/68Y8033     
[  526.028145] RIP: 0010:[<ffffffff811276f6>]  [<ffffffff811276f6>] bdi_dirty_limit+0x66/0xc0
[  526.028145] RSP: 0018:ffff8811530bfcc0  EFLAGS: 00010206
[  526.028145] RAX: 0000000000b9877e RBX: 00000000001a8112 RCX: 28f5c28f5c28f5c3
[  526.028145] RDX: 0000000000000000 RSI: 0000000000b9877e RDI: 0000000000000000
[  526.028145] RBP: ffff8811530bfce0 R08: 0000000000000010 R09: 0000000000000000
[  526.028145] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8808d4408e20
[  526.028145] R13: ffff8808d4408e20 R14: ffff8808d44091a0 R15: 0000000000000000
[  526.028145] FS:  0000000000000000(0000) GS:ffff8808ddd40000(0000) knlGS:0000000000000000
[  526.028145] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  526.028145] CR2: 00007fa35dd4eb60 CR3: 0000000001a0c000 CR4: 00000000000007e0
[  526.028145] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  526.028145] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  526.028145] Process flush-8:0 (pid: 2235, threadinfo ffff8811530be000, task ffff88115315c5e0)
[  526.028145] Stack:
[  526.028145]  0400000000000000 0000000000000007 ffff8808d4408e20 ffffffffffffffee
[  526.028145]  ffff8811530bfd10 ffffffff811ae95c 0000000000350225 00000000001a8112
[  526.028145]  0000000000000000 0000000000000002 ffff8811530bfdc0 ffffffff811b0620
[  526.028145] Call Trace:
[  526.209272] SMP alternatives: lockdep: fixing up alternatives
[  526.209275] smpboot: Booting Node 0 Processor 10 APIC 0x5
[  526.220012]  [<ffffffff811ae95c>] over_bground_thresh+0x7c/0x90
[  526.220012]  [<ffffffff811b0620>] wb_do_writeback+0x170/0x310
[  526.220012]  [<ffffffff811b08eb>] bdi_writeback_thread+0x12b/0x420
[  526.220012]  [<ffffffff811b07c0>] ? wb_do_writeback+0x310/0x310
[  526.220012]  [<ffffffff8106deae>] kthread+0xde/0xf0
[  526.220012]  [<ffffffff814c6184>] kernel_thread_helper+0x4/0x10
[  526.220012]  [<ffffffff814bc1f0>] ? retint_restore_args+0x13/0x13
[  526.220012]  [<ffffffff8106ddd0>] ? __init_kthread_worker+0x70/0x70
[  526.220012]  [<ffffffff814c6180>] ? gs_change+0x13/0x13
[  526.220012] Code: 28 5c 8f c2 f5 28 8b 7d e0 48 89 c6 48 0f af f3 48 c1 ee 02 48 89 f0 48 f7 e1 48 89 d6 31 d2 48 c1 ee 02 48 0f af 75 e8 48 89 f0 <48> f7 f7 41 8b 94 24 74 02 00 00 48 0f af d3 48 89 c7 48 c1 ea 
[  526.220012] RIP  [<ffffffff811276f6>] bdi_dirty_limit+0x66/0xc0
[  526.220012]  RSP <ffff8811530bfcc0>
[  526.304469] ---[ end trace bcfc7ab74bdb11a5 ]---
[  527.330948] SMP alternatives: lockdep: fixing up alternatives


----

[ 1941.614775] SMP alternatives: lockdep: fixing up alternatives
[ 1941.620614] smpboot: Booting Node 1 Processor 5 APIC 0x12
[ 1941.657424] SMP alternatives: lockdep: fixing up alternatives
[ 1941.663215] smpboot: Booting Node 1 Processor 6 APIC 0x14

[ 1992.721819] general protection fault: 0000 [#2] SMP 
[ 1992.724844] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm cdc_ether pcspkr usbnet shpchp pci_hotplug i2c_i801 i2c_core ioatdma mii crc32c_intel serio_raw microcode lpc_ich mfd_core i7core_edac bnx2 dca edac_core tpm_tis tpm sg tpm_bios rtc_cmos button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase
scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[ 1992.726995] CPU 8 
[ 1992.726995] Pid: 19654, comm: shutdown Tainted: G      D      3.6.0-rc1-tglx-hotplug-0.0.0.28.36b5ec9-default #1 IBM IBM System x -[7870C4Q]-/68Y8033     
[ 1992.726995] RIP: 0010:[<ffffffff810843d7>]  [<ffffffff810843d7>] try_to_wake_up+0x57/0x2f0
[ 1992.726995] RSP: 0018:ffff8808d47e5e58  EFLAGS: 00010002
[ 1992.726995] RAX: 6b6b6b6b6b6b6b6b RBX: 000000000000000f RCX: 000000006b6b6b6b
[ 1992.726995] RDX: 000000006b6c6b6b RSI: ffffffff817a7fbf RDI: ffff88115315cdd0
[ 1992.726995] RBP: ffff8808d47e5e98 R08: 0000000000000002 R09: 0000000000000001
[ 1992.726995] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88115315c5e0
[ 1992.726995] R13: 000000006b6b6b6b R14: ffff8808d44091a0 R15: 0000000000000000
[ 1992.726995] FS:  00007f32b1beb700(0000) GS:ffff8808ddd00000(0000) knlGS:0000000000000000
[ 1992.726995] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1992.726995] CR2: 00007f32b173d1a0 CR3: 0000001153bd0000 CR4: 00000000000007e0
[ 1992.726995] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1992.726995] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1992.726995] Process shutdown (pid: 19654, threadinfo ffff8808d47e4000, task ffff8808d4b045e0)
[ 1992.726995] Stack:
[ 1992.726995]  0000000000000246 ffff88115315cdd0 0000000000000286 0000000000000000
[ 1992.726995]  ffff8808d4408e20 ffff8808d5433288 ffff8808d44091a0 00746c6168206d65
[ 1992.726995]  ffff8808d47e5ea8 ffffffff810846a0 ffff8808d47e5ed8 ffffffff811adce1
[ 1992.726995] Call Trace:
[ 1992.726995]  [<ffffffff810846a0>] wake_up_process+0x10/0x20
[ 1992.726995]  [<ffffffff811adce1>] bdi_queue_work+0xd1/0x1f0
[ 1992.726995]  [<ffffffff811ae7d9>] __bdi_start_writeback+0x79/0x160
[ 1992.726995]  [<ffffffff811af1b0>] wakeup_flusher_threads+0x120/0x1e0
[ 1992.726995]  [<ffffffff811af0ca>] ? wakeup_flusher_threads+0x3a/0x1e0
[ 1992.726995]  [<ffffffff811b45b2>] sys_sync+0x22/0x90
[ 1992.726995]  [<ffffffff814c4fb9>] system_call_fastpath+0x16/0x1b
[ 1992.726995] Code: 31 ed 48 89 c7 48 89 45 c8 e8 46 72 43 00 48 89 45 d0 49 8b 04 24 85 c3 0f 84 02 02 00 00 45 8b 6c 24 2c 49 8b 44 24 08 45 85 ed <44> 8b 70 18 74 75 8b 1d dd ea 9d 00 85 db 0f 85 2d 02 00 00 49  
[ 1992.726995] RIP  [<ffffffff810843d7>] try_to_wake_up+0x57/0x2f0
[ 1992.726995]  RSP <ffff8808d47e5e58>
[ 1992.726995] ---[ end trace bcfc7ab74bdb11a6 ]---
[ 1992.726995] Kernel panic - not syncing: Fatal exception in interrupt




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0(), during CPU hotplug
  2012-09-13  6:30       ` Michael Wang
@ 2012-09-13 12:47         ` Srivatsa S. Bhat
  2012-09-14  4:33           ` Michael Wang
  2012-09-26  9:35           ` Srivatsa S. Bhat
  0 siblings, 2 replies; 20+ messages in thread
From: Srivatsa S. Bhat @ 2012-09-13 12:47 UTC (permalink / raw)
  To: Michael Wang, paulmck
  Cc: linux-kernel, Thomas Gleixner, rusty, Peter Zijlstra, Tejun Heo

On 09/13/2012 12:00 PM, Michael Wang wrote:
> On 09/12/2012 11:31 PM, Paul E. McKenney wrote:
>> On Wed, Sep 12, 2012 at 06:06:20PM +0530, Srivatsa S. Bhat wrote:
>>> On 07/19/2012 10:45 PM, Paul E. McKenney wrote:
>>>> On Thu, Jul 19, 2012 at 05:39:30PM +0530, Srivatsa S. Bhat wrote:
>>>>> Hi Paul,
>>>>>
>>>>> While running a CPU hotplug stress test on v3.5-rc7+
>>>>> (mainline commit 8a7298b7805ab) I hit this warning.
>>>>> I haven't tried to debug this yet...
>>>>>
>>>>> Line number 1550 maps to:
>>>>>
>>>>> WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
>>>>>
>>>>> inside rcu_do_batch().
>>>>
>>>> Hello, Srivatsa,
>>>>
>>>> I believe that you need commit a16b7a69 (Prevent __call_rcu() from
>>>> invoking RCU core on offline CPUs), which is currently in -tip, queued
>>>> for 3.6.  Please see below for the patch.
>>>>
>>>> Does this help?
>>>>
>>>
>>> Hi Paul,
>>>
>>> I am hitting the cpu_is_offline() warning in rcu_do_batch() (see 2 of the
>>> examples below) occasionally while testing CPU hotplug on Thomas' smp/hotplug
>>> branch in -tip. It does contain the commit that you had mentioned above.
>>>
>>> The stack trace suggests that we are not hitting this from the __call_rcu()
>>> path. So I guess this needs a different fix?
>>
>> So there was an interrupt from stop_machine_stop().  Because RCU complained
>> about offline, I presume that this was on exit from stop_machine_stop().
>> (Otherwise, on entry to stop_machine_stop(), the CPU has not yet marked
>> itself offline, right?)
>>

Yes, that's my understanding too.

>> So my question is:  Why didn't the CPU shut off all interrupts before
>> coming out of stop_machine_stop()?
>>
>> Or am I confused about what is really happening here?
> 
> I think Srivatsa may need patch a3716d2e too:
>

The changelog of this patch doesn't seem to suggest that it was written to
address the problem I am facing. Please see my thoughts below..

> commit a3716d2e5a50a9ed5268ae3d3c2f093968ff236a
> Author: Paul E. McKenney <paul.mckenney@linaro.org>
> Date:   Thu Jun 21 09:54:10 2012 -0700
> 
>     rcu: Prevent offline CPUs from executing RCU core code
>     
>     Earlier versions of RCU invoked the RCU core from the CPU_DYING notifier
>     in order to note a quiescent state for the outgoing CPU.  Because the
>     CPU is marked "offline" during the execution of the CPU_DYING notifiers,
>     the RCU core had to tolerate being invoked from an offline CPU.  However,
>     commit b1420f1c (Make rcu_barrier() less disruptive) left only tracing
>     code in the CPU_DYING notifier, so the RCU core need no longer execute
>     on offline CPUs.  This commit therefore enforces this restriction.
>     
>     Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
>     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 300aba6..84a6f55 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1892,6 +1892,8 @@ static void rcu_process_callbacks(struct softirq_action *unused)
>  {
>         struct rcu_state *rsp;
> 
> +       if (cpu_is_offline(smp_processor_id()))
> +               return;
>         trace_rcu_utilization("Start RCU core");
>         for_each_rcu_flavor(rsp)
>                 __rcu_process_callbacks(rsp);
> 
> With out it, the RCU_SOFTIRQ raised before offline won't
> have a chance to return.
> 

I'm sure this would help to get rid of the warning, but IMHO, like Paul pointed out,
its rather surprising that the CPU going offline got interrupts *after* coming out of
the stop_machine thing, iow, *after* it was marked offline.. So we need to examine
*this* issue right?

Regards,
Srivatsa S. Bhat

>>>
>>> [   53.882344] smpboot: CPU 7 is now offline
>>> [   53.891072] CPU 12 MCA banks CMCI:6 CMCI:8
>>> [   53.895621] CPU 15 MCA banks CMCI:2 CMCI:3 CMCI:5
>>> [   53.914738] Broke affinity for irq 81
>>> [   53.917769] do_IRQ: 8.211 No irq handler for vector (irq -1) 
>>> [   53.917769] ------------[ cut here ]------------
>>> [   53.917769] WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0()
>>> [   53.917769] Hardware name: IBM System x -[7870C4Q]-
>>> [   53.917769] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm cdc_ether usbnet ioatdma lpc_ich mfd_core crc32c_intel microcode mii pcspkr i2c_i801 shpchp i2c_core serio_raw bnx2 tpm_tis dca tpm pci_hotplug i7core_edac tpm_bios edac_core sg rtc_cmos button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
>>> [   53.917769] Pid: 47, comm: migration/8 Not tainted 3.6.0-rc1-tglx-hotplug-0.0.0.28.36b5ec9-default #1
>>> [   53.917769] Call Trace:
>>> [   53.917769]  <IRQ>  [<ffffffff810e7806>] ? rcu_do_batch+0x386/0x3a0
>>> [   53.917769]  [<ffffffff810e7806>] ? rcu_do_batch+0x386/0x3a0
>>> [   53.917769]  [<ffffffff8104338a>] warn_slowpath_common+0x7a/0xb0
>>> [   53.917769]  [<ffffffff810433d5>] warn_slowpath_null+0x15/0x20
>>> [   53.917769]  [<ffffffff810e7806>] rcu_do_batch+0x386/0x3a0
>>> [   53.917769]  [<ffffffff810ac8a0>] ? trace_hardirqs_on_caller+0x70/0x1b0
>>> [   53.917769]  [<ffffffff810ac9ed>] ? trace_hardirqs_on+0xd/0x10
>>> [   53.917769]  [<ffffffff810e9143>] __rcu_process_callbacks+0x1a3/0x200
>>> [   53.917769]  [<ffffffff810e9228>] rcu_process_callbacks+0x88/0x240
>>> [   53.917769]  [<ffffffff8104dc79>] __do_softirq+0x159/0x400
>>> [   53.917769]  [<ffffffff814c627c>] call_softirq+0x1c/0x30
>>> [   53.917769]  [<ffffffff810044f5>] do_softirq+0x95/0xd0
>>> [   53.917769]  [<ffffffff8104d745>] irq_exit+0xe5/0x100
>>> [   53.917769]  [<ffffffff81003c14>] do_IRQ+0x64/0xe0
>>> [   53.917769]  [<ffffffff814bc12f>] common_interrupt+0x6f/0x6f
>>> [   53.917769]  <EOI>  [<ffffffff810cfe7a>] ? stop_machine_cpu_stop+0xda/0x130
>>> [   53.917769]  [<ffffffff810cfda0>] ? stop_one_cpu_nowait+0x50/0x50
>>> [   53.917769]  [<ffffffff810cfab9>] cpu_stopper_thread+0xd9/0x1b0
>>> [   53.917769]  [<ffffffff814bbe4f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
>>> [   53.917769]  [<ffffffff810cf9e0>] ? res_counter_init+0x50/0x50
>>> [   53.917769]  [<ffffffff810ac95d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
>>> [   53.917769]  [<ffffffff810ac9ed>] ? trace_hardirqs_on+0xd/0x10
>>> [   53.917769]  [<ffffffff810cf9e0>] ? res_counter_init+0x50/0x50
>>> [   53.917769]  [<ffffffff8106deae>] kthread+0xde/0xf0
>>> [   53.917769]  [<ffffffff814c6184>] kernel_thread_helper+0x4/0x10
>>> [   53.917769]  [<ffffffff814bc1f0>] ? retint_restore_args+0x13/0x13
>>> [   53.917769]  [<ffffffff8106ddd0>] ? __init_kthread_worker+0x70/0x70
>>> [   53.917769]  [<ffffffff814c6180>] ? gs_change+0x13/0x13
>>> [   53.917769] ---[ end trace f60a282810c4ce78 ]---
>>> [   54.170634] smpboot: CPU 8 is now offline
>>> [   54.192259] NOHZ: local_softirq_pending 200 
>>> [   54.197936] smpboot: CPU 9 is now offline
>>> [   54.219707] NOHZ: local_softirq_pending 200 
>>> [   54.225795] smpboot: CPU 10 is now offline
>>>
>>>
>>> ---
>>>
>>> [  372.482434] smpboot: CPU 11 is now offline
>>> [  372.534211] smpboot: CPU 12 is now offline
>>> [  372.539786] CPU 13 MCA banks CMCI:6 CMCI:8
>>> [  372.582474] smpboot: CPU 13 is now offline
>>> [  372.591006] CPU 14 MCA banks CMCI:6 CMCI:8
>>> [  372.629745] ------------[ cut here ]------------
>>> [  372.633736] WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0()
>>> [  372.633736] Hardware name: IBM System x -[7870C4Q]-
>>> [  372.633736] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel
>>> microcode serio_raw tpm_tis i7core_edac cdc_ether usbnet ioatdma mii lpc_ich pcspkr edac_core mfd_core bnx2 shpchp pci_hotplug i2c_i801 i2c_core dca tpm sg tpm_bios rtc_cmos button uhci_hcd ehci_hcd
>>> usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
>>> [  372.633736] Pid: 8625, comm: migration/14 Not tainted 3.6.0-rc1-tglx-hotplug-0.0.0.28.36b5ec9-default #1
>>> [  372.633736] Call Trace:
>>> [  372.633736]  <IRQ>  [<ffffffff810e7806>] ? rcu_do_batch+0x386/0x3a0
>>> [  372.633736]  [<ffffffff810e7806>] ? rcu_do_batch+0x386/0x3a0
>>> [  372.633736]  [<ffffffff8104338a>] warn_slowpath_common+0x7a/0xb0
>>> [  372.633736]  [<ffffffff810433d5>] warn_slowpath_null+0x15/0x20
>>> [  372.633736]  [<ffffffff810e7806>] rcu_do_batch+0x386/0x3a0
>>> [  372.633736]  [<ffffffff810ac8a0>] ? trace_hardirqs_on_caller+0x70/0x1b0
>>> [  372.633736]  [<ffffffff810ac9ed>] ? trace_hardirqs_on+0xd/0x10
>>> [  372.633736]  [<ffffffff810e9143>] __rcu_process_callbacks+0x1a3/0x200
>>> [  372.633736]  [<ffffffff810e9228>] rcu_process_callbacks+0x88/0x240
>>> [  372.633736]  [<ffffffff8104dc79>] __do_softirq+0x159/0x400
>>> [  372.633736]  [<ffffffff814c627c>] call_softirq+0x1c/0x30
>>> [  372.633736]  [<ffffffff810044f5>] do_softirq+0x95/0xd0
>>> [  372.633736]  [<ffffffff8104d745>] irq_exit+0xe5/0x100
>>> [  372.633736]  [<ffffffff81028df9>] smp_apic_timer_interrupt+0x69/0xa0
>>> [  372.633736]  [<ffffffff814c5aef>] apic_timer_interrupt+0x6f/0x80
>>> [  372.633736]  <EOI>  [<ffffffff810cfe7a>] ? stop_machine_cpu_stop+0xda/0x130
>>> [  372.633736]  [<ffffffff810cfda0>] ? stop_one_cpu_nowait+0x50/0x50
>>> [  372.633736]  [<ffffffff810cfab9>] cpu_stopper_thread+0xd9/0x1b0
>>> [  372.633736]  [<ffffffff814bbe4f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
>>> [  372.633736]  [<ffffffff810cf9e0>] ? res_counter_init+0x50/0x50
>>> [  372.633736]  [<ffffffff810ac95d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
>>> [  372.633736]  [<ffffffff810ac9ed>] ? trace_hardirqs_on+0xd/0x10
>>> [  372.633736]  [<ffffffff810cf9e0>] ? res_counter_init+0x50/0x50
>>> [  372.633736]  [<ffffffff8106deae>] kthread+0xde/0xf0
>>> [  372.633736]  [<ffffffff814c6184>] kernel_thread_helper+0x4/0x10
>>> [  372.633736]  [<ffffffff814bc1f0>] ? retint_restore_args+0x13/0x13
>>> [  372.633736]  [<ffffffff8106ddd0>] ? __init_kthread_worker+0x70/0x70
>>> [  372.633736]  [<ffffffff814c6180>] ? gs_change+0x13/0x13
>>> [  372.633736] ---[ end trace a4296a31284c846d ]---
>>> [  372.883063] smpboot: CPU 14 is now offline
>>> [  372.892721] CPU 15 MCA banks CMCI:6 CMCI:8
>>> [  372.907250] smpboot: CPU 15 is now offline
>>> [  372.911545] SMP alternatives: lockdep: fixing up alternatives
>>> [  372.917292] SMP alternatives: switching to UP code
>>> [  372.941917] SMP alternatives: lockdep: fixing up alternatives


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0(), during CPU hotplug
  2012-09-13 12:47         ` Srivatsa S. Bhat
@ 2012-09-14  4:33           ` Michael Wang
  2012-09-26  9:35           ` Srivatsa S. Bhat
  1 sibling, 0 replies; 20+ messages in thread
From: Michael Wang @ 2012-09-14  4:33 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: paulmck, linux-kernel, Thomas Gleixner, rusty, Peter Zijlstra, Tejun Heo

On 09/13/2012 08:47 PM, Srivatsa S. Bhat wrote:
> On 09/13/2012 12:00 PM, Michael Wang wrote:
>> On 09/12/2012 11:31 PM, Paul E. McKenney wrote:
>>> On Wed, Sep 12, 2012 at 06:06:20PM +0530, Srivatsa S. Bhat wrote:
>>>> On 07/19/2012 10:45 PM, Paul E. McKenney wrote:
>>>>> On Thu, Jul 19, 2012 at 05:39:30PM +0530, Srivatsa S. Bhat wrote:
>>>>>> Hi Paul,
>>>>>>
>>>>>> While running a CPU hotplug stress test on v3.5-rc7+
>>>>>> (mainline commit 8a7298b7805ab) I hit this warning.
>>>>>> I haven't tried to debug this yet...
>>>>>>
>>>>>> Line number 1550 maps to:
>>>>>>
>>>>>> WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
>>>>>>
>>>>>> inside rcu_do_batch().
>>>>>
>>>>> Hello, Srivatsa,
>>>>>
>>>>> I believe that you need commit a16b7a69 (Prevent __call_rcu() from
>>>>> invoking RCU core on offline CPUs), which is currently in -tip, queued
>>>>> for 3.6.  Please see below for the patch.
>>>>>
>>>>> Does this help?
>>>>>
>>>>
>>>> Hi Paul,
>>>>
>>>> I am hitting the cpu_is_offline() warning in rcu_do_batch() (see 2 of the
>>>> examples below) occasionally while testing CPU hotplug on Thomas' smp/hotplug
>>>> branch in -tip. It does contain the commit that you had mentioned above.
>>>>
>>>> The stack trace suggests that we are not hitting this from the __call_rcu()
>>>> path. So I guess this needs a different fix?
>>>
>>> So there was an interrupt from stop_machine_stop().  Because RCU complained
>>> about offline, I presume that this was on exit from stop_machine_stop().
>>> (Otherwise, on entry to stop_machine_stop(), the CPU has not yet marked
>>> itself offline, right?)
>>>
> 
> Yes, that's my understanding too.
> 
>>> So my question is:  Why didn't the CPU shut off all interrupts before
>>> coming out of stop_machine_stop()?
>>>
>>> Or am I confused about what is really happening here?
>>
>> I think Srivatsa may need patch a3716d2e too:
>>
> 
> The changelog of this patch doesn't seem to suggest that it was written to
> address the problem I am facing. Please see my thoughts below..
> 
>> commit a3716d2e5a50a9ed5268ae3d3c2f093968ff236a
>> Author: Paul E. McKenney <paul.mckenney@linaro.org>
>> Date:   Thu Jun 21 09:54:10 2012 -0700
>>
>>     rcu: Prevent offline CPUs from executing RCU core code
>>     
>>     Earlier versions of RCU invoked the RCU core from the CPU_DYING notifier
>>     in order to note a quiescent state for the outgoing CPU.  Because the
>>     CPU is marked "offline" during the execution of the CPU_DYING notifiers,
>>     the RCU core had to tolerate being invoked from an offline CPU.  However,
>>     commit b1420f1c (Make rcu_barrier() less disruptive) left only tracing
>>     code in the CPU_DYING notifier, so the RCU core need no longer execute
>>     on offline CPUs.  This commit therefore enforces this restriction.
>>     
>>     Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
>>     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>>
>> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
>> index 300aba6..84a6f55 100644
>> --- a/kernel/rcutree.c
>> +++ b/kernel/rcutree.c
>> @@ -1892,6 +1892,8 @@ static void rcu_process_callbacks(struct softirq_action *unused)
>>  {
>>         struct rcu_state *rsp;
>>
>> +       if (cpu_is_offline(smp_processor_id()))
>> +               return;
>>         trace_rcu_utilization("Start RCU core");
>>         for_each_rcu_flavor(rsp)
>>                 __rcu_process_callbacks(rsp);
>>
>> With out it, the RCU_SOFTIRQ raised before offline won't
>> have a chance to return.
>>
> 
> I'm sure this would help to get rid of the warning, but IMHO, like Paul pointed out,
> its rather surprising that the CPU going offline got interrupts *after* coming out of
> the stop_machine thing, iow, *after* it was marked offline.. So we need to examine
> *this* issue right?

This may be a clue:

>>>> [   53.917769] do_IRQ: 8.211 No irq handler for vector (irq -1)

I'm not familiar with APIC, but that's not something we expected, correct?

Regards,
Michael Wang

> 
> Regards,
> Srivatsa S. Bhat
> 
>>>>
>>>> [   53.882344] smpboot: CPU 7 is now offline
>>>> [   53.891072] CPU 12 MCA banks CMCI:6 CMCI:8
>>>> [   53.895621] CPU 15 MCA banks CMCI:2 CMCI:3 CMCI:5
>>>> [   53.914738] Broke affinity for irq 81
>>>> [   53.917769] do_IRQ: 8.211 No irq handler for vector (irq -1) 
>>>> [   53.917769] ------------[ cut here ]------------
>>>> [   53.917769] WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0()
>>>> [   53.917769] Hardware name: IBM System x -[7870C4Q]-
>>>> [   53.917769] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm cdc_ether usbnet ioatdma lpc_ich mfd_core crc32c_intel microcode mii pcspkr i2c_i801 shpchp i2c_core serio_raw bnx2 tpm_tis dca tpm pci_hotplug i7core_edac tpm_bios edac_core sg rtc_cmos button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
>>>> [   53.917769] Pid: 47, comm: migration/8 Not tainted 3.6.0-rc1-tglx-hotplug-0.0.0.28.36b5ec9-default #1
>>>> [   53.917769] Call Trace:
>>>> [   53.917769]  <IRQ>  [<ffffffff810e7806>] ? rcu_do_batch+0x386/0x3a0
>>>> [   53.917769]  [<ffffffff810e7806>] ? rcu_do_batch+0x386/0x3a0
>>>> [   53.917769]  [<ffffffff8104338a>] warn_slowpath_common+0x7a/0xb0
>>>> [   53.917769]  [<ffffffff810433d5>] warn_slowpath_null+0x15/0x20
>>>> [   53.917769]  [<ffffffff810e7806>] rcu_do_batch+0x386/0x3a0
>>>> [   53.917769]  [<ffffffff810ac8a0>] ? trace_hardirqs_on_caller+0x70/0x1b0
>>>> [   53.917769]  [<ffffffff810ac9ed>] ? trace_hardirqs_on+0xd/0x10
>>>> [   53.917769]  [<ffffffff810e9143>] __rcu_process_callbacks+0x1a3/0x200
>>>> [   53.917769]  [<ffffffff810e9228>] rcu_process_callbacks+0x88/0x240
>>>> [   53.917769]  [<ffffffff8104dc79>] __do_softirq+0x159/0x400
>>>> [   53.917769]  [<ffffffff814c627c>] call_softirq+0x1c/0x30
>>>> [   53.917769]  [<ffffffff810044f5>] do_softirq+0x95/0xd0
>>>> [   53.917769]  [<ffffffff8104d745>] irq_exit+0xe5/0x100
>>>> [   53.917769]  [<ffffffff81003c14>] do_IRQ+0x64/0xe0
>>>> [   53.917769]  [<ffffffff814bc12f>] common_interrupt+0x6f/0x6f
>>>> [   53.917769]  <EOI>  [<ffffffff810cfe7a>] ? stop_machine_cpu_stop+0xda/0x130
>>>> [   53.917769]  [<ffffffff810cfda0>] ? stop_one_cpu_nowait+0x50/0x50
>>>> [   53.917769]  [<ffffffff810cfab9>] cpu_stopper_thread+0xd9/0x1b0
>>>> [   53.917769]  [<ffffffff814bbe4f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
>>>> [   53.917769]  [<ffffffff810cf9e0>] ? res_counter_init+0x50/0x50
>>>> [   53.917769]  [<ffffffff810ac95d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
>>>> [   53.917769]  [<ffffffff810ac9ed>] ? trace_hardirqs_on+0xd/0x10
>>>> [   53.917769]  [<ffffffff810cf9e0>] ? res_counter_init+0x50/0x50
>>>> [   53.917769]  [<ffffffff8106deae>] kthread+0xde/0xf0
>>>> [   53.917769]  [<ffffffff814c6184>] kernel_thread_helper+0x4/0x10
>>>> [   53.917769]  [<ffffffff814bc1f0>] ? retint_restore_args+0x13/0x13
>>>> [   53.917769]  [<ffffffff8106ddd0>] ? __init_kthread_worker+0x70/0x70
>>>> [   53.917769]  [<ffffffff814c6180>] ? gs_change+0x13/0x13
>>>> [   53.917769] ---[ end trace f60a282810c4ce78 ]---
>>>> [   54.170634] smpboot: CPU 8 is now offline
>>>> [   54.192259] NOHZ: local_softirq_pending 200 
>>>> [   54.197936] smpboot: CPU 9 is now offline
>>>> [   54.219707] NOHZ: local_softirq_pending 200 
>>>> [   54.225795] smpboot: CPU 10 is now offline
>>>>
>>>>
>>>> ---
>>>>
>>>> [  372.482434] smpboot: CPU 11 is now offline
>>>> [  372.534211] smpboot: CPU 12 is now offline
>>>> [  372.539786] CPU 13 MCA banks CMCI:6 CMCI:8
>>>> [  372.582474] smpboot: CPU 13 is now offline
>>>> [  372.591006] CPU 14 MCA banks CMCI:6 CMCI:8
>>>> [  372.629745] ------------[ cut here ]------------
>>>> [  372.633736] WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0()
>>>> [  372.633736] Hardware name: IBM System x -[7870C4Q]-
>>>> [  372.633736] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel
>>>> microcode serio_raw tpm_tis i7core_edac cdc_ether usbnet ioatdma mii lpc_ich pcspkr edac_core mfd_core bnx2 shpchp pci_hotplug i2c_i801 i2c_core dca tpm sg tpm_bios rtc_cmos button uhci_hcd ehci_hcd
>>>> usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
>>>> [  372.633736] Pid: 8625, comm: migration/14 Not tainted 3.6.0-rc1-tglx-hotplug-0.0.0.28.36b5ec9-default #1
>>>> [  372.633736] Call Trace:
>>>> [  372.633736]  <IRQ>  [<ffffffff810e7806>] ? rcu_do_batch+0x386/0x3a0
>>>> [  372.633736]  [<ffffffff810e7806>] ? rcu_do_batch+0x386/0x3a0
>>>> [  372.633736]  [<ffffffff8104338a>] warn_slowpath_common+0x7a/0xb0
>>>> [  372.633736]  [<ffffffff810433d5>] warn_slowpath_null+0x15/0x20
>>>> [  372.633736]  [<ffffffff810e7806>] rcu_do_batch+0x386/0x3a0
>>>> [  372.633736]  [<ffffffff810ac8a0>] ? trace_hardirqs_on_caller+0x70/0x1b0
>>>> [  372.633736]  [<ffffffff810ac9ed>] ? trace_hardirqs_on+0xd/0x10
>>>> [  372.633736]  [<ffffffff810e9143>] __rcu_process_callbacks+0x1a3/0x200
>>>> [  372.633736]  [<ffffffff810e9228>] rcu_process_callbacks+0x88/0x240
>>>> [  372.633736]  [<ffffffff8104dc79>] __do_softirq+0x159/0x400
>>>> [  372.633736]  [<ffffffff814c627c>] call_softirq+0x1c/0x30
>>>> [  372.633736]  [<ffffffff810044f5>] do_softirq+0x95/0xd0
>>>> [  372.633736]  [<ffffffff8104d745>] irq_exit+0xe5/0x100
>>>> [  372.633736]  [<ffffffff81028df9>] smp_apic_timer_interrupt+0x69/0xa0
>>>> [  372.633736]  [<ffffffff814c5aef>] apic_timer_interrupt+0x6f/0x80
>>>> [  372.633736]  <EOI>  [<ffffffff810cfe7a>] ? stop_machine_cpu_stop+0xda/0x130
>>>> [  372.633736]  [<ffffffff810cfda0>] ? stop_one_cpu_nowait+0x50/0x50
>>>> [  372.633736]  [<ffffffff810cfab9>] cpu_stopper_thread+0xd9/0x1b0
>>>> [  372.633736]  [<ffffffff814bbe4f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
>>>> [  372.633736]  [<ffffffff810cf9e0>] ? res_counter_init+0x50/0x50
>>>> [  372.633736]  [<ffffffff810ac95d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
>>>> [  372.633736]  [<ffffffff810ac9ed>] ? trace_hardirqs_on+0xd/0x10
>>>> [  372.633736]  [<ffffffff810cf9e0>] ? res_counter_init+0x50/0x50
>>>> [  372.633736]  [<ffffffff8106deae>] kthread+0xde/0xf0
>>>> [  372.633736]  [<ffffffff814c6184>] kernel_thread_helper+0x4/0x10
>>>> [  372.633736]  [<ffffffff814bc1f0>] ? retint_restore_args+0x13/0x13
>>>> [  372.633736]  [<ffffffff8106ddd0>] ? __init_kthread_worker+0x70/0x70
>>>> [  372.633736]  [<ffffffff814c6180>] ? gs_change+0x13/0x13
>>>> [  372.633736] ---[ end trace a4296a31284c846d ]---
>>>> [  372.883063] smpboot: CPU 14 is now offline
>>>> [  372.892721] CPU 15 MCA banks CMCI:6 CMCI:8
>>>> [  372.907250] smpboot: CPU 15 is now offline
>>>> [  372.911545] SMP alternatives: lockdep: fixing up alternatives
>>>> [  372.917292] SMP alternatives: switching to UP code
>>>> [  372.941917] SMP alternatives: lockdep: fixing up alternatives


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0(), during CPU hotplug
  2012-09-13  8:35     ` Srivatsa S. Bhat
@ 2012-09-14 11:47       ` Fengguang Wu
  2012-09-14 12:18         ` Srivatsa S. Bhat
  0 siblings, 1 reply; 20+ messages in thread
From: Fengguang Wu @ 2012-09-14 11:47 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: paulmck, jack, linux-kernel, Thomas Gleixner, rusty,
	Peter Zijlstra, Tejun Heo

On Thu, Sep 13, 2012 at 02:05:17PM +0530, Srivatsa S. Bhat wrote:
> On 09/12/2012 06:06 PM, Srivatsa S. Bhat wrote:
> > On 07/19/2012 10:45 PM, Paul E. McKenney wrote:
> >> On Thu, Jul 19, 2012 at 05:39:30PM +0530, Srivatsa S. Bhat wrote:
> >>> Hi Paul,
> >>>
> >>> While running a CPU hotplug stress test on v3.5-rc7+
> >>> (mainline commit 8a7298b7805ab) I hit this warning.
> >>> I haven't tried to debug this yet...
> >>>
> >>> Line number 1550 maps to:
> >>>
> >>> WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
> >>>
> >>> inside rcu_do_batch().
> >>
> >> Hello, Srivatsa,
> >>
> >> I believe that you need commit a16b7a69 (Prevent __call_rcu() from
> >> invoking RCU core on offline CPUs), which is currently in -tip, queued
> >> for 3.6.  Please see below for the patch.
> >>
> >> Does this help?
> >>
> > 
> > Hi Paul,
> > 
> > I am hitting the cpu_is_offline() warning in rcu_do_batch() (see 2 of the
> > examples below) occasionally while testing CPU hotplug on Thomas' smp/hotplug
> > branch in -tip. It does contain the commit that you had mentioned above.
> > 
> 
> I also hit some writeback related problems during some of these runs. But I was
> not able to reproduce them after that occurrence. (Adding relevant people to CC.)
>
> I hit the divide error shown below during the CPU hotplug test run, and the general

I've hit the divide error before. And I'm not luckier than you in
reproducing the bug. I tempt to add a test as the workaround:

+       if (WARN_ON(!denominator))
+               return dirty;

btw, any chance you may share the CPU hotplug test scripts?
It'd be a valuable addition to my 0day boot test system.

> protection fault subsequently, while trying to shutdown the machine after the test.

Thanks,
Fengguang

> [  522.987310] SMP alternatives: switching to SMP code
> [  522.999101] smpboot: Booting Node 1 Processor 7 APIC 0x16
> [  524.083872] SMP alternatives: lockdep: fixing up alternatives
> [  524.090053] smpboot: Booting Node 0 Processor 8 APIC 0x1
> [  525.148720] SMP alternatives: lockdep: fixing up alternatives
> [  525.154970] smpboot: Booting Node 0 Processor 9 APIC 0x3
> [  526.024180] divide error: 0000 [#1] SMP 
> [  526.028144] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm cdc_ether pcspkr usbnet shpchp pci_hotplug i2c_i801 i2c_core ioatdma mii crc32c_intel serio_raw microcode lpc_ich mfd_core i7core_edac bnx2 dca edac_core tpm_tis tpm sg tpm_bios rtc_cmos button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
> [  526.028145] CPU 9 
> [  526.028145] Pid: 2235, comm: flush-8:0 Not tainted 3.6.0-rc1-tglx-hotplug-0.0.0.28.36b5ec9-default #1 IBM IBM System x -[7870C4Q]-/68Y8033     
> [  526.028145] RIP: 0010:[<ffffffff811276f6>]  [<ffffffff811276f6>] bdi_dirty_limit+0x66/0xc0
> [  526.028145] RSP: 0018:ffff8811530bfcc0  EFLAGS: 00010206
> [  526.028145] RAX: 0000000000b9877e RBX: 00000000001a8112 RCX: 28f5c28f5c28f5c3
> [  526.028145] RDX: 0000000000000000 RSI: 0000000000b9877e RDI: 0000000000000000
> [  526.028145] RBP: ffff8811530bfce0 R08: 0000000000000010 R09: 0000000000000000
> [  526.028145] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8808d4408e20
> [  526.028145] R13: ffff8808d4408e20 R14: ffff8808d44091a0 R15: 0000000000000000
> [  526.028145] FS:  0000000000000000(0000) GS:ffff8808ddd40000(0000) knlGS:0000000000000000
> [  526.028145] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  526.028145] CR2: 00007fa35dd4eb60 CR3: 0000000001a0c000 CR4: 00000000000007e0
> [  526.028145] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  526.028145] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  526.028145] Process flush-8:0 (pid: 2235, threadinfo ffff8811530be000, task ffff88115315c5e0)
> [  526.028145] Stack:
> [  526.028145]  0400000000000000 0000000000000007 ffff8808d4408e20 ffffffffffffffee
> [  526.028145]  ffff8811530bfd10 ffffffff811ae95c 0000000000350225 00000000001a8112
> [  526.028145]  0000000000000000 0000000000000002 ffff8811530bfdc0 ffffffff811b0620
> [  526.028145] Call Trace:
> [  526.209272] SMP alternatives: lockdep: fixing up alternatives
> [  526.209275] smpboot: Booting Node 0 Processor 10 APIC 0x5
> [  526.220012]  [<ffffffff811ae95c>] over_bground_thresh+0x7c/0x90
> [  526.220012]  [<ffffffff811b0620>] wb_do_writeback+0x170/0x310
> [  526.220012]  [<ffffffff811b08eb>] bdi_writeback_thread+0x12b/0x420
> [  526.220012]  [<ffffffff811b07c0>] ? wb_do_writeback+0x310/0x310
> [  526.220012]  [<ffffffff8106deae>] kthread+0xde/0xf0
> [  526.220012]  [<ffffffff814c6184>] kernel_thread_helper+0x4/0x10
> [  526.220012]  [<ffffffff814bc1f0>] ? retint_restore_args+0x13/0x13
> [  526.220012]  [<ffffffff8106ddd0>] ? __init_kthread_worker+0x70/0x70
> [  526.220012]  [<ffffffff814c6180>] ? gs_change+0x13/0x13
> [  526.220012] Code: 28 5c 8f c2 f5 28 8b 7d e0 48 89 c6 48 0f af f3 48 c1 ee 02 48 89 f0 48 f7 e1 48 89 d6 31 d2 48 c1 ee 02 48 0f af 75 e8 48 89 f0 <48> f7 f7 41 8b 94 24 74 02 00 00 48 0f af d3 48 89 c7 48 c1 ea 
> [  526.220012] RIP  [<ffffffff811276f6>] bdi_dirty_limit+0x66/0xc0
> [  526.220012]  RSP <ffff8811530bfcc0>
> [  526.304469] ---[ end trace bcfc7ab74bdb11a5 ]---
> [  527.330948] SMP alternatives: lockdep: fixing up alternatives
> 
> 
> ----
> 
> [ 1941.614775] SMP alternatives: lockdep: fixing up alternatives
> [ 1941.620614] smpboot: Booting Node 1 Processor 5 APIC 0x12
> [ 1941.657424] SMP alternatives: lockdep: fixing up alternatives
> [ 1941.663215] smpboot: Booting Node 1 Processor 6 APIC 0x14
> 
> [ 1992.721819] general protection fault: 0000 [#2] SMP 
> [ 1992.724844] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm cdc_ether pcspkr usbnet shpchp pci_hotplug i2c_i801 i2c_core ioatdma mii crc32c_intel serio_raw microcode lpc_ich mfd_core i7core_edac bnx2 dca edac_core tpm_tis tpm sg tpm_bios rtc_cmos button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase
> scsi_transport_sas scsi_mod thermal thermal_sys hwmon
> [ 1992.726995] CPU 8 
> [ 1992.726995] Pid: 19654, comm: shutdown Tainted: G      D      3.6.0-rc1-tglx-hotplug-0.0.0.28.36b5ec9-default #1 IBM IBM System x -[7870C4Q]-/68Y8033     
> [ 1992.726995] RIP: 0010:[<ffffffff810843d7>]  [<ffffffff810843d7>] try_to_wake_up+0x57/0x2f0
> [ 1992.726995] RSP: 0018:ffff8808d47e5e58  EFLAGS: 00010002
> [ 1992.726995] RAX: 6b6b6b6b6b6b6b6b RBX: 000000000000000f RCX: 000000006b6b6b6b
> [ 1992.726995] RDX: 000000006b6c6b6b RSI: ffffffff817a7fbf RDI: ffff88115315cdd0
> [ 1992.726995] RBP: ffff8808d47e5e98 R08: 0000000000000002 R09: 0000000000000001
> [ 1992.726995] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88115315c5e0
> [ 1992.726995] R13: 000000006b6b6b6b R14: ffff8808d44091a0 R15: 0000000000000000
> [ 1992.726995] FS:  00007f32b1beb700(0000) GS:ffff8808ddd00000(0000) knlGS:0000000000000000
> [ 1992.726995] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1992.726995] CR2: 00007f32b173d1a0 CR3: 0000001153bd0000 CR4: 00000000000007e0
> [ 1992.726995] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1992.726995] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 1992.726995] Process shutdown (pid: 19654, threadinfo ffff8808d47e4000, task ffff8808d4b045e0)
> [ 1992.726995] Stack:
> [ 1992.726995]  0000000000000246 ffff88115315cdd0 0000000000000286 0000000000000000
> [ 1992.726995]  ffff8808d4408e20 ffff8808d5433288 ffff8808d44091a0 00746c6168206d65
> [ 1992.726995]  ffff8808d47e5ea8 ffffffff810846a0 ffff8808d47e5ed8 ffffffff811adce1
> [ 1992.726995] Call Trace:
> [ 1992.726995]  [<ffffffff810846a0>] wake_up_process+0x10/0x20
> [ 1992.726995]  [<ffffffff811adce1>] bdi_queue_work+0xd1/0x1f0
> [ 1992.726995]  [<ffffffff811ae7d9>] __bdi_start_writeback+0x79/0x160
> [ 1992.726995]  [<ffffffff811af1b0>] wakeup_flusher_threads+0x120/0x1e0
> [ 1992.726995]  [<ffffffff811af0ca>] ? wakeup_flusher_threads+0x3a/0x1e0
> [ 1992.726995]  [<ffffffff811b45b2>] sys_sync+0x22/0x90
> [ 1992.726995]  [<ffffffff814c4fb9>] system_call_fastpath+0x16/0x1b
> [ 1992.726995] Code: 31 ed 48 89 c7 48 89 45 c8 e8 46 72 43 00 48 89 45 d0 49 8b 04 24 85 c3 0f 84 02 02 00 00 45 8b 6c 24 2c 49 8b 44 24 08 45 85 ed <44> 8b 70 18 74 75 8b 1d dd ea 9d 00 85 db 0f 85 2d 02 00 00 49  
> [ 1992.726995] RIP  [<ffffffff810843d7>] try_to_wake_up+0x57/0x2f0
> [ 1992.726995]  RSP <ffff8808d47e5e58>
> [ 1992.726995] ---[ end trace bcfc7ab74bdb11a6 ]---
> [ 1992.726995] Kernel panic - not syncing: Fatal exception in interrupt
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0(), during CPU hotplug
  2012-09-14 11:47       ` Fengguang Wu
@ 2012-09-14 12:18         ` Srivatsa S. Bhat
  2012-09-14 12:25           ` Peter Zijlstra
  2012-09-14 12:28           ` Fengguang Wu
  0 siblings, 2 replies; 20+ messages in thread
From: Srivatsa S. Bhat @ 2012-09-14 12:18 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: paulmck, jack, linux-kernel, Thomas Gleixner, rusty,
	Peter Zijlstra, Tejun Heo, Michael Wang

On 09/14/2012 05:17 PM, Fengguang Wu wrote:
> On Thu, Sep 13, 2012 at 02:05:17PM +0530, Srivatsa S. Bhat wrote:
>> On 09/12/2012 06:06 PM, Srivatsa S. Bhat wrote:
>>> On 07/19/2012 10:45 PM, Paul E. McKenney wrote:
>>>> On Thu, Jul 19, 2012 at 05:39:30PM +0530, Srivatsa S. Bhat wrote:
>>>>> Hi Paul,
>>>>>
>>>>> While running a CPU hotplug stress test on v3.5-rc7+
>>>>> (mainline commit 8a7298b7805ab) I hit this warning.
>>>>> I haven't tried to debug this yet...
>>>>>
>>>>> Line number 1550 maps to:
>>>>>
>>>>> WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
>>>>>
>>>>> inside rcu_do_batch().
>>>>
>>>> Hello, Srivatsa,
>>>>
>>>> I believe that you need commit a16b7a69 (Prevent __call_rcu() from
>>>> invoking RCU core on offline CPUs), which is currently in -tip, queued
>>>> for 3.6.  Please see below for the patch.
>>>>
>>>> Does this help?
>>>>
>>>
>>> Hi Paul,
>>>
>>> I am hitting the cpu_is_offline() warning in rcu_do_batch() (see 2 of the
>>> examples below) occasionally while testing CPU hotplug on Thomas' smp/hotplug
>>> branch in -tip. It does contain the commit that you had mentioned above.
>>>
>>
>> I also hit some writeback related problems during some of these runs. But I was
>> not able to reproduce them after that occurrence. (Adding relevant people to CC.)
>>
>> I hit the divide error shown below during the CPU hotplug test run, and the general
> 
> I've hit the divide error before. And I'm not luckier than you in
> reproducing the bug. I tempt to add a test as the workaround:
> 

Hi Fengguang,
First of all, thanks for taking a look!

I just hit it today during my testing again! See below for the stack-traces..

> +       if (WARN_ON(!denominator))
> +               return dirty;

Yep, I am going to try this out next and see if the denominator really became 0.
I went through the code and I didn't find anything wrong with it. It looks pretty
good (the denominator sanity checking part). Hmm...

> 
> btw, any chance you may share the CPU hotplug test scripts?
> It'd be a valuable addition to my 0day boot test system.
> 

Sure, I was using a very simple script that does offline/online in sequence. That's all.
Please find it below. Note that adding a breather between the hotplug operations (like
adding sleep as shown below) has the potential to expose different bugs than the normal
stress test (no sleeps between subsequent operations). So both cases are useful to test.
For example, with the sleep, I hit this divide error with higher probability, whereas
without the sleep (stress test) I hit the rcu_do_batch warning more often.

(And on a slightly different note, when x86 supports CPU 0 hotplug (today it doesn't),
you can use i = 0 instead of i = 1 in the loop. I believe that code is already in -tip).


CPU hotplug test script
-----------------------
#! /bin/bash

NUMBER_OF_CPUS=`ls -d /sys/devices/system/cpu/cpu[0-9]* | wc -l`

cd /sys/devices/system/cpu

while [ 1 ]
do
        for ((i=1; i < NUMBER_OF_CPUS; i++))
        do
                sleep 1;
                state=`cat cpu$i/online`
                if [ $state -eq 0 ]
                then
                        echo 1 > cpu$i/online
                else
                        echo 0 > cpu$i/online
                fi
        done
done


>> protection fault subsequently, while trying to shutdown the machine after the test.
> 

Dumps from the divide error and general protection fault in today's testing:


[ 4831.353623] SMP alternatives: lockdep: fixing up alternatives
[ 4831.359465] smpboot: Booting Node 1 Processor 13 APIC 0x13
[ 4833.028342] SMP alternatives: lockdep: fixing up alternatives
[ 4833.034188] smpboot: Booting Node 1 Processor 14 APIC 0x15
[ 4835.693836] SMP alternatives: lockdep: fixing up alternatives
[ 4835.699609] smpboot: Booting Node 1 Processor 15 APIC 0x17
[ 4843.142309] Broke affinity for irq 77
[ 4843.147605] smpboot: CPU 1 is now offline
[ 4843.155885] CPU 9 MCA banks CMCI:2 CMCI:3 CMCI:5
[ 4844.379459] Broke affinity for irq 80
[ 4844.384569] smpboot: CPU 2 is now offline
[ 4844.392111] CPU 10 MCA banks CMCI:2 CMCI:3 CMCI:5
[ 4845.810713] smpboot: CPU 3 is now offline
[ 4845.821409] CPU 11 MCA banks CMCI:2 CMCI:3 CMCI:5
[ 4848.710692] Broke affinity for irq 74
[ 4848.716063] smpboot: CPU 4 is now offline
[ 4848.721781] CPU 5 MCA banks CMCI:6 CMCI:8
[ 4848.726440] CPU 12 MCA banks CMCI:2 CMCI:3 CMCI:5
[ 4855.990349] smpboot: CPU 5 is now offline
[ 4855.997251] CPU 6 MCA banks CMCI:6 CMCI:8
[ 4856.002002] CPU 13 MCA banks CMCI:2 CMCI:3 CMCI:5
[ 4859.198420] smpboot: CPU 6 is now offline
[ 4859.205217] CPU 7 MCA banks CMCI:6 CMCI:8
[ 4859.209858] CPU 14 MCA banks CMCI:2 CMCI:3 CMCI:5
[ 4861.202261] smpboot: CPU 7 is now offline
[ 4861.211009] CPU 12 MCA banks CMCI:6 CMCI:8
[ 4861.215403] CPU 15 MCA banks CMCI:2 CMCI:3 CMCI:5
[ 4863.210557] smpboot: CPU 8 is now offline
[ 4890.536087] divide error: 0000 [#1] SMP 
[ 4890.540047] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode cdc_ether usbnet i2c_i801 mii tpm_tis pcspkr ioatdma i7core_edac tpm serio_raw i2c_core bnx2 shpchp lpc_ich mfd_core edac_core pci_hotplug dca tpm_bios sg rtc_cmos button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[ 4890.540047] CPU 9 
[ 4890.540047] Pid: 2243, comm: flush-8:0 Not tainted 3.6.0-rc1-tglx-hotplug-0.0.0.28.36b5ec9-default #1 IBM IBM System x -[7870C4Q]-/68Y8033     
[ 4890.540047] RIP: 0010:[<ffffffff811276f6>]  [<ffffffff811276f6>] bdi_dirty_limit+0x66/0xc0
[ 4890.540047] RSP: 0018:ffff8808d3e1fcc0  EFLAGS: 00010202
[ 4890.540047] RAX: 0000000000000000 RBX: 00000000001a80c4 RCX: 28f5c28f5c28f5c3
[ 4890.540047] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 4890.540047] RBP: ffff8808d3e1fce0 R08: 0000000000000010 R09: 0000000000000000
[ 4890.540047] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8808d3d80e20
[ 4890.540047] R13: ffff8808d3d80e20 R14: ffff8808d3d811a0 R15: 0000000000000000
[ 4890.540047] FS:  0000000000000000(0000) GS:ffff8808ddd40000(0000) knlGS:0000000000000000
[ 4890.540047] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 4890.540047] CR2: ffffffffff600400 CR3: 0000000001a0c000 CR4: 00000000000007e0
[ 4890.540047] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4890.540047] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 4890.540047] Process flush-8:0 (pid: 2243, threadinfo ffff8808d3e1e000, task ffff8808d4060000)
[ 4890.540047] Stack:
[ 4890.540047]  0001000000000000 0000000000000000 ffff8808d3d80e20 fffffffffffffff7
[ 4890.540047]  ffff8808d3e1fd10 ffffffff811ae95c 0000000000350189 00000000001a80c4
[ 4890.540047]  0000000000000000 0000000000000000 ffff8808d3e1fdc0 ffffffff811b0620
[ 4890.540047] Call Trace:
[ 4890.540047]  [<ffffffff811ae95c>] over_bground_thresh+0x7c/0x90
[ 4890.540047]  [<ffffffff811b0620>] wb_do_writeback+0x170/0x310
[ 4890.540047]  [<ffffffff811b08eb>] bdi_writeback_thread+0x12b/0x420
[ 4890.540047]  [<ffffffff811b07c0>] ? wb_do_writeback+0x310/0x310
[ 4890.540047]  [<ffffffff8106deae>] kthread+0xde/0xf0
[ 4890.540047]  [<ffffffff814c6184>] kernel_thread_helper+0x4/0x10
[ 4890.540047]  [<ffffffff814bc1f0>] ? retint_restore_args+0x13/0x13
[ 4890.540047]  [<ffffffff8106ddd0>] ? __init_kthread_worker+0x70/0x70
[ 4890.540047]  [<ffffffff814c6180>] ? gs_change+0x13/0x13
[ 4890.540047] Code: 28 5c 8f c2 f5 28 8b 7d e0 48 89 c6 48 0f af f3 48 c1 ee 02 48 89 f0 48 f7 e1 48 89 d6 31 d2 48 c1 ee 02 48 0f af 75 e8 48 89 f0 <48> f7 f7 41 8b 94 24 74 02 00 00 48 0f af d3 48 89 c7 48 c1 ea 
[ 4890.540047] RIP  [<ffffffff811276f6>] bdi_dirty_limit+0x66/0xc0
[ 4890.540047]  RSP <ffff8808d3e1fcc0>
[ 4890.805627] ---[ end trace 3bdb61d99b220aeb ]---
[ 4890.818925] Broke affinity for irq 76
[ 4890.823921] smpboot: CPU 9 is now offline
[ 4900.238238] general protection fault: 0000 [#2] SMP 
[ 4900.242203] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode cdc_ether usbnet i2c_i801 mii tpm_tis pcspkr ioatdma i7core_edac tpm serio_raw i2c_core bnx2 shpchp lpc_ich mfd_core edac_core pci_hotplug dca tpm_bios sg rtc_cmos button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[ 4900.248087] CPU 13 
[ 4900.248087] Pid: 0, comm: swapper/13 Tainted: G      D      3.6.0-rc1-tglx-hotplug-0.0.0.28.36b5ec9-default #1 IBM IBM System x -[7870C4Q]-/68Y8033     
[ 4900.248087] RIP: 0010:[<ffffffff810843d7>]  [<ffffffff810843d7>] try_to_wake_up+0x57/0x2f0
[ 4900.248087] RSP: 0018:ffff88117fd43d30  EFLAGS: 00010002
[ 4900.248087] RAX: 6b6b6b6b6b6b6b6b RBX: 000000000000000f RCX: 000000006b6b6b6b
[ 4900.248087] RDX: 000000006b6c6b6b RSI: ffffffff817a7fbf RDI: ffff8808d40607f0
[ 4900.248087] RBP: ffff88117fd43d70 R08: 0000000000000002 R09: 0000000000000001
[ 4900.248087] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8808d4060000
[ 4900.248087] R13: 000000006b6b6b6b R14: ffff8808d3d80e20 R15: 0000000000000000
[ 4900.248087] FS:  0000000000000000(0000) GS:ffff88117fd40000(0000) knlGS:0000000000000000
[ 4900.248087] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 4900.248087] CR2: 00007f472ef5f068 CR3: 0000000001a0c000 CR4: 00000000000007e0
[ 4900.248087] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4900.248087] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 4900.248087] Process swapper/13 (pid: 0, threadinfo ffff8808d762e000, task ffff8808d7630000)
[ 4900.248087] Stack:
[ 4900.248087]  0000000000000246 ffff8808d40607f0 0000000000000286 ffff8808d762ffd8
[ 4900.248087]  ffff8808d3d80e20 ffff8808d3d811a0 ffff8808d3d80e20 ffffffff8113d970
[ 4900.248087]  ffff88117fd43d80 ffffffff810846a0 ffff88117fd43db0 ffffffff8113da22
[ 4900.248087] Call Trace:
[ 4900.248087]  <IRQ> 
[ 4900.248087]  [<ffffffff8113d970>] ? bdi_init+0x270/0x270
[ 4900.248087]  [<ffffffff810846a0>] wake_up_process+0x10/0x20
[ 4900.248087]  [<ffffffff8113da22>] wakeup_timer_fn+0xb2/0x1c0
[ 4900.248087]  [<ffffffff810557db>] call_timer_fn+0xcb/0x240
[ 4900.248087]  [<ffffffff81055710>] ? detach_if_pending+0x150/0x150
[ 4900.248087]  [<ffffffff8113d970>] ? bdi_init+0x270/0x270
[ 4900.248087]  [<ffffffff81056f75>] run_timer_softirq+0x1c5/0x2c0
[ 4900.248087]  [<ffffffff8104dc79>] __do_softirq+0x159/0x400
[ 4900.248087]  [<ffffffff814c627c>] call_softirq+0x1c/0x30
[ 4900.248087]  [<ffffffff810044f5>] do_softirq+0x95/0xd0
[ 4900.248087]  [<ffffffff8104d745>] irq_exit+0xe5/0x100
[ 4900.248087]  [<ffffffff81028df9>] smp_apic_timer_interrupt+0x69/0xa0
[ 4900.248087]  [<ffffffff814c5aef>] apic_timer_interrupt+0x6f/0x80
[ 4900.248087]  <EOI> 
[ 4900.248087]  [<ffffffff812db3d3>] ? intel_idle+0xc3/0x170
[ 4900.248087]  [<ffffffff812db3d7>] ? intel_idle+0xc7/0x170
[ 4900.248087]  [<ffffffff812db3d3>] ? intel_idle+0xc3/0x170
[ 4900.248087]  [<ffffffff813b1937>] cpuidle_enter+0x17/0x20
[ 4900.248087]  [<ffffffff813b1952>] cpuidle_enter_state+0x12/0x50
[ 4900.248087]  [<ffffffff813b2403>] cpuidle_idle_call+0x133/0x260
[ 4900.248087]  [<ffffffff8100b1f5>] cpu_idle+0x95/0xf0
[ 4900.248087]  [<ffffffff814b1819>] start_secondary+0x85/0x8c
[ 4900.248087] Code: 31 ed 48 89 c7 48 89 45 c8 e8 46 72 43 00 48 89 45 d0 49 8b 04 24 85 c3 0f 84 02 02 00 00 45 8b 6c 24 2c 49 8b 44 24 08 45 85 ed <44> 8b 70 18 74 75 8b 1d dd ea 9d 00 85 db 0f 85 2d 02 00 00 49 
[ 4900.248087] RIP  [<ffffffff810843d7>] try_to_wake_up+0x57/0x2f0
[ 4900.248087]  RSP <ffff88117fd43d30>
[ 4900.248087] ---[ end trace 3bdb61d99b220aec ]---
[ 4900.248087] Kernel panic - not syncing: Fatal exception in interrupt




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0(), during CPU hotplug
  2012-09-14 12:18         ` Srivatsa S. Bhat
@ 2012-09-14 12:25           ` Peter Zijlstra
  2012-09-14 12:32             ` Fengguang Wu
  2012-09-14 12:34             ` Srivatsa S. Bhat
  2012-09-14 12:28           ` Fengguang Wu
  1 sibling, 2 replies; 20+ messages in thread
From: Peter Zijlstra @ 2012-09-14 12:25 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Fengguang Wu, paulmck, jack, linux-kernel, Thomas Gleixner,
	rusty, Tejun Heo, Michael Wang

On Fri, 2012-09-14 at 17:48 +0530, Srivatsa S. Bhat wrote:
> #! /bin/bash

CPUPATH="/sys/devices/system/cpu"

> NUMBER_OF_CPUS=`ls -d /sys/devices/system/cpu/cpu[0-9]* | wc -l`

apply the above

> cd /sys/devices/system/cpu

skip this, so running the script doesn't change PWD

> while [ 1 ]

while :;

> do
>         for ((i=1; i < NUMBER_OF_CPUS; i++))
>         do
>                 sleep 1;

Also play with shorter sleeps like: sleep .1, I've found that higher
hotplug rate triggers some races faster.

>                 state=`cat cpu$i/online`
>                 if [ $state -eq 0 ]
>                 then
>                         echo 1 > cpu$i/online
>                 else
>                         echo 0 > cpu$i/online

	echo $((state^1)) > $CPUPATH/cpu$i/online

>                 fi
>         done
> done 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0(), during CPU hotplug
  2012-09-14 12:18         ` Srivatsa S. Bhat
  2012-09-14 12:25           ` Peter Zijlstra
@ 2012-09-14 12:28           ` Fengguang Wu
  1 sibling, 0 replies; 20+ messages in thread
From: Fengguang Wu @ 2012-09-14 12:28 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: paulmck, jack, linux-kernel, Thomas Gleixner, rusty,
	Peter Zijlstra, Tejun Heo, Michael Wang

On Fri, Sep 14, 2012 at 05:48:20PM +0530, Srivatsa S. Bhat wrote:
> On 09/14/2012 05:17 PM, Fengguang Wu wrote:
> > On Thu, Sep 13, 2012 at 02:05:17PM +0530, Srivatsa S. Bhat wrote:
> >> On 09/12/2012 06:06 PM, Srivatsa S. Bhat wrote:
> >>> On 07/19/2012 10:45 PM, Paul E. McKenney wrote:
> >>>> On Thu, Jul 19, 2012 at 05:39:30PM +0530, Srivatsa S. Bhat wrote:
> >>>>> Hi Paul,
> >>>>>
> >>>>> While running a CPU hotplug stress test on v3.5-rc7+
> >>>>> (mainline commit 8a7298b7805ab) I hit this warning.
> >>>>> I haven't tried to debug this yet...
> >>>>>
> >>>>> Line number 1550 maps to:
> >>>>>
> >>>>> WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
> >>>>>
> >>>>> inside rcu_do_batch().
> >>>>
> >>>> Hello, Srivatsa,
> >>>>
> >>>> I believe that you need commit a16b7a69 (Prevent __call_rcu() from
> >>>> invoking RCU core on offline CPUs), which is currently in -tip, queued
> >>>> for 3.6.  Please see below for the patch.
> >>>>
> >>>> Does this help?
> >>>>
> >>>
> >>> Hi Paul,
> >>>
> >>> I am hitting the cpu_is_offline() warning in rcu_do_batch() (see 2 of the
> >>> examples below) occasionally while testing CPU hotplug on Thomas' smp/hotplug
> >>> branch in -tip. It does contain the commit that you had mentioned above.
> >>>
> >>
> >> I also hit some writeback related problems during some of these runs. But I was
> >> not able to reproduce them after that occurrence. (Adding relevant people to CC.)
> >>
> >> I hit the divide error shown below during the CPU hotplug test run, and the general
> > 
> > I've hit the divide error before. And I'm not luckier than you in
> > reproducing the bug. I tempt to add a test as the workaround:
> > 
> 
> Hi Fengguang,
> First of all, thanks for taking a look!
> 
> I just hit it today during my testing again! See below for the stack-traces..
> 
> > +       if (WARN_ON(!denominator))
> > +               return dirty;
> 
> Yep, I am going to try this out next and see if the denominator really became 0.

OK, thanks for the testing!

> I went through the code and I didn't find anything wrong with it. It looks pretty
> good (the denominator sanity checking part). Hmm...

Yeah, looks so..

> > btw, any chance you may share the CPU hotplug test scripts?
> > It'd be a valuable addition to my 0day boot test system.
> > 
> 
> Sure, I was using a very simple script that does offline/online in sequence. That's all.
> Please find it below. Note that adding a breather between the hotplug operations (like
> adding sleep as shown below) has the potential to expose different bugs than the normal
> stress test (no sleeps between subsequent operations). So both cases are useful to test.
> For example, with the sleep, I hit this divide error with higher probability, whereas
> without the sleep (stress test) I hit the rcu_do_batch warning more often.
> 
> (And on a slightly different note, when x86 supports CPU 0 hotplug (today it doesn't),
> you can use i = 0 instead of i = 1 in the loop. I believe that code is already in -tip).

Thank you very much for the script and the detailed notes!!

Thanks,
Fengguang

> 
> CPU hotplug test script
> -----------------------
> #! /bin/bash
> 
> NUMBER_OF_CPUS=`ls -d /sys/devices/system/cpu/cpu[0-9]* | wc -l`
> 
> cd /sys/devices/system/cpu
> 
> while [ 1 ]
> do
>         for ((i=1; i < NUMBER_OF_CPUS; i++))
>         do
>                 sleep 1;
>                 state=`cat cpu$i/online`
>                 if [ $state -eq 0 ]
>                 then
>                         echo 1 > cpu$i/online
>                 else
>                         echo 0 > cpu$i/online
>                 fi
>         done
> done
> 
> 
> >> protection fault subsequently, while trying to shutdown the machine after the test.
> > 
> 
> Dumps from the divide error and general protection fault in today's testing:
> 
> 
> [ 4831.353623] SMP alternatives: lockdep: fixing up alternatives
> [ 4831.359465] smpboot: Booting Node 1 Processor 13 APIC 0x13
> [ 4833.028342] SMP alternatives: lockdep: fixing up alternatives
> [ 4833.034188] smpboot: Booting Node 1 Processor 14 APIC 0x15
> [ 4835.693836] SMP alternatives: lockdep: fixing up alternatives
> [ 4835.699609] smpboot: Booting Node 1 Processor 15 APIC 0x17
> [ 4843.142309] Broke affinity for irq 77
> [ 4843.147605] smpboot: CPU 1 is now offline
> [ 4843.155885] CPU 9 MCA banks CMCI:2 CMCI:3 CMCI:5
> [ 4844.379459] Broke affinity for irq 80
> [ 4844.384569] smpboot: CPU 2 is now offline
> [ 4844.392111] CPU 10 MCA banks CMCI:2 CMCI:3 CMCI:5
> [ 4845.810713] smpboot: CPU 3 is now offline
> [ 4845.821409] CPU 11 MCA banks CMCI:2 CMCI:3 CMCI:5
> [ 4848.710692] Broke affinity for irq 74
> [ 4848.716063] smpboot: CPU 4 is now offline
> [ 4848.721781] CPU 5 MCA banks CMCI:6 CMCI:8
> [ 4848.726440] CPU 12 MCA banks CMCI:2 CMCI:3 CMCI:5
> [ 4855.990349] smpboot: CPU 5 is now offline
> [ 4855.997251] CPU 6 MCA banks CMCI:6 CMCI:8
> [ 4856.002002] CPU 13 MCA banks CMCI:2 CMCI:3 CMCI:5
> [ 4859.198420] smpboot: CPU 6 is now offline
> [ 4859.205217] CPU 7 MCA banks CMCI:6 CMCI:8
> [ 4859.209858] CPU 14 MCA banks CMCI:2 CMCI:3 CMCI:5
> [ 4861.202261] smpboot: CPU 7 is now offline
> [ 4861.211009] CPU 12 MCA banks CMCI:6 CMCI:8
> [ 4861.215403] CPU 15 MCA banks CMCI:2 CMCI:3 CMCI:5
> [ 4863.210557] smpboot: CPU 8 is now offline
> [ 4890.536087] divide error: 0000 [#1] SMP 
> [ 4890.540047] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode cdc_ether usbnet i2c_i801 mii tpm_tis pcspkr ioatdma i7core_edac tpm serio_raw i2c_core bnx2 shpchp lpc_ich mfd_core edac_core pci_hotplug dca tpm_bios sg rtc_cmos button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
> [ 4890.540047] CPU 9 
> [ 4890.540047] Pid: 2243, comm: flush-8:0 Not tainted 3.6.0-rc1-tglx-hotplug-0.0.0.28.36b5ec9-default #1 IBM IBM System x -[7870C4Q]-/68Y8033     
> [ 4890.540047] RIP: 0010:[<ffffffff811276f6>]  [<ffffffff811276f6>] bdi_dirty_limit+0x66/0xc0
> [ 4890.540047] RSP: 0018:ffff8808d3e1fcc0  EFLAGS: 00010202
> [ 4890.540047] RAX: 0000000000000000 RBX: 00000000001a80c4 RCX: 28f5c28f5c28f5c3
> [ 4890.540047] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> [ 4890.540047] RBP: ffff8808d3e1fce0 R08: 0000000000000010 R09: 0000000000000000
> [ 4890.540047] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8808d3d80e20
> [ 4890.540047] R13: ffff8808d3d80e20 R14: ffff8808d3d811a0 R15: 0000000000000000
> [ 4890.540047] FS:  0000000000000000(0000) GS:ffff8808ddd40000(0000) knlGS:0000000000000000
> [ 4890.540047] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 4890.540047] CR2: ffffffffff600400 CR3: 0000000001a0c000 CR4: 00000000000007e0
> [ 4890.540047] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 4890.540047] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 4890.540047] Process flush-8:0 (pid: 2243, threadinfo ffff8808d3e1e000, task ffff8808d4060000)
> [ 4890.540047] Stack:
> [ 4890.540047]  0001000000000000 0000000000000000 ffff8808d3d80e20 fffffffffffffff7
> [ 4890.540047]  ffff8808d3e1fd10 ffffffff811ae95c 0000000000350189 00000000001a80c4
> [ 4890.540047]  0000000000000000 0000000000000000 ffff8808d3e1fdc0 ffffffff811b0620
> [ 4890.540047] Call Trace:
> [ 4890.540047]  [<ffffffff811ae95c>] over_bground_thresh+0x7c/0x90
> [ 4890.540047]  [<ffffffff811b0620>] wb_do_writeback+0x170/0x310
> [ 4890.540047]  [<ffffffff811b08eb>] bdi_writeback_thread+0x12b/0x420
> [ 4890.540047]  [<ffffffff811b07c0>] ? wb_do_writeback+0x310/0x310
> [ 4890.540047]  [<ffffffff8106deae>] kthread+0xde/0xf0
> [ 4890.540047]  [<ffffffff814c6184>] kernel_thread_helper+0x4/0x10
> [ 4890.540047]  [<ffffffff814bc1f0>] ? retint_restore_args+0x13/0x13
> [ 4890.540047]  [<ffffffff8106ddd0>] ? __init_kthread_worker+0x70/0x70
> [ 4890.540047]  [<ffffffff814c6180>] ? gs_change+0x13/0x13
> [ 4890.540047] Code: 28 5c 8f c2 f5 28 8b 7d e0 48 89 c6 48 0f af f3 48 c1 ee 02 48 89 f0 48 f7 e1 48 89 d6 31 d2 48 c1 ee 02 48 0f af 75 e8 48 89 f0 <48> f7 f7 41 8b 94 24 74 02 00 00 48 0f af d3 48 89 c7 48 c1 ea 
> [ 4890.540047] RIP  [<ffffffff811276f6>] bdi_dirty_limit+0x66/0xc0
> [ 4890.540047]  RSP <ffff8808d3e1fcc0>
> [ 4890.805627] ---[ end trace 3bdb61d99b220aeb ]---
> [ 4890.818925] Broke affinity for irq 76
> [ 4890.823921] smpboot: CPU 9 is now offline
> [ 4900.238238] general protection fault: 0000 [#2] SMP 
> [ 4900.242203] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode cdc_ether usbnet i2c_i801 mii tpm_tis pcspkr ioatdma i7core_edac tpm serio_raw i2c_core bnx2 shpchp lpc_ich mfd_core edac_core pci_hotplug dca tpm_bios sg rtc_cmos button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
> [ 4900.248087] CPU 13 
> [ 4900.248087] Pid: 0, comm: swapper/13 Tainted: G      D      3.6.0-rc1-tglx-hotplug-0.0.0.28.36b5ec9-default #1 IBM IBM System x -[7870C4Q]-/68Y8033     
> [ 4900.248087] RIP: 0010:[<ffffffff810843d7>]  [<ffffffff810843d7>] try_to_wake_up+0x57/0x2f0
> [ 4900.248087] RSP: 0018:ffff88117fd43d30  EFLAGS: 00010002
> [ 4900.248087] RAX: 6b6b6b6b6b6b6b6b RBX: 000000000000000f RCX: 000000006b6b6b6b
> [ 4900.248087] RDX: 000000006b6c6b6b RSI: ffffffff817a7fbf RDI: ffff8808d40607f0
> [ 4900.248087] RBP: ffff88117fd43d70 R08: 0000000000000002 R09: 0000000000000001
> [ 4900.248087] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8808d4060000
> [ 4900.248087] R13: 000000006b6b6b6b R14: ffff8808d3d80e20 R15: 0000000000000000
> [ 4900.248087] FS:  0000000000000000(0000) GS:ffff88117fd40000(0000) knlGS:0000000000000000
> [ 4900.248087] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 4900.248087] CR2: 00007f472ef5f068 CR3: 0000000001a0c000 CR4: 00000000000007e0
> [ 4900.248087] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 4900.248087] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 4900.248087] Process swapper/13 (pid: 0, threadinfo ffff8808d762e000, task ffff8808d7630000)
> [ 4900.248087] Stack:
> [ 4900.248087]  0000000000000246 ffff8808d40607f0 0000000000000286 ffff8808d762ffd8
> [ 4900.248087]  ffff8808d3d80e20 ffff8808d3d811a0 ffff8808d3d80e20 ffffffff8113d970
> [ 4900.248087]  ffff88117fd43d80 ffffffff810846a0 ffff88117fd43db0 ffffffff8113da22
> [ 4900.248087] Call Trace:
> [ 4900.248087]  <IRQ> 
> [ 4900.248087]  [<ffffffff8113d970>] ? bdi_init+0x270/0x270
> [ 4900.248087]  [<ffffffff810846a0>] wake_up_process+0x10/0x20
> [ 4900.248087]  [<ffffffff8113da22>] wakeup_timer_fn+0xb2/0x1c0
> [ 4900.248087]  [<ffffffff810557db>] call_timer_fn+0xcb/0x240
> [ 4900.248087]  [<ffffffff81055710>] ? detach_if_pending+0x150/0x150
> [ 4900.248087]  [<ffffffff8113d970>] ? bdi_init+0x270/0x270
> [ 4900.248087]  [<ffffffff81056f75>] run_timer_softirq+0x1c5/0x2c0
> [ 4900.248087]  [<ffffffff8104dc79>] __do_softirq+0x159/0x400
> [ 4900.248087]  [<ffffffff814c627c>] call_softirq+0x1c/0x30
> [ 4900.248087]  [<ffffffff810044f5>] do_softirq+0x95/0xd0
> [ 4900.248087]  [<ffffffff8104d745>] irq_exit+0xe5/0x100
> [ 4900.248087]  [<ffffffff81028df9>] smp_apic_timer_interrupt+0x69/0xa0
> [ 4900.248087]  [<ffffffff814c5aef>] apic_timer_interrupt+0x6f/0x80
> [ 4900.248087]  <EOI> 
> [ 4900.248087]  [<ffffffff812db3d3>] ? intel_idle+0xc3/0x170
> [ 4900.248087]  [<ffffffff812db3d7>] ? intel_idle+0xc7/0x170
> [ 4900.248087]  [<ffffffff812db3d3>] ? intel_idle+0xc3/0x170
> [ 4900.248087]  [<ffffffff813b1937>] cpuidle_enter+0x17/0x20
> [ 4900.248087]  [<ffffffff813b1952>] cpuidle_enter_state+0x12/0x50
> [ 4900.248087]  [<ffffffff813b2403>] cpuidle_idle_call+0x133/0x260
> [ 4900.248087]  [<ffffffff8100b1f5>] cpu_idle+0x95/0xf0
> [ 4900.248087]  [<ffffffff814b1819>] start_secondary+0x85/0x8c
> [ 4900.248087] Code: 31 ed 48 89 c7 48 89 45 c8 e8 46 72 43 00 48 89 45 d0 49 8b 04 24 85 c3 0f 84 02 02 00 00 45 8b 6c 24 2c 49 8b 44 24 08 45 85 ed <44> 8b 70 18 74 75 8b 1d dd ea 9d 00 85 db 0f 85 2d 02 00 00 49 
> [ 4900.248087] RIP  [<ffffffff810843d7>] try_to_wake_up+0x57/0x2f0
> [ 4900.248087]  RSP <ffff88117fd43d30>
> [ 4900.248087] ---[ end trace 3bdb61d99b220aec ]---
> [ 4900.248087] Kernel panic - not syncing: Fatal exception in interrupt
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0(), during CPU hotplug
  2012-09-14 12:25           ` Peter Zijlstra
@ 2012-09-14 12:32             ` Fengguang Wu
  2012-09-14 12:34             ` Srivatsa S. Bhat
  1 sibling, 0 replies; 20+ messages in thread
From: Fengguang Wu @ 2012-09-14 12:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa S. Bhat, paulmck, jack, linux-kernel, Thomas Gleixner,
	rusty, Tejun Heo, Michael Wang

On Fri, Sep 14, 2012 at 02:25:19PM +0200, Peter Zijlstra wrote:
> On Fri, 2012-09-14 at 17:48 +0530, Srivatsa S. Bhat wrote:
> > #! /bin/bash
> 
> CPUPATH="/sys/devices/system/cpu"
> 
> > NUMBER_OF_CPUS=`ls -d /sys/devices/system/cpu/cpu[0-9]* | wc -l`
> 
> apply the above
> 
> > cd /sys/devices/system/cpu
> 
> skip this, so running the script doesn't change PWD

Yes, that's good practice.

> > while [ 1 ]
> 
> while :;

I used to use 'true', but ':' looks cool!

> > do
> >         for ((i=1; i < NUMBER_OF_CPUS; i++))
> >         do
> >                 sleep 1;
> 
> Also play with shorter sleeps like: sleep .1, I've found that higher
> hotplug rate triggers some races faster.

OK.

> >                 state=`cat cpu$i/online`
> >                 if [ $state -eq 0 ]
> >                 then
> >                         echo 1 > cpu$i/online
> >                 else
> >                         echo 0 > cpu$i/online
> 
> 	echo $((state^1)) > $CPUPATH/cpu$i/online

Good!

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0(), during CPU hotplug
  2012-09-14 12:25           ` Peter Zijlstra
  2012-09-14 12:32             ` Fengguang Wu
@ 2012-09-14 12:34             ` Srivatsa S. Bhat
  1 sibling, 0 replies; 20+ messages in thread
From: Srivatsa S. Bhat @ 2012-09-14 12:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Fengguang Wu, paulmck, jack, linux-kernel, Thomas Gleixner,
	rusty, Tejun Heo, Michael Wang

[-- Attachment #1: Type: text/plain, Size: 1351 bytes --]

On 09/14/2012 05:55 PM, Peter Zijlstra wrote:
> On Fri, 2012-09-14 at 17:48 +0530, Srivatsa S. Bhat wrote:
>> #! /bin/bash
> 
> CPUPATH="/sys/devices/system/cpu"
> 
>> NUMBER_OF_CPUS=`ls -d /sys/devices/system/cpu/cpu[0-9]* | wc -l`
> 
> apply the above
> 
>> cd /sys/devices/system/cpu
> 
> skip this, so running the script doesn't change PWD
> 
>> while [ 1 ]
> 
> while :;
> 
>> do
>>         for ((i=1; i < NUMBER_OF_CPUS; i++))
>>         do
>>                 sleep 1;
> 
> Also play with shorter sleeps like: sleep .1, I've found that higher
> hotplug rate triggers some races faster.
> 
>>                 state=`cat cpu$i/online`
>>                 if [ $state -eq 0 ]
>>                 then
>>                         echo 1 > cpu$i/online
>>                 else
>>                         echo 0 > cpu$i/online
> 
> 	echo $((state^1)) > $CPUPATH/cpu$i/online
> 
>>                 fi
>>         done
>> done 
> 
> 

Thanks a lot Peter for your suggestions!

Fengguang, like I had mentioned some time before, somebody had pointed
me to a hotplug test-suite that I believe is being used in LTP.

I have a copy of that suite, which I have attached with this mail.
I have never actually used it seriously before (because I could trigger
bugs with my simple script!), so I don't exactly know what state it is
in ;-)

Regards,
Srivatsa S. Bhat

[-- Attachment #2: lhcs_regression-1.6.tgz --]
[-- Type: application/x-compressed-tar, Size: 17854 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0(), during CPU hotplug
  2012-09-13 12:47         ` Srivatsa S. Bhat
  2012-09-14  4:33           ` Michael Wang
@ 2012-09-26  9:35           ` Srivatsa S. Bhat
  2012-09-27  2:59             ` Michael Wang
  1 sibling, 1 reply; 20+ messages in thread
From: Srivatsa S. Bhat @ 2012-09-26  9:35 UTC (permalink / raw)
  To: Michael Wang, paulmck, Thomas Gleixner
  Cc: linux-kernel, rusty, Peter Zijlstra, Tejun Heo, Liu, Chuansheng,
	Ingo Molnar, Borislav Petkov, Fengguang Wu

On 09/13/2012 06:17 PM, Srivatsa S. Bhat wrote:
> On 09/13/2012 12:00 PM, Michael Wang wrote:
>> On 09/12/2012 11:31 PM, Paul E. McKenney wrote:
>>> On Wed, Sep 12, 2012 at 06:06:20PM +0530, Srivatsa S. Bhat wrote:
>>>> On 07/19/2012 10:45 PM, Paul E. McKenney wrote:
>>>>> On Thu, Jul 19, 2012 at 05:39:30PM +0530, Srivatsa S. Bhat wrote:
>>>>>> Hi Paul,
>>>>>>
>>>>>> While running a CPU hotplug stress test on v3.5-rc7+
>>>>>> (mainline commit 8a7298b7805ab) I hit this warning.
>>>>>> I haven't tried to debug this yet...
>>>>>>
>>>>>> Line number 1550 maps to:
>>>>>>
>>>>>> WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
>>>>>>
>>>>>> inside rcu_do_batch().
>>>>>
>>>>> Hello, Srivatsa,
>>>>>
>>>>> I believe that you need commit a16b7a69 (Prevent __call_rcu() from
>>>>> invoking RCU core on offline CPUs), which is currently in -tip, queued
>>>>> for 3.6.  Please see below for the patch.
>>>>>
>>>>> Does this help?
>>>>>
>>>>
>>>> Hi Paul,
>>>>
>>>> I am hitting the cpu_is_offline() warning in rcu_do_batch() (see 2 of the
>>>> examples below) occasionally while testing CPU hotplug on Thomas' smp/hotplug
>>>> branch in -tip. It does contain the commit that you had mentioned above.
>>>>
>>>> The stack trace suggests that we are not hitting this from the __call_rcu()
>>>> path. So I guess this needs a different fix?
>>>
>>> So there was an interrupt from stop_machine_stop().  Because RCU complained
>>> about offline, I presume that this was on exit from stop_machine_stop().
>>> (Otherwise, on entry to stop_machine_stop(), the CPU has not yet marked
>>> itself offline, right?)
>>>
> 
> Yes, that's my understanding too.
> 
>>> So my question is:  Why didn't the CPU shut off all interrupts before
>>> coming out of stop_machine_stop()?
>>>
>>> Or am I confused about what is really happening here?
>>

The interesting thing is that in every single instance of an interrupt hitting
an offline CPU, the IP points to the statement : local_irq_restore(flags)
in stop_machine_cpu_stop(). And the most common case of such an interrupt is the
APIC timer interrupt (though I have also seen a few instances of do_IRQ() (device
interrupts) in some of my stacktraces).

So, I'm inclined to conclude that the functions responsible for clearing interrupts
during CPU offline, such as clear_local_APIC() and friends are not doing their
job properly..

Below are results from some of my CPU hotplug tests, in which I had added debug
print statements and WARN_ON()s to catch situations where interrupts hit offline
CPUs.
(Address <ffffffff810cf8aa> corresponds to line no. 481 in kernel/stop_machine.c,
ie., local_irq_restore(flags) in stop_machine_cpu_stop()).

Regards,
Srivatsa S. Bhat

---------------

[ 4730.906259] smpboot: Booting Node 1 Processor 6 APIC 0x14
[ 4731.018008] SMP alternatives: lockdep: fixing up alternatives
[ 4731.023811] smpboot: Booting Node 1 Processor 7 APIC 0x16
[ 4731.132009] SMP Apic timer interrupt to offline CPU! 8
[ 4731.132009] ------------[ cut here ]------------
[ 4731.132009] WARNING: at arch/x86/kernel/apic/apic.c:904 smp_apic_timer_interrupt+0xc9/0xd0()
[ 4731.132009] Hardware name: IBM System x -[7870C4Q]-
[ 4731.132009] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm cdc_ether crc32c_intel i7core_edac shpchp usbnet pcspkr edac_core microcode tpm_tis mii bnx2 serio_raw lpc_ich mfd_core i2c_i801 pci_hotplug i2c_core tpm ioatdma dca tpm_bios sg button rtc_cmos uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[ 4731.132009] Pid: 12843, comm: migration/8 Not tainted 3.6.0-rc7-warn-cpu-interrupt-0.0.0.28.36b5ec9-default #3
[ 4731.132009] Call Trace:
[ 4731.132009]  <IRQ>  [<ffffffff81028f99>] ? smp_apic_timer_interrupt+0xc9/0xd0
[ 4731.132009]  [<ffffffff81028f99>] ? smp_apic_timer_interrupt+0xc9/0xd0
[ 4731.132009]  [<ffffffff810433ea>] warn_slowpath_common+0x7a/0xb0
[ 4731.132009]  [<ffffffff81043435>] warn_slowpath_null+0x15/0x20
[ 4731.132009]  [<ffffffff81028f99>] smp_apic_timer_interrupt+0xc9/0xd0
[ 4731.132009]  [<ffffffff814c662f>] apic_timer_interrupt+0x6f/0x80
[ 4731.132009]  <EOI>  [<ffffffff810cf8aa>] ? stop_machine_cpu_stop+0xda/0x130
[ 4731.132009]  [<ffffffff810cf7d0>] ? stop_one_cpu_nowait+0x50/0x50
[ 4731.132009]  [<ffffffff810cf4e9>] cpu_stopper_thread+0xd9/0x1b0
[ 4731.132009]  [<ffffffff814bc98f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
[ 4731.132009]  [<ffffffff810cf410>] ? res_counter_init+0x50/0x50
[ 4731.132009]  [<ffffffff810ac36d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
[ 4731.132009]  [<ffffffff810ac3fd>] ? trace_hardirqs_on+0xd/0x10
[ 4731.132009]  [<ffffffff810cf410>] ? res_counter_init+0x50/0x50
[ 4731.132009]  [<ffffffff8106df0e>] kthread+0x9e/0xb0
[ 4731.132009]  [<ffffffff814c6cc4>] kernel_thread_helper+0x4/0x10
[ 4731.132009]  [<ffffffff814bcd30>] ? retint_restore_args+0x13/0x13
[ 4731.132009]  [<ffffffff8106de70>] ? __init_kthread_worker+0x70/0x70
[ 4731.132009]  [<ffffffff814c6cc0>] ? gs_change+0x13/0x13
[ 4731.132009] ---[ end trace 4591bad96b0e5f4b ]---
[ 4731.440036] smpboot: CPU 8 is now offline
[ 4731.485506] smpboot: CPU 9 is now offline
[ 4731.533459] smpboot: CPU 10 is now offline
[ 4731.591316] smpboot: CPU 11 is now offline
[ 4731.616017] CPU 3 MCA banks CMCI:2 CMCI:3 CMCI:5
[ 4731.729817] smpboot: CPU 12 is now offline
[ 4731.748034] CPU 4 MCA banks CMCI:2 CMCI:3 CMCI:5 CMCI:6 CMCI:8
[ 4731.800110] smpboot: CPU 13 is now offline


[ 4909.232645] SMP alternatives: lockdep: fixing up alternatives
[ 4909.238633] smpboot: Booting Node 1 Processor 7 APIC 0x16
[ 4909.350250] smpboot: CPU 8 is now offline
[ 4909.395118] Broke affinity for irq 3
[ 4909.400743] smpboot: CPU 9 is now offline
[ 4909.444681] smpboot: CPU 10 is now offline
[ 4909.496011] SMP Apic timer interrupt to offline CPU! 11
[ 4909.496011] ------------[ cut here ]------------
[ 4909.496011] WARNING: at arch/x86/kernel/apic/apic.c:904 smp_apic_timer_interrupt+0xc9/0xd0()
[ 4909.496011] Hardware name: IBM System x -[7870C4Q]-
[ 4909.496011] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm cdc_ether crc32c_intel i7core_edac shpchp usbnet pcspkr edac_core microcode tpm_tis mii bnx2 serio_raw lpc_ich mfd_core i2c_i801 pci_hotplug i2c_core tpm ioatdma dca tpm_bios sg button rtc_cmos uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[ 4909.496011] Pid: 10241, comm: migration/11 Tainted: G        W    3.6.0-rc7-warn-cpu-interrupt-0.0.0.28.36b5ec9-default #3
[ 4909.496011] Call Trace:
[ 4909.496011]  <IRQ>  [<ffffffff81028f99>] ? smp_apic_timer_interrupt+0xc9/0xd0
[ 4909.496011]  [<ffffffff81028f99>] ? smp_apic_timer_interrupt+0xc9/0xd0
[ 4909.496011]  [<ffffffff810433ea>] warn_slowpath_common+0x7a/0xb0
[ 4909.496011]  [<ffffffff81043435>] warn_slowpath_null+0x15/0x20
[ 4909.496011]  [<ffffffff81028f99>] smp_apic_timer_interrupt+0xc9/0xd0
[ 4909.496011]  [<ffffffff814c662f>] apic_timer_interrupt+0x6f/0x80
[ 4909.496011]  <EOI>  [<ffffffff810cf8aa>] ? stop_machine_cpu_stop+0xda/0x130
[ 4909.496011]  [<ffffffff810cf7d0>] ? stop_one_cpu_nowait+0x50/0x50
[ 4909.496011]  [<ffffffff810cf4e9>] cpu_stopper_thread+0xd9/0x1b0
[ 4909.496011]  [<ffffffff814bc98f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
[ 4909.496011]  [<ffffffff810cf410>] ? res_counter_init+0x50/0x50
[ 4909.496011]  [<ffffffff810ac36d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
[ 4909.496011]  [<ffffffff810ac3fd>] ? trace_hardirqs_on+0xd/0x10
[ 4909.496011]  [<ffffffff810cf410>] ? res_counter_init+0x50/0x50
[ 4909.496011]  [<ffffffff8106df0e>] kthread+0x9e/0xb0
[ 4909.496011]  [<ffffffff814c6cc4>] kernel_thread_helper+0x4/0x10
[ 4909.496011]  [<ffffffff814bcd30>] ? retint_restore_args+0x13/0x13
[ 4909.496011]  [<ffffffff8106de70>] ? __init_kthread_worker+0x70/0x70
[ 4909.496011]  [<ffffffff814c6cc0>] ? gs_change+0x13/0x13
[ 4909.496011] ---[ end trace 4591bad96b0e5f4c ]---
[ 4909.703691] ------------[ cut here ]------------
[ 4909.708124] WARNING: at kernel/rcutree.c:1560 rcu_do_batch+0x386/0x3a0()
[ 4909.708124] Hardware name: IBM System x -[7870C4Q]-
[ 4909.708124] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm cdc_ether crc32c_intel i7core_edac shpchp usbnet pcspkr edac_core microcode tpm_tis mii bnx2 serio_raw lpc_ich mfd_core i2c_i801 pci_hotplug i2c_core tpm ioatdma dca tpm_bios sg button rtc_cmos uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[ 4909.708124] Pid: 10241, comm: migration/11 Tainted: G        W    3.6.0-rc7-warn-cpu-interrupt-0.0.0.28.36b5ec9-default #3
[ 4909.708124] Call Trace:
[ 4909.708124]  <IRQ>  [<ffffffff810e7336>] ? rcu_do_batch+0x386/0x3a0
[ 4909.708124]  [<ffffffff810e7336>] ? rcu_do_batch+0x386/0x3a0
[ 4909.708124]  [<ffffffff810433ea>] warn_slowpath_common+0x7a/0xb0
[ 4909.708124]  [<ffffffff81043435>] warn_slowpath_null+0x15/0x20
[ 4909.708124]  [<ffffffff810e7336>] rcu_do_batch+0x386/0x3a0
[ 4909.708124]  [<ffffffff810ac2b0>] ? trace_hardirqs_on_caller+0x70/0x1b0
[ 4909.708124]  [<ffffffff810ac3fd>] ? trace_hardirqs_on+0xd/0x10
[ 4909.708124]  [<ffffffff810e8c73>] __rcu_process_callbacks+0x1a3/0x200
[ 4909.708124]  [<ffffffff810e8d58>] rcu_process_callbacks+0x88/0x240
[ 4909.708124]  [<ffffffff8104dcb9>] __do_softirq+0x159/0x400
[ 4909.708124]  [<ffffffff814c6dbc>] call_softirq+0x1c/0x30
[ 4909.708124]  [<ffffffff81004525>] do_softirq+0x95/0xd0
[ 4909.708124]  [<ffffffff8104d785>] irq_exit+0xe5/0x100
[ 4909.708124]  [<ffffffff81028f51>] smp_apic_timer_interrupt+0x81/0xd0
[ 4909.708124]  [<ffffffff814c662f>] apic_timer_interrupt+0x6f/0x80
[ 4909.708124]  <EOI>  [<ffffffff810cf8aa>] ? stop_machine_cpu_stop+0xda/0x130
[ 4909.708124]  [<ffffffff810cf7d0>] ? stop_one_cpu_nowait+0x50/0x50
[ 4909.708124]  [<ffffffff810cf4e9>] cpu_stopper_thread+0xd9/0x1b0
[ 4909.708124]  [<ffffffff814bc98f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
[ 4909.708124]  [<ffffffff810cf410>] ? res_counter_init+0x50/0x50
[ 4909.708124]  [<ffffffff810ac36d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
[ 4909.708124]  [<ffffffff810ac3fd>] ? trace_hardirqs_on+0xd/0x10
[ 4909.708124]  [<ffffffff810cf410>] ? res_counter_init+0x50/0x50
[ 4909.708124]  [<ffffffff8106df0e>] kthread+0x9e/0xb0
[ 4909.708124]  [<ffffffff814c6cc4>] kernel_thread_helper+0x4/0x10
[ 4909.708124]  [<ffffffff814bcd30>] ? retint_restore_args+0x13/0x13
[ 4909.708124]  [<ffffffff8106de70>] ? __init_kthread_worker+0x70/0x70
[ 4909.708124]  [<ffffffff814c6cc0>] ? gs_change+0x13/0x13
[ 4909.708124] ---[ end trace 4591bad96b0e5f4d ]---
[ 4909.812522] smpboot: CPU 11 is now offline


[ 4974.727112] smpboot: CPU 12 is now offline
[ 4974.737763] CPU 4 MCA banks CMCI:2 CMCI:3 CMCI:5 CMCI:6 CMCI:8
[ 4974.896043] smpboot: CPU 13 is now offline
[ 4974.903566] CPU 5 MCA banks CMCI:2 CMCI:3 CMCI:5
[ 4974.968010] SMP Apic timer interrupt to offline CPU! 14
[ 4974.968010] ------------[ cut here ]------------
[ 4974.968010] WARNING: at arch/x86/kernel/apic/apic.c:904 smp_apic_timer_interrupt+0xc9/0xd0()
[ 4974.968010] Hardware name: IBM System x -[7870C4Q]-
[ 4974.968010] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm cdc_ether crc32c_intel i7core_edac shpchp usbnet pcspkr edac_core microcode tpm_tis mii bnx2 serio_raw lpc_ich mfd_core i2c_i801 pci_hotplug i2c_core tpm ioatdma dca tpm_bios sg button rtc_cmos uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[ 4974.968010] Pid: 20651, comm: migration/14 Tainted: G        W    3.6.0-rc7-warn-cpu-interrupt-0.0.0.28.36b5ec9-default #3
[ 4974.968010] Call Trace:
[ 4974.968010]  <IRQ>  [<ffffffff81028f99>] ? smp_apic_timer_interrupt+0xc9/0xd0
[ 4974.968010]  [<ffffffff81028f99>] ? smp_apic_timer_interrupt+0xc9/0xd0
[ 4974.968010]  [<ffffffff810433ea>] warn_slowpath_common+0x7a/0xb0
[ 4974.968010]  [<ffffffff81043435>] warn_slowpath_null+0x15/0x20
[ 4974.968010]  [<ffffffff81028f99>] smp_apic_timer_interrupt+0xc9/0xd0
[ 4974.968010]  [<ffffffff814c662f>] apic_timer_interrupt+0x6f/0x80
[ 4974.968010]  <EOI>  [<ffffffff810cf8aa>] ? stop_machine_cpu_stop+0xda/0x130
[ 4974.968010]  [<ffffffff810cf7d0>] ? stop_one_cpu_nowait+0x50/0x50
[ 4974.968010]  [<ffffffff810cf4e9>] cpu_stopper_thread+0xd9/0x1b0
[ 4974.968010]  [<ffffffff814bc98f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
[ 4974.968010]  [<ffffffff810cf410>] ? res_counter_init+0x50/0x50
[ 4974.968010]  [<ffffffff810ac36d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
[ 4974.968010]  [<ffffffff810ac3fd>] ? trace_hardirqs_on+0xd/0x10
[ 4974.968010]  [<ffffffff810cf410>] ? res_counter_init+0x50/0x50
[ 4974.968010]  [<ffffffff8106df0e>] kthread+0x9e/0xb0
[ 4974.968010]  [<ffffffff814c6cc4>] kernel_thread_helper+0x4/0x10
[ 4974.968010]  [<ffffffff814bcd30>] ? retint_restore_args+0x13/0x13
[ 4974.968010]  [<ffffffff8106de70>] ? __init_kthread_worker+0x70/0x70
[ 4974.968010]  [<ffffffff814c6cc0>] ? gs_change+0x13/0x13
[ 4974.968010] ---[ end trace 4591bad96b0e5f4e ]---
[ 4975.179145] smpboot: CPU 14 is now offline


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0(), during CPU hotplug
  2012-09-26  9:35           ` Srivatsa S. Bhat
@ 2012-09-27  2:59             ` Michael Wang
  2012-09-27 19:06               ` Srivatsa S. Bhat
  0 siblings, 1 reply; 20+ messages in thread
From: Michael Wang @ 2012-09-27  2:59 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: paulmck, Thomas Gleixner, linux-kernel, rusty, Peter Zijlstra,
	Tejun Heo, Liu, Chuansheng, Ingo Molnar, Borislav Petkov,
	Fengguang Wu, hpa, x86

On 09/26/2012 05:35 PM, Srivatsa S. Bhat wrote:
> On 09/13/2012 06:17 PM, Srivatsa S. Bhat wrote:
>> On 09/13/2012 12:00 PM, Michael Wang wrote:
>>> On 09/12/2012 11:31 PM, Paul E. McKenney wrote:
>>>> On Wed, Sep 12, 2012 at 06:06:20PM +0530, Srivatsa S. Bhat wrote:
>>>>> On 07/19/2012 10:45 PM, Paul E. McKenney wrote:
>>>>>> On Thu, Jul 19, 2012 at 05:39:30PM +0530, Srivatsa S. Bhat wrote:
>>>>>>> Hi Paul,
>>>>>>>
>>>>>>> While running a CPU hotplug stress test on v3.5-rc7+
>>>>>>> (mainline commit 8a7298b7805ab) I hit this warning.
>>>>>>> I haven't tried to debug this yet...
>>>>>>>
>>>>>>> Line number 1550 maps to:
>>>>>>>
>>>>>>> WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
>>>>>>>
>>>>>>> inside rcu_do_batch().
>>>>>>
>>>>>> Hello, Srivatsa,
>>>>>>
>>>>>> I believe that you need commit a16b7a69 (Prevent __call_rcu() from
>>>>>> invoking RCU core on offline CPUs), which is currently in -tip, queued
>>>>>> for 3.6.  Please see below for the patch.
>>>>>>
>>>>>> Does this help?
>>>>>>
>>>>>
>>>>> Hi Paul,
>>>>>
>>>>> I am hitting the cpu_is_offline() warning in rcu_do_batch() (see 2 of the
>>>>> examples below) occasionally while testing CPU hotplug on Thomas' smp/hotplug
>>>>> branch in -tip. It does contain the commit that you had mentioned above.
>>>>>
>>>>> The stack trace suggests that we are not hitting this from the __call_rcu()
>>>>> path. So I guess this needs a different fix?
>>>>
>>>> So there was an interrupt from stop_machine_stop().  Because RCU complained
>>>> about offline, I presume that this was on exit from stop_machine_stop().
>>>> (Otherwise, on entry to stop_machine_stop(), the CPU has not yet marked
>>>> itself offline, right?)
>>>>
>>
>> Yes, that's my understanding too.
>>
>>>> So my question is:  Why didn't the CPU shut off all interrupts before
>>>> coming out of stop_machine_stop()?
>>>>
>>>> Or am I confused about what is really happening here?
>>>
> 
> The interesting thing is that in every single instance of an interrupt hitting
> an offline CPU, the IP points to the statement : local_irq_restore(flags)
> in stop_machine_cpu_stop(). And the most common case of such an interrupt is the
> APIC timer interrupt (though I have also seen a few instances of do_IRQ() (device
> interrupts) in some of my stacktraces).
> 
> So, I'm inclined to conclude that the functions responsible for clearing interrupts
> during CPU offline, such as clear_local_APIC() and friends are not doing their
> job properly..

Agree, "we try to shut down apic but failed", that's may be the reason.

I think we need some help from the experts on apic code now(or don't do
any job in do_IRQ if cpu is offline, but a bad solution)...

Regards,
Michael Wang

> 
> Below are results from some of my CPU hotplug tests, in which I had added debug
> print statements and WARN_ON()s to catch situations where interrupts hit offline
> CPUs.
> (Address <ffffffff810cf8aa> corresponds to line no. 481 in kernel/stop_machine.c,
> ie., local_irq_restore(flags) in stop_machine_cpu_stop()).
> 
> Regards,
> Srivatsa S. Bhat
> 
> ---------------
> 
> [ 4730.906259] smpboot: Booting Node 1 Processor 6 APIC 0x14
> [ 4731.018008] SMP alternatives: lockdep: fixing up alternatives
> [ 4731.023811] smpboot: Booting Node 1 Processor 7 APIC 0x16
> [ 4731.132009] SMP Apic timer interrupt to offline CPU! 8
> [ 4731.132009] ------------[ cut here ]------------
> [ 4731.132009] WARNING: at arch/x86/kernel/apic/apic.c:904 smp_apic_timer_interrupt+0xc9/0xd0()
> [ 4731.132009] Hardware name: IBM System x -[7870C4Q]-
> [ 4731.132009] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm cdc_ether crc32c_intel i7core_edac shpchp usbnet pcspkr edac_core microcode tpm_tis mii bnx2 serio_raw lpc_ich mfd_core i2c_i801 pci_hotplug i2c_core tpm ioatdma dca tpm_bios sg button rtc_cmos uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
> [ 4731.132009] Pid: 12843, comm: migration/8 Not tainted 3.6.0-rc7-warn-cpu-interrupt-0.0.0.28.36b5ec9-default #3
> [ 4731.132009] Call Trace:
> [ 4731.132009]  <IRQ>  [<ffffffff81028f99>] ? smp_apic_timer_interrupt+0xc9/0xd0
> [ 4731.132009]  [<ffffffff81028f99>] ? smp_apic_timer_interrupt+0xc9/0xd0
> [ 4731.132009]  [<ffffffff810433ea>] warn_slowpath_common+0x7a/0xb0
> [ 4731.132009]  [<ffffffff81043435>] warn_slowpath_null+0x15/0x20
> [ 4731.132009]  [<ffffffff81028f99>] smp_apic_timer_interrupt+0xc9/0xd0
> [ 4731.132009]  [<ffffffff814c662f>] apic_timer_interrupt+0x6f/0x80
> [ 4731.132009]  <EOI>  [<ffffffff810cf8aa>] ? stop_machine_cpu_stop+0xda/0x130
> [ 4731.132009]  [<ffffffff810cf7d0>] ? stop_one_cpu_nowait+0x50/0x50
> [ 4731.132009]  [<ffffffff810cf4e9>] cpu_stopper_thread+0xd9/0x1b0
> [ 4731.132009]  [<ffffffff814bc98f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
> [ 4731.132009]  [<ffffffff810cf410>] ? res_counter_init+0x50/0x50
> [ 4731.132009]  [<ffffffff810ac36d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
> [ 4731.132009]  [<ffffffff810ac3fd>] ? trace_hardirqs_on+0xd/0x10
> [ 4731.132009]  [<ffffffff810cf410>] ? res_counter_init+0x50/0x50
> [ 4731.132009]  [<ffffffff8106df0e>] kthread+0x9e/0xb0
> [ 4731.132009]  [<ffffffff814c6cc4>] kernel_thread_helper+0x4/0x10
> [ 4731.132009]  [<ffffffff814bcd30>] ? retint_restore_args+0x13/0x13
> [ 4731.132009]  [<ffffffff8106de70>] ? __init_kthread_worker+0x70/0x70
> [ 4731.132009]  [<ffffffff814c6cc0>] ? gs_change+0x13/0x13
> [ 4731.132009] ---[ end trace 4591bad96b0e5f4b ]---
> [ 4731.440036] smpboot: CPU 8 is now offline
> [ 4731.485506] smpboot: CPU 9 is now offline
> [ 4731.533459] smpboot: CPU 10 is now offline
> [ 4731.591316] smpboot: CPU 11 is now offline
> [ 4731.616017] CPU 3 MCA banks CMCI:2 CMCI:3 CMCI:5
> [ 4731.729817] smpboot: CPU 12 is now offline
> [ 4731.748034] CPU 4 MCA banks CMCI:2 CMCI:3 CMCI:5 CMCI:6 CMCI:8
> [ 4731.800110] smpboot: CPU 13 is now offline
> 
> 
> [ 4909.232645] SMP alternatives: lockdep: fixing up alternatives
> [ 4909.238633] smpboot: Booting Node 1 Processor 7 APIC 0x16
> [ 4909.350250] smpboot: CPU 8 is now offline
> [ 4909.395118] Broke affinity for irq 3
> [ 4909.400743] smpboot: CPU 9 is now offline
> [ 4909.444681] smpboot: CPU 10 is now offline
> [ 4909.496011] SMP Apic timer interrupt to offline CPU! 11
> [ 4909.496011] ------------[ cut here ]------------
> [ 4909.496011] WARNING: at arch/x86/kernel/apic/apic.c:904 smp_apic_timer_interrupt+0xc9/0xd0()
> [ 4909.496011] Hardware name: IBM System x -[7870C4Q]-
> [ 4909.496011] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm cdc_ether crc32c_intel i7core_edac shpchp usbnet pcspkr edac_core microcode tpm_tis mii bnx2 serio_raw lpc_ich mfd_core i2c_i801 pci_hotplug i2c_core tpm ioatdma dca tpm_bios sg button rtc_cmos uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
> [ 4909.496011] Pid: 10241, comm: migration/11 Tainted: G        W    3.6.0-rc7-warn-cpu-interrupt-0.0.0.28.36b5ec9-default #3
> [ 4909.496011] Call Trace:
> [ 4909.496011]  <IRQ>  [<ffffffff81028f99>] ? smp_apic_timer_interrupt+0xc9/0xd0
> [ 4909.496011]  [<ffffffff81028f99>] ? smp_apic_timer_interrupt+0xc9/0xd0
> [ 4909.496011]  [<ffffffff810433ea>] warn_slowpath_common+0x7a/0xb0
> [ 4909.496011]  [<ffffffff81043435>] warn_slowpath_null+0x15/0x20
> [ 4909.496011]  [<ffffffff81028f99>] smp_apic_timer_interrupt+0xc9/0xd0
> [ 4909.496011]  [<ffffffff814c662f>] apic_timer_interrupt+0x6f/0x80
> [ 4909.496011]  <EOI>  [<ffffffff810cf8aa>] ? stop_machine_cpu_stop+0xda/0x130
> [ 4909.496011]  [<ffffffff810cf7d0>] ? stop_one_cpu_nowait+0x50/0x50
> [ 4909.496011]  [<ffffffff810cf4e9>] cpu_stopper_thread+0xd9/0x1b0
> [ 4909.496011]  [<ffffffff814bc98f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
> [ 4909.496011]  [<ffffffff810cf410>] ? res_counter_init+0x50/0x50
> [ 4909.496011]  [<ffffffff810ac36d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
> [ 4909.496011]  [<ffffffff810ac3fd>] ? trace_hardirqs_on+0xd/0x10
> [ 4909.496011]  [<ffffffff810cf410>] ? res_counter_init+0x50/0x50
> [ 4909.496011]  [<ffffffff8106df0e>] kthread+0x9e/0xb0
> [ 4909.496011]  [<ffffffff814c6cc4>] kernel_thread_helper+0x4/0x10
> [ 4909.496011]  [<ffffffff814bcd30>] ? retint_restore_args+0x13/0x13
> [ 4909.496011]  [<ffffffff8106de70>] ? __init_kthread_worker+0x70/0x70
> [ 4909.496011]  [<ffffffff814c6cc0>] ? gs_change+0x13/0x13
> [ 4909.496011] ---[ end trace 4591bad96b0e5f4c ]---
> [ 4909.703691] ------------[ cut here ]------------
> [ 4909.708124] WARNING: at kernel/rcutree.c:1560 rcu_do_batch+0x386/0x3a0()
> [ 4909.708124] Hardware name: IBM System x -[7870C4Q]-
> [ 4909.708124] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm cdc_ether crc32c_intel i7core_edac shpchp usbnet pcspkr edac_core microcode tpm_tis mii bnx2 serio_raw lpc_ich mfd_core i2c_i801 pci_hotplug i2c_core tpm ioatdma dca tpm_bios sg button rtc_cmos uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
> [ 4909.708124] Pid: 10241, comm: migration/11 Tainted: G        W    3.6.0-rc7-warn-cpu-interrupt-0.0.0.28.36b5ec9-default #3
> [ 4909.708124] Call Trace:
> [ 4909.708124]  <IRQ>  [<ffffffff810e7336>] ? rcu_do_batch+0x386/0x3a0
> [ 4909.708124]  [<ffffffff810e7336>] ? rcu_do_batch+0x386/0x3a0
> [ 4909.708124]  [<ffffffff810433ea>] warn_slowpath_common+0x7a/0xb0
> [ 4909.708124]  [<ffffffff81043435>] warn_slowpath_null+0x15/0x20
> [ 4909.708124]  [<ffffffff810e7336>] rcu_do_batch+0x386/0x3a0
> [ 4909.708124]  [<ffffffff810ac2b0>] ? trace_hardirqs_on_caller+0x70/0x1b0
> [ 4909.708124]  [<ffffffff810ac3fd>] ? trace_hardirqs_on+0xd/0x10
> [ 4909.708124]  [<ffffffff810e8c73>] __rcu_process_callbacks+0x1a3/0x200
> [ 4909.708124]  [<ffffffff810e8d58>] rcu_process_callbacks+0x88/0x240
> [ 4909.708124]  [<ffffffff8104dcb9>] __do_softirq+0x159/0x400
> [ 4909.708124]  [<ffffffff814c6dbc>] call_softirq+0x1c/0x30
> [ 4909.708124]  [<ffffffff81004525>] do_softirq+0x95/0xd0
> [ 4909.708124]  [<ffffffff8104d785>] irq_exit+0xe5/0x100
> [ 4909.708124]  [<ffffffff81028f51>] smp_apic_timer_interrupt+0x81/0xd0
> [ 4909.708124]  [<ffffffff814c662f>] apic_timer_interrupt+0x6f/0x80
> [ 4909.708124]  <EOI>  [<ffffffff810cf8aa>] ? stop_machine_cpu_stop+0xda/0x130
> [ 4909.708124]  [<ffffffff810cf7d0>] ? stop_one_cpu_nowait+0x50/0x50
> [ 4909.708124]  [<ffffffff810cf4e9>] cpu_stopper_thread+0xd9/0x1b0
> [ 4909.708124]  [<ffffffff814bc98f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
> [ 4909.708124]  [<ffffffff810cf410>] ? res_counter_init+0x50/0x50
> [ 4909.708124]  [<ffffffff810ac36d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
> [ 4909.708124]  [<ffffffff810ac3fd>] ? trace_hardirqs_on+0xd/0x10
> [ 4909.708124]  [<ffffffff810cf410>] ? res_counter_init+0x50/0x50
> [ 4909.708124]  [<ffffffff8106df0e>] kthread+0x9e/0xb0
> [ 4909.708124]  [<ffffffff814c6cc4>] kernel_thread_helper+0x4/0x10
> [ 4909.708124]  [<ffffffff814bcd30>] ? retint_restore_args+0x13/0x13
> [ 4909.708124]  [<ffffffff8106de70>] ? __init_kthread_worker+0x70/0x70
> [ 4909.708124]  [<ffffffff814c6cc0>] ? gs_change+0x13/0x13
> [ 4909.708124] ---[ end trace 4591bad96b0e5f4d ]---
> [ 4909.812522] smpboot: CPU 11 is now offline
> 
> 
> [ 4974.727112] smpboot: CPU 12 is now offline
> [ 4974.737763] CPU 4 MCA banks CMCI:2 CMCI:3 CMCI:5 CMCI:6 CMCI:8
> [ 4974.896043] smpboot: CPU 13 is now offline
> [ 4974.903566] CPU 5 MCA banks CMCI:2 CMCI:3 CMCI:5
> [ 4974.968010] SMP Apic timer interrupt to offline CPU! 14
> [ 4974.968010] ------------[ cut here ]------------
> [ 4974.968010] WARNING: at arch/x86/kernel/apic/apic.c:904 smp_apic_timer_interrupt+0xc9/0xd0()
> [ 4974.968010] Hardware name: IBM System x -[7870C4Q]-
> [ 4974.968010] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm cdc_ether crc32c_intel i7core_edac shpchp usbnet pcspkr edac_core microcode tpm_tis mii bnx2 serio_raw lpc_ich mfd_core i2c_i801 pci_hotplug i2c_core tpm ioatdma dca tpm_bios sg button rtc_cmos uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
> [ 4974.968010] Pid: 20651, comm: migration/14 Tainted: G        W    3.6.0-rc7-warn-cpu-interrupt-0.0.0.28.36b5ec9-default #3
> [ 4974.968010] Call Trace:
> [ 4974.968010]  <IRQ>  [<ffffffff81028f99>] ? smp_apic_timer_interrupt+0xc9/0xd0
> [ 4974.968010]  [<ffffffff81028f99>] ? smp_apic_timer_interrupt+0xc9/0xd0
> [ 4974.968010]  [<ffffffff810433ea>] warn_slowpath_common+0x7a/0xb0
> [ 4974.968010]  [<ffffffff81043435>] warn_slowpath_null+0x15/0x20
> [ 4974.968010]  [<ffffffff81028f99>] smp_apic_timer_interrupt+0xc9/0xd0
> [ 4974.968010]  [<ffffffff814c662f>] apic_timer_interrupt+0x6f/0x80
> [ 4974.968010]  <EOI>  [<ffffffff810cf8aa>] ? stop_machine_cpu_stop+0xda/0x130
> [ 4974.968010]  [<ffffffff810cf7d0>] ? stop_one_cpu_nowait+0x50/0x50
> [ 4974.968010]  [<ffffffff810cf4e9>] cpu_stopper_thread+0xd9/0x1b0
> [ 4974.968010]  [<ffffffff814bc98f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
> [ 4974.968010]  [<ffffffff810cf410>] ? res_counter_init+0x50/0x50
> [ 4974.968010]  [<ffffffff810ac36d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
> [ 4974.968010]  [<ffffffff810ac3fd>] ? trace_hardirqs_on+0xd/0x10
> [ 4974.968010]  [<ffffffff810cf410>] ? res_counter_init+0x50/0x50
> [ 4974.968010]  [<ffffffff8106df0e>] kthread+0x9e/0xb0
> [ 4974.968010]  [<ffffffff814c6cc4>] kernel_thread_helper+0x4/0x10
> [ 4974.968010]  [<ffffffff814bcd30>] ? retint_restore_args+0x13/0x13
> [ 4974.968010]  [<ffffffff8106de70>] ? __init_kthread_worker+0x70/0x70
> [ 4974.968010]  [<ffffffff814c6cc0>] ? gs_change+0x13/0x13
> [ 4974.968010] ---[ end trace 4591bad96b0e5f4e ]---
> [ 4975.179145] smpboot: CPU 14 is now offline
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0(), during CPU hotplug
  2012-09-27  2:59             ` Michael Wang
@ 2012-09-27 19:06               ` Srivatsa S. Bhat
  0 siblings, 0 replies; 20+ messages in thread
From: Srivatsa S. Bhat @ 2012-09-27 19:06 UTC (permalink / raw)
  To: Michael Wang
  Cc: paulmck, Thomas Gleixner, linux-kernel, rusty, Peter Zijlstra,
	Tejun Heo, Liu, Chuansheng, Ingo Molnar, Borislav Petkov,
	Fengguang Wu, hpa, x86, Suresh Siddha, Venkatesh Pallipadi

On 09/27/2012 08:29 AM, Michael Wang wrote:
> On 09/26/2012 05:35 PM, Srivatsa S. Bhat wrote:
>> On 09/13/2012 06:17 PM, Srivatsa S. Bhat wrote:
>>> On 09/13/2012 12:00 PM, Michael Wang wrote:
>>>> On 09/12/2012 11:31 PM, Paul E. McKenney wrote:
>>>>> On Wed, Sep 12, 2012 at 06:06:20PM +0530, Srivatsa S. Bhat wrote:
>>>>>> On 07/19/2012 10:45 PM, Paul E. McKenney wrote:
>>>>>>> On Thu, Jul 19, 2012 at 05:39:30PM +0530, Srivatsa S. Bhat wrote:
>>>>>>>> Hi Paul,
>>>>>>>>
>>>>>>>> While running a CPU hotplug stress test on v3.5-rc7+
>>>>>>>> (mainline commit 8a7298b7805ab) I hit this warning.
>>>>>>>> I haven't tried to debug this yet...
>>>>>>>>
>>>>>>>> Line number 1550 maps to:
>>>>>>>>
>>>>>>>> WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
>>>>>>>>
>>>>>>>> inside rcu_do_batch().
>>>>>>>
>>>>>>> Hello, Srivatsa,
>>>>>>>
>>>>>>> I believe that you need commit a16b7a69 (Prevent __call_rcu() from
>>>>>>> invoking RCU core on offline CPUs), which is currently in -tip, queued
>>>>>>> for 3.6.  Please see below for the patch.
>>>>>>>
>>>>>>> Does this help?
>>>>>>>
>>>>>>
>>>>>> Hi Paul,
>>>>>>
>>>>>> I am hitting the cpu_is_offline() warning in rcu_do_batch() (see 2 of the
>>>>>> examples below) occasionally while testing CPU hotplug on Thomas' smp/hotplug
>>>>>> branch in -tip. It does contain the commit that you had mentioned above.
>>>>>>
>>>>>> The stack trace suggests that we are not hitting this from the __call_rcu()
>>>>>> path. So I guess this needs a different fix?
>>>>>
>>>>> So there was an interrupt from stop_machine_stop().  Because RCU complained
>>>>> about offline, I presume that this was on exit from stop_machine_stop().
>>>>> (Otherwise, on entry to stop_machine_stop(), the CPU has not yet marked
>>>>> itself offline, right?)
>>>>>
>>>
>>> Yes, that's my understanding too.
>>>
>>>>> So my question is:  Why didn't the CPU shut off all interrupts before
>>>>> coming out of stop_machine_stop()?
>>>>>
>>>>> Or am I confused about what is really happening here?
>>>>
>>
>> The interesting thing is that in every single instance of an interrupt hitting
>> an offline CPU, the IP points to the statement : local_irq_restore(flags)
>> in stop_machine_cpu_stop(). And the most common case of such an interrupt is the
>> APIC timer interrupt (though I have also seen a few instances of do_IRQ() (device
>> interrupts) in some of my stacktraces).
>>
>> So, I'm inclined to conclude that the functions responsible for clearing interrupts
>> during CPU offline, such as clear_local_APIC() and friends are not doing their
>> job properly..
> 
> Agree, "we try to shut down apic but failed", that's may be the reason.
> 
> I think we need some help from the experts on apic code now(or don't do
> any job in do_IRQ if cpu is offline, but a bad solution)...
> 

Adding Suresh and Venki to CC.

Regards,
Srivatsa S. Bhat

> 
>>
>> Below are results from some of my CPU hotplug tests, in which I had added debug
>> print statements and WARN_ON()s to catch situations where interrupts hit offline
>> CPUs.
>> (Address <ffffffff810cf8aa> corresponds to line no. 481 in kernel/stop_machine.c,
>> ie., local_irq_restore(flags) in stop_machine_cpu_stop()).
>>
>> Regards,
>> Srivatsa S. Bhat
>>
>> ---------------
>>
>> [ 4730.906259] smpboot: Booting Node 1 Processor 6 APIC 0x14
>> [ 4731.018008] SMP alternatives: lockdep: fixing up alternatives
>> [ 4731.023811] smpboot: Booting Node 1 Processor 7 APIC 0x16
>> [ 4731.132009] SMP Apic timer interrupt to offline CPU! 8
>> [ 4731.132009] ------------[ cut here ]------------
>> [ 4731.132009] WARNING: at arch/x86/kernel/apic/apic.c:904 smp_apic_timer_interrupt+0xc9/0xd0()
>> [ 4731.132009] Hardware name: IBM System x -[7870C4Q]-
>> [ 4731.132009] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm cdc_ether crc32c_intel i7core_edac shpchp usbnet pcspkr edac_core microcode tpm_tis mii bnx2 serio_raw lpc_ich mfd_core i2c_i801 pci_hotplug i2c_core tpm ioatdma dca tpm_bios sg button rtc_cmos uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
>> [ 4731.132009] Pid: 12843, comm: migration/8 Not tainted 3.6.0-rc7-warn-cpu-interrupt-0.0.0.28.36b5ec9-default #3
>> [ 4731.132009] Call Trace:
>> [ 4731.132009]  <IRQ>  [<ffffffff81028f99>] ? smp_apic_timer_interrupt+0xc9/0xd0
>> [ 4731.132009]  [<ffffffff81028f99>] ? smp_apic_timer_interrupt+0xc9/0xd0
>> [ 4731.132009]  [<ffffffff810433ea>] warn_slowpath_common+0x7a/0xb0
>> [ 4731.132009]  [<ffffffff81043435>] warn_slowpath_null+0x15/0x20
>> [ 4731.132009]  [<ffffffff81028f99>] smp_apic_timer_interrupt+0xc9/0xd0
>> [ 4731.132009]  [<ffffffff814c662f>] apic_timer_interrupt+0x6f/0x80
>> [ 4731.132009]  <EOI>  [<ffffffff810cf8aa>] ? stop_machine_cpu_stop+0xda/0x130
>> [ 4731.132009]  [<ffffffff810cf7d0>] ? stop_one_cpu_nowait+0x50/0x50
>> [ 4731.132009]  [<ffffffff810cf4e9>] cpu_stopper_thread+0xd9/0x1b0
>> [ 4731.132009]  [<ffffffff814bc98f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
>> [ 4731.132009]  [<ffffffff810cf410>] ? res_counter_init+0x50/0x50
>> [ 4731.132009]  [<ffffffff810ac36d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
>> [ 4731.132009]  [<ffffffff810ac3fd>] ? trace_hardirqs_on+0xd/0x10
>> [ 4731.132009]  [<ffffffff810cf410>] ? res_counter_init+0x50/0x50
>> [ 4731.132009]  [<ffffffff8106df0e>] kthread+0x9e/0xb0
>> [ 4731.132009]  [<ffffffff814c6cc4>] kernel_thread_helper+0x4/0x10
>> [ 4731.132009]  [<ffffffff814bcd30>] ? retint_restore_args+0x13/0x13
>> [ 4731.132009]  [<ffffffff8106de70>] ? __init_kthread_worker+0x70/0x70
>> [ 4731.132009]  [<ffffffff814c6cc0>] ? gs_change+0x13/0x13
>> [ 4731.132009] ---[ end trace 4591bad96b0e5f4b ]---
>> [ 4731.440036] smpboot: CPU 8 is now offline
>> [ 4731.485506] smpboot: CPU 9 is now offline
>> [ 4731.533459] smpboot: CPU 10 is now offline
>> [ 4731.591316] smpboot: CPU 11 is now offline
>> [ 4731.616017] CPU 3 MCA banks CMCI:2 CMCI:3 CMCI:5
>> [ 4731.729817] smpboot: CPU 12 is now offline
>> [ 4731.748034] CPU 4 MCA banks CMCI:2 CMCI:3 CMCI:5 CMCI:6 CMCI:8
>> [ 4731.800110] smpboot: CPU 13 is now offline
>>
>>
>> [ 4909.232645] SMP alternatives: lockdep: fixing up alternatives
>> [ 4909.238633] smpboot: Booting Node 1 Processor 7 APIC 0x16
>> [ 4909.350250] smpboot: CPU 8 is now offline
>> [ 4909.395118] Broke affinity for irq 3
>> [ 4909.400743] smpboot: CPU 9 is now offline
>> [ 4909.444681] smpboot: CPU 10 is now offline
>> [ 4909.496011] SMP Apic timer interrupt to offline CPU! 11
>> [ 4909.496011] ------------[ cut here ]------------
>> [ 4909.496011] WARNING: at arch/x86/kernel/apic/apic.c:904 smp_apic_timer_interrupt+0xc9/0xd0()
>> [ 4909.496011] Hardware name: IBM System x -[7870C4Q]-
>> [ 4909.496011] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm cdc_ether crc32c_intel i7core_edac shpchp usbnet pcspkr edac_core microcode tpm_tis mii bnx2 serio_raw lpc_ich mfd_core i2c_i801 pci_hotplug i2c_core tpm ioatdma dca tpm_bios sg button rtc_cmos uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
>> [ 4909.496011] Pid: 10241, comm: migration/11 Tainted: G        W    3.6.0-rc7-warn-cpu-interrupt-0.0.0.28.36b5ec9-default #3
>> [ 4909.496011] Call Trace:
>> [ 4909.496011]  <IRQ>  [<ffffffff81028f99>] ? smp_apic_timer_interrupt+0xc9/0xd0
>> [ 4909.496011]  [<ffffffff81028f99>] ? smp_apic_timer_interrupt+0xc9/0xd0
>> [ 4909.496011]  [<ffffffff810433ea>] warn_slowpath_common+0x7a/0xb0
>> [ 4909.496011]  [<ffffffff81043435>] warn_slowpath_null+0x15/0x20
>> [ 4909.496011]  [<ffffffff81028f99>] smp_apic_timer_interrupt+0xc9/0xd0
>> [ 4909.496011]  [<ffffffff814c662f>] apic_timer_interrupt+0x6f/0x80
>> [ 4909.496011]  <EOI>  [<ffffffff810cf8aa>] ? stop_machine_cpu_stop+0xda/0x130
>> [ 4909.496011]  [<ffffffff810cf7d0>] ? stop_one_cpu_nowait+0x50/0x50
>> [ 4909.496011]  [<ffffffff810cf4e9>] cpu_stopper_thread+0xd9/0x1b0
>> [ 4909.496011]  [<ffffffff814bc98f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
>> [ 4909.496011]  [<ffffffff810cf410>] ? res_counter_init+0x50/0x50
>> [ 4909.496011]  [<ffffffff810ac36d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
>> [ 4909.496011]  [<ffffffff810ac3fd>] ? trace_hardirqs_on+0xd/0x10
>> [ 4909.496011]  [<ffffffff810cf410>] ? res_counter_init+0x50/0x50
>> [ 4909.496011]  [<ffffffff8106df0e>] kthread+0x9e/0xb0
>> [ 4909.496011]  [<ffffffff814c6cc4>] kernel_thread_helper+0x4/0x10
>> [ 4909.496011]  [<ffffffff814bcd30>] ? retint_restore_args+0x13/0x13
>> [ 4909.496011]  [<ffffffff8106de70>] ? __init_kthread_worker+0x70/0x70
>> [ 4909.496011]  [<ffffffff814c6cc0>] ? gs_change+0x13/0x13
>> [ 4909.496011] ---[ end trace 4591bad96b0e5f4c ]---
>> [ 4909.703691] ------------[ cut here ]------------
>> [ 4909.708124] WARNING: at kernel/rcutree.c:1560 rcu_do_batch+0x386/0x3a0()
>> [ 4909.708124] Hardware name: IBM System x -[7870C4Q]-
>> [ 4909.708124] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm cdc_ether crc32c_intel i7core_edac shpchp usbnet pcspkr edac_core microcode tpm_tis mii bnx2 serio_raw lpc_ich mfd_core i2c_i801 pci_hotplug i2c_core tpm ioatdma dca tpm_bios sg button rtc_cmos uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
>> [ 4909.708124] Pid: 10241, comm: migration/11 Tainted: G        W    3.6.0-rc7-warn-cpu-interrupt-0.0.0.28.36b5ec9-default #3
>> [ 4909.708124] Call Trace:
>> [ 4909.708124]  <IRQ>  [<ffffffff810e7336>] ? rcu_do_batch+0x386/0x3a0
>> [ 4909.708124]  [<ffffffff810e7336>] ? rcu_do_batch+0x386/0x3a0
>> [ 4909.708124]  [<ffffffff810433ea>] warn_slowpath_common+0x7a/0xb0
>> [ 4909.708124]  [<ffffffff81043435>] warn_slowpath_null+0x15/0x20
>> [ 4909.708124]  [<ffffffff810e7336>] rcu_do_batch+0x386/0x3a0
>> [ 4909.708124]  [<ffffffff810ac2b0>] ? trace_hardirqs_on_caller+0x70/0x1b0
>> [ 4909.708124]  [<ffffffff810ac3fd>] ? trace_hardirqs_on+0xd/0x10
>> [ 4909.708124]  [<ffffffff810e8c73>] __rcu_process_callbacks+0x1a3/0x200
>> [ 4909.708124]  [<ffffffff810e8d58>] rcu_process_callbacks+0x88/0x240
>> [ 4909.708124]  [<ffffffff8104dcb9>] __do_softirq+0x159/0x400
>> [ 4909.708124]  [<ffffffff814c6dbc>] call_softirq+0x1c/0x30
>> [ 4909.708124]  [<ffffffff81004525>] do_softirq+0x95/0xd0
>> [ 4909.708124]  [<ffffffff8104d785>] irq_exit+0xe5/0x100
>> [ 4909.708124]  [<ffffffff81028f51>] smp_apic_timer_interrupt+0x81/0xd0
>> [ 4909.708124]  [<ffffffff814c662f>] apic_timer_interrupt+0x6f/0x80
>> [ 4909.708124]  <EOI>  [<ffffffff810cf8aa>] ? stop_machine_cpu_stop+0xda/0x130
>> [ 4909.708124]  [<ffffffff810cf7d0>] ? stop_one_cpu_nowait+0x50/0x50
>> [ 4909.708124]  [<ffffffff810cf4e9>] cpu_stopper_thread+0xd9/0x1b0
>> [ 4909.708124]  [<ffffffff814bc98f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
>> [ 4909.708124]  [<ffffffff810cf410>] ? res_counter_init+0x50/0x50
>> [ 4909.708124]  [<ffffffff810ac36d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
>> [ 4909.708124]  [<ffffffff810ac3fd>] ? trace_hardirqs_on+0xd/0x10
>> [ 4909.708124]  [<ffffffff810cf410>] ? res_counter_init+0x50/0x50
>> [ 4909.708124]  [<ffffffff8106df0e>] kthread+0x9e/0xb0
>> [ 4909.708124]  [<ffffffff814c6cc4>] kernel_thread_helper+0x4/0x10
>> [ 4909.708124]  [<ffffffff814bcd30>] ? retint_restore_args+0x13/0x13
>> [ 4909.708124]  [<ffffffff8106de70>] ? __init_kthread_worker+0x70/0x70
>> [ 4909.708124]  [<ffffffff814c6cc0>] ? gs_change+0x13/0x13
>> [ 4909.708124] ---[ end trace 4591bad96b0e5f4d ]---
>> [ 4909.812522] smpboot: CPU 11 is now offline
>>
>>
>> [ 4974.727112] smpboot: CPU 12 is now offline
>> [ 4974.737763] CPU 4 MCA banks CMCI:2 CMCI:3 CMCI:5 CMCI:6 CMCI:8
>> [ 4974.896043] smpboot: CPU 13 is now offline
>> [ 4974.903566] CPU 5 MCA banks CMCI:2 CMCI:3 CMCI:5
>> [ 4974.968010] SMP Apic timer interrupt to offline CPU! 14
>> [ 4974.968010] ------------[ cut here ]------------
>> [ 4974.968010] WARNING: at arch/x86/kernel/apic/apic.c:904 smp_apic_timer_interrupt+0xc9/0xd0()
>> [ 4974.968010] Hardware name: IBM System x -[7870C4Q]-
>> [ 4974.968010] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm cdc_ether crc32c_intel i7core_edac shpchp usbnet pcspkr edac_core microcode tpm_tis mii bnx2 serio_raw lpc_ich mfd_core i2c_i801 pci_hotplug i2c_core tpm ioatdma dca tpm_bios sg button rtc_cmos uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
>> [ 4974.968010] Pid: 20651, comm: migration/14 Tainted: G        W    3.6.0-rc7-warn-cpu-interrupt-0.0.0.28.36b5ec9-default #3
>> [ 4974.968010] Call Trace:
>> [ 4974.968010]  <IRQ>  [<ffffffff81028f99>] ? smp_apic_timer_interrupt+0xc9/0xd0
>> [ 4974.968010]  [<ffffffff81028f99>] ? smp_apic_timer_interrupt+0xc9/0xd0
>> [ 4974.968010]  [<ffffffff810433ea>] warn_slowpath_common+0x7a/0xb0
>> [ 4974.968010]  [<ffffffff81043435>] warn_slowpath_null+0x15/0x20
>> [ 4974.968010]  [<ffffffff81028f99>] smp_apic_timer_interrupt+0xc9/0xd0
>> [ 4974.968010]  [<ffffffff814c662f>] apic_timer_interrupt+0x6f/0x80
>> [ 4974.968010]  <EOI>  [<ffffffff810cf8aa>] ? stop_machine_cpu_stop+0xda/0x130
>> [ 4974.968010]  [<ffffffff810cf7d0>] ? stop_one_cpu_nowait+0x50/0x50
>> [ 4974.968010]  [<ffffffff810cf4e9>] cpu_stopper_thread+0xd9/0x1b0
>> [ 4974.968010]  [<ffffffff814bc98f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
>> [ 4974.968010]  [<ffffffff810cf410>] ? res_counter_init+0x50/0x50
>> [ 4974.968010]  [<ffffffff810ac36d>] ? trace_hardirqs_on_caller+0x12d/0x1b0
>> [ 4974.968010]  [<ffffffff810ac3fd>] ? trace_hardirqs_on+0xd/0x10
>> [ 4974.968010]  [<ffffffff810cf410>] ? res_counter_init+0x50/0x50
>> [ 4974.968010]  [<ffffffff8106df0e>] kthread+0x9e/0xb0
>> [ 4974.968010]  [<ffffffff814c6cc4>] kernel_thread_helper+0x4/0x10
>> [ 4974.968010]  [<ffffffff814bcd30>] ? retint_restore_args+0x13/0x13
>> [ 4974.968010]  [<ffffffff8106de70>] ? __init_kthread_worker+0x70/0x70
>> [ 4974.968010]  [<ffffffff814c6cc0>] ? gs_change+0x13/0x13
>> [ 4974.968010] ---[ end trace 4591bad96b0e5f4e ]---
>> [ 4975.179145] smpboot: CPU 14 is now offline
>>


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2012-09-27 19:06 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-19 12:09 WARNING: at kernel/rcutree.c:1550 __rcu_process_callbacks+0x46f/0x4b0() Srivatsa S. Bhat
2012-07-19 17:15 ` Paul E. McKenney
2012-07-20 10:41   ` Srivatsa S. Bhat
2012-07-20 14:36     ` Paul E. McKenney
2012-07-20 14:57       ` Srivatsa S. Bhat
2012-09-12 12:36   ` WARNING: at kernel/rcutree.c:1558 rcu_do_batch+0x386/0x3a0(), during CPU hotplug Srivatsa S. Bhat
2012-09-12 15:31     ` Paul E. McKenney
2012-09-13  6:30       ` Michael Wang
2012-09-13 12:47         ` Srivatsa S. Bhat
2012-09-14  4:33           ` Michael Wang
2012-09-26  9:35           ` Srivatsa S. Bhat
2012-09-27  2:59             ` Michael Wang
2012-09-27 19:06               ` Srivatsa S. Bhat
2012-09-13  8:35     ` Srivatsa S. Bhat
2012-09-14 11:47       ` Fengguang Wu
2012-09-14 12:18         ` Srivatsa S. Bhat
2012-09-14 12:25           ` Peter Zijlstra
2012-09-14 12:32             ` Fengguang Wu
2012-09-14 12:34             ` Srivatsa S. Bhat
2012-09-14 12:28           ` Fengguang Wu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).