All of lore.kernel.org
 help / color / mirror / Atom feed
* Please add to stable:  module: don't unlink the module until we've removed all exposure.
@ 2013-05-31 18:14 Ben Greear
  2013-06-02  5:09 ` Rusty Russell
  0 siblings, 1 reply; 51+ messages in thread
From: Ben Greear @ 2013-05-31 18:14 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: rusty, stable, Joe Lawrence

It turns out, the bug I spent yesterday chasing in various 3.9 kernels is apparently
fixed by the commit in the title (c9c390bb5535380d40614571894ef0c00bc026ff).

Fortunately, Joe Lawrence somehow saw my email to lkml and pointed me to the
bug report below, which mentions the commit...

https://bugzilla.kernel.org/show_bug.cgi?id=58011


Please consider adding this patch to at least the 3.9 stable queue.

I have a kernel config larded up with debugging options that reproduces
the bug fairly quickly on stock 3.9.[01234] (Fedora-17, 64-bit, in case that matters)
if someone wants it....

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Please add to stable:  module: don't unlink the module until we've removed all exposure.
  2013-05-31 18:14 Please add to stable: module: don't unlink the module until we've removed all exposure Ben Greear
@ 2013-06-02  5:09 ` Rusty Russell
  2013-06-03  3:46   ` Joe Lawrence
  0 siblings, 1 reply; 51+ messages in thread
From: Rusty Russell @ 2013-06-02  5:09 UTC (permalink / raw)
  To: Ben Greear, Linux Kernel Mailing List; +Cc: stable, Joe Lawrence

Ben Greear <greearb@candelatech.com> writes:

> It turns out, the bug I spent yesterday chasing in various 3.9 kernels is apparently
> fixed by the commit in the title (c9c390bb5535380d40614571894ef0c00bc026ff).

Apparently being the operative word.

This commit avoids the entire "module insert failed due to sysfs race"
path in the common case, it doesn't fix any actual problem.

I think the real commit you want is Linus' kobject fix
a49b7e82cab0f9b41f483359be83f44fbb6b4979 "kobject: fix kset_find_obj()
race with concurrent last kobject_put()".

Or is that already in stable?

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Please add to stable:  module: don't unlink the module until we've removed all exposure.
  2013-06-02  5:09 ` Rusty Russell
@ 2013-06-03  3:46   ` Joe Lawrence
  2013-06-03 11:25     ` Joe Lawrence
  0 siblings, 1 reply; 51+ messages in thread
From: Joe Lawrence @ 2013-06-03  3:46 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Ben Greear, Linux Kernel Mailing List, stable, Joe Lawrence

On Sun, 2 Jun 2013, Rusty Russell wrote:

> Ben Greear <greearb@candelatech.com> writes:
> 
> > It turns out, the bug I spent yesterday chasing in various 3.9 kernels is apparently
> > fixed by the commit in the title (c9c390bb5535380d40614571894ef0c00bc026ff).
> 
> Apparently being the operative word.
> 
> This commit avoids the entire "module insert failed due to sysfs race"
> path in the common case, it doesn't fix any actual problem.
> 
> I think the real commit you want is Linus' kobject fix
> a49b7e82cab0f9b41f483359be83f44fbb6b4979 "kobject: fix kset_find_obj()
> race with concurrent last kobject_put()".
> 
> Or is that already in stable?

Hi Rusty,
 
I had pointed Ben (offlist) to that bugzilla entry without realizing
there were other earlier related fixes in this space.  Re-viewing bz-
58011, it looks like it was opened against 3.8.12, while Ben and myself
had encountered module loading problems in versions 3.9 and
3.9.[1-3].  I can update the bugzilla entry to add a comment noting commit
a49b7e82 "kobject: fix kset_find_obj() race with concurrent last
kobject_put()".

That said, it doesn't appear that commit 944a1fa "module: don't unlink the
module until we've removed all exposure" has not made it into any stable  
kernel.  On my system, applying this on top of 3.9 resolved a module
unload/load race that would occasionally occur on boot (two video adapters
of the same make, the module unloads for whatever reason and I see "module
is already loaded" and "sysfs: cannot create duplicate filename
'/module/mgag200'" messages every 5-10% instances.)  I have logs if you
were interested in these warnings/crashes.

Hope this clarifies things.

Regards,

-- Joe

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Please add to stable:  module: don't unlink the module until we've removed all exposure.
  2013-06-03  3:46   ` Joe Lawrence
@ 2013-06-03 11:25     ` Joe Lawrence
  2013-06-03 14:17       ` Joe Lawrence
  0 siblings, 1 reply; 51+ messages in thread
From: Joe Lawrence @ 2013-06-03 11:25 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: Rusty Russell, Ben Greear, Linux Kernel Mailing List, stable

[fixing Cc: stable@kernel.org address]

On Sun, 2 Jun 2013, Joe Lawrence wrote:

> On Sun, 2 Jun 2013, Rusty Russell wrote:
> 
> > Ben Greear <greearb@candelatech.com> writes:
> > 
> > > It turns out, the bug I spent yesterday chasing in various 3.9 kernels is apparently
> > > fixed by the commit in the title (c9c390bb5535380d40614571894ef0c00bc026ff).
> > 
> > Apparently being the operative word.
> > 
> > This commit avoids the entire "module insert failed due to sysfs race"
> > path in the common case, it doesn't fix any actual problem.
> > 
> > I think the real commit you want is Linus' kobject fix
> > a49b7e82cab0f9b41f483359be83f44fbb6b4979 "kobject: fix kset_find_obj()
> > race with concurrent last kobject_put()".
> > 
> > Or is that already in stable?
> 
> Hi Rusty,
>  
> I had pointed Ben (offlist) to that bugzilla entry without realizing
> there were other earlier related fixes in this space.  Re-viewing bz-
> 58011, it looks like it was opened against 3.8.12, while Ben and myself
> had encountered module loading problems in versions 3.9 and
> 3.9.[1-3].  I can update the bugzilla entry to add a comment noting commit
> a49b7e82 "kobject: fix kset_find_obj() race with concurrent last
> kobject_put()".
> 
> That said, it doesn't appear that commit 944a1fa "module: don't unlink the
> module until we've removed all exposure" has not made it into any stable  
> kernel.  On my system, applying this on top of 3.9 resolved a module
> unload/load race that would occasionally occur on boot (two video adapters
> of the same make, the module unloads for whatever reason and I see "module
> is already loaded" and "sysfs: cannot create duplicate filename
> '/module/mgag200'" messages every 5-10% instances.)  I have logs if you
> were interested in these warnings/crashes.
> 
> Hope this clarifies things.
> 
> Regards,
> 
> -- Joe
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Please add to stable:  module: don't unlink the module until we've removed all exposure.
  2013-06-03 11:25     ` Joe Lawrence
@ 2013-06-03 14:17       ` Joe Lawrence
  2013-06-03 15:59         ` Ben Greear
  2013-06-05  5:07         ` Greg KH
  0 siblings, 2 replies; 51+ messages in thread
From: Joe Lawrence @ 2013-06-03 14:17 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: Rusty Russell, Ben Greear, Linux Kernel Mailing List, stable

[Cc: stable@vger.kernel.org]

Third time is a charm?  The stable address was incorrect from the first 
msg in this thread, but the relevant bits remain quoted below...

On Mon, 3 Jun 2013, Joe Lawrence wrote:

> [fixing Cc: stable@kernel.org address]
> 
> On Sun, 2 Jun 2013, Joe Lawrence wrote:
> 
> > On Sun, 2 Jun 2013, Rusty Russell wrote:
> > 
> > > Ben Greear <greearb@candelatech.com> writes:
> > > 
> > > > It turns out, the bug I spent yesterday chasing in various 3.9 kernels is apparently
> > > > fixed by the commit in the title (c9c390bb5535380d40614571894ef0c00bc026ff).
> > > 
> > > Apparently being the operative word.
> > > 
> > > This commit avoids the entire "module insert failed due to sysfs race"
> > > path in the common case, it doesn't fix any actual problem.
> > > 
> > > I think the real commit you want is Linus' kobject fix
> > > a49b7e82cab0f9b41f483359be83f44fbb6b4979 "kobject: fix kset_find_obj()
> > > race with concurrent last kobject_put()".
> > > 
> > > Or is that already in stable?
> > 
> > Hi Rusty,
> >  
> > I had pointed Ben (offlist) to that bugzilla entry without realizing
> > there were other earlier related fixes in this space.  Re-viewing bz-
> > 58011, it looks like it was opened against 3.8.12, while Ben and myself
> > had encountered module loading problems in versions 3.9 and
> > 3.9.[1-3].  I can update the bugzilla entry to add a comment noting commit
> > a49b7e82 "kobject: fix kset_find_obj() race with concurrent last
> > kobject_put()".
> > 
> > That said, it doesn't appear that commit 944a1fa "module: don't unlink the
> > module until we've removed all exposure" has not made it into any stable  
> > kernel.  On my system, applying this on top of 3.9 resolved a module
> > unload/load race that would occasionally occur on boot (two video adapters
> > of the same make, the module unloads for whatever reason and I see "module
> > is already loaded" and "sysfs: cannot create duplicate filename
> > '/module/mgag200'" messages every 5-10% instances.)  I have logs if you
> > were interested in these warnings/crashes.
> > 
> > Hope this clarifies things.
> > 
> > Regards,
> > 
> > -- Joe
> > 
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Please add to stable:  module: don't unlink the module until we've removed all exposure.
  2013-06-03 14:17       ` Joe Lawrence
@ 2013-06-03 15:59         ` Ben Greear
  2013-06-03 16:36           ` Ben Greear
  2013-06-05  5:07         ` Greg KH
  1 sibling, 1 reply; 51+ messages in thread
From: Ben Greear @ 2013-06-03 15:59 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: Rusty Russell, Linux Kernel Mailing List, stable

On 06/03/2013 07:17 AM, Joe Lawrence wrote:

>>> Hi Rusty,
>>>
>>> I had pointed Ben (offlist) to that bugzilla entry without realizing
>>> there were other earlier related fixes in this space.  Re-viewing bz-
>>> 58011, it looks like it was opened against 3.8.12, while Ben and myself
>>> had encountered module loading problems in versions 3.9 and
>>> 3.9.[1-3].  I can update the bugzilla entry to add a comment noting commit
>>> a49b7e82 "kobject: fix kset_find_obj() race with concurrent last
>>> kobject_put()".
>>>
>>> That said, it doesn't appear that commit 944a1fa "module: don't unlink the
>>> module until we've removed all exposure" has not made it into any stable
>>> kernel.  On my system, applying this on top of 3.9 resolved a module
>>> unload/load race that would occasionally occur on boot (two video adapters
>>> of the same make, the module unloads for whatever reason and I see "module
>>> is already loaded" and "sysfs: cannot create duplicate filename
>>> '/module/mgag200'" messages every 5-10% instances.)  I have logs if you
>>> were interested in these warnings/crashes.

It at least works around the problem for me as well.  But, a more rare
migration/[0-3] (I think) related lockup still exists in 3.9.4 for me,
so I will also try applying that other kobject patch and continue testing
today...

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Please add to stable:  module: don't unlink the module until we've removed all exposure.
  2013-06-03 15:59         ` Ben Greear
@ 2013-06-03 16:36           ` Ben Greear
  2013-06-04  4:37             ` Rusty Russell
  2013-06-04  5:56             ` Rusty Russell
  0 siblings, 2 replies; 51+ messages in thread
From: Ben Greear @ 2013-06-03 16:36 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: Rusty Russell, Linux Kernel Mailing List, stable

On 06/03/2013 08:59 AM, Ben Greear wrote:
> On 06/03/2013 07:17 AM, Joe Lawrence wrote:
>
>>>> Hi Rusty,
>>>>
>>>> I had pointed Ben (offlist) to that bugzilla entry without realizing
>>>> there were other earlier related fixes in this space.  Re-viewing bz-
>>>> 58011, it looks like it was opened against 3.8.12, while Ben and myself
>>>> had encountered module loading problems in versions 3.9 and
>>>> 3.9.[1-3].  I can update the bugzilla entry to add a comment noting commit
>>>> a49b7e82 "kobject: fix kset_find_obj() race with concurrent last
>>>> kobject_put()".
>>>>
>>>> That said, it doesn't appear that commit 944a1fa "module: don't unlink the
>>>> module until we've removed all exposure" has not made it into any stable
>>>> kernel.  On my system, applying this on top of 3.9 resolved a module
>>>> unload/load race that would occasionally occur on boot (two video adapters
>>>> of the same make, the module unloads for whatever reason and I see "module
>>>> is already loaded" and "sysfs: cannot create duplicate filename
>>>> '/module/mgag200'" messages every 5-10% instances.)  I have logs if you
>>>> were interested in these warnings/crashes.
>
> It at least works around the problem for me as well.  But, a more rare
> migration/[0-3] (I think) related lockup still exists in 3.9.4 for me,
> so I will also try applying that other kobject patch and continue testing
> today...

Well, that other kobject patch is already in 3.9.4, so I think it's still
a good idea to include the
"module: don't unlink the module until we've removed all exposure."
patch in stable.  I have a decent test case to reproduce the crash, so if someone
wants me to test other patches instead, then I will do so.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Please add to stable:  module: don't unlink the module until we've removed all exposure.
  2013-06-03 16:36           ` Ben Greear
@ 2013-06-04  4:37             ` Rusty Russell
  2013-06-04  5:56             ` Rusty Russell
  1 sibling, 0 replies; 51+ messages in thread
From: Rusty Russell @ 2013-06-04  4:37 UTC (permalink / raw)
  To: Ben Greear, Joe Lawrence; +Cc: Linux Kernel Mailing List, stable

Ben Greear <greearb@candelatech.com> writes:
> On 06/03/2013 08:59 AM, Ben Greear wrote:
>> On 06/03/2013 07:17 AM, Joe Lawrence wrote:
>>
>>>>> Hi Rusty,
>>>>>
>>>>> I had pointed Ben (offlist) to that bugzilla entry without realizing
>>>>> there were other earlier related fixes in this space.  Re-viewing bz-
>>>>> 58011, it looks like it was opened against 3.8.12, while Ben and myself
>>>>> had encountered module loading problems in versions 3.9 and
>>>>> 3.9.[1-3].  I can update the bugzilla entry to add a comment noting commit
>>>>> a49b7e82 "kobject: fix kset_find_obj() race with concurrent last
>>>>> kobject_put()".
>>>>>
>>>>> That said, it doesn't appear that commit 944a1fa "module: don't unlink the
>>>>> module until we've removed all exposure" has not made it into any stable
>>>>> kernel.  On my system, applying this on top of 3.9 resolved a module
>>>>> unload/load race that would occasionally occur on boot (two video adapters
>>>>> of the same make, the module unloads for whatever reason and I see "module
>>>>> is already loaded" and "sysfs: cannot create duplicate filename
>>>>> '/module/mgag200'" messages every 5-10% instances.)  I have logs if you
>>>>> were interested in these warnings/crashes.
>>
>> It at least works around the problem for me as well.  But, a more rare
>> migration/[0-3] (I think) related lockup still exists in 3.9.4 for me,
>> so I will also try applying that other kobject patch and continue testing
>> today...
>
> Well, that other kobject patch is already in 3.9.4, so I think it's still
> a good idea to include the
> "module: don't unlink the module until we've removed all exposure."
> patch in stable.  I have a decent test case to reproduce the crash, so if someone
> wants me to test other patches instead, then I will do so.

I understand your eagerness to have this resolved, but we need to
understand the problem.  The fix you asked for in stable was supposed to
be cosmetic, to avoid the sysfs warning.  But it did serve to stress the
cleanup path, which may still have lurking bugs!

I reproduced the oops myself on 3.8.  I will chase it on 3.9.4, too.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Please add to stable:  module: don't unlink the module until we've removed all exposure.
  2013-06-03 16:36           ` Ben Greear
  2013-06-04  4:37             ` Rusty Russell
@ 2013-06-04  5:56             ` Rusty Russell
  2013-06-04 14:07               ` Joe Lawrence
  1 sibling, 1 reply; 51+ messages in thread
From: Rusty Russell @ 2013-06-04  5:56 UTC (permalink / raw)
  To: Ben Greear, Joe Lawrence; +Cc: Linux Kernel Mailing List, stable

Ben Greear <greearb@candelatech.com> writes:
>> It at least works around the problem for me as well.  But, a more rare
>> migration/[0-3] (I think) related lockup still exists in 3.9.4 for me,
>> so I will also try applying that other kobject patch and continue testing
>> today...
>
> Well, that other kobject patch is already in 3.9.4, so I think it's still
> a good idea to include the
> "module: don't unlink the module until we've removed all exposure."
> patch in stable.  I have a decent test case to reproduce the crash, so if someone
> wants me to test other patches instead, then I will do so.

OK, I cannot reproduce on 3.9.4.  I #if 0'd out the WARNs in sysfs and
kobject, and did this (which reliably broke on 3.8):

# M=`modinfo -F filename e1000`
# for i in `seq 10000`; do insmod $M; rmmod e1000; done >/dev/null 2>&1 & for i in `seq 10000`; do insmod $M; rmmod e1000; done > /dev/null 2>&1 & for i in `seq 10000`; do insmod $M; rmmod e1000; done > /dev/null 2>&1 & for i in `seq 10000`; do insmod $M; rmmod e1000; done >/dev/null 2>&1 &
# 

This was under kvm, 4-way SMP, init=/bin/bash.

Do you have a backtrace of the 3.9.4 crash?  You can add "CFLAGS_module.o
= -O0" to get a clearer backtrace if you want...

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Please add to stable:  module: don't unlink the module until we've removed all exposure.
  2013-06-04  5:56             ` Rusty Russell
@ 2013-06-04 14:07               ` Joe Lawrence
  2013-06-04 16:50                 ` Joe Lawrence
                                   ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Joe Lawrence @ 2013-06-04 14:07 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Ben Greear, Linux Kernel Mailing List, stable

On Tue, 04 Jun 2013 15:26:28 +0930
Rusty Russell <rusty@rustcorp.com.au> wrote:

> Do you have a backtrace of the 3.9.4 crash?  You can add "CFLAGS_module.o
> = -O0" to get a clearer backtrace if you want...

Hi Rusty,

See my 3.9 stack traces below, which may or may not be what Ben had
been seeing.  If you like, I can try a similar loop as the one you were
testing in the other email.  

Regards,

-- Joe

*** First instance ***

------------[ cut here ]------------
WARNING: at fs/sysfs/dir.c:536 sysfs_add_one+0xd4/0x100()
Hardware name: ftServer 6400
sysfs: cannot create duplicate filename '/module/mgag200'
Modules linked in: enclosure(+) mgag200(+) ghash_clmulni_intel(+) pcspkr joydev osst vhost_net st tun macvtap macvlan uinput raid1 mpt2sas(OF) raid_class qla2xxx(OF) scsi_transport_fc scsi_transport_sas sd_mod(OF) usb_storage scsi_tgt scsi_hbas(OF) i2c_algo_bit drm_kms_helper ttm drm i2c_core
Pid: 733, comm: systemd-udevd Tainted: GF          O 3.9.0sra_new+ #1
Call Trace:
 [<ffffffff81061a9f>] warn_slowpath_common+0x7f/0xc0
 [<ffffffff81061b96>] warn_slowpath_fmt+0x46/0x50
 [<ffffffff81319875>] ? strlcat+0x65/0x90
 [<ffffffff81222914>] sysfs_add_one+0xd4/0x100
 [<ffffffff81222b38>] create_dir+0x78/0xd0
 [<ffffffff81222e86>] sysfs_create_dir+0x86/0xe0
 [<ffffffff81313588>] kobject_add_internal+0xa8/0x270
 [<ffffffff81313ab3>] kobject_init_and_add+0x63/0x90
 [<ffffffff810ca34d>] load_module+0x12dd/0x2890
 [<ffffffff81331670>] ? ddebug_proc_open+0xc0/0xc0
 [<ffffffff810cb9ea>] sys_init_module+0xea/0x140
 [<ffffffff81680d19>] system_call_fastpath+0x16/0x1b
---[ end trace 247a5f5f82ef192d ]---
------------[ cut here ]------------
WARNING: at lib/kobject.c:196 kobject_add_internal+0x204/0x270()
Hardware name: ftServer 6400
kobject_add_internal failed for mgag200 with -EEXIST, don't try to register things with the same name in the same directory.
Modules linked in:0m] Started Conf mdio(+) coretemp(+) crc32c_intel(+) dca(+) enclosure(+) mgag200(+) ghash_clmulni_intel pcspkr joydev osst vhost_net st tun macvtap macvlan uinput raid1 mpt2sas(OF) raid_class qla2xxx(OF) scsi_transport_fc scsi_transport_sas sd_mod(OF) usb_storage scsi_tgt scsi_hbas(OF) i2c_algo_bit drm_kms_helper ttm drm i2c_core
Pid: 733, comm: systemd-udevd Tainted: GF       W  O 3.9.0sra_new+ #1

Call Trace:
 [<ffffffff81061a9f>] warn_slowpath_common+0x7f/0xc0
 [<ffffffff81061b96>] warn_slowpath_fmt+0x46/0x50
 [<ffffffff813136e4>] kobject_add_internal+0x204/0x270
 [<ffffffff81313ab3>] kobject_init_and_add+0x63/0x90
 [<ffffffff810ca34d>] load_module+0x12dd/0x2890
 [<ffffffff81331670>] ? ddebug_proc_open+0xc0/0xc0
 [<ffffffff810cb9ea>] sys_init_module+0xea/0x140
 [<ffffffff81680d19>] system_call_fastpath+0x16/0x1b
---[ end trace 247a5f5f82ef192e ]---


*** Second instance ***

mgag200: module is already loaded
igb: Intel(R) Gigabit Ethernet Network Driver - version 4.1.2-k
BUG: unable to handle kernel paging request at ffffffffa01d060c
IP: [<ffffffff81313276>] kobject_del+0x16/0x40
PGD 1c0f067 PUD 1c10063 PMD 851372067 PTE 0
Oops: 0002 [#1] SMP 
Modules linked in: ixgbe(OF+) igb(OF+) mgag200(+) ptp pps_core mdio dca coretemp crc32c_intel pcspkr ghash_clmulni_intel vhost_net tun macvtap macvlan uinput raid1 usb_storage mpt2sas(OF) raid_class qla2xxx(OF) scsi_transport_fc scsi_transport_sas sd_mod(OF) scsi_tgt scsi_hbas(OF) i2c_algo_bit drm_kms_helper ttm drm i2c_core
CPU 28 
Pid: 719, comm: systemd-udevd Tainted: GF          O 3.9.0sra_new+ #1 Stratus ftServer 6400/G7LAZ
RIP: 0010:[<ffffffff81313276>]  [<ffffffff81313276>] kobject_del+0x16/0x40
RSP: 0018:ffff88103814fd08  EFLAGS: 00010292
RAX: 0000000000000200 RBX: ffffffffa01d05d0 RCX: 0000000100250004
RDX: ffff88103814ffd8 RSI: 0000000000250004 RDI: 0000000000000246
RBP: ffff88103814fd18 R08: ffff88103814fa80 R09: 0000000000000000
R10: ffff88085f821d40 R11: 0000000000000025 R12: ffffffff81c412c0
R13: ffff880852c8cfc0 R14: ffffffffa01e0580 R15: ffffffffa01e0598
FS:  00007fc98fe6c840(0000) GS:ffff88107fd80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffa01d060c CR3: 0000001038137000 CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process systemd-udevd (pid: 719, threadinfo ffff88103814e000, task ffff8810380d98a0)
Stack:
 ffff88103814fd18 ffffffffa01d05d0 ffff88103814fd48 ffffffff81313302
 ffff88103814fd78 ffffffffa01d05d0 ffffffffa01d05d0 ffffffffffffffea
 ffff88103814fd68 ffffffff8131348b 00000000ffff8000 ffff88103814fee8
Call Trace:
 [<ffffffff81313302>] kobject_cleanup+0x62/0x1b0
 [<ffffffff8131348b>] kobject_put+0x2b/0x60
 [<ffffffff810cb8f1>] load_module+0x2881/0x2890
 [<ffffffff81331670>] ? ddebug_proc_open+0xc0/0xc0
 [<ffffffff810cb9ea>] sys_init_module+0xea/0x140
 [<ffffffff81680d19>] system_call_fastpath+0x16/0x1b
Code: 02 00 00 48 8b 5d f0 4c 8b 65 f8 c9 c3 0f 1f 84 00 00 00 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 85 ff 74 22 e8 7a fc f0 ff <80> 63 3c fd 48 89 df e8 6e ff ff ff 48 8b 7b 18 e8 d5 01 00 00 
RIP  [<ffffffff81313276>] kobject_del+0x16/0x40
 RSP <ffff88103814fd08>
CR2: ffffffffa01d060c
---[ end trace e320c2319820c81a ]---
Kernel panic - not syncing: Fatal exception



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Please add to stable:  module: don't unlink the module until we've removed all exposure.
  2013-06-04 14:07               ` Joe Lawrence
@ 2013-06-04 16:50                 ` Joe Lawrence
  2013-06-04 16:53                 ` Ben Greear
  2013-06-05  3:29                 ` Please add to stable: module: don't unlink the module until we've removed all exposure Rusty Russell
  2 siblings, 0 replies; 51+ messages in thread
From: Joe Lawrence @ 2013-06-04 16:50 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: Rusty Russell, Ben Greear, Linux Kernel Mailing List, stable

On Tue, 4 Jun 2013, Joe Lawrence wrote:

> Hi Rusty,
> 
> See my 3.9 stack traces below, which may or may not be what Ben had
> been seeing.  If you like, I can try a similar loop as the one you were
> testing in the other email.  

With a modified version of your module load/unload loop (only needed 
insmod as the module initialization routine returns -EINVAL to mimic 
mgag200 with incorrect modeset value).  This crashed right out of the 
chute on 3.9.4 ... still running OK with 3.9 + commit 944a1fa "module: 
don't unlink the module until we've removed all exposure".

-- Joe

test_mod.c :

#include <linux/module.h>
#include <linux/delay.h>

MODULE_LICENSE("GPL");

static int test_mod_init(void) { return -EINVAL; }
static void test_mod_exit(void) {}

module_init(test_mod_init);
module_exit(test_mod_exit);


from the console log :

test_mod: module verification failed: signature and/or required key missing - tainting kernel
------------[ cut here ]------------
WARNING: at fs/sysfs/dir.c:536 sysfs_add_one+0xd4/0x100()
Hardware name: ftServer 6400
sysfs: cannot create duplicate filename '/module/test_mod'
Modules linked in: test_mod(OF+) ebtable_nat nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 bonding xt_conntrack nf_conntrack ib_iser rdma_cm ebtable_filter ib_addr ebtables iw_cm ib_cm ib_sa ib_mad ip6table_filter ib_core ip6_tables iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_multipath coretemp crc32c_intel ghash_clmulni_intel pcspkr ixgbe joydev mdio igb ptp pps_core dca vhost_net tun macvtap macvlan uinput raid1 sd_mod i2c_algo_bit drm_kms_helper ttm drm usb_storage mpt2sas raid_class scsi_transport_sas i2c_core
Pid: 8466, comm: insmod Tainted: GF          O 3.9.4 #1
Call Trace:
 [<ffffffff8106159f>] warn_slowpath_common+0x7f/0xc0
 [<ffffffff81061696>] warn_slowpath_fmt+0x46/0x50
 [<ffffffff81319895>] ? strlcat+0x65/0x90
 [<ffffffff81222784>] sysfs_add_one+0xd4/0x100
 [<ffffffff812229a8>] create_dir+0x78/0xd0
 [<ffffffff81222cf6>] sysfs_create_dir+0x86/0xe0
 [<ffffffff813135a8>] kobject_add_internal+0xa8/0x270
 [<ffffffff81313ad3>] kobject_init_and_add+0x63/0x90
 [<ffffffff810c9f9d>] load_module+0x12dd/0x2890
 [<ffffffff81331690>] ? ddebug_proc_open+0xc0/0xc0
 [<ffffffff810cb63a>] sys_init_module+0xea/0x140
 [<ffffffff81681119>] system_call_fastpath+0x16/0x1b
---[ end trace 54bd469258bec620 ]---
------------[ cut here ]------------
WARNING: at lib/kobject.c:196 kobject_add_internal+0x204/0x270()
Hardware name: ftServer 6400
kobject_add_internal failed for test_mod with -EEXIST, don't try to register things with the same name in the same directory.
Modules linked in: test_mod(OF+) ebtable_nat nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 bonding xt_conntrack nf_conntrack ib_iser rdma_cm ebtable_filter ib_addr ebtables iw_cm ib_cm ib_sa ib_mad ip6table_filter ib_core ip6_tables iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_multipath coretemp crc32c_intel ghash_clmulni_intel pcspkr ixgbe joydev mdio igb ptp pps_core dca vhost_net tun macvtap macvlan uinput raid1 sd_mod i2c_algo_bit drm_kms_helper ttm drm usb_storage mpt2sas raid_class scsi_transport_sas i2c_core
Pid: 8466, comm: insmod Tainted: GF       W  O 3.9.4 #1
Call Trace:
 [<ffffffff8106159f>] warn_slowpath_common+0x7f/0xc0
 [<ffffffff81061696>] warn_slowpath_fmt+0x46/0x50
 [<ffffffff81313704>] kobject_add_internal+0x204/0x270
 [<ffffffff81313ad3>] kobject_init_and_add+0x63/0x90
 [<ffffffff810c9f9d>] load_module+0x12dd/0x2890
 [<ffffffff81331690>] ? ddebug_proc_open+0xc0/0xc0
 [<ffffffff810cb63a>] sys_init_module+0xea/0x140
 [<ffffffff81681119>] system_call_fastpath+0x16/0x1b
---[ end trace 54bd469258bec621 ]---
test_mod: module is already loaded
test_mod: module is already loaded
BUG: unable to handle kernel paging request at ffffffffa02ed08c
IP: [<ffffffff81313491>] kobject_put+0x11/0x60
PGD 1c0f067 PUD 1c10063 PMD 84dd68067 PTE 0
Oops: 0000 [#1] SMP 
Modules linked in: test_mod(OF+) ebtable_nat nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 bonding xt_conntrack nf_conntrack ib_iser rdma_cm ebtable_filter ib_addr ebtables iw_cm ib_cm ib_sa ib_mad ip6table_filter ib_core ip6_tables iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_multipath coretemp crc32c_intel ghash_clmulni_intel pcspkr ixgbe joydev mdio igb ptp pps_core dca vhost_net tun macvtap macvlan uinput raid1 sd_mod i2c_algo_bit drm_kms_helper ttm drm usb_storage mpt2sas raid_class scsi_transport_sas i2c_core
CPU 25 
Pid: 8551, comm: insmod Tainted: GF       W  O 3.9.4 #1 Stratus ftServer 6400/G7LAZ
RIP: 0010:[<ffffffff81313491>]  [<ffffffff81313491>] kobject_put+0x11/0x60
RSP: 0018:ffff881050b95d58  EFLAGS: 00010286
RAX: 0000000000000022 RBX: ffffffffa02ed050 RCX: ffff88107fd2fba8
RDX: 0000000000000000 RSI: ffff88107fd2df58 RDI: ffffffffa02ed050
RBP: ffff881050b95d68 R08: ffffffff81ce2080 R09: 00000000000007c6
R10: 0000000000000000 R11: 00000000000007c5 R12: ffffffffa02ed050
R13: ffffffffffffffea R14: ffffffffa035c000 R15: ffffffffa035c018
FS:  00007fd0768a3740(0000) GS:ffff88107fd20000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffa02ed08c CR3: 0000001050bd7000 CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process insmod (pid: 8551, threadinfo ffff881050b94000, task ffff88105169c9e0)
Stack:
 00000000ffff8000 ffff881050b95ee8 ffff881050b95ed8 ffffffff810cb541
 ffffffff81331690 ffffc90017037fff ffffc90017038000 ffffffff00000002
 ffffc900170220e0 ffffc90000000003 ffffffffa02c1270 00000000000002a0
Call Trace:
 [<ffffffff810cb541>] load_module+0x2881/0x2890
 [<ffffffff81331690>] ? ddebug_proc_open+0xc0/0xc0
 [<ffffffff810cb63a>] sys_init_module+0xea/0x140
 [<ffffffff81681119>] system_call_fastpath+0x16/0x1b
Code: 01 00 e9 10 ff ff ff 0f 1f 00 55 48 83 ef 38 48 89 e5 e8 43 fe ff ff 5d c3 90 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 85 ff 74 1a <f6> 47 3c 01 74 21 f0 83 6b 38 01 0f 94 c0 84 c0 74 08 48 89 df 
RIP  [<ffffffff81313491>] kobject_put+0x11/0x60
 RSP <ffff881050b95d58>
CR2: ffffffffa02ed08c

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Please add to stable:  module: don't unlink the module until we've removed all exposure.
  2013-06-04 14:07               ` Joe Lawrence
  2013-06-04 16:50                 ` Joe Lawrence
@ 2013-06-04 16:53                 ` Ben Greear
  2013-06-04 17:45                   ` Ben Greear
  2013-06-05  3:29                 ` Please add to stable: module: don't unlink the module until we've removed all exposure Rusty Russell
  2 siblings, 1 reply; 51+ messages in thread
From: Ben Greear @ 2013-06-04 16:53 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: Rusty Russell, Linux Kernel Mailing List, stable

On 06/04/2013 07:07 AM, Joe Lawrence wrote:
> On Tue, 04 Jun 2013 15:26:28 +0930
> Rusty Russell <rusty@rustcorp.com.au> wrote:
>
>> Do you have a backtrace of the 3.9.4 crash?  You can add "CFLAGS_module.o
>> = -O0" to get a clearer backtrace if you want...
>
> Hi Rusty,
>
> See my 3.9 stack traces below, which may or may not be what Ben had
> been seeing.  If you like, I can try a similar loop as the one you were
> testing in the other email.

My stack traces are similar.  I had better luck reproducing the problem
once I enabled lots of debugging (slub memory poisoning, lockdep,
object debugging, etc).

I'm using Fedora 17 on 2-core core-i7 (4 CPU threads total) for most of this
testing.  We reproduced on dual-core Atom system as well
(32-bit Fedora 14 and Fedora 17).  Relatively standard hardware as far
as I know.

I'll run the insmod/rmmod stress test on my patched systems
and see if I can reproduce with the patch in the title applied.

Rusty:  I'm also seeing lockups related to migration on stock 3.9.4+
(with and without the 'don't unlink the module...' patch.  Much harder
to reproduce.  But, that code appears to be mostly called during
module load/unload, so it's possible it is related.  The first
traces are from a system with local patches, applied, but a later
post by me has traces from clean upstream kernel.

Further debugging showed that this could be a race, because it seems
that all migration/ threads think they are done with their state machine,
but the atomic thread counter sits at 1, so no progress is ever made.

http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg443471.html

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Please add to stable:  module: don't unlink the module until we've removed all exposure.
  2013-06-04 16:53                 ` Ben Greear
@ 2013-06-04 17:45                   ` Ben Greear
  2013-06-05  4:17                     ` Rusty Russell
  0 siblings, 1 reply; 51+ messages in thread
From: Ben Greear @ 2013-06-04 17:45 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: Rusty Russell, Linux Kernel Mailing List, stable

On 06/04/2013 09:53 AM, Ben Greear wrote:
> On 06/04/2013 07:07 AM, Joe Lawrence wrote:
>> On Tue, 04 Jun 2013 15:26:28 +0930
>> Rusty Russell <rusty@rustcorp.com.au> wrote:
>>
>>> Do you have a backtrace of the 3.9.4 crash?  You can add "CFLAGS_module.o
>>> = -O0" to get a clearer backtrace if you want...
>>
>> Hi Rusty,
>>
>> See my 3.9 stack traces below, which may or may not be what Ben had
>> been seeing.  If you like, I can try a similar loop as the one you were
>> testing in the other email.
>
> My stack traces are similar.  I had better luck reproducing the problem
> once I enabled lots of debugging (slub memory poisoning, lockdep,
> object debugging, etc).
>
> I'm using Fedora 17 on 2-core core-i7 (4 CPU threads total) for most of this
> testing.  We reproduced on dual-core Atom system as well
> (32-bit Fedora 14 and Fedora 17).  Relatively standard hardware as far
> as I know.
>
> I'll run the insmod/rmmod stress test on my patched systems
> and see if I can reproduce with the patch in the title applied.
>
> Rusty:  I'm also seeing lockups related to migration on stock 3.9.4+
> (with and without the 'don't unlink the module...' patch.  Much harder
> to reproduce.  But, that code appears to be mostly called during
> module load/unload, so it's possible it is related.  The first
> traces are from a system with local patches, applied, but a later
> post by me has traces from clean upstream kernel.
>
> Further debugging showed that this could be a race, because it seems
> that all migration/ threads think they are done with their state machine,
> but the atomic thread counter sits at 1, so no progress is ever made.
>
> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg443471.html

I reproduced the migration deadlock after a while (loading and unloading
the macvlan module with this command:

for i in `seq 10000`; do modprobe macvlan; rmmod macvlan; done

I did not see the kobj crash, but this kernel was running your patch
that makes the problem go away for me, for whatever reason.

I have some printk debugging in (see bottom of email) and was using a serial console, so things
were probably running a bit slower than on most systems.  Here is trace
from my kernel with local patches and not so much debugging enabled
(this is NOT a clean upstream kernel, though I reproduced the same thing
with a clean upstream 3.9.4 kernel plus your module unlink patch yesterday).

__stop_machine, num-threads: 4, fn: __try_stop_module  data: ffff8801c6ae7f28
cpu: 0 loops: 1 jiffies: 4299011449  timeout: 4299011448 curstate: 0  smdata->state: 1  thread_ack: 4
cpu: 1 loops: 1 jiffies: 4299011449  timeout: 4299011448 curstate: 0  smdata->state: 1  thread_ack: 4
cpu: 2 loops: 1 jiffies: 4299011449  timeout: 4299011448 curstate: 0  smdata->state: 1  thread_ack: 3
cpu: 3 loops: 1 jiffies: 4299011449  timeout: 4299011448 curstate: 0  smdata->state: 1  thread_ack: 2
__stop_machine, num-threads: 4, fn: __unlink_module  data: ffffffffa0aeeab0
cpu: 0 loops: 1 jiffies: 4299011501  timeout: 4299011500 curstate: 0  smdata->state: 1  thread_ack: 4
cpu: 1 loops: 1 jiffies: 4299011501  timeout: 4299011500 curstate: 0  smdata->state: 1  thread_ack: 4
cpu: 3 loops: 1 jiffies: 4299011501  timeout: 4299011500 curstate: 0  smdata->state: 1  thread_ack: 3
ath: wiphy0: Failed to stop TX DMA, queues=0x005!
cpu: 2 loops: 1 jiffies: 4299011501  timeout: 4299011500 curstate: 0  smdata->state: 1  thread_ack: 2
BUG: soft lockup - CPU#3 stuck for 23s! [migration/3:29]
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_po]
CPU 3
Pid: 29, comm: migration/3 Tainted: G         C O 3.9.4+ #60 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.
RIP: 0010:[<ffffffff8109d69c>]  [<ffffffff8109d69c>] tasklet_action+0x46/0xcc
RSP: 0000:ffff88022bd83ed8  EFLAGS: 00000282
RAX: ffff88022bd8e080 RBX: ffff8802222a4000 RCX: ffffffff81a90f06
RDX: ffff88021f48afa8 RSI: 0000000000000000 RDI: ffffffff81a050b0
RBP: ffff88022bd83ee8 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000000005f2 R11: 00000000fd010018 R12: ffff88022bd83e48
R13: ffffffff815d145d R14: ffff88022bd83ee8 R15: ffff880222282000
FS:  0000000000000000(0000) GS:ffff88022bd80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fa8a06d5000 CR3: 00000002007ea000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/3 (pid: 29, threadinfo ffff880222282000, task ffff880222279770)
Stack:
  ffffffff81a050b0 ffff880222282000 ffff88022bd83f78 ffffffff8109db1f
  ffff88022bd83f08 ffff880222282010 ffff880222283fd8 04208040810b79ef
  00000001003db5d8 000000032bd8e150 ffff880222282000 0000000000000030
Call Trace:
  <IRQ>
  [<ffffffff8109db1f>] __do_softirq+0x107/0x23c
  [<ffffffff8109dce6>] irq_exit+0x4b/0xa8
  [<ffffffff815d257f>] smp_apic_timer_interrupt+0x8b/0x99
  [<ffffffff815d145d>] apic_timer_interrupt+0x6d/0x80
  <EOI>
  [<ffffffff810f9984>] ? stop_machine_cpu_stop+0xc7/0x145
  [<ffffffff810f9974>] ? stop_machine_cpu_stop+0xb7/0x145
  [<ffffffff81016445>] ? sched_clock+0x9/0xd
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c9065>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d07ac>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
Code: 48 8b 14 25 80 e0 00 00 65 48 c7 04 25 80 e0 00 00 00 00 00 00 65 48 03 04 25 d8 da 00 00 65 48 89 04 25 88 e0 00 00 fb
------------[ cut here ]------------
WARNING: at /home/greearb/git/linux-3.9.dev.y/kernel/watchdog.c:245 watchdog_overflow_callback+0x9b/0xa6()
Hardware name: To be filled by O.E.M.
Watchdog detected hard LOCKUP on cpu 1
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_po]
Pid: 17, comm: migration/1 Tainted: G         C O 3.9.4+ #60
Call Trace:
  <NMI>  [<ffffffff810963e1>] warn_slowpath_common+0x85/0x9f
  [<ffffffff8109649e>] warn_slowpath_fmt+0x46/0x48
  [<ffffffff810c4dfb>] ? sched_clock_cpu+0x44/0xce
  [<ffffffff81103b70>] watchdog_overflow_callback+0x9b/0xa6
  [<ffffffff8113354f>] __perf_event_overflow+0x137/0x1cb
  [<ffffffff8101db3f>] ? x86_perf_event_set_period+0x107/0x113
  [<ffffffff81133a9e>] perf_event_overflow+0x14/0x16
  [<ffffffff810230dc>] intel_pmu_handle_irq+0x2b0/0x32d
  [<ffffffff815cbc51>] perf_event_nmi_handler+0x19/0x1b
  [<ffffffff815cb4ca>] nmi_handle+0x55/0x7e
  [<ffffffff815cb59b>] do_nmi+0xa8/0x2db
  [<ffffffff815cac31>] end_repeat_nmi+0x1e/0x2e
  [<ffffffff810f9942>] ? stop_machine_cpu_stop+0x85/0x145
  [<ffffffff810f9942>] ? stop_machine_cpu_stop+0x85/0x145
  [<ffffffff810f9942>] ? stop_machine_cpu_stop+0x85/0x145
  <<EOE>>  [<ffffffff810c6b1c>] ? set_next_entity+0x28/0x7e
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c9065>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d07ac>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
---[ end trace dcd772d6fdf499cf ]---

...


SysRq : Show backtrace of all active CPUs
sending NMI to all CPUs:
NMI backtrace for cpu 2
CPU 2
Pid: 23, comm: migration/2 Tainted: G        WC O 3.9.4+ #60 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.
RIP: 0010:[<ffffffff810f9987>]  [<ffffffff810f9987>] stop_machine_cpu_stop+0xca/0x145
RSP: 0018:ffff880222219cf8  EFLAGS: 00000012
RAX: 00000001003db5d7 RBX: ffff8801c6ae7e18 RCX: 0000000000000097
RDX: 0000000000000002 RSI: 0000000000000006 RDI: 0000000000000246
RBP: ffff880222219d68 R08: 00000001003dc935 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801c6ae7e3c
R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000002
FS:  0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000027e2178 CR3: 000000021955d000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000)
Stack:
  ffff880200000001 ffff880200000002 ffff880222219d28 ffffffff810c6b1c
  ffff880222219d88 00ffffff8100f8a4 000000051058262e 0000000000000292
  ffff880222219d58 ffff88022bd0e400 ffff8801c6ae7d08 ffff880222218000
Call Trace:
  [<ffffffff810c6b1c>] ? set_next_entity+0x28/0x7e
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c9065>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d07ac>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
Code: 48 c7 c7 24 fa 80 81 89 44 24 08 8b 43 20 89 04 24 31 c0 e8 7a e1 4c 00 4c 8b 05 85 c6 9e 00 49 81 c0 88 13 00 00 f3 90
NMI backtrace for cpu 0
CPU 0
Pid: 8, comm: migration/0 Tainted: G        WC O 3.9.4+ #60 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E..
RIP: 0010:[<ffffffff810f9984>]  [<ffffffff810f9984>] stop_machine_cpu_stop+0xc7/0x145
RSP: 0000:ffff880222145cf8  EFLAGS: 00000006
RAX: 00000001003db5d7 RBX: ffff8801c6ae7e18 RCX: 0000000000000004
RDX: 0000000000000002 RSI: ffff88022bc0de68 RDI: 0000000000000246
RBP: ffff880222145d68 R08: 00000001003dc95f R09: 0000000000000001
R10: ffff880222145bf8 R11: 0000000000000000 R12: ffff8801c6ae7e3c
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000002
FS:  0000000000000000(0000) GS:ffff88022bc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000006fb9a8 CR3: 000000021af43000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/0 (pid: 8, threadinfo ffff880222144000, task ffff88022213aee0)
Stack:
  ffff880200000001 ffff880200000004 ffff880222145d28 ffffffff810c6b1c
  ffff880222145d88 01ffffff8100f8a4 000000051023ce2c 0000000000000292
  ffff880222145d98 ffff88022bc0e400 ffff8801c6ae7d08 ffff880222144000
Call Trace:
  [<ffffffff810c6b1c>] ? set_next_entity+0x28/0x7e
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c9065>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d07ac>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
Code: 44 89 f6 48 c7 c7 24 fa 80 81 89 44 24 08 8b 43 20 89 04 24 31 c0 e8 7a e1 4c 00 4c 8b 05 85 c6 9e 00 49 81 c0 88 13 00
NMI backtrace for cpu 1
CPU 1
Pid: 17, comm: migration/1 Tainted: G        WC O 3.9.4+ #60 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.
RIP: 0010:[<ffffffff810f9984>]  [<ffffffff810f9984>] stop_machine_cpu_stop+0xc7/0x145
RSP: 0000:ffff88022217dcf8  EFLAGS: 00000012
RAX: 00000001003db5d7 RBX: ffff8801c6ae7e18 RCX: 0000000000000094
RDX: 0000000000000002 RSI: 0000000000000006 RDI: 0000000000000246
RBP: ffff88022217dd68 R08: 00000001003dc935 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801c6ae7e3c
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000002
FS:  0000000000000000(0000) GS:ffff88022bc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fbe801fd000 CR3: 00000001be9bb000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/1 (pid: 17, threadinfo ffff88022217c000, task ffff88022216ddc0)
Stack:
  ffff880200000001 ffff880200000004 ffff88022217dd28 ffffffff810c6b1c
  ffff88022217dd88 00ffffff8100f8a4 00000004c3cbfc0c 0000000000000292
  ffff88022217dd58 ffff88022bc8e400 ffff8801c6ae7d08 ffff88022217c000
Call Trace:
  [<ffffffff810c6b1c>] ? set_next_entity+0x28/0x7e
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c9065>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d07ac>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
Code: 44 89 f6 48 c7 c7 24 fa 80 81 89 44 24 08 8b 43 20 89 04 24 31 c0 e8 7a e1 4c 00 4c 8b 05 85 c6 9e 00 49 81 c0 88 13 00
NMI backtrace for cpu 3
CPU 3
Pid: 29, comm: migration/3 Tainted: G        WC O 3.9.4+ #60 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.
RIP: 0010:[<ffffffff812f6975>]  [<ffffffff812f6975>] delay_tsc+0x83/0xee
RSP: 0000:ffff88022bd83b60  EFLAGS: 00000046
RAX: 00000b04f0fdf718 RBX: ffff880222282000 RCX: ffff880222282010
RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000289672
RBP: ffff88022bd83bb0 R08: 0000000000000040 R09: 0000000000000001
R10: ffff88022bd83ad0 R11: 0000000000000000 R12: 00000000f0fdf6e8
R13: 0000000000000003 R14: ffff880222282000 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff88022bd80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fa8a06d5000 CR3: 00000002007ea000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/3 (pid: 29, threadinfo ffff880222282000, task ffff880222279770)
Stack:
  000000000000000f 0000000000000002 ffff880222282010 0028967200000082
  ffff88022bd83b90 0000000000000001 000000000000006c 0000000000000007
  0000000000000086 0000000000000001 ffff88022bd83bc0 ffffffff812f68c5
Call Trace:
  <IRQ>
  [<ffffffff812f68c5>] __const_udelay+0x28/0x2a
  [<ffffffff8102ff25>] arch_trigger_all_cpu_backtrace+0x66/0x7d
  [<ffffffff813ae154>] sysrq_handle_showallcpus+0xe/0x10
  [<ffffffff813ae46b>] __handle_sysrq+0xbf/0x15b
  [<ffffffff813ae80e>] handle_sysrq+0x2c/0x2e
  [<ffffffff813c25a2>] serial8250_rx_chars+0x13c/0x1b9
  [<ffffffff813c2691>] serial8250_handle_irq+0x72/0xa8
  [<ffffffff813c2752>] serial8250_default_handle_irq+0x23/0x28
  [<ffffffff813c148c>] serial8250_interrupt+0x4d/0xc6
  [<ffffffff811046b8>] handle_irq_event_percpu+0x7a/0x1e5
  [<ffffffff81104864>] handle_irq_event+0x41/0x61
  [<ffffffff81107028>] handle_edge_irq+0xa6/0xcb
  [<ffffffff81011d9f>] handle_irq+0x24/0x2d
  [<ffffffff815d248d>] do_IRQ+0x4d/0xb4
  [<ffffffff815ca5ad>] common_interrupt+0x6d/0x6d
  [<ffffffff810a47ee>] ? run_timer_softirq+0x24/0x1df
  [<ffffffff8109dabd>] ? __do_softirq+0xa5/0x23c
  [<ffffffff8109db8a>] ? __do_softirq+0x172/0x23c
  [<ffffffff8109dce6>] irq_exit+0x4b/0xa8
  [<ffffffff815d257f>] smp_apic_timer_interrupt+0x8b/0x99
  [<ffffffff815d145d>] apic_timer_interrupt+0x6d/0x80
  <EOI>
  [<ffffffff810f9984>] ? stop_machine_cpu_stop+0xc7/0x145
  [<ffffffff810f9974>] ? stop_machine_cpu_stop+0xb7/0x145
  [<ffffffff81016445>] ? sched_clock+0x9/0xd
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c9065>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d07ac>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
Code: fa d1 ff 66 90 89 c2 44 29 e2 3b 55 cc 73 4a ff 4b 1c 48 8b 4d c0 48 8b 11 80 e2 08 74 0b 89 45 b8 e8 08 28 2d 00 8b 45


commit 5b783a651bb9941923cd8dbff6d0ce80d2617f97
Author: Ben Greear <greearb@candelatech.com>
Date:   Mon Jun 3 13:31:09 2013 -0700

     debugging:  Instrument the 'migration/[0-x]' process a bit.

     Hoping to figure out the deadlock we see occassionally due
     to migration processes all spinning in busy loop.

     Signed-off-by: Ben Greear <greearb@candelatech.com>

diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index c09f295..f7a1de5 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -408,6 +408,9 @@ static int stop_machine_cpu_stop(void *data)
         int cpu = smp_processor_id(), err = 0;
         unsigned long flags;
         bool is_active;
+       unsigned long loops = 0;
+       unsigned long start_at = jiffies;
+       unsigned long timeout = start_at - 1;

         /*
          * When called from stop_machine_from_inactive_cpu(), irq might
@@ -422,6 +425,13 @@ static int stop_machine_cpu_stop(void *data)

         /* Simple state machine */
         do {
+               loops++;
+               if (time_after(jiffies, timeout)) {
+                       printk("cpu: %i loops: %lu jiffies: %lu  timeout: %lu curstate: %i  smdata->state:
+                              cpu, loops, jiffies, timeout, curstate, smdata->state,
+                              atomic_read(&smdata->thread_ack));
+                       timeout = jiffies + 5000;
+               }
                 /* Chill out and ensure we re-read stopmachine_state. */
                 cpu_relax();
                 if (smdata->state != curstate) {
@@ -473,6 +483,14 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)

         /* Set the initial state and stop all online cpus. */
         set_state(&smdata, STOPMACHINE_PREPARE);
+       {
+               char ksym_buf[KSYM_NAME_LEN];
+               printk("__stop_machine, num-threads: %i, fn: %s  data: %p\n",
+                      smdata.num_threads,
+                      kallsyms_lookup((unsigned long)fn, NULL, NULL, NULL,
+                                      ksym_buf),
+                      data);
+       }
         return stop_cpus(cpu_online_mask, stop_machine_cpu_stop, &smdata);
  }

@@ -517,10 +535,16 @@ int stop_machine_from_inactive_cpu(int (*fn)(void *), void *data,
                                             .active_cpus = cpus };
         struct cpu_stop_done done;
         int ret;
+       char ksym_buf[KSYM_NAME_LEN];

         /* Local CPU must be inactive and CPU hotplug in progress. */
         BUG_ON(cpu_active(raw_smp_processor_id()));
         smdata.num_threads = num_active_cpus() + 1;     /* +1 for local */
+       printk("stop-machine-from-inactive-cpu, num-threads: %i  fn: %s(%p)\n",
+              smdata.num_threads,
+              kallsyms_lookup((unsigned long)fn, NULL, NULL, NULL,
+                              ksym_buf),
+              data);

         /* No proper task established and can't sleep - busy wait for lock. */
         while (!mutex_trylock(&stop_cpus_mutex))


Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: Please add to stable:  module: don't unlink the module until we've removed all exposure.
  2013-06-04 14:07               ` Joe Lawrence
  2013-06-04 16:50                 ` Joe Lawrence
  2013-06-04 16:53                 ` Ben Greear
@ 2013-06-05  3:29                 ` Rusty Russell
  2 siblings, 0 replies; 51+ messages in thread
From: Rusty Russell @ 2013-06-05  3:29 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: Ben Greear, Linux Kernel Mailing List, stable

Joe Lawrence <joe.lawrence@stratus.com> writes:
> On Tue, 04 Jun 2013 15:26:28 +0930
> Rusty Russell <rusty@rustcorp.com.au> wrote:
>
>> Do you have a backtrace of the 3.9.4 crash?  You can add "CFLAGS_module.o
>> = -O0" to get a clearer backtrace if you want...
>
> Hi Rusty,
>
> See my 3.9 stack traces below, which may or may not be what Ben had
> been seeing.  If you like, I can try a similar loop as the one you were
> testing in the other email.  
>
> Regards,
>
> -- Joe
>
> *** First instance ***
>
> ------------[ cut here ]------------
> WARNING: at fs/sysfs/dir.c:536 sysfs_add_one+0xd4/0x100()
> Hardware name: ftServer 6400
> sysfs: cannot create duplicate filename '/module/mgag200'
> Modules linked in: enclosure(+) mgag200(+) ghash_clmulni_intel(+) pcspkr joydev osst vhost_net st tun macvtap macvlan uinput raid1 mpt2sas(OF) raid_class qla2xxx(OF) scsi_transport_fc scsi_transport_sas sd_mod(OF) usb_storage scsi_tgt scsi_hbas(OF) i2c_algo_bit drm_kms_helper ttm drm i2c_core
> Pid: 733, comm: systemd-udevd Tainted: GF          O 3.9.0sra_new+ #1
> Call Trace:
>  [<ffffffff81061a9f>] warn_slowpath_common+0x7f/0xc0
>  [<ffffffff81061b96>] warn_slowpath_fmt+0x46/0x50
>  [<ffffffff81319875>] ? strlcat+0x65/0x90
>  [<ffffffff81222914>] sysfs_add_one+0xd4/0x100
>  [<ffffffff81222b38>] create_dir+0x78/0xd0
>  [<ffffffff81222e86>] sysfs_create_dir+0x86/0xe0
>  [<ffffffff81313588>] kobject_add_internal+0xa8/0x270
>  [<ffffffff81313ab3>] kobject_init_and_add+0x63/0x90
>  [<ffffffff810ca34d>] load_module+0x12dd/0x2890
>  [<ffffffff81331670>] ? ddebug_proc_open+0xc0/0xc0
>  [<ffffffff810cb9ea>] sys_init_module+0xea/0x140
>  [<ffffffff81680d19>] system_call_fastpath+0x16/0x1b
> ---[ end trace 247a5f5f82ef192d ]---
> ------------[ cut here ]------------
> WARNING: at lib/kobject.c:196 kobject_add_internal+0x204/0x270()
> Hardware name: ftServer 6400
> kobject_add_internal failed for mgag200 with -EEXIST, don't try to register things with the same name in the same directory.
> Modules linked in:0m] Started Conf mdio(+) coretemp(+) crc32c_intel(+) dca(+) enclosure(+) mgag200(+) ghash_clmulni_intel pcspkr joydev osst vhost_net st tun macvtap macvlan uinput raid1 mpt2sas(OF) raid_class qla2xxx(OF) scsi_transport_fc scsi_transport_sas sd_mod(OF) usb_storage scsi_tgt scsi_hbas(OF) i2c_algo_bit drm_kms_helper ttm drm i2c_core
> Pid: 733, comm: systemd-udevd Tainted: GF       W  O 3.9.0sra_new+ #1
>
> Call Trace:
>  [<ffffffff81061a9f>] warn_slowpath_common+0x7f/0xc0
>  [<ffffffff81061b96>] warn_slowpath_fmt+0x46/0x50
>  [<ffffffff813136e4>] kobject_add_internal+0x204/0x270
>  [<ffffffff81313ab3>] kobject_init_and_add+0x63/0x90
>  [<ffffffff810ca34d>] load_module+0x12dd/0x2890
>  [<ffffffff81331670>] ? ddebug_proc_open+0xc0/0xc0
>  [<ffffffff810cb9ea>] sys_init_module+0xea/0x140
>  [<ffffffff81680d19>] system_call_fastpath+0x16/0x1b
> ---[ end trace 247a5f5f82ef192e ]---

That's a WARN_ON.  It's harmless, but indeed is fixed by the patch
suggested.

> *** Second instance ***
>
> mgag200: module is already loaded
> igb: Intel(R) Gigabit Ethernet Network Driver - version 4.1.2-k
> BUG: unable to handle kernel paging request at ffffffffa01d060c
> IP: [<ffffffff81313276>] kobject_del+0x16/0x40
> PGD 1c0f067 PUD 1c10063 PMD 851372067 PTE 0
> Oops: 0002 [#1] SMP 
> Modules linked in: ixgbe(OF+) igb(OF+) mgag200(+) ptp pps_core mdio dca coretemp crc32c_intel pcspkr ghash_clmulni_intel vhost_net tun macvtap macvlan uinput raid1 usb_storage mpt2sas(OF) raid_class qla2xxx(OF) scsi_transport_fc scsi_transport_sas sd_mod(OF) scsi_tgt scsi_hbas(OF) i2c_algo_bit drm_kms_helper ttm drm i2c_core
> CPU 28 
> Pid: 719, comm: systemd-udevd Tainted: GF          O 3.9.0sra_new+ #1 Stratus ftServer 6400/G7LAZ
> RIP: 0010:[<ffffffff81313276>]  [<ffffffff81313276>] kobject_del+0x16/0x40
> RSP: 0018:ffff88103814fd08  EFLAGS: 00010292
> RAX: 0000000000000200 RBX: ffffffffa01d05d0 RCX: 0000000100250004
> RDX: ffff88103814ffd8 RSI: 0000000000250004 RDI: 0000000000000246
> RBP: ffff88103814fd18 R08: ffff88103814fa80 R09: 0000000000000000
> R10: ffff88085f821d40 R11: 0000000000000025 R12: ffffffff81c412c0
> R13: ffff880852c8cfc0 R14: ffffffffa01e0580 R15: ffffffffa01e0598
> FS:  00007fc98fe6c840(0000) GS:ffff88107fd80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffffffffa01d060c CR3: 0000001038137000 CR4: 00000000000407e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process systemd-udevd (pid: 719, threadinfo ffff88103814e000, task ffff8810380d98a0)
> Stack:
>  ffff88103814fd18 ffffffffa01d05d0 ffff88103814fd48 ffffffff81313302
>  ffff88103814fd78 ffffffffa01d05d0 ffffffffa01d05d0 ffffffffffffffea
>  ffff88103814fd68 ffffffff8131348b 00000000ffff8000 ffff88103814fee8
> Call Trace:
>  [<ffffffff81313302>] kobject_cleanup+0x62/0x1b0
>  [<ffffffff8131348b>] kobject_put+0x2b/0x60
>  [<ffffffff810cb8f1>] load_module+0x2881/0x2890
>  [<ffffffff81331670>] ? ddebug_proc_open+0xc0/0xc0
>  [<ffffffff810cb9ea>] sys_init_module+0xea/0x140
>  [<ffffffff81680d19>] system_call_fastpath+0x16/0x1b
> Code: 02 00 00 48 8b 5d f0 4c 8b 65 f8 c9 c3 0f 1f 84 00 00 00 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 85 ff 74 22 e8 7a fc f0 ff <80> 63 3c fd 48 89 df e8 6e ff ff ff 48 8b 7b 18 e8 d5 01 00 00 
> RIP  [<ffffffff81313276>] kobject_del+0x16/0x40
>  RSP <ffff88103814fd08>
> CR2: ffffffffa01d060c
> ---[ end trace e320c2319820c81a ]---
> Kernel panic - not syncing: Fatal exception

This is the interesting one!  This is kind of crash which Linus' a49b7e82
fixed, but 3.9 contains that already.

For some reason, mgag200 is being unloaded (or failing to load) while
another one is being loaded.  The new module does this:

	kobj = kset_find_obj(module_kset, mod->name);
	if (kobj) {
		printk(KERN_ERR "%s: module is already loaded\n", mod->name);
		kobject_put(kobj);

The old one does this:

	kobject_put(&mod->mkobj.kobj);  /* in mod_sysfs_fini */
        ...
        module_free(mod, mod->module_core); /* in free_module */

And it doesn't wait for the kobj count to hit zero, so the other
kobject_put() it on a kobject which is freed...

Normally the right answer would be to offload the freeing of the module
memory to module_ktype's release function.

The backported dont-remove-from-list fix prevents this race from
happening, since the duplicate module is now caught earlier.  It does
not prevent the same problem with other kobj references; for example,
what about sysfs accesses to a module which fails init?

So the simplest thing is to backport the fix, as suggested.  But now I'm
trying to trigger the same bug in other ways.

Thanks,
Rusty.





^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Please add to stable:  module: don't unlink the module until we've removed all exposure.
  2013-06-04 17:45                   ` Ben Greear
@ 2013-06-05  4:17                     ` Rusty Russell
  2013-06-05  7:15                       ` Tejun Heo
  0 siblings, 1 reply; 51+ messages in thread
From: Rusty Russell @ 2013-06-05  4:17 UTC (permalink / raw)
  To: Ben Greear, Joe Lawrence; +Cc: Linux Kernel Mailing List, stable, Tejun Heo

Ben Greear <greearb@candelatech.com> writes:
> On 06/04/2013 09:53 AM, Ben Greear wrote:
>> On 06/04/2013 07:07 AM, Joe Lawrence wrote:
>>> On Tue, 04 Jun 2013 15:26:28 +0930
>>> Rusty Russell <rusty@rustcorp.com.au> wrote:
>>>
>>>> Do you have a backtrace of the 3.9.4 crash?  You can add "CFLAGS_module.o
>>>> = -O0" to get a clearer backtrace if you want...
>>>
>>> Hi Rusty,
>>>
>>> See my 3.9 stack traces below, which may or may not be what Ben had
>>> been seeing.  If you like, I can try a similar loop as the one you were
>>> testing in the other email.
>>
>> My stack traces are similar.  I had better luck reproducing the problem
>> once I enabled lots of debugging (slub memory poisoning, lockdep,
>> object debugging, etc).
>>
>> I'm using Fedora 17 on 2-core core-i7 (4 CPU threads total) for most of this
>> testing.  We reproduced on dual-core Atom system as well
>> (32-bit Fedora 14 and Fedora 17).  Relatively standard hardware as far
>> as I know.
>>
>> I'll run the insmod/rmmod stress test on my patched systems
>> and see if I can reproduce with the patch in the title applied.
>>
>> Rusty:  I'm also seeing lockups related to migration on stock 3.9.4+
>> (with and without the 'don't unlink the module...' patch.  Much harder
>> to reproduce.  But, that code appears to be mostly called during
>> module load/unload, so it's possible it is related.  The first
>> traces are from a system with local patches, applied, but a later
>> post by me has traces from clean upstream kernel.
>>
>> Further debugging showed that this could be a race, because it seems
>> that all migration/ threads think they are done with their state machine,
>> but the atomic thread counter sits at 1, so no progress is ever made.
>>
>> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg443471.html
>
> I reproduced the migration deadlock after a while (loading and unloading
> the macvlan module with this command:
>
> for i in `seq 10000`; do modprobe macvlan; rmmod macvlan; done
>
> I did not see the kobj crash, but this kernel was running your patch
> that makes the problem go away for me, for whatever reason.
>
> I have some printk debugging in (see bottom of email) and was using a serial console, so things
> were probably running a bit slower than on most systems.  Here is trace
> from my kernel with local patches and not so much debugging enabled
> (this is NOT a clean upstream kernel, though I reproduced the same thing
> with a clean upstream 3.9.4 kernel plus your module unlink patch yesterday).

Tejun CC'd.  We can't be running two stop machines in parallel, since
there's a mutex (and there's also one in the module code).

> __stop_machine, num-threads: 4, fn: __try_stop_module  data: ffff8801c6ae7f28
> cpu: 0 loops: 1 jiffies: 4299011449  timeout: 4299011448 curstate: 0  smdata->state: 1  thread_ack: 4
> cpu: 1 loops: 1 jiffies: 4299011449  timeout: 4299011448 curstate: 0  smdata->state: 1  thread_ack: 4
> cpu: 2 loops: 1 jiffies: 4299011449  timeout: 4299011448 curstate: 0  smdata->state: 1  thread_ack: 3
> cpu: 3 loops: 1 jiffies: 4299011449  timeout: 4299011448 curstate: 0  smdata->state: 1  thread_ack: 2
> __stop_machine, num-threads: 4, fn: __unlink_module  data: ffffffffa0aeeab0
> cpu: 0 loops: 1 jiffies: 4299011501  timeout: 4299011500 curstate: 0  smdata->state: 1  thread_ack: 4
> cpu: 1 loops: 1 jiffies: 4299011501  timeout: 4299011500 curstate: 0  smdata->state: 1  thread_ack: 4
> cpu: 3 loops: 1 jiffies: 4299011501  timeout: 4299011500 curstate: 0  smdata->state: 1  thread_ack: 3
> ath: wiphy0: Failed to stop TX DMA, queues=0x005!
> cpu: 2 loops: 1 jiffies: 4299011501  timeout: 4299011500 curstate: 0  smdata->state: 1  thread_ack: 2

What's the ath driver doing here?  Or is that failure normal?

Your patch was mangled (unfinished?) but this doesn't show any timeouts
happening.  I guess you hit STOPMACHINE_DISABLE_IRQ and turned
interrupts off, so no jiffies increment.

Try using the loop counter, with some value big enough that it doesn't
trip normally.


> BUG: soft lockup - CPU#3 stuck for 23s! [migration/3:29]
> Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_po]
> CPU 3
> Pid: 29, comm: migration/3 Tainted: G         C O 3.9.4+ #60 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.
> RIP: 0010:[<ffffffff8109d69c>]  [<ffffffff8109d69c>] tasklet_action+0x46/0xcc
> RSP: 0000:ffff88022bd83ed8  EFLAGS: 00000282
> RAX: ffff88022bd8e080 RBX: ffff8802222a4000 RCX: ffffffff81a90f06
> RDX: ffff88021f48afa8 RSI: 0000000000000000 RDI: ffffffff81a050b0
> RBP: ffff88022bd83ee8 R08: 0000000000000000 R09: 0000000000000000
> R10: 00000000000005f2 R11: 00000000fd010018 R12: ffff88022bd83e48
> R13: ffffffff815d145d R14: ffff88022bd83ee8 R15: ffff880222282000
> FS:  0000000000000000(0000) GS:ffff88022bd80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007fa8a06d5000 CR3: 00000002007ea000 CR4: 00000000000007e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process migration/3 (pid: 29, threadinfo ffff880222282000, task ffff880222279770)
> Stack:
>   ffffffff81a050b0 ffff880222282000 ffff88022bd83f78 ffffffff8109db1f
>   ffff88022bd83f08 ffff880222282010 ffff880222283fd8 04208040810b79ef
>   00000001003db5d8 000000032bd8e150 ffff880222282000 0000000000000030
> Call Trace:
>   <IRQ>
>   [<ffffffff8109db1f>] __do_softirq+0x107/0x23c
>   [<ffffffff8109dce6>] irq_exit+0x4b/0xa8
>   [<ffffffff815d257f>] smp_apic_timer_interrupt+0x8b/0x99
>   [<ffffffff815d145d>] apic_timer_interrupt+0x6d/0x80
>   <EOI>
>   [<ffffffff810f9984>] ? stop_machine_cpu_stop+0xc7/0x145
>   [<ffffffff810f9974>] ? stop_machine_cpu_stop+0xb7/0x145
>   [<ffffffff81016445>] ? sched_clock+0x9/0xd
>   [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
>   [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
>   [<ffffffff815c9065>] ? __schedule+0x59f/0x5e7
>   [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
>   [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
>   [<ffffffff810b4a09>] kthread+0xb5/0xbd
>   [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
>   [<ffffffff815d07ac>] ret_from_fork+0x7c/0xb0
>   [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
> Code: 48 8b 14 25 80 e0 00 00 65 48 c7 04 25 80 e0 00 00 00 00 00 00 65 48 03 04 25 d8 da 00 00 65 48 89 04 25 88 e0 00 00 fb
> ------------[ cut here ]------------
> WARNING: at /home/greearb/git/linux-3.9.dev.y/kernel/watchdog.c:245 watchdog_overflow_callback+0x9b/0xa6()
> Hardware name: To be filled by O.E.M.
> Watchdog detected hard LOCKUP on cpu 1
> Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_po]
> Pid: 17, comm: migration/1 Tainted: G         C O 3.9.4+ #60
> Call Trace:
>   <NMI>  [<ffffffff810963e1>] warn_slowpath_common+0x85/0x9f
>   [<ffffffff8109649e>] warn_slowpath_fmt+0x46/0x48
>   [<ffffffff810c4dfb>] ? sched_clock_cpu+0x44/0xce
>   [<ffffffff81103b70>] watchdog_overflow_callback+0x9b/0xa6
>   [<ffffffff8113354f>] __perf_event_overflow+0x137/0x1cb
>   [<ffffffff8101db3f>] ? x86_perf_event_set_period+0x107/0x113
>   [<ffffffff81133a9e>] perf_event_overflow+0x14/0x16
>   [<ffffffff810230dc>] intel_pmu_handle_irq+0x2b0/0x32d
>   [<ffffffff815cbc51>] perf_event_nmi_handler+0x19/0x1b
>   [<ffffffff815cb4ca>] nmi_handle+0x55/0x7e
>   [<ffffffff815cb59b>] do_nmi+0xa8/0x2db
>   [<ffffffff815cac31>] end_repeat_nmi+0x1e/0x2e
>   [<ffffffff810f9942>] ? stop_machine_cpu_stop+0x85/0x145
>   [<ffffffff810f9942>] ? stop_machine_cpu_stop+0x85/0x145
>   [<ffffffff810f9942>] ? stop_machine_cpu_stop+0x85/0x145
>   <<EOE>>  [<ffffffff810c6b1c>] ? set_next_entity+0x28/0x7e
>   [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
>   [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
>   [<ffffffff815c9065>] ? __schedule+0x59f/0x5e7
>   [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
>   [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
>   [<ffffffff810b4a09>] kthread+0xb5/0xbd
>   [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
>   [<ffffffff815d07ac>] ret_from_fork+0x7c/0xb0
>   [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
> ---[ end trace dcd772d6fdf499cf ]---
>
> ...
>
>
> SysRq : Show backtrace of all active CPUs
> sending NMI to all CPUs:
> NMI backtrace for cpu 2
> CPU 2
> Pid: 23, comm: migration/2 Tainted: G        WC O 3.9.4+ #60 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.
> RIP: 0010:[<ffffffff810f9987>]  [<ffffffff810f9987>] stop_machine_cpu_stop+0xca/0x145
> RSP: 0018:ffff880222219cf8  EFLAGS: 00000012
> RAX: 00000001003db5d7 RBX: ffff8801c6ae7e18 RCX: 0000000000000097
> RDX: 0000000000000002 RSI: 0000000000000006 RDI: 0000000000000246
> RBP: ffff880222219d68 R08: 00000001003dc935 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801c6ae7e3c
> R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000002
> FS:  0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00000000027e2178 CR3: 000000021955d000 CR4: 00000000000007e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000)
> Stack:
>   ffff880200000001 ffff880200000002 ffff880222219d28 ffffffff810c6b1c
>   ffff880222219d88 00ffffff8100f8a4 000000051058262e 0000000000000292
>   ffff880222219d58 ffff88022bd0e400 ffff8801c6ae7d08 ffff880222218000
> Call Trace:
>   [<ffffffff810c6b1c>] ? set_next_entity+0x28/0x7e
>   [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
>   [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
>   [<ffffffff815c9065>] ? __schedule+0x59f/0x5e7
>   [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
>   [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
>   [<ffffffff810b4a09>] kthread+0xb5/0xbd
>   [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
>   [<ffffffff815d07ac>] ret_from_fork+0x7c/0xb0
>   [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
> Code: 48 c7 c7 24 fa 80 81 89 44 24 08 8b 43 20 89 04 24 31 c0 e8 7a e1 4c 00 4c 8b 05 85 c6 9e 00 49 81 c0 88 13 00 00 f3 90
> NMI backtrace for cpu 0
> CPU 0
> Pid: 8, comm: migration/0 Tainted: G        WC O 3.9.4+ #60 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E..
> RIP: 0010:[<ffffffff810f9984>]  [<ffffffff810f9984>] stop_machine_cpu_stop+0xc7/0x145
> RSP: 0000:ffff880222145cf8  EFLAGS: 00000006
> RAX: 00000001003db5d7 RBX: ffff8801c6ae7e18 RCX: 0000000000000004
> RDX: 0000000000000002 RSI: ffff88022bc0de68 RDI: 0000000000000246
> RBP: ffff880222145d68 R08: 00000001003dc95f R09: 0000000000000001
> R10: ffff880222145bf8 R11: 0000000000000000 R12: ffff8801c6ae7e3c
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000002
> FS:  0000000000000000(0000) GS:ffff88022bc00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00000000006fb9a8 CR3: 000000021af43000 CR4: 00000000000007f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process migration/0 (pid: 8, threadinfo ffff880222144000, task ffff88022213aee0)
> Stack:
>   ffff880200000001 ffff880200000004 ffff880222145d28 ffffffff810c6b1c
>   ffff880222145d88 01ffffff8100f8a4 000000051023ce2c 0000000000000292
>   ffff880222145d98 ffff88022bc0e400 ffff8801c6ae7d08 ffff880222144000
> Call Trace:
>   [<ffffffff810c6b1c>] ? set_next_entity+0x28/0x7e
>   [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
>   [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
>   [<ffffffff815c9065>] ? __schedule+0x59f/0x5e7
>   [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
>   [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
>   [<ffffffff810b4a09>] kthread+0xb5/0xbd
>   [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
>   [<ffffffff815d07ac>] ret_from_fork+0x7c/0xb0
>   [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
> Code: 44 89 f6 48 c7 c7 24 fa 80 81 89 44 24 08 8b 43 20 89 04 24 31 c0 e8 7a e1 4c 00 4c 8b 05 85 c6 9e 00 49 81 c0 88 13 00
> NMI backtrace for cpu 1
> CPU 1
> Pid: 17, comm: migration/1 Tainted: G        WC O 3.9.4+ #60 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.
> RIP: 0010:[<ffffffff810f9984>]  [<ffffffff810f9984>] stop_machine_cpu_stop+0xc7/0x145
> RSP: 0000:ffff88022217dcf8  EFLAGS: 00000012
> RAX: 00000001003db5d7 RBX: ffff8801c6ae7e18 RCX: 0000000000000094
> RDX: 0000000000000002 RSI: 0000000000000006 RDI: 0000000000000246
> RBP: ffff88022217dd68 R08: 00000001003dc935 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801c6ae7e3c
> R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000002
> FS:  0000000000000000(0000) GS:ffff88022bc80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007fbe801fd000 CR3: 00000001be9bb000 CR4: 00000000000007e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process migration/1 (pid: 17, threadinfo ffff88022217c000, task ffff88022216ddc0)
> Stack:
>   ffff880200000001 ffff880200000004 ffff88022217dd28 ffffffff810c6b1c
>   ffff88022217dd88 00ffffff8100f8a4 00000004c3cbfc0c 0000000000000292
>   ffff88022217dd58 ffff88022bc8e400 ffff8801c6ae7d08 ffff88022217c000
> Call Trace:
>   [<ffffffff810c6b1c>] ? set_next_entity+0x28/0x7e
>   [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
>   [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
>   [<ffffffff815c9065>] ? __schedule+0x59f/0x5e7
>   [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
>   [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
>   [<ffffffff810b4a09>] kthread+0xb5/0xbd
>   [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
>   [<ffffffff815d07ac>] ret_from_fork+0x7c/0xb0
>   [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
> Code: 44 89 f6 48 c7 c7 24 fa 80 81 89 44 24 08 8b 43 20 89 04 24 31 c0 e8 7a e1 4c 00 4c 8b 05 85 c6 9e 00 49 81 c0 88 13 00
> NMI backtrace for cpu 3
> CPU 3
> Pid: 29, comm: migration/3 Tainted: G        WC O 3.9.4+ #60 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.
> RIP: 0010:[<ffffffff812f6975>]  [<ffffffff812f6975>] delay_tsc+0x83/0xee
> RSP: 0000:ffff88022bd83b60  EFLAGS: 00000046
> RAX: 00000b04f0fdf718 RBX: ffff880222282000 RCX: ffff880222282010
> RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000289672
> RBP: ffff88022bd83bb0 R08: 0000000000000040 R09: 0000000000000001
> R10: ffff88022bd83ad0 R11: 0000000000000000 R12: 00000000f0fdf6e8
> R13: 0000000000000003 R14: ffff880222282000 R15: 0000000000000001
> FS:  0000000000000000(0000) GS:ffff88022bd80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007fa8a06d5000 CR3: 00000002007ea000 CR4: 00000000000007e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process migration/3 (pid: 29, threadinfo ffff880222282000, task ffff880222279770)
> Stack:
>   000000000000000f 0000000000000002 ffff880222282010 0028967200000082
>   ffff88022bd83b90 0000000000000001 000000000000006c 0000000000000007
>   0000000000000086 0000000000000001 ffff88022bd83bc0 ffffffff812f68c5
> Call Trace:
>   <IRQ>
>   [<ffffffff812f68c5>] __const_udelay+0x28/0x2a
>   [<ffffffff8102ff25>] arch_trigger_all_cpu_backtrace+0x66/0x7d
>   [<ffffffff813ae154>] sysrq_handle_showallcpus+0xe/0x10
>   [<ffffffff813ae46b>] __handle_sysrq+0xbf/0x15b
>   [<ffffffff813ae80e>] handle_sysrq+0x2c/0x2e
>   [<ffffffff813c25a2>] serial8250_rx_chars+0x13c/0x1b9
>   [<ffffffff813c2691>] serial8250_handle_irq+0x72/0xa8
>   [<ffffffff813c2752>] serial8250_default_handle_irq+0x23/0x28
>   [<ffffffff813c148c>] serial8250_interrupt+0x4d/0xc6
>   [<ffffffff811046b8>] handle_irq_event_percpu+0x7a/0x1e5
>   [<ffffffff81104864>] handle_irq_event+0x41/0x61
>   [<ffffffff81107028>] handle_edge_irq+0xa6/0xcb
>   [<ffffffff81011d9f>] handle_irq+0x24/0x2d
>   [<ffffffff815d248d>] do_IRQ+0x4d/0xb4
>   [<ffffffff815ca5ad>] common_interrupt+0x6d/0x6d
>   [<ffffffff810a47ee>] ? run_timer_softirq+0x24/0x1df
>   [<ffffffff8109dabd>] ? __do_softirq+0xa5/0x23c
>   [<ffffffff8109db8a>] ? __do_softirq+0x172/0x23c
>   [<ffffffff8109dce6>] irq_exit+0x4b/0xa8
>   [<ffffffff815d257f>] smp_apic_timer_interrupt+0x8b/0x99
>   [<ffffffff815d145d>] apic_timer_interrupt+0x6d/0x80
>   <EOI>
>   [<ffffffff810f9984>] ? stop_machine_cpu_stop+0xc7/0x145
>   [<ffffffff810f9974>] ? stop_machine_cpu_stop+0xb7/0x145
>   [<ffffffff81016445>] ? sched_clock+0x9/0xd
>   [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
>   [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
>   [<ffffffff815c9065>] ? __schedule+0x59f/0x5e7
>   [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
>   [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
>   [<ffffffff810b4a09>] kthread+0xb5/0xbd
>   [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
>   [<ffffffff815d07ac>] ret_from_fork+0x7c/0xb0
>   [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
> Code: fa d1 ff 66 90 89 c2 44 29 e2 3b 55 cc 73 4a ff 4b 1c 48 8b 4d c0 48 8b 11 80 e2 08 74 0b 89 45 b8 e8 08 28 2d 00 8b 45

So everyone's doing stop_machine, but noone's making progress.  I'd like
to know why...

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Please add to stable:  module: don't unlink the module until we've removed all exposure.
  2013-06-03 14:17       ` Joe Lawrence
  2013-06-03 15:59         ` Ben Greear
@ 2013-06-05  5:07         ` Greg KH
  2013-06-05  7:13           ` Rusty Russell
  1 sibling, 1 reply; 51+ messages in thread
From: Greg KH @ 2013-06-05  5:07 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: Rusty Russell, Ben Greear, Linux Kernel Mailing List, stable

On Mon, Jun 03, 2013 at 10:17:17AM -0400, Joe Lawrence wrote:
> [Cc: stable@vger.kernel.org]
> 
> Third time is a charm?  The stable address was incorrect from the first 
> msg in this thread, but the relevant bits remain quoted below...

Really?  I'm totally confused...

> On Mon, 3 Jun 2013, Joe Lawrence wrote:
> 
> > [fixing Cc: stable@kernel.org address]
> > 
> > On Sun, 2 Jun 2013, Joe Lawrence wrote:
> > 
> > > On Sun, 2 Jun 2013, Rusty Russell wrote:
> > > 
> > > > Ben Greear <greearb@candelatech.com> writes:
> > > > 
> > > > > It turns out, the bug I spent yesterday chasing in various 3.9 kernels is apparently
> > > > > fixed by the commit in the title (c9c390bb5535380d40614571894ef0c00bc026ff).
> > > > 
> > > > Apparently being the operative word.
> > > > 
> > > > This commit avoids the entire "module insert failed due to sysfs race"
> > > > path in the common case, it doesn't fix any actual problem.
> > > > 
> > > > I think the real commit you want is Linus' kobject fix
> > > > a49b7e82cab0f9b41f483359be83f44fbb6b4979 "kobject: fix kset_find_obj()
> > > > race with concurrent last kobject_put()".
> > > > 
> > > > Or is that already in stable?
> > > 
> > > Hi Rusty,
> > >  
> > > I had pointed Ben (offlist) to that bugzilla entry without realizing
> > > there were other earlier related fixes in this space.  Re-viewing bz-
> > > 58011, it looks like it was opened against 3.8.12, while Ben and myself
> > > had encountered module loading problems in versions 3.9 and
> > > 3.9.[1-3].  I can update the bugzilla entry to add a comment noting commit
> > > a49b7e82 "kobject: fix kset_find_obj() race with concurrent last
> > > kobject_put()".
> > > 
> > > That said, it doesn't appear that commit 944a1fa "module: don't unlink the
> > > module until we've removed all exposure" has not made it into any stable  
> > > kernel.  On my system, applying this on top of 3.9 resolved a module
> > > unload/load race that would occasionally occur on boot (two video adapters
> > > of the same make, the module unloads for whatever reason and I see "module
> > > is already loaded" and "sysfs: cannot create duplicate filename
> > > '/module/mgag200'" messages every 5-10% instances.)  I have logs if you
> > > were interested in these warnings/crashes.
> > > 
> > > Hope this clarifies things.

After this whole thread, what should I be doing for the 3.9-stable tree?
Add commit 944a1fa?  Or something else?

confused,

greg k-h

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Please add to stable:  module: don't unlink the module until we've removed all exposure.
  2013-06-05  5:07         ` Greg KH
@ 2013-06-05  7:13           ` Rusty Russell
  0 siblings, 0 replies; 51+ messages in thread
From: Rusty Russell @ 2013-06-05  7:13 UTC (permalink / raw)
  To: Greg KH, Joe Lawrence; +Cc: Ben Greear, Linux Kernel Mailing List, stable

Greg KH <gregkh@linuxfoundation.org> writes:
> On Mon, Jun 03, 2013 at 10:17:17AM -0400, Joe Lawrence wrote:
>> [Cc: stable@vger.kernel.org]
>> 
>> Third time is a charm?  The stable address was incorrect from the first 
>> msg in this thread, but the relevant bits remain quoted below...
>
> Really?  I'm totally confused...
>
>> On Mon, 3 Jun 2013, Joe Lawrence wrote:
>> 
>> > [fixing Cc: stable@kernel.org address]
>> > 
>> > On Sun, 2 Jun 2013, Joe Lawrence wrote:
>> > 
>> > > On Sun, 2 Jun 2013, Rusty Russell wrote:
>> > > 
>> > > > Ben Greear <greearb@candelatech.com> writes:
>> > > > 
>> > > > > It turns out, the bug I spent yesterday chasing in various 3.9 kernels is apparently
>> > > > > fixed by the commit in the title (c9c390bb5535380d40614571894ef0c00bc026ff).
>> > > > 
>> > > > Apparently being the operative word.
>> > > > 
>> > > > This commit avoids the entire "module insert failed due to sysfs race"
>> > > > path in the common case, it doesn't fix any actual problem.
>> > > > 
>> > > > I think the real commit you want is Linus' kobject fix
>> > > > a49b7e82cab0f9b41f483359be83f44fbb6b4979 "kobject: fix kset_find_obj()
>> > > > race with concurrent last kobject_put()".
>> > > > 
>> > > > Or is that already in stable?
>> > > 
>> > > Hi Rusty,
>> > >  
>> > > I had pointed Ben (offlist) to that bugzilla entry without realizing
>> > > there were other earlier related fixes in this space.  Re-viewing bz-
>> > > 58011, it looks like it was opened against 3.8.12, while Ben and myself
>> > > had encountered module loading problems in versions 3.9 and
>> > > 3.9.[1-3].  I can update the bugzilla entry to add a comment noting commit
>> > > a49b7e82 "kobject: fix kset_find_obj() race with concurrent last
>> > > kobject_put()".
>> > > 
>> > > That said, it doesn't appear that commit 944a1fa "module: don't unlink the
>> > > module until we've removed all exposure" has not made it into any stable  
>> > > kernel.  On my system, applying this on top of 3.9 resolved a module
>> > > unload/load race that would occasionally occur on boot (two video adapters
>> > > of the same make, the module unloads for whatever reason and I see "module
>> > > is already loaded" and "sysfs: cannot create duplicate filename
>> > > '/module/mgag200'" messages every 5-10% instances.)  I have logs if you
>> > > were interested in these warnings/crashes.
>> > > 
>> > > Hope this clarifies things.
>
> After this whole thread, what should I be doing for the 3.9-stable tree?
> Add commit 944a1fa?  Or something else?

Yes.  It does fix an Oops unrelated to what it was intended to fix, so
it's the lowest pain path.

There may be other ways of triggering a similar oops, but do far the
obvious attempt has failed (holding a sysfs file open while a module
fails its init).  I might patch it anyway, because it makes me
uncomfortable, but that's separate.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Please add to stable:  module: don't unlink the module until we've removed all exposure.
  2013-06-05  4:17                     ` Rusty Russell
@ 2013-06-05  7:15                       ` Tejun Heo
  2013-06-05 16:59                         ` Ben Greear
  0 siblings, 1 reply; 51+ messages in thread
From: Tejun Heo @ 2013-06-05  7:15 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Ben Greear, Joe Lawrence, Linux Kernel Mailing List, stable

Hello,

On Wed, Jun 05, 2013 at 01:47:43PM +0930, Rusty Russell wrote:
> > I have some printk debugging in (see bottom of email) and was using a serial console, so things
> > were probably running a bit slower than on most systems.  Here is trace
> > from my kernel with local patches and not so much debugging enabled
> > (this is NOT a clean upstream kernel, though I reproduced the same thing
> > with a clean upstream 3.9.4 kernel plus your module unlink patch yesterday).
> 
> Tejun CC'd.  We can't be running two stop machines in parallel, since
> there's a mutex (and there's also one in the module code).
>
> > __stop_machine, num-threads: 4, fn: __try_stop_module  data: ffff8801c6ae7f28
> > cpu: 0 loops: 1 jiffies: 4299011449  timeout: 4299011448 curstate: 0  smdata->state: 1  thread_ack: 4
> > cpu: 1 loops: 1 jiffies: 4299011449  timeout: 4299011448 curstate: 0  smdata->state: 1  thread_ack: 4
> > cpu: 2 loops: 1 jiffies: 4299011449  timeout: 4299011448 curstate: 0  smdata->state: 1  thread_ack: 3
> > cpu: 3 loops: 1 jiffies: 4299011449  timeout: 4299011448 curstate: 0  smdata->state: 1  thread_ack: 2
> > __stop_machine, num-threads: 4, fn: __unlink_module  data: ffffffffa0aeeab0
> > cpu: 0 loops: 1 jiffies: 4299011501  timeout: 4299011500 curstate: 0  smdata->state: 1  thread_ack: 4
> > cpu: 1 loops: 1 jiffies: 4299011501  timeout: 4299011500 curstate: 0  smdata->state: 1  thread_ack: 4
> > cpu: 3 loops: 1 jiffies: 4299011501  timeout: 4299011500 curstate: 0  smdata->state: 1  thread_ack: 3
> > ath: wiphy0: Failed to stop TX DMA, queues=0x005!
> > cpu: 2 loops: 1 jiffies: 4299011501  timeout: 4299011500 curstate: 0  smdata->state: 1  thread_ack: 2

A bit confused.  I looked at the code again and it still seems
properly interlocked.  Can't see how the above scenario could happen.
Actually, I'm not even sure, what I'm looking at.  Is the above from
the lockup?  It'd be helpful if we can get more traces from the locked
up state.  Shouldn't be hard to detect.  Dumb timeout based detection
should work fine.  How easily can you reproduce the issue?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Please add to stable:  module: don't unlink the module until we've removed all exposure.
  2013-06-05  7:15                       ` Tejun Heo
@ 2013-06-05 16:59                         ` Ben Greear
  2013-06-05 18:48                           ` Tejun Heo
  0 siblings, 1 reply; 51+ messages in thread
From: Ben Greear @ 2013-06-05 16:59 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable

On 06/05/2013 12:15 AM, Tejun Heo wrote:
> Hello,
>
> On Wed, Jun 05, 2013 at 01:47:43PM +0930, Rusty Russell wrote:
>>> I have some printk debugging in (see bottom of email) and was using a serial console, so things
>>> were probably running a bit slower than on most systems.  Here is trace
>>> from my kernel with local patches and not so much debugging enabled
>>> (this is NOT a clean upstream kernel, though I reproduced the same thing
>>> with a clean upstream 3.9.4 kernel plus your module unlink patch yesterday).
>>
>> Tejun CC'd.  We can't be running two stop machines in parallel, since
>> there's a mutex (and there's also one in the module code).
>>
>>> __stop_machine, num-threads: 4, fn: __try_stop_module  data: ffff8801c6ae7f28
>>> cpu: 0 loops: 1 jiffies: 4299011449  timeout: 4299011448 curstate: 0  smdata->state: 1  thread_ack: 4
>>> cpu: 1 loops: 1 jiffies: 4299011449  timeout: 4299011448 curstate: 0  smdata->state: 1  thread_ack: 4
>>> cpu: 2 loops: 1 jiffies: 4299011449  timeout: 4299011448 curstate: 0  smdata->state: 1  thread_ack: 3
>>> cpu: 3 loops: 1 jiffies: 4299011449  timeout: 4299011448 curstate: 0  smdata->state: 1  thread_ack: 2
>>> __stop_machine, num-threads: 4, fn: __unlink_module  data: ffffffffa0aeeab0
>>> cpu: 0 loops: 1 jiffies: 4299011501  timeout: 4299011500 curstate: 0  smdata->state: 1  thread_ack: 4
>>> cpu: 1 loops: 1 jiffies: 4299011501  timeout: 4299011500 curstate: 0  smdata->state: 1  thread_ack: 4
>>> cpu: 3 loops: 1 jiffies: 4299011501  timeout: 4299011500 curstate: 0  smdata->state: 1  thread_ack: 3
>>> ath: wiphy0: Failed to stop TX DMA, queues=0x005!
>>> cpu: 2 loops: 1 jiffies: 4299011501  timeout: 4299011500 curstate: 0  smdata->state: 1  thread_ack: 2
>
> A bit confused.  I looked at the code again and it still seems
> properly interlocked.  Can't see how the above scenario could happen.
> Actually, I'm not even sure, what I'm looking at.  Is the above from
> the lockup?  It'd be helpful if we can get more traces from the locked
> up state.  Shouldn't be hard to detect.  Dumb timeout based detection
> should work fine.  How easily can you reproduce the issue?

I can easily reproduce the problem.  First, my machine info:

dual-core i7, 4 cpu threads total.
8 GB RAM
64-bit kernel, Fedora-17 64-bit OS
2 ath9k NICs, each with 200 virtual stations configured.
   I don't think the virtual stations matter to much, as we reproduce easily
   enough with fewer, but I think that maybe the extra IRQ load helps trigger
   the bug.
I added lots of printk debugging, which makes it easier to hit the
lockup.

There is an 'ath' message in the log above, but most traces have nothing related
to ath in them, so I think that thing above is not related.  (That is a common
ath error message for years, and at least on my hardware, does not appear to cause any
serious problems).

Here below is another trace with even more debugging printouts.  After a while,
it stopped even printing out cpu-hung messages..not sure if that is an important clue or
not.


One pattern I notice repeating for at least most of the hangs is that all but one
CPU thread has irqs disabled and is in state 2.  But, there will be one thread
in state 1 that still has IRQs enabled and it is reported to be in soft-lockup
instead of hard-lockup.  In 'sysrq l' it always shows some IRQ processing,
but typically that of the sysrq itself.  I added printk that would always
print if the thread notices that smdata->state != curstate, and the soft-lockup
thread (cpu 2 below) never shows that message.

I thought it might be because it was reading stale smdata->state, so I changed
that to atomic_t hoping that would mitigate that.  I also tried adding smp_rmb()
below the cpu_relax().  Neither had any affect, so I am left assuming that the
thread instead is stuck handling IRQs and never gets out of the IRQ handler.

Maybe since I have 2 real cores, and 3 processes busy-spinning on their CPU cores,
the remaining process can just never handle all the IRQs and get back to the
cpu shutdown state machine?  The various soft-hang stacks below show at least slightly
different stacks, so I assume that thread is doing at least something.


__stop_machine, num-threads: 4, fn: __try_stop_module(ffffffff810e86da)  data: ffff8802202abf28  smdata: ffff8802202abe58
set_state, cpu: 3  state: 0 newstate: 1 smdata: ffff8802202abe58  fn: ffffffff810e86da
cpu: 1 loops: 1 jiffies: 4294739830  timeout: 4294739829 curstate: 0  smdata->state: 1  thread_ack: 4  smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 3 loops: 1 jiffies: 4294739830  timeout: 4294739829 curstate: 0  smdata->state: 1  thread_ack: 4  smdata: ffff8802202abe58 fn: ffffffff810e86da
state-change: cpu: 3 loops: 1 jiffies: 4294739830 curstate: 0  smdata->state: 1  thread_ack: 4  smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 2 loops: 1 jiffies: 4294739830  timeout: 4294739829 curstate: 0  smdata->state: 1  thread_ack: 4  smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 0 loops: 1 jiffies: 4294739830  timeout: 4294739829 curstate: 0  smdata->state: 1  thread_ack: 4  smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 3 ack_state jiffies: 4294739830 smdata->state: 1  thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da
state-change: cpu: 2 loops: 1 jiffies: 4294739830 curstate: 0  smdata->state: 1  thread_ack: 4  smdata: ffff8802202abe58 fn: ffffffff810e86da
state-change: cpu: 0 loops: 1 jiffies: 4294739830 curstate: 0  smdata->state: 1  thread_ack: 4  smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 3 ack_state end jiffies: 4294739830 smdata->state: 1  thread_ack: 3  smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 2 ack_state jiffies: 4294739830 smdata->state: 1  thread_ack: 3 smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 0 ack_state jiffies: 4294739830 smdata->state: 1  thread_ack: 3 smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 2 ack_state end jiffies: 4294739830 smdata->state: 1  thread_ack: 2  smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 0 ack_state end jiffies: 4294739830 smdata->state: 1  thread_ack: 1  smdata: ffff8802202abe58 fn: ffffffff810e86da
state-change: cpu: 1 loops: 1 jiffies: 4294739981 curstate: 0  smdata->state: 1  thread_ack: 1  smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 1 ack_state jiffies: 4294739993 smdata->state: 1  thread_ack: 1 smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 1  increment state, smdata: ffff8802202abe58 fn: ffffffff810e86da
set_state, cpu: 1  state: 1 newstate: 2 smdata: ffff8802202abe58  fn: ffffffff810e86da
cpu: 1 ack_state end jiffies: 4294740017 smdata->state: 2  thread_ack: 4  smdata: ffff8802202abe58 fn: ffffffff810e86da
state-change: cpu: 3 loops: 24479339 jiffies: 4294740017 curstate: 1  smdata->state: 2  thread_ack: 4  smdata: ffff8802202abe58 fn: ffffffff810e86da
state-change: cpu: 0 loops: 23433125 jiffies: 4294740017 curstate: 1  smdata->state: 2  thread_ack: 4  smdata: ffff8802202abe58 fn: ffffffff810e86da
state-change: cpu: 2 loops: 23440505 jiffies: 4294740017 curstate: 1  smdata->state: 2  thread_ack: 4  smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 3 ack_state jiffies: 4294740017 smdata->state: 2  thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 0 ack_state jiffies: 4294740017 smdata->state: 2  thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 2 ack_state jiffies: 4294740017 smdata->state: 2  thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 3 ack_state end jiffies: 4294740017 smdata->state: 2  thread_ack: 3  smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 0 ack_state end jiffies: 4294740017 smdata->state: 2  thread_ack: 2  smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 2 ack_state end jiffies: 4294740017 smdata->state: 2  thread_ack: 1  smdata: ffff8802202abe58 fn: ffffffff810e86da
state-change: cpu: 1 loops: 2 jiffies: 4294740017 curstate: 1  smdata->state: 2  thread_ack: 1  smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 1 ack_state jiffies: 4294740017 smdata->state: 2  thread_ack: 1 smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 1  increment state, smdata: ffff8802202abe58 fn: ffffffff810e86da
set_state, cpu: 1  state: 2 newstate: 3 smdata: ffff8802202abe58  fn: ffffffff810e86da
cpu: 1 ack_state end jiffies: 4294740017 smdata->state: 3  thread_ack: 4  smdata: ffff8802202abe58 fn: ffffffff810e86da
state-change: cpu: 0 loops: 42466689 jiffies: 4294740017 curstate: 2  smdata->state: 3  thread_ack: 4  smdata: ffff8802202abe58 fn: ffffffff810e86da
state-change: cpu: 2 loops: 42473881 jiffies: 4294740017 curstate: 2  smdata->state: 3  thread_ack: 4  smdata: ffff8802202abe58 fn: ffffffff810e86da
state-change: cpu: 3 loops: 43825073 jiffies: 4294740017 curstate: 2  smdata->state: 3  thread_ack: 4  smdata: ffff8802202abe58 fn: ffffffff810e86da
calling fn: cpu: 0 loops: 42466689 curstate: 3  smdata->state: 3  thread_ack: 4  smdata: ffff8802202abe58  fn: ffffffff810e86da
cpu: 2 ack_state jiffies: 4294740017 smdata->state: 3  thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 3 ack_state jiffies: 4294740017 smdata->state: 3  thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 0 ack_state jiffies: 4294740017 smdata->state: 3  thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 2 ack_state end jiffies: 4294740017 smdata->state: 3  thread_ack: 3  smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 3 ack_state end jiffies: 4294740017 smdata->state: 3  thread_ack: 2  smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 0 ack_state end jiffies: 4294740017 smdata->state: 3  thread_ack: 1  smdata: ffff8802202abe58 fn: ffffffff810e86da
state-change: cpu: 1 loops: 3 jiffies: 4294740017 curstate: 2  smdata->state: 3  thread_ack: 1  smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 1 ack_state jiffies: 4294740017 smdata->state: 3  thread_ack: 1 smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 1  increment state, smdata: ffff8802202abe58 fn: ffffffff810e86da
set_state, cpu: 1  state: 3 newstate: 4 smdata: ffff8802202abe58  fn: ffffffff810e86da
cpu: 1 ack_state end jiffies: 4294740017 smdata->state: 4  thread_ack: 4  smdata: ffff8802202abe58 fn: ffffffff810e86da
state-change: cpu: 2 loops: 62146520 jiffies: 4294740017 curstate: 3  smdata->state: 4  thread_ack: 4  smdata: ffff8802202abe58 fn: ffffffff810e86da
state-change: cpu: 0 loops: 62138953 jiffies: 4294740017 curstate: 3  smdata->state: 4  thread_ack: 4  smdata: ffff8802202abe58 fn: ffffffff810e86da
state-change: cpu: 3 loops: 64625424 jiffies: 4294740017 curstate: 3  smdata->state: 4  thread_ack: 4  smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 2 ack_state jiffies: 4294740017 smdata->state: 4  thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 0 ack_state jiffies: 4294740017 smdata->state: 4  thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 3 ack_state jiffies: 4294740017 smdata->state: 4  thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 2 ack_state end jiffies: 4294740017 smdata->state: 4  thread_ack: 3  smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 0 ack_state end jiffies: 4294740017 smdata->state: 4  thread_ack: 2  smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 3 ack_state end jiffies: 4294740017 smdata->state: 4  thread_ack: 1  smdata: ffff8802202abe58 fn: ffffffff810e86da
state-change: cpu: 1 loops: 4 jiffies: 4294740435 curstate: 3  smdata->state: 4  thread_ack: 1  smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 1 ack_state jiffies: 4294740447 smdata->state: 4  thread_ack: 1 smdata: ffff8802202abe58 fn: ffffffff810e86da
cpu: 1  increment state, smdata: ffff8802202abe58 fn: ffffffff810e86da
set_state, cpu: 1  state: 4 newstate: 5 smdata: ffff8802202abe58  fn: ffffffff810e86da
cpu: 1 ack_state end jiffies: 4294740471 smdata->state: 5  thread_ack: 4  smdata: ffff8802202abe58 fn: ffffffff810e86da
__stop_machine, num-threads: 4, fn: __unlink_module(ffffffff810e817b)  data: ffffffffa0ac5ab0  smdata: ffff8802202abe18
set_state, cpu: 0  state: 0 newstate: 1 smdata: ffff8802202abe18  fn: ffffffff810e817b
cpu: 0 loops: 1 jiffies: 4294740500  timeout: 4294740499 curstate: 0  smdata->state: 1  thread_ack: 4  smdata: ffff8802202abe18 fn: ffffffff810e817b
cpu: 3 loops: 1 jiffies: 4294740500  timeout: 4294740499 curstate: 0  smdata->state: 1  thread_ack: 4  smdata: ffff8802202abe18 fn: ffffffff810e817b
cpu: 1 loops: 1 jiffies: 4294740500  timeout: 4294740499 curstate: 0  smdata->state: 1  thread_ack: 4  smdata: ffff8802202abe18 fn: ffffffff810e817b
state-change: cpu: 3 loops: 1 jiffies: 4294740500 curstate: 0  smdata->state: 1  thread_ack: 4  smdata: ffff8802202abe18 fn: ffffffff810e817b
state-change: cpu: 1 loops: 1 jiffies: 4294740500 curstate: 0  smdata->state: 1  thread_ack: 4  smdata: ffff8802202abe18 fn: ffffffff810e817b
cpu: 3 ack_state jiffies: 4294740500 smdata->state: 1  thread_ack: 4 smdata: ffff8802202abe18 fn: ffffffff810e817b
cpu: 1 ack_state jiffies: 4294740500 smdata->state: 1  thread_ack: 4 smdata: ffff8802202abe18 fn: ffffffff810e817b
cpu: 3 ack_state end jiffies: 4294740500 smdata->state: 1  thread_ack: 3  smdata: ffff8802202abe18 fn: ffffffff810e817b
cpu: 1 ack_state end jiffies: 4294740500 smdata->state: 1  thread_ack: 2  smdata: ffff8802202abe18 fn: ffffffff810e817b
cpu: 2 loops: 1 jiffies: 4294740501  timeout: 4294740500 curstate: 0  smdata->state: 1  thread_ack: 2  smdata: ffff8802202abe18 fn: ffffffff810e817b
state-change: cpu: 2 loops: 1 jiffies: 4294740501 curstate: 0  smdata->state: 1  thread_ack: 2  smdata: ffff8802202abe18 fn: ffffffff810e817b
cpu: 2 ack_state jiffies: 4294740501 smdata->state: 1  thread_ack: 2 smdata: ffff8802202abe18 fn: ffffffff810e817b
cpu: 2 ack_state end jiffies: 4294740501 smdata->state: 1  thread_ack: 1  smdata: ffff8802202abe18 fn: ffffffff810e817b
state-change: cpu: 0 loops: 1 jiffies: 4294740651 curstate: 0  smdata->state: 1  thread_ack: 1  smdata: ffff8802202abe18 fn: ffffffff810e817b
cpu: 0 ack_state jiffies: 4294740664 smdata->state: 1  thread_ack: 1 smdata: ffff8802202abe18 fn: ffffffff810e817b
cpu: 0  increment state, smdata: ffff8802202abe18 fn: ffffffff810e817b
set_state, cpu: 0  state: 1 newstate: 2 smdata: ffff8802202abe18  fn: ffffffff810e817b
cpu: 0 ack_state end jiffies: 4294740688 smdata->state: 2  thread_ack: 4  smdata: ffff8802202abe18 fn: ffffffff810e817b
state-change: cpu: 3 loops: 23659570 jiffies: 4294740688 curstate: 1  smdata->state: 2  thread_ack: 4  smdata: ffff8802202abe18 fn: ffffffff810e817b
state-change: cpu: 1 loops: 23670374 jiffies: 4294740688 curstate: 1  smdata->state: 2  thread_ack: 4  smdata: ffff8802202abe18 fn: ffffffff810e817b
cpu: 3 ack_state jiffies: 4294740688 smdata->state: 2  thread_ack: 4 smdata: ffff8802202abe18 fn: ffffffff810e817b
cpu: 1 ack_state jiffies: 4294740688 smdata->state: 2  thread_ack: 4 smdata: ffff8802202abe18 fn: ffffffff810e817b
cpu: 3 ack_state end jiffies: 4294740688 smdata->state: 2  thread_ack: 3  smdata: ffff8802202abe18 fn: ffffffff810e817b
cpu: 1 ack_state end jiffies: 4294740688 smdata->state: 2  thread_ack: 2  smdata: ffff8802202abe18 fn: ffffffff810e817b
state-change: cpu: 0 loops: 2 jiffies: 4294740688 curstate: 1  smdata->state: 2  thread_ack: 2  smdata: ffff8802202abe18 fn: ffffffff810e817b
cpu: 0 ack_state jiffies: 4294740688 smdata->state: 2  thread_ack: 2 smdata: ffff8802202abe18 fn: ffffffff810e817b
cpu: 0 ack_state end jiffies: 4294740688 smdata->state: 2  thread_ack: 1  smdata: ffff8802202abe18 fn: ffffffff810e817b
------------[ cut here ]------------
WARNING: at /home/greearb/git/linux-3.9.dev.y/kernel/watchdog.c:245 watchdog_overflow_callback+0x9b/0xa6()
Hardware name: To be filled by O.E.M.
Watchdog detected hard LOCKUP on cpu 0
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm 
gpio]
Pid: 8, comm: migration/0 Tainted: G         C O 3.9.4+ #70
Call Trace:
  <NMI>  [<ffffffff810963e1>] warn_slowpath_common+0x85/0x9f
  [<ffffffff8109649e>] warn_slowpath_fmt+0x46/0x48
  [<ffffffff810c4dfb>] ? sched_clock_cpu+0x44/0xce
  [<ffffffff81103cec>] watchdog_overflow_callback+0x9b/0xa6
  [<ffffffff811336cb>] __perf_event_overflow+0x137/0x1cb
  [<ffffffff8101db3f>] ? x86_perf_event_set_period+0x107/0x113
  [<ffffffff81133c1a>] perf_event_overflow+0x14/0x16
  [<ffffffff810230dc>] intel_pmu_handle_irq+0x2b0/0x32d
  [<ffffffff815cbdd1>] perf_event_nmi_handler+0x19/0x1b
  [<ffffffff815cb64a>] nmi_handle+0x55/0x7e
  [<ffffffff815cb71b>] do_nmi+0xa8/0x2db
  [<ffffffff815cadb1>] end_repeat_nmi+0x1e/0x2e
  [<ffffffff810f9945>] ? stop_machine_cpu_stop+0x88/0x267
  [<ffffffff810f9945>] ? stop_machine_cpu_stop+0x88/0x267
  [<ffffffff810f9945>] ? stop_machine_cpu_stop+0x88/0x267
  <<EOE>>  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
---[ end trace 9767454fd3f66a7f ]---
BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23]
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm 
gpio]
CPU 2
Pid: 23, comm: migration/2 Tainted: G        WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffff8109daab>]  [<ffffffff8109daab>] __do_softirq+0x93/0x23c
RSP: 0018:ffff88022bd03ef8  EFLAGS: 00000246
RAX: 0000000000000000 RBX: 0000000000000246 RCX: ffff88022bd0e506
RDX: ffff880222218010 RSI: ffff88022bd0e4f0 RDI: 0000000000000002
RBP: ffff88022bd03f78 R08: ffff88022bd0e770 R09: 0000000000000001
R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e68
R13: ffffffff815d15dd R14: ffff88022bd03f78 R15: ffff880222218000
FS:  0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000)
Stack:
  ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef
  00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000038
  ffff880200000006 000000111671cb12 000000111671cb12 ffff88022bd0db00
Call Trace:
  <IRQ>
  [<ffffffff8109dce6>] irq_exit+0x4b/0xa8
  [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99
  [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80
  <EOI>
  [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267
  [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
Code: 89 45 90 49 8d 44 24 10 4c 89 65 b0 65 8b 14 25 20 b0 00 00 48 89 45 88 89 55 ac 65 c7 04 25 40 13 01 00 00 00 00 00 fb 66 66 90 <66> 66 90 48 c7 c3 80 50 
a0 8
------------[ cut here ]------------
WARNING: at /home/greearb/git/linux-3.9.dev.y/kernel/watchdog.c:245 watchdog_overflow_callback+0x9b/0xa6()
Hardware name: To be filled by O.E.M.
Watchdog detected hard LOCKUP on cpu 3
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm 
gpio]
Pid: 29, comm: migration/3 Tainted: G        WC O 3.9.4+ #70
Call Trace:
  <NMI>  [<ffffffff810963e1>] warn_slowpath_common+0x85/0x9f
  [<ffffffff8109649e>] warn_slowpath_fmt+0x46/0x48
  [<ffffffff810c4dfb>] ? sched_clock_cpu+0x44/0xce
  [<ffffffff81103cec>] watchdog_overflow_callback+0x9b/0xa6
  [<ffffffff811336cb>] __perf_event_overflow+0x137/0x1cb
  [<ffffffff8101db3f>] ? x86_perf_event_set_period+0x107/0x113
  [<ffffffff81133c1a>] perf_event_overflow+0x14/0x16
  [<ffffffff810230dc>] intel_pmu_handle_irq+0x2b0/0x32d
  [<ffffffff815cbdd1>] perf_event_nmi_handler+0x19/0x1b
  [<ffffffff815cb64a>] nmi_handle+0x55/0x7e
  [<ffffffff815cb71b>] do_nmi+0xa8/0x2db
  [<ffffffff815cadb1>] end_repeat_nmi+0x1e/0x2e
  [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267
  [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267
  [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267
  <<EOE>>  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
---[ end trace 9767454fd3f66a80 ]---
BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23]
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm 
gpio]
CPU 2
Pid: 23, comm: migration/2 Tainted: G        WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffff8109d6b7>]  [<ffffffff8109d6b7>] tasklet_action+0x61/0xcc
RSP: 0018:ffff88022bd03ed8  EFLAGS: 00000246
RAX: 0000000000000001 RBX: ffff880222185dc0 RCX: ffff88022bd0e506
RDX: ffff8802186f2fa8 RSI: ffff88022bd0e4f0 RDI: ffffffff81a050b0
RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001
R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e48
R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff8802186f2fb0
FS:  0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000)
Stack:
  ffffffff81a050b0 ffff880222218000 ffff88022bd03f78 ffffffff8109db1f
  ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef
  00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000030
Call Trace:
  <IRQ>
  [<ffffffff8109db1f>] __do_softirq+0x107/0x23c
  [<ffffffff8109dce6>] irq_exit+0x4b/0xa8
  [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99
  [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80
  <EOI>
  [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267
  [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
Code: da 00 00 65 48 89 04 25 88 e0 00 00 fb 66 66 90 66 66 90 eb 77 48 8b 1a 4c 8d 62 08 f0 0f ba 6a 08 01 19 c0 85 c0 75 2d 8b 42 10 <85> c0 75 20 f0 41 0f ba 
34 2
------------[ cut here ]------------
WARNING: at /home/greearb/git/linux-3.9.dev.y/kernel/watchdog.c:245 watchdog_overflow_callback+0x9b/0xa6()
Hardware name: To be filled by O.E.M.
Watchdog detected hard LOCKUP on cpu 1
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm 
gpio]
Pid: 17, comm: migration/1 Tainted: G        WC O 3.9.4+ #70
Call Trace:
  <NMI>  [<ffffffff810963e1>] warn_slowpath_common+0x85/0x9f
  [<ffffffff8109649e>] warn_slowpath_fmt+0x46/0x48
  [<ffffffff810c4dfb>] ? sched_clock_cpu+0x44/0xce
  [<ffffffff81103cec>] watchdog_overflow_callback+0x9b/0xa6
  [<ffffffff811336cb>] __perf_event_overflow+0x137/0x1cb
  [<ffffffff8101db3f>] ? x86_perf_event_set_period+0x107/0x113
  [<ffffffff81133c1a>] perf_event_overflow+0x14/0x16
  [<ffffffff810230dc>] intel_pmu_handle_irq+0x2b0/0x32d
  [<ffffffff815cbdd1>] perf_event_nmi_handler+0x19/0x1b
  [<ffffffff815cb64a>] nmi_handle+0x55/0x7e
  [<ffffffff815cb71b>] do_nmi+0xa8/0x2db
  [<ffffffff815cadb1>] end_repeat_nmi+0x1e/0x2e
  [<ffffffff810f99a8>] ? stop_machine_cpu_stop+0xeb/0x267
  [<ffffffff810f99a8>] ? stop_machine_cpu_stop+0xeb/0x267
  [<ffffffff810f99a8>] ? stop_machine_cpu_stop+0xeb/0x267
  <<EOE>>  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff815ca1f4>] ? _raw_spin_unlock_irq+0x10/0x36
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
---[ end trace 9767454fd3f66a81 ]---
BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23]
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm 
gpio]
CPU 2
Pid: 23, comm: migration/2 Tainted: G        WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffff8109daab>]  [<ffffffff8109daab>] __do_softirq+0x93/0x23c
RSP: 0018:ffff88022bd03ef8  EFLAGS: 00000246
RAX: 0000000000000000 RBX: 0000000000000246 RCX: ffff88022bd0e506
RDX: ffff880222218010 RSI: ffff88022bd0e4f0 RDI: 0000000000000002
RBP: ffff88022bd03f78 R08: ffff88022bd0e770 R09: 0000000000000001
R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e68
R13: ffffffff815d15dd R14: ffff88022bd03f78 R15: ffff880222218000
FS:  0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000)
Stack:
  ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef
  00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000038
  ffff880200000006 000000111671cb12 000000111671cb12 ffff88022bd0db00
Call Trace:
  <IRQ>
  [<ffffffff8109dce6>] irq_exit+0x4b/0xa8
  [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99
  [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80
  <EOI>
  [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267
  [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
Code: 89 45 90 49 8d 44 24 10 4c 89 65 b0 65 8b 14 25 20 b0 00 00 48 89 45 88 89 55 ac 65 c7 04 25 40 13 01 00 00 00 00 00 fb 66 66 90 <66> 66 90 48 c7 c3 80 50 
a0 8
BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23]
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm 
gpio]
CPU 2
Pid: 23, comm: migration/2 Tainted: G        WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffff8109d712>]  [<ffffffff8109d712>] tasklet_action+0xbc/0xcc
RSP: 0018:ffff88022bd03ed8  EFLAGS: 00000202
RAX: 0000000000000040 RBX: ffff880222185dc0 RCX: ffff88022bd0e506
RDX: ffff8802186f2fa8 RSI: ffff88022bd0e4f0 RDI: 0000000000000006
RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001
R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e48
R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff8802186f2fb0
FS:  0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000)
Stack:
  ffffffff81a050b0 ffff880222218000 ffff88022bd03f78 ffffffff8109db1f
  ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef
  00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000030
Call Trace:
  <IRQ>
  [<ffffffff8109db1f>] __do_softirq+0x107/0x23c
  [<ffffffff8109dce6>] irq_exit+0x4b/0xa8
  [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99
  [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80
  <EOI>
  [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267
  [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
Code: 90 bf 06 00 00 00 48 c7 02 00 00 00 00 65 48 8b 04 25 88 e0 00 00 48 89 10 65 48 89 14 25 88 e0 00 00 e8 1c fe ff ff fb 66 66 90 <66> 66 90 48 89 da 48 85 
d2 7
BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23]
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm 
gpio]
CPU 2
Pid: 23, comm: migration/2 Tainted: G        WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffff8109d6e1>]  [<ffffffff8109d6e1>] tasklet_action+0x8b/0xcc
RSP: 0018:ffff88022bd03ed8  EFLAGS: 00000202
RAX: 0000000000000001 RBX: ffff880222185dc0 RCX: ffff88022bd0e506
RDX: ffff8802186f2fa8 RSI: ffff88022bd0e4f0 RDI: ffffffff81a050b0
RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001
R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e48
R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff8802186f2fb0
FS:  0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000)
Stack:
  ffffffff81a050b0 ffff880222218000 ffff88022bd03f78 ffffffff8109db1f
  ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef
  00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000030
Call Trace:
  <IRQ>
  [<ffffffff8109db1f>] __do_softirq+0x107/0x23c
  [<ffffffff8109dce6>] irq_exit+0x4b/0xa8
  [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99
  [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80
  <EOI>
  [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267
  [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
Code: 10 85 c0 75 20 f0 41 0f ba 34 24 00 19 c0 85 c0 75 04 0f 0b eb fe 48 8b 7a 20 ff 52 18 f0 41 80 24 24 fd eb 3a f0 41 80 24 24 fd <fa> 66 66 90 66 66 90 bf 
06 0
BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23]
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm 
gpio]
CPU 2
Pid: 23, comm: migration/2 Tainted: G        WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffff8109d6e1>]  [<ffffffff8109d6e1>] tasklet_action+0x8b/0xcc
RSP: 0018:ffff88022bd03ed8  EFLAGS: 00000202
RAX: 0000000000000001 RBX: ffff880222185dc0 RCX: ffff88022bd0e506
RDX: ffff8802186f2fa8 RSI: ffff88022bd0e4f0 RDI: ffffffff81a050b0
RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001
R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e48
R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff8802186f2fb0
FS:  0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000)
Stack:
  ffffffff81a050b0 ffff880222218000 ffff88022bd03f78 ffffffff8109db1f
  ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef
  00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000030
Call Trace:
  <IRQ>
  [<ffffffff8109db1f>] __do_softirq+0x107/0x23c
  [<ffffffff8109dce6>] irq_exit+0x4b/0xa8
  [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99
  [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80
  <EOI>
  [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267
  [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
Code: 10 85 c0 75 20 f0 41 0f ba 34 24 00 19 c0 85 c0 75 04 0f 0b eb fe 48 8b 7a 20 ff 52 18 f0 41 80 24 24 fd eb 3a f0 41 80 24 24 fd <fa> 66 66 90 66 66 90 bf 
06 0
BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23]
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm 
gpio]
CPU 2
Pid: 23, comm: migration/2 Tainted: G        WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffff8109dac6>]  [<ffffffff8109dac6>] __do_softirq+0xae/0x23c
RSP: 0018:ffff88022bd03ef8  EFLAGS: 00000282
RAX: 0000000000000000 RBX: 0000000000000246 RCX: ffff88022bd0e506
RDX: ffff880222218010 RSI: ffff88022bd0e4f0 RDI: 0000000000000002
RBP: ffff88022bd03f78 R08: ffff88022bd0e770 R09: 0000000000000001
R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e68
R13: ffffffff815d15dd R14: ffff88022bd03f78 R15: ffff880222218000
FS:  0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000)
Stack:
  ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef
  00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000028
  ffff880200000006 000000111671cb12 000000111671cb12 ffff88022bd0db00
Call Trace:
  <IRQ>
  [<ffffffff8109dce6>] irq_exit+0x4b/0xa8
  [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99
  [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80
  <EOI>
  [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267
  [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
Code: 65 c7 04 25 40 13 01 00 00 00 00 00 fb 66 66 90 66 66 90 48 c7 c3 80 50 a0 81 48 c7 45 b8 00 00 00 00 65 4c 8b 2c 25 48 c6 00 00 <41> f6 c7 01 0f 84 ba 00 
00 0
BUG: soft lockup - CPU#2 stuck for 23s! [migration/2:23]
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm 
gpio]
CPU 2
Pid: 23, comm: migration/2 Tainted: G        WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffff8109d69c>]  [<ffffffff8109d69c>] tasklet_action+0x46/0xcc
RSP: 0018:ffff88022bd03ed8  EFLAGS: 00000282
RAX: ffff88022bd0e080 RBX: ffff880222185dc0 RCX: ffff88022bd0e506
RDX: ffff8802186f2fa8 RSI: ffff88022bd0e4f0 RDI: ffffffff81a050b0
RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001
R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e48
R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff880222218000
FS:  0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000)
Stack:
  ffffffff81a050b0 ffff880222218000 ffff88022bd03f78 ffffffff8109db1f
  ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef
  00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000030
Call Trace:
  <IRQ>
  [<ffffffff8109db1f>] __do_softirq+0x107/0x23c
  [<ffffffff8109dce6>] irq_exit+0x4b/0xa8
  [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99
  [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80
  <EOI>
  [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267
  [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
Code: 48 8b 14 25 80 e0 00 00 65 48 c7 04 25 80 e0 00 00 00 00 00 00 65 48 03 04 25 d8 da 00 00 65 48 89 04 25 88 e0 00 00 fb 66 66 90 <66> 66 90 eb 77 48 8b 1a 
4c 8
BUG: soft lockup - CPU#2 stuck for 23s! [migration/2:23]
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm 
gpio]
CPU 2
Pid: 23, comm: migration/2 Tainted: G        WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffff8109d712>]  [<ffffffff8109d712>] tasklet_action+0xbc/0xcc
RSP: 0018:ffff88022bd03ed8  EFLAGS: 00000202
RAX: 0000000000000040 RBX: ffff880222185dc0 RCX: ffff88022bd0e506
RDX: ffff8802186f2fa8 RSI: ffff88022bd0e4f0 RDI: 0000000000000006
RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001
R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e48
R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff8802186f2fb0
FS:  0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000)
Stack:
  ffffffff81a050b0 ffff880222218000 ffff88022bd03f78 ffffffff8109db1f
  ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef
  00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000030
Call Trace:
  <IRQ>
  [<ffffffff8109db1f>] __do_softirq+0x107/0x23c
  [<ffffffff8109dce6>] irq_exit+0x4b/0xa8
  [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99
  [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80
  <EOI>
  [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267
  [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
Code: 90 bf 06 00 00 00 48 c7 02 00 00 00 00 65 48 8b 04 25 88 e0 00 00 48 89 10 65 48 89 14 25 88 e0 00 00 e8 1c fe ff ff fb 66 66 90 <66> 66 90 48 89 da 48 85 
d2 7
BUG: soft lockup - CPU#2 stuck for 23s! [migration/2:23]
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm 
gpio]
CPU 2
Pid: 23, comm: migration/2 Tainted: G        WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffff8109d712>]  [<ffffffff8109d712>] tasklet_action+0xbc/0xcc
RSP: 0018:ffff88022bd03ed8  EFLAGS: 00000202
RAX: 0000000000000040 RBX: ffff880222185dc0 RCX: ffff88022bd0e506
RDX: ffff8802186f2fa8 RSI: ffff88022bd0e4f0 RDI: 0000000000000006
RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001
R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e48
R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff8802186f2fb0
FS:  0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000)
Stack:
  ffffffff81a050b0 ffff880222218000 ffff88022bd03f78 ffffffff8109db1f
  ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef
  00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000030
Call Trace:
  <IRQ>
  [<ffffffff8109db1f>] __do_softirq+0x107/0x23c
  [<ffffffff8109dce6>] irq_exit+0x4b/0xa8
  [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99
  [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80
  <EOI>
  [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267
  [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
Code: 90 bf 06 00 00 00 48 c7 02 00 00 00 00 65 48 8b 04 25 88 e0 00 00 48 89 10 65 48 89 14 25 88 e0 00 00 e8 1c fe ff ff fb 66 66 90 <66> 66 90 48 89 da 48 85 
d2 7
BUG: soft lockup - CPU#2 stuck for 23s! [migration/2:23]
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm 
gpio]
CPU 2
Pid: 23, comm: migration/2 Tainted: G        WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffff8109d6ae>]  [<ffffffff8109d6ae>] tasklet_action+0x58/0xcc
RSP: 0018:ffff88022bd03ed8  EFLAGS: 00000282
RAX: ffff88022bd0e080 RBX: ffff880222185dc0 RCX: ffff88022bd0e506
RDX: ffff8802186f2fa8 RSI: ffff88022bd0e4f0 RDI: ffffffff81a050b0
RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001
R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e48
R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff8802186f2fb0
FS:  0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000)
Stack:
  ffffffff81a050b0 ffff880222218000 ffff88022bd03f78 ffffffff8109db1f
  ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef
  00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000030
Call Trace:
  <IRQ>
  [<ffffffff8109db1f>] __do_softirq+0x107/0x23c
  [<ffffffff8109dce6>] irq_exit+0x4b/0xa8
  [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99
  [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80
  <EOI>
  [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267
  [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
Code: 00 00 00 65 48 03 04 25 d8 da 00 00 65 48 89 04 25 88 e0 00 00 fb 66 66 90 66 66 90 eb 77 48 8b 1a 4c 8d 62 08 f0 0f ba 6a 08 01 <19> c0 85 c0 75 2d 8b 42 
10 8
BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23]
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm 
gpio]
CPU 2
Pid: 23, comm: migration/2 Tainted: G        WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffff8110a66f>]  [<ffffffff8110a66f>] rcu_bh_qs+0x9/0x22
RSP: 0018:ffff88022bd03ee8  EFLAGS: 00000246
RAX: 0000000000000040 RBX: 0000000000000246 RCX: ffff88022bd0e506
RDX: 0000000000000000 RSI: ffff88022bd0e4f0 RDI: 0000000000000002
RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001
R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e58
R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff880222218000
FS:  0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000)
Stack:
  ffff88022bd03f78 ffffffff8109db8a ffff88022bd03f08 ffff880222218010
  ffff880222219fd8 04208040810b79ef 00000000fffc8ad1 000000022bd0e150
  ffff880222218000 0000000000000030 ffff880200000006 000000111671cb12
Call Trace:
  <IRQ>
  [<ffffffff8109db8a>] __do_softirq+0x172/0x23c
  [<ffffffff8109dce6>] irq_exit+0x4b/0xa8
  [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99
  [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80
  <EOI>
  [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267
  [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
Code: 55 48 89 e5 66 66 66 66 90 48 c7 c0 f0 e4 00 00 48 63 ff 48 8b 14 fd 50 b5 ad 81 c6 44 10 10 01 c9 c3 55 48 89 e5 66 66 66 66 90 <48> c7 c0 30 e6 00 00 48 
63 f
BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23]
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm 
gpio]
CPU 2
Pid: 23, comm: migration/2 Tainted: G        WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffff8109dac6>]  [<ffffffff8109dac6>] __do_softirq+0xae/0x23c
RSP: 0018:ffff88022bd03ef8  EFLAGS: 00000296
RAX: 0000000000000000 RBX: 0000000000000246 RCX: ffff88022bd0e506
RDX: ffff880222218010 RSI: ffff88022bd0e4f0 RDI: 0000000000000002
RBP: ffff88022bd03f78 R08: ffff88022bd0e770 R09: 0000000000000001
R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e68
R13: ffffffff815d15dd R14: ffff88022bd03f78 R15: ffff880222218000
FS:  0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000)
Stack:
  ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef
  00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000020
  ffff880200000006 000000111671cb12 000000111671cb12 ffff88022bd0db00
Call Trace:
  <IRQ>
  [<ffffffff8109dce6>] irq_exit+0x4b/0xa8
  [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99
  [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80
  <EOI>
  [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267
  [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
Code: 65 c7 04 25 40 13 01 00 00 00 00 00 fb 66 66 90 66 66 90 48 c7 c3 80 50 a0 81 48 c7 45 b8 00 00 00 00 65 4c 8b 2c 25 48 c6 00 00 <41> f6 c7 01 0f 84 ba 00 
00 0
BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23]
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm 
gpio]
CPU 2
Pid: 23, comm: migration/2 Tainted: G        WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffff8110a66f>]  [<ffffffff8110a66f>] rcu_bh_qs+0x9/0x22
RSP: 0018:ffff88022bd03ee8  EFLAGS: 00000246
RAX: 0000000000000040 RBX: 0000000000000246 RCX: ffff88022bd0e506
RDX: 0000000000000000 RSI: ffff88022bd0e4f0 RDI: 0000000000000002
RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001
R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e58
R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff880222218000
FS:  0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000)
Stack:
  ffff88022bd03f78 ffffffff8109db8a ffff88022bd03f08 ffff880222218010
  ffff880222219fd8 04208040810b79ef 00000000fffc8ad1 000000022bd0e150
  ffff880222218000 0000000000000030 ffff880200000006 000000111671cb12
Call Trace:
  <IRQ>
  [<ffffffff8109db8a>] __do_softirq+0x172/0x23c
  [<ffffffff8109dce6>] irq_exit+0x4b/0xa8
  [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99
  [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80
  <EOI>
  [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267
  [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
Code: 55 48 89 e5 66 66 66 66 90 48 c7 c0 f0 e4 00 00 48 63 ff 48 8b 14 fd 50 b5 ad 81 c6 44 10 10 01 c9 c3 55 48 89 e5 66 66 66 66 90 <48> c7 c0 30 e6 00 00 48 
63 f
BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23]
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm 
gpio]
CPU 2
Pid: 23, comm: migration/2 Tainted: G        WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffff8109d6ae>]  [<ffffffff8109d6ae>] tasklet_action+0x58/0xcc
RSP: 0018:ffff88022bd03ed8  EFLAGS: 00000282
RAX: ffff88022bd0e080 RBX: ffff880222185dc0 RCX: ffff88022bd0e506
RDX: ffff8802186f2fa8 RSI: ffff88022bd0e4f0 RDI: ffffffff81a050b0
RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001
R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e48
R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff8802186f2fb0
FS:  0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000)
Stack:
  ffffffff81a050b0 ffff880222218000 ffff88022bd03f78 ffffffff8109db1f
  ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef
  00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000030
Call Trace:
  <IRQ>
  [<ffffffff8109db1f>] __do_softirq+0x107/0x23c
  [<ffffffff8109dce6>] irq_exit+0x4b/0xa8
  [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99
  [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80
  <EOI>
  [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267
  [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
Code: 00 00 00 65 48 03 04 25 d8 da 00 00 65 48 89 04 25 88 e0 00 00 fb 66 66 90 66 66 90 eb 77 48 8b 1a 4c 8d 62 08 f0 0f ba 6a 08 01 <19> c0 85 c0 75 2d 8b 42 
10 8
BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23]
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm 
gpio]
CPU 2
Pid: 23, comm: migration/2 Tainted: G        WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffff8109daab>]  [<ffffffff8109daab>] __do_softirq+0x93/0x23c
RSP: 0018:ffff88022bd03ef8  EFLAGS: 00000246
RAX: 0000000000000000 RBX: 0000000000000246 RCX: ffff88022bd0e506
RDX: ffff880222218010 RSI: ffff88022bd0e4f0 RDI: 0000000000000002
RBP: ffff88022bd03f78 R08: ffff88022bd0e770 R09: 0000000000000001
R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e68
R13: ffffffff815d15dd R14: ffff88022bd03f78 R15: ffff880222218000
FS:  0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000)
Stack:
  ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef
  00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000038
  ffff880200000006 000000111671cb12 000000111671cb12 ffff88022bd0db00
Call Trace:
  <IRQ>
  [<ffffffff8109dce6>] irq_exit+0x4b/0xa8
  [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99
  [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80
  <EOI>
  [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267
  [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
Code: 89 45 90 49 8d 44 24 10 4c 89 65 b0 65 8b 14 25 20 b0 00 00 48 89 45 88 89 55 ac 65 c7 04 25 40 13 01 00 00 00 00 00 fb 66 66 90 <66> 66 90 48 c7 c3 80 50 
a0 8
BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23]
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm 
gpio]
CPU 2
Pid: 23, comm: migration/2 Tainted: G        WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffff8109d6e1>]  [<ffffffff8109d6e1>] tasklet_action+0x8b/0xcc
RSP: 0018:ffff88022bd03ed8  EFLAGS: 00000202
RAX: 0000000000000001 RBX: ffff880222185dc0 RCX: ffff88022bd0e506
RDX: ffff8802186f2fa8 RSI: ffff88022bd0e4f0 RDI: ffffffff81a050b0
RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001
R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e48
R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff8802186f2fb0
FS:  0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000)
Stack:
  ffffffff81a050b0 ffff880222218000 ffff88022bd03f78 ffffffff8109db1f
  ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef
  00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000030
Call Trace:
  <IRQ>
  [<ffffffff8109db1f>] __do_softirq+0x107/0x23c
  [<ffffffff8109dce6>] irq_exit+0x4b/0xa8
  [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99
  [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80
  <EOI>
  [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267
  [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
Code: 10 85 c0 75 20 f0 41 0f ba 34 24 00 19 c0 85 c0 75 04 0f 0b eb fe 48 8b 7a 20 ff 52 18 f0 41 80 24 24 fd eb 3a f0 41 80 24 24 fd <fa> 66 66 90 66 66 90 bf 
06 0
BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23]
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm 
gpio]
CPU 2
Pid: 23, comm: migration/2 Tainted: G        WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffff8109d712>]  [<ffffffff8109d712>] tasklet_action+0xbc/0xcc
RSP: 0018:ffff88022bd03ed8  EFLAGS: 00000202
RAX: 0000000000000040 RBX: ffff880222185dc0 RCX: ffff88022bd0e506
RDX: ffff8802186f2fa8 RSI: ffff88022bd0e4f0 RDI: 0000000000000006
RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001
R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e48
R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff8802186f2fb0
FS:  0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000)
Stack:
  ffffffff81a050b0 ffff880222218000 ffff88022bd03f78 ffffffff8109db1f
  ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef
  00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000030
Call Trace:
  <IRQ>
  [<ffffffff8109db1f>] __do_softirq+0x107/0x23c
  [<ffffffff8109dce6>] irq_exit+0x4b/0xa8
  [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99
  [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80
  <EOI>
  [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267
  [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
Code: 90 bf 06 00 00 00 48 c7 02 00 00 00 00 65 48 8b 04 25 88 e0 00 00 48 89 10 65 48 89 14 25 88 e0 00 00 e8 1c fe ff ff fb 66 66 90 <66> 66 90 48 89 da 48 85 
d2 7
BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23]
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm 
gpio]
CPU 2
Pid: 23, comm: migration/2 Tainted: G        WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffff8109daab>]  [<ffffffff8109daab>] __do_softirq+0x93/0x23c
RSP: 0018:ffff88022bd03ef8  EFLAGS: 00000246
RAX: 0000000000000000 RBX: 0000000000000246 RCX: ffff88022bd0e506
RDX: ffff880222218010 RSI: ffff88022bd0e4f0 RDI: 0000000000000002
RBP: ffff88022bd03f78 R08: ffff88022bd0e770 R09: 0000000000000001
R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e68
R13: ffffffff815d15dd R14: ffff88022bd03f78 R15: ffff880222218000
FS:  0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000)
Stack:
  ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef
  00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000038
  ffff880200000006 000000111671cb12 000000111671cb12 ffff88022bd0db00
Call Trace:
  <IRQ>
  [<ffffffff8109dce6>] irq_exit+0x4b/0xa8
  [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99
  [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80
  <EOI>
  [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267
  [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
Code: 89 45 90 49 8d 44 24 10 4c 89 65 b0 65 8b 14 25 20 b0 00 00 48 89 45 88 89 55 ac 65 c7 04 25 40 13 01 00 00 00 00 00 fb 66 66 90 <66> 66 90 48 c7 c3 80 50 
a0 8
------------[ cut here ]------------
WARNING: at /home/greearb/git/linux-3.9.dev.y/kernel/watchdog.c:245 watchdog_overflow_callback+0x9b/0xa6()
Hardware name: To be filled by O.E.M.
Watchdog detected hard LOCKUP on cpu 2
Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm 
gpio]
Pid: 23, comm: migration/2 Tainted: G        WC O 3.9.4+ #70
Call Trace:
  <NMI>  [<ffffffff810963e1>] warn_slowpath_common+0x85/0x9f
  [<ffffffff8109649e>] warn_slowpath_fmt+0x46/0x48
  [<ffffffff810c4dfb>] ? sched_clock_cpu+0x44/0xce
  [<ffffffff81103cec>] watchdog_overflow_callback+0x9b/0xa6
  [<ffffffff811336cb>] __perf_event_overflow+0x137/0x1cb
  [<ffffffff8101db3f>] ? x86_perf_event_set_period+0x107/0x113
  [<ffffffff81133c1a>] perf_event_overflow+0x14/0x16
  [<ffffffff810230dc>] intel_pmu_handle_irq+0x2b0/0x32d
  [<ffffffff815cbdd1>] perf_event_nmi_handler+0x19/0x1b
  [<ffffffff815cb64a>] nmi_handle+0x55/0x7e
  [<ffffffff815cb71b>] do_nmi+0xa8/0x2db
  [<ffffffff815cadb1>] end_repeat_nmi+0x1e/0x2e
  [<ffffffff8109dad4>] ? __do_softirq+0xbc/0x23c
  [<ffffffff8109dad4>] ? __do_softirq+0xbc/0x23c
  [<ffffffff8109dad4>] ? __do_softirq+0xbc/0x23c
  <<EOE>>  <IRQ>  [<ffffffff8109dce6>] irq_exit+0x4b/0xa8
  [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99
  [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80
  <EOI>  [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267
  [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176
  [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7
  [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f
  [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b4a09>] kthread+0xb5/0xbd
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
  [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60
---[ end trace 9767454fd3f66a82 ]---

  CTRL-A Z for help |115200 8N1 | NOR | Minicom 2.5    | VT102 | Online 15:23 




>
> Thanks.
>


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Please add to stable:  module: don't unlink the module until we've removed all exposure.
  2013-06-05 16:59                         ` Ben Greear
@ 2013-06-05 18:48                           ` Tejun Heo
  2013-06-05 19:11                             ` Ben Greear
  0 siblings, 1 reply; 51+ messages in thread
From: Tejun Heo @ 2013-06-05 18:48 UTC (permalink / raw)
  To: Ben Greear; +Cc: Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable

Hello, Ben.

On Wed, Jun 05, 2013 at 09:59:00AM -0700, Ben Greear wrote:
> One pattern I notice repeating for at least most of the hangs is that all but one
> CPU thread has irqs disabled and is in state 2.  But, there will be one thread
> in state 1 that still has IRQs enabled and it is reported to be in soft-lockup
> instead of hard-lockup.  In 'sysrq l' it always shows some IRQ processing,
> but typically that of the sysrq itself.  I added printk that would always
> print if the thread notices that smdata->state != curstate, and the soft-lockup
> thread (cpu 2 below) never shows that message.

It sounds like one of the cpus get live-locked by IRQs.  I can't tell
why the situation is made worse by other CPUs being tied up.  Do you
ever see CPUs being live locked by IRQs during normal operation?

> I thought it might be because it was reading stale smdata->state, so I changed
> that to atomic_t hoping that would mitigate that.  I also tried adding smp_rmb()
> below the cpu_relax().  Neither had any affect, so I am left assuming that the

I looked at the code again and the memory accesses seem properly
interlocked.  It's a bit tricky and should probably have used spinlock
instead considering it's already a hugely expensive path anyway, but
it does seem correct to me.

> thread instead is stuck handling IRQs and never gets out of the IRQ handler.

Seems that way to me too.

> Maybe since I have 2 real cores, and 3 processes busy-spinning on their CPU cores,
> the remaining process can just never handle all the IRQs and get back to the
> cpu shutdown state machine?  The various soft-hang stacks below show at least slightly
> different stacks, so I assume that thread is doing at least something.

What's the source of all those IRQs tho?  I don't think the IRQs are
from actual events.  The system is quiesced.  Even if it's from
receiving packets, it's gonna quiet down pretty quickly.  The hang
doesn't go away if you disconnect the network cable while hung, right?

What could be happening is that IRQ handling is handled by a thread
but the IRQ handler itself doesn't clear the IRQ properly and depends
on the handling thread to clear the condition.  If no CPU is available
for scheduling, it might end up raising and re-reraising IRQs for the
same condition without ever being handled.  If that's the case, such
lockup could happen on a normally functioning UP machine or if the IRQ
is pinned to a single CPU which happens to be running the handling
thread.  At any rate, it'd be a plain live-lock bug on the driver
side.

Can you please try to confirm the specific interrupt being
continuously raised?  Detecting the hang shouldn't be too difficult.
Just recording the starting jiffies and if progress hasn't been made
for, say, ten seconds, it can set a flag and then print the IRQs being
handled if the flag is set.  If it indeed is the ath device, we
probably wanna get the driver maintainer involved.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Please add to stable:  module: don't unlink the module until we've removed all exposure.
  2013-06-05 18:48                           ` Tejun Heo
@ 2013-06-05 19:11                             ` Ben Greear
  2013-06-05 19:31                               ` stop_machine lockup issue in 3.9.y Ben Greear
  0 siblings, 1 reply; 51+ messages in thread
From: Ben Greear @ 2013-06-05 19:11 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable

On 06/05/2013 11:48 AM, Tejun Heo wrote:
> Hello, Ben.
>
> On Wed, Jun 05, 2013 at 09:59:00AM -0700, Ben Greear wrote:
>> One pattern I notice repeating for at least most of the hangs is that all but one
>> CPU thread has irqs disabled and is in state 2.  But, there will be one thread
>> in state 1 that still has IRQs enabled and it is reported to be in soft-lockup
>> instead of hard-lockup.  In 'sysrq l' it always shows some IRQ processing,
>> but typically that of the sysrq itself.  I added printk that would always
>> print if the thread notices that smdata->state != curstate, and the soft-lockup
>> thread (cpu 2 below) never shows that message.
>
> It sounds like one of the cpus get live-locked by IRQs.  I can't tell
> why the situation is made worse by other CPUs being tied up.  Do you
> ever see CPUs being live locked by IRQs during normal operation?

No, I have not noticed any live locks aside from this, at least in
the 3.9 kernels.

>> I thought it might be because it was reading stale smdata->state, so I changed
>> that to atomic_t hoping that would mitigate that.  I also tried adding smp_rmb()
>> below the cpu_relax().  Neither had any affect, so I am left assuming that the
>
> I looked at the code again and the memory accesses seem properly
> interlocked.  It's a bit tricky and should probably have used spinlock
> instead considering it's already a hugely expensive path anyway, but
> it does seem correct to me.
>
>> thread instead is stuck handling IRQs and never gets out of the IRQ handler.
>
> Seems that way to me too.
>
>> Maybe since I have 2 real cores, and 3 processes busy-spinning on their CPU cores,
>> the remaining process can just never handle all the IRQs and get back to the
>> cpu shutdown state machine?  The various soft-hang stacks below show at least slightly
>> different stacks, so I assume that thread is doing at least something.
>
> What's the source of all those IRQs tho?  I don't think the IRQs are
> from actual events.  The system is quiesced.  Even if it's from
> receiving packets, it's gonna quiet down pretty quickly.  The hang
> doesn't go away if you disconnect the network cable while hung, right?
>
> What could be happening is that IRQ handling is handled by a thread
> but the IRQ handler itself doesn't clear the IRQ properly and depends
> on the handling thread to clear the condition.  If no CPU is available
> for scheduling, it might end up raising and re-reraising IRQs for the
> same condition without ever being handled.  If that's the case, such
> lockup could happen on a normally functioning UP machine or if the IRQ
> is pinned to a single CPU which happens to be running the handling
> thread.  At any rate, it'd be a plain live-lock bug on the driver
> side.
>
> Can you please try to confirm the specific interrupt being
> continuously raised?  Detecting the hang shouldn't be too difficult.
> Just recording the starting jiffies and if progress hasn't been made
> for, say, ten seconds, it can set a flag and then print the IRQs being
> handled if the flag is set.  If it indeed is the ath device, we
> probably wanna get the driver maintainer involved.

I am not sure how to tell which IRQ is being handled.  Do the
stack traces (showing smp_apic_timer_interrupt, for instance)
indicate potential culprits, or is that more a symptom of just
when the soft-lockup check is called?


Where should I add code to print out irqs?  In the lockup state,
the thread (probably) stuck handling irqs isn't executing any code in
the stop_machine file as far as I can tell.

Maybe I need to instrument the __do_softirq or similar method?

For what it's worth, previous debugging appears to show that jiffies
stops incrementing in many of these lockups.

Also, I have been trying for 20+ minutes to reproduce the lockup
with the ath9k module removed (and my user-space app that uses it
stopped), and I have not reproduced it yet.  So, possibly it is
related to ath9k, but my user-space app pokes at lots of other
stuff and starts loads of dhcp client processes and such too,
so not sure yet.


Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 51+ messages in thread

* stop_machine lockup issue in 3.9.y.
  2013-06-05 19:11                             ` Ben Greear
@ 2013-06-05 19:31                               ` Ben Greear
  2013-06-05 20:58                                 ` Ben Greear
  0 siblings, 1 reply; 51+ messages in thread
From: Ben Greear @ 2013-06-05 19:31 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable

This is no longer really about the module unlink, so changing
subject.

On 06/05/2013 12:11 PM, Ben Greear wrote:
> On 06/05/2013 11:48 AM, Tejun Heo wrote:
>> Hello, Ben.
>>
>> On Wed, Jun 05, 2013 at 09:59:00AM -0700, Ben Greear wrote:
>>> One pattern I notice repeating for at least most of the hangs is that all but one
>>> CPU thread has irqs disabled and is in state 2.  But, there will be one thread
>>> in state 1 that still has IRQs enabled and it is reported to be in soft-lockup
>>> instead of hard-lockup.  In 'sysrq l' it always shows some IRQ processing,
>>> but typically that of the sysrq itself.  I added printk that would always
>>> print if the thread notices that smdata->state != curstate, and the soft-lockup
>>> thread (cpu 2 below) never shows that message.
>>
>> It sounds like one of the cpus get live-locked by IRQs.  I can't tell
>> why the situation is made worse by other CPUs being tied up.  Do you
>> ever see CPUs being live locked by IRQs during normal operation?
>
> No, I have not noticed any live locks aside from this, at least in
> the 3.9 kernels.
>
>>> I thought it might be because it was reading stale smdata->state, so I changed
>>> that to atomic_t hoping that would mitigate that.  I also tried adding smp_rmb()
>>> below the cpu_relax().  Neither had any affect, so I am left assuming that the
>>
>> I looked at the code again and the memory accesses seem properly
>> interlocked.  It's a bit tricky and should probably have used spinlock
>> instead considering it's already a hugely expensive path anyway, but
>> it does seem correct to me.
>>
>>> thread instead is stuck handling IRQs and never gets out of the IRQ handler.
>>
>> Seems that way to me too.
>>
>>> Maybe since I have 2 real cores, and 3 processes busy-spinning on their CPU cores,
>>> the remaining process can just never handle all the IRQs and get back to the
>>> cpu shutdown state machine?  The various soft-hang stacks below show at least slightly
>>> different stacks, so I assume that thread is doing at least something.
>>
>> What's the source of all those IRQs tho?  I don't think the IRQs are
>> from actual events.  The system is quiesced.  Even if it's from
>> receiving packets, it's gonna quiet down pretty quickly.  The hang
>> doesn't go away if you disconnect the network cable while hung, right?
>>
>> What could be happening is that IRQ handling is handled by a thread
>> but the IRQ handler itself doesn't clear the IRQ properly and depends
>> on the handling thread to clear the condition.  If no CPU is available
>> for scheduling, it might end up raising and re-reraising IRQs for the
>> same condition without ever being handled.  If that's the case, such
>> lockup could happen on a normally functioning UP machine or if the IRQ
>> is pinned to a single CPU which happens to be running the handling
>> thread.  At any rate, it'd be a plain live-lock bug on the driver
>> side.
>>
>> Can you please try to confirm the specific interrupt being
>> continuously raised?  Detecting the hang shouldn't be too difficult.
>> Just recording the starting jiffies and if progress hasn't been made
>> for, say, ten seconds, it can set a flag and then print the IRQs being
>> handled if the flag is set.  If it indeed is the ath device, we
>> probably wanna get the driver maintainer involved.
>
> I am not sure how to tell which IRQ is being handled.  Do the
> stack traces (showing smp_apic_timer_interrupt, for instance)
> indicate potential culprits, or is that more a symptom of just
> when the soft-lockup check is called?
>
>
> Where should I add code to print out irqs?  In the lockup state,
> the thread (probably) stuck handling irqs isn't executing any code in
> the stop_machine file as far as I can tell.
>
> Maybe I need to instrument the __do_softirq or similar method?
>
> For what it's worth, previous debugging appears to show that jiffies
> stops incrementing in many of these lockups.
>
> Also, I have been trying for 20+ minutes to reproduce the lockup
> with the ath9k module removed (and my user-space app that uses it
> stopped), and I have not reproduced it yet.  So, possibly it is
> related to ath9k, but my user-space app pokes at lots of other
> stuff and starts loads of dhcp client processes and such too,
> so not sure yet.


I re-added ath9k, turned on my app (to create 400 stations, etc),
re-started the module unload/load loop:
for i in `seq 10000`; do modprobe macvlan; rmmod macvlan; done
and hit the problem fairly quickly.

This is on stock 3.9.4, with Rusty's kobj patch, and some printk debugging
I added to the stop_machine.c file.

Perhaps interestingly, I do see an ath9k warning/error in this log as well.

Also, since lockdep is enabled, we get some irq printouts.  Does this mean
anything to you?

__stop_machine(upstream): num-threads: 4,  fn: __try_stop_module(ffffffff810f330a)  data: ffff8801c594bf28  smdata: ffff8801c594be58
set_state, cpu: 0  state: 0 newstate: 1 smdata: ffff8801c594be58  fn: ffffffff810f330a
cpu: 1 loops: 1 jiffies: 4297177162  timeout: 4297177161 curstate: 0  smdata->state: 1  thread_ack: 4  smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 0 loops: 1 jiffies: 4297177162  timeout: 4297177161 curstate: 0  smdata->state: 1  thread_ack: 4  smdata: ffff8801c594be58 fn: ffffffff810f330a
state-change: cpu: 0 loops: 1 jiffies: 4297177162 curstate: 0  smdata->state: 1  thread_ack: 4  smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 0 ack_state jiffies: 4297177162 smdata->state: 1  thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 3 loops: 1 jiffies: 4297177162  timeout: 4297177161 curstate: 0  smdata->state: 1  thread_ack: 4  smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 0 ack_state end jiffies: 4297177162 smdata->state: 1  thread_ack: 3  smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 2 loops: 1 jiffies: 4297177162  timeout: 4297177161 curstate: 0  smdata->state: 1  thread_ack: 3  smdata: ffff8801c594be58 fn: ffffffff810f330a
state-change: cpu: 3 loops: 1 jiffies: 4297177162 curstate: 0  smdata->state: 1  thread_ack: 3  smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 3 ack_state jiffies: 4297177162 smdata->state: 1  thread_ack: 3 smdata: ffff8801c594be58 fn: ffffffff810f330a
state-change: cpu: 2 loops: 1 jiffies: 4297177162 curstate: 0  smdata->state: 1  thread_ack: 3  smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 3 ack_state end jiffies: 4297177162 smdata->state: 1  thread_ack: 2  smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 2 ack_state jiffies: 4297177162 smdata->state: 1  thread_ack: 2 smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 2 ack_state end jiffies: 4297177162 smdata->state: 1  thread_ack: 1  smdata: ffff8801c594be58 fn: ffffffff810f330a
state-change: cpu: 1 loops: 1 jiffies: 4297177313 curstate: 0  smdata->state: 1  thread_ack: 1  smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 1 ack_state jiffies: 4297177326 smdata->state: 1  thread_ack: 1 smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 1  increment state, smdata: ffff8801c594be58 fn: ffffffff810f330a
set_state, cpu: 1  state: 1 newstate: 2 smdata: ffff8801c594be58  fn: ffffffff810f330a
cpu: 1 ack_state end jiffies: 4297177350 smdata->state: 2  thread_ack: 4  smdata: ffff8801c594be58 fn: ffffffff810f330a
state-change: cpu: 0 loops: 12226751 jiffies: 4297177350 curstate: 1  smdata->state: 2  thread_ack: 4  smdata: ffff8801c594be58 fn: ffffffff810f330a
state-change: cpu: 2 loops: 12286477 jiffies: 4297177350 curstate: 1  smdata->state: 2  thread_ack: 4  smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 0 ack_state jiffies: 4297177350 smdata->state: 2  thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 0 ack_state end jiffies: 4297177350 smdata->state: 2  thread_ack: 3  smdata: ffff8801c594be58 fn: ffffffff810f330a
state-change: cpu: 3 loops: 25634226 jiffies: 4297177350 curstate: 1  smdata->state: 2  thread_ack: 4  smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 3 ack_state jiffies: 4297177350 smdata->state: 2  thread_ack: 3 smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 3 ack_state end jiffies: 4297177350 smdata->state: 2  thread_ack: 2  smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 2 ack_state jiffies: 4297177350 smdata->state: 2  thread_ack: 2 smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 2 ack_state end jiffies: 4297177350 smdata->state: 2  thread_ack: 1  smdata: ffff8801c594be58 fn: ffffffff810f330a
state-change: cpu: 1 loops: 2 jiffies: 4297177350 curstate: 1  smdata->state: 2  thread_ack: 1  smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 1 ack_state jiffies: 4297177350 smdata->state: 2  thread_ack: 1 smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 1  increment state, smdata: ffff8801c594be58 fn: ffffffff810f330a
set_state, cpu: 1  state: 2 newstate: 3 smdata: ffff8801c594be58  fn: ffffffff810f330a
cpu: 1 ack_state end jiffies: 4297177350 smdata->state: 3  thread_ack: 4  smdata: ffff8801c594be58 fn: ffffffff810f330a
state-change: cpu: 3 loops: 47817385 jiffies: 4297177350 curstate: 2  smdata->state: 3  thread_ack: 4  smdata: ffff8801c594be58 fn: ffffffff810f330a
state-change: cpu: 2 loops: 31933478 jiffies: 4297177350 curstate: 2  smdata->state: 3  thread_ack: 4  smdata: ffff8801c594be58 fn: ffffffff810f330a
state-change: cpu: 0 loops: 31875466 jiffies: 4297177350 curstate: 2  smdata->state: 3  thread_ack: 4  smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 2 ack_state jiffies: 4297177350 smdata->state: 3  thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 3 ack_state jiffies: 4297177350 smdata->state: 3  thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a
calling fn: cpu: 0 loops: 31875466 curstate: 3  smdata->state: 3  thread_ack: 4  smdata: ffff8801c594be58  fn: ffffffff810f330a
cpu: 2 ack_state end jiffies: 4297177350 smdata->state: 3  thread_ack: 3  smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 0 ack_state jiffies: 4297177350 smdata->state: 3  thread_ack: 2 smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 3 ack_state end jiffies: 4297177350 smdata->state: 3  thread_ack: 2  smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 0 ack_state end jiffies: 4297177350 smdata->state: 3  thread_ack: 1  smdata: ffff8801c594be58 fn: ffffffff810f330a
state-change: cpu: 1 loops: 3 jiffies: 4297177350 curstate: 2  smdata->state: 3  thread_ack: 1  smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 1 ack_state jiffies: 4297177350 smdata->state: 3  thread_ack: 1 smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 1  increment state, smdata: ffff8801c594be58 fn: ffffffff810f330a
set_state, cpu: 1  state: 3 newstate: 4 smdata: ffff8801c594be58  fn: ffffffff810f330a
cpu: 1 ack_state end jiffies: 4297177350 smdata->state: 4  thread_ack: 4  smdata: ffff8801c594be58 fn: ffffffff810f330a
state-change: cpu: 3 loops: 71662403 jiffies: 4297177350 curstate: 3  smdata->state: 4  thread_ack: 4  smdata: ffff8801c594be58 fn: ffffffff810f330a
state-change: cpu: 2 loops: 53246516 jiffies: 4297177350 curstate: 3  smdata->state: 4  thread_ack: 4  smdata: ffff8801c594be58 fn: ffffffff810f330a
state-change: cpu: 0 loops: 53187256 jiffies: 4297177350 curstate: 3  smdata->state: 4  thread_ack: 4  smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 2 ack_state jiffies: 4297177350 smdata->state: 4  thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 0 ack_state jiffies: 4297177350 smdata->state: 4  thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 2 ack_state end jiffies: 4297177350 smdata->state: 4  thread_ack: 3  smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 0 ack_state end jiffies: 4297177350 smdata->state: 4  thread_ack: 2  smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 3 ack_state jiffies: 4297177350 smdata->state: 4  thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 3 ack_state end jiffies: 4297177656 smdata->state: 4  thread_ack: 1  smdata: ffff8801c594be58 fn: ffffffff810f330a
state-change: cpu: 1 loops: 4 jiffies: 4297177768 curstate: 3  smdata->state: 4  thread_ack: 1  smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 1 ack_state jiffies: 4297177780 smdata->state: 4  thread_ack: 1 smdata: ffff8801c594be58 fn: ffffffff810f330a
cpu: 1  increment state, smdata: ffff8801c594be58 fn: ffffffff810f330a
set_state, cpu: 1  state: 4 newstate: 5 smdata: ffff8801c594be58  fn: ffffffff810f330a
cpu: 1 ack_state end jiffies: 4297177804 smdata->state: 5  thread_ack: 4  smdata: ffff8801c594be58 fn: ffffffff810f330a
ath: wiphy0: Failed to stop TX DMA, queues=0x005!
__stop_machine(upstream): num-threads: 4,  fn: __unlink_module(ffffffff810f2dab)  data: ffffffffa117aad0  smdata: ffff8801c594be18
set_state, cpu: 0  state: 0 newstate: 1 smdata: ffff8801c594be18  fn: ffffffff810f2dab
sta201: authenticate with 00:de:ad:1d:ea:01
cpu: 1 loops: 1 jiffies: 4297178887  timeout: 4297178886 curstate: 0  smdata->state: 1  thread_ack: 4  smdata: ffff8801c594be18 fn: ffffffff810f2dab
cpu: 2 loops: 1 jiffies: 4297178887  timeout: 4297178886 curstate: 0  smdata->state: 1  thread_ack: 4  smdata: ffff8801c594be18 fn: ffffffff810f2dab
cpu: 0 loops: 1 jiffies: 4297178887  timeout: 4297178886 curstate: 0  smdata->state: 1  thread_ack: 4  smdata: ffff8801c594be18 fn: ffffffff810f2dab
state-change: cpu: 2 loops: 1 jiffies: 4297178887 curstate: 0  smdata->state: 1  thread_ack: 4  smdata: ffff8801c594be18 fn: ffffffff810f2dab
cpu: 3 loops: 1 jiffies: 4297178887  timeout: 4297178886 curstate: 0  smdata->state: 1  thread_ack: 4  smdata: ffff8801c594be18 fn: ffffffff810f2dab
cpu: 2 ack_state jiffies: 4297178887 smdata->state: 1  thread_ack: 4 smdata: ffff8801c594be18 fn: ffffffff810f2dab
state-change: cpu: 0 loops: 1 jiffies: 4297178887 curstate: 0  smdata->state: 1  thread_ack: 4  smdata: ffff8801c594be18 fn: ffffffff810f2dab
cpu: 2 ack_state end jiffies: 4297178887 smdata->state: 1  thread_ack: 3  smdata: ffff8801c594be18 fn: ffffffff810f2dab
state-change: cpu: 3 loops: 1 jiffies: 4297178887 curstate: 0  smdata->state: 1  thread_ack: 4  smdata: ffff8801c594be18 fn: ffffffff810f2dab
cpu: 0 ack_state jiffies: 4297178887 smdata->state: 1  thread_ack: 3 smdata: ffff8801c594be18 fn: ffffffff810f2dab
cpu: 3 ack_state jiffies: 4297178887 smdata->state: 1  thread_ack: 3 smdata: ffff8801c594be18 fn: ffffffff810f2dab
cpu: 0 ack_state end jiffies: 4297178887 smdata->state: 1  thread_ack: 2  smdata: ffff8801c594be18 fn: ffffffff810f2dab
cpu: 3 ack_state end jiffies: 4297178887 smdata->state: 1  thread_ack: 1  smdata: ffff8801c594be18 fn: ffffffff810f2dab
state-change: cpu: 1 loops: 1 jiffies: 4297179040 curstate: 0  smdata->state: 1  thread_ack: 1  smdata: ffff8801c594be18 fn: ffffffff810f2dab
cpu: 1 ack_state jiffies: 4297179054 smdata->state: 1  thread_ack: 1 smdata: ffff8801c594be18 fn: ffffffff810f2dab
cpu: 1  increment state, smdata: ffff8801c594be18 fn: ffffffff810f2dab
set_state, cpu: 1  state: 1 newstate: 2 smdata: ffff8801c594be18  fn: ffffffff810f2dab
cpu: 1 ack_state end jiffies: 4297179083 smdata->state: 2  thread_ack: 4  smdata: ffff8801c594be18 fn: ffffffff810f2dab
state-change: cpu: 2 loops: 25037141 jiffies: 4297179083 curstate: 1  smdata->state: 2  thread_ack: 4  smdata: ffff8801c594be18 fn: ffffffff810f2dab
state-change: cpu: 3 loops: 28970837 jiffies: 4297179083 curstate: 1  smdata->state: 2  thread_ack: 4  smdata: ffff8801c594be18 fn: ffffffff810f2dab
cpu: 2 ack_state jiffies: 4297179083 smdata->state: 2  thread_ack: 4 smdata: ffff8801c594be18 fn: ffffffff810f2dab
cpu: 3 ack_state jiffies: 4297179083 smdata->state: 2  thread_ack: 4 smdata: ffff8801c594be18 fn: ffffffff810f2dab
state-change: cpu: 0 loops: 25898133 jiffies: 4297179083 curstate: 1  smdata->state: 2  thread_ack: 4  smdata: ffff8801c594be18 fn: ffffffff810f2dab
cpu: 2 ack_state end jiffies: 4297179083 smdata->state: 2  thread_ack: 3  smdata: ffff8801c594be18 fn: ffffffff810f2dab
cpu: 0 ack_state jiffies: 4297179083 smdata->state: 2  thread_ack: 2 smdata: ffff8801c594be18 fn: ffffffff810f2dab
cpu: 3 ack_state end jiffies: 4297179083 smdata->state: 2  thread_ack: 2  smdata: ffff8801c594be18 fn: ffffffff810f2dab
cpu: 0 ack_state end jiffies: 4297179083 smdata->state: 2  thread_ack: 1  smdata: ffff8801c594be18 fn: ffffffff810f2dab
------------[ cut here ]------------
WARNING: at /home/greearb/git/linux-2.6.linus/kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7()
Hardware name: To be filled by O.E.M.
Watchdog detected hard LOCKUP on cpu 2
Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd 
sunrpc]
Pid: 23, comm: migration/2 Tainted: G         C   3.9.4+ #11
Call Trace:
  <NMI>  [<ffffffff810977f1>] warn_slowpath_common+0x85/0x9f
  [<ffffffff810978ae>] warn_slowpath_fmt+0x46/0x48
  [<ffffffff8110f42d>] watchdog_overflow_callback+0x9c/0xa7
  [<ffffffff8113feb6>] __perf_event_overflow+0x137/0x1cb
  [<ffffffff8101dff6>] ? x86_perf_event_set_period+0x103/0x10f
  [<ffffffff811403fa>] perf_event_overflow+0x14/0x16
  [<ffffffff81023730>] intel_pmu_handle_irq+0x2dc/0x359
  [<ffffffff815eee05>] perf_event_nmi_handler+0x19/0x1b
  [<ffffffff815ee5f3>] nmi_handle+0x7f/0xc2
  [<ffffffff815ee574>] ? oops_begin+0xa9/0xa9
  [<ffffffff815ee6f2>] do_nmi+0xbc/0x304
  [<ffffffff815edd81>] end_repeat_nmi+0x1e/0x2e
  [<ffffffff81099fce>] ? vprintk_emit+0x40a/0x444
  [<ffffffff81104ef8>] ? stop_machine_cpu_stop+0xd8/0x274
  [<ffffffff81104ef8>] ? stop_machine_cpu_stop+0xd8/0x274
  [<ffffffff81104ef8>] ? stop_machine_cpu_stop+0xd8/0x274
  <<EOE>>  [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff81104e20>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff81104b8d>] cpu_stopper_thread+0xae/0x162
  [<ffffffff815ebb1f>] ? __schedule+0x5ef/0x637
  [<ffffffff815ecf38>] ? _raw_spin_unlock_irqrestore+0x47/0x7e
  [<ffffffff810e92cc>] ? trace_hardirqs_on_caller+0x123/0x15a
  [<ffffffff810e9310>] ? trace_hardirqs_on+0xd/0xf
  [<ffffffff815ecf61>] ? _raw_spin_unlock_irqrestore+0x70/0x7e
  [<ffffffff810bef34>] smpboot_thread_fn+0x258/0x260
  [<ffffffff810becdc>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b7c22>] kthread+0xc7/0xcf
  [<ffffffff810b7b5b>] ? __init_kthread_worker+0x5b/0x5b
  [<ffffffff815f3b6c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b7b5b>] ? __init_kthread_worker+0x5b/0x5b
---[ end trace 4947dfa9b0a4cec3 ]---
BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17]
Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd 
sunrpc]
irq event stamp: 835637905
hardirqs last  enabled at (835637904): [<ffffffff8109f4c1>] __do_softirq+0x9f/0x257
hardirqs last disabled at (835637905): [<ffffffff815f48ad>] apic_timer_interrupt+0x6d/0x80
softirqs last  enabled at (5654720): [<ffffffff8109f621>] __do_softirq+0x1ff/0x257
softirqs last disabled at (5654725): [<ffffffff8109f743>] irq_exit+0x5f/0xbb
CPU 1
Pid: 17, comm: migration/1 Tainted: G        WC   3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffff8109ee72>]  [<ffffffff8109ee72>] tasklet_hi_action+0xf0/0xf0
RSP: 0018:ffff88022bc83ef0  EFLAGS: 00000212
RAX: 0000000000000006 RBX: ffff880217deb710 RCX: 0000000000000006
RDX: 0000000000000006 RSI: 0000000000000000 RDI: ffffffff81a050b0
RBP: ffff88022bc83f78 R08: ffffffff81a050b0 R09: ffff88022bc83cc8
R10: 00000000000005f2 R11: ffff8802203aaf50 R12: ffff88022bc83e68
R13: ffffffff815f48b2 R14: ffff88022bc83f78 R15: ffff88022230e000
FS:  0000000000000000(0000) GS:ffff88022bc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000430070 CR3: 00000001cbc5d000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/1 (pid: 17, threadinfo ffff88022230e000, task ffff8802223142c0)
Stack:
  ffffffff8109f539 ffff88022bc83f08 ffff88022230e010 042080402bc83f88
  000000010021bfcd 000000012bc83fa8 ffff88022230e000 ffff88022230ffd8
  0000000000000030 ffff880200000006 00000248d8cdab1c 1304da35fe841722
Call Trace:
  <IRQ>
  [<ffffffff8109f539>] ? __do_softirq+0x117/0x257
  [<ffffffff8109f743>] irq_exit+0x5f/0xbb
  [<ffffffff815f59fd>] smp_apic_timer_interrupt+0x8a/0x98
  [<ffffffff815f48b2>] apic_timer_interrupt+0x72/0x80
  <EOI>
  [<ffffffff81099fdb>] ? vprintk_emit+0x417/0x444
  [<ffffffff815e9fc0>] printk+0x4d/0x4f
  [<ffffffff81104b36>] ? cpu_stopper_thread+0x57/0x162
  [<ffffffff8110504c>] stop_machine_cpu_stop+0x22c/0x274
  [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff81104e20>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff81104b8d>] cpu_stopper_thread+0xae/0x162
  [<ffffffff815ebb1f>] ? __schedule+0x5ef/0x637
  [<ffffffff815ecf38>] ? _raw_spin_unlock_irqrestore+0x47/0x7e
  [<ffffffff810e92cc>] ? trace_hardirqs_on_caller+0x123/0x15a
  [<ffffffff810e9310>] ? trace_hardirqs_on+0xd/0xf
  [<ffffffff815ecf61>] ? _raw_spin_unlock_irqrestore+0x70/0x7e
  [<ffffffff810bef34>] smpboot_thread_fn+0x258/0x260
  [<ffffffff810becdc>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b7c22>] kthread+0xc7/0xcf
  [<ffffffff810b7b5b>] ? __init_kthread_worker+0x5b/0x5b
  [<ffffffff815f3b6c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b7b5b>] ? __init_kthread_worker+0x5b/0x5b
Code: 1c 25 18 e2 00 00 e8 cd fe ff ff e8 ac a4 04 00 fb 66 66 90 66 66 90 4c 89 e3 48 85 db 0f 85 79 ff ff ff 5f 5b 41 5c 41 5d c9 c3 <55> 48 89 e5 41 55 41 54 
53 4
------------[ cut here ]------------
WARNING: at /home/greearb/git/linux-2.6.linus/kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7()
Hardware name: To be filled by O.E.M.
Watchdog detected hard LOCKUP on cpu 0
Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd 
sunrpc]
Pid: 8, comm: migration/0 Tainted: G        WC   3.9.4+ #11
Call Trace:
  <NMI>  [<ffffffff810977f1>] warn_slowpath_common+0x85/0x9f
  [<ffffffff810978ae>] warn_slowpath_fmt+0x46/0x48
  [<ffffffff8110f42d>] watchdog_overflow_callback+0x9c/0xa7
  [<ffffffff8113feb6>] __perf_event_overflow+0x137/0x1cb
  [<ffffffff8101dff6>] ? x86_perf_event_set_period+0x103/0x10f
  [<ffffffff811403fa>] perf_event_overflow+0x14/0x16
  [<ffffffff81023730>] intel_pmu_handle_irq+0x2dc/0x359
  [<ffffffff815eee05>] perf_event_nmi_handler+0x19/0x1b
  [<ffffffff815ee5f3>] nmi_handle+0x7f/0xc2
  [<ffffffff815ee574>] ? oops_begin+0xa9/0xa9
  [<ffffffff815ee6f2>] do_nmi+0xbc/0x304
  [<ffffffff815edd81>] end_repeat_nmi+0x1e/0x2e
  [<ffffffff81099fce>] ? vprintk_emit+0x40a/0x444
  [<ffffffff81104efa>] ? stop_machine_cpu_stop+0xda/0x274
  [<ffffffff81104efa>] ? stop_machine_cpu_stop+0xda/0x274
  [<ffffffff81104efa>] ? stop_machine_cpu_stop+0xda/0x274
  <<EOE>>  [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff81104e20>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff81104b8d>] cpu_stopper_thread+0xae/0x162
  [<ffffffff815ebb1f>] ? __schedule+0x5ef/0x637
  [<ffffffff815ecf38>] ? _raw_spin_unlock_irqrestore+0x47/0x7e
  [<ffffffff810e92cc>] ? trace_hardirqs_on_caller+0x123/0x15a
  [<ffffffff810e9310>] ? trace_hardirqs_on+0xd/0xf
  [<ffffffff815ecf61>] ? _raw_spin_unlock_irqrestore+0x70/0x7e
  [<ffffffff810bef34>] smpboot_thread_fn+0x258/0x260
  [<ffffffff810becdc>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b7c22>] kthread+0xc7/0xcf
  [<ffffffff810b7b5b>] ? __init_kthread_worker+0x5b/0x5b
  [<ffffffff815f3b6c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b7b5b>] ? __init_kthread_worker+0x5b/0x5b
---[ end trace 4947dfa9b0a4cec4 ]---
------------[ cut here ]------------
WARNING: at /home/greearb/git/linux-2.6.linus/kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7()
Hardware name: To be filled by O.E.M.
Watchdog detected hard LOCKUP on cpu 3
Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd 
sunrpc]
Pid: 29, comm: migration/3 Tainted: G        WC   3.9.4+ #11
Call Trace:
  <NMI>  [<ffffffff810977f1>] warn_slowpath_common+0x85/0x9f
  [<ffffffff810978ae>] warn_slowpath_fmt+0x46/0x48
  [<ffffffff8110f42d>] watchdog_overflow_callback+0x9c/0xa7
  [<ffffffff8113feb6>] __perf_event_overflow+0x137/0x1cb
  [<ffffffff8101dff6>] ? x86_perf_event_set_period+0x103/0x10f
  [<ffffffff811403fa>] perf_event_overflow+0x14/0x16
  [<ffffffff81023730>] intel_pmu_handle_irq+0x2dc/0x359
  [<ffffffff815eee05>] perf_event_nmi_handler+0x19/0x1b
  [<ffffffff815ee5f3>] nmi_handle+0x7f/0xc2
  [<ffffffff815ee574>] ? oops_begin+0xa9/0xa9
  [<ffffffff815ee6f2>] do_nmi+0xbc/0x304
  [<ffffffff815edd81>] end_repeat_nmi+0x1e/0x2e
  [<ffffffff81099fce>] ? vprintk_emit+0x40a/0x444
  [<ffffffff81104efa>] ? stop_machine_cpu_stop+0xda/0x274
  [<ffffffff81104efa>] ? stop_machine_cpu_stop+0xda/0x274
  [<ffffffff81104efa>] ? stop_machine_cpu_stop+0xda/0x274
  <<EOE>>  [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff81104e20>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff81104b8d>] cpu_stopper_thread+0xae/0x162
  [<ffffffff815ebb1f>] ? __schedule+0x5ef/0x637
  [<ffffffff815ecf38>] ? _raw_spin_unlock_irqrestore+0x47/0x7e
  [<ffffffff810e92cc>] ? trace_hardirqs_on_caller+0x123/0x15a
  [<ffffffff810e9310>] ? trace_hardirqs_on+0xd/0xf
  [<ffffffff815ecf61>] ? _raw_spin_unlock_irqrestore+0x70/0x7e
  [<ffffffff810bef34>] smpboot_thread_fn+0x258/0x260
  [<ffffffff810becdc>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b7c22>] kthread+0xc7/0xcf
  [<ffffffff810b7b5b>] ? __init_kthread_worker+0x5b/0x5b
  [<ffffffff815f3b6c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b7b5b>] ? __init_kthread_worker+0x5b/0x5b
---[ end trace 4947dfa9b0a4cec5 ]---
BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17]
Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd 
sunrpc]
irq event stamp: 1774512131
hardirqs last  enabled at (1774512130): [<ffffffff8109f4c1>] __do_softirq+0x9f/0x257
hardirqs last disabled at (1774512131): [<ffffffff815f48ad>] apic_timer_interrupt+0x6d/0x80
softirqs last  enabled at (5654720): [<ffffffff8109f621>] __do_softirq+0x1ff/0x257
softirqs last disabled at (5654725): [<ffffffff8109f743>] irq_exit+0x5f/0xbb
CPU 1
Pid: 17, comm: migration/1 Tainted: G        WC   3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffff8109f4c5>]  [<ffffffff8109f4c5>] __do_softirq+0xa3/0x257
RSP: 0018:ffff88022bc83ef8  EFLAGS: 00000202
RAX: ffff8802223142c0 RBX: ffff880217deb710 RCX: 0000000000000006
RDX: ffff88022230e010 RSI: 0000000000000000 RDI: ffff8802223142c0
RBP: ffff88022bc83f78 R08: ffffffff81a050b0 R09: ffff88022bc83cc8
R10: 00000000000005f2 R11: ffff8802203aaf50 R12: ffff88022bc83e68
R13: ffffffff815f48b2 R14: ffff88022bc83f78 R15: ffff88022230e000
FS:  0000000000000000(0000) GS:ffff88022bc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000430070 CR3: 00000001cbc5d000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/1 (pid: 17, threadinfo ffff88022230e000, task ffff8802223142c0)
Stack:
  ffff88022bc83f08 ffff88022230e010 042080402bc83f88 000000010021bfcd
  000000012bc83fa8 ffff88022230e000 ffff88022230ffd8 0000000000000038
  ffff880200000006 00000248d8cdab1c 1304da35fe841722 ffff88022bc8dc80
Call Trace:
  <IRQ>
  [<ffffffff8109f743>] irq_exit+0x5f/0xbb
  [<ffffffff815f59fd>] smp_apic_timer_interrupt+0x8a/0x98
  [<ffffffff815f48b2>] apic_timer_interrupt+0x72/0x80
  <EOI>
  [<ffffffff81099fdb>] ? vprintk_emit+0x417/0x444
  [<ffffffff815e9fc0>] printk+0x4d/0x4f
  [<ffffffff81104b36>] ? cpu_stopper_thread+0x57/0x162
  [<ffffffff8110504c>] stop_machine_cpu_stop+0x22c/0x274
  [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff81104e20>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff81104b8d>] cpu_stopper_thread+0xae/0x162
  [<ffffffff815ebb1f>] ? __schedule+0x5ef/0x637
  [<ffffffff815ecf38>] ? _raw_spin_unlock_irqrestore+0x47/0x7e
  [<ffffffff810e92cc>] ? trace_hardirqs_on_caller+0x123/0x15a
  [<ffffffff810e9310>] ? trace_hardirqs_on+0xd/0xf
  [<ffffffff815ecf61>] ? _raw_spin_unlock_irqrestore+0x70/0x7e
  [<ffffffff810bef34>] smpboot_thread_fn+0x258/0x260
  [<ffffffff810becdc>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b7c22>] kthread+0xc7/0xcf
  [<ffffffff810b7b5b>] ? __init_kthread_worker+0x5b/0x5b
  [<ffffffff815f3b6c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b7b5b>] ? __init_kthread_worker+0x5b/0x5b
Code: 55 b0 49 81 ec d8 1f 00 00 49 8d 44 24 10 4c 89 65 a8 48 89 45 88 65 c7 04 25 80 1b 01 00 00 00 00 00 e8 42 9e 04 00 fb 66 66 90 <66> 66 90 48 c7 c3 80 50 
a0 8
BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17]
Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd 
sunrpc]
irq event stamp: 2713027507
hardirqs last  enabled at (2713027506): [<ffffffff8109f4c1>] __do_softirq+0x9f/0x257
hardirqs last disabled at (2713027507): [<ffffffff815f48ad>] apic_timer_interrupt+0x6d/0x80
softirqs last  enabled at (5654720): [<ffffffff8109f621>] __do_softirq+0x1ff/0x257
softirqs last disabled at (5654725): [<ffffffff8109f743>] irq_exit+0x5f/0xbb
CPU 1
Pid: 17, comm: migration/1 Tainted: G        WC   3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: 0010:[<ffffffff8109f4c5>]  [<ffffffff8109f4c5>] __do_softirq+0xa3/0x257
RSP: 0018:ffff88022bc83ef8  EFLAGS: 00000286
RAX: ffff8802223142c0 RBX: ffff880217deb710 RCX: 0000000000000006
RDX: ffff88022230e010 RSI: 0000000000000000 RDI: ffff8802223142c0
RBP: ffff88022bc83f78 R08: ffffffff81a050b0 R09: ffff88022bc83cc8
R10: 00000000000005f2 R11: ffff88022bc83c38 R12: ffff88022bc83e68
R13: ffffffff815f48b2 R14: ffff88022bc83f78 R15: ffff88022230e000
FS:  0000000000000000(0000) GS:ffff88022bc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000430070 CR3: 00000001cbc5d000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/1 (pid: 17, threadinfo ffff88022230e000, task ffff8802223142c0)
Stack:
  ffff88022bc83f08 ffff88022230e010 042080402bc83f88 000000010021bfcd
  000000012bc83fa8 ffff88022230e000 ffff88022230ffd8 0000000000000038
  ffff880200000006 00000248d8cdab1c 1304da35fe841722 ffff88022bc8dc80
Call Trace:
  <IRQ>
  [<ffffffff8109f743>] irq_exit+0x5f/0xbb
  [<ffffffff815f59fd>] smp_apic_timer_interrupt+0x8a/0x98
  [<ffffffff815f48b2>] apic_timer_interrupt+0x72/0x80
  <EOI>
  [<ffffffff81099fdb>] ? vprintk_emit+0x417/0x444
  [<ffffffff815e9fc0>] printk+0x4d/0x4f
  [<ffffffff81104b36>] ? cpu_stopper_thread+0x57/0x162
  [<ffffffff8110504c>] stop_machine_cpu_stop+0x22c/0x274
  [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7
  [<ffffffff81104e20>] ? stop_one_cpu_nowait+0x30/0x30
  [<ffffffff81104b8d>] cpu_stopper_thread+0xae/0x162
  [<ffffffff815ebb1f>] ? __schedule+0x5ef/0x637
  [<ffffffff815ecf38>] ? _raw_spin_unlock_irqrestore+0x47/0x7e
  [<ffffffff810e92cc>] ? trace_hardirqs_on_caller+0x123/0x15a
  [<ffffffff810e9310>] ? trace_hardirqs_on+0xd/0xf
  [<ffffffff815ecf61>] ? _raw_spin_unlock_irqrestore+0x70/0x7e
  [<ffffffff810bef34>] smpboot_thread_fn+0x258/0x260
  [<ffffffff810becdc>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810b7c22>] kthread+0xc7/0xcf
  [<ffffffff810b7b5b>] ? __init_kthread_worker+0x5b/0x5b
  [<ffffffff815f3b6c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810b7b5b>] ? __init_kthread_worker+0x5b/0x5b
Code: 55 b0 49 81 ec d8 1f 00 00 49 8d 44 24 10 4c 89 65 a8 48 89 45 88 65 c7 04 25 80 1b 01 00 00 00 00 00 e8 42 9e 04 00 fb 66 66 90 <66> 66 90 48 c7 c3 80 50 
a0 8

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
  2013-06-05 19:31                               ` stop_machine lockup issue in 3.9.y Ben Greear
@ 2013-06-05 20:58                                 ` Ben Greear
  2013-06-05 21:11                                     ` Tejun Heo
  0 siblings, 1 reply; 51+ messages in thread
From: Ben Greear @ 2013-06-05 20:58 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable

On 06/05/2013 12:31 PM, Ben Greear wrote:
> This is no longer really about the module unlink, so changing
> subject.
>
> On 06/05/2013 12:11 PM, Ben Greear wrote:
>> On 06/05/2013 11:48 AM, Tejun Heo wrote:
>>> Hello, Ben.
>>>
>>> On Wed, Jun 05, 2013 at 09:59:00AM -0700, Ben Greear wrote:
>>>> One pattern I notice repeating for at least most of the hangs is that all but one
>>>> CPU thread has irqs disabled and is in state 2.  But, there will be one thread
>>>> in state 1 that still has IRQs enabled and it is reported to be in soft-lockup
>>>> instead of hard-lockup.  In 'sysrq l' it always shows some IRQ processing,
>>>> but typically that of the sysrq itself.  I added printk that would always
>>>> print if the thread notices that smdata->state != curstate, and the soft-lockup
>>>> thread (cpu 2 below) never shows that message.
>>>
>>> It sounds like one of the cpus get live-locked by IRQs.  I can't tell
>>> why the situation is made worse by other CPUs being tied up.  Do you
>>> ever see CPUs being live locked by IRQs during normal operation?

Hmm, wonder if I found it.  I previously saw times where it appears
jiffies does not increment.  __do_softirq has a break-out based on
jiffies timeout.  Maybe that is failing to get us out of __do_softirq
in my lockup case because for whatever reason the system cannot update
jiffies in this case?

I added this (probably whitespace damaged) hack and now I have not been
able to reproduce the problem.

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 14d7758..621ea3b 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void)
         unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
         int cpu;
         unsigned long old_flags = current->flags;
+       unsigned long loops = 0;

         /*
          * Mask out PF_MEMALLOC s current task context is borrowed for the
@@ -241,6 +242,7 @@ restart:
                         unsigned int vec_nr = h - softirq_vec;
                         int prev_count = preempt_count();

+                       loops++;
                         kstat_incr_softirqs_this_cpu(vec_nr);

                         trace_softirq_entry(vec_nr);
@@ -265,7 +267,7 @@ restart:

         pending = local_softirq_pending();
         if (pending) {
-               if (time_before(jiffies, end) && !need_resched())
+               if (time_before(jiffies, end) && !need_resched() && (loops < 500))
                         goto restart;

                 wakeup_softirqd();

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
  2013-06-05 20:58                                 ` Ben Greear
  2013-06-05 21:11                                     ` Tejun Heo
@ 2013-06-05 21:11                                     ` Tejun Heo
  0 siblings, 0 replies; 51+ messages in thread
From: Tejun Heo @ 2013-06-05 21:11 UTC (permalink / raw)
  To: Ben Greear
  Cc: Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable,
	Luis R. Rodriguez, Jouni Malinen, Vasanthakumar Thiagarajan,
	Senthil Balasubramanian, linux-wireless, ath9k-devel,
	Thomas Gleixner, Ingo Molnar

(cc'ing wireless crowd, tglx and Ingo.  The original thread is at
 http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 )

Hello, Ben.

On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote:
> Hmm, wonder if I found it.  I previously saw times where it appears
> jiffies does not increment.  __do_softirq has a break-out based on
> jiffies timeout.  Maybe that is failing to get us out of __do_softirq
> in my lockup case because for whatever reason the system cannot update
> jiffies in this case?
> 
> I added this (probably whitespace damaged) hack and now I have not been
> able to reproduce the problem.

Ah, nice catch. :)

> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 14d7758..621ea3b 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void)
>         unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
>         int cpu;
>         unsigned long old_flags = current->flags;
> +       unsigned long loops = 0;
> 
>         /*
>          * Mask out PF_MEMALLOC s current task context is borrowed for the
> @@ -241,6 +242,7 @@ restart:
>                         unsigned int vec_nr = h - softirq_vec;
>                         int prev_count = preempt_count();
> 
> +                       loops++;
>                         kstat_incr_softirqs_this_cpu(vec_nr);
> 
>                         trace_softirq_entry(vec_nr);
> @@ -265,7 +267,7 @@ restart:
> 
>         pending = local_softirq_pending();
>         if (pending) {
> -               if (time_before(jiffies, end) && !need_resched())
> +               if (time_before(jiffies, end) && !need_resched() && (loops < 500))
>                         goto restart;

So, softirq most likely kicked off from ath9k is rescheduling itself
to the extent where it ends up locking out the CPU completely.  The
problem is usually okay because the processing would break out in 2ms
but as jiffies is stopped in this case with all other CPUs trapped in
stop_machine, the loop never breaks and the machine hangs.  While
adding the counter limit probably isn't a bad idea, softirq requeueing
itself indefinitely sounds pretty buggy.

ath9k people, do you guys have any idea what's going on?  Why would
softirq repeat itself indefinitely?

Ingo, Thomas, we're seeing a stop_machine hanging because

* All other CPUs entered IRQ disabled stage.  Jiffies is not being
  updated.

* The last CPU get caught up executing softirq indefinitely.  As
  jiffies doesn't get updated, it never breaks out of softirq
  handling.  This is a deadlock.  This CPU won't break out of softirq
  handling unless jiffies is updated and other CPUs can't do anything
  until this CPU enters the same stop_machine stage.

Ben found out that breaking out of softirq handling after certain
number of repetitions makes the issue go away, which isn't a proper
fix but we might want anyway.  What do you guys think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
@ 2013-06-05 21:11                                     ` Tejun Heo
  0 siblings, 0 replies; 51+ messages in thread
From: Tejun Heo @ 2013-06-05 21:11 UTC (permalink / raw)
  To: Ben Greear
  Cc: Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable,
	Luis R. Rodriguez, Jouni Malinen, Vasanthakumar Thiagarajan,
	Senthil Balasubramanian, linux-wireless, ath9k-devel,
	Thomas Gleixner, Ingo Molnar

(cc'ing wireless crowd, tglx and Ingo.  The original thread is at
 http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 )

Hello, Ben.

On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote:
> Hmm, wonder if I found it.  I previously saw times where it appears
> jiffies does not increment.  __do_softirq has a break-out based on
> jiffies timeout.  Maybe that is failing to get us out of __do_softirq
> in my lockup case because for whatever reason the system cannot update
> jiffies in this case?
> 
> I added this (probably whitespace damaged) hack and now I have not been
> able to reproduce the problem.

Ah, nice catch. :)

> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 14d7758..621ea3b 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void)
>         unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
>         int cpu;
>         unsigned long old_flags = current->flags;
> +       unsigned long loops = 0;
> 
>         /*
>          * Mask out PF_MEMALLOC s current task context is borrowed for the
> @@ -241,6 +242,7 @@ restart:
>                         unsigned int vec_nr = h - softirq_vec;
>                         int prev_count = preempt_count();
> 
> +                       loops++;
>                         kstat_incr_softirqs_this_cpu(vec_nr);
> 
>                         trace_softirq_entry(vec_nr);
> @@ -265,7 +267,7 @@ restart:
> 
>         pending = local_softirq_pending();
>         if (pending) {
> -               if (time_before(jiffies, end) && !need_resched())
> +               if (time_before(jiffies, end) && !need_resched() && (loops < 500))
>                         goto restart;

So, softirq most likely kicked off from ath9k is rescheduling itself
to the extent where it ends up locking out the CPU completely.  The
problem is usually okay because the processing would break out in 2ms
but as jiffies is stopped in this case with all other CPUs trapped in
stop_machine, the loop never breaks and the machine hangs.  While
adding the counter limit probably isn't a bad idea, softirq requeueing
itself indefinitely sounds pretty buggy.

ath9k people, do you guys have any idea what's going on?  Why would
softirq repeat itself indefinitely?

Ingo, Thomas, we're seeing a stop_machine hanging because

* All other CPUs entered IRQ disabled stage.  Jiffies is not being
  updated.

* The last CPU get caught up executing softirq indefinitely.  As
  jiffies doesn't get updated, it never breaks out of softirq
  handling.  This is a deadlock.  This CPU won't break out of softirq
  handling unless jiffies is updated and other CPUs can't do anything
  until this CPU enters the same stop_machine stage.

Ben found out that breaking out of softirq handling after certain
number of repetitions makes the issue go away, which isn't a proper
fix but we might want anyway.  What do you guys think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [ath9k-devel] stop_machine lockup issue in 3.9.y.
@ 2013-06-05 21:11                                     ` Tejun Heo
  0 siblings, 0 replies; 51+ messages in thread
From: Tejun Heo @ 2013-06-05 21:11 UTC (permalink / raw)
  To: ath9k-devel

(cc'ing wireless crowd, tglx and Ingo.  The original thread is at
 http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 )

Hello, Ben.

On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote:
> Hmm, wonder if I found it.  I previously saw times where it appears
> jiffies does not increment.  __do_softirq has a break-out based on
> jiffies timeout.  Maybe that is failing to get us out of __do_softirq
> in my lockup case because for whatever reason the system cannot update
> jiffies in this case?
> 
> I added this (probably whitespace damaged) hack and now I have not been
> able to reproduce the problem.

Ah, nice catch. :)

> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 14d7758..621ea3b 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void)
>         unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
>         int cpu;
>         unsigned long old_flags = current->flags;
> +       unsigned long loops = 0;
> 
>         /*
>          * Mask out PF_MEMALLOC s current task context is borrowed for the
> @@ -241,6 +242,7 @@ restart:
>                         unsigned int vec_nr = h - softirq_vec;
>                         int prev_count = preempt_count();
> 
> +                       loops++;
>                         kstat_incr_softirqs_this_cpu(vec_nr);
> 
>                         trace_softirq_entry(vec_nr);
> @@ -265,7 +267,7 @@ restart:
> 
>         pending = local_softirq_pending();
>         if (pending) {
> -               if (time_before(jiffies, end) && !need_resched())
> +               if (time_before(jiffies, end) && !need_resched() && (loops < 500))
>                         goto restart;

So, softirq most likely kicked off from ath9k is rescheduling itself
to the extent where it ends up locking out the CPU completely.  The
problem is usually okay because the processing would break out in 2ms
but as jiffies is stopped in this case with all other CPUs trapped in
stop_machine, the loop never breaks and the machine hangs.  While
adding the counter limit probably isn't a bad idea, softirq requeueing
itself indefinitely sounds pretty buggy.

ath9k people, do you guys have any idea what's going on?  Why would
softirq repeat itself indefinitely?

Ingo, Thomas, we're seeing a stop_machine hanging because

* All other CPUs entered IRQ disabled stage.  Jiffies is not being
  updated.

* The last CPU get caught up executing softirq indefinitely.  As
  jiffies doesn't get updated, it never breaks out of softirq
  handling.  This is a deadlock.  This CPU won't break out of softirq
  handling unless jiffies is updated and other CPUs can't do anything
  until this CPU enters the same stop_machine stage.

Ben found out that breaking out of softirq handling after certain
number of repetitions makes the issue go away, which isn't a proper
fix but we might want anyway.  What do you guys think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
  2013-06-05 21:11                                     ` Tejun Heo
@ 2013-06-05 21:33                                       ` Ben Greear
  -1 siblings, 0 replies; 51+ messages in thread
From: Ben Greear @ 2013-06-05 21:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable,
	Luis R. Rodriguez, Jouni Malinen, Vasanthakumar Thiagarajan,
	Senthil Balasubramanian, linux-wireless, ath9k-devel,
	Thomas Gleixner, Ingo Molnar

On 06/05/2013 02:11 PM, Tejun Heo wrote:
> (cc'ing wireless crowd, tglx and Ingo.  The original thread is at
>   http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 )
>
> Hello, Ben.
>
> On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote:
>> Hmm, wonder if I found it.  I previously saw times where it appears
>> jiffies does not increment.  __do_softirq has a break-out based on
>> jiffies timeout.  Maybe that is failing to get us out of __do_softirq
>> in my lockup case because for whatever reason the system cannot update
>> jiffies in this case?
>>
>> I added this (probably whitespace damaged) hack and now I have not been
>> able to reproduce the problem.
>
> Ah, nice catch. :)
>
>> diff --git a/kernel/softirq.c b/kernel/softirq.c
>> index 14d7758..621ea3b 100644
>> --- a/kernel/softirq.c
>> +++ b/kernel/softirq.c
>> @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void)
>>          unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
>>          int cpu;
>>          unsigned long old_flags = current->flags;
>> +       unsigned long loops = 0;
>>
>>          /*
>>           * Mask out PF_MEMALLOC s current task context is borrowed for the
>> @@ -241,6 +242,7 @@ restart:
>>                          unsigned int vec_nr = h - softirq_vec;
>>                          int prev_count = preempt_count();
>>
>> +                       loops++;
>>                          kstat_incr_softirqs_this_cpu(vec_nr);
>>
>>                          trace_softirq_entry(vec_nr);
>> @@ -265,7 +267,7 @@ restart:
>>
>>          pending = local_softirq_pending();
>>          if (pending) {
>> -               if (time_before(jiffies, end) && !need_resched())
>> +               if (time_before(jiffies, end) && !need_resched() && (loops < 500))
>>                          goto restart;
>
> So, softirq most likely kicked off from ath9k is rescheduling itself
> to the extent where it ends up locking out the CPU completely.  The
> problem is usually okay because the processing would break out in 2ms
> but as jiffies is stopped in this case with all other CPUs trapped in
> stop_machine, the loop never breaks and the machine hangs.  While
> adding the counter limit probably isn't a bad idea, softirq requeueing
> itself indefinitely sounds pretty buggy.

Just to be clear on the ath9k part for the wifi folks:

This is basically un-patched 3.9.4, but I have 200 virtual stations
configured on each of two ath9k radios.  I cannot reproduce the problem
without ath9k, but I do not know for certain ath9k is the real
culprit.

In the case where I can most easily reproduce the lockup, ath9k virtual
stations would be trying to associate, so I'd expect a fair amount
of packet processing to be happening...

> ath9k people, do you guys have any idea what's going on?  Why would
> softirq repeat itself indefinitely?
>
> Ingo, Thomas, we're seeing a stop_machine hanging because
>
> * All other CPUs entered IRQ disabled stage.  Jiffies is not being
>    updated.
>
> * The last CPU get caught up executing softirq indefinitely.  As
>    jiffies doesn't get updated, it never breaks out of softirq
>    handling.  This is a deadlock.  This CPU won't break out of softirq
>    handling unless jiffies is updated and other CPUs can't do anything
>    until this CPU enters the same stop_machine stage.
>
> Ben found out that breaking out of softirq handling after certain
> number of repetitions makes the issue go away, which isn't a proper
> fix but we might want anyway.  What do you guys think?

Thanks,
Ben



-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [ath9k-devel] stop_machine lockup issue in 3.9.y.
@ 2013-06-05 21:33                                       ` Ben Greear
  0 siblings, 0 replies; 51+ messages in thread
From: Ben Greear @ 2013-06-05 21:33 UTC (permalink / raw)
  To: ath9k-devel

On 06/05/2013 02:11 PM, Tejun Heo wrote:
> (cc'ing wireless crowd, tglx and Ingo.  The original thread is at
>   http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 )
>
> Hello, Ben.
>
> On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote:
>> Hmm, wonder if I found it.  I previously saw times where it appears
>> jiffies does not increment.  __do_softirq has a break-out based on
>> jiffies timeout.  Maybe that is failing to get us out of __do_softirq
>> in my lockup case because for whatever reason the system cannot update
>> jiffies in this case?
>>
>> I added this (probably whitespace damaged) hack and now I have not been
>> able to reproduce the problem.
>
> Ah, nice catch. :)
>
>> diff --git a/kernel/softirq.c b/kernel/softirq.c
>> index 14d7758..621ea3b 100644
>> --- a/kernel/softirq.c
>> +++ b/kernel/softirq.c
>> @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void)
>>          unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
>>          int cpu;
>>          unsigned long old_flags = current->flags;
>> +       unsigned long loops = 0;
>>
>>          /*
>>           * Mask out PF_MEMALLOC s current task context is borrowed for the
>> @@ -241,6 +242,7 @@ restart:
>>                          unsigned int vec_nr = h - softirq_vec;
>>                          int prev_count = preempt_count();
>>
>> +                       loops++;
>>                          kstat_incr_softirqs_this_cpu(vec_nr);
>>
>>                          trace_softirq_entry(vec_nr);
>> @@ -265,7 +267,7 @@ restart:
>>
>>          pending = local_softirq_pending();
>>          if (pending) {
>> -               if (time_before(jiffies, end) && !need_resched())
>> +               if (time_before(jiffies, end) && !need_resched() && (loops < 500))
>>                          goto restart;
>
> So, softirq most likely kicked off from ath9k is rescheduling itself
> to the extent where it ends up locking out the CPU completely.  The
> problem is usually okay because the processing would break out in 2ms
> but as jiffies is stopped in this case with all other CPUs trapped in
> stop_machine, the loop never breaks and the machine hangs.  While
> adding the counter limit probably isn't a bad idea, softirq requeueing
> itself indefinitely sounds pretty buggy.

Just to be clear on the ath9k part for the wifi folks:

This is basically un-patched 3.9.4, but I have 200 virtual stations
configured on each of two ath9k radios.  I cannot reproduce the problem
without ath9k, but I do not know for certain ath9k is the real
culprit.

In the case where I can most easily reproduce the lockup, ath9k virtual
stations would be trying to associate, so I'd expect a fair amount
of packet processing to be happening...

> ath9k people, do you guys have any idea what's going on?  Why would
> softirq repeat itself indefinitely?
>
> Ingo, Thomas, we're seeing a stop_machine hanging because
>
> * All other CPUs entered IRQ disabled stage.  Jiffies is not being
>    updated.
>
> * The last CPU get caught up executing softirq indefinitely.  As
>    jiffies doesn't get updated, it never breaks out of softirq
>    handling.  This is a deadlock.  This CPU won't break out of softirq
>    handling unless jiffies is updated and other CPUs can't do anything
>    until this CPU enters the same stop_machine stage.
>
> Ben found out that breaking out of softirq handling after certain
> number of repetitions makes the issue go away, which isn't a proper
> fix but we might want anyway.  What do you guys think?

Thanks,
Ben



-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
  2013-06-05 21:11                                     ` Tejun Heo
  (?)
@ 2013-06-06  1:34                                       ` Eric Dumazet
  -1 siblings, 0 replies; 51+ messages in thread
From: Eric Dumazet @ 2013-06-06  1:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ben Greear, Rusty Russell, Joe Lawrence,
	Linux Kernel Mailing List, stable, Luis R. Rodriguez,
	Jouni Malinen, Vasanthakumar Thiagarajan,
	Senthil Balasubramanian, linux-wireless, ath9k-devel,
	Thomas Gleixner, Ingo Molnar

On Wed, 2013-06-05 at 14:11 -0700, Tejun Heo wrote:
> (cc'ing wireless crowd, tglx and Ingo.  The original thread is at
>  http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 )
> 
> Hello, Ben.
> 
> On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote:
> > Hmm, wonder if I found it.  I previously saw times where it appears
> > jiffies does not increment.  __do_softirq has a break-out based on
> > jiffies timeout.  Maybe that is failing to get us out of __do_softirq
> > in my lockup case because for whatever reason the system cannot update
> > jiffies in this case?
> > 
> > I added this (probably whitespace damaged) hack and now I have not been
> > able to reproduce the problem.
> 
> Ah, nice catch. :)
> 
> > diff --git a/kernel/softirq.c b/kernel/softirq.c
> > index 14d7758..621ea3b 100644
> > --- a/kernel/softirq.c
> > +++ b/kernel/softirq.c
> > @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void)
> >         unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
> >         int cpu;
> >         unsigned long old_flags = current->flags;
> > +       unsigned long loops = 0;
> > 
> >         /*
> >          * Mask out PF_MEMALLOC s current task context is borrowed for the
> > @@ -241,6 +242,7 @@ restart:
> >                         unsigned int vec_nr = h - softirq_vec;
> >                         int prev_count = preempt_count();
> > 
> > +                       loops++;
> >                         kstat_incr_softirqs_this_cpu(vec_nr);
> > 
> >                         trace_softirq_entry(vec_nr);
> > @@ -265,7 +267,7 @@ restart:
> > 
> >         pending = local_softirq_pending();
> >         if (pending) {
> > -               if (time_before(jiffies, end) && !need_resched())
> > +               if (time_before(jiffies, end) && !need_resched() && (loops < 500))
> >                         goto restart;
> 
> So, softirq most likely kicked off from ath9k is rescheduling itself
> to the extent where it ends up locking out the CPU completely.  The
> problem is usually okay because the processing would break out in 2ms
> but as jiffies is stopped in this case with all other CPUs trapped in
> stop_machine, the loop never breaks and the machine hangs.  While
> adding the counter limit probably isn't a bad idea, softirq requeueing
> itself indefinitely sounds pretty buggy.
> 
> ath9k people, do you guys have any idea what's going on?  Why would
> softirq repeat itself indefinitely?
> 
> Ingo, Thomas, we're seeing a stop_machine hanging because
> 
> * All other CPUs entered IRQ disabled stage.  Jiffies is not being
>   updated.
> 
> * The last CPU get caught up executing softirq indefinitely.  As
>   jiffies doesn't get updated, it never breaks out of softirq
>   handling.  This is a deadlock.  This CPU won't break out of softirq
>   handling unless jiffies is updated and other CPUs can't do anything
>   until this CPU enters the same stop_machine stage.
> 
> Ben found out that breaking out of softirq handling after certain
> number of repetitions makes the issue go away, which isn't a proper
> fix but we might want anyway.  What do you guys think?
> 

Interesting....

Before 3.9 and commit c10d73671ad30f5469
("softirq: reduce latencies") we used to limit the __do_softirq() loop
to 10.




^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
@ 2013-06-06  1:34                                       ` Eric Dumazet
  0 siblings, 0 replies; 51+ messages in thread
From: Eric Dumazet @ 2013-06-06  1:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ben Greear, Rusty Russell, Joe Lawrence,
	Linux Kernel Mailing List, stable, Luis R. Rodriguez,
	Jouni Malinen, Vasanthakumar Thiagarajan,
	Senthil Balasubramanian, linux-wireless, ath9k-devel,
	Thomas Gleixner, Ingo Molnar

On Wed, 2013-06-05 at 14:11 -0700, Tejun Heo wrote:
> (cc'ing wireless crowd, tglx and Ingo.  The original thread is at
>  http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 )
> 
> Hello, Ben.
> 
> On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote:
> > Hmm, wonder if I found it.  I previously saw times where it appears
> > jiffies does not increment.  __do_softirq has a break-out based on
> > jiffies timeout.  Maybe that is failing to get us out of __do_softirq
> > in my lockup case because for whatever reason the system cannot update
> > jiffies in this case?
> > 
> > I added this (probably whitespace damaged) hack and now I have not been
> > able to reproduce the problem.
> 
> Ah, nice catch. :)
> 
> > diff --git a/kernel/softirq.c b/kernel/softirq.c
> > index 14d7758..621ea3b 100644
> > --- a/kernel/softirq.c
> > +++ b/kernel/softirq.c
> > @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void)
> >         unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
> >         int cpu;
> >         unsigned long old_flags = current->flags;
> > +       unsigned long loops = 0;
> > 
> >         /*
> >          * Mask out PF_MEMALLOC s current task context is borrowed for the
> > @@ -241,6 +242,7 @@ restart:
> >                         unsigned int vec_nr = h - softirq_vec;
> >                         int prev_count = preempt_count();
> > 
> > +                       loops++;
> >                         kstat_incr_softirqs_this_cpu(vec_nr);
> > 
> >                         trace_softirq_entry(vec_nr);
> > @@ -265,7 +267,7 @@ restart:
> > 
> >         pending = local_softirq_pending();
> >         if (pending) {
> > -               if (time_before(jiffies, end) && !need_resched())
> > +               if (time_before(jiffies, end) && !need_resched() && (loops < 500))
> >                         goto restart;
> 
> So, softirq most likely kicked off from ath9k is rescheduling itself
> to the extent where it ends up locking out the CPU completely.  The
> problem is usually okay because the processing would break out in 2ms
> but as jiffies is stopped in this case with all other CPUs trapped in
> stop_machine, the loop never breaks and the machine hangs.  While
> adding the counter limit probably isn't a bad idea, softirq requeueing
> itself indefinitely sounds pretty buggy.
> 
> ath9k people, do you guys have any idea what's going on?  Why would
> softirq repeat itself indefinitely?
> 
> Ingo, Thomas, we're seeing a stop_machine hanging because
> 
> * All other CPUs entered IRQ disabled stage.  Jiffies is not being
>   updated.
> 
> * The last CPU get caught up executing softirq indefinitely.  As
>   jiffies doesn't get updated, it never breaks out of softirq
>   handling.  This is a deadlock.  This CPU won't break out of softirq
>   handling unless jiffies is updated and other CPUs can't do anything
>   until this CPU enters the same stop_machine stage.
> 
> Ben found out that breaking out of softirq handling after certain
> number of repetitions makes the issue go away, which isn't a proper
> fix but we might want anyway.  What do you guys think?
> 

Interesting....

Before 3.9 and commit c10d73671ad30f5469
("softirq: reduce latencies") we used to limit the __do_softirq() loop
to 10.




^ permalink raw reply	[flat|nested] 51+ messages in thread

* [ath9k-devel] stop_machine lockup issue in 3.9.y.
@ 2013-06-06  1:34                                       ` Eric Dumazet
  0 siblings, 0 replies; 51+ messages in thread
From: Eric Dumazet @ 2013-06-06  1:34 UTC (permalink / raw)
  To: ath9k-devel

On Wed, 2013-06-05 at 14:11 -0700, Tejun Heo wrote:
> (cc'ing wireless crowd, tglx and Ingo.  The original thread is at
>  http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 )
> 
> Hello, Ben.
> 
> On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote:
> > Hmm, wonder if I found it.  I previously saw times where it appears
> > jiffies does not increment.  __do_softirq has a break-out based on
> > jiffies timeout.  Maybe that is failing to get us out of __do_softirq
> > in my lockup case because for whatever reason the system cannot update
> > jiffies in this case?
> > 
> > I added this (probably whitespace damaged) hack and now I have not been
> > able to reproduce the problem.
> 
> Ah, nice catch. :)
> 
> > diff --git a/kernel/softirq.c b/kernel/softirq.c
> > index 14d7758..621ea3b 100644
> > --- a/kernel/softirq.c
> > +++ b/kernel/softirq.c
> > @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void)
> >         unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
> >         int cpu;
> >         unsigned long old_flags = current->flags;
> > +       unsigned long loops = 0;
> > 
> >         /*
> >          * Mask out PF_MEMALLOC s current task context is borrowed for the
> > @@ -241,6 +242,7 @@ restart:
> >                         unsigned int vec_nr = h - softirq_vec;
> >                         int prev_count = preempt_count();
> > 
> > +                       loops++;
> >                         kstat_incr_softirqs_this_cpu(vec_nr);
> > 
> >                         trace_softirq_entry(vec_nr);
> > @@ -265,7 +267,7 @@ restart:
> > 
> >         pending = local_softirq_pending();
> >         if (pending) {
> > -               if (time_before(jiffies, end) && !need_resched())
> > +               if (time_before(jiffies, end) && !need_resched() && (loops < 500))
> >                         goto restart;
> 
> So, softirq most likely kicked off from ath9k is rescheduling itself
> to the extent where it ends up locking out the CPU completely.  The
> problem is usually okay because the processing would break out in 2ms
> but as jiffies is stopped in this case with all other CPUs trapped in
> stop_machine, the loop never breaks and the machine hangs.  While
> adding the counter limit probably isn't a bad idea, softirq requeueing
> itself indefinitely sounds pretty buggy.
> 
> ath9k people, do you guys have any idea what's going on?  Why would
> softirq repeat itself indefinitely?
> 
> Ingo, Thomas, we're seeing a stop_machine hanging because
> 
> * All other CPUs entered IRQ disabled stage.  Jiffies is not being
>   updated.
> 
> * The last CPU get caught up executing softirq indefinitely.  As
>   jiffies doesn't get updated, it never breaks out of softirq
>   handling.  This is a deadlock.  This CPU won't break out of softirq
>   handling unless jiffies is updated and other CPUs can't do anything
>   until this CPU enters the same stop_machine stage.
> 
> Ben found out that breaking out of softirq handling after certain
> number of repetitions makes the issue go away, which isn't a proper
> fix but we might want anyway.  What do you guys think?
> 

Interesting....

Before 3.9 and commit c10d73671ad30f5469
("softirq: reduce latencies") we used to limit the __do_softirq() loop
to 10.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
  2013-06-06  1:34                                       ` Eric Dumazet
  (?)
@ 2013-06-06  3:14                                         ` Tejun Heo
  -1 siblings, 0 replies; 51+ messages in thread
From: Tejun Heo @ 2013-06-06  3:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Greear, Rusty Russell, Joe Lawrence,
	Linux Kernel Mailing List, stable, Luis R. Rodriguez,
	Jouni Malinen, Vasanthakumar Thiagarajan,
	Senthil Balasubramanian, linux-wireless, ath9k-devel,
	Thomas Gleixner, Ingo Molnar

Hello, Eric.

On Wed, Jun 05, 2013 at 06:34:52PM -0700, Eric Dumazet wrote:
> > Ingo, Thomas, we're seeing a stop_machine hanging because
> > 
> > * All other CPUs entered IRQ disabled stage.  Jiffies is not being
> >   updated.
> > 
> > * The last CPU get caught up executing softirq indefinitely.  As
> >   jiffies doesn't get updated, it never breaks out of softirq
> >   handling.  This is a deadlock.  This CPU won't break out of softirq
> >   handling unless jiffies is updated and other CPUs can't do anything
> >   until this CPU enters the same stop_machine stage.
> > 
> > Ben found out that breaking out of softirq handling after certain
> > number of repetitions makes the issue go away, which isn't a proper
> > fix but we might want anyway.  What do you guys think?
> > 
> 
> Interesting....
> 
> Before 3.9 and commit c10d73671ad30f5469
> ("softirq: reduce latencies") we used to limit the __do_softirq() loop
> to 10.

Ah, so, that's why it's showing up now.  We probably have had the same
issue all along but it used to be masked by the softirq limiting.  Do
you care to revive the 10 iterations limit so that it's limited by
both the count and timing?  We do wanna find out why softirq is
spinning indefinitely tho.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
@ 2013-06-06  3:14                                         ` Tejun Heo
  0 siblings, 0 replies; 51+ messages in thread
From: Tejun Heo @ 2013-06-06  3:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Greear, Rusty Russell, Joe Lawrence,
	Linux Kernel Mailing List, stable, Luis R. Rodriguez,
	Jouni Malinen, Vasanthakumar Thiagarajan,
	Senthil Balasubramanian, linux-wireless, ath9k-devel,
	Thomas Gleixner, Ingo Molnar

Hello, Eric.

On Wed, Jun 05, 2013 at 06:34:52PM -0700, Eric Dumazet wrote:
> > Ingo, Thomas, we're seeing a stop_machine hanging because
> > 
> > * All other CPUs entered IRQ disabled stage.  Jiffies is not being
> >   updated.
> > 
> > * The last CPU get caught up executing softirq indefinitely.  As
> >   jiffies doesn't get updated, it never breaks out of softirq
> >   handling.  This is a deadlock.  This CPU won't break out of softirq
> >   handling unless jiffies is updated and other CPUs can't do anything
> >   until this CPU enters the same stop_machine stage.
> > 
> > Ben found out that breaking out of softirq handling after certain
> > number of repetitions makes the issue go away, which isn't a proper
> > fix but we might want anyway.  What do you guys think?
> > 
> 
> Interesting....
> 
> Before 3.9 and commit c10d73671ad30f5469
> ("softirq: reduce latencies") we used to limit the __do_softirq() loop
> to 10.

Ah, so, that's why it's showing up now.  We probably have had the same
issue all along but it used to be masked by the softirq limiting.  Do
you care to revive the 10 iterations limit so that it's limited by
both the count and timing?  We do wanna find out why softirq is
spinning indefinitely tho.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [ath9k-devel] stop_machine lockup issue in 3.9.y.
@ 2013-06-06  3:14                                         ` Tejun Heo
  0 siblings, 0 replies; 51+ messages in thread
From: Tejun Heo @ 2013-06-06  3:14 UTC (permalink / raw)
  To: ath9k-devel

Hello, Eric.

On Wed, Jun 05, 2013 at 06:34:52PM -0700, Eric Dumazet wrote:
> > Ingo, Thomas, we're seeing a stop_machine hanging because
> > 
> > * All other CPUs entered IRQ disabled stage.  Jiffies is not being
> >   updated.
> > 
> > * The last CPU get caught up executing softirq indefinitely.  As
> >   jiffies doesn't get updated, it never breaks out of softirq
> >   handling.  This is a deadlock.  This CPU won't break out of softirq
> >   handling unless jiffies is updated and other CPUs can't do anything
> >   until this CPU enters the same stop_machine stage.
> > 
> > Ben found out that breaking out of softirq handling after certain
> > number of repetitions makes the issue go away, which isn't a proper
> > fix but we might want anyway.  What do you guys think?
> > 
> 
> Interesting....
> 
> Before 3.9 and commit c10d73671ad30f5469
> ("softirq: reduce latencies") we used to limit the __do_softirq() loop
> to 10.

Ah, so, that's why it's showing up now.  We probably have had the same
issue all along but it used to be masked by the softirq limiting.  Do
you care to revive the 10 iterations limit so that it's limited by
both the count and timing?  We do wanna find out why softirq is
spinning indefinitely tho.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
  2013-06-06  3:14                                         ` Tejun Heo
  (?)
@ 2013-06-06  3:26                                           ` Eric Dumazet
  -1 siblings, 0 replies; 51+ messages in thread
From: Eric Dumazet @ 2013-06-06  3:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ben Greear, Rusty Russell, Joe Lawrence,
	Linux Kernel Mailing List, stable, Luis R. Rodriguez,
	Jouni Malinen, Vasanthakumar Thiagarajan,
	Senthil Balasubramanian, linux-wireless, ath9k-devel,
	Thomas Gleixner, Ingo Molnar

On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote:

> 
> Ah, so, that's why it's showing up now.  We probably have had the same
> issue all along but it used to be masked by the softirq limiting.  Do
> you care to revive the 10 iterations limit so that it's limited by
> both the count and timing?  We do wanna find out why softirq is
> spinning indefinitely tho.

Yes, no problem, I can do that.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
@ 2013-06-06  3:26                                           ` Eric Dumazet
  0 siblings, 0 replies; 51+ messages in thread
From: Eric Dumazet @ 2013-06-06  3:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ben Greear, Rusty Russell, Joe Lawrence,
	Linux Kernel Mailing List, stable, Luis R. Rodriguez,
	Jouni Malinen, Vasanthakumar Thiagarajan,
	Senthil Balasubramanian, linux-wireless, ath9k-devel,
	Thomas Gleixner, Ingo Molnar

On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote:

> 
> Ah, so, that's why it's showing up now.  We probably have had the same
> issue all along but it used to be masked by the softirq limiting.  Do
> you care to revive the 10 iterations limit so that it's limited by
> both the count and timing?  We do wanna find out why softirq is
> spinning indefinitely tho.

Yes, no problem, I can do that.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [ath9k-devel] stop_machine lockup issue in 3.9.y.
@ 2013-06-06  3:26                                           ` Eric Dumazet
  0 siblings, 0 replies; 51+ messages in thread
From: Eric Dumazet @ 2013-06-06  3:26 UTC (permalink / raw)
  To: ath9k-devel

On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote:

> 
> Ah, so, that's why it's showing up now.  We probably have had the same
> issue all along but it used to be masked by the softirq limiting.  Do
> you care to revive the 10 iterations limit so that it's limited by
> both the count and timing?  We do wanna find out why softirq is
> spinning indefinitely tho.

Yes, no problem, I can do that.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
  2013-06-06  3:26                                           ` Eric Dumazet
@ 2013-06-06  3:41                                             ` Ben Greear
  -1 siblings, 0 replies; 51+ messages in thread
From: Ben Greear @ 2013-06-06  3:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tejun Heo, Rusty Russell, Joe Lawrence,
	Linux Kernel Mailing List, stable, Luis R. Rodriguez,
	Jouni Malinen, Vasanthakumar Thiagarajan,
	Senthil Balasubramanian, linux-wireless, ath9k-devel,
	Thomas Gleixner, Ingo Molnar

On 06/05/2013 08:26 PM, Eric Dumazet wrote:
> On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote:
>
>>
>> Ah, so, that's why it's showing up now.  We probably have had the same
>> issue all along but it used to be masked by the softirq limiting.  Do
>> you care to revive the 10 iterations limit so that it's limited by
>> both the count and timing?  We do wanna find out why softirq is
>> spinning indefinitely tho.
>
> Yes, no problem, I can do that.

Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would
be fine by me.

I can send a version of my patch easily enough if we can agree on the max number of
loops (and if indeed my version of the patch is acceptable).

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [ath9k-devel] stop_machine lockup issue in 3.9.y.
@ 2013-06-06  3:41                                             ` Ben Greear
  0 siblings, 0 replies; 51+ messages in thread
From: Ben Greear @ 2013-06-06  3:41 UTC (permalink / raw)
  To: ath9k-devel

On 06/05/2013 08:26 PM, Eric Dumazet wrote:
> On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote:
>
>>
>> Ah, so, that's why it's showing up now.  We probably have had the same
>> issue all along but it used to be masked by the softirq limiting.  Do
>> you care to revive the 10 iterations limit so that it's limited by
>> both the count and timing?  We do wanna find out why softirq is
>> spinning indefinitely tho.
>
> Yes, no problem, I can do that.

Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would
be fine by me.

I can send a version of my patch easily enough if we can agree on the max number of
loops (and if indeed my version of the patch is acceptable).

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
  2013-06-06  3:41                                             ` [ath9k-devel] " Ben Greear
@ 2013-06-06  3:46                                               ` Eric Dumazet
  -1 siblings, 0 replies; 51+ messages in thread
From: Eric Dumazet @ 2013-06-06  3:46 UTC (permalink / raw)
  To: Ben Greear
  Cc: Tejun Heo, Rusty Russell, Joe Lawrence,
	Linux Kernel Mailing List, stable, Luis R. Rodriguez,
	Jouni Malinen, Vasanthakumar Thiagarajan,
	Senthil Balasubramanian, linux-wireless, ath9k-devel,
	Thomas Gleixner, Ingo Molnar

On Wed, 2013-06-05 at 20:41 -0700, Ben Greear wrote:
> On 06/05/2013 08:26 PM, Eric Dumazet wrote:
> > On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote:
> >
> >>
> >> Ah, so, that's why it's showing up now.  We probably have had the same
> >> issue all along but it used to be masked by the softirq limiting.  Do
> >> you care to revive the 10 iterations limit so that it's limited by
> >> both the count and timing?  We do wanna find out why softirq is
> >> spinning indefinitely tho.
> >
> > Yes, no problem, I can do that.
> 
> Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would
> be fine by me.
> 
> I can send a version of my patch easily enough if we can agree on the max number of
> loops (and if indeed my version of the patch is acceptable).

Well, 10 was the prior limit and seems really fine.

The non update on jiffies seems quite exceptional condition (I hope...)

We use in Google a patch triggering warning is a thread holds the cpu
without taking care to need_resched() for more than xx ms




^ permalink raw reply	[flat|nested] 51+ messages in thread

* [ath9k-devel] stop_machine lockup issue in 3.9.y.
@ 2013-06-06  3:46                                               ` Eric Dumazet
  0 siblings, 0 replies; 51+ messages in thread
From: Eric Dumazet @ 2013-06-06  3:46 UTC (permalink / raw)
  To: ath9k-devel

On Wed, 2013-06-05 at 20:41 -0700, Ben Greear wrote:
> On 06/05/2013 08:26 PM, Eric Dumazet wrote:
> > On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote:
> >
> >>
> >> Ah, so, that's why it's showing up now.  We probably have had the same
> >> issue all along but it used to be masked by the softirq limiting.  Do
> >> you care to revive the 10 iterations limit so that it's limited by
> >> both the count and timing?  We do wanna find out why softirq is
> >> spinning indefinitely tho.
> >
> > Yes, no problem, I can do that.
> 
> Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would
> be fine by me.
> 
> I can send a version of my patch easily enough if we can agree on the max number of
> loops (and if indeed my version of the patch is acceptable).

Well, 10 was the prior limit and seems really fine.

The non update on jiffies seems quite exceptional condition (I hope...)

We use in Google a patch triggering warning is a thread holds the cpu
without taking care to need_resched() for more than xx ms

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
  2013-06-06  3:46                                               ` [ath9k-devel] " Eric Dumazet
@ 2013-06-06  3:50                                                 ` Ben Greear
  -1 siblings, 0 replies; 51+ messages in thread
From: Ben Greear @ 2013-06-06  3:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tejun Heo, Rusty Russell, Joe Lawrence,
	Linux Kernel Mailing List, stable, Luis R. Rodriguez,
	Jouni Malinen, Vasanthakumar Thiagarajan,
	Senthil Balasubramanian, linux-wireless, ath9k-devel,
	Thomas Gleixner, Ingo Molnar

On 06/05/2013 08:46 PM, Eric Dumazet wrote:
> On Wed, 2013-06-05 at 20:41 -0700, Ben Greear wrote:
>> On 06/05/2013 08:26 PM, Eric Dumazet wrote:
>>> On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote:
>>>
>>>>
>>>> Ah, so, that's why it's showing up now.  We probably have had the same
>>>> issue all along but it used to be masked by the softirq limiting.  Do
>>>> you care to revive the 10 iterations limit so that it's limited by
>>>> both the count and timing?  We do wanna find out why softirq is
>>>> spinning indefinitely tho.
>>>
>>> Yes, no problem, I can do that.
>>
>> Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would
>> be fine by me.
>>
>> I can send a version of my patch easily enough if we can agree on the max number of
>> loops (and if indeed my version of the patch is acceptable).
>
> Well, 10 was the prior limit and seems really fine.
>
> The non update on jiffies seems quite exceptional condition (I hope...)
>
> We use in Google a patch triggering warning is a thread holds the cpu
> without taking care to need_resched() for more than xx ms

Well, I'm sure that patch works nicely until the clock stops moving
forward :)

I'll post a patch with limit of 10 shortly.

Thanks,
Ben



-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [ath9k-devel] stop_machine lockup issue in 3.9.y.
@ 2013-06-06  3:50                                                 ` Ben Greear
  0 siblings, 0 replies; 51+ messages in thread
From: Ben Greear @ 2013-06-06  3:50 UTC (permalink / raw)
  To: ath9k-devel

On 06/05/2013 08:46 PM, Eric Dumazet wrote:
> On Wed, 2013-06-05 at 20:41 -0700, Ben Greear wrote:
>> On 06/05/2013 08:26 PM, Eric Dumazet wrote:
>>> On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote:
>>>
>>>>
>>>> Ah, so, that's why it's showing up now.  We probably have had the same
>>>> issue all along but it used to be masked by the softirq limiting.  Do
>>>> you care to revive the 10 iterations limit so that it's limited by
>>>> both the count and timing?  We do wanna find out why softirq is
>>>> spinning indefinitely tho.
>>>
>>> Yes, no problem, I can do that.
>>
>> Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would
>> be fine by me.
>>
>> I can send a version of my patch easily enough if we can agree on the max number of
>> loops (and if indeed my version of the patch is acceptable).
>
> Well, 10 was the prior limit and seems really fine.
>
> The non update on jiffies seems quite exceptional condition (I hope...)
>
> We use in Google a patch triggering warning is a thread holds the cpu
> without taking care to need_resched() for more than xx ms

Well, I'm sure that patch works nicely until the clock stops moving
forward :)

I'll post a patch with limit of 10 shortly.

Thanks,
Ben



-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
  2013-06-06  3:50                                                 ` [ath9k-devel] " Ben Greear
@ 2013-06-06  4:08                                                   ` Eric Dumazet
  -1 siblings, 0 replies; 51+ messages in thread
From: Eric Dumazet @ 2013-06-06  4:08 UTC (permalink / raw)
  To: Ben Greear
  Cc: Tejun Heo, Rusty Russell, Joe Lawrence,
	Linux Kernel Mailing List, stable, Luis R. Rodriguez,
	Jouni Malinen, Vasanthakumar Thiagarajan,
	Senthil Balasubramanian, linux-wireless, ath9k-devel,
	Thomas Gleixner, Ingo Molnar

On Wed, 2013-06-05 at 20:50 -0700, Ben Greear wrote:
> On 06/05/2013 08:46 PM, Eric Dumazet wrote:
> >
> > We use in Google a patch triggering warning is a thread holds the cpu
> > without taking care to need_resched() for more than xx ms
> 
> Well, I'm sure that patch works nicely until the clock stops moving
> forward :)
> 

This is not using jiffies, but the clock used in kernel/sched/core.c,
with ns resolution ;)

> I'll post a patch with limit of 10 shortly.

ok



^ permalink raw reply	[flat|nested] 51+ messages in thread

* [ath9k-devel] stop_machine lockup issue in 3.9.y.
@ 2013-06-06  4:08                                                   ` Eric Dumazet
  0 siblings, 0 replies; 51+ messages in thread
From: Eric Dumazet @ 2013-06-06  4:08 UTC (permalink / raw)
  To: ath9k-devel

On Wed, 2013-06-05 at 20:50 -0700, Ben Greear wrote:
> On 06/05/2013 08:46 PM, Eric Dumazet wrote:
> >
> > We use in Google a patch triggering warning is a thread holds the cpu
> > without taking care to need_resched() for more than xx ms
> 
> Well, I'm sure that patch works nicely until the clock stops moving
> forward :)
> 

This is not using jiffies, but the clock used in kernel/sched/core.c,
with ns resolution ;)

> I'll post a patch with limit of 10 shortly.

ok

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
  2013-06-06  3:41                                             ` [ath9k-devel] " Ben Greear
@ 2013-06-06 20:55                                               ` Tejun Heo
  -1 siblings, 0 replies; 51+ messages in thread
From: Tejun Heo @ 2013-06-06 20:55 UTC (permalink / raw)
  To: Ben Greear
  Cc: Eric Dumazet, Rusty Russell, Joe Lawrence,
	Linux Kernel Mailing List, stable, Luis R. Rodriguez,
	Jouni Malinen, Vasanthakumar Thiagarajan,
	Senthil Balasubramanian, linux-wireless, ath9k-devel,
	Thomas Gleixner, Ingo Molnar

Hello, Ben.

On Wed, Jun 05, 2013 at 08:41:01PM -0700, Ben Greear wrote:
> On 06/05/2013 08:26 PM, Eric Dumazet wrote:
> >On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote:
> >>Ah, so, that's why it's showing up now.  We probably have had the same
> >>issue all along but it used to be masked by the softirq limiting.  Do
> >>you care to revive the 10 iterations limit so that it's limited by
> >>both the count and timing?  We do wanna find out why softirq is
> >>spinning indefinitely tho.
> >
> >Yes, no problem, I can do that.
> 
> Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would
> be fine by me.

First of all, kudos for tracking the issue down.  While the removal of
looping limit in softirq handling was the direct cause for making the
problem visible, it's very bothering that we have softirq runaway.
Finding out the perpetrator shouldn't be hard.  Something like the
following should work (untested).  Once we know which softirq (prolly
the network one), we can dig deeper.

Thanks.

diff --git a/kernel/softirq.c b/kernel/softirq.c
index b5197dc..5af3682 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void)
 	unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
 	int cpu;
 	unsigned long old_flags = current->flags;
+	int cnt = 0;
 
 	/*
 	 * Mask out PF_MEMALLOC s current task context is borrowed for the
@@ -244,6 +245,9 @@ restart:
 			kstat_incr_softirqs_this_cpu(vec_nr);
 
 			trace_softirq_entry(vec_nr);
+			if (++cnt >= 5000 && cnt < 5010)
+				printk("XXX __do_softirq: stuck handling softirqs, cnt=%d action=%pf\n",
+				       cnt, h->action);
 			h->action(h);
 			trace_softirq_exit(vec_nr);
 			if (unlikely(prev_count != preempt_count())) {


-- 
tejun

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [ath9k-devel] stop_machine lockup issue in 3.9.y.
@ 2013-06-06 20:55                                               ` Tejun Heo
  0 siblings, 0 replies; 51+ messages in thread
From: Tejun Heo @ 2013-06-06 20:55 UTC (permalink / raw)
  To: ath9k-devel

Hello, Ben.

On Wed, Jun 05, 2013 at 08:41:01PM -0700, Ben Greear wrote:
> On 06/05/2013 08:26 PM, Eric Dumazet wrote:
> >On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote:
> >>Ah, so, that's why it's showing up now.  We probably have had the same
> >>issue all along but it used to be masked by the softirq limiting.  Do
> >>you care to revive the 10 iterations limit so that it's limited by
> >>both the count and timing?  We do wanna find out why softirq is
> >>spinning indefinitely tho.
> >
> >Yes, no problem, I can do that.
> 
> Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would
> be fine by me.

First of all, kudos for tracking the issue down.  While the removal of
looping limit in softirq handling was the direct cause for making the
problem visible, it's very bothering that we have softirq runaway.
Finding out the perpetrator shouldn't be hard.  Something like the
following should work (untested).  Once we know which softirq (prolly
the network one), we can dig deeper.

Thanks.

diff --git a/kernel/softirq.c b/kernel/softirq.c
index b5197dc..5af3682 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void)
 	unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
 	int cpu;
 	unsigned long old_flags = current->flags;
+	int cnt = 0;
 
 	/*
 	 * Mask out PF_MEMALLOC s current task context is borrowed for the
@@ -244,6 +245,9 @@ restart:
 			kstat_incr_softirqs_this_cpu(vec_nr);
 
 			trace_softirq_entry(vec_nr);
+			if (++cnt >= 5000 && cnt < 5010)
+				printk("XXX __do_softirq: stuck handling softirqs, cnt=%d action=%pf\n",
+				       cnt, h->action);
 			h->action(h);
 			trace_softirq_exit(vec_nr);
 			if (unlikely(prev_count != preempt_count())) {


-- 
tejun

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
  2013-06-06 20:55                                               ` [ath9k-devel] " Tejun Heo
@ 2013-06-06 21:15                                                 ` Ben Greear
  -1 siblings, 0 replies; 51+ messages in thread
From: Ben Greear @ 2013-06-06 21:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Eric Dumazet, Rusty Russell, Joe Lawrence,
	Linux Kernel Mailing List, stable, Luis R. Rodriguez,
	Jouni Malinen, Vasanthakumar Thiagarajan,
	Senthil Balasubramanian, linux-wireless, ath9k-devel,
	Thomas Gleixner, Ingo Molnar

On 06/06/2013 01:55 PM, Tejun Heo wrote:
> Hello, Ben.
>
> On Wed, Jun 05, 2013 at 08:41:01PM -0700, Ben Greear wrote:
>> On 06/05/2013 08:26 PM, Eric Dumazet wrote:
>>> On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote:
>>>> Ah, so, that's why it's showing up now.  We probably have had the same
>>>> issue all along but it used to be masked by the softirq limiting.  Do
>>>> you care to revive the 10 iterations limit so that it's limited by
>>>> both the count and timing?  We do wanna find out why softirq is
>>>> spinning indefinitely tho.
>>>
>>> Yes, no problem, I can do that.
>>
>> Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would
>> be fine by me.
>
> First of all, kudos for tracking the issue down.  While the removal of
> looping limit in softirq handling was the direct cause for making the
> problem visible, it's very bothering that we have softirq runaway.
> Finding out the perpetrator shouldn't be hard.  Something like the
> following should work (untested).  Once we know which softirq (prolly
> the network one), we can dig deeper.

The patch below assumes my fix is not in the code, right?

I'll work on this, but it will probably be next week before
I have time...gotta catch up on some other things first.

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [ath9k-devel] stop_machine lockup issue in 3.9.y.
@ 2013-06-06 21:15                                                 ` Ben Greear
  0 siblings, 0 replies; 51+ messages in thread
From: Ben Greear @ 2013-06-06 21:15 UTC (permalink / raw)
  To: ath9k-devel

On 06/06/2013 01:55 PM, Tejun Heo wrote:
> Hello, Ben.
>
> On Wed, Jun 05, 2013 at 08:41:01PM -0700, Ben Greear wrote:
>> On 06/05/2013 08:26 PM, Eric Dumazet wrote:
>>> On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote:
>>>> Ah, so, that's why it's showing up now.  We probably have had the same
>>>> issue all along but it used to be masked by the softirq limiting.  Do
>>>> you care to revive the 10 iterations limit so that it's limited by
>>>> both the count and timing?  We do wanna find out why softirq is
>>>> spinning indefinitely tho.
>>>
>>> Yes, no problem, I can do that.
>>
>> Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would
>> be fine by me.
>
> First of all, kudos for tracking the issue down.  While the removal of
> looping limit in softirq handling was the direct cause for making the
> problem visible, it's very bothering that we have softirq runaway.
> Finding out the perpetrator shouldn't be hard.  Something like the
> following should work (untested).  Once we know which softirq (prolly
> the network one), we can dig deeper.

The patch below assumes my fix is not in the code, right?

I'll work on this, but it will probably be next week before
I have time...gotta catch up on some other things first.

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
  2013-06-06 21:15                                                 ` [ath9k-devel] " Ben Greear
@ 2013-06-06 21:17                                                   ` Tejun Heo
  -1 siblings, 0 replies; 51+ messages in thread
From: Tejun Heo @ 2013-06-06 21:17 UTC (permalink / raw)
  To: Ben Greear
  Cc: Eric Dumazet, Rusty Russell, Joe Lawrence,
	Linux Kernel Mailing List, stable, Luis R. Rodriguez,
	Jouni Malinen, Vasanthakumar Thiagarajan,
	Senthil Balasubramanian, linux-wireless, ath9k-devel,
	Thomas Gleixner, Ingo Molnar

On Thu, Jun 06, 2013 at 02:15:40PM -0700, Ben Greear wrote:
> >First of all, kudos for tracking the issue down.  While the removal of
> >looping limit in softirq handling was the direct cause for making the
> >problem visible, it's very bothering that we have softirq runaway.
> >Finding out the perpetrator shouldn't be hard.  Something like the
> >following should work (untested).  Once we know which softirq (prolly
> >the network one), we can dig deeper.
> 
> The patch below assumes my fix is not in the code, right?

Yeap.

> I'll work on this, but it will probably be next week before
> I have time...gotta catch up on some other things first.

Thanks a lot for hunting this down!

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [ath9k-devel] stop_machine lockup issue in 3.9.y.
@ 2013-06-06 21:17                                                   ` Tejun Heo
  0 siblings, 0 replies; 51+ messages in thread
From: Tejun Heo @ 2013-06-06 21:17 UTC (permalink / raw)
  To: ath9k-devel

On Thu, Jun 06, 2013 at 02:15:40PM -0700, Ben Greear wrote:
> >First of all, kudos for tracking the issue down.  While the removal of
> >looping limit in softirq handling was the direct cause for making the
> >problem visible, it's very bothering that we have softirq runaway.
> >Finding out the perpetrator shouldn't be hard.  Something like the
> >following should work (untested).  Once we know which softirq (prolly
> >the network one), we can dig deeper.
> 
> The patch below assumes my fix is not in the code, right?

Yeap.

> I'll work on this, but it will probably be next week before
> I have time...gotta catch up on some other things first.

Thanks a lot for hunting this down!

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2013-06-06 21:17 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-31 18:14 Please add to stable: module: don't unlink the module until we've removed all exposure Ben Greear
2013-06-02  5:09 ` Rusty Russell
2013-06-03  3:46   ` Joe Lawrence
2013-06-03 11:25     ` Joe Lawrence
2013-06-03 14:17       ` Joe Lawrence
2013-06-03 15:59         ` Ben Greear
2013-06-03 16:36           ` Ben Greear
2013-06-04  4:37             ` Rusty Russell
2013-06-04  5:56             ` Rusty Russell
2013-06-04 14:07               ` Joe Lawrence
2013-06-04 16:50                 ` Joe Lawrence
2013-06-04 16:53                 ` Ben Greear
2013-06-04 17:45                   ` Ben Greear
2013-06-05  4:17                     ` Rusty Russell
2013-06-05  7:15                       ` Tejun Heo
2013-06-05 16:59                         ` Ben Greear
2013-06-05 18:48                           ` Tejun Heo
2013-06-05 19:11                             ` Ben Greear
2013-06-05 19:31                               ` stop_machine lockup issue in 3.9.y Ben Greear
2013-06-05 20:58                                 ` Ben Greear
2013-06-05 21:11                                   ` Tejun Heo
2013-06-05 21:11                                     ` [ath9k-devel] " Tejun Heo
2013-06-05 21:11                                     ` Tejun Heo
2013-06-05 21:33                                     ` Ben Greear
2013-06-05 21:33                                       ` [ath9k-devel] " Ben Greear
2013-06-06  1:34                                     ` Eric Dumazet
2013-06-06  1:34                                       ` [ath9k-devel] " Eric Dumazet
2013-06-06  1:34                                       ` Eric Dumazet
2013-06-06  3:14                                       ` Tejun Heo
2013-06-06  3:14                                         ` [ath9k-devel] " Tejun Heo
2013-06-06  3:14                                         ` Tejun Heo
2013-06-06  3:26                                         ` Eric Dumazet
2013-06-06  3:26                                           ` [ath9k-devel] " Eric Dumazet
2013-06-06  3:26                                           ` Eric Dumazet
2013-06-06  3:41                                           ` Ben Greear
2013-06-06  3:41                                             ` [ath9k-devel] " Ben Greear
2013-06-06  3:46                                             ` Eric Dumazet
2013-06-06  3:46                                               ` [ath9k-devel] " Eric Dumazet
2013-06-06  3:50                                               ` Ben Greear
2013-06-06  3:50                                                 ` [ath9k-devel] " Ben Greear
2013-06-06  4:08                                                 ` Eric Dumazet
2013-06-06  4:08                                                   ` [ath9k-devel] " Eric Dumazet
2013-06-06 20:55                                             ` Tejun Heo
2013-06-06 20:55                                               ` [ath9k-devel] " Tejun Heo
2013-06-06 21:15                                               ` Ben Greear
2013-06-06 21:15                                                 ` [ath9k-devel] " Ben Greear
2013-06-06 21:17                                                 ` Tejun Heo
2013-06-06 21:17                                                   ` [ath9k-devel] " Tejun Heo
2013-06-05  3:29                 ` Please add to stable: module: don't unlink the module until we've removed all exposure Rusty Russell
2013-06-05  5:07         ` Greg KH
2013-06-05  7:13           ` Rusty Russell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.