* Please add to stable: module: don't unlink the module until we've removed all exposure. @ 2013-05-31 18:14 Ben Greear 2013-06-02 5:09 ` Rusty Russell 0 siblings, 1 reply; 51+ messages in thread From: Ben Greear @ 2013-05-31 18:14 UTC (permalink / raw) To: Linux Kernel Mailing List; +Cc: rusty, stable, Joe Lawrence It turns out, the bug I spent yesterday chasing in various 3.9 kernels is apparently fixed by the commit in the title (c9c390bb5535380d40614571894ef0c00bc026ff). Fortunately, Joe Lawrence somehow saw my email to lkml and pointed me to the bug report below, which mentions the commit... https://bugzilla.kernel.org/show_bug.cgi?id=58011 Please consider adding this patch to at least the 3.9 stable queue. I have a kernel config larded up with debugging options that reproduces the bug fairly quickly on stock 3.9.[01234] (Fedora-17, 64-bit, in case that matters) if someone wants it.... Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Please add to stable: module: don't unlink the module until we've removed all exposure. 2013-05-31 18:14 Please add to stable: module: don't unlink the module until we've removed all exposure Ben Greear @ 2013-06-02 5:09 ` Rusty Russell 2013-06-03 3:46 ` Joe Lawrence 0 siblings, 1 reply; 51+ messages in thread From: Rusty Russell @ 2013-06-02 5:09 UTC (permalink / raw) To: Ben Greear, Linux Kernel Mailing List; +Cc: stable, Joe Lawrence Ben Greear <greearb@candelatech.com> writes: > It turns out, the bug I spent yesterday chasing in various 3.9 kernels is apparently > fixed by the commit in the title (c9c390bb5535380d40614571894ef0c00bc026ff). Apparently being the operative word. This commit avoids the entire "module insert failed due to sysfs race" path in the common case, it doesn't fix any actual problem. I think the real commit you want is Linus' kobject fix a49b7e82cab0f9b41f483359be83f44fbb6b4979 "kobject: fix kset_find_obj() race with concurrent last kobject_put()". Or is that already in stable? Cheers, Rusty. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Please add to stable: module: don't unlink the module until we've removed all exposure. 2013-06-02 5:09 ` Rusty Russell @ 2013-06-03 3:46 ` Joe Lawrence 2013-06-03 11:25 ` Joe Lawrence 0 siblings, 1 reply; 51+ messages in thread From: Joe Lawrence @ 2013-06-03 3:46 UTC (permalink / raw) To: Rusty Russell; +Cc: Ben Greear, Linux Kernel Mailing List, stable, Joe Lawrence On Sun, 2 Jun 2013, Rusty Russell wrote: > Ben Greear <greearb@candelatech.com> writes: > > > It turns out, the bug I spent yesterday chasing in various 3.9 kernels is apparently > > fixed by the commit in the title (c9c390bb5535380d40614571894ef0c00bc026ff). > > Apparently being the operative word. > > This commit avoids the entire "module insert failed due to sysfs race" > path in the common case, it doesn't fix any actual problem. > > I think the real commit you want is Linus' kobject fix > a49b7e82cab0f9b41f483359be83f44fbb6b4979 "kobject: fix kset_find_obj() > race with concurrent last kobject_put()". > > Or is that already in stable? Hi Rusty, I had pointed Ben (offlist) to that bugzilla entry without realizing there were other earlier related fixes in this space. Re-viewing bz- 58011, it looks like it was opened against 3.8.12, while Ben and myself had encountered module loading problems in versions 3.9 and 3.9.[1-3]. I can update the bugzilla entry to add a comment noting commit a49b7e82 "kobject: fix kset_find_obj() race with concurrent last kobject_put()". That said, it doesn't appear that commit 944a1fa "module: don't unlink the module until we've removed all exposure" has not made it into any stable kernel. On my system, applying this on top of 3.9 resolved a module unload/load race that would occasionally occur on boot (two video adapters of the same make, the module unloads for whatever reason and I see "module is already loaded" and "sysfs: cannot create duplicate filename '/module/mgag200'" messages every 5-10% instances.) I have logs if you were interested in these warnings/crashes. Hope this clarifies things. Regards, -- Joe ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Please add to stable: module: don't unlink the module until we've removed all exposure. 2013-06-03 3:46 ` Joe Lawrence @ 2013-06-03 11:25 ` Joe Lawrence 2013-06-03 14:17 ` Joe Lawrence 0 siblings, 1 reply; 51+ messages in thread From: Joe Lawrence @ 2013-06-03 11:25 UTC (permalink / raw) To: Joe Lawrence; +Cc: Rusty Russell, Ben Greear, Linux Kernel Mailing List, stable [fixing Cc: stable@kernel.org address] On Sun, 2 Jun 2013, Joe Lawrence wrote: > On Sun, 2 Jun 2013, Rusty Russell wrote: > > > Ben Greear <greearb@candelatech.com> writes: > > > > > It turns out, the bug I spent yesterday chasing in various 3.9 kernels is apparently > > > fixed by the commit in the title (c9c390bb5535380d40614571894ef0c00bc026ff). > > > > Apparently being the operative word. > > > > This commit avoids the entire "module insert failed due to sysfs race" > > path in the common case, it doesn't fix any actual problem. > > > > I think the real commit you want is Linus' kobject fix > > a49b7e82cab0f9b41f483359be83f44fbb6b4979 "kobject: fix kset_find_obj() > > race with concurrent last kobject_put()". > > > > Or is that already in stable? > > Hi Rusty, > > I had pointed Ben (offlist) to that bugzilla entry without realizing > there were other earlier related fixes in this space. Re-viewing bz- > 58011, it looks like it was opened against 3.8.12, while Ben and myself > had encountered module loading problems in versions 3.9 and > 3.9.[1-3]. I can update the bugzilla entry to add a comment noting commit > a49b7e82 "kobject: fix kset_find_obj() race with concurrent last > kobject_put()". > > That said, it doesn't appear that commit 944a1fa "module: don't unlink the > module until we've removed all exposure" has not made it into any stable > kernel. On my system, applying this on top of 3.9 resolved a module > unload/load race that would occasionally occur on boot (two video adapters > of the same make, the module unloads for whatever reason and I see "module > is already loaded" and "sysfs: cannot create duplicate filename > '/module/mgag200'" messages every 5-10% instances.) I have logs if you > were interested in these warnings/crashes. > > Hope this clarifies things. > > Regards, > > -- Joe > ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Please add to stable: module: don't unlink the module until we've removed all exposure. 2013-06-03 11:25 ` Joe Lawrence @ 2013-06-03 14:17 ` Joe Lawrence 2013-06-03 15:59 ` Ben Greear 2013-06-05 5:07 ` Greg KH 0 siblings, 2 replies; 51+ messages in thread From: Joe Lawrence @ 2013-06-03 14:17 UTC (permalink / raw) To: Joe Lawrence; +Cc: Rusty Russell, Ben Greear, Linux Kernel Mailing List, stable [Cc: stable@vger.kernel.org] Third time is a charm? The stable address was incorrect from the first msg in this thread, but the relevant bits remain quoted below... On Mon, 3 Jun 2013, Joe Lawrence wrote: > [fixing Cc: stable@kernel.org address] > > On Sun, 2 Jun 2013, Joe Lawrence wrote: > > > On Sun, 2 Jun 2013, Rusty Russell wrote: > > > > > Ben Greear <greearb@candelatech.com> writes: > > > > > > > It turns out, the bug I spent yesterday chasing in various 3.9 kernels is apparently > > > > fixed by the commit in the title (c9c390bb5535380d40614571894ef0c00bc026ff). > > > > > > Apparently being the operative word. > > > > > > This commit avoids the entire "module insert failed due to sysfs race" > > > path in the common case, it doesn't fix any actual problem. > > > > > > I think the real commit you want is Linus' kobject fix > > > a49b7e82cab0f9b41f483359be83f44fbb6b4979 "kobject: fix kset_find_obj() > > > race with concurrent last kobject_put()". > > > > > > Or is that already in stable? > > > > Hi Rusty, > > > > I had pointed Ben (offlist) to that bugzilla entry without realizing > > there were other earlier related fixes in this space. Re-viewing bz- > > 58011, it looks like it was opened against 3.8.12, while Ben and myself > > had encountered module loading problems in versions 3.9 and > > 3.9.[1-3]. I can update the bugzilla entry to add a comment noting commit > > a49b7e82 "kobject: fix kset_find_obj() race with concurrent last > > kobject_put()". > > > > That said, it doesn't appear that commit 944a1fa "module: don't unlink the > > module until we've removed all exposure" has not made it into any stable > > kernel. On my system, applying this on top of 3.9 resolved a module > > unload/load race that would occasionally occur on boot (two video adapters > > of the same make, the module unloads for whatever reason and I see "module > > is already loaded" and "sysfs: cannot create duplicate filename > > '/module/mgag200'" messages every 5-10% instances.) I have logs if you > > were interested in these warnings/crashes. > > > > Hope this clarifies things. > > > > Regards, > > > > -- Joe > > > ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Please add to stable: module: don't unlink the module until we've removed all exposure. 2013-06-03 14:17 ` Joe Lawrence @ 2013-06-03 15:59 ` Ben Greear 2013-06-03 16:36 ` Ben Greear 2013-06-05 5:07 ` Greg KH 1 sibling, 1 reply; 51+ messages in thread From: Ben Greear @ 2013-06-03 15:59 UTC (permalink / raw) To: Joe Lawrence; +Cc: Rusty Russell, Linux Kernel Mailing List, stable On 06/03/2013 07:17 AM, Joe Lawrence wrote: >>> Hi Rusty, >>> >>> I had pointed Ben (offlist) to that bugzilla entry without realizing >>> there were other earlier related fixes in this space. Re-viewing bz- >>> 58011, it looks like it was opened against 3.8.12, while Ben and myself >>> had encountered module loading problems in versions 3.9 and >>> 3.9.[1-3]. I can update the bugzilla entry to add a comment noting commit >>> a49b7e82 "kobject: fix kset_find_obj() race with concurrent last >>> kobject_put()". >>> >>> That said, it doesn't appear that commit 944a1fa "module: don't unlink the >>> module until we've removed all exposure" has not made it into any stable >>> kernel. On my system, applying this on top of 3.9 resolved a module >>> unload/load race that would occasionally occur on boot (two video adapters >>> of the same make, the module unloads for whatever reason and I see "module >>> is already loaded" and "sysfs: cannot create duplicate filename >>> '/module/mgag200'" messages every 5-10% instances.) I have logs if you >>> were interested in these warnings/crashes. It at least works around the problem for me as well. But, a more rare migration/[0-3] (I think) related lockup still exists in 3.9.4 for me, so I will also try applying that other kobject patch and continue testing today... Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Please add to stable: module: don't unlink the module until we've removed all exposure. 2013-06-03 15:59 ` Ben Greear @ 2013-06-03 16:36 ` Ben Greear 2013-06-04 4:37 ` Rusty Russell 2013-06-04 5:56 ` Rusty Russell 0 siblings, 2 replies; 51+ messages in thread From: Ben Greear @ 2013-06-03 16:36 UTC (permalink / raw) To: Joe Lawrence; +Cc: Rusty Russell, Linux Kernel Mailing List, stable On 06/03/2013 08:59 AM, Ben Greear wrote: > On 06/03/2013 07:17 AM, Joe Lawrence wrote: > >>>> Hi Rusty, >>>> >>>> I had pointed Ben (offlist) to that bugzilla entry without realizing >>>> there were other earlier related fixes in this space. Re-viewing bz- >>>> 58011, it looks like it was opened against 3.8.12, while Ben and myself >>>> had encountered module loading problems in versions 3.9 and >>>> 3.9.[1-3]. I can update the bugzilla entry to add a comment noting commit >>>> a49b7e82 "kobject: fix kset_find_obj() race with concurrent last >>>> kobject_put()". >>>> >>>> That said, it doesn't appear that commit 944a1fa "module: don't unlink the >>>> module until we've removed all exposure" has not made it into any stable >>>> kernel. On my system, applying this on top of 3.9 resolved a module >>>> unload/load race that would occasionally occur on boot (two video adapters >>>> of the same make, the module unloads for whatever reason and I see "module >>>> is already loaded" and "sysfs: cannot create duplicate filename >>>> '/module/mgag200'" messages every 5-10% instances.) I have logs if you >>>> were interested in these warnings/crashes. > > It at least works around the problem for me as well. But, a more rare > migration/[0-3] (I think) related lockup still exists in 3.9.4 for me, > so I will also try applying that other kobject patch and continue testing > today... Well, that other kobject patch is already in 3.9.4, so I think it's still a good idea to include the "module: don't unlink the module until we've removed all exposure." patch in stable. I have a decent test case to reproduce the crash, so if someone wants me to test other patches instead, then I will do so. Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Please add to stable: module: don't unlink the module until we've removed all exposure. 2013-06-03 16:36 ` Ben Greear @ 2013-06-04 4:37 ` Rusty Russell 2013-06-04 5:56 ` Rusty Russell 1 sibling, 0 replies; 51+ messages in thread From: Rusty Russell @ 2013-06-04 4:37 UTC (permalink / raw) To: Ben Greear, Joe Lawrence; +Cc: Linux Kernel Mailing List, stable Ben Greear <greearb@candelatech.com> writes: > On 06/03/2013 08:59 AM, Ben Greear wrote: >> On 06/03/2013 07:17 AM, Joe Lawrence wrote: >> >>>>> Hi Rusty, >>>>> >>>>> I had pointed Ben (offlist) to that bugzilla entry without realizing >>>>> there were other earlier related fixes in this space. Re-viewing bz- >>>>> 58011, it looks like it was opened against 3.8.12, while Ben and myself >>>>> had encountered module loading problems in versions 3.9 and >>>>> 3.9.[1-3]. I can update the bugzilla entry to add a comment noting commit >>>>> a49b7e82 "kobject: fix kset_find_obj() race with concurrent last >>>>> kobject_put()". >>>>> >>>>> That said, it doesn't appear that commit 944a1fa "module: don't unlink the >>>>> module until we've removed all exposure" has not made it into any stable >>>>> kernel. On my system, applying this on top of 3.9 resolved a module >>>>> unload/load race that would occasionally occur on boot (two video adapters >>>>> of the same make, the module unloads for whatever reason and I see "module >>>>> is already loaded" and "sysfs: cannot create duplicate filename >>>>> '/module/mgag200'" messages every 5-10% instances.) I have logs if you >>>>> were interested in these warnings/crashes. >> >> It at least works around the problem for me as well. But, a more rare >> migration/[0-3] (I think) related lockup still exists in 3.9.4 for me, >> so I will also try applying that other kobject patch and continue testing >> today... > > Well, that other kobject patch is already in 3.9.4, so I think it's still > a good idea to include the > "module: don't unlink the module until we've removed all exposure." > patch in stable. I have a decent test case to reproduce the crash, so if someone > wants me to test other patches instead, then I will do so. I understand your eagerness to have this resolved, but we need to understand the problem. The fix you asked for in stable was supposed to be cosmetic, to avoid the sysfs warning. But it did serve to stress the cleanup path, which may still have lurking bugs! I reproduced the oops myself on 3.8. I will chase it on 3.9.4, too. Thanks, Rusty. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Please add to stable: module: don't unlink the module until we've removed all exposure. 2013-06-03 16:36 ` Ben Greear 2013-06-04 4:37 ` Rusty Russell @ 2013-06-04 5:56 ` Rusty Russell 2013-06-04 14:07 ` Joe Lawrence 1 sibling, 1 reply; 51+ messages in thread From: Rusty Russell @ 2013-06-04 5:56 UTC (permalink / raw) To: Ben Greear, Joe Lawrence; +Cc: Linux Kernel Mailing List, stable Ben Greear <greearb@candelatech.com> writes: >> It at least works around the problem for me as well. But, a more rare >> migration/[0-3] (I think) related lockup still exists in 3.9.4 for me, >> so I will also try applying that other kobject patch and continue testing >> today... > > Well, that other kobject patch is already in 3.9.4, so I think it's still > a good idea to include the > "module: don't unlink the module until we've removed all exposure." > patch in stable. I have a decent test case to reproduce the crash, so if someone > wants me to test other patches instead, then I will do so. OK, I cannot reproduce on 3.9.4. I #if 0'd out the WARNs in sysfs and kobject, and did this (which reliably broke on 3.8): # M=`modinfo -F filename e1000` # for i in `seq 10000`; do insmod $M; rmmod e1000; done >/dev/null 2>&1 & for i in `seq 10000`; do insmod $M; rmmod e1000; done > /dev/null 2>&1 & for i in `seq 10000`; do insmod $M; rmmod e1000; done > /dev/null 2>&1 & for i in `seq 10000`; do insmod $M; rmmod e1000; done >/dev/null 2>&1 & # This was under kvm, 4-way SMP, init=/bin/bash. Do you have a backtrace of the 3.9.4 crash? You can add "CFLAGS_module.o = -O0" to get a clearer backtrace if you want... Thanks, Rusty. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Please add to stable: module: don't unlink the module until we've removed all exposure. 2013-06-04 5:56 ` Rusty Russell @ 2013-06-04 14:07 ` Joe Lawrence 2013-06-04 16:50 ` Joe Lawrence ` (2 more replies) 0 siblings, 3 replies; 51+ messages in thread From: Joe Lawrence @ 2013-06-04 14:07 UTC (permalink / raw) To: Rusty Russell; +Cc: Ben Greear, Linux Kernel Mailing List, stable On Tue, 04 Jun 2013 15:26:28 +0930 Rusty Russell <rusty@rustcorp.com.au> wrote: > Do you have a backtrace of the 3.9.4 crash? You can add "CFLAGS_module.o > = -O0" to get a clearer backtrace if you want... Hi Rusty, See my 3.9 stack traces below, which may or may not be what Ben had been seeing. If you like, I can try a similar loop as the one you were testing in the other email. Regards, -- Joe *** First instance *** ------------[ cut here ]------------ WARNING: at fs/sysfs/dir.c:536 sysfs_add_one+0xd4/0x100() Hardware name: ftServer 6400 sysfs: cannot create duplicate filename '/module/mgag200' Modules linked in: enclosure(+) mgag200(+) ghash_clmulni_intel(+) pcspkr joydev osst vhost_net st tun macvtap macvlan uinput raid1 mpt2sas(OF) raid_class qla2xxx(OF) scsi_transport_fc scsi_transport_sas sd_mod(OF) usb_storage scsi_tgt scsi_hbas(OF) i2c_algo_bit drm_kms_helper ttm drm i2c_core Pid: 733, comm: systemd-udevd Tainted: GF O 3.9.0sra_new+ #1 Call Trace: [<ffffffff81061a9f>] warn_slowpath_common+0x7f/0xc0 [<ffffffff81061b96>] warn_slowpath_fmt+0x46/0x50 [<ffffffff81319875>] ? strlcat+0x65/0x90 [<ffffffff81222914>] sysfs_add_one+0xd4/0x100 [<ffffffff81222b38>] create_dir+0x78/0xd0 [<ffffffff81222e86>] sysfs_create_dir+0x86/0xe0 [<ffffffff81313588>] kobject_add_internal+0xa8/0x270 [<ffffffff81313ab3>] kobject_init_and_add+0x63/0x90 [<ffffffff810ca34d>] load_module+0x12dd/0x2890 [<ffffffff81331670>] ? ddebug_proc_open+0xc0/0xc0 [<ffffffff810cb9ea>] sys_init_module+0xea/0x140 [<ffffffff81680d19>] system_call_fastpath+0x16/0x1b ---[ end trace 247a5f5f82ef192d ]--- ------------[ cut here ]------------ WARNING: at lib/kobject.c:196 kobject_add_internal+0x204/0x270() Hardware name: ftServer 6400 kobject_add_internal failed for mgag200 with -EEXIST, don't try to register things with the same name in the same directory. Modules linked in:0m] Started Conf mdio(+) coretemp(+) crc32c_intel(+) dca(+) enclosure(+) mgag200(+) ghash_clmulni_intel pcspkr joydev osst vhost_net st tun macvtap macvlan uinput raid1 mpt2sas(OF) raid_class qla2xxx(OF) scsi_transport_fc scsi_transport_sas sd_mod(OF) usb_storage scsi_tgt scsi_hbas(OF) i2c_algo_bit drm_kms_helper ttm drm i2c_core Pid: 733, comm: systemd-udevd Tainted: GF W O 3.9.0sra_new+ #1 Call Trace: [<ffffffff81061a9f>] warn_slowpath_common+0x7f/0xc0 [<ffffffff81061b96>] warn_slowpath_fmt+0x46/0x50 [<ffffffff813136e4>] kobject_add_internal+0x204/0x270 [<ffffffff81313ab3>] kobject_init_and_add+0x63/0x90 [<ffffffff810ca34d>] load_module+0x12dd/0x2890 [<ffffffff81331670>] ? ddebug_proc_open+0xc0/0xc0 [<ffffffff810cb9ea>] sys_init_module+0xea/0x140 [<ffffffff81680d19>] system_call_fastpath+0x16/0x1b ---[ end trace 247a5f5f82ef192e ]--- *** Second instance *** mgag200: module is already loaded igb: Intel(R) Gigabit Ethernet Network Driver - version 4.1.2-k BUG: unable to handle kernel paging request at ffffffffa01d060c IP: [<ffffffff81313276>] kobject_del+0x16/0x40 PGD 1c0f067 PUD 1c10063 PMD 851372067 PTE 0 Oops: 0002 [#1] SMP Modules linked in: ixgbe(OF+) igb(OF+) mgag200(+) ptp pps_core mdio dca coretemp crc32c_intel pcspkr ghash_clmulni_intel vhost_net tun macvtap macvlan uinput raid1 usb_storage mpt2sas(OF) raid_class qla2xxx(OF) scsi_transport_fc scsi_transport_sas sd_mod(OF) scsi_tgt scsi_hbas(OF) i2c_algo_bit drm_kms_helper ttm drm i2c_core CPU 28 Pid: 719, comm: systemd-udevd Tainted: GF O 3.9.0sra_new+ #1 Stratus ftServer 6400/G7LAZ RIP: 0010:[<ffffffff81313276>] [<ffffffff81313276>] kobject_del+0x16/0x40 RSP: 0018:ffff88103814fd08 EFLAGS: 00010292 RAX: 0000000000000200 RBX: ffffffffa01d05d0 RCX: 0000000100250004 RDX: ffff88103814ffd8 RSI: 0000000000250004 RDI: 0000000000000246 RBP: ffff88103814fd18 R08: ffff88103814fa80 R09: 0000000000000000 R10: ffff88085f821d40 R11: 0000000000000025 R12: ffffffff81c412c0 R13: ffff880852c8cfc0 R14: ffffffffa01e0580 R15: ffffffffa01e0598 FS: 00007fc98fe6c840(0000) GS:ffff88107fd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffa01d060c CR3: 0000001038137000 CR4: 00000000000407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process systemd-udevd (pid: 719, threadinfo ffff88103814e000, task ffff8810380d98a0) Stack: ffff88103814fd18 ffffffffa01d05d0 ffff88103814fd48 ffffffff81313302 ffff88103814fd78 ffffffffa01d05d0 ffffffffa01d05d0 ffffffffffffffea ffff88103814fd68 ffffffff8131348b 00000000ffff8000 ffff88103814fee8 Call Trace: [<ffffffff81313302>] kobject_cleanup+0x62/0x1b0 [<ffffffff8131348b>] kobject_put+0x2b/0x60 [<ffffffff810cb8f1>] load_module+0x2881/0x2890 [<ffffffff81331670>] ? ddebug_proc_open+0xc0/0xc0 [<ffffffff810cb9ea>] sys_init_module+0xea/0x140 [<ffffffff81680d19>] system_call_fastpath+0x16/0x1b Code: 02 00 00 48 8b 5d f0 4c 8b 65 f8 c9 c3 0f 1f 84 00 00 00 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 85 ff 74 22 e8 7a fc f0 ff <80> 63 3c fd 48 89 df e8 6e ff ff ff 48 8b 7b 18 e8 d5 01 00 00 RIP [<ffffffff81313276>] kobject_del+0x16/0x40 RSP <ffff88103814fd08> CR2: ffffffffa01d060c ---[ end trace e320c2319820c81a ]--- Kernel panic - not syncing: Fatal exception ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Please add to stable: module: don't unlink the module until we've removed all exposure. 2013-06-04 14:07 ` Joe Lawrence @ 2013-06-04 16:50 ` Joe Lawrence 2013-06-04 16:53 ` Ben Greear 2013-06-05 3:29 ` Please add to stable: module: don't unlink the module until we've removed all exposure Rusty Russell 2 siblings, 0 replies; 51+ messages in thread From: Joe Lawrence @ 2013-06-04 16:50 UTC (permalink / raw) To: Joe Lawrence; +Cc: Rusty Russell, Ben Greear, Linux Kernel Mailing List, stable On Tue, 4 Jun 2013, Joe Lawrence wrote: > Hi Rusty, > > See my 3.9 stack traces below, which may or may not be what Ben had > been seeing. If you like, I can try a similar loop as the one you were > testing in the other email. With a modified version of your module load/unload loop (only needed insmod as the module initialization routine returns -EINVAL to mimic mgag200 with incorrect modeset value). This crashed right out of the chute on 3.9.4 ... still running OK with 3.9 + commit 944a1fa "module: don't unlink the module until we've removed all exposure". -- Joe test_mod.c : #include <linux/module.h> #include <linux/delay.h> MODULE_LICENSE("GPL"); static int test_mod_init(void) { return -EINVAL; } static void test_mod_exit(void) {} module_init(test_mod_init); module_exit(test_mod_exit); from the console log : test_mod: module verification failed: signature and/or required key missing - tainting kernel ------------[ cut here ]------------ WARNING: at fs/sysfs/dir.c:536 sysfs_add_one+0xd4/0x100() Hardware name: ftServer 6400 sysfs: cannot create duplicate filename '/module/test_mod' Modules linked in: test_mod(OF+) ebtable_nat nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 bonding xt_conntrack nf_conntrack ib_iser rdma_cm ebtable_filter ib_addr ebtables iw_cm ib_cm ib_sa ib_mad ip6table_filter ib_core ip6_tables iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_multipath coretemp crc32c_intel ghash_clmulni_intel pcspkr ixgbe joydev mdio igb ptp pps_core dca vhost_net tun macvtap macvlan uinput raid1 sd_mod i2c_algo_bit drm_kms_helper ttm drm usb_storage mpt2sas raid_class scsi_transport_sas i2c_core Pid: 8466, comm: insmod Tainted: GF O 3.9.4 #1 Call Trace: [<ffffffff8106159f>] warn_slowpath_common+0x7f/0xc0 [<ffffffff81061696>] warn_slowpath_fmt+0x46/0x50 [<ffffffff81319895>] ? strlcat+0x65/0x90 [<ffffffff81222784>] sysfs_add_one+0xd4/0x100 [<ffffffff812229a8>] create_dir+0x78/0xd0 [<ffffffff81222cf6>] sysfs_create_dir+0x86/0xe0 [<ffffffff813135a8>] kobject_add_internal+0xa8/0x270 [<ffffffff81313ad3>] kobject_init_and_add+0x63/0x90 [<ffffffff810c9f9d>] load_module+0x12dd/0x2890 [<ffffffff81331690>] ? ddebug_proc_open+0xc0/0xc0 [<ffffffff810cb63a>] sys_init_module+0xea/0x140 [<ffffffff81681119>] system_call_fastpath+0x16/0x1b ---[ end trace 54bd469258bec620 ]--- ------------[ cut here ]------------ WARNING: at lib/kobject.c:196 kobject_add_internal+0x204/0x270() Hardware name: ftServer 6400 kobject_add_internal failed for test_mod with -EEXIST, don't try to register things with the same name in the same directory. Modules linked in: test_mod(OF+) ebtable_nat nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 bonding xt_conntrack nf_conntrack ib_iser rdma_cm ebtable_filter ib_addr ebtables iw_cm ib_cm ib_sa ib_mad ip6table_filter ib_core ip6_tables iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_multipath coretemp crc32c_intel ghash_clmulni_intel pcspkr ixgbe joydev mdio igb ptp pps_core dca vhost_net tun macvtap macvlan uinput raid1 sd_mod i2c_algo_bit drm_kms_helper ttm drm usb_storage mpt2sas raid_class scsi_transport_sas i2c_core Pid: 8466, comm: insmod Tainted: GF W O 3.9.4 #1 Call Trace: [<ffffffff8106159f>] warn_slowpath_common+0x7f/0xc0 [<ffffffff81061696>] warn_slowpath_fmt+0x46/0x50 [<ffffffff81313704>] kobject_add_internal+0x204/0x270 [<ffffffff81313ad3>] kobject_init_and_add+0x63/0x90 [<ffffffff810c9f9d>] load_module+0x12dd/0x2890 [<ffffffff81331690>] ? ddebug_proc_open+0xc0/0xc0 [<ffffffff810cb63a>] sys_init_module+0xea/0x140 [<ffffffff81681119>] system_call_fastpath+0x16/0x1b ---[ end trace 54bd469258bec621 ]--- test_mod: module is already loaded test_mod: module is already loaded BUG: unable to handle kernel paging request at ffffffffa02ed08c IP: [<ffffffff81313491>] kobject_put+0x11/0x60 PGD 1c0f067 PUD 1c10063 PMD 84dd68067 PTE 0 Oops: 0000 [#1] SMP Modules linked in: test_mod(OF+) ebtable_nat nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 bonding xt_conntrack nf_conntrack ib_iser rdma_cm ebtable_filter ib_addr ebtables iw_cm ib_cm ib_sa ib_mad ip6table_filter ib_core ip6_tables iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_multipath coretemp crc32c_intel ghash_clmulni_intel pcspkr ixgbe joydev mdio igb ptp pps_core dca vhost_net tun macvtap macvlan uinput raid1 sd_mod i2c_algo_bit drm_kms_helper ttm drm usb_storage mpt2sas raid_class scsi_transport_sas i2c_core CPU 25 Pid: 8551, comm: insmod Tainted: GF W O 3.9.4 #1 Stratus ftServer 6400/G7LAZ RIP: 0010:[<ffffffff81313491>] [<ffffffff81313491>] kobject_put+0x11/0x60 RSP: 0018:ffff881050b95d58 EFLAGS: 00010286 RAX: 0000000000000022 RBX: ffffffffa02ed050 RCX: ffff88107fd2fba8 RDX: 0000000000000000 RSI: ffff88107fd2df58 RDI: ffffffffa02ed050 RBP: ffff881050b95d68 R08: ffffffff81ce2080 R09: 00000000000007c6 R10: 0000000000000000 R11: 00000000000007c5 R12: ffffffffa02ed050 R13: ffffffffffffffea R14: ffffffffa035c000 R15: ffffffffa035c018 FS: 00007fd0768a3740(0000) GS:ffff88107fd20000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffa02ed08c CR3: 0000001050bd7000 CR4: 00000000000407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process insmod (pid: 8551, threadinfo ffff881050b94000, task ffff88105169c9e0) Stack: 00000000ffff8000 ffff881050b95ee8 ffff881050b95ed8 ffffffff810cb541 ffffffff81331690 ffffc90017037fff ffffc90017038000 ffffffff00000002 ffffc900170220e0 ffffc90000000003 ffffffffa02c1270 00000000000002a0 Call Trace: [<ffffffff810cb541>] load_module+0x2881/0x2890 [<ffffffff81331690>] ? ddebug_proc_open+0xc0/0xc0 [<ffffffff810cb63a>] sys_init_module+0xea/0x140 [<ffffffff81681119>] system_call_fastpath+0x16/0x1b Code: 01 00 e9 10 ff ff ff 0f 1f 00 55 48 83 ef 38 48 89 e5 e8 43 fe ff ff 5d c3 90 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 85 ff 74 1a <f6> 47 3c 01 74 21 f0 83 6b 38 01 0f 94 c0 84 c0 74 08 48 89 df RIP [<ffffffff81313491>] kobject_put+0x11/0x60 RSP <ffff881050b95d58> CR2: ffffffffa02ed08c ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Please add to stable: module: don't unlink the module until we've removed all exposure. 2013-06-04 14:07 ` Joe Lawrence 2013-06-04 16:50 ` Joe Lawrence @ 2013-06-04 16:53 ` Ben Greear 2013-06-04 17:45 ` Ben Greear 2013-06-05 3:29 ` Please add to stable: module: don't unlink the module until we've removed all exposure Rusty Russell 2 siblings, 1 reply; 51+ messages in thread From: Ben Greear @ 2013-06-04 16:53 UTC (permalink / raw) To: Joe Lawrence; +Cc: Rusty Russell, Linux Kernel Mailing List, stable On 06/04/2013 07:07 AM, Joe Lawrence wrote: > On Tue, 04 Jun 2013 15:26:28 +0930 > Rusty Russell <rusty@rustcorp.com.au> wrote: > >> Do you have a backtrace of the 3.9.4 crash? You can add "CFLAGS_module.o >> = -O0" to get a clearer backtrace if you want... > > Hi Rusty, > > See my 3.9 stack traces below, which may or may not be what Ben had > been seeing. If you like, I can try a similar loop as the one you were > testing in the other email. My stack traces are similar. I had better luck reproducing the problem once I enabled lots of debugging (slub memory poisoning, lockdep, object debugging, etc). I'm using Fedora 17 on 2-core core-i7 (4 CPU threads total) for most of this testing. We reproduced on dual-core Atom system as well (32-bit Fedora 14 and Fedora 17). Relatively standard hardware as far as I know. I'll run the insmod/rmmod stress test on my patched systems and see if I can reproduce with the patch in the title applied. Rusty: I'm also seeing lockups related to migration on stock 3.9.4+ (with and without the 'don't unlink the module...' patch. Much harder to reproduce. But, that code appears to be mostly called during module load/unload, so it's possible it is related. The first traces are from a system with local patches, applied, but a later post by me has traces from clean upstream kernel. Further debugging showed that this could be a race, because it seems that all migration/ threads think they are done with their state machine, but the atomic thread counter sits at 1, so no progress is ever made. http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg443471.html Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Please add to stable: module: don't unlink the module until we've removed all exposure. 2013-06-04 16:53 ` Ben Greear @ 2013-06-04 17:45 ` Ben Greear 2013-06-05 4:17 ` Rusty Russell 0 siblings, 1 reply; 51+ messages in thread From: Ben Greear @ 2013-06-04 17:45 UTC (permalink / raw) To: Joe Lawrence; +Cc: Rusty Russell, Linux Kernel Mailing List, stable On 06/04/2013 09:53 AM, Ben Greear wrote: > On 06/04/2013 07:07 AM, Joe Lawrence wrote: >> On Tue, 04 Jun 2013 15:26:28 +0930 >> Rusty Russell <rusty@rustcorp.com.au> wrote: >> >>> Do you have a backtrace of the 3.9.4 crash? You can add "CFLAGS_module.o >>> = -O0" to get a clearer backtrace if you want... >> >> Hi Rusty, >> >> See my 3.9 stack traces below, which may or may not be what Ben had >> been seeing. If you like, I can try a similar loop as the one you were >> testing in the other email. > > My stack traces are similar. I had better luck reproducing the problem > once I enabled lots of debugging (slub memory poisoning, lockdep, > object debugging, etc). > > I'm using Fedora 17 on 2-core core-i7 (4 CPU threads total) for most of this > testing. We reproduced on dual-core Atom system as well > (32-bit Fedora 14 and Fedora 17). Relatively standard hardware as far > as I know. > > I'll run the insmod/rmmod stress test on my patched systems > and see if I can reproduce with the patch in the title applied. > > Rusty: I'm also seeing lockups related to migration on stock 3.9.4+ > (with and without the 'don't unlink the module...' patch. Much harder > to reproduce. But, that code appears to be mostly called during > module load/unload, so it's possible it is related. The first > traces are from a system with local patches, applied, but a later > post by me has traces from clean upstream kernel. > > Further debugging showed that this could be a race, because it seems > that all migration/ threads think they are done with their state machine, > but the atomic thread counter sits at 1, so no progress is ever made. > > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg443471.html I reproduced the migration deadlock after a while (loading and unloading the macvlan module with this command: for i in `seq 10000`; do modprobe macvlan; rmmod macvlan; done I did not see the kobj crash, but this kernel was running your patch that makes the problem go away for me, for whatever reason. I have some printk debugging in (see bottom of email) and was using a serial console, so things were probably running a bit slower than on most systems. Here is trace from my kernel with local patches and not so much debugging enabled (this is NOT a clean upstream kernel, though I reproduced the same thing with a clean upstream 3.9.4 kernel plus your module unlink patch yesterday). __stop_machine, num-threads: 4, fn: __try_stop_module data: ffff8801c6ae7f28 cpu: 0 loops: 1 jiffies: 4299011449 timeout: 4299011448 curstate: 0 smdata->state: 1 thread_ack: 4 cpu: 1 loops: 1 jiffies: 4299011449 timeout: 4299011448 curstate: 0 smdata->state: 1 thread_ack: 4 cpu: 2 loops: 1 jiffies: 4299011449 timeout: 4299011448 curstate: 0 smdata->state: 1 thread_ack: 3 cpu: 3 loops: 1 jiffies: 4299011449 timeout: 4299011448 curstate: 0 smdata->state: 1 thread_ack: 2 __stop_machine, num-threads: 4, fn: __unlink_module data: ffffffffa0aeeab0 cpu: 0 loops: 1 jiffies: 4299011501 timeout: 4299011500 curstate: 0 smdata->state: 1 thread_ack: 4 cpu: 1 loops: 1 jiffies: 4299011501 timeout: 4299011500 curstate: 0 smdata->state: 1 thread_ack: 4 cpu: 3 loops: 1 jiffies: 4299011501 timeout: 4299011500 curstate: 0 smdata->state: 1 thread_ack: 3 ath: wiphy0: Failed to stop TX DMA, queues=0x005! cpu: 2 loops: 1 jiffies: 4299011501 timeout: 4299011500 curstate: 0 smdata->state: 1 thread_ack: 2 BUG: soft lockup - CPU#3 stuck for 23s! [migration/3:29] Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_po] CPU 3 Pid: 29, comm: migration/3 Tainted: G C O 3.9.4+ #60 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E. RIP: 0010:[<ffffffff8109d69c>] [<ffffffff8109d69c>] tasklet_action+0x46/0xcc RSP: 0000:ffff88022bd83ed8 EFLAGS: 00000282 RAX: ffff88022bd8e080 RBX: ffff8802222a4000 RCX: ffffffff81a90f06 RDX: ffff88021f48afa8 RSI: 0000000000000000 RDI: ffffffff81a050b0 RBP: ffff88022bd83ee8 R08: 0000000000000000 R09: 0000000000000000 R10: 00000000000005f2 R11: 00000000fd010018 R12: ffff88022bd83e48 R13: ffffffff815d145d R14: ffff88022bd83ee8 R15: ffff880222282000 FS: 0000000000000000(0000) GS:ffff88022bd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fa8a06d5000 CR3: 00000002007ea000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/3 (pid: 29, threadinfo ffff880222282000, task ffff880222279770) Stack: ffffffff81a050b0 ffff880222282000 ffff88022bd83f78 ffffffff8109db1f ffff88022bd83f08 ffff880222282010 ffff880222283fd8 04208040810b79ef 00000001003db5d8 000000032bd8e150 ffff880222282000 0000000000000030 Call Trace: <IRQ> [<ffffffff8109db1f>] __do_softirq+0x107/0x23c [<ffffffff8109dce6>] irq_exit+0x4b/0xa8 [<ffffffff815d257f>] smp_apic_timer_interrupt+0x8b/0x99 [<ffffffff815d145d>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810f9984>] ? stop_machine_cpu_stop+0xc7/0x145 [<ffffffff810f9974>] ? stop_machine_cpu_stop+0xb7/0x145 [<ffffffff81016445>] ? sched_clock+0x9/0xd [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c9065>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d07ac>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 Code: 48 8b 14 25 80 e0 00 00 65 48 c7 04 25 80 e0 00 00 00 00 00 00 65 48 03 04 25 d8 da 00 00 65 48 89 04 25 88 e0 00 00 fb ------------[ cut here ]------------ WARNING: at /home/greearb/git/linux-3.9.dev.y/kernel/watchdog.c:245 watchdog_overflow_callback+0x9b/0xa6() Hardware name: To be filled by O.E.M. Watchdog detected hard LOCKUP on cpu 1 Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_po] Pid: 17, comm: migration/1 Tainted: G C O 3.9.4+ #60 Call Trace: <NMI> [<ffffffff810963e1>] warn_slowpath_common+0x85/0x9f [<ffffffff8109649e>] warn_slowpath_fmt+0x46/0x48 [<ffffffff810c4dfb>] ? sched_clock_cpu+0x44/0xce [<ffffffff81103b70>] watchdog_overflow_callback+0x9b/0xa6 [<ffffffff8113354f>] __perf_event_overflow+0x137/0x1cb [<ffffffff8101db3f>] ? x86_perf_event_set_period+0x107/0x113 [<ffffffff81133a9e>] perf_event_overflow+0x14/0x16 [<ffffffff810230dc>] intel_pmu_handle_irq+0x2b0/0x32d [<ffffffff815cbc51>] perf_event_nmi_handler+0x19/0x1b [<ffffffff815cb4ca>] nmi_handle+0x55/0x7e [<ffffffff815cb59b>] do_nmi+0xa8/0x2db [<ffffffff815cac31>] end_repeat_nmi+0x1e/0x2e [<ffffffff810f9942>] ? stop_machine_cpu_stop+0x85/0x145 [<ffffffff810f9942>] ? stop_machine_cpu_stop+0x85/0x145 [<ffffffff810f9942>] ? stop_machine_cpu_stop+0x85/0x145 <<EOE>> [<ffffffff810c6b1c>] ? set_next_entity+0x28/0x7e [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c9065>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d07ac>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 ---[ end trace dcd772d6fdf499cf ]--- ... SysRq : Show backtrace of all active CPUs sending NMI to all CPUs: NMI backtrace for cpu 2 CPU 2 Pid: 23, comm: migration/2 Tainted: G WC O 3.9.4+ #60 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E. RIP: 0010:[<ffffffff810f9987>] [<ffffffff810f9987>] stop_machine_cpu_stop+0xca/0x145 RSP: 0018:ffff880222219cf8 EFLAGS: 00000012 RAX: 00000001003db5d7 RBX: ffff8801c6ae7e18 RCX: 0000000000000097 RDX: 0000000000000002 RSI: 0000000000000006 RDI: 0000000000000246 RBP: ffff880222219d68 R08: 00000001003dc935 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801c6ae7e3c R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000002 FS: 0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000027e2178 CR3: 000000021955d000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000) Stack: ffff880200000001 ffff880200000002 ffff880222219d28 ffffffff810c6b1c ffff880222219d88 00ffffff8100f8a4 000000051058262e 0000000000000292 ffff880222219d58 ffff88022bd0e400 ffff8801c6ae7d08 ffff880222218000 Call Trace: [<ffffffff810c6b1c>] ? set_next_entity+0x28/0x7e [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c9065>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d07ac>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 Code: 48 c7 c7 24 fa 80 81 89 44 24 08 8b 43 20 89 04 24 31 c0 e8 7a e1 4c 00 4c 8b 05 85 c6 9e 00 49 81 c0 88 13 00 00 f3 90 NMI backtrace for cpu 0 CPU 0 Pid: 8, comm: migration/0 Tainted: G WC O 3.9.4+ #60 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.. RIP: 0010:[<ffffffff810f9984>] [<ffffffff810f9984>] stop_machine_cpu_stop+0xc7/0x145 RSP: 0000:ffff880222145cf8 EFLAGS: 00000006 RAX: 00000001003db5d7 RBX: ffff8801c6ae7e18 RCX: 0000000000000004 RDX: 0000000000000002 RSI: ffff88022bc0de68 RDI: 0000000000000246 RBP: ffff880222145d68 R08: 00000001003dc95f R09: 0000000000000001 R10: ffff880222145bf8 R11: 0000000000000000 R12: ffff8801c6ae7e3c R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000002 FS: 0000000000000000(0000) GS:ffff88022bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000006fb9a8 CR3: 000000021af43000 CR4: 00000000000007f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/0 (pid: 8, threadinfo ffff880222144000, task ffff88022213aee0) Stack: ffff880200000001 ffff880200000004 ffff880222145d28 ffffffff810c6b1c ffff880222145d88 01ffffff8100f8a4 000000051023ce2c 0000000000000292 ffff880222145d98 ffff88022bc0e400 ffff8801c6ae7d08 ffff880222144000 Call Trace: [<ffffffff810c6b1c>] ? set_next_entity+0x28/0x7e [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c9065>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d07ac>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 Code: 44 89 f6 48 c7 c7 24 fa 80 81 89 44 24 08 8b 43 20 89 04 24 31 c0 e8 7a e1 4c 00 4c 8b 05 85 c6 9e 00 49 81 c0 88 13 00 NMI backtrace for cpu 1 CPU 1 Pid: 17, comm: migration/1 Tainted: G WC O 3.9.4+ #60 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E. RIP: 0010:[<ffffffff810f9984>] [<ffffffff810f9984>] stop_machine_cpu_stop+0xc7/0x145 RSP: 0000:ffff88022217dcf8 EFLAGS: 00000012 RAX: 00000001003db5d7 RBX: ffff8801c6ae7e18 RCX: 0000000000000094 RDX: 0000000000000002 RSI: 0000000000000006 RDI: 0000000000000246 RBP: ffff88022217dd68 R08: 00000001003dc935 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801c6ae7e3c R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000002 FS: 0000000000000000(0000) GS:ffff88022bc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fbe801fd000 CR3: 00000001be9bb000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/1 (pid: 17, threadinfo ffff88022217c000, task ffff88022216ddc0) Stack: ffff880200000001 ffff880200000004 ffff88022217dd28 ffffffff810c6b1c ffff88022217dd88 00ffffff8100f8a4 00000004c3cbfc0c 0000000000000292 ffff88022217dd58 ffff88022bc8e400 ffff8801c6ae7d08 ffff88022217c000 Call Trace: [<ffffffff810c6b1c>] ? set_next_entity+0x28/0x7e [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c9065>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d07ac>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 Code: 44 89 f6 48 c7 c7 24 fa 80 81 89 44 24 08 8b 43 20 89 04 24 31 c0 e8 7a e1 4c 00 4c 8b 05 85 c6 9e 00 49 81 c0 88 13 00 NMI backtrace for cpu 3 CPU 3 Pid: 29, comm: migration/3 Tainted: G WC O 3.9.4+ #60 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E. RIP: 0010:[<ffffffff812f6975>] [<ffffffff812f6975>] delay_tsc+0x83/0xee RSP: 0000:ffff88022bd83b60 EFLAGS: 00000046 RAX: 00000b04f0fdf718 RBX: ffff880222282000 RCX: ffff880222282010 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000289672 RBP: ffff88022bd83bb0 R08: 0000000000000040 R09: 0000000000000001 R10: ffff88022bd83ad0 R11: 0000000000000000 R12: 00000000f0fdf6e8 R13: 0000000000000003 R14: ffff880222282000 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff88022bd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fa8a06d5000 CR3: 00000002007ea000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/3 (pid: 29, threadinfo ffff880222282000, task ffff880222279770) Stack: 000000000000000f 0000000000000002 ffff880222282010 0028967200000082 ffff88022bd83b90 0000000000000001 000000000000006c 0000000000000007 0000000000000086 0000000000000001 ffff88022bd83bc0 ffffffff812f68c5 Call Trace: <IRQ> [<ffffffff812f68c5>] __const_udelay+0x28/0x2a [<ffffffff8102ff25>] arch_trigger_all_cpu_backtrace+0x66/0x7d [<ffffffff813ae154>] sysrq_handle_showallcpus+0xe/0x10 [<ffffffff813ae46b>] __handle_sysrq+0xbf/0x15b [<ffffffff813ae80e>] handle_sysrq+0x2c/0x2e [<ffffffff813c25a2>] serial8250_rx_chars+0x13c/0x1b9 [<ffffffff813c2691>] serial8250_handle_irq+0x72/0xa8 [<ffffffff813c2752>] serial8250_default_handle_irq+0x23/0x28 [<ffffffff813c148c>] serial8250_interrupt+0x4d/0xc6 [<ffffffff811046b8>] handle_irq_event_percpu+0x7a/0x1e5 [<ffffffff81104864>] handle_irq_event+0x41/0x61 [<ffffffff81107028>] handle_edge_irq+0xa6/0xcb [<ffffffff81011d9f>] handle_irq+0x24/0x2d [<ffffffff815d248d>] do_IRQ+0x4d/0xb4 [<ffffffff815ca5ad>] common_interrupt+0x6d/0x6d [<ffffffff810a47ee>] ? run_timer_softirq+0x24/0x1df [<ffffffff8109dabd>] ? __do_softirq+0xa5/0x23c [<ffffffff8109db8a>] ? __do_softirq+0x172/0x23c [<ffffffff8109dce6>] irq_exit+0x4b/0xa8 [<ffffffff815d257f>] smp_apic_timer_interrupt+0x8b/0x99 [<ffffffff815d145d>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810f9984>] ? stop_machine_cpu_stop+0xc7/0x145 [<ffffffff810f9974>] ? stop_machine_cpu_stop+0xb7/0x145 [<ffffffff81016445>] ? sched_clock+0x9/0xd [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c9065>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d07ac>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 Code: fa d1 ff 66 90 89 c2 44 29 e2 3b 55 cc 73 4a ff 4b 1c 48 8b 4d c0 48 8b 11 80 e2 08 74 0b 89 45 b8 e8 08 28 2d 00 8b 45 commit 5b783a651bb9941923cd8dbff6d0ce80d2617f97 Author: Ben Greear <greearb@candelatech.com> Date: Mon Jun 3 13:31:09 2013 -0700 debugging: Instrument the 'migration/[0-x]' process a bit. Hoping to figure out the deadlock we see occassionally due to migration processes all spinning in busy loop. Signed-off-by: Ben Greear <greearb@candelatech.com> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c index c09f295..f7a1de5 100644 --- a/kernel/stop_machine.c +++ b/kernel/stop_machine.c @@ -408,6 +408,9 @@ static int stop_machine_cpu_stop(void *data) int cpu = smp_processor_id(), err = 0; unsigned long flags; bool is_active; + unsigned long loops = 0; + unsigned long start_at = jiffies; + unsigned long timeout = start_at - 1; /* * When called from stop_machine_from_inactive_cpu(), irq might @@ -422,6 +425,13 @@ static int stop_machine_cpu_stop(void *data) /* Simple state machine */ do { + loops++; + if (time_after(jiffies, timeout)) { + printk("cpu: %i loops: %lu jiffies: %lu timeout: %lu curstate: %i smdata->state: + cpu, loops, jiffies, timeout, curstate, smdata->state, + atomic_read(&smdata->thread_ack)); + timeout = jiffies + 5000; + } /* Chill out and ensure we re-read stopmachine_state. */ cpu_relax(); if (smdata->state != curstate) { @@ -473,6 +483,14 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus) /* Set the initial state and stop all online cpus. */ set_state(&smdata, STOPMACHINE_PREPARE); + { + char ksym_buf[KSYM_NAME_LEN]; + printk("__stop_machine, num-threads: %i, fn: %s data: %p\n", + smdata.num_threads, + kallsyms_lookup((unsigned long)fn, NULL, NULL, NULL, + ksym_buf), + data); + } return stop_cpus(cpu_online_mask, stop_machine_cpu_stop, &smdata); } @@ -517,10 +535,16 @@ int stop_machine_from_inactive_cpu(int (*fn)(void *), void *data, .active_cpus = cpus }; struct cpu_stop_done done; int ret; + char ksym_buf[KSYM_NAME_LEN]; /* Local CPU must be inactive and CPU hotplug in progress. */ BUG_ON(cpu_active(raw_smp_processor_id())); smdata.num_threads = num_active_cpus() + 1; /* +1 for local */ + printk("stop-machine-from-inactive-cpu, num-threads: %i fn: %s(%p)\n", + smdata.num_threads, + kallsyms_lookup((unsigned long)fn, NULL, NULL, NULL, + ksym_buf), + data); /* No proper task established and can't sleep - busy wait for lock. */ while (!mutex_trylock(&stop_cpus_mutex)) Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: Please add to stable: module: don't unlink the module until we've removed all exposure. 2013-06-04 17:45 ` Ben Greear @ 2013-06-05 4:17 ` Rusty Russell 2013-06-05 7:15 ` Tejun Heo 0 siblings, 1 reply; 51+ messages in thread From: Rusty Russell @ 2013-06-05 4:17 UTC (permalink / raw) To: Ben Greear, Joe Lawrence; +Cc: Linux Kernel Mailing List, stable, Tejun Heo Ben Greear <greearb@candelatech.com> writes: > On 06/04/2013 09:53 AM, Ben Greear wrote: >> On 06/04/2013 07:07 AM, Joe Lawrence wrote: >>> On Tue, 04 Jun 2013 15:26:28 +0930 >>> Rusty Russell <rusty@rustcorp.com.au> wrote: >>> >>>> Do you have a backtrace of the 3.9.4 crash? You can add "CFLAGS_module.o >>>> = -O0" to get a clearer backtrace if you want... >>> >>> Hi Rusty, >>> >>> See my 3.9 stack traces below, which may or may not be what Ben had >>> been seeing. If you like, I can try a similar loop as the one you were >>> testing in the other email. >> >> My stack traces are similar. I had better luck reproducing the problem >> once I enabled lots of debugging (slub memory poisoning, lockdep, >> object debugging, etc). >> >> I'm using Fedora 17 on 2-core core-i7 (4 CPU threads total) for most of this >> testing. We reproduced on dual-core Atom system as well >> (32-bit Fedora 14 and Fedora 17). Relatively standard hardware as far >> as I know. >> >> I'll run the insmod/rmmod stress test on my patched systems >> and see if I can reproduce with the patch in the title applied. >> >> Rusty: I'm also seeing lockups related to migration on stock 3.9.4+ >> (with and without the 'don't unlink the module...' patch. Much harder >> to reproduce. But, that code appears to be mostly called during >> module load/unload, so it's possible it is related. The first >> traces are from a system with local patches, applied, but a later >> post by me has traces from clean upstream kernel. >> >> Further debugging showed that this could be a race, because it seems >> that all migration/ threads think they are done with their state machine, >> but the atomic thread counter sits at 1, so no progress is ever made. >> >> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg443471.html > > I reproduced the migration deadlock after a while (loading and unloading > the macvlan module with this command: > > for i in `seq 10000`; do modprobe macvlan; rmmod macvlan; done > > I did not see the kobj crash, but this kernel was running your patch > that makes the problem go away for me, for whatever reason. > > I have some printk debugging in (see bottom of email) and was using a serial console, so things > were probably running a bit slower than on most systems. Here is trace > from my kernel with local patches and not so much debugging enabled > (this is NOT a clean upstream kernel, though I reproduced the same thing > with a clean upstream 3.9.4 kernel plus your module unlink patch yesterday). Tejun CC'd. We can't be running two stop machines in parallel, since there's a mutex (and there's also one in the module code). > __stop_machine, num-threads: 4, fn: __try_stop_module data: ffff8801c6ae7f28 > cpu: 0 loops: 1 jiffies: 4299011449 timeout: 4299011448 curstate: 0 smdata->state: 1 thread_ack: 4 > cpu: 1 loops: 1 jiffies: 4299011449 timeout: 4299011448 curstate: 0 smdata->state: 1 thread_ack: 4 > cpu: 2 loops: 1 jiffies: 4299011449 timeout: 4299011448 curstate: 0 smdata->state: 1 thread_ack: 3 > cpu: 3 loops: 1 jiffies: 4299011449 timeout: 4299011448 curstate: 0 smdata->state: 1 thread_ack: 2 > __stop_machine, num-threads: 4, fn: __unlink_module data: ffffffffa0aeeab0 > cpu: 0 loops: 1 jiffies: 4299011501 timeout: 4299011500 curstate: 0 smdata->state: 1 thread_ack: 4 > cpu: 1 loops: 1 jiffies: 4299011501 timeout: 4299011500 curstate: 0 smdata->state: 1 thread_ack: 4 > cpu: 3 loops: 1 jiffies: 4299011501 timeout: 4299011500 curstate: 0 smdata->state: 1 thread_ack: 3 > ath: wiphy0: Failed to stop TX DMA, queues=0x005! > cpu: 2 loops: 1 jiffies: 4299011501 timeout: 4299011500 curstate: 0 smdata->state: 1 thread_ack: 2 What's the ath driver doing here? Or is that failure normal? Your patch was mangled (unfinished?) but this doesn't show any timeouts happening. I guess you hit STOPMACHINE_DISABLE_IRQ and turned interrupts off, so no jiffies increment. Try using the loop counter, with some value big enough that it doesn't trip normally. > BUG: soft lockup - CPU#3 stuck for 23s! [migration/3:29] > Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_po] > CPU 3 > Pid: 29, comm: migration/3 Tainted: G C O 3.9.4+ #60 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E. > RIP: 0010:[<ffffffff8109d69c>] [<ffffffff8109d69c>] tasklet_action+0x46/0xcc > RSP: 0000:ffff88022bd83ed8 EFLAGS: 00000282 > RAX: ffff88022bd8e080 RBX: ffff8802222a4000 RCX: ffffffff81a90f06 > RDX: ffff88021f48afa8 RSI: 0000000000000000 RDI: ffffffff81a050b0 > RBP: ffff88022bd83ee8 R08: 0000000000000000 R09: 0000000000000000 > R10: 00000000000005f2 R11: 00000000fd010018 R12: ffff88022bd83e48 > R13: ffffffff815d145d R14: ffff88022bd83ee8 R15: ffff880222282000 > FS: 0000000000000000(0000) GS:ffff88022bd80000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 00007fa8a06d5000 CR3: 00000002007ea000 CR4: 00000000000007e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process migration/3 (pid: 29, threadinfo ffff880222282000, task ffff880222279770) > Stack: > ffffffff81a050b0 ffff880222282000 ffff88022bd83f78 ffffffff8109db1f > ffff88022bd83f08 ffff880222282010 ffff880222283fd8 04208040810b79ef > 00000001003db5d8 000000032bd8e150 ffff880222282000 0000000000000030 > Call Trace: > <IRQ> > [<ffffffff8109db1f>] __do_softirq+0x107/0x23c > [<ffffffff8109dce6>] irq_exit+0x4b/0xa8 > [<ffffffff815d257f>] smp_apic_timer_interrupt+0x8b/0x99 > [<ffffffff815d145d>] apic_timer_interrupt+0x6d/0x80 > <EOI> > [<ffffffff810f9984>] ? stop_machine_cpu_stop+0xc7/0x145 > [<ffffffff810f9974>] ? stop_machine_cpu_stop+0xb7/0x145 > [<ffffffff81016445>] ? sched_clock+0x9/0xd > [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 > [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 > [<ffffffff815c9065>] ? __schedule+0x59f/0x5e7 > [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f > [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 > [<ffffffff810b4a09>] kthread+0xb5/0xbd > [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 > [<ffffffff815d07ac>] ret_from_fork+0x7c/0xb0 > [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 > Code: 48 8b 14 25 80 e0 00 00 65 48 c7 04 25 80 e0 00 00 00 00 00 00 65 48 03 04 25 d8 da 00 00 65 48 89 04 25 88 e0 00 00 fb > ------------[ cut here ]------------ > WARNING: at /home/greearb/git/linux-3.9.dev.y/kernel/watchdog.c:245 watchdog_overflow_callback+0x9b/0xa6() > Hardware name: To be filled by O.E.M. > Watchdog detected hard LOCKUP on cpu 1 > Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_po] > Pid: 17, comm: migration/1 Tainted: G C O 3.9.4+ #60 > Call Trace: > <NMI> [<ffffffff810963e1>] warn_slowpath_common+0x85/0x9f > [<ffffffff8109649e>] warn_slowpath_fmt+0x46/0x48 > [<ffffffff810c4dfb>] ? sched_clock_cpu+0x44/0xce > [<ffffffff81103b70>] watchdog_overflow_callback+0x9b/0xa6 > [<ffffffff8113354f>] __perf_event_overflow+0x137/0x1cb > [<ffffffff8101db3f>] ? x86_perf_event_set_period+0x107/0x113 > [<ffffffff81133a9e>] perf_event_overflow+0x14/0x16 > [<ffffffff810230dc>] intel_pmu_handle_irq+0x2b0/0x32d > [<ffffffff815cbc51>] perf_event_nmi_handler+0x19/0x1b > [<ffffffff815cb4ca>] nmi_handle+0x55/0x7e > [<ffffffff815cb59b>] do_nmi+0xa8/0x2db > [<ffffffff815cac31>] end_repeat_nmi+0x1e/0x2e > [<ffffffff810f9942>] ? stop_machine_cpu_stop+0x85/0x145 > [<ffffffff810f9942>] ? stop_machine_cpu_stop+0x85/0x145 > [<ffffffff810f9942>] ? stop_machine_cpu_stop+0x85/0x145 > <<EOE>> [<ffffffff810c6b1c>] ? set_next_entity+0x28/0x7e > [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 > [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 > [<ffffffff815c9065>] ? __schedule+0x59f/0x5e7 > [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f > [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 > [<ffffffff810b4a09>] kthread+0xb5/0xbd > [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 > [<ffffffff815d07ac>] ret_from_fork+0x7c/0xb0 > [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 > ---[ end trace dcd772d6fdf499cf ]--- > > ... > > > SysRq : Show backtrace of all active CPUs > sending NMI to all CPUs: > NMI backtrace for cpu 2 > CPU 2 > Pid: 23, comm: migration/2 Tainted: G WC O 3.9.4+ #60 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E. > RIP: 0010:[<ffffffff810f9987>] [<ffffffff810f9987>] stop_machine_cpu_stop+0xca/0x145 > RSP: 0018:ffff880222219cf8 EFLAGS: 00000012 > RAX: 00000001003db5d7 RBX: ffff8801c6ae7e18 RCX: 0000000000000097 > RDX: 0000000000000002 RSI: 0000000000000006 RDI: 0000000000000246 > RBP: ffff880222219d68 R08: 00000001003dc935 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801c6ae7e3c > R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000002 > FS: 0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 00000000027e2178 CR3: 000000021955d000 CR4: 00000000000007e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000) > Stack: > ffff880200000001 ffff880200000002 ffff880222219d28 ffffffff810c6b1c > ffff880222219d88 00ffffff8100f8a4 000000051058262e 0000000000000292 > ffff880222219d58 ffff88022bd0e400 ffff8801c6ae7d08 ffff880222218000 > Call Trace: > [<ffffffff810c6b1c>] ? set_next_entity+0x28/0x7e > [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 > [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 > [<ffffffff815c9065>] ? __schedule+0x59f/0x5e7 > [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f > [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 > [<ffffffff810b4a09>] kthread+0xb5/0xbd > [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 > [<ffffffff815d07ac>] ret_from_fork+0x7c/0xb0 > [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 > Code: 48 c7 c7 24 fa 80 81 89 44 24 08 8b 43 20 89 04 24 31 c0 e8 7a e1 4c 00 4c 8b 05 85 c6 9e 00 49 81 c0 88 13 00 00 f3 90 > NMI backtrace for cpu 0 > CPU 0 > Pid: 8, comm: migration/0 Tainted: G WC O 3.9.4+ #60 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.. > RIP: 0010:[<ffffffff810f9984>] [<ffffffff810f9984>] stop_machine_cpu_stop+0xc7/0x145 > RSP: 0000:ffff880222145cf8 EFLAGS: 00000006 > RAX: 00000001003db5d7 RBX: ffff8801c6ae7e18 RCX: 0000000000000004 > RDX: 0000000000000002 RSI: ffff88022bc0de68 RDI: 0000000000000246 > RBP: ffff880222145d68 R08: 00000001003dc95f R09: 0000000000000001 > R10: ffff880222145bf8 R11: 0000000000000000 R12: ffff8801c6ae7e3c > R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000002 > FS: 0000000000000000(0000) GS:ffff88022bc00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 00000000006fb9a8 CR3: 000000021af43000 CR4: 00000000000007f0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process migration/0 (pid: 8, threadinfo ffff880222144000, task ffff88022213aee0) > Stack: > ffff880200000001 ffff880200000004 ffff880222145d28 ffffffff810c6b1c > ffff880222145d88 01ffffff8100f8a4 000000051023ce2c 0000000000000292 > ffff880222145d98 ffff88022bc0e400 ffff8801c6ae7d08 ffff880222144000 > Call Trace: > [<ffffffff810c6b1c>] ? set_next_entity+0x28/0x7e > [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 > [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 > [<ffffffff815c9065>] ? __schedule+0x59f/0x5e7 > [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f > [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 > [<ffffffff810b4a09>] kthread+0xb5/0xbd > [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 > [<ffffffff815d07ac>] ret_from_fork+0x7c/0xb0 > [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 > Code: 44 89 f6 48 c7 c7 24 fa 80 81 89 44 24 08 8b 43 20 89 04 24 31 c0 e8 7a e1 4c 00 4c 8b 05 85 c6 9e 00 49 81 c0 88 13 00 > NMI backtrace for cpu 1 > CPU 1 > Pid: 17, comm: migration/1 Tainted: G WC O 3.9.4+ #60 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E. > RIP: 0010:[<ffffffff810f9984>] [<ffffffff810f9984>] stop_machine_cpu_stop+0xc7/0x145 > RSP: 0000:ffff88022217dcf8 EFLAGS: 00000012 > RAX: 00000001003db5d7 RBX: ffff8801c6ae7e18 RCX: 0000000000000094 > RDX: 0000000000000002 RSI: 0000000000000006 RDI: 0000000000000246 > RBP: ffff88022217dd68 R08: 00000001003dc935 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801c6ae7e3c > R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000002 > FS: 0000000000000000(0000) GS:ffff88022bc80000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 00007fbe801fd000 CR3: 00000001be9bb000 CR4: 00000000000007e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process migration/1 (pid: 17, threadinfo ffff88022217c000, task ffff88022216ddc0) > Stack: > ffff880200000001 ffff880200000004 ffff88022217dd28 ffffffff810c6b1c > ffff88022217dd88 00ffffff8100f8a4 00000004c3cbfc0c 0000000000000292 > ffff88022217dd58 ffff88022bc8e400 ffff8801c6ae7d08 ffff88022217c000 > Call Trace: > [<ffffffff810c6b1c>] ? set_next_entity+0x28/0x7e > [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 > [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 > [<ffffffff815c9065>] ? __schedule+0x59f/0x5e7 > [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f > [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 > [<ffffffff810b4a09>] kthread+0xb5/0xbd > [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 > [<ffffffff815d07ac>] ret_from_fork+0x7c/0xb0 > [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 > Code: 44 89 f6 48 c7 c7 24 fa 80 81 89 44 24 08 8b 43 20 89 04 24 31 c0 e8 7a e1 4c 00 4c 8b 05 85 c6 9e 00 49 81 c0 88 13 00 > NMI backtrace for cpu 3 > CPU 3 > Pid: 29, comm: migration/3 Tainted: G WC O 3.9.4+ #60 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E. > RIP: 0010:[<ffffffff812f6975>] [<ffffffff812f6975>] delay_tsc+0x83/0xee > RSP: 0000:ffff88022bd83b60 EFLAGS: 00000046 > RAX: 00000b04f0fdf718 RBX: ffff880222282000 RCX: ffff880222282010 > RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000289672 > RBP: ffff88022bd83bb0 R08: 0000000000000040 R09: 0000000000000001 > R10: ffff88022bd83ad0 R11: 0000000000000000 R12: 00000000f0fdf6e8 > R13: 0000000000000003 R14: ffff880222282000 R15: 0000000000000001 > FS: 0000000000000000(0000) GS:ffff88022bd80000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 00007fa8a06d5000 CR3: 00000002007ea000 CR4: 00000000000007e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process migration/3 (pid: 29, threadinfo ffff880222282000, task ffff880222279770) > Stack: > 000000000000000f 0000000000000002 ffff880222282010 0028967200000082 > ffff88022bd83b90 0000000000000001 000000000000006c 0000000000000007 > 0000000000000086 0000000000000001 ffff88022bd83bc0 ffffffff812f68c5 > Call Trace: > <IRQ> > [<ffffffff812f68c5>] __const_udelay+0x28/0x2a > [<ffffffff8102ff25>] arch_trigger_all_cpu_backtrace+0x66/0x7d > [<ffffffff813ae154>] sysrq_handle_showallcpus+0xe/0x10 > [<ffffffff813ae46b>] __handle_sysrq+0xbf/0x15b > [<ffffffff813ae80e>] handle_sysrq+0x2c/0x2e > [<ffffffff813c25a2>] serial8250_rx_chars+0x13c/0x1b9 > [<ffffffff813c2691>] serial8250_handle_irq+0x72/0xa8 > [<ffffffff813c2752>] serial8250_default_handle_irq+0x23/0x28 > [<ffffffff813c148c>] serial8250_interrupt+0x4d/0xc6 > [<ffffffff811046b8>] handle_irq_event_percpu+0x7a/0x1e5 > [<ffffffff81104864>] handle_irq_event+0x41/0x61 > [<ffffffff81107028>] handle_edge_irq+0xa6/0xcb > [<ffffffff81011d9f>] handle_irq+0x24/0x2d > [<ffffffff815d248d>] do_IRQ+0x4d/0xb4 > [<ffffffff815ca5ad>] common_interrupt+0x6d/0x6d > [<ffffffff810a47ee>] ? run_timer_softirq+0x24/0x1df > [<ffffffff8109dabd>] ? __do_softirq+0xa5/0x23c > [<ffffffff8109db8a>] ? __do_softirq+0x172/0x23c > [<ffffffff8109dce6>] irq_exit+0x4b/0xa8 > [<ffffffff815d257f>] smp_apic_timer_interrupt+0x8b/0x99 > [<ffffffff815d145d>] apic_timer_interrupt+0x6d/0x80 > <EOI> > [<ffffffff810f9984>] ? stop_machine_cpu_stop+0xc7/0x145 > [<ffffffff810f9974>] ? stop_machine_cpu_stop+0xb7/0x145 > [<ffffffff81016445>] ? sched_clock+0x9/0xd > [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 > [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 > [<ffffffff815c9065>] ? __schedule+0x59f/0x5e7 > [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f > [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 > [<ffffffff810b4a09>] kthread+0xb5/0xbd > [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 > [<ffffffff815d07ac>] ret_from_fork+0x7c/0xb0 > [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 > Code: fa d1 ff 66 90 89 c2 44 29 e2 3b 55 cc 73 4a ff 4b 1c 48 8b 4d c0 48 8b 11 80 e2 08 74 0b 89 45 b8 e8 08 28 2d 00 8b 45 So everyone's doing stop_machine, but noone's making progress. I'd like to know why... Cheers, Rusty. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Please add to stable: module: don't unlink the module until we've removed all exposure. 2013-06-05 4:17 ` Rusty Russell @ 2013-06-05 7:15 ` Tejun Heo 2013-06-05 16:59 ` Ben Greear 0 siblings, 1 reply; 51+ messages in thread From: Tejun Heo @ 2013-06-05 7:15 UTC (permalink / raw) To: Rusty Russell; +Cc: Ben Greear, Joe Lawrence, Linux Kernel Mailing List, stable Hello, On Wed, Jun 05, 2013 at 01:47:43PM +0930, Rusty Russell wrote: > > I have some printk debugging in (see bottom of email) and was using a serial console, so things > > were probably running a bit slower than on most systems. Here is trace > > from my kernel with local patches and not so much debugging enabled > > (this is NOT a clean upstream kernel, though I reproduced the same thing > > with a clean upstream 3.9.4 kernel plus your module unlink patch yesterday). > > Tejun CC'd. We can't be running two stop machines in parallel, since > there's a mutex (and there's also one in the module code). > > > __stop_machine, num-threads: 4, fn: __try_stop_module data: ffff8801c6ae7f28 > > cpu: 0 loops: 1 jiffies: 4299011449 timeout: 4299011448 curstate: 0 smdata->state: 1 thread_ack: 4 > > cpu: 1 loops: 1 jiffies: 4299011449 timeout: 4299011448 curstate: 0 smdata->state: 1 thread_ack: 4 > > cpu: 2 loops: 1 jiffies: 4299011449 timeout: 4299011448 curstate: 0 smdata->state: 1 thread_ack: 3 > > cpu: 3 loops: 1 jiffies: 4299011449 timeout: 4299011448 curstate: 0 smdata->state: 1 thread_ack: 2 > > __stop_machine, num-threads: 4, fn: __unlink_module data: ffffffffa0aeeab0 > > cpu: 0 loops: 1 jiffies: 4299011501 timeout: 4299011500 curstate: 0 smdata->state: 1 thread_ack: 4 > > cpu: 1 loops: 1 jiffies: 4299011501 timeout: 4299011500 curstate: 0 smdata->state: 1 thread_ack: 4 > > cpu: 3 loops: 1 jiffies: 4299011501 timeout: 4299011500 curstate: 0 smdata->state: 1 thread_ack: 3 > > ath: wiphy0: Failed to stop TX DMA, queues=0x005! > > cpu: 2 loops: 1 jiffies: 4299011501 timeout: 4299011500 curstate: 0 smdata->state: 1 thread_ack: 2 A bit confused. I looked at the code again and it still seems properly interlocked. Can't see how the above scenario could happen. Actually, I'm not even sure, what I'm looking at. Is the above from the lockup? It'd be helpful if we can get more traces from the locked up state. Shouldn't be hard to detect. Dumb timeout based detection should work fine. How easily can you reproduce the issue? Thanks. -- tejun ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Please add to stable: module: don't unlink the module until we've removed all exposure. 2013-06-05 7:15 ` Tejun Heo @ 2013-06-05 16:59 ` Ben Greear 2013-06-05 18:48 ` Tejun Heo 0 siblings, 1 reply; 51+ messages in thread From: Ben Greear @ 2013-06-05 16:59 UTC (permalink / raw) To: Tejun Heo; +Cc: Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable On 06/05/2013 12:15 AM, Tejun Heo wrote: > Hello, > > On Wed, Jun 05, 2013 at 01:47:43PM +0930, Rusty Russell wrote: >>> I have some printk debugging in (see bottom of email) and was using a serial console, so things >>> were probably running a bit slower than on most systems. Here is trace >>> from my kernel with local patches and not so much debugging enabled >>> (this is NOT a clean upstream kernel, though I reproduced the same thing >>> with a clean upstream 3.9.4 kernel plus your module unlink patch yesterday). >> >> Tejun CC'd. We can't be running two stop machines in parallel, since >> there's a mutex (and there's also one in the module code). >> >>> __stop_machine, num-threads: 4, fn: __try_stop_module data: ffff8801c6ae7f28 >>> cpu: 0 loops: 1 jiffies: 4299011449 timeout: 4299011448 curstate: 0 smdata->state: 1 thread_ack: 4 >>> cpu: 1 loops: 1 jiffies: 4299011449 timeout: 4299011448 curstate: 0 smdata->state: 1 thread_ack: 4 >>> cpu: 2 loops: 1 jiffies: 4299011449 timeout: 4299011448 curstate: 0 smdata->state: 1 thread_ack: 3 >>> cpu: 3 loops: 1 jiffies: 4299011449 timeout: 4299011448 curstate: 0 smdata->state: 1 thread_ack: 2 >>> __stop_machine, num-threads: 4, fn: __unlink_module data: ffffffffa0aeeab0 >>> cpu: 0 loops: 1 jiffies: 4299011501 timeout: 4299011500 curstate: 0 smdata->state: 1 thread_ack: 4 >>> cpu: 1 loops: 1 jiffies: 4299011501 timeout: 4299011500 curstate: 0 smdata->state: 1 thread_ack: 4 >>> cpu: 3 loops: 1 jiffies: 4299011501 timeout: 4299011500 curstate: 0 smdata->state: 1 thread_ack: 3 >>> ath: wiphy0: Failed to stop TX DMA, queues=0x005! >>> cpu: 2 loops: 1 jiffies: 4299011501 timeout: 4299011500 curstate: 0 smdata->state: 1 thread_ack: 2 > > A bit confused. I looked at the code again and it still seems > properly interlocked. Can't see how the above scenario could happen. > Actually, I'm not even sure, what I'm looking at. Is the above from > the lockup? It'd be helpful if we can get more traces from the locked > up state. Shouldn't be hard to detect. Dumb timeout based detection > should work fine. How easily can you reproduce the issue? I can easily reproduce the problem. First, my machine info: dual-core i7, 4 cpu threads total. 8 GB RAM 64-bit kernel, Fedora-17 64-bit OS 2 ath9k NICs, each with 200 virtual stations configured. I don't think the virtual stations matter to much, as we reproduce easily enough with fewer, but I think that maybe the extra IRQ load helps trigger the bug. I added lots of printk debugging, which makes it easier to hit the lockup. There is an 'ath' message in the log above, but most traces have nothing related to ath in them, so I think that thing above is not related. (That is a common ath error message for years, and at least on my hardware, does not appear to cause any serious problems). Here below is another trace with even more debugging printouts. After a while, it stopped even printing out cpu-hung messages..not sure if that is an important clue or not. One pattern I notice repeating for at least most of the hangs is that all but one CPU thread has irqs disabled and is in state 2. But, there will be one thread in state 1 that still has IRQs enabled and it is reported to be in soft-lockup instead of hard-lockup. In 'sysrq l' it always shows some IRQ processing, but typically that of the sysrq itself. I added printk that would always print if the thread notices that smdata->state != curstate, and the soft-lockup thread (cpu 2 below) never shows that message. I thought it might be because it was reading stale smdata->state, so I changed that to atomic_t hoping that would mitigate that. I also tried adding smp_rmb() below the cpu_relax(). Neither had any affect, so I am left assuming that the thread instead is stuck handling IRQs and never gets out of the IRQ handler. Maybe since I have 2 real cores, and 3 processes busy-spinning on their CPU cores, the remaining process can just never handle all the IRQs and get back to the cpu shutdown state machine? The various soft-hang stacks below show at least slightly different stacks, so I assume that thread is doing at least something. __stop_machine, num-threads: 4, fn: __try_stop_module(ffffffff810e86da) data: ffff8802202abf28 smdata: ffff8802202abe58 set_state, cpu: 3 state: 0 newstate: 1 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 1 loops: 1 jiffies: 4294739830 timeout: 4294739829 curstate: 0 smdata->state: 1 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 3 loops: 1 jiffies: 4294739830 timeout: 4294739829 curstate: 0 smdata->state: 1 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da state-change: cpu: 3 loops: 1 jiffies: 4294739830 curstate: 0 smdata->state: 1 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 2 loops: 1 jiffies: 4294739830 timeout: 4294739829 curstate: 0 smdata->state: 1 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 0 loops: 1 jiffies: 4294739830 timeout: 4294739829 curstate: 0 smdata->state: 1 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 3 ack_state jiffies: 4294739830 smdata->state: 1 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da state-change: cpu: 2 loops: 1 jiffies: 4294739830 curstate: 0 smdata->state: 1 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da state-change: cpu: 0 loops: 1 jiffies: 4294739830 curstate: 0 smdata->state: 1 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 3 ack_state end jiffies: 4294739830 smdata->state: 1 thread_ack: 3 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 2 ack_state jiffies: 4294739830 smdata->state: 1 thread_ack: 3 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 0 ack_state jiffies: 4294739830 smdata->state: 1 thread_ack: 3 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 2 ack_state end jiffies: 4294739830 smdata->state: 1 thread_ack: 2 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 0 ack_state end jiffies: 4294739830 smdata->state: 1 thread_ack: 1 smdata: ffff8802202abe58 fn: ffffffff810e86da state-change: cpu: 1 loops: 1 jiffies: 4294739981 curstate: 0 smdata->state: 1 thread_ack: 1 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 1 ack_state jiffies: 4294739993 smdata->state: 1 thread_ack: 1 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 1 increment state, smdata: ffff8802202abe58 fn: ffffffff810e86da set_state, cpu: 1 state: 1 newstate: 2 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 1 ack_state end jiffies: 4294740017 smdata->state: 2 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da state-change: cpu: 3 loops: 24479339 jiffies: 4294740017 curstate: 1 smdata->state: 2 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da state-change: cpu: 0 loops: 23433125 jiffies: 4294740017 curstate: 1 smdata->state: 2 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da state-change: cpu: 2 loops: 23440505 jiffies: 4294740017 curstate: 1 smdata->state: 2 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 3 ack_state jiffies: 4294740017 smdata->state: 2 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 0 ack_state jiffies: 4294740017 smdata->state: 2 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 2 ack_state jiffies: 4294740017 smdata->state: 2 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 3 ack_state end jiffies: 4294740017 smdata->state: 2 thread_ack: 3 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 0 ack_state end jiffies: 4294740017 smdata->state: 2 thread_ack: 2 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 2 ack_state end jiffies: 4294740017 smdata->state: 2 thread_ack: 1 smdata: ffff8802202abe58 fn: ffffffff810e86da state-change: cpu: 1 loops: 2 jiffies: 4294740017 curstate: 1 smdata->state: 2 thread_ack: 1 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 1 ack_state jiffies: 4294740017 smdata->state: 2 thread_ack: 1 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 1 increment state, smdata: ffff8802202abe58 fn: ffffffff810e86da set_state, cpu: 1 state: 2 newstate: 3 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 1 ack_state end jiffies: 4294740017 smdata->state: 3 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da state-change: cpu: 0 loops: 42466689 jiffies: 4294740017 curstate: 2 smdata->state: 3 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da state-change: cpu: 2 loops: 42473881 jiffies: 4294740017 curstate: 2 smdata->state: 3 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da state-change: cpu: 3 loops: 43825073 jiffies: 4294740017 curstate: 2 smdata->state: 3 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da calling fn: cpu: 0 loops: 42466689 curstate: 3 smdata->state: 3 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 2 ack_state jiffies: 4294740017 smdata->state: 3 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 3 ack_state jiffies: 4294740017 smdata->state: 3 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 0 ack_state jiffies: 4294740017 smdata->state: 3 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 2 ack_state end jiffies: 4294740017 smdata->state: 3 thread_ack: 3 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 3 ack_state end jiffies: 4294740017 smdata->state: 3 thread_ack: 2 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 0 ack_state end jiffies: 4294740017 smdata->state: 3 thread_ack: 1 smdata: ffff8802202abe58 fn: ffffffff810e86da state-change: cpu: 1 loops: 3 jiffies: 4294740017 curstate: 2 smdata->state: 3 thread_ack: 1 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 1 ack_state jiffies: 4294740017 smdata->state: 3 thread_ack: 1 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 1 increment state, smdata: ffff8802202abe58 fn: ffffffff810e86da set_state, cpu: 1 state: 3 newstate: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 1 ack_state end jiffies: 4294740017 smdata->state: 4 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da state-change: cpu: 2 loops: 62146520 jiffies: 4294740017 curstate: 3 smdata->state: 4 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da state-change: cpu: 0 loops: 62138953 jiffies: 4294740017 curstate: 3 smdata->state: 4 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da state-change: cpu: 3 loops: 64625424 jiffies: 4294740017 curstate: 3 smdata->state: 4 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 2 ack_state jiffies: 4294740017 smdata->state: 4 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 0 ack_state jiffies: 4294740017 smdata->state: 4 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 3 ack_state jiffies: 4294740017 smdata->state: 4 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 2 ack_state end jiffies: 4294740017 smdata->state: 4 thread_ack: 3 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 0 ack_state end jiffies: 4294740017 smdata->state: 4 thread_ack: 2 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 3 ack_state end jiffies: 4294740017 smdata->state: 4 thread_ack: 1 smdata: ffff8802202abe58 fn: ffffffff810e86da state-change: cpu: 1 loops: 4 jiffies: 4294740435 curstate: 3 smdata->state: 4 thread_ack: 1 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 1 ack_state jiffies: 4294740447 smdata->state: 4 thread_ack: 1 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 1 increment state, smdata: ffff8802202abe58 fn: ffffffff810e86da set_state, cpu: 1 state: 4 newstate: 5 smdata: ffff8802202abe58 fn: ffffffff810e86da cpu: 1 ack_state end jiffies: 4294740471 smdata->state: 5 thread_ack: 4 smdata: ffff8802202abe58 fn: ffffffff810e86da __stop_machine, num-threads: 4, fn: __unlink_module(ffffffff810e817b) data: ffffffffa0ac5ab0 smdata: ffff8802202abe18 set_state, cpu: 0 state: 0 newstate: 1 smdata: ffff8802202abe18 fn: ffffffff810e817b cpu: 0 loops: 1 jiffies: 4294740500 timeout: 4294740499 curstate: 0 smdata->state: 1 thread_ack: 4 smdata: ffff8802202abe18 fn: ffffffff810e817b cpu: 3 loops: 1 jiffies: 4294740500 timeout: 4294740499 curstate: 0 smdata->state: 1 thread_ack: 4 smdata: ffff8802202abe18 fn: ffffffff810e817b cpu: 1 loops: 1 jiffies: 4294740500 timeout: 4294740499 curstate: 0 smdata->state: 1 thread_ack: 4 smdata: ffff8802202abe18 fn: ffffffff810e817b state-change: cpu: 3 loops: 1 jiffies: 4294740500 curstate: 0 smdata->state: 1 thread_ack: 4 smdata: ffff8802202abe18 fn: ffffffff810e817b state-change: cpu: 1 loops: 1 jiffies: 4294740500 curstate: 0 smdata->state: 1 thread_ack: 4 smdata: ffff8802202abe18 fn: ffffffff810e817b cpu: 3 ack_state jiffies: 4294740500 smdata->state: 1 thread_ack: 4 smdata: ffff8802202abe18 fn: ffffffff810e817b cpu: 1 ack_state jiffies: 4294740500 smdata->state: 1 thread_ack: 4 smdata: ffff8802202abe18 fn: ffffffff810e817b cpu: 3 ack_state end jiffies: 4294740500 smdata->state: 1 thread_ack: 3 smdata: ffff8802202abe18 fn: ffffffff810e817b cpu: 1 ack_state end jiffies: 4294740500 smdata->state: 1 thread_ack: 2 smdata: ffff8802202abe18 fn: ffffffff810e817b cpu: 2 loops: 1 jiffies: 4294740501 timeout: 4294740500 curstate: 0 smdata->state: 1 thread_ack: 2 smdata: ffff8802202abe18 fn: ffffffff810e817b state-change: cpu: 2 loops: 1 jiffies: 4294740501 curstate: 0 smdata->state: 1 thread_ack: 2 smdata: ffff8802202abe18 fn: ffffffff810e817b cpu: 2 ack_state jiffies: 4294740501 smdata->state: 1 thread_ack: 2 smdata: ffff8802202abe18 fn: ffffffff810e817b cpu: 2 ack_state end jiffies: 4294740501 smdata->state: 1 thread_ack: 1 smdata: ffff8802202abe18 fn: ffffffff810e817b state-change: cpu: 0 loops: 1 jiffies: 4294740651 curstate: 0 smdata->state: 1 thread_ack: 1 smdata: ffff8802202abe18 fn: ffffffff810e817b cpu: 0 ack_state jiffies: 4294740664 smdata->state: 1 thread_ack: 1 smdata: ffff8802202abe18 fn: ffffffff810e817b cpu: 0 increment state, smdata: ffff8802202abe18 fn: ffffffff810e817b set_state, cpu: 0 state: 1 newstate: 2 smdata: ffff8802202abe18 fn: ffffffff810e817b cpu: 0 ack_state end jiffies: 4294740688 smdata->state: 2 thread_ack: 4 smdata: ffff8802202abe18 fn: ffffffff810e817b state-change: cpu: 3 loops: 23659570 jiffies: 4294740688 curstate: 1 smdata->state: 2 thread_ack: 4 smdata: ffff8802202abe18 fn: ffffffff810e817b state-change: cpu: 1 loops: 23670374 jiffies: 4294740688 curstate: 1 smdata->state: 2 thread_ack: 4 smdata: ffff8802202abe18 fn: ffffffff810e817b cpu: 3 ack_state jiffies: 4294740688 smdata->state: 2 thread_ack: 4 smdata: ffff8802202abe18 fn: ffffffff810e817b cpu: 1 ack_state jiffies: 4294740688 smdata->state: 2 thread_ack: 4 smdata: ffff8802202abe18 fn: ffffffff810e817b cpu: 3 ack_state end jiffies: 4294740688 smdata->state: 2 thread_ack: 3 smdata: ffff8802202abe18 fn: ffffffff810e817b cpu: 1 ack_state end jiffies: 4294740688 smdata->state: 2 thread_ack: 2 smdata: ffff8802202abe18 fn: ffffffff810e817b state-change: cpu: 0 loops: 2 jiffies: 4294740688 curstate: 1 smdata->state: 2 thread_ack: 2 smdata: ffff8802202abe18 fn: ffffffff810e817b cpu: 0 ack_state jiffies: 4294740688 smdata->state: 2 thread_ack: 2 smdata: ffff8802202abe18 fn: ffffffff810e817b cpu: 0 ack_state end jiffies: 4294740688 smdata->state: 2 thread_ack: 1 smdata: ffff8802202abe18 fn: ffffffff810e817b ------------[ cut here ]------------ WARNING: at /home/greearb/git/linux-3.9.dev.y/kernel/watchdog.c:245 watchdog_overflow_callback+0x9b/0xa6() Hardware name: To be filled by O.E.M. Watchdog detected hard LOCKUP on cpu 0 Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm gpio] Pid: 8, comm: migration/0 Tainted: G C O 3.9.4+ #70 Call Trace: <NMI> [<ffffffff810963e1>] warn_slowpath_common+0x85/0x9f [<ffffffff8109649e>] warn_slowpath_fmt+0x46/0x48 [<ffffffff810c4dfb>] ? sched_clock_cpu+0x44/0xce [<ffffffff81103cec>] watchdog_overflow_callback+0x9b/0xa6 [<ffffffff811336cb>] __perf_event_overflow+0x137/0x1cb [<ffffffff8101db3f>] ? x86_perf_event_set_period+0x107/0x113 [<ffffffff81133c1a>] perf_event_overflow+0x14/0x16 [<ffffffff810230dc>] intel_pmu_handle_irq+0x2b0/0x32d [<ffffffff815cbdd1>] perf_event_nmi_handler+0x19/0x1b [<ffffffff815cb64a>] nmi_handle+0x55/0x7e [<ffffffff815cb71b>] do_nmi+0xa8/0x2db [<ffffffff815cadb1>] end_repeat_nmi+0x1e/0x2e [<ffffffff810f9945>] ? stop_machine_cpu_stop+0x88/0x267 [<ffffffff810f9945>] ? stop_machine_cpu_stop+0x88/0x267 [<ffffffff810f9945>] ? stop_machine_cpu_stop+0x88/0x267 <<EOE>> [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 ---[ end trace 9767454fd3f66a7f ]--- BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23] Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm gpio] CPU 2 Pid: 23, comm: migration/2 Tainted: G WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: 0010:[<ffffffff8109daab>] [<ffffffff8109daab>] __do_softirq+0x93/0x23c RSP: 0018:ffff88022bd03ef8 EFLAGS: 00000246 RAX: 0000000000000000 RBX: 0000000000000246 RCX: ffff88022bd0e506 RDX: ffff880222218010 RSI: ffff88022bd0e4f0 RDI: 0000000000000002 RBP: ffff88022bd03f78 R08: ffff88022bd0e770 R09: 0000000000000001 R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e68 R13: ffffffff815d15dd R14: ffff88022bd03f78 R15: ffff880222218000 FS: 0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000) Stack: ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef 00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000038 ffff880200000006 000000111671cb12 000000111671cb12 ffff88022bd0db00 Call Trace: <IRQ> [<ffffffff8109dce6>] irq_exit+0x4b/0xa8 [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99 [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267 [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 Code: 89 45 90 49 8d 44 24 10 4c 89 65 b0 65 8b 14 25 20 b0 00 00 48 89 45 88 89 55 ac 65 c7 04 25 40 13 01 00 00 00 00 00 fb 66 66 90 <66> 66 90 48 c7 c3 80 50 a0 8 ------------[ cut here ]------------ WARNING: at /home/greearb/git/linux-3.9.dev.y/kernel/watchdog.c:245 watchdog_overflow_callback+0x9b/0xa6() Hardware name: To be filled by O.E.M. Watchdog detected hard LOCKUP on cpu 3 Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm gpio] Pid: 29, comm: migration/3 Tainted: G WC O 3.9.4+ #70 Call Trace: <NMI> [<ffffffff810963e1>] warn_slowpath_common+0x85/0x9f [<ffffffff8109649e>] warn_slowpath_fmt+0x46/0x48 [<ffffffff810c4dfb>] ? sched_clock_cpu+0x44/0xce [<ffffffff81103cec>] watchdog_overflow_callback+0x9b/0xa6 [<ffffffff811336cb>] __perf_event_overflow+0x137/0x1cb [<ffffffff8101db3f>] ? x86_perf_event_set_period+0x107/0x113 [<ffffffff81133c1a>] perf_event_overflow+0x14/0x16 [<ffffffff810230dc>] intel_pmu_handle_irq+0x2b0/0x32d [<ffffffff815cbdd1>] perf_event_nmi_handler+0x19/0x1b [<ffffffff815cb64a>] nmi_handle+0x55/0x7e [<ffffffff815cb71b>] do_nmi+0xa8/0x2db [<ffffffff815cadb1>] end_repeat_nmi+0x1e/0x2e [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267 [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267 [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267 <<EOE>> [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 ---[ end trace 9767454fd3f66a80 ]--- BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23] Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm gpio] CPU 2 Pid: 23, comm: migration/2 Tainted: G WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: 0010:[<ffffffff8109d6b7>] [<ffffffff8109d6b7>] tasklet_action+0x61/0xcc RSP: 0018:ffff88022bd03ed8 EFLAGS: 00000246 RAX: 0000000000000001 RBX: ffff880222185dc0 RCX: ffff88022bd0e506 RDX: ffff8802186f2fa8 RSI: ffff88022bd0e4f0 RDI: ffffffff81a050b0 RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001 R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e48 R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff8802186f2fb0 FS: 0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000) Stack: ffffffff81a050b0 ffff880222218000 ffff88022bd03f78 ffffffff8109db1f ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef 00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000030 Call Trace: <IRQ> [<ffffffff8109db1f>] __do_softirq+0x107/0x23c [<ffffffff8109dce6>] irq_exit+0x4b/0xa8 [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99 [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267 [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 Code: da 00 00 65 48 89 04 25 88 e0 00 00 fb 66 66 90 66 66 90 eb 77 48 8b 1a 4c 8d 62 08 f0 0f ba 6a 08 01 19 c0 85 c0 75 2d 8b 42 10 <85> c0 75 20 f0 41 0f ba 34 2 ------------[ cut here ]------------ WARNING: at /home/greearb/git/linux-3.9.dev.y/kernel/watchdog.c:245 watchdog_overflow_callback+0x9b/0xa6() Hardware name: To be filled by O.E.M. Watchdog detected hard LOCKUP on cpu 1 Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm gpio] Pid: 17, comm: migration/1 Tainted: G WC O 3.9.4+ #70 Call Trace: <NMI> [<ffffffff810963e1>] warn_slowpath_common+0x85/0x9f [<ffffffff8109649e>] warn_slowpath_fmt+0x46/0x48 [<ffffffff810c4dfb>] ? sched_clock_cpu+0x44/0xce [<ffffffff81103cec>] watchdog_overflow_callback+0x9b/0xa6 [<ffffffff811336cb>] __perf_event_overflow+0x137/0x1cb [<ffffffff8101db3f>] ? x86_perf_event_set_period+0x107/0x113 [<ffffffff81133c1a>] perf_event_overflow+0x14/0x16 [<ffffffff810230dc>] intel_pmu_handle_irq+0x2b0/0x32d [<ffffffff815cbdd1>] perf_event_nmi_handler+0x19/0x1b [<ffffffff815cb64a>] nmi_handle+0x55/0x7e [<ffffffff815cb71b>] do_nmi+0xa8/0x2db [<ffffffff815cadb1>] end_repeat_nmi+0x1e/0x2e [<ffffffff810f99a8>] ? stop_machine_cpu_stop+0xeb/0x267 [<ffffffff810f99a8>] ? stop_machine_cpu_stop+0xeb/0x267 [<ffffffff810f99a8>] ? stop_machine_cpu_stop+0xeb/0x267 <<EOE>> [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff815ca1f4>] ? _raw_spin_unlock_irq+0x10/0x36 [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 ---[ end trace 9767454fd3f66a81 ]--- BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23] Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm gpio] CPU 2 Pid: 23, comm: migration/2 Tainted: G WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: 0010:[<ffffffff8109daab>] [<ffffffff8109daab>] __do_softirq+0x93/0x23c RSP: 0018:ffff88022bd03ef8 EFLAGS: 00000246 RAX: 0000000000000000 RBX: 0000000000000246 RCX: ffff88022bd0e506 RDX: ffff880222218010 RSI: ffff88022bd0e4f0 RDI: 0000000000000002 RBP: ffff88022bd03f78 R08: ffff88022bd0e770 R09: 0000000000000001 R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e68 R13: ffffffff815d15dd R14: ffff88022bd03f78 R15: ffff880222218000 FS: 0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000) Stack: ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef 00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000038 ffff880200000006 000000111671cb12 000000111671cb12 ffff88022bd0db00 Call Trace: <IRQ> [<ffffffff8109dce6>] irq_exit+0x4b/0xa8 [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99 [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267 [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 Code: 89 45 90 49 8d 44 24 10 4c 89 65 b0 65 8b 14 25 20 b0 00 00 48 89 45 88 89 55 ac 65 c7 04 25 40 13 01 00 00 00 00 00 fb 66 66 90 <66> 66 90 48 c7 c3 80 50 a0 8 BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23] Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm gpio] CPU 2 Pid: 23, comm: migration/2 Tainted: G WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: 0010:[<ffffffff8109d712>] [<ffffffff8109d712>] tasklet_action+0xbc/0xcc RSP: 0018:ffff88022bd03ed8 EFLAGS: 00000202 RAX: 0000000000000040 RBX: ffff880222185dc0 RCX: ffff88022bd0e506 RDX: ffff8802186f2fa8 RSI: ffff88022bd0e4f0 RDI: 0000000000000006 RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001 R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e48 R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff8802186f2fb0 FS: 0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000) Stack: ffffffff81a050b0 ffff880222218000 ffff88022bd03f78 ffffffff8109db1f ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef 00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000030 Call Trace: <IRQ> [<ffffffff8109db1f>] __do_softirq+0x107/0x23c [<ffffffff8109dce6>] irq_exit+0x4b/0xa8 [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99 [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267 [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 Code: 90 bf 06 00 00 00 48 c7 02 00 00 00 00 65 48 8b 04 25 88 e0 00 00 48 89 10 65 48 89 14 25 88 e0 00 00 e8 1c fe ff ff fb 66 66 90 <66> 66 90 48 89 da 48 85 d2 7 BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23] Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm gpio] CPU 2 Pid: 23, comm: migration/2 Tainted: G WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: 0010:[<ffffffff8109d6e1>] [<ffffffff8109d6e1>] tasklet_action+0x8b/0xcc RSP: 0018:ffff88022bd03ed8 EFLAGS: 00000202 RAX: 0000000000000001 RBX: ffff880222185dc0 RCX: ffff88022bd0e506 RDX: ffff8802186f2fa8 RSI: ffff88022bd0e4f0 RDI: ffffffff81a050b0 RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001 R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e48 R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff8802186f2fb0 FS: 0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000) Stack: ffffffff81a050b0 ffff880222218000 ffff88022bd03f78 ffffffff8109db1f ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef 00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000030 Call Trace: <IRQ> [<ffffffff8109db1f>] __do_softirq+0x107/0x23c [<ffffffff8109dce6>] irq_exit+0x4b/0xa8 [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99 [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267 [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 Code: 10 85 c0 75 20 f0 41 0f ba 34 24 00 19 c0 85 c0 75 04 0f 0b eb fe 48 8b 7a 20 ff 52 18 f0 41 80 24 24 fd eb 3a f0 41 80 24 24 fd <fa> 66 66 90 66 66 90 bf 06 0 BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23] Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm gpio] CPU 2 Pid: 23, comm: migration/2 Tainted: G WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: 0010:[<ffffffff8109d6e1>] [<ffffffff8109d6e1>] tasklet_action+0x8b/0xcc RSP: 0018:ffff88022bd03ed8 EFLAGS: 00000202 RAX: 0000000000000001 RBX: ffff880222185dc0 RCX: ffff88022bd0e506 RDX: ffff8802186f2fa8 RSI: ffff88022bd0e4f0 RDI: ffffffff81a050b0 RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001 R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e48 R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff8802186f2fb0 FS: 0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000) Stack: ffffffff81a050b0 ffff880222218000 ffff88022bd03f78 ffffffff8109db1f ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef 00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000030 Call Trace: <IRQ> [<ffffffff8109db1f>] __do_softirq+0x107/0x23c [<ffffffff8109dce6>] irq_exit+0x4b/0xa8 [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99 [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267 [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 Code: 10 85 c0 75 20 f0 41 0f ba 34 24 00 19 c0 85 c0 75 04 0f 0b eb fe 48 8b 7a 20 ff 52 18 f0 41 80 24 24 fd eb 3a f0 41 80 24 24 fd <fa> 66 66 90 66 66 90 bf 06 0 BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23] Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm gpio] CPU 2 Pid: 23, comm: migration/2 Tainted: G WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: 0010:[<ffffffff8109dac6>] [<ffffffff8109dac6>] __do_softirq+0xae/0x23c RSP: 0018:ffff88022bd03ef8 EFLAGS: 00000282 RAX: 0000000000000000 RBX: 0000000000000246 RCX: ffff88022bd0e506 RDX: ffff880222218010 RSI: ffff88022bd0e4f0 RDI: 0000000000000002 RBP: ffff88022bd03f78 R08: ffff88022bd0e770 R09: 0000000000000001 R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e68 R13: ffffffff815d15dd R14: ffff88022bd03f78 R15: ffff880222218000 FS: 0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000) Stack: ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef 00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000028 ffff880200000006 000000111671cb12 000000111671cb12 ffff88022bd0db00 Call Trace: <IRQ> [<ffffffff8109dce6>] irq_exit+0x4b/0xa8 [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99 [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267 [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 Code: 65 c7 04 25 40 13 01 00 00 00 00 00 fb 66 66 90 66 66 90 48 c7 c3 80 50 a0 81 48 c7 45 b8 00 00 00 00 65 4c 8b 2c 25 48 c6 00 00 <41> f6 c7 01 0f 84 ba 00 00 0 BUG: soft lockup - CPU#2 stuck for 23s! [migration/2:23] Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm gpio] CPU 2 Pid: 23, comm: migration/2 Tainted: G WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: 0010:[<ffffffff8109d69c>] [<ffffffff8109d69c>] tasklet_action+0x46/0xcc RSP: 0018:ffff88022bd03ed8 EFLAGS: 00000282 RAX: ffff88022bd0e080 RBX: ffff880222185dc0 RCX: ffff88022bd0e506 RDX: ffff8802186f2fa8 RSI: ffff88022bd0e4f0 RDI: ffffffff81a050b0 RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001 R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e48 R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff880222218000 FS: 0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000) Stack: ffffffff81a050b0 ffff880222218000 ffff88022bd03f78 ffffffff8109db1f ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef 00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000030 Call Trace: <IRQ> [<ffffffff8109db1f>] __do_softirq+0x107/0x23c [<ffffffff8109dce6>] irq_exit+0x4b/0xa8 [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99 [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267 [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 Code: 48 8b 14 25 80 e0 00 00 65 48 c7 04 25 80 e0 00 00 00 00 00 00 65 48 03 04 25 d8 da 00 00 65 48 89 04 25 88 e0 00 00 fb 66 66 90 <66> 66 90 eb 77 48 8b 1a 4c 8 BUG: soft lockup - CPU#2 stuck for 23s! [migration/2:23] Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm gpio] CPU 2 Pid: 23, comm: migration/2 Tainted: G WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: 0010:[<ffffffff8109d712>] [<ffffffff8109d712>] tasklet_action+0xbc/0xcc RSP: 0018:ffff88022bd03ed8 EFLAGS: 00000202 RAX: 0000000000000040 RBX: ffff880222185dc0 RCX: ffff88022bd0e506 RDX: ffff8802186f2fa8 RSI: ffff88022bd0e4f0 RDI: 0000000000000006 RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001 R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e48 R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff8802186f2fb0 FS: 0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000) Stack: ffffffff81a050b0 ffff880222218000 ffff88022bd03f78 ffffffff8109db1f ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef 00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000030 Call Trace: <IRQ> [<ffffffff8109db1f>] __do_softirq+0x107/0x23c [<ffffffff8109dce6>] irq_exit+0x4b/0xa8 [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99 [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267 [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 Code: 90 bf 06 00 00 00 48 c7 02 00 00 00 00 65 48 8b 04 25 88 e0 00 00 48 89 10 65 48 89 14 25 88 e0 00 00 e8 1c fe ff ff fb 66 66 90 <66> 66 90 48 89 da 48 85 d2 7 BUG: soft lockup - CPU#2 stuck for 23s! [migration/2:23] Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm gpio] CPU 2 Pid: 23, comm: migration/2 Tainted: G WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: 0010:[<ffffffff8109d712>] [<ffffffff8109d712>] tasklet_action+0xbc/0xcc RSP: 0018:ffff88022bd03ed8 EFLAGS: 00000202 RAX: 0000000000000040 RBX: ffff880222185dc0 RCX: ffff88022bd0e506 RDX: ffff8802186f2fa8 RSI: ffff88022bd0e4f0 RDI: 0000000000000006 RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001 R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e48 R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff8802186f2fb0 FS: 0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000) Stack: ffffffff81a050b0 ffff880222218000 ffff88022bd03f78 ffffffff8109db1f ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef 00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000030 Call Trace: <IRQ> [<ffffffff8109db1f>] __do_softirq+0x107/0x23c [<ffffffff8109dce6>] irq_exit+0x4b/0xa8 [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99 [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267 [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 Code: 90 bf 06 00 00 00 48 c7 02 00 00 00 00 65 48 8b 04 25 88 e0 00 00 48 89 10 65 48 89 14 25 88 e0 00 00 e8 1c fe ff ff fb 66 66 90 <66> 66 90 48 89 da 48 85 d2 7 BUG: soft lockup - CPU#2 stuck for 23s! [migration/2:23] Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm gpio] CPU 2 Pid: 23, comm: migration/2 Tainted: G WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: 0010:[<ffffffff8109d6ae>] [<ffffffff8109d6ae>] tasklet_action+0x58/0xcc RSP: 0018:ffff88022bd03ed8 EFLAGS: 00000282 RAX: ffff88022bd0e080 RBX: ffff880222185dc0 RCX: ffff88022bd0e506 RDX: ffff8802186f2fa8 RSI: ffff88022bd0e4f0 RDI: ffffffff81a050b0 RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001 R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e48 R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff8802186f2fb0 FS: 0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000) Stack: ffffffff81a050b0 ffff880222218000 ffff88022bd03f78 ffffffff8109db1f ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef 00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000030 Call Trace: <IRQ> [<ffffffff8109db1f>] __do_softirq+0x107/0x23c [<ffffffff8109dce6>] irq_exit+0x4b/0xa8 [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99 [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267 [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 Code: 00 00 00 65 48 03 04 25 d8 da 00 00 65 48 89 04 25 88 e0 00 00 fb 66 66 90 66 66 90 eb 77 48 8b 1a 4c 8d 62 08 f0 0f ba 6a 08 01 <19> c0 85 c0 75 2d 8b 42 10 8 BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23] Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm gpio] CPU 2 Pid: 23, comm: migration/2 Tainted: G WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: 0010:[<ffffffff8110a66f>] [<ffffffff8110a66f>] rcu_bh_qs+0x9/0x22 RSP: 0018:ffff88022bd03ee8 EFLAGS: 00000246 RAX: 0000000000000040 RBX: 0000000000000246 RCX: ffff88022bd0e506 RDX: 0000000000000000 RSI: ffff88022bd0e4f0 RDI: 0000000000000002 RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001 R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e58 R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff880222218000 FS: 0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000) Stack: ffff88022bd03f78 ffffffff8109db8a ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef 00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000030 ffff880200000006 000000111671cb12 Call Trace: <IRQ> [<ffffffff8109db8a>] __do_softirq+0x172/0x23c [<ffffffff8109dce6>] irq_exit+0x4b/0xa8 [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99 [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267 [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 Code: 55 48 89 e5 66 66 66 66 90 48 c7 c0 f0 e4 00 00 48 63 ff 48 8b 14 fd 50 b5 ad 81 c6 44 10 10 01 c9 c3 55 48 89 e5 66 66 66 66 90 <48> c7 c0 30 e6 00 00 48 63 f BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23] Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm gpio] CPU 2 Pid: 23, comm: migration/2 Tainted: G WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: 0010:[<ffffffff8109dac6>] [<ffffffff8109dac6>] __do_softirq+0xae/0x23c RSP: 0018:ffff88022bd03ef8 EFLAGS: 00000296 RAX: 0000000000000000 RBX: 0000000000000246 RCX: ffff88022bd0e506 RDX: ffff880222218010 RSI: ffff88022bd0e4f0 RDI: 0000000000000002 RBP: ffff88022bd03f78 R08: ffff88022bd0e770 R09: 0000000000000001 R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e68 R13: ffffffff815d15dd R14: ffff88022bd03f78 R15: ffff880222218000 FS: 0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000) Stack: ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef 00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000020 ffff880200000006 000000111671cb12 000000111671cb12 ffff88022bd0db00 Call Trace: <IRQ> [<ffffffff8109dce6>] irq_exit+0x4b/0xa8 [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99 [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267 [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 Code: 65 c7 04 25 40 13 01 00 00 00 00 00 fb 66 66 90 66 66 90 48 c7 c3 80 50 a0 81 48 c7 45 b8 00 00 00 00 65 4c 8b 2c 25 48 c6 00 00 <41> f6 c7 01 0f 84 ba 00 00 0 BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23] Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm gpio] CPU 2 Pid: 23, comm: migration/2 Tainted: G WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: 0010:[<ffffffff8110a66f>] [<ffffffff8110a66f>] rcu_bh_qs+0x9/0x22 RSP: 0018:ffff88022bd03ee8 EFLAGS: 00000246 RAX: 0000000000000040 RBX: 0000000000000246 RCX: ffff88022bd0e506 RDX: 0000000000000000 RSI: ffff88022bd0e4f0 RDI: 0000000000000002 RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001 R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e58 R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff880222218000 FS: 0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000) Stack: ffff88022bd03f78 ffffffff8109db8a ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef 00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000030 ffff880200000006 000000111671cb12 Call Trace: <IRQ> [<ffffffff8109db8a>] __do_softirq+0x172/0x23c [<ffffffff8109dce6>] irq_exit+0x4b/0xa8 [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99 [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267 [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 Code: 55 48 89 e5 66 66 66 66 90 48 c7 c0 f0 e4 00 00 48 63 ff 48 8b 14 fd 50 b5 ad 81 c6 44 10 10 01 c9 c3 55 48 89 e5 66 66 66 66 90 <48> c7 c0 30 e6 00 00 48 63 f BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23] Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm gpio] CPU 2 Pid: 23, comm: migration/2 Tainted: G WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: 0010:[<ffffffff8109d6ae>] [<ffffffff8109d6ae>] tasklet_action+0x58/0xcc RSP: 0018:ffff88022bd03ed8 EFLAGS: 00000282 RAX: ffff88022bd0e080 RBX: ffff880222185dc0 RCX: ffff88022bd0e506 RDX: ffff8802186f2fa8 RSI: ffff88022bd0e4f0 RDI: ffffffff81a050b0 RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001 R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e48 R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff8802186f2fb0 FS: 0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000) Stack: ffffffff81a050b0 ffff880222218000 ffff88022bd03f78 ffffffff8109db1f ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef 00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000030 Call Trace: <IRQ> [<ffffffff8109db1f>] __do_softirq+0x107/0x23c [<ffffffff8109dce6>] irq_exit+0x4b/0xa8 [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99 [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267 [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 Code: 00 00 00 65 48 03 04 25 d8 da 00 00 65 48 89 04 25 88 e0 00 00 fb 66 66 90 66 66 90 eb 77 48 8b 1a 4c 8d 62 08 f0 0f ba 6a 08 01 <19> c0 85 c0 75 2d 8b 42 10 8 BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23] Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm gpio] CPU 2 Pid: 23, comm: migration/2 Tainted: G WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: 0010:[<ffffffff8109daab>] [<ffffffff8109daab>] __do_softirq+0x93/0x23c RSP: 0018:ffff88022bd03ef8 EFLAGS: 00000246 RAX: 0000000000000000 RBX: 0000000000000246 RCX: ffff88022bd0e506 RDX: ffff880222218010 RSI: ffff88022bd0e4f0 RDI: 0000000000000002 RBP: ffff88022bd03f78 R08: ffff88022bd0e770 R09: 0000000000000001 R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e68 R13: ffffffff815d15dd R14: ffff88022bd03f78 R15: ffff880222218000 FS: 0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000) Stack: ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef 00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000038 ffff880200000006 000000111671cb12 000000111671cb12 ffff88022bd0db00 Call Trace: <IRQ> [<ffffffff8109dce6>] irq_exit+0x4b/0xa8 [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99 [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267 [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 Code: 89 45 90 49 8d 44 24 10 4c 89 65 b0 65 8b 14 25 20 b0 00 00 48 89 45 88 89 55 ac 65 c7 04 25 40 13 01 00 00 00 00 00 fb 66 66 90 <66> 66 90 48 c7 c3 80 50 a0 8 BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23] Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm gpio] CPU 2 Pid: 23, comm: migration/2 Tainted: G WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: 0010:[<ffffffff8109d6e1>] [<ffffffff8109d6e1>] tasklet_action+0x8b/0xcc RSP: 0018:ffff88022bd03ed8 EFLAGS: 00000202 RAX: 0000000000000001 RBX: ffff880222185dc0 RCX: ffff88022bd0e506 RDX: ffff8802186f2fa8 RSI: ffff88022bd0e4f0 RDI: ffffffff81a050b0 RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001 R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e48 R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff8802186f2fb0 FS: 0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000) Stack: ffffffff81a050b0 ffff880222218000 ffff88022bd03f78 ffffffff8109db1f ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef 00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000030 Call Trace: <IRQ> [<ffffffff8109db1f>] __do_softirq+0x107/0x23c [<ffffffff8109dce6>] irq_exit+0x4b/0xa8 [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99 [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267 [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 Code: 10 85 c0 75 20 f0 41 0f ba 34 24 00 19 c0 85 c0 75 04 0f 0b eb fe 48 8b 7a 20 ff 52 18 f0 41 80 24 24 fd eb 3a f0 41 80 24 24 fd <fa> 66 66 90 66 66 90 bf 06 0 BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23] Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm gpio] CPU 2 Pid: 23, comm: migration/2 Tainted: G WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: 0010:[<ffffffff8109d712>] [<ffffffff8109d712>] tasklet_action+0xbc/0xcc RSP: 0018:ffff88022bd03ed8 EFLAGS: 00000202 RAX: 0000000000000040 RBX: ffff880222185dc0 RCX: ffff88022bd0e506 RDX: ffff8802186f2fa8 RSI: ffff88022bd0e4f0 RDI: 0000000000000006 RBP: ffff88022bd03ee8 R08: ffff88022bd0e770 R09: 0000000000000001 R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e48 R13: ffffffff815d15dd R14: ffff88022bd03ee8 R15: ffff8802186f2fb0 FS: 0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000) Stack: ffffffff81a050b0 ffff880222218000 ffff88022bd03f78 ffffffff8109db1f ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef 00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000030 Call Trace: <IRQ> [<ffffffff8109db1f>] __do_softirq+0x107/0x23c [<ffffffff8109dce6>] irq_exit+0x4b/0xa8 [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99 [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267 [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 Code: 90 bf 06 00 00 00 48 c7 02 00 00 00 00 65 48 8b 04 25 88 e0 00 00 48 89 10 65 48 89 14 25 88 e0 00 00 e8 1c fe ff ff fb 66 66 90 <66> 66 90 48 89 da 48 85 d2 7 BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:23] Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm gpio] CPU 2 Pid: 23, comm: migration/2 Tainted: G WC O 3.9.4+ #70 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: 0010:[<ffffffff8109daab>] [<ffffffff8109daab>] __do_softirq+0x93/0x23c RSP: 0018:ffff88022bd03ef8 EFLAGS: 00000246 RAX: 0000000000000000 RBX: 0000000000000246 RCX: ffff88022bd0e506 RDX: ffff880222218010 RSI: ffff88022bd0e4f0 RDI: 0000000000000002 RBP: ffff88022bd03f78 R08: ffff88022bd0e770 R09: 0000000000000001 R10: ffff88022bd13de0 R11: 0000000000000002 R12: ffff88022bd03e68 R13: ffffffff815d15dd R14: ffff88022bd03f78 R15: ffff880222218000 FS: 0000000000000000(0000) GS:ffff88022bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f22cc3e7000 CR3: 0000000001a0c000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/2 (pid: 23, threadinfo ffff880222218000, task ffff880222210000) Stack: ffff88022bd03f08 ffff880222218010 ffff880222219fd8 04208040810b79ef 00000000fffc8ad1 000000022bd0e150 ffff880222218000 0000000000000038 ffff880200000006 000000111671cb12 000000111671cb12 ffff88022bd0db00 Call Trace: <IRQ> [<ffffffff8109dce6>] irq_exit+0x4b/0xa8 [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99 [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267 [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 Code: 89 45 90 49 8d 44 24 10 4c 89 65 b0 65 8b 14 25 20 b0 00 00 48 89 45 88 89 55 ac 65 c7 04 25 40 13 01 00 00 00 00 00 fb 66 66 90 <66> 66 90 48 c7 c3 80 50 a0 8 ------------[ cut here ]------------ WARNING: at /home/greearb/git/linux-3.9.dev.y/kernel/watchdog.c:245 watchdog_overflow_callback+0x9b/0xa6() Hardware name: To be filled by O.E.M. Watchdog detected hard LOCKUP on cpu 2 Modules linked in: nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc wanlink(O) pktgen lockd sunrpc coretemp hwmon mperf intel_powerclamp snd_hda_codec_realtek kvm gpio] Pid: 23, comm: migration/2 Tainted: G WC O 3.9.4+ #70 Call Trace: <NMI> [<ffffffff810963e1>] warn_slowpath_common+0x85/0x9f [<ffffffff8109649e>] warn_slowpath_fmt+0x46/0x48 [<ffffffff810c4dfb>] ? sched_clock_cpu+0x44/0xce [<ffffffff81103cec>] watchdog_overflow_callback+0x9b/0xa6 [<ffffffff811336cb>] __perf_event_overflow+0x137/0x1cb [<ffffffff8101db3f>] ? x86_perf_event_set_period+0x107/0x113 [<ffffffff81133c1a>] perf_event_overflow+0x14/0x16 [<ffffffff810230dc>] intel_pmu_handle_irq+0x2b0/0x32d [<ffffffff815cbdd1>] perf_event_nmi_handler+0x19/0x1b [<ffffffff815cb64a>] nmi_handle+0x55/0x7e [<ffffffff815cb71b>] do_nmi+0xa8/0x2db [<ffffffff815cadb1>] end_repeat_nmi+0x1e/0x2e [<ffffffff8109dad4>] ? __do_softirq+0xbc/0x23c [<ffffffff8109dad4>] ? __do_softirq+0xbc/0x23c [<ffffffff8109dad4>] ? __do_softirq+0xbc/0x23c <<EOE>> <IRQ> [<ffffffff8109dce6>] irq_exit+0x4b/0xa8 [<ffffffff815d26ff>] smp_apic_timer_interrupt+0x8b/0x99 [<ffffffff815d15dd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810f99a4>] ? stop_machine_cpu_stop+0xe7/0x267 [<ffffffff810f9afd>] ? stop_machine_cpu_stop+0x240/0x267 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810e817b>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f98bd>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff810f9624>] cpu_stopper_thread+0xbd/0x176 [<ffffffff815c91e5>] ? __schedule+0x59f/0x5e7 [<ffffffff810bb434>] smpboot_thread_fn+0x217/0x21f [<ffffffff810bb21d>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b4a09>] kthread+0xb5/0xbd [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 [<ffffffff815d092c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b4954>] ? kthread_freezable_should_stop+0x60/0x60 ---[ end trace 9767454fd3f66a82 ]--- CTRL-A Z for help |115200 8N1 | NOR | Minicom 2.5 | VT102 | Online 15:23 > > Thanks. > -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Please add to stable: module: don't unlink the module until we've removed all exposure. 2013-06-05 16:59 ` Ben Greear @ 2013-06-05 18:48 ` Tejun Heo 2013-06-05 19:11 ` Ben Greear 0 siblings, 1 reply; 51+ messages in thread From: Tejun Heo @ 2013-06-05 18:48 UTC (permalink / raw) To: Ben Greear; +Cc: Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable Hello, Ben. On Wed, Jun 05, 2013 at 09:59:00AM -0700, Ben Greear wrote: > One pattern I notice repeating for at least most of the hangs is that all but one > CPU thread has irqs disabled and is in state 2. But, there will be one thread > in state 1 that still has IRQs enabled and it is reported to be in soft-lockup > instead of hard-lockup. In 'sysrq l' it always shows some IRQ processing, > but typically that of the sysrq itself. I added printk that would always > print if the thread notices that smdata->state != curstate, and the soft-lockup > thread (cpu 2 below) never shows that message. It sounds like one of the cpus get live-locked by IRQs. I can't tell why the situation is made worse by other CPUs being tied up. Do you ever see CPUs being live locked by IRQs during normal operation? > I thought it might be because it was reading stale smdata->state, so I changed > that to atomic_t hoping that would mitigate that. I also tried adding smp_rmb() > below the cpu_relax(). Neither had any affect, so I am left assuming that the I looked at the code again and the memory accesses seem properly interlocked. It's a bit tricky and should probably have used spinlock instead considering it's already a hugely expensive path anyway, but it does seem correct to me. > thread instead is stuck handling IRQs and never gets out of the IRQ handler. Seems that way to me too. > Maybe since I have 2 real cores, and 3 processes busy-spinning on their CPU cores, > the remaining process can just never handle all the IRQs and get back to the > cpu shutdown state machine? The various soft-hang stacks below show at least slightly > different stacks, so I assume that thread is doing at least something. What's the source of all those IRQs tho? I don't think the IRQs are from actual events. The system is quiesced. Even if it's from receiving packets, it's gonna quiet down pretty quickly. The hang doesn't go away if you disconnect the network cable while hung, right? What could be happening is that IRQ handling is handled by a thread but the IRQ handler itself doesn't clear the IRQ properly and depends on the handling thread to clear the condition. If no CPU is available for scheduling, it might end up raising and re-reraising IRQs for the same condition without ever being handled. If that's the case, such lockup could happen on a normally functioning UP machine or if the IRQ is pinned to a single CPU which happens to be running the handling thread. At any rate, it'd be a plain live-lock bug on the driver side. Can you please try to confirm the specific interrupt being continuously raised? Detecting the hang shouldn't be too difficult. Just recording the starting jiffies and if progress hasn't been made for, say, ten seconds, it can set a flag and then print the IRQs being handled if the flag is set. If it indeed is the ath device, we probably wanna get the driver maintainer involved. Thanks. -- tejun ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Please add to stable: module: don't unlink the module until we've removed all exposure. 2013-06-05 18:48 ` Tejun Heo @ 2013-06-05 19:11 ` Ben Greear 2013-06-05 19:31 ` stop_machine lockup issue in 3.9.y Ben Greear 0 siblings, 1 reply; 51+ messages in thread From: Ben Greear @ 2013-06-05 19:11 UTC (permalink / raw) To: Tejun Heo; +Cc: Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable On 06/05/2013 11:48 AM, Tejun Heo wrote: > Hello, Ben. > > On Wed, Jun 05, 2013 at 09:59:00AM -0700, Ben Greear wrote: >> One pattern I notice repeating for at least most of the hangs is that all but one >> CPU thread has irqs disabled and is in state 2. But, there will be one thread >> in state 1 that still has IRQs enabled and it is reported to be in soft-lockup >> instead of hard-lockup. In 'sysrq l' it always shows some IRQ processing, >> but typically that of the sysrq itself. I added printk that would always >> print if the thread notices that smdata->state != curstate, and the soft-lockup >> thread (cpu 2 below) never shows that message. > > It sounds like one of the cpus get live-locked by IRQs. I can't tell > why the situation is made worse by other CPUs being tied up. Do you > ever see CPUs being live locked by IRQs during normal operation? No, I have not noticed any live locks aside from this, at least in the 3.9 kernels. >> I thought it might be because it was reading stale smdata->state, so I changed >> that to atomic_t hoping that would mitigate that. I also tried adding smp_rmb() >> below the cpu_relax(). Neither had any affect, so I am left assuming that the > > I looked at the code again and the memory accesses seem properly > interlocked. It's a bit tricky and should probably have used spinlock > instead considering it's already a hugely expensive path anyway, but > it does seem correct to me. > >> thread instead is stuck handling IRQs and never gets out of the IRQ handler. > > Seems that way to me too. > >> Maybe since I have 2 real cores, and 3 processes busy-spinning on their CPU cores, >> the remaining process can just never handle all the IRQs and get back to the >> cpu shutdown state machine? The various soft-hang stacks below show at least slightly >> different stacks, so I assume that thread is doing at least something. > > What's the source of all those IRQs tho? I don't think the IRQs are > from actual events. The system is quiesced. Even if it's from > receiving packets, it's gonna quiet down pretty quickly. The hang > doesn't go away if you disconnect the network cable while hung, right? > > What could be happening is that IRQ handling is handled by a thread > but the IRQ handler itself doesn't clear the IRQ properly and depends > on the handling thread to clear the condition. If no CPU is available > for scheduling, it might end up raising and re-reraising IRQs for the > same condition without ever being handled. If that's the case, such > lockup could happen on a normally functioning UP machine or if the IRQ > is pinned to a single CPU which happens to be running the handling > thread. At any rate, it'd be a plain live-lock bug on the driver > side. > > Can you please try to confirm the specific interrupt being > continuously raised? Detecting the hang shouldn't be too difficult. > Just recording the starting jiffies and if progress hasn't been made > for, say, ten seconds, it can set a flag and then print the IRQs being > handled if the flag is set. If it indeed is the ath device, we > probably wanna get the driver maintainer involved. I am not sure how to tell which IRQ is being handled. Do the stack traces (showing smp_apic_timer_interrupt, for instance) indicate potential culprits, or is that more a symptom of just when the soft-lockup check is called? Where should I add code to print out irqs? In the lockup state, the thread (probably) stuck handling irqs isn't executing any code in the stop_machine file as far as I can tell. Maybe I need to instrument the __do_softirq or similar method? For what it's worth, previous debugging appears to show that jiffies stops incrementing in many of these lockups. Also, I have been trying for 20+ minutes to reproduce the lockup with the ath9k module removed (and my user-space app that uses it stopped), and I have not reproduced it yet. So, possibly it is related to ath9k, but my user-space app pokes at lots of other stuff and starts loads of dhcp client processes and such too, so not sure yet. Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 51+ messages in thread
* stop_machine lockup issue in 3.9.y. 2013-06-05 19:11 ` Ben Greear @ 2013-06-05 19:31 ` Ben Greear 2013-06-05 20:58 ` Ben Greear 0 siblings, 1 reply; 51+ messages in thread From: Ben Greear @ 2013-06-05 19:31 UTC (permalink / raw) To: Tejun Heo; +Cc: Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable This is no longer really about the module unlink, so changing subject. On 06/05/2013 12:11 PM, Ben Greear wrote: > On 06/05/2013 11:48 AM, Tejun Heo wrote: >> Hello, Ben. >> >> On Wed, Jun 05, 2013 at 09:59:00AM -0700, Ben Greear wrote: >>> One pattern I notice repeating for at least most of the hangs is that all but one >>> CPU thread has irqs disabled and is in state 2. But, there will be one thread >>> in state 1 that still has IRQs enabled and it is reported to be in soft-lockup >>> instead of hard-lockup. In 'sysrq l' it always shows some IRQ processing, >>> but typically that of the sysrq itself. I added printk that would always >>> print if the thread notices that smdata->state != curstate, and the soft-lockup >>> thread (cpu 2 below) never shows that message. >> >> It sounds like one of the cpus get live-locked by IRQs. I can't tell >> why the situation is made worse by other CPUs being tied up. Do you >> ever see CPUs being live locked by IRQs during normal operation? > > No, I have not noticed any live locks aside from this, at least in > the 3.9 kernels. > >>> I thought it might be because it was reading stale smdata->state, so I changed >>> that to atomic_t hoping that would mitigate that. I also tried adding smp_rmb() >>> below the cpu_relax(). Neither had any affect, so I am left assuming that the >> >> I looked at the code again and the memory accesses seem properly >> interlocked. It's a bit tricky and should probably have used spinlock >> instead considering it's already a hugely expensive path anyway, but >> it does seem correct to me. >> >>> thread instead is stuck handling IRQs and never gets out of the IRQ handler. >> >> Seems that way to me too. >> >>> Maybe since I have 2 real cores, and 3 processes busy-spinning on their CPU cores, >>> the remaining process can just never handle all the IRQs and get back to the >>> cpu shutdown state machine? The various soft-hang stacks below show at least slightly >>> different stacks, so I assume that thread is doing at least something. >> >> What's the source of all those IRQs tho? I don't think the IRQs are >> from actual events. The system is quiesced. Even if it's from >> receiving packets, it's gonna quiet down pretty quickly. The hang >> doesn't go away if you disconnect the network cable while hung, right? >> >> What could be happening is that IRQ handling is handled by a thread >> but the IRQ handler itself doesn't clear the IRQ properly and depends >> on the handling thread to clear the condition. If no CPU is available >> for scheduling, it might end up raising and re-reraising IRQs for the >> same condition without ever being handled. If that's the case, such >> lockup could happen on a normally functioning UP machine or if the IRQ >> is pinned to a single CPU which happens to be running the handling >> thread. At any rate, it'd be a plain live-lock bug on the driver >> side. >> >> Can you please try to confirm the specific interrupt being >> continuously raised? Detecting the hang shouldn't be too difficult. >> Just recording the starting jiffies and if progress hasn't been made >> for, say, ten seconds, it can set a flag and then print the IRQs being >> handled if the flag is set. If it indeed is the ath device, we >> probably wanna get the driver maintainer involved. > > I am not sure how to tell which IRQ is being handled. Do the > stack traces (showing smp_apic_timer_interrupt, for instance) > indicate potential culprits, or is that more a symptom of just > when the soft-lockup check is called? > > > Where should I add code to print out irqs? In the lockup state, > the thread (probably) stuck handling irqs isn't executing any code in > the stop_machine file as far as I can tell. > > Maybe I need to instrument the __do_softirq or similar method? > > For what it's worth, previous debugging appears to show that jiffies > stops incrementing in many of these lockups. > > Also, I have been trying for 20+ minutes to reproduce the lockup > with the ath9k module removed (and my user-space app that uses it > stopped), and I have not reproduced it yet. So, possibly it is > related to ath9k, but my user-space app pokes at lots of other > stuff and starts loads of dhcp client processes and such too, > so not sure yet. I re-added ath9k, turned on my app (to create 400 stations, etc), re-started the module unload/load loop: for i in `seq 10000`; do modprobe macvlan; rmmod macvlan; done and hit the problem fairly quickly. This is on stock 3.9.4, with Rusty's kobj patch, and some printk debugging I added to the stop_machine.c file. Perhaps interestingly, I do see an ath9k warning/error in this log as well. Also, since lockdep is enabled, we get some irq printouts. Does this mean anything to you? __stop_machine(upstream): num-threads: 4, fn: __try_stop_module(ffffffff810f330a) data: ffff8801c594bf28 smdata: ffff8801c594be58 set_state, cpu: 0 state: 0 newstate: 1 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 1 loops: 1 jiffies: 4297177162 timeout: 4297177161 curstate: 0 smdata->state: 1 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 0 loops: 1 jiffies: 4297177162 timeout: 4297177161 curstate: 0 smdata->state: 1 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a state-change: cpu: 0 loops: 1 jiffies: 4297177162 curstate: 0 smdata->state: 1 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 0 ack_state jiffies: 4297177162 smdata->state: 1 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 3 loops: 1 jiffies: 4297177162 timeout: 4297177161 curstate: 0 smdata->state: 1 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 0 ack_state end jiffies: 4297177162 smdata->state: 1 thread_ack: 3 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 2 loops: 1 jiffies: 4297177162 timeout: 4297177161 curstate: 0 smdata->state: 1 thread_ack: 3 smdata: ffff8801c594be58 fn: ffffffff810f330a state-change: cpu: 3 loops: 1 jiffies: 4297177162 curstate: 0 smdata->state: 1 thread_ack: 3 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 3 ack_state jiffies: 4297177162 smdata->state: 1 thread_ack: 3 smdata: ffff8801c594be58 fn: ffffffff810f330a state-change: cpu: 2 loops: 1 jiffies: 4297177162 curstate: 0 smdata->state: 1 thread_ack: 3 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 3 ack_state end jiffies: 4297177162 smdata->state: 1 thread_ack: 2 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 2 ack_state jiffies: 4297177162 smdata->state: 1 thread_ack: 2 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 2 ack_state end jiffies: 4297177162 smdata->state: 1 thread_ack: 1 smdata: ffff8801c594be58 fn: ffffffff810f330a state-change: cpu: 1 loops: 1 jiffies: 4297177313 curstate: 0 smdata->state: 1 thread_ack: 1 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 1 ack_state jiffies: 4297177326 smdata->state: 1 thread_ack: 1 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 1 increment state, smdata: ffff8801c594be58 fn: ffffffff810f330a set_state, cpu: 1 state: 1 newstate: 2 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 1 ack_state end jiffies: 4297177350 smdata->state: 2 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a state-change: cpu: 0 loops: 12226751 jiffies: 4297177350 curstate: 1 smdata->state: 2 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a state-change: cpu: 2 loops: 12286477 jiffies: 4297177350 curstate: 1 smdata->state: 2 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 0 ack_state jiffies: 4297177350 smdata->state: 2 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 0 ack_state end jiffies: 4297177350 smdata->state: 2 thread_ack: 3 smdata: ffff8801c594be58 fn: ffffffff810f330a state-change: cpu: 3 loops: 25634226 jiffies: 4297177350 curstate: 1 smdata->state: 2 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 3 ack_state jiffies: 4297177350 smdata->state: 2 thread_ack: 3 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 3 ack_state end jiffies: 4297177350 smdata->state: 2 thread_ack: 2 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 2 ack_state jiffies: 4297177350 smdata->state: 2 thread_ack: 2 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 2 ack_state end jiffies: 4297177350 smdata->state: 2 thread_ack: 1 smdata: ffff8801c594be58 fn: ffffffff810f330a state-change: cpu: 1 loops: 2 jiffies: 4297177350 curstate: 1 smdata->state: 2 thread_ack: 1 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 1 ack_state jiffies: 4297177350 smdata->state: 2 thread_ack: 1 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 1 increment state, smdata: ffff8801c594be58 fn: ffffffff810f330a set_state, cpu: 1 state: 2 newstate: 3 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 1 ack_state end jiffies: 4297177350 smdata->state: 3 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a state-change: cpu: 3 loops: 47817385 jiffies: 4297177350 curstate: 2 smdata->state: 3 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a state-change: cpu: 2 loops: 31933478 jiffies: 4297177350 curstate: 2 smdata->state: 3 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a state-change: cpu: 0 loops: 31875466 jiffies: 4297177350 curstate: 2 smdata->state: 3 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 2 ack_state jiffies: 4297177350 smdata->state: 3 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 3 ack_state jiffies: 4297177350 smdata->state: 3 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a calling fn: cpu: 0 loops: 31875466 curstate: 3 smdata->state: 3 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 2 ack_state end jiffies: 4297177350 smdata->state: 3 thread_ack: 3 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 0 ack_state jiffies: 4297177350 smdata->state: 3 thread_ack: 2 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 3 ack_state end jiffies: 4297177350 smdata->state: 3 thread_ack: 2 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 0 ack_state end jiffies: 4297177350 smdata->state: 3 thread_ack: 1 smdata: ffff8801c594be58 fn: ffffffff810f330a state-change: cpu: 1 loops: 3 jiffies: 4297177350 curstate: 2 smdata->state: 3 thread_ack: 1 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 1 ack_state jiffies: 4297177350 smdata->state: 3 thread_ack: 1 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 1 increment state, smdata: ffff8801c594be58 fn: ffffffff810f330a set_state, cpu: 1 state: 3 newstate: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 1 ack_state end jiffies: 4297177350 smdata->state: 4 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a state-change: cpu: 3 loops: 71662403 jiffies: 4297177350 curstate: 3 smdata->state: 4 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a state-change: cpu: 2 loops: 53246516 jiffies: 4297177350 curstate: 3 smdata->state: 4 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a state-change: cpu: 0 loops: 53187256 jiffies: 4297177350 curstate: 3 smdata->state: 4 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 2 ack_state jiffies: 4297177350 smdata->state: 4 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 0 ack_state jiffies: 4297177350 smdata->state: 4 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 2 ack_state end jiffies: 4297177350 smdata->state: 4 thread_ack: 3 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 0 ack_state end jiffies: 4297177350 smdata->state: 4 thread_ack: 2 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 3 ack_state jiffies: 4297177350 smdata->state: 4 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 3 ack_state end jiffies: 4297177656 smdata->state: 4 thread_ack: 1 smdata: ffff8801c594be58 fn: ffffffff810f330a state-change: cpu: 1 loops: 4 jiffies: 4297177768 curstate: 3 smdata->state: 4 thread_ack: 1 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 1 ack_state jiffies: 4297177780 smdata->state: 4 thread_ack: 1 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 1 increment state, smdata: ffff8801c594be58 fn: ffffffff810f330a set_state, cpu: 1 state: 4 newstate: 5 smdata: ffff8801c594be58 fn: ffffffff810f330a cpu: 1 ack_state end jiffies: 4297177804 smdata->state: 5 thread_ack: 4 smdata: ffff8801c594be58 fn: ffffffff810f330a ath: wiphy0: Failed to stop TX DMA, queues=0x005! __stop_machine(upstream): num-threads: 4, fn: __unlink_module(ffffffff810f2dab) data: ffffffffa117aad0 smdata: ffff8801c594be18 set_state, cpu: 0 state: 0 newstate: 1 smdata: ffff8801c594be18 fn: ffffffff810f2dab sta201: authenticate with 00:de:ad:1d:ea:01 cpu: 1 loops: 1 jiffies: 4297178887 timeout: 4297178886 curstate: 0 smdata->state: 1 thread_ack: 4 smdata: ffff8801c594be18 fn: ffffffff810f2dab cpu: 2 loops: 1 jiffies: 4297178887 timeout: 4297178886 curstate: 0 smdata->state: 1 thread_ack: 4 smdata: ffff8801c594be18 fn: ffffffff810f2dab cpu: 0 loops: 1 jiffies: 4297178887 timeout: 4297178886 curstate: 0 smdata->state: 1 thread_ack: 4 smdata: ffff8801c594be18 fn: ffffffff810f2dab state-change: cpu: 2 loops: 1 jiffies: 4297178887 curstate: 0 smdata->state: 1 thread_ack: 4 smdata: ffff8801c594be18 fn: ffffffff810f2dab cpu: 3 loops: 1 jiffies: 4297178887 timeout: 4297178886 curstate: 0 smdata->state: 1 thread_ack: 4 smdata: ffff8801c594be18 fn: ffffffff810f2dab cpu: 2 ack_state jiffies: 4297178887 smdata->state: 1 thread_ack: 4 smdata: ffff8801c594be18 fn: ffffffff810f2dab state-change: cpu: 0 loops: 1 jiffies: 4297178887 curstate: 0 smdata->state: 1 thread_ack: 4 smdata: ffff8801c594be18 fn: ffffffff810f2dab cpu: 2 ack_state end jiffies: 4297178887 smdata->state: 1 thread_ack: 3 smdata: ffff8801c594be18 fn: ffffffff810f2dab state-change: cpu: 3 loops: 1 jiffies: 4297178887 curstate: 0 smdata->state: 1 thread_ack: 4 smdata: ffff8801c594be18 fn: ffffffff810f2dab cpu: 0 ack_state jiffies: 4297178887 smdata->state: 1 thread_ack: 3 smdata: ffff8801c594be18 fn: ffffffff810f2dab cpu: 3 ack_state jiffies: 4297178887 smdata->state: 1 thread_ack: 3 smdata: ffff8801c594be18 fn: ffffffff810f2dab cpu: 0 ack_state end jiffies: 4297178887 smdata->state: 1 thread_ack: 2 smdata: ffff8801c594be18 fn: ffffffff810f2dab cpu: 3 ack_state end jiffies: 4297178887 smdata->state: 1 thread_ack: 1 smdata: ffff8801c594be18 fn: ffffffff810f2dab state-change: cpu: 1 loops: 1 jiffies: 4297179040 curstate: 0 smdata->state: 1 thread_ack: 1 smdata: ffff8801c594be18 fn: ffffffff810f2dab cpu: 1 ack_state jiffies: 4297179054 smdata->state: 1 thread_ack: 1 smdata: ffff8801c594be18 fn: ffffffff810f2dab cpu: 1 increment state, smdata: ffff8801c594be18 fn: ffffffff810f2dab set_state, cpu: 1 state: 1 newstate: 2 smdata: ffff8801c594be18 fn: ffffffff810f2dab cpu: 1 ack_state end jiffies: 4297179083 smdata->state: 2 thread_ack: 4 smdata: ffff8801c594be18 fn: ffffffff810f2dab state-change: cpu: 2 loops: 25037141 jiffies: 4297179083 curstate: 1 smdata->state: 2 thread_ack: 4 smdata: ffff8801c594be18 fn: ffffffff810f2dab state-change: cpu: 3 loops: 28970837 jiffies: 4297179083 curstate: 1 smdata->state: 2 thread_ack: 4 smdata: ffff8801c594be18 fn: ffffffff810f2dab cpu: 2 ack_state jiffies: 4297179083 smdata->state: 2 thread_ack: 4 smdata: ffff8801c594be18 fn: ffffffff810f2dab cpu: 3 ack_state jiffies: 4297179083 smdata->state: 2 thread_ack: 4 smdata: ffff8801c594be18 fn: ffffffff810f2dab state-change: cpu: 0 loops: 25898133 jiffies: 4297179083 curstate: 1 smdata->state: 2 thread_ack: 4 smdata: ffff8801c594be18 fn: ffffffff810f2dab cpu: 2 ack_state end jiffies: 4297179083 smdata->state: 2 thread_ack: 3 smdata: ffff8801c594be18 fn: ffffffff810f2dab cpu: 0 ack_state jiffies: 4297179083 smdata->state: 2 thread_ack: 2 smdata: ffff8801c594be18 fn: ffffffff810f2dab cpu: 3 ack_state end jiffies: 4297179083 smdata->state: 2 thread_ack: 2 smdata: ffff8801c594be18 fn: ffffffff810f2dab cpu: 0 ack_state end jiffies: 4297179083 smdata->state: 2 thread_ack: 1 smdata: ffff8801c594be18 fn: ffffffff810f2dab ------------[ cut here ]------------ WARNING: at /home/greearb/git/linux-2.6.linus/kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7() Hardware name: To be filled by O.E.M. Watchdog detected hard LOCKUP on cpu 2 Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] Pid: 23, comm: migration/2 Tainted: G C 3.9.4+ #11 Call Trace: <NMI> [<ffffffff810977f1>] warn_slowpath_common+0x85/0x9f [<ffffffff810978ae>] warn_slowpath_fmt+0x46/0x48 [<ffffffff8110f42d>] watchdog_overflow_callback+0x9c/0xa7 [<ffffffff8113feb6>] __perf_event_overflow+0x137/0x1cb [<ffffffff8101dff6>] ? x86_perf_event_set_period+0x103/0x10f [<ffffffff811403fa>] perf_event_overflow+0x14/0x16 [<ffffffff81023730>] intel_pmu_handle_irq+0x2dc/0x359 [<ffffffff815eee05>] perf_event_nmi_handler+0x19/0x1b [<ffffffff815ee5f3>] nmi_handle+0x7f/0xc2 [<ffffffff815ee574>] ? oops_begin+0xa9/0xa9 [<ffffffff815ee6f2>] do_nmi+0xbc/0x304 [<ffffffff815edd81>] end_repeat_nmi+0x1e/0x2e [<ffffffff81099fce>] ? vprintk_emit+0x40a/0x444 [<ffffffff81104ef8>] ? stop_machine_cpu_stop+0xd8/0x274 [<ffffffff81104ef8>] ? stop_machine_cpu_stop+0xd8/0x274 [<ffffffff81104ef8>] ? stop_machine_cpu_stop+0xd8/0x274 <<EOE>> [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff81104e20>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff81104b8d>] cpu_stopper_thread+0xae/0x162 [<ffffffff815ebb1f>] ? __schedule+0x5ef/0x637 [<ffffffff815ecf38>] ? _raw_spin_unlock_irqrestore+0x47/0x7e [<ffffffff810e92cc>] ? trace_hardirqs_on_caller+0x123/0x15a [<ffffffff810e9310>] ? trace_hardirqs_on+0xd/0xf [<ffffffff815ecf61>] ? _raw_spin_unlock_irqrestore+0x70/0x7e [<ffffffff810bef34>] smpboot_thread_fn+0x258/0x260 [<ffffffff810becdc>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b7c22>] kthread+0xc7/0xcf [<ffffffff810b7b5b>] ? __init_kthread_worker+0x5b/0x5b [<ffffffff815f3b6c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b7b5b>] ? __init_kthread_worker+0x5b/0x5b ---[ end trace 4947dfa9b0a4cec3 ]--- BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17] Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] irq event stamp: 835637905 hardirqs last enabled at (835637904): [<ffffffff8109f4c1>] __do_softirq+0x9f/0x257 hardirqs last disabled at (835637905): [<ffffffff815f48ad>] apic_timer_interrupt+0x6d/0x80 softirqs last enabled at (5654720): [<ffffffff8109f621>] __do_softirq+0x1ff/0x257 softirqs last disabled at (5654725): [<ffffffff8109f743>] irq_exit+0x5f/0xbb CPU 1 Pid: 17, comm: migration/1 Tainted: G WC 3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: 0010:[<ffffffff8109ee72>] [<ffffffff8109ee72>] tasklet_hi_action+0xf0/0xf0 RSP: 0018:ffff88022bc83ef0 EFLAGS: 00000212 RAX: 0000000000000006 RBX: ffff880217deb710 RCX: 0000000000000006 RDX: 0000000000000006 RSI: 0000000000000000 RDI: ffffffff81a050b0 RBP: ffff88022bc83f78 R08: ffffffff81a050b0 R09: ffff88022bc83cc8 R10: 00000000000005f2 R11: ffff8802203aaf50 R12: ffff88022bc83e68 R13: ffffffff815f48b2 R14: ffff88022bc83f78 R15: ffff88022230e000 FS: 0000000000000000(0000) GS:ffff88022bc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000430070 CR3: 00000001cbc5d000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/1 (pid: 17, threadinfo ffff88022230e000, task ffff8802223142c0) Stack: ffffffff8109f539 ffff88022bc83f08 ffff88022230e010 042080402bc83f88 000000010021bfcd 000000012bc83fa8 ffff88022230e000 ffff88022230ffd8 0000000000000030 ffff880200000006 00000248d8cdab1c 1304da35fe841722 Call Trace: <IRQ> [<ffffffff8109f539>] ? __do_softirq+0x117/0x257 [<ffffffff8109f743>] irq_exit+0x5f/0xbb [<ffffffff815f59fd>] smp_apic_timer_interrupt+0x8a/0x98 [<ffffffff815f48b2>] apic_timer_interrupt+0x72/0x80 <EOI> [<ffffffff81099fdb>] ? vprintk_emit+0x417/0x444 [<ffffffff815e9fc0>] printk+0x4d/0x4f [<ffffffff81104b36>] ? cpu_stopper_thread+0x57/0x162 [<ffffffff8110504c>] stop_machine_cpu_stop+0x22c/0x274 [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff81104e20>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff81104b8d>] cpu_stopper_thread+0xae/0x162 [<ffffffff815ebb1f>] ? __schedule+0x5ef/0x637 [<ffffffff815ecf38>] ? _raw_spin_unlock_irqrestore+0x47/0x7e [<ffffffff810e92cc>] ? trace_hardirqs_on_caller+0x123/0x15a [<ffffffff810e9310>] ? trace_hardirqs_on+0xd/0xf [<ffffffff815ecf61>] ? _raw_spin_unlock_irqrestore+0x70/0x7e [<ffffffff810bef34>] smpboot_thread_fn+0x258/0x260 [<ffffffff810becdc>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b7c22>] kthread+0xc7/0xcf [<ffffffff810b7b5b>] ? __init_kthread_worker+0x5b/0x5b [<ffffffff815f3b6c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b7b5b>] ? __init_kthread_worker+0x5b/0x5b Code: 1c 25 18 e2 00 00 e8 cd fe ff ff e8 ac a4 04 00 fb 66 66 90 66 66 90 4c 89 e3 48 85 db 0f 85 79 ff ff ff 5f 5b 41 5c 41 5d c9 c3 <55> 48 89 e5 41 55 41 54 53 4 ------------[ cut here ]------------ WARNING: at /home/greearb/git/linux-2.6.linus/kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7() Hardware name: To be filled by O.E.M. Watchdog detected hard LOCKUP on cpu 0 Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] Pid: 8, comm: migration/0 Tainted: G WC 3.9.4+ #11 Call Trace: <NMI> [<ffffffff810977f1>] warn_slowpath_common+0x85/0x9f [<ffffffff810978ae>] warn_slowpath_fmt+0x46/0x48 [<ffffffff8110f42d>] watchdog_overflow_callback+0x9c/0xa7 [<ffffffff8113feb6>] __perf_event_overflow+0x137/0x1cb [<ffffffff8101dff6>] ? x86_perf_event_set_period+0x103/0x10f [<ffffffff811403fa>] perf_event_overflow+0x14/0x16 [<ffffffff81023730>] intel_pmu_handle_irq+0x2dc/0x359 [<ffffffff815eee05>] perf_event_nmi_handler+0x19/0x1b [<ffffffff815ee5f3>] nmi_handle+0x7f/0xc2 [<ffffffff815ee574>] ? oops_begin+0xa9/0xa9 [<ffffffff815ee6f2>] do_nmi+0xbc/0x304 [<ffffffff815edd81>] end_repeat_nmi+0x1e/0x2e [<ffffffff81099fce>] ? vprintk_emit+0x40a/0x444 [<ffffffff81104efa>] ? stop_machine_cpu_stop+0xda/0x274 [<ffffffff81104efa>] ? stop_machine_cpu_stop+0xda/0x274 [<ffffffff81104efa>] ? stop_machine_cpu_stop+0xda/0x274 <<EOE>> [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff81104e20>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff81104b8d>] cpu_stopper_thread+0xae/0x162 [<ffffffff815ebb1f>] ? __schedule+0x5ef/0x637 [<ffffffff815ecf38>] ? _raw_spin_unlock_irqrestore+0x47/0x7e [<ffffffff810e92cc>] ? trace_hardirqs_on_caller+0x123/0x15a [<ffffffff810e9310>] ? trace_hardirqs_on+0xd/0xf [<ffffffff815ecf61>] ? _raw_spin_unlock_irqrestore+0x70/0x7e [<ffffffff810bef34>] smpboot_thread_fn+0x258/0x260 [<ffffffff810becdc>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b7c22>] kthread+0xc7/0xcf [<ffffffff810b7b5b>] ? __init_kthread_worker+0x5b/0x5b [<ffffffff815f3b6c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b7b5b>] ? __init_kthread_worker+0x5b/0x5b ---[ end trace 4947dfa9b0a4cec4 ]--- ------------[ cut here ]------------ WARNING: at /home/greearb/git/linux-2.6.linus/kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7() Hardware name: To be filled by O.E.M. Watchdog detected hard LOCKUP on cpu 3 Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] Pid: 29, comm: migration/3 Tainted: G WC 3.9.4+ #11 Call Trace: <NMI> [<ffffffff810977f1>] warn_slowpath_common+0x85/0x9f [<ffffffff810978ae>] warn_slowpath_fmt+0x46/0x48 [<ffffffff8110f42d>] watchdog_overflow_callback+0x9c/0xa7 [<ffffffff8113feb6>] __perf_event_overflow+0x137/0x1cb [<ffffffff8101dff6>] ? x86_perf_event_set_period+0x103/0x10f [<ffffffff811403fa>] perf_event_overflow+0x14/0x16 [<ffffffff81023730>] intel_pmu_handle_irq+0x2dc/0x359 [<ffffffff815eee05>] perf_event_nmi_handler+0x19/0x1b [<ffffffff815ee5f3>] nmi_handle+0x7f/0xc2 [<ffffffff815ee574>] ? oops_begin+0xa9/0xa9 [<ffffffff815ee6f2>] do_nmi+0xbc/0x304 [<ffffffff815edd81>] end_repeat_nmi+0x1e/0x2e [<ffffffff81099fce>] ? vprintk_emit+0x40a/0x444 [<ffffffff81104efa>] ? stop_machine_cpu_stop+0xda/0x274 [<ffffffff81104efa>] ? stop_machine_cpu_stop+0xda/0x274 [<ffffffff81104efa>] ? stop_machine_cpu_stop+0xda/0x274 <<EOE>> [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff81104e20>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff81104b8d>] cpu_stopper_thread+0xae/0x162 [<ffffffff815ebb1f>] ? __schedule+0x5ef/0x637 [<ffffffff815ecf38>] ? _raw_spin_unlock_irqrestore+0x47/0x7e [<ffffffff810e92cc>] ? trace_hardirqs_on_caller+0x123/0x15a [<ffffffff810e9310>] ? trace_hardirqs_on+0xd/0xf [<ffffffff815ecf61>] ? _raw_spin_unlock_irqrestore+0x70/0x7e [<ffffffff810bef34>] smpboot_thread_fn+0x258/0x260 [<ffffffff810becdc>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b7c22>] kthread+0xc7/0xcf [<ffffffff810b7b5b>] ? __init_kthread_worker+0x5b/0x5b [<ffffffff815f3b6c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b7b5b>] ? __init_kthread_worker+0x5b/0x5b ---[ end trace 4947dfa9b0a4cec5 ]--- BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17] Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] irq event stamp: 1774512131 hardirqs last enabled at (1774512130): [<ffffffff8109f4c1>] __do_softirq+0x9f/0x257 hardirqs last disabled at (1774512131): [<ffffffff815f48ad>] apic_timer_interrupt+0x6d/0x80 softirqs last enabled at (5654720): [<ffffffff8109f621>] __do_softirq+0x1ff/0x257 softirqs last disabled at (5654725): [<ffffffff8109f743>] irq_exit+0x5f/0xbb CPU 1 Pid: 17, comm: migration/1 Tainted: G WC 3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: 0010:[<ffffffff8109f4c5>] [<ffffffff8109f4c5>] __do_softirq+0xa3/0x257 RSP: 0018:ffff88022bc83ef8 EFLAGS: 00000202 RAX: ffff8802223142c0 RBX: ffff880217deb710 RCX: 0000000000000006 RDX: ffff88022230e010 RSI: 0000000000000000 RDI: ffff8802223142c0 RBP: ffff88022bc83f78 R08: ffffffff81a050b0 R09: ffff88022bc83cc8 R10: 00000000000005f2 R11: ffff8802203aaf50 R12: ffff88022bc83e68 R13: ffffffff815f48b2 R14: ffff88022bc83f78 R15: ffff88022230e000 FS: 0000000000000000(0000) GS:ffff88022bc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000430070 CR3: 00000001cbc5d000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/1 (pid: 17, threadinfo ffff88022230e000, task ffff8802223142c0) Stack: ffff88022bc83f08 ffff88022230e010 042080402bc83f88 000000010021bfcd 000000012bc83fa8 ffff88022230e000 ffff88022230ffd8 0000000000000038 ffff880200000006 00000248d8cdab1c 1304da35fe841722 ffff88022bc8dc80 Call Trace: <IRQ> [<ffffffff8109f743>] irq_exit+0x5f/0xbb [<ffffffff815f59fd>] smp_apic_timer_interrupt+0x8a/0x98 [<ffffffff815f48b2>] apic_timer_interrupt+0x72/0x80 <EOI> [<ffffffff81099fdb>] ? vprintk_emit+0x417/0x444 [<ffffffff815e9fc0>] printk+0x4d/0x4f [<ffffffff81104b36>] ? cpu_stopper_thread+0x57/0x162 [<ffffffff8110504c>] stop_machine_cpu_stop+0x22c/0x274 [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff81104e20>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff81104b8d>] cpu_stopper_thread+0xae/0x162 [<ffffffff815ebb1f>] ? __schedule+0x5ef/0x637 [<ffffffff815ecf38>] ? _raw_spin_unlock_irqrestore+0x47/0x7e [<ffffffff810e92cc>] ? trace_hardirqs_on_caller+0x123/0x15a [<ffffffff810e9310>] ? trace_hardirqs_on+0xd/0xf [<ffffffff815ecf61>] ? _raw_spin_unlock_irqrestore+0x70/0x7e [<ffffffff810bef34>] smpboot_thread_fn+0x258/0x260 [<ffffffff810becdc>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b7c22>] kthread+0xc7/0xcf [<ffffffff810b7b5b>] ? __init_kthread_worker+0x5b/0x5b [<ffffffff815f3b6c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b7b5b>] ? __init_kthread_worker+0x5b/0x5b Code: 55 b0 49 81 ec d8 1f 00 00 49 8d 44 24 10 4c 89 65 a8 48 89 45 88 65 c7 04 25 80 1b 01 00 00 00 00 00 e8 42 9e 04 00 fb 66 66 90 <66> 66 90 48 c7 c3 80 50 a0 8 BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17] Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] irq event stamp: 2713027507 hardirqs last enabled at (2713027506): [<ffffffff8109f4c1>] __do_softirq+0x9f/0x257 hardirqs last disabled at (2713027507): [<ffffffff815f48ad>] apic_timer_interrupt+0x6d/0x80 softirqs last enabled at (5654720): [<ffffffff8109f621>] __do_softirq+0x1ff/0x257 softirqs last disabled at (5654725): [<ffffffff8109f743>] irq_exit+0x5f/0xbb CPU 1 Pid: 17, comm: migration/1 Tainted: G WC 3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: 0010:[<ffffffff8109f4c5>] [<ffffffff8109f4c5>] __do_softirq+0xa3/0x257 RSP: 0018:ffff88022bc83ef8 EFLAGS: 00000286 RAX: ffff8802223142c0 RBX: ffff880217deb710 RCX: 0000000000000006 RDX: ffff88022230e010 RSI: 0000000000000000 RDI: ffff8802223142c0 RBP: ffff88022bc83f78 R08: ffffffff81a050b0 R09: ffff88022bc83cc8 R10: 00000000000005f2 R11: ffff88022bc83c38 R12: ffff88022bc83e68 R13: ffffffff815f48b2 R14: ffff88022bc83f78 R15: ffff88022230e000 FS: 0000000000000000(0000) GS:ffff88022bc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000430070 CR3: 00000001cbc5d000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/1 (pid: 17, threadinfo ffff88022230e000, task ffff8802223142c0) Stack: ffff88022bc83f08 ffff88022230e010 042080402bc83f88 000000010021bfcd 000000012bc83fa8 ffff88022230e000 ffff88022230ffd8 0000000000000038 ffff880200000006 00000248d8cdab1c 1304da35fe841722 ffff88022bc8dc80 Call Trace: <IRQ> [<ffffffff8109f743>] irq_exit+0x5f/0xbb [<ffffffff815f59fd>] smp_apic_timer_interrupt+0x8a/0x98 [<ffffffff815f48b2>] apic_timer_interrupt+0x72/0x80 <EOI> [<ffffffff81099fdb>] ? vprintk_emit+0x417/0x444 [<ffffffff815e9fc0>] printk+0x4d/0x4f [<ffffffff81104b36>] ? cpu_stopper_thread+0x57/0x162 [<ffffffff8110504c>] stop_machine_cpu_stop+0x22c/0x274 [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff810f2dab>] ? copy_module_from_fd+0xe7/0xe7 [<ffffffff81104e20>] ? stop_one_cpu_nowait+0x30/0x30 [<ffffffff81104b8d>] cpu_stopper_thread+0xae/0x162 [<ffffffff815ebb1f>] ? __schedule+0x5ef/0x637 [<ffffffff815ecf38>] ? _raw_spin_unlock_irqrestore+0x47/0x7e [<ffffffff810e92cc>] ? trace_hardirqs_on_caller+0x123/0x15a [<ffffffff810e9310>] ? trace_hardirqs_on+0xd/0xf [<ffffffff815ecf61>] ? _raw_spin_unlock_irqrestore+0x70/0x7e [<ffffffff810bef34>] smpboot_thread_fn+0x258/0x260 [<ffffffff810becdc>] ? test_ti_thread_flag.clone.0+0x11/0x11 [<ffffffff810b7c22>] kthread+0xc7/0xcf [<ffffffff810b7b5b>] ? __init_kthread_worker+0x5b/0x5b [<ffffffff815f3b6c>] ret_from_fork+0x7c/0xb0 [<ffffffff810b7b5b>] ? __init_kthread_worker+0x5b/0x5b Code: 55 b0 49 81 ec d8 1f 00 00 49 8d 44 24 10 4c 89 65 a8 48 89 45 88 65 c7 04 25 80 1b 01 00 00 00 00 00 e8 42 9e 04 00 fb 66 66 90 <66> 66 90 48 c7 c3 80 50 a0 8 Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: stop_machine lockup issue in 3.9.y. 2013-06-05 19:31 ` stop_machine lockup issue in 3.9.y Ben Greear @ 2013-06-05 20:58 ` Ben Greear 2013-06-05 21:11 ` Tejun Heo 0 siblings, 1 reply; 51+ messages in thread From: Ben Greear @ 2013-06-05 20:58 UTC (permalink / raw) To: Tejun Heo; +Cc: Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable On 06/05/2013 12:31 PM, Ben Greear wrote: > This is no longer really about the module unlink, so changing > subject. > > On 06/05/2013 12:11 PM, Ben Greear wrote: >> On 06/05/2013 11:48 AM, Tejun Heo wrote: >>> Hello, Ben. >>> >>> On Wed, Jun 05, 2013 at 09:59:00AM -0700, Ben Greear wrote: >>>> One pattern I notice repeating for at least most of the hangs is that all but one >>>> CPU thread has irqs disabled and is in state 2. But, there will be one thread >>>> in state 1 that still has IRQs enabled and it is reported to be in soft-lockup >>>> instead of hard-lockup. In 'sysrq l' it always shows some IRQ processing, >>>> but typically that of the sysrq itself. I added printk that would always >>>> print if the thread notices that smdata->state != curstate, and the soft-lockup >>>> thread (cpu 2 below) never shows that message. >>> >>> It sounds like one of the cpus get live-locked by IRQs. I can't tell >>> why the situation is made worse by other CPUs being tied up. Do you >>> ever see CPUs being live locked by IRQs during normal operation? Hmm, wonder if I found it. I previously saw times where it appears jiffies does not increment. __do_softirq has a break-out based on jiffies timeout. Maybe that is failing to get us out of __do_softirq in my lockup case because for whatever reason the system cannot update jiffies in this case? I added this (probably whitespace damaged) hack and now I have not been able to reproduce the problem. diff --git a/kernel/softirq.c b/kernel/softirq.c index 14d7758..621ea3b 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void) unsigned long end = jiffies + MAX_SOFTIRQ_TIME; int cpu; unsigned long old_flags = current->flags; + unsigned long loops = 0; /* * Mask out PF_MEMALLOC s current task context is borrowed for the @@ -241,6 +242,7 @@ restart: unsigned int vec_nr = h - softirq_vec; int prev_count = preempt_count(); + loops++; kstat_incr_softirqs_this_cpu(vec_nr); trace_softirq_entry(vec_nr); @@ -265,7 +267,7 @@ restart: pending = local_softirq_pending(); if (pending) { - if (time_before(jiffies, end) && !need_resched()) + if (time_before(jiffies, end) && !need_resched() && (loops < 500)) goto restart; wakeup_softirqd(); Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: stop_machine lockup issue in 3.9.y. 2013-06-05 20:58 ` Ben Greear 2013-06-05 21:11 ` Tejun Heo @ 2013-06-05 21:11 ` Tejun Heo 0 siblings, 0 replies; 51+ messages in thread From: Tejun Heo @ 2013-06-05 21:11 UTC (permalink / raw) To: Ben Greear Cc: Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable, Luis R. Rodriguez, Jouni Malinen, Vasanthakumar Thiagarajan, Senthil Balasubramanian, linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar (cc'ing wireless crowd, tglx and Ingo. The original thread is at http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 ) Hello, Ben. On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote: > Hmm, wonder if I found it. I previously saw times where it appears > jiffies does not increment. __do_softirq has a break-out based on > jiffies timeout. Maybe that is failing to get us out of __do_softirq > in my lockup case because for whatever reason the system cannot update > jiffies in this case? > > I added this (probably whitespace damaged) hack and now I have not been > able to reproduce the problem. Ah, nice catch. :) > diff --git a/kernel/softirq.c b/kernel/softirq.c > index 14d7758..621ea3b 100644 > --- a/kernel/softirq.c > +++ b/kernel/softirq.c > @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void) > unsigned long end = jiffies + MAX_SOFTIRQ_TIME; > int cpu; > unsigned long old_flags = current->flags; > + unsigned long loops = 0; > > /* > * Mask out PF_MEMALLOC s current task context is borrowed for the > @@ -241,6 +242,7 @@ restart: > unsigned int vec_nr = h - softirq_vec; > int prev_count = preempt_count(); > > + loops++; > kstat_incr_softirqs_this_cpu(vec_nr); > > trace_softirq_entry(vec_nr); > @@ -265,7 +267,7 @@ restart: > > pending = local_softirq_pending(); > if (pending) { > - if (time_before(jiffies, end) && !need_resched()) > + if (time_before(jiffies, end) && !need_resched() && (loops < 500)) > goto restart; So, softirq most likely kicked off from ath9k is rescheduling itself to the extent where it ends up locking out the CPU completely. The problem is usually okay because the processing would break out in 2ms but as jiffies is stopped in this case with all other CPUs trapped in stop_machine, the loop never breaks and the machine hangs. While adding the counter limit probably isn't a bad idea, softirq requeueing itself indefinitely sounds pretty buggy. ath9k people, do you guys have any idea what's going on? Why would softirq repeat itself indefinitely? Ingo, Thomas, we're seeing a stop_machine hanging because * All other CPUs entered IRQ disabled stage. Jiffies is not being updated. * The last CPU get caught up executing softirq indefinitely. As jiffies doesn't get updated, it never breaks out of softirq handling. This is a deadlock. This CPU won't break out of softirq handling unless jiffies is updated and other CPUs can't do anything until this CPU enters the same stop_machine stage. Ben found out that breaking out of softirq handling after certain number of repetitions makes the issue go away, which isn't a proper fix but we might want anyway. What do you guys think? Thanks. -- tejun ^ permalink raw reply [flat|nested] 51+ messages in thread
* [ath9k-devel] stop_machine lockup issue in 3.9.y. @ 2013-06-05 21:11 ` Tejun Heo 0 siblings, 0 replies; 51+ messages in thread From: Tejun Heo @ 2013-06-05 21:11 UTC (permalink / raw) To: ath9k-devel (cc'ing wireless crowd, tglx and Ingo. The original thread is at http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 ) Hello, Ben. On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote: > Hmm, wonder if I found it. I previously saw times where it appears > jiffies does not increment. __do_softirq has a break-out based on > jiffies timeout. Maybe that is failing to get us out of __do_softirq > in my lockup case because for whatever reason the system cannot update > jiffies in this case? > > I added this (probably whitespace damaged) hack and now I have not been > able to reproduce the problem. Ah, nice catch. :) > diff --git a/kernel/softirq.c b/kernel/softirq.c > index 14d7758..621ea3b 100644 > --- a/kernel/softirq.c > +++ b/kernel/softirq.c > @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void) > unsigned long end = jiffies + MAX_SOFTIRQ_TIME; > int cpu; > unsigned long old_flags = current->flags; > + unsigned long loops = 0; > > /* > * Mask out PF_MEMALLOC s current task context is borrowed for the > @@ -241,6 +242,7 @@ restart: > unsigned int vec_nr = h - softirq_vec; > int prev_count = preempt_count(); > > + loops++; > kstat_incr_softirqs_this_cpu(vec_nr); > > trace_softirq_entry(vec_nr); > @@ -265,7 +267,7 @@ restart: > > pending = local_softirq_pending(); > if (pending) { > - if (time_before(jiffies, end) && !need_resched()) > + if (time_before(jiffies, end) && !need_resched() && (loops < 500)) > goto restart; So, softirq most likely kicked off from ath9k is rescheduling itself to the extent where it ends up locking out the CPU completely. The problem is usually okay because the processing would break out in 2ms but as jiffies is stopped in this case with all other CPUs trapped in stop_machine, the loop never breaks and the machine hangs. While adding the counter limit probably isn't a bad idea, softirq requeueing itself indefinitely sounds pretty buggy. ath9k people, do you guys have any idea what's going on? Why would softirq repeat itself indefinitely? Ingo, Thomas, we're seeing a stop_machine hanging because * All other CPUs entered IRQ disabled stage. Jiffies is not being updated. * The last CPU get caught up executing softirq indefinitely. As jiffies doesn't get updated, it never breaks out of softirq handling. This is a deadlock. This CPU won't break out of softirq handling unless jiffies is updated and other CPUs can't do anything until this CPU enters the same stop_machine stage. Ben found out that breaking out of softirq handling after certain number of repetitions makes the issue go away, which isn't a proper fix but we might want anyway. What do you guys think? Thanks. -- tejun ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: stop_machine lockup issue in 3.9.y. @ 2013-06-05 21:11 ` Tejun Heo 0 siblings, 0 replies; 51+ messages in thread From: Tejun Heo @ 2013-06-05 21:11 UTC (permalink / raw) To: Ben Greear Cc: Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable, Luis R. Rodriguez, Jouni Malinen, Vasanthakumar Thiagarajan, Senthil Balasubramanian, linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar (cc'ing wireless crowd, tglx and Ingo. The original thread is at http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 ) Hello, Ben. On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote: > Hmm, wonder if I found it. I previously saw times where it appears > jiffies does not increment. __do_softirq has a break-out based on > jiffies timeout. Maybe that is failing to get us out of __do_softirq > in my lockup case because for whatever reason the system cannot update > jiffies in this case? > > I added this (probably whitespace damaged) hack and now I have not been > able to reproduce the problem. Ah, nice catch. :) > diff --git a/kernel/softirq.c b/kernel/softirq.c > index 14d7758..621ea3b 100644 > --- a/kernel/softirq.c > +++ b/kernel/softirq.c > @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void) > unsigned long end = jiffies + MAX_SOFTIRQ_TIME; > int cpu; > unsigned long old_flags = current->flags; > + unsigned long loops = 0; > > /* > * Mask out PF_MEMALLOC s current task context is borrowed for the > @@ -241,6 +242,7 @@ restart: > unsigned int vec_nr = h - softirq_vec; > int prev_count = preempt_count(); > > + loops++; > kstat_incr_softirqs_this_cpu(vec_nr); > > trace_softirq_entry(vec_nr); > @@ -265,7 +267,7 @@ restart: > > pending = local_softirq_pending(); > if (pending) { > - if (time_before(jiffies, end) && !need_resched()) > + if (time_before(jiffies, end) && !need_resched() && (loops < 500)) > goto restart; So, softirq most likely kicked off from ath9k is rescheduling itself to the extent where it ends up locking out the CPU completely. The problem is usually okay because the processing would break out in 2ms but as jiffies is stopped in this case with all other CPUs trapped in stop_machine, the loop never breaks and the machine hangs. While adding the counter limit probably isn't a bad idea, softirq requeueing itself indefinitely sounds pretty buggy. ath9k people, do you guys have any idea what's going on? Why would softirq repeat itself indefinitely? Ingo, Thomas, we're seeing a stop_machine hanging because * All other CPUs entered IRQ disabled stage. Jiffies is not being updated. * The last CPU get caught up executing softirq indefinitely. As jiffies doesn't get updated, it never breaks out of softirq handling. This is a deadlock. This CPU won't break out of softirq handling unless jiffies is updated and other CPUs can't do anything until this CPU enters the same stop_machine stage. Ben found out that breaking out of softirq handling after certain number of repetitions makes the issue go away, which isn't a proper fix but we might want anyway. What do you guys think? Thanks. -- tejun ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: stop_machine lockup issue in 3.9.y. 2013-06-05 21:11 ` Tejun Heo @ 2013-06-05 21:33 ` Ben Greear -1 siblings, 0 replies; 51+ messages in thread From: Ben Greear @ 2013-06-05 21:33 UTC (permalink / raw) To: Tejun Heo Cc: Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable, Luis R. Rodriguez, Jouni Malinen, Vasanthakumar Thiagarajan, Senthil Balasubramanian, linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar On 06/05/2013 02:11 PM, Tejun Heo wrote: > (cc'ing wireless crowd, tglx and Ingo. The original thread is at > http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 ) > > Hello, Ben. > > On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote: >> Hmm, wonder if I found it. I previously saw times where it appears >> jiffies does not increment. __do_softirq has a break-out based on >> jiffies timeout. Maybe that is failing to get us out of __do_softirq >> in my lockup case because for whatever reason the system cannot update >> jiffies in this case? >> >> I added this (probably whitespace damaged) hack and now I have not been >> able to reproduce the problem. > > Ah, nice catch. :) > >> diff --git a/kernel/softirq.c b/kernel/softirq.c >> index 14d7758..621ea3b 100644 >> --- a/kernel/softirq.c >> +++ b/kernel/softirq.c >> @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void) >> unsigned long end = jiffies + MAX_SOFTIRQ_TIME; >> int cpu; >> unsigned long old_flags = current->flags; >> + unsigned long loops = 0; >> >> /* >> * Mask out PF_MEMALLOC s current task context is borrowed for the >> @@ -241,6 +242,7 @@ restart: >> unsigned int vec_nr = h - softirq_vec; >> int prev_count = preempt_count(); >> >> + loops++; >> kstat_incr_softirqs_this_cpu(vec_nr); >> >> trace_softirq_entry(vec_nr); >> @@ -265,7 +267,7 @@ restart: >> >> pending = local_softirq_pending(); >> if (pending) { >> - if (time_before(jiffies, end) && !need_resched()) >> + if (time_before(jiffies, end) && !need_resched() && (loops < 500)) >> goto restart; > > So, softirq most likely kicked off from ath9k is rescheduling itself > to the extent where it ends up locking out the CPU completely. The > problem is usually okay because the processing would break out in 2ms > but as jiffies is stopped in this case with all other CPUs trapped in > stop_machine, the loop never breaks and the machine hangs. While > adding the counter limit probably isn't a bad idea, softirq requeueing > itself indefinitely sounds pretty buggy. Just to be clear on the ath9k part for the wifi folks: This is basically un-patched 3.9.4, but I have 200 virtual stations configured on each of two ath9k radios. I cannot reproduce the problem without ath9k, but I do not know for certain ath9k is the real culprit. In the case where I can most easily reproduce the lockup, ath9k virtual stations would be trying to associate, so I'd expect a fair amount of packet processing to be happening... > ath9k people, do you guys have any idea what's going on? Why would > softirq repeat itself indefinitely? > > Ingo, Thomas, we're seeing a stop_machine hanging because > > * All other CPUs entered IRQ disabled stage. Jiffies is not being > updated. > > * The last CPU get caught up executing softirq indefinitely. As > jiffies doesn't get updated, it never breaks out of softirq > handling. This is a deadlock. This CPU won't break out of softirq > handling unless jiffies is updated and other CPUs can't do anything > until this CPU enters the same stop_machine stage. > > Ben found out that breaking out of softirq handling after certain > number of repetitions makes the issue go away, which isn't a proper > fix but we might want anyway. What do you guys think? Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 51+ messages in thread
* [ath9k-devel] stop_machine lockup issue in 3.9.y. @ 2013-06-05 21:33 ` Ben Greear 0 siblings, 0 replies; 51+ messages in thread From: Ben Greear @ 2013-06-05 21:33 UTC (permalink / raw) To: ath9k-devel On 06/05/2013 02:11 PM, Tejun Heo wrote: > (cc'ing wireless crowd, tglx and Ingo. The original thread is at > http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 ) > > Hello, Ben. > > On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote: >> Hmm, wonder if I found it. I previously saw times where it appears >> jiffies does not increment. __do_softirq has a break-out based on >> jiffies timeout. Maybe that is failing to get us out of __do_softirq >> in my lockup case because for whatever reason the system cannot update >> jiffies in this case? >> >> I added this (probably whitespace damaged) hack and now I have not been >> able to reproduce the problem. > > Ah, nice catch. :) > >> diff --git a/kernel/softirq.c b/kernel/softirq.c >> index 14d7758..621ea3b 100644 >> --- a/kernel/softirq.c >> +++ b/kernel/softirq.c >> @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void) >> unsigned long end = jiffies + MAX_SOFTIRQ_TIME; >> int cpu; >> unsigned long old_flags = current->flags; >> + unsigned long loops = 0; >> >> /* >> * Mask out PF_MEMALLOC s current task context is borrowed for the >> @@ -241,6 +242,7 @@ restart: >> unsigned int vec_nr = h - softirq_vec; >> int prev_count = preempt_count(); >> >> + loops++; >> kstat_incr_softirqs_this_cpu(vec_nr); >> >> trace_softirq_entry(vec_nr); >> @@ -265,7 +267,7 @@ restart: >> >> pending = local_softirq_pending(); >> if (pending) { >> - if (time_before(jiffies, end) && !need_resched()) >> + if (time_before(jiffies, end) && !need_resched() && (loops < 500)) >> goto restart; > > So, softirq most likely kicked off from ath9k is rescheduling itself > to the extent where it ends up locking out the CPU completely. The > problem is usually okay because the processing would break out in 2ms > but as jiffies is stopped in this case with all other CPUs trapped in > stop_machine, the loop never breaks and the machine hangs. While > adding the counter limit probably isn't a bad idea, softirq requeueing > itself indefinitely sounds pretty buggy. Just to be clear on the ath9k part for the wifi folks: This is basically un-patched 3.9.4, but I have 200 virtual stations configured on each of two ath9k radios. I cannot reproduce the problem without ath9k, but I do not know for certain ath9k is the real culprit. In the case where I can most easily reproduce the lockup, ath9k virtual stations would be trying to associate, so I'd expect a fair amount of packet processing to be happening... > ath9k people, do you guys have any idea what's going on? Why would > softirq repeat itself indefinitely? > > Ingo, Thomas, we're seeing a stop_machine hanging because > > * All other CPUs entered IRQ disabled stage. Jiffies is not being > updated. > > * The last CPU get caught up executing softirq indefinitely. As > jiffies doesn't get updated, it never breaks out of softirq > handling. This is a deadlock. This CPU won't break out of softirq > handling unless jiffies is updated and other CPUs can't do anything > until this CPU enters the same stop_machine stage. > > Ben found out that breaking out of softirq handling after certain > number of repetitions makes the issue go away, which isn't a proper > fix but we might want anyway. What do you guys think? Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: stop_machine lockup issue in 3.9.y. 2013-06-05 21:11 ` Tejun Heo (?) @ 2013-06-06 1:34 ` Eric Dumazet -1 siblings, 0 replies; 51+ messages in thread From: Eric Dumazet @ 2013-06-06 1:34 UTC (permalink / raw) To: Tejun Heo Cc: Ben Greear, Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable, Luis R. Rodriguez, Jouni Malinen, Vasanthakumar Thiagarajan, Senthil Balasubramanian, linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar On Wed, 2013-06-05 at 14:11 -0700, Tejun Heo wrote: > (cc'ing wireless crowd, tglx and Ingo. The original thread is at > http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 ) > > Hello, Ben. > > On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote: > > Hmm, wonder if I found it. I previously saw times where it appears > > jiffies does not increment. __do_softirq has a break-out based on > > jiffies timeout. Maybe that is failing to get us out of __do_softirq > > in my lockup case because for whatever reason the system cannot update > > jiffies in this case? > > > > I added this (probably whitespace damaged) hack and now I have not been > > able to reproduce the problem. > > Ah, nice catch. :) > > > diff --git a/kernel/softirq.c b/kernel/softirq.c > > index 14d7758..621ea3b 100644 > > --- a/kernel/softirq.c > > +++ b/kernel/softirq.c > > @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void) > > unsigned long end = jiffies + MAX_SOFTIRQ_TIME; > > int cpu; > > unsigned long old_flags = current->flags; > > + unsigned long loops = 0; > > > > /* > > * Mask out PF_MEMALLOC s current task context is borrowed for the > > @@ -241,6 +242,7 @@ restart: > > unsigned int vec_nr = h - softirq_vec; > > int prev_count = preempt_count(); > > > > + loops++; > > kstat_incr_softirqs_this_cpu(vec_nr); > > > > trace_softirq_entry(vec_nr); > > @@ -265,7 +267,7 @@ restart: > > > > pending = local_softirq_pending(); > > if (pending) { > > - if (time_before(jiffies, end) && !need_resched()) > > + if (time_before(jiffies, end) && !need_resched() && (loops < 500)) > > goto restart; > > So, softirq most likely kicked off from ath9k is rescheduling itself > to the extent where it ends up locking out the CPU completely. The > problem is usually okay because the processing would break out in 2ms > but as jiffies is stopped in this case with all other CPUs trapped in > stop_machine, the loop never breaks and the machine hangs. While > adding the counter limit probably isn't a bad idea, softirq requeueing > itself indefinitely sounds pretty buggy. > > ath9k people, do you guys have any idea what's going on? Why would > softirq repeat itself indefinitely? > > Ingo, Thomas, we're seeing a stop_machine hanging because > > * All other CPUs entered IRQ disabled stage. Jiffies is not being > updated. > > * The last CPU get caught up executing softirq indefinitely. As > jiffies doesn't get updated, it never breaks out of softirq > handling. This is a deadlock. This CPU won't break out of softirq > handling unless jiffies is updated and other CPUs can't do anything > until this CPU enters the same stop_machine stage. > > Ben found out that breaking out of softirq handling after certain > number of repetitions makes the issue go away, which isn't a proper > fix but we might want anyway. What do you guys think? > Interesting.... Before 3.9 and commit c10d73671ad30f5469 ("softirq: reduce latencies") we used to limit the __do_softirq() loop to 10. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [ath9k-devel] stop_machine lockup issue in 3.9.y. @ 2013-06-06 1:34 ` Eric Dumazet 0 siblings, 0 replies; 51+ messages in thread From: Eric Dumazet @ 2013-06-06 1:34 UTC (permalink / raw) To: ath9k-devel On Wed, 2013-06-05 at 14:11 -0700, Tejun Heo wrote: > (cc'ing wireless crowd, tglx and Ingo. The original thread is at > http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 ) > > Hello, Ben. > > On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote: > > Hmm, wonder if I found it. I previously saw times where it appears > > jiffies does not increment. __do_softirq has a break-out based on > > jiffies timeout. Maybe that is failing to get us out of __do_softirq > > in my lockup case because for whatever reason the system cannot update > > jiffies in this case? > > > > I added this (probably whitespace damaged) hack and now I have not been > > able to reproduce the problem. > > Ah, nice catch. :) > > > diff --git a/kernel/softirq.c b/kernel/softirq.c > > index 14d7758..621ea3b 100644 > > --- a/kernel/softirq.c > > +++ b/kernel/softirq.c > > @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void) > > unsigned long end = jiffies + MAX_SOFTIRQ_TIME; > > int cpu; > > unsigned long old_flags = current->flags; > > + unsigned long loops = 0; > > > > /* > > * Mask out PF_MEMALLOC s current task context is borrowed for the > > @@ -241,6 +242,7 @@ restart: > > unsigned int vec_nr = h - softirq_vec; > > int prev_count = preempt_count(); > > > > + loops++; > > kstat_incr_softirqs_this_cpu(vec_nr); > > > > trace_softirq_entry(vec_nr); > > @@ -265,7 +267,7 @@ restart: > > > > pending = local_softirq_pending(); > > if (pending) { > > - if (time_before(jiffies, end) && !need_resched()) > > + if (time_before(jiffies, end) && !need_resched() && (loops < 500)) > > goto restart; > > So, softirq most likely kicked off from ath9k is rescheduling itself > to the extent where it ends up locking out the CPU completely. The > problem is usually okay because the processing would break out in 2ms > but as jiffies is stopped in this case with all other CPUs trapped in > stop_machine, the loop never breaks and the machine hangs. While > adding the counter limit probably isn't a bad idea, softirq requeueing > itself indefinitely sounds pretty buggy. > > ath9k people, do you guys have any idea what's going on? Why would > softirq repeat itself indefinitely? > > Ingo, Thomas, we're seeing a stop_machine hanging because > > * All other CPUs entered IRQ disabled stage. Jiffies is not being > updated. > > * The last CPU get caught up executing softirq indefinitely. As > jiffies doesn't get updated, it never breaks out of softirq > handling. This is a deadlock. This CPU won't break out of softirq > handling unless jiffies is updated and other CPUs can't do anything > until this CPU enters the same stop_machine stage. > > Ben found out that breaking out of softirq handling after certain > number of repetitions makes the issue go away, which isn't a proper > fix but we might want anyway. What do you guys think? > Interesting.... Before 3.9 and commit c10d73671ad30f5469 ("softirq: reduce latencies") we used to limit the __do_softirq() loop to 10. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: stop_machine lockup issue in 3.9.y. @ 2013-06-06 1:34 ` Eric Dumazet 0 siblings, 0 replies; 51+ messages in thread From: Eric Dumazet @ 2013-06-06 1:34 UTC (permalink / raw) To: Tejun Heo Cc: Ben Greear, Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable, Luis R. Rodriguez, Jouni Malinen, Vasanthakumar Thiagarajan, Senthil Balasubramanian, linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar On Wed, 2013-06-05 at 14:11 -0700, Tejun Heo wrote: > (cc'ing wireless crowd, tglx and Ingo. The original thread is at > http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 ) > > Hello, Ben. > > On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote: > > Hmm, wonder if I found it. I previously saw times where it appears > > jiffies does not increment. __do_softirq has a break-out based on > > jiffies timeout. Maybe that is failing to get us out of __do_softirq > > in my lockup case because for whatever reason the system cannot update > > jiffies in this case? > > > > I added this (probably whitespace damaged) hack and now I have not been > > able to reproduce the problem. > > Ah, nice catch. :) > > > diff --git a/kernel/softirq.c b/kernel/softirq.c > > index 14d7758..621ea3b 100644 > > --- a/kernel/softirq.c > > +++ b/kernel/softirq.c > > @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void) > > unsigned long end = jiffies + MAX_SOFTIRQ_TIME; > > int cpu; > > unsigned long old_flags = current->flags; > > + unsigned long loops = 0; > > > > /* > > * Mask out PF_MEMALLOC s current task context is borrowed for the > > @@ -241,6 +242,7 @@ restart: > > unsigned int vec_nr = h - softirq_vec; > > int prev_count = preempt_count(); > > > > + loops++; > > kstat_incr_softirqs_this_cpu(vec_nr); > > > > trace_softirq_entry(vec_nr); > > @@ -265,7 +267,7 @@ restart: > > > > pending = local_softirq_pending(); > > if (pending) { > > - if (time_before(jiffies, end) && !need_resched()) > > + if (time_before(jiffies, end) && !need_resched() && (loops < 500)) > > goto restart; > > So, softirq most likely kicked off from ath9k is rescheduling itself > to the extent where it ends up locking out the CPU completely. The > problem is usually okay because the processing would break out in 2ms > but as jiffies is stopped in this case with all other CPUs trapped in > stop_machine, the loop never breaks and the machine hangs. While > adding the counter limit probably isn't a bad idea, softirq requeueing > itself indefinitely sounds pretty buggy. > > ath9k people, do you guys have any idea what's going on? Why would > softirq repeat itself indefinitely? > > Ingo, Thomas, we're seeing a stop_machine hanging because > > * All other CPUs entered IRQ disabled stage. Jiffies is not being > updated. > > * The last CPU get caught up executing softirq indefinitely. As > jiffies doesn't get updated, it never breaks out of softirq > handling. This is a deadlock. This CPU won't break out of softirq > handling unless jiffies is updated and other CPUs can't do anything > until this CPU enters the same stop_machine stage. > > Ben found out that breaking out of softirq handling after certain > number of repetitions makes the issue go away, which isn't a proper > fix but we might want anyway. What do you guys think? > Interesting.... Before 3.9 and commit c10d73671ad30f5469 ("softirq: reduce latencies") we used to limit the __do_softirq() loop to 10. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: stop_machine lockup issue in 3.9.y. 2013-06-06 1:34 ` Eric Dumazet (?) @ 2013-06-06 3:14 ` Tejun Heo -1 siblings, 0 replies; 51+ messages in thread From: Tejun Heo @ 2013-06-06 3:14 UTC (permalink / raw) To: Eric Dumazet Cc: Ben Greear, Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable, Luis R. Rodriguez, Jouni Malinen, Vasanthakumar Thiagarajan, Senthil Balasubramanian, linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar Hello, Eric. On Wed, Jun 05, 2013 at 06:34:52PM -0700, Eric Dumazet wrote: > > Ingo, Thomas, we're seeing a stop_machine hanging because > > > > * All other CPUs entered IRQ disabled stage. Jiffies is not being > > updated. > > > > * The last CPU get caught up executing softirq indefinitely. As > > jiffies doesn't get updated, it never breaks out of softirq > > handling. This is a deadlock. This CPU won't break out of softirq > > handling unless jiffies is updated and other CPUs can't do anything > > until this CPU enters the same stop_machine stage. > > > > Ben found out that breaking out of softirq handling after certain > > number of repetitions makes the issue go away, which isn't a proper > > fix but we might want anyway. What do you guys think? > > > > Interesting.... > > Before 3.9 and commit c10d73671ad30f5469 > ("softirq: reduce latencies") we used to limit the __do_softirq() loop > to 10. Ah, so, that's why it's showing up now. We probably have had the same issue all along but it used to be masked by the softirq limiting. Do you care to revive the 10 iterations limit so that it's limited by both the count and timing? We do wanna find out why softirq is spinning indefinitely tho. Thanks. -- tejun ^ permalink raw reply [flat|nested] 51+ messages in thread
* [ath9k-devel] stop_machine lockup issue in 3.9.y. @ 2013-06-06 3:14 ` Tejun Heo 0 siblings, 0 replies; 51+ messages in thread From: Tejun Heo @ 2013-06-06 3:14 UTC (permalink / raw) To: ath9k-devel Hello, Eric. On Wed, Jun 05, 2013 at 06:34:52PM -0700, Eric Dumazet wrote: > > Ingo, Thomas, we're seeing a stop_machine hanging because > > > > * All other CPUs entered IRQ disabled stage. Jiffies is not being > > updated. > > > > * The last CPU get caught up executing softirq indefinitely. As > > jiffies doesn't get updated, it never breaks out of softirq > > handling. This is a deadlock. This CPU won't break out of softirq > > handling unless jiffies is updated and other CPUs can't do anything > > until this CPU enters the same stop_machine stage. > > > > Ben found out that breaking out of softirq handling after certain > > number of repetitions makes the issue go away, which isn't a proper > > fix but we might want anyway. What do you guys think? > > > > Interesting.... > > Before 3.9 and commit c10d73671ad30f5469 > ("softirq: reduce latencies") we used to limit the __do_softirq() loop > to 10. Ah, so, that's why it's showing up now. We probably have had the same issue all along but it used to be masked by the softirq limiting. Do you care to revive the 10 iterations limit so that it's limited by both the count and timing? We do wanna find out why softirq is spinning indefinitely tho. Thanks. -- tejun ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: stop_machine lockup issue in 3.9.y. @ 2013-06-06 3:14 ` Tejun Heo 0 siblings, 0 replies; 51+ messages in thread From: Tejun Heo @ 2013-06-06 3:14 UTC (permalink / raw) To: Eric Dumazet Cc: Ben Greear, Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable, Luis R. Rodriguez, Jouni Malinen, Vasanthakumar Thiagarajan, Senthil Balasubramanian, linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar Hello, Eric. On Wed, Jun 05, 2013 at 06:34:52PM -0700, Eric Dumazet wrote: > > Ingo, Thomas, we're seeing a stop_machine hanging because > > > > * All other CPUs entered IRQ disabled stage. Jiffies is not being > > updated. > > > > * The last CPU get caught up executing softirq indefinitely. As > > jiffies doesn't get updated, it never breaks out of softirq > > handling. This is a deadlock. This CPU won't break out of softirq > > handling unless jiffies is updated and other CPUs can't do anything > > until this CPU enters the same stop_machine stage. > > > > Ben found out that breaking out of softirq handling after certain > > number of repetitions makes the issue go away, which isn't a proper > > fix but we might want anyway. What do you guys think? > > > > Interesting.... > > Before 3.9 and commit c10d73671ad30f5469 > ("softirq: reduce latencies") we used to limit the __do_softirq() loop > to 10. Ah, so, that's why it's showing up now. We probably have had the same issue all along but it used to be masked by the softirq limiting. Do you care to revive the 10 iterations limit so that it's limited by both the count and timing? We do wanna find out why softirq is spinning indefinitely tho. Thanks. -- tejun ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: stop_machine lockup issue in 3.9.y. 2013-06-06 3:14 ` Tejun Heo (?) @ 2013-06-06 3:26 ` Eric Dumazet -1 siblings, 0 replies; 51+ messages in thread From: Eric Dumazet @ 2013-06-06 3:26 UTC (permalink / raw) To: Tejun Heo Cc: Ben Greear, Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable, Luis R. Rodriguez, Jouni Malinen, Vasanthakumar Thiagarajan, Senthil Balasubramanian, linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote: > > Ah, so, that's why it's showing up now. We probably have had the same > issue all along but it used to be masked by the softirq limiting. Do > you care to revive the 10 iterations limit so that it's limited by > both the count and timing? We do wanna find out why softirq is > spinning indefinitely tho. Yes, no problem, I can do that. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [ath9k-devel] stop_machine lockup issue in 3.9.y. @ 2013-06-06 3:26 ` Eric Dumazet 0 siblings, 0 replies; 51+ messages in thread From: Eric Dumazet @ 2013-06-06 3:26 UTC (permalink / raw) To: ath9k-devel On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote: > > Ah, so, that's why it's showing up now. We probably have had the same > issue all along but it used to be masked by the softirq limiting. Do > you care to revive the 10 iterations limit so that it's limited by > both the count and timing? We do wanna find out why softirq is > spinning indefinitely tho. Yes, no problem, I can do that. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: stop_machine lockup issue in 3.9.y. @ 2013-06-06 3:26 ` Eric Dumazet 0 siblings, 0 replies; 51+ messages in thread From: Eric Dumazet @ 2013-06-06 3:26 UTC (permalink / raw) To: Tejun Heo Cc: Ben Greear, Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable, Luis R. Rodriguez, Jouni Malinen, Vasanthakumar Thiagarajan, Senthil Balasubramanian, linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote: > > Ah, so, that's why it's showing up now. We probably have had the same > issue all along but it used to be masked by the softirq limiting. Do > you care to revive the 10 iterations limit so that it's limited by > both the count and timing? We do wanna find out why softirq is > spinning indefinitely tho. Yes, no problem, I can do that. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: stop_machine lockup issue in 3.9.y. 2013-06-06 3:26 ` Eric Dumazet @ 2013-06-06 3:41 ` Ben Greear -1 siblings, 0 replies; 51+ messages in thread From: Ben Greear @ 2013-06-06 3:41 UTC (permalink / raw) To: Eric Dumazet Cc: Tejun Heo, Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable, Luis R. Rodriguez, Jouni Malinen, Vasanthakumar Thiagarajan, Senthil Balasubramanian, linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar On 06/05/2013 08:26 PM, Eric Dumazet wrote: > On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote: > >> >> Ah, so, that's why it's showing up now. We probably have had the same >> issue all along but it used to be masked by the softirq limiting. Do >> you care to revive the 10 iterations limit so that it's limited by >> both the count and timing? We do wanna find out why softirq is >> spinning indefinitely tho. > > Yes, no problem, I can do that. Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would be fine by me. I can send a version of my patch easily enough if we can agree on the max number of loops (and if indeed my version of the patch is acceptable). Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 51+ messages in thread
* [ath9k-devel] stop_machine lockup issue in 3.9.y. @ 2013-06-06 3:41 ` Ben Greear 0 siblings, 0 replies; 51+ messages in thread From: Ben Greear @ 2013-06-06 3:41 UTC (permalink / raw) To: ath9k-devel On 06/05/2013 08:26 PM, Eric Dumazet wrote: > On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote: > >> >> Ah, so, that's why it's showing up now. We probably have had the same >> issue all along but it used to be masked by the softirq limiting. Do >> you care to revive the 10 iterations limit so that it's limited by >> both the count and timing? We do wanna find out why softirq is >> spinning indefinitely tho. > > Yes, no problem, I can do that. Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would be fine by me. I can send a version of my patch easily enough if we can agree on the max number of loops (and if indeed my version of the patch is acceptable). Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: stop_machine lockup issue in 3.9.y. 2013-06-06 3:41 ` [ath9k-devel] " Ben Greear @ 2013-06-06 3:46 ` Eric Dumazet -1 siblings, 0 replies; 51+ messages in thread From: Eric Dumazet @ 2013-06-06 3:46 UTC (permalink / raw) To: Ben Greear Cc: Tejun Heo, Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable, Luis R. Rodriguez, Jouni Malinen, Vasanthakumar Thiagarajan, Senthil Balasubramanian, linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar On Wed, 2013-06-05 at 20:41 -0700, Ben Greear wrote: > On 06/05/2013 08:26 PM, Eric Dumazet wrote: > > On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote: > > > >> > >> Ah, so, that's why it's showing up now. We probably have had the same > >> issue all along but it used to be masked by the softirq limiting. Do > >> you care to revive the 10 iterations limit so that it's limited by > >> both the count and timing? We do wanna find out why softirq is > >> spinning indefinitely tho. > > > > Yes, no problem, I can do that. > > Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would > be fine by me. > > I can send a version of my patch easily enough if we can agree on the max number of > loops (and if indeed my version of the patch is acceptable). Well, 10 was the prior limit and seems really fine. The non update on jiffies seems quite exceptional condition (I hope...) We use in Google a patch triggering warning is a thread holds the cpu without taking care to need_resched() for more than xx ms ^ permalink raw reply [flat|nested] 51+ messages in thread
* [ath9k-devel] stop_machine lockup issue in 3.9.y. @ 2013-06-06 3:46 ` Eric Dumazet 0 siblings, 0 replies; 51+ messages in thread From: Eric Dumazet @ 2013-06-06 3:46 UTC (permalink / raw) To: ath9k-devel On Wed, 2013-06-05 at 20:41 -0700, Ben Greear wrote: > On 06/05/2013 08:26 PM, Eric Dumazet wrote: > > On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote: > > > >> > >> Ah, so, that's why it's showing up now. We probably have had the same > >> issue all along but it used to be masked by the softirq limiting. Do > >> you care to revive the 10 iterations limit so that it's limited by > >> both the count and timing? We do wanna find out why softirq is > >> spinning indefinitely tho. > > > > Yes, no problem, I can do that. > > Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would > be fine by me. > > I can send a version of my patch easily enough if we can agree on the max number of > loops (and if indeed my version of the patch is acceptable). Well, 10 was the prior limit and seems really fine. The non update on jiffies seems quite exceptional condition (I hope...) We use in Google a patch triggering warning is a thread holds the cpu without taking care to need_resched() for more than xx ms ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: stop_machine lockup issue in 3.9.y. 2013-06-06 3:46 ` [ath9k-devel] " Eric Dumazet @ 2013-06-06 3:50 ` Ben Greear -1 siblings, 0 replies; 51+ messages in thread From: Ben Greear @ 2013-06-06 3:50 UTC (permalink / raw) To: Eric Dumazet Cc: Tejun Heo, Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable, Luis R. Rodriguez, Jouni Malinen, Vasanthakumar Thiagarajan, Senthil Balasubramanian, linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar On 06/05/2013 08:46 PM, Eric Dumazet wrote: > On Wed, 2013-06-05 at 20:41 -0700, Ben Greear wrote: >> On 06/05/2013 08:26 PM, Eric Dumazet wrote: >>> On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote: >>> >>>> >>>> Ah, so, that's why it's showing up now. We probably have had the same >>>> issue all along but it used to be masked by the softirq limiting. Do >>>> you care to revive the 10 iterations limit so that it's limited by >>>> both the count and timing? We do wanna find out why softirq is >>>> spinning indefinitely tho. >>> >>> Yes, no problem, I can do that. >> >> Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would >> be fine by me. >> >> I can send a version of my patch easily enough if we can agree on the max number of >> loops (and if indeed my version of the patch is acceptable). > > Well, 10 was the prior limit and seems really fine. > > The non update on jiffies seems quite exceptional condition (I hope...) > > We use in Google a patch triggering warning is a thread holds the cpu > without taking care to need_resched() for more than xx ms Well, I'm sure that patch works nicely until the clock stops moving forward :) I'll post a patch with limit of 10 shortly. Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 51+ messages in thread
* [ath9k-devel] stop_machine lockup issue in 3.9.y. @ 2013-06-06 3:50 ` Ben Greear 0 siblings, 0 replies; 51+ messages in thread From: Ben Greear @ 2013-06-06 3:50 UTC (permalink / raw) To: ath9k-devel On 06/05/2013 08:46 PM, Eric Dumazet wrote: > On Wed, 2013-06-05 at 20:41 -0700, Ben Greear wrote: >> On 06/05/2013 08:26 PM, Eric Dumazet wrote: >>> On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote: >>> >>>> >>>> Ah, so, that's why it's showing up now. We probably have had the same >>>> issue all along but it used to be masked by the softirq limiting. Do >>>> you care to revive the 10 iterations limit so that it's limited by >>>> both the count and timing? We do wanna find out why softirq is >>>> spinning indefinitely tho. >>> >>> Yes, no problem, I can do that. >> >> Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would >> be fine by me. >> >> I can send a version of my patch easily enough if we can agree on the max number of >> loops (and if indeed my version of the patch is acceptable). > > Well, 10 was the prior limit and seems really fine. > > The non update on jiffies seems quite exceptional condition (I hope...) > > We use in Google a patch triggering warning is a thread holds the cpu > without taking care to need_resched() for more than xx ms Well, I'm sure that patch works nicely until the clock stops moving forward :) I'll post a patch with limit of 10 shortly. Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: stop_machine lockup issue in 3.9.y. 2013-06-06 3:50 ` [ath9k-devel] " Ben Greear @ 2013-06-06 4:08 ` Eric Dumazet -1 siblings, 0 replies; 51+ messages in thread From: Eric Dumazet @ 2013-06-06 4:08 UTC (permalink / raw) To: Ben Greear Cc: Tejun Heo, Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable, Luis R. Rodriguez, Jouni Malinen, Vasanthakumar Thiagarajan, Senthil Balasubramanian, linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar On Wed, 2013-06-05 at 20:50 -0700, Ben Greear wrote: > On 06/05/2013 08:46 PM, Eric Dumazet wrote: > > > > We use in Google a patch triggering warning is a thread holds the cpu > > without taking care to need_resched() for more than xx ms > > Well, I'm sure that patch works nicely until the clock stops moving > forward :) > This is not using jiffies, but the clock used in kernel/sched/core.c, with ns resolution ;) > I'll post a patch with limit of 10 shortly. ok ^ permalink raw reply [flat|nested] 51+ messages in thread
* [ath9k-devel] stop_machine lockup issue in 3.9.y. @ 2013-06-06 4:08 ` Eric Dumazet 0 siblings, 0 replies; 51+ messages in thread From: Eric Dumazet @ 2013-06-06 4:08 UTC (permalink / raw) To: ath9k-devel On Wed, 2013-06-05 at 20:50 -0700, Ben Greear wrote: > On 06/05/2013 08:46 PM, Eric Dumazet wrote: > > > > We use in Google a patch triggering warning is a thread holds the cpu > > without taking care to need_resched() for more than xx ms > > Well, I'm sure that patch works nicely until the clock stops moving > forward :) > This is not using jiffies, but the clock used in kernel/sched/core.c, with ns resolution ;) > I'll post a patch with limit of 10 shortly. ok ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: stop_machine lockup issue in 3.9.y. 2013-06-06 3:41 ` [ath9k-devel] " Ben Greear @ 2013-06-06 20:55 ` Tejun Heo -1 siblings, 0 replies; 51+ messages in thread From: Tejun Heo @ 2013-06-06 20:55 UTC (permalink / raw) To: Ben Greear Cc: Eric Dumazet, Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable, Luis R. Rodriguez, Jouni Malinen, Vasanthakumar Thiagarajan, Senthil Balasubramanian, linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar Hello, Ben. On Wed, Jun 05, 2013 at 08:41:01PM -0700, Ben Greear wrote: > On 06/05/2013 08:26 PM, Eric Dumazet wrote: > >On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote: > >>Ah, so, that's why it's showing up now. We probably have had the same > >>issue all along but it used to be masked by the softirq limiting. Do > >>you care to revive the 10 iterations limit so that it's limited by > >>both the count and timing? We do wanna find out why softirq is > >>spinning indefinitely tho. > > > >Yes, no problem, I can do that. > > Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would > be fine by me. First of all, kudos for tracking the issue down. While the removal of looping limit in softirq handling was the direct cause for making the problem visible, it's very bothering that we have softirq runaway. Finding out the perpetrator shouldn't be hard. Something like the following should work (untested). Once we know which softirq (prolly the network one), we can dig deeper. Thanks. diff --git a/kernel/softirq.c b/kernel/softirq.c index b5197dc..5af3682 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void) unsigned long end = jiffies + MAX_SOFTIRQ_TIME; int cpu; unsigned long old_flags = current->flags; + int cnt = 0; /* * Mask out PF_MEMALLOC s current task context is borrowed for the @@ -244,6 +245,9 @@ restart: kstat_incr_softirqs_this_cpu(vec_nr); trace_softirq_entry(vec_nr); + if (++cnt >= 5000 && cnt < 5010) + printk("XXX __do_softirq: stuck handling softirqs, cnt=%d action=%pf\n", + cnt, h->action); h->action(h); trace_softirq_exit(vec_nr); if (unlikely(prev_count != preempt_count())) { -- tejun ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [ath9k-devel] stop_machine lockup issue in 3.9.y. @ 2013-06-06 20:55 ` Tejun Heo 0 siblings, 0 replies; 51+ messages in thread From: Tejun Heo @ 2013-06-06 20:55 UTC (permalink / raw) To: ath9k-devel Hello, Ben. On Wed, Jun 05, 2013 at 08:41:01PM -0700, Ben Greear wrote: > On 06/05/2013 08:26 PM, Eric Dumazet wrote: > >On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote: > >>Ah, so, that's why it's showing up now. We probably have had the same > >>issue all along but it used to be masked by the softirq limiting. Do > >>you care to revive the 10 iterations limit so that it's limited by > >>both the count and timing? We do wanna find out why softirq is > >>spinning indefinitely tho. > > > >Yes, no problem, I can do that. > > Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would > be fine by me. First of all, kudos for tracking the issue down. While the removal of looping limit in softirq handling was the direct cause for making the problem visible, it's very bothering that we have softirq runaway. Finding out the perpetrator shouldn't be hard. Something like the following should work (untested). Once we know which softirq (prolly the network one), we can dig deeper. Thanks. diff --git a/kernel/softirq.c b/kernel/softirq.c index b5197dc..5af3682 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void) unsigned long end = jiffies + MAX_SOFTIRQ_TIME; int cpu; unsigned long old_flags = current->flags; + int cnt = 0; /* * Mask out PF_MEMALLOC s current task context is borrowed for the @@ -244,6 +245,9 @@ restart: kstat_incr_softirqs_this_cpu(vec_nr); trace_softirq_entry(vec_nr); + if (++cnt >= 5000 && cnt < 5010) + printk("XXX __do_softirq: stuck handling softirqs, cnt=%d action=%pf\n", + cnt, h->action); h->action(h); trace_softirq_exit(vec_nr); if (unlikely(prev_count != preempt_count())) { -- tejun ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: stop_machine lockup issue in 3.9.y. 2013-06-06 20:55 ` [ath9k-devel] " Tejun Heo @ 2013-06-06 21:15 ` Ben Greear -1 siblings, 0 replies; 51+ messages in thread From: Ben Greear @ 2013-06-06 21:15 UTC (permalink / raw) To: Tejun Heo Cc: Eric Dumazet, Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable, Luis R. Rodriguez, Jouni Malinen, Vasanthakumar Thiagarajan, Senthil Balasubramanian, linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar On 06/06/2013 01:55 PM, Tejun Heo wrote: > Hello, Ben. > > On Wed, Jun 05, 2013 at 08:41:01PM -0700, Ben Greear wrote: >> On 06/05/2013 08:26 PM, Eric Dumazet wrote: >>> On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote: >>>> Ah, so, that's why it's showing up now. We probably have had the same >>>> issue all along but it used to be masked by the softirq limiting. Do >>>> you care to revive the 10 iterations limit so that it's limited by >>>> both the count and timing? We do wanna find out why softirq is >>>> spinning indefinitely tho. >>> >>> Yes, no problem, I can do that. >> >> Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would >> be fine by me. > > First of all, kudos for tracking the issue down. While the removal of > looping limit in softirq handling was the direct cause for making the > problem visible, it's very bothering that we have softirq runaway. > Finding out the perpetrator shouldn't be hard. Something like the > following should work (untested). Once we know which softirq (prolly > the network one), we can dig deeper. The patch below assumes my fix is not in the code, right? I'll work on this, but it will probably be next week before I have time...gotta catch up on some other things first. Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 51+ messages in thread
* [ath9k-devel] stop_machine lockup issue in 3.9.y. @ 2013-06-06 21:15 ` Ben Greear 0 siblings, 0 replies; 51+ messages in thread From: Ben Greear @ 2013-06-06 21:15 UTC (permalink / raw) To: ath9k-devel On 06/06/2013 01:55 PM, Tejun Heo wrote: > Hello, Ben. > > On Wed, Jun 05, 2013 at 08:41:01PM -0700, Ben Greear wrote: >> On 06/05/2013 08:26 PM, Eric Dumazet wrote: >>> On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote: >>>> Ah, so, that's why it's showing up now. We probably have had the same >>>> issue all along but it used to be masked by the softirq limiting. Do >>>> you care to revive the 10 iterations limit so that it's limited by >>>> both the count and timing? We do wanna find out why softirq is >>>> spinning indefinitely tho. >>> >>> Yes, no problem, I can do that. >> >> Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would >> be fine by me. > > First of all, kudos for tracking the issue down. While the removal of > looping limit in softirq handling was the direct cause for making the > problem visible, it's very bothering that we have softirq runaway. > Finding out the perpetrator shouldn't be hard. Something like the > following should work (untested). Once we know which softirq (prolly > the network one), we can dig deeper. The patch below assumes my fix is not in the code, right? I'll work on this, but it will probably be next week before I have time...gotta catch up on some other things first. Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: stop_machine lockup issue in 3.9.y. 2013-06-06 21:15 ` [ath9k-devel] " Ben Greear @ 2013-06-06 21:17 ` Tejun Heo -1 siblings, 0 replies; 51+ messages in thread From: Tejun Heo @ 2013-06-06 21:17 UTC (permalink / raw) To: Ben Greear Cc: Eric Dumazet, Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable, Luis R. Rodriguez, Jouni Malinen, Vasanthakumar Thiagarajan, Senthil Balasubramanian, linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar On Thu, Jun 06, 2013 at 02:15:40PM -0700, Ben Greear wrote: > >First of all, kudos for tracking the issue down. While the removal of > >looping limit in softirq handling was the direct cause for making the > >problem visible, it's very bothering that we have softirq runaway. > >Finding out the perpetrator shouldn't be hard. Something like the > >following should work (untested). Once we know which softirq (prolly > >the network one), we can dig deeper. > > The patch below assumes my fix is not in the code, right? Yeap. > I'll work on this, but it will probably be next week before > I have time...gotta catch up on some other things first. Thanks a lot for hunting this down! -- tejun ^ permalink raw reply [flat|nested] 51+ messages in thread
* [ath9k-devel] stop_machine lockup issue in 3.9.y. @ 2013-06-06 21:17 ` Tejun Heo 0 siblings, 0 replies; 51+ messages in thread From: Tejun Heo @ 2013-06-06 21:17 UTC (permalink / raw) To: ath9k-devel On Thu, Jun 06, 2013 at 02:15:40PM -0700, Ben Greear wrote: > >First of all, kudos for tracking the issue down. While the removal of > >looping limit in softirq handling was the direct cause for making the > >problem visible, it's very bothering that we have softirq runaway. > >Finding out the perpetrator shouldn't be hard. Something like the > >following should work (untested). Once we know which softirq (prolly > >the network one), we can dig deeper. > > The patch below assumes my fix is not in the code, right? Yeap. > I'll work on this, but it will probably be next week before > I have time...gotta catch up on some other things first. Thanks a lot for hunting this down! -- tejun ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Please add to stable: module: don't unlink the module until we've removed all exposure. 2013-06-04 14:07 ` Joe Lawrence 2013-06-04 16:50 ` Joe Lawrence 2013-06-04 16:53 ` Ben Greear @ 2013-06-05 3:29 ` Rusty Russell 2 siblings, 0 replies; 51+ messages in thread From: Rusty Russell @ 2013-06-05 3:29 UTC (permalink / raw) To: Joe Lawrence; +Cc: Ben Greear, Linux Kernel Mailing List, stable Joe Lawrence <joe.lawrence@stratus.com> writes: > On Tue, 04 Jun 2013 15:26:28 +0930 > Rusty Russell <rusty@rustcorp.com.au> wrote: > >> Do you have a backtrace of the 3.9.4 crash? You can add "CFLAGS_module.o >> = -O0" to get a clearer backtrace if you want... > > Hi Rusty, > > See my 3.9 stack traces below, which may or may not be what Ben had > been seeing. If you like, I can try a similar loop as the one you were > testing in the other email. > > Regards, > > -- Joe > > *** First instance *** > > ------------[ cut here ]------------ > WARNING: at fs/sysfs/dir.c:536 sysfs_add_one+0xd4/0x100() > Hardware name: ftServer 6400 > sysfs: cannot create duplicate filename '/module/mgag200' > Modules linked in: enclosure(+) mgag200(+) ghash_clmulni_intel(+) pcspkr joydev osst vhost_net st tun macvtap macvlan uinput raid1 mpt2sas(OF) raid_class qla2xxx(OF) scsi_transport_fc scsi_transport_sas sd_mod(OF) usb_storage scsi_tgt scsi_hbas(OF) i2c_algo_bit drm_kms_helper ttm drm i2c_core > Pid: 733, comm: systemd-udevd Tainted: GF O 3.9.0sra_new+ #1 > Call Trace: > [<ffffffff81061a9f>] warn_slowpath_common+0x7f/0xc0 > [<ffffffff81061b96>] warn_slowpath_fmt+0x46/0x50 > [<ffffffff81319875>] ? strlcat+0x65/0x90 > [<ffffffff81222914>] sysfs_add_one+0xd4/0x100 > [<ffffffff81222b38>] create_dir+0x78/0xd0 > [<ffffffff81222e86>] sysfs_create_dir+0x86/0xe0 > [<ffffffff81313588>] kobject_add_internal+0xa8/0x270 > [<ffffffff81313ab3>] kobject_init_and_add+0x63/0x90 > [<ffffffff810ca34d>] load_module+0x12dd/0x2890 > [<ffffffff81331670>] ? ddebug_proc_open+0xc0/0xc0 > [<ffffffff810cb9ea>] sys_init_module+0xea/0x140 > [<ffffffff81680d19>] system_call_fastpath+0x16/0x1b > ---[ end trace 247a5f5f82ef192d ]--- > ------------[ cut here ]------------ > WARNING: at lib/kobject.c:196 kobject_add_internal+0x204/0x270() > Hardware name: ftServer 6400 > kobject_add_internal failed for mgag200 with -EEXIST, don't try to register things with the same name in the same directory. > Modules linked in:0m] Started Conf mdio(+) coretemp(+) crc32c_intel(+) dca(+) enclosure(+) mgag200(+) ghash_clmulni_intel pcspkr joydev osst vhost_net st tun macvtap macvlan uinput raid1 mpt2sas(OF) raid_class qla2xxx(OF) scsi_transport_fc scsi_transport_sas sd_mod(OF) usb_storage scsi_tgt scsi_hbas(OF) i2c_algo_bit drm_kms_helper ttm drm i2c_core > Pid: 733, comm: systemd-udevd Tainted: GF W O 3.9.0sra_new+ #1 > > Call Trace: > [<ffffffff81061a9f>] warn_slowpath_common+0x7f/0xc0 > [<ffffffff81061b96>] warn_slowpath_fmt+0x46/0x50 > [<ffffffff813136e4>] kobject_add_internal+0x204/0x270 > [<ffffffff81313ab3>] kobject_init_and_add+0x63/0x90 > [<ffffffff810ca34d>] load_module+0x12dd/0x2890 > [<ffffffff81331670>] ? ddebug_proc_open+0xc0/0xc0 > [<ffffffff810cb9ea>] sys_init_module+0xea/0x140 > [<ffffffff81680d19>] system_call_fastpath+0x16/0x1b > ---[ end trace 247a5f5f82ef192e ]--- That's a WARN_ON. It's harmless, but indeed is fixed by the patch suggested. > *** Second instance *** > > mgag200: module is already loaded > igb: Intel(R) Gigabit Ethernet Network Driver - version 4.1.2-k > BUG: unable to handle kernel paging request at ffffffffa01d060c > IP: [<ffffffff81313276>] kobject_del+0x16/0x40 > PGD 1c0f067 PUD 1c10063 PMD 851372067 PTE 0 > Oops: 0002 [#1] SMP > Modules linked in: ixgbe(OF+) igb(OF+) mgag200(+) ptp pps_core mdio dca coretemp crc32c_intel pcspkr ghash_clmulni_intel vhost_net tun macvtap macvlan uinput raid1 usb_storage mpt2sas(OF) raid_class qla2xxx(OF) scsi_transport_fc scsi_transport_sas sd_mod(OF) scsi_tgt scsi_hbas(OF) i2c_algo_bit drm_kms_helper ttm drm i2c_core > CPU 28 > Pid: 719, comm: systemd-udevd Tainted: GF O 3.9.0sra_new+ #1 Stratus ftServer 6400/G7LAZ > RIP: 0010:[<ffffffff81313276>] [<ffffffff81313276>] kobject_del+0x16/0x40 > RSP: 0018:ffff88103814fd08 EFLAGS: 00010292 > RAX: 0000000000000200 RBX: ffffffffa01d05d0 RCX: 0000000100250004 > RDX: ffff88103814ffd8 RSI: 0000000000250004 RDI: 0000000000000246 > RBP: ffff88103814fd18 R08: ffff88103814fa80 R09: 0000000000000000 > R10: ffff88085f821d40 R11: 0000000000000025 R12: ffffffff81c412c0 > R13: ffff880852c8cfc0 R14: ffffffffa01e0580 R15: ffffffffa01e0598 > FS: 00007fc98fe6c840(0000) GS:ffff88107fd80000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: ffffffffa01d060c CR3: 0000001038137000 CR4: 00000000000407e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process systemd-udevd (pid: 719, threadinfo ffff88103814e000, task ffff8810380d98a0) > Stack: > ffff88103814fd18 ffffffffa01d05d0 ffff88103814fd48 ffffffff81313302 > ffff88103814fd78 ffffffffa01d05d0 ffffffffa01d05d0 ffffffffffffffea > ffff88103814fd68 ffffffff8131348b 00000000ffff8000 ffff88103814fee8 > Call Trace: > [<ffffffff81313302>] kobject_cleanup+0x62/0x1b0 > [<ffffffff8131348b>] kobject_put+0x2b/0x60 > [<ffffffff810cb8f1>] load_module+0x2881/0x2890 > [<ffffffff81331670>] ? ddebug_proc_open+0xc0/0xc0 > [<ffffffff810cb9ea>] sys_init_module+0xea/0x140 > [<ffffffff81680d19>] system_call_fastpath+0x16/0x1b > Code: 02 00 00 48 8b 5d f0 4c 8b 65 f8 c9 c3 0f 1f 84 00 00 00 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 85 ff 74 22 e8 7a fc f0 ff <80> 63 3c fd 48 89 df e8 6e ff ff ff 48 8b 7b 18 e8 d5 01 00 00 > RIP [<ffffffff81313276>] kobject_del+0x16/0x40 > RSP <ffff88103814fd08> > CR2: ffffffffa01d060c > ---[ end trace e320c2319820c81a ]--- > Kernel panic - not syncing: Fatal exception This is the interesting one! This is kind of crash which Linus' a49b7e82 fixed, but 3.9 contains that already. For some reason, mgag200 is being unloaded (or failing to load) while another one is being loaded. The new module does this: kobj = kset_find_obj(module_kset, mod->name); if (kobj) { printk(KERN_ERR "%s: module is already loaded\n", mod->name); kobject_put(kobj); The old one does this: kobject_put(&mod->mkobj.kobj); /* in mod_sysfs_fini */ ... module_free(mod, mod->module_core); /* in free_module */ And it doesn't wait for the kobj count to hit zero, so the other kobject_put() it on a kobject which is freed... Normally the right answer would be to offload the freeing of the module memory to module_ktype's release function. The backported dont-remove-from-list fix prevents this race from happening, since the duplicate module is now caught earlier. It does not prevent the same problem with other kobj references; for example, what about sysfs accesses to a module which fails init? So the simplest thing is to backport the fix, as suggested. But now I'm trying to trigger the same bug in other ways. Thanks, Rusty. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Please add to stable: module: don't unlink the module until we've removed all exposure. 2013-06-03 14:17 ` Joe Lawrence 2013-06-03 15:59 ` Ben Greear @ 2013-06-05 5:07 ` Greg KH 2013-06-05 7:13 ` Rusty Russell 1 sibling, 1 reply; 51+ messages in thread From: Greg KH @ 2013-06-05 5:07 UTC (permalink / raw) To: Joe Lawrence; +Cc: Rusty Russell, Ben Greear, Linux Kernel Mailing List, stable On Mon, Jun 03, 2013 at 10:17:17AM -0400, Joe Lawrence wrote: > [Cc: stable@vger.kernel.org] > > Third time is a charm? The stable address was incorrect from the first > msg in this thread, but the relevant bits remain quoted below... Really? I'm totally confused... > On Mon, 3 Jun 2013, Joe Lawrence wrote: > > > [fixing Cc: stable@kernel.org address] > > > > On Sun, 2 Jun 2013, Joe Lawrence wrote: > > > > > On Sun, 2 Jun 2013, Rusty Russell wrote: > > > > > > > Ben Greear <greearb@candelatech.com> writes: > > > > > > > > > It turns out, the bug I spent yesterday chasing in various 3.9 kernels is apparently > > > > > fixed by the commit in the title (c9c390bb5535380d40614571894ef0c00bc026ff). > > > > > > > > Apparently being the operative word. > > > > > > > > This commit avoids the entire "module insert failed due to sysfs race" > > > > path in the common case, it doesn't fix any actual problem. > > > > > > > > I think the real commit you want is Linus' kobject fix > > > > a49b7e82cab0f9b41f483359be83f44fbb6b4979 "kobject: fix kset_find_obj() > > > > race with concurrent last kobject_put()". > > > > > > > > Or is that already in stable? > > > > > > Hi Rusty, > > > > > > I had pointed Ben (offlist) to that bugzilla entry without realizing > > > there were other earlier related fixes in this space. Re-viewing bz- > > > 58011, it looks like it was opened against 3.8.12, while Ben and myself > > > had encountered module loading problems in versions 3.9 and > > > 3.9.[1-3]. I can update the bugzilla entry to add a comment noting commit > > > a49b7e82 "kobject: fix kset_find_obj() race with concurrent last > > > kobject_put()". > > > > > > That said, it doesn't appear that commit 944a1fa "module: don't unlink the > > > module until we've removed all exposure" has not made it into any stable > > > kernel. On my system, applying this on top of 3.9 resolved a module > > > unload/load race that would occasionally occur on boot (two video adapters > > > of the same make, the module unloads for whatever reason and I see "module > > > is already loaded" and "sysfs: cannot create duplicate filename > > > '/module/mgag200'" messages every 5-10% instances.) I have logs if you > > > were interested in these warnings/crashes. > > > > > > Hope this clarifies things. After this whole thread, what should I be doing for the 3.9-stable tree? Add commit 944a1fa? Or something else? confused, greg k-h ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Please add to stable: module: don't unlink the module until we've removed all exposure. 2013-06-05 5:07 ` Greg KH @ 2013-06-05 7:13 ` Rusty Russell 0 siblings, 0 replies; 51+ messages in thread From: Rusty Russell @ 2013-06-05 7:13 UTC (permalink / raw) To: Greg KH, Joe Lawrence; +Cc: Ben Greear, Linux Kernel Mailing List, stable Greg KH <gregkh@linuxfoundation.org> writes: > On Mon, Jun 03, 2013 at 10:17:17AM -0400, Joe Lawrence wrote: >> [Cc: stable@vger.kernel.org] >> >> Third time is a charm? The stable address was incorrect from the first >> msg in this thread, but the relevant bits remain quoted below... > > Really? I'm totally confused... > >> On Mon, 3 Jun 2013, Joe Lawrence wrote: >> >> > [fixing Cc: stable@kernel.org address] >> > >> > On Sun, 2 Jun 2013, Joe Lawrence wrote: >> > >> > > On Sun, 2 Jun 2013, Rusty Russell wrote: >> > > >> > > > Ben Greear <greearb@candelatech.com> writes: >> > > > >> > > > > It turns out, the bug I spent yesterday chasing in various 3.9 kernels is apparently >> > > > > fixed by the commit in the title (c9c390bb5535380d40614571894ef0c00bc026ff). >> > > > >> > > > Apparently being the operative word. >> > > > >> > > > This commit avoids the entire "module insert failed due to sysfs race" >> > > > path in the common case, it doesn't fix any actual problem. >> > > > >> > > > I think the real commit you want is Linus' kobject fix >> > > > a49b7e82cab0f9b41f483359be83f44fbb6b4979 "kobject: fix kset_find_obj() >> > > > race with concurrent last kobject_put()". >> > > > >> > > > Or is that already in stable? >> > > >> > > Hi Rusty, >> > > >> > > I had pointed Ben (offlist) to that bugzilla entry without realizing >> > > there were other earlier related fixes in this space. Re-viewing bz- >> > > 58011, it looks like it was opened against 3.8.12, while Ben and myself >> > > had encountered module loading problems in versions 3.9 and >> > > 3.9.[1-3]. I can update the bugzilla entry to add a comment noting commit >> > > a49b7e82 "kobject: fix kset_find_obj() race with concurrent last >> > > kobject_put()". >> > > >> > > That said, it doesn't appear that commit 944a1fa "module: don't unlink the >> > > module until we've removed all exposure" has not made it into any stable >> > > kernel. On my system, applying this on top of 3.9 resolved a module >> > > unload/load race that would occasionally occur on boot (two video adapters >> > > of the same make, the module unloads for whatever reason and I see "module >> > > is already loaded" and "sysfs: cannot create duplicate filename >> > > '/module/mgag200'" messages every 5-10% instances.) I have logs if you >> > > were interested in these warnings/crashes. >> > > >> > > Hope this clarifies things. > > After this whole thread, what should I be doing for the 3.9-stable tree? > Add commit 944a1fa? Or something else? Yes. It does fix an Oops unrelated to what it was intended to fix, so it's the lowest pain path. There may be other ways of triggering a similar oops, but do far the obvious attempt has failed (holding a sysfs file open while a module fails its init). I might patch it anyway, because it makes me uncomfortable, but that's separate. Thanks, Rusty. ^ permalink raw reply [flat|nested] 51+ messages in thread
end of thread, other threads:[~2013-06-06 21:17 UTC | newest] Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-05-31 18:14 Please add to stable: module: don't unlink the module until we've removed all exposure Ben Greear 2013-06-02 5:09 ` Rusty Russell 2013-06-03 3:46 ` Joe Lawrence 2013-06-03 11:25 ` Joe Lawrence 2013-06-03 14:17 ` Joe Lawrence 2013-06-03 15:59 ` Ben Greear 2013-06-03 16:36 ` Ben Greear 2013-06-04 4:37 ` Rusty Russell 2013-06-04 5:56 ` Rusty Russell 2013-06-04 14:07 ` Joe Lawrence 2013-06-04 16:50 ` Joe Lawrence 2013-06-04 16:53 ` Ben Greear 2013-06-04 17:45 ` Ben Greear 2013-06-05 4:17 ` Rusty Russell 2013-06-05 7:15 ` Tejun Heo 2013-06-05 16:59 ` Ben Greear 2013-06-05 18:48 ` Tejun Heo 2013-06-05 19:11 ` Ben Greear 2013-06-05 19:31 ` stop_machine lockup issue in 3.9.y Ben Greear 2013-06-05 20:58 ` Ben Greear 2013-06-05 21:11 ` Tejun Heo 2013-06-05 21:11 ` [ath9k-devel] " Tejun Heo 2013-06-05 21:11 ` Tejun Heo 2013-06-05 21:33 ` Ben Greear 2013-06-05 21:33 ` [ath9k-devel] " Ben Greear 2013-06-06 1:34 ` Eric Dumazet 2013-06-06 1:34 ` [ath9k-devel] " Eric Dumazet 2013-06-06 1:34 ` Eric Dumazet 2013-06-06 3:14 ` Tejun Heo 2013-06-06 3:14 ` [ath9k-devel] " Tejun Heo 2013-06-06 3:14 ` Tejun Heo 2013-06-06 3:26 ` Eric Dumazet 2013-06-06 3:26 ` [ath9k-devel] " Eric Dumazet 2013-06-06 3:26 ` Eric Dumazet 2013-06-06 3:41 ` Ben Greear 2013-06-06 3:41 ` [ath9k-devel] " Ben Greear 2013-06-06 3:46 ` Eric Dumazet 2013-06-06 3:46 ` [ath9k-devel] " Eric Dumazet 2013-06-06 3:50 ` Ben Greear 2013-06-06 3:50 ` [ath9k-devel] " Ben Greear 2013-06-06 4:08 ` Eric Dumazet 2013-06-06 4:08 ` [ath9k-devel] " Eric Dumazet 2013-06-06 20:55 ` Tejun Heo 2013-06-06 20:55 ` [ath9k-devel] " Tejun Heo 2013-06-06 21:15 ` Ben Greear 2013-06-06 21:15 ` [ath9k-devel] " Ben Greear 2013-06-06 21:17 ` Tejun Heo 2013-06-06 21:17 ` [ath9k-devel] " Tejun Heo 2013-06-05 3:29 ` Please add to stable: module: don't unlink the module until we've removed all exposure Rusty Russell 2013-06-05 5:07 ` Greg KH 2013-06-05 7:13 ` Rusty Russell
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.