linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* kernel BUG at kernel/workqueue.c:291
@ 2009-02-27 19:48 Carsten Aulbert
  2009-03-02 10:51 ` Carsten Aulbert
  0 siblings, 1 reply; 8+ messages in thread
From: Carsten Aulbert @ 2009-02-27 19:48 UTC (permalink / raw)
  To: linux-kernel

Hi all,

a few of our nodes showed this bug suddenly, possibly triggered by some
user jobs (still need to look for correlations):

[...]
[228496.772509] nfs: server n0051 not responding, timed out
[228704.928037] ------------[ cut here ]------------
[228704.928224] kernel BUG at kernel/workqueue.c:291!
[228704.928404] invalid opcode: 0000 [1] SMP
[228704.928647] CPU 0
[228704.928852] Modules linked in: lm92 w83793 w83781d hwmon_vid hwmon nfs nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 netconsole configfs ipmi_si ipmi_devintf ipmi_watchdog ipmi_poweroff ipmi_msghandler e1000e i2c_i801 8250_pnp 8250 serial_core i2c_core
[228704.930002] Pid: 1609, comm: rpciod/0 Not tainted 2.6.27.14-nodes #1
[228704.930002] RIP: 0010:[<ffffffff8023c6db>]  [<ffffffff8023c6db>] run_workqueue+0x6f/0x102
[228704.930002] RSP: 0018:ffff880214bcdec0  EFLAGS: 00010207
[228704.930002] RAX: 0000000000000000 RBX: ffff880214b82f40 RCX: ffff880215444418
[228704.930002] RDX: ffff880187d07d58 RSI: ffff880214bcdee0 RDI: ffff880215444410
[228704.930002] RBP: ffffffffa0077186 R08: ffff880214bcc000 R09: ffff88021491f808
[228704.930002] R10: 0000000000000246 R11: ffff880187d07d50 R12: ffff880214ad7d28
[228704.930002] R13: ffffffff806065a0 R14: ffffffff80607280 R15: 0000000000000000
[228704.930002] FS:  0000000000000000(0000) GS:ffffffff80636040(0000) knlGS:0000000000000000
[228704.930002] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[228704.930002] CR2: 00007fc056333fd8 CR3: 00000001ed270000 CR4: 00000000000006e0
[228704.930002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[228704.930002] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[228704.930002] Process rpciod/0 (pid: 1609, threadinfo ffff880214bcc000, task ffff880217b08780)
[228704.930002] Stack:  ffff880214b82f40 ffff880214b82f40 ffff880214b82f58 ffffffff8023cff3
[228704.930002]  0000000000000000 ffff880217b08780 ffffffff8023f7d7 ffff880214bcdef8
[228704.930002]  ffff880214bcdef8 ffffffff806065a0 ffffffff80607280 ffff880214b82f40
[228704.930002] Call Trace:
[228704.930002]  [<ffffffff8023cff3>] ? worker_thread+0x90/0x9b
[228704.930002]  [<ffffffff8023f7d7>] ? autoremove_wake_function+0x0/0x2e
[228704.930002]  [<ffffffff8023cf63>] ? worker_thread+0x0/0x9b
[228704.930002]  [<ffffffff8023f6c2>] ? kthread+0x47/0x75
[228704.930002]  [<ffffffff8022afa8>] ? schedule_tail+0x27/0x5f
[228704.930002]  [<ffffffff8020ccb9>] ? child_rip+0xa/0x11
[228704.930002]  [<ffffffff8023f67b>] ? kthread+0x0/0x75
[228704.930002]  [<ffffffff8020ccaf>] ? child_rip+0x0/0x11
[228704.930002]
[228704.930002]
[228704.930002] Code: 6f 18 48 89 7b 30 48 8b 11 48 8b 41 08 48 89 42 08 48 89 10 48 89 49 08 48 89 09 fe 03 fb 48 8b 41 f8 48 83 e0 fc 48 39 d8 74 04 <0f> 0b eb fe f0 80 61 f8 fe ff d5 65 48 8b 04 25 10 00 00 00 8b
[228704.930002] RIP  [<ffffffff8023c6db>] run_workqueue+0x6f/0x102
[228704.930002]  RSP <ffff880214bcdec0>
[228704.941003] ---[ end trace deef6e5387b5a584 ]---
[229698.790015] nfs: server n0051 not responding, timed out
[229878.790016] nfs: server n0051 not responding, timed out
[...]

Is this a known bug, google did not find anything relevant on the first try.

Cheers

Carsten

PS: Please Cc me

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: kernel BUG at kernel/workqueue.c:291
  2009-02-27 19:48 kernel BUG at kernel/workqueue.c:291 Carsten Aulbert
@ 2009-03-02 10:51 ` Carsten Aulbert
  2009-03-03  7:26   ` Andrew Morton
  0 siblings, 1 reply; 8+ messages in thread
From: Carsten Aulbert @ 2009-03-02 10:51 UTC (permalink / raw)
  To: linux-kernel

Hi again,

in the mean time 43 of our nodes were struck with this error. It seems
that the jobs of a certain user can trigger this bug, however I have no
clue how to really trigger it manually.

My questions:
Is this a know bug for 2.6.27.14 (we can upgrade to .19 if necessary),
but as this file was not modyfied recently, I suspect there is no ready
fix for that.

Do you need any more info of our systems (Intel X3220 based Supermirco
systems), the kernel config (deadline scheduler in use,...) or something
else?

Carsten Aulbert schrieb:
> [228704.928037] ------------[ cut here ]------------
> [228704.928224] kernel BUG at kernel/workqueue.c:291!
> [228704.928404] invalid opcode: 0000 [1] SMP
> [228704.928647] CPU 0
> [228704.928852] Modules linked in: lm92 w83793 w83781d hwmon_vid hwmon nfs nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 netconsole configfs ipmi_si ipmi_devintf ipmi_watchdog ipmi_poweroff ipmi_msghandler e1000e i2c_i801 8250_pnp 8250 serial_core i2c_core
> [228704.930002] Pid: 1609, comm: rpciod/0 Not tainted 2.6.27.14-nodes #1
> [228704.930002] RIP: 0010:[<ffffffff8023c6db>]  [<ffffffff8023c6db>] run_workqueue+0x6f/0x102
> [228704.930002] RSP: 0018:ffff880214bcdec0  EFLAGS: 00010207
> [228704.930002] RAX: 0000000000000000 RBX: ffff880214b82f40 RCX: ffff880215444418
> [228704.930002] RDX: ffff880187d07d58 RSI: ffff880214bcdee0 RDI: ffff880215444410
> [228704.930002] RBP: ffffffffa0077186 R08: ffff880214bcc000 R09: ffff88021491f808
> [228704.930002] R10: 0000000000000246 R11: ffff880187d07d50 R12: ffff880214ad7d28
> [228704.930002] R13: ffffffff806065a0 R14: ffffffff80607280 R15: 0000000000000000
> [228704.930002] FS:  0000000000000000(0000) GS:ffffffff80636040(0000) knlGS:0000000000000000
> [228704.930002] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [228704.930002] CR2: 00007fc056333fd8 CR3: 00000001ed270000 CR4: 00000000000006e0
> [228704.930002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [228704.930002] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [228704.930002] Process rpciod/0 (pid: 1609, threadinfo ffff880214bcc000, task ffff880217b08780)
> [228704.930002] Stack:  ffff880214b82f40 ffff880214b82f40 ffff880214b82f58 ffffffff8023cff3
> [228704.930002]  0000000000000000 ffff880217b08780 ffffffff8023f7d7 ffff880214bcdef8
> [228704.930002]  ffff880214bcdef8 ffffffff806065a0 ffffffff80607280 ffff880214b82f40
> [228704.930002] Call Trace:
> [228704.930002]  [<ffffffff8023cff3>] ? worker_thread+0x90/0x9b
> [228704.930002]  [<ffffffff8023f7d7>] ? autoremove_wake_function+0x0/0x2e
> [228704.930002]  [<ffffffff8023cf63>] ? worker_thread+0x0/0x9b
> [228704.930002]  [<ffffffff8023f6c2>] ? kthread+0x47/0x75
> [228704.930002]  [<ffffffff8022afa8>] ? schedule_tail+0x27/0x5f
> [228704.930002]  [<ffffffff8020ccb9>] ? child_rip+0xa/0x11
> [228704.930002]  [<ffffffff8023f67b>] ? kthread+0x0/0x75
> [228704.930002]  [<ffffffff8020ccaf>] ? child_rip+0x0/0x11
> [228704.930002]
> [228704.930002]
> [228704.930002] Code: 6f 18 48 89 7b 30 48 8b 11 48 8b 41 08 48 89 42 08 48 89 10 48 89 49 08 48 89 09 fe 03 fb 48 8b 41 f8 48 83 e0 fc 48 39 d8 74 04 <0f> 0b eb fe f0 80 61 f8 fe ff d5 65 48 8b 04 25 10 00 00 00 8b
> [228704.930002] RIP  [<ffffffff8023c6db>] run_workqueue+0x6f/0x102
> [228704.930002]  RSP <ffff880214bcdec0>
> [228704.941003] ---[ end trace deef6e5387b5a584 ]---

Thanks for any input, for reight now I'm quite helpless....

Cheers

Carsten

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: kernel BUG at kernel/workqueue.c:291
  2009-03-02 10:51 ` Carsten Aulbert
@ 2009-03-03  7:26   ` Andrew Morton
  2009-03-03  7:36     ` Carsten Aulbert
  2009-03-03 15:16     ` Trond Myklebust
  0 siblings, 2 replies; 8+ messages in thread
From: Andrew Morton @ 2009-03-03  7:26 UTC (permalink / raw)
  To: Carsten Aulbert; +Cc: linux-kernel, linux-nfs

On Mon, 02 Mar 2009 11:51:48 +0100 Carsten Aulbert <carsten.aulbert@aei.mpg.de> wrote:

> Hi again,
> 
> in the mean time 43 of our nodes were struck with this error. It seems
> that the jobs of a certain user can trigger this bug, however I have no
> clue how to really trigger it manually.

That's a lot of nodes.

> My questions:
> Is this a know bug for 2.6.27.14 (we can upgrade to .19 if necessary),
> but as this file was not modyfied recently, I suspect there is no ready
> fix for that.
> 
> Do you need any more info of our systems (Intel X3220 based Supermirco
> systems), the kernel config (deadline scheduler in use,...) or something
> else?

Let's cc the NFS developers, see if this rpciod crash is familiar to them?

> Carsten Aulbert schrieb:
> > [228704.928037] ------------[ cut here ]------------
> > [228704.928224] kernel BUG at kernel/workqueue.c:291!
> > [228704.928404] invalid opcode: 0000 [1] SMP
> > [228704.928647] CPU 0
> > [228704.928852] Modules linked in: lm92 w83793 w83781d hwmon_vid hwmon nfs nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 netconsole configfs ipmi_si ipmi_devintf ipmi_watchdog ipmi_poweroff ipmi_msghandler e1000e i2c_i801 8250_pnp 8250 serial_core i2c_core
> > [228704.930002] Pid: 1609, comm: rpciod/0 Not tainted 2.6.27.14-nodes #1
> > [228704.930002] RIP: 0010:[<ffffffff8023c6db>]  [<ffffffff8023c6db>] run_workqueue+0x6f/0x102
> > [228704.930002] RSP: 0018:ffff880214bcdec0  EFLAGS: 00010207
> > [228704.930002] RAX: 0000000000000000 RBX: ffff880214b82f40 RCX: ffff880215444418
> > [228704.930002] RDX: ffff880187d07d58 RSI: ffff880214bcdee0 RDI: ffff880215444410
> > [228704.930002] RBP: ffffffffa0077186 R08: ffff880214bcc000 R09: ffff88021491f808
> > [228704.930002] R10: 0000000000000246 R11: ffff880187d07d50 R12: ffff880214ad7d28
> > [228704.930002] R13: ffffffff806065a0 R14: ffffffff80607280 R15: 0000000000000000
> > [228704.930002] FS:  0000000000000000(0000) GS:ffffffff80636040(0000) knlGS:0000000000000000
> > [228704.930002] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> > [228704.930002] CR2: 00007fc056333fd8 CR3: 00000001ed270000 CR4: 00000000000006e0
> > [228704.930002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [228704.930002] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [228704.930002] Process rpciod/0 (pid: 1609, threadinfo ffff880214bcc000, task ffff880217b08780)
> > [228704.930002] Stack:  ffff880214b82f40 ffff880214b82f40 ffff880214b82f58 ffffffff8023cff3
> > [228704.930002]  0000000000000000 ffff880217b08780 ffffffff8023f7d7 ffff880214bcdef8
> > [228704.930002]  ffff880214bcdef8 ffffffff806065a0 ffffffff80607280 ffff880214b82f40
> > [228704.930002] Call Trace:
> > [228704.930002]  [<ffffffff8023cff3>] ? worker_thread+0x90/0x9b
> > [228704.930002]  [<ffffffff8023f7d7>] ? autoremove_wake_function+0x0/0x2e
> > [228704.930002]  [<ffffffff8023cf63>] ? worker_thread+0x0/0x9b
> > [228704.930002]  [<ffffffff8023f6c2>] ? kthread+0x47/0x75
> > [228704.930002]  [<ffffffff8022afa8>] ? schedule_tail+0x27/0x5f
> > [228704.930002]  [<ffffffff8020ccb9>] ? child_rip+0xa/0x11
> > [228704.930002]  [<ffffffff8023f67b>] ? kthread+0x0/0x75
> > [228704.930002]  [<ffffffff8020ccaf>] ? child_rip+0x0/0x11
> > [228704.930002]
> > [228704.930002]
> > [228704.930002] Code: 6f 18 48 89 7b 30 48 8b 11 48 8b 41 08 48 89 42 08 48 89 10 48 89 49 08 48 89 09 fe 03 fb 48 8b 41 f8 48 83 e0 fc 48 39 d8 74 04 <0f> 0b eb fe f0 80 61 f8 fe ff d5 65 48 8b 04 25 10 00 00 00 8b
> > [228704.930002] RIP  [<ffffffff8023c6db>] run_workqueue+0x6f/0x102
> > [228704.930002]  RSP <ffff880214bcdec0>
> > [228704.941003] ---[ end trace deef6e5387b5a584 ]---
> 
> Thanks for any input, for reight now I'm quite helpless....


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: kernel BUG at kernel/workqueue.c:291
  2009-03-03  7:26   ` Andrew Morton
@ 2009-03-03  7:36     ` Carsten Aulbert
  2009-03-03 15:16     ` Trond Myklebust
  1 sibling, 0 replies; 8+ messages in thread
From: Carsten Aulbert @ 2009-03-03  7:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-nfs

Hi Andrew,

Andrew Morton schrieb:
>> in the mean time 43 of our nodes were struck with this error. It seems
>> that the jobs of a certain user can trigger this bug, however I have no
>> clue how to really trigger it manually.
> 
> That's a lot of nodes.
Quite, at least some percentage of the whole system.
> 
> Let's cc the NFS developers, see if this rpciod crash is familiar to them?

Good idea, I should have done that myself - sorry

I think we were able to pinpoint at least one user's jobs to "generate"
this, but I need to talk to him, what access patterns are used via NFS here.

Systems are running Debian Etch,

dpkg -l | awk '/(nfs|portmap)/ {print $2 "\t\t" $3}'
libnfsidmap2            0.18-0
mountnfs                1.1.3-2
nfs-common              1.0.10-6+etch.1
nfs-kernel-server               1.0.10-6+etch.1
portmap         5-26


If you need more, please let me know! So far the machines are 'on hold',
i.e. we have not yet rebooted them to be able to find out a little bit
more. If you(anyone) think we can reboot them and put back into our
scheduling queue, please let me know, the users are waiting for more cycles.

Thanks a lot

Carsten

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: kernel BUG at kernel/workqueue.c:291
  2009-03-03  7:26   ` Andrew Morton
  2009-03-03  7:36     ` Carsten Aulbert
@ 2009-03-03 15:16     ` Trond Myklebust
  2009-03-03 15:23       ` Carsten Aulbert
  1 sibling, 1 reply; 8+ messages in thread
From: Trond Myklebust @ 2009-03-03 15:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Carsten Aulbert, linux-kernel, linux-nfs

On Mon, 2009-03-02 at 23:26 -0800, Andrew Morton wrote:
> On Mon, 02 Mar 2009 11:51:48 +0100 Carsten Aulbert <carsten.aulbert@aei.mpg.de> wrote:
> 
> > Hi again,
> > 
> > in the mean time 43 of our nodes were struck with this error. It seems
> > that the jobs of a certain user can trigger this bug, however I have no
> > clue how to really trigger it manually.
> 
> That's a lot of nodes.
> 
> > My questions:
> > Is this a know bug for 2.6.27.14 (we can upgrade to .19 if necessary),
> > but as this file was not modyfied recently, I suspect there is no ready
> > fix for that.
> > 
> > Do you need any more info of our systems (Intel X3220 based Supermirco
> > systems), the kernel config (deadline scheduler in use,...) or something
> > else?
> 
> Let's cc the NFS developers, see if this rpciod crash is familiar to them?

Nope. I've never seen it before.

struct rpc_task does admittedly share storage for the work queue and the
rpc wait queue links, but if that were to be causing the reported
corruption, then it would mean that an rpc_task is simultaneously on a
wait queue and trying to execute on a work queue. I have no history of
that ever having happened.

Trond

> > Carsten Aulbert schrieb:
> > > [228704.928037] ------------[ cut here ]------------
> > > [228704.928224] kernel BUG at kernel/workqueue.c:291!
> > > [228704.928404] invalid opcode: 0000 [1] SMP
> > > [228704.928647] CPU 0
> > > [228704.928852] Modules linked in: lm92 w83793 w83781d hwmon_vid hwmon nfs nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 netconsole configfs ipmi_si ipmi_devintf ipmi_watchdog ipmi_poweroff ipmi_msghandler e1000e i2c_i801 8250_pnp 8250 serial_core i2c_core
> > > [228704.930002] Pid: 1609, comm: rpciod/0 Not tainted 2.6.27.14-nodes #1
> > > [228704.930002] RIP: 0010:[<ffffffff8023c6db>]  [<ffffffff8023c6db>] run_workqueue+0x6f/0x102
> > > [228704.930002] RSP: 0018:ffff880214bcdec0  EFLAGS: 00010207
> > > [228704.930002] RAX: 0000000000000000 RBX: ffff880214b82f40 RCX: ffff880215444418
> > > [228704.930002] RDX: ffff880187d07d58 RSI: ffff880214bcdee0 RDI: ffff880215444410
> > > [228704.930002] RBP: ffffffffa0077186 R08: ffff880214bcc000 R09: ffff88021491f808
> > > [228704.930002] R10: 0000000000000246 R11: ffff880187d07d50 R12: ffff880214ad7d28
> > > [228704.930002] R13: ffffffff806065a0 R14: ffffffff80607280 R15: 0000000000000000
> > > [228704.930002] FS:  0000000000000000(0000) GS:ffffffff80636040(0000) knlGS:0000000000000000
> > > [228704.930002] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> > > [228704.930002] CR2: 00007fc056333fd8 CR3: 00000001ed270000 CR4: 00000000000006e0
> > > [228704.930002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [228704.930002] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > [228704.930002] Process rpciod/0 (pid: 1609, threadinfo ffff880214bcc000, task ffff880217b08780)
> > > [228704.930002] Stack:  ffff880214b82f40 ffff880214b82f40 ffff880214b82f58 ffffffff8023cff3
> > > [228704.930002]  0000000000000000 ffff880217b08780 ffffffff8023f7d7 ffff880214bcdef8
> > > [228704.930002]  ffff880214bcdef8 ffffffff806065a0 ffffffff80607280 ffff880214b82f40
> > > [228704.930002] Call Trace:
> > > [228704.930002]  [<ffffffff8023cff3>] ? worker_thread+0x90/0x9b
> > > [228704.930002]  [<ffffffff8023f7d7>] ? autoremove_wake_function+0x0/0x2e
> > > [228704.930002]  [<ffffffff8023cf63>] ? worker_thread+0x0/0x9b
> > > [228704.930002]  [<ffffffff8023f6c2>] ? kthread+0x47/0x75
> > > [228704.930002]  [<ffffffff8022afa8>] ? schedule_tail+0x27/0x5f
> > > [228704.930002]  [<ffffffff8020ccb9>] ? child_rip+0xa/0x11
> > > [228704.930002]  [<ffffffff8023f67b>] ? kthread+0x0/0x75
> > > [228704.930002]  [<ffffffff8020ccaf>] ? child_rip+0x0/0x11
> > > [228704.930002]
> > > [228704.930002]
> > > [228704.930002] Code: 6f 18 48 89 7b 30 48 8b 11 48 8b 41 08 48 89 42 08 48 89 10 48 89 49 08 48 89 09 fe 03 fb 48 8b 41 f8 48 83 e0 fc 48 39 d8 74 04 <0f> 0b eb fe f0 80 61 f8 fe ff d5 65 48 8b 04 25 10 00 00 00 8b
> > > [228704.930002] RIP  [<ffffffff8023c6db>] run_workqueue+0x6f/0x102
> > > [228704.930002]  RSP <ffff880214bcdec0>
> > > [228704.941003] ---[ end trace deef6e5387b5a584 ]---
> > 
> > Thanks for any input, for reight now I'm quite helpless....
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: kernel BUG at kernel/workqueue.c:291
  2009-03-03 15:16     ` Trond Myklebust
@ 2009-03-03 15:23       ` Carsten Aulbert
  2009-03-03 20:41         ` Aaron Straus
  2009-03-03 21:21         ` Trond Myklebust
  0 siblings, 2 replies; 8+ messages in thread
From: Carsten Aulbert @ 2009-03-03 15:23 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Andrew Morton, linux-kernel, linux-nfs

Hi Trond,

Trond Myklebust schrieb:
> struct rpc_task does admittedly share storage for the work queue and the
> rpc wait queue links, but if that were to be causing the reported
> corruption, then it would mean that an rpc_task is simultaneously on a
> wait queue and trying to execute on a work queue. I have no history of
> that ever having happened.

Anything I might be able to give to you helping you to narrow it down
somewhat? As written I suspect a certain type of user jobs, but since
literally 1000s of these ran over the course of several days it might be
hard to trigger this reliably again.

Cheers

Carsten

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: kernel BUG at kernel/workqueue.c:291
  2009-03-03 15:23       ` Carsten Aulbert
@ 2009-03-03 20:41         ` Aaron Straus
  2009-03-03 21:21         ` Trond Myklebust
  1 sibling, 0 replies; 8+ messages in thread
From: Aaron Straus @ 2009-03-03 20:41 UTC (permalink / raw)
  To: Carsten Aulbert; +Cc: Trond Myklebust, Andrew Morton, linux-kernel, linux-nfs

Hi Carsten,

On Mar 03 04:23 PM, Carsten Aulbert wrote:
> > struct rpc_task does admittedly share storage for the work queue and the
> > rpc wait queue links, but if that were to be causing the reported
> > corruption, then it would mean that an rpc_task is simultaneously on a
> > wait queue and trying to execute on a work queue. I have no history of
> > that ever having happened.
> 
> Anything I might be able to give to you helping you to narrow it down
> somewhat? As written I suspect a certain type of user jobs, but since
> literally 1000s of these ran over the course of several days it might be
> hard to trigger this reliably again.

Out of curiosity, do you know (approximately) the last kernel you ran
which did not show the BUG?

i.e. we know 2.6.27.14 is bad, what was the last known good kernel?

Thanks!

				=a=


-- 
===================
Aaron Straus
aaron@merfinllc.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: kernel BUG at kernel/workqueue.c:291
  2009-03-03 15:23       ` Carsten Aulbert
  2009-03-03 20:41         ` Aaron Straus
@ 2009-03-03 21:21         ` Trond Myklebust
  1 sibling, 0 replies; 8+ messages in thread
From: Trond Myklebust @ 2009-03-03 21:21 UTC (permalink / raw)
  To: Carsten Aulbert; +Cc: Andrew Morton, linux-kernel, linux-nfs

On Tue, 2009-03-03 at 16:23 +0100, Carsten Aulbert wrote:
> Hi Trond,
> 
> Trond Myklebust schrieb:
> > struct rpc_task does admittedly share storage for the work queue and the
> > rpc wait queue links, but if that were to be causing the reported
> > corruption, then it would mean that an rpc_task is simultaneously on a
> > wait queue and trying to execute on a work queue. I have no history of
> > that ever having happened.
> 
> Anything I might be able to give to you helping you to narrow it down
> somewhat? As written I suspect a certain type of user jobs, but since
> literally 1000s of these ran over the course of several days it might be
> hard to trigger this reliably again.

Could you try triggering it after applying the following patch? I think
I've spotted a potential race condition.

Cheers
  Trond
---------------------------------------------------------------------------
From: Trond Myklebust <Trond.Myklebust@netapp.com>
SUNRPC: Tighten up the task locking rules in __rpc_execute()

We should probably not be testing any flags after we've cleared the
RPC_TASK_RUNNING flag, since rpc_make_runnable() is then free to assign the
rpc_task to another workqueue, which may then destroy it.

We can fix any races with rpc_make_runnable() by ensuring that we only
clear the RPC_TASK_RUNNING flag while holding the rpc_wait_queue->lock that
the task is supposed to be sleeping on (and then checking whether or not
the task really is sleeping).

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---

 net/sunrpc/sched.c |   33 ++++++++++++++++++++-------------
 1 files changed, 20 insertions(+), 13 deletions(-)


diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 385f427..ff50a05 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -293,11 +293,6 @@ static void rpc_make_runnable(struct rpc_task *task)
 	rpc_clear_queued(task);
 	if (rpc_test_and_set_running(task))
 		return;
-	/* We might have raced */
-	if (RPC_IS_QUEUED(task)) {
-		rpc_clear_running(task);
-		return;
-	}
 	if (RPC_IS_ASYNC(task)) {
 		int status;
 
@@ -607,7 +602,9 @@ void rpc_release_calldata(const struct rpc_call_ops *ops, void *calldata)
  */
 static void __rpc_execute(struct rpc_task *task)
 {
-	int		status = 0;
+	struct rpc_wait_queue *queue;
+	int task_is_async = RPC_IS_ASYNC(task);
+	int status = 0;
 
 	dprintk("RPC: %5u __rpc_execute flags=0x%x\n",
 			task->tk_pid, task->tk_flags);
@@ -647,15 +644,25 @@ static void __rpc_execute(struct rpc_task *task)
 		 */
 		if (!RPC_IS_QUEUED(task))
 			continue;
-		rpc_clear_running(task);
-		if (RPC_IS_ASYNC(task)) {
-			/* Careful! we may have raced... */
-			if (RPC_IS_QUEUED(task))
-				return;
-			if (rpc_test_and_set_running(task))
-				return;
+		/*
+		 * The queue->lock protects against races with
+		 * rpc_make_runnable().
+		 *
+		 * Note that once we clear RPC_TASK_RUNNING on an asynchronous
+		 * rpc_task, rpc_make_runnable() can assign it to a
+		 * different workqueue. We therefore cannot assume that the
+		 * rpc_task pointer may still be dereferenced.
+		 */
+		queue = task->tk_waitqueue;
+		spin_lock_bh(&queue->lock);
+		if (!RPC_IS_QUEUED(task)) {
+			spin_unlock_bh(&queue->lock);
 			continue;
 		}
+		rpc_clear_running(task);
+		spin_unlock_bh(&queue->lock);
+		if (task_is_async)
+			return;
 
 		/* sync task: sleep here */
 		dprintk("RPC: %5u sync task going to sleep\n", task->tk_pid);



^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-03-03 21:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-27 19:48 kernel BUG at kernel/workqueue.c:291 Carsten Aulbert
2009-03-02 10:51 ` Carsten Aulbert
2009-03-03  7:26   ` Andrew Morton
2009-03-03  7:36     ` Carsten Aulbert
2009-03-03 15:16     ` Trond Myklebust
2009-03-03 15:23       ` Carsten Aulbert
2009-03-03 20:41         ` Aaron Straus
2009-03-03 21:21         ` Trond Myklebust

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).