* crash in active_task, 3.0-rc6+ @ 2011-07-11 18:01 Ben Greear 2011-07-19 11:08 ` Peter Zijlstra 0 siblings, 1 reply; 4+ messages in thread From: Ben Greear @ 2011-07-11 18:01 UTC (permalink / raw) To: Linux Kernel Mailing List This is from a lightly patched kernel (nfs related patches) Same system sees lockdep warnings related to scheduling that I've been posting as well... I haven't seen anything similar in older kernels, so this may be a regression. We tested .38 in this setup without seeing it, but havn't tested .39. [root@simech2 ~]# BUG: unable to handle kernel NULL pointer dereference at 0000000000000004 IP: [<ffffffff81231bf8>] do_raw_spin_lock+0x14/0x13c PGD 127082067 PUD 127081067 PMD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU 2 Modules linked in: 8021q garp xt_addrtype xt_TPROXY nf_tproxy_core xt_socket nf_defrag_ipv6 xt_set ip_set nfnetlink xt_connlimit ip] Pid: 2203, comm: gnuserver Not tainted 3.0.0-rc6+ #20 Supermicro X7DBU/X7DBU RIP: 0010:[<ffffffff81231bf8>] [<ffffffff81231bf8>] do_raw_spin_lock+0x14/0x13c RSP: 0018:ffff8801104e3628 EFLAGS: 00010096 RAX: ffff880126b9e7e0 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff8801104e3668 R08: 0000000000000002 R09: 0000000000000001 R10: 000002bf7b96a212 R11: ffff880107481e30 R12: 0000000000000003 R13: 0000000000000000 R14: 0000000000000046 R15: 0000000000000002 FS: 00007f213f07c720(0000) GS:ffff88012fc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000004 CR3: 0000000110519000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process gnuserver (pid: 2203, threadinfo ffff8801104e2000, task ffff880126b9e7e0) Stack: 0000000000000018 0000000000000046 00000000bf3d0800 0000000000000000 0000000000000003 0000000000000000 0000000000000046 0000000000000002 ffff8801104e3698 ffffffff8147e459 ffffffff810a7f25 ffffffff8103d2f7 Call Trace: [<ffffffff8147e459>] _raw_spin_lock+0x3e/0x45 [<ffffffff810a7f25>] ? __rcu_read_unlock+0xdc/0x24e [<ffffffff8103d2f7>] ? activate_task+0x30/0x30 [<ffffffff810a7f25>] __rcu_read_unlock+0xdc/0x24e [<ffffffff8103d34d>] rcu_read_unlock+0x21/0x23 [<ffffffff8103efd7>] select_task_rq_fair+0x8cc/0x8e1 [<ffffffff81042b59>] ? finish_task_switch+0x78/0xf1 [<ffffffff81042b1b>] ? finish_task_switch+0x3a/0xf1 [<ffffffff810468d0>] ? try_to_wake_up+0x29/0x1a0 [<ffffffff810468d0>] ? try_to_wake_up+0x29/0x1a0 [<ffffffff81040c52>] select_task_rq+0x13/0x44 [<ffffffff8104697f>] try_to_wake_up+0xd8/0x1a0 [<ffffffff8103c783>] ? __wake_up+0x1d/0x48 [<ffffffff81046a54>] default_wake_function+0xd/0xf [<ffffffff8106751a>] autoremove_wake_function+0x13/0x38 [<ffffffff810395d0>] __wake_up_common+0x49/0x7f [<ffffffff8103c79a>] __wake_up+0x34/0x48 [<ffffffff810a74a9>] rcu_report_exp_rnp+0x50/0x89 [<ffffffff810a7f25>] ? __rcu_read_unlock+0xdc/0x24e [<ffffffff810a8032>] __rcu_read_unlock+0x1e9/0x24e [<ffffffff813fce40>] rcu_read_unlock+0x21/0x23 [<ffffffff813fd52d>] ip_queue_xmit+0x35e/0x3b1 [<ffffffff813fd1cf>] ? ip_send_reply+0x247/0x247 [<ffffffff8140f5f3>] tcp_transmit_skb+0x785/0x7c3 [<ffffffff81411e23>] tcp_write_xmit+0x806/0x8f5 [<ffffffff810e646f>] ? might_fault+0x4e/0x9e [<ffffffff81403e25>] ? copy_from_user+0x2a/0x2c [<ffffffff81411f63>] __tcp_push_pending_frames+0x20/0x4d [<ffffffff8140411f>] tcp_push+0x84/0x86 [<ffffffff81406577>] tcp_sendmsg+0x674/0x775 [<ffffffff81423d68>] inet_sendmsg+0x61/0x6a [<ffffffff813af67a>] __sock_sendmsg_nosec+0x58/0x61 [<ffffffff813b0db5>] __sock_sendmsg+0x3d/0x48 [<ffffffff813b1631>] sock_sendmsg+0xa3/0xbc [<ffffffff810ea6ae>] ? handle_pte_fault+0x7fc/0x84d [<ffffffff81110b14>] ? mem_cgroup_get_limit+0x45/0x45 [<ffffffff8111a4e9>] ? fget_light+0x35/0xac [<ffffffff813b16b2>] ? sockfd_lookup_light+0x1b/0x53 [<ffffffff813b1bf5>] sys_sendto+0xfa/0x11f [<ffffffff8147f0d8>] ? retint_restore_args+0x13/0x13 [<ffffffff8109f0a1>] ? audit_syscall_entry+0x119/0x145 [<ffffffff81484d52>] system_call_fastpath+0x16/0x1b Code: ff 48 c7 43 10 ff ff ff ff c7 43 08 ff ff ff ff fe 03 41 58 5b c9 c3 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 89 fb 48 83 ec 7f 04 ad 4e ad de 74 0c 48 c7 c6 47 32 7a 81 e8 a9 fe ff ff RIP [<ffffffff81231bf8>] do_raw_spin_lock+0x14/0x13c RSP <ffff8801104e3628> CR2: 0000000000000004 ---[ end trace bd9381b635ffe22e ]--- -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: crash in active_task, 3.0-rc6+ 2011-07-11 18:01 crash in active_task, 3.0-rc6+ Ben Greear @ 2011-07-19 11:08 ` Peter Zijlstra 2011-07-19 14:28 ` Paul E. McKenney 0 siblings, 1 reply; 4+ messages in thread From: Peter Zijlstra @ 2011-07-19 11:08 UTC (permalink / raw) To: Ben Greear Cc: Linux Kernel Mailing List, Andrew Morton, Paul E. McKenney, Ingo Molnar [-- Attachment #1: Type: text/plain, Size: 4735 bytes --] On Mon, 2011-07-11 at 11:01 -0700, Ben Greear wrote: > This is from a lightly patched kernel (nfs related patches) > Same system sees lockdep warnings related to scheduling that > I've been posting as well... > > I haven't seen anything similar in older kernels, so this may > be a regression. We tested .38 in this setup without seeing it, > but havn't tested .39. > > [root@simech2 ~]# BUG: unable to handle kernel NULL pointer dereference at 0000000000000004 > IP: [<ffffffff81231bf8>] do_raw_spin_lock+0x14/0x13c > PGD 127082067 PUD 127081067 PMD 0 > Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC > CPU 2 > Modules linked in: 8021q garp xt_addrtype xt_TPROXY nf_tproxy_core xt_socket nf_defrag_ipv6 xt_set ip_set nfnetlink xt_connlimit ip] > > Pid: 2203, comm: gnuserver Not tainted 3.0.0-rc6+ #20 Supermicro X7DBU/X7DBU > RIP: 0010:[<ffffffff81231bf8>] [<ffffffff81231bf8>] do_raw_spin_lock+0x14/0x13c > RSP: 0018:ffff8801104e3628 EFLAGS: 00010096 > RAX: ffff880126b9e7e0 RBX: 0000000000000000 RCX: 0000000000000000 > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 > RBP: ffff8801104e3668 R08: 0000000000000002 R09: 0000000000000001 > R10: 000002bf7b96a212 R11: ffff880107481e30 R12: 0000000000000003 > R13: 0000000000000000 R14: 0000000000000046 R15: 0000000000000002 > FS: 00007f213f07c720(0000) GS:ffff88012fc80000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000004 CR3: 0000000110519000 CR4: 00000000000006e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process gnuserver (pid: 2203, threadinfo ffff8801104e2000, task ffff880126b9e7e0) > Stack: > 0000000000000018 0000000000000046 00000000bf3d0800 0000000000000000 > 0000000000000003 0000000000000000 0000000000000046 0000000000000002 > ffff8801104e3698 ffffffff8147e459 ffffffff810a7f25 ffffffff8103d2f7 > Call Trace: > [<ffffffff8147e459>] _raw_spin_lock+0x3e/0x45 > [<ffffffff810a7f25>] ? __rcu_read_unlock+0xdc/0x24e > [<ffffffff8103d2f7>] ? activate_task+0x30/0x30 > [<ffffffff810a7f25>] __rcu_read_unlock+0xdc/0x24e > [<ffffffff8103d34d>] rcu_read_unlock+0x21/0x23 > [<ffffffff8103efd7>] select_task_rq_fair+0x8cc/0x8e1 > [<ffffffff81042b59>] ? finish_task_switch+0x78/0xf1 > [<ffffffff81042b1b>] ? finish_task_switch+0x3a/0xf1 > [<ffffffff810468d0>] ? try_to_wake_up+0x29/0x1a0 > [<ffffffff810468d0>] ? try_to_wake_up+0x29/0x1a0 > [<ffffffff81040c52>] select_task_rq+0x13/0x44 > [<ffffffff8104697f>] try_to_wake_up+0xd8/0x1a0 > [<ffffffff8103c783>] ? __wake_up+0x1d/0x48 > [<ffffffff81046a54>] default_wake_function+0xd/0xf > [<ffffffff8106751a>] autoremove_wake_function+0x13/0x38 > [<ffffffff810395d0>] __wake_up_common+0x49/0x7f > [<ffffffff8103c79a>] __wake_up+0x34/0x48 > [<ffffffff810a74a9>] rcu_report_exp_rnp+0x50/0x89 > [<ffffffff810a7f25>] ? __rcu_read_unlock+0xdc/0x24e > [<ffffffff810a8032>] __rcu_read_unlock+0x1e9/0x24e > [<ffffffff813fce40>] rcu_read_unlock+0x21/0x23 > [<ffffffff813fd52d>] ip_queue_xmit+0x35e/0x3b1 > [<ffffffff813fd1cf>] ? ip_send_reply+0x247/0x247 > [<ffffffff8140f5f3>] tcp_transmit_skb+0x785/0x7c3 > [<ffffffff81411e23>] tcp_write_xmit+0x806/0x8f5 > [<ffffffff810e646f>] ? might_fault+0x4e/0x9e > [<ffffffff81403e25>] ? copy_from_user+0x2a/0x2c > [<ffffffff81411f63>] __tcp_push_pending_frames+0x20/0x4d > [<ffffffff8140411f>] tcp_push+0x84/0x86 > [<ffffffff81406577>] tcp_sendmsg+0x674/0x775 > [<ffffffff81423d68>] inet_sendmsg+0x61/0x6a > [<ffffffff813af67a>] __sock_sendmsg_nosec+0x58/0x61 > [<ffffffff813b0db5>] __sock_sendmsg+0x3d/0x48 > [<ffffffff813b1631>] sock_sendmsg+0xa3/0xbc > [<ffffffff810ea6ae>] ? handle_pte_fault+0x7fc/0x84d > [<ffffffff81110b14>] ? mem_cgroup_get_limit+0x45/0x45 > [<ffffffff8111a4e9>] ? fget_light+0x35/0xac > [<ffffffff813b16b2>] ? sockfd_lookup_light+0x1b/0x53 > [<ffffffff813b1bf5>] sys_sendto+0xfa/0x11f > [<ffffffff8147f0d8>] ? retint_restore_args+0x13/0x13 > [<ffffffff8109f0a1>] ? audit_syscall_entry+0x119/0x145 > [<ffffffff81484d52>] system_call_fastpath+0x16/0x1b Fun, that's a recursion we've been poking at for the past few days, but I haven't seen it do a NULL-ptr deref like this yet. I'm also not quite sure the below patches will catch this particular recursion, it might be you need the patch from lkml.kernel.org/r/20110718151524.GA4236@linux.vnet.ibm.com as well, since this is RCU recursing on itself without the aid of interrupts. How reproducable is this? And if so, what's your .config and recipe for making it go bang? [-- Attachment #2: peter_zijlstra-re-info__possible_circular_locking_dependency_detected.patch --] [-- Type: text/x-patch, Size: 2815 bytes --] Subject: rcu, softirq: Avoid rcu_read_unlock_special vs softirq recursion From: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Fri, 15 Jul 2011 17:04:27 +0200 $task IRQ SoftIRQ rcu_read_lock() /* do stuff */ <preempt> |= UNLOCK_BLOCKED rcu_read_unlock() --t->rcu_read_lock_nesting irq_enter(); /* do stuff, don't use RCU */ irq_exit(); sub_preempt_count(IRQ_EXIT_OFFSET); invoke_softirq() ttwu(); spin_lock_irq(&pi->lock) rcu_read_lock(); /* do stuff */ rcu_read_unlock(); rcu_read_unlock_special() rcu_report_exp_rnp() ttwu() spin_lock_irq(&pi->lock) /* deadlock */ rcu_read_unlock_special(t); Ed can simply trigger this 'easy' because invoke_softirq() immediately does a ttwu() of ksoftirqd/# instead of doing the in-place softirq stuff first, but even without that the above can happen. Cure this by also excluding softirqs from the rcu_read_unlock_special() handler and ensuring the force_irqthreads ksoftirqd/# wakeup is done from full softirq context. [ Alternatively, delaying the ->rcu_read_lock_nesting decrement until after the special handling would make the thing more robust in the face of interrupts as well. ] Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reported-and-tested-by: Ed Tomlinson <edt@aei.ca> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1310742267.2586.353.camel@twins --- kernel/rcutree_plugin.h | 2 +- kernel/softirq.c | 12 ++++++++++-- 2 files changed, 11 insertions(+), 3 deletions(-) diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h index 14dc7dd..373c9c8 100644 --- a/kernel/rcutree_plugin.h +++ b/kernel/rcutree_plugin.h @@ -309,7 +309,7 @@ static void rcu_read_unlock_special(struct task_struct *t) } /* Hardware IRQ handlers cannot block. */ - if (in_irq()) { + if (in_irq() || in_serving_softirq()) { local_irq_restore(flags); return; } diff --git a/kernel/softirq.c b/kernel/softirq.c index 40cf63d..fca82c3 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -315,16 +315,24 @@ static inline void invoke_softirq(void) { if (!force_irqthreads) __do_softirq(); - else + else { + __local_bh_disable((unsigned long)__builtin_return_address(0), + SOFTIRQ_OFFSET); wakeup_softirqd(); + __local_bh_enable(SOFTIRQ_OFFSET); + } } #else static inline void invoke_softirq(void) { if (!force_irqthreads) do_softirq(); - else + else { + __local_bh_disable((unsigned long)__builtin_return_address(0), + SOFTIRQ_OFFSET); wakeup_softirqd(); + __local_bh_enable(SOFTIRQ_OFFSET); + } } #endif [-- Attachment #3: peter_zijlstra-sched-add_irq_enterexit_to_scheduler_ipi.patch --] [-- Type: text/x-patch, Size: 2824 bytes --] Subject: sched: Add irq_{enter,exit}() to scheduler_ipi() From: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Mon, 18 Jul 2011 11:29:44 +0200 Ensure scheduler_ipi() calls irq_{enter,exit} when it does some actual work. Traditionally we never did any actual work from the resched IPI and all magic happened in the return from interrupt path. Now that we do do some work, we need to ensure irq_{enter,exit} are called so that we don't confuse things. This affects things like timekeeping, NO_HZ and RCU, basically everything with a hook in irq_enter/exit. Explicit examples of things going wrong are: sched_clock_cpu() -- has a callback when leaving NO_HZ state to take a new reading from GTOD and TSC. Without this callback, time is stuck in the past. RCU -- needs in_irq() to work in order to avoid some nasty deadlocks Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1310981384.13765.40.camel@twins --- kernel/sched.c | 40 ++++++++++++++++++++++++++++++++++------ 1 files changed, 34 insertions(+), 6 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 8fb4245..eb9cbe7 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -2544,13 +2544,9 @@ static int ttwu_remote(struct task_struct *p, int wake_flags) } #ifdef CONFIG_SMP -static void sched_ttwu_pending(void) +static void sched_ttwu_do_pending(struct task_struct *list) { struct rq *rq = this_rq(); - struct task_struct *list = xchg(&rq->wake_list, NULL); - - if (!list) - return; raw_spin_lock(&rq->lock); @@ -2563,9 +2559,41 @@ static void sched_ttwu_pending(void) raw_spin_unlock(&rq->lock); } +static void sched_ttwu_pending(void) +{ + struct rq *rq = this_rq(); + struct task_struct *list = xchg(&rq->wake_list, NULL); + + if (!list) + return; + + sched_ttwu_do_pending(list); +} + void scheduler_ipi(void) { - sched_ttwu_pending(); + struct rq *rq = this_rq(); + struct task_struct *list = xchg(&rq->wake_list, NULL); + + if (!list) + return; + + /* + * Not all reschedule IPI handlers call irq_enter/irq_exit, since + * traditionally all their work was done from the interrupt return + * path. Now that we actually do some work, we need to make sure + * we do call them. + * + * Some archs already do call them, luckily irq_enter/exit nest + * properly. + * + * Arguably we should visit all archs and update all handlers, + * however a fair share of IPIs are still resched only so this would + * somewhat pessimize the simple resched case. + */ + irq_enter(); + sched_ttwu_do_pending(list); + irq_exit(); } static void ttwu_queue_remote(struct task_struct *p, int cpu) ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: crash in active_task, 3.0-rc6+ 2011-07-19 11:08 ` Peter Zijlstra @ 2011-07-19 14:28 ` Paul E. McKenney 2011-07-19 16:09 ` Ben Greear 0 siblings, 1 reply; 4+ messages in thread From: Paul E. McKenney @ 2011-07-19 14:28 UTC (permalink / raw) To: Peter Zijlstra Cc: Ben Greear, Linux Kernel Mailing List, Andrew Morton, Ingo Molnar On Tue, Jul 19, 2011 at 01:08:21PM +0200, Peter Zijlstra wrote: > On Mon, 2011-07-11 at 11:01 -0700, Ben Greear wrote: > > This is from a lightly patched kernel (nfs related patches) > > Same system sees lockdep warnings related to scheduling that > > I've been posting as well... > > > > I haven't seen anything similar in older kernels, so this may > > be a regression. We tested .38 in this setup without seeing it, > > but havn't tested .39. > > > > [root@simech2 ~]# BUG: unable to handle kernel NULL pointer dereference at 0000000000000004 > > IP: [<ffffffff81231bf8>] do_raw_spin_lock+0x14/0x13c > > PGD 127082067 PUD 127081067 PMD 0 > > Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC > > CPU 2 > > Modules linked in: 8021q garp xt_addrtype xt_TPROXY nf_tproxy_core xt_socket nf_defrag_ipv6 xt_set ip_set nfnetlink xt_connlimit ip] > > > > Pid: 2203, comm: gnuserver Not tainted 3.0.0-rc6+ #20 Supermicro X7DBU/X7DBU > > RIP: 0010:[<ffffffff81231bf8>] [<ffffffff81231bf8>] do_raw_spin_lock+0x14/0x13c > > RSP: 0018:ffff8801104e3628 EFLAGS: 00010096 > > RAX: ffff880126b9e7e0 RBX: 0000000000000000 RCX: 0000000000000000 > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 > > RBP: ffff8801104e3668 R08: 0000000000000002 R09: 0000000000000001 > > R10: 000002bf7b96a212 R11: ffff880107481e30 R12: 0000000000000003 > > R13: 0000000000000000 R14: 0000000000000046 R15: 0000000000000002 > > FS: 00007f213f07c720(0000) GS:ffff88012fc80000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 0000000000000004 CR3: 0000000110519000 CR4: 00000000000006e0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > Process gnuserver (pid: 2203, threadinfo ffff8801104e2000, task ffff880126b9e7e0) > > Stack: > > 0000000000000018 0000000000000046 00000000bf3d0800 0000000000000000 > > 0000000000000003 0000000000000000 0000000000000046 0000000000000002 > > ffff8801104e3698 ffffffff8147e459 ffffffff810a7f25 ffffffff8103d2f7 > > Call Trace: > > [<ffffffff8147e459>] _raw_spin_lock+0x3e/0x45 > > [<ffffffff810a7f25>] ? __rcu_read_unlock+0xdc/0x24e > > [<ffffffff8103d2f7>] ? activate_task+0x30/0x30 > > [<ffffffff810a7f25>] __rcu_read_unlock+0xdc/0x24e > > [<ffffffff8103d34d>] rcu_read_unlock+0x21/0x23 > > [<ffffffff8103efd7>] select_task_rq_fair+0x8cc/0x8e1 > > [<ffffffff81042b59>] ? finish_task_switch+0x78/0xf1 > > [<ffffffff81042b1b>] ? finish_task_switch+0x3a/0xf1 > > [<ffffffff810468d0>] ? try_to_wake_up+0x29/0x1a0 > > [<ffffffff810468d0>] ? try_to_wake_up+0x29/0x1a0 > > [<ffffffff81040c52>] select_task_rq+0x13/0x44 > > [<ffffffff8104697f>] try_to_wake_up+0xd8/0x1a0 > > [<ffffffff8103c783>] ? __wake_up+0x1d/0x48 > > [<ffffffff81046a54>] default_wake_function+0xd/0xf > > [<ffffffff8106751a>] autoremove_wake_function+0x13/0x38 > > [<ffffffff810395d0>] __wake_up_common+0x49/0x7f > > [<ffffffff8103c79a>] __wake_up+0x34/0x48 > > [<ffffffff810a74a9>] rcu_report_exp_rnp+0x50/0x89 > > [<ffffffff810a7f25>] ? __rcu_read_unlock+0xdc/0x24e > > [<ffffffff810a8032>] __rcu_read_unlock+0x1e9/0x24e > > [<ffffffff813fce40>] rcu_read_unlock+0x21/0x23 > > [<ffffffff813fd52d>] ip_queue_xmit+0x35e/0x3b1 > > [<ffffffff813fd1cf>] ? ip_send_reply+0x247/0x247 > > [<ffffffff8140f5f3>] tcp_transmit_skb+0x785/0x7c3 > > [<ffffffff81411e23>] tcp_write_xmit+0x806/0x8f5 > > [<ffffffff810e646f>] ? might_fault+0x4e/0x9e > > [<ffffffff81403e25>] ? copy_from_user+0x2a/0x2c > > [<ffffffff81411f63>] __tcp_push_pending_frames+0x20/0x4d > > [<ffffffff8140411f>] tcp_push+0x84/0x86 > > [<ffffffff81406577>] tcp_sendmsg+0x674/0x775 > > [<ffffffff81423d68>] inet_sendmsg+0x61/0x6a > > [<ffffffff813af67a>] __sock_sendmsg_nosec+0x58/0x61 > > [<ffffffff813b0db5>] __sock_sendmsg+0x3d/0x48 > > [<ffffffff813b1631>] sock_sendmsg+0xa3/0xbc > > [<ffffffff810ea6ae>] ? handle_pte_fault+0x7fc/0x84d > > [<ffffffff81110b14>] ? mem_cgroup_get_limit+0x45/0x45 > > [<ffffffff8111a4e9>] ? fget_light+0x35/0xac > > [<ffffffff813b16b2>] ? sockfd_lookup_light+0x1b/0x53 > > [<ffffffff813b1bf5>] sys_sendto+0xfa/0x11f > > [<ffffffff8147f0d8>] ? retint_restore_args+0x13/0x13 > > [<ffffffff8109f0a1>] ? audit_syscall_entry+0x119/0x145 > > [<ffffffff81484d52>] system_call_fastpath+0x16/0x1b > > > Fun, that's a recursion we've been poking at for the past few days, but > I haven't seen it do a NULL-ptr deref like this yet. > > I'm also not quite sure the below patches will catch this particular > recursion, it might be you need the patch from > lkml.kernel.org/r/20110718151524.GA4236@linux.vnet.ibm.com as well, > since this is RCU recursing on itself without the aid of interrupts. > > How reproducable is this? And if so, what's your .config and recipe for > making it go bang? Yow! I bet that the .config includes CONFIG_RCU_BOOST=y. If so, the following patch will be needed. Otherwise, I have no idea how the above sequence of events happened -- all of the ->rcu_read_unlock_special bits should be cleared by the time that rcu_report_exp_rnp() is invoked. Though the patch that Peter pointed you at on lkml.kernel.org above would likely make it go away in conjunction with the patch below. Thanx, Paul ------------------------------------------------------------------------ rcu: Fix RCU_BOOST race handling current->rcu_read_unlock_special The RCU_BOOST commits for TREE_PREEMPT_RCU introduced an other-task write to a new RCU_READ_UNLOCK_BOOSTED bit in the task_struct structure's ->rcu_read_unlock_special field, but, as noted by Steven Rostedt, without correctly synchronizing all accesses to ->rcu_read_unlock_special. This could result in bits in ->rcu_read_unlock_special being spuriously set and cleared due to conflicting accesses, which in turn could result in deadlocks between the rcu_node structure's ->lock and the scheduler's rq and pi locks. These deadlocks would result from RCU incorrectly believing that the just-ended RCU read-side critical section had been preempted and/or boosted. If that RCU read-side critical section was executed with either rq or pi locks held, RCU's ensuing (incorrect) calls to the scheduler would cause the scheduler to attempt to once again acquire the rq and pi locks, resulting in deadlock. More complex deadlock cycles are also possible, involving multiple rq and pi locks as well as locks from multiple rcu_node structures. This commit fixes synchronization by creating ->rcu_boosted field in task_struct that is accessed and modified only when holding the ->lock in the rcu_node structure on which the task is queued (on that rcu_node structure's ->blkd_tasks list). This results in tasks accessing only their own current->rcu_read_unlock_special fields, making unsynchronized access once again legal, and keeping the rcu_read_unlock() fastpath free of atomic instructions and memory barriers. The reason that the rcu_read_unlock() fastpath does not need to access the new current->rcu_boosted field is that this new field cannot be non-zero unless the RCU_READ_UNLOCK_BLOCKED bit is set in the current->rcu_read_unlock_special field. Therefore, rcu_read_unlock() need only test current->rcu_read_unlock_special: if that is zero, then current->rcu_boosted must also be zero. This bug does not affect TINY_PREEMPT_RCU because this implementation of RCU accesses current->rcu_read_unlock_special with irqs disabled, thus preventing races on the !SMP systems that TINY_PREEMPT_RCU runs on. Maybe-reported-by: Dave Jones <davej@redhat.com> Maybe-reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> diff --git a/include/linux/sched.h b/include/linux/sched.h index 496770a..76676a4 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1254,6 +1254,9 @@ struct task_struct { #ifdef CONFIG_PREEMPT_RCU int rcu_read_lock_nesting; char rcu_read_unlock_special; +#if defined(CONFIG_RCU_BOOST) && defined(CONFIG_TREE_PREEMPT_RCU) + int rcu_boosted; +#endif /* #if defined(CONFIG_RCU_BOOST) && defined(CONFIG_TREE_PREEMPT_RCU) */ struct list_head rcu_node_entry; #endif /* #ifdef CONFIG_PREEMPT_RCU */ #ifdef CONFIG_TREE_PREEMPT_RCU diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h index 94a674e..82b3c58 100644 --- a/kernel/rcutree_plugin.h +++ b/kernel/rcutree_plugin.h @@ -342,6 +342,11 @@ static void rcu_read_unlock_special(struct task_struct *t) #ifdef CONFIG_RCU_BOOST if (&t->rcu_node_entry == rnp->boost_tasks) rnp->boost_tasks = np; + /* Snapshot and clear ->rcu_boosted with rcu_node lock held. */ + if (t->rcu_boosted) { + special |= RCU_READ_UNLOCK_BOOSTED; + t->rcu_boosted = 0; + } #endif /* #ifdef CONFIG_RCU_BOOST */ t->rcu_blocked_node = NULL; @@ -358,7 +363,6 @@ static void rcu_read_unlock_special(struct task_struct *t) #ifdef CONFIG_RCU_BOOST /* Unboost if we were boosted. */ if (special & RCU_READ_UNLOCK_BOOSTED) { - t->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_BOOSTED; rt_mutex_unlock(t->rcu_boost_mutex); t->rcu_boost_mutex = NULL; } @@ -1175,7 +1179,7 @@ static int rcu_boost(struct rcu_node *rnp) t = container_of(tb, struct task_struct, rcu_node_entry); rt_mutex_init_proxy_locked(&mtx, t); t->rcu_boost_mutex = &mtx; - t->rcu_read_unlock_special |= RCU_READ_UNLOCK_BOOSTED; + t->rcu_boosted = 1; raw_spin_unlock_irqrestore(&rnp->lock, flags); rt_mutex_lock(&mtx); /* Side effect: boosts task t's priority. */ rt_mutex_unlock(&mtx); /* Keep lockdep happy. */ ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: crash in active_task, 3.0-rc6+ 2011-07-19 14:28 ` Paul E. McKenney @ 2011-07-19 16:09 ` Ben Greear 0 siblings, 0 replies; 4+ messages in thread From: Ben Greear @ 2011-07-19 16:09 UTC (permalink / raw) To: paulmck Cc: Peter Zijlstra, Linux Kernel Mailing List, Andrew Morton, Ingo Molnar On 07/19/2011 07:28 AM, Paul E. McKenney wrote: > On Tue, Jul 19, 2011 at 01:08:21PM +0200, Peter Zijlstra wrote: >> On Mon, 2011-07-11 at 11:01 -0700, Ben Greear wrote: >>> This is from a lightly patched kernel (nfs related patches) >>> Same system sees lockdep warnings related to scheduling that >>> I've been posting as well... >>> >>> I haven't seen anything similar in older kernels, so this may >>> be a regression. We tested .38 in this setup without seeing it, >>> but havn't tested .39. >>> >>> [root@simech2 ~]# BUG: unable to handle kernel NULL pointer dereference at 0000000000000004 >>> IP: [<ffffffff81231bf8>] do_raw_spin_lock+0x14/0x13c >>> PGD 127082067 PUD 127081067 PMD 0 >>> Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC >>> CPU 2 >>> Modules linked in: 8021q garp xt_addrtype xt_TPROXY nf_tproxy_core xt_socket nf_defrag_ipv6 xt_set ip_set nfnetlink xt_connlimit ip] >>> >>> Pid: 2203, comm: gnuserver Not tainted 3.0.0-rc6+ #20 Supermicro X7DBU/X7DBU >>> RIP: 0010:[<ffffffff81231bf8>] [<ffffffff81231bf8>] do_raw_spin_lock+0x14/0x13c >>> RSP: 0018:ffff8801104e3628 EFLAGS: 00010096 >>> RAX: ffff880126b9e7e0 RBX: 0000000000000000 RCX: 0000000000000000 >>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 >>> RBP: ffff8801104e3668 R08: 0000000000000002 R09: 0000000000000001 >>> R10: 000002bf7b96a212 R11: ffff880107481e30 R12: 0000000000000003 >>> R13: 0000000000000000 R14: 0000000000000046 R15: 0000000000000002 >>> FS: 00007f213f07c720(0000) GS:ffff88012fc80000(0000) knlGS:0000000000000000 >>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> CR2: 0000000000000004 CR3: 0000000110519000 CR4: 00000000000006e0 >>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >>> Process gnuserver (pid: 2203, threadinfo ffff8801104e2000, task ffff880126b9e7e0) >>> Stack: >>> 0000000000000018 0000000000000046 00000000bf3d0800 0000000000000000 >>> 0000000000000003 0000000000000000 0000000000000046 0000000000000002 >>> ffff8801104e3698 ffffffff8147e459 ffffffff810a7f25 ffffffff8103d2f7 >>> Call Trace: >>> [<ffffffff8147e459>] _raw_spin_lock+0x3e/0x45 >>> [<ffffffff810a7f25>] ? __rcu_read_unlock+0xdc/0x24e >>> [<ffffffff8103d2f7>] ? activate_task+0x30/0x30 >>> [<ffffffff810a7f25>] __rcu_read_unlock+0xdc/0x24e >>> [<ffffffff8103d34d>] rcu_read_unlock+0x21/0x23 >>> [<ffffffff8103efd7>] select_task_rq_fair+0x8cc/0x8e1 >>> [<ffffffff81042b59>] ? finish_task_switch+0x78/0xf1 >>> [<ffffffff81042b1b>] ? finish_task_switch+0x3a/0xf1 >>> [<ffffffff810468d0>] ? try_to_wake_up+0x29/0x1a0 >>> [<ffffffff810468d0>] ? try_to_wake_up+0x29/0x1a0 >>> [<ffffffff81040c52>] select_task_rq+0x13/0x44 >>> [<ffffffff8104697f>] try_to_wake_up+0xd8/0x1a0 >>> [<ffffffff8103c783>] ? __wake_up+0x1d/0x48 >>> [<ffffffff81046a54>] default_wake_function+0xd/0xf >>> [<ffffffff8106751a>] autoremove_wake_function+0x13/0x38 >>> [<ffffffff810395d0>] __wake_up_common+0x49/0x7f >>> [<ffffffff8103c79a>] __wake_up+0x34/0x48 >>> [<ffffffff810a74a9>] rcu_report_exp_rnp+0x50/0x89 >>> [<ffffffff810a7f25>] ? __rcu_read_unlock+0xdc/0x24e >>> [<ffffffff810a8032>] __rcu_read_unlock+0x1e9/0x24e >>> [<ffffffff813fce40>] rcu_read_unlock+0x21/0x23 >>> [<ffffffff813fd52d>] ip_queue_xmit+0x35e/0x3b1 >>> [<ffffffff813fd1cf>] ? ip_send_reply+0x247/0x247 >>> [<ffffffff8140f5f3>] tcp_transmit_skb+0x785/0x7c3 >>> [<ffffffff81411e23>] tcp_write_xmit+0x806/0x8f5 >>> [<ffffffff810e646f>] ? might_fault+0x4e/0x9e >>> [<ffffffff81403e25>] ? copy_from_user+0x2a/0x2c >>> [<ffffffff81411f63>] __tcp_push_pending_frames+0x20/0x4d >>> [<ffffffff8140411f>] tcp_push+0x84/0x86 >>> [<ffffffff81406577>] tcp_sendmsg+0x674/0x775 >>> [<ffffffff81423d68>] inet_sendmsg+0x61/0x6a >>> [<ffffffff813af67a>] __sock_sendmsg_nosec+0x58/0x61 >>> [<ffffffff813b0db5>] __sock_sendmsg+0x3d/0x48 >>> [<ffffffff813b1631>] sock_sendmsg+0xa3/0xbc >>> [<ffffffff810ea6ae>] ? handle_pte_fault+0x7fc/0x84d >>> [<ffffffff81110b14>] ? mem_cgroup_get_limit+0x45/0x45 >>> [<ffffffff8111a4e9>] ? fget_light+0x35/0xac >>> [<ffffffff813b16b2>] ? sockfd_lookup_light+0x1b/0x53 >>> [<ffffffff813b1bf5>] sys_sendto+0xfa/0x11f >>> [<ffffffff8147f0d8>] ? retint_restore_args+0x13/0x13 >>> [<ffffffff8109f0a1>] ? audit_syscall_entry+0x119/0x145 >>> [<ffffffff81484d52>] system_call_fastpath+0x16/0x1b >> >> >> Fun, that's a recursion we've been poking at for the past few days, but >> I haven't seen it do a NULL-ptr deref like this yet. >> >> I'm also not quite sure the below patches will catch this particular >> recursion, it might be you need the patch from >> lkml.kernel.org/r/20110718151524.GA4236@linux.vnet.ibm.com as well, >> since this is RCU recursing on itself without the aid of interrupts. >> >> How reproducable is this? And if so, what's your .config and recipe for >> making it go bang? > > Yow! > > I bet that the .config includes CONFIG_RCU_BOOST=y. If so, the following > patch will be needed. Otherwise, I have no idea how the above sequence > of events happened -- all of the ->rcu_read_unlock_special bits should > be cleared by the time that rcu_report_exp_rnp() is invoked. Though > the patch that Peter pointed you at on lkml.kernel.org above would likely > make it go away in conjunction with the patch below. > > Thanx, Paul Yes, I was running with boost enabled: [greearb@fs3 linux-3.0.x64-nfs]$ grep BOOST .config CONFIG_RCU_BOOST=y CONFIG_RCU_BOOST_PRIO=1 CONFIG_RCU_BOOST_DELAY=500 Work load was mounting 200 nfs mounts, doing O_DIRECT io (200 processes, one per mount) for 30 seconds or so, unmount, repeat..for hours. I could reliably crash the system on over-night runs even after fixing the NFS bugs, but it crashed in various ways. I'll re-run the tests with the latest upstream tree plus these patches, assuming they are not already in the tree. Might be a day or two, before I can get this set up, however, as the machine in question is running some other tests on an older kernel at the moment. Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2011-07-19 16:10 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-07-11 18:01 crash in active_task, 3.0-rc6+ Ben Greear 2011-07-19 11:08 ` Peter Zijlstra 2011-07-19 14:28 ` Paul E. McKenney 2011-07-19 16:09 ` Ben Greear
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.