From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759209Ab3FCQCX (ORCPT ); Mon, 3 Jun 2013 12:02:23 -0400 Received: from hrndva-omtalb.mail.rr.com ([71.74.56.122]:33540 "EHLO hrndva-omtalb.mail.rr.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758938Ab3FCQCS (ORCPT ); Mon, 3 Jun 2013 12:02:18 -0400 X-Authority-Analysis: v=2.0 cv=Et5QXFgA c=1 sm=0 a=rXTBtCOcEpjy1lPqhTCpEQ==:17 a=mNMOxpOpBa8A:10 a=w5NR5vgd4yIA:10 a=5SG0PmZfjMsA:10 a=IkcTkHD0fZMA:10 a=meVymXHHAAAA:8 a=0jC6bcFkUTUA:10 a=VwQbUJbxAAAA:8 a=1f2KdLKwJRynbfznIPcA:9 a=QEXdDO2ut3YA:10 a=rXTBtCOcEpjy1lPqhTCpEQ==:117 X-Cloudmark-Score: 0 X-Authenticated-User: X-Originating-IP: 74.67.115.198 Message-ID: <1370275337.26799.94.camel@gandalf.local.home> Subject: Re: [PATCH 0/2] [GIT PULL] rcu/ftrace: Fix livelock from overhead of RCU debugging From: Steven Rostedt To: linux-kernel@vger.kernel.org Cc: Linus Torvalds , Ingo Molnar , Andrew Morton , "Paul E. McKenney" Date: Mon, 03 Jun 2013 12:02:17 -0400 In-Reply-To: <20130529185300.403107512@goodmis.org> References: <20130529185300.403107512@goodmis.org> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.4.4-3 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2013-05-29 at 14:53 -0400, Steven Rostedt wrote: > Linus, > > [ If you don't care to read the background of these patches, > just jump down to the "Please pull" line. I included this > description in the tag as well. ] Hi Linus, I noticed that this hasn't been pulled yet. Is there any issue with it, or has it just been missed due to your travels? I have a few more patches I'm currently testing that I built on top of these commits. I'll have another pull request today or tomorrow. These changes touched RCU but has been acked by Paul. Thanks, -- Steve > > I was testing a config for one bug and triggered a livelock lockup. A > NMI watchdog dump showed what was happening: > > [ 65.972000] NMI backtrace for cpu 2 > [ 65.972000] CPU: 2 PID: 48 Comm: kworker/2:1 Not tainted 3.10.0-rc2-test+ #151 > [ 65.972000] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007 > [ 65.972000] task: ffff880078dc2140 ti: ffff880078dc4000 task.ti: ffff880078dc4000 > [ 65.972000] RIP: 0010:[] [] ftrace_ops_list_func+0x144/0x151 > [ 65.972000] RSP: 0000:ffff880078dc5ac8 EFLAGS: 00000006 > [ 65.972000] RAX: 0000000000008000 RBX: 0000000000000001 RCX: 000000000000000f > [ 65.972000] RDX: 0000000000000002 RSI: ffffffff810c0e9f RDI: ffffffff810b73f7 > [ 65.972000] RBP: ffff880078dc5af8 R08: ffff880078dc5be0 R09: ffff880078dc5bb0 > [ 65.972000] R10: ffffffff81a3c730 R11: 0000000000000001 R12: 0000000000008000 > [ 65.972000] R13: ffffffff810b73f7 R14: 0000000000000000 R15: ffffffff810c0e9f > [ 65.972000] FS: 0000000000000000(0000) GS:ffff88007aa00000(0000) knlGS:0000000000000000 > [ 65.972000] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 65.972000] CR2: 0000000000000168 CR3: 0000000001a0b000 CR4: 00000000000007e0 > [ 65.972000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 65.972000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [ 65.972000] Stack: > [ 65.972000] ffff880078d97480 0000000000000001 ffffffff810b9148 ffff880078d97480 > [ 65.972000] 0000000000000000 ffffffff810bb6ba ffff880078dc5bd0 ffffffff814fa7c0 > [ 65.972000] 0000000000000000 ffffffff810bb6ba ffff880078dc5c00 ffffffff814fa7c0 > [ 65.972000] Call Trace: > [ 65.972000] [] ? check_for_new_grace_period+0x115/0x115 > [ 65.972000] [] ? rcu_note_context_switch+0x20e/0x260 > [ 65.972000] [] ftrace_call+0x5/0x2f > [ 65.972000] [] ? rcu_note_context_switch+0x20e/0x260 > [ 65.972000] [] ? ftrace_call+0x5/0x2f > [ 65.972000] [] ? rcu_is_cpu_idle+0x5/0x69 > [ 65.972000] [] ? debug_lockdep_rcu_enabled+0x5/0x35 > [ 65.972000] [] ? rcu_is_cpu_idle+0x5/0x69 > [ 65.972000] [] ? rcu_read_lock_held+0x19/0x3c > [ 65.972000] [] ftrace_ops_test+0x47/0x93 > [ 65.972000] [] ? check_for_new_grace_period+0x115/0x115 > [ 65.972000] [] ftrace_ops_list_func+0xcc/0x151 > [ 65.972000] [] ftrace_call+0x5/0x2f > [ 65.972000] [] ? ftrace_ops_list_func+0x11f/0x151 > [ 65.972000] [] ? rcu_note_context_switch+0x206/0x260 > [ 67.972000] [] ? rcu_preempt_qs+0x5/0x53 > [ 65.972000] [] ? trace_hardirqs_off+0xd/0xf > [ 65.972000] [] ? rcu_preempt_qs+0x5/0x53 > [ 65.972000] [] ? rcu_note_context_switch+0x20e/0x260 > [ 65.972000] [] ? schedule+0x64/0x66 > [ 65.972000] [] __schedule+0x67/0x710 > [ 65.972000] [] ? worker_thread+0x1e9/0x1f3 > [ 65.972000] [] ? worker_thread+0x1e9/0x1f3 > [ 65.972000] [] schedule+0x64/0x66 > [ 65.972000] [] worker_thread+0x1ee/0x1f3 > [ 65.972000] [] ? rescuer_thread+0x26c/0x26c > [ 65.972000] [] kthread+0xb1/0xb9 > [ 65.972000] [] ? finish_task_switch+0x7c/0xf8 > [ 65.972000] [] ? __init_kthread_worker+0x59/0x59 > [ 65.972000] [] ret_from_fork+0x7c/0xb0 > [ 65.972000] [] ? __init_kthread_worker+0x59/0x59 > > What happened was that ftrace's function tracer uses > hlist_for_each_entry_rcu() which uses rcu_dereference_raw(), and also > calls rcu_dereference_raw() directly. As this macro is defined as > rcu_dereference_check() and will call rcu_read_lock_sched_held() which > does various rcu checks as well as lockdep and such, when it happens for > *every* function call in the kernel it slows the system down where it > can live lock (interrupts taking too long to finish, before another > interrupt is triggered). > > To give an idea of the abuse that function tracing is taking, every > function that is traced can do the following: > > in __ftrace_ops_list_func() we have: > > do_for_each_ftrace_op(op, ftrace_ops_list) { > if (ftrace_ops_test(op, ip)) > op->func(ip, parent_ip, op, regs); > } while_for_each_ftrace_op(op); > > Where: > > #define do_for_each_ftrace_op(op, list) \ > op = rcu_dereference_raw(list); \ > do > > and > > #define while_for_each_ftrace_op(op) \ > while (likely(op = rcu_dereference_raw((op)->next)) && \ > unlikely((op) != &ftrace_list_end)) > > And ftrace_ops_test() has: > > filter_hash = rcu_dereference_raw(ops->filter_hash); > notrace_hash = rcu_dereference_raw(ops->notrace_hash); > > if ((ftrace_hash_empty(filter_hash) || > ftrace_lookup_ip(filter_hash, ip)) && > (ftrace_hash_empty(notrace_hash) || > !ftrace_lookup_ip(notrace_hash, ip))) > ret = 1; > > Where ftrace_lookup_ip() has: > > hlist_for_each_entry_rcu(entry, hhd, hlist) { > > Now that is 6 calls to rcu_dereference_raw() that happens for every > single function that is traced! And this doesn't count possibly more by > the callbacks that were registered by the function tracer, or multiple > loops by while_for_each_ftrace_op(). When rcu debugging is on, that's 6 > calls to rcu_read_lock_sched_held() which not only does several calls > that is traced by the function tracer (luckily the recursion protection > catches those), but those calls also call preempt_disable/enable (that > gets traced by the function tracer). Worse yet, lock_is_held() is called > that disables and then re-enables interrupts, and does a search for > locks held. > > All this overhead that is done for every function call that is traced > (which could be most of the functions in the kernel) and does it at > least 6 times. This can drastically slow things down so much that we get > a live lock in the system. > > Now, we could put a bunch of notrace in the RCU code, but as RCU is a > more critical subsystem than tracing, and used more often, I find that > it is advantageous to be able to trace RCU. RCU is also a more complex > subsystem than tracing, and having that ability to trace RCU is an asset > to help debug it and give us better confidence that it works. Thus, I > believe it is better to remove some of the debugability from the tracing > infrastructure than to remove it from RCU. Not to mention, even > eliminating the recursive calls by the RCU being traced, it doesn't > remove the RCU functions being called or the lockdep function being > called 6 times for every function traced. > > After discussing this with Paul McKenney, I added two new RCU interfaces: > > rcu_dereference_raw_notrace() > hlist_for_each_entry_raw_notrace() > > The function tracer now uses these and this prevents the live lock. > > > Please pull the latest trace-fixes-v3.10-rc3-v2 tree, which can be found at: > > git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git > trace-fixes-v3.10-rc3-v2 > > Tag SHA1: 066277682313adbca025f177cff8a211f2b4f46d > Head SHA1: 1bb539ca36e21c2f4fce0865e11df384bc7b7656 > > > Steven Rostedt (2): > rcu: Add _notrace variation of rcu_dereference_raw() and hlist_for_each_entry_rcu() > ftrace: Use the rcu _notrace variants for rcu_dereference_raw() and friends > > ---- > include/linux/rculist.h | 20 ++++++++++++++++++++ > include/linux/rcupdate.h | 9 +++++++++ > kernel/trace/ftrace.c | 18 +++++++++--------- > 3 files changed, 38 insertions(+), 9 deletions(-)