All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] [GIT PULL] rcu/ftrace: Fix livelock from overhead of RCU debugging
@ 2013-05-29 18:53 Steven Rostedt
  2013-05-29 18:53 ` [PATCH 1/2] rcu: Add _notrace variation of rcu_dereference_raw() and hlist_for_each_entry_rcu() Steven Rostedt
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Steven Rostedt @ 2013-05-29 18:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Paul E. McKenney

[-- Attachment #1: Type: text/plain, Size: 8191 bytes --]


Linus,

[ If you don't care to read the background of these patches,
  just jump down to the "Please pull" line. I included this
  description in the tag as well. ]

I was testing a config for one bug and triggered a livelock lockup. A
NMI watchdog dump showed what was happening:

[   65.972000] NMI backtrace for cpu 2
[   65.972000] CPU: 2 PID: 48 Comm: kworker/2:1 Not tainted 3.10.0-rc2-test+ #151
[   65.972000] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
[   65.972000] task: ffff880078dc2140 ti: ffff880078dc4000 task.ti: ffff880078dc4000
[   65.972000] RIP: 0010:[<ffffffff810c2259>]  [<ffffffff810c2259>] ftrace_ops_list_func+0x144/0x151
[   65.972000] RSP: 0000:ffff880078dc5ac8  EFLAGS: 00000006
[   65.972000] RAX: 0000000000008000 RBX: 0000000000000001 RCX: 000000000000000f
[   65.972000] RDX: 0000000000000002 RSI: ffffffff810c0e9f RDI: ffffffff810b73f7
[   65.972000] RBP: ffff880078dc5af8 R08: ffff880078dc5be0 R09: ffff880078dc5bb0
[   65.972000] R10: ffffffff81a3c730 R11: 0000000000000001 R12: 0000000000008000
[   65.972000] R13: ffffffff810b73f7 R14: 0000000000000000 R15: ffffffff810c0e9f
[   65.972000] FS:  0000000000000000(0000) GS:ffff88007aa00000(0000) knlGS:0000000000000000
[   65.972000] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   65.972000] CR2: 0000000000000168 CR3: 0000000001a0b000 CR4: 00000000000007e0
[   65.972000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   65.972000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   65.972000] Stack:
[   65.972000]  ffff880078d97480 0000000000000001 ffffffff810b9148 ffff880078d97480
[   65.972000]  0000000000000000 ffffffff810bb6ba ffff880078dc5bd0 ffffffff814fa7c0
[   65.972000]  0000000000000000 ffffffff810bb6ba ffff880078dc5c00 ffffffff814fa7c0
[   65.972000] Call Trace:
[   65.972000]  [<ffffffff810b9148>] ? check_for_new_grace_period+0x115/0x115
[   65.972000]  [<ffffffff810bb6ba>] ? rcu_note_context_switch+0x20e/0x260
[   65.972000]  [<ffffffff814fa7c0>] ftrace_call+0x5/0x2f
[   65.972000]  [<ffffffff810bb6ba>] ? rcu_note_context_switch+0x20e/0x260
[   65.972000]  [<ffffffff814fa7c0>] ? ftrace_call+0x5/0x2f
[   65.972000]  [<ffffffff810b73fc>] ? rcu_is_cpu_idle+0x5/0x69
[   65.972000]  [<ffffffff810558d3>] ? debug_lockdep_rcu_enabled+0x5/0x35
[   65.972000]  [<ffffffff810b73fc>] ? rcu_is_cpu_idle+0x5/0x69
[   65.972000]  [<ffffffff810c0e9f>] ? rcu_read_lock_held+0x19/0x3c
[   65.972000]  [<ffffffff810c0f90>] ftrace_ops_test+0x47/0x93
[   65.972000]  [<ffffffff810b9148>] ? check_for_new_grace_period+0x115/0x115
[   65.972000]  [<ffffffff810c21e1>] ftrace_ops_list_func+0xcc/0x151
[   65.972000]  [<ffffffff814fa7c0>] ftrace_call+0x5/0x2f
[   65.972000]  [<ffffffff810c2234>] ? ftrace_ops_list_func+0x11f/0x151
[   65.972000]  [<ffffffff810bb6b2>] ? rcu_note_context_switch+0x206/0x260
[   67.972000]  [<ffffffff810b914d>] ? rcu_preempt_qs+0x5/0x53
[   65.972000]  [<ffffffff81086f41>] ? trace_hardirqs_off+0xd/0xf
[   65.972000]  [<ffffffff810b914d>] ? rcu_preempt_qs+0x5/0x53
[   65.972000]  [<ffffffff810bb6ba>] ? rcu_note_context_switch+0x20e/0x260
[   65.972000]  [<ffffffff814f2bca>] ? schedule+0x64/0x66
[   65.972000]  [<ffffffff814f246e>] __schedule+0x67/0x710
[   65.972000]  [<ffffffff81052e5e>] ? worker_thread+0x1e9/0x1f3
[   65.972000]  [<ffffffff81052e5e>] ? worker_thread+0x1e9/0x1f3
[   65.972000]  [<ffffffff814f2bca>] schedule+0x64/0x66
[   65.972000]  [<ffffffff81052e63>] worker_thread+0x1ee/0x1f3
[   65.972000]  [<ffffffff81052c75>] ? rescuer_thread+0x26c/0x26c
[   65.972000]  [<ffffffff8105a8bf>] kthread+0xb1/0xb9
[   65.972000]  [<ffffffff81064b3c>] ? finish_task_switch+0x7c/0xf8
[   65.972000]  [<ffffffff8105a80e>] ? __init_kthread_worker+0x59/0x59
[   65.972000]  [<ffffffff814faadc>] ret_from_fork+0x7c/0xb0
[   65.972000]  [<ffffffff8105a80e>] ? __init_kthread_worker+0x59/0x59

What happened was that ftrace's function tracer uses
hlist_for_each_entry_rcu() which uses rcu_dereference_raw(), and also
calls rcu_dereference_raw() directly. As this macro is defined as
rcu_dereference_check() and will call rcu_read_lock_sched_held() which
does various rcu checks as well as lockdep and such, when it happens for
*every* function call in the kernel it slows the system down where it
can live lock (interrupts taking too long to finish, before another
interrupt is triggered).

To give an idea of the abuse that function tracing is taking, every
function that is traced can do the following:

in __ftrace_ops_list_func() we have:

        do_for_each_ftrace_op(op, ftrace_ops_list) {
                if (ftrace_ops_test(op, ip))
                        op->func(ip, parent_ip, op, regs);
        } while_for_each_ftrace_op(op);

Where:

#define do_for_each_ftrace_op(op, list)                 \
        op = rcu_dereference_raw(list);                 \
        do

and

#define while_for_each_ftrace_op(op)                            \
        while (likely(op = rcu_dereference_raw((op)->next)) &&  \
               unlikely((op) != &ftrace_list_end))

And ftrace_ops_test() has:

        filter_hash = rcu_dereference_raw(ops->filter_hash);
        notrace_hash = rcu_dereference_raw(ops->notrace_hash);

        if ((ftrace_hash_empty(filter_hash) ||
             ftrace_lookup_ip(filter_hash, ip)) &&
            (ftrace_hash_empty(notrace_hash) ||
             !ftrace_lookup_ip(notrace_hash, ip)))
                ret = 1;

Where ftrace_lookup_ip() has:

        hlist_for_each_entry_rcu(entry, hhd, hlist) {

Now that is 6 calls to rcu_dereference_raw() that happens for every
single function that is traced! And this doesn't count possibly more by
the callbacks that were registered by the function tracer, or multiple
loops by while_for_each_ftrace_op(). When rcu debugging is on, that's 6
calls to rcu_read_lock_sched_held() which not only does several calls
that is traced by the function tracer (luckily the recursion protection
catches those), but those calls also call preempt_disable/enable (that
gets traced by the function tracer). Worse yet, lock_is_held() is called
that disables and then re-enables interrupts, and does a search for
locks held.

All this overhead that is done for every function call that is traced
(which could be most of the functions in the kernel) and does it at
least 6 times. This can drastically slow things down so much that we get
a live lock in the system.

Now, we could put a bunch of notrace in the RCU code, but as RCU is a
more critical subsystem than tracing, and used more often, I find that
it is advantageous to be able to trace RCU. RCU is also a more complex
subsystem than tracing, and having that ability to trace RCU is an asset
to help debug it and give us better confidence that it works. Thus, I
believe it is better to remove some of the debugability from the tracing
infrastructure than to remove it from RCU. Not to mention, even
eliminating the recursive calls by the RCU being traced, it doesn't
remove the RCU functions being called or the lockdep function being
called 6 times for every function traced.

After discussing this with Paul McKenney, I added two new RCU interfaces:

 rcu_dereference_raw_notrace()
 hlist_for_each_entry_raw_notrace()

The function tracer now uses these and this prevents the live lock.


Please pull the latest trace-fixes-v3.10-rc3-v2 tree, which can be found at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
trace-fixes-v3.10-rc3-v2

Tag SHA1: 066277682313adbca025f177cff8a211f2b4f46d
Head SHA1: 1bb539ca36e21c2f4fce0865e11df384bc7b7656


Steven Rostedt (2):
      rcu: Add _notrace variation of rcu_dereference_raw() and hlist_for_each_entry_rcu()
      ftrace: Use the rcu _notrace variants for rcu_dereference_raw() and friends

----
 include/linux/rculist.h  |   20 ++++++++++++++++++++
 include/linux/rcupdate.h |    9 +++++++++
 kernel/trace/ftrace.c    |   18 +++++++++---------
 3 files changed, 38 insertions(+), 9 deletions(-)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] rcu: Add _notrace variation of rcu_dereference_raw() and hlist_for_each_entry_rcu()
  2013-05-29 18:53 [PATCH 0/2] [GIT PULL] rcu/ftrace: Fix livelock from overhead of RCU debugging Steven Rostedt
@ 2013-05-29 18:53 ` Steven Rostedt
  2013-05-29 18:53 ` [PATCH 2/2] ftrace: Use the rcu _notrace variants for rcu_dereference_raw() and friends Steven Rostedt
  2013-06-03 16:02 ` [PATCH 0/2] [GIT PULL] rcu/ftrace: Fix livelock from overhead of RCU debugging Steven Rostedt
  2 siblings, 0 replies; 6+ messages in thread
From: Steven Rostedt @ 2013-05-29 18:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Paul E. McKenney

[-- Attachment #1: Type: text/plain, Size: 3000 bytes --]

From: Steven Rostedt <rostedt@goodmis.org>

As rcu_dereference_raw() under RCU debug config options can add quite a
bit of checks, and that tracing uses rcu_dereference_raw(), these checks
happen with the function tracer. The function tracer also happens to trace
these debug checks too. This added overhead can livelock the system.

Add a new interface to RCU for both rcu_dereference_raw_notrace() as well
as hlist_for_each_entry_rcu_notrace() as the hlist iterator uses the
rcu_dereference_raw() as well, and is used a bit with the function tracer.

Link: http://lkml.kernel.org/r/20130528184209.304356745@goodmis.org

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/rculist.h  |   20 ++++++++++++++++++++
 include/linux/rcupdate.h |    9 +++++++++
 2 files changed, 29 insertions(+)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 8089e35..f4b1001 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -461,6 +461,26 @@ static inline void hlist_add_after_rcu(struct hlist_node *prev,
 			&(pos)->member)), typeof(*(pos)), member))
 
 /**
+ * hlist_for_each_entry_rcu_notrace - iterate over rcu list of given type (for tracing)
+ * @pos:	the type * to use as a loop cursor.
+ * @head:	the head for your list.
+ * @member:	the name of the hlist_node within the struct.
+ *
+ * This list-traversal primitive may safely run concurrently with
+ * the _rcu list-mutation primitives such as hlist_add_head_rcu()
+ * as long as the traversal is guarded by rcu_read_lock().
+ *
+ * This is the same as hlist_for_each_entry_rcu() except that it does
+ * not do any RCU debugging or tracing.
+ */
+#define hlist_for_each_entry_rcu_notrace(pos, head, member)			\
+	for (pos = hlist_entry_safe (rcu_dereference_raw_notrace(hlist_first_rcu(head)),\
+			typeof(*(pos)), member);			\
+		pos;							\
+		pos = hlist_entry_safe(rcu_dereference_raw_notrace(hlist_next_rcu(\
+			&(pos)->member)), typeof(*(pos)), member))
+
+/**
  * hlist_for_each_entry_rcu_bh - iterate over rcu list of given type
  * @pos:	the type * to use as a loop cursor.
  * @head:	the head for your list.
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 4ccd68e..ddcc782 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -640,6 +640,15 @@ static inline void rcu_preempt_sleep_check(void)
 
 #define rcu_dereference_raw(p) rcu_dereference_check(p, 1) /*@@@ needed? @@@*/
 
+/*
+ * The tracing infrastructure traces RCU (we want that), but unfortunately
+ * some of the RCU checks causes tracing to lock up the system.
+ *
+ * The tracing version of rcu_dereference_raw() must not call
+ * rcu_read_lock_held().
+ */
+#define rcu_dereference_raw_notrace(p) __rcu_dereference_check((p), 1, __rcu)
+
 /**
  * rcu_access_index() - fetch RCU index with no dereferencing
  * @p: The index to read
-- 
1.7.10.4



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 2/2] ftrace: Use the rcu _notrace variants for rcu_dereference_raw() and friends
  2013-05-29 18:53 [PATCH 0/2] [GIT PULL] rcu/ftrace: Fix livelock from overhead of RCU debugging Steven Rostedt
  2013-05-29 18:53 ` [PATCH 1/2] rcu: Add _notrace variation of rcu_dereference_raw() and hlist_for_each_entry_rcu() Steven Rostedt
@ 2013-05-29 18:53 ` Steven Rostedt
  2013-06-03 16:02 ` [PATCH 0/2] [GIT PULL] rcu/ftrace: Fix livelock from overhead of RCU debugging Steven Rostedt
  2 siblings, 0 replies; 6+ messages in thread
From: Steven Rostedt @ 2013-05-29 18:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Paul E. McKenney

[-- Attachment #1: Type: text/plain, Size: 3538 bytes --]

From: Steven Rostedt <rostedt@goodmis.org>

As rcu_dereference_raw() under RCU debug config options can add quite a
bit of checks, and that tracing uses rcu_dereference_raw(), these checks
happen with the function tracer. The function tracer also happens to trace
these debug checks too. This added overhead can livelock the system.

Have the function tracer use the new RCU _notrace equivalents that do
not do the debug checks for RCU.

Link: http://lkml.kernel.org/r/20130528184209.467603904@goodmis.org

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/ftrace.c |   18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index b549b0f..6c508ff 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -120,22 +120,22 @@ static void ftrace_ops_no_ops(unsigned long ip, unsigned long parent_ip);
 
 /*
  * Traverse the ftrace_global_list, invoking all entries.  The reason that we
- * can use rcu_dereference_raw() is that elements removed from this list
+ * can use rcu_dereference_raw_notrace() is that elements removed from this list
  * are simply leaked, so there is no need to interact with a grace-period
- * mechanism.  The rcu_dereference_raw() calls are needed to handle
+ * mechanism.  The rcu_dereference_raw_notrace() calls are needed to handle
  * concurrent insertions into the ftrace_global_list.
  *
  * Silly Alpha and silly pointer-speculation compiler optimizations!
  */
 #define do_for_each_ftrace_op(op, list)			\
-	op = rcu_dereference_raw(list);			\
+	op = rcu_dereference_raw_notrace(list);			\
 	do
 
 /*
  * Optimized for just a single item in the list (as that is the normal case).
  */
 #define while_for_each_ftrace_op(op)				\
-	while (likely(op = rcu_dereference_raw((op)->next)) &&	\
+	while (likely(op = rcu_dereference_raw_notrace((op)->next)) &&	\
 	       unlikely((op) != &ftrace_list_end))
 
 static inline void ftrace_ops_init(struct ftrace_ops *ops)
@@ -779,7 +779,7 @@ ftrace_find_profiled_func(struct ftrace_profile_stat *stat, unsigned long ip)
 	if (hlist_empty(hhd))
 		return NULL;
 
-	hlist_for_each_entry_rcu(rec, hhd, node) {
+	hlist_for_each_entry_rcu_notrace(rec, hhd, node) {
 		if (rec->ip == ip)
 			return rec;
 	}
@@ -1165,7 +1165,7 @@ ftrace_lookup_ip(struct ftrace_hash *hash, unsigned long ip)
 
 	hhd = &hash->buckets[key];
 
-	hlist_for_each_entry_rcu(entry, hhd, hlist) {
+	hlist_for_each_entry_rcu_notrace(entry, hhd, hlist) {
 		if (entry->ip == ip)
 			return entry;
 	}
@@ -1422,8 +1422,8 @@ ftrace_ops_test(struct ftrace_ops *ops, unsigned long ip)
 	struct ftrace_hash *notrace_hash;
 	int ret;
 
-	filter_hash = rcu_dereference_raw(ops->filter_hash);
-	notrace_hash = rcu_dereference_raw(ops->notrace_hash);
+	filter_hash = rcu_dereference_raw_notrace(ops->filter_hash);
+	notrace_hash = rcu_dereference_raw_notrace(ops->notrace_hash);
 
 	if ((ftrace_hash_empty(filter_hash) ||
 	     ftrace_lookup_ip(filter_hash, ip)) &&
@@ -2920,7 +2920,7 @@ static void function_trace_probe_call(unsigned long ip, unsigned long parent_ip,
 	 * on the hash. rcu_read_lock is too dangerous here.
 	 */
 	preempt_disable_notrace();
-	hlist_for_each_entry_rcu(entry, hhd, node) {
+	hlist_for_each_entry_rcu_notrace(entry, hhd, node) {
 		if (entry->ip == ip)
 			entry->ops->func(ip, parent_ip, &entry->data);
 	}
-- 
1.7.10.4



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] [GIT PULL] rcu/ftrace: Fix livelock from overhead of RCU debugging
  2013-05-29 18:53 [PATCH 0/2] [GIT PULL] rcu/ftrace: Fix livelock from overhead of RCU debugging Steven Rostedt
  2013-05-29 18:53 ` [PATCH 1/2] rcu: Add _notrace variation of rcu_dereference_raw() and hlist_for_each_entry_rcu() Steven Rostedt
  2013-05-29 18:53 ` [PATCH 2/2] ftrace: Use the rcu _notrace variants for rcu_dereference_raw() and friends Steven Rostedt
@ 2013-06-03 16:02 ` Steven Rostedt
  2 siblings, 0 replies; 6+ messages in thread
From: Steven Rostedt @ 2013-06-03 16:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Paul E. McKenney

On Wed, 2013-05-29 at 14:53 -0400, Steven Rostedt wrote:
> Linus,
> 
> [ If you don't care to read the background of these patches,
>   just jump down to the "Please pull" line. I included this
>   description in the tag as well. ]

Hi Linus,

I noticed that this hasn't been pulled yet. Is there any issue with it,
or has it just been missed due to your travels?  I have a few more
patches I'm currently testing that I built on top of these commits. I'll
have another pull request today or tomorrow.

These changes touched RCU but has been acked by Paul.

Thanks,

-- Steve

> 
> I was testing a config for one bug and triggered a livelock lockup. A
> NMI watchdog dump showed what was happening:
> 
> [   65.972000] NMI backtrace for cpu 2
> [   65.972000] CPU: 2 PID: 48 Comm: kworker/2:1 Not tainted 3.10.0-rc2-test+ #151
> [   65.972000] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
> [   65.972000] task: ffff880078dc2140 ti: ffff880078dc4000 task.ti: ffff880078dc4000
> [   65.972000] RIP: 0010:[<ffffffff810c2259>]  [<ffffffff810c2259>] ftrace_ops_list_func+0x144/0x151
> [   65.972000] RSP: 0000:ffff880078dc5ac8  EFLAGS: 00000006
> [   65.972000] RAX: 0000000000008000 RBX: 0000000000000001 RCX: 000000000000000f
> [   65.972000] RDX: 0000000000000002 RSI: ffffffff810c0e9f RDI: ffffffff810b73f7
> [   65.972000] RBP: ffff880078dc5af8 R08: ffff880078dc5be0 R09: ffff880078dc5bb0
> [   65.972000] R10: ffffffff81a3c730 R11: 0000000000000001 R12: 0000000000008000
> [   65.972000] R13: ffffffff810b73f7 R14: 0000000000000000 R15: ffffffff810c0e9f
> [   65.972000] FS:  0000000000000000(0000) GS:ffff88007aa00000(0000) knlGS:0000000000000000
> [   65.972000] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [   65.972000] CR2: 0000000000000168 CR3: 0000000001a0b000 CR4: 00000000000007e0
> [   65.972000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   65.972000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [   65.972000] Stack:
> [   65.972000]  ffff880078d97480 0000000000000001 ffffffff810b9148 ffff880078d97480
> [   65.972000]  0000000000000000 ffffffff810bb6ba ffff880078dc5bd0 ffffffff814fa7c0
> [   65.972000]  0000000000000000 ffffffff810bb6ba ffff880078dc5c00 ffffffff814fa7c0
> [   65.972000] Call Trace:
> [   65.972000]  [<ffffffff810b9148>] ? check_for_new_grace_period+0x115/0x115
> [   65.972000]  [<ffffffff810bb6ba>] ? rcu_note_context_switch+0x20e/0x260
> [   65.972000]  [<ffffffff814fa7c0>] ftrace_call+0x5/0x2f
> [   65.972000]  [<ffffffff810bb6ba>] ? rcu_note_context_switch+0x20e/0x260
> [   65.972000]  [<ffffffff814fa7c0>] ? ftrace_call+0x5/0x2f
> [   65.972000]  [<ffffffff810b73fc>] ? rcu_is_cpu_idle+0x5/0x69
> [   65.972000]  [<ffffffff810558d3>] ? debug_lockdep_rcu_enabled+0x5/0x35
> [   65.972000]  [<ffffffff810b73fc>] ? rcu_is_cpu_idle+0x5/0x69
> [   65.972000]  [<ffffffff810c0e9f>] ? rcu_read_lock_held+0x19/0x3c
> [   65.972000]  [<ffffffff810c0f90>] ftrace_ops_test+0x47/0x93
> [   65.972000]  [<ffffffff810b9148>] ? check_for_new_grace_period+0x115/0x115
> [   65.972000]  [<ffffffff810c21e1>] ftrace_ops_list_func+0xcc/0x151
> [   65.972000]  [<ffffffff814fa7c0>] ftrace_call+0x5/0x2f
> [   65.972000]  [<ffffffff810c2234>] ? ftrace_ops_list_func+0x11f/0x151
> [   65.972000]  [<ffffffff810bb6b2>] ? rcu_note_context_switch+0x206/0x260
> [   67.972000]  [<ffffffff810b914d>] ? rcu_preempt_qs+0x5/0x53
> [   65.972000]  [<ffffffff81086f41>] ? trace_hardirqs_off+0xd/0xf
> [   65.972000]  [<ffffffff810b914d>] ? rcu_preempt_qs+0x5/0x53
> [   65.972000]  [<ffffffff810bb6ba>] ? rcu_note_context_switch+0x20e/0x260
> [   65.972000]  [<ffffffff814f2bca>] ? schedule+0x64/0x66
> [   65.972000]  [<ffffffff814f246e>] __schedule+0x67/0x710
> [   65.972000]  [<ffffffff81052e5e>] ? worker_thread+0x1e9/0x1f3
> [   65.972000]  [<ffffffff81052e5e>] ? worker_thread+0x1e9/0x1f3
> [   65.972000]  [<ffffffff814f2bca>] schedule+0x64/0x66
> [   65.972000]  [<ffffffff81052e63>] worker_thread+0x1ee/0x1f3
> [   65.972000]  [<ffffffff81052c75>] ? rescuer_thread+0x26c/0x26c
> [   65.972000]  [<ffffffff8105a8bf>] kthread+0xb1/0xb9
> [   65.972000]  [<ffffffff81064b3c>] ? finish_task_switch+0x7c/0xf8
> [   65.972000]  [<ffffffff8105a80e>] ? __init_kthread_worker+0x59/0x59
> [   65.972000]  [<ffffffff814faadc>] ret_from_fork+0x7c/0xb0
> [   65.972000]  [<ffffffff8105a80e>] ? __init_kthread_worker+0x59/0x59
> 
> What happened was that ftrace's function tracer uses
> hlist_for_each_entry_rcu() which uses rcu_dereference_raw(), and also
> calls rcu_dereference_raw() directly. As this macro is defined as
> rcu_dereference_check() and will call rcu_read_lock_sched_held() which
> does various rcu checks as well as lockdep and such, when it happens for
> *every* function call in the kernel it slows the system down where it
> can live lock (interrupts taking too long to finish, before another
> interrupt is triggered).
> 
> To give an idea of the abuse that function tracing is taking, every
> function that is traced can do the following:
> 
> in __ftrace_ops_list_func() we have:
> 
>         do_for_each_ftrace_op(op, ftrace_ops_list) {
>                 if (ftrace_ops_test(op, ip))
>                         op->func(ip, parent_ip, op, regs);
>         } while_for_each_ftrace_op(op);
> 
> Where:
> 
> #define do_for_each_ftrace_op(op, list)                 \
>         op = rcu_dereference_raw(list);                 \
>         do
> 
> and
> 
> #define while_for_each_ftrace_op(op)                            \
>         while (likely(op = rcu_dereference_raw((op)->next)) &&  \
>                unlikely((op) != &ftrace_list_end))
> 
> And ftrace_ops_test() has:
> 
>         filter_hash = rcu_dereference_raw(ops->filter_hash);
>         notrace_hash = rcu_dereference_raw(ops->notrace_hash);
> 
>         if ((ftrace_hash_empty(filter_hash) ||
>              ftrace_lookup_ip(filter_hash, ip)) &&
>             (ftrace_hash_empty(notrace_hash) ||
>              !ftrace_lookup_ip(notrace_hash, ip)))
>                 ret = 1;
> 
> Where ftrace_lookup_ip() has:
> 
>         hlist_for_each_entry_rcu(entry, hhd, hlist) {
> 
> Now that is 6 calls to rcu_dereference_raw() that happens for every
> single function that is traced! And this doesn't count possibly more by
> the callbacks that were registered by the function tracer, or multiple
> loops by while_for_each_ftrace_op(). When rcu debugging is on, that's 6
> calls to rcu_read_lock_sched_held() which not only does several calls
> that is traced by the function tracer (luckily the recursion protection
> catches those), but those calls also call preempt_disable/enable (that
> gets traced by the function tracer). Worse yet, lock_is_held() is called
> that disables and then re-enables interrupts, and does a search for
> locks held.
> 
> All this overhead that is done for every function call that is traced
> (which could be most of the functions in the kernel) and does it at
> least 6 times. This can drastically slow things down so much that we get
> a live lock in the system.
> 
> Now, we could put a bunch of notrace in the RCU code, but as RCU is a
> more critical subsystem than tracing, and used more often, I find that
> it is advantageous to be able to trace RCU. RCU is also a more complex
> subsystem than tracing, and having that ability to trace RCU is an asset
> to help debug it and give us better confidence that it works. Thus, I
> believe it is better to remove some of the debugability from the tracing
> infrastructure than to remove it from RCU. Not to mention, even
> eliminating the recursive calls by the RCU being traced, it doesn't
> remove the RCU functions being called or the lockdep function being
> called 6 times for every function traced.
> 
> After discussing this with Paul McKenney, I added two new RCU interfaces:
> 
>  rcu_dereference_raw_notrace()
>  hlist_for_each_entry_raw_notrace()
> 
> The function tracer now uses these and this prevents the live lock.
> 
> 
> Please pull the latest trace-fixes-v3.10-rc3-v2 tree, which can be found at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
> trace-fixes-v3.10-rc3-v2
> 
> Tag SHA1: 066277682313adbca025f177cff8a211f2b4f46d
> Head SHA1: 1bb539ca36e21c2f4fce0865e11df384bc7b7656
> 
> 
> Steven Rostedt (2):
>       rcu: Add _notrace variation of rcu_dereference_raw() and hlist_for_each_entry_rcu()
>       ftrace: Use the rcu _notrace variants for rcu_dereference_raw() and friends
> 
> ----
>  include/linux/rculist.h  |   20 ++++++++++++++++++++
>  include/linux/rcupdate.h |    9 +++++++++
>  kernel/trace/ftrace.c    |   18 +++++++++---------
>  3 files changed, 38 insertions(+), 9 deletions(-)



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] ftrace: Use the rcu _notrace variants for rcu_dereference_raw() and friends
  2013-05-28 18:38 ` [PATCH 2/2] ftrace: Use the rcu _notrace variants for rcu_dereference_raw() " Steven Rostedt
@ 2013-05-29  1:53   ` Paul E. McKenney
  0 siblings, 0 replies; 6+ messages in thread
From: Paul E. McKenney @ 2013-05-29  1:53 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: linux-kernel, Ingo Molnar, Andrew Morton

On Tue, May 28, 2013 at 02:38:43PM -0400, Steven Rostedt wrote:
> As rcu_dereference_raw() under RCU debug config options can add quite a
> bit of checks, and that tracing uses rcu_dereference_raw(), these checks
> happen with the function tracer. The function tracer also happens to trace
> these debug checks too. This added overhead can livelock the system.
> 
> Have the function tracer use the new RCU _notrace equivalents that do
> not do the debug checks for RCU.
> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Looks good to me!

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Index: linux-trace.git/kernel/trace/ftrace.c
> ===================================================================
> --- linux-trace.git.orig/kernel/trace/ftrace.c
> +++ linux-trace.git/kernel/trace/ftrace.c
> @@ -120,22 +120,22 @@ static void ftrace_ops_no_ops(unsigned l
> 
>  /*
>   * Traverse the ftrace_global_list, invoking all entries.  The reason that we
> - * can use rcu_dereference_raw() is that elements removed from this list
> + * can use rcu_dereference_raw_notrace() is that elements removed from this list
>   * are simply leaked, so there is no need to interact with a grace-period
> - * mechanism.  The rcu_dereference_raw() calls are needed to handle
> + * mechanism.  The rcu_dereference_raw_notrace() calls are needed to handle
>   * concurrent insertions into the ftrace_global_list.
>   *
>   * Silly Alpha and silly pointer-speculation compiler optimizations!
>   */
>  #define do_for_each_ftrace_op(op, list)			\
> -	op = rcu_dereference_raw(list);			\
> +	op = rcu_dereference_raw_notrace(list);			\
>  	do
> 
>  /*
>   * Optimized for just a single item in the list (as that is the normal case).
>   */
>  #define while_for_each_ftrace_op(op)				\
> -	while (likely(op = rcu_dereference_raw((op)->next)) &&	\
> +	while (likely(op = rcu_dereference_raw_notrace((op)->next)) &&	\
>  	       unlikely((op) != &ftrace_list_end))
> 
>  static inline void ftrace_ops_init(struct ftrace_ops *ops)
> @@ -779,7 +779,7 @@ ftrace_find_profiled_func(struct ftrace_
>  	if (hlist_empty(hhd))
>  		return NULL;
> 
> -	hlist_for_each_entry_rcu(rec, hhd, node) {
> +	hlist_for_each_entry_rcu_notrace(rec, hhd, node) {
>  		if (rec->ip == ip)
>  			return rec;
>  	}
> @@ -1165,7 +1165,7 @@ ftrace_lookup_ip(struct ftrace_hash *has
> 
>  	hhd = &hash->buckets[key];
> 
> -	hlist_for_each_entry_rcu(entry, hhd, hlist) {
> +	hlist_for_each_entry_rcu_notrace(entry, hhd, hlist) {
>  		if (entry->ip == ip)
>  			return entry;
>  	}
> @@ -1422,8 +1422,8 @@ ftrace_ops_test(struct ftrace_ops *ops, 
>  	struct ftrace_hash *notrace_hash;
>  	int ret;
> 
> -	filter_hash = rcu_dereference_raw(ops->filter_hash);
> -	notrace_hash = rcu_dereference_raw(ops->notrace_hash);
> +	filter_hash = rcu_dereference_raw_notrace(ops->filter_hash);
> +	notrace_hash = rcu_dereference_raw_notrace(ops->notrace_hash);
> 
>  	if ((ftrace_hash_empty(filter_hash) ||
>  	     ftrace_lookup_ip(filter_hash, ip)) &&
> @@ -2920,7 +2920,7 @@ static void function_trace_probe_call(un
>  	 * on the hash. rcu_read_lock is too dangerous here.
>  	 */
>  	preempt_disable_notrace();
> -	hlist_for_each_entry_rcu(entry, hhd, node) {
> +	hlist_for_each_entry_rcu_notrace(entry, hhd, node) {
>  		if (entry->ip == ip)
>  			entry->ops->func(ip, parent_ip, &entry->data);
>  	}
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 2/2] ftrace: Use the rcu _notrace variants for rcu_dereference_raw() and friends
  2013-05-28 18:38 [PATCH 0/2] rcu/ftrace: Add rcu_dereference_raw_notrace() and friends Steven Rostedt
@ 2013-05-28 18:38 ` Steven Rostedt
  2013-05-29  1:53   ` Paul E. McKenney
  0 siblings, 1 reply; 6+ messages in thread
From: Steven Rostedt @ 2013-05-28 18:38 UTC (permalink / raw)
  To: linux-kernel; +Cc: Paul E. McKenney, Ingo Molnar, Andrew Morton

[-- Attachment #1: ftrace-rcu-dereference.patch --]
[-- Type: text/plain, Size: 3072 bytes --]

As rcu_dereference_raw() under RCU debug config options can add quite a
bit of checks, and that tracing uses rcu_dereference_raw(), these checks
happen with the function tracer. The function tracer also happens to trace
these debug checks too. This added overhead can livelock the system.

Have the function tracer use the new RCU _notrace equivalents that do
not do the debug checks for RCU.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Index: linux-trace.git/kernel/trace/ftrace.c
===================================================================
--- linux-trace.git.orig/kernel/trace/ftrace.c
+++ linux-trace.git/kernel/trace/ftrace.c
@@ -120,22 +120,22 @@ static void ftrace_ops_no_ops(unsigned l
 
 /*
  * Traverse the ftrace_global_list, invoking all entries.  The reason that we
- * can use rcu_dereference_raw() is that elements removed from this list
+ * can use rcu_dereference_raw_notrace() is that elements removed from this list
  * are simply leaked, so there is no need to interact with a grace-period
- * mechanism.  The rcu_dereference_raw() calls are needed to handle
+ * mechanism.  The rcu_dereference_raw_notrace() calls are needed to handle
  * concurrent insertions into the ftrace_global_list.
  *
  * Silly Alpha and silly pointer-speculation compiler optimizations!
  */
 #define do_for_each_ftrace_op(op, list)			\
-	op = rcu_dereference_raw(list);			\
+	op = rcu_dereference_raw_notrace(list);			\
 	do
 
 /*
  * Optimized for just a single item in the list (as that is the normal case).
  */
 #define while_for_each_ftrace_op(op)				\
-	while (likely(op = rcu_dereference_raw((op)->next)) &&	\
+	while (likely(op = rcu_dereference_raw_notrace((op)->next)) &&	\
 	       unlikely((op) != &ftrace_list_end))
 
 static inline void ftrace_ops_init(struct ftrace_ops *ops)
@@ -779,7 +779,7 @@ ftrace_find_profiled_func(struct ftrace_
 	if (hlist_empty(hhd))
 		return NULL;
 
-	hlist_for_each_entry_rcu(rec, hhd, node) {
+	hlist_for_each_entry_rcu_notrace(rec, hhd, node) {
 		if (rec->ip == ip)
 			return rec;
 	}
@@ -1165,7 +1165,7 @@ ftrace_lookup_ip(struct ftrace_hash *has
 
 	hhd = &hash->buckets[key];
 
-	hlist_for_each_entry_rcu(entry, hhd, hlist) {
+	hlist_for_each_entry_rcu_notrace(entry, hhd, hlist) {
 		if (entry->ip == ip)
 			return entry;
 	}
@@ -1422,8 +1422,8 @@ ftrace_ops_test(struct ftrace_ops *ops, 
 	struct ftrace_hash *notrace_hash;
 	int ret;
 
-	filter_hash = rcu_dereference_raw(ops->filter_hash);
-	notrace_hash = rcu_dereference_raw(ops->notrace_hash);
+	filter_hash = rcu_dereference_raw_notrace(ops->filter_hash);
+	notrace_hash = rcu_dereference_raw_notrace(ops->notrace_hash);
 
 	if ((ftrace_hash_empty(filter_hash) ||
 	     ftrace_lookup_ip(filter_hash, ip)) &&
@@ -2920,7 +2920,7 @@ static void function_trace_probe_call(un
 	 * on the hash. rcu_read_lock is too dangerous here.
 	 */
 	preempt_disable_notrace();
-	hlist_for_each_entry_rcu(entry, hhd, node) {
+	hlist_for_each_entry_rcu_notrace(entry, hhd, node) {
 		if (entry->ip == ip)
 			entry->ops->func(ip, parent_ip, &entry->data);
 	}


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-06-03 16:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-29 18:53 [PATCH 0/2] [GIT PULL] rcu/ftrace: Fix livelock from overhead of RCU debugging Steven Rostedt
2013-05-29 18:53 ` [PATCH 1/2] rcu: Add _notrace variation of rcu_dereference_raw() and hlist_for_each_entry_rcu() Steven Rostedt
2013-05-29 18:53 ` [PATCH 2/2] ftrace: Use the rcu _notrace variants for rcu_dereference_raw() and friends Steven Rostedt
2013-06-03 16:02 ` [PATCH 0/2] [GIT PULL] rcu/ftrace: Fix livelock from overhead of RCU debugging Steven Rostedt
  -- strict thread matches above, loose matches on Subject: below --
2013-05-28 18:38 [PATCH 0/2] rcu/ftrace: Add rcu_dereference_raw_notrace() and friends Steven Rostedt
2013-05-28 18:38 ` [PATCH 2/2] ftrace: Use the rcu _notrace variants for rcu_dereference_raw() " Steven Rostedt
2013-05-29  1:53   ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.