All of lore.kernel.org
 help / color / mirror / Atom feed
From: Steven Rostedt <rostedt@goodmis.org>
To: linux-kernel@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Ingo Molnar <mingo@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Subject: [PATCH 0/2] [GIT PULL] rcu/ftrace: Fix livelock from overhead of RCU debugging
Date: Wed, 29 May 2013 14:53:00 -0400	[thread overview]
Message-ID: <20130529185300.403107512@goodmis.org> (raw)

[-- Attachment #1: Type: text/plain, Size: 8191 bytes --]


Linus,

[ If you don't care to read the background of these patches,
  just jump down to the "Please pull" line. I included this
  description in the tag as well. ]

I was testing a config for one bug and triggered a livelock lockup. A
NMI watchdog dump showed what was happening:

[   65.972000] NMI backtrace for cpu 2
[   65.972000] CPU: 2 PID: 48 Comm: kworker/2:1 Not tainted 3.10.0-rc2-test+ #151
[   65.972000] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
[   65.972000] task: ffff880078dc2140 ti: ffff880078dc4000 task.ti: ffff880078dc4000
[   65.972000] RIP: 0010:[<ffffffff810c2259>]  [<ffffffff810c2259>] ftrace_ops_list_func+0x144/0x151
[   65.972000] RSP: 0000:ffff880078dc5ac8  EFLAGS: 00000006
[   65.972000] RAX: 0000000000008000 RBX: 0000000000000001 RCX: 000000000000000f
[   65.972000] RDX: 0000000000000002 RSI: ffffffff810c0e9f RDI: ffffffff810b73f7
[   65.972000] RBP: ffff880078dc5af8 R08: ffff880078dc5be0 R09: ffff880078dc5bb0
[   65.972000] R10: ffffffff81a3c730 R11: 0000000000000001 R12: 0000000000008000
[   65.972000] R13: ffffffff810b73f7 R14: 0000000000000000 R15: ffffffff810c0e9f
[   65.972000] FS:  0000000000000000(0000) GS:ffff88007aa00000(0000) knlGS:0000000000000000
[   65.972000] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   65.972000] CR2: 0000000000000168 CR3: 0000000001a0b000 CR4: 00000000000007e0
[   65.972000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   65.972000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   65.972000] Stack:
[   65.972000]  ffff880078d97480 0000000000000001 ffffffff810b9148 ffff880078d97480
[   65.972000]  0000000000000000 ffffffff810bb6ba ffff880078dc5bd0 ffffffff814fa7c0
[   65.972000]  0000000000000000 ffffffff810bb6ba ffff880078dc5c00 ffffffff814fa7c0
[   65.972000] Call Trace:
[   65.972000]  [<ffffffff810b9148>] ? check_for_new_grace_period+0x115/0x115
[   65.972000]  [<ffffffff810bb6ba>] ? rcu_note_context_switch+0x20e/0x260
[   65.972000]  [<ffffffff814fa7c0>] ftrace_call+0x5/0x2f
[   65.972000]  [<ffffffff810bb6ba>] ? rcu_note_context_switch+0x20e/0x260
[   65.972000]  [<ffffffff814fa7c0>] ? ftrace_call+0x5/0x2f
[   65.972000]  [<ffffffff810b73fc>] ? rcu_is_cpu_idle+0x5/0x69
[   65.972000]  [<ffffffff810558d3>] ? debug_lockdep_rcu_enabled+0x5/0x35
[   65.972000]  [<ffffffff810b73fc>] ? rcu_is_cpu_idle+0x5/0x69
[   65.972000]  [<ffffffff810c0e9f>] ? rcu_read_lock_held+0x19/0x3c
[   65.972000]  [<ffffffff810c0f90>] ftrace_ops_test+0x47/0x93
[   65.972000]  [<ffffffff810b9148>] ? check_for_new_grace_period+0x115/0x115
[   65.972000]  [<ffffffff810c21e1>] ftrace_ops_list_func+0xcc/0x151
[   65.972000]  [<ffffffff814fa7c0>] ftrace_call+0x5/0x2f
[   65.972000]  [<ffffffff810c2234>] ? ftrace_ops_list_func+0x11f/0x151
[   65.972000]  [<ffffffff810bb6b2>] ? rcu_note_context_switch+0x206/0x260
[   67.972000]  [<ffffffff810b914d>] ? rcu_preempt_qs+0x5/0x53
[   65.972000]  [<ffffffff81086f41>] ? trace_hardirqs_off+0xd/0xf
[   65.972000]  [<ffffffff810b914d>] ? rcu_preempt_qs+0x5/0x53
[   65.972000]  [<ffffffff810bb6ba>] ? rcu_note_context_switch+0x20e/0x260
[   65.972000]  [<ffffffff814f2bca>] ? schedule+0x64/0x66
[   65.972000]  [<ffffffff814f246e>] __schedule+0x67/0x710
[   65.972000]  [<ffffffff81052e5e>] ? worker_thread+0x1e9/0x1f3
[   65.972000]  [<ffffffff81052e5e>] ? worker_thread+0x1e9/0x1f3
[   65.972000]  [<ffffffff814f2bca>] schedule+0x64/0x66
[   65.972000]  [<ffffffff81052e63>] worker_thread+0x1ee/0x1f3
[   65.972000]  [<ffffffff81052c75>] ? rescuer_thread+0x26c/0x26c
[   65.972000]  [<ffffffff8105a8bf>] kthread+0xb1/0xb9
[   65.972000]  [<ffffffff81064b3c>] ? finish_task_switch+0x7c/0xf8
[   65.972000]  [<ffffffff8105a80e>] ? __init_kthread_worker+0x59/0x59
[   65.972000]  [<ffffffff814faadc>] ret_from_fork+0x7c/0xb0
[   65.972000]  [<ffffffff8105a80e>] ? __init_kthread_worker+0x59/0x59

What happened was that ftrace's function tracer uses
hlist_for_each_entry_rcu() which uses rcu_dereference_raw(), and also
calls rcu_dereference_raw() directly. As this macro is defined as
rcu_dereference_check() and will call rcu_read_lock_sched_held() which
does various rcu checks as well as lockdep and such, when it happens for
*every* function call in the kernel it slows the system down where it
can live lock (interrupts taking too long to finish, before another
interrupt is triggered).

To give an idea of the abuse that function tracing is taking, every
function that is traced can do the following:

in __ftrace_ops_list_func() we have:

        do_for_each_ftrace_op(op, ftrace_ops_list) {
                if (ftrace_ops_test(op, ip))
                        op->func(ip, parent_ip, op, regs);
        } while_for_each_ftrace_op(op);

Where:

#define do_for_each_ftrace_op(op, list)                 \
        op = rcu_dereference_raw(list);                 \
        do

and

#define while_for_each_ftrace_op(op)                            \
        while (likely(op = rcu_dereference_raw((op)->next)) &&  \
               unlikely((op) != &ftrace_list_end))

And ftrace_ops_test() has:

        filter_hash = rcu_dereference_raw(ops->filter_hash);
        notrace_hash = rcu_dereference_raw(ops->notrace_hash);

        if ((ftrace_hash_empty(filter_hash) ||
             ftrace_lookup_ip(filter_hash, ip)) &&
            (ftrace_hash_empty(notrace_hash) ||
             !ftrace_lookup_ip(notrace_hash, ip)))
                ret = 1;

Where ftrace_lookup_ip() has:

        hlist_for_each_entry_rcu(entry, hhd, hlist) {

Now that is 6 calls to rcu_dereference_raw() that happens for every
single function that is traced! And this doesn't count possibly more by
the callbacks that were registered by the function tracer, or multiple
loops by while_for_each_ftrace_op(). When rcu debugging is on, that's 6
calls to rcu_read_lock_sched_held() which not only does several calls
that is traced by the function tracer (luckily the recursion protection
catches those), but those calls also call preempt_disable/enable (that
gets traced by the function tracer). Worse yet, lock_is_held() is called
that disables and then re-enables interrupts, and does a search for
locks held.

All this overhead that is done for every function call that is traced
(which could be most of the functions in the kernel) and does it at
least 6 times. This can drastically slow things down so much that we get
a live lock in the system.

Now, we could put a bunch of notrace in the RCU code, but as RCU is a
more critical subsystem than tracing, and used more often, I find that
it is advantageous to be able to trace RCU. RCU is also a more complex
subsystem than tracing, and having that ability to trace RCU is an asset
to help debug it and give us better confidence that it works. Thus, I
believe it is better to remove some of the debugability from the tracing
infrastructure than to remove it from RCU. Not to mention, even
eliminating the recursive calls by the RCU being traced, it doesn't
remove the RCU functions being called or the lockdep function being
called 6 times for every function traced.

After discussing this with Paul McKenney, I added two new RCU interfaces:

 rcu_dereference_raw_notrace()
 hlist_for_each_entry_raw_notrace()

The function tracer now uses these and this prevents the live lock.


Please pull the latest trace-fixes-v3.10-rc3-v2 tree, which can be found at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
trace-fixes-v3.10-rc3-v2

Tag SHA1: 066277682313adbca025f177cff8a211f2b4f46d
Head SHA1: 1bb539ca36e21c2f4fce0865e11df384bc7b7656


Steven Rostedt (2):
      rcu: Add _notrace variation of rcu_dereference_raw() and hlist_for_each_entry_rcu()
      ftrace: Use the rcu _notrace variants for rcu_dereference_raw() and friends

----
 include/linux/rculist.h  |   20 ++++++++++++++++++++
 include/linux/rcupdate.h |    9 +++++++++
 kernel/trace/ftrace.c    |   18 +++++++++---------
 3 files changed, 38 insertions(+), 9 deletions(-)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

             reply	other threads:[~2013-05-29 18:54 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-05-29 18:53 Steven Rostedt [this message]
2013-05-29 18:53 ` [PATCH 1/2] rcu: Add _notrace variation of rcu_dereference_raw() and hlist_for_each_entry_rcu() Steven Rostedt
2013-05-29 18:53 ` [PATCH 2/2] ftrace: Use the rcu _notrace variants for rcu_dereference_raw() and friends Steven Rostedt
2013-06-03 16:02 ` [PATCH 0/2] [GIT PULL] rcu/ftrace: Fix livelock from overhead of RCU debugging Steven Rostedt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130529185300.403107512@goodmis.org \
    --to=rostedt@goodmis.org \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.