All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH] ftrace: Use schedule_on_each_cpu() as a heavy synchronize_sched()
@ 2013-05-29  0:01 Steven Rostedt
  2013-05-29  7:52 ` Peter Zijlstra
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Steven Rostedt @ 2013-05-29  0:01 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, Tejun Heo, Ingo Molnar, Frederic Weisbecker,
	Peter Zijlstra, Jiri Olsa

The function tracer uses preempt_disable/enable_notrace() for
synchronization between reading registered ftrace_ops and unregistering
them.

Most of the ftrace_ops are global permanent structures that do not
require this synchronization. That is, ops may be added and removed from
the hlist but are never freed, and wont hurt if a synchronization is
missed.

But this is not true for dynamically created ftrace_ops or control_ops,
which are used by the perf function tracing.

The problem here is that the function tracer can be used to trace
kernel/user context switches as well as going to and from idle.
Basically, it can be used to trace blind spots of the RCU subsystem.
This means that even though preempt_disable() is done, a
synchronize_sched() will ignore CPUs that haven't made it out of user
space or idle. These can include functions that are being traced just
before entering or exiting the kernel sections.

To implement the RCU synchronization, instead of using
synchronize_sched() the use of schedule_on_each_cpu() is performed. This
means that when a dynamically allocated ftrace_ops, or a control ops is
being unregistered, all CPUs must be touched and execute a ftrace_sync()
stub function via the work queues. This will rip CPUs out from idle or
in dynamic tick mode. This only happens when a user disables perf
function tracing or other dynamically allocated function tracers, but it
allows us to continue to debug RCU and context tracking with function
tracing.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Index: linux-trace.git/kernel/trace/ftrace.c
===================================================================
--- linux-trace.git.orig/kernel/trace/ftrace.c
+++ linux-trace.git/kernel/trace/ftrace.c
@@ -413,6 +413,17 @@ static int __register_ftrace_function(st
 	return 0;
 }
 
+static void ftrace_sync(struct work_struct *work)
+{
+	/*
+	 * This function is just a stub to implement a hard force
+	 * of synchronize_sched(). This requires synchronizing
+	 * tasks even in userspace and idle.
+	 *
+	 * Yes, function tracing is rude.
+	 */
+}
+
 static int __unregister_ftrace_function(struct ftrace_ops *ops)
 {
 	int ret;
@@ -440,8 +451,12 @@ static int __unregister_ftrace_function(
 			 * so there'll be no new users. We must ensure
 			 * all current users are done before we free
 			 * the control data.
+			 * Note synchronize_sched() is not enough, as we
+			 * use preempt_disable() to do RCU, but the function
+			 * tracer can be called where RCU is not active
+			 * (before user_exit()).
 			 */
-			synchronize_sched();
+			schedule_on_each_cpu(ftrace_sync);
 			control_ops_free(ops);
 		}
 	} else
@@ -456,9 +471,13 @@ static int __unregister_ftrace_function(
 	/*
 	 * Dynamic ops may be freed, we must make sure that all
 	 * callers are done before leaving this function.
+	 *
+	 * Again, normal synchronize_sched() is not good enough.
+	 * We need to do a hard force of sched synchronization.
 	 */
 	if (ops->flags & FTRACE_OPS_FL_DYNAMIC)
-		synchronize_sched();
+		schedule_on_each_cpu(ftrace_sync);
+
 
 	return 0;
 }



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] ftrace: Use schedule_on_each_cpu() as a heavy synchronize_sched()
  2013-05-29  0:01 [RFC][PATCH] ftrace: Use schedule_on_each_cpu() as a heavy synchronize_sched() Steven Rostedt
@ 2013-05-29  7:52 ` Peter Zijlstra
  2013-05-29 13:33   ` Paul E. McKenney
                     ` (2 more replies)
  2013-05-29  8:23 ` Paul E. McKenney
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 14+ messages in thread
From: Peter Zijlstra @ 2013-05-29  7:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, LKML, Tejun Heo, Ingo Molnar,
	Frederic Weisbecker, Jiri Olsa

On Tue, May 28, 2013 at 08:01:16PM -0400, Steven Rostedt wrote:
> The function tracer uses preempt_disable/enable_notrace() for
> synchronization between reading registered ftrace_ops and unregistering
> them.
> 
> Most of the ftrace_ops are global permanent structures that do not
> require this synchronization. That is, ops may be added and removed from
> the hlist but are never freed, and wont hurt if a synchronization is
> missed.
> 
> But this is not true for dynamically created ftrace_ops or control_ops,
> which are used by the perf function tracing.
> 
> The problem here is that the function tracer can be used to trace
> kernel/user context switches as well as going to and from idle.
> Basically, it can be used to trace blind spots of the RCU subsystem.
> This means that even though preempt_disable() is done, a
> synchronize_sched() will ignore CPUs that haven't made it out of user
> space or idle. These can include functions that are being traced just
> before entering or exiting the kernel sections.

Just to be clear, its the idle part that's a problem, right? Being stuck
in userspace isn't a problem since if that CPU is in userspace its
certainly not got a reference to whatever list entry we're removing.

Now when the CPU really is idle, its obviously not using tracing either;
so only the gray area where RCU thinks we're idle but we're not actually
idle is a problem?

Is there something a little smarter we can do? Could we use
on_each_cpu_cond() with a function that checks if the CPU really is
fully idle?

> To implement the RCU synchronization, instead of using
> synchronize_sched() the use of schedule_on_each_cpu() is performed. This
> means that when a dynamically allocated ftrace_ops, or a control ops is
> being unregistered, all CPUs must be touched and execute a ftrace_sync()
> stub function via the work queues. This will rip CPUs out from idle or
> in dynamic tick mode. This only happens when a user disables perf
> function tracing or other dynamically allocated function tracers, but it
> allows us to continue to debug RCU and context tracking with function
> tracing.

I don't suppose there's anything perf can do to about this right? Since
its all on user demand we're kinda stuck with dynamic memory.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] ftrace: Use schedule_on_each_cpu() as a heavy synchronize_sched()
  2013-05-29  0:01 [RFC][PATCH] ftrace: Use schedule_on_each_cpu() as a heavy synchronize_sched() Steven Rostedt
  2013-05-29  7:52 ` Peter Zijlstra
@ 2013-05-29  8:23 ` Paul E. McKenney
  2013-06-04 11:03 ` Frederic Weisbecker
  2013-06-05 11:51 ` Peter Zijlstra
  3 siblings, 0 replies; 14+ messages in thread
From: Paul E. McKenney @ 2013-05-29  8:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Tejun Heo, Ingo Molnar, Frederic Weisbecker,
	Peter Zijlstra, Jiri Olsa

On Tue, May 28, 2013 at 08:01:16PM -0400, Steven Rostedt wrote:
> The function tracer uses preempt_disable/enable_notrace() for
> synchronization between reading registered ftrace_ops and unregistering
> them.
> 
> Most of the ftrace_ops are global permanent structures that do not
> require this synchronization. That is, ops may be added and removed from
> the hlist but are never freed, and wont hurt if a synchronization is
> missed.
> 
> But this is not true for dynamically created ftrace_ops or control_ops,
> which are used by the perf function tracing.
> 
> The problem here is that the function tracer can be used to trace
> kernel/user context switches as well as going to and from idle.
> Basically, it can be used to trace blind spots of the RCU subsystem.
> This means that even though preempt_disable() is done, a
> synchronize_sched() will ignore CPUs that haven't made it out of user
> space or idle. These can include functions that are being traced just
> before entering or exiting the kernel sections.
> 
> To implement the RCU synchronization, instead of using
> synchronize_sched() the use of schedule_on_each_cpu() is performed. This
> means that when a dynamically allocated ftrace_ops, or a control ops is
> being unregistered, all CPUs must be touched and execute a ftrace_sync()
> stub function via the work queues. This will rip CPUs out from idle or
> in dynamic tick mode. This only happens when a user disables perf
> function tracing or other dynamically allocated function tracers, but it
> allows us to continue to debug RCU and context tracking with function
> tracing.
> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Index: linux-trace.git/kernel/trace/ftrace.c
> ===================================================================
> --- linux-trace.git.orig/kernel/trace/ftrace.c
> +++ linux-trace.git/kernel/trace/ftrace.c
> @@ -413,6 +413,17 @@ static int __register_ftrace_function(st
>  	return 0;
>  }
> 
> +static void ftrace_sync(struct work_struct *work)
> +{
> +	/*
> +	 * This function is just a stub to implement a hard force
> +	 * of synchronize_sched(). This requires synchronizing
> +	 * tasks even in userspace and idle.
> +	 *
> +	 * Yes, function tracing is rude.
> +	 */
> +}
> +
>  static int __unregister_ftrace_function(struct ftrace_ops *ops)
>  {
>  	int ret;
> @@ -440,8 +451,12 @@ static int __unregister_ftrace_function(
>  			 * so there'll be no new users. We must ensure
>  			 * all current users are done before we free
>  			 * the control data.
> +			 * Note synchronize_sched() is not enough, as we
> +			 * use preempt_disable() to do RCU, but the function
> +			 * tracer can be called where RCU is not active
> +			 * (before user_exit()).
>  			 */
> -			synchronize_sched();
> +			schedule_on_each_cpu(ftrace_sync);
>  			control_ops_free(ops);
>  		}
>  	} else
> @@ -456,9 +471,13 @@ static int __unregister_ftrace_function(
>  	/*
>  	 * Dynamic ops may be freed, we must make sure that all
>  	 * callers are done before leaving this function.
> +	 *
> +	 * Again, normal synchronize_sched() is not good enough.
> +	 * We need to do a hard force of sched synchronization.
>  	 */
>  	if (ops->flags & FTRACE_OPS_FL_DYNAMIC)
> -		synchronize_sched();
> +		schedule_on_each_cpu(ftrace_sync);
> +
> 
>  	return 0;
>  }
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] ftrace: Use schedule_on_each_cpu() as a heavy synchronize_sched()
  2013-05-29  7:52 ` Peter Zijlstra
@ 2013-05-29 13:33   ` Paul E. McKenney
  2013-05-29 13:55     ` Steven Rostedt
  2013-05-29 13:41   ` Steven Rostedt
  2014-06-19  1:56   ` Steven Rostedt
  2 siblings, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2013-05-29 13:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, LKML, Tejun Heo, Ingo Molnar,
	Frederic Weisbecker, Jiri Olsa

On Wed, May 29, 2013 at 09:52:49AM +0200, Peter Zijlstra wrote:
> On Tue, May 28, 2013 at 08:01:16PM -0400, Steven Rostedt wrote:
> > The function tracer uses preempt_disable/enable_notrace() for
> > synchronization between reading registered ftrace_ops and unregistering
> > them.
> > 
> > Most of the ftrace_ops are global permanent structures that do not
> > require this synchronization. That is, ops may be added and removed from
> > the hlist but are never freed, and wont hurt if a synchronization is
> > missed.
> > 
> > But this is not true for dynamically created ftrace_ops or control_ops,
> > which are used by the perf function tracing.
> > 
> > The problem here is that the function tracer can be used to trace
> > kernel/user context switches as well as going to and from idle.
> > Basically, it can be used to trace blind spots of the RCU subsystem.
> > This means that even though preempt_disable() is done, a
> > synchronize_sched() will ignore CPUs that haven't made it out of user
> > space or idle. These can include functions that are being traced just
> > before entering or exiting the kernel sections.
> 
> Just to be clear, its the idle part that's a problem, right? Being stuck
> in userspace isn't a problem since if that CPU is in userspace its
> certainly not got a reference to whatever list entry we're removing.

You got it!  The problem is the exact definition of "idle".  The way that
it works now is that the idle loop tells RCU when idle starts and ends
by invoking rcu_idle_enter() and rcu_idle_exit(), respectively.  Right
now, these calls are in the top-level idle loop.  They could in principle
be moved down further, but last time I tried it, it got pretty ugly.

> Now when the CPU really is idle, its obviously not using tracing either;
> so only the gray area where RCU thinks we're idle but we're not actually
> idle is a problem?

Exactly.  And there always will be a grey area, just like the grey area
between being in an interrupt handler and in_irq() knowing about it.

> Is there something a little smarter we can do? Could we use
> on_each_cpu_cond() with a function that checks if the CPU really is
> fully idle?

One recent change that should help is making the _rcuidle variants of
the tracing functions callable from both idle and irq.  To make the
on_each_cpu_cond() approach work, event tracing would need to switch
from RCU (which might be preemptible RCU) to RCU-sched (whose read-side
critical sections can pair with on_each_cpu().  I have to defer to Steven
on whether this is a good approach.

> > To implement the RCU synchronization, instead of using
> > synchronize_sched() the use of schedule_on_each_cpu() is performed. This
> > means that when a dynamically allocated ftrace_ops, or a control ops is
> > being unregistered, all CPUs must be touched and execute a ftrace_sync()
> > stub function via the work queues. This will rip CPUs out from idle or
> > in dynamic tick mode. This only happens when a user disables perf
> > function tracing or other dynamically allocated function tracers, but it
> > allows us to continue to debug RCU and context tracking with function
> > tracing.
> 
> I don't suppose there's anything perf can do to about this right? Since
> its all on user demand we're kinda stuck with dynamic memory.

I believe that Steven's earlier patch using on_each_cpu() solves this
problem.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] ftrace: Use schedule_on_each_cpu() as a heavy synchronize_sched()
  2013-05-29  7:52 ` Peter Zijlstra
  2013-05-29 13:33   ` Paul E. McKenney
@ 2013-05-29 13:41   ` Steven Rostedt
  2014-06-19  1:56   ` Steven Rostedt
  2 siblings, 0 replies; 14+ messages in thread
From: Steven Rostedt @ 2013-05-29 13:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, LKML, Tejun Heo, Ingo Molnar,
	Frederic Weisbecker, Jiri Olsa

On Wed, 2013-05-29 at 09:52 +0200, Peter Zijlstra wrote:

> Just to be clear, its the idle part that's a problem, right?

Actually, it's the userspace part that's triggered the bugs.


>  Being stuck
> in userspace isn't a problem since if that CPU is in userspace its
> certainly not got a reference to whatever list entry we're removing.

The issue is coming out of and going into userspace. For example, we
trace the function user_exit(), which is the function that tells RCU
it's coming back into the kernel. At this point, RCU doesn't keep track
of the preempt disable/enable that's done there.

> 
> Now when the CPU really is idle, its obviously not using tracing either;
> so only the gray area where RCU thinks we're idle but we're not actually
> idle is a problem?

Right. It's going into and coming out of idle or userspace that's the
issue. There's a very small window where RCU is blind to this.

> 
> Is there something a little smarter we can do? Could we use
> on_each_cpu_cond() with a function that checks if the CPU really is
> fully idle?

One thing I thought about doing was both a synchronize_sched() and then
a msleep(5). As a single function trace should never last 5
milliseconds. But to me, that's just hacky, and this is the real
solution.

> 
> > To implement the RCU synchronization, instead of using
> > synchronize_sched() the use of schedule_on_each_cpu() is performed. This
> > means that when a dynamically allocated ftrace_ops, or a control ops is
> > being unregistered, all CPUs must be touched and execute a ftrace_sync()
> > stub function via the work queues. This will rip CPUs out from idle or
> > in dynamic tick mode. This only happens when a user disables perf
> > function tracing or other dynamically allocated function tracers, but it
> > allows us to continue to debug RCU and context tracking with function
> > tracing.
> 
> I don't suppose there's anything perf can do to about this right? Since
> its all on user demand we're kinda stuck with dynamic memory.

Right, and perf isn't the only one. SystemTap, lttng, and even other
parts of ftrace can have this problem. Anyone that does a dynamic
allocation needs a full synchronization.

-- Steve



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] ftrace: Use schedule_on_each_cpu() as a heavy synchronize_sched()
  2013-05-29 13:33   ` Paul E. McKenney
@ 2013-05-29 13:55     ` Steven Rostedt
  0 siblings, 0 replies; 14+ messages in thread
From: Steven Rostedt @ 2013-05-29 13:55 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, LKML, Tejun Heo, Ingo Molnar,
	Frederic Weisbecker, Jiri Olsa

On Wed, 2013-05-29 at 06:33 -0700, Paul E. McKenney wrote:

> > Is there something a little smarter we can do? Could we use
> > on_each_cpu_cond() with a function that checks if the CPU really is
> > fully idle?
> 
> One recent change that should help is making the _rcuidle variants of
> the tracing functions callable from both idle and irq.  To make the
> on_each_cpu_cond() approach work, event tracing would need to switch
> from RCU (which might be preemptible RCU) to RCU-sched (whose read-side
> critical sections can pair with on_each_cpu().  I have to defer to Steven
> on whether this is a good approach.
> 

Just to be clear, the issue is only with the function tracer. This has
nothing to do with trace events, as we have the _rcuidle() variants to
deal with that.

We want the function tracer to trace pretty much everything it can,
especially a complex system like RCU. Thus, I would say that the burden
goes onto the tracing facility to solve this and not prevent tracing
critical parts of RCU.

As you stated with the problem of in_irq(), there's a point where we are
in an interrupt but the in_irq() isn't set yet. And this even shows up
in function tracing:

  <idle>-0       0d...   141.555326: function:             smp_apic_timer_interrupt
  <idle>-0       0d...   141.555327: function:                native_apic_mem_write
  <idle>-0       0d...   141.555327: function:                exit_idle
  <idle>-0       0d...   141.555327: function:                irq_enter
  <idle>-0       0d...   141.555327: function:                   rcu_irq_enter
  <idle>-0       0d...   141.555328: function:                   idle_cpu
  <idle>-0       0d.h.   141.555328: function:                   tick_check_idle
  <idle>-0       0d.h.   141.555328: function:                      tick_check_oneshot_broadcast
  <idle>-0       0d.h.   141.555328: function:                      ktime_get

Notice that we traced smp_apic_timer_interrupt, native_apic_mem_write,
exit_idle, irq_enter, and rcu_irq_enter, before rcu even was informed
that we are coming out of idle.

Then idle_cpu was also traced before the preempt_count was changed to
notify that we are in an interrupt (the 'h' in "0d.h.").

-- Steve



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] ftrace: Use schedule_on_each_cpu() as a heavy synchronize_sched()
  2013-05-29  0:01 [RFC][PATCH] ftrace: Use schedule_on_each_cpu() as a heavy synchronize_sched() Steven Rostedt
  2013-05-29  7:52 ` Peter Zijlstra
  2013-05-29  8:23 ` Paul E. McKenney
@ 2013-06-04 11:03 ` Frederic Weisbecker
  2013-06-04 12:11   ` Steven Rostedt
  2013-06-05 11:51 ` Peter Zijlstra
  3 siblings, 1 reply; 14+ messages in thread
From: Frederic Weisbecker @ 2013-06-04 11:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, LKML, Tejun Heo, Ingo Molnar, Peter Zijlstra,
	Jiri Olsa

On Tue, May 28, 2013 at 08:01:16PM -0400, Steven Rostedt wrote:
> The function tracer uses preempt_disable/enable_notrace() for
> synchronization between reading registered ftrace_ops and unregistering
> them.
> 
> Most of the ftrace_ops are global permanent structures that do not
> require this synchronization. That is, ops may be added and removed from
> the hlist but are never freed, and wont hurt if a synchronization is
> missed.
> 
> But this is not true for dynamically created ftrace_ops or control_ops,
> which are used by the perf function tracing.
> 
> The problem here is that the function tracer can be used to trace
> kernel/user context switches as well as going to and from idle.
> Basically, it can be used to trace blind spots of the RCU subsystem.
> This means that even though preempt_disable() is done, a
> synchronize_sched() will ignore CPUs that haven't made it out of user
> space or idle. These can include functions that are being traced just
> before entering or exiting the kernel sections.
> 
> To implement the RCU synchronization, instead of using
> synchronize_sched() the use of schedule_on_each_cpu() is performed. This
> means that when a dynamically allocated ftrace_ops, or a control ops is
> being unregistered, all CPUs must be touched and execute a ftrace_sync()
> stub function via the work queues. This will rip CPUs out from idle or
> in dynamic tick mode. This only happens when a user disables perf
> function tracing or other dynamically allocated function tracers, but it
> allows us to continue to debug RCU and context tracking with function
> tracing.
> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Acked-by: Frederic Weisbecker <fweisbec@gmail.com>

If ftrace were to use rcu_dereference_sched() instead of rcu_dereference_raw(), I guess
the issue would have been detected before. But I guess we want to avoid that for
tracing recursion?

Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] ftrace: Use schedule_on_each_cpu() as a heavy synchronize_sched()
  2013-06-04 11:03 ` Frederic Weisbecker
@ 2013-06-04 12:11   ` Steven Rostedt
  2013-06-04 12:30     ` Frederic Weisbecker
  0 siblings, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2013-06-04 12:11 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, LKML, Tejun Heo, Ingo Molnar, Peter Zijlstra,
	Jiri Olsa

On Tue, 2013-06-04 at 13:03 +0200, Frederic Weisbecker wrote:

> If ftrace were to use rcu_dereference_sched() instead of rcu_dereference_raw(), I guess
> the issue would have been detected before. But I guess we want to avoid that for
> tracing recursion?

It's been detected before, just ignored as most of ftrace function
tracing has permanent objects and the synchronization doesn't really
matter for them. But for perf and SystemTap that uses dynamically
created ftrace_ops and needs to free them, it does make a difference.
Something I knew needed to be fixed but never got around to it as the
race is extremely tight (and requires root user trying to trigger it). I
haven't been ably to trigger the race, but in theory it's there.

Note the checks that rcu_dereference_sched() has to test for these kinds
of things would cause ftrace to live lock the system if it had used it.
In fact, I had to make a rcu_dereference_raw_notrace() to prevent ftrace
locking up the system when full rcu debugging is enabled.

-- Steve


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] ftrace: Use schedule_on_each_cpu() as a heavy synchronize_sched()
  2013-06-04 12:11   ` Steven Rostedt
@ 2013-06-04 12:30     ` Frederic Weisbecker
  0 siblings, 0 replies; 14+ messages in thread
From: Frederic Weisbecker @ 2013-06-04 12:30 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, LKML, Tejun Heo, Ingo Molnar, Peter Zijlstra,
	Jiri Olsa

On Tue, Jun 04, 2013 at 08:11:21AM -0400, Steven Rostedt wrote:
> On Tue, 2013-06-04 at 13:03 +0200, Frederic Weisbecker wrote:
> 
> > If ftrace were to use rcu_dereference_sched() instead of rcu_dereference_raw(), I guess
> > the issue would have been detected before. But I guess we want to avoid that for
> > tracing recursion?
> 
> It's been detected before, just ignored as most of ftrace function
> tracing has permanent objects and the synchronization doesn't really
> matter for them. But for perf and SystemTap that uses dynamically
> created ftrace_ops and needs to free them, it does make a difference.
> Something I knew needed to be fixed but never got around to it as the
> race is extremely tight (and requires root user trying to trigger it). I
> haven't been ably to trigger the race, but in theory it's there.
> 
> Note the checks that rcu_dereference_sched() has to test for these kinds
> of things would cause ftrace to live lock the system if it had used it.
> In fact, I had to make a rcu_dereference_raw_notrace() to prevent ftrace
> locking up the system when full rcu debugging is enabled.

I see.

Thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] ftrace: Use schedule_on_each_cpu() as a heavy synchronize_sched()
  2013-05-29  0:01 [RFC][PATCH] ftrace: Use schedule_on_each_cpu() as a heavy synchronize_sched() Steven Rostedt
                   ` (2 preceding siblings ...)
  2013-06-04 11:03 ` Frederic Weisbecker
@ 2013-06-05 11:51 ` Peter Zijlstra
  2013-06-05 13:36   ` Steven Rostedt
  3 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2013-06-05 11:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, LKML, Tejun Heo, Ingo Molnar,
	Frederic Weisbecker, Jiri Olsa

On Tue, May 28, 2013 at 08:01:16PM -0400, Steven Rostedt wrote:
> The function tracer uses preempt_disable/enable_notrace() for
> synchronization between reading registered ftrace_ops and unregistering
> them.
> 
> Most of the ftrace_ops are global permanent structures that do not
> require this synchronization. That is, ops may be added and removed from
> the hlist but are never freed, and wont hurt if a synchronization is
> missed.
> 
> But this is not true for dynamically created ftrace_ops or control_ops,
> which are used by the perf function tracing.
> 
> The problem here is that the function tracer can be used to trace
> kernel/user context switches as well as going to and from idle.
> Basically, it can be used to trace blind spots of the RCU subsystem.
> This means that even though preempt_disable() is done, a
> synchronize_sched() will ignore CPUs that haven't made it out of user
> space or idle. These can include functions that are being traced just
> before entering or exiting the kernel sections.
> 
> To implement the RCU synchronization, instead of using
> synchronize_sched() the use of schedule_on_each_cpu() is performed. This
> means that when a dynamically allocated ftrace_ops, or a control ops is
> being unregistered, all CPUs must be touched and execute a ftrace_sync()
> stub function via the work queues. This will rip CPUs out from idle or
> in dynamic tick mode. This only happens when a user disables perf
> function tracing or other dynamically allocated function tracers, but it
> allows us to continue to debug RCU and context tracking with function
> tracing.
> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> 
> Index: linux-trace.git/kernel/trace/ftrace.c
> ===================================================================
> --- linux-trace.git.orig/kernel/trace/ftrace.c
> +++ linux-trace.git/kernel/trace/ftrace.c
> @@ -413,6 +413,17 @@ static int __register_ftrace_function(st
>  	return 0;
>  }
>  
> +static void ftrace_sync(struct work_struct *work)
> +{
> +	/*
> +	 * This function is just a stub to implement a hard force
> +	 * of synchronize_sched(). This requires synchronizing
> +	 * tasks even in userspace and idle.
> +	 *
> +	 * Yes, function tracing is rude.
> +	 */
> +}
> +
>  static int __unregister_ftrace_function(struct ftrace_ops *ops)
>  {
>  	int ret;
> @@ -440,8 +451,12 @@ static int __unregister_ftrace_function(
>  			 * so there'll be no new users. We must ensure
>  			 * all current users are done before we free
>  			 * the control data.
> +			 * Note synchronize_sched() is not enough, as we
> +			 * use preempt_disable() to do RCU, but the function
> +			 * tracer can be called where RCU is not active
> +			 * (before user_exit()).
>  			 */
> -			synchronize_sched();
> +			schedule_on_each_cpu(ftrace_sync);
>  			control_ops_free(ops);
>  		}
>  	} else
> @@ -456,9 +471,13 @@ static int __unregister_ftrace_function(
>  	/*
>  	 * Dynamic ops may be freed, we must make sure that all
>  	 * callers are done before leaving this function.
> +	 *
> +	 * Again, normal synchronize_sched() is not good enough.
> +	 * We need to do a hard force of sched synchronization.
>  	 */
>  	if (ops->flags & FTRACE_OPS_FL_DYNAMIC)
> -		synchronize_sched();
> +		schedule_on_each_cpu(ftrace_sync);
> +
>  
>  	return 0;
>  }
> 

So what happens if schedule_on_each_cpu() returns -ENOMEM? :-)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] ftrace: Use schedule_on_each_cpu() as a heavy synchronize_sched()
  2013-06-05 11:51 ` Peter Zijlstra
@ 2013-06-05 13:36   ` Steven Rostedt
  0 siblings, 0 replies; 14+ messages in thread
From: Steven Rostedt @ 2013-06-05 13:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, LKML, Tejun Heo, Ingo Molnar,
	Frederic Weisbecker, Jiri Olsa

On Wed, 2013-06-05 at 13:51 +0200, Peter Zijlstra wrote:

> > @@ -456,9 +471,13 @@ static int __unregister_ftrace_function(
> >  	/*
> >  	 * Dynamic ops may be freed, we must make sure that all
> >  	 * callers are done before leaving this function.
> > +	 *
> > +	 * Again, normal synchronize_sched() is not good enough.
> > +	 * We need to do a hard force of sched synchronization.
> >  	 */
> >  	if (ops->flags & FTRACE_OPS_FL_DYNAMIC)
> > -		synchronize_sched();
> > +		schedule_on_each_cpu(ftrace_sync);
> > +
> >  
> >  	return 0;
> >  }
> > 
> 
> So what happens if schedule_on_each_cpu() returns -ENOMEM? :-)

Hah! I was thinking the same thing when I wrote it, but as this was just
an RFC, I wanted to hear more about this current approach before adding
more.

I was going to also add something like this:

	if (schedule_on_each_cpu(ftrace_sync) < 0) {
		/*
		 * Crap, memory pressure caused this to fail.
		 * Do a synchronize_sched() and then a msleep().
		 * The race that  fails synchronize_sched() is really small
		 * and a msleep() should clear it (but not guaranteed like
		 * a schedule_on_each_cpu() does).
		 */
		synchronize_sched();
		msleep(5);
	}

-- Steve



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] ftrace: Use schedule_on_each_cpu() as a heavy synchronize_sched()
  2013-05-29  7:52 ` Peter Zijlstra
  2013-05-29 13:33   ` Paul E. McKenney
  2013-05-29 13:41   ` Steven Rostedt
@ 2014-06-19  1:56   ` Steven Rostedt
  2014-06-19  2:28     ` Paul E. McKenney
  2 siblings, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2014-06-19  1:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, LKML, Tejun Heo, Ingo Molnar,
	Frederic Weisbecker, Jiri Olsa


Another blast from the past (from the book of cleaning out inbox)

On Wed, 29 May 2013 09:52:49 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, May 28, 2013 at 08:01:16PM -0400, Steven Rostedt wrote:
> > The function tracer uses preempt_disable/enable_notrace() for
> > synchronization between reading registered ftrace_ops and unregistering
> > them.
> > 
> > Most of the ftrace_ops are global permanent structures that do not
> > require this synchronization. That is, ops may be added and removed from
> > the hlist but are never freed, and wont hurt if a synchronization is
> > missed.
> > 
> > But this is not true for dynamically created ftrace_ops or control_ops,
> > which are used by the perf function tracing.
> > 
> > The problem here is that the function tracer can be used to trace
> > kernel/user context switches as well as going to and from idle.
> > Basically, it can be used to trace blind spots of the RCU subsystem.
> > This means that even though preempt_disable() is done, a
> > synchronize_sched() will ignore CPUs that haven't made it out of user
> > space or idle. These can include functions that are being traced just
> > before entering or exiting the kernel sections.
> 
> Just to be clear, its the idle part that's a problem, right? Being stuck
> in userspace isn't a problem since if that CPU is in userspace its
> certainly not got a reference to whatever list entry we're removing.
> 
> Now when the CPU really is idle, its obviously not using tracing either;
> so only the gray area where RCU thinks we're idle but we're not actually
> idle is a problem?
> 
> Is there something a little smarter we can do? Could we use
> on_each_cpu_cond() with a function that checks if the CPU really is
> fully idle?
> 
> > To implement the RCU synchronization, instead of using
> > synchronize_sched() the use of schedule_on_each_cpu() is performed. This
> > means that when a dynamically allocated ftrace_ops, or a control ops is
> > being unregistered, all CPUs must be touched and execute a ftrace_sync()
> > stub function via the work queues. This will rip CPUs out from idle or
> > in dynamic tick mode. This only happens when a user disables perf
> > function tracing or other dynamically allocated function tracers, but it
> > allows us to continue to debug RCU and context tracking with function
> > tracing.
> 
> I don't suppose there's anything perf can do to about this right? Since
> its all on user demand we're kinda stuck with dynamic memory.

If Paul finished his "synchronize_all_tasks_scheduled()" then we could
use that instead. Where "synchornize_all_tasks_scheduled()" would
return after all tasks have either scheduled, in userspace, or idle
(that is, not on the run queue). And scheduled means a non preempted
schedule, where the task itself actually called schedule.

Paul, how you doing on that? You said you could have something by 3.17.
That's coming up quick :-)

-- Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] ftrace: Use schedule_on_each_cpu() as a heavy synchronize_sched()
  2014-06-19  1:56   ` Steven Rostedt
@ 2014-06-19  2:28     ` Paul E. McKenney
  2014-06-19  7:18       ` Masami Hiramatsu
  0 siblings, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2014-06-19  2:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, LKML, Tejun Heo, Ingo Molnar,
	Frederic Weisbecker, Jiri Olsa

On Wed, Jun 18, 2014 at 09:56:26PM -0400, Steven Rostedt wrote:
> 
> Another blast from the past (from the book of cleaning out inbox)
> 
> On Wed, 29 May 2013 09:52:49 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Tue, May 28, 2013 at 08:01:16PM -0400, Steven Rostedt wrote:
> > > The function tracer uses preempt_disable/enable_notrace() for
> > > synchronization between reading registered ftrace_ops and unregistering
> > > them.
> > > 
> > > Most of the ftrace_ops are global permanent structures that do not
> > > require this synchronization. That is, ops may be added and removed from
> > > the hlist but are never freed, and wont hurt if a synchronization is
> > > missed.
> > > 
> > > But this is not true for dynamically created ftrace_ops or control_ops,
> > > which are used by the perf function tracing.
> > > 
> > > The problem here is that the function tracer can be used to trace
> > > kernel/user context switches as well as going to and from idle.
> > > Basically, it can be used to trace blind spots of the RCU subsystem.
> > > This means that even though preempt_disable() is done, a
> > > synchronize_sched() will ignore CPUs that haven't made it out of user
> > > space or idle. These can include functions that are being traced just
> > > before entering or exiting the kernel sections.
> > 
> > Just to be clear, its the idle part that's a problem, right? Being stuck
> > in userspace isn't a problem since if that CPU is in userspace its
> > certainly not got a reference to whatever list entry we're removing.
> > 
> > Now when the CPU really is idle, its obviously not using tracing either;
> > so only the gray area where RCU thinks we're idle but we're not actually
> > idle is a problem?
> > 
> > Is there something a little smarter we can do? Could we use
> > on_each_cpu_cond() with a function that checks if the CPU really is
> > fully idle?
> > 
> > > To implement the RCU synchronization, instead of using
> > > synchronize_sched() the use of schedule_on_each_cpu() is performed. This
> > > means that when a dynamically allocated ftrace_ops, or a control ops is
> > > being unregistered, all CPUs must be touched and execute a ftrace_sync()
> > > stub function via the work queues. This will rip CPUs out from idle or
> > > in dynamic tick mode. This only happens when a user disables perf
> > > function tracing or other dynamically allocated function tracers, but it
> > > allows us to continue to debug RCU and context tracking with function
> > > tracing.
> > 
> > I don't suppose there's anything perf can do to about this right? Since
> > its all on user demand we're kinda stuck with dynamic memory.
> 
> If Paul finished his "synchronize_all_tasks_scheduled()" then we could
> use that instead. Where "synchornize_all_tasks_scheduled()" would
> return after all tasks have either scheduled, in userspace, or idle
> (that is, not on the run queue). And scheduled means a non preempted
> schedule, where the task itself actually called schedule.
> 
> Paul, how you doing on that? You said you could have something by 3.17.
> That's coming up quick :-)

I am still expecting to, depite my misadventures with performance
regressions.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Re: [RFC][PATCH] ftrace: Use schedule_on_each_cpu() as a heavy synchronize_sched()
  2014-06-19  2:28     ` Paul E. McKenney
@ 2014-06-19  7:18       ` Masami Hiramatsu
  0 siblings, 0 replies; 14+ messages in thread
From: Masami Hiramatsu @ 2014-06-19  7:18 UTC (permalink / raw)
  To: paulmck
  Cc: Steven Rostedt, Peter Zijlstra, LKML, Tejun Heo, Ingo Molnar,
	Frederic Weisbecker, Jiri Olsa

(2014/06/19 11:28), Paul E. McKenney wrote:
> On Wed, Jun 18, 2014 at 09:56:26PM -0400, Steven Rostedt wrote:
>>
>> Another blast from the past (from the book of cleaning out inbox)
>>
>> On Wed, 29 May 2013 09:52:49 +0200
>> Peter Zijlstra <peterz@infradead.org> wrote:
>>
>>> On Tue, May 28, 2013 at 08:01:16PM -0400, Steven Rostedt wrote:
>>>> The function tracer uses preempt_disable/enable_notrace() for
>>>> synchronization between reading registered ftrace_ops and unregistering
>>>> them.
>>>>
>>>> Most of the ftrace_ops are global permanent structures that do not
>>>> require this synchronization. That is, ops may be added and removed from
>>>> the hlist but are never freed, and wont hurt if a synchronization is
>>>> missed.
>>>>
>>>> But this is not true for dynamically created ftrace_ops or control_ops,
>>>> which are used by the perf function tracing.
>>>>
>>>> The problem here is that the function tracer can be used to trace
>>>> kernel/user context switches as well as going to and from idle.
>>>> Basically, it can be used to trace blind spots of the RCU subsystem.
>>>> This means that even though preempt_disable() is done, a
>>>> synchronize_sched() will ignore CPUs that haven't made it out of user
>>>> space or idle. These can include functions that are being traced just
>>>> before entering or exiting the kernel sections.
>>>
>>> Just to be clear, its the idle part that's a problem, right? Being stuck
>>> in userspace isn't a problem since if that CPU is in userspace its
>>> certainly not got a reference to whatever list entry we're removing.
>>>
>>> Now when the CPU really is idle, its obviously not using tracing either;
>>> so only the gray area where RCU thinks we're idle but we're not actually
>>> idle is a problem?
>>>
>>> Is there something a little smarter we can do? Could we use
>>> on_each_cpu_cond() with a function that checks if the CPU really is
>>> fully idle?
>>>
>>>> To implement the RCU synchronization, instead of using
>>>> synchronize_sched() the use of schedule_on_each_cpu() is performed. This
>>>> means that when a dynamically allocated ftrace_ops, or a control ops is
>>>> being unregistered, all CPUs must be touched and execute a ftrace_sync()
>>>> stub function via the work queues. This will rip CPUs out from idle or
>>>> in dynamic tick mode. This only happens when a user disables perf
>>>> function tracing or other dynamically allocated function tracers, but it
>>>> allows us to continue to debug RCU and context tracking with function
>>>> tracing.
>>>
>>> I don't suppose there's anything perf can do to about this right? Since
>>> its all on user demand we're kinda stuck with dynamic memory.
>>
>> If Paul finished his "synchronize_all_tasks_scheduled()" then we could
>> use that instead. Where "synchornize_all_tasks_scheduled()" would
>> return after all tasks have either scheduled, in userspace, or idle
>> (that is, not on the run queue). And scheduled means a non preempted
>> schedule, where the task itself actually called schedule.

This is also good for jump-optimized kprobes on preemptive kernel.
Since there is no way to ensure that no one is running on an dynamically
allocated code buffer, the jump optimization is currently disabled
with CONFIG_PREEMPT. However, this new API can allow us to enable it.

>>
>> Paul, how you doing on that? You said you could have something by 3.17.
>> That's coming up quick :-)
> 
> I am still expecting to, depite my misadventures with performance
> regressions.

Nice. I look forward to that :-)

Thank you,

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-06-19  7:18 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-29  0:01 [RFC][PATCH] ftrace: Use schedule_on_each_cpu() as a heavy synchronize_sched() Steven Rostedt
2013-05-29  7:52 ` Peter Zijlstra
2013-05-29 13:33   ` Paul E. McKenney
2013-05-29 13:55     ` Steven Rostedt
2013-05-29 13:41   ` Steven Rostedt
2014-06-19  1:56   ` Steven Rostedt
2014-06-19  2:28     ` Paul E. McKenney
2014-06-19  7:18       ` Masami Hiramatsu
2013-05-29  8:23 ` Paul E. McKenney
2013-06-04 11:03 ` Frederic Weisbecker
2013-06-04 12:11   ` Steven Rostedt
2013-06-04 12:30     ` Frederic Weisbecker
2013-06-05 11:51 ` Peter Zijlstra
2013-06-05 13:36   ` Steven Rostedt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.