All of lore.kernel.org
 help / color / mirror / Atom feed
* Q: perf_event && task->ptrace_bps[]
@ 2010-11-08 14:56 Oleg Nesterov
  2010-11-08 14:57 ` Q: sys_perf_event_open() && PF_EXITING Oleg Nesterov
                   ` (3 more replies)
  0 siblings, 4 replies; 91+ messages in thread
From: Oleg Nesterov @ 2010-11-08 14:56 UTC (permalink / raw)
  To: Alan Stern, Arnaldo Carvalho de Melo, Frederic Weisbecker,
	Ingo Molnar, Paul Mackerras, Peter Zijlstra, Prasad,
	Roland McGrath
  Cc: linux-kernel

Hello.

I am trying to understand the usage of hw-breakpoints in arch_ptrace().
ptrace_set_debugreg() and related code looks obviously racy. Nothing
protects us against flush_ptrace_hw_breakpoint() called by the dying
tracee. Afaics we can leak perf_event or use the already freed memory
or both.

Am I missed something?

Looking into the git history, I don't even know which patch should be
blamed (if I am right), there were too many changes. I noticed that
2ebd4ffb6d0cb877787b1e42be8485820158857e "perf events: Split out task
search into helper" moved the PF_EXITING check from find_get_context().
This check coould help if sys_ptrace() races with SIGKILL, but it was
racy anyway.

It is not clear to me what should be done. Looking more, I do not
understand the scope of perf_event/ctx at all, sys_perf_event_open()
looks wrong too, see the next email I am going to send.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Q: sys_perf_event_open() && PF_EXITING
  2010-11-08 14:56 Q: perf_event && task->ptrace_bps[] Oleg Nesterov
@ 2010-11-08 14:57 ` Oleg Nesterov
  2011-01-19 18:21   ` [PATCH 0/2] Was: " Oleg Nesterov
  2010-11-08 14:57 ` Q: perf_event && event->owner Oleg Nesterov
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2010-11-08 14:57 UTC (permalink / raw)
  To: Alan Stern, Arnaldo Carvalho de Melo, Frederic Weisbecker,
	Ingo Molnar, Paul Mackerras, Peter Zijlstra, Prasad,
	Roland McGrath
  Cc: linux-kernel

I am puzzled by PF_EXITING check in find_lively_task_by_vpid().

How can it help? The task can call do_exit() right after the check.

And why do we need it? The comment only says "Can't attach events to
a dying task". Maybe it tries protect sys_perf_event_open() against
perf_event_exit_task_context(), but it can't.

c93f7669 "perf_counter: Fix race in attaching counters to tasks and
exiting" says:

    There is also a race between perf_counter_exit_task and
    find_get_context; this solves the race by moving the get_ctx that
    was in perf_counter_alloc into the locked region in find_get_context,
    so that once find_get_context has got the context for a task, it
    won't get freed even if the task calls perf_counter_exit_task.

OK, the code was changed since that commit, but afaics "it won't be
freed" is still true.

However,

    It
    doesn't matter if new top-level (non-inherited) counters get attached
    to the context after perf_counter_exit_task has detached the context
    from the task.  They will just stay there and never get scheduled in
    until the counters' fds get closed, and then perf_release will remove
    them from the context and eventually free the context.

This looks wrong. perf_release() does free_event()->put_ctx(), this pairs
get_ctx() after alloc_perf_context().

But __perf_event_init_context() sets ctx->refcount = 1, and I guess this
reference should be dropped by ctx->task ? If yes, then it is not OK to
attach the event after sys_perf_event_open().

No?


Hmm. jump_label_inc/dec looks obviously racy too. Say, free_event() races
with perf_event_alloc(). There is a window between atomic_xxx() and
jump_label_update(), afaics it is possible to call jump_label_disable()
when perf_task_events/perf_swevent_enabled != 0.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Q: perf_event && event->owner
  2010-11-08 14:56 Q: perf_event && task->ptrace_bps[] Oleg Nesterov
  2010-11-08 14:57 ` Q: sys_perf_event_open() && PF_EXITING Oleg Nesterov
@ 2010-11-08 14:57 ` Oleg Nesterov
  2010-11-08 20:11   ` Frederic Weisbecker
  2010-11-08 18:41 ` Q: perf_event && task->ptrace_bps[] Frederic Weisbecker
  2011-01-17 20:34 ` Oleg Nesterov
  3 siblings, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2010-11-08 14:57 UTC (permalink / raw)
  To: Alan Stern, Arnaldo Carvalho de Melo, Frederic Weisbecker,
	Ingo Molnar, Paul Mackerras, Peter Zijlstra, Prasad,
	Roland McGrath
  Cc: linux-kernel

Another thing I can't understand, event->owner/owner_entry.

Say, some thread calls sys_perf_event_open() and creates the event.
It becomes its owner. Now this thread exits, but fd/event are still
here, and event->owner refers to the dead task_struct.

ptrace looks even more strange. Debugger can attach the breakpoint
to the tracee and then exit/detach. ->ptrace_bps events still point
to the same (may be dead) task. Even if another debugger attaches
and reuses these events.

And for what? Afaics, this is only used by PR_TASK_PERF_EVENTS_xxABLE.
Looks like, tools/perf/ used prctl() in the past. Perhaps this API
can die now and we can kill ->owner/owner_entry?

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && task->ptrace_bps[]
  2010-11-08 14:56 Q: perf_event && task->ptrace_bps[] Oleg Nesterov
  2010-11-08 14:57 ` Q: sys_perf_event_open() && PF_EXITING Oleg Nesterov
  2010-11-08 14:57 ` Q: perf_event && event->owner Oleg Nesterov
@ 2010-11-08 18:41 ` Frederic Weisbecker
  2010-11-08 19:18   ` Oleg Nesterov
  2011-01-17 20:34 ` Oleg Nesterov
  3 siblings, 1 reply; 91+ messages in thread
From: Frederic Weisbecker @ 2010-11-08 18:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alan Stern, Arnaldo Carvalho de Melo, Ingo Molnar,
	Paul Mackerras, Peter Zijlstra, Prasad, Roland McGrath,
	linux-kernel

On Mon, Nov 08, 2010 at 03:56:47PM +0100, Oleg Nesterov wrote:
> Hello.
> 
> I am trying to understand the usage of hw-breakpoints in arch_ptrace().
> ptrace_set_debugreg() and related code looks obviously racy. Nothing
> protects us against flush_ptrace_hw_breakpoint() called by the dying
> tracee. Afaics we can leak perf_event or use the already freed memory
> or both.
> 
> Am I missed something?
> 
> Looking into the git history, I don't even know which patch should be
> blamed (if I am right), there were too many changes. I noticed that
> 2ebd4ffb6d0cb877787b1e42be8485820158857e "perf events: Split out task
> search into helper" moved the PF_EXITING check from find_get_context().
> This check coould help if sys_ptrace() races with SIGKILL, but it was
> racy anyway.
> 
> It is not clear to me what should be done. Looking more, I do not
> understand the scope of perf_event/ctx at all, sys_perf_event_open()
> looks wrong too, see the next email I am going to send.
> 
> Oleg.
> 


But I don't understand how ptrace_set_debugreg() and flush_old_exec() can
happen at the same time. The parent can only do the ptrace request when
the child is stopped, right? But it can't be stopped in flush_old_exec()...?

Not sure how any race can happen here. I am certainly missing something obvious.

Thanks.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && task->ptrace_bps[]
  2010-11-08 18:41 ` Q: perf_event && task->ptrace_bps[] Frederic Weisbecker
@ 2010-11-08 19:18   ` Oleg Nesterov
  2011-01-17 23:58     ` Frederic Weisbecker
  0 siblings, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2010-11-08 19:18 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Alan Stern, Arnaldo Carvalho de Melo, Ingo Molnar,
	Paul Mackerras, Peter Zijlstra, Prasad, Roland McGrath,
	linux-kernel

On 11/08, Frederic Weisbecker wrote:
>
> On Mon, Nov 08, 2010 at 03:56:47PM +0100, Oleg Nesterov wrote:
> > Hello.
> >
> > I am trying to understand the usage of hw-breakpoints in arch_ptrace().
> > ptrace_set_debugreg() and related code looks obviously racy. Nothing
> > protects us against flush_ptrace_hw_breakpoint() called by the dying
> > tracee. Afaics we can leak perf_event or use the already freed memory
> > or both.
> >
> > Am I missed something?
> >
> > Looking into the git history, I don't even know which patch should be
> > blamed (if I am right), there were too many changes. I noticed that
> > 2ebd4ffb6d0cb877787b1e42be8485820158857e "perf events: Split out task
> > search into helper" moved the PF_EXITING check from find_get_context().
> > This check coould help if sys_ptrace() races with SIGKILL, but it was
> > racy anyway.
> >
> > It is not clear to me what should be done. Looking more, I do not
> > understand the scope of perf_event/ctx at all, sys_perf_event_open()
> > looks wrong too, see the next email I am going to send.
> >
> > Oleg.
>
> But I don't understand how ptrace_set_debugreg() and flush_old_exec() can
> happen at the same time.

This can't happen. But I meant do_exit()->flush_ptrace_hw_breakpoint()

> The parent can only do the ptrace request when
> the child is stopped, right?

Yes. But nothing can "pin" TASK_TRACED.

We know that a) the tracee was stopped() when sys_ptrace() was called
and b) its task_struct can't go away. That is all. The tracee can be
killed at any moment, and sys_ptrace() can race with with
flush_ptrace_hw_breakpoint().

> I am certainly missing something obvious.

Perhaps ;) Or, it is quite possible I missed something, I never read
this code before and it is certainly not trivial.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && event->owner
  2010-11-08 14:57 ` Q: perf_event && event->owner Oleg Nesterov
@ 2010-11-08 20:11   ` Frederic Weisbecker
  2010-11-08 20:41     ` Peter Zijlstra
  2010-11-09 15:57     ` Oleg Nesterov
  0 siblings, 2 replies; 91+ messages in thread
From: Frederic Weisbecker @ 2010-11-08 20:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alan Stern, Arnaldo Carvalho de Melo, Ingo Molnar,
	Paul Mackerras, Peter Zijlstra, Prasad, Roland McGrath,
	linux-kernel

On Mon, Nov 08, 2010 at 03:57:54PM +0100, Oleg Nesterov wrote:
> Another thing I can't understand, event->owner/owner_entry.
> 
> Say, some thread calls sys_perf_event_open() and creates the event.
> It becomes its owner. Now this thread exits, but fd/event are still
> here, and event->owner refers to the dead task_struct.



Hmm, it seems to me that the last reference to the events are
put in __perf_event_exit_task, and then free_event() is called
there, which rcu queues the event to be released.

Not sure where is the issue here.


 
> ptrace looks even more strange. Debugger can attach the breakpoint
> to the tracee and then exit/detach. ->ptrace_bps events still point
> to the same (may be dead) task. Even if another debugger attaches
> and reuses these events.



Hmm, in this case ptrace_bps will continue to trigger on the task
they were applied.

On the other hand, you're right, I'm not sure that the debugger is
the correct owner for the breakpoints.
I think it works though, looking at perf_event_create_kernel_counter():

	event->owner = current;
	get_task_struct(current);

(current is the debugger)

On perf_event_release_kernel():

	put_task_struct(event->owner);

So even if the debugger dies, we keep a valid owner, it works but just makes
few sense as the debugger can change.
Perhaps the real owner should be the task on which we attach our breakpoint.

What do you think?


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && event->owner
  2010-11-08 20:11   ` Frederic Weisbecker
@ 2010-11-08 20:41     ` Peter Zijlstra
  2010-11-09 16:18       ` Oleg Nesterov
  2010-11-09 15:57     ` Oleg Nesterov
  1 sibling, 1 reply; 91+ messages in thread
From: Peter Zijlstra @ 2010-11-08 20:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Oleg Nesterov, Alan Stern, Arnaldo Carvalho de Melo, Ingo Molnar,
	Paul Mackerras, Prasad, Roland McGrath, linux-kernel

On Mon, 2010-11-08 at 21:11 +0100, Frederic Weisbecker wrote:
> Perhaps the real owner should be the task on which we attach our
> breakpoint.

No the point of event->owner is to point to the task that creates the
event, not the task we possibly attach it to (that should be reachable
through event->ctx->task).

As to removing event->owner as Oleg suggests, its a published ABI and
there might be people using it.

The use-case is a monitor thread wanting to stop all monitoring it
initiated, for example because its wants to synchronize various counters
attached to different tasks etc..



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && event->owner
  2010-11-08 20:11   ` Frederic Weisbecker
  2010-11-08 20:41     ` Peter Zijlstra
@ 2010-11-09 15:57     ` Oleg Nesterov
  2010-11-09 16:56       ` Peter Zijlstra
  1 sibling, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2010-11-09 15:57 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Alan Stern, Arnaldo Carvalho de Melo, Ingo Molnar,
	Paul Mackerras, Peter Zijlstra, Prasad, Roland McGrath,
	linux-kernel

On 11/08, Frederic Weisbecker wrote:
>
> On Mon, Nov 08, 2010 at 03:57:54PM +0100, Oleg Nesterov wrote:
> > Another thing I can't understand, event->owner/owner_entry.
> >
> > Say, some thread calls sys_perf_event_open() and creates the event.
> > It becomes its owner. Now this thread exits, but fd/event are still
> > here, and event->owner refers to the dead task_struct.
>
> Hmm, it seems to me that the last reference to the events are
> put in __perf_event_exit_task,

I think no, in this case sys_perf_event_open() owns the event. IOW,
perf_release() frees this perf_event. But this doesn't matter.

> and then free_event() is called
> there, which rcu queues the event to be released.
>
> Not sure where is the issue here.

I am not saying this is buggy. But it looks very strange to me.

If the creator of perf_event dies, nobody can use its ->perf_event_list
anyway. What is the point to keep the reference to the dead task_struct
and preserve this ->perf_event_list?

And why do we need ->owner at all? Afaics, it is _only_ needed to find
->perf_event_mutex in perf_event_release_kernel(). And this mutex only
protects ->perf_event_list (mostly for prctl).

Of course, I understand that it is not completely trivial to change this.
The exiting creator can clear its ->perf_event_list and set
event->owner = NULL, but then perf_event_release_kernel() should
avoid the races with do_exit() somehow.

> > ptrace looks even more strange. Debugger can attach the breakpoint
> > to the tracee and then exit/detach. ->ptrace_bps events still point
> > to the same (may be dead) task. Even if another debugger attaches
> > and reuses these events.
>
>
>
> Hmm, in this case ptrace_bps will continue to trigger on the task
> they were applied.
>
> On the other hand, you're right, I'm not sure that the debugger is
> the correct owner for the breakpoints.
> I think it works though, looking at perf_event_create_kernel_counter():
>
> 	event->owner = current;
> 	get_task_struct(current);
>
> (current is the debugger)
>
> On perf_event_release_kernel():
>
> 	put_task_struct(event->owner);
>
> So even if the debugger dies, we keep a valid owner, it works but just makes
> few sense as the debugger can change.

Yes, it works, but I am not sure about "valid" above ;) Even if the previous
debugger doesn't exit.

And. Suppose that the new debugger attaches and reuses ->ptrace_bps[],
everything works.

Now, the former debugger does prctl(PR_TASK_PERF_EVENTS_DISABLE) and
suddenly bps stop working.

Not to mention this looks racy. Can't prctl() doing perf_event_disable/enable
race with modify_user_hw_breakpoint/unregister_hw_breakpoint/etc ?

> Perhaps the real owner should be the task on which we attach our breakpoint.

Not sure... What for?

In any case, I don't think the tracee should "control" this event.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && event->owner
  2010-11-08 20:41     ` Peter Zijlstra
@ 2010-11-09 16:18       ` Oleg Nesterov
  0 siblings, 0 replies; 91+ messages in thread
From: Oleg Nesterov @ 2010-11-09 16:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Alan Stern, Arnaldo Carvalho de Melo,
	Ingo Molnar, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On 11/08, Peter Zijlstra wrote:
>
> As to removing event->owner as Oleg suggests, its a published ABI and
> there might be people using it.

This was my main question.

It's a pity ;) This ABI doesn't look very nice, but OK.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && event->owner
  2010-11-09 15:57     ` Oleg Nesterov
@ 2010-11-09 16:56       ` Peter Zijlstra
  2010-11-09 16:58         ` Oleg Nesterov
  0 siblings, 1 reply; 91+ messages in thread
From: Peter Zijlstra @ 2010-11-09 16:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Alan Stern, Arnaldo Carvalho de Melo,
	Ingo Molnar, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Tue, 2010-11-09 at 16:57 +0100, Oleg Nesterov wrote:
> 
> If the creator of perf_event dies, nobody can use its ->perf_event_list
> anyway. What is the point to keep the reference to the dead task_struct
> and preserve this ->perf_event_list? 

But when the owner dies it will close all its fds, which means it will
clear its tsk->perf_event_list, no? (With exception of the case where
the fd was passed through a unix-socket to another process).


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && event->owner
  2010-11-09 16:56       ` Peter Zijlstra
@ 2010-11-09 16:58         ` Oleg Nesterov
  2010-11-09 17:07           ` Peter Zijlstra
  0 siblings, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2010-11-09 16:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Alan Stern, Arnaldo Carvalho de Melo,
	Ingo Molnar, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On 11/09, Peter Zijlstra wrote:
>
> On Tue, 2010-11-09 at 16:57 +0100, Oleg Nesterov wrote:
> >
> > If the creator of perf_event dies, nobody can use its ->perf_event_list
> > anyway. What is the point to keep the reference to the dead task_struct
> > and preserve this ->perf_event_list?
>
> But when the owner dies it will close all its fds, which means it will
> clear its tsk->perf_event_list, no? (With exception of the case where
> the fd was passed through a unix-socket to another process).

fork(), pthread_create(). Only __fput() calls ->release, when the last
reference to file goes away.

And ptrace(), it doesn't use sys_perf_event_open() to create the event.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && event->owner
  2010-11-09 16:58         ` Oleg Nesterov
@ 2010-11-09 17:07           ` Peter Zijlstra
  2010-11-09 17:42             ` Oleg Nesterov
  0 siblings, 1 reply; 91+ messages in thread
From: Peter Zijlstra @ 2010-11-09 17:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Alan Stern, Arnaldo Carvalho de Melo,
	Ingo Molnar, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Tue, 2010-11-09 at 17:58 +0100, Oleg Nesterov wrote:
> On 11/09, Peter Zijlstra wrote:
> >
> > On Tue, 2010-11-09 at 16:57 +0100, Oleg Nesterov wrote:
> > >
> > > If the creator of perf_event dies, nobody can use its ->perf_event_list
> > > anyway. What is the point to keep the reference to the dead task_struct
> > > and preserve this ->perf_event_list?
> >
> > But when the owner dies it will close all its fds, which means it will
> > clear its tsk->perf_event_list, no? (With exception of the case where
> > the fd was passed through a unix-socket to another process).
> 
> fork(), pthread_create(). Only __fput() calls ->release, when the last
> reference to file goes away.

Ah,.. quite so. So how about we explicitly destroy the list when the
task dies?

> And ptrace(), it doesn't use sys_perf_event_open() to create the event.

Right, I guess it uses kernel based things, I guess we could not add
kernel based counters to the list.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && event->owner
  2010-11-09 17:07           ` Peter Zijlstra
@ 2010-11-09 17:42             ` Oleg Nesterov
  2010-11-09 18:01               ` Peter Zijlstra
  0 siblings, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2010-11-09 17:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Alan Stern, Arnaldo Carvalho de Melo,
	Ingo Molnar, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On 11/09, Peter Zijlstra wrote:
>
> Ah,.. quite so. So how about we explicitly destroy the list when the
> task dies?

Yes, I think it makes sense to destroy the list and set ->owner = NULL.
If we reset the owner, we can also avoid get_task_struct().

The only problem is perf_event_release_kernel(), it can race with the
exiting event->owner. It can do get_task_struct() under rcu lock temporary,
just to take the mutex and remove the entry.

> > And ptrace(), it doesn't use sys_perf_event_open() to create the event.
>
> Right, I guess it uses kernel based things, I guess we could not add
> kernel based counters to the list.

Agreed, another case when event->owner should be NULL.



Hmm. With or without these changes. Shouldn't perf_event_release_kernel()
remove the event from list before anything else? Otherwise, afaics a thread
which does close(event_fd) can race with creator doing prctl(EVENTS_ENABLE),
no?

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && event->owner
  2010-11-09 17:42             ` Oleg Nesterov
@ 2010-11-09 18:01               ` Peter Zijlstra
  2010-11-09 18:57                 ` Oleg Nesterov
  0 siblings, 1 reply; 91+ messages in thread
From: Peter Zijlstra @ 2010-11-09 18:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Alan Stern, Arnaldo Carvalho de Melo,
	Ingo Molnar, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Tue, 2010-11-09 at 18:42 +0100, Oleg Nesterov wrote:
> On 11/09, Peter Zijlstra wrote:
> >
> > Ah,.. quite so. So how about we explicitly destroy the list when the
> > task dies?
> 
> Yes, I think it makes sense to destroy the list and set ->owner = NULL.
> If we reset the owner, we can also avoid get_task_struct().
> 
> The only problem is perf_event_release_kernel(), it can race with the
> exiting event->owner. It can do get_task_struct() under rcu lock temporary,
> just to take the mutex and remove the entry.
> 
> > > And ptrace(), it doesn't use sys_perf_event_open() to create the event.
> >
> > Right, I guess it uses kernel based things, I guess we could not add
> > kernel based counters to the list.
> 
> Agreed, another case when event->owner should be NULL.
> 
> 
> 
> Hmm. With or without these changes. Shouldn't perf_event_release_kernel()
> remove the event from list before anything else? Otherwise, afaics a thread
> which does close(event_fd) can race with creator doing prctl(EVENTS_ENABLE),
> no?

I think you're right, how about something like this?

---
Index: linux-2.6/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/kernel/perf_event.c
+++ linux-2.6/kernel/perf_event.c
@@ -2234,11 +2234,6 @@ int perf_event_release_kernel(struct per
 	raw_spin_unlock_irq(&ctx->lock);
 	mutex_unlock(&ctx->mutex);
 
-	mutex_lock(&event->owner->perf_event_mutex);
-	list_del_init(&event->owner_entry);
-	mutex_unlock(&event->owner->perf_event_mutex);
-	put_task_struct(event->owner);
-
 	free_event(event);
 
 	return 0;
@@ -2254,6 +2249,12 @@ static int perf_release(struct inode *in
 
 	file->private_data = NULL;
 
+	if (event->owner) {
+		mutex_lock(&event->owner->perf_event_mutex);
+		list_del_init(&event->owner_entry);
+		mutex_unlock(&event->owner->perf_event_mutex);
+	}
+
 	return perf_event_release_kernel(event);
 }
 
@@ -5677,7 +5678,7 @@ SYSCALL_DEFINE5(perf_event_open,
 	mutex_unlock(&ctx->mutex);
 
 	event->owner = current;
-	get_task_struct(current);
+
 	mutex_lock(&current->perf_event_mutex);
 	list_add_tail(&event->owner_entry, &current->perf_event_list);
 	mutex_unlock(&current->perf_event_mutex);
@@ -5745,12 +5746,6 @@ perf_event_create_kernel_counter(struct 
 	++ctx->generation;
 	mutex_unlock(&ctx->mutex);
 
-	event->owner = current;
-	get_task_struct(current);
-	mutex_lock(&current->perf_event_mutex);
-	list_add_tail(&event->owner_entry, &current->perf_event_list);
-	mutex_unlock(&current->perf_event_mutex);
-
 	return event;
 
 err_free:
@@ -5901,8 +5896,16 @@ static void perf_event_exit_task_context
  */
 void perf_event_exit_task(struct task_struct *child)
 {
+	struct perf_event *event, *tmp;
 	int ctxn;
 
+	mutex_lock(&child->perf_event_mutex);
+	list_for_each_entry_safe(event, tmp, &child->perf_event_list,
+				 owner_entry) {
+		list_del_init(&event->owner_entry);
+	}
+	mutex_unlock(&child->perf_event_mutex);
+
 	for_each_task_context_nr(ctxn)
 		perf_event_exit_task_context(child, ctxn);
 }


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && event->owner
  2010-11-09 18:01               ` Peter Zijlstra
@ 2010-11-09 18:57                 ` Oleg Nesterov
  2010-11-09 19:16                   ` Peter Zijlstra
  2010-11-10 15:17                   ` Peter Zijlstra
  0 siblings, 2 replies; 91+ messages in thread
From: Oleg Nesterov @ 2010-11-09 18:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Alan Stern, Arnaldo Carvalho de Melo,
	Ingo Molnar, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On 11/09, Peter Zijlstra wrote:
>
> I think you're right, how about something like this?

I need to read it with a fresh head ;)

At first glance,

> @@ -2254,6 +2249,12 @@ static int perf_release(struct inode *in
>
>  	file->private_data = NULL;
>
> +	if (event->owner) {
> +		mutex_lock(&event->owner->perf_event_mutex);
> +		list_del_init(&event->owner_entry);
> +		mutex_unlock(&event->owner->perf_event_mutex);
> +	}

Agreed, it is better to do this in perf_release().

But, this can use the already freed task_struct, event->owner.

Either sys_perf_open() should do get_task_struct() like we currently
do, or perf_event_exit_task() should clear event->owner and then
perf_release() should do something like

	rcu_read_lock();
	owner = event->owner;
	if (owner)
		get_task_struct(owner);
	rcu_read_unlock();

	if (owner) {
		mutex_lock(&event->owner->perf_event_mutex);
		list_del_init(&event->owner_entry);
		mutex_unlock(&event->owner->perf_event_mutex);
		put_task_struct(owner);
	}

Probably this can be simplified...

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && event->owner
  2010-11-09 18:57                 ` Oleg Nesterov
@ 2010-11-09 19:16                   ` Peter Zijlstra
  2010-11-10 15:17                   ` Peter Zijlstra
  1 sibling, 0 replies; 91+ messages in thread
From: Peter Zijlstra @ 2010-11-09 19:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Alan Stern, Arnaldo Carvalho de Melo,
	Ingo Molnar, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Tue, 2010-11-09 at 19:57 +0100, Oleg Nesterov wrote:
> On 11/09, Peter Zijlstra wrote:
> >
> > I think you're right, how about something like this?
> 
> I need to read it with a fresh head ;)

You're right, and I seem to suffer from a similar problem, will respin
tomorrow.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && event->owner
  2010-11-09 18:57                 ` Oleg Nesterov
  2010-11-09 19:16                   ` Peter Zijlstra
@ 2010-11-10 15:17                   ` Peter Zijlstra
  2010-11-10 15:44                     ` Oleg Nesterov
  1 sibling, 1 reply; 91+ messages in thread
From: Peter Zijlstra @ 2010-11-10 15:17 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Alan Stern, Arnaldo Carvalho de Melo,
	Ingo Molnar, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Tue, 2010-11-09 at 19:57 +0100, Oleg Nesterov wrote:
> Either sys_perf_open() should do get_task_struct() like we currently
> do, or perf_event_exit_task() should clear event->owner and then
> perf_release() should do something like
> 
>         rcu_read_lock();
>         owner = event->owner;
>         if (owner)
>                 get_task_struct(owner);
>         rcu_read_unlock();
> 
>         if (owner) {
>                 mutex_lock(&event->owner->perf_event_mutex);
>                 list_del_init(&event->owner_entry);
>                 mutex_unlock(&event->owner->perf_event_mutex);
>                 put_task_struct(owner);
>         }
> 
> Probably this can be simplified... 

I think that's still racy, suppose we do:

void perf_event_exit_task(struct task_struct *child)
{
	struct perf_event *event, *tmp;
	int ctxn;

	mutex_lock(&child->perf_event_mutex);
	list_for_each_entry_safe(event, tmp, &child->perf_event_list,
				 owner_entry) {
		event->owner = NULL;
		list_del_init(&event->owner_entry);
	}
	mutex_unlock(&child->perf_event_mutex);

	for_each_task_context_nr(ctxn)
		perf_event_exit_task_context(child, ctxn);
}


and the close() races with an exit, then couldn't we observe
event->owner after the last put_task_struct()? In which case a
get_task_struct() will result in a double-free.



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && event->owner
  2010-11-10 15:17                   ` Peter Zijlstra
@ 2010-11-10 15:44                     ` Oleg Nesterov
  2010-11-12 15:48                       ` Peter Zijlstra
  0 siblings, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2010-11-10 15:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Alan Stern, Arnaldo Carvalho de Melo,
	Ingo Molnar, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On 11/10, Peter Zijlstra wrote:
>
> On Tue, 2010-11-09 at 19:57 +0100, Oleg Nesterov wrote:
> > Either sys_perf_open() should do get_task_struct() like we currently
> > do, or perf_event_exit_task() should clear event->owner and then
> > perf_release() should do something like
> >
> >         rcu_read_lock();
> >         owner = event->owner;
> >         if (owner)
> >                 get_task_struct(owner);
> >         rcu_read_unlock();
> >
> >         if (owner) {
> >                 mutex_lock(&event->owner->perf_event_mutex);
> >                 list_del_init(&event->owner_entry);
> >                 mutex_unlock(&event->owner->perf_event_mutex);
> >                 put_task_struct(owner);
> >         }
> >
> > Probably this can be simplified...
>
> I think that's still racy, suppose we do:
>
> void perf_event_exit_task(struct task_struct *child)
> {
> 	struct perf_event *event, *tmp;
> 	int ctxn;
>
> 	mutex_lock(&child->perf_event_mutex);
> 	list_for_each_entry_safe(event, tmp, &child->perf_event_list,
> 				 owner_entry) {
> 		event->owner = NULL;
> 		list_del_init(&event->owner_entry);
> 	}
> 	mutex_unlock(&child->perf_event_mutex);
>
> 	for_each_task_context_nr(ctxn)
> 		perf_event_exit_task_context(child, ctxn);
> }
>
>
> and the close() races with an exit, then couldn't we observe
> event->owner after the last put_task_struct()?

I think no. Note that we do not just free task_struct via rcu callback.
Instead, delayed_put_task_struct() drops the (may be) last reference.

But the code is racy, yes. owner != NULL case is fine. But
perf_release() can see event->owner == NULL before list_del() was
completed. perf_event_exit_task() needs wmb() in between, I think.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && event->owner
  2010-11-10 15:44                     ` Oleg Nesterov
@ 2010-11-12 15:48                       ` Peter Zijlstra
  2010-11-12 18:49                         ` Oleg Nesterov
  2010-11-18 14:09                         ` [tip:perf/urgent] perf: Fix owner-list vs exit tip-bot for Peter Zijlstra
  0 siblings, 2 replies; 91+ messages in thread
From: Peter Zijlstra @ 2010-11-12 15:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Alan Stern, Arnaldo Carvalho de Melo,
	Ingo Molnar, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Wed, 2010-11-10 at 16:44 +0100, Oleg Nesterov wrote:
> 
> But the code is racy, yes. owner != NULL case is fine. But
> perf_release() can see event->owner == NULL before list_del() was
> completed. perf_event_exit_task() needs wmb() in between, I think.
> 

Utter paranoia took over and I'm still not sure its solid...


---
Subject: perf: Fix owner-list vs exit
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Tue, 09 Nov 2010 19:01:43 +0100

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1289325703.2191.60.camel@laptop>
---
 kernel/perf_event.c |   63 ++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 51 insertions(+), 12 deletions(-)

Index: linux-2.6/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/kernel/perf_event.c
+++ linux-2.6/kernel/perf_event.c
@@ -2234,11 +2234,6 @@ int perf_event_release_kernel(struct per
 	raw_spin_unlock_irq(&ctx->lock);
 	mutex_unlock(&ctx->mutex);
 
-	mutex_lock(&event->owner->perf_event_mutex);
-	list_del_init(&event->owner_entry);
-	mutex_unlock(&event->owner->perf_event_mutex);
-	put_task_struct(event->owner);
-
 	free_event(event);
 
 	return 0;
@@ -2251,9 +2246,43 @@ EXPORT_SYMBOL_GPL(perf_event_release_ker
 static int perf_release(struct inode *inode, struct file *file)
 {
 	struct perf_event *event = file->private_data;
+	struct task_struct *owner;
 
 	file->private_data = NULL;
 
+	rcu_read_lock();
+	owner = ACCESS_ONCE(event->owner);
+	/*
+	 * Matches the smp_wmb() in perf_event_exit_task(). If we observe
+	 * !owner it means the list deletion is complete and we can indeed
+	 * free this event, otherwise we need to serialize on
+	 * owner->perf_event_mutex.
+	 */
+	smp_read_barrier_depends();
+	if (owner) {
+		/*
+		 * Since delayed_put_task_struct() also drops the last
+		 * task reference we can safely take a new reference
+		 * while holding the rcu_read_lock().
+		 */
+		get_task_struct(owner);
+	}
+	rcu_read_unlock();
+
+	if (owner) {
+		mutex_lock(&owner->perf_event_mutex);
+		/*
+		 * We have to re-check the event->owner field, if it is cleared
+		 * we raced with perf_event_exit_task(), acquiring the mutex
+		 * ensured they're done, and we can proceed with freeing the
+		 * event.
+		 */
+		if (event->owner)
+			list_del_init(&event->owner_entry);
+		mutex_unlock(&owner->perf_event_mutex);
+		put_task_struct(owner);
+	}
+
 	return perf_event_release_kernel(event);
 }
 
@@ -5677,7 +5706,7 @@ SYSCALL_DEFINE5(perf_event_open,
 	mutex_unlock(&ctx->mutex);
 
 	event->owner = current;
-	get_task_struct(current);
+
 	mutex_lock(&current->perf_event_mutex);
 	list_add_tail(&event->owner_entry, &current->perf_event_list);
 	mutex_unlock(&current->perf_event_mutex);
@@ -5745,12 +5774,6 @@ perf_event_create_kernel_counter(struct
 	++ctx->generation;
 	mutex_unlock(&ctx->mutex);
 
-	event->owner = current;
-	get_task_struct(current);
-	mutex_lock(&current->perf_event_mutex);
-	list_add_tail(&event->owner_entry, &current->perf_event_list);
-	mutex_unlock(&current->perf_event_mutex);
-
 	return event;
 
 err_free:
@@ -5901,8 +5924,24 @@ static void perf_event_exit_task_context
  */
 void perf_event_exit_task(struct task_struct *child)
 {
+	struct perf_event *event, *tmp;
 	int ctxn;
 
+	mutex_lock(&child->perf_event_mutex);
+	list_for_each_entry_safe(event, tmp, &child->perf_event_list,
+				 owner_entry) {
+		list_del_init(&event->owner_entry);
+
+		/*
+		 * Ensure the list deletion is visible before we clear
+		 * the owner, closes a race against perf_release() where
+		 * we need to serialize on the owner->perf_event_mutex.
+		 */
+		smp_wmb();
+		event->owner = NULL;
+	}
+	mutex_unlock(&child->perf_event_mutex);
+
 	for_each_task_context_nr(ctxn)
 		perf_event_exit_task_context(child, ctxn);
 }


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && event->owner
  2010-11-12 15:48                       ` Peter Zijlstra
@ 2010-11-12 18:49                         ` Oleg Nesterov
  2010-11-18 14:09                         ` [tip:perf/urgent] perf: Fix owner-list vs exit tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 91+ messages in thread
From: Oleg Nesterov @ 2010-11-12 18:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Alan Stern, Arnaldo Carvalho de Melo,
	Ingo Molnar, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On 11/12, Peter Zijlstra wrote:
>
> Utter paranoia took over and I'm still not sure its solid...

Thanks Peter!

I believe this all is correct.

> Subject: perf: Fix owner-list vs exit
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Tue, 09 Nov 2010 19:01:43 +0100
> 
> Reported-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> LKML-Reference: <1289325703.2191.60.camel@laptop>
> ---
>  kernel/perf_event.c |   63 ++++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 51 insertions(+), 12 deletions(-)
> 
> Index: linux-2.6/kernel/perf_event.c
> ===================================================================
> --- linux-2.6.orig/kernel/perf_event.c
> +++ linux-2.6/kernel/perf_event.c
> @@ -2234,11 +2234,6 @@ int perf_event_release_kernel(struct per
>  	raw_spin_unlock_irq(&ctx->lock);
>  	mutex_unlock(&ctx->mutex);
>  
> -	mutex_lock(&event->owner->perf_event_mutex);
> -	list_del_init(&event->owner_entry);
> -	mutex_unlock(&event->owner->perf_event_mutex);
> -	put_task_struct(event->owner);
> -
>  	free_event(event);
>  
>  	return 0;
> @@ -2251,9 +2246,43 @@ EXPORT_SYMBOL_GPL(perf_event_release_ker
>  static int perf_release(struct inode *inode, struct file *file)
>  {
>  	struct perf_event *event = file->private_data;
> +	struct task_struct *owner;
>  
>  	file->private_data = NULL;
>  
> +	rcu_read_lock();
> +	owner = ACCESS_ONCE(event->owner);
> +	/*
> +	 * Matches the smp_wmb() in perf_event_exit_task(). If we observe
> +	 * !owner it means the list deletion is complete and we can indeed
> +	 * free this event, otherwise we need to serialize on
> +	 * owner->perf_event_mutex.
> +	 */
> +	smp_read_barrier_depends();
> +	if (owner) {
> +		/*
> +		 * Since delayed_put_task_struct() also drops the last
> +		 * task reference we can safely take a new reference
> +		 * while holding the rcu_read_lock().
> +		 */
> +		get_task_struct(owner);
> +	}
> +	rcu_read_unlock();
> +
> +	if (owner) {
> +		mutex_lock(&owner->perf_event_mutex);
> +		/*
> +		 * We have to re-check the event->owner field, if it is cleared
> +		 * we raced with perf_event_exit_task(), acquiring the mutex
> +		 * ensured they're done, and we can proceed with freeing the
> +		 * event.
> +		 */
> +		if (event->owner)
> +			list_del_init(&event->owner_entry);
> +		mutex_unlock(&owner->perf_event_mutex);
> +		put_task_struct(owner);
> +	}
> +
>  	return perf_event_release_kernel(event);
>  }
>  
> @@ -5677,7 +5706,7 @@ SYSCALL_DEFINE5(perf_event_open,
>  	mutex_unlock(&ctx->mutex);
>  
>  	event->owner = current;
> -	get_task_struct(current);
> +
>  	mutex_lock(&current->perf_event_mutex);
>  	list_add_tail(&event->owner_entry, &current->perf_event_list);
>  	mutex_unlock(&current->perf_event_mutex);
> @@ -5745,12 +5774,6 @@ perf_event_create_kernel_counter(struct
>  	++ctx->generation;
>  	mutex_unlock(&ctx->mutex);
>  
> -	event->owner = current;
> -	get_task_struct(current);
> -	mutex_lock(&current->perf_event_mutex);
> -	list_add_tail(&event->owner_entry, &current->perf_event_list);
> -	mutex_unlock(&current->perf_event_mutex);
> -
>  	return event;
>  
>  err_free:
> @@ -5901,8 +5924,24 @@ static void perf_event_exit_task_context
>   */
>  void perf_event_exit_task(struct task_struct *child)
>  {
> +	struct perf_event *event, *tmp;
>  	int ctxn;
>  
> +	mutex_lock(&child->perf_event_mutex);
> +	list_for_each_entry_safe(event, tmp, &child->perf_event_list,
> +				 owner_entry) {
> +		list_del_init(&event->owner_entry);
> +
> +		/*
> +		 * Ensure the list deletion is visible before we clear
> +		 * the owner, closes a race against perf_release() where
> +		 * we need to serialize on the owner->perf_event_mutex.
> +		 */
> +		smp_wmb();
> +		event->owner = NULL;
> +	}
> +	mutex_unlock(&child->perf_event_mutex);
> +
>  	for_each_task_context_nr(ctxn)
>  		perf_event_exit_task_context(child, ctxn);
>  }
> 


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [tip:perf/urgent] perf: Fix owner-list vs exit
  2010-11-12 15:48                       ` Peter Zijlstra
  2010-11-12 18:49                         ` Oleg Nesterov
@ 2010-11-18 14:09                         ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 91+ messages in thread
From: tip-bot for Peter Zijlstra @ 2010-11-18 14:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, oleg, tglx, mingo

Commit-ID:  8882135bcd332f294df5455747ea43ba9e6f77ad
Gitweb:     http://git.kernel.org/tip/8882135bcd332f294df5455747ea43ba9e6f77ad
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Tue, 9 Nov 2010 19:01:43 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Thu, 18 Nov 2010 13:18:46 +0100

perf: Fix owner-list vs exit

Oleg noticed that a perf-fd keeping a reference on the creating task
leads to a few funny side effects.

There's two different aspects to this:

  - kernel based perf-events, these should not take out
    a reference on the creating task and appear on the task's
    event list since they're not bound to fds nor visible
    to userspace.

  - fork() and pthread_create(), these can lead to the creating
    task dying (and thus the task's event-list becomming useless)
    but keeping the list and ref alive until the event is closed.

Combined they lead to malfunction of the ptrace hw_tracepoints.

Cure this by not considering kernel based perf_events for the
owner-list and destroying the owner-list when the owner dies.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Oleg Nesterov <oleg@redhat.com>
LKML-Reference: <1289576883.2084.286.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/perf_event.c |   63 +++++++++++++++++++++++++++++++++++++++++---------
 1 files changed, 51 insertions(+), 12 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index f818d9d..671f6c8 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -2235,11 +2235,6 @@ int perf_event_release_kernel(struct perf_event *event)
 	raw_spin_unlock_irq(&ctx->lock);
 	mutex_unlock(&ctx->mutex);
 
-	mutex_lock(&event->owner->perf_event_mutex);
-	list_del_init(&event->owner_entry);
-	mutex_unlock(&event->owner->perf_event_mutex);
-	put_task_struct(event->owner);
-
 	free_event(event);
 
 	return 0;
@@ -2252,9 +2247,43 @@ EXPORT_SYMBOL_GPL(perf_event_release_kernel);
 static int perf_release(struct inode *inode, struct file *file)
 {
 	struct perf_event *event = file->private_data;
+	struct task_struct *owner;
 
 	file->private_data = NULL;
 
+	rcu_read_lock();
+	owner = ACCESS_ONCE(event->owner);
+	/*
+	 * Matches the smp_wmb() in perf_event_exit_task(). If we observe
+	 * !owner it means the list deletion is complete and we can indeed
+	 * free this event, otherwise we need to serialize on
+	 * owner->perf_event_mutex.
+	 */
+	smp_read_barrier_depends();
+	if (owner) {
+		/*
+		 * Since delayed_put_task_struct() also drops the last
+		 * task reference we can safely take a new reference
+		 * while holding the rcu_read_lock().
+		 */
+		get_task_struct(owner);
+	}
+	rcu_read_unlock();
+
+	if (owner) {
+		mutex_lock(&owner->perf_event_mutex);
+		/*
+		 * We have to re-check the event->owner field, if it is cleared
+		 * we raced with perf_event_exit_task(), acquiring the mutex
+		 * ensured they're done, and we can proceed with freeing the
+		 * event.
+		 */
+		if (event->owner)
+			list_del_init(&event->owner_entry);
+		mutex_unlock(&owner->perf_event_mutex);
+		put_task_struct(owner);
+	}
+
 	return perf_event_release_kernel(event);
 }
 
@@ -5678,7 +5707,7 @@ SYSCALL_DEFINE5(perf_event_open,
 	mutex_unlock(&ctx->mutex);
 
 	event->owner = current;
-	get_task_struct(current);
+
 	mutex_lock(&current->perf_event_mutex);
 	list_add_tail(&event->owner_entry, &current->perf_event_list);
 	mutex_unlock(&current->perf_event_mutex);
@@ -5746,12 +5775,6 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
 	++ctx->generation;
 	mutex_unlock(&ctx->mutex);
 
-	event->owner = current;
-	get_task_struct(current);
-	mutex_lock(&current->perf_event_mutex);
-	list_add_tail(&event->owner_entry, &current->perf_event_list);
-	mutex_unlock(&current->perf_event_mutex);
-
 	return event;
 
 err_free:
@@ -5902,8 +5925,24 @@ again:
  */
 void perf_event_exit_task(struct task_struct *child)
 {
+	struct perf_event *event, *tmp;
 	int ctxn;
 
+	mutex_lock(&child->perf_event_mutex);
+	list_for_each_entry_safe(event, tmp, &child->perf_event_list,
+				 owner_entry) {
+		list_del_init(&event->owner_entry);
+
+		/*
+		 * Ensure the list deletion is visible before we clear
+		 * the owner, closes a race against perf_release() where
+		 * we need to serialize on the owner->perf_event_mutex.
+		 */
+		smp_wmb();
+		event->owner = NULL;
+	}
+	mutex_unlock(&child->perf_event_mutex);
+
 	for_each_task_context_nr(ctxn)
 		perf_event_exit_task_context(child, ctxn);
 }

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && task->ptrace_bps[]
  2010-11-08 14:56 Q: perf_event && task->ptrace_bps[] Oleg Nesterov
                   ` (2 preceding siblings ...)
  2010-11-08 18:41 ` Q: perf_event && task->ptrace_bps[] Frederic Weisbecker
@ 2011-01-17 20:34 ` Oleg Nesterov
  2011-01-17 20:52   ` Peter Zijlstra
  2011-01-18 18:42   ` Q: perf_event && task->ptrace_bps[] Frederic Weisbecker
  3 siblings, 2 replies; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-17 20:34 UTC (permalink / raw)
  To: Alan Stern, Arnaldo Carvalho de Melo, Frederic Weisbecker,
	Ingo Molnar, Paul Mackerras, Peter Zijlstra, Prasad,
	Roland McGrath
  Cc: linux-kernel

On 11/08, Oleg Nesterov wrote:
>
> I am trying to understand the usage of hw-breakpoints in arch_ptrace().
> ptrace_set_debugreg() and related code looks obviously racy. Nothing
> protects us against flush_ptrace_hw_breakpoint() called by the dying
> tracee. Afaics we can leak perf_event or use the already freed memory
> or both.
>
> Am I missed something?
>
> Looking into the git history, I don't even know which patch should be
> blamed (if I am right), there were too many changes. I noticed that
> 2ebd4ffb6d0cb877787b1e42be8485820158857e "perf events: Split out task
> search into helper" moved the PF_EXITING check from find_get_context().
> This check coould help if sys_ptrace() races with SIGKILL, but it was
> racy anyway.

Ping.

Any idea how to fix this cleanly? May be we can reuse perf_event_mutex,
but this looks soooo ugly. And do_exit()->flush_ptrace_hw_breakpoint()
has the strange "FIXME:" comment which doesn't help me to understand
what can we do.

Probably the best fix is to change this code so that the tracer owns
->ptrace_bps[], not the tracee. But this is not trivial, and needs a
lot of changes in ptrace code.


I am reading perf_event.c, but all I found so far is a couple of trivial
methods to crash the kernel via sys_perf_event_open(), will report
tomorrow...

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && task->ptrace_bps[]
  2011-01-17 20:34 ` Oleg Nesterov
@ 2011-01-17 20:52   ` Peter Zijlstra
  2011-01-17 21:01     ` Frederic Weisbecker
  2011-01-18 16:09     ` [PATCH 0/2] perf: event->cpu checking fixes Oleg Nesterov
  2011-01-18 18:42   ` Q: perf_event && task->ptrace_bps[] Frederic Weisbecker
  1 sibling, 2 replies; 91+ messages in thread
From: Peter Zijlstra @ 2011-01-17 20:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alan Stern, Arnaldo Carvalho de Melo, Frederic Weisbecker,
	Ingo Molnar, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Mon, 2011-01-17 at 21:34 +0100, Oleg Nesterov wrote:
> On 11/08, Oleg Nesterov wrote:
> >
> > I am trying to understand the usage of hw-breakpoints in arch_ptrace().
> > ptrace_set_debugreg() and related code looks obviously racy. Nothing
> > protects us against flush_ptrace_hw_breakpoint() called by the dying
> > tracee. Afaics we can leak perf_event or use the already freed memory
> > or both.
> >
> > Am I missed something?
> >
> > Looking into the git history, I don't even know which patch should be
> > blamed (if I am right), there were too many changes. I noticed that
> > 2ebd4ffb6d0cb877787b1e42be8485820158857e "perf events: Split out task
> > search into helper" moved the PF_EXITING check from find_get_context().
> > This check coould help if sys_ptrace() races with SIGKILL, but it was
> > racy anyway.
> 
> Ping.
> 
> Any idea how to fix this cleanly? May be we can reuse perf_event_mutex,
> but this looks soooo ugly. And do_exit()->flush_ptrace_hw_breakpoint()
> has the strange "FIXME:" comment which doesn't help me to understand
> what can we do.
> 
> Probably the best fix is to change this code so that the tracer owns
> ->ptrace_bps[], not the tracee. But this is not trivial, and needs a
> lot of changes in ptrace code.

Wasn't this sorted by: 8882135bcd332f294df5455747ea43ba9e6f77ad?

Or is this purely related to the ptrace muck? in which case I guess
Frederic is you man, I never looked at the hw_breakpoint stuff in
general and the ptrace bits in particular.

> I am reading perf_event.c, but all I found so far is a couple of trivial
> methods to crash the kernel via sys_perf_event_open(), will report
> tomorrow...

Ow, that's not too pretty.. 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && task->ptrace_bps[]
  2011-01-17 20:52   ` Peter Zijlstra
@ 2011-01-17 21:01     ` Frederic Weisbecker
  2011-01-18 16:09     ` [PATCH 0/2] perf: event->cpu checking fixes Oleg Nesterov
  1 sibling, 0 replies; 91+ messages in thread
From: Frederic Weisbecker @ 2011-01-17 21:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Alan Stern, Arnaldo Carvalho de Melo, Ingo Molnar,
	Paul Mackerras, Prasad, Roland McGrath, linux-kernel

On Mon, Jan 17, 2011 at 09:52:56PM +0100, Peter Zijlstra wrote:
> On Mon, 2011-01-17 at 21:34 +0100, Oleg Nesterov wrote:
> > On 11/08, Oleg Nesterov wrote:
> > >
> > > I am trying to understand the usage of hw-breakpoints in arch_ptrace().
> > > ptrace_set_debugreg() and related code looks obviously racy. Nothing
> > > protects us against flush_ptrace_hw_breakpoint() called by the dying
> > > tracee. Afaics we can leak perf_event or use the already freed memory
> > > or both.
> > >
> > > Am I missed something?
> > >
> > > Looking into the git history, I don't even know which patch should be
> > > blamed (if I am right), there were too many changes. I noticed that
> > > 2ebd4ffb6d0cb877787b1e42be8485820158857e "perf events: Split out task
> > > search into helper" moved the PF_EXITING check from find_get_context().
> > > This check coould help if sys_ptrace() races with SIGKILL, but it was
> > > racy anyway.
> > 
> > Ping.
> > 
> > Any idea how to fix this cleanly? May be we can reuse perf_event_mutex,
> > but this looks soooo ugly. And do_exit()->flush_ptrace_hw_breakpoint()
> > has the strange "FIXME:" comment which doesn't help me to understand
> > what can we do.
> > 
> > Probably the best fix is to change this code so that the tracer owns
> > ->ptrace_bps[], not the tracee. But this is not trivial, and needs a
> > lot of changes in ptrace code.
> 
> Wasn't this sorted by: 8882135bcd332f294df5455747ea43ba9e6f77ad?
> 
> Or is this purely related to the ptrace muck? in which case I guess
> Frederic is you man, I never looked at the hw_breakpoint stuff in
> general and the ptrace bits in particular.

Yeah sorry I lost track on this and left it unanswered in the middle.
Just lemme rewalk the thread and I'm back :)

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && task->ptrace_bps[]
  2010-11-08 19:18   ` Oleg Nesterov
@ 2011-01-17 23:58     ` Frederic Weisbecker
  2011-01-18  1:16       ` Roland McGrath
  0 siblings, 1 reply; 91+ messages in thread
From: Frederic Weisbecker @ 2011-01-17 23:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alan Stern, Arnaldo Carvalho de Melo, Ingo Molnar,
	Paul Mackerras, Peter Zijlstra, Prasad, Roland McGrath,
	linux-kernel

On Mon, Nov 08, 2010 at 08:18:13PM +0100, Oleg Nesterov wrote:
> On 11/08, Frederic Weisbecker wrote:
> >
> > On Mon, Nov 08, 2010 at 03:56:47PM +0100, Oleg Nesterov wrote:
> > > Hello.
> > >
> > > I am trying to understand the usage of hw-breakpoints in arch_ptrace().
> > > ptrace_set_debugreg() and related code looks obviously racy. Nothing
> > > protects us against flush_ptrace_hw_breakpoint() called by the dying
> > > tracee. Afaics we can leak perf_event or use the already freed memory
> > > or both.
> > >
> > > Am I missed something?
> > >
> > > Looking into the git history, I don't even know which patch should be
> > > blamed (if I am right), there were too many changes. I noticed that
> > > 2ebd4ffb6d0cb877787b1e42be8485820158857e "perf events: Split out task
> > > search into helper" moved the PF_EXITING check from find_get_context().
> > > This check coould help if sys_ptrace() races with SIGKILL, but it was
> > > racy anyway.
> > >
> > > It is not clear to me what should be done. Looking more, I do not
> > > understand the scope of perf_event/ctx at all, sys_perf_event_open()
> > > looks wrong too, see the next email I am going to send.
> > >
> > > Oleg.
> >
> > But I don't understand how ptrace_set_debugreg() and flush_old_exec() can
> > happen at the same time.
> 
> This can't happen. But I meant do_exit()->flush_ptrace_hw_breakpoint()
> 
> > The parent can only do the ptrace request when
> > the child is stopped, right?
> 
> Yes. But nothing can "pin" TASK_TRACED.
> 
> We know that a) the tracee was stopped() when sys_ptrace() was called
> and b) its task_struct can't go away. That is all. The tracee can be
> killed at any moment, and sys_ptrace() can race with with
> flush_ptrace_hw_breakpoint().

Aah, so we check if the task is stopped when sys_ptrace() is called,
but right after we do this check, the tracee can be resumed at any time?
(with either SIGCONT or SIGKILL), even if we are servicing the ptrace
request at the same time?

Seems to be so as I look at the code.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && task->ptrace_bps[]
  2011-01-17 23:58     ` Frederic Weisbecker
@ 2011-01-18  1:16       ` Roland McGrath
  0 siblings, 0 replies; 91+ messages in thread
From: Roland McGrath @ 2011-01-18  1:16 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Oleg Nesterov, Alan Stern, Arnaldo Carvalho de Melo, Ingo Molnar,
	Paul Mackerras, Peter Zijlstra, Prasad, linux-kernel

> Aah, so we check if the task is stopped when sys_ptrace() is called,
> but right after we do this check, the tracee can be resumed at any time?
> (with either SIGCONT or SIGKILL), even if we are servicing the ptrace
> request at the same time?

Only by SIGKILL.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 0/2] perf: event->cpu checking fixes
  2011-01-17 20:52   ` Peter Zijlstra
  2011-01-17 21:01     ` Frederic Weisbecker
@ 2011-01-18 16:09     ` Oleg Nesterov
  2011-01-18 16:10       ` [PATCH 1/2] perf: find_get_context: fix the per-cpu-counter check Oleg Nesterov
  2011-01-18 16:10       ` [PATCH 2/2] perf: validate cpu early in perf_event_alloc() Oleg Nesterov
  1 sibling, 2 replies; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-18 16:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alan Stern, Arnaldo Carvalho de Melo, Frederic Weisbecker,
	Ingo Molnar, Paul Mackerras, Prasad, Roland McGrath, gregkh,
	linux-kernel, stable

On 01/17, Peter Zijlstra wrote:
>
> Wasn't this sorted by: 8882135bcd332f294df5455747ea43ba9e6f77ad?

No. This commit only answers my 3rd question, 1-2 are still waiting ;)

> > I am reading perf_event.c, but all I found so far is a couple of trivial
> > methods to crash the kernel via sys_perf_event_open(), will report
> > tomorrow...
>
> Ow, that's not too pretty..

Fortunately, this is trivial. Probably 2.6.37 needs these fixes too.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 1/2] perf: find_get_context: fix the per-cpu-counter check
  2011-01-18 16:09     ` [PATCH 0/2] perf: event->cpu checking fixes Oleg Nesterov
@ 2011-01-18 16:10       ` Oleg Nesterov
  2011-01-18 19:06         ` [tip:perf/urgent] perf: Find_get_context: " tip-bot for Oleg Nesterov
  2011-01-18 16:10       ` [PATCH 2/2] perf: validate cpu early in perf_event_alloc() Oleg Nesterov
  1 sibling, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-18 16:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alan Stern, Arnaldo Carvalho de Melo, Frederic Weisbecker,
	Ingo Molnar, Paul Mackerras, Prasad, Roland McGrath, gregkh,
	linux-kernel, stable

If task == NULL, find_get_context() should always check that cpu
is correct.

Afaics, the bug was introduced by 38a81da2 "perf events: Clean up
pid passing", but even before that commit "&& cpu != -1" was not
exactly right, -ESRCH from find_task_by_vpid() is not accurate.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 kernel/perf_event.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- git/kernel/perf_event.c~1_find_get_context	2011-01-14 18:21:05.000000000 +0100
+++ git/kernel/perf_event.c	2011-01-18 16:56:40.000000000 +0100
@@ -2228,7 +2228,7 @@ find_get_context(struct pmu *pmu, struct
 	unsigned long flags;
 	int ctxn, err;
 
-	if (!task && cpu != -1) {
+	if (!task) {
 		/* Must be root to operate on a CPU event: */
 		if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
 			return ERR_PTR(-EACCES);


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 2/2] perf: validate cpu early in perf_event_alloc()
  2011-01-18 16:09     ` [PATCH 0/2] perf: event->cpu checking fixes Oleg Nesterov
  2011-01-18 16:10       ` [PATCH 1/2] perf: find_get_context: fix the per-cpu-counter check Oleg Nesterov
@ 2011-01-18 16:10       ` Oleg Nesterov
  2011-01-18 19:07         ` [tip:perf/urgent] perf: Validate " tip-bot for Oleg Nesterov
  1 sibling, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-18 16:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alan Stern, Arnaldo Carvalho de Melo, Frederic Weisbecker,
	Ingo Molnar, Paul Mackerras, Prasad, Roland McGrath, gregkh,
	linux-kernel, stable

Starting from perf_event_alloc()->perf_init_event(), the kernel
assumes that event->cpu is either -1 or the valid CPU number.

Change perf_event_alloc() to validate this argument early. This
also means we can remove the similar check in find_get_context().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 kernel/perf_event.c |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

--- git/kernel/perf_event.c~2_perf_event_alloc	2011-01-18 16:56:40.000000000 +0100
+++ git/kernel/perf_event.c	2011-01-18 16:57:08.000000000 +0100
@@ -2233,9 +2233,6 @@ find_get_context(struct pmu *pmu, struct
 		if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
 			return ERR_PTR(-EACCES);
 
-		if (cpu < 0 || cpu >= nr_cpumask_bits)
-			return ERR_PTR(-EINVAL);
-
 		/*
 		 * We could be clever and allow to attach a event to an
 		 * offline CPU and activate it when the CPU comes up, but
@@ -5541,6 +5538,11 @@ perf_event_alloc(struct perf_event_attr 
 	struct hw_perf_event *hwc;
 	long err;
 
+	if ((unsigned)cpu >= nr_cpu_ids) {
+		if (!task || cpu != -1)
+			return ERR_PTR(-EINVAL);
+	}
+
 	event = kzalloc(sizeof(*event), GFP_KERNEL);
 	if (!event)
 		return ERR_PTR(-ENOMEM);
@@ -5589,7 +5591,7 @@ perf_event_alloc(struct perf_event_attr 
 
 	if (!overflow_handler && parent_event)
 		overflow_handler = parent_event->overflow_handler;
-	
+
 	event->overflow_handler	= overflow_handler;
 
 	if (attr->disabled)


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && task->ptrace_bps[]
  2011-01-17 20:34 ` Oleg Nesterov
  2011-01-17 20:52   ` Peter Zijlstra
@ 2011-01-18 18:42   ` Frederic Weisbecker
  2011-01-19 15:37     ` Oleg Nesterov
  1 sibling, 1 reply; 91+ messages in thread
From: Frederic Weisbecker @ 2011-01-18 18:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alan Stern, Arnaldo Carvalho de Melo, Ingo Molnar,
	Paul Mackerras, Peter Zijlstra, Prasad, Roland McGrath,
	linux-kernel

On Mon, Jan 17, 2011 at 09:34:59PM +0100, Oleg Nesterov wrote:
> On 11/08, Oleg Nesterov wrote:
> >
> > I am trying to understand the usage of hw-breakpoints in arch_ptrace().
> > ptrace_set_debugreg() and related code looks obviously racy. Nothing
> > protects us against flush_ptrace_hw_breakpoint() called by the dying
> > tracee. Afaics we can leak perf_event or use the already freed memory
> > or both.
> >
> > Am I missed something?
> >
> > Looking into the git history, I don't even know which patch should be
> > blamed (if I am right), there were too many changes. I noticed that
> > 2ebd4ffb6d0cb877787b1e42be8485820158857e "perf events: Split out task
> > search into helper" moved the PF_EXITING check from find_get_context().
> > This check coould help if sys_ptrace() races with SIGKILL, but it was
> > racy anyway.
> 
> Ping.
> 
> Any idea how to fix this cleanly? May be we can reuse perf_event_mutex,
> but this looks soooo ugly. And do_exit()->flush_ptrace_hw_breakpoint()
> has the strange "FIXME:" comment which doesn't help me to understand
> what can we do.

Yeah forget about the FIXME, it's a stale thing I need to remove.

> 
> Probably the best fix is to change this code so that the tracer owns
> ->ptrace_bps[], not the tracee. But this is not trivial, and needs a
> lot of changes in ptrace code.


How much complicated would it be?

Because I see three solutions to solve this:

- Have a mutex inside thread->ptrace_bps. The contention must be
rare and only concern ptrace and tracee exit. That's the simplest.

- Have an atomic refcount inside thread->ptrace_bps so that the actual
flush can be delayed until necessary. Same as above, but exit and ptrace
can execute concurrently, code must be a bit more complicated though.

- Your solution. I'm just not sure how much change it involves. Seems
like we need to notify the parent for it to flush the breakpoints
when a task exits. Same when ptrace detaches we need to flush.

What do you guys think? At a glance it seems a mutex or a refcount
would take more memory for each thread, but I can manage to have
->ptrace_bps a block only allocated if necessary. It would be only
a pointer if no breakpoint is queued.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [tip:perf/urgent] perf: Find_get_context: fix the per-cpu-counter check
  2011-01-18 16:10       ` [PATCH 1/2] perf: find_get_context: fix the per-cpu-counter check Oleg Nesterov
@ 2011-01-18 19:06         ` tip-bot for Oleg Nesterov
  0 siblings, 0 replies; 91+ messages in thread
From: tip-bot for Oleg Nesterov @ 2011-01-18 19:06 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, paulus, acme, hpa, mingo, stern, a.p.zijlstra,
	roland, fweisbec, tglx, oleg, mingo, prasad

Commit-ID:  22a4ec729017ba613337a28f306f94ba5023fe2e
Gitweb:     http://git.kernel.org/tip/22a4ec729017ba613337a28f306f94ba5023fe2e
Author:     Oleg Nesterov <oleg@redhat.com>
AuthorDate: Tue, 18 Jan 2011 17:10:08 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Tue, 18 Jan 2011 19:34:23 +0100

perf: Find_get_context: fix the per-cpu-counter check

If task == NULL, find_get_context() should always check that cpu
is correct.

Afaics, the bug was introduced by 38a81da2 "perf events: Clean
up pid passing", but even before that commit "&& cpu != -1" was
not exactly right, -ESRCH from find_task_by_vpid() is not
accurate.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: gregkh@suse.de
Cc: stable@kernel.org
LKML-Reference: <20110118161008.GB693@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/perf_event.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 76be4c7..a962b19 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -2228,7 +2228,7 @@ find_get_context(struct pmu *pmu, struct task_struct *task, int cpu)
 	unsigned long flags;
 	int ctxn, err;
 
-	if (!task && cpu != -1) {
+	if (!task) {
 		/* Must be root to operate on a CPU event: */
 		if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
 			return ERR_PTR(-EACCES);

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [tip:perf/urgent] perf: Validate cpu early in perf_event_alloc()
  2011-01-18 16:10       ` [PATCH 2/2] perf: validate cpu early in perf_event_alloc() Oleg Nesterov
@ 2011-01-18 19:07         ` tip-bot for Oleg Nesterov
  0 siblings, 0 replies; 91+ messages in thread
From: tip-bot for Oleg Nesterov @ 2011-01-18 19:07 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, paulus, acme, hpa, mingo, stern, a.p.zijlstra,
	roland, fweisbec, tglx, oleg, mingo, prasad

Commit-ID:  66832eb4baaaa9abe4c993ddf9113a79e39b9915
Gitweb:     http://git.kernel.org/tip/66832eb4baaaa9abe4c993ddf9113a79e39b9915
Author:     Oleg Nesterov <oleg@redhat.com>
AuthorDate: Tue, 18 Jan 2011 17:10:32 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Tue, 18 Jan 2011 19:34:23 +0100

perf: Validate cpu early in perf_event_alloc()

Starting from perf_event_alloc()->perf_init_event(), the kernel
assumes that event->cpu is either -1 or the valid CPU number.

Change perf_event_alloc() to validate this argument early. This
also means we can remove the similar check in
find_get_context().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: gregkh@suse.de
Cc: stable@kernel.org
LKML-Reference: <20110118161032.GC693@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/perf_event.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index a962b19..67d9bd7 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -2233,9 +2233,6 @@ find_get_context(struct pmu *pmu, struct task_struct *task, int cpu)
 		if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
 			return ERR_PTR(-EACCES);
 
-		if (cpu < 0 || cpu >= nr_cpumask_bits)
-			return ERR_PTR(-EINVAL);
-
 		/*
 		 * We could be clever and allow to attach a event to an
 		 * offline CPU and activate it when the CPU comes up, but
@@ -5541,6 +5538,11 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	struct hw_perf_event *hwc;
 	long err;
 
+	if ((unsigned)cpu >= nr_cpu_ids) {
+		if (!task || cpu != -1)
+			return ERR_PTR(-EINVAL);
+	}
+
 	event = kzalloc(sizeof(*event), GFP_KERNEL);
 	if (!event)
 		return ERR_PTR(-ENOMEM);
@@ -5589,7 +5591,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 
 	if (!overflow_handler && parent_event)
 		overflow_handler = parent_event->overflow_handler;
-	
+
 	event->overflow_handler	= overflow_handler;
 
 	if (attr->disabled)

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && task->ptrace_bps[]
  2011-01-18 18:42   ` Q: perf_event && task->ptrace_bps[] Frederic Weisbecker
@ 2011-01-19 15:37     ` Oleg Nesterov
  2011-01-19 20:05       ` Frederic Weisbecker
  0 siblings, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-19 15:37 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Alan Stern, Arnaldo Carvalho de Melo, Ingo Molnar,
	Paul Mackerras, Peter Zijlstra, Prasad, Roland McGrath,
	linux-kernel

On 01/18, Frederic Weisbecker wrote:
>
> On Mon, Jan 17, 2011 at 09:34:59PM +0100, Oleg Nesterov wrote:
> >
> > Any idea how to fix this cleanly? May be we can reuse perf_event_mutex,
> > but this looks soooo ugly. And do_exit()->flush_ptrace_hw_breakpoint()
> > has the strange "FIXME:" comment which doesn't help me to understand
> > what can we do.
>
> Yeah forget about the FIXME, it's a stale thing I need to remove.

OK, good.

> > Probably the best fix is to change this code so that the tracer owns
> > ->ptrace_bps[], not the tracee. But this is not trivial, and needs a
> > lot of changes in ptrace code.
>
> How much complicated would it be?

The problem is, ptrace_detach/release_task can't sleep currently.
The necessary changes are nasty.

> Because I see three solutions to solve this:
>
> - Have a mutex inside thread->ptrace_bps. The contention must be
> rare and only concern ptrace and tracee exit. That's the simplest.

I think we can reuse perf_event_mutex for this. Not very good too,
but simple. But this depends on what can we do under this mutex...

I am going to report a couple more bugs (at least, it looks like
a bug when I am trying to understand the code ;), probably they
should be fixed first.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 0/2] Was: Q: sys_perf_event_open() && PF_EXITING
  2010-11-08 14:57 ` Q: sys_perf_event_open() && PF_EXITING Oleg Nesterov
@ 2011-01-19 18:21   ` Oleg Nesterov
  2011-01-19 18:22     ` [PATCH 1/2] perf: fix find_get_context() vs perf_event_exit_task() race Oleg Nesterov
                       ` (2 more replies)
  0 siblings, 3 replies; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-19 18:21 UTC (permalink / raw)
  To: Alan Stern, Arnaldo Carvalho de Melo, Frederic Weisbecker,
	Ingo Molnar, Paul Mackerras, Peter Zijlstra, Prasad,
	Roland McGrath
  Cc: linux-kernel

On 11/08, Oleg Nesterov wrote:
>
> I am puzzled by PF_EXITING check in find_lively_task_by_vpid().
>
> How can it help? The task can call do_exit() right after the check.
>
> And why do we need it? The comment only says "Can't attach events to
> a dying task". Maybe it tries protect sys_perf_event_open() against
> perf_event_exit_task_context(), but it can't.

Yes.

Please see 1/2. Well, I can't say I really like the idea to reuse
task->perf_event_mutex, but I do not see a better fix.

Also, I have no idea how can I actually test the changes in the code
I can hardly understand, please review.

Also. I believe there are more problems in perf_install_event(), but
I need to recheck.

> Hmm. jump_label_inc/dec looks obviously racy too. Say, free_event() races
> with perf_event_alloc(). There is a window between atomic_xxx() and
> jump_label_update(), afaics it is possible to call jump_label_disable()
> when perf_task_events/perf_swevent_enabled != 0.

Another issue...

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 1/2] perf: fix find_get_context() vs perf_event_exit_task() race
  2011-01-19 18:21   ` [PATCH 0/2] Was: " Oleg Nesterov
@ 2011-01-19 18:22     ` Oleg Nesterov
  2011-01-19 18:49       ` Peter Zijlstra
  2011-01-19 19:18       ` [tip:perf/urgent] perf: Fix " tip-bot for Oleg Nesterov
  2011-01-19 18:22     ` [PATCH 2/2] perf: fix perf_event_init_task()/perf_event_free_task() interaction Oleg Nesterov
  2011-01-20 19:30     ` Q: perf_install_in_context/perf_event_enable are racy? Oleg Nesterov
  2 siblings, 2 replies; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-19 18:22 UTC (permalink / raw)
  To: Alan Stern, Arnaldo Carvalho de Melo, Frederic Weisbecker,
	Ingo Molnar, Paul Mackerras, Peter Zijlstra, Prasad,
	Roland McGrath
  Cc: linux-kernel

find_get_context() must not install the new perf_event_context if the
task has already passed perf_event_exit_task().

If nothing else, this means the memory leak. Initially ctx->refcount == 2,
it is supposed that perf_event_exit_task_context() should participate and
do the necessary put_ctx().

find_lively_task_by_vpid() checks PF_EXITING but this buys nothing, by the
time we call find_get_context() this task can be already dead. To the point,
cmpxchg() can succeed when the task has already done the last schedule().

Change find_get_context() to populate task->perf_event_ctxp[] under
task->perf_event_mutex, this way we can trust PF_EXITING because
perf_event_exit_task() takes the same mutex.

Also, change perf_event_exit_task_context() to use rcu_dereference().
Probably this is not strictly needed, but with or without this change
find_get_context() can race with setup_new_exec()->perf_event_exit_task(),
rcu_dereference() looks better.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 kernel/perf_event.c |   34 ++++++++++++++++++++--------------
 1 file changed, 20 insertions(+), 14 deletions(-)

--- git/kernel/perf_event.c~3_find_get_context_vs_exit	2011-01-18 16:57:08.000000000 +0100
+++ git/kernel/perf_event.c	2011-01-19 17:41:16.000000000 +0100
@@ -2201,13 +2201,6 @@ find_lively_task_by_vpid(pid_t vpid)
 	if (!task)
 		return ERR_PTR(-ESRCH);
 
-	/*
-	 * Can't attach events to a dying task.
-	 */
-	err = -ESRCH;
-	if (task->flags & PF_EXITING)
-		goto errout;
-
 	/* Reuse ptrace permission checks for now. */
 	err = -EACCES;
 	if (!ptrace_may_access(task, PTRACE_MODE_READ))
@@ -2268,14 +2261,27 @@ retry:
 
 		get_ctx(ctx);
 
-		if (cmpxchg(&task->perf_event_ctxp[ctxn], NULL, ctx)) {
-			/*
-			 * We raced with some other task; use
-			 * the context they set.
-			 */
+		err = 0;
+		mutex_lock(&task->perf_event_mutex);
+		/*
+		 * If it has already passed perf_event_exit_task().
+		 * we must see PF_EXITING, it takes this mutex too.
+		 */
+		if (task->flags & PF_EXITING)
+			err = -ESRCH;
+		else if (task->perf_event_ctxp[ctxn])
+			err = -EAGAIN;
+		else
+			rcu_assign_pointer(task->perf_event_ctxp[ctxn], ctx);
+		mutex_unlock(&task->perf_event_mutex);
+
+		if (unlikely(err)) {
 			put_task_struct(task);
 			kfree(ctx);
-			goto retry;
+
+			if (err == -EAGAIN)
+				goto retry;
+			goto errout;
 		}
 	}
 
@@ -6127,7 +6133,7 @@ static void perf_event_exit_task_context
 	 * scheduled, so we are now safe from rescheduling changing
 	 * our context.
 	 */
-	child_ctx = child->perf_event_ctxp[ctxn];
+	child_ctx = rcu_dereference(child->perf_event_ctxp[ctxn]);
 	task_ctx_sched_out(child_ctx, EVENT_ALL);
 
 	/*


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 2/2] perf: fix perf_event_init_task()/perf_event_free_task() interaction
  2011-01-19 18:21   ` [PATCH 0/2] Was: " Oleg Nesterov
  2011-01-19 18:22     ` [PATCH 1/2] perf: fix find_get_context() vs perf_event_exit_task() race Oleg Nesterov
@ 2011-01-19 18:22     ` Oleg Nesterov
  2011-01-19 18:51       ` Peter Zijlstra
  2011-01-19 19:19       ` [tip:perf/urgent] perf: Fix " tip-bot for Oleg Nesterov
  2011-01-20 19:30     ` Q: perf_install_in_context/perf_event_enable are racy? Oleg Nesterov
  2 siblings, 2 replies; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-19 18:22 UTC (permalink / raw)
  To: Alan Stern, Arnaldo Carvalho de Melo, Frederic Weisbecker,
	Ingo Molnar, Paul Mackerras, Peter Zijlstra, Prasad,
	Roland McGrath
  Cc: linux-kernel

perf_event_init_task() should clear child->perf_event_ctxp[] before
anything else. Otherwise, if perf_event_init_context(perf_hw_context)
fails, perf_event_free_task() can free perf_event_ctxp[perf_sw_context]
copied from parent->perf_event_ctxp[] by dup_task_struct().

Also move the initialization of perf_event_mutex and perf_event_list
from perf_event_init_context() to perf_event_init_context().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 kernel/perf_event.c |    9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

--- git/kernel/perf_event.c~4_perf_event_init_task	2011-01-19 17:41:16.000000000 +0100
+++ git/kernel/perf_event.c	2011-01-19 18:49:23.000000000 +0100
@@ -6446,11 +6446,6 @@ int perf_event_init_context(struct task_
 	unsigned long flags;
 	int ret = 0;
 
-	child->perf_event_ctxp[ctxn] = NULL;
-
-	mutex_init(&child->perf_event_mutex);
-	INIT_LIST_HEAD(&child->perf_event_list);
-
 	if (likely(!parent->perf_event_ctxp[ctxn]))
 		return 0;
 
@@ -6540,6 +6535,10 @@ int perf_event_init_task(struct task_str
 {
 	int ctxn, ret;
 
+	memset(child->perf_event_ctxp, 0, sizeof(child->perf_event_ctxp));
+	mutex_init(&child->perf_event_mutex);
+	INIT_LIST_HEAD(&child->perf_event_list);
+
 	for_each_task_context_nr(ctxn) {
 		ret = perf_event_init_context(child, ctxn);
 		if (ret)


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 1/2] perf: fix find_get_context() vs perf_event_exit_task() race
  2011-01-19 18:22     ` [PATCH 1/2] perf: fix find_get_context() vs perf_event_exit_task() race Oleg Nesterov
@ 2011-01-19 18:49       ` Peter Zijlstra
  2011-01-19 19:18       ` [tip:perf/urgent] perf: Fix " tip-bot for Oleg Nesterov
  1 sibling, 0 replies; 91+ messages in thread
From: Peter Zijlstra @ 2011-01-19 18:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alan Stern, Arnaldo Carvalho de Melo, Frederic Weisbecker,
	Ingo Molnar, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Wed, 2011-01-19 at 19:22 +0100, Oleg Nesterov wrote:
> find_get_context() must not install the new perf_event_context if the
> task has already passed perf_event_exit_task().
> 
> If nothing else, this means the memory leak. Initially ctx->refcount == 2,
> it is supposed that perf_event_exit_task_context() should participate and
> do the necessary put_ctx().
> 
> find_lively_task_by_vpid() checks PF_EXITING but this buys nothing, by the
> time we call find_get_context() this task can be already dead. To the point,
> cmpxchg() can succeed when the task has already done the last schedule().
> 
> Change find_get_context() to populate task->perf_event_ctxp[] under
> task->perf_event_mutex, this way we can trust PF_EXITING because
> perf_event_exit_task() takes the same mutex.
> 
> Also, change perf_event_exit_task_context() to use rcu_dereference().
> Probably this is not strictly needed, but with or without this change
> find_get_context() can race with setup_new_exec()->perf_event_exit_task(),
> rcu_dereference() looks better.

I think initially the idea was that this race couldn't happen because by
that time we would be unhashed from the pidhash and thus invisible for
new events, however from what I can make from the exit path we get
unhashed in exit_notify() which is _after_ perf_event_exit_task(), so
yes this looks to be a proper fix.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 2/2] perf: fix perf_event_init_task()/perf_event_free_task() interaction
  2011-01-19 18:22     ` [PATCH 2/2] perf: fix perf_event_init_task()/perf_event_free_task() interaction Oleg Nesterov
@ 2011-01-19 18:51       ` Peter Zijlstra
  2011-01-19 19:19       ` [tip:perf/urgent] perf: Fix " tip-bot for Oleg Nesterov
  1 sibling, 0 replies; 91+ messages in thread
From: Peter Zijlstra @ 2011-01-19 18:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alan Stern, Arnaldo Carvalho de Melo, Frederic Weisbecker,
	Ingo Molnar, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Wed, 2011-01-19 at 19:22 +0100, Oleg Nesterov wrote:
> perf_event_init_task() should clear child->perf_event_ctxp[] before
> anything else. Otherwise, if perf_event_init_context(perf_hw_context)
> fails, perf_event_free_task() can free perf_event_ctxp[perf_sw_context]
> copied from parent->perf_event_ctxp[] by dup_task_struct().
> 
> Also move the initialization of perf_event_mutex and perf_event_list
> from perf_event_init_context() to perf_event_init_context().
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Another fine find.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [tip:perf/urgent] perf: Fix find_get_context() vs perf_event_exit_task() race
  2011-01-19 18:22     ` [PATCH 1/2] perf: fix find_get_context() vs perf_event_exit_task() race Oleg Nesterov
  2011-01-19 18:49       ` Peter Zijlstra
@ 2011-01-19 19:18       ` tip-bot for Oleg Nesterov
  2011-01-21 15:29         ` Ingo Molnar
  1 sibling, 1 reply; 91+ messages in thread
From: tip-bot for Oleg Nesterov @ 2011-01-19 19:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, paulus, acme, hpa, mingo, stern, a.p.zijlstra,
	roland, fweisbec, tglx, oleg, mingo, prasad

Commit-ID:  dbe08d82ce3967ccdf459f7951d02589cf967300
Gitweb:     http://git.kernel.org/tip/dbe08d82ce3967ccdf459f7951d02589cf967300
Author:     Oleg Nesterov <oleg@redhat.com>
AuthorDate: Wed, 19 Jan 2011 19:22:07 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 19 Jan 2011 20:04:27 +0100

perf: Fix find_get_context() vs perf_event_exit_task() race

find_get_context() must not install the new perf_event_context
if the task has already passed perf_event_exit_task().

If nothing else, this means the memory leak. Initially
ctx->refcount == 2, it is supposed that
perf_event_exit_task_context() should participate and do the
necessary put_ctx().

find_lively_task_by_vpid() checks PF_EXITING but this buys
nothing, by the time we call find_get_context() this task can be
already dead. To the point, cmpxchg() can succeed when the task
has already done the last schedule().

Change find_get_context() to populate task->perf_event_ctxp[]
under task->perf_event_mutex, this way we can trust PF_EXITING
because perf_event_exit_task() takes the same mutex.

Also, change perf_event_exit_task_context() to use
rcu_dereference(). Probably this is not strictly needed, but
with or without this change find_get_context() can race with
setup_new_exec()->perf_event_exit_task(), rcu_dereference()
looks better.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
LKML-Reference: <20110119182207.GB12183@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/perf_event.c |   34 ++++++++++++++++++++--------------
 1 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 84522c7..4ec55ef 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -2201,13 +2201,6 @@ find_lively_task_by_vpid(pid_t vpid)
 	if (!task)
 		return ERR_PTR(-ESRCH);
 
-	/*
-	 * Can't attach events to a dying task.
-	 */
-	err = -ESRCH;
-	if (task->flags & PF_EXITING)
-		goto errout;
-
 	/* Reuse ptrace permission checks for now. */
 	err = -EACCES;
 	if (!ptrace_may_access(task, PTRACE_MODE_READ))
@@ -2268,14 +2261,27 @@ retry:
 
 		get_ctx(ctx);
 
-		if (cmpxchg(&task->perf_event_ctxp[ctxn], NULL, ctx)) {
-			/*
-			 * We raced with some other task; use
-			 * the context they set.
-			 */
+		err = 0;
+		mutex_lock(&task->perf_event_mutex);
+		/*
+		 * If it has already passed perf_event_exit_task().
+		 * we must see PF_EXITING, it takes this mutex too.
+		 */
+		if (task->flags & PF_EXITING)
+			err = -ESRCH;
+		else if (task->perf_event_ctxp[ctxn])
+			err = -EAGAIN;
+		else
+			rcu_assign_pointer(task->perf_event_ctxp[ctxn], ctx);
+		mutex_unlock(&task->perf_event_mutex);
+
+		if (unlikely(err)) {
 			put_task_struct(task);
 			kfree(ctx);
-			goto retry;
+
+			if (err == -EAGAIN)
+				goto retry;
+			goto errout;
 		}
 	}
 
@@ -6127,7 +6133,7 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
 	 * scheduled, so we are now safe from rescheduling changing
 	 * our context.
 	 */
-	child_ctx = child->perf_event_ctxp[ctxn];
+	child_ctx = rcu_dereference(child->perf_event_ctxp[ctxn]);
 	task_ctx_sched_out(child_ctx, EVENT_ALL);
 
 	/*

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [tip:perf/urgent] perf: Fix perf_event_init_task()/perf_event_free_task() interaction
  2011-01-19 18:22     ` [PATCH 2/2] perf: fix perf_event_init_task()/perf_event_free_task() interaction Oleg Nesterov
  2011-01-19 18:51       ` Peter Zijlstra
@ 2011-01-19 19:19       ` tip-bot for Oleg Nesterov
  1 sibling, 0 replies; 91+ messages in thread
From: tip-bot for Oleg Nesterov @ 2011-01-19 19:19 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, paulus, acme, hpa, mingo, stern, a.p.zijlstra,
	roland, fweisbec, tglx, oleg, mingo, prasad

Commit-ID:  8550d7cb6ed6c89add49c3b6ad4c753ab8a3d7f9
Gitweb:     http://git.kernel.org/tip/8550d7cb6ed6c89add49c3b6ad4c753ab8a3d7f9
Author:     Oleg Nesterov <oleg@redhat.com>
AuthorDate: Wed, 19 Jan 2011 19:22:28 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 19 Jan 2011 20:04:28 +0100

perf: Fix perf_event_init_task()/perf_event_free_task() interaction

perf_event_init_task() should clear child->perf_event_ctxp[]
before anything else. Otherwise, if
perf_event_init_context(perf_hw_context) fails,
perf_event_free_task() can free perf_event_ctxp[perf_sw_context]
copied from parent->perf_event_ctxp[] by dup_task_struct().

Also move the initialization of perf_event_mutex and
perf_event_list from perf_event_init_context() to
perf_event_init_context().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
LKML-Reference: <20110119182228.GC12183@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/perf_event.c |    9 ++++-----
 1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 4ec55ef..244ca3a 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -6446,11 +6446,6 @@ int perf_event_init_context(struct task_struct *child, int ctxn)
 	unsigned long flags;
 	int ret = 0;
 
-	child->perf_event_ctxp[ctxn] = NULL;
-
-	mutex_init(&child->perf_event_mutex);
-	INIT_LIST_HEAD(&child->perf_event_list);
-
 	if (likely(!parent->perf_event_ctxp[ctxn]))
 		return 0;
 
@@ -6539,6 +6534,10 @@ int perf_event_init_task(struct task_struct *child)
 {
 	int ctxn, ret;
 
+	memset(child->perf_event_ctxp, 0, sizeof(child->perf_event_ctxp));
+	mutex_init(&child->perf_event_mutex);
+	INIT_LIST_HEAD(&child->perf_event_list);
+
 	for_each_task_context_nr(ctxn) {
 		ret = perf_event_init_context(child, ctxn);
 		if (ret)

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && task->ptrace_bps[]
  2011-01-19 15:37     ` Oleg Nesterov
@ 2011-01-19 20:05       ` Frederic Weisbecker
  2011-01-20 17:28         ` Oleg Nesterov
  0 siblings, 1 reply; 91+ messages in thread
From: Frederic Weisbecker @ 2011-01-19 20:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alan Stern, Arnaldo Carvalho de Melo, Ingo Molnar,
	Paul Mackerras, Peter Zijlstra, Prasad, Roland McGrath,
	linux-kernel

On Wed, Jan 19, 2011 at 04:37:46PM +0100, Oleg Nesterov wrote:
> On 01/18, Frederic Weisbecker wrote:
> > > Probably the best fix is to change this code so that the tracer owns
> > > ->ptrace_bps[], not the tracee. But this is not trivial, and needs a
> > > lot of changes in ptrace code.
> >
> > How much complicated would it be?
> 
> The problem is, ptrace_detach/release_task can't sleep currently.
> The necessary changes are nasty.

Ok, let's forget that then :)
 
> > Because I see three solutions to solve this:
> >
> > - Have a mutex inside thread->ptrace_bps. The contention must be
> > rare and only concern ptrace and tracee exit. That's the simplest.
> 
> I think we can reuse perf_event_mutex for this. Not very good too,
> but simple. But this depends on what can we do under this mutex...

That could work. I feel a bit uncomfortable to use a perf related
mutex for that though. I can't figure out any deadlock with the current
state, but if we are going to use that solution, perf events will be
created/destroyed/disabled/enabled under that mutex. May be that large
coverage might create some dependency problems in the future. I don't
know...

Dunno, that doesn't seem to be a good use of perf_event_mutex.

I had another idea based on a refcount. Can you have a look?
The drawback is to add an entry in task_struct. OTOH I can drop
more of them for the no-running-breakpoint case from thread_struct
in a subsequent task.

Note the problem touches more archs than x86. Basically every
arch that use breakpoint use a similar scheme that must be fixed.

(only compile tested yet)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 45892dc..08d79f9 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -608,6 +608,9 @@ static int ptrace_write_dr7(struct task_struct *tsk, unsigned long data)
 	unsigned len, type;
 	struct perf_event *bp;
 
+	if (ptrace_get_breakpoints(tsk) < 0)
+		return -ESRCH;
+
 	data &= ~DR_CONTROL_RESERVED;
 	old_dr7 = ptrace_get_dr7(thread->ptrace_bps);
 restore:
@@ -655,6 +658,9 @@ restore:
 		}
 		goto restore;
 	}
+
+	ptrace_put_breakpoints(tsk);
+
 	return ((orig_ret < 0) ? orig_ret : rc);
 }
 
@@ -668,10 +674,17 @@ static unsigned long ptrace_get_debugreg(struct task_struct *tsk, int n)
 
 	if (n < HBP_NUM) {
 		struct perf_event *bp;
+
+		if (ptrace_get_breakpoints(tsk) < 0)
+			return -ESRCH;
+
 		bp = thread->ptrace_bps[n];
 		if (!bp)
-			return 0;
-		val = bp->hw.info.address;
+			val = 0;
+		else
+			val = bp->hw.info.address;
+
+		ptrace_put_breakpoints(tsk);
 	} else if (n == 6) {
 		val = thread->debugreg6;
 	 } else if (n == 7) {
@@ -686,6 +699,10 @@ static int ptrace_set_breakpoint_addr(struct task_struct *tsk, int nr,
 	struct perf_event *bp;
 	struct thread_struct *t = &tsk->thread;
 	struct perf_event_attr attr;
+	int err = 0;
+
+	if (ptrace_get_breakpoints(tsk) < 0)
+		return -ESRCH;
 
 	if (!t->ptrace_bps[nr]) {
 		ptrace_breakpoint_init(&attr);
@@ -709,24 +726,23 @@ static int ptrace_set_breakpoint_addr(struct task_struct *tsk, int nr,
 		 * writing for the user. And anyway this is the previous
 		 * behaviour.
 		 */
-		if (IS_ERR(bp))
-			return PTR_ERR(bp);
+		if (IS_ERR(bp)) {
+			err = PTR_ERR(bp);
+			goto put;
+		}
 
 		t->ptrace_bps[nr] = bp;
 	} else {
-		int err;
-
 		bp = t->ptrace_bps[nr];
 
 		attr = bp->attr;
 		attr.bp_addr = addr;
 		err = modify_user_hw_breakpoint(bp, &attr);
-		if (err)
-			return err;
 	}
 
-
-	return 0;
+put:
+	ptrace_put_breakpoints(tsk);
+	return err;
 }
 
 /*
diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index 092a04f..519f03c 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -192,6 +192,9 @@ static inline void ptrace_init_task(struct task_struct *child, bool ptrace)
 		child->ptrace = current->ptrace;
 		__ptrace_link(child, current->parent);
 	}
+#ifdef CONFIG_HAVE_HW_BREAKPOINT
+	atomic_set(&child->ptrace_bp_refcnt, 1);
+#endif
 }
 
 /**
@@ -352,7 +355,12 @@ static inline void user_single_step_siginfo(struct task_struct *tsk,
 extern int task_current_syscall(struct task_struct *target, long *callno,
 				unsigned long args[6], unsigned int maxargs,
 				unsigned long *sp, unsigned long *pc);
-
-#endif
+#ifdef CONFIG_HAVE_HW_BREAKPOINT
+extern int ptrace_get_breakpoints(struct task_struct *tsk);
+extern void ptrace_put_breakpoints(struct task_struct *tsk);
+#else
+static inline void ptrace_put_breakpoints(struct task_struct *tsk) { }
+#endif /* CONFIG_HAVE_HW_BREAKPOINT */
+#endif /* __KERNEL */
 
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 777cd01..523a1c5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1526,6 +1526,9 @@ struct task_struct {
 		unsigned long memsw_bytes; /* uncharged mem+swap usage */
 	} memcg_batch;
 #endif
+#ifdef CONFIG_HAVE_HW_BREAKPOINT
+	atomic_t ptrace_bp_refcnt;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/kernel/exit.c b/kernel/exit.c
index 8cb8904..5284464 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1015,7 +1015,7 @@ NORET_TYPE void do_exit(long code)
 	/*
 	 * FIXME: do that only when needed, using sched_exit tracepoint
 	 */
-	flush_ptrace_hw_breakpoint(tsk);
+	ptrace_put_breakpoints(tsk);
 
 	exit_notify(tsk, group_dead);
 #ifdef CONFIG_NUMA
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 99bbaa3..23394f1 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -22,6 +22,7 @@
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
 #include <linux/regset.h>
+#include <linux/hw_breakpoint.h>
 
 
 /*
@@ -876,3 +877,19 @@ asmlinkage long compat_sys_ptrace(compat_long_t request, compat_long_t pid,
 	return ret;
 }
 #endif	/* CONFIG_COMPAT */
+
+#ifdef CONFIG_HAVE_HW_BREAKPOINT
+int ptrace_get_breakpoints(struct task_struct *tsk)
+{
+	if (atomic_inc_not_zero(&tsk->ptrace_bp_refcnt))
+		return 0;
+
+	return -1;
+}
+
+void ptrace_put_breakpoints(struct task_struct *tsk)
+{
+	if (!atomic_dec_return(&tsk->ptrace_bp_refcnt))
+		flush_ptrace_hw_breakpoint(tsk);
+}
+#endif /* CONFIG_HAVE_HW_BREAKPOINT */

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && task->ptrace_bps[]
  2011-01-19 20:05       ` Frederic Weisbecker
@ 2011-01-20 17:28         ` Oleg Nesterov
  2011-01-28 17:41           ` Frederic Weisbecker
  0 siblings, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-20 17:28 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Alan Stern, Arnaldo Carvalho de Melo, Ingo Molnar,
	Paul Mackerras, Peter Zijlstra, Prasad, Roland McGrath,
	linux-kernel

On 01/19, Frederic Weisbecker wrote:
>
> On Wed, Jan 19, 2011 at 04:37:46PM +0100, Oleg Nesterov wrote:
> >
> > I think we can reuse perf_event_mutex for this. Not very good too,
> > but simple. But this depends on what can we do under this mutex...
>
> That could work. I feel a bit uncomfortable to use a perf related
> mutex for that though. I can't figure out any deadlock with the current
> state, but if we are going to use that solution, perf events will be
> created/destroyed/disabled/enabled under that mutex.

No, I didn't mean create/destroy under that mutex, but

> Dunno, that doesn't seem to be a good use of perf_event_mutex.

I agree anyway.

> OTOH I can drop
> more of them for the no-running-breakpoint case from thread_struct
> in a subsequent task.

Hmm. Can't understand what do you mean. Just curious, could you explain?

> Note the problem touches more archs than x86. Basically every
> arch that use breakpoint use a similar scheme that must be fixed.

Yes. Perhaps we should try to unify some code... Say, can't we move
->ptrace_bps[] to task_struct?


> +void ptrace_put_breakpoints(struct task_struct *tsk)
> +{
> +	if (!atomic_dec_return(&tsk->ptrace_bp_refcnt))
> +		flush_ptrace_hw_breakpoint(tsk);

(minor nit, atomic_dec_and_test() looks more natural)


I think the patch is correct and should fix the problem.

Thanks!

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-19 18:21   ` [PATCH 0/2] Was: " Oleg Nesterov
  2011-01-19 18:22     ` [PATCH 1/2] perf: fix find_get_context() vs perf_event_exit_task() race Oleg Nesterov
  2011-01-19 18:22     ` [PATCH 2/2] perf: fix perf_event_init_task()/perf_event_free_task() interaction Oleg Nesterov
@ 2011-01-20 19:30     ` Oleg Nesterov
  2011-01-21 12:11       ` Peter Zijlstra
  2 siblings, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-20 19:30 UTC (permalink / raw)
  To: Alan Stern, Arnaldo Carvalho de Melo, Frederic Weisbecker,
	Ingo Molnar, Paul Mackerras, Peter Zijlstra, Prasad,
	Roland McGrath
  Cc: linux-kernel

On 01/19, Oleg Nesterov wrote:
>
> Also. I believe there are more problems in perf_install_event(), but
> I need to recheck.

Help! I can't believe it can be so trivially wrong, but otoh I can't
understand how this can be correct.

So, ignoring details and !task case, __perf_install_in_context() does:

	if (cpuctx->task_ctx || ctx->task != current)
		return;

	cpuctx->task_ctx = ctx;
	event_sched_in(event);

Stupid question, what if this task has already passed
perf_event_exit_task() and thus it doesn't have ->perf_event_ctxp[] ?
Given that perf_event_context_sched_out() does nothing if !ctx, who
will event_sched_out() this event?

OK, even if I am right this is trivial, we just need the additional
check.



But, it seems, there is another problem. Forget about the exiting,
I can't understand why we can trust current in the code above.
With __ARCH_WANT_INTERRUPTS_ON_CTXSW schedule() does:

	// sets cpuctx->task_ctx = NULL
	perf_event_task_sched_out();

	// enables irqs
	prepare_lock_switch();


	// updates current_task
	switch_to();

What if IPI comes in the window before switch_to() ?

(the same questions for __perf_event_enable).

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-20 19:30     ` Q: perf_install_in_context/perf_event_enable are racy? Oleg Nesterov
@ 2011-01-21 12:11       ` Peter Zijlstra
  2011-01-21 13:03         ` Ingo Molnar
  0 siblings, 1 reply; 91+ messages in thread
From: Peter Zijlstra @ 2011-01-21 12:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alan Stern, Arnaldo Carvalho de Melo, Frederic Weisbecker,
	Ingo Molnar, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Thu, 2011-01-20 at 20:30 +0100, Oleg Nesterov wrote:
> On 01/19, Oleg Nesterov wrote:
> >
> > Also. I believe there are more problems in perf_install_event(), but
> > I need to recheck.
> 
> Help! I can't believe it can be so trivially wrong, but otoh I can't
> understand how this can be correct.
> 
> So, ignoring details and !task case, __perf_install_in_context() does:
> 
> 	if (cpuctx->task_ctx || ctx->task != current)
> 		return;
> 
> 	cpuctx->task_ctx = ctx;
> 	event_sched_in(event);
> 
> Stupid question, what if this task has already passed
> perf_event_exit_task() and thus it doesn't have ->perf_event_ctxp[] ?
> Given that perf_event_context_sched_out() does nothing if !ctx, who
> will event_sched_out() this event?
> 
> OK, even if I am right this is trivial, we just need the additional
> check.

Indeed (or do the cleanup from put_ctx(), but that's too complex a
change I think).

> But, it seems, there is another problem. Forget about the exiting,
> I can't understand why we can trust current in the code above.
> With __ARCH_WANT_INTERRUPTS_ON_CTXSW schedule() does:
> 
> 	// sets cpuctx->task_ctx = NULL
> 	perf_event_task_sched_out();
> 
> 	// enables irqs
> 	prepare_lock_switch();
> 
> 
> 	// updates current_task
> 	switch_to();
> 
> What if IPI comes in the window before switch_to() ?
> 
> (the same questions for __perf_event_enable).

Ingo, do you have any insights in that, I think you wrote all that
initially?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-21 12:11       ` Peter Zijlstra
@ 2011-01-21 13:03         ` Ingo Molnar
  2011-01-21 13:39           ` Peter Zijlstra
  0 siblings, 1 reply; 91+ messages in thread
From: Ingo Molnar @ 2011-01-21 13:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Alan Stern, Arnaldo Carvalho de Melo,
	Frederic Weisbecker, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Thu, 2011-01-20 at 20:30 +0100, Oleg Nesterov wrote:
> > On 01/19, Oleg Nesterov wrote:
> > >
> > > Also. I believe there are more problems in perf_install_event(), but
> > > I need to recheck.
> > 
> > Help! I can't believe it can be so trivially wrong, but otoh I can't
> > understand how this can be correct.
> > 
> > So, ignoring details and !task case, __perf_install_in_context() does:
> > 
> > 	if (cpuctx->task_ctx || ctx->task != current)
> > 		return;
> > 
> > 	cpuctx->task_ctx = ctx;
> > 	event_sched_in(event);
> > 
> > Stupid question, what if this task has already passed
> > perf_event_exit_task() and thus it doesn't have ->perf_event_ctxp[] ?
> > Given that perf_event_context_sched_out() does nothing if !ctx, who
> > will event_sched_out() this event?
> > 
> > OK, even if I am right this is trivial, we just need the additional
> > check.
> 
> Indeed (or do the cleanup from put_ctx(), but that's too complex a
> change I think).
> 
> > But, it seems, there is another problem. Forget about the exiting,
> > I can't understand why we can trust current in the code above.
> > With __ARCH_WANT_INTERRUPTS_ON_CTXSW schedule() does:
> > 
> > 	// sets cpuctx->task_ctx = NULL
> > 	perf_event_task_sched_out();
> > 
> > 	// enables irqs
> > 	prepare_lock_switch();
> > 
> > 
> > 	// updates current_task
> > 	switch_to();
> > 
> > What if IPI comes in the window before switch_to() ?
> > 
> > (the same questions for __perf_event_enable).
> 
> Ingo, do you have any insights in that, I think you wrote all that
> initially?

Not sure. Can an IPI come there - we have irqs disabled usually, dont we?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-21 13:03         ` Ingo Molnar
@ 2011-01-21 13:39           ` Peter Zijlstra
  2011-01-21 14:26             ` Oleg Nesterov
  0 siblings, 1 reply; 91+ messages in thread
From: Peter Zijlstra @ 2011-01-21 13:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Oleg Nesterov, Alan Stern, Arnaldo Carvalho de Melo,
	Frederic Weisbecker, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Fri, 2011-01-21 at 14:03 +0100, Ingo Molnar wrote:
> > > But, it seems, there is another problem. Forget about the exiting,
> > > I can't understand why we can trust current in the code above.
> > > With __ARCH_WANT_INTERRUPTS_ON_CTXSW schedule() does:
> > > 
> > >     // sets cpuctx->task_ctx = NULL
> > >     perf_event_task_sched_out();
> > > 
> > >     // enables irqs
> > >     prepare_lock_switch();
> > > 
> > > 
> > >     // updates current_task
> > >     switch_to();
> > > 
> > > What if IPI comes in the window before switch_to() ?
> > > 
> > > (the same questions for __perf_event_enable).
> > 
> > Ingo, do you have any insights in that, I think you wrote all that
> > initially?
> 
> Not sure. Can an IPI come there - we have irqs disabled usually, dont we?

Ah, I think I see how that works:

  __perf_event_task_sched_out()
    perf_event_context_sched_out()
      if (do_switch)
        cpuctx->task_ctx = NULL;

vs

  __perf_install_in_context()
   if (cpu_ctx->task_ctx != ctx)



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-21 13:39           ` Peter Zijlstra
@ 2011-01-21 14:26             ` Oleg Nesterov
  2011-01-21 15:05               ` Peter Zijlstra
  0 siblings, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-21 14:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Alan Stern, Arnaldo Carvalho de Melo,
	Frederic Weisbecker, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On 01/21, Peter Zijlstra wrote:
>
> On Fri, 2011-01-21 at 14:03 +0100, Ingo Molnar wrote:
> > > > But, it seems, there is another problem. Forget about the exiting,
> > > > I can't understand why we can trust current in the code above.
> > > > With __ARCH_WANT_INTERRUPTS_ON_CTXSW schedule() does:
> > > >
> > > >     // sets cpuctx->task_ctx = NULL
> > > >     perf_event_task_sched_out();
> > > >
> > > >     // enables irqs
> > > >     prepare_lock_switch();
> > > >
> > > >
> > > >     // updates current_task
> > > >     switch_to();
> > > >
> > > > What if IPI comes in the window before switch_to() ?
> > > >
> > > > (the same questions for __perf_event_enable).
> > >
> > > Ingo, do you have any insights in that, I think you wrote all that
> > > initially?
> >
> > Not sure. Can an IPI come there - we have irqs disabled usually, dont we?

__ARCH_WANT_INTERRUPTS_ON_CTXSW enables irqs during prepare_task_switch()

> Ah, I think I see how that works:

Hmm. I don't...

>
>   __perf_event_task_sched_out()
>     perf_event_context_sched_out()
>       if (do_switch)
>         cpuctx->task_ctx = NULL;

exactly, this clears ->task_ctx

> vs
>
>   __perf_install_in_context()
>    if (cpu_ctx->task_ctx != ctx)

And then __perf_install_in_context() sets cpuctx->task_ctx = ctx,
because ctx->task == current && cpuctx->task_ctx == NULL.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-21 14:26             ` Oleg Nesterov
@ 2011-01-21 15:05               ` Peter Zijlstra
  2011-01-21 20:40                 ` Frederic Weisbecker
  0 siblings, 1 reply; 91+ messages in thread
From: Peter Zijlstra @ 2011-01-21 15:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Alan Stern, Arnaldo Carvalho de Melo,
	Frederic Weisbecker, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Fri, 2011-01-21 at 15:26 +0100, Oleg Nesterov wrote:
> 
> > Ah, I think I see how that works:
> 
> Hmm. I don't...
> 
> >
> >   __perf_event_task_sched_out()
> >     perf_event_context_sched_out()
> >       if (do_switch)
> >         cpuctx->task_ctx = NULL;
> 
> exactly, this clears ->task_ctx
> 
> > vs
> >
> >   __perf_install_in_context()
> >    if (cpu_ctx->task_ctx != ctx)
> 
> And then __perf_install_in_context() sets cpuctx->task_ctx = ctx,
> because ctx->task == current && cpuctx->task_ctx == NULL.

Hrm,. right, so the comment suggests it should do what it doesn't :-)

It looks like Paul's a63eaf34ae60bd (perf_counter: Dynamically allocate
tasks' perf_counter_context struct), relevant hunk below, wrecked it:

@@ -568,11 +582,17 @@ static void __perf_install_in_context(void *info)
         * If this is a task context, we need to check whether it is
         * the current task context of this cpu. If not it has been
         * scheduled out before the smp call arrived.
+        * Or possibly this is the right context but it isn't
+        * on this cpu because it had no counters.
         */
-       if (ctx->task && cpuctx->task_ctx != ctx)
-               return;
+       if (ctx->task && cpuctx->task_ctx != ctx) {
+               if (cpuctx->task_ctx || ctx->task != current)
+                       return;
+               cpuctx->task_ctx = ctx;
+       }
 
        spin_lock_irqsave(&ctx->lock, flags);
+       ctx->is_active = 1;
        update_context_time(ctx);
 
        /*


I can't really seem to come up with a sane test that isn't racy with
something, my cold seems to have clogged not only my nose :/

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [tip:perf/urgent] perf: Fix find_get_context() vs perf_event_exit_task() race
  2011-01-19 19:18       ` [tip:perf/urgent] perf: Fix " tip-bot for Oleg Nesterov
@ 2011-01-21 15:29         ` Ingo Molnar
  2011-01-21 15:53           ` Oleg Nesterov
  0 siblings, 1 reply; 91+ messages in thread
From: Ingo Molnar @ 2011-01-21 15:29 UTC (permalink / raw)
  To: mingo, hpa, acme, paulus, linux-kernel, stern, a.p.zijlstra,
	fweisbec, roland, oleg, tglx, prasad
  Cc: linux-tip-commits


* tip-bot for Oleg Nesterov <oleg@redhat.com> wrote:

> Commit-ID:  dbe08d82ce3967ccdf459f7951d02589cf967300
> Gitweb:     http://git.kernel.org/tip/dbe08d82ce3967ccdf459f7951d02589cf967300
> Author:     Oleg Nesterov <oleg@redhat.com>
> AuthorDate: Wed, 19 Jan 2011 19:22:07 +0100
> Committer:  Ingo Molnar <mingo@elte.hu>
> CommitDate: Wed, 19 Jan 2011 20:04:27 +0100
> 
> perf: Fix find_get_context() vs perf_event_exit_task() race
> 
> find_get_context() must not install the new perf_event_context
> if the task has already passed perf_event_exit_task().
> 
> If nothing else, this means the memory leak. Initially
> ctx->refcount == 2, it is supposed that
> perf_event_exit_task_context() should participate and do the
> necessary put_ctx().
> 
> find_lively_task_by_vpid() checks PF_EXITING but this buys
> nothing, by the time we call find_get_context() this task can be
> already dead. To the point, cmpxchg() can succeed when the task
> has already done the last schedule().
> 
> Change find_get_context() to populate task->perf_event_ctxp[]
> under task->perf_event_mutex, this way we can trust PF_EXITING
> because perf_event_exit_task() takes the same mutex.
> 
> Also, change perf_event_exit_task_context() to use
> rcu_dereference(). Probably this is not strictly needed, but
> with or without this change find_get_context() can race with
> setup_new_exec()->perf_event_exit_task(), rcu_dereference()
> looks better.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Prasad <prasad@linux.vnet.ibm.com>
> Cc: Roland McGrath <roland@redhat.com>
> LKML-Reference: <20110119182207.GB12183@redhat.com>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  kernel/perf_event.c |   34 ++++++++++++++++++++--------------
>  1 files changed, 20 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> index 84522c7..4ec55ef 100644
> --- a/kernel/perf_event.c
> +++ b/kernel/perf_event.c
> @@ -2201,13 +2201,6 @@ find_lively_task_by_vpid(pid_t vpid)
>  	if (!task)
>  		return ERR_PTR(-ESRCH);
>  
> -	/*
> -	 * Can't attach events to a dying task.
> -	 */
> -	err = -ESRCH;
> -	if (task->flags & PF_EXITING)
> -		goto errout;
> -
>  	/* Reuse ptrace permission checks for now. */
>  	err = -EACCES;
>  	if (!ptrace_may_access(task, PTRACE_MODE_READ))
> @@ -2268,14 +2261,27 @@ retry:
>  
>  		get_ctx(ctx);
>  
> -		if (cmpxchg(&task->perf_event_ctxp[ctxn], NULL, ctx)) {
> -			/*
> -			 * We raced with some other task; use
> -			 * the context they set.
> -			 */
> +		err = 0;
> +		mutex_lock(&task->perf_event_mutex);
> +		/*
> +		 * If it has already passed perf_event_exit_task().
> +		 * we must see PF_EXITING, it takes this mutex too.
> +		 */
> +		if (task->flags & PF_EXITING)
> +			err = -ESRCH;
> +		else if (task->perf_event_ctxp[ctxn])
> +			err = -EAGAIN;
> +		else
> +			rcu_assign_pointer(task->perf_event_ctxp[ctxn], ctx);
> +		mutex_unlock(&task->perf_event_mutex);
> +
> +		if (unlikely(err)) {
>  			put_task_struct(task);
>  			kfree(ctx);
> -			goto retry;
> +
> +			if (err == -EAGAIN)
> +				goto retry;
> +			goto errout;
>  		}
>  	}
>  
> @@ -6127,7 +6133,7 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
>  	 * scheduled, so we are now safe from rescheduling changing
>  	 * our context.
>  	 */
> -	child_ctx = child->perf_event_ctxp[ctxn];
> +	child_ctx = rcu_dereference(child->perf_event_ctxp[ctxn]);
>  	task_ctx_sched_out(child_ctx, EVENT_ALL);
>  
>  	/*

hm, this one's causing:

 [   25.557579] ===================================================
 [   25.561361] [ INFO: suspicious rcu_dereference_check() usage. ]
 [   25.561361] ---------------------------------------------------
 [   25.561361] kernel/perf_event.c:6136 invoked rcu_dereference_check() without protection!
 [   25.561361]
 [   25.561361] other info that might help us debug this:
 [   25.561361]
 [   25.561361]
 [   25.561361] rcu_scheduler_active = 1, debug_locks = 0
 [   25.561361] no locks held by true/1397.
 [   25.561361]
 [   25.561361] stack backtrace:
 [   25.561361] Pid: 1397, comm: true Not tainted 2.6.38-rc1-tip+ #86752
 [   25.561361] Call Trace:
 [   25.561361]  [<ffffffff8106cd98>] ? lockdep_rcu_dereference+0xaa/0xb3
 [   25.561361]  [<ffffffff810b34ee>] ? perf_event_exit_task+0x118/0x22a
 [   25.561361]  [<ffffffff811133b8>] ? free_fs_struct+0x44/0x48
 [   25.561361]  [<ffffffff810434ef>] ? do_exit+0x2c8/0x770
 [   25.561361]  [<ffffffff813a52ed>] ? retint_swapgs+0xe/0x13
 [   25.561361]  [<ffffffff81043c3c>] ? do_group_exit+0x82/0xad
 [   25.561361]  [<ffffffff81043c7e>] ? sys_exit_group+0x17/0x1b
 [   25.561361]  [<ffffffff81002acb>] ? system_call_fastpath+0x16/0x1b

Any ideas?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [tip:perf/urgent] perf: Fix find_get_context() vs perf_event_exit_task() race
  2011-01-21 15:29         ` Ingo Molnar
@ 2011-01-21 15:53           ` Oleg Nesterov
  2011-01-21 17:45             ` [PATCH] perf: perf_event_exit_task_context: s/rcu_dereference/rcu_dereference_raw/ Oleg Nesterov
  0 siblings, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-21 15:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: mingo, hpa, acme, paulus, linux-kernel, stern, a.p.zijlstra,
	fweisbec, roland, tglx, prasad, linux-tip-commits

On 01/21, Ingo Molnar wrote:
>
> * tip-bot for Oleg Nesterov <oleg@redhat.com> wrote:
>
> > @@ -6127,7 +6133,7 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
> >  	 * scheduled, so we are now safe from rescheduling changing
> >  	 * our context.
> >  	 */
> > -	child_ctx = child->perf_event_ctxp[ctxn];
> > +	child_ctx = rcu_dereference(child->perf_event_ctxp[ctxn]);
> >  	task_ctx_sched_out(child_ctx, EVENT_ALL);
> >
> >  	/*
>
> hm, this one's causing:
>
>  [   25.557579] ===================================================
>  [   25.561361] [ INFO: suspicious rcu_dereference_check() usage. ]


Oh, indeed, I am stupid!

I added rcu_dereference() because it has smp_read_barrier_depends(),
but I forgot about rcu_dereference_check().

I'll send the fix soon...

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH] perf: perf_event_exit_task_context: s/rcu_dereference/rcu_dereference_raw/
  2011-01-21 15:53           ` Oleg Nesterov
@ 2011-01-21 17:45             ` Oleg Nesterov
  2011-01-21 17:53               ` Oleg Nesterov
  2011-01-21 22:12               ` [tip:perf/urgent] " tip-bot for Oleg Nesterov
  0 siblings, 2 replies; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-21 17:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: mingo, hpa, acme, paulus, linux-kernel, stern, a.p.zijlstra,
	fweisbec, roland, tglx, prasad, linux-tip-commits,
	Paul E. McKenney

In theory, almost every user of task->child->perf_event_ctxp[]
is wrong. find_get_context() can install the new context at any
moment, we need read_barrier_depends().

dbe08d82ce3967ccdf459f7951d02589cf967300 "perf: Fix
find_get_context() vs perf_event_exit_task() race" added
rcu_dereference() into perf_event_exit_task_context() to make
the precedent, but this makes __rcu_dereference_check() unhappy.
Use rcu_dereference_raw() to shut up the warning.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 kernel/perf_event.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- git/kernel/perf_event.c~6_rcu_check	2011-01-19 18:49:23.000000000 +0100
+++ git/kernel/perf_event.c	2011-01-21 18:41:02.000000000 +0100
@@ -6133,7 +6133,7 @@ static void perf_event_exit_task_context
 	 * scheduled, so we are now safe from rescheduling changing
 	 * our context.
 	 */
-	child_ctx = rcu_dereference(child->perf_event_ctxp[ctxn]);
+	child_ctx = rcu_dereference_raw(child->perf_event_ctxp[ctxn]);
 	task_ctx_sched_out(child_ctx, EVENT_ALL);
 
 	/*


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH] perf: perf_event_exit_task_context: s/rcu_dereference/rcu_dereference_raw/
  2011-01-21 17:45             ` [PATCH] perf: perf_event_exit_task_context: s/rcu_dereference/rcu_dereference_raw/ Oleg Nesterov
@ 2011-01-21 17:53               ` Oleg Nesterov
  2011-01-21 21:50                 ` Paul E. McKenney
  2011-01-21 22:12               ` [tip:perf/urgent] " tip-bot for Oleg Nesterov
  1 sibling, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-21 17:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: mingo, hpa, acme, paulus, linux-kernel, stern, a.p.zijlstra,
	fweisbec, roland, tglx, prasad, linux-tip-commits,
	Paul E. McKenney

On 01/21, Oleg Nesterov wrote:
>
> In theory, almost every user of task->child->perf_event_ctxp[]
> is wrong. find_get_context() can install the new context at any
> moment, we need read_barrier_depends().

And perhaps it makes sense to fix them all, although the problem
is only theoretical.

> dbe08d82ce3967ccdf459f7951d02589cf967300 "perf: Fix
> find_get_context() vs perf_event_exit_task() race" added
> rcu_dereference() into perf_event_exit_task_context() to make
> the precedent, but this makes __rcu_dereference_check() unhappy.
> Use rcu_dereference_raw() to shut up the warning.

But rcu_dereference_raw() looks a bit confusing, and it is not
very convenient to use read_barrier_depends() directly.

Paul, may be it makes sense to add the new trivial helper which
can be used instead?

Yes, this is only cosmetic issue, I know ;)

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-21 15:05               ` Peter Zijlstra
@ 2011-01-21 20:40                 ` Frederic Weisbecker
  2011-01-24 11:42                   ` Oleg Nesterov
  0 siblings, 1 reply; 91+ messages in thread
From: Frederic Weisbecker @ 2011-01-21 20:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Ingo Molnar, Alan Stern, Arnaldo Carvalho de Melo,
	Paul Mackerras, Prasad, Roland McGrath, linux-kernel

On Fri, Jan 21, 2011 at 04:05:04PM +0100, Peter Zijlstra wrote:
> On Fri, 2011-01-21 at 15:26 +0100, Oleg Nesterov wrote:
> > 
> > > Ah, I think I see how that works:
> > 
> > Hmm. I don't...
> > 
> > >
> > >   __perf_event_task_sched_out()
> > >     perf_event_context_sched_out()
> > >       if (do_switch)
> > >         cpuctx->task_ctx = NULL;
> > 
> > exactly, this clears ->task_ctx
> > 
> > > vs
> > >
> > >   __perf_install_in_context()
> > >    if (cpu_ctx->task_ctx != ctx)
> > 
> > And then __perf_install_in_context() sets cpuctx->task_ctx = ctx,
> > because ctx->task == current && cpuctx->task_ctx == NULL.
> 
> Hrm,. right, so the comment suggests it should do what it doesn't :-)
> 
> It looks like Paul's a63eaf34ae60bd (perf_counter: Dynamically allocate
> tasks' perf_counter_context struct), relevant hunk below, wrecked it:
> 
> @@ -568,11 +582,17 @@ static void __perf_install_in_context(void *info)
>          * If this is a task context, we need to check whether it is
>          * the current task context of this cpu. If not it has been
>          * scheduled out before the smp call arrived.
> +        * Or possibly this is the right context but it isn't
> +        * on this cpu because it had no counters.
>          */
> -       if (ctx->task && cpuctx->task_ctx != ctx)
> -               return;
> +       if (ctx->task && cpuctx->task_ctx != ctx) {
> +               if (cpuctx->task_ctx || ctx->task != current)
> +                       return;
> +               cpuctx->task_ctx = ctx;
> +       }
>  
>         spin_lock_irqsave(&ctx->lock, flags);
> +       ctx->is_active = 1;
>         update_context_time(ctx);
>  
>         /*
> 
> 
> I can't really seem to come up with a sane test that isn't racy with
> something, my cold seems to have clogged not only my nose :/


What do you think about the following (only compile tested yet), it
probably needs more comments, factorizing the checks betwee, perf_event_enable()
and perf_install_in_context(), build-cond against __ARCH_WANT_INTERRUPTS_ON_CTXSW,
but the (good or bad) idea is there.


diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index c5fa717..e97472b 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -928,6 +928,8 @@ static void add_event_to_ctx(struct perf_event *event,
 	event->tstamp_stopped = tstamp;
 }
 
+static DEFINE_PER_CPU(int, task_events_schedulable);
+
 /*
  * Cross CPU call to install and enable a performance event
  *
@@ -949,7 +951,8 @@ static void __perf_install_in_context(void *info)
 	 * on this cpu because it had no events.
 	 */
 	if (ctx->task && cpuctx->task_ctx != ctx) {
-		if (cpuctx->task_ctx || ctx->task != current)
+		if (cpuctx->task_ctx || ctx->task != current
+		    || !__get_cpu_var(task_events_schedulable))
 			return;
 		cpuctx->task_ctx = ctx;
 	}
@@ -1091,7 +1094,8 @@ static void __perf_event_enable(void *info)
 	 * event's task is the current task on this cpu.
 	 */
 	if (ctx->task && cpuctx->task_ctx != ctx) {
-		if (cpuctx->task_ctx || ctx->task != current)
+		if (cpuctx->task_ctx || ctx->task != current
+		    || !__get_cpu_var(task_events_schedulable))
 			return;
 		cpuctx->task_ctx = ctx;
 	}
@@ -1414,6 +1418,9 @@ void __perf_event_task_sched_out(struct task_struct *task,
 {
 	int ctxn;
 
+	__get_cpu_var(task_events_schedulable) = 0;
+	barrier(); /* Must be visible by enable/install_in_context IPI */
+
 	for_each_task_context_nr(ctxn)
 		perf_event_context_sched_out(task, ctxn, next);
 }
@@ -1587,6 +1594,8 @@ void __perf_event_task_sched_in(struct task_struct *task)
 	struct perf_event_context *ctx;
 	int ctxn;
 
+	__get_cpu_var(task_events_schedulable) = 1;
+
 	for_each_task_context_nr(ctxn) {
 		ctx = task->perf_event_ctxp[ctxn];
 		if (likely(!ctx))
@@ -5964,6 +5973,18 @@ SYSCALL_DEFINE5(perf_event_open,
 	WARN_ON_ONCE(ctx->parent_ctx);
 	mutex_lock(&ctx->mutex);
 
+	/*
+	 * Every pending sched switch must finish so that
+	 * we ensure every pending calls to perf_event_task_sched_in/out are
+	 * finished. We ensure the next ones will correctly handle the
+	 * perf_task_events label and then the task_events_schedulable
+	 * state. So perf_install_in_context() won't install events
+	 * in the tiny race window between perf_event_task_sched_out()
+	 * and perf_event_task_sched_in() in the __ARCH_WANT_INTERRUPTS_ON_CTXSW
+	 * case.
+	 */
+	synchronize_sched();
+
 	if (move_group) {
 		perf_install_in_context(ctx, group_leader, cpu);
 		get_ctx(ctx);


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH] perf: perf_event_exit_task_context: s/rcu_dereference/rcu_dereference_raw/
  2011-01-21 17:53               ` Oleg Nesterov
@ 2011-01-21 21:50                 ` Paul E. McKenney
  2011-01-24 11:51                   ` Oleg Nesterov
  0 siblings, 1 reply; 91+ messages in thread
From: Paul E. McKenney @ 2011-01-21 21:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, mingo, hpa, acme, paulus, linux-kernel, stern,
	a.p.zijlstra, fweisbec, roland, tglx, prasad, linux-tip-commits

On Fri, Jan 21, 2011 at 06:53:45PM +0100, Oleg Nesterov wrote:
> On 01/21, Oleg Nesterov wrote:
> >
> > In theory, almost every user of task->child->perf_event_ctxp[]
> > is wrong. find_get_context() can install the new context at any
> > moment, we need read_barrier_depends().
> 
> And perhaps it makes sense to fix them all, although the problem
> is only theoretical.
> 
> > dbe08d82ce3967ccdf459f7951d02589cf967300 "perf: Fix
> > find_get_context() vs perf_event_exit_task() race" added
> > rcu_dereference() into perf_event_exit_task_context() to make
> > the precedent, but this makes __rcu_dereference_check() unhappy.
> > Use rcu_dereference_raw() to shut up the warning.
> 
> But rcu_dereference_raw() looks a bit confusing, and it is not
> very convenient to use read_barrier_depends() directly.
> 
> Paul, may be it makes sense to add the new trivial helper which
> can be used instead?
> 
> Yes, this is only cosmetic issue, I know ;)

Cosmetic issues can be pretty important to the poor guy trying to read
the code.  ;-)

What keeps the structure that rcu_dereference_raw() returns a pointer
to from going away?  Best would be if a lockdep condition could be
constructed from the answer to this question and added to the appropriate
rcu_dereference() primitive.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [tip:perf/urgent] perf: perf_event_exit_task_context: s/rcu_dereference/rcu_dereference_raw/
  2011-01-21 17:45             ` [PATCH] perf: perf_event_exit_task_context: s/rcu_dereference/rcu_dereference_raw/ Oleg Nesterov
  2011-01-21 17:53               ` Oleg Nesterov
@ 2011-01-21 22:12               ` tip-bot for Oleg Nesterov
  1 sibling, 0 replies; 91+ messages in thread
From: tip-bot for Oleg Nesterov @ 2011-01-21 22:12 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, paulmck, hpa, mingo, oleg, tglx, mingo

Commit-ID:  806839b22cbda90176d7f8d421889bddd7826e93
Gitweb:     http://git.kernel.org/tip/806839b22cbda90176d7f8d421889bddd7826e93
Author:     Oleg Nesterov <oleg@redhat.com>
AuthorDate: Fri, 21 Jan 2011 18:45:47 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 21 Jan 2011 22:08:16 +0100

perf: perf_event_exit_task_context: s/rcu_dereference/rcu_dereference_raw/

In theory, almost every user of task->child->perf_event_ctxp[]
is wrong. find_get_context() can install the new context at any
moment, we need read_barrier_depends().

dbe08d82ce3967ccdf459f7951d02589cf967300 "perf: Fix
find_get_context() vs perf_event_exit_task() race" added
rcu_dereference() into perf_event_exit_task_context() to make
the precedent, but this makes __rcu_dereference_check() unhappy.
Use rcu_dereference_raw() to shut up the warning.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: acme@redhat.com
Cc: paulus@samba.org
Cc: stern@rowland.harvard.edu
Cc: a.p.zijlstra@chello.nl
Cc: fweisbec@gmail.com
Cc: roland@redhat.com
Cc: prasad@linux.vnet.ibm.com
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
LKML-Reference: <20110121174547.GA8796@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/perf_event.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index c5fa717..126a302 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -6136,7 +6136,7 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
 	 * scheduled, so we are now safe from rescheduling changing
 	 * our context.
 	 */
-	child_ctx = rcu_dereference(child->perf_event_ctxp[ctxn]);
+	child_ctx = rcu_dereference_raw(child->perf_event_ctxp[ctxn]);
 	task_ctx_sched_out(child_ctx, EVENT_ALL);
 
 	/*

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-21 20:40                 ` Frederic Weisbecker
@ 2011-01-24 11:42                   ` Oleg Nesterov
  2011-01-26 17:53                     ` Oleg Nesterov
  0 siblings, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-24 11:42 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On 01/21, Frederic Weisbecker wrote:
>
> +static DEFINE_PER_CPU(int, task_events_schedulable);

Yes, I think this can work. I thought about this too. The only problem,
this doesn't make the whole code more understandable ;)

> @@ -1587,6 +1594,8 @@ void __perf_event_task_sched_in(struct task_struct *task)
>  	struct perf_event_context *ctx;
>  	int ctxn;
>
> +	__get_cpu_var(task_events_schedulable) = 1;
> +
>  	for_each_task_context_nr(ctxn) {
>  		ctx = task->perf_event_ctxp[ctxn];
>  		if (likely(!ctx))

This doesn't look right. We should set task_events_schedulable
_after_ perf_event_context_sched_in(), otherwise we have the similar
race with next.

rq->curr and current_task were already updated. __perf_install_in_context
should not set "cpuctx->task_ctx = next" before perf_event_context_sched_in(),
it does nothing if cpuctx->task_ctx == ctx.

OTOH, if we set task_events_schedulable after for_each_task_context_nr(),
then we have another race with next, but this race is minor. If
find_get_context() + perf_install_in_context() happen in this window,
the new event won't be scheduled until next reschedules itself.

> +	/*
> +	 * Every pending sched switch must finish so that
> +	 * we ensure every pending calls to perf_event_task_sched_in/out are
> +	 * finished. We ensure the next ones will correctly handle the
> +	 * perf_task_events label and then the task_events_schedulable
> +	 * state. So perf_install_in_context() won't install events
> +	 * in the tiny race window between perf_event_task_sched_out()
> +	 * and perf_event_task_sched_in() in the __ARCH_WANT_INTERRUPTS_ON_CTXSW
> +	 * case.
> +	 */
> +	synchronize_sched();

Yes, if perf_task_events was zero before perf_event_alloc(), then it
is possible that task_events_schedulable == 1 while schedule() is in
progress. perf_event_create_kernel_counter() needs this too.



Frederic, All, can't we simplify this?

First, we modify __perf_install_in_context() so that it never tries
to install the event into !is_active context. IOW, it never tries
to set cpuctx->task_ctx = ctx.

Then we add the new trivial helper stop_resched_task(task) which
simply wakeups the stop thread on task_cpu(task), and thus forces
this task to reschedule.

Now,

	static void
	perf_install_in_context(struct perf_event_context *ctx,
				struct perf_event *event,
				int cpu)
	{
		struct task_struct *task = ctx->task;

		event->ctx = ctx;

		if (!task) {
			/*
			 * Per cpu events are installed via an smp call and
			 * the install is always successful.
			 */
			smp_call_function_single(cpu, __perf_install_in_context,
						 event, 1);
			return;
		}

		for (;;) {
			bool done, need_resched = false;

			raw_spin_lock_irq(&ctx->lock);
			done = !list_empty(&event->group_entry);
			if (!done && !ctx->is_active) {
				add_event_to_ctx(event, ctx);
				need_resched = task_running(task);
				done = true;
			}
			raw_spin_unlock_irq(&ctx->lock);

			if (done) {
				if (need_resched)
					stop_resched_task(task);
				break;
			}

			task_oncpu_function_call(task, __perf_install_in_context,
						event);
		}
	}

Yes, stop_resched_task() can't help if this task itself is the stop thread.
But the stop thread shouldn't run for a long time without rescheduling,
otherwise we already have the problems.

Do you all think this makes any sense?

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH] perf: perf_event_exit_task_context: s/rcu_dereference/rcu_dereference_raw/
  2011-01-21 21:50                 ` Paul E. McKenney
@ 2011-01-24 11:51                   ` Oleg Nesterov
  0 siblings, 0 replies; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-24 11:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Ingo Molnar, mingo, hpa, acme, paulus, linux-kernel, stern,
	a.p.zijlstra, fweisbec, roland, tglx, prasad, linux-tip-commits

On 01/21, Paul E. McKenney wrote:
>
> On Fri, Jan 21, 2011 at 06:53:45PM +0100, Oleg Nesterov wrote:
> >
> > But rcu_dereference_raw() looks a bit confusing, and it is not
> > very convenient to use read_barrier_depends() directly.
> >
> > Paul, may be it makes sense to add the new trivial helper which
> > can be used instead?
> >
> > Yes, this is only cosmetic issue, I know ;)
>
> Cosmetic issues can be pretty important to the poor guy trying to read
> the code.  ;-)

Agreed!

> What keeps the structure that rcu_dereference_raw() returns a pointer
> to from going away?

It can't go away, current owns its ->perf_event_ctxp[] pointers. But
the pointer can be installed at any time by sys_perf_event_open().

Currently the code does

	ctx = current->perf_event_ctxp[ctxn];
	if (ctx)
		do_something(ctx);

and in theory we need smp_read_barrier_depends() in between.

> Best would be if a lockdep condition could be
> constructed from the answer to this question and added to the appropriate
> rcu_dereference() primitive.

In this case the condition is "true", so we can use rcu_dereference_raw().
The only problem, it looks confusing. Especially because you actually
need rcu_read_lock() if you look at not_current_task->perf_event_ctxp[].

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-24 11:42                   ` Oleg Nesterov
@ 2011-01-26 17:53                     ` Oleg Nesterov
  2011-01-26 18:49                       ` Oleg Nesterov
  2011-01-27 13:14                       ` Peter Zijlstra
  0 siblings, 2 replies; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-26 17:53 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On 01/24, Oleg Nesterov wrote:
>
> Frederic, All, can't we simplify this?

Well, to clarify, it looks simpler to me ;)

But if you don't like this approach, lets use task_events_schedulable flag.

> First, we modify __perf_install_in_context() so that it never tries
> to install the event into !is_active context. IOW, it never tries
> to set cpuctx->task_ctx = ctx.
>
> Then we add the new trivial helper stop_resched_task(task) which
> simply wakeups the stop thread on task_cpu(task), and thus forces
> this task to reschedule.
>
> ...
>
> Yes, stop_resched_task() can't help if this task itself is the stop thread.
> But the stop thread shouldn't run for a long time without rescheduling,
> otherwise we already have the problems.

Please see the untested patch below. It doesn't change perf_event_enable(),
only perf_install_in_context(). Just for early review to know your opinion.
To simplify the reading, here is the code:

	void task_force_schedule(struct task_struct *p)
	{
		struct rq *rq;

		preempt_disable();
		rq = task_rq(p);
		if (rq->curr == p)
			wake_up_process(rq->stop);
		preempt_enable();
	}

	static void
	perf_install_in_context(struct perf_event_context *ctx,
				struct perf_event *event,
				int cpu)
	{
		struct task_struct *task = ctx->task;

		event->ctx = ctx;

		if (!task) {
			/*
			 * Per cpu events are installed via an smp call and
			 * the install is always successful.
			 */
			smp_call_function_single(cpu, __perf_install_in_context,
						 event, 1);
			return;
		}

		for (;;) {
			raw_spin_lock_irq(&ctx->lock);
			/*
			 * The lock prevents that this context is
			 * scheduled in, we can add the event safely.
			 */
			if (!ctx->is_active)
				add_event_to_ctx(event, ctx);
			raw_spin_unlock_irq(&ctx->lock);

			if (event->attach_state & PERF_ATTACH_CONTEXT) {
				task_force_schedule(task);
				break;
			}

			task_oncpu_function_call(task, __perf_install_in_context,
							event);
			if (event->attach_state & PERF_ATTACH_CONTEXT)
				break;
		}
	}

Oleg.

 include/linux/sched.h |    1 +
 kernel/sched.c        |   11 +++++++++++
 kernel/perf_event.c   |   49 +++++++++++++++++++++----------------------------
 3 files changed, 33 insertions(+), 28 deletions(-)

--- perf/include/linux/sched.h~1_force_resched	2011-01-14 18:21:04.000000000 +0100
+++ perf/include/linux/sched.h	2011-01-26 17:54:28.000000000 +0100
@@ -2584,6 +2584,7 @@ static inline void inc_syscw(struct task
 extern void task_oncpu_function_call(struct task_struct *p,
 				     void (*func) (void *info), void *info);
 
+extern void task_force_schedule(struct task_struct *p);
 
 #ifdef CONFIG_MM_OWNER
 extern void mm_update_next_owner(struct mm_struct *mm);
--- perf/kernel/sched.c~1_force_resched	2011-01-20 20:37:11.000000000 +0100
+++ perf/kernel/sched.c	2011-01-26 17:52:42.000000000 +0100
@@ -1968,6 +1968,17 @@ void sched_set_stop_task(int cpu, struct
 	}
 }
 
+void task_force_schedule(struct task_struct *p)
+{
+	struct rq *rq;
+
+	preempt_disable();
+	rq = task_rq(p);
+	if (rq->curr == p)
+		wake_up_process(rq->stop);
+	preempt_enable();
+}
+
 /*
  * __normal_prio - return the priority that is based on the static prio
  */
--- perf/kernel/perf_event.c~2_install_ctx_via_resched	2011-01-21 18:41:02.000000000 +0100
+++ perf/kernel/perf_event.c	2011-01-26 18:32:30.000000000 +0100
@@ -943,16 +943,10 @@ static void __perf_install_in_context(vo
 
 	/*
 	 * If this is a task context, we need to check whether it is
-	 * the current task context of this cpu. If not it has been
-	 * scheduled out before the smp call arrived.
-	 * Or possibly this is the right context but it isn't
-	 * on this cpu because it had no events.
+	 * the current task context of this cpu.
 	 */
-	if (ctx->task && cpuctx->task_ctx != ctx) {
-		if (cpuctx->task_ctx || ctx->task != current)
-			return;
-		cpuctx->task_ctx = ctx;
-	}
+	if (ctx->task && cpuctx->task_ctx != ctx)
+		return;
 
 	raw_spin_lock(&ctx->lock);
 	ctx->is_active = 1;
@@ -1030,27 +1024,26 @@ perf_install_in_context(struct perf_even
 		return;
 	}
 
-retry:
-	task_oncpu_function_call(task, __perf_install_in_context,
-				 event);
-
-	raw_spin_lock_irq(&ctx->lock);
-	/*
-	 * we need to retry the smp call.
-	 */
-	if (ctx->is_active && list_empty(&event->group_entry)) {
+	for (;;) {
+		raw_spin_lock_irq(&ctx->lock);
+		/*
+		 * The lock prevents that this context is
+		 * scheduled in, we can add the event safely.
+		 */
+		if (!ctx->is_active)
+			add_event_to_ctx(event, ctx);
 		raw_spin_unlock_irq(&ctx->lock);
-		goto retry;
-	}
 
-	/*
-	 * The lock prevents that this context is scheduled in so we
-	 * can add the event safely, if it the call above did not
-	 * succeed.
-	 */
-	if (list_empty(&event->group_entry))
-		add_event_to_ctx(event, ctx);
-	raw_spin_unlock_irq(&ctx->lock);
+		if (event->attach_state & PERF_ATTACH_CONTEXT) {
+			task_force_schedule(task);
+			break;
+		}
+
+		task_oncpu_function_call(task, __perf_install_in_context,
+						event);
+		if (event->attach_state & PERF_ATTACH_CONTEXT)
+			break;
+	}
 }
 
 /*


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-26 17:53                     ` Oleg Nesterov
@ 2011-01-26 18:49                       ` Oleg Nesterov
  2011-01-26 18:51                         ` [PATCH] fix the theoretical task_cpu/task_curr problem in kick_process/task_oncpu_function_call Oleg Nesterov
  2011-01-26 19:05                         ` Q: perf_install_in_context/perf_event_enable are racy? Peter Zijlstra
  2011-01-27 13:14                       ` Peter Zijlstra
  1 sibling, 2 replies; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-26 18:49 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On 01/26, Oleg Nesterov wrote:
>
> Please see the untested patch below. It doesn't change perf_event_enable(),
> only perf_install_in_context().

Forgot to mention... Also, it doesn't try to fix the race with do_exit(),
this needs another change.

And, damn, can't resist. This is mostly cosmetic issue, but I feel
discomfort every time I look at task_oncpu_function_call(). It _looks_
obviously wrong, even if the problem doesn't exist in practice. I'll
send the pedantic fix to keep the maintainers busy ;)

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH] fix the theoretical task_cpu/task_curr problem in kick_process/task_oncpu_function_call
  2011-01-26 18:49                       ` Oleg Nesterov
@ 2011-01-26 18:51                         ` Oleg Nesterov
  2011-01-26 19:05                         ` Q: perf_install_in_context/perf_event_enable are racy? Peter Zijlstra
  1 sibling, 0 replies; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-26 18:51 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Peter Zijlstra, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel, Frederic Weisbecker

kick_process() and task_oncpu_function_call() are not right, they
can use the dead CPU for smp_send_reschedule/smp_call_function_single
if try_to_wake_up() makes this task running after we read task_cpu().

Given that task_curr() is inline this problem is pure theoretical,
compiler doesn't read task_cpu() twice, but the code looks wrong.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 kernel/sched.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- perf/kernel/sched.c~task_cpu_vs_task_curr	2011-01-26 19:26:40.000000000 +0100
+++ perf/kernel/sched.c	2011-01-26 19:26:58.000000000 +0100
@@ -2269,7 +2269,7 @@ void kick_process(struct task_struct *p)
 
 	preempt_disable();
 	cpu = task_cpu(p);
-	if ((cpu != smp_processor_id()) && task_curr(p))
+	if ((cpu != smp_processor_id()) && (cpu_curr(cpu) == p))
 		smp_send_reschedule(cpu);
 	preempt_enable();
 }
@@ -2292,7 +2292,7 @@ void task_oncpu_function_call(struct tas
 
 	preempt_disable();
 	cpu = task_cpu(p);
-	if (task_curr(p))
+	if (cpu_curr(cpu) == p)
 		smp_call_function_single(cpu, func, info, 1);
 	preempt_enable();
 }


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-26 18:49                       ` Oleg Nesterov
  2011-01-26 18:51                         ` [PATCH] fix the theoretical task_cpu/task_curr problem in kick_process/task_oncpu_function_call Oleg Nesterov
@ 2011-01-26 19:05                         ` Peter Zijlstra
  2011-01-26 19:33                           ` Peter Zijlstra
  1 sibling, 1 reply; 91+ messages in thread
From: Peter Zijlstra @ 2011-01-26 19:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Wed, 2011-01-26 at 19:49 +0100, Oleg Nesterov wrote:
> On 01/26, Oleg Nesterov wrote:
> >
> > Please see the untested patch below. It doesn't change perf_event_enable(),
> > only perf_install_in_context().
> 
> Forgot to mention... Also, it doesn't try to fix the race with do_exit(),
> this needs another change.
> 
> And, damn, can't resist. This is mostly cosmetic issue, but I feel
> discomfort every time I look at task_oncpu_function_call(). It _looks_
> obviously wrong, even if the problem doesn't exist in practice. I'll
> send the pedantic fix to keep the maintainers busy ;)

I've been trying to sit down and work my way through it today, your last
suggestion very nearly seemed to make sense, but I kept getting
distracted.

FWIW I think perf_event_enable() has the very same issue...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-26 19:05                         ` Q: perf_install_in_context/perf_event_enable are racy? Peter Zijlstra
@ 2011-01-26 19:33                           ` Peter Zijlstra
  2011-01-26 19:38                             ` Peter Zijlstra
  2011-01-26 21:19                             ` Oleg Nesterov
  0 siblings, 2 replies; 91+ messages in thread
From: Peter Zijlstra @ 2011-01-26 19:33 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Wed, 2011-01-26 at 20:05 +0100, Peter Zijlstra wrote:
> On Wed, 2011-01-26 at 19:49 +0100, Oleg Nesterov wrote:
> > On 01/26, Oleg Nesterov wrote:
> > >
> > > Please see the untested patch below. It doesn't change perf_event_enable(),
> > > only perf_install_in_context().
> > 
> > Forgot to mention... Also, it doesn't try to fix the race with do_exit(),
> > this needs another change.
> > 
> > And, damn, can't resist. This is mostly cosmetic issue, but I feel
> > discomfort every time I look at task_oncpu_function_call(). It _looks_
> > obviously wrong, even if the problem doesn't exist in practice. I'll
> > send the pedantic fix to keep the maintainers busy ;)
> 
> I've been trying to sit down and work my way through it today, your last
> suggestion very nearly seemed to make sense, but I kept getting
> distracted.
> 
> FWIW I think perf_event_enable() has the very same issue...

Wouldn't something like the below cure things too?


---
 kernel/sched.c |   23 ++++++++++++++++++++++-
 1 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 18d38e4..7eadbcf 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2265,6 +2265,22 @@ void kick_process(struct task_struct *p)
 EXPORT_SYMBOL_GPL(kick_process);
 #endif /* CONFIG_SMP */
 
+struct task_function_call {
+	struct task_struct *p;
+	void (*func)(void *info);
+	void *info;
+};
+
+void task_function_trampoline(void *data)
+{
+	struct task_function_call *tfc = data;
+
+	if (this_rq()->curr != tfc->p)
+		return;
+
+	tfc->func(tfc->data);
+}
+
 /**
  * task_oncpu_function_call - call a function on the cpu on which a task runs
  * @p:		the task to evaluate
@@ -2278,11 +2294,16 @@ void task_oncpu_function_call(struct task_struct *p,
 			      void (*func) (void *info), void *info)
 {
 	int cpu;
+	struct task_function_call data = {
+		.p = p,
+		.func = func,
+		.info = info,
+	};
 
 	preempt_disable();
 	cpu = task_cpu(p);
 	if (task_curr(p))
-		smp_call_function_single(cpu, func, info, 1);
+		smp_call_function_single(cpu, task_function_trampoline, &data, 1);
 	preempt_enable();
 }
 


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-26 19:33                           ` Peter Zijlstra
@ 2011-01-26 19:38                             ` Peter Zijlstra
  2011-01-26 21:19                             ` Oleg Nesterov
  1 sibling, 0 replies; 91+ messages in thread
From: Peter Zijlstra @ 2011-01-26 19:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Wed, 2011-01-26 at 20:33 +0100, Peter Zijlstra wrote:

> Wouldn't something like the below cure things too?

> +struct task_function_call {
> +	struct task_struct *p;
> +	void (*func)(void *info);
> +	void *info;
> +};
> +
> +void task_function_trampoline(void *data)
> +{
> +	struct task_function_call *tfc = data;
> +
> +	if (this_rq()->curr != tfc->p)
> +		return;
> +
> +	tfc->func(tfc->data);
> +}

tfc->info of course ;-)

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-26 19:33                           ` Peter Zijlstra
  2011-01-26 19:38                             ` Peter Zijlstra
@ 2011-01-26 21:19                             ` Oleg Nesterov
  2011-01-26 21:33                               ` Oleg Nesterov
  1 sibling, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-26 21:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On 01/26, Peter Zijlstra wrote:
>
> On Wed, 2011-01-26 at 20:05 +0100, Peter Zijlstra wrote:
> > On Wed, 2011-01-26 at 19:49 +0100, Oleg Nesterov wrote:
> > > On 01/26, Oleg Nesterov wrote:
> > > >
> > > > Please see the untested patch below. It doesn't change perf_event_enable(),
> > > > only perf_install_in_context().
> > >
> > > Forgot to mention... Also, it doesn't try to fix the race with do_exit(),
> > > this needs another change.
> > >
> > > And, damn, can't resist. This is mostly cosmetic issue, but I feel
> > > discomfort every time I look at task_oncpu_function_call(). It _looks_
> > > obviously wrong, even if the problem doesn't exist in practice. I'll
> > > send the pedantic fix to keep the maintainers busy ;)
> >
> > I've been trying to sit down and work my way through it today, your last
> > suggestion very nearly seemed to make sense, but I kept getting
> > distracted.
> >
> > FWIW I think perf_event_enable() has the very same issue...

Yes, yes, note the "doesn't change perf_event_enable()" above.

In fact, I _suspect_ perf_event_enable() has more problems, but
I need to recheck.

> +void task_function_trampoline(void *data)
> +{
> +	struct task_function_call *tfc = data;
> +
> +	if (this_rq()->curr != tfc->p)
> +		return;

Yes, I was thinking about checking rq->curr too, but this doesn't
really help. This closes the race with "prev", but we have the similar
race with "next".

__perf_install_in_context() should not set ->task_ctx before next
does perf_event_context_sched_in(). Otherwise it will do nothing,
it checks cpuctx->task_ctx == ctx.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-26 21:19                             ` Oleg Nesterov
@ 2011-01-26 21:33                               ` Oleg Nesterov
  2011-01-27 10:32                                 ` Peter Zijlstra
  0 siblings, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-26 21:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On 01/26, Oleg Nesterov wrote:
>
> > +void task_function_trampoline(void *data)
> > +{
> > +	struct task_function_call *tfc = data;
> > +
> > +	if (this_rq()->curr != tfc->p)
> > +		return;
>
> Yes, I was thinking about checking rq->curr too, but this doesn't
> really help. This closes the race with "prev", but we have the similar
> race with "next".
>
> __perf_install_in_context() should not set ->task_ctx before next
> does perf_event_context_sched_in(). Otherwise it will do nothing,
> it checks cpuctx->task_ctx == ctx.

But of course, we can add rq->in_context_switch or something. This
is more or less equal to Frederic's per-cpu task_events_schedulable
but simpler, because this doesn't depend on perf_task_events.

This is what I had in mind initially but I didn't dare to add the
new member into rq, it is only needed for perf.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-26 21:33                               ` Oleg Nesterov
@ 2011-01-27 10:32                                 ` Peter Zijlstra
  2011-01-27 12:29                                   ` Peter Zijlstra
  2011-01-27 15:52                                   ` Oleg Nesterov
  0 siblings, 2 replies; 91+ messages in thread
From: Peter Zijlstra @ 2011-01-27 10:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Wed, 2011-01-26 at 22:33 +0100, Oleg Nesterov wrote:
> 
> This is what I had in mind initially but I didn't dare to add the
> new member into rq, it is only needed for perf. 

Right, but its a weakness in the task_oncpu_function_call()
implementation, wouldn't any user run into this problem eventually?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-27 10:32                                 ` Peter Zijlstra
@ 2011-01-27 12:29                                   ` Peter Zijlstra
  2011-01-27 16:10                                     ` Oleg Nesterov
  2011-01-27 15:52                                   ` Oleg Nesterov
  1 sibling, 1 reply; 91+ messages in thread
From: Peter Zijlstra @ 2011-01-27 12:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Thu, 2011-01-27 at 11:32 +0100, Peter Zijlstra wrote:
> On Wed, 2011-01-26 at 22:33 +0100, Oleg Nesterov wrote:
> > 
> > This is what I had in mind initially but I didn't dare to add the
> > new member into rq, it is only needed for perf. 
> 
> Right, but its a weakness in the task_oncpu_function_call()
> implementation, wouldn't any user run into this problem eventually?

I can't seem to avoid having to add this rq member, but like you said,
we only need to do that when __ARCH_WANT_INTERRUPTS_ON_CTXSW.
I can't seem to avoid having to add this rq member, but like you said,
we only 

We still need to validate p is actually current when the IPI happens,
the test might return true in task_oncpu_function_call() but be false by
the time we process the IPI.

So this should avoid us calling @func when @p isn't (fully) running.

---
 kernel/sched.c |   46 ++++++++++++++++++++++++++++++++++++++++------
 1 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 18d38e4..fbff6a8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -490,7 +490,10 @@ struct rq {
 	struct task_struct *curr, *idle, *stop;
 	unsigned long next_balance;
 	struct mm_struct *prev_mm;
-
+#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+	int in_ctxsw;
+#endif
+	
 	u64 clock;
 	u64 clock_task;
 
@@ -2265,6 +2268,29 @@ void kick_process(struct task_struct *p)
 EXPORT_SYMBOL_GPL(kick_process);
 #endif /* CONFIG_SMP */
 
+struct task_function_call {
+	struct task_struct *p;
+	void (*func)(void *info);
+	void *info;
+};
+
+void task_function_trampoline(void *data)
+{
+	struct task_function_call *tfc = data;
+	struct task_struct *p = tfc->p;
+	struct rq *rq = this_rq();
+
+#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+	if (rq->in_ctxsw)
+		return;
+#endif
+
+	if (rq->curr != p)
+		return;
+
+	tfc->func(tfc->info);
+}
+
 /**
  * task_oncpu_function_call - call a function on the cpu on which a task runs
  * @p:		the task to evaluate
@@ -2278,11 +2304,16 @@ void task_oncpu_function_call(struct task_struct *p,
 			      void (*func) (void *info), void *info)
 {
 	int cpu;
+	struct task_function_call data = {
+		.p = p,
+		.func = func,
+		.info = info,
+	};
 
 	preempt_disable();
 	cpu = task_cpu(p);
 	if (task_curr(p))
-		smp_call_function_single(cpu, func, info, 1);
+		smp_call_function_single(cpu, task_function_trampoline, &data, 1);
 	preempt_enable();
 }
 
@@ -2776,9 +2807,15 @@ static inline void
 prepare_task_switch(struct rq *rq, struct task_struct *prev,
 		    struct task_struct *next)
 {
+#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+	rq->in_ctxsw = 1;
+#endif
+	sched_info_switch(prev, next);
+	perf_event_task_sched_out(prev, next);
 	fire_sched_out_preempt_notifiers(prev, next);
 	prepare_lock_switch(rq, next);
 	prepare_arch_switch(next);
+	trace_sched_switch(prev, next);
 }
 
 /**
@@ -2823,6 +2860,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	perf_event_task_sched_in(current);
 #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
 	local_irq_enable();
+	rq->in_ctxsw = 0;
 #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
 	finish_lock_switch(rq, prev);
 
@@ -2911,7 +2949,6 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	struct mm_struct *mm, *oldmm;
 
 	prepare_task_switch(rq, prev, next);
-	trace_sched_switch(prev, next);
 	mm = next->mm;
 	oldmm = prev->active_mm;
 	/*
@@ -3989,9 +4026,6 @@ need_resched_nonpreemptible:
 	rq->skip_clock_update = 0;
 
 	if (likely(prev != next)) {
-		sched_info_switch(prev, next);
-		perf_event_task_sched_out(prev, next);
-
 		rq->nr_switches++;
 		rq->curr = next;
 		++*switch_count;


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-26 17:53                     ` Oleg Nesterov
  2011-01-26 18:49                       ` Oleg Nesterov
@ 2011-01-27 13:14                       ` Peter Zijlstra
  2011-01-27 14:28                         ` Peter Zijlstra
  2011-01-27 16:57                         ` Oleg Nesterov
  1 sibling, 2 replies; 91+ messages in thread
From: Peter Zijlstra @ 2011-01-27 13:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Wed, 2011-01-26 at 18:53 +0100, Oleg Nesterov wrote:
>         void task_force_schedule(struct task_struct *p)
>         {
>                 struct rq *rq;
> 
>                 preempt_disable();
>                 rq = task_rq(p);
>                 if (rq->curr == p)
>                         wake_up_process(rq->stop);
>                 preempt_enable();
>         }
> 
>         static void
>         perf_install_in_context(struct perf_event_context *ctx,
>                                 struct perf_event *event,
>                                 int cpu)
>         {
>                 struct task_struct *task = ctx->task;
> 
>                 event->ctx = ctx;
> 
>                 if (!task) {
>                         /*
>                          * Per cpu events are installed via an smp call and
>                          * the install is always successful.
>                          */
>                         smp_call_function_single(cpu, __perf_install_in_context,
>                                                  event, 1);
>                         return;
>                 }
> 
>                 for (;;) {
>                         raw_spin_lock_irq(&ctx->lock);
>                         /*
>                          * The lock prevents that this context is
>                          * scheduled in, we can add the event safely.
>                          */
>                         if (!ctx->is_active)
>                                 add_event_to_ctx(event, ctx);
>                         raw_spin_unlock_irq(&ctx->lock);
> 
>                         if (event->attach_state & PERF_ATTACH_CONTEXT) {
>                                 task_force_schedule(task);
>                                 break;
>                         }
> 
>                         task_oncpu_function_call(task, __perf_install_in_context,
>                                                         event);
>                         if (event->attach_state & PERF_ATTACH_CONTEXT)
>                                 break;
>                 }
>         } 

Right, so the fact of introducing extra scheduling makes me feel
uncomfortable... the whole purpose is to observe without perturbing (as
much as possible).

So the whole crux of the matter is adding a ctx to a running process. If
the ctx exists, ->is_active will be tracked properly and much of the
problem goes away.

  rcu_assign_pointer(task->perf_event_ctx[n], new_ctx);
  task_oncpu_function_call(task, __perf_install_in_context, event);

Should I think suffice to get the ctx in sync with the task state, we've
got the following cases:
 1) task is in the middle of scheduling in
 2) task is in the middle of scheduling out
 3) task is running

Without __ARCH_WANT_INTERRUPT_ON_CTXSW everything is boring and works,
1: the IPI will be delayed until 3, 2: the IPI finds another task and
the next schedule in will sort things.

With, however, things are more interesting. 2 seems to be adequately
covered by the patch I just send, the IPI will bail and the next
sched-in of the relevant task will pick matters up. 1 otoh doesn't seem
covered, the IPI will bail, leaving us stranded.

To fix this it seems we need to make task_oncpu_function_call() wait
until the ctx is done, while (cpu_rq(cpu)->in_ctxsw) cpu_relax(); before
sending the IPI like, however that would require adding a few memory
barriers I think... 

/me goes search for implied barriers around there.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-27 13:14                       ` Peter Zijlstra
@ 2011-01-27 14:28                         ` Peter Zijlstra
  2011-01-27 14:58                           ` Peter Zijlstra
  2011-01-27 16:57                         ` Oleg Nesterov
  1 sibling, 1 reply; 91+ messages in thread
From: Peter Zijlstra @ 2011-01-27 14:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Thu, 2011-01-27 at 14:14 +0100, Peter Zijlstra wrote:
> 
> With, however, things are more interesting. 2 seems to be adequately
> covered by the patch I just send, the IPI will bail and the next
> sched-in of the relevant task will pick matters up. 1 otoh doesn't seem
> covered, the IPI will bail, leaving us stranded. 

blergh, so the race condition specific for perf can be cured by putting
the ->in_ctxsw = 0, under the local_irq_disable(). When we hit early,
the perf_event_task_sched_in() will do the job and we can simply bail in
the IPI. If he hit late, the IPI will be delayed until after, and we'll
be case 3 again.

More generic task_oncpu_function_call() users, say using preempt
notifiers will have to deal with the fact that the sched_in notifier
runs after we unlock/enable irqs.

<crazy idea here> 

So I was contemplating if we could make things work by placing
rq->nr_switches++; _after_ context_switch() and use:

rq->curr != current
mb() /* implied by ctxsw? */
rq->nr_switches++

to do something like:

nr_switches = rq->nr_switches;
smp_rmb();
if (rq->curr != current) {
  smp_rmb();
  while (rq->nr_switches == nr_switches)
    cpu_relax();
}

to synchronize things, but then my head hurt.. mostly because you can
only use rq->curr != current on the local cpu, in which case spinning
will deadlock you.

The 'solution' seemed to be to do that test from an IPI, and return the
state in struct task_function_call, then spin on the other cpu..

So I've likely fallen off a cliff somewhere along the line, but just in
case, here's the patch:
---

 kernel/sched.c |   57 ++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 49 insertions(+), 8 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 18d38e4..31f8d75 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2265,6 +2265,30 @@ void kick_process(struct task_struct *p)
 EXPORT_SYMBOL_GPL(kick_process);
 #endif /* CONFIG_SMP */
 
+struct task_function_call {
+	struct task_struct *p;
+	void (*func)(void *info);
+	void *info;
+	int ret;
+};
+
+void task_function_trampoline(void *data)
+{
+	struct task_function_call *tfc = data;
+	struct task_struct *p = tfc->p;
+	struct rq *rq = this_rq();
+
+	if (rq->curr != current) {
+		tfc->ret = 1;
+		return;
+	}
+
+	if (rq->curr != p)
+		return;
+
+	tfc->func(tfc->info);
+}
+
 /**
  * task_oncpu_function_call - call a function on the cpu on which a task runs
  * @p:		the task to evaluate
@@ -2277,12 +2301,30 @@ EXPORT_SYMBOL_GPL(kick_process);
 void task_oncpu_function_call(struct task_struct *p,
 			      void (*func) (void *info), void *info)
 {
+	struct task_function_call data = {
+		.p = p,
+		.func = func,
+		.info = info,
+	};
+	unsigned long nr_switches;
+	struct rq *rq;
 	int cpu;
 
 	preempt_disable();
-	cpu = task_cpu(p);
-	if (task_curr(p))
-		smp_call_function_single(cpu, func, info, 1);
+again:
+	data.ret = 0;
+	rq = task_rq(p);
+	nr_switches = rq->nr_switches;
+	smp_rmb();
+	if (task_curr(p)) {
+		smp_call_function_single(cpu_of(rq), 
+				task_function_trampoline, &data, 1);
+		if (data.ret == 1) {
+			while (rq->nr_switches == nr_switches)
+				cpu_relax();
+			goto again;
+		}
+	}
 	preempt_enable();
 }
 
@@ -2776,9 +2818,12 @@ static inline void
 prepare_task_switch(struct rq *rq, struct task_struct *prev,
 		    struct task_struct *next)
 {
+	sched_info_switch(prev, next);
+	perf_event_task_sched_out(prev, next);
 	fire_sched_out_preempt_notifiers(prev, next);
 	prepare_lock_switch(rq, next);
 	prepare_arch_switch(next);
+	trace_sched_switch(prev, next);
 }
 
 /**
@@ -2911,7 +2956,6 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	struct mm_struct *mm, *oldmm;
 
 	prepare_task_switch(rq, prev, next);
-	trace_sched_switch(prev, next);
 	mm = next->mm;
 	oldmm = prev->active_mm;
 	/*
@@ -3989,10 +4033,6 @@ need_resched_nonpreemptible:
 	rq->skip_clock_update = 0;
 
 	if (likely(prev != next)) {
-		sched_info_switch(prev, next);
-		perf_event_task_sched_out(prev, next);
-
-		rq->nr_switches++;
 		rq->curr = next;
 		++*switch_count;
 
@@ -4005,6 +4045,7 @@ need_resched_nonpreemptible:
 		 */
 		cpu = smp_processor_id();
 		rq = cpu_rq(cpu);
+		rq->nr_switches++;
 	} else
 		raw_spin_unlock_irq(&rq->lock);
 


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-27 14:28                         ` Peter Zijlstra
@ 2011-01-27 14:58                           ` Peter Zijlstra
  0 siblings, 0 replies; 91+ messages in thread
From: Peter Zijlstra @ 2011-01-27 14:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Thu, 2011-01-27 at 15:28 +0100, Peter Zijlstra wrote:
> 
> <crazy idea here> 
> 

> So I've likely fallen off a cliff somewhere along the line, but just in
> case, here's the patch: 

Its completely broken, ignore this :-)

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-27 10:32                                 ` Peter Zijlstra
  2011-01-27 12:29                                   ` Peter Zijlstra
@ 2011-01-27 15:52                                   ` Oleg Nesterov
  1 sibling, 0 replies; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-27 15:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On 01/27, Peter Zijlstra wrote:
>
> On Wed, 2011-01-26 at 22:33 +0100, Oleg Nesterov wrote:
> >
> > This is what I had in mind initially but I didn't dare to add the
> > new member into rq, it is only needed for perf.
>
> Right, but its a weakness in the task_oncpu_function_call()
> implementation, wouldn't any user run into this problem eventually?

I think that other users are fine, they do not try to change ctx.

OTOH, probably your change in task_oncpu_function_call() makes sense
anyway, this way func() can never have some other subtle problems
witht context_switch in progress.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-27 12:29                                   ` Peter Zijlstra
@ 2011-01-27 16:10                                     ` Oleg Nesterov
  2011-01-27 16:27                                       ` Peter Zijlstra
  0 siblings, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-27 16:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On 01/27, Peter Zijlstra wrote:
>
> +void task_function_trampoline(void *data)
> +{
> +	struct task_function_call *tfc = data;
> +	struct task_struct *p = tfc->p;
> +	struct rq *rq = this_rq();
> +
> +#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
> +	if (rq->in_ctxsw)
> +		return;
> +#endif
> +
> +	if (rq->curr != p)
> +		return;

Yes, I think this should solve the problem.

>  prepare_task_switch(struct rq *rq, struct task_struct *prev,
>  		    struct task_struct *next)
>  {
> +#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
> +	rq->in_ctxsw = 1;
> +#endif
> +	sched_info_switch(prev, next);
> +	perf_event_task_sched_out(prev, next);
>  	fire_sched_out_preempt_notifiers(prev, next);
>  	prepare_lock_switch(rq, next);
>  	prepare_arch_switch(next);
> +	trace_sched_switch(prev, next);
>  }

Yes, I was wondering why schedule() calls perf_event_task_sched_out().
This way the code looks more symmetrical/understandable.

>  /**
> @@ -2823,6 +2860,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
>  	perf_event_task_sched_in(current);
>  #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
>  	local_irq_enable();
> +	rq->in_ctxsw = 0;

If we think that context_switch finishes here, probably it would be
more clean to clear ->in_ctxsw before local_irq_enable().

>  #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
>  	finish_lock_switch(rq, prev);

But, otoh, maybe finish_lock_switch() can clear in_ctxsw, it already
checks __ARCH_WANT_INTERRUPTS_ON_CTXSW. Likewise, perhaps it can be
set in prepare_lock_switch() which enables irqs.

But this is cosmetic and up to you.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-27 16:10                                     ` Oleg Nesterov
@ 2011-01-27 16:27                                       ` Peter Zijlstra
  2011-01-27 16:59                                         ` Oleg Nesterov
  0 siblings, 1 reply; 91+ messages in thread
From: Peter Zijlstra @ 2011-01-27 16:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Thu, 2011-01-27 at 17:10 +0100, Oleg Nesterov wrote:
> >  #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
> >       local_irq_enable();
> > +     rq->in_ctxsw = 0;
> 
> If we think that context_switch finishes here, probably it would be
> more clean to clear ->in_ctxsw before local_irq_enable().

It must in fact be done before, otherwise there's a race where we set
ctx after perf_event_task_sched_in() runs, and we send the IPI, the IPI
lands after local_irq_enable() but before rq->in_ctxsq = 0, the IPI is
ignored, nothing happens.

> >  #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
> >       finish_lock_switch(rq, prev);
> 
> But, otoh, maybe finish_lock_switch() can clear in_ctxsw, it already
> checks __ARCH_WANT_INTERRUPTS_ON_CTXSW. Likewise, perhaps it can be
> set in prepare_lock_switch() which enables irqs.
> 
> But this is cosmetic and up to you. 

Can't do because of the above thing..

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-27 13:14                       ` Peter Zijlstra
  2011-01-27 14:28                         ` Peter Zijlstra
@ 2011-01-27 16:57                         ` Oleg Nesterov
  2011-01-27 17:11                           ` Peter Zijlstra
  1 sibling, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-27 16:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On 01/27, Peter Zijlstra wrote:
>
> Right, so the fact of introducing extra scheduling makes me feel
> uncomfortable... the whole purpose is to observe without perturbing (as
> much as possible).

Yes, agreed.

Well, otoh the patch removes the code which sets ->task_ctx from
__perf_install_in_context() and __perf_event_enable(), and perhaps
we could simplify the things further, but anyway I agree.

> Should I think suffice to get the ctx in sync with the task state, we've
> got the following cases:
>  1) task is in the middle of scheduling in
>  2) task is in the middle of scheduling out
>  3) task is running
>
> Without __ARCH_WANT_INTERRUPT_ON_CTXSW everything is boring and works,
> 1: the IPI will be delayed until 3, 2: the IPI finds another task and
> the next schedule in will sort things.
>
> With, however, things are more interesting. 2 seems to be adequately
> covered by the patch I just send, the IPI will bail and the next
> sched-in of the relevant task will pick matters up. 1 otoh doesn't seem
> covered, the IPI will bail, leaving us stranded.

Hmm, yes... Strangely, I missed that when I was thinking about in_ctxsw.

Perhaps, we can change task_oncpu_function_call() so that it returns
-EAGAIN in case it hits in_ctxsw != 0? If the caller sees -EAGAIN, it
should always retry even if !ctx->is_active.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-27 16:27                                       ` Peter Zijlstra
@ 2011-01-27 16:59                                         ` Oleg Nesterov
  0 siblings, 0 replies; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-27 16:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On 01/27, Peter Zijlstra wrote:
>
> On Thu, 2011-01-27 at 17:10 +0100, Oleg Nesterov wrote:
> > >  #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
> > >       local_irq_enable();
> > > +     rq->in_ctxsw = 0;
> >
> > If we think that context_switch finishes here, probably it would be
> > more clean to clear ->in_ctxsw before local_irq_enable().
>
> It must in fact be done before,

Yes, I alredy realized this when I was reading another email from you.

> > But, otoh, maybe finish_lock_switch() can clear in_ctxsw, it already
> > checks __ARCH_WANT_INTERRUPTS_ON_CTXSW. Likewise, perhaps it can be
> > set in prepare_lock_switch() which enables irqs.
> >
> > But this is cosmetic and up to you.
>
> Can't do because of the above thing..

Right.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-27 16:57                         ` Oleg Nesterov
@ 2011-01-27 17:11                           ` Peter Zijlstra
  2011-01-27 22:18                             ` Oleg Nesterov
  0 siblings, 1 reply; 91+ messages in thread
From: Peter Zijlstra @ 2011-01-27 17:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Thu, 2011-01-27 at 17:57 +0100, Oleg Nesterov wrote:
> 
> > With, however, things are more interesting. 2 seems to be adequately
> > covered by the patch I just send, the IPI will bail and the next
> > sched-in of the relevant task will pick matters up. 1 otoh doesn't seem
> > covered, the IPI will bail, leaving us stranded.
> 
> Hmm, yes... Strangely, I missed that when I was thinking about in_ctxsw.
> 
> Perhaps, we can change task_oncpu_function_call() so that it returns
> -EAGAIN in case it hits in_ctxsw != 0? If the caller sees -EAGAIN, it
> should always retry even if !ctx->is_active.

That would be very easy to do, we can pass a return value through the
task_function_call structure.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-27 17:11                           ` Peter Zijlstra
@ 2011-01-27 22:18                             ` Oleg Nesterov
  2011-01-28 11:52                               ` Peter Zijlstra
  0 siblings, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-27 22:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On 01/27, Peter Zijlstra wrote:
>
> On Thu, 2011-01-27 at 17:57 +0100, Oleg Nesterov wrote:
> >
> > > With, however, things are more interesting. 2 seems to be adequately
> > > covered by the patch I just send, the IPI will bail and the next
> > > sched-in of the relevant task will pick matters up. 1 otoh doesn't seem
> > > covered, the IPI will bail, leaving us stranded.
> >
> > Hmm, yes... Strangely, I missed that when I was thinking about in_ctxsw.
> >
> > Perhaps, we can change task_oncpu_function_call() so that it returns
> > -EAGAIN in case it hits in_ctxsw != 0? If the caller sees -EAGAIN, it
> > should always retry even if !ctx->is_active.
>
> That would be very easy to do, we can pass a return value through the
> task_function_call structure.

Yes.

Perhaps task_oncpu_function_call() should retry itself to simplify the
callers. I wonder if we should also retry if rq->curr != p...



Oh. You know, I am starting to think I will never understand this.
Forget about the problems with __ARCH_WANT_INTERRUPT_ON_CTXSW.

perf_install_in_context() does task_oncpu_function_call() and then


	// ctx->is_active == 0

	/*
	 * The lock prevents that this context is scheduled in so we
	 * can add the event safely, if it the call above did not
	 * succeed.
	 */
	if (list_empty(&event->group_entry))
		add_event_to_ctx(event, ctx);

This assumes that the task is not running.

Why? Because (I guess) we assume that either task_oncpu_function_call()
should see task_curr() == T, or if the task becomes running after that
it should see the new ->perf_event_ctxp[ctxn] != NULL. And I do not see
how we can prove this.

If find_get_context() sets the new context, the only guarantee we have
is: perf_event_exit_task() can't miss this context. The task, however,
can be scheduled in and miss the new value in perf_event_ctxp[].
And, task_oncpu_function_call() can equally miss rq->curr == task.

But. I think this all falls into the absolutely theoretical category,
and in the worst case nothing really bad can happen, just event_sched_in()
will be delayed until this task reshedules.


So, I think your patch should fix all problems with schedule. Just it
needs the couple of changes we already discussed:

	- finish_task_switch() should clear rq->in_ctxsw before
	  local_irq_enable()

	- task_oncpu_function_call() (or its callers) should always
	  retry the "if (task_curr(p))" code if ->in_ctxsw is true.

If you think we have other problems here please don't tell me,
I already got lost ;)

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-27 22:18                             ` Oleg Nesterov
@ 2011-01-28 11:52                               ` Peter Zijlstra
  2011-01-28 14:57                                 ` Peter Zijlstra
  0 siblings, 1 reply; 91+ messages in thread
From: Peter Zijlstra @ 2011-01-28 11:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Thu, 2011-01-27 at 23:18 +0100, Oleg Nesterov wrote:
> On 01/27, Peter Zijlstra wrote:
> >
> > On Thu, 2011-01-27 at 17:57 +0100, Oleg Nesterov wrote:
> > >
> > > > With, however, things are more interesting. 2 seems to be adequately
> > > > covered by the patch I just send, the IPI will bail and the next
> > > > sched-in of the relevant task will pick matters up. 1 otoh doesn't seem
> > > > covered, the IPI will bail, leaving us stranded.
> > >
> > > Hmm, yes... Strangely, I missed that when I was thinking about in_ctxsw.
> > >
> > > Perhaps, we can change task_oncpu_function_call() so that it returns
> > > -EAGAIN in case it hits in_ctxsw != 0? If the caller sees -EAGAIN, it
> > > should always retry even if !ctx->is_active.
> >
> > That would be very easy to do, we can pass a return value through the
> > task_function_call structure.
> 
> Yes.
> 
> Perhaps task_oncpu_function_call() should retry itself to simplify the
> callers. I wonder if we should also retry if rq->curr != p...

Yes we should, the task could have been migrated and be running on
another cpu..

> Oh. You know, I am starting to think I will never understand this.

Oh, please don't give up, we shall persevere with this until it all
makes perfect sense (or we're both mental and get locked up), it can
only improve matters, right? :-)

> perf_install_in_context() does task_oncpu_function_call() and then
> 
> 
> 	// ctx->is_active == 0
> 
> 	/*
> 	 * The lock prevents that this context is scheduled in so we
> 	 * can add the event safely, if it the call above did not
> 	 * succeed.
> 	 */
> 	if (list_empty(&event->group_entry))
> 		add_event_to_ctx(event, ctx);
> 
> This assumes that the task is not running.
> 
> Why? Because (I guess) we assume that either task_oncpu_function_call()
> should see task_curr() == T, or if the task becomes running after that
> it should see the new ->perf_event_ctxp[ctxn] != NULL. And I do not see
> how we can prove this.

Right, that is the intended logic, lets see if I can make that be true.

So task_oncpu_function_call() as per after the below patch, will loop
until either:

 - the task isn't running, or
 - we executed the function on the cpu during the task's stay there

If it isn't running, it might have scheduled in by the time we've
acquired the ctx->lock, the ->is_active test catches that and retries
the task_oncpu_function_call(), if its still not running, us holding the
ctx->lock ensures its possible schedule-in on another cpu will be held
up at perf_event_task_sched_in().

Now:

> If find_get_context() sets the new context, the only guarantee we have
> is: perf_event_exit_task() can't miss this context. The task, however,
> can be scheduled in and miss the new value in perf_event_ctxp[].
> And, task_oncpu_function_call() can equally miss rq->curr == task.

Right, so in case the perf_event_task_sched_in() missed the assignment
of ->perf_event_ctxp[n], then our above story falls flat on its face.

Because then we can not rely on ->in_active being set for running tasks.

So we need a task_curr() test under that lock, which would need
perf_event_task_sched_out() to be done _before_ we set rq->curr = next,
I _think_.

> But. I think this all falls into the absolutely theoretical category,
> and in the worst case nothing really bad can happen, just event_sched_in()
> will be delayed until this task reshedules.

Still, it would be bad if some HPC workload (1 task per cpu, very sparse
syscalls, hardly no scheduling at all) would go wonky once in a blue
moon.

More importantly I think, it would be best if this code were obvious, it
clearly isn't, so lets hang in here for a little while more.

> So, I think your patch should fix all problems with schedule. Just it
> needs the couple of changes we already discussed:
> 
> 	- finish_task_switch() should clear rq->in_ctxsw before
> 	  local_irq_enable()

check, although I should still add at least a little comment in
task_oncpu_function_call() explaining things.

> 	- task_oncpu_function_call() (or its callers) should always
> 	  retry the "if (task_curr(p))" code if ->in_ctxsw is true.

check.

> If you think we have other problems here please don't tell me,
> I already got lost ;)

Sorry to bother you more, but I think we're actually getting
somewhere...

---
 include/linux/sched.h |    4 +-
 kernel/sched.c        |   64 ++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 58 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d747f94..b147d73 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2581,8 +2581,8 @@ static inline void inc_syscw(struct task_struct *tsk)
 /*
  * Call the function if the target task is executing on a CPU right now:
  */
-extern void task_oncpu_function_call(struct task_struct *p,
-				     void (*func) (void *info), void *info);
+extern int task_oncpu_function_call(struct task_struct *p,
+				    void (*func) (void *info), void *info);
 
 
 #ifdef CONFIG_MM_OWNER
diff --git a/kernel/sched.c b/kernel/sched.c
index 18d38e4..9ef760c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -490,7 +490,10 @@ struct rq {
 	struct task_struct *curr, *idle, *stop;
 	unsigned long next_balance;
 	struct mm_struct *prev_mm;
-
+#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+	int in_ctxsw;
+#endif
+	
 	u64 clock;
 	u64 clock_task;
 
@@ -2265,6 +2268,34 @@ void kick_process(struct task_struct *p)
 EXPORT_SYMBOL_GPL(kick_process);
 #endif /* CONFIG_SMP */
 
+struct task_function_call {
+	struct task_struct *p;
+	void (*func)(void *info);
+	void *info;
+	int ret;
+};
+
+void task_function_trampoline(void *data)
+{
+	struct task_function_call *tfc = data;
+	struct task_struct *p = tfc->p;
+	struct rq *rq = this_rq();
+
+	tfc->ret = -EAGAIN;
+
+#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+	if (rq->in_ctxsw)
+		return;
+#endif
+
+	if (rq->curr != p)
+		return;
+
+	tfc->ret = 0;
+
+	tfc->func(tfc->info);
+}
+
 /**
  * task_oncpu_function_call - call a function on the cpu on which a task runs
  * @p:		the task to evaluate
@@ -2273,17 +2304,31 @@ EXPORT_SYMBOL_GPL(kick_process);
  *
  * Calls the function @func when the task is currently running. This might
  * be on the current CPU, which just calls the function directly
+ *
+ * returns: 0 when @func got called
  */
-void task_oncpu_function_call(struct task_struct *p,
+int task_oncpu_function_call(struct task_struct *p,
 			      void (*func) (void *info), void *info)
 {
+	struct task_function_call data = {
+		.p = p,
+		.func = func,
+		.info = info,
+	};
 	int cpu;
 
 	preempt_disable();
+again:
+	data.ret = -ESRCH; /* No such (running) process */
 	cpu = task_cpu(p);
-	if (task_curr(p))
-		smp_call_function_single(cpu, func, info, 1);
+	if (task_curr(p)) {
+		smp_call_function_single(cpu, task_function_trampoline, &data, 1);
+		if (data.ret == -EAGAIN)
+			goto again;
+	}
 	preempt_enable();
+
+	return data.ret;
 }
 
 #ifdef CONFIG_SMP
@@ -2776,9 +2821,15 @@ static inline void
 prepare_task_switch(struct rq *rq, struct task_struct *prev,
 		    struct task_struct *next)
 {
+#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+	rq->in_ctxsw = 1;
+#endif
+	sched_info_switch(prev, next);
+	perf_event_task_sched_out(prev, next);
 	fire_sched_out_preempt_notifiers(prev, next);
 	prepare_lock_switch(rq, next);
 	prepare_arch_switch(next);
+	trace_sched_switch(prev, next);
 }
 
 /**
@@ -2822,6 +2873,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
 	perf_event_task_sched_in(current);
 #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+	rq->in_ctxsw = 0;
 	local_irq_enable();
 #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
 	finish_lock_switch(rq, prev);
@@ -2911,7 +2963,6 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	struct mm_struct *mm, *oldmm;
 
 	prepare_task_switch(rq, prev, next);
-	trace_sched_switch(prev, next);
 	mm = next->mm;
 	oldmm = prev->active_mm;
 	/*
@@ -3989,9 +4040,6 @@ need_resched_nonpreemptible:
 	rq->skip_clock_update = 0;
 
 	if (likely(prev != next)) {
-		sched_info_switch(prev, next);
-		perf_event_task_sched_out(prev, next);
-
 		rq->nr_switches++;
 		rq->curr = next;
 		++*switch_count;


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-28 11:52                               ` Peter Zijlstra
@ 2011-01-28 14:57                                 ` Peter Zijlstra
  2011-01-28 16:28                                   ` Oleg Nesterov
  0 siblings, 1 reply; 91+ messages in thread
From: Peter Zijlstra @ 2011-01-28 14:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Fri, 2011-01-28 at 12:52 +0100, Peter Zijlstra wrote:
> Right, so in case the perf_event_task_sched_in() missed the assignment
> of ->perf_event_ctxp[n], then our above story falls flat on its face.
> 
> Because then we can not rely on ->in_active being set for running tasks.
> 
> So we need a task_curr() test under that lock, which would need
> perf_event_task_sched_out() to be done _before_ we set rq->curr = next,
> I _think_. 


Ok, so how about something like this:

if task_oncpu_function_call() managed to execute the function proper,
we're done. Otherwise, if while holding the lock, task_curr() is true,
it means the task is now current and we should try again, if its not, it
cannot become current because us holding ctx->lock blocks
perf_event_task_sched_in().

Hmm?

---
 include/linux/sched.h |    4 +-
 kernel/perf_event.c   |   23 ++++++++++-------
 kernel/sched.c        |   65 +++++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 73 insertions(+), 19 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d747f94..b147d73 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2581,8 +2581,8 @@ static inline void inc_syscw(struct task_struct *tsk)
 /*
  * Call the function if the target task is executing on a CPU right now:
  */
-extern void task_oncpu_function_call(struct task_struct *p,
-				     void (*func) (void *info), void *info);
+extern int task_oncpu_function_call(struct task_struct *p,
+				    void (*func) (void *info), void *info);
 
 
 #ifdef CONFIG_MM_OWNER
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 852ae8c..0d988b8 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1017,6 +1017,7 @@ perf_install_in_context(struct perf_event_context *ctx,
 			int cpu)
 {
 	struct task_struct *task = ctx->task;
+	int ret;
 
 	event->ctx = ctx;
 
@@ -1031,25 +1032,29 @@ perf_install_in_context(struct perf_event_context *ctx,
 	}
 
 retry:
-	task_oncpu_function_call(task, __perf_install_in_context,
-				 event);
+	ret = task_oncpu_function_call(task, 
+			__perf_install_in_context, event);
+
+	if (!ret)
+		return;
 
 	raw_spin_lock_irq(&ctx->lock);
+
 	/*
-	 * we need to retry the smp call.
+	 * If the task_oncpu_function_call() failed, re-check task_curr() now
+	 * that we hold ctx->lock(), if it is running retry the IPI.
 	 */
-	if (ctx->is_active && list_empty(&event->group_entry)) {
+	if (task_curr(task)) {
 		raw_spin_unlock_irq(&ctx->lock);
 		goto retry;
 	}
 
 	/*
-	 * The lock prevents that this context is scheduled in so we
-	 * can add the event safely, if it the call above did not
-	 * succeed.
+	 * Otherwise the lock prevents that this context is scheduled in so we
+	 * can add the event safely, if it the call above did not succeed.
 	 */
-	if (list_empty(&event->group_entry))
-		add_event_to_ctx(event, ctx);
+	add_event_to_ctx(event, ctx);
+
 	raw_spin_unlock_irq(&ctx->lock);
 }
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 18d38e4..3686dce 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -490,7 +490,10 @@ struct rq {
 	struct task_struct *curr, *idle, *stop;
 	unsigned long next_balance;
 	struct mm_struct *prev_mm;
-
+#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+	int in_ctxsw;
+#endif
+	
 	u64 clock;
 	u64 clock_task;
 
@@ -2265,6 +2268,34 @@ void kick_process(struct task_struct *p)
 EXPORT_SYMBOL_GPL(kick_process);
 #endif /* CONFIG_SMP */
 
+struct task_function_call {
+	struct task_struct *p;
+	void (*func)(void *info);
+	void *info;
+	int ret;
+};
+
+void task_function_trampoline(void *data)
+{
+	struct task_function_call *tfc = data;
+	struct task_struct *p = tfc->p;
+	struct rq *rq = this_rq();
+
+	tfc->ret = -EAGAIN;
+
+#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+	if (rq->in_ctxsw)
+		return;
+#endif
+
+	if (rq->curr != p)
+		return;
+
+	tfc->ret = 0;
+
+	tfc->func(tfc->info);
+}
+
 /**
  * task_oncpu_function_call - call a function on the cpu on which a task runs
  * @p:		the task to evaluate
@@ -2273,17 +2304,31 @@ EXPORT_SYMBOL_GPL(kick_process);
  *
  * Calls the function @func when the task is currently running. This might
  * be on the current CPU, which just calls the function directly
+ *
+ * returns: 0 when @func got called
  */
-void task_oncpu_function_call(struct task_struct *p,
+int task_oncpu_function_call(struct task_struct *p,
 			      void (*func) (void *info), void *info)
 {
+	struct task_function_call data = {
+		.p = p,
+		.func = func,
+		.info = info,
+	};
 	int cpu;
 
 	preempt_disable();
+again:
+	data.ret = -ESRCH; /* No such (running) process */
 	cpu = task_cpu(p);
-	if (task_curr(p))
-		smp_call_function_single(cpu, func, info, 1);
+	if (task_curr(p)) {
+		smp_call_function_single(cpu, task_function_trampoline, &data, 1);
+		if (data.ret == -EAGAIN)
+			goto again;
+	}
 	preempt_enable();
+
+	return data.ret;
 }
 
 #ifdef CONFIG_SMP
@@ -2776,9 +2821,15 @@ static inline void
 prepare_task_switch(struct rq *rq, struct task_struct *prev,
 		    struct task_struct *next)
 {
+#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+	rq->in_ctxsw = 1;
+#endif
+	sched_info_switch(prev, next);
+	perf_event_task_sched_out(prev, next);
 	fire_sched_out_preempt_notifiers(prev, next);
 	prepare_lock_switch(rq, next);
 	prepare_arch_switch(next);
+	trace_sched_switch(prev, next);
 }
 
 /**
@@ -2822,6 +2873,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
 	perf_event_task_sched_in(current);
 #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+	rq->in_ctxsw = 0;
 	local_irq_enable();
 #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
 	finish_lock_switch(rq, prev);
@@ -2911,7 +2963,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	struct mm_struct *mm, *oldmm;
 
 	prepare_task_switch(rq, prev, next);
-	trace_sched_switch(prev, next);
+
 	mm = next->mm;
 	oldmm = prev->active_mm;
 	/*
@@ -3989,9 +4041,6 @@ need_resched_nonpreemptible:
 	rq->skip_clock_update = 0;
 
 	if (likely(prev != next)) {
-		sched_info_switch(prev, next);
-		perf_event_task_sched_out(prev, next);
-
 		rq->nr_switches++;
 		rq->curr = next;
 		++*switch_count;



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-28 14:57                                 ` Peter Zijlstra
@ 2011-01-28 16:28                                   ` Oleg Nesterov
  2011-01-28 18:11                                     ` Peter Zijlstra
  0 siblings, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-28 16:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On 01/28, Peter Zijlstra wrote:
>
> On Fri, 2011-01-28 at 12:52 +0100, Peter Zijlstra wrote:
> > Right, so in case the perf_event_task_sched_in() missed the assignment
> > of ->perf_event_ctxp[n], then our above story falls flat on its face.
> >
> > Because then we can not rely on ->in_active being set for running tasks.
> >
> > So we need a task_curr() test under that lock, which would need
> > perf_event_task_sched_out() to be done _before_ we set rq->curr = next,
> > I _think_.
>
> Ok, so how about something like this:
>
> if task_oncpu_function_call() managed to execute the function proper,
> we're done. Otherwise, if while holding the lock, task_curr() is true,
> it means the task is now current and we should try again, if its not, it
> cannot become current because us holding ctx->lock blocks
> perf_event_task_sched_in().
>
> Hmm?

I _feel_ this patch should be right. To me, this even makes the code
more understandable. But I'll try to re-read it once again, somehow
I can't concentrace today.

> @@ -1031,25 +1032,29 @@ perf_install_in_context(struct perf_event_context *ctx,
>  	}
>
>  retry:
> -	task_oncpu_function_call(task, __perf_install_in_context,
> -				 event);
> +	ret = task_oncpu_function_call(task,
> +			__perf_install_in_context, event);
> +
> +	if (!ret)
> +		return;
>
>  	raw_spin_lock_irq(&ctx->lock);
> +
>  	/*
> -	 * we need to retry the smp call.
> +	 * If the task_oncpu_function_call() failed, re-check task_curr() now
> +	 * that we hold ctx->lock(), if it is running retry the IPI.
>  	 */
> -	if (ctx->is_active && list_empty(&event->group_entry)) {
> +	if (task_curr(task)) {

Yes, but task_curr() should be exported.

One note. If this patch is correct (I think it is), then this check
in __perf_install_in_context() and __perf_event_enable()

		if (cpuctx->task_ctx || ctx->task != current)
			return;

should become unneeded. It should be removed or turned into WARN_ON()
imho, otherwise it looks confusing.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_event && task->ptrace_bps[]
  2011-01-20 17:28         ` Oleg Nesterov
@ 2011-01-28 17:41           ` Frederic Weisbecker
  0 siblings, 0 replies; 91+ messages in thread
From: Frederic Weisbecker @ 2011-01-28 17:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alan Stern, Arnaldo Carvalho de Melo, Ingo Molnar,
	Paul Mackerras, Peter Zijlstra, Prasad, Roland McGrath,
	linux-kernel

On Thu, Jan 20, 2011 at 06:28:10PM +0100, Oleg Nesterov wrote:
> On 01/19, Frederic Weisbecker wrote:
> > OTOH I can drop
> > more of them for the no-running-breakpoint case from thread_struct
> > in a subsequent task.
> 
> Hmm. Can't understand what do you mean. Just curious, could you explain?

Indeed now that I read that, it was completely not understandable :)

So I meant that currently we have this:

task->thread->ptrace_bps[BP_NUM]

Where ptrace_bps is:

struct perf_event *ptrace_bps[BP_NUM];

And we populate that with pointers when needed. Now this is a waste
of space, I should better make it:

struct perf_event **ptrace_bps;

And only allocate the pointer space when needed.

 
> > Note the problem touches more archs than x86. Basically every
> > arch that use breakpoint use a similar scheme that must be fixed.
> 
> Yes. Perhaps we should try to unify some code... Say, can't we move
> ->ptrace_bps[] to task_struct?

It seems that every archs that currently implement breakpoints have
this linear mapping of registers, even when physically they are not
linear: ARM has a seperate register space for instruction and data
breakpoints for example.

So yeah it seems we can store that in task_struct. I may try that
in a subsequent patch.

> 
> > +void ptrace_put_breakpoints(struct task_struct *tsk)
> > +{
> > +	if (!atomic_dec_return(&tsk->ptrace_bp_refcnt))
> > +		flush_ptrace_hw_breakpoint(tsk);
> 
> (minor nit, atomic_dec_and_test() looks more natural)

Indeed, will change that.

Thanks!

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-28 16:28                                   ` Oleg Nesterov
@ 2011-01-28 18:11                                     ` Peter Zijlstra
  2011-01-31 17:26                                       ` Oleg Nesterov
  0 siblings, 1 reply; 91+ messages in thread
From: Peter Zijlstra @ 2011-01-28 18:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Fri, 2011-01-28 at 17:28 +0100, Oleg Nesterov wrote:
> 
> I _feel_ this patch should be right. To me, this even makes the code
> more understandable. But I'll try to re-read it once again, somehow
> I can't concentrace today. 

Just to give you more food for through, I couldn't help myself..

(compile tested only so far)

---
 include/linux/sched.h |    7 --
 kernel/perf_event.c   |  235 +++++++++++++++++++++++++++++++------------------
 kernel/sched.c        |   31 +------
 3 files changed, 156 insertions(+), 117 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d747f94..0b40ee3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2578,13 +2578,6 @@ static inline void inc_syscw(struct task_struct *tsk)
 #define TASK_SIZE_OF(tsk)	TASK_SIZE
 #endif
 
-/*
- * Call the function if the target task is executing on a CPU right now:
- */
-extern void task_oncpu_function_call(struct task_struct *p,
-				     void (*func) (void *info), void *info);
-
-
 #ifdef CONFIG_MM_OWNER
 extern void mm_update_next_owner(struct mm_struct *mm);
 extern void mm_init_owner(struct mm_struct *mm, struct task_struct *p);
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 852ae8c..cb62433 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -38,6 +38,83 @@
 
 #include <asm/irq_regs.h>
 
+struct remote_function_call {
+	struct task_struct *p;
+	int (*func)(void *info);
+	void *info;
+	int ret;
+};
+
+static void remote_function(void *data)
+{
+	struct remote_function_call *tfc = data;
+	struct task_struct *p = tfc->p;
+
+	if (p) {
+		tfc->ret = -EAGAIN;
+		if (task_cpu(p) != smp_processor_id() || !task_curr(p));
+			return;
+	}
+
+	tfc->ret = tfc->func(tfc->info);
+}
+
+/**
+ * task_function_call - call a function on the cpu on which a task runs
+ * @p:		the task to evaluate
+ * @func:	the function to be called
+ * @info:	the function call argument
+ *
+ * Calls the function @func when the task is currently running. This might
+ * be on the current CPU, which just calls the function directly
+ *
+ * returns: @func return value, or
+ * 	    -ESRCH  - when the process isn't running
+ * 	    -EAGAIN - when the process moved away
+ */
+static int
+task_function_call(struct task_struct *p, int (*func) (void *info), void *info)
+{
+	struct remote_function_call data = {
+		.p = p,
+		.func = func,
+		.info = info,
+		.ret = -ESRCH, /* No such (running) process */
+	};
+	int cpu;
+
+	preempt_disable();
+	cpu = task_cpu(p);
+	if (task_curr(p))
+		smp_call_function_single(cpu, remote_function, &data, 1);
+	preempt_enable();
+
+	return data.ret;
+}
+
+/**
+ * cpu_function_call - call a function on the cpu
+ * @func:	the function to be called
+ * @info:	the function call argument
+ *
+ * Calls the function @func on the remote cpu.
+ *
+ * returns: @func return value or -ENXIO when the cpu is offline
+ */
+static int cpu_function_call(int cpu, int (*func) (void *info), void *info)
+{
+	struct remote_function_call data = {
+		.p = NULL,
+		.func = func,
+		.info = info,
+		.ret = -ENXIO, /* No such CPU */
+	};
+
+	smp_call_function_single(cpu, remote_function, &data, 1);
+
+	return data.ret;
+}
+
 enum event_type_t {
 	EVENT_FLEXIBLE = 0x1,
 	EVENT_PINNED = 0x2,
@@ -618,27 +695,18 @@ __get_cpu_context(struct perf_event_context *ctx)
  * We disable the event on the hardware level first. After that we
  * remove it from the context list.
  */
-static void __perf_event_remove_from_context(void *info)
+static int __perf_remove_from_context(void *info)
 {
 	struct perf_event *event = info;
 	struct perf_event_context *ctx = event->ctx;
 	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
 
-	/*
-	 * If this is a task context, we need to check whether it is
-	 * the current task context of this cpu. If not it has been
-	 * scheduled out before the smp call arrived.
-	 */
-	if (ctx->task && cpuctx->task_ctx != ctx)
-		return;
-
 	raw_spin_lock(&ctx->lock);
-
 	event_sched_out(event, cpuctx, ctx);
-
 	list_del_event(event, ctx);
-
 	raw_spin_unlock(&ctx->lock);
+
+	return 0;
 }
 
 
@@ -657,7 +725,7 @@ static void __perf_event_remove_from_context(void *info)
  * When called from perf_event_exit_task, it's OK because the
  * context has been detached from its task.
  */
-static void perf_event_remove_from_context(struct perf_event *event)
+static void perf_remove_from_context(struct perf_event *event)
 {
 	struct perf_event_context *ctx = event->ctx;
 	struct task_struct *task = ctx->task;
@@ -667,39 +735,36 @@ static void perf_event_remove_from_context(struct perf_event *event)
 		 * Per cpu events are removed via an smp call and
 		 * the removal is always successful.
 		 */
-		smp_call_function_single(event->cpu,
-					 __perf_event_remove_from_context,
-					 event, 1);
+		cpu_function_call(event->cpu, __perf_remove_from_context, event);
 		return;
 	}
 
 retry:
-	task_oncpu_function_call(task, __perf_event_remove_from_context,
-				 event);
+	if (!task_function_call(task, __perf_remove_from_context, event))
+		return;
 
 	raw_spin_lock_irq(&ctx->lock);
 	/*
-	 * If the context is active we need to retry the smp call.
+	 * If we failed to find a running task, but find it running now that
+	 * we've acquired the ctx->lock, retry.
 	 */
-	if (ctx->nr_active && !list_empty(&event->group_entry)) {
+	if (task_curr(task)) {
 		raw_spin_unlock_irq(&ctx->lock);
 		goto retry;
 	}
 
 	/*
-	 * The lock prevents that this context is scheduled in so we
-	 * can remove the event safely, if the call above did not
-	 * succeed.
+	 * Since the task isn't running, its safe to remove the event, us
+	 * holding the ctx->lock ensures the task won't get scheduled in.
 	 */
-	if (!list_empty(&event->group_entry))
-		list_del_event(event, ctx);
+	list_del_event(event, ctx);
 	raw_spin_unlock_irq(&ctx->lock);
 }
 
 /*
  * Cross CPU call to disable a performance event
  */
-static void __perf_event_disable(void *info)
+static int __perf_event_disable(void *info)
 {
 	struct perf_event *event = info;
 	struct perf_event_context *ctx = event->ctx;
@@ -710,7 +775,7 @@ static void __perf_event_disable(void *info)
 	 * event's task is the current task on this cpu.
 	 */
 	if (ctx->task && cpuctx->task_ctx != ctx)
-		return;
+		return -EINVAL;
 
 	raw_spin_lock(&ctx->lock);
 
@@ -729,6 +794,8 @@ static void __perf_event_disable(void *info)
 	}
 
 	raw_spin_unlock(&ctx->lock);
+
+	return 0;
 }
 
 /*
@@ -753,13 +820,13 @@ void perf_event_disable(struct perf_event *event)
 		/*
 		 * Disable the event on the cpu that it's on
 		 */
-		smp_call_function_single(event->cpu, __perf_event_disable,
-					 event, 1);
+		cpu_function_call(event->cpu, __perf_event_disable, event);
 		return;
 	}
 
 retry:
-	task_oncpu_function_call(task, __perf_event_disable, event);
+	if (!task_function_call(task, __perf_event_disable, event))
+		return;
 
 	raw_spin_lock_irq(&ctx->lock);
 	/*
@@ -767,6 +834,11 @@ retry:
 	 */
 	if (event->state == PERF_EVENT_STATE_ACTIVE) {
 		raw_spin_unlock_irq(&ctx->lock);
+		/*
+		 * Reload the task pointer, it might have been changed by
+		 * a concurrent perf_event_context_sched_out().
+		 */
+		task = ctx->task;
 		goto retry;
 	}
 
@@ -778,7 +850,6 @@ retry:
 		update_group_times(event);
 		event->state = PERF_EVENT_STATE_OFF;
 	}
-
 	raw_spin_unlock_irq(&ctx->lock);
 }
 
@@ -928,12 +999,14 @@ static void add_event_to_ctx(struct perf_event *event,
 	event->tstamp_stopped = tstamp;
 }
 
+static void perf_event_context_sched_in(struct perf_event_context *ctx);
+
 /*
  * Cross CPU call to install and enable a performance event
  *
  * Must be called with ctx->mutex held
  */
-static void __perf_install_in_context(void *info)
+static int  __perf_install_in_context(void *info)
 {
 	struct perf_event *event = info;
 	struct perf_event_context *ctx = event->ctx;
@@ -942,20 +1015,15 @@ static void __perf_install_in_context(void *info)
 	int err;
 
 	/*
-	 * If this is a task context, we need to check whether it is
-	 * the current task context of this cpu. If not it has been
-	 * scheduled out before the smp call arrived.
-	 * Or possibly this is the right context but it isn't
-	 * on this cpu because it had no events.
+	 * In case we're installing a new context to an already running task,
+	 * could also happen before perf_event_task_sched_in() on architectures
+	 * which do context switches with IRQs enabled.
 	 */
-	if (ctx->task && cpuctx->task_ctx != ctx) {
-		if (cpuctx->task_ctx || ctx->task != current)
-			return;
-		cpuctx->task_ctx = ctx;
-	}
+	if (ctx->task && !cpuctx->task_ctx)
+		perf_event_context_sched_in(ctx);
 
 	raw_spin_lock(&ctx->lock);
-	ctx->is_active = 1;
+	WARN_ON_ONCE(!ctx->is_active);
 	update_context_time(ctx);
 
 	add_event_to_ctx(event, ctx);
@@ -997,6 +1065,8 @@ static void __perf_install_in_context(void *info)
 
 unlock:
 	raw_spin_unlock(&ctx->lock);
+
+	return 0;
 }
 
 /*
@@ -1025,31 +1095,29 @@ perf_install_in_context(struct perf_event_context *ctx,
 		 * Per cpu events are installed via an smp call and
 		 * the install is always successful.
 		 */
-		smp_call_function_single(cpu, __perf_install_in_context,
-					 event, 1);
+		cpu_function_call(cpu, __perf_install_in_context, event);
 		return;
 	}
 
 retry:
-	task_oncpu_function_call(task, __perf_install_in_context,
-				 event);
+	if (!task_function_call(task, __perf_install_in_context, event))
+		return;
 
 	raw_spin_lock_irq(&ctx->lock);
 	/*
-	 * we need to retry the smp call.
+	 * If we failed to find a running task, but find it running now that
+	 * we've acquired the ctx->lock, retry.
 	 */
-	if (ctx->is_active && list_empty(&event->group_entry)) {
+	if (task_curr(task)) {
 		raw_spin_unlock_irq(&ctx->lock);
 		goto retry;
 	}
 
 	/*
-	 * The lock prevents that this context is scheduled in so we
-	 * can add the event safely, if it the call above did not
-	 * succeed.
+	 * Since the task isn't running, its safe to add the event, us holding
+	 * the ctx->lock ensures the task won't get scheduled in.
 	 */
-	if (list_empty(&event->group_entry))
-		add_event_to_ctx(event, ctx);
+	add_event_to_ctx(event, ctx);
 	raw_spin_unlock_irq(&ctx->lock);
 }
 
@@ -1078,7 +1146,7 @@ static void __perf_event_mark_enabled(struct perf_event *event,
 /*
  * Cross CPU call to enable a performance event
  */
-static void __perf_event_enable(void *info)
+static int __perf_event_enable(void *info)
 {
 	struct perf_event *event = info;
 	struct perf_event_context *ctx = event->ctx;
@@ -1086,18 +1154,10 @@ static void __perf_event_enable(void *info)
 	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
 	int err;
 
-	/*
-	 * If this is a per-task event, need to check whether this
-	 * event's task is the current task on this cpu.
-	 */
-	if (ctx->task && cpuctx->task_ctx != ctx) {
-		if (cpuctx->task_ctx || ctx->task != current)
-			return;
-		cpuctx->task_ctx = ctx;
-	}
+	if (WARN_ON_ONCE(!ctx->is_active))
+		return -EINVAL;
 
 	raw_spin_lock(&ctx->lock);
-	ctx->is_active = 1;
 	update_context_time(ctx);
 
 	if (event->state >= PERF_EVENT_STATE_INACTIVE)
@@ -1138,6 +1198,8 @@ static void __perf_event_enable(void *info)
 
 unlock:
 	raw_spin_unlock(&ctx->lock);
+
+	return 0;
 }
 
 /*
@@ -1158,8 +1220,7 @@ void perf_event_enable(struct perf_event *event)
 		/*
 		 * Enable the event on the cpu that it's on
 		 */
-		smp_call_function_single(event->cpu, __perf_event_enable,
-					 event, 1);
+		cpu_function_call(event->cpu, __perf_event_enable, event);
 		return;
 	}
 
@@ -1178,8 +1239,15 @@ void perf_event_enable(struct perf_event *event)
 		event->state = PERF_EVENT_STATE_OFF;
 
 retry:
+	if (!ctx->is_active) {
+		__perf_event_mark_enabled(event, ctx);
+		goto out;
+	}
+
 	raw_spin_unlock_irq(&ctx->lock);
-	task_oncpu_function_call(task, __perf_event_enable, event);
+
+	if (!task_function_call(task, __perf_event_enable, event))
+		return;
 
 	raw_spin_lock_irq(&ctx->lock);
 
@@ -1187,15 +1255,14 @@ retry:
 	 * If the context is active and the event is still off,
 	 * we need to retry the cross-call.
 	 */
-	if (ctx->is_active && event->state == PERF_EVENT_STATE_OFF)
+	if (ctx->is_active && event->state == PERF_EVENT_STATE_OFF) {
+		/*
+		 * task could have been flipped by a concurrent
+		 * perf_event_context_sched_out()
+		 */
+		task = ctx->task;
 		goto retry;
-
-	/*
-	 * Since we have the lock this context can't be scheduled
-	 * in, so we can change the state safely.
-	 */
-	if (event->state == PERF_EVENT_STATE_OFF)
-		__perf_event_mark_enabled(event, ctx);
+	}
 
 out:
 	raw_spin_unlock_irq(&ctx->lock);
@@ -1339,8 +1406,8 @@ static void perf_event_sync_stat(struct perf_event_context *ctx,
 	}
 }
 
-void perf_event_context_sched_out(struct task_struct *task, int ctxn,
-				  struct task_struct *next)
+static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
+					 struct task_struct *next)
 {
 	struct perf_event_context *ctx = task->perf_event_ctxp[ctxn];
 	struct perf_event_context *next_ctx;
@@ -1541,7 +1608,7 @@ static void task_ctx_sched_in(struct perf_event_context *ctx,
 	cpuctx->task_ctx = ctx;
 }
 
-void perf_event_context_sched_in(struct perf_event_context *ctx)
+static void perf_event_context_sched_in(struct perf_event_context *ctx)
 {
 	struct perf_cpu_context *cpuctx;
 
@@ -5949,10 +6016,10 @@ SYSCALL_DEFINE5(perf_event_open,
 		struct perf_event_context *gctx = group_leader->ctx;
 
 		mutex_lock(&gctx->mutex);
-		perf_event_remove_from_context(group_leader);
+		perf_remove_from_context(group_leader);
 		list_for_each_entry(sibling, &group_leader->sibling_list,
 				    group_entry) {
-			perf_event_remove_from_context(sibling);
+			perf_remove_from_context(sibling);
 			put_ctx(gctx);
 		}
 		mutex_unlock(&gctx->mutex);
@@ -6103,7 +6170,7 @@ __perf_event_exit_task(struct perf_event *child_event,
 {
 	struct perf_event *parent_event;
 
-	perf_event_remove_from_context(child_event);
+	perf_remove_from_context(child_event);
 
 	parent_event = child_event->parent;
 	/*
@@ -6594,9 +6661,9 @@ static void __perf_event_exit_context(void *__info)
 	perf_pmu_rotate_stop(ctx->pmu);
 
 	list_for_each_entry_safe(event, tmp, &ctx->pinned_groups, group_entry)
-		__perf_event_remove_from_context(event);
+		__perf_remove_from_context(event);
 	list_for_each_entry_safe(event, tmp, &ctx->flexible_groups, group_entry)
-		__perf_event_remove_from_context(event);
+		__perf_remove_from_context(event);
 }
 
 static void perf_event_exit_cpu_context(int cpu)
diff --git a/kernel/sched.c b/kernel/sched.c
index 18d38e4..01549b0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -490,7 +490,7 @@ struct rq {
 	struct task_struct *curr, *idle, *stop;
 	unsigned long next_balance;
 	struct mm_struct *prev_mm;
-
+	
 	u64 clock;
 	u64 clock_task;
 
@@ -2265,27 +2265,6 @@ void kick_process(struct task_struct *p)
 EXPORT_SYMBOL_GPL(kick_process);
 #endif /* CONFIG_SMP */
 
-/**
- * task_oncpu_function_call - call a function on the cpu on which a task runs
- * @p:		the task to evaluate
- * @func:	the function to be called
- * @info:	the function call argument
- *
- * Calls the function @func when the task is currently running. This might
- * be on the current CPU, which just calls the function directly
- */
-void task_oncpu_function_call(struct task_struct *p,
-			      void (*func) (void *info), void *info)
-{
-	int cpu;
-
-	preempt_disable();
-	cpu = task_cpu(p);
-	if (task_curr(p))
-		smp_call_function_single(cpu, func, info, 1);
-	preempt_enable();
-}
-
 #ifdef CONFIG_SMP
 /*
  * ->cpus_allowed is protected by either TASK_WAKING or rq->lock held.
@@ -2776,9 +2755,12 @@ static inline void
 prepare_task_switch(struct rq *rq, struct task_struct *prev,
 		    struct task_struct *next)
 {
+	sched_info_switch(prev, next);
+	perf_event_task_sched_out(prev, next);
 	fire_sched_out_preempt_notifiers(prev, next);
 	prepare_lock_switch(rq, next);
 	prepare_arch_switch(next);
+	trace_sched_switch(prev, next);
 }
 
 /**
@@ -2911,7 +2893,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	struct mm_struct *mm, *oldmm;
 
 	prepare_task_switch(rq, prev, next);
-	trace_sched_switch(prev, next);
+
 	mm = next->mm;
 	oldmm = prev->active_mm;
 	/*
@@ -3989,9 +3971,6 @@ need_resched_nonpreemptible:
 	rq->skip_clock_update = 0;
 
 	if (likely(prev != next)) {
-		sched_info_switch(prev, next);
-		perf_event_task_sched_out(prev, next);
-
 		rq->nr_switches++;
 		rq->curr = next;
 		++*switch_count;



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-28 18:11                                     ` Peter Zijlstra
@ 2011-01-31 17:26                                       ` Oleg Nesterov
  2011-01-31 18:23                                         ` Peter Zijlstra
  0 siblings, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-31 17:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On 01/28, Peter Zijlstra wrote:
>
> Just to give you more food for through, I couldn't help myself..

Hmm. So far I am only trying to understand the perf_install_in_context()
paths. And, after I spent almost 2 hours, I am starting to believe this
change is probably good ;)

I do not understand the point of cpu_function_call() though, it looks
equal to smp_call_function_single() ?

> -static void __perf_install_in_context(void *info)
> +static int  __perf_install_in_context(void *info)
>  {
>  	struct perf_event *event = info;
>  	struct perf_event_context *ctx = event->ctx;
> @@ -942,20 +1015,15 @@ static void __perf_install_in_context(void *info)
>  	int err;
>
>  	/*
> -	 * If this is a task context, we need to check whether it is
> -	 * the current task context of this cpu. If not it has been
> -	 * scheduled out before the smp call arrived.
> -	 * Or possibly this is the right context but it isn't
> -	 * on this cpu because it had no events.
> +	 * In case we're installing a new context to an already running task,
> +	 * could also happen before perf_event_task_sched_in() on architectures
> +	 * which do context switches with IRQs enabled.
>  	 */
> -	if (ctx->task && cpuctx->task_ctx != ctx) {
> -		if (cpuctx->task_ctx || ctx->task != current)
> -			return;
> -		cpuctx->task_ctx = ctx;
> -	}
> +	if (ctx->task && !cpuctx->task_ctx)
> +		perf_event_context_sched_in(ctx);

OK... This eliminates the 2nd race with __ARCH_WANT_INTERRUPTS_ON_CTXSW
(we must not set "cpuctx->task_ctx = ctx" in case "next" is going to
 do perf_event_context_sched_in() later). So it is enough to check
rq->curr in remote_function().

>  	raw_spin_lock(&ctx->lock);
> -	ctx->is_active = 1;
> +	WARN_ON_ONCE(!ctx->is_active);

This looks wrong if ctx->task == NULL.



So. With this patch it is possible that perf_event_context_sched_in()
is called right after prepare_lock_switch(). Stupid question, why
can't we always do this then? I mean, what if we change
prepare_task_switch() to do

	perf_event_task_sched_out();
	perf_event_task_sched_in();

?

Probably we can unify the COND_STMT(perf_task_events) check and simplify
the things further.

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-31 17:26                                       ` Oleg Nesterov
@ 2011-01-31 18:23                                         ` Peter Zijlstra
  2011-01-31 19:11                                           ` Oleg Nesterov
  0 siblings, 1 reply; 91+ messages in thread
From: Peter Zijlstra @ 2011-01-31 18:23 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Mon, 2011-01-31 at 18:26 +0100, Oleg Nesterov wrote:
> On 01/28, Peter Zijlstra wrote:
> >
> > Just to give you more food for through, I couldn't help myself..
> 
> Hmm. So far I am only trying to understand the perf_install_in_context()
> paths. And, after I spent almost 2 hours, I am starting to believe this
> change is probably good ;)
 
phew ;-)

> I do not understand the point of cpu_function_call() though, it looks
> equal to smp_call_function_single() ?

Very nearly so, except it takes a function that returns an int..

> > -static void __perf_install_in_context(void *info)
> > +static int  __perf_install_in_context(void *info)
> >  {
> >  	struct perf_event *event = info;
> >  	struct perf_event_context *ctx = event->ctx;
> > @@ -942,20 +1015,15 @@ static void __perf_install_in_context(void *info)
> >  	int err;
> >
> >  	/*
> > -	 * If this is a task context, we need to check whether it is
> > -	 * the current task context of this cpu. If not it has been
> > -	 * scheduled out before the smp call arrived.
> > -	 * Or possibly this is the right context but it isn't
> > -	 * on this cpu because it had no events.
> > +	 * In case we're installing a new context to an already running task,
> > +	 * could also happen before perf_event_task_sched_in() on architectures
> > +	 * which do context switches with IRQs enabled.
> >  	 */
> > -	if (ctx->task && cpuctx->task_ctx != ctx) {
> > -		if (cpuctx->task_ctx || ctx->task != current)
> > -			return;
> > -		cpuctx->task_ctx = ctx;
> > -	}
> > +	if (ctx->task && !cpuctx->task_ctx)
> > +		perf_event_context_sched_in(ctx);
> 
> OK... This eliminates the 2nd race with __ARCH_WANT_INTERRUPTS_ON_CTXSW
> (we must not set "cpuctx->task_ctx = ctx" in case "next" is going to
>  do perf_event_context_sched_in() later). So it is enough to check
> rq->curr in remote_function().

Right, but since I moved those functions into perf_event.c (they were
getting rather specific) I can no longer deref (or even obtain) a rq
structure. So it implements rq->curr == p in a somewhat round-about
fashion but it should be identical.

> 
> >  	raw_spin_lock(&ctx->lock);
> > -	ctx->is_active = 1;
> > +	WARN_ON_ONCE(!ctx->is_active);
> 
> This looks wrong if ctx->task == NULL.

cpuctx->ctx should still have ->is_active = 1 I think.

> 
> So. With this patch it is possible that perf_event_context_sched_in()
> is called right after prepare_lock_switch(). Stupid question, why
> can't we always do this then? I mean, what if we change
> prepare_task_switch() to do
> 
> 	perf_event_task_sched_out();
> 	perf_event_task_sched_in();
> 
> ?
> 
> Probably we can unify the COND_STMT(perf_task_events) check and simplify
> the things further.

That might work, Ingo any reason we have a pre and post hook around the
context switch and not a single function?


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-31 18:23                                         ` Peter Zijlstra
@ 2011-01-31 19:11                                           ` Oleg Nesterov
  2011-01-31 19:29                                             ` Peter Zijlstra
  0 siblings, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2011-01-31 19:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On 01/31, Peter Zijlstra wrote:
>
> On Mon, 2011-01-31 at 18:26 +0100, Oleg Nesterov wrote:
>
> > I do not understand the point of cpu_function_call() though, it looks
> > equal to smp_call_function_single() ?
>
> Very nearly so, except it takes a function that returns an int..

Ah, indeed...

> > >  	raw_spin_lock(&ctx->lock);
> > > -	ctx->is_active = 1;
> > > +	WARN_ON_ONCE(!ctx->is_active);
> >
> > This looks wrong if ctx->task == NULL.
>
> cpuctx->ctx should still have ->is_active = 1 I think.

Unless this is the first cpu counter, no?

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: Q: perf_install_in_context/perf_event_enable are racy?
  2011-01-31 19:11                                           ` Oleg Nesterov
@ 2011-01-31 19:29                                             ` Peter Zijlstra
  2011-02-01 14:03                                               ` [PATCH] perf: Cure task_oncpu_function_call() races Peter Zijlstra
  0 siblings, 1 reply; 91+ messages in thread
From: Peter Zijlstra @ 2011-01-31 19:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Mon, 2011-01-31 at 20:11 +0100, Oleg Nesterov wrote:
> 
> > > >   raw_spin_lock(&ctx->lock);
> > > > - ctx->is_active = 1;
> > > > + WARN_ON_ONCE(!ctx->is_active);
> > >
> > > This looks wrong if ctx->task == NULL.
> >
> > cpuctx->ctx should still have ->is_active = 1 I think.
> 
> Unless this is the first cpu counter, no? 

Ah, indeed.. 


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH] perf: Cure task_oncpu_function_call() races
  2011-01-31 19:29                                             ` Peter Zijlstra
@ 2011-02-01 14:03                                               ` Peter Zijlstra
  2011-02-01 17:27                                                 ` Oleg Nesterov
  0 siblings, 1 reply; 91+ messages in thread
From: Peter Zijlstra @ 2011-02-01 14:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel


Oleg, I've actually run-tested the below and all seems well (clearly
I've never actually hit the races found before either, so in that
respect its not a conclusive test).

Can you agree with this patch?

---
Oleg reported that on architectures with
__ARCH_WANT_INTERRUPTS_ON_CTXSW the IPI from task_oncpu_function_call()
can land before perf_event_task_sched_in() and cause interesting
situations for eg. perf_install_in_context().

This patch reworks the task_oncpu_function_call() interface to give a
more usable primitive as well as rework all its users to hopefully be
more obvious as well as remove the races.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    7 --
 kernel/perf_event.c   |  240 +++++++++++++++++++++++++++++++------------------
 kernel/sched.c        |   29 +-----
 3 files changed, 157 insertions(+), 119 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3294f60..57ad0f9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2575,13 +2575,6 @@ static inline void inc_syscw(struct task_struct *tsk)
 #define TASK_SIZE_OF(tsk)	TASK_SIZE
 #endif
 
-/*
- * Call the function if the target task is executing on a CPU right now:
- */
-extern void task_oncpu_function_call(struct task_struct *p,
-				     void (*func) (void *info), void *info);
-
-
 #ifdef CONFIG_MM_OWNER
 extern void mm_update_next_owner(struct mm_struct *mm);
 extern void mm_init_owner(struct mm_struct *mm, struct task_struct *p);
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 852ae8c..b81f31f 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -38,6 +38,79 @@
 
 #include <asm/irq_regs.h>
 
+struct remote_function_call {
+	struct task_struct *p;
+	int (*func)(void *info);
+	void *info;
+	int ret;
+};
+
+static void remote_function(void *data)
+{
+	struct remote_function_call *tfc = data;
+	struct task_struct *p = tfc->p;
+
+	if (p) {
+		tfc->ret = -EAGAIN;
+		if (task_cpu(p) != smp_processor_id() || !task_curr(p))
+			return;
+	}
+
+	tfc->ret = tfc->func(tfc->info);
+}
+
+/**
+ * task_function_call - call a function on the cpu on which a task runs
+ * @p:		the task to evaluate
+ * @func:	the function to be called
+ * @info:	the function call argument
+ *
+ * Calls the function @func when the task is currently running. This might
+ * be on the current CPU, which just calls the function directly
+ *
+ * returns: @func return value, or
+ * 	    -ESRCH  - when the process isn't running
+ * 	    -EAGAIN - when the process moved away
+ */
+static int
+task_function_call(struct task_struct *p, int (*func) (void *info), void *info)
+{
+	struct remote_function_call data = {
+		.p = p,
+		.func = func,
+		.info = info,
+		.ret = -ESRCH, /* No such (running) process */
+	};
+
+	if (task_curr(p))
+		smp_call_function_single(task_cpu(p), remote_function, &data, 1);
+
+	return data.ret;
+}
+
+/**
+ * cpu_function_call - call a function on the cpu
+ * @func:	the function to be called
+ * @info:	the function call argument
+ *
+ * Calls the function @func on the remote cpu.
+ *
+ * returns: @func return value or -ENXIO when the cpu is offline
+ */
+static int cpu_function_call(int cpu, int (*func) (void *info), void *info)
+{
+	struct remote_function_call data = {
+		.p = NULL,
+		.func = func,
+		.info = info,
+		.ret = -ENXIO, /* No such CPU */
+	};
+
+	smp_call_function_single(cpu, remote_function, &data, 1);
+
+	return data.ret;
+}
+
 enum event_type_t {
 	EVENT_FLEXIBLE = 0x1,
 	EVENT_PINNED = 0x2,
@@ -618,35 +691,24 @@ __get_cpu_context(struct perf_event_context *ctx)
  * We disable the event on the hardware level first. After that we
  * remove it from the context list.
  */
-static void __perf_event_remove_from_context(void *info)
+static int __perf_remove_from_context(void *info)
 {
 	struct perf_event *event = info;
 	struct perf_event_context *ctx = event->ctx;
 	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
 
-	/*
-	 * If this is a task context, we need to check whether it is
-	 * the current task context of this cpu. If not it has been
-	 * scheduled out before the smp call arrived.
-	 */
-	if (ctx->task && cpuctx->task_ctx != ctx)
-		return;
-
 	raw_spin_lock(&ctx->lock);
-
 	event_sched_out(event, cpuctx, ctx);
-
 	list_del_event(event, ctx);
-
 	raw_spin_unlock(&ctx->lock);
+
+	return 0;
 }
 
 
 /*
  * Remove the event from a task's (or a CPU's) list of events.
  *
- * Must be called with ctx->mutex held.
- *
  * CPU events are removed with a smp call. For task events we only
  * call when the task is on a CPU.
  *
@@ -657,49 +719,48 @@ static void __perf_event_remove_from_context(void *info)
  * When called from perf_event_exit_task, it's OK because the
  * context has been detached from its task.
  */
-static void perf_event_remove_from_context(struct perf_event *event)
+static void perf_remove_from_context(struct perf_event *event)
 {
 	struct perf_event_context *ctx = event->ctx;
 	struct task_struct *task = ctx->task;
 
+	lockdep_assert_held(&ctx->mutex);
+
 	if (!task) {
 		/*
 		 * Per cpu events are removed via an smp call and
 		 * the removal is always successful.
 		 */
-		smp_call_function_single(event->cpu,
-					 __perf_event_remove_from_context,
-					 event, 1);
+		cpu_function_call(event->cpu, __perf_remove_from_context, event);
 		return;
 	}
 
 retry:
-	task_oncpu_function_call(task, __perf_event_remove_from_context,
-				 event);
+	if (!task_function_call(task, __perf_remove_from_context, event))
+		return;
 
 	raw_spin_lock_irq(&ctx->lock);
 	/*
-	 * If the context is active we need to retry the smp call.
+	 * If we failed to find a running task, but find it running now that
+	 * we've acquired the ctx->lock, retry.
 	 */
-	if (ctx->nr_active && !list_empty(&event->group_entry)) {
+	if (task_curr(task)) {
 		raw_spin_unlock_irq(&ctx->lock);
 		goto retry;
 	}
 
 	/*
-	 * The lock prevents that this context is scheduled in so we
-	 * can remove the event safely, if the call above did not
-	 * succeed.
+	 * Since the task isn't running, its safe to remove the event, us
+	 * holding the ctx->lock ensures the task won't get scheduled in.
 	 */
-	if (!list_empty(&event->group_entry))
-		list_del_event(event, ctx);
+	list_del_event(event, ctx);
 	raw_spin_unlock_irq(&ctx->lock);
 }
 
 /*
  * Cross CPU call to disable a performance event
  */
-static void __perf_event_disable(void *info)
+static int __perf_event_disable(void *info)
 {
 	struct perf_event *event = info;
 	struct perf_event_context *ctx = event->ctx;
@@ -708,9 +769,12 @@ static void __perf_event_disable(void *info)
 	/*
 	 * If this is a per-task event, need to check whether this
 	 * event's task is the current task on this cpu.
+	 *
+	 * Can trigger due to concurrent perf_event_context_sched_out()
+	 * flipping contexts around.
 	 */
 	if (ctx->task && cpuctx->task_ctx != ctx)
-		return;
+		return -EINVAL;
 
 	raw_spin_lock(&ctx->lock);
 
@@ -729,6 +793,8 @@ static void __perf_event_disable(void *info)
 	}
 
 	raw_spin_unlock(&ctx->lock);
+
+	return 0;
 }
 
 /*
@@ -753,13 +819,13 @@ void perf_event_disable(struct perf_event *event)
 		/*
 		 * Disable the event on the cpu that it's on
 		 */
-		smp_call_function_single(event->cpu, __perf_event_disable,
-					 event, 1);
+		cpu_function_call(event->cpu, __perf_event_disable, event);
 		return;
 	}
 
 retry:
-	task_oncpu_function_call(task, __perf_event_disable, event);
+	if (!task_function_call(task, __perf_event_disable, event))
+		return;
 
 	raw_spin_lock_irq(&ctx->lock);
 	/*
@@ -767,6 +833,11 @@ retry:
 	 */
 	if (event->state == PERF_EVENT_STATE_ACTIVE) {
 		raw_spin_unlock_irq(&ctx->lock);
+		/*
+		 * Reload the task pointer, it might have been changed by
+		 * a concurrent perf_event_context_sched_out().
+		 */
+		task = ctx->task;
 		goto retry;
 	}
 
@@ -778,7 +849,6 @@ retry:
 		update_group_times(event);
 		event->state = PERF_EVENT_STATE_OFF;
 	}
-
 	raw_spin_unlock_irq(&ctx->lock);
 }
 
@@ -928,12 +998,14 @@ static void add_event_to_ctx(struct perf_event *event,
 	event->tstamp_stopped = tstamp;
 }
 
+static void perf_event_context_sched_in(struct perf_event_context *ctx);
+
 /*
  * Cross CPU call to install and enable a performance event
  *
  * Must be called with ctx->mutex held
  */
-static void __perf_install_in_context(void *info)
+static int  __perf_install_in_context(void *info)
 {
 	struct perf_event *event = info;
 	struct perf_event_context *ctx = event->ctx;
@@ -942,17 +1014,12 @@ static void __perf_install_in_context(void *info)
 	int err;
 
 	/*
-	 * If this is a task context, we need to check whether it is
-	 * the current task context of this cpu. If not it has been
-	 * scheduled out before the smp call arrived.
-	 * Or possibly this is the right context but it isn't
-	 * on this cpu because it had no events.
+	 * In case we're installing a new context to an already running task,
+	 * could also happen before perf_event_task_sched_in() on architectures
+	 * which do context switches with IRQs enabled.
 	 */
-	if (ctx->task && cpuctx->task_ctx != ctx) {
-		if (cpuctx->task_ctx || ctx->task != current)
-			return;
-		cpuctx->task_ctx = ctx;
-	}
+	if (ctx->task && !cpuctx->task_ctx)
+		perf_event_context_sched_in(ctx);
 
 	raw_spin_lock(&ctx->lock);
 	ctx->is_active = 1;
@@ -997,6 +1064,8 @@ static void __perf_install_in_context(void *info)
 
 unlock:
 	raw_spin_unlock(&ctx->lock);
+
+	return 0;
 }
 
 /*
@@ -1008,8 +1077,6 @@ unlock:
  * If the event is attached to a task which is on a CPU we use a smp
  * call to enable it in the task context. The task might have been
  * scheduled away, but we check this in the smp call again.
- *
- * Must be called with ctx->mutex held.
  */
 static void
 perf_install_in_context(struct perf_event_context *ctx,
@@ -1018,6 +1085,8 @@ perf_install_in_context(struct perf_event_context *ctx,
 {
 	struct task_struct *task = ctx->task;
 
+	lockdep_assert_held(&ctx->mutex);
+
 	event->ctx = ctx;
 
 	if (!task) {
@@ -1025,31 +1094,29 @@ perf_install_in_context(struct perf_event_context *ctx,
 		 * Per cpu events are installed via an smp call and
 		 * the install is always successful.
 		 */
-		smp_call_function_single(cpu, __perf_install_in_context,
-					 event, 1);
+		cpu_function_call(cpu, __perf_install_in_context, event);
 		return;
 	}
 
 retry:
-	task_oncpu_function_call(task, __perf_install_in_context,
-				 event);
+	if (!task_function_call(task, __perf_install_in_context, event))
+		return;
 
 	raw_spin_lock_irq(&ctx->lock);
 	/*
-	 * we need to retry the smp call.
+	 * If we failed to find a running task, but find it running now that
+	 * we've acquired the ctx->lock, retry.
 	 */
-	if (ctx->is_active && list_empty(&event->group_entry)) {
+	if (task_curr(task)) {
 		raw_spin_unlock_irq(&ctx->lock);
 		goto retry;
 	}
 
 	/*
-	 * The lock prevents that this context is scheduled in so we
-	 * can add the event safely, if it the call above did not
-	 * succeed.
+	 * Since the task isn't running, its safe to add the event, us holding
+	 * the ctx->lock ensures the task won't get scheduled in.
 	 */
-	if (list_empty(&event->group_entry))
-		add_event_to_ctx(event, ctx);
+	add_event_to_ctx(event, ctx);
 	raw_spin_unlock_irq(&ctx->lock);
 }
 
@@ -1078,7 +1145,7 @@ static void __perf_event_mark_enabled(struct perf_event *event,
 /*
  * Cross CPU call to enable a performance event
  */
-static void __perf_event_enable(void *info)
+static int __perf_event_enable(void *info)
 {
 	struct perf_event *event = info;
 	struct perf_event_context *ctx = event->ctx;
@@ -1086,18 +1153,10 @@ static void __perf_event_enable(void *info)
 	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
 	int err;
 
-	/*
-	 * If this is a per-task event, need to check whether this
-	 * event's task is the current task on this cpu.
-	 */
-	if (ctx->task && cpuctx->task_ctx != ctx) {
-		if (cpuctx->task_ctx || ctx->task != current)
-			return;
-		cpuctx->task_ctx = ctx;
-	}
+	if (WARN_ON_ONCE(!ctx->is_active))
+		return -EINVAL;
 
 	raw_spin_lock(&ctx->lock);
-	ctx->is_active = 1;
 	update_context_time(ctx);
 
 	if (event->state >= PERF_EVENT_STATE_INACTIVE)
@@ -1138,6 +1197,8 @@ static void __perf_event_enable(void *info)
 
 unlock:
 	raw_spin_unlock(&ctx->lock);
+
+	return 0;
 }
 
 /*
@@ -1158,8 +1219,7 @@ void perf_event_enable(struct perf_event *event)
 		/*
 		 * Enable the event on the cpu that it's on
 		 */
-		smp_call_function_single(event->cpu, __perf_event_enable,
-					 event, 1);
+		cpu_function_call(event->cpu, __perf_event_enable, event);
 		return;
 	}
 
@@ -1178,8 +1238,15 @@ void perf_event_enable(struct perf_event *event)
 		event->state = PERF_EVENT_STATE_OFF;
 
 retry:
+	if (!ctx->is_active) {
+		__perf_event_mark_enabled(event, ctx);
+		goto out;
+	}
+
 	raw_spin_unlock_irq(&ctx->lock);
-	task_oncpu_function_call(task, __perf_event_enable, event);
+
+	if (!task_function_call(task, __perf_event_enable, event))
+		return;
 
 	raw_spin_lock_irq(&ctx->lock);
 
@@ -1187,15 +1254,14 @@ retry:
 	 * If the context is active and the event is still off,
 	 * we need to retry the cross-call.
 	 */
-	if (ctx->is_active && event->state == PERF_EVENT_STATE_OFF)
+	if (ctx->is_active && event->state == PERF_EVENT_STATE_OFF) {
+		/*
+		 * task could have been flipped by a concurrent
+		 * perf_event_context_sched_out()
+		 */
+		task = ctx->task;
 		goto retry;
-
-	/*
-	 * Since we have the lock this context can't be scheduled
-	 * in, so we can change the state safely.
-	 */
-	if (event->state == PERF_EVENT_STATE_OFF)
-		__perf_event_mark_enabled(event, ctx);
+	}
 
 out:
 	raw_spin_unlock_irq(&ctx->lock);
@@ -1339,8 +1405,8 @@ static void perf_event_sync_stat(struct perf_event_context *ctx,
 	}
 }
 
-void perf_event_context_sched_out(struct task_struct *task, int ctxn,
-				  struct task_struct *next)
+static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
+					 struct task_struct *next)
 {
 	struct perf_event_context *ctx = task->perf_event_ctxp[ctxn];
 	struct perf_event_context *next_ctx;
@@ -1541,7 +1607,7 @@ static void task_ctx_sched_in(struct perf_event_context *ctx,
 	cpuctx->task_ctx = ctx;
 }
 
-void perf_event_context_sched_in(struct perf_event_context *ctx)
+static void perf_event_context_sched_in(struct perf_event_context *ctx)
 {
 	struct perf_cpu_context *cpuctx;
 
@@ -5949,10 +6015,10 @@ SYSCALL_DEFINE5(perf_event_open,
 		struct perf_event_context *gctx = group_leader->ctx;
 
 		mutex_lock(&gctx->mutex);
-		perf_event_remove_from_context(group_leader);
+		perf_remove_from_context(group_leader);
 		list_for_each_entry(sibling, &group_leader->sibling_list,
 				    group_entry) {
-			perf_event_remove_from_context(sibling);
+			perf_remove_from_context(sibling);
 			put_ctx(gctx);
 		}
 		mutex_unlock(&gctx->mutex);
@@ -6103,7 +6169,7 @@ __perf_event_exit_task(struct perf_event *child_event,
 {
 	struct perf_event *parent_event;
 
-	perf_event_remove_from_context(child_event);
+	perf_remove_from_context(child_event);
 
 	parent_event = child_event->parent;
 	/*
@@ -6594,9 +6660,9 @@ static void __perf_event_exit_context(void *__info)
 	perf_pmu_rotate_stop(ctx->pmu);
 
 	list_for_each_entry_safe(event, tmp, &ctx->pinned_groups, group_entry)
-		__perf_event_remove_from_context(event);
+		__perf_remove_from_context(event);
 	list_for_each_entry_safe(event, tmp, &ctx->flexible_groups, group_entry)
-		__perf_event_remove_from_context(event);
+		__perf_remove_from_context(event);
 }
 
 static void perf_event_exit_cpu_context(int cpu)
diff --git a/kernel/sched.c b/kernel/sched.c
index 477e1bc..6daa737 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2297,27 +2297,6 @@ void kick_process(struct task_struct *p)
 EXPORT_SYMBOL_GPL(kick_process);
 #endif /* CONFIG_SMP */
 
-/**
- * task_oncpu_function_call - call a function on the cpu on which a task runs
- * @p:		the task to evaluate
- * @func:	the function to be called
- * @info:	the function call argument
- *
- * Calls the function @func when the task is currently running. This might
- * be on the current CPU, which just calls the function directly
- */
-void task_oncpu_function_call(struct task_struct *p,
-			      void (*func) (void *info), void *info)
-{
-	int cpu;
-
-	preempt_disable();
-	cpu = task_cpu(p);
-	if (task_curr(p))
-		smp_call_function_single(cpu, func, info, 1);
-	preempt_enable();
-}
-
 #ifdef CONFIG_SMP
 /*
  * ->cpus_allowed is protected by either TASK_WAKING or rq->lock held.
@@ -2809,9 +2788,12 @@ static inline void
 prepare_task_switch(struct rq *rq, struct task_struct *prev,
 		    struct task_struct *next)
 {
+	sched_info_switch(prev, next);
+	perf_event_task_sched_out(prev, next);
 	fire_sched_out_preempt_notifiers(prev, next);
 	prepare_lock_switch(rq, next);
 	prepare_arch_switch(next);
+	trace_sched_switch(prev, next);
 }
 
 /**
@@ -2944,7 +2926,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	struct mm_struct *mm, *oldmm;
 
 	prepare_task_switch(rq, prev, next);
-	trace_sched_switch(prev, next);
+
 	mm = next->mm;
 	oldmm = prev->active_mm;
 	/*
@@ -4116,9 +4098,6 @@ need_resched_nonpreemptible:
 	rq->skip_clock_update = 0;
 
 	if (likely(prev != next)) {
-		sched_info_switch(prev, next);
-		perf_event_task_sched_out(prev, next);
-
 		rq->nr_switches++;
 		rq->curr = next;
 		++*switch_count;



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH] perf: Cure task_oncpu_function_call() races
  2011-02-01 14:03                                               ` [PATCH] perf: Cure task_oncpu_function_call() races Peter Zijlstra
@ 2011-02-01 17:27                                                 ` Oleg Nesterov
  2011-02-01 18:08                                                   ` Peter Zijlstra
  0 siblings, 1 reply; 91+ messages in thread
From: Oleg Nesterov @ 2011-02-01 17:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On 02/01, Peter Zijlstra wrote:
>
> Oleg, I've actually run-tested the below and all seems well (clearly
> I've never actually hit the races found before either, so in that
> respect its not a conclusive test).
>
> Can you agree with this patch?

You know, I already wrote the i-think-it-is-correct email. But then
I decided to read it once again.

> -static void __perf_event_remove_from_context(void *info)
> +static int __perf_remove_from_context(void *info)
>  {
>  	struct perf_event *event = info;
>  	struct perf_event_context *ctx = event->ctx;
>  	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
>
> -	/*
> -	 * If this is a task context, we need to check whether it is
> -	 * the current task context of this cpu. If not it has been
> -	 * scheduled out before the smp call arrived.
> -	 */
> -	if (ctx->task && cpuctx->task_ctx != ctx)
> -		return;

OK, I think this is right... event_sched_out() will see
PERF_EVENT_STATE_INACTIVE if perf_event_task_sched_in() was not
called yet.

But,

> -static void perf_event_remove_from_context(struct perf_event *event)
> +static void perf_remove_from_context(struct perf_event *event)
>  {
> ...
>  	raw_spin_lock_irq(&ctx->lock);
>  	/*
> -	 * If the context is active we need to retry the smp call.
> +	 * If we failed to find a running task, but find it running now that
> +	 * we've acquired the ctx->lock, retry.
>  	 */
> -	if (ctx->nr_active && !list_empty(&event->group_entry)) {
> +	if (task_curr(task)) {
>  		raw_spin_unlock_irq(&ctx->lock);
>  		goto retry;
>  	}
>
>  	/*
> -	 * The lock prevents that this context is scheduled in so we
> -	 * can remove the event safely, if the call above did not
> -	 * succeed.
> +	 * Since the task isn't running, its safe to remove the event, us
> +	 * holding the ctx->lock ensures the task won't get scheduled in.
>  	 */
> -	if (!list_empty(&event->group_entry))
> -		list_del_event(event, ctx);
> +	list_del_event(event, ctx);

this looks suspicious (the same for perf_install_in_context).

Unlike the IPI handler, this can see schedule-in-progress in any state.
In particular, we can see rq->curr == next (so that task_curr() == F),
but before "prev" has already called perf_event_task_sched_out().

So we have to check ctx->is_active, or schedule() should change rq->curr
after perf_event_task_sched_out().

> @@ -753,13 +819,13 @@ void perf_event_disable(struct perf_event *event)
> ...
>  	 */
>  	if (event->state == PERF_EVENT_STATE_ACTIVE) {
>  		raw_spin_unlock_irq(&ctx->lock);
> +		/*
> +		 * Reload the task pointer, it might have been changed by
> +		 * a concurrent perf_event_context_sched_out().
> +		 */
> +		task = ctx->task;
>  		goto retry;

I am wondering why only perf_event_disable() needs this...

Just curious, this is equally needed without this patch?

Oleg.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH] perf: Cure task_oncpu_function_call() races
  2011-02-01 17:27                                                 ` Oleg Nesterov
@ 2011-02-01 18:08                                                   ` Peter Zijlstra
  2011-02-01 18:18                                                     ` Peter Zijlstra
  2011-02-01 21:00                                                     ` Peter Zijlstra
  0 siblings, 2 replies; 91+ messages in thread
From: Peter Zijlstra @ 2011-02-01 18:08 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Tue, 2011-02-01 at 18:27 +0100, Oleg Nesterov wrote:
> On 02/01, Peter Zijlstra wrote:
> >
> > Oleg, I've actually run-tested the below and all seems well (clearly
> > I've never actually hit the races found before either, so in that
> > respect its not a conclusive test).
> >
> > Can you agree with this patch?
> 
> You know, I already wrote the i-think-it-is-correct email. But then
> I decided to read it once again.

:-)

> > -static void __perf_event_remove_from_context(void *info)
> > +static int __perf_remove_from_context(void *info)
> >  {
> >  	struct perf_event *event = info;
> >  	struct perf_event_context *ctx = event->ctx;
> >  	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
> >
> > -	/*
> > -	 * If this is a task context, we need to check whether it is
> > -	 * the current task context of this cpu. If not it has been
> > -	 * scheduled out before the smp call arrived.
> > -	 */
> > -	if (ctx->task && cpuctx->task_ctx != ctx)
> > -		return;
> 
> OK, I think this is right... event_sched_out() will see
> PERF_EVENT_STATE_INACTIVE if perf_event_task_sched_in() was not
> called yet.

Right

> But,
> 
> > -static void perf_event_remove_from_context(struct perf_event *event)
> > +static void perf_remove_from_context(struct perf_event *event)
> >  {
> > ...
> >  	raw_spin_lock_irq(&ctx->lock);
> >  	/*
> > -	 * If the context is active we need to retry the smp call.
> > +	 * If we failed to find a running task, but find it running now that
> > +	 * we've acquired the ctx->lock, retry.
> >  	 */
> > -	if (ctx->nr_active && !list_empty(&event->group_entry)) {
> > +	if (task_curr(task)) {
> >  		raw_spin_unlock_irq(&ctx->lock);
> >  		goto retry;
> >  	}
> >
> >  	/*
> > -	 * The lock prevents that this context is scheduled in so we
> > -	 * can remove the event safely, if the call above did not
> > -	 * succeed.
> > +	 * Since the task isn't running, its safe to remove the event, us
> > +	 * holding the ctx->lock ensures the task won't get scheduled in.
> >  	 */
> > -	if (!list_empty(&event->group_entry))
> > -		list_del_event(event, ctx);
> > +	list_del_event(event, ctx);
> 
> this looks suspicious (the same for perf_install_in_context).
> 
> Unlike the IPI handler, this can see schedule-in-progress in any state.
> In particular, we can see rq->curr == next (so that task_curr() == F),
> but before "prev" has already called perf_event_task_sched_out().
> 
> So we have to check ctx->is_active, or schedule() should change rq->curr
> after perf_event_task_sched_out().

I only considered current == next in that case, not current == prev, let
me undo some of those sched.c bits and put a comment.

> > @@ -753,13 +819,13 @@ void perf_event_disable(struct perf_event *event)
> > ...
> >  	 */
> >  	if (event->state == PERF_EVENT_STATE_ACTIVE) {
> >  		raw_spin_unlock_irq(&ctx->lock);
> > +		/*
> > +		 * Reload the task pointer, it might have been changed by
> > +		 * a concurrent perf_event_context_sched_out().
> > +		 */
> > +		task = ctx->task;
> >  		goto retry;
> 
> I am wondering why only perf_event_disable() needs this...

perf_event_enable() also has it, but you made me re-asses all that and I
think there's more to it.

perf_install_in_context() works on a ctx obtained by find_get_context(),
that context is either new (uncloned) or existing in which case it
called unclone_ctx(). So I was thinking there was no race with the ctx
flipping in perf_event_context_sched_out(), _however_ since it only
acquires ctx->mutex after calling unclone_ctx() there is a race window
with perf_event_init_task().

This race we should fix with perf_pin_task_context()

perf_remove_from_context() seems ok though, we only call that on the
install ctx, which we should fix above, or on a dying uncloned context
(no race with fork because exit doesn't clone).

> Just curious, this is equally needed without this patch?

Yes, I think it is a pre-existing problem.



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH] perf: Cure task_oncpu_function_call() races
  2011-02-01 18:08                                                   ` Peter Zijlstra
@ 2011-02-01 18:18                                                     ` Peter Zijlstra
  2011-02-01 21:00                                                     ` Peter Zijlstra
  1 sibling, 0 replies; 91+ messages in thread
From: Peter Zijlstra @ 2011-02-01 18:18 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Tue, 2011-02-01 at 19:08 +0100, Peter Zijlstra wrote:
> > > +static void perf_remove_from_context(struct perf_event *event)
> > >  {
> > > ...
> > >     raw_spin_lock_irq(&ctx->lock);
> > >     /*
> > > +    * If we failed to find a running task, but find it running now that
> > > +    * we've acquired the ctx->lock, retry.
> > >      */
> > > +   if (task_curr(task)) {
> > >             raw_spin_unlock_irq(&ctx->lock);
> > >             goto retry;
> > >     }
> > >
> > >     /*
> > > +    * Since the task isn't running, its safe to remove the event, us
> > > +    * holding the ctx->lock ensures the task won't get scheduled in.
> > >      */
> > > +   list_del_event(event, ctx);
> > 
> > this looks suspicious (the same for perf_install_in_context).
> > 
> > Unlike the IPI handler, this can see schedule-in-progress in any state.
> > In particular, we can see rq->curr == next (so that task_curr() == F),
> > but before "prev" has already called perf_event_task_sched_out().
> > 
> > So we have to check ctx->is_active, or schedule() should change rq->curr
> > after perf_event_task_sched_out().
> 
> I only considered current == next in that case, not current == prev, let
> me undo some of those sched.c bits and put a comment. 

On second thought, your proposed ->is_active check seems to result in
much nicer code in sched.c. Let me think through that.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH] perf: Cure task_oncpu_function_call() races
  2011-02-01 18:08                                                   ` Peter Zijlstra
  2011-02-01 18:18                                                     ` Peter Zijlstra
@ 2011-02-01 21:00                                                     ` Peter Zijlstra
  1 sibling, 0 replies; 91+ messages in thread
From: Peter Zijlstra @ 2011-02-01 21:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Ingo Molnar, Alan Stern,
	Arnaldo Carvalho de Melo, Paul Mackerras, Prasad, Roland McGrath,
	linux-kernel

On Tue, 2011-02-01 at 19:08 +0100, Peter Zijlstra wrote:
> perf_install_in_context() works on a ctx obtained by find_get_context(),
> that context is either new (uncloned) or existing in which case it
> called unclone_ctx(). So I was thinking there was no race with the ctx
> flipping in perf_event_context_sched_out(), _however_ since it only
> acquires ctx->mutex after calling unclone_ctx() there is a race window
> with perf_event_init_task().
> 
> This race we should fix with perf_pin_task_context()

I came up with the below.. I'll give it some runtime tomorrow, my brain
just gave up for today.. 

---
Index: linux-2.6/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/kernel/perf_event.c
+++ linux-2.6/kernel/perf_event.c
@@ -327,7 +327,6 @@ static void perf_unpin_context(struct pe
 	raw_spin_lock_irqsave(&ctx->lock, flags);
 	--ctx->pin_count;
 	raw_spin_unlock_irqrestore(&ctx->lock, flags);
-	put_ctx(ctx);
 }
 
 /*
@@ -741,10 +740,10 @@ static void perf_remove_from_context(str
 
 	raw_spin_lock_irq(&ctx->lock);
 	/*
-	 * If we failed to find a running task, but find it running now that
-	 * we've acquired the ctx->lock, retry.
+	 * If we failed to find a running task, but find the context active now
+	 * that we've acquired the ctx->lock, retry.
 	 */
-	if (task_curr(task)) {
+	if (ctx->is_active) {
 		raw_spin_unlock_irq(&ctx->lock);
 		goto retry;
 	}
@@ -1104,10 +1103,10 @@ perf_install_in_context(struct perf_even
 
 	raw_spin_lock_irq(&ctx->lock);
 	/*
-	 * If we failed to find a running task, but find it running now that
-	 * we've acquired the ctx->lock, retry.
+	 * If we failed to find a running task, but find the context active now
+	 * that we've acquired the ctx->lock, retry.
 	 */
-	if (task_curr(task)) {
+	if (ctx->is_active) {
 		raw_spin_unlock_irq(&ctx->lock);
 		goto retry;
 	}
@@ -2278,6 +2277,9 @@ find_lively_task_by_vpid(pid_t vpid)
 
 }
 
+/*
+ * Returns a matching context with refcount and pincount.
+ */
 static struct perf_event_context *
 find_get_context(struct pmu *pmu, struct task_struct *task, int cpu)
 {
@@ -2302,6 +2304,7 @@ find_get_context(struct pmu *pmu, struct
 		cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
 		ctx = &cpuctx->ctx;
 		get_ctx(ctx);
+		++ctx->pin_count;
 
 		return ctx;
 	}
@@ -2315,6 +2318,7 @@ find_get_context(struct pmu *pmu, struct
 	ctx = perf_lock_task_context(task, ctxn, &flags);
 	if (ctx) {
 		unclone_ctx(ctx);
+		++ctx->pin_count;
 		raw_spin_unlock_irqrestore(&ctx->lock, flags);
 	}
 
@@ -6041,6 +6045,7 @@ SYSCALL_DEFINE5(perf_event_open,
 
 	perf_install_in_context(ctx, event, cpu);
 	++ctx->generation;
+	perf_unpin_context(ctx);
 	mutex_unlock(&ctx->mutex);
 
 	event->owner = current;
@@ -6066,6 +6071,7 @@ SYSCALL_DEFINE5(perf_event_open,
 	return event_fd;
 
 err_context:
+	perf_unpin_context(ctx);
 	put_ctx(ctx);
 err_alloc:
 	free_event(event);
@@ -6116,6 +6122,7 @@ perf_event_create_kernel_counter(struct
 	mutex_lock(&ctx->mutex);
 	perf_install_in_context(ctx, event, cpu);
 	++ctx->generation;
+	perf_unpin_context(ctx);
 	mutex_unlock(&ctx->mutex);
 
 	return event;
@@ -6591,6 +6598,7 @@ int perf_event_init_context(struct task_
 	mutex_unlock(&parent_ctx->mutex);
 
 	perf_unpin_context(parent_ctx);
+	put_ctx(parent_ctx);
 
 	return ret;
 }



^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, other threads:[~2011-02-01 20:59 UTC | newest]

Thread overview: 91+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-08 14:56 Q: perf_event && task->ptrace_bps[] Oleg Nesterov
2010-11-08 14:57 ` Q: sys_perf_event_open() && PF_EXITING Oleg Nesterov
2011-01-19 18:21   ` [PATCH 0/2] Was: " Oleg Nesterov
2011-01-19 18:22     ` [PATCH 1/2] perf: fix find_get_context() vs perf_event_exit_task() race Oleg Nesterov
2011-01-19 18:49       ` Peter Zijlstra
2011-01-19 19:18       ` [tip:perf/urgent] perf: Fix " tip-bot for Oleg Nesterov
2011-01-21 15:29         ` Ingo Molnar
2011-01-21 15:53           ` Oleg Nesterov
2011-01-21 17:45             ` [PATCH] perf: perf_event_exit_task_context: s/rcu_dereference/rcu_dereference_raw/ Oleg Nesterov
2011-01-21 17:53               ` Oleg Nesterov
2011-01-21 21:50                 ` Paul E. McKenney
2011-01-24 11:51                   ` Oleg Nesterov
2011-01-21 22:12               ` [tip:perf/urgent] " tip-bot for Oleg Nesterov
2011-01-19 18:22     ` [PATCH 2/2] perf: fix perf_event_init_task()/perf_event_free_task() interaction Oleg Nesterov
2011-01-19 18:51       ` Peter Zijlstra
2011-01-19 19:19       ` [tip:perf/urgent] perf: Fix " tip-bot for Oleg Nesterov
2011-01-20 19:30     ` Q: perf_install_in_context/perf_event_enable are racy? Oleg Nesterov
2011-01-21 12:11       ` Peter Zijlstra
2011-01-21 13:03         ` Ingo Molnar
2011-01-21 13:39           ` Peter Zijlstra
2011-01-21 14:26             ` Oleg Nesterov
2011-01-21 15:05               ` Peter Zijlstra
2011-01-21 20:40                 ` Frederic Weisbecker
2011-01-24 11:42                   ` Oleg Nesterov
2011-01-26 17:53                     ` Oleg Nesterov
2011-01-26 18:49                       ` Oleg Nesterov
2011-01-26 18:51                         ` [PATCH] fix the theoretical task_cpu/task_curr problem in kick_process/task_oncpu_function_call Oleg Nesterov
2011-01-26 19:05                         ` Q: perf_install_in_context/perf_event_enable are racy? Peter Zijlstra
2011-01-26 19:33                           ` Peter Zijlstra
2011-01-26 19:38                             ` Peter Zijlstra
2011-01-26 21:19                             ` Oleg Nesterov
2011-01-26 21:33                               ` Oleg Nesterov
2011-01-27 10:32                                 ` Peter Zijlstra
2011-01-27 12:29                                   ` Peter Zijlstra
2011-01-27 16:10                                     ` Oleg Nesterov
2011-01-27 16:27                                       ` Peter Zijlstra
2011-01-27 16:59                                         ` Oleg Nesterov
2011-01-27 15:52                                   ` Oleg Nesterov
2011-01-27 13:14                       ` Peter Zijlstra
2011-01-27 14:28                         ` Peter Zijlstra
2011-01-27 14:58                           ` Peter Zijlstra
2011-01-27 16:57                         ` Oleg Nesterov
2011-01-27 17:11                           ` Peter Zijlstra
2011-01-27 22:18                             ` Oleg Nesterov
2011-01-28 11:52                               ` Peter Zijlstra
2011-01-28 14:57                                 ` Peter Zijlstra
2011-01-28 16:28                                   ` Oleg Nesterov
2011-01-28 18:11                                     ` Peter Zijlstra
2011-01-31 17:26                                       ` Oleg Nesterov
2011-01-31 18:23                                         ` Peter Zijlstra
2011-01-31 19:11                                           ` Oleg Nesterov
2011-01-31 19:29                                             ` Peter Zijlstra
2011-02-01 14:03                                               ` [PATCH] perf: Cure task_oncpu_function_call() races Peter Zijlstra
2011-02-01 17:27                                                 ` Oleg Nesterov
2011-02-01 18:08                                                   ` Peter Zijlstra
2011-02-01 18:18                                                     ` Peter Zijlstra
2011-02-01 21:00                                                     ` Peter Zijlstra
2010-11-08 14:57 ` Q: perf_event && event->owner Oleg Nesterov
2010-11-08 20:11   ` Frederic Weisbecker
2010-11-08 20:41     ` Peter Zijlstra
2010-11-09 16:18       ` Oleg Nesterov
2010-11-09 15:57     ` Oleg Nesterov
2010-11-09 16:56       ` Peter Zijlstra
2010-11-09 16:58         ` Oleg Nesterov
2010-11-09 17:07           ` Peter Zijlstra
2010-11-09 17:42             ` Oleg Nesterov
2010-11-09 18:01               ` Peter Zijlstra
2010-11-09 18:57                 ` Oleg Nesterov
2010-11-09 19:16                   ` Peter Zijlstra
2010-11-10 15:17                   ` Peter Zijlstra
2010-11-10 15:44                     ` Oleg Nesterov
2010-11-12 15:48                       ` Peter Zijlstra
2010-11-12 18:49                         ` Oleg Nesterov
2010-11-18 14:09                         ` [tip:perf/urgent] perf: Fix owner-list vs exit tip-bot for Peter Zijlstra
2010-11-08 18:41 ` Q: perf_event && task->ptrace_bps[] Frederic Weisbecker
2010-11-08 19:18   ` Oleg Nesterov
2011-01-17 23:58     ` Frederic Weisbecker
2011-01-18  1:16       ` Roland McGrath
2011-01-17 20:34 ` Oleg Nesterov
2011-01-17 20:52   ` Peter Zijlstra
2011-01-17 21:01     ` Frederic Weisbecker
2011-01-18 16:09     ` [PATCH 0/2] perf: event->cpu checking fixes Oleg Nesterov
2011-01-18 16:10       ` [PATCH 1/2] perf: find_get_context: fix the per-cpu-counter check Oleg Nesterov
2011-01-18 19:06         ` [tip:perf/urgent] perf: Find_get_context: " tip-bot for Oleg Nesterov
2011-01-18 16:10       ` [PATCH 2/2] perf: validate cpu early in perf_event_alloc() Oleg Nesterov
2011-01-18 19:07         ` [tip:perf/urgent] perf: Validate " tip-bot for Oleg Nesterov
2011-01-18 18:42   ` Q: perf_event && task->ptrace_bps[] Frederic Weisbecker
2011-01-19 15:37     ` Oleg Nesterov
2011-01-19 20:05       ` Frederic Weisbecker
2011-01-20 17:28         ` Oleg Nesterov
2011-01-28 17:41           ` Frederic Weisbecker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.