All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] signal: sighand unprotected when accessed by /proc
@ 2014-06-03 17:02 Steven Rostedt
  2014-06-03 17:26 ` Oleg Nesterov
  0 siblings, 1 reply; 63+ messages in thread
From: Steven Rostedt @ 2014-06-03 17:02 UTC (permalink / raw)
  To: LKML
  Cc: Oleg Nesterov, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	Ingo Molnar, Clark Williams, Linus Torvalds

We were able to trigger this bug in -rt, and by review, I'm thinking
that this could very well be a mainline bug too. I had our QA team add
a trace patch to the kernel to prove my analysis, and it did.

Here's the patch:

 http://rostedt.homelinux.com/private/sighand-trace.patch

Let me try to explain the bug:


	CPU0				CPU1
	----				----
 [ read of /proc/<pid>/stat ]
  get_task_struct();
  [...]
				  [ <pid> exits ]
				  [ parent does wait on <pid> ]
				  wait_task_zombie()
				    release_task()
				      proc_flush_task()
				      /* the above removes new access
				         to the /proc system */
				      __exit_signal()
				        __cleanup_sighand(sighand);
					  atomic_dec_and_test(sighand->count);
  do_task_stat()
    lock_task_sighand(task);
      sighand = rcu_dereference(tsk->sighand);

					    kmem_cache_free(sighand);

      if (sighand != NULL)
        spin_lock(sighand->siglock);

       ** BOOM! use after free **


Seems there is no protection between reading the sighand from proc and
freeing it. The sighand->count is not updated, and the sighand is not
freed via rcu.  Wouldn't matter, because the use of sighand is not
totally protected by rcu.

We take a lot of care to free tsk->signal but we are a bit sloppy in
protecting sighand.

Here's the trace from our last crash:

      ps-23716   1....... 11201242344176: __lock_task_sighand: lock sighand=ffff8801022aee80 from sh:23718
    make-23672   1.....11 11201242409218: release_task: freeing sighand ffff8801022aee80 for sh:23718
    make-23672   1.....11 11201242409980: __cleanup_sighand: free sighand ffff8801022aee80
    make-23672   1.....11 11201242410380: __cleanup_sighand: sighand ffff8801022aee80 freed
      ps-23716   1d...211 11201243007052: __list_del_entry: BUG HIT!

Let me explain the differences between -rt and mainline here.

One, the spinlock in -rt is an rtmutex. The list_del_entry() bug is the
task trying to remove itself from sighand->lock->wait_list. As the lock
has been freed, the list head of the rtmutex is corrupted.

Another difference here is that all this ran on CPU 1. In mainline,
that could not happen because the spinlock prevents preemption. But
still, I don't see anything that could stop this from happening on
multiple CPUs.

Again, the bug does exist in mainline. I believe the only reason that
we do not trigger it (often) is that there's a lot that the reading of
the proc needs to do between opening the proc files and then reading
the sighand->lock. Much more than the clean up path between
removing /proc and freeing sighand. But with an interrupt going off at
the right time slowing down the fast path, we can still hit this race.

This is the beauty of the -rt patch. It can simulate hard to hit races
nicely, due to the preemptive nature of the spinlocks.

I do not see any trivial fix for this. I figured I show this to the
signal gurus before trying to redesign the way sighand gets freed or
used by /proc.

-- Steve

  

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [BUG] signal: sighand unprotected when accessed by /proc
  2014-06-03 17:02 [BUG] signal: sighand unprotected when accessed by /proc Steven Rostedt
@ 2014-06-03 17:26 ` Oleg Nesterov
  2014-06-03 18:03   ` Linus Torvalds
  2014-06-03 18:05   ` Steven Rostedt
  0 siblings, 2 replies; 63+ messages in thread
From: Oleg Nesterov @ 2014-06-03 17:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	Ingo Molnar, Clark Williams, Linus Torvalds

On 06/03, Steven Rostedt wrote:
>
> We were able to trigger this bug in -rt, and by review, I'm thinking
> that this could very well be a mainline bug too. I had our QA team add
> a trace patch to the kernel to prove my analysis, and it did.
>
> Here's the patch:
>
>  http://rostedt.homelinux.com/private/sighand-trace.patch
>
> Let me try to explain the bug:
>
>
> 	CPU0				CPU1
> 	----				----
>  [ read of /proc/<pid>/stat ]
>   get_task_struct();
>   [...]
> 				  [ <pid> exits ]
> 				  [ parent does wait on <pid> ]
> 				  wait_task_zombie()
> 				    release_task()
> 				      proc_flush_task()
> 				      /* the above removes new access
> 				         to the /proc system */
> 				      __exit_signal()
> 				        __cleanup_sighand(sighand);
> 					  atomic_dec_and_test(sighand->count);
>   do_task_stat()
>     lock_task_sighand(task);
>       sighand = rcu_dereference(tsk->sighand);
>
> 					    kmem_cache_free(sighand);
>
>       if (sighand != NULL)
>         spin_lock(sighand->siglock);
>
>        ** BOOM! use after free **

Yes, ->sighand can be already freed at this point, but this should be
fine because sighand_cachep is SLAB_DESTROY_BY_RCU.

That is why lock_task_sighand() does rcu_read_lock() and re-checks
sighand == tsk->sighand after it takes ->siglock. It is fine if it was
already freed or even reallocated via kmem_cache_alloc(sighand_cachep).
We only need to ensure that (SLAB_DESTROY_BY_RCU should ensure this)
this memory won't be returned to system, so this peace of memory must
be "struct sighand" with the properly initialized ->siglock until
rcu_read_unlock().

> Seems there is no protection between reading the sighand from proc and
> freeing it. The sighand->count is not updated, and the sighand is not
> freed via rcu.

See above.

> One, the spinlock in -rt is an rtmutex. The list_del_entry() bug is the
> task trying to remove itself from sighand->lock->wait_list. As the lock
> has been freed, the list head of the rtmutex is corrupted.

looks like, SLAB_DESTROY_BY_RCU logic is broken?

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [BUG] signal: sighand unprotected when accessed by /proc
  2014-06-03 17:26 ` Oleg Nesterov
@ 2014-06-03 18:03   ` Linus Torvalds
  2014-06-03 20:01     ` Oleg Nesterov
  2014-06-03 20:05     ` [BUG] signal: sighand unprotected when accessed by /proc Steven Rostedt
  2014-06-03 18:05   ` Steven Rostedt
  1 sibling, 2 replies; 63+ messages in thread
From: Linus Torvalds @ 2014-06-03 18:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Steven Rostedt, LKML, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Ingo Molnar, Clark Williams

On Tue, Jun 3, 2014 at 10:26 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> looks like, SLAB_DESTROY_BY_RCU logic is broken?

I haven't looked at the code yet, but SLAB_DESTROY_BY_RCU can be
subtle and very dangerous.

The danger is that the *slab* itself is free'd by RCU, but individual
allocations can (and do) get re-used FOR THE SAME OBJECT TYPE without
waiting for RCU!

This is subtle. It means that most people who think that "it's free'd
by RCU" get it wrong. Because individual allocations really aren't at
all RCU-free'd, it's just that the underlying memory is guaranteed to
not change type or be entirely thrown away until after a RCU grace
period.

That means that the only thing you can use rcu-delayed are:
 (a) pure optimistic reads
 (b) idempotent things that are initialized by the constructor, and
_not_ by the allocation.

Nothing else.

An example of (b) is having a lock that is initialized by the
constructor, and that everybody else uses properly: you may have
delayed people locking/unlocking it, but as long as they don't screw
anything else up they won't break even if the data structure was
re-used by a new allocations.

Without looking at the code, it sounds like somebody may doing things
to "sighand->lock->wait_list" that they shouldn't do. We've had cases
like that before, and most of them have been changed to *not* use
SLAB_DESTROY_BY_RCU, and instead make each individual allocation be
RCU-free'd (which is a lot simpler to think about, because then you
don't have the whole re-use issue).

Of course, it could easily be lack of any RCU protection too. As
mentioned, I didn't really check the code. I just wanted to clarify
this subtlety, because while I think Oleg knows about it, maybe others
didn't quite catch that subtlety.

And this could easily be an RT issue, if the RT code does some
re-initialization of the rtmutex that replaces the spinlock we have.

               Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [BUG] signal: sighand unprotected when accessed by /proc
  2014-06-03 17:26 ` Oleg Nesterov
  2014-06-03 18:03   ` Linus Torvalds
@ 2014-06-03 18:05   ` Steven Rostedt
  2014-06-03 19:25     ` Oleg Nesterov
  1 sibling, 1 reply; 63+ messages in thread
From: Steven Rostedt @ 2014-06-03 18:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	Ingo Molnar, Clark Williams, Linus Torvalds

On Tue, 3 Jun 2014 19:26:32 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

> On 06/03, Steven Rostedt wrote:
> >
> > We were able to trigger this bug in -rt, and by review, I'm thinking
> > that this could very well be a mainline bug too. I had our QA team add
> > a trace patch to the kernel to prove my analysis, and it did.
> >
> > Here's the patch:
> >
> >  http://rostedt.homelinux.com/private/sighand-trace.patch
> >
> > Let me try to explain the bug:
> >
> >
> > 	CPU0				CPU1
> > 	----				----
> >  [ read of /proc/<pid>/stat ]
> >   get_task_struct();
> >   [...]
> > 				  [ <pid> exits ]
> > 				  [ parent does wait on <pid> ]
> > 				  wait_task_zombie()
> > 				    release_task()
> > 				      proc_flush_task()
> > 				      /* the above removes new access
> > 				         to the /proc system */
> > 				      __exit_signal()
> > 				        __cleanup_sighand(sighand);
> > 					  atomic_dec_and_test(sighand->count);
> >   do_task_stat()
> >     lock_task_sighand(task);
> >       sighand = rcu_dereference(tsk->sighand);
> >
> > 					    kmem_cache_free(sighand);
> >
> >       if (sighand != NULL)
> >         spin_lock(sighand->siglock);
> >
> >        ** BOOM! use after free **
> 
> Yes, ->sighand can be already freed at this point, but this should be
> fine because sighand_cachep is SLAB_DESTROY_BY_RCU.

Ah, I didn't notice that. This makes this even more bazaar.

You know, this code could use some comments. I may send you a patch,
because that __lock_task_sighand() is doing a lot of subtle things and
there's not a single comment explaining it :-(



> 
> That is why lock_task_sighand() does rcu_read_lock() and re-checks
> sighand == tsk->sighand after it takes ->siglock. It is fine if it was
> already freed or even reallocated via kmem_cache_alloc(sighand_cachep).
> We only need to ensure that (SLAB_DESTROY_BY_RCU should ensure this)
> this memory won't be returned to system, so this peace of memory must
> be "struct sighand" with the properly initialized ->siglock until
> rcu_read_unlock().

OK, this makes __lock_task_sighand() make some more sense.

> 
> > Seems there is no protection between reading the sighand from proc and
> > freeing it. The sighand->count is not updated, and the sighand is not
> > freed via rcu.
> 
> See above.
> 
> > One, the spinlock in -rt is an rtmutex. The list_del_entry() bug is the
> > task trying to remove itself from sighand->lock->wait_list. As the lock
> > has been freed, the list head of the rtmutex is corrupted.
> 
> looks like, SLAB_DESTROY_BY_RCU logic is broken?

Could be. I'll look to see if we didn't break something.

Thanks!

-- Steve

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [BUG] signal: sighand unprotected when accessed by /proc
  2014-06-03 18:05   ` Steven Rostedt
@ 2014-06-03 19:25     ` Oleg Nesterov
  2014-06-04  1:16       ` Steven Rostedt
  0 siblings, 1 reply; 63+ messages in thread
From: Oleg Nesterov @ 2014-06-03 19:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	Ingo Molnar, Clark Williams, Linus Torvalds

On 06/03, Steven Rostedt wrote:
>
> You know, this code could use some comments. I may send you a patch,
> because that __lock_task_sighand() is doing a lot of subtle things and
> there's not a single comment explaining it :-(

Yes, agreed. Not only SLAB_DESTROY_BY_RCU is not obvious, local_irq_save()
is not clear at all. The latter already has a doc patch from Paul, I'll try
to add more comments on top once I see that patch in Linus's tree.

But I would be happy if you send the patch ;)

And this reminds me... I still think that __lock_task_sighand() should be
de-uglified. I already sent the patch, probably I'll resend it.

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [BUG] signal: sighand unprotected when accessed by /proc
  2014-06-03 18:03   ` Linus Torvalds
@ 2014-06-03 20:01     ` Oleg Nesterov
  2014-06-03 20:03       ` Oleg Nesterov
  2014-06-06 20:33       ` Paul E. McKenney
  2014-06-03 20:05     ` [BUG] signal: sighand unprotected when accessed by /proc Steven Rostedt
  1 sibling, 2 replies; 63+ messages in thread
From: Oleg Nesterov @ 2014-06-03 20:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, LKML, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Ingo Molnar, Clark Williams

On 06/03, Linus Torvalds wrote:
>
> On Tue, Jun 3, 2014 at 10:26 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > looks like, SLAB_DESTROY_BY_RCU logic is broken?
>
> I haven't looked at the code yet, but SLAB_DESTROY_BY_RCU can be
> subtle and very dangerous.
>
> The danger is that the *slab* itself is free'd by RCU, but individual
> allocations can (and do) get re-used FOR THE SAME OBJECT TYPE without
> waiting for RCU!
>
> This is subtle. It means that most people who think that "it's free'd
> by RCU" get it wrong. Because individual allocations really aren't at
> all RCU-free'd, it's just that the underlying memory is guaranteed to
> not change type or be entirely thrown away until after a RCU grace
> period.

Yes, exactly. And unless you use current->sighand (which is obviously
stable) you need lock_task_sighand() which relies on ->siglock initialized
by sighand_ctor().

> Without looking at the code, it sounds like somebody may doing things
> to "sighand->lock->wait_list" that they shouldn't do. We've had cases
> like that before, and most of them have been changed to *not* use
> SLAB_DESTROY_BY_RCU, and instead make each individual allocation be
> RCU-free'd (which is a lot simpler to think about, because then you
> don't have the whole re-use issue).

Sure, we only need to change __cleanup_sighand() to use call_rcu().
But I am not sure this makes sense, I mean, I do not think this can
make something more simple/clear.

> And this could easily be an RT issue, if the RT code does some
> re-initialization of the rtmutex that replaces the spinlock we have.

Unlikely... this should be done by sighand_ctor() anyway.

I'll try to recheck rt_mutex_unlock() tomorrow. _Perhaps_ rcu_read_unlock()
should be shifted from lock_task_sighand() to unlock_task_sighand() to
ensure that rt_mutex_unlock() does nothihg with this memory after it
makes another lock/unlock possible.

But if we need this (currently I do not think so), this doesn't depend on
SLAB_DESTROY_BY_RCU. And, at first glance, in this case rcu_read_unlock_special()
might be wrong too.

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [BUG] signal: sighand unprotected when accessed by /proc
  2014-06-03 20:01     ` Oleg Nesterov
@ 2014-06-03 20:03       ` Oleg Nesterov
  2014-06-06 20:33       ` Paul E. McKenney
  1 sibling, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2014-06-03 20:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, LKML, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Ingo Molnar, Clark Williams

Damn, sorry for noise...

On 06/03, Oleg Nesterov wrote:
>
> I'll try to recheck rt_mutex_unlock() tomorrow. _Perhaps_ rcu_read_unlock()
> should be shifted from lock_task_sighand() to unlock_task_sighand() to
> ensure that rt_mutex_unlock() does nothihg with this memory after it
> makes another lock/unlock possible.

I meant, we need to ensure it does nothing with this memory, otherwise we
need to delay rcu_read_unlock() till unlock_task_sighand().

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [BUG] signal: sighand unprotected when accessed by /proc
  2014-06-03 18:03   ` Linus Torvalds
  2014-06-03 20:01     ` Oleg Nesterov
@ 2014-06-03 20:05     ` Steven Rostedt
  2014-06-03 20:09       ` Oleg Nesterov
  1 sibling, 1 reply; 63+ messages in thread
From: Steven Rostedt @ 2014-06-03 20:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, LKML, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Ingo Molnar, Clark Williams

On Tue, 3 Jun 2014 11:03:49 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:
 
> An example of (b) is having a lock that is initialized by the
> constructor, and that everybody else uses properly: you may have
> delayed people locking/unlocking it, but as long as they don't screw
> anything else up they won't break even if the data structure was
> re-used by a new allocations.

I'm trying to wrap my head around this.

I guess it comes down to this code here in __lock_task_sighand:

	for (;;) {
		local_irq_save(*flags);
		rcu_read_lock();
		sighand = rcu_dereference(tsk->sighand);
		if (unlikely(sighand == NULL)) {
			rcu_read_unlock();
			local_irq_restore(*flags);
			break;
		}

   < Here's the critical point! >

		spin_lock(&sighand->siglock);
		if (likely(sighand == tsk->sighand)) {
			rcu_read_unlock();
			break;
		}
		spin_unlock(&sighand->siglock);
		rcu_read_unlock();
		local_irq_restore(*flags);
	}


We get the sighand with an rcu_dereference in an rcu_read_lock()
protected section, as the page its on in the slab cache can be freed
with rcu, thus we want to make sure what we get is a sighand and not
anything else. RCU protects us for this.

Now, that sighand can be freed and reallocated as another sighand,
but because it's still a sighand, the lock we are spinning on is some
sighand->lock pointer, whether it is ours or a new one.

Thus when we hit spin_lock(&sighand->siglock) it can be the sighand for
the task we want or reallocated and initialized for a new sighand, and
we are now suddenly spinning on a new lock which we would probably take.

In that case, if we take it, then sighand != tsk->sighand and we
release the lock and try again. Wow, that's very subtle :-p  The lock
was switch out right from under us.


> 
> Without looking at the code, it sounds like somebody may doing things
> to "sighand->lock->wait_list" that they shouldn't do. We've had cases
> like that before, and most of them have been changed to *not* use
> SLAB_DESTROY_BY_RCU, and instead make each individual allocation be
> RCU-free'd (which is a lot simpler to think about, because then you
> don't have the whole re-use issue).

Yes, this is exactly what happened. OK, this is an -rt only bug, not a
mainline one :-/

When we convert the spin_lock into a rtmutex, when we hit the lock and
it was contended (the task was in the process of exiting and it takes
the lock to set tsk->sighand to NULL), instead of spinning, the task
adds itself to the lock->wait_list and goes to sleep.

Now, if that lock is released and reused (I didn't trace other tasks
allocating these locks), it reinitializes the lock->wait_list. But when
the cleanup code released the lock, it woke up the task that was in
__lock_task_sighand() and that task now tries to take the lock. As the
lock is now free, it grabs it and removes itself from the
lock->wait_list. But because the lock->wait_list has been
re-initialized, it doesn't find itself on the list and we hit the bug.
This makes sense as we never crashed, we just triggered the list.h
debug warnings.

> 
> Of course, it could easily be lack of any RCU protection too. As
> mentioned, I didn't really check the code. I just wanted to clarify
> this subtlety, because while I think Oleg knows about it, maybe others
> didn't quite catch that subtlety.

At least for -rt, it seems we have to convert it that way. Anything
more complex than a simple spinning lock will trigger this bug. Can't
use normal mutexes either.

> 
> And this could easily be an RT issue, if the RT code does some
> re-initialization of the rtmutex that replaces the spinlock we have.

Yep, I believe this currently is an RT issue. But I would suggest that
sighand be freed by a normal rcu call. This reuse looks really subtle
and prone for other bugs.

Thanks for the explanation!

-- Steve

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [BUG] signal: sighand unprotected when accessed by /proc
  2014-06-03 20:05     ` [BUG] signal: sighand unprotected when accessed by /proc Steven Rostedt
@ 2014-06-03 20:09       ` Oleg Nesterov
  2014-06-03 20:15         ` Steven Rostedt
  2014-06-03 20:25         ` Steven Rostedt
  0 siblings, 2 replies; 63+ messages in thread
From: Oleg Nesterov @ 2014-06-03 20:09 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, LKML, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Ingo Molnar, Clark Williams

Steven, I am already sleeping, probably I just need to re-read your
email tomorrow, but...

On 06/03, Steven Rostedt wrote:
>
> When we convert the spin_lock into a rtmutex, when we hit the lock and
> it was contended (the task was in the process of exiting and it takes
> the lock to set tsk->sighand to NULL), instead of spinning, the task
> adds itself to the lock->wait_list and goes to sleep.

This is clear,

> Now, if that lock is released and reused (I didn't trace other tasks
> allocating these locks), it reinitializes the lock->wait_list.

How? From where? This should be done by sighand_ctor() only?

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [BUG] signal: sighand unprotected when accessed by /proc
  2014-06-03 20:09       ` Oleg Nesterov
@ 2014-06-03 20:15         ` Steven Rostedt
  2014-06-03 20:25         ` Steven Rostedt
  1 sibling, 0 replies; 63+ messages in thread
From: Steven Rostedt @ 2014-06-03 20:15 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, LKML, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Ingo Molnar, Clark Williams,
	Luis Claudio R. Goncalves

On Tue, 3 Jun 2014 22:09:38 +0200
Oleg Nesterov <oleg@redhat.com> wrote:


> > Now, if that lock is released and reused (I didn't trace other tasks
> > allocating these locks), it reinitializes the lock->wait_list.
> 
> How? From where? This should be done by sighand_ctor() only?

heh, the more I look at this code, the deeper in the rabbit hole I go.

OK, I didn't understand this in the first place. I see that
sighand_ctor() is only called when the entire slab is created, correct?

That means, when you allocate a new sighand, it doesn't get
re-initialized, and the sighand->lock should still be the old lock.

That makes sense. I'll need to look at this some more :-/  Maybe
there's another bug that I'm missing and I'm climbing the wrong tree.

-- Steve

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [BUG] signal: sighand unprotected when accessed by /proc
  2014-06-03 20:09       ` Oleg Nesterov
  2014-06-03 20:15         ` Steven Rostedt
@ 2014-06-03 20:25         ` Steven Rostedt
  2014-06-03 21:12           ` Thomas Gleixner
  1 sibling, 1 reply; 63+ messages in thread
From: Steven Rostedt @ 2014-06-03 20:25 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, LKML, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Ingo Molnar, Clark Williams,
	Luis Claudio R. Goncalves

On Tue, 3 Jun 2014 22:09:38 +0200
Oleg Nesterov <oleg@redhat.com> wrote:
 
> > Now, if that lock is released and reused (I didn't trace other tasks
> > allocating these locks), it reinitializes the lock->wait_list.
> 
> How? From where? This should be done by sighand_ctor() only?

This looks definitely like an -rt only bug and it's an obvious one at
that :-p

Looking in mm/slub.c: slab_alloc_node() we have this:

        if (unlikely(gfpflags & __GFP_ZERO) && object)
                memset(object, 0, s->object_size);
#ifdef CONFIG_PREEMPT_RT_FULL
        if (unlikely(s->ctor) && object)
                s->ctor(object);
#endif  

        slab_post_alloc_hook(s, gfpflags, object);

        return object;
}


We call the ctor on the object when it is allocated, not when the page
is created.

Doh! I guess we now know why we shouldn't do that.

-- Steve

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [BUG] signal: sighand unprotected when accessed by /proc
  2014-06-03 20:25         ` Steven Rostedt
@ 2014-06-03 21:12           ` Thomas Gleixner
  0 siblings, 0 replies; 63+ messages in thread
From: Thomas Gleixner @ 2014-06-03 21:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Oleg Nesterov, Linus Torvalds, LKML, Peter Zijlstra,
	Andrew Morton, Ingo Molnar, Clark Williams,
	Luis Claudio R. Goncalves



On Tue, 3 Jun 2014, Steven Rostedt wrote:

> On Tue, 3 Jun 2014 22:09:38 +0200
> Oleg Nesterov <oleg@redhat.com> wrote:
>  
> > > Now, if that lock is released and reused (I didn't trace other tasks
> > > allocating these locks), it reinitializes the lock->wait_list.
> > 
> > How? From where? This should be done by sighand_ctor() only?
> 
> This looks definitely like an -rt only bug and it's an obvious one at
> that :-p
> 
> Looking in mm/slub.c: slab_alloc_node() we have this:
> 
>         if (unlikely(gfpflags & __GFP_ZERO) && object)
>                 memset(object, 0, s->object_size);
> #ifdef CONFIG_PREEMPT_RT_FULL
>         if (unlikely(s->ctor) && object)
>                 s->ctor(object);
> #endif  

Looks like the usual git/quilt default artifact.
 
That's why I have 

       QUILT_PATCH_OPTS="--fuzz=0"

in my .quiltrc 

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [BUG] signal: sighand unprotected when accessed by /proc
  2014-06-03 19:25     ` Oleg Nesterov
@ 2014-06-04  1:16       ` Steven Rostedt
  2014-06-04 16:31         ` Oleg Nesterov
  0 siblings, 1 reply; 63+ messages in thread
From: Steven Rostedt @ 2014-06-04  1:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	Ingo Molnar, Clark Williams, Linus Torvalds

On Tue, 3 Jun 2014 21:25:25 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

> On 06/03, Steven Rostedt wrote:
> >
> > You know, this code could use some comments. I may send you a patch,
> > because that __lock_task_sighand() is doing a lot of subtle things and
> > there's not a single comment explaining it :-(
> 
> Yes, agreed. Not only SLAB_DESTROY_BY_RCU is not obvious, local_irq_save()
> is not clear at all. The latter already has a doc patch from Paul, I'll try
> to add more comments on top once I see that patch in Linus's tree.
> 
> But I would be happy if you send the patch ;)
> 
> And this reminds me... I still think that __lock_task_sighand() should be
> de-uglified. I already sent the patch, probably I'll resend it.

I'd be happy to document the hell out of that function, but it sounds
like you have some updates to it. Is there a git repo somewhere with
your latest that I can base a patch against?

-- Steve

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [BUG] signal: sighand unprotected when accessed by /proc
  2014-06-04  1:16       ` Steven Rostedt
@ 2014-06-04 16:31         ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2014-06-04 16:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	Ingo Molnar, Clark Williams, Linus Torvalds

On 06/03, Steven Rostedt wrote:
>
> On Tue, 3 Jun 2014 21:25:25 +0200
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > On 06/03, Steven Rostedt wrote:
> > >
> > > You know, this code could use some comments. I may send you a patch,
> > > because that __lock_task_sighand() is doing a lot of subtle things and
> > > there's not a single comment explaining it :-(
> >
> > Yes, agreed. Not only SLAB_DESTROY_BY_RCU is not obvious, local_irq_save()
> > is not clear at all. The latter already has a doc patch from Paul, I'll try
> > to add more comments on top once I see that patch in Linus's tree.
> >
> > But I would be happy if you send the patch ;)
> >
> > And this reminds me... I still think that __lock_task_sighand() should be
> > de-uglified. I already sent the patch, probably I'll resend it.
>
> I'd be happy to document the hell out of that function, but it sounds
> like you have some updates to it.

Paul has, I don't know where ;) That patch adds the comments to explain the
mysterious local_irq_disable() at the start. On top of another patch which
documents the subtle problems with preemption and rcu_read_unlock().

As for the cleanup I have, it is nowhere and I'll resend it later.

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [BUG] signal: sighand unprotected when accessed by /proc
  2014-06-03 20:01     ` Oleg Nesterov
  2014-06-03 20:03       ` Oleg Nesterov
@ 2014-06-06 20:33       ` Paul E. McKenney
  2014-06-08 13:07         ` safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) Oleg Nesterov
  1 sibling, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2014-06-06 20:33 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Steven Rostedt, LKML, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Tue, Jun 03, 2014 at 10:01:25PM +0200, Oleg Nesterov wrote:
> On 06/03, Linus Torvalds wrote:
> >
> > On Tue, Jun 3, 2014 at 10:26 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > looks like, SLAB_DESTROY_BY_RCU logic is broken?
> >
> > I haven't looked at the code yet, but SLAB_DESTROY_BY_RCU can be
> > subtle and very dangerous.
> >
> > The danger is that the *slab* itself is free'd by RCU, but individual
> > allocations can (and do) get re-used FOR THE SAME OBJECT TYPE without
> > waiting for RCU!
> >
> > This is subtle. It means that most people who think that "it's free'd
> > by RCU" get it wrong. Because individual allocations really aren't at
> > all RCU-free'd, it's just that the underlying memory is guaranteed to
> > not change type or be entirely thrown away until after a RCU grace
> > period.
> 
> Yes, exactly. And unless you use current->sighand (which is obviously
> stable) you need lock_task_sighand() which relies on ->siglock initialized
> by sighand_ctor().
> 
> > Without looking at the code, it sounds like somebody may doing things
> > to "sighand->lock->wait_list" that they shouldn't do. We've had cases
> > like that before, and most of them have been changed to *not* use
> > SLAB_DESTROY_BY_RCU, and instead make each individual allocation be
> > RCU-free'd (which is a lot simpler to think about, because then you
> > don't have the whole re-use issue).
> 
> Sure, we only need to change __cleanup_sighand() to use call_rcu().
> But I am not sure this makes sense, I mean, I do not think this can
> make something more simple/clear.
> 
> > And this could easily be an RT issue, if the RT code does some
> > re-initialization of the rtmutex that replaces the spinlock we have.
> 
> Unlikely... this should be done by sighand_ctor() anyway.
> 
> I'll try to recheck rt_mutex_unlock() tomorrow. _Perhaps_ rcu_read_unlock()
> should be shifted from lock_task_sighand() to unlock_task_sighand() to
> ensure that rt_mutex_unlock() does nothihg with this memory after it
> makes another lock/unlock possible.
> 
> But if we need this (currently I do not think so), this doesn't depend on
> SLAB_DESTROY_BY_RCU. And, at first glance, in this case rcu_read_unlock_special()
> might be wrong too.

OK, I will bite...  What did I mess up in rcu_read_unlock_special()?

This function does not report leaving the RCU read-side critical section
until after its call to rt_mutex_unlock() has returned, so any RCU
read-side critical sections in rt_mutex_unlock() will be respected.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 63+ messages in thread

* safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-06 20:33       ` Paul E. McKenney
@ 2014-06-08 13:07         ` Oleg Nesterov
  2014-06-09 16:26           ` Paul E. McKenney
  0 siblings, 1 reply; 63+ messages in thread
From: Oleg Nesterov @ 2014-06-08 13:07 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Steven Rostedt, LKML, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On 06/06, Paul E. McKenney wrote:
>
> On Tue, Jun 03, 2014 at 10:01:25PM +0200, Oleg Nesterov wrote:
> >
> > I'll try to recheck rt_mutex_unlock() tomorrow. _Perhaps_ rcu_read_unlock()
> > should be shifted from lock_task_sighand() to unlock_task_sighand() to
> > ensure that rt_mutex_unlock() does nothihg with this memory after it
> > makes another lock/unlock possible.
> >
> > But if we need this (currently I do not think so), this doesn't depend on
> > SLAB_DESTROY_BY_RCU. And, at first glance, in this case rcu_read_unlock_special()
> > might be wrong too.
>
> OK, I will bite...  What did I mess up in rcu_read_unlock_special()?
>
> This function does not report leaving the RCU read-side critical section
> until after its call to rt_mutex_unlock() has returned, so any RCU
> read-side critical sections in rt_mutex_unlock() will be respected.

Sorry for confusion.

I only meant that afaics rcu_read_unlock_special() equally depends on the
fact that rt_mutex_unlock() does nothing with "struct rt_mutex" after it
makes another rt_mutex_lock() + rt_mutex_unlock() possible, otherwise this
code is wrong (and unlock_task_sighand() would be wrong too).

Just to simplify the discussion... suppose we add "atomic_t nr_slow_unlock"
into "struct rt_mutex" and change rt_mutex_slowunlock() to do
atomic_inc(&lock->nr_slow_unlock) after it drops ->wait_lock. Of course this
would be ugly, just for illustration.

In this case atomic_inc() above can write to rcu_boost()'s stack after this
functions returns to the caller. And unlock_task_sighand() would be wrong
too, atomic_inc() could write to the memory which was already returned to
system because "unlock" path runs outside of rcu-protected section.

But it seems to me that currently we are safe, rt_mutex_unlock() doesn't do
something like this, a concurrent rt_mutex_lock() must always take wait_lock
too.


And while this is off-topic and I can be easily wrong, it seems that the
normal "struct mutex" is not safe in this respect. If nothing else, once
__mutex_unlock_common_slowpath()->__mutex_slowpath_needs_to_unlock() sets
lock->count = 1, a concurent mutex_lock() can take and then release this
mutex before __mutex_unlock_common_slowpath() takes ->wait_lock.

So _perhaps_ we should not rely on this property of rt_mutex "too much".

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-08 13:07         ` safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) Oleg Nesterov
@ 2014-06-09 16:26           ` Paul E. McKenney
  2014-06-09 18:15             ` Oleg Nesterov
  2014-06-10  8:37             ` Peter Zijlstra
  0 siblings, 2 replies; 63+ messages in thread
From: Paul E. McKenney @ 2014-06-09 16:26 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Steven Rostedt, LKML, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Sun, Jun 08, 2014 at 03:07:18PM +0200, Oleg Nesterov wrote:
> On 06/06, Paul E. McKenney wrote:
> >
> > On Tue, Jun 03, 2014 at 10:01:25PM +0200, Oleg Nesterov wrote:
> > >
> > > I'll try to recheck rt_mutex_unlock() tomorrow. _Perhaps_ rcu_read_unlock()
> > > should be shifted from lock_task_sighand() to unlock_task_sighand() to
> > > ensure that rt_mutex_unlock() does nothihg with this memory after it
> > > makes another lock/unlock possible.
> > >
> > > But if we need this (currently I do not think so), this doesn't depend on
> > > SLAB_DESTROY_BY_RCU. And, at first glance, in this case rcu_read_unlock_special()
> > > might be wrong too.
> >
> > OK, I will bite...  What did I mess up in rcu_read_unlock_special()?
> >
> > This function does not report leaving the RCU read-side critical section
> > until after its call to rt_mutex_unlock() has returned, so any RCU
> > read-side critical sections in rt_mutex_unlock() will be respected.
> 
> Sorry for confusion.
> 
> I only meant that afaics rcu_read_unlock_special() equally depends on the
> fact that rt_mutex_unlock() does nothing with "struct rt_mutex" after it
> makes another rt_mutex_lock() + rt_mutex_unlock() possible, otherwise this
> code is wrong (and unlock_task_sighand() would be wrong too).
> 
> Just to simplify the discussion... suppose we add "atomic_t nr_slow_unlock"
> into "struct rt_mutex" and change rt_mutex_slowunlock() to do
> atomic_inc(&lock->nr_slow_unlock) after it drops ->wait_lock. Of course this
> would be ugly, just for illustration.

That would indeed be a bad thing, as it could potentially lead to
use-after-free bugs.  Though one could argue that any code that resulted
in use-after-free would be quite aggressive.  But still...

> In this case atomic_inc() above can write to rcu_boost()'s stack after this
> functions returns to the caller. And unlock_task_sighand() would be wrong
> too, atomic_inc() could write to the memory which was already returned to
> system because "unlock" path runs outside of rcu-protected section.
> 
> But it seems to me that currently we are safe, rt_mutex_unlock() doesn't do
> something like this, a concurrent rt_mutex_lock() must always take wait_lock
> too.
> 
> 
> And while this is off-topic and I can be easily wrong, it seems that the
> normal "struct mutex" is not safe in this respect. If nothing else, once
> __mutex_unlock_common_slowpath()->__mutex_slowpath_needs_to_unlock() sets
> lock->count = 1, a concurent mutex_lock() can take and then release this
> mutex before __mutex_unlock_common_slowpath() takes ->wait_lock.
> 
> So _perhaps_ we should not rely on this property of rt_mutex "too much".

Well, I could easily move the rt_mutex from rcu_boost()'s stack to the
rcu_node structure, if that would help.  That said, I still have my
use-after-free concern above.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-09 16:26           ` Paul E. McKenney
@ 2014-06-09 18:15             ` Oleg Nesterov
  2014-06-09 18:29               ` Steven Rostedt
  2014-06-10  8:37             ` Peter Zijlstra
  1 sibling, 1 reply; 63+ messages in thread
From: Oleg Nesterov @ 2014-06-09 18:15 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Steven Rostedt, LKML, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On 06/09, Paul E. McKenney wrote:
>
> On Sun, Jun 08, 2014 at 03:07:18PM +0200, Oleg Nesterov wrote:
> >
> > I only meant that afaics rcu_read_unlock_special() equally depends on the
> > fact that rt_mutex_unlock() does nothing with "struct rt_mutex" after it
> > makes another rt_mutex_lock() + rt_mutex_unlock() possible, otherwise this
> > code is wrong (and unlock_task_sighand() would be wrong too).
> >
> > Just to simplify the discussion... suppose we add "atomic_t nr_slow_unlock"
> > into "struct rt_mutex" and change rt_mutex_slowunlock() to do
> > atomic_inc(&lock->nr_slow_unlock) after it drops ->wait_lock. Of course this
> > would be ugly, just for illustration.
>
> That would indeed be a bad thing, as it could potentially lead to
> use-after-free bugs.  Though one could argue that any code that resulted
> in use-after-free would be quite aggressive.  But still...

And once again, note that the normal mutex is already unsafe (unless I missed
something).

> > So _perhaps_ we should not rely on this property of rt_mutex "too much".
>
> Well, I could easily move the rt_mutex from rcu_boost()'s stack to the
> rcu_node structure, if that would help.  That said, I still have my
> use-after-free concern above.

Or we can document that rt_mutex is special and rt_mutex_unlock() should be
"atomic" and safe as spin_unlock() or complete().

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-09 18:15             ` Oleg Nesterov
@ 2014-06-09 18:29               ` Steven Rostedt
  2014-06-09 18:51                 ` Linus Torvalds
  2014-06-09 19:04                 ` Oleg Nesterov
  0 siblings, 2 replies; 63+ messages in thread
From: Steven Rostedt @ 2014-06-09 18:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul E. McKenney, Linus Torvalds, LKML, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Mon, 9 Jun 2014 20:15:53 +0200
Oleg Nesterov <oleg@redhat.com> wrote:


> > That would indeed be a bad thing, as it could potentially lead to
> > use-after-free bugs.  Though one could argue that any code that resulted
> > in use-after-free would be quite aggressive.  But still...
> 
> And once again, note that the normal mutex is already unsafe (unless I missed
> something).

Is it unsafe?

This thread was started because of a bug we triggered in -rt, which
ended up being a change specific to -rt that modified the way slub
handled SLAB_DESTROY_BY_RCU. What else was wrong with it?

-- Steve

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-09 18:29               ` Steven Rostedt
@ 2014-06-09 18:51                 ` Linus Torvalds
  2014-06-09 19:41                   ` Steven Rostedt
  2014-06-10 12:56                   ` Paul E. McKenney
  2014-06-09 19:04                 ` Oleg Nesterov
  1 sibling, 2 replies; 63+ messages in thread
From: Linus Torvalds @ 2014-06-09 18:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Oleg Nesterov, Paul E. McKenney, LKML, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Mon, Jun 9, 2014 at 11:29 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>> And once again, note that the normal mutex is already unsafe (unless I missed
>> something).
>
> Is it unsafe?
>
> This thread was started because of a bug we triggered in -rt, which
> ended up being a change specific to -rt that modified the way slub
> handled SLAB_DESTROY_BY_RCU. What else was wrong with it?

There's a different issue with freeing of mutexes, which is not a bug,
but "by design". Namely that mutexes aren't truly "atomic". They are
complex data structures, and they have issues that a spinlock does not
have.

When unlocking a mutex, the thread doing the unlocking will still
touch the mutex itself _after_ another thread could already have
successfully acquired the mutex. This is not a problem in any normal
use. since all of this is perfectly coherent in general, but it means
that code sequences like:

  mutex_lock(mem->mutex);
  kill_it = !--mem->refcount;
  mutex_unlock(mem->mutex);
  if (kill_it)
     free(mem);

are fundamentally buggy.

Note that if you think of mutexes as truly indivisible atomic
operations, the above is "obviously correct": the last person who got
the mutex marked it for killing. But the fact is, the *next-to-last*
mutex acquirer may still actively be in the *non-indivisible*
mutex_unlock() when the last person frees it, resulting in a
use-after-free. And yes, we've had this bug, and as far as I know it's
possible that the RT code *introduces* this bug when it changes
spinlocks into mutexes. Because we do exactly the above code sequence
with spinlocks. So just replacing spinlocks with mutexes is a very
subtly buggy thing to do in general.

Another example of this kind of situation is using a mutex as a
completion event: that's simply buggy. Again, it's because mutexes are
complex data structures, and you have a very similar use-after-free
issue. It's why the fundamental data structure for a "struct
completion" is a spinlock, not a mutex.

Again, in *theory*, a completion could be just a mutex that starts out
locked, and then the completer completes it by just unlocking it. The
person who waits for a completion does so by just asking for a lock.
Obvious, simple, and trivial, right? Not so, because of the *exact*
same issue above: the completer (who does an "unlock") may still be
accessing the completion data structure when the completion waiter has
successfully gotten the lock. So the standard thing of the completer
freeing the underlying completion memory (which is often on the stack,
so "freeing" is just he act of going out of scope of the liveness of
the completion data structure) would not work if the completion was a
mutex.

This is subtle, and it is basically unavoidable. If a mutex (or
counting semaphore) has a fast-path - and a mutex/semaphore without a
fast-path is shit - then this issue will exist. Exactly because the
fast-path will depend on just one part of the whole big mutex
structure, and the slow-path will have other pieces to it.

There might be reasonable ways to avoid this issue (having the
fastpath locking field share memory with the slow-path locking, for
example), but it's not how our semaphores and mutexes work, and I
suspect it cannot be the case in general (because it limits you too
badly in how to implement the mutex). As a result, this is all "by
design" as opposed to being a bug.

                Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-09 18:29               ` Steven Rostedt
  2014-06-09 18:51                 ` Linus Torvalds
@ 2014-06-09 19:04                 ` Oleg Nesterov
  1 sibling, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2014-06-09 19:04 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, Linus Torvalds, LKML, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On 06/09, Steven Rostedt wrote:
>
> On Mon, 9 Jun 2014 20:15:53 +0200
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > > That would indeed be a bad thing, as it could potentially lead to
> > > use-after-free bugs.  Though one could argue that any code that resulted
> > > in use-after-free would be quite aggressive.  But still...
> >
> > And once again, note that the normal mutex is already unsafe (unless I missed
> > something).
>
> Is it unsafe?

Only in a sense that UNLOCK is not atomic.

IOW, you can't, say, declare a mutex or semaphore on stack, and use lock/unlock
to serialize with another thread.

But rt_mutex seems fine in this case, and for example rcu_boost() does this.
I do not know if this is by design or not, and can we rely on this or not.

> This thread was started because of a bug we triggered in -rt, which
> ended up being a change specific to -rt that modified the way slub
> handled SLAB_DESTROY_BY_RCU. What else was wrong with it?

And I specially changed the subject to avoid the confusion with
SLAB_DESTROY_BY_RCU bug we discussed before, but apparently I need to
apologize for confusion again ;)

But. Note that if rt_mutex is changed so that UNLOCK becomes non-atomic
in a sense above, then lock_task_sighand()/unlock_task_sighand() will be
buggy in -rt.

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-09 18:51                 ` Linus Torvalds
@ 2014-06-09 19:41                   ` Steven Rostedt
  2014-06-10  8:53                     ` Thomas Gleixner
  2014-06-10 12:56                   ` Paul E. McKenney
  1 sibling, 1 reply; 63+ messages in thread
From: Steven Rostedt @ 2014-06-09 19:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Paul E. McKenney, LKML, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Mon, 9 Jun 2014 11:51:09 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, Jun 9, 2014 at 11:29 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> >>
> >> And once again, note that the normal mutex is already unsafe (unless I missed
> >> something).
> >
> > Is it unsafe?
> >
> > This thread was started because of a bug we triggered in -rt, which
> > ended up being a change specific to -rt that modified the way slub
> > handled SLAB_DESTROY_BY_RCU. What else was wrong with it?
> 
> There's a different issue with freeing of mutexes, which is not a bug,
> but "by design". Namely that mutexes aren't truly "atomic". They are
> complex data structures, and they have issues that a spinlock does not
> have.
> 
> When unlocking a mutex, the thread doing the unlocking will still
> touch the mutex itself _after_ another thread could already have
> successfully acquired the mutex. This is not a problem in any normal
> use. since all of this is perfectly coherent in general, but it means
> that code sequences like:
> 
>   mutex_lock(mem->mutex);
>   kill_it = !--mem->refcount;
>   mutex_unlock(mem->mutex);
>   if (kill_it)
>      free(mem);

Oh this again. I remember you having a discussion about it in the past.
I'm trying to wrap my head around it so let me make sure I understand
the spinlock (working) case, and then a bad mutex case.

For this to work, you have some item, lets say on a list. When it is
created, the refcount is 1 and it's added to the list. To remove it,
you remove it from the list and dec the refcount, and if it is zero
then you can free it. The issue is that you don't know what may have
seen it while its on the list and the last one to dec it to zero frees
it. Is this a situation that is used?

Because spinlocks are atomic, this is all fine, but if you have a
mutex, then this is where you can have issues. I think rtmutex has an
issue with it too. Specifically in the slow_unlock case:

	if (!rt_mutex_has_waiters(lock)) {
		lock->owner = NULL;
		raw_spin_unlock(&lock->wait_lock);
		return;
	}

That is, you can have:


	CPU0				CPU1
	----				----
				[refcount = 1]
				read foo from list
				[refcount = 2]
  [refcount 2]
  remove foo from list

  mutex_lock(foo->mutex);
  kill_it = !--mem->refcount;
  [refcount = 1]
  [kill_it == 0]
  mutex_unlock(foo->mutex)
    lock->owner = NULL;

				mutex_lock()
				 [ owner == NULL, fast path!]
				kill_it = !--mem->refcount;
				 [refcount = 0]
				 [kill_it = 1]
				mutex_unlock()
				 [ owner == current, fast path!]

				if (kill_it)
				   free(foo);

  raw_spin_unlock(&lock->wait_lock);
   [access freed foo's wait_lock, BUG!]

Is this the problem? I'm just trying to visualize it, and maybe even
visualize it for others that are reading this thread.

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-09 16:26           ` Paul E. McKenney
  2014-06-09 18:15             ` Oleg Nesterov
@ 2014-06-10  8:37             ` Peter Zijlstra
  2014-06-10 12:52               ` Paul E. McKenney
  1 sibling, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2014-06-10  8:37 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, Linus Torvalds, Steven Rostedt, LKML,
	Thomas Gleixner, Andrew Morton, Ingo Molnar, Clark Williams

[-- Attachment #1: Type: text/plain, Size: 1168 bytes --]

On Mon, Jun 09, 2014 at 09:26:13AM -0700, Paul E. McKenney wrote:
> That would indeed be a bad thing, as it could potentially lead to
> use-after-free bugs.  Though one could argue that any code that resulted
> in use-after-free would be quite aggressive.  But still...

Let me hijack this thread for yet another issue... So I had an RCU
related use-after-free the other day, and while Sasha was able to
trigger it quite easily, I had a multi-day struggle to reproduce.

Once I figured out what the exact problem was it was also clear to me
why it was so hard for me to reproduce.

So normally its easier to trigger races on bigger machines, more cpus,
more concurrency, more races, all good.

_However_ with RCU the grace period machinery is slower the bigger the
machine, so bigger machine, slower grace period, slower RCU free, less
likely to hit use-after-free.

So I was thinking, and I know you all will go kick me for this because
the very last thing we need is what I'm about to propose: more RCU
flavours :-).

How about an rcu_read_unlock() reference counted RCU variant that's
ultra aggressive in doing the callbacks in order to better trigger such
issues?

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-09 19:41                   ` Steven Rostedt
@ 2014-06-10  8:53                     ` Thomas Gleixner
  2014-06-10 16:57                       ` Oleg Nesterov
  2014-06-10 17:07                       ` Oleg Nesterov
  0 siblings, 2 replies; 63+ messages in thread
From: Thomas Gleixner @ 2014-06-10  8:53 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Oleg Nesterov, Paul E. McKenney, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Mon, 9 Jun 2014, Steven Rostedt wrote:
> On Mon, 9 Jun 2014 11:51:09 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Because spinlocks are atomic, this is all fine, but if you have a
> mutex, then this is where you can have issues. I think rtmutex has an
> issue with it too. Specifically in the slow_unlock case:
> 
> 	if (!rt_mutex_has_waiters(lock)) {
> 		lock->owner = NULL;
> 		raw_spin_unlock(&lock->wait_lock);
> 		return;
> 	}

Indeed. If the fast path is enabled we have that issue. Fortunately
there is a halfways reasonable solution for this.

Thanks,

	tglx
--------------------------->
Subject: rtmutex: Plug slow unlock race
From: Thomas Gleixner <tglx@linutronix.de>
Date: Tue, 10 Jun 2014 09:27:00 +0200

When the rtmutex fast path is enabled the slow unlock function can
create the following situation:

spin_lock(foo->m->wait_lock);
foo->m->owner = NULL;
	    			rt_mutex_lock(foo->m); <-- fast path
				free = atomic_dec_and_test(foo->refcnt);
				rt_mutex_unlock(foo->m); <-- fast path
				if (free)
				   kfree(foo);

spin_unlock(foo->m->wait_lock); <--- Use after free.

Plug the race by changing the slow unlock to the following scheme:

     while (!rt_mutex_has_waiters(m)) {
     	    /* Clear the waiters bit in m->owner */
	    clear_rt_mutex_waiters(m);
      	    owner = rt_mutex_owner(m);
      	    spin_unlock(m->wait_lock);
      	    if (cmpxchg(m->owner, owner, 0) == owner)
      	       return;
      	    spin_lock(m->wait_lock);
     }

So in case of a new waiter incoming while the owner tries the slow
path unlock we have two situations:

 unlock(wait_lock);
					lock(wait_lock);
 cmpxchg(p, owner, 0) == owner
 	    	   			mark_rt_mutex_waiters(lock);
	 				acquire(lock);

Or:

 unlock(wait_lock);
					lock(wait_lock);
	 				mark_rt_mutex_waiters(lock);
 cmpxchg(p, owner, 0) != owner
					enqueue_waiter();
					unlock(wait_lock);
 lock(wait_lock);
 wake waiter();
 unlock(wait_lock);
					lock(wait_lock);
					acquire(lock);

If the fast path is disabled, then the simple

   m->owner = NULL;
   unlock(m->wait_lock);

is sufficient as all access to m->owner is serialized via
m->wait_lock;


Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/locking/rtmutex.c |   94 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 90 insertions(+), 4 deletions(-)

Index: tip/kernel/locking/rtmutex.c
===================================================================
--- tip.orig/kernel/locking/rtmutex.c
+++ tip/kernel/locking/rtmutex.c
@@ -83,6 +83,48 @@ static inline void mark_rt_mutex_waiters
 		owner = *p;
 	} while (cmpxchg(p, owner, owner | RT_MUTEX_HAS_WAITERS) != owner);
 }
+
+/*
+ * Safe fastpath aware unlock:
+ * 1) Clear the waiters bit
+ * 2) Drop lock->wait_lock
+ * 3) Try to unlock the lock with cmpxchg
+ */
+static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock)
+	__releases(lock->wait_lock)
+{
+	unsigned long owner, *p = (unsigned long *) &lock->owner;
+
+	owner = (unsigned long) rt_mutex_owner(lock);
+	clear_rt_mutex_waiters(lock);
+	raw_spin_unlock(&lock->wait_lock);
+	/*
+	 * If a new waiter comes in between the unlock and the cmpxchg
+	 * we have two situations:
+	 *
+	 * unlock(wait_lock);
+	 *					lock(wait_lock);
+	 * cmpxchg(p, owner, 0) == owner
+	 *					mark_rt_mutex_waiters(lock);
+	 *					acquire(lock);
+	 * or:
+	 *
+	 * unlock(wait_lock);
+	 *					lock(wait_lock);
+	 *					mark_rt_mutex_waiters(lock);
+	 *
+	 * cmpxchg(p, owner, 0) != owner
+	 *					enqueue_waiter();
+	 *					unlock(wait_lock);
+	 * lock(wait_lock);
+	 * wake waiter();
+	 * unlock(wait_lock);
+	 *					lock(wait_lock);
+	 *					acquire(lock);
+	 */
+	return rt_mutex_cmpxchg(p, owner, 0);
+}
+
 #else
 # define rt_mutex_cmpxchg(l,c,n)	(0)
 static inline void mark_rt_mutex_waiters(struct rt_mutex *lock)
@@ -90,6 +132,17 @@ static inline void mark_rt_mutex_waiters
 	lock->owner = (struct task_struct *)
 			((unsigned long)lock->owner | RT_MUTEX_HAS_WAITERS);
 }
+
+/*
+ * Simple slow path only version: owner is protected by wait_lock
+ */
+static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock)
+	__releases(lock->wait_lock)
+{
+	lock->owner = NULL;
+	raw_spin_unlock(&lock->wait_lock);
+	return true;
+}
 #endif
 
 static inline int
@@ -928,10 +981,43 @@ rt_mutex_slowunlock(struct rt_mutex *loc
 
 	rt_mutex_deadlock_account_unlock(current);
 
-	if (!rt_mutex_has_waiters(lock)) {
-		lock->owner = NULL;
-		raw_spin_unlock(&lock->wait_lock);
-		return;
+	/*
+	 * We must be careful here if the fast path is enabled. If we
+	 * have no waiters queued we cannot set owner to NULL here
+	 * because of:
+	 *
+	 * foo->lock->owner = NULL;
+	 *			rtmutex_lock(foo->lock);   <- fast path
+	 *			free = atomic_dec_and_test(foo->refcnt);
+	 *			rtmutex_unlock(foo->lock); <- fast path
+	 *			if (free)
+	 *				kfree(foo);
+	 * raw_spin_unlock(foo->lock->wait_lock);
+	 *
+	 * So for the fastpath enabled kernel:
+	 *
+	 * Nothing can set the waiters bit as long as we hold
+	 * lock->wait_lock. So we do the following sequence:
+	 *
+	 *	owner = rt_mutex_owner(lock);
+	 *	clear_rt_mutex_waiters(lock);
+	 *	raw_spin_unlock(&lock->wait_lock);
+	 *	if (cmpxchg(&lock->owner, owner, 0) == owner)
+	 *		return;
+	 *	goto retry;
+	 *
+	 * The fastpath disabled variant is simple as all access to
+	 * lock->owner is serialized by lock->wait_lock:
+	 *
+	 *	lock->owner = NULL;
+	 *	raw_spin_unlock(&lock->wait_lock);
+	 */
+	while (!rt_mutex_has_waiters(lock)) {
+		/* Drops lock->wait_lock ! */
+		if (unlock_rt_mutex_safe(lock) == true)
+			return;
+		/* Relock the rtmutex and try again */
+		raw_spin_lock(&lock->wait_lock);
 	}
 
 	wakeup_next_waiter(lock);

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-10  8:37             ` Peter Zijlstra
@ 2014-06-10 12:52               ` Paul E. McKenney
  2014-06-10 13:01                 ` Peter Zijlstra
  0 siblings, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2014-06-10 12:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Linus Torvalds, Steven Rostedt, LKML,
	Thomas Gleixner, Andrew Morton, Ingo Molnar, Clark Williams

On Tue, Jun 10, 2014 at 10:37:26AM +0200, Peter Zijlstra wrote:
> On Mon, Jun 09, 2014 at 09:26:13AM -0700, Paul E. McKenney wrote:
> > That would indeed be a bad thing, as it could potentially lead to
> > use-after-free bugs.  Though one could argue that any code that resulted
> > in use-after-free would be quite aggressive.  But still...
> 
> Let me hijack this thread for yet another issue... So I had an RCU
> related use-after-free the other day, and while Sasha was able to
> trigger it quite easily, I had a multi-day struggle to reproduce.
> 
> Once I figured out what the exact problem was it was also clear to me
> why it was so hard for me to reproduce.
> 
> So normally its easier to trigger races on bigger machines, more cpus,
> more concurrency, more races, all good.
> 
> _However_ with RCU the grace period machinery is slower the bigger the
> machine, so bigger machine, slower grace period, slower RCU free, less
> likely to hit use-after-free.
> 
> So I was thinking, and I know you all will go kick me for this because
> the very last thing we need is what I'm about to propose: more RCU
> flavours :-).
> 
> How about an rcu_read_unlock() reference counted RCU variant that's
> ultra aggressive in doing the callbacks in order to better trigger such
> issues?

If you are using synchronize_rcu() for the update side, then I suggest
rcutorture.gp_exp=1 to force use expediting throughout.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-09 18:51                 ` Linus Torvalds
  2014-06-09 19:41                   ` Steven Rostedt
@ 2014-06-10 12:56                   ` Paul E. McKenney
  2014-06-10 14:48                     ` Peter Zijlstra
  2014-06-10 15:35                     ` Linus Torvalds
  1 sibling, 2 replies; 63+ messages in thread
From: Paul E. McKenney @ 2014-06-10 12:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Oleg Nesterov, LKML, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Mon, Jun 09, 2014 at 11:51:09AM -0700, Linus Torvalds wrote:
> On Mon, Jun 9, 2014 at 11:29 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> >>
> >> And once again, note that the normal mutex is already unsafe (unless I missed
> >> something).
> >
> > Is it unsafe?
> >
> > This thread was started because of a bug we triggered in -rt, which
> > ended up being a change specific to -rt that modified the way slub
> > handled SLAB_DESTROY_BY_RCU. What else was wrong with it?
> 
> There's a different issue with freeing of mutexes, which is not a bug,
> but "by design". Namely that mutexes aren't truly "atomic". They are
> complex data structures, and they have issues that a spinlock does not
> have.
> 
> When unlocking a mutex, the thread doing the unlocking will still
> touch the mutex itself _after_ another thread could already have
> successfully acquired the mutex. This is not a problem in any normal
> use. since all of this is perfectly coherent in general, but it means
> that code sequences like:
> 
>   mutex_lock(mem->mutex);
>   kill_it = !--mem->refcount;
>   mutex_unlock(mem->mutex);
>   if (kill_it)
>      free(mem);
> 
> are fundamentally buggy.
> 
> Note that if you think of mutexes as truly indivisible atomic
> operations, the above is "obviously correct": the last person who got
> the mutex marked it for killing. But the fact is, the *next-to-last*
> mutex acquirer may still actively be in the *non-indivisible*
> mutex_unlock() when the last person frees it, resulting in a
> use-after-free. And yes, we've had this bug, and as far as I know it's
> possible that the RT code *introduces* this bug when it changes
> spinlocks into mutexes. Because we do exactly the above code sequence
> with spinlocks. So just replacing spinlocks with mutexes is a very
> subtly buggy thing to do in general.
> 
> Another example of this kind of situation is using a mutex as a
> completion event: that's simply buggy. Again, it's because mutexes are
> complex data structures, and you have a very similar use-after-free
> issue. It's why the fundamental data structure for a "struct
> completion" is a spinlock, not a mutex.
> 
> Again, in *theory*, a completion could be just a mutex that starts out
> locked, and then the completer completes it by just unlocking it. The
> person who waits for a completion does so by just asking for a lock.
> Obvious, simple, and trivial, right? Not so, because of the *exact*
> same issue above: the completer (who does an "unlock") may still be
> accessing the completion data structure when the completion waiter has
> successfully gotten the lock. So the standard thing of the completer
> freeing the underlying completion memory (which is often on the stack,
> so "freeing" is just he act of going out of scope of the liveness of
> the completion data structure) would not work if the completion was a
> mutex.
> 
> This is subtle, and it is basically unavoidable. If a mutex (or
> counting semaphore) has a fast-path - and a mutex/semaphore without a
> fast-path is shit - then this issue will exist. Exactly because the
> fast-path will depend on just one part of the whole big mutex
> structure, and the slow-path will have other pieces to it.
> 
> There might be reasonable ways to avoid this issue (having the
> fastpath locking field share memory with the slow-path locking, for
> example), but it's not how our semaphores and mutexes work, and I
> suspect it cannot be the case in general (because it limits you too
> badly in how to implement the mutex). As a result, this is all "by
> design" as opposed to being a bug.

So to safely free a structure containing a mutex, is there some better
approach than the following?

	mutex_lock(mem->mutex);
	kill_it = !--mem->refcount;
	rcu_read_lock();
	mutex_unlock(mem->mutex);
	rcu_read_unlock();
	if (kill_it)
		kfree_rcu(mem, rh); /* rh is the rcu_head field in mem. */

For example, is there some other way to know that all the prior lock
releases have finished their post-release accesses?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-10 12:52               ` Paul E. McKenney
@ 2014-06-10 13:01                 ` Peter Zijlstra
  2014-06-10 14:36                   ` Paul E. McKenney
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2014-06-10 13:01 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, Linus Torvalds, Steven Rostedt, LKML,
	Thomas Gleixner, Andrew Morton, Ingo Molnar, Clark Williams

[-- Attachment #1: Type: text/plain, Size: 1726 bytes --]

On Tue, Jun 10, 2014 at 05:52:35AM -0700, Paul E. McKenney wrote:
> On Tue, Jun 10, 2014 at 10:37:26AM +0200, Peter Zijlstra wrote:
> > On Mon, Jun 09, 2014 at 09:26:13AM -0700, Paul E. McKenney wrote:
> > > That would indeed be a bad thing, as it could potentially lead to
> > > use-after-free bugs.  Though one could argue that any code that resulted
> > > in use-after-free would be quite aggressive.  But still...
> > 
> > Let me hijack this thread for yet another issue... So I had an RCU
> > related use-after-free the other day, and while Sasha was able to
> > trigger it quite easily, I had a multi-day struggle to reproduce.
> > 
> > Once I figured out what the exact problem was it was also clear to me
> > why it was so hard for me to reproduce.
> > 
> > So normally its easier to trigger races on bigger machines, more cpus,
> > more concurrency, more races, all good.
> > 
> > _However_ with RCU the grace period machinery is slower the bigger the
> > machine, so bigger machine, slower grace period, slower RCU free, less
> > likely to hit use-after-free.
> > 
> > So I was thinking, and I know you all will go kick me for this because
> > the very last thing we need is what I'm about to propose: more RCU
> > flavours :-).
> > 
> > How about an rcu_read_unlock() reference counted RCU variant that's
> > ultra aggressive in doing the callbacks in order to better trigger such
> > issues?
> 
> If you are using synchronize_rcu() for the update side, then I suggest
> rcutorture.gp_exp=1 to force use expediting throughout.

No such luck, this was regular kfree() from call_rcu(). And the callback
execution was typically delayed long enough to never 'see' the
use-after-free.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-10 13:01                 ` Peter Zijlstra
@ 2014-06-10 14:36                   ` Paul E. McKenney
  2014-06-10 15:20                     ` Paul E. McKenney
  0 siblings, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2014-06-10 14:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Linus Torvalds, Steven Rostedt, LKML,
	Thomas Gleixner, Andrew Morton, Ingo Molnar, Clark Williams

On Tue, Jun 10, 2014 at 03:01:38PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 10, 2014 at 05:52:35AM -0700, Paul E. McKenney wrote:
> > On Tue, Jun 10, 2014 at 10:37:26AM +0200, Peter Zijlstra wrote:
> > > On Mon, Jun 09, 2014 at 09:26:13AM -0700, Paul E. McKenney wrote:
> > > > That would indeed be a bad thing, as it could potentially lead to
> > > > use-after-free bugs.  Though one could argue that any code that resulted
> > > > in use-after-free would be quite aggressive.  But still...
> > > 
> > > Let me hijack this thread for yet another issue... So I had an RCU
> > > related use-after-free the other day, and while Sasha was able to
> > > trigger it quite easily, I had a multi-day struggle to reproduce.
> > > 
> > > Once I figured out what the exact problem was it was also clear to me
> > > why it was so hard for me to reproduce.
> > > 
> > > So normally its easier to trigger races on bigger machines, more cpus,
> > > more concurrency, more races, all good.
> > > 
> > > _However_ with RCU the grace period machinery is slower the bigger the
> > > machine, so bigger machine, slower grace period, slower RCU free, less
> > > likely to hit use-after-free.
> > > 
> > > So I was thinking, and I know you all will go kick me for this because
> > > the very last thing we need is what I'm about to propose: more RCU
> > > flavours :-).
> > > 
> > > How about an rcu_read_unlock() reference counted RCU variant that's
> > > ultra aggressive in doing the callbacks in order to better trigger such
> > > issues?
> > 
> > If you are using synchronize_rcu() for the update side, then I suggest
> > rcutorture.gp_exp=1 to force use expediting throughout.
> 
> No such luck, this was regular kfree() from call_rcu(). And the callback
> execution was typically delayed long enough to never 'see' the
> use-after-free.

Figures.  ;-)

Well, there is always the approach of booting your big systems with most
of the CPUs turned off.  Another approach would be to set HZ=10000 or
some such, assuming the kernel can actually survive that kind of abuse.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-10 12:56                   ` Paul E. McKenney
@ 2014-06-10 14:48                     ` Peter Zijlstra
  2014-06-10 15:18                       ` Paul E. McKenney
  2014-06-10 15:35                     ` Linus Torvalds
  1 sibling, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2014-06-10 14:48 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Steven Rostedt, Oleg Nesterov, LKML,
	Thomas Gleixner, Andrew Morton, Ingo Molnar, Clark Williams

[-- Attachment #1: Type: text/plain, Size: 2160 bytes --]

On Tue, Jun 10, 2014 at 05:56:55AM -0700, Paul E. McKenney wrote:
> On Mon, Jun 09, 2014 at 11:51:09AM -0700, Linus Torvalds wrote:

> > This is subtle, and it is basically unavoidable. If a mutex (or
> > counting semaphore) has a fast-path - and a mutex/semaphore without a
> > fast-path is shit - then this issue will exist. Exactly because the
> > fast-path will depend on just one part of the whole big mutex
> > structure, and the slow-path will have other pieces to it.
> > 
> > There might be reasonable ways to avoid this issue (having the
> > fastpath locking field share memory with the slow-path locking, for
> > example), but it's not how our semaphores and mutexes work, and I
> > suspect it cannot be the case in general (because it limits you too
> > badly in how to implement the mutex). As a result, this is all "by
> > design" as opposed to being a bug.
> 
> So to safely free a structure containing a mutex, is there some better
> approach than the following?
> 
> 	mutex_lock(mem->mutex);
> 	kill_it = !--mem->refcount;
> 	rcu_read_lock();
> 	mutex_unlock(mem->mutex);
> 	rcu_read_unlock();
> 	if (kill_it)
> 		kfree_rcu(mem, rh); /* rh is the rcu_head field in mem. */
> 
> For example, is there some other way to know that all the prior lock
> releases have finished their post-release accesses?

So Thomas posted a patch curing rt_mutex, and for that we really _have_
to because it needs to replace a spinlock_t. But for the regular mutex
its better (from a performance pov) to not do this.

By releasing early and checking for pending waiters later we allow
earlier lock stealing, which is good for throughput.

Back to your example, I think your example is misleading in that it
states: 'a structure containing a mutex'. The problem only arises when
that mutex is used as part of the life-time management of said
structure.

If it has regular (atomic_t or atomic_long_t or spinlock_t) reference
counting, we know the mutex_unlock() must have competed by the time we
do put_*(), and if that put was the last one, there cannot have been
another, otherwise your reference counting is broken.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-10 14:48                     ` Peter Zijlstra
@ 2014-06-10 15:18                       ` Paul E. McKenney
  0 siblings, 0 replies; 63+ messages in thread
From: Paul E. McKenney @ 2014-06-10 15:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Steven Rostedt, Oleg Nesterov, LKML,
	Thomas Gleixner, Andrew Morton, Ingo Molnar, Clark Williams

On Tue, Jun 10, 2014 at 04:48:30PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 10, 2014 at 05:56:55AM -0700, Paul E. McKenney wrote:
> > On Mon, Jun 09, 2014 at 11:51:09AM -0700, Linus Torvalds wrote:
> 
> > > This is subtle, and it is basically unavoidable. If a mutex (or
> > > counting semaphore) has a fast-path - and a mutex/semaphore without a
> > > fast-path is shit - then this issue will exist. Exactly because the
> > > fast-path will depend on just one part of the whole big mutex
> > > structure, and the slow-path will have other pieces to it.
> > > 
> > > There might be reasonable ways to avoid this issue (having the
> > > fastpath locking field share memory with the slow-path locking, for
> > > example), but it's not how our semaphores and mutexes work, and I
> > > suspect it cannot be the case in general (because it limits you too
> > > badly in how to implement the mutex). As a result, this is all "by
> > > design" as opposed to being a bug.
> > 
> > So to safely free a structure containing a mutex, is there some better
> > approach than the following?
> > 
> > 	mutex_lock(mem->mutex);
> > 	kill_it = !--mem->refcount;
> > 	rcu_read_lock();
> > 	mutex_unlock(mem->mutex);
> > 	rcu_read_unlock();
> > 	if (kill_it)
> > 		kfree_rcu(mem, rh); /* rh is the rcu_head field in mem. */
> > 
> > For example, is there some other way to know that all the prior lock
> > releases have finished their post-release accesses?
> 
> So Thomas posted a patch curing rt_mutex, and for that we really _have_
> to because it needs to replace a spinlock_t. But for the regular mutex
> its better (from a performance pov) to not do this.
> 
> By releasing early and checking for pending waiters later we allow
> earlier lock stealing, which is good for throughput.
> 
> Back to your example, I think your example is misleading in that it
> states: 'a structure containing a mutex'. The problem only arises when
> that mutex is used as part of the life-time management of said
> structure.

Hey, I just minimally modified the example from this thread!  ;-)

> If it has regular (atomic_t or atomic_long_t or spinlock_t) reference
> counting, we know the mutex_unlock() must have competed by the time we
> do put_*(), and if that put was the last one, there cannot have been
> another, otherwise your reference counting is broken.

So your point is that we need to have some other lifetime management
mechanism for the structure, and that whatever that is, we need to release
-after- our unlock completes, correct?  Which is in fact what I did with
the rcu_read_unlock() above, so we might actually be in agreement here.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-10 14:36                   ` Paul E. McKenney
@ 2014-06-10 15:20                     ` Paul E. McKenney
  0 siblings, 0 replies; 63+ messages in thread
From: Paul E. McKenney @ 2014-06-10 15:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Linus Torvalds, Steven Rostedt, LKML,
	Thomas Gleixner, Andrew Morton, Ingo Molnar, Clark Williams

On Tue, Jun 10, 2014 at 07:36:32AM -0700, Paul E. McKenney wrote:
> On Tue, Jun 10, 2014 at 03:01:38PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 10, 2014 at 05:52:35AM -0700, Paul E. McKenney wrote:
> > > On Tue, Jun 10, 2014 at 10:37:26AM +0200, Peter Zijlstra wrote:
> > > > On Mon, Jun 09, 2014 at 09:26:13AM -0700, Paul E. McKenney wrote:
> > > > > That would indeed be a bad thing, as it could potentially lead to
> > > > > use-after-free bugs.  Though one could argue that any code that resulted
> > > > > in use-after-free would be quite aggressive.  But still...
> > > > 
> > > > Let me hijack this thread for yet another issue... So I had an RCU
> > > > related use-after-free the other day, and while Sasha was able to
> > > > trigger it quite easily, I had a multi-day struggle to reproduce.
> > > > 
> > > > Once I figured out what the exact problem was it was also clear to me
> > > > why it was so hard for me to reproduce.
> > > > 
> > > > So normally its easier to trigger races on bigger machines, more cpus,
> > > > more concurrency, more races, all good.
> > > > 
> > > > _However_ with RCU the grace period machinery is slower the bigger the
> > > > machine, so bigger machine, slower grace period, slower RCU free, less
> > > > likely to hit use-after-free.
> > > > 
> > > > So I was thinking, and I know you all will go kick me for this because
> > > > the very last thing we need is what I'm about to propose: more RCU
> > > > flavours :-).
> > > > 
> > > > How about an rcu_read_unlock() reference counted RCU variant that's
> > > > ultra aggressive in doing the callbacks in order to better trigger such
> > > > issues?
> > > 
> > > If you are using synchronize_rcu() for the update side, then I suggest
> > > rcutorture.gp_exp=1 to force use expediting throughout.
> > 
> > No such luck, this was regular kfree() from call_rcu(). And the callback
> > execution was typically delayed long enough to never 'see' the
> > use-after-free.
> 
> Figures.  ;-)
> 
> Well, there is always the approach of booting your big systems with most
> of the CPUs turned off.  Another approach would be to set HZ=10000 or
> some such, assuming the kernel can actually survive that kind of abuse.

And yet another approach is to have a pair of low-priority processes
per CPU that context-switch back and forth to each other if that CPU
has nothing else to do.  This should get rid of most of the increase in
grace-period duration with increasing numbers of CPUs.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-10 12:56                   ` Paul E. McKenney
  2014-06-10 14:48                     ` Peter Zijlstra
@ 2014-06-10 15:35                     ` Linus Torvalds
  2014-06-10 16:15                       ` Paul E. McKenney
  1 sibling, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2014-06-10 15:35 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Steven Rostedt, Oleg Nesterov, LKML, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Tue, Jun 10, 2014 at 5:56 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> So to safely free a structure containing a mutex, is there some better
> approach than the following?

Yes.

A data structure *containing* a mutex is fine. The problem only occurs
when that mutex also acts the lock for the data structure. As a
result, there are two fixes for the "locks are not one single atomic
field":

 (a) don't do self-locking objects
 (b) use spinlocks for these "self-locking" objects

And quite frankly, (a) is the normal "solution" to the problem.

The fact is, having the data structure contain its own lifetime lock
is unusual, and generally broken. The *normal* sequence for freeing
something should be that the last access to it is the atomic referenc
count access:

  .. do whatever, including "unlock(&mem->lock)" ..
  if (atomic_dec_and_test(&mem->refcount))
    .. we can now free it ..

and that's safe. It doesn't matter if "mem" had a mutex in it, it
doesn't matter if you walked around three times widdershins with a
dead chicken around your neck. You can do whatever you want, the above
is fine (and you shouldn't even need to worry about CPU memory
ordering, because the alloc/free had better have the barriers to make
sure) that nothing can leak from a free to the next allocation,
although that is another discussion perhaps worth having).

The whole notion of having the lock that protects the lifetime of the
data structure inside the structure itself is pretty crazy. Because
while it is true that we sometimes have the refcount be non-atomic,
and instead protected with a lock, that lock is generally always
*outside* the object itself, because you want the lock for lookup etc.
So having the lock _and_ the refcount be inside the object really is
crazy.

That said, "crazy" has happened. We do occasionally do it. It's
generally a mistake (the last example of this was the pipe thing), but
sometimes we do it on purpose (the dcache, for example). You can do
lookups without holding a lock (generally using RCU), and taking the
lock and incrementing a refcount, but then the lock has to be a
spinlock *anyway*, so that's ok.

The last case where we actually had this bug (the afore-mentioned pipe
thing), the mutex lock wasn't even needed - we had a real spinlock
protecting the reference count. The bug was that we did the mutex
unlock *after* the spinlock, for no good reason.

So it really isn't normally a problem. The RT spinlock conversion
people need to be aware of it, but normally you can't even screw this
up.

              Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-10 15:35                     ` Linus Torvalds
@ 2014-06-10 16:15                       ` Paul E. McKenney
  0 siblings, 0 replies; 63+ messages in thread
From: Paul E. McKenney @ 2014-06-10 16:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Oleg Nesterov, LKML, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Tue, Jun 10, 2014 at 08:35:53AM -0700, Linus Torvalds wrote:
> On Tue, Jun 10, 2014 at 5:56 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > So to safely free a structure containing a mutex, is there some better
> > approach than the following?
> 
> Yes.
> 
> A data structure *containing* a mutex is fine. The problem only occurs
> when that mutex also acts the lock for the data structure. As a
> result, there are two fixes for the "locks are not one single atomic
> field":
> 
>  (a) don't do self-locking objects
>  (b) use spinlocks for these "self-locking" objects
> 
> And quite frankly, (a) is the normal "solution" to the problem.
> 
> The fact is, having the data structure contain its own lifetime lock
> is unusual, and generally broken. The *normal* sequence for freeing
> something should be that the last access to it is the atomic referenc
> count access:
> 
>   .. do whatever, including "unlock(&mem->lock)" ..
>   if (atomic_dec_and_test(&mem->refcount))
>     .. we can now free it ..
> 
> and that's safe. It doesn't matter if "mem" had a mutex in it, it
> doesn't matter if you walked around three times widdershins with a
> dead chicken around your neck. You can do whatever you want, the above
> is fine (and you shouldn't even need to worry about CPU memory
> ordering, because the alloc/free had better have the barriers to make
> sure) that nothing can leak from a free to the next allocation,
> although that is another discussion perhaps worth having).
> 
> The whole notion of having the lock that protects the lifetime of the
> data structure inside the structure itself is pretty crazy. Because
> while it is true that we sometimes have the refcount be non-atomic,
> and instead protected with a lock, that lock is generally always
> *outside* the object itself, because you want the lock for lookup etc.
> So having the lock _and_ the refcount be inside the object really is
> crazy.

Agreed!  You need to have something external to the object that holds the
object down long enough to safely acquire the object's lock, refcount,
or whatever.  Contention usually rules out that "something" being a
global lock or refcount, but per-CPU locks, per-CPU refcounts, RCU,
and so on can work.

> That said, "crazy" has happened. We do occasionally do it. It's
> generally a mistake (the last example of this was the pipe thing), but
> sometimes we do it on purpose (the dcache, for example). You can do
> lookups without holding a lock (generally using RCU), and taking the
> lock and incrementing a refcount, but then the lock has to be a
> spinlock *anyway*, so that's ok.

And of course you have to acquire the lock before doing your
rcu_read_unlock() in that case.

> The last case where we actually had this bug (the afore-mentioned pipe
> thing), the mutex lock wasn't even needed - we had a real spinlock
> protecting the reference count. The bug was that we did the mutex
> unlock *after* the spinlock, for no good reason.
> 
> So it really isn't normally a problem. The RT spinlock conversion
> people need to be aware of it, but normally you can't even screw this
> up.

And yes, I might need to add in a reference-count handoff around RCU's
use of rt_mutex.  Haven't yet come up with a reasonable way to use RCU
at that point instead.  I suppose I could use SRCU, but...  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-10  8:53                     ` Thomas Gleixner
@ 2014-06-10 16:57                       ` Oleg Nesterov
  2014-06-10 18:08                         ` Thomas Gleixner
  2014-06-10 17:07                       ` Oleg Nesterov
  1 sibling, 1 reply; 63+ messages in thread
From: Oleg Nesterov @ 2014-06-10 16:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Steven Rostedt, Linus Torvalds, Paul E. McKenney, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On 06/10, Thomas Gleixner wrote:
>
> On Mon, 9 Jun 2014, Steven Rostedt wrote:
> > I think rtmutex has an
> > issue with it too. Specifically in the slow_unlock case:
> >
> > 	if (!rt_mutex_has_waiters(lock)) {
> > 		lock->owner = NULL;
> > 		raw_spin_unlock(&lock->wait_lock);
> > 		return;
> > 	}
>
> Indeed. If the fast path is enabled we have that issue. Fortunately
> there is a halfways reasonable solution for this.

Ah, yes, I missed that,

> +	while (!rt_mutex_has_waiters(lock)) {
> +		/* Drops lock->wait_lock ! */
> +		if (unlock_rt_mutex_safe(lock) == true)
> +			return;
> +		/* Relock the rtmutex and try again */
> +		raw_spin_lock(&lock->wait_lock);
>  	}

OK...

wakeup_next_waiter() does rt_mutex_set_owner(NULL) before we drop ->wait_lock,
but this looks fine: we know that rt_mutex_has_waiters() can not become false
until waiter->task takes this lock and does rt_mutex_dequeue(), so ->owner
can't be NULL, right?

Perhaps it could simply do ->owner = RT_MUTEX_HAS_WAITERS to make this more
clear...


Off-topic question. I simply can't understand why rt_mutex_slowtrylock() checks
rt_mutex_owner(lock) != current. This looks pointless, try_to_take_rt_mutex()
always fails (correctly) if rt_mutex_owner() != NULL ? IOW, can't we simply
remove this check or turn it into "if (!rt_mutex_owner(lock))" ?

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-10  8:53                     ` Thomas Gleixner
  2014-06-10 16:57                       ` Oleg Nesterov
@ 2014-06-10 17:07                       ` Oleg Nesterov
  2014-06-10 17:51                         ` Thomas Gleixner
  1 sibling, 1 reply; 63+ messages in thread
From: Oleg Nesterov @ 2014-06-10 17:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Steven Rostedt, Linus Torvalds, Paul E. McKenney, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On 06/10, Thomas Gleixner wrote:
>
> +static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock)
> +	__releases(lock->wait_lock)
> +{
> +	unsigned long owner, *p = (unsigned long *) &lock->owner;
> +
> +	owner = (unsigned long) rt_mutex_owner(lock);
> +	clear_rt_mutex_waiters(lock);
> +	raw_spin_unlock(&lock->wait_lock);
> +	/*
> +	 * If a new waiter comes in between the unlock and the cmpxchg
> +	 * we have two situations:
> +	 *
> +	 * unlock(wait_lock);
> +	 *					lock(wait_lock);
> +	 * cmpxchg(p, owner, 0) == owner
> +	 *					mark_rt_mutex_waiters(lock);
> +	 *					acquire(lock);
> +	 * or:
> +	 *
> +	 * unlock(wait_lock);
> +	 *					lock(wait_lock);
> +	 *					mark_rt_mutex_waiters(lock);
> +	 *
> +	 * cmpxchg(p, owner, 0) != owner
> +	 *					enqueue_waiter();
> +	 *					unlock(wait_lock);
> +	 * lock(wait_lock);
> +	 * wake waiter();
> +	 * unlock(wait_lock);
> +	 *					lock(wait_lock);
> +	 *					acquire(lock);
> +	 */
> +	return rt_mutex_cmpxchg(p, owner, 0);

Wait, but this looks like a typo. rt_mutex_cmpxchg() needs "struct rt_mutex *",
not "long *". It seems that you should simply kill "*p" above.

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-10 17:07                       ` Oleg Nesterov
@ 2014-06-10 17:51                         ` Thomas Gleixner
  0 siblings, 0 replies; 63+ messages in thread
From: Thomas Gleixner @ 2014-06-10 17:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Steven Rostedt, Linus Torvalds, Paul E. McKenney, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Tue, 10 Jun 2014, Oleg Nesterov wrote:

> On 06/10, Thomas Gleixner wrote:
> >
> > +static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock)
> > +	__releases(lock->wait_lock)
> > +{
> > +	unsigned long owner, *p = (unsigned long *) &lock->owner;
> > +
> > +	owner = (unsigned long) rt_mutex_owner(lock);
> > +	clear_rt_mutex_waiters(lock);
> > +	raw_spin_unlock(&lock->wait_lock);
> > +	/*
> > +	 * If a new waiter comes in between the unlock and the cmpxchg
> > +	 * we have two situations:
> > +	 *
> > +	 * unlock(wait_lock);
> > +	 *					lock(wait_lock);
> > +	 * cmpxchg(p, owner, 0) == owner
> > +	 *					mark_rt_mutex_waiters(lock);
> > +	 *					acquire(lock);
> > +	 * or:
> > +	 *
> > +	 * unlock(wait_lock);
> > +	 *					lock(wait_lock);
> > +	 *					mark_rt_mutex_waiters(lock);
> > +	 *
> > +	 * cmpxchg(p, owner, 0) != owner
> > +	 *					enqueue_waiter();
> > +	 *					unlock(wait_lock);
> > +	 * lock(wait_lock);
> > +	 * wake waiter();
> > +	 * unlock(wait_lock);
> > +	 *					lock(wait_lock);
> > +	 *					acquire(lock);
> > +	 */
> > +	return rt_mutex_cmpxchg(p, owner, 0);
> 
> Wait, but this looks like a typo. rt_mutex_cmpxchg() needs "struct rt_mutex *",
> not "long *". It seems that you should simply kill "*p" above.

Uurgh. I had cmpxchg there first and then changed it. You're right, I
can kill the p magic. Bah, thats what you get sending out stuff just
before jumping into a car.




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-10 16:57                       ` Oleg Nesterov
@ 2014-06-10 18:08                         ` Thomas Gleixner
  2014-06-10 18:13                           ` Steven Rostedt
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2014-06-10 18:08 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Steven Rostedt, Linus Torvalds, Paul E. McKenney, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Tue, 10 Jun 2014, Oleg Nesterov wrote:

> On 06/10, Thomas Gleixner wrote:
> >
> > On Mon, 9 Jun 2014, Steven Rostedt wrote:
> > > I think rtmutex has an
> > > issue with it too. Specifically in the slow_unlock case:
> > >
> > > 	if (!rt_mutex_has_waiters(lock)) {
> > > 		lock->owner = NULL;
> > > 		raw_spin_unlock(&lock->wait_lock);
> > > 		return;
> > > 	}
> >
> > Indeed. If the fast path is enabled we have that issue. Fortunately
> > there is a halfways reasonable solution for this.
> 
> Ah, yes, I missed that,
> 
> > +	while (!rt_mutex_has_waiters(lock)) {
> > +		/* Drops lock->wait_lock ! */
> > +		if (unlock_rt_mutex_safe(lock) == true)
> > +			return;
> > +		/* Relock the rtmutex and try again */
> > +		raw_spin_lock(&lock->wait_lock);
> >  	}
> 
> OK...
> 
> wakeup_next_waiter() does rt_mutex_set_owner(NULL) before we drop ->wait_lock,
> but this looks fine: we know that rt_mutex_has_waiters() can not become false
> until waiter->task takes this lock and does rt_mutex_dequeue(), so ->owner
> can't be NULL, right?

Correct.

> Perhaps it could simply do ->owner = RT_MUTEX_HAS_WAITERS to make this more
> clear...

Good point. The new owner can cleanup the mess.
 
> Off-topic question. I simply can't understand why rt_mutex_slowtrylock() checks
> rt_mutex_owner(lock) != current. This looks pointless, try_to_take_rt_mutex()
> always fails (correctly) if rt_mutex_owner() != NULL ? IOW, can't we simply
> remove this check or turn it into "if (!rt_mutex_owner(lock))" ?

Indeed.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-10 18:08                         ` Thomas Gleixner
@ 2014-06-10 18:13                           ` Steven Rostedt
  2014-06-10 20:05                             ` Thomas Gleixner
  0 siblings, 1 reply; 63+ messages in thread
From: Steven Rostedt @ 2014-06-10 18:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Oleg Nesterov, Linus Torvalds, Paul E. McKenney, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Tue, 10 Jun 2014 20:08:37 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:


> > Perhaps it could simply do ->owner = RT_MUTEX_HAS_WAITERS to make this more
> > clear...
> 
> Good point. The new owner can cleanup the mess.
>  

I thought about this too. It should work with the added overhead that
every time we go into the unlock slow path, we guarantee that the next
lock will go into the lock slowpath.

As long as the new acquired lock does a fast unlock, then we get out of
this spiral.

-- Steve

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-10 18:13                           ` Steven Rostedt
@ 2014-06-10 20:05                             ` Thomas Gleixner
  2014-06-10 20:13                               ` Thomas Gleixner
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2014-06-10 20:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Oleg Nesterov, Linus Torvalds, Paul E. McKenney, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Tue, 10 Jun 2014, Steven Rostedt wrote:
> On Tue, 10 Jun 2014 20:08:37 +0200 (CEST)
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> 
> > > Perhaps it could simply do ->owner = RT_MUTEX_HAS_WAITERS to make this more
> > > clear...
> > 
> > Good point. The new owner can cleanup the mess.
> >  
> 
> I thought about this too. It should work with the added overhead that
> every time we go into the unlock slow path, we guarantee that the next
> lock will go into the lock slowpath.
> 
> As long as the new acquired lock does a fast unlock, then we get out of
> this spiral.

The alternative solution is to document WHY this is safe. I think I
prefer that one :)

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-10 20:05                             ` Thomas Gleixner
@ 2014-06-10 20:13                               ` Thomas Gleixner
  2014-06-11 15:52                                 ` Paul E. McKenney
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2014-06-10 20:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Oleg Nesterov, Linus Torvalds, Paul E. McKenney, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Tue, 10 Jun 2014, Thomas Gleixner wrote:
> On Tue, 10 Jun 2014, Steven Rostedt wrote:
> > On Tue, 10 Jun 2014 20:08:37 +0200 (CEST)
> > Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> > 
> > > > Perhaps it could simply do ->owner = RT_MUTEX_HAS_WAITERS to make this more
> > > > clear...
> > > 
> > > Good point. The new owner can cleanup the mess.
> > >  
> > 
> > I thought about this too. It should work with the added overhead that
> > every time we go into the unlock slow path, we guarantee that the next
> > lock will go into the lock slowpath.
> > 
> > As long as the new acquired lock does a fast unlock, then we get out of
> > this spiral.
> 
> The alternative solution is to document WHY this is safe. I think I
> prefer that one :)

And actually we keep the waiter bit set in wakeup_next_waiter()
because we only dequeue the waiter from the lock owners pi waiter
list, but not from the lock waiter list.

rt_mutex_set_owner() sets the waiters bit if the lock has waiters. I
agree with Oleg that this is not obvious from the code.

So I add both a comment and open code it.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-10 20:13                               ` Thomas Gleixner
@ 2014-06-11 15:52                                 ` Paul E. McKenney
  2014-06-11 17:07                                   ` Oleg Nesterov
  0 siblings, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2014-06-11 15:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Steven Rostedt, Oleg Nesterov, Linus Torvalds, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Tue, Jun 10, 2014 at 10:13:20PM +0200, Thomas Gleixner wrote:
> On Tue, 10 Jun 2014, Thomas Gleixner wrote:
> > On Tue, 10 Jun 2014, Steven Rostedt wrote:
> > > On Tue, 10 Jun 2014 20:08:37 +0200 (CEST)
> > > Thomas Gleixner <tglx@linutronix.de> wrote:
> > > 
> > > 
> > > > > Perhaps it could simply do ->owner = RT_MUTEX_HAS_WAITERS to make this more
> > > > > clear...
> > > > 
> > > > Good point. The new owner can cleanup the mess.
> > > >  
> > > 
> > > I thought about this too. It should work with the added overhead that
> > > every time we go into the unlock slow path, we guarantee that the next
> > > lock will go into the lock slowpath.
> > > 
> > > As long as the new acquired lock does a fast unlock, then we get out of
> > > this spiral.
> > 
> > The alternative solution is to document WHY this is safe. I think I
> > prefer that one :)
> 
> And actually we keep the waiter bit set in wakeup_next_waiter()
> because we only dequeue the waiter from the lock owners pi waiter
> list, but not from the lock waiter list.
> 
> rt_mutex_set_owner() sets the waiters bit if the lock has waiters. I
> agree with Oleg that this is not obvious from the code.
> 
> So I add both a comment and open code it.

And for my part, I have queued the following, which should prevent RCU
from relying on rt_mutex_unlock() keeping hands off after unlock.
This passes light rcutorture testing.

Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

rcu: Allow post-unlock reference for rt_mutex

The current approach to RCU priority boosting uses an rt_mutex strictly
for its priority-boosting side effects.  The rt_mutex_init_proxy_locked()
function is used by the booster to initialize the lock as held by the
boostee.  The booster then uses rt_mutex_lock() to acquire this rt_mutex,
which priority-boosts the boostee.  When the boostee reaches the end
of its outermost RCU read-side critical section, it checks a field in
its task structure to see whether it has been boosted, and, if so, uses
rt_mutex_unlock() to release the rt_mutex.  The booster can then go on
to boost the next task that is blocking the current RCU grace period.

But reasonable implementations of rt_mutex_unlock() might result in the
boostee referencing the rt_mutex's data after releasing it.  But the
booster might have re-initialized the rt_mutex between the time that the
boostee released it and the time that it later referenced it.  This is
clearly asking for trouble, so this commit introduces a completion that
forces the booster to wait until the boostee has completely finished with
the rt_mutex, thus avoiding the case where the booster is re-initializing
the rt_mutex before the last boostee's last reference to that rt_mutex.

This of course does introduce some overhead, but the priority-boosting
code paths are miles from any possible fastpath, and the overhead of
executing the completion will normally be quite small compared to the
overhead of priority boosting and deboosting, so this should be OK.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index bf2c1e669691..31194ee9dfa6 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -172,6 +172,11 @@ struct rcu_node {
 				/*  queued on this rcu_node structure that */
 				/*  are blocking the current grace period, */
 				/*  there can be no such task. */
+	struct completion boost_completion;
+				/* Used to ensure that the rt_mutex used */
+				/*  to carry out the boosting is fully */
+				/*  released with no future boostee accesses */
+				/*  before that rt_mutex is re-initialized. */
 	unsigned long boost_time;
 				/* When to start boosting (jiffies). */
 	struct task_struct *boost_kthread_task;
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index ec7627becaf0..99743e9ea8ed 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -427,8 +427,10 @@ void rcu_read_unlock_special(struct task_struct *t)
 
 #ifdef CONFIG_RCU_BOOST
 		/* Unboost if we were boosted. */
-		if (rbmp)
+		if (rbmp) {
 			rt_mutex_unlock(rbmp);
+			complete(&rnp->boost_completion);
+		}
 #endif /* #ifdef CONFIG_RCU_BOOST */
 
 		/*
@@ -1202,10 +1204,14 @@ static int rcu_boost(struct rcu_node *rnp)
 	t = container_of(tb, struct task_struct, rcu_node_entry);
 	rt_mutex_init_proxy_locked(&mtx, t);
 	t->rcu_boost_mutex = &mtx;
+	init_completion(&rnp->boost_completion);
 	raw_spin_unlock_irqrestore(&rnp->lock, flags);
 	rt_mutex_lock(&mtx);  /* Side effect: boosts task t's priority. */
 	rt_mutex_unlock(&mtx);  /* Keep lockdep happy. */
 
+	/* Wait until boostee is done accessing mtx before reinitializing. */
+	wait_for_completion(&rnp->boost_completion);
+
 	return ACCESS_ONCE(rnp->exp_tasks) != NULL ||
 	       ACCESS_ONCE(rnp->boost_tasks) != NULL;
 }


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-11 15:52                                 ` Paul E. McKenney
@ 2014-06-11 17:07                                   ` Oleg Nesterov
  2014-06-11 17:17                                     ` Oleg Nesterov
  2014-06-11 17:27                                     ` Paul E. McKenney
  0 siblings, 2 replies; 63+ messages in thread
From: Oleg Nesterov @ 2014-06-11 17:07 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On 06/11, Paul E. McKenney wrote:
>
> @@ -1202,10 +1204,14 @@ static int rcu_boost(struct rcu_node *rnp)
>  	t = container_of(tb, struct task_struct, rcu_node_entry);
>  	rt_mutex_init_proxy_locked(&mtx, t);
>  	t->rcu_boost_mutex = &mtx;
> +	init_completion(&rnp->boost_completion);

can't rcu_init_one() do this? but this is minor,

>  	raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  	rt_mutex_lock(&mtx);  /* Side effect: boosts task t's priority. */
>  	rt_mutex_unlock(&mtx);  /* Keep lockdep happy. */
>
> +	/* Wait until boostee is done accessing mtx before reinitializing. */
> +	wait_for_completion(&rnp->boost_completion);
> +

I must have missed something, I dont understand why we need ->boost_completion.

What if you simply move that rt_mutex into rcu_node ?

Or. Given that rcu_boost_kthread() never exits, it can declare this mutex
on stack and pass the pointer to rcu_boost() ?

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-11 17:07                                   ` Oleg Nesterov
@ 2014-06-11 17:17                                     ` Oleg Nesterov
  2014-06-11 17:29                                       ` Paul E. McKenney
  2014-06-11 17:27                                     ` Paul E. McKenney
  1 sibling, 1 reply; 63+ messages in thread
From: Oleg Nesterov @ 2014-06-11 17:17 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On 06/11, Oleg Nesterov wrote:
>
> On 06/11, Paul E. McKenney wrote:
>
> >  	raw_spin_unlock_irqrestore(&rnp->lock, flags);
> >  	rt_mutex_lock(&mtx);  /* Side effect: boosts task t's priority. */
> >  	rt_mutex_unlock(&mtx);  /* Keep lockdep happy. */
> >
> > +	/* Wait until boostee is done accessing mtx before reinitializing. */
> > +	wait_for_completion(&rnp->boost_completion);
> > +
>
> I must have missed something, I dont understand why we need ->boost_completion.
>
> What if you simply move that rt_mutex into rcu_node ?
>
> Or. Given that rcu_boost_kthread() never exits, it can declare this mutex
> on stack and pass the pointer to rcu_boost() ?

Ah, please ignore, I forgot about init_proxy_locked(). Although perhaps this
can be solved easily.

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-11 17:07                                   ` Oleg Nesterov
  2014-06-11 17:17                                     ` Oleg Nesterov
@ 2014-06-11 17:27                                     ` Paul E. McKenney
  1 sibling, 0 replies; 63+ messages in thread
From: Paul E. McKenney @ 2014-06-11 17:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Wed, Jun 11, 2014 at 07:07:05PM +0200, Oleg Nesterov wrote:
> On 06/11, Paul E. McKenney wrote:
> >
> > @@ -1202,10 +1204,14 @@ static int rcu_boost(struct rcu_node *rnp)
> >  	t = container_of(tb, struct task_struct, rcu_node_entry);
> >  	rt_mutex_init_proxy_locked(&mtx, t);
> >  	t->rcu_boost_mutex = &mtx;
> > +	init_completion(&rnp->boost_completion);
> 
> can't rcu_init_one() do this? but this is minor,

It could, but I would have to define yet another init-time function under
CONFIG_RCU_BOOST and not.  Yeah, lazy...

> >  	raw_spin_unlock_irqrestore(&rnp->lock, flags);
> >  	rt_mutex_lock(&mtx);  /* Side effect: boosts task t's priority. */
> >  	rt_mutex_unlock(&mtx);  /* Keep lockdep happy. */
> >
> > +	/* Wait until boostee is done accessing mtx before reinitializing. */
> > +	wait_for_completion(&rnp->boost_completion);
> > +
> 
> I must have missed something, I dont understand why we need ->boost_completion.

Because rt_mutex_init_proxy_locked() stomps on mtx periodically.
Which might happen to be work at the moment, but doesn't seem like a
particularly good thing.

> What if you simply move that rt_mutex into rcu_node ?
> 
> Or. Given that rcu_boost_kthread() never exits, it can declare this mutex
> on stack and pass the pointer to rcu_boost() ?

It is true that moving mtx to either the rcu_node structure or to
rcu_boost_kthread()'s stack frame would preserve type safety, but not
initialization safety.

Or maybe I am being too paranoid?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-11 17:17                                     ` Oleg Nesterov
@ 2014-06-11 17:29                                       ` Paul E. McKenney
  2014-06-11 17:59                                         ` Oleg Nesterov
  0 siblings, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2014-06-11 17:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Wed, Jun 11, 2014 at 07:17:34PM +0200, Oleg Nesterov wrote:
> On 06/11, Oleg Nesterov wrote:
> >
> > On 06/11, Paul E. McKenney wrote:
> >
> > >  	raw_spin_unlock_irqrestore(&rnp->lock, flags);
> > >  	rt_mutex_lock(&mtx);  /* Side effect: boosts task t's priority. */
> > >  	rt_mutex_unlock(&mtx);  /* Keep lockdep happy. */
> > >
> > > +	/* Wait until boostee is done accessing mtx before reinitializing. */
> > > +	wait_for_completion(&rnp->boost_completion);
> > > +
> >
> > I must have missed something, I dont understand why we need ->boost_completion.
> >
> > What if you simply move that rt_mutex into rcu_node ?
> >
> > Or. Given that rcu_boost_kthread() never exits, it can declare this mutex
> > on stack and pass the pointer to rcu_boost() ?
> 
> Ah, please ignore, I forgot about init_proxy_locked(). Although perhaps this
> can be solved easily.

You beat me to it.  ;-)

I was thinking of ->boost_completion as the way to solve it easily, but
what did you have in mind?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-11 17:29                                       ` Paul E. McKenney
@ 2014-06-11 17:59                                         ` Oleg Nesterov
  2014-06-11 19:56                                           ` Paul E. McKenney
  0 siblings, 1 reply; 63+ messages in thread
From: Oleg Nesterov @ 2014-06-11 17:59 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On 06/11, Paul E. McKenney wrote:
>
> On Wed, Jun 11, 2014 at 07:17:34PM +0200, Oleg Nesterov wrote:
> > On 06/11, Oleg Nesterov wrote:
> > >
> > > On 06/11, Paul E. McKenney wrote:
> > >
> > > >  	raw_spin_unlock_irqrestore(&rnp->lock, flags);
> > > >  	rt_mutex_lock(&mtx);  /* Side effect: boosts task t's priority. */
> > > >  	rt_mutex_unlock(&mtx);  /* Keep lockdep happy. */
> > > >
> > > > +	/* Wait until boostee is done accessing mtx before reinitializing. */
> > > > +	wait_for_completion(&rnp->boost_completion);
> > > > +
> > >
> > > I must have missed something, I dont understand why we need ->boost_completion.
> > >
> > > What if you simply move that rt_mutex into rcu_node ?
> > >
> > > Or. Given that rcu_boost_kthread() never exits, it can declare this mutex
> > > on stack and pass the pointer to rcu_boost() ?
> >
> > Ah, please ignore, I forgot about init_proxy_locked(). Although perhaps this
> > can be solved easily.
>
> You beat me to it.  ;-)
>
> I was thinking of ->boost_completion as the way to solve it easily, but
> what did you have in mind?

I meant, rcu_boost() could probably just do "mtx->owner = t", we know that
it was unlocked by us and nobody else can use it until we set
t->rcu_boost_mutex.

And if we move it into rcu_node, then we can probably kill ->rcu_boost_mutex,
rcu_read_unlock_special() could check rnp->boost_mutex->owner == current.

But you know, I also think that the dentist removed the rest of my brains
along my tooth, so I am not sure if I actually have something in mind.

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-11 17:59                                         ` Oleg Nesterov
@ 2014-06-11 19:56                                           ` Paul E. McKenney
  2014-06-12 17:28                                             ` Oleg Nesterov
  0 siblings, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2014-06-11 19:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Wed, Jun 11, 2014 at 07:59:34PM +0200, Oleg Nesterov wrote:
> On 06/11, Paul E. McKenney wrote:
> >
> > On Wed, Jun 11, 2014 at 07:17:34PM +0200, Oleg Nesterov wrote:
> > > On 06/11, Oleg Nesterov wrote:
> > > >
> > > > On 06/11, Paul E. McKenney wrote:
> > > >
> > > > >  	raw_spin_unlock_irqrestore(&rnp->lock, flags);
> > > > >  	rt_mutex_lock(&mtx);  /* Side effect: boosts task t's priority. */
> > > > >  	rt_mutex_unlock(&mtx);  /* Keep lockdep happy. */
> > > > >
> > > > > +	/* Wait until boostee is done accessing mtx before reinitializing. */
> > > > > +	wait_for_completion(&rnp->boost_completion);
> > > > > +
> > > >
> > > > I must have missed something, I dont understand why we need ->boost_completion.
> > > >
> > > > What if you simply move that rt_mutex into rcu_node ?
> > > >
> > > > Or. Given that rcu_boost_kthread() never exits, it can declare this mutex
> > > > on stack and pass the pointer to rcu_boost() ?
> > >
> > > Ah, please ignore, I forgot about init_proxy_locked(). Although perhaps this
> > > can be solved easily.
> >
> > You beat me to it.  ;-)
> >
> > I was thinking of ->boost_completion as the way to solve it easily, but
> > what did you have in mind?
> 
> I meant, rcu_boost() could probably just do "mtx->owner = t", we know that
> it was unlocked by us and nobody else can use it until we set
> t->rcu_boost_mutex.

My concern with this is that rcu_read_unlock_special() could hypothetically
get preempted (either by kernel or hypervisor), so that it might be a long
time until it makes its reference.  But maybe that reference would be
harmless in this case.

> And if we move it into rcu_node, then we can probably kill ->rcu_boost_mutex,
> rcu_read_unlock_special() could check rnp->boost_mutex->owner == current.

If this was anywhere near a hot code path, I would be sorely tempted.

> But you know, I also think that the dentist removed the rest of my brains
> along my tooth, so I am not sure if I actually have something in mind.

Ouch!!!  When that happens to me, most of my brains return some time
after the anesthetic wears off, but I know the feeling!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-11 19:56                                           ` Paul E. McKenney
@ 2014-06-12 17:28                                             ` Oleg Nesterov
  2014-06-12 20:35                                               ` Paul E. McKenney
  0 siblings, 1 reply; 63+ messages in thread
From: Oleg Nesterov @ 2014-06-12 17:28 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On 06/11, Paul E. McKenney wrote:
>
> On Wed, Jun 11, 2014 at 07:59:34PM +0200, Oleg Nesterov wrote:
> > On 06/11, Paul E. McKenney wrote:
> > >
> > > I was thinking of ->boost_completion as the way to solve it easily, but
> > > what did you have in mind?
> >
> > I meant, rcu_boost() could probably just do "mtx->owner = t", we know that
> > it was unlocked by us and nobody else can use it until we set
> > t->rcu_boost_mutex.
>
> My concern with this is that rcu_read_unlock_special() could hypothetically
> get preempted (either by kernel or hypervisor), so that it might be a long
> time until it makes its reference.  But maybe that reference would be
> harmless in this case.

Confused... Not sure I understand what did you mean, and certainly I do not
understand how this connects to the proxy-locking method.

Could you explain?

> > And if we move it into rcu_node, then we can probably kill ->rcu_boost_mutex,
> > rcu_read_unlock_special() could check rnp->boost_mutex->owner == current.
>
> If this was anywhere near a hot code path, I would be sorely tempted.

Ah, but I didn't mean perfomance. I think it is always good to try to remove
something from task_struct, it is huge. I do not mean sizeof() in the first
place, the very fact that I can hardly understand the purpose of a half of its
members makes me sad ;)

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-12 17:28                                             ` Oleg Nesterov
@ 2014-06-12 20:35                                               ` Paul E. McKenney
  2014-06-12 21:40                                                 ` Thomas Gleixner
  2014-06-13 14:52                                                 ` Oleg Nesterov
  0 siblings, 2 replies; 63+ messages in thread
From: Paul E. McKenney @ 2014-06-12 20:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Thu, Jun 12, 2014 at 07:28:44PM +0200, Oleg Nesterov wrote:
> On 06/11, Paul E. McKenney wrote:
> >
> > On Wed, Jun 11, 2014 at 07:59:34PM +0200, Oleg Nesterov wrote:
> > > On 06/11, Paul E. McKenney wrote:
> > > >
> > > > I was thinking of ->boost_completion as the way to solve it easily, but
> > > > what did you have in mind?
> > >
> > > I meant, rcu_boost() could probably just do "mtx->owner = t", we know that
> > > it was unlocked by us and nobody else can use it until we set
> > > t->rcu_boost_mutex.
> >
> > My concern with this is that rcu_read_unlock_special() could hypothetically
> > get preempted (either by kernel or hypervisor), so that it might be a long
> > time until it makes its reference.  But maybe that reference would be
> > harmless in this case.
> 
> Confused... Not sure I understand what did you mean, and certainly I do not
> understand how this connects to the proxy-locking method.
> 
> Could you explain?

Here is the hypothetical sequence of events, which cannot happen unless
the CPU releasing the lock accesses the structure after releasing
the lock:

	CPU 0				CPU 1 (booster)

	releases boost_mutex

					acquires boost_mutex
					releases boost_mutex
					post-release boost_mutex access?
					Loops to next task to boost
					proxy-locks boost_mutex

	post-release boost_mutex access:
		confused due to proxy-lock
		operation?

Now maybe this ends up being safe, but it sure feels like an accident
waiting to happen.  Some bright developer comes up with a super-fast
handoff, and blam, RCU priority boosting takes it in the shorts.  ;-)
In contrast, using the completion prevents this.

> > > And if we move it into rcu_node, then we can probably kill ->rcu_boost_mutex,
> > > rcu_read_unlock_special() could check rnp->boost_mutex->owner == current.
> >
> > If this was anywhere near a hot code path, I would be sorely tempted.
> 
> Ah, but I didn't mean perfomance. I think it is always good to try to remove
> something from task_struct, it is huge. I do not mean sizeof() in the first
> place, the very fact that I can hardly understand the purpose of a half of its
> members makes me sad ;)

Now -that- just might make a huge amount of sense!  Let's see...

o	We hold the rcu_node structure's ->lock when checking the owner
	(looks like rt_mutex_owner() is the right API).

o	We hold the rcu_node structure's ->lock when doing
	rt_mutex_init_proxy_locked().

o	We -don't- hold ->lock when releasing the rt_mutex, but that
	should be OK: The owner is releasing it, and it is going to
	not-owned, so no other task can possibly see ownership moving
	to/from them.

o	The rcu_node structure grows a bit, but not enough to worry
	about, and on most systems, the decrease in task_struct size
	will more than outweigh the increase in rcu_node size.

Looks quite promising, how about the following?  (Hey, it builds, so it
must be correct, right?)

							Thanx, Paul

------------------------------------------------------------------------

rcu: Simplify priority boosting by putting rt_mutex in rcu_node

RCU priority boosting currently checks for boosting via a pointer in
task_struct.  However, this is not needed: As Oleg noted, if the
rt_mutex is placed in the rcu_node instead of on the booster's stack,
the boostee can simply check it see if it owns the lock.  This commit
makes this change, shrinking task_struct by one pointer and the kernel
by three lines.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

 b/include/linux/sched.h    |    3 ---
 b/kernel/rcu/tree.h        |    3 +++
 b/kernel/rcu/tree_plugin.h |   19 ++++++++-----------
 3 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 25f54c79f757..6b90114764ff 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1222,9 +1222,6 @@ struct task_struct {
 #ifdef CONFIG_TREE_PREEMPT_RCU
 	struct rcu_node *rcu_blocked_node;
 #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
-#ifdef CONFIG_RCU_BOOST
-	struct rt_mutex *rcu_boost_mutex;
-#endif /* #ifdef CONFIG_RCU_BOOST */
 
 #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
 	struct sched_info sched_info;
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 31194ee9dfa6..db3f096ed80b 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -177,6 +177,9 @@ struct rcu_node {
 				/*  to carry out the boosting is fully */
 				/*  released with no future boostee accesses */
 				/*  before that rt_mutex is re-initialized. */
+	struct rt_mutex boost_mtx;
+				/* Used only for the priority-boosting */
+				/*  side effect, not as a lock. */
 	unsigned long boost_time;
 				/* When to start boosting (jiffies). */
 	struct task_struct *boost_kthread_task;
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 99743e9ea8ed..7628095f1c47 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -398,11 +398,9 @@ void rcu_read_unlock_special(struct task_struct *t)
 #ifdef CONFIG_RCU_BOOST
 		if (&t->rcu_node_entry == rnp->boost_tasks)
 			rnp->boost_tasks = np;
-		/* Snapshot/clear ->rcu_boost_mutex with rcu_node lock held. */
-		if (t->rcu_boost_mutex) {
-			rbmp = t->rcu_boost_mutex;
-			t->rcu_boost_mutex = NULL;
-		}
+		/* Snapshot/clear ->boost_mutex with rcu_node lock held. */
+		if (rt_mutex_owner(&rnp->boost_mtx) == t)
+			rbmp = &rnp->boost_mtx;
 #endif /* #ifdef CONFIG_RCU_BOOST */
 
 		/*
@@ -1151,7 +1149,6 @@ static void rcu_wake_cond(struct task_struct *t, int status)
 static int rcu_boost(struct rcu_node *rnp)
 {
 	unsigned long flags;
-	struct rt_mutex mtx;
 	struct task_struct *t;
 	struct list_head *tb;
 
@@ -1202,14 +1199,14 @@ static int rcu_boost(struct rcu_node *rnp)
 	 * section.
 	 */
 	t = container_of(tb, struct task_struct, rcu_node_entry);
-	rt_mutex_init_proxy_locked(&mtx, t);
-	t->rcu_boost_mutex = &mtx;
+	rt_mutex_init_proxy_locked(&rnp->boost_mtx, t);
 	init_completion(&rnp->boost_completion);
 	raw_spin_unlock_irqrestore(&rnp->lock, flags);
-	rt_mutex_lock(&mtx);  /* Side effect: boosts task t's priority. */
-	rt_mutex_unlock(&mtx);  /* Keep lockdep happy. */
+	/* Lock only for side effect: boosts task t's priority. */
+	rt_mutex_lock(&rnp->boost_mtx);
+	rt_mutex_unlock(&rnp->boost_mtx);  /* Then keep lockdep happy. */
 
-	/* Wait until boostee is done accessing mtx before reinitializing. */
+	/* Wait for boostee to be done w/boost_mtx before reinitializing. */
 	wait_for_completion(&rnp->boost_completion);
 
 	return ACCESS_ONCE(rnp->exp_tasks) != NULL ||


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-12 20:35                                               ` Paul E. McKenney
@ 2014-06-12 21:40                                                 ` Thomas Gleixner
  2014-06-12 22:27                                                   ` Paul E. McKenney
  2014-06-13 14:55                                                   ` Oleg Nesterov
  2014-06-13 14:52                                                 ` Oleg Nesterov
  1 sibling, 2 replies; 63+ messages in thread
From: Thomas Gleixner @ 2014-06-12 21:40 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, Steven Rostedt, Linus Torvalds, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Thu, 12 Jun 2014, Paul E. McKenney wrote:
> On Thu, Jun 12, 2014 at 07:28:44PM +0200, Oleg Nesterov wrote:
> > On 06/11, Paul E. McKenney wrote:
> > >
> > > On Wed, Jun 11, 2014 at 07:59:34PM +0200, Oleg Nesterov wrote:
> > > > On 06/11, Paul E. McKenney wrote:
> > > > >
> > > > > I was thinking of ->boost_completion as the way to solve it easily, but
> > > > > what did you have in mind?
> > > >
> > > > I meant, rcu_boost() could probably just do "mtx->owner = t", we know that
> > > > it was unlocked by us and nobody else can use it until we set
> > > > t->rcu_boost_mutex.
> > >
> > > My concern with this is that rcu_read_unlock_special() could hypothetically
> > > get preempted (either by kernel or hypervisor), so that it might be a long
> > > time until it makes its reference.  But maybe that reference would be
> > > harmless in this case.
> > 
> > Confused... Not sure I understand what did you mean, and certainly I do not
> > understand how this connects to the proxy-locking method.
> > 
> > Could you explain?
> 
> Here is the hypothetical sequence of events, which cannot happen unless
> the CPU releasing the lock accesses the structure after releasing
> the lock:
> 
> 	CPU 0				CPU 1 (booster)
> 
> 	releases boost_mutex
> 
> 					acquires boost_mutex
> 					releases boost_mutex
> 					post-release boost_mutex access?
> 					Loops to next task to boost
> 					proxy-locks boost_mutex
> 
> 	post-release boost_mutex access:
> 		confused due to proxy-lock
> 		operation?
> 
> Now maybe this ends up being safe, but it sure feels like an accident
> waiting to happen.  Some bright developer comes up with a super-fast
> handoff, and blam, RCU priority boosting takes it in the shorts.  ;-)
> In contrast, using the completion prevents this.
> 
> > > > And if we move it into rcu_node, then we can probably kill ->rcu_boost_mutex,
> > > > rcu_read_unlock_special() could check rnp->boost_mutex->owner == current.
> > >
> > > If this was anywhere near a hot code path, I would be sorely tempted.
> > 
> > Ah, but I didn't mean perfomance. I think it is always good to try to remove
> > something from task_struct, it is huge. I do not mean sizeof() in the first
> > place, the very fact that I can hardly understand the purpose of a half of its
> > members makes me sad ;)
> 
> Now -that- just might make a huge amount of sense!  Let's see...
> 
> o	We hold the rcu_node structure's ->lock when checking the owner
> 	(looks like rt_mutex_owner() is the right API).
> 
> o	We hold the rcu_node structure's ->lock when doing
> 	rt_mutex_init_proxy_locked().
> 
> o	We -don't- hold ->lock when releasing the rt_mutex, but that
> 	should be OK: The owner is releasing it, and it is going to
> 	not-owned, so no other task can possibly see ownership moving
> 	to/from them.
> 
> o	The rcu_node structure grows a bit, but not enough to worry
> 	about, and on most systems, the decrease in task_struct size
> 	will more than outweigh the increase in rcu_node size.
> 
> Looks quite promising, how about the following?  (Hey, it builds, so it
> must be correct, right?)

True. Why should we have users if we would test the crap we produce?

Just FYI, I have a patch pending which makes the release safe.

      http://marc.info/?l=linux-kernel&m=140251240630730&w=2

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-12 21:40                                                 ` Thomas Gleixner
@ 2014-06-12 22:27                                                   ` Paul E. McKenney
  2014-06-12 23:19                                                     ` Paul E. McKenney
  2014-06-13 14:55                                                   ` Oleg Nesterov
  1 sibling, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2014-06-12 22:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Oleg Nesterov, Steven Rostedt, Linus Torvalds, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Thu, Jun 12, 2014 at 11:40:07PM +0200, Thomas Gleixner wrote:
> On Thu, 12 Jun 2014, Paul E. McKenney wrote:
> > On Thu, Jun 12, 2014 at 07:28:44PM +0200, Oleg Nesterov wrote:
> > > On 06/11, Paul E. McKenney wrote:
> > > >
> > > > On Wed, Jun 11, 2014 at 07:59:34PM +0200, Oleg Nesterov wrote:
> > > > > On 06/11, Paul E. McKenney wrote:
> > > > > >
> > > > > > I was thinking of ->boost_completion as the way to solve it easily, but
> > > > > > what did you have in mind?
> > > > >
> > > > > I meant, rcu_boost() could probably just do "mtx->owner = t", we know that
> > > > > it was unlocked by us and nobody else can use it until we set
> > > > > t->rcu_boost_mutex.
> > > >
> > > > My concern with this is that rcu_read_unlock_special() could hypothetically
> > > > get preempted (either by kernel or hypervisor), so that it might be a long
> > > > time until it makes its reference.  But maybe that reference would be
> > > > harmless in this case.
> > > 
> > > Confused... Not sure I understand what did you mean, and certainly I do not
> > > understand how this connects to the proxy-locking method.
> > > 
> > > Could you explain?
> > 
> > Here is the hypothetical sequence of events, which cannot happen unless
> > the CPU releasing the lock accesses the structure after releasing
> > the lock:
> > 
> > 	CPU 0				CPU 1 (booster)
> > 
> > 	releases boost_mutex
> > 
> > 					acquires boost_mutex
> > 					releases boost_mutex
> > 					post-release boost_mutex access?
> > 					Loops to next task to boost
> > 					proxy-locks boost_mutex
> > 
> > 	post-release boost_mutex access:
> > 		confused due to proxy-lock
> > 		operation?
> > 
> > Now maybe this ends up being safe, but it sure feels like an accident
> > waiting to happen.  Some bright developer comes up with a super-fast
> > handoff, and blam, RCU priority boosting takes it in the shorts.  ;-)
> > In contrast, using the completion prevents this.
> > 
> > > > > And if we move it into rcu_node, then we can probably kill ->rcu_boost_mutex,
> > > > > rcu_read_unlock_special() could check rnp->boost_mutex->owner == current.
> > > >
> > > > If this was anywhere near a hot code path, I would be sorely tempted.
> > > 
> > > Ah, but I didn't mean perfomance. I think it is always good to try to remove
> > > something from task_struct, it is huge. I do not mean sizeof() in the first
> > > place, the very fact that I can hardly understand the purpose of a half of its
> > > members makes me sad ;)
> > 
> > Now -that- just might make a huge amount of sense!  Let's see...
> > 
> > o	We hold the rcu_node structure's ->lock when checking the owner
> > 	(looks like rt_mutex_owner() is the right API).
> > 
> > o	We hold the rcu_node structure's ->lock when doing
> > 	rt_mutex_init_proxy_locked().
> > 
> > o	We -don't- hold ->lock when releasing the rt_mutex, but that
> > 	should be OK: The owner is releasing it, and it is going to
> > 	not-owned, so no other task can possibly see ownership moving
> > 	to/from them.
> > 
> > o	The rcu_node structure grows a bit, but not enough to worry
> > 	about, and on most systems, the decrease in task_struct size
> > 	will more than outweigh the increase in rcu_node size.
> > 
> > Looks quite promising, how about the following?  (Hey, it builds, so it
> > must be correct, right?)
> 
> True. Why should we have users if we would test the crap we produce?

Well, it seems to be passing initial tests as well.  Must be my tests
need more work.

> Just FYI, I have a patch pending which makes the release safe.
> 
>       http://marc.info/?l=linux-kernel&m=140251240630730&w=2

Very good, belt -and- suspenders!  Might even work that way.  ;-)

I could argue for a cpu_relax() in the "while (!rt_mutex_has_waiters(lock))"
loop for the case in which the CPU enqueuing itself is preempted, possibly
by a hypervisor, but either way:

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

							Thanx, Paul


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-12 22:27                                                   ` Paul E. McKenney
@ 2014-06-12 23:19                                                     ` Paul E. McKenney
  2014-06-13 15:08                                                       ` Oleg Nesterov
  0 siblings, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2014-06-12 23:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Oleg Nesterov, Steven Rostedt, Linus Torvalds, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Thu, Jun 12, 2014 at 03:27:48PM -0700, Paul E. McKenney wrote:
> On Thu, Jun 12, 2014 at 11:40:07PM +0200, Thomas Gleixner wrote:

[ . . . ]

> > True. Why should we have users if we would test the crap we produce?
> 
> Well, it seems to be passing initial tests as well.  Must be my tests
> need more work.

Or, as in this case, must be that I should have my test machine using
the right git-tree branch and correct kernel config.  :-/  Updated patch
below, FWIW.

							Thanx, Paul

------------------------------------------------------------------------

rcu: Simplify priority boosting by putting rt_mutex in rcu_node

RCU priority boosting currently checks for boosting via a pointer in
task_struct.  However, this is not needed: As Oleg noted, if the
rt_mutex is placed in the rcu_node instead of on the booster's stack,
the boostee can simply check it see if it owns the lock.  This commit
makes this change, shrinking task_struct by one pointer and the kernel
by twelve lines.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

 b/include/linux/init_task.h |    9 +--------
 b/include/linux/sched.h     |    6 ------
 b/kernel/rcu/tree.h         |    3 +++
 b/kernel/rcu/tree_plugin.h  |   20 +++++++++-----------
 4 files changed, 13 insertions(+), 25 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6df7f9fe0d01..2bb4c4f3531a 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -102,12 +102,6 @@ extern struct group_info init_groups;
 #define INIT_IDS
 #endif
 
-#ifdef CONFIG_RCU_BOOST
-#define INIT_TASK_RCU_BOOST()						\
-	.rcu_boost_mutex = NULL,
-#else
-#define INIT_TASK_RCU_BOOST()
-#endif
 #ifdef CONFIG_TREE_PREEMPT_RCU
 #define INIT_TASK_RCU_TREE_PREEMPT()					\
 	.rcu_blocked_node = NULL,
@@ -119,8 +113,7 @@ extern struct group_info init_groups;
 	.rcu_read_lock_nesting = 0,					\
 	.rcu_read_unlock_special = 0,					\
 	.rcu_node_entry = LIST_HEAD_INIT(tsk.rcu_node_entry),		\
-	INIT_TASK_RCU_TREE_PREEMPT()					\
-	INIT_TASK_RCU_BOOST()
+	INIT_TASK_RCU_TREE_PREEMPT()
 #else
 #define INIT_TASK_RCU_PREEMPT(tsk)
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 25f54c79f757..1ffb275976da 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1222,9 +1222,6 @@ struct task_struct {
 #ifdef CONFIG_TREE_PREEMPT_RCU
 	struct rcu_node *rcu_blocked_node;
 #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
-#ifdef CONFIG_RCU_BOOST
-	struct rt_mutex *rcu_boost_mutex;
-#endif /* #ifdef CONFIG_RCU_BOOST */
 
 #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
 	struct sched_info sched_info;
@@ -1961,9 +1958,6 @@ static inline void rcu_copy_process(struct task_struct *p)
 #ifdef CONFIG_TREE_PREEMPT_RCU
 	p->rcu_blocked_node = NULL;
 #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
-#ifdef CONFIG_RCU_BOOST
-	p->rcu_boost_mutex = NULL;
-#endif /* #ifdef CONFIG_RCU_BOOST */
 	INIT_LIST_HEAD(&p->rcu_node_entry);
 }
 
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 31194ee9dfa6..db3f096ed80b 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -177,6 +177,9 @@ struct rcu_node {
 				/*  to carry out the boosting is fully */
 				/*  released with no future boostee accesses */
 				/*  before that rt_mutex is re-initialized. */
+	struct rt_mutex boost_mtx;
+				/* Used only for the priority-boosting */
+				/*  side effect, not as a lock. */
 	unsigned long boost_time;
 				/* When to start boosting (jiffies). */
 	struct task_struct *boost_kthread_task;
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 99743e9ea8ed..c93c525b71fe 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -29,6 +29,7 @@
 #include <linux/oom.h>
 #include <linux/smpboot.h>
 #include "../time/tick-internal.h"
+#include "../locking/rtmutex_common.h"
 
 #define RCU_KTHREAD_PRIO 1
 
@@ -398,11 +399,9 @@ void rcu_read_unlock_special(struct task_struct *t)
 #ifdef CONFIG_RCU_BOOST
 		if (&t->rcu_node_entry == rnp->boost_tasks)
 			rnp->boost_tasks = np;
-		/* Snapshot/clear ->rcu_boost_mutex with rcu_node lock held. */
-		if (t->rcu_boost_mutex) {
-			rbmp = t->rcu_boost_mutex;
-			t->rcu_boost_mutex = NULL;
-		}
+		/* Snapshot/clear ->boost_mutex with rcu_node lock held. */
+		if (rt_mutex_owner(&rnp->boost_mtx) == t)
+			rbmp = &rnp->boost_mtx;
 #endif /* #ifdef CONFIG_RCU_BOOST */
 
 		/*
@@ -1151,7 +1150,6 @@ static void rcu_wake_cond(struct task_struct *t, int status)
 static int rcu_boost(struct rcu_node *rnp)
 {
 	unsigned long flags;
-	struct rt_mutex mtx;
 	struct task_struct *t;
 	struct list_head *tb;
 
@@ -1202,14 +1200,14 @@ static int rcu_boost(struct rcu_node *rnp)
 	 * section.
 	 */
 	t = container_of(tb, struct task_struct, rcu_node_entry);
-	rt_mutex_init_proxy_locked(&mtx, t);
-	t->rcu_boost_mutex = &mtx;
+	rt_mutex_init_proxy_locked(&rnp->boost_mtx, t);
 	init_completion(&rnp->boost_completion);
 	raw_spin_unlock_irqrestore(&rnp->lock, flags);
-	rt_mutex_lock(&mtx);  /* Side effect: boosts task t's priority. */
-	rt_mutex_unlock(&mtx);  /* Keep lockdep happy. */
+	/* Lock only for side effect: boosts task t's priority. */
+	rt_mutex_lock(&rnp->boost_mtx);
+	rt_mutex_unlock(&rnp->boost_mtx);  /* Then keep lockdep happy. */
 
-	/* Wait until boostee is done accessing mtx before reinitializing. */
+	/* Wait for boostee to be done w/boost_mtx before reinitializing. */
 	wait_for_completion(&rnp->boost_completion);
 
 	return ACCESS_ONCE(rnp->exp_tasks) != NULL ||


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-12 20:35                                               ` Paul E. McKenney
  2014-06-12 21:40                                                 ` Thomas Gleixner
@ 2014-06-13 14:52                                                 ` Oleg Nesterov
  1 sibling, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2014-06-13 14:52 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On 06/12, Paul E. McKenney wrote:
>
> On Thu, Jun 12, 2014 at 07:28:44PM +0200, Oleg Nesterov wrote:
> > On 06/11, Paul E. McKenney wrote:
> > >
> > > On Wed, Jun 11, 2014 at 07:59:34PM +0200, Oleg Nesterov wrote:
> > > > On 06/11, Paul E. McKenney wrote:
> > > > >
> > > > > I was thinking of ->boost_completion as the way to solve it easily, but
> > > > > what did you have in mind?
> > > >
> > > > I meant, rcu_boost() could probably just do "mtx->owner = t", we know that
> > > > it was unlocked by us and nobody else can use it until we set
> > > > t->rcu_boost_mutex.
> > >
> > > My concern with this is that rcu_read_unlock_special() could hypothetically
> > > get preempted (either by kernel or hypervisor), so that it might be a long
> > > time until it makes its reference.  But maybe that reference would be
> > > harmless in this case.
> >
> > Confused... Not sure I understand what did you mean, and certainly I do not
> > understand how this connects to the proxy-locking method.
> >
> > Could you explain?
>
> Here is the hypothetical sequence of events, which cannot happen unless
> the CPU releasing the lock accesses the structure after releasing
> the lock:
>
> 	CPU 0				CPU 1 (booster)
>
> 	releases boost_mutex
>
> 					acquires boost_mutex
> 					releases boost_mutex
> 					post-release boost_mutex access?
> 					Loops to next task to boost
> 					proxy-locks boost_mutex
>
> 	post-release boost_mutex access:
> 		confused due to proxy-lock
> 		operation?

But this is the same problem I was worried about. Should be fixed by the
patch from Thomas but lets ignore this, lets assume that we should not rely
on this (and that is why you addded completion).

My point was, in this case we can separate "init" and "proxy_locked" and solve
the problem. If we initialize this mutex once, then "->owner = t" should be
always safe: whatever "post-release" above does we were able to lock (and unlock)
this mutex, nobody else (including "post-release) can play with ->owner.

But given that Thomas fixed rt_mutex, I think this all doesn't matter. And
I obviously like your last patch which moves boost_mtx into rcu_node ;)

> Now maybe this ends up being safe, but it sure feels like an accident
> waiting to happen.  Some bright developer comes up with a super-fast
> handoff,

Yes, initially I thought the same, but then I changed my mind. rt kernel
uses rt_mutex instead of spinlock_t and this means that rt_mutex_unlock()
must be atomic just as spin_unlock().

> o	We -don't- hold ->lock when releasing the rt_mutex, but that
> 	should be OK: The owner is releasing it, and it is going to
> 	not-owned, so no other task can possibly see ownership moving
> 	to/from them.

Yes. I have a very minor here, I'll reply to the last version.

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-12 21:40                                                 ` Thomas Gleixner
  2014-06-12 22:27                                                   ` Paul E. McKenney
@ 2014-06-13 14:55                                                   ` Oleg Nesterov
  2014-06-13 16:10                                                     ` Thomas Gleixner
  1 sibling, 1 reply; 63+ messages in thread
From: Oleg Nesterov @ 2014-06-13 14:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Paul E. McKenney, Steven Rostedt, Linus Torvalds, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On 06/12, Thomas Gleixner wrote:
>
> Just FYI, I have a patch pending which makes the release safe.
>
>       http://marc.info/?l=linux-kernel&m=140251240630730&w=2

Aha, great. I didn't see this version, and just in case I think it is fine.

Perhaps you can look at http://marc.info/?t=140207370200003 ?

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-12 23:19                                                     ` Paul E. McKenney
@ 2014-06-13 15:08                                                       ` Oleg Nesterov
  2014-06-15  5:40                                                         ` Paul E. McKenney
  0 siblings, 1 reply; 63+ messages in thread
From: Oleg Nesterov @ 2014-06-13 15:08 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On 06/12, Paul E. McKenney wrote:
>
> @@ -398,11 +399,9 @@ void rcu_read_unlock_special(struct task_struct *t)
>  #ifdef CONFIG_RCU_BOOST
>  		if (&t->rcu_node_entry == rnp->boost_tasks)
>  			rnp->boost_tasks = np;
> -		/* Snapshot/clear ->rcu_boost_mutex with rcu_node lock held. */
> -		if (t->rcu_boost_mutex) {
> -			rbmp = t->rcu_boost_mutex;
> -			t->rcu_boost_mutex = NULL;
> -		}
> +		/* Snapshot/clear ->boost_mutex with rcu_node lock held. */
> +		if (rt_mutex_owner(&rnp->boost_mtx) == t)
> +			rbmp = &rnp->boost_mtx;

The comment above looks confusing after this change ;) We do not clear it,
and it doesn't explain "with rcu_node lock held".

And, with or without this change it is not obvious why do we need "rbmp",
after this patch this becomes even more unobvious.

This is subjective of course, but perhaps it would be more understandable
to do

	bool xxx;

	...

	// Check this under rcu_node lock to ensure that unlock below
	// can't race with rt_mutex_init_proxy_locked() in progress.
	xxx = rt_mutex_owner(&rnp->boost_mtx) == t;

	...

	// rnp->lock was dropped
	if (xxx)
		rt_mutex_unlock(&rnp->boost_mtx);


But this is very minor, I won't insist of course. Mostly I am just trying
to check my understanding.

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-13 14:55                                                   ` Oleg Nesterov
@ 2014-06-13 16:10                                                     ` Thomas Gleixner
  2014-06-13 16:19                                                       ` Oleg Nesterov
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2014-06-13 16:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul E. McKenney, Steven Rostedt, Linus Torvalds, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Fri, 13 Jun 2014, Oleg Nesterov wrote:

> On 06/12, Thomas Gleixner wrote:
> >
> > Just FYI, I have a patch pending which makes the release safe.
> >
> >       http://marc.info/?l=linux-kernel&m=140251240630730&w=2
> 
> Aha, great. I didn't see this version, and just in case I think it is fine.
> 
> Perhaps you can look at http://marc.info/?t=140207370200003 ?

Looks good. Should I queue it with the other rtmutex stuff?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-13 16:10                                                     ` Thomas Gleixner
@ 2014-06-13 16:19                                                       ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2014-06-13 16:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Paul E. McKenney, Steven Rostedt, Linus Torvalds, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On 06/13, Thomas Gleixner wrote:
>
> On Fri, 13 Jun 2014, Oleg Nesterov wrote:
>
> > Perhaps you can look at http://marc.info/?t=140207370200003 ?
>
> Looks good. Should I queue it with the other rtmutex stuff?

That would be great, thanks ;)

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-13 15:08                                                       ` Oleg Nesterov
@ 2014-06-15  5:40                                                         ` Paul E. McKenney
  2014-06-17 18:57                                                           ` Paul E. McKenney
  0 siblings, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2014-06-15  5:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Fri, Jun 13, 2014 at 05:08:30PM +0200, Oleg Nesterov wrote:
> On 06/12, Paul E. McKenney wrote:
> >
> > @@ -398,11 +399,9 @@ void rcu_read_unlock_special(struct task_struct *t)
> >  #ifdef CONFIG_RCU_BOOST
> >  		if (&t->rcu_node_entry == rnp->boost_tasks)
> >  			rnp->boost_tasks = np;
> > -		/* Snapshot/clear ->rcu_boost_mutex with rcu_node lock held. */
> > -		if (t->rcu_boost_mutex) {
> > -			rbmp = t->rcu_boost_mutex;
> > -			t->rcu_boost_mutex = NULL;
> > -		}
> > +		/* Snapshot/clear ->boost_mutex with rcu_node lock held. */
> > +		if (rt_mutex_owner(&rnp->boost_mtx) == t)
> > +			rbmp = &rnp->boost_mtx;
> 
> The comment above looks confusing after this change ;) We do not clear it,
> and it doesn't explain "with rcu_node lock held".
> 
> And, with or without this change it is not obvious why do we need "rbmp",
> after this patch this becomes even more unobvious.
> 
> This is subjective of course, but perhaps it would be more understandable
> to do
> 
> 	bool xxx;
> 
> 	...
> 
> 	// Check this under rcu_node lock to ensure that unlock below
> 	// can't race with rt_mutex_init_proxy_locked() in progress.
> 	xxx = rt_mutex_owner(&rnp->boost_mtx) == t;
> 
> 	...
> 
> 	// rnp->lock was dropped
> 	if (xxx)
> 		rt_mutex_unlock(&rnp->boost_mtx);
> 
> 
> But this is very minor, I won't insist of course. Mostly I am just trying
> to check my understanding.

No, this is good, and I will update accordingly.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-15  5:40                                                         ` Paul E. McKenney
@ 2014-06-17 18:57                                                           ` Paul E. McKenney
  2014-06-18 16:43                                                             ` Oleg Nesterov
  0 siblings, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2014-06-17 18:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Sat, Jun 14, 2014 at 10:40:58PM -0700, Paul E. McKenney wrote:
> On Fri, Jun 13, 2014 at 05:08:30PM +0200, Oleg Nesterov wrote:
> > On 06/12, Paul E. McKenney wrote:
> > >
> > > @@ -398,11 +399,9 @@ void rcu_read_unlock_special(struct task_struct *t)
> > >  #ifdef CONFIG_RCU_BOOST
> > >  		if (&t->rcu_node_entry == rnp->boost_tasks)
> > >  			rnp->boost_tasks = np;
> > > -		/* Snapshot/clear ->rcu_boost_mutex with rcu_node lock held. */
> > > -		if (t->rcu_boost_mutex) {
> > > -			rbmp = t->rcu_boost_mutex;
> > > -			t->rcu_boost_mutex = NULL;
> > > -		}
> > > +		/* Snapshot/clear ->boost_mutex with rcu_node lock held. */
> > > +		if (rt_mutex_owner(&rnp->boost_mtx) == t)
> > > +			rbmp = &rnp->boost_mtx;
> > 
> > The comment above looks confusing after this change ;) We do not clear it,
> > and it doesn't explain "with rcu_node lock held".
> > 
> > And, with or without this change it is not obvious why do we need "rbmp",
> > after this patch this becomes even more unobvious.
> > 
> > This is subjective of course, but perhaps it would be more understandable
> > to do
> > 
> > 	bool xxx;
> > 
> > 	...
> > 
> > 	// Check this under rcu_node lock to ensure that unlock below
> > 	// can't race with rt_mutex_init_proxy_locked() in progress.
> > 	xxx = rt_mutex_owner(&rnp->boost_mtx) == t;
> > 
> > 	...
> > 
> > 	// rnp->lock was dropped
> > 	if (xxx)
> > 		rt_mutex_unlock(&rnp->boost_mtx);
> > 
> > 
> > But this is very minor, I won't insist of course. Mostly I am just trying
> > to check my understanding.
> 
> No, this is good, and I will update accordingly.

I suppose I could have included the patch...

							Thanx, Paul

------------------------------------------------------------------------

rcu: Simplify priority boosting by putting rt_mutex in rcu_node

RCU priority boosting currently checks for boosting via a pointer in
task_struct.  However, this is not needed: As Oleg noted, if the
rt_mutex is placed in the rcu_node instead of on the booster's stack,
the boostee can simply check it see if it owns the lock.  This commit
makes this change, shrinking task_struct by one pointer and the kernel
by thirteen lines.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6df7f9fe0d01..2bb4c4f3531a 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -102,12 +102,6 @@ extern struct group_info init_groups;
 #define INIT_IDS
 #endif
 
-#ifdef CONFIG_RCU_BOOST
-#define INIT_TASK_RCU_BOOST()						\
-	.rcu_boost_mutex = NULL,
-#else
-#define INIT_TASK_RCU_BOOST()
-#endif
 #ifdef CONFIG_TREE_PREEMPT_RCU
 #define INIT_TASK_RCU_TREE_PREEMPT()					\
 	.rcu_blocked_node = NULL,
@@ -119,8 +113,7 @@ extern struct group_info init_groups;
 	.rcu_read_lock_nesting = 0,					\
 	.rcu_read_unlock_special = 0,					\
 	.rcu_node_entry = LIST_HEAD_INIT(tsk.rcu_node_entry),		\
-	INIT_TASK_RCU_TREE_PREEMPT()					\
-	INIT_TASK_RCU_BOOST()
+	INIT_TASK_RCU_TREE_PREEMPT()
 #else
 #define INIT_TASK_RCU_PREEMPT(tsk)
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 306f4f0c987a..3cfbc05e66e6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1270,9 +1270,6 @@ struct task_struct {
 #ifdef CONFIG_TREE_PREEMPT_RCU
 	struct rcu_node *rcu_blocked_node;
 #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
-#ifdef CONFIG_RCU_BOOST
-	struct rt_mutex *rcu_boost_mutex;
-#endif /* #ifdef CONFIG_RCU_BOOST */
 
 #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
 	struct sched_info sched_info;
@@ -2009,9 +2006,6 @@ static inline void rcu_copy_process(struct task_struct *p)
 #ifdef CONFIG_TREE_PREEMPT_RCU
 	p->rcu_blocked_node = NULL;
 #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
-#ifdef CONFIG_RCU_BOOST
-	p->rcu_boost_mutex = NULL;
-#endif /* #ifdef CONFIG_RCU_BOOST */
 	INIT_LIST_HEAD(&p->rcu_node_entry);
 }
 
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 31194ee9dfa6..db3f096ed80b 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -177,6 +177,9 @@ struct rcu_node {
 				/*  to carry out the boosting is fully */
 				/*  released with no future boostee accesses */
 				/*  before that rt_mutex is re-initialized. */
+	struct rt_mutex boost_mtx;
+				/* Used only for the priority-boosting */
+				/*  side effect, not as a lock. */
 	unsigned long boost_time;
 				/* When to start boosting (jiffies). */
 	struct task_struct *boost_kthread_task;
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index dc98cacfef21..d8ae20f5ca87 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -29,6 +29,7 @@
 #include <linux/oom.h>
 #include <linux/smpboot.h>
 #include "../time/tick-internal.h"
+#include "../locking/rtmutex_common.h"
 
 #define RCU_KTHREAD_PRIO 1
 
@@ -336,7 +337,7 @@ void rcu_read_unlock_special(struct task_struct *t)
 	unsigned long flags;
 	struct list_head *np;
 #ifdef CONFIG_RCU_BOOST
-	struct rt_mutex *rbmp = NULL;
+	bool drop_boost_mutex = false;
 #endif /* #ifdef CONFIG_RCU_BOOST */
 	struct rcu_node *rnp;
 	int special;
@@ -398,11 +399,8 @@ void rcu_read_unlock_special(struct task_struct *t)
 #ifdef CONFIG_RCU_BOOST
 		if (&t->rcu_node_entry == rnp->boost_tasks)
 			rnp->boost_tasks = np;
-		/* Snapshot/clear ->rcu_boost_mutex with rcu_node lock held. */
-		if (t->rcu_boost_mutex) {
-			rbmp = t->rcu_boost_mutex;
-			t->rcu_boost_mutex = NULL;
-		}
+		/* Snapshot ->boost_mtx ownership with rcu_node lock held. */
+		drop_boost_mutex = rt_mutex_owner(&rnp->boost_mtx) == t;
 #endif /* #ifdef CONFIG_RCU_BOOST */
 
 		/*
@@ -427,8 +425,8 @@ void rcu_read_unlock_special(struct task_struct *t)
 
 #ifdef CONFIG_RCU_BOOST
 		/* Unboost if we were boosted. */
-		if (rbmp) {
-			rt_mutex_unlock(rbmp);
+		if (drop_boost_mutex) {
+			rt_mutex_unlock(&rnp->boost_mtx);
 			complete(&rnp->boost_completion);
 		}
 #endif /* #ifdef CONFIG_RCU_BOOST */
@@ -1151,7 +1149,6 @@ static void rcu_wake_cond(struct task_struct *t, int status)
 static int rcu_boost(struct rcu_node *rnp)
 {
 	unsigned long flags;
-	struct rt_mutex mtx;
 	struct task_struct *t;
 	struct list_head *tb;
 
@@ -1202,14 +1199,14 @@ static int rcu_boost(struct rcu_node *rnp)
 	 * section.
 	 */
 	t = container_of(tb, struct task_struct, rcu_node_entry);
-	rt_mutex_init_proxy_locked(&mtx, t);
-	t->rcu_boost_mutex = &mtx;
+	rt_mutex_init_proxy_locked(&rnp->boost_mtx, t);
 	init_completion(&rnp->boost_completion);
 	raw_spin_unlock_irqrestore(&rnp->lock, flags);
-	rt_mutex_lock(&mtx);  /* Side effect: boosts task t's priority. */
-	rt_mutex_unlock(&mtx);  /* Keep lockdep happy. */
+	/* Lock only for side effect: boosts task t's priority. */
+	rt_mutex_lock(&rnp->boost_mtx);
+	rt_mutex_unlock(&rnp->boost_mtx);  /* Then keep lockdep happy. */
 
-	/* Wait until boostee is done accessing mtx before reinitializing. */
+	/* Wait for boostee to be done w/boost_mtx before reinitializing. */
 	wait_for_completion(&rnp->boost_completion);
 
 	return ACCESS_ONCE(rnp->exp_tasks) != NULL ||


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-17 18:57                                                           ` Paul E. McKenney
@ 2014-06-18 16:43                                                             ` Oleg Nesterov
  2014-06-18 16:53                                                               ` Steven Rostedt
  2014-06-18 17:00                                                               ` Paul E. McKenney
  0 siblings, 2 replies; 63+ messages in thread
From: Oleg Nesterov @ 2014-06-18 16:43 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On 06/17, Paul E. McKenney wrote:
>
> +		if (drop_boost_mutex) {
> +			rt_mutex_unlock(&rnp->boost_mtx);
>  			complete(&rnp->boost_completion);

Well, I still do not understand this ->boost_completion...

> -	/* Wait until boostee is done accessing mtx before reinitializing. */
> +	/* Wait for boostee to be done w/boost_mtx before reinitializing. */
>  	wait_for_completion(&rnp->boost_completion);

OK, at least we have a comment.

But let me repeat. Thomas has already fixed rt_mutex, unlock is atomic.
It doesn't touch this memory after it makes another lock() possible.

And (contrary to what I said initially) we can rely on this because -rt
converts spinlock_t into rt_mutex ?

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-18 16:43                                                             ` Oleg Nesterov
@ 2014-06-18 16:53                                                               ` Steven Rostedt
  2014-06-21 19:54                                                                 ` Thomas Gleixner
  2014-06-18 17:00                                                               ` Paul E. McKenney
  1 sibling, 1 reply; 63+ messages in thread
From: Steven Rostedt @ 2014-06-18 16:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul E. McKenney, Thomas Gleixner, Linus Torvalds, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Wed, 18 Jun 2014 18:43:59 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

 
> And (contrary to what I said initially) we can rely on this because -rt
> converts spinlock_t into rt_mutex ?

Correct. Because if spinlock_t has this behavior, rt_mutex must have it
too, otherwise -rt will suffer greatly from that. Who knows, maybe this
will fix some strange bug reports that we have had in the past.

-- Steve

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-18 16:43                                                             ` Oleg Nesterov
  2014-06-18 16:53                                                               ` Steven Rostedt
@ 2014-06-18 17:00                                                               ` Paul E. McKenney
  1 sibling, 0 replies; 63+ messages in thread
From: Paul E. McKenney @ 2014-06-18 17:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Wed, Jun 18, 2014 at 06:43:59PM +0200, Oleg Nesterov wrote:
> On 06/17, Paul E. McKenney wrote:
> >
> > +		if (drop_boost_mutex) {
> > +			rt_mutex_unlock(&rnp->boost_mtx);
> >  			complete(&rnp->boost_completion);
> 
> Well, I still do not understand this ->boost_completion...
> 
> > -	/* Wait until boostee is done accessing mtx before reinitializing. */
> > +	/* Wait for boostee to be done w/boost_mtx before reinitializing. */
> >  	wait_for_completion(&rnp->boost_completion);
> 
> OK, at least we have a comment.
> 
> But let me repeat. Thomas has already fixed rt_mutex, unlock is atomic.
> It doesn't touch this memory after it makes another lock() possible.
> 
> And (contrary to what I said initially) we can rely on this because -rt
> converts spinlock_t into rt_mutex ?

Well, perhaps I should rein in my paranoia on this one.  That said, the
cost of my paranoia is minimal in this slowpath.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc)
  2014-06-18 16:53                                                               ` Steven Rostedt
@ 2014-06-21 19:54                                                                 ` Thomas Gleixner
  0 siblings, 0 replies; 63+ messages in thread
From: Thomas Gleixner @ 2014-06-21 19:54 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Oleg Nesterov, Paul E. McKenney, Linus Torvalds, LKML,
	Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams

On Wed, 18 Jun 2014, Steven Rostedt wrote:

> On Wed, 18 Jun 2014 18:43:59 +0200
> Oleg Nesterov <oleg@redhat.com> wrote:
> 
>  
> > And (contrary to what I said initially) we can rely on this because -rt
> > converts spinlock_t into rt_mutex ?
> 
> Correct. Because if spinlock_t has this behavior, rt_mutex must have it
> too, otherwise -rt will suffer greatly from that. Who knows, maybe this
> will fix some strange bug reports that we have had in the past.

Indeed. I found a few backtraces from Carstens test farm, where stuff
explodes in the slowpath raw_spin_unlock call. Happens once a year or
never ...

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2014-06-21 19:55 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-03 17:02 [BUG] signal: sighand unprotected when accessed by /proc Steven Rostedt
2014-06-03 17:26 ` Oleg Nesterov
2014-06-03 18:03   ` Linus Torvalds
2014-06-03 20:01     ` Oleg Nesterov
2014-06-03 20:03       ` Oleg Nesterov
2014-06-06 20:33       ` Paul E. McKenney
2014-06-08 13:07         ` safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) Oleg Nesterov
2014-06-09 16:26           ` Paul E. McKenney
2014-06-09 18:15             ` Oleg Nesterov
2014-06-09 18:29               ` Steven Rostedt
2014-06-09 18:51                 ` Linus Torvalds
2014-06-09 19:41                   ` Steven Rostedt
2014-06-10  8:53                     ` Thomas Gleixner
2014-06-10 16:57                       ` Oleg Nesterov
2014-06-10 18:08                         ` Thomas Gleixner
2014-06-10 18:13                           ` Steven Rostedt
2014-06-10 20:05                             ` Thomas Gleixner
2014-06-10 20:13                               ` Thomas Gleixner
2014-06-11 15:52                                 ` Paul E. McKenney
2014-06-11 17:07                                   ` Oleg Nesterov
2014-06-11 17:17                                     ` Oleg Nesterov
2014-06-11 17:29                                       ` Paul E. McKenney
2014-06-11 17:59                                         ` Oleg Nesterov
2014-06-11 19:56                                           ` Paul E. McKenney
2014-06-12 17:28                                             ` Oleg Nesterov
2014-06-12 20:35                                               ` Paul E. McKenney
2014-06-12 21:40                                                 ` Thomas Gleixner
2014-06-12 22:27                                                   ` Paul E. McKenney
2014-06-12 23:19                                                     ` Paul E. McKenney
2014-06-13 15:08                                                       ` Oleg Nesterov
2014-06-15  5:40                                                         ` Paul E. McKenney
2014-06-17 18:57                                                           ` Paul E. McKenney
2014-06-18 16:43                                                             ` Oleg Nesterov
2014-06-18 16:53                                                               ` Steven Rostedt
2014-06-21 19:54                                                                 ` Thomas Gleixner
2014-06-18 17:00                                                               ` Paul E. McKenney
2014-06-13 14:55                                                   ` Oleg Nesterov
2014-06-13 16:10                                                     ` Thomas Gleixner
2014-06-13 16:19                                                       ` Oleg Nesterov
2014-06-13 14:52                                                 ` Oleg Nesterov
2014-06-11 17:27                                     ` Paul E. McKenney
2014-06-10 17:07                       ` Oleg Nesterov
2014-06-10 17:51                         ` Thomas Gleixner
2014-06-10 12:56                   ` Paul E. McKenney
2014-06-10 14:48                     ` Peter Zijlstra
2014-06-10 15:18                       ` Paul E. McKenney
2014-06-10 15:35                     ` Linus Torvalds
2014-06-10 16:15                       ` Paul E. McKenney
2014-06-09 19:04                 ` Oleg Nesterov
2014-06-10  8:37             ` Peter Zijlstra
2014-06-10 12:52               ` Paul E. McKenney
2014-06-10 13:01                 ` Peter Zijlstra
2014-06-10 14:36                   ` Paul E. McKenney
2014-06-10 15:20                     ` Paul E. McKenney
2014-06-03 20:05     ` [BUG] signal: sighand unprotected when accessed by /proc Steven Rostedt
2014-06-03 20:09       ` Oleg Nesterov
2014-06-03 20:15         ` Steven Rostedt
2014-06-03 20:25         ` Steven Rostedt
2014-06-03 21:12           ` Thomas Gleixner
2014-06-03 18:05   ` Steven Rostedt
2014-06-03 19:25     ` Oleg Nesterov
2014-06-04  1:16       ` Steven Rostedt
2014-06-04 16:31         ` Oleg Nesterov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.