* [BUG] signal: sighand unprotected when accessed by /proc @ 2014-06-03 17:02 Steven Rostedt 2014-06-03 17:26 ` Oleg Nesterov 0 siblings, 1 reply; 63+ messages in thread From: Steven Rostedt @ 2014-06-03 17:02 UTC (permalink / raw) To: LKML Cc: Oleg Nesterov, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams, Linus Torvalds We were able to trigger this bug in -rt, and by review, I'm thinking that this could very well be a mainline bug too. I had our QA team add a trace patch to the kernel to prove my analysis, and it did. Here's the patch: http://rostedt.homelinux.com/private/sighand-trace.patch Let me try to explain the bug: CPU0 CPU1 ---- ---- [ read of /proc/<pid>/stat ] get_task_struct(); [...] [ <pid> exits ] [ parent does wait on <pid> ] wait_task_zombie() release_task() proc_flush_task() /* the above removes new access to the /proc system */ __exit_signal() __cleanup_sighand(sighand); atomic_dec_and_test(sighand->count); do_task_stat() lock_task_sighand(task); sighand = rcu_dereference(tsk->sighand); kmem_cache_free(sighand); if (sighand != NULL) spin_lock(sighand->siglock); ** BOOM! use after free ** Seems there is no protection between reading the sighand from proc and freeing it. The sighand->count is not updated, and the sighand is not freed via rcu. Wouldn't matter, because the use of sighand is not totally protected by rcu. We take a lot of care to free tsk->signal but we are a bit sloppy in protecting sighand. Here's the trace from our last crash: ps-23716 1....... 11201242344176: __lock_task_sighand: lock sighand=ffff8801022aee80 from sh:23718 make-23672 1.....11 11201242409218: release_task: freeing sighand ffff8801022aee80 for sh:23718 make-23672 1.....11 11201242409980: __cleanup_sighand: free sighand ffff8801022aee80 make-23672 1.....11 11201242410380: __cleanup_sighand: sighand ffff8801022aee80 freed ps-23716 1d...211 11201243007052: __list_del_entry: BUG HIT! Let me explain the differences between -rt and mainline here. One, the spinlock in -rt is an rtmutex. The list_del_entry() bug is the task trying to remove itself from sighand->lock->wait_list. As the lock has been freed, the list head of the rtmutex is corrupted. Another difference here is that all this ran on CPU 1. In mainline, that could not happen because the spinlock prevents preemption. But still, I don't see anything that could stop this from happening on multiple CPUs. Again, the bug does exist in mainline. I believe the only reason that we do not trigger it (often) is that there's a lot that the reading of the proc needs to do between opening the proc files and then reading the sighand->lock. Much more than the clean up path between removing /proc and freeing sighand. But with an interrupt going off at the right time slowing down the fast path, we can still hit this race. This is the beauty of the -rt patch. It can simulate hard to hit races nicely, due to the preemptive nature of the spinlocks. I do not see any trivial fix for this. I figured I show this to the signal gurus before trying to redesign the way sighand gets freed or used by /proc. -- Steve ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [BUG] signal: sighand unprotected when accessed by /proc 2014-06-03 17:02 [BUG] signal: sighand unprotected when accessed by /proc Steven Rostedt @ 2014-06-03 17:26 ` Oleg Nesterov 2014-06-03 18:03 ` Linus Torvalds 2014-06-03 18:05 ` Steven Rostedt 0 siblings, 2 replies; 63+ messages in thread From: Oleg Nesterov @ 2014-06-03 17:26 UTC (permalink / raw) To: Steven Rostedt Cc: LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams, Linus Torvalds On 06/03, Steven Rostedt wrote: > > We were able to trigger this bug in -rt, and by review, I'm thinking > that this could very well be a mainline bug too. I had our QA team add > a trace patch to the kernel to prove my analysis, and it did. > > Here's the patch: > > http://rostedt.homelinux.com/private/sighand-trace.patch > > Let me try to explain the bug: > > > CPU0 CPU1 > ---- ---- > [ read of /proc/<pid>/stat ] > get_task_struct(); > [...] > [ <pid> exits ] > [ parent does wait on <pid> ] > wait_task_zombie() > release_task() > proc_flush_task() > /* the above removes new access > to the /proc system */ > __exit_signal() > __cleanup_sighand(sighand); > atomic_dec_and_test(sighand->count); > do_task_stat() > lock_task_sighand(task); > sighand = rcu_dereference(tsk->sighand); > > kmem_cache_free(sighand); > > if (sighand != NULL) > spin_lock(sighand->siglock); > > ** BOOM! use after free ** Yes, ->sighand can be already freed at this point, but this should be fine because sighand_cachep is SLAB_DESTROY_BY_RCU. That is why lock_task_sighand() does rcu_read_lock() and re-checks sighand == tsk->sighand after it takes ->siglock. It is fine if it was already freed or even reallocated via kmem_cache_alloc(sighand_cachep). We only need to ensure that (SLAB_DESTROY_BY_RCU should ensure this) this memory won't be returned to system, so this peace of memory must be "struct sighand" with the properly initialized ->siglock until rcu_read_unlock(). > Seems there is no protection between reading the sighand from proc and > freeing it. The sighand->count is not updated, and the sighand is not > freed via rcu. See above. > One, the spinlock in -rt is an rtmutex. The list_del_entry() bug is the > task trying to remove itself from sighand->lock->wait_list. As the lock > has been freed, the list head of the rtmutex is corrupted. looks like, SLAB_DESTROY_BY_RCU logic is broken? Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [BUG] signal: sighand unprotected when accessed by /proc 2014-06-03 17:26 ` Oleg Nesterov @ 2014-06-03 18:03 ` Linus Torvalds 2014-06-03 20:01 ` Oleg Nesterov 2014-06-03 20:05 ` [BUG] signal: sighand unprotected when accessed by /proc Steven Rostedt 2014-06-03 18:05 ` Steven Rostedt 1 sibling, 2 replies; 63+ messages in thread From: Linus Torvalds @ 2014-06-03 18:03 UTC (permalink / raw) To: Oleg Nesterov Cc: Steven Rostedt, LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Tue, Jun 3, 2014 at 10:26 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > looks like, SLAB_DESTROY_BY_RCU logic is broken? I haven't looked at the code yet, but SLAB_DESTROY_BY_RCU can be subtle and very dangerous. The danger is that the *slab* itself is free'd by RCU, but individual allocations can (and do) get re-used FOR THE SAME OBJECT TYPE without waiting for RCU! This is subtle. It means that most people who think that "it's free'd by RCU" get it wrong. Because individual allocations really aren't at all RCU-free'd, it's just that the underlying memory is guaranteed to not change type or be entirely thrown away until after a RCU grace period. That means that the only thing you can use rcu-delayed are: (a) pure optimistic reads (b) idempotent things that are initialized by the constructor, and _not_ by the allocation. Nothing else. An example of (b) is having a lock that is initialized by the constructor, and that everybody else uses properly: you may have delayed people locking/unlocking it, but as long as they don't screw anything else up they won't break even if the data structure was re-used by a new allocations. Without looking at the code, it sounds like somebody may doing things to "sighand->lock->wait_list" that they shouldn't do. We've had cases like that before, and most of them have been changed to *not* use SLAB_DESTROY_BY_RCU, and instead make each individual allocation be RCU-free'd (which is a lot simpler to think about, because then you don't have the whole re-use issue). Of course, it could easily be lack of any RCU protection too. As mentioned, I didn't really check the code. I just wanted to clarify this subtlety, because while I think Oleg knows about it, maybe others didn't quite catch that subtlety. And this could easily be an RT issue, if the RT code does some re-initialization of the rtmutex that replaces the spinlock we have. Linus ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [BUG] signal: sighand unprotected when accessed by /proc 2014-06-03 18:03 ` Linus Torvalds @ 2014-06-03 20:01 ` Oleg Nesterov 2014-06-03 20:03 ` Oleg Nesterov 2014-06-06 20:33 ` Paul E. McKenney 2014-06-03 20:05 ` [BUG] signal: sighand unprotected when accessed by /proc Steven Rostedt 1 sibling, 2 replies; 63+ messages in thread From: Oleg Nesterov @ 2014-06-03 20:01 UTC (permalink / raw) To: Linus Torvalds Cc: Steven Rostedt, LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On 06/03, Linus Torvalds wrote: > > On Tue, Jun 3, 2014 at 10:26 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > > > looks like, SLAB_DESTROY_BY_RCU logic is broken? > > I haven't looked at the code yet, but SLAB_DESTROY_BY_RCU can be > subtle and very dangerous. > > The danger is that the *slab* itself is free'd by RCU, but individual > allocations can (and do) get re-used FOR THE SAME OBJECT TYPE without > waiting for RCU! > > This is subtle. It means that most people who think that "it's free'd > by RCU" get it wrong. Because individual allocations really aren't at > all RCU-free'd, it's just that the underlying memory is guaranteed to > not change type or be entirely thrown away until after a RCU grace > period. Yes, exactly. And unless you use current->sighand (which is obviously stable) you need lock_task_sighand() which relies on ->siglock initialized by sighand_ctor(). > Without looking at the code, it sounds like somebody may doing things > to "sighand->lock->wait_list" that they shouldn't do. We've had cases > like that before, and most of them have been changed to *not* use > SLAB_DESTROY_BY_RCU, and instead make each individual allocation be > RCU-free'd (which is a lot simpler to think about, because then you > don't have the whole re-use issue). Sure, we only need to change __cleanup_sighand() to use call_rcu(). But I am not sure this makes sense, I mean, I do not think this can make something more simple/clear. > And this could easily be an RT issue, if the RT code does some > re-initialization of the rtmutex that replaces the spinlock we have. Unlikely... this should be done by sighand_ctor() anyway. I'll try to recheck rt_mutex_unlock() tomorrow. _Perhaps_ rcu_read_unlock() should be shifted from lock_task_sighand() to unlock_task_sighand() to ensure that rt_mutex_unlock() does nothihg with this memory after it makes another lock/unlock possible. But if we need this (currently I do not think so), this doesn't depend on SLAB_DESTROY_BY_RCU. And, at first glance, in this case rcu_read_unlock_special() might be wrong too. Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [BUG] signal: sighand unprotected when accessed by /proc 2014-06-03 20:01 ` Oleg Nesterov @ 2014-06-03 20:03 ` Oleg Nesterov 2014-06-06 20:33 ` Paul E. McKenney 1 sibling, 0 replies; 63+ messages in thread From: Oleg Nesterov @ 2014-06-03 20:03 UTC (permalink / raw) To: Linus Torvalds Cc: Steven Rostedt, LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams Damn, sorry for noise... On 06/03, Oleg Nesterov wrote: > > I'll try to recheck rt_mutex_unlock() tomorrow. _Perhaps_ rcu_read_unlock() > should be shifted from lock_task_sighand() to unlock_task_sighand() to > ensure that rt_mutex_unlock() does nothihg with this memory after it > makes another lock/unlock possible. I meant, we need to ensure it does nothing with this memory, otherwise we need to delay rcu_read_unlock() till unlock_task_sighand(). Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [BUG] signal: sighand unprotected when accessed by /proc 2014-06-03 20:01 ` Oleg Nesterov 2014-06-03 20:03 ` Oleg Nesterov @ 2014-06-06 20:33 ` Paul E. McKenney 2014-06-08 13:07 ` safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) Oleg Nesterov 1 sibling, 1 reply; 63+ messages in thread From: Paul E. McKenney @ 2014-06-06 20:33 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Steven Rostedt, LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Tue, Jun 03, 2014 at 10:01:25PM +0200, Oleg Nesterov wrote: > On 06/03, Linus Torvalds wrote: > > > > On Tue, Jun 3, 2014 at 10:26 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > looks like, SLAB_DESTROY_BY_RCU logic is broken? > > > > I haven't looked at the code yet, but SLAB_DESTROY_BY_RCU can be > > subtle and very dangerous. > > > > The danger is that the *slab* itself is free'd by RCU, but individual > > allocations can (and do) get re-used FOR THE SAME OBJECT TYPE without > > waiting for RCU! > > > > This is subtle. It means that most people who think that "it's free'd > > by RCU" get it wrong. Because individual allocations really aren't at > > all RCU-free'd, it's just that the underlying memory is guaranteed to > > not change type or be entirely thrown away until after a RCU grace > > period. > > Yes, exactly. And unless you use current->sighand (which is obviously > stable) you need lock_task_sighand() which relies on ->siglock initialized > by sighand_ctor(). > > > Without looking at the code, it sounds like somebody may doing things > > to "sighand->lock->wait_list" that they shouldn't do. We've had cases > > like that before, and most of them have been changed to *not* use > > SLAB_DESTROY_BY_RCU, and instead make each individual allocation be > > RCU-free'd (which is a lot simpler to think about, because then you > > don't have the whole re-use issue). > > Sure, we only need to change __cleanup_sighand() to use call_rcu(). > But I am not sure this makes sense, I mean, I do not think this can > make something more simple/clear. > > > And this could easily be an RT issue, if the RT code does some > > re-initialization of the rtmutex that replaces the spinlock we have. > > Unlikely... this should be done by sighand_ctor() anyway. > > I'll try to recheck rt_mutex_unlock() tomorrow. _Perhaps_ rcu_read_unlock() > should be shifted from lock_task_sighand() to unlock_task_sighand() to > ensure that rt_mutex_unlock() does nothihg with this memory after it > makes another lock/unlock possible. > > But if we need this (currently I do not think so), this doesn't depend on > SLAB_DESTROY_BY_RCU. And, at first glance, in this case rcu_read_unlock_special() > might be wrong too. OK, I will bite... What did I mess up in rcu_read_unlock_special()? This function does not report leaving the RCU read-side critical section until after its call to rt_mutex_unlock() has returned, so any RCU read-side critical sections in rt_mutex_unlock() will be respected. Thanx, Paul ^ permalink raw reply [flat|nested] 63+ messages in thread
* safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-06 20:33 ` Paul E. McKenney @ 2014-06-08 13:07 ` Oleg Nesterov 2014-06-09 16:26 ` Paul E. McKenney 0 siblings, 1 reply; 63+ messages in thread From: Oleg Nesterov @ 2014-06-08 13:07 UTC (permalink / raw) To: Paul E. McKenney Cc: Linus Torvalds, Steven Rostedt, LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On 06/06, Paul E. McKenney wrote: > > On Tue, Jun 03, 2014 at 10:01:25PM +0200, Oleg Nesterov wrote: > > > > I'll try to recheck rt_mutex_unlock() tomorrow. _Perhaps_ rcu_read_unlock() > > should be shifted from lock_task_sighand() to unlock_task_sighand() to > > ensure that rt_mutex_unlock() does nothihg with this memory after it > > makes another lock/unlock possible. > > > > But if we need this (currently I do not think so), this doesn't depend on > > SLAB_DESTROY_BY_RCU. And, at first glance, in this case rcu_read_unlock_special() > > might be wrong too. > > OK, I will bite... What did I mess up in rcu_read_unlock_special()? > > This function does not report leaving the RCU read-side critical section > until after its call to rt_mutex_unlock() has returned, so any RCU > read-side critical sections in rt_mutex_unlock() will be respected. Sorry for confusion. I only meant that afaics rcu_read_unlock_special() equally depends on the fact that rt_mutex_unlock() does nothing with "struct rt_mutex" after it makes another rt_mutex_lock() + rt_mutex_unlock() possible, otherwise this code is wrong (and unlock_task_sighand() would be wrong too). Just to simplify the discussion... suppose we add "atomic_t nr_slow_unlock" into "struct rt_mutex" and change rt_mutex_slowunlock() to do atomic_inc(&lock->nr_slow_unlock) after it drops ->wait_lock. Of course this would be ugly, just for illustration. In this case atomic_inc() above can write to rcu_boost()'s stack after this functions returns to the caller. And unlock_task_sighand() would be wrong too, atomic_inc() could write to the memory which was already returned to system because "unlock" path runs outside of rcu-protected section. But it seems to me that currently we are safe, rt_mutex_unlock() doesn't do something like this, a concurrent rt_mutex_lock() must always take wait_lock too. And while this is off-topic and I can be easily wrong, it seems that the normal "struct mutex" is not safe in this respect. If nothing else, once __mutex_unlock_common_slowpath()->__mutex_slowpath_needs_to_unlock() sets lock->count = 1, a concurent mutex_lock() can take and then release this mutex before __mutex_unlock_common_slowpath() takes ->wait_lock. So _perhaps_ we should not rely on this property of rt_mutex "too much". Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-08 13:07 ` safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) Oleg Nesterov @ 2014-06-09 16:26 ` Paul E. McKenney 2014-06-09 18:15 ` Oleg Nesterov 2014-06-10 8:37 ` Peter Zijlstra 0 siblings, 2 replies; 63+ messages in thread From: Paul E. McKenney @ 2014-06-09 16:26 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Steven Rostedt, LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Sun, Jun 08, 2014 at 03:07:18PM +0200, Oleg Nesterov wrote: > On 06/06, Paul E. McKenney wrote: > > > > On Tue, Jun 03, 2014 at 10:01:25PM +0200, Oleg Nesterov wrote: > > > > > > I'll try to recheck rt_mutex_unlock() tomorrow. _Perhaps_ rcu_read_unlock() > > > should be shifted from lock_task_sighand() to unlock_task_sighand() to > > > ensure that rt_mutex_unlock() does nothihg with this memory after it > > > makes another lock/unlock possible. > > > > > > But if we need this (currently I do not think so), this doesn't depend on > > > SLAB_DESTROY_BY_RCU. And, at first glance, in this case rcu_read_unlock_special() > > > might be wrong too. > > > > OK, I will bite... What did I mess up in rcu_read_unlock_special()? > > > > This function does not report leaving the RCU read-side critical section > > until after its call to rt_mutex_unlock() has returned, so any RCU > > read-side critical sections in rt_mutex_unlock() will be respected. > > Sorry for confusion. > > I only meant that afaics rcu_read_unlock_special() equally depends on the > fact that rt_mutex_unlock() does nothing with "struct rt_mutex" after it > makes another rt_mutex_lock() + rt_mutex_unlock() possible, otherwise this > code is wrong (and unlock_task_sighand() would be wrong too). > > Just to simplify the discussion... suppose we add "atomic_t nr_slow_unlock" > into "struct rt_mutex" and change rt_mutex_slowunlock() to do > atomic_inc(&lock->nr_slow_unlock) after it drops ->wait_lock. Of course this > would be ugly, just for illustration. That would indeed be a bad thing, as it could potentially lead to use-after-free bugs. Though one could argue that any code that resulted in use-after-free would be quite aggressive. But still... > In this case atomic_inc() above can write to rcu_boost()'s stack after this > functions returns to the caller. And unlock_task_sighand() would be wrong > too, atomic_inc() could write to the memory which was already returned to > system because "unlock" path runs outside of rcu-protected section. > > But it seems to me that currently we are safe, rt_mutex_unlock() doesn't do > something like this, a concurrent rt_mutex_lock() must always take wait_lock > too. > > > And while this is off-topic and I can be easily wrong, it seems that the > normal "struct mutex" is not safe in this respect. If nothing else, once > __mutex_unlock_common_slowpath()->__mutex_slowpath_needs_to_unlock() sets > lock->count = 1, a concurent mutex_lock() can take and then release this > mutex before __mutex_unlock_common_slowpath() takes ->wait_lock. > > So _perhaps_ we should not rely on this property of rt_mutex "too much". Well, I could easily move the rt_mutex from rcu_boost()'s stack to the rcu_node structure, if that would help. That said, I still have my use-after-free concern above. Thanx, Paul ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-09 16:26 ` Paul E. McKenney @ 2014-06-09 18:15 ` Oleg Nesterov 2014-06-09 18:29 ` Steven Rostedt 2014-06-10 8:37 ` Peter Zijlstra 1 sibling, 1 reply; 63+ messages in thread From: Oleg Nesterov @ 2014-06-09 18:15 UTC (permalink / raw) To: Paul E. McKenney Cc: Linus Torvalds, Steven Rostedt, LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On 06/09, Paul E. McKenney wrote: > > On Sun, Jun 08, 2014 at 03:07:18PM +0200, Oleg Nesterov wrote: > > > > I only meant that afaics rcu_read_unlock_special() equally depends on the > > fact that rt_mutex_unlock() does nothing with "struct rt_mutex" after it > > makes another rt_mutex_lock() + rt_mutex_unlock() possible, otherwise this > > code is wrong (and unlock_task_sighand() would be wrong too). > > > > Just to simplify the discussion... suppose we add "atomic_t nr_slow_unlock" > > into "struct rt_mutex" and change rt_mutex_slowunlock() to do > > atomic_inc(&lock->nr_slow_unlock) after it drops ->wait_lock. Of course this > > would be ugly, just for illustration. > > That would indeed be a bad thing, as it could potentially lead to > use-after-free bugs. Though one could argue that any code that resulted > in use-after-free would be quite aggressive. But still... And once again, note that the normal mutex is already unsafe (unless I missed something). > > So _perhaps_ we should not rely on this property of rt_mutex "too much". > > Well, I could easily move the rt_mutex from rcu_boost()'s stack to the > rcu_node structure, if that would help. That said, I still have my > use-after-free concern above. Or we can document that rt_mutex is special and rt_mutex_unlock() should be "atomic" and safe as spin_unlock() or complete(). Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-09 18:15 ` Oleg Nesterov @ 2014-06-09 18:29 ` Steven Rostedt 2014-06-09 18:51 ` Linus Torvalds 2014-06-09 19:04 ` Oleg Nesterov 0 siblings, 2 replies; 63+ messages in thread From: Steven Rostedt @ 2014-06-09 18:29 UTC (permalink / raw) To: Oleg Nesterov Cc: Paul E. McKenney, Linus Torvalds, LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Mon, 9 Jun 2014 20:15:53 +0200 Oleg Nesterov <oleg@redhat.com> wrote: > > That would indeed be a bad thing, as it could potentially lead to > > use-after-free bugs. Though one could argue that any code that resulted > > in use-after-free would be quite aggressive. But still... > > And once again, note that the normal mutex is already unsafe (unless I missed > something). Is it unsafe? This thread was started because of a bug we triggered in -rt, which ended up being a change specific to -rt that modified the way slub handled SLAB_DESTROY_BY_RCU. What else was wrong with it? -- Steve ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-09 18:29 ` Steven Rostedt @ 2014-06-09 18:51 ` Linus Torvalds 2014-06-09 19:41 ` Steven Rostedt 2014-06-10 12:56 ` Paul E. McKenney 2014-06-09 19:04 ` Oleg Nesterov 1 sibling, 2 replies; 63+ messages in thread From: Linus Torvalds @ 2014-06-09 18:51 UTC (permalink / raw) To: Steven Rostedt Cc: Oleg Nesterov, Paul E. McKenney, LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Mon, Jun 9, 2014 at 11:29 AM, Steven Rostedt <rostedt@goodmis.org> wrote: >> >> And once again, note that the normal mutex is already unsafe (unless I missed >> something). > > Is it unsafe? > > This thread was started because of a bug we triggered in -rt, which > ended up being a change specific to -rt that modified the way slub > handled SLAB_DESTROY_BY_RCU. What else was wrong with it? There's a different issue with freeing of mutexes, which is not a bug, but "by design". Namely that mutexes aren't truly "atomic". They are complex data structures, and they have issues that a spinlock does not have. When unlocking a mutex, the thread doing the unlocking will still touch the mutex itself _after_ another thread could already have successfully acquired the mutex. This is not a problem in any normal use. since all of this is perfectly coherent in general, but it means that code sequences like: mutex_lock(mem->mutex); kill_it = !--mem->refcount; mutex_unlock(mem->mutex); if (kill_it) free(mem); are fundamentally buggy. Note that if you think of mutexes as truly indivisible atomic operations, the above is "obviously correct": the last person who got the mutex marked it for killing. But the fact is, the *next-to-last* mutex acquirer may still actively be in the *non-indivisible* mutex_unlock() when the last person frees it, resulting in a use-after-free. And yes, we've had this bug, and as far as I know it's possible that the RT code *introduces* this bug when it changes spinlocks into mutexes. Because we do exactly the above code sequence with spinlocks. So just replacing spinlocks with mutexes is a very subtly buggy thing to do in general. Another example of this kind of situation is using a mutex as a completion event: that's simply buggy. Again, it's because mutexes are complex data structures, and you have a very similar use-after-free issue. It's why the fundamental data structure for a "struct completion" is a spinlock, not a mutex. Again, in *theory*, a completion could be just a mutex that starts out locked, and then the completer completes it by just unlocking it. The person who waits for a completion does so by just asking for a lock. Obvious, simple, and trivial, right? Not so, because of the *exact* same issue above: the completer (who does an "unlock") may still be accessing the completion data structure when the completion waiter has successfully gotten the lock. So the standard thing of the completer freeing the underlying completion memory (which is often on the stack, so "freeing" is just he act of going out of scope of the liveness of the completion data structure) would not work if the completion was a mutex. This is subtle, and it is basically unavoidable. If a mutex (or counting semaphore) has a fast-path - and a mutex/semaphore without a fast-path is shit - then this issue will exist. Exactly because the fast-path will depend on just one part of the whole big mutex structure, and the slow-path will have other pieces to it. There might be reasonable ways to avoid this issue (having the fastpath locking field share memory with the slow-path locking, for example), but it's not how our semaphores and mutexes work, and I suspect it cannot be the case in general (because it limits you too badly in how to implement the mutex). As a result, this is all "by design" as opposed to being a bug. Linus ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-09 18:51 ` Linus Torvalds @ 2014-06-09 19:41 ` Steven Rostedt 2014-06-10 8:53 ` Thomas Gleixner 2014-06-10 12:56 ` Paul E. McKenney 1 sibling, 1 reply; 63+ messages in thread From: Steven Rostedt @ 2014-06-09 19:41 UTC (permalink / raw) To: Linus Torvalds Cc: Oleg Nesterov, Paul E. McKenney, LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Mon, 9 Jun 2014 11:51:09 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, Jun 9, 2014 at 11:29 AM, Steven Rostedt <rostedt@goodmis.org> wrote: > >> > >> And once again, note that the normal mutex is already unsafe (unless I missed > >> something). > > > > Is it unsafe? > > > > This thread was started because of a bug we triggered in -rt, which > > ended up being a change specific to -rt that modified the way slub > > handled SLAB_DESTROY_BY_RCU. What else was wrong with it? > > There's a different issue with freeing of mutexes, which is not a bug, > but "by design". Namely that mutexes aren't truly "atomic". They are > complex data structures, and they have issues that a spinlock does not > have. > > When unlocking a mutex, the thread doing the unlocking will still > touch the mutex itself _after_ another thread could already have > successfully acquired the mutex. This is not a problem in any normal > use. since all of this is perfectly coherent in general, but it means > that code sequences like: > > mutex_lock(mem->mutex); > kill_it = !--mem->refcount; > mutex_unlock(mem->mutex); > if (kill_it) > free(mem); Oh this again. I remember you having a discussion about it in the past. I'm trying to wrap my head around it so let me make sure I understand the spinlock (working) case, and then a bad mutex case. For this to work, you have some item, lets say on a list. When it is created, the refcount is 1 and it's added to the list. To remove it, you remove it from the list and dec the refcount, and if it is zero then you can free it. The issue is that you don't know what may have seen it while its on the list and the last one to dec it to zero frees it. Is this a situation that is used? Because spinlocks are atomic, this is all fine, but if you have a mutex, then this is where you can have issues. I think rtmutex has an issue with it too. Specifically in the slow_unlock case: if (!rt_mutex_has_waiters(lock)) { lock->owner = NULL; raw_spin_unlock(&lock->wait_lock); return; } That is, you can have: CPU0 CPU1 ---- ---- [refcount = 1] read foo from list [refcount = 2] [refcount 2] remove foo from list mutex_lock(foo->mutex); kill_it = !--mem->refcount; [refcount = 1] [kill_it == 0] mutex_unlock(foo->mutex) lock->owner = NULL; mutex_lock() [ owner == NULL, fast path!] kill_it = !--mem->refcount; [refcount = 0] [kill_it = 1] mutex_unlock() [ owner == current, fast path!] if (kill_it) free(foo); raw_spin_unlock(&lock->wait_lock); [access freed foo's wait_lock, BUG!] Is this the problem? I'm just trying to visualize it, and maybe even visualize it for others that are reading this thread. Thanks, -- Steve ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-09 19:41 ` Steven Rostedt @ 2014-06-10 8:53 ` Thomas Gleixner 2014-06-10 16:57 ` Oleg Nesterov 2014-06-10 17:07 ` Oleg Nesterov 0 siblings, 2 replies; 63+ messages in thread From: Thomas Gleixner @ 2014-06-10 8:53 UTC (permalink / raw) To: Steven Rostedt Cc: Linus Torvalds, Oleg Nesterov, Paul E. McKenney, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Mon, 9 Jun 2014, Steven Rostedt wrote: > On Mon, 9 Jun 2014 11:51:09 -0700 > Linus Torvalds <torvalds@linux-foundation.org> wrote: > Because spinlocks are atomic, this is all fine, but if you have a > mutex, then this is where you can have issues. I think rtmutex has an > issue with it too. Specifically in the slow_unlock case: > > if (!rt_mutex_has_waiters(lock)) { > lock->owner = NULL; > raw_spin_unlock(&lock->wait_lock); > return; > } Indeed. If the fast path is enabled we have that issue. Fortunately there is a halfways reasonable solution for this. Thanks, tglx ---------------------------> Subject: rtmutex: Plug slow unlock race From: Thomas Gleixner <tglx@linutronix.de> Date: Tue, 10 Jun 2014 09:27:00 +0200 When the rtmutex fast path is enabled the slow unlock function can create the following situation: spin_lock(foo->m->wait_lock); foo->m->owner = NULL; rt_mutex_lock(foo->m); <-- fast path free = atomic_dec_and_test(foo->refcnt); rt_mutex_unlock(foo->m); <-- fast path if (free) kfree(foo); spin_unlock(foo->m->wait_lock); <--- Use after free. Plug the race by changing the slow unlock to the following scheme: while (!rt_mutex_has_waiters(m)) { /* Clear the waiters bit in m->owner */ clear_rt_mutex_waiters(m); owner = rt_mutex_owner(m); spin_unlock(m->wait_lock); if (cmpxchg(m->owner, owner, 0) == owner) return; spin_lock(m->wait_lock); } So in case of a new waiter incoming while the owner tries the slow path unlock we have two situations: unlock(wait_lock); lock(wait_lock); cmpxchg(p, owner, 0) == owner mark_rt_mutex_waiters(lock); acquire(lock); Or: unlock(wait_lock); lock(wait_lock); mark_rt_mutex_waiters(lock); cmpxchg(p, owner, 0) != owner enqueue_waiter(); unlock(wait_lock); lock(wait_lock); wake waiter(); unlock(wait_lock); lock(wait_lock); acquire(lock); If the fast path is disabled, then the simple m->owner = NULL; unlock(m->wait_lock); is sufficient as all access to m->owner is serialized via m->wait_lock; Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/locking/rtmutex.c | 94 +++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 90 insertions(+), 4 deletions(-) Index: tip/kernel/locking/rtmutex.c =================================================================== --- tip.orig/kernel/locking/rtmutex.c +++ tip/kernel/locking/rtmutex.c @@ -83,6 +83,48 @@ static inline void mark_rt_mutex_waiters owner = *p; } while (cmpxchg(p, owner, owner | RT_MUTEX_HAS_WAITERS) != owner); } + +/* + * Safe fastpath aware unlock: + * 1) Clear the waiters bit + * 2) Drop lock->wait_lock + * 3) Try to unlock the lock with cmpxchg + */ +static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock) + __releases(lock->wait_lock) +{ + unsigned long owner, *p = (unsigned long *) &lock->owner; + + owner = (unsigned long) rt_mutex_owner(lock); + clear_rt_mutex_waiters(lock); + raw_spin_unlock(&lock->wait_lock); + /* + * If a new waiter comes in between the unlock and the cmpxchg + * we have two situations: + * + * unlock(wait_lock); + * lock(wait_lock); + * cmpxchg(p, owner, 0) == owner + * mark_rt_mutex_waiters(lock); + * acquire(lock); + * or: + * + * unlock(wait_lock); + * lock(wait_lock); + * mark_rt_mutex_waiters(lock); + * + * cmpxchg(p, owner, 0) != owner + * enqueue_waiter(); + * unlock(wait_lock); + * lock(wait_lock); + * wake waiter(); + * unlock(wait_lock); + * lock(wait_lock); + * acquire(lock); + */ + return rt_mutex_cmpxchg(p, owner, 0); +} + #else # define rt_mutex_cmpxchg(l,c,n) (0) static inline void mark_rt_mutex_waiters(struct rt_mutex *lock) @@ -90,6 +132,17 @@ static inline void mark_rt_mutex_waiters lock->owner = (struct task_struct *) ((unsigned long)lock->owner | RT_MUTEX_HAS_WAITERS); } + +/* + * Simple slow path only version: owner is protected by wait_lock + */ +static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock) + __releases(lock->wait_lock) +{ + lock->owner = NULL; + raw_spin_unlock(&lock->wait_lock); + return true; +} #endif static inline int @@ -928,10 +981,43 @@ rt_mutex_slowunlock(struct rt_mutex *loc rt_mutex_deadlock_account_unlock(current); - if (!rt_mutex_has_waiters(lock)) { - lock->owner = NULL; - raw_spin_unlock(&lock->wait_lock); - return; + /* + * We must be careful here if the fast path is enabled. If we + * have no waiters queued we cannot set owner to NULL here + * because of: + * + * foo->lock->owner = NULL; + * rtmutex_lock(foo->lock); <- fast path + * free = atomic_dec_and_test(foo->refcnt); + * rtmutex_unlock(foo->lock); <- fast path + * if (free) + * kfree(foo); + * raw_spin_unlock(foo->lock->wait_lock); + * + * So for the fastpath enabled kernel: + * + * Nothing can set the waiters bit as long as we hold + * lock->wait_lock. So we do the following sequence: + * + * owner = rt_mutex_owner(lock); + * clear_rt_mutex_waiters(lock); + * raw_spin_unlock(&lock->wait_lock); + * if (cmpxchg(&lock->owner, owner, 0) == owner) + * return; + * goto retry; + * + * The fastpath disabled variant is simple as all access to + * lock->owner is serialized by lock->wait_lock: + * + * lock->owner = NULL; + * raw_spin_unlock(&lock->wait_lock); + */ + while (!rt_mutex_has_waiters(lock)) { + /* Drops lock->wait_lock ! */ + if (unlock_rt_mutex_safe(lock) == true) + return; + /* Relock the rtmutex and try again */ + raw_spin_lock(&lock->wait_lock); } wakeup_next_waiter(lock); ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-10 8:53 ` Thomas Gleixner @ 2014-06-10 16:57 ` Oleg Nesterov 2014-06-10 18:08 ` Thomas Gleixner 2014-06-10 17:07 ` Oleg Nesterov 1 sibling, 1 reply; 63+ messages in thread From: Oleg Nesterov @ 2014-06-10 16:57 UTC (permalink / raw) To: Thomas Gleixner Cc: Steven Rostedt, Linus Torvalds, Paul E. McKenney, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On 06/10, Thomas Gleixner wrote: > > On Mon, 9 Jun 2014, Steven Rostedt wrote: > > I think rtmutex has an > > issue with it too. Specifically in the slow_unlock case: > > > > if (!rt_mutex_has_waiters(lock)) { > > lock->owner = NULL; > > raw_spin_unlock(&lock->wait_lock); > > return; > > } > > Indeed. If the fast path is enabled we have that issue. Fortunately > there is a halfways reasonable solution for this. Ah, yes, I missed that, > + while (!rt_mutex_has_waiters(lock)) { > + /* Drops lock->wait_lock ! */ > + if (unlock_rt_mutex_safe(lock) == true) > + return; > + /* Relock the rtmutex and try again */ > + raw_spin_lock(&lock->wait_lock); > } OK... wakeup_next_waiter() does rt_mutex_set_owner(NULL) before we drop ->wait_lock, but this looks fine: we know that rt_mutex_has_waiters() can not become false until waiter->task takes this lock and does rt_mutex_dequeue(), so ->owner can't be NULL, right? Perhaps it could simply do ->owner = RT_MUTEX_HAS_WAITERS to make this more clear... Off-topic question. I simply can't understand why rt_mutex_slowtrylock() checks rt_mutex_owner(lock) != current. This looks pointless, try_to_take_rt_mutex() always fails (correctly) if rt_mutex_owner() != NULL ? IOW, can't we simply remove this check or turn it into "if (!rt_mutex_owner(lock))" ? Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-10 16:57 ` Oleg Nesterov @ 2014-06-10 18:08 ` Thomas Gleixner 2014-06-10 18:13 ` Steven Rostedt 0 siblings, 1 reply; 63+ messages in thread From: Thomas Gleixner @ 2014-06-10 18:08 UTC (permalink / raw) To: Oleg Nesterov Cc: Steven Rostedt, Linus Torvalds, Paul E. McKenney, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Tue, 10 Jun 2014, Oleg Nesterov wrote: > On 06/10, Thomas Gleixner wrote: > > > > On Mon, 9 Jun 2014, Steven Rostedt wrote: > > > I think rtmutex has an > > > issue with it too. Specifically in the slow_unlock case: > > > > > > if (!rt_mutex_has_waiters(lock)) { > > > lock->owner = NULL; > > > raw_spin_unlock(&lock->wait_lock); > > > return; > > > } > > > > Indeed. If the fast path is enabled we have that issue. Fortunately > > there is a halfways reasonable solution for this. > > Ah, yes, I missed that, > > > + while (!rt_mutex_has_waiters(lock)) { > > + /* Drops lock->wait_lock ! */ > > + if (unlock_rt_mutex_safe(lock) == true) > > + return; > > + /* Relock the rtmutex and try again */ > > + raw_spin_lock(&lock->wait_lock); > > } > > OK... > > wakeup_next_waiter() does rt_mutex_set_owner(NULL) before we drop ->wait_lock, > but this looks fine: we know that rt_mutex_has_waiters() can not become false > until waiter->task takes this lock and does rt_mutex_dequeue(), so ->owner > can't be NULL, right? Correct. > Perhaps it could simply do ->owner = RT_MUTEX_HAS_WAITERS to make this more > clear... Good point. The new owner can cleanup the mess. > Off-topic question. I simply can't understand why rt_mutex_slowtrylock() checks > rt_mutex_owner(lock) != current. This looks pointless, try_to_take_rt_mutex() > always fails (correctly) if rt_mutex_owner() != NULL ? IOW, can't we simply > remove this check or turn it into "if (!rt_mutex_owner(lock))" ? Indeed. Thanks, tglx ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-10 18:08 ` Thomas Gleixner @ 2014-06-10 18:13 ` Steven Rostedt 2014-06-10 20:05 ` Thomas Gleixner 0 siblings, 1 reply; 63+ messages in thread From: Steven Rostedt @ 2014-06-10 18:13 UTC (permalink / raw) To: Thomas Gleixner Cc: Oleg Nesterov, Linus Torvalds, Paul E. McKenney, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Tue, 10 Jun 2014 20:08:37 +0200 (CEST) Thomas Gleixner <tglx@linutronix.de> wrote: > > Perhaps it could simply do ->owner = RT_MUTEX_HAS_WAITERS to make this more > > clear... > > Good point. The new owner can cleanup the mess. > I thought about this too. It should work with the added overhead that every time we go into the unlock slow path, we guarantee that the next lock will go into the lock slowpath. As long as the new acquired lock does a fast unlock, then we get out of this spiral. -- Steve ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-10 18:13 ` Steven Rostedt @ 2014-06-10 20:05 ` Thomas Gleixner 2014-06-10 20:13 ` Thomas Gleixner 0 siblings, 1 reply; 63+ messages in thread From: Thomas Gleixner @ 2014-06-10 20:05 UTC (permalink / raw) To: Steven Rostedt Cc: Oleg Nesterov, Linus Torvalds, Paul E. McKenney, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Tue, 10 Jun 2014, Steven Rostedt wrote: > On Tue, 10 Jun 2014 20:08:37 +0200 (CEST) > Thomas Gleixner <tglx@linutronix.de> wrote: > > > > > Perhaps it could simply do ->owner = RT_MUTEX_HAS_WAITERS to make this more > > > clear... > > > > Good point. The new owner can cleanup the mess. > > > > I thought about this too. It should work with the added overhead that > every time we go into the unlock slow path, we guarantee that the next > lock will go into the lock slowpath. > > As long as the new acquired lock does a fast unlock, then we get out of > this spiral. The alternative solution is to document WHY this is safe. I think I prefer that one :) Thanks, tglx ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-10 20:05 ` Thomas Gleixner @ 2014-06-10 20:13 ` Thomas Gleixner 2014-06-11 15:52 ` Paul E. McKenney 0 siblings, 1 reply; 63+ messages in thread From: Thomas Gleixner @ 2014-06-10 20:13 UTC (permalink / raw) To: Steven Rostedt Cc: Oleg Nesterov, Linus Torvalds, Paul E. McKenney, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Tue, 10 Jun 2014, Thomas Gleixner wrote: > On Tue, 10 Jun 2014, Steven Rostedt wrote: > > On Tue, 10 Jun 2014 20:08:37 +0200 (CEST) > > Thomas Gleixner <tglx@linutronix.de> wrote: > > > > > > > > Perhaps it could simply do ->owner = RT_MUTEX_HAS_WAITERS to make this more > > > > clear... > > > > > > Good point. The new owner can cleanup the mess. > > > > > > > I thought about this too. It should work with the added overhead that > > every time we go into the unlock slow path, we guarantee that the next > > lock will go into the lock slowpath. > > > > As long as the new acquired lock does a fast unlock, then we get out of > > this spiral. > > The alternative solution is to document WHY this is safe. I think I > prefer that one :) And actually we keep the waiter bit set in wakeup_next_waiter() because we only dequeue the waiter from the lock owners pi waiter list, but not from the lock waiter list. rt_mutex_set_owner() sets the waiters bit if the lock has waiters. I agree with Oleg that this is not obvious from the code. So I add both a comment and open code it. Thanks, tglx ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-10 20:13 ` Thomas Gleixner @ 2014-06-11 15:52 ` Paul E. McKenney 2014-06-11 17:07 ` Oleg Nesterov 0 siblings, 1 reply; 63+ messages in thread From: Paul E. McKenney @ 2014-06-11 15:52 UTC (permalink / raw) To: Thomas Gleixner Cc: Steven Rostedt, Oleg Nesterov, Linus Torvalds, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Tue, Jun 10, 2014 at 10:13:20PM +0200, Thomas Gleixner wrote: > On Tue, 10 Jun 2014, Thomas Gleixner wrote: > > On Tue, 10 Jun 2014, Steven Rostedt wrote: > > > On Tue, 10 Jun 2014 20:08:37 +0200 (CEST) > > > Thomas Gleixner <tglx@linutronix.de> wrote: > > > > > > > > > > > Perhaps it could simply do ->owner = RT_MUTEX_HAS_WAITERS to make this more > > > > > clear... > > > > > > > > Good point. The new owner can cleanup the mess. > > > > > > > > > > I thought about this too. It should work with the added overhead that > > > every time we go into the unlock slow path, we guarantee that the next > > > lock will go into the lock slowpath. > > > > > > As long as the new acquired lock does a fast unlock, then we get out of > > > this spiral. > > > > The alternative solution is to document WHY this is safe. I think I > > prefer that one :) > > And actually we keep the waiter bit set in wakeup_next_waiter() > because we only dequeue the waiter from the lock owners pi waiter > list, but not from the lock waiter list. > > rt_mutex_set_owner() sets the waiters bit if the lock has waiters. I > agree with Oleg that this is not obvious from the code. > > So I add both a comment and open code it. And for my part, I have queued the following, which should prevent RCU from relying on rt_mutex_unlock() keeping hands off after unlock. This passes light rcutorture testing. Thoughts? Thanx, Paul ------------------------------------------------------------------------ rcu: Allow post-unlock reference for rt_mutex The current approach to RCU priority boosting uses an rt_mutex strictly for its priority-boosting side effects. The rt_mutex_init_proxy_locked() function is used by the booster to initialize the lock as held by the boostee. The booster then uses rt_mutex_lock() to acquire this rt_mutex, which priority-boosts the boostee. When the boostee reaches the end of its outermost RCU read-side critical section, it checks a field in its task structure to see whether it has been boosted, and, if so, uses rt_mutex_unlock() to release the rt_mutex. The booster can then go on to boost the next task that is blocking the current RCU grace period. But reasonable implementations of rt_mutex_unlock() might result in the boostee referencing the rt_mutex's data after releasing it. But the booster might have re-initialized the rt_mutex between the time that the boostee released it and the time that it later referenced it. This is clearly asking for trouble, so this commit introduces a completion that forces the booster to wait until the boostee has completely finished with the rt_mutex, thus avoiding the case where the booster is re-initializing the rt_mutex before the last boostee's last reference to that rt_mutex. This of course does introduce some overhead, but the priority-boosting code paths are miles from any possible fastpath, and the overhead of executing the completion will normally be quite small compared to the overhead of priority boosting and deboosting, so this should be OK. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index bf2c1e669691..31194ee9dfa6 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -172,6 +172,11 @@ struct rcu_node { /* queued on this rcu_node structure that */ /* are blocking the current grace period, */ /* there can be no such task. */ + struct completion boost_completion; + /* Used to ensure that the rt_mutex used */ + /* to carry out the boosting is fully */ + /* released with no future boostee accesses */ + /* before that rt_mutex is re-initialized. */ unsigned long boost_time; /* When to start boosting (jiffies). */ struct task_struct *boost_kthread_task; diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index ec7627becaf0..99743e9ea8ed 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -427,8 +427,10 @@ void rcu_read_unlock_special(struct task_struct *t) #ifdef CONFIG_RCU_BOOST /* Unboost if we were boosted. */ - if (rbmp) + if (rbmp) { rt_mutex_unlock(rbmp); + complete(&rnp->boost_completion); + } #endif /* #ifdef CONFIG_RCU_BOOST */ /* @@ -1202,10 +1204,14 @@ static int rcu_boost(struct rcu_node *rnp) t = container_of(tb, struct task_struct, rcu_node_entry); rt_mutex_init_proxy_locked(&mtx, t); t->rcu_boost_mutex = &mtx; + init_completion(&rnp->boost_completion); raw_spin_unlock_irqrestore(&rnp->lock, flags); rt_mutex_lock(&mtx); /* Side effect: boosts task t's priority. */ rt_mutex_unlock(&mtx); /* Keep lockdep happy. */ + /* Wait until boostee is done accessing mtx before reinitializing. */ + wait_for_completion(&rnp->boost_completion); + return ACCESS_ONCE(rnp->exp_tasks) != NULL || ACCESS_ONCE(rnp->boost_tasks) != NULL; } ^ permalink raw reply related [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-11 15:52 ` Paul E. McKenney @ 2014-06-11 17:07 ` Oleg Nesterov 2014-06-11 17:17 ` Oleg Nesterov 2014-06-11 17:27 ` Paul E. McKenney 0 siblings, 2 replies; 63+ messages in thread From: Oleg Nesterov @ 2014-06-11 17:07 UTC (permalink / raw) To: Paul E. McKenney Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On 06/11, Paul E. McKenney wrote: > > @@ -1202,10 +1204,14 @@ static int rcu_boost(struct rcu_node *rnp) > t = container_of(tb, struct task_struct, rcu_node_entry); > rt_mutex_init_proxy_locked(&mtx, t); > t->rcu_boost_mutex = &mtx; > + init_completion(&rnp->boost_completion); can't rcu_init_one() do this? but this is minor, > raw_spin_unlock_irqrestore(&rnp->lock, flags); > rt_mutex_lock(&mtx); /* Side effect: boosts task t's priority. */ > rt_mutex_unlock(&mtx); /* Keep lockdep happy. */ > > + /* Wait until boostee is done accessing mtx before reinitializing. */ > + wait_for_completion(&rnp->boost_completion); > + I must have missed something, I dont understand why we need ->boost_completion. What if you simply move that rt_mutex into rcu_node ? Or. Given that rcu_boost_kthread() never exits, it can declare this mutex on stack and pass the pointer to rcu_boost() ? Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-11 17:07 ` Oleg Nesterov @ 2014-06-11 17:17 ` Oleg Nesterov 2014-06-11 17:29 ` Paul E. McKenney 2014-06-11 17:27 ` Paul E. McKenney 1 sibling, 1 reply; 63+ messages in thread From: Oleg Nesterov @ 2014-06-11 17:17 UTC (permalink / raw) To: Paul E. McKenney Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On 06/11, Oleg Nesterov wrote: > > On 06/11, Paul E. McKenney wrote: > > > raw_spin_unlock_irqrestore(&rnp->lock, flags); > > rt_mutex_lock(&mtx); /* Side effect: boosts task t's priority. */ > > rt_mutex_unlock(&mtx); /* Keep lockdep happy. */ > > > > + /* Wait until boostee is done accessing mtx before reinitializing. */ > > + wait_for_completion(&rnp->boost_completion); > > + > > I must have missed something, I dont understand why we need ->boost_completion. > > What if you simply move that rt_mutex into rcu_node ? > > Or. Given that rcu_boost_kthread() never exits, it can declare this mutex > on stack and pass the pointer to rcu_boost() ? Ah, please ignore, I forgot about init_proxy_locked(). Although perhaps this can be solved easily. Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-11 17:17 ` Oleg Nesterov @ 2014-06-11 17:29 ` Paul E. McKenney 2014-06-11 17:59 ` Oleg Nesterov 0 siblings, 1 reply; 63+ messages in thread From: Paul E. McKenney @ 2014-06-11 17:29 UTC (permalink / raw) To: Oleg Nesterov Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Wed, Jun 11, 2014 at 07:17:34PM +0200, Oleg Nesterov wrote: > On 06/11, Oleg Nesterov wrote: > > > > On 06/11, Paul E. McKenney wrote: > > > > > raw_spin_unlock_irqrestore(&rnp->lock, flags); > > > rt_mutex_lock(&mtx); /* Side effect: boosts task t's priority. */ > > > rt_mutex_unlock(&mtx); /* Keep lockdep happy. */ > > > > > > + /* Wait until boostee is done accessing mtx before reinitializing. */ > > > + wait_for_completion(&rnp->boost_completion); > > > + > > > > I must have missed something, I dont understand why we need ->boost_completion. > > > > What if you simply move that rt_mutex into rcu_node ? > > > > Or. Given that rcu_boost_kthread() never exits, it can declare this mutex > > on stack and pass the pointer to rcu_boost() ? > > Ah, please ignore, I forgot about init_proxy_locked(). Although perhaps this > can be solved easily. You beat me to it. ;-) I was thinking of ->boost_completion as the way to solve it easily, but what did you have in mind? Thanx, Paul ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-11 17:29 ` Paul E. McKenney @ 2014-06-11 17:59 ` Oleg Nesterov 2014-06-11 19:56 ` Paul E. McKenney 0 siblings, 1 reply; 63+ messages in thread From: Oleg Nesterov @ 2014-06-11 17:59 UTC (permalink / raw) To: Paul E. McKenney Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On 06/11, Paul E. McKenney wrote: > > On Wed, Jun 11, 2014 at 07:17:34PM +0200, Oleg Nesterov wrote: > > On 06/11, Oleg Nesterov wrote: > > > > > > On 06/11, Paul E. McKenney wrote: > > > > > > > raw_spin_unlock_irqrestore(&rnp->lock, flags); > > > > rt_mutex_lock(&mtx); /* Side effect: boosts task t's priority. */ > > > > rt_mutex_unlock(&mtx); /* Keep lockdep happy. */ > > > > > > > > + /* Wait until boostee is done accessing mtx before reinitializing. */ > > > > + wait_for_completion(&rnp->boost_completion); > > > > + > > > > > > I must have missed something, I dont understand why we need ->boost_completion. > > > > > > What if you simply move that rt_mutex into rcu_node ? > > > > > > Or. Given that rcu_boost_kthread() never exits, it can declare this mutex > > > on stack and pass the pointer to rcu_boost() ? > > > > Ah, please ignore, I forgot about init_proxy_locked(). Although perhaps this > > can be solved easily. > > You beat me to it. ;-) > > I was thinking of ->boost_completion as the way to solve it easily, but > what did you have in mind? I meant, rcu_boost() could probably just do "mtx->owner = t", we know that it was unlocked by us and nobody else can use it until we set t->rcu_boost_mutex. And if we move it into rcu_node, then we can probably kill ->rcu_boost_mutex, rcu_read_unlock_special() could check rnp->boost_mutex->owner == current. But you know, I also think that the dentist removed the rest of my brains along my tooth, so I am not sure if I actually have something in mind. Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-11 17:59 ` Oleg Nesterov @ 2014-06-11 19:56 ` Paul E. McKenney 2014-06-12 17:28 ` Oleg Nesterov 0 siblings, 1 reply; 63+ messages in thread From: Paul E. McKenney @ 2014-06-11 19:56 UTC (permalink / raw) To: Oleg Nesterov Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Wed, Jun 11, 2014 at 07:59:34PM +0200, Oleg Nesterov wrote: > On 06/11, Paul E. McKenney wrote: > > > > On Wed, Jun 11, 2014 at 07:17:34PM +0200, Oleg Nesterov wrote: > > > On 06/11, Oleg Nesterov wrote: > > > > > > > > On 06/11, Paul E. McKenney wrote: > > > > > > > > > raw_spin_unlock_irqrestore(&rnp->lock, flags); > > > > > rt_mutex_lock(&mtx); /* Side effect: boosts task t's priority. */ > > > > > rt_mutex_unlock(&mtx); /* Keep lockdep happy. */ > > > > > > > > > > + /* Wait until boostee is done accessing mtx before reinitializing. */ > > > > > + wait_for_completion(&rnp->boost_completion); > > > > > + > > > > > > > > I must have missed something, I dont understand why we need ->boost_completion. > > > > > > > > What if you simply move that rt_mutex into rcu_node ? > > > > > > > > Or. Given that rcu_boost_kthread() never exits, it can declare this mutex > > > > on stack and pass the pointer to rcu_boost() ? > > > > > > Ah, please ignore, I forgot about init_proxy_locked(). Although perhaps this > > > can be solved easily. > > > > You beat me to it. ;-) > > > > I was thinking of ->boost_completion as the way to solve it easily, but > > what did you have in mind? > > I meant, rcu_boost() could probably just do "mtx->owner = t", we know that > it was unlocked by us and nobody else can use it until we set > t->rcu_boost_mutex. My concern with this is that rcu_read_unlock_special() could hypothetically get preempted (either by kernel or hypervisor), so that it might be a long time until it makes its reference. But maybe that reference would be harmless in this case. > And if we move it into rcu_node, then we can probably kill ->rcu_boost_mutex, > rcu_read_unlock_special() could check rnp->boost_mutex->owner == current. If this was anywhere near a hot code path, I would be sorely tempted. > But you know, I also think that the dentist removed the rest of my brains > along my tooth, so I am not sure if I actually have something in mind. Ouch!!! When that happens to me, most of my brains return some time after the anesthetic wears off, but I know the feeling! Thanx, Paul ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-11 19:56 ` Paul E. McKenney @ 2014-06-12 17:28 ` Oleg Nesterov 2014-06-12 20:35 ` Paul E. McKenney 0 siblings, 1 reply; 63+ messages in thread From: Oleg Nesterov @ 2014-06-12 17:28 UTC (permalink / raw) To: Paul E. McKenney Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On 06/11, Paul E. McKenney wrote: > > On Wed, Jun 11, 2014 at 07:59:34PM +0200, Oleg Nesterov wrote: > > On 06/11, Paul E. McKenney wrote: > > > > > > I was thinking of ->boost_completion as the way to solve it easily, but > > > what did you have in mind? > > > > I meant, rcu_boost() could probably just do "mtx->owner = t", we know that > > it was unlocked by us and nobody else can use it until we set > > t->rcu_boost_mutex. > > My concern with this is that rcu_read_unlock_special() could hypothetically > get preempted (either by kernel or hypervisor), so that it might be a long > time until it makes its reference. But maybe that reference would be > harmless in this case. Confused... Not sure I understand what did you mean, and certainly I do not understand how this connects to the proxy-locking method. Could you explain? > > And if we move it into rcu_node, then we can probably kill ->rcu_boost_mutex, > > rcu_read_unlock_special() could check rnp->boost_mutex->owner == current. > > If this was anywhere near a hot code path, I would be sorely tempted. Ah, but I didn't mean perfomance. I think it is always good to try to remove something from task_struct, it is huge. I do not mean sizeof() in the first place, the very fact that I can hardly understand the purpose of a half of its members makes me sad ;) Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-12 17:28 ` Oleg Nesterov @ 2014-06-12 20:35 ` Paul E. McKenney 2014-06-12 21:40 ` Thomas Gleixner 2014-06-13 14:52 ` Oleg Nesterov 0 siblings, 2 replies; 63+ messages in thread From: Paul E. McKenney @ 2014-06-12 20:35 UTC (permalink / raw) To: Oleg Nesterov Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Thu, Jun 12, 2014 at 07:28:44PM +0200, Oleg Nesterov wrote: > On 06/11, Paul E. McKenney wrote: > > > > On Wed, Jun 11, 2014 at 07:59:34PM +0200, Oleg Nesterov wrote: > > > On 06/11, Paul E. McKenney wrote: > > > > > > > > I was thinking of ->boost_completion as the way to solve it easily, but > > > > what did you have in mind? > > > > > > I meant, rcu_boost() could probably just do "mtx->owner = t", we know that > > > it was unlocked by us and nobody else can use it until we set > > > t->rcu_boost_mutex. > > > > My concern with this is that rcu_read_unlock_special() could hypothetically > > get preempted (either by kernel or hypervisor), so that it might be a long > > time until it makes its reference. But maybe that reference would be > > harmless in this case. > > Confused... Not sure I understand what did you mean, and certainly I do not > understand how this connects to the proxy-locking method. > > Could you explain? Here is the hypothetical sequence of events, which cannot happen unless the CPU releasing the lock accesses the structure after releasing the lock: CPU 0 CPU 1 (booster) releases boost_mutex acquires boost_mutex releases boost_mutex post-release boost_mutex access? Loops to next task to boost proxy-locks boost_mutex post-release boost_mutex access: confused due to proxy-lock operation? Now maybe this ends up being safe, but it sure feels like an accident waiting to happen. Some bright developer comes up with a super-fast handoff, and blam, RCU priority boosting takes it in the shorts. ;-) In contrast, using the completion prevents this. > > > And if we move it into rcu_node, then we can probably kill ->rcu_boost_mutex, > > > rcu_read_unlock_special() could check rnp->boost_mutex->owner == current. > > > > If this was anywhere near a hot code path, I would be sorely tempted. > > Ah, but I didn't mean perfomance. I think it is always good to try to remove > something from task_struct, it is huge. I do not mean sizeof() in the first > place, the very fact that I can hardly understand the purpose of a half of its > members makes me sad ;) Now -that- just might make a huge amount of sense! Let's see... o We hold the rcu_node structure's ->lock when checking the owner (looks like rt_mutex_owner() is the right API). o We hold the rcu_node structure's ->lock when doing rt_mutex_init_proxy_locked(). o We -don't- hold ->lock when releasing the rt_mutex, but that should be OK: The owner is releasing it, and it is going to not-owned, so no other task can possibly see ownership moving to/from them. o The rcu_node structure grows a bit, but not enough to worry about, and on most systems, the decrease in task_struct size will more than outweigh the increase in rcu_node size. Looks quite promising, how about the following? (Hey, it builds, so it must be correct, right?) Thanx, Paul ------------------------------------------------------------------------ rcu: Simplify priority boosting by putting rt_mutex in rcu_node RCU priority boosting currently checks for boosting via a pointer in task_struct. However, this is not needed: As Oleg noted, if the rt_mutex is placed in the rcu_node instead of on the booster's stack, the boostee can simply check it see if it owns the lock. This commit makes this change, shrinking task_struct by one pointer and the kernel by three lines. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> b/include/linux/sched.h | 3 --- b/kernel/rcu/tree.h | 3 +++ b/kernel/rcu/tree_plugin.h | 19 ++++++++----------- 3 files changed, 11 insertions(+), 14 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 25f54c79f757..6b90114764ff 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1222,9 +1222,6 @@ struct task_struct { #ifdef CONFIG_TREE_PREEMPT_RCU struct rcu_node *rcu_blocked_node; #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */ -#ifdef CONFIG_RCU_BOOST - struct rt_mutex *rcu_boost_mutex; -#endif /* #ifdef CONFIG_RCU_BOOST */ #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) struct sched_info sched_info; diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 31194ee9dfa6..db3f096ed80b 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -177,6 +177,9 @@ struct rcu_node { /* to carry out the boosting is fully */ /* released with no future boostee accesses */ /* before that rt_mutex is re-initialized. */ + struct rt_mutex boost_mtx; + /* Used only for the priority-boosting */ + /* side effect, not as a lock. */ unsigned long boost_time; /* When to start boosting (jiffies). */ struct task_struct *boost_kthread_task; diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 99743e9ea8ed..7628095f1c47 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -398,11 +398,9 @@ void rcu_read_unlock_special(struct task_struct *t) #ifdef CONFIG_RCU_BOOST if (&t->rcu_node_entry == rnp->boost_tasks) rnp->boost_tasks = np; - /* Snapshot/clear ->rcu_boost_mutex with rcu_node lock held. */ - if (t->rcu_boost_mutex) { - rbmp = t->rcu_boost_mutex; - t->rcu_boost_mutex = NULL; - } + /* Snapshot/clear ->boost_mutex with rcu_node lock held. */ + if (rt_mutex_owner(&rnp->boost_mtx) == t) + rbmp = &rnp->boost_mtx; #endif /* #ifdef CONFIG_RCU_BOOST */ /* @@ -1151,7 +1149,6 @@ static void rcu_wake_cond(struct task_struct *t, int status) static int rcu_boost(struct rcu_node *rnp) { unsigned long flags; - struct rt_mutex mtx; struct task_struct *t; struct list_head *tb; @@ -1202,14 +1199,14 @@ static int rcu_boost(struct rcu_node *rnp) * section. */ t = container_of(tb, struct task_struct, rcu_node_entry); - rt_mutex_init_proxy_locked(&mtx, t); - t->rcu_boost_mutex = &mtx; + rt_mutex_init_proxy_locked(&rnp->boost_mtx, t); init_completion(&rnp->boost_completion); raw_spin_unlock_irqrestore(&rnp->lock, flags); - rt_mutex_lock(&mtx); /* Side effect: boosts task t's priority. */ - rt_mutex_unlock(&mtx); /* Keep lockdep happy. */ + /* Lock only for side effect: boosts task t's priority. */ + rt_mutex_lock(&rnp->boost_mtx); + rt_mutex_unlock(&rnp->boost_mtx); /* Then keep lockdep happy. */ - /* Wait until boostee is done accessing mtx before reinitializing. */ + /* Wait for boostee to be done w/boost_mtx before reinitializing. */ wait_for_completion(&rnp->boost_completion); return ACCESS_ONCE(rnp->exp_tasks) != NULL || ^ permalink raw reply related [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-12 20:35 ` Paul E. McKenney @ 2014-06-12 21:40 ` Thomas Gleixner 2014-06-12 22:27 ` Paul E. McKenney 2014-06-13 14:55 ` Oleg Nesterov 2014-06-13 14:52 ` Oleg Nesterov 1 sibling, 2 replies; 63+ messages in thread From: Thomas Gleixner @ 2014-06-12 21:40 UTC (permalink / raw) To: Paul E. McKenney Cc: Oleg Nesterov, Steven Rostedt, Linus Torvalds, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Thu, 12 Jun 2014, Paul E. McKenney wrote: > On Thu, Jun 12, 2014 at 07:28:44PM +0200, Oleg Nesterov wrote: > > On 06/11, Paul E. McKenney wrote: > > > > > > On Wed, Jun 11, 2014 at 07:59:34PM +0200, Oleg Nesterov wrote: > > > > On 06/11, Paul E. McKenney wrote: > > > > > > > > > > I was thinking of ->boost_completion as the way to solve it easily, but > > > > > what did you have in mind? > > > > > > > > I meant, rcu_boost() could probably just do "mtx->owner = t", we know that > > > > it was unlocked by us and nobody else can use it until we set > > > > t->rcu_boost_mutex. > > > > > > My concern with this is that rcu_read_unlock_special() could hypothetically > > > get preempted (either by kernel or hypervisor), so that it might be a long > > > time until it makes its reference. But maybe that reference would be > > > harmless in this case. > > > > Confused... Not sure I understand what did you mean, and certainly I do not > > understand how this connects to the proxy-locking method. > > > > Could you explain? > > Here is the hypothetical sequence of events, which cannot happen unless > the CPU releasing the lock accesses the structure after releasing > the lock: > > CPU 0 CPU 1 (booster) > > releases boost_mutex > > acquires boost_mutex > releases boost_mutex > post-release boost_mutex access? > Loops to next task to boost > proxy-locks boost_mutex > > post-release boost_mutex access: > confused due to proxy-lock > operation? > > Now maybe this ends up being safe, but it sure feels like an accident > waiting to happen. Some bright developer comes up with a super-fast > handoff, and blam, RCU priority boosting takes it in the shorts. ;-) > In contrast, using the completion prevents this. > > > > > And if we move it into rcu_node, then we can probably kill ->rcu_boost_mutex, > > > > rcu_read_unlock_special() could check rnp->boost_mutex->owner == current. > > > > > > If this was anywhere near a hot code path, I would be sorely tempted. > > > > Ah, but I didn't mean perfomance. I think it is always good to try to remove > > something from task_struct, it is huge. I do not mean sizeof() in the first > > place, the very fact that I can hardly understand the purpose of a half of its > > members makes me sad ;) > > Now -that- just might make a huge amount of sense! Let's see... > > o We hold the rcu_node structure's ->lock when checking the owner > (looks like rt_mutex_owner() is the right API). > > o We hold the rcu_node structure's ->lock when doing > rt_mutex_init_proxy_locked(). > > o We -don't- hold ->lock when releasing the rt_mutex, but that > should be OK: The owner is releasing it, and it is going to > not-owned, so no other task can possibly see ownership moving > to/from them. > > o The rcu_node structure grows a bit, but not enough to worry > about, and on most systems, the decrease in task_struct size > will more than outweigh the increase in rcu_node size. > > Looks quite promising, how about the following? (Hey, it builds, so it > must be correct, right?) True. Why should we have users if we would test the crap we produce? Just FYI, I have a patch pending which makes the release safe. http://marc.info/?l=linux-kernel&m=140251240630730&w=2 Thanks, tglx ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-12 21:40 ` Thomas Gleixner @ 2014-06-12 22:27 ` Paul E. McKenney 2014-06-12 23:19 ` Paul E. McKenney 2014-06-13 14:55 ` Oleg Nesterov 1 sibling, 1 reply; 63+ messages in thread From: Paul E. McKenney @ 2014-06-12 22:27 UTC (permalink / raw) To: Thomas Gleixner Cc: Oleg Nesterov, Steven Rostedt, Linus Torvalds, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Thu, Jun 12, 2014 at 11:40:07PM +0200, Thomas Gleixner wrote: > On Thu, 12 Jun 2014, Paul E. McKenney wrote: > > On Thu, Jun 12, 2014 at 07:28:44PM +0200, Oleg Nesterov wrote: > > > On 06/11, Paul E. McKenney wrote: > > > > > > > > On Wed, Jun 11, 2014 at 07:59:34PM +0200, Oleg Nesterov wrote: > > > > > On 06/11, Paul E. McKenney wrote: > > > > > > > > > > > > I was thinking of ->boost_completion as the way to solve it easily, but > > > > > > what did you have in mind? > > > > > > > > > > I meant, rcu_boost() could probably just do "mtx->owner = t", we know that > > > > > it was unlocked by us and nobody else can use it until we set > > > > > t->rcu_boost_mutex. > > > > > > > > My concern with this is that rcu_read_unlock_special() could hypothetically > > > > get preempted (either by kernel or hypervisor), so that it might be a long > > > > time until it makes its reference. But maybe that reference would be > > > > harmless in this case. > > > > > > Confused... Not sure I understand what did you mean, and certainly I do not > > > understand how this connects to the proxy-locking method. > > > > > > Could you explain? > > > > Here is the hypothetical sequence of events, which cannot happen unless > > the CPU releasing the lock accesses the structure after releasing > > the lock: > > > > CPU 0 CPU 1 (booster) > > > > releases boost_mutex > > > > acquires boost_mutex > > releases boost_mutex > > post-release boost_mutex access? > > Loops to next task to boost > > proxy-locks boost_mutex > > > > post-release boost_mutex access: > > confused due to proxy-lock > > operation? > > > > Now maybe this ends up being safe, but it sure feels like an accident > > waiting to happen. Some bright developer comes up with a super-fast > > handoff, and blam, RCU priority boosting takes it in the shorts. ;-) > > In contrast, using the completion prevents this. > > > > > > > And if we move it into rcu_node, then we can probably kill ->rcu_boost_mutex, > > > > > rcu_read_unlock_special() could check rnp->boost_mutex->owner == current. > > > > > > > > If this was anywhere near a hot code path, I would be sorely tempted. > > > > > > Ah, but I didn't mean perfomance. I think it is always good to try to remove > > > something from task_struct, it is huge. I do not mean sizeof() in the first > > > place, the very fact that I can hardly understand the purpose of a half of its > > > members makes me sad ;) > > > > Now -that- just might make a huge amount of sense! Let's see... > > > > o We hold the rcu_node structure's ->lock when checking the owner > > (looks like rt_mutex_owner() is the right API). > > > > o We hold the rcu_node structure's ->lock when doing > > rt_mutex_init_proxy_locked(). > > > > o We -don't- hold ->lock when releasing the rt_mutex, but that > > should be OK: The owner is releasing it, and it is going to > > not-owned, so no other task can possibly see ownership moving > > to/from them. > > > > o The rcu_node structure grows a bit, but not enough to worry > > about, and on most systems, the decrease in task_struct size > > will more than outweigh the increase in rcu_node size. > > > > Looks quite promising, how about the following? (Hey, it builds, so it > > must be correct, right?) > > True. Why should we have users if we would test the crap we produce? Well, it seems to be passing initial tests as well. Must be my tests need more work. > Just FYI, I have a patch pending which makes the release safe. > > http://marc.info/?l=linux-kernel&m=140251240630730&w=2 Very good, belt -and- suspenders! Might even work that way. ;-) I could argue for a cpu_relax() in the "while (!rt_mutex_has_waiters(lock))" loop for the case in which the CPU enqueuing itself is preempted, possibly by a hypervisor, but either way: Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Thanx, Paul ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-12 22:27 ` Paul E. McKenney @ 2014-06-12 23:19 ` Paul E. McKenney 2014-06-13 15:08 ` Oleg Nesterov 0 siblings, 1 reply; 63+ messages in thread From: Paul E. McKenney @ 2014-06-12 23:19 UTC (permalink / raw) To: Thomas Gleixner Cc: Oleg Nesterov, Steven Rostedt, Linus Torvalds, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Thu, Jun 12, 2014 at 03:27:48PM -0700, Paul E. McKenney wrote: > On Thu, Jun 12, 2014 at 11:40:07PM +0200, Thomas Gleixner wrote: [ . . . ] > > True. Why should we have users if we would test the crap we produce? > > Well, it seems to be passing initial tests as well. Must be my tests > need more work. Or, as in this case, must be that I should have my test machine using the right git-tree branch and correct kernel config. :-/ Updated patch below, FWIW. Thanx, Paul ------------------------------------------------------------------------ rcu: Simplify priority boosting by putting rt_mutex in rcu_node RCU priority boosting currently checks for boosting via a pointer in task_struct. However, this is not needed: As Oleg noted, if the rt_mutex is placed in the rcu_node instead of on the booster's stack, the boostee can simply check it see if it owns the lock. This commit makes this change, shrinking task_struct by one pointer and the kernel by twelve lines. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> b/include/linux/init_task.h | 9 +-------- b/include/linux/sched.h | 6 ------ b/kernel/rcu/tree.h | 3 +++ b/kernel/rcu/tree_plugin.h | 20 +++++++++----------- 4 files changed, 13 insertions(+), 25 deletions(-) diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 6df7f9fe0d01..2bb4c4f3531a 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -102,12 +102,6 @@ extern struct group_info init_groups; #define INIT_IDS #endif -#ifdef CONFIG_RCU_BOOST -#define INIT_TASK_RCU_BOOST() \ - .rcu_boost_mutex = NULL, -#else -#define INIT_TASK_RCU_BOOST() -#endif #ifdef CONFIG_TREE_PREEMPT_RCU #define INIT_TASK_RCU_TREE_PREEMPT() \ .rcu_blocked_node = NULL, @@ -119,8 +113,7 @@ extern struct group_info init_groups; .rcu_read_lock_nesting = 0, \ .rcu_read_unlock_special = 0, \ .rcu_node_entry = LIST_HEAD_INIT(tsk.rcu_node_entry), \ - INIT_TASK_RCU_TREE_PREEMPT() \ - INIT_TASK_RCU_BOOST() + INIT_TASK_RCU_TREE_PREEMPT() #else #define INIT_TASK_RCU_PREEMPT(tsk) #endif diff --git a/include/linux/sched.h b/include/linux/sched.h index 25f54c79f757..1ffb275976da 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1222,9 +1222,6 @@ struct task_struct { #ifdef CONFIG_TREE_PREEMPT_RCU struct rcu_node *rcu_blocked_node; #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */ -#ifdef CONFIG_RCU_BOOST - struct rt_mutex *rcu_boost_mutex; -#endif /* #ifdef CONFIG_RCU_BOOST */ #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) struct sched_info sched_info; @@ -1961,9 +1958,6 @@ static inline void rcu_copy_process(struct task_struct *p) #ifdef CONFIG_TREE_PREEMPT_RCU p->rcu_blocked_node = NULL; #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */ -#ifdef CONFIG_RCU_BOOST - p->rcu_boost_mutex = NULL; -#endif /* #ifdef CONFIG_RCU_BOOST */ INIT_LIST_HEAD(&p->rcu_node_entry); } diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 31194ee9dfa6..db3f096ed80b 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -177,6 +177,9 @@ struct rcu_node { /* to carry out the boosting is fully */ /* released with no future boostee accesses */ /* before that rt_mutex is re-initialized. */ + struct rt_mutex boost_mtx; + /* Used only for the priority-boosting */ + /* side effect, not as a lock. */ unsigned long boost_time; /* When to start boosting (jiffies). */ struct task_struct *boost_kthread_task; diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 99743e9ea8ed..c93c525b71fe 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -29,6 +29,7 @@ #include <linux/oom.h> #include <linux/smpboot.h> #include "../time/tick-internal.h" +#include "../locking/rtmutex_common.h" #define RCU_KTHREAD_PRIO 1 @@ -398,11 +399,9 @@ void rcu_read_unlock_special(struct task_struct *t) #ifdef CONFIG_RCU_BOOST if (&t->rcu_node_entry == rnp->boost_tasks) rnp->boost_tasks = np; - /* Snapshot/clear ->rcu_boost_mutex with rcu_node lock held. */ - if (t->rcu_boost_mutex) { - rbmp = t->rcu_boost_mutex; - t->rcu_boost_mutex = NULL; - } + /* Snapshot/clear ->boost_mutex with rcu_node lock held. */ + if (rt_mutex_owner(&rnp->boost_mtx) == t) + rbmp = &rnp->boost_mtx; #endif /* #ifdef CONFIG_RCU_BOOST */ /* @@ -1151,7 +1150,6 @@ static void rcu_wake_cond(struct task_struct *t, int status) static int rcu_boost(struct rcu_node *rnp) { unsigned long flags; - struct rt_mutex mtx; struct task_struct *t; struct list_head *tb; @@ -1202,14 +1200,14 @@ static int rcu_boost(struct rcu_node *rnp) * section. */ t = container_of(tb, struct task_struct, rcu_node_entry); - rt_mutex_init_proxy_locked(&mtx, t); - t->rcu_boost_mutex = &mtx; + rt_mutex_init_proxy_locked(&rnp->boost_mtx, t); init_completion(&rnp->boost_completion); raw_spin_unlock_irqrestore(&rnp->lock, flags); - rt_mutex_lock(&mtx); /* Side effect: boosts task t's priority. */ - rt_mutex_unlock(&mtx); /* Keep lockdep happy. */ + /* Lock only for side effect: boosts task t's priority. */ + rt_mutex_lock(&rnp->boost_mtx); + rt_mutex_unlock(&rnp->boost_mtx); /* Then keep lockdep happy. */ - /* Wait until boostee is done accessing mtx before reinitializing. */ + /* Wait for boostee to be done w/boost_mtx before reinitializing. */ wait_for_completion(&rnp->boost_completion); return ACCESS_ONCE(rnp->exp_tasks) != NULL || ^ permalink raw reply related [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-12 23:19 ` Paul E. McKenney @ 2014-06-13 15:08 ` Oleg Nesterov 2014-06-15 5:40 ` Paul E. McKenney 0 siblings, 1 reply; 63+ messages in thread From: Oleg Nesterov @ 2014-06-13 15:08 UTC (permalink / raw) To: Paul E. McKenney Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On 06/12, Paul E. McKenney wrote: > > @@ -398,11 +399,9 @@ void rcu_read_unlock_special(struct task_struct *t) > #ifdef CONFIG_RCU_BOOST > if (&t->rcu_node_entry == rnp->boost_tasks) > rnp->boost_tasks = np; > - /* Snapshot/clear ->rcu_boost_mutex with rcu_node lock held. */ > - if (t->rcu_boost_mutex) { > - rbmp = t->rcu_boost_mutex; > - t->rcu_boost_mutex = NULL; > - } > + /* Snapshot/clear ->boost_mutex with rcu_node lock held. */ > + if (rt_mutex_owner(&rnp->boost_mtx) == t) > + rbmp = &rnp->boost_mtx; The comment above looks confusing after this change ;) We do not clear it, and it doesn't explain "with rcu_node lock held". And, with or without this change it is not obvious why do we need "rbmp", after this patch this becomes even more unobvious. This is subjective of course, but perhaps it would be more understandable to do bool xxx; ... // Check this under rcu_node lock to ensure that unlock below // can't race with rt_mutex_init_proxy_locked() in progress. xxx = rt_mutex_owner(&rnp->boost_mtx) == t; ... // rnp->lock was dropped if (xxx) rt_mutex_unlock(&rnp->boost_mtx); But this is very minor, I won't insist of course. Mostly I am just trying to check my understanding. Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-13 15:08 ` Oleg Nesterov @ 2014-06-15 5:40 ` Paul E. McKenney 2014-06-17 18:57 ` Paul E. McKenney 0 siblings, 1 reply; 63+ messages in thread From: Paul E. McKenney @ 2014-06-15 5:40 UTC (permalink / raw) To: Oleg Nesterov Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Fri, Jun 13, 2014 at 05:08:30PM +0200, Oleg Nesterov wrote: > On 06/12, Paul E. McKenney wrote: > > > > @@ -398,11 +399,9 @@ void rcu_read_unlock_special(struct task_struct *t) > > #ifdef CONFIG_RCU_BOOST > > if (&t->rcu_node_entry == rnp->boost_tasks) > > rnp->boost_tasks = np; > > - /* Snapshot/clear ->rcu_boost_mutex with rcu_node lock held. */ > > - if (t->rcu_boost_mutex) { > > - rbmp = t->rcu_boost_mutex; > > - t->rcu_boost_mutex = NULL; > > - } > > + /* Snapshot/clear ->boost_mutex with rcu_node lock held. */ > > + if (rt_mutex_owner(&rnp->boost_mtx) == t) > > + rbmp = &rnp->boost_mtx; > > The comment above looks confusing after this change ;) We do not clear it, > and it doesn't explain "with rcu_node lock held". > > And, with or without this change it is not obvious why do we need "rbmp", > after this patch this becomes even more unobvious. > > This is subjective of course, but perhaps it would be more understandable > to do > > bool xxx; > > ... > > // Check this under rcu_node lock to ensure that unlock below > // can't race with rt_mutex_init_proxy_locked() in progress. > xxx = rt_mutex_owner(&rnp->boost_mtx) == t; > > ... > > // rnp->lock was dropped > if (xxx) > rt_mutex_unlock(&rnp->boost_mtx); > > > But this is very minor, I won't insist of course. Mostly I am just trying > to check my understanding. No, this is good, and I will update accordingly. Thanx, Paul ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-15 5:40 ` Paul E. McKenney @ 2014-06-17 18:57 ` Paul E. McKenney 2014-06-18 16:43 ` Oleg Nesterov 0 siblings, 1 reply; 63+ messages in thread From: Paul E. McKenney @ 2014-06-17 18:57 UTC (permalink / raw) To: Oleg Nesterov Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Sat, Jun 14, 2014 at 10:40:58PM -0700, Paul E. McKenney wrote: > On Fri, Jun 13, 2014 at 05:08:30PM +0200, Oleg Nesterov wrote: > > On 06/12, Paul E. McKenney wrote: > > > > > > @@ -398,11 +399,9 @@ void rcu_read_unlock_special(struct task_struct *t) > > > #ifdef CONFIG_RCU_BOOST > > > if (&t->rcu_node_entry == rnp->boost_tasks) > > > rnp->boost_tasks = np; > > > - /* Snapshot/clear ->rcu_boost_mutex with rcu_node lock held. */ > > > - if (t->rcu_boost_mutex) { > > > - rbmp = t->rcu_boost_mutex; > > > - t->rcu_boost_mutex = NULL; > > > - } > > > + /* Snapshot/clear ->boost_mutex with rcu_node lock held. */ > > > + if (rt_mutex_owner(&rnp->boost_mtx) == t) > > > + rbmp = &rnp->boost_mtx; > > > > The comment above looks confusing after this change ;) We do not clear it, > > and it doesn't explain "with rcu_node lock held". > > > > And, with or without this change it is not obvious why do we need "rbmp", > > after this patch this becomes even more unobvious. > > > > This is subjective of course, but perhaps it would be more understandable > > to do > > > > bool xxx; > > > > ... > > > > // Check this under rcu_node lock to ensure that unlock below > > // can't race with rt_mutex_init_proxy_locked() in progress. > > xxx = rt_mutex_owner(&rnp->boost_mtx) == t; > > > > ... > > > > // rnp->lock was dropped > > if (xxx) > > rt_mutex_unlock(&rnp->boost_mtx); > > > > > > But this is very minor, I won't insist of course. Mostly I am just trying > > to check my understanding. > > No, this is good, and I will update accordingly. I suppose I could have included the patch... Thanx, Paul ------------------------------------------------------------------------ rcu: Simplify priority boosting by putting rt_mutex in rcu_node RCU priority boosting currently checks for boosting via a pointer in task_struct. However, this is not needed: As Oleg noted, if the rt_mutex is placed in the rcu_node instead of on the booster's stack, the boostee can simply check it see if it owns the lock. This commit makes this change, shrinking task_struct by one pointer and the kernel by thirteen lines. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 6df7f9fe0d01..2bb4c4f3531a 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -102,12 +102,6 @@ extern struct group_info init_groups; #define INIT_IDS #endif -#ifdef CONFIG_RCU_BOOST -#define INIT_TASK_RCU_BOOST() \ - .rcu_boost_mutex = NULL, -#else -#define INIT_TASK_RCU_BOOST() -#endif #ifdef CONFIG_TREE_PREEMPT_RCU #define INIT_TASK_RCU_TREE_PREEMPT() \ .rcu_blocked_node = NULL, @@ -119,8 +113,7 @@ extern struct group_info init_groups; .rcu_read_lock_nesting = 0, \ .rcu_read_unlock_special = 0, \ .rcu_node_entry = LIST_HEAD_INIT(tsk.rcu_node_entry), \ - INIT_TASK_RCU_TREE_PREEMPT() \ - INIT_TASK_RCU_BOOST() + INIT_TASK_RCU_TREE_PREEMPT() #else #define INIT_TASK_RCU_PREEMPT(tsk) #endif diff --git a/include/linux/sched.h b/include/linux/sched.h index 306f4f0c987a..3cfbc05e66e6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1270,9 +1270,6 @@ struct task_struct { #ifdef CONFIG_TREE_PREEMPT_RCU struct rcu_node *rcu_blocked_node; #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */ -#ifdef CONFIG_RCU_BOOST - struct rt_mutex *rcu_boost_mutex; -#endif /* #ifdef CONFIG_RCU_BOOST */ #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) struct sched_info sched_info; @@ -2009,9 +2006,6 @@ static inline void rcu_copy_process(struct task_struct *p) #ifdef CONFIG_TREE_PREEMPT_RCU p->rcu_blocked_node = NULL; #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */ -#ifdef CONFIG_RCU_BOOST - p->rcu_boost_mutex = NULL; -#endif /* #ifdef CONFIG_RCU_BOOST */ INIT_LIST_HEAD(&p->rcu_node_entry); } diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 31194ee9dfa6..db3f096ed80b 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -177,6 +177,9 @@ struct rcu_node { /* to carry out the boosting is fully */ /* released with no future boostee accesses */ /* before that rt_mutex is re-initialized. */ + struct rt_mutex boost_mtx; + /* Used only for the priority-boosting */ + /* side effect, not as a lock. */ unsigned long boost_time; /* When to start boosting (jiffies). */ struct task_struct *boost_kthread_task; diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index dc98cacfef21..d8ae20f5ca87 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -29,6 +29,7 @@ #include <linux/oom.h> #include <linux/smpboot.h> #include "../time/tick-internal.h" +#include "../locking/rtmutex_common.h" #define RCU_KTHREAD_PRIO 1 @@ -336,7 +337,7 @@ void rcu_read_unlock_special(struct task_struct *t) unsigned long flags; struct list_head *np; #ifdef CONFIG_RCU_BOOST - struct rt_mutex *rbmp = NULL; + bool drop_boost_mutex = false; #endif /* #ifdef CONFIG_RCU_BOOST */ struct rcu_node *rnp; int special; @@ -398,11 +399,8 @@ void rcu_read_unlock_special(struct task_struct *t) #ifdef CONFIG_RCU_BOOST if (&t->rcu_node_entry == rnp->boost_tasks) rnp->boost_tasks = np; - /* Snapshot/clear ->rcu_boost_mutex with rcu_node lock held. */ - if (t->rcu_boost_mutex) { - rbmp = t->rcu_boost_mutex; - t->rcu_boost_mutex = NULL; - } + /* Snapshot ->boost_mtx ownership with rcu_node lock held. */ + drop_boost_mutex = rt_mutex_owner(&rnp->boost_mtx) == t; #endif /* #ifdef CONFIG_RCU_BOOST */ /* @@ -427,8 +425,8 @@ void rcu_read_unlock_special(struct task_struct *t) #ifdef CONFIG_RCU_BOOST /* Unboost if we were boosted. */ - if (rbmp) { - rt_mutex_unlock(rbmp); + if (drop_boost_mutex) { + rt_mutex_unlock(&rnp->boost_mtx); complete(&rnp->boost_completion); } #endif /* #ifdef CONFIG_RCU_BOOST */ @@ -1151,7 +1149,6 @@ static void rcu_wake_cond(struct task_struct *t, int status) static int rcu_boost(struct rcu_node *rnp) { unsigned long flags; - struct rt_mutex mtx; struct task_struct *t; struct list_head *tb; @@ -1202,14 +1199,14 @@ static int rcu_boost(struct rcu_node *rnp) * section. */ t = container_of(tb, struct task_struct, rcu_node_entry); - rt_mutex_init_proxy_locked(&mtx, t); - t->rcu_boost_mutex = &mtx; + rt_mutex_init_proxy_locked(&rnp->boost_mtx, t); init_completion(&rnp->boost_completion); raw_spin_unlock_irqrestore(&rnp->lock, flags); - rt_mutex_lock(&mtx); /* Side effect: boosts task t's priority. */ - rt_mutex_unlock(&mtx); /* Keep lockdep happy. */ + /* Lock only for side effect: boosts task t's priority. */ + rt_mutex_lock(&rnp->boost_mtx); + rt_mutex_unlock(&rnp->boost_mtx); /* Then keep lockdep happy. */ - /* Wait until boostee is done accessing mtx before reinitializing. */ + /* Wait for boostee to be done w/boost_mtx before reinitializing. */ wait_for_completion(&rnp->boost_completion); return ACCESS_ONCE(rnp->exp_tasks) != NULL || ^ permalink raw reply related [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-17 18:57 ` Paul E. McKenney @ 2014-06-18 16:43 ` Oleg Nesterov 2014-06-18 16:53 ` Steven Rostedt 2014-06-18 17:00 ` Paul E. McKenney 0 siblings, 2 replies; 63+ messages in thread From: Oleg Nesterov @ 2014-06-18 16:43 UTC (permalink / raw) To: Paul E. McKenney Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On 06/17, Paul E. McKenney wrote: > > + if (drop_boost_mutex) { > + rt_mutex_unlock(&rnp->boost_mtx); > complete(&rnp->boost_completion); Well, I still do not understand this ->boost_completion... > - /* Wait until boostee is done accessing mtx before reinitializing. */ > + /* Wait for boostee to be done w/boost_mtx before reinitializing. */ > wait_for_completion(&rnp->boost_completion); OK, at least we have a comment. But let me repeat. Thomas has already fixed rt_mutex, unlock is atomic. It doesn't touch this memory after it makes another lock() possible. And (contrary to what I said initially) we can rely on this because -rt converts spinlock_t into rt_mutex ? Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-18 16:43 ` Oleg Nesterov @ 2014-06-18 16:53 ` Steven Rostedt 2014-06-21 19:54 ` Thomas Gleixner 2014-06-18 17:00 ` Paul E. McKenney 1 sibling, 1 reply; 63+ messages in thread From: Steven Rostedt @ 2014-06-18 16:53 UTC (permalink / raw) To: Oleg Nesterov Cc: Paul E. McKenney, Thomas Gleixner, Linus Torvalds, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Wed, 18 Jun 2014 18:43:59 +0200 Oleg Nesterov <oleg@redhat.com> wrote: > And (contrary to what I said initially) we can rely on this because -rt > converts spinlock_t into rt_mutex ? Correct. Because if spinlock_t has this behavior, rt_mutex must have it too, otherwise -rt will suffer greatly from that. Who knows, maybe this will fix some strange bug reports that we have had in the past. -- Steve ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-18 16:53 ` Steven Rostedt @ 2014-06-21 19:54 ` Thomas Gleixner 0 siblings, 0 replies; 63+ messages in thread From: Thomas Gleixner @ 2014-06-21 19:54 UTC (permalink / raw) To: Steven Rostedt Cc: Oleg Nesterov, Paul E. McKenney, Linus Torvalds, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Wed, 18 Jun 2014, Steven Rostedt wrote: > On Wed, 18 Jun 2014 18:43:59 +0200 > Oleg Nesterov <oleg@redhat.com> wrote: > > > > And (contrary to what I said initially) we can rely on this because -rt > > converts spinlock_t into rt_mutex ? > > Correct. Because if spinlock_t has this behavior, rt_mutex must have it > too, otherwise -rt will suffer greatly from that. Who knows, maybe this > will fix some strange bug reports that we have had in the past. Indeed. I found a few backtraces from Carstens test farm, where stuff explodes in the slowpath raw_spin_unlock call. Happens once a year or never ... Thanks, tglx ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-18 16:43 ` Oleg Nesterov 2014-06-18 16:53 ` Steven Rostedt @ 2014-06-18 17:00 ` Paul E. McKenney 1 sibling, 0 replies; 63+ messages in thread From: Paul E. McKenney @ 2014-06-18 17:00 UTC (permalink / raw) To: Oleg Nesterov Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Wed, Jun 18, 2014 at 06:43:59PM +0200, Oleg Nesterov wrote: > On 06/17, Paul E. McKenney wrote: > > > > + if (drop_boost_mutex) { > > + rt_mutex_unlock(&rnp->boost_mtx); > > complete(&rnp->boost_completion); > > Well, I still do not understand this ->boost_completion... > > > - /* Wait until boostee is done accessing mtx before reinitializing. */ > > + /* Wait for boostee to be done w/boost_mtx before reinitializing. */ > > wait_for_completion(&rnp->boost_completion); > > OK, at least we have a comment. > > But let me repeat. Thomas has already fixed rt_mutex, unlock is atomic. > It doesn't touch this memory after it makes another lock() possible. > > And (contrary to what I said initially) we can rely on this because -rt > converts spinlock_t into rt_mutex ? Well, perhaps I should rein in my paranoia on this one. That said, the cost of my paranoia is minimal in this slowpath. Thanx, Paul ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-12 21:40 ` Thomas Gleixner 2014-06-12 22:27 ` Paul E. McKenney @ 2014-06-13 14:55 ` Oleg Nesterov 2014-06-13 16:10 ` Thomas Gleixner 1 sibling, 1 reply; 63+ messages in thread From: Oleg Nesterov @ 2014-06-13 14:55 UTC (permalink / raw) To: Thomas Gleixner Cc: Paul E. McKenney, Steven Rostedt, Linus Torvalds, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On 06/12, Thomas Gleixner wrote: > > Just FYI, I have a patch pending which makes the release safe. > > http://marc.info/?l=linux-kernel&m=140251240630730&w=2 Aha, great. I didn't see this version, and just in case I think it is fine. Perhaps you can look at http://marc.info/?t=140207370200003 ? Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-13 14:55 ` Oleg Nesterov @ 2014-06-13 16:10 ` Thomas Gleixner 2014-06-13 16:19 ` Oleg Nesterov 0 siblings, 1 reply; 63+ messages in thread From: Thomas Gleixner @ 2014-06-13 16:10 UTC (permalink / raw) To: Oleg Nesterov Cc: Paul E. McKenney, Steven Rostedt, Linus Torvalds, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Fri, 13 Jun 2014, Oleg Nesterov wrote: > On 06/12, Thomas Gleixner wrote: > > > > Just FYI, I have a patch pending which makes the release safe. > > > > http://marc.info/?l=linux-kernel&m=140251240630730&w=2 > > Aha, great. I didn't see this version, and just in case I think it is fine. > > Perhaps you can look at http://marc.info/?t=140207370200003 ? Looks good. Should I queue it with the other rtmutex stuff? Thanks, tglx ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-13 16:10 ` Thomas Gleixner @ 2014-06-13 16:19 ` Oleg Nesterov 0 siblings, 0 replies; 63+ messages in thread From: Oleg Nesterov @ 2014-06-13 16:19 UTC (permalink / raw) To: Thomas Gleixner Cc: Paul E. McKenney, Steven Rostedt, Linus Torvalds, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On 06/13, Thomas Gleixner wrote: > > On Fri, 13 Jun 2014, Oleg Nesterov wrote: > > > Perhaps you can look at http://marc.info/?t=140207370200003 ? > > Looks good. Should I queue it with the other rtmutex stuff? That would be great, thanks ;) Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-12 20:35 ` Paul E. McKenney 2014-06-12 21:40 ` Thomas Gleixner @ 2014-06-13 14:52 ` Oleg Nesterov 1 sibling, 0 replies; 63+ messages in thread From: Oleg Nesterov @ 2014-06-13 14:52 UTC (permalink / raw) To: Paul E. McKenney Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On 06/12, Paul E. McKenney wrote: > > On Thu, Jun 12, 2014 at 07:28:44PM +0200, Oleg Nesterov wrote: > > On 06/11, Paul E. McKenney wrote: > > > > > > On Wed, Jun 11, 2014 at 07:59:34PM +0200, Oleg Nesterov wrote: > > > > On 06/11, Paul E. McKenney wrote: > > > > > > > > > > I was thinking of ->boost_completion as the way to solve it easily, but > > > > > what did you have in mind? > > > > > > > > I meant, rcu_boost() could probably just do "mtx->owner = t", we know that > > > > it was unlocked by us and nobody else can use it until we set > > > > t->rcu_boost_mutex. > > > > > > My concern with this is that rcu_read_unlock_special() could hypothetically > > > get preempted (either by kernel or hypervisor), so that it might be a long > > > time until it makes its reference. But maybe that reference would be > > > harmless in this case. > > > > Confused... Not sure I understand what did you mean, and certainly I do not > > understand how this connects to the proxy-locking method. > > > > Could you explain? > > Here is the hypothetical sequence of events, which cannot happen unless > the CPU releasing the lock accesses the structure after releasing > the lock: > > CPU 0 CPU 1 (booster) > > releases boost_mutex > > acquires boost_mutex > releases boost_mutex > post-release boost_mutex access? > Loops to next task to boost > proxy-locks boost_mutex > > post-release boost_mutex access: > confused due to proxy-lock > operation? But this is the same problem I was worried about. Should be fixed by the patch from Thomas but lets ignore this, lets assume that we should not rely on this (and that is why you addded completion). My point was, in this case we can separate "init" and "proxy_locked" and solve the problem. If we initialize this mutex once, then "->owner = t" should be always safe: whatever "post-release" above does we were able to lock (and unlock) this mutex, nobody else (including "post-release) can play with ->owner. But given that Thomas fixed rt_mutex, I think this all doesn't matter. And I obviously like your last patch which moves boost_mtx into rcu_node ;) > Now maybe this ends up being safe, but it sure feels like an accident > waiting to happen. Some bright developer comes up with a super-fast > handoff, Yes, initially I thought the same, but then I changed my mind. rt kernel uses rt_mutex instead of spinlock_t and this means that rt_mutex_unlock() must be atomic just as spin_unlock(). > o We -don't- hold ->lock when releasing the rt_mutex, but that > should be OK: The owner is releasing it, and it is going to > not-owned, so no other task can possibly see ownership moving > to/from them. Yes. I have a very minor here, I'll reply to the last version. Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-11 17:07 ` Oleg Nesterov 2014-06-11 17:17 ` Oleg Nesterov @ 2014-06-11 17:27 ` Paul E. McKenney 1 sibling, 0 replies; 63+ messages in thread From: Paul E. McKenney @ 2014-06-11 17:27 UTC (permalink / raw) To: Oleg Nesterov Cc: Thomas Gleixner, Steven Rostedt, Linus Torvalds, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Wed, Jun 11, 2014 at 07:07:05PM +0200, Oleg Nesterov wrote: > On 06/11, Paul E. McKenney wrote: > > > > @@ -1202,10 +1204,14 @@ static int rcu_boost(struct rcu_node *rnp) > > t = container_of(tb, struct task_struct, rcu_node_entry); > > rt_mutex_init_proxy_locked(&mtx, t); > > t->rcu_boost_mutex = &mtx; > > + init_completion(&rnp->boost_completion); > > can't rcu_init_one() do this? but this is minor, It could, but I would have to define yet another init-time function under CONFIG_RCU_BOOST and not. Yeah, lazy... > > raw_spin_unlock_irqrestore(&rnp->lock, flags); > > rt_mutex_lock(&mtx); /* Side effect: boosts task t's priority. */ > > rt_mutex_unlock(&mtx); /* Keep lockdep happy. */ > > > > + /* Wait until boostee is done accessing mtx before reinitializing. */ > > + wait_for_completion(&rnp->boost_completion); > > + > > I must have missed something, I dont understand why we need ->boost_completion. Because rt_mutex_init_proxy_locked() stomps on mtx periodically. Which might happen to be work at the moment, but doesn't seem like a particularly good thing. > What if you simply move that rt_mutex into rcu_node ? > > Or. Given that rcu_boost_kthread() never exits, it can declare this mutex > on stack and pass the pointer to rcu_boost() ? It is true that moving mtx to either the rcu_node structure or to rcu_boost_kthread()'s stack frame would preserve type safety, but not initialization safety. Or maybe I am being too paranoid? Thanx, Paul ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-10 8:53 ` Thomas Gleixner 2014-06-10 16:57 ` Oleg Nesterov @ 2014-06-10 17:07 ` Oleg Nesterov 2014-06-10 17:51 ` Thomas Gleixner 1 sibling, 1 reply; 63+ messages in thread From: Oleg Nesterov @ 2014-06-10 17:07 UTC (permalink / raw) To: Thomas Gleixner Cc: Steven Rostedt, Linus Torvalds, Paul E. McKenney, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On 06/10, Thomas Gleixner wrote: > > +static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock) > + __releases(lock->wait_lock) > +{ > + unsigned long owner, *p = (unsigned long *) &lock->owner; > + > + owner = (unsigned long) rt_mutex_owner(lock); > + clear_rt_mutex_waiters(lock); > + raw_spin_unlock(&lock->wait_lock); > + /* > + * If a new waiter comes in between the unlock and the cmpxchg > + * we have two situations: > + * > + * unlock(wait_lock); > + * lock(wait_lock); > + * cmpxchg(p, owner, 0) == owner > + * mark_rt_mutex_waiters(lock); > + * acquire(lock); > + * or: > + * > + * unlock(wait_lock); > + * lock(wait_lock); > + * mark_rt_mutex_waiters(lock); > + * > + * cmpxchg(p, owner, 0) != owner > + * enqueue_waiter(); > + * unlock(wait_lock); > + * lock(wait_lock); > + * wake waiter(); > + * unlock(wait_lock); > + * lock(wait_lock); > + * acquire(lock); > + */ > + return rt_mutex_cmpxchg(p, owner, 0); Wait, but this looks like a typo. rt_mutex_cmpxchg() needs "struct rt_mutex *", not "long *". It seems that you should simply kill "*p" above. Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-10 17:07 ` Oleg Nesterov @ 2014-06-10 17:51 ` Thomas Gleixner 0 siblings, 0 replies; 63+ messages in thread From: Thomas Gleixner @ 2014-06-10 17:51 UTC (permalink / raw) To: Oleg Nesterov Cc: Steven Rostedt, Linus Torvalds, Paul E. McKenney, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Tue, 10 Jun 2014, Oleg Nesterov wrote: > On 06/10, Thomas Gleixner wrote: > > > > +static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock) > > + __releases(lock->wait_lock) > > +{ > > + unsigned long owner, *p = (unsigned long *) &lock->owner; > > + > > + owner = (unsigned long) rt_mutex_owner(lock); > > + clear_rt_mutex_waiters(lock); > > + raw_spin_unlock(&lock->wait_lock); > > + /* > > + * If a new waiter comes in between the unlock and the cmpxchg > > + * we have two situations: > > + * > > + * unlock(wait_lock); > > + * lock(wait_lock); > > + * cmpxchg(p, owner, 0) == owner > > + * mark_rt_mutex_waiters(lock); > > + * acquire(lock); > > + * or: > > + * > > + * unlock(wait_lock); > > + * lock(wait_lock); > > + * mark_rt_mutex_waiters(lock); > > + * > > + * cmpxchg(p, owner, 0) != owner > > + * enqueue_waiter(); > > + * unlock(wait_lock); > > + * lock(wait_lock); > > + * wake waiter(); > > + * unlock(wait_lock); > > + * lock(wait_lock); > > + * acquire(lock); > > + */ > > + return rt_mutex_cmpxchg(p, owner, 0); > > Wait, but this looks like a typo. rt_mutex_cmpxchg() needs "struct rt_mutex *", > not "long *". It seems that you should simply kill "*p" above. Uurgh. I had cmpxchg there first and then changed it. You're right, I can kill the p magic. Bah, thats what you get sending out stuff just before jumping into a car. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-09 18:51 ` Linus Torvalds 2014-06-09 19:41 ` Steven Rostedt @ 2014-06-10 12:56 ` Paul E. McKenney 2014-06-10 14:48 ` Peter Zijlstra 2014-06-10 15:35 ` Linus Torvalds 1 sibling, 2 replies; 63+ messages in thread From: Paul E. McKenney @ 2014-06-10 12:56 UTC (permalink / raw) To: Linus Torvalds Cc: Steven Rostedt, Oleg Nesterov, LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Mon, Jun 09, 2014 at 11:51:09AM -0700, Linus Torvalds wrote: > On Mon, Jun 9, 2014 at 11:29 AM, Steven Rostedt <rostedt@goodmis.org> wrote: > >> > >> And once again, note that the normal mutex is already unsafe (unless I missed > >> something). > > > > Is it unsafe? > > > > This thread was started because of a bug we triggered in -rt, which > > ended up being a change specific to -rt that modified the way slub > > handled SLAB_DESTROY_BY_RCU. What else was wrong with it? > > There's a different issue with freeing of mutexes, which is not a bug, > but "by design". Namely that mutexes aren't truly "atomic". They are > complex data structures, and they have issues that a spinlock does not > have. > > When unlocking a mutex, the thread doing the unlocking will still > touch the mutex itself _after_ another thread could already have > successfully acquired the mutex. This is not a problem in any normal > use. since all of this is perfectly coherent in general, but it means > that code sequences like: > > mutex_lock(mem->mutex); > kill_it = !--mem->refcount; > mutex_unlock(mem->mutex); > if (kill_it) > free(mem); > > are fundamentally buggy. > > Note that if you think of mutexes as truly indivisible atomic > operations, the above is "obviously correct": the last person who got > the mutex marked it for killing. But the fact is, the *next-to-last* > mutex acquirer may still actively be in the *non-indivisible* > mutex_unlock() when the last person frees it, resulting in a > use-after-free. And yes, we've had this bug, and as far as I know it's > possible that the RT code *introduces* this bug when it changes > spinlocks into mutexes. Because we do exactly the above code sequence > with spinlocks. So just replacing spinlocks with mutexes is a very > subtly buggy thing to do in general. > > Another example of this kind of situation is using a mutex as a > completion event: that's simply buggy. Again, it's because mutexes are > complex data structures, and you have a very similar use-after-free > issue. It's why the fundamental data structure for a "struct > completion" is a spinlock, not a mutex. > > Again, in *theory*, a completion could be just a mutex that starts out > locked, and then the completer completes it by just unlocking it. The > person who waits for a completion does so by just asking for a lock. > Obvious, simple, and trivial, right? Not so, because of the *exact* > same issue above: the completer (who does an "unlock") may still be > accessing the completion data structure when the completion waiter has > successfully gotten the lock. So the standard thing of the completer > freeing the underlying completion memory (which is often on the stack, > so "freeing" is just he act of going out of scope of the liveness of > the completion data structure) would not work if the completion was a > mutex. > > This is subtle, and it is basically unavoidable. If a mutex (or > counting semaphore) has a fast-path - and a mutex/semaphore without a > fast-path is shit - then this issue will exist. Exactly because the > fast-path will depend on just one part of the whole big mutex > structure, and the slow-path will have other pieces to it. > > There might be reasonable ways to avoid this issue (having the > fastpath locking field share memory with the slow-path locking, for > example), but it's not how our semaphores and mutexes work, and I > suspect it cannot be the case in general (because it limits you too > badly in how to implement the mutex). As a result, this is all "by > design" as opposed to being a bug. So to safely free a structure containing a mutex, is there some better approach than the following? mutex_lock(mem->mutex); kill_it = !--mem->refcount; rcu_read_lock(); mutex_unlock(mem->mutex); rcu_read_unlock(); if (kill_it) kfree_rcu(mem, rh); /* rh is the rcu_head field in mem. */ For example, is there some other way to know that all the prior lock releases have finished their post-release accesses? Thanx, Paul ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-10 12:56 ` Paul E. McKenney @ 2014-06-10 14:48 ` Peter Zijlstra 2014-06-10 15:18 ` Paul E. McKenney 2014-06-10 15:35 ` Linus Torvalds 1 sibling, 1 reply; 63+ messages in thread From: Peter Zijlstra @ 2014-06-10 14:48 UTC (permalink / raw) To: Paul E. McKenney Cc: Linus Torvalds, Steven Rostedt, Oleg Nesterov, LKML, Thomas Gleixner, Andrew Morton, Ingo Molnar, Clark Williams [-- Attachment #1: Type: text/plain, Size: 2160 bytes --] On Tue, Jun 10, 2014 at 05:56:55AM -0700, Paul E. McKenney wrote: > On Mon, Jun 09, 2014 at 11:51:09AM -0700, Linus Torvalds wrote: > > This is subtle, and it is basically unavoidable. If a mutex (or > > counting semaphore) has a fast-path - and a mutex/semaphore without a > > fast-path is shit - then this issue will exist. Exactly because the > > fast-path will depend on just one part of the whole big mutex > > structure, and the slow-path will have other pieces to it. > > > > There might be reasonable ways to avoid this issue (having the > > fastpath locking field share memory with the slow-path locking, for > > example), but it's not how our semaphores and mutexes work, and I > > suspect it cannot be the case in general (because it limits you too > > badly in how to implement the mutex). As a result, this is all "by > > design" as opposed to being a bug. > > So to safely free a structure containing a mutex, is there some better > approach than the following? > > mutex_lock(mem->mutex); > kill_it = !--mem->refcount; > rcu_read_lock(); > mutex_unlock(mem->mutex); > rcu_read_unlock(); > if (kill_it) > kfree_rcu(mem, rh); /* rh is the rcu_head field in mem. */ > > For example, is there some other way to know that all the prior lock > releases have finished their post-release accesses? So Thomas posted a patch curing rt_mutex, and for that we really _have_ to because it needs to replace a spinlock_t. But for the regular mutex its better (from a performance pov) to not do this. By releasing early and checking for pending waiters later we allow earlier lock stealing, which is good for throughput. Back to your example, I think your example is misleading in that it states: 'a structure containing a mutex'. The problem only arises when that mutex is used as part of the life-time management of said structure. If it has regular (atomic_t or atomic_long_t or spinlock_t) reference counting, we know the mutex_unlock() must have competed by the time we do put_*(), and if that put was the last one, there cannot have been another, otherwise your reference counting is broken. [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-10 14:48 ` Peter Zijlstra @ 2014-06-10 15:18 ` Paul E. McKenney 0 siblings, 0 replies; 63+ messages in thread From: Paul E. McKenney @ 2014-06-10 15:18 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Steven Rostedt, Oleg Nesterov, LKML, Thomas Gleixner, Andrew Morton, Ingo Molnar, Clark Williams On Tue, Jun 10, 2014 at 04:48:30PM +0200, Peter Zijlstra wrote: > On Tue, Jun 10, 2014 at 05:56:55AM -0700, Paul E. McKenney wrote: > > On Mon, Jun 09, 2014 at 11:51:09AM -0700, Linus Torvalds wrote: > > > > This is subtle, and it is basically unavoidable. If a mutex (or > > > counting semaphore) has a fast-path - and a mutex/semaphore without a > > > fast-path is shit - then this issue will exist. Exactly because the > > > fast-path will depend on just one part of the whole big mutex > > > structure, and the slow-path will have other pieces to it. > > > > > > There might be reasonable ways to avoid this issue (having the > > > fastpath locking field share memory with the slow-path locking, for > > > example), but it's not how our semaphores and mutexes work, and I > > > suspect it cannot be the case in general (because it limits you too > > > badly in how to implement the mutex). As a result, this is all "by > > > design" as opposed to being a bug. > > > > So to safely free a structure containing a mutex, is there some better > > approach than the following? > > > > mutex_lock(mem->mutex); > > kill_it = !--mem->refcount; > > rcu_read_lock(); > > mutex_unlock(mem->mutex); > > rcu_read_unlock(); > > if (kill_it) > > kfree_rcu(mem, rh); /* rh is the rcu_head field in mem. */ > > > > For example, is there some other way to know that all the prior lock > > releases have finished their post-release accesses? > > So Thomas posted a patch curing rt_mutex, and for that we really _have_ > to because it needs to replace a spinlock_t. But for the regular mutex > its better (from a performance pov) to not do this. > > By releasing early and checking for pending waiters later we allow > earlier lock stealing, which is good for throughput. > > Back to your example, I think your example is misleading in that it > states: 'a structure containing a mutex'. The problem only arises when > that mutex is used as part of the life-time management of said > structure. Hey, I just minimally modified the example from this thread! ;-) > If it has regular (atomic_t or atomic_long_t or spinlock_t) reference > counting, we know the mutex_unlock() must have competed by the time we > do put_*(), and if that put was the last one, there cannot have been > another, otherwise your reference counting is broken. So your point is that we need to have some other lifetime management mechanism for the structure, and that whatever that is, we need to release -after- our unlock completes, correct? Which is in fact what I did with the rcu_read_unlock() above, so we might actually be in agreement here. Thanx, Paul ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-10 12:56 ` Paul E. McKenney 2014-06-10 14:48 ` Peter Zijlstra @ 2014-06-10 15:35 ` Linus Torvalds 2014-06-10 16:15 ` Paul E. McKenney 1 sibling, 1 reply; 63+ messages in thread From: Linus Torvalds @ 2014-06-10 15:35 UTC (permalink / raw) To: Paul McKenney Cc: Steven Rostedt, Oleg Nesterov, LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Tue, Jun 10, 2014 at 5:56 AM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > > So to safely free a structure containing a mutex, is there some better > approach than the following? Yes. A data structure *containing* a mutex is fine. The problem only occurs when that mutex also acts the lock for the data structure. As a result, there are two fixes for the "locks are not one single atomic field": (a) don't do self-locking objects (b) use spinlocks for these "self-locking" objects And quite frankly, (a) is the normal "solution" to the problem. The fact is, having the data structure contain its own lifetime lock is unusual, and generally broken. The *normal* sequence for freeing something should be that the last access to it is the atomic referenc count access: .. do whatever, including "unlock(&mem->lock)" .. if (atomic_dec_and_test(&mem->refcount)) .. we can now free it .. and that's safe. It doesn't matter if "mem" had a mutex in it, it doesn't matter if you walked around three times widdershins with a dead chicken around your neck. You can do whatever you want, the above is fine (and you shouldn't even need to worry about CPU memory ordering, because the alloc/free had better have the barriers to make sure) that nothing can leak from a free to the next allocation, although that is another discussion perhaps worth having). The whole notion of having the lock that protects the lifetime of the data structure inside the structure itself is pretty crazy. Because while it is true that we sometimes have the refcount be non-atomic, and instead protected with a lock, that lock is generally always *outside* the object itself, because you want the lock for lookup etc. So having the lock _and_ the refcount be inside the object really is crazy. That said, "crazy" has happened. We do occasionally do it. It's generally a mistake (the last example of this was the pipe thing), but sometimes we do it on purpose (the dcache, for example). You can do lookups without holding a lock (generally using RCU), and taking the lock and incrementing a refcount, but then the lock has to be a spinlock *anyway*, so that's ok. The last case where we actually had this bug (the afore-mentioned pipe thing), the mutex lock wasn't even needed - we had a real spinlock protecting the reference count. The bug was that we did the mutex unlock *after* the spinlock, for no good reason. So it really isn't normally a problem. The RT spinlock conversion people need to be aware of it, but normally you can't even screw this up. Linus ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-10 15:35 ` Linus Torvalds @ 2014-06-10 16:15 ` Paul E. McKenney 0 siblings, 0 replies; 63+ messages in thread From: Paul E. McKenney @ 2014-06-10 16:15 UTC (permalink / raw) To: Linus Torvalds Cc: Steven Rostedt, Oleg Nesterov, LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Tue, Jun 10, 2014 at 08:35:53AM -0700, Linus Torvalds wrote: > On Tue, Jun 10, 2014 at 5:56 AM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > > > So to safely free a structure containing a mutex, is there some better > > approach than the following? > > Yes. > > A data structure *containing* a mutex is fine. The problem only occurs > when that mutex also acts the lock for the data structure. As a > result, there are two fixes for the "locks are not one single atomic > field": > > (a) don't do self-locking objects > (b) use spinlocks for these "self-locking" objects > > And quite frankly, (a) is the normal "solution" to the problem. > > The fact is, having the data structure contain its own lifetime lock > is unusual, and generally broken. The *normal* sequence for freeing > something should be that the last access to it is the atomic referenc > count access: > > .. do whatever, including "unlock(&mem->lock)" .. > if (atomic_dec_and_test(&mem->refcount)) > .. we can now free it .. > > and that's safe. It doesn't matter if "mem" had a mutex in it, it > doesn't matter if you walked around three times widdershins with a > dead chicken around your neck. You can do whatever you want, the above > is fine (and you shouldn't even need to worry about CPU memory > ordering, because the alloc/free had better have the barriers to make > sure) that nothing can leak from a free to the next allocation, > although that is another discussion perhaps worth having). > > The whole notion of having the lock that protects the lifetime of the > data structure inside the structure itself is pretty crazy. Because > while it is true that we sometimes have the refcount be non-atomic, > and instead protected with a lock, that lock is generally always > *outside* the object itself, because you want the lock for lookup etc. > So having the lock _and_ the refcount be inside the object really is > crazy. Agreed! You need to have something external to the object that holds the object down long enough to safely acquire the object's lock, refcount, or whatever. Contention usually rules out that "something" being a global lock or refcount, but per-CPU locks, per-CPU refcounts, RCU, and so on can work. > That said, "crazy" has happened. We do occasionally do it. It's > generally a mistake (the last example of this was the pipe thing), but > sometimes we do it on purpose (the dcache, for example). You can do > lookups without holding a lock (generally using RCU), and taking the > lock and incrementing a refcount, but then the lock has to be a > spinlock *anyway*, so that's ok. And of course you have to acquire the lock before doing your rcu_read_unlock() in that case. > The last case where we actually had this bug (the afore-mentioned pipe > thing), the mutex lock wasn't even needed - we had a real spinlock > protecting the reference count. The bug was that we did the mutex > unlock *after* the spinlock, for no good reason. > > So it really isn't normally a problem. The RT spinlock conversion > people need to be aware of it, but normally you can't even screw this > up. And yes, I might need to add in a reference-count handoff around RCU's use of rt_mutex. Haven't yet come up with a reasonable way to use RCU at that point instead. I suppose I could use SRCU, but... ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-09 18:29 ` Steven Rostedt 2014-06-09 18:51 ` Linus Torvalds @ 2014-06-09 19:04 ` Oleg Nesterov 1 sibling, 0 replies; 63+ messages in thread From: Oleg Nesterov @ 2014-06-09 19:04 UTC (permalink / raw) To: Steven Rostedt Cc: Paul E. McKenney, Linus Torvalds, LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On 06/09, Steven Rostedt wrote: > > On Mon, 9 Jun 2014 20:15:53 +0200 > Oleg Nesterov <oleg@redhat.com> wrote: > > > > That would indeed be a bad thing, as it could potentially lead to > > > use-after-free bugs. Though one could argue that any code that resulted > > > in use-after-free would be quite aggressive. But still... > > > > And once again, note that the normal mutex is already unsafe (unless I missed > > something). > > Is it unsafe? Only in a sense that UNLOCK is not atomic. IOW, you can't, say, declare a mutex or semaphore on stack, and use lock/unlock to serialize with another thread. But rt_mutex seems fine in this case, and for example rcu_boost() does this. I do not know if this is by design or not, and can we rely on this or not. > This thread was started because of a bug we triggered in -rt, which > ended up being a change specific to -rt that modified the way slub > handled SLAB_DESTROY_BY_RCU. What else was wrong with it? And I specially changed the subject to avoid the confusion with SLAB_DESTROY_BY_RCU bug we discussed before, but apparently I need to apologize for confusion again ;) But. Note that if rt_mutex is changed so that UNLOCK becomes non-atomic in a sense above, then lock_task_sighand()/unlock_task_sighand() will be buggy in -rt. Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-09 16:26 ` Paul E. McKenney 2014-06-09 18:15 ` Oleg Nesterov @ 2014-06-10 8:37 ` Peter Zijlstra 2014-06-10 12:52 ` Paul E. McKenney 1 sibling, 1 reply; 63+ messages in thread From: Peter Zijlstra @ 2014-06-10 8:37 UTC (permalink / raw) To: Paul E. McKenney Cc: Oleg Nesterov, Linus Torvalds, Steven Rostedt, LKML, Thomas Gleixner, Andrew Morton, Ingo Molnar, Clark Williams [-- Attachment #1: Type: text/plain, Size: 1168 bytes --] On Mon, Jun 09, 2014 at 09:26:13AM -0700, Paul E. McKenney wrote: > That would indeed be a bad thing, as it could potentially lead to > use-after-free bugs. Though one could argue that any code that resulted > in use-after-free would be quite aggressive. But still... Let me hijack this thread for yet another issue... So I had an RCU related use-after-free the other day, and while Sasha was able to trigger it quite easily, I had a multi-day struggle to reproduce. Once I figured out what the exact problem was it was also clear to me why it was so hard for me to reproduce. So normally its easier to trigger races on bigger machines, more cpus, more concurrency, more races, all good. _However_ with RCU the grace period machinery is slower the bigger the machine, so bigger machine, slower grace period, slower RCU free, less likely to hit use-after-free. So I was thinking, and I know you all will go kick me for this because the very last thing we need is what I'm about to propose: more RCU flavours :-). How about an rcu_read_unlock() reference counted RCU variant that's ultra aggressive in doing the callbacks in order to better trigger such issues? [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-10 8:37 ` Peter Zijlstra @ 2014-06-10 12:52 ` Paul E. McKenney 2014-06-10 13:01 ` Peter Zijlstra 0 siblings, 1 reply; 63+ messages in thread From: Paul E. McKenney @ 2014-06-10 12:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Oleg Nesterov, Linus Torvalds, Steven Rostedt, LKML, Thomas Gleixner, Andrew Morton, Ingo Molnar, Clark Williams On Tue, Jun 10, 2014 at 10:37:26AM +0200, Peter Zijlstra wrote: > On Mon, Jun 09, 2014 at 09:26:13AM -0700, Paul E. McKenney wrote: > > That would indeed be a bad thing, as it could potentially lead to > > use-after-free bugs. Though one could argue that any code that resulted > > in use-after-free would be quite aggressive. But still... > > Let me hijack this thread for yet another issue... So I had an RCU > related use-after-free the other day, and while Sasha was able to > trigger it quite easily, I had a multi-day struggle to reproduce. > > Once I figured out what the exact problem was it was also clear to me > why it was so hard for me to reproduce. > > So normally its easier to trigger races on bigger machines, more cpus, > more concurrency, more races, all good. > > _However_ with RCU the grace period machinery is slower the bigger the > machine, so bigger machine, slower grace period, slower RCU free, less > likely to hit use-after-free. > > So I was thinking, and I know you all will go kick me for this because > the very last thing we need is what I'm about to propose: more RCU > flavours :-). > > How about an rcu_read_unlock() reference counted RCU variant that's > ultra aggressive in doing the callbacks in order to better trigger such > issues? If you are using synchronize_rcu() for the update side, then I suggest rcutorture.gp_exp=1 to force use expediting throughout. Thanx, Paul ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-10 12:52 ` Paul E. McKenney @ 2014-06-10 13:01 ` Peter Zijlstra 2014-06-10 14:36 ` Paul E. McKenney 0 siblings, 1 reply; 63+ messages in thread From: Peter Zijlstra @ 2014-06-10 13:01 UTC (permalink / raw) To: Paul E. McKenney Cc: Oleg Nesterov, Linus Torvalds, Steven Rostedt, LKML, Thomas Gleixner, Andrew Morton, Ingo Molnar, Clark Williams [-- Attachment #1: Type: text/plain, Size: 1726 bytes --] On Tue, Jun 10, 2014 at 05:52:35AM -0700, Paul E. McKenney wrote: > On Tue, Jun 10, 2014 at 10:37:26AM +0200, Peter Zijlstra wrote: > > On Mon, Jun 09, 2014 at 09:26:13AM -0700, Paul E. McKenney wrote: > > > That would indeed be a bad thing, as it could potentially lead to > > > use-after-free bugs. Though one could argue that any code that resulted > > > in use-after-free would be quite aggressive. But still... > > > > Let me hijack this thread for yet another issue... So I had an RCU > > related use-after-free the other day, and while Sasha was able to > > trigger it quite easily, I had a multi-day struggle to reproduce. > > > > Once I figured out what the exact problem was it was also clear to me > > why it was so hard for me to reproduce. > > > > So normally its easier to trigger races on bigger machines, more cpus, > > more concurrency, more races, all good. > > > > _However_ with RCU the grace period machinery is slower the bigger the > > machine, so bigger machine, slower grace period, slower RCU free, less > > likely to hit use-after-free. > > > > So I was thinking, and I know you all will go kick me for this because > > the very last thing we need is what I'm about to propose: more RCU > > flavours :-). > > > > How about an rcu_read_unlock() reference counted RCU variant that's > > ultra aggressive in doing the callbacks in order to better trigger such > > issues? > > If you are using synchronize_rcu() for the update side, then I suggest > rcutorture.gp_exp=1 to force use expediting throughout. No such luck, this was regular kfree() from call_rcu(). And the callback execution was typically delayed long enough to never 'see' the use-after-free. [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-10 13:01 ` Peter Zijlstra @ 2014-06-10 14:36 ` Paul E. McKenney 2014-06-10 15:20 ` Paul E. McKenney 0 siblings, 1 reply; 63+ messages in thread From: Paul E. McKenney @ 2014-06-10 14:36 UTC (permalink / raw) To: Peter Zijlstra Cc: Oleg Nesterov, Linus Torvalds, Steven Rostedt, LKML, Thomas Gleixner, Andrew Morton, Ingo Molnar, Clark Williams On Tue, Jun 10, 2014 at 03:01:38PM +0200, Peter Zijlstra wrote: > On Tue, Jun 10, 2014 at 05:52:35AM -0700, Paul E. McKenney wrote: > > On Tue, Jun 10, 2014 at 10:37:26AM +0200, Peter Zijlstra wrote: > > > On Mon, Jun 09, 2014 at 09:26:13AM -0700, Paul E. McKenney wrote: > > > > That would indeed be a bad thing, as it could potentially lead to > > > > use-after-free bugs. Though one could argue that any code that resulted > > > > in use-after-free would be quite aggressive. But still... > > > > > > Let me hijack this thread for yet another issue... So I had an RCU > > > related use-after-free the other day, and while Sasha was able to > > > trigger it quite easily, I had a multi-day struggle to reproduce. > > > > > > Once I figured out what the exact problem was it was also clear to me > > > why it was so hard for me to reproduce. > > > > > > So normally its easier to trigger races on bigger machines, more cpus, > > > more concurrency, more races, all good. > > > > > > _However_ with RCU the grace period machinery is slower the bigger the > > > machine, so bigger machine, slower grace period, slower RCU free, less > > > likely to hit use-after-free. > > > > > > So I was thinking, and I know you all will go kick me for this because > > > the very last thing we need is what I'm about to propose: more RCU > > > flavours :-). > > > > > > How about an rcu_read_unlock() reference counted RCU variant that's > > > ultra aggressive in doing the callbacks in order to better trigger such > > > issues? > > > > If you are using synchronize_rcu() for the update side, then I suggest > > rcutorture.gp_exp=1 to force use expediting throughout. > > No such luck, this was regular kfree() from call_rcu(). And the callback > execution was typically delayed long enough to never 'see' the > use-after-free. Figures. ;-) Well, there is always the approach of booting your big systems with most of the CPUs turned off. Another approach would be to set HZ=10000 or some such, assuming the kernel can actually survive that kind of abuse. Thanx, Paul ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) 2014-06-10 14:36 ` Paul E. McKenney @ 2014-06-10 15:20 ` Paul E. McKenney 0 siblings, 0 replies; 63+ messages in thread From: Paul E. McKenney @ 2014-06-10 15:20 UTC (permalink / raw) To: Peter Zijlstra Cc: Oleg Nesterov, Linus Torvalds, Steven Rostedt, LKML, Thomas Gleixner, Andrew Morton, Ingo Molnar, Clark Williams On Tue, Jun 10, 2014 at 07:36:32AM -0700, Paul E. McKenney wrote: > On Tue, Jun 10, 2014 at 03:01:38PM +0200, Peter Zijlstra wrote: > > On Tue, Jun 10, 2014 at 05:52:35AM -0700, Paul E. McKenney wrote: > > > On Tue, Jun 10, 2014 at 10:37:26AM +0200, Peter Zijlstra wrote: > > > > On Mon, Jun 09, 2014 at 09:26:13AM -0700, Paul E. McKenney wrote: > > > > > That would indeed be a bad thing, as it could potentially lead to > > > > > use-after-free bugs. Though one could argue that any code that resulted > > > > > in use-after-free would be quite aggressive. But still... > > > > > > > > Let me hijack this thread for yet another issue... So I had an RCU > > > > related use-after-free the other day, and while Sasha was able to > > > > trigger it quite easily, I had a multi-day struggle to reproduce. > > > > > > > > Once I figured out what the exact problem was it was also clear to me > > > > why it was so hard for me to reproduce. > > > > > > > > So normally its easier to trigger races on bigger machines, more cpus, > > > > more concurrency, more races, all good. > > > > > > > > _However_ with RCU the grace period machinery is slower the bigger the > > > > machine, so bigger machine, slower grace period, slower RCU free, less > > > > likely to hit use-after-free. > > > > > > > > So I was thinking, and I know you all will go kick me for this because > > > > the very last thing we need is what I'm about to propose: more RCU > > > > flavours :-). > > > > > > > > How about an rcu_read_unlock() reference counted RCU variant that's > > > > ultra aggressive in doing the callbacks in order to better trigger such > > > > issues? > > > > > > If you are using synchronize_rcu() for the update side, then I suggest > > > rcutorture.gp_exp=1 to force use expediting throughout. > > > > No such luck, this was regular kfree() from call_rcu(). And the callback > > execution was typically delayed long enough to never 'see' the > > use-after-free. > > Figures. ;-) > > Well, there is always the approach of booting your big systems with most > of the CPUs turned off. Another approach would be to set HZ=10000 or > some such, assuming the kernel can actually survive that kind of abuse. And yet another approach is to have a pair of low-priority processes per CPU that context-switch back and forth to each other if that CPU has nothing else to do. This should get rid of most of the increase in grace-period duration with increasing numbers of CPUs. Thanx, Paul ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [BUG] signal: sighand unprotected when accessed by /proc 2014-06-03 18:03 ` Linus Torvalds 2014-06-03 20:01 ` Oleg Nesterov @ 2014-06-03 20:05 ` Steven Rostedt 2014-06-03 20:09 ` Oleg Nesterov 1 sibling, 1 reply; 63+ messages in thread From: Steven Rostedt @ 2014-06-03 20:05 UTC (permalink / raw) To: Linus Torvalds Cc: Oleg Nesterov, LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams On Tue, 3 Jun 2014 11:03:49 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > An example of (b) is having a lock that is initialized by the > constructor, and that everybody else uses properly: you may have > delayed people locking/unlocking it, but as long as they don't screw > anything else up they won't break even if the data structure was > re-used by a new allocations. I'm trying to wrap my head around this. I guess it comes down to this code here in __lock_task_sighand: for (;;) { local_irq_save(*flags); rcu_read_lock(); sighand = rcu_dereference(tsk->sighand); if (unlikely(sighand == NULL)) { rcu_read_unlock(); local_irq_restore(*flags); break; } < Here's the critical point! > spin_lock(&sighand->siglock); if (likely(sighand == tsk->sighand)) { rcu_read_unlock(); break; } spin_unlock(&sighand->siglock); rcu_read_unlock(); local_irq_restore(*flags); } We get the sighand with an rcu_dereference in an rcu_read_lock() protected section, as the page its on in the slab cache can be freed with rcu, thus we want to make sure what we get is a sighand and not anything else. RCU protects us for this. Now, that sighand can be freed and reallocated as another sighand, but because it's still a sighand, the lock we are spinning on is some sighand->lock pointer, whether it is ours or a new one. Thus when we hit spin_lock(&sighand->siglock) it can be the sighand for the task we want or reallocated and initialized for a new sighand, and we are now suddenly spinning on a new lock which we would probably take. In that case, if we take it, then sighand != tsk->sighand and we release the lock and try again. Wow, that's very subtle :-p The lock was switch out right from under us. > > Without looking at the code, it sounds like somebody may doing things > to "sighand->lock->wait_list" that they shouldn't do. We've had cases > like that before, and most of them have been changed to *not* use > SLAB_DESTROY_BY_RCU, and instead make each individual allocation be > RCU-free'd (which is a lot simpler to think about, because then you > don't have the whole re-use issue). Yes, this is exactly what happened. OK, this is an -rt only bug, not a mainline one :-/ When we convert the spin_lock into a rtmutex, when we hit the lock and it was contended (the task was in the process of exiting and it takes the lock to set tsk->sighand to NULL), instead of spinning, the task adds itself to the lock->wait_list and goes to sleep. Now, if that lock is released and reused (I didn't trace other tasks allocating these locks), it reinitializes the lock->wait_list. But when the cleanup code released the lock, it woke up the task that was in __lock_task_sighand() and that task now tries to take the lock. As the lock is now free, it grabs it and removes itself from the lock->wait_list. But because the lock->wait_list has been re-initialized, it doesn't find itself on the list and we hit the bug. This makes sense as we never crashed, we just triggered the list.h debug warnings. > > Of course, it could easily be lack of any RCU protection too. As > mentioned, I didn't really check the code. I just wanted to clarify > this subtlety, because while I think Oleg knows about it, maybe others > didn't quite catch that subtlety. At least for -rt, it seems we have to convert it that way. Anything more complex than a simple spinning lock will trigger this bug. Can't use normal mutexes either. > > And this could easily be an RT issue, if the RT code does some > re-initialization of the rtmutex that replaces the spinlock we have. Yep, I believe this currently is an RT issue. But I would suggest that sighand be freed by a normal rcu call. This reuse looks really subtle and prone for other bugs. Thanks for the explanation! -- Steve ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [BUG] signal: sighand unprotected when accessed by /proc 2014-06-03 20:05 ` [BUG] signal: sighand unprotected when accessed by /proc Steven Rostedt @ 2014-06-03 20:09 ` Oleg Nesterov 2014-06-03 20:15 ` Steven Rostedt 2014-06-03 20:25 ` Steven Rostedt 0 siblings, 2 replies; 63+ messages in thread From: Oleg Nesterov @ 2014-06-03 20:09 UTC (permalink / raw) To: Steven Rostedt Cc: Linus Torvalds, LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams Steven, I am already sleeping, probably I just need to re-read your email tomorrow, but... On 06/03, Steven Rostedt wrote: > > When we convert the spin_lock into a rtmutex, when we hit the lock and > it was contended (the task was in the process of exiting and it takes > the lock to set tsk->sighand to NULL), instead of spinning, the task > adds itself to the lock->wait_list and goes to sleep. This is clear, > Now, if that lock is released and reused (I didn't trace other tasks > allocating these locks), it reinitializes the lock->wait_list. How? From where? This should be done by sighand_ctor() only? Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [BUG] signal: sighand unprotected when accessed by /proc 2014-06-03 20:09 ` Oleg Nesterov @ 2014-06-03 20:15 ` Steven Rostedt 2014-06-03 20:25 ` Steven Rostedt 1 sibling, 0 replies; 63+ messages in thread From: Steven Rostedt @ 2014-06-03 20:15 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams, Luis Claudio R. Goncalves On Tue, 3 Jun 2014 22:09:38 +0200 Oleg Nesterov <oleg@redhat.com> wrote: > > Now, if that lock is released and reused (I didn't trace other tasks > > allocating these locks), it reinitializes the lock->wait_list. > > How? From where? This should be done by sighand_ctor() only? heh, the more I look at this code, the deeper in the rabbit hole I go. OK, I didn't understand this in the first place. I see that sighand_ctor() is only called when the entire slab is created, correct? That means, when you allocate a new sighand, it doesn't get re-initialized, and the sighand->lock should still be the old lock. That makes sense. I'll need to look at this some more :-/ Maybe there's another bug that I'm missing and I'm climbing the wrong tree. -- Steve ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [BUG] signal: sighand unprotected when accessed by /proc 2014-06-03 20:09 ` Oleg Nesterov 2014-06-03 20:15 ` Steven Rostedt @ 2014-06-03 20:25 ` Steven Rostedt 2014-06-03 21:12 ` Thomas Gleixner 1 sibling, 1 reply; 63+ messages in thread From: Steven Rostedt @ 2014-06-03 20:25 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams, Luis Claudio R. Goncalves On Tue, 3 Jun 2014 22:09:38 +0200 Oleg Nesterov <oleg@redhat.com> wrote: > > Now, if that lock is released and reused (I didn't trace other tasks > > allocating these locks), it reinitializes the lock->wait_list. > > How? From where? This should be done by sighand_ctor() only? This looks definitely like an -rt only bug and it's an obvious one at that :-p Looking in mm/slub.c: slab_alloc_node() we have this: if (unlikely(gfpflags & __GFP_ZERO) && object) memset(object, 0, s->object_size); #ifdef CONFIG_PREEMPT_RT_FULL if (unlikely(s->ctor) && object) s->ctor(object); #endif slab_post_alloc_hook(s, gfpflags, object); return object; } We call the ctor on the object when it is allocated, not when the page is created. Doh! I guess we now know why we shouldn't do that. -- Steve ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [BUG] signal: sighand unprotected when accessed by /proc 2014-06-03 20:25 ` Steven Rostedt @ 2014-06-03 21:12 ` Thomas Gleixner 0 siblings, 0 replies; 63+ messages in thread From: Thomas Gleixner @ 2014-06-03 21:12 UTC (permalink / raw) To: Steven Rostedt Cc: Oleg Nesterov, Linus Torvalds, LKML, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams, Luis Claudio R. Goncalves On Tue, 3 Jun 2014, Steven Rostedt wrote: > On Tue, 3 Jun 2014 22:09:38 +0200 > Oleg Nesterov <oleg@redhat.com> wrote: > > > > Now, if that lock is released and reused (I didn't trace other tasks > > > allocating these locks), it reinitializes the lock->wait_list. > > > > How? From where? This should be done by sighand_ctor() only? > > This looks definitely like an -rt only bug and it's an obvious one at > that :-p > > Looking in mm/slub.c: slab_alloc_node() we have this: > > if (unlikely(gfpflags & __GFP_ZERO) && object) > memset(object, 0, s->object_size); > #ifdef CONFIG_PREEMPT_RT_FULL > if (unlikely(s->ctor) && object) > s->ctor(object); > #endif Looks like the usual git/quilt default artifact. That's why I have QUILT_PATCH_OPTS="--fuzz=0" in my .quiltrc Thanks, tglx ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [BUG] signal: sighand unprotected when accessed by /proc 2014-06-03 17:26 ` Oleg Nesterov 2014-06-03 18:03 ` Linus Torvalds @ 2014-06-03 18:05 ` Steven Rostedt 2014-06-03 19:25 ` Oleg Nesterov 1 sibling, 1 reply; 63+ messages in thread From: Steven Rostedt @ 2014-06-03 18:05 UTC (permalink / raw) To: Oleg Nesterov Cc: LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams, Linus Torvalds On Tue, 3 Jun 2014 19:26:32 +0200 Oleg Nesterov <oleg@redhat.com> wrote: > On 06/03, Steven Rostedt wrote: > > > > We were able to trigger this bug in -rt, and by review, I'm thinking > > that this could very well be a mainline bug too. I had our QA team add > > a trace patch to the kernel to prove my analysis, and it did. > > > > Here's the patch: > > > > http://rostedt.homelinux.com/private/sighand-trace.patch > > > > Let me try to explain the bug: > > > > > > CPU0 CPU1 > > ---- ---- > > [ read of /proc/<pid>/stat ] > > get_task_struct(); > > [...] > > [ <pid> exits ] > > [ parent does wait on <pid> ] > > wait_task_zombie() > > release_task() > > proc_flush_task() > > /* the above removes new access > > to the /proc system */ > > __exit_signal() > > __cleanup_sighand(sighand); > > atomic_dec_and_test(sighand->count); > > do_task_stat() > > lock_task_sighand(task); > > sighand = rcu_dereference(tsk->sighand); > > > > kmem_cache_free(sighand); > > > > if (sighand != NULL) > > spin_lock(sighand->siglock); > > > > ** BOOM! use after free ** > > Yes, ->sighand can be already freed at this point, but this should be > fine because sighand_cachep is SLAB_DESTROY_BY_RCU. Ah, I didn't notice that. This makes this even more bazaar. You know, this code could use some comments. I may send you a patch, because that __lock_task_sighand() is doing a lot of subtle things and there's not a single comment explaining it :-( > > That is why lock_task_sighand() does rcu_read_lock() and re-checks > sighand == tsk->sighand after it takes ->siglock. It is fine if it was > already freed or even reallocated via kmem_cache_alloc(sighand_cachep). > We only need to ensure that (SLAB_DESTROY_BY_RCU should ensure this) > this memory won't be returned to system, so this peace of memory must > be "struct sighand" with the properly initialized ->siglock until > rcu_read_unlock(). OK, this makes __lock_task_sighand() make some more sense. > > > Seems there is no protection between reading the sighand from proc and > > freeing it. The sighand->count is not updated, and the sighand is not > > freed via rcu. > > See above. > > > One, the spinlock in -rt is an rtmutex. The list_del_entry() bug is the > > task trying to remove itself from sighand->lock->wait_list. As the lock > > has been freed, the list head of the rtmutex is corrupted. > > looks like, SLAB_DESTROY_BY_RCU logic is broken? Could be. I'll look to see if we didn't break something. Thanks! -- Steve ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [BUG] signal: sighand unprotected when accessed by /proc 2014-06-03 18:05 ` Steven Rostedt @ 2014-06-03 19:25 ` Oleg Nesterov 2014-06-04 1:16 ` Steven Rostedt 0 siblings, 1 reply; 63+ messages in thread From: Oleg Nesterov @ 2014-06-03 19:25 UTC (permalink / raw) To: Steven Rostedt Cc: LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams, Linus Torvalds On 06/03, Steven Rostedt wrote: > > You know, this code could use some comments. I may send you a patch, > because that __lock_task_sighand() is doing a lot of subtle things and > there's not a single comment explaining it :-( Yes, agreed. Not only SLAB_DESTROY_BY_RCU is not obvious, local_irq_save() is not clear at all. The latter already has a doc patch from Paul, I'll try to add more comments on top once I see that patch in Linus's tree. But I would be happy if you send the patch ;) And this reminds me... I still think that __lock_task_sighand() should be de-uglified. I already sent the patch, probably I'll resend it. Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [BUG] signal: sighand unprotected when accessed by /proc 2014-06-03 19:25 ` Oleg Nesterov @ 2014-06-04 1:16 ` Steven Rostedt 2014-06-04 16:31 ` Oleg Nesterov 0 siblings, 1 reply; 63+ messages in thread From: Steven Rostedt @ 2014-06-04 1:16 UTC (permalink / raw) To: Oleg Nesterov Cc: LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams, Linus Torvalds On Tue, 3 Jun 2014 21:25:25 +0200 Oleg Nesterov <oleg@redhat.com> wrote: > On 06/03, Steven Rostedt wrote: > > > > You know, this code could use some comments. I may send you a patch, > > because that __lock_task_sighand() is doing a lot of subtle things and > > there's not a single comment explaining it :-( > > Yes, agreed. Not only SLAB_DESTROY_BY_RCU is not obvious, local_irq_save() > is not clear at all. The latter already has a doc patch from Paul, I'll try > to add more comments on top once I see that patch in Linus's tree. > > But I would be happy if you send the patch ;) > > And this reminds me... I still think that __lock_task_sighand() should be > de-uglified. I already sent the patch, probably I'll resend it. I'd be happy to document the hell out of that function, but it sounds like you have some updates to it. Is there a git repo somewhere with your latest that I can base a patch against? -- Steve ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [BUG] signal: sighand unprotected when accessed by /proc 2014-06-04 1:16 ` Steven Rostedt @ 2014-06-04 16:31 ` Oleg Nesterov 0 siblings, 0 replies; 63+ messages in thread From: Oleg Nesterov @ 2014-06-04 16:31 UTC (permalink / raw) To: Steven Rostedt Cc: LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar, Clark Williams, Linus Torvalds On 06/03, Steven Rostedt wrote: > > On Tue, 3 Jun 2014 21:25:25 +0200 > Oleg Nesterov <oleg@redhat.com> wrote: > > > On 06/03, Steven Rostedt wrote: > > > > > > You know, this code could use some comments. I may send you a patch, > > > because that __lock_task_sighand() is doing a lot of subtle things and > > > there's not a single comment explaining it :-( > > > > Yes, agreed. Not only SLAB_DESTROY_BY_RCU is not obvious, local_irq_save() > > is not clear at all. The latter already has a doc patch from Paul, I'll try > > to add more comments on top once I see that patch in Linus's tree. > > > > But I would be happy if you send the patch ;) > > > > And this reminds me... I still think that __lock_task_sighand() should be > > de-uglified. I already sent the patch, probably I'll resend it. > > I'd be happy to document the hell out of that function, but it sounds > like you have some updates to it. Paul has, I don't know where ;) That patch adds the comments to explain the mysterious local_irq_disable() at the start. On top of another patch which documents the subtle problems with preemption and rcu_read_unlock(). As for the cleanup I have, it is nowhere and I'll resend it later. Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
end of thread, other threads:[~2014-06-21 19:55 UTC | newest] Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-06-03 17:02 [BUG] signal: sighand unprotected when accessed by /proc Steven Rostedt 2014-06-03 17:26 ` Oleg Nesterov 2014-06-03 18:03 ` Linus Torvalds 2014-06-03 20:01 ` Oleg Nesterov 2014-06-03 20:03 ` Oleg Nesterov 2014-06-06 20:33 ` Paul E. McKenney 2014-06-08 13:07 ` safety of *mutex_unlock() (Was: [BUG] signal: sighand unprotected when accessed by /proc) Oleg Nesterov 2014-06-09 16:26 ` Paul E. McKenney 2014-06-09 18:15 ` Oleg Nesterov 2014-06-09 18:29 ` Steven Rostedt 2014-06-09 18:51 ` Linus Torvalds 2014-06-09 19:41 ` Steven Rostedt 2014-06-10 8:53 ` Thomas Gleixner 2014-06-10 16:57 ` Oleg Nesterov 2014-06-10 18:08 ` Thomas Gleixner 2014-06-10 18:13 ` Steven Rostedt 2014-06-10 20:05 ` Thomas Gleixner 2014-06-10 20:13 ` Thomas Gleixner 2014-06-11 15:52 ` Paul E. McKenney 2014-06-11 17:07 ` Oleg Nesterov 2014-06-11 17:17 ` Oleg Nesterov 2014-06-11 17:29 ` Paul E. McKenney 2014-06-11 17:59 ` Oleg Nesterov 2014-06-11 19:56 ` Paul E. McKenney 2014-06-12 17:28 ` Oleg Nesterov 2014-06-12 20:35 ` Paul E. McKenney 2014-06-12 21:40 ` Thomas Gleixner 2014-06-12 22:27 ` Paul E. McKenney 2014-06-12 23:19 ` Paul E. McKenney 2014-06-13 15:08 ` Oleg Nesterov 2014-06-15 5:40 ` Paul E. McKenney 2014-06-17 18:57 ` Paul E. McKenney 2014-06-18 16:43 ` Oleg Nesterov 2014-06-18 16:53 ` Steven Rostedt 2014-06-21 19:54 ` Thomas Gleixner 2014-06-18 17:00 ` Paul E. McKenney 2014-06-13 14:55 ` Oleg Nesterov 2014-06-13 16:10 ` Thomas Gleixner 2014-06-13 16:19 ` Oleg Nesterov 2014-06-13 14:52 ` Oleg Nesterov 2014-06-11 17:27 ` Paul E. McKenney 2014-06-10 17:07 ` Oleg Nesterov 2014-06-10 17:51 ` Thomas Gleixner 2014-06-10 12:56 ` Paul E. McKenney 2014-06-10 14:48 ` Peter Zijlstra 2014-06-10 15:18 ` Paul E. McKenney 2014-06-10 15:35 ` Linus Torvalds 2014-06-10 16:15 ` Paul E. McKenney 2014-06-09 19:04 ` Oleg Nesterov 2014-06-10 8:37 ` Peter Zijlstra 2014-06-10 12:52 ` Paul E. McKenney 2014-06-10 13:01 ` Peter Zijlstra 2014-06-10 14:36 ` Paul E. McKenney 2014-06-10 15:20 ` Paul E. McKenney 2014-06-03 20:05 ` [BUG] signal: sighand unprotected when accessed by /proc Steven Rostedt 2014-06-03 20:09 ` Oleg Nesterov 2014-06-03 20:15 ` Steven Rostedt 2014-06-03 20:25 ` Steven Rostedt 2014-06-03 21:12 ` Thomas Gleixner 2014-06-03 18:05 ` Steven Rostedt 2014-06-03 19:25 ` Oleg Nesterov 2014-06-04 1:16 ` Steven Rostedt 2014-06-04 16:31 ` Oleg Nesterov
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.