Found it! (was Re: [3.10] Oopses in kmem_cache_allocate() via prepare

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Found it! (was Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds())
@ 2013-12-02 16:00 Linus Torvalds
  2013-12-02 16:27 ` Ingo Molnar
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Linus Torvalds @ 2013-12-02 16:00 UTC (permalink / raw)
  To: Simon Kirby
  Cc: Ian Applegate, Al Viro, Christoph Lameter, Pekka Enberg, LKML,
	Chris Mason

[-- Attachment #1: Type: text/plain, Size: 2185 bytes --]

On Sat, Nov 30, 2013 at 1:08 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I still don't see what could be wrong with the pipe_inode_info thing,
> but the fact that it's been so consistent in your traces does make me
> suspect it really is *that* particular slab.

I think I finally found it.

I've spent waaayy too much time looking at and thinking about that
code without seeing anything wrong, but this morning I woke up and
thought to myself "What if.."

And looking at the code again, I went "BINGO".

All our reference counting etc seems right, but we have one very
subtle bug: on the freeing path, we have a pattern like this:

        spin_lock(&inode->i_lock);
        if (!--pipe->files) {
                inode->i_pipe = NULL;
                kill = 1;
        }
        spin_unlock(&inode->i_lock);
        __pipe_unlock(pipe);
        if (kill)
                free_pipe_info(pipe);

which on the face of it is trying to be very careful in not accessing
the pipe-info after it is released by having that "kill" flag, and
doing the release last.

And it's complete garbage.

Why?

Because the thread that decrements "pipe->files" *without* releasing
it, will very much access it after it has been dropped: that
"__pipe_unlock(pipe)" happens *after* we've decremented the pipe
reference count and dropped the inode lock. So another CPU can come in
and free the structure concurrently with that __pipe_unlock(pipe).

This happens in two places, and we don't actually need or want the
pipe lock for the pipe->files accesses (since pipe->files is protected
by inode->i_lock, not the pipe lock), so the solution is to just do
the __pipe_unlock() before the whole dance about the pipe->files
reference count.

Patch appended. And no wonder nobody has ever seen it, because the
race is unlikely as hell to ever happen. Simon, I assume it will be
another few months before we can say "yeah, that fixed it", but I
really think this is it. It explains all the symptoms, including
"DEBUG_PAGEALLOC didn't catch it" (because the access happens just as
it is released, and DEBUG_PAGEALLOC takes too long to actually free
unmap the page etc).

                     Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 859 bytes --]

 fs/pipe.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index d2c45e14e6d8..18f1a4b2dbbc 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -743,13 +743,14 @@ pipe_release(struct inode *inode, struct file *file)
 		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
 	}
+	__pipe_unlock(pipe);
+
 	spin_lock(&inode->i_lock);
 	if (!--pipe->files) {
 		inode->i_pipe = NULL;
 		kill = 1;
 	}
 	spin_unlock(&inode->i_lock);
-	__pipe_unlock(pipe);

 	if (kill)
 		free_pipe_info(pipe);
@@ -1130,13 +1131,14 @@ err_wr:
 	goto err;

 err:
+	__pipe_unlock(pipe);
+
 	spin_lock(&inode->i_lock);
 	if (!--pipe->files) {
 		inode->i_pipe = NULL;
 		kill = 1;
 	}
 	spin_unlock(&inode->i_lock);
-	__pipe_unlock(pipe);
 	if (kill)
 		free_pipe_info(pipe);
 	return ret;

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Found it! (was Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds())
  2013-12-02 16:00 Found it! (was Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()) Linus Torvalds
@ 2013-12-02 16:27 ` Ingo Molnar
  2013-12-02 16:46   ` Al Viro
  2013-12-02 17:06 ` Al Viro
  2013-12-03  2:58 ` Linus Torvalds
  2 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2013-12-02 16:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Simon Kirby, Ian Applegate, Al Viro, Christoph Lameter,
	Pekka Enberg, LKML, Chris Mason


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, Nov 30, 2013 at 1:08 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I still don't see what could be wrong with the pipe_inode_info thing,
> > but the fact that it's been so consistent in your traces does make me
> > suspect it really is *that* particular slab.
> 
> I think I finally found it.
> 
> I've spent waaayy too much time looking at and thinking about that
> code without seeing anything wrong, but this morning I woke up and
> thought to myself "What if.."
> 
> And looking at the code again, I went "BINGO".
> 
> All our reference counting etc seems right, but we have one very
> subtle bug: on the freeing path, we have a pattern like this:
> 
>         spin_lock(&inode->i_lock);
>         if (!--pipe->files) {
>                 inode->i_pipe = NULL;
>                 kill = 1;
>         }
>         spin_unlock(&inode->i_lock);
>         __pipe_unlock(pipe);
>         if (kill)
>                 free_pipe_info(pipe);
> 
> which on the face of it is trying to be very careful in not accessing
> the pipe-info after it is released by having that "kill" flag, and
> doing the release last.
> 
> And it's complete garbage.
> 
> Why?
> 
> Because the thread that decrements "pipe->files" *without* releasing 
> it, will very much access it after it has been dropped: that 
> "__pipe_unlock(pipe)" happens *after* we've decremented the pipe 
> reference count and dropped the inode lock. So another CPU can come 
> in and free the structure concurrently with that 
> __pipe_unlock(pipe).
> 
> This happens in two places, and we don't actually need or want the 
> pipe lock for the pipe->files accesses (since pipe->files is 
> protected by inode->i_lock, not the pipe lock), so the solution is 
> to just do the __pipe_unlock() before the whole dance about the 
> pipe->files reference count.
> 
> Patch appended. And no wonder nobody has ever seen it, because the 
> race is unlikely as hell to ever happen. Simon, I assume it will be 
> another few months before we can say "yeah, that fixed it", but I 
> really think this is it. It explains all the symptoms, including 
> "DEBUG_PAGEALLOC didn't catch it" (because the access happens just 
> as it is released, and DEBUG_PAGEALLOC takes too long to actually 
> free unmap the page etc).
> 
>                      Linus
>  fs/pipe.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index d2c45e14e6d8..18f1a4b2dbbc 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -743,13 +743,14 @@ pipe_release(struct inode *inode, struct file *file)
>  		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
>  		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
>  	}
> +	__pipe_unlock(pipe);
> +
>  	spin_lock(&inode->i_lock);
>  	if (!--pipe->files) {
>  		inode->i_pipe = NULL;
>  		kill = 1;
>  	}
>  	spin_unlock(&inode->i_lock);
> -	__pipe_unlock(pipe);

I'm wondering why pipe-mutex was introduced. It was done fairly 
recently, with no justification given:

  From 72b0d9aacb89f3759931ec440e1b535671145bb4 Mon Sep 17 00:00:00 2001
  From: Al Viro <viro@zeniv.linux.org.uk>
  Date: Thu, 21 Mar 2013 02:32:24 -0400
  Subject: [PATCH] pipe: don't use ->i_mutex

  now it can be done - put mutex into pipe_inode_info, use it instead
  of ->i_mutex

  Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
  ---
   fs/ocfs2/file.c           | 6 ++----
   fs/pipe.c                 | 5 +++--
   include/linux/pipe_fs_i.h | 2 ++
   3 files changed, 7 insertions(+), 6 deletions(-)

It's not like there should be many (any?) VFS operations where a pipe 
is used via i_mutex and pipe->mutex in parallel, which would improve 
scalability - so I don't see the scalability advantage. (But I might 
be missing something)

Barring such kind of workload the extra mutex just adds extra 
micro-costs because now two locks have to be taken on 
creation/destruction, plus it adds extra complexity and races.

So unless I'm missing something obvious, another good fix would be to 
just revert pipe->mutex and rely on i_mutex as before?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Found it! (was Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds())
  2013-12-02 16:27 ` Ingo Molnar
@ 2013-12-02 16:46   ` Al Viro
  2013-12-02 17:05     ` Ingo Molnar
  0 siblings, 1 reply; 16+ messages in thread
From: Al Viro @ 2013-12-02 16:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Simon Kirby, Ian Applegate, Christoph Lameter,
	Pekka Enberg, LKML, Chris Mason

On Mon, Dec 02, 2013 at 05:27:55PM +0100, Ingo Molnar wrote:

> It's not like there should be many (any?) VFS operations where a pipe 
> is used via i_mutex and pipe->mutex in parallel, which would improve 
> scalability - so I don't see the scalability advantage. (But I might 
> be missing something)
> 
> Barring such kind of workload the extra mutex just adds extra 
> micro-costs because now two locks have to be taken on 
> creation/destruction, plus it adds extra complexity and races.
> 
> So unless I'm missing something obvious, another good fix would be to 
> just revert pipe->mutex and rely on i_mutex as before?

You are missing the extra shitloads of complexity in ->i_mutex ordering,
and ->i_mutex is already used for too many things...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Found it! (was Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds())
  2013-12-02 16:46   ` Al Viro
@ 2013-12-02 17:05     ` Ingo Molnar
  0 siblings, 0 replies; 16+ messages in thread
From: Ingo Molnar @ 2013-12-02 17:05 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Simon Kirby, Ian Applegate, Christoph Lameter,
	Pekka Enberg, LKML, Chris Mason

* Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Mon, Dec 02, 2013 at 05:27:55PM +0100, Ingo Molnar wrote:
> 
> > It's not like there should be many (any?) VFS operations where a pipe 
> > is used via i_mutex and pipe->mutex in parallel, which would improve 
> > scalability - so I don't see the scalability advantage. (But I might 
> > be missing something)
> > 
> > Barring such kind of workload the extra mutex just adds extra 
> > micro-costs because now two locks have to be taken on 
> > creation/destruction, plus it adds extra complexity and races.
> > 
> > So unless I'm missing something obvious, another good fix would be to 
> > just revert pipe->mutex and rely on i_mutex as before?
> 
> You are missing the extra shitloads of complexity in ->i_mutex ordering,
> and ->i_mutex is already used for too many things...

Well, AFAICS the split-out did not reduce ordering complexity but 
increased it, at least in the short term: pipe->mutex now has to be 
taken in the right order with i_mutex, the subject of the bug here.

Plus AFAICS where i_mutex was used for pipe-internal purposes we used 
pretty generic facilities like user-copy, signal-sending, wakeups, 
etc. - none of which is really adding complexity to i_mutex ordering, 
as those are always expected to be facilities independent of the VFS 
in the future as well.

Anyway, it's your call obviously.

In any case, what prompted my reply was the overly terse nature of the 
changelog, would it make sense to put more verbose reasoning into 
changelogs, especially where such a change has a seemingly non-obvious 
(to me) cost/benefit balance?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Found it! (was Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds())
  2013-12-02 16:00 Found it! (was Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()) Linus Torvalds
  2013-12-02 16:27 ` Ingo Molnar
@ 2013-12-02 17:06 ` Al Viro
  2013-12-03  2:58 ` Linus Torvalds
  2 siblings, 0 replies; 16+ messages in thread
From: Al Viro @ 2013-12-02 17:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Simon Kirby, Ian Applegate, Christoph Lameter, Pekka Enberg,
	LKML, Chris Mason

On Mon, Dec 02, 2013 at 08:00:38AM -0800, Linus Torvalds wrote:
> On Sat, Nov 30, 2013 at 1:08 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I still don't see what could be wrong with the pipe_inode_info thing,
> > but the fact that it's been so consistent in your traces does make me
> > suspect it really is *that* particular slab.
> 
> I think I finally found it.
> 
> I've spent waaayy too much time looking at and thinking about that
> code without seeing anything wrong, but this morning I woke up and
> thought to myself "What if.."
> 
> And looking at the code again, I went "BINGO".
> 
> All our reference counting etc seems right, but we have one very
> subtle bug: on the freeing path, we have a pattern like this:
> 
>         spin_lock(&inode->i_lock);
>         if (!--pipe->files) {
>                 inode->i_pipe = NULL;
>                 kill = 1;
>         }
>         spin_unlock(&inode->i_lock);
>         __pipe_unlock(pipe);
>         if (kill)
>                 free_pipe_info(pipe);
> 
> which on the face of it is trying to be very careful in not accessing
> the pipe-info after it is released by having that "kill" flag, and
> doing the release last.
> 
> And it's complete garbage.
> 
> Why?
> 
> Because the thread that decrements "pipe->files" *without* releasing
> it, will very much access it after it has been dropped: that
> "__pipe_unlock(pipe)" happens *after* we've decremented the pipe
> reference count and dropped the inode lock. So another CPU can come in
> and free the structure concurrently with that __pipe_unlock(pipe).
> 
> This happens in two places, and we don't actually need or want the
> pipe lock for the pipe->files accesses (since pipe->files is protected
> by inode->i_lock, not the pipe lock), so the solution is to just do
> the __pipe_unlock() before the whole dance about the pipe->files
> reference count.
> 
> Patch appended. And no wonder nobody has ever seen it, because the
> race is unlikely as hell to ever happen. Simon, I assume it will be
> another few months before we can say "yeah, that fixed it", but I
> really think this is it. It explains all the symptoms, including
> "DEBUG_PAGEALLOC didn't catch it" (because the access happens just as
> it is released, and DEBUG_PAGEALLOC takes too long to actually free
> unmap the page etc).

Nice catch.  I'd probably add

/* we are holding inode->i_pipe->files */
static void drop_pipe(struct inode *inode, struct pipe_inode_info *pipe)
{
	int kill = 0;
        spin_lock(&inode->i_lock);
	WARN_ON(pipe != inode->i_pipe);
        if (!--pipe->files) {
                inode->i_pipe = NULL;
                kill = 1;
        }
        spin_unlock(&inode->i_lock);
        if (kill)
                free_pipe_info(pipe);
}

and use it in both places...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Found it! (was Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds())
  2013-12-02 16:00 Found it! (was Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()) Linus Torvalds
  2013-12-02 16:27 ` Ingo Molnar
  2013-12-02 17:06 ` Al Viro
@ 2013-12-03  2:58 ` Linus Torvalds
  2013-12-03  4:28   ` Al Viro
  2013-12-03  8:52   ` [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races Ingo Molnar
  2 siblings, 2 replies; 16+ messages in thread
From: Linus Torvalds @ 2013-12-03  2:58 UTC (permalink / raw)
  To: Simon Kirby, Ingo Molnar, Peter Zijlstra, Waiman Long
  Cc: Ian Applegate, Al Viro, Christoph Lameter, Pekka Enberg, LKML,
	Chris Mason

Once more on this issue, because I've been unable to stop thinking
about it, and noticed that it's actually even subtler than I thought
initially.

On Mon, Dec 2, 2013 at 8:00 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> All our reference counting etc seems right, but we have one very
> subtle bug: on the freeing path, we have a pattern like this:
>
>         spin_lock(&inode->i_lock);
>         if (!--pipe->files) {
>                 inode->i_pipe = NULL;
>                 kill = 1;
>         }
>         spin_unlock(&inode->i_lock);
>         __pipe_unlock(pipe);
>         if (kill)
>                 free_pipe_info(pipe);
>
> which on the face of it is trying to be very careful in not accessing
> the pipe-info after it is released by having that "kill" flag, and
> doing the release last.
>
> And it's complete garbage.
>
> Why?
>
> Because the thread that decrements "pipe->files" *without* releasing
> it, will very much access it after it has been dropped: that
> "__pipe_unlock(pipe)" happens *after* we've decremented the pipe
> reference count and dropped the inode lock. So another CPU can come in
> and free the structure concurrently with that __pipe_unlock(pipe).

So I still think this was the bug, and the problems Simon and Ian saw
should be fixed, but it ends up being even more interesting than just
the above explanation.

Yes, the inode->i_lock is what protects the "pipe->files" counter, and
we do in fact on the allocation side have cases where we update the
counter with just the spinlock held, without holding the pipe mutex.
So when we decrement the 'files' counter and release the i_lock
spinlock, technically it's now freeable.

BUT.

It turns out that on the *releasing* side, the old code actually held
the pipe->mutex in all situations where it was decrementing the
pipe->files counter, so the race we saw happen *shouldn't* have
happened, because even though the i_lock spinlock was the real lock,
the pipe->mutex lock *should* have serialized the two threads
releasing the pipe, and thus the bug shouldn't have been visible.

So we have two threads serialized by that pipe->lock mutex, and yet
one of them writes to the structure *after* the other one (remember:
serialized!) decides that it can free it.

In other words: while the fix for pipes was simply to move the mutex
unlock up, this actually shows a big fundamental issue with mutexes.
That issue being that you cannot use them (at least as is) to protect
an object reference count. And I wonder if we do that somewhere else.

And I kind of knew about this, because we historically had very strict
rules about the whole "don't use semaphores to implement completions",
which has the exact same issue: a semaphore isn't "safe" wrt
de-allocation, and we have some places that use completions exactly
because completions guarantee that the very last access to them on the
waiter is on the waiter side (who can then free the memory), and all
other users are "safe".

And mutexes and semaphores don't have those guarantees: they just
serialize what is *inside* the mutex/semaphore, but not necessarily
the lock *ITSELF*.

So this "obviously correct" pattern is in fact fundamentally buggy:

   mutex_lock(obj->lock);
   dead = !--obj->recount;
   mutex_unlock(obj->lock);
   if (dead)
      free(obj);

even though it *looks* safe. Because by definition only the last
thread to get the object mutex will be the one freeing it, right?

Wrong. Because what can happen is (roughly):

    CPU1:

        mutex_lock(obj->lock);
        dead = !--obj->refcount;
        // refcount was 2, is now 1, dead = 0.

.. at this point CPU2 comes in, tries to do the same thing:

        mutex_lock(obj->lock);
        // blocks on obj->lock, goes to slowpath
        // mutex is negative, CPU2 is in optimistic
        //  spinning mode in __mutex_lock_common
        // under #ifdef CONFIG_MUTEX_SPIN_ON_OWNER

-- CPU1 again:

        mutex_unlock(obj->lock);
          __mutex_fastpath_unlock()
            fastpath fails (because mutex is nonpositive
              __mutex_unlock_slowpath:
                if (__mutex_slowpath_needs_to_unlock())
                        atomic_set(&lock->count, 1);

.. CPU2 - who is spinning, sees this, and does

                if ((atomic_read(&lock->count) == 1) &&
                    (atomic_cmpxchg(&lock->count, 1, 0) == 1)) {

.. and now CPU2 owns the mutex, and goes on to do

        dead = !--obj->refcount;
        // refcount was 1, is now 0, dead = 1.
        mutex_unlock(obj->lock);
        if (dead)
                free(obj);

 - but in the meantime, CPU1 is busy still unlocking:

                if (!list_empty(&lock->wait_list)) {
                        /* get the first entry from the wait-list: */
                        struct mutex_waiter *waiter =
                                        list_entry(lock->wait_list.next,
                                                   struct mutex_waiter, list);

                        debug_mutex_wake_waiter(lock, waiter);

                        wake_up_process(waiter->task);
        }
        spin_unlock_mutex(&lock->wait_lock, flags);

and in particular notice how CPU1 wrote to the lock *after* CPU2 had
gotten the lock and free'd the data structure. In practice, it's only
really that "spin_unlock_mutex(&lock->wait_lock, flags);" that happens
(and this is how we got the single byte being incremented).

In other words, it's unsafe to protect reference counts inside objects
with anything but spinlocks and/or atomic refcounts. Or you have to
have the lock *outside* the object you're protecting (which is often
what you want for other reasons anyway, notably lookup).

So having a non-atomic refcount protected inside a sleeping lock is a
bug, and that's really the bug that ended up biting us for pipes.

Now, the question is: do we have other such cases? How do we document
this? Do we try to make mutexes and other locks safe to use for things
like this?

              Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Found it! (was Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds())
  2013-12-03  2:58 ` Linus Torvalds
@ 2013-12-03  4:28   ` Al Viro
  2013-12-05  8:12     ` gfs2 deadlock (was Re: Found it) Al Viro
  2013-12-03  8:52   ` [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races Ingo Molnar
  1 sibling, 1 reply; 16+ messages in thread
From: Al Viro @ 2013-12-03  4:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Simon Kirby, Ingo Molnar, Peter Zijlstra, Waiman Long,
	Ian Applegate, Christoph Lameter, Pekka Enberg, LKML,
	Chris Mason

On Mon, Dec 02, 2013 at 06:58:57PM -0800, Linus Torvalds wrote:

> In other words, it's unsafe to protect reference counts inside objects
> with anything but spinlocks and/or atomic refcounts. Or you have to
> have the lock *outside* the object you're protecting (which is often
> what you want for other reasons anyway, notably lookup).
> 
> So having a non-atomic refcount protected inside a sleeping lock is a
> bug, and that's really the bug that ended up biting us for pipes.
> 
> Now, the question is: do we have other such cases? How do we document
> this? Do we try to make mutexes and other locks safe to use for things
> like this?

Umm...  AFAICS, in VFS proper we have
	files_struct - atomic_dec_and_test
	fs_struct - spinlock + int
	file - atomic_long_dec_and_test (with delays after that, including
RCU).
	super_block - global spinlock + int (s_count); the mutex in there
(->s_umount) can be taken by anybody who holds an active ref *or* has
bumped ->s_count while holding sb_lock.  Exactly to prevent that kind of
unpleasantness.  Freeing RCU-delayed.
	vfsmount - percpu counter + flag + global seqlock, with quite a bit of
contortions for the sake of avoiding cross-CPU stores on fastpath; discussed
back in October, concluded to be safe.  Freeing RCU-delayed.
	dentry - lockref, with RCU-delayed actual freeing.
	file_system_type, nls_table, linux_binfmt - module refcount of "owner";
search structures protected by global spinlocks or rwlocks, exiting module
is responsible for unregistering first.
	inode - atomic_dec_and_lock, with actual freeing RCU-delayed (and
evicting code waiting for pagecache references to be gone, with the rest
being responsibility of fs method called before we free the sucker)
	block_device - part of bdevfs inode

These should be safe, but damnit, we really need the lifecycle documented for
all objects - the above is only a part of it (note that for e.g. superblocks
we have additional rules re "->s_active can't be incremented for any reason
once it drops to zero, it can't be incremented until superblock had been
marked 'born' and it crosses over to zero only with ->s_umount held"; there's
6 stages in life cycle of struct super_block and we had interesting bugs due
to messing the transitions up).  The trouble is, attempt to write those down
tends to stray into massive grep session, with usual results - some other
crap gets found (e.g. in some odd driver) and needs to be dealt with ;-/
Sigh...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races
  2013-12-03  2:58 ` Linus Torvalds
  2013-12-03  4:28   ` Al Viro
@ 2013-12-03  8:52   ` Ingo Molnar
  2013-12-03 18:10     ` Linus Torvalds
                       ` (2 more replies)
  1 sibling, 3 replies; 16+ messages in thread
From: Ingo Molnar @ 2013-12-03  8:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Simon Kirby, Peter Zijlstra, Waiman Long, Ian Applegate, Al Viro,
	Christoph Lameter, Pekka Enberg, LKML, Chris Mason,
	Thomas Gleixner

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

>  - but in the meantime, CPU1 is busy still unlocking:
> 
>                 if (!list_empty(&lock->wait_list)) {
>                         /* get the first entry from the wait-list: */
>                         struct mutex_waiter *waiter =
>                                         list_entry(lock->wait_list.next,
>                                                    struct mutex_waiter, list);
> 
>                         debug_mutex_wake_waiter(lock, waiter);
> 
>                         wake_up_process(waiter->task);
>         }
>         spin_unlock_mutex(&lock->wait_lock, flags);
> 
> and in particular notice how CPU1 wrote to the lock *after* CPU2 had 
> gotten the lock and free'd the data structure. In practice, it's 
> only really that "spin_unlock_mutex(&lock->wait_lock, flags);" that 
> happens (and this is how we got the single byte being incremented).
> 
> In other words, it's unsafe to protect reference counts inside 
> objects with anything but spinlocks and/or atomic refcounts. Or you 
> have to have the lock *outside* the object you're protecting (which 
> is often what you want for other reasons anyway, notably lookup).

Indeed: this comes from mutex->count being separate from 
mutex->wait_lock, and this should affect every architecture that has a 
mutex->count fast-path implemented (essentially every architecture 
that matters).

Such bugs should also magically go away with mutex debugging enabled.

I'd expect such bugs to be more prominent with unlucky object 
size/alignment: if mutex->count lies on a separate cache line from 
mutex->wait_lock.

Side note: this might be a valid light weight debugging technique, we 
could add padding between the two fields to force them into separate 
cache lines, without slowing it down.

Simon, would you be willing to try the fairly trivial patch below? 
Please enable CONFIG_DEBUG_MUTEX_FASTPATH=y. Does your kernel fail 
faster that way?

> So having a non-atomic refcount protected inside a sleeping lock is 
> a bug, and that's really the bug that ended up biting us for pipes.
> 
> Now, the question is: do we have other such cases? How do we 
> document this? [...]

I'd not be surprised if we had such cases, especially where 
spinlock->mutex conversions were done. About 20% of the kernel's 
critical sections use mutexes, 60% use spinlocks. (the remaining 20% 
is shared between semaphored, rwlocks and rwsems, in that order of 
frequency.)

> [...] Do we try to make mutexes and other locks safe to use for 
> things like this?

I think we try that, to make mutexes safer in general.

Can you see a way to do that fairly cheaply?

I can see two approaches, both rather radical:

1)

Eliminate mutex->count and (ab-)use mutex->wait_lock as 'the' mutex 
lock: with TICKET_SLOWPATH_FLAG used to denote waiters or so and care 
taken to not use it as a 'real' spinlock but use the raw accessors.

This would still allow a good mutex_lock() fastpath, as it would 
essentially become spin_trylock() with an asm goto slow path helper 
perhaps.

Doing this would have various advantages:

 - we'd eliminate much (all?) of per arch mutex code
 - we'd share spinlock and mutex low level implementations
 - we'd reduce struct mutex size by 4 bytes

It's still early in the morning so I might be missing something 
trivial though - this sounds suspiciously too easy ;-) Having a proper 
mutex slowpath might not be so easy without changing the spinlock 
code.

2)

Another method would be to do the opposite: eliminate mutex->wait_lock 
[for the non-debug case] and do everything via mutex->count and 
mutex->owner.

This saves even more space and potentially enables a tighter slowpath.

It probably won't hurt the massively parallel case, as we already do 
smart MCS locking via mutex->spin_mlock.

So I'd argue for #2. (Assuming it addresses the problem)

Thanks,

	Ingo

=======================>
Subject: mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races
From: Ingo Molnar <mingo@kernel.org>

Help debug SMP ordering issues by forcing mutex->count on a separate 
cache line with mutex->wait_lock. This will make certain classes races 
more prominent, without slowing down other parts of the mutex code.

Signed-off-by: Ingo Molnar <mingo@kernel.org>

diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index d318193..6952fc4 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -49,6 +49,15 @@
 struct mutex {
 	/* 1: unlocked, 0: locked, negative: locked, possible waiters */
 	atomic_t		count;
+#ifdef CONFIG_DEBUG_MUTEX_FASTPATH
+	/*
+	 * Debug SMP ordering issues by forcing mutex->count on a separate
+	 * cache line with mutex->wait_lock. This will make certain classes
+	 * of races more prominent, without slowing down other parts of the
+	 * mutex code.
+	 */
+	u8			__cache_padding[64];
+#endif
 	spinlock_t		wait_lock;
 	struct list_head	wait_list;
 #if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
@@ -58,7 +67,7 @@ struct mutex {
 	void			*spin_mlock;	/* Spinner MCS lock */
 #endif
 #ifdef CONFIG_DEBUG_MUTEXES
-	const char 		*name;
+	const char		*name;
 	void			*magic;
 #endif
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 6982094..08618a9 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -845,6 +845,18 @@ config DEBUG_SPINLOCK
 	  best used in conjunction with the NMI watchdog so that spinlock
 	  deadlocks are also debuggable.

+config DEBUG_MUTEX_FASTPATH
+	bool "Mutex debugging: lightweight mutex race check"
+	depends on DEBUG_KERNEL && !DEBUG_MUTEXES
+	help
+	  This debug feature simply increases struct mutex by 64 bytes and
+	  thus allows certain mutex related SMP races to be detected more
+	  easily.
+
+	  (This feature is disabled if DEBUG_MUTEXES is enabled.)
+
+	  If unsure, say N.
+
 config DEBUG_MUTEXES
 	bool "Mutex debugging: basic checks"
 	depends on DEBUG_KERNEL

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races
  2013-12-03  8:52   ` [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races Ingo Molnar
@ 2013-12-03 18:10     ` Linus Torvalds
  2013-12-04  9:19       ` Simon Kirby
  2013-12-05  6:57     ` Simon Kirby
  2013-12-11 15:03     ` Waiman Long
  2 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2013-12-03 18:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Simon Kirby, Peter Zijlstra, Waiman Long, Ian Applegate, Al Viro,
	Christoph Lameter, Pekka Enberg, LKML, Chris Mason,
	Thomas Gleixner

On Tue, Dec 3, 2013 at 12:52 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> I'd expect such bugs to be more prominent with unlucky object
> size/alignment: if mutex->count lies on a separate cache line from
> mutex->wait_lock.

I doubt that makes much of a difference. It's still just "CPU cycles"
away, and the window will be tiny unless you have multi-socket
machines and/or are just very unlucky.

For stress-testing, it would be much better to use some hack like

    /* udelay a bit if the spinlock isn't contended */
    if (mutex->wait_lock.ticket.head+1 == mutex->wait_lock.ticket.tail)
        udelay(1);

in __mutex_unlock_common() just before the spin_unlock(). Make the
window really *big*.

That said:

>> So having a non-atomic refcount protected inside a sleeping lock is
>> a bug, and that's really the bug that ended up biting us for pipes.
>>
>> Now, the question is: do we have other such cases? How do we
>> document this? [...]
>
> I'd not be surprised if we had such cases, especially where
> spinlock->mutex conversions were done

So I actually don't think we do. Why? Having a sleeping lock inside
the object that protects the refcount is actually hard to do.

You need to increment the refcount when finding the object, but that
in turn tends to imply holding the lock *before* you find it. Or you
have to find it locklessly, which in turn implies RCU, which in turn
implies non-sleeping lock.

And quite frankly, anybody who uses SRCU and a sleeping lock is just
broken to begin with. That's just an abomination.  If the RT codepaths
do something like that, don't even tell me. It's broken and stupid.

So protecting a refcount with a mutex in the same object is really
quite hard. Even the pipe code didn't actually do that, it just
happened to nest the real lock inside the sleeping lock (but that's
also why it was so easy to fix - the sleeping lock wasn't actually
necessary or helpful in the refcounting path).

So my *gut* feel is that nobody does this. But people have been known
to do insane things just because they use too many sleeping locks and
think it's "better".

                Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races
  2013-12-03 18:10     ` Linus Torvalds
@ 2013-12-04  9:19       ` Simon Kirby
  2013-12-04 21:14         ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Simon Kirby @ 2013-12-04  9:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Peter Zijlstra, Waiman Long, Ian Applegate, Al Viro,
	Christoph Lameter, Pekka Enberg, LKML, Chris Mason,
	Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 1729 bytes --]

On Tue, Dec 03, 2013 at 10:10:29AM -0800, Linus Torvalds wrote:

> On Tue, Dec 3, 2013 at 12:52 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > I'd expect such bugs to be more prominent with unlucky object
> > size/alignment: if mutex->count lies on a separate cache line from
> > mutex->wait_lock.
> 
> I doubt that makes much of a difference. It's still just "CPU cycles"
> away, and the window will be tiny unless you have multi-socket
> machines and/or are just very unlucky.
> 
> For stress-testing, it would be much better to use some hack like
> 
>     /* udelay a bit if the spinlock isn't contended */
>     if (mutex->wait_lock.ticket.head+1 == mutex->wait_lock.ticket.tail)
>         udelay(1);
> 
> in __mutex_unlock_common() just before the spin_unlock(). Make the
> window really *big*.

I haven't had a chance yet to do much testing of the proposed race fix
and race enlarging patches, but I did write a tool to reproduce the race.
I started it running and left for dinner, and sure enough, it actually
seems to work on plain 3.12 on a Dell PowerEdge r410 w/dual E5520,
reproducing "Poison overwritten" at a rate of about once every 15 minutes
(running 6 in parallel, booted with "slub_debug").

I have no idea if actually relying on tsc alignment between cores and
sockets is a reasonable idea these days, but it seems to work. I first
used a read() on a pipe close()d by the other process to synchronize
them, but this didn't seem to work as well as busy-waiting until the
timestamp counters pass a previously-decided-upon start time.

Meanwhile, I still don't understand how moving the unlock _up_ to cover
less of the code can solve the race, but I will stare at your long
explanation more tomorrow.

Simon-

[-- Attachment #2: piperace.c --]
[-- Type: text/x-csrc, Size: 1652 bytes --]

#include <sys/wait.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>

#define MAX_PIPES 450
#define FORK_CYCLES 2000000
#define CLOSE_CYCLES 100000
#define STAT_SHIFT 6

static inline unsigned long readtsc()
{
	unsigned int low, high;
	asm volatile("rdtsc" : "=a" (low), "=d" (high));
	return low | ((unsigned long)(high) << 32);
}

static int pipefd[MAX_PIPES][2];

int main(int argc, char *argv[]) {
	unsigned long loop, race_start, miss;
	unsigned long misses = 0, races = 0;
	int i;
	pid_t pid;
	struct rlimit rlim = {
		.rlim_cur = MAX_PIPES * 2 + 96,
		.rlim_max = MAX_PIPES * 2 + 96,
	};

	if (setrlimit(RLIMIT_NOFILE, &rlim) != 0)
		perror("setrlimit(RLIMIT_NOFILE)");

	for (loop = 1;;loop++) {
		/*
		 * Make a bunch of pipes
		 */
		for (i = 0;i < MAX_PIPES;i++) {
			if (pipe(pipefd[i]) == -1) {
				perror("pipe()");
				exit(EXIT_FAILURE);
			}
		}

		race_start = readtsc() + FORK_CYCLES;

		asm("":::"memory");

		pid = fork();
		if (pid == -1) {
			perror("fork()");
			exit(EXIT_FAILURE);
		}
		pid = !!pid;

		/*
		 * Close one pipe descriptor per process
		 */
		for (i = 0;i < MAX_PIPES;i++)
			close(pipefd[i][!pid]);

		for (i = 0;i < MAX_PIPES;i++) {
			/*
			 * Line up and try to close at the same time
			 */
			miss = 1;
			while (readtsc() < race_start)
				miss = 0;

			close(pipefd[i][pid]);

			misses+= miss;
			race_start+= CLOSE_CYCLES;
		}
		races+= MAX_PIPES;

		if (!(loop & (~(~0UL << STAT_SHIFT))))
			fprintf(stderr, "%c %lu (%.2f%% false starts)\n",
				"CP"[pid], readtsc(), misses * 100. / races);

		if (pid)
			wait(NULL);	/* Parent */
		else
			exit(0);	/* Child */
	}
}


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races
  2013-12-04  9:19       ` Simon Kirby
@ 2013-12-04 21:14         ` Linus Torvalds
  2013-12-05  8:06           ` Simon Kirby
  0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2013-12-04 21:14 UTC (permalink / raw)
  To: Simon Kirby
  Cc: Ingo Molnar, Peter Zijlstra, Waiman Long, Ian Applegate, Al Viro,
	Christoph Lameter, Pekka Enberg, LKML, Chris Mason,
	Thomas Gleixner

On Wed, Dec 4, 2013 at 1:19 AM, Simon Kirby <sim@hostway.ca> wrote:
>
> Meanwhile, I still don't understand how moving the unlock _up_ to cover
> less of the code can solve the race, but I will stare at your long
> explanation more tomorrow.

The lock we're moving up isn't the lock that actually protects the
whole allocation logic (it's the lock that then protects the pipe
contents when a pipe is *used*). So it's a useless lock, and moving it
up is a good idea regardless (because it makes the locks only protect
the parts they are actually *supposed* to protect.

And while extraneous lock wouldn't normally hurt, the sleeping locks
(both mutexes and semaphores) aren't actually safe wrt de-allocation -
they protect anything *inside* the lock, but the lock data structure
itself is accessed racily wrt other lockers (in a way that still
leaves the locked region protected, but not the lock itself). If you
care about details, you can walk through my example.

         Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races
  2013-12-03  8:52   ` [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races Ingo Molnar
  2013-12-03 18:10     ` Linus Torvalds
@ 2013-12-05  6:57     ` Simon Kirby
  2013-12-11 15:03     ` Waiman Long
  2 siblings, 0 replies; 16+ messages in thread
From: Simon Kirby @ 2013-12-05  6:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, Waiman Long, Ian Applegate,
	Al Viro, Christoph Lameter, Pekka Enberg, LKML, Chris Mason,
	Thomas Gleixner

On Tue, Dec 03, 2013 at 09:52:33AM +0100, Ingo Molnar wrote:

> Indeed: this comes from mutex->count being separate from 
> mutex->wait_lock, and this should affect every architecture that has a 
> mutex->count fast-path implemented (essentially every architecture 
> that matters).
> 
> Such bugs should also magically go away with mutex debugging enabled.

Confirmed: I ran the reproducer with CONFIG_DEBUG_MUTEXES for a few
hours, and never got a single poison overwritten notice.

> I'd expect such bugs to be more prominent with unlucky object 
> size/alignment: if mutex->count lies on a separate cache line from 
> mutex->wait_lock.
> 
> Side note: this might be a valid light weight debugging technique, we 
> could add padding between the two fields to force them into separate 
> cache lines, without slowing it down.
> 
> Simon, would you be willing to try the fairly trivial patch below? 
> Please enable CONFIG_DEBUG_MUTEX_FASTPATH=y. Does your kernel fail 
> faster that way?

I didn't see much of a change other than the incremented poison byte is
now further in due to the padding, and it shows up in kmalloc-256.

I also tried with Linus' udelay() suggestion, below. With this, there
were many occurrences per second.

Simon-

diff --git a/kernel/mutex.c b/kernel/mutex.c
index d24105b..f65e735 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -25,6 +25,7 @@
 #include <linux/spinlock.h>
 #include <linux/interrupt.h>
 #include <linux/debug_locks.h>
+#include <linux/delay.h>
 
 /*
  * In the DEBUG case we are using the "NULL fastpath" for mutexes,
@@ -740,6 +741,11 @@ __mutex_unlock_common_slowpath(atomic_t *lock_count, int nested)
 		wake_up_process(waiter->task);
 	}
 
+	/* udelay a bit if the spinlock isn't contended */
+	if (lock->wait_lock.rlock.raw_lock.tickets.head + 1 ==
+	    lock->wait_lock.rlock.raw_lock.tickets.tail)
+		udelay(1);
+
 	spin_unlock_mutex(&lock->wait_lock, flags);
 }
 

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races
  2013-12-04 21:14         ` Linus Torvalds
@ 2013-12-05  8:06           ` Simon Kirby
  0 siblings, 0 replies; 16+ messages in thread
From: Simon Kirby @ 2013-12-05  8:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Peter Zijlstra, Waiman Long, Ian Applegate, Al Viro,
	Christoph Lameter, Pekka Enberg, LKML, Chris Mason,
	Thomas Gleixner

On Wed, Dec 04, 2013 at 01:14:56PM -0800, Linus Torvalds wrote:

> The lock we're moving up isn't the lock that actually protects the
> whole allocation logic (it's the lock that then protects the pipe
> contents when a pipe is *used*). So it's a useless lock, and moving it
> up is a good idea regardless (because it makes the locks only protect
> the parts they are actually *supposed* to protect.
> 
> And while extraneous lock wouldn't normally hurt, the sleeping locks
> (both mutexes and semaphores) aren't actually safe wrt de-allocation -
> they protect anything *inside* the lock, but the lock data structure
> itself is accessed racily wrt other lockers (in a way that still
> leaves the locked region protected, but not the lock itself). If you
> care about details, you can walk through my example.

Yes, this makes sense now. It was spin_unlock_mutex() on the pipe lock
that itself was already already freed and poisoned by another cpu. This
explicit poison check also fires:

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index bf156de..ae425d0 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -159,6 +159,7 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 			__ticket_unlock_slowpath(lock, prev);
 	} else
 		__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
+	WARN_ON(*(unsigned int *)&lock->tickets.head == 0x6b6b6b6c);
 }
 
 static inline int arch_spin_is_locked(arch_spinlock_t *lock)

It warns only as often as the poison checking already did, with a stack
of warn_*, __mutex_unlock_slowpath(), mutex_unlock(), pipe_release().

Trying to prove a negative, of course, but I tested with your first fix
overnight and got no errors. Current git (with b0d8d2292160bb63de) also
looks good. I will leave it running for a few days.

Thanks for getting stuck on this one. It was educational, at least!

Simon-

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* gfs2 deadlock (was Re: Found it)
  2013-12-03  4:28   ` Al Viro
@ 2013-12-05  8:12     ` Al Viro
  2013-12-05 10:19       ` Steven Whitehouse
  0 siblings, 1 reply; 16+ messages in thread
From: Al Viro @ 2013-12-05  8:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Simon Kirby, Ingo Molnar, Peter Zijlstra, Waiman Long,
	Ian Applegate, Christoph Lameter, Pekka Enberg, LKML,
	Chris Mason, Steven Whitehouse

On Tue, Dec 03, 2013 at 04:28:30AM +0000, Al Viro wrote:

> These should be safe, but damnit, we really need the lifecycle documented for
> all objects - the above is only a part of it (note that for e.g. superblocks
> we have additional rules re "->s_active can't be incremented for any reason
> once it drops to zero, it can't be incremented until superblock had been
> marked 'born' and it crosses over to zero only with ->s_umount held"; there's
> 6 stages in life cycle of struct super_block and we had interesting bugs due
> to messing the transitions up).  The trouble is, attempt to write those down
> tends to stray into massive grep session, with usual results - some other
> crap gets found (e.g. in some odd driver) and needs to be dealt with ;-/
> Sigh...

... and sure enough, this time is no different - gfs2 sysfs-related code
cheerfully violates lifetime rules for superblocks, which would've
caused a major mess later, if it had not immediately caused a deadlock
on the same superblock ;-/

Watch: gfs2 creates a bunch of files in sysfs (/sys/fs/gfs2/<devname>/*).
Said bunch gets removed from ->put_super().  Which is called under
->s_umount.  Guess what happens if somebody tries to write "1" to
/sys/fs/gfs2/.../freeze just as we enter that ->put_super() (or at any
point starting from the moment when deactivate_locked_super() has dropped
the last active reference)?  This:
static ssize_t freeze_store(struct gfs2_sbd *sdp, const char *buf, size_t len)
{
        int error;
        int n = simple_strtol(buf, NULL, 0);

        if (!capable(CAP_SYS_ADMIN))
                return -EPERM;

        switch (n) {
[snip]
        case 1:
                error = freeze_super(sdp->sd_vfs);

And freeze_super(sb) assumes that caller has an active reference to
sb:
int freeze_super(struct super_block *sb)
{
        int ret;

        atomic_inc(&sb->s_active);

... which is not legitimate when ->s_active has already reached zero.
And right after that we hit this:
        down_write(&sb->s_umount);

Voila - write(2) is waiting for ->s_umount, while umount(2) is holding
->s_umount and waits for write(2) to get past freeze_store().

Hell knows what to do here - atomic_inc_not_zero() in freeze_super()
(and failing if it fails) would've worked, but it doesn't help with
the deadlock - just write "0" instead and we hit thaw_super(), which
starts with grabbing ->s_umount.  atomic_inc_not_zero()/deactivate_super()
around that call of thaw_super() would probably work, but I'll need to
look at that after I get some sleep...

Why bother with sysfs, anyway?  What's wrong with putting those same files
on gfs2meta, seeing that _this_ would have no problems with object lifetimes?
Too late by now, of course, but...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: gfs2 deadlock (was Re: Found it)
  2013-12-05  8:12     ` gfs2 deadlock (was Re: Found it) Al Viro
@ 2013-12-05 10:19       ` Steven Whitehouse
  0 siblings, 0 replies; 16+ messages in thread
From: Steven Whitehouse @ 2013-12-05 10:19 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Simon Kirby, Ingo Molnar, Peter Zijlstra,
	Waiman Long, Ian Applegate, Christoph Lameter, Pekka Enberg,
	LKML, Chris Mason, cluster-devel

Hi,

On Thu, 2013-12-05 at 08:12 +0000, Al Viro wrote:
> On Tue, Dec 03, 2013 at 04:28:30AM +0000, Al Viro wrote:
> 
> > These should be safe, but damnit, we really need the lifecycle documented for
> > all objects - the above is only a part of it (note that for e.g. superblocks
> > we have additional rules re "->s_active can't be incremented for any reason
> > once it drops to zero, it can't be incremented until superblock had been
> > marked 'born' and it crosses over to zero only with ->s_umount held"; there's
> > 6 stages in life cycle of struct super_block and we had interesting bugs due
> > to messing the transitions up).  The trouble is, attempt to write those down
> > tends to stray into massive grep session, with usual results - some other
> > crap gets found (e.g. in some odd driver) and needs to be dealt with ;-/
> > Sigh...
> 
> ... and sure enough, this time is no different - gfs2 sysfs-related code
> cheerfully violates lifetime rules for superblocks, which would've
> caused a major mess later, if it had not immediately caused a deadlock
> on the same superblock ;-/
> 
> Watch: gfs2 creates a bunch of files in sysfs (/sys/fs/gfs2/<devname>/*).
> Said bunch gets removed from ->put_super().  Which is called under
> ->s_umount.  Guess what happens if somebody tries to write "1" to
> /sys/fs/gfs2/.../freeze just as we enter that ->put_super() (or at any
> point starting from the moment when deactivate_locked_super() has dropped
> the last active reference)?  This:
> static ssize_t freeze_store(struct gfs2_sbd *sdp, const char *buf, size_t len)
> {
>         int error;
>         int n = simple_strtol(buf, NULL, 0);
> 
>         if (!capable(CAP_SYS_ADMIN))
>                 return -EPERM;
> 
>         switch (n) {
> [snip]
>         case 1:
>                 error = freeze_super(sdp->sd_vfs);
> 
> And freeze_super(sb) assumes that caller has an active reference to
> sb:
> int freeze_super(struct super_block *sb)
> {
>         int ret;
> 
>         atomic_inc(&sb->s_active);
> 
> ... which is not legitimate when ->s_active has already reached zero.
> And right after that we hit this:
>         down_write(&sb->s_umount);
> 
> Voila - write(2) is waiting for ->s_umount, while umount(2) is holding
> ->s_umount and waits for write(2) to get past freeze_store().
> 
> Hell knows what to do here - atomic_inc_not_zero() in freeze_super()
> (and failing if it fails) would've worked, but it doesn't help with
> the deadlock - just write "0" instead and we hit thaw_super(), which
> starts with grabbing ->s_umount.  atomic_inc_not_zero()/deactivate_super()
> around that call of thaw_super() would probably work, but I'll need to
> look at that after I get some sleep...
> 
> Why bother with sysfs, anyway?  What's wrong with putting those same files
> on gfs2meta, seeing that _this_ would have no problems with object lifetimes?
> Too late by now, of course, but...

Well now. Thats is a bit of a can of worms :(

Short answer is that the sysfs stuff was largely inherited from GFS
unchanged. The metafs was added in GFS2 as a way to access files which
are otherwise hidden from the user (e.g. the journals). In retrospect,
both were probably a mistake, for different reasons.

Quite a large number of the interfaces in sysfs were added to do what
should really have been the job of mount options. Over time we've now
got to the point that anything that can be done with sysfs that makes
sense as a mount option, is now a mount option and thats the recommended
way to do things.

Freeze can be done via dmsetup suspend, and thats what we've recommended
that people use for a long time now. The only thing in userspace that
used the sysfs interface was gfs2_tool (an obsolete and no longer
shipping userspace tool). So the chances are that nothing uses that
interface any more, but it has been left for backward compatibility
purposes. Maybe we could remove it now?

In any case freeze in GFS2 is due for an overhaul and Ben Marzinski has
patches which he is currently working on as we speak, that are a big
step forward in this area, even though they won't (I think) solve the
issue that you've identified here.

What we do need to preserve are the uevents which are generated though,
since those are used by the older cluster suite programs, and are also
useful for debugging mount/umount/recovery issues, even with the latest
pcs based cluster suite.

Returning briefly to the metafs, the locking there has also been a pain
to deal with. Over time I hope to transition to using "proper"
interfaces rather than trying to directly touch the internal files.
We've already done that with quota - that means that extending the fs
and adding journals are the only two operations left which still use the
metafs as an interface. Eventually I'd like those to have fs independent
(if possible!) interfaces too and then userspace will no longer require
direct access to the metafs. Extending the fs is fairly easy to fix I
think, adding journals will be more tricky to do.

So although it probably doesn't help a lot - thats the background behind
where we are today,

Steve.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races
  2013-12-03  8:52   ` [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races Ingo Molnar
  2013-12-03 18:10     ` Linus Torvalds
  2013-12-05  6:57     ` Simon Kirby
@ 2013-12-11 15:03     ` Waiman Long
  2 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2013-12-11 15:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Simon Kirby, Peter Zijlstra, Ian Applegate,
	Al Viro, Christoph Lameter, Pekka Enberg, LKML, Chris Mason,
	Thomas Gleixner

On 12/03/2013 03:52 AM, Ingo Molnar wrote:
> *
> I think we try that, to make mutexes safer in general.
>
> Can you see a way to do that fairly cheaply?
>
> I can see two approaches, both rather radical:
>
> 1)
>
> Eliminate mutex->count and (ab-)use mutex->wait_lock as 'the' mutex
> lock: with TICKET_SLOWPATH_FLAG used to denote waiters or so and care
> taken to not use it as a 'real' spinlock but use the raw accessors.
>
> This would still allow a good mutex_lock() fastpath, as it would
> essentially become spin_trylock() with an asm goto slow path helper
> perhaps.
>
> Doing this would have various advantages:
>
>   - we'd eliminate much (all?) of per arch mutex code
>   - we'd share spinlock and mutex low level implementations
>   - we'd reduce struct mutex size by 4 bytes
>
> It's still early in the morning so I might be missing something
> trivial though - this sounds suspiciously too easy ;-) Having a proper
> mutex slowpath might not be so easy without changing the spinlock
> code.
>
> 2)
>
> Another method would be to do the opposite: eliminate mutex->wait_lock
> [for the non-debug case] and do everything via mutex->count and
> mutex->owner.
>
> This saves even more space and potentially enables a tighter slowpath.
>
> It probably won't hurt the massively parallel case, as we already do
> smart MCS locking via mutex->spin_mlock.
>
> So I'd argue for #2. (Assuming it addresses the problem)
>
> Thanks,
>
> 	Ingo
>
>

I also think that #2 is safer as messing with spinlock code can be 
risky. However, #2 probably won't work for architectures that use the 
generic mutex-xchg.h fastpath. Currently the following architectures use 
mutex-xchg.h - unicore32, arc, arm and hexagon. Is there a reason why 
they cannot be converted to use mutex-dec.h instead?

-Longman



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2013-12-11 15:37 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-02 16:00 Found it! (was Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()) Linus Torvalds
2013-12-02 16:27 ` Ingo Molnar
2013-12-02 16:46   ` Al Viro
2013-12-02 17:05     ` Ingo Molnar
2013-12-02 17:06 ` Al Viro
2013-12-03  2:58 ` Linus Torvalds
2013-12-03  4:28   ` Al Viro
2013-12-05  8:12     ` gfs2 deadlock (was Re: Found it) Al Viro
2013-12-05 10:19       ` Steven Whitehouse
2013-12-03  8:52   ` [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races Ingo Molnar
2013-12-03 18:10     ` Linus Torvalds
2013-12-04  9:19       ` Simon Kirby
2013-12-04 21:14         ` Linus Torvalds
2013-12-05  8:06           ` Simon Kirby
2013-12-05  6:57     ` Simon Kirby
2013-12-11 15:03     ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).