linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG] fs/super: a possible sleep-in-atomic bug in put_super
@ 2017-10-06  8:59 Jia-Ju Bai
  2017-10-06  9:06 ` Michal Hocko
  2017-10-06 12:19 ` Al Viro
  0 siblings, 2 replies; 11+ messages in thread
From: Jia-Ju Bai @ 2017-10-06  8:59 UTC (permalink / raw)
  To: viro, torbjorn.lindh, rgooch; +Cc: linux-fsdevel, linux-kernel

According to fs/super.c, the kernel may sleep under a spinlock.
The function call path is:
put_super (acquire the spinlock)
   __put_super
     destroy_super
       list_lru_destroy
         list_lru_unregister
           mutex_lock --> may sleep
         memcg_get_cache_ids
           down_read --> may sleep

This bug is found by my static analysis tool and my code review.

Thanks,
Jia-Ju Bai

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] fs/super: a possible sleep-in-atomic bug in put_super
  2017-10-06  8:59 [BUG] fs/super: a possible sleep-in-atomic bug in put_super Jia-Ju Bai
@ 2017-10-06  9:06 ` Michal Hocko
  2017-10-07 11:56   ` Vladimir Davydov
  2017-10-06 12:19 ` Al Viro
  1 sibling, 1 reply; 11+ messages in thread
From: Michal Hocko @ 2017-10-06  9:06 UTC (permalink / raw)
  To: Jia-Ju Bai
  Cc: viro, torbjorn.lindh, rgooch, linux-fsdevel, linux-kernel,
	Vladimir Davydov

[CC Vladimir]

On Fri 06-10-17 16:59:18, Jia-Ju Bai wrote:
> According to fs/super.c, the kernel may sleep under a spinlock.
> The function call path is:
> put_super (acquire the spinlock)
>   __put_super
>     destroy_super
>       list_lru_destroy
>         list_lru_unregister
>           mutex_lock --> may sleep
>         memcg_get_cache_ids
>           down_read --> may sleep
> 
> This bug is found by my static analysis tool and my code review.
> 
> Thanks,
> Jia-Ju Bai

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] fs/super: a possible sleep-in-atomic bug in put_super
  2017-10-06  8:59 [BUG] fs/super: a possible sleep-in-atomic bug in put_super Jia-Ju Bai
  2017-10-06  9:06 ` Michal Hocko
@ 2017-10-06 12:19 ` Al Viro
  1 sibling, 0 replies; 11+ messages in thread
From: Al Viro @ 2017-10-06 12:19 UTC (permalink / raw)
  To: Jia-Ju Bai; +Cc: torbjorn.lindh, rgooch, linux-fsdevel, linux-kernel

On Fri, Oct 06, 2017 at 04:59:18PM +0800, Jia-Ju Bai wrote:
> According to fs/super.c, the kernel may sleep under a spinlock.
> The function call path is:
> put_super (acquire the spinlock)
>   __put_super
>     destroy_super
>       list_lru_destroy
>         list_lru_unregister
>           mutex_lock --> may sleep
>         memcg_get_cache_ids
>           down_read --> may sleep
> 
> This bug is found by my static analysis tool and my code review.

Invariant to watch is this: s->s_active > 0 => s->s_count > 0.  In other
words, anything that passes from __put_super() to destroy_super() has
already been through deactivate_locked_super() to list_lru_destroy().

And list_lru_destroy() called twice without list_lru_init() between
those will quietly do nothing on the second call.

All other callers of destroy_super() are from alloc_super()/sget_userns().
The former is the only place where instances are created, the latter is
the only caller of the former.  Direct calls of destroy_super() in there
	a) happen only to instances that had not been visible in any
shared data structures yet and
	b) are done with no spinlocks held.

struct super_block has probably the most complex lifecycle in the entire VFS.
These days the nastiness is fortunately limited to fs/super.c guts, but
it's very definitely still there.  I've posted sketches of description on
fsdevel several times, but never managed to turn that into coherent text ;-/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] fs/super: a possible sleep-in-atomic bug in put_super
  2017-10-06  9:06 ` Michal Hocko
@ 2017-10-07 11:56   ` Vladimir Davydov
  2017-10-07 17:06     ` Al Viro
  0 siblings, 1 reply; 11+ messages in thread
From: Vladimir Davydov @ 2017-10-07 11:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jia-Ju Bai, viro, torbjorn.lindh, rgooch, linux-fsdevel, linux-kernel

Hello,

On Fri, Oct 06, 2017 at 11:06:04AM +0200, Michal Hocko wrote:
> On Fri 06-10-17 16:59:18, Jia-Ju Bai wrote:
> > According to fs/super.c, the kernel may sleep under a spinlock.
> > The function call path is:
> > put_super (acquire the spinlock)
> >   __put_super
> >     destroy_super
> >       list_lru_destroy
> >         list_lru_unregister
> >           mutex_lock --> may sleep
> >         memcg_get_cache_ids
> >           down_read --> may sleep
> > 
> > This bug is found by my static analysis tool and my code review.

This is false-positive: by the time we get to destroy_super(), the lru
lists have already been destroyed - see deactivate_locked_super() - so
list_lru_destroy() will retrun right away without attempting to take any
locks. That's why there's no lockdep warnings regarding this issue.

I think we can move list_lru_destroy() to destroy_super_work() to
suppress this warning. Not sure if it's really worth the trouble though.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] fs/super: a possible sleep-in-atomic bug in put_super
  2017-10-07 11:56   ` Vladimir Davydov
@ 2017-10-07 17:06     ` Al Viro
  2017-10-07 21:14       ` Al Viro
  0 siblings, 1 reply; 11+ messages in thread
From: Al Viro @ 2017-10-07 17:06 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Michal Hocko, Jia-Ju Bai, torbjorn.lindh, rgooch, linux-fsdevel,
	linux-kernel

On Sat, Oct 07, 2017 at 02:56:40PM +0300, Vladimir Davydov wrote:
> Hello,
> 
> On Fri, Oct 06, 2017 at 11:06:04AM +0200, Michal Hocko wrote:
> > On Fri 06-10-17 16:59:18, Jia-Ju Bai wrote:
> > > According to fs/super.c, the kernel may sleep under a spinlock.
> > > The function call path is:
> > > put_super (acquire the spinlock)
> > >   __put_super
> > >     destroy_super
> > >       list_lru_destroy
> > >         list_lru_unregister
> > >           mutex_lock --> may sleep
> > >         memcg_get_cache_ids
> > >           down_read --> may sleep
> > > 
> > > This bug is found by my static analysis tool and my code review.
> 
> This is false-positive: by the time we get to destroy_super(), the lru
> lists have already been destroyed - see deactivate_locked_super() - so
> list_lru_destroy() will retrun right away without attempting to take any
> locks. That's why there's no lockdep warnings regarding this issue.
> 
> I think we can move list_lru_destroy() to destroy_super_work() to
> suppress this warning. Not sure if it's really worth the trouble though.

It's a bit trickier than that (callers of destroy_super() prior to superblock
getting reachable via shared data structures do not have that lru_list_destroy()
a no-op, but they are not called under spinlocks).

Locking in mm/list_lru.c looks excessive, but then I'm not well familiar with
that code.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] fs/super: a possible sleep-in-atomic bug in put_super
  2017-10-07 17:06     ` Al Viro
@ 2017-10-07 21:14       ` Al Viro
  2017-10-08  0:56         ` Al Viro
  0 siblings, 1 reply; 11+ messages in thread
From: Al Viro @ 2017-10-07 21:14 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Michal Hocko, Jia-Ju Bai, torbjorn.lindh, linux-fsdevel, linux-kernel

On Sat, Oct 07, 2017 at 06:06:51PM +0100, Al Viro wrote:
> On Sat, Oct 07, 2017 at 02:56:40PM +0300, Vladimir Davydov wrote:
> > Hello,
> > 
> > On Fri, Oct 06, 2017 at 11:06:04AM +0200, Michal Hocko wrote:
> > > On Fri 06-10-17 16:59:18, Jia-Ju Bai wrote:
> > > > According to fs/super.c, the kernel may sleep under a spinlock.
> > > > The function call path is:
> > > > put_super (acquire the spinlock)
> > > >   __put_super
> > > >     destroy_super
> > > >       list_lru_destroy
> > > >         list_lru_unregister
> > > >           mutex_lock --> may sleep
> > > >         memcg_get_cache_ids
> > > >           down_read --> may sleep
> > > > 
> > > > This bug is found by my static analysis tool and my code review.
> > 
> > This is false-positive: by the time we get to destroy_super(), the lru
> > lists have already been destroyed - see deactivate_locked_super() - so
> > list_lru_destroy() will retrun right away without attempting to take any
> > locks. That's why there's no lockdep warnings regarding this issue.
> > 
> > I think we can move list_lru_destroy() to destroy_super_work() to
> > suppress this warning. Not sure if it's really worth the trouble though.
> 
> It's a bit trickier than that (callers of destroy_super() prior to superblock
> getting reachable via shared data structures do not have that lru_list_destroy()
> a no-op, but they are not called under spinlocks).
> 
> Locking in mm/list_lru.c looks excessive, but then I'm not well familiar with
> that code.

It *is* excessive.

	1) coallocate struct list_lru and array of struct list_lru_node
hanging off it.  Turn all existing variables and struct members of that
type into pointers.  init would allocate and return a pointer, destroy
would free (and leave it for callers to clear their pointers, of course).

	2) store the value of memcg_nr_cache_ids as of the last creation
or resize in list_lru.  Pass that through to memcg_destroy_list_lru_node().
Result: no need for memcg_get_cache_ids() in list_lru_destroy().

	3) add static list_lru *currently_resized, protected by list_lru_mutex.
NULL when memcg_update_all_list_lrus() is not run, points to currently
resized list_lru when it is.

	4) have lru_list_destroy() check (under list_lru_mutex) whether it's
being asked to kill the currently resized one.  If it is, do
	victim->list.prev->next = victim->list.next;
	victim->list.next->prev = victim->list.prev;
	victim->list.prev = NULL;
and bugger off, otherwise act as now.  Turn the loop in
memcg_update_all_list_lrus() into
	mutex_lock(&list_lrus_mutex);
	lru = list_lrus.next;
	while (lru != &list_lrus) {
		currently_resized = list_entry(lru, struct list_lru, list);
		mutex_unlock(&list_lrus_mutex);
		ret = memcg_update_list_lru(lru, old_size, new_size);
		mutex_lock(&list_lrus_mutex);
		if (unlikely(!lru->prev)) {
			lru = lru->next;
			free currently_resized as list_lru_destroy() would have
			continue;
		}
		if (ret)
			goto fail;
		lru = lru->next;
	}
	currently_resized = NULL;
	mutex_unlock(&list_lrus_mutex);
	
	5) replace list_lrus_mutex with a spinlock.

At that point list_lru_destroy() becomes non-blocking.  I think it should work,
but that's from several hours of looking through that code, so I might be
missing something subtle here...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] fs/super: a possible sleep-in-atomic bug in put_super
  2017-10-07 21:14       ` Al Viro
@ 2017-10-08  0:56         ` Al Viro
  2017-10-08  2:03           ` Al Viro
  0 siblings, 1 reply; 11+ messages in thread
From: Al Viro @ 2017-10-08  0:56 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Michal Hocko, Jia-Ju Bai, torbjorn.lindh, linux-fsdevel, linux-kernel

On Sat, Oct 07, 2017 at 10:14:44PM +0100, Al Viro wrote:

> 	1) coallocate struct list_lru and array of struct list_lru_node
> hanging off it.  Turn all existing variables and struct members of that
> type into pointers.  init would allocate and return a pointer, destroy
> would free (and leave it for callers to clear their pointers, of course).

Better yet, keep list_lru containing just the pointer to list_lru_node
array.  And put that array into the tail of struct list_lru_nodes.  That
way normal accesses are kept exactly as-is and we don't need to update
the users of that thing at all.

> 	4) have lru_list_destroy() check (under list_lru_mutex) whether it's
> being asked to kill the currently resized one.  If it is, do
> 	victim->list.prev->next = victim->list.next;
> 	victim->list.next->prev = victim->list.prev;
> 	victim->list.prev = NULL;

Doesn't work, unfortunately - it needs to stay on the list and be marked
in some other way.

> and bugger off, otherwise act as now.  Turn the loop in
> memcg_update_all_list_lrus() into
> 	mutex_lock(&list_lrus_mutex);
> 	lru = list_lrus.next;
> 	while (lru != &list_lrus) {
> 		currently_resized = list_entry(lru, struct list_lru, list);
> 		mutex_unlock(&list_lrus_mutex);
> 		ret = memcg_update_list_lru(lru, old_size, new_size);
> 		mutex_lock(&list_lrus_mutex);
> 		if (unlikely(!lru->prev)) {
> 			lru = lru->next;

... because this might very well be pointing to already freed object.

> 			free currently_resized as list_lru_destroy() would have
> 			continue;

What's more, we need to be careful about resize vs. drain.  Right now it's
on list_lrus_mutex, but if we drop that around actual resize of an individual
list_lru, we'll need something else.  Would there be any problem if we
took memcg_cache_ids_sem shared in memcg_offline_kmem()?

The first problem is not fatal - we can e.g. use the sign of the field used
to store the number of ->memcg_lrus elements (i.e. stashed value of
memcg_nr_cache_ids at allocation or last resize) to indicate that actual
freeing is left for resizer...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] fs/super: a possible sleep-in-atomic bug in put_super
  2017-10-08  0:56         ` Al Viro
@ 2017-10-08  2:03           ` Al Viro
  2017-10-08 15:47             ` Vladimir Davydov
  0 siblings, 1 reply; 11+ messages in thread
From: Al Viro @ 2017-10-08  2:03 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Michal Hocko, Jia-Ju Bai, torbjorn.lindh, linux-fsdevel, linux-kernel

On Sun, Oct 08, 2017 at 01:56:08AM +0100, Al Viro wrote:

> What's more, we need to be careful about resize vs. drain.  Right now it's
> on list_lrus_mutex, but if we drop that around actual resize of an individual
> list_lru, we'll need something else.  Would there be any problem if we
> took memcg_cache_ids_sem shared in memcg_offline_kmem()?
> 
> The first problem is not fatal - we can e.g. use the sign of the field used
> to store the number of ->memcg_lrus elements (i.e. stashed value of
> memcg_nr_cache_ids at allocation or last resize) to indicate that actual
> freeing is left for resizer...

Ugh.  That spinlock would have to be held over too much work, or bounced back
and forth a lot on memcg shutdowns ;-/  Gets especially nasty if we want
list_lru_destroy() callable from rcu callbacks.  Oh, well...

I still suspect that locking there is too heavy, but it looks like I don't have
a better replacement.

What are the realistic numbers of memcg on a big system?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] fs/super: a possible sleep-in-atomic bug in put_super
  2017-10-08  2:03           ` Al Viro
@ 2017-10-08 15:47             ` Vladimir Davydov
  2017-10-08 21:13               ` Al Viro
  0 siblings, 1 reply; 11+ messages in thread
From: Vladimir Davydov @ 2017-10-08 15:47 UTC (permalink / raw)
  To: Al Viro
  Cc: Michal Hocko, Jia-Ju Bai, torbjorn.lindh, linux-fsdevel, linux-kernel

On Sun, Oct 08, 2017 at 03:03:32AM +0100, Al Viro wrote:
> On Sun, Oct 08, 2017 at 01:56:08AM +0100, Al Viro wrote:
> 
> > What's more, we need to be careful about resize vs. drain.  Right now it's
> > on list_lrus_mutex, but if we drop that around actual resize of an individual
> > list_lru, we'll need something else.  Would there be any problem if we
> > took memcg_cache_ids_sem shared in memcg_offline_kmem()?
> > 
> > The first problem is not fatal - we can e.g. use the sign of the field used
> > to store the number of ->memcg_lrus elements (i.e. stashed value of
> > memcg_nr_cache_ids at allocation or last resize) to indicate that actual
> > freeing is left for resizer...
> 
> Ugh.  That spinlock would have to be held over too much work, or bounced back
> and forth a lot on memcg shutdowns ;-/  Gets especially nasty if we want
> list_lru_destroy() callable from rcu callbacks.  Oh, well...
> 
> I still suspect that locking there is too heavy, but it looks like I don't have
> a better replacement.
> 
> What are the realistic numbers of memcg on a big system?

Several thousand. I guess we could turn list_lrus_mutex into a spin lock
by making resize/drain procedures handle list_lru destruction as you
suggested above, but list_lru_destroy() would still have to iterate over
all elements of list_lru_node->memcg_lrus array to free per-memcg
objects, which is too heavy to be performed under sb_lock IMHO.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] fs/super: a possible sleep-in-atomic bug in put_super
  2017-10-08 15:47             ` Vladimir Davydov
@ 2017-10-08 21:13               ` Al Viro
  2017-10-09  8:43                 ` Vladimir Davydov
  0 siblings, 1 reply; 11+ messages in thread
From: Al Viro @ 2017-10-08 21:13 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Michal Hocko, Jia-Ju Bai, torbjorn.lindh, linux-fsdevel, linux-kernel

On Sun, Oct 08, 2017 at 06:47:46PM +0300, Vladimir Davydov wrote:
> On Sun, Oct 08, 2017 at 03:03:32AM +0100, Al Viro wrote:
> > On Sun, Oct 08, 2017 at 01:56:08AM +0100, Al Viro wrote:
> > 
> > > What's more, we need to be careful about resize vs. drain.  Right now it's
> > > on list_lrus_mutex, but if we drop that around actual resize of an individual
> > > list_lru, we'll need something else.  Would there be any problem if we
> > > took memcg_cache_ids_sem shared in memcg_offline_kmem()?
> > > 
> > > The first problem is not fatal - we can e.g. use the sign of the field used
> > > to store the number of ->memcg_lrus elements (i.e. stashed value of
> > > memcg_nr_cache_ids at allocation or last resize) to indicate that actual
> > > freeing is left for resizer...
> > 
> > Ugh.  That spinlock would have to be held over too much work, or bounced back
> > and forth a lot on memcg shutdowns ;-/  Gets especially nasty if we want
> > list_lru_destroy() callable from rcu callbacks.  Oh, well...
> > 
> > I still suspect that locking there is too heavy, but it looks like I don't have
> > a better replacement.
> > 
> > What are the realistic numbers of memcg on a big system?
> 
> Several thousand. I guess we could turn list_lrus_mutex into a spin lock
> by making resize/drain procedures handle list_lru destruction as you
> suggested above, but list_lru_destroy() would still have to iterate over
> all elements of list_lru_node->memcg_lrus array to free per-memcg
> objects, which is too heavy to be performed under sb_lock IMHO.

Hmm...  Some observations:
	* struct list_lru_one is fairly small and it doesn't pack well - you
get 25% overhead on it.  Dedicated kmem_cache might've be worth doing.
	* in addition to list_lru_one (all allocated one by one), you have
an array of pointers to them.  And that array never shrinks.  What's more,
the objects pointed to are only 3 times bigger than pointers...
	* how hot is memcg destruction codepath?  The only real argument
in favour of list instead of hlist is list_splice(); if that's not that
critical, hlist would do just fine.  And that drives the size of
struct list_lru_one down to two words, at which point getting rid of
that array of pointers becomes rather attractive.  Amount of cacheline
bouncing might be an issue, but then it might be an overall win; can't
tell without experiments.  It certainly would've simplified the things
a whole lot, including the rollback on allocation failure during resize,
etc.  And list_lru_destroy() would get several orders of magnitude cheaper...
	* coallocating ->list with ->node[] is almost certain worth doing -
AFAICS, it's a clear optimization, no matter whether we do anything else
or not.  Loops by list_lrus would be better off without fetching lru->node
for every entry.  _And_ the objects containing list_lru wouldn't be
reachable via list_lrus.

As for fs/super.c side...  IMO destroy_super() ought to WARN_ON when
it sees non-NULL ->node.  Let alloc_super()/sget_userns() do those
list_lru_destroy() directly for instances that get killed before
becoming reachable via shared data structures; everything else must
go through deactivate_locked_super().  The only reason for list_lru_destroy()
in destroy_super() is the use of the latter for disposal of never-seen-by-anyone
struct super_block instances.  Come to think of that, something like this
might be a good approach:

diff --git a/fs/super.c b/fs/super.c
index 166c4ee0d0ed..8ca15415351a 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -154,21 +154,19 @@ static void destroy_super_rcu(struct rcu_head *head)
 	schedule_work(&s->destroy_work);
 }
 
-/**
- *	destroy_super	-	frees a superblock
- *	@s: superblock to free
- *
- *	Frees a superblock.
- */
-static void destroy_super(struct super_block *s)
+/* Free a superblock that has never been seen by anyone */
+static void destroy_unused_super(struct super_block *s)
 {
+	if (!s)
+		return;
+	up_write(&s->s_umount);
 	list_lru_destroy(&s->s_dentry_lru);
 	list_lru_destroy(&s->s_inode_lru);
 	security_sb_free(s);
-	WARN_ON(!list_empty(&s->s_mounts));
 	put_user_ns(s->s_user_ns);
 	kfree(s->s_subtype);
-	call_rcu(&s->rcu, destroy_super_rcu);
+	/* no delays needed */
+	destroy_super_work(&s->destroy_work);
 }
 
 /**
@@ -256,7 +254,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
 	return s;
 
 fail:
-	destroy_super(s);
+	destroy_unused_super(s);
 	return NULL;
 }
 
@@ -265,11 +263,17 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
 /*
  * Drop a superblock's refcount.  The caller must hold sb_lock.
  */
-static void __put_super(struct super_block *sb)
+static void __put_super(struct super_block *s)
 {
-	if (!--sb->s_count) {
-		list_del_init(&sb->s_list);
-		destroy_super(sb);
+	if (!--s->s_count) {
+		list_del_init(&s->s_list);
+		WARN_ON(s->s_dentry_lru.node);
+		WARN_ON(s->s_inode_lru.node);
+		WARN_ON(!list_empty(&s->s_mounts));
+		security_sb_free(s);
+		put_user_ns(s->s_user_ns);
+		kfree(s->s_subtype);
+		call_rcu(&s->rcu, destroy_super_rcu);
 	}
 }
 
@@ -484,19 +488,12 @@ struct super_block *sget_userns(struct file_system_type *type,
 				continue;
 			if (user_ns != old->s_user_ns) {
 				spin_unlock(&sb_lock);
-				if (s) {
-					up_write(&s->s_umount);
-					destroy_super(s);
-				}
+				destroy_unused_super(s);
 				return ERR_PTR(-EBUSY);
 			}
 			if (!grab_super(old))
 				goto retry;
-			if (s) {
-				up_write(&s->s_umount);
-				destroy_super(s);
-				s = NULL;
-			}
+			destroy_unused_super(s);
 			return old;
 		}
 	}
@@ -511,8 +508,7 @@ struct super_block *sget_userns(struct file_system_type *type,
 	err = set(s, data);
 	if (err) {
 		spin_unlock(&sb_lock);
-		up_write(&s->s_umount);
-		destroy_super(s);
+		destroy_unused_super(s);
 		return ERR_PTR(err);
 	}
 	s->s_type = type;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [BUG] fs/super: a possible sleep-in-atomic bug in put_super
  2017-10-08 21:13               ` Al Viro
@ 2017-10-09  8:43                 ` Vladimir Davydov
  0 siblings, 0 replies; 11+ messages in thread
From: Vladimir Davydov @ 2017-10-09  8:43 UTC (permalink / raw)
  To: Al Viro
  Cc: Michal Hocko, Jia-Ju Bai, torbjorn.lindh, linux-fsdevel, linux-kernel

On Sun, Oct 08, 2017 at 10:13:57PM +0100, Al Viro wrote:
> On Sun, Oct 08, 2017 at 06:47:46PM +0300, Vladimir Davydov wrote:
> > On Sun, Oct 08, 2017 at 03:03:32AM +0100, Al Viro wrote:
> > > On Sun, Oct 08, 2017 at 01:56:08AM +0100, Al Viro wrote:
> > > 
> > > > What's more, we need to be careful about resize vs. drain.  Right now it's
> > > > on list_lrus_mutex, but if we drop that around actual resize of an individual
> > > > list_lru, we'll need something else.  Would there be any problem if we
> > > > took memcg_cache_ids_sem shared in memcg_offline_kmem()?
> > > > 
> > > > The first problem is not fatal - we can e.g. use the sign of the field used
> > > > to store the number of ->memcg_lrus elements (i.e. stashed value of
> > > > memcg_nr_cache_ids at allocation or last resize) to indicate that actual
> > > > freeing is left for resizer...
> > > 
> > > Ugh.  That spinlock would have to be held over too much work, or bounced back
> > > and forth a lot on memcg shutdowns ;-/  Gets especially nasty if we want
> > > list_lru_destroy() callable from rcu callbacks.  Oh, well...
> > > 
> > > I still suspect that locking there is too heavy, but it looks like I don't have
> > > a better replacement.
> > > 
> > > What are the realistic numbers of memcg on a big system?
> > 
> > Several thousand. I guess we could turn list_lrus_mutex into a spin lock
> > by making resize/drain procedures handle list_lru destruction as you
> > suggested above, but list_lru_destroy() would still have to iterate over
> > all elements of list_lru_node->memcg_lrus array to free per-memcg
> > objects, which is too heavy to be performed under sb_lock IMHO.
> 
> Hmm...  Some observations:
> 	* struct list_lru_one is fairly small and it doesn't pack well - you
> get 25% overhead on it.  Dedicated kmem_cache might've be worth doing.
> 	* in addition to list_lru_one (all allocated one by one), you have
> an array of pointers to them.  And that array never shrinks.  What's more,
> the objects pointed to are only 3 times bigger than pointers...
> 	* how hot is memcg destruction codepath?  The only real argument
> in favour of list instead of hlist is list_splice(); if that's not that

This is an LRU list so we add objects to the tail and remove them from
the head (or vice versa). AFAICS hlist wouldn't be enough here as it
doesn't store a pointer to the list tail.

Anyway, no matter which list implementation we use, list or hlist,
we have to synchronize resizing of memcg_lrus against concurrent
list modifications. AFAIU we can achieve that either by holding
list_lru_node->lock while copying the array or using an array of
pointers, in which case we can copy the array without holding the
lock and only take the lock to swap pointers.  Since the array may
grow quite big, I decided to use an array of pointers to avoid
holding the lock for too long.

> critical, hlist would do just fine.  And that drives the size of
> struct list_lru_one down to two words, at which point getting rid of
> that array of pointers becomes rather attractive.  Amount of cacheline
> bouncing might be an issue, but then it might be an overall win; can't
> tell without experiments.  It certainly would've simplified the things
> a whole lot, including the rollback on allocation failure during resize,
> etc.  And list_lru_destroy() would get several orders of magnitude cheaper...

IMHO at the same time it would complicate list_lru_walk(), because the
isolate() callback may release the list_lru_node->lock, which opens a
time window for the list_lru_one to be reallocated. Currently, we don't
have to care about this, because once we dereferenced a list_lru_one, we
can be sure it won't go away.

> 	* coallocating ->list with ->node[] is almost certain worth doing -
> AFAICS, it's a clear optimization, no matter whether we do anything else
> or not.  Loops by list_lrus would be better off without fetching lru->node
> for every entry.  _And_ the objects containing list_lru wouldn't be
> reachable via list_lrus.

By co-allocating ->node[] with list_lru we would gain nothing, because
then we wouldn't be able to embed list_lru in super_block and would have
to use a pointer there.

BTW, I think we need to remove alignment from super_block->s_dentry_lru
and s_inode_lru definitions. It seems to be left there from the time
when list_lru wasn't numa aware. Now, the alignment is guaranteed by the
definition of strcut list_lru_node.

> 
> As for fs/super.c side...  IMO destroy_super() ought to WARN_ON when
> it sees non-NULL ->node.  Let alloc_super()/sget_userns() do those
> list_lru_destroy() directly for instances that get killed before
> becoming reachable via shared data structures; everything else must
> go through deactivate_locked_super().  The only reason for list_lru_destroy()
> in destroy_super() is the use of the latter for disposal of never-seen-by-anyone
> struct super_block instances.  Come to think of that, something like this
> might be a good approach:

Makes sense to me.

> 
> diff --git a/fs/super.c b/fs/super.c
> index 166c4ee0d0ed..8ca15415351a 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -154,21 +154,19 @@ static void destroy_super_rcu(struct rcu_head *head)
>  	schedule_work(&s->destroy_work);
>  }
>  
> -/**
> - *	destroy_super	-	frees a superblock
> - *	@s: superblock to free
> - *
> - *	Frees a superblock.
> - */
> -static void destroy_super(struct super_block *s)
> +/* Free a superblock that has never been seen by anyone */
> +static void destroy_unused_super(struct super_block *s)
>  {
> +	if (!s)
> +		return;
> +	up_write(&s->s_umount);
>  	list_lru_destroy(&s->s_dentry_lru);
>  	list_lru_destroy(&s->s_inode_lru);
>  	security_sb_free(s);
> -	WARN_ON(!list_empty(&s->s_mounts));
>  	put_user_ns(s->s_user_ns);
>  	kfree(s->s_subtype);
> -	call_rcu(&s->rcu, destroy_super_rcu);
> +	/* no delays needed */
> +	destroy_super_work(&s->destroy_work);
>  }
>  
>  /**
> @@ -256,7 +254,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
>  	return s;
>  
>  fail:
> -	destroy_super(s);
> +	destroy_unused_super(s);
>  	return NULL;
>  }
>  
> @@ -265,11 +263,17 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
>  /*
>   * Drop a superblock's refcount.  The caller must hold sb_lock.
>   */
> -static void __put_super(struct super_block *sb)
> +static void __put_super(struct super_block *s)
>  {
> -	if (!--sb->s_count) {
> -		list_del_init(&sb->s_list);
> -		destroy_super(sb);
> +	if (!--s->s_count) {
> +		list_del_init(&s->s_list);
> +		WARN_ON(s->s_dentry_lru.node);
> +		WARN_ON(s->s_inode_lru.node);
> +		WARN_ON(!list_empty(&s->s_mounts));
> +		security_sb_free(s);
> +		put_user_ns(s->s_user_ns);
> +		kfree(s->s_subtype);
> +		call_rcu(&s->rcu, destroy_super_rcu);
>  	}
>  }
>  
> @@ -484,19 +488,12 @@ struct super_block *sget_userns(struct file_system_type *type,
>  				continue;
>  			if (user_ns != old->s_user_ns) {
>  				spin_unlock(&sb_lock);
> -				if (s) {
> -					up_write(&s->s_umount);
> -					destroy_super(s);
> -				}
> +				destroy_unused_super(s);
>  				return ERR_PTR(-EBUSY);
>  			}
>  			if (!grab_super(old))
>  				goto retry;
> -			if (s) {
> -				up_write(&s->s_umount);
> -				destroy_super(s);
> -				s = NULL;
> -			}
> +			destroy_unused_super(s);
>  			return old;
>  		}
>  	}
> @@ -511,8 +508,7 @@ struct super_block *sget_userns(struct file_system_type *type,
>  	err = set(s, data);
>  	if (err) {
>  		spin_unlock(&sb_lock);
> -		up_write(&s->s_umount);
> -		destroy_super(s);
> +		destroy_unused_super(s);
>  		return ERR_PTR(err);
>  	}
>  	s->s_type = type;
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-10-09  8:43 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-06  8:59 [BUG] fs/super: a possible sleep-in-atomic bug in put_super Jia-Ju Bai
2017-10-06  9:06 ` Michal Hocko
2017-10-07 11:56   ` Vladimir Davydov
2017-10-07 17:06     ` Al Viro
2017-10-07 21:14       ` Al Viro
2017-10-08  0:56         ` Al Viro
2017-10-08  2:03           ` Al Viro
2017-10-08 15:47             ` Vladimir Davydov
2017-10-08 21:13               ` Al Viro
2017-10-09  8:43                 ` Vladimir Davydov
2017-10-06 12:19 ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).