All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] proc/sysctl: drop unregistered stale dentries as soon as possible
@ 2017-02-08 10:48 Konstantin Khlebnikov
  2017-02-08 21:48 ` Andrew Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Konstantin Khlebnikov @ 2017-02-08 10:48 UTC (permalink / raw)
  To: Andrew Morton, Eric W. Biederman, linux-kernel, Alexander Viro

Currently unregistering sysctl does not prune its dentries.
Stale sysctl dentries could slowdown sysctl operations significantly.

For example, command:

# for i in {1..100000} ; do unshare -n -- sysctl -a &> /dev/null ; done

creates a millions of stale denties around sysctls of loopback interface:

# sysctl fs.dentry-state
fs.dentry-state = 25812579	24724135	45	0	0	0

All of them have matching names thus lookup have to scan though whole
hash chain and call d_compare (proc_sys_compare) which checks them
under system-wide spinlock (sysctl_lock).

# time sysctl -a > /dev/null
real    1m12.806s
user    0m0.016s
sys     1m12.400s

Currently only memory reclaimer could remove this garbage.
But without significant memory pressure this never happens.

This patch detects stale dentry in proc_sys_compare and pretends that
it has matching name - revalidation will kill it and lookup restarts.
As a result each stale dentry will be seen only once and will not
contaminate hash endlessly.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 fs/proc/proc_sysctl.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index d4e37acd4821..1af7230c2c9e 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -852,11 +852,19 @@ static int proc_sys_compare(const struct dentry *dentry,
 	inode = d_inode_rcu(dentry);
 	if (!inode)
 		return 1;
+
+	/*
+	 * Stale dentry: we cannot invalidate it right here, instead we
+	 * pretend that it matches and revalidation will kill it later.
+	 */
+	head = rcu_dereference(PROC_I(inode)->sysctl);
+	if (head && head->unregistering)
+		return 0;
+
 	if (name->len != len)
 		return 1;
 	if (memcmp(name->name, str, len))
 		return 1;
-	head = rcu_dereference(PROC_I(inode)->sysctl);
 	return !head || !sysctl_is_seen(head);
 }
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] proc/sysctl: drop unregistered stale dentries as soon as possible
  2017-02-08 10:48 [PATCH] proc/sysctl: drop unregistered stale dentries as soon as possible Konstantin Khlebnikov
@ 2017-02-08 21:48 ` Andrew Morton
  2017-02-09  3:53   ` Al Viro
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2017-02-08 21:48 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: Eric W. Biederman, linux-kernel, Alexander Viro

On Wed, 08 Feb 2017 13:48:24 +0300 Konstantin Khlebnikov <khlebnikov@yandex-team.ru> wrote:

> Currently unregistering sysctl does not prune its dentries.
> Stale sysctl dentries could slowdown sysctl operations significantly.
> 
> For example, command:
> 
> # for i in {1..100000} ; do unshare -n -- sysctl -a &> /dev/null ; done
> 
> creates a millions of stale denties around sysctls of loopback interface:
> 
> # sysctl fs.dentry-state
> fs.dentry-state = 25812579	24724135	45	0	0	0
> 
> All of them have matching names thus lookup have to scan though whole
> hash chain and call d_compare (proc_sys_compare) which checks them
> under system-wide spinlock (sysctl_lock).
> 
> # time sysctl -a > /dev/null
> real    1m12.806s
> user    0m0.016s
> sys     1m12.400s
> 
> Currently only memory reclaimer could remove this garbage.
> But without significant memory pressure this never happens.
> 
> This patch detects stale dentry in proc_sys_compare and pretends that
> it has matching name - revalidation will kill it and lookup restarts.
> As a result each stale dentry will be seen only once and will not
> contaminate hash endlessly.
> 

What are "stale" dentries?  Unused dentries?  If so, why doesn't the
creation of a new dentry immediately invalidate the old dentry with a
matching path?  What do other filesystems do to prevent this issue?

IOW I'm wondering if this should be fixed in some other place.  Al?

> --- a/fs/proc/proc_sysctl.c
> +++ b/fs/proc/proc_sysctl.c
> @@ -852,11 +852,19 @@ static int proc_sys_compare(const struct dentry *dentry,
>  	inode = d_inode_rcu(dentry);
>  	if (!inode)
>  		return 1;
> +
> +	/*
> +	 * Stale dentry: we cannot invalidate it right here, instead we
> +	 * pretend that it matches and revalidation will kill it later.
> +	 */
> +	head = rcu_dereference(PROC_I(inode)->sysctl);
> +	if (head && head->unregistering)
> +		return 0;
> +
>  	if (name->len != len)
>  		return 1;
>  	if (memcmp(name->name, str, len))
>  		return 1;
> -	head = rcu_dereference(PROC_I(inode)->sysctl);
>  	return !head || !sysctl_is_seen(head);
>  }
>  

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] proc/sysctl: drop unregistered stale dentries as soon as possible
  2017-02-08 21:48 ` Andrew Morton
@ 2017-02-09  3:53   ` Al Viro
  2017-02-09  7:36     ` Konstantin Khlebnikov
  0 siblings, 1 reply; 14+ messages in thread
From: Al Viro @ 2017-02-09  3:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Konstantin Khlebnikov, Eric W. Biederman, linux-kernel

On Wed, Feb 08, 2017 at 01:48:04PM -0800, Andrew Morton wrote:

> > This patch detects stale dentry in proc_sys_compare and pretends that
> > it has matching name - revalidation will kill it and lookup restarts.
> > As a result each stale dentry will be seen only once and will not
> > contaminate hash endlessly.
> > 
> 
> What are "stale" dentries?  Unused dentries?  If so, why doesn't the
> creation of a new dentry immediately invalidate the old dentry with a
> matching path?  What do other filesystems do to prevent this issue?

The whole point is that it's *NOT* a matching path.  Currently ->d_compare()
for /proc/sys tries to make sure that sysctl getting unregistered means
that no extra references will be added to dentries of the stuff we are
trying to kick out.  If it's getting unregistered, ->d_compare() won't be
seeing them and from that point on dentry refcount can only go down -
no new lookups will increase it.

This kludge tries to have _any_ lookup in the same hash chain pick the
first dentry of such stuff, no matter what name/parent it has.  Then
it relies upon ->d_revalidate() refusing to accept that sucker, so that
it gets unhashed and we (hopefully) repeat the lookup.

This is complete garbage.  Lookups won't be repeated indefinitely -
if there are several such dentries in the hash chain we search, syscall
will end up failing with ESTALE on thus buggered ->d_compare(), even though
none of those dentries are anywhere near the path we are trying to resolve.
No other filesystem attempts that kind of insanity, and for a good reason.

The problem it tries to address is that sysctl unregistration doesn't
unhash the now-stale dentries.  Before the unregistration we kept them
even with refcount 0, until memory pressure evicts the suckers.  After
unregistration we make sure that refcount reaching 0 will cause the
instant eviction.  The problem is with the case when they had refcount 0
to start with.  Then the eviction rule does not get triggered - it would have
happened when dropping the last reference, but we don't have any.

The kludge proposed in that patch is nowhere near being a sane way to deal
with that.  Having ->d_compare() notice such dentries and quietly kick
them out would be borderline saner, but
	* it's a potentially blocking operation and ->d_compare() is called
in non-blocking contexts, including deep under rcu_read_lock().
	* it's done when walking a hash chain; having that chain modified
by ->d_compare() itself would require some modifications of callers and
those are very hot codepaths.

I agree that the problem is real, but this is no way to deal with it.
What we want is something along the lines of d_prune_aliases() done for all
inodes corresponding to given sysctl.  Done just before erase_header()
in start_unregistering().  That would require maintaining the list of
inodes in question (e.g. anchored in ctl_table_header) and a bit of care
in traversing it (use of igrab(), etc.)

In the current form - NAK.  Sorry.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] proc/sysctl: drop unregistered stale dentries as soon as possible
  2017-02-09  3:53   ` Al Viro
@ 2017-02-09  7:36     ` Konstantin Khlebnikov
  2017-02-09  8:40       ` Al Viro
  0 siblings, 1 reply; 14+ messages in thread
From: Konstantin Khlebnikov @ 2017-02-09  7:36 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrew Morton, Konstantin Khlebnikov, Eric W. Biederman,
	Linux Kernel Mailing List

On Thu, Feb 9, 2017 at 6:53 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Feb 08, 2017 at 01:48:04PM -0800, Andrew Morton wrote:
>
>> > This patch detects stale dentry in proc_sys_compare and pretends that
>> > it has matching name - revalidation will kill it and lookup restarts.
>> > As a result each stale dentry will be seen only once and will not
>> > contaminate hash endlessly.
>> >
>>
>> What are "stale" dentries?  Unused dentries?  If so, why doesn't the
>> creation of a new dentry immediately invalidate the old dentry with a
>> matching path?  What do other filesystems do to prevent this issue?
>
> The whole point is that it's *NOT* a matching path.  Currently ->d_compare()
> for /proc/sys tries to make sure that sysctl getting unregistered means
> that no extra references will be added to dentries of the stuff we are
> trying to kick out.  If it's getting unregistered, ->d_compare() won't be
> seeing them and from that point on dentry refcount can only go down -
> no new lookups will increase it.
>
> This kludge tries to have _any_ lookup in the same hash chain pick the
> first dentry of such stuff, no matter what name/parent it has.  Then
> it relies upon ->d_revalidate() refusing to accept that sucker, so that
> it gets unhashed and we (hopefully) repeat the lookup.
>
> This is complete garbage.  Lookups won't be repeated indefinitely -
> if there are several such dentries in the hash chain we search, syscall
> will end up failing with ESTALE on thus buggered ->d_compare(), even though
> none of those dentries are anywhere near the path we are trying to resolve.
> No other filesystem attempts that kind of insanity, and for a good reason.
>
> The problem it tries to address is that sysctl unregistration doesn't
> unhash the now-stale dentries.  Before the unregistration we kept them
> even with refcount 0, until memory pressure evicts the suckers.  After
> unregistration we make sure that refcount reaching 0 will cause the
> instant eviction.  The problem is with the case when they had refcount 0
> to start with.  Then the eviction rule does not get triggered - it would have
> happened when dropping the last reference, but we don't have any.
>
> The kludge proposed in that patch is nowhere near being a sane way to deal
> with that.  Having ->d_compare() notice such dentries and quietly kick
> them out would be borderline saner, but
>         * it's a potentially blocking operation and ->d_compare() is called
> in non-blocking contexts, including deep under rcu_read_lock().
>         * it's done when walking a hash chain; having that chain modified
> by ->d_compare() itself would require some modifications of callers and
> those are very hot codepaths.
>
> I agree that the problem is real, but this is no way to deal with it.
> What we want is something along the lines of d_prune_aliases() done for all
> inodes corresponding to given sysctl.  Done just before erase_header()
> in start_unregistering().  That would require maintaining the list of
> inodes in question (e.g. anchored in ctl_table_header) and a bit of care
> in traversing it (use of igrab(), etc.)
>
> In the current form - NAK.  Sorry.

Ok, Thank you. I've expected that this fix isn't sane,

Maybe we could minimize changes for now. For example: keep these
stale dentries in memory but silently unhash them in ->d_compare().
Memory processure and reclaimer will kill them later.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] proc/sysctl: drop unregistered stale dentries as soon as possible
  2017-02-09  7:36     ` Konstantin Khlebnikov
@ 2017-02-09  8:40       ` Al Viro
  2017-02-10  7:35         ` [PATCH] proc/sysctl: prune stale dentries during unregistering Konstantin Khlebnikov
  0 siblings, 1 reply; 14+ messages in thread
From: Al Viro @ 2017-02-09  8:40 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Andrew Morton, Konstantin Khlebnikov, Eric W. Biederman,
	Linux Kernel Mailing List

On Thu, Feb 09, 2017 at 10:36:15AM +0300, Konstantin Khlebnikov wrote:

> Ok, Thank you. I've expected that this fix isn't sane,
> 
> Maybe we could minimize changes for now. For example: keep these
> stale dentries in memory but silently unhash them in ->d_compare().
> Memory processure and reclaimer will kill them later.

->d_compare() is called by the code walking the hash chains.  What's worse,
in the most common case all we have is rcu_read_lock().  Modifying the chain
in rcu reader is no-go.  Turning __d_lookup_rcu() into a writer on the
off-chance that we'll walk onto a visibly stale sysctl dentry - even more so.

If you want to deal with that, do it right, please.  Have sysctl inodes
on a list of some kind anchored in struct ctl_table_header; insert them
there in proc_sys_make_inode(), remove - in proc_evict_inode() (or
have it pass the inode to sysctl_head_put() and do the removal there).
Use sysctl_lock for serialization.

In start_unregistering(), just before the erase_header() call, check
if the list is non-empty and if it is -
	grab sysctl_lock
	last = NULL
	walk the list
		igrab(inode we are looking at)
		if succeeded
			drop sysctl_lock
			iput(last)
			last = that inode
			d_prune_aliases(last)
			retake sysctl_lock
			// inode is still not evicted, so it's still on the list
	drop sysctl_lock
	iput(last)

list would pass through struct proc_inode, and I would probably use
hlist rather than the normal one; might be more convenient to initialize
that way.  Getting from containing struct proc_inode to inode - &ei->vfs_inode.

It's not that much work; if you have time - go for it, or remind me after
-rc1...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] proc/sysctl: prune stale dentries during unregistering
  2017-02-09  8:40       ` Al Viro
@ 2017-02-10  7:35         ` Konstantin Khlebnikov
  2017-02-10  7:47           ` Al Viro
  2017-02-18 18:55           ` Konstantin Khlebnikov
  0 siblings, 2 replies; 14+ messages in thread
From: Konstantin Khlebnikov @ 2017-02-10  7:35 UTC (permalink / raw)
  To: linux-kernel, Al Viro; +Cc: Andrew Morton, Eric W. Biederman

Currently unregistering sysctl table does not prune its dentries.
Stale dentries could slowdown sysctl operations significantly.

For example, command:

# for i in {1..100000} ; do unshare -n -- sysctl -a &> /dev/null ; done

creates a millions of stale denties around sysctls of loopback interface:

# sysctl fs.dentry-state
fs.dentry-state = 25812579  24724135        45      0       0       0

All of them have matching names thus lookup have to scan though whole
hash chain and call d_compare (proc_sys_compare) which checks them
under system-wide spinlock (sysctl_lock).

# time sysctl -a > /dev/null
real    1m12.806s
user    0m0.016s
sys     1m12.400s

Currently only memory reclaimer could remove this garbage.
But without significant memory pressure this never happens.

This patch collects sysctl inodes into list on sysctl table header and
prunes all their dentries once that table unregisters.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/proc/inode.c        |    3 ++
 fs/proc/internal.h     |    7 ++++--
 fs/proc/proc_sysctl.c  |   59 +++++++++++++++++++++++++++++++++++-------------
 include/linux/sysctl.h |    1 +
 4 files changed, 51 insertions(+), 19 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 842a5ff5b85c..7ad9ed7958af 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -43,10 +43,11 @@ static void proc_evict_inode(struct inode *inode)
 	de = PDE(inode);
 	if (de)
 		pde_put(de);
+
 	head = PROC_I(inode)->sysctl;
 	if (head) {
 		RCU_INIT_POINTER(PROC_I(inode)->sysctl, NULL);
-		sysctl_head_put(head);
+		proc_sys_evict_inode(inode, head);
 	}
 }
 
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 2de5194ba378..ed1d762160e6 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -65,6 +65,7 @@ struct proc_inode {
 	struct proc_dir_entry *pde;
 	struct ctl_table_header *sysctl;
 	struct ctl_table *sysctl_entry;
+	struct list_head sysctl_inodes;
 	const struct proc_ns_operations *ns_ops;
 	struct inode vfs_inode;
 };
@@ -249,10 +250,12 @@ extern void proc_thread_self_init(void);
  */
 #ifdef CONFIG_PROC_SYSCTL
 extern int proc_sys_init(void);
-extern void sysctl_head_put(struct ctl_table_header *);
+extern void proc_sys_evict_inode(struct inode *inode,
+				 struct ctl_table_header *head);
 #else
 static inline void proc_sys_init(void) { }
-static inline void sysctl_head_put(struct ctl_table_header *head) { }
+static inline void proc_sys_evict_inode(struct  inode *inode,
+					struct ctl_table_header *head) { }
 #endif
 
 /*
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index d4e37acd4821..8efb1e10b025 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -190,6 +190,7 @@ static void init_header(struct ctl_table_header *head,
 	head->set = set;
 	head->parent = NULL;
 	head->node = node;
+	INIT_LIST_HEAD(&head->inodes);
 	if (node) {
 		struct ctl_table *entry;
 		for (entry = table; entry->procname; entry++, node++)
@@ -259,6 +260,29 @@ static void unuse_table(struct ctl_table_header *p)
 			complete(p->unregistering);
 }
 
+/* called under sysctl_lock */
+static void proc_sys_prune_dcache(struct ctl_table_header *head)
+{
+	struct inode *inode, *prev = NULL;
+	struct proc_inode *ei;
+
+	list_for_each_entry(ei, &head->inodes, sysctl_inodes) {
+		inode = igrab(&ei->vfs_inode);
+		if (inode) {
+			spin_unlock(&sysctl_lock);
+			iput(prev);
+			prev = inode;
+			d_prune_aliases(inode);
+			spin_lock(&sysctl_lock);
+		}
+	}
+	if (prev) {
+		spin_unlock(&sysctl_lock);
+		iput(prev);
+		spin_lock(&sysctl_lock);
+	}
+}
+
 /* called under sysctl_lock, will reacquire if has to wait */
 static void start_unregistering(struct ctl_table_header *p)
 {
@@ -278,27 +302,17 @@ static void start_unregistering(struct ctl_table_header *p)
 		p->unregistering = ERR_PTR(-EINVAL);
 	}
 	/*
+	 * Prune dentries for unregistered sysctls: namespaced sysctls
+	 * can have duplicate names and contaminate dcache very badly.
+	 */
+	proc_sys_prune_dcache(p);
+	/*
 	 * do not remove from the list until nobody holds it; walking the
 	 * list in do_sysctl() relies on that.
 	 */
 	erase_header(p);
 }
 
-static void sysctl_head_get(struct ctl_table_header *head)
-{
-	spin_lock(&sysctl_lock);
-	head->count++;
-	spin_unlock(&sysctl_lock);
-}
-
-void sysctl_head_put(struct ctl_table_header *head)
-{
-	spin_lock(&sysctl_lock);
-	if (!--head->count)
-		kfree_rcu(head, rcu);
-	spin_unlock(&sysctl_lock);
-}
-
 static struct ctl_table_header *sysctl_head_grab(struct ctl_table_header *head)
 {
 	BUG_ON(!head);
@@ -440,11 +454,15 @@ static struct inode *proc_sys_make_inode(struct super_block *sb,
 
 	inode->i_ino = get_next_ino();
 
-	sysctl_head_get(head);
 	ei = PROC_I(inode);
 	ei->sysctl = head;
 	ei->sysctl_entry = table;
 
+	spin_lock(&sysctl_lock);
+	list_add(&ei->sysctl_inodes, &head->inodes);
+	head->count++;
+	spin_unlock(&sysctl_lock);
+
 	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
 	inode->i_mode = table->mode;
 	if (!S_ISDIR(table->mode)) {
@@ -466,6 +484,15 @@ static struct inode *proc_sys_make_inode(struct super_block *sb,
 	return inode;
 }
 
+void proc_sys_evict_inode(struct inode *inode, struct ctl_table_header *head)
+{
+	spin_lock(&sysctl_lock);
+	list_del(&PROC_I(inode)->sysctl_inodes);
+	if (!--head->count)
+		kfree_rcu(head, rcu);
+	spin_unlock(&sysctl_lock);
+}
+
 static struct ctl_table_header *grab_header(struct inode *inode)
 {
 	struct ctl_table_header *head = PROC_I(inode)->sysctl;
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index adf4e51cf597..b7e82049fec7 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -143,6 +143,7 @@ struct ctl_table_header
 	struct ctl_table_set *set;
 	struct ctl_dir *parent;
 	struct ctl_node *node;
+	struct list_head inodes; /* head for proc_inode->sysctl_inodes */
 };
 
 struct ctl_dir {

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] proc/sysctl: prune stale dentries during unregistering
  2017-02-10  7:35         ` [PATCH] proc/sysctl: prune stale dentries during unregistering Konstantin Khlebnikov
@ 2017-02-10  7:47           ` Al Viro
  2017-02-10  7:54             ` Konstantin Khlebnikov
  2017-02-18 18:55           ` Konstantin Khlebnikov
  1 sibling, 1 reply; 14+ messages in thread
From: Al Viro @ 2017-02-10  7:47 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: linux-kernel, Andrew Morton, Eric W. Biederman

On Fri, Feb 10, 2017 at 10:35:02AM +0300, Konstantin Khlebnikov wrote:

> # time sysctl -a > /dev/null
> real    1m12.806s
> user    0m0.016s
> sys     1m12.400s
> 
> Currently only memory reclaimer could remove this garbage.
> But without significant memory pressure this never happens.
> 
> This patch collects sysctl inodes into list on sysctl table header and
> prunes all their dentries once that table unregisters.

I'd probably go for hlist, but that's mostly cosmetic difference; how about
the matching stats *after* that patch?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] proc/sysctl: prune stale dentries during unregistering
  2017-02-10  7:47           ` Al Viro
@ 2017-02-10  7:54             ` Konstantin Khlebnikov
  2017-02-13  9:54               ` Eric W. Biederman
  0 siblings, 1 reply; 14+ messages in thread
From: Konstantin Khlebnikov @ 2017-02-10  7:54 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-kernel, Andrew Morton, Eric W. Biederman

On 10.02.2017 10:47, Al Viro wrote:
> On Fri, Feb 10, 2017 at 10:35:02AM +0300, Konstantin Khlebnikov wrote:
>
>> # time sysctl -a > /dev/null
>> real    1m12.806s
>> user    0m0.016s
>> sys     1m12.400s
>>
>> Currently only memory reclaimer could remove this garbage.
>> But without significant memory pressure this never happens.
>>
>> This patch collects sysctl inodes into list on sysctl table header and
>> prunes all their dentries once that table unregisters.
>
> I'd probably go for hlist, but that's mostly cosmetic difference; how about
> the matching stats *after* that patch?
>

dcache size doesn't grow endlessly, so stats are fine

# sysctl fs.dentry-state
fs.dentry-state = 92712	58376	45	0	0	0

# time sysctl -a &>/dev/null

real	0m0.013s
user	0m0.004s
sys	0m0.008s

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] proc/sysctl: prune stale dentries during unregistering
  2017-02-10  7:54             ` Konstantin Khlebnikov
@ 2017-02-13  9:54               ` Eric W. Biederman
  0 siblings, 0 replies; 14+ messages in thread
From: Eric W. Biederman @ 2017-02-13  9:54 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: Al Viro, linux-kernel, Andrew Morton

Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:

> On 10.02.2017 10:47, Al Viro wrote:
>> On Fri, Feb 10, 2017 at 10:35:02AM +0300, Konstantin Khlebnikov wrote:
>>
>>> # time sysctl -a > /dev/null
>>> real    1m12.806s
>>> user    0m0.016s
>>> sys     1m12.400s
>>>
>>> Currently only memory reclaimer could remove this garbage.
>>> But without significant memory pressure this never happens.
>>>
>>> This patch collects sysctl inodes into list on sysctl table header and
>>> prunes all their dentries once that table unregisters.
>>
>> I'd probably go for hlist, but that's mostly cosmetic difference; how about
>> the matching stats *after* that patch?
>>
>
> dcache size doesn't grow endlessly, so stats are fine
>
> # sysctl fs.dentry-state
> fs.dentry-state = 92712	58376	45	0	0	0
>
> # time sysctl -a &>/dev/null
>
> real	0m0.013s
> user	0m0.004s
> sys	0m0.008s

Applied thanks,

Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] proc/sysctl: prune stale dentries during unregistering
  2017-02-10  7:35         ` [PATCH] proc/sysctl: prune stale dentries during unregistering Konstantin Khlebnikov
  2017-02-10  7:47           ` Al Viro
@ 2017-02-18 18:55           ` Konstantin Khlebnikov
  2017-02-19  8:42             ` Al Viro
  1 sibling, 1 reply; 14+ messages in thread
From: Konstantin Khlebnikov @ 2017-02-18 18:55 UTC (permalink / raw)
  To: linux-kernel, Al Viro; +Cc: Andrew Morton, Eric W. Biederman

This patch has locking problem. I've got lockdep splat under LTP.

[ 6633.115456] ======================================================
[ 6633.115502] [ INFO: possible circular locking dependency detected ]
[ 6633.115553] 4.9.10-debug+ #9 Tainted: G             L
[ 6633.115584] -------------------------------------------------------
[ 6633.115627] ksm02/284980 is trying to acquire lock:
[ 6633.115659]  (&sb->s_type->i_lock_key#4){+.+...}, at: [<ffffffff816bc1ce>] igrab+0x1e/0x80
[ 6633.115834] but task is already holding lock:
[ 6633.115882]  (sysctl_lock){+.+...}, at: [<ffffffff817e379b>] unregister_sysctl_table+0x6b/0x110
[ 6633.116026] which lock already depends on the new lock.
[ 6633.116026]
[ 6633.116080]
[ 6633.116080] the existing dependency chain (in reverse order) is:
[ 6633.116117]
-> #2 (sysctl_lock){+.+...}:
-> #1 (&(&dentry->d_lockref.lock)->rlock){+.+...}:
-> #0 (&sb->s_type->i_lock_key#4){+.+...}:

d_lock nests inside i_lock
sysctl_lock nests inside d_lock in d_compare

This patch adds i_lock nesting inside sysctl_lock.

On 10.02.2017 10:35, Konstantin Khlebnikov wrote:
> Currently unregistering sysctl table does not prune its dentries.
> Stale dentries could slowdown sysctl operations significantly.
>
> For example, command:
>
> # for i in {1..100000} ; do unshare -n -- sysctl -a &> /dev/null ; done
>
> creates a millions of stale denties around sysctls of loopback interface:
>
> # sysctl fs.dentry-state
> fs.dentry-state = 25812579  24724135        45      0       0       0
>
> All of them have matching names thus lookup have to scan though whole
> hash chain and call d_compare (proc_sys_compare) which checks them
> under system-wide spinlock (sysctl_lock).
>
> # time sysctl -a > /dev/null
> real    1m12.806s
> user    0m0.016s
> sys     1m12.400s
>
> Currently only memory reclaimer could remove this garbage.
> But without significant memory pressure this never happens.
>
> This patch collects sysctl inodes into list on sysctl table header and
> prunes all their dentries once that table unregisters.
>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/proc/inode.c        |    3 ++
>  fs/proc/internal.h     |    7 ++++--
>  fs/proc/proc_sysctl.c  |   59 +++++++++++++++++++++++++++++++++++-------------
>  include/linux/sysctl.h |    1 +
>  4 files changed, 51 insertions(+), 19 deletions(-)
>
> diff --git a/fs/proc/inode.c b/fs/proc/inode.c
> index 842a5ff5b85c..7ad9ed7958af 100644
> --- a/fs/proc/inode.c
> +++ b/fs/proc/inode.c
> @@ -43,10 +43,11 @@ static void proc_evict_inode(struct inode *inode)
>  	de = PDE(inode);
>  	if (de)
>  		pde_put(de);
> +
>  	head = PROC_I(inode)->sysctl;
>  	if (head) {
>  		RCU_INIT_POINTER(PROC_I(inode)->sysctl, NULL);
> -		sysctl_head_put(head);
> +		proc_sys_evict_inode(inode, head);
>  	}
>  }
>
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index 2de5194ba378..ed1d762160e6 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -65,6 +65,7 @@ struct proc_inode {
>  	struct proc_dir_entry *pde;
>  	struct ctl_table_header *sysctl;
>  	struct ctl_table *sysctl_entry;
> +	struct list_head sysctl_inodes;
>  	const struct proc_ns_operations *ns_ops;
>  	struct inode vfs_inode;
>  };
> @@ -249,10 +250,12 @@ extern void proc_thread_self_init(void);
>   */
>  #ifdef CONFIG_PROC_SYSCTL
>  extern int proc_sys_init(void);
> -extern void sysctl_head_put(struct ctl_table_header *);
> +extern void proc_sys_evict_inode(struct inode *inode,
> +				 struct ctl_table_header *head);
>  #else
>  static inline void proc_sys_init(void) { }
> -static inline void sysctl_head_put(struct ctl_table_header *head) { }
> +static inline void proc_sys_evict_inode(struct  inode *inode,
> +					struct ctl_table_header *head) { }
>  #endif
>
>  /*
> diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> index d4e37acd4821..8efb1e10b025 100644
> --- a/fs/proc/proc_sysctl.c
> +++ b/fs/proc/proc_sysctl.c
> @@ -190,6 +190,7 @@ static void init_header(struct ctl_table_header *head,
>  	head->set = set;
>  	head->parent = NULL;
>  	head->node = node;
> +	INIT_LIST_HEAD(&head->inodes);
>  	if (node) {
>  		struct ctl_table *entry;
>  		for (entry = table; entry->procname; entry++, node++)
> @@ -259,6 +260,29 @@ static void unuse_table(struct ctl_table_header *p)
>  			complete(p->unregistering);
>  }
>
> +/* called under sysctl_lock */
> +static void proc_sys_prune_dcache(struct ctl_table_header *head)
> +{
> +	struct inode *inode, *prev = NULL;
> +	struct proc_inode *ei;
> +
> +	list_for_each_entry(ei, &head->inodes, sysctl_inodes) {
> +		inode = igrab(&ei->vfs_inode);
> +		if (inode) {
> +			spin_unlock(&sysctl_lock);
> +			iput(prev);
> +			prev = inode;
> +			d_prune_aliases(inode);
> +			spin_lock(&sysctl_lock);
> +		}
> +	}
> +	if (prev) {
> +		spin_unlock(&sysctl_lock);
> +		iput(prev);
> +		spin_lock(&sysctl_lock);
> +	}
> +}
> +
>  /* called under sysctl_lock, will reacquire if has to wait */
>  static void start_unregistering(struct ctl_table_header *p)
>  {
> @@ -278,27 +302,17 @@ static void start_unregistering(struct ctl_table_header *p)
>  		p->unregistering = ERR_PTR(-EINVAL);
>  	}
>  	/*
> +	 * Prune dentries for unregistered sysctls: namespaced sysctls
> +	 * can have duplicate names and contaminate dcache very badly.
> +	 */
> +	proc_sys_prune_dcache(p);
> +	/*
>  	 * do not remove from the list until nobody holds it; walking the
>  	 * list in do_sysctl() relies on that.
>  	 */
>  	erase_header(p);
>  }
>
> -static void sysctl_head_get(struct ctl_table_header *head)
> -{
> -	spin_lock(&sysctl_lock);
> -	head->count++;
> -	spin_unlock(&sysctl_lock);
> -}
> -
> -void sysctl_head_put(struct ctl_table_header *head)
> -{
> -	spin_lock(&sysctl_lock);
> -	if (!--head->count)
> -		kfree_rcu(head, rcu);
> -	spin_unlock(&sysctl_lock);
> -}
> -
>  static struct ctl_table_header *sysctl_head_grab(struct ctl_table_header *head)
>  {
>  	BUG_ON(!head);
> @@ -440,11 +454,15 @@ static struct inode *proc_sys_make_inode(struct super_block *sb,
>
>  	inode->i_ino = get_next_ino();
>
> -	sysctl_head_get(head);
>  	ei = PROC_I(inode);
>  	ei->sysctl = head;
>  	ei->sysctl_entry = table;
>
> +	spin_lock(&sysctl_lock);
> +	list_add(&ei->sysctl_inodes, &head->inodes);
> +	head->count++;
> +	spin_unlock(&sysctl_lock);
> +
>  	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
>  	inode->i_mode = table->mode;
>  	if (!S_ISDIR(table->mode)) {
> @@ -466,6 +484,15 @@ static struct inode *proc_sys_make_inode(struct super_block *sb,
>  	return inode;
>  }
>
> +void proc_sys_evict_inode(struct inode *inode, struct ctl_table_header *head)
> +{
> +	spin_lock(&sysctl_lock);
> +	list_del(&PROC_I(inode)->sysctl_inodes);
> +	if (!--head->count)
> +		kfree_rcu(head, rcu);
> +	spin_unlock(&sysctl_lock);
> +}
> +
>  static struct ctl_table_header *grab_header(struct inode *inode)
>  {
>  	struct ctl_table_header *head = PROC_I(inode)->sysctl;
> diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
> index adf4e51cf597..b7e82049fec7 100644
> --- a/include/linux/sysctl.h
> +++ b/include/linux/sysctl.h
> @@ -143,6 +143,7 @@ struct ctl_table_header
>  	struct ctl_table_set *set;
>  	struct ctl_dir *parent;
>  	struct ctl_node *node;
> +	struct list_head inodes; /* head for proc_inode->sysctl_inodes */
>  };
>
>  struct ctl_dir {
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] proc/sysctl: prune stale dentries during unregistering
  2017-02-18 18:55           ` Konstantin Khlebnikov
@ 2017-02-19  8:42             ` Al Viro
  2017-02-21  1:41               ` [REVIEW][PATCH] proc/sysctl: Don't grab i_lock under sysctl_lock Eric W. Biederman
  0 siblings, 1 reply; 14+ messages in thread
From: Al Viro @ 2017-02-19  8:42 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: linux-kernel, Andrew Morton, Eric W. Biederman

On Sat, Feb 18, 2017 at 09:55:28PM +0300, Konstantin Khlebnikov wrote:
> This patch has locking problem. I've got lockdep splat under LTP.
> 
> d_lock nests inside i_lock
> sysctl_lock nests inside d_lock in d_compare
> 
> This patch adds i_lock nesting inside sysctl_lock.

Once ->unregistering is set, you can drop sysctl_lock just fine.  So I'd
try something like this - use rcu_read_lock() in proc_sys_prune_dcache(),
drop sysctl_lock() before it and regain after.  Make sure that no inodes
are added to the list ones ->unregistering has been set and use RCU list
primitives for modifying the inode list, with sysctl_lock still used to
serialize its modifications.

Freeing struct inode is RCU-delayed (see proc_destroy_inode()), so doing
igrab() is safe there.  Since we don't drop inode reference until after we'd
passed beyond it in the list, list_for_each_entry_rcu() should be fine,
AFAICS.  Below is a completely untested modification of your patch along
those lines:

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 842a5ff5b85c..7ad9ed7958af 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -43,10 +43,11 @@ static void proc_evict_inode(struct inode *inode)
 	de = PDE(inode);
 	if (de)
 		pde_put(de);
+
 	head = PROC_I(inode)->sysctl;
 	if (head) {
 		RCU_INIT_POINTER(PROC_I(inode)->sysctl, NULL);
-		sysctl_head_put(head);
+		proc_sys_evict_inode(inode, head);
 	}
 }
 
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 2de5194ba378..ed1d762160e6 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -65,6 +65,7 @@ struct proc_inode {
 	struct proc_dir_entry *pde;
 	struct ctl_table_header *sysctl;
 	struct ctl_table *sysctl_entry;
+	struct list_head sysctl_inodes;
 	const struct proc_ns_operations *ns_ops;
 	struct inode vfs_inode;
 };
@@ -249,10 +250,12 @@ extern void proc_thread_self_init(void);
  */
 #ifdef CONFIG_PROC_SYSCTL
 extern int proc_sys_init(void);
-extern void sysctl_head_put(struct ctl_table_header *);
+extern void proc_sys_evict_inode(struct inode *inode,
+				 struct ctl_table_header *head);
 #else
 static inline void proc_sys_init(void) { }
-static inline void sysctl_head_put(struct ctl_table_header *head) { }
+static inline void proc_sys_evict_inode(struct  inode *inode,
+					struct ctl_table_header *head) { }
 #endif
 
 /*
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 55313d994895..6477c4a2dc6c 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -190,6 +190,7 @@ static void init_header(struct ctl_table_header *head,
 	head->set = set;
 	head->parent = NULL;
 	head->node = node;
+	INIT_LIST_HEAD(&head->inodes);
 	if (node) {
 		struct ctl_table *entry;
 		for (entry = table; entry->procname; entry++, node++)
@@ -259,6 +260,26 @@ static void unuse_table(struct ctl_table_header *p)
 			complete(p->unregistering);
 }
 
+static void proc_sys_prune_dcache(struct ctl_table_header *head)
+{
+	struct inode *inode, *prev = NULL;
+	struct proc_inode *ei;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(ei, &head->inodes, sysctl_inodes) {
+		inode = igrab(&ei->vfs_inode);
+		if (inode) {
+			rcu_read_unlock();
+			iput(prev);
+			prev = inode;
+			d_prune_aliases(inode);
+			rcu_read_lock();
+		}
+	}
+	rcu_read_unlock();
+	iput(prev);
+}
+
 /* called under sysctl_lock, will reacquire if has to wait */
 static void start_unregistering(struct ctl_table_header *p)
 {
@@ -272,31 +293,22 @@ static void start_unregistering(struct ctl_table_header *p)
 		p->unregistering = &wait;
 		spin_unlock(&sysctl_lock);
 		wait_for_completion(&wait);
-		spin_lock(&sysctl_lock);
 	} else {
 		/* anything non-NULL; we'll never dereference it */
 		p->unregistering = ERR_PTR(-EINVAL);
+		spin_unlock(&sysctl_lock);
 	}
 	/*
+	 * Prune dentries for unregistered sysctls: namespaced sysctls
+	 * can have duplicate names and contaminate dcache very badly.
+	 */
+	proc_sys_prune_dcache(p);
+	/*
 	 * do not remove from the list until nobody holds it; walking the
 	 * list in do_sysctl() relies on that.
 	 */
-	erase_header(p);
-}
-
-static void sysctl_head_get(struct ctl_table_header *head)
-{
-	spin_lock(&sysctl_lock);
-	head->count++;
-	spin_unlock(&sysctl_lock);
-}
-
-void sysctl_head_put(struct ctl_table_header *head)
-{
 	spin_lock(&sysctl_lock);
-	if (!--head->count)
-		kfree_rcu(head, rcu);
-	spin_unlock(&sysctl_lock);
+	erase_header(p);
 }
 
 static struct ctl_table_header *sysctl_head_grab(struct ctl_table_header *head)
@@ -440,10 +452,20 @@ static struct inode *proc_sys_make_inode(struct super_block *sb,
 
 	inode->i_ino = get_next_ino();
 
-	sysctl_head_get(head);
 	ei = PROC_I(inode);
+
+	spin_lock(&sysctl_lock);
+	if (unlikely(head->unregistering)) {
+		spin_unlock(&sysctl_lock);
+		iput(inode);
+		inode = NULL;
+		goto out;
+	}
 	ei->sysctl = head;
 	ei->sysctl_entry = table;
+	list_add_rcu(&ei->sysctl_inodes, &head->inodes);
+	head->count++;
+	spin_unlock(&sysctl_lock);
 
 	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
 	inode->i_mode = table->mode;
@@ -466,6 +488,15 @@ static struct inode *proc_sys_make_inode(struct super_block *sb,
 	return inode;
 }
 
+void proc_sys_evict_inode(struct inode *inode, struct ctl_table_header *head)
+{
+	spin_lock(&sysctl_lock);
+	list_del_rcu(&PROC_I(inode)->sysctl_inodes);
+	if (!--head->count)
+		kfree_rcu(head, rcu);
+	spin_unlock(&sysctl_lock);
+}
+
 static struct ctl_table_header *grab_header(struct inode *inode)
 {
 	struct ctl_table_header *head = PROC_I(inode)->sysctl;
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index adf4e51cf597..b7e82049fec7 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -143,6 +143,7 @@ struct ctl_table_header
 	struct ctl_table_set *set;
 	struct ctl_dir *parent;
 	struct ctl_node *node;
+	struct list_head inodes; /* head for proc_inode->sysctl_inodes */
 };
 
 struct ctl_dir {

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [REVIEW][PATCH] proc/sysctl: Don't grab i_lock under sysctl_lock.
  2017-02-19  8:42             ` Al Viro
@ 2017-02-21  1:41               ` Eric W. Biederman
  2017-02-21  8:40                 ` Konstantin Khlebnikov
  0 siblings, 1 reply; 14+ messages in thread
From: Eric W. Biederman @ 2017-02-21  1:41 UTC (permalink / raw)
  To: Al Viro; +Cc: Konstantin Khlebnikov, linux-kernel, Andrew Morton


Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
> This patch has locking problem. I've got lockdep splat under LTP.
>
> [ 6633.115456] ======================================================
> [ 6633.115502] [ INFO: possible circular locking dependency detected ]
> [ 6633.115553] 4.9.10-debug+ #9 Tainted: G             L
> [ 6633.115584] -------------------------------------------------------
> [ 6633.115627] ksm02/284980 is trying to acquire lock:
> [ 6633.115659]  (&sb->s_type->i_lock_key#4){+.+...}, at: [<ffffffff816bc1ce>] igrab+0x1e/0x80
> [ 6633.115834] but task is already holding lock:
> [ 6633.115882]  (sysctl_lock){+.+...}, at: [<ffffffff817e379b>] unregister_sysctl_table+0x6b/0x110
> [ 6633.116026] which lock already depends on the new lock.
> [ 6633.116026]
> [ 6633.116080]
> [ 6633.116080] the existing dependency chain (in reverse order) is:
> [ 6633.116117]
> -> #2 (sysctl_lock){+.+...}:
> -> #1 (&(&dentry->d_lockref.lock)->rlock){+.+...}:
> -> #0 (&sb->s_type->i_lock_key#4){+.+...}:
>
> d_lock nests inside i_lock
> sysctl_lock nests inside d_lock in d_compare
>
> This patch adds i_lock nesting inside sysctl_lock.

Al Viro <viro@ZenIV.linux.org.uk> replied:
> Once ->unregistering is set, you can drop sysctl_lock just fine.  So I'd
> try something like this - use rcu_read_lock() in proc_sys_prune_dcache(),
> drop sysctl_lock() before it and regain after.  Make sure that no inodes
> are added to the list ones ->unregistering has been set and use RCU list
> primitives for modifying the inode list, with sysctl_lock still used to
> serialize its modifications.
>
> Freeing struct inode is RCU-delayed (see proc_destroy_inode()), so doing
> igrab() is safe there.  Since we don't drop inode reference until after we'd
> passed beyond it in the list, list_for_each_entry_rcu() should be fine.

I agree with Al Viro's analsysis of the situtation.

Fixes: 802e348c6b77 ("proc/sysctl: prune stale dentries during unregistering")
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---

This is my cleaned up version of Al Viro's proposed fix.
I have tested it and the lockdep warnings go away, and
I have fixed a few trivial to ensure things work as intended.

Unless someone sees a problem I am going to add this fix to my tree and
then send a pull request to Linus.

 fs/proc/proc_sysctl.c | 31 ++++++++++++++++++-------------
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 8efb1e10b025..3e64c6502dc8 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -266,21 +266,19 @@ static void proc_sys_prune_dcache(struct ctl_table_header *head)
 	struct inode *inode, *prev = NULL;
 	struct proc_inode *ei;
 
-	list_for_each_entry(ei, &head->inodes, sysctl_inodes) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(ei, &head->inodes, sysctl_inodes) {
 		inode = igrab(&ei->vfs_inode);
 		if (inode) {
-			spin_unlock(&sysctl_lock);
+			rcu_read_unlock();
 			iput(prev);
 			prev = inode;
 			d_prune_aliases(inode);
-			spin_lock(&sysctl_lock);
+			rcu_read_lock();
 		}
 	}
-	if (prev) {
-		spin_unlock(&sysctl_lock);
-		iput(prev);
-		spin_lock(&sysctl_lock);
-	}
+	rcu_read_unlock();
+	iput(prev);
 }
 
 /* called under sysctl_lock, will reacquire if has to wait */
@@ -296,10 +294,10 @@ static void start_unregistering(struct ctl_table_header *p)
 		p->unregistering = &wait;
 		spin_unlock(&sysctl_lock);
 		wait_for_completion(&wait);
-		spin_lock(&sysctl_lock);
 	} else {
 		/* anything non-NULL; we'll never dereference it */
 		p->unregistering = ERR_PTR(-EINVAL);
+		spin_unlock(&sysctl_lock);
 	}
 	/*
 	 * Prune dentries for unregistered sysctls: namespaced sysctls
@@ -310,6 +308,7 @@ static void start_unregistering(struct ctl_table_header *p)
 	 * do not remove from the list until nobody holds it; walking the
 	 * list in do_sysctl() relies on that.
 	 */
+	spin_lock(&sysctl_lock);
 	erase_header(p);
 }
 
@@ -455,11 +454,17 @@ static struct inode *proc_sys_make_inode(struct super_block *sb,
 	inode->i_ino = get_next_ino();
 
 	ei = PROC_I(inode);
-	ei->sysctl = head;
-	ei->sysctl_entry = table;
 
 	spin_lock(&sysctl_lock);
-	list_add(&ei->sysctl_inodes, &head->inodes);
+	if (unlikely(head->unregistering)) {
+		spin_unlock(&sysctl_lock);
+		iput(inode);
+		inode = NULL;
+		goto out;
+	}
+	ei->sysctl = head;
+	ei->sysctl_entry = table;
+	list_add_rcu(&ei->sysctl_inodes, &head->inodes);
 	head->count++;
 	spin_unlock(&sysctl_lock);
 
@@ -487,7 +492,7 @@ static struct inode *proc_sys_make_inode(struct super_block *sb,
 void proc_sys_evict_inode(struct inode *inode, struct ctl_table_header *head)
 {
 	spin_lock(&sysctl_lock);
-	list_del(&PROC_I(inode)->sysctl_inodes);
+	list_del_rcu(&PROC_I(inode)->sysctl_inodes);
 	if (!--head->count)
 		kfree_rcu(head, rcu);
 	spin_unlock(&sysctl_lock);
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [REVIEW][PATCH] proc/sysctl: Don't grab i_lock under sysctl_lock.
  2017-02-21  1:41               ` [REVIEW][PATCH] proc/sysctl: Don't grab i_lock under sysctl_lock Eric W. Biederman
@ 2017-02-21  8:40                 ` Konstantin Khlebnikov
  2017-02-21 19:29                   ` Eric W. Biederman
  0 siblings, 1 reply; 14+ messages in thread
From: Konstantin Khlebnikov @ 2017-02-21  8:40 UTC (permalink / raw)
  To: Eric W. Biederman, Al Viro; +Cc: linux-kernel, Andrew Morton



On 21.02.2017 04:41, Eric W. Biederman wrote:
>
> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
>> This patch has locking problem. I've got lockdep splat under LTP.
>>
>> [ 6633.115456] ======================================================
>> [ 6633.115502] [ INFO: possible circular locking dependency detected ]
>> [ 6633.115553] 4.9.10-debug+ #9 Tainted: G             L
>> [ 6633.115584] -------------------------------------------------------
>> [ 6633.115627] ksm02/284980 is trying to acquire lock:
>> [ 6633.115659]  (&sb->s_type->i_lock_key#4){+.+...}, at: [<ffffffff816bc1ce>] igrab+0x1e/0x80
>> [ 6633.115834] but task is already holding lock:
>> [ 6633.115882]  (sysctl_lock){+.+...}, at: [<ffffffff817e379b>] unregister_sysctl_table+0x6b/0x110
>> [ 6633.116026] which lock already depends on the new lock.
>> [ 6633.116026]
>> [ 6633.116080]
>> [ 6633.116080] the existing dependency chain (in reverse order) is:
>> [ 6633.116117]
>> -> #2 (sysctl_lock){+.+...}:
>> -> #1 (&(&dentry->d_lockref.lock)->rlock){+.+...}:
>> -> #0 (&sb->s_type->i_lock_key#4){+.+...}:
>>
>> d_lock nests inside i_lock
>> sysctl_lock nests inside d_lock in d_compare
>>
>> This patch adds i_lock nesting inside sysctl_lock.
>
> Al Viro <viro@ZenIV.linux.org.uk> replied:
>> Once ->unregistering is set, you can drop sysctl_lock just fine.  So I'd
>> try something like this - use rcu_read_lock() in proc_sys_prune_dcache(),
>> drop sysctl_lock() before it and regain after.  Make sure that no inodes
>> are added to the list ones ->unregistering has been set and use RCU list
>> primitives for modifying the inode list, with sysctl_lock still used to
>> serialize its modifications.
>>
>> Freeing struct inode is RCU-delayed (see proc_destroy_inode()), so doing
>> igrab() is safe there.  Since we don't drop inode reference until after we'd
>> passed beyond it in the list, list_for_each_entry_rcu() should be fine.
>
> I agree with Al Viro's analsysis of the situtation.
>
> Fixes: 802e348c6b77 ("proc/sysctl: prune stale dentries during unregistering")
> Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>
> This is my cleaned up version of Al Viro's proposed fix.
> I have tested it and the lockdep warnings go away, and
> I have fixed a few trivial to ensure things work as intended.
>
> Unless someone sees a problem I am going to add this fix to my tree and
> then send a pull request to Linus.

I've tested the same patch and found no problems.

Except proc_sys_prune_dcache() is no longer called under sysctl_lock like says comment above it.

>
>  fs/proc/proc_sysctl.c | 31 ++++++++++++++++++-------------
>  1 file changed, 18 insertions(+), 13 deletions(-)
>
> diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> index 8efb1e10b025..3e64c6502dc8 100644
> --- a/fs/proc/proc_sysctl.c
> +++ b/fs/proc/proc_sysctl.c
> @@ -266,21 +266,19 @@ static void proc_sys_prune_dcache(struct ctl_table_header *head)
>  	struct inode *inode, *prev = NULL;
>  	struct proc_inode *ei;
>
> -	list_for_each_entry(ei, &head->inodes, sysctl_inodes) {
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(ei, &head->inodes, sysctl_inodes) {
>  		inode = igrab(&ei->vfs_inode);
>  		if (inode) {
> -			spin_unlock(&sysctl_lock);
> +			rcu_read_unlock();
>  			iput(prev);
>  			prev = inode;
>  			d_prune_aliases(inode);
> -			spin_lock(&sysctl_lock);
> +			rcu_read_lock();
>  		}
>  	}
> -	if (prev) {
> -		spin_unlock(&sysctl_lock);
> -		iput(prev);
> -		spin_lock(&sysctl_lock);
> -	}
> +	rcu_read_unlock();
> +	iput(prev);
>  }
>
>  /* called under sysctl_lock, will reacquire if has to wait */
> @@ -296,10 +294,10 @@ static void start_unregistering(struct ctl_table_header *p)
>  		p->unregistering = &wait;
>  		spin_unlock(&sysctl_lock);
>  		wait_for_completion(&wait);
> -		spin_lock(&sysctl_lock);
>  	} else {
>  		/* anything non-NULL; we'll never dereference it */
>  		p->unregistering = ERR_PTR(-EINVAL);
> +		spin_unlock(&sysctl_lock);
>  	}
>  	/*
>  	 * Prune dentries for unregistered sysctls: namespaced sysctls
> @@ -310,6 +308,7 @@ static void start_unregistering(struct ctl_table_header *p)
>  	 * do not remove from the list until nobody holds it; walking the
>  	 * list in do_sysctl() relies on that.
>  	 */
> +	spin_lock(&sysctl_lock);
>  	erase_header(p);
>  }
>
> @@ -455,11 +454,17 @@ static struct inode *proc_sys_make_inode(struct super_block *sb,
>  	inode->i_ino = get_next_ino();
>
>  	ei = PROC_I(inode);
> -	ei->sysctl = head;
> -	ei->sysctl_entry = table;
>
>  	spin_lock(&sysctl_lock);
> -	list_add(&ei->sysctl_inodes, &head->inodes);
> +	if (unlikely(head->unregistering)) {
> +		spin_unlock(&sysctl_lock);
> +		iput(inode);
> +		inode = NULL;
> +		goto out;
> +	}
> +	ei->sysctl = head;
> +	ei->sysctl_entry = table;
> +	list_add_rcu(&ei->sysctl_inodes, &head->inodes);
>  	head->count++;
>  	spin_unlock(&sysctl_lock);
>
> @@ -487,7 +492,7 @@ static struct inode *proc_sys_make_inode(struct super_block *sb,
>  void proc_sys_evict_inode(struct inode *inode, struct ctl_table_header *head)
>  {
>  	spin_lock(&sysctl_lock);
> -	list_del(&PROC_I(inode)->sysctl_inodes);
> +	list_del_rcu(&PROC_I(inode)->sysctl_inodes);
>  	if (!--head->count)
>  		kfree_rcu(head, rcu);
>  	spin_unlock(&sysctl_lock);
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REVIEW][PATCH] proc/sysctl: Don't grab i_lock under sysctl_lock.
  2017-02-21  8:40                 ` Konstantin Khlebnikov
@ 2017-02-21 19:29                   ` Eric W. Biederman
  0 siblings, 0 replies; 14+ messages in thread
From: Eric W. Biederman @ 2017-02-21 19:29 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: Al Viro, linux-kernel, Andrew Morton

Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:

> On 21.02.2017 04:41, Eric W. Biederman wrote:
>>
>> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
>>> This patch has locking problem. I've got lockdep splat under LTP.
>>>
>>> [ 6633.115456] ======================================================
>>> [ 6633.115502] [ INFO: possible circular locking dependency detected ]
>>> [ 6633.115553] 4.9.10-debug+ #9 Tainted: G             L
>>> [ 6633.115584] -------------------------------------------------------
>>> [ 6633.115627] ksm02/284980 is trying to acquire lock:
>>> [ 6633.115659]  (&sb->s_type->i_lock_key#4){+.+...}, at: [<ffffffff816bc1ce>] igrab+0x1e/0x80
>>> [ 6633.115834] but task is already holding lock:
>>> [ 6633.115882]  (sysctl_lock){+.+...}, at: [<ffffffff817e379b>] unregister_sysctl_table+0x6b/0x110
>>> [ 6633.116026] which lock already depends on the new lock.
>>> [ 6633.116026]
>>> [ 6633.116080]
>>> [ 6633.116080] the existing dependency chain (in reverse order) is:
>>> [ 6633.116117]
>>> -> #2 (sysctl_lock){+.+...}:
>>> -> #1 (&(&dentry->d_lockref.lock)->rlock){+.+...}:
>>> -> #0 (&sb->s_type->i_lock_key#4){+.+...}:
>>>
>>> d_lock nests inside i_lock
>>> sysctl_lock nests inside d_lock in d_compare
>>>
>>> This patch adds i_lock nesting inside sysctl_lock.
>>
>> Al Viro <viro@ZenIV.linux.org.uk> replied:
>>> Once ->unregistering is set, you can drop sysctl_lock just fine.  So I'd
>>> try something like this - use rcu_read_lock() in proc_sys_prune_dcache(),
>>> drop sysctl_lock() before it and regain after.  Make sure that no inodes
>>> are added to the list ones ->unregistering has been set and use RCU list
>>> primitives for modifying the inode list, with sysctl_lock still used to
>>> serialize its modifications.
>>>
>>> Freeing struct inode is RCU-delayed (see proc_destroy_inode()), so doing
>>> igrab() is safe there.  Since we don't drop inode reference until after we'd
>>> passed beyond it in the list, list_for_each_entry_rcu() should be fine.
>>
>> I agree with Al Viro's analsysis of the situtation.
>>
>> Fixes: 802e348c6b77 ("proc/sysctl: prune stale dentries during unregistering")
>> Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>> Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>
>> This is my cleaned up version of Al Viro's proposed fix.
>> I have tested it and the lockdep warnings go away, and
>> I have fixed a few trivial to ensure things work as intended.
>>
>> Unless someone sees a problem I am going to add this fix to my tree and
>> then send a pull request to Linus.
>
> I've tested the same patch and found no problems.
>
> Except proc_sys_prune_dcache() is no longer called under sysctl_lock
> like says comment above it.

Thank you.  I will add your Tested-by line to the patch.

Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-02-21 19:34 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-08 10:48 [PATCH] proc/sysctl: drop unregistered stale dentries as soon as possible Konstantin Khlebnikov
2017-02-08 21:48 ` Andrew Morton
2017-02-09  3:53   ` Al Viro
2017-02-09  7:36     ` Konstantin Khlebnikov
2017-02-09  8:40       ` Al Viro
2017-02-10  7:35         ` [PATCH] proc/sysctl: prune stale dentries during unregistering Konstantin Khlebnikov
2017-02-10  7:47           ` Al Viro
2017-02-10  7:54             ` Konstantin Khlebnikov
2017-02-13  9:54               ` Eric W. Biederman
2017-02-18 18:55           ` Konstantin Khlebnikov
2017-02-19  8:42             ` Al Viro
2017-02-21  1:41               ` [REVIEW][PATCH] proc/sysctl: Don't grab i_lock under sysctl_lock Eric W. Biederman
2017-02-21  8:40                 ` Konstantin Khlebnikov
2017-02-21 19:29                   ` Eric W. Biederman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.