Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v3 0/4] kernfs: proposed locking and concurrency improvement
@ 2021-04-09  1:14 Ian Kent
  2021-04-09  1:14 ` [PATCH v3 1/4] kernfs: move revalidate to be near lookup Ian Kent
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Ian Kent @ 2021-04-09  1:14 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Tejun Heo
  Cc: Brice Goglin, Fox Chen, Rick Lindsley, Al Viro, Miklos Szeredi,
	David Howells, Eric Sandeen, Kernel Mailing List, linux-fsdevel

There have been a few instances of contention on the kernfs_mutex during
path walks, a case on very large IBM systems seen by myself, a report by
Brice Goglin and followed up by Fox Chen, and I've since seen a couple
of other reports by CoreOS users.

The common thread is a large number of kernfs path walks leading to
slowness of path walks due to kernfs_mutex contention.

The problem being that changes to the VFS over some time have increased
it's concurrency capabilities to an extent that kernfs's use of a mutex
is no longer appropriate. There's also an issue of walks for non-existent
paths causing contention if there are quite a few of them which is a less
common problem.

This patch series is relatively straight forward.

All it does is add the ability to take advantage of VFS negative dentry
caching to avoid needless dentry alloc/free cycles for lookups of paths
that don't exit and change the kernfs_mutex to a read/write semaphore.

The patch that tried to stay in VFS rcu-walk mode during path walks has
been dropped for two reasons. First, it doesn't actually give very much
improvement and, second, if there's a place where mistakes could go
unnoticed it would be in that path. This makes the patch series simpler
to review and reduces the likelihood of problems going unnoticed and
popping up later.

The patch to use a revision to identify if a directory has changed has
also been dropped. If the directory has changed the dentry revision
needs to be updated to avoid subsequent rb tree searches and after
changing to use a read/write semaphore the update also requires a lock.
But the d_lock is the only lock available at this point which might
itself be contended.

Changes since v2:
- actually fix the inode attribute update locking.
- drop the patch that tried to stay in rcu-walk mode.
- drop the use a revision to identify if a directory has changed patch.

Changes since v1:
- fix locking in .permission() and .getattr() by re-factoring the attribute
  handling code.

---

Ian Kent (4):
      kernfs: move revalidate to be near lookup
      kernfs: use VFS negative dentry caching
      kernfs: switch kernfs to use an rwsem
      kernfs: use i_lock to protect concurrent inode updates


 fs/kernfs/dir.c             |  240 +++++++++++++++++++++++--------------------
 fs/kernfs/file.c            |    4 -
 fs/kernfs/inode.c           |   18 ++-
 fs/kernfs/kernfs-internal.h |    5 +
 fs/kernfs/mount.c           |   12 +-
 fs/kernfs/symlink.c         |    4 -
 include/linux/kernfs.h      |    2 
 7 files changed, 155 insertions(+), 130 deletions(-)

--


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3 1/4] kernfs: move revalidate to be near lookup
  2021-04-09  1:14 [PATCH v3 0/4] kernfs: proposed locking and concurrency improvement Ian Kent
@ 2021-04-09  1:14 ` Ian Kent
  2021-04-09  1:15 ` [PATCH v3 2/4] kernfs: use VFS negative dentry caching Ian Kent
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Ian Kent @ 2021-04-09  1:14 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Tejun Heo
  Cc: Brice Goglin, Fox Chen, Rick Lindsley, Al Viro, Miklos Szeredi,
	David Howells, Eric Sandeen, Kernel Mailing List, linux-fsdevel

While the dentry operation kernfs_dop_revalidate() is grouped with
dentry type functions it also has a strong affinity to the inode
operation ->lookup().

In order to take advantage of the VFS negative dentry caching that
can be used to reduce path lookup overhead on non-existent paths it
will need to call kernfs_find_ns(). So, to avoid a forward declaration,
move it to be near kernfs_iop_lookup().

There's no functional change from this patch.

Signed-off-by: Ian Kent <raven@themaw.net>
---
 fs/kernfs/dir.c |   86 ++++++++++++++++++++++++++++---------------------------
 1 file changed, 43 insertions(+), 43 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 7e0e62deab53c..4c69e2af82dac 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -548,49 +548,6 @@ void kernfs_put(struct kernfs_node *kn)
 }
 EXPORT_SYMBOL_GPL(kernfs_put);
 
-static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
-{
-	struct kernfs_node *kn;
-
-	if (flags & LOOKUP_RCU)
-		return -ECHILD;
-
-	/* Always perform fresh lookup for negatives */
-	if (d_really_is_negative(dentry))
-		goto out_bad_unlocked;
-
-	kn = kernfs_dentry_node(dentry);
-	mutex_lock(&kernfs_mutex);
-
-	/* The kernfs node has been deactivated */
-	if (!kernfs_active(kn))
-		goto out_bad;
-
-	/* The kernfs node has been moved? */
-	if (kernfs_dentry_node(dentry->d_parent) != kn->parent)
-		goto out_bad;
-
-	/* The kernfs node has been renamed */
-	if (strcmp(dentry->d_name.name, kn->name) != 0)
-		goto out_bad;
-
-	/* The kernfs node has been moved to a different namespace */
-	if (kn->parent && kernfs_ns_enabled(kn->parent) &&
-	    kernfs_info(dentry->d_sb)->ns != kn->ns)
-		goto out_bad;
-
-	mutex_unlock(&kernfs_mutex);
-	return 1;
-out_bad:
-	mutex_unlock(&kernfs_mutex);
-out_bad_unlocked:
-	return 0;
-}
-
-const struct dentry_operations kernfs_dops = {
-	.d_revalidate	= kernfs_dop_revalidate,
-};
-
 /**
  * kernfs_node_from_dentry - determine kernfs_node associated with a dentry
  * @dentry: the dentry in question
@@ -1073,6 +1030,49 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
 	return ERR_PTR(rc);
 }
 
+static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
+{
+	struct kernfs_node *kn;
+
+	if (flags & LOOKUP_RCU)
+		return -ECHILD;
+
+	/* Always perform fresh lookup for negatives */
+	if (d_really_is_negative(dentry))
+		goto out_bad_unlocked;
+
+	kn = kernfs_dentry_node(dentry);
+	mutex_lock(&kernfs_mutex);
+
+	/* The kernfs node has been deactivated */
+	if (!kernfs_active_read(kn))
+		goto out_bad;
+
+	/* The kernfs node has been moved? */
+	if (kernfs_dentry_node(dentry->d_parent) != kn->parent)
+		goto out_bad;
+
+	/* The kernfs node has been renamed */
+	if (strcmp(dentry->d_name.name, kn->name) != 0)
+		goto out_bad;
+
+	/* The kernfs node has been moved to a different namespace */
+	if (kn->parent && kernfs_ns_enabled(kn->parent) &&
+	    kernfs_info(dentry->d_sb)->ns != kn->ns)
+		goto out_bad;
+
+	mutex_unlock(&kernfs_mutex);
+	return 1;
+out_bad:
+	mutex_unlock(&kernfs_mutex);
+out_bad_unlocked:
+	return 0;
+}
+
+const struct dentry_operations kernfs_dops = {
+	.d_revalidate	= kernfs_dop_revalidate,
+};
+
 static struct dentry *kernfs_iop_lookup(struct inode *dir,
 					struct dentry *dentry,
 					unsigned int flags)



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3 2/4] kernfs: use VFS negative dentry caching
  2021-04-09  1:14 [PATCH v3 0/4] kernfs: proposed locking and concurrency improvement Ian Kent
  2021-04-09  1:14 ` [PATCH v3 1/4] kernfs: move revalidate to be near lookup Ian Kent
@ 2021-04-09  1:15 ` Ian Kent
  2021-04-09  1:35   ` Al Viro
  2021-04-09  1:15 ` [PATCH v3 3/4] kernfs: switch kernfs to use an rwsem Ian Kent
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Ian Kent @ 2021-04-09  1:15 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Tejun Heo
  Cc: Brice Goglin, Fox Chen, Rick Lindsley, Al Viro, Miklos Szeredi,
	David Howells, Eric Sandeen, Kernel Mailing List, linux-fsdevel

If there are many lookups for non-existent paths these negative lookups
can lead to a lot of overhead during path walks.

The VFS allows dentries to be created as negative and hashed, and caches
them so they can be used to reduce the fairly high overhead alloc/free
cycle that occurs during these lookups.

Signed-off-by: Ian Kent <raven@themaw.net>
---
 fs/kernfs/dir.c |   55 +++++++++++++++++++++++++++++++++----------------------
 1 file changed, 33 insertions(+), 22 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 4c69e2af82dac..edfeee1bf38ec 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1037,12 +1037,33 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 	if (flags & LOOKUP_RCU)
 		return -ECHILD;
 
-	/* Always perform fresh lookup for negatives */
-	if (d_really_is_negative(dentry))
-		goto out_bad_unlocked;
+	mutex_lock(&kernfs_mutex);
 
 	kn = kernfs_dentry_node(dentry);
-	mutex_lock(&kernfs_mutex);
+
+	/* Negative hashed dentry? */
+	if (!kn) {
+		struct kernfs_node *parent;
+
+		/* If the kernfs node can be found this is a stale negative
+		 * hashed dentry so it must be discarded and the lookup redone.
+		 */
+		parent = kernfs_dentry_node(dentry->d_parent);
+		if (parent) {
+			const void *ns = NULL;
+
+			if (kernfs_ns_enabled(parent))
+				ns = kernfs_info(dentry->d_parent->d_sb)->ns;
+			kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
+			if (kn)
+				goto out_bad;
+		}
+
+		/* The kernfs node doesn't exist, leave the dentry negative
+		 * and return success.
+		 */
+		goto out;
+	}
 
 	/* The kernfs node has been deactivated */
 	if (!kernfs_active_read(kn))
@@ -1060,12 +1081,11 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 	if (kn->parent && kernfs_ns_enabled(kn->parent) &&
 	    kernfs_info(dentry->d_sb)->ns != kn->ns)
 		goto out_bad;
-
+out:
 	mutex_unlock(&kernfs_mutex);
 	return 1;
 out_bad:
 	mutex_unlock(&kernfs_mutex);
-out_bad_unlocked:
 	return 0;
 }
 
@@ -1080,33 +1100,24 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
 	struct dentry *ret;
 	struct kernfs_node *parent = dir->i_private;
 	struct kernfs_node *kn;
-	struct inode *inode;
+	struct inode *inode = NULL;
 	const void *ns = NULL;
 
 	mutex_lock(&kernfs_mutex);
-
 	if (kernfs_ns_enabled(parent))
 		ns = kernfs_info(dir->i_sb)->ns;
 
 	kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
-
-	/* no such entry */
-	if (!kn || !kernfs_active(kn)) {
-		ret = NULL;
-		goto out_unlock;
-	}
-
 	/* attach dentry and inode */
-	inode = kernfs_get_inode(dir->i_sb, kn);
-	if (!inode) {
-		ret = ERR_PTR(-ENOMEM);
-		goto out_unlock;
+	if (kn && kernfs_active(kn)) {
+		inode = kernfs_get_inode(dir->i_sb, kn);
+		if (!inode)
+			inode = ERR_PTR(-ENOMEM);
 	}
-
-	/* instantiate and hash dentry */
+	/* instantiate and hash (possibly negative) dentry */
 	ret = d_splice_alias(inode, dentry);
- out_unlock:
 	mutex_unlock(&kernfs_mutex);
+
 	return ret;
 }
 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3 3/4] kernfs: switch kernfs to use an rwsem
  2021-04-09  1:14 [PATCH v3 0/4] kernfs: proposed locking and concurrency improvement Ian Kent
  2021-04-09  1:14 ` [PATCH v3 1/4] kernfs: move revalidate to be near lookup Ian Kent
  2021-04-09  1:15 ` [PATCH v3 2/4] kernfs: use VFS negative dentry caching Ian Kent
@ 2021-04-09  1:15 ` Ian Kent
  2021-04-09  1:15 ` [PATCH v3 4/4] kernfs: use i_lock to protect concurrent inode updates Ian Kent
  2021-04-19  7:56 ` [PATCH v3 0/4] kernfs: proposed locking and concurrency improvement Fox Chen
  4 siblings, 0 replies; 10+ messages in thread
From: Ian Kent @ 2021-04-09  1:15 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Tejun Heo
  Cc: Brice Goglin, Fox Chen, Rick Lindsley, Al Viro, Miklos Szeredi,
	David Howells, Eric Sandeen, Kernel Mailing List, linux-fsdevel

The kernfs global lock restricts the ability to perform kernfs node
lookup operations in parallel during path walks.

Change the kernfs mutex to an rwsem so that, when opportunity arises,
node searches can be done in parallel with path walk lookups.

Signed-off-by: Ian Kent <raven@themaw.net>
---
 fs/kernfs/dir.c             |  117 ++++++++++++++++++++++++-------------------
 fs/kernfs/file.c            |    4 +
 fs/kernfs/inode.c           |   16 +++---
 fs/kernfs/kernfs-internal.h |    5 +-
 fs/kernfs/mount.c           |   12 ++--
 fs/kernfs/symlink.c         |    4 +
 include/linux/kernfs.h      |    2 -
 7 files changed, 86 insertions(+), 74 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index edfeee1bf38ec..9bea235f2ec66 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -17,7 +17,7 @@
 
 #include "kernfs-internal.h"
 
-DEFINE_MUTEX(kernfs_mutex);
+DECLARE_RWSEM(kernfs_rwsem);
 static DEFINE_SPINLOCK(kernfs_rename_lock);	/* kn->parent and ->name */
 static char kernfs_pr_cont_buf[PATH_MAX];	/* protected by rename_lock */
 static DEFINE_SPINLOCK(kernfs_idr_lock);	/* root->ino_idr */
@@ -26,10 +26,21 @@ static DEFINE_SPINLOCK(kernfs_idr_lock);	/* root->ino_idr */
 
 static bool kernfs_active(struct kernfs_node *kn)
 {
-	lockdep_assert_held(&kernfs_mutex);
 	return atomic_read(&kn->active) >= 0;
 }
 
+static bool kernfs_active_write(struct kernfs_node *kn)
+{
+	lockdep_assert_held_write(&kernfs_rwsem);
+	return kernfs_active(kn);
+}
+
+static bool kernfs_active_read(struct kernfs_node *kn)
+{
+	lockdep_assert_held_read(&kernfs_rwsem);
+	return kernfs_active(kn);
+}
+
 static bool kernfs_lockdep(struct kernfs_node *kn)
 {
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
@@ -340,7 +351,7 @@ static int kernfs_sd_compare(const struct kernfs_node *left,
  *	@kn->parent->dir.children.
  *
  *	Locking:
- *	mutex_lock(kernfs_mutex)
+ *	kernfs_rwsem held exclusive
  *
  *	RETURNS:
  *	0 on susccess -EEXIST on failure.
@@ -385,7 +396,7 @@ static int kernfs_link_sibling(struct kernfs_node *kn)
  *	removed, %false if @kn wasn't on the rbtree.
  *
  *	Locking:
- *	mutex_lock(kernfs_mutex)
+ *	kernfs_rwsem held exclusive
  */
 static bool kernfs_unlink_sibling(struct kernfs_node *kn)
 {
@@ -455,14 +466,14 @@ void kernfs_put_active(struct kernfs_node *kn)
  * return after draining is complete.
  */
 static void kernfs_drain(struct kernfs_node *kn)
-	__releases(&kernfs_mutex) __acquires(&kernfs_mutex)
+	__releases(&kernfs_rwsem) __acquires(&kernfs_rwsem)
 {
 	struct kernfs_root *root = kernfs_root(kn);
 
-	lockdep_assert_held(&kernfs_mutex);
+	lockdep_assert_held_write(&kernfs_rwsem);
 	WARN_ON_ONCE(kernfs_active(kn));
 
-	mutex_unlock(&kernfs_mutex);
+	up_write(&kernfs_rwsem);
 
 	if (kernfs_lockdep(kn)) {
 		rwsem_acquire(&kn->dep_map, 0, 0, _RET_IP_);
@@ -481,7 +492,7 @@ static void kernfs_drain(struct kernfs_node *kn)
 
 	kernfs_drain_open_files(kn);
 
-	mutex_lock(&kernfs_mutex);
+	down_write(&kernfs_rwsem);
 }
 
 /**
@@ -720,7 +731,7 @@ int kernfs_add_one(struct kernfs_node *kn)
 	bool has_ns;
 	int ret;
 
-	mutex_lock(&kernfs_mutex);
+	down_write(&kernfs_rwsem);
 
 	ret = -EINVAL;
 	has_ns = kernfs_ns_enabled(parent);
@@ -735,7 +746,7 @@ int kernfs_add_one(struct kernfs_node *kn)
 	if (parent->flags & KERNFS_EMPTY_DIR)
 		goto out_unlock;
 
-	if ((parent->flags & KERNFS_ACTIVATED) && !kernfs_active(parent))
+	if ((parent->flags & KERNFS_ACTIVATED) && !kernfs_active_write(parent))
 		goto out_unlock;
 
 	kn->hash = kernfs_name_hash(kn->name, kn->ns);
@@ -751,7 +762,7 @@ int kernfs_add_one(struct kernfs_node *kn)
 		ps_iattr->ia_mtime = ps_iattr->ia_ctime;
 	}
 
-	mutex_unlock(&kernfs_mutex);
+	up_write(&kernfs_rwsem);
 
 	/*
 	 * Activate the new node unless CREATE_DEACTIVATED is requested.
@@ -765,7 +776,7 @@ int kernfs_add_one(struct kernfs_node *kn)
 	return 0;
 
 out_unlock:
-	mutex_unlock(&kernfs_mutex);
+	up_write(&kernfs_rwsem);
 	return ret;
 }
 
@@ -786,7 +797,7 @@ static struct kernfs_node *kernfs_find_ns(struct kernfs_node *parent,
 	bool has_ns = kernfs_ns_enabled(parent);
 	unsigned int hash;
 
-	lockdep_assert_held(&kernfs_mutex);
+	lockdep_assert_held(&kernfs_rwsem);
 
 	if (has_ns != (bool)ns) {
 		WARN(1, KERN_WARNING "kernfs: ns %s in '%s' for '%s'\n",
@@ -818,7 +829,7 @@ static struct kernfs_node *kernfs_walk_ns(struct kernfs_node *parent,
 	size_t len;
 	char *p, *name;
 
-	lockdep_assert_held(&kernfs_mutex);
+	lockdep_assert_held_read(&kernfs_rwsem);
 
 	/* grab kernfs_rename_lock to piggy back on kernfs_pr_cont_buf */
 	spin_lock_irq(&kernfs_rename_lock);
@@ -858,10 +869,10 @@ struct kernfs_node *kernfs_find_and_get_ns(struct kernfs_node *parent,
 {
 	struct kernfs_node *kn;
 
-	mutex_lock(&kernfs_mutex);
+	down_read(&kernfs_rwsem);
 	kn = kernfs_find_ns(parent, name, ns);
 	kernfs_get(kn);
-	mutex_unlock(&kernfs_mutex);
+	up_read(&kernfs_rwsem);
 
 	return kn;
 }
@@ -882,10 +893,10 @@ struct kernfs_node *kernfs_walk_and_get_ns(struct kernfs_node *parent,
 {
 	struct kernfs_node *kn;
 
-	mutex_lock(&kernfs_mutex);
+	down_read(&kernfs_rwsem);
 	kn = kernfs_walk_ns(parent, path, ns);
 	kernfs_get(kn);
-	mutex_unlock(&kernfs_mutex);
+	up_read(&kernfs_rwsem);
 
 	return kn;
 }
@@ -1037,7 +1048,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 	if (flags & LOOKUP_RCU)
 		return -ECHILD;
 
-	mutex_lock(&kernfs_mutex);
+	down_read(&kernfs_rwsem);
 
 	kn = kernfs_dentry_node(dentry);
 
@@ -1082,10 +1093,10 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 	    kernfs_info(dentry->d_sb)->ns != kn->ns)
 		goto out_bad;
 out:
-	mutex_unlock(&kernfs_mutex);
+	up_read(&kernfs_rwsem);
 	return 1;
 out_bad:
-	mutex_unlock(&kernfs_mutex);
+	up_read(&kernfs_rwsem);
 	return 0;
 }
 
@@ -1103,7 +1114,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
 	struct inode *inode = NULL;
 	const void *ns = NULL;
 
-	mutex_lock(&kernfs_mutex);
+	down_read(&kernfs_rwsem);
 	if (kernfs_ns_enabled(parent))
 		ns = kernfs_info(dir->i_sb)->ns;
 
@@ -1116,7 +1127,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
 	}
 	/* instantiate and hash (possibly negative) dentry */
 	ret = d_splice_alias(inode, dentry);
-	mutex_unlock(&kernfs_mutex);
+	up_read(&kernfs_rwsem);
 
 	return ret;
 }
@@ -1238,7 +1249,7 @@ static struct kernfs_node *kernfs_next_descendant_post(struct kernfs_node *pos,
 {
 	struct rb_node *rbn;
 
-	lockdep_assert_held(&kernfs_mutex);
+	lockdep_assert_held_write(&kernfs_rwsem);
 
 	/* if first iteration, visit leftmost descendant which may be root */
 	if (!pos)
@@ -1274,7 +1285,7 @@ void kernfs_activate(struct kernfs_node *kn)
 {
 	struct kernfs_node *pos;
 
-	mutex_lock(&kernfs_mutex);
+	down_write(&kernfs_rwsem);
 
 	pos = NULL;
 	while ((pos = kernfs_next_descendant_post(pos, kn))) {
@@ -1288,14 +1299,14 @@ void kernfs_activate(struct kernfs_node *kn)
 		pos->flags |= KERNFS_ACTIVATED;
 	}
 
-	mutex_unlock(&kernfs_mutex);
+	up_write(&kernfs_rwsem);
 }
 
 static void __kernfs_remove(struct kernfs_node *kn)
 {
 	struct kernfs_node *pos;
 
-	lockdep_assert_held(&kernfs_mutex);
+	lockdep_assert_held_write(&kernfs_rwsem);
 
 	/*
 	 * Short-circuit if non-root @kn has already finished removal.
@@ -1310,7 +1321,7 @@ static void __kernfs_remove(struct kernfs_node *kn)
 	/* prevent any new usage under @kn by deactivating all nodes */
 	pos = NULL;
 	while ((pos = kernfs_next_descendant_post(pos, kn)))
-		if (kernfs_active(pos))
+		if (kernfs_active_write(pos))
 			atomic_add(KN_DEACTIVATED_BIAS, &pos->active);
 
 	/* deactivate and unlink the subtree node-by-node */
@@ -1318,7 +1329,7 @@ static void __kernfs_remove(struct kernfs_node *kn)
 		pos = kernfs_leftmost_descendant(kn);
 
 		/*
-		 * kernfs_drain() drops kernfs_mutex temporarily and @pos's
+		 * kernfs_drain() drops kernfs_rwsem temporarily and @pos's
 		 * base ref could have been put by someone else by the time
 		 * the function returns.  Make sure it doesn't go away
 		 * underneath us.
@@ -1365,9 +1376,9 @@ static void __kernfs_remove(struct kernfs_node *kn)
  */
 void kernfs_remove(struct kernfs_node *kn)
 {
-	mutex_lock(&kernfs_mutex);
+	down_write(&kernfs_rwsem);
 	__kernfs_remove(kn);
-	mutex_unlock(&kernfs_mutex);
+	up_write(&kernfs_rwsem);
 }
 
 /**
@@ -1454,17 +1465,17 @@ bool kernfs_remove_self(struct kernfs_node *kn)
 {
 	bool ret;
 
-	mutex_lock(&kernfs_mutex);
+	down_write(&kernfs_rwsem);
 	kernfs_break_active_protection(kn);
 
 	/*
 	 * SUICIDAL is used to arbitrate among competing invocations.  Only
 	 * the first one will actually perform removal.  When the removal
 	 * is complete, SUICIDED is set and the active ref is restored
-	 * while holding kernfs_mutex.  The ones which lost arbitration
-	 * waits for SUICDED && drained which can happen only after the
-	 * enclosing kernfs operation which executed the winning instance
-	 * of kernfs_remove_self() finished.
+	 * while kernfs_rwsem for held exclusive.  The ones which lost
+	 * arbitration waits for SUICIDED && drained which can happen only
+	 * after the enclosing kernfs operation which executed the winning
+	 * instance of kernfs_remove_self() finished.
 	 */
 	if (!(kn->flags & KERNFS_SUICIDAL)) {
 		kn->flags |= KERNFS_SUICIDAL;
@@ -1482,9 +1493,9 @@ bool kernfs_remove_self(struct kernfs_node *kn)
 			    atomic_read(&kn->active) == KN_DEACTIVATED_BIAS)
 				break;
 
-			mutex_unlock(&kernfs_mutex);
+			up_write(&kernfs_rwsem);
 			schedule();
-			mutex_lock(&kernfs_mutex);
+			down_write(&kernfs_rwsem);
 		}
 		finish_wait(waitq, &wait);
 		WARN_ON_ONCE(!RB_EMPTY_NODE(&kn->rb));
@@ -1492,12 +1503,12 @@ bool kernfs_remove_self(struct kernfs_node *kn)
 	}
 
 	/*
-	 * This must be done while holding kernfs_mutex; otherwise, waiting
-	 * for SUICIDED && deactivated could finish prematurely.
+	 * This must be done while kernfs_rwsem held exclusive; otherwise,
+	 * waiting for SUICIDED && deactivated could finish prematurely.
 	 */
 	kernfs_unbreak_active_protection(kn);
 
-	mutex_unlock(&kernfs_mutex);
+	up_write(&kernfs_rwsem);
 	return ret;
 }
 
@@ -1521,13 +1532,13 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,
 		return -ENOENT;
 	}
 
-	mutex_lock(&kernfs_mutex);
+	down_write(&kernfs_rwsem);
 
 	kn = kernfs_find_ns(parent, name, ns);
 	if (kn)
 		__kernfs_remove(kn);
 
-	mutex_unlock(&kernfs_mutex);
+	up_write(&kernfs_rwsem);
 
 	if (kn)
 		return 0;
@@ -1553,10 +1564,10 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
 	if (!kn->parent)
 		return -EINVAL;
 
-	mutex_lock(&kernfs_mutex);
+	down_write(&kernfs_rwsem);
 
 	error = -ENOENT;
-	if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
+	if (!kernfs_active_write(kn) || !kernfs_active_write(new_parent) ||
 	    (new_parent->flags & KERNFS_EMPTY_DIR))
 		goto out;
 
@@ -1607,7 +1618,7 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
 
 	error = 0;
  out:
-	mutex_unlock(&kernfs_mutex);
+	up_write(&kernfs_rwsem);
 	return error;
 }
 
@@ -1627,7 +1638,7 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
 	struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
 {
 	if (pos) {
-		int valid = kernfs_active(pos) &&
+		int valid = kernfs_active_read(pos) &&
 			pos->parent == parent && hash == pos->hash;
 		kernfs_put(pos);
 		if (!valid)
@@ -1647,7 +1658,7 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
 		}
 	}
 	/* Skip over entries which are dying/dead or in the wrong namespace */
-	while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
+	while (pos && (!kernfs_active_read(pos) || pos->ns != ns)) {
 		struct rb_node *node = rb_next(&pos->rb);
 		if (!node)
 			pos = NULL;
@@ -1668,7 +1679,7 @@ static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
 				pos = NULL;
 			else
 				pos = rb_to_kn(node);
-		} while (pos && (!kernfs_active(pos) || pos->ns != ns));
+		} while (pos && (!kernfs_active_read(pos) || pos->ns != ns));
 	}
 	return pos;
 }
@@ -1682,7 +1693,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 
 	if (!dir_emit_dots(file, ctx))
 		return 0;
-	mutex_lock(&kernfs_mutex);
+	down_read(&kernfs_rwsem);
 
 	if (kernfs_ns_enabled(parent))
 		ns = kernfs_info(dentry->d_sb)->ns;
@@ -1699,12 +1710,12 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 		file->private_data = pos;
 		kernfs_get(pos);
 
-		mutex_unlock(&kernfs_mutex);
+		up_read(&kernfs_rwsem);
 		if (!dir_emit(ctx, name, len, ino, type))
 			return 0;
-		mutex_lock(&kernfs_mutex);
+		down_read(&kernfs_rwsem);
 	}
-	mutex_unlock(&kernfs_mutex);
+	up_read(&kernfs_rwsem);
 	file->private_data = NULL;
 	ctx->pos = INT_MAX;
 	return 0;
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index c757193121475..60e2a86c535eb 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -860,7 +860,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
 	spin_unlock_irq(&kernfs_notify_lock);
 
 	/* kick fsnotify */
-	mutex_lock(&kernfs_mutex);
+	down_write(&kernfs_rwsem);
 
 	list_for_each_entry(info, &kernfs_root(kn)->supers, node) {
 		struct kernfs_node *parent;
@@ -898,7 +898,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
 		iput(inode);
 	}
 
-	mutex_unlock(&kernfs_mutex);
+	up_write(&kernfs_rwsem);
 	kernfs_put(kn);
 	goto repeat;
 }
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index d73950fc3d57d..3b01e9e61f14e 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -106,9 +106,9 @@ int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
 {
 	int ret;
 
-	mutex_lock(&kernfs_mutex);
+	down_write(&kernfs_rwsem);
 	ret = __kernfs_setattr(kn, iattr);
-	mutex_unlock(&kernfs_mutex);
+	up_write(&kernfs_rwsem);
 	return ret;
 }
 
@@ -122,7 +122,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
 	if (!kn)
 		return -EINVAL;
 
-	mutex_lock(&kernfs_mutex);
+	down_write(&kernfs_rwsem);
 	error = setattr_prepare(&init_user_ns, dentry, iattr);
 	if (error)
 		goto out;
@@ -135,7 +135,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
 	setattr_copy(&init_user_ns, inode, iattr);
 
 out:
-	mutex_unlock(&kernfs_mutex);
+	up_write(&kernfs_rwsem);
 	return error;
 }
 
@@ -191,9 +191,9 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
 	struct inode *inode = d_inode(path->dentry);
 	struct kernfs_node *kn = inode->i_private;
 
-	mutex_lock(&kernfs_mutex);
+	down_write(&kernfs_rwsem);
 	kernfs_refresh_inode(kn, inode);
-	mutex_unlock(&kernfs_mutex);
+	up_write(&kernfs_rwsem);
 
 	generic_fillattr(&init_user_ns, inode, stat);
 	return 0;
@@ -284,9 +284,9 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,
 
 	kn = inode->i_private;
 
-	mutex_lock(&kernfs_mutex);
+	down_write(&kernfs_rwsem);
 	kernfs_refresh_inode(kn, inode);
-	mutex_unlock(&kernfs_mutex);
+	up_write(&kernfs_rwsem);
 
 	return generic_permission(&init_user_ns, inode, mask);
 }
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index ccc3b44f6306f..cbd4789fac0f5 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -13,6 +13,7 @@
 #include <linux/lockdep.h>
 #include <linux/fs.h>
 #include <linux/mutex.h>
+#include <linux/rwsem.h>
 #include <linux/xattr.h>
 
 #include <linux/kernfs.h>
@@ -69,7 +70,7 @@ struct kernfs_super_info {
 	 */
 	const void		*ns;
 
-	/* anchored at kernfs_root->supers, protected by kernfs_mutex */
+	/* anchored at kernfs_root->supers, protected by kernfs_rwsem */
 	struct list_head	node;
 };
 #define kernfs_info(SB) ((struct kernfs_super_info *)(SB->s_fs_info))
@@ -102,7 +103,7 @@ int __kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr);
 /*
  * dir.c
  */
-extern struct mutex kernfs_mutex;
+extern struct rw_semaphore kernfs_rwsem;
 extern const struct dentry_operations kernfs_dops;
 extern const struct file_operations kernfs_dir_fops;
 extern const struct inode_operations kernfs_dir_iops;
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 9dc7e7a64e10f..baa4155ba2edf 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -255,9 +255,9 @@ static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *k
 	sb->s_shrink.seeks = 0;
 
 	/* get root inode, initialize and unlock it */
-	mutex_lock(&kernfs_mutex);
+	down_write(&kernfs_rwsem);
 	inode = kernfs_get_inode(sb, info->root->kn);
-	mutex_unlock(&kernfs_mutex);
+	up_write(&kernfs_rwsem);
 	if (!inode) {
 		pr_debug("kernfs: could not get root inode\n");
 		return -ENOMEM;
@@ -344,9 +344,9 @@ int kernfs_get_tree(struct fs_context *fc)
 		}
 		sb->s_flags |= SB_ACTIVE;
 
-		mutex_lock(&kernfs_mutex);
+		down_write(&kernfs_rwsem);
 		list_add(&info->node, &info->root->supers);
-		mutex_unlock(&kernfs_mutex);
+		up_write(&kernfs_rwsem);
 	}
 
 	fc->root = dget(sb->s_root);
@@ -372,9 +372,9 @@ void kernfs_kill_sb(struct super_block *sb)
 {
 	struct kernfs_super_info *info = kernfs_info(sb);
 
-	mutex_lock(&kernfs_mutex);
+	down_write(&kernfs_rwsem);
 	list_del(&info->node);
-	mutex_unlock(&kernfs_mutex);
+	up_write(&kernfs_rwsem);
 
 	/*
 	 * Remove the superblock from fs_supers/s_instances
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 5432883d819f2..c8f8e41b84110 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -116,9 +116,9 @@ static int kernfs_getlink(struct inode *inode, char *path)
 	struct kernfs_node *target = kn->symlink.target_kn;
 	int error;
 
-	mutex_lock(&kernfs_mutex);
+	down_read(&kernfs_rwsem);
 	error = kernfs_get_target_path(parent, target, path);
-	mutex_unlock(&kernfs_mutex);
+	up_read(&kernfs_rwsem);
 
 	return error;
 }
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 9e8ca8743c268..1adb6f0c5f836 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -188,7 +188,7 @@ struct kernfs_root {
 	u32			id_highbits;
 	struct kernfs_syscall_ops *syscall_ops;
 
-	/* list of kernfs_super_info of this root, protected by kernfs_mutex */
+	/* list of kernfs_super_info of this root, protected by kernfs_rwsem */
 	struct list_head	supers;
 
 	wait_queue_head_t	deactivate_waitq;



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3 4/4] kernfs: use i_lock to protect concurrent inode updates
  2021-04-09  1:14 [PATCH v3 0/4] kernfs: proposed locking and concurrency improvement Ian Kent
                   ` (2 preceding siblings ...)
  2021-04-09  1:15 ` [PATCH v3 3/4] kernfs: switch kernfs to use an rwsem Ian Kent
@ 2021-04-09  1:15 ` Ian Kent
  2021-04-19  7:56 ` [PATCH v3 0/4] kernfs: proposed locking and concurrency improvement Fox Chen
  4 siblings, 0 replies; 10+ messages in thread
From: Ian Kent @ 2021-04-09  1:15 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Tejun Heo
  Cc: Brice Goglin, Fox Chen, Rick Lindsley, Al Viro, Miklos Szeredi,
	David Howells, Eric Sandeen, Kernel Mailing List, linux-fsdevel

The inode operations .permission() and .getattr() use the kernfs node
write lock but all that's needed is to keep the rb tree stable while
updating the inode attributes as well as protecting the update itself
against concurrent changes.

And .permission() is called frequently during path walks and can cause
quite a bit of contention between kernfs node operations and path
walks when the number of concurrent walks is high.

To change kernfs_iop_getattr() and kernfs_iop_permission() to take
the rw sem read lock instead of the write lock an additional lock is
needed to protect against multiple processes concurrently updating
the inode attributes and link count in kernfs_refresh_inode().

The inode i_lock seems like the sensible thing to use to protect these
inode attribute updates so use it in kernfs_refresh_inode().

Signed-off-by: Ian Kent <raven@themaw.net>
---
 fs/kernfs/inode.c |   10 ++++++----
 fs/kernfs/mount.c |    4 ++--
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 3b01e9e61f14e..6728ecd81eb37 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -172,6 +172,7 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)
 {
 	struct kernfs_iattrs *attrs = kn->iattr;
 
+	spin_lock(&inode->i_lock);
 	inode->i_mode = kn->mode;
 	if (attrs)
 		/*
@@ -182,6 +183,7 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)
 
 	if (kernfs_type(kn) == KERNFS_DIR)
 		set_nlink(inode, kn->dir.subdirs + 2);
+	spin_unlock(&inode->i_lock);
 }
 
 int kernfs_iop_getattr(struct user_namespace *mnt_userns,
@@ -191,9 +193,9 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
 	struct inode *inode = d_inode(path->dentry);
 	struct kernfs_node *kn = inode->i_private;
 
-	down_write(&kernfs_rwsem);
+	down_read(&kernfs_rwsem);
 	kernfs_refresh_inode(kn, inode);
-	up_write(&kernfs_rwsem);
+	up_read(&kernfs_rwsem);
 
 	generic_fillattr(&init_user_ns, inode, stat);
 	return 0;
@@ -284,9 +286,9 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,
 
 	kn = inode->i_private;
 
-	down_write(&kernfs_rwsem);
+	down_read(&kernfs_rwsem);
 	kernfs_refresh_inode(kn, inode);
-	up_write(&kernfs_rwsem);
+	up_read(&kernfs_rwsem);
 
 	return generic_permission(&init_user_ns, inode, mask);
 }
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index baa4155ba2edf..f2f909d09f522 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -255,9 +255,9 @@ static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *k
 	sb->s_shrink.seeks = 0;
 
 	/* get root inode, initialize and unlock it */
-	down_write(&kernfs_rwsem);
+	down_read(&kernfs_rwsem);
 	inode = kernfs_get_inode(sb, info->root->kn);
-	up_write(&kernfs_rwsem);
+	up_read(&kernfs_rwsem);
 	if (!inode) {
 		pr_debug("kernfs: could not get root inode\n");
 		return -ENOMEM;



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 2/4] kernfs: use VFS negative dentry caching
  2021-04-09  1:15 ` [PATCH v3 2/4] kernfs: use VFS negative dentry caching Ian Kent
@ 2021-04-09  1:35   ` Al Viro
  2021-04-09  8:26     ` Ian Kent
  0 siblings, 1 reply; 10+ messages in thread
From: Al Viro @ 2021-04-09  1:35 UTC (permalink / raw)
  To: Ian Kent
  Cc: Greg Kroah-Hartman, Tejun Heo, Brice Goglin, Fox Chen,
	Rick Lindsley, Miklos Szeredi, David Howells, Eric Sandeen,
	Kernel Mailing List, linux-fsdevel

On Fri, Apr 09, 2021 at 09:15:06AM +0800, Ian Kent wrote:
> +		parent = kernfs_dentry_node(dentry->d_parent);
> +		if (parent) {
> +			const void *ns = NULL;
> +
> +			if (kernfs_ns_enabled(parent))
> +				ns = kernfs_info(dentry->d_parent->d_sb)->ns;

	For any dentry d, we have d->d_parent->d_sb == d->d_sb.  All the time.
If you ever run into the case where that would not be true, you've found
a critical bug.

> +			kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
> +			if (kn)
> +				goto out_bad;
> +		}

Umm...  What's to prevent a race with successful rename(2)?  IOW, what's
there to stabilize ->d_parent and ->d_name while we are in that function?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 2/4] kernfs: use VFS negative dentry caching
  2021-04-09  1:35   ` Al Viro
@ 2021-04-09  8:26     ` Ian Kent
  2021-04-09  9:34       ` Ian Kent
  0 siblings, 1 reply; 10+ messages in thread
From: Ian Kent @ 2021-04-09  8:26 UTC (permalink / raw)
  To: Al Viro
  Cc: Greg Kroah-Hartman, Tejun Heo, Brice Goglin, Fox Chen,
	Rick Lindsley, Miklos Szeredi, David Howells, Eric Sandeen,
	Kernel Mailing List, linux-fsdevel

On Fri, 2021-04-09 at 01:35 +0000, Al Viro wrote:
> On Fri, Apr 09, 2021 at 09:15:06AM +0800, Ian Kent wrote:
> > +		parent = kernfs_dentry_node(dentry->d_parent);
> > +		if (parent) {
> > +			const void *ns = NULL;
> > +
> > +			if (kernfs_ns_enabled(parent))
> > +				ns = kernfs_info(dentry->d_parent-
> > >d_sb)->ns;
> 
> 	For any dentry d, we have d->d_parent->d_sb == d->d_sb.  All
> the time.
> If you ever run into the case where that would not be true, you've
> found
> a critical bug.

Right, yes.

> 
> > +			kn = kernfs_find_ns(parent, dentry-
> > >d_name.name, ns);
> > +			if (kn)
> > +				goto out_bad;
> > +		}
> 
> Umm...  What's to prevent a race with successful rename(2)?  IOW,
> what's
> there to stabilize ->d_parent and ->d_name while we are in that
> function?

Indeed, glad you looked at this.

Now I'm wondering how kerfs_iop_rename() protects itself from
concurrent kernfs_rename_ns() ... 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 2/4] kernfs: use VFS negative dentry caching
  2021-04-09  8:26     ` Ian Kent
@ 2021-04-09  9:34       ` Ian Kent
  0 siblings, 0 replies; 10+ messages in thread
From: Ian Kent @ 2021-04-09  9:34 UTC (permalink / raw)
  To: Al Viro
  Cc: Greg Kroah-Hartman, Tejun Heo, Brice Goglin, Fox Chen,
	Rick Lindsley, Miklos Szeredi, David Howells, Eric Sandeen,
	Kernel Mailing List, linux-fsdevel

On Fri, 2021-04-09 at 16:26 +0800, Ian Kent wrote:
> On Fri, 2021-04-09 at 01:35 +0000, Al Viro wrote:
> > On Fri, Apr 09, 2021 at 09:15:06AM +0800, Ian Kent wrote:
> > > +		parent = kernfs_dentry_node(dentry->d_parent);
> > > +		if (parent) {
> > > +			const void *ns = NULL;
> > > +
> > > +			if (kernfs_ns_enabled(parent))
> > > +				ns = kernfs_info(dentry->d_parent-
> > > > d_sb)->ns;
> > 
> > 	For any dentry d, we have d->d_parent->d_sb == d->d_sb.  All
> > the time.
> > If you ever run into the case where that would not be true, you've
> > found
> > a critical bug.
> 
> Right, yes.
> 
> > > +			kn = kernfs_find_ns(parent, dentry-
> > > > d_name.name, ns);
> > > +			if (kn)
> > > +				goto out_bad;
> > > +		}
> > 
> > Umm...  What's to prevent a race with successful rename(2)?  IOW,
> > what's
> > there to stabilize ->d_parent and ->d_name while we are in that
> > function?
> 
> Indeed, glad you looked at this.
> 
> Now I'm wondering how kerfs_iop_rename() protects itself from
> concurrent kernfs_rename_ns() ... 

As I thought ... I haven't done an exhaustive search but I can't find
any file system that doesn't call back into kernfs from
kernfs_syscall_ops (if provided at kernfs root creation).

I don't see anything that uses kernfs that defines a .rename() op
but if there was one it would be expected to call back into kernfs
at which point it would block on kernfs_mutex (kernfs_rwsem) until
it's released.

So I don't think there can be changes in this case due to the lock
taken just above the code your questioning.

I need to think a bit about whether the dentry being negative (ie.
not having kernfs node) could allow bad things to happen ...

Or am I misunderstanding the race your pointing out here?

Ian


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 0/4] kernfs: proposed locking and concurrency improvement
  2021-04-09  1:14 [PATCH v3 0/4] kernfs: proposed locking and concurrency improvement Ian Kent
                   ` (3 preceding siblings ...)
  2021-04-09  1:15 ` [PATCH v3 4/4] kernfs: use i_lock to protect concurrent inode updates Ian Kent
@ 2021-04-19  7:56 ` Fox Chen
  2021-04-19 12:25   ` Ian Kent
  4 siblings, 1 reply; 10+ messages in thread
From: Fox Chen @ 2021-04-19  7:56 UTC (permalink / raw)
  To: Ian Kent
  Cc: Greg Kroah-Hartman, Tejun Heo, Brice Goglin, Rick Lindsley,
	Al Viro, Miklos Szeredi, David Howells, Eric Sandeen,
	Kernel Mailing List, linux-fsdevel


[-- Attachment #1: Type: text/plain, Size: 3214 bytes --]

On Fri, Apr 9, 2021 at 9:14 AM Ian Kent <raven@themaw.net> wrote:
>
> There have been a few instances of contention on the kernfs_mutex during
> path walks, a case on very large IBM systems seen by myself, a report by
> Brice Goglin and followed up by Fox Chen, and I've since seen a couple
> of other reports by CoreOS users.
>
> The common thread is a large number of kernfs path walks leading to
> slowness of path walks due to kernfs_mutex contention.
>
> The problem being that changes to the VFS over some time have increased
> it's concurrency capabilities to an extent that kernfs's use of a mutex
> is no longer appropriate. There's also an issue of walks for non-existent
> paths causing contention if there are quite a few of them which is a less
> common problem.
>
> This patch series is relatively straight forward.
>
> All it does is add the ability to take advantage of VFS negative dentry
> caching to avoid needless dentry alloc/free cycles for lookups of paths
> that don't exit and change the kernfs_mutex to a read/write semaphore.
>
> The patch that tried to stay in VFS rcu-walk mode during path walks has
> been dropped for two reasons. First, it doesn't actually give very much
> improvement and, second, if there's a place where mistakes could go
> unnoticed it would be in that path. This makes the patch series simpler
> to review and reduces the likelihood of problems going unnoticed and
> popping up later.
>
> The patch to use a revision to identify if a directory has changed has
> also been dropped. If the directory has changed the dentry revision
> needs to be updated to avoid subsequent rb tree searches and after
> changing to use a read/write semaphore the update also requires a lock.
> But the d_lock is the only lock available at this point which might
> itself be contended.
>
> Changes since v2:
> - actually fix the inode attribute update locking.
> - drop the patch that tried to stay in rcu-walk mode.
> - drop the use a revision to identify if a directory has changed patch.
>
> Changes since v1:
> - fix locking in .permission() and .getattr() by re-factoring the attribute
>   handling code.
>
> ---
>
> Ian Kent (4):
>       kernfs: move revalidate to be near lookup
>       kernfs: use VFS negative dentry caching
>       kernfs: switch kernfs to use an rwsem
>       kernfs: use i_lock to protect concurrent inode updates
>
>
>  fs/kernfs/dir.c             |  240 +++++++++++++++++++++++--------------------
>  fs/kernfs/file.c            |    4 -
>  fs/kernfs/inode.c           |   18 ++-
>  fs/kernfs/kernfs-internal.h |    5 +
>  fs/kernfs/mount.c           |   12 +-
>  fs/kernfs/symlink.c         |    4 -
>  include/linux/kernfs.h      |    2
>  7 files changed, 155 insertions(+), 130 deletions(-)
>
> --
>

Hi Ian,

I tested this patchset with my
benchmark(https://github.com/foxhlchen/sysfs_benchmark) on a 96 CPUs
(aws c5) machine.

The result was promising:
Before, one open+read+close cycle took 500us without much variation.
With this patch, the fastest one only takes 30us, though the slowest
is still around 100us(due to the spinlock). perf report shows no more
significant mutex contention.

FYR, I put outputs in the attachment.


thanks,
fox

[-- Attachment #2: result.after --]
[-- Type: application/octet-stream, Size: 4927 bytes --]

[-- Attachment #3: result.before --]
[-- Type: application/octet-stream, Size: 5055 bytes --]

[-- Attachment #4: perf_report --]
[-- Type: application/octet-stream, Size: 230766 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 0/4] kernfs: proposed locking and concurrency improvement
  2021-04-19  7:56 ` [PATCH v3 0/4] kernfs: proposed locking and concurrency improvement Fox Chen
@ 2021-04-19 12:25   ` Ian Kent
  0 siblings, 0 replies; 10+ messages in thread
From: Ian Kent @ 2021-04-19 12:25 UTC (permalink / raw)
  To: Fox Chen
  Cc: Greg Kroah-Hartman, Tejun Heo, Brice Goglin, Rick Lindsley,
	Al Viro, Miklos Szeredi, David Howells, Eric Sandeen,
	Kernel Mailing List, linux-fsdevel

On Mon, 2021-04-19 at 15:56 +0800, Fox Chen wrote:
> On Fri, Apr 9, 2021 at 9:14 AM Ian Kent <raven@themaw.net> wrote:
> > There have been a few instances of contention on the kernfs_mutex
> > during
> > path walks, a case on very large IBM systems seen by myself, a
> > report by
> > Brice Goglin and followed up by Fox Chen, and I've since seen a
> > couple
> > of other reports by CoreOS users.
> > 
> > The common thread is a large number of kernfs path walks leading to
> > slowness of path walks due to kernfs_mutex contention.
> > 
> > The problem being that changes to the VFS over some time have
> > increased
> > it's concurrency capabilities to an extent that kernfs's use of a
> > mutex
> > is no longer appropriate. There's also an issue of walks for non-
> > existent
> > paths causing contention if there are quite a few of them which is
> > a less
> > common problem.
> > 
> > This patch series is relatively straight forward.
> > 
> > All it does is add the ability to take advantage of VFS negative
> > dentry
> > caching to avoid needless dentry alloc/free cycles for lookups of
> > paths
> > that don't exit and change the kernfs_mutex to a read/write
> > semaphore.
> > 
> > The patch that tried to stay in VFS rcu-walk mode during path walks
> > has
> > been dropped for two reasons. First, it doesn't actually give very
> > much
> > improvement and, second, if there's a place where mistakes could go
> > unnoticed it would be in that path. This makes the patch series
> > simpler
> > to review and reduces the likelihood of problems going unnoticed
> > and
> > popping up later.
> > 
> > The patch to use a revision to identify if a directory has changed
> > has
> > also been dropped. If the directory has changed the dentry revision
> > needs to be updated to avoid subsequent rb tree searches and after
> > changing to use a read/write semaphore the update also requires a
> > lock.
> > But the d_lock is the only lock available at this point which might
> > itself be contended.
> > 
> > Changes since v2:
> > - actually fix the inode attribute update locking.
> > - drop the patch that tried to stay in rcu-walk mode.
> > - drop the use a revision to identify if a directory has changed
> > patch.
> > 
> > Changes since v1:
> > - fix locking in .permission() and .getattr() by re-factoring the
> > attribute
> >   handling code.
> > 
> > ---
> > 
> > Ian Kent (4):
> >       kernfs: move revalidate to be near lookup
> >       kernfs: use VFS negative dentry caching
> >       kernfs: switch kernfs to use an rwsem
> >       kernfs: use i_lock to protect concurrent inode updates
> > 
> > 
> >  fs/kernfs/dir.c             |  240 +++++++++++++++++++++++------
> > --------------
> >  fs/kernfs/file.c            |    4 -
> >  fs/kernfs/inode.c           |   18 ++-
> >  fs/kernfs/kernfs-internal.h |    5 +
> >  fs/kernfs/mount.c           |   12 +-
> >  fs/kernfs/symlink.c         |    4 -
> >  include/linux/kernfs.h      |    2
> >  7 files changed, 155 insertions(+), 130 deletions(-)
> > 
> > --
> > 
> 
> Hi Ian,
> 
> I tested this patchset with my
> benchmark(https://github.com/foxhlchen/sysfs_benchmark) on a 96 CPUs
> (aws c5) machine.
> 
> The result was promising:
> Before, one open+read+close cycle took 500us without much variation.
> With this patch, the fastest one only takes 30us, though the slowest
> is still around 100us(due to the spinlock). perf report shows no more
> significant mutex contention.

Thanks for this Fox.
I'll have a look through the data a bit later.

For now, I'd like to keep the series as simple as possible.

But there shouldn't be a problem reading and comparing those
attributes between the kernfs node and the inode without taking
the additional lock. So a check could be done and the lock only
taken if an update is needed.

That may well improve that worst case quite a bit, but as I say,
it would need to be a follow up change.

Ian


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, back to index

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-09  1:14 [PATCH v3 0/4] kernfs: proposed locking and concurrency improvement Ian Kent
2021-04-09  1:14 ` [PATCH v3 1/4] kernfs: move revalidate to be near lookup Ian Kent
2021-04-09  1:15 ` [PATCH v3 2/4] kernfs: use VFS negative dentry caching Ian Kent
2021-04-09  1:35   ` Al Viro
2021-04-09  8:26     ` Ian Kent
2021-04-09  9:34       ` Ian Kent
2021-04-09  1:15 ` [PATCH v3 3/4] kernfs: switch kernfs to use an rwsem Ian Kent
2021-04-09  1:15 ` [PATCH v3 4/4] kernfs: use i_lock to protect concurrent inode updates Ian Kent
2021-04-19  7:56 ` [PATCH v3 0/4] kernfs: proposed locking and concurrency improvement Fox Chen
2021-04-19 12:25   ` Ian Kent

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git