All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/2] kernfs: use hashed mutex and spinlock in place of global ones.
@ 2022-02-06  1:09 Imran Khan
  2022-02-06  1:09 ` [PATCH v5 1/2] " Imran Khan
  2022-02-06  1:09 ` [PATCH v5 2/2] kernfs: Replace per-fs global rwsem with per-fs hashed rwsem Imran Khan
  0 siblings, 2 replies; 10+ messages in thread
From: Imran Khan @ 2022-02-06  1:09 UTC (permalink / raw)
  To: tj, gregkh; +Cc: linux-kernel

Reduce contention around global locks used in kernfs.

The current kernfs design makes use of 3 global locks to synchronize
various operations. There are a few other global locks as well but
they are used for specific cases and hence don't cause severe contention.

The above mentioned 3 main global locks used in kernfs are:

1. A global mutex, kernfs_open_file_mutex, to protect the list of
kernfs_open_file instances correspondng to a sysfs attribute.

2. A global spinlock, kernfs_open_node_lock, to protect
kernfs_node->attr.open which points to kernfs_open_node instance
corresponding to a kernfs_node.

3. A global per-fs rw semaphore, root->kernfs_rwsem, to synchronize most
of the other operations like adding, removing, renaming etc. of a file
or directory.

Since these locks are global, they can cause contention when multiple
(for example few hundred) applications try to access sysfs (or other kernfs
based file system) files in parallel, even if the applications are
accessing different and unrelated files.

For example on a system with 384 cores, if I run 200 instances of an
application which is mostly executing the following loop:

  for (int loop = 0; loop <100 ; loop++)
  {
    for (int port_num = 1; port_num < 2; port_num++)
    {
      for (int gid_index = 0; gid_index < 254; gid_index++ )
      {
        char ret_buf[64], ret_buf_lo[64];
        char gid_file_path[1024];

        int      ret_len;
        int      ret_fd;
        ssize_t  ret_rd;

        ub4  i, saved_errno;

        memset(ret_buf, 0, sizeof(ret_buf));
        memset(gid_file_path, 0, sizeof(gid_file_path));

        ret_len = snprintf(gid_file_path, sizeof(gid_file_path),
                           "/sys/class/infiniband/%s/ports/%d/gids/%d",
                           dev_name,
                           port_num,
                           gid_index);

        ret_fd = open(gid_file_path, O_RDONLY | O_CLOEXEC);
        if (ret_fd < 0)
        {
          printf("Failed to open %s\n", gid_file_path);
          continue;
        }

        /* Read the GID */
        ret_rd = read(ret_fd, ret_buf, 40);

        if (ret_rd == -1)
        {
          printf("Failed to read from file %s, errno: %u\n",
                 gid_file_path, saved_errno);

          continue;
        }

        close(ret_fd);
      }
    }

I can see contention around above mentioned locks as follows:

-   54.07%    53.60%  showgids         [kernel.kallsyms]       [k] osq_lock
   - 53.60% __libc_start_main
      - 32.29% __GI___libc_open
           entry_SYSCALL_64_after_hwframe
           do_syscall_64
           sys_open
           do_sys_open
           do_filp_open
           path_openat
           vfs_open
           do_dentry_open
           kernfs_fop_open
           mutex_lock
         - __mutex_lock_slowpath
            - 32.23% __mutex_lock.isra.5
                 osq_lock
      - 21.31% __GI___libc_close
           entry_SYSCALL_64_after_hwframe
           do_syscall_64
           exit_to_usermode_loop
           task_work_run
           ____fput
           __fput
           kernfs_fop_release
           kernfs_put_open_node.isra.8
           mutex_lock
         - __mutex_lock_slowpath
            - 21.28% __mutex_lock.isra.5
                 osq_lock

-   10.49%    10.39%  showgids         [kernel.kallsyms]      [k] down_read
     10.39% __libc_start_main
        __GI___libc_open
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        sys_open
        do_sys_open
        do_filp_open
      - path_openat
         - 9.72% link_path_walk
            - 5.21% inode_permission
               - __inode_permission
                  - 5.21% kernfs_iop_permission
                       down_read
            - 4.08% walk_component
                 lookup_fast
               - d_revalidate.part.24
                  - 4.08% kernfs_dop_revalidate

-    7.48%     7.41%  showgids         [kernel.kallsyms]       [k] up_read
     7.41% __libc_start_main
        __GI___libc_open
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        sys_open
        do_sys_open
        do_filp_open
      - path_openat
         - 7.01% link_path_walk
            - 4.12% inode_permission
               - __inode_permission
                  - 4.12% kernfs_iop_permission
                       up_read
            - 2.61% walk_component
                 lookup_fast
               - d_revalidate.part.24
                  - 2.61% kernfs_dop_revalidate

Moreover this run of 200 application isntances takes 32-34 secs. to
complete.                                                     

This patch set is reducing the above mentioned contention by replacing
these global locks with hashed locks. 

For example with the patched kernel and on the same test setup, we no
longer see contention around osq_lock (i.e kernfs_open_file_mutex) and also
contention around per-fs kernfs_rwsem has reduced significantly as well.
This can be seen in the following perf snippet:

-    1.66%     1.65%  showgids         [kernel.kallsyms]      [k] down_read
     1.65% __libc_start_main
        __GI___libc_open
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        sys_open
        do_sys_open
        do_filp_open
      - path_openat
         - 1.62% link_path_walk
            - 0.98% inode_permission
               - __inode_permission
                  + 0.98% kernfs_iop_permission
            - 0.52% walk_component
                 lookup_fast
               - d_revalidate.part.24
                  - 0.52% kernfs_dop_revalidate

-    1.12%     1.11%  showgids         [kernel.kallsyms]      [k] up_read
     1.11% __libc_start_main
        __GI___libc_open
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        sys_open
        do_sys_open
        do_filp_open
      - path_openat
         - 1.11% link_path_walk
            - 0.69% inode_permission
               - __inode_permission
                  - 0.69% kernfs_iop_permission
                       up_read

Moreover the test execution time has reduced from 32-34 secs to 18-19 secs.

The patches of this patchset introduce following chanages:

PATCH-1: Make global kernfs_open_file_mutex and kernfs_open_node_lock
	 per-fs hashed locks, where address of a kernfs_node acts as hash key.
	 This results in kernfs_node objects, whose address give the
	 different hash value, using different kernfs_open_file_mutex
	 and kernfs_open_node_lock rather than all kernfs_node objects
	 using the same kernfs_open_file_mutex and kernfs_open_node_lock
	 as was the case earlier.

PATCH-2: Replace per-fs single rw semaphore with per-fs hashed semaphore,
	 where address of a kernfs_node acts as hash key to reduce contention
	 around single per-fs rw semaphore.

------------------------------------------------------------------
Changes since v4:
 - Removed compilation warnings reported by the 0-day bot.

Changes since v3:
 - Make open_file_mutex and open_node_lock per-fs.
 - Replace per-fs rwsem with per-fs hashed rwsem.
   (Suggested by Tejun Heo <tj@kernel.org>)


Imran Khan (2):
  kernfs: use hashed mutex and spinlock in place of global ones.
  kernfs: Replace per-fs global rwsem with per-fs hashed rwsem.

 fs/kernfs/dir.c             | 281 +++++++++++++++++++++++++-----------
 fs/kernfs/file.c            |  67 ++++-----
 fs/kernfs/inode.c           |  22 +--
 fs/kernfs/kernfs-internal.h | 163 +++++++++++++++++++++
 fs/kernfs/mount.c           |  13 +-
 fs/kernfs/symlink.c         |   5 +-
 include/linux/kernfs.h      |  44 +++++-
 7 files changed, 445 insertions(+), 150 deletions(-)


base-commit: a70bf4a85b43cb952bd39dd948b103b1b3eb2cf8
-- 
2.30.2


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v5 1/2] kernfs: use hashed mutex and spinlock in place of global ones.
  2022-02-06  1:09 [PATCH v5 0/2] kernfs: use hashed mutex and spinlock in place of global ones Imran Khan
@ 2022-02-06  1:09 ` Imran Khan
  2022-02-08 11:27   ` Greg KH
  2022-02-08 17:22   ` Tejun Heo
  2022-02-06  1:09 ` [PATCH v5 2/2] kernfs: Replace per-fs global rwsem with per-fs hashed rwsem Imran Khan
  1 sibling, 2 replies; 10+ messages in thread
From: Imran Khan @ 2022-02-06  1:09 UTC (permalink / raw)
  To: tj, gregkh; +Cc: linux-kernel

Right now a global mutex (kernfs_open_file_mutex) protects list of
kernfs_open_file instances corresponding to a sysfs attribute. So even
if different tasks are opening or closing different sysfs files they
can contend on osq_lock of this mutex. The contention is more apparent
in large scale systems with few hundred CPUs where most of the CPUs have
running tasks that are opening, accessing or closing sysfs files at any
point of time. Since each list of kernfs_open_file belongs to a
kernfs_open_node instance which in turn corresponds to one kernfs_node,
moving global kernfs_open_file_mutex within kernfs_node would sound like
fixing this contention but it has unwanted side effect of bloating up
kernfs_node size and hence kobject memory usage.

Also since kernfs_node->attr.open points to kernfs_open_node instance
corresponding to the kernfs_node, we can use a kernfs_node specific
spinlock in place of current global spinlock i.e kernfs_open_node_lock.
But this approach will increase kobject memory usage as well.

Use per-fs hashed locks in place of above mentioned global locks to reduce
kernfs access contention without increasing kobject memory usage.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---
 fs/kernfs/dir.c             |  5 +++
 fs/kernfs/file.c            | 61 ++++++++++++++++---------------------
 fs/kernfs/kernfs-internal.h | 51 +++++++++++++++++++++++++++++++
 include/linux/kernfs.h      | 39 ++++++++++++++++++++++++
 4 files changed, 122 insertions(+), 34 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index e6d9772ddb4ca..d26fb3bffda92 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -909,6 +909,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 {
 	struct kernfs_root *root;
 	struct kernfs_node *kn;
+	int lock_count;
 
 	root = kzalloc(sizeof(*root), GFP_KERNEL);
 	if (!root)
@@ -916,6 +917,10 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 
 	idr_init(&root->ino_idr);
 	init_rwsem(&root->kernfs_rwsem);
+	for (lock_count = 0; lock_count < NR_KERNFS_LOCKS; lock_count++) {
+		spin_lock_init(&root->open_node_locks[lock_count].lock);
+		mutex_init(&root->open_file_mutex[lock_count].lock);
+	}
 	INIT_LIST_HEAD(&root->supers);
 
 	/*
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 9414a7a60a9f4..018d038b72fdd 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -18,20 +18,6 @@
 
 #include "kernfs-internal.h"
 
-/*
- * There's one kernfs_open_file for each open file and one kernfs_open_node
- * for each kernfs_node with one or more open files.
- *
- * kernfs_node->attr.open points to kernfs_open_node.  attr.open is
- * protected by kernfs_open_node_lock.
- *
- * filp->private_data points to seq_file whose ->private points to
- * kernfs_open_file.  kernfs_open_files are chained at
- * kernfs_open_node->files, which is protected by kernfs_open_file_mutex.
- */
-static DEFINE_SPINLOCK(kernfs_open_node_lock);
-static DEFINE_MUTEX(kernfs_open_file_mutex);
-
 struct kernfs_open_node {
 	atomic_t		refcnt;
 	atomic_t		event;
@@ -524,10 +510,11 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
 				struct kernfs_open_file *of)
 {
 	struct kernfs_open_node *on, *new_on = NULL;
-
+	struct mutex *mutex = NULL;
+	spinlock_t *lock = NULL;
  retry:
-	mutex_lock(&kernfs_open_file_mutex);
-	spin_lock_irq(&kernfs_open_node_lock);
+	mutex = kernfs_open_file_mutex_lock(kn);
+	lock = kernfs_open_node_lock(kn);
 
 	if (!kn->attr.open && new_on) {
 		kn->attr.open = new_on;
@@ -540,8 +527,8 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
 		list_add_tail(&of->list, &on->files);
 	}
 
-	spin_unlock_irq(&kernfs_open_node_lock);
-	mutex_unlock(&kernfs_open_file_mutex);
+	spin_unlock_irq(lock);
+	mutex_unlock(mutex);
 
 	if (on) {
 		kfree(new_on);
@@ -575,10 +562,14 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
 				 struct kernfs_open_file *of)
 {
 	struct kernfs_open_node *on = kn->attr.open;
+	struct mutex *mutex = NULL;
+	spinlock_t *lock = NULL;
 	unsigned long flags;
 
-	mutex_lock(&kernfs_open_file_mutex);
-	spin_lock_irqsave(&kernfs_open_node_lock, flags);
+	mutex = kernfs_open_file_mutex_lock(kn);
+	lock = kernfs_open_node_lock_ptr(kn);
+
+	spin_lock_irqsave(lock, flags);
 
 	if (of)
 		list_del(&of->list);
@@ -588,8 +579,8 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
 	else
 		on = NULL;
 
-	spin_unlock_irqrestore(&kernfs_open_node_lock, flags);
-	mutex_unlock(&kernfs_open_file_mutex);
+	spin_unlock_irqrestore(lock, flags);
+	mutex_unlock(mutex);
 
 	kfree(on);
 }
@@ -729,11 +720,11 @@ static void kernfs_release_file(struct kernfs_node *kn,
 	/*
 	 * @of is guaranteed to have no other file operations in flight and
 	 * we just want to synchronize release and drain paths.
-	 * @kernfs_open_file_mutex is enough.  @of->mutex can't be used
+	 * @open_file_mutex is enough.  @of->mutex can't be used
 	 * here because drain path may be called from places which can
 	 * cause circular dependency.
 	 */
-	lockdep_assert_held(&kernfs_open_file_mutex);
+	lockdep_assert_held(kernfs_open_file_mutex_ptr(kn));
 
 	if (!of->released) {
 		/*
@@ -750,11 +741,12 @@ static int kernfs_fop_release(struct inode *inode, struct file *filp)
 {
 	struct kernfs_node *kn = inode->i_private;
 	struct kernfs_open_file *of = kernfs_of(filp);
+	struct mutex *lock = NULL;
 
 	if (kn->flags & KERNFS_HAS_RELEASE) {
-		mutex_lock(&kernfs_open_file_mutex);
+		lock = kernfs_open_file_mutex_lock(kn);
 		kernfs_release_file(kn, of);
-		mutex_unlock(&kernfs_open_file_mutex);
+		mutex_unlock(lock);
 	}
 
 	kernfs_put_open_node(kn, of);
@@ -769,19 +761,21 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
 {
 	struct kernfs_open_node *on;
 	struct kernfs_open_file *of;
+	struct mutex *mutex = NULL;
+	spinlock_t *lock = NULL;
 
 	if (!(kn->flags & (KERNFS_HAS_MMAP | KERNFS_HAS_RELEASE)))
 		return;
 
-	spin_lock_irq(&kernfs_open_node_lock);
+	lock = kernfs_open_node_lock(kn);
 	on = kn->attr.open;
 	if (on)
 		atomic_inc(&on->refcnt);
-	spin_unlock_irq(&kernfs_open_node_lock);
+	spin_unlock_irq(lock);
 	if (!on)
 		return;
 
-	mutex_lock(&kernfs_open_file_mutex);
+	mutex = kernfs_open_file_mutex_lock(kn);
 
 	list_for_each_entry(of, &on->files, list) {
 		struct inode *inode = file_inode(of->file);
@@ -793,8 +787,7 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
 			kernfs_release_file(kn, of);
 	}
 
-	mutex_unlock(&kernfs_open_file_mutex);
-
+	mutex_unlock(mutex);
 	kernfs_put_open_node(kn, NULL);
 }
 
@@ -922,13 +915,13 @@ void kernfs_notify(struct kernfs_node *kn)
 		return;
 
 	/* kick poll immediately */
-	spin_lock_irqsave(&kernfs_open_node_lock, flags);
+	spin_lock_irqsave(kernfs_open_node_lock_ptr(kn), flags);
 	on = kn->attr.open;
 	if (on) {
 		atomic_inc(&on->event);
 		wake_up_interruptible(&on->poll);
 	}
-	spin_unlock_irqrestore(&kernfs_open_node_lock, flags);
+	spin_unlock_irqrestore(kernfs_open_node_lock_ptr(kn), flags);
 
 	/* schedule work to kick fsnotify */
 	spin_lock_irqsave(&kernfs_notify_lock, flags);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index f9cc912c31e1b..cc49a6cd94154 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -31,6 +31,7 @@ struct kernfs_iattrs {
 	atomic_t		user_xattr_size;
 };
 
+
 /* +1 to avoid triggering overflow warning when negating it */
 #define KN_DEACTIVATED_BIAS		(INT_MIN + 1)
 
@@ -147,4 +148,54 @@ void kernfs_drain_open_files(struct kernfs_node *kn);
  */
 extern const struct inode_operations kernfs_symlink_iops;
 
+static inline spinlock_t *kernfs_open_node_lock_ptr(struct kernfs_node *kn)
+{
+	struct kernfs_root *root;
+	int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
+
+	root = kernfs_root(kn);
+
+	return &root->open_node_locks[idx].lock;
+}
+
+static inline spinlock_t *kernfs_open_node_lock(struct kernfs_node *kn)
+{
+	struct kernfs_root *root;
+	spinlock_t *lock;
+	int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
+
+	root = kernfs_root(kn);
+
+	lock = &root->open_node_locks[idx].lock;
+
+	spin_lock_irq(lock);
+
+	return lock;
+}
+
+static inline struct mutex *kernfs_open_file_mutex_ptr(struct kernfs_node *kn)
+{
+	struct kernfs_root *root;
+	int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
+
+	root = kernfs_root(kn);
+
+	return &root->open_file_mutex[idx].lock;
+}
+
+static inline struct mutex *kernfs_open_file_mutex_lock(struct kernfs_node *kn)
+{
+	struct kernfs_root *root;
+	struct mutex *lock;
+	int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
+
+	root = kernfs_root(kn);
+
+	lock = &root->open_file_mutex[idx].lock;
+
+	mutex_lock(lock);
+
+	return lock;
+}
+
 #endif	/* __KERNFS_INTERNAL_H */
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 861c4f0f8a29f..5bf9f02ce9dce 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -18,6 +18,8 @@
 #include <linux/uidgid.h>
 #include <linux/wait.h>
 #include <linux/rwsem.h>
+#include <linux/spinlock.h>
+#include <linux/cache.h>
 
 struct file;
 struct dentry;
@@ -34,6 +36,40 @@ struct kernfs_fs_context;
 struct kernfs_open_node;
 struct kernfs_iattrs;
 
+/*
+ * NR_KERNFS_LOCK_BITS determines size (NR_KERNFS_LOCKS) of hash
+ * table of locks.
+ * Having a small hash table would impact scalability, since
+ * more and more kernfs_node objects will end up using same lock
+ * and having a very large hash table would waste memory.
+ *
+ * At the moment size of hash table of locks is being set based on
+ * the number of CPUs as follows:
+ *
+ * NR_CPU      NR_KERNFS_LOCK_BITS      NR_KERNFS_LOCKS
+ *   1                  1                       2
+ *  2-3                 2                       4
+ *  4-7                 4                       16
+ *  8-15                6                       64
+ *  16-31               8                       256
+ *  32 and more         10                      1024
+ */
+#ifdef CONFIG_SMP
+#define NR_KERNFS_LOCK_BITS (2 * (ilog2(NR_CPUS < 32 ? NR_CPUS : 32)))
+#else
+#define NR_KERNFS_LOCK_BITS     1
+#endif
+
+#define NR_KERNFS_LOCKS     (1 << NR_KERNFS_LOCK_BITS)
+
+struct kernfs_open_node_lock {
+	spinlock_t lock;
+} ____cacheline_aligned_in_smp;
+
+struct kernfs_open_file_mutex {
+	struct mutex lock;
+} ____cacheline_aligned_in_smp;
+
 enum kernfs_node_type {
 	KERNFS_DIR		= 0x0001,
 	KERNFS_FILE		= 0x0002,
@@ -90,6 +126,7 @@ enum kernfs_root_flag {
 	KERNFS_ROOT_SUPPORT_USER_XATTR		= 0x0008,
 };
 
+
 /* type-specific structures for kernfs_node union members */
 struct kernfs_elem_dir {
 	unsigned long		subdirs;
@@ -201,6 +238,8 @@ struct kernfs_root {
 
 	wait_queue_head_t	deactivate_waitq;
 	struct rw_semaphore	kernfs_rwsem;
+	struct kernfs_open_node_lock open_node_locks[NR_KERNFS_LOCKS];
+	struct kernfs_open_file_mutex open_file_mutex[NR_KERNFS_LOCKS];
 };
 
 struct kernfs_open_file {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 2/2] kernfs: Replace per-fs global rwsem with per-fs hashed rwsem.
  2022-02-06  1:09 [PATCH v5 0/2] kernfs: use hashed mutex and spinlock in place of global ones Imran Khan
  2022-02-06  1:09 ` [PATCH v5 1/2] " Imran Khan
@ 2022-02-06  1:09 ` Imran Khan
  2022-02-08 18:26   ` Tejun Heo
  1 sibling, 1 reply; 10+ messages in thread
From: Imran Khan @ 2022-02-06  1:09 UTC (permalink / raw)
  To: tj, gregkh; +Cc: linux-kernel

Having a single rwsem to synchronize all operations across a kernfs
based file system (cgroup, sysfs etc.) does not scale well. Replace
it with a hashed rwsem to reduce contention around single per-fs
rwsem.
Also introduce a perfs rwsem to protect per-fs list of kernfs_super_info.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---
 fs/kernfs/dir.c             | 276 ++++++++++++++++++++++++------------
 fs/kernfs/file.c            |   6 +-
 fs/kernfs/inode.c           |  22 ++-
 fs/kernfs/kernfs-internal.h | 112 +++++++++++++++
 fs/kernfs/mount.c           |  13 +-
 fs/kernfs/symlink.c         |   5 +-
 include/linux/kernfs.h      |   5 +-
 7 files changed, 323 insertions(+), 116 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index d26fb3bffda92..ec1fff78c25a9 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -25,7 +25,9 @@ static DEFINE_SPINLOCK(kernfs_idr_lock);	/* root->ino_idr */
 
 static bool kernfs_active(struct kernfs_node *kn)
 {
-	lockdep_assert_held(&kernfs_root(kn)->kernfs_rwsem);
+	int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
+
+	lockdep_assert_held(&kernfs_root(kn)->kernfs_rwsem[idx]);
 	return atomic_read(&kn->active) >= 0;
 }
 
@@ -450,40 +452,42 @@ void kernfs_put_active(struct kernfs_node *kn)
 /**
  * kernfs_drain - drain kernfs_node
  * @kn: kernfs_node to drain
+ * @anc: ancestor of kernfs_node to drain
  *
  * Drain existing usages and nuke all existing mmaps of @kn.  Mutiple
  * removers may invoke this function concurrently on @kn and all will
  * return after draining is complete.
  */
-static void kernfs_drain(struct kernfs_node *kn)
-	__releases(&kernfs_root(kn)->kernfs_rwsem)
-	__acquires(&kernfs_root(kn)->kernfs_rwsem)
+static void kernfs_drain(struct kernfs_node *kn, struct kernfs_node *anc)
+	__releases(&kernfs_root(anc)->kernfs_rwsem[a_idx])
+	__acquires(&kernfs_root(anc)->kernfs_rwsem[a_idx])
 {
 	struct kernfs_root *root = kernfs_root(kn);
+	int a_idx = hash_ptr(anc, NR_KERNFS_LOCK_BITS);
 
-	lockdep_assert_held_write(&root->kernfs_rwsem);
-	WARN_ON_ONCE(kernfs_active(kn));
+	lockdep_assert_held_write(&root->kernfs_rwsem[a_idx]);
+	WARN_ON_ONCE(atomic_read(&kn->active) >= 0);
 
-	up_write(&root->kernfs_rwsem);
+	up_write_kernfs_rwsem(anc);
 
-	if (kernfs_lockdep(kn)) {
-		rwsem_acquire(&kn->dep_map, 0, 0, _RET_IP_);
-		if (atomic_read(&kn->active) != KN_DEACTIVATED_BIAS)
-			lock_contended(&kn->dep_map, _RET_IP_);
+	if (kernfs_lockdep(anc)) {
+		rwsem_acquire(&anc->dep_map, 0, 0, _RET_IP_);
+		if (atomic_read(&anc->active) != KN_DEACTIVATED_BIAS)
+			lock_contended(&anc->dep_map, _RET_IP_);
 	}
 
 	/* but everyone should wait for draining */
 	wait_event(root->deactivate_waitq,
 		   atomic_read(&kn->active) == KN_DEACTIVATED_BIAS);
 
-	if (kernfs_lockdep(kn)) {
-		lock_acquired(&kn->dep_map, _RET_IP_);
-		rwsem_release(&kn->dep_map, _RET_IP_);
+	if (kernfs_lockdep(anc)) {
+		lock_acquired(&anc->dep_map, _RET_IP_);
+		rwsem_release(&anc->dep_map, _RET_IP_);
 	}
 
 	kernfs_drain_open_files(kn);
 
-	down_write(&root->kernfs_rwsem);
+	down_write_kernfs_rwsem(anc, LOCK_SELF, 0);
 }
 
 /**
@@ -718,12 +722,11 @@ struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root,
 int kernfs_add_one(struct kernfs_node *kn)
 {
 	struct kernfs_node *parent = kn->parent;
-	struct kernfs_root *root = kernfs_root(parent);
 	struct kernfs_iattrs *ps_iattr;
 	bool has_ns;
 	int ret;
 
-	down_write(&root->kernfs_rwsem);
+	down_write_kernfs_rwsem(parent, LOCK_SELF, 0);
 
 	ret = -EINVAL;
 	has_ns = kernfs_ns_enabled(parent);
@@ -754,7 +757,7 @@ int kernfs_add_one(struct kernfs_node *kn)
 		ps_iattr->ia_mtime = ps_iattr->ia_ctime;
 	}
 
-	up_write(&root->kernfs_rwsem);
+	up_write_kernfs_rwsem(parent);
 
 	/*
 	 * Activate the new node unless CREATE_DEACTIVATED is requested.
@@ -768,7 +771,7 @@ int kernfs_add_one(struct kernfs_node *kn)
 	return 0;
 
 out_unlock:
-	up_write(&root->kernfs_rwsem);
+	up_write_kernfs_rwsem(parent);
 	return ret;
 }
 
@@ -788,8 +791,9 @@ static struct kernfs_node *kernfs_find_ns(struct kernfs_node *parent,
 	struct rb_node *node = parent->dir.children.rb_node;
 	bool has_ns = kernfs_ns_enabled(parent);
 	unsigned int hash;
+	int idx = hash_ptr(parent, NR_KERNFS_LOCK_BITS);
 
-	lockdep_assert_held(&kernfs_root(parent)->kernfs_rwsem);
+	lockdep_assert_held(&kernfs_root(parent)->kernfs_rwsem[idx]);
 
 	if (has_ns != (bool)ns) {
 		WARN(1, KERN_WARNING "kernfs: ns %s in '%s' for '%s'\n",
@@ -820,8 +824,9 @@ static struct kernfs_node *kernfs_walk_ns(struct kernfs_node *parent,
 {
 	size_t len;
 	char *p, *name;
+	int idx = hash_ptr(parent, NR_KERNFS_LOCK_BITS);
 
-	lockdep_assert_held_read(&kernfs_root(parent)->kernfs_rwsem);
+	lockdep_assert_held_read(&kernfs_root(parent)->kernfs_rwsem[idx]);
 
 	/* grab kernfs_rename_lock to piggy back on kernfs_pr_cont_buf */
 	spin_lock_irq(&kernfs_rename_lock);
@@ -860,12 +865,11 @@ struct kernfs_node *kernfs_find_and_get_ns(struct kernfs_node *parent,
 					   const char *name, const void *ns)
 {
 	struct kernfs_node *kn;
-	struct kernfs_root *root = kernfs_root(parent);
 
-	down_read(&root->kernfs_rwsem);
+	down_read_kernfs_rwsem(parent, LOCK_SELF, 0);
 	kn = kernfs_find_ns(parent, name, ns);
 	kernfs_get(kn);
-	up_read(&root->kernfs_rwsem);
+	up_read_kernfs_rwsem(parent);
 
 	return kn;
 }
@@ -885,12 +889,11 @@ struct kernfs_node *kernfs_walk_and_get_ns(struct kernfs_node *parent,
 					   const char *path, const void *ns)
 {
 	struct kernfs_node *kn;
-	struct kernfs_root *root = kernfs_root(parent);
 
-	down_read(&root->kernfs_rwsem);
+	down_read_kernfs_rwsem(parent, LOCK_SELF, 0);
 	kn = kernfs_walk_ns(parent, path, ns);
 	kernfs_get(kn);
-	up_read(&root->kernfs_rwsem);
+	up_read_kernfs_rwsem(parent);
 
 	return kn;
 }
@@ -916,11 +919,12 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 		return ERR_PTR(-ENOMEM);
 
 	idr_init(&root->ino_idr);
-	init_rwsem(&root->kernfs_rwsem);
 	for (lock_count = 0; lock_count < NR_KERNFS_LOCKS; lock_count++) {
 		spin_lock_init(&root->open_node_locks[lock_count].lock);
 		mutex_init(&root->open_file_mutex[lock_count].lock);
+		init_rwsem(&root->kernfs_rwsem[lock_count]);
 	}
+	init_rwsem(&root->supers_rwsem);
 	INIT_LIST_HEAD(&root->supers);
 
 	/*
@@ -1050,7 +1054,6 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
 static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 {
 	struct kernfs_node *kn;
-	struct kernfs_root *root;
 
 	if (flags & LOOKUP_RCU)
 		return -ECHILD;
@@ -1066,13 +1069,12 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 		parent = kernfs_dentry_node(dentry->d_parent);
 		if (parent) {
 			spin_unlock(&dentry->d_lock);
-			root = kernfs_root(parent);
-			down_read(&root->kernfs_rwsem);
+			down_read_kernfs_rwsem(parent, LOCK_SELF, 0);
 			if (kernfs_dir_changed(parent, dentry)) {
-				up_read(&root->kernfs_rwsem);
+				up_read_kernfs_rwsem(parent);
 				return 0;
 			}
-			up_read(&root->kernfs_rwsem);
+			up_read_kernfs_rwsem(parent);
 		} else
 			spin_unlock(&dentry->d_lock);
 
@@ -1083,8 +1085,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 	}
 
 	kn = kernfs_dentry_node(dentry);
-	root = kernfs_root(kn);
-	down_read(&root->kernfs_rwsem);
+	down_read_kernfs_rwsem(kn, LOCK_SELF, 0);
 
 	/* The kernfs node has been deactivated */
 	if (!kernfs_active(kn))
@@ -1103,10 +1104,10 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 	    kernfs_info(dentry->d_sb)->ns != kn->ns)
 		goto out_bad;
 
-	up_read(&root->kernfs_rwsem);
+	up_read_kernfs_rwsem(kn);
 	return 1;
 out_bad:
-	up_read(&root->kernfs_rwsem);
+	up_read_kernfs_rwsem(kn);
 	return 0;
 }
 
@@ -1120,28 +1121,30 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
 {
 	struct kernfs_node *parent = dir->i_private;
 	struct kernfs_node *kn;
-	struct kernfs_root *root;
 	struct inode *inode = NULL;
 	const void *ns = NULL;
 
-	root = kernfs_root(parent);
-	down_read(&root->kernfs_rwsem);
+	down_read_kernfs_rwsem(parent, LOCK_SELF, 0);
 	if (kernfs_ns_enabled(parent))
 		ns = kernfs_info(dir->i_sb)->ns;
 
 	kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
+	up_read_kernfs_rwsem(parent);
 	/* attach dentry and inode */
 	if (kn) {
 		/* Inactive nodes are invisible to the VFS so don't
 		 * create a negative.
 		 */
+		down_read_kernfs_rwsem(kn, LOCK_SELF, 0);
 		if (!kernfs_active(kn)) {
-			up_read(&root->kernfs_rwsem);
+			/* Unlock both node and parent before returning */
+			up_read_kernfs_rwsem(kn);
 			return NULL;
 		}
 		inode = kernfs_get_inode(dir->i_sb, kn);
 		if (!inode)
 			inode = ERR_PTR(-ENOMEM);
+		up_read_kernfs_rwsem(kn);
 	}
 	/*
 	 * Needed for negative dentry validation.
@@ -1149,9 +1152,10 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
 	 * or transforms from positive dentry in dentry_unlink_inode()
 	 * called from vfs_rmdir().
 	 */
+	down_read_kernfs_rwsem(parent, LOCK_SELF, 0);
 	if (!IS_ERR(inode))
 		kernfs_set_rev(parent, dentry);
-	up_read(&root->kernfs_rwsem);
+	up_read_kernfs_rwsem(parent);
 
 	/* instantiate and hash (possibly negative) dentry */
 	return d_splice_alias(inode, dentry);
@@ -1273,8 +1277,9 @@ static struct kernfs_node *kernfs_next_descendant_post(struct kernfs_node *pos,
 						       struct kernfs_node *root)
 {
 	struct rb_node *rbn;
+	int idx = hash_ptr(root, NR_KERNFS_LOCK_BITS);
 
-	lockdep_assert_held_write(&kernfs_root(root)->kernfs_rwsem);
+	lockdep_assert_held_write(&kernfs_root(root)->kernfs_rwsem[idx]);
 
 	/* if first iteration, visit leftmost descendant which may be root */
 	if (!pos)
@@ -1309,9 +1314,8 @@ static struct kernfs_node *kernfs_next_descendant_post(struct kernfs_node *pos,
 void kernfs_activate(struct kernfs_node *kn)
 {
 	struct kernfs_node *pos;
-	struct kernfs_root *root = kernfs_root(kn);
 
-	down_write(&root->kernfs_rwsem);
+	down_write_kernfs_rwsem(kn, LOCK_SELF, 0);
 
 	pos = NULL;
 	while ((pos = kernfs_next_descendant_post(pos, kn))) {
@@ -1325,14 +1329,15 @@ void kernfs_activate(struct kernfs_node *kn)
 		pos->flags |= KERNFS_ACTIVATED;
 	}
 
-	up_write(&root->kernfs_rwsem);
+	up_write_kernfs_rwsem(kn);
 }
 
 static void __kernfs_remove(struct kernfs_node *kn)
 {
 	struct kernfs_node *pos;
+	int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
 
-	lockdep_assert_held_write(&kernfs_root(kn)->kernfs_rwsem);
+	lockdep_assert_held_write(&kernfs_root(kn)->kernfs_rwsem[idx]);
 
 	/*
 	 * Short-circuit if non-root @kn has already finished removal.
@@ -1346,9 +1351,16 @@ static void __kernfs_remove(struct kernfs_node *kn)
 
 	/* prevent any new usage under @kn by deactivating all nodes */
 	pos = NULL;
-	while ((pos = kernfs_next_descendant_post(pos, kn)))
+	while ((pos = kernfs_next_descendant_post(pos, kn))) {
+		int n_idx = hash_ptr(pos, NR_KERNFS_LOCK_BITS);
+
+		if (n_idx != idx)
+			down_write_kernfs_rwsem(pos, LOCK_SELF, 1);
 		if (kernfs_active(pos))
 			atomic_add(KN_DEACTIVATED_BIAS, &pos->active);
+		if (n_idx != idx)
+			up_write_kernfs_rwsem(pos);
+	}
 
 	/* deactivate and unlink the subtree node-by-node */
 	do {
@@ -1369,7 +1381,7 @@ static void __kernfs_remove(struct kernfs_node *kn)
 		 * error paths without worrying about draining.
 		 */
 		if (kn->flags & KERNFS_ACTIVATED)
-			kernfs_drain(pos);
+			kernfs_drain(pos, kn);
 		else
 			WARN_ON_ONCE(atomic_read(&kn->active) != KN_DEACTIVATED_BIAS);
 
@@ -1402,11 +1414,9 @@ static void __kernfs_remove(struct kernfs_node *kn)
  */
 void kernfs_remove(struct kernfs_node *kn)
 {
-	struct kernfs_root *root = kernfs_root(kn);
-
-	down_write(&root->kernfs_rwsem);
+	down_write_kernfs_rwsem(kn, LOCK_SELF, 0);
 	__kernfs_remove(kn);
-	up_write(&root->kernfs_rwsem);
+	up_write_kernfs_rwsem(kn);
 }
 
 /**
@@ -1492,9 +1502,8 @@ void kernfs_unbreak_active_protection(struct kernfs_node *kn)
 bool kernfs_remove_self(struct kernfs_node *kn)
 {
 	bool ret;
-	struct kernfs_root *root = kernfs_root(kn);
 
-	down_write(&root->kernfs_rwsem);
+	down_write_kernfs_rwsem(kn, LOCK_SELF, 0);
 	kernfs_break_active_protection(kn);
 
 	/*
@@ -1522,9 +1531,9 @@ bool kernfs_remove_self(struct kernfs_node *kn)
 			    atomic_read(&kn->active) == KN_DEACTIVATED_BIAS)
 				break;
 
-			up_write(&root->kernfs_rwsem);
+			up_write_kernfs_rwsem(kn);
 			schedule();
-			down_write(&root->kernfs_rwsem);
+			down_write_kernfs_rwsem(kn, LOCK_SELF, 0);
 		}
 		finish_wait(waitq, &wait);
 		WARN_ON_ONCE(!RB_EMPTY_NODE(&kn->rb));
@@ -1537,7 +1546,7 @@ bool kernfs_remove_self(struct kernfs_node *kn)
 	 */
 	kernfs_unbreak_active_protection(kn);
 
-	up_write(&root->kernfs_rwsem);
+	up_write_kernfs_rwsem(kn);
 	return ret;
 }
 
@@ -1554,7 +1563,6 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,
 			     const void *ns)
 {
 	struct kernfs_node *kn;
-	struct kernfs_root *root;
 
 	if (!parent) {
 		WARN(1, KERN_WARNING "kernfs: can not remove '%s', no directory\n",
@@ -1562,14 +1570,15 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,
 		return -ENOENT;
 	}
 
-	root = kernfs_root(parent);
-	down_write(&root->kernfs_rwsem);
+	down_write_kernfs_rwsem(parent, LOCK_SELF, 0);
 
 	kn = kernfs_find_ns(parent, name, ns);
-	if (kn)
+	up_write_kernfs_rwsem(parent);
+	if (kn) {
+		down_write_kernfs_rwsem(kn, LOCK_SELF, 0);
 		__kernfs_remove(kn);
-
-	up_write(&root->kernfs_rwsem);
+		up_write_kernfs_rwsem(kn);
+	}
 
 	if (kn)
 		return 0;
@@ -1588,37 +1597,65 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
 		     const char *new_name, const void *new_ns)
 {
 	struct kernfs_node *old_parent;
-	struct kernfs_root *root;
 	const char *old_name = NULL;
-	int error;
+	int error, idx, np_idx, p_idx;
 
 	/* can't move or rename root */
 	if (!kn->parent)
 		return -EINVAL;
 
-	root = kernfs_root(kn);
-	down_write(&root->kernfs_rwsem);
+	/*
+	 * Take lock of node's old (current) parent.
+	 * If new parent has a different lock, then take that
+	 * lock as well.
+	 */
+	idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
+	p_idx = hash_ptr(kn->parent, NR_KERNFS_LOCK_BITS);
+	np_idx = hash_ptr(new_parent, NR_KERNFS_LOCK_BITS);
+
+	/*
+	 * Take only kn's lock. The subsequent kernfs_put
+	 * may free up old_parent so if old_parent has a
+	 * different lock, we will explicitly release that.
+	 */
+	down_write_kernfs_rwsem(kn, LOCK_SELF, 0);
+
+	if (idx != np_idx) /* new parent hashes to different lock */
+		down_write_kernfs_rwsem(new_parent, LOCK_SELF, 1);
+
+	/* old_parent hashes to a different lock */
+	if (idx != p_idx && p_idx != np_idx)
+		down_write_kernfs_rwsem(kn->parent, LOCK_SELF, 2);
 
 	error = -ENOENT;
 	if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
-	    (new_parent->flags & KERNFS_EMPTY_DIR))
+	    (new_parent->flags & KERNFS_EMPTY_DIR)) {
+		if (idx != p_idx && p_idx != np_idx)
+			up_write_kernfs_rwsem(kn->parent);
 		goto out;
-
+	}
 	error = 0;
 	if ((kn->parent == new_parent) && (kn->ns == new_ns) &&
-	    (strcmp(kn->name, new_name) == 0))
+	    (strcmp(kn->name, new_name) == 0)) {
+		if (idx != p_idx && p_idx != np_idx)
+			up_write_kernfs_rwsem(kn->parent);
 		goto out;	/* nothing to rename */
-
+	}
 	error = -EEXIST;
-	if (kernfs_find_ns(new_parent, new_name, new_ns))
+	if (kernfs_find_ns(new_parent, new_name, new_ns)) {
+		if (idx != p_idx && p_idx != np_idx)
+			up_write_kernfs_rwsem(kn->parent);
 		goto out;
-
+	}
 	/* rename kernfs_node */
 	if (strcmp(kn->name, new_name) != 0) {
 		error = -ENOMEM;
 		new_name = kstrdup_const(new_name, GFP_KERNEL);
-		if (!new_name)
+		if (!new_name) {
+			if (idx != p_idx && p_idx != np_idx)
+				up_write_kernfs_rwsem(kn->parent);
 			goto out;
+		}
 	} else {
 		new_name = NULL;
 	}
@@ -1646,12 +1683,22 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
 	kn->hash = kernfs_name_hash(kn->name, kn->ns);
 	kernfs_link_sibling(kn);
 
+	/* Release old_parent's lock, if it is different */
+	if (idx != p_idx && p_idx != np_idx)
+		up_write_kernfs_rwsem(old_parent);
 	kernfs_put(old_parent);
 	kfree_const(old_name);
 
 	error = 0;
  out:
-	up_write(&root->kernfs_rwsem);
+	/*
+	 * If new parent lock has been taken release it.
+	 * Lastly release node's lock.
+	 */
+	if (idx != np_idx) /* new parent hashes to different lock */
+		up_write_kernfs_rwsem(new_parent);
+
+	up_write_kernfs_rwsem(kn);
 	return error;
 }
 
@@ -1670,9 +1717,20 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
 static struct kernfs_node *kernfs_dir_pos(const void *ns,
 	struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
 {
+	int idx, p_idx;
+
+	p_idx = hash_ptr(parent, NR_KERNFS_LOCK_BITS);
+	lockdep_assert_held(&kernfs_root(parent)->kernfs_rwsem[p_idx]);
 	if (pos) {
-		int valid = kernfs_active(pos) &&
+		int valid = 0;
+
+		idx = hash_ptr(pos, NR_KERNFS_LOCK_BITS);
+		if (idx != p_idx)
+			down_read_kernfs_rwsem(pos, LOCK_SELF, 1);
+		valid = kernfs_active(pos) &&
 			pos->parent == parent && hash == pos->hash;
+		if (idx != p_idx)
+			up_read_kernfs_rwsem(pos);
 		kernfs_put(pos);
 		if (!valid)
 			pos = NULL;
@@ -1681,18 +1739,37 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
 		struct rb_node *node = parent->dir.children.rb_node;
 		while (node) {
 			pos = rb_to_kn(node);
-
+			idx = hash_ptr(pos, NR_KERNFS_LOCK_BITS);
+			if (idx != p_idx)
+				down_read_kernfs_rwsem(pos, LOCK_SELF, 1);
 			if (hash < pos->hash)
 				node = node->rb_left;
 			else if (hash > pos->hash)
 				node = node->rb_right;
-			else
+			else {
+				if (idx != p_idx)
+					up_read_kernfs_rwsem(pos);
 				break;
+			}
+			if (idx != p_idx)
+				up_read_kernfs_rwsem(pos);
 		}
 	}
 	/* Skip over entries which are dying/dead or in the wrong namespace */
-	while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
-		struct rb_node *node = rb_next(&pos->rb);
+	while (pos) {
+		struct rb_node *node;
+
+		idx = hash_ptr(pos, NR_KERNFS_LOCK_BITS);
+		if (idx != p_idx)
+			down_read_kernfs_rwsem(pos, LOCK_SELF, 1);
+		if (kernfs_active(pos) && pos->ns == ns) {
+			if (idx != p_idx)
+				up_read_kernfs_rwsem(pos);
+			break;
+		}
+		node = rb_next(&pos->rb);
+		if (idx != p_idx)
+			up_read_kernfs_rwsem(pos);
 		if (!node)
 			pos = NULL;
 		else
@@ -1704,16 +1781,41 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
 static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
 	struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
 {
+	int idx, p_idx;
+	int unlock_node = 0;
+
+	p_idx = hash_ptr(parent, NR_KERNFS_LOCK_BITS);
+	lockdep_assert_held(&kernfs_root(parent)->kernfs_rwsem[p_idx]);
 	pos = kernfs_dir_pos(ns, parent, ino, pos);
 	if (pos) {
+		idx = hash_ptr(pos, NR_KERNFS_LOCK_BITS);
+		if (idx != p_idx)
+			down_read_kernfs_rwsem(pos, LOCK_SELF, 1);
 		do {
 			struct rb_node *node = rb_next(&pos->rb);
+			if (idx != p_idx) {
+				up_read_kernfs_rwsem(pos);
+				unlock_node = 0;
+			}
 			if (!node)
 				pos = NULL;
-			else
+			else {
 				pos = rb_to_kn(node);
+				if (pos != NULL) {
+					idx =  hash_ptr(pos,
+						NR_KERNFS_LOCK_BITS);
+					if (idx != p_idx) {
+						down_read_kernfs_rwsem(pos,
+								LOCK_SELF,
+									1);
+						unlock_node = 1;
+					}
+				}
+			}
 		} while (pos && (!kernfs_active(pos) || pos->ns != ns));
 	}
+	if (unlock_node)
+		up_read_kernfs_rwsem(pos);
 	return pos;
 }
 
@@ -1722,14 +1824,12 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 	struct dentry *dentry = file->f_path.dentry;
 	struct kernfs_node *parent = kernfs_dentry_node(dentry);
 	struct kernfs_node *pos = file->private_data;
-	struct kernfs_root *root;
 	const void *ns = NULL;
 
 	if (!dir_emit_dots(file, ctx))
 		return 0;
 
-	root = kernfs_root(parent);
-	down_read(&root->kernfs_rwsem);
+	down_read_kernfs_rwsem(parent, LOCK_SELF, 0);
 
 	if (kernfs_ns_enabled(parent))
 		ns = kernfs_info(dentry->d_sb)->ns;
@@ -1746,12 +1846,12 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 		file->private_data = pos;
 		kernfs_get(pos);
 
-		up_read(&root->kernfs_rwsem);
+		up_read_kernfs_rwsem(parent);
 		if (!dir_emit(ctx, name, len, ino, type))
 			return 0;
-		down_read(&root->kernfs_rwsem);
+		down_read_kernfs_rwsem(parent, LOCK_SELF, 0);
 	}
-	up_read(&root->kernfs_rwsem);
+	up_read_kernfs_rwsem(parent);
 	file->private_data = NULL;
 	ctx->pos = INT_MAX;
 	return 0;
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 018d038b72fdd..5124add292582 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -855,8 +855,9 @@ static void kernfs_notify_workfn(struct work_struct *work)
 
 	root = kernfs_root(kn);
 	/* kick fsnotify */
-	down_write(&root->kernfs_rwsem);
+	down_write_kernfs_rwsem(kn, LOCK_SELF, 0);
 
+	down_write(&root->supers_rwsem);
 	list_for_each_entry(info, &kernfs_root(kn)->supers, node) {
 		struct kernfs_node *parent;
 		struct inode *p_inode = NULL;
@@ -892,8 +893,9 @@ static void kernfs_notify_workfn(struct work_struct *work)
 
 		iput(inode);
 	}
+	up_write(&root->supers_rwsem);
 
-	up_write(&root->kernfs_rwsem);
+	up_write_kernfs_rwsem(kn);
 	kernfs_put(kn);
 	goto repeat;
 }
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 3d783d80f5daa..fa9a6a48119c0 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -99,11 +99,10 @@ int __kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
 int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
 {
 	int ret;
-	struct kernfs_root *root = kernfs_root(kn);
 
-	down_write(&root->kernfs_rwsem);
+	down_write_kernfs_rwsem(kn, LOCK_SELF, 0);
 	ret = __kernfs_setattr(kn, iattr);
-	up_write(&root->kernfs_rwsem);
+	up_write_kernfs_rwsem(kn);
 	return ret;
 }
 
@@ -112,14 +111,12 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
 {
 	struct inode *inode = d_inode(dentry);
 	struct kernfs_node *kn = inode->i_private;
-	struct kernfs_root *root;
 	int error;
 
 	if (!kn)
 		return -EINVAL;
 
-	root = kernfs_root(kn);
-	down_write(&root->kernfs_rwsem);
+	down_write_kernfs_rwsem(kn, LOCK_SELF, 0);
 	error = setattr_prepare(&init_user_ns, dentry, iattr);
 	if (error)
 		goto out;
@@ -132,7 +129,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
 	setattr_copy(&init_user_ns, inode, iattr);
 
 out:
-	up_write(&root->kernfs_rwsem);
+	up_write_kernfs_rwsem(kn);
 	return error;
 }
 
@@ -187,14 +184,13 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
 {
 	struct inode *inode = d_inode(path->dentry);
 	struct kernfs_node *kn = inode->i_private;
-	struct kernfs_root *root = kernfs_root(kn);
 
-	down_read(&root->kernfs_rwsem);
+	down_read_kernfs_rwsem(kn, LOCK_SELF, 0);
 	spin_lock(&inode->i_lock);
 	kernfs_refresh_inode(kn, inode);
 	generic_fillattr(&init_user_ns, inode, stat);
 	spin_unlock(&inode->i_lock);
-	up_read(&root->kernfs_rwsem);
+	up_read_kernfs_rwsem(kn);
 
 	return 0;
 }
@@ -278,21 +274,19 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,
 			  struct inode *inode, int mask)
 {
 	struct kernfs_node *kn;
-	struct kernfs_root *root;
 	int ret;
 
 	if (mask & MAY_NOT_BLOCK)
 		return -ECHILD;
 
 	kn = inode->i_private;
-	root = kernfs_root(kn);
 
-	down_read(&root->kernfs_rwsem);
+	down_read_kernfs_rwsem(kn, LOCK_SELF, 0);
 	spin_lock(&inode->i_lock);
 	kernfs_refresh_inode(kn, inode);
 	ret = generic_permission(&init_user_ns, inode, mask);
 	spin_unlock(&inode->i_lock);
-	up_read(&root->kernfs_rwsem);
+	up_read_kernfs_rwsem(kn);
 
 	return ret;
 }
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index cc49a6cd94154..3f011b323173c 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -19,6 +19,9 @@
 #include <linux/kernfs.h>
 #include <linux/fs_context.h>
 
+#define LOCK_SELF 0
+#define LOCK_SELF_AND_PARENT 1
+
 struct kernfs_iattrs {
 	kuid_t			ia_uid;
 	kgid_t			ia_gid;
@@ -102,6 +105,115 @@ static inline bool kernfs_dir_changed(struct kernfs_node *parent,
 	return false;
 }
 
+/*
+ * If both node and it's parent need locking,
+ * lock child first so that kernfs_rename_ns
+ * does not change the parent, leaving us
+ * with old parent here.
+ */
+static inline void down_write_kernfs_rwsem(struct kernfs_node *kn,
+					   u8 lock_parent,
+					   u8 nesting)
+{
+	int idx, p_idx;
+	struct kernfs_root *root;
+
+	idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
+	root = kernfs_root(kn);
+
+	down_write_nested(&root->kernfs_rwsem[idx], nesting);
+
+	kernfs_get(kn);
+
+	if (kn->parent)
+		p_idx = hash_ptr(kn->parent, NR_KERNFS_LOCK_BITS);
+
+	if (kn->parent && lock_parent && p_idx != idx) {
+		/*
+		 * Node and parent hash to different locks.
+		 * node's lock has already been taken.
+		 * Take parent's lock and update token.
+		 */
+		down_write_nested(&root->kernfs_rwsem[p_idx],
+				  nesting + 1);
+
+		kernfs_get(kn->parent);
+		kn->unlock_parent = 1;
+	}
+}
+
+static inline void up_write_kernfs_rwsem(struct kernfs_node *kn)
+{
+	int p_idx, idx;
+	struct kernfs_root *root;
+
+	/* node lock is already taken in down_xxx so kn->parent is safe */
+	p_idx = hash_ptr(kn->parent, NR_KERNFS_LOCK_BITS);
+	idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
+	root = kernfs_root(kn);
+
+	if (kn->unlock_parent) {
+		kn->unlock_parent = 0;
+		up_write(&root->kernfs_rwsem[p_idx]);
+		kernfs_put(kn->parent);
+	}
+
+	up_write(&root->kernfs_rwsem[idx]);
+	kernfs_put(kn);
+}
+
+static inline void down_read_kernfs_rwsem(struct kernfs_node *kn,
+					  u8 lock_parent,
+					  u8 nesting)
+{
+	int idx, p_idx;
+	struct kernfs_root *root;
+
+	idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
+	root = kernfs_root(kn);
+
+	down_read_nested(&root->kernfs_rwsem[idx], nesting);
+
+	kernfs_get(kn);
+
+	if (kn->parent)
+		p_idx = hash_ptr(kn->parent, NR_KERNFS_LOCK_BITS);
+
+	if (kn->parent && lock_parent && p_idx != idx) {
+		/*
+		 * Node and parent hash to different locks.
+		 * node's lock has already been taken.
+		 * Take parent's lock and update token.
+		 */
+		down_read_nested(&root->kernfs_rwsem[p_idx],
+				 nesting + 1);
+
+		kernfs_get(kn->parent);
+
+		kn->unlock_parent = 1;
+	}
+}
+
+static inline void up_read_kernfs_rwsem(struct kernfs_node *kn)
+{
+	int p_idx, idx;
+	struct kernfs_root *root;
+
+	/* node lock is already taken in down_xxx so kn->parent is safe */
+	p_idx = hash_ptr(kn->parent, NR_KERNFS_LOCK_BITS);
+	idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
+	root = kernfs_root(kn);
+
+	if (kn->unlock_parent) {
+		kn->unlock_parent = 0;
+		up_read(&root->kernfs_rwsem[p_idx]);
+		kernfs_put(kn->parent);
+	}
+
+	up_read(&root->kernfs_rwsem[idx]);
+	kernfs_put(kn);
+}
+
 extern const struct super_operations kernfs_sops;
 extern struct kmem_cache *kernfs_node_cache, *kernfs_iattrs_cache;
 
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index cfa79715fc1a7..ebb7d9a10f47e 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -236,7 +236,6 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
 static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *kfc)
 {
 	struct kernfs_super_info *info = kernfs_info(sb);
-	struct kernfs_root *kf_root = kfc->root;
 	struct inode *inode;
 	struct dentry *root;
 
@@ -256,9 +255,9 @@ static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *k
 	sb->s_shrink.seeks = 0;
 
 	/* get root inode, initialize and unlock it */
-	down_read(&kf_root->kernfs_rwsem);
+	down_read_kernfs_rwsem(info->root->kn, 0, 0);
 	inode = kernfs_get_inode(sb, info->root->kn);
-	up_read(&kf_root->kernfs_rwsem);
+	up_read_kernfs_rwsem(info->root->kn);
 	if (!inode) {
 		pr_debug("kernfs: could not get root inode\n");
 		return -ENOMEM;
@@ -346,9 +345,9 @@ int kernfs_get_tree(struct fs_context *fc)
 		}
 		sb->s_flags |= SB_ACTIVE;
 
-		down_write(&root->kernfs_rwsem);
+		down_write(&root->supers_rwsem);
 		list_add(&info->node, &info->root->supers);
-		up_write(&root->kernfs_rwsem);
+		up_write(&root->supers_rwsem);
 	}
 
 	fc->root = dget(sb->s_root);
@@ -375,9 +374,9 @@ void kernfs_kill_sb(struct super_block *sb)
 	struct kernfs_super_info *info = kernfs_info(sb);
 	struct kernfs_root *root = info->root;
 
-	down_write(&root->kernfs_rwsem);
+	down_write(&root->supers_rwsem);
 	list_del(&info->node);
-	up_write(&root->kernfs_rwsem);
+	up_write(&root->supers_rwsem);
 
 	/*
 	 * Remove the superblock from fs_supers/s_instances
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 0ab13824822f7..5d4a769e2ab1e 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -113,12 +113,11 @@ static int kernfs_getlink(struct inode *inode, char *path)
 	struct kernfs_node *kn = inode->i_private;
 	struct kernfs_node *parent = kn->parent;
 	struct kernfs_node *target = kn->symlink.target_kn;
-	struct kernfs_root *root = kernfs_root(parent);
 	int error;
 
-	down_read(&root->kernfs_rwsem);
+	down_read_kernfs_rwsem(parent, LOCK_SELF, 0);
 	error = kernfs_get_target_path(parent, target, path);
-	up_read(&root->kernfs_rwsem);
+	up_read_kernfs_rwsem(parent);
 
 	return error;
 }
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 5bf9f02ce9dce..3b3c3e0b44083 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -179,6 +179,7 @@ struct kernfs_node {
 	 */
 	struct kernfs_node	*parent;
 	const char		*name;
+	u8			unlock_parent; /* release parent's rwsem */
 
 	struct rb_node		rb;
 
@@ -237,9 +238,10 @@ struct kernfs_root {
 	struct list_head	supers;
 
 	wait_queue_head_t	deactivate_waitq;
-	struct rw_semaphore	kernfs_rwsem;
 	struct kernfs_open_node_lock open_node_locks[NR_KERNFS_LOCKS];
 	struct kernfs_open_file_mutex open_file_mutex[NR_KERNFS_LOCKS];
+	struct rw_semaphore	supers_rwsem;
+	struct rw_semaphore	kernfs_rwsem[NR_KERNFS_LOCKS];
 };
 
 struct kernfs_open_file {
@@ -619,5 +621,4 @@ static inline int kernfs_rename(struct kernfs_node *kn,
 {
 	return kernfs_rename_ns(kn, new_parent, new_name, NULL);
 }
-
 #endif	/* __LINUX_KERNFS_H */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 1/2] kernfs: use hashed mutex and spinlock in place of global ones.
  2022-02-06  1:09 ` [PATCH v5 1/2] " Imran Khan
@ 2022-02-08 11:27   ` Greg KH
  2022-02-14 12:13     ` Imran Khan
  2022-02-08 17:22   ` Tejun Heo
  1 sibling, 1 reply; 10+ messages in thread
From: Greg KH @ 2022-02-08 11:27 UTC (permalink / raw)
  To: Imran Khan; +Cc: tj, linux-kernel

On Sun, Feb 06, 2022 at 12:09:24PM +1100, Imran Khan wrote:
> diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> index f9cc912c31e1b..cc49a6cd94154 100644
> --- a/fs/kernfs/kernfs-internal.h
> +++ b/fs/kernfs/kernfs-internal.h
> @@ -31,6 +31,7 @@ struct kernfs_iattrs {
>  	atomic_t		user_xattr_size;
>  };
>  
> +
>  /* +1 to avoid triggering overflow warning when negating it */
>  #define KN_DEACTIVATED_BIAS		(INT_MIN + 1)
>  

Nit, the above change isn't needed :)

> @@ -147,4 +148,54 @@ void kernfs_drain_open_files(struct kernfs_node *kn);
>   */
>  extern const struct inode_operations kernfs_symlink_iops;
>  
> +static inline spinlock_t *kernfs_open_node_lock_ptr(struct kernfs_node *kn)
> +{
> +	struct kernfs_root *root;
> +	int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
> +
> +	root = kernfs_root(kn);
> +
> +	return &root->open_node_locks[idx].lock;
> +}
> +
> +static inline spinlock_t *kernfs_open_node_lock(struct kernfs_node *kn)
> +{
> +	struct kernfs_root *root;
> +	spinlock_t *lock;
> +	int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
> +
> +	root = kernfs_root(kn);
> +
> +	lock = &root->open_node_locks[idx].lock;
> +
> +	spin_lock_irq(lock);
> +
> +	return lock;
> +}

Can't you use kernfs_open_node_lock_ptr() in kernfs_open_node_lock() to
make this use less duplicated code?


> +
> +static inline struct mutex *kernfs_open_file_mutex_ptr(struct kernfs_node *kn)
> +{
> +	struct kernfs_root *root;
> +	int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
> +
> +	root = kernfs_root(kn);
> +
> +	return &root->open_file_mutex[idx].lock;
> +}
> +
> +static inline struct mutex *kernfs_open_file_mutex_lock(struct kernfs_node *kn)
> +{
> +	struct kernfs_root *root;
> +	struct mutex *lock;
> +	int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
> +
> +	root = kernfs_root(kn);
> +
> +	lock = &root->open_file_mutex[idx].lock;
> +
> +	mutex_lock(lock);
> +
> +	return lock;
> +}

Same thing here.



> +
>  #endif	/* __KERNFS_INTERNAL_H */
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 861c4f0f8a29f..5bf9f02ce9dce 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -18,6 +18,8 @@
>  #include <linux/uidgid.h>
>  #include <linux/wait.h>
>  #include <linux/rwsem.h>
> +#include <linux/spinlock.h>
> +#include <linux/cache.h>
>  
>  struct file;
>  struct dentry;
> @@ -34,6 +36,40 @@ struct kernfs_fs_context;
>  struct kernfs_open_node;
>  struct kernfs_iattrs;
>  
> +/*
> + * NR_KERNFS_LOCK_BITS determines size (NR_KERNFS_LOCKS) of hash
> + * table of locks.
> + * Having a small hash table would impact scalability, since
> + * more and more kernfs_node objects will end up using same lock
> + * and having a very large hash table would waste memory.
> + *
> + * At the moment size of hash table of locks is being set based on
> + * the number of CPUs as follows:
> + *
> + * NR_CPU      NR_KERNFS_LOCK_BITS      NR_KERNFS_LOCKS
> + *   1                  1                       2
> + *  2-3                 2                       4
> + *  4-7                 4                       16
> + *  8-15                6                       64
> + *  16-31               8                       256
> + *  32 and more         10                      1024
> + */
> +#ifdef CONFIG_SMP
> +#define NR_KERNFS_LOCK_BITS (2 * (ilog2(NR_CPUS < 32 ? NR_CPUS : 32)))
> +#else
> +#define NR_KERNFS_LOCK_BITS     1
> +#endif
> +
> +#define NR_KERNFS_LOCKS     (1 << NR_KERNFS_LOCK_BITS)
> +
> +struct kernfs_open_node_lock {
> +	spinlock_t lock;
> +} ____cacheline_aligned_in_smp;
> +
> +struct kernfs_open_file_mutex {
> +	struct mutex lock;
> +} ____cacheline_aligned_in_smp;
> +
>  enum kernfs_node_type {
>  	KERNFS_DIR		= 0x0001,
>  	KERNFS_FILE		= 0x0002,
> @@ -90,6 +126,7 @@ enum kernfs_root_flag {
>  	KERNFS_ROOT_SUPPORT_USER_XATTR		= 0x0008,
>  };
>  
> +
>  /* type-specific structures for kernfs_node union members */
>  struct kernfs_elem_dir {
>  	unsigned long		subdirs;
> @@ -201,6 +238,8 @@ struct kernfs_root {
>  
>  	wait_queue_head_t	deactivate_waitq;
>  	struct rw_semaphore	kernfs_rwsem;
> +	struct kernfs_open_node_lock open_node_locks[NR_KERNFS_LOCKS];
> +	struct kernfs_open_file_mutex open_file_mutex[NR_KERNFS_LOCKS];
>  };

I think struct kernfs_root can be declared locally inside fs/kernfs/ to
keep other subsystems/files from having to see this structure, right?
That would make this a bit less of a "rebuild the world" type of change
and could be done in a patch before this one.

Overall, this looks sane to me, nice work.

Tejun, any comments?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 1/2] kernfs: use hashed mutex and spinlock in place of global ones.
  2022-02-06  1:09 ` [PATCH v5 1/2] " Imran Khan
  2022-02-08 11:27   ` Greg KH
@ 2022-02-08 17:22   ` Tejun Heo
  2022-02-14 12:19     ` Imran Khan
  1 sibling, 1 reply; 10+ messages in thread
From: Tejun Heo @ 2022-02-08 17:22 UTC (permalink / raw)
  To: Imran Khan; +Cc: gregkh, linux-kernel

On Sun, Feb 06, 2022 at 12:09:24PM +1100, Imran Khan wrote:
> +/*
> + * NR_KERNFS_LOCK_BITS determines size (NR_KERNFS_LOCKS) of hash
> + * table of locks.
> + * Having a small hash table would impact scalability, since
> + * more and more kernfs_node objects will end up using same lock
> + * and having a very large hash table would waste memory.
> + *
> + * At the moment size of hash table of locks is being set based on
> + * the number of CPUs as follows:
> + *
> + * NR_CPU      NR_KERNFS_LOCK_BITS      NR_KERNFS_LOCKS
> + *   1                  1                       2
> + *  2-3                 2                       4
> + *  4-7                 4                       16
> + *  8-15                6                       64
> + *  16-31               8                       256
> + *  32 and more         10                      1024
> + */
> +#ifdef CONFIG_SMP
> +#define NR_KERNFS_LOCK_BITS (2 * (ilog2(NR_CPUS < 32 ? NR_CPUS : 32)))
> +#else
> +#define NR_KERNFS_LOCK_BITS     1
> +#endif
> +
> +#define NR_KERNFS_LOCKS     (1 << NR_KERNFS_LOCK_BITS)

I have a couple questions:

* How did you come up with the above numbers? Are they based on some
  experimentation? It'd be nice if the comment explains why the numbers are
  like that.

* IIRC, we split these locks to per kernfs instance recently as a way to
  mitigate lock contention occurring across kernfs instances. I don't think
  it's beneficial to keep these hashed locks separate. It'd be both simpler
  and less contended to double one shared hashtable than splitting the table
  into two separate half sized ones. So, maybe switch to global hashtables
  and increase the size?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 2/2] kernfs: Replace per-fs global rwsem with per-fs hashed rwsem.
  2022-02-06  1:09 ` [PATCH v5 2/2] kernfs: Replace per-fs global rwsem with per-fs hashed rwsem Imran Khan
@ 2022-02-08 18:26   ` Tejun Heo
  2022-02-14 12:27     ` Imran Khan
  0 siblings, 1 reply; 10+ messages in thread
From: Tejun Heo @ 2022-02-08 18:26 UTC (permalink / raw)
  To: Imran Khan; +Cc: gregkh, linux-kernel

Hello,

On Sun, Feb 06, 2022 at 12:09:25PM +1100, Imran Khan wrote:
> Having a single rwsem to synchronize all operations across a kernfs
> based file system (cgroup, sysfs etc.) does not scale well. Replace
> it with a hashed rwsem to reduce contention around single per-fs
> rwsem.
> Also introduce a perfs rwsem to protect per-fs list of kernfs_super_info.

Can you split the two conversions into separate patches? Also, I'm not sure
about the per-fs hashtable as before.

>  static bool kernfs_active(struct kernfs_node *kn)
>  {
> -	lockdep_assert_held(&kernfs_root(kn)->kernfs_rwsem);
> +	int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
> +
> +	lockdep_assert_held(&kernfs_root(kn)->kernfs_rwsem[idx]);

Please encapsulate this into a function. If possible, it'd be ideal if the
conversion can be done in steps - e.g. first introduce lock encapsulation
interface while leaving locking itself alone and then switch the actual
locking.

> -static void kernfs_drain(struct kernfs_node *kn)
> -	__releases(&kernfs_root(kn)->kernfs_rwsem)
> -	__acquires(&kernfs_root(kn)->kernfs_rwsem)
> +static void kernfs_drain(struct kernfs_node *kn, struct kernfs_node *anc)
> +	__releases(&kernfs_root(anc)->kernfs_rwsem[a_idx])
> +	__acquires(&kernfs_root(anc)->kernfs_rwsem[a_idx])
>  {
>  	struct kernfs_root *root = kernfs_root(kn);
> +	int a_idx = hash_ptr(anc, NR_KERNFS_LOCK_BITS);
>  
> -	lockdep_assert_held_write(&root->kernfs_rwsem);
> -	WARN_ON_ONCE(kernfs_active(kn));
> +	lockdep_assert_held_write(&root->kernfs_rwsem[a_idx]);
> +	WARN_ON_ONCE(atomic_read(&kn->active) >= 0);

Ditto, I'd much prefer to see the lock lookup and accompanying operations to
be encapsulated somehow.

> @@ -1588,37 +1597,65 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
>  		     const char *new_name, const void *new_ns)
>  {
>  	struct kernfs_node *old_parent;
> -	struct kernfs_root *root;
>  	const char *old_name = NULL;
> -	int error;
> +	int error, idx, np_idx, p_idx;
>  
>  	/* can't move or rename root */
>  	if (!kn->parent)
>  		return -EINVAL;
>  
> -	root = kernfs_root(kn);
> -	down_write(&root->kernfs_rwsem);
> +	/*
> +	 * Take lock of node's old (current) parent.
> +	 * If new parent has a different lock, then take that
> +	 * lock as well.
> +	 */
> +	idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
> +	p_idx = hash_ptr(kn->parent, NR_KERNFS_LOCK_BITS);
> +	np_idx = hash_ptr(new_parent, NR_KERNFS_LOCK_BITS);
> +
> +	/*
> +	 * Take only kn's lock. The subsequent kernfs_put
> +	 * may free up old_parent so if old_parent has a
> +	 * different lock, we will explicitly release that.
> +	 */
> +	down_write_kernfs_rwsem(kn, LOCK_SELF, 0);
> +
> +	if (idx != np_idx) /* new parent hashes to different lock */
> +		down_write_kernfs_rwsem(new_parent, LOCK_SELF, 1);
> +
> +	/* old_parent hashes to a different lock */
> +	if (idx != p_idx && p_idx != np_idx)
> +		down_write_kernfs_rwsem(kn->parent, LOCK_SELF, 2);

Can't this lead to ABBA deadlock? When double locking, the locking order
should always be consistent. If we were doing per-kernfs_node lock, child ->
parent ordering works but we're hashing locks, so that doesn't work anymore
- one child-parent combo can lock A then B while the other child-parent
combo hash the other way around and lock B then A. The only order we can
define is in terms of the locks themselves - e.g. if the address (or index)
of lock A < lock B, then we lock A first whether that maps to the child or
parent.

Also, please encapsulate double locking in a set of functions. We really
don't wanna see all the details in the users.

> --- a/fs/kernfs/kernfs-internal.h
> +++ b/fs/kernfs/kernfs-internal.h
> @@ -19,6 +19,9 @@
>  #include <linux/kernfs.h>
>  #include <linux/fs_context.h>
>  
> +#define LOCK_SELF 0
> +#define LOCK_SELF_AND_PARENT 1

I get that this is private header but can you please add some identifying
prefix and make them enums? That makes it way easier for debuggers, bpf and
tracing.

> +/*
> + * If both node and it's parent need locking,
> + * lock child first so that kernfs_rename_ns
> + * does not change the parent, leaving us
> + * with old parent here.
> + */

Please reflow it close to 80 chars and ditto with above. We can't follow
child -> parent order with hashed locks.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 1/2] kernfs: use hashed mutex and spinlock in place of global ones.
  2022-02-08 11:27   ` Greg KH
@ 2022-02-14 12:13     ` Imran Khan
  0 siblings, 0 replies; 10+ messages in thread
From: Imran Khan @ 2022-02-14 12:13 UTC (permalink / raw)
  To: Greg KH; +Cc: tj, linux-kernel

Hello Greg,
Thanks for reviewing this.

On 8/2/22 10:27 pm, Greg KH wrote:
> On Sun, Feb 06, 2022 at 12:09:24PM +1100, Imran Khan wrote:
>> diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
>> index f9cc912c31e1b..cc49a6cd94154 100644
>> --- a/fs/kernfs/kernfs-internal.h
>> +++ b/fs/kernfs/kernfs-internal.h
>> @@ -31,6 +31,7 @@ struct kernfs_iattrs {
>>  	atomic_t		user_xattr_size;
>>  };
>>  
>> +
>>  /* +1 to avoid triggering overflow warning when negating it */
>>  #define KN_DEACTIVATED_BIAS		(INT_MIN + 1)
>>  
> 
> Nit, the above change isn't needed :)

This has been removed in version 6 of patch at [1].
> 
>> @@ -147,4 +148,54 @@ void kernfs_drain_open_files(struct kernfs_node *kn);
>>   */
>>  extern const struct inode_operations kernfs_symlink_iops;
>>  
[...]
>> +}
> 
> Can't you use kernfs_open_node_lock_ptr() in kernfs_open_node_lock() to
> make this use less duplicated code?
> 

Yes. I have removed duplicate code in version 6 of patch at [1].
> 
>> +

[...]
>> +	return lock;
>> +}
> 
> Same thing here.
> 
> 
Done.

> 
>> +
>>  #endif	/* __KERNFS_INTERNAL_H */
>> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
>> index 861c4f0f8a29f..5bf9f02ce9dce 100644
>> --- a/include/linux/kernfs.h
>> +++ b/include/linux/kernfs.h
>> @@ -18,6 +18,8 @@
>>  #include <linux/uidgid.h>
>>  #include <linux/wait.h>
>>  #include <linux/rwsem.h>
>> +#include <linux/spinlock.h>
>> +#include <linux/cache.h>
>>  
>>  struct file;
>>  struct dentry;
>> @@ -34,6 +36,40 @@ struct kernfs_fs_context;
>>  struct kernfs_open_node;
>>  struct kernfs_iattrs;
>>  
>> +/*
[...]
>> @@ -90,6 +126,7 @@ enum kernfs_root_flag {
>>  	KERNFS_ROOT_SUPPORT_USER_XATTR		= 0x0008,
>>  };
>>  
>> +
>>  /* type-specific structures for kernfs_node union members */
>>  struct kernfs_elem_dir {
>>  	unsigned long		subdirs;
>> @@ -201,6 +238,8 @@ struct kernfs_root {
>>  
>>  	wait_queue_head_t	deactivate_waitq;
>>  	struct rw_semaphore	kernfs_rwsem;
>> +	struct kernfs_open_node_lock open_node_locks[NR_KERNFS_LOCKS];
>> +	struct kernfs_open_file_mutex open_file_mutex[NR_KERNFS_LOCKS];
>>  };
> 
> I think struct kernfs_root can be declared locally inside fs/kernfs/ to
> keep other subsystems/files from having to see this structure, right?
> That would make this a bit less of a "rebuild the world" type of change
> and could be done in a patch before this one.

Tejun has suggested to avoid using per-fs hash table and take hash table
outside kernfs_root. So this patch is not changing kernfs_root and I
have left it in it's current place.
Please let me know if moving kernfs_root is still needed.

Thanks,
-- Imran

[1]:
https://lore.kernel.org/lkml/20220214120322.2402628-1-imran.f.khan@oracle.com/

> 
> Overall, this looks sane to me, nice work.
> 
> Tejun, any comments?
> 
> thanks,
> 
> greg k-h

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 1/2] kernfs: use hashed mutex and spinlock in place of global ones.
  2022-02-08 17:22   ` Tejun Heo
@ 2022-02-14 12:19     ` Imran Khan
  2022-02-14 17:39       ` Tejun Heo
  0 siblings, 1 reply; 10+ messages in thread
From: Imran Khan @ 2022-02-14 12:19 UTC (permalink / raw)
  To: Tejun Heo; +Cc: gregkh, linux-kernel

Hello Tejun,

Thanks for your feedback.

On 9/2/22 4:22 am, Tejun Heo wrote:
> On Sun, Feb 06, 2022 at 12:09:24PM +1100, Imran Khan wrote:
>> +/*
>> + * NR_KERNFS_LOCK_BITS determines size (NR_KERNFS_LOCKS) of hash
>> + * table of locks.
>> + * Having a small hash table would impact scalability, since
>> + * more and more kernfs_node objects will end up using same lock
>> + * and having a very large hash table would waste memory.
>> + *
>> + * At the moment size of hash table of locks is being set based on
>> + * the number of CPUs as follows:
>> + *
>> + * NR_CPU      NR_KERNFS_LOCK_BITS      NR_KERNFS_LOCKS
>> + *   1                  1                       2
>> + *  2-3                 2                       4
>> + *  4-7                 4                       16
>> + *  8-15                6                       64
>> + *  16-31               8                       256
>> + *  32 and more         10                      1024
>> + */
>> +#ifdef CONFIG_SMP
>> +#define NR_KERNFS_LOCK_BITS (2 * (ilog2(NR_CPUS < 32 ? NR_CPUS : 32)))
>> +#else
>> +#define NR_KERNFS_LOCK_BITS     1
>> +#endif
>> +
>> +#define NR_KERNFS_LOCKS     (1 << NR_KERNFS_LOCK_BITS)
> 
> I have a couple questions:
> 
> * How did you come up with the above numbers? Are they based on some
>   experimentation? It'd be nice if the comment explains why the numbers are
>   like that.
> 

I did some testing using different number of CPUs with qemu and the
above numbers were collected from there. It may not be optimum but this
is what I could think of while doing those internal tests.

Do you think it may be better to make this configurable via Kconfig.

> * IIRC, we split these locks to per kernfs instance recently as a way to
>   mitigate lock contention occurring across kernfs instances. I don't think
>   it's beneficial to keep these hashed locks separate. It'd be both simpler
>   and less contended to double one shared hashtable than splitting the table
>   into two separate half sized ones. So, maybe switch to global hashtables
>   and increase the size?

I have removed the hash table from kernfs_root. Doubling the size was
not giving any improvements so I have kept the hash table size same as
before. This change is present in v6 of patch at [1]

Thanks
 -- Imran

[1]:
https://lore.kernel.org/lkml/20220214120322.2402628-1-imran.f.khan@oracle.com/


> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 2/2] kernfs: Replace per-fs global rwsem with per-fs hashed rwsem.
  2022-02-08 18:26   ` Tejun Heo
@ 2022-02-14 12:27     ` Imran Khan
  0 siblings, 0 replies; 10+ messages in thread
From: Imran Khan @ 2022-02-14 12:27 UTC (permalink / raw)
  To: Tejun Heo; +Cc: gregkh, linux-kernel

Hi Tejun,

On 9/2/22 5:26 am, Tejun Heo wrote:
> Hello,
> 
> On Sun, Feb 06, 2022 at 12:09:25PM +1100, Imran Khan wrote:
>> Having a single rwsem to synchronize all operations across a kernfs
>> based file system (cgroup, sysfs etc.) does not scale well. Replace
>> it with a hashed rwsem to reduce contention around single per-fs
>> rwsem.
>> Also introduce a perfs rwsem to protect per-fs list of kernfs_super_info.
> 
> Can you split the two conversions into separate patches? Also, I'm not sure
> about the per-fs hashtable as before.
> 

Yes. I have done this in v6 of patchset at [1].

>>  static bool kernfs_active(struct kernfs_node *kn)
>>  {
>> -	lockdep_assert_held(&kernfs_root(kn)->kernfs_rwsem);
>> +	int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
>> +
>> +	lockdep_assert_held(&kernfs_root(kn)->kernfs_rwsem[idx]);
> 
> Please encapsulate this into a function. If possible, it'd be ideal if the
> conversion can be done in steps - e.g. first introduce lock encapsulation
> interface while leaving locking itself alone and then switch the actual
> locking.

Sure. I have divided interface and usage part into separate patches in
v6 of the patch at [1].

> 
>> -static void kernfs_drain(struct kernfs_node *kn)
>> -	__releases(&kernfs_root(kn)->kernfs_rwsem)
[...]
>> +	lockdep_assert_held_write(&root->kernfs_rwsem[a_idx]);
>> +	WARN_ON_ONCE(atomic_read(&kn->active) >= 0);
> 
> Ditto, I'd much prefer to see the lock lookup and accompanying operations to
> be encapsulated somehow.

Done.
> 
>> @@ -1588,37 +1597,65 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
>>  		     const char *new_name, const void *new_ns)
>>  {
>>  	struct kernfs_node *old_parent;
>> -	struct kernfs_root *root;
>>  	const char *old_name = NULL;
>> -	int error;
>> +	int error, idx, np_idx, p_idx;
>>  
>>  	/* can't move or rename root */
>>  	if (!kn->parent)
>>  		return -EINVAL;
>>  
>> -	root = kernfs_root(kn);
>> -	down_write(&root->kernfs_rwsem);
>> +	/*
>> +	 * Take lock of node's old (current) parent.
>> +	 * If new parent has a different lock, then take that
>> +	 * lock as well.
>> +	 */
>> +	idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
>> +	p_idx = hash_ptr(kn->parent, NR_KERNFS_LOCK_BITS);
>> +	np_idx = hash_ptr(new_parent, NR_KERNFS_LOCK_BITS);
>> +
>> +	/*
>> +	 * Take only kn's lock. The subsequent kernfs_put
>> +	 * may free up old_parent so if old_parent has a
>> +	 * different lock, we will explicitly release that.
>> +	 */
>> +	down_write_kernfs_rwsem(kn, LOCK_SELF, 0);
>> +
>> +	if (idx != np_idx) /* new parent hashes to different lock */
>> +		down_write_kernfs_rwsem(new_parent, LOCK_SELF, 1);
>> +
>> +	/* old_parent hashes to a different lock */
>> +	if (idx != p_idx && p_idx != np_idx)
>> +		down_write_kernfs_rwsem(kn->parent, LOCK_SELF, 2);
> 
> Can't this lead to ABBA deadlock? When double locking, the locking order
> should always be consistent. If we were doing per-kernfs_node lock, child ->
> parent ordering works but we're hashing locks, so that doesn't work anymore
> - one child-parent combo can lock A then B while the other child-parent
> combo hash the other way around and lock B then A. The only order we can
> define is in terms of the locks themselves - e.g. if the address (or index)
> of lock A < lock B, then we lock A first whether that maps to the child or
> parent.
> 

Indeed. In v6 of this change I am no longer using parent<->child
relations locking order. As suggested new interface uses lock addresses
to determine locking order for nested locking cases.

> Also, please encapsulate double locking in a set of functions. We really
> don't wanna see all the details in the users.
> 

Sure. This has been taken care in v6 of the patch at [1].
>> --- a/fs/kernfs/kernfs-internal.h
>> +++ b/fs/kernfs/kernfs-internal.h
>> @@ -19,6 +19,9 @@
>>  #include <linux/kernfs.h>
>>  #include <linux/fs_context.h>
>>  
>> +#define LOCK_SELF 0
>> +#define LOCK_SELF_AND_PARENT 1
> 
> I get that this is private header but can you please add some identifying
> prefix and make them enums? That makes it way easier for debuggers, bpf and
> tracing.
> 

Understood. These have been replaced with a properly prefixed enum.
>> +/*
>> + * If both node and it's parent need locking,
>> + * lock child first so that kernfs_rename_ns
>> + * does not change the parent, leaving us
>> + * with old parent here.
>> + */
> 
> Please reflow it close to 80 chars and ditto with above. We can't follow
> child -> parent order with hashed locks.

Done as mentioned earlier.

Thanks,
-- Imran

[1]:
https://lore.kernel.org/lkml/20220214120322.2402628-1-imran.f.khan@oracle.com/


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 1/2] kernfs: use hashed mutex and spinlock in place of global ones.
  2022-02-14 12:19     ` Imran Khan
@ 2022-02-14 17:39       ` Tejun Heo
  0 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2022-02-14 17:39 UTC (permalink / raw)
  To: Imran Khan; +Cc: gregkh, linux-kernel

Hello,

On Mon, Feb 14, 2022 at 11:19:38PM +1100, Imran Khan wrote:
> > * How did you come up with the above numbers? Are they based on some
> >   experimentation? It'd be nice if the comment explains why the numbers are
> >   like that.
> > 
> 
> I did some testing using different number of CPUs with qemu and the
> above numbers were collected from there. It may not be optimum but this
> is what I could think of while doing those internal tests.
> 
> Do you think it may be better to make this configurable via Kconfig.

I think it's good as-is for now and it's super easy and cheap to increase
the hashtable size when some cases see a high contention.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-02-14 17:39 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-06  1:09 [PATCH v5 0/2] kernfs: use hashed mutex and spinlock in place of global ones Imran Khan
2022-02-06  1:09 ` [PATCH v5 1/2] " Imran Khan
2022-02-08 11:27   ` Greg KH
2022-02-14 12:13     ` Imran Khan
2022-02-08 17:22   ` Tejun Heo
2022-02-14 12:19     ` Imran Khan
2022-02-14 17:39       ` Tejun Heo
2022-02-06  1:09 ` [PATCH v5 2/2] kernfs: Replace per-fs global rwsem with per-fs hashed rwsem Imran Khan
2022-02-08 18:26   ` Tejun Heo
2022-02-14 12:27     ` Imran Khan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.