linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/7] kernfs: Introduce hashed mutexes to replace global kernfs_open_file_mutex.
@ 2022-02-14 12:03 Imran Khan
  2022-02-14 12:03 ` [PATCH v6 1/7] " Imran Khan
                   ` (7 more replies)
  0 siblings, 8 replies; 19+ messages in thread
From: Imran Khan @ 2022-02-14 12:03 UTC (permalink / raw)
  To: tj, gregkh; +Cc: linux-kernel

Reduce contention around global locks used in kernfs.

The current kernfs design makes use of 3 global locks to synchronize
various operations. There are a few other global locks as well but
they are used for specific cases and hence don't cause severe contention.

The above mentioned 3 main global locks used in kernfs are:

1. A global mutex, kernfs_open_file_mutex, to protect the list of
kernfs_open_file instances correspondng to a sysfs attribute.

2. A global spinlock, kernfs_open_node_lock, to protect
kernfs_node->attr.open which points to kernfs_open_node instance
corresponding to a kernfs_node.

3. A global per-fs rw semaphore, root->kernfs_rwsem, to synchronize most
of the other operations like adding, removing, renaming etc. of a file
or directory.

Since these locks are global, they can cause contention when multiple
(for example few hundred) applications try to access sysfs (or other kernfs
based file system) files in parallel, even if the applications are
accessing different and unrelated files.

For example on a system with 384 cores, if I run 200 instances of an
application which is mostly executing the following loop:

  for (int loop = 0; loop <100 ; loop++)
  {
    for (int port_num = 1; port_num < 2; port_num++)
    {
      for (int gid_index = 0; gid_index < 254; gid_index++ )
      {
        char ret_buf[64], ret_buf_lo[64];
        char gid_file_path[1024];

        int      ret_len;
        int      ret_fd;
        ssize_t  ret_rd;

        ub4  i, saved_errno;

        memset(ret_buf, 0, sizeof(ret_buf));
        memset(gid_file_path, 0, sizeof(gid_file_path));

        ret_len = snprintf(gid_file_path, sizeof(gid_file_path),
                           "/sys/class/infiniband/%s/ports/%d/gids/%d",
                           dev_name,
                           port_num,
                           gid_index);

        ret_fd = open(gid_file_path, O_RDONLY | O_CLOEXEC);
        if (ret_fd < 0)
        {
          printf("Failed to open %s\n", gid_file_path);
          continue;
        }

        /* Read the GID */
        ret_rd = read(ret_fd, ret_buf, 40);

        if (ret_rd == -1)
        {
          printf("Failed to read from file %s, errno: %u\n",
                 gid_file_path, saved_errno);

          continue;
        }

        close(ret_fd);
      }
    }

I can see contention around above mentioned locks as follows:

-   54.07%    53.60%  showgids         [kernel.kallsyms]       [k] osq_lock
   - 53.60% __libc_start_main
      - 32.29% __GI___libc_open
           entry_SYSCALL_64_after_hwframe
           do_syscall_64
           sys_open
           do_sys_open
           do_filp_open
           path_openat
           vfs_open
           do_dentry_open
           kernfs_fop_open
           mutex_lock
         - __mutex_lock_slowpath
            - 32.23% __mutex_lock.isra.5
                 osq_lock
      - 21.31% __GI___libc_close
           entry_SYSCALL_64_after_hwframe
           do_syscall_64
           exit_to_usermode_loop
           task_work_run
           ____fput
           __fput
           kernfs_fop_release
           kernfs_put_open_node.isra.8
           mutex_lock
         - __mutex_lock_slowpath
            - 21.28% __mutex_lock.isra.5
                 osq_lock

-   10.49%    10.39%  showgids         [kernel.kallsyms]      [k] down_read
     10.39% __libc_start_main
        __GI___libc_open
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        sys_open
        do_sys_open
        do_filp_open
      - path_openat
         - 9.72% link_path_walk
            - 5.21% inode_permission
               - __inode_permission
                  - 5.21% kernfs_iop_permission
                       down_read
            - 4.08% walk_component
                 lookup_fast
               - d_revalidate.part.24
                  - 4.08% kernfs_dop_revalidate

-    7.48%     7.41%  showgids         [kernel.kallsyms]       [k] up_read
     7.41% __libc_start_main
        __GI___libc_open
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        sys_open
        do_sys_open
        do_filp_open
      - path_openat
         - 7.01% link_path_walk
            - 4.12% inode_permission
               - __inode_permission
                  - 4.12% kernfs_iop_permission
                       up_read
            - 2.61% walk_component
                 lookup_fast
               - d_revalidate.part.24
                  - 2.61% kernfs_dop_revalidate

Moreover this run of 200 application isntances takes 32-34 secs. to
complete.                                                     

This patch set is reducing the above mentioned contention by replacing
these global locks with hashed locks. 

For example with the patched kernel and on the same test setup, we no
longer see contention around osq_lock (i.e kernfs_open_file_mutex) and also
contention around per-fs kernfs_rwsem has reduced significantly as well.
This can be seen in the following perf snippet:

-    1.66%     1.65%  showgids         [kernel.kallsyms]      [k] down_read
     1.65% __libc_start_main
        __GI___libc_open
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        sys_open
        do_sys_open
        do_filp_open
      - path_openat
         - 1.62% link_path_walk
            - 0.98% inode_permission
               - __inode_permission
                  + 0.98% kernfs_iop_permission
            - 0.52% walk_component
                 lookup_fast
               - d_revalidate.part.24
                  - 0.52% kernfs_dop_revalidate

-    1.12%     1.11%  showgids         [kernel.kallsyms]      [k] up_read
     1.11% __libc_start_main
        __GI___libc_open
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        sys_open
        do_sys_open
        do_filp_open
      - path_openat
         - 1.11% link_path_walk
            - 0.69% inode_permission
               - __inode_permission
                  - 0.69% kernfs_iop_permission
                       up_read

Moreover the test execution time has reduced from 32-34 secs to 18-19 secs.

The patches of this patchset introduce following changes:

PATCH-1: Introduce hashed mutexes to replace global kernfs_open_file_mutex.

PATCH-2: Replace global kernfs_open_file_mutex with hashed mutexes.

PATCH-3: Introduce hashed spinlocks to replace global kernfs_open_node_lock.

PATCH-4: Replace global kernfs_open_node_lock with hashed spinlocks.

PATCH-5: Use a per-fs rwsem to protect list of kernfs_super_info.

PATCH-6: Introduce hashed rw-sem to replace per-fs kernfs_rwsem.

PATCH-7: Replace per-fs rwsem with hashed ones.

------------------------------------------------------------------
Changes since v5:
 - Addressed review comments from Greg
	- Clean up duplicate code.
 - Addressed review comments from Tejun
	- Explain how current value of NR_KERNFS_LOCKS were obtained for
	  different values of NR_CPUS.
	- Use single hash table for locks in place of per-fs hash table
	  for locks.
	- Move introduction of supers_rwsem to a separate patch.
	- Separate interface and usage part of hashed rwsem implementation
	  into 2 patches.
	- Use address of rwsems to determine locking order in case of
	  nested locking. This approach avoids ABBA deadlock possibility.
	- Change some #define macros into enum, with proper prefix.
 - Taking a cue from Tejun's feedback about separating hashed rwsem
   interface and usage into 2 patches, I have also divided the patch
   that introduced hashed mutex and spinlock, into separate patches. 	  
 - Rebase on linux-next tag: next-20220211

Changes since v4:
 - Removed compilation warnings reported by the 0-day bot.

Changes since v3:
 - Make open_file_mutex and open_node_lock per-fs.
 - Replace per-fs rwsem with per-fs hashed rwsem.
   (Suggested by Tejun Heo <tj@kernel.org>)

Imran Khan (7):
  kernfs: Introduce hashed mutexes to replace global
    kernfs_open_file_mutex.
  kernfs: Replace global kernfs_open_file_mutex with hashed mutexes.
  kernfs: Introduce hashed spinlocks to replace global
    kernfs_open_node_lock.
  kernfs: Replace global kernfs_open_node_lock with hashed spinlocks.
  kernfs: Use a per-fs rwsem to protect per-fs list of
    kernfs_super_info.
  kernfs: Introduce hashed rw-sem to replace per-fs kernfs_rwsem.
  kernfs: Replace per-fs rwsem with hashed ones.

 fs/kernfs/dir.c             | 132 ++++----
 fs/kernfs/file.c            |  65 ++--
 fs/kernfs/inode.c           |  22 +-
 fs/kernfs/kernfs-internal.h | 604 +++++++++++++++++++++++++++++++++++-
 fs/kernfs/mount.c           |  29 +-
 fs/kernfs/symlink.c         |   5 +-
 include/linux/kernfs.h      |  69 ++++
 7 files changed, 802 insertions(+), 124 deletions(-)


base-commit: 6d9bd4ad4ca08b1114e814c2c42383b8b13be631
-- 
2.30.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v6 1/7] kernfs: Introduce hashed mutexes to replace global kernfs_open_file_mutex.
  2022-02-14 12:03 [PATCH v6 0/7] kernfs: Introduce hashed mutexes to replace global kernfs_open_file_mutex Imran Khan
@ 2022-02-14 12:03 ` Imran Khan
  2022-02-14 17:50   ` Tejun Heo
  2022-02-14 12:03 ` [PATCH v6 2/7] kernfs: Replace global kernfs_open_file_mutex with hashed mutexes Imran Khan
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Imran Khan @ 2022-02-14 12:03 UTC (permalink / raw)
  To: tj, gregkh; +Cc: linux-kernel

In current kernfs design a single mutex, kernfs_open_file_mutex, protects
the list of kernfs_open_file instances corresponding to a sysfs attribute.
So even if different tasks are opening or closing different sysfs files
they can contend on osq_lock of this mutex. The contention is more apparent
in large scale systems with few hundred CPUs where most of the CPUs have
running tasks that are opening, accessing or closing sysfs files at any
point of time.

Using hashed mutexes in place of a single global mutex, can significantly
reduce contention around global mutex and hence can provide better
scalability. Moreover as these hashed mutexes are not part of kernfs_node
objects we will not see any singnificant change in memory utilization of
kernfs based file systems like sysfs, cgroupfs etc.

This patch introduces hashed mutexes that can be used in place of above
mentioned global mutex. It also provides interfaces needed to use hashed
mutexes. The next patch makes use of these interfaces and replaces global
mutex with hashed ones.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---
 fs/kernfs/kernfs-internal.h | 23 +++++++++++++++
 fs/kernfs/mount.c           | 13 ++++++++
 include/linux/kernfs.h      | 59 +++++++++++++++++++++++++++++++++++++
 3 files changed, 95 insertions(+)

diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index f9cc912c31e1b..03e983953eda4 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -147,4 +147,27 @@ void kernfs_drain_open_files(struct kernfs_node *kn);
  */
 extern const struct inode_operations kernfs_symlink_iops;
 
+/*
+ * kernfs locks
+ */
+extern struct kernfs_global_locks *kernfs_locks;
+
+static inline struct mutex *kernfs_open_file_mutex_ptr(struct kernfs_node *kn)
+{
+	int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
+
+	return &kernfs_locks->open_file_mutex[idx].lock;
+}
+
+static inline struct mutex *kernfs_open_file_mutex_lock(struct kernfs_node *kn)
+{
+	struct mutex *lock;
+
+	lock = kernfs_open_file_mutex_ptr(kn);
+
+	mutex_lock(lock);
+
+	return lock;
+}
+
 #endif	/* __KERNFS_INTERNAL_H */
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index cfa79715fc1a7..fa3fa22c95b21 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -20,6 +20,7 @@
 #include "kernfs-internal.h"
 
 struct kmem_cache *kernfs_node_cache, *kernfs_iattrs_cache;
+struct kernfs_global_locks *kernfs_locks;
 
 static int kernfs_sop_show_options(struct seq_file *sf, struct dentry *dentry)
 {
@@ -387,6 +388,17 @@ void kernfs_kill_sb(struct super_block *sb)
 	kfree(info);
 }
 
+void __init kernfs_lock_init(void)
+{
+	int count;
+
+	kernfs_locks = kmalloc(sizeof(struct kernfs_global_locks), GFP_KERNEL);
+	WARN_ON(!kernfs_locks);
+
+	for (count = 0; count < NR_KERNFS_LOCKS; count++)
+		mutex_init(&kernfs_locks->open_file_mutex[count].lock);
+}
+
 void __init kernfs_init(void)
 {
 	kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
@@ -397,4 +409,5 @@ void __init kernfs_init(void)
 	kernfs_iattrs_cache  = kmem_cache_create("kernfs_iattrs_cache",
 					      sizeof(struct kernfs_iattrs),
 					      0, SLAB_PANIC, NULL);
+	kernfs_lock_init();
 }
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 861c4f0f8a29f..3f72d38d48e31 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -18,6 +18,7 @@
 #include <linux/uidgid.h>
 #include <linux/wait.h>
 #include <linux/rwsem.h>
+#include <linux/cache.h>
 
 struct file;
 struct dentry;
@@ -34,6 +35,62 @@ struct kernfs_fs_context;
 struct kernfs_open_node;
 struct kernfs_iattrs;
 
+/*
+ * NR_KERNFS_LOCK_BITS determines size (NR_KERNFS_LOCKS) of hash
+ * table of locks.
+ * Having a small hash table would impact scalability, since
+ * more and more kernfs_node objects will end up using same lock
+ * and having a very large hash table would waste memory.
+ *
+ * At the moment size of hash table of locks is being set based on
+ * the number of CPUs as follows:
+ *
+ * NR_CPU      NR_KERNFS_LOCK_BITS      NR_KERNFS_LOCKS
+ *   1                  1                       2
+ *  2-3                 2                       4
+ *  4-7                 4                       16
+ *  8-15                6                       64
+ *  16-31               8                       256
+ *  32 and more         10                      1024
+ *
+ * The above relation between NR_CPU and number of locks is based
+ * on some internal experimentation which involved booting qemu
+ * with different values of smp, performing some sysfs operations
+ * on all CPUs and observing how increase in number of locks impacts
+ * completion time of these sysfs operations on each CPU.
+ */
+#ifdef CONFIG_SMP
+#define NR_KERNFS_LOCK_BITS (2 * (ilog2(NR_CPUS < 32 ? NR_CPUS : 32)))
+#else
+#define NR_KERNFS_LOCK_BITS     1
+#endif
+
+#define NR_KERNFS_LOCKS     (1 << NR_KERNFS_LOCK_BITS)
+
+/*
+ * There's one kernfs_open_file for each open file and one kernfs_open_node
+ * for each kernfs_node with one or more open files.
+ *
+ * filp->private_data points to seq_file whose ->private points to
+ * kernfs_open_file.
+ * kernfs_open_files are chained at kernfs_open_node->files, which is
+ * protected by kernfs_open_file_mutex.lock.
+ */
+
+struct kernfs_open_file_mutex {
+	struct mutex lock;
+} ____cacheline_aligned_in_smp;
+
+/*
+ * To reduce possible contention in sysfs access, arising due to single
+ * locks, use an array of locks and use kernfs_node object address as
+ * hash keys to get the index of these locks.
+ */
+
+struct kernfs_global_locks {
+	struct kernfs_open_file_mutex open_file_mutex[NR_KERNFS_LOCKS];
+};
+
 enum kernfs_node_type {
 	KERNFS_DIR		= 0x0001,
 	KERNFS_FILE		= 0x0002,
@@ -413,6 +470,8 @@ void kernfs_kill_sb(struct super_block *sb);
 
 void kernfs_init(void);
 
+void kernfs_lock_init(void);
+
 struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root,
 						   u64 id);
 #else	/* CONFIG_KERNFS */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v6 2/7] kernfs: Replace global kernfs_open_file_mutex with hashed mutexes.
  2022-02-14 12:03 [PATCH v6 0/7] kernfs: Introduce hashed mutexes to replace global kernfs_open_file_mutex Imran Khan
  2022-02-14 12:03 ` [PATCH v6 1/7] " Imran Khan
@ 2022-02-14 12:03 ` Imran Khan
  2022-02-14 12:03 ` [PATCH v6 3/7] kernfs: Introduce hashed spinlocks to replace global kernfs_open_node_lock Imran Khan
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Imran Khan @ 2022-02-14 12:03 UTC (permalink / raw)
  To: tj, gregkh; +Cc: linux-kernel

Remove global kernfs_open_file_mutex, using hashed mutex and corresponding
interface introduced in previous patch.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---
 fs/kernfs/file.c | 31 +++++++++++++++----------------
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 9414a7a60a9f4..295fe67950346 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -19,18 +19,13 @@
 #include "kernfs-internal.h"
 
 /*
- * There's one kernfs_open_file for each open file and one kernfs_open_node
- * for each kernfs_node with one or more open files.
- *
  * kernfs_node->attr.open points to kernfs_open_node.  attr.open is
  * protected by kernfs_open_node_lock.
  *
  * filp->private_data points to seq_file whose ->private points to
- * kernfs_open_file.  kernfs_open_files are chained at
- * kernfs_open_node->files, which is protected by kernfs_open_file_mutex.
+ * kernfs_open_file.
  */
 static DEFINE_SPINLOCK(kernfs_open_node_lock);
-static DEFINE_MUTEX(kernfs_open_file_mutex);
 
 struct kernfs_open_node {
 	atomic_t		refcnt;
@@ -524,9 +519,10 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
 				struct kernfs_open_file *of)
 {
 	struct kernfs_open_node *on, *new_on = NULL;
+	struct mutex *mutex = NULL;
 
  retry:
-	mutex_lock(&kernfs_open_file_mutex);
+	mutex = kernfs_open_file_mutex_lock(kn);
 	spin_lock_irq(&kernfs_open_node_lock);
 
 	if (!kn->attr.open && new_on) {
@@ -541,7 +537,7 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
 	}
 
 	spin_unlock_irq(&kernfs_open_node_lock);
-	mutex_unlock(&kernfs_open_file_mutex);
+	mutex_unlock(mutex);
 
 	if (on) {
 		kfree(new_on);
@@ -575,9 +571,10 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
 				 struct kernfs_open_file *of)
 {
 	struct kernfs_open_node *on = kn->attr.open;
+	struct mutex *mutex = NULL;
 	unsigned long flags;
 
-	mutex_lock(&kernfs_open_file_mutex);
+	mutex = kernfs_open_file_mutex_lock(kn);
 	spin_lock_irqsave(&kernfs_open_node_lock, flags);
 
 	if (of)
@@ -589,7 +586,7 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
 		on = NULL;
 
 	spin_unlock_irqrestore(&kernfs_open_node_lock, flags);
-	mutex_unlock(&kernfs_open_file_mutex);
+	mutex_unlock(mutex);
 
 	kfree(on);
 }
@@ -729,11 +726,11 @@ static void kernfs_release_file(struct kernfs_node *kn,
 	/*
 	 * @of is guaranteed to have no other file operations in flight and
 	 * we just want to synchronize release and drain paths.
-	 * @kernfs_open_file_mutex is enough.  @of->mutex can't be used
+	 * @open_file_mutex is enough.  @of->mutex can't be used
 	 * here because drain path may be called from places which can
 	 * cause circular dependency.
 	 */
-	lockdep_assert_held(&kernfs_open_file_mutex);
+	lockdep_assert_held(kernfs_open_file_mutex_ptr(kn));
 
 	if (!of->released) {
 		/*
@@ -750,11 +747,12 @@ static int kernfs_fop_release(struct inode *inode, struct file *filp)
 {
 	struct kernfs_node *kn = inode->i_private;
 	struct kernfs_open_file *of = kernfs_of(filp);
+	struct mutex *lock = NULL;
 
 	if (kn->flags & KERNFS_HAS_RELEASE) {
-		mutex_lock(&kernfs_open_file_mutex);
+		lock = kernfs_open_file_mutex_lock(kn);
 		kernfs_release_file(kn, of);
-		mutex_unlock(&kernfs_open_file_mutex);
+		mutex_unlock(lock);
 	}
 
 	kernfs_put_open_node(kn, of);
@@ -769,6 +767,7 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
 {
 	struct kernfs_open_node *on;
 	struct kernfs_open_file *of;
+	struct mutex *mutex = NULL;
 
 	if (!(kn->flags & (KERNFS_HAS_MMAP | KERNFS_HAS_RELEASE)))
 		return;
@@ -781,7 +780,7 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
 	if (!on)
 		return;
 
-	mutex_lock(&kernfs_open_file_mutex);
+	mutex = kernfs_open_file_mutex_lock(kn);
 
 	list_for_each_entry(of, &on->files, list) {
 		struct inode *inode = file_inode(of->file);
@@ -793,7 +792,7 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
 			kernfs_release_file(kn, of);
 	}
 
-	mutex_unlock(&kernfs_open_file_mutex);
+	mutex_unlock(mutex);
 
 	kernfs_put_open_node(kn, NULL);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v6 3/7] kernfs: Introduce hashed spinlocks to replace global kernfs_open_node_lock.
  2022-02-14 12:03 [PATCH v6 0/7] kernfs: Introduce hashed mutexes to replace global kernfs_open_file_mutex Imran Khan
  2022-02-14 12:03 ` [PATCH v6 1/7] " Imran Khan
  2022-02-14 12:03 ` [PATCH v6 2/7] kernfs: Replace global kernfs_open_file_mutex with hashed mutexes Imran Khan
@ 2022-02-14 12:03 ` Imran Khan
  2022-02-14 12:03 ` [PATCH v6 4/7] kernfs: Replace global kernfs_open_node_lock with hashed spinlocks Imran Khan
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Imran Khan @ 2022-02-14 12:03 UTC (permalink / raw)
  To: tj, gregkh; +Cc: linux-kernel

In current kernfs design a single spinlock, kernfs_open_node_lock, protects
the kernfs_node->attr.open i.e kernfs_open_node instances corresponding to
a sysfs attribute. So even if different tasks are opening or closing
different sysfs files they can contend on this spinlock. The contention is
more apparent in large scale systems with few hundred CPUs where most of
the CPUs have running tasks that are opening, accessing or closing sysfs
files at any point of time.

Using hashed spinlocks in place of a single global spinlock, can
significantly reduce contention around global spinlock and hence provide
better scalability. Moreover as these hashed spinlocks are not part of
kernfs_node objects we will not see any singnificant change in memory
utilization of kernfs based file systems like sysfs, cgroupfs etc.

This patch introduces hashed spinlocks that can be used in place of above
mentioned global spinlock. It also provides interfaces needed to use hashed
spinlocks. The next patch makes use of these interfaces and replaces global
spinlock with hashed ones.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---
 fs/kernfs/kernfs-internal.h | 20 ++++++++++++++++++++
 fs/kernfs/mount.c           |  4 +++-
 include/linux/kernfs.h      | 11 +++++++++--
 3 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 03e983953eda4..593395f325a18 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -170,4 +170,24 @@ static inline struct mutex *kernfs_open_file_mutex_lock(struct kernfs_node *kn)
 	return lock;
 }
 
+static inline spinlock_t *
+kernfs_open_node_spinlock_ptr(struct kernfs_node *kn)
+{
+	int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
+
+	return &kernfs_locks->open_node_locks[idx].lock;
+}
+
+static inline spinlock_t *
+kernfs_open_node_spinlock(struct kernfs_node *kn)
+{
+	spinlock_t *lock;
+
+	lock = kernfs_open_node_spinlock_ptr(kn);
+
+	spin_lock_irq(lock);
+
+	return lock;
+}
+
 #endif	/* __KERNFS_INTERNAL_H */
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index fa3fa22c95b21..809b738739b18 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -395,8 +395,10 @@ void __init kernfs_lock_init(void)
 	kernfs_locks = kmalloc(sizeof(struct kernfs_global_locks), GFP_KERNEL);
 	WARN_ON(!kernfs_locks);
 
-	for (count = 0; count < NR_KERNFS_LOCKS; count++)
+	for (count = 0; count < NR_KERNFS_LOCKS; count++) {
 		mutex_init(&kernfs_locks->open_file_mutex[count].lock);
+		spin_lock_init(&kernfs_locks->open_node_locks[count].lock);
+	}
 }
 
 void __init kernfs_init(void)
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3f72d38d48e31..7ee0595b315a2 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -19,6 +19,7 @@
 #include <linux/wait.h>
 #include <linux/rwsem.h>
 #include <linux/cache.h>
+#include <linux/spinlock.h>
 
 struct file;
 struct dentry;
@@ -75,20 +76,26 @@ struct kernfs_iattrs;
  * kernfs_open_file.
  * kernfs_open_files are chained at kernfs_open_node->files, which is
  * protected by kernfs_open_file_mutex.lock.
+ *
+ * kernfs_node->attr.open points to kernfs_open_node.  attr.open is
+ * protected by kernfs_open_node_lock.lock.
  */
-
 struct kernfs_open_file_mutex {
 	struct mutex lock;
 } ____cacheline_aligned_in_smp;
 
+struct kernfs_open_node_lock {
+	spinlock_t lock;
+} ____cacheline_aligned_in_smp;
+
 /*
  * To reduce possible contention in sysfs access, arising due to single
  * locks, use an array of locks and use kernfs_node object address as
  * hash keys to get the index of these locks.
  */
-
 struct kernfs_global_locks {
 	struct kernfs_open_file_mutex open_file_mutex[NR_KERNFS_LOCKS];
+	struct kernfs_open_node_lock open_node_locks[NR_KERNFS_LOCKS];
 };
 
 enum kernfs_node_type {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v6 4/7] kernfs: Replace global kernfs_open_node_lock with hashed spinlocks.
  2022-02-14 12:03 [PATCH v6 0/7] kernfs: Introduce hashed mutexes to replace global kernfs_open_file_mutex Imran Khan
                   ` (2 preceding siblings ...)
  2022-02-14 12:03 ` [PATCH v6 3/7] kernfs: Introduce hashed spinlocks to replace global kernfs_open_node_lock Imran Khan
@ 2022-02-14 12:03 ` Imran Khan
  2022-02-14 12:03 ` [PATCH v6 5/7] kernfs: Use a per-fs rwsem to protect per-fs list of kernfs_super_info Imran Khan
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Imran Khan @ 2022-02-14 12:03 UTC (permalink / raw)
  To: tj, gregkh; +Cc: linux-kernel

Remove global kernfs_open_node_lock, using hashed spinlock and
corresponding interface introduced in previous patch.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---
 fs/kernfs/file.c | 30 +++++++++++++-----------------
 1 file changed, 13 insertions(+), 17 deletions(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 295fe67950346..f3ecc6fe8aedc 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -18,15 +18,6 @@
 
 #include "kernfs-internal.h"
 
-/*
- * kernfs_node->attr.open points to kernfs_open_node.  attr.open is
- * protected by kernfs_open_node_lock.
- *
- * filp->private_data points to seq_file whose ->private points to
- * kernfs_open_file.
- */
-static DEFINE_SPINLOCK(kernfs_open_node_lock);
-
 struct kernfs_open_node {
 	atomic_t		refcnt;
 	atomic_t		event;
@@ -520,10 +511,11 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
 {
 	struct kernfs_open_node *on, *new_on = NULL;
 	struct mutex *mutex = NULL;
+	spinlock_t *lock = NULL;
 
  retry:
 	mutex = kernfs_open_file_mutex_lock(kn);
-	spin_lock_irq(&kernfs_open_node_lock);
+	lock = kernfs_open_node_spinlock(kn);
 
 	if (!kn->attr.open && new_on) {
 		kn->attr.open = new_on;
@@ -536,7 +528,7 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
 		list_add_tail(&of->list, &on->files);
 	}
 
-	spin_unlock_irq(&kernfs_open_node_lock);
+	spin_unlock_irq(lock);
 	mutex_unlock(mutex);
 
 	if (on) {
@@ -572,10 +564,13 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
 {
 	struct kernfs_open_node *on = kn->attr.open;
 	struct mutex *mutex = NULL;
+	spinlock_t *lock = NULL;
 	unsigned long flags;
 
 	mutex = kernfs_open_file_mutex_lock(kn);
-	spin_lock_irqsave(&kernfs_open_node_lock, flags);
+	lock = kernfs_open_node_spinlock_ptr(kn);
+
+	spin_lock_irqsave(lock, flags);
 
 	if (of)
 		list_del(&of->list);
@@ -585,7 +580,7 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
 	else
 		on = NULL;
 
-	spin_unlock_irqrestore(&kernfs_open_node_lock, flags);
+	spin_unlock_irqrestore(lock, flags);
 	mutex_unlock(mutex);
 
 	kfree(on);
@@ -768,15 +763,16 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
 	struct kernfs_open_node *on;
 	struct kernfs_open_file *of;
 	struct mutex *mutex = NULL;
+	spinlock_t *lock = NULL;
 
 	if (!(kn->flags & (KERNFS_HAS_MMAP | KERNFS_HAS_RELEASE)))
 		return;
 
-	spin_lock_irq(&kernfs_open_node_lock);
+	lock = kernfs_open_node_spinlock(kn);
 	on = kn->attr.open;
 	if (on)
 		atomic_inc(&on->refcnt);
-	spin_unlock_irq(&kernfs_open_node_lock);
+	spin_unlock_irq(lock);
 	if (!on)
 		return;
 
@@ -921,13 +917,13 @@ void kernfs_notify(struct kernfs_node *kn)
 		return;
 
 	/* kick poll immediately */
-	spin_lock_irqsave(&kernfs_open_node_lock, flags);
+	spin_lock_irqsave(kernfs_open_node_spinlock_ptr(kn), flags);
 	on = kn->attr.open;
 	if (on) {
 		atomic_inc(&on->event);
 		wake_up_interruptible(&on->poll);
 	}
-	spin_unlock_irqrestore(&kernfs_open_node_lock, flags);
+	spin_unlock_irqrestore(kernfs_open_node_spinlock_ptr(kn), flags);
 
 	/* schedule work to kick fsnotify */
 	spin_lock_irqsave(&kernfs_notify_lock, flags);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v6 5/7] kernfs: Use a per-fs rwsem to protect per-fs list of kernfs_super_info.
  2022-02-14 12:03 [PATCH v6 0/7] kernfs: Introduce hashed mutexes to replace global kernfs_open_file_mutex Imran Khan
                   ` (3 preceding siblings ...)
  2022-02-14 12:03 ` [PATCH v6 4/7] kernfs: Replace global kernfs_open_node_lock with hashed spinlocks Imran Khan
@ 2022-02-14 12:03 ` Imran Khan
  2022-02-14 12:03 ` [PATCH v6 6/7] kernfs: Introduce hashed rw-sem to replace per-fs kernfs_rwsem Imran Khan
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Imran Khan @ 2022-02-14 12:03 UTC (permalink / raw)
  To: tj, gregkh; +Cc: linux-kernel

Right now per-fs kernfs_rwsem protects list of kernfs_super_info instances
for a kernfs_root. Since kernfs_rwsem is used to synchronize several other
operations across kernfs and since most of these operations don't impact
kernfs_super_info, we can use a separate per-fs rwsem to synchronize access
to list of kernfs_super_info.
This helps in reducing contention around kernfs_rwsem and also allows
operations that change/access list of kernfs_super_info to proceed without
contending for kernfs_rwsem.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---
 fs/kernfs/dir.c        | 1 +
 fs/kernfs/file.c       | 2 ++
 fs/kernfs/mount.c      | 8 ++++----
 include/linux/kernfs.h | 1 +
 4 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index e6d9772ddb4ca..dc769301ac96b 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -917,6 +917,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 	idr_init(&root->ino_idr);
 	init_rwsem(&root->kernfs_rwsem);
 	INIT_LIST_HEAD(&root->supers);
+	init_rwsem(&root->supers_rwsem);
 
 	/*
 	 * On 64bit ino setups, id is ino.  On 32bit, low 32bits are ino.
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index f3ecc6fe8aedc..af2046bc63aa1 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -859,6 +859,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
 	/* kick fsnotify */
 	down_write(&root->kernfs_rwsem);
 
+	down_write(&root->supers_rwsem);
 	list_for_each_entry(info, &kernfs_root(kn)->supers, node) {
 		struct kernfs_node *parent;
 		struct inode *p_inode = NULL;
@@ -894,6 +895,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
 
 		iput(inode);
 	}
+	up_write(&root->supers_rwsem);
 
 	up_write(&root->kernfs_rwsem);
 	kernfs_put(kn);
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 809b738739b18..d35142226c340 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -347,9 +347,9 @@ int kernfs_get_tree(struct fs_context *fc)
 		}
 		sb->s_flags |= SB_ACTIVE;
 
-		down_write(&root->kernfs_rwsem);
+		down_write(&root->supers_rwsem);
 		list_add(&info->node, &info->root->supers);
-		up_write(&root->kernfs_rwsem);
+		up_write(&root->supers_rwsem);
 	}
 
 	fc->root = dget(sb->s_root);
@@ -376,9 +376,9 @@ void kernfs_kill_sb(struct super_block *sb)
 	struct kernfs_super_info *info = kernfs_info(sb);
 	struct kernfs_root *root = info->root;
 
-	down_write(&root->kernfs_rwsem);
+	down_write(&root->supers_rwsem);
 	list_del(&info->node);
-	up_write(&root->kernfs_rwsem);
+	up_write(&root->supers_rwsem);
 
 	/*
 	 * Remove the superblock from fs_supers/s_instances
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 7ee0595b315a2..84653c609a5c0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -265,6 +265,7 @@ struct kernfs_root {
 
 	wait_queue_head_t	deactivate_waitq;
 	struct rw_semaphore	kernfs_rwsem;
+	struct rw_semaphore	supers_rwsem;
 };
 
 struct kernfs_open_file {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v6 6/7] kernfs: Introduce hashed rw-sem to replace per-fs kernfs_rwsem.
  2022-02-14 12:03 [PATCH v6 0/7] kernfs: Introduce hashed mutexes to replace global kernfs_open_file_mutex Imran Khan
                   ` (4 preceding siblings ...)
  2022-02-14 12:03 ` [PATCH v6 5/7] kernfs: Use a per-fs rwsem to protect per-fs list of kernfs_super_info Imran Khan
@ 2022-02-14 12:03 ` Imran Khan
  2022-02-14 18:10   ` Tejun Heo
  2022-02-16  1:46   ` Al Viro
  2022-02-14 12:03 ` [PATCH v6 7/7] kernfs: Replace per-fs rwsem with hashed ones Imran Khan
  2022-02-14 18:15 ` [PATCH v6 0/7] kernfs: Introduce hashed mutexes to replace global kernfs_open_file_mutex Tejun Heo
  7 siblings, 2 replies; 19+ messages in thread
From: Imran Khan @ 2022-02-14 12:03 UTC (permalink / raw)
  To: tj, gregkh; +Cc: linux-kernel

Having a single rwsem to synchronize all operations across a kernfs
based file system (cgroup, sysfs etc.) does not scale well. The contention
around this single rwsem becomes more apparent in large scale systems with
few hundred CPUs where most of the CPUs have running tasks that are
opening, accessing or closing sysfs files at any point of time.

Using hashed rwsems in place of a per-fs rwsem, can significantly reduce
contention around per-fs rwsem and hence provide better scalability.
Moreover as these hashed rwsems are not part of kernfs_node objects we will
not see any singnificant change in memory utilization of kernfs based file
systems like sysfs, cgroupfs etc.

This patch introduces hashed rwsems that can be used in place of per-fs
rwsem. It also provides interfaces to use hashed rwsems and interfaces for
lockdep annotation as well. The next patch makes use of these interfaces
and replaces per-fs rwsem with hashed ones.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---
 fs/kernfs/dir.c             |   1 -
 fs/kernfs/kernfs-internal.h | 560 ++++++++++++++++++++++++++++++++++++
 fs/kernfs/mount.c           |   1 +
 include/linux/kernfs.h      |   2 +
 4 files changed, 563 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index dc769301ac96b..0dac58f8091c9 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -25,7 +25,6 @@ static DEFINE_SPINLOCK(kernfs_idr_lock);	/* root->ino_idr */
 
 static bool kernfs_active(struct kernfs_node *kn)
 {
-	lockdep_assert_held(&kernfs_root(kn)->kernfs_rwsem);
 	return atomic_read(&kn->active) >= 0;
 }
 
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 593395f325a18..ba89de378f240 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -19,6 +19,17 @@
 #include <linux/kernfs.h>
 #include <linux/fs_context.h>
 
+/*
+ * kernfs_rwsem locking pattern:
+ *
+ * KERNFS_RWSEM_LOCK_SELF: lock kernfs_node only.
+ * KERNFS_RWSEM_LOCK_SELF_AND_PARENT: lock kernfs_node and its parent.
+ */
+enum kernfs_rwsem_lock_pattern {
+	KERNFS_RWSEM_LOCK_SELF,
+	KERNFS_RWSEM_LOCK_SELF_AND_PARENT
+};
+
 struct kernfs_iattrs {
 	kuid_t			ia_uid;
 	kgid_t			ia_gid;
@@ -190,4 +201,553 @@ kernfs_open_node_spinlock(struct kernfs_node *kn)
 	return lock;
 }
 
+static inline struct rw_semaphore *kernfs_rwsem_ptr(struct kernfs_node *kn)
+{
+	int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
+
+	return &kernfs_locks->kernfs_rwsem[idx];
+}
+
+static inline void kernfs_rwsem_assert_held(struct kernfs_node *kn)
+{
+	lockdep_assert_held(kernfs_rwsem_ptr(kn));
+}
+
+static inline void kernfs_rwsem_assert_held_write(struct kernfs_node *kn)
+{
+	lockdep_assert_held_write(kernfs_rwsem_ptr(kn));
+}
+
+static inline void kernfs_rwsem_assert_held_read(struct kernfs_node *kn)
+{
+	lockdep_assert_held_read(kernfs_rwsem_ptr(kn));
+}
+
+/**
+ * down_write_kernfs_rwsem_for_two_nodes() - Acquire hashed rwsem for 2 nodes
+ *
+ * @kn1: kernfs_node for which hashed rwsem needs to be taken
+ * @kn2: kernfs_node for which hashed rwsem needs to be taken
+ *
+ * In certain cases we need to acquire hashed rwsem for 2 nodes that don't have a
+ * parent child relationship. This is one of the cases of nested locking involving
+ * hashed rwsem and rwsem with lower address is acquired first.
+ */
+static inline void down_write_kernfs_rwsem_for_two_nodes(struct kernfs_node *kn1,
+							 struct kernfs_node *kn2)
+{
+	struct rw_semaphore *rwsem1 = kernfs_rwsem_ptr(kn1);
+	struct rw_semaphore *rwsem2 = kernfs_rwsem_ptr(kn2);
+
+	if (rwsem1 == rwsem2)
+		down_write_nested(rwsem1, 0);
+	else {
+		if (rwsem1 < rwsem2) {
+			down_write_nested(rwsem1, 0);
+			down_write_nested(rwsem2, 1);
+		} else {
+			down_write_nested(rwsem2, 0);
+			down_write_nested(rwsem1, 1);
+		}
+	}
+	kernfs_get(kn1);
+	kernfs_get(kn2);
+}
+
+/**
+ * up_write_kernfs_rwsem_for_two_nodes() - Release hashed rwsem for 2 nodes
+ *
+ * @kn1: kernfs_node for which hashed rwsem needs to be released
+ * @kn2: kernfs_node for which hashed rwsem needs to be released
+ *
+ * In case of nested locking, rwsem with higher address is released first.
+ */
+static inline void up_write_kernfs_rwsem_for_two_nodes(struct kernfs_node *kn1,
+						       struct kernfs_node *kn2)
+{
+	struct rw_semaphore *rwsem1 = kernfs_rwsem_ptr(kn1);
+	struct rw_semaphore *rwsem2 = kernfs_rwsem_ptr(kn2);
+
+	if (rwsem1 == rwsem2)
+		up_write(rwsem1);
+	else {
+		if (rwsem1 > rwsem2) {
+			up_write(rwsem1);
+			up_write(rwsem2);
+		} else {
+			up_write(rwsem2);
+			up_write(rwsem1);
+		}
+	}
+
+	kernfs_put(kn1);
+	kernfs_put(kn2);
+}
+
+/**
+ * down_read_kernfs_rwsem_for_two_nodes() - Acquire hashed rwsem for 2 nodes
+ *
+ * @kn1: kernfs_node for which hashed rwsem needs to be taken
+ * @kn2: kernfs_node for which hashed rwsem needs to be taken
+ *
+ * In certain cases we need to acquire hashed rwsem for 2 nodes that don't have a
+ * parent child relationship. This is one of the cases of nested locking involving
+ * hashed rwsem and rwsem with lower address is acquired first.
+ */
+static inline void down_read_kernfs_rwsem_for_two_nodes(struct kernfs_node *kn1,
+							struct kernfs_node *kn2)
+{
+	struct rw_semaphore *rwsem1 = kernfs_rwsem_ptr(kn1);
+	struct rw_semaphore *rwsem2 = kernfs_rwsem_ptr(kn2);
+
+	if (rwsem1 == rwsem2)
+		down_read_nested(rwsem1, 0);
+	else {
+		if (rwsem1 < rwsem2) {
+			down_read_nested(rwsem1, 0);
+			down_read_nested(rwsem2, 1);
+		} else {
+			down_read_nested(rwsem2, 0);
+			down_read_nested(rwsem1, 1);
+		}
+	}
+	kernfs_get(kn1);
+	kernfs_get(kn2);
+}
+
+/**
+ * up_read_kernfs_rwsem_for_two_nodes() - Release hashed rwsem for 2 nodes
+ *
+ * @kn1: kernfs_node for which hashed rwsem needs to be released
+ * @kn2: kernfs_node for which hashed rwsem needs to be released
+ *
+ * In case of nested locking, rwsem with higher address is released first.
+ */
+static inline void up_read_kernfs_rwsem_for_two_nodes(struct kernfs_node *kn1,
+						       struct kernfs_node *kn2)
+{
+	struct rw_semaphore *rwsem1 = kernfs_rwsem_ptr(kn1);
+	struct rw_semaphore *rwsem2 = kernfs_rwsem_ptr(kn2);
+
+	if (rwsem1 == rwsem2)
+		up_read(rwsem1);
+	else {
+		if (rwsem1 > rwsem2) {
+			up_read(rwsem1);
+			up_read(rwsem2);
+		} else {
+			up_read(rwsem2);
+			up_read(rwsem1);
+		}
+	}
+
+	kernfs_put(kn1);
+	kernfs_put(kn2);
+}
+
+/**
+ * down_write_kernfs_rwsem() - Acquire hashed rwsem
+ *
+ * @kn: kernfs_node for which hashed rwsem needs to be taken
+ * @ptrn: locking pattern i.e whether to lock only given node or to lock
+ *	  node and its parent as well
+ *
+ * In case of nested locking, rwsem with lower address is acquired first.
+ *
+ * Return: void
+ */
+static inline void down_write_kernfs_rwsem(struct kernfs_node *kn,
+				      enum kernfs_rwsem_lock_pattern ptrn)
+{
+	struct rw_semaphore *p_rwsem = NULL;
+	struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
+	int lock_parent = 0;
+
+	if (ptrn == KERNFS_RWSEM_LOCK_SELF_AND_PARENT && kn->parent)
+		lock_parent = 1;
+
+	if (lock_parent)
+		p_rwsem = kernfs_rwsem_ptr(kn->parent);
+
+	if (!lock_parent || rwsem == p_rwsem) {
+		down_write_nested(rwsem, 0);
+		kernfs_get(kn);
+		kn->unlock_parent = 0;
+	} else {
+		/**
+		 * In case of nested locking, locks are taken in order of their
+		 * addresses. lock with lower address is taken first, followed
+		 * by lock with higher address.
+		 */
+		if (rwsem < p_rwsem) {
+			down_write_nested(rwsem, 0);
+			down_write_nested(p_rwsem, 1);
+		} else {
+			down_write_nested(p_rwsem, 0);
+			down_write_nested(rwsem, 1);
+		}
+		kernfs_get(kn);
+		kernfs_get(kn->parent);
+		kn->unlock_parent = 1;
+	}
+}
+
+/**
+ * up_write_kernfs_rwsem() - Release hashed rwsem
+ *
+ * @kn: kernfs_node for which hashed rwsem was taken
+ *
+ * In case of nested locking, rwsem with higher address is released first.
+ *
+ * Return: void
+ */
+static inline void up_write_kernfs_rwsem(struct kernfs_node *kn)
+{
+	struct rw_semaphore *p_rwsem = NULL;
+	struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
+
+	if (kn->unlock_parent) {
+		kn->unlock_parent = 0;
+		p_rwsem = kernfs_rwsem_ptr(kn->parent);
+		if (rwsem > p_rwsem) {
+			up_write(rwsem);
+			up_write(p_rwsem);
+		} else {
+			up_write(p_rwsem);
+			up_write(rwsem);
+		}
+		kernfs_put(kn->parent);
+	} else
+		up_write(rwsem);
+
+	kernfs_put(kn);
+}
+
+/**
+ * down_read_kernfs_rwsem() - Acquire hashed rwsem
+ *
+ * @kn: kernfs_node for which hashed rwsem needs to be taken
+ * @ptrn: locking pattern i.e whether to lock only given node or to lock
+ *	  node and its parent as well
+ *
+ * In case of nested locking, rwsem with lower address is acquired first.
+ *
+ * Return: void
+ */
+static inline void down_read_kernfs_rwsem(struct kernfs_node *kn,
+				      enum kernfs_rwsem_lock_pattern ptrn)
+{
+	struct rw_semaphore *p_rwsem = NULL;
+	struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
+	int lock_parent = 0;
+
+	if (ptrn == KERNFS_RWSEM_LOCK_SELF_AND_PARENT && kn->parent)
+		lock_parent = 1;
+
+	if (lock_parent)
+		p_rwsem = kernfs_rwsem_ptr(kn->parent);
+
+	if (!lock_parent || rwsem == p_rwsem) {
+		down_read_nested(rwsem, 0);
+		kernfs_get(kn);
+		kn->unlock_parent = 0;
+	} else {
+		/**
+		 * In case of nested locking, locks are taken in order of their
+		 * addresses. lock with lower address is taken first, followed
+		 * by lock with higher address.
+		 */
+		if (rwsem < p_rwsem) {
+			down_read_nested(rwsem, 0);
+			down_read_nested(p_rwsem, 1);
+		} else {
+			down_read_nested(p_rwsem, 0);
+			down_read_nested(rwsem, 1);
+		}
+		kernfs_get(kn);
+		kernfs_get(kn->parent);
+		kn->unlock_parent = 1;
+	}
+}
+
+/**
+ * up_read_kernfs_rwsem() - Release hashed rwsem
+ *
+ * @kn: kernfs_node for which hashed rwsem was taken
+ *
+ * In case of nested locking, rwsem with higher address is released first.
+ *
+ * Return: void
+ */
+static inline void up_read_kernfs_rwsem(struct kernfs_node *kn)
+{
+	struct rw_semaphore *p_rwsem = NULL;
+	struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
+
+	if (kn->unlock_parent) {
+		kn->unlock_parent = 0;
+		p_rwsem = kernfs_rwsem_ptr(kn->parent);
+		if (rwsem > p_rwsem) {
+			up_read(rwsem);
+			up_read(p_rwsem);
+		} else {
+			up_read(p_rwsem);
+			up_read(rwsem);
+		}
+		kernfs_put(kn->parent);
+	} else
+		up_read(rwsem);
+
+	kernfs_put(kn);
+}
+
+static inline void kernfs_swap_rwsems(struct rw_semaphore **array, int i, int j)
+{
+	struct rw_semaphore *tmp;
+
+	tmp = array[i];
+	array[i] = array[j];
+	array[j] = tmp;
+}
+
+static inline void kernfs_sort_rwsems(struct rw_semaphore **array)
+{
+	if (array[0] > array[1])
+		kernfs_swap_rwsems(array, 0, 1);
+
+	if (array[0] > array[2])
+		kernfs_swap_rwsems(array, 0, 2);
+
+	if (array[1] > array[2])
+		kernfs_swap_rwsems(array, 1, 2);
+}
+
+/**
+ * down_write_kernfs_rwsem_rename_ns() - take hashed rwsem during
+ * rename or similar operations.
+ *
+ * @kn: kernfs_node of interest
+ * @current_parent: current parent of kernfs_node of interest
+ * @new_parent: about to become new parent of kernfs_node
+ *
+ * During rename or similar operations the parent of a node changes,
+ * and this means we will see different parents of a kernfs_node at
+ * the time of taking and releasing its or its parent's hashed rwsem.
+ * This function separately takes locks corresponding to node, and
+ * corresponding to its current and future parents (if needed).
+ *
+ * Return: void
+ */
+static inline void down_write_kernfs_rwsem_rename_ns(struct kernfs_node *kn,
+					struct kernfs_node *current_parent,
+					struct kernfs_node *new_parent)
+{
+	struct rw_semaphore *array[3];
+
+	array[0] = kernfs_rwsem_ptr(kn);
+	array[1] = kernfs_rwsem_ptr(current_parent);
+	array[2] = kernfs_rwsem_ptr(new_parent);
+
+	if (array[0] == array[1] && array[0] == array[2]) {
+		/* All 3 nodes hash to same rwsem */
+		down_write_nested(array[0], 0);
+	} else {
+		/**
+		 * All 3 nodes are not hashing to the same rwsem, so sort the
+		 * array.
+		 */
+		kernfs_sort_rwsems(array);
+
+		if (array[0] == array[1] || array[1] == array[2]) {
+			/**
+			 * Two nodes hash to same rwsem, and these
+			 * will occupy consecutive places in array after
+			 * sorting.
+			 */
+			down_write_nested(array[0], 0);
+			down_write_nested(array[2], 1);
+		} else {
+			/* All 3 nodes hashe to different rwsems */
+			down_write_nested(array[0], 0);
+			down_write_nested(array[1], 1);
+			down_write_nested(array[2], 2);
+		}
+	}
+
+	kernfs_get(kn);
+	kernfs_get(current_parent);
+	kernfs_get(new_parent);
+}
+
+/**
+ * up_write_kernfs_rwsem_rename_ns() - release hashed rwsem during
+ * rename or similar operations.
+ *
+ * @kn: kernfs_node of interest
+ * @current_parent: current parent of kernfs_node of interest
+ * @old_parent: old parent of kernfs_node
+ *
+ * During rename or similar operations the parent of a node changes,
+ * and this means we will see different parents of a kernfs_node at
+ * the time of taking and releasing its or its parent's hashed rwsem.
+ * This function separately releases locks corresponding to node, and
+ * corresponding to its current and old parents (if needed).
+ *
+ * Return: void
+ */
+static inline void up_write_kernfs_rwsem_rename_ns(struct kernfs_node *kn,
+					struct kernfs_node *current_parent,
+					struct kernfs_node *old_parent)
+{
+	struct rw_semaphore *array[3];
+
+	array[0] = kernfs_rwsem_ptr(kn);
+	array[1] = kernfs_rwsem_ptr(current_parent);
+	array[2] = kernfs_rwsem_ptr(old_parent);
+
+	if (array[0] == array[1] && array[0] == array[2]) {
+		/* All 3 nodes hash to same rwsem */
+		up_write(array[0]);
+	} else {
+		/**
+		 * All 3 nodes are not hashing to the same rwsem, so sort the
+		 * array.
+		 */
+		kernfs_sort_rwsems(array);
+
+		if (array[0] == array[1] || array[1] == array[2]) {
+			/**
+			 * Two nodes hash to same rwsem, and these
+			 * will occupy consecutive places in array after
+			 * sorting.
+			 */
+			up_write(array[2]);
+			up_write(array[0]);
+		} else {
+			/* All 3 nodes hashe to different rwsems */
+			up_write(array[2]);
+			up_write(array[1]);
+			up_write(array[0]);
+		}
+	}
+
+	kernfs_put(old_parent);
+	kernfs_put(current_parent);
+	kernfs_put(kn);
+}
+
+/**
+ * down_read_kernfs_rwsem_rename_ns() - take hashed rwsem during
+ * rename or similar operations.
+ *
+ * @kn: kernfs_node of interest
+ * @current_parent: current parent of kernfs_node of interest
+ * @new_parent: about to become new parent of kernfs_node
+ *
+ * During rename or similar operations the parent of a node changes,
+ * and this means we will see different parents of a kernfs_node at
+ * the time of taking and releasing its or its parent's hashed rwsem.
+ * This function separately takes locks corresponding to node, and
+ * corresponding to its current and future parents (if needed).
+ *
+ * Return: void
+ */
+static inline void down_read_kernfs_rwsem_rename_ns(struct kernfs_node *kn,
+					struct kernfs_node *current_parent,
+					struct kernfs_node *new_parent)
+{
+	struct rw_semaphore *array[3];
+
+	array[0] = kernfs_rwsem_ptr(kn);
+	array[1] = kernfs_rwsem_ptr(current_parent);
+	array[2] = kernfs_rwsem_ptr(new_parent);
+
+	if (array[0] == array[1] && array[0] == array[2]) {
+		/* All 3 nodes hash to same rwsem */
+		down_read_nested(array[0], 0);
+	} else {
+		/**
+		 * All 3 nodes are not hashing to the same rwsem, so sort the
+		 * array.
+		 */
+		kernfs_sort_rwsems(array);
+
+		if (array[0] == array[1] || array[1] == array[2]) {
+			/**
+			 * Two nodes hash to same rwsem, and these
+			 * will occupy consecutive places in array after
+			 * sorting.
+			 */
+			down_read_nested(array[0], 0);
+			down_read_nested(array[2], 1);
+		} else {
+			/* All 3 nodes hashe to different rwsems */
+			down_read_nested(array[0], 0);
+			down_read_nested(array[1], 1);
+			down_read_nested(array[2], 2);
+		}
+	}
+
+	kernfs_get(kn);
+	kernfs_get(current_parent);
+	kernfs_get(new_parent);
+}
+
+/**
+ * up_read_kernfs_rwsem_rename_ns() - release hashed rwsem during
+ * rename or similar operations.
+ *
+ * @kn: kernfs_node of interest
+ * @current_parent: current parent of kernfs_node of interest
+ * @old_parent: old parent of kernfs_node
+ *
+ * During rename or similar operations the parent of a node changes,
+ * and this means we will see different parents of a kernfs_node at
+ * the time of taking and releasing its or its parent's hashed rwsem.
+ * This function separately releases locks corresponding to node, and
+ * corresponding to its current and old parents (if needed).
+ *
+ * Return: void
+ */
+static inline void up_read_kernfs_rwsem_rename_ns(struct kernfs_node *kn,
+					struct kernfs_node *current_parent,
+					struct kernfs_node *old_parent)
+{
+	struct rw_semaphore *array[3];
+
+	array[0] = kernfs_rwsem_ptr(kn);
+	array[1] = kernfs_rwsem_ptr(current_parent);
+	array[2] = kernfs_rwsem_ptr(old_parent);
+
+	if (array[0] == array[1] && array[0] == array[2]) {
+		/* All 3 nodes hash to same rwsem */
+		up_read(array[0]);
+	} else {
+		/**
+		 * All 3 nodes are not hashing to the same rwsem, so sort the
+		 * array.
+		 */
+		kernfs_sort_rwsems(array);
+
+		if (array[0] == array[1] || array[1] == array[2]) {
+			/**
+			 * Two nodes hash to same rwsem, and these
+			 * will occupy consecutive places in array after
+			 * sorting.
+			 */
+			up_read(array[2]);
+			up_read(array[0]);
+		} else {
+			/* All 3 nodes hashe to different rwsems */
+			up_read(array[2]);
+			up_read(array[1]);
+			up_read(array[0]);
+		}
+	}
+
+	kernfs_put(old_parent);
+	kernfs_put(current_parent);
+	kernfs_put(kn);
+}
+
 #endif	/* __KERNFS_INTERNAL_H */
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index d35142226c340..d28f8a3eeb215 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -398,6 +398,7 @@ void __init kernfs_lock_init(void)
 	for (count = 0; count < NR_KERNFS_LOCKS; count++) {
 		mutex_init(&kernfs_locks->open_file_mutex[count].lock);
 		spin_lock_init(&kernfs_locks->open_node_locks[count].lock);
+		init_rwsem(&kernfs_locks->kernfs_rwsem[count]);
 	}
 }
 
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 84653c609a5c0..8451f153b2291 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -96,6 +96,7 @@ struct kernfs_open_node_lock {
 struct kernfs_global_locks {
 	struct kernfs_open_file_mutex open_file_mutex[NR_KERNFS_LOCKS];
 	struct kernfs_open_node_lock open_node_locks[NR_KERNFS_LOCKS];
+	struct rw_semaphore kernfs_rwsem[NR_KERNFS_LOCKS];
 };
 
 enum kernfs_node_type {
@@ -206,6 +207,7 @@ struct kernfs_node {
 	 */
 	struct kernfs_node	*parent;
 	const char		*name;
+	u8			unlock_parent; /* release parent's rwsem */
 
 	struct rb_node		rb;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v6 7/7] kernfs: Replace per-fs rwsem with hashed ones.
  2022-02-14 12:03 [PATCH v6 0/7] kernfs: Introduce hashed mutexes to replace global kernfs_open_file_mutex Imran Khan
                   ` (5 preceding siblings ...)
  2022-02-14 12:03 ` [PATCH v6 6/7] kernfs: Introduce hashed rw-sem to replace per-fs kernfs_rwsem Imran Khan
@ 2022-02-14 12:03 ` Imran Khan
  2022-02-14 17:49   ` Nathan Chancellor
  2022-02-14 18:15 ` [PATCH v6 0/7] kernfs: Introduce hashed mutexes to replace global kernfs_open_file_mutex Tejun Heo
  7 siblings, 1 reply; 19+ messages in thread
From: Imran Khan @ 2022-02-14 12:03 UTC (permalink / raw)
  To: tj, gregkh; +Cc: linux-kernel

Remove per-fs rwsem, using hashed rwsem and corresponding interface
introduced in previous patch. Also as we are removing use of per-fs
rwsem, we no longer need lockdep checkings under kernfs_active.
Wherever needed lockdep checkings can be made on a per-node basis using
interfaces provided in previous patch.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---
 fs/kernfs/dir.c             | 130 ++++++++++++++++++------------------
 fs/kernfs/file.c            |   4 +-
 fs/kernfs/inode.c           |  22 +++---
 fs/kernfs/kernfs-internal.h |   1 -
 fs/kernfs/mount.c           |   5 +-
 fs/kernfs/symlink.c         |   5 +-
 6 files changed, 80 insertions(+), 87 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 0dac58f8091c9..42404f8bc6acd 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -449,21 +449,22 @@ void kernfs_put_active(struct kernfs_node *kn)
 /**
  * kernfs_drain - drain kernfs_node
  * @kn: kernfs_node to drain
+ * @anc: ancestor of kernfs_node to drain
  *
  * Drain existing usages and nuke all existing mmaps of @kn.  Mutiple
  * removers may invoke this function concurrently on @kn and all will
  * return after draining is complete.
  */
-static void kernfs_drain(struct kernfs_node *kn)
-	__releases(&kernfs_root(kn)->kernfs_rwsem)
-	__acquires(&kernfs_root(kn)->kernfs_rwsem)
+static void kernfs_drain(struct kernfs_node *kn, struct kernfs_node *anc)
+	__releases(kernfs_rwsem_ptr(anc))
+	__acquires(kernfs_rwsem_ptr(anc))
 {
 	struct kernfs_root *root = kernfs_root(kn);
 
-	lockdep_assert_held_write(&root->kernfs_rwsem);
+	kernfs_rwsem_assert_held_write(anc);
 	WARN_ON_ONCE(kernfs_active(kn));
 
-	up_write(&root->kernfs_rwsem);
+	up_write_kernfs_rwsem(anc);
 
 	if (kernfs_lockdep(kn)) {
 		rwsem_acquire(&kn->dep_map, 0, 0, _RET_IP_);
@@ -482,7 +483,7 @@ static void kernfs_drain(struct kernfs_node *kn)
 
 	kernfs_drain_open_files(kn);
 
-	down_write(&root->kernfs_rwsem);
+	down_write_kernfs_rwsem(anc, KERNFS_RWSEM_LOCK_SELF);
 }
 
 /**
@@ -717,12 +718,16 @@ struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root,
 int kernfs_add_one(struct kernfs_node *kn)
 {
 	struct kernfs_node *parent = kn->parent;
-	struct kernfs_root *root = kernfs_root(parent);
 	struct kernfs_iattrs *ps_iattr;
 	bool has_ns;
 	int ret;
 
-	down_write(&root->kernfs_rwsem);
+	/**
+	 * The node being added is not active at this point of time and may
+	 * be activated later depending on CREATE_DEACTIVATED flag. So at
+	 * this point of time just locking the parent is enough.
+	 */
+	down_write_kernfs_rwsem(parent, KERNFS_RWSEM_LOCK_SELF);
 
 	ret = -EINVAL;
 	has_ns = kernfs_ns_enabled(parent);
@@ -753,7 +758,7 @@ int kernfs_add_one(struct kernfs_node *kn)
 		ps_iattr->ia_mtime = ps_iattr->ia_ctime;
 	}
 
-	up_write(&root->kernfs_rwsem);
+	up_write_kernfs_rwsem(parent);
 
 	/*
 	 * Activate the new node unless CREATE_DEACTIVATED is requested.
@@ -767,7 +772,7 @@ int kernfs_add_one(struct kernfs_node *kn)
 	return 0;
 
 out_unlock:
-	up_write(&root->kernfs_rwsem);
+	up_write_kernfs_rwsem(parent);
 	return ret;
 }
 
@@ -788,7 +793,7 @@ static struct kernfs_node *kernfs_find_ns(struct kernfs_node *parent,
 	bool has_ns = kernfs_ns_enabled(parent);
 	unsigned int hash;
 
-	lockdep_assert_held(&kernfs_root(parent)->kernfs_rwsem);
+	kernfs_rwsem_assert_held(parent);
 
 	if (has_ns != (bool)ns) {
 		WARN(1, KERN_WARNING "kernfs: ns %s in '%s' for '%s'\n",
@@ -820,7 +825,7 @@ static struct kernfs_node *kernfs_walk_ns(struct kernfs_node *parent,
 	size_t len;
 	char *p, *name;
 
-	lockdep_assert_held_read(&kernfs_root(parent)->kernfs_rwsem);
+	kernfs_rwsem_assert_held_read(parent);
 
 	/* grab kernfs_rename_lock to piggy back on kernfs_pr_cont_buf */
 	spin_lock_irq(&kernfs_rename_lock);
@@ -859,12 +864,11 @@ struct kernfs_node *kernfs_find_and_get_ns(struct kernfs_node *parent,
 					   const char *name, const void *ns)
 {
 	struct kernfs_node *kn;
-	struct kernfs_root *root = kernfs_root(parent);
 
-	down_read(&root->kernfs_rwsem);
+	down_read_kernfs_rwsem(parent, KERNFS_RWSEM_LOCK_SELF);
 	kn = kernfs_find_ns(parent, name, ns);
 	kernfs_get(kn);
-	up_read(&root->kernfs_rwsem);
+	up_read_kernfs_rwsem(parent);
 
 	return kn;
 }
@@ -884,12 +888,11 @@ struct kernfs_node *kernfs_walk_and_get_ns(struct kernfs_node *parent,
 					   const char *path, const void *ns)
 {
 	struct kernfs_node *kn;
-	struct kernfs_root *root = kernfs_root(parent);
 
-	down_read(&root->kernfs_rwsem);
+	down_read_kernfs_rwsem(parent, KERNFS_RWSEM_LOCK_SELF);
 	kn = kernfs_walk_ns(parent, path, ns);
 	kernfs_get(kn);
-	up_read(&root->kernfs_rwsem);
+	up_read_kernfs_rwsem(parent);
 
 	return kn;
 }
@@ -914,7 +917,6 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 		return ERR_PTR(-ENOMEM);
 
 	idr_init(&root->ino_idr);
-	init_rwsem(&root->kernfs_rwsem);
 	INIT_LIST_HEAD(&root->supers);
 	init_rwsem(&root->supers_rwsem);
 
@@ -1045,7 +1047,6 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
 static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 {
 	struct kernfs_node *kn;
-	struct kernfs_root *root;
 
 	if (flags & LOOKUP_RCU)
 		return -ECHILD;
@@ -1061,13 +1062,12 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 		parent = kernfs_dentry_node(dentry->d_parent);
 		if (parent) {
 			spin_unlock(&dentry->d_lock);
-			root = kernfs_root(parent);
-			down_read(&root->kernfs_rwsem);
+			down_read_kernfs_rwsem(parent, KERNFS_RWSEM_LOCK_SELF);
 			if (kernfs_dir_changed(parent, dentry)) {
-				up_read(&root->kernfs_rwsem);
+				up_read_kernfs_rwsem(parent);
 				return 0;
 			}
-			up_read(&root->kernfs_rwsem);
+			up_read_kernfs_rwsem(parent);
 		} else
 			spin_unlock(&dentry->d_lock);
 
@@ -1078,8 +1078,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 	}
 
 	kn = kernfs_dentry_node(dentry);
-	root = kernfs_root(kn);
-	down_read(&root->kernfs_rwsem);
+	down_read_kernfs_rwsem(kn, KERNFS_RWSEM_LOCK_SELF);
 
 	/* The kernfs node has been deactivated */
 	if (!kernfs_active(kn))
@@ -1098,10 +1097,10 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 	    kernfs_info(dentry->d_sb)->ns != kn->ns)
 		goto out_bad;
 
-	up_read(&root->kernfs_rwsem);
+	up_read_kernfs_rwsem(kn);
 	return 1;
 out_bad:
-	up_read(&root->kernfs_rwsem);
+	up_read_kernfs_rwsem(kn);
 	return 0;
 }
 
@@ -1115,28 +1114,29 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
 {
 	struct kernfs_node *parent = dir->i_private;
 	struct kernfs_node *kn;
-	struct kernfs_root *root;
 	struct inode *inode = NULL;
 	const void *ns = NULL;
 
-	root = kernfs_root(parent);
-	down_read(&root->kernfs_rwsem);
+	down_read_kernfs_rwsem(parent, KERNFS_RWSEM_LOCK_SELF);
 	if (kernfs_ns_enabled(parent))
 		ns = kernfs_info(dir->i_sb)->ns;
 
 	kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
+	up_read_kernfs_rwsem(parent);
 	/* attach dentry and inode */
 	if (kn) {
 		/* Inactive nodes are invisible to the VFS so don't
 		 * create a negative.
 		 */
+		down_read_kernfs_rwsem(kn, KERNFS_RWSEM_LOCK_SELF);
 		if (!kernfs_active(kn)) {
-			up_read(&root->kernfs_rwsem);
+			up_read_kernfs_rwsem(kn);
 			return NULL;
 		}
 		inode = kernfs_get_inode(dir->i_sb, kn);
 		if (!inode)
 			inode = ERR_PTR(-ENOMEM);
+		up_read_kernfs_rwsem(kn);
 	}
 	/*
 	 * Needed for negative dentry validation.
@@ -1144,9 +1144,10 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
 	 * or transforms from positive dentry in dentry_unlink_inode()
 	 * called from vfs_rmdir().
 	 */
+	down_read_kernfs_rwsem(parent, KERNFS_RWSEM_LOCK_SELF);
 	if (!IS_ERR(inode))
 		kernfs_set_rev(parent, dentry);
-	up_read(&root->kernfs_rwsem);
+	up_read_kernfs_rwsem(parent);
 
 	/* instantiate and hash (possibly negative) dentry */
 	return d_splice_alias(inode, dentry);
@@ -1269,7 +1270,7 @@ static struct kernfs_node *kernfs_next_descendant_post(struct kernfs_node *pos,
 {
 	struct rb_node *rbn;
 
-	lockdep_assert_held_write(&kernfs_root(root)->kernfs_rwsem);
+	kernfs_rwsem_assert_held_write(root);
 
 	/* if first iteration, visit leftmost descendant which may be root */
 	if (!pos)
@@ -1304,9 +1305,8 @@ static struct kernfs_node *kernfs_next_descendant_post(struct kernfs_node *pos,
 void kernfs_activate(struct kernfs_node *kn)
 {
 	struct kernfs_node *pos;
-	struct kernfs_root *root = kernfs_root(kn);
 
-	down_write(&root->kernfs_rwsem);
+	down_write_kernfs_rwsem(kn, KERNFS_RWSEM_LOCK_SELF);
 
 	pos = NULL;
 	while ((pos = kernfs_next_descendant_post(pos, kn))) {
@@ -1320,15 +1320,13 @@ void kernfs_activate(struct kernfs_node *kn)
 		pos->flags |= KERNFS_ACTIVATED;
 	}
 
-	up_write(&root->kernfs_rwsem);
+	up_write_kernfs_rwsem(kn);
 }
 
 static void __kernfs_remove(struct kernfs_node *kn)
 {
 	struct kernfs_node *pos;
 
-	lockdep_assert_held_write(&kernfs_root(kn)->kernfs_rwsem);
-
 	/*
 	 * Short-circuit if non-root @kn has already finished removal.
 	 * This is for kernfs_remove_self() which plays with active ref
@@ -1341,12 +1339,16 @@ static void __kernfs_remove(struct kernfs_node *kn)
 
 	/* prevent any new usage under @kn by deactivating all nodes */
 	pos = NULL;
+
+	down_write_kernfs_rwsem(kn, KERNFS_RWSEM_LOCK_SELF);
 	while ((pos = kernfs_next_descendant_post(pos, kn)))
 		if (kernfs_active(pos))
 			atomic_add(KN_DEACTIVATED_BIAS, &pos->active);
+	up_write_kernfs_rwsem(kn);
 
 	/* deactivate and unlink the subtree node-by-node */
 	do {
+		down_write_kernfs_rwsem(kn, KERNFS_RWSEM_LOCK_SELF);
 		pos = kernfs_leftmost_descendant(kn);
 
 		/*
@@ -1364,10 +1366,15 @@ static void __kernfs_remove(struct kernfs_node *kn)
 		 * error paths without worrying about draining.
 		 */
 		if (kn->flags & KERNFS_ACTIVATED)
-			kernfs_drain(pos);
+			kernfs_drain(pos, kn);
 		else
 			WARN_ON_ONCE(atomic_read(&kn->active) != KN_DEACTIVATED_BIAS);
 
+		up_write_kernfs_rwsem(kn);
+
+		if (pos->parent)
+			down_write_kernfs_rwsem(pos->parent, KERNFS_RWSEM_LOCK_SELF);
+
 		/*
 		 * kernfs_unlink_sibling() succeeds once per node.  Use it
 		 * to decide who's responsible for cleanups.
@@ -1385,6 +1392,9 @@ static void __kernfs_remove(struct kernfs_node *kn)
 			kernfs_put(pos);
 		}
 
+		if (pos->parent)
+			up_write_kernfs_rwsem(pos->parent);
+
 		kernfs_put(pos);
 	} while (pos != kn);
 }
@@ -1397,11 +1407,7 @@ static void __kernfs_remove(struct kernfs_node *kn)
  */
 void kernfs_remove(struct kernfs_node *kn)
 {
-	struct kernfs_root *root = kernfs_root(kn);
-
-	down_write(&root->kernfs_rwsem);
 	__kernfs_remove(kn);
-	up_write(&root->kernfs_rwsem);
 }
 
 /**
@@ -1487,9 +1493,8 @@ void kernfs_unbreak_active_protection(struct kernfs_node *kn)
 bool kernfs_remove_self(struct kernfs_node *kn)
 {
 	bool ret;
-	struct kernfs_root *root = kernfs_root(kn);
 
-	down_write(&root->kernfs_rwsem);
+	down_write_kernfs_rwsem(kn, KERNFS_RWSEM_LOCK_SELF);
 	kernfs_break_active_protection(kn);
 
 	/*
@@ -1503,9 +1508,11 @@ bool kernfs_remove_self(struct kernfs_node *kn)
 	 */
 	if (!(kn->flags & KERNFS_SUICIDAL)) {
 		kn->flags |= KERNFS_SUICIDAL;
+		up_write_kernfs_rwsem(kn);
 		__kernfs_remove(kn);
 		kn->flags |= KERNFS_SUICIDED;
 		ret = true;
+		down_write_kernfs_rwsem(kn, KERNFS_RWSEM_LOCK_SELF);
 	} else {
 		wait_queue_head_t *waitq = &kernfs_root(kn)->deactivate_waitq;
 		DEFINE_WAIT(wait);
@@ -1517,9 +1524,9 @@ bool kernfs_remove_self(struct kernfs_node *kn)
 			    atomic_read(&kn->active) == KN_DEACTIVATED_BIAS)
 				break;
 
-			up_write(&root->kernfs_rwsem);
+			up_write_kernfs_rwsem(kn);
 			schedule();
-			down_write(&root->kernfs_rwsem);
+			down_write_kernfs_rwsem(kn, KERNFS_RWSEM_LOCK_SELF_AND_PARENT);
 		}
 		finish_wait(waitq, &wait);
 		WARN_ON_ONCE(!RB_EMPTY_NODE(&kn->rb));
@@ -1532,7 +1539,7 @@ bool kernfs_remove_self(struct kernfs_node *kn)
 	 */
 	kernfs_unbreak_active_protection(kn);
 
-	up_write(&root->kernfs_rwsem);
+	up_write_kernfs_rwsem(kn);
 	return ret;
 }
 
@@ -1549,7 +1556,6 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,
 			     const void *ns)
 {
 	struct kernfs_node *kn;
-	struct kernfs_root *root;
 
 	if (!parent) {
 		WARN(1, KERN_WARNING "kernfs: can not remove '%s', no directory\n",
@@ -1557,15 +1563,15 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,
 		return -ENOENT;
 	}
 
-	root = kernfs_root(parent);
-	down_write(&root->kernfs_rwsem);
+	down_write_kernfs_rwsem(parent, KERNFS_RWSEM_LOCK_SELF);
 
 	kn = kernfs_find_ns(parent, name, ns);
+
+	up_write_kernfs_rwsem(parent);
+
 	if (kn)
 		__kernfs_remove(kn);
 
-	up_write(&root->kernfs_rwsem);
-
 	if (kn)
 		return 0;
 	else
@@ -1583,7 +1589,6 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
 		     const char *new_name, const void *new_ns)
 {
 	struct kernfs_node *old_parent;
-	struct kernfs_root *root;
 	const char *old_name = NULL;
 	int error;
 
@@ -1591,8 +1596,7 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
 	if (!kn->parent)
 		return -EINVAL;
 
-	root = kernfs_root(kn);
-	down_write(&root->kernfs_rwsem);
+	down_write_kernfs_rwsem_rename_ns(kn, kn->parent, new_parent);
 
 	error = -ENOENT;
 	if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
@@ -1646,7 +1650,7 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
 
 	error = 0;
  out:
-	up_write(&root->kernfs_rwsem);
+	up_write_kernfs_rwsem_rename_ns(kn, new_parent, old_parent);
 	return error;
 }
 
@@ -1717,14 +1721,12 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 	struct dentry *dentry = file->f_path.dentry;
 	struct kernfs_node *parent = kernfs_dentry_node(dentry);
 	struct kernfs_node *pos = file->private_data;
-	struct kernfs_root *root;
 	const void *ns = NULL;
 
 	if (!dir_emit_dots(file, ctx))
 		return 0;
 
-	root = kernfs_root(parent);
-	down_read(&root->kernfs_rwsem);
+	down_read_kernfs_rwsem(parent, KERNFS_RWSEM_LOCK_SELF);
 
 	if (kernfs_ns_enabled(parent))
 		ns = kernfs_info(dentry->d_sb)->ns;
@@ -1741,12 +1743,12 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 		file->private_data = pos;
 		kernfs_get(pos);
 
-		up_read(&root->kernfs_rwsem);
+		up_read_kernfs_rwsem(parent);
 		if (!dir_emit(ctx, name, len, ino, type))
 			return 0;
-		down_read(&root->kernfs_rwsem);
+		down_read_kernfs_rwsem(parent, KERNFS_RWSEM_LOCK_SELF);
 	}
-	up_read(&root->kernfs_rwsem);
+	up_read_kernfs_rwsem(parent);
 	file->private_data = NULL;
 	ctx->pos = INT_MAX;
 	return 0;
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index af2046bc63aa1..03d8f0d087cb8 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -857,7 +857,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
 
 	root = kernfs_root(kn);
 	/* kick fsnotify */
-	down_write(&root->kernfs_rwsem);
+	down_write_kernfs_rwsem(kn, KERNFS_RWSEM_LOCK_SELF);
 
 	down_write(&root->supers_rwsem);
 	list_for_each_entry(info, &kernfs_root(kn)->supers, node) {
@@ -897,7 +897,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
 	}
 	up_write(&root->supers_rwsem);
 
-	up_write(&root->kernfs_rwsem);
+	up_write_kernfs_rwsem(kn);
 	kernfs_put(kn);
 	goto repeat;
 }
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 3d783d80f5daa..4de65f9c21d85 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -99,11 +99,10 @@ int __kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
 int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
 {
 	int ret;
-	struct kernfs_root *root = kernfs_root(kn);
 
-	down_write(&root->kernfs_rwsem);
+	down_write_kernfs_rwsem(kn, KERNFS_RWSEM_LOCK_SELF);
 	ret = __kernfs_setattr(kn, iattr);
-	up_write(&root->kernfs_rwsem);
+	up_write_kernfs_rwsem(kn);
 	return ret;
 }
 
@@ -112,14 +111,12 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
 {
 	struct inode *inode = d_inode(dentry);
 	struct kernfs_node *kn = inode->i_private;
-	struct kernfs_root *root;
 	int error;
 
 	if (!kn)
 		return -EINVAL;
 
-	root = kernfs_root(kn);
-	down_write(&root->kernfs_rwsem);
+	down_write_kernfs_rwsem(kn, KERNFS_RWSEM_LOCK_SELF);
 	error = setattr_prepare(&init_user_ns, dentry, iattr);
 	if (error)
 		goto out;
@@ -132,7 +129,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
 	setattr_copy(&init_user_ns, inode, iattr);
 
 out:
-	up_write(&root->kernfs_rwsem);
+	up_write_kernfs_rwsem(kn);
 	return error;
 }
 
@@ -187,14 +184,13 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
 {
 	struct inode *inode = d_inode(path->dentry);
 	struct kernfs_node *kn = inode->i_private;
-	struct kernfs_root *root = kernfs_root(kn);
 
-	down_read(&root->kernfs_rwsem);
+	down_read_kernfs_rwsem(kn, KERNFS_RWSEM_LOCK_SELF);
 	spin_lock(&inode->i_lock);
 	kernfs_refresh_inode(kn, inode);
 	generic_fillattr(&init_user_ns, inode, stat);
 	spin_unlock(&inode->i_lock);
-	up_read(&root->kernfs_rwsem);
+	up_read_kernfs_rwsem(kn);
 
 	return 0;
 }
@@ -278,21 +274,19 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,
 			  struct inode *inode, int mask)
 {
 	struct kernfs_node *kn;
-	struct kernfs_root *root;
 	int ret;
 
 	if (mask & MAY_NOT_BLOCK)
 		return -ECHILD;
 
 	kn = inode->i_private;
-	root = kernfs_root(kn);
 
-	down_read(&root->kernfs_rwsem);
+	down_read_kernfs_rwsem(kn, KERNFS_RWSEM_LOCK_SELF);
 	spin_lock(&inode->i_lock);
 	kernfs_refresh_inode(kn, inode);
 	ret = generic_permission(&init_user_ns, inode, mask);
 	spin_unlock(&inode->i_lock);
-	up_read(&root->kernfs_rwsem);
+	up_read_kernfs_rwsem(kn);
 
 	return ret;
 }
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index ba89de378f240..f19b180557559 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -133,7 +133,6 @@ int __kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr);
 /*
  * dir.c
  */
-extern struct rw_semaphore kernfs_rwsem;
 extern const struct dentry_operations kernfs_dops;
 extern const struct file_operations kernfs_dir_fops;
 extern const struct inode_operations kernfs_dir_iops;
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index d28f8a3eeb215..2816750f798e2 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -237,7 +237,6 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
 static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *kfc)
 {
 	struct kernfs_super_info *info = kernfs_info(sb);
-	struct kernfs_root *kf_root = kfc->root;
 	struct inode *inode;
 	struct dentry *root;
 
@@ -257,9 +256,9 @@ static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *k
 	sb->s_shrink.seeks = 0;
 
 	/* get root inode, initialize and unlock it */
-	down_read(&kf_root->kernfs_rwsem);
+	down_read_kernfs_rwsem(info->root->kn, KERNFS_RWSEM_LOCK_SELF);
 	inode = kernfs_get_inode(sb, info->root->kn);
-	up_read(&kf_root->kernfs_rwsem);
+	up_read_kernfs_rwsem(info->root->kn);
 	if (!inode) {
 		pr_debug("kernfs: could not get root inode\n");
 		return -ENOMEM;
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 0ab13824822f7..24d0f64460bda 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -113,12 +113,11 @@ static int kernfs_getlink(struct inode *inode, char *path)
 	struct kernfs_node *kn = inode->i_private;
 	struct kernfs_node *parent = kn->parent;
 	struct kernfs_node *target = kn->symlink.target_kn;
-	struct kernfs_root *root = kernfs_root(parent);
 	int error;
 
-	down_read(&root->kernfs_rwsem);
+	down_read_kernfs_rwsem(parent, KERNFS_RWSEM_LOCK_SELF);
 	error = kernfs_get_target_path(parent, target, path);
-	up_read(&root->kernfs_rwsem);
+	up_read_kernfs_rwsem(parent);
 
 	return error;
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 7/7] kernfs: Replace per-fs rwsem with hashed ones.
  2022-02-14 12:03 ` [PATCH v6 7/7] kernfs: Replace per-fs rwsem with hashed ones Imran Khan
@ 2022-02-14 17:49   ` Nathan Chancellor
  2022-02-16  8:57     ` [kbuild-all] " Chen, Rong A
  0 siblings, 1 reply; 19+ messages in thread
From: Nathan Chancellor @ 2022-02-14 17:49 UTC (permalink / raw)
  To: Imran Khan, kernel test robot; +Cc: tj, gregkh, linux-kernel, llvm, kbuild-all

On Tue, Feb 15, 2022 at 01:19:23AM +0800, kernel test robot wrote:
> Hi Imran,
> 
> Thank you for the patch! Perhaps something to improve:
> 
> [auto build test WARNING on 6d9bd4ad4ca08b1114e814c2c42383b8b13be631]
> 
> url:    https://github.com/0day-ci/linux/commits/Imran-Khan/kernfs-Introduce-hashed-mutexes-to-replace-global-kernfs_open_file_mutex/20220214-200700
> base:   6d9bd4ad4ca08b1114e814c2c42383b8b13be631
> config: hexagon-randconfig-r012-20220213 (https://download.01.org/0day-ci/archive/20220214/202202142322.OezS0EtW-lkp@intel.com/config)
> compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project ea071884b0cc7210b3cc5fe858f0e892a779a23b)
> reproduce (this is a W=1 build):
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # https://github.com/0day-ci/linux/commit/48eab913f80f1555d57b964ef7cdba17f5ec5d1f
>         git remote add linux-review https://github.com/0day-ci/linux
>         git fetch --no-tags linux-review Imran-Khan/kernfs-Introduce-hashed-mutexes-to-replace-global-kernfs_open_file_mutex/20220214-200700
>         git checkout 48eab913f80f1555d57b964ef7cdba17f5ec5d1f
>         # save the config file to linux build tree
>         mkdir build_dir
>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash fs/kernfs/
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
> 
> All warnings (new ones prefixed by >>):
> 
> >> fs/kernfs/dir.c:1619:7: warning: variable 'old_parent' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
>                    if (!new_name)
>                        ^~~~~~~~~
>    fs/kernfs/dir.c:1653:50: note: uninitialized use occurs here
>            up_write_kernfs_rwsem_rename_ns(kn, new_parent, old_parent);
>                                                            ^~~~~~~~~~
>    fs/kernfs/dir.c:1619:3: note: remove the 'if' if its condition is always false
>                    if (!new_name)
>                    ^~~~~~~~~~~~~~
>    fs/kernfs/dir.c:1612:6: warning: variable 'old_parent' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
>            if (kernfs_find_ns(new_parent, new_name, new_ns))
>                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    fs/kernfs/dir.c:1653:50: note: uninitialized use occurs here
>            up_write_kernfs_rwsem_rename_ns(kn, new_parent, old_parent);
>                                                            ^~~~~~~~~~
>    fs/kernfs/dir.c:1612:2: note: remove the 'if' if its condition is always false
>            if (kernfs_find_ns(new_parent, new_name, new_ns))
>            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    fs/kernfs/dir.c:1607:6: warning: variable 'old_parent' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
>            if ((kn->parent == new_parent) && (kn->ns == new_ns) &&
>                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    fs/kernfs/dir.c:1653:50: note: uninitialized use occurs here
>            up_write_kernfs_rwsem_rename_ns(kn, new_parent, old_parent);
>                                                            ^~~~~~~~~~
>    fs/kernfs/dir.c:1607:2: note: remove the 'if' if its condition is always false
>            if ((kn->parent == new_parent) && (kn->ns == new_ns) &&
>            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    fs/kernfs/dir.c:1602:6: warning: variable 'old_parent' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
>            if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
>                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    fs/kernfs/dir.c:1653:50: note: uninitialized use occurs here
>            up_write_kernfs_rwsem_rename_ns(kn, new_parent, old_parent);
>                                                            ^~~~~~~~~~
>    fs/kernfs/dir.c:1602:2: note: remove the 'if' if its condition is always false
>            if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
>            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >> fs/kernfs/dir.c:1602:6: warning: variable 'old_parent' is used uninitialized whenever '||' condition is true [-Wsometimes-uninitialized]
>            if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
>                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    fs/kernfs/dir.c:1653:50: note: uninitialized use occurs here
>            up_write_kernfs_rwsem_rename_ns(kn, new_parent, old_parent);
>                                                            ^~~~~~~~~~
>    fs/kernfs/dir.c:1602:6: note: remove the '||' if its condition is always false
>            if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
>                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >> fs/kernfs/dir.c:1602:6: warning: variable 'old_parent' is used uninitialized whenever '||' condition is true [-Wsometimes-uninitialized]
>            if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
>                ^~~~~~~~~~~~~~~~~~
>    fs/kernfs/dir.c:1653:50: note: uninitialized use occurs here
>            up_write_kernfs_rwsem_rename_ns(kn, new_parent, old_parent);
>                                                            ^~~~~~~~~~
>    fs/kernfs/dir.c:1602:6: note: remove the '||' if its condition is always false
>            if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
>                ^~~~~~~~~~~~~~~~~~~~~
>    fs/kernfs/dir.c:1591:32: note: initialize the variable 'old_parent' to silence this warning
>            struct kernfs_node *old_parent;
>                                          ^
>                                           = NULL
>    6 warnings generated.
> 
> 
> vim +1619 fs/kernfs/dir.c
> 
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1580  
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1581  /**
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1582   * kernfs_rename_ns - move and rename a kernfs_node
> 324a56e16e44ba Tejun Heo          2013-12-11  1583   * @kn: target node
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1584   * @new_parent: new parent to put @sd under
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1585   * @new_name: new name
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1586   * @new_ns: new namespace tag
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1587   */
> 324a56e16e44ba Tejun Heo          2013-12-11  1588  int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1589  		     const char *new_name, const void *new_ns)
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1590  {
> 3eef34ad7dc369 Tejun Heo          2014-02-07  1591  	struct kernfs_node *old_parent;
> 3eef34ad7dc369 Tejun Heo          2014-02-07  1592  	const char *old_name = NULL;
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1593  	int error;
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1594  
> 3eef34ad7dc369 Tejun Heo          2014-02-07  1595  	/* can't move or rename root */
> 3eef34ad7dc369 Tejun Heo          2014-02-07  1596  	if (!kn->parent)
> 3eef34ad7dc369 Tejun Heo          2014-02-07  1597  		return -EINVAL;
> 3eef34ad7dc369 Tejun Heo          2014-02-07  1598  
> 48eab913f80f15 Imran Khan         2022-02-14  1599  	down_write_kernfs_rwsem_rename_ns(kn, kn->parent, new_parent);
> 798c75a0d44cdb Greg Kroah-Hartman 2014-01-13  1600  
> d0ae3d4347ee02 Tejun Heo          2013-12-11  1601  	error = -ENOENT;
> ea015218f2f7ac Eric W. Biederman  2015-05-13 @1602  	if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
> ea015218f2f7ac Eric W. Biederman  2015-05-13  1603  	    (new_parent->flags & KERNFS_EMPTY_DIR))
> d0ae3d4347ee02 Tejun Heo          2013-12-11  1604  		goto out;
> d0ae3d4347ee02 Tejun Heo          2013-12-11  1605  
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1606  	error = 0;
> adc5e8b58f4886 Tejun Heo          2013-12-11  1607  	if ((kn->parent == new_parent) && (kn->ns == new_ns) &&
> adc5e8b58f4886 Tejun Heo          2013-12-11  1608  	    (strcmp(kn->name, new_name) == 0))
> 798c75a0d44cdb Greg Kroah-Hartman 2014-01-13  1609  		goto out;	/* nothing to rename */
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1610  
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1611  	error = -EEXIST;
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1612  	if (kernfs_find_ns(new_parent, new_name, new_ns))
> 798c75a0d44cdb Greg Kroah-Hartman 2014-01-13  1613  		goto out;
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1614  
> 324a56e16e44ba Tejun Heo          2013-12-11  1615  	/* rename kernfs_node */
> adc5e8b58f4886 Tejun Heo          2013-12-11  1616  	if (strcmp(kn->name, new_name) != 0) {
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1617  		error = -ENOMEM;
> 75287a677ba1be Andrzej Hajda      2015-02-13  1618  		new_name = kstrdup_const(new_name, GFP_KERNEL);
> fd7b9f7b9776b1 Tejun Heo          2013-11-28 @1619  		if (!new_name)
> 798c75a0d44cdb Greg Kroah-Hartman 2014-01-13  1620  			goto out;
> 3eef34ad7dc369 Tejun Heo          2014-02-07  1621  	} else {
> 3eef34ad7dc369 Tejun Heo          2014-02-07  1622  		new_name = NULL;
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1623  	}
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1624  
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1625  	/*
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1626  	 * Move to the appropriate place in the appropriate directories rbtree.
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1627  	 */
> c637b8acbe079e Tejun Heo          2013-12-11  1628  	kernfs_unlink_sibling(kn);
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1629  	kernfs_get(new_parent);
> 3eef34ad7dc369 Tejun Heo          2014-02-07  1630  
> 3eef34ad7dc369 Tejun Heo          2014-02-07  1631  	/* rename_lock protects ->parent and ->name accessors */
> 3eef34ad7dc369 Tejun Heo          2014-02-07  1632  	spin_lock_irq(&kernfs_rename_lock);
> 3eef34ad7dc369 Tejun Heo          2014-02-07  1633  
> 3eef34ad7dc369 Tejun Heo          2014-02-07  1634  	old_parent = kn->parent;
> adc5e8b58f4886 Tejun Heo          2013-12-11  1635  	kn->parent = new_parent;
> 3eef34ad7dc369 Tejun Heo          2014-02-07  1636  
> 3eef34ad7dc369 Tejun Heo          2014-02-07  1637  	kn->ns = new_ns;
> 3eef34ad7dc369 Tejun Heo          2014-02-07  1638  	if (new_name) {
> 3eef34ad7dc369 Tejun Heo          2014-02-07  1639  		old_name = kn->name;
> 3eef34ad7dc369 Tejun Heo          2014-02-07  1640  		kn->name = new_name;
> 3eef34ad7dc369 Tejun Heo          2014-02-07  1641  	}
> 3eef34ad7dc369 Tejun Heo          2014-02-07  1642  
> 3eef34ad7dc369 Tejun Heo          2014-02-07  1643  	spin_unlock_irq(&kernfs_rename_lock);
> 3eef34ad7dc369 Tejun Heo          2014-02-07  1644  
> 9561a8961c708f Tejun Heo          2014-02-10  1645  	kn->hash = kernfs_name_hash(kn->name, kn->ns);
> c637b8acbe079e Tejun Heo          2013-12-11  1646  	kernfs_link_sibling(kn);
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1647  
> 3eef34ad7dc369 Tejun Heo          2014-02-07  1648  	kernfs_put(old_parent);
> 75287a677ba1be Andrzej Hajda      2015-02-13  1649  	kfree_const(old_name);
> 3eef34ad7dc369 Tejun Heo          2014-02-07  1650  
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1651  	error = 0;
> ae34372eb8408b Tejun Heo          2014-01-10  1652   out:
> 48eab913f80f15 Imran Khan         2022-02-14  1653  	up_write_kernfs_rwsem_rename_ns(kn, new_parent, old_parent);
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1654  	return error;
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1655  }
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1656  
> 
> ---
> 0-DAY CI Kernel Test Service, Intel Corporation
> https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
> 

I am forwarding this along as a reply to the original posting so that
this patch series is not picked up in its current form. This warning
looks legitimate to me and it will break allmodconfig due to -Werror.

Intel folks, why did this get sent to the author privately, rather than
as a reply to the original posting, given the patch is still in review?

Cheers,
Nathan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 1/7] kernfs: Introduce hashed mutexes to replace global kernfs_open_file_mutex.
  2022-02-14 12:03 ` [PATCH v6 1/7] " Imran Khan
@ 2022-02-14 17:50   ` Tejun Heo
  0 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2022-02-14 17:50 UTC (permalink / raw)
  To: Imran Khan; +Cc: gregkh, linux-kernel

Hello,

On Mon, Feb 14, 2022 at 11:03:16PM +1100, Imran Khan wrote:
> +extern struct kernfs_global_locks *kernfs_locks;
> +
> +static inline struct mutex *kernfs_open_file_mutex_ptr(struct kernfs_node *kn)
> +{
> +	int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
> +
> +	return &kernfs_locks->open_file_mutex[idx].lock;
> +}
> +
> +static inline struct mutex *kernfs_open_file_mutex_lock(struct kernfs_node *kn)
> +{
> +	struct mutex *lock;
> +
> +	lock = kernfs_open_file_mutex_ptr(kn);
> +
> +	mutex_lock(lock);
> +
> +	return lock;
> +}

So, splitting patches this way doesn't really help. Because this patch
introduces code which isn't used and the second patch does all the
meaningful changes. It'd be better if the first patch introduces the
interface without changing the actual locking - ie. introduce and convert to
use kernfs_open_file_mutex*() but make it return the same old global mutex,
and then the second patch adds the hashed locks and updates
kernfs_open_file_mutex*() to actually return hashed locks. This way, the
meaningful changes are split into two patches which can be verified
independently.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 6/7] kernfs: Introduce hashed rw-sem to replace per-fs kernfs_rwsem.
  2022-02-14 12:03 ` [PATCH v6 6/7] kernfs: Introduce hashed rw-sem to replace per-fs kernfs_rwsem Imran Khan
@ 2022-02-14 18:10   ` Tejun Heo
  2022-02-16  1:46   ` Al Viro
  1 sibling, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2022-02-14 18:10 UTC (permalink / raw)
  To: Imran Khan; +Cc: gregkh, linux-kernel

On Mon, Feb 14, 2022 at 11:03:21PM +1100, Imran Khan wrote:
> +/**
> + * up_write_kernfs_rwsem_for_two_nodes() - Release hashed rwsem for 2 nodes
> + *
> + * @kn1: kernfs_node for which hashed rwsem needs to be released
> + * @kn2: kernfs_node for which hashed rwsem needs to be released
> + *
> + * In case of nested locking, rwsem with higher address is released first.
> + */
> +static inline void up_write_kernfs_rwsem_for_two_nodes(struct kernfs_node *kn1,
> +						       struct kernfs_node *kn2)
> +{
> +	struct rw_semaphore *rwsem1 = kernfs_rwsem_ptr(kn1);
> +	struct rw_semaphore *rwsem2 = kernfs_rwsem_ptr(kn2);
> +
> +	if (rwsem1 == rwsem2)
> +		up_write(rwsem1);
> +	else {
> +		if (rwsem1 > rwsem2) {
> +			up_write(rwsem1);
> +			up_write(rwsem2);
> +		} else {
> +			up_write(rwsem2);
> +			up_write(rwsem1);
> +		}
> +	}
> +
> +	kernfs_put(kn1);
> +	kernfs_put(kn2);
> +}

You don't need to order unlocks.

> +/**
> + * down_read_kernfs_rwsem_for_two_nodes() - Acquire hashed rwsem for 2 nodes
> + *
> + * @kn1: kernfs_node for which hashed rwsem needs to be taken
> + * @kn2: kernfs_node for which hashed rwsem needs to be taken
> + *
> + * In certain cases we need to acquire hashed rwsem for 2 nodes that don't have a
> + * parent child relationship. This is one of the cases of nested locking involving
> + * hashed rwsem and rwsem with lower address is acquired first.
> + */
> +static inline void down_read_kernfs_rwsem_for_two_nodes(struct kernfs_node *kn1,
> +							struct kernfs_node *kn2)

Maybe something like kernfs_down_read_double_nodes() is enough as the name?
up/down already imply rwsem.

> +static inline void down_read_kernfs_rwsem(struct kernfs_node *kn,
> +				      enum kernfs_rwsem_lock_pattern ptrn)
> +{
> +	struct rw_semaphore *p_rwsem = NULL;
> +	struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
> +	int lock_parent = 0;

bool?

> +
> +	if (ptrn == KERNFS_RWSEM_LOCK_SELF_AND_PARENT && kn->parent)

I wonder whether it'd be clearer to separate the double lock case into its
own function. The backend implementation being shared is fine but if we had
e.g. kernfs_down_read() and kernfs_down_read_double(), wouldn't that be
simpler?

> +		lock_parent = 1;
> +
> +	if (lock_parent)
> +		p_rwsem = kernfs_rwsem_ptr(kn->parent);
> +
> +	if (!lock_parent || rwsem == p_rwsem) {
> +		down_read_nested(rwsem, 0);
> +		kernfs_get(kn);
> +		kn->unlock_parent = 0;
> +	} else {
> +		/**
> +		 * In case of nested locking, locks are taken in order of their
> +		 * addresses. lock with lower address is taken first, followed
> +		 * by lock with higher address.
> +		 */
> +		if (rwsem < p_rwsem) {
> +			down_read_nested(rwsem, 0);
> +			down_read_nested(p_rwsem, 1);
> +		} else {
> +			down_read_nested(p_rwsem, 0);
> +			down_read_nested(rwsem, 1);
> +		}
> +		kernfs_get(kn);
> +		kernfs_get(kn->parent);
> +		kn->unlock_parent = 1;

I wouldn't put this inside kernfs_node. Either make the same decision
(whether it has a parent) in up() or return something which can be passed to
up() by the caller.

> +/**
> + * down_write_kernfs_rwsem_rename_ns() - take hashed rwsem during

kernfs_down_write_triple()?

> +static inline void up_write_kernfs_rwsem_rename_ns(struct kernfs_node *kn,
> +					struct kernfs_node *current_parent,
> +					struct kernfs_node *old_parent)
> +{
> +	struct rw_semaphore *array[3];
> +
> +	array[0] = kernfs_rwsem_ptr(kn);
> +	array[1] = kernfs_rwsem_ptr(current_parent);
> +	array[2] = kernfs_rwsem_ptr(old_parent);

So, we had sth like the following:

struct kernfs_rwsem_token {
       struct kernfs_node held[3];
};

which the down functions return (probably as out argument), wouldn't we be
able to share up() for all variants and make the code simpler?

> +static inline void down_read_kernfs_rwsem_rename_ns(struct kernfs_node *kn,
> +					struct kernfs_node *current_parent,
> +					struct kernfs_node *new_parent)
> +{
> +	struct rw_semaphore *array[3];
> +
> +	array[0] = kernfs_rwsem_ptr(kn);
> +	array[1] = kernfs_rwsem_ptr(current_parent);
> +	array[2] = kernfs_rwsem_ptr(new_parent);
> +
> +	if (array[0] == array[1] && array[0] == array[2]) {
> +		/* All 3 nodes hash to same rwsem */
> +		down_read_nested(array[0], 0);
> +	} else {
> +		/**
> +		 * All 3 nodes are not hashing to the same rwsem, so sort the
> +		 * array.
> +		 */
> +		kernfs_sort_rwsems(array);
> +
> +		if (array[0] == array[1] || array[1] == array[2]) {
> +			/**
> +			 * Two nodes hash to same rwsem, and these
> +			 * will occupy consecutive places in array after
> +			 * sorting.
> +			 */
> +			down_read_nested(array[0], 0);
> +			down_read_nested(array[2], 1);
> +		} else {
> +			/* All 3 nodes hashe to different rwsems */
> +			down_read_nested(array[0], 0);
> +			down_read_nested(array[1], 1);
> +			down_read_nested(array[2], 2);
> +		}
> +	}

How about factoring out "am I locking one, two or three?" into a function -
e.g. the sort function takes the array, sort & uniq's them into locking
token so that the down functions (for both double and triple) just do what's
the token says.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 0/7] kernfs: Introduce hashed mutexes to replace global kernfs_open_file_mutex.
  2022-02-14 12:03 [PATCH v6 0/7] kernfs: Introduce hashed mutexes to replace global kernfs_open_file_mutex Imran Khan
                   ` (6 preceding siblings ...)
  2022-02-14 12:03 ` [PATCH v6 7/7] kernfs: Replace per-fs rwsem with hashed ones Imran Khan
@ 2022-02-14 18:15 ` Tejun Heo
  2022-02-25  5:43   ` Imran Khan
  7 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2022-02-14 18:15 UTC (permalink / raw)
  To: Imran Khan; +Cc: gregkh, linux-kernel

Hello,

The patchset generally looks good to me but I think it can be split a bit
better and the locking helper code can be more compacted. Also, it'd be
great o include the benchmark method and result in the commit log so that it
can be looked up easily later. It just has to be in one of the patch
descriptions whether the first or last of the series and the rest can refer
to it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 6/7] kernfs: Introduce hashed rw-sem to replace per-fs kernfs_rwsem.
  2022-02-14 12:03 ` [PATCH v6 6/7] kernfs: Introduce hashed rw-sem to replace per-fs kernfs_rwsem Imran Khan
  2022-02-14 18:10   ` Tejun Heo
@ 2022-02-16  1:46   ` Al Viro
  2022-02-16  4:57     ` Imran Khan
  1 sibling, 1 reply; 19+ messages in thread
From: Al Viro @ 2022-02-16  1:46 UTC (permalink / raw)
  To: Imran Khan; +Cc: tj, gregkh, linux-kernel

On Mon, Feb 14, 2022 at 11:03:21PM +1100, Imran Khan wrote:

> +static inline void down_write_kernfs_rwsem_for_two_nodes(struct kernfs_node *kn1,
> +							 struct kernfs_node *kn2)
> +{
> +	struct rw_semaphore *rwsem1 = kernfs_rwsem_ptr(kn1);
> +	struct rw_semaphore *rwsem2 = kernfs_rwsem_ptr(kn2);
> +
> +	if (rwsem1 == rwsem2)
> +		down_write_nested(rwsem1, 0);
> +	else {
> +		if (rwsem1 < rwsem2) {
> +			down_write_nested(rwsem1, 0);
> +			down_write_nested(rwsem2, 1);
> +		} else {
> +			down_write_nested(rwsem2, 0);
> +			down_write_nested(rwsem1, 1);
> +		}
> +	}
> +	kernfs_get(kn1);
> +	kernfs_get(kn2);

	What's the last bit for?  Do you expect the caller to grab that,
then drop references, then pass the same references to up_write_.....?
Explain, please.  Is there any caller that would *not* have the matching
up_write_... in the same function, BTW?

Just in case - "defensive programming" is not a good answer (to just about any
question, TBH)...

> +static inline void down_write_kernfs_rwsem(struct kernfs_node *kn,
> +				      enum kernfs_rwsem_lock_pattern ptrn)
> +{
> +	struct rw_semaphore *p_rwsem = NULL;
> +	struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
> +	int lock_parent = 0;
> +
> +	if (ptrn == KERNFS_RWSEM_LOCK_SELF_AND_PARENT && kn->parent)
> +		lock_parent = 1;
> +
> +	if (lock_parent)
> +		p_rwsem = kernfs_rwsem_ptr(kn->parent);

Er...  And what's to prevent ->parent changing right under you?  For that
matter, why bother with that at all - what's wrong with your "lock two
nodes" primitive?  AFAICS, your current series has only one caller passing
that flag, so...

Note that ->unlock_parent thing is also dubious - the matching up_write_...
for that sole caller is in the same function, so it could bloody well
choose between single-up and double-up on its own.  Would make the whole
thing simpler...

> +/**
> + * down_read_kernfs_rwsem() - Acquire hashed rwsem
> + *
> + * @kn: kernfs_node for which hashed rwsem needs to be taken
> + * @ptrn: locking pattern i.e whether to lock only given node or to lock
> + *	  node and its parent as well
> + *
> + * In case of nested locking, rwsem with lower address is acquired first.
> + *
> + * Return: void
> + */

... and this one, AFAICS, doesn't need ptrn at all - no callers that would ask to
lock parent.

> +static inline void kernfs_swap_rwsems(struct rw_semaphore **array, int i, int j)
> +{
> +	struct rw_semaphore *tmp;
> +
> +	tmp = array[i];
> +	array[i] = array[j];
> +	array[j] = tmp;
> +}
> +
> +static inline void kernfs_sort_rwsems(struct rw_semaphore **array)
> +{
> +	if (array[0] > array[1])
> +		kernfs_swap_rwsems(array, 0, 1);
> +
> +	if (array[0] > array[2])
> +		kernfs_swap_rwsems(array, 0, 2);
> +
> +	if (array[1] > array[2])
> +		kernfs_swap_rwsems(array, 1, 2);
> +}
> +
> +/**
> + * down_write_kernfs_rwsem_rename_ns() - take hashed rwsem during
> + * rename or similar operations.
> + *
> + * @kn: kernfs_node of interest
> + * @current_parent: current parent of kernfs_node of interest
> + * @new_parent: about to become new parent of kernfs_node
> + *
> + * During rename or similar operations the parent of a node changes,
> + * and this means we will see different parents of a kernfs_node at
> + * the time of taking and releasing its or its parent's hashed rwsem.
> + * This function separately takes locks corresponding to node, and
> + * corresponding to its current and future parents (if needed).
> + *
> + * Return: void
> + */
> +static inline void down_write_kernfs_rwsem_rename_ns(struct kernfs_node *kn,
> +					struct kernfs_node *current_parent,
> +					struct kernfs_node *new_parent)
> +{
> +	struct rw_semaphore *array[3];
> +
> +	array[0] = kernfs_rwsem_ptr(kn);
> +	array[1] = kernfs_rwsem_ptr(current_parent);
> +	array[2] = kernfs_rwsem_ptr(new_parent);
> +
> +	if (array[0] == array[1] && array[0] == array[2]) {
> +		/* All 3 nodes hash to same rwsem */
> +		down_write_nested(array[0], 0);
> +	} else {
> +		/**
> +		 * All 3 nodes are not hashing to the same rwsem, so sort the
> +		 * array.
> +		 */
> +		kernfs_sort_rwsems(array);
> +
> +		if (array[0] == array[1] || array[1] == array[2]) {
> +			/**
> +			 * Two nodes hash to same rwsem, and these
> +			 * will occupy consecutive places in array after
> +			 * sorting.
> +			 */
> +			down_write_nested(array[0], 0);
> +			down_write_nested(array[2], 1);
> +		} else {
> +			/* All 3 nodes hashe to different rwsems */
> +			down_write_nested(array[0], 0);
> +			down_write_nested(array[1], 1);
> +			down_write_nested(array[2], 2);
> +		}
> +	}
> +
> +	kernfs_get(kn);
> +	kernfs_get(current_parent);
> +	kernfs_get(new_parent);
> +}

Holy shit...  And _that_ is supposed to be inlined?  Complete with sorting the
array, etc.?

<looks at the callers in the next patch>

-       root = kernfs_root(kn);
-       down_write(&root->kernfs_rwsem);
+       down_write_kernfs_rwsem_rename_ns(kn, kn->parent, new_parent);

This is rename.  The very thing that changes kn->parent.  Just what would
stop *ANOTHER* rename from modifying said kn->parent while you'd been
sleeping waiting for that rwsem?  It's not even a narrow race window...

I could believe that similar question about stability of ->parent in
case of KERNFS_RWSEM_LOCK_SELF_AND_PARENT handling might be dealt with
by something along the lines of "nothing is supposed to rename a node
while it's fed to kernfs_remove_self(), which is the only user of that",
but here we definitely *can* have racing renames.

> +/**
> + * down_read_kernfs_rwsem_rename_ns() - take hashed rwsem during
> + * rename or similar operations.
> + *
> + * @kn: kernfs_node of interest
> + * @current_parent: current parent of kernfs_node of interest
> + * @new_parent: about to become new parent of kernfs_node
> + *
> + * During rename or similar operations the parent of a node changes,
> + * and this means we will see different parents of a kernfs_node at
> + * the time of taking and releasing its or its parent's hashed rwsem.
> + * This function separately takes locks corresponding to node, and
> + * corresponding to its current and future parents (if needed).
> + *
> + * Return: void
> + */
> +static inline void down_read_kernfs_rwsem_rename_ns(struct kernfs_node *kn,
> +					struct kernfs_node *current_parent,
> +					struct kernfs_node *new_parent)

<tries to guess what use could *that* possibly have>
<fails>
<looks for callers (in a different mail, damnit)>
<none>
...

TBH, this is rather uninspiring.  I can easily believe that you've got
contention visibly reduced on a benchmark.  That'd only be interesting
if we had a clear proof of correctness.  And you don't even bother to
discuss your locking rules anywhere I can see in the series, cover letter
included...

I agree that sysfs locking had always been an atrocity - "serialize
everything, we can't be arsed to think of that" approach had been there
from the day one.  However, that's precisely the reason why replacement
needs a very careful analysis.  For all I know, you might've done just
that, but you are asking reviewers to reproduce that work independently.
Or to hope that you've gotten it right and nod through the entire thing.
Pardon me, but I know how easy it is to miss something while doing that
kind of work - been there, done that.

*That* is the part that needs review.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 6/7] kernfs: Introduce hashed rw-sem to replace per-fs kernfs_rwsem.
  2022-02-16  1:46   ` Al Viro
@ 2022-02-16  4:57     ` Imran Khan
  2022-02-18  3:25       ` Al Viro
  0 siblings, 1 reply; 19+ messages in thread
From: Imran Khan @ 2022-02-16  4:57 UTC (permalink / raw)
  To: Al Viro; +Cc: tj, gregkh, linux-kernel

Hello Al Viro,

Thanks for reviewing this.

On 16/2/22 12:46 pm, Al Viro wrote:
> On Mon, Feb 14, 2022 at 11:03:21PM +1100, Imran Khan wrote:
> 
>> +static inline void down_write_kernfs_rwsem_for_two_nodes(struct kernfs_node *kn1,
>> +							 struct kernfs_node *kn2)
>> +{
>> +	struct rw_semaphore *rwsem1 = kernfs_rwsem_ptr(kn1);
>> +	struct rw_semaphore *rwsem2 = kernfs_rwsem_ptr(kn2);
>> +
>> +	if (rwsem1 == rwsem2)
>> +		down_write_nested(rwsem1, 0);
>> +	else {
>> +		if (rwsem1 < rwsem2) {
>> +			down_write_nested(rwsem1, 0);
>> +			down_write_nested(rwsem2, 1);
>> +		} else {
>> +			down_write_nested(rwsem2, 0);
>> +			down_write_nested(rwsem1, 1);
>> +		}
>> +	}
>> +	kernfs_get(kn1);
>> +	kernfs_get(kn2);
> 
> 	What's the last bit for?  Do you expect the caller to grab that,
> then drop references, then pass the same references to up_write_.....?
> Explain, please.  Is there any caller that would *not* have the matching
> up_write_... in the same function, BTW?

Every caller of down_write_ has a matching up_write_. I am grabbing the
reference for cases where node gets deleted between down_write_ and
up_write_. For example currently kernfs_remove invokes
down_write/up_write on per-fs rwsem. This change is using rwsem
corresponding to kernfs_node being removed so we may get a situation
when kernfs_remove --> __kernfs_remove --> kernfs_put results in
deletion of kernfs_node but subsequent up_write_ from kernfs_remove
ends up using freed kernfs_node.
> 
> Just in case - "defensive programming" is not a good answer (to just about any
> question, TBH)...
> 
>> +static inline void down_write_kernfs_rwsem(struct kernfs_node *kn,
>> +				      enum kernfs_rwsem_lock_pattern ptrn)
>> +{
>> +	struct rw_semaphore *p_rwsem = NULL;
>> +	struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
>> +	int lock_parent = 0;
>> +
>> +	if (ptrn == KERNFS_RWSEM_LOCK_SELF_AND_PARENT && kn->parent)
>> +		lock_parent = 1;
>> +
>> +	if (lock_parent)
>> +		p_rwsem = kernfs_rwsem_ptr(kn->parent);
> 
> Er...  And what's to prevent ->parent changing right under you?  For that
> matter, why bother with that at all - what's wrong with your "lock two
> nodes" primitive?  AFAICS, your current series has only one caller passing
> that flag, so...
> 

Agree. "lock two nodes" primitive could have been used here to make sure
that I work with same parent. I will have this change in next version.


> Note that ->unlock_parent thing is also dubious - the matching up_write_...
> for that sole caller is in the same function, so it could bloody well
> choose between single-up and double-up on its own.  Would make the whole
> thing simpler...

Tejun has raised similar concern in his last feedback and I am working
towards making use of "double node locking" primitives for cases where
both parent and node need locking.
> 
>> +/**
>> + * down_read_kernfs_rwsem() - Acquire hashed rwsem
>> + *
>> + * @kn: kernfs_node for which hashed rwsem needs to be taken
>> + * @ptrn: locking pattern i.e whether to lock only given node or to lock
>> + *	  node and its parent as well
>> + *
>> + * In case of nested locking, rwsem with lower address is acquired first.
>> + *
>> + * Return: void
>> + */
> 
> ... and this one, AFAICS, doesn't need ptrn at all - no callers that would ask to
> lock parent.

Sure. I will get rid of this in next version.
> 
>> +static inline void kernfs_swap_rwsems(struct rw_semaphore **array, int i, int j)
[...]
>> +}
>> +
>> +/**
>> + * down_write_kernfs_rwsem_rename_ns() - take hashed rwsem during
>> + * rename or similar operations.
>> + *
>> + * @kn: kernfs_node of interest
>> + * @current_parent: current parent of kernfs_node of interest
>> + * @new_parent: about to become new parent of kernfs_node
>> + *
>> + * During rename or similar operations the parent of a node changes,
>> + * and this means we will see different parents of a kernfs_node at
>> + * the time of taking and releasing its or its parent's hashed rwsem.
>> + * This function separately takes locks corresponding to node, and
>> + * corresponding to its current and future parents (if needed).
>> + *
>> + * Return: void
>> + */
>> +static inline void down_write_kernfs_rwsem_rename_ns(struct kernfs_node *kn,
>> +					struct kernfs_node *current_parent,
>> +					struct kernfs_node *new_parent)
>> +{
>> +	struct rw_semaphore *array[3];
>> +
>> +	array[0] = kernfs_rwsem_ptr(kn);
>> +	array[1] = kernfs_rwsem_ptr(current_parent);
>> +	array[2] = kernfs_rwsem_ptr(new_parent);
>> +
>> +	if (array[0] == array[1] && array[0] == array[2]) {
>> +		/* All 3 nodes hash to same rwsem */
>> +		down_write_nested(array[0], 0);
>> +	} else {
>> +		/**
>> +		 * All 3 nodes are not hashing to the same rwsem, so sort the
>> +		 * array.
>> +		 */
>> +		kernfs_sort_rwsems(array);
>> +
>> +		if (array[0] == array[1] || array[1] == array[2]) {
>> +			/**
>> +			 * Two nodes hash to same rwsem, and these
>> +			 * will occupy consecutive places in array after
>> +			 * sorting.
>> +			 */
>> +			down_write_nested(array[0], 0);
>> +			down_write_nested(array[2], 1);
>> +		} else {
>> +			/* All 3 nodes hashe to different rwsems */
>> +			down_write_nested(array[0], 0);
>> +			down_write_nested(array[1], 1);
>> +			down_write_nested(array[2], 2);
>> +		}
>> +	}
>> +
>> +	kernfs_get(kn);
>> +	kernfs_get(current_parent);
>> +	kernfs_get(new_parent);
>> +}
> 
> Holy shit...  And _that_ is supposed to be inlined?  Complete with sorting the
> array, etc.?
> 

Sorry about this. I will correct this in next version. I have got some
suggestions from Tejun about how to make these interfaces more compact
and I will make these multi node interfaces non-inline as well.

> <looks at the callers in the next patch>
> 
> -       root = kernfs_root(kn);
> -       down_write(&root->kernfs_rwsem);
> +       down_write_kernfs_rwsem_rename_ns(kn, kn->parent, new_parent);
> 
> This is rename.  The very thing that changes kn->parent.  Just what would
> stop *ANOTHER* rename from modifying said kn->parent while you'd been
> sleeping waiting for that rwsem?  It's not even a narrow race window...
> 
> I could believe that similar question about stability of ->parent in
> case of KERNFS_RWSEM_LOCK_SELF_AND_PARENT handling might be dealt with
> by something along the lines of "nothing is supposed to rename a node
> while it's fed to kernfs_remove_self(), which is the only user of that",
> but here we definitely *can* have racing renames.
> 

Agree. I missed the point of changing parent during wait of rwsem. Could
you please let me know if following approach is acceptable in this case:

1. Note kn->parent
2. down_write_
3. After acquiring the rwsems check if current kn->parent is same as the
earlier noted one and if so proceed otherwise invoke down_write_ again
as per current kn->parent.

Once we have taken the rwsems and found that kn->parent did not change
during wait we can proceed safely till corresponding up_write_

Regarding kernfs_remove_self, yes that was the idea but as mentioned
earlier in the next version I will use "double node locking" primitives.


>> +/**
>> + * down_read_kernfs_rwsem_rename_ns() - take hashed rwsem during
>> + * rename or similar operations.
>> + *
>> + * @kn: kernfs_node of interest
>> + * @current_parent: current parent of kernfs_node of interest
>> + * @new_parent: about to become new parent of kernfs_node
>> + *
>> + * During rename or similar operations the parent of a node changes,
>> + * and this means we will see different parents of a kernfs_node at
>> + * the time of taking and releasing its or its parent's hashed rwsem.
>> + * This function separately takes locks corresponding to node, and
>> + * corresponding to its current and future parents (if needed).
>> + *
>> + * Return: void
>> + */
>> +static inline void down_read_kernfs_rwsem_rename_ns(struct kernfs_node *kn,
>> +					struct kernfs_node *current_parent,
>> +					struct kernfs_node *new_parent)
> 
> <tries to guess what use could *that* possibly have>
> <fails>
> <looks for callers (in a different mail, damnit)>
> <none>
> ...
> 
> TBH, this is rather uninspiring.  I can easily believe that you've got
> contention visibly reduced on a benchmark.  That'd only be interesting
> if we had a clear proof of correctness.  And you don't even bother to
> discuss your locking rules anywhere I can see in the series, cover letter
> included...
> 

I have run my tests with lockdep enabled as well. Could you please let
me know if there is something that can be done as proof of correctness.
For sanity of patches, my current tests include LTP sysfs tests, CPU
hotplugging and cgroup access/creation/removal. I am booting oracle
linux as well which involves cgroupfs access(via systemd). If you could
suggest some other tests I can execute those as well.

Also regarding locking rules, I am not sure where to mention it. If I
put accompanying comments, at all the places where I am taking hashed
rwsems, to explain why lock corresponding to a node or corresponding to
its parent is being taken, will that be enough as description of locking
rules.

> I agree that sysfs locking had always been an atrocity - "serialize
> everything, we can't be arsed to think of that" approach had been there
> from the day one.  However, that's precisely the reason why replacement
> needs a very careful analysis.  For all I know, you might've done just
> that, but you are asking reviewers to reproduce that work independently.
> Or to hope that you've gotten it right and nod through the entire thing.
> Pardon me, but I know how easy it is to miss something while doing that
> kind of work - been there, done that.

I understand your concern but I am not asking reviewers to reproduce
this work independently :). If I can get to know what things can
be/should be tested further in this regard, I can do those tests.
Since the start of this work, I have been also running my tests with
lockdep and KASAN enabled as well and those tests have flagged some
issues which I have addressed.

I will certainly address your review comments in next version but in the
meantime if you have some suggestions regarding testing or description
of locking rules, please let me know.

Thanks a lot for having a look.
 -- Imran

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [kbuild-all] Re: [PATCH v6 7/7] kernfs: Replace per-fs rwsem with hashed ones.
  2022-02-14 17:49   ` Nathan Chancellor
@ 2022-02-16  8:57     ` Chen, Rong A
  0 siblings, 0 replies; 19+ messages in thread
From: Chen, Rong A @ 2022-02-16  8:57 UTC (permalink / raw)
  To: Nathan Chancellor, Imran Khan, kernel test robot
  Cc: tj, gregkh, linux-kernel, llvm, kbuild-all



On 2/15/2022 1:49 AM, Nathan Chancellor wrote:
> On Tue, Feb 15, 2022 at 01:19:23AM +0800, kernel test robot wrote:
>> Hi Imran,
>>
>> Thank you for the patch! Perhaps something to improve:
>>
>> [auto build test WARNING on 6d9bd4ad4ca08b1114e814c2c42383b8b13be631]
>>
>> url:    https://github.com/0day-ci/linux/commits/Imran-Khan/kernfs-Introduce-hashed-mutexes-to-replace-global-kernfs_open_file_mutex/20220214-200700
>> base:   6d9bd4ad4ca08b1114e814c2c42383b8b13be631
>> config: hexagon-randconfig-r012-20220213 (https://download.01.org/0day-ci/archive/20220214/202202142322.OezS0EtW-lkp@intel.com/config)
>> compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project ea071884b0cc7210b3cc5fe858f0e892a779a23b)
>> reproduce (this is a W=1 build):
>>          wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>>          chmod +x ~/bin/make.cross
>>          # https://github.com/0day-ci/linux/commit/48eab913f80f1555d57b964ef7cdba17f5ec5d1f
>>          git remote add linux-review https://github.com/0day-ci/linux
>>          git fetch --no-tags linux-review Imran-Khan/kernfs-Introduce-hashed-mutexes-to-replace-global-kernfs_open_file_mutex/20220214-200700
>>          git checkout 48eab913f80f1555d57b964ef7cdba17f5ec5d1f
>>          # save the config file to linux build tree
>>          mkdir build_dir
>>          COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash fs/kernfs/
>>
>> If you fix the issue, kindly add following tag as appropriate
>> Reported-by: kernel test robot <lkp@intel.com>
>>
>> All warnings (new ones prefixed by >>):
>>
>>>> fs/kernfs/dir.c:1619:7: warning: variable 'old_parent' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
>>                     if (!new_name)
>>                         ^~~~~~~~~
>>     fs/kernfs/dir.c:1653:50: note: uninitialized use occurs here
>>             up_write_kernfs_rwsem_rename_ns(kn, new_parent, old_parent);
>>                                                             ^~~~~~~~~~
>>     fs/kernfs/dir.c:1619:3: note: remove the 'if' if its condition is always false
>>                     if (!new_name)
>>                     ^~~~~~~~~~~~~~
>>     fs/kernfs/dir.c:1612:6: warning: variable 'old_parent' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
>>             if (kernfs_find_ns(new_parent, new_name, new_ns))
>>                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>     fs/kernfs/dir.c:1653:50: note: uninitialized use occurs here
>>             up_write_kernfs_rwsem_rename_ns(kn, new_parent, old_parent);
>>                                                             ^~~~~~~~~~
>>     fs/kernfs/dir.c:1612:2: note: remove the 'if' if its condition is always false
>>             if (kernfs_find_ns(new_parent, new_name, new_ns))
>>             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>     fs/kernfs/dir.c:1607:6: warning: variable 'old_parent' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
>>             if ((kn->parent == new_parent) && (kn->ns == new_ns) &&
>>                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>     fs/kernfs/dir.c:1653:50: note: uninitialized use occurs here
>>             up_write_kernfs_rwsem_rename_ns(kn, new_parent, old_parent);
>>                                                             ^~~~~~~~~~
>>     fs/kernfs/dir.c:1607:2: note: remove the 'if' if its condition is always false
>>             if ((kn->parent == new_parent) && (kn->ns == new_ns) &&
>>             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>     fs/kernfs/dir.c:1602:6: warning: variable 'old_parent' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
>>             if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
>>                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>     fs/kernfs/dir.c:1653:50: note: uninitialized use occurs here
>>             up_write_kernfs_rwsem_rename_ns(kn, new_parent, old_parent);
>>                                                             ^~~~~~~~~~
>>     fs/kernfs/dir.c:1602:2: note: remove the 'if' if its condition is always false
>>             if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
>>             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>> fs/kernfs/dir.c:1602:6: warning: variable 'old_parent' is used uninitialized whenever '||' condition is true [-Wsometimes-uninitialized]
>>             if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
>>                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>     fs/kernfs/dir.c:1653:50: note: uninitialized use occurs here
>>             up_write_kernfs_rwsem_rename_ns(kn, new_parent, old_parent);
>>                                                             ^~~~~~~~~~
>>     fs/kernfs/dir.c:1602:6: note: remove the '||' if its condition is always false
>>             if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
>>                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>> fs/kernfs/dir.c:1602:6: warning: variable 'old_parent' is used uninitialized whenever '||' condition is true [-Wsometimes-uninitialized]
>>             if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
>>                 ^~~~~~~~~~~~~~~~~~
>>     fs/kernfs/dir.c:1653:50: note: uninitialized use occurs here
>>             up_write_kernfs_rwsem_rename_ns(kn, new_parent, old_parent);
>>                                                             ^~~~~~~~~~
>>     fs/kernfs/dir.c:1602:6: note: remove the '||' if its condition is always false
>>             if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
>>                 ^~~~~~~~~~~~~~~~~~~~~
>>     fs/kernfs/dir.c:1591:32: note: initialize the variable 'old_parent' to silence this warning
>>             struct kernfs_node *old_parent;
>>                                           ^
>>                                            = NULL
>>     6 warnings generated.
>>
>>
>> vim +1619 fs/kernfs/dir.c
>>
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1580
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1581  /**
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1582   * kernfs_rename_ns - move and rename a kernfs_node
>> 324a56e16e44ba Tejun Heo          2013-12-11  1583   * @kn: target node
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1584   * @new_parent: new parent to put @sd under
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1585   * @new_name: new name
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1586   * @new_ns: new namespace tag
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1587   */
>> 324a56e16e44ba Tejun Heo          2013-12-11  1588  int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1589  		     const char *new_name, const void *new_ns)
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1590  {
>> 3eef34ad7dc369 Tejun Heo          2014-02-07  1591  	struct kernfs_node *old_parent;
>> 3eef34ad7dc369 Tejun Heo          2014-02-07  1592  	const char *old_name = NULL;
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1593  	int error;
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1594
>> 3eef34ad7dc369 Tejun Heo          2014-02-07  1595  	/* can't move or rename root */
>> 3eef34ad7dc369 Tejun Heo          2014-02-07  1596  	if (!kn->parent)
>> 3eef34ad7dc369 Tejun Heo          2014-02-07  1597  		return -EINVAL;
>> 3eef34ad7dc369 Tejun Heo          2014-02-07  1598
>> 48eab913f80f15 Imran Khan         2022-02-14  1599  	down_write_kernfs_rwsem_rename_ns(kn, kn->parent, new_parent);
>> 798c75a0d44cdb Greg Kroah-Hartman 2014-01-13  1600
>> d0ae3d4347ee02 Tejun Heo          2013-12-11  1601  	error = -ENOENT;
>> ea015218f2f7ac Eric W. Biederman  2015-05-13 @1602  	if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
>> ea015218f2f7ac Eric W. Biederman  2015-05-13  1603  	    (new_parent->flags & KERNFS_EMPTY_DIR))
>> d0ae3d4347ee02 Tejun Heo          2013-12-11  1604  		goto out;
>> d0ae3d4347ee02 Tejun Heo          2013-12-11  1605
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1606  	error = 0;
>> adc5e8b58f4886 Tejun Heo          2013-12-11  1607  	if ((kn->parent == new_parent) && (kn->ns == new_ns) &&
>> adc5e8b58f4886 Tejun Heo          2013-12-11  1608  	    (strcmp(kn->name, new_name) == 0))
>> 798c75a0d44cdb Greg Kroah-Hartman 2014-01-13  1609  		goto out;	/* nothing to rename */
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1610
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1611  	error = -EEXIST;
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1612  	if (kernfs_find_ns(new_parent, new_name, new_ns))
>> 798c75a0d44cdb Greg Kroah-Hartman 2014-01-13  1613  		goto out;
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1614
>> 324a56e16e44ba Tejun Heo          2013-12-11  1615  	/* rename kernfs_node */
>> adc5e8b58f4886 Tejun Heo          2013-12-11  1616  	if (strcmp(kn->name, new_name) != 0) {
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1617  		error = -ENOMEM;
>> 75287a677ba1be Andrzej Hajda      2015-02-13  1618  		new_name = kstrdup_const(new_name, GFP_KERNEL);
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28 @1619  		if (!new_name)
>> 798c75a0d44cdb Greg Kroah-Hartman 2014-01-13  1620  			goto out;
>> 3eef34ad7dc369 Tejun Heo          2014-02-07  1621  	} else {
>> 3eef34ad7dc369 Tejun Heo          2014-02-07  1622  		new_name = NULL;
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1623  	}
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1624
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1625  	/*
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1626  	 * Move to the appropriate place in the appropriate directories rbtree.
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1627  	 */
>> c637b8acbe079e Tejun Heo          2013-12-11  1628  	kernfs_unlink_sibling(kn);
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1629  	kernfs_get(new_parent);
>> 3eef34ad7dc369 Tejun Heo          2014-02-07  1630
>> 3eef34ad7dc369 Tejun Heo          2014-02-07  1631  	/* rename_lock protects ->parent and ->name accessors */
>> 3eef34ad7dc369 Tejun Heo          2014-02-07  1632  	spin_lock_irq(&kernfs_rename_lock);
>> 3eef34ad7dc369 Tejun Heo          2014-02-07  1633
>> 3eef34ad7dc369 Tejun Heo          2014-02-07  1634  	old_parent = kn->parent;
>> adc5e8b58f4886 Tejun Heo          2013-12-11  1635  	kn->parent = new_parent;
>> 3eef34ad7dc369 Tejun Heo          2014-02-07  1636
>> 3eef34ad7dc369 Tejun Heo          2014-02-07  1637  	kn->ns = new_ns;
>> 3eef34ad7dc369 Tejun Heo          2014-02-07  1638  	if (new_name) {
>> 3eef34ad7dc369 Tejun Heo          2014-02-07  1639  		old_name = kn->name;
>> 3eef34ad7dc369 Tejun Heo          2014-02-07  1640  		kn->name = new_name;
>> 3eef34ad7dc369 Tejun Heo          2014-02-07  1641  	}
>> 3eef34ad7dc369 Tejun Heo          2014-02-07  1642
>> 3eef34ad7dc369 Tejun Heo          2014-02-07  1643  	spin_unlock_irq(&kernfs_rename_lock);
>> 3eef34ad7dc369 Tejun Heo          2014-02-07  1644
>> 9561a8961c708f Tejun Heo          2014-02-10  1645  	kn->hash = kernfs_name_hash(kn->name, kn->ns);
>> c637b8acbe079e Tejun Heo          2013-12-11  1646  	kernfs_link_sibling(kn);
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1647
>> 3eef34ad7dc369 Tejun Heo          2014-02-07  1648  	kernfs_put(old_parent);
>> 75287a677ba1be Andrzej Hajda      2015-02-13  1649  	kfree_const(old_name);
>> 3eef34ad7dc369 Tejun Heo          2014-02-07  1650
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1651  	error = 0;
>> ae34372eb8408b Tejun Heo          2014-01-10  1652   out:
>> 48eab913f80f15 Imran Khan         2022-02-14  1653  	up_write_kernfs_rwsem_rename_ns(kn, new_parent, old_parent);
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1654  	return error;
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1655  }
>> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1656
>>
>> ---
>> 0-DAY CI Kernel Test Service, Intel Corporation
>> https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
>>
> 
> I am forwarding this along as a reply to the original posting so that
> this patch series is not picked up in its current form. This warning
> looks legitimate to me and it will break allmodconfig due to -Werror.
> 
> Intel folks, why did this get sent to the author privately, rather than
> as a reply to the original posting, given the patch is still in review?

Hi Nathan,

Thanks for the notice, it should be a bug in our system, will fix it asap.

Best Regards,
Rong Chen

> 
> Cheers,
> Nathan
> _______________________________________________
> kbuild-all mailing list -- kbuild-all@lists.01.org
> To unsubscribe send an email to kbuild-all-leave@lists.01.org
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 6/7] kernfs: Introduce hashed rw-sem to replace per-fs kernfs_rwsem.
  2022-02-16  4:57     ` Imran Khan
@ 2022-02-18  3:25       ` Al Viro
  2022-02-22 18:09         ` Tejun Heo
  2022-02-25  5:52         ` Imran Khan
  0 siblings, 2 replies; 19+ messages in thread
From: Al Viro @ 2022-02-18  3:25 UTC (permalink / raw)
  To: Imran Khan; +Cc: tj, gregkh, linux-kernel

On Wed, Feb 16, 2022 at 03:57:09PM +1100, Imran Khan wrote:

> Agree. I missed the point of changing parent during wait of rwsem. Could
> you please let me know if following approach is acceptable in this case:
> 
> 1. Note kn->parent
> 2. down_write_
> 3. After acquiring the rwsems check if current kn->parent is same as the
> earlier noted one and if so proceed otherwise invoke down_write_ again
> as per current kn->parent.
> 
> Once we have taken the rwsems and found that kn->parent did not change
> during wait we can proceed safely till corresponding up_write_

Maybe...  TBH, kernfs_remove_ns() calling conventions suck; "move this
(wherever it is) there under that name", especially combined with
topology-modifying moves, makes life very unpleasant for callers who
want to use it safely.  And I'm not at all sure they manage (or care to)
use it safely...

There are very few sources of cross-directory moves in the entire system.
One is cross-directory cgroup rename(2) (already serialized on per-fs basis
on VFS level), another is device_move().  Which is uncommon (5 callers
in the entire system, one in drivers/media/usb/pvrusb2/, another in
drivers/s390/cio, 3 more in net/bluetooth) and AFAICS unsafe wrt e.g.
kobject_get_path().  Not for kernfs data structures - unsafe use of
kobject->parent pointers.  I could be wrong - that's just from some grepping
around, but it looks like a use-after-free race in the best case.

So changes of kn->parent can be considered a very cold path.  Just make sure
you pin the damn thing, so it won't disapper away from under you while you
are grabbing the locks.

> I have run my tests with lockdep enabled as well. Could you please let
> me know if there is something that can be done as proof of correctness.
> For sanity of patches, my current tests include LTP sysfs tests, CPU
> hotplugging and cgroup access/creation/removal. I am booting oracle
> linux as well which involves cgroupfs access(via systemd). If you could
> suggest some other tests I can execute those as well.
> 
> Also regarding locking rules, I am not sure where to mention it. If I
> put accompanying comments, at all the places where I am taking hashed
> rwsems, to explain why lock corresponding to a node or corresponding to
> its parent is being taken, will that be enough as description of locking
> rules.

A separate text, along the lines of "foo->bar is modified only when we
are holding this, this and that; any of those is sufficient to stabilize
it.  Locking order is such-and-such.  Such-and-such predicate is guaranteed
to be true when you acquire this lock and must be true again by the time
you drop it.", etc.  Some of that might go into the comments somewhere
in source (e.g. when it's about data structures dealt with only one
file, or in the header where the data structures are declared), some -
in a separate text, perhaps in fs/kernfs/, perhaps in Documentation/filesystems/)

And, of course, there's the cover letter of series and commit messages.

> > I agree that sysfs locking had always been an atrocity - "serialize
> > everything, we can't be arsed to think of that" approach had been there
> > from the day one.  However, that's precisely the reason why replacement
> > needs a very careful analysis.  For all I know, you might've done just
> > that, but you are asking reviewers to reproduce that work independently.
> > Or to hope that you've gotten it right and nod through the entire thing.
> > Pardon me, but I know how easy it is to miss something while doing that
> > kind of work - been there, done that.
> 
> I understand your concern but I am not asking reviewers to reproduce
> this work independently :). If I can get to know what things can
> be/should be tested further in this regard, I can do those tests.
> Since the start of this work, I have been also running my tests with
> lockdep and KASAN enabled as well and those tests have flagged some
> issues which I have addressed.
> 
> I will certainly address your review comments in next version but in the
> meantime if you have some suggestions regarding testing or description
> of locking rules, please let me know.

Tests are useful, but if we can't reason about the behaviour of code...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 6/7] kernfs: Introduce hashed rw-sem to replace per-fs kernfs_rwsem.
  2022-02-18  3:25       ` Al Viro
@ 2022-02-22 18:09         ` Tejun Heo
  2022-02-25  5:52         ` Imran Khan
  1 sibling, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2022-02-22 18:09 UTC (permalink / raw)
  To: Al Viro; +Cc: Imran Khan, gregkh, linux-kernel

On Fri, Feb 18, 2022 at 03:25:31AM +0000, Al Viro wrote:
> There are very few sources of cross-directory moves in the entire system.
> One is cross-directory cgroup rename(2) (already serialized on per-fs basis
> on VFS level), another is device_move().  Which is uncommon (5 callers

FWIW, cgroup rename(2) doesn't allow changing the parent.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 0/7] kernfs: Introduce hashed mutexes to replace global kernfs_open_file_mutex.
  2022-02-14 18:15 ` [PATCH v6 0/7] kernfs: Introduce hashed mutexes to replace global kernfs_open_file_mutex Tejun Heo
@ 2022-02-25  5:43   ` Imran Khan
  0 siblings, 0 replies; 19+ messages in thread
From: Imran Khan @ 2022-02-25  5:43 UTC (permalink / raw)
  To: Tejun Heo; +Cc: gregkh, linux-kernel

Hello Tejun,

Thanks again for reviewing this. I have refactored the code.
I have also added a document to describe lock usage and proof
of correctness in the new patch set at [1]. This took some time
but it can highlight if I have understood something wrong while
making these changes.

[1]
https://lore.kernel.org/lkml/20220225052116.1243150-1-imran.f.khan@oracle.com/

Thanks
 -- Imran

On 15/2/22 5:15 am, Tejun Heo wrote:
> Hello,
> 
> The patchset generally looks good to me but I think it can be split a bit
> better and the locking helper code can be more compacted. Also, it'd be
> great o include the benchmark method and result in the commit log so that it
> can be looked up easily later. It just has to be in one of the patch
> descriptions whether the first or last of the series and the rest can refer
> to it.
> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 6/7] kernfs: Introduce hashed rw-sem to replace per-fs kernfs_rwsem.
  2022-02-18  3:25       ` Al Viro
  2022-02-22 18:09         ` Tejun Heo
@ 2022-02-25  5:52         ` Imran Khan
  1 sibling, 0 replies; 19+ messages in thread
From: Imran Khan @ 2022-02-25  5:52 UTC (permalink / raw)
  To: Al Viro; +Cc: tj, gregkh, linux-kernel

Hello Al,

Thanks for the feedback and suggestions. I have addressed these in v7 of
the patchset at [1].

On 18/2/22 2:25 pm, Al Viro wrote:
> On Wed, Feb 16, 2022 at 03:57:09PM +1100, Imran Khan wrote:
> 
[...]
> 
> Maybe...  TBH, kernfs_remove_ns() calling conventions suck; "move this
> (wherever it is) there under that name", especially combined with
> topology-modifying moves, makes life very unpleasant for callers who
> want to use it safely.  And I'm not at all sure they manage (or care to)
> use it safely...
> 
> There are very few sources of cross-directory moves in the entire system.
> One is cross-directory cgroup rename(2) (already serialized on per-fs basis
> on VFS level), another is device_move().  Which is uncommon (5 callers
> in the entire system, one in drivers/media/usb/pvrusb2/, another in
> drivers/s390/cio, 3 more in net/bluetooth) and AFAICS unsafe wrt e.g.
> kobject_get_path().  Not for kernfs data structures - unsafe use of
> kobject->parent pointers.  I could be wrong - that's just from some grepping
> around, but it looks like a use-after-free race in the best case.
> 
> So changes of kn->parent can be considered a very cold path.  Just make sure
> you pin the damn thing, so it won't disapper away from under you while you
> are grabbing the locks.
> 

Done.

>> I have run my tests with lockdep enabled as well. Could you please let
>> me know if there is something that can be done as proof of correctness.
>> For sanity of patches, my current tests include LTP sysfs tests, CPU
>> hotplugging and cgroup access/creation/removal. I am booting oracle
>> linux as well which involves cgroupfs access(via systemd). If you could
>> suggest some other tests I can execute those as well.
>>
>> Also regarding locking rules, I am not sure where to mention it. If I
>> put accompanying comments, at all the places where I am taking hashed
>> rwsems, to explain why lock corresponding to a node or corresponding to
>> its parent is being taken, will that be enough as description of locking
>> rules.
> 
> A separate text, along the lines of "foo->bar is modified only when we
> are holding this, this and that; any of those is sufficient to stabilize
> it.  Locking order is such-and-such.  Such-and-such predicate is guaranteed
> to be true when you acquire this lock and must be true again by the time
> you drop it.", etc.  Some of that might go into the comments somewhere
> in source (e.g. when it's about data structures dealt with only one
> file, or in the header where the data structures are declared), some -
> in a separate text, perhaps in fs/kernfs/, perhaps in Documentation/filesystems/)
> 
> And, of course, there's the cover letter of series and commit messages.
> 

I have added a document to describe usage and proof of correctness for
hashed locks. This should highlight if I have misunderstood something or
if I have overlooked some scenarios while making these changes. I will
await your feedback regarding this.

Thanks
 -- Imran

[1]:
https://lore.kernel.org/lkml/20220225052116.1243150-1-imran.f.khan@oracle.com/

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-02-25  5:53 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-14 12:03 [PATCH v6 0/7] kernfs: Introduce hashed mutexes to replace global kernfs_open_file_mutex Imran Khan
2022-02-14 12:03 ` [PATCH v6 1/7] " Imran Khan
2022-02-14 17:50   ` Tejun Heo
2022-02-14 12:03 ` [PATCH v6 2/7] kernfs: Replace global kernfs_open_file_mutex with hashed mutexes Imran Khan
2022-02-14 12:03 ` [PATCH v6 3/7] kernfs: Introduce hashed spinlocks to replace global kernfs_open_node_lock Imran Khan
2022-02-14 12:03 ` [PATCH v6 4/7] kernfs: Replace global kernfs_open_node_lock with hashed spinlocks Imran Khan
2022-02-14 12:03 ` [PATCH v6 5/7] kernfs: Use a per-fs rwsem to protect per-fs list of kernfs_super_info Imran Khan
2022-02-14 12:03 ` [PATCH v6 6/7] kernfs: Introduce hashed rw-sem to replace per-fs kernfs_rwsem Imran Khan
2022-02-14 18:10   ` Tejun Heo
2022-02-16  1:46   ` Al Viro
2022-02-16  4:57     ` Imran Khan
2022-02-18  3:25       ` Al Viro
2022-02-22 18:09         ` Tejun Heo
2022-02-25  5:52         ` Imran Khan
2022-02-14 12:03 ` [PATCH v6 7/7] kernfs: Replace per-fs rwsem with hashed ones Imran Khan
2022-02-14 17:49   ` Nathan Chancellor
2022-02-16  8:57     ` [kbuild-all] " Chen, Rong A
2022-02-14 18:15 ` [PATCH v6 0/7] kernfs: Introduce hashed mutexes to replace global kernfs_open_file_mutex Tejun Heo
2022-02-25  5:43   ` Imran Khan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).