linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 00/10] kernfs: Remove reference counting for kernfs_open_node.
@ 2022-04-10  2:37 Imran Khan
  2022-04-10  2:37 ` [PATCH v8 01/10] " Imran Khan
                   ` (10 more replies)
  0 siblings, 11 replies; 25+ messages in thread
From: Imran Khan @ 2022-04-10  2:37 UTC (permalink / raw)
  To: tj, viro, gregkh, ebiederm; +Cc: linux-kernel

Reduce contention around global locks used in kernfs.

The current kernfs design makes use of 3 global locks to synchronize
various operations. There are a few other global locks as well but
they are used for specific cases and hence don't cause severe contention.

The above mentioned 3 main global locks used in kernfs are:

1. A global mutex, kernfs_open_file_mutex, to protect the list of
kernfs_open_file instances correspondng to a sysfs attribute.

2. A global spinlock, kernfs_open_node_lock, to protect
kernfs_node->attr.open which points to kernfs_open_node instance
corresponding to a kernfs_node.

3. A global per-fs rw semaphore, root->kernfs_rwsem, to synchronize most
of the other operations like adding, removing, renaming etc. of a file
or directory.

Since these locks are global, they can cause contention when multiple
(for example few hundred) applications try to access sysfs (or other kernfs
based file system) files in parallel, even if the applications are
accessing different and unrelated files.

For example on a system with 384 cores, if I run 200 instances of an
application which is mostly executing the following loop:

  for (int loop = 0; loop <100 ; loop++)
  {
    for (int port_num = 1; port_num < 2; port_num++)
    {
      for (int gid_index = 0; gid_index < 254; gid_index++ )
      {
        char ret_buf[64], ret_buf_lo[64];
        char gid_file_path[1024];

        int      ret_len;
        int      ret_fd;
        ssize_t  ret_rd;

        ub4  i, saved_errno;

        memset(ret_buf, 0, sizeof(ret_buf));
        memset(gid_file_path, 0, sizeof(gid_file_path));

        ret_len = snprintf(gid_file_path, sizeof(gid_file_path),
                           "/sys/class/infiniband/%s/ports/%d/gids/%d",
                           dev_name,
                           port_num,
                           gid_index);

        ret_fd = open(gid_file_path, O_RDONLY | O_CLOEXEC);
        if (ret_fd < 0)
        {
          printf("Failed to open %s\n", gid_file_path);
          continue;
        }

        /* Read the GID */
        ret_rd = read(ret_fd, ret_buf, 40);

        if (ret_rd == -1)
        {
          printf("Failed to read from file %s, errno: %u\n",
                 gid_file_path, saved_errno);

          continue;
        }

        close(ret_fd);
      }
    }

I can see contention around above mentioned locks as follows:

-   54.07%    53.60%  showgids         [kernel.kallsyms]       [k] osq_lock
   - 53.60% __libc_start_main
      - 32.29% __GI___libc_open
           entry_SYSCALL_64_after_hwframe
           do_syscall_64
           sys_open
           do_sys_open
           do_filp_open
           path_openat
           vfs_open
           do_dentry_open
           kernfs_fop_open
           mutex_lock
         - __mutex_lock_slowpath
            - 32.23% __mutex_lock.isra.5
                 osq_lock
      - 21.31% __GI___libc_close
           entry_SYSCALL_64_after_hwframe
           do_syscall_64
           exit_to_usermode_loop
           task_work_run
           ____fput
           __fput
           kernfs_fop_release
           kernfs_put_open_node.isra.8
           mutex_lock
         - __mutex_lock_slowpath
            - 21.28% __mutex_lock.isra.5
                 osq_lock

-   10.49%    10.39%  showgids         [kernel.kallsyms]      [k] down_read
     10.39% __libc_start_main
        __GI___libc_open
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        sys_open
        do_sys_open
        do_filp_open
      - path_openat
         - 9.72% link_path_walk
            - 5.21% inode_permission
               - __inode_permission
                  - 5.21% kernfs_iop_permission
                       down_read
            - 4.08% walk_component
                 lookup_fast
               - d_revalidate.part.24
                  - 4.08% kernfs_dop_revalidate

-    7.48%     7.41%  showgids         [kernel.kallsyms]       [k] up_read
     7.41% __libc_start_main
        __GI___libc_open
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        sys_open
        do_sys_open
        do_filp_open
      - path_openat
         - 7.01% link_path_walk
            - 4.12% inode_permission
               - __inode_permission
                  - 4.12% kernfs_iop_permission
                       up_read
            - 2.61% walk_component
                 lookup_fast
               - d_revalidate.part.24
                  - 2.61% kernfs_dop_revalidate

Moreover this run of 200 application isntances takes 32-34 secs. to
complete.                                                     

This patch set is reducing the above mentioned contention by replacing
these global locks with hashed locks. 

For example with the patched kernel and on the same test setup, we no
longer see contention around osq_lock (i.e kernfs_open_file_mutex) and also
contention around per-fs kernfs_rwsem has reduced significantly as well.
This can be seen in the following perf snippet:

-    1.66%     1.65%  showgids         [kernel.kallsyms]      [k] down_read
     1.65% __libc_start_main
        __GI___libc_open
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        sys_open
        do_sys_open
        do_filp_open
      - path_openat
         - 1.62% link_path_walk
            - 0.98% inode_permission
               - __inode_permission
                  + 0.98% kernfs_iop_permission
            - 0.52% walk_component
                 lookup_fast
               - d_revalidate.part.24
                  - 0.52% kernfs_dop_revalidate

-    1.12%     1.11%  showgids         [kernel.kallsyms]      [k] up_read
     1.11% __libc_start_main
        __GI___libc_open
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        sys_open
        do_sys_open
        do_filp_open
      - path_openat
         - 1.11% link_path_walk
            - 0.69% inode_permission
               - __inode_permission
                  - 0.69% kernfs_iop_permission
                       up_read

Moreover the test execution time has reduced from 32-34 secs to 15-16 secs.

The patches of this patchset introduce following changes:

PATCH-1: Remove reference counting from kernfs_open_node.

PATCH-2: Make kernfs_open_node->attr.open RCU protected.

PATCH-3: Change kernfs_notify_list to llist.

PATCH-4: Introduce interface to access kernfs_open_file_mutex.

PATCH-5: Replace global kernfs_open_file_mutex with hashed mutexes.

PATCH-6: Use a per-fs rwsem to protect per-fs list of kernfs_super_info.

PATCH-7: Change kernfs_rename_lock into a read-write lock.

PATCH-8: Introduce interface to access per-fs rwsem.

PATCH-9: Replace per-fs rwsem with hashed rwsems.

PATCH-10: Add a document to describe hashed locks used in kernfs.

------------------------------------------------------------------

Changes since v7:
 - Addressed review comments from Al Viro
	- Remove reference counting for kernfs_open_node
	- Remove kernfs_open_node_lock and make ->attr.open RCU protected
	- Change kernfs_rename_lock to a read-write lock
	- Lock involved nodes before invoking kernfs_find_ns from kernfs_walk_ns
	- Change kernfs_notify_list to llist

 - Addressed review comments from Eric
	- Move interface to access kernfs_open_file_mutex to file.c
 
 - Update document about hashed locks
	
 - Rebase on tag next-20220407

Changes since v6:
 - Addressed review comments from Tejun
	- Make locking helpers compact and remove unlock_parent from node.
 
 - Addressed review comments from Al Viro
	- Make multi node lock helpers non-inline
	- Account for change of parent while waiting on semaphores during
	  rename operation
	- Add a document to describe hashed locks introduced in this patch
	  and to provide proof of correctness 
 
 - Fix for some issues that I observed while preparing lock correctness
   document
	- Lock both parent and target while looking for symlink
	- Introduce a per-fs mutex to synchronize lookup and removal of a node
	- Avoid locking parent in remove_self, because only the first instance
	  does actual removal so other invocations of remove_self don't need
	  to lock the parent
 
 - Remove refcounting from lock helpers
	- Refcounting was present in lock helpers for cases where an execution
	  path acquires node's rwsem and then deletes the node. For such cases
	  release helpers should not try to acquire semaphore for this already
	  freed node. Refcounting was ensuring that release helpers could get
	  an still existing node.
	  I have modified locking helpers such that helpers that acquire rwsem
	  returns its address which can later be used by release helpers.

Changes since v5:
 - Addressed review comments from Greg
	- Clean up duplicate code.
 - Addressed review comments from Tejun
	- Explain how current value of NR_KERNFS_LOCKS were obtained for
	  different values of NR_CPUS.
	- Use single hash table for locks in place of per-fs hash table
	  for locks.
	- Move introduction of supers_rwsem to a separate patch.
	- Separate interface and usage part of hashed rwsem implementation
	  into 2 patches.
	- Use address of rwsems to determine locking order in case of
	  nested locking. This approach avoids ABBA deadlock possibility.
	- Change some #define macros into enum, with proper prefix.
 - Taking a cue from Tejun's feedback about separating hashed rwsem
   interface and usage into 2 patches, I have also divided the patch
   that introduced hashed mutex and spinlock, into separate patches. 	  
 - Rebase on linux-next tag: next-20220211

Changes since v4:
 - Removed compilation warnings reported by the 0-day bot.

Changes since v3:
 - Make open_file_mutex and open_node_lock per-fs.
 - Replace per-fs rwsem with per-fs hashed rwsem.
   (Suggested by Tejun Heo <tj@kernel.org>)

Imran Khan (10):
  kernfs: Remove reference counting for kernfs_open_node.
  kernfs: make ->attr.open RCU protected.
  kernfs: Change kernfs_notify_list to llist.
  kernfs: Introduce interface to access global kernfs_open_file_mutex.
  kernfs: Replace global kernfs_open_file_mutex with hashed mutexes.
  kernfs: Use a per-fs rwsem to protect per-fs list of
    kernfs_super_info.
  kernfs: Change kernfs_rename_lock into a read-write lock.
  kernfs: Introduce interface to access per-fs rwsem.
  kernfs: Replace per-fs rwsem with hashed rwsems.
  kernfs: Add a document to describe hashed locks used in kernfs.

 .../filesystems/kernfs-hashed-locks.rst       | 214 ++++++++++++++
 fs/kernfs/Makefile                            |   2 +-
 fs/kernfs/dir.c                               | 272 ++++++++++++------
 fs/kernfs/file.c                              | 228 ++++++++-------
 fs/kernfs/inode.c                             |  48 +++-
 fs/kernfs/kernfs-internal.c                   | 259 +++++++++++++++++
 fs/kernfs/kernfs-internal.h                   | 127 +++++++-
 fs/kernfs/mount.c                             |  30 +-
 fs/kernfs/symlink.c                           |  11 +-
 include/linux/kernfs.h                        |  62 +++-
 10 files changed, 1041 insertions(+), 212 deletions(-)
 create mode 100644 Documentation/filesystems/kernfs-hashed-locks.rst
 create mode 100644 fs/kernfs/kernfs-internal.c


base-commit: 2e9a9857569ec27e64d2ddd01294bbe3c736acb1
-- 
2.30.2


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v8 01/10] kernfs: Remove reference counting for kernfs_open_node.
  2022-04-10  2:37 [PATCH v8 00/10] kernfs: Remove reference counting for kernfs_open_node Imran Khan
@ 2022-04-10  2:37 ` Imran Khan
  2022-04-22 16:03   ` Tejun Heo
  2022-04-10  2:37 ` [PATCH v8 02/10] kernfs: make ->attr.open RCU protected Imran Khan
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: Imran Khan @ 2022-04-10  2:37 UTC (permalink / raw)
  To: tj, viro, gregkh, ebiederm; +Cc: linux-kernel

The decision to free kernfs_open_node object in kernfs_put_open_node can
be taken based on whether kernfs_open_node->files list is empty or not. As
far as kernfs_drain_open_files is concerned it can't overlap with
kernfs_fops_open and hence can check for ->attr.open optimistically
(if ->attr.open is NULL) or under kernfs_open_file_mutex (if it needs to
traverse the ->files list.) Thus kernfs_drain_open_files can work w/o ref
counting involved kernfs_open_node as well.
So remove ->refcnt and modify the above mentioned users accordingly.

Suggested by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---
 fs/kernfs/file.c | 23 +++++++++--------------
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 88423069407c..aea6968c979e 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -33,7 +33,6 @@ static DEFINE_SPINLOCK(kernfs_open_node_lock);
 static DEFINE_MUTEX(kernfs_open_file_mutex);
 
 struct kernfs_open_node {
-	atomic_t		refcnt;
 	atomic_t		event;
 	wait_queue_head_t	poll;
 	struct list_head	files; /* goes through kernfs_open_file.list */
@@ -530,10 +529,8 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
 	}
 
 	on = kn->attr.open;
-	if (on) {
-		atomic_inc(&on->refcnt);
+	if (on)
 		list_add_tail(&of->list, &on->files);
-	}
 
 	spin_unlock_irq(&kernfs_open_node_lock);
 	mutex_unlock(&kernfs_open_file_mutex);
@@ -548,7 +545,6 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
 	if (!new_on)
 		return -ENOMEM;
 
-	atomic_set(&new_on->refcnt, 0);
 	atomic_set(&new_on->event, 1);
 	init_waitqueue_head(&new_on->poll);
 	INIT_LIST_HEAD(&new_on->files);
@@ -557,11 +553,12 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
 
 /**
  *	kernfs_put_open_node - put kernfs_open_node
- *	@kn: target kernfs_nodet
+ *	@kn: target kernfs_node
  *	@of: associated kernfs_open_file
  *
  *	Put @kn->attr.open and unlink @of from the files list.  If
- *	reference count reaches zero, disassociate and free it.
+ *	list of associated open files becomes empty, disassociate and
+ *	free kernfs_open_node.
  *
  *	LOCKING:
  *	None.
@@ -578,7 +575,7 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
 	if (of)
 		list_del(&of->list);
 
-	if (atomic_dec_and_test(&on->refcnt))
+	if (list_empty(&on->files))
 		kn->attr.open = NULL;
 	else
 		on = NULL;
@@ -768,15 +765,15 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
 	if (!(kn->flags & (KERNFS_HAS_MMAP | KERNFS_HAS_RELEASE)))
 		return;
 
-	spin_lock_irq(&kernfs_open_node_lock);
 	on = kn->attr.open;
-	if (on)
-		atomic_inc(&on->refcnt);
-	spin_unlock_irq(&kernfs_open_node_lock);
 	if (!on)
 		return;
 
 	mutex_lock(&kernfs_open_file_mutex);
+	if (!kn->attr.open) {
+		mutex_unlock(&kernfs_open_file_mutex);
+		return;
+	}
 
 	list_for_each_entry(of, &on->files, list) {
 		struct inode *inode = file_inode(of->file);
@@ -789,8 +786,6 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
 	}
 
 	mutex_unlock(&kernfs_open_file_mutex);
-
-	kernfs_put_open_node(kn, NULL);
 }
 
 /*
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v8 02/10] kernfs: make ->attr.open RCU protected.
  2022-04-10  2:37 [PATCH v8 00/10] kernfs: Remove reference counting for kernfs_open_node Imran Khan
  2022-04-10  2:37 ` [PATCH v8 01/10] " Imran Khan
@ 2022-04-10  2:37 ` Imran Khan
  2022-04-22 16:19   ` Tejun Heo
  2022-04-10  2:37 ` [PATCH v8 03/10] kernfs: Change kernfs_notify_list to llist Imran Khan
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: Imran Khan @ 2022-04-10  2:37 UTC (permalink / raw)
  To: tj, viro, gregkh, ebiederm; +Cc: linux-kernel

After removal of kernfs_open_node->refcnt in the previous patch,
kernfs_open_node_lock can be removed as well by making ->attr.open
RCU protected. kernfs_put_open_node can delegate freeing to ->attr.open
to RCU and other readers of ->attr.open can do so under rcu_read_(un)lock.

So make ->attr.open RCU protected and remove global kernfs_open_node_lock.

Suggested by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---
 fs/kernfs/file.c       | 101 ++++++++++++++++++++++-------------------
 include/linux/kernfs.h |   2 +-
 2 files changed, 55 insertions(+), 48 deletions(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index aea6968c979e..bc393dcf4efa 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -23,16 +23,16 @@
  * for each kernfs_node with one or more open files.
  *
  * kernfs_node->attr.open points to kernfs_open_node.  attr.open is
- * protected by kernfs_open_node_lock.
+ * RCU protected.
  *
  * filp->private_data points to seq_file whose ->private points to
  * kernfs_open_file.  kernfs_open_files are chained at
  * kernfs_open_node->files, which is protected by kernfs_open_file_mutex.
  */
-static DEFINE_SPINLOCK(kernfs_open_node_lock);
 static DEFINE_MUTEX(kernfs_open_file_mutex);
 
 struct kernfs_open_node {
+	struct rcu_head         rcu_head;
 	atomic_t		event;
 	wait_queue_head_t	poll;
 	struct list_head	files; /* goes through kernfs_open_file.list */
@@ -156,8 +156,9 @@ static void kernfs_seq_stop(struct seq_file *sf, void *v)
 static int kernfs_seq_show(struct seq_file *sf, void *v)
 {
 	struct kernfs_open_file *of = sf->private;
+	struct kernfs_open_node *on = rcu_dereference_raw(of->kn->attr.open);
 
-	of->event = atomic_read(&of->kn->attr.open->event);
+	of->event = atomic_read(&on->event);
 
 	return of->kn->attr.ops->seq_show(sf, v);
 }
@@ -180,6 +181,7 @@ static ssize_t kernfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 	struct kernfs_open_file *of = kernfs_of(iocb->ki_filp);
 	ssize_t len = min_t(size_t, iov_iter_count(iter), PAGE_SIZE);
 	const struct kernfs_ops *ops;
+	struct kernfs_open_node *on;
 	char *buf;
 
 	buf = of->prealloc_buf;
@@ -201,7 +203,8 @@ static ssize_t kernfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		goto out_free;
 	}
 
-	of->event = atomic_read(&of->kn->attr.open->event);
+	on = rcu_dereference_raw(of->kn->attr.open);
+	of->event = atomic_read(&on->event);
 	ops = kernfs_ops(of->kn);
 	if (ops->read)
 		len = ops->read(of, buf, len, iocb->ki_pos);
@@ -519,36 +522,34 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
 {
 	struct kernfs_open_node *on, *new_on = NULL;
 
- retry:
 	mutex_lock(&kernfs_open_file_mutex);
-	spin_lock_irq(&kernfs_open_node_lock);
-
-	if (!kn->attr.open && new_on) {
-		kn->attr.open = new_on;
-		new_on = NULL;
-	}
-
-	on = kn->attr.open;
-	if (on)
-		list_add_tail(&of->list, &on->files);
-
-	spin_unlock_irq(&kernfs_open_node_lock);
-	mutex_unlock(&kernfs_open_file_mutex);
 
+	/**
+	 * ->attr.open changes under kernfs_open_file_mutex so we don't
+	 * need rcu_read_lock to ensure its existence.
+	 */
+	on = rcu_dereference_protected(kn->attr.open,
+				   lockdep_is_held(&kernfs_open_file_mutex));
 	if (on) {
-		kfree(new_on);
+		list_add_tail(&of->list, &on->files);
+		mutex_unlock(&kernfs_open_file_mutex);
 		return 0;
+	} else {
+		/* not there, initialize a new one and retry */
+		new_on = kmalloc(sizeof(*new_on), GFP_KERNEL);
+		if (!new_on) {
+			mutex_unlock(&kernfs_open_file_mutex);
+			return -ENOMEM;
+		}
+		atomic_set(&new_on->event, 1);
+		init_waitqueue_head(&new_on->poll);
+		INIT_LIST_HEAD(&new_on->files);
+		list_add_tail(&of->list, &new_on->files);
+		rcu_assign_pointer(kn->attr.open, new_on);
 	}
+	mutex_unlock(&kernfs_open_file_mutex);
 
-	/* not there, initialize a new one and retry */
-	new_on = kmalloc(sizeof(*new_on), GFP_KERNEL);
-	if (!new_on)
-		return -ENOMEM;
-
-	atomic_set(&new_on->event, 1);
-	init_waitqueue_head(&new_on->poll);
-	INIT_LIST_HEAD(&new_on->files);
-	goto retry;
+	return 0;
 }
 
 /**
@@ -566,24 +567,30 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
 static void kernfs_put_open_node(struct kernfs_node *kn,
 				 struct kernfs_open_file *of)
 {
-	struct kernfs_open_node *on = kn->attr.open;
-	unsigned long flags;
+	struct kernfs_open_node *on;
+
+	/* ->attr.open NULL means there are no more open files */
+	if (rcu_dereference_raw(kn->attr.open) == NULL)
+		return;
 
 	mutex_lock(&kernfs_open_file_mutex);
-	spin_lock_irqsave(&kernfs_open_node_lock, flags);
+
+	on = rcu_dereference_protected(kn->attr.open,
+				       lockdep_is_held(&kernfs_open_file_mutex));
+
+	if (!on) {
+		mutex_unlock(&kernfs_open_file_mutex);
+		return;
+	}
 
 	if (of)
 		list_del(&of->list);
 
-	if (list_empty(&on->files))
-		kn->attr.open = NULL;
-	else
-		on = NULL;
-
-	spin_unlock_irqrestore(&kernfs_open_node_lock, flags);
+	if (list_empty(&on->files)) {
+		rcu_assign_pointer(kn->attr.open, NULL);
+		kfree_rcu(on, rcu_head);
+	}
 	mutex_unlock(&kernfs_open_file_mutex);
-
-	kfree(on);
 }
 
 static int kernfs_fop_open(struct inode *inode, struct file *file)
@@ -765,12 +772,13 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
 	if (!(kn->flags & (KERNFS_HAS_MMAP | KERNFS_HAS_RELEASE)))
 		return;
 
-	on = kn->attr.open;
-	if (!on)
+	if (rcu_dereference_raw(kn->attr.open) == NULL)
 		return;
 
 	mutex_lock(&kernfs_open_file_mutex);
-	if (!kn->attr.open) {
+	on = rcu_dereference_check(kn->attr.open,
+				   lockdep_is_held(&kernfs_open_file_mutex));
+	if (!on) {
 		mutex_unlock(&kernfs_open_file_mutex);
 		return;
 	}
@@ -805,7 +813,7 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
 __poll_t kernfs_generic_poll(struct kernfs_open_file *of, poll_table *wait)
 {
 	struct kernfs_node *kn = kernfs_dentry_node(of->file->f_path.dentry);
-	struct kernfs_open_node *on = kn->attr.open;
+	struct kernfs_open_node *on = rcu_dereference_raw(kn->attr.open);
 
 	poll_wait(of->file, &on->poll, wait);
 
@@ -912,14 +920,13 @@ void kernfs_notify(struct kernfs_node *kn)
 		return;
 
 	/* kick poll immediately */
-	spin_lock_irqsave(&kernfs_open_node_lock, flags);
-	on = kn->attr.open;
+	rcu_read_lock();
+	on = rcu_dereference(kn->attr.open);
 	if (on) {
 		atomic_inc(&on->event);
 		wake_up_interruptible(&on->poll);
 	}
-	spin_unlock_irqrestore(&kernfs_open_node_lock, flags);
-
+	rcu_read_unlock();
 	/* schedule work to kick fsnotify */
 	spin_lock_irqsave(&kernfs_notify_lock, flags);
 	if (!kn->attr.notify_next) {
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index e2ae15a6225e..13f54f078a52 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -114,7 +114,7 @@ struct kernfs_elem_symlink {
 
 struct kernfs_elem_attr {
 	const struct kernfs_ops	*ops;
-	struct kernfs_open_node	*open;
+	struct kernfs_open_node __rcu	*open;
 	loff_t			size;
 	struct kernfs_node	*notify_next;	/* for kernfs_notify() */
 };
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v8 03/10] kernfs: Change kernfs_notify_list to llist.
  2022-04-10  2:37 [PATCH v8 00/10] kernfs: Remove reference counting for kernfs_open_node Imran Khan
  2022-04-10  2:37 ` [PATCH v8 01/10] " Imran Khan
  2022-04-10  2:37 ` [PATCH v8 02/10] kernfs: make ->attr.open RCU protected Imran Khan
@ 2022-04-10  2:37 ` Imran Khan
  2022-04-22 16:41   ` Tejun Heo
  2022-04-10  2:37 ` [PATCH v8 04/10] kernfs: Introduce interface to access global kernfs_open_file_mutex Imran Khan
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: Imran Khan @ 2022-04-10  2:37 UTC (permalink / raw)
  To: tj, viro, gregkh, ebiederm; +Cc: linux-kernel

At present kernfs_notify_list is implemented as a singly linked
list of kernfs_node(s), where last element points to itself and
value of ->attr.next tells if node is present on the list or not.
Both addition and deletion to list happen under kernfs_notify_lock.

Change kernfs_notify_list to llist so that addition to list can heppen
locklessly. We still need kernfs_notify_lock for consumers (kernfs_notify\
_workfn) because there can be multiple concurrent work items.

Also with this approach we don't check if a kernfs_node is already present
in the kernfs_notify_list i.e whether ->attr.next == NULL but that should
be fine as it will allow serial processing of all events submitted for the
same node while the node was still in event notification list. With earlier
approach as long as a node was in the event notification list i.e ->attr.next
!= NULL the subsequent events were ignored.

Suggested by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---
 fs/kernfs/file.c       | 49 ++++++++++++++++++++++--------------------
 include/linux/kernfs.h |  2 +-
 2 files changed, 27 insertions(+), 24 deletions(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index bc393dcf4efa..c89220dcfdc1 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -38,18 +38,17 @@ struct kernfs_open_node {
 	struct list_head	files; /* goes through kernfs_open_file.list */
 };
 
-/*
- * kernfs_notify() may be called from any context and bounces notifications
- * through a work item.  To minimize space overhead in kernfs_node, the
- * pending queue is implemented as a singly linked list of kernfs_nodes.
- * The list is terminated with the self pointer so that whether a
- * kernfs_node is on the list or not can be determined by testing the next
- * pointer for NULL.
+/**
+ * attribute_to_node - get kernfs_node object corresponding to a kernfs attribute
+ * @ptr:	&struct kernfs_elem_attr
+ * @type:	struct kernfs_node
+ * @member:	name of member (i.e attr)
  */
-#define KERNFS_NOTIFY_EOL			((void *)&kernfs_notify_list)
+#define attribute_to_node(ptr, type, member)	\
+	container_of(ptr, type, member)
 
 static DEFINE_SPINLOCK(kernfs_notify_lock);
-static struct kernfs_node *kernfs_notify_list = KERNFS_NOTIFY_EOL;
+static LLIST_HEAD(kernfs_notify_list);
 
 static struct kernfs_open_file *kernfs_of(struct file *file)
 {
@@ -846,18 +845,25 @@ static void kernfs_notify_workfn(struct work_struct *work)
 	struct kernfs_node *kn;
 	struct kernfs_super_info *info;
 	struct kernfs_root *root;
+	struct llist_node *free;
+	struct kernfs_elem_attr *attr;
 repeat:
-	/* pop one off the notify_list */
+	/**
+	 * pop one off the notify_list.
+	 * There can be multiple concurrent work items.
+	 * Use kernfs_notify_lock to synchronize between multipl consumers.
+	 */
 	spin_lock_irq(&kernfs_notify_lock);
-	kn = kernfs_notify_list;
-	if (kn == KERNFS_NOTIFY_EOL) {
+	if (llist_empty(&kernfs_notify_list)) {
 		spin_unlock_irq(&kernfs_notify_lock);
 		return;
 	}
-	kernfs_notify_list = kn->attr.notify_next;
-	kn->attr.notify_next = NULL;
+
+	free = llist_del_first(&kernfs_notify_list);
 	spin_unlock_irq(&kernfs_notify_lock);
 
+	attr = llist_entry(free, struct kernfs_elem_attr, notify_next);
+	kn = attribute_to_node(attr, struct kernfs_node, attr);
 	root = kernfs_root(kn);
 	/* kick fsnotify */
 	down_write(&root->kernfs_rwsem);
@@ -913,12 +919,14 @@ static void kernfs_notify_workfn(struct work_struct *work)
 void kernfs_notify(struct kernfs_node *kn)
 {
 	static DECLARE_WORK(kernfs_notify_work, kernfs_notify_workfn);
-	unsigned long flags;
 	struct kernfs_open_node *on;
 
 	if (WARN_ON(kernfs_type(kn) != KERNFS_FILE))
 		return;
 
+	/* Because we are using llist for kernfs_notify_list */
+	WARN_ON_ONCE(in_nmi());
+
 	/* kick poll immediately */
 	rcu_read_lock();
 	on = rcu_dereference(kn->attr.open);
@@ -928,14 +936,9 @@ void kernfs_notify(struct kernfs_node *kn)
 	}
 	rcu_read_unlock();
 	/* schedule work to kick fsnotify */
-	spin_lock_irqsave(&kernfs_notify_lock, flags);
-	if (!kn->attr.notify_next) {
-		kernfs_get(kn);
-		kn->attr.notify_next = kernfs_notify_list;
-		kernfs_notify_list = kn;
-		schedule_work(&kernfs_notify_work);
-	}
-	spin_unlock_irqrestore(&kernfs_notify_lock, flags);
+	kernfs_get(kn);
+	llist_add(&kn->attr.notify_next, &kernfs_notify_list);
+	schedule_work(&kernfs_notify_work);
 }
 EXPORT_SYMBOL_GPL(kernfs_notify);
 
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 13f54f078a52..2dd9c8df0f4f 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -116,7 +116,7 @@ struct kernfs_elem_attr {
 	const struct kernfs_ops	*ops;
 	struct kernfs_open_node __rcu	*open;
 	loff_t			size;
-	struct kernfs_node	*notify_next;	/* for kernfs_notify() */
+	struct llist_node	notify_next;	/* for kernfs_notify() */
 };
 
 /*
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v8 04/10] kernfs: Introduce interface to access global kernfs_open_file_mutex.
  2022-04-10  2:37 [PATCH v8 00/10] kernfs: Remove reference counting for kernfs_open_node Imran Khan
                   ` (2 preceding siblings ...)
  2022-04-10  2:37 ` [PATCH v8 03/10] kernfs: Change kernfs_notify_list to llist Imran Khan
@ 2022-04-10  2:37 ` Imran Khan
  2022-04-10  2:37 ` [PATCH v8 05/10] kernfs: Replace global kernfs_open_file_mutex with hashed mutexes Imran Khan
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Imran Khan @ 2022-04-10  2:37 UTC (permalink / raw)
  To: tj, viro, gregkh, ebiederm; +Cc: linux-kernel

This allows to change underlying mutex locking, without needing to change
the users of the lock. For example next patch modifies this interface to
use hashed mutexes in place of a single global kernfs_open_file_mutex.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---
 fs/kernfs/file.c | 42 +++++++++++++++++++++++++++++++-----------
 1 file changed, 31 insertions(+), 11 deletions(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index c89220dcfdc1..214b48d59148 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -47,6 +47,22 @@ struct kernfs_open_node {
 #define attribute_to_node(ptr, type, member)	\
 	container_of(ptr, type, member)
 
+static inline struct mutex *kernfs_open_file_mutex_ptr(struct kernfs_node *kn)
+{
+	return &kernfs_open_file_mutex;
+}
+
+static inline struct mutex *kernfs_open_file_mutex_lock(struct kernfs_node *kn)
+{
+	struct mutex *lock;
+
+	lock = kernfs_open_file_mutex_ptr(kn);
+
+	mutex_lock(lock);
+
+	return lock;
+}
+
 static DEFINE_SPINLOCK(kernfs_notify_lock);
 static LLIST_HEAD(kernfs_notify_list);
 
@@ -520,8 +536,9 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
 				struct kernfs_open_file *of)
 {
 	struct kernfs_open_node *on, *new_on = NULL;
+	struct mutex *mutex = NULL;
 
-	mutex_lock(&kernfs_open_file_mutex);
+	mutex = kernfs_open_file_mutex_lock(kn);
 
 	/**
 	 * ->attr.open changes under kernfs_open_file_mutex so we don't
@@ -531,13 +548,13 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
 				   lockdep_is_held(&kernfs_open_file_mutex));
 	if (on) {
 		list_add_tail(&of->list, &on->files);
-		mutex_unlock(&kernfs_open_file_mutex);
+		mutex_unlock(mutex);
 		return 0;
 	} else {
 		/* not there, initialize a new one and retry */
 		new_on = kmalloc(sizeof(*new_on), GFP_KERNEL);
 		if (!new_on) {
-			mutex_unlock(&kernfs_open_file_mutex);
+			mutex_unlock(mutex);
 			return -ENOMEM;
 		}
 		atomic_set(&new_on->event, 1);
@@ -546,7 +563,7 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
 		list_add_tail(&of->list, &new_on->files);
 		rcu_assign_pointer(kn->attr.open, new_on);
 	}
-	mutex_unlock(&kernfs_open_file_mutex);
+	mutex_unlock(mutex);
 
 	return 0;
 }
@@ -567,12 +584,13 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
 				 struct kernfs_open_file *of)
 {
 	struct kernfs_open_node *on;
+	struct mutex *mutex = NULL;
 
 	/* ->attr.open NULL means there are no more open files */
 	if (rcu_dereference_raw(kn->attr.open) == NULL)
 		return;
 
-	mutex_lock(&kernfs_open_file_mutex);
+	mutex = kernfs_open_file_mutex_lock(kn);
 
 	on = rcu_dereference_protected(kn->attr.open,
 				       lockdep_is_held(&kernfs_open_file_mutex));
@@ -589,7 +607,7 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
 		rcu_assign_pointer(kn->attr.open, NULL);
 		kfree_rcu(on, rcu_head);
 	}
-	mutex_unlock(&kernfs_open_file_mutex);
+	mutex_unlock(mutex);
 }
 
 static int kernfs_fop_open(struct inode *inode, struct file *file)
@@ -731,7 +749,7 @@ static void kernfs_release_file(struct kernfs_node *kn,
 	 * here because drain path may be called from places which can
 	 * cause circular dependency.
 	 */
-	lockdep_assert_held(&kernfs_open_file_mutex);
+	lockdep_assert_held(kernfs_open_file_mutex_ptr(kn));
 
 	if (!of->released) {
 		/*
@@ -748,11 +766,12 @@ static int kernfs_fop_release(struct inode *inode, struct file *filp)
 {
 	struct kernfs_node *kn = inode->i_private;
 	struct kernfs_open_file *of = kernfs_of(filp);
+	struct mutex *mutex = NULL;
 
 	if (kn->flags & KERNFS_HAS_RELEASE) {
-		mutex_lock(&kernfs_open_file_mutex);
+		mutex = kernfs_open_file_mutex_lock(kn);
 		kernfs_release_file(kn, of);
-		mutex_unlock(&kernfs_open_file_mutex);
+		mutex_unlock(mutex);
 	}
 
 	kernfs_put_open_node(kn, of);
@@ -767,6 +786,7 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
 {
 	struct kernfs_open_node *on;
 	struct kernfs_open_file *of;
+	struct mutex *mutex = NULL;
 
 	if (!(kn->flags & (KERNFS_HAS_MMAP | KERNFS_HAS_RELEASE)))
 		return;
@@ -774,7 +794,7 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
 	if (rcu_dereference_raw(kn->attr.open) == NULL)
 		return;
 
-	mutex_lock(&kernfs_open_file_mutex);
+	mutex = kernfs_open_file_mutex_lock(kn);
 	on = rcu_dereference_check(kn->attr.open,
 				   lockdep_is_held(&kernfs_open_file_mutex));
 	if (!on) {
@@ -792,7 +812,7 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
 			kernfs_release_file(kn, of);
 	}
 
-	mutex_unlock(&kernfs_open_file_mutex);
+	mutex_unlock(mutex);
 }
 
 /*
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v8 05/10] kernfs: Replace global kernfs_open_file_mutex with hashed mutexes.
  2022-04-10  2:37 [PATCH v8 00/10] kernfs: Remove reference counting for kernfs_open_node Imran Khan
                   ` (3 preceding siblings ...)
  2022-04-10  2:37 ` [PATCH v8 04/10] kernfs: Introduce interface to access global kernfs_open_file_mutex Imran Khan
@ 2022-04-10  2:37 ` Imran Khan
  2022-04-22 17:00   ` Tejun Heo
  2022-04-10  2:37 ` [PATCH v8 06/10] kernfs: Use a per-fs rwsem to protect per-fs list of kernfs_super_info Imran Khan
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: Imran Khan @ 2022-04-10  2:37 UTC (permalink / raw)
  To: tj, viro, gregkh, ebiederm; +Cc: linux-kernel

In current kernfs design a single mutex, kernfs_open_file_mutex, protects
the list of kernfs_open_file instances corresponding to a sysfs attribute.
So even if different tasks are opening or closing different sysfs files
they can contend on osq_lock of this mutex. The contention is more apparent
in large scale systems with few hundred CPUs where most of the CPUs have
running tasks that are opening, accessing or closing sysfs files at any
point of time.

Using hashed mutexes in place of a single global mutex, can significantly
reduce contention around global mutex and hence can provide better
scalability. Moreover as these hashed mutexes are not part of kernfs_node
objects we will not see any singnificant change in memory utilization of
kernfs based file systems like sysfs, cgroupfs etc.

Modify interface introduced in previous patch to make use of hashed
mutexes. Use kernfs_node address as hashing key.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---
 fs/kernfs/file.c       | 32 +++++++++++++-----------
 fs/kernfs/mount.c      | 14 +++++++++++
 include/linux/kernfs.h | 57 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 88 insertions(+), 15 deletions(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 214b48d59148..0946ab341ce4 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -19,18 +19,14 @@
 #include "kernfs-internal.h"
 
 /*
- * There's one kernfs_open_file for each open file and one kernfs_open_node
- * for each kernfs_node with one or more open files.
- *
+ * kernfs locks
+ */
+extern struct kernfs_global_locks *kernfs_locks;
+
+/*
  * kernfs_node->attr.open points to kernfs_open_node.  attr.open is
  * RCU protected.
- *
- * filp->private_data points to seq_file whose ->private points to
- * kernfs_open_file.  kernfs_open_files are chained at
- * kernfs_open_node->files, which is protected by kernfs_open_file_mutex.
  */
-static DEFINE_MUTEX(kernfs_open_file_mutex);
-
 struct kernfs_open_node {
 	struct rcu_head         rcu_head;
 	atomic_t		event;
@@ -49,7 +45,9 @@ struct kernfs_open_node {
 
 static inline struct mutex *kernfs_open_file_mutex_ptr(struct kernfs_node *kn)
 {
-	return &kernfs_open_file_mutex;
+	int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
+
+	return &kernfs_locks->open_file_mutex[idx].lock;
 }
 
 static inline struct mutex *kernfs_open_file_mutex_lock(struct kernfs_node *kn)
@@ -545,7 +543,7 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
 	 * need rcu_read_lock to ensure its existence.
 	 */
 	on = rcu_dereference_protected(kn->attr.open,
-				   lockdep_is_held(&kernfs_open_file_mutex));
+				   lockdep_is_held(mutex));
 	if (on) {
 		list_add_tail(&of->list, &on->files);
 		mutex_unlock(mutex);
@@ -593,10 +591,10 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
 	mutex = kernfs_open_file_mutex_lock(kn);
 
 	on = rcu_dereference_protected(kn->attr.open,
-				       lockdep_is_held(&kernfs_open_file_mutex));
+				       lockdep_is_held(mutex));
 
 	if (!on) {
-		mutex_unlock(&kernfs_open_file_mutex);
+		mutex_unlock(mutex);
 		return;
 	}
 
@@ -769,6 +767,10 @@ static int kernfs_fop_release(struct inode *inode, struct file *filp)
 	struct mutex *mutex = NULL;
 
 	if (kn->flags & KERNFS_HAS_RELEASE) {
+		/**
+		 * Callers of kernfs_fop_release, don't need global
+		 * exclusion so using hashed mutex here is safe.
+		 */
 		mutex = kernfs_open_file_mutex_lock(kn);
 		kernfs_release_file(kn, of);
 		mutex_unlock(mutex);
@@ -796,9 +798,9 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
 
 	mutex = kernfs_open_file_mutex_lock(kn);
 	on = rcu_dereference_check(kn->attr.open,
-				   lockdep_is_held(&kernfs_open_file_mutex));
+				   lockdep_is_held(mutex));
 	if (!on) {
-		mutex_unlock(&kernfs_open_file_mutex);
+		mutex_unlock(mutex);
 		return;
 	}
 
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index cfa79715fc1a..a64a04efc9be 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -20,6 +20,7 @@
 #include "kernfs-internal.h"
 
 struct kmem_cache *kernfs_node_cache, *kernfs_iattrs_cache;
+struct kernfs_global_locks *kernfs_locks;
 
 static int kernfs_sop_show_options(struct seq_file *sf, struct dentry *dentry)
 {
@@ -387,6 +388,17 @@ void kernfs_kill_sb(struct super_block *sb)
 	kfree(info);
 }
 
+void __init kernfs_lock_init(void)
+{
+	int count;
+
+	kernfs_locks = kmalloc(sizeof(struct kernfs_global_locks), GFP_KERNEL);
+	WARN_ON(!kernfs_locks);
+
+	for (count = 0; count < NR_KERNFS_LOCKS; count++)
+		mutex_init(&kernfs_locks->open_file_mutex[count].lock);
+}
+
 void __init kernfs_init(void)
 {
 	kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
@@ -397,4 +409,6 @@ void __init kernfs_init(void)
 	kernfs_iattrs_cache  = kmem_cache_create("kernfs_iattrs_cache",
 					      sizeof(struct kernfs_iattrs),
 					      0, SLAB_PANIC, NULL);
+
+	kernfs_lock_init();
 }
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 2dd9c8df0f4f..cc514bda0ae7 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -18,6 +18,7 @@
 #include <linux/uidgid.h>
 #include <linux/wait.h>
 #include <linux/rwsem.h>
+#include <linux/cache.h>
 
 struct file;
 struct dentry;
@@ -34,6 +35,60 @@ struct kernfs_fs_context;
 struct kernfs_open_node;
 struct kernfs_iattrs;
 
+/*
+ * NR_KERNFS_LOCK_BITS determines size (NR_KERNFS_LOCKS) of hash
+ * table of locks.
+ * Having a small hash table would impact scalability, since
+ * more and more kernfs_node objects will end up using same lock
+ * and having a very large hash table would waste memory.
+ *
+ * At the moment size of hash table of locks is being set based on
+ * the number of CPUs as follows:
+ *
+ * NR_CPU      NR_KERNFS_LOCK_BITS      NR_KERNFS_LOCKS
+ *   1                  1                       2
+ *  2-3                 2                       4
+ *  4-7                 4                       16
+ *  8-15                6                       64
+ *  16-31               8                       256
+ *  32 and more         10                      1024
+ *
+ * The above relation between NR_CPU and number of locks is based
+ * on some internal experimentation which involved booting qemu
+ * with different values of smp, performing some sysfs operations
+ * on all CPUs and observing how increase in number of locks impacts
+ * completion time of these sysfs operations on each CPU.
+ */
+#ifdef CONFIG_SMP
+#define NR_KERNFS_LOCK_BITS (2 * (ilog2(NR_CPUS < 32 ? NR_CPUS : 32)))
+#else
+#define NR_KERNFS_LOCK_BITS     1
+#endif
+
+#define NR_KERNFS_LOCKS     (1 << NR_KERNFS_LOCK_BITS)
+
+/*
+ * There's one kernfs_open_file for each open file and one kernfs_open_node
+ * for each kernfs_node with one or more open files.
+ *
+ * filp->private_data points to seq_file whose ->private points to
+ * kernfs_open_file.
+ * kernfs_open_files are chained at kernfs_open_node->files, which is
+ * protected by kernfs_open_file_mutex.lock.
+ */
+struct kernfs_open_file_mutex {
+	struct mutex lock;
+} ____cacheline_aligned_in_smp;
+
+/*
+ * To reduce possible contention in sysfs access, arising due to single
+ * locks, use an array of locks and use kernfs_node object address as
+ * hash keys to get the index of these locks.
+ */
+struct kernfs_global_locks {
+	struct kernfs_open_file_mutex open_file_mutex[NR_KERNFS_LOCKS];
+};
+
 enum kernfs_node_type {
 	KERNFS_DIR		= 0x0001,
 	KERNFS_FILE		= 0x0002,
@@ -397,6 +452,8 @@ void kernfs_kill_sb(struct super_block *sb);
 
 void kernfs_init(void);
 
+void kernfs_lock_init(void);
+
 struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root,
 						   u64 id);
 #else	/* CONFIG_KERNFS */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v8 06/10] kernfs: Use a per-fs rwsem to protect per-fs list of kernfs_super_info.
  2022-04-10  2:37 [PATCH v8 00/10] kernfs: Remove reference counting for kernfs_open_node Imran Khan
                   ` (4 preceding siblings ...)
  2022-04-10  2:37 ` [PATCH v8 05/10] kernfs: Replace global kernfs_open_file_mutex with hashed mutexes Imran Khan
@ 2022-04-10  2:37 ` Imran Khan
  2022-04-10  2:37 ` [PATCH v8 07/10] kernfs: Change kernfs_rename_lock into a read-write lock Imran Khan
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Imran Khan @ 2022-04-10  2:37 UTC (permalink / raw)
  To: tj, viro, gregkh, ebiederm; +Cc: linux-kernel

Right now per-fs kernfs_rwsem protects list of kernfs_super_info instances
for a kernfs_root. Since kernfs_rwsem is used to synchronize several other
operations across kernfs and since most of these operations don't impact
kernfs_super_info, we can use a separate per-fs rwsem to synchronize access
to list of kernfs_super_info.
This helps in reducing contention around kernfs_rwsem and also allows
operations that change/access list of kernfs_super_info to proceed without
contending for kernfs_rwsem.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---
 fs/kernfs/dir.c             | 1 +
 fs/kernfs/file.c            | 2 ++
 fs/kernfs/kernfs-internal.h | 1 +
 fs/kernfs/mount.c           | 8 ++++----
 4 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 61a8edc4ba8b..17b438498c0b 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -917,6 +917,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 	idr_init(&root->ino_idr);
 	init_rwsem(&root->kernfs_rwsem);
 	INIT_LIST_HEAD(&root->supers);
+	init_rwsem(&root->supers_rwsem);
 
 	/*
 	 * On 64bit ino setups, id is ino.  On 32bit, low 32bits are ino.
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 0946ab341ce4..0bffe5d0f510 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -890,6 +890,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
 	/* kick fsnotify */
 	down_write(&root->kernfs_rwsem);
 
+	down_write(&root->supers_rwsem);
 	list_for_each_entry(info, &kernfs_root(kn)->supers, node) {
 		struct kernfs_node *parent;
 		struct inode *p_inode = NULL;
@@ -925,6 +926,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
 
 		iput(inode);
 	}
+	up_write(&root->supers_rwsem);
 
 	up_write(&root->kernfs_rwsem);
 	kernfs_put(kn);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index eeaa779b929c..82c6b16645bc 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -47,6 +47,7 @@ struct kernfs_root {
 
 	wait_queue_head_t	deactivate_waitq;
 	struct rw_semaphore	kernfs_rwsem;
+	struct rw_semaphore     supers_rwsem;
 };
 
 /* +1 to avoid triggering overflow warning when negating it */
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index a64a04efc9be..1ac36b2a89ab 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -347,9 +347,9 @@ int kernfs_get_tree(struct fs_context *fc)
 		}
 		sb->s_flags |= SB_ACTIVE;
 
-		down_write(&root->kernfs_rwsem);
+		down_write(&root->supers_rwsem);
 		list_add(&info->node, &info->root->supers);
-		up_write(&root->kernfs_rwsem);
+		up_write(&root->supers_rwsem);
 	}
 
 	fc->root = dget(sb->s_root);
@@ -376,9 +376,9 @@ void kernfs_kill_sb(struct super_block *sb)
 	struct kernfs_super_info *info = kernfs_info(sb);
 	struct kernfs_root *root = info->root;
 
-	down_write(&root->kernfs_rwsem);
+	down_write(&root->supers_rwsem);
 	list_del(&info->node);
-	up_write(&root->kernfs_rwsem);
+	up_write(&root->supers_rwsem);
 
 	/*
 	 * Remove the superblock from fs_supers/s_instances
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v8 07/10] kernfs: Change kernfs_rename_lock into a read-write lock.
  2022-04-10  2:37 [PATCH v8 00/10] kernfs: Remove reference counting for kernfs_open_node Imran Khan
                   ` (5 preceding siblings ...)
  2022-04-10  2:37 ` [PATCH v8 06/10] kernfs: Use a per-fs rwsem to protect per-fs list of kernfs_super_info Imran Khan
@ 2022-04-10  2:37 ` Imran Khan
  2022-04-10  2:37 ` [PATCH v8 08/10] kernfs: Introduce interface to access per-fs rwsem Imran Khan
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Imran Khan @ 2022-04-10  2:37 UTC (permalink / raw)
  To: tj, viro, gregkh, ebiederm; +Cc: linux-kernel

kernfs_rename_lock protects a node's ->parent and ->name and its current users
can be easily divided into readers or writers of ->parent and ->name.
kernfs_rename_lock also protects a static buffer (kernfs_pr_cont_buf) which
is used to hold read attributes (name, path etc.) of a kernfs_node and we
can maintain this functionality by accessing this buffer under write_lock.
So change kernfs_rename_lock into a read-write lock for better scalability.

Suggested by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---
 fs/kernfs/dir.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 17b438498c0b..8e8c8b2c350d 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -17,7 +17,7 @@
 
 #include "kernfs-internal.h"
 
-static DEFINE_SPINLOCK(kernfs_rename_lock);	/* kn->parent and ->name */
+static DEFINE_RWLOCK(kernfs_rename_lock);	/* kn->parent and ->name */
 static char kernfs_pr_cont_buf[PATH_MAX];	/* protected by rename_lock */
 static DEFINE_SPINLOCK(kernfs_idr_lock);	/* root->ino_idr */
 
@@ -184,9 +184,9 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
 	unsigned long flags;
 	int ret;
 
-	spin_lock_irqsave(&kernfs_rename_lock, flags);
+	read_lock_irqsave(&kernfs_rename_lock, flags);
 	ret = kernfs_name_locked(kn, buf, buflen);
-	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
+	read_unlock_irqrestore(&kernfs_rename_lock, flags);
 	return ret;
 }
 
@@ -212,9 +212,9 @@ int kernfs_path_from_node(struct kernfs_node *to, struct kernfs_node *from,
 	unsigned long flags;
 	int ret;
 
-	spin_lock_irqsave(&kernfs_rename_lock, flags);
+	read_lock_irqsave(&kernfs_rename_lock, flags);
 	ret = kernfs_path_from_node_locked(to, from, buf, buflen);
-	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
+	read_unlock_irqrestore(&kernfs_rename_lock, flags);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(kernfs_path_from_node);
@@ -229,12 +229,12 @@ void pr_cont_kernfs_name(struct kernfs_node *kn)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&kernfs_rename_lock, flags);
+	write_lock_irqsave(&kernfs_rename_lock, flags);
 
 	kernfs_name_locked(kn, kernfs_pr_cont_buf, sizeof(kernfs_pr_cont_buf));
 	pr_cont("%s", kernfs_pr_cont_buf);
 
-	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
+	write_unlock_irqrestore(&kernfs_rename_lock, flags);
 }
 
 /**
@@ -248,7 +248,7 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
 	unsigned long flags;
 	int sz;
 
-	spin_lock_irqsave(&kernfs_rename_lock, flags);
+	write_lock_irqsave(&kernfs_rename_lock, flags);
 
 	sz = kernfs_path_from_node_locked(kn, NULL, kernfs_pr_cont_buf,
 					  sizeof(kernfs_pr_cont_buf));
@@ -265,7 +265,7 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
 	pr_cont("%s", kernfs_pr_cont_buf);
 
 out:
-	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
+	write_unlock_irqrestore(&kernfs_rename_lock, flags);
 }
 
 /**
@@ -280,10 +280,10 @@ struct kernfs_node *kernfs_get_parent(struct kernfs_node *kn)
 	struct kernfs_node *parent;
 	unsigned long flags;
 
-	spin_lock_irqsave(&kernfs_rename_lock, flags);
+	read_lock_irqsave(&kernfs_rename_lock, flags);
 	parent = kn->parent;
 	kernfs_get(parent);
-	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
+	read_unlock_irqrestore(&kernfs_rename_lock, flags);
 
 	return parent;
 }
@@ -824,12 +824,12 @@ static struct kernfs_node *kernfs_walk_ns(struct kernfs_node *parent,
 	lockdep_assert_held_read(&kernfs_root(parent)->kernfs_rwsem);
 
 	/* grab kernfs_rename_lock to piggy back on kernfs_pr_cont_buf */
-	spin_lock_irq(&kernfs_rename_lock);
+	write_lock_irq(&kernfs_rename_lock);
 
 	len = strlcpy(kernfs_pr_cont_buf, path, sizeof(kernfs_pr_cont_buf));
 
 	if (len >= sizeof(kernfs_pr_cont_buf)) {
-		spin_unlock_irq(&kernfs_rename_lock);
+		write_unlock_irq(&kernfs_rename_lock);
 		return NULL;
 	}
 
@@ -841,7 +841,7 @@ static struct kernfs_node *kernfs_walk_ns(struct kernfs_node *parent,
 		parent = kernfs_find_ns(parent, name, ns);
 	}
 
-	spin_unlock_irq(&kernfs_rename_lock);
+	write_unlock_irq(&kernfs_rename_lock);
 
 	return parent;
 }
@@ -1635,7 +1635,7 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
 	kernfs_get(new_parent);
 
 	/* rename_lock protects ->parent and ->name accessors */
-	spin_lock_irq(&kernfs_rename_lock);
+	write_lock_irq(&kernfs_rename_lock);
 
 	old_parent = kn->parent;
 	kn->parent = new_parent;
@@ -1646,7 +1646,7 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
 		kn->name = new_name;
 	}
 
-	spin_unlock_irq(&kernfs_rename_lock);
+	write_unlock_irq(&kernfs_rename_lock);
 
 	kn->hash = kernfs_name_hash(kn->name, kn->ns);
 	kernfs_link_sibling(kn);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v8 08/10] kernfs: Introduce interface to access per-fs rwsem.
  2022-04-10  2:37 [PATCH v8 00/10] kernfs: Remove reference counting for kernfs_open_node Imran Khan
                   ` (6 preceding siblings ...)
  2022-04-10  2:37 ` [PATCH v8 07/10] kernfs: Change kernfs_rename_lock into a read-write lock Imran Khan
@ 2022-04-10  2:37 ` Imran Khan
  2022-04-10  2:37 ` [PATCH v8 09/10] kernfs: Replace per-fs rwsem with hashed rwsems Imran Khan
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Imran Khan @ 2022-04-10  2:37 UTC (permalink / raw)
  To: tj, viro, gregkh, ebiederm; +Cc: linux-kernel

per-fs rwsem is used across kernfs for synchronization purposes.
Having an interface to access it not only avoids code duplication, it
can also help in changing the underlying locking mechanism without needing
to change the lock users. For example next patch modifies this interface
to make use of hashed rwsems in place of per-fs rwsem.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---
 fs/kernfs/dir.c             | 114 ++++++++++++++++++------------------
 fs/kernfs/file.c            |   5 +-
 fs/kernfs/inode.c           |  26 ++++----
 fs/kernfs/kernfs-internal.h |  78 ++++++++++++++++++++++++
 fs/kernfs/mount.c           |   6 +-
 fs/kernfs/symlink.c         |   6 +-
 6 files changed, 156 insertions(+), 79 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 8e8c8b2c350d..f8520d842b39 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -25,7 +25,7 @@ static DEFINE_SPINLOCK(kernfs_idr_lock);	/* root->ino_idr */
 
 static bool kernfs_active(struct kernfs_node *kn)
 {
-	lockdep_assert_held(&kernfs_root(kn)->kernfs_rwsem);
+	kernfs_rwsem_assert_held(kn);
 	return atomic_read(&kn->active) >= 0;
 }
 
@@ -461,10 +461,16 @@ static void kernfs_drain(struct kernfs_node *kn)
 {
 	struct kernfs_root *root = kernfs_root(kn);
 
-	lockdep_assert_held_write(&root->kernfs_rwsem);
+	/**
+	 * kn has the same root as its ancestor, so it can be used to get
+	 * per-fs rwsem.
+	 */
+	struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
+
+	kernfs_rwsem_assert_held_write(kn);
 	WARN_ON_ONCE(kernfs_active(kn));
 
-	up_write(&root->kernfs_rwsem);
+	kernfs_up_write(rwsem);
 
 	if (kernfs_lockdep(kn)) {
 		rwsem_acquire(&kn->dep_map, 0, 0, _RET_IP_);
@@ -483,7 +489,7 @@ static void kernfs_drain(struct kernfs_node *kn)
 
 	kernfs_drain_open_files(kn);
 
-	down_write(&root->kernfs_rwsem);
+	kernfs_down_write(kn);
 }
 
 /**
@@ -718,12 +724,12 @@ struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root,
 int kernfs_add_one(struct kernfs_node *kn)
 {
 	struct kernfs_node *parent = kn->parent;
-	struct kernfs_root *root = kernfs_root(parent);
 	struct kernfs_iattrs *ps_iattr;
+	struct rw_semaphore *rwsem;
 	bool has_ns;
 	int ret;
 
-	down_write(&root->kernfs_rwsem);
+	rwsem = kernfs_down_write(parent);
 
 	ret = -EINVAL;
 	has_ns = kernfs_ns_enabled(parent);
@@ -754,7 +760,7 @@ int kernfs_add_one(struct kernfs_node *kn)
 		ps_iattr->ia_mtime = ps_iattr->ia_ctime;
 	}
 
-	up_write(&root->kernfs_rwsem);
+	kernfs_up_write(rwsem);
 
 	/*
 	 * Activate the new node unless CREATE_DEACTIVATED is requested.
@@ -768,7 +774,7 @@ int kernfs_add_one(struct kernfs_node *kn)
 	return 0;
 
 out_unlock:
-	up_write(&root->kernfs_rwsem);
+	kernfs_up_write(rwsem);
 	return ret;
 }
 
@@ -789,7 +795,7 @@ static struct kernfs_node *kernfs_find_ns(struct kernfs_node *parent,
 	bool has_ns = kernfs_ns_enabled(parent);
 	unsigned int hash;
 
-	lockdep_assert_held(&kernfs_root(parent)->kernfs_rwsem);
+	kernfs_rwsem_assert_held(parent);
 
 	if (has_ns != (bool)ns) {
 		WARN(1, KERN_WARNING "kernfs: ns %s in '%s' for '%s'\n",
@@ -821,7 +827,7 @@ static struct kernfs_node *kernfs_walk_ns(struct kernfs_node *parent,
 	size_t len;
 	char *p, *name;
 
-	lockdep_assert_held_read(&kernfs_root(parent)->kernfs_rwsem);
+	kernfs_rwsem_assert_held_read(parent);
 
 	/* grab kernfs_rename_lock to piggy back on kernfs_pr_cont_buf */
 	write_lock_irq(&kernfs_rename_lock);
@@ -860,12 +866,12 @@ struct kernfs_node *kernfs_find_and_get_ns(struct kernfs_node *parent,
 					   const char *name, const void *ns)
 {
 	struct kernfs_node *kn;
-	struct kernfs_root *root = kernfs_root(parent);
+	struct rw_semaphore *rwsem;
 
-	down_read(&root->kernfs_rwsem);
+	rwsem = kernfs_down_read(parent);
 	kn = kernfs_find_ns(parent, name, ns);
 	kernfs_get(kn);
-	up_read(&root->kernfs_rwsem);
+	kernfs_up_read(rwsem);
 
 	return kn;
 }
@@ -885,12 +891,12 @@ struct kernfs_node *kernfs_walk_and_get_ns(struct kernfs_node *parent,
 					   const char *path, const void *ns)
 {
 	struct kernfs_node *kn;
-	struct kernfs_root *root = kernfs_root(parent);
+	struct rw_semaphore *rwsem;
 
-	down_read(&root->kernfs_rwsem);
+	rwsem = kernfs_down_read(parent);
 	kn = kernfs_walk_ns(parent, path, ns);
 	kernfs_get(kn);
-	up_read(&root->kernfs_rwsem);
+	kernfs_up_read(rwsem);
 
 	return kn;
 }
@@ -1055,7 +1061,7 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
 static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 {
 	struct kernfs_node *kn;
-	struct kernfs_root *root;
+	struct rw_semaphore *rwsem;
 
 	if (flags & LOOKUP_RCU)
 		return -ECHILD;
@@ -1071,13 +1077,12 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 		parent = kernfs_dentry_node(dentry->d_parent);
 		if (parent) {
 			spin_unlock(&dentry->d_lock);
-			root = kernfs_root(parent);
-			down_read(&root->kernfs_rwsem);
+			rwsem = kernfs_down_read(parent);
 			if (kernfs_dir_changed(parent, dentry)) {
-				up_read(&root->kernfs_rwsem);
+				kernfs_up_read(rwsem);
 				return 0;
 			}
-			up_read(&root->kernfs_rwsem);
+			kernfs_up_read(rwsem);
 		} else
 			spin_unlock(&dentry->d_lock);
 
@@ -1088,8 +1093,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 	}
 
 	kn = kernfs_dentry_node(dentry);
-	root = kernfs_root(kn);
-	down_read(&root->kernfs_rwsem);
+	rwsem = kernfs_down_read(kn);
 
 	/* The kernfs node has been deactivated */
 	if (!kernfs_active(kn))
@@ -1108,10 +1112,10 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 	    kernfs_info(dentry->d_sb)->ns != kn->ns)
 		goto out_bad;
 
-	up_read(&root->kernfs_rwsem);
+	kernfs_up_read(rwsem);
 	return 1;
 out_bad:
-	up_read(&root->kernfs_rwsem);
+	kernfs_up_read(rwsem);
 	return 0;
 }
 
@@ -1125,12 +1129,11 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
 {
 	struct kernfs_node *parent = dir->i_private;
 	struct kernfs_node *kn;
-	struct kernfs_root *root;
 	struct inode *inode = NULL;
 	const void *ns = NULL;
+	struct rw_semaphore *rwsem;
 
-	root = kernfs_root(parent);
-	down_read(&root->kernfs_rwsem);
+	rwsem = kernfs_down_read(parent);
 	if (kernfs_ns_enabled(parent))
 		ns = kernfs_info(dir->i_sb)->ns;
 
@@ -1141,7 +1144,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
 		 * create a negative.
 		 */
 		if (!kernfs_active(kn)) {
-			up_read(&root->kernfs_rwsem);
+			kernfs_up_read(rwsem);
 			return NULL;
 		}
 		inode = kernfs_get_inode(dir->i_sb, kn);
@@ -1156,7 +1159,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
 	 */
 	if (!IS_ERR(inode))
 		kernfs_set_rev(parent, dentry);
-	up_read(&root->kernfs_rwsem);
+	kernfs_up_read(rwsem);
 
 	/* instantiate and hash (possibly negative) dentry */
 	return d_splice_alias(inode, dentry);
@@ -1279,7 +1282,7 @@ static struct kernfs_node *kernfs_next_descendant_post(struct kernfs_node *pos,
 {
 	struct rb_node *rbn;
 
-	lockdep_assert_held_write(&kernfs_root(root)->kernfs_rwsem);
+	kernfs_rwsem_assert_held_write(root);
 
 	/* if first iteration, visit leftmost descendant which may be root */
 	if (!pos)
@@ -1314,9 +1317,9 @@ static struct kernfs_node *kernfs_next_descendant_post(struct kernfs_node *pos,
 void kernfs_activate(struct kernfs_node *kn)
 {
 	struct kernfs_node *pos;
-	struct kernfs_root *root = kernfs_root(kn);
+	struct rw_semaphore *rwsem;
 
-	down_write(&root->kernfs_rwsem);
+	rwsem = kernfs_down_write(kn);
 
 	pos = NULL;
 	while ((pos = kernfs_next_descendant_post(pos, kn))) {
@@ -1330,14 +1333,14 @@ void kernfs_activate(struct kernfs_node *kn)
 		pos->flags |= KERNFS_ACTIVATED;
 	}
 
-	up_write(&root->kernfs_rwsem);
+	kernfs_up_write(rwsem);
 }
 
 static void __kernfs_remove(struct kernfs_node *kn)
 {
 	struct kernfs_node *pos;
 
-	lockdep_assert_held_write(&kernfs_root(kn)->kernfs_rwsem);
+	kernfs_rwsem_assert_held_write(kn);
 
 	/*
 	 * Short-circuit if non-root @kn has already finished removal.
@@ -1407,11 +1410,11 @@ static void __kernfs_remove(struct kernfs_node *kn)
  */
 void kernfs_remove(struct kernfs_node *kn)
 {
-	struct kernfs_root *root = kernfs_root(kn);
+	struct rw_semaphore *rwsem;
 
-	down_write(&root->kernfs_rwsem);
+	rwsem = kernfs_down_write(kn);
 	__kernfs_remove(kn);
-	up_write(&root->kernfs_rwsem);
+	kernfs_up_write(rwsem);
 }
 
 /**
@@ -1497,9 +1500,9 @@ void kernfs_unbreak_active_protection(struct kernfs_node *kn)
 bool kernfs_remove_self(struct kernfs_node *kn)
 {
 	bool ret;
-	struct kernfs_root *root = kernfs_root(kn);
+	struct rw_semaphore *rwsem;
 
-	down_write(&root->kernfs_rwsem);
+	rwsem = kernfs_down_write(kn);
 	kernfs_break_active_protection(kn);
 
 	/*
@@ -1527,9 +1530,9 @@ bool kernfs_remove_self(struct kernfs_node *kn)
 			    atomic_read(&kn->active) == KN_DEACTIVATED_BIAS)
 				break;
 
-			up_write(&root->kernfs_rwsem);
+			kernfs_up_write(rwsem);
 			schedule();
-			down_write(&root->kernfs_rwsem);
+			rwsem = kernfs_down_write(kn);
 		}
 		finish_wait(waitq, &wait);
 		WARN_ON_ONCE(!RB_EMPTY_NODE(&kn->rb));
@@ -1542,7 +1545,7 @@ bool kernfs_remove_self(struct kernfs_node *kn)
 	 */
 	kernfs_unbreak_active_protection(kn);
 
-	up_write(&root->kernfs_rwsem);
+	kernfs_up_write(rwsem);
 	return ret;
 }
 
@@ -1559,7 +1562,7 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,
 			     const void *ns)
 {
 	struct kernfs_node *kn;
-	struct kernfs_root *root;
+	struct rw_semaphore *rwsem;
 
 	if (!parent) {
 		WARN(1, KERN_WARNING "kernfs: can not remove '%s', no directory\n",
@@ -1567,14 +1570,13 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,
 		return -ENOENT;
 	}
 
-	root = kernfs_root(parent);
-	down_write(&root->kernfs_rwsem);
+	rwsem = kernfs_down_write(parent);
 
 	kn = kernfs_find_ns(parent, name, ns);
 	if (kn)
 		__kernfs_remove(kn);
 
-	up_write(&root->kernfs_rwsem);
+	kernfs_up_write(rwsem);
 
 	if (kn)
 		return 0;
@@ -1593,16 +1595,15 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
 		     const char *new_name, const void *new_ns)
 {
 	struct kernfs_node *old_parent;
-	struct kernfs_root *root;
 	const char *old_name = NULL;
+	struct rw_semaphore *rwsem;
 	int error;
 
 	/* can't move or rename root */
 	if (!kn->parent)
 		return -EINVAL;
 
-	root = kernfs_root(kn);
-	down_write(&root->kernfs_rwsem);
+	rwsem = kernfs_down_write(kn);
 
 	error = -ENOENT;
 	if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
@@ -1656,7 +1657,7 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
 
 	error = 0;
  out:
-	up_write(&root->kernfs_rwsem);
+	kernfs_up_write(rwsem);
 	return error;
 }
 
@@ -1727,14 +1728,13 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 	struct dentry *dentry = file->f_path.dentry;
 	struct kernfs_node *parent = kernfs_dentry_node(dentry);
 	struct kernfs_node *pos = file->private_data;
-	struct kernfs_root *root;
 	const void *ns = NULL;
+	struct rw_semaphore *rwsem;
 
 	if (!dir_emit_dots(file, ctx))
 		return 0;
 
-	root = kernfs_root(parent);
-	down_read(&root->kernfs_rwsem);
+	rwsem = kernfs_down_read(parent);
 
 	if (kernfs_ns_enabled(parent))
 		ns = kernfs_info(dentry->d_sb)->ns;
@@ -1751,12 +1751,12 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 		file->private_data = pos;
 		kernfs_get(pos);
 
-		up_read(&root->kernfs_rwsem);
+		kernfs_up_read(rwsem);
 		if (!dir_emit(ctx, name, len, ino, type))
 			return 0;
-		down_read(&root->kernfs_rwsem);
+		rwsem = kernfs_down_read(parent);
 	}
-	up_read(&root->kernfs_rwsem);
+	kernfs_up_read(rwsem);
 	file->private_data = NULL;
 	ctx->pos = INT_MAX;
 	return 0;
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 0bffe5d0f510..03700388baa9 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -869,6 +869,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
 	struct kernfs_root *root;
 	struct llist_node *free;
 	struct kernfs_elem_attr *attr;
+	struct rw_semaphore *rwsem;
 repeat:
 	/**
 	 * pop one off the notify_list.
@@ -888,7 +889,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
 	kn = attribute_to_node(attr, struct kernfs_node, attr);
 	root = kernfs_root(kn);
 	/* kick fsnotify */
-	down_write(&root->kernfs_rwsem);
+	rwsem = kernfs_down_write(kn);
 
 	down_write(&root->supers_rwsem);
 	list_for_each_entry(info, &kernfs_root(kn)->supers, node) {
@@ -928,7 +929,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
 	}
 	up_write(&root->supers_rwsem);
 
-	up_write(&root->kernfs_rwsem);
+	kernfs_up_write(rwsem);
 	kernfs_put(kn);
 	goto repeat;
 }
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 3d783d80f5da..efe5ae98abf4 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -99,11 +99,11 @@ int __kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
 int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
 {
 	int ret;
-	struct kernfs_root *root = kernfs_root(kn);
+	struct rw_semaphore *rwsem;
 
-	down_write(&root->kernfs_rwsem);
+	rwsem = kernfs_down_write(kn);
 	ret = __kernfs_setattr(kn, iattr);
-	up_write(&root->kernfs_rwsem);
+	kernfs_up_write(rwsem);
 	return ret;
 }
 
@@ -112,14 +112,13 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
 {
 	struct inode *inode = d_inode(dentry);
 	struct kernfs_node *kn = inode->i_private;
-	struct kernfs_root *root;
+	struct rw_semaphore *rwsem;
 	int error;
 
 	if (!kn)
 		return -EINVAL;
 
-	root = kernfs_root(kn);
-	down_write(&root->kernfs_rwsem);
+	rwsem = kernfs_down_write(kn);
 	error = setattr_prepare(&init_user_ns, dentry, iattr);
 	if (error)
 		goto out;
@@ -132,7 +131,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
 	setattr_copy(&init_user_ns, inode, iattr);
 
 out:
-	up_write(&root->kernfs_rwsem);
+	kernfs_up_write(rwsem);
 	return error;
 }
 
@@ -187,14 +186,14 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
 {
 	struct inode *inode = d_inode(path->dentry);
 	struct kernfs_node *kn = inode->i_private;
-	struct kernfs_root *root = kernfs_root(kn);
+	struct rw_semaphore *rwsem;
 
-	down_read(&root->kernfs_rwsem);
+	rwsem = kernfs_down_read(kn);
 	spin_lock(&inode->i_lock);
 	kernfs_refresh_inode(kn, inode);
 	generic_fillattr(&init_user_ns, inode, stat);
 	spin_unlock(&inode->i_lock);
-	up_read(&root->kernfs_rwsem);
+	kernfs_up_read(rwsem);
 
 	return 0;
 }
@@ -277,22 +276,21 @@ void kernfs_evict_inode(struct inode *inode)
 int kernfs_iop_permission(struct user_namespace *mnt_userns,
 			  struct inode *inode, int mask)
 {
+	struct rw_semaphore *rwsem;
 	struct kernfs_node *kn;
-	struct kernfs_root *root;
 	int ret;
 
 	if (mask & MAY_NOT_BLOCK)
 		return -ECHILD;
 
 	kn = inode->i_private;
-	root = kernfs_root(kn);
 
-	down_read(&root->kernfs_rwsem);
+	rwsem = kernfs_down_read(kn);
 	spin_lock(&inode->i_lock);
 	kernfs_refresh_inode(kn, inode);
 	ret = generic_permission(&init_user_ns, inode, mask);
 	spin_unlock(&inode->i_lock);
-	up_read(&root->kernfs_rwsem);
+	kernfs_up_read(rwsem);
 
 	return ret;
 }
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 82c6b16645bc..0c49cf57f80f 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -165,4 +165,82 @@ void kernfs_drain_open_files(struct kernfs_node *kn);
  */
 extern const struct inode_operations kernfs_symlink_iops;
 
+static inline struct rw_semaphore *kernfs_rwsem_ptr(struct kernfs_node *kn)
+{
+	struct kernfs_root *root = kernfs_root(kn);
+
+	return &root->kernfs_rwsem;
+}
+
+static inline void kernfs_rwsem_assert_held(struct kernfs_node *kn)
+{
+	lockdep_assert_held(kernfs_rwsem_ptr(kn));
+}
+
+static inline void kernfs_rwsem_assert_held_write(struct kernfs_node *kn)
+{
+	lockdep_assert_held_write(kernfs_rwsem_ptr(kn));
+}
+
+static inline void kernfs_rwsem_assert_held_read(struct kernfs_node *kn)
+{
+	lockdep_assert_held_read(kernfs_rwsem_ptr(kn));
+}
+
+/**
+ * kernfs_down_write() - Acquire kernfs rwsem
+ *
+ * @kn: kernfs_node for which rwsem needs to be taken
+ *
+ * Return: pointer to acquired rwsem
+ */
+static inline struct rw_semaphore *kernfs_down_write(struct kernfs_node *kn)
+{
+	struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
+
+	down_write(rwsem);
+
+	return rwsem;
+}
+
+/**
+ * kernfs_up_write - Release kernfs rwsem
+ *
+ * @rwsem: address of rwsem to release
+ *
+ * Return: void
+ */
+static inline void kernfs_up_write(struct rw_semaphore *rwsem)
+{
+	up_write(rwsem);
+}
+
+/**
+ * kernfs_down_read() - Acquire kernfs rwsem
+ *
+ * @kn: kernfs_node for which rwsem needs to be taken
+ *
+ * Return: pointer to acquired rwsem
+ */
+static inline struct rw_semaphore *kernfs_down_read(struct kernfs_node *kn)
+{
+	struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
+
+	down_read(rwsem);
+
+	return rwsem;
+}
+
+/**
+ * kernfs_up_read - Release kernfs rwsem
+ *
+ * @rwsem: address of rwsem to release
+ *
+ * Return: void
+ */
+static inline void kernfs_up_read(struct rw_semaphore *rwsem)
+{
+	up_read(rwsem);
+}
+
 #endif	/* __KERNFS_INTERNAL_H */
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 1ac36b2a89ab..0e872824b7db 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -237,9 +237,9 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
 static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *kfc)
 {
 	struct kernfs_super_info *info = kernfs_info(sb);
-	struct kernfs_root *kf_root = kfc->root;
 	struct inode *inode;
 	struct dentry *root;
+	struct rw_semaphore *rwsem;
 
 	info->sb = sb;
 	/* Userspace would break if executables or devices appear on sysfs */
@@ -257,9 +257,9 @@ static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *k
 	sb->s_shrink.seeks = 0;
 
 	/* get root inode, initialize and unlock it */
-	down_read(&kf_root->kernfs_rwsem);
+	rwsem = kernfs_down_read(info->root->kn);
 	inode = kernfs_get_inode(sb, info->root->kn);
-	up_read(&kf_root->kernfs_rwsem);
+	kernfs_up_read(rwsem);
 	if (!inode) {
 		pr_debug("kernfs: could not get root inode\n");
 		return -ENOMEM;
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 0ab13824822f..9d4103602554 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -113,12 +113,12 @@ static int kernfs_getlink(struct inode *inode, char *path)
 	struct kernfs_node *kn = inode->i_private;
 	struct kernfs_node *parent = kn->parent;
 	struct kernfs_node *target = kn->symlink.target_kn;
-	struct kernfs_root *root = kernfs_root(parent);
+	struct rw_semaphore *rwsem;
 	int error;
 
-	down_read(&root->kernfs_rwsem);
+	rwsem = kernfs_down_read(parent);
 	error = kernfs_get_target_path(parent, target, path);
-	up_read(&root->kernfs_rwsem);
+	kernfs_up_read(rwsem);
 
 	return error;
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v8 09/10] kernfs: Replace per-fs rwsem with hashed rwsems.
  2022-04-10  2:37 [PATCH v8 00/10] kernfs: Remove reference counting for kernfs_open_node Imran Khan
                   ` (7 preceding siblings ...)
  2022-04-10  2:37 ` [PATCH v8 08/10] kernfs: Introduce interface to access per-fs rwsem Imran Khan
@ 2022-04-10  2:37 ` Imran Khan
  2022-04-10  2:37 ` [PATCH v8 10/10] kernfs: Add a document to describe hashed locks used in kernfs Imran Khan
  2022-04-22 17:03 ` [PATCH v8 00/10] kernfs: Remove reference counting for kernfs_open_node Tejun Heo
  10 siblings, 0 replies; 25+ messages in thread
From: Imran Khan @ 2022-04-10  2:37 UTC (permalink / raw)
  To: tj, viro, gregkh, ebiederm; +Cc: linux-kernel

Having a single rwsem to synchronize all operations across a kernfs
based file system (cgroup, sysfs etc.) does not scale well. The contention
around this single rwsem becomes more apparent in large scale systems with
few hundred CPUs where most of the CPUs have running tasks that are
opening, accessing or closing sysfs files at any point of time.

Using hashed rwsems in place of a per-fs rwsem, can significantly reduce
contention around per-fs rwsem and hence provide better scalability.
Moreover as these hashed rwsems are not part of kernfs_node objects we will
not see any singnificant change in memory utilization of kernfs based file
systems like sysfs, cgroupfs etc.

Modify interface introduced in previous patch to make use of hashed rwsems.
Just like earlier change use kernfs_node address as hashing key. Since we
are getting rid of per-fs lock, in certain cases we may need to acquire
locks corresponding to multiple nodes and in such cases of nested locking,
locks are taken in order of their addresses. Introduce helpers to acquire
rwsems corresponding to multiple nodes for such cases.

For operations that involve finding the node first and then operating on it
(for example operations involving find_and_get_ns), acquiring rwsem for the
node being searched is not possible. Such operations need to make sure that
a concurrent remove does not remove the found node. Introduce a per-fs
mutex that can be used to synchronize these operations against parallel
removal of involved node.

Replacing global mutex and spinlocks with hashed ones (as mentioned in
previous changes) and global rwsem with hashed rwsem (as done in this
change) reduces contention around kernfs and results in better performance
numbers.

For example on a system with 384 cores, if I run 200 instances of an
application which is mostly executing the following loop:

  for (int loop = 0; loop <100 ; loop++)
  {
    for (int port_num = 1; port_num < 2; port_num++)
    {
      for (int gid_index = 0; gid_index < 254; gid_index++ )
      {
        char ret_buf[64], ret_buf_lo[64];
        char gid_file_path[1024];

        int      ret_len;
        int      ret_fd;
        ssize_t  ret_rd;

        ub4  i, saved_errno;

        memset(ret_buf, 0, sizeof(ret_buf));
        memset(gid_file_path, 0, sizeof(gid_file_path));

        ret_len = snprintf(gid_file_path, sizeof(gid_file_path),
                           "/sys/class/infiniband/%s/ports/%d/gids/%d",
                           dev_name,
                           port_num,
                           gid_index);

        ret_fd = open(gid_file_path, O_RDONLY | O_CLOEXEC);
        if (ret_fd < 0)
        {
          printf("Failed to open %s\n", gid_file_path);
          continue;
        }

        /* Read the GID */
        ret_rd = read(ret_fd, ret_buf, 40);

        if (ret_rd == -1)
        {
          printf("Failed to read from file %s, errno: %u\n",
                 gid_file_path, saved_errno);

          continue;
        }

        close(ret_fd);
      }
    }

I can see contention around above mentioned locks as follows:

-   54.07%    53.60%  showgids         [kernel.kallsyms]       [k] osq_lock
   - 53.60% __libc_start_main
      - 32.29% __GI___libc_open
           entry_SYSCALL_64_after_hwframe
           do_syscall_64
           sys_open
           do_sys_open
           do_filp_open
           path_openat
           vfs_open
           do_dentry_open
           kernfs_fop_open
           mutex_lock
         - __mutex_lock_slowpath
            - 32.23% __mutex_lock.isra.5
                 osq_lock
      - 21.31% __GI___libc_close
           entry_SYSCALL_64_after_hwframe
           do_syscall_64
           exit_to_usermode_loop
           task_work_run
           ____fput
           __fput
           kernfs_fop_release
           kernfs_put_open_node.isra.8
           mutex_lock
         - __mutex_lock_slowpath
            - 21.28% __mutex_lock.isra.5
                 osq_lock

-   10.49%    10.39%  showgids         [kernel.kallsyms]      [k] down_read
     10.39% __libc_start_main
        __GI___libc_open
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        sys_open
        do_sys_open
        do_filp_open
      - path_openat
         - 9.72% link_path_walk
            - 5.21% inode_permission
               - __inode_permission
                  - 5.21% kernfs_iop_permission
                       down_read
            - 4.08% walk_component
                 lookup_fast
               - d_revalidate.part.24
                  - 4.08% kernfs_dop_revalidate

-    7.48%     7.41%  showgids         [kernel.kallsyms]       [k] up_read
     7.41% __libc_start_main
        __GI___libc_open
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        sys_open
        do_sys_open
        do_filp_open
      - path_openat
         - 7.01% link_path_walk
            - 4.12% inode_permission
               - __inode_permission
                  - 4.12% kernfs_iop_permission
                       up_read
            - 2.61% walk_component
                 lookup_fast
               - d_revalidate.part.24
                  - 2.61% kernfs_dop_revalidate

Moreover this run of 200 application isntances takes 32-34 secs. to
complete.

With the patched kernel and on the same test setup, we no longer see
contention around osq_lock (i.e kernfs_open_file_mutex) and also
contention around per-fs kernfs_rwsem has reduced significantly as well.
This can be seen in the following perf snippet:

-    1.66%     1.65%  showgids         [kernel.kallsyms]      [k] down_read
     1.65% __libc_start_main
        __GI___libc_open
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        sys_open
        do_sys_open
        do_filp_open
      - path_openat
         - 1.62% link_path_walk
            - 0.98% inode_permission
               - __inode_permission
                  + 0.98% kernfs_iop_permission
            - 0.52% walk_component
                 lookup_fast
               - d_revalidate.part.24
                  - 0.52% kernfs_dop_revalidate

-    1.12%     1.11%  showgids         [kernel.kallsyms]      [k] up_read
     1.11% __libc_start_main
        __GI___libc_open
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        sys_open
        do_sys_open
        do_filp_open
      - path_openat
         - 1.11% link_path_walk
            - 0.69% inode_permission
               - __inode_permission
                  - 0.69% kernfs_iop_permission
                       up_read

Moreover the test execution time has reduced from 32-34 secs to 18-19 secs.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---
 fs/kernfs/Makefile          |   2 +-
 fs/kernfs/dir.c             | 171 +++++++++++++++++++-----
 fs/kernfs/inode.c           |  20 +++
 fs/kernfs/kernfs-internal.c | 259 ++++++++++++++++++++++++++++++++++++
 fs/kernfs/kernfs-internal.h |  52 +++++++-
 fs/kernfs/mount.c           |   4 +-
 fs/kernfs/symlink.c         |  11 +-
 include/linux/kernfs.h      |   1 +
 8 files changed, 478 insertions(+), 42 deletions(-)
 create mode 100644 fs/kernfs/kernfs-internal.c

diff --git a/fs/kernfs/Makefile b/fs/kernfs/Makefile
index 4ca54ff54c98..778da6b118e9 100644
--- a/fs/kernfs/Makefile
+++ b/fs/kernfs/Makefile
@@ -3,4 +3,4 @@
 # Makefile for the kernfs pseudo filesystem
 #
 
-obj-y		:= mount.o inode.o dir.o file.o symlink.o
+obj-y		:= mount.o inode.o dir.o file.o symlink.o kernfs-internal.o
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index f8520d842b39..bdc355143735 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -17,7 +17,7 @@
 
 #include "kernfs-internal.h"
 
-static DEFINE_RWLOCK(kernfs_rename_lock);	/* kn->parent and ->name */
+DEFINE_RWLOCK(kernfs_rename_lock);		/* kn->parent and ->name */
 static char kernfs_pr_cont_buf[PATH_MAX];	/* protected by rename_lock */
 static DEFINE_SPINLOCK(kernfs_idr_lock);	/* root->ino_idr */
 
@@ -25,7 +25,6 @@ static DEFINE_SPINLOCK(kernfs_idr_lock);	/* root->ino_idr */
 
 static bool kernfs_active(struct kernfs_node *kn)
 {
-	kernfs_rwsem_assert_held(kn);
 	return atomic_read(&kn->active) >= 0;
 }
 
@@ -450,26 +449,24 @@ void kernfs_put_active(struct kernfs_node *kn)
 /**
  * kernfs_drain - drain kernfs_node
  * @kn: kernfs_node to drain
+ * @anc: ancestor of kernfs_node to drain
  *
  * Drain existing usages and nuke all existing mmaps of @kn.  Mutiple
  * removers may invoke this function concurrently on @kn and all will
  * return after draining is complete.
  */
-static void kernfs_drain(struct kernfs_node *kn)
-	__releases(&kernfs_root(kn)->kernfs_rwsem)
-	__acquires(&kernfs_root(kn)->kernfs_rwsem)
+static void kernfs_drain(struct kernfs_node *kn, struct kernfs_node *anc)
+	__releases(kernfs_rwsem_ptr(anc))
+	__acquires(kernfs_rwsem_ptr(anc))
 {
 	struct kernfs_root *root = kernfs_root(kn);
 
-	/**
-	 * kn has the same root as its ancestor, so it can be used to get
-	 * per-fs rwsem.
-	 */
-	struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
+	struct rw_semaphore *rwsem;
 
-	kernfs_rwsem_assert_held_write(kn);
+	kernfs_rwsem_assert_held_write(anc);
 	WARN_ON_ONCE(kernfs_active(kn));
 
+	rwsem = kernfs_rwsem_ptr(anc);
 	kernfs_up_write(rwsem);
 
 	if (kernfs_lockdep(kn)) {
@@ -489,7 +486,7 @@ static void kernfs_drain(struct kernfs_node *kn)
 
 	kernfs_drain_open_files(kn);
 
-	kernfs_down_write(kn);
+	kernfs_down_write(anc);
 }
 
 /**
@@ -729,6 +726,11 @@ int kernfs_add_one(struct kernfs_node *kn)
 	bool has_ns;
 	int ret;
 
+	/**
+	 * The node being added is not active at this point of time and may
+	 * be activated later depending on CREATE_DEACTIVATED flag. So at
+	 * this point of time just locking the parent is enough.
+	 */
 	rwsem = kernfs_down_write(parent);
 
 	ret = -EINVAL;
@@ -826,28 +828,39 @@ static struct kernfs_node *kernfs_walk_ns(struct kernfs_node *parent,
 {
 	size_t len;
 	char *p, *name;
+	struct rw_semaphore *rwsem;
 
 	kernfs_rwsem_assert_held_read(parent);
 
-	/* grab kernfs_rename_lock to piggy back on kernfs_pr_cont_buf */
-	write_lock_irq(&kernfs_rename_lock);
+	p = kzalloc(PATH_MAX, GFP_KERNEL);
+	if (!p)
+		return NULL;
 
-	len = strlcpy(kernfs_pr_cont_buf, path, sizeof(kernfs_pr_cont_buf));
+	/* Caller has kernfs_rm_mutex so topology will not change */
+	len = strlcpy(p, path, PATH_MAX);
 
-	if (len >= sizeof(kernfs_pr_cont_buf)) {
-		write_unlock_irq(&kernfs_rename_lock);
+	if (len >= PATH_MAX) {
+		kfree(p);
 		return NULL;
 	}
 
-	p = kernfs_pr_cont_buf;
+	rwsem = kernfs_rwsem_ptr(parent);
 
 	while ((name = strsep(&p, "/")) && parent) {
 		if (*name == '\0')
 			continue;
+
 		parent = kernfs_find_ns(parent, name, ns);
+		/*
+		 * Release rwsem for node whose child RB tree has been
+		 * traversed.
+		 */
+		kernfs_up_read(rwsem);
+		if (parent) /* Acquire rwsem before traversing child RB tree */
+			rwsem = kernfs_down_read(parent);
 	}
 
-	write_unlock_irq(&kernfs_rename_lock);
+	kfree(p);
 
 	return parent;
 }
@@ -867,11 +880,20 @@ struct kernfs_node *kernfs_find_and_get_ns(struct kernfs_node *parent,
 {
 	struct kernfs_node *kn;
 	struct rw_semaphore *rwsem;
+	struct kernfs_root *root = kernfs_root(parent);
 
+	/**
+	 * We don't have address of kernfs_node (that is being searched)
+	 * yet. Acquiring root->kernfs_rm_mutex and releasing it after
+	 * pinning the found kernfs_node, ensures that found kernfs_node
+	 * will not disappear due to a parallel remove operation.
+	 */
+	mutex_lock(&root->kernfs_rm_mutex);
 	rwsem = kernfs_down_read(parent);
 	kn = kernfs_find_ns(parent, name, ns);
 	kernfs_get(kn);
 	kernfs_up_read(rwsem);
+	mutex_unlock(&root->kernfs_rm_mutex);
 
 	return kn;
 }
@@ -892,11 +914,26 @@ struct kernfs_node *kernfs_walk_and_get_ns(struct kernfs_node *parent,
 {
 	struct kernfs_node *kn;
 	struct rw_semaphore *rwsem;
+	struct kernfs_root *root = kernfs_root(parent);
 
+	/**
+	 * We don't have address of kernfs_node (that is being searched)
+	 * yet. Acquiring root->kernfs_rm_mutex and releasing it after
+	 * pinning the found kernfs_node, ensures that found kernfs_node
+	 * will not disappear due to a parallel remove operation.
+	 */
+	mutex_lock(&root->kernfs_rm_mutex);
 	rwsem = kernfs_down_read(parent);
 	kn = kernfs_walk_ns(parent, path, ns);
 	kernfs_get(kn);
-	kernfs_up_read(rwsem);
+	if (kn)
+		/* Release lock taken under kernfs_walk_ns */
+		kernfs_up_read(kernfs_rwsem_ptr(kn));
+	else
+		/* Release parent lock because walk_ns bailed out early */
+		kernfs_up_read(rwsem);
+
+	mutex_unlock(&root->kernfs_rm_mutex);
 
 	return kn;
 }
@@ -921,9 +958,9 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 		return ERR_PTR(-ENOMEM);
 
 	idr_init(&root->ino_idr);
-	init_rwsem(&root->kernfs_rwsem);
 	INIT_LIST_HEAD(&root->supers);
 	init_rwsem(&root->supers_rwsem);
+	mutex_init(&root->kernfs_rm_mutex);
 
 	/*
 	 * On 64bit ino setups, id is ino.  On 32bit, low 32bits are ino.
@@ -1093,6 +1130,11 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 	}
 
 	kn = kernfs_dentry_node(dentry);
+	/**
+	 * For dentry revalidation just acquiring kernfs_node's rwsem for
+	 * reading should be enough. If a competing rename or remove wins
+	 * one of the checks below will fail.
+	 */
 	rwsem = kernfs_down_read(kn);
 
 	/* The kernfs node has been deactivated */
@@ -1132,24 +1174,35 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
 	struct inode *inode = NULL;
 	const void *ns = NULL;
 	struct rw_semaphore *rwsem;
+	struct kernfs_root *root = kernfs_root(parent);
 
+	/**
+	 * We don't have address of kernfs_node (that is being searched)
+	 * yet. So take root->kernfs_rm_mutex to avoid parallel removal of
+	 * found kernfs_node.
+	 */
+	mutex_lock(&root->kernfs_rm_mutex);
 	rwsem = kernfs_down_read(parent);
 	if (kernfs_ns_enabled(parent))
 		ns = kernfs_info(dir->i_sb)->ns;
 
 	kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
+	kernfs_up_read(rwsem);
 	/* attach dentry and inode */
 	if (kn) {
 		/* Inactive nodes are invisible to the VFS so don't
 		 * create a negative.
 		 */
+		rwsem = kernfs_down_read(kn);
 		if (!kernfs_active(kn)) {
 			kernfs_up_read(rwsem);
+			mutex_unlock(&root->kernfs_rm_mutex);
 			return NULL;
 		}
 		inode = kernfs_get_inode(dir->i_sb, kn);
 		if (!inode)
 			inode = ERR_PTR(-ENOMEM);
+		kernfs_up_read(rwsem);
 	}
 	/*
 	 * Needed for negative dentry validation.
@@ -1157,9 +1210,11 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
 	 * or transforms from positive dentry in dentry_unlink_inode()
 	 * called from vfs_rmdir().
 	 */
+	rwsem = kernfs_down_read(parent);
 	if (!IS_ERR(inode))
 		kernfs_set_rev(parent, dentry);
 	kernfs_up_read(rwsem);
+	mutex_unlock(&root->kernfs_rm_mutex);
 
 	/* instantiate and hash (possibly negative) dentry */
 	return d_splice_alias(inode, dentry);
@@ -1339,27 +1394,40 @@ void kernfs_activate(struct kernfs_node *kn)
 static void __kernfs_remove(struct kernfs_node *kn)
 {
 	struct kernfs_node *pos;
+	struct rw_semaphore *rwsem;
+	struct kernfs_root *root;
+
+	if (!kn)
+		return;
 
-	kernfs_rwsem_assert_held_write(kn);
+	root = kernfs_root(kn);
 
 	/*
 	 * Short-circuit if non-root @kn has already finished removal.
 	 * This is for kernfs_remove_self() which plays with active ref
 	 * after removal.
 	 */
-	if (!kn || (kn->parent && RB_EMPTY_NODE(&kn->rb)))
+	mutex_lock(&root->kernfs_rm_mutex);
+	rwsem = kernfs_down_write(kn);
+	if (kn->parent && RB_EMPTY_NODE(&kn->rb)) {
+		kernfs_up_write(rwsem);
+		mutex_unlock(&root->kernfs_rm_mutex);
 		return;
+	}
 
 	pr_debug("kernfs %s: removing\n", kn->name);
 
 	/* prevent any new usage under @kn by deactivating all nodes */
 	pos = NULL;
+
 	while ((pos = kernfs_next_descendant_post(pos, kn)))
 		if (kernfs_active(pos))
 			atomic_add(KN_DEACTIVATED_BIAS, &pos->active);
+	kernfs_up_write(rwsem);
 
 	/* deactivate and unlink the subtree node-by-node */
 	do {
+		rwsem = kernfs_down_write(kn);
 		pos = kernfs_leftmost_descendant(kn);
 
 		/*
@@ -1377,10 +1445,25 @@ static void __kernfs_remove(struct kernfs_node *kn)
 		 * error paths without worrying about draining.
 		 */
 		if (kn->flags & KERNFS_ACTIVATED)
-			kernfs_drain(pos);
+			kernfs_drain(pos, kn);
 		else
 			WARN_ON_ONCE(atomic_read(&kn->active) != KN_DEACTIVATED_BIAS);
 
+		kernfs_up_write(rwsem);
+
+		/**
+		 * By now node and all of its descendants have been deactivated
+		 * Once a descendant has been drained, acquire its parent's lock
+		 * and unlink it from parent's children rb tree.
+		 * We drop kn's lock before acquiring pos->parent's lock to avoid
+		 * deadlock that will happen if pos->parent and kn hash to same lock.
+		 * Dropping kn's lock should be safe because it is in deactived state.
+		 * Further root->kernfs_rm_mutex ensures that we will not have
+		 * concurrent instances of __kernfs_remove
+		 */
+		if (pos->parent)
+			rwsem = kernfs_down_write(pos->parent);
+
 		/*
 		 * kernfs_unlink_sibling() succeeds once per node.  Use it
 		 * to decide who's responsible for cleanups.
@@ -1398,8 +1481,12 @@ static void __kernfs_remove(struct kernfs_node *kn)
 			kernfs_put(pos);
 		}
 
+		if (pos->parent)
+			kernfs_up_write(rwsem);
 		kernfs_put(pos);
 	} while (pos != kn);
+
+	mutex_unlock(&root->kernfs_rm_mutex);
 }
 
 /**
@@ -1410,11 +1497,7 @@ static void __kernfs_remove(struct kernfs_node *kn)
  */
 void kernfs_remove(struct kernfs_node *kn)
 {
-	struct rw_semaphore *rwsem;
-
-	rwsem = kernfs_down_write(kn);
 	__kernfs_remove(kn);
-	kernfs_up_write(rwsem);
 }
 
 /**
@@ -1516,9 +1599,11 @@ bool kernfs_remove_self(struct kernfs_node *kn)
 	 */
 	if (!(kn->flags & KERNFS_SUICIDAL)) {
 		kn->flags |= KERNFS_SUICIDAL;
+		kernfs_up_write(rwsem);
 		__kernfs_remove(kn);
 		kn->flags |= KERNFS_SUICIDED;
 		ret = true;
+		rwsem = kernfs_down_write(kn);
 	} else {
 		wait_queue_head_t *waitq = &kernfs_root(kn)->deactivate_waitq;
 		DEFINE_WAIT(wait);
@@ -1572,11 +1657,17 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,
 
 	rwsem = kernfs_down_write(parent);
 
+	/**
+	 * Since the node being searched will be removed eventually,
+	 * we don't need to take root->kernfs_rm_mutex.
+	 * Even if a parallel remove succeeds, the subsequent __kernfs_remove
+	 * will detect it and bail-out early.
+	 */
 	kn = kernfs_find_ns(parent, name, ns);
-	if (kn)
-		__kernfs_remove(kn);
 
 	kernfs_up_write(rwsem);
+	if (kn)
+		__kernfs_remove(kn);
 
 	if (kn)
 		return 0;
@@ -1596,14 +1687,26 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
 {
 	struct kernfs_node *old_parent;
 	const char *old_name = NULL;
-	struct rw_semaphore *rwsem;
+	struct kernfs_rwsem_token token;
 	int error;
+	struct kernfs_root *root = kernfs_root(kn);
 
 	/* can't move or rename root */
 	if (!kn->parent)
 		return -EINVAL;
 
-	rwsem = kernfs_down_write(kn);
+	mutex_lock(&root->kernfs_rm_mutex);
+	old_parent = kn->parent;
+	kernfs_get(old_parent);
+	kernfs_down_write_triple_nodes(kn, old_parent, new_parent, &token);
+	while (old_parent != kn->parent) {
+		kernfs_put(old_parent);
+		kernfs_up_write_triple_nodes(kn, old_parent, new_parent, &token);
+		old_parent = kn->parent;
+		kernfs_get(old_parent);
+		kernfs_down_write_triple_nodes(kn, old_parent, new_parent, &token);
+	}
+	kernfs_put(old_parent);
 
 	error = -ENOENT;
 	if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
@@ -1638,7 +1741,6 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
 	/* rename_lock protects ->parent and ->name accessors */
 	write_lock_irq(&kernfs_rename_lock);
 
-	old_parent = kn->parent;
 	kn->parent = new_parent;
 
 	kn->ns = new_ns;
@@ -1657,7 +1759,8 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
 
 	error = 0;
  out:
-	kernfs_up_write(rwsem);
+	mutex_unlock(&root->kernfs_rm_mutex);
+	kernfs_up_write_triple_nodes(kn, new_parent, old_parent, &token);
 	return error;
 }
 
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index efe5ae98abf4..36a40b08b97f 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -101,6 +101,12 @@ int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
 	int ret;
 	struct rw_semaphore *rwsem;
 
+	/**
+	 * Since we are only modifying the inode attribute, we just need
+	 * to lock involved node. Operations that add or remove a node
+	 * acquire parent's lock before changing the inode attributes, so
+	 * such operations are also in sync with this interface.
+	 */
 	rwsem = kernfs_down_write(kn);
 	ret = __kernfs_setattr(kn, iattr);
 	kernfs_up_write(rwsem);
@@ -118,6 +124,12 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
 	if (!kn)
 		return -EINVAL;
 
+	/**
+	 * Since we are only modifying the inode attribute, we just need
+	 * to lock involved node. Operations that add or remove a node
+	 * acquire parent's lock before changing the inode attributes, so
+	 * such operations are also in sync with .setattr backend.
+	 */
 	rwsem = kernfs_down_write(kn);
 	error = setattr_prepare(&init_user_ns, dentry, iattr);
 	if (error)
@@ -188,6 +200,10 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
 	struct kernfs_node *kn = inode->i_private;
 	struct rw_semaphore *rwsem;
 
+	/**
+	 * As we are only reading ->iattr, acquiring kn's rwsem for
+	 * reading is enough.
+	 */
 	rwsem = kernfs_down_read(kn);
 	spin_lock(&inode->i_lock);
 	kernfs_refresh_inode(kn, inode);
@@ -285,6 +301,10 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,
 
 	kn = inode->i_private;
 
+	/**
+	 * As we are only reading ->iattr, acquiring kn's rwsem for
+	 * reading is enough.
+	 */
 	rwsem = kernfs_down_read(kn);
 	spin_lock(&inode->i_lock);
 	kernfs_refresh_inode(kn, inode);
diff --git a/fs/kernfs/kernfs-internal.c b/fs/kernfs/kernfs-internal.c
new file mode 100644
index 000000000000..80d7d64532fe
--- /dev/null
+++ b/fs/kernfs/kernfs-internal.c
@@ -0,0 +1,259 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * This file provides inernal helpers for kernfs.
+ */
+
+#include "kernfs-internal.h"
+
+static void kernfs_swap_rwsems(struct rw_semaphore **array, int i, int j)
+{
+	struct rw_semaphore *tmp;
+
+	tmp = array[i];
+	array[i] = array[j];
+	array[j] = tmp;
+}
+
+static void kernfs_sort_rwsems(struct kernfs_rwsem_token *token)
+{
+	struct rw_semaphore **array = &token->rwsems[0];
+
+	if (token->count == 2) {
+		if (array[0] == array[1])
+			token->count = 1;
+		else if (array[0] > array[1])
+			kernfs_swap_rwsems(array, 0, 1);
+	} else {
+		if (array[0] == array[1] && array[0] == array[2])
+			token->count = 1;
+		else {
+			if (array[0] > array[1])
+				kernfs_swap_rwsems(array, 0, 1);
+
+			if (array[0] > array[2])
+				kernfs_swap_rwsems(array, 0, 2);
+
+			if (array[1] > array[2])
+				kernfs_swap_rwsems(array, 1, 2);
+
+			if (array[0] == array[1] || array[1] == array[2])
+				token->count = 2;
+		}
+	}
+}
+
+/**
+ * kernfs_down_write_double_nodes() - take hashed rwsem for 2 nodes
+ *
+ * @kn1: first kernfs_node of interest
+ * @kn2: second kernfs_node of interest
+ * @token: token to pass unlocking information to caller
+ *
+ * Acquire hashed rwsem for 2 nodes. Some operation may need to acquire
+ * hashed rwsems for 2 nodes (for example for a node and its parent).
+ * This function can be used in such cases.
+ *
+ * Return: void
+ */
+void kernfs_down_write_double_nodes(struct kernfs_node *kn1,
+				    struct kernfs_node *kn2,
+				    struct kernfs_rwsem_token *token)
+{
+	struct rw_semaphore **array = &token->rwsems[0];
+
+	array[0] = kernfs_rwsem_ptr(kn1);
+	array[1] = kernfs_rwsem_ptr(kn2);
+	token->count = 2;
+
+	kernfs_sort_rwsems(token);
+
+	if (token->count == 1) {
+		/* Both nodes hash to same rwsem */
+		down_write_nested(array[0], 0);
+	} else {
+		/* Both nodes hash to different rwsems */
+		down_write_nested(array[0], 0);
+		down_write_nested(array[1], 1);
+	}
+}
+
+/**
+ * kernfs_up_write_double_nodes - release hashed rwsem for 2 nodes
+ *
+ * @kn1: first kernfs_node of interest
+ * @kn2: second kernfs_node of interest
+ * @token: token to indicate unlocking information
+ *		->rwsems is a sorted list of rwsem addresses
+ *		->count contains number of unique locks
+ *
+ * Release hashed rwsems for 2 nodes
+ *
+ * Return: void
+ */
+void kernfs_up_write_double_nodes(struct kernfs_node *kn1,
+				  struct kernfs_node *kn2,
+				  struct kernfs_rwsem_token *token)
+{
+	struct rw_semaphore **array = &token->rwsems[0];
+
+	if (token->count == 1) {
+		/* Both nodes hash to same rwsem */
+		up_write(array[0]);
+	} else {
+		/* Both nodes hashe to different rwsems */
+		up_write(array[0]);
+		up_write(array[1]);
+	}
+}
+
+/**
+ * kernfs_down_read_double_nodes() - take hashed rwsem for 2 nodes
+ *
+ * @kn1: first kernfs_node of interest
+ * @kn2: second kernfs_node of interest
+ * @token: token to pass unlocking information to caller
+ *
+ * Acquire hashed rwsem for 2 nodes. Some operation may need to acquire
+ * hashed rwsems for 2 nodes (for example for a node and its parent).
+ * This function can be used in such cases.
+ *
+ * Return: void
+ */
+void kernfs_down_read_double_nodes(struct kernfs_node *kn1,
+				    struct kernfs_node *kn2,
+				    struct kernfs_rwsem_token *token)
+{
+	struct rw_semaphore **array = &token->rwsems[0];
+
+	array[0] = kernfs_rwsem_ptr(kn1);
+	array[1] = kernfs_rwsem_ptr(kn2);
+	token->count = 2;
+
+	kernfs_sort_rwsems(token);
+
+	if (token->count == 1) {
+		/* Both nodes hash to same rwsem */
+		down_read_nested(array[0], 0);
+	} else {
+		/* Both nodes hash to different rwsems */
+		down_read_nested(array[0], 0);
+		down_read_nested(array[1], 1);
+	}
+}
+
+/**
+ * kernfs_up_read_double_nodes - release hashed rwsem for 2 nodes
+ *
+ * @kn1: first kernfs_node of interest
+ * @kn2: second kernfs_node of interest
+ * @token: token to indicate unlocking information
+ *		->rwsems is a sorted list of rwsem addresses
+ *		->count contains number of unique locks
+ *
+ * Release hashed rwsems for 2 nodes
+ *
+ * Return: void
+ */
+void kernfs_up_read_double_nodes(struct kernfs_node *kn1,
+				  struct kernfs_node *kn2,
+				  struct kernfs_rwsem_token *token)
+{
+	struct rw_semaphore **array = &token->rwsems[0];
+
+	if (token->count == 1) {
+		/* Both nodes hash to same rwsem */
+		up_read(array[0]);
+	} else {
+		/* Both nodes hashe to different rwsems */
+		up_read(array[0]);
+		up_read(array[1]);
+	}
+}
+
+/**
+ * kernfs_down_write_triple_nodes() - take hashed rwsem for 3 nodes
+ *
+ * @kn1: first kernfs_node of interest
+ * @kn2: second kernfs_node of interest
+ * @kn3: third kernfs_node of interest
+ * @token: token to pass unlocking information to caller
+ *
+ * Acquire hashed rwsem for 3 nodes. Some operation may need to acquire
+ * hashed rwsems for 3 nodes (for example rename operation needs to
+ * acquire rwsem corresponding to node, its current parent and its future
+ * parent). This function can be used in such cases.
+ *
+ * Return: void
+ */
+void kernfs_down_write_triple_nodes(struct kernfs_node *kn1,
+				    struct kernfs_node *kn2,
+				    struct kernfs_node *kn3,
+				    struct kernfs_rwsem_token *token)
+{
+	struct rw_semaphore **array = &token->rwsems[0];
+
+	array[0] = kernfs_rwsem_ptr(kn1);
+	array[1] = kernfs_rwsem_ptr(kn2);
+	array[2] = kernfs_rwsem_ptr(kn3);
+	token->count = 3;
+
+	kernfs_sort_rwsems(token);
+
+	if (token->count == 1) {
+		/* All 3 nodes hash to same rwsem */
+		down_write_nested(array[0], 0);
+	} else if (token->count == 2) {
+		/**
+		 * Two nodes hash to same rwsem, and these
+		 * will occupy consecutive places in array after
+		 * sorting.
+		 */
+		down_write_nested(array[0], 0);
+		down_write_nested(array[2], 1);
+	} else {
+		/* All 3 nodes hashe to different rwsems */
+		down_write_nested(array[0], 0);
+		down_write_nested(array[1], 1);
+		down_write_nested(array[2], 2);
+	}
+}
+
+/**
+ * kernfs_up_write_triple_nodes - release hashed rwsem for 3 nodes
+ *
+ * @kn1: first kernfs_node of interest
+ * @kn2: second kernfs_node of interest
+ * @kn3: third kernfs_node of interest
+ * @token: token to indicate unlocking information
+ *		->rwsems is a sorted list of rwsem addresses
+ *		->count contains number of unique locks
+ *
+ * Release hashed rwsems for 3 nodes
+ *
+ * Return: void
+ */
+void kernfs_up_write_triple_nodes(struct kernfs_node *kn1,
+				  struct kernfs_node *kn2,
+				  struct kernfs_node *kn3,
+				  struct kernfs_rwsem_token *token)
+{
+	struct rw_semaphore **array = &token->rwsems[0];
+
+	if (token->count == 1) {
+		/* All 3 nodes hash to same rwsem */
+		up_write(array[0]);
+	} else if (token->count == 2) {
+		/**
+		 * Two nodes hash to same rwsem, and these
+		 * will occupy consecutive places in array after
+		 * sorting.
+		 */
+		up_write(array[0]);
+		up_write(array[2]);
+	} else {
+		/* All 3 nodes hashe to different rwsems */
+		up_write(array[0]);
+		up_write(array[1]);
+		up_write(array[2]);
+	}
+}
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 0c49cf57f80f..4e630abe4a18 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -19,6 +19,20 @@
 #include <linux/kernfs.h>
 #include <linux/fs_context.h>
 
+/**
+ * Token for nested locking interfaces.
+ *
+ * rwsems: array of rwsems to acquire
+ * count: has 2 uses
+ *	  As input argument it specifies size of ->rwsems array
+ *	  As return argument it specifies number of unique rwsems
+ *	  present in ->rwsems array
+ */
+struct kernfs_rwsem_token {
+	struct rw_semaphore *rwsems[3];
+	int count;
+};
+
 struct kernfs_iattrs {
 	kuid_t			ia_uid;
 	kgid_t			ia_gid;
@@ -46,8 +60,8 @@ struct kernfs_root {
 	struct list_head	supers;
 
 	wait_queue_head_t	deactivate_waitq;
-	struct rw_semaphore	kernfs_rwsem;
 	struct rw_semaphore     supers_rwsem;
+	struct mutex            kernfs_rm_mutex;
 };
 
 /* +1 to avoid triggering overflow warning when negating it */
@@ -165,11 +179,17 @@ void kernfs_drain_open_files(struct kernfs_node *kn);
  */
 extern const struct inode_operations kernfs_symlink_iops;
 
+/*
+ * kernfs locks
+ */
+extern struct kernfs_global_locks *kernfs_locks;
+extern rwlock_t kernfs_rename_lock;
+
 static inline struct rw_semaphore *kernfs_rwsem_ptr(struct kernfs_node *kn)
 {
-	struct kernfs_root *root = kernfs_root(kn);
+	int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
 
-	return &root->kernfs_rwsem;
+	return &kernfs_locks->kernfs_rwsem[idx];
 }
 
 static inline void kernfs_rwsem_assert_held(struct kernfs_node *kn)
@@ -243,4 +263,30 @@ static inline void kernfs_up_read(struct rw_semaphore *rwsem)
 	up_read(rwsem);
 }
 
+
+void kernfs_down_write_double_nodes(struct kernfs_node *kn1,
+				    struct kernfs_node *kn2,
+				    struct kernfs_rwsem_token *token);
+
+void kernfs_up_write_double_nodes(struct kernfs_node *kn1,
+				  struct kernfs_node *kn2,
+				  struct kernfs_rwsem_token *token);
+
+void kernfs_down_read_double_nodes(struct kernfs_node *kn1,
+				    struct kernfs_node *kn2,
+				    struct kernfs_rwsem_token *token);
+
+void kernfs_up_read_double_nodes(struct kernfs_node *kn1,
+				  struct kernfs_node *kn2,
+				  struct kernfs_rwsem_token *token);
+
+void kernfs_down_write_triple_nodes(struct kernfs_node *kn1,
+				    struct kernfs_node *kn2,
+				    struct kernfs_node *kn3,
+				    struct kernfs_rwsem_token *token);
+
+void kernfs_up_write_triple_nodes(struct kernfs_node *kn1,
+				  struct kernfs_node *kn2,
+				  struct kernfs_node *kn3,
+				  struct kernfs_rwsem_token *token);
 #endif	/* __KERNFS_INTERNAL_H */
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 0e872824b7db..5efa0f4ca209 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -395,8 +395,10 @@ void __init kernfs_lock_init(void)
 	kernfs_locks = kmalloc(sizeof(struct kernfs_global_locks), GFP_KERNEL);
 	WARN_ON(!kernfs_locks);
 
-	for (count = 0; count < NR_KERNFS_LOCKS; count++)
+	for (count = 0; count < NR_KERNFS_LOCKS; count++) {
 		mutex_init(&kernfs_locks->open_file_mutex[count].lock);
+		init_rwsem(&kernfs_locks->kernfs_rwsem[count]);
+	}
 }
 
 void __init kernfs_init(void)
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 9d4103602554..5e404ea455bd 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -110,15 +110,20 @@ static int kernfs_get_target_path(struct kernfs_node *parent,
 
 static int kernfs_getlink(struct inode *inode, char *path)
 {
+	unsigned long flags;
 	struct kernfs_node *kn = inode->i_private;
 	struct kernfs_node *parent = kn->parent;
 	struct kernfs_node *target = kn->symlink.target_kn;
-	struct rw_semaphore *rwsem;
 	int error;
 
-	rwsem = kernfs_down_read(parent);
+	/**
+	 * kernfs_get_target_path needs that all nodes in the path don't
+	 * undergo a parent change in the middle of it. Since ->parent
+	 * change happens under kernfs_rename_lock, acquire the same.
+	 */
+	read_lock_irqsave(&kernfs_rename_lock, flags);
 	error = kernfs_get_target_path(parent, target, path);
-	kernfs_up_read(rwsem);
+	read_unlock_irqrestore(&kernfs_rename_lock, flags);
 
 	return error;
 }
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index cc514bda0ae7..19506bdb6d2b 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -87,6 +87,7 @@ struct kernfs_open_file_mutex {
  */
 struct kernfs_global_locks {
 	struct kernfs_open_file_mutex open_file_mutex[NR_KERNFS_LOCKS];
+	struct rw_semaphore kernfs_rwsem[NR_KERNFS_LOCKS];
 };
 
 enum kernfs_node_type {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v8 10/10] kernfs: Add a document to describe hashed locks used in kernfs.
  2022-04-10  2:37 [PATCH v8 00/10] kernfs: Remove reference counting for kernfs_open_node Imran Khan
                   ` (8 preceding siblings ...)
  2022-04-10  2:37 ` [PATCH v8 09/10] kernfs: Replace per-fs rwsem with hashed rwsems Imran Khan
@ 2022-04-10  2:37 ` Imran Khan
  2022-04-22 17:03 ` [PATCH v8 00/10] kernfs: Remove reference counting for kernfs_open_node Tejun Heo
  10 siblings, 0 replies; 25+ messages in thread
From: Imran Khan @ 2022-04-10  2:37 UTC (permalink / raw)
  To: tj, viro, gregkh, ebiederm; +Cc: linux-kernel

This document describes usage and proof of various hashed locks
introduced in this patch set

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---
 .../filesystems/kernfs-hashed-locks.rst       | 214 ++++++++++++++++++
 1 file changed, 214 insertions(+)
 create mode 100644 Documentation/filesystems/kernfs-hashed-locks.rst

diff --git a/Documentation/filesystems/kernfs-hashed-locks.rst b/Documentation/filesystems/kernfs-hashed-locks.rst
new file mode 100644
index 000000000000..8c3542e38e04
--- /dev/null
+++ b/Documentation/filesystems/kernfs-hashed-locks.rst
@@ -0,0 +1,214 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+===================
+kernfs hashed locks
+===================
+
+kernfs uses following hashed locks
+
+1. Hashed mutexes
+2. Hashed rwsem
+
+In certain cases hashed rwsem needs to work in conjunction with a per-fs mutex
+(Described further below).So this document describes this mutex as well.
+
+A kernfs_global_locks object (defined below) provides hashed mutexes,
+hashed spinlocks and hashed rwsems.
+
+	struct kernfs_global_locks {
+		struct kernfs_open_file_mutex open_file_mutex[NR_KERNFS_LOCKS];
+		struct rw_semaphore kernfs_rwsem[NR_KERNFS_LOCKS];
+	};
+
+The hashed mutexes is encapsulated in kernfs_open_file_mutex as shown below:
+
+struct kernfs_open_file_mutex {
+	struct mutex lock;
+} ____cacheline_aligned_in_smp;
+
+For all hashed locks address of a kernfs_node object acts as hashing key.
+
+For the remainder of this document a node means a kernfs_node object. The
+node can refer to a file, directory or symlink of a kernfs based file system.
+Also a node's mutex or rwsem refers to hashed mutex, or hashed rwsem
+corresponding to the node.
+It does not mean any locking construct embedded in the kernfs_node itself.
+
+What is protected by hashed locks
+=================================
+
+(1) There's one kernfs_open_file for each open file and all kernfs_open_file
+    instances corresponding to a kernfs_node are maintained in a list.
+    hashed mutexes or kernfs_global_locks.open_file_mutex[index].lock protects
+    this list.
+
+(2) Hashed rwsems or kernfs_global_locks.kernfs_rwsem[index] protects node's
+    state and synchronizes operations that change state of a node or depend on
+    the state of a node.
+
+(3) per-fs mutex (mentioned earlier) provides synchronization between lookup
+    and remove operations.
+    While looking for a node we will not have address of corresponding node
+    so we can't acquire node's rwsem right from the beginning.
+    On the other hand a parallel remove operation for the same node can acquire
+    corresponding rwsem and go ahead with node removal. So it may happen that
+    search operation for the node finds and returns it but before it can be
+    pinned or used, the remove operation, that was going on in parallel, removes
+    the node and hence makes its any future use wrong.
+    per-fs mutex ensures that for competing search and remove operations only
+    one proceeds at a time and since object returned by search is pinned before
+    releasing the per-fs mutex, it will be available for subsequent usage.
+
+
+Lock usage and proof
+=======================
+
+(1) Hashed mutexes
+
+    Since hashed mutexes protect the list of kernfs_open_file instances
+    corresponding to a kernfs_node, ->open and ->release backends of
+    file_operations need to acquire hashed mutex corresponding to kernfs_node.
+    Also when a kernfs_node is removed, all of its kernfs_open_file instances
+    are drained after deactivating the node. This drain operation acquires
+    hashed mutex to traverse list of kernfs_open_file instances.
+    So addition (via ->open), deletion (via ->release) and traversal
+    (during kernfs_drain) of kernfs_open_file list occurs in a synchronous
+    manner.
+
+(2) Hashed rwsems
+
+	3.1. A node's rwsem protects its state and needs to be acquired to:
+		3.1.a. Remove the node
+		3.1.b. Move the node
+		3.1.c. Travers or modify a node's children RB tree (for
+		       directories), i.e to add/remove files/subdirectories
+		       within/from a directory.
+		3.1.d. Modify or access node's inode attributes
+
+	3.2. Hashed rwsems are used in following operations:
+
+		3.2.a. Addition of a new node
+
+		While adding a new kernfs_node under a kernfs directory
+		kernfs_add_one acquires directory node's rwsem for
+		writing. Clause 3.1.a ensures that directory exists
+		throughout the operation. Clause 3.1.c ensures proper
+		updation of children rb tree (i.e ->dir.children).
+		Clause 3.1.d ensures correct modification of inode
+		attribute to reflect timestamp of this operation.
+		If the directory gets removed while waiting for semaphore,
+		the subsequent checks in kernfs_add_one will fail resulting
+		in early bail out from kernfs_add_one.
+
+		3.2.b. Removal of a node
+
+		Removal of a node involves recursive removal of all of its
+		descendants as well. per-fs mutex (i.e kernfs_rm_mutex) avoids
+		concurrent node removals even if the nodes are different.
+
+		At first node's rwsem is acquired. Clause 3.1.c avoids parallel
+		modification of descendant tree and while holding this rwsem
+		each of the descendants are deactivated.
+
+		Once a descendant has been deactivated and drained, its parent's
+		rwsem is taken. Clause 3.1.c ensures proper unlinking of this
+		descendant from its siblings. Clause 3.1.d ensures that parent's
+		inode attributes are correctly updated to record time stamp of
+		removal.
+
+		3.2.c. Movement of a node
+
+		Moving or renaming a node (kernfs_rename_ns) acquires rwsem for
+		node and its old and new parents. Clauses 3.1.b and 3.1.c avoid
+		concurrent move operations for the same node.
+		Also if old parent of a node changes while waiting for rwsem,
+		the acquisition of rwsem for 3 involved nodes is attempted
+		again. It is always ensured that as far as old parent is
+		concerned, rwsem corresponding to current parent is acquired.
+
+		3.2.d. Reading a directory
+
+		For diectory reading kernfs_fop_readdir acquires directory
+		node's rwsem for reading. Clause 3.1.c ensures a consistent view
+		of children RB tree.
+		As far as directroy being read is concerned, if it gets removed
+		while waiting for semaphore, the for loop that iterates through
+		children will be ineffective. So for this operation acquiring
+		directory node's rwsem for reading is enough.
+
+		3.2.e. Dentry revalidation
+
+		A dentry revalidation (kernfs_dop_revalidate) can happen for a
+		negative or for a normal dentry.
+		For negative dentries we just need to check parent change, so in
+		this case acquiring parent kernfs_node's rwsem for reading is
+		enough.
+		For a normal dentry acquiring node's rwsem for reading is enough
+		(Clause 3.1.a and 3.1.b).
+		If node gets removed while waiting for the lock subsequent checks
+		in kernfs_dop_revalidate will fail and kernfs_dop_revalidate will
+		exit early.
+
+		3.2.f. kernfs_node lookup
+
+		While searching for a node under a given parent
+		(kernfs_find_and_get_ns, kernfs_walk_and_get_ns) rwsem of parent
+		node is acquired for reading. Clause 3.1.c ensures a consistent
+		view of parent's children RB tree. To avoid parallel removal of
+		found node before it gets pinned, these operation make use of
+		per-fs mutex (kernfs_rm_mutex) as explained earlier.
+		This per-fs mutex is also taken during kernfs_node removal
+		(__kernfs_remove).
+
+		If the node being searched gets removed while waiting for the
+		mutex or rwsem, the subsequent kernfs_find_ns or kernfs_walk_ns
+		will fail.
+
+		3.2.g. kenfs_node's inode lookup
+
+		Looking up for inode instances via kernfs_iop_lookup involves
+		node lookup. So locks acquired are same as ones required in 3.2.f.
+		Also once node lookup is complete parent's rwsem is released and
+		rwsem of found node is acquired to get corresponding inode.
+		Since we are operating under per-fs kernfs_rm_mutex the found node
+		will not disappear in the middle.
+
+		3.2.h. Updating or reading inode attribute
+
+		Interfaces that change inode attributes(i.e kernfs_setattr and
+		kernfs_iop_setattr) acquire node's rwsem for writing.
+		If the kernfs_node gets removed while waiting for the semaphore
+		the subsequent __kernfs_setattr will fail.
+		From 3.2.a and 3.2.b we know that updates due to addition or
+		removal of nodes will not happen in parallel.
+		So just locking the kernfs_node in these cases is enough to
+		guarantee correct modification of inode attributes.
+		Similarly the interfaces that read inode attributes
+		(i.e kernfs_iop_getattr, kernfs_iop_permission) just need to
+		acquire involved node's rwsem for reading.
+
+		3.2.i. kernfs file event generation
+
+		kernfs_notify pins involved node before scheduling
+		kernfs_notify_work and kernfs_notify_workfn acquires node's
+		rwsem. Clauses in 3.1 ensure a consistent view of node state
+		throughout execution of work handler.
+
+		3.2.j. mount
+
+		kernfs_fill_super, invoked during mount operation, acquires root
+		node's rwsem. During mount process there can't be other execution
+		contexts trying to move or delete the node so just locking the
+		involved node(i.e the root node) is enough.
+
+		3.2.k. while activating a node
+
+		For a node that started as deactivated, kernfs_activate
+		activates the node. In this case acquiring node's rwsem is
+		enough. Since the node is not active yet any parallel removal
+		that wins the race for rwsem will skip this node and its
+		descendents. Also user space can't see a deactivated node so we
+		don't have any parallel access emanating from their as well.
+
+	3.3  For operations that involve locking multiple nodes at the same time
+	     locks are acquired in order of their addresses.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v8 01/10] kernfs: Remove reference counting for kernfs_open_node.
  2022-04-10  2:37 ` [PATCH v8 01/10] " Imran Khan
@ 2022-04-22 16:03   ` Tejun Heo
  2022-04-26  1:43     ` Imran Khan
  0 siblings, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2022-04-22 16:03 UTC (permalink / raw)
  To: Imran Khan; +Cc: viro, gregkh, ebiederm, linux-kernel

On Sun, Apr 10, 2022 at 12:37:10PM +1000, Imran Khan wrote:
> @@ -768,15 +765,15 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
>  	if (!(kn->flags & (KERNFS_HAS_MMAP | KERNFS_HAS_RELEASE)))
>  		return;
>  
> -	spin_lock_irq(&kernfs_open_node_lock);
>  	on = kn->attr.open;
> -	if (on)
> -		atomic_inc(&on->refcnt);
> -	spin_unlock_irq(&kernfs_open_node_lock);
>  	if (!on)
>  		return;
>
>  	mutex_lock(&kernfs_open_file_mutex);
> +	if (!kn->attr.open) {
> +		mutex_unlock(&kernfs_open_file_mutex);
> +		return;
> +	}

What if @on got freed and new one got allocated between the lockless check
and the locked check? Is there a reason to keep the lockless check at all?

>  	list_for_each_entry(of, &on->files, list) {
>  		struct inode *inode = file_inode(of->file);

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v8 02/10] kernfs: make ->attr.open RCU protected.
  2022-04-10  2:37 ` [PATCH v8 02/10] kernfs: make ->attr.open RCU protected Imran Khan
@ 2022-04-22 16:19   ` Tejun Heo
  2022-04-26  1:54     ` Imran Khan
  0 siblings, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2022-04-22 16:19 UTC (permalink / raw)
  To: Imran Khan; +Cc: viro, gregkh, ebiederm, linux-kernel

On Sun, Apr 10, 2022 at 12:37:11PM +1000, Imran Khan wrote:
>  static int kernfs_seq_show(struct seq_file *sf, void *v)
>  {
>  	struct kernfs_open_file *of = sf->private;
> +	struct kernfs_open_node *on = rcu_dereference_raw(of->kn->attr.open);

I suppose raw deref is safe because @on can't go away while @of is alive,
right? If that's the case, please factor out of -> on dereferencing into a
helper and put a comment there explaining why the raw deref is safe.

...
> @@ -201,7 +203,8 @@ static ssize_t kernfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
>  		goto out_free;
>  	}
>  
> -	of->event = atomic_read(&of->kn->attr.open->event);
> +	on = rcu_dereference_raw(of->kn->attr.open);

cuz we don't wanna sprinkle raw derefs in multiple places without
explanation like this.

...
>  /**
> @@ -566,24 +567,30 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
>  static void kernfs_put_open_node(struct kernfs_node *kn,
>  				 struct kernfs_open_file *of)
>  {
> -	struct kernfs_open_node *on = kn->attr.open;
> -	unsigned long flags;
> +	struct kernfs_open_node *on;
> +
> +	/* ->attr.open NULL means there are no more open files */
> +	if (rcu_dereference_raw(kn->attr.open) == NULL)
> +		return;

For pointer value check, what you want is rcu_access_pointer(). That said,
tho, why is this being called if no one is linked on it? Before removing the
refcnt, that'd be the same as trying to put a 0 ref. How does that happen?
Also, can you please rename it to unlink or something of the sort? It's
confusing to call it put when there's no refcnt.

>  
>  	mutex_lock(&kernfs_open_file_mutex);
> -	spin_lock_irqsave(&kernfs_open_node_lock, flags);
> +
> +	on = rcu_dereference_protected(kn->attr.open,
> +				       lockdep_is_held(&kernfs_open_file_mutex));

Again, a better way to do it would be defining a kn -> on accessor which
encodes the safe way to deref and use it. The deref rule is tied to the
deref itself not the callsite.

> +
> +	if (!on) {
> +		mutex_unlock(&kernfs_open_file_mutex);
> +		return;
> +	}
>  
>  	if (of)
>  		list_del(&of->list);
>  
> -	if (list_empty(&on->files))
> -		kn->attr.open = NULL;
> -	else
> -		on = NULL;
> -
> -	spin_unlock_irqrestore(&kernfs_open_node_lock, flags);
> +	if (list_empty(&on->files)) {
> +		rcu_assign_pointer(kn->attr.open, NULL);
> +		kfree_rcu(on, rcu_head);
> +	}
>  	mutex_unlock(&kernfs_open_file_mutex);
> -
> -	kfree(on);
>  }
>  
>  static int kernfs_fop_open(struct inode *inode, struct file *file)
> @@ -765,12 +772,13 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
>  	if (!(kn->flags & (KERNFS_HAS_MMAP | KERNFS_HAS_RELEASE)))
>  		return;
>  
> -	on = kn->attr.open;
> -	if (!on)
> +	if (rcu_dereference_raw(kn->attr.open) == NULL)
>  		return;

rcu_access_pointer again and you gotta explain why the lockless check is
safe.

>  	mutex_lock(&kernfs_open_file_mutex);
> -	if (!kn->attr.open) {
> +	on = rcu_dereference_check(kn->attr.open,
> +				   lockdep_is_held(&kernfs_open_file_mutex));
> +	if (!on) {
>  		mutex_unlock(&kernfs_open_file_mutex);
>  		return;
>  	}
...
> @@ -912,14 +920,13 @@ void kernfs_notify(struct kernfs_node *kn)
>  		return;
>  
>  	/* kick poll immediately */
> -	spin_lock_irqsave(&kernfs_open_node_lock, flags);
> -	on = kn->attr.open;
> +	rcu_read_lock();
> +	on = rcu_dereference(kn->attr.open);
>  	if (on) {
>  		atomic_inc(&on->event);
>  		wake_up_interruptible(&on->poll);
>  	}
> -	spin_unlock_irqrestore(&kernfs_open_node_lock, flags);
> -
> +	rcu_read_unlock();

An explanation of why this is safe in terms of event ordering would be great
here.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v8 03/10] kernfs: Change kernfs_notify_list to llist.
  2022-04-10  2:37 ` [PATCH v8 03/10] kernfs: Change kernfs_notify_list to llist Imran Khan
@ 2022-04-22 16:41   ` Tejun Heo
  0 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2022-04-22 16:41 UTC (permalink / raw)
  To: Imran Khan; +Cc: viro, gregkh, ebiederm, linux-kernel

Hello,

On Sun, Apr 10, 2022 at 12:37:12PM +1000, Imran Khan wrote:
> @@ -846,18 +845,25 @@ static void kernfs_notify_workfn(struct work_struct *work)
>  	struct kernfs_node *kn;
>  	struct kernfs_super_info *info;
>  	struct kernfs_root *root;
> +	struct llist_node *free;
> +	struct kernfs_elem_attr *attr;
>  repeat:
> -	/* pop one off the notify_list */
> +	/**
> +	 * pop one off the notify_list.
> +	 * There can be multiple concurrent work items.
> +	 * Use kernfs_notify_lock to synchronize between multipl consumers.
> +	 */

This is running off of a single work item, so there can only be one instance
of this running at any given time.

>  	spin_lock_irq(&kernfs_notify_lock);
> -	kn = kernfs_notify_list;
> -	if (kn == KERNFS_NOTIFY_EOL) {
> +	if (llist_empty(&kernfs_notify_list)) {
>  		spin_unlock_irq(&kernfs_notify_lock);
>  		return;
>  	}
> -	kernfs_notify_list = kn->attr.notify_next;
> -	kn->attr.notify_next = NULL;
> +
> +	free = llist_del_first(&kernfs_notify_list);

Why not just test whether the returned pointer is NULL here instead of doing
a separate empty check above?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v8 05/10] kernfs: Replace global kernfs_open_file_mutex with hashed mutexes.
  2022-04-10  2:37 ` [PATCH v8 05/10] kernfs: Replace global kernfs_open_file_mutex with hashed mutexes Imran Khan
@ 2022-04-22 17:00   ` Tejun Heo
  0 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2022-04-22 17:00 UTC (permalink / raw)
  To: Imran Khan; +Cc: viro, gregkh, ebiederm, linux-kernel

On Sun, Apr 10, 2022 at 12:37:14PM +1000, Imran Khan wrote:
> diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
> index 214b48d59148..0946ab341ce4 100644
> --- a/fs/kernfs/file.c
> +++ b/fs/kernfs/file.c
> @@ -19,18 +19,14 @@
>  #include "kernfs-internal.h"
>  
>  /*
> - * There's one kernfs_open_file for each open file and one kernfs_open_node
> - * for each kernfs_node with one or more open files.
> - *
> + * kernfs locks
> + */
> +extern struct kernfs_global_locks *kernfs_locks;

This should be in a header file, right?

> @@ -545,7 +543,7 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
>  	 * need rcu_read_lock to ensure its existence.
>  	 */
>  	on = rcu_dereference_protected(kn->attr.open,
> -				   lockdep_is_held(&kernfs_open_file_mutex));
> +				   lockdep_is_held(mutex));
>  	if (on) {
>  		list_add_tail(&of->list, &on->files);
>  		mutex_unlock(mutex);
> @@ -593,10 +591,10 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
>  	mutex = kernfs_open_file_mutex_lock(kn);
>  
>  	on = rcu_dereference_protected(kn->attr.open,
> -				       lockdep_is_held(&kernfs_open_file_mutex));
> +				       lockdep_is_held(mutex));
>  
>  	if (!on) {
> -		mutex_unlock(&kernfs_open_file_mutex);
> +		mutex_unlock(mutex);
>  		return;
>  	}

Don't above chunks belong to the previous patch?

> @@ -769,6 +767,10 @@ static int kernfs_fop_release(struct inode *inode, struct file *filp)
>  	struct mutex *mutex = NULL;
>  
>  	if (kn->flags & KERNFS_HAS_RELEASE) {
> +		/**
> +		 * Callers of kernfs_fop_release, don't need global
> +		 * exclusion so using hashed mutex here is safe.
> +		 */

I don't think we use /** for in-line comments. Just do balanced wings.

 /*
  * ....
  */

besides, I'm not sure the comment is helping that much. It'd be better to
have a comment describing the locking rules comprehensively around the lock
definition and/or helpers.

>  		mutex = kernfs_open_file_mutex_lock(kn);
>  		kernfs_release_file(kn, of);
>  		mutex_unlock(mutex);
> @@ -796,9 +798,9 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
>  
>  	mutex = kernfs_open_file_mutex_lock(kn);
>  	on = rcu_dereference_check(kn->attr.open,
> -				   lockdep_is_held(&kernfs_open_file_mutex));
> +				   lockdep_is_held(mutex));
>  	if (!on) {
> -		mutex_unlock(&kernfs_open_file_mutex);
> +		mutex_unlock(mutex);

Again, should be in the previous patch.

> +void __init kernfs_lock_init(void)
> +{
> +	int count;
> +
> +	kernfs_locks = kmalloc(sizeof(struct kernfs_global_locks), GFP_KERNEL);
> +	WARN_ON(!kernfs_locks);
> +
> +	for (count = 0; count < NR_KERNFS_LOCKS; count++)
> +		mutex_init(&kernfs_locks->open_file_mutex[count].lock);
> +}

Does this need to be a separate function?

>  void __init kernfs_init(void)
>  {
>  	kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
> @@ -397,4 +409,6 @@ void __init kernfs_init(void)
>  	kernfs_iattrs_cache  = kmem_cache_create("kernfs_iattrs_cache",
>  					      sizeof(struct kernfs_iattrs),
>  					      0, SLAB_PANIC, NULL);
> +
> +	kernfs_lock_init();
>  }
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 2dd9c8df0f4f..cc514bda0ae7 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
...
> + * filp->private_data points to seq_file whose ->private points to
> + * kernfs_open_file.
> + * kernfs_open_files are chained at kernfs_open_node->files, which is
> + * protected by kernfs_open_file_mutex.lock.

Can you either flow the sentences together or have a blank line inbetween if
they're supposed to be in separate paragraphs?

> + */
> +struct kernfs_open_file_mutex {
> +	struct mutex lock;
> +} ____cacheline_aligned_in_smp;

Is there a reason to define the outer struct?

> +/*
> + * To reduce possible contention in sysfs access, arising due to single
> + * locks, use an array of locks and use kernfs_node object address as
> + * hash keys to get the index of these locks.
> + */
> +struct kernfs_global_locks {
> +	struct kernfs_open_file_mutex open_file_mutex[NR_KERNFS_LOCKS];
> +};

Ditto, why not just define the array directly?

> +void kernfs_lock_init(void);

Does this function need to be exposed?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v8 00/10] kernfs: Remove reference counting for kernfs_open_node.
  2022-04-10  2:37 [PATCH v8 00/10] kernfs: Remove reference counting for kernfs_open_node Imran Khan
                   ` (9 preceding siblings ...)
  2022-04-10  2:37 ` [PATCH v8 10/10] kernfs: Add a document to describe hashed locks used in kernfs Imran Khan
@ 2022-04-22 17:03 ` Tejun Heo
  2022-04-23  8:49   ` Imran Khan
  10 siblings, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2022-04-22 17:03 UTC (permalink / raw)
  To: Imran Khan; +Cc: viro, gregkh, ebiederm, linux-kernel

Hello, Imran.

I took a look to patch 5 and it looks like there's still some work left to
do. Maybe it'd be easier to concentrate on the first two parts - the notify
list conversion and open_file_mutex conversion first and then get to the
rest later?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v8 00/10] kernfs: Remove reference counting for kernfs_open_node.
  2022-04-22 17:03 ` [PATCH v8 00/10] kernfs: Remove reference counting for kernfs_open_node Tejun Heo
@ 2022-04-23  8:49   ` Imran Khan
  2022-04-25 17:21     ` Tejun Heo
  0 siblings, 1 reply; 25+ messages in thread
From: Imran Khan @ 2022-04-23  8:49 UTC (permalink / raw)
  To: Tejun Heo; +Cc: viro, gregkh, ebiederm, linux-kernel

Hello Tejun,


On 23/4/22 3:03 am, Tejun Heo wrote:
> Hello, Imran.
> 
> I took a look to patch 5 and it looks like there's still some work left to
> do. Maybe it'd be easier to concentrate on the first two parts - the notify
> list conversion and open_file_mutex conversion first and then get to the
> rest later?
> 

Thanks again for reviewing this and I agree with your suggestion. So far
most of the concerns have been around usage of kernfs_rwsem and those
can be addressed independent of first 5 (even first 7 )changes here.
Just one question in this regard, should I send the new patch set
(addressing open_file_mutex and list conversion) as a separate patch set
or should I sent it as v9 of ongoing change set. I guess first option is
better but thought of confirming once before proceeding.

Thanks
-- Imran
> Thanks.
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v8 00/10] kernfs: Remove reference counting for kernfs_open_node.
  2022-04-23  8:49   ` Imran Khan
@ 2022-04-25 17:21     ` Tejun Heo
  2022-04-28 17:28       ` Imran Khan
  0 siblings, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2022-04-25 17:21 UTC (permalink / raw)
  To: Imran Khan; +Cc: viro, gregkh, ebiederm, linux-kernel

Hello,

On Sat, Apr 23, 2022 at 06:49:49PM +1000, Imran Khan wrote:
> Thanks again for reviewing this and I agree with your suggestion. So far
> most of the concerns have been around usage of kernfs_rwsem and those
> can be addressed independent of first 5 (even first 7 )changes here.
> Just one question in this regard, should I send the new patch set
> (addressing open_file_mutex and list conversion) as a separate patch set
> or should I sent it as v9 of ongoing change set. I guess first option is
> better but thought of confirming once before proceeding.

Either way is okay but I like the first one too. Let's break it up to
smaller pieces so that we can make progress piece by piece.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v8 01/10] kernfs: Remove reference counting for kernfs_open_node.
  2022-04-22 16:03   ` Tejun Heo
@ 2022-04-26  1:43     ` Imran Khan
  2022-04-26 18:29       ` Tejun Heo
  0 siblings, 1 reply; 25+ messages in thread
From: Imran Khan @ 2022-04-26  1:43 UTC (permalink / raw)
  To: Tejun Heo; +Cc: viro, gregkh, ebiederm, linux-kernel

Hello Tejun,

On 23/4/22 2:03 am, Tejun Heo wrote:
> On Sun, Apr 10, 2022 at 12:37:10PM +1000, Imran Khan wrote:
>> @@ -768,15 +765,15 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
>>  	if (!(kn->flags & (KERNFS_HAS_MMAP | KERNFS_HAS_RELEASE)))
>>  		return;
>>  
>> -	spin_lock_irq(&kernfs_open_node_lock);
>>  	on = kn->attr.open;
>> -	if (on)
>> -		atomic_inc(&on->refcnt);
>> -	spin_unlock_irq(&kernfs_open_node_lock);
>>  	if (!on)
>>  		return;
>>
>>  	mutex_lock(&kernfs_open_file_mutex);
>> +	if (!kn->attr.open) {
>> +		mutex_unlock(&kernfs_open_file_mutex);
>> +		return;
>> +	}
> 
> What if @on got freed and new one got allocated between the lockless check
> and the locked check? Is there a reason to keep the lockless check at all?
> 

The only reason for lockless check is to opportunistically check and
return if ->attr.open is already NULL, without waiting to acquire the
mutex. This is because no one will be adding to ->attr.open at this
point of time.
But we can live with just the locked check as well.
Please let me know if you think of lockless check as an overkill in this
case.

Thanks
 -- Imran

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v8 02/10] kernfs: make ->attr.open RCU protected.
  2022-04-22 16:19   ` Tejun Heo
@ 2022-04-26  1:54     ` Imran Khan
  2022-04-26 18:37       ` Tejun Heo
  0 siblings, 1 reply; 25+ messages in thread
From: Imran Khan @ 2022-04-26  1:54 UTC (permalink / raw)
  To: Tejun Heo; +Cc: viro, gregkh, ebiederm, linux-kernel

Hello Tejun,

On 23/4/22 2:19 am, Tejun Heo wrote:
> On Sun, Apr 10, 2022 at 12:37:11PM +1000, Imran Khan wrote:
>>  static int kernfs_seq_show(struct seq_file *sf, void *v)
>>  {
>>  	struct kernfs_open_file *of = sf->private;
>> +	struct kernfs_open_node *on = rcu_dereference_raw(of->kn->attr.open);
> 
> I suppose raw deref is safe because @on can't go away while @of is alive,
> right? 

Yes.

If that's the case, please factor out of -> on dereferencing into a
> helper and put a comment there explaining why the raw deref is safe.
> 

Sure, will put dereferncing in a separate helper in next version.


>> @@ -201,7 +203,8 @@ static ssize_t kernfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
>>  		goto out_free;
>>  	}
>>  
>> -	of->event = atomic_read(&of->kn->attr.open->event);
>> +	on = rcu_dereference_raw(of->kn->attr.open);
> 
> cuz we don't wanna sprinkle raw derefs in multiple places without
> explanation like this.
> 
Agree.

> ...
>>  /**
>> @@ -566,24 +567,30 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
>>  static void kernfs_put_open_node(struct kernfs_node *kn,
>>  				 struct kernfs_open_file *of)
>>  {
>> -	struct kernfs_open_node *on = kn->attr.open;
>> -	unsigned long flags;
>> +	struct kernfs_open_node *on;
>> +
>> +	/* ->attr.open NULL means there are no more open files */
>> +	if (rcu_dereference_raw(kn->attr.open) == NULL)
>> +		return;
> 
> For pointer value check, what you want is rcu_access_pointer(). That said,
> tho, why is this being called if no one is linked on it? Before removing the
> refcnt, that'd be the same as trying to put a 0 ref. How does that happen?

Yeah this check surely should not be needed. I will remove it in next
version.

> Also, can you please rename it to unlink or something of the sort? It's
> confusing to call it put when there's no refcnt.
> 

sure I will rename _put_ to _unlink_.

>>  
>>  	mutex_lock(&kernfs_open_file_mutex);
>> -	spin_lock_irqsave(&kernfs_open_node_lock, flags);
>> +
>> +	on = rcu_dereference_protected(kn->attr.open,
>> +				       lockdep_is_held(&kernfs_open_file_mutex));
> 
> Again, a better way to do it would be defining a kn -> on accessor which
> encodes the safe way to deref and use it. The deref rule is tied to the
> deref itself not the callsite.
>

Okay I will factor this out in a separate helper.

[...]

>>  static int kernfs_fop_open(struct inode *inode, struct file *file)
>> @@ -765,12 +772,13 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
>>  	if (!(kn->flags & (KERNFS_HAS_MMAP | KERNFS_HAS_RELEASE)))
>>  		return;
>>  
>> -	on = kn->attr.open;
>> -	if (!on)
>> +	if (rcu_dereference_raw(kn->attr.open) == NULL)
>>  		return;
> 
> rcu_access_pointer again and you gotta explain why the lockless check is
> safe.
>

The lockless check is safe because no one will be adding to ->attr.open
at this point of time. This allows early bail out if ->attr.open in
already NULL. And if kernfs_put_open_node makes ->attr.open NULL, it
does this under open_file_mutex so subsequent check under
open_file_mutex will make sure to bail out if kernfs_put_open_node won
the race.
I will put an explanatory comment in the code, explaining the same.

[...]

>> @@ -912,14 +920,13 @@ void kernfs_notify(struct kernfs_node *kn)
>>  		return;
>>  
>>  	/* kick poll immediately */
>> -	spin_lock_irqsave(&kernfs_open_node_lock, flags);
>> -	on = kn->attr.open;
>> +	rcu_read_lock();
>> +	on = rcu_dereference(kn->attr.open);
>>  	if (on) {
>>  		atomic_inc(&on->event);
>>  		wake_up_interruptible(&on->poll);
>>  	}
>> -	spin_unlock_irqrestore(&kernfs_open_node_lock, flags);
>> -
>> +	rcu_read_unlock();
> 
> An explanation of why this is safe in terms of event ordering would be great
> here.
> 

This is safe because here we don't need to refcnt ->on in this case. If
writer (kernfs_put_open_node) has already made ->attr.open NULL we will
bail out. If kernfs_notify got an old ->attr.open we can still safely
process the event, even if kernfs_put_open_node updates ->attr.open to
NULL in parallel.
In both the cases the behaviour/order will be same as earlier code that
used kernfs_open_node_lock.
Please let me know if this answers your query or if something is still
missing.


Thanks
 -- Imran

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v8 01/10] kernfs: Remove reference counting for kernfs_open_node.
  2022-04-26  1:43     ` Imran Khan
@ 2022-04-26 18:29       ` Tejun Heo
  2022-04-26 20:13         ` Al Viro
  0 siblings, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2022-04-26 18:29 UTC (permalink / raw)
  To: Imran Khan; +Cc: viro, gregkh, ebiederm, linux-kernel

On Tue, Apr 26, 2022 at 11:43:38AM +1000, Imran Khan wrote:
> Hello Tejun,
> 
> On 23/4/22 2:03 am, Tejun Heo wrote:
> > On Sun, Apr 10, 2022 at 12:37:10PM +1000, Imran Khan wrote:
> >> @@ -768,15 +765,15 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
> >>  	if (!(kn->flags & (KERNFS_HAS_MMAP | KERNFS_HAS_RELEASE)))
> >>  		return;
> >>  
> >> -	spin_lock_irq(&kernfs_open_node_lock);
> >>  	on = kn->attr.open;
> >> -	if (on)
> >> -		atomic_inc(&on->refcnt);
> >> -	spin_unlock_irq(&kernfs_open_node_lock);
> >>  	if (!on)
> >>  		return;
> >>
> >>  	mutex_lock(&kernfs_open_file_mutex);
> >> +	if (!kn->attr.open) {
> >> +		mutex_unlock(&kernfs_open_file_mutex);
> >> +		return;
> >> +	}
> > 
> > What if @on got freed and new one got allocated between the lockless check
> > and the locked check? Is there a reason to keep the lockless check at all?
> 
> The only reason for lockless check is to opportunistically check and
> return if ->attr.open is already NULL, without waiting to acquire the
> mutex. This is because no one will be adding to ->attr.open at this
> point of time.
> But we can live with just the locked check as well.
> Please let me know if you think of lockless check as an overkill in this
> case.

The code is just wrong. You can end up:

        on = kn->attr.open;
        if (!on)
                return;

        // we get preempted here and someone else puts @on and then
        // recreates it

        mutex_lock();
        if (!kn->attr.open) {
                mutex_unlock();
                return;
        }

        // we're here but @on is a pointer to an already freed memory

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v8 02/10] kernfs: make ->attr.open RCU protected.
  2022-04-26  1:54     ` Imran Khan
@ 2022-04-26 18:37       ` Tejun Heo
  0 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2022-04-26 18:37 UTC (permalink / raw)
  To: Imran Khan; +Cc: viro, gregkh, ebiederm, linux-kernel

On Tue, Apr 26, 2022 at 11:54:36AM +1000, Imran Khan wrote:
> >> @@ -912,14 +920,13 @@ void kernfs_notify(struct kernfs_node *kn)
> >>  		return;
> >>  
> >>  	/* kick poll immediately */
> >> -	spin_lock_irqsave(&kernfs_open_node_lock, flags);
> >> -	on = kn->attr.open;
> >> +	rcu_read_lock();
> >> +	on = rcu_dereference(kn->attr.open);
> >>  	if (on) {
> >>  		atomic_inc(&on->event);
> >>  		wake_up_interruptible(&on->poll);
> >>  	}
> >> -	spin_unlock_irqrestore(&kernfs_open_node_lock, flags);
> >> -
> >> +	rcu_read_unlock();
> > 
> > An explanation of why this is safe in terms of event ordering would be great
> > here.
> > 
> 
> This is safe because here we don't need to refcnt ->on in this case. If
> writer (kernfs_put_open_node) has already made ->attr.open NULL we will
> bail out. If kernfs_notify got an old ->attr.open we can still safely
> process the event, even if kernfs_put_open_node updates ->attr.open to
> NULL in parallel.
> In both the cases the behaviour/order will be same as earlier code that
> used kernfs_open_node_lock.
> Please let me know if this answers your query or if something is still
> missing.

I was more wondering whether if there's someone waiting for an event which
should arrive considering the sequence of events, whether the lockless code
would have the same ordering properties. I think it does given that the
event sequence can only be defined in memory visibility terms anyway.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v8 01/10] kernfs: Remove reference counting for kernfs_open_node.
  2022-04-26 18:29       ` Tejun Heo
@ 2022-04-26 20:13         ` Al Viro
  2022-04-26 20:16           ` Tejun Heo
  0 siblings, 1 reply; 25+ messages in thread
From: Al Viro @ 2022-04-26 20:13 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Imran Khan, gregkh, ebiederm, linux-kernel

On Tue, Apr 26, 2022 at 08:29:21AM -1000, Tejun Heo wrote:

> The code is just wrong. You can end up:
> 
>         on = kn->attr.open;
>         if (!on)
>                 return;
> 
>         // we get preempted here and someone else puts @on and then
>         // recreates it

Can't happen - the caller has guaranteed that no new openers will be
coming.  Look at the call chain of kernfs_drain_open_files()...


Al, back to life and staring at the pile of mail to sift through...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v8 01/10] kernfs: Remove reference counting for kernfs_open_node.
  2022-04-26 20:13         ` Al Viro
@ 2022-04-26 20:16           ` Tejun Heo
  0 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2022-04-26 20:16 UTC (permalink / raw)
  To: Al Viro; +Cc: Imran Khan, gregkh, ebiederm, linux-kernel

On Tue, Apr 26, 2022 at 08:13:22PM +0000, Al Viro wrote:
> On Tue, Apr 26, 2022 at 08:29:21AM -1000, Tejun Heo wrote:
> 
> > The code is just wrong. You can end up:
> > 
> >         on = kn->attr.open;
> >         if (!on)
> >                 return;
> > 
> >         // we get preempted here and someone else puts @on and then
> >         // recreates it
> 
> Can't happen - the caller has guaranteed that no new openers will be
> coming.  Look at the call chain of kernfs_drain_open_files()...

Ah, okay. It can only transition to NULL from there. It's still a weird way
to go about it tho.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v8 00/10] kernfs: Remove reference counting for kernfs_open_node.
  2022-04-25 17:21     ` Tejun Heo
@ 2022-04-28 17:28       ` Imran Khan
  0 siblings, 0 replies; 25+ messages in thread
From: Imran Khan @ 2022-04-28 17:28 UTC (permalink / raw)
  To: Tejun Heo; +Cc: viro, gregkh, ebiederm, linux-kernel

Hello Tejun,

On 26/4/22 3:21 am, Tejun Heo wrote:
> Hello,
> 
> On Sat, Apr 23, 2022 at 06:49:49PM +1000, Imran Khan wrote:
>> Thanks again for reviewing this and I agree with your suggestion. So far
>> most of the concerns have been around usage of kernfs_rwsem and those
>> can be addressed independent of first 5 (even first 7 )changes here.
>> Just one question in this regard, should I send the new patch set
>> (addressing open_file_mutex and list conversion) as a separate patch set
>> or should I sent it as v9 of ongoing change set. I guess first option is
>> better but thought of confirming once before proceeding.
> 
> Either way is okay but I like the first one too. Let's break it up to
> smaller pieces so that we can make progress piece by piece.
> 

I have sent first 5 patches after addressing your review comments. This
patch set can be seen at [1].

Could you please have a look and let me know if it looks okay.

Thanks
 -- Imran

[1]:
https://lore.kernel.org/lkml/20220428055431.3826852-1-imran.f.khan@oracle.com/

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2022-04-28 17:28 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-10  2:37 [PATCH v8 00/10] kernfs: Remove reference counting for kernfs_open_node Imran Khan
2022-04-10  2:37 ` [PATCH v8 01/10] " Imran Khan
2022-04-22 16:03   ` Tejun Heo
2022-04-26  1:43     ` Imran Khan
2022-04-26 18:29       ` Tejun Heo
2022-04-26 20:13         ` Al Viro
2022-04-26 20:16           ` Tejun Heo
2022-04-10  2:37 ` [PATCH v8 02/10] kernfs: make ->attr.open RCU protected Imran Khan
2022-04-22 16:19   ` Tejun Heo
2022-04-26  1:54     ` Imran Khan
2022-04-26 18:37       ` Tejun Heo
2022-04-10  2:37 ` [PATCH v8 03/10] kernfs: Change kernfs_notify_list to llist Imran Khan
2022-04-22 16:41   ` Tejun Heo
2022-04-10  2:37 ` [PATCH v8 04/10] kernfs: Introduce interface to access global kernfs_open_file_mutex Imran Khan
2022-04-10  2:37 ` [PATCH v8 05/10] kernfs: Replace global kernfs_open_file_mutex with hashed mutexes Imran Khan
2022-04-22 17:00   ` Tejun Heo
2022-04-10  2:37 ` [PATCH v8 06/10] kernfs: Use a per-fs rwsem to protect per-fs list of kernfs_super_info Imran Khan
2022-04-10  2:37 ` [PATCH v8 07/10] kernfs: Change kernfs_rename_lock into a read-write lock Imran Khan
2022-04-10  2:37 ` [PATCH v8 08/10] kernfs: Introduce interface to access per-fs rwsem Imran Khan
2022-04-10  2:37 ` [PATCH v8 09/10] kernfs: Replace per-fs rwsem with hashed rwsems Imran Khan
2022-04-10  2:37 ` [PATCH v8 10/10] kernfs: Add a document to describe hashed locks used in kernfs Imran Khan
2022-04-22 17:03 ` [PATCH v8 00/10] kernfs: Remove reference counting for kernfs_open_node Tejun Heo
2022-04-23  8:49   ` Imran Khan
2022-04-25 17:21     ` Tejun Heo
2022-04-28 17:28       ` Imran Khan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).