All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] vfs: Use per-cpu list for SB's s_inodes list
@ 2016-02-19 21:10 Waiman Long
  2016-02-19 21:10 ` [PATCH v2 1/3] lib/percpu-list: Per-cpu list with associated per-cpu locks Waiman Long
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Waiman Long @ 2016-02-19 21:10 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Scott J Norton, Douglas Hatch,
	Waiman Long

v1->v2:
 - Use separate structures for list head and nodes & provide a
   cleaner interface.
 - Use existing list_for_each_entry() or list_for_each_entry_safe()
   macros for each of the sb's s_inodes iteration functions instead
   of using list_for_each_entry_safe() for all of them which may not
   be safe in some cases.
 - Use an iterator interface to access all the nodes of a group of
   per-cpu lists. This approach is cleaner than the previous double-for
   macro which is kind of hacky. However, it does require more lines
   of code changes.
 - Add a preparatory patch 2 to extract out the per-inode codes from
   the superblock s_inodes list iteration functions to minimize code
   changes needed in the patch 3.

This patch is a replacement of my previous list batching patch -
https://lwn.net/Articles/674105/. Compared with the previous patch,
this one provides better performance and fairness. However, it also
requires a bit more changes in the VFS layer.

This patchset is a derivative of Andi Kleen's patch on "Initial per
cpu list for the per sb inode list"

https://git.kernel.org/cgit/linux/kernel/git/ak/linux-misc.git/commit/?h=hle315/combined&id=f1cf9e715a40f44086662ae3b29f123cf059cbf4

Patch 1 introduces the per-cpu list.

Patch 2 extracts out the per-inode codes from the superblock s_inodes
list iteration functions to minimize code changes needed in the
following patch.

Patch 3 modifies the superblock and inode structures to use the per-cpu
list. The corresponding functions that reference those structures are
modified.

Waiman Long (3):
  lib/percpu-list: Per-cpu list with associated per-cpu locks
  vfs: Refactor sb->s_inodes iteration functions
  vfs: Use per-cpu list for superblock's inode list

 fs/block_dev.c              |   71 +++++++++++--------
 fs/drop_caches.c            |   48 ++++++++-----
 fs/fs-writeback.c           |   85 ++++++++++++----------
 fs/inode.c                  |  163 ++++++++++++++++++++++++-------------------
 fs/notify/inode_mark.c      |  145 ++++++++++++++++++++------------------
 fs/quota/dquot.c            |  134 +++++++++++++++++++++--------------
 fs/super.c                  |    7 +-
 include/linux/fs.h          |   37 +++++++++-
 include/linux/percpu-list.h |   94 +++++++++++++++++++++++++
 lib/Makefile                |    2 +-
 lib/percpu-list.c           |  144 ++++++++++++++++++++++++++++++++++++++
 11 files changed, 643 insertions(+), 287 deletions(-)
 create mode 100644 include/linux/percpu-list.h
 create mode 100644 lib/percpu-list.c

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/3] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-19 21:10 [PATCH v2 0/3] vfs: Use per-cpu list for SB's s_inodes list Waiman Long
@ 2016-02-19 21:10 ` Waiman Long
  2016-02-19 21:10 ` [PATCH v2 2/3] vfs: Refactor sb->s_inodes iteration functions Waiman Long
  2016-02-19 21:10 ` [PATCH v2 3/3] vfs: Use per-cpu list for superblock's inode list Waiman Long
  2 siblings, 0 replies; 13+ messages in thread
From: Waiman Long @ 2016-02-19 21:10 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Scott J Norton, Douglas Hatch,
	Waiman Long

Linked list is used everywhere in the Linux kernel. However, if many
threads are trying to add or delete entries into the same linked list,
it can create a performance bottleneck.

This patch introduces a new per-cpu list subystem with associated
per-cpu locks for protecting each of the lists individually. This
allows list entries insertion and deletion operations to happen in
parallel instead of being serialized with a global list and lock.

List entry insertion is strictly per cpu. List deletion, however, can
happen in a cpu other than the one that did the insertion. So we still
need lock to protect the list. Because of that, there may still be
a small amount of contention when deletion is being done.

A new header file include/linux/percpu-list.h will be added with the
associated pcpu_list_head and pcpu_list_node structures. The following
functions are provided to manage the per-cpu list:

 1. int init_pcpu_list_head(struct pcpu_list_head **ppcpu_head)
 2. void pcpu_list_add(struct pcpu_list_node *node,
		       struct pcpu_list_head *head)
 3. void pcpu_list_del(struct pcpu_list *node)
 4. void pcpu_list_iterate(struct pcpu_list_head *pcpu_head,
			   bool safe,
			   struct list_head **phead,
			   int (*iter)(struct pcpu_list_node *,
				       struct pcpu_list_node *,
				       spinlock_t *, void *),
			   void *arg);

The last one is used to iterate all the list nodes in a group of
per-cpu lists. An iterator function has to be provided to access
each list node one-by-one.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 include/linux/percpu-list.h |   94 ++++++++++++++++++++++++++++
 lib/Makefile                |    2 +-
 lib/percpu-list.c           |  144 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 239 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/percpu-list.h
 create mode 100644 lib/percpu-list.c

diff --git a/include/linux/percpu-list.h b/include/linux/percpu-list.h
new file mode 100644
index 0000000..9c100cb
--- /dev/null
+++ b/include/linux/percpu-list.h
@@ -0,0 +1,94 @@
+/*
+ * Per-cpu list
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long <waiman.long@hpe.com>
+ */
+#ifndef __LINUX_PERCPU_LIST_H
+#define __LINUX_PERCPU_LIST_H
+
+#include <linux/spinlock.h>
+#include <linux/list.h>
+#include <linux/percpu.h>
+
+/*
+ * include/linux/percpu-list.h
+ *
+ * A per-cpu list protected by a per-cpu spinlock.
+ *
+ * The pcpu_list_head structure contains the spinlock, the other
+ * pcpu_list_node structures only contains a pointer to the spinlock in
+ * pcpu_list_head.
+ */
+struct pcpu_list_head {
+	struct list_head list;
+	spinlock_t lock;
+};
+
+struct pcpu_list_node {
+	struct list_head list;
+	spinlock_t *lockptr;
+};
+
+#define PCPU_LIST_HEAD_INIT(name)				\
+	{							\
+		.list.prev = &name.list,			\
+		.list.next = &name.list,			\
+		.list.lock = __SPIN_LOCK_UNLOCKED(name),	\
+	}
+
+#define PCPU_LIST_NODE_INIT(name)		\
+	{					\
+		.list.prev = &name.list,	\
+		.list.next = &name.list,	\
+		.list.lockptr = NULL		\
+	}
+
+#define pcpu_list_next_entry(pos, member) list_next_entry(pos, member.list)
+
+static inline void init_pcpu_list_node(struct pcpu_list_node *node)
+{
+	INIT_LIST_HEAD(&node->list);
+	node->lockptr = NULL;
+}
+
+static inline void free_pcpu_list_head(struct pcpu_list_head **pcpu_head_handle)
+{
+	free_percpu(*pcpu_head_handle);
+	*pcpu_head_handle = NULL;
+}
+
+static inline bool pcpu_list_empty(struct pcpu_list_head *pcpu_head)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		if (!list_empty(&per_cpu_ptr(pcpu_head, cpu)->list))
+			return false;
+	return true;
+}
+
+extern int  init_pcpu_list_head(struct pcpu_list_head **ppcpu_head);
+extern void pcpu_list_add(struct pcpu_list_node *node,
+			  struct pcpu_list_head *head);
+extern void pcpu_list_del(struct pcpu_list_node *node);
+extern void pcpu_list_iterate(struct pcpu_list_head *pcpu_head,
+			      bool safe,
+			      struct list_head **phead,
+			      int (*iter)(struct pcpu_list_node *,
+					  struct pcpu_list_node **,
+					  spinlock_t *, void *),
+			      void *arg);
+
+#endif /* __LINUX_PERCPU_LIST_H */
diff --git a/lib/Makefile b/lib/Makefile
index a7c26a4..71a25d4 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -27,7 +27,7 @@ obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
 	 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
 	 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 	 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
-	 once.o
+	 once.o percpu-list.o
 obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
 obj-y += hexdump.o
diff --git a/lib/percpu-list.c b/lib/percpu-list.c
new file mode 100644
index 0000000..498b381
--- /dev/null
+++ b/lib/percpu-list.c
@@ -0,0 +1,144 @@
+/*
+ * Per-cpu list
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long <waiman.long@hpe.com>
+ */
+#include <linux/percpu-list.h>
+
+/*
+ * Initialize the per-cpu list
+ */
+int init_pcpu_list_head(struct pcpu_list_head **ppcpu_head)
+{
+	struct pcpu_list_head *pcpu_head = alloc_percpu(struct pcpu_list_head);
+	int cpu;
+
+	if (!pcpu_head)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu) {
+		struct pcpu_list_head *head = per_cpu_ptr(pcpu_head, cpu);
+
+		INIT_LIST_HEAD(&head->list);
+		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
+	}
+
+	*ppcpu_head = pcpu_head;
+	return 0;
+}
+
+/*
+ * List selection is based on the CPU being used when the pcpu_list_add()
+ * function is called. However, deletion may be done by a different CPU.
+ * So we still need to use a lock to protect the content of the list.
+ */
+void pcpu_list_add(struct pcpu_list_node *node, struct pcpu_list_head *head)
+{
+	spinlock_t *lock;
+
+	/*
+	 * There is a very slight chance the cpu will be changed
+	 * (by preemption) before calling spin_lock(). We only need to put
+	 * the node in one of the per-cpu lists. It may not need to be
+	 * that of the current cpu.
+	 */
+	lock = this_cpu_ptr(&head->lock);
+	spin_lock(lock);
+	node->lockptr = lock;
+	list_add(&node->list, this_cpu_ptr(&head->list));
+	spin_unlock(lock);
+}
+
+/*
+ * Delete a node from a percpu list
+ *
+ * We need to check the lock pointer again after taking the lock to guard
+ * against concurrent delete of the same node. If the lock pointer changes
+ * (becomes NULL or to a different one), we assume that the deletion was done
+ * elsewhere.
+ */
+void pcpu_list_del(struct pcpu_list_node *node)
+{
+	spinlock_t *lock = READ_ONCE(node->lockptr);
+
+	if (unlikely(!lock))
+		return;
+
+	spin_lock(lock);
+	if (likely(lock == node->lockptr)) {
+		list_del_init(&node->list);
+		node->lockptr = NULL;
+	}
+	spin_unlock(lock);
+}
+
+/**
+ * Iteration function over all the nodes of a group of per-cpu lists.
+ * @pcpu_head:	the per-cpu list head
+ * @safe:	true if the iteration has to be safe against removal of node
+ * @phead:	a handle to save the per-cpu list head pointer
+ * @iter:	the iterator function to be called for each node
+ * @arg:	the argument (structure pointer) to be passed to func
+ *
+ * This function does not return any value. Returned value from the iterator
+ * function has to be passed via the structure pointed to by arg argument.
+ *
+ * The supplied iterator function should have the following prototype:
+ * int func(struct pcpu_list_node *node, struct pcpu_list_node *next,
+ *	    spinlock_t *lock, void *arg)
+ * where
+ *  node:  the current per-cpu list node
+ *  pnext: a handle to the next per-cpu list node (defined only if safe=true)
+ *  lock:  the spinlock that has currently be acquired
+ *  arg:   the structure pointer passed into pcpu_list_iterate()
+ *
+ * A non-zero return value will cause it to break out of the iteration loop
+ * and return immediately. The iterator function must release the lock before
+ * returning a non-zero value.
+ */
+void pcpu_list_iterate(struct pcpu_list_head *pcpu_head,
+		       bool safe,
+		       struct list_head **phead,
+		       int (*iter)(struct pcpu_list_node *,
+				   struct pcpu_list_node **,
+				   spinlock_t *, void *),
+		       void *arg)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		spinlock_t	 *lock = &per_cpu_ptr(pcpu_head, cpu)->lock;
+		struct list_head *head = &per_cpu_ptr(pcpu_head, cpu)->list;
+		struct pcpu_list_node *node, *next;
+
+		if (list_empty(head))
+			continue;
+
+		if (phead)
+			*phead = head;	/* Save current list head */
+
+		spin_lock(lock);
+		if (safe) {
+			list_for_each_entry_safe(node, next, head, list)
+				if (iter(node, &next, lock, arg))
+					return;
+		} else {
+			list_for_each_entry(node, head, list)
+				if (iter(node, NULL, lock, arg))
+					return;
+		}
+		spin_unlock(lock);
+	}
+}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/3] vfs: Refactor sb->s_inodes iteration functions
  2016-02-19 21:10 [PATCH v2 0/3] vfs: Use per-cpu list for SB's s_inodes list Waiman Long
  2016-02-19 21:10 ` [PATCH v2 1/3] lib/percpu-list: Per-cpu list with associated per-cpu locks Waiman Long
@ 2016-02-19 21:10 ` Waiman Long
  2016-02-19 21:10 ` [PATCH v2 3/3] vfs: Use per-cpu list for superblock's inode list Waiman Long
  2 siblings, 0 replies; 13+ messages in thread
From: Waiman Long @ 2016-02-19 21:10 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Scott J Norton, Douglas Hatch,
	Waiman Long

This patch refactors the following superblock inode list (sb->s_inodes)
iteration functions in vfs:

 1. iterate_bdevs()
 2. drop_pagecache_sb()
 3. wait_sb_inodes()
 4. evict_inodes()
 5. invalidate_inodes()
 6. fsnotify_unmount_inodes()
 7. add_dquot_ref()
 8. remove_dquot_ref()

The per-inode processing codes of the above functions are extracted
out into inline functions to ease their conversion to use the per-cpu
list. There is no functional change.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 fs/block_dev.c         |   59 +++++++++++---------
 fs/drop_caches.c       |   39 ++++++++-----
 fs/fs-writeback.c      |   73 +++++++++++++-----------
 fs/inode.c             |  108 ++++++++++++++++++++---------------
 fs/notify/inode_mark.c |  146 ++++++++++++++++++++++++++----------------------
 fs/quota/dquot.c       |  105 ++++++++++++++++++++--------------
 6 files changed, 298 insertions(+), 232 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 39b3a17..6eaeedf 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1862,38 +1862,45 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
 }
 EXPORT_SYMBOL(__invalidate_device);
 
+static inline void
+__iterate_bdev(spinlock_t *lock, struct inode *inode, struct inode **old_inode,
+	       void (*func)(struct block_device *, void *), void *arg)
+{
+	struct address_space *mapping = inode->i_mapping;
+
+	spin_lock(&inode->i_lock);
+	if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
+	    mapping->nrpages == 0) {
+		spin_unlock(&inode->i_lock);
+		return;
+	}
+	__iget(inode);
+	spin_unlock(&inode->i_lock);
+	spin_unlock(lock);
+	/*
+	 * We hold a reference to 'inode' so it couldn't have been
+	 * removed from s_inodes list while we dropped the
+	 * pcpu_lock. We cannot iput the inode now as we can
+	 * be holding the last reference and we cannot iput it under
+	 * pcpu_lock. So we keep the reference and iput it later.
+	 */
+	iput(*old_inode);
+	*old_inode = inode;
+
+	func(I_BDEV(inode), arg);
+
+	spin_lock(lock);
+}
+
 void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 {
 	struct inode *inode, *old_inode = NULL;
 
 	spin_lock(&blockdev_superblock->s_inode_list_lock);
-	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
-		struct address_space *mapping = inode->i_mapping;
-
-		spin_lock(&inode->i_lock);
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
-		    mapping->nrpages == 0) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		__iget(inode);
-		spin_unlock(&inode->i_lock);
-		spin_unlock(&blockdev_superblock->s_inode_list_lock);
-		/*
-		 * We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the
-		 * s_inode_list_lock  We cannot iput the inode now as we can
-		 * be holding the last reference and we cannot iput it under
-		 * s_inode_list_lock. So we keep the reference and iput it
-		 * later.
-		 */
-		iput(old_inode);
-		old_inode = inode;
+	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list)
+		__iterate_bdev(&blockdev_superblock->s_inode_list_lock,
+			       inode, &old_inode, func, arg);
 
-		func(I_BDEV(inode), arg);
-
-		spin_lock(&blockdev_superblock->s_inode_list_lock);
-	}
 	spin_unlock(&blockdev_superblock->s_inode_list_lock);
 	iput(old_inode);
 }
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index d72d52b..d3449d5 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -13,28 +13,35 @@
 /* A global variable is a bit ugly, but it keeps the code simple */
 int sysctl_drop_caches;
 
+static inline void __drop_pagecache_sb(spinlock_t *lock, struct inode *inode,
+				       struct inode **toput_inode)
+{
+	spin_lock(&inode->i_lock);
+	if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+	    (inode->i_mapping->nrpages == 0)) {
+		spin_unlock(&inode->i_lock);
+		return;
+	}
+	__iget(inode);
+	spin_unlock(&inode->i_lock);
+	spin_unlock(lock);
+
+	invalidate_mapping_pages(inode->i_mapping, 0, -1);
+	iput(*toput_inode);
+	*toput_inode = inode;
+
+	spin_lock(lock);
+}
+
 static void drop_pagecache_sb(struct super_block *sb, void *unused)
 {
 	struct inode *inode, *toput_inode = NULL;
 
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		spin_lock(&inode->i_lock);
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    (inode->i_mapping->nrpages == 0)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		__iget(inode);
-		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list)
+		__drop_pagecache_sb(&sb->s_inode_list_lock, inode,
+				    &toput_inode);
 
-		invalidate_mapping_pages(inode->i_mapping, 0, -1);
-		iput(toput_inode);
-		toput_inode = inode;
-
-		spin_lock(&sb->s_inode_list_lock);
-	}
 	spin_unlock(&sb->s_inode_list_lock);
 	iput(toput_inode);
 }
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6915c95..5ad6eda 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2095,6 +2095,43 @@ out_unlock_inode:
 }
 EXPORT_SYMBOL(__mark_inode_dirty);
 
+static inline void __wait_sb_inode(spinlock_t *lock, struct inode *inode,
+				   struct inode **old_inode)
+{
+	struct address_space *mapping = inode->i_mapping;
+
+	spin_lock(&inode->i_lock);
+	if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+	    (mapping->nrpages == 0)) {
+		spin_unlock(&inode->i_lock);
+		return;
+	}
+	__iget(inode);
+	spin_unlock(&inode->i_lock);
+	spin_unlock(lock);
+
+	/*
+	 * We hold a reference to 'inode' so it couldn't have been
+	 * removed from s_inodes list while we dropped the
+	 * pcpu_lock.  We cannot iput the inode now as we can
+	 * be holding the last reference and we cannot iput it under
+	 * pcpu_lock. So we keep the reference and iput it later.
+	 */
+	iput(*old_inode);
+	*old_inode = inode;
+
+	/*
+	 * We keep the error status of individual mapping so that
+	 * applications can catch the writeback error using fsync(2).
+	 * See filemap_fdatawait_keep_errors() for details.
+	 */
+	filemap_fdatawait_keep_errors(mapping);
+
+	cond_resched();
+
+	spin_lock(lock);
+}
+
 /*
  * The @s_sync_lock is used to serialise concurrent sync operations
  * to avoid lock contention problems with concurrent wait_sb_inodes() calls.
@@ -2124,41 +2161,9 @@ static void wait_sb_inodes(struct super_block *sb)
 	 * In which case, the inode may not be on the dirty list, but
 	 * we still have to wait for that writeout.
 	 */
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		struct address_space *mapping = inode->i_mapping;
-
-		spin_lock(&inode->i_lock);
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    (mapping->nrpages == 0)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		__iget(inode);
-		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list)
+		__wait_sb_inode(&sb->s_inode_list_lock, inode, &old_inode);
 
-		/*
-		 * We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the
-		 * s_inode_list_lock.  We cannot iput the inode now as we can
-		 * be holding the last reference and we cannot iput it under
-		 * s_inode_list_lock. So we keep the reference and iput it
-		 * later.
-		 */
-		iput(old_inode);
-		old_inode = inode;
-
-		/*
-		 * We keep the error status of individual mapping so that
-		 * applications can catch the writeback error using fsync(2).
-		 * See filemap_fdatawait_keep_errors() for details.
-		 */
-		filemap_fdatawait_keep_errors(mapping);
-
-		cond_resched();
-
-		spin_lock(&sb->s_inode_list_lock);
-	}
 	spin_unlock(&sb->s_inode_list_lock);
 	iput(old_inode);
 	mutex_unlock(&sb->s_sync_lock);
diff --git a/fs/inode.c b/fs/inode.c
index 9f62db3..6dd609e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -579,6 +579,37 @@ static void dispose_list(struct list_head *head)
 	}
 }
 
+static inline int __evict_inode(spinlock_t *lock, struct inode *inode,
+				struct list_head *dispose)
+{
+	if (atomic_read(&inode->i_count))
+		return 0;
+
+	spin_lock(&inode->i_lock);
+	if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
+		spin_unlock(&inode->i_lock);
+		return 0;
+	}
+
+	inode->i_state |= I_FREEING;
+	inode_lru_list_del(inode);
+	spin_unlock(&inode->i_lock);
+	list_add(&inode->i_lru, dispose);
+
+	/*
+	 * We can have a ton of inodes to evict at unmount time given
+	 * enough memory, check to see if we need to go to sleep for a
+	 * bit so we don't livelock.
+	 */
+	if (need_resched()) {
+		spin_unlock(lock);
+		cond_resched();
+		dispose_list(dispose);
+		return 1;	/* Redo it again */
+	}
+	return 0;
+}
+
 /**
  * evict_inodes	- evict all evictable inodes for a superblock
  * @sb:		superblock to operate on
@@ -596,35 +627,39 @@ void evict_inodes(struct super_block *sb)
 again:
 	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
-		if (atomic_read(&inode->i_count))
-			continue;
+		if (__evict_inode(&sb->s_inode_list_lock, inode, &dispose))
+			goto again;
+	}
+	spin_unlock(&sb->s_inode_list_lock);
 
-		spin_lock(&inode->i_lock);
-		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
+	dispose_list(&dispose);
+}
 
-		inode->i_state |= I_FREEING;
-		inode_lru_list_del(inode);
+static inline void __invalidate_inode(struct inode *inode, bool kill_dirty,
+				      struct list_head *dispose, int *busy)
+{
+	spin_lock(&inode->i_lock);
+	if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 		spin_unlock(&inode->i_lock);
-		list_add(&inode->i_lru, &dispose);
+		return;
+	}
 
-		/*
-		 * We can have a ton of inodes to evict at unmount time given
-		 * enough memory, check to see if we need to go to sleep for a
-		 * bit so we don't livelock.
-		 */
-		if (need_resched()) {
-			spin_unlock(&sb->s_inode_list_lock);
-			cond_resched();
-			dispose_list(&dispose);
-			goto again;
-		}
+	if (inode->i_state & I_DIRTY_ALL && !kill_dirty) {
+		spin_unlock(&inode->i_lock);
+		*busy = 1;
+		return;
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 
-	dispose_list(&dispose);
+	if (atomic_read(&inode->i_count)) {
+		spin_unlock(&inode->i_lock);
+		*busy = 1;
+		return;
+	}
+
+	inode->i_state |= I_FREEING;
+	inode_lru_list_del(inode);
+	spin_unlock(&inode->i_lock);
+	list_add(&inode->i_lru, dispose);
 }
 
 /**
@@ -644,28 +679,9 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 	LIST_HEAD(dispose);
 
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
-		spin_lock(&inode->i_lock);
-		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		if (inode->i_state & I_DIRTY_ALL && !kill_dirty) {
-			spin_unlock(&inode->i_lock);
-			busy = 1;
-			continue;
-		}
-		if (atomic_read(&inode->i_count)) {
-			spin_unlock(&inode->i_lock);
-			busy = 1;
-			continue;
-		}
+	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list)
+		__invalidate_inode(inode, kill_dirty, &dispose, &busy);
 
-		inode->i_state |= I_FREEING;
-		inode_lru_list_del(inode);
-		spin_unlock(&inode->i_lock);
-		list_add(&inode->i_lru, &dispose);
-	}
 	spin_unlock(&sb->s_inode_list_lock);
 
 	dispose_list(&dispose);
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 741077d..ec52dcb 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -141,86 +141,98 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
 	return ret;
 }
 
-/**
- * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
- * @sb: superblock being unmounted.
- *
- * Called during unmount with no locks held, so needs to be safe against
- * concurrent modifiers. We temporarily drop sb->s_inode_list_lock and CAN block.
- */
-void fsnotify_unmount_inodes(struct super_block *sb)
+static inline void
+__fsnotify_unmount_inode(spinlock_t *lock, struct inode *inode,
+			 struct list_head *head, struct inode **pnext,
+			 struct inode **need_iput)
 {
-	struct inode *inode, *next_i, *need_iput = NULL;
-
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list) {
-		struct inode *need_iput_tmp;
+	struct inode *need_iput_tmp;
+	struct inode *next_i = *pnext;
 
-		/*
-		 * We cannot __iget() an inode in state I_FREEING,
-		 * I_WILL_FREE, or I_NEW which is fine because by that point
-		 * the inode cannot have any associated watches.
-		 */
-		spin_lock(&inode->i_lock);
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
+	/*
+	 * We cannot __iget() an inode in state I_FREEING,
+	 * I_WILL_FREE, or I_NEW which is fine because by that point
+	 * the inode cannot have any associated watches.
+	 */
+	spin_lock(&inode->i_lock);
+	if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+		spin_unlock(&inode->i_lock);
+		return;
+	}
 
-		/*
-		 * If i_count is zero, the inode cannot have any watches and
-		 * doing an __iget/iput with MS_ACTIVE clear would actually
-		 * evict all inodes with zero i_count from icache which is
-		 * unnecessarily violent and may in fact be illegal to do.
-		 */
-		if (!atomic_read(&inode->i_count)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
+	/*
+	 * If i_count is zero, the inode cannot have any watches and
+	 * doing an __iget/iput with MS_ACTIVE clear would actually
+	 * evict all inodes with zero i_count from icache which is
+	 * unnecessarily violent and may in fact be illegal to do.
+	 */
+	if (!atomic_read(&inode->i_count)) {
+		spin_unlock(&inode->i_lock);
+		return;
+	}
 
-		need_iput_tmp = need_iput;
-		need_iput = NULL;
+	need_iput_tmp = *need_iput;
+	*need_iput = NULL;
 
-		/* In case fsnotify_inode_delete() drops a reference. */
-		if (inode != need_iput_tmp)
-			__iget(inode);
-		else
-			need_iput_tmp = NULL;
-		spin_unlock(&inode->i_lock);
+	/* In case fsnotify_inode_delete() drops a reference. */
+	if (inode != need_iput_tmp)
+		__iget(inode);
+	else
+		need_iput_tmp = NULL;
+	spin_unlock(&inode->i_lock);
 
-		/* In case the dropping of a reference would nuke next_i. */
-		while (&next_i->i_sb_list != &sb->s_inodes) {
-			spin_lock(&next_i->i_lock);
-			if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) &&
-						atomic_read(&next_i->i_count)) {
-				__iget(next_i);
-				need_iput = next_i;
-				spin_unlock(&next_i->i_lock);
-				break;
-			}
+	/* In case the dropping of a reference would nuke next_i. */
+	while (&next_i->i_sb_list != head) {
+		spin_lock(&next_i->i_lock);
+		if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) &&
+					atomic_read(&next_i->i_count)) {
+			__iget(next_i);
+			*need_iput = next_i;
 			spin_unlock(&next_i->i_lock);
-			next_i = list_next_entry(next_i, i_sb_list);
+			break;
 		}
+		spin_unlock(&next_i->i_lock);
+		next_i = list_next_entry(next_i, i_sb_list);
+	}
+	*pnext = next_i;
 
-		/*
-		 * We can safely drop s_inode_list_lock here because either
-		 * we actually hold references on both inode and next_i or
-		 * end of list.  Also no new inodes will be added since the
-		 * umount has begun.
-		 */
-		spin_unlock(&sb->s_inode_list_lock);
+	/*
+	 * We can safely drop pcpu_lock  here because either
+	 * we actually hold references on both inode and next_i or
+	 * end of list.  Also no new inodes will be added since the
+	 * umount has begun.
+	 */
+	spin_unlock(lock);
 
-		if (need_iput_tmp)
-			iput(need_iput_tmp);
+	if (need_iput_tmp)
+		iput(need_iput_tmp);
 
-		/* for each watch, send FS_UNMOUNT and then remove it */
-		fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
+	/* for each watch, send FS_UNMOUNT and then remove it */
+	fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
 
-		fsnotify_inode_delete(inode);
+	fsnotify_inode_delete(inode);
 
-		iput(inode);
+	iput(inode);
+
+	spin_lock(lock);
+}
+
+/**
+ * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
+ * @sb: superblock being unmounted.
+ *
+ * Called during unmount with no locks held, so needs to be safe against
+ * concurrent modifiers. We temporarily drop sb->s_inodes_cpu->lock and CAN
+ * block.
+ */
+void fsnotify_unmount_inodes(struct super_block *sb)
+{
+	struct inode *inode, *next_i, *need_iput = NULL;
+
+	spin_lock(&sb->s_inode_list_lock);
+	list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list)
+		__fsnotify_unmount_inode(&sb->s_inode_list_lock, inode,
+					 &sb->s_inodes, &next_i, &need_iput);
 
-		spin_lock(&sb->s_inode_list_lock);
-	}
 	spin_unlock(&sb->s_inode_list_lock);
 }
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 3c3b81b..143183b 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -920,6 +920,42 @@ static int dqinit_needed(struct inode *inode, int type)
 	return 0;
 }
 
+static inline void
+__add_dquot_ref(spinlock_t *lock, struct inode *inode, int type,
+#ifdef CONFIG_QUOTA_DEBUG
+		int *reserved,
+#endif
+		struct inode **old_inode)
+{
+	spin_lock(&inode->i_lock);
+	if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+	    !atomic_read(&inode->i_writecount) ||
+	    !dqinit_needed(inode, type)) {
+		spin_unlock(&inode->i_lock);
+		return;
+	}
+	__iget(inode);
+	spin_unlock(&inode->i_lock);
+	spin_unlock(lock);
+
+#ifdef CONFIG_QUOTA_DEBUG
+	if (unlikely(inode_get_rsv_space(inode) > 0))
+		*reserved = 1;
+#endif
+	iput(*old_inode);
+	__dquot_initialize(inode, type);
+
+	/*
+	 * We hold a reference to 'inode' so it couldn't have been
+	 * removed from s_inodes list while we dropped the
+	 * pcpu_lock. We cannot iput the inode now as we can be
+	 * holding the last reference and we cannot iput it under
+	 * pcpu_lock. So we keep the reference and iput it later.
+	 */
+	*old_inode = inode;
+	spin_lock(lock);
+}
+
 /* This routine is guarded by dqonoff_mutex mutex */
 static void add_dquot_ref(struct super_block *sb, int type)
 {
@@ -929,36 +965,12 @@ static void add_dquot_ref(struct super_block *sb, int type)
 #endif
 
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		spin_lock(&inode->i_lock);
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    !atomic_read(&inode->i_writecount) ||
-		    !dqinit_needed(inode, type)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		__iget(inode);
-		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
-
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list)
+		__add_dquot_ref(&sb->s_inode_list_lock, inode, type,
 #ifdef CONFIG_QUOTA_DEBUG
-		if (unlikely(inode_get_rsv_space(inode) > 0))
-			reserved = 1;
+				&reserved,
 #endif
-		iput(old_inode);
-		__dquot_initialize(inode, type);
-
-		/*
-		 * We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the
-		 * s_inode_list_lock. We cannot iput the inode now as we can be
-		 * holding the last reference and we cannot iput it under
-		 * s_inode_list_lock. So we keep the reference and iput it
-		 * later.
-		 */
-		old_inode = inode;
-		spin_lock(&sb->s_inode_list_lock);
-	}
+				&old_inode);
 	spin_unlock(&sb->s_inode_list_lock);
 	iput(old_inode);
 
@@ -1022,6 +1034,25 @@ static void put_dquot_list(struct list_head *tofree_head)
 	}
 }
 
+static inline void
+__remove_dquot_ref(struct inode *inode, int type,
+		   struct list_head *tofree_head, int *reserved)
+{
+	/*
+	 *  We have to scan also I_NEW inodes because they can already
+	 *  have quota pointer initialized. Luckily, we need to touch
+	 *  only quota pointers and these have separate locking
+	 *  (dq_data_lock).
+	 */
+	spin_lock(&dq_data_lock);
+	if (!IS_NOQUOTA(inode)) {
+		if (unlikely(inode_get_rsv_space(inode) > 0))
+			*reserved = 1;
+		remove_inode_dquot_ref(inode, type, tofree_head);
+	}
+	spin_unlock(&dq_data_lock);
+}
+
 static void remove_dquot_ref(struct super_block *sb, int type,
 		struct list_head *tofree_head)
 {
@@ -1029,21 +1060,9 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 	int reserved = 0;
 
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		/*
-		 *  We have to scan also I_NEW inodes because they can already
-		 *  have quota pointer initialized. Luckily, we need to touch
-		 *  only quota pointers and these have separate locking
-		 *  (dq_data_lock).
-		 */
-		spin_lock(&dq_data_lock);
-		if (!IS_NOQUOTA(inode)) {
-			if (unlikely(inode_get_rsv_space(inode) > 0))
-				reserved = 1;
-			remove_inode_dquot_ref(inode, type, tofree_head);
-		}
-		spin_unlock(&dq_data_lock);
-	}
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list)
+		__remove_dquot_ref(inode, type, tofree_head, &reserved);
+
 	spin_unlock(&sb->s_inode_list_lock);
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 3/3] vfs: Use per-cpu list for superblock's inode list
  2016-02-19 21:10 [PATCH v2 0/3] vfs: Use per-cpu list for SB's s_inodes list Waiman Long
  2016-02-19 21:10 ` [PATCH v2 1/3] lib/percpu-list: Per-cpu list with associated per-cpu locks Waiman Long
  2016-02-19 21:10 ` [PATCH v2 2/3] vfs: Refactor sb->s_inodes iteration functions Waiman Long
@ 2016-02-19 21:10 ` Waiman Long
  2016-02-21 21:34   ` Dave Chinner
  2 siblings, 1 reply; 13+ messages in thread
From: Waiman Long @ 2016-02-19 21:10 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Scott J Norton, Douglas Hatch,
	Waiman Long

When many threads are trying to add or delete inode to or from
a superblock's s_inodes list, spinlock contention on the list can
become a performance bottleneck.

This patch changes the s_inodes field to become a per-cpu list with
per-cpu spinlocks.

With an exit microbenchmark that creates a large number of threads,
attachs many inodes to them and then exits. The runtimes of that
microbenchmark with 1000 threads before and after the patch on a
4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as
follows:

  Kernel            Elapsed Time    System Time
  ------            ------------    -----------
  Vanilla 4.5-rc4      65.29s         82m14s
  Patched 4.5-rc4      22.81s         23m03s

Before the patch, spinlock contention at the inode_sb_list_add()
function at the startup phase and the inode_sb_list_del() function at
the exit phase were about 79% and 93% of total CPU time respectively
(as measured by perf). After the patch, the percpu_list_add()
function consumed only about 0.04% of CPU time at startup phase. The
percpu_list_del() function consumed about 0.4% of CPU time at exit
phase. There were still some spinlock contention, but they happened
elsewhere.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 fs/block_dev.c         |   40 +++++++++++--------
 fs/drop_caches.c       |   31 ++++++++-------
 fs/fs-writeback.c      |   30 ++++++++------
 fs/inode.c             |   99 ++++++++++++++++++++++++-----------------------
 fs/notify/inode_mark.c |   43 ++++++++++-----------
 fs/quota/dquot.c       |   83 ++++++++++++++++++++++------------------
 fs/super.c             |    7 ++-
 include/linux/fs.h     |   37 ++++++++++++++++--
 8 files changed, 211 insertions(+), 159 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 6eaeedf..5992a1f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1862,21 +1862,27 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
 }
 EXPORT_SYMBOL(__invalidate_device);
 
-static inline void
-__iterate_bdev(spinlock_t *lock, struct inode *inode, struct inode **old_inode,
-	       void (*func)(struct block_device *, void *), void *arg)
+/*
+ * iterate_bdev_iter  - iteration function for each inode of a block
+ *			device superblock
+ */
+SB_INODES_ITER_FUNC(iterate_bdev, pcpu_lock,
+		    struct inode *old_inode;
+		    void (*func)(struct block_device *, void *);
+		    void *arg)
 {
+	SB_INODES_ITER_ARGS(iterate_bdev, inode, arg);
 	struct address_space *mapping = inode->i_mapping;
 
 	spin_lock(&inode->i_lock);
 	if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
 	    mapping->nrpages == 0) {
 		spin_unlock(&inode->i_lock);
-		return;
+		return 0;
 	}
 	__iget(inode);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(lock);
+	spin_unlock(pcpu_lock);
 	/*
 	 * We hold a reference to 'inode' so it couldn't have been
 	 * removed from s_inodes list while we dropped the
@@ -1884,23 +1890,23 @@ __iterate_bdev(spinlock_t *lock, struct inode *inode, struct inode **old_inode,
 	 * be holding the last reference and we cannot iput it under
 	 * pcpu_lock. So we keep the reference and iput it later.
 	 */
-	iput(*old_inode);
-	*old_inode = inode;
+	iput(arg->old_inode);
+	arg->old_inode = inode;
 
-	func(I_BDEV(inode), arg);
+	arg->func(I_BDEV(inode), arg->arg);
 
-	spin_lock(lock);
+	spin_lock(pcpu_lock);
+	return 0;
 }
 
-void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
+void iterate_bdevs(void (*func)(struct block_device *, void *), void *f_arg)
 {
-	struct inode *inode, *old_inode = NULL;
+	struct iterate_bdev_arg arg;
 
-	spin_lock(&blockdev_superblock->s_inode_list_lock);
-	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list)
-		__iterate_bdev(&blockdev_superblock->s_inode_list_lock,
-			       inode, &old_inode, func, arg);
+	arg.arg = f_arg;
+	arg.func = func;
+	arg.old_inode = NULL;
 
-	spin_unlock(&blockdev_superblock->s_inode_list_lock);
-	iput(old_inode);
+	SB_INODES_ITER_CALL(iterate_bdev, blockdev_superblock);
+	iput(arg.old_inode);
 }
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index d3449d5..63b1842 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -13,37 +13,40 @@
 /* A global variable is a bit ugly, but it keeps the code simple */
 int sysctl_drop_caches;
 
-static inline void __drop_pagecache_sb(spinlock_t *lock, struct inode *inode,
-				       struct inode **toput_inode)
+/*
+ * drop_pagecache_iter - iteration function for each inode of a superblock
+ */
+SB_INODES_ITER_FUNC(drop_pagecache, pcpu_lock,
+		    struct inode *toput_inode)
 {
+	SB_INODES_ITER_ARGS(drop_pagecache, inode, arg);
+
 	spin_lock(&inode->i_lock);
 	if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 	    (inode->i_mapping->nrpages == 0)) {
 		spin_unlock(&inode->i_lock);
-		return;
+		return 0;
 	}
 	__iget(inode);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(lock);
+	spin_unlock(pcpu_lock);
 
 	invalidate_mapping_pages(inode->i_mapping, 0, -1);
-	iput(*toput_inode);
-	*toput_inode = inode;
+	iput(arg->toput_inode);
+	arg->toput_inode = inode;
 
-	spin_lock(lock);
+	spin_lock(pcpu_lock);
+	return 0;
 }
 
 static void drop_pagecache_sb(struct super_block *sb, void *unused)
 {
-	struct inode *inode, *toput_inode = NULL;
+	struct drop_pagecache_arg arg;
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list)
-		__drop_pagecache_sb(&sb->s_inode_list_lock, inode,
-				    &toput_inode);
+	arg.toput_inode = NULL;
 
-	spin_unlock(&sb->s_inode_list_lock);
-	iput(toput_inode);
+	SB_INODES_ITER_CALL(drop_pagecache, sb);
+	iput(arg.toput_inode);
 }
 
 int drop_caches_sysctl_handler(struct ctl_table *table, int write,
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5ad6eda..080bae5 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2095,20 +2095,24 @@ out_unlock_inode:
 }
 EXPORT_SYMBOL(__mark_inode_dirty);
 
-static inline void __wait_sb_inode(spinlock_t *lock, struct inode *inode,
-				   struct inode **old_inode)
+/*
+ * wait_sb_inode_iter - iteration function for each inode of a superblock
+ */
+SB_INODES_ITER_FUNC(wait_sb_inode, pcpu_lock,
+		    struct inode *old_inode)
 {
+	SB_INODES_ITER_ARGS(wait_sb_inode, inode, arg);
 	struct address_space *mapping = inode->i_mapping;
 
 	spin_lock(&inode->i_lock);
 	if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 	    (mapping->nrpages == 0)) {
 		spin_unlock(&inode->i_lock);
-		return;
+		return 0;
 	}
 	__iget(inode);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(lock);
+	spin_unlock(pcpu_lock);
 
 	/*
 	 * We hold a reference to 'inode' so it couldn't have been
@@ -2117,8 +2121,8 @@ static inline void __wait_sb_inode(spinlock_t *lock, struct inode *inode,
 	 * be holding the last reference and we cannot iput it under
 	 * pcpu_lock. So we keep the reference and iput it later.
 	 */
-	iput(*old_inode);
-	*old_inode = inode;
+	iput(arg->old_inode);
+	arg->old_inode = inode;
 
 	/*
 	 * We keep the error status of individual mapping so that
@@ -2129,7 +2133,8 @@ static inline void __wait_sb_inode(spinlock_t *lock, struct inode *inode,
 
 	cond_resched();
 
-	spin_lock(lock);
+	spin_lock(pcpu_lock);
+	return 0;
 }
 
 /*
@@ -2143,7 +2148,9 @@ static inline void __wait_sb_inode(spinlock_t *lock, struct inode *inode,
  */
 static void wait_sb_inodes(struct super_block *sb)
 {
-	struct inode *inode, *old_inode = NULL;
+	struct wait_sb_inode_arg arg;
+
+	arg.old_inode = NULL;
 
 	/*
 	 * We need to be protected against the filesystem going from
@@ -2152,7 +2159,6 @@ static void wait_sb_inodes(struct super_block *sb)
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	mutex_lock(&sb->s_sync_lock);
-	spin_lock(&sb->s_inode_list_lock);
 
 	/*
 	 * Data integrity sync. Must wait for all pages under writeback,
@@ -2161,11 +2167,9 @@ static void wait_sb_inodes(struct super_block *sb)
 	 * In which case, the inode may not be on the dirty list, but
 	 * we still have to wait for that writeout.
 	 */
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list)
-		__wait_sb_inode(&sb->s_inode_list_lock, inode, &old_inode);
+	SB_INODES_ITER_CALL(wait_sb_inode, sb);
 
-	spin_unlock(&sb->s_inode_list_lock);
-	iput(old_inode);
+	iput(arg.old_inode);
 	mutex_unlock(&sb->s_sync_lock);
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index 6dd609e..0f2fba4 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -28,7 +28,7 @@
  *   inode->i_state, inode->i_hash, __iget()
  * Inode LRU list locks protect:
  *   inode->i_sb->s_inode_lru, inode->i_lru
- * inode->i_sb->s_inode_list_lock protects:
+ * inode->i_sb->s_inodes->lock protects:
  *   inode->i_sb->s_inodes, inode->i_sb_list
  * bdi->wb.list_lock protects:
  *   bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_io_list
@@ -37,7 +37,7 @@
  *
  * Lock ordering:
  *
- * inode->i_sb->s_inode_list_lock
+ * inode->i_sb->s_inodes->lock
  *   inode->i_lock
  *     Inode LRU list locks
  *
@@ -45,7 +45,7 @@
  *   inode->i_lock
  *
  * inode_hash_lock
- *   inode->i_sb->s_inode_list_lock
+ *   inode->i_sb->s_inodes->lock
  *   inode->i_lock
  *
  * iunique_lock
@@ -424,19 +424,14 @@ static void inode_lru_list_del(struct inode *inode)
  */
 void inode_sb_list_add(struct inode *inode)
 {
-	spin_lock(&inode->i_sb->s_inode_list_lock);
-	list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
-	spin_unlock(&inode->i_sb->s_inode_list_lock);
+	pcpu_list_add(&inode->i_sb_list, inode->i_sb->s_inodes);
 }
 EXPORT_SYMBOL_GPL(inode_sb_list_add);
 
 static inline void inode_sb_list_del(struct inode *inode)
 {
-	if (!list_empty(&inode->i_sb_list)) {
-		spin_lock(&inode->i_sb->s_inode_list_lock);
-		list_del_init(&inode->i_sb_list);
-		spin_unlock(&inode->i_sb->s_inode_list_lock);
-	}
+	if (!list_empty(&inode->i_sb_list.list))
+		pcpu_list_del(&inode->i_sb_list);
 }
 
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
@@ -579,9 +574,15 @@ static void dispose_list(struct list_head *head)
 	}
 }
 
-static inline int __evict_inode(spinlock_t *lock, struct inode *inode,
-				struct list_head *dispose)
+/*
+ * evict_inode_iter - iteration function for each inode of a superblock
+ */
+SB_INODES_ITER_FUNC(evict_inode, pcpu_lock,
+		    struct list_head dispose;
+		    bool iter_again)
 {
+	SB_INODES_ITER_ARGS(evict_inode, inode, arg);
+
 	if (atomic_read(&inode->i_count))
 		return 0;
 
@@ -594,7 +595,7 @@ static inline int __evict_inode(spinlock_t *lock, struct inode *inode,
 	inode->i_state |= I_FREEING;
 	inode_lru_list_del(inode);
 	spin_unlock(&inode->i_lock);
-	list_add(&inode->i_lru, dispose);
+	list_add(&inode->i_lru, &arg->dispose);
 
 	/*
 	 * We can have a ton of inodes to evict at unmount time given
@@ -602,9 +603,10 @@ static inline int __evict_inode(spinlock_t *lock, struct inode *inode,
 	 * bit so we don't livelock.
 	 */
 	if (need_resched()) {
-		spin_unlock(lock);
+		spin_unlock(pcpu_lock);
 		cond_resched();
-		dispose_list(dispose);
+		dispose_list(&arg->dispose);
+		arg->iter_again = true;
 		return 1;	/* Redo it again */
 	}
 	return 0;
@@ -621,47 +623,53 @@ static inline int __evict_inode(spinlock_t *lock, struct inode *inode,
  */
 void evict_inodes(struct super_block *sb)
 {
-	struct inode *inode, *next;
-	LIST_HEAD(dispose);
+	struct evict_inode_arg arg;
+
+	INIT_LIST_HEAD(&arg.dispose);
 
 again:
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
-		if (__evict_inode(&sb->s_inode_list_lock, inode, &dispose))
-			goto again;
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+	arg.iter_again = false;
+	SB_INODES_ITER_CALL_SAFE(evict_inode, sb, NULL);
+	if (arg.iter_again)
+		goto again;
 
-	dispose_list(&dispose);
+	dispose_list(&arg.dispose);
 }
 
-static inline void __invalidate_inode(struct inode *inode, bool kill_dirty,
-				      struct list_head *dispose, int *busy)
+/*
+ * invalidate_inode_iter - attempt to free an inode on a superblock
+ */
+SB_INODES_ITER_FUNC(invalidate_inode, pcpu_lock,
+		    struct list_head dispose;
+		    bool busy;
+		    bool kill_dirty)
 {
+	SB_INODES_ITER_ARGS(invalidate_inode, inode, arg);
+
 	spin_lock(&inode->i_lock);
 	if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 		spin_unlock(&inode->i_lock);
-		return;
+		return 0;
 	}
 
-	if (inode->i_state & I_DIRTY_ALL && !kill_dirty) {
+	if (inode->i_state & I_DIRTY_ALL && !arg->kill_dirty) {
 		spin_unlock(&inode->i_lock);
-		*busy = 1;
-		return;
+		arg->busy = 1;
+		return 0;
 	}
 
 	if (atomic_read(&inode->i_count)) {
 		spin_unlock(&inode->i_lock);
-		*busy = 1;
-		return;
+		arg->busy = 1;
+		return 0;
 	}
 
 	inode->i_state |= I_FREEING;
 	inode_lru_list_del(inode);
 	spin_unlock(&inode->i_lock);
-	list_add(&inode->i_lru, dispose);
+	list_add(&inode->i_lru, &arg->dispose);
+	return 0;
 }
-
 /**
  * invalidate_inodes	- attempt to free all inodes on a superblock
  * @sb:		superblock to operate on
@@ -674,19 +682,16 @@ static inline void __invalidate_inode(struct inode *inode, bool kill_dirty,
  */
 int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 {
-	int busy = 0;
-	struct inode *inode, *next;
-	LIST_HEAD(dispose);
+	struct invalidate_inode_arg arg;
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list)
-		__invalidate_inode(inode, kill_dirty, &dispose, &busy);
+	arg.kill_dirty = kill_dirty;
+	arg.busy = 0;
+	INIT_LIST_HEAD(&arg.dispose);
 
-	spin_unlock(&sb->s_inode_list_lock);
+	SB_INODES_ITER_CALL_SAFE(invalidate_inode, sb, NULL);
+	dispose_list(&arg.dispose);
 
-	dispose_list(&dispose);
-
-	return busy;
+	return arg.busy;
 }
 
 /*
@@ -897,7 +902,7 @@ struct inode *new_inode_pseudo(struct super_block *sb)
 		spin_lock(&inode->i_lock);
 		inode->i_state = 0;
 		spin_unlock(&inode->i_lock);
-		INIT_LIST_HEAD(&inode->i_sb_list);
+		init_pcpu_list_node(&inode->i_sb_list);
 	}
 	return inode;
 }
@@ -918,8 +923,6 @@ struct inode *new_inode(struct super_block *sb)
 {
 	struct inode *inode;
 
-	spin_lock_prefetch(&sb->s_inode_list_lock);
-
 	inode = new_inode_pseudo(sb);
 	if (inode)
 		inode_sb_list_add(inode);
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index ec52dcb..1fd1daf 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -141,13 +141,15 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
 	return ret;
 }
 
-static inline void
-__fsnotify_unmount_inode(spinlock_t *lock, struct inode *inode,
-			 struct list_head *head, struct inode **pnext,
-			 struct inode **need_iput)
+/*
+ * unmount_inode_iter - iteration function for each inode of a SB
+ */
+SB_INODES_ITER_FUNC(unmount_inode, pcpu_lock,
+		    struct inode *need_iput;
+		    struct list_head *percpu_head)
 {
+	SB_INODES_ITER_ARGS_SAFE(unmount_inode, inode, next_i, arg);
 	struct inode *need_iput_tmp;
-	struct inode *next_i = *pnext;
 
 	/*
 	 * We cannot __iget() an inode in state I_FREEING,
@@ -157,7 +159,7 @@ __fsnotify_unmount_inode(spinlock_t *lock, struct inode *inode,
 	spin_lock(&inode->i_lock);
 	if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
 		spin_unlock(&inode->i_lock);
-		return;
+		return 0;
 	}
 
 	/*
@@ -168,11 +170,11 @@ __fsnotify_unmount_inode(spinlock_t *lock, struct inode *inode,
 	 */
 	if (!atomic_read(&inode->i_count)) {
 		spin_unlock(&inode->i_lock);
-		return;
+		return 0;
 	}
 
-	need_iput_tmp = *need_iput;
-	*need_iput = NULL;
+	need_iput_tmp = arg->need_iput;
+	arg->need_iput = NULL;
 
 	/* In case fsnotify_inode_delete() drops a reference. */
 	if (inode != need_iput_tmp)
@@ -182,19 +184,19 @@ __fsnotify_unmount_inode(spinlock_t *lock, struct inode *inode,
 	spin_unlock(&inode->i_lock);
 
 	/* In case the dropping of a reference would nuke next_i. */
-	while (&next_i->i_sb_list != head) {
+	while (&next_i->i_sb_list.list != arg->percpu_head) {
 		spin_lock(&next_i->i_lock);
 		if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) &&
 					atomic_read(&next_i->i_count)) {
 			__iget(next_i);
-			*need_iput = next_i;
+			arg->need_iput = next_i;
 			spin_unlock(&next_i->i_lock);
 			break;
 		}
 		spin_unlock(&next_i->i_lock);
-		next_i = list_next_entry(next_i, i_sb_list);
+		next_i = pcpu_list_next_entry(next_i, i_sb_list);
 	}
-	*pnext = next_i;
+	SB_INODES_ITER_SET_PCPU_LIST_NEXT(next_i);
 
 	/*
 	 * We can safely drop pcpu_lock  here because either
@@ -202,7 +204,7 @@ __fsnotify_unmount_inode(spinlock_t *lock, struct inode *inode,
 	 * end of list.  Also no new inodes will be added since the
 	 * umount has begun.
 	 */
-	spin_unlock(lock);
+	spin_unlock(pcpu_lock);
 
 	if (need_iput_tmp)
 		iput(need_iput_tmp);
@@ -214,7 +216,8 @@ __fsnotify_unmount_inode(spinlock_t *lock, struct inode *inode,
 
 	iput(inode);
 
-	spin_lock(lock);
+	spin_lock(pcpu_lock);
+	return 0;
 }
 
 /**
@@ -227,12 +230,8 @@ __fsnotify_unmount_inode(spinlock_t *lock, struct inode *inode,
  */
 void fsnotify_unmount_inodes(struct super_block *sb)
 {
-	struct inode *inode, *next_i, *need_iput = NULL;
-
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list)
-		__fsnotify_unmount_inode(&sb->s_inode_list_lock, inode,
-					 &sb->s_inodes, &next_i, &need_iput);
+	struct unmount_inode_arg arg;
 
-	spin_unlock(&sb->s_inode_list_lock);
+	arg.need_iput = NULL;
+	SB_INODES_ITER_CALL_SAFE(unmount_inode, sb, &arg.percpu_head);
 }
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 143183b..6aa593e 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -920,30 +920,33 @@ static int dqinit_needed(struct inode *inode, int type)
 	return 0;
 }
 
-static inline void
-__add_dquot_ref(spinlock_t *lock, struct inode *inode, int type,
-#ifdef CONFIG_QUOTA_DEBUG
-		int *reserved,
-#endif
-		struct inode **old_inode)
+/*
+ * add_dquot_iter - iteration function for each inode of a superblock
+ */
+SB_INODES_ITER_FUNC(add_dquot, pcpu_lock,
+		    struct inode *old_inode;
+		    int type;
+		    int reserved)
 {
+	SB_INODES_ITER_ARGS(add_dquot, inode, arg);
+
 	spin_lock(&inode->i_lock);
 	if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 	    !atomic_read(&inode->i_writecount) ||
-	    !dqinit_needed(inode, type)) {
+	    !dqinit_needed(inode, arg->type)) {
 		spin_unlock(&inode->i_lock);
-		return;
+		return 0;
 	}
 	__iget(inode);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(lock);
+	spin_unlock(pcpu_lock);
 
 #ifdef CONFIG_QUOTA_DEBUG
 	if (unlikely(inode_get_rsv_space(inode) > 0))
-		*reserved = 1;
+		arg->reserved = 1;
 #endif
-	iput(*old_inode);
-	__dquot_initialize(inode, type);
+	iput(arg->old_inode);
+	__dquot_initialize(inode, arg->type);
 
 	/*
 	 * We hold a reference to 'inode' so it couldn't have been
@@ -952,30 +955,27 @@ __add_dquot_ref(spinlock_t *lock, struct inode *inode, int type,
 	 * holding the last reference and we cannot iput it under
 	 * pcpu_lock. So we keep the reference and iput it later.
 	 */
-	*old_inode = inode;
-	spin_lock(lock);
+	arg->old_inode = inode;
+	spin_lock(pcpu_lock);
+	return 0;
 }
 
 /* This routine is guarded by dqonoff_mutex mutex */
 static void add_dquot_ref(struct super_block *sb, int type)
 {
-	struct inode *inode, *old_inode = NULL;
-#ifdef CONFIG_QUOTA_DEBUG
-	int reserved = 0;
-#endif
+	struct add_dquot_arg arg;
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list)
-		__add_dquot_ref(&sb->s_inode_list_lock, inode, type,
 #ifdef CONFIG_QUOTA_DEBUG
-				&reserved,
+	arg.reserved = 0;
 #endif
-				&old_inode);
-	spin_unlock(&sb->s_inode_list_lock);
-	iput(old_inode);
+	arg.old_inode = NULL;
+	arg.type = type;
+
+	SB_INODES_ITER_CALL(add_dquot, sb);
+	iput(arg.old_inode);
 
 #ifdef CONFIG_QUOTA_DEBUG
-	if (reserved) {
+	if (arg.reserved) {
 		quota_error(sb, "Writes happened before quota was turned on "
 			"thus quota information is probably inconsistent. "
 			"Please run quotacheck(8)");
@@ -1034,10 +1034,16 @@ static void put_dquot_list(struct list_head *tofree_head)
 	}
 }
 
-static inline void
-__remove_dquot_ref(struct inode *inode, int type,
-		   struct list_head *tofree_head, int *reserved)
+/*
+ * add_dquot_iter - iteration function for each inode of a superblock
+ */
+SB_INODES_ITER_FUNC(remove_dquot, pcpu_lock,
+		    struct list_head *tofree_head;
+		    int type;
+		    int reserved)
 {
+	SB_INODES_ITER_ARGS(remove_dquot, inode, arg);
+
 	/*
 	 *  We have to scan also I_NEW inodes because they can already
 	 *  have quota pointer initialized. Luckily, we need to touch
@@ -1047,25 +1053,26 @@ __remove_dquot_ref(struct inode *inode, int type,
 	spin_lock(&dq_data_lock);
 	if (!IS_NOQUOTA(inode)) {
 		if (unlikely(inode_get_rsv_space(inode) > 0))
-			*reserved = 1;
-		remove_inode_dquot_ref(inode, type, tofree_head);
+			arg->reserved = 1;
+		remove_inode_dquot_ref(inode, arg->type, arg->tofree_head);
 	}
 	spin_unlock(&dq_data_lock);
+	return 0;
 }
 
 static void remove_dquot_ref(struct super_block *sb, int type,
 		struct list_head *tofree_head)
 {
-	struct inode *inode;
-	int reserved = 0;
+	struct remove_dquot_arg arg;
+
+	arg.reserved = 0;
+	arg.type = type;
+	arg.tofree_head = tofree_head;
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list)
-		__remove_dquot_ref(inode, type, tofree_head, &reserved);
+	SB_INODES_ITER_CALL(remove_dquot, sb);
 
-	spin_unlock(&sb->s_inode_list_lock);
 #ifdef CONFIG_QUOTA_DEBUG
-	if (reserved) {
+	if (arg.reserved) {
 		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
 			" was disabled thus quota information is probably "
 			"inconsistent. Please run quotacheck(8).\n", sb->s_id);
diff --git a/fs/super.c b/fs/super.c
index 1182af8..7d44fad 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -163,6 +163,7 @@ static void destroy_super(struct super_block *s)
 {
 	list_lru_destroy(&s->s_dentry_lru);
 	list_lru_destroy(&s->s_inode_lru);
+	free_pcpu_list_head(&s->s_inodes);
 	security_sb_free(s);
 	WARN_ON(!list_empty(&s->s_mounts));
 	kfree(s->s_subtype);
@@ -204,9 +205,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	INIT_HLIST_NODE(&s->s_instances);
 	INIT_HLIST_BL_HEAD(&s->s_anon);
 	mutex_init(&s->s_sync_lock);
-	INIT_LIST_HEAD(&s->s_inodes);
-	spin_lock_init(&s->s_inode_list_lock);
 
+	if (init_pcpu_list_head(&s->s_inodes))
+		goto fail;
 	if (list_lru_init_memcg(&s->s_dentry_lru))
 		goto fail;
 	if (list_lru_init_memcg(&s->s_inode_lru))
@@ -426,7 +427,7 @@ void generic_shutdown_super(struct super_block *sb)
 		if (sop->put_super)
 			sop->put_super(sb);
 
-		if (!list_empty(&sb->s_inodes)) {
+		if (!pcpu_list_empty(sb->s_inodes)) {
 			printk("VFS: Busy inodes after unmount of %s. "
 			   "Self-destruct in 5 seconds.  Have a nice day...\n",
 			   sb->s_id);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ae68100..c30cdb6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -27,6 +27,7 @@
 #include <linux/migrate_mode.h>
 #include <linux/uidgid.h>
 #include <linux/lockdep.h>
+#include <linux/percpu-list.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/blk_types.h>
 #include <linux/workqueue.h>
@@ -648,7 +649,7 @@ struct inode {
 	u16			i_wb_frn_history;
 #endif
 	struct list_head	i_lru;		/* inode LRU list */
-	struct list_head	i_sb_list;
+	struct pcpu_list_node	i_sb_list;
 	union {
 		struct hlist_head	i_dentry;
 		struct rcu_head		i_rcu;
@@ -1397,11 +1398,39 @@ struct super_block {
 	 */
 	int s_stack_depth;
 
-	/* s_inode_list_lock protects s_inodes */
-	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
-	struct list_head	s_inodes;	/* all inodes */
+	/* The percpu locks protect s_inodes */
+	struct pcpu_list_head __percpu *s_inodes;	/* all inodes */
 };
 
+/*
+ * Superblock's inode list iterator function and arguments macros
+ */
+#define SB_INODES_ITER_FUNC(name, lock, struct_fields)			\
+	struct name ## _arg {						\
+		struct_fields;						\
+	};								\
+	static int name ## _iter(struct pcpu_list_node *_node,		\
+				 struct pcpu_list_node **_pnext,	\
+				 spinlock_t *lock, void *_arg)
+
+#define SB_INODES_ITER_ARGS(name, i, a)					\
+	struct inode *i = container_of(_node, struct inode, i_sb_list);	\
+	struct name ## _arg *a = (struct name ## _arg *)_arg
+
+#define SB_INODES_ITER_ARGS_SAFE(name, i, n, a)				\
+	struct inode *i = container_of(_node, struct inode, i_sb_list);	\
+	struct inode *n = container_of(*_pnext, struct inode, i_sb_list);\
+	struct name ## _arg *a = (struct name ## _arg *)_arg
+
+#define SB_INODES_ITER_SET_PCPU_LIST_NEXT(n)				\
+	{ *_pnext = &(n)->i_sb_list; }
+
+#define SB_INODES_ITER_CALL(name, sb)					\
+	pcpu_list_iterate(sb->s_inodes, false, NULL, name ## _iter, &arg)
+
+#define SB_INODES_ITER_CALL_SAFE(name, sb, phead)			\
+	pcpu_list_iterate(sb->s_inodes, true, phead, name ## _iter, &arg)
+
 extern struct timespec current_fs_time(struct super_block *sb);
 
 /*
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/3] vfs: Use per-cpu list for superblock's inode list
  2016-02-19 21:10 ` [PATCH v2 3/3] vfs: Use per-cpu list for superblock's inode list Waiman Long
@ 2016-02-21 21:34   ` Dave Chinner
  2016-02-22  9:18     ` Peter Zijlstra
  2016-02-23 18:56     ` Waiman Long
  0 siblings, 2 replies; 13+ messages in thread
From: Dave Chinner @ 2016-02-21 21:34 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On Fri, Feb 19, 2016 at 04:10:45PM -0500, Waiman Long wrote:
> +/*
> + * Superblock's inode list iterator function and arguments macros
> + */
> +#define SB_INODES_ITER_FUNC(name, lock, struct_fields)			\
> +	struct name ## _arg {						\
> +		struct_fields;						\
> +	};								\
> +	static int name ## _iter(struct pcpu_list_node *_node,		\
> +				 struct pcpu_list_node **_pnext,	\
> +				 spinlock_t *lock, void *_arg)
> +
> +#define SB_INODES_ITER_ARGS(name, i, a)					\
> +	struct inode *i = container_of(_node, struct inode, i_sb_list);	\
> +	struct name ## _arg *a = (struct name ## _arg *)_arg
> +
> +#define SB_INODES_ITER_ARGS_SAFE(name, i, n, a)				\
> +	struct inode *i = container_of(_node, struct inode, i_sb_list);	\
> +	struct inode *n = container_of(*_pnext, struct inode, i_sb_list);\
> +	struct name ## _arg *a = (struct name ## _arg *)_arg
> +
> +#define SB_INODES_ITER_SET_PCPU_LIST_NEXT(n)				\
> +	{ *_pnext = &(n)->i_sb_list; }
> +
> +#define SB_INODES_ITER_CALL(name, sb)					\
> +	pcpu_list_iterate(sb->s_inodes, false, NULL, name ## _iter, &arg)
> +
> +#define SB_INODES_ITER_CALL_SAFE(name, sb, phead)			\
> +	pcpu_list_iterate(sb->s_inodes, true, phead, name ## _iter, &arg)
> +

No, just no.

Ungreppable, breaks cscope, obfuscates everything, shouts a lot,
code using the API looks completely broken (e.g. semi-colons in
"function declarations"), and it reminds me of the worst of the
worst unmaintainable code in an exceedingly buggy and undebuggable
proprietary filesystem I've previously had the "joy" of working
with.

Just fix the bug in the previous version; it's so much cleaner than
this .... mess.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/3] vfs: Use per-cpu list for superblock's inode list
  2016-02-21 21:34   ` Dave Chinner
@ 2016-02-22  9:18     ` Peter Zijlstra
  2016-02-22 11:54       ` Jan Kara
  2016-02-23 18:56     ` Waiman Long
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2016-02-22  9:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Waiman Long, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On Mon, Feb 22, 2016 at 08:34:19AM +1100, Dave Chinner wrote:
> On Fri, Feb 19, 2016 at 04:10:45PM -0500, Waiman Long wrote:
> > +/*
> > + * Superblock's inode list iterator function and arguments macros
> > + */
> > +#define SB_INODES_ITER_FUNC(name, lock, struct_fields)			\
> > +	struct name ## _arg {						\
> > +		struct_fields;						\
> > +	};								\
> > +	static int name ## _iter(struct pcpu_list_node *_node,		\
> > +				 struct pcpu_list_node **_pnext,	\
> > +				 spinlock_t *lock, void *_arg)
> > +
> > +#define SB_INODES_ITER_ARGS(name, i, a)					\
> > +	struct inode *i = container_of(_node, struct inode, i_sb_list);	\
> > +	struct name ## _arg *a = (struct name ## _arg *)_arg
> > +
> > +#define SB_INODES_ITER_ARGS_SAFE(name, i, n, a)				\
> > +	struct inode *i = container_of(_node, struct inode, i_sb_list);	\
> > +	struct inode *n = container_of(*_pnext, struct inode, i_sb_list);\
> > +	struct name ## _arg *a = (struct name ## _arg *)_arg
> > +
> > +#define SB_INODES_ITER_SET_PCPU_LIST_NEXT(n)				\
> > +	{ *_pnext = &(n)->i_sb_list; }
> > +
> > +#define SB_INODES_ITER_CALL(name, sb)					\
> > +	pcpu_list_iterate(sb->s_inodes, false, NULL, name ## _iter, &arg)
> > +
> > +#define SB_INODES_ITER_CALL_SAFE(name, sb, phead)			\
> > +	pcpu_list_iterate(sb->s_inodes, true, phead, name ## _iter, &arg)
> > +
> 
> No, just no.

Ha! See I was thinking of something more along the lines of the below.

Which is completely untested and incomplete, but does illustrate the
point.

Also, I think fsnotify_unmount_inodes() (as per mainline) is missing a
final iput(need_iput) at the very end, but I could be mistaken, that
code hurts my brain.

---
 fs/block_dev.c         |   51 +++++++++++---------------
 fs/drop_caches.c       |   32 +++++-----------
 fs/fs-writeback.c      |   65 ++++++++--------------------------
 fs/notify/inode_mark.c |   93 ++++++++++---------------------------------------
 fs/quota/dquot.c       |   63 ++++++++-------------------------
 fs/super.c             |   65 +++++++++++++++++++++++++++++++++-
 include/linux/fs.h     |    9 ++--
 7 files changed, 154 insertions(+), 224 deletions(-)

--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1862,38 +1862,29 @@ int __invalidate_device(struct block_dev
 }
 EXPORT_SYMBOL(__invalidate_device);
 
-void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
-{
-	struct inode *inode, *old_inode = NULL;
+struct iterate_bdevs_data {
+	void (*func)(struct block_device *, void *);
+	void *arg;
+};
 
-	spin_lock(&blockdev_superblock->s_inode_list_lock);
-	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
-		struct address_space *mapping = inode->i_mapping;
+static void __iterate_bdevs(struct inode *inode, void *arg)
+{
+	struct iterate_bdevs_data *ibd = arg;
 
-		spin_lock(&inode->i_lock);
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
-		    mapping->nrpages == 0) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		__iget(inode);
-		spin_unlock(&inode->i_lock);
-		spin_unlock(&blockdev_superblock->s_inode_list_lock);
-		/*
-		 * We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the
-		 * s_inode_list_lock  We cannot iput the inode now as we can
-		 * be holding the last reference and we cannot iput it under
-		 * s_inode_list_lock. So we keep the reference and iput it
-		 * later.
-		 */
-		iput(old_inode);
-		old_inode = inode;
+	ibd->func(I_BDEV(inode), ibd->arg);
+}
 
-		func(I_BDEV(inode), arg);
+static bool __iterate_bdevs_cond(struct inode *inode, void *arg)
+{
+	return inode->i_mapping->nr_pages != 0;
+}
 
-		spin_lock(&blockdev_superblock->s_inode_list_lock);
-	}
-	spin_unlock(&blockdev_superblock->s_inode_list_lock);
-	iput(old_inode);
+void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
+{
+	struct iterate_bdevs_data ibd = {
+		.func = func,
+		.arg = arg,
+	};
+	iterate_inodes_super(&blockdev_superblock, &__iterate_bdevs_cond,
+			     &__iterate_bdevs, &ibd);
 }
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -13,30 +13,20 @@
 /* A global variable is a bit ugly, but it keeps the code simple */
 int sysctl_drop_caches;
 
-static void drop_pagecache_sb(struct super_block *sb, void *unused)
+static void __drop_pagecache_sb(struct inode *inode, void *unused)
 {
-	struct inode *inode, *toput_inode = NULL;
-
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		spin_lock(&inode->i_lock);
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    (inode->i_mapping->nrpages == 0)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		__iget(inode);
-		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+	invalidate_mapping_pages(inode->i_mapping, 0, -1);
+}
 
-		invalidate_mapping_pages(inode->i_mapping, 0, -1);
-		iput(toput_inode);
-		toput_inode = inode;
+static bool __drop_pagecache_sb_cond(struct inode *inode, void *unused)
+{
+	return inode->i_mapping->nr_pages != 0;
+}
 
-		spin_lock(&sb->s_inode_list_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
-	iput(toput_inode);
+static void drop_pagecache_sb(struct super_block *sb, void *unused)
+{
+	iterate_inodes_super(sb, &__drop_pagecache_sb_cond,
+			     &__drop_pagecache_sb, NULL);
 }
 
 int drop_caches_sysctl_handler(struct ctl_table *table, int write,
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2102,6 +2102,21 @@ void __mark_inode_dirty(struct inode *in
 }
 EXPORT_SYMBOL(__mark_inode_dirty);
 
+static void __wait_sb_inodes(struct inode *inode, void *arg)
+{
+	/*
+	 * We keep the error status of individual mapping so that
+	 * applications can catch the writeback error using fsync(2).
+	 * See filemap_fdatawait_keep_errors() for details.
+	 */
+	filemap_fdatawait_keep_errors(inode->i_mapping);
+}
+
+static bool __wait_sb_inodes_cond(struct inode *inode, void *arg)
+{
+	return inode->i_mapping->nr_pages != 0;
+}
+
 /*
  * The @s_sync_lock is used to serialise concurrent sync operations
  * to avoid lock contention problems with concurrent wait_sb_inodes() calls.
@@ -2113,8 +2128,6 @@ EXPORT_SYMBOL(__mark_inode_dirty);
  */
 static void wait_sb_inodes(struct super_block *sb)
 {
-	struct inode *inode, *old_inode = NULL;
-
 	/*
 	 * We need to be protected against the filesystem going from
 	 * r/o to r/w or vice versa.
@@ -2122,52 +2135,8 @@ static void wait_sb_inodes(struct super_
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	mutex_lock(&sb->s_sync_lock);
-	spin_lock(&sb->s_inode_list_lock);
-
-	/*
-	 * Data integrity sync. Must wait for all pages under writeback,
-	 * because there may have been pages dirtied before our sync
-	 * call, but which had writeout started before we write it out.
-	 * In which case, the inode may not be on the dirty list, but
-	 * we still have to wait for that writeout.
-	 */
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		struct address_space *mapping = inode->i_mapping;
-
-		spin_lock(&inode->i_lock);
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    (mapping->nrpages == 0)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		__iget(inode);
-		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
-
-		/*
-		 * We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the
-		 * s_inode_list_lock.  We cannot iput the inode now as we can
-		 * be holding the last reference and we cannot iput it under
-		 * s_inode_list_lock. So we keep the reference and iput it
-		 * later.
-		 */
-		iput(old_inode);
-		old_inode = inode;
-
-		/*
-		 * We keep the error status of individual mapping so that
-		 * applications can catch the writeback error using fsync(2).
-		 * See filemap_fdatawait_keep_errors() for details.
-		 */
-		filemap_fdatawait_keep_errors(mapping);
-
-		cond_resched();
-
-		spin_lock(&sb->s_inode_list_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
-	iput(old_inode);
+	iterate_inodes_super(sb, &__wait_sb_inodes_cond,
+			     &__wait_sb_inodes, NULL);
 	mutex_unlock(&sb->s_sync_lock);
 }
 
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -141,6 +141,24 @@ int fsnotify_add_inode_mark(struct fsnot
 	return ret;
 }
 
+static void __fsnotify_umount_inodes(struct inode *inode, void *arg)
+{
+	/* for each watch, send FS_UNMOUNT and then remove it */
+	fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
+	fsnotify_inode_delete(inode);
+}
+
+static bool __fsnotify_umount_cond(struct inode *inode, void *arg)
+{
+	/*
+	 * If i_count is zero, the inode cannot have any watches and
+	 * doing an __iget/iput with MS_ACTIVE clear would actually
+	 * evict all inodes with zero i_count from icache which is
+	 * unnecessarily violent and may in fact be illegal to do.
+	 */
+	return atomic_read(&inode->i_count);
+}
+
 /**
  * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
  * @sb: superblock being unmounted.
@@ -150,77 +168,6 @@ int fsnotify_add_inode_mark(struct fsnot
  */
 void fsnotify_unmount_inodes(struct super_block *sb)
 {
-	struct inode *inode, *next_i, *need_iput = NULL;
-
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list) {
-		struct inode *need_iput_tmp;
-
-		/*
-		 * We cannot __iget() an inode in state I_FREEING,
-		 * I_WILL_FREE, or I_NEW which is fine because by that point
-		 * the inode cannot have any associated watches.
-		 */
-		spin_lock(&inode->i_lock);
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-
-		/*
-		 * If i_count is zero, the inode cannot have any watches and
-		 * doing an __iget/iput with MS_ACTIVE clear would actually
-		 * evict all inodes with zero i_count from icache which is
-		 * unnecessarily violent and may in fact be illegal to do.
-		 */
-		if (!atomic_read(&inode->i_count)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-
-		need_iput_tmp = need_iput;
-		need_iput = NULL;
-
-		/* In case fsnotify_inode_delete() drops a reference. */
-		if (inode != need_iput_tmp)
-			__iget(inode);
-		else
-			need_iput_tmp = NULL;
-		spin_unlock(&inode->i_lock);
-
-		/* In case the dropping of a reference would nuke next_i. */
-		while (&next_i->i_sb_list != &sb->s_inodes) {
-			spin_lock(&next_i->i_lock);
-			if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) &&
-						atomic_read(&next_i->i_count)) {
-				__iget(next_i);
-				need_iput = next_i;
-				spin_unlock(&next_i->i_lock);
-				break;
-			}
-			spin_unlock(&next_i->i_lock);
-			next_i = list_next_entry(next_i, i_sb_list);
-		}
-
-		/*
-		 * We can safely drop s_inode_list_lock here because either
-		 * we actually hold references on both inode and next_i or
-		 * end of list.  Also no new inodes will be added since the
-		 * umount has begun.
-		 */
-		spin_unlock(&sb->s_inode_list_lock);
-
-		if (need_iput_tmp)
-			iput(need_iput_tmp);
-
-		/* for each watch, send FS_UNMOUNT and then remove it */
-		fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
-
-		fsnotify_inode_delete(inode);
-
-		iput(inode);
-
-		spin_lock(&sb->s_inode_list_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+	iterate_inodes_super(sb, &__fsnotify_umount_cond,
+			     &__fsnotify_umount_inodes, NULL);
 }
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -920,55 +920,26 @@ static int dqinit_needed(struct inode *i
 	return 0;
 }
 
+static void __add_dquot_ref(struct inode *inode, void *arg)
+{
+	int type = (int)(long)(arg);
+
+	__dquot_initialize(inode, type);
+}
+
+static bool __add_dquot_ref_cond(struct inode *inode, void *arg)
+{
+	int type = (int)(long)(arg);
+
+	return atomic_read(&inode->i_writecount) && dqinit_needed(inode, type);
+}
+
 /* This routine is guarded by dqonoff_mutex mutex */
 static void add_dquot_ref(struct super_block *sb, int type)
 {
-	struct inode *inode, *old_inode = NULL;
-#ifdef CONFIG_QUOTA_DEBUG
-	int reserved = 0;
-#endif
-
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		spin_lock(&inode->i_lock);
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    !atomic_read(&inode->i_writecount) ||
-		    !dqinit_needed(inode, type)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		__iget(inode);
-		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
-
-#ifdef CONFIG_QUOTA_DEBUG
-		if (unlikely(inode_get_rsv_space(inode) > 0))
-			reserved = 1;
-#endif
-		iput(old_inode);
-		__dquot_initialize(inode, type);
-
-		/*
-		 * We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the
-		 * s_inode_list_lock. We cannot iput the inode now as we can be
-		 * holding the last reference and we cannot iput it under
-		 * s_inode_list_lock. So we keep the reference and iput it
-		 * later.
-		 */
-		old_inode = inode;
-		spin_lock(&sb->s_inode_list_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
-	iput(old_inode);
-
-#ifdef CONFIG_QUOTA_DEBUG
-	if (reserved) {
-		quota_error(sb, "Writes happened before quota was turned on "
-			"thus quota information is probably inconsistent. "
-			"Please run quotacheck(8)");
-	}
-#endif
+	/* XXX we lost debug code */
+	iterate_inodes_super(sb, &__add_dquot_ref_cond,
+			     &__add_dquot_ref, (void *)type);
 }
 
 /*
--- a/fs/super.c
+++ b/fs/super.c
@@ -204,7 +204,7 @@ static struct super_block *alloc_super(s
 	INIT_HLIST_NODE(&s->s_instances);
 	INIT_HLIST_BL_HEAD(&s->s_anon);
 	mutex_init(&s->s_sync_lock);
-	INIT_LIST_HEAD(&s->s_inodes);
+	PCPU_LIST_HEAD_INIT(&s->s_inodes_cpu);
 	spin_lock_init(&s->s_inode_list_lock);
 
 	if (list_lru_init_memcg(&s->s_dentry_lru))
@@ -253,6 +253,67 @@ static struct super_block *alloc_super(s
 	return NULL;
 }
 
+/*
+ * Iterate over all the inodes of this superblock.
+ */
+void iterate_inodes_super(struct super_block *sb, 
+			  bool (*cond)(struct inode *, void *),
+			  void (*func)(struct inode *, void *), void *arg)
+{
+	struct inode *inode, *n_inode, *old_inode = NULL;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct list_head *head = &per_cpu_ptr(&sb->s_inodes_cpu, cpu)->list;
+		spinlock_t       *lock = &per_cpu_ptr(&sb->s_inodes_cpu, cpu)->lock;
+
+		spin_lock(lock);
+		list_for_each_entry_safe(inode, n_inode, head, i_sb_list) {
+			struct inode *put_inode;
+
+			spin_lock(&inode->i_lock);
+			if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+			    !cond(inode, arg)) {
+				spin_unlock(&inode->i_lock);
+				continue;
+			}
+
+			put_inode = old_inode;
+			old_inode = NULL;
+
+			if (inode != put_inode)
+				__iget(inode);
+			else
+				put_inode = NULL;
+			spin_unlock(&inode->i_lock);
+
+			while (&n_inode->i_sb_list != head) {
+				spin_lock(&n_inode->i_lock);
+				if (!((n_inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+				     !cond(n_inode, arg))) {
+					__iget(n_inode);
+					old_inode = n_inode;
+					spin_unlock(&n_inode->i_lock);
+					break;
+				}
+				spin_unlock(&n_inode->i_lock);
+				n_inode = list_next_entry(n_inode, i_sb_list);
+			}
+
+			spin_unlock(lock);
+
+			iput(put_inode);
+			func(inode, arg);
+			iput(inode);
+			cond_resched();
+
+			spin_lock(lock);
+		}
+		spin_unlock(lock);
+	}
+	iput(old_inode);
+}
+
 /* Superblock refcounting  */
 
 /*
@@ -426,7 +487,7 @@ void generic_shutdown_super(struct super
 		if (sop->put_super)
 			sop->put_super(sb);
 
-		if (!list_empty(&sb->s_inodes)) {
+		if (!pcpu_list_empty(&sb->s_inodes_cpu)) {
 			printk("VFS: Busy inodes after unmount of %s. "
 			   "Self-destruct in 5 seconds.  Have a nice day...\n",
 			   sb->s_id);
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -648,7 +648,7 @@ struct inode {
 	u16			i_wb_frn_history;
 #endif
 	struct list_head	i_lru;		/* inode LRU list */
-	struct list_head	i_sb_list;
+	struct pcpu_list_node	i_sb_list;
 	union {
 		struct hlist_head	i_dentry;
 		struct rcu_head		i_rcu;
@@ -1397,11 +1397,12 @@ struct super_block {
 	 */
 	int s_stack_depth;
 
-	/* s_inode_list_lock protects s_inodes */
-	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
-	struct list_head	s_inodes;	/* all inodes */
+	struct pcpu_list_head	s_inodes_cpu;
 };
 
+extern void iterate_inodes_super(struct super_block *sb, 
+			  void (*func)(struct inode *, void *), void *arg);
+
 extern struct timespec current_fs_time(struct super_block *sb);
 
 /*

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/3] vfs: Use per-cpu list for superblock's inode list
  2016-02-22  9:18     ` Peter Zijlstra
@ 2016-02-22 11:54       ` Jan Kara
  2016-02-22 12:12         ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2016-02-22 11:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Chinner, Waiman Long, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On Mon 22-02-16 10:18:44, Peter Zijlstra wrote:
> On Mon, Feb 22, 2016 at 08:34:19AM +1100, Dave Chinner wrote:
> > On Fri, Feb 19, 2016 at 04:10:45PM -0500, Waiman Long wrote:
> > > +/*
> > > + * Superblock's inode list iterator function and arguments macros
> > > + */
> > > +#define SB_INODES_ITER_FUNC(name, lock, struct_fields)			\
> > > +	struct name ## _arg {						\
> > > +		struct_fields;						\
> > > +	};								\
> > > +	static int name ## _iter(struct pcpu_list_node *_node,		\
> > > +				 struct pcpu_list_node **_pnext,	\
> > > +				 spinlock_t *lock, void *_arg)
> > > +
> > > +#define SB_INODES_ITER_ARGS(name, i, a)					\
> > > +	struct inode *i = container_of(_node, struct inode, i_sb_list);	\
> > > +	struct name ## _arg *a = (struct name ## _arg *)_arg
> > > +
> > > +#define SB_INODES_ITER_ARGS_SAFE(name, i, n, a)				\
> > > +	struct inode *i = container_of(_node, struct inode, i_sb_list);	\
> > > +	struct inode *n = container_of(*_pnext, struct inode, i_sb_list);\
> > > +	struct name ## _arg *a = (struct name ## _arg *)_arg
> > > +
> > > +#define SB_INODES_ITER_SET_PCPU_LIST_NEXT(n)				\
> > > +	{ *_pnext = &(n)->i_sb_list; }
> > > +
> > > +#define SB_INODES_ITER_CALL(name, sb)					\
> > > +	pcpu_list_iterate(sb->s_inodes, false, NULL, name ## _iter, &arg)
> > > +
> > > +#define SB_INODES_ITER_CALL_SAFE(name, sb, phead)			\
> > > +	pcpu_list_iterate(sb->s_inodes, true, phead, name ## _iter, &arg)
> > > +
> > 
> > No, just no.
> 
> Ha! See I was thinking of something more along the lines of the below.
> 
> Which is completely untested and incomplete, but does illustrate the
> point.
> 
> Also, I think fsnotify_unmount_inodes() (as per mainline) is missing a
> final iput(need_iput) at the very end, but I could be mistaken, that
> code hurts my brain.

I think the code is actually correct since need_iput contains "inode
further in the list than the current inode". Thus we will always go though
another iteration of the loop which will drop the reference. And inode
cannot change state to I_FREEING or I_WILL_FREE because we hold inode
reference. But it is subtle as hell so I agree that code needs rewrite.

								Honza
> 
> ---
>  fs/block_dev.c         |   51 +++++++++++---------------
>  fs/drop_caches.c       |   32 +++++-----------
>  fs/fs-writeback.c      |   65 ++++++++--------------------------
>  fs/notify/inode_mark.c |   93 ++++++++++---------------------------------------
>  fs/quota/dquot.c       |   63 ++++++++-------------------------
>  fs/super.c             |   65 +++++++++++++++++++++++++++++++++-
>  include/linux/fs.h     |    9 ++--
>  7 files changed, 154 insertions(+), 224 deletions(-)
> 
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1862,38 +1862,29 @@ int __invalidate_device(struct block_dev
>  }
>  EXPORT_SYMBOL(__invalidate_device);
>  
> -void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
> -{
> -	struct inode *inode, *old_inode = NULL;
> +struct iterate_bdevs_data {
> +	void (*func)(struct block_device *, void *);
> +	void *arg;
> +};
>  
> -	spin_lock(&blockdev_superblock->s_inode_list_lock);
> -	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
> -		struct address_space *mapping = inode->i_mapping;
> +static void __iterate_bdevs(struct inode *inode, void *arg)
> +{
> +	struct iterate_bdevs_data *ibd = arg;
>  
> -		spin_lock(&inode->i_lock);
> -		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
> -		    mapping->nrpages == 0) {
> -			spin_unlock(&inode->i_lock);
> -			continue;
> -		}
> -		__iget(inode);
> -		spin_unlock(&inode->i_lock);
> -		spin_unlock(&blockdev_superblock->s_inode_list_lock);
> -		/*
> -		 * We hold a reference to 'inode' so it couldn't have been
> -		 * removed from s_inodes list while we dropped the
> -		 * s_inode_list_lock  We cannot iput the inode now as we can
> -		 * be holding the last reference and we cannot iput it under
> -		 * s_inode_list_lock. So we keep the reference and iput it
> -		 * later.
> -		 */
> -		iput(old_inode);
> -		old_inode = inode;
> +	ibd->func(I_BDEV(inode), ibd->arg);
> +}
>  
> -		func(I_BDEV(inode), arg);
> +static bool __iterate_bdevs_cond(struct inode *inode, void *arg)
> +{
> +	return inode->i_mapping->nr_pages != 0;
> +}
>  
> -		spin_lock(&blockdev_superblock->s_inode_list_lock);
> -	}
> -	spin_unlock(&blockdev_superblock->s_inode_list_lock);
> -	iput(old_inode);
> +void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
> +{
> +	struct iterate_bdevs_data ibd = {
> +		.func = func,
> +		.arg = arg,
> +	};
> +	iterate_inodes_super(&blockdev_superblock, &__iterate_bdevs_cond,
> +			     &__iterate_bdevs, &ibd);
>  }
> --- a/fs/drop_caches.c
> +++ b/fs/drop_caches.c
> @@ -13,30 +13,20 @@
>  /* A global variable is a bit ugly, but it keeps the code simple */
>  int sysctl_drop_caches;
>  
> -static void drop_pagecache_sb(struct super_block *sb, void *unused)
> +static void __drop_pagecache_sb(struct inode *inode, void *unused)
>  {
> -	struct inode *inode, *toput_inode = NULL;
> -
> -	spin_lock(&sb->s_inode_list_lock);
> -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> -		spin_lock(&inode->i_lock);
> -		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> -		    (inode->i_mapping->nrpages == 0)) {
> -			spin_unlock(&inode->i_lock);
> -			continue;
> -		}
> -		__iget(inode);
> -		spin_unlock(&inode->i_lock);
> -		spin_unlock(&sb->s_inode_list_lock);
> +	invalidate_mapping_pages(inode->i_mapping, 0, -1);
> +}
>  
> -		invalidate_mapping_pages(inode->i_mapping, 0, -1);
> -		iput(toput_inode);
> -		toput_inode = inode;
> +static bool __drop_pagecache_sb_cond(struct inode *inode, void *unused)
> +{
> +	return inode->i_mapping->nr_pages != 0;
> +}
>  
> -		spin_lock(&sb->s_inode_list_lock);
> -	}
> -	spin_unlock(&sb->s_inode_list_lock);
> -	iput(toput_inode);
> +static void drop_pagecache_sb(struct super_block *sb, void *unused)
> +{
> +	iterate_inodes_super(sb, &__drop_pagecache_sb_cond,
> +			     &__drop_pagecache_sb, NULL);
>  }
>  
>  int drop_caches_sysctl_handler(struct ctl_table *table, int write,
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -2102,6 +2102,21 @@ void __mark_inode_dirty(struct inode *in
>  }
>  EXPORT_SYMBOL(__mark_inode_dirty);
>  
> +static void __wait_sb_inodes(struct inode *inode, void *arg)
> +{
> +	/*
> +	 * We keep the error status of individual mapping so that
> +	 * applications can catch the writeback error using fsync(2).
> +	 * See filemap_fdatawait_keep_errors() for details.
> +	 */
> +	filemap_fdatawait_keep_errors(inode->i_mapping);
> +}
> +
> +static bool __wait_sb_inodes_cond(struct inode *inode, void *arg)
> +{
> +	return inode->i_mapping->nr_pages != 0;
> +}
> +
>  /*
>   * The @s_sync_lock is used to serialise concurrent sync operations
>   * to avoid lock contention problems with concurrent wait_sb_inodes() calls.
> @@ -2113,8 +2128,6 @@ EXPORT_SYMBOL(__mark_inode_dirty);
>   */
>  static void wait_sb_inodes(struct super_block *sb)
>  {
> -	struct inode *inode, *old_inode = NULL;
> -
>  	/*
>  	 * We need to be protected against the filesystem going from
>  	 * r/o to r/w or vice versa.
> @@ -2122,52 +2135,8 @@ static void wait_sb_inodes(struct super_
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
>  	mutex_lock(&sb->s_sync_lock);
> -	spin_lock(&sb->s_inode_list_lock);
> -
> -	/*
> -	 * Data integrity sync. Must wait for all pages under writeback,
> -	 * because there may have been pages dirtied before our sync
> -	 * call, but which had writeout started before we write it out.
> -	 * In which case, the inode may not be on the dirty list, but
> -	 * we still have to wait for that writeout.
> -	 */
> -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> -		struct address_space *mapping = inode->i_mapping;
> -
> -		spin_lock(&inode->i_lock);
> -		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> -		    (mapping->nrpages == 0)) {
> -			spin_unlock(&inode->i_lock);
> -			continue;
> -		}
> -		__iget(inode);
> -		spin_unlock(&inode->i_lock);
> -		spin_unlock(&sb->s_inode_list_lock);
> -
> -		/*
> -		 * We hold a reference to 'inode' so it couldn't have been
> -		 * removed from s_inodes list while we dropped the
> -		 * s_inode_list_lock.  We cannot iput the inode now as we can
> -		 * be holding the last reference and we cannot iput it under
> -		 * s_inode_list_lock. So we keep the reference and iput it
> -		 * later.
> -		 */
> -		iput(old_inode);
> -		old_inode = inode;
> -
> -		/*
> -		 * We keep the error status of individual mapping so that
> -		 * applications can catch the writeback error using fsync(2).
> -		 * See filemap_fdatawait_keep_errors() for details.
> -		 */
> -		filemap_fdatawait_keep_errors(mapping);
> -
> -		cond_resched();
> -
> -		spin_lock(&sb->s_inode_list_lock);
> -	}
> -	spin_unlock(&sb->s_inode_list_lock);
> -	iput(old_inode);
> +	iterate_inodes_super(sb, &__wait_sb_inodes_cond,
> +			     &__wait_sb_inodes, NULL);
>  	mutex_unlock(&sb->s_sync_lock);
>  }
>  
> --- a/fs/notify/inode_mark.c
> +++ b/fs/notify/inode_mark.c
> @@ -141,6 +141,24 @@ int fsnotify_add_inode_mark(struct fsnot
>  	return ret;
>  }
>  
> +static void __fsnotify_umount_inodes(struct inode *inode, void *arg)
> +{
> +	/* for each watch, send FS_UNMOUNT and then remove it */
> +	fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
> +	fsnotify_inode_delete(inode);
> +}
> +
> +static bool __fsnotify_umount_cond(struct inode *inode, void *arg)
> +{
> +	/*
> +	 * If i_count is zero, the inode cannot have any watches and
> +	 * doing an __iget/iput with MS_ACTIVE clear would actually
> +	 * evict all inodes with zero i_count from icache which is
> +	 * unnecessarily violent and may in fact be illegal to do.
> +	 */
> +	return atomic_read(&inode->i_count);
> +}
> +
>  /**
>   * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
>   * @sb: superblock being unmounted.
> @@ -150,77 +168,6 @@ int fsnotify_add_inode_mark(struct fsnot
>   */
>  void fsnotify_unmount_inodes(struct super_block *sb)
>  {
> -	struct inode *inode, *next_i, *need_iput = NULL;
> -
> -	spin_lock(&sb->s_inode_list_lock);
> -	list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list) {
> -		struct inode *need_iput_tmp;
> -
> -		/*
> -		 * We cannot __iget() an inode in state I_FREEING,
> -		 * I_WILL_FREE, or I_NEW which is fine because by that point
> -		 * the inode cannot have any associated watches.
> -		 */
> -		spin_lock(&inode->i_lock);
> -		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> -			spin_unlock(&inode->i_lock);
> -			continue;
> -		}
> -
> -		/*
> -		 * If i_count is zero, the inode cannot have any watches and
> -		 * doing an __iget/iput with MS_ACTIVE clear would actually
> -		 * evict all inodes with zero i_count from icache which is
> -		 * unnecessarily violent and may in fact be illegal to do.
> -		 */
> -		if (!atomic_read(&inode->i_count)) {
> -			spin_unlock(&inode->i_lock);
> -			continue;
> -		}
> -
> -		need_iput_tmp = need_iput;
> -		need_iput = NULL;
> -
> -		/* In case fsnotify_inode_delete() drops a reference. */
> -		if (inode != need_iput_tmp)
> -			__iget(inode);
> -		else
> -			need_iput_tmp = NULL;
> -		spin_unlock(&inode->i_lock);
> -
> -		/* In case the dropping of a reference would nuke next_i. */
> -		while (&next_i->i_sb_list != &sb->s_inodes) {
> -			spin_lock(&next_i->i_lock);
> -			if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) &&
> -						atomic_read(&next_i->i_count)) {
> -				__iget(next_i);
> -				need_iput = next_i;
> -				spin_unlock(&next_i->i_lock);
> -				break;
> -			}
> -			spin_unlock(&next_i->i_lock);
> -			next_i = list_next_entry(next_i, i_sb_list);
> -		}
> -
> -		/*
> -		 * We can safely drop s_inode_list_lock here because either
> -		 * we actually hold references on both inode and next_i or
> -		 * end of list.  Also no new inodes will be added since the
> -		 * umount has begun.
> -		 */
> -		spin_unlock(&sb->s_inode_list_lock);
> -
> -		if (need_iput_tmp)
> -			iput(need_iput_tmp);
> -
> -		/* for each watch, send FS_UNMOUNT and then remove it */
> -		fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
> -
> -		fsnotify_inode_delete(inode);
> -
> -		iput(inode);
> -
> -		spin_lock(&sb->s_inode_list_lock);
> -	}
> -	spin_unlock(&sb->s_inode_list_lock);
> +	iterate_inodes_super(sb, &__fsnotify_umount_cond,
> +			     &__fsnotify_umount_inodes, NULL);
>  }
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -920,55 +920,26 @@ static int dqinit_needed(struct inode *i
>  	return 0;
>  }
>  
> +static void __add_dquot_ref(struct inode *inode, void *arg)
> +{
> +	int type = (int)(long)(arg);
> +
> +	__dquot_initialize(inode, type);
> +}
> +
> +static bool __add_dquot_ref_cond(struct inode *inode, void *arg)
> +{
> +	int type = (int)(long)(arg);
> +
> +	return atomic_read(&inode->i_writecount) && dqinit_needed(inode, type);
> +}
> +
>  /* This routine is guarded by dqonoff_mutex mutex */
>  static void add_dquot_ref(struct super_block *sb, int type)
>  {
> -	struct inode *inode, *old_inode = NULL;
> -#ifdef CONFIG_QUOTA_DEBUG
> -	int reserved = 0;
> -#endif
> -
> -	spin_lock(&sb->s_inode_list_lock);
> -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> -		spin_lock(&inode->i_lock);
> -		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> -		    !atomic_read(&inode->i_writecount) ||
> -		    !dqinit_needed(inode, type)) {
> -			spin_unlock(&inode->i_lock);
> -			continue;
> -		}
> -		__iget(inode);
> -		spin_unlock(&inode->i_lock);
> -		spin_unlock(&sb->s_inode_list_lock);
> -
> -#ifdef CONFIG_QUOTA_DEBUG
> -		if (unlikely(inode_get_rsv_space(inode) > 0))
> -			reserved = 1;
> -#endif
> -		iput(old_inode);
> -		__dquot_initialize(inode, type);
> -
> -		/*
> -		 * We hold a reference to 'inode' so it couldn't have been
> -		 * removed from s_inodes list while we dropped the
> -		 * s_inode_list_lock. We cannot iput the inode now as we can be
> -		 * holding the last reference and we cannot iput it under
> -		 * s_inode_list_lock. So we keep the reference and iput it
> -		 * later.
> -		 */
> -		old_inode = inode;
> -		spin_lock(&sb->s_inode_list_lock);
> -	}
> -	spin_unlock(&sb->s_inode_list_lock);
> -	iput(old_inode);
> -
> -#ifdef CONFIG_QUOTA_DEBUG
> -	if (reserved) {
> -		quota_error(sb, "Writes happened before quota was turned on "
> -			"thus quota information is probably inconsistent. "
> -			"Please run quotacheck(8)");
> -	}
> -#endif
> +	/* XXX we lost debug code */
> +	iterate_inodes_super(sb, &__add_dquot_ref_cond,
> +			     &__add_dquot_ref, (void *)type);
>  }
>  
>  /*
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -204,7 +204,7 @@ static struct super_block *alloc_super(s
>  	INIT_HLIST_NODE(&s->s_instances);
>  	INIT_HLIST_BL_HEAD(&s->s_anon);
>  	mutex_init(&s->s_sync_lock);
> -	INIT_LIST_HEAD(&s->s_inodes);
> +	PCPU_LIST_HEAD_INIT(&s->s_inodes_cpu);
>  	spin_lock_init(&s->s_inode_list_lock);
>  
>  	if (list_lru_init_memcg(&s->s_dentry_lru))
> @@ -253,6 +253,67 @@ static struct super_block *alloc_super(s
>  	return NULL;
>  }
>  
> +/*
> + * Iterate over all the inodes of this superblock.
> + */
> +void iterate_inodes_super(struct super_block *sb, 
> +			  bool (*cond)(struct inode *, void *),
> +			  void (*func)(struct inode *, void *), void *arg)
> +{
> +	struct inode *inode, *n_inode, *old_inode = NULL;
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		struct list_head *head = &per_cpu_ptr(&sb->s_inodes_cpu, cpu)->list;
> +		spinlock_t       *lock = &per_cpu_ptr(&sb->s_inodes_cpu, cpu)->lock;
> +
> +		spin_lock(lock);
> +		list_for_each_entry_safe(inode, n_inode, head, i_sb_list) {
> +			struct inode *put_inode;
> +
> +			spin_lock(&inode->i_lock);
> +			if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> +			    !cond(inode, arg)) {
> +				spin_unlock(&inode->i_lock);
> +				continue;
> +			}
> +
> +			put_inode = old_inode;
> +			old_inode = NULL;
> +
> +			if (inode != put_inode)
> +				__iget(inode);
> +			else
> +				put_inode = NULL;
> +			spin_unlock(&inode->i_lock);
> +
> +			while (&n_inode->i_sb_list != head) {
> +				spin_lock(&n_inode->i_lock);
> +				if (!((n_inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> +				     !cond(n_inode, arg))) {
> +					__iget(n_inode);
> +					old_inode = n_inode;
> +					spin_unlock(&n_inode->i_lock);
> +					break;
> +				}
> +				spin_unlock(&n_inode->i_lock);
> +				n_inode = list_next_entry(n_inode, i_sb_list);
> +			}
> +
> +			spin_unlock(lock);
> +
> +			iput(put_inode);
> +			func(inode, arg);
> +			iput(inode);
> +			cond_resched();
> +
> +			spin_lock(lock);
> +		}
> +		spin_unlock(lock);
> +	}
> +	iput(old_inode);
> +}
> +
>  /* Superblock refcounting  */
>  
>  /*
> @@ -426,7 +487,7 @@ void generic_shutdown_super(struct super
>  		if (sop->put_super)
>  			sop->put_super(sb);
>  
> -		if (!list_empty(&sb->s_inodes)) {
> +		if (!pcpu_list_empty(&sb->s_inodes_cpu)) {
>  			printk("VFS: Busy inodes after unmount of %s. "
>  			   "Self-destruct in 5 seconds.  Have a nice day...\n",
>  			   sb->s_id);
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -648,7 +648,7 @@ struct inode {
>  	u16			i_wb_frn_history;
>  #endif
>  	struct list_head	i_lru;		/* inode LRU list */
> -	struct list_head	i_sb_list;
> +	struct pcpu_list_node	i_sb_list;
>  	union {
>  		struct hlist_head	i_dentry;
>  		struct rcu_head		i_rcu;
> @@ -1397,11 +1397,12 @@ struct super_block {
>  	 */
>  	int s_stack_depth;
>  
> -	/* s_inode_list_lock protects s_inodes */
> -	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
> -	struct list_head	s_inodes;	/* all inodes */
> +	struct pcpu_list_head	s_inodes_cpu;
>  };
>  
> +extern void iterate_inodes_super(struct super_block *sb, 
> +			  void (*func)(struct inode *, void *), void *arg);
> +
>  extern struct timespec current_fs_time(struct super_block *sb);
>  
>  /*
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/3] vfs: Use per-cpu list for superblock's inode list
  2016-02-22 11:54       ` Jan Kara
@ 2016-02-22 12:12         ` Peter Zijlstra
  2016-02-22 13:04           ` Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2016-02-22 12:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Waiman Long, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On Mon, Feb 22, 2016 at 12:54:35PM +0100, Jan Kara wrote:
> > Also, I think fsnotify_unmount_inodes() (as per mainline) is missing a
> > final iput(need_iput) at the very end, but I could be mistaken, that
> > code hurts my brain.
> 
> I think the code is actually correct since need_iput contains "inode
> further in the list than the current inode". Thus we will always go though
> another iteration of the loop which will drop the reference. And inode
> cannot change state to I_FREEING or I_WILL_FREE because we hold inode
> reference. But it is subtle as hell so I agree that code needs rewrite.

So while talking to dchinner, he doubted fsnotify will actually remove
inodes from the sb-list, but wasn't sure and too tired to check now.

(I got lost in the fsnotify code real quick and gave up, for I was
mostly trying to make a point that we don't need the CPP magic and can
do with 'readable' code).

If it doesn't, it doesn't need to do this extra special magic dance and
can use the 'normal' iterator pattern used in all the other functions,
greatly reducing complexity.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/3] vfs: Use per-cpu list for superblock's inode list
  2016-02-22 12:12         ` Peter Zijlstra
@ 2016-02-22 13:04           ` Jan Kara
  2016-02-22 21:08             ` Dave Chinner
  2016-02-23 19:01             ` Waiman Long
  0 siblings, 2 replies; 13+ messages in thread
From: Jan Kara @ 2016-02-22 13:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jan Kara, Dave Chinner, Waiman Long, Alexander Viro, Jan Kara,
	Jeff Layton, J. Bruce Fields, Tejun Heo, Christoph Lameter,
	linux-fsdevel, linux-kernel, Ingo Molnar, Andi Kleen,
	Dave Chinner, Scott J Norton, Douglas Hatch

[-- Attachment #1: Type: text/plain, Size: 1996 bytes --]

On Mon 22-02-16 13:12:22, Peter Zijlstra wrote:
> On Mon, Feb 22, 2016 at 12:54:35PM +0100, Jan Kara wrote:
> > > Also, I think fsnotify_unmount_inodes() (as per mainline) is missing a
> > > final iput(need_iput) at the very end, but I could be mistaken, that
> > > code hurts my brain.
> > 
> > I think the code is actually correct since need_iput contains "inode
> > further in the list than the current inode". Thus we will always go though
> > another iteration of the loop which will drop the reference. And inode
> > cannot change state to I_FREEING or I_WILL_FREE because we hold inode
> > reference. But it is subtle as hell so I agree that code needs rewrite.
> 
> So while talking to dchinner, he doubted fsnotify will actually remove
> inodes from the sb-list, but wasn't sure and too tired to check now.
> 
> (I got lost in the fsnotify code real quick and gave up, for I was
> mostly trying to make a point that we don't need the CPP magic and can
> do with 'readable' code).
> 
> If it doesn't, it doesn't need to do this extra special magic dance and
> can use the 'normal' iterator pattern used in all the other functions,
> greatly reducing complexity.

Yeah, that would be nice. But fsnotify code needs to iterate over all
inodes, drop sb_list_lock and do some fsnotify magic with the inode which
is not substantial for our discussion. Now that fsnotify magic may actually
drop all the remaining inode references so once we drop our reference
pinning the inode, it can just disappear. We don't want to restart the scan
for each inode we have to process so that is the reason why we play ugly
tricks with pinning the next inode in the list.

But I agree it should be possible to just use list_for_each_entry() instead
of list_for_each_entry_safe() and keep current inode pinned till the next
iteration to make it stick in the sb->s_inodes list. That would make the
iteration more standard. Lightly tested patch attached.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-fsnotify-Simplify-inode-iteration-on-umount.patch --]
[-- Type: text/x-patch, Size: 2801 bytes --]

>From b73ae63fff14dea2afac34523d5ebfc5f030eff6 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Mon, 22 Feb 2016 13:54:32 +0100
Subject: [PATCH] fsnotify: Simplify inode iteration on umount

fsnotify_unmount_inodes() played complex tricks to pin next inode in the
sb->s_inodes list when iterating over all inodes. If we switch to
keeping current inode pinned somewhat longer, we can make the code much
simpler and standard.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/notify/inode_mark.c | 45 +++++++++------------------------------------
 1 file changed, 9 insertions(+), 36 deletions(-)

diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 741077deef3b..a3645249f7ec 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -150,12 +150,10 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
  */
 void fsnotify_unmount_inodes(struct super_block *sb)
 {
-	struct inode *inode, *next_i, *need_iput = NULL;
+	struct inode *inode, *iput_inode = NULL;
 
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list) {
-		struct inode *need_iput_tmp;
-
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		/*
 		 * We cannot __iget() an inode in state I_FREEING,
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
@@ -178,49 +176,24 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 			continue;
 		}
 
-		need_iput_tmp = need_iput;
-		need_iput = NULL;
-
-		/* In case fsnotify_inode_delete() drops a reference. */
-		if (inode != need_iput_tmp)
-			__iget(inode);
-		else
-			need_iput_tmp = NULL;
+		__iget(inode);
 		spin_unlock(&inode->i_lock);
-
-		/* In case the dropping of a reference would nuke next_i. */
-		while (&next_i->i_sb_list != &sb->s_inodes) {
-			spin_lock(&next_i->i_lock);
-			if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) &&
-						atomic_read(&next_i->i_count)) {
-				__iget(next_i);
-				need_iput = next_i;
-				spin_unlock(&next_i->i_lock);
-				break;
-			}
-			spin_unlock(&next_i->i_lock);
-			next_i = list_next_entry(next_i, i_sb_list);
-		}
-
-		/*
-		 * We can safely drop s_inode_list_lock here because either
-		 * we actually hold references on both inode and next_i or
-		 * end of list.  Also no new inodes will be added since the
-		 * umount has begun.
-		 */
 		spin_unlock(&sb->s_inode_list_lock);
 
-		if (need_iput_tmp)
-			iput(need_iput_tmp);
+		if (iput_inode)
+			iput(iput_inode);
 
 		/* for each watch, send FS_UNMOUNT and then remove it */
 		fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
 
 		fsnotify_inode_delete(inode);
 
-		iput(inode);
+		iput_inode = inode;
 
 		spin_lock(&sb->s_inode_list_lock);
 	}
 	spin_unlock(&sb->s_inode_list_lock);
+
+	if (iput_inode)
+		iput(iput_inode);
 }
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/3] vfs: Use per-cpu list for superblock's inode list
  2016-02-22 13:04           ` Jan Kara
@ 2016-02-22 21:08             ` Dave Chinner
  2016-02-22 22:18               ` Jan Kara
  2016-02-23 19:01             ` Waiman Long
  1 sibling, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2016-02-22 21:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, Waiman Long, Alexander Viro, Jan Kara,
	Jeff Layton, J. Bruce Fields, Tejun Heo, Christoph Lameter,
	linux-fsdevel, linux-kernel, Ingo Molnar, Andi Kleen,
	Dave Chinner, Scott J Norton, Douglas Hatch

On Mon, Feb 22, 2016 at 02:04:35PM +0100, Jan Kara wrote:
> On Mon 22-02-16 13:12:22, Peter Zijlstra wrote:
> > On Mon, Feb 22, 2016 at 12:54:35PM +0100, Jan Kara wrote:
> > > > Also, I think fsnotify_unmount_inodes() (as per mainline) is missing a
> > > > final iput(need_iput) at the very end, but I could be mistaken, that
> > > > code hurts my brain.
> > > 
> > > I think the code is actually correct since need_iput contains "inode
> > > further in the list than the current inode". Thus we will always go though
> > > another iteration of the loop which will drop the reference. And inode
> > > cannot change state to I_FREEING or I_WILL_FREE because we hold inode
> > > reference. But it is subtle as hell so I agree that code needs rewrite.
> > 
> > So while talking to dchinner, he doubted fsnotify will actually remove
> > inodes from the sb-list, but wasn't sure and too tired to check now.
> > 
> > (I got lost in the fsnotify code real quick and gave up, for I was
> > mostly trying to make a point that we don't need the CPP magic and can
> > do with 'readable' code).
> > 
> > If it doesn't, it doesn't need to do this extra special magic dance and
> > can use the 'normal' iterator pattern used in all the other functions,
> > greatly reducing complexity.
> 
> Yeah, that would be nice. But fsnotify code needs to iterate over all
> inodes, drop sb_list_lock and do some fsnotify magic with the inode which
> is not substantial for our discussion. Now that fsnotify magic may actually
> drop all the remaining inode references so once we drop our reference
> pinning the inode, it can just disappear. We don't want to restart the scan
> for each inode we have to process so that is the reason why we play ugly
> tricks with pinning the next inode in the list.
> 
> But I agree it should be possible to just use list_for_each_entry() instead
> of list_for_each_entry_safe() and keep current inode pinned till the next
> iteration to make it stick in the sb->s_inodes list. That would make the
> iteration more standard. Lightly tested patch attached.

That's exactly what I was thinking. Patch looks ok from aquick
reading of it, but I haven't I've got anything here to test it
at all. Perhaps we need so xfstests coverage of this code....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/3] vfs: Use per-cpu list for superblock's inode list
  2016-02-22 21:08             ` Dave Chinner
@ 2016-02-22 22:18               ` Jan Kara
  0 siblings, 0 replies; 13+ messages in thread
From: Jan Kara @ 2016-02-22 22:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Peter Zijlstra, Waiman Long, Alexander Viro, Jan Kara,
	Jeff Layton, J. Bruce Fields, Tejun Heo, Christoph Lameter,
	linux-fsdevel, linux-kernel, Ingo Molnar, Andi Kleen,
	Dave Chinner, Scott J Norton, Douglas Hatch

On Tue 23-02-16 08:08:14, Dave Chinner wrote:
> On Mon, Feb 22, 2016 at 02:04:35PM +0100, Jan Kara wrote:
> > On Mon 22-02-16 13:12:22, Peter Zijlstra wrote:
> > > On Mon, Feb 22, 2016 at 12:54:35PM +0100, Jan Kara wrote:
> > > > > Also, I think fsnotify_unmount_inodes() (as per mainline) is missing a
> > > > > final iput(need_iput) at the very end, but I could be mistaken, that
> > > > > code hurts my brain.
> > > > 
> > > > I think the code is actually correct since need_iput contains "inode
> > > > further in the list than the current inode". Thus we will always go though
> > > > another iteration of the loop which will drop the reference. And inode
> > > > cannot change state to I_FREEING or I_WILL_FREE because we hold inode
> > > > reference. But it is subtle as hell so I agree that code needs rewrite.
> > > 
> > > So while talking to dchinner, he doubted fsnotify will actually remove
> > > inodes from the sb-list, but wasn't sure and too tired to check now.
> > > 
> > > (I got lost in the fsnotify code real quick and gave up, for I was
> > > mostly trying to make a point that we don't need the CPP magic and can
> > > do with 'readable' code).
> > > 
> > > If it doesn't, it doesn't need to do this extra special magic dance and
> > > can use the 'normal' iterator pattern used in all the other functions,
> > > greatly reducing complexity.
> > 
> > Yeah, that would be nice. But fsnotify code needs to iterate over all
> > inodes, drop sb_list_lock and do some fsnotify magic with the inode which
> > is not substantial for our discussion. Now that fsnotify magic may actually
> > drop all the remaining inode references so once we drop our reference
> > pinning the inode, it can just disappear. We don't want to restart the scan
> > for each inode we have to process so that is the reason why we play ugly
> > tricks with pinning the next inode in the list.
> > 
> > But I agree it should be possible to just use list_for_each_entry() instead
> > of list_for_each_entry_safe() and keep current inode pinned till the next
> > iteration to make it stick in the sb->s_inodes list. That would make the
> > iteration more standard. Lightly tested patch attached.
> 
> That's exactly what I was thinking. Patch looks ok from aquick
> reading of it, but I haven't I've got anything here to test it
> at all. Perhaps we need so xfstests coverage of this code....

I've tested it by adding watches to some files and then unmounting the
filesystem. That should give basic testing to the code. There's some
reasonable inotify coverage (including unmount events) in LTP but most of
the testing happens in tmpdir so it is not particularly useful for
stressing this code.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/3] vfs: Use per-cpu list for superblock's inode list
  2016-02-21 21:34   ` Dave Chinner
  2016-02-22  9:18     ` Peter Zijlstra
@ 2016-02-23 18:56     ` Waiman Long
  1 sibling, 0 replies; 13+ messages in thread
From: Waiman Long @ 2016-02-23 18:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On 02/21/2016 04:34 PM, Dave Chinner wrote:
> On Fri, Feb 19, 2016 at 04:10:45PM -0500, Waiman Long wrote:
>> +/*
>> + * Superblock's inode list iterator function and arguments macros
>> + */
>> +#define SB_INODES_ITER_FUNC(name, lock, struct_fields)			\
>> +	struct name ## _arg {						\
>> +		struct_fields;						\
>> +	};								\
>> +	static int name ## _iter(struct pcpu_list_node *_node,		\
>> +				 struct pcpu_list_node **_pnext,	\
>> +				 spinlock_t *lock, void *_arg)
>> +
>> +#define SB_INODES_ITER_ARGS(name, i, a)					\
>> +	struct inode *i = container_of(_node, struct inode, i_sb_list);	\
>> +	struct name ## _arg *a = (struct name ## _arg *)_arg
>> +
>> +#define SB_INODES_ITER_ARGS_SAFE(name, i, n, a)				\
>> +	struct inode *i = container_of(_node, struct inode, i_sb_list);	\
>> +	struct inode *n = container_of(*_pnext, struct inode, i_sb_list);\
>> +	struct name ## _arg *a = (struct name ## _arg *)_arg
>> +
>> +#define SB_INODES_ITER_SET_PCPU_LIST_NEXT(n)				\
>> +	{ *_pnext =&(n)->i_sb_list; }
>> +
>> +#define SB_INODES_ITER_CALL(name, sb)					\
>> +	pcpu_list_iterate(sb->s_inodes, false, NULL, name ## _iter,&arg)
>> +
>> +#define SB_INODES_ITER_CALL_SAFE(name, sb, phead)			\
>> +	pcpu_list_iterate(sb->s_inodes, true, phead, name ## _iter,&arg)
>> +
> No, just no.
>
> Ungreppable, breaks cscope, obfuscates everything, shouts a lot,
> code using the API looks completely broken (e.g. semi-colons in
> "function declarations"), and it reminds me of the worst of the
> worst unmaintainable code in an exceedingly buggy and undebuggable
> proprietary filesystem I've previously had the "joy" of working
> with.
>
> Just fix the bug in the previous version; it's so much cleaner than
> this .... mess.
>
> Cheers,
>
> Dave.

Sorry for that. I will scrap the current approach and use another way to 
iterate the list instead. I will send out an updated patch soon.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/3] vfs: Use per-cpu list for superblock's inode list
  2016-02-22 13:04           ` Jan Kara
  2016-02-22 21:08             ` Dave Chinner
@ 2016-02-23 19:01             ` Waiman Long
  1 sibling, 0 replies; 13+ messages in thread
From: Waiman Long @ 2016-02-23 19:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, Dave Chinner, Alexander Viro, Jan Kara,
	Jeff Layton, J. Bruce Fields, Tejun Heo, Christoph Lameter,
	linux-fsdevel, linux-kernel, Ingo Molnar, Andi Kleen,
	Dave Chinner, Scott J Norton, Douglas Hatch

On 02/22/2016 08:04 AM, Jan Kara wrote:
> On Mon 22-02-16 13:12:22, Peter Zijlstra wrote:
>> On Mon, Feb 22, 2016 at 12:54:35PM +0100, Jan Kara wrote:
>>>> Also, I think fsnotify_unmount_inodes() (as per mainline) is missing a
>>>> final iput(need_iput) at the very end, but I could be mistaken, that
>>>> code hurts my brain.
>>> I think the code is actually correct since need_iput contains "inode
>>> further in the list than the current inode". Thus we will always go though
>>> another iteration of the loop which will drop the reference. And inode
>>> cannot change state to I_FREEING or I_WILL_FREE because we hold inode
>>> reference. But it is subtle as hell so I agree that code needs rewrite.
>> So while talking to dchinner, he doubted fsnotify will actually remove
>> inodes from the sb-list, but wasn't sure and too tired to check now.
>>
>> (I got lost in the fsnotify code real quick and gave up, for I was
>> mostly trying to make a point that we don't need the CPP magic and can
>> do with 'readable' code).
>>
>> If it doesn't, it doesn't need to do this extra special magic dance and
>> can use the 'normal' iterator pattern used in all the other functions,
>> greatly reducing complexity.
> Yeah, that would be nice. But fsnotify code needs to iterate over all
> inodes, drop sb_list_lock and do some fsnotify magic with the inode which
> is not substantial for our discussion. Now that fsnotify magic may actually
> drop all the remaining inode references so once we drop our reference
> pinning the inode, it can just disappear. We don't want to restart the scan
> for each inode we have to process so that is the reason why we play ugly
> tricks with pinning the next inode in the list.
>
> But I agree it should be possible to just use list_for_each_entry() instead
> of list_for_each_entry_safe() and keep current inode pinned till the next
> iteration to make it stick in the sb->s_inodes list. That would make the
> iteration more standard. Lightly tested patch attached.
>
> 								Honza

Your patch looks good to me. I would like to put your patch into my 
per-cpu list patchset if you don't mind.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-02-23 19:02 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-19 21:10 [PATCH v2 0/3] vfs: Use per-cpu list for SB's s_inodes list Waiman Long
2016-02-19 21:10 ` [PATCH v2 1/3] lib/percpu-list: Per-cpu list with associated per-cpu locks Waiman Long
2016-02-19 21:10 ` [PATCH v2 2/3] vfs: Refactor sb->s_inodes iteration functions Waiman Long
2016-02-19 21:10 ` [PATCH v2 3/3] vfs: Use per-cpu list for superblock's inode list Waiman Long
2016-02-21 21:34   ` Dave Chinner
2016-02-22  9:18     ` Peter Zijlstra
2016-02-22 11:54       ` Jan Kara
2016-02-22 12:12         ` Peter Zijlstra
2016-02-22 13:04           ` Jan Kara
2016-02-22 21:08             ` Dave Chinner
2016-02-22 22:18               ` Jan Kara
2016-02-23 19:01             ` Waiman Long
2016-02-23 18:56     ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.