linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RESEND PATCH 0/5] vfs: Use dlock list for SB's s_inodes list
@ 2016-06-07 19:35 Waiman Long
  2016-06-07 19:35 ` [RESEND PATCH 1/5] lib/dlock-list: Distributed and lock-protected lists Waiman Long
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Waiman Long @ 2016-06-07 19:35 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Waiman Long

This is a follow up of the following patchset:

  [PATCH v7 0/4] vfs: Use per-cpu list for SB's s_inodes list
  https://lkml.org/lkml/2016/4/12/1009

The main change is the renaming of percpu list to dlock list, as
suggested by Christoph Lameter. It also adds a new patch from Boqun
Feng to add the __percpu modifier for parameters.

Patch 1 introduces the dlock list.

Patch 2 adds the __percpu modifier to the appropriate parameters.

Patch 3 cleans up the fsnotify_unmount_inodes() function by making
the code simpler and more standard.

Patch 4 replaces the use of list_for_each_entry_safe() in
evict_inodes() and invalidate_inodes() by list_for_each_entry().

Patch 5 modifies the superblock and inode structures to use the dlock
list. The corresponding functions that reference those structures
are modified.

Boqun Feng (1):
  lib/dlock-list: Add __percpu modifier for parameters

Jan Kara (2):
  fsnotify: Simplify inode iteration on umount
  vfs: Remove unnecessary list_for_each_entry_safe() variants

Waiman Long (2):
  lib/dlock-list: Distributed and lock-protected lists
  vfs: Use dlock list for superblock's inode list

 fs/block_dev.c             |   13 ++-
 fs/drop_caches.c           |   10 +-
 fs/fs-writeback.c          |   13 ++-
 fs/inode.c                 |   40 +++-----
 fs/notify/inode_mark.c     |   53 +++--------
 fs/quota/dquot.c           |   16 ++--
 fs/super.c                 |    7 +-
 include/linux/dlock-list.h |  235 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h         |    8 +-
 lib/Makefile               |    2 +-
 lib/dlock-list.c           |  101 +++++++++++++++++++
 11 files changed, 402 insertions(+), 96 deletions(-)
 create mode 100644 include/linux/dlock-list.h
 create mode 100644 lib/dlock-list.c

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RESEND PATCH 1/5] lib/dlock-list: Distributed and lock-protected lists
  2016-06-07 19:35 [RESEND PATCH 0/5] vfs: Use dlock list for SB's s_inodes list Waiman Long
@ 2016-06-07 19:35 ` Waiman Long
  2016-06-07 20:13   ` Andi Kleen
  2016-06-07 19:35 ` [RESEND PATCH 2/5] lib/dlock-list: Add __percpu modifier for parameters Waiman Long
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 9+ messages in thread
From: Waiman Long @ 2016-06-07 19:35 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Waiman Long

Linked list is used everywhere in the Linux kernel. However, if many
threads are trying to add or delete entries into the same linked list,
it can create a performance bottleneck.

This patch introduces a new list APIs that provide a set of distributed
lists (one per CPU), each of which is protected by its own spinlock.
To the callers, however, the set of lists acts like a single
consolidated list.  This allows list entries insertion and deletion
operations to happen in parallel instead of being serialized with a
global list and lock.

List entry insertion is strictly per cpu. List deletion, however, can
happen in a cpu other than the one that did the insertion. So we still
need lock to protect the list. Because of that, there may still be
a small amount of contention when deletion is being done.

A new header file include/linux/dlock-list.h will be added with the
associated dlock_list_head and dlock_list_node structures. The following
functions are provided to manage the per-cpu list:

 1. int init_dlock_list_head(struct dlock_list_head **pdlock_head)
 2. void dlock_list_add(struct dlock_list_node *node,
		        struct dlock_list_head *head)
 3. void dlock_list_del(struct dlock_list *node)

Iteration of all the list entries within a group of per-cpu
lists is done by calling either the dlock_list_iterate() or
dlock_list_iterate_safe() functions in a while loop. They correspond
to the list_for_each_entry() and list_for_each_entry_safe() macros
respectively. The iteration states are keep in a dlock_list_state
structure that is passed to the iteration functions.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 include/linux/dlock-list.h |  233 ++++++++++++++++++++++++++++++++++++++++++++
 lib/Makefile               |    2 +-
 lib/dlock-list.c           |  100 +++++++++++++++++++
 3 files changed, 334 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/dlock-list.h
 create mode 100644 lib/dlock-list.c

diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
new file mode 100644
index 0000000..43355f8
--- /dev/null
+++ b/include/linux/dlock-list.h
@@ -0,0 +1,233 @@
+/*
+ * Distributed/locked list
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long <waiman.long@hpe.com>
+ */
+#ifndef __LINUX_DLOCK_LIST_H
+#define __LINUX_DLOCK_LIST_H
+
+#include <linux/spinlock.h>
+#include <linux/list.h>
+#include <linux/percpu.h>
+
+/*
+ * include/linux/dlock-list.h
+ *
+ * A distributed (per-cpu) set of lists each of which is protected by its
+ * own spinlock, but acts like a single consolidated list to the callers.
+ *
+ * The dlock_list_head structure contains the spinlock, the other
+ * dlock_list_node structures only contains a pointer to the spinlock in
+ * dlock_list_head.
+ */
+struct dlock_list_head {
+	struct list_head list;
+	spinlock_t lock;
+};
+
+#define DLOCK_LIST_HEAD_INIT(name)				\
+	{							\
+		.list.prev = &name.list,			\
+		.list.next = &name.list,			\
+		.list.lock = __SPIN_LOCK_UNLOCKED(name),	\
+	}
+
+/*
+ * Per-cpu list iteration state
+ */
+struct dlock_list_state {
+	int			 cpu;
+	spinlock_t		*lock;
+	struct list_head	*head;	/* List head of current per-cpu list */
+	struct dlock_list_node	*curr;
+	struct dlock_list_node	*next;
+};
+
+#define DLOCK_LIST_STATE_INIT()			\
+	{					\
+		.cpu  = -1,			\
+		.lock = NULL,			\
+		.head = NULL,			\
+		.curr = NULL,			\
+		.next = NULL,			\
+	}
+
+#define DEFINE_DLOCK_LIST_STATE(s)		\
+	struct dlock_list_state s = DLOCK_LIST_STATE_INIT()
+
+static inline void init_dlock_list_state(struct dlock_list_state *state)
+{
+	state->cpu  = -1;
+	state->lock = NULL;
+	state->head = NULL;
+	state->curr = NULL;
+	state->next = NULL;
+}
+
+#ifdef CONFIG_DEBUG_SPINLOCK
+#define DLOCK_LIST_WARN_ON(x)	WARN_ON(x)
+#else
+#define DLOCK_LIST_WARN_ON(x)
+#endif
+
+/*
+ * Next per-cpu list entry
+ */
+#define dlock_list_next_entry(pos, member) list_next_entry(pos, member.list)
+
+/*
+ * Per-cpu node data structure
+ */
+struct dlock_list_node {
+	struct list_head list;
+	spinlock_t *lockptr;
+};
+
+#define DLOCK_LIST_NODE_INIT(name)		\
+	{					\
+		.list.prev = &name.list,	\
+		.list.next = &name.list,	\
+		.list.lockptr = NULL		\
+	}
+
+static inline void init_dlock_list_node(struct dlock_list_node *node)
+{
+	INIT_LIST_HEAD(&node->list);
+	node->lockptr = NULL;
+}
+
+static inline void free_dlock_list_head(struct dlock_list_head **pdlock_head)
+{
+	free_percpu(*pdlock_head);
+	*pdlock_head = NULL;
+}
+
+/*
+ * Check if all the per-cpu lists are empty
+ */
+static inline bool dlock_list_empty(struct dlock_list_head *dlock_head)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		if (!list_empty(&per_cpu_ptr(dlock_head, cpu)->list))
+			return false;
+	return true;
+}
+
+/*
+ * Helper function to find the first entry of the next per-cpu list
+ * It works somewhat like for_each_possible_cpu(cpu).
+ *
+ * Return: true if the entry is found, false if all the lists exhausted
+ */
+static __always_inline bool
+__dlock_list_next_cpu(struct dlock_list_head *head,
+		      struct dlock_list_state *state)
+{
+	if (state->lock)
+		spin_unlock(state->lock);
+next_cpu:
+	/*
+	 * for_each_possible_cpu(cpu)
+	 */
+	state->cpu = cpumask_next(state->cpu, cpu_possible_mask);
+	if (state->cpu >= nr_cpu_ids)
+		return false;	/* All the per-cpu lists iterated */
+
+	state->head = &per_cpu_ptr(head, state->cpu)->list;
+	if (list_empty(state->head))
+		goto next_cpu;
+
+	state->lock = &per_cpu_ptr(head, state->cpu)->lock;
+	spin_lock(state->lock);
+	/*
+	 * There is a slight chance that the list may become empty just
+	 * before the lock is acquired. So an additional check is
+	 * needed to make sure that state->curr points to a valid entry.
+	 */
+	if (list_empty(state->head)) {
+		spin_unlock(state->lock);
+		goto next_cpu;
+	}
+	state->curr = list_entry(state->head->next,
+				 struct dlock_list_node, list);
+	return true;
+}
+
+/*
+ * Iterate to the next entry of the group of per-cpu lists
+ *
+ * Return: true if the next entry is found, false if all the entries iterated
+ */
+static inline bool dlock_list_iterate(struct dlock_list_head *head,
+				      struct dlock_list_state *state)
+{
+	/*
+	 * Find next entry
+	 */
+	if (state->curr)
+		state->curr = list_next_entry(state->curr, list);
+
+	if (!state->curr || (&state->curr->list == state->head)) {
+		/*
+		 * The current per-cpu list has been exhausted, try the next
+		 * per-cpu list.
+		 */
+		if (!__dlock_list_next_cpu(head, state))
+			return false;
+	}
+
+	DLOCK_LIST_WARN_ON(state->curr->lockptr != state->lock);
+	return true;	/* Continue the iteration */
+}
+
+/*
+ * Iterate to the next entry of the group of per-cpu lists and safe
+ * against removal of list_entry
+ *
+ * Return: true if the next entry is found, false if all the entries iterated
+ */
+static inline bool dlock_list_iterate_safe(struct dlock_list_head *head,
+					   struct dlock_list_state *state)
+{
+	/*
+	 * Find next entry
+	 */
+	if (state->curr) {
+		state->curr = state->next;
+		state->next = list_next_entry(state->next, list);
+	}
+
+	if (!state->curr || (&state->curr->list == state->head)) {
+		/*
+		 * The current per-cpu list has been exhausted, try the next
+		 * per-cpu list.
+		 */
+		if (!__dlock_list_next_cpu(head, state))
+			return false;
+		state->next = list_next_entry(state->curr, list);
+	}
+
+	DLOCK_LIST_WARN_ON(state->curr->lockptr != state->lock);
+	return true;	/* Continue the iteration */
+}
+
+extern void dlock_list_add(struct dlock_list_node *node,
+			  struct dlock_list_head *head);
+extern void dlock_list_del(struct dlock_list_node *node);
+extern int  init_dlock_list_head(struct dlock_list_head **pdlock_head);
+
+#endif /* __LINUX_DLOCK_LIST_H */
diff --git a/lib/Makefile b/lib/Makefile
index a65e9a8..415fe5f 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -40,7 +40,7 @@ obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
 	 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
 	 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 	 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
-	 once.o
+	 once.o dlock-list.o
 obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
 obj-y += hexdump.o
diff --git a/lib/dlock-list.c b/lib/dlock-list.c
new file mode 100644
index 0000000..84d4623
--- /dev/null
+++ b/lib/dlock-list.c
@@ -0,0 +1,100 @@
+/*
+ * Distributed/locked list
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long <waiman.long@hpe.com>
+ */
+#include <linux/dlock-list.h>
+#include <linux/lockdep.h>
+
+/*
+ * The dlock list lock needs its own class to avoid warning and stack
+ * trace when lockdep is enabled.
+ */
+static struct lock_class_key dlock_list_key;
+
+/*
+ * Initialize the per-cpu list head
+ */
+int init_dlock_list_head(struct dlock_list_head **pdlock_head)
+{
+	struct dlock_list_head *dlock_head;
+	int cpu;
+
+	dlock_head = alloc_percpu(struct dlock_list_head);
+	if (!dlock_head)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu) {
+		struct dlock_list_head *head = per_cpu_ptr(dlock_head, cpu);
+
+		INIT_LIST_HEAD(&head->list);
+		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
+		lockdep_set_class(&head->lock, &dlock_list_key);
+	}
+
+	*pdlock_head = dlock_head;
+	return 0;
+}
+
+/*
+ * List selection is based on the CPU being used when the dlock_list_add()
+ * function is called. However, deletion may be done by a different CPU.
+ * So we still need to use a lock to protect the content of the list.
+ */
+void dlock_list_add(struct dlock_list_node *node, struct dlock_list_head *head)
+{
+	struct dlock_list_head *myhead;
+
+	/*
+	 * Disable preemption to make sure that CPU won't gets changed.
+	 */
+	myhead = get_cpu_ptr(head);
+	spin_lock(&myhead->lock);
+	node->lockptr = &myhead->lock;
+	list_add(&node->list, &myhead->list);
+	spin_unlock(&myhead->lock);
+	put_cpu_ptr(head);
+}
+
+/*
+ * Delete a node from a dlock list
+ *
+ * We need to check the lock pointer again after taking the lock to guard
+ * against concurrent delete of the same node. If the lock pointer changes
+ * (becomes NULL or to a different one), we assume that the deletion was done
+ * elsewhere.
+ */
+void dlock_list_del(struct dlock_list_node *node)
+{
+	spinlock_t *lock = READ_ONCE(node->lockptr);
+
+	if (unlikely(!lock)) {
+		WARN(1, "dlock_list_del: node 0x%lx has no associated lock\n",
+			(unsigned long)node);
+		return;
+	}
+
+	spin_lock(lock);
+	if (likely(lock == node->lockptr)) {
+		list_del_init(&node->list);
+		node->lockptr = NULL;
+	} else {
+		/*
+		 * This path should never be executed.
+		 */
+		WARN_ON(1);
+	}
+	spin_unlock(lock);
+}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RESEND PATCH 2/5] lib/dlock-list: Add __percpu modifier for parameters
  2016-06-07 19:35 [RESEND PATCH 0/5] vfs: Use dlock list for SB's s_inodes list Waiman Long
  2016-06-07 19:35 ` [RESEND PATCH 1/5] lib/dlock-list: Distributed and lock-protected lists Waiman Long
@ 2016-06-07 19:35 ` Waiman Long
  2016-06-07 19:35 ` [RESEND PATCH 3/5] fsnotify: Simplify inode iteration on umount Waiman Long
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Waiman Long @ 2016-06-07 19:35 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Waiman Long

From: Boqun Feng <boqun.feng@gmail.com>

Add __percpu modifier properly to help:

1.	Differ pointers to actual structures with those to percpu
	structures, which could improve readability.

2. 	Prevent sparse from complaining about "different address spaces"

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 include/linux/dlock-list.h |   18 ++++++++++--------
 lib/dlock-list.c           |    5 +++--
 2 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
index 43355f8..a8e1fd2 100644
--- a/include/linux/dlock-list.h
+++ b/include/linux/dlock-list.h
@@ -108,7 +108,8 @@ static inline void init_dlock_list_node(struct dlock_list_node *node)
 	node->lockptr = NULL;
 }
 
-static inline void free_dlock_list_head(struct dlock_list_head **pdlock_head)
+static inline void
+free_dlock_list_head(struct dlock_list_head __percpu **pdlock_head)
 {
 	free_percpu(*pdlock_head);
 	*pdlock_head = NULL;
@@ -117,7 +118,7 @@ static inline void free_dlock_list_head(struct dlock_list_head **pdlock_head)
 /*
  * Check if all the per-cpu lists are empty
  */
-static inline bool dlock_list_empty(struct dlock_list_head *dlock_head)
+static inline bool dlock_list_empty(struct dlock_list_head __percpu *dlock_head)
 {
 	int cpu;
 
@@ -134,7 +135,7 @@ static inline bool dlock_list_empty(struct dlock_list_head *dlock_head)
  * Return: true if the entry is found, false if all the lists exhausted
  */
 static __always_inline bool
-__dlock_list_next_cpu(struct dlock_list_head *head,
+__dlock_list_next_cpu(struct dlock_list_head __percpu *head,
 		      struct dlock_list_state *state)
 {
 	if (state->lock)
@@ -172,7 +173,7 @@ next_cpu:
  *
  * Return: true if the next entry is found, false if all the entries iterated
  */
-static inline bool dlock_list_iterate(struct dlock_list_head *head,
+static inline bool dlock_list_iterate(struct dlock_list_head __percpu *head,
 				      struct dlock_list_state *state)
 {
 	/*
@@ -200,8 +201,9 @@ static inline bool dlock_list_iterate(struct dlock_list_head *head,
  *
  * Return: true if the next entry is found, false if all the entries iterated
  */
-static inline bool dlock_list_iterate_safe(struct dlock_list_head *head,
-					   struct dlock_list_state *state)
+static inline bool
+dlock_list_iterate_safe(struct dlock_list_head __percpu *head,
+			struct dlock_list_state *state)
 {
 	/*
 	 * Find next entry
@@ -226,8 +228,8 @@ static inline bool dlock_list_iterate_safe(struct dlock_list_head *head,
 }
 
 extern void dlock_list_add(struct dlock_list_node *node,
-			  struct dlock_list_head *head);
+			   struct dlock_list_head __percpu *head);
 extern void dlock_list_del(struct dlock_list_node *node);
-extern int  init_dlock_list_head(struct dlock_list_head **pdlock_head);
+extern int  init_dlock_list_head(struct dlock_list_head __percpu **pdlock_head);
 
 #endif /* __LINUX_DLOCK_LIST_H */
diff --git a/lib/dlock-list.c b/lib/dlock-list.c
index 84d4623..e1a1930 100644
--- a/lib/dlock-list.c
+++ b/lib/dlock-list.c
@@ -27,7 +27,7 @@ static struct lock_class_key dlock_list_key;
 /*
  * Initialize the per-cpu list head
  */
-int init_dlock_list_head(struct dlock_list_head **pdlock_head)
+int init_dlock_list_head(struct dlock_list_head __percpu **pdlock_head)
 {
 	struct dlock_list_head *dlock_head;
 	int cpu;
@@ -53,7 +53,8 @@ int init_dlock_list_head(struct dlock_list_head **pdlock_head)
  * function is called. However, deletion may be done by a different CPU.
  * So we still need to use a lock to protect the content of the list.
  */
-void dlock_list_add(struct dlock_list_node *node, struct dlock_list_head *head)
+void dlock_list_add(struct dlock_list_node *node,
+		    struct dlock_list_head __percpu *head)
 {
 	struct dlock_list_head *myhead;
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RESEND PATCH 3/5] fsnotify: Simplify inode iteration on umount
  2016-06-07 19:35 [RESEND PATCH 0/5] vfs: Use dlock list for SB's s_inodes list Waiman Long
  2016-06-07 19:35 ` [RESEND PATCH 1/5] lib/dlock-list: Distributed and lock-protected lists Waiman Long
  2016-06-07 19:35 ` [RESEND PATCH 2/5] lib/dlock-list: Add __percpu modifier for parameters Waiman Long
@ 2016-06-07 19:35 ` Waiman Long
  2016-06-07 19:35 ` [RESEND PATCH 4/5] vfs: Remove unnecessary list_for_each_entry_safe() variants Waiman Long
  2016-06-07 19:35 ` [RESEND PATCH 5/5] vfs: Use dlock list for superblock's inode list Waiman Long
  4 siblings, 0 replies; 9+ messages in thread
From: Waiman Long @ 2016-06-07 19:35 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Jan Kara, Waiman Long

From: Jan Kara <jack@suse.cz>

fsnotify_unmount_inodes() played complex tricks to pin next inode in the
sb->s_inodes list when iterating over all inodes. If we switch to
keeping current inode pinned somewhat longer, we can make the code much
simpler and standard.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 fs/notify/inode_mark.c |   45 +++++++++------------------------------------
 1 files changed, 9 insertions(+), 36 deletions(-)

diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 741077d..a364524 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -150,12 +150,10 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
  */
 void fsnotify_unmount_inodes(struct super_block *sb)
 {
-	struct inode *inode, *next_i, *need_iput = NULL;
+	struct inode *inode, *iput_inode = NULL;
 
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list) {
-		struct inode *need_iput_tmp;
-
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		/*
 		 * We cannot __iget() an inode in state I_FREEING,
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
@@ -178,49 +176,24 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 			continue;
 		}
 
-		need_iput_tmp = need_iput;
-		need_iput = NULL;
-
-		/* In case fsnotify_inode_delete() drops a reference. */
-		if (inode != need_iput_tmp)
-			__iget(inode);
-		else
-			need_iput_tmp = NULL;
+		__iget(inode);
 		spin_unlock(&inode->i_lock);
-
-		/* In case the dropping of a reference would nuke next_i. */
-		while (&next_i->i_sb_list != &sb->s_inodes) {
-			spin_lock(&next_i->i_lock);
-			if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) &&
-						atomic_read(&next_i->i_count)) {
-				__iget(next_i);
-				need_iput = next_i;
-				spin_unlock(&next_i->i_lock);
-				break;
-			}
-			spin_unlock(&next_i->i_lock);
-			next_i = list_next_entry(next_i, i_sb_list);
-		}
-
-		/*
-		 * We can safely drop s_inode_list_lock here because either
-		 * we actually hold references on both inode and next_i or
-		 * end of list.  Also no new inodes will be added since the
-		 * umount has begun.
-		 */
 		spin_unlock(&sb->s_inode_list_lock);
 
-		if (need_iput_tmp)
-			iput(need_iput_tmp);
+		if (iput_inode)
+			iput(iput_inode);
 
 		/* for each watch, send FS_UNMOUNT and then remove it */
 		fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
 
 		fsnotify_inode_delete(inode);
 
-		iput(inode);
+		iput_inode = inode;
 
 		spin_lock(&sb->s_inode_list_lock);
 	}
 	spin_unlock(&sb->s_inode_list_lock);
+
+	if (iput_inode)
+		iput(iput_inode);
 }
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RESEND PATCH 4/5] vfs: Remove unnecessary list_for_each_entry_safe() variants
  2016-06-07 19:35 [RESEND PATCH 0/5] vfs: Use dlock list for SB's s_inodes list Waiman Long
                   ` (2 preceding siblings ...)
  2016-06-07 19:35 ` [RESEND PATCH 3/5] fsnotify: Simplify inode iteration on umount Waiman Long
@ 2016-06-07 19:35 ` Waiman Long
  2016-06-07 19:35 ` [RESEND PATCH 5/5] vfs: Use dlock list for superblock's inode list Waiman Long
  4 siblings, 0 replies; 9+ messages in thread
From: Waiman Long @ 2016-06-07 19:35 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Jan Kara, Waiman Long

From: Jan Kara <jack@suse.cz>

evict_inodes() and invalidate_inodes() use list_for_each_entry_safe()
to iterate sb->s_inodes list. However, since we use i_lru list entry for
our local temporary list of inodes to destroy, the inode is guaranteed
to stay in sb->s_inodes list while we hold sb->s_inode_list_lock. So
there is no real need for safe iteration variant and we can use
list_for_each_entry() just fine.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 fs/inode.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 69b8b52..c9cbea8 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -596,12 +596,12 @@ static void dispose_list(struct list_head *head)
  */
 void evict_inodes(struct super_block *sb)
 {
-	struct inode *inode, *next;
+	struct inode *inode;
 	LIST_HEAD(dispose);
 
 again:
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		if (atomic_read(&inode->i_count))
 			continue;
 
@@ -646,11 +646,11 @@ again:
 int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 {
 	int busy = 0;
-	struct inode *inode, *next;
+	struct inode *inode;
 	LIST_HEAD(dispose);
 
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RESEND PATCH 5/5] vfs: Use dlock list for superblock's inode list
  2016-06-07 19:35 [RESEND PATCH 0/5] vfs: Use dlock list for SB's s_inodes list Waiman Long
                   ` (3 preceding siblings ...)
  2016-06-07 19:35 ` [RESEND PATCH 4/5] vfs: Remove unnecessary list_for_each_entry_safe() variants Waiman Long
@ 2016-06-07 19:35 ` Waiman Long
  4 siblings, 0 replies; 9+ messages in thread
From: Waiman Long @ 2016-06-07 19:35 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Waiman Long

When many threads are trying to add or delete inode to or from
a superblock's s_inodes list, spinlock contention on the list can
become a performance bottleneck.

This patch changes the s_inodes field to become a dlock list which
is a distributed set of lists with per-list spinlocks.  As a result,
the following superblock inode list (sb->s_inodes) iteration functions
in vfs are also being modified:

 1. iterate_bdevs()
 2. drop_pagecache_sb()
 3. wait_sb_inodes()
 4. evict_inodes()
 5. invalidate_inodes()
 6. fsnotify_unmount_inodes()
 7. add_dquot_ref()
 8. remove_dquot_ref()

With an exit microbenchmark that creates a large number of threads,
attachs many inodes to them and then exits. The runtimes of that
microbenchmark with 1000 threads before and after the patch on a
4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as
follows:

  Kernel            Elapsed Time    System Time
  ------            ------------    -----------
  Vanilla 4.5-rc4      65.29s         82m14s
  Patched 4.5-rc4      22.81s         23m03s

Before the patch, spinlock contention at the inode_sb_list_add()
function at the startup phase and the inode_sb_list_del() function at
the exit phase were about 79% and 93% of total CPU time respectively
(as measured by perf). After the patch, the percpu_list_add()
function consumed only about 0.04% of CPU time at startup phase. The
percpu_list_del() function consumed about 0.4% of CPU time at exit
phase. There were still some spinlock contention, but they happened
elsewhere.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c         |   13 +++++++------
 fs/drop_caches.c       |   10 +++++-----
 fs/fs-writeback.c      |   13 +++++++------
 fs/inode.c             |   36 +++++++++++++++---------------------
 fs/notify/inode_mark.c |   10 +++++-----
 fs/quota/dquot.c       |   16 ++++++++--------
 fs/super.c             |    7 ++++---
 include/linux/fs.h     |    8 ++++----
 8 files changed, 55 insertions(+), 58 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 20a2c02..967d746 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1884,11 +1884,13 @@ EXPORT_SYMBOL(__invalidate_device);
 void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 {
 	struct inode *inode, *old_inode = NULL;
+	DEFINE_DLOCK_LIST_STATE(state);
 
-	spin_lock(&blockdev_superblock->s_inode_list_lock);
-	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
-		struct address_space *mapping = inode->i_mapping;
+	while (dlock_list_iterate(blockdev_superblock->s_inodes, &state)) {
+		struct address_space *mapping;
 
+		inode   = list_entry(state.curr, struct inode, i_sb_list);
+		mapping = inode->i_mapping;
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
 		    mapping->nrpages == 0) {
@@ -1897,7 +1899,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&blockdev_superblock->s_inode_list_lock);
+		spin_unlock(state.lock);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
 		 * removed from s_inodes list while we dropped the
@@ -1911,8 +1913,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 
 		func(I_BDEV(inode), arg);
 
-		spin_lock(&blockdev_superblock->s_inode_list_lock);
+		spin_lock(state.lock);
 	}
-	spin_unlock(&blockdev_superblock->s_inode_list_lock);
 	iput(old_inode);
 }
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index d72d52b..26b6c68 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -16,9 +16,10 @@ int sysctl_drop_caches;
 static void drop_pagecache_sb(struct super_block *sb, void *unused)
 {
 	struct inode *inode, *toput_inode = NULL;
+	DEFINE_DLOCK_LIST_STATE(state);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	while (dlock_list_iterate(sb->s_inodes, &state)) {
+		inode = list_entry(state.curr, struct inode, i_sb_list);
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 		    (inode->i_mapping->nrpages == 0)) {
@@ -27,15 +28,14 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(state.lock);
 
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
 
-		spin_lock(&sb->s_inode_list_lock);
+		spin_lock(state.lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 	iput(toput_inode);
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 592cea5..3378ea9 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2154,6 +2154,7 @@ EXPORT_SYMBOL(__mark_inode_dirty);
 static void wait_sb_inodes(struct super_block *sb)
 {
 	struct inode *inode, *old_inode = NULL;
+	DEFINE_DLOCK_LIST_STATE(state);
 
 	/*
 	 * We need to be protected against the filesystem going from
@@ -2162,7 +2163,6 @@ static void wait_sb_inodes(struct super_block *sb)
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	mutex_lock(&sb->s_sync_lock);
-	spin_lock(&sb->s_inode_list_lock);
 
 	/*
 	 * Data integrity sync. Must wait for all pages under writeback,
@@ -2171,9 +2171,11 @@ static void wait_sb_inodes(struct super_block *sb)
 	 * In which case, the inode may not be on the dirty list, but
 	 * we still have to wait for that writeout.
 	 */
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		struct address_space *mapping = inode->i_mapping;
+	while (dlock_list_iterate(sb->s_inodes, &state)) {
+		struct address_space *mapping;
 
+		inode   = list_entry(state.curr, struct inode, i_sb_list);
+		mapping = inode->i_mapping;
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 		    (mapping->nrpages == 0)) {
@@ -2182,7 +2184,7 @@ static void wait_sb_inodes(struct super_block *sb)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(state.lock);
 
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
@@ -2204,9 +2206,8 @@ static void wait_sb_inodes(struct super_block *sb)
 
 		cond_resched();
 
-		spin_lock(&sb->s_inode_list_lock);
+		spin_lock(state.lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 	iput(old_inode);
 	mutex_unlock(&sb->s_sync_lock);
 }
diff --git a/fs/inode.c b/fs/inode.c
index c9cbea8..9b6084d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -28,7 +28,7 @@
  *   inode->i_state, inode->i_hash, __iget()
  * Inode LRU list locks protect:
  *   inode->i_sb->s_inode_lru, inode->i_lru
- * inode->i_sb->s_inode_list_lock protects:
+ * inode->i_sb->s_inodes->lock protects:
  *   inode->i_sb->s_inodes, inode->i_sb_list
  * bdi->wb.list_lock protects:
  *   bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_io_list
@@ -37,7 +37,7 @@
  *
  * Lock ordering:
  *
- * inode->i_sb->s_inode_list_lock
+ * inode->i_sb->s_inodes->lock
  *   inode->i_lock
  *     Inode LRU list locks
  *
@@ -45,7 +45,7 @@
  *   inode->i_lock
  *
  * inode_hash_lock
- *   inode->i_sb->s_inode_list_lock
+ *   inode->i_sb->s_inodes->lock
  *   inode->i_lock
  *
  * iunique_lock
@@ -430,19 +430,14 @@ static void inode_lru_list_del(struct inode *inode)
  */
 void inode_sb_list_add(struct inode *inode)
 {
-	spin_lock(&inode->i_sb->s_inode_list_lock);
-	list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
-	spin_unlock(&inode->i_sb->s_inode_list_lock);
+	dlock_list_add(&inode->i_sb_list, inode->i_sb->s_inodes);
 }
 EXPORT_SYMBOL_GPL(inode_sb_list_add);
 
 static inline void inode_sb_list_del(struct inode *inode)
 {
-	if (!list_empty(&inode->i_sb_list)) {
-		spin_lock(&inode->i_sb->s_inode_list_lock);
-		list_del_init(&inode->i_sb_list);
-		spin_unlock(&inode->i_sb->s_inode_list_lock);
-	}
+	if (!list_empty(&inode->i_sb_list.list))
+		dlock_list_del(&inode->i_sb_list);
 }
 
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
@@ -597,11 +592,13 @@ static void dispose_list(struct list_head *head)
 void evict_inodes(struct super_block *sb)
 {
 	struct inode *inode;
+	struct dlock_list_state state;
 	LIST_HEAD(dispose);
 
 again:
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	init_dlock_list_state(&state);
+	while (dlock_list_iterate(sb->s_inodes, &state)) {
+		inode = list_entry(state.curr, struct inode, i_sb_list);
 		if (atomic_read(&inode->i_count))
 			continue;
 
@@ -622,13 +619,12 @@ again:
 		 * bit so we don't livelock.
 		 */
 		if (need_resched()) {
-			spin_unlock(&sb->s_inode_list_lock);
+			spin_unlock(state.lock);
 			cond_resched();
 			dispose_list(&dispose);
 			goto again;
 		}
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 
 	dispose_list(&dispose);
 }
@@ -648,9 +644,10 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 	int busy = 0;
 	struct inode *inode;
 	LIST_HEAD(dispose);
+	DEFINE_DLOCK_LIST_STATE(state);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	while (dlock_list_iterate(sb->s_inodes, &state)) {
+		inode = list_entry(state.curr, struct inode, i_sb_list);
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
@@ -672,7 +669,6 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 		spin_unlock(&inode->i_lock);
 		list_add(&inode->i_lru, &dispose);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 
 	dispose_list(&dispose);
 
@@ -887,7 +883,7 @@ struct inode *new_inode_pseudo(struct super_block *sb)
 		spin_lock(&inode->i_lock);
 		inode->i_state = 0;
 		spin_unlock(&inode->i_lock);
-		INIT_LIST_HEAD(&inode->i_sb_list);
+		init_dlock_list_node(&inode->i_sb_list);
 	}
 	return inode;
 }
@@ -908,8 +904,6 @@ struct inode *new_inode(struct super_block *sb)
 {
 	struct inode *inode;
 
-	spin_lock_prefetch(&sb->s_inode_list_lock);
-
 	inode = new_inode_pseudo(sb);
 	if (inode)
 		inode_sb_list_add(inode);
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index a364524..2639522 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -151,14 +151,15 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
 void fsnotify_unmount_inodes(struct super_block *sb)
 {
 	struct inode *inode, *iput_inode = NULL;
+	DEFINE_DLOCK_LIST_STATE(state);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	while (dlock_list_iterate(sb->s_inodes, &state)) {
 		/*
 		 * We cannot __iget() an inode in state I_FREEING,
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
 		 * the inode cannot have any associated watches.
 		 */
+		inode = list_entry(state.curr, struct inode, i_sb_list);
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
 			spin_unlock(&inode->i_lock);
@@ -178,7 +179,7 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(state.lock);
 
 		if (iput_inode)
 			iput(iput_inode);
@@ -190,9 +191,8 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 
 		iput_inode = inode;
 
-		spin_lock(&sb->s_inode_list_lock);
+		spin_lock(state.lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 
 	if (iput_inode)
 		iput(iput_inode);
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index ff21980..cb13619 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -936,12 +936,13 @@ static int dqinit_needed(struct inode *inode, int type)
 static void add_dquot_ref(struct super_block *sb, int type)
 {
 	struct inode *inode, *old_inode = NULL;
+	DEFINE_DLOCK_LIST_STATE(state);
 #ifdef CONFIG_QUOTA_DEBUG
 	int reserved = 0;
 #endif
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	while (dlock_list_iterate(sb->s_inodes, &state)) {
+		inode = list_entry(state.curr, struct inode, i_sb_list);
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 		    !atomic_read(&inode->i_writecount) ||
@@ -951,7 +952,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(state.lock);
 
 #ifdef CONFIG_QUOTA_DEBUG
 		if (unlikely(inode_get_rsv_space(inode) > 0))
@@ -969,9 +970,8 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		 * later.
 		 */
 		old_inode = inode;
-		spin_lock(&sb->s_inode_list_lock);
+		spin_lock(state.lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 	iput(old_inode);
 
 #ifdef CONFIG_QUOTA_DEBUG
@@ -1039,15 +1039,16 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 {
 	struct inode *inode;
 	int reserved = 0;
+	DEFINE_DLOCK_LIST_STATE(state);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	while (dlock_list_iterate(sb->s_inodes, &state)) {
 		/*
 		 *  We have to scan also I_NEW inodes because they can already
 		 *  have quota pointer initialized. Luckily, we need to touch
 		 *  only quota pointers and these have separate locking
 		 *  (dq_data_lock).
 		 */
+		inode = list_entry(state.curr, struct inode, i_sb_list);
 		spin_lock(&dq_data_lock);
 		if (!IS_NOQUOTA(inode)) {
 			if (unlikely(inode_get_rsv_space(inode) > 0))
@@ -1056,7 +1057,6 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 		}
 		spin_unlock(&dq_data_lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
 		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
diff --git a/fs/super.c b/fs/super.c
index 74914b1..6c167d3 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -163,6 +163,7 @@ static void destroy_super(struct super_block *s)
 {
 	list_lru_destroy(&s->s_dentry_lru);
 	list_lru_destroy(&s->s_inode_lru);
+	free_dlock_list_head(&s->s_inodes);
 	security_sb_free(s);
 	WARN_ON(!list_empty(&s->s_mounts));
 	kfree(s->s_subtype);
@@ -204,9 +205,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	INIT_HLIST_NODE(&s->s_instances);
 	INIT_HLIST_BL_HEAD(&s->s_anon);
 	mutex_init(&s->s_sync_lock);
-	INIT_LIST_HEAD(&s->s_inodes);
-	spin_lock_init(&s->s_inode_list_lock);
 
+	if (init_dlock_list_head(&s->s_inodes))
+		goto fail;
 	if (list_lru_init_memcg(&s->s_dentry_lru))
 		goto fail;
 	if (list_lru_init_memcg(&s->s_inode_lru))
@@ -427,7 +428,7 @@ void generic_shutdown_super(struct super_block *sb)
 		if (sop->put_super)
 			sop->put_super(sb);
 
-		if (!list_empty(&sb->s_inodes)) {
+		if (!dlock_list_empty(sb->s_inodes)) {
 			printk("VFS: Busy inodes after unmount of %s. "
 			   "Self-destruct in 5 seconds.  Have a nice day...\n",
 			   sb->s_id);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 70e61b5..d22b213 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -32,6 +32,7 @@
 #include <linux/workqueue.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/delayed_call.h>
+#include <linux/dlock-list.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -651,7 +652,7 @@ struct inode {
 	u16			i_wb_frn_history;
 #endif
 	struct list_head	i_lru;		/* inode LRU list */
-	struct list_head	i_sb_list;
+	struct dlock_list_node	i_sb_list;
 	union {
 		struct hlist_head	i_dentry;
 		struct rcu_head		i_rcu;
@@ -1416,9 +1417,8 @@ struct super_block {
 	 */
 	int s_stack_depth;
 
-	/* s_inode_list_lock protects s_inodes */
-	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
-	struct list_head	s_inodes;	/* all inodes */
+	/* The percpu locks protect s_inodes */
+	struct dlock_list_head __percpu *s_inodes;	/* all inodes */
 };
 
 extern struct timespec current_fs_time(struct super_block *sb);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RESEND PATCH 1/5] lib/dlock-list: Distributed and lock-protected lists
  2016-06-07 19:35 ` [RESEND PATCH 1/5] lib/dlock-list: Distributed and lock-protected lists Waiman Long
@ 2016-06-07 20:13   ` Andi Kleen
  2016-06-07 23:53     ` Waiman Long
  2016-07-11 17:37     ` Waiman Long
  0 siblings, 2 replies; 9+ messages in thread
From: Andi Kleen @ 2016-06-07 20:13 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng, Scott J Norton, Douglas Hatch

On Tue, Jun 07, 2016 at 03:35:51PM -0400, Waiman Long wrote:
> Linked list is used everywhere in the Linux kernel. However, if many
> threads are trying to add or delete entries into the same linked list,
> it can create a performance bottleneck.
> 
> This patch introduces a new list APIs that provide a set of distributed
> lists (one per CPU), each of which is protected by its own spinlock.

One thing I don't like is that it is per CPU. One per CPU is almost
certainly overkill and not needed for true scalability, especially
on systems using SMT. Also it makes the case where everything has to
be walked more and more expensive, because all these locks have to
be taken. Even when not contended this will add up.

It would be better to do this per every Nth CPU. Now I don't have
a clear answer what the best N is, but I'm pretty sure it's > 1.
For example at least on SMT systems only per core instead of per
thread. Likely even more coarse grained, although per socket
may be not good enough.

-Andi

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RESEND PATCH 1/5] lib/dlock-list: Distributed and lock-protected lists
  2016-06-07 20:13   ` Andi Kleen
@ 2016-06-07 23:53     ` Waiman Long
  2016-07-11 17:37     ` Waiman Long
  1 sibling, 0 replies; 9+ messages in thread
From: Waiman Long @ 2016-06-07 23:53 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

On 06/07/2016 04:13 PM, Andi Kleen wrote:
> On Tue, Jun 07, 2016 at 03:35:51PM -0400, Waiman Long wrote:
>> Linked list is used everywhere in the Linux kernel. However, if many
>> threads are trying to add or delete entries into the same linked list,
>> it can create a performance bottleneck.
>>
>> This patch introduces a new list APIs that provide a set of distributed
>> lists (one per CPU), each of which is protected by its own spinlock.
> One thing I don't like is that it is per CPU. One per CPU is almost
> certainly overkill and not needed for true scalability, especially
> on systems using SMT. Also it makes the case where everything has to
> be walked more and more expensive, because all these locks have to
> be taken. Even when not contended this will add up.
>
> It would be better to do this per every Nth CPU. Now I don't have
> a clear answer what the best N is, but I'm pretty sure it's>  1.
> For example at least on SMT systems only per core instead of per
> thread. Likely even more coarse grained, although per socket
> may be not good enough.
>
> -Andi

Thanks for the comment. That will need a new per group of cpus construct 
somewhere between per-cpu and per-node. I will think about this a bit to 
see how to move forward.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RESEND PATCH 1/5] lib/dlock-list: Distributed and lock-protected lists
  2016-06-07 20:13   ` Andi Kleen
  2016-06-07 23:53     ` Waiman Long
@ 2016-07-11 17:37     ` Waiman Long
  1 sibling, 0 replies; 9+ messages in thread
From: Waiman Long @ 2016-07-11 17:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

On 06/07/2016 04:13 PM, Andi Kleen wrote:
> On Tue, Jun 07, 2016 at 03:35:51PM -0400, Waiman Long wrote:
>> Linked list is used everywhere in the Linux kernel. However, if many
>> threads are trying to add or delete entries into the same linked list,
>> it can create a performance bottleneck.
>>
>> This patch introduces a new list APIs that provide a set of distributed
>> lists (one per CPU), each of which is protected by its own spinlock.
> One thing I don't like is that it is per CPU. One per CPU is almost
> certainly overkill and not needed for true scalability, especially
> on systems using SMT. Also it makes the case where everything has to
> be walked more and more expensive, because all these locks have to
> be taken. Even when not contended this will add up.

When iterating the lists, the lock shouldn't be taken when a list is empty.

> It would be better to do this per every Nth CPU. Now I don't have
> a clear answer what the best N is, but I'm pretty sure it's>  1.
> For example at least on SMT systems only per core instead of per
> thread. Likely even more coarse grained, although per socket
> may be not good enough.
>
> -Andi

I have just sent out an updated patch to mapped 2 cores to each list. 
Maybe you can take a look to see if that is good enough from your point 
of view.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-07-11 17:37 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-07 19:35 [RESEND PATCH 0/5] vfs: Use dlock list for SB's s_inodes list Waiman Long
2016-06-07 19:35 ` [RESEND PATCH 1/5] lib/dlock-list: Distributed and lock-protected lists Waiman Long
2016-06-07 20:13   ` Andi Kleen
2016-06-07 23:53     ` Waiman Long
2016-07-11 17:37     ` Waiman Long
2016-06-07 19:35 ` [RESEND PATCH 2/5] lib/dlock-list: Add __percpu modifier for parameters Waiman Long
2016-06-07 19:35 ` [RESEND PATCH 3/5] fsnotify: Simplify inode iteration on umount Waiman Long
2016-06-07 19:35 ` [RESEND PATCH 4/5] vfs: Remove unnecessary list_for_each_entry_safe() variants Waiman Long
2016-06-07 19:35 ` [RESEND PATCH 5/5] vfs: Use dlock list for superblock's inode list Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).