All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/5] vfs: Use dlock list for SB's s_inodes list
@ 2016-08-09 16:52 Waiman Long
  2016-08-09 16:52 ` [PATCH v5 1/5] lib/dlock-list: Distributed and lock-protected lists Waiman Long
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Waiman Long @ 2016-08-09 16:52 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Waiman Long

v4->v5:
 - Rebased the patch to 4.8-rc1 (changes to fs/fs-writeback.c was
   dropped).
 - Use kcalloc() instead of percpu_alloc() to allocate the dlock list
   heads structure as suggested by Christoph Lameter.
 - Replaced patch 5 by another one that made sibling CPUs use the same
   dlock list head thus reducing the number of list heads that needed
   to be maintained.

v3->v4:
 - As suggested by Al, encapsulate the dlock list mechanism into
   the dlist_for_each_entry() and dlist_for_each_entry_safe()
   which are the equivalent of list_for_each_entry() and
   list_for_each_entry_safe() for regular linked list. That simplifies
   the changes in the call sites that perform dlock list iterations.
 - Add a new patch to make the percpu head structure cacheline aligned
   to prevent cacheline contention from disrupting the performance
   of nearby percpu variables.

v2->v3:
 - Remove the 2 persubnode API patches.
 - Merge __percpu tag patch 2 into patch 1.
 - As suggested by Tejun Heo, restructure the dlock_list_head data
   structure to hide the __percpu tag and rename some of the functions
   and structures.
 - Move most of the code from dlock_list.h to dlock_list.c and export
   the symbols.

v1->v2:
 - Add a set of simple per-subnode APIs that is between percpu and
   per-node in granularity.
 - Make dlock list to use the per-subnode APIs so as to reduce the
   total number of separate linked list that needs to be managed
   and iterated.
 - There is no change in patches 1-5.

This is a follow up of the following patchset:

  [PATCH v7 0/4] vfs: Use per-cpu list for SB's s_inodes list
  https://lkml.org/lkml/2016/4/12/1009

Patch 1 introduces the dlock list. The list heads are allocated
by kcalloc() instead of percpu_alloc(). This may slightly increase
cacheline contention when multiple CPUs are accessing dlock list,
but improve performance when the whole dlock list needs to be iterated.

Patch 2 cleans up the fsnotify_unmount_inodes() function by making
the code simpler and more standard.

Patch 3 replaces the use of list_for_each_entry_safe() in
evict_inodes() and invalidate_inodes() by list_for_each_entry().

Patch 4 modifies the superblock and inode structures to use the dlock
list. The corresponding functions that reference those structures
are modified.

Patch 5 makes the sibling CPUs use the same dlock list head to reduce
the number of list heads that need to be iterated.

Jan Kara (2):
  fsnotify: Simplify inode iteration on umount
  vfs: Remove unnecessary list_for_each_entry_safe() variants

Waiman Long (3):
  lib/dlock-list: Distributed and lock-protected lists
  vfs: Use dlock list for superblock's inode list
  lib/dlock-list: Make sibling CPUs share the same linked list

 fs/block_dev.c             |    9 +-
 fs/drop_caches.c           |    9 +-
 fs/inode.c                 |   38 +++----
 fs/notify/inode_mark.c     |   52 ++-------
 fs/quota/dquot.c           |   14 +--
 fs/super.c                 |    7 +-
 include/linux/dlock-list.h |  230 +++++++++++++++++++++++++++++++++++++
 include/linux/fs.h         |    8 +-
 lib/Makefile               |    2 +-
 lib/dlock-list.c           |  268 ++++++++++++++++++++++++++++++++++++++++++++
 10 files changed, 548 insertions(+), 89 deletions(-)
 create mode 100644 include/linux/dlock-list.h
 create mode 100644 lib/dlock-list.c

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v5 1/5] lib/dlock-list: Distributed and lock-protected lists
  2016-08-09 16:52 [PATCH v5 0/5] vfs: Use dlock list for SB's s_inodes list Waiman Long
@ 2016-08-09 16:52 ` Waiman Long
  2016-08-09 16:52 ` [PATCH v5 2/5] fsnotify: Simplify inode iteration on umount Waiman Long
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Waiman Long @ 2016-08-09 16:52 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Waiman Long

Linked list is used everywhere in the Linux kernel. However, if many
threads are trying to add or delete entries into the same linked list,
it can create a performance bottleneck.

This patch introduces a new list APIs that provide a set of distributed
lists (one per CPU), each of which is protected by its own spinlock.
To the callers, however, the set of lists acts like a single
consolidated list.  This allows list entries insertion and deletion
operations to happen in parallel instead of being serialized with a
global list and lock.

List entry insertion is strictly per cpu. List deletion, however, can
happen in a cpu other than the one that did the insertion. So we still
need lock to protect the list. Because of that, there may still be
a small amount of contention when deletion is being done.

A new header file include/linux/dlock-list.h will be added with the
associated dlock_list_head and dlock_list_node structures. The following
functions are provided to manage the per-cpu list:

 1. int alloc_dlock_list_heads(struct dlock_list_heads *dlist)
 2. void free_dlock_list_heads(struct dlock_list_heads *dlist)
 3. void dlock_list_add(struct dlock_list_node *node,
		        struct dlock_list_heads *dlist)
 4. void dlock_list_del(struct dlock_list *node)

Iteration of all the list entries within a dlock list array
is done by calling either the dlist_for_each_entry() or
dlist_for_each_entry_safe() macros. They correspond to the
list_for_each_entry() and list_for_each_entry_safe() macros
respectively. The iteration states are keep in a dlock_list_iter
structure that is passed to the iteration macros.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 include/linux/dlock-list.h |  227 ++++++++++++++++++++++++++++++++++++++++++++
 lib/Makefile               |    2 +-
 lib/dlock-list.c           |  199 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 427 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/dlock-list.h
 create mode 100644 lib/dlock-list.c

diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
new file mode 100644
index 0000000..19f897e
--- /dev/null
+++ b/include/linux/dlock-list.h
@@ -0,0 +1,227 @@
+/*
+ * Distributed and locked list
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long <waiman.long@hpe.com>
+ */
+#ifndef __LINUX_DLOCK_LIST_H
+#define __LINUX_DLOCK_LIST_H
+
+#include <linux/spinlock.h>
+#include <linux/list.h>
+
+/*
+ * include/linux/dlock-list.h
+ *
+ * A distributed set of lists each of which is protected by its own
+ * spinlock, but acts like a single consolidated list to the callers.
+ * The number of lists used is equal to the number of CPUs available in
+ * the system to minimize contention.
+ *
+ * The dlock_list_head structure contains the spinlock, the other
+ * dlock_list_node structures contains a pointer to the head entry instead.
+ */
+struct dlock_list_head {
+	struct list_head list;
+	spinlock_t lock;
+};
+
+struct dlock_list_heads {
+	struct dlock_list_head *heads;
+};
+
+/*
+ * dlock list node data structure
+ */
+struct dlock_list_node {
+	struct list_head list;
+	struct dlock_list_head *head;
+};
+
+/*
+ * dlock list iteration state
+ *
+ * This is an opaque data structure that may change. Users of this structure
+ * should not access the structure members directly other than using the
+ * helper functions and macros provided in this header file.
+ */
+struct dlock_list_iter {
+	int index;
+	struct dlock_list_head *head, *entry;
+};
+
+#define DLOCK_LIST_ITER_INIT(dlist)		\
+	{					\
+		.index = -1,			\
+		.head = (dlist)->heads,		\
+	}
+
+#define DEFINE_DLOCK_LIST_ITER(s, heads)	\
+	struct dlock_list_iter s = DLOCK_LIST_ITER_INIT(heads)
+
+static inline void init_dlock_list_iter(struct dlock_list_iter *iter,
+					struct dlock_list_heads *heads)
+{
+	*iter = (struct dlock_list_iter)DLOCK_LIST_ITER_INIT(heads);
+}
+
+#define DLOCK_LIST_NODE_INIT(name)		\
+	{					\
+		.list = LIST_HEAD_INIT(name)	\
+	}
+
+static inline void init_dlock_list_node(struct dlock_list_node *node)
+{
+	*node = (struct dlock_list_node)DLOCK_LIST_NODE_INIT(node->list);
+}
+
+/**
+ * dlock_list_unlock - unlock the spinlock that protects the current list
+ * @iter: Pointer to the dlock list iterator structure
+ */
+static inline void dlock_list_unlock(struct dlock_list_iter *iter)
+{
+	spin_unlock(&iter->entry->lock);
+}
+
+/**
+ * dlock_list_relock - lock the spinlock that protects the current list
+ * @iter: Pointer to the dlock list iterator structure
+ */
+static inline void dlock_list_relock(struct dlock_list_iter *iter)
+{
+	spin_lock(&iter->entry->lock);
+}
+
+/*
+ * Allocation and freeing of dlock list
+ */
+extern int  alloc_dlock_list_heads(struct dlock_list_heads *dlist);
+extern void free_dlock_list_heads(struct dlock_list_heads *dlist);
+
+/*
+ * Check if a dlock list is empty or not.
+ */
+extern bool dlock_list_empty(struct dlock_list_heads *dlist);
+
+/*
+ * The dlock list addition and deletion functions here are not irq-safe.
+ * Special irq-safe variants will have to be added if we need them.
+ */
+extern void dlock_list_add(struct dlock_list_node *node,
+			   struct dlock_list_heads *dlist);
+extern void dlock_list_del(struct dlock_list_node *node);
+
+/*
+ * Find the first entry of the next available list.
+ */
+extern struct dlock_list_node *
+__dlock_list_next_list(struct dlock_list_iter *iter);
+
+/**
+ * __dlock_list_next_entry - Iterate to the next entry of the dlock list
+ * @curr : Pointer to the current dlock_list_node structure
+ * @iter : Pointer to the dlock list iterator structure
+ * Return: Pointer to the next entry or NULL if all the entries are iterated
+ *
+ * The iterator has to be properly initialized before calling this function.
+ */
+static inline struct dlock_list_node *
+__dlock_list_next_entry(struct dlock_list_node *curr,
+			struct dlock_list_iter *iter)
+{
+	/*
+	 * Find next entry
+	 */
+	if (curr)
+		curr = list_next_entry(curr, list);
+
+	if (!curr || (&curr->list == &iter->entry->list)) {
+		/*
+		 * The current list has been exhausted, try the next available
+		 * list.
+		 */
+		curr = __dlock_list_next_list(iter);
+	}
+
+	return curr;	/* Continue the iteration */
+}
+
+/**
+ * dlock_list_first_entry - get the first element from a list
+ * @iter  : The dlock list iterator.
+ * @type  : The type of the struct this is embedded in.
+ * @member: The name of the dlock_list_node within the struct.
+ * Return : Pointer to the next entry or NULL if all the entries are iterated.
+ */
+#define dlock_list_first_entry(iter, type, member)			\
+	({								\
+		struct dlock_list_node *_n;				\
+		_n = __dlock_list_next_entry(NULL, iter);		\
+		_n ? list_entry(_n, type, member) : NULL;		\
+	})
+
+/**
+ * dlock_list_next_entry - iterate to the next entry of the list
+ * @pos   : The type * to cursor
+ * @iter  : The dlock list iterator.
+ * @member: The name of the dlock_list_node within the struct.
+ * Return : Pointer to the next entry or NULL if all the entries are iterated.
+ *
+ * Note that pos can't be NULL.
+ */
+#define dlock_list_next_entry(pos, iter, member)			\
+	({								\
+		struct dlock_list_node *_n;				\
+		_n = __dlock_list_next_entry(&(pos)->member, iter);	\
+		_n ? list_entry(_n, typeof(*(pos)), member) : NULL;	\
+	})
+
+/**
+ * dlist_for_each_entry - iterate over the dlock list
+ * @pos   : Type * to use as a loop cursor
+ * @iter  : The dlock list iterator
+ * @member: The name of the dlock_list_node within the struct
+ *
+ * This iteration macro isn't safe with respect to list entry removal, but
+ * it can correctly iterate newly added entries right after the current one.
+ * This iteration function is designed to be used in a while loop.
+ */
+#define dlist_for_each_entry(pos, iter, member)				\
+	for (pos = dlock_list_first_entry(iter, typeof(*(pos)), member);\
+	     pos != NULL;						\
+	     pos = dlock_list_next_entry(pos, iter, member))
+
+/**
+ * dlist_for_each_entry_safe - iterate over the dlock list & safe over removal
+ * @pos   : Type * to use as a loop cursor
+ * @n	  : Another type * to use as temporary storage
+ * @iter  : The dlock list iterator
+ * @member: The name of the dlock_list_node within the struct
+ *
+ * This iteration macro is safe with respect to list entry removal.
+ * However, it cannot correctly iterate newly added entries right after the
+ * current one.
+ */
+#define dlist_for_each_entry_safe(pos, n, iter, member)			\
+	for (pos = dlock_list_first_entry(iter, typeof(*(pos)), member);\
+	    ({								\
+		bool _b = (pos != NULL);				\
+		if (_b)							\
+			n = dlock_list_next_entry(pos, iter, member);	\
+		_b;							\
+	    });								\
+	    pos = n)
+
+#endif /* __LINUX_DLOCK_LIST_H */
diff --git a/lib/Makefile b/lib/Makefile
index cfa68eb..14c20fc 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -37,7 +37,7 @@ obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
 	 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
 	 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 	 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
-	 once.o
+	 once.o dlock-list.o
 obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
 obj-y += hexdump.o
diff --git a/lib/dlock-list.c b/lib/dlock-list.c
new file mode 100644
index 0000000..9d4950c
--- /dev/null
+++ b/lib/dlock-list.c
@@ -0,0 +1,199 @@
+/*
+ * Distributed and locked list
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long <waiman.long@hpe.com>
+ */
+#include <linux/dlock-list.h>
+#include <linux/lockdep.h>
+#include <linux/slab.h>
+#include <linux/cpumask.h>
+
+/*
+ * As all the locks in the dlock list are dynamically allocated, they need
+ * to belong to their own special lock class to avoid warning and stack
+ * trace in kernel log when lockdep is enabled. Statically allocated locks
+ * don't have this problem.
+ */
+static struct lock_class_key dlock_list_key;
+
+/**
+ * alloc_dlock_list_heads - Initialize and allocate the list of head entries
+ * @dlist: Pointer to the dlock_list_heads structure to be initialized
+ * Return: 0 if successful, -ENOMEM if memory allocation error
+ *
+ * This function does not allocate the dlock_list_heads structure itself. The
+ * callers will have to do their own memory allocation, if necessary. However,
+ * this allows embedding the dlock_list_heads structure directly into other
+ * structures.
+ */
+int alloc_dlock_list_heads(struct dlock_list_heads *dlist)
+{
+	int idx;
+
+	dlist->heads = kcalloc(nr_cpu_ids, sizeof(struct dlock_list_head),
+			       GFP_KERNEL);
+
+	if (!dlist->heads)
+		return -ENOMEM;
+
+	for (idx = 0; idx < nr_cpu_ids; idx++) {
+		struct dlock_list_head *head = &dlist->heads[idx];
+
+		INIT_LIST_HEAD(&head->list);
+		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
+		lockdep_set_class(&head->lock, &dlock_list_key);
+	}
+	return 0;
+}
+
+/**
+ * free_dlock_list_heads - Free all the heads entries of the dlock list
+ * @dlist: Pointer of the dlock_list_heads structure to be freed
+ *
+ * This function doesn't free the dlock_list_heads structure itself. So
+ * the caller will have to do it, if necessary.
+ */
+void free_dlock_list_heads(struct dlock_list_heads *dlist)
+{
+	kfree(dlist->heads);
+	dlist->heads = NULL;
+}
+
+/**
+ * dlock_list_empty - Check if all the dlock lists are empty
+ * @dlist: Pointer to the dlock_list_heads structure
+ * Return: true if list is empty, false otherwise.
+ *
+ * This can be a pretty expensive function call. If this function is required
+ * in a performance critical path, we may have to maintain a global count
+ * of the list entries in the global dlock_list_heads structure instead.
+ */
+bool dlock_list_empty(struct dlock_list_heads *dlist)
+{
+	int idx;
+
+	for (idx = 0; idx < nr_cpu_ids; idx++)
+		if (!list_empty(&dlist->heads[idx].list))
+			return false;
+	return true;
+}
+
+/**
+ * dlock_list_add - Adds a node to the given dlock list
+ * @node : The node to be added
+ * @dlist: The dlock list where the node is to be added
+ *
+ * List selection is based on the CPU being used when the dlock_list_add()
+ * function is called. However, deletion may be done by a different CPU.
+ * So we still need to use a lock to protect the content of the list.
+ */
+void dlock_list_add(struct dlock_list_node *node,
+		    struct dlock_list_heads *dlist)
+{
+	struct dlock_list_head *head = &dlist->heads[smp_processor_id()];
+
+	/*
+	 * There is no need to disable preemption
+	 */
+	spin_lock(&head->lock);
+	node->head = head;
+	list_add(&node->list, &head->list);
+	spin_unlock(&head->lock);
+}
+
+/**
+ * dlock_list_del - Delete a node from a dlock list
+ * @node : The node to be deleted
+ *
+ * We need to check the lock pointer again after taking the lock to guard
+ * against concurrent deletion of the same node. If the lock pointer changes
+ * (becomes NULL or to a different one), we assume that the deletion was done
+ * elsewhere. A warning will be printed if this happens as it is likely to be
+ * a bug.
+ */
+void dlock_list_del(struct dlock_list_node *node)
+{
+	struct dlock_list_head *head;
+	bool retry;
+
+	do {
+		head = READ_ONCE(node->head);
+		if (WARN_ONCE(!head,
+			"dlock_list_del: node 0x%lx has no associated head\n",
+			(unsigned long)node))
+			return;
+
+		spin_lock(&head->lock);
+		if (likely(head == node->head)) {
+			list_del_init(&node->list);
+			node->head = NULL;
+			retry = false;
+		} else {
+			/*
+			 * The lock has somehow changed. Retry again if it is
+			 * not NULL. Otherwise, just ignore the delete
+			 * operation.
+			 */
+			retry = (node->head != NULL);
+		}
+		spin_unlock(&head->lock);
+	} while (retry);
+}
+
+/**
+ * __dlock_list_next_list: Find the first entry of the next available list
+ * @dlist: Pointer to the dlock_list_heads structure
+ * @iter : Pointer to the dlock list iterator structure
+ * Return: true if the entry is found, false if all the lists exhausted
+ *
+ * The information about the next available list will be put into the iterator.
+ */
+struct dlock_list_node *__dlock_list_next_list(struct dlock_list_iter *iter)
+{
+	struct dlock_list_node *next;
+	struct dlock_list_head *head;
+
+restart:
+	if (iter->entry) {
+		spin_unlock(&iter->entry->lock);
+		iter->entry = NULL;
+	}
+
+next_list:
+	/*
+	 * Try next list
+	 */
+	if (++iter->index >= nr_cpu_ids)
+		return NULL;	/* All the entries iterated */
+
+	if (list_empty(&iter->head[iter->index].list))
+		goto next_list;
+
+	head = iter->entry = &iter->head[iter->index];
+	spin_lock(&head->lock);
+	/*
+	 * There is a slight chance that the list may become empty just
+	 * before the lock is acquired. So an additional check is
+	 * needed to make sure that a valid node will be returned.
+	 */
+	if (list_empty(&head->list))
+		goto restart;
+
+	next = list_entry(head->list.next, struct dlock_list_node,
+			  list);
+	WARN_ON_ONCE(next->head != head);
+
+	return next;
+}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 2/5] fsnotify: Simplify inode iteration on umount
  2016-08-09 16:52 [PATCH v5 0/5] vfs: Use dlock list for SB's s_inodes list Waiman Long
  2016-08-09 16:52 ` [PATCH v5 1/5] lib/dlock-list: Distributed and lock-protected lists Waiman Long
@ 2016-08-09 16:52 ` Waiman Long
  2016-08-09 16:52 ` [PATCH v5 3/5] vfs: Remove unnecessary list_for_each_entry_safe() variants Waiman Long
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Waiman Long @ 2016-08-09 16:52 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Jan Kara, Waiman Long

From: Jan Kara <jack@suse.cz>

fsnotify_unmount_inodes() played complex tricks to pin next inode in the
sb->s_inodes list when iterating over all inodes. If we switch to
keeping current inode pinned somewhat longer, we can make the code much
simpler and standard.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 fs/notify/inode_mark.c |   45 +++++++++------------------------------------
 1 files changed, 9 insertions(+), 36 deletions(-)

diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 741077d..a364524 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -150,12 +150,10 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
  */
 void fsnotify_unmount_inodes(struct super_block *sb)
 {
-	struct inode *inode, *next_i, *need_iput = NULL;
+	struct inode *inode, *iput_inode = NULL;
 
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list) {
-		struct inode *need_iput_tmp;
-
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		/*
 		 * We cannot __iget() an inode in state I_FREEING,
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
@@ -178,49 +176,24 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 			continue;
 		}
 
-		need_iput_tmp = need_iput;
-		need_iput = NULL;
-
-		/* In case fsnotify_inode_delete() drops a reference. */
-		if (inode != need_iput_tmp)
-			__iget(inode);
-		else
-			need_iput_tmp = NULL;
+		__iget(inode);
 		spin_unlock(&inode->i_lock);
-
-		/* In case the dropping of a reference would nuke next_i. */
-		while (&next_i->i_sb_list != &sb->s_inodes) {
-			spin_lock(&next_i->i_lock);
-			if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) &&
-						atomic_read(&next_i->i_count)) {
-				__iget(next_i);
-				need_iput = next_i;
-				spin_unlock(&next_i->i_lock);
-				break;
-			}
-			spin_unlock(&next_i->i_lock);
-			next_i = list_next_entry(next_i, i_sb_list);
-		}
-
-		/*
-		 * We can safely drop s_inode_list_lock here because either
-		 * we actually hold references on both inode and next_i or
-		 * end of list.  Also no new inodes will be added since the
-		 * umount has begun.
-		 */
 		spin_unlock(&sb->s_inode_list_lock);
 
-		if (need_iput_tmp)
-			iput(need_iput_tmp);
+		if (iput_inode)
+			iput(iput_inode);
 
 		/* for each watch, send FS_UNMOUNT and then remove it */
 		fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
 
 		fsnotify_inode_delete(inode);
 
-		iput(inode);
+		iput_inode = inode;
 
 		spin_lock(&sb->s_inode_list_lock);
 	}
 	spin_unlock(&sb->s_inode_list_lock);
+
+	if (iput_inode)
+		iput(iput_inode);
 }
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 3/5] vfs: Remove unnecessary list_for_each_entry_safe() variants
  2016-08-09 16:52 [PATCH v5 0/5] vfs: Use dlock list for SB's s_inodes list Waiman Long
  2016-08-09 16:52 ` [PATCH v5 1/5] lib/dlock-list: Distributed and lock-protected lists Waiman Long
  2016-08-09 16:52 ` [PATCH v5 2/5] fsnotify: Simplify inode iteration on umount Waiman Long
@ 2016-08-09 16:52 ` Waiman Long
  2016-08-09 16:52 ` [PATCH v5 4/5] vfs: Use dlock list for superblock's inode list Waiman Long
  2016-08-09 16:52 ` [PATCH v5 5/5] lib/dlock-list: Make sibling CPUs share the same linked list Waiman Long
  4 siblings, 0 replies; 10+ messages in thread
From: Waiman Long @ 2016-08-09 16:52 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Jan Kara, Waiman Long

From: Jan Kara <jack@suse.cz>

evict_inodes() and invalidate_inodes() use list_for_each_entry_safe()
to iterate sb->s_inodes list. However, since we use i_lru list entry for
our local temporary list of inodes to destroy, the inode is guaranteed
to stay in sb->s_inodes list while we hold sb->s_inode_list_lock. So
there is no real need for safe iteration variant and we can use
list_for_each_entry() just fine.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 fs/inode.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 7e3ef3a..640d4f8 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -599,12 +599,12 @@ static void dispose_list(struct list_head *head)
  */
 void evict_inodes(struct super_block *sb)
 {
-	struct inode *inode, *next;
+	struct inode *inode;
 	LIST_HEAD(dispose);
 
 again:
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		if (atomic_read(&inode->i_count))
 			continue;
 
@@ -649,11 +649,11 @@ again:
 int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 {
 	int busy = 0;
-	struct inode *inode, *next;
+	struct inode *inode;
 	LIST_HEAD(dispose);
 
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 4/5] vfs: Use dlock list for superblock's inode list
  2016-08-09 16:52 [PATCH v5 0/5] vfs: Use dlock list for SB's s_inodes list Waiman Long
                   ` (2 preceding siblings ...)
  2016-08-09 16:52 ` [PATCH v5 3/5] vfs: Remove unnecessary list_for_each_entry_safe() variants Waiman Long
@ 2016-08-09 16:52 ` Waiman Long
  2016-08-09 16:52 ` [PATCH v5 5/5] lib/dlock-list: Make sibling CPUs share the same linked list Waiman Long
  4 siblings, 0 replies; 10+ messages in thread
From: Waiman Long @ 2016-08-09 16:52 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Waiman Long

When many threads are trying to add or delete inode to or from
a superblock's s_inodes list, spinlock contention on the list can
become a performance bottleneck.

This patch changes the s_inodes field to become a dlock list which
is a distributed set of lists with per-list spinlocks.  As a result,
the following superblock inode list (sb->s_inodes) iteration functions
in vfs are also being modified:

 1. iterate_bdevs()
 2. drop_pagecache_sb()
 3. evict_inodes()
 4. invalidate_inodes()
 5. fsnotify_unmount_inodes()
 6. add_dquot_ref()
 7. remove_dquot_ref()

With an exit microbenchmark that creates a large number of threads,
attachs many inodes to them and then exits. The runtimes of that
microbenchmark with 1000 threads before and after the patch on a
4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as
follows:

  Kernel            Elapsed Time    System Time
  ------            ------------    -----------
  Vanilla 4.5-rc4      65.29s         82m14s
  Patched 4.5-rc4      22.81s         23m03s

Before the patch, spinlock contention at the inode_sb_list_add()
function at the startup phase and the inode_sb_list_del() function at
the exit phase were about 79% and 93% of total CPU time respectively
(as measured by perf). After the patch, the dlock_list_add()
function consumed only about 0.04% of CPU time at startup phase. The
dlock_list_del() function consumed about 0.4% of CPU time at exit
phase. There were still some spinlock contention, but they happened
elsewhere.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c         |    9 ++++-----
 fs/drop_caches.c       |    9 ++++-----
 fs/inode.c             |   34 +++++++++++++---------------------
 fs/notify/inode_mark.c |    9 ++++-----
 fs/quota/dquot.c       |   14 ++++++--------
 fs/super.c             |    7 ++++---
 include/linux/fs.h     |    8 ++++----
 7 files changed, 39 insertions(+), 51 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index c3cdde8..ca88a3e 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1880,9 +1880,9 @@ EXPORT_SYMBOL(__invalidate_device);
 void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 {
 	struct inode *inode, *old_inode = NULL;
+	DEFINE_DLOCK_LIST_ITER(iter, &blockdev_superblock->s_inodes);
 
-	spin_lock(&blockdev_superblock->s_inode_list_lock);
-	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
+	dlist_for_each_entry(inode, &iter, i_sb_list) {
 		struct address_space *mapping = inode->i_mapping;
 
 		spin_lock(&inode->i_lock);
@@ -1893,7 +1893,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&blockdev_superblock->s_inode_list_lock);
+		dlock_list_unlock(&iter);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
 		 * removed from s_inodes list while we dropped the
@@ -1907,8 +1907,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 
 		func(I_BDEV(inode), arg);
 
-		spin_lock(&blockdev_superblock->s_inode_list_lock);
+		dlock_list_relock(&iter);
 	}
-	spin_unlock(&blockdev_superblock->s_inode_list_lock);
 	iput(old_inode);
 }
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index d72d52b..8db1374 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -16,9 +16,9 @@ int sysctl_drop_caches;
 static void drop_pagecache_sb(struct super_block *sb, void *unused)
 {
 	struct inode *inode, *toput_inode = NULL;
+	DEFINE_DLOCK_LIST_ITER(iter, &sb->s_inodes);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	dlist_for_each_entry(inode, &iter, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 		    (inode->i_mapping->nrpages == 0)) {
@@ -27,15 +27,14 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		dlock_list_unlock(&iter);
 
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
 
-		spin_lock(&sb->s_inode_list_lock);
+		dlock_list_relock(&iter);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 	iput(toput_inode);
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index 640d4f8..e2c261e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -28,7 +28,7 @@
  *   inode->i_state, inode->i_hash, __iget()
  * Inode LRU list locks protect:
  *   inode->i_sb->s_inode_lru, inode->i_lru
- * inode->i_sb->s_inode_list_lock protects:
+ * inode->i_sb->s_inodes->head->lock protects:
  *   inode->i_sb->s_inodes, inode->i_sb_list
  * bdi->wb.list_lock protects:
  *   bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_io_list
@@ -37,7 +37,7 @@
  *
  * Lock ordering:
  *
- * inode->i_sb->s_inode_list_lock
+ * inode->i_sb->s_inodes->head->lock
  *   inode->i_lock
  *     Inode LRU list locks
  *
@@ -45,7 +45,7 @@
  *   inode->i_lock
  *
  * inode_hash_lock
- *   inode->i_sb->s_inode_list_lock
+ *   inode->i_sb->s_inodes->head->lock
  *   inode->i_lock
  *
  * iunique_lock
@@ -432,19 +432,14 @@ static void inode_lru_list_del(struct inode *inode)
  */
 void inode_sb_list_add(struct inode *inode)
 {
-	spin_lock(&inode->i_sb->s_inode_list_lock);
-	list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
-	spin_unlock(&inode->i_sb->s_inode_list_lock);
+	dlock_list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
 }
 EXPORT_SYMBOL_GPL(inode_sb_list_add);
 
 static inline void inode_sb_list_del(struct inode *inode)
 {
-	if (!list_empty(&inode->i_sb_list)) {
-		spin_lock(&inode->i_sb->s_inode_list_lock);
-		list_del_init(&inode->i_sb_list);
-		spin_unlock(&inode->i_sb->s_inode_list_lock);
-	}
+	if (!list_empty(&inode->i_sb_list.list))
+		dlock_list_del(&inode->i_sb_list);
 }
 
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
@@ -600,11 +595,12 @@ static void dispose_list(struct list_head *head)
 void evict_inodes(struct super_block *sb)
 {
 	struct inode *inode;
+	struct dlock_list_iter iter;
 	LIST_HEAD(dispose);
 
 again:
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	init_dlock_list_iter(&iter, &sb->s_inodes);
+	dlist_for_each_entry(inode, &iter, i_sb_list) {
 		if (atomic_read(&inode->i_count))
 			continue;
 
@@ -625,13 +621,12 @@ again:
 		 * bit so we don't livelock.
 		 */
 		if (need_resched()) {
-			spin_unlock(&sb->s_inode_list_lock);
+			dlock_list_unlock(&iter);
 			cond_resched();
 			dispose_list(&dispose);
 			goto again;
 		}
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 
 	dispose_list(&dispose);
 }
@@ -651,9 +646,9 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 	int busy = 0;
 	struct inode *inode;
 	LIST_HEAD(dispose);
+	DEFINE_DLOCK_LIST_ITER(iter, &sb->s_inodes);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	dlist_for_each_entry(inode, &iter, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
@@ -675,7 +670,6 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 		spin_unlock(&inode->i_lock);
 		list_add(&inode->i_lru, &dispose);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 
 	dispose_list(&dispose);
 
@@ -890,7 +884,7 @@ struct inode *new_inode_pseudo(struct super_block *sb)
 		spin_lock(&inode->i_lock);
 		inode->i_state = 0;
 		spin_unlock(&inode->i_lock);
-		INIT_LIST_HEAD(&inode->i_sb_list);
+		init_dlock_list_node(&inode->i_sb_list);
 	}
 	return inode;
 }
@@ -911,8 +905,6 @@ struct inode *new_inode(struct super_block *sb)
 {
 	struct inode *inode;
 
-	spin_lock_prefetch(&sb->s_inode_list_lock);
-
 	inode = new_inode_pseudo(sb);
 	if (inode)
 		inode_sb_list_add(inode);
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index a364524..ae3dd1d 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -151,9 +151,9 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
 void fsnotify_unmount_inodes(struct super_block *sb)
 {
 	struct inode *inode, *iput_inode = NULL;
+	DEFINE_DLOCK_LIST_ITER(iter, &sb->s_inodes);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	dlist_for_each_entry(inode, &iter, i_sb_list) {
 		/*
 		 * We cannot __iget() an inode in state I_FREEING,
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
@@ -178,7 +178,7 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		dlock_list_unlock(&iter);
 
 		if (iput_inode)
 			iput(iput_inode);
@@ -190,9 +190,8 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 
 		iput_inode = inode;
 
-		spin_lock(&sb->s_inode_list_lock);
+		dlock_list_relock(&iter);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 
 	if (iput_inode)
 		iput(iput_inode);
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 1bfac28..5082d28 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -939,12 +939,12 @@ static int dqinit_needed(struct inode *inode, int type)
 static void add_dquot_ref(struct super_block *sb, int type)
 {
 	struct inode *inode, *old_inode = NULL;
+	DEFINE_DLOCK_LIST_ITER(iter, &sb->s_inodes);
 #ifdef CONFIG_QUOTA_DEBUG
 	int reserved = 0;
 #endif
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	dlist_for_each_entry(inode, &iter, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 		    !atomic_read(&inode->i_writecount) ||
@@ -954,7 +954,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		dlock_list_unlock(&iter);
 
 #ifdef CONFIG_QUOTA_DEBUG
 		if (unlikely(inode_get_rsv_space(inode) > 0))
@@ -972,9 +972,8 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		 * later.
 		 */
 		old_inode = inode;
-		spin_lock(&sb->s_inode_list_lock);
+		dlock_list_relock(&iter);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 	iput(old_inode);
 
 #ifdef CONFIG_QUOTA_DEBUG
@@ -1042,9 +1041,9 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 {
 	struct inode *inode;
 	int reserved = 0;
+	DEFINE_DLOCK_LIST_ITER(iter, &sb->s_inodes);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	dlist_for_each_entry(inode, &iter, i_sb_list) {
 		/*
 		 *  We have to scan also I_NEW inodes because they can already
 		 *  have quota pointer initialized. Luckily, we need to touch
@@ -1059,7 +1058,6 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 		}
 		spin_unlock(&dq_data_lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
 		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
diff --git a/fs/super.c b/fs/super.c
index c2ff475..fcc70d7 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -164,6 +164,7 @@ static void destroy_super(struct super_block *s)
 {
 	list_lru_destroy(&s->s_dentry_lru);
 	list_lru_destroy(&s->s_inode_lru);
+	free_dlock_list_heads(&s->s_inodes);
 	security_sb_free(s);
 	WARN_ON(!list_empty(&s->s_mounts));
 	put_user_ns(s->s_user_ns);
@@ -211,11 +212,11 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
 	INIT_HLIST_NODE(&s->s_instances);
 	INIT_HLIST_BL_HEAD(&s->s_anon);
 	mutex_init(&s->s_sync_lock);
-	INIT_LIST_HEAD(&s->s_inodes);
-	spin_lock_init(&s->s_inode_list_lock);
 	INIT_LIST_HEAD(&s->s_inodes_wb);
 	spin_lock_init(&s->s_inode_wblist_lock);
 
+	if (alloc_dlock_list_heads(&s->s_inodes))
+		goto fail;
 	if (list_lru_init_memcg(&s->s_dentry_lru))
 		goto fail;
 	if (list_lru_init_memcg(&s->s_inode_lru))
@@ -436,7 +437,7 @@ void generic_shutdown_super(struct super_block *sb)
 		if (sop->put_super)
 			sop->put_super(sb);
 
-		if (!list_empty(&sb->s_inodes)) {
+		if (!dlock_list_empty(&sb->s_inodes)) {
 			printk("VFS: Busy inodes after unmount of %s. "
 			   "Self-destruct in 5 seconds.  Have a nice day...\n",
 			   sb->s_id);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3523bf6..fd718a5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -32,6 +32,7 @@
 #include <linux/workqueue.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/delayed_call.h>
+#include <linux/dlock-list.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -661,7 +662,7 @@ struct inode {
 	u16			i_wb_frn_history;
 #endif
 	struct list_head	i_lru;		/* inode LRU list */
-	struct list_head	i_sb_list;
+	struct dlock_list_node	i_sb_list;
 	struct list_head	i_wb_list;	/* backing dev writeback list */
 	union {
 		struct hlist_head	i_dentry;
@@ -1424,9 +1425,8 @@ struct super_block {
 	 */
 	int s_stack_depth;
 
-	/* s_inode_list_lock protects s_inodes */
-	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
-	struct list_head	s_inodes;	/* all inodes */
+	/* The internal per-list locks protect s_inodes */
+	struct dlock_list_heads s_inodes;	/* all inodes */
 
 	spinlock_t		s_inode_wblist_lock;
 	struct list_head	s_inodes_wb;	/* writeback inodes */
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 5/5] lib/dlock-list: Make sibling CPUs share the same linked list
  2016-08-09 16:52 [PATCH v5 0/5] vfs: Use dlock list for SB's s_inodes list Waiman Long
                   ` (3 preceding siblings ...)
  2016-08-09 16:52 ` [PATCH v5 4/5] vfs: Use dlock list for superblock's inode list Waiman Long
@ 2016-08-09 16:52 ` Waiman Long
  2016-08-09 18:23   ` kbuild test robot
                     ` (2 more replies)
  4 siblings, 3 replies; 10+ messages in thread
From: Waiman Long @ 2016-08-09 16:52 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Waiman Long

The dlock list needs one list for each of the CPUs available. However,
for sibling CPUs, they are sharing the L2 and probably L1 caches
too. As a result, there is not much to gain in term of avoiding
cacheline contention while increasing the cacheline footprint of the
L1/L2 caches as separate lists may need to be in the cache.

This patch makes all the sibling CPUs share the same list, thus
reducing the number of lists that need to be maintained in each
dlock list without having any noticeable impact on performance. It
also improves dlock list iteration performance as fewer lists need
to be iterated.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 include/linux/dlock-list.h |    3 ++
 lib/dlock-list.c           |   79 +++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 77 insertions(+), 5 deletions(-)

diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
index 19f897e..f5d4251 100644
--- a/include/linux/dlock-list.h
+++ b/include/linux/dlock-list.h
@@ -39,6 +39,7 @@ struct dlock_list_head {
 
 struct dlock_list_heads {
 	struct dlock_list_head *heads;
+	int nhead;
 };
 
 /*
@@ -58,6 +59,7 @@ struct dlock_list_node {
  */
 struct dlock_list_iter {
 	int index;
+	int nhead;
 	struct dlock_list_head *head, *entry;
 };
 
@@ -65,6 +67,7 @@ struct dlock_list_iter {
 	{					\
 		.index = -1,			\
 		.head = (dlist)->heads,		\
+		.nhead = (dlist)->nhead,	\
 	}
 
 #define DEFINE_DLOCK_LIST_ITER(s, heads)	\
diff --git a/lib/dlock-list.c b/lib/dlock-list.c
index 9d4950c..b8b3524 100644
--- a/lib/dlock-list.c
+++ b/lib/dlock-list.c
@@ -28,6 +28,71 @@
  */
 static struct lock_class_key dlock_list_key;
 
+/*
+ * Mapping CPU number to dlock list index.
+ */
+static DEFINE_PER_CPU_READ_MOSTLY(int, cpu2list);
+static int nr_dlists;
+
+/*
+ * Initialize cpu2list mapping table & nr_dlists;
+ *
+ * All the sibling CPUs of a sibling group will map to the same dlock list so
+ * as to reduce the number of dlock lists to be maintained while minimizing
+ * cacheline contention.
+ *
+ * As the sibling masks are set up in the core initcall phase, this function
+ * has to be done in the postcore phase to get the right data. An alloc can
+ * be called before init. In this case, we just do a simple 1-1 mapping
+ * between CPU and dlock list head. After init, multiple CPUs may map to the
+ * same dlock list head.
+ */
+static int __init cpu2list_init(void)
+{
+	int idx, cpu;
+	int nr_siblings;
+	cpumask_var_t sibling_mask;
+	static struct cpumask mask __initdata;
+
+	/*
+	 * Check # of sibling CPUs for CPU 0
+	 */
+	sibling_mask = topology_sibling_cpumask(0);
+	if (!sibling_mask)
+		goto done;
+	nr_siblings = cpumask_weight(sibling_mask);
+	if (nr_siblings == 1)
+		goto done;
+
+	cpumask_setall(&mask);
+	idx = 0;
+	for_each_possible_cpu(cpu) {
+		int scpu;
+
+		if (!cpumask_test_cpu(cpu, &mask))
+			continue;
+		sibling_mask = topology_sibling_cpumask(cpu);
+		for_each_cpu(scpu, sibling_mask) {
+			per_cpu(cpu2list, scpu) = idx;
+			cpumask_clear_cpu(scpu, &mask);
+		}
+		idx++;
+	}
+
+	/*
+	 * nr_dlists can only be set after cpu2list_map is properly initialized.
+	 */
+	smp_mb();
+	nr_dlists = nr_cpu_ids/nr_siblings;
+
+	WARN_ON(cpumask_weight(&mask) != 0);
+	WARN_ON(idx > nr_dlists);
+	pr_info("dlock-list: %d head entries per dlock list.\n", nr_dlists);
+done:
+	return 0;
+}
+postcore_initcall(cpu2list_init);
+
 /**
  * alloc_dlock_list_heads - Initialize and allocate the list of head entries
  * @dlist: Pointer to the dlock_list_heads structure to be initialized
@@ -42,13 +107,14 @@ int alloc_dlock_list_heads(struct dlock_list_heads *dlist)
 {
 	int idx;
 
-	dlist->heads = kcalloc(nr_cpu_ids, sizeof(struct dlock_list_head),
+	dlist->nhead = nr_dlists ? nr_dlists : nr_cpu_ids;
+	dlist->heads = kcalloc(dlist->nhead, sizeof(struct dlock_list_head),
 			       GFP_KERNEL);
 
 	if (!dlist->heads)
 		return -ENOMEM;
 
-	for (idx = 0; idx < nr_cpu_ids; idx++) {
+	for (idx = 0; idx < dlist->nhead; idx++) {
 		struct dlock_list_head *head = &dlist->heads[idx];
 
 		INIT_LIST_HEAD(&head->list);
@@ -69,6 +135,7 @@ void free_dlock_list_heads(struct dlock_list_heads *dlist)
 {
 	kfree(dlist->heads);
 	dlist->heads = NULL;
+	dlist->nhead = 0;
 }
 
 /**
@@ -84,7 +151,7 @@ bool dlock_list_empty(struct dlock_list_heads *dlist)
 {
 	int idx;
 
-	for (idx = 0; idx < nr_cpu_ids; idx++)
+	for (idx = 0; idx < dlist->nhead; idx++)
 		if (!list_empty(&dlist->heads[idx].list))
 			return false;
 	return true;
@@ -102,7 +169,9 @@ bool dlock_list_empty(struct dlock_list_heads *dlist)
 void dlock_list_add(struct dlock_list_node *node,
 		    struct dlock_list_heads *dlist)
 {
-	struct dlock_list_head *head = &dlist->heads[smp_processor_id()];
+	int cpu = smp_processor_id();
+	int idx = (dlist->nhead < nr_cpu_ids) ? per_cpu(cpu2list, cpu) : cpu;
+	struct dlock_list_head *head = &dlist->heads[idx];
 
 	/*
 	 * There is no need to disable preemption
@@ -175,7 +244,7 @@ next_list:
 	/*
 	 * Try next list
 	 */
-	if (++iter->index >= nr_cpu_ids)
+	if (++iter->index >= iter->nhead)
 		return NULL;	/* All the entries iterated */
 
 	if (list_empty(&iter->head[iter->index].list))
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 5/5] lib/dlock-list: Make sibling CPUs share the same linked list
  2016-08-09 16:52 ` [PATCH v5 5/5] lib/dlock-list: Make sibling CPUs share the same linked list Waiman Long
@ 2016-08-09 18:23   ` kbuild test robot
  2016-08-09 19:46     ` Waiman Long
  2016-08-09 18:44   ` kbuild test robot
  2016-08-09 19:00   ` kbuild test robot
  2 siblings, 1 reply; 10+ messages in thread
From: kbuild test robot @ 2016-08-09 18:23 UTC (permalink / raw)
  To: Waiman Long
  Cc: kbuild-all, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Andi Kleen,
	Dave Chinner, Boqun Feng, Scott J Norton, Douglas Hatch,
	Waiman Long

[-- Attachment #1: Type: text/plain, Size: 1577 bytes --]

Hi Waiman,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.8-rc1 next-20160809]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Waiman-Long/vfs-Use-dlock-list-for-SB-s-s_inodes-list/20160810-012457
config: i386-randconfig-s0-201632 (attached as .config)
compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   lib/dlock-list.c: In function 'cpu2list_init':
>> lib/dlock-list.c:60:15: error: assignment to expression with array type
     sibling_mask = topology_sibling_cpumask(0);
                  ^
>> lib/dlock-list.c:61:6: warning: the address of 'sibling_mask' will always evaluate as 'true' [-Waddress]
     if (!sibling_mask)
         ^
   lib/dlock-list.c:74:16: error: assignment to expression with array type
      sibling_mask = topology_sibling_cpumask(cpu);
                   ^

vim +60 lib/dlock-list.c

    54		cpumask_var_t sibling_mask;
    55		static struct cpumask mask __initdata;
    56	
    57		/*
    58		 * Check # of sibling CPUs for CPU 0
    59		 */
  > 60		sibling_mask = topology_sibling_cpumask(0);
  > 61		if (!sibling_mask)
    62			goto done;
    63		nr_siblings = cpumask_weight(sibling_mask);
    64		if (nr_siblings == 1)

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 34099 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 5/5] lib/dlock-list: Make sibling CPUs share the same linked list
  2016-08-09 16:52 ` [PATCH v5 5/5] lib/dlock-list: Make sibling CPUs share the same linked list Waiman Long
  2016-08-09 18:23   ` kbuild test robot
@ 2016-08-09 18:44   ` kbuild test robot
  2016-08-09 19:00   ` kbuild test robot
  2 siblings, 0 replies; 10+ messages in thread
From: kbuild test robot @ 2016-08-09 18:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: kbuild-all, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Andi Kleen,
	Dave Chinner, Boqun Feng, Scott J Norton, Douglas Hatch,
	Waiman Long

[-- Attachment #1: Type: text/plain, Size: 1433 bytes --]

Hi Waiman,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.8-rc1 next-20160809]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Waiman-Long/vfs-Use-dlock-list-for-SB-s-s_inodes-list/20160810-012457
config: x86_64-randconfig-s2-08100126 (attached as .config)
compiler: gcc-4.4 (Debian 4.4.7-8) 4.4.7
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   lib/dlock-list.c: In function 'cpu2list_init':
>> lib/dlock-list.c:60: error: incompatible types when assigning to type 'cpumask_var_t' from type 'struct cpumask *'
   lib/dlock-list.c:61: warning: the address of 'sibling_mask' will always evaluate as 'true'
   lib/dlock-list.c:74: error: incompatible types when assigning to type 'cpumask_var_t' from type 'struct cpumask *'

vim +60 lib/dlock-list.c

    54		cpumask_var_t sibling_mask;
    55		static struct cpumask mask __initdata;
    56	
    57		/*
    58		 * Check # of sibling CPUs for CPU 0
    59		 */
  > 60		sibling_mask = topology_sibling_cpumask(0);
    61		if (!sibling_mask)
    62			goto done;
    63		nr_siblings = cpumask_weight(sibling_mask);

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 25603 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 5/5] lib/dlock-list: Make sibling CPUs share the same linked list
  2016-08-09 16:52 ` [PATCH v5 5/5] lib/dlock-list: Make sibling CPUs share the same linked list Waiman Long
  2016-08-09 18:23   ` kbuild test robot
  2016-08-09 18:44   ` kbuild test robot
@ 2016-08-09 19:00   ` kbuild test robot
  2 siblings, 0 replies; 10+ messages in thread
From: kbuild test robot @ 2016-08-09 19:00 UTC (permalink / raw)
  To: Waiman Long
  Cc: kbuild-all, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Andi Kleen,
	Dave Chinner, Boqun Feng, Scott J Norton, Douglas Hatch,
	Waiman Long

[-- Attachment #1: Type: text/plain, Size: 1445 bytes --]

Hi Waiman,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.8-rc1 next-20160809]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Waiman-Long/vfs-Use-dlock-list-for-SB-s-s_inodes-list/20160810-012457
config: x86_64-randconfig-s0-08100126 (attached as .config)
compiler: gcc-4.4 (Debian 4.4.7-8) 4.4.7
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   lib/dlock-list.c: In function 'cpu2list_init':
>> lib/dlock-list.c:60: error: incompatible types when assigning to type 'cpumask_var_t' from type 'const struct cpumask *'
   lib/dlock-list.c:61: warning: the address of 'sibling_mask' will always evaluate as 'true'
   lib/dlock-list.c:74: error: incompatible types when assigning to type 'cpumask_var_t' from type 'const struct cpumask *'

vim +60 lib/dlock-list.c

    54		cpumask_var_t sibling_mask;
    55		static struct cpumask mask __initdata;
    56	
    57		/*
    58		 * Check # of sibling CPUs for CPU 0
    59		 */
  > 60		sibling_mask = topology_sibling_cpumask(0);
    61		if (!sibling_mask)
    62			goto done;
    63		nr_siblings = cpumask_weight(sibling_mask);

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 24305 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 5/5] lib/dlock-list: Make sibling CPUs share the same linked list
  2016-08-09 18:23   ` kbuild test robot
@ 2016-08-09 19:46     ` Waiman Long
  0 siblings, 0 replies; 10+ messages in thread
From: Waiman Long @ 2016-08-09 19:46 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Andi Kleen,
	Dave Chinner, Boqun Feng, Scott J Norton, Douglas Hatch

On 08/09/2016 02:23 PM, kbuild test robot wrote:
> Hi Waiman,
>
> [auto build test ERROR on linus/master]
> [also build test ERROR on v4.8-rc1 next-20160809]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url:    https://github.com/0day-ci/linux/commits/Waiman-Long/vfs-Use-dlock-list-for-SB-s-s_inodes-list/20160810-012457
> config: i386-randconfig-s0-201632 (attached as .config)
> compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
> reproduce:
>          # save the attached .config to linux build tree
>          make ARCH=i386
>
> All error/warnings (new ones prefixed by>>):
>
>     lib/dlock-list.c: In function 'cpu2list_init':
>>> lib/dlock-list.c:60:15: error: assignment to expression with array type
>       sibling_mask = topology_sibling_cpumask(0);
>                    ^
>>> lib/dlock-list.c:61:6: warning: the address of 'sibling_mask' will always evaluate as 'true' [-Waddress]
>       if (!sibling_mask)
>           ^
>     lib/dlock-list.c:74:16: error: assignment to expression with array type
>        sibling_mask = topology_sibling_cpumask(cpu);
>                     ^
>
> vim +60 lib/dlock-list.c
>
>      54		cpumask_var_t sibling_mask;
>      55		static struct cpumask mask __initdata;
>      56	
>      57		/*
>      58		 * Check # of sibling CPUs for CPU 0
>      59		 */
>    >  60		sibling_mask = topology_sibling_cpumask(0);
>    >  61		if (!sibling_mask)
>      62			goto done;
>      63		nr_siblings = cpumask_weight(sibling_mask);
>      64		if (nr_siblings == 1)
>
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

I have updated and re-sent patch 5 that should fix the non-SMP build error.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-08-09 19:46 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-09 16:52 [PATCH v5 0/5] vfs: Use dlock list for SB's s_inodes list Waiman Long
2016-08-09 16:52 ` [PATCH v5 1/5] lib/dlock-list: Distributed and lock-protected lists Waiman Long
2016-08-09 16:52 ` [PATCH v5 2/5] fsnotify: Simplify inode iteration on umount Waiman Long
2016-08-09 16:52 ` [PATCH v5 3/5] vfs: Remove unnecessary list_for_each_entry_safe() variants Waiman Long
2016-08-09 16:52 ` [PATCH v5 4/5] vfs: Use dlock list for superblock's inode list Waiman Long
2016-08-09 16:52 ` [PATCH v5 5/5] lib/dlock-list: Make sibling CPUs share the same linked list Waiman Long
2016-08-09 18:23   ` kbuild test robot
2016-08-09 19:46     ` Waiman Long
2016-08-09 18:44   ` kbuild test robot
2016-08-09 19:00   ` kbuild test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.