linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/4] vfs: Use dlock list for SB's s_inodes list
@ 2016-07-15 17:39 Waiman Long
  2016-07-15 17:39 ` [PATCH v3 1/4] lib/dlock-list: Distributed and lock-protected lists Waiman Long
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Waiman Long @ 2016-07-15 17:39 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Waiman Long

v2->v3:
 - Remove the 2 persubnode API patches.
 - Merge __percpu tag patch 2 into patch 1.
 - As suggested by Tejun Heo, restructure the dlock_list_head data
   structure to hide the __percpu tag and rename some of the functions
   and structures.
 - Move most of the code from dlock_list.h to dlock_list.c and export
   the symbols.

v1->v2:
 - Add a set of simple per-subnode APIs that is between percpu and
   per-node in granularity.
 - Make dlock list to use the per-subnode APIs so as to reduce the
   total number of separate linked list that needs to be managed
   and iterated.
 - There is no change in patches 1-5.

This is a follow up of the following patchset:

  [PATCH v7 0/4] vfs: Use per-cpu list for SB's s_inodes list
  https://lkml.org/lkml/2016/4/12/1009

The main change is the renaming of percpu list to dlock list, as
suggested by Christoph Lameter. It also adds a new patch from Boqun
Feng to add the __percpu modifier for parameters.

Patch 1 introduces the dlock list.

Patch 2 cleans up the fsnotify_unmount_inodes() function by making
the code simpler and more standard.

Patch 3 replaces the use of list_for_each_entry_safe() in
evict_inodes() and invalidate_inodes() by list_for_each_entry().

Patch 4 modifies the superblock and inode structures to use the dlock
list. The corresponding functions that reference those structures
are modified.

Jan Kara (2):
  fsnotify: Simplify inode iteration on umount
  vfs: Remove unnecessary list_for_each_entry_safe() variants

Waiman Long (2):
  lib/dlock-list: Distributed and lock-protected lists
  vfs: Use dlock list for superblock's inode list

 fs/block_dev.c             |   13 ++-
 fs/drop_caches.c           |   10 +-
 fs/fs-writeback.c          |   13 ++-
 fs/inode.c                 |   40 +++----
 fs/notify/inode_mark.c     |   53 +++-------
 fs/quota/dquot.c           |   16 ++--
 fs/super.c                 |    7 +-
 include/linux/dlock-list.h |  135 +++++++++++++++++++++++
 include/linux/fs.h         |    8 +-
 lib/Makefile               |    2 +-
 lib/dlock-list.c           |  254 ++++++++++++++++++++++++++++++++++++++++++++
 11 files changed, 455 insertions(+), 96 deletions(-)
 create mode 100644 include/linux/dlock-list.h
 create mode 100644 lib/dlock-list.c


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 1/4] lib/dlock-list: Distributed and lock-protected lists
  2016-07-15 17:39 [PATCH v3 0/4] vfs: Use dlock list for SB's s_inodes list Waiman Long
@ 2016-07-15 17:39 ` Waiman Long
  2016-07-18 23:38   ` Tejun Heo
  2016-07-19  5:00   ` Al Viro
  2016-07-15 17:39 ` [PATCH v3 2/4] fsnotify: Simplify inode iteration on umount Waiman Long
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 20+ messages in thread
From: Waiman Long @ 2016-07-15 17:39 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Waiman Long

Linked list is used everywhere in the Linux kernel. However, if many
threads are trying to add or delete entries into the same linked list,
it can create a performance bottleneck.

This patch introduces a new list APIs that provide a set of distributed
lists (one per CPU), each of which is protected by its own spinlock.
To the callers, however, the set of lists acts like a single
consolidated list.  This allows list entries insertion and deletion
operations to happen in parallel instead of being serialized with a
global list and lock.

List entry insertion is strictly per cpu. List deletion, however, can
happen in a cpu other than the one that did the insertion. So we still
need lock to protect the list. Because of that, there may still be
a small amount of contention when deletion is being done.

A new header file include/linux/dlock-list.h will be added with the
associated dlock_list_head and dlock_list_node structures. The following
functions are provided to manage the per-cpu list:

 1. int init_dlock_list_head(struct dlock_list_head *dlist)
 2. void dlock_list_add(struct dlock_list_node *node,
		        struct dlock_list_head *dlist)
 3. void dlock_list_del(struct dlock_list *node)

Iteration of all the list entries within a group of per-cpu lists is
done by calling either the dlock_list_next() or dlock_list_next_safe()
functions in a while loop. They correspond to the list_for_each_entry()
and list_for_each_entry_safe() macros respectively. The iteration
states are keep in a dlock_list_iter structure that is passed to the
iteration functions.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 include/linux/dlock-list.h |  135 +++++++++++++++++++++++
 lib/Makefile               |    2 +-
 lib/dlock-list.c           |  254 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 390 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/dlock-list.h
 create mode 100644 lib/dlock-list.c

diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
new file mode 100644
index 0000000..2647b7d
--- /dev/null
+++ b/include/linux/dlock-list.h
@@ -0,0 +1,135 @@
+/*
+ * Distributed and locked list
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long <waiman.long@hpe.com>
+ */
+#ifndef __LINUX_DLOCK_LIST_H
+#define __LINUX_DLOCK_LIST_H
+
+#include <linux/spinlock.h>
+#include <linux/list.h>
+#include <linux/percpu.h>
+
+/*
+ * include/linux/dlock-list.h
+ *
+ * A distributed (per-cpu) set of lists each of which is protected by its
+ * own spinlock, but acts like a single consolidated list to the callers.
+ *
+ * The dlock_list_head_percpu structure contains the spinlock, the other
+ * dlock_list_node structures only contains a pointer to the spinlock in
+ * dlock_list_head_percpu.
+ */
+struct dlock_list_head_percpu {
+	struct list_head list;
+	spinlock_t lock;
+};
+
+struct dlock_list_head {
+	struct dlock_list_head_percpu __percpu *head;
+};
+
+/*
+ * dlock list node data structure
+ */
+struct dlock_list_node {
+	struct list_head list;
+	spinlock_t *lockptr;
+};
+
+#define DLOCK_LIST_HEAD_PERCPU_INIT(name)			\
+	{							\
+		.list.prev = &name.list,			\
+		.list.next = &name.list,			\
+		.list.lock = __SPIN_LOCK_UNLOCKED(name),	\
+	}
+
+/*
+ * dlock list iteration state
+ */
+struct dlock_list_iter {
+	int			 cpu;
+	spinlock_t		*lock;
+	struct list_head	*head;	/* List head of current per-cpu list */
+	struct dlock_list_node	*curr;
+	struct dlock_list_node	*next;
+};
+
+#define DLOCK_LIST_ITER_INIT()			\
+	{					\
+		.cpu  = -1,			\
+	}
+
+#define DEFINE_DLOCK_LIST_ITER(s)		\
+	struct dlock_list_iter s = DLOCK_LIST_ITER_INIT()
+
+static inline void init_dlock_list_iter(struct dlock_list_iter *iter)
+{
+	*iter = (struct dlock_list_iter)DLOCK_LIST_ITER_INIT();
+}
+
+#define DLOCK_LIST_NODE_INIT(name)		\
+	{					\
+		.list.prev = &name.list,	\
+		.list.next = &name.list,	\
+	}
+
+static inline void init_dlock_list_node(struct dlock_list_node *node)
+{
+	INIT_LIST_HEAD(&node->list);
+	node->lockptr = NULL;
+}
+
+/*
+ * Check if all the dlock lists are empty
+ *
+ * This can be a pretty expensive function call. If this function is required
+ * in a performance critical path, we may have to maintain a global count
+ * of the list entries in the global dlock_list_head structure instead.
+ */
+static inline bool dlock_list_empty(struct dlock_list_head *dlist)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		if (!list_empty(&per_cpu_ptr(dlist->head, cpu)->list))
+			return false;
+	return true;
+}
+
+/*
+ * Allocation and freeing of dlock list
+ */
+extern int  alloc_dlock_list_head(struct dlock_list_head *dlist);
+extern void free_dlock_list_head(struct dlock_list_head *dlist);
+
+/*
+ * The dlock list iteration functions which return true if iteration has
+ * to be continued.
+ */
+extern bool dlock_list_next(struct dlock_list_head *dlist,
+			    struct dlock_list_iter *iter);
+extern bool dlock_list_next_safe(struct dlock_list_head *dlist,
+				 struct dlock_list_iter *iter);
+
+/*
+ * The dlock list addition and deletion functions here are not irq-safe.
+ * Special irq-safe variants will have to be added if we need them.
+ */
+extern void dlock_list_add(struct dlock_list_node *node,
+			   struct dlock_list_head *dlist);
+extern void dlock_list_del(struct dlock_list_node *node);
+
+#endif /* __LINUX_DLOCK_LIST_H */
diff --git a/lib/Makefile b/lib/Makefile
index 499fb35..92e8c38 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -40,7 +40,7 @@ obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
 	 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
 	 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 	 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
-	 once.o
+	 once.o dlock-list.o
 obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
 obj-y += hexdump.o
diff --git a/lib/dlock-list.c b/lib/dlock-list.c
new file mode 100644
index 0000000..af4a9f3
--- /dev/null
+++ b/lib/dlock-list.c
@@ -0,0 +1,254 @@
+/*
+ * Distributed and locked list
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long <waiman.long@hpe.com>
+ */
+#include <linux/dlock-list.h>
+#include <linux/lockdep.h>
+#include <linux/export.h>
+
+/*
+ * As all the locks in the dlock list are dynamically allocated, they need
+ * to belong to their own special lock class to avoid warning and stack
+ * trace in kernel log when lockdep is enabled. Statically allocated locks
+ * don't have this problem.
+ */
+static struct lock_class_key dlock_list_key;
+
+/**
+ * alloc_dlock_list_head - Initialize and allocate the per-cpu list head
+ * @dlist: Pointer to the dlock_list_head structure to be initialized
+ * Return: 0 if successful, -ENOMEM if memory allocation error
+ *
+ * This function does not allocate the dlock_list_head structure itself. The
+ * callers will have to do their own memory allocation, if necessary. However,
+ * this allows embedding the dlock_list_head structure directly into other
+ * structures.
+ */
+int alloc_dlock_list_head(struct dlock_list_head *dlist)
+{
+	struct dlock_list_head dlist_tmp;
+	int cpu;
+
+	dlist_tmp.head = alloc_percpu(struct dlock_list_head_percpu);
+	if (!dlist_tmp.head)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu) {
+		struct dlock_list_head_percpu *head;
+
+		head = per_cpu_ptr(dlist_tmp.head, cpu);
+		INIT_LIST_HEAD(&head->list);
+		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
+		lockdep_set_class(&head->lock, &dlock_list_key);
+	}
+
+	dlist->head = dlist_tmp.head;
+	return 0;
+}
+EXPORT_SYMBOL(alloc_dlock_list_head);
+
+/**
+ * free_dlock_list_head - Free the per-cpu list head of dlock list
+ * @dlist: Pointer of the dlock_list_head structure to be freed
+ *
+ * This function doesn't free the dlock_list_head structure itself. So
+ * the caller will have to do it, if necessary.
+ */
+void free_dlock_list_head(struct dlock_list_head *dlist)
+{
+	free_percpu(dlist->head);
+	dlist->head = NULL;
+}
+EXPORT_SYMBOL(free_dlock_list_head);
+
+/**
+ * dlock_list_add - Adds a node to the given dlock list
+ * @node : The node to be added
+ * @dlist: The dlock list where the node is to be added
+ *
+ * List selection is based on the CPU being used when the dlock_list_add()
+ * function is called. However, deletion may be done by a different CPU.
+ * So we still need to use a lock to protect the content of the list.
+ */
+void dlock_list_add(struct dlock_list_node *node,
+		    struct dlock_list_head *dlist)
+{
+	struct dlock_list_head_percpu *head;
+
+	/*
+	 * Disable preemption to make sure that CPU won't gets changed.
+	 */
+	head = get_cpu_ptr(dlist->head);
+	spin_lock(&head->lock);
+	node->lockptr = &head->lock;
+	list_add(&node->list, &head->list);
+	spin_unlock(&head->lock);
+	put_cpu_ptr(dlist->head);
+}
+EXPORT_SYMBOL(dlock_list_add);
+
+/**
+ * dlock_list_del - Delete a node from a dlock list
+ * @node : The node to be deleted
+ *
+ * We need to check the lock pointer again after taking the lock to guard
+ * against concurrent deletion of the same node. If the lock pointer changes
+ * (becomes NULL or to a different one), we assume that the deletion was done
+ * elsewhere. A warning will be printed if this happens as it is likely to be
+ * a bug.
+ */
+void dlock_list_del(struct dlock_list_node *node)
+{
+	spinlock_t *lock = READ_ONCE(node->lockptr);
+
+	if (unlikely(!lock)) {
+		WARN_ONCE(1,
+			"dlock_list_del: node 0x%lx has no associated lock\n",
+			(unsigned long)node);
+		return;
+	}
+
+	spin_lock(lock);
+	if (likely(lock == node->lockptr)) {
+		list_del_init(&node->list);
+		node->lockptr = NULL;
+	} else {
+		/*
+		 * This path should never be executed.
+		 */
+		WARN_ON_ONCE(1);
+	}
+	spin_unlock(lock);
+}
+EXPORT_SYMBOL(dlock_list_del);
+
+/*
+ * Helper function to find the first entry of the next per-cpu list
+ * It works somewhat like for_each_possible_cpu(cpu).
+ *
+ * Return: true if the entry is found, false if all the lists exhausted
+ *
+ */
+static inline bool dlock_list_next_cpu(struct dlock_list_head *dlist,
+				       struct dlock_list_iter *iter)
+{
+	if (iter->lock)
+		spin_unlock(iter->lock);
+next_cpu:
+	/*
+	 * for_each_possible_cpu(cpu)
+	 */
+	iter->cpu = cpumask_next(iter->cpu, cpu_possible_mask);
+	if (iter->cpu >= nr_cpu_ids)
+		return false;	/* All the per-cpu lists iterated */
+
+	iter->head = &per_cpu_ptr(dlist->head, iter->cpu)->list;
+	if (list_empty(iter->head))
+		goto next_cpu;
+
+	iter->lock = &per_cpu_ptr(dlist->head, iter->cpu)->lock;
+	spin_lock(iter->lock);
+	/*
+	 * There is a slight chance that the list may become empty just
+	 * before the lock is acquired. So an additional check is
+	 * needed to make sure that iter->curr points to a valid entry.
+	 */
+	if (list_empty(iter->head)) {
+		spin_unlock(iter->lock);
+		goto next_cpu;
+	}
+	iter->curr = list_entry(iter->head->next,
+				 struct dlock_list_node, list);
+	return true;
+}
+
+/**
+ * dlock_list_next - Iterate to the next entry of the dlock list
+ * @dlist: Pointer to the dlock_list_head structure
+ * @iter : Pointer to the dlock list iterator structure
+ * Return: true if the next entry is found, false if all the entries iterated
+ *
+ * The iterator has to be properly initialized before calling this function.
+ * This iteration function isn't safe with respect to list entry removal, but
+ * it can correctly iterate newly added entries right after the current one.
+ * This iteration function is designed to be used in a while loop.
+ *
+ * Usage example:
+ *
+ * DEFINE_DLOCK_LIST_ITER(iter);
+ * while (dlock_list_next(dlist, &iter)) {
+ *	...
+ * }
+ */
+bool dlock_list_next(struct dlock_list_head *dlist,
+		     struct dlock_list_iter *iter)
+{
+	/*
+	 * Find next entry
+	 */
+	if (iter->curr)
+		iter->curr = list_next_entry(iter->curr, list);
+
+	if (!iter->curr || (&iter->curr->list == iter->head)) {
+		/*
+		 * The current per-cpu list has been exhausted, try the next
+		 * per-cpu list.
+		 */
+		if (!dlock_list_next_cpu(dlist, iter))
+			return false;
+	}
+
+	WARN_ON_ONCE(iter->curr->lockptr != iter->lock);
+	return true;	/* Continue the iteration */
+}
+EXPORT_SYMBOL(dlock_list_next);
+
+/**
+ * dlock_list_next_safe - Removal-safe iterator of dlock list
+ * @dlist: Pointer to the dlock_list_head structure
+ * @iter : Pointer to the dlock list iterator structure
+ * Return: true if the next entry is found, false if all the entries iterated
+ *
+ * The iterator has to be properly initialized before calling this function.
+ * This iteration function is safe with respect to list entry removal.
+ * However, it cannot correctly iterate newly added entries right after the
+ * current one.
+ */
+bool dlock_list_next_safe(struct dlock_list_head *dlist,
+			  struct dlock_list_iter *iter)
+{
+	/*
+	 * Find next entry
+	 */
+	if (iter->curr) {
+		iter->curr = iter->next;
+		iter->next = list_next_entry(iter->next, list);
+	}
+
+	if (!iter->curr || (&iter->curr->list == iter->head)) {
+		/*
+		 * The current per-cpu list has been exhausted, try the next
+		 * per-cpu list.
+		 */
+		if (!dlock_list_next_cpu(dlist, iter))
+			return false;
+		iter->next = list_next_entry(iter->curr, list);
+	}
+
+	WARN_ON_ONCE(iter->curr->lockptr != iter->lock);
+	return true;	/* Continue the iteration */
+}
+EXPORT_SYMBOL(dlock_list_next_safe);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 2/4] fsnotify: Simplify inode iteration on umount
  2016-07-15 17:39 [PATCH v3 0/4] vfs: Use dlock list for SB's s_inodes list Waiman Long
  2016-07-15 17:39 ` [PATCH v3 1/4] lib/dlock-list: Distributed and lock-protected lists Waiman Long
@ 2016-07-15 17:39 ` Waiman Long
  2016-07-15 17:39 ` [PATCH v3 3/4] vfs: Remove unnecessary list_for_each_entry_safe() variants Waiman Long
  2016-07-15 17:39 ` [PATCH v3 4/4] vfs: Use dlock list for superblock's inode list Waiman Long
  3 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2016-07-15 17:39 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Jan Kara, Waiman Long

From: Jan Kara <jack@suse.cz>

fsnotify_unmount_inodes() played complex tricks to pin next inode in the
sb->s_inodes list when iterating over all inodes. If we switch to
keeping current inode pinned somewhat longer, we can make the code much
simpler and standard.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 fs/notify/inode_mark.c |   45 +++++++++------------------------------------
 1 files changed, 9 insertions(+), 36 deletions(-)

diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 741077d..a364524 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -150,12 +150,10 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
  */
 void fsnotify_unmount_inodes(struct super_block *sb)
 {
-	struct inode *inode, *next_i, *need_iput = NULL;
+	struct inode *inode, *iput_inode = NULL;
 
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list) {
-		struct inode *need_iput_tmp;
-
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		/*
 		 * We cannot __iget() an inode in state I_FREEING,
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
@@ -178,49 +176,24 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 			continue;
 		}
 
-		need_iput_tmp = need_iput;
-		need_iput = NULL;
-
-		/* In case fsnotify_inode_delete() drops a reference. */
-		if (inode != need_iput_tmp)
-			__iget(inode);
-		else
-			need_iput_tmp = NULL;
+		__iget(inode);
 		spin_unlock(&inode->i_lock);
-
-		/* In case the dropping of a reference would nuke next_i. */
-		while (&next_i->i_sb_list != &sb->s_inodes) {
-			spin_lock(&next_i->i_lock);
-			if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) &&
-						atomic_read(&next_i->i_count)) {
-				__iget(next_i);
-				need_iput = next_i;
-				spin_unlock(&next_i->i_lock);
-				break;
-			}
-			spin_unlock(&next_i->i_lock);
-			next_i = list_next_entry(next_i, i_sb_list);
-		}
-
-		/*
-		 * We can safely drop s_inode_list_lock here because either
-		 * we actually hold references on both inode and next_i or
-		 * end of list.  Also no new inodes will be added since the
-		 * umount has begun.
-		 */
 		spin_unlock(&sb->s_inode_list_lock);
 
-		if (need_iput_tmp)
-			iput(need_iput_tmp);
+		if (iput_inode)
+			iput(iput_inode);
 
 		/* for each watch, send FS_UNMOUNT and then remove it */
 		fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
 
 		fsnotify_inode_delete(inode);
 
-		iput(inode);
+		iput_inode = inode;
 
 		spin_lock(&sb->s_inode_list_lock);
 	}
 	spin_unlock(&sb->s_inode_list_lock);
+
+	if (iput_inode)
+		iput(iput_inode);
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 3/4] vfs: Remove unnecessary list_for_each_entry_safe() variants
  2016-07-15 17:39 [PATCH v3 0/4] vfs: Use dlock list for SB's s_inodes list Waiman Long
  2016-07-15 17:39 ` [PATCH v3 1/4] lib/dlock-list: Distributed and lock-protected lists Waiman Long
  2016-07-15 17:39 ` [PATCH v3 2/4] fsnotify: Simplify inode iteration on umount Waiman Long
@ 2016-07-15 17:39 ` Waiman Long
  2016-07-15 17:39 ` [PATCH v3 4/4] vfs: Use dlock list for superblock's inode list Waiman Long
  3 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2016-07-15 17:39 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Jan Kara, Waiman Long

From: Jan Kara <jack@suse.cz>

evict_inodes() and invalidate_inodes() use list_for_each_entry_safe()
to iterate sb->s_inodes list. However, since we use i_lru list entry for
our local temporary list of inodes to destroy, the inode is guaranteed
to stay in sb->s_inodes list while we hold sb->s_inode_list_lock. So
there is no real need for safe iteration variant and we can use
list_for_each_entry() just fine.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 fs/inode.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 4ccbc21..8204813 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -597,12 +597,12 @@ static void dispose_list(struct list_head *head)
  */
 void evict_inodes(struct super_block *sb)
 {
-	struct inode *inode, *next;
+	struct inode *inode;
 	LIST_HEAD(dispose);
 
 again:
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		if (atomic_read(&inode->i_count))
 			continue;
 
@@ -647,11 +647,11 @@ again:
 int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 {
 	int busy = 0;
-	struct inode *inode, *next;
+	struct inode *inode;
 	LIST_HEAD(dispose);
 
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 4/4] vfs: Use dlock list for superblock's inode list
  2016-07-15 17:39 [PATCH v3 0/4] vfs: Use dlock list for SB's s_inodes list Waiman Long
                   ` (2 preceding siblings ...)
  2016-07-15 17:39 ` [PATCH v3 3/4] vfs: Remove unnecessary list_for_each_entry_safe() variants Waiman Long
@ 2016-07-15 17:39 ` Waiman Long
  2016-07-19  5:23   ` Al Viro
  3 siblings, 1 reply; 20+ messages in thread
From: Waiman Long @ 2016-07-15 17:39 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Waiman Long

When many threads are trying to add or delete inode to or from
a superblock's s_inodes list, spinlock contention on the list can
become a performance bottleneck.

This patch changes the s_inodes field to become a dlock list which
is a distributed set of lists with per-list spinlocks.  As a result,
the following superblock inode list (sb->s_inodes) iteration functions
in vfs are also being modified:

 1. iterate_bdevs()
 2. drop_pagecache_sb()
 3. wait_sb_inodes()
 4. evict_inodes()
 5. invalidate_inodes()
 6. fsnotify_unmount_inodes()
 7. add_dquot_ref()
 8. remove_dquot_ref()

With an exit microbenchmark that creates a large number of threads,
attachs many inodes to them and then exits. The runtimes of that
microbenchmark with 1000 threads before and after the patch on a
4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as
follows:

  Kernel            Elapsed Time    System Time
  ------            ------------    -----------
  Vanilla 4.5-rc4      65.29s         82m14s
  Patched 4.5-rc4      22.81s         23m03s

Before the patch, spinlock contention at the inode_sb_list_add()
function at the startup phase and the inode_sb_list_del() function at
the exit phase were about 79% and 93% of total CPU time respectively
(as measured by perf). After the patch, the dlock_list_add()
function consumed only about 0.04% of CPU time at startup phase. The
dlock_list_del() function consumed about 0.4% of CPU time at exit
phase. There were still some spinlock contention, but they happened
elsewhere.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c         |   13 +++++++------
 fs/drop_caches.c       |   10 +++++-----
 fs/fs-writeback.c      |   13 +++++++------
 fs/inode.c             |   36 +++++++++++++++---------------------
 fs/notify/inode_mark.c |   10 +++++-----
 fs/quota/dquot.c       |   16 ++++++++--------
 fs/super.c             |    7 ++++---
 include/linux/fs.h     |    8 ++++----
 8 files changed, 55 insertions(+), 58 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 71ccab1..f410c62 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1896,11 +1896,13 @@ EXPORT_SYMBOL(__invalidate_device);
 void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 {
 	struct inode *inode, *old_inode = NULL;
+	DEFINE_DLOCK_LIST_ITER(iter);
 
-	spin_lock(&blockdev_superblock->s_inode_list_lock);
-	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
-		struct address_space *mapping = inode->i_mapping;
+	while (dlock_list_next(&blockdev_superblock->s_inodes, &iter)) {
+		struct address_space *mapping;
 
+		inode   = list_entry(iter.curr, struct inode, i_sb_list);
+		mapping = inode->i_mapping;
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
 		    mapping->nrpages == 0) {
@@ -1909,7 +1911,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&blockdev_superblock->s_inode_list_lock);
+		spin_unlock(iter.lock);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
 		 * removed from s_inodes list while we dropped the
@@ -1923,8 +1925,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 
 		func(I_BDEV(inode), arg);
 
-		spin_lock(&blockdev_superblock->s_inode_list_lock);
+		spin_lock(iter.lock);
 	}
-	spin_unlock(&blockdev_superblock->s_inode_list_lock);
 	iput(old_inode);
 }
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index d72d52b..2fcd19d 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -16,9 +16,10 @@ int sysctl_drop_caches;
 static void drop_pagecache_sb(struct super_block *sb, void *unused)
 {
 	struct inode *inode, *toput_inode = NULL;
+	DEFINE_DLOCK_LIST_ITER(iter);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	while (dlock_list_next(&sb->s_inodes, &iter)) {
+		inode = list_entry(iter.curr, struct inode, i_sb_list);
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 		    (inode->i_mapping->nrpages == 0)) {
@@ -27,15 +28,14 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(iter.lock);
 
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
 
-		spin_lock(&sb->s_inode_list_lock);
+		spin_lock(iter.lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 	iput(toput_inode);
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 989a2ce..7258aab 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2155,6 +2155,7 @@ EXPORT_SYMBOL(__mark_inode_dirty);
 static void wait_sb_inodes(struct super_block *sb)
 {
 	struct inode *inode, *old_inode = NULL;
+	DEFINE_DLOCK_LIST_ITER(iter);
 
 	/*
 	 * We need to be protected against the filesystem going from
@@ -2163,7 +2164,6 @@ static void wait_sb_inodes(struct super_block *sb)
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	mutex_lock(&sb->s_sync_lock);
-	spin_lock(&sb->s_inode_list_lock);
 
 	/*
 	 * Data integrity sync. Must wait for all pages under writeback,
@@ -2172,9 +2172,11 @@ static void wait_sb_inodes(struct super_block *sb)
 	 * In which case, the inode may not be on the dirty list, but
 	 * we still have to wait for that writeout.
 	 */
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		struct address_space *mapping = inode->i_mapping;
+	while (dlock_list_next(&sb->s_inodes, &iter)) {
+		struct address_space *mapping;
 
+		inode   = list_entry(iter.curr, struct inode, i_sb_list);
+		mapping = inode->i_mapping;
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 		    (mapping->nrpages == 0)) {
@@ -2183,7 +2185,7 @@ static void wait_sb_inodes(struct super_block *sb)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(iter.lock);
 
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
@@ -2205,9 +2207,8 @@ static void wait_sb_inodes(struct super_block *sb)
 
 		cond_resched();
 
-		spin_lock(&sb->s_inode_list_lock);
+		spin_lock(iter.lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 	iput(old_inode);
 	mutex_unlock(&sb->s_sync_lock);
 }
diff --git a/fs/inode.c b/fs/inode.c
index 8204813..c048ab3 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -28,7 +28,7 @@
  *   inode->i_state, inode->i_hash, __iget()
  * Inode LRU list locks protect:
  *   inode->i_sb->s_inode_lru, inode->i_lru
- * inode->i_sb->s_inode_list_lock protects:
+ * inode->i_sb->s_inodes->head->lock protects:
  *   inode->i_sb->s_inodes, inode->i_sb_list
  * bdi->wb.list_lock protects:
  *   bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_io_list
@@ -37,7 +37,7 @@
  *
  * Lock ordering:
  *
- * inode->i_sb->s_inode_list_lock
+ * inode->i_sb->s_inodes->head->lock
  *   inode->i_lock
  *     Inode LRU list locks
  *
@@ -45,7 +45,7 @@
  *   inode->i_lock
  *
  * inode_hash_lock
- *   inode->i_sb->s_inode_list_lock
+ *   inode->i_sb->s_inodes->head->lock
  *   inode->i_lock
  *
  * iunique_lock
@@ -431,19 +431,14 @@ static void inode_lru_list_del(struct inode *inode)
  */
 void inode_sb_list_add(struct inode *inode)
 {
-	spin_lock(&inode->i_sb->s_inode_list_lock);
-	list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
-	spin_unlock(&inode->i_sb->s_inode_list_lock);
+	dlock_list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
 }
 EXPORT_SYMBOL_GPL(inode_sb_list_add);
 
 static inline void inode_sb_list_del(struct inode *inode)
 {
-	if (!list_empty(&inode->i_sb_list)) {
-		spin_lock(&inode->i_sb->s_inode_list_lock);
-		list_del_init(&inode->i_sb_list);
-		spin_unlock(&inode->i_sb->s_inode_list_lock);
-	}
+	if (!list_empty(&inode->i_sb_list.list))
+		dlock_list_del(&inode->i_sb_list);
 }
 
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
@@ -598,11 +593,13 @@ static void dispose_list(struct list_head *head)
 void evict_inodes(struct super_block *sb)
 {
 	struct inode *inode;
+	struct dlock_list_iter iter;
 	LIST_HEAD(dispose);
 
 again:
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	init_dlock_list_iter(&iter);
+	while (dlock_list_next(&sb->s_inodes, &iter)) {
+		inode = list_entry(iter.curr, struct inode, i_sb_list);
 		if (atomic_read(&inode->i_count))
 			continue;
 
@@ -623,13 +620,12 @@ again:
 		 * bit so we don't livelock.
 		 */
 		if (need_resched()) {
-			spin_unlock(&sb->s_inode_list_lock);
+			spin_unlock(iter.lock);
 			cond_resched();
 			dispose_list(&dispose);
 			goto again;
 		}
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 
 	dispose_list(&dispose);
 }
@@ -649,9 +645,10 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 	int busy = 0;
 	struct inode *inode;
 	LIST_HEAD(dispose);
+	DEFINE_DLOCK_LIST_ITER(iter);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	while (dlock_list_next(&sb->s_inodes, &iter)) {
+		inode = list_entry(iter.curr, struct inode, i_sb_list);
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
@@ -673,7 +670,6 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 		spin_unlock(&inode->i_lock);
 		list_add(&inode->i_lru, &dispose);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 
 	dispose_list(&dispose);
 
@@ -888,7 +884,7 @@ struct inode *new_inode_pseudo(struct super_block *sb)
 		spin_lock(&inode->i_lock);
 		inode->i_state = 0;
 		spin_unlock(&inode->i_lock);
-		INIT_LIST_HEAD(&inode->i_sb_list);
+		init_dlock_list_node(&inode->i_sb_list);
 	}
 	return inode;
 }
@@ -909,8 +905,6 @@ struct inode *new_inode(struct super_block *sb)
 {
 	struct inode *inode;
 
-	spin_lock_prefetch(&sb->s_inode_list_lock);
-
 	inode = new_inode_pseudo(sb);
 	if (inode)
 		inode_sb_list_add(inode);
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index a364524..6e4e8b8 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -151,14 +151,15 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
 void fsnotify_unmount_inodes(struct super_block *sb)
 {
 	struct inode *inode, *iput_inode = NULL;
+	DEFINE_DLOCK_LIST_ITER(iter);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	while (dlock_list_next(&sb->s_inodes, &iter)) {
 		/*
 		 * We cannot __iget() an inode in state I_FREEING,
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
 		 * the inode cannot have any associated watches.
 		 */
+		inode = list_entry(iter.curr, struct inode, i_sb_list);
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
 			spin_unlock(&inode->i_lock);
@@ -178,7 +179,7 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(iter.lock);
 
 		if (iput_inode)
 			iput(iput_inode);
@@ -190,9 +191,8 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 
 		iput_inode = inode;
 
-		spin_lock(&sb->s_inode_list_lock);
+		spin_lock(iter.lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 
 	if (iput_inode)
 		iput(iput_inode);
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index ff21980..b55f396 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -936,12 +936,13 @@ static int dqinit_needed(struct inode *inode, int type)
 static void add_dquot_ref(struct super_block *sb, int type)
 {
 	struct inode *inode, *old_inode = NULL;
+	DEFINE_DLOCK_LIST_ITER(iter);
 #ifdef CONFIG_QUOTA_DEBUG
 	int reserved = 0;
 #endif
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	while (dlock_list_next(&sb->s_inodes, &iter)) {
+		inode = list_entry(iter.curr, struct inode, i_sb_list);
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 		    !atomic_read(&inode->i_writecount) ||
@@ -951,7 +952,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(iter.lock);
 
 #ifdef CONFIG_QUOTA_DEBUG
 		if (unlikely(inode_get_rsv_space(inode) > 0))
@@ -969,9 +970,8 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		 * later.
 		 */
 		old_inode = inode;
-		spin_lock(&sb->s_inode_list_lock);
+		spin_lock(iter.lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 	iput(old_inode);
 
 #ifdef CONFIG_QUOTA_DEBUG
@@ -1039,15 +1039,16 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 {
 	struct inode *inode;
 	int reserved = 0;
+	DEFINE_DLOCK_LIST_ITER(iter);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	while (dlock_list_next(&sb->s_inodes, &iter)) {
 		/*
 		 *  We have to scan also I_NEW inodes because they can already
 		 *  have quota pointer initialized. Luckily, we need to touch
 		 *  only quota pointers and these have separate locking
 		 *  (dq_data_lock).
 		 */
+		inode = list_entry(iter.curr, struct inode, i_sb_list);
 		spin_lock(&dq_data_lock);
 		if (!IS_NOQUOTA(inode)) {
 			if (unlikely(inode_get_rsv_space(inode) > 0))
@@ -1056,7 +1057,6 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 		}
 		spin_unlock(&dq_data_lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
 		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
diff --git a/fs/super.c b/fs/super.c
index d78b984..4c33204 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -163,6 +163,7 @@ static void destroy_super(struct super_block *s)
 {
 	list_lru_destroy(&s->s_dentry_lru);
 	list_lru_destroy(&s->s_inode_lru);
+	free_dlock_list_head(&s->s_inodes);
 	security_sb_free(s);
 	WARN_ON(!list_empty(&s->s_mounts));
 	kfree(s->s_subtype);
@@ -204,9 +205,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	INIT_HLIST_NODE(&s->s_instances);
 	INIT_HLIST_BL_HEAD(&s->s_anon);
 	mutex_init(&s->s_sync_lock);
-	INIT_LIST_HEAD(&s->s_inodes);
-	spin_lock_init(&s->s_inode_list_lock);
 
+	if (alloc_dlock_list_head(&s->s_inodes))
+		goto fail;
 	if (list_lru_init_memcg(&s->s_dentry_lru))
 		goto fail;
 	if (list_lru_init_memcg(&s->s_inode_lru))
@@ -427,7 +428,7 @@ void generic_shutdown_super(struct super_block *sb)
 		if (sop->put_super)
 			sop->put_super(sb);
 
-		if (!list_empty(&sb->s_inodes)) {
+		if (!dlock_list_empty(&sb->s_inodes)) {
 			printk("VFS: Busy inodes after unmount of %s. "
 			   "Self-destruct in 5 seconds.  Have a nice day...\n",
 			   sb->s_id);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dd28814..d5233ec 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -32,6 +32,7 @@
 #include <linux/workqueue.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/delayed_call.h>
+#include <linux/dlock-list.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -664,7 +665,7 @@ struct inode {
 	u16			i_wb_frn_history;
 #endif
 	struct list_head	i_lru;		/* inode LRU list */
-	struct list_head	i_sb_list;
+	struct dlock_list_node	i_sb_list;
 	union {
 		struct hlist_head	i_dentry;
 		struct rcu_head		i_rcu;
@@ -1445,9 +1446,8 @@ struct super_block {
 	 */
 	int s_stack_depth;
 
-	/* s_inode_list_lock protects s_inodes */
-	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
-	struct list_head	s_inodes;	/* all inodes */
+	/* The internal percpu locks protect s_inodes */
+	struct dlock_list_head s_inodes;	/* all inodes */
 };
 
 extern struct timespec current_fs_time(struct super_block *sb);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/4] lib/dlock-list: Distributed and lock-protected lists
  2016-07-15 17:39 ` [PATCH v3 1/4] lib/dlock-list: Distributed and lock-protected lists Waiman Long
@ 2016-07-18 23:38   ` Tejun Heo
  2016-07-19 18:42     ` Waiman Long
  2016-07-19  5:00   ` Al Viro
  1 sibling, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2016-07-18 23:38 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Christoph Lameter, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

Hello, Waiman.

On Fri, Jul 15, 2016 at 01:39:40PM -0400, Waiman Long wrote:
> Suggested-by: Tejun Heo <tj@kernel.org>

Not sure I should be on suggested-by given that this wasn't my idea at
all.

> +/*
> + * include/linux/dlock-list.h
> + *
> + * A distributed (per-cpu) set of lists each of which is protected by its
> + * own spinlock, but acts like a single consolidated list to the callers.
> + *
> + * The dlock_list_head_percpu structure contains the spinlock, the other
> + * dlock_list_node structures only contains a pointer to the spinlock in
> + * dlock_list_head_percpu.
> + */

The more I think about it, the more bothered I'm about the dlock_list
name.  For the most part, this isn't different from other percpu data
structures in the kernel.  Sure, it might benefit from doing Nth cpu,
but so are other percpu data structures and it's not just "distributed
lock" list either.  The list itself is percpu, not just locking.  Can
we please go back to percpu_list?  Christoph, what do you think?

> +struct dlock_list_node {
> +	struct list_head list;
> +	spinlock_t *lockptr;
> +};

Wouldn't it be better to point to dlock_list_percpu?

> +#define DLOCK_LIST_HEAD_PERCPU_INIT(name)			\
> +	{							\
> +		.list.prev = &name.list,			\
> +		.list.next = &name.list,			\

Use LIST_HEAD_INIT()?  Also, why do we even need the initializers if
the data structure can only be dynamically allocated.  In fact, do
the definitions even need to be exposed in the header?

> +		.list.lock = __SPIN_LOCK_UNLOCKED(name),	\
> +	}
> +
> +/*
> + * dlock list iteration state
> + */
> +struct dlock_list_iter {
> +	int			 cpu;
                                ^
	I'm not sure lining up with space here is common in kernel.

> +	spinlock_t		*lock;
> +	struct list_head	*head;	/* List head of current per-cpu list */
> +	struct dlock_list_node	*curr;
> +	struct dlock_list_node	*next;
> +};
> +
> +#define DLOCK_LIST_ITER_INIT()			\
                               ^
		Don't we usually omit () in these cases?
> +	{					\
> +		.cpu  = -1,			\
> +	}
> +
> +#define DEFINE_DLOCK_LIST_ITER(s)		\
> +	struct dlock_list_iter s = DLOCK_LIST_ITER_INIT()
> +
> +static inline void init_dlock_list_iter(struct dlock_list_iter *iter)
> +{
> +	*iter = (struct dlock_list_iter)DLOCK_LIST_ITER_INIT();
> +}
> +
> +#define DLOCK_LIST_NODE_INIT(name)		\
> +	{					\
> +		.list.prev = &name.list,	\
> +		.list.next = &name.list,	\
		^
		LIST_HEAD_INIT()?
> +	}
> +
> +static inline void init_dlock_list_node(struct dlock_list_node *node)
> +{
> +	INIT_LIST_HEAD(&node->list);
> +	node->lockptr = NULL;

Why not use DLOCK_LIST_NODE_INIT()?

> +}
> +
> +/*
> + * Check if all the dlock lists are empty
> + *
> + * This can be a pretty expensive function call. If this function is required
> + * in a performance critical path, we may have to maintain a global count
> + * of the list entries in the global dlock_list_head structure instead.
> + */

/** function comment please.

> +static inline bool dlock_list_empty(struct dlock_list_head *dlist)
> +{
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu)
> +		if (!list_empty(&per_cpu_ptr(dlist->head, cpu)->list))
> +			return false;
> +	return true;
> +}
> +
> +/*
> + * Allocation and freeing of dlock list
> + */
> +extern int  alloc_dlock_list_head(struct dlock_list_head *dlist);
              ^
	      ditto with alignment

> +extern void free_dlock_list_head(struct dlock_list_head *dlist);
> +
> +/*
> + * The dlock list iteration functions which return true if iteration has
> + * to be continued.
> + */
> +extern bool dlock_list_next(struct dlock_list_head *dlist,
> +			    struct dlock_list_iter *iter);
> +extern bool dlock_list_next_safe(struct dlock_list_head *dlist,
> +				 struct dlock_list_iter *iter);

Why not return dlock_list_node * for the current node?  That'd more
conventional and allows dlock_list_iter to be opaque.

> diff --git a/lib/dlock-list.c b/lib/dlock-list.c
> new file mode 100644
> index 0000000..af4a9f3
> --- /dev/null
> +++ b/lib/dlock-list.c
...
> +int alloc_dlock_list_head(struct dlock_list_head *dlist)
> +{
> +	struct dlock_list_head dlist_tmp;
> +	int cpu;
> +
> +	dlist_tmp.head = alloc_percpu(struct dlock_list_head_percpu);
> +	if (!dlist_tmp.head)
> +		return -ENOMEM;
> +
> +	for_each_possible_cpu(cpu) {
> +		struct dlock_list_head_percpu *head;
> +
> +		head = per_cpu_ptr(dlist_tmp.head, cpu);
> +		INIT_LIST_HEAD(&head->list);
> +		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
> +		lockdep_set_class(&head->lock, &dlock_list_key);
> +	}
> +
> +	dlist->head = dlist_tmp.head;

Just use dlist->head directly or use local __perpcu head pointer?

> +	return 0;
> +}
> +EXPORT_SYMBOL(alloc_dlock_list_head);

Does this actually need to be exported?  If so, it might be a better
idea to start with EXPORT_SYMBOL_GPL().

> +void dlock_list_add(struct dlock_list_node *node,
> +		    struct dlock_list_head *dlist)
> +{
> +	struct dlock_list_head_percpu *head;
                                      ^

This probably requires __percpu annotation.  Have you run it through
sparse and checked for address space warnings?

> +
> +	/*
> +	 * Disable preemption to make sure that CPU won't gets changed.
> +	 */
> +	head = get_cpu_ptr(dlist->head);
> +	spin_lock(&head->lock);
> +	node->lockptr = &head->lock;
> +	list_add(&node->list, &head->list);
> +	spin_unlock(&head->lock);
> +	put_cpu_ptr(dlist->head);
> +}
> +EXPORT_SYMBOL(dlock_list_add);
> +
> +/**
> + * dlock_list_del - Delete a node from a dlock list
> + * @node : The node to be deleted
> + *
> + * We need to check the lock pointer again after taking the lock to guard
> + * against concurrent deletion of the same node. If the lock pointer changes
> + * (becomes NULL or to a different one), we assume that the deletion was done
> + * elsewhere. A warning will be printed if this happens as it is likely to be
> + * a bug.
> + */
> +void dlock_list_del(struct dlock_list_node *node)
> +{
> +	spinlock_t *lock = READ_ONCE(node->lockptr);
> +
> +	if (unlikely(!lock)) {
> +		WARN_ONCE(1,
> +			"dlock_list_del: node 0x%lx has no associated lock\n",
> +			(unsigned long)node);

Maybe "if (WARN_ONCE(!lock...)"?  WARN_ONCE implies unlikely.

> +		return;
> +	}
> +
> +	spin_lock(lock);
> +	if (likely(lock == node->lockptr)) {
> +		list_del_init(&node->list);
> +		node->lockptr = NULL;
> +	} else {
> +		/*
> +		 * This path should never be executed.
> +		 */
> +		WARN_ON_ONCE(1);
> +	}

This still kinda bothers me because this pretty much requires the
users to have strong synchronization around the operations and makes
it unusable in situations where opportunistic behaviors are
acceptable.  It negates the usefulness quite a bit.

> +	spin_unlock(lock);
> +}
> +EXPORT_SYMBOL(dlock_list_del);
> +
> +/*
> + * Helper function to find the first entry of the next per-cpu list
> + * It works somewhat like for_each_possible_cpu(cpu).
> + *
> + * Return: true if the entry is found, false if all the lists exhausted
> + *
> + */
> +static inline bool dlock_list_next_cpu(struct dlock_list_head *dlist,
              ^
	Just let the compiler figure it out.

> +				       struct dlock_list_iter *iter)
> +{
> +	if (iter->lock)
> +		spin_unlock(iter->lock);
> +next_cpu:
> +	/*
> +	 * for_each_possible_cpu(cpu)
> +	 */
> +	iter->cpu = cpumask_next(iter->cpu, cpu_possible_mask);
> +	if (iter->cpu >= nr_cpu_ids)
> +		return false;	/* All the per-cpu lists iterated */
> +
> +	iter->head = &per_cpu_ptr(dlist->head, iter->cpu)->list;
> +	if (list_empty(iter->head))
> +		goto next_cpu;
> +
> +	iter->lock = &per_cpu_ptr(dlist->head, iter->cpu)->lock;
> +	spin_lock(iter->lock);
> +	/*
> +	 * There is a slight chance that the list may become empty just
> +	 * before the lock is acquired. So an additional check is
> +	 * needed to make sure that iter->curr points to a valid entry.
> +	 */
> +	if (list_empty(iter->head)) {
> +		spin_unlock(iter->lock);
> +		goto next_cpu;
> +	}
> +	iter->curr = list_entry(iter->head->next,
> +				 struct dlock_list_node, list);
> +	return true;
> +}
...
> +/**
> + * dlock_list_next_safe - Removal-safe iterator of dlock list
> + * @dlist: Pointer to the dlock_list_head structure
> + * @iter : Pointer to the dlock list iterator structure
> + * Return: true if the next entry is found, false if all the entries iterated
> + *
> + * The iterator has to be properly initialized before calling this function.
> + * This iteration function is safe with respect to list entry removal.
> + * However, it cannot correctly iterate newly added entries right after the
> + * current one.
> + */

This still looks wrong to me.  If you want to provide the two variants
of iterations, can't you just implement one next function and build
the two types of iterations on top of it?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/4] lib/dlock-list: Distributed and lock-protected lists
  2016-07-15 17:39 ` [PATCH v3 1/4] lib/dlock-list: Distributed and lock-protected lists Waiman Long
  2016-07-18 23:38   ` Tejun Heo
@ 2016-07-19  5:00   ` Al Viro
  2016-07-19 19:01     ` Waiman Long
  1 sibling, 1 reply; 20+ messages in thread
From: Al Viro @ 2016-07-19  5:00 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jan Kara, Jeff Layton, J. Bruce Fields, Tejun Heo,
	Christoph Lameter, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

On Fri, Jul 15, 2016 at 01:39:40PM -0400, Waiman Long wrote:

> +struct dlock_list_head_percpu {
> +	struct list_head list;
> +	spinlock_t lock;
> +};

> +#define DLOCK_LIST_HEAD_PERCPU_INIT(name)			\
> +	{							\
> +		.list.prev = &name.list,			\
> +		.list.next = &name.list,			\
> +		.list.lock = __SPIN_LOCK_UNLOCKED(name),	\

What's .list.lock and how does that even compile?

> +extern bool dlock_list_next(struct dlock_list_head *dlist,
> +			    struct dlock_list_iter *iter);

Ugh...  Why not dlist_for_each_entry(), seeing that all users end up with
the same boilerplate?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/4] vfs: Use dlock list for superblock's inode list
  2016-07-15 17:39 ` [PATCH v3 4/4] vfs: Use dlock list for superblock's inode list Waiman Long
@ 2016-07-19  5:23   ` Al Viro
  2016-07-19 18:35     ` Tejun Heo
  2016-07-19 19:07     ` Waiman Long
  0 siblings, 2 replies; 20+ messages in thread
From: Al Viro @ 2016-07-19  5:23 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jan Kara, Jeff Layton, J. Bruce Fields, Tejun Heo,
	Christoph Lameter, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

On Fri, Jul 15, 2016 at 01:39:43PM -0400, Waiman Long wrote:
>  void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
>  {
>  	struct inode *inode, *old_inode = NULL;
> +	DEFINE_DLOCK_LIST_ITER(iter);
>  
> -	spin_lock(&blockdev_superblock->s_inode_list_lock);
> -	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
> -		struct address_space *mapping = inode->i_mapping;
> +	while (dlock_list_next(&blockdev_superblock->s_inodes, &iter)) {
> +		struct address_space *mapping;
>  
> +		inode   = list_entry(iter.curr, struct inode, i_sb_list);
> +		mapping = inode->i_mapping;

TBH, I would very much prefer something like
	DEFINE_DLOCK_LIST_ITER(iter, &blockdev_superblock->s_inodes);

	dlist_for_each_entry(inode, &iter, i_sb_list) {
		mapping = inode->i_mapping;

> -		spin_unlock(&blockdev_superblock->s_inode_list_lock);
> +		spin_unlock(iter.lock);

... and this might be worth dlist_{un,re}lock(&iter);

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/4] vfs: Use dlock list for superblock's inode list
  2016-07-19  5:23   ` Al Viro
@ 2016-07-19 18:35     ` Tejun Heo
  2016-07-19 19:07     ` Waiman Long
  1 sibling, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2016-07-19 18:35 UTC (permalink / raw)
  To: Al Viro
  Cc: Waiman Long, Jan Kara, Jeff Layton, J. Bruce Fields,
	Christoph Lameter, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

Hello,

On Tue, Jul 19, 2016 at 06:23:42AM +0100, Al Viro wrote:
> On Fri, Jul 15, 2016 at 01:39:43PM -0400, Waiman Long wrote:
> > +		inode   = list_entry(iter.curr, struct inode, i_sb_list);
> > +		mapping = inode->i_mapping;
> 
> TBH, I would very much prefer something like
> 	DEFINE_DLOCK_LIST_ITER(iter, &blockdev_superblock->s_inodes);
> 
> 	dlist_for_each_entry(inode, &iter, i_sb_list) {
> 		mapping = inode->i_mapping;
> 
> > -		spin_unlock(&blockdev_superblock->s_inode_list_lock);
> > +		spin_unlock(iter.lock);
> 
> ... and this might be worth dlist_{un,re}lock(&iter);

If iterating requires prep / cleanup stages, it might as well do

	DEFINE_DLOCK_LIST_ITER(iter);

	dlock_list_iter_start(&iter, &blockdev_superblock->s_inodes);

	// iterate

	dlock_list_iter_end(&iter);

And probably provide dlist_for_each_entry() macro which handles
everything.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/4] lib/dlock-list: Distributed and lock-protected lists
  2016-07-18 23:38   ` Tejun Heo
@ 2016-07-19 18:42     ` Waiman Long
  2016-07-19 19:23       ` Tejun Heo
                         ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Waiman Long @ 2016-07-19 18:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Christoph Lameter, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

On 07/18/2016 07:38 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Fri, Jul 15, 2016 at 01:39:40PM -0400, Waiman Long wrote:
>> Suggested-by: Tejun Heo<tj@kernel.org>
> Not sure I should be on suggested-by given that this wasn't my idea at
> all.

I put the tag there because of your suggestion to restructure the data 
structure which did make the patch better. I can remove that tag if you 
think it is not appropriate.

>> +/*
>> + * include/linux/dlock-list.h
>> + *
>> + * A distributed (per-cpu) set of lists each of which is protected by its
>> + * own spinlock, but acts like a single consolidated list to the callers.
>> + *
>> + * The dlock_list_head_percpu structure contains the spinlock, the other
>> + * dlock_list_node structures only contains a pointer to the spinlock in
>> + * dlock_list_head_percpu.
>> + */
> The more I think about it, the more bothered I'm about the dlock_list
> name.  For the most part, this isn't different from other percpu data
> structures in the kernel.  Sure, it might benefit from doing Nth cpu,
> but so are other percpu data structures and it's not just "distributed
> lock" list either.  The list itself is percpu, not just locking.  Can
> we please go back to percpu_list?  Christoph, what do you think?
>

As I said before, I don't mind reverting the name back to percpu_list. I 
am just waiting for a final agreement.

>> +struct dlock_list_node {
>> +	struct list_head list;
>> +	spinlock_t *lockptr;
>> +};
> Wouldn't it be better to point to dlock_list_percpu?

I could. However, the only thing that matter is the spinlock that 
protects the list entry.

>> +#define DLOCK_LIST_HEAD_PERCPU_INIT(name)			\
>> +	{							\
>> +		.list.prev =&name.list,			\
>> +		.list.next =&name.list,			\
> Use LIST_HEAD_INIT()?  Also, why do we even need the initializers if
> the data structure can only be dynamically allocated.  In fact, do
> the definitions even need to be exposed in the header?

I put it there for completeness sake. You are right. That macro isn't 
currently used. I will remove it in the next iteration of the patch.

>> +		.list.lock = __SPIN_LOCK_UNLOCKED(name),	\
>> +	}
>> +
>> +/*
>> + * dlock list iteration state
>> + */
>> +struct dlock_list_iter {
>> +	int			 cpu;
>                                  ^
> 	I'm not sure lining up with space here is common in kernel.

OK, I will remove the extra spaces to make it more conformant to the 
kernel style.

>> +	spinlock_t		*lock;
>> +	struct list_head	*head;	/* List head of current per-cpu list */
>> +	struct dlock_list_node	*curr;
>> +	struct dlock_list_node	*next;
>> +};
>> +
>> +#define DLOCK_LIST_ITER_INIT()			\
>                                 ^
> 		Don't we usually omit () in these cases?

Good point. Will remove the unneeded ().

>> +	{					\
>> +		.cpu  = -1,			\
>> +	}
>> +
>> +#define DEFINE_DLOCK_LIST_ITER(s)		\
>> +	struct dlock_list_iter s = DLOCK_LIST_ITER_INIT()
>> +
>> +static inline void init_dlock_list_iter(struct dlock_list_iter *iter)
>> +{
>> +	*iter = (struct dlock_list_iter)DLOCK_LIST_ITER_INIT();
>> +}
>> +
>> +#define DLOCK_LIST_NODE_INIT(name)		\
>> +	{					\
>> +		.list.prev =&name.list,	\
>> +		.list.next =&name.list,	\
> 		^
> 		LIST_HEAD_INIT()?

Will make the change.

>> +	}
>> +
>> +static inline void init_dlock_list_node(struct dlock_list_node *node)
>> +{
>> +	INIT_LIST_HEAD(&node->list);
>> +	node->lockptr = NULL;
> Why not use DLOCK_LIST_NODE_INIT()?
>

Yes, I can make the change.

>> +}
>> +
>> +/*
>> + * Check if all the dlock lists are empty
>> + *
>> + * This can be a pretty expensive function call. If this function is required
>> + * in a performance critical path, we may have to maintain a global count
>> + * of the list entries in the global dlock_list_head structure instead.
>> + */
> /** function comment please.

Sure.

>> +static inline bool dlock_list_empty(struct dlock_list_head *dlist)
>> +{
>> +	int cpu;
>> +
>> +	for_each_possible_cpu(cpu)
>> +		if (!list_empty(&per_cpu_ptr(dlist->head, cpu)->list))
>> +			return false;
>> +	return true;
>> +}
>> +
>> +/*
>> + * Allocation and freeing of dlock list
>> + */
>> +extern int  alloc_dlock_list_head(struct dlock_list_head *dlist);
>                ^
> 	      ditto with alignment

Will remove the extra space.

>> +extern void free_dlock_list_head(struct dlock_list_head *dlist);
>> +
>> +/*
>> + * The dlock list iteration functions which return true if iteration has
>> + * to be continued.
>> + */
>> +extern bool dlock_list_next(struct dlock_list_head *dlist,
>> +			    struct dlock_list_iter *iter);
>> +extern bool dlock_list_next_safe(struct dlock_list_head *dlist,
>> +				 struct dlock_list_iter *iter);
> Why not return dlock_list_node * for the current node?  That'd more
> conventional and allows dlock_list_iter to be opaque.

Yes, I can make it return dlock_list_node *.

However, to make dlock_list_iter opaque, I will have to dynamically 
allocate the structure. That will add an extra memory allocation and 
free calls as well as handling the error case of running out of memory. 
I don't think that is worth doing at this point.

>> diff --git a/lib/dlock-list.c b/lib/dlock-list.c
>> new file mode 100644
>> index 0000000..af4a9f3
>> --- /dev/null
>> +++ b/lib/dlock-list.c
> ...
>> +int alloc_dlock_list_head(struct dlock_list_head *dlist)
>> +{
>> +	struct dlock_list_head dlist_tmp;
>> +	int cpu;
>> +
>> +	dlist_tmp.head = alloc_percpu(struct dlock_list_head_percpu);
>> +	if (!dlist_tmp.head)
>> +		return -ENOMEM;
>> +
>> +	for_each_possible_cpu(cpu) {
>> +		struct dlock_list_head_percpu *head;
>> +
>> +		head = per_cpu_ptr(dlist_tmp.head, cpu);
>> +		INIT_LIST_HEAD(&head->list);
>> +		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
>> +		lockdep_set_class(&head->lock,&dlock_list_key);
>> +	}
>> +
>> +	dlist->head = dlist_tmp.head;
> Just use dlist->head directly or use local __perpcu head pointer?

I just don't want to expose the structure to world until it is fully 
initialized. If you think I am over-cautious, I can use dlist->head as 
suggested.

>> +	return 0;
>> +}
>> +EXPORT_SYMBOL(alloc_dlock_list_head);
> Does this actually need to be exported?  If so, it might be a better
> idea to start with EXPORT_SYMBOL_GPL().

For the current use case, we probably don't need to export the symbols. 
Other use cases may require that. I will change it to use the version 
instead.

>> +void dlock_list_add(struct dlock_list_node *node,
>> +		    struct dlock_list_head *dlist)
>> +{
>> +	struct dlock_list_head_percpu *head;
>                                        ^
>
> This probably requires __percpu annotation.  Have you run it through
> sparse and checked for address space warnings?

You are right. I probably miss the __percpu annotation. I haven't run it 
through sparse. I will do that to fix any warning found.

>> +
>> +	/*
>> +	 * Disable preemption to make sure that CPU won't gets changed.
>> +	 */
>> +	head = get_cpu_ptr(dlist->head);
>> +	spin_lock(&head->lock);
>> +	node->lockptr =&head->lock;
>> +	list_add(&node->list,&head->list);
>> +	spin_unlock(&head->lock);
>> +	put_cpu_ptr(dlist->head);
>> +}
>> +EXPORT_SYMBOL(dlock_list_add);
>> +
>> +/**
>> + * dlock_list_del - Delete a node from a dlock list
>> + * @node : The node to be deleted
>> + *
>> + * We need to check the lock pointer again after taking the lock to guard
>> + * against concurrent deletion of the same node. If the lock pointer changes
>> + * (becomes NULL or to a different one), we assume that the deletion was done
>> + * elsewhere. A warning will be printed if this happens as it is likely to be
>> + * a bug.
>> + */
>> +void dlock_list_del(struct dlock_list_node *node)
>> +{
>> +	spinlock_t *lock = READ_ONCE(node->lockptr);
>> +
>> +	if (unlikely(!lock)) {
>> +		WARN_ONCE(1,
>> +			"dlock_list_del: node 0x%lx has no associated lock\n",
>> +			(unsigned long)node);
> Maybe "if (WARN_ONCE(!lock...)"?  WARN_ONCE implies unlikely.

OK, will do that.

>> +		return;
>> +	}
>> +
>> +	spin_lock(lock);
>> +	if (likely(lock == node->lockptr)) {
>> +		list_del_init(&node->list);
>> +		node->lockptr = NULL;
>> +	} else {
>> +		/*
>> +		 * This path should never be executed.
>> +		 */
>> +		WARN_ON_ONCE(1);
>> +	}
> This still kinda bothers me because this pretty much requires the
> users to have strong synchronization around the operations and makes
> it unusable in situations where opportunistic behaviors are
> acceptable.  It negates the usefulness quite a bit.

I understand your concern. I will make it retry again with the new lock.

>> +	spin_unlock(lock);
>> +}
>> +EXPORT_SYMBOL(dlock_list_del);
>> +
>> +/*
>> + * Helper function to find the first entry of the next per-cpu list
>> + * It works somewhat like for_each_possible_cpu(cpu).
>> + *
>> + * Return: true if the entry is found, false if all the lists exhausted
>> + *
>> + */
>> +static inline bool dlock_list_next_cpu(struct dlock_list_head *dlist,
>                ^
> 	Just let the compiler figure it out.

Sure. Will remove the inline tag.

>> +				       struct dlock_list_iter *iter)
>> +{
>> +	if (iter->lock)
>> +		spin_unlock(iter->lock);
>> +next_cpu:
>> +	/*
>> +	 * for_each_possible_cpu(cpu)
>> +	 */
>> +	iter->cpu = cpumask_next(iter->cpu, cpu_possible_mask);
>> +	if (iter->cpu>= nr_cpu_ids)
>> +		return false;	/* All the per-cpu lists iterated */
>> +
>> +	iter->head =&per_cpu_ptr(dlist->head, iter->cpu)->list;
>> +	if (list_empty(iter->head))
>> +		goto next_cpu;
>> +
>> +	iter->lock =&per_cpu_ptr(dlist->head, iter->cpu)->lock;
>> +	spin_lock(iter->lock);
>> +	/*
>> +	 * There is a slight chance that the list may become empty just
>> +	 * before the lock is acquired. So an additional check is
>> +	 * needed to make sure that iter->curr points to a valid entry.
>> +	 */
>> +	if (list_empty(iter->head)) {
>> +		spin_unlock(iter->lock);
>> +		goto next_cpu;
>> +	}
>> +	iter->curr = list_entry(iter->head->next,
>> +				 struct dlock_list_node, list);
>> +	return true;
>> +}
> ...
>> +/**
>> + * dlock_list_next_safe - Removal-safe iterator of dlock list
>> + * @dlist: Pointer to the dlock_list_head structure
>> + * @iter : Pointer to the dlock list iterator structure
>> + * Return: true if the next entry is found, false if all the entries iterated
>> + *
>> + * The iterator has to be properly initialized before calling this function.
>> + * This iteration function is safe with respect to list entry removal.
>> + * However, it cannot correctly iterate newly added entries right after the
>> + * current one.
>> + */
> This still looks wrong to me.  If you want to provide the two variants
> of iterations, can't you just implement one next function and build
> the two types of iterations on top of it?

I have been thinking about making dlock_list_next_cpu()  the real 
external function and have 2 inline functions that implement 
dlock_list_next() and dlock_list_next_safe(). That may strike a better 
balance between performance and code abstraction. I will do so if you 
have no objection to that.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/4] lib/dlock-list: Distributed and lock-protected lists
  2016-07-19  5:00   ` Al Viro
@ 2016-07-19 19:01     ` Waiman Long
  0 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2016-07-19 19:01 UTC (permalink / raw)
  To: Al Viro
  Cc: Jan Kara, Jeff Layton, J. Bruce Fields, Tejun Heo,
	Christoph Lameter, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

On 07/19/2016 01:00 AM, Al Viro wrote:
> On Fri, Jul 15, 2016 at 01:39:40PM -0400, Waiman Long wrote:
>
>> +struct dlock_list_head_percpu {
>> +	struct list_head list;
>> +	spinlock_t lock;
>> +};
>> +#define DLOCK_LIST_HEAD_PERCPU_INIT(name)			\
>> +	{							\
>> +		.list.prev =&name.list,			\
>> +		.list.next =&name.list,			\
>> +		.list.lock = __SPIN_LOCK_UNLOCKED(name),	\
> What's .list.lock and how does that even compile?

Yes, it is a typo. This macro is not used. That is why there is no 
compilation error. I will remove it from the patch.

>> +extern bool dlock_list_next(struct dlock_list_head *dlist,
>> +			    struct dlock_list_iter *iter);
> Ugh...  Why not dlist_for_each_entry(), seeing that all users end up with
> the same boilerplate?

Right, I could make a dlock_list_for_each_entry() that encapsulate the 
boilerplate. I will work on that.

Thanks,
Longman

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/4] vfs: Use dlock list for superblock's inode list
  2016-07-19  5:23   ` Al Viro
  2016-07-19 18:35     ` Tejun Heo
@ 2016-07-19 19:07     ` Waiman Long
  1 sibling, 0 replies; 20+ messages in thread
From: Waiman Long @ 2016-07-19 19:07 UTC (permalink / raw)
  To: Al Viro
  Cc: Jan Kara, Jeff Layton, J. Bruce Fields, Tejun Heo,
	Christoph Lameter, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

On 07/19/2016 01:23 AM, Al Viro wrote:
> On Fri, Jul 15, 2016 at 01:39:43PM -0400, Waiman Long wrote:
>>   void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
>>   {
>>   	struct inode *inode, *old_inode = NULL;
>> +	DEFINE_DLOCK_LIST_ITER(iter);
>>
>> -	spin_lock(&blockdev_superblock->s_inode_list_lock);
>> -	list_for_each_entry(inode,&blockdev_superblock->s_inodes, i_sb_list) {
>> -		struct address_space *mapping = inode->i_mapping;
>> +	while (dlock_list_next(&blockdev_superblock->s_inodes,&iter)) {
>> +		struct address_space *mapping;
>>
>> +		inode   = list_entry(iter.curr, struct inode, i_sb_list);
>> +		mapping = inode->i_mapping;
> TBH, I would very much prefer something like
> 	DEFINE_DLOCK_LIST_ITER(iter,&blockdev_superblock->s_inodes);
>
> 	dlist_for_each_entry(inode,&iter, i_sb_list) {
> 		mapping = inode->i_mapping;
>

Sure. I will make the necessary changes to make it happen. Thanks for 
the suggestion.

>> -		spin_unlock(&blockdev_superblock->s_inode_list_lock);
>> +		spin_unlock(iter.lock);
> ... and this might be worth dlist_{un,re}lock(&iter);

Good point. Will change the code accordingly.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/4] lib/dlock-list: Distributed and lock-protected lists
  2016-07-19 18:42     ` Waiman Long
@ 2016-07-19 19:23       ` Tejun Heo
  2016-07-20 19:53         ` Waiman Long
  2016-07-20 22:02         ` Waiman Long
  2016-07-20 22:15       ` Waiman Long
  2016-07-22 20:43       ` Waiman Long
  2 siblings, 2 replies; 20+ messages in thread
From: Tejun Heo @ 2016-07-19 19:23 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Christoph Lameter, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

Hello,

On Tue, Jul 19, 2016 at 02:42:31PM -0400, Waiman Long wrote:
> On 07/18/2016 07:38 PM, Tejun Heo wrote:
> > > +struct dlock_list_node {
> > > +	struct list_head list;
> > > +	spinlock_t *lockptr;
> > > +};
> > Wouldn't it be better to point to dlock_list_percpu?
> 
> I could. However, the only thing that matter is the spinlock that protects
> the list entry.

Yeah, we can get back to this when it's actually necessary.  It just
looked a bit weird to me.

> > > +/*
> > > + * The dlock list iteration functions which return true if iteration has
> > > + * to be continued.
> > > + */
> > > +extern bool dlock_list_next(struct dlock_list_head *dlist,
> > > +			    struct dlock_list_iter *iter);
> > > +extern bool dlock_list_next_safe(struct dlock_list_head *dlist,
> > > +				 struct dlock_list_iter *iter);
> > Why not return dlock_list_node * for the current node?  That'd more
> > conventional and allows dlock_list_iter to be opaque.
> 
> Yes, I can make it return dlock_list_node *.
> 
> However, to make dlock_list_iter opaque, I will have to dynamically allocate
> the structure. That will add an extra memory allocation and free calls as
> well as handling the error case of running out of memory. I don't think that
> is worth doing at this point.

Sure, keep it defined in the header file.  Just don't require users to
reach into it and add a comment saying that the struct is opaque to
its users.

> > > +int alloc_dlock_list_head(struct dlock_list_head *dlist)
> > > +{
> > > +	struct dlock_list_head dlist_tmp;
> > > +	int cpu;
> > > +
> > > +	dlist_tmp.head = alloc_percpu(struct dlock_list_head_percpu);
> > > +	if (!dlist_tmp.head)
> > > +		return -ENOMEM;
> > > +
> > > +	for_each_possible_cpu(cpu) {
> > > +		struct dlock_list_head_percpu *head;
> > > +
> > > +		head = per_cpu_ptr(dlist_tmp.head, cpu);
> > > +		INIT_LIST_HEAD(&head->list);
> > > +		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
> > > +		lockdep_set_class(&head->lock,&dlock_list_key);
> > > +	}
> > > +
> > > +	dlist->head = dlist_tmp.head;
> > Just use dlist->head directly or use local __perpcu head pointer?
> 
> I just don't want to expose the structure to world until it is fully
> initialized. If you think I am over-cautious, I can use dlist->head as
> suggested.

I don't think it makes any actual difference.  No strong opinion
either way.  Just use local __percpu head pointer then?

> > > +	return 0;
> > > +}
> > > +EXPORT_SYMBOL(alloc_dlock_list_head);
> > Does this actually need to be exported?  If so, it might be a better
> > idea to start with EXPORT_SYMBOL_GPL().
> 
> For the current use case, we probably don't need to export the symbols.
> Other use cases may require that. I will change it to use the version
> instead.

If it's not immediately necessary, it's best to not export at all.

> > > +void dlock_list_del(struct dlock_list_node *node)
> > > +{
> > > +	spinlock_t *lock = READ_ONCE(node->lockptr);
> > > +
> > > +	if (unlikely(!lock)) {
> > > +		WARN_ONCE(1,
> > > +			"dlock_list_del: node 0x%lx has no associated lock\n",
> > > +			(unsigned long)node);
> > Maybe "if (WARN_ONCE(!lock...)"?  WARN_ONCE implies unlikely.
> 
> OK, will do that.
> 
> > > +		return;
> > > +	}
> > > +
> > > +	spin_lock(lock);
> > > +	if (likely(lock == node->lockptr)) {
> > > +		list_del_init(&node->list);
> > > +		node->lockptr = NULL;
> > > +	} else {
> > > +		/*
> > > +		 * This path should never be executed.
> > > +		 */
> > > +		WARN_ON_ONCE(1);
> > > +	}
> > This still kinda bothers me because this pretty much requires the
> > users to have strong synchronization around the operations and makes
> > it unusable in situations where opportunistic behaviors are
> > acceptable.  It negates the usefulness quite a bit.
> 
> I understand your concern. I will make it retry again with the new lock.

It doesn't necessarily have to retry but shouldn't break down when
used in an opportunistic racy way - e.g. if adds and removes race, the
order of operations isn't clearly defined as such any outcome is fine
as long as the list maintains its integrity.

> > > +/**
> > > + * dlock_list_next_safe - Removal-safe iterator of dlock list
> > > + * @dlist: Pointer to the dlock_list_head structure
> > > + * @iter : Pointer to the dlock list iterator structure
> > > + * Return: true if the next entry is found, false if all the entries iterated
> > > + *
> > > + * The iterator has to be properly initialized before calling this function.
> > > + * This iteration function is safe with respect to list entry removal.
> > > + * However, it cannot correctly iterate newly added entries right after the
> > > + * current one.
> > > + */
> > This still looks wrong to me.  If you want to provide the two variants
> > of iterations, can't you just implement one next function and build
> > the two types of iterations on top of it?
> 
> I have been thinking about making dlock_list_next_cpu()  the real external
> function and have 2 inline functions that implement dlock_list_next() and
> dlock_list_next_safe(). That may strike a better balance between performance
> and code abstraction. I will do so if you have no objection to that.

Yeah, please give it a try.  As mentioned in another reply, it'd
probably be best to provide an iteration macro which encapsulates the
whole thing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/4] lib/dlock-list: Distributed and lock-protected lists
  2016-07-19 19:23       ` Tejun Heo
@ 2016-07-20 19:53         ` Waiman Long
  2016-07-20 22:02         ` Waiman Long
  1 sibling, 0 replies; 20+ messages in thread
From: Waiman Long @ 2016-07-20 19:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Christoph Lameter, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

On 07/19/2016 03:23 PM, Tejun Heo wrote:
> Hello,
>
> On Tue, Jul 19, 2016 at 02:42:31PM -0400, Waiman Long wrote:
>> On 07/18/2016 07:38 PM, Tejun Heo wrote:
>>>> +struct dlock_list_node {
>>>> +	struct list_head list;
>>>> +	spinlock_t *lockptr;
>>>> +};
>>> Wouldn't it be better to point to dlock_list_percpu?
>> I could. However, the only thing that matter is the spinlock that protects
>> the list entry.
> Yeah, we can get back to this when it's actually necessary.  It just
> looked a bit weird to me.
>
>>>> +/*
>>>> + * The dlock list iteration functions which return true if iteration has
>>>> + * to be continued.
>>>> + */
>>>> +extern bool dlock_list_next(struct dlock_list_head *dlist,
>>>> +			    struct dlock_list_iter *iter);
>>>> +extern bool dlock_list_next_safe(struct dlock_list_head *dlist,
>>>> +				 struct dlock_list_iter *iter);
>>> Why not return dlock_list_node * for the current node?  That'd more
>>> conventional and allows dlock_list_iter to be opaque.
>> Yes, I can make it return dlock_list_node *.
>>
>> However, to make dlock_list_iter opaque, I will have to dynamically allocate
>> the structure. That will add an extra memory allocation and free calls as
>> well as handling the error case of running out of memory. I don't think that
>> is worth doing at this point.
> Sure, keep it defined in the header file.  Just don't require users to
> reach into it and add a comment saying that the struct is opaque to
> its users.

Will do so.

>>>> +int alloc_dlock_list_head(struct dlock_list_head *dlist)
>>>> +{
>>>> +	struct dlock_list_head dlist_tmp;
>>>> +	int cpu;
>>>> +
>>>> +	dlist_tmp.head = alloc_percpu(struct dlock_list_head_percpu);
>>>> +	if (!dlist_tmp.head)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	for_each_possible_cpu(cpu) {
>>>> +		struct dlock_list_head_percpu *head;
>>>> +
>>>> +		head = per_cpu_ptr(dlist_tmp.head, cpu);
>>>> +		INIT_LIST_HEAD(&head->list);
>>>> +		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
>>>> +		lockdep_set_class(&head->lock,&dlock_list_key);
>>>> +	}
>>>> +
>>>> +	dlist->head = dlist_tmp.head;
>>> Just use dlist->head directly or use local __perpcu head pointer?
>> I just don't want to expose the structure to world until it is fully
>> initialized. If you think I am over-cautious, I can use dlist->head as
>> suggested.
> I don't think it makes any actual difference.  No strong opinion
> either way.  Just use local __percpu head pointer then?
>
>>>> +	return 0;
>>>> +}
>>>> +EXPORT_SYMBOL(alloc_dlock_list_head);
>>> Does this actually need to be exported?  If so, it might be a better
>>> idea to start with EXPORT_SYMBOL_GPL().
>> For the current use case, we probably don't need to export the symbols.
>> Other use cases may require that. I will change it to use the version
>> instead.
> If it's not immediately necessary, it's best to not export at all.

I will remove the EXPORT statements.

>>>> +void dlock_list_del(struct dlock_list_node *node)
>>>> +{
>>>> +	spinlock_t *lock = READ_ONCE(node->lockptr);
>>>> +
>>>> +	if (unlikely(!lock)) {
>>>> +		WARN_ONCE(1,
>>>> +			"dlock_list_del: node 0x%lx has no associated lock\n",
>>>> +			(unsigned long)node);
>>> Maybe "if (WARN_ONCE(!lock...)"?  WARN_ONCE implies unlikely.
>> OK, will do that.
>>
>>>> +		return;
>>>> +	}
>>>> +
>>>> +	spin_lock(lock);
>>>> +	if (likely(lock == node->lockptr)) {
>>>> +		list_del_init(&node->list);
>>>> +		node->lockptr = NULL;
>>>> +	} else {
>>>> +		/*
>>>> +		 * This path should never be executed.
>>>> +		 */
>>>> +		WARN_ON_ONCE(1);
>>>> +	}
>>> This still kinda bothers me because this pretty much requires the
>>> users to have strong synchronization around the operations and makes
>>> it unusable in situations where opportunistic behaviors are
>>> acceptable.  It negates the usefulness quite a bit.
>> I understand your concern. I will make it retry again with the new lock.
> It doesn't necessarily have to retry but shouldn't break down when
> used in an opportunistic racy way - e.g. if adds and removes race, the
> order of operations isn't clearly defined as such any outcome is fine
> as long as the list maintains its integrity.

I am going to retry it if the lock changes to different one and ignore 
it if it becomes NULL.

>>>> +/**
>>>> + * dlock_list_next_safe - Removal-safe iterator of dlock list
>>>> + * @dlist: Pointer to the dlock_list_head structure
>>>> + * @iter : Pointer to the dlock list iterator structure
>>>> + * Return: true if the next entry is found, false if all the entries iterated
>>>> + *
>>>> + * The iterator has to be properly initialized before calling this function.
>>>> + * This iteration function is safe with respect to list entry removal.
>>>> + * However, it cannot correctly iterate newly added entries right after the
>>>> + * current one.
>>>> + */
>>> This still looks wrong to me.  If you want to provide the two variants
>>> of iterations, can't you just implement one next function and build
>>> the two types of iterations on top of it?
>> I have been thinking about making dlock_list_next_cpu()  the real external
>> function and have 2 inline functions that implement dlock_list_next() and
>> dlock_list_next_safe(). That may strike a better balance between performance
>> and code abstraction. I will do so if you have no objection to that.
> Yeah, please give it a try.  As mentioned in another reply, it'd
> probably be best to provide an iteration macro which encapsulates the
> whole thing.
>
> Thanks.
>

Yes, I am go to make the dlist_for_each_entry macro as suggested by Al.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/4] lib/dlock-list: Distributed and lock-protected lists
  2016-07-19 19:23       ` Tejun Heo
  2016-07-20 19:53         ` Waiman Long
@ 2016-07-20 22:02         ` Waiman Long
  1 sibling, 0 replies; 20+ messages in thread
From: Waiman Long @ 2016-07-20 22:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Christoph Lameter, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

On 07/19/2016 03:23 PM, Tejun Heo wrote:
> Hello,
>
> On Tue, Jul 19, 2016 at 02:42:31PM -0400, Waiman Long wrote:
>>>> +int alloc_dlock_list_head(struct dlock_list_head *dlist)
>>>> +{
>>>> +	struct dlock_list_head dlist_tmp;
>>>> +	int cpu;
>>>> +
>>>> +	dlist_tmp.head = alloc_percpu(struct dlock_list_head_percpu);
>>>> +	if (!dlist_tmp.head)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	for_each_possible_cpu(cpu) {
>>>> +		struct dlock_list_head_percpu *head;
>>>> +
>>>> +		head = per_cpu_ptr(dlist_tmp.head, cpu);
>>>> +		INIT_LIST_HEAD(&head->list);
>>>> +		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
>>>> +		lockdep_set_class(&head->lock,&dlock_list_key);
>>>> +	}
>>>> +
>>>> +	dlist->head = dlist_tmp.head;
>>> Just use dlist->head directly or use local __perpcu head pointer?
>> I just don't want to expose the structure to world until it is fully
>> initialized. If you think I am over-cautious, I can use dlist->head as
>> suggested.
> I don't think it makes any actual difference.  No strong opinion
> either way.  Just use local __percpu head pointer then?

I have run sparse on dlock_list.c. There is no need to use the __percpu 
tag here. The head gets assigned the result of per_cpu_ptr() which has 
no __percpu annotation. I actually got sparse warning if I used the 
__percpu tag.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/4] lib/dlock-list: Distributed and lock-protected lists
  2016-07-19 18:42     ` Waiman Long
  2016-07-19 19:23       ` Tejun Heo
@ 2016-07-20 22:15       ` Waiman Long
  2016-07-21  0:48         ` Christoph Lameter
  2016-07-22 20:43       ` Waiman Long
  2 siblings, 1 reply; 20+ messages in thread
From: Waiman Long @ 2016-07-20 22:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

On 07/19/2016 02:42 PM, Waiman Long wrote:
> On 07/18/2016 07:38 PM, Tejun Heo wrote:
>
>>> +/*
>>> + * include/linux/dlock-list.h
>>> + *
>>> + * A distributed (per-cpu) set of lists each of which is protected 
>>> by its
>>> + * own spinlock, but acts like a single consolidated list to the 
>>> callers.
>>> + *
>>> + * The dlock_list_head_percpu structure contains the spinlock, the 
>>> other
>>> + * dlock_list_node structures only contains a pointer to the 
>>> spinlock in
>>> + * dlock_list_head_percpu.
>>> + */
>> The more I think about it, the more bothered I'm about the dlock_list
>> name.  For the most part, this isn't different from other percpu data
>> structures in the kernel.  Sure, it might benefit from doing Nth cpu,
>> but so are other percpu data structures and it's not just "distributed
>> lock" list either.  The list itself is percpu, not just locking.  Can
>> we please go back to percpu_list?  Christoph, what do you think?
>>
>
> As I said before, I don't mind reverting the name back to percpu_list. 
> I am just waiting for a final agreement.
>

Christoph, are you OK with Tejun's request to revert the name back to 
percpu_list? Or do you still think the current name is better?

I am almost done with my next version of the patch. This is the only 
thing that is still outstanding.

Thanks,
Longman

Thanks,
Longman

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/4] lib/dlock-list: Distributed and lock-protected lists
  2016-07-20 22:15       ` Waiman Long
@ 2016-07-21  0:48         ` Christoph Lameter
  2016-07-21  1:36           ` Waiman Long
  2016-07-21  1:49           ` Dave Chinner
  0 siblings, 2 replies; 20+ messages in thread
From: Christoph Lameter @ 2016-07-21  0:48 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

On Wed, 20 Jul 2016, Waiman Long wrote:

> Christoph, are you OK with Tejun's request to revert the name back to
> percpu_list? Or do you still think the current name is better?

The percpu structure contains a spinlock and may be remotely accessed? You
are aware that other percpu variables that share the same cacheline will
be negatively impacted by accesses from other processors?

The role of percpu areas are to have memory areas where the code can
expect that cachelines are exclusively there for that processor.

How frequent are the remote accesses? If this is rare then ok.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/4] lib/dlock-list: Distributed and lock-protected lists
  2016-07-21  0:48         ` Christoph Lameter
@ 2016-07-21  1:36           ` Waiman Long
  2016-07-21  1:49           ` Dave Chinner
  1 sibling, 0 replies; 20+ messages in thread
From: Waiman Long @ 2016-07-21  1:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

On 07/20/2016 08:48 PM, Christoph Lameter wrote:
> On Wed, 20 Jul 2016, Waiman Long wrote:
>
>> Christoph, are you OK with Tejun's request to revert the name back to
>> percpu_list? Or do you still think the current name is better?
> The percpu structure contains a spinlock and may be remotely accessed? You
> are aware that other percpu variables that share the same cacheline will
> be negatively impacted by accesses from other processors?

The spinlock can be remotely accessed during deletion as well as the 
iteration of all the percpu lists. Iteration of all the percpu data is 
not a new thing as it may also be done in percpu_counter when calling 
percpu_counter_sum().

Yes, remote access can have a negative performance impact on the access 
of other percpu data that happen to reside in the same cacheline.

> The role of percpu areas are to have memory areas where the code can
> expect that cachelines are exclusively there for that processor.
>
> How frequent are the remote accesses? If this is rare then ok.

I can't say how often that will happen for the dlock list. If the thread 
that create the inodes does not migrate to other CPU, the deletion of 
the inode should also happens in the same CPU.

One way to reduce the performance impact is to make the percpu head 
structure cacheline aligned at the expense of some wasted space. I could 
add an additional parameter to alloc_dlock_list_head() to force 
cacheline alignment if the caller wish to do so. What do think about that?

Cheers,
Longman

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/4] lib/dlock-list: Distributed and lock-protected lists
  2016-07-21  0:48         ` Christoph Lameter
  2016-07-21  1:36           ` Waiman Long
@ 2016-07-21  1:49           ` Dave Chinner
  1 sibling, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2016-07-21  1:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Waiman Long, Tejun Heo, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

On Wed, Jul 20, 2016 at 07:48:13PM -0500, Christoph Lameter wrote:
> On Wed, 20 Jul 2016, Waiman Long wrote:
> 
> > Christoph, are you OK with Tejun's request to revert the name back to
> > percpu_list? Or do you still think the current name is better?
> 
> The percpu structure contains a spinlock and may be remotely accessed? You
> are aware that other percpu variables that share the same cacheline will
> be negatively impacted by accesses from other processors?
> 
> The role of percpu areas are to have memory areas where the code can
> expect that cachelines are exclusively there for that processor.
> 
> How frequent are the remote accesses? If this is rare then ok.

Remote access will be the common case on traversal and removal from
the superblock inode list.

Under memory reclaim, the access should at least be from
a CPU on the same node (as inode reclaim is NUMA aware). However,
any other inode eviction event (e.g. inode unlink) removing it from
the sb list will essentially be from a random CPU.

Traversals (such as from the sync code, or cache invalidations) will
run on a single CPU, so alomst all access from them will be be
remote.

So it's really only a per-cpu structure for list addition....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/4] lib/dlock-list: Distributed and lock-protected lists
  2016-07-19 18:42     ` Waiman Long
  2016-07-19 19:23       ` Tejun Heo
  2016-07-20 22:15       ` Waiman Long
@ 2016-07-22 20:43       ` Waiman Long
  2 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2016-07-22 20:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Christoph Lameter, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

On 07/19/2016 02:42 PM, Waiman Long wrote:
> On 07/18/2016 07:38 PM, Tejun Heo wrote:
>
>
>>> +/*
>>> + * include/linux/dlock-list.h
>>> + *
>>> + * A distributed (per-cpu) set of lists each of which is protected 
>>> by its
>>> + * own spinlock, but acts like a single consolidated list to the 
>>> callers.
>>> + *
>>> + * The dlock_list_head_percpu structure contains the spinlock, the 
>>> other
>>> + * dlock_list_node structures only contains a pointer to the 
>>> spinlock in
>>> + * dlock_list_head_percpu.
>>> + */
>> The more I think about it, the more bothered I'm about the dlock_list
>> name.  For the most part, this isn't different from other percpu data
>> structures in the kernel.  Sure, it might benefit from doing Nth cpu,
>> but so are other percpu data structures and it's not just "distributed
>> lock" list either.  The list itself is percpu, not just locking.  Can
>> we please go back to percpu_list?  Christoph, what do you think?
>>
>
> As I said before, I don't mind reverting the name back to percpu_list. 
> I am just waiting for a final agreement.
>

I have just sent out an update dlock-list patch that incorporates all 
the feedbacks that I got so far except the name change. I will be on 
vacation next week. After I come back, we can continue our discussion if 
the name should be reverted back to percpu_list or not.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2016-07-22 20:43 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-15 17:39 [PATCH v3 0/4] vfs: Use dlock list for SB's s_inodes list Waiman Long
2016-07-15 17:39 ` [PATCH v3 1/4] lib/dlock-list: Distributed and lock-protected lists Waiman Long
2016-07-18 23:38   ` Tejun Heo
2016-07-19 18:42     ` Waiman Long
2016-07-19 19:23       ` Tejun Heo
2016-07-20 19:53         ` Waiman Long
2016-07-20 22:02         ` Waiman Long
2016-07-20 22:15       ` Waiman Long
2016-07-21  0:48         ` Christoph Lameter
2016-07-21  1:36           ` Waiman Long
2016-07-21  1:49           ` Dave Chinner
2016-07-22 20:43       ` Waiman Long
2016-07-19  5:00   ` Al Viro
2016-07-19 19:01     ` Waiman Long
2016-07-15 17:39 ` [PATCH v3 2/4] fsnotify: Simplify inode iteration on umount Waiman Long
2016-07-15 17:39 ` [PATCH v3 3/4] vfs: Remove unnecessary list_for_each_entry_safe() variants Waiman Long
2016-07-15 17:39 ` [PATCH v3 4/4] vfs: Use dlock list for superblock's inode list Waiman Long
2016-07-19  5:23   ` Al Viro
2016-07-19 18:35     ` Tejun Heo
2016-07-19 19:07     ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).