linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/7] vfs: Use dlock list for SB's s_inodes list
@ 2016-07-11 17:32 Waiman Long
  2016-07-11 17:32 ` [PATCH v2 1/7] lib/dlock-list: Distributed and lock-protected lists Waiman Long
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: Waiman Long @ 2016-07-11 17:32 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Waiman Long

v1->v2:
 - Add a set of simple per-subnode APIs that is between percpu and
   per-node in granularity.
 - Make dlock list to use the per-subnode APIs so as to reduce the
   total number of separate linked list that needs to be managed
   and iterated.
 - There is no change in patches 1-5.

This is a follow up of the following patchset:

  [PATCH v7 0/4] vfs: Use per-cpu list for SB's s_inodes list
  https://lkml.org/lkml/2016/4/12/1009

The main change is the renaming of percpu list to dlock list, as
suggested by Christoph Lameter. It also adds a new patch from Boqun
Feng to add the __percpu modifier for parameters.

Patch 1 introduces the dlock list.

Patch 2 adds the __percpu modifier to the appropriate parameters.

Patch 3 cleans up the fsnotify_unmount_inodes() function by making
the code simpler and more standard.

Patch 4 replaces the use of list_for_each_entry_safe() in
evict_inodes() and invalidate_inodes() by list_for_each_entry().

Patch 5 modifies the superblock and inode structures to use the dlock
list. The corresponding functions that reference those structures
are modified.

Patch 6 adds a new set of per-subnode APIs that allow distributed
resources with a granularity that is in between a per-cpu and per-node
systems.

Patch 7 modifies the dlock list to use the new per-subnode APIs so
that it will have less linked lists to manage and iterate compared
with the per-cpu version.

Boqun Feng (1):
  lib/dlock-list: Add __percpu modifier for parameters

Jan Kara (2):
  fsnotify: Simplify inode iteration on umount
  vfs: Remove unnecessary list_for_each_entry_safe() variants

Waiman Long (4):
  lib/dlock-list: Distributed and lock-protected lists
  vfs: Use dlock list for superblock's inode list
  lib/persubnode: Introducing a simple per-subnode APIs
  lib/dlock-list: Use the per-subnode APIs for managing lists

 fs/block_dev.c             |   13 ++-
 fs/drop_caches.c           |   10 +-
 fs/fs-writeback.c          |   13 ++-
 fs/inode.c                 |   40 +++-----
 fs/notify/inode_mark.c     |   53 +++--------
 fs/quota/dquot.c           |   16 ++--
 fs/super.c                 |    7 +-
 include/linux/dlock-list.h |  234 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h         |    8 +-
 include/linux/persubnode.h |   80 +++++++++++++++
 init/main.c                |    2 +
 lib/Makefile               |    4 +-
 lib/dlock-list.c           |  102 +++++++++++++++++++
 lib/persubnode.c           |  119 ++++++++++++++++++++++
 14 files changed, 605 insertions(+), 96 deletions(-)
 create mode 100644 include/linux/dlock-list.h
 create mode 100644 include/linux/persubnode.h
 create mode 100644 lib/dlock-list.c
 create mode 100644 lib/persubnode.c

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 1/7] lib/dlock-list: Distributed and lock-protected lists
  2016-07-11 17:32 [PATCH v2 0/7] vfs: Use dlock list for SB's s_inodes list Waiman Long
@ 2016-07-11 17:32 ` Waiman Long
  2016-07-13 16:08   ` Tejun Heo
  2016-07-11 17:32 ` [PATCH v2 2/7] lib/dlock-list: Add __percpu modifier for parameters Waiman Long
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2016-07-11 17:32 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Waiman Long

Linked list is used everywhere in the Linux kernel. However, if many
threads are trying to add or delete entries into the same linked list,
it can create a performance bottleneck.

This patch introduces a new list APIs that provide a set of distributed
lists (one per CPU), each of which is protected by its own spinlock.
To the callers, however, the set of lists acts like a single
consolidated list.  This allows list entries insertion and deletion
operations to happen in parallel instead of being serialized with a
global list and lock.

List entry insertion is strictly per cpu. List deletion, however, can
happen in a cpu other than the one that did the insertion. So we still
need lock to protect the list. Because of that, there may still be
a small amount of contention when deletion is being done.

A new header file include/linux/dlock-list.h will be added with the
associated dlock_list_head and dlock_list_node structures. The following
functions are provided to manage the per-cpu list:

 1. int init_dlock_list_head(struct dlock_list_head **pdlock_head)
 2. void dlock_list_add(struct dlock_list_node *node,
		        struct dlock_list_head *head)
 3. void dlock_list_del(struct dlock_list *node)

Iteration of all the list entries within a group of per-cpu
lists is done by calling either the dlock_list_iterate() or
dlock_list_iterate_safe() functions in a while loop. They correspond
to the list_for_each_entry() and list_for_each_entry_safe() macros
respectively. The iteration states are keep in a dlock_list_state
structure that is passed to the iteration functions.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 include/linux/dlock-list.h |  233 ++++++++++++++++++++++++++++++++++++++++++++
 lib/Makefile               |    2 +-
 lib/dlock-list.c           |  100 +++++++++++++++++++
 3 files changed, 334 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/dlock-list.h
 create mode 100644 lib/dlock-list.c

diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
new file mode 100644
index 0000000..43355f8
--- /dev/null
+++ b/include/linux/dlock-list.h
@@ -0,0 +1,233 @@
+/*
+ * Distributed/locked list
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long <waiman.long@hpe.com>
+ */
+#ifndef __LINUX_DLOCK_LIST_H
+#define __LINUX_DLOCK_LIST_H
+
+#include <linux/spinlock.h>
+#include <linux/list.h>
+#include <linux/percpu.h>
+
+/*
+ * include/linux/dlock-list.h
+ *
+ * A distributed (per-cpu) set of lists each of which is protected by its
+ * own spinlock, but acts like a single consolidated list to the callers.
+ *
+ * The dlock_list_head structure contains the spinlock, the other
+ * dlock_list_node structures only contains a pointer to the spinlock in
+ * dlock_list_head.
+ */
+struct dlock_list_head {
+	struct list_head list;
+	spinlock_t lock;
+};
+
+#define DLOCK_LIST_HEAD_INIT(name)				\
+	{							\
+		.list.prev = &name.list,			\
+		.list.next = &name.list,			\
+		.list.lock = __SPIN_LOCK_UNLOCKED(name),	\
+	}
+
+/*
+ * Per-cpu list iteration state
+ */
+struct dlock_list_state {
+	int			 cpu;
+	spinlock_t		*lock;
+	struct list_head	*head;	/* List head of current per-cpu list */
+	struct dlock_list_node	*curr;
+	struct dlock_list_node	*next;
+};
+
+#define DLOCK_LIST_STATE_INIT()			\
+	{					\
+		.cpu  = -1,			\
+		.lock = NULL,			\
+		.head = NULL,			\
+		.curr = NULL,			\
+		.next = NULL,			\
+	}
+
+#define DEFINE_DLOCK_LIST_STATE(s)		\
+	struct dlock_list_state s = DLOCK_LIST_STATE_INIT()
+
+static inline void init_dlock_list_state(struct dlock_list_state *state)
+{
+	state->cpu  = -1;
+	state->lock = NULL;
+	state->head = NULL;
+	state->curr = NULL;
+	state->next = NULL;
+}
+
+#ifdef CONFIG_DEBUG_SPINLOCK
+#define DLOCK_LIST_WARN_ON(x)	WARN_ON(x)
+#else
+#define DLOCK_LIST_WARN_ON(x)
+#endif
+
+/*
+ * Next per-cpu list entry
+ */
+#define dlock_list_next_entry(pos, member) list_next_entry(pos, member.list)
+
+/*
+ * Per-cpu node data structure
+ */
+struct dlock_list_node {
+	struct list_head list;
+	spinlock_t *lockptr;
+};
+
+#define DLOCK_LIST_NODE_INIT(name)		\
+	{					\
+		.list.prev = &name.list,	\
+		.list.next = &name.list,	\
+		.list.lockptr = NULL		\
+	}
+
+static inline void init_dlock_list_node(struct dlock_list_node *node)
+{
+	INIT_LIST_HEAD(&node->list);
+	node->lockptr = NULL;
+}
+
+static inline void free_dlock_list_head(struct dlock_list_head **pdlock_head)
+{
+	free_percpu(*pdlock_head);
+	*pdlock_head = NULL;
+}
+
+/*
+ * Check if all the per-cpu lists are empty
+ */
+static inline bool dlock_list_empty(struct dlock_list_head *dlock_head)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		if (!list_empty(&per_cpu_ptr(dlock_head, cpu)->list))
+			return false;
+	return true;
+}
+
+/*
+ * Helper function to find the first entry of the next per-cpu list
+ * It works somewhat like for_each_possible_cpu(cpu).
+ *
+ * Return: true if the entry is found, false if all the lists exhausted
+ */
+static __always_inline bool
+__dlock_list_next_cpu(struct dlock_list_head *head,
+		      struct dlock_list_state *state)
+{
+	if (state->lock)
+		spin_unlock(state->lock);
+next_cpu:
+	/*
+	 * for_each_possible_cpu(cpu)
+	 */
+	state->cpu = cpumask_next(state->cpu, cpu_possible_mask);
+	if (state->cpu >= nr_cpu_ids)
+		return false;	/* All the per-cpu lists iterated */
+
+	state->head = &per_cpu_ptr(head, state->cpu)->list;
+	if (list_empty(state->head))
+		goto next_cpu;
+
+	state->lock = &per_cpu_ptr(head, state->cpu)->lock;
+	spin_lock(state->lock);
+	/*
+	 * There is a slight chance that the list may become empty just
+	 * before the lock is acquired. So an additional check is
+	 * needed to make sure that state->curr points to a valid entry.
+	 */
+	if (list_empty(state->head)) {
+		spin_unlock(state->lock);
+		goto next_cpu;
+	}
+	state->curr = list_entry(state->head->next,
+				 struct dlock_list_node, list);
+	return true;
+}
+
+/*
+ * Iterate to the next entry of the group of per-cpu lists
+ *
+ * Return: true if the next entry is found, false if all the entries iterated
+ */
+static inline bool dlock_list_iterate(struct dlock_list_head *head,
+				      struct dlock_list_state *state)
+{
+	/*
+	 * Find next entry
+	 */
+	if (state->curr)
+		state->curr = list_next_entry(state->curr, list);
+
+	if (!state->curr || (&state->curr->list == state->head)) {
+		/*
+		 * The current per-cpu list has been exhausted, try the next
+		 * per-cpu list.
+		 */
+		if (!__dlock_list_next_cpu(head, state))
+			return false;
+	}
+
+	DLOCK_LIST_WARN_ON(state->curr->lockptr != state->lock);
+	return true;	/* Continue the iteration */
+}
+
+/*
+ * Iterate to the next entry of the group of per-cpu lists and safe
+ * against removal of list_entry
+ *
+ * Return: true if the next entry is found, false if all the entries iterated
+ */
+static inline bool dlock_list_iterate_safe(struct dlock_list_head *head,
+					   struct dlock_list_state *state)
+{
+	/*
+	 * Find next entry
+	 */
+	if (state->curr) {
+		state->curr = state->next;
+		state->next = list_next_entry(state->next, list);
+	}
+
+	if (!state->curr || (&state->curr->list == state->head)) {
+		/*
+		 * The current per-cpu list has been exhausted, try the next
+		 * per-cpu list.
+		 */
+		if (!__dlock_list_next_cpu(head, state))
+			return false;
+		state->next = list_next_entry(state->curr, list);
+	}
+
+	DLOCK_LIST_WARN_ON(state->curr->lockptr != state->lock);
+	return true;	/* Continue the iteration */
+}
+
+extern void dlock_list_add(struct dlock_list_node *node,
+			  struct dlock_list_head *head);
+extern void dlock_list_del(struct dlock_list_node *node);
+extern int  init_dlock_list_head(struct dlock_list_head **pdlock_head);
+
+#endif /* __LINUX_DLOCK_LIST_H */
diff --git a/lib/Makefile b/lib/Makefile
index 499fb35..92e8c38 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -40,7 +40,7 @@ obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
 	 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
 	 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 	 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
-	 once.o
+	 once.o dlock-list.o
 obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
 obj-y += hexdump.o
diff --git a/lib/dlock-list.c b/lib/dlock-list.c
new file mode 100644
index 0000000..84d4623
--- /dev/null
+++ b/lib/dlock-list.c
@@ -0,0 +1,100 @@
+/*
+ * Distributed/locked list
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long <waiman.long@hpe.com>
+ */
+#include <linux/dlock-list.h>
+#include <linux/lockdep.h>
+
+/*
+ * The dlock list lock needs its own class to avoid warning and stack
+ * trace when lockdep is enabled.
+ */
+static struct lock_class_key dlock_list_key;
+
+/*
+ * Initialize the per-cpu list head
+ */
+int init_dlock_list_head(struct dlock_list_head **pdlock_head)
+{
+	struct dlock_list_head *dlock_head;
+	int cpu;
+
+	dlock_head = alloc_percpu(struct dlock_list_head);
+	if (!dlock_head)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu) {
+		struct dlock_list_head *head = per_cpu_ptr(dlock_head, cpu);
+
+		INIT_LIST_HEAD(&head->list);
+		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
+		lockdep_set_class(&head->lock, &dlock_list_key);
+	}
+
+	*pdlock_head = dlock_head;
+	return 0;
+}
+
+/*
+ * List selection is based on the CPU being used when the dlock_list_add()
+ * function is called. However, deletion may be done by a different CPU.
+ * So we still need to use a lock to protect the content of the list.
+ */
+void dlock_list_add(struct dlock_list_node *node, struct dlock_list_head *head)
+{
+	struct dlock_list_head *myhead;
+
+	/*
+	 * Disable preemption to make sure that CPU won't gets changed.
+	 */
+	myhead = get_cpu_ptr(head);
+	spin_lock(&myhead->lock);
+	node->lockptr = &myhead->lock;
+	list_add(&node->list, &myhead->list);
+	spin_unlock(&myhead->lock);
+	put_cpu_ptr(head);
+}
+
+/*
+ * Delete a node from a dlock list
+ *
+ * We need to check the lock pointer again after taking the lock to guard
+ * against concurrent delete of the same node. If the lock pointer changes
+ * (becomes NULL or to a different one), we assume that the deletion was done
+ * elsewhere.
+ */
+void dlock_list_del(struct dlock_list_node *node)
+{
+	spinlock_t *lock = READ_ONCE(node->lockptr);
+
+	if (unlikely(!lock)) {
+		WARN(1, "dlock_list_del: node 0x%lx has no associated lock\n",
+			(unsigned long)node);
+		return;
+	}
+
+	spin_lock(lock);
+	if (likely(lock == node->lockptr)) {
+		list_del_init(&node->list);
+		node->lockptr = NULL;
+	} else {
+		/*
+		 * This path should never be executed.
+		 */
+		WARN_ON(1);
+	}
+	spin_unlock(lock);
+}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 2/7] lib/dlock-list: Add __percpu modifier for parameters
  2016-07-11 17:32 [PATCH v2 0/7] vfs: Use dlock list for SB's s_inodes list Waiman Long
  2016-07-11 17:32 ` [PATCH v2 1/7] lib/dlock-list: Distributed and lock-protected lists Waiman Long
@ 2016-07-11 17:32 ` Waiman Long
  2016-07-13 16:08   ` Tejun Heo
  2016-07-11 17:32 ` [PATCH v2 3/7] fsnotify: Simplify inode iteration on umount Waiman Long
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2016-07-11 17:32 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Waiman Long

From: Boqun Feng <boqun.feng@gmail.com>

Add __percpu modifier properly to help:

1.	Differ pointers to actual structures with those to percpu
	structures, which could improve readability.

2. 	Prevent sparse from complaining about "different address spaces"

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 include/linux/dlock-list.h |   18 ++++++++++--------
 lib/dlock-list.c           |    5 +++--
 2 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
index 43355f8..a8e1fd2 100644
--- a/include/linux/dlock-list.h
+++ b/include/linux/dlock-list.h
@@ -108,7 +108,8 @@ static inline void init_dlock_list_node(struct dlock_list_node *node)
 	node->lockptr = NULL;
 }
 
-static inline void free_dlock_list_head(struct dlock_list_head **pdlock_head)
+static inline void
+free_dlock_list_head(struct dlock_list_head __percpu **pdlock_head)
 {
 	free_percpu(*pdlock_head);
 	*pdlock_head = NULL;
@@ -117,7 +118,7 @@ static inline void free_dlock_list_head(struct dlock_list_head **pdlock_head)
 /*
  * Check if all the per-cpu lists are empty
  */
-static inline bool dlock_list_empty(struct dlock_list_head *dlock_head)
+static inline bool dlock_list_empty(struct dlock_list_head __percpu *dlock_head)
 {
 	int cpu;
 
@@ -134,7 +135,7 @@ static inline bool dlock_list_empty(struct dlock_list_head *dlock_head)
  * Return: true if the entry is found, false if all the lists exhausted
  */
 static __always_inline bool
-__dlock_list_next_cpu(struct dlock_list_head *head,
+__dlock_list_next_cpu(struct dlock_list_head __percpu *head,
 		      struct dlock_list_state *state)
 {
 	if (state->lock)
@@ -172,7 +173,7 @@ next_cpu:
  *
  * Return: true if the next entry is found, false if all the entries iterated
  */
-static inline bool dlock_list_iterate(struct dlock_list_head *head,
+static inline bool dlock_list_iterate(struct dlock_list_head __percpu *head,
 				      struct dlock_list_state *state)
 {
 	/*
@@ -200,8 +201,9 @@ static inline bool dlock_list_iterate(struct dlock_list_head *head,
  *
  * Return: true if the next entry is found, false if all the entries iterated
  */
-static inline bool dlock_list_iterate_safe(struct dlock_list_head *head,
-					   struct dlock_list_state *state)
+static inline bool
+dlock_list_iterate_safe(struct dlock_list_head __percpu *head,
+			struct dlock_list_state *state)
 {
 	/*
 	 * Find next entry
@@ -226,8 +228,8 @@ static inline bool dlock_list_iterate_safe(struct dlock_list_head *head,
 }
 
 extern void dlock_list_add(struct dlock_list_node *node,
-			  struct dlock_list_head *head);
+			   struct dlock_list_head __percpu *head);
 extern void dlock_list_del(struct dlock_list_node *node);
-extern int  init_dlock_list_head(struct dlock_list_head **pdlock_head);
+extern int  init_dlock_list_head(struct dlock_list_head __percpu **pdlock_head);
 
 #endif /* __LINUX_DLOCK_LIST_H */
diff --git a/lib/dlock-list.c b/lib/dlock-list.c
index 84d4623..e1a1930 100644
--- a/lib/dlock-list.c
+++ b/lib/dlock-list.c
@@ -27,7 +27,7 @@ static struct lock_class_key dlock_list_key;
 /*
  * Initialize the per-cpu list head
  */
-int init_dlock_list_head(struct dlock_list_head **pdlock_head)
+int init_dlock_list_head(struct dlock_list_head __percpu **pdlock_head)
 {
 	struct dlock_list_head *dlock_head;
 	int cpu;
@@ -53,7 +53,8 @@ int init_dlock_list_head(struct dlock_list_head **pdlock_head)
  * function is called. However, deletion may be done by a different CPU.
  * So we still need to use a lock to protect the content of the list.
  */
-void dlock_list_add(struct dlock_list_node *node, struct dlock_list_head *head)
+void dlock_list_add(struct dlock_list_node *node,
+		    struct dlock_list_head __percpu *head)
 {
 	struct dlock_list_head *myhead;
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 3/7] fsnotify: Simplify inode iteration on umount
  2016-07-11 17:32 [PATCH v2 0/7] vfs: Use dlock list for SB's s_inodes list Waiman Long
  2016-07-11 17:32 ` [PATCH v2 1/7] lib/dlock-list: Distributed and lock-protected lists Waiman Long
  2016-07-11 17:32 ` [PATCH v2 2/7] lib/dlock-list: Add __percpu modifier for parameters Waiman Long
@ 2016-07-11 17:32 ` Waiman Long
  2016-07-11 17:32 ` [PATCH v2 4/7] vfs: Remove unnecessary list_for_each_entry_safe() variants Waiman Long
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2016-07-11 17:32 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Jan Kara, Waiman Long

From: Jan Kara <jack@suse.cz>

fsnotify_unmount_inodes() played complex tricks to pin next inode in the
sb->s_inodes list when iterating over all inodes. If we switch to
keeping current inode pinned somewhat longer, we can make the code much
simpler and standard.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 fs/notify/inode_mark.c |   45 +++++++++------------------------------------
 1 files changed, 9 insertions(+), 36 deletions(-)

diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 741077d..a364524 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -150,12 +150,10 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
  */
 void fsnotify_unmount_inodes(struct super_block *sb)
 {
-	struct inode *inode, *next_i, *need_iput = NULL;
+	struct inode *inode, *iput_inode = NULL;
 
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list) {
-		struct inode *need_iput_tmp;
-
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		/*
 		 * We cannot __iget() an inode in state I_FREEING,
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
@@ -178,49 +176,24 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 			continue;
 		}
 
-		need_iput_tmp = need_iput;
-		need_iput = NULL;
-
-		/* In case fsnotify_inode_delete() drops a reference. */
-		if (inode != need_iput_tmp)
-			__iget(inode);
-		else
-			need_iput_tmp = NULL;
+		__iget(inode);
 		spin_unlock(&inode->i_lock);
-
-		/* In case the dropping of a reference would nuke next_i. */
-		while (&next_i->i_sb_list != &sb->s_inodes) {
-			spin_lock(&next_i->i_lock);
-			if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) &&
-						atomic_read(&next_i->i_count)) {
-				__iget(next_i);
-				need_iput = next_i;
-				spin_unlock(&next_i->i_lock);
-				break;
-			}
-			spin_unlock(&next_i->i_lock);
-			next_i = list_next_entry(next_i, i_sb_list);
-		}
-
-		/*
-		 * We can safely drop s_inode_list_lock here because either
-		 * we actually hold references on both inode and next_i or
-		 * end of list.  Also no new inodes will be added since the
-		 * umount has begun.
-		 */
 		spin_unlock(&sb->s_inode_list_lock);
 
-		if (need_iput_tmp)
-			iput(need_iput_tmp);
+		if (iput_inode)
+			iput(iput_inode);
 
 		/* for each watch, send FS_UNMOUNT and then remove it */
 		fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
 
 		fsnotify_inode_delete(inode);
 
-		iput(inode);
+		iput_inode = inode;
 
 		spin_lock(&sb->s_inode_list_lock);
 	}
 	spin_unlock(&sb->s_inode_list_lock);
+
+	if (iput_inode)
+		iput(iput_inode);
 }
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 4/7] vfs: Remove unnecessary list_for_each_entry_safe() variants
  2016-07-11 17:32 [PATCH v2 0/7] vfs: Use dlock list for SB's s_inodes list Waiman Long
                   ` (2 preceding siblings ...)
  2016-07-11 17:32 ` [PATCH v2 3/7] fsnotify: Simplify inode iteration on umount Waiman Long
@ 2016-07-11 17:32 ` Waiman Long
  2016-07-11 17:32 ` [PATCH v2 5/7] vfs: Use dlock list for superblock's inode list Waiman Long
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2016-07-11 17:32 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Jan Kara, Waiman Long

From: Jan Kara <jack@suse.cz>

evict_inodes() and invalidate_inodes() use list_for_each_entry_safe()
to iterate sb->s_inodes list. However, since we use i_lru list entry for
our local temporary list of inodes to destroy, the inode is guaranteed
to stay in sb->s_inodes list while we hold sb->s_inode_list_lock. So
there is no real need for safe iteration variant and we can use
list_for_each_entry() just fine.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 fs/inode.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 4ccbc21..8204813 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -597,12 +597,12 @@ static void dispose_list(struct list_head *head)
  */
 void evict_inodes(struct super_block *sb)
 {
-	struct inode *inode, *next;
+	struct inode *inode;
 	LIST_HEAD(dispose);
 
 again:
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		if (atomic_read(&inode->i_count))
 			continue;
 
@@ -647,11 +647,11 @@ again:
 int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 {
 	int busy = 0;
-	struct inode *inode, *next;
+	struct inode *inode;
 	LIST_HEAD(dispose);
 
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 5/7] vfs: Use dlock list for superblock's inode list
  2016-07-11 17:32 [PATCH v2 0/7] vfs: Use dlock list for SB's s_inodes list Waiman Long
                   ` (3 preceding siblings ...)
  2016-07-11 17:32 ` [PATCH v2 4/7] vfs: Remove unnecessary list_for_each_entry_safe() variants Waiman Long
@ 2016-07-11 17:32 ` Waiman Long
  2016-07-11 17:32 ` [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs Waiman Long
  2016-07-11 17:32 ` [RFC PATCH v2 7/7] lib/dlock-list: Use the per-subnode APIs for managing lists Waiman Long
  6 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2016-07-11 17:32 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Waiman Long

When many threads are trying to add or delete inode to or from
a superblock's s_inodes list, spinlock contention on the list can
become a performance bottleneck.

This patch changes the s_inodes field to become a dlock list which
is a distributed set of lists with per-list spinlocks.  As a result,
the following superblock inode list (sb->s_inodes) iteration functions
in vfs are also being modified:

 1. iterate_bdevs()
 2. drop_pagecache_sb()
 3. wait_sb_inodes()
 4. evict_inodes()
 5. invalidate_inodes()
 6. fsnotify_unmount_inodes()
 7. add_dquot_ref()
 8. remove_dquot_ref()

With an exit microbenchmark that creates a large number of threads,
attachs many inodes to them and then exits. The runtimes of that
microbenchmark with 1000 threads before and after the patch on a
4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as
follows:

  Kernel            Elapsed Time    System Time
  ------            ------------    -----------
  Vanilla 4.5-rc4      65.29s         82m14s
  Patched 4.5-rc4      22.81s         23m03s

Before the patch, spinlock contention at the inode_sb_list_add()
function at the startup phase and the inode_sb_list_del() function at
the exit phase were about 79% and 93% of total CPU time respectively
(as measured by perf). After the patch, the percpu_list_add()
function consumed only about 0.04% of CPU time at startup phase. The
percpu_list_del() function consumed about 0.4% of CPU time at exit
phase. There were still some spinlock contention, but they happened
elsewhere.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c         |   13 +++++++------
 fs/drop_caches.c       |   10 +++++-----
 fs/fs-writeback.c      |   13 +++++++------
 fs/inode.c             |   36 +++++++++++++++---------------------
 fs/notify/inode_mark.c |   10 +++++-----
 fs/quota/dquot.c       |   16 ++++++++--------
 fs/super.c             |    7 ++++---
 include/linux/fs.h     |    8 ++++----
 8 files changed, 55 insertions(+), 58 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 71ccab1..21e9064 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1896,11 +1896,13 @@ EXPORT_SYMBOL(__invalidate_device);
 void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 {
 	struct inode *inode, *old_inode = NULL;
+	DEFINE_DLOCK_LIST_STATE(state);
 
-	spin_lock(&blockdev_superblock->s_inode_list_lock);
-	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
-		struct address_space *mapping = inode->i_mapping;
+	while (dlock_list_iterate(blockdev_superblock->s_inodes, &state)) {
+		struct address_space *mapping;
 
+		inode   = list_entry(state.curr, struct inode, i_sb_list);
+		mapping = inode->i_mapping;
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
 		    mapping->nrpages == 0) {
@@ -1909,7 +1911,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&blockdev_superblock->s_inode_list_lock);
+		spin_unlock(state.lock);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
 		 * removed from s_inodes list while we dropped the
@@ -1923,8 +1925,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 
 		func(I_BDEV(inode), arg);
 
-		spin_lock(&blockdev_superblock->s_inode_list_lock);
+		spin_lock(state.lock);
 	}
-	spin_unlock(&blockdev_superblock->s_inode_list_lock);
 	iput(old_inode);
 }
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index d72d52b..26b6c68 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -16,9 +16,10 @@ int sysctl_drop_caches;
 static void drop_pagecache_sb(struct super_block *sb, void *unused)
 {
 	struct inode *inode, *toput_inode = NULL;
+	DEFINE_DLOCK_LIST_STATE(state);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	while (dlock_list_iterate(sb->s_inodes, &state)) {
+		inode = list_entry(state.curr, struct inode, i_sb_list);
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 		    (inode->i_mapping->nrpages == 0)) {
@@ -27,15 +28,14 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(state.lock);
 
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
 
-		spin_lock(&sb->s_inode_list_lock);
+		spin_lock(state.lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 	iput(toput_inode);
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 989a2ce..bbb7ff2 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2155,6 +2155,7 @@ EXPORT_SYMBOL(__mark_inode_dirty);
 static void wait_sb_inodes(struct super_block *sb)
 {
 	struct inode *inode, *old_inode = NULL;
+	DEFINE_DLOCK_LIST_STATE(state);
 
 	/*
 	 * We need to be protected against the filesystem going from
@@ -2163,7 +2164,6 @@ static void wait_sb_inodes(struct super_block *sb)
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	mutex_lock(&sb->s_sync_lock);
-	spin_lock(&sb->s_inode_list_lock);
 
 	/*
 	 * Data integrity sync. Must wait for all pages under writeback,
@@ -2172,9 +2172,11 @@ static void wait_sb_inodes(struct super_block *sb)
 	 * In which case, the inode may not be on the dirty list, but
 	 * we still have to wait for that writeout.
 	 */
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		struct address_space *mapping = inode->i_mapping;
+	while (dlock_list_iterate(sb->s_inodes, &state)) {
+		struct address_space *mapping;
 
+		inode   = list_entry(state.curr, struct inode, i_sb_list);
+		mapping = inode->i_mapping;
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 		    (mapping->nrpages == 0)) {
@@ -2183,7 +2185,7 @@ static void wait_sb_inodes(struct super_block *sb)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(state.lock);
 
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
@@ -2205,9 +2207,8 @@ static void wait_sb_inodes(struct super_block *sb)
 
 		cond_resched();
 
-		spin_lock(&sb->s_inode_list_lock);
+		spin_lock(state.lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 	iput(old_inode);
 	mutex_unlock(&sb->s_sync_lock);
 }
diff --git a/fs/inode.c b/fs/inode.c
index 8204813..7e93160 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -28,7 +28,7 @@
  *   inode->i_state, inode->i_hash, __iget()
  * Inode LRU list locks protect:
  *   inode->i_sb->s_inode_lru, inode->i_lru
- * inode->i_sb->s_inode_list_lock protects:
+ * inode->i_sb->s_inodes->lock protects:
  *   inode->i_sb->s_inodes, inode->i_sb_list
  * bdi->wb.list_lock protects:
  *   bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_io_list
@@ -37,7 +37,7 @@
  *
  * Lock ordering:
  *
- * inode->i_sb->s_inode_list_lock
+ * inode->i_sb->s_inodes->lock
  *   inode->i_lock
  *     Inode LRU list locks
  *
@@ -45,7 +45,7 @@
  *   inode->i_lock
  *
  * inode_hash_lock
- *   inode->i_sb->s_inode_list_lock
+ *   inode->i_sb->s_inodes->lock
  *   inode->i_lock
  *
  * iunique_lock
@@ -431,19 +431,14 @@ static void inode_lru_list_del(struct inode *inode)
  */
 void inode_sb_list_add(struct inode *inode)
 {
-	spin_lock(&inode->i_sb->s_inode_list_lock);
-	list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
-	spin_unlock(&inode->i_sb->s_inode_list_lock);
+	dlock_list_add(&inode->i_sb_list, inode->i_sb->s_inodes);
 }
 EXPORT_SYMBOL_GPL(inode_sb_list_add);
 
 static inline void inode_sb_list_del(struct inode *inode)
 {
-	if (!list_empty(&inode->i_sb_list)) {
-		spin_lock(&inode->i_sb->s_inode_list_lock);
-		list_del_init(&inode->i_sb_list);
-		spin_unlock(&inode->i_sb->s_inode_list_lock);
-	}
+	if (!list_empty(&inode->i_sb_list.list))
+		dlock_list_del(&inode->i_sb_list);
 }
 
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
@@ -598,11 +593,13 @@ static void dispose_list(struct list_head *head)
 void evict_inodes(struct super_block *sb)
 {
 	struct inode *inode;
+	struct dlock_list_state state;
 	LIST_HEAD(dispose);
 
 again:
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	init_dlock_list_state(&state);
+	while (dlock_list_iterate(sb->s_inodes, &state)) {
+		inode = list_entry(state.curr, struct inode, i_sb_list);
 		if (atomic_read(&inode->i_count))
 			continue;
 
@@ -623,13 +620,12 @@ again:
 		 * bit so we don't livelock.
 		 */
 		if (need_resched()) {
-			spin_unlock(&sb->s_inode_list_lock);
+			spin_unlock(state.lock);
 			cond_resched();
 			dispose_list(&dispose);
 			goto again;
 		}
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 
 	dispose_list(&dispose);
 }
@@ -649,9 +645,10 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 	int busy = 0;
 	struct inode *inode;
 	LIST_HEAD(dispose);
+	DEFINE_DLOCK_LIST_STATE(state);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	while (dlock_list_iterate(sb->s_inodes, &state)) {
+		inode = list_entry(state.curr, struct inode, i_sb_list);
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
@@ -673,7 +670,6 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 		spin_unlock(&inode->i_lock);
 		list_add(&inode->i_lru, &dispose);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 
 	dispose_list(&dispose);
 
@@ -888,7 +884,7 @@ struct inode *new_inode_pseudo(struct super_block *sb)
 		spin_lock(&inode->i_lock);
 		inode->i_state = 0;
 		spin_unlock(&inode->i_lock);
-		INIT_LIST_HEAD(&inode->i_sb_list);
+		init_dlock_list_node(&inode->i_sb_list);
 	}
 	return inode;
 }
@@ -909,8 +905,6 @@ struct inode *new_inode(struct super_block *sb)
 {
 	struct inode *inode;
 
-	spin_lock_prefetch(&sb->s_inode_list_lock);
-
 	inode = new_inode_pseudo(sb);
 	if (inode)
 		inode_sb_list_add(inode);
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index a364524..2639522 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -151,14 +151,15 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
 void fsnotify_unmount_inodes(struct super_block *sb)
 {
 	struct inode *inode, *iput_inode = NULL;
+	DEFINE_DLOCK_LIST_STATE(state);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	while (dlock_list_iterate(sb->s_inodes, &state)) {
 		/*
 		 * We cannot __iget() an inode in state I_FREEING,
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
 		 * the inode cannot have any associated watches.
 		 */
+		inode = list_entry(state.curr, struct inode, i_sb_list);
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
 			spin_unlock(&inode->i_lock);
@@ -178,7 +179,7 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(state.lock);
 
 		if (iput_inode)
 			iput(iput_inode);
@@ -190,9 +191,8 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 
 		iput_inode = inode;
 
-		spin_lock(&sb->s_inode_list_lock);
+		spin_lock(state.lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 
 	if (iput_inode)
 		iput(iput_inode);
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index ff21980..cb13619 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -936,12 +936,13 @@ static int dqinit_needed(struct inode *inode, int type)
 static void add_dquot_ref(struct super_block *sb, int type)
 {
 	struct inode *inode, *old_inode = NULL;
+	DEFINE_DLOCK_LIST_STATE(state);
 #ifdef CONFIG_QUOTA_DEBUG
 	int reserved = 0;
 #endif
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	while (dlock_list_iterate(sb->s_inodes, &state)) {
+		inode = list_entry(state.curr, struct inode, i_sb_list);
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 		    !atomic_read(&inode->i_writecount) ||
@@ -951,7 +952,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(state.lock);
 
 #ifdef CONFIG_QUOTA_DEBUG
 		if (unlikely(inode_get_rsv_space(inode) > 0))
@@ -969,9 +970,8 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		 * later.
 		 */
 		old_inode = inode;
-		spin_lock(&sb->s_inode_list_lock);
+		spin_lock(state.lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 	iput(old_inode);
 
 #ifdef CONFIG_QUOTA_DEBUG
@@ -1039,15 +1039,16 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 {
 	struct inode *inode;
 	int reserved = 0;
+	DEFINE_DLOCK_LIST_STATE(state);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	while (dlock_list_iterate(sb->s_inodes, &state)) {
 		/*
 		 *  We have to scan also I_NEW inodes because they can already
 		 *  have quota pointer initialized. Luckily, we need to touch
 		 *  only quota pointers and these have separate locking
 		 *  (dq_data_lock).
 		 */
+		inode = list_entry(state.curr, struct inode, i_sb_list);
 		spin_lock(&dq_data_lock);
 		if (!IS_NOQUOTA(inode)) {
 			if (unlikely(inode_get_rsv_space(inode) > 0))
@@ -1056,7 +1057,6 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 		}
 		spin_unlock(&dq_data_lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
 		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
diff --git a/fs/super.c b/fs/super.c
index d78b984..b2db824 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -163,6 +163,7 @@ static void destroy_super(struct super_block *s)
 {
 	list_lru_destroy(&s->s_dentry_lru);
 	list_lru_destroy(&s->s_inode_lru);
+	free_dlock_list_head(&s->s_inodes);
 	security_sb_free(s);
 	WARN_ON(!list_empty(&s->s_mounts));
 	kfree(s->s_subtype);
@@ -204,9 +205,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	INIT_HLIST_NODE(&s->s_instances);
 	INIT_HLIST_BL_HEAD(&s->s_anon);
 	mutex_init(&s->s_sync_lock);
-	INIT_LIST_HEAD(&s->s_inodes);
-	spin_lock_init(&s->s_inode_list_lock);
 
+	if (init_dlock_list_head(&s->s_inodes))
+		goto fail;
 	if (list_lru_init_memcg(&s->s_dentry_lru))
 		goto fail;
 	if (list_lru_init_memcg(&s->s_inode_lru))
@@ -427,7 +428,7 @@ void generic_shutdown_super(struct super_block *sb)
 		if (sop->put_super)
 			sop->put_super(sb);
 
-		if (!list_empty(&sb->s_inodes)) {
+		if (!dlock_list_empty(sb->s_inodes)) {
 			printk("VFS: Busy inodes after unmount of %s. "
 			   "Self-destruct in 5 seconds.  Have a nice day...\n",
 			   sb->s_id);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dd28814..a7442a6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -32,6 +32,7 @@
 #include <linux/workqueue.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/delayed_call.h>
+#include <linux/dlock-list.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -664,7 +665,7 @@ struct inode {
 	u16			i_wb_frn_history;
 #endif
 	struct list_head	i_lru;		/* inode LRU list */
-	struct list_head	i_sb_list;
+	struct dlock_list_node	i_sb_list;
 	union {
 		struct hlist_head	i_dentry;
 		struct rcu_head		i_rcu;
@@ -1445,9 +1446,8 @@ struct super_block {
 	 */
 	int s_stack_depth;
 
-	/* s_inode_list_lock protects s_inodes */
-	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
-	struct list_head	s_inodes;	/* all inodes */
+	/* The percpu locks protect s_inodes */
+	struct dlock_list_head __percpu *s_inodes;	/* all inodes */
 };
 
 extern struct timespec current_fs_time(struct super_block *sb);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs
  2016-07-11 17:32 [PATCH v2 0/7] vfs: Use dlock list for SB's s_inodes list Waiman Long
                   ` (4 preceding siblings ...)
  2016-07-11 17:32 ` [PATCH v2 5/7] vfs: Use dlock list for superblock's inode list Waiman Long
@ 2016-07-11 17:32 ` Waiman Long
  2016-07-12  3:14   ` Boqun Feng
  2016-07-12 14:27   ` Tejun Heo
  2016-07-11 17:32 ` [RFC PATCH v2 7/7] lib/dlock-list: Use the per-subnode APIs for managing lists Waiman Long
  6 siblings, 2 replies; 24+ messages in thread
From: Waiman Long @ 2016-07-11 17:32 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Waiman Long

The percpu APIs are extensively used in the Linux kernel to reduce
cacheline contention and improve performance. For some use cases, the
percpu APIs may be too fine-grain for distributed resources whereas
a per-node based allocation may be too coarse as we can have dozens
of CPUs in a NUMA node in some high-end systems.

This patch introduces a simple per-subnode APIs where each of the
distributed resources will be shared by only a handful of CPUs within
a NUMA node. The per-subnode APIs are built on top of the percpu APIs
and hence requires the same amount of memory as if the percpu APIs
are used. However, it helps to reduce the total number of separate
resources that needed to be managed. As a result, it can speed up code
that need to iterate all the resources compared with using the percpu
APIs. Cacheline contention, however, will increases slightly as each
resource is shared by more than one CPU. As long as the number of CPUs
in each subnode is small, the performance impact won't be significant.

In this patch, at most 2 sibling groups can be put into a subnode. For
an x86-64 CPU, at most 4 CPUs will be in a subnode when HT is enabled
and 2 when it is not.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 include/linux/persubnode.h |   80 +++++++++++++++++++++++++++++
 init/main.c                |    2 +
 lib/Makefile               |    2 +
 lib/persubnode.c           |  119 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 203 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/persubnode.h
 create mode 100644 lib/persubnode.c

diff --git a/include/linux/persubnode.h b/include/linux/persubnode.h
new file mode 100644
index 0000000..b777daa
--- /dev/null
+++ b/include/linux/persubnode.h
@@ -0,0 +1,80 @@
+/*
+ * Per-subnode definitions
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long <waiman.long@hpe.com>
+ */
+#ifndef __LINUX_PERSUBNODE_H
+#define __LINUX_PERSUBNODE_H
+
+#include <linux/percpu.h>
+#include <linux/topology.h>
+
+/*
+ * Per-subnode APIs
+ */
+#define __persubnode			__percpu
+#define nr_subnode_ids			nr_cpu_ids
+#define alloc_persubnode(type)		alloc_percpu(type)
+#define free_persubnode(var)		free_percpu(var)
+#define for_each_subnode(snode)		for_each_cpu(snode, subnode_mask)
+#define per_subnode_ptr(ptr, subnode)	per_cpu_ptr(ptr, subnode)
+#define per_subnode(var, subnode)	per_cpu(var, subnode)
+
+#ifdef CONFIG_SMP
+
+extern struct cpumask __subnode_mask __read_mostly;
+DECLARE_PER_CPU_READ_MOSTLY(int, cpu_subnode_id);
+
+#define subnode_mask		(&__subnode_mask)
+
+static inline int this_cpu_to_subnode(void)
+{
+	return *this_cpu_ptr(&cpu_subnode_id);
+}
+
+/*
+ * For safety, preemption should be disabled before using this_subnode_ptr().
+ */
+#define this_subnode_ptr(ptr)			\
+({						\
+	int _snid = this_cpu_to_subnode();	\
+	per_cpu_ptr(ptr, _snid);		\
+})
+
+#define get_subnode_ptr(ptr)			\
+({						\
+	preempt_disable();			\
+	this_subnode_ptr(ptr);			\
+})
+
+#define put_subnode_ptr(ptr)			\
+do {						\
+	(void)(ptr);				\
+	preempt_enable();			\
+} while (0)
+
+extern void __init subnode_early_init(void);
+
+#else /* CONFIG_SMP */
+
+#define subnode_mask		cpu_possible_mask
+#define this_subnode_ptr(ptr)	this_cpu_ptr(ptr)
+#define get_subnode_ptr(ptr)	get_cpu_ptr(ptr)
+#define put_subnode_ptr(ptr)	put_cpu_ptr(ptr)
+
+static inline void subnode_early_init(void) { }
+
+#endif /* CONFIG_SMP */
+#endif /* __LINUX_PERSUBNODE_H */
diff --git a/init/main.c b/init/main.c
index 4c17fda..28e4425 100644
--- a/init/main.c
+++ b/init/main.c
@@ -81,6 +81,7 @@
 #include <linux/integrity.h>
 #include <linux/proc_ns.h>
 #include <linux/io.h>
+#include <linux/persubnode.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -524,6 +525,7 @@ asmlinkage __visible void __init start_kernel(void)
 			   NULL, set_init_arg);
 
 	jump_label_init();
+	subnode_early_init();
 
 	/*
 	 * These use large bootmem allocations and must precede
diff --git a/lib/Makefile b/lib/Makefile
index 92e8c38..440152c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -232,3 +232,5 @@ obj-$(CONFIG_UCS2_STRING) += ucs2_string.o
 obj-$(CONFIG_UBSAN) += ubsan.o
 
 UBSAN_SANITIZE_ubsan.o := n
+
+obj-$(CONFIG_SMP) += persubnode.o
diff --git a/lib/persubnode.c b/lib/persubnode.c
new file mode 100644
index 0000000..9febe7c
--- /dev/null
+++ b/lib/persubnode.c
@@ -0,0 +1,119 @@
+/*
+ * Per-subnode APIs
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long <waiman.long@hpe.com>
+ */
+
+/*
+ * The per-subnode APIs work on top of the per-cpu APIs. Instead of
+ * having to manage n separate resources on a n-cpu system, users can
+ * now manage only n/m separate resources where m is the number of CPUs
+ * in each subnode. This cuts down the time needed to traverse all the
+ * resources but at the expense of some wasted per-cpu memory as part of
+ * the per-cpu memory will not be used.
+ *
+ * All the CPUs in the same subnode must come from the same NUMA node.
+ * However, there can be more than one subnode in each NUMA node.
+ *
+ * As the per-subnode APIs can be used early in the bootup process while
+ * not all the information needed for initialization may be available
+ * at that time, the initialization is separated into two steps:
+ *  1) an early init that is called directly from start_kernel(); and
+ *  2) a postcore init.
+ *
+ * Before initialization, all the subnode IDs of the CPUs will be zero. So
+ * they will all use the same resource at subnode 0. The early init copies
+ * the cpu_possible_mask to subnode_mask causing resource initialization
+ * to be done for all the per-cpu resources allocated. At the postcore
+ * init, some bits of the subnode_mask will be cleared and the corresponding
+ * cpu to subnode ID mapping will be set accordingly.
+ */
+#include <linux/persubnode.h>
+
+DEFINE_PER_CPU_READ_MOSTLY(int, cpu_subnode_id);
+EXPORT_PER_CPU_SYMBOL(cpu_subnode_id);
+
+struct cpumask __subnode_mask __read_mostly;
+EXPORT_SYMBOL(__subnode_mask);
+
+/*
+ * Iterates all the CPUs from the given starting CPU
+ */
+#define for_each_cpu_from(from, cpu, mask)		\
+	for ((cpu) = (from); (cpu) < nr_cpu_ids;	\
+	     (cpu) = cpumask_next((cpu), (mask)))
+
+/*
+ * Early subnode initialization to be called early in the boot process.
+ */
+void __init subnode_early_init(void)
+{
+	/* Make subnode_mask the same as cpu_possible_mask */
+	cpumask_copy(subnode_mask, cpu_possible_mask);
+}
+
+/*
+ * Initialize the subnodes
+ *
+ * All the sibling CPUs will be in the same subnode. On top of that, we will
+ * put at most 2 sibling groups into the same subnode. The percpu
+ * topology_sibling_cpumask() and topology_core_cpumask() are used for
+ * grouping CPUs into subnodes. The subnode ID is the CPU number of the
+ * first CPU in the subnode.
+ */
+static int __init subnode_init(void)
+{
+	int cpu;
+	int nr_subnodes = 0;
+	const int subnode_nr_cpus = 2;
+
+	/*
+	 * Some of the bits in the subnode_mask will be cleared as we proceed.
+	 */
+	for_each_cpu(cpu, subnode_mask) {
+		int ccpu, scpu;
+		int cpucnt = 0;
+
+		cpumask_var_t core_mask = topology_core_cpumask(cpu);
+		cpumask_var_t sibling_mask;
+
+		/*
+		 * Put subnode_nr_cpus of CPUs and their siblings into each
+		 * subnode.
+		 */
+		for_each_cpu_from(cpu, ccpu, core_mask) {
+			sibling_mask = topology_sibling_cpumask(ccpu);
+			for_each_cpu_from(ccpu, scpu, sibling_mask) {
+				/*
+				 * Clear the bits of the higher CPUs.
+				 */
+				if (scpu > cpu)
+					cpumask_clear_cpu(scpu, subnode_mask);
+				per_cpu(cpu_subnode_id, scpu) = cpu;
+			}
+			if (++cpucnt == subnode_nr_cpus)
+				break;	/* One subnode formed */
+		}
+		nr_subnodes++;
+	}
+	pr_info("Number of subnodes initialized = %d\n", nr_subnodes);
+
+	/*
+	 * CPU 0 must be mapped to subnode 0
+	 */
+	BUG_ON(per_cpu(cpu_subnode_id, 0) != 0);
+	return 0;
+}
+postcore_initcall(subnode_init);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH v2 7/7] lib/dlock-list: Use the per-subnode APIs for managing lists
  2016-07-11 17:32 [PATCH v2 0/7] vfs: Use dlock list for SB's s_inodes list Waiman Long
                   ` (5 preceding siblings ...)
  2016-07-11 17:32 ` [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs Waiman Long
@ 2016-07-11 17:32 ` Waiman Long
  6 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2016-07-11 17:32 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Scott J Norton,
	Douglas Hatch, Waiman Long

This patch modifies the dlock-list to use the per-subnode APIs to
manage the distributed lists. As a result, the number of lists that
need to be iterated in dlock_list_iterate() will be reduced at least
by half making the iteration a bit faster.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 include/linux/dlock-list.h |   81 +++++++++++++++++++++----------------------
 lib/dlock-list.c           |   19 +++++-----
 2 files changed, 50 insertions(+), 50 deletions(-)

diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
index a8e1fd2..01667fc 100644
--- a/include/linux/dlock-list.h
+++ b/include/linux/dlock-list.h
@@ -20,12 +20,12 @@
 
 #include <linux/spinlock.h>
 #include <linux/list.h>
-#include <linux/percpu.h>
+#include <linux/persubnode.h>
 
 /*
  * include/linux/dlock-list.h
  *
- * A distributed (per-cpu) set of lists each of which is protected by its
+ * A distributed (per-subnode) set of lists each of which is protected by its
  * own spinlock, but acts like a single consolidated list to the callers.
  *
  * The dlock_list_head structure contains the spinlock, the other
@@ -45,19 +45,19 @@ struct dlock_list_head {
 	}
 
 /*
- * Per-cpu list iteration state
+ * Per-subnode list iteration state
  */
 struct dlock_list_state {
-	int			 cpu;
+	int			 snid;	/* Subnode ID */
 	spinlock_t		*lock;
-	struct list_head	*head;	/* List head of current per-cpu list */
+	struct list_head	*head;	/* List head of current per-subnode list */
 	struct dlock_list_node	*curr;
 	struct dlock_list_node	*next;
 };
 
 #define DLOCK_LIST_STATE_INIT()			\
 	{					\
-		.cpu  = -1,			\
+		.snid  = -1,			\
 		.lock = NULL,			\
 		.head = NULL,			\
 		.curr = NULL,			\
@@ -69,7 +69,7 @@ struct dlock_list_state {
 
 static inline void init_dlock_list_state(struct dlock_list_state *state)
 {
-	state->cpu  = -1;
+	state->snid  = -1;
 	state->lock = NULL;
 	state->head = NULL;
 	state->curr = NULL;
@@ -83,12 +83,12 @@ static inline void init_dlock_list_state(struct dlock_list_state *state)
 #endif
 
 /*
- * Next per-cpu list entry
+ * Next per-subnode list entry
  */
 #define dlock_list_next_entry(pos, member) list_next_entry(pos, member.list)
 
 /*
- * Per-cpu node data structure
+ * Per-subnode node data structure
  */
 struct dlock_list_node {
 	struct list_head list;
@@ -109,50 +109,50 @@ static inline void init_dlock_list_node(struct dlock_list_node *node)
 }
 
 static inline void
-free_dlock_list_head(struct dlock_list_head __percpu **pdlock_head)
+free_dlock_list_head(struct dlock_list_head __persubnode **pdlock_head)
 {
-	free_percpu(*pdlock_head);
+	free_persubnode(*pdlock_head);
 	*pdlock_head = NULL;
 }
 
 /*
- * Check if all the per-cpu lists are empty
+ * Check if all the per-subnode lists are empty
  */
-static inline bool dlock_list_empty(struct dlock_list_head __percpu *dlock_head)
+static inline bool dlock_list_empty(struct dlock_list_head __persubnode *dlock_head)
 {
-	int cpu;
+	int snid;
 
-	for_each_possible_cpu(cpu)
-		if (!list_empty(&per_cpu_ptr(dlock_head, cpu)->list))
+	for_each_subnode(snid)
+		if (!list_empty(&per_subnode_ptr(dlock_head, snid)->list))
 			return false;
 	return true;
 }
 
 /*
- * Helper function to find the first entry of the next per-cpu list
- * It works somewhat like for_each_possible_cpu(cpu).
+ * Helper function to find the first entry of the next per-subnode list
+ * It works somewhat like for_each_subnode(snid).
  *
  * Return: true if the entry is found, false if all the lists exhausted
  */
 static __always_inline bool
-__dlock_list_next_cpu(struct dlock_list_head __percpu *head,
+__dlock_list_next_subnode(struct dlock_list_head __persubnode *head,
 		      struct dlock_list_state *state)
 {
 	if (state->lock)
 		spin_unlock(state->lock);
-next_cpu:
+next_subnode:
 	/*
-	 * for_each_possible_cpu(cpu)
+	 * for_each_subnode(snid)
 	 */
-	state->cpu = cpumask_next(state->cpu, cpu_possible_mask);
-	if (state->cpu >= nr_cpu_ids)
-		return false;	/* All the per-cpu lists iterated */
+	state->snid = cpumask_next(state->snid, subnode_mask);
+	if (state->snid >= nr_subnode_ids)
+		return false;	/* All the per-subnode lists iterated */
 
-	state->head = &per_cpu_ptr(head, state->cpu)->list;
+	state->head = &per_subnode_ptr(head, state->snid)->list;
 	if (list_empty(state->head))
-		goto next_cpu;
+		goto next_subnode;
 
-	state->lock = &per_cpu_ptr(head, state->cpu)->lock;
+	state->lock = &per_subnode_ptr(head, state->snid)->lock;
 	spin_lock(state->lock);
 	/*
 	 * There is a slight chance that the list may become empty just
@@ -161,7 +161,7 @@ next_cpu:
 	 */
 	if (list_empty(state->head)) {
 		spin_unlock(state->lock);
-		goto next_cpu;
+		goto next_subnode;
 	}
 	state->curr = list_entry(state->head->next,
 				 struct dlock_list_node, list);
@@ -169,11 +169,11 @@ next_cpu:
 }
 
 /*
- * Iterate to the next entry of the group of per-cpu lists
+ * Iterate to the next entry of the group of per-subnode lists
  *
  * Return: true if the next entry is found, false if all the entries iterated
  */
-static inline bool dlock_list_iterate(struct dlock_list_head __percpu *head,
+static inline bool dlock_list_iterate(struct dlock_list_head __persubnode *head,
 				      struct dlock_list_state *state)
 {
 	/*
@@ -184,10 +184,10 @@ static inline bool dlock_list_iterate(struct dlock_list_head __percpu *head,
 
 	if (!state->curr || (&state->curr->list == state->head)) {
 		/*
-		 * The current per-cpu list has been exhausted, try the next
-		 * per-cpu list.
+		 * The current per-subnode list has been exhausted, try the next
+		 * per-subnode list.
 		 */
-		if (!__dlock_list_next_cpu(head, state))
+		if (!__dlock_list_next_subnode(head, state))
 			return false;
 	}
 
@@ -196,13 +196,13 @@ static inline bool dlock_list_iterate(struct dlock_list_head __percpu *head,
 }
 
 /*
- * Iterate to the next entry of the group of per-cpu lists and safe
+ * Iterate to the next entry of the group of per-subnode lists and safe
  * against removal of list_entry
  *
  * Return: true if the next entry is found, false if all the entries iterated
  */
 static inline bool
-dlock_list_iterate_safe(struct dlock_list_head __percpu *head,
+dlock_list_iterate_safe(struct dlock_list_head __persubnode *head,
 			struct dlock_list_state *state)
 {
 	/*
@@ -215,10 +215,10 @@ dlock_list_iterate_safe(struct dlock_list_head __percpu *head,
 
 	if (!state->curr || (&state->curr->list == state->head)) {
 		/*
-		 * The current per-cpu list has been exhausted, try the next
-		 * per-cpu list.
+		 * The current per-subnode list has been exhausted, try the next
+		 * per-subnode list.
 		 */
-		if (!__dlock_list_next_cpu(head, state))
+		if (!__dlock_list_next_subnode(head, state))
 			return false;
 		state->next = list_next_entry(state->curr, list);
 	}
@@ -228,8 +228,7 @@ dlock_list_iterate_safe(struct dlock_list_head __percpu *head,
 }
 
 extern void dlock_list_add(struct dlock_list_node *node,
-			   struct dlock_list_head __percpu *head);
+			   struct dlock_list_head __persubnode *head);
 extern void dlock_list_del(struct dlock_list_node *node);
-extern int  init_dlock_list_head(struct dlock_list_head __percpu **pdlock_head);
-
+extern int  init_dlock_list_head(struct dlock_list_head __persubnode **pdlock_head);
 #endif /* __LINUX_DLOCK_LIST_H */
diff --git a/lib/dlock-list.c b/lib/dlock-list.c
index e1a1930..05bbf45 100644
--- a/lib/dlock-list.c
+++ b/lib/dlock-list.c
@@ -25,20 +25,21 @@
 static struct lock_class_key dlock_list_key;
 
 /*
- * Initialize the per-cpu list head
+ * Initialize the per-subnode list head
  */
-int init_dlock_list_head(struct dlock_list_head __percpu **pdlock_head)
+int init_dlock_list_head(struct dlock_list_head __persubnode **pdlock_head)
 {
 	struct dlock_list_head *dlock_head;
-	int cpu;
+	int snid;
 
-	dlock_head = alloc_percpu(struct dlock_list_head);
+	dlock_head = alloc_persubnode(struct dlock_list_head);
 	if (!dlock_head)
 		return -ENOMEM;
 
-	for_each_possible_cpu(cpu) {
-		struct dlock_list_head *head = per_cpu_ptr(dlock_head, cpu);
+	for_each_subnode(snid) {
+		struct dlock_list_head *head;
 
+		head = per_subnode_ptr(dlock_head, snid);
 		INIT_LIST_HEAD(&head->list);
 		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
 		lockdep_set_class(&head->lock, &dlock_list_key);
@@ -54,19 +55,19 @@ int init_dlock_list_head(struct dlock_list_head __percpu **pdlock_head)
  * So we still need to use a lock to protect the content of the list.
  */
 void dlock_list_add(struct dlock_list_node *node,
-		    struct dlock_list_head __percpu *head)
+		    struct dlock_list_head __persubnode *head)
 {
 	struct dlock_list_head *myhead;
 
 	/*
 	 * Disable preemption to make sure that CPU won't gets changed.
 	 */
-	myhead = get_cpu_ptr(head);
+	myhead = get_subnode_ptr(head);
 	spin_lock(&myhead->lock);
 	node->lockptr = &myhead->lock;
 	list_add(&node->list, &myhead->list);
 	spin_unlock(&myhead->lock);
-	put_cpu_ptr(head);
+	put_subnode_ptr(head);
 }
 
 /*
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs
  2016-07-11 17:32 ` [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs Waiman Long
@ 2016-07-12  3:14   ` Boqun Feng
  2016-07-12 18:37     ` Waiman Long
  2016-07-12 14:27   ` Tejun Heo
  1 sibling, 1 reply; 24+ messages in thread
From: Boqun Feng @ 2016-07-12  3:14 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

[-- Attachment #1: Type: text/plain, Size: 10556 bytes --]

On Mon, Jul 11, 2016 at 01:32:11PM -0400, Waiman Long wrote:
> The percpu APIs are extensively used in the Linux kernel to reduce
> cacheline contention and improve performance. For some use cases, the
> percpu APIs may be too fine-grain for distributed resources whereas
> a per-node based allocation may be too coarse as we can have dozens
> of CPUs in a NUMA node in some high-end systems.
> 
> This patch introduces a simple per-subnode APIs where each of the
> distributed resources will be shared by only a handful of CPUs within
> a NUMA node. The per-subnode APIs are built on top of the percpu APIs
> and hence requires the same amount of memory as if the percpu APIs
> are used. However, it helps to reduce the total number of separate
> resources that needed to be managed. As a result, it can speed up code
> that need to iterate all the resources compared with using the percpu
> APIs. Cacheline contention, however, will increases slightly as each
> resource is shared by more than one CPU. As long as the number of CPUs
> in each subnode is small, the performance impact won't be significant.
> 
> In this patch, at most 2 sibling groups can be put into a subnode. For
> an x86-64 CPU, at most 4 CPUs will be in a subnode when HT is enabled
> and 2 when it is not.
> 
> Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
> ---
>  include/linux/persubnode.h |   80 +++++++++++++++++++++++++++++
>  init/main.c                |    2 +
>  lib/Makefile               |    2 +
>  lib/persubnode.c           |  119 ++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 203 insertions(+), 0 deletions(-)
>  create mode 100644 include/linux/persubnode.h
>  create mode 100644 lib/persubnode.c
> 
> diff --git a/include/linux/persubnode.h b/include/linux/persubnode.h
> new file mode 100644
> index 0000000..b777daa
> --- /dev/null
> +++ b/include/linux/persubnode.h
> @@ -0,0 +1,80 @@
> +/*
> + * Per-subnode definitions
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
> + *
> + * Authors: Waiman Long <waiman.long@hpe.com>
> + */
> +#ifndef __LINUX_PERSUBNODE_H
> +#define __LINUX_PERSUBNODE_H
> +
> +#include <linux/percpu.h>
> +#include <linux/topology.h>
> +
> +/*
> + * Per-subnode APIs
> + */
> +#define __persubnode			__percpu
> +#define nr_subnode_ids			nr_cpu_ids
> +#define alloc_persubnode(type)		alloc_percpu(type)
> +#define free_persubnode(var)		free_percpu(var)
> +#define for_each_subnode(snode)		for_each_cpu(snode, subnode_mask)
> +#define per_subnode_ptr(ptr, subnode)	per_cpu_ptr(ptr, subnode)
> +#define per_subnode(var, subnode)	per_cpu(var, subnode)
> +
> +#ifdef CONFIG_SMP
> +
> +extern struct cpumask __subnode_mask __read_mostly;
> +DECLARE_PER_CPU_READ_MOSTLY(int, cpu_subnode_id);
> +
> +#define subnode_mask		(&__subnode_mask)
> +
> +static inline int this_cpu_to_subnode(void)
> +{
> +	return *this_cpu_ptr(&cpu_subnode_id);
> +}
> +
> +/*
> + * For safety, preemption should be disabled before using this_subnode_ptr().
> + */
> +#define this_subnode_ptr(ptr)			\
> +({						\
> +	int _snid = this_cpu_to_subnode();	\
> +	per_cpu_ptr(ptr, _snid);		\
> +})
> +
> +#define get_subnode_ptr(ptr)			\
> +({						\
> +	preempt_disable();			\
> +	this_subnode_ptr(ptr);			\
> +})
> +
> +#define put_subnode_ptr(ptr)			\
> +do {						\
> +	(void)(ptr);				\
> +	preempt_enable();			\
> +} while (0)
> +
> +extern void __init subnode_early_init(void);
> +
> +#else /* CONFIG_SMP */
> +
> +#define subnode_mask		cpu_possible_mask
> +#define this_subnode_ptr(ptr)	this_cpu_ptr(ptr)
> +#define get_subnode_ptr(ptr)	get_cpu_ptr(ptr)
> +#define put_subnode_ptr(ptr)	put_cpu_ptr(ptr)
> +
> +static inline void subnode_early_init(void) { }
> +
> +#endif /* CONFIG_SMP */
> +#endif /* __LINUX_PERSUBNODE_H */
> diff --git a/init/main.c b/init/main.c
> index 4c17fda..28e4425 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -81,6 +81,7 @@
>  #include <linux/integrity.h>
>  #include <linux/proc_ns.h>
>  #include <linux/io.h>
> +#include <linux/persubnode.h>
>  
>  #include <asm/io.h>
>  #include <asm/bugs.h>
> @@ -524,6 +525,7 @@ asmlinkage __visible void __init start_kernel(void)
>  			   NULL, set_init_arg);
>  
>  	jump_label_init();
> +	subnode_early_init();
>  
>  	/*
>  	 * These use large bootmem allocations and must precede
> diff --git a/lib/Makefile b/lib/Makefile
> index 92e8c38..440152c 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -232,3 +232,5 @@ obj-$(CONFIG_UCS2_STRING) += ucs2_string.o
>  obj-$(CONFIG_UBSAN) += ubsan.o
>  
>  UBSAN_SANITIZE_ubsan.o := n
> +
> +obj-$(CONFIG_SMP) += persubnode.o
> diff --git a/lib/persubnode.c b/lib/persubnode.c
> new file mode 100644
> index 0000000..9febe7c
> --- /dev/null
> +++ b/lib/persubnode.c
> @@ -0,0 +1,119 @@
> +/*
> + * Per-subnode APIs
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
> + *
> + * Authors: Waiman Long <waiman.long@hpe.com>
> + */
> +
> +/*
> + * The per-subnode APIs work on top of the per-cpu APIs. Instead of
> + * having to manage n separate resources on a n-cpu system, users can
> + * now manage only n/m separate resources where m is the number of CPUs
> + * in each subnode. This cuts down the time needed to traverse all the
> + * resources but at the expense of some wasted per-cpu memory as part of
> + * the per-cpu memory will not be used.
> + *
> + * All the CPUs in the same subnode must come from the same NUMA node.
> + * However, there can be more than one subnode in each NUMA node.
> + *
> + * As the per-subnode APIs can be used early in the bootup process while
> + * not all the information needed for initialization may be available
> + * at that time, the initialization is separated into two steps:
> + *  1) an early init that is called directly from start_kernel(); and
> + *  2) a postcore init.
> + *
> + * Before initialization, all the subnode IDs of the CPUs will be zero. So
> + * they will all use the same resource at subnode 0. The early init copies
> + * the cpu_possible_mask to subnode_mask causing resource initialization
> + * to be done for all the per-cpu resources allocated. At the postcore
> + * init, some bits of the subnode_mask will be cleared and the corresponding
> + * cpu to subnode ID mapping will be set accordingly.
> + */
> +#include <linux/persubnode.h>
> +
> +DEFINE_PER_CPU_READ_MOSTLY(int, cpu_subnode_id);
> +EXPORT_PER_CPU_SYMBOL(cpu_subnode_id);
> +
> +struct cpumask __subnode_mask __read_mostly;
> +EXPORT_SYMBOL(__subnode_mask);
> +
> +/*
> + * Iterates all the CPUs from the given starting CPU
> + */
> +#define for_each_cpu_from(from, cpu, mask)		\
> +	for ((cpu) = (from); (cpu) < nr_cpu_ids;	\
> +	     (cpu) = cpumask_next((cpu), (mask)))
> +
> +/*
> + * Early subnode initialization to be called early in the boot process.
> + */
> +void __init subnode_early_init(void)
> +{
> +	/* Make subnode_mask the same as cpu_possible_mask */
> +	cpumask_copy(subnode_mask, cpu_possible_mask);
> +}
> +
> +/*
> + * Initialize the subnodes
> + *
> + * All the sibling CPUs will be in the same subnode. On top of that, we will
> + * put at most 2 sibling groups into the same subnode. The percpu
> + * topology_sibling_cpumask() and topology_core_cpumask() are used for
> + * grouping CPUs into subnodes. The subnode ID is the CPU number of the
> + * first CPU in the subnode.
> + */
> +static int __init subnode_init(void)
> +{
> +	int cpu;
> +	int nr_subnodes = 0;
> +	const int subnode_nr_cpus = 2;
> +
> +	/*
> +	 * Some of the bits in the subnode_mask will be cleared as we proceed.
> +	 */
> +	for_each_cpu(cpu, subnode_mask) {
> +		int ccpu, scpu;
> +		int cpucnt = 0;
> +
> +		cpumask_var_t core_mask = topology_core_cpumask(cpu);
> +		cpumask_var_t sibling_mask;
> +
> +		/*
> +		 * Put subnode_nr_cpus of CPUs and their siblings into each
> +		 * subnode.
> +		 */
> +		for_each_cpu_from(cpu, ccpu, core_mask) {
> +			sibling_mask = topology_sibling_cpumask(ccpu);
> +			for_each_cpu_from(ccpu, scpu, sibling_mask) {
> +				/*
> +				 * Clear the bits of the higher CPUs.
> +				 */
> +				if (scpu > cpu)
> +					cpumask_clear_cpu(scpu, subnode_mask);

Do we also need to clear the 'core_mask' here? Consider a core consist
of two sibling groups and each sibling group consist of two cpus. At the
beginning of the outer loop(for_each_cpu_from(cpu, ccpu, core_mask)):

'core_mask' is 0b1111

so at the beginning of the inner loop first time:

'ccpu' is 0, therefore 'sibling_mask' is 0b1100, in this loop we set the
'cpu_subnode_id' of cpu 0 and 1 to 0.

at the beginning of the inner loop second time:

'ccpu' is 1 because we don't clear cpu 1 from 'core_mask'. Therefore
sibling_mask is still 0b1100, so in this loop we still do the setting on
'cpu_subnode_id' of cpu 0 and 1.

Am I missing something here?

Regards,
Boqun

> +				per_cpu(cpu_subnode_id, scpu) = cpu;
> +			}
> +			if (++cpucnt == subnode_nr_cpus)
> +				break;	/* One subnode formed */
> +		}
> +		nr_subnodes++;
> +	}
> +	pr_info("Number of subnodes initialized = %d\n", nr_subnodes);
> +
> +	/*
> +	 * CPU 0 must be mapped to subnode 0
> +	 */
> +	BUG_ON(per_cpu(cpu_subnode_id, 0) != 0);
> +	return 0;
> +}
> +postcore_initcall(subnode_init);
> -- 
> 1.7.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs
  2016-07-11 17:32 ` [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs Waiman Long
  2016-07-12  3:14   ` Boqun Feng
@ 2016-07-12 14:27   ` Tejun Heo
  2016-07-12 18:51     ` Waiman Long
  1 sibling, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2016-07-12 14:27 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Christoph Lameter, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

Hello,

On Mon, Jul 11, 2016 at 01:32:11PM -0400, Waiman Long wrote:
> The percpu APIs are extensively used in the Linux kernel to reduce
> cacheline contention and improve performance. For some use cases, the
> percpu APIs may be too fine-grain for distributed resources whereas
> a per-node based allocation may be too coarse as we can have dozens
> of CPUs in a NUMA node in some high-end systems.
> 
> This patch introduces a simple per-subnode APIs where each of the
> distributed resources will be shared by only a handful of CPUs within
> a NUMA node. The per-subnode APIs are built on top of the percpu APIs
> and hence requires the same amount of memory as if the percpu APIs
> are used. However, it helps to reduce the total number of separate
> resources that needed to be managed. As a result, it can speed up code
> that need to iterate all the resources compared with using the percpu
> APIs. Cacheline contention, however, will increases slightly as each
> resource is shared by more than one CPU. As long as the number of CPUs
> in each subnode is small, the performance impact won't be significant.
> 
> In this patch, at most 2 sibling groups can be put into a subnode. For
> an x86-64 CPU, at most 4 CPUs will be in a subnode when HT is enabled
> and 2 when it is not.

I understand that there's a trade-off between local access and global
traversing and you're trying to find a sweet spot between the two, but
this seems pretty arbitrary.  What's the use case?  What are the
numbers?  Why are global traversals often enough to matter so much?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs
  2016-07-12  3:14   ` Boqun Feng
@ 2016-07-12 18:37     ` Waiman Long
  0 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2016-07-12 18:37 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On 07/11/2016 11:14 PM, Boqun Feng wrote:
> On Mon, Jul 11, 2016 at 01:32:11PM -0400, Waiman Long wrote:
>> +/*
>> + * Initialize the subnodes
>> + *
>> + * All the sibling CPUs will be in the same subnode. On top of that, we will
>> + * put at most 2 sibling groups into the same subnode. The percpu
>> + * topology_sibling_cpumask() and topology_core_cpumask() are used for
>> + * grouping CPUs into subnodes. The subnode ID is the CPU number of the
>> + * first CPU in the subnode.
>> + */
>> +static int __init subnode_init(void)
>> +{
>> +	int cpu;
>> +	int nr_subnodes = 0;
>> +	const int subnode_nr_cpus = 2;
>> +
>> +	/*
>> +	 * Some of the bits in the subnode_mask will be cleared as we proceed.
>> +	 */
>> +	for_each_cpu(cpu, subnode_mask) {
>> +		int ccpu, scpu;
>> +		int cpucnt = 0;
>> +
>> +		cpumask_var_t core_mask = topology_core_cpumask(cpu);
>> +		cpumask_var_t sibling_mask;
>> +
>> +		/*
>> +		 * Put subnode_nr_cpus of CPUs and their siblings into each
>> +		 * subnode.
>> +		 */
>> +		for_each_cpu_from(cpu, ccpu, core_mask) {
>> +			sibling_mask = topology_sibling_cpumask(ccpu);
>> +			for_each_cpu_from(ccpu, scpu, sibling_mask) {
>> +				/*
>> +				 * Clear the bits of the higher CPUs.
>> +				 */
>> +				if (scpu>  cpu)
>> +					cpumask_clear_cpu(scpu, subnode_mask);
> Do we also need to clear the 'core_mask' here? Consider a core consist
> of two sibling groups and each sibling group consist of two cpus. At the
> beginning of the outer loop(for_each_cpu_from(cpu, ccpu, core_mask)):
>
> 'core_mask' is 0b1111
>
> so at the beginning of the inner loop first time:
>
> 'ccpu' is 0, therefore 'sibling_mask' is 0b1100, in this loop we set the
> 'cpu_subnode_id' of cpu 0 and 1 to 0.
>
> at the beginning of the inner loop second time:
>
> 'ccpu' is 1 because we don't clear cpu 1 from 'core_mask'. Therefore
> sibling_mask is still 0b1100, so in this loop we still do the setting on
> 'cpu_subnode_id' of cpu 0 and 1.
>
> Am I missing something here?
>

You are right. The current code work in my test as the 2 sibling CPUs 
occupy the a lower and higher numbers like (0, 72) for a 72-core system. 
It may not work for other sibling CPU assignment.

The core_mask, however, is a global data variable and we cannot modify 
it. I will make the following change instead:

diff --git a/lib/persubnode.c b/lib/persubnode.c
index 9febe7c..d1c8c29 100644
--- a/lib/persubnode.c
+++ b/lib/persubnode.c
@@ -94,6 +94,8 @@ static int __init subnode_init(void)
                  * subnode.
                  */
                 for_each_cpu_from(cpu, ccpu, core_mask) {
+                       if (!cpumask_test_cpu(ccpu, subnode_mask))
+                               continue;       /* Skip allocated CPU */
                         sibling_mask = topology_sibling_cpumask(ccpu);
                         for_each_cpu_from(ccpu, scpu, sibling_mask) {
                                 /*

Thanks for catching this bug.

Cheers,
Longman

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs
  2016-07-12 14:27   ` Tejun Heo
@ 2016-07-12 18:51     ` Waiman Long
  2016-07-12 18:57       ` Tejun Heo
  0 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2016-07-12 18:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Christoph Lameter, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

On 07/12/2016 10:27 AM, Tejun Heo wrote:
> Hello,
>
> On Mon, Jul 11, 2016 at 01:32:11PM -0400, Waiman Long wrote:
>> The percpu APIs are extensively used in the Linux kernel to reduce
>> cacheline contention and improve performance. For some use cases, the
>> percpu APIs may be too fine-grain for distributed resources whereas
>> a per-node based allocation may be too coarse as we can have dozens
>> of CPUs in a NUMA node in some high-end systems.
>>
>> This patch introduces a simple per-subnode APIs where each of the
>> distributed resources will be shared by only a handful of CPUs within
>> a NUMA node. The per-subnode APIs are built on top of the percpu APIs
>> and hence requires the same amount of memory as if the percpu APIs
>> are used. However, it helps to reduce the total number of separate
>> resources that needed to be managed. As a result, it can speed up code
>> that need to iterate all the resources compared with using the percpu
>> APIs. Cacheline contention, however, will increases slightly as each
>> resource is shared by more than one CPU. As long as the number of CPUs
>> in each subnode is small, the performance impact won't be significant.
>>
>> In this patch, at most 2 sibling groups can be put into a subnode. For
>> an x86-64 CPU, at most 4 CPUs will be in a subnode when HT is enabled
>> and 2 when it is not.
> I understand that there's a trade-off between local access and global
> traversing and you're trying to find a sweet spot between the two, but
> this seems pretty arbitrary.  What's the use case?  What are the
> numbers?  Why are global traversals often enough to matter so much?

The last 2 RFC patches were created in response to Andi's comment to 
have coarser granularity than per-cpu. In this particular use case, I 
don't think global list traversals are frequent enough to really have 
any noticeable performance impact. So I don't have any benchmark number 
to support this change. However, it may not be true for other future use 
cases.

These 2 patches were created to gauge if using a per-subnode API for 
this use case is a good idea or not. I am perfectly happy to keep it as 
per-cpu and scrap the last 2 RFC patches. My main goal is to make this 
patchset more acceptable to be moved forward instead of staying in limbo.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs
  2016-07-12 18:51     ` Waiman Long
@ 2016-07-12 18:57       ` Tejun Heo
  2016-07-12 19:42         ` Waiman Long
  0 siblings, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2016-07-12 18:57 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Christoph Lameter, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

Hello,

On Tue, Jul 12, 2016 at 02:51:31PM -0400, Waiman Long wrote:
> The last 2 RFC patches were created in response to Andi's comment to have
> coarser granularity than per-cpu. In this particular use case, I don't think
> global list traversals are frequent enough to really have any noticeable
> performance impact. So I don't have any benchmark number to support this
> change. However, it may not be true for other future use cases.
> 
> These 2 patches were created to gauge if using a per-subnode API for this
> use case is a good idea or not. I am perfectly happy to keep it as per-cpu
> and scrap the last 2 RFC patches. My main goal is to make this patchset more
> acceptable to be moved forward instead of staying in limbo.

I see.  I don't think it makes sense to add a whole new API for a use
case which doesn't really need it without any backing data.  It
probably would be best to revisit this when we're dealing with an
actually problematic case.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs
  2016-07-12 18:57       ` Tejun Heo
@ 2016-07-12 19:42         ` Waiman Long
  0 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2016-07-12 19:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Christoph Lameter, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

On 07/12/2016 02:57 PM, Tejun Heo wrote:
> Hello,
>
> On Tue, Jul 12, 2016 at 02:51:31PM -0400, Waiman Long wrote:
>> The last 2 RFC patches were created in response to Andi's comment to have
>> coarser granularity than per-cpu. In this particular use case, I don't think
>> global list traversals are frequent enough to really have any noticeable
>> performance impact. So I don't have any benchmark number to support this
>> change. However, it may not be true for other future use cases.
>>
>> These 2 patches were created to gauge if using a per-subnode API for this
>> use case is a good idea or not. I am perfectly happy to keep it as per-cpu
>> and scrap the last 2 RFC patches. My main goal is to make this patchset more
>> acceptable to be moved forward instead of staying in limbo.
> I see.  I don't think it makes sense to add a whole new API for a use
> case which doesn't really need it without any backing data.  It
> probably would be best to revisit this when we're dealing with an
> actually problematic case.
>
> Thanks.
>

I am fine with that. BTW, do you think patches 1-5 are good enough to be 
merged in a future release or is there further improvement that needs to 
be made?

Thanks,
Longman

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/7] lib/dlock-list: Distributed and lock-protected lists
  2016-07-11 17:32 ` [PATCH v2 1/7] lib/dlock-list: Distributed and lock-protected lists Waiman Long
@ 2016-07-13 16:08   ` Tejun Heo
  2016-07-14  2:54     ` Waiman Long
  0 siblings, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2016-07-13 16:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Christoph Lameter, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

Hello,

On Mon, Jul 11, 2016 at 01:32:06PM -0400, Waiman Long wrote:
...
> A new header file include/linux/dlock-list.h will be added with the

Heh, I think perpcu_list was the better name but suppose I'm too late.

> associated dlock_list_head and dlock_list_node structures. The following
> functions are provided to manage the per-cpu list:
> 
>  1. int init_dlock_list_head(struct dlock_list_head **pdlock_head)
>  2. void dlock_list_add(struct dlock_list_node *node,
> 		        struct dlock_list_head *head)
>  3. void dlock_list_del(struct dlock_list *node)
> 
> Iteration of all the list entries within a group of per-cpu
> lists is done by calling either the dlock_list_iterate() or
> dlock_list_iterate_safe() functions in a while loop. They correspond
> to the list_for_each_entry() and list_for_each_entry_safe() macros
> respectively. The iteration states are keep in a dlock_list_state
> structure that is passed to the iteration functions.

Why do we need two variants of this?  We need a state variable to walk
the list anyway.  Why not make dlock_list_iterate() safe against
removal and get rid of the second variant?  Also, dlock_list_next()
probably is a better name.

> +/*
> + * include/linux/dlock-list.h
> + *
> + * A distributed (per-cpu) set of lists each of which is protected by its
> + * own spinlock, but acts like a single consolidated list to the callers.
> + *
> + * The dlock_list_head structure contains the spinlock, the other
> + * dlock_list_node structures only contains a pointer to the spinlock in
> + * dlock_list_head.
> + */
> +struct dlock_list_head {
> +	struct list_head list;
> +	spinlock_t lock;
> +};
> +
> +#define DLOCK_LIST_HEAD_INIT(name)				\
> +	{							\
> +		.list.prev = &name.list,			\
> +		.list.next = &name.list,			\
> +		.list.lock = __SPIN_LOCK_UNLOCKED(name),	\
> +	}

This is confusing.  It looks like dlock_list_head and
DLOCK_LIST_HEAD_INIT() can be used to define and initialize static
dlock_lists but that isn't true.  It's weird to require the user to
deal with percpu declaration of the data type.  Shouldn't it be more
like the following?

struct dlock_list_head_cpu {
	struct list_head list;
	spinlock_t lock;
};

struct dlock_list_head {
	struct dlock_list_head_percpu *head_cpu;
};

> +/*
> + * Per-cpu list iteration state
> + */
> +struct dlock_list_state {
> +	int			 cpu;
> +	spinlock_t		*lock;
> +	struct list_head	*head;	/* List head of current per-cpu list */
> +	struct dlock_list_node	*curr;
> +	struct dlock_list_node	*next;
> +};

Maybe dlock_list_iter[ator] is a better name?

> +#define DLOCK_LIST_STATE_INIT()			\
> +	{					\
> +		.cpu  = -1,			\
> +		.lock = NULL,			\
> +		.head = NULL,			\
> +		.curr = NULL,			\
> +		.next = NULL,			\
> +	}

The NULL inits are unnecessary and prone to being left behind.

> +#define DEFINE_DLOCK_LIST_STATE(s)		\
> +	struct dlock_list_state s = DLOCK_LIST_STATE_INIT()
> +
> +static inline void init_dlock_list_state(struct dlock_list_state *state)
> +{
> +	state->cpu  = -1;
> +	state->lock = NULL;
> +	state->head = NULL;
> +	state->curr = NULL;
> +	state->next = NULL;
> +}

Why not "state = (struct dlock_list_state)DLOCK_LIST_STATE_INIT;"?

> +#ifdef CONFIG_DEBUG_SPINLOCK
> +#define DLOCK_LIST_WARN_ON(x)	WARN_ON(x)
> +#else
> +#define DLOCK_LIST_WARN_ON(x)
> +#endif

I'd just use WARN_ON_ONCE() without the CONFIG guard.

> +/*
> + * Next per-cpu list entry
> + */
> +#define dlock_list_next_entry(pos, member) list_next_entry(pos, member.list)

Why does this need to be exposed?

> +/*
> + * Per-cpu node data structure
> + */
> +struct dlock_list_node {
> +	struct list_head list;
> +	spinlock_t *lockptr;
> +};
> +
> +#define DLOCK_LIST_NODE_INIT(name)		\
> +	{					\
> +		.list.prev = &name.list,	\
> +		.list.next = &name.list,	\
> +		.list.lockptr = NULL		\
> +	}

Ditto with NULL init.

> +static inline void init_dlock_list_node(struct dlock_list_node *node)
> +{
> +	INIT_LIST_HEAD(&node->list);
> +	node->lockptr = NULL;
> +}

Ditto with init.

> +static inline void free_dlock_list_head(struct dlock_list_head **pdlock_head)
> +{
> +	free_percpu(*pdlock_head);
> +	*pdlock_head = NULL;
> +}

Why does this need to be inlined?

> +/*
> + * Check if all the per-cpu lists are empty
> + */

Please use proper function comments.

> +static inline bool dlock_list_empty(struct dlock_list_head *dlock_head)
> +{
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu)
> +		if (!list_empty(&per_cpu_ptr(dlock_head, cpu)->list))
> +			return false;
> +	return true;
> +}
> +
> +/*
> + * Helper function to find the first entry of the next per-cpu list
> + * It works somewhat like for_each_possible_cpu(cpu).
> + *
> + * Return: true if the entry is found, false if all the lists exhausted

Ditto about the comment.

> + */
> +static __always_inline bool
> +__dlock_list_next_cpu(struct dlock_list_head *head,
> +		      struct dlock_list_state *state)
> +{
...
> +static inline bool dlock_list_iterate_safe(struct dlock_list_head *head,
> +					   struct dlock_list_state *state)
> +{

Inlining these doesn't make senes to me.

> diff --git a/lib/Makefile b/lib/Makefile
> index 499fb35..92e8c38 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> +/*
> + * The dlock list lock needs its own class to avoid warning and stack
> + * trace when lockdep is enabled.
> + */

Can you please elaborate on this?

> +static struct lock_class_key dlock_list_key;
> +
> +/*
> + * Initialize the per-cpu list head
> + */
> +int init_dlock_list_head(struct dlock_list_head **pdlock_head)
> +{
> +	struct dlock_list_head *dlock_head;
> +	int cpu;
> +
> +	dlock_head = alloc_percpu(struct dlock_list_head);
> +	if (!dlock_head)
> +		return -ENOMEM;
> +
> +	for_each_possible_cpu(cpu) {
> +		struct dlock_list_head *head = per_cpu_ptr(dlock_head, cpu);
> +
> +		INIT_LIST_HEAD(&head->list);
> +		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
> +		lockdep_set_class(&head->lock, &dlock_list_key);
> +	}
> +
> +	*pdlock_head = dlock_head;
> +	return 0;
> +}

Why is this called init?  Why not do the following?

struct dlock_list_head *alloc_dlock_list_head(void);

Also, the pointer type needs to include __percpu annotation.

> +/*
> + * List selection is based on the CPU being used when the dlock_list_add()
> + * function is called. However, deletion may be done by a different CPU.
> + * So we still need to use a lock to protect the content of the list.
> + */
> +void dlock_list_add(struct dlock_list_node *node, struct dlock_list_head *head)
> +{
> +	struct dlock_list_head *myhead;
> +
> +	/*
> +	 * Disable preemption to make sure that CPU won't gets changed.
> +	 */
> +	myhead = get_cpu_ptr(head);
> +	spin_lock(&myhead->lock);
> +	node->lockptr = &myhead->lock;
> +	list_add(&node->list, &myhead->list);
> +	spin_unlock(&myhead->lock);
> +	put_cpu_ptr(head);
> +}

I wonder whether it'd be better to use irqsafe operations.  lists tend
to be often used from irq contexts.

> +/*
> + * Delete a node from a dlock list
> + *
> + * We need to check the lock pointer again after taking the lock to guard
> + * against concurrent delete of the same node. If the lock pointer changes
> + * (becomes NULL or to a different one), we assume that the deletion was done
> + * elsewhere.
> + */
> +void dlock_list_del(struct dlock_list_node *node)
> +{
> +	spinlock_t *lock = READ_ONCE(node->lockptr);
> +
> +	if (unlikely(!lock)) {
> +		WARN(1, "dlock_list_del: node 0x%lx has no associated lock\n",
> +			(unsigned long)node);
> +		return;
> +	}
> +
> +	spin_lock(lock);
> +	if (likely(lock == node->lockptr)) {
> +		list_del_init(&node->list);
> +		node->lockptr = NULL;
> +	} else {
> +		/*
> +		 * This path should never be executed.
> +		 */

What if it races against someone else removing and adding back?
Shouldn't it retry on those cases?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 2/7] lib/dlock-list: Add __percpu modifier for parameters
  2016-07-11 17:32 ` [PATCH v2 2/7] lib/dlock-list: Add __percpu modifier for parameters Waiman Long
@ 2016-07-13 16:08   ` Tejun Heo
  2016-07-14  2:54     ` Waiman Long
  0 siblings, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2016-07-13 16:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Christoph Lameter, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

On Mon, Jul 11, 2016 at 01:32:07PM -0400, Waiman Long wrote:
> From: Boqun Feng <boqun.feng@gmail.com>
> 
> Add __percpu modifier properly to help:
> 
> 1.	Differ pointers to actual structures with those to percpu
> 	structures, which could improve readability.
> 
> 2. 	Prevent sparse from complaining about "different address spaces"
> 
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> Signed-off-by: Waiman Long <Waiman.Long@hpe.com>

Please do this in the same patch.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/7] lib/dlock-list: Distributed and lock-protected lists
  2016-07-13 16:08   ` Tejun Heo
@ 2016-07-14  2:54     ` Waiman Long
  2016-07-14 11:50       ` Tejun Heo
  0 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2016-07-14  2:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Christoph Lameter, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

On 07/13/2016 12:08 PM, Tejun Heo wrote:
> Hello,
>
> On Mon, Jul 11, 2016 at 01:32:06PM -0400, Waiman Long wrote:
> ...
>> A new header file include/linux/dlock-list.h will be added with the
> Heh, I think perpcu_list was the better name but suppose I'm too late.

Given the fact that we may change the level of lock granularity in the 
future, it is too late to change it back to percpu_list.

>> associated dlock_list_head and dlock_list_node structures. The following
>> functions are provided to manage the per-cpu list:
>>
>>   1. int init_dlock_list_head(struct dlock_list_head **pdlock_head)
>>   2. void dlock_list_add(struct dlock_list_node *node,
>> 		        struct dlock_list_head *head)
>>   3. void dlock_list_del(struct dlock_list *node)
>>
>> Iteration of all the list entries within a group of per-cpu
>> lists is done by calling either the dlock_list_iterate() or
>> dlock_list_iterate_safe() functions in a while loop. They correspond
>> to the list_for_each_entry() and list_for_each_entry_safe() macros
>> respectively. The iteration states are keep in a dlock_list_state
>> structure that is passed to the iteration functions.
> Why do we need two variants of this?  We need a state variable to walk
> the list anyway.  Why not make dlock_list_iterate() safe against
> removal and get rid of the second variant?  Also, dlock_list_next()
> probably is a better name.

The two variants are the equivalent of list_for_each_entry() and 
list_for_each_entry_safe(). There are scenarios where 
list_for_each_entry_safe() isn't really safe when the next pointer of 
the current node is modified, for example. In this particular use case, 
the safe version isn't needed as all the list_for_each_entry_safe() call 
sites were modified to use list_for_each_entry() instead. I keep the 
dlock_list_iterate_safe() for future use cases that may need the safe 
version.

As it is a macro, it won't consume any resource if it is not used. 
However, I can remove it if you think we shouldn't carry any code that 
is not currently used.

Yes, I can change the name to dlock_list_next().

>> +/*
>> + * include/linux/dlock-list.h
>> + *
>> + * A distributed (per-cpu) set of lists each of which is protected by its
>> + * own spinlock, but acts like a single consolidated list to the callers.
>> + *
>> + * The dlock_list_head structure contains the spinlock, the other
>> + * dlock_list_node structures only contains a pointer to the spinlock in
>> + * dlock_list_head.
>> + */
>> +struct dlock_list_head {
>> +	struct list_head list;
>> +	spinlock_t lock;
>> +};
>> +
>> +#define DLOCK_LIST_HEAD_INIT(name)				\
>> +	{							\
>> +		.list.prev =&name.list,			\
>> +		.list.next =&name.list,			\
>> +		.list.lock = __SPIN_LOCK_UNLOCKED(name),	\
>> +	}
> This is confusing.  It looks like dlock_list_head and
> DLOCK_LIST_HEAD_INIT() can be used to define and initialize static
> dlock_lists but that isn't true.  It's weird to require the user to
> deal with percpu declaration of the data type.  Shouldn't it be more
> like the following?
>
> struct dlock_list_head_cpu {
> 	struct list_head list;
> 	spinlock_t lock;
> };
>
> struct dlock_list_head {
> 	struct dlock_list_head_percpu *head_cpu;
> };

I think I know what you want. I will update the code accordingly.

>> +/*
>> + * Per-cpu list iteration state
>> + */
>> +struct dlock_list_state {
>> +	int			 cpu;
>> +	spinlock_t		*lock;
>> +	struct list_head	*head;	/* List head of current per-cpu list */
>> +	struct dlock_list_node	*curr;
>> +	struct dlock_list_node	*next;
>> +};
> Maybe dlock_list_iter[ator] is a better name?

Sure, I can rename it.

>
>> +#define DLOCK_LIST_STATE_INIT()			\
>> +	{					\
>> +		.cpu  = -1,			\
>> +		.lock = NULL,			\
>> +		.head = NULL,			\
>> +		.curr = NULL,			\
>> +		.next = NULL,			\
>> +	}
> The NULL inits are unnecessary and prone to being left behind.

Will remove those NULL inits.

>
>> +#define DEFINE_DLOCK_LIST_STATE(s)		\
>> +	struct dlock_list_state s = DLOCK_LIST_STATE_INIT()
>> +
>> +static inline void init_dlock_list_state(struct dlock_list_state *state)
>> +{
>> +	state->cpu  = -1;
>> +	state->lock = NULL;
>> +	state->head = NULL;
>> +	state->curr = NULL;
>> +	state->next = NULL;
>> +}
> Why not "state = (struct dlock_list_state)DLOCK_LIST_STATE_INIT;"?

Yes, that should work too.

>
>> +#ifdef CONFIG_DEBUG_SPINLOCK
>> +#define DLOCK_LIST_WARN_ON(x)	WARN_ON(x)
>> +#else
>> +#define DLOCK_LIST_WARN_ON(x)
>> +#endif
> I'd just use WARN_ON_ONCE() without the CONFIG guard.

I'd do so if it is used in a real function. However, it is used in the 
dlock_list_iterate() macro which will duplicate the WARN_ON call on all 
the call sites. That is why I am hesitant to do that.

>
>> +/*
>> + * Next per-cpu list entry
>> + */
>> +#define dlock_list_next_entry(pos, member) list_next_entry(pos, member.list)
> Why does this need to be exposed?

It was used in one of the modified file. I can change it to an inline 
function. Or do you mean making it a real function and put it in the .c 
file?

>
>> +/*
>> + * Per-cpu node data structure
>> + */
>> +struct dlock_list_node {
>> +	struct list_head list;
>> +	spinlock_t *lockptr;
>> +};
>> +
>> +#define DLOCK_LIST_NODE_INIT(name)		\
>> +	{					\
>> +		.list.prev =&name.list,	\
>> +		.list.next =&name.list,	\
>> +		.list.lockptr = NULL		\
>> +	}
> Ditto with NULL init.

Will remove it.

>> +static inline void init_dlock_list_node(struct dlock_list_node *node)
>> +{
>> +	INIT_LIST_HEAD(&node->list);
>> +	node->lockptr = NULL;
>> +}
> Ditto with init.
>
>> +static inline void free_dlock_list_head(struct dlock_list_head **pdlock_head)
>> +{
>> +	free_percpu(*pdlock_head);
>> +	*pdlock_head = NULL;
>> +}
> Why does this need to be inlined?

It was inlined because it is very simple. I can move it to the c file to 
have more data abstraction.

>> +/*
>> + * Check if all the per-cpu lists are empty
>> + */
> Please use proper function comments.
>

Will do.

>> +static inline bool dlock_list_empty(struct dlock_list_head *dlock_head)
>> +{
>> +	int cpu;
>> +
>> +	for_each_possible_cpu(cpu)
>> +		if (!list_empty(&per_cpu_ptr(dlock_head, cpu)->list))
>> +			return false;
>> +	return true;
>> +}
>> +
>> +/*
>> + * Helper function to find the first entry of the next per-cpu list
>> + * It works somewhat like for_each_possible_cpu(cpu).
>> + *
>> + * Return: true if the entry is found, false if all the lists exhausted
> Ditto about the comment.
>
>> + */
>> +static __always_inline bool
>> +__dlock_list_next_cpu(struct dlock_list_head *head,
>> +		      struct dlock_list_state *state)
>> +{
> ...
>> +static inline bool dlock_list_iterate_safe(struct dlock_list_head *head,
>> +					   struct dlock_list_state *state)
>> +{
> Inlining these doesn't make senes to me.

They are inlined for performance reason. I can make them real functions 
if you think it is better.

>> diff --git a/lib/Makefile b/lib/Makefile
>> index 499fb35..92e8c38 100644
>> --- a/lib/Makefile
>> +++ b/lib/Makefile
>> +/*
>> + * The dlock list lock needs its own class to avoid warning and stack
>> + * trace when lockdep is enabled.
>> + */
> Can you please elaborate on this?

Kernel warning was given out and I think something got disabled without 
that when the kernel booted up with lockdep enabled. I think it is 
because the locks were dynamically allocated. It will not be a problem 
if they are statically allocated.

>> +static struct lock_class_key dlock_list_key;
>> +
>> +/*
>> + * Initialize the per-cpu list head
>> + */
>> +int init_dlock_list_head(struct dlock_list_head **pdlock_head)
>> +{
>> +	struct dlock_list_head *dlock_head;
>> +	int cpu;
>> +
>> +	dlock_head = alloc_percpu(struct dlock_list_head);
>> +	if (!dlock_head)
>> +		return -ENOMEM;
>> +
>> +	for_each_possible_cpu(cpu) {
>> +		struct dlock_list_head *head = per_cpu_ptr(dlock_head, cpu);
>> +
>> +		INIT_LIST_HEAD(&head->list);
>> +		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
>> +		lockdep_set_class(&head->lock,&dlock_list_key);
>> +	}
>> +
>> +	*pdlock_head = dlock_head;
>> +	return 0;
>> +}
> Why is this called init?  Why not do the following?
>
> struct dlock_list_head *alloc_dlock_list_head(void);
>
> Also, the pointer type needs to include __percpu annotation.

Thanks for the suggestion. Will change it.

>> +/*
>> + * List selection is based on the CPU being used when the dlock_list_add()
>> + * function is called. However, deletion may be done by a different CPU.
>> + * So we still need to use a lock to protect the content of the list.
>> + */
>> +void dlock_list_add(struct dlock_list_node *node, struct dlock_list_head *head)
>> +{
>> +	struct dlock_list_head *myhead;
>> +
>> +	/*
>> +	 * Disable preemption to make sure that CPU won't gets changed.
>> +	 */
>> +	myhead = get_cpu_ptr(head);
>> +	spin_lock(&myhead->lock);
>> +	node->lockptr =&myhead->lock;
>> +	list_add(&node->list,&myhead->list);
>> +	spin_unlock(&myhead->lock);
>> +	put_cpu_ptr(head);
>> +}
> I wonder whether it'd be better to use irqsafe operations.  lists tend
> to be often used from irq contexts.

The current use case only need to use the regular lock functions. You 
are right that future use cases may require an irqsafe version of locks. 
I can either modify the code now to allow lock type selection at init 
time, for example, or defer it as a future enhancement when the need 
arises. What do you think?

>
>> +/*
>> + * Delete a node from a dlock list
>> + *
>> + * We need to check the lock pointer again after taking the lock to guard
>> + * against concurrent delete of the same node. If the lock pointer changes
>> + * (becomes NULL or to a different one), we assume that the deletion was done
>> + * elsewhere.
>> + */
>> +void dlock_list_del(struct dlock_list_node *node)
>> +{
>> +	spinlock_t *lock = READ_ONCE(node->lockptr);
>> +
>> +	if (unlikely(!lock)) {
>> +		WARN(1, "dlock_list_del: node 0x%lx has no associated lock\n",
>> +			(unsigned long)node);
>> +		return;
>> +	}
>> +
>> +	spin_lock(lock);
>> +	if (likely(lock == node->lockptr)) {
>> +		list_del_init(&node->list);
>> +		node->lockptr = NULL;
>> +	} else {
>> +		/*
>> +		 * This path should never be executed.
>> +		 */
> What if it races against someone else removing and adding back?
> Shouldn't it retry on those cases?

If it is racing with another thread removing and adding back, retrying 
the deletion may not be the right thing to do. It really depends on what 
kind of deletion the callers are doing. I will clarify in the comment 
further change may be needed in the future if future callers are doing 
complicated deletion that requires more elaborate exception handling.

Thanks for the review comments.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 2/7] lib/dlock-list: Add __percpu modifier for parameters
  2016-07-13 16:08   ` Tejun Heo
@ 2016-07-14  2:54     ` Waiman Long
  0 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2016-07-14  2:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Christoph Lameter, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

On 07/13/2016 12:08 PM, Tejun Heo wrote:
> On Mon, Jul 11, 2016 at 01:32:07PM -0400, Waiman Long wrote:
>> From: Boqun Feng<boqun.feng@gmail.com>
>>
>> Add __percpu modifier properly to help:
>>
>> 1.	Differ pointers to actual structures with those to percpu
>> 	structures, which could improve readability.
>>
>> 2. 	Prevent sparse from complaining about "different address spaces"
>>
>> Signed-off-by: Boqun Feng<boqun.feng@gmail.com>
>> Signed-off-by: Waiman Long<Waiman.Long@hpe.com>
> Please do this in the same patch.
>
> Thanks.
>

Will merge it into patch 1.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/7] lib/dlock-list: Distributed and lock-protected lists
  2016-07-14  2:54     ` Waiman Long
@ 2016-07-14 11:50       ` Tejun Heo
  2016-07-14 14:35         ` Jan Kara
  2016-07-14 17:13         ` Waiman Long
  0 siblings, 2 replies; 24+ messages in thread
From: Tejun Heo @ 2016-07-14 11:50 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Christoph Lameter, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

Hello,

On Wed, Jul 13, 2016 at 10:54:19PM -0400, Waiman Long wrote:
> On 07/13/2016 12:08 PM, Tejun Heo wrote:
> > On Mon, Jul 11, 2016 at 01:32:06PM -0400, Waiman Long wrote:
> > ...
> > > A new header file include/linux/dlock-list.h will be added with the
> > Heh, I think perpcu_list was the better name but suppose I'm too late.
> 
> Given the fact that we may change the level of lock granularity in the
> future, it is too late to change it back to percpu_list.

It's more about the consistency.  We have other data structures which
are in a similar situation.  If we end up doing Nth CPU for dlist,
we'll probably end up doing the same thing for other percpu data
structures.

> > Why do we need two variants of this?  We need a state variable to walk
> > the list anyway.  Why not make dlock_list_iterate() safe against
> > removal and get rid of the second variant?  Also, dlock_list_next()
> > probably is a better name.
> 
> The two variants are the equivalent of list_for_each_entry() and
> list_for_each_entry_safe(). There are scenarios where
> list_for_each_entry_safe() isn't really safe when the next pointer of the
> current node is modified, for example. In this particular use case, the safe
> version isn't needed as all the list_for_each_entry_safe() call sites were
> modified to use list_for_each_entry() instead. I keep the
> dlock_list_iterate_safe() for future use cases that may need the safe
> version.

I don't think it makes sense to worry about the cases where the next
entry to iterate may be removed by the iterator.  What I'm trying to
say is just make the iteration always safe and don't worry about the
distinction.  For list_for_each_entry(), it makes the difference of
requiring and not requiring a separtae state variable.  Here, we need
it anyway.

> > > +#ifdef CONFIG_DEBUG_SPINLOCK
> > > +#define DLOCK_LIST_WARN_ON(x)	WARN_ON(x)
> > > +#else
> > > +#define DLOCK_LIST_WARN_ON(x)
> > > +#endif
> > I'd just use WARN_ON_ONCE() without the CONFIG guard.
> 
> I'd do so if it is used in a real function. However, it is used in the
> dlock_list_iterate() macro which will duplicate the WARN_ON call on all the
> call sites. That is why I am hesitant to do that.

They're used in inline functions but not macros.  Just uninline the
functions?

> > > +/*
> > > + * Next per-cpu list entry
> > > + */
> > > +#define dlock_list_next_entry(pos, member) list_next_entry(pos, member.list)
> > Why does this need to be exposed?
> 
> It was used in one of the modified file. I can change it to an inline
> function. Or do you mean making it a real function and put it in the .c
> file?

If it's used, never mind.

> > > +static __always_inline bool
> > > +__dlock_list_next_cpu(struct dlock_list_head *head,
> > > +		      struct dlock_list_state *state)
> > > +{
> > ...
> > > +static inline bool dlock_list_iterate_safe(struct dlock_list_head *head,
> > > +					   struct dlock_list_state *state)
> > > +{
> > Inlining these doesn't make senes to me.
> 
> They are inlined for performance reason. I can make them real functions if
> you think it is better.

The functions aren't that small and inlining functions which aren't
very small tend to become more expensive pretty quickly with multiple
users.  I'd just make them proper functions.

> > > diff --git a/lib/Makefile b/lib/Makefile
> > > index 499fb35..92e8c38 100644
> > > --- a/lib/Makefile
> > > +++ b/lib/Makefile
> > > +/*
> > > + * The dlock list lock needs its own class to avoid warning and stack
> > > + * trace when lockdep is enabled.
> > > + */
> > Can you please elaborate on this?
> 
> Kernel warning was given out and I think something got disabled without that
> when the kernel booted up with lockdep enabled. I think it is because the
> locks were dynamically allocated. It will not be a problem if they are
> statically allocated.

Ah, okay.  Can you update the comment to include the above
information.  It doesn't have to be long.  Just mention that this is
lockdep key for dlock data structure which is dynamically allocated.

> > > +void dlock_list_add(struct dlock_list_node *node, struct dlock_list_head *head)
> > > +{
> > > +	struct dlock_list_head *myhead;
> > > +
> > > +	/*
> > > +	 * Disable preemption to make sure that CPU won't gets changed.
> > > +	 */
> > > +	myhead = get_cpu_ptr(head);
> > > +	spin_lock(&myhead->lock);
> > > +	node->lockptr =&myhead->lock;
> > > +	list_add(&node->list,&myhead->list);
> > > +	spin_unlock(&myhead->lock);
> > > +	put_cpu_ptr(head);
> > > +}
> > I wonder whether it'd be better to use irqsafe operations.  lists tend
> > to be often used from irq contexts.
> 
> The current use case only need to use the regular lock functions. You are
> right that future use cases may require an irqsafe version of locks. I can
> either modify the code now to allow lock type selection at init time, for
> example, or defer it as a future enhancement when the need arises. What do
> you think?

The bulk of performance gain of dlist would come from being per-cpu
and I don't think it's likely that we'd see any noticeable difference
between irq and preempt safe operations.  Given that what's being
implemented is really low level operations, I'd suggest going with
irqsafe from the get-go.

> > > +void dlock_list_del(struct dlock_list_node *node)
> > > +{
> > > +	spinlock_t *lock = READ_ONCE(node->lockptr);
> > > +
> > > +	if (unlikely(!lock)) {
> > > +		WARN(1, "dlock_list_del: node 0x%lx has no associated lock\n",
> > > +			(unsigned long)node);
> > > +		return;
> > > +	}
> > > +
> > > +	spin_lock(lock);
> > > +	if (likely(lock == node->lockptr)) {
> > > +		list_del_init(&node->list);
> > > +		node->lockptr = NULL;
> > > +	} else {
> > > +		/*
> > > +		 * This path should never be executed.
> > > +		 */
> > What if it races against someone else removing and adding back?
> > Shouldn't it retry on those cases?
> 
> If it is racing with another thread removing and adding back, retrying the
> deletion may not be the right thing to do. It really depends on what kind of
> deletion the callers are doing. I will clarify in the comment further change
> may be needed in the future if future callers are doing complicated deletion
> that requires more elaborate exception handling.

I haven't really thought about it but it could be that in the racing
cases the order doesn't matter and just skipping the operation is
fine.  I'm not sure triggering warning is the right answer tho.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/7] lib/dlock-list: Distributed and lock-protected lists
  2016-07-14 11:50       ` Tejun Heo
@ 2016-07-14 14:35         ` Jan Kara
  2016-07-14 14:51           ` Tejun Heo
  2016-07-14 17:13         ` Waiman Long
  1 sibling, 1 reply; 24+ messages in thread
From: Jan Kara @ 2016-07-14 14:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Waiman Long, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng, Scott J Norton, Douglas Hatch

On Thu 14-07-16 07:50:43, Tejun Heo wrote:
> > > > +void dlock_list_add(struct dlock_list_node *node, struct dlock_list_head *head)
> > > > +{
> > > > +	struct dlock_list_head *myhead;
> > > > +
> > > > +	/*
> > > > +	 * Disable preemption to make sure that CPU won't gets changed.
> > > > +	 */
> > > > +	myhead = get_cpu_ptr(head);
> > > > +	spin_lock(&myhead->lock);
> > > > +	node->lockptr =&myhead->lock;
> > > > +	list_add(&node->list,&myhead->list);
> > > > +	spin_unlock(&myhead->lock);
> > > > +	put_cpu_ptr(head);
> > > > +}
> > > I wonder whether it'd be better to use irqsafe operations.  lists tend
> > > to be often used from irq contexts.
> > 
> > The current use case only need to use the regular lock functions. You are
> > right that future use cases may require an irqsafe version of locks. I can
> > either modify the code now to allow lock type selection at init time, for
> > example, or defer it as a future enhancement when the need arises. What do
> > you think?
> 
> The bulk of performance gain of dlist would come from being per-cpu
> and I don't think it's likely that we'd see any noticeable difference
> between irq and preempt safe operations.  Given that what's being
> implemented is really low level operations, I'd suggest going with
> irqsafe from the get-go.

I'm not sure here. i_sb_list for which percpu lists will be used is bashed
pretty heavily under some workloads and the cost of additional interrupt
disabling & enabling may be visible under those loads. Probably not in the
cases where you get a boost from percpu lists but if the workload is mostly
single-threaded, additional cpu cost may be measurable. So IMO we should
check whether a load which creates tons of empty inodes in tmpfs from a
single process doesn't regress with this change.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/7] lib/dlock-list: Distributed and lock-protected lists
  2016-07-14 14:35         ` Jan Kara
@ 2016-07-14 14:51           ` Tejun Heo
  2016-07-14 16:16             ` Waiman Long
  0 siblings, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2016-07-14 14:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: Waiman Long, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng, Scott J Norton, Douglas Hatch

Hello, Jan.

On Thu, Jul 14, 2016 at 04:35:47PM +0200, Jan Kara wrote:
> > > The current use case only need to use the regular lock functions. You are
> > > right that future use cases may require an irqsafe version of locks. I can
> > > either modify the code now to allow lock type selection at init time, for
> > > example, or defer it as a future enhancement when the need arises. What do
> > > you think?
> > 
> > The bulk of performance gain of dlist would come from being per-cpu
> > and I don't think it's likely that we'd see any noticeable difference
> > between irq and preempt safe operations.  Given that what's being
> > implemented is really low level operations, I'd suggest going with
> > irqsafe from the get-go.
> 
> I'm not sure here. i_sb_list for which percpu lists will be used is bashed
> pretty heavily under some workloads and the cost of additional interrupt
> disabling & enabling may be visible under those loads. Probably not in the
> cases where you get a boost from percpu lists but if the workload is mostly
> single-threaded, additional cpu cost may be measurable. So IMO we should
> check whether a load which creates tons of empty inodes in tmpfs from a
> single process doesn't regress with this change.

Sure, if it actually matters, we can always create separate preempt /
irq variants.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/7] lib/dlock-list: Distributed and lock-protected lists
  2016-07-14 14:51           ` Tejun Heo
@ 2016-07-14 16:16             ` Waiman Long
  0 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2016-07-14 16:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Christoph Lameter, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

On 07/14/2016 10:51 AM, Tejun Heo wrote:
> Hello, Jan.
>
> On Thu, Jul 14, 2016 at 04:35:47PM +0200, Jan Kara wrote:
>>>> The current use case only need to use the regular lock functions. You are
>>>> right that future use cases may require an irqsafe version of locks. I can
>>>> either modify the code now to allow lock type selection at init time, for
>>>> example, or defer it as a future enhancement when the need arises. What do
>>>> you think?
>>> The bulk of performance gain of dlist would come from being per-cpu
>>> and I don't think it's likely that we'd see any noticeable difference
>>> between irq and preempt safe operations.  Given that what's being
>>> implemented is really low level operations, I'd suggest going with
>>> irqsafe from the get-go.
>> I'm not sure here. i_sb_list for which percpu lists will be used is bashed
>> pretty heavily under some workloads and the cost of additional interrupt
>> disabling&  enabling may be visible under those loads. Probably not in the
>> cases where you get a boost from percpu lists but if the workload is mostly
>> single-threaded, additional cpu cost may be measurable. So IMO we should
>> check whether a load which creates tons of empty inodes in tmpfs from a
>> single process doesn't regress with this change.
> Sure, if it actually matters, we can always create separate preempt /
> irq variants.
>
> Thanks.
>

I think having the irqsafe version of add and delete function variants 
is the way to go to ensure that we don't cause performance regression 
for codes that don't need the overhead of irqsave and irqrestore.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/7] lib/dlock-list: Distributed and lock-protected lists
  2016-07-14 11:50       ` Tejun Heo
  2016-07-14 14:35         ` Jan Kara
@ 2016-07-14 17:13         ` Waiman Long
  2016-07-14 17:41           ` Tejun Heo
  1 sibling, 1 reply; 24+ messages in thread
From: Waiman Long @ 2016-07-14 17:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Christoph Lameter, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

On 07/14/2016 07:50 AM, Tejun Heo wrote:
> Hello,
>
> On Wed, Jul 13, 2016 at 10:54:19PM -0400, Waiman Long wrote:
>> On 07/13/2016 12:08 PM, Tejun Heo wrote:
>>> On Mon, Jul 11, 2016 at 01:32:06PM -0400, Waiman Long wrote:
>>> ...
>>>> A new header file include/linux/dlock-list.h will be added with the
>>> Heh, I think perpcu_list was the better name but suppose I'm too late.
>> Given the fact that we may change the level of lock granularity in the
>> future, it is too late to change it back to percpu_list.
> It's more about the consistency.  We have other data structures which
> are in a similar situation.  If we end up doing Nth CPU for dlist,
> we'll probably end up doing the same thing for other percpu data
> structures.

I see.

I got comment related to the percpu-list name from Christoph Lameter a 
while ago. His argument was that since deletion can happenned from any 
CPU, it was not really percpu like the other percpu structures. That 
prompted me to change the name to the current form. I am fine with 
either names, but I would like to keep the current name unless there is 
a great rationale for switching back.


>>> Why do we need two variants of this?  We need a state variable to walk
>>> the list anyway.  Why not make dlock_list_iterate() safe against
>>> removal and get rid of the second variant?  Also, dlock_list_next()
>>> probably is a better name.
>> The two variants are the equivalent of list_for_each_entry() and
>> list_for_each_entry_safe(). There are scenarios where
>> list_for_each_entry_safe() isn't really safe when the next pointer of the
>> current node is modified, for example. In this particular use case, the safe
>> version isn't needed as all the list_for_each_entry_safe() call sites were
>> modified to use list_for_each_entry() instead. I keep the
>> dlock_list_iterate_safe() for future use cases that may need the safe
>> version.
> I don't think it makes sense to worry about the cases where the next
> entry to iterate may be removed by the iterator.  What I'm trying to
> say is just make the iteration always safe and don't worry about the
> distinction.  For list_for_each_entry(), it makes the difference of
> requiring and not requiring a separtae state variable.  Here, we need
> it anyway.

A lot of those functions that need to iterate the list will release the 
lock in the middle, do some stuff, reacquire the lock and move on to the 
next entry. So it is entirely possible that new entries will be inserted 
between the current entry and the next one in between the release and 
re-acquisition of the lock. Using the safe version will skip those newly 
added entries which is a change in behavior for the current code. That 
is my main concern for making it deletion safe by default.


>
>>>> +#ifdef CONFIG_DEBUG_SPINLOCK
>>>> +#define DLOCK_LIST_WARN_ON(x)	WARN_ON(x)
>>>> +#else
>>>> +#define DLOCK_LIST_WARN_ON(x)
>>>> +#endif
>>> I'd just use WARN_ON_ONCE() without the CONFIG guard.
>> I'd do so if it is used in a real function. However, it is used in the
>> dlock_list_iterate() macro which will duplicate the WARN_ON call on all the
>> call sites. That is why I am hesitant to do that.
> They're used in inline functions but not macros.  Just uninline the
> functions?

Yes, by uninlining those functions, I can eliminate that macro.

>>>> +/*
>>>> + * Next per-cpu list entry
>>>> + */
>>>> +#define dlock_list_next_entry(pos, member) list_next_entry(pos, member.list)
>>> Why does this need to be exposed?
>> It was used in one of the modified file. I can change it to an inline
>> function. Or do you mean making it a real function and put it in the .c
>> file?
> If it's used, never mind.
>
>>>> +static __always_inline bool
>>>> +__dlock_list_next_cpu(struct dlock_list_head *head,
>>>> +		      struct dlock_list_state *state)
>>>> +{
>>> ...
>>>> +static inline bool dlock_list_iterate_safe(struct dlock_list_head *head,
>>>> +					   struct dlock_list_state *state)
>>>> +{
>>> Inlining these doesn't make senes to me.
>> They are inlined for performance reason. I can make them real functions if
>> you think it is better.
> The functions aren't that small and inlining functions which aren't
> very small tend to become more expensive pretty quickly with multiple
> users.  I'd just make them proper functions.

Yes, will do so.

>>>> diff --git a/lib/Makefile b/lib/Makefile
>>>> index 499fb35..92e8c38 100644
>>>> --- a/lib/Makefile
>>>> +++ b/lib/Makefile
>>>> +/*
>>>> + * The dlock list lock needs its own class to avoid warning and stack
>>>> + * trace when lockdep is enabled.
>>>> + */
>>> Can you please elaborate on this?
>> Kernel warning was given out and I think something got disabled without that
>> when the kernel booted up with lockdep enabled. I think it is because the
>> locks were dynamically allocated. It will not be a problem if they are
>> statically allocated.
> Ah, okay.  Can you update the comment to include the above
> information.  It doesn't have to be long.  Just mention that this is
> lockdep key for dlock data structure which is dynamically allocated.

I will update the comment.

>>>> +void dlock_list_add(struct dlock_list_node *node, struct dlock_list_head *head)
>>>> +{
>>>> +	struct dlock_list_head *myhead;
>>>> +
>>>> +	/*
>>>> +	 * Disable preemption to make sure that CPU won't gets changed.
>>>> +	 */
>>>> +	myhead = get_cpu_ptr(head);
>>>> +	spin_lock(&myhead->lock);
>>>> +	node->lockptr =&myhead->lock;
>>>> +	list_add(&node->list,&myhead->list);
>>>> +	spin_unlock(&myhead->lock);
>>>> +	put_cpu_ptr(head);
>>>> +}
>>> I wonder whether it'd be better to use irqsafe operations.  lists tend
>>> to be often used from irq contexts.
>> The current use case only need to use the regular lock functions. You are
>> right that future use cases may require an irqsafe version of locks. I can
>> either modify the code now to allow lock type selection at init time, for
>> example, or defer it as a future enhancement when the need arises. What do
>> you think?
> The bulk of performance gain of dlist would come from being per-cpu
> and I don't think it's likely that we'd see any noticeable difference
> between irq and preempt safe operations.  Given that what's being
> implemented is really low level operations, I'd suggest going with
> irqsafe from the get-go.
>
>>>> +void dlock_list_del(struct dlock_list_node *node)
>>>> +{
>>>> +	spinlock_t *lock = READ_ONCE(node->lockptr);
>>>> +
>>>> +	if (unlikely(!lock)) {
>>>> +		WARN(1, "dlock_list_del: node 0x%lx has no associated lock\n",
>>>> +			(unsigned long)node);
>>>> +		return;
>>>> +	}
>>>> +
>>>> +	spin_lock(lock);
>>>> +	if (likely(lock == node->lockptr)) {
>>>> +		list_del_init(&node->list);
>>>> +		node->lockptr = NULL;
>>>> +	} else {
>>>> +		/*
>>>> +		 * This path should never be executed.
>>>> +		 */
>>> What if it races against someone else removing and adding back?
>>> Shouldn't it retry on those cases?
>> If it is racing with another thread removing and adding back, retrying the
>> deletion may not be the right thing to do. It really depends on what kind of
>> deletion the callers are doing. I will clarify in the comment further change
>> may be needed in the future if future callers are doing complicated deletion
>> that requires more elaborate exception handling.
> I haven't really thought about it but it could be that in the racing
> cases the order doesn't matter and just skipping the operation is
> fine.  I'm not sure triggering warning is the right answer tho.
>
> Thanks.

I don't think it is normal to have concurrent deletion of the same 
entry. Most likely it is a bug if this happens. Having the warning 
message in the kernel log will help to catch those errors.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/7] lib/dlock-list: Distributed and lock-protected lists
  2016-07-14 17:13         ` Waiman Long
@ 2016-07-14 17:41           ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2016-07-14 17:41 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Christoph Lameter, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Boqun Feng,
	Scott J Norton, Douglas Hatch

Hello,

On Thu, Jul 14, 2016 at 01:13:24PM -0400, Waiman Long wrote:
> On 07/14/2016 07:50 AM, Tejun Heo wrote:
> I got comment related to the percpu-list name from Christoph Lameter a while
> ago. His argument was that since deletion can happenned from any CPU, it was
> not really percpu like the other percpu structures. That prompted me to
> change the name to the current form. I am fine with either names, but I
> would like to keep the current name unless there is a great rationale for
> switching back.

Yeah, I don't know.  It's probably gonna stick out as a percpu data
structure with a weird name.  No biggies.  If it actually matters, we
can rename it later.  Christoph, what do you think?  Do you still
think dlist is a better name?

> > I don't think it makes sense to worry about the cases where the next
> > entry to iterate may be removed by the iterator.  What I'm trying to
> > say is just make the iteration always safe and don't worry about the
> > distinction.  For list_for_each_entry(), it makes the difference of
> > requiring and not requiring a separtae state variable.  Here, we need
> > it anyway.
> 
> A lot of those functions that need to iterate the list will release the lock
> in the middle, do some stuff, reacquire the lock and move on to the next
> entry. So it is entirely possible that new entries will be inserted between
> the current entry and the next one in between the release and re-acquisition
> of the lock. Using the safe version will skip those newly added entries
> which is a change in behavior for the current code. That is my main concern
> for making it deletion safe by default.

I see.  The distinction between unsafe and safe versions is pretty
subtle.  :(

> I don't think it is normal to have concurrent deletion of the same entry.
> Most likely it is a bug if this happens. Having the warning message in the
> kernel log will help to catch those errors.

Yeah, maybe.  I was thinking more in line of list_del_init().  dlist
having its own locking embedded makes it a bit murky which parts of
synchronization belong where.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2016-07-14 17:41 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-11 17:32 [PATCH v2 0/7] vfs: Use dlock list for SB's s_inodes list Waiman Long
2016-07-11 17:32 ` [PATCH v2 1/7] lib/dlock-list: Distributed and lock-protected lists Waiman Long
2016-07-13 16:08   ` Tejun Heo
2016-07-14  2:54     ` Waiman Long
2016-07-14 11:50       ` Tejun Heo
2016-07-14 14:35         ` Jan Kara
2016-07-14 14:51           ` Tejun Heo
2016-07-14 16:16             ` Waiman Long
2016-07-14 17:13         ` Waiman Long
2016-07-14 17:41           ` Tejun Heo
2016-07-11 17:32 ` [PATCH v2 2/7] lib/dlock-list: Add __percpu modifier for parameters Waiman Long
2016-07-13 16:08   ` Tejun Heo
2016-07-14  2:54     ` Waiman Long
2016-07-11 17:32 ` [PATCH v2 3/7] fsnotify: Simplify inode iteration on umount Waiman Long
2016-07-11 17:32 ` [PATCH v2 4/7] vfs: Remove unnecessary list_for_each_entry_safe() variants Waiman Long
2016-07-11 17:32 ` [PATCH v2 5/7] vfs: Use dlock list for superblock's inode list Waiman Long
2016-07-11 17:32 ` [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs Waiman Long
2016-07-12  3:14   ` Boqun Feng
2016-07-12 18:37     ` Waiman Long
2016-07-12 14:27   ` Tejun Heo
2016-07-12 18:51     ` Waiman Long
2016-07-12 18:57       ` Tejun Heo
2016-07-12 19:42         ` Waiman Long
2016-07-11 17:32 ` [RFC PATCH v2 7/7] lib/dlock-list: Use the per-subnode APIs for managing lists Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).