linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/6] vfs: Use dlock list for SB's s_inodes list
@ 2017-10-05 18:43 Waiman Long
  2017-10-05 18:43 ` [PATCH v7 1/6] lib/dlock-list: Distributed and lock-protected lists Waiman Long
                   ` (9 more replies)
  0 siblings, 10 replies; 31+ messages in thread
From: Waiman Long @ 2017-10-05 18:43 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Davidlohr Bueso,
	Waiman Long

v6->v7:
 - Fix outdated email address.
 - Add a comment to patch 4 to explain allocation issue & fix a
   compilation problem with cpumask.
 - Replace patch 6 with another one that adds an irqsafe mode argument 
   in alloc_dlock_list_heads() instead of adding new APIs.

v5->v6:
 - Rebased the patch to 4.14-rc3.
 - Drop the fsnotify patch as it had been merged somehow.
 - Add a new patch 5 with alternative way of selecting list by hashing
   instead of cpu #.
 - Add a new patch 6 to proivde a set irq safe APIs to be used in
   interrupt context.
 - Update the CPU to index mapping code.

v4->v5:
 - Rebased the patch to 4.8-rc1 (changes to fs/fs-writeback.c was
   dropped).
 - Use kcalloc() instead of percpu_alloc() to allocate the dlock list
   heads structure as suggested by Christoph Lameter.
 - Replaced patch 5 by another one that made sibling CPUs use the same
   dlock list head thus reducing the number of list heads that needed
   to be maintained.

v3->v4:
 - As suggested by Al, encapsulate the dlock list mechanism into
   the dlist_for_each_entry() and dlist_for_each_entry_safe()
   which are the equivalent of list_for_each_entry() and
   list_for_each_entry_safe() for regular linked list. That simplifies
   the changes in the call sites that perform dlock list iterations.
 - Add a new patch to make the percpu head structure cacheline aligned
   to prevent cacheline contention from disrupting the performance
   of nearby percpu variables.

v2->v3:
 - Remove the 2 persubnode API patches.
 - Merge __percpu tag patch 2 into patch 1.
 - As suggested by Tejun Heo, restructure the dlock_list_head data
   structure to hide the __percpu tag and rename some of the functions
   and structures.
 - Move most of the code from dlock_list.h to dlock_list.c and export
   the symbols.

v1->v2:
 - Add a set of simple per-subnode APIs that is between percpu and
   per-node in granularity.
 - Make dlock list to use the per-subnode APIs so as to reduce the
   total number of separate linked list that needs to be managed
   and iterated.
 - There is no change in patches 1-5.

This is a follow up of the following patchset:

  [PATCH v7 0/4] vfs: Use per-cpu list for SB's s_inodes list
  https://lkml.org/lkml/2016/4/12/1009

This patchset provides new APIs for a set of distributed locked lists
(one/CPU core) to minimize lock and cacheline contention. Insertion
and deletion to the list will be cheap and relatively contention free.
Lookup, on the other hand, may be a bit more costly as there are
multiple lists to iterate. This is not really a problem for the
replacement of superblock's inode list by dlock list included in
the patchset as lookup isn't needed.

For use cases that need to do lookup, the dlock list can also be
treated as a set of hashed lists that scales with the number of CPU
cores in the system.

Both patches 5 and 6 are added to support other use cases like epoll
nested callbacks, for example, which could use the dlock-list to
reduce lock contention problem.

Patch 1 introduces the dlock list. The list heads are allocated
by kcalloc() instead of percpu_alloc(). Each list head entry is
cacheline aligned to minimize contention.

Patch 2 replaces the use of list_for_each_entry_safe() in
evict_inodes() and invalidate_inodes() by list_for_each_entry().

Patch 3 modifies the superblock and inode structures to use the dlock
list. The corresponding functions that reference those structures
are modified.

Patch 4 makes the sibling CPUs use the same dlock list head to reduce
the number of list heads that need to be iterated.

Patch 5 enables alternative use case of as a set of hashed lists.

Patch 6 provides an irq safe mode specified at dlock-list allocation
time so that it can be within interrupt context.

Jan Kara (1):
  vfs: Remove unnecessary list_for_each_entry_safe() variants

Waiman Long (5):
  lib/dlock-list: Distributed and lock-protected lists
  vfs: Use dlock list for superblock's inode list
  lib/dlock-list: Make sibling CPUs share the same linked list
  lib/dlock-list: Enable faster lookup with hashing
  lib/dlock-list: Add an IRQ-safe mode to be used in interrupt handler

 fs/block_dev.c             |   9 +-
 fs/drop_caches.c           |   9 +-
 fs/inode.c                 |  38 +++---
 fs/notify/fsnotify.c       |   9 +-
 fs/quota/dquot.c           |  14 +-
 fs/super.c                 |   7 +-
 include/linux/dlock-list.h | 245 ++++++++++++++++++++++++++++++++++
 include/linux/fs.h         |   8 +-
 lib/Makefile               |   2 +-
 lib/dlock-list.c           | 322 +++++++++++++++++++++++++++++++++++++++++++++
 10 files changed, 609 insertions(+), 54 deletions(-)
 create mode 100644 include/linux/dlock-list.h
 create mode 100644 lib/dlock-list.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v7 1/6] lib/dlock-list: Distributed and lock-protected lists
  2017-10-05 18:43 [PATCH v7 0/6] vfs: Use dlock list for SB's s_inodes list Waiman Long
@ 2017-10-05 18:43 ` Waiman Long
  2017-10-10  5:35   ` Boqun Feng
  2017-10-18  8:55   ` Boqun Feng
  2017-10-05 18:43 ` [PATCH v7 2/6] vfs: Remove unnecessary list_for_each_entry_safe() variants Waiman Long
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 31+ messages in thread
From: Waiman Long @ 2017-10-05 18:43 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Davidlohr Bueso,
	Waiman Long

Linked list is used everywhere in the Linux kernel. However, if many
threads are trying to add or delete entries into the same linked list,
it can create a performance bottleneck.

This patch introduces a new list APIs that provide a set of distributed
lists (one per CPU), each of which is protected by its own spinlock.
To the callers, however, the set of lists acts like a single
consolidated list.  This allows list entries insertion and deletion
operations to happen in parallel instead of being serialized with a
global list and lock.

List entry insertion is strictly per cpu. List deletion, however, can
happen in a cpu other than the one that did the insertion. So we still
need lock to protect the list. Because of that, there may still be
a small amount of contention when deletion is being done.

A new header file include/linux/dlock-list.h will be added with the
associated dlock_list_head and dlock_list_node structures. The following
functions are provided to manage the per-cpu list:

 1. int alloc_dlock_list_heads(struct dlock_list_heads *dlist)
 2. void free_dlock_list_heads(struct dlock_list_heads *dlist)
 3. void dlock_list_add(struct dlock_list_node *node,
		        struct dlock_list_heads *dlist)
 4. void dlock_list_del(struct dlock_list *node)

Iteration of all the list entries within a dlock list array
is done by calling either the dlist_for_each_entry() or
dlist_for_each_entry_safe() macros. They correspond to the
list_for_each_entry() and list_for_each_entry_safe() macros
respectively. The iteration states are keep in a dlock_list_iter
structure that is passed to the iteration macros.

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 include/linux/dlock-list.h | 224 +++++++++++++++++++++++++++++++++++++++++++
 lib/Makefile               |   2 +-
 lib/dlock-list.c           | 231 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 456 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/dlock-list.h
 create mode 100644 lib/dlock-list.c

diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
new file mode 100644
index 0000000..7940e524
--- /dev/null
+++ b/include/linux/dlock-list.h
@@ -0,0 +1,224 @@
+/*
+ * Distributed and locked list
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ * (C) Copyright 2017 Red Hat, Inc.
+ *
+ * Authors: Waiman Long <longman@redhat.com>
+ */
+#ifndef __LINUX_DLOCK_LIST_H
+#define __LINUX_DLOCK_LIST_H
+
+#include <linux/spinlock.h>
+#include <linux/list.h>
+
+/*
+ * include/linux/dlock-list.h
+ *
+ * The dlock_list_head structure contains the spinlock. It is cacheline
+ * aligned to reduce contention among different CPUs. The other
+ * dlock_list_node structures contains a pointer to the head entry instead.
+ */
+struct dlock_list_head {
+	struct list_head list;
+	spinlock_t lock;
+} ____cacheline_aligned_in_smp;
+
+struct dlock_list_heads {
+	struct dlock_list_head *heads;
+};
+
+/*
+ * dlock list node data structure
+ */
+struct dlock_list_node {
+	struct list_head list;
+	struct dlock_list_head *head;
+};
+
+/*
+ * dlock list iteration state
+ *
+ * This is an opaque data structure that may change. Users of this structure
+ * should not access the structure members directly other than using the
+ * helper functions and macros provided in this header file.
+ */
+struct dlock_list_iter {
+	int index;
+	struct dlock_list_head *head, *entry;
+};
+
+#define DLOCK_LIST_ITER_INIT(dlist)		\
+	{					\
+		.index = -1,			\
+		.head = (dlist)->heads,		\
+	}
+
+#define DEFINE_DLOCK_LIST_ITER(s, heads)	\
+	struct dlock_list_iter s = DLOCK_LIST_ITER_INIT(heads)
+
+static inline void init_dlock_list_iter(struct dlock_list_iter *iter,
+					struct dlock_list_heads *heads)
+{
+	*iter = (struct dlock_list_iter)DLOCK_LIST_ITER_INIT(heads);
+}
+
+#define DLOCK_LIST_NODE_INIT(name)		\
+	{					\
+		.list = LIST_HEAD_INIT(name)	\
+	}
+
+static inline void init_dlock_list_node(struct dlock_list_node *node)
+{
+	*node = (struct dlock_list_node)DLOCK_LIST_NODE_INIT(node->list);
+}
+
+/**
+ * dlock_list_unlock - unlock the spinlock that protects the current list
+ * @iter: Pointer to the dlock list iterator structure
+ */
+static inline void dlock_list_unlock(struct dlock_list_iter *iter)
+{
+	spin_unlock(&iter->entry->lock);
+}
+
+/**
+ * dlock_list_relock - lock the spinlock that protects the current list
+ * @iter: Pointer to the dlock list iterator structure
+ */
+static inline void dlock_list_relock(struct dlock_list_iter *iter)
+{
+	spin_lock(&iter->entry->lock);
+}
+
+/*
+ * Allocation and freeing of dlock list
+ */
+extern int  alloc_dlock_list_heads(struct dlock_list_heads *dlist);
+extern void free_dlock_list_heads(struct dlock_list_heads *dlist);
+
+/*
+ * Check if a dlock list is empty or not.
+ */
+extern bool dlock_lists_empty(struct dlock_list_heads *dlist);
+
+/*
+ * The dlock list addition and deletion functions here are not irq-safe.
+ * Special irq-safe variants will have to be added if we need them.
+ */
+extern void dlock_lists_add(struct dlock_list_node *node,
+			    struct dlock_list_heads *dlist);
+extern void dlock_lists_del(struct dlock_list_node *node);
+
+/*
+ * Find the first entry of the next available list.
+ */
+extern struct dlock_list_node *
+__dlock_list_next_list(struct dlock_list_iter *iter);
+
+/**
+ * __dlock_list_next_entry - Iterate to the next entry of the dlock list
+ * @curr : Pointer to the current dlock_list_node structure
+ * @iter : Pointer to the dlock list iterator structure
+ * Return: Pointer to the next entry or NULL if all the entries are iterated
+ *
+ * The iterator has to be properly initialized before calling this function.
+ */
+static inline struct dlock_list_node *
+__dlock_list_next_entry(struct dlock_list_node *curr,
+			struct dlock_list_iter *iter)
+{
+	/*
+	 * Find next entry
+	 */
+	if (curr)
+		curr = list_next_entry(curr, list);
+
+	if (!curr || (&curr->list == &iter->entry->list)) {
+		/*
+		 * The current list has been exhausted, try the next available
+		 * list.
+		 */
+		curr = __dlock_list_next_list(iter);
+	}
+
+	return curr;	/* Continue the iteration */
+}
+
+/**
+ * dlock_list_first_entry - get the first element from a list
+ * @iter  : The dlock list iterator.
+ * @type  : The type of the struct this is embedded in.
+ * @member: The name of the dlock_list_node within the struct.
+ * Return : Pointer to the next entry or NULL if all the entries are iterated.
+ */
+#define dlock_list_first_entry(iter, type, member)			\
+	({								\
+		struct dlock_list_node *_n;				\
+		_n = __dlock_list_next_entry(NULL, iter);		\
+		_n ? list_entry(_n, type, member) : NULL;		\
+	})
+
+/**
+ * dlock_list_next_entry - iterate to the next entry of the list
+ * @pos   : The type * to cursor
+ * @iter  : The dlock list iterator.
+ * @member: The name of the dlock_list_node within the struct.
+ * Return : Pointer to the next entry or NULL if all the entries are iterated.
+ *
+ * Note that pos can't be NULL.
+ */
+#define dlock_list_next_entry(pos, iter, member)			\
+	({								\
+		struct dlock_list_node *_n;				\
+		_n = __dlock_list_next_entry(&(pos)->member, iter);	\
+		_n ? list_entry(_n, typeof(*(pos)), member) : NULL;	\
+	})
+
+/**
+ * dlist_for_each_entry - iterate over the dlock list
+ * @pos   : Type * to use as a loop cursor
+ * @iter  : The dlock list iterator
+ * @member: The name of the dlock_list_node within the struct
+ *
+ * This iteration macro isn't safe with respect to list entry removal, but
+ * it can correctly iterate newly added entries right after the current one.
+ * This iteration function is designed to be used in a while loop.
+ */
+#define dlist_for_each_entry(pos, iter, member)				\
+	for (pos = dlock_list_first_entry(iter, typeof(*(pos)), member);\
+	     pos != NULL;						\
+	     pos = dlock_list_next_entry(pos, iter, member))
+
+/**
+ * dlist_for_each_entry_safe - iterate over the dlock list & safe over removal
+ * @pos   : Type * to use as a loop cursor
+ * @n	  : Another type * to use as temporary storage
+ * @iter  : The dlock list iterator
+ * @member: The name of the dlock_list_node within the struct
+ *
+ * This iteration macro is safe with respect to list entry removal.
+ * However, it cannot correctly iterate newly added entries right after the
+ * current one.
+ */
+#define dlist_for_each_entry_safe(pos, n, iter, member)			\
+	for (pos = dlock_list_first_entry(iter, typeof(*(pos)), member);\
+	    ({								\
+		bool _b = (pos != NULL);				\
+		if (_b)							\
+			n = dlock_list_next_entry(pos, iter, member);	\
+		_b;							\
+	    });								\
+	    pos = n)
+
+#endif /* __LINUX_DLOCK_LIST_H */
diff --git a/lib/Makefile b/lib/Makefile
index dafa796..0536cd3 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -38,7 +38,7 @@ obj-y += bcd.o div64.o sort.o parser.o debug_locks.o random32.o \
 	 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
 	 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 	 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
-	 once.o refcount.o usercopy.o errseq.o
+	 once.o refcount.o usercopy.o errseq.o dlock-list.o
 obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
 obj-y += hexdump.o
diff --git a/lib/dlock-list.c b/lib/dlock-list.c
new file mode 100644
index 0000000..2779e3e
--- /dev/null
+++ b/lib/dlock-list.c
@@ -0,0 +1,231 @@
+/*
+ * Distributed and locked list
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ * (C) Copyright 2017 Red Hat, Inc.
+ *
+ * Authors: Waiman Long <longman@redhat.com>
+ */
+#include <linux/dlock-list.h>
+#include <linux/lockdep.h>
+#include <linux/slab.h>
+#include <linux/cpumask.h>
+
+/*
+ * The distributed and locked list is a distributed set of lists each of
+ * which is protected by its own spinlock, but acts like a single
+ * consolidated list to the callers. For scaling purpose, the number of
+ * lists used is equal to the number of possible CPUs in the system to
+ * minimize contention.
+ *
+ * However, it is possible that individual CPU numbers may be equal to
+ * or greater than the number of possible CPUs when there are holes in
+ * the CPU number list. As a result, we need to map the CPU number to a
+ * list index.
+ */
+static DEFINE_PER_CPU_READ_MOSTLY(int, cpu2idx);
+
+/*
+ * As all the locks in the dlock list are dynamically allocated, they need
+ * to belong to their own special lock class to avoid warning and stack
+ * trace in kernel log when lockdep is enabled. Statically allocated locks
+ * don't have this problem.
+ */
+static struct lock_class_key dlock_list_key;
+
+/*
+ * Initialize cpu2idx mapping table
+ *
+ * It is possible that a dlock-list can be allocated before the cpu2idx is
+ * initialized. In this case, all the cpus are mapped to the first entry
+ * before initialization.
+ *
+ */
+static int __init cpu2idx_init(void)
+{
+	int idx, cpu;
+
+	idx = 0;
+	for_each_possible_cpu(cpu)
+		per_cpu(cpu2idx, cpu) = idx++;
+	return 0;
+}
+postcore_initcall(cpu2idx_init);
+
+/**
+ * alloc_dlock_list_heads - Initialize and allocate the list of head entries
+ * @dlist: Pointer to the dlock_list_heads structure to be initialized
+ * Return: 0 if successful, -ENOMEM if memory allocation error
+ *
+ * This function does not allocate the dlock_list_heads structure itself. The
+ * callers will have to do their own memory allocation, if necessary. However,
+ * this allows embedding the dlock_list_heads structure directly into other
+ * structures.
+ */
+int alloc_dlock_list_heads(struct dlock_list_heads *dlist)
+{
+	int idx;
+
+	dlist->heads = kcalloc(nr_cpu_ids, sizeof(struct dlock_list_head),
+			       GFP_KERNEL);
+
+	if (!dlist->heads)
+		return -ENOMEM;
+
+	for (idx = 0; idx < nr_cpu_ids; idx++) {
+		struct dlock_list_head *head = &dlist->heads[idx];
+
+		INIT_LIST_HEAD(&head->list);
+		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
+		lockdep_set_class(&head->lock, &dlock_list_key);
+	}
+	return 0;
+}
+
+/**
+ * free_dlock_list_heads - Free all the heads entries of the dlock list
+ * @dlist: Pointer of the dlock_list_heads structure to be freed
+ *
+ * This function doesn't free the dlock_list_heads structure itself. So
+ * the caller will have to do it, if necessary.
+ */
+void free_dlock_list_heads(struct dlock_list_heads *dlist)
+{
+	kfree(dlist->heads);
+	dlist->heads = NULL;
+}
+
+/**
+ * dlock_lists_empty - Check if all the dlock lists are empty
+ * @dlist: Pointer to the dlock_list_heads structure
+ * Return: true if list is empty, false otherwise.
+ *
+ * This can be a pretty expensive function call. If this function is required
+ * in a performance critical path, we may have to maintain a global count
+ * of the list entries in the global dlock_list_heads structure instead.
+ */
+bool dlock_lists_empty(struct dlock_list_heads *dlist)
+{
+	int idx;
+
+	for (idx = 0; idx < nr_cpu_ids; idx++)
+		if (!list_empty(&dlist->heads[idx].list))
+			return false;
+	return true;
+}
+
+/**
+ * dlock_lists_add - Adds a node to the given dlock list
+ * @node : The node to be added
+ * @dlist: The dlock list where the node is to be added
+ *
+ * List selection is based on the CPU being used when the dlock_list_add()
+ * function is called. However, deletion may be done by a different CPU.
+ */
+void dlock_lists_add(struct dlock_list_node *node,
+		     struct dlock_list_heads *dlist)
+{
+	struct dlock_list_head *head = &dlist->heads[this_cpu_read(cpu2idx)];
+
+	/*
+	 * There is no need to disable preemption
+	 */
+	spin_lock(&head->lock);
+	node->head = head;
+	list_add(&node->list, &head->list);
+	spin_unlock(&head->lock);
+}
+
+/**
+ * dlock_lists_del - Delete a node from a dlock list
+ * @node : The node to be deleted
+ *
+ * We need to check the lock pointer again after taking the lock to guard
+ * against concurrent deletion of the same node. If the lock pointer changes
+ * (becomes NULL or to a different one), we assume that the deletion was done
+ * elsewhere. A warning will be printed if this happens as it is likely to be
+ * a bug.
+ */
+void dlock_lists_del(struct dlock_list_node *node)
+{
+	struct dlock_list_head *head;
+	bool retry;
+
+	do {
+		head = READ_ONCE(node->head);
+		if (WARN_ONCE(!head, "%s: node 0x%lx has no associated head\n",
+			      __func__, (unsigned long)node))
+			return;
+
+		spin_lock(&head->lock);
+		if (likely(head == node->head)) {
+			list_del_init(&node->list);
+			node->head = NULL;
+			retry = false;
+		} else {
+			/*
+			 * The lock has somehow changed. Retry again if it is
+			 * not NULL. Otherwise, just ignore the delete
+			 * operation.
+			 */
+			retry = (node->head != NULL);
+		}
+		spin_unlock(&head->lock);
+	} while (retry);
+}
+
+/**
+ * __dlock_list_next_list: Find the first entry of the next available list
+ * @dlist: Pointer to the dlock_list_heads structure
+ * @iter : Pointer to the dlock list iterator structure
+ * Return: true if the entry is found, false if all the lists exhausted
+ *
+ * The information about the next available list will be put into the iterator.
+ */
+struct dlock_list_node *__dlock_list_next_list(struct dlock_list_iter *iter)
+{
+	struct dlock_list_node *next;
+	struct dlock_list_head *head;
+
+restart:
+	if (iter->entry) {
+		spin_unlock(&iter->entry->lock);
+		iter->entry = NULL;
+	}
+
+next_list:
+	/*
+	 * Try next list
+	 */
+	if (++iter->index >= nr_cpu_ids)
+		return NULL;	/* All the entries iterated */
+
+	if (list_empty(&iter->head[iter->index].list))
+		goto next_list;
+
+	head = iter->entry = &iter->head[iter->index];
+	spin_lock(&head->lock);
+	/*
+	 * There is a slight chance that the list may become empty just
+	 * before the lock is acquired. So an additional check is
+	 * needed to make sure that a valid node will be returned.
+	 */
+	if (list_empty(&head->list))
+		goto restart;
+
+	next = list_entry(head->list.next, struct dlock_list_node,
+			  list);
+	WARN_ON_ONCE(next->head != head);
+
+	return next;
+}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v7 2/6] vfs: Remove unnecessary list_for_each_entry_safe() variants
  2017-10-05 18:43 [PATCH v7 0/6] vfs: Use dlock list for SB's s_inodes list Waiman Long
  2017-10-05 18:43 ` [PATCH v7 1/6] lib/dlock-list: Distributed and lock-protected lists Waiman Long
@ 2017-10-05 18:43 ` Waiman Long
  2017-10-05 18:43 ` [PATCH v7 3/6] vfs: Use dlock list for superblock's inode list Waiman Long
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2017-10-05 18:43 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Davidlohr Bueso, Jan Kara,
	Waiman Long

From: Jan Kara <jack@suse.cz>

evict_inodes() and invalidate_inodes() use list_for_each_entry_safe()
to iterate sb->s_inodes list. However, since we use i_lru list entry for
our local temporary list of inodes to destroy, the inode is guaranteed
to stay in sb->s_inodes list while we hold sb->s_inode_list_lock. So
there is no real need for safe iteration variant and we can use
list_for_each_entry() just fine.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 fs/inode.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index d1e35b5..e70e5fcb 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -601,12 +601,12 @@ static void dispose_list(struct list_head *head)
  */
 void evict_inodes(struct super_block *sb)
 {
-	struct inode *inode, *next;
+	struct inode *inode;
 	LIST_HEAD(dispose);
 
 again:
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		if (atomic_read(&inode->i_count))
 			continue;
 
@@ -652,11 +652,11 @@ void evict_inodes(struct super_block *sb)
 int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 {
 	int busy = 0;
-	struct inode *inode, *next;
+	struct inode *inode;
 	LIST_HEAD(dispose);
 
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v7 3/6] vfs: Use dlock list for superblock's inode list
  2017-10-05 18:43 [PATCH v7 0/6] vfs: Use dlock list for SB's s_inodes list Waiman Long
  2017-10-05 18:43 ` [PATCH v7 1/6] lib/dlock-list: Distributed and lock-protected lists Waiman Long
  2017-10-05 18:43 ` [PATCH v7 2/6] vfs: Remove unnecessary list_for_each_entry_safe() variants Waiman Long
@ 2017-10-05 18:43 ` Waiman Long
  2017-10-05 18:43 ` [PATCH v7 4/6] lib/dlock-list: Make sibling CPUs share the same linked list Waiman Long
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2017-10-05 18:43 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Davidlohr Bueso,
	Waiman Long

When many threads are trying to add or delete inode to or from
a superblock's s_inodes list, spinlock contention on the list can
become a performance bottleneck.

This patch changes the s_inodes field to become a dlock list which
is a distributed set of lists with per-list spinlocks.  As a result,
the following superblock inode list (sb->s_inodes) iteration functions
in vfs are also being modified:

 1. iterate_bdevs()
 2. drop_pagecache_sb()
 3. evict_inodes()
 4. invalidate_inodes()
 5. fsnotify_unmount_inodes()
 6. add_dquot_ref()
 7. remove_dquot_ref()

With an exit microbenchmark that creates a large number of threads,
attachs many inodes to them and then exits. The runtimes of that
microbenchmark with 1200 threads before and after the patch on a
2-socket Intel E5-2699 v3 system (36 cores, 72 threads) were as
follows:

  Kernel            Elapsed Time    System Time
  ------            ------------    -----------
  Vanilla 4.14-rc3     58.40s         65m20s
  Patched 4.14-rc3     42.17s         45m32s

Before the patch, spinlock contention at the inode_sb_list_add()
function at the startup phase and the evict() function at the teardown
phase were about 68% and 97% of total CPU time respectively (as
measured by perf). With the patched kernel, the inode_sb_list_add()
function is less than 1% and the evict() function is at 2.5%
respectively. There were still quite a lot of lock contention elsewhere
that limited the gain. For example, the nlru lock was contributing
most of the lock contention in the teardown phase.

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c       |  9 ++++-----
 fs/drop_caches.c     |  9 ++++-----
 fs/inode.c           | 34 +++++++++++++---------------------
 fs/notify/fsnotify.c |  9 ++++-----
 fs/quota/dquot.c     | 14 ++++++--------
 fs/super.c           |  7 ++++---
 include/linux/fs.h   |  8 ++++----
 7 files changed, 39 insertions(+), 51 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 93d088f..6a885ef 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -2125,9 +2125,9 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
 void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 {
 	struct inode *inode, *old_inode = NULL;
+	DEFINE_DLOCK_LIST_ITER(iter, &blockdev_superblock->s_inodes);
 
-	spin_lock(&blockdev_superblock->s_inode_list_lock);
-	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
+	dlist_for_each_entry(inode, &iter, i_sb_list) {
 		struct address_space *mapping = inode->i_mapping;
 		struct block_device *bdev;
 
@@ -2139,7 +2139,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&blockdev_superblock->s_inode_list_lock);
+		dlock_list_unlock(&iter);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
 		 * removed from s_inodes list while we dropped the
@@ -2157,8 +2157,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 			func(bdev, arg);
 		mutex_unlock(&bdev->bd_mutex);
 
-		spin_lock(&blockdev_superblock->s_inode_list_lock);
+		dlock_list_relock(&iter);
 	}
-	spin_unlock(&blockdev_superblock->s_inode_list_lock);
 	iput(old_inode);
 }
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index d72d52b..8db1374 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -16,9 +16,9 @@
 static void drop_pagecache_sb(struct super_block *sb, void *unused)
 {
 	struct inode *inode, *toput_inode = NULL;
+	DEFINE_DLOCK_LIST_ITER(iter, &sb->s_inodes);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	dlist_for_each_entry(inode, &iter, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 		    (inode->i_mapping->nrpages == 0)) {
@@ -27,15 +27,14 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		dlock_list_unlock(&iter);
 
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
 
-		spin_lock(&sb->s_inode_list_lock);
+		dlock_list_relock(&iter);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 	iput(toput_inode);
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index e70e5fcb..7bfbe6b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -28,7 +28,7 @@
  *   inode->i_state, inode->i_hash, __iget()
  * Inode LRU list locks protect:
  *   inode->i_sb->s_inode_lru, inode->i_lru
- * inode->i_sb->s_inode_list_lock protects:
+ * inode->i_sb->s_inodes->head->lock protects:
  *   inode->i_sb->s_inodes, inode->i_sb_list
  * bdi->wb.list_lock protects:
  *   bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_io_list
@@ -37,7 +37,7 @@
  *
  * Lock ordering:
  *
- * inode->i_sb->s_inode_list_lock
+ * inode->i_sb->s_inodes->head->lock
  *   inode->i_lock
  *     Inode LRU list locks
  *
@@ -45,7 +45,7 @@
  *   inode->i_lock
  *
  * inode_hash_lock
- *   inode->i_sb->s_inode_list_lock
+ *   inode->i_sb->s_inodes->head->lock
  *   inode->i_lock
  *
  * iunique_lock
@@ -434,19 +434,14 @@ static void inode_lru_list_del(struct inode *inode)
  */
 void inode_sb_list_add(struct inode *inode)
 {
-	spin_lock(&inode->i_sb->s_inode_list_lock);
-	list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
-	spin_unlock(&inode->i_sb->s_inode_list_lock);
+	dlock_lists_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
 }
 EXPORT_SYMBOL_GPL(inode_sb_list_add);
 
 static inline void inode_sb_list_del(struct inode *inode)
 {
-	if (!list_empty(&inode->i_sb_list)) {
-		spin_lock(&inode->i_sb->s_inode_list_lock);
-		list_del_init(&inode->i_sb_list);
-		spin_unlock(&inode->i_sb->s_inode_list_lock);
-	}
+	if (!list_empty(&inode->i_sb_list.list))
+		dlock_lists_del(&inode->i_sb_list);
 }
 
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
@@ -602,11 +597,12 @@ static void dispose_list(struct list_head *head)
 void evict_inodes(struct super_block *sb)
 {
 	struct inode *inode;
+	struct dlock_list_iter iter;
 	LIST_HEAD(dispose);
 
 again:
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	init_dlock_list_iter(&iter, &sb->s_inodes);
+	dlist_for_each_entry(inode, &iter, i_sb_list) {
 		if (atomic_read(&inode->i_count))
 			continue;
 
@@ -627,13 +623,12 @@ void evict_inodes(struct super_block *sb)
 		 * bit so we don't livelock.
 		 */
 		if (need_resched()) {
-			spin_unlock(&sb->s_inode_list_lock);
+			dlock_list_unlock(&iter);
 			cond_resched();
 			dispose_list(&dispose);
 			goto again;
 		}
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 
 	dispose_list(&dispose);
 }
@@ -654,9 +649,9 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 	int busy = 0;
 	struct inode *inode;
 	LIST_HEAD(dispose);
+	DEFINE_DLOCK_LIST_ITER(iter, &sb->s_inodes);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	dlist_for_each_entry(inode, &iter, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
@@ -678,7 +673,6 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 		spin_unlock(&inode->i_lock);
 		list_add(&inode->i_lru, &dispose);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 
 	dispose_list(&dispose);
 
@@ -893,7 +887,7 @@ struct inode *new_inode_pseudo(struct super_block *sb)
 		spin_lock(&inode->i_lock);
 		inode->i_state = 0;
 		spin_unlock(&inode->i_lock);
-		INIT_LIST_HEAD(&inode->i_sb_list);
+		init_dlock_list_node(&inode->i_sb_list);
 	}
 	return inode;
 }
@@ -914,8 +908,6 @@ struct inode *new_inode(struct super_block *sb)
 {
 	struct inode *inode;
 
-	spin_lock_prefetch(&sb->s_inode_list_lock);
-
 	inode = new_inode_pseudo(sb);
 	if (inode)
 		inode_sb_list_add(inode);
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 0c4583b..136a68a 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -51,9 +51,9 @@ void __fsnotify_vfsmount_delete(struct vfsmount *mnt)
 void fsnotify_unmount_inodes(struct super_block *sb)
 {
 	struct inode *inode, *iput_inode = NULL;
+	DEFINE_DLOCK_LIST_ITER(iter, &sb->s_inodes);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	dlist_for_each_entry(inode, &iter, i_sb_list) {
 		/*
 		 * We cannot __iget() an inode in state I_FREEING,
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
@@ -78,7 +78,7 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		dlock_list_unlock(&iter);
 
 		if (iput_inode)
 			iput(iput_inode);
@@ -90,9 +90,8 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 
 		iput_inode = inode;
 
-		spin_lock(&sb->s_inode_list_lock);
+		dlock_list_relock(&iter);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 
 	if (iput_inode)
 		iput(iput_inode);
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 50b0556..983c7be 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -936,12 +936,12 @@ static int dqinit_needed(struct inode *inode, int type)
 static void add_dquot_ref(struct super_block *sb, int type)
 {
 	struct inode *inode, *old_inode = NULL;
+	DEFINE_DLOCK_LIST_ITER(iter, &sb->s_inodes);
 #ifdef CONFIG_QUOTA_DEBUG
 	int reserved = 0;
 #endif
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	dlist_for_each_entry(inode, &iter, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 		    !atomic_read(&inode->i_writecount) ||
@@ -951,7 +951,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		dlock_list_unlock(&iter);
 
 #ifdef CONFIG_QUOTA_DEBUG
 		if (unlikely(inode_get_rsv_space(inode) > 0))
@@ -969,9 +969,8 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		 * later.
 		 */
 		old_inode = inode;
-		spin_lock(&sb->s_inode_list_lock);
+		dlock_list_relock(&iter);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 	iput(old_inode);
 
 #ifdef CONFIG_QUOTA_DEBUG
@@ -1039,9 +1038,9 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 {
 	struct inode *inode;
 	int reserved = 0;
+	DEFINE_DLOCK_LIST_ITER(iter, &sb->s_inodes);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	dlist_for_each_entry(inode, &iter, i_sb_list) {
 		/*
 		 *  We have to scan also I_NEW inodes because they can already
 		 *  have quota pointer initialized. Luckily, we need to touch
@@ -1056,7 +1055,6 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 		}
 		spin_unlock(&dq_data_lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
 		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
diff --git a/fs/super.c b/fs/super.c
index 166c4ee..a90a070 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -164,6 +164,7 @@ static void destroy_super(struct super_block *s)
 {
 	list_lru_destroy(&s->s_dentry_lru);
 	list_lru_destroy(&s->s_inode_lru);
+	free_dlock_list_heads(&s->s_inodes);
 	security_sb_free(s);
 	WARN_ON(!list_empty(&s->s_mounts));
 	put_user_ns(s->s_user_ns);
@@ -210,11 +211,11 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
 	INIT_HLIST_NODE(&s->s_instances);
 	INIT_HLIST_BL_HEAD(&s->s_anon);
 	mutex_init(&s->s_sync_lock);
-	INIT_LIST_HEAD(&s->s_inodes);
-	spin_lock_init(&s->s_inode_list_lock);
 	INIT_LIST_HEAD(&s->s_inodes_wb);
 	spin_lock_init(&s->s_inode_wblist_lock);
 
+	if (alloc_dlock_list_heads(&s->s_inodes))
+		goto fail;
 	if (list_lru_init_memcg(&s->s_dentry_lru))
 		goto fail;
 	if (list_lru_init_memcg(&s->s_inode_lru))
@@ -434,7 +435,7 @@ void generic_shutdown_super(struct super_block *sb)
 		if (sop->put_super)
 			sop->put_super(sb);
 
-		if (!list_empty(&sb->s_inodes)) {
+		if (!dlock_lists_empty(&sb->s_inodes)) {
 			printk("VFS: Busy inodes after unmount of %s. "
 			   "Self-destruct in 5 seconds.  Have a nice day...\n",
 			   sb->s_id);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 339e737..643f4d8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -34,6 +34,7 @@
 #include <linux/delayed_call.h>
 #include <linux/uuid.h>
 #include <linux/errseq.h>
+#include <linux/dlock-list.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -632,7 +633,7 @@ struct inode {
 	u16			i_wb_frn_history;
 #endif
 	struct list_head	i_lru;		/* inode LRU list */
-	struct list_head	i_sb_list;
+	struct dlock_list_node	i_sb_list;
 	struct list_head	i_wb_list;	/* backing dev writeback list */
 	union {
 		struct hlist_head	i_dentry;
@@ -1434,9 +1435,8 @@ struct super_block {
 	 */
 	int s_stack_depth;
 
-	/* s_inode_list_lock protects s_inodes */
-	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
-	struct list_head	s_inodes;	/* all inodes */
+	/* The internal per-list locks protect s_inodes */
+	struct dlock_list_heads s_inodes;	/* all inodes */
 
 	spinlock_t		s_inode_wblist_lock;
 	struct list_head	s_inodes_wb;	/* writeback inodes */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v7 4/6] lib/dlock-list: Make sibling CPUs share the same linked list
  2017-10-05 18:43 [PATCH v7 0/6] vfs: Use dlock list for SB's s_inodes list Waiman Long
                   ` (2 preceding siblings ...)
  2017-10-05 18:43 ` [PATCH v7 3/6] vfs: Use dlock list for superblock's inode list Waiman Long
@ 2017-10-05 18:43 ` Waiman Long
  2017-10-09 15:40   ` Jan Kara
  2017-10-05 18:43 ` [PATCH v7 5/6] lib/dlock-list: Enable faster lookup with hashing Waiman Long
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Waiman Long @ 2017-10-05 18:43 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Davidlohr Bueso,
	Waiman Long

The dlock list needs one list for each of the CPUs available. However,
for sibling CPUs, they are sharing the L2 and probably L1 caches
too. As a result, there is not much to gain in term of avoiding
cacheline contention while increasing the cacheline footprint of the
L1/L2 caches as separate lists may need to be in the cache.

This patch makes all the sibling CPUs share the same list, thus
reducing the number of lists that need to be maintained in each
dlock list without having any noticeable impact on performance. It
also improves dlock list iteration performance as fewer lists need
to be iterated.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 lib/dlock-list.c | 68 +++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 53 insertions(+), 15 deletions(-)

diff --git a/lib/dlock-list.c b/lib/dlock-list.c
index 2779e3e..a045fd7 100644
--- a/lib/dlock-list.c
+++ b/lib/dlock-list.c
@@ -25,15 +25,14 @@
  * The distributed and locked list is a distributed set of lists each of
  * which is protected by its own spinlock, but acts like a single
  * consolidated list to the callers. For scaling purpose, the number of
- * lists used is equal to the number of possible CPUs in the system to
- * minimize contention.
+ * lists used is equal to the number of possible cores in the system to
+ * minimize contention. All threads of the same CPU core will share the
+ * same list.
  *
- * However, it is possible that individual CPU numbers may be equal to
- * or greater than the number of possible CPUs when there are holes in
- * the CPU number list. As a result, we need to map the CPU number to a
- * list index.
+ * We need to map each CPU number to a list index.
  */
 static DEFINE_PER_CPU_READ_MOSTLY(int, cpu2idx);
+static int nr_dlock_lists __read_mostly;
 
 /*
  * As all the locks in the dlock list are dynamically allocated, they need
@@ -44,20 +43,55 @@
 static struct lock_class_key dlock_list_key;
 
 /*
- * Initialize cpu2idx mapping table
+ * Initialize cpu2idx mapping table & nr_dlock_lists.
  *
  * It is possible that a dlock-list can be allocated before the cpu2idx is
  * initialized. In this case, all the cpus are mapped to the first entry
  * before initialization.
  *
+ * All the sibling CPUs of a sibling group will map to the same dlock list so
+ * as to reduce the number of dlock lists to be maintained while minimizing
+ * cacheline contention.
+ *
+ * As the sibling masks are set up in the core initcall phase, this function
+ * has to be done in the postcore phase to get the right data.
  */
 static int __init cpu2idx_init(void)
 {
 	int idx, cpu;
+	struct cpumask *sibling_mask;
+	static struct cpumask mask __initdata;
 
+	cpumask_clear(&mask);
 	idx = 0;
-	for_each_possible_cpu(cpu)
-		per_cpu(cpu2idx, cpu) = idx++;
+	for_each_possible_cpu(cpu) {
+		int scpu;
+
+		if (cpumask_test_cpu(cpu, &mask))
+			continue;
+		per_cpu(cpu2idx, cpu) = idx;
+		cpumask_set_cpu(cpu, &mask);
+
+		sibling_mask = topology_sibling_cpumask(cpu);
+		if (sibling_mask) {
+			for_each_cpu(scpu, sibling_mask) {
+				per_cpu(cpu2idx, scpu) = idx;
+				cpumask_set_cpu(scpu, &mask);
+			}
+		}
+		idx++;
+	}
+
+	/*
+	 * nr_dlock_lists can only be set after cpu2idx is properly
+	 * initialized.
+	 */
+	smp_mb();
+	nr_dlock_lists = idx;
+	WARN_ON(nr_dlock_lists > nr_cpu_ids);
+
+	pr_info("dlock-list: %d head entries per dlock list.\n",
+		nr_dlock_lists);
 	return 0;
 }
 postcore_initcall(cpu2idx_init);
@@ -71,18 +105,22 @@ static int __init cpu2idx_init(void)
  * callers will have to do their own memory allocation, if necessary. However,
  * this allows embedding the dlock_list_heads structure directly into other
  * structures.
+ *
+ * Since nr_dlock_lists will always be <= nr_cpu_ids, having more lists
+ * than necessary allocated is not a problem other than some wasted memory.
+ * The extra lists will not be ever used as all the cpu2idx entries will be
+ * 0 before initialization.
  */
 int alloc_dlock_list_heads(struct dlock_list_heads *dlist)
 {
-	int idx;
+	int idx, cnt = nr_dlock_lists ? nr_dlock_lists : nr_cpu_ids;
 
-	dlist->heads = kcalloc(nr_cpu_ids, sizeof(struct dlock_list_head),
-			       GFP_KERNEL);
+	dlist->heads = kcalloc(cnt, sizeof(struct dlock_list_head), GFP_KERNEL);
 
 	if (!dlist->heads)
 		return -ENOMEM;
 
-	for (idx = 0; idx < nr_cpu_ids; idx++) {
+	for (idx = 0; idx < cnt; idx++) {
 		struct dlock_list_head *head = &dlist->heads[idx];
 
 		INIT_LIST_HEAD(&head->list);
@@ -118,7 +156,7 @@ bool dlock_lists_empty(struct dlock_list_heads *dlist)
 {
 	int idx;
 
-	for (idx = 0; idx < nr_cpu_ids; idx++)
+	for (idx = 0; idx < nr_dlock_lists; idx++)
 		if (!list_empty(&dlist->heads[idx].list))
 			return false;
 	return true;
@@ -207,7 +245,7 @@ struct dlock_list_node *__dlock_list_next_list(struct dlock_list_iter *iter)
 	/*
 	 * Try next list
 	 */
-	if (++iter->index >= nr_cpu_ids)
+	if (++iter->index >= nr_dlock_lists)
 		return NULL;	/* All the entries iterated */
 
 	if (list_empty(&iter->head[iter->index].list))
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v7 5/6] lib/dlock-list: Enable faster lookup with hashing
  2017-10-05 18:43 [PATCH v7 0/6] vfs: Use dlock list for SB's s_inodes list Waiman Long
                   ` (3 preceding siblings ...)
  2017-10-05 18:43 ` [PATCH v7 4/6] lib/dlock-list: Make sibling CPUs share the same linked list Waiman Long
@ 2017-10-05 18:43 ` Waiman Long
  2017-10-09 13:08   ` Davidlohr Bueso
  2017-10-05 18:43 ` [PATCH v7 6/6] lib/dlock-list: Add an IRQ-safe mode to be used in interrupt handler Waiman Long
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Waiman Long @ 2017-10-05 18:43 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Davidlohr Bueso,
	Waiman Long

Insertion and deletion is relatively cheap and mostly contention
free for dlock-list. Lookup, on the other hand, can be rather costly
because all the lists in a dlock-list will have to be iterated.

Currently dlock-list insertion is based on the cpu that the task is
running on. So a given object can be inserted into any one of the
lists depending on what the current cpu is.

This patch provides an alternative way of list selection. The caller
can provide a object context which will be hashed to one of the list
in a dlock-list. The object can then be added into that particular
list. Lookup can be done by iterating elements in the provided list
only instead of all the lists in a dlock-list.

The new APIs are:

struct dlock_list_head *dlock_list_hash(struct dlock_list_heads *, void *);
void dlock_list_add(struct dlock_list_node *, struct dlock_list_head *);

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/dlock-list.h |  9 +++++++++
 lib/dlock-list.c           | 49 +++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 51 insertions(+), 7 deletions(-)

diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
index 7940e524..16474ae 100644
--- a/include/linux/dlock-list.h
+++ b/include/linux/dlock-list.h
@@ -121,6 +121,15 @@ extern void dlock_lists_add(struct dlock_list_node *node,
 extern void dlock_lists_del(struct dlock_list_node *node);
 
 /*
+ * Instead of individual list mapping by CPU number, it can be based on
+ * a given context to speed up loockup performance.
+ */
+extern struct dlock_list_head *dlock_list_hash(struct dlock_list_heads *dlist,
+					       void *context);
+extern void dlock_list_add(struct dlock_list_node *node,
+			   struct dlock_list_head *head);
+
+/*
  * Find the first entry of the next available list.
  */
 extern struct dlock_list_node *
diff --git a/lib/dlock-list.c b/lib/dlock-list.c
index a045fd7..8cd0876 100644
--- a/lib/dlock-list.c
+++ b/lib/dlock-list.c
@@ -20,6 +20,7 @@
 #include <linux/lockdep.h>
 #include <linux/slab.h>
 #include <linux/cpumask.h>
+#include <linux/jhash.h>
 
 /*
  * The distributed and locked list is a distributed set of lists each of
@@ -163,6 +164,46 @@ bool dlock_lists_empty(struct dlock_list_heads *dlist)
 }
 
 /**
+ * dlock_list_hash - Hash the given context to a particular list
+ * @dlist: The dlock list
+ * @ctx  : The context for hashing
+ */
+struct dlock_list_head *dlock_list_hash(struct dlock_list_heads *dlist,
+					void *ctx)
+{
+	unsigned long val = (unsigned long)ctx;
+	u32 hash;
+
+	if (unlikely(!nr_dlock_lists)) {
+		WARN_ON_ONCE(1);
+		return &dlist->heads[0];
+	}
+	if (val < nr_dlock_lists)
+		hash = val;
+	else
+		hash = jhash2((u32 *)&ctx, sizeof(ctx)/sizeof(u32), 0)
+				% nr_dlock_lists;
+	return &dlist->heads[hash];
+}
+
+/**
+ * dlock_list_add - Add a node to a particular head of dlock list
+ * @node: The node to be added
+ * @head: The dlock list head where the node is to be added
+ */
+void dlock_list_add(struct dlock_list_node *node,
+		    struct dlock_list_head *head)
+{
+	/*
+	 * There is no need to disable preemption
+	 */
+	spin_lock(&head->lock);
+	node->head = head;
+	list_add(&node->list, &head->list);
+	spin_unlock(&head->lock);
+}
+
+/**
  * dlock_lists_add - Adds a node to the given dlock list
  * @node : The node to be added
  * @dlist: The dlock list where the node is to be added
@@ -175,13 +216,7 @@ void dlock_lists_add(struct dlock_list_node *node,
 {
 	struct dlock_list_head *head = &dlist->heads[this_cpu_read(cpu2idx)];
 
-	/*
-	 * There is no need to disable preemption
-	 */
-	spin_lock(&head->lock);
-	node->head = head;
-	list_add(&node->list, &head->list);
-	spin_unlock(&head->lock);
+	dlock_list_add(node, head);
 }
 
 /**
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v7 6/6] lib/dlock-list: Add an IRQ-safe mode to be used in interrupt handler
  2017-10-05 18:43 [PATCH v7 0/6] vfs: Use dlock list for SB's s_inodes list Waiman Long
                   ` (4 preceding siblings ...)
  2017-10-05 18:43 ` [PATCH v7 5/6] lib/dlock-list: Enable faster lookup with hashing Waiman Long
@ 2017-10-05 18:43 ` Waiman Long
  2017-10-13 15:45 ` [PATCH v7 7/6] fs/epoll: scale nested callbacks Davidlohr Bueso
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2017-10-05 18:43 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Davidlohr Bueso,
	Waiman Long

To enable the use of dlock-list in an interrupt handler, a new
irqsafe mode can now be specified at dlock-list allocation time as
an additional argument to alloc_dlock_list_heads(). With that mode
specified, the spin_lock_irqsave/spin_unlock_irqrestore pair will be
used instead of the regular lock and unlock calls.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 fs/super.c                 |  2 +-
 include/linux/dlock-list.h | 18 +++++++++++++++---
 lib/dlock-list.c           | 44 +++++++++++++++++++++++++++++++-------------
 3 files changed, 47 insertions(+), 17 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index a90a070..0840e54 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -214,7 +214,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
 	INIT_LIST_HEAD(&s->s_inodes_wb);
 	spin_lock_init(&s->s_inode_wblist_lock);
 
-	if (alloc_dlock_list_heads(&s->s_inodes))
+	if (alloc_dlock_list_heads(&s->s_inodes, false))
 		goto fail;
 	if (list_lru_init_memcg(&s->s_dentry_lru))
 		goto fail;
diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
index 16474ae..2ba7b4f 100644
--- a/include/linux/dlock-list.h
+++ b/include/linux/dlock-list.h
@@ -32,6 +32,8 @@
 struct dlock_list_head {
 	struct list_head list;
 	spinlock_t lock;
+	int irqsafe;	/* IRQ safe mode */
+	unsigned long flags;
 } ____cacheline_aligned_in_smp;
 
 struct dlock_list_heads {
@@ -89,7 +91,12 @@ static inline void init_dlock_list_node(struct dlock_list_node *node)
  */
 static inline void dlock_list_unlock(struct dlock_list_iter *iter)
 {
-	spin_unlock(&iter->entry->lock);
+	struct dlock_list_head *h = iter->entry;
+
+	if (h->irqsafe)
+		spin_unlock_irqrestore(&h->lock, h->flags);
+	else
+		spin_unlock(&h->lock);
 }
 
 /**
@@ -98,13 +105,18 @@ static inline void dlock_list_unlock(struct dlock_list_iter *iter)
  */
 static inline void dlock_list_relock(struct dlock_list_iter *iter)
 {
-	spin_lock(&iter->entry->lock);
+	struct dlock_list_head *h = iter->entry;
+
+	if (h->irqsafe)
+		spin_lock_irqsave(&h->lock, h->flags);
+	else
+		spin_lock(&h->lock);
 }
 
 /*
  * Allocation and freeing of dlock list
  */
-extern int  alloc_dlock_list_heads(struct dlock_list_heads *dlist);
+extern int  alloc_dlock_list_heads(struct dlock_list_heads *dlist, int irqsafe);
 extern void free_dlock_list_heads(struct dlock_list_heads *dlist);
 
 /*
diff --git a/lib/dlock-list.c b/lib/dlock-list.c
index 8cd0876..4fded20 100644
--- a/lib/dlock-list.c
+++ b/lib/dlock-list.c
@@ -99,7 +99,8 @@ static int __init cpu2idx_init(void)
 
 /**
  * alloc_dlock_list_heads - Initialize and allocate the list of head entries
- * @dlist: Pointer to the dlock_list_heads structure to be initialized
+ * @dlist  : Pointer to the dlock_list_heads structure to be initialized
+ * @irqsafe: IRQ safe mode flag
  * Return: 0 if successful, -ENOMEM if memory allocation error
  *
  * This function does not allocate the dlock_list_heads structure itself. The
@@ -112,7 +113,7 @@ static int __init cpu2idx_init(void)
  * The extra lists will not be ever used as all the cpu2idx entries will be
  * 0 before initialization.
  */
-int alloc_dlock_list_heads(struct dlock_list_heads *dlist)
+int alloc_dlock_list_heads(struct dlock_list_heads *dlist, int irqsafe)
 {
 	int idx, cnt = nr_dlock_lists ? nr_dlock_lists : nr_cpu_ids;
 
@@ -126,6 +127,7 @@ int alloc_dlock_list_heads(struct dlock_list_heads *dlist)
 
 		INIT_LIST_HEAD(&head->list);
 		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
+		head->irqsafe = irqsafe;
 		lockdep_set_class(&head->lock, &dlock_list_key);
 	}
 	return 0;
@@ -194,13 +196,19 @@ struct dlock_list_head *dlock_list_hash(struct dlock_list_heads *dlist,
 void dlock_list_add(struct dlock_list_node *node,
 		    struct dlock_list_head *head)
 {
-	/*
-	 * There is no need to disable preemption
-	 */
-	spin_lock(&head->lock);
-	node->head = head;
-	list_add(&node->list, &head->list);
-	spin_unlock(&head->lock);
+	unsigned long flags;
+
+	if (head->irqsafe) {
+		spin_lock_irqsave(&head->lock, flags);
+		node->head = head;
+		list_add(&node->list, &head->list);
+		spin_unlock_irqrestore(&head->lock, flags);
+	} else {
+		spin_lock(&head->lock);
+		node->head = head;
+		list_add(&node->list, &head->list);
+		spin_unlock(&head->lock);
+	}
 }
 
 /**
@@ -232,6 +240,7 @@ void dlock_lists_add(struct dlock_list_node *node,
 void dlock_lists_del(struct dlock_list_node *node)
 {
 	struct dlock_list_head *head;
+	unsigned long flags;
 	bool retry;
 
 	do {
@@ -240,7 +249,11 @@ void dlock_lists_del(struct dlock_list_node *node)
 			      __func__, (unsigned long)node))
 			return;
 
-		spin_lock(&head->lock);
+		if (head->irqsafe)
+			spin_lock_irqsave(&head->lock, flags);
+		else
+			spin_lock(&head->lock);
+
 		if (likely(head == node->head)) {
 			list_del_init(&node->list);
 			node->head = NULL;
@@ -253,7 +266,11 @@ void dlock_lists_del(struct dlock_list_node *node)
 			 */
 			retry = (node->head != NULL);
 		}
-		spin_unlock(&head->lock);
+
+		if (head->irqsafe)
+			spin_unlock_irqrestore(&head->lock, flags);
+		else
+			spin_unlock(&head->lock);
 	} while (retry);
 }
 
@@ -272,7 +289,7 @@ struct dlock_list_node *__dlock_list_next_list(struct dlock_list_iter *iter)
 
 restart:
 	if (iter->entry) {
-		spin_unlock(&iter->entry->lock);
+		dlock_list_unlock(iter);
 		iter->entry = NULL;
 	}
 
@@ -287,7 +304,8 @@ struct dlock_list_node *__dlock_list_next_list(struct dlock_list_iter *iter)
 		goto next_list;
 
 	head = iter->entry = &iter->head[iter->index];
-	spin_lock(&head->lock);
+	dlock_list_relock(iter);
+
 	/*
 	 * There is a slight chance that the list may become empty just
 	 * before the lock is acquired. So an additional check is
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 5/6] lib/dlock-list: Enable faster lookup with hashing
  2017-10-05 18:43 ` [PATCH v7 5/6] lib/dlock-list: Enable faster lookup with hashing Waiman Long
@ 2017-10-09 13:08   ` Davidlohr Bueso
  2017-10-09 14:16     ` Waiman Long
  0 siblings, 1 reply; 31+ messages in thread
From: Davidlohr Bueso @ 2017-10-09 13:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng

On Thu, 05 Oct 2017, Waiman Long wrote:

>Insertion and deletion is relatively cheap and mostly contention
>free for dlock-list. Lookup, on the other hand, can be rather costly
>because all the lists in a dlock-list will have to be iterated.
>
>Currently dlock-list insertion is based on the cpu that the task is
>running on. So a given object can be inserted into any one of the
>lists depending on what the current cpu is.
>
>This patch provides an alternative way of list selection. The caller
>can provide a object context which will be hashed to one of the list
>in a dlock-list. The object can then be added into that particular
>list. Lookup can be done by iterating elements in the provided list
>only instead of all the lists in a dlock-list.

Unless I'm misusing the api, I could not find a standard way of
iterating a _particular_ list head (the one the dlock_list_hash()
returned). This is because iterators always want the all the heads.

Also, in my particular epoll case I'd need the head->lock _not_ to
be dropped after the iteration, and therefore is pretty adhoc.
Currently we do:

dlist_for_each_entry() {
	// acquire head->lock for each list
}
// no locks held
dlist_add()

I'm thinking perhaps we could have dlist_check_add() which passes a
callback to ensure we want to add the node. The function could acquire
the head->lock and not release it until the very end.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 5/6] lib/dlock-list: Enable faster lookup with hashing
  2017-10-09 13:08   ` Davidlohr Bueso
@ 2017-10-09 14:16     ` Waiman Long
  2017-10-09 16:03       ` Davidlohr Bueso
  0 siblings, 1 reply; 31+ messages in thread
From: Waiman Long @ 2017-10-09 14:16 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng

On 10/09/2017 09:08 AM, Davidlohr Bueso wrote:
> On Thu, 05 Oct 2017, Waiman Long wrote:
>
>> Insertion and deletion is relatively cheap and mostly contention
>> free for dlock-list. Lookup, on the other hand, can be rather costly
>> because all the lists in a dlock-list will have to be iterated.
>>
>> Currently dlock-list insertion is based on the cpu that the task is
>> running on. So a given object can be inserted into any one of the
>> lists depending on what the current cpu is.
>>
>> This patch provides an alternative way of list selection. The caller
>> can provide a object context which will be hashed to one of the list
>> in a dlock-list. The object can then be added into that particular
>> list. Lookup can be done by iterating elements in the provided list
>> only instead of all the lists in a dlock-list.
>
> Unless I'm misusing the api, I could not find a standard way of
> iterating a _particular_ list head (the one the dlock_list_hash()
> returned). This is because iterators always want the all the heads.
>
> Also, in my particular epoll case I'd need the head->lock _not_ to
> be dropped after the iteration, and therefore is pretty adhoc.
> Currently we do:
>
> dlist_for_each_entry() {
>     // acquire head->lock for each list
> }
> // no locks held
> dlist_add()
>
> I'm thinking perhaps we could have dlist_check_add() which passes a
> callback to ensure we want to add the node. The function could acquire
> the head->lock and not release it until the very end. 

With the dlock_list_hash(), dlock-list is degenerated into a pool of
list where one is chosen by hashing. So the regular list iteration
macros like list_for_each_entry() can then be used. Of course, you have
to explicitly do the lock and unlock operation.

I could also encapsulate it a bit with inlined function like

dlock_list_single_iter_init(iter, dlist, head, flags)

It could set up the iterator properly to iterate only 1 list. The flags
can be to indicate holding the lock after iteration. In this case,
dlock_list_unlock(iter) will have to be called to do the unlock. I could
add a patch to do that if you prefer that route.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 4/6] lib/dlock-list: Make sibling CPUs share the same linked list
  2017-10-05 18:43 ` [PATCH v7 4/6] lib/dlock-list: Make sibling CPUs share the same linked list Waiman Long
@ 2017-10-09 15:40   ` Jan Kara
  2017-10-09 16:14     ` Waiman Long
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Kara @ 2017-10-09 15:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng, Davidlohr Bueso

On Thu 05-10-17 14:43:26, Waiman Long wrote:
> The dlock list needs one list for each of the CPUs available. However,
> for sibling CPUs, they are sharing the L2 and probably L1 caches
> too. As a result, there is not much to gain in term of avoiding
> cacheline contention while increasing the cacheline footprint of the
> L1/L2 caches as separate lists may need to be in the cache.
> 
> This patch makes all the sibling CPUs share the same list, thus
> reducing the number of lists that need to be maintained in each
> dlock list without having any noticeable impact on performance. It
> also improves dlock list iteration performance as fewer lists need
> to be iterated.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

...

> @@ -118,7 +156,7 @@ bool dlock_lists_empty(struct dlock_list_heads *dlist)
>  {
>  	int idx;
>  
> -	for (idx = 0; idx < nr_cpu_ids; idx++)
> +	for (idx = 0; idx < nr_dlock_lists; idx++)
>  		if (!list_empty(&dlist->heads[idx].list))
>  			return false;
>  	return true;
> @@ -207,7 +245,7 @@ struct dlock_list_node *__dlock_list_next_list(struct dlock_list_iter *iter)
>  	/*
>  	 * Try next list
>  	 */
> -	if (++iter->index >= nr_cpu_ids)
> +	if (++iter->index >= nr_dlock_lists)
>  		return NULL;	/* All the entries iterated */
>  
>  	if (list_empty(&iter->head[iter->index].list))

Why these two do not need a similar treatment as alloc_dlist_heads()?

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 5/6] lib/dlock-list: Enable faster lookup with hashing
  2017-10-09 14:16     ` Waiman Long
@ 2017-10-09 16:03       ` Davidlohr Bueso
  2017-10-09 16:11         ` Waiman Long
  0 siblings, 1 reply; 31+ messages in thread
From: Davidlohr Bueso @ 2017-10-09 16:03 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng

On Mon, 09 Oct 2017, Waiman Long wrote:

>On 10/09/2017 09:08 AM, Davidlohr Bueso wrote:
>> On Thu, 05 Oct 2017, Waiman Long wrote:
>>
>>> This patch provides an alternative way of list selection. The caller
>>> can provide a object context which will be hashed to one of the list
>>> in a dlock-list. The object can then be added into that particular
>>> list. Lookup can be done by iterating elements in the provided list
>>> only instead of all the lists in a dlock-list.
>>
>> Unless I'm misusing the api, I could not find a standard way of
>> iterating a _particular_ list head (the one the dlock_list_hash()
>> returned). This is because iterators always want the all the heads.
>>
>> Also, in my particular epoll case I'd need the head->lock _not_ to
>> be dropped after the iteration, and therefore is pretty adhoc.
>> Currently we do:
>>
>> dlist_for_each_entry() {
>>     // acquire head->lock for each list
>> }
>> // no locks held
>> dlist_add()
>>
>> I'm thinking perhaps we could have dlist_check_add() which passes a
>> callback to ensure we want to add the node. The function could acquire
>> the head->lock and not release it until the very end.
>
>With the dlock_list_hash(), dlock-list is degenerated into a pool of
>list where one is chosen by hashing. So the regular list iteration
>macros like list_for_each_entry() can then be used. Of course, you have
>to explicitly do the lock and unlock operation.

Right, which seemed rather asymmetric and fragile, albeit adhoc to epoll.
Particularly not having the iter structure, makes use directly have to
deal with the spinlock, and there goes irq awareness out the window.

>I could also encapsulate it a bit with inlined function like

Probably not worth it, unless some other use cases came up. This just
bulks the api even more.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 5/6] lib/dlock-list: Enable faster lookup with hashing
  2017-10-09 16:03       ` Davidlohr Bueso
@ 2017-10-09 16:11         ` Waiman Long
  0 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2017-10-09 16:11 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng

On 10/09/2017 12:03 PM, Davidlohr Bueso wrote:
> On Mon, 09 Oct 2017, Waiman Long wrote:
>
>> On 10/09/2017 09:08 AM, Davidlohr Bueso wrote:
>>> On Thu, 05 Oct 2017, Waiman Long wrote:
>>>
>>>> This patch provides an alternative way of list selection. The caller
>>>> can provide a object context which will be hashed to one of the list
>>>> in a dlock-list. The object can then be added into that particular
>>>> list. Lookup can be done by iterating elements in the provided list
>>>> only instead of all the lists in a dlock-list.
>>>
>>> Unless I'm misusing the api, I could not find a standard way of
>>> iterating a _particular_ list head (the one the dlock_list_hash()
>>> returned). This is because iterators always want the all the heads.
>>>
>>> Also, in my particular epoll case I'd need the head->lock _not_ to
>>> be dropped after the iteration, and therefore is pretty adhoc.
>>> Currently we do:
>>>
>>> dlist_for_each_entry() {
>>>     // acquire head->lock for each list
>>> }
>>> // no locks held
>>> dlist_add()
>>>
>>> I'm thinking perhaps we could have dlist_check_add() which passes a
>>> callback to ensure we want to add the node. The function could acquire
>>> the head->lock and not release it until the very end.
>>
>> With the dlock_list_hash(), dlock-list is degenerated into a pool of
>> list where one is chosen by hashing. So the regular list iteration
>> macros like list_for_each_entry() can then be used. Of course, you have
>> to explicitly do the lock and unlock operation.
>
> Right, which seemed rather asymmetric and fragile, albeit adhoc to epoll.
> Particularly not having the iter structure, makes use directly have to
> deal with the spinlock, and there goes irq awareness out the window.
>

Right. We don't need the irq safety support in this case.

>> I could also encapsulate it a bit with inlined function like
>
> Probably not worth it, unless some other use cases came up. This just
> bulks the api even more. 

I am thinking of adding only one more init API. So it is not a
significant bulking-up at all.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 4/6] lib/dlock-list: Make sibling CPUs share the same linked list
  2017-10-09 15:40   ` Jan Kara
@ 2017-10-09 16:14     ` Waiman Long
  0 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2017-10-09 16:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng, Davidlohr Bueso

On 10/09/2017 11:40 AM, Jan Kara wrote:
> On Thu 05-10-17 14:43:26, Waiman Long wrote:
>> The dlock list needs one list for each of the CPUs available. However,
>> for sibling CPUs, they are sharing the L2 and probably L1 caches
>> too. As a result, there is not much to gain in term of avoiding
>> cacheline contention while increasing the cacheline footprint of the
>> L1/L2 caches as separate lists may need to be in the cache.
>>
>> This patch makes all the sibling CPUs share the same list, thus
>> reducing the number of lists that need to be maintained in each
>> dlock list without having any noticeable impact on performance. It
>> also improves dlock list iteration performance as fewer lists need
>> to be iterated.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
> ...
>
>> @@ -118,7 +156,7 @@ bool dlock_lists_empty(struct dlock_list_heads *dlist)
>>  {
>>  	int idx;
>>  
>> -	for (idx = 0; idx < nr_cpu_ids; idx++)
>> +	for (idx = 0; idx < nr_dlock_lists; idx++)
>>  		if (!list_empty(&dlist->heads[idx].list))
>>  			return false;
>>  	return true;
>> @@ -207,7 +245,7 @@ struct dlock_list_node *__dlock_list_next_list(struct dlock_list_iter *iter)
>>  	/*
>>  	 * Try next list
>>  	 */
>> -	if (++iter->index >= nr_cpu_ids)
>> +	if (++iter->index >= nr_dlock_lists)
>>  		return NULL;	/* All the entries iterated */
>>  
>>  	if (list_empty(&iter->head[iter->index].list))
> Why these two do not need a similar treatment as alloc_dlist_heads()?
>
> 								Honza
>
I am aware that there is one vfs superblock allocation before
nr_dlock_list is inited. However, I don't believe a list emptiness test
or an iteration will happen that early in the boot sequence. Maybe I
should put a BUG_ON(!nr_dlock_list) test in those function just to be sure.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 1/6] lib/dlock-list: Distributed and lock-protected lists
  2017-10-05 18:43 ` [PATCH v7 1/6] lib/dlock-list: Distributed and lock-protected lists Waiman Long
@ 2017-10-10  5:35   ` Boqun Feng
  2017-10-13 21:10     ` Waiman Long
  2017-10-18  8:55   ` Boqun Feng
  1 sibling, 1 reply; 31+ messages in thread
From: Boqun Feng @ 2017-10-10  5:35 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Davidlohr Bueso

[-- Attachment #1: Type: text/plain, Size: 5328 bytes --]

On Thu, Oct 05, 2017 at 06:43:23PM +0000, Waiman Long wrote:
[...]
> +/*
> + * As all the locks in the dlock list are dynamically allocated, they need
> + * to belong to their own special lock class to avoid warning and stack
> + * trace in kernel log when lockdep is enabled. Statically allocated locks
> + * don't have this problem.
> + */
> +static struct lock_class_key dlock_list_key;
> +

So in this way, you make all dlock_lists share the same lock_class_key,
which means if there are two structures:

	struct some_a {
		...
		struct dlock_list_heads dlists;
	};

	struct some_b {
		...
		struct dlock_list_heads dlists;
	};

some_a::dlists and some_b::dlists are going to have the same lockdep
key, is this what you want? If not, you may want to do something like
init_srcu_struct() does.

> +/*
> + * Initialize cpu2idx mapping table
> + *
> + * It is possible that a dlock-list can be allocated before the cpu2idx is
> + * initialized. In this case, all the cpus are mapped to the first entry
> + * before initialization.
> + *
> + */
> +static int __init cpu2idx_init(void)
> +{
> +	int idx, cpu;
> +
> +	idx = 0;
> +	for_each_possible_cpu(cpu)
> +		per_cpu(cpu2idx, cpu) = idx++;
> +	return 0;
> +}
> +postcore_initcall(cpu2idx_init);
> +
> +/**
> + * alloc_dlock_list_heads - Initialize and allocate the list of head entries
> + * @dlist: Pointer to the dlock_list_heads structure to be initialized
> + * Return: 0 if successful, -ENOMEM if memory allocation error
> + *
> + * This function does not allocate the dlock_list_heads structure itself. The
> + * callers will have to do their own memory allocation, if necessary. However,
> + * this allows embedding the dlock_list_heads structure directly into other
> + * structures.
> + */
> +int alloc_dlock_list_heads(struct dlock_list_heads *dlist)
> +{
> +	int idx;
> +
> +	dlist->heads = kcalloc(nr_cpu_ids, sizeof(struct dlock_list_head),
> +			       GFP_KERNEL);
> +
> +	if (!dlist->heads)
> +		return -ENOMEM;
> +
> +	for (idx = 0; idx < nr_cpu_ids; idx++) {
> +		struct dlock_list_head *head = &dlist->heads[idx];
> +
> +		INIT_LIST_HEAD(&head->list);
> +		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
> +		lockdep_set_class(&head->lock, &dlock_list_key);
> +	}
> +	return 0;
> +}
> +
> +/**
> + * free_dlock_list_heads - Free all the heads entries of the dlock list
> + * @dlist: Pointer of the dlock_list_heads structure to be freed
> + *
> + * This function doesn't free the dlock_list_heads structure itself. So
> + * the caller will have to do it, if necessary.
> + */
> +void free_dlock_list_heads(struct dlock_list_heads *dlist)
> +{
> +	kfree(dlist->heads);
> +	dlist->heads = NULL;
> +}
> +
> +/**
> + * dlock_lists_empty - Check if all the dlock lists are empty
> + * @dlist: Pointer to the dlock_list_heads structure
> + * Return: true if list is empty, false otherwise.
> + *
> + * This can be a pretty expensive function call. If this function is required
> + * in a performance critical path, we may have to maintain a global count
> + * of the list entries in the global dlock_list_heads structure instead.
> + */
> +bool dlock_lists_empty(struct dlock_list_heads *dlist)
> +{
> +	int idx;
> +
> +	for (idx = 0; idx < nr_cpu_ids; idx++)
> +		if (!list_empty(&dlist->heads[idx].list))
> +			return false;
> +	return true;
> +}
> +
> +/**
> + * dlock_lists_add - Adds a node to the given dlock list
> + * @node : The node to be added
> + * @dlist: The dlock list where the node is to be added
> + *
> + * List selection is based on the CPU being used when the dlock_list_add()
> + * function is called. However, deletion may be done by a different CPU.
> + */
> +void dlock_lists_add(struct dlock_list_node *node,
> +		     struct dlock_list_heads *dlist)
> +{
> +	struct dlock_list_head *head = &dlist->heads[this_cpu_read(cpu2idx)];
> +
> +	/*
> +	 * There is no need to disable preemption
> +	 */
> +	spin_lock(&head->lock);
> +	node->head = head;
> +	list_add(&node->list, &head->list);
> +	spin_unlock(&head->lock);
> +}
> +
> +/**
> + * dlock_lists_del - Delete a node from a dlock list
> + * @node : The node to be deleted
> + *
> + * We need to check the lock pointer again after taking the lock to guard
> + * against concurrent deletion of the same node. If the lock pointer changes
> + * (becomes NULL or to a different one), we assume that the deletion was done
> + * elsewhere. A warning will be printed if this happens as it is likely to be
> + * a bug.
> + */
> +void dlock_lists_del(struct dlock_list_node *node)
> +{
> +	struct dlock_list_head *head;
> +	bool retry;
> +
> +	do {
> +		head = READ_ONCE(node->head);

Since we read node->head locklessly here, I think we should use
WRITE_ONCE() for all the stores of node->head, to avoid store tearings?

Regards,
Boqun

> +		if (WARN_ONCE(!head, "%s: node 0x%lx has no associated head\n",
> +			      __func__, (unsigned long)node))
> +			return;
> +
> +		spin_lock(&head->lock);
> +		if (likely(head == node->head)) {
> +			list_del_init(&node->list);
> +			node->head = NULL;
> +			retry = false;
> +		} else {
> +			/*
> +			 * The lock has somehow changed. Retry again if it is
> +			 * not NULL. Otherwise, just ignore the delete
> +			 * operation.
> +			 */
> +			retry = (node->head != NULL);
> +		}
> +		spin_unlock(&head->lock);
> +	} while (retry);
> +}
> +
[...]

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v7 7/6] fs/epoll: scale nested callbacks
  2017-10-05 18:43 [PATCH v7 0/6] vfs: Use dlock list for SB's s_inodes list Waiman Long
                   ` (5 preceding siblings ...)
  2017-10-05 18:43 ` [PATCH v7 6/6] lib/dlock-list: Add an IRQ-safe mode to be used in interrupt handler Waiman Long
@ 2017-10-13 15:45 ` Davidlohr Bueso
  2017-10-16 19:30   ` Jason Baron
  2017-10-17 19:36 ` [PATCH v7 8/9] lib/dlock-list: Export symbols and add warnings Waiman Long
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Davidlohr Bueso @ 2017-10-13 15:45 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng, akpm

A customer reported massive contention on the ncalls->lock in which
the workload is designed around nested epolls (where the fd is already
an epoll).

 83.49%  [kernel]               [k] __pv_queued_spin_lock_slowpath
  2.70%  [kernel]               [k] ep_call_nested.constprop.13
  1.94%  [kernel]               [k] _raw_spin_lock_irqsave
  1.83%  [kernel]               [k] __raw_callee_save___pv_queued_spin_unlock
  1.45%  [kernel]               [k] _raw_spin_unlock_irqrestore
  0.41%  [kernel]               [k] ep_scan_ready_list.isra.8
  0.36%  [kernel]               [k] pvclock_clocksource_read

The application running on kvm, is using a shared memory IPC communication
with a pipe wakeup mechanism, and a two level dispatcher both built around
'epoll_wait'. There is one process per physical core and a full mesh of pipes
between them, so on a 18 core system (the actual case), there are 18*18 pipes
for the IPCs alone.

This patch proposes replacing the nested calls global linked list with a dlock
interface, for which we can benefit from pcpu lists when doing ep_poll_safewake(),
and hashing for the current task, which provides two benefits:

1. Speeds up the process of loop and max-depth checks from O(N) lookups to O(1)
   (albeit possible collisions, which we have to iterate); and,

2. Provides smaller lock granularity.

cpus	before		after	   diff
1	1409370		1344804     -4.58%
2	1015690		1337053     31.63%
3	 721009		1273254     76.59%
4	 380959		1128931    196.33%
5	 287603		1028362    257.56%
6	 221104		 894943    304.76%
7	 173592		 976395    462.46%
8	 145860		 922584    532.51%
9	 127877		 925900    624.05%
10	 112603		 791456    602.87%
11	  97926		 724296    639.63%
12	  80732		 730485    804.82%

With the exception of a single cpu, where the overhead of jhashing influences), we
get some pretty good raw throughput increase. Similarly, the amount of time spent
decreases immensely as well.

Passes ltp related testcases.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 fs/eventpoll.c | 88 +++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 53 insertions(+), 35 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 2fabd19cdeea..675c97fdc5da 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -22,7 +22,7 @@
 #include <linux/slab.h>
 #include <linux/poll.h>
 #include <linux/string.h>
-#include <linux/list.h>
+#include <linux/dlock-list.h>
 #include <linux/hash.h>
 #include <linux/spinlock.h>
 #include <linux/syscalls.h>
@@ -119,7 +119,7 @@ struct epoll_filefd {
  * and loop cycles.
  */
 struct nested_call_node {
-	struct list_head llink;
+	struct dlock_list_node llink;
 	void *cookie;
 	void *ctx;
 };
@@ -129,8 +129,7 @@ struct nested_call_node {
  * maximum recursion dept and loop cycles.
  */
 struct nested_calls {
-	struct list_head tasks_call_list;
-	spinlock_t lock;
+	struct dlock_list_heads tasks_call_list;
 };
 
 /*
@@ -371,10 +370,14 @@ static inline int ep_op_has_event(int op)
 }
 
 /* Initialize the poll safe wake up structure */
-static void ep_nested_calls_init(struct nested_calls *ncalls)
+static inline int ep_nested_calls_init(struct nested_calls *ncalls)
+{
+	return alloc_dlock_list_heads(&ncalls->tasks_call_list, true);
+}
+
+static inline void ep_nested_calls_free(struct nested_calls *ncalls)
 {
-	INIT_LIST_HEAD(&ncalls->tasks_call_list);
-	spin_lock_init(&ncalls->lock);
+	free_dlock_list_heads(&ncalls->tasks_call_list);
 }
 
 /**
@@ -483,45 +486,50 @@ static int ep_call_nested(struct nested_calls *ncalls, int max_nests,
 {
 	int error, call_nests = 0;
 	unsigned long flags;
-	struct list_head *lsthead = &ncalls->tasks_call_list;
-	struct nested_call_node *tncur;
-	struct nested_call_node tnode;
+	struct dlock_list_head *head;
+	struct nested_call_node *tncur, tnode;
 
-	spin_lock_irqsave(&ncalls->lock, flags);
+	head = dlock_list_hash(&ncalls->tasks_call_list, ctx);
 
+	spin_lock_irqsave(&head->lock, flags);
 	/*
 	 * Try to see if the current task is already inside this wakeup call.
-	 * We use a list here, since the population inside this set is always
-	 * very much limited.
+	 *
+	 * We use a chained hash table with linked lists here, and take the
+	 * lock to avoid racing when collisions (where ctx pointer is the key).
+	 * Calls for which context is the cpu number, avoid hashing and act as
+	 * pcpu add/removal.
+	 *
+	 * Since the population inside this set is always very much limited,
+	 * list scanning should be short.
 	 */
-	list_for_each_entry(tncur, lsthead, llink) {
-		if (tncur->ctx == ctx &&
-		    (tncur->cookie == cookie || ++call_nests > max_nests)) {
-			/*
-			 * Ops ... loop detected or maximum nest level reached.
-			 * We abort this wake by breaking the cycle itself.
-			 */
-			error = -1;
-			goto out_unlock;
-		}
-	}
+	list_for_each_entry(tncur, &head->list, llink.list) {
+	       if (tncur->ctx == ctx &&
+		   (tncur->cookie == cookie || ++call_nests > EP_MAX_NESTS)) {
+		       /*
+			* Ops ... loop detected or maximum nest level reached.
+			* We abort this wake by breaking the cycle itself.
+			*/
+		       error = -1;
+		       spin_unlock_irqrestore(&head->lock, flags);
+		       goto done;
+	       }
+       }
+
 
 	/* Add the current task and cookie to the list */
 	tnode.ctx = ctx;
 	tnode.cookie = cookie;
-	list_add(&tnode.llink, lsthead);
-
-	spin_unlock_irqrestore(&ncalls->lock, flags);
+	tnode.llink.head = head;
+	list_add(&tnode.llink.list, &head->list);
+	spin_unlock_irqrestore(&head->lock, flags);
 
 	/* Call the nested function */
 	error = (*nproc)(priv, cookie, call_nests);
 
 	/* Remove the current task from the list */
-	spin_lock_irqsave(&ncalls->lock, flags);
-	list_del(&tnode.llink);
-out_unlock:
-	spin_unlock_irqrestore(&ncalls->lock, flags);
-
+	dlock_lists_del(&tnode.llink);
+done:
 	return error;
 }
 
@@ -2313,13 +2321,16 @@ static int __init eventpoll_init(void)
 	 * Initialize the structure used to perform epoll file descriptor
 	 * inclusion loops checks.
 	 */
-	ep_nested_calls_init(&poll_loop_ncalls);
+	if (ep_nested_calls_init(&poll_loop_ncalls))
+		goto err;
 
 	/* Initialize the structure used to perform safe poll wait head wake ups */
-	ep_nested_calls_init(&poll_safewake_ncalls);
+	if (ep_nested_calls_init(&poll_safewake_ncalls))
+		goto err_free1;
 
 	/* Initialize the structure used to perform file's f_op->poll() calls */
-	ep_nested_calls_init(&poll_readywalk_ncalls);
+	if (ep_nested_calls_init(&poll_readywalk_ncalls))
+		goto err_free0;
 
 	/*
 	 * We can have many thousands of epitems, so prevent this from
@@ -2336,5 +2347,12 @@ static int __init eventpoll_init(void)
 			sizeof(struct eppoll_entry), 0, SLAB_PANIC, NULL);
 
 	return 0;
+
+err_free0:
+	ep_nested_calls_free(&poll_safewake_ncalls);
+err_free1:
+	ep_nested_calls_free(&poll_loop_ncalls);
+err:
+	return -ENOMEM;
 }
 fs_initcall(eventpoll_init);
-- 
2.12.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 1/6] lib/dlock-list: Distributed and lock-protected lists
  2017-10-10  5:35   ` Boqun Feng
@ 2017-10-13 21:10     ` Waiman Long
  0 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2017-10-13 21:10 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Davidlohr Bueso

On 10/10/2017 01:35 AM, Boqun Feng wrote:
> On Thu, Oct 05, 2017 at 06:43:23PM +0000, Waiman Long wrote:
> [...]
>> +/*
>> + * As all the locks in the dlock list are dynamically allocated, they need
>> + * to belong to their own special lock class to avoid warning and stack
>> + * trace in kernel log when lockdep is enabled. Statically allocated locks
>> + * don't have this problem.
>> + */
>> +static struct lock_class_key dlock_list_key;
>> +
> So in this way, you make all dlock_lists share the same lock_class_key,
> which means if there are two structures:
>
> 	struct some_a {
> 		...
> 		struct dlock_list_heads dlists;
> 	};
>
> 	struct some_b {
> 		...
> 		struct dlock_list_heads dlists;
> 	};
>
> some_a::dlists and some_b::dlists are going to have the same lockdep
> key, is this what you want? If not, you may want to do something like
> init_srcu_struct() does.

I think it will be a problem only if a task acquire a lock in a
dlock-list and then acquire another lock from another dlock-list. The
way the dlock-list is used, no more than one lock will be acquired at
any time, so there won't be nested locking within the same dlock-list.
It is not a problem with the current use cases that I and Davidlohr
have, but it may be a problem if dlock-list becomes more widely used. I
will take a look at how init_srcu_struct() does, and maybe update the
patch accordingly. Thanks for the suggestion.

>
>> + * dlock_lists_del - Delete a node from a dlock list
>> + * @node : The node to be deleted
>> + *
>> + * We need to check the lock pointer again after taking the lock to guard
>> + * against concurrent deletion of the same node. If the lock pointer changes
>> + * (becomes NULL or to a different one), we assume that the deletion was done
>> + * elsewhere. A warning will be printed if this happens as it is likely to be
>> + * a bug.
>> + */
>> +void dlock_lists_del(struct dlock_list_node *node)
>> +{
>> +	struct dlock_list_head *head;
>> +	bool retry;
>> +
>> +	do {
>> +		head = READ_ONCE(node->head);
> Since we read node->head locklessly here, I think we should use
> WRITE_ONCE() for all the stores of node->head, to avoid store tearings?

Yes, you are right. I will use WRITE_ONCE() in my next version.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 7/6] fs/epoll: scale nested callbacks
  2017-10-13 15:45 ` [PATCH v7 7/6] fs/epoll: scale nested callbacks Davidlohr Bueso
@ 2017-10-16 19:30   ` Jason Baron
  2017-10-17 15:53     ` Davidlohr Bueso
  0 siblings, 1 reply; 31+ messages in thread
From: Jason Baron @ 2017-10-16 19:30 UTC (permalink / raw)
  To: Davidlohr Bueso, Waiman Long, akpm
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng

Hi,

I posted a patch to completely remove the lock here from the
ep_poll_safewake() patch here:

http://lkml.iu.edu/hypermail/linux/kernel/1710.1/05834.html

So these are going to conflict. The reason its safe to remove the lock
is that there are loop and depth checks now that are performed during
EPOLL_CTL_ADD. Specifically, ep_loop_check(). I would prefer to these
checks once add add time as opposed to at each wakeup (even if they can
be scaled better).

I also have a pending patch to do something similar for
poll_readywalk_ncalls, but I think that poll_safewake_ncalls is the most
egregious one here?

Thanks,

-Jason

On 10/13/2017 11:45 AM, Davidlohr Bueso wrote:
> A customer reported massive contention on the ncalls->lock in which
> the workload is designed around nested epolls (where the fd is already
> an epoll).
> 
> 83.49%  [kernel]               [k] __pv_queued_spin_lock_slowpath
>  2.70%  [kernel]               [k] ep_call_nested.constprop.13
>  1.94%  [kernel]               [k] _raw_spin_lock_irqsave
>  1.83%  [kernel]               [k]
> __raw_callee_save___pv_queued_spin_unlock
>  1.45%  [kernel]               [k] _raw_spin_unlock_irqrestore
>  0.41%  [kernel]               [k] ep_scan_ready_list.isra.8
>  0.36%  [kernel]               [k] pvclock_clocksource_read
> 
> The application running on kvm, is using a shared memory IPC communication
> with a pipe wakeup mechanism, and a two level dispatcher both built around
> 'epoll_wait'. There is one process per physical core and a full mesh of
> pipes
> between them, so on a 18 core system (the actual case), there are 18*18
> pipes
> for the IPCs alone.
> 
> This patch proposes replacing the nested calls global linked list with a
> dlock
> interface, for which we can benefit from pcpu lists when doing
> ep_poll_safewake(),
> and hashing for the current task, which provides two benefits:
> 
> 1. Speeds up the process of loop and max-depth checks from O(N) lookups
> to O(1)
>   (albeit possible collisions, which we have to iterate); and,
> 
> 2. Provides smaller lock granularity.
> 
> cpus    before        after       diff
> 1    1409370        1344804     -4.58%
> 2    1015690        1337053     31.63%
> 3     721009        1273254     76.59%
> 4     380959        1128931    196.33%
> 5     287603        1028362    257.56%
> 6     221104         894943    304.76%
> 7     173592         976395    462.46%
> 8     145860         922584    532.51%
> 9     127877         925900    624.05%
> 10     112603         791456    602.87%
> 11      97926         724296    639.63%
> 12      80732         730485    804.82%
> 
> With the exception of a single cpu, where the overhead of jhashing
> influences), we
> get some pretty good raw throughput increase. Similarly, the amount of
> time spent
> decreases immensely as well.
> 
> Passes ltp related testcases.
> 
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> ---
> fs/eventpoll.c | 88
> +++++++++++++++++++++++++++++++++++-----------------------
> 1 file changed, 53 insertions(+), 35 deletions(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 2fabd19cdeea..675c97fdc5da 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -22,7 +22,7 @@
> #include <linux/slab.h>
> #include <linux/poll.h>
> #include <linux/string.h>
> -#include <linux/list.h>
> +#include <linux/dlock-list.h>
> #include <linux/hash.h>
> #include <linux/spinlock.h>
> #include <linux/syscalls.h>
> @@ -119,7 +119,7 @@ struct epoll_filefd {
>  * and loop cycles.
>  */
> struct nested_call_node {
> -    struct list_head llink;
> +    struct dlock_list_node llink;
>     void *cookie;
>     void *ctx;
> };
> @@ -129,8 +129,7 @@ struct nested_call_node {
>  * maximum recursion dept and loop cycles.
>  */
> struct nested_calls {
> -    struct list_head tasks_call_list;
> -    spinlock_t lock;
> +    struct dlock_list_heads tasks_call_list;
> };
> 
> /*
> @@ -371,10 +370,14 @@ static inline int ep_op_has_event(int op)
> }
> 
> /* Initialize the poll safe wake up structure */
> -static void ep_nested_calls_init(struct nested_calls *ncalls)
> +static inline int ep_nested_calls_init(struct nested_calls *ncalls)
> +{
> +    return alloc_dlock_list_heads(&ncalls->tasks_call_list, true);
> +}
> +
> +static inline void ep_nested_calls_free(struct nested_calls *ncalls)
> {
> -    INIT_LIST_HEAD(&ncalls->tasks_call_list);
> -    spin_lock_init(&ncalls->lock);
> +    free_dlock_list_heads(&ncalls->tasks_call_list);
> }
> 
> /**
> @@ -483,45 +486,50 @@ static int ep_call_nested(struct nested_calls
> *ncalls, int max_nests,
> {
>     int error, call_nests = 0;
>     unsigned long flags;
> -    struct list_head *lsthead = &ncalls->tasks_call_list;
> -    struct nested_call_node *tncur;
> -    struct nested_call_node tnode;
> +    struct dlock_list_head *head;
> +    struct nested_call_node *tncur, tnode;
> 
> -    spin_lock_irqsave(&ncalls->lock, flags);
> +    head = dlock_list_hash(&ncalls->tasks_call_list, ctx);
> 
> +    spin_lock_irqsave(&head->lock, flags);
>     /*
>      * Try to see if the current task is already inside this wakeup call.
> -     * We use a list here, since the population inside this set is always
> -     * very much limited.
> +     *
> +     * We use a chained hash table with linked lists here, and take the
> +     * lock to avoid racing when collisions (where ctx pointer is the
> key).
> +     * Calls for which context is the cpu number, avoid hashing and act as
> +     * pcpu add/removal.
> +     *
> +     * Since the population inside this set is always very much limited,
> +     * list scanning should be short.
>      */
> -    list_for_each_entry(tncur, lsthead, llink) {
> -        if (tncur->ctx == ctx &&
> -            (tncur->cookie == cookie || ++call_nests > max_nests)) {
> -            /*
> -             * Ops ... loop detected or maximum nest level reached.
> -             * We abort this wake by breaking the cycle itself.
> -             */
> -            error = -1;
> -            goto out_unlock;
> -        }
> -    }
> +    list_for_each_entry(tncur, &head->list, llink.list) {
> +           if (tncur->ctx == ctx &&
> +           (tncur->cookie == cookie || ++call_nests > EP_MAX_NESTS)) {
> +               /*
> +            * Ops ... loop detected or maximum nest level reached.
> +            * We abort this wake by breaking the cycle itself.
> +            */
> +               error = -1;
> +               spin_unlock_irqrestore(&head->lock, flags);
> +               goto done;
> +           }
> +       }
> +
> 
>     /* Add the current task and cookie to the list */
>     tnode.ctx = ctx;
>     tnode.cookie = cookie;
> -    list_add(&tnode.llink, lsthead);
> -
> -    spin_unlock_irqrestore(&ncalls->lock, flags);
> +    tnode.llink.head = head;
> +    list_add(&tnode.llink.list, &head->list);
> +    spin_unlock_irqrestore(&head->lock, flags);
> 
>     /* Call the nested function */
>     error = (*nproc)(priv, cookie, call_nests);
> 
>     /* Remove the current task from the list */
> -    spin_lock_irqsave(&ncalls->lock, flags);
> -    list_del(&tnode.llink);
> -out_unlock:
> -    spin_unlock_irqrestore(&ncalls->lock, flags);
> -
> +    dlock_lists_del(&tnode.llink);
> +done:
>     return error;
> }
> 
> @@ -2313,13 +2321,16 @@ static int __init eventpoll_init(void)
>      * Initialize the structure used to perform epoll file descriptor
>      * inclusion loops checks.
>      */
> -    ep_nested_calls_init(&poll_loop_ncalls);
> +    if (ep_nested_calls_init(&poll_loop_ncalls))
> +        goto err;
> 
>     /* Initialize the structure used to perform safe poll wait head wake
> ups */
> -    ep_nested_calls_init(&poll_safewake_ncalls);
> +    if (ep_nested_calls_init(&poll_safewake_ncalls))
> +        goto err_free1;
> 
>     /* Initialize the structure used to perform file's f_op->poll()
> calls */
> -    ep_nested_calls_init(&poll_readywalk_ncalls);
> +    if (ep_nested_calls_init(&poll_readywalk_ncalls))
> +        goto err_free0;
> 
>     /*
>      * We can have many thousands of epitems, so prevent this from
> @@ -2336,5 +2347,12 @@ static int __init eventpoll_init(void)
>             sizeof(struct eppoll_entry), 0, SLAB_PANIC, NULL);
> 
>     return 0;
> +
> +err_free0:
> +    ep_nested_calls_free(&poll_safewake_ncalls);
> +err_free1:
> +    ep_nested_calls_free(&poll_loop_ncalls);
> +err:
> +    return -ENOMEM;
> }
> fs_initcall(eventpoll_init);

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 7/6] fs/epoll: scale nested callbacks
  2017-10-16 19:30   ` Jason Baron
@ 2017-10-17 15:53     ` Davidlohr Bueso
  2017-10-18 14:06       ` Jason Baron
  0 siblings, 1 reply; 31+ messages in thread
From: Davidlohr Bueso @ 2017-10-17 15:53 UTC (permalink / raw)
  To: Jason Baron
  Cc: Waiman Long, akpm, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Andi Kleen,
	Dave Chinner, Boqun Feng

On Mon, 16 Oct 2017, Jason Baron wrote:

>Hi,
>
>I posted a patch to completely remove the lock here from the
>ep_poll_safewake() patch here:
>
>http://lkml.iu.edu/hypermail/linux/kernel/1710.1/05834.html

Kernel development never ceases to amaze me. Two complementary
fixes to a 8+ y/o performance issue on the same day - not that
nested epolls are that common, but it also comes from two real
workloads...

Getting rid of the lock altogether is always the best way.

>
>So these are going to conflict. The reason its safe to remove the lock
>is that there are loop and depth checks now that are performed during
>EPOLL_CTL_ADD. Specifically, ep_loop_check(). I would prefer to these
>checks once add add time as opposed to at each wakeup (even if they can
>be scaled better).

Wrt conflicts, no worries, I'll rebase -- and hopefully we can get
the dlock stuff in for v4.15 as well.

>
>I also have a pending patch to do something similar for
>poll_readywalk_ncalls, but I think that poll_safewake_ncalls is the most
>egregious one here?

The customer's workload issues are for the loop_ncalls and readywalk_ncalls
lists, so I'd be interested in seeing your patch for the later. The reason
your patch above is likely not to help my scenario is that most of the time
is spent at a dispatcher level doing epoll_wait, not too many EPOLL_CTL_ADDs
going on afaict.

Thanks,
Davidlohr

>
>Thanks,
>
>-Jason
>
>On 10/13/2017 11:45 AM, Davidlohr Bueso wrote:
>> A customer reported massive contention on the ncalls->lock in which
>> the workload is designed around nested epolls (where the fd is already
>> an epoll).
>>
>> 83.49%  [kernel]               [k] __pv_queued_spin_lock_slowpath
>>  2.70%  [kernel]               [k] ep_call_nested.constprop.13
>>  1.94%  [kernel]               [k] _raw_spin_lock_irqsave
>>  1.83%  [kernel]               [k]
>> __raw_callee_save___pv_queued_spin_unlock
>>  1.45%  [kernel]               [k] _raw_spin_unlock_irqrestore
>>  0.41%  [kernel]               [k] ep_scan_ready_list.isra.8
>>  0.36%  [kernel]               [k] pvclock_clocksource_read
>>
>> The application running on kvm, is using a shared memory IPC communication
>> with a pipe wakeup mechanism, and a two level dispatcher both built around
>> 'epoll_wait'. There is one process per physical core and a full mesh of
>> pipes
>> between them, so on a 18 core system (the actual case), there are 18*18
>> pipes
>> for the IPCs alone.
>>
>> This patch proposes replacing the nested calls global linked list with a
>> dlock
>> interface, for which we can benefit from pcpu lists when doing
>> ep_poll_safewake(),
>> and hashing for the current task, which provides two benefits:
>>
>> 1. Speeds up the process of loop and max-depth checks from O(N) lookups
>> to O(1)
>>   (albeit possible collisions, which we have to iterate); and,
>>
>> 2. Provides smaller lock granularity.
>>
>> cpus    before        after       diff
>> 1    1409370        1344804     -4.58%
>> 2    1015690        1337053     31.63%
>> 3     721009        1273254     76.59%
>> 4     380959        1128931    196.33%
>> 5     287603        1028362    257.56%
>> 6     221104         894943    304.76%
>> 7     173592         976395    462.46%
>> 8     145860         922584    532.51%
>> 9     127877         925900    624.05%
>> 10     112603         791456    602.87%
>> 11      97926         724296    639.63%
>> 12      80732         730485    804.82%
>>
>> With the exception of a single cpu, where the overhead of jhashing
>> influences), we
>> get some pretty good raw throughput increase. Similarly, the amount of
>> time spent
>> decreases immensely as well.
>>
>> Passes ltp related testcases.
>>
>> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
>> ---
>> fs/eventpoll.c | 88
>> +++++++++++++++++++++++++++++++++++-----------------------
>> 1 file changed, 53 insertions(+), 35 deletions(-)
>>
>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>> index 2fabd19cdeea..675c97fdc5da 100644
>> --- a/fs/eventpoll.c
>> +++ b/fs/eventpoll.c
>> @@ -22,7 +22,7 @@
>> #include <linux/slab.h>
>> #include <linux/poll.h>
>> #include <linux/string.h>
>> -#include <linux/list.h>
>> +#include <linux/dlock-list.h>
>> #include <linux/hash.h>
>> #include <linux/spinlock.h>
>> #include <linux/syscalls.h>
>> @@ -119,7 +119,7 @@ struct epoll_filefd {
>>  * and loop cycles.
>>  */
>> struct nested_call_node {
>> -    struct list_head llink;
>> +    struct dlock_list_node llink;
>>     void *cookie;
>>     void *ctx;
>> };
>> @@ -129,8 +129,7 @@ struct nested_call_node {
>>  * maximum recursion dept and loop cycles.
>>  */
>> struct nested_calls {
>> -    struct list_head tasks_call_list;
>> -    spinlock_t lock;
>> +    struct dlock_list_heads tasks_call_list;
>> };
>>
>> /*
>> @@ -371,10 +370,14 @@ static inline int ep_op_has_event(int op)
>> }
>>
>> /* Initialize the poll safe wake up structure */
>> -static void ep_nested_calls_init(struct nested_calls *ncalls)
>> +static inline int ep_nested_calls_init(struct nested_calls *ncalls)
>> +{
>> +    return alloc_dlock_list_heads(&ncalls->tasks_call_list, true);
>> +}
>> +
>> +static inline void ep_nested_calls_free(struct nested_calls *ncalls)
>> {
>> -    INIT_LIST_HEAD(&ncalls->tasks_call_list);
>> -    spin_lock_init(&ncalls->lock);
>> +    free_dlock_list_heads(&ncalls->tasks_call_list);
>> }
>>
>> /**
>> @@ -483,45 +486,50 @@ static int ep_call_nested(struct nested_calls
>> *ncalls, int max_nests,
>> {
>>     int error, call_nests = 0;
>>     unsigned long flags;
>> -    struct list_head *lsthead = &ncalls->tasks_call_list;
>> -    struct nested_call_node *tncur;
>> -    struct nested_call_node tnode;
>> +    struct dlock_list_head *head;
>> +    struct nested_call_node *tncur, tnode;
>>
>> -    spin_lock_irqsave(&ncalls->lock, flags);
>> +    head = dlock_list_hash(&ncalls->tasks_call_list, ctx);
>>
>> +    spin_lock_irqsave(&head->lock, flags);
>>     /*
>>      * Try to see if the current task is already inside this wakeup call.
>> -     * We use a list here, since the population inside this set is always
>> -     * very much limited.
>> +     *
>> +     * We use a chained hash table with linked lists here, and take the
>> +     * lock to avoid racing when collisions (where ctx pointer is the
>> key).
>> +     * Calls for which context is the cpu number, avoid hashing and act as
>> +     * pcpu add/removal.
>> +     *
>> +     * Since the population inside this set is always very much limited,
>> +     * list scanning should be short.
>>      */
>> -    list_for_each_entry(tncur, lsthead, llink) {
>> -        if (tncur->ctx == ctx &&
>> -            (tncur->cookie == cookie || ++call_nests > max_nests)) {
>> -            /*
>> -             * Ops ... loop detected or maximum nest level reached.
>> -             * We abort this wake by breaking the cycle itself.
>> -             */
>> -            error = -1;
>> -            goto out_unlock;
>> -        }
>> -    }
>> +    list_for_each_entry(tncur, &head->list, llink.list) {
>> +           if (tncur->ctx == ctx &&
>> +           (tncur->cookie == cookie || ++call_nests > EP_MAX_NESTS)) {
>> +               /*
>> +            * Ops ... loop detected or maximum nest level reached.
>> +            * We abort this wake by breaking the cycle itself.
>> +            */
>> +               error = -1;
>> +               spin_unlock_irqrestore(&head->lock, flags);
>> +               goto done;
>> +           }
>> +       }
>> +
>>
>>     /* Add the current task and cookie to the list */
>>     tnode.ctx = ctx;
>>     tnode.cookie = cookie;
>> -    list_add(&tnode.llink, lsthead);
>> -
>> -    spin_unlock_irqrestore(&ncalls->lock, flags);
>> +    tnode.llink.head = head;
>> +    list_add(&tnode.llink.list, &head->list);
>> +    spin_unlock_irqrestore(&head->lock, flags);
>>
>>     /* Call the nested function */
>>     error = (*nproc)(priv, cookie, call_nests);
>>
>>     /* Remove the current task from the list */
>> -    spin_lock_irqsave(&ncalls->lock, flags);
>> -    list_del(&tnode.llink);
>> -out_unlock:
>> -    spin_unlock_irqrestore(&ncalls->lock, flags);
>> -
>> +    dlock_lists_del(&tnode.llink);
>> +done:
>>     return error;
>> }
>>
>> @@ -2313,13 +2321,16 @@ static int __init eventpoll_init(void)
>>      * Initialize the structure used to perform epoll file descriptor
>>      * inclusion loops checks.
>>      */
>> -    ep_nested_calls_init(&poll_loop_ncalls);
>> +    if (ep_nested_calls_init(&poll_loop_ncalls))
>> +        goto err;
>>
>>     /* Initialize the structure used to perform safe poll wait head wake
>> ups */
>> -    ep_nested_calls_init(&poll_safewake_ncalls);
>> +    if (ep_nested_calls_init(&poll_safewake_ncalls))
>> +        goto err_free1;
>>
>>     /* Initialize the structure used to perform file's f_op->poll()
>> calls */
>> -    ep_nested_calls_init(&poll_readywalk_ncalls);
>> +    if (ep_nested_calls_init(&poll_readywalk_ncalls))
>> +        goto err_free0;
>>
>>     /*
>>      * We can have many thousands of epitems, so prevent this from
>> @@ -2336,5 +2347,12 @@ static int __init eventpoll_init(void)
>>             sizeof(struct eppoll_entry), 0, SLAB_PANIC, NULL);
>>
>>     return 0;
>> +
>> +err_free0:
>> +    ep_nested_calls_free(&poll_safewake_ncalls);
>> +err_free1:
>> +    ep_nested_calls_free(&poll_loop_ncalls);
>> +err:
>> +    return -ENOMEM;
>> }
>> fs_initcall(eventpoll_init);

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v7 8/9] lib/dlock-list: Export symbols and add warnings
  2017-10-05 18:43 [PATCH v7 0/6] vfs: Use dlock list for SB's s_inodes list Waiman Long
                   ` (6 preceding siblings ...)
  2017-10-13 15:45 ` [PATCH v7 7/6] fs/epoll: scale nested callbacks Davidlohr Bueso
@ 2017-10-17 19:36 ` Waiman Long
  2017-10-17 19:36   ` [PATCH v7 9/9] lib/dlock-list: Unique lock class key for each allocation call site Waiman Long
  2017-10-26 18:28 ` [PATCH v7 0/6] vfs: Use dlock list for SB's s_inodes list Waiman Long
  2017-10-27 20:10 ` [PATCH v7 10/10] lib/dlock-list: Fix use-after-unlock problem in dlist_for_each_entry_safe() Waiman Long
  9 siblings, 1 reply; 31+ messages in thread
From: Waiman Long @ 2017-10-17 19:36 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Davidlohr Bueso,
	Waiman Long

The EXPORT_SYMBOL() macro is now used in lib/dlock-list.c to enable
kernel modules to use dlock-list. Some warning messages are also added.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 lib/dlock-list.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/lib/dlock-list.c b/lib/dlock-list.c
index 4fded20..6ce5c7193 100644
--- a/lib/dlock-list.c
+++ b/lib/dlock-list.c
@@ -132,6 +132,7 @@ int alloc_dlock_list_heads(struct dlock_list_heads *dlist, int irqsafe)
 	}
 	return 0;
 }
+EXPORT_SYMBOL(alloc_dlock_list_heads);
 
 /**
  * free_dlock_list_heads - Free all the heads entries of the dlock list
@@ -145,6 +146,7 @@ void free_dlock_list_heads(struct dlock_list_heads *dlist)
 	kfree(dlist->heads);
 	dlist->heads = NULL;
 }
+EXPORT_SYMBOL(free_dlock_list_heads);
 
 /**
  * dlock_lists_empty - Check if all the dlock lists are empty
@@ -159,11 +161,15 @@ bool dlock_lists_empty(struct dlock_list_heads *dlist)
 {
 	int idx;
 
+	/* Shouldn't be called before nr_dlock_lists is initialized */
+	WARN_ON_ONCE(!nr_dlock_lists);
+
 	for (idx = 0; idx < nr_dlock_lists; idx++)
 		if (!list_empty(&dlist->heads[idx].list))
 			return false;
 	return true;
 }
+EXPORT_SYMBOL(dlock_lists_empty);
 
 /**
  * dlock_list_hash - Hash the given context to a particular list
@@ -187,6 +193,7 @@ struct dlock_list_head *dlock_list_hash(struct dlock_list_heads *dlist,
 				% nr_dlock_lists;
 	return &dlist->heads[hash];
 }
+EXPORT_SYMBOL(dlock_list_hash);
 
 /**
  * dlock_list_add - Add a node to a particular head of dlock list
@@ -210,6 +217,7 @@ void dlock_list_add(struct dlock_list_node *node,
 		spin_unlock(&head->lock);
 	}
 }
+EXPORT_SYMBOL(dlock_list_add);
 
 /**
  * dlock_lists_add - Adds a node to the given dlock list
@@ -226,6 +234,7 @@ void dlock_lists_add(struct dlock_list_node *node,
 
 	dlock_list_add(node, head);
 }
+EXPORT_SYMBOL(dlock_lists_add);
 
 /**
  * dlock_lists_del - Delete a node from a dlock list
@@ -273,6 +282,7 @@ void dlock_lists_del(struct dlock_list_node *node)
 			spin_unlock(&head->lock);
 	} while (retry);
 }
+EXPORT_SYMBOL(dlock_lists_del);
 
 /**
  * __dlock_list_next_list: Find the first entry of the next available list
@@ -287,6 +297,9 @@ struct dlock_list_node *__dlock_list_next_list(struct dlock_list_iter *iter)
 	struct dlock_list_node *next;
 	struct dlock_list_head *head;
 
+	/* Shouldn't be called before nr_dlock_lists is initialized */
+	WARN_ON_ONCE(!nr_dlock_lists);
+
 restart:
 	if (iter->entry) {
 		dlock_list_unlock(iter);
@@ -320,3 +333,4 @@ struct dlock_list_node *__dlock_list_next_list(struct dlock_list_iter *iter)
 
 	return next;
 }
+EXPORT_SYMBOL(__dlock_list_next_list);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v7 9/9] lib/dlock-list: Unique lock class key for each allocation call site
  2017-10-17 19:36 ` [PATCH v7 8/9] lib/dlock-list: Export symbols and add warnings Waiman Long
@ 2017-10-17 19:36   ` Waiman Long
  0 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2017-10-17 19:36 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Davidlohr Bueso,
	Waiman Long

Boqun Feng has kindly pointed out that the same lock class key is
used for all the dlock-list allocation. That can be a problem in case
a task need to acquire the locks of more than one dlock-list at the
same time with lockdep enabled.

To avoid this problem, the alloc_dlock_list_heads() function is
changed to use a different lock class key for each of its call sites
in the kernel.

Reported-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/dlock-list.h | 16 +++++++++++++++-
 lib/dlock-list.c           | 21 +++++++++------------
 2 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
index 2ba7b4f..02c5f4d 100644
--- a/include/linux/dlock-list.h
+++ b/include/linux/dlock-list.h
@@ -116,9 +116,23 @@ static inline void dlock_list_relock(struct dlock_list_iter *iter)
 /*
  * Allocation and freeing of dlock list
  */
-extern int  alloc_dlock_list_heads(struct dlock_list_heads *dlist, int irqsafe);
+extern int  __alloc_dlock_list_heads(struct dlock_list_heads *dlist,
+				     int irqsafe, struct lock_class_key *key);
 extern void free_dlock_list_heads(struct dlock_list_heads *dlist);
 
+/**
+ * alloc_dlock_list_head - Initialize and allocate the list of head entries.
+ * @dlist  : Pointer to the dlock_list_heads structure to be initialized
+ * @irqsafe: IRQ safe mode flag
+ * Return  : 0 if successful, -ENOMEM if memory allocation error
+ */
+#define alloc_dlock_list_heads(dlist, irqsafe)				\
+({									\
+	static struct lock_class_key _key;				\
+	int _ret = __alloc_dlock_list_heads(dlist, irqsafe, &_key);	\
+	_ret;								\
+})
+
 /*
  * Check if a dlock list is empty or not.
  */
diff --git a/lib/dlock-list.c b/lib/dlock-list.c
index 6ce5c7193..17e182b 100644
--- a/lib/dlock-list.c
+++ b/lib/dlock-list.c
@@ -36,14 +36,6 @@
 static int nr_dlock_lists __read_mostly;
 
 /*
- * As all the locks in the dlock list are dynamically allocated, they need
- * to belong to their own special lock class to avoid warning and stack
- * trace in kernel log when lockdep is enabled. Statically allocated locks
- * don't have this problem.
- */
-static struct lock_class_key dlock_list_key;
-
-/*
  * Initialize cpu2idx mapping table & nr_dlock_lists.
  *
  * It is possible that a dlock-list can be allocated before the cpu2idx is
@@ -98,9 +90,10 @@ static int __init cpu2idx_init(void)
 postcore_initcall(cpu2idx_init);
 
 /**
- * alloc_dlock_list_heads - Initialize and allocate the list of head entries
+ * __alloc_dlock_list_heads - Initialize and allocate the list of head entries
  * @dlist  : Pointer to the dlock_list_heads structure to be initialized
  * @irqsafe: IRQ safe mode flag
+ * @key    : The lock class key to be used for lockdep
  * Return: 0 if successful, -ENOMEM if memory allocation error
  *
  * This function does not allocate the dlock_list_heads structure itself. The
@@ -112,8 +105,12 @@ static int __init cpu2idx_init(void)
  * than necessary allocated is not a problem other than some wasted memory.
  * The extra lists will not be ever used as all the cpu2idx entries will be
  * 0 before initialization.
+ *
+ * Dynamically allocated locks need to have their own special lock class
+ * to avoid lockdep warning.
  */
-int alloc_dlock_list_heads(struct dlock_list_heads *dlist, int irqsafe)
+int __alloc_dlock_list_heads(struct dlock_list_heads *dlist, int irqsafe,
+			     struct lock_class_key *key)
 {
 	int idx, cnt = nr_dlock_lists ? nr_dlock_lists : nr_cpu_ids;
 
@@ -128,11 +125,11 @@ int alloc_dlock_list_heads(struct dlock_list_heads *dlist, int irqsafe)
 		INIT_LIST_HEAD(&head->list);
 		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
 		head->irqsafe = irqsafe;
-		lockdep_set_class(&head->lock, &dlock_list_key);
+		lockdep_set_class(&head->lock, key);
 	}
 	return 0;
 }
-EXPORT_SYMBOL(alloc_dlock_list_heads);
+EXPORT_SYMBOL(__alloc_dlock_list_heads);
 
 /**
  * free_dlock_list_heads - Free all the heads entries of the dlock list
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 1/6] lib/dlock-list: Distributed and lock-protected lists
  2017-10-05 18:43 ` [PATCH v7 1/6] lib/dlock-list: Distributed and lock-protected lists Waiman Long
  2017-10-10  5:35   ` Boqun Feng
@ 2017-10-18  8:55   ` Boqun Feng
  1 sibling, 0 replies; 31+ messages in thread
From: Boqun Feng @ 2017-10-18  8:55 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Davidlohr Bueso

[-- Attachment #1: Type: text/plain, Size: 6105 bytes --]

On Thu, Oct 05, 2017 at 06:43:23PM +0000, Waiman Long wrote:
[...]
> +/*
> + * Find the first entry of the next available list.
> + */
> +extern struct dlock_list_node *
> +__dlock_list_next_list(struct dlock_list_iter *iter);
> +
> +/**
> + * __dlock_list_next_entry - Iterate to the next entry of the dlock list
> + * @curr : Pointer to the current dlock_list_node structure
> + * @iter : Pointer to the dlock list iterator structure
> + * Return: Pointer to the next entry or NULL if all the entries are iterated
> + *
> + * The iterator has to be properly initialized before calling this function.
> + */
> +static inline struct dlock_list_node *
> +__dlock_list_next_entry(struct dlock_list_node *curr,
> +			struct dlock_list_iter *iter)
> +{
> +	/*
> +	 * Find next entry
> +	 */
> +	if (curr)
> +		curr = list_next_entry(curr, list);
> +
> +	if (!curr || (&curr->list == &iter->entry->list)) {
> +		/*
> +		 * The current list has been exhausted, try the next available
> +		 * list.
> +		 */
> +		curr = __dlock_list_next_list(iter);
> +	}
> +
> +	return curr;	/* Continue the iteration */
> +}
> +
> +/**
> + * dlock_list_first_entry - get the first element from a list
> + * @iter  : The dlock list iterator.
> + * @type  : The type of the struct this is embedded in.
> + * @member: The name of the dlock_list_node within the struct.
> + * Return : Pointer to the next entry or NULL if all the entries are iterated.
> + */
> +#define dlock_list_first_entry(iter, type, member)			\
> +	({								\
> +		struct dlock_list_node *_n;				\
> +		_n = __dlock_list_next_entry(NULL, iter);		\
> +		_n ? list_entry(_n, type, member) : NULL;		\
> +	})
> +
> +/**
> + * dlock_list_next_entry - iterate to the next entry of the list
> + * @pos   : The type * to cursor
> + * @iter  : The dlock list iterator.
> + * @member: The name of the dlock_list_node within the struct.
> + * Return : Pointer to the next entry or NULL if all the entries are iterated.
> + *
> + * Note that pos can't be NULL.
> + */
> +#define dlock_list_next_entry(pos, iter, member)			\
> +	({								\
> +		struct dlock_list_node *_n;				\
> +		_n = __dlock_list_next_entry(&(pos)->member, iter);	\
> +		_n ? list_entry(_n, typeof(*(pos)), member) : NULL;	\
> +	})
> +

[...]

> +/**
> + * dlist_for_each_entry_safe - iterate over the dlock list & safe over removal
> + * @pos   : Type * to use as a loop cursor
> + * @n	  : Another type * to use as temporary storage
> + * @iter  : The dlock list iterator
> + * @member: The name of the dlock_list_node within the struct
> + *
> + * This iteration macro is safe with respect to list entry removal.
> + * However, it cannot correctly iterate newly added entries right after the
> + * current one.
> + */
> +#define dlist_for_each_entry_safe(pos, n, iter, member)			\

So I missed something interesting here ;-)

> +	for (pos = dlock_list_first_entry(iter, typeof(*(pos)), member);\
> +	    ({								\
> +		bool _b = (pos != NULL);				\
> +		if (_b)							\
> +			n = dlock_list_next_entry(pos, iter, member);	\

If @pos is the last item of the list of the index/cpu, and
dlock_list_next_entry() will eventually call __dlock_list_next_list(),
which will drop the lock for the current list and grab the lock for the
next list, leaving @pos unprotected. But in the meanwhile, there could
be another thread deleting @pos via dlock_lists_del() and freeing it.
This is a use-after-free.

I think we can have something like:

	(by adding a ->prev_entry in dlock_list_iter and severl helper
	functions.)

	bool dlist_is_last_perlist(struct dlock_list_node *n)
	{
		return list_is_last(&n->list, &n->head->list);
	
	}

	void dlock_list_release_prev(struct dlock_list_iter *iter)
	{
		spin_unlock(iter->prev_entry->lock);
		iter->prev_entry = NULL;
	}

	#define dlist_for_each_entry_safe(pos, n, iter, member)		\
		for (pos = dlock_list_first_entry(iter, typeof(*(pos)), member);	\
		    ({									\
			bool _b = (pos != NULL);					\
			if (_b) {							\
				if (dlist_is_last_perlist(&(pos)->member)) {		\
					iter->prev_entry = iter->entry;			\
					iter->entry = NULL;				\
					n = dlock_list_first_entry(NULL, iter, member);	\
				}							\
				else							\
					n = dlock_list_next_entry(pos, iter, member);	\
			}								\
			_b;								\
		    });									\
		    pos = n, iter->prev_entry && dlock_list_release_prev(iter))

Of course, the dlock_list_first_entry() here may need a better name ;-)

Thoughts?

Regards,
Boqun

> +		_b;							\
> +	    });								\
> +	    pos = n)
> +
> +#endif /* __LINUX_DLOCK_LIST_H */

[...]

> +/**
> + * __dlock_list_next_list: Find the first entry of the next available list
> + * @dlist: Pointer to the dlock_list_heads structure
> + * @iter : Pointer to the dlock list iterator structure
> + * Return: true if the entry is found, false if all the lists exhausted
> + *
> + * The information about the next available list will be put into the iterator.
> + */
> +struct dlock_list_node *__dlock_list_next_list(struct dlock_list_iter *iter)
> +{
> +	struct dlock_list_node *next;
> +	struct dlock_list_head *head;
> +
> +restart:
> +	if (iter->entry) {
> +		spin_unlock(&iter->entry->lock);
> +		iter->entry = NULL;
> +	}
> +
> +next_list:
> +	/*
> +	 * Try next list
> +	 */
> +	if (++iter->index >= nr_cpu_ids)
> +		return NULL;	/* All the entries iterated */
> +
> +	if (list_empty(&iter->head[iter->index].list))
> +		goto next_list;
> +
> +	head = iter->entry = &iter->head[iter->index];
> +	spin_lock(&head->lock);
> +	/*
> +	 * There is a slight chance that the list may become empty just
> +	 * before the lock is acquired. So an additional check is
> +	 * needed to make sure that a valid node will be returned.
> +	 */
> +	if (list_empty(&head->list))
> +		goto restart;
> +
> +	next = list_entry(head->list.next, struct dlock_list_node,
> +			  list);
> +	WARN_ON_ONCE(next->head != head);
> +
> +	return next;
> +}
> -- 
> 1.8.3.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 7/6] fs/epoll: scale nested callbacks
  2017-10-17 15:53     ` Davidlohr Bueso
@ 2017-10-18 14:06       ` Jason Baron
  2017-10-18 15:44         ` Davidlohr Bueso
  0 siblings, 1 reply; 31+ messages in thread
From: Jason Baron @ 2017-10-18 14:06 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Waiman Long, akpm, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Andi Kleen,
	Dave Chinner, Boqun Feng



On 10/17/2017 11:53 AM, Davidlohr Bueso wrote:
> On Mon, 16 Oct 2017, Jason Baron wrote:
> 
>> Hi,
>>
>> I posted a patch to completely remove the lock here from the
>> ep_poll_safewake() patch here:
>>
>> http://lkml.iu.edu/hypermail/linux/kernel/1710.1/05834.html
> 
> Kernel development never ceases to amaze me. Two complementary
> fixes to a 8+ y/o performance issue on the same day - not that
> nested epolls are that common, but it also comes from two real
> workloads...
> 
> Getting rid of the lock altogether is always the best way.
> 
>>
>> So these are going to conflict. The reason its safe to remove the lock
>> is that there are loop and depth checks now that are performed during
>> EPOLL_CTL_ADD. Specifically, ep_loop_check(). I would prefer to these
>> checks once add add time as opposed to at each wakeup (even if they can
>> be scaled better).
> 
> Wrt conflicts, no worries, I'll rebase -- and hopefully we can get
> the dlock stuff in for v4.15 as well.
> 
>>
>> I also have a pending patch to do something similar for
>> poll_readywalk_ncalls, but I think that poll_safewake_ncalls is the most
>> egregious one here?
> 
> The customer's workload issues are for the loop_ncalls and readywalk_ncalls
> lists, so I'd be interested in seeing your patch for the later. The reason
> your patch above is likely not to help my scenario is that most of the time
> is spent at a dispatcher level doing epoll_wait, not too many
> EPOLL_CTL_ADDs
> going on afaict.

If there are not many EPOLL_CTL_ADDs, then I wouldn't think loop_ncalls
would be highly contented (since it should only be called from the add
path)?

Thanks,

-Jason


> 
> Thanks,
> Davidlohr
> 
>>
>> Thanks,
>>
>> -Jason
>>
>> On 10/13/2017 11:45 AM, Davidlohr Bueso wrote:
>>> A customer reported massive contention on the ncalls->lock in which
>>> the workload is designed around nested epolls (where the fd is already
>>> an epoll).
>>>
>>> 83.49%  [kernel]               [k] __pv_queued_spin_lock_slowpath
>>>  2.70%  [kernel]               [k] ep_call_nested.constprop.13
>>>  1.94%  [kernel]               [k] _raw_spin_lock_irqsave
>>>  1.83%  [kernel]               [k]
>>> __raw_callee_save___pv_queued_spin_unlock
>>>  1.45%  [kernel]               [k] _raw_spin_unlock_irqrestore
>>>  0.41%  [kernel]               [k] ep_scan_ready_list.isra.8
>>>  0.36%  [kernel]               [k] pvclock_clocksource_read
>>>
>>> The application running on kvm, is using a shared memory IPC
>>> communication
>>> with a pipe wakeup mechanism, and a two level dispatcher both built
>>> around
>>> 'epoll_wait'. There is one process per physical core and a full mesh of
>>> pipes
>>> between them, so on a 18 core system (the actual case), there are 18*18
>>> pipes
>>> for the IPCs alone.
>>>
>>> This patch proposes replacing the nested calls global linked list with a
>>> dlock
>>> interface, for which we can benefit from pcpu lists when doing
>>> ep_poll_safewake(),
>>> and hashing for the current task, which provides two benefits:
>>>
>>> 1. Speeds up the process of loop and max-depth checks from O(N) lookups
>>> to O(1)
>>>   (albeit possible collisions, which we have to iterate); and,
>>>
>>> 2. Provides smaller lock granularity.
>>>
>>> cpus    before        after       diff
>>> 1    1409370        1344804     -4.58%
>>> 2    1015690        1337053     31.63%
>>> 3     721009        1273254     76.59%
>>> 4     380959        1128931    196.33%
>>> 5     287603        1028362    257.56%
>>> 6     221104         894943    304.76%
>>> 7     173592         976395    462.46%
>>> 8     145860         922584    532.51%
>>> 9     127877         925900    624.05%
>>> 10     112603         791456    602.87%
>>> 11      97926         724296    639.63%
>>> 12      80732         730485    804.82%
>>>
>>> With the exception of a single cpu, where the overhead of jhashing
>>> influences), we
>>> get some pretty good raw throughput increase. Similarly, the amount of
>>> time spent
>>> decreases immensely as well.
>>>
>>> Passes ltp related testcases.
>>>
>>> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
>>> ---
>>> fs/eventpoll.c | 88
>>> +++++++++++++++++++++++++++++++++++-----------------------
>>> 1 file changed, 53 insertions(+), 35 deletions(-)
>>>
>>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>>> index 2fabd19cdeea..675c97fdc5da 100644
>>> --- a/fs/eventpoll.c
>>> +++ b/fs/eventpoll.c
>>> @@ -22,7 +22,7 @@
>>> #include <linux/slab.h>
>>> #include <linux/poll.h>
>>> #include <linux/string.h>
>>> -#include <linux/list.h>
>>> +#include <linux/dlock-list.h>
>>> #include <linux/hash.h>
>>> #include <linux/spinlock.h>
>>> #include <linux/syscalls.h>
>>> @@ -119,7 +119,7 @@ struct epoll_filefd {
>>>  * and loop cycles.
>>>  */
>>> struct nested_call_node {
>>> -    struct list_head llink;
>>> +    struct dlock_list_node llink;
>>>     void *cookie;
>>>     void *ctx;
>>> };
>>> @@ -129,8 +129,7 @@ struct nested_call_node {
>>>  * maximum recursion dept and loop cycles.
>>>  */
>>> struct nested_calls {
>>> -    struct list_head tasks_call_list;
>>> -    spinlock_t lock;
>>> +    struct dlock_list_heads tasks_call_list;
>>> };
>>>
>>> /*
>>> @@ -371,10 +370,14 @@ static inline int ep_op_has_event(int op)
>>> }
>>>
>>> /* Initialize the poll safe wake up structure */
>>> -static void ep_nested_calls_init(struct nested_calls *ncalls)
>>> +static inline int ep_nested_calls_init(struct nested_calls *ncalls)
>>> +{
>>> +    return alloc_dlock_list_heads(&ncalls->tasks_call_list, true);
>>> +}
>>> +
>>> +static inline void ep_nested_calls_free(struct nested_calls *ncalls)
>>> {
>>> -    INIT_LIST_HEAD(&ncalls->tasks_call_list);
>>> -    spin_lock_init(&ncalls->lock);
>>> +    free_dlock_list_heads(&ncalls->tasks_call_list);
>>> }
>>>
>>> /**
>>> @@ -483,45 +486,50 @@ static int ep_call_nested(struct nested_calls
>>> *ncalls, int max_nests,
>>> {
>>>     int error, call_nests = 0;
>>>     unsigned long flags;
>>> -    struct list_head *lsthead = &ncalls->tasks_call_list;
>>> -    struct nested_call_node *tncur;
>>> -    struct nested_call_node tnode;
>>> +    struct dlock_list_head *head;
>>> +    struct nested_call_node *tncur, tnode;
>>>
>>> -    spin_lock_irqsave(&ncalls->lock, flags);
>>> +    head = dlock_list_hash(&ncalls->tasks_call_list, ctx);
>>>
>>> +    spin_lock_irqsave(&head->lock, flags);
>>>     /*
>>>      * Try to see if the current task is already inside this wakeup
>>> call.
>>> -     * We use a list here, since the population inside this set is
>>> always
>>> -     * very much limited.
>>> +     *
>>> +     * We use a chained hash table with linked lists here, and take the
>>> +     * lock to avoid racing when collisions (where ctx pointer is the
>>> key).
>>> +     * Calls for which context is the cpu number, avoid hashing and
>>> act as
>>> +     * pcpu add/removal.
>>> +     *
>>> +     * Since the population inside this set is always very much
>>> limited,
>>> +     * list scanning should be short.
>>>      */
>>> -    list_for_each_entry(tncur, lsthead, llink) {
>>> -        if (tncur->ctx == ctx &&
>>> -            (tncur->cookie == cookie || ++call_nests > max_nests)) {
>>> -            /*
>>> -             * Ops ... loop detected or maximum nest level reached.
>>> -             * We abort this wake by breaking the cycle itself.
>>> -             */
>>> -            error = -1;
>>> -            goto out_unlock;
>>> -        }
>>> -    }
>>> +    list_for_each_entry(tncur, &head->list, llink.list) {
>>> +           if (tncur->ctx == ctx &&
>>> +           (tncur->cookie == cookie || ++call_nests > EP_MAX_NESTS)) {
>>> +               /*
>>> +            * Ops ... loop detected or maximum nest level reached.
>>> +            * We abort this wake by breaking the cycle itself.
>>> +            */
>>> +               error = -1;
>>> +               spin_unlock_irqrestore(&head->lock, flags);
>>> +               goto done;
>>> +           }
>>> +       }
>>> +
>>>
>>>     /* Add the current task and cookie to the list */
>>>     tnode.ctx = ctx;
>>>     tnode.cookie = cookie;
>>> -    list_add(&tnode.llink, lsthead);
>>> -
>>> -    spin_unlock_irqrestore(&ncalls->lock, flags);
>>> +    tnode.llink.head = head;
>>> +    list_add(&tnode.llink.list, &head->list);
>>> +    spin_unlock_irqrestore(&head->lock, flags);
>>>
>>>     /* Call the nested function */
>>>     error = (*nproc)(priv, cookie, call_nests);
>>>
>>>     /* Remove the current task from the list */
>>> -    spin_lock_irqsave(&ncalls->lock, flags);
>>> -    list_del(&tnode.llink);
>>> -out_unlock:
>>> -    spin_unlock_irqrestore(&ncalls->lock, flags);
>>> -
>>> +    dlock_lists_del(&tnode.llink);
>>> +done:
>>>     return error;
>>> }
>>>
>>> @@ -2313,13 +2321,16 @@ static int __init eventpoll_init(void)
>>>      * Initialize the structure used to perform epoll file descriptor
>>>      * inclusion loops checks.
>>>      */
>>> -    ep_nested_calls_init(&poll_loop_ncalls);
>>> +    if (ep_nested_calls_init(&poll_loop_ncalls))
>>> +        goto err;
>>>
>>>     /* Initialize the structure used to perform safe poll wait head wake
>>> ups */
>>> -    ep_nested_calls_init(&poll_safewake_ncalls);
>>> +    if (ep_nested_calls_init(&poll_safewake_ncalls))
>>> +        goto err_free1;
>>>
>>>     /* Initialize the structure used to perform file's f_op->poll()
>>> calls */
>>> -    ep_nested_calls_init(&poll_readywalk_ncalls);
>>> +    if (ep_nested_calls_init(&poll_readywalk_ncalls))
>>> +        goto err_free0;
>>>
>>>     /*
>>>      * We can have many thousands of epitems, so prevent this from
>>> @@ -2336,5 +2347,12 @@ static int __init eventpoll_init(void)
>>>             sizeof(struct eppoll_entry), 0, SLAB_PANIC, NULL);
>>>
>>>     return 0;
>>> +
>>> +err_free0:
>>> +    ep_nested_calls_free(&poll_safewake_ncalls);
>>> +err_free1:
>>> +    ep_nested_calls_free(&poll_loop_ncalls);
>>> +err:
>>> +    return -ENOMEM;
>>> }
>>> fs_initcall(eventpoll_init);

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 7/6] fs/epoll: scale nested callbacks
  2017-10-18 14:06       ` Jason Baron
@ 2017-10-18 15:44         ` Davidlohr Bueso
  0 siblings, 0 replies; 31+ messages in thread
From: Davidlohr Bueso @ 2017-10-18 15:44 UTC (permalink / raw)
  To: Jason Baron
  Cc: Waiman Long, akpm, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Andi Kleen,
	Dave Chinner, Boqun Feng

On Wed, 18 Oct 2017, Jason Baron wrote:

>If there are not many EPOLL_CTL_ADDs, then I wouldn't think loop_ncalls
>would be highly contented (since it should only be called from the add
>path)?

So yeah it's all about the readywalks.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 0/6] vfs: Use dlock list for SB's s_inodes list
  2017-10-05 18:43 [PATCH v7 0/6] vfs: Use dlock list for SB's s_inodes list Waiman Long
                   ` (7 preceding siblings ...)
  2017-10-17 19:36 ` [PATCH v7 8/9] lib/dlock-list: Export symbols and add warnings Waiman Long
@ 2017-10-26 18:28 ` Waiman Long
  2017-10-27  0:58   ` Boqun Feng
  2017-10-27 20:10 ` [PATCH v7 10/10] lib/dlock-list: Fix use-after-unlock problem in dlist_for_each_entry_safe() Waiman Long
  9 siblings, 1 reply; 31+ messages in thread
From: Waiman Long @ 2017-10-26 18:28 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Davidlohr Bueso

On 10/05/2017 02:43 PM, Waiman Long wrote:
>
> This is a follow up of the following patchset:
>
>   [PATCH v7 0/4] vfs: Use per-cpu list for SB's s_inodes list
>   https://lkml.org/lkml/2016/4/12/1009
>
> This patchset provides new APIs for a set of distributed locked lists
> (one/CPU core) to minimize lock and cacheline contention. Insertion
> and deletion to the list will be cheap and relatively contention free.
> Lookup, on the other hand, may be a bit more costly as there are
> multiple lists to iterate. This is not really a problem for the
> replacement of superblock's inode list by dlock list included in
> the patchset as lookup isn't needed.
>
> For use cases that need to do lookup, the dlock list can also be
> treated as a set of hashed lists that scales with the number of CPU
> cores in the system.
>
> Both patches 5 and 6 are added to support other use cases like epoll
> nested callbacks, for example, which could use the dlock-list to
> reduce lock contention problem.
>
> Patch 1 introduces the dlock list. The list heads are allocated
> by kcalloc() instead of percpu_alloc(). Each list head entry is
> cacheline aligned to minimize contention.
>
> Patch 2 replaces the use of list_for_each_entry_safe() in
> evict_inodes() and invalidate_inodes() by list_for_each_entry().
>
> Patch 3 modifies the superblock and inode structures to use the dlock
> list. The corresponding functions that reference those structures
> are modified.
>
> Patch 4 makes the sibling CPUs use the same dlock list head to reduce
> the number of list heads that need to be iterated.
>
> Patch 5 enables alternative use case of as a set of hashed lists.
>
> Patch 6 provides an irq safe mode specified at dlock-list allocation
> time so that it can be within interrupt context.
>
> Jan Kara (1):
>   vfs: Remove unnecessary list_for_each_entry_safe() variants
>
> Waiman Long (5):
>   lib/dlock-list: Distributed and lock-protected lists
>   vfs: Use dlock list for superblock's inode list
>   lib/dlock-list: Make sibling CPUs share the same linked list
>   lib/dlock-list: Enable faster lookup with hashing
>   lib/dlock-list: Add an IRQ-safe mode to be used in interrupt handler
>
>  fs/block_dev.c             |   9 +-
>  fs/drop_caches.c           |   9 +-
>  fs/inode.c                 |  38 +++---
>  fs/notify/fsnotify.c       |   9 +-
>  fs/quota/dquot.c           |  14 +-
>  fs/super.c                 |   7 +-
>  include/linux/dlock-list.h | 245 ++++++++++++++++++++++++++++++++++
>  include/linux/fs.h         |   8 +-
>  lib/Makefile               |   2 +-
>  lib/dlock-list.c           | 322 +++++++++++++++++++++++++++++++++++++++++++++
>  10 files changed, 609 insertions(+), 54 deletions(-)
>  create mode 100644 include/linux/dlock-list.h
>  create mode 100644 lib/dlock-list.c
>
Is there other objections about merging this patch series? With the
additional patches 8 & 9 that I sent out on Oct 17, I think I had
addressed all the concerns that I received so far. Please let me know
what else do I need to do to make these patches mergeable?

Thanks,
Longman

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 0/6] vfs: Use dlock list for SB's s_inodes list
  2017-10-26 18:28 ` [PATCH v7 0/6] vfs: Use dlock list for SB's s_inodes list Waiman Long
@ 2017-10-27  0:58   ` Boqun Feng
  2017-10-27 20:19     ` Waiman Long
  0 siblings, 1 reply; 31+ messages in thread
From: Boqun Feng @ 2017-10-27  0:58 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Davidlohr Bueso

[-- Attachment #1: Type: text/plain, Size: 3534 bytes --]

On Thu, Oct 26, 2017 at 02:28:55PM -0400, Waiman Long wrote:
> On 10/05/2017 02:43 PM, Waiman Long wrote:
> >
> > This is a follow up of the following patchset:
> >
> >   [PATCH v7 0/4] vfs: Use per-cpu list for SB's s_inodes list
> >   https://lkml.org/lkml/2016/4/12/1009
> >
> > This patchset provides new APIs for a set of distributed locked lists
> > (one/CPU core) to minimize lock and cacheline contention. Insertion
> > and deletion to the list will be cheap and relatively contention free.
> > Lookup, on the other hand, may be a bit more costly as there are
> > multiple lists to iterate. This is not really a problem for the
> > replacement of superblock's inode list by dlock list included in
> > the patchset as lookup isn't needed.
> >
> > For use cases that need to do lookup, the dlock list can also be
> > treated as a set of hashed lists that scales with the number of CPU
> > cores in the system.
> >
> > Both patches 5 and 6 are added to support other use cases like epoll
> > nested callbacks, for example, which could use the dlock-list to
> > reduce lock contention problem.
> >
> > Patch 1 introduces the dlock list. The list heads are allocated
> > by kcalloc() instead of percpu_alloc(). Each list head entry is
> > cacheline aligned to minimize contention.
> >
> > Patch 2 replaces the use of list_for_each_entry_safe() in
> > evict_inodes() and invalidate_inodes() by list_for_each_entry().
> >
> > Patch 3 modifies the superblock and inode structures to use the dlock
> > list. The corresponding functions that reference those structures
> > are modified.
> >
> > Patch 4 makes the sibling CPUs use the same dlock list head to reduce
> > the number of list heads that need to be iterated.
> >
> > Patch 5 enables alternative use case of as a set of hashed lists.
> >
> > Patch 6 provides an irq safe mode specified at dlock-list allocation
> > time so that it can be within interrupt context.
> >
> > Jan Kara (1):
> >   vfs: Remove unnecessary list_for_each_entry_safe() variants
> >
> > Waiman Long (5):
> >   lib/dlock-list: Distributed and lock-protected lists
> >   vfs: Use dlock list for superblock's inode list
> >   lib/dlock-list: Make sibling CPUs share the same linked list
> >   lib/dlock-list: Enable faster lookup with hashing
> >   lib/dlock-list: Add an IRQ-safe mode to be used in interrupt handler
> >
> >  fs/block_dev.c             |   9 +-
> >  fs/drop_caches.c           |   9 +-
> >  fs/inode.c                 |  38 +++---
> >  fs/notify/fsnotify.c       |   9 +-
> >  fs/quota/dquot.c           |  14 +-
> >  fs/super.c                 |   7 +-
> >  include/linux/dlock-list.h | 245 ++++++++++++++++++++++++++++++++++
> >  include/linux/fs.h         |   8 +-
> >  lib/Makefile               |   2 +-
> >  lib/dlock-list.c           | 322 +++++++++++++++++++++++++++++++++++++++++++++
> >  10 files changed, 609 insertions(+), 54 deletions(-)
> >  create mode 100644 include/linux/dlock-list.h
> >  create mode 100644 lib/dlock-list.c
> >
> Is there other objections about merging this patch series? With the
> additional patches 8 & 9 that I sent out on Oct 17, I think I had
> addressed all the concerns that I received so far. Please let me know
> what else do I need to do to make these patches mergeable?
> 

Hi Waiman,

Have you read my email about the dlist_for_each_entry_safe():

	https://marc.info/?l=linux-kernel&m=150831690725964&w=2

?

Regards,
Boqun

> Thanks,
> Longman
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v7 10/10] lib/dlock-list: Fix use-after-unlock problem in dlist_for_each_entry_safe()
  2017-10-05 18:43 [PATCH v7 0/6] vfs: Use dlock list for SB's s_inodes list Waiman Long
                   ` (8 preceding siblings ...)
  2017-10-26 18:28 ` [PATCH v7 0/6] vfs: Use dlock list for SB's s_inodes list Waiman Long
@ 2017-10-27 20:10 ` Waiman Long
  2017-10-30  9:06   ` Jan Kara
  2017-10-30 14:11   ` Davidlohr Bueso
  9 siblings, 2 replies; 31+ messages in thread
From: Waiman Long @ 2017-10-27 20:10 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Davidlohr Bueso,
	Waiman Long

The dlist_for_each_entry_safe() macro in include/linux/dlock-list has
a use-after-unlock problem where racing condition can happen because
of a lack of spinlock protection.  Fortunately, this macro is not
currently being used in the kernel.

This patch changes the dlist_for_each_entry_safe() macro so that the
call to __dlock_list_next_list() is deferred until the next entry is
being used. That should eliminate the use-after-unlock problem.

Reported-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/dlock-list.h | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
index 02c5f4d..f4b7657 100644
--- a/include/linux/dlock-list.h
+++ b/include/linux/dlock-list.h
@@ -191,17 +191,17 @@ extern void dlock_list_add(struct dlock_list_node *node,
 }
 
 /**
- * dlock_list_first_entry - get the first element from a list
+ * dlock_list_next_list_entry - get first element from next list in iterator
  * @iter  : The dlock list iterator.
- * @type  : The type of the struct this is embedded in.
+ * @pos   : A variable of the struct that is embedded in.
  * @member: The name of the dlock_list_node within the struct.
- * Return : Pointer to the next entry or NULL if all the entries are iterated.
+ * Return : Pointer to first entry or NULL if all the lists are iterated.
  */
-#define dlock_list_first_entry(iter, type, member)			\
+#define dlock_list_next_list_entry(iter, pos, member)			\
 	({								\
 		struct dlock_list_node *_n;				\
 		_n = __dlock_list_next_entry(NULL, iter);		\
-		_n ? list_entry(_n, type, member) : NULL;		\
+		_n ? list_entry(_n, typeof(*pos), member) : NULL;	\
 	})
 
 /**
@@ -231,7 +231,7 @@ extern void dlock_list_add(struct dlock_list_node *node,
  * This iteration function is designed to be used in a while loop.
  */
 #define dlist_for_each_entry(pos, iter, member)				\
-	for (pos = dlock_list_first_entry(iter, typeof(*(pos)), member);\
+	for (pos = dlock_list_next_list_entry(iter, pos, member);	\
 	     pos != NULL;						\
 	     pos = dlock_list_next_entry(pos, iter, member))
 
@@ -245,14 +245,20 @@ extern void dlock_list_add(struct dlock_list_node *node,
  * This iteration macro is safe with respect to list entry removal.
  * However, it cannot correctly iterate newly added entries right after the
  * current one.
+ *
+ * The call to __dlock_list_next_list() is deferred until the next entry
+ * is being iterated to avoid use-after-unlock problem.
  */
 #define dlist_for_each_entry_safe(pos, n, iter, member)			\
-	for (pos = dlock_list_first_entry(iter, typeof(*(pos)), member);\
+	for (pos = NULL;						\
 	    ({								\
-		bool _b = (pos != NULL);				\
-		if (_b)							\
-			n = dlock_list_next_entry(pos, iter, member);	\
-		_b;							\
+		if (!pos ||						\
+		   (&(pos)->member.list == &(iter)->entry->list))	\
+			pos = dlock_list_next_list_entry(iter, pos,	\
+							 member);	\
+		if (pos)						\
+			n = list_next_entry(pos, member.list);		\
+		pos;							\
 	    });								\
 	    pos = n)
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 0/6] vfs: Use dlock list for SB's s_inodes list
  2017-10-27  0:58   ` Boqun Feng
@ 2017-10-27 20:19     ` Waiman Long
  0 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2017-10-27 20:19 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Davidlohr Bueso

On 10/26/2017 08:58 PM, Boqun Feng wrote:
>> Is there other objections about merging this patch series? With the
>> additional patches 8 & 9 that I sent out on Oct 17, I think I had
>> addressed all the concerns that I received so far. Please let me know
>> what else do I need to do to make these patches mergeable?
>>
> Hi Waiman,
>
> Have you read my email about the dlist_for_each_entry_safe():
>
> 	https://marc.info/?l=linux-kernel&m=150831690725964&w=2
>
> ?
>
> Regards,
> Boqun

I am sorry that I somehow forgot to respond to this email. Anyway,
dlist_for_each_entry_safe() is not currently used and so was not that
well-tested. I just sent out another patch to fix that use-after-unlock
problem that you had found. The fix is somewhat different from what you
proposed, but that should still fix the problem. I modified some
dlist_for_each_entry() macros to dlist_for_each_entry_safe(), compiled
and boot the kernel. I haven't seen any problem so far.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 10/10] lib/dlock-list: Fix use-after-unlock problem in dlist_for_each_entry_safe()
  2017-10-27 20:10 ` [PATCH v7 10/10] lib/dlock-list: Fix use-after-unlock problem in dlist_for_each_entry_safe() Waiman Long
@ 2017-10-30  9:06   ` Jan Kara
  2017-10-30 14:06     ` Boqun Feng
  2017-10-30 14:11   ` Davidlohr Bueso
  1 sibling, 1 reply; 31+ messages in thread
From: Jan Kara @ 2017-10-30  9:06 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng, Davidlohr Bueso

On Fri 27-10-17 16:10:53, Waiman Long wrote:
> The dlist_for_each_entry_safe() macro in include/linux/dlock-list has
> a use-after-unlock problem where racing condition can happen because
> of a lack of spinlock protection.  Fortunately, this macro is not
> currently being used in the kernel.
> 
> This patch changes the dlist_for_each_entry_safe() macro so that the
> call to __dlock_list_next_list() is deferred until the next entry is
> being used. That should eliminate the use-after-unlock problem.
> 
> Reported-by: Boqun Feng <boqun.feng@gmail.com>
> Signed-off-by: Waiman Long <longman@redhat.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza


> ---
>  include/linux/dlock-list.h | 28 +++++++++++++++++-----------
>  1 file changed, 17 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
> index 02c5f4d..f4b7657 100644
> --- a/include/linux/dlock-list.h
> +++ b/include/linux/dlock-list.h
> @@ -191,17 +191,17 @@ extern void dlock_list_add(struct dlock_list_node *node,
>  }
>  
>  /**
> - * dlock_list_first_entry - get the first element from a list
> + * dlock_list_next_list_entry - get first element from next list in iterator
>   * @iter  : The dlock list iterator.
> - * @type  : The type of the struct this is embedded in.
> + * @pos   : A variable of the struct that is embedded in.
>   * @member: The name of the dlock_list_node within the struct.
> - * Return : Pointer to the next entry or NULL if all the entries are iterated.
> + * Return : Pointer to first entry or NULL if all the lists are iterated.
>   */
> -#define dlock_list_first_entry(iter, type, member)			\
> +#define dlock_list_next_list_entry(iter, pos, member)			\
>  	({								\
>  		struct dlock_list_node *_n;				\
>  		_n = __dlock_list_next_entry(NULL, iter);		\
> -		_n ? list_entry(_n, type, member) : NULL;		\
> +		_n ? list_entry(_n, typeof(*pos), member) : NULL;	\
>  	})
>  
>  /**
> @@ -231,7 +231,7 @@ extern void dlock_list_add(struct dlock_list_node *node,
>   * This iteration function is designed to be used in a while loop.
>   */
>  #define dlist_for_each_entry(pos, iter, member)				\
> -	for (pos = dlock_list_first_entry(iter, typeof(*(pos)), member);\
> +	for (pos = dlock_list_next_list_entry(iter, pos, member);	\
>  	     pos != NULL;						\
>  	     pos = dlock_list_next_entry(pos, iter, member))
>  
> @@ -245,14 +245,20 @@ extern void dlock_list_add(struct dlock_list_node *node,
>   * This iteration macro is safe with respect to list entry removal.
>   * However, it cannot correctly iterate newly added entries right after the
>   * current one.
> + *
> + * The call to __dlock_list_next_list() is deferred until the next entry
> + * is being iterated to avoid use-after-unlock problem.
>   */
>  #define dlist_for_each_entry_safe(pos, n, iter, member)			\
> -	for (pos = dlock_list_first_entry(iter, typeof(*(pos)), member);\
> +	for (pos = NULL;						\
>  	    ({								\
> -		bool _b = (pos != NULL);				\
> -		if (_b)							\
> -			n = dlock_list_next_entry(pos, iter, member);	\
> -		_b;							\
> +		if (!pos ||						\
> +		   (&(pos)->member.list == &(iter)->entry->list))	\
> +			pos = dlock_list_next_list_entry(iter, pos,	\
> +							 member);	\
> +		if (pos)						\
> +			n = list_next_entry(pos, member.list);		\
> +		pos;							\
>  	    });								\
>  	    pos = n)
>  
> -- 
> 1.8.3.1
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 10/10] lib/dlock-list: Fix use-after-unlock problem in dlist_for_each_entry_safe()
  2017-10-30  9:06   ` Jan Kara
@ 2017-10-30 14:06     ` Boqun Feng
  0 siblings, 0 replies; 31+ messages in thread
From: Boqun Feng @ 2017-10-30 14:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Waiman Long, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Andi Kleen,
	Dave Chinner, Davidlohr Bueso

[-- Attachment #1: Type: text/plain, Size: 3958 bytes --]

On Mon, Oct 30, 2017 at 10:06:40AM +0100, Jan Kara wrote:
> On Fri 27-10-17 16:10:53, Waiman Long wrote:
> > The dlist_for_each_entry_safe() macro in include/linux/dlock-list has
> > a use-after-unlock problem where racing condition can happen because
> > of a lack of spinlock protection.  Fortunately, this macro is not
> > currently being used in the kernel.
> > 
> > This patch changes the dlist_for_each_entry_safe() macro so that the
> > call to __dlock_list_next_list() is deferred until the next entry is
> > being used. That should eliminate the use-after-unlock problem.
> > 

Better than what I proposed, thanks for looking into this!

> > Reported-by: Boqun Feng <boqun.feng@gmail.com>
> > Signed-off-by: Waiman Long <longman@redhat.com>
> 
> Looks good to me. You can add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 

Also mine FWIW:

Reviewed-by: Boqun Feng <boqun.feng@gmail.com>

Regards,
Boqun

> 								Honza
> 
> 
> > ---
> >  include/linux/dlock-list.h | 28 +++++++++++++++++-----------
> >  1 file changed, 17 insertions(+), 11 deletions(-)
> > 
> > diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
> > index 02c5f4d..f4b7657 100644
> > --- a/include/linux/dlock-list.h
> > +++ b/include/linux/dlock-list.h
> > @@ -191,17 +191,17 @@ extern void dlock_list_add(struct dlock_list_node *node,
> >  }
> >  
> >  /**
> > - * dlock_list_first_entry - get the first element from a list
> > + * dlock_list_next_list_entry - get first element from next list in iterator
> >   * @iter  : The dlock list iterator.
> > - * @type  : The type of the struct this is embedded in.
> > + * @pos   : A variable of the struct that is embedded in.
> >   * @member: The name of the dlock_list_node within the struct.
> > - * Return : Pointer to the next entry or NULL if all the entries are iterated.
> > + * Return : Pointer to first entry or NULL if all the lists are iterated.
> >   */
> > -#define dlock_list_first_entry(iter, type, member)			\
> > +#define dlock_list_next_list_entry(iter, pos, member)			\
> >  	({								\
> >  		struct dlock_list_node *_n;				\
> >  		_n = __dlock_list_next_entry(NULL, iter);		\
> > -		_n ? list_entry(_n, type, member) : NULL;		\
> > +		_n ? list_entry(_n, typeof(*pos), member) : NULL;	\
> >  	})
> >  
> >  /**
> > @@ -231,7 +231,7 @@ extern void dlock_list_add(struct dlock_list_node *node,
> >   * This iteration function is designed to be used in a while loop.
> >   */
> >  #define dlist_for_each_entry(pos, iter, member)				\
> > -	for (pos = dlock_list_first_entry(iter, typeof(*(pos)), member);\
> > +	for (pos = dlock_list_next_list_entry(iter, pos, member);	\
> >  	     pos != NULL;						\
> >  	     pos = dlock_list_next_entry(pos, iter, member))
> >  
> > @@ -245,14 +245,20 @@ extern void dlock_list_add(struct dlock_list_node *node,
> >   * This iteration macro is safe with respect to list entry removal.
> >   * However, it cannot correctly iterate newly added entries right after the
> >   * current one.
> > + *
> > + * The call to __dlock_list_next_list() is deferred until the next entry
> > + * is being iterated to avoid use-after-unlock problem.
> >   */
> >  #define dlist_for_each_entry_safe(pos, n, iter, member)			\
> > -	for (pos = dlock_list_first_entry(iter, typeof(*(pos)), member);\
> > +	for (pos = NULL;						\
> >  	    ({								\
> > -		bool _b = (pos != NULL);				\
> > -		if (_b)							\
> > -			n = dlock_list_next_entry(pos, iter, member);	\
> > -		_b;							\
> > +		if (!pos ||						\
> > +		   (&(pos)->member.list == &(iter)->entry->list))	\
> > +			pos = dlock_list_next_list_entry(iter, pos,	\
> > +							 member);	\
> > +		if (pos)						\
> > +			n = list_next_entry(pos, member.list);		\
> > +		pos;							\
> >  	    });								\
> >  	    pos = n)
> >  
> > -- 
> > 1.8.3.1
> > 
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 10/10] lib/dlock-list: Fix use-after-unlock problem in dlist_for_each_entry_safe()
  2017-10-27 20:10 ` [PATCH v7 10/10] lib/dlock-list: Fix use-after-unlock problem in dlist_for_each_entry_safe() Waiman Long
  2017-10-30  9:06   ` Jan Kara
@ 2017-10-30 14:11   ` Davidlohr Bueso
  2017-10-30 14:15     ` Waiman Long
  1 sibling, 1 reply; 31+ messages in thread
From: Davidlohr Bueso @ 2017-10-30 14:11 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng

On Fri, 27 Oct 2017, Waiman Long wrote:

>The dlist_for_each_entry_safe() macro in include/linux/dlock-list has
>a use-after-unlock problem where racing condition can happen because
>of a lack of spinlock protection.  Fortunately, this macro is not
>currently being used in the kernel.
>
>This patch changes the dlist_for_each_entry_safe() macro so that the
>call to __dlock_list_next_list() is deferred until the next entry is
>being used. That should eliminate the use-after-unlock problem.
>
>Reported-by: Boqun Feng <boqun.feng@gmail.com>
>Signed-off-by: Waiman Long <longman@redhat.com>

Reviewed-by: Davidlohr Bueso <dbueso@suse.de>

But would it not be better to merge this patch (among others) into 1/N?
Specifically the newer patches 7-10 should be in the original dlock
implementation instead of adding fixes to incorrect code in the original
commit. Also less of a pita for backporting.

Thanks,
Davidlohr


>---
> include/linux/dlock-list.h | 28 +++++++++++++++++-----------
> 1 file changed, 17 insertions(+), 11 deletions(-)
>
>diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
>index 02c5f4d..f4b7657 100644
>--- a/include/linux/dlock-list.h
>+++ b/include/linux/dlock-list.h
>@@ -191,17 +191,17 @@ extern void dlock_list_add(struct dlock_list_node *node,
> }
>
> /**
>- * dlock_list_first_entry - get the first element from a list
>+ * dlock_list_next_list_entry - get first element from next list in iterator
>  * @iter  : The dlock list iterator.
>- * @type  : The type of the struct this is embedded in.
>+ * @pos   : A variable of the struct that is embedded in.
>  * @member: The name of the dlock_list_node within the struct.
>- * Return : Pointer to the next entry or NULL if all the entries are iterated.
>+ * Return : Pointer to first entry or NULL if all the lists are iterated.
>  */
>-#define dlock_list_first_entry(iter, type, member)			\
>+#define dlock_list_next_list_entry(iter, pos, member)			\
> 	({								\
> 		struct dlock_list_node *_n;				\
> 		_n = __dlock_list_next_entry(NULL, iter);		\
>-		_n ? list_entry(_n, type, member) : NULL;		\
>+		_n ? list_entry(_n, typeof(*pos), member) : NULL;	\
> 	})
>
> /**
>@@ -231,7 +231,7 @@ extern void dlock_list_add(struct dlock_list_node *node,
>  * This iteration function is designed to be used in a while loop.
>  */
> #define dlist_for_each_entry(pos, iter, member)				\
>-	for (pos = dlock_list_first_entry(iter, typeof(*(pos)), member);\
>+	for (pos = dlock_list_next_list_entry(iter, pos, member);	\
> 	     pos != NULL;						\
> 	     pos = dlock_list_next_entry(pos, iter, member))
>
>@@ -245,14 +245,20 @@ extern void dlock_list_add(struct dlock_list_node *node,
>  * This iteration macro is safe with respect to list entry removal.
>  * However, it cannot correctly iterate newly added entries right after the
>  * current one.
>+ *
>+ * The call to __dlock_list_next_list() is deferred until the next entry
>+ * is being iterated to avoid use-after-unlock problem.
>  */
> #define dlist_for_each_entry_safe(pos, n, iter, member)			\
>-	for (pos = dlock_list_first_entry(iter, typeof(*(pos)), member);\
>+	for (pos = NULL;						\
> 	    ({								\
>-		bool _b = (pos != NULL);				\
>-		if (_b)							\
>-			n = dlock_list_next_entry(pos, iter, member);	\
>-		_b;							\
>+		if (!pos ||						\
>+		   (&(pos)->member.list == &(iter)->entry->list))	\
>+			pos = dlock_list_next_list_entry(iter, pos,	\
>+							 member);	\
>+		if (pos)						\
>+			n = list_next_entry(pos, member.list);		\
>+		pos;							\
> 	    });								\
> 	    pos = n)
>
>-- 
>1.8.3.1
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 10/10] lib/dlock-list: Fix use-after-unlock problem in dlist_for_each_entry_safe()
  2017-10-30 14:11   ` Davidlohr Bueso
@ 2017-10-30 14:15     ` Waiman Long
  0 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2017-10-30 14:15 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng

On 10/30/2017 10:11 AM, Davidlohr Bueso wrote:
> On Fri, 27 Oct 2017, Waiman Long wrote:
>
>> The dlist_for_each_entry_safe() macro in include/linux/dlock-list has
>> a use-after-unlock problem where racing condition can happen because
>> of a lack of spinlock protection.  Fortunately, this macro is not
>> currently being used in the kernel.
>>
>> This patch changes the dlist_for_each_entry_safe() macro so that the
>> call to __dlock_list_next_list() is deferred until the next entry is
>> being used. That should eliminate the use-after-unlock problem.
>>
>> Reported-by: Boqun Feng <boqun.feng@gmail.com>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>
> Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
>
> But would it not be better to merge this patch (among others) into 1/N?
> Specifically the newer patches 7-10 should be in the original dlock
> implementation instead of adding fixes to incorrect code in the original
> commit. Also less of a pita for backporting.
>
> Thanks,
> Davidlohr +191,17 @@ extern void dlock_list_add(struct dlock_list_node
> *node,

Yes, that is true. I will send out a new version with all the fixes
integrated later this week.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2017-10-30 14:15 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-05 18:43 [PATCH v7 0/6] vfs: Use dlock list for SB's s_inodes list Waiman Long
2017-10-05 18:43 ` [PATCH v7 1/6] lib/dlock-list: Distributed and lock-protected lists Waiman Long
2017-10-10  5:35   ` Boqun Feng
2017-10-13 21:10     ` Waiman Long
2017-10-18  8:55   ` Boqun Feng
2017-10-05 18:43 ` [PATCH v7 2/6] vfs: Remove unnecessary list_for_each_entry_safe() variants Waiman Long
2017-10-05 18:43 ` [PATCH v7 3/6] vfs: Use dlock list for superblock's inode list Waiman Long
2017-10-05 18:43 ` [PATCH v7 4/6] lib/dlock-list: Make sibling CPUs share the same linked list Waiman Long
2017-10-09 15:40   ` Jan Kara
2017-10-09 16:14     ` Waiman Long
2017-10-05 18:43 ` [PATCH v7 5/6] lib/dlock-list: Enable faster lookup with hashing Waiman Long
2017-10-09 13:08   ` Davidlohr Bueso
2017-10-09 14:16     ` Waiman Long
2017-10-09 16:03       ` Davidlohr Bueso
2017-10-09 16:11         ` Waiman Long
2017-10-05 18:43 ` [PATCH v7 6/6] lib/dlock-list: Add an IRQ-safe mode to be used in interrupt handler Waiman Long
2017-10-13 15:45 ` [PATCH v7 7/6] fs/epoll: scale nested callbacks Davidlohr Bueso
2017-10-16 19:30   ` Jason Baron
2017-10-17 15:53     ` Davidlohr Bueso
2017-10-18 14:06       ` Jason Baron
2017-10-18 15:44         ` Davidlohr Bueso
2017-10-17 19:36 ` [PATCH v7 8/9] lib/dlock-list: Export symbols and add warnings Waiman Long
2017-10-17 19:36   ` [PATCH v7 9/9] lib/dlock-list: Unique lock class key for each allocation call site Waiman Long
2017-10-26 18:28 ` [PATCH v7 0/6] vfs: Use dlock list for SB's s_inodes list Waiman Long
2017-10-27  0:58   ` Boqun Feng
2017-10-27 20:19     ` Waiman Long
2017-10-27 20:10 ` [PATCH v7 10/10] lib/dlock-list: Fix use-after-unlock problem in dlist_for_each_entry_safe() Waiman Long
2017-10-30  9:06   ` Jan Kara
2017-10-30 14:06     ` Boqun Feng
2017-10-30 14:11   ` Davidlohr Bueso
2017-10-30 14:15     ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).