linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 0/6] vfs: Use dlock list for SB's s_inodes list
@ 2017-10-31 18:50 Waiman Long
  2017-10-31 18:50 ` [PATCH v8 1/6] lib/dlock-list: Distributed and lock-protected lists Waiman Long
                   ` (7 more replies)
  0 siblings, 8 replies; 32+ messages in thread
From: Waiman Long @ 2017-10-31 18:50 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Davidlohr Bueso,
	Waiman Long

v7->v8:
 - Integrate the additional patches 8, 9 and 10 sent to fix issues in
   the original v7 patchset into patch 1 and adjust the other patches
   accordingly.

v6->v7:
 - Fix outdated email address.
 - Add a comment to patch 4 to explain allocation issue & fix a
   compilation problem with cpumask.
 - Replace patch 6 with another one that adds an irqsafe mode argument 
   in alloc_dlock_list_heads() instead of adding new APIs.

v5->v6:
 - Rebased the patch to 4.14-rc3.
 - Drop the fsnotify patch as it had been merged somehow.
 - Add a new patch 5 with alternative way of selecting list by hashing
   instead of cpu #.
 - Add a new patch 6 to proivde a set irq safe APIs to be used in
   interrupt context.
 - Update the CPU to index mapping code.

v4->v5:
 - Rebased the patch to 4.8-rc1 (changes to fs/fs-writeback.c was
   dropped).
 - Use kcalloc() instead of percpu_alloc() to allocate the dlock list
   heads structure as suggested by Christoph Lameter.
 - Replaced patch 5 by another one that made sibling CPUs use the same
   dlock list head thus reducing the number of list heads that needed
   to be maintained.

v3->v4:
 - As suggested by Al, encapsulate the dlock list mechanism into
   the dlist_for_each_entry() and dlist_for_each_entry_safe()
   which are the equivalent of list_for_each_entry() and
   list_for_each_entry_safe() for regular linked list. That simplifies
   the changes in the call sites that perform dlock list iterations.
 - Add a new patch to make the percpu head structure cacheline aligned
   to prevent cacheline contention from disrupting the performance
   of nearby percpu variables.

v2->v3:
 - Remove the 2 persubnode API patches.
 - Merge __percpu tag patch 2 into patch 1.
 - As suggested by Tejun Heo, restructure the dlock_list_head data
   structure to hide the __percpu tag and rename some of the functions
   and structures.
 - Move most of the code from dlock_list.h to dlock_list.c and export
   the symbols.

v1->v2:
 - Add a set of simple per-subnode APIs that is between percpu and
   per-node in granularity.
 - Make dlock list to use the per-subnode APIs so as to reduce the
   total number of separate linked list that needs to be managed
   and iterated.
 - There is no change in patches 1-5.

This patchset provides new APIs for a set of distributed locked lists
(one/CPU core) to minimize lock and cacheline contention. Insertion
and deletion to the list will be cheap and relatively contention free.
Lookup, on the other hand, may be a bit more costly as there are
multiple lists to iterate. This is not really a problem for the
replacement of superblock's inode list by dlock list included in
the patchset as lookup isn't needed.

For use cases that need to do lookup, the dlock list can also be
treated as a set of hashed lists that scales with the number of CPU
cores in the system.

Both patches 5 and 6 are added to support other use cases like epoll
nested callbacks, for example, which could use the dlock-list to
reduce lock contention problem.

Patch 1 introduces the dlock list. The list heads are allocated
by kcalloc() instead of percpu_alloc(). Each list head entry is
cacheline aligned to minimize contention.

Patch 2 replaces the use of list_for_each_entry_safe() in
evict_inodes() and invalidate_inodes() by list_for_each_entry().

Patch 3 modifies the superblock and inode structures to use the dlock
list. The corresponding functions that reference those structures
are modified.

Patch 4 makes the sibling CPUs use the same dlock list head to reduce
the number of list heads that need to be iterated.

Patch 5 enables alternative use case of as a set of hashed lists.

Patch 6 provides an irq safe mode specified at dlock-list allocation
time so that it can be used within interrupt context.

Jan Kara (1):
  vfs: Remove unnecessary list_for_each_entry_safe() variants

Waiman Long (5):
  lib/dlock-list: Distributed and lock-protected lists
  vfs: Use dlock list for superblock's inode list
  lib/dlock-list: Make sibling CPUs share the same linked list
  lib/dlock-list: Enable faster lookup with hashing
  lib/dlock-list: Add an IRQ-safe mode to be used in interrupt handler

 fs/block_dev.c             |   9 +-
 fs/drop_caches.c           |   9 +-
 fs/inode.c                 |  38 ++----
 fs/notify/fsnotify.c       |   9 +-
 fs/quota/dquot.c           |  14 +-
 fs/super.c                 |   7 +-
 include/linux/dlock-list.h | 263 +++++++++++++++++++++++++++++++++++
 include/linux/fs.h         |   8 +-
 lib/Makefile               |   2 +-
 lib/dlock-list.c           | 333 +++++++++++++++++++++++++++++++++++++++++++++
 10 files changed, 638 insertions(+), 54 deletions(-)
 create mode 100644 include/linux/dlock-list.h
 create mode 100644 lib/dlock-list.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v8 1/6] lib/dlock-list: Distributed and lock-protected lists
  2017-10-31 18:50 [PATCH v8 0/6] vfs: Use dlock list for SB's s_inodes list Waiman Long
@ 2017-10-31 18:50 ` Waiman Long
  2017-10-31 21:37   ` Davidlohr Bueso
                     ` (2 more replies)
  2017-10-31 18:50 ` [PATCH v8 2/6] vfs: Remove unnecessary list_for_each_entry_safe() variants Waiman Long
                   ` (6 subsequent siblings)
  7 siblings, 3 replies; 32+ messages in thread
From: Waiman Long @ 2017-10-31 18:50 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Davidlohr Bueso,
	Waiman Long

Linked list is used everywhere in the Linux kernel. However, if many
threads are trying to add or delete entries into the same linked list,
it can create a performance bottleneck.

This patch introduces a new list APIs that provide a set of distributed
lists (one per CPU), each of which is protected by its own spinlock.
To the callers, however, the set of lists acts like a single
consolidated list.  This allows list entries insertion and deletion
operations to happen in parallel instead of being serialized with a
global list and lock.

List entry insertion is strictly per cpu. List deletion, however, can
happen in a cpu other than the one that did the insertion. So we still
need lock to protect the list. Because of that, there may still be
a small amount of contention when deletion is being done.

A new header file include/linux/dlock-list.h will be added with the
associated dlock_list_head and dlock_list_node structures. The following
functions are provided to manage the per-cpu list:

 1. int alloc_dlock_list_heads(struct dlock_list_heads *dlist)
 2. void free_dlock_list_heads(struct dlock_list_heads *dlist)
 3. void dlock_list_add(struct dlock_list_node *node,
		        struct dlock_list_heads *dlist)
 4. void dlock_list_del(struct dlock_list *node)

Iteration of all the list entries within a dlock list array
is done by calling either the dlist_for_each_entry() or
dlist_for_each_entry_safe() macros. They correspond to the
list_for_each_entry() and list_for_each_entry_safe() macros
respectively. The iteration states are keep in a dlock_list_iter
structure that is passed to the iteration macros.

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 include/linux/dlock-list.h | 242 +++++++++++++++++++++++++++++++++++++++++++++
 lib/Makefile               |   2 +-
 lib/dlock-list.c           | 234 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 477 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/dlock-list.h
 create mode 100644 lib/dlock-list.c

diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
new file mode 100644
index 0000000..c00c7f9
--- /dev/null
+++ b/include/linux/dlock-list.h
@@ -0,0 +1,242 @@
+/*
+ * Distributed and locked list
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ * (C) Copyright 2017 Red Hat, Inc.
+ *
+ * Authors: Waiman Long <longman@redhat.com>
+ */
+#ifndef __LINUX_DLOCK_LIST_H
+#define __LINUX_DLOCK_LIST_H
+
+#include <linux/spinlock.h>
+#include <linux/list.h>
+
+/*
+ * include/linux/dlock-list.h
+ *
+ * The dlock_list_head structure contains the spinlock. It is cacheline
+ * aligned to reduce contention among different CPUs. The other
+ * dlock_list_node structures contains a pointer to the head entry instead.
+ */
+struct dlock_list_head {
+	struct list_head list;
+	spinlock_t lock;
+} ____cacheline_aligned_in_smp;
+
+struct dlock_list_heads {
+	struct dlock_list_head *heads;
+};
+
+/*
+ * dlock list node data structure
+ */
+struct dlock_list_node {
+	struct list_head list;
+	struct dlock_list_head *head;
+};
+
+/*
+ * dlock list iteration state
+ *
+ * This is an opaque data structure that may change. Users of this structure
+ * should not access the structure members directly other than using the
+ * helper functions and macros provided in this header file.
+ */
+struct dlock_list_iter {
+	int index;
+	struct dlock_list_head *head, *entry;
+};
+
+#define DLOCK_LIST_ITER_INIT(dlist)		\
+	{					\
+		.index = -1,			\
+		.head = (dlist)->heads,		\
+	}
+
+#define DEFINE_DLOCK_LIST_ITER(s, heads)	\
+	struct dlock_list_iter s = DLOCK_LIST_ITER_INIT(heads)
+
+static inline void init_dlock_list_iter(struct dlock_list_iter *iter,
+					struct dlock_list_heads *heads)
+{
+	*iter = (struct dlock_list_iter)DLOCK_LIST_ITER_INIT(heads);
+}
+
+#define DLOCK_LIST_NODE_INIT(name)		\
+	{					\
+		.list = LIST_HEAD_INIT(name)	\
+	}
+
+static inline void init_dlock_list_node(struct dlock_list_node *node)
+{
+	*node = (struct dlock_list_node)DLOCK_LIST_NODE_INIT(node->list);
+}
+
+/**
+ * dlock_list_unlock - unlock the spinlock that protects the current list
+ * @iter: Pointer to the dlock list iterator structure
+ */
+static inline void dlock_list_unlock(struct dlock_list_iter *iter)
+{
+	spin_unlock(&iter->entry->lock);
+}
+
+/**
+ * dlock_list_relock - lock the spinlock that protects the current list
+ * @iter: Pointer to the dlock list iterator structure
+ */
+static inline void dlock_list_relock(struct dlock_list_iter *iter)
+{
+	spin_lock(&iter->entry->lock);
+}
+
+/*
+ * Allocation and freeing of dlock list
+ */
+extern int  __alloc_dlock_list_heads(struct dlock_list_heads *dlist,
+				     struct lock_class_key *key);
+extern void free_dlock_list_heads(struct dlock_list_heads *dlist);
+
+/**
+ * alloc_dlock_list_head - Initialize and allocate the list of head entries.
+ * @dlist  : Pointer to the dlock_list_heads structure to be initialized
+ * Return  : 0 if successful, -ENOMEM if memory allocation error
+ */
+#define alloc_dlock_list_heads(dlist)					\
+({									\
+	static struct lock_class_key _key;				\
+	__alloc_dlock_list_heads(dlist, &_key);				\
+})
+
+/*
+ * Check if a dlock list is empty or not.
+ */
+extern bool dlock_lists_empty(struct dlock_list_heads *dlist);
+
+/*
+ * The dlock list addition and deletion functions here are not irq-safe.
+ * Special irq-safe variants will have to be added if we need them.
+ */
+extern void dlock_lists_add(struct dlock_list_node *node,
+			    struct dlock_list_heads *dlist);
+extern void dlock_lists_del(struct dlock_list_node *node);
+
+/*
+ * Find the first entry of the next available list.
+ */
+extern struct dlock_list_node *
+__dlock_list_next_list(struct dlock_list_iter *iter);
+
+/**
+ * __dlock_list_next_entry - Iterate to the next entry of the dlock list
+ * @curr : Pointer to the current dlock_list_node structure
+ * @iter : Pointer to the dlock list iterator structure
+ * Return: Pointer to the next entry or NULL if all the entries are iterated
+ *
+ * The iterator has to be properly initialized before calling this function.
+ */
+static inline struct dlock_list_node *
+__dlock_list_next_entry(struct dlock_list_node *curr,
+			struct dlock_list_iter *iter)
+{
+	/*
+	 * Find next entry
+	 */
+	if (curr)
+		curr = list_next_entry(curr, list);
+
+	if (!curr || (&curr->list == &iter->entry->list)) {
+		/*
+		 * The current list has been exhausted, try the next available
+		 * list.
+		 */
+		curr = __dlock_list_next_list(iter);
+	}
+
+	return curr;	/* Continue the iteration */
+}
+
+/**
+ * _dlock_list_next_list_entry - get first element from next list in iterator
+ * @iter  : The dlock list iterator.
+ * @pos   : A variable of the struct that is embedded in.
+ * @member: The name of the dlock_list_node within the struct.
+ * Return : Pointer to first entry or NULL if all the lists are iterated.
+ */
+#define _dlock_list_next_list_entry(iter, pos, member)			\
+	({								\
+		struct dlock_list_node *_n;				\
+		_n = __dlock_list_next_entry(NULL, iter);		\
+		_n ? list_entry(_n, typeof(*pos), member) : NULL;	\
+	})
+
+/**
+ * _dlock_list_next_entry - iterate to the next entry of the list
+ * @pos   : The type * to cursor
+ * @iter  : The dlock list iterator.
+ * @member: The name of the dlock_list_node within the struct.
+ * Return : Pointer to the next entry or NULL if all the entries are iterated.
+ *
+ * Note that pos can't be NULL.
+ */
+#define _dlock_list_next_entry(pos, iter, member)			\
+	({								\
+		struct dlock_list_node *_n;				\
+		_n = __dlock_list_next_entry(&(pos)->member, iter);	\
+		_n ? list_entry(_n, typeof(*(pos)), member) : NULL;	\
+	})
+
+/**
+ * dlist_for_each_entry - iterate over the dlock list
+ * @pos   : Type * to use as a loop cursor
+ * @iter  : The dlock list iterator
+ * @member: The name of the dlock_list_node within the struct
+ *
+ * This iteration macro isn't safe with respect to list entry removal, but
+ * it can correctly iterate newly added entries right after the current one.
+ * This iteration function is designed to be used in a while loop.
+ */
+#define dlist_for_each_entry(pos, iter, member)				\
+	for (pos = _dlock_list_next_list_entry(iter, pos, member);	\
+	     pos != NULL;						\
+	     pos = _dlock_list_next_entry(pos, iter, member))
+
+/**
+ * dlist_for_each_entry_safe - iterate over the dlock list & safe over removal
+ * @pos   : Type * to use as a loop cursor
+ * @n	  : Another type * to use as temporary storage
+ * @iter  : The dlock list iterator
+ * @member: The name of the dlock_list_node within the struct
+ *
+ * This iteration macro is safe with respect to list entry removal.
+ * However, it cannot correctly iterate newly added entries right after the
+ * current one.
+ *
+ * The call to __dlock_list_next_list() is deferred until the next entry
+ * is being iterated to avoid use-after-unlock problem.
+ */
+#define dlist_for_each_entry_safe(pos, n, iter, member)			\
+	for (pos = NULL;						\
+	    ({								\
+		if (!pos ||						\
+		   (&(pos)->member.list == &(iter)->entry->list))	\
+			pos = _dlock_list_next_list_entry(iter, pos,	\
+							  member);	\
+		if (pos)						\
+			n = list_next_entry(pos, member.list);		\
+		pos;							\
+	    });								\
+	    pos = n)
+
+#endif /* __LINUX_DLOCK_LIST_H */
diff --git a/lib/Makefile b/lib/Makefile
index dafa796..0536cd3 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -38,7 +38,7 @@ obj-y += bcd.o div64.o sort.o parser.o debug_locks.o random32.o \
 	 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
 	 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 	 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
-	 once.o refcount.o usercopy.o errseq.o
+	 once.o refcount.o usercopy.o errseq.o dlock-list.o
 obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
 obj-y += hexdump.o
diff --git a/lib/dlock-list.c b/lib/dlock-list.c
new file mode 100644
index 0000000..17ced06
--- /dev/null
+++ b/lib/dlock-list.c
@@ -0,0 +1,234 @@
+/*
+ * Distributed and locked list
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ * (C) Copyright 2017 Red Hat, Inc.
+ *
+ * Authors: Waiman Long <longman@redhat.com>
+ */
+#include <linux/dlock-list.h>
+#include <linux/lockdep.h>
+#include <linux/slab.h>
+#include <linux/cpumask.h>
+
+/*
+ * The distributed and locked list is a distributed set of lists each of
+ * which is protected by its own spinlock, but acts like a single
+ * consolidated list to the callers. For scaling purpose, the number of
+ * lists used is equal to the number of possible CPUs in the system to
+ * minimize contention.
+ *
+ * However, it is possible that individual CPU numbers may be equal to
+ * or greater than the number of possible CPUs when there are holes in
+ * the CPU number list. As a result, we need to map the CPU number to a
+ * list index.
+ */
+static DEFINE_PER_CPU_READ_MOSTLY(int, cpu2idx);
+
+/*
+ * Initialize cpu2idx mapping table
+ *
+ * It is possible that a dlock-list can be allocated before the cpu2idx is
+ * initialized. In this case, all the cpus are mapped to the first entry
+ * before initialization.
+ *
+ */
+static int __init cpu2idx_init(void)
+{
+	int idx, cpu;
+
+	idx = 0;
+	for_each_possible_cpu(cpu)
+		per_cpu(cpu2idx, cpu) = idx++;
+	return 0;
+}
+postcore_initcall(cpu2idx_init);
+
+/**
+ * __alloc_dlock_list_heads - Initialize and allocate the list of head entries
+ * @dlist: Pointer to the dlock_list_heads structure to be initialized
+ * @key  : The lock class key to be used for lockdep
+ * Return: 0 if successful, -ENOMEM if memory allocation error
+ *
+ * This function does not allocate the dlock_list_heads structure itself. The
+ * callers will have to do their own memory allocation, if necessary. However,
+ * this allows embedding the dlock_list_heads structure directly into other
+ * structures.
+ *
+ * Dynamically allocated locks need to have their own special lock class
+ * to avoid lockdep warning.
+ */
+int __alloc_dlock_list_heads(struct dlock_list_heads *dlist,
+			     struct lock_class_key *key)
+{
+	int idx;
+
+	dlist->heads = kcalloc(nr_cpu_ids, sizeof(struct dlock_list_head),
+			       GFP_KERNEL);
+
+	if (!dlist->heads)
+		return -ENOMEM;
+
+	for (idx = 0; idx < nr_cpu_ids; idx++) {
+		struct dlock_list_head *head = &dlist->heads[idx];
+
+		INIT_LIST_HEAD(&head->list);
+		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
+		lockdep_set_class(&head->lock, key);
+	}
+	return 0;
+}
+EXPORT_SYMBOL(__alloc_dlock_list_heads);
+
+/**
+ * free_dlock_list_heads - Free all the heads entries of the dlock list
+ * @dlist: Pointer of the dlock_list_heads structure to be freed
+ *
+ * This function doesn't free the dlock_list_heads structure itself. So
+ * the caller will have to do it, if necessary.
+ */
+void free_dlock_list_heads(struct dlock_list_heads *dlist)
+{
+	kfree(dlist->heads);
+	dlist->heads = NULL;
+}
+EXPORT_SYMBOL(free_dlock_list_heads);
+
+/**
+ * dlock_lists_empty - Check if all the dlock lists are empty
+ * @dlist: Pointer to the dlock_list_heads structure
+ * Return: true if list is empty, false otherwise.
+ *
+ * This can be a pretty expensive function call. If this function is required
+ * in a performance critical path, we may have to maintain a global count
+ * of the list entries in the global dlock_list_heads structure instead.
+ */
+bool dlock_lists_empty(struct dlock_list_heads *dlist)
+{
+	int idx;
+
+	for (idx = 0; idx < nr_cpu_ids; idx++)
+		if (!list_empty(&dlist->heads[idx].list))
+			return false;
+	return true;
+}
+EXPORT_SYMBOL(dlock_lists_empty);
+
+/**
+ * dlock_lists_add - Adds a node to the given dlock list
+ * @node : The node to be added
+ * @dlist: The dlock list where the node is to be added
+ *
+ * List selection is based on the CPU being used when the dlock_list_add()
+ * function is called. However, deletion may be done by a different CPU.
+ */
+void dlock_lists_add(struct dlock_list_node *node,
+		     struct dlock_list_heads *dlist)
+{
+	struct dlock_list_head *head = &dlist->heads[this_cpu_read(cpu2idx)];
+
+	/*
+	 * There is no need to disable preemption
+	 */
+	spin_lock(&head->lock);
+	node->head = head;
+	list_add(&node->list, &head->list);
+	spin_unlock(&head->lock);
+}
+EXPORT_SYMBOL(dlock_lists_add);
+
+/**
+ * dlock_lists_del - Delete a node from a dlock list
+ * @node : The node to be deleted
+ *
+ * We need to check the lock pointer again after taking the lock to guard
+ * against concurrent deletion of the same node. If the lock pointer changes
+ * (becomes NULL or to a different one), we assume that the deletion was done
+ * elsewhere. A warning will be printed if this happens as it is likely to be
+ * a bug.
+ */
+void dlock_lists_del(struct dlock_list_node *node)
+{
+	struct dlock_list_head *head;
+	bool retry;
+
+	do {
+		head = READ_ONCE(node->head);
+		if (WARN_ONCE(!head, "%s: node 0x%lx has no associated head\n",
+			      __func__, (unsigned long)node))
+			return;
+
+		spin_lock(&head->lock);
+		if (likely(head == node->head)) {
+			list_del_init(&node->list);
+			node->head = NULL;
+			retry = false;
+		} else {
+			/*
+			 * The lock has somehow changed. Retry again if it is
+			 * not NULL. Otherwise, just ignore the delete
+			 * operation.
+			 */
+			retry = (node->head != NULL);
+		}
+		spin_unlock(&head->lock);
+	} while (retry);
+}
+EXPORT_SYMBOL(dlock_lists_del);
+
+/**
+ * __dlock_list_next_list: Find the first entry of the next available list
+ * @dlist: Pointer to the dlock_list_heads structure
+ * @iter : Pointer to the dlock list iterator structure
+ * Return: true if the entry is found, false if all the lists exhausted
+ *
+ * The information about the next available list will be put into the iterator.
+ */
+struct dlock_list_node *__dlock_list_next_list(struct dlock_list_iter *iter)
+{
+	struct dlock_list_node *next;
+	struct dlock_list_head *head;
+
+restart:
+	if (iter->entry) {
+		spin_unlock(&iter->entry->lock);
+		iter->entry = NULL;
+	}
+
+next_list:
+	/*
+	 * Try next list
+	 */
+	if (++iter->index >= nr_cpu_ids)
+		return NULL;	/* All the entries iterated */
+
+	if (list_empty(&iter->head[iter->index].list))
+		goto next_list;
+
+	head = iter->entry = &iter->head[iter->index];
+	spin_lock(&head->lock);
+	/*
+	 * There is a slight chance that the list may become empty just
+	 * before the lock is acquired. So an additional check is
+	 * needed to make sure that a valid node will be returned.
+	 */
+	if (list_empty(&head->list))
+		goto restart;
+
+	next = list_entry(head->list.next, struct dlock_list_node,
+			  list);
+	WARN_ON_ONCE(next->head != head);
+
+	return next;
+}
+EXPORT_SYMBOL(__dlock_list_next_list);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v8 2/6] vfs: Remove unnecessary list_for_each_entry_safe() variants
  2017-10-31 18:50 [PATCH v8 0/6] vfs: Use dlock list for SB's s_inodes list Waiman Long
  2017-10-31 18:50 ` [PATCH v8 1/6] lib/dlock-list: Distributed and lock-protected lists Waiman Long
@ 2017-10-31 18:50 ` Waiman Long
  2017-10-31 18:50 ` [PATCH v8 3/6] vfs: Use dlock list for superblock's inode list Waiman Long
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Waiman Long @ 2017-10-31 18:50 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Davidlohr Bueso, Jan Kara,
	Waiman Long

From: Jan Kara <jack@suse.cz>

evict_inodes() and invalidate_inodes() use list_for_each_entry_safe()
to iterate sb->s_inodes list. However, since we use i_lru list entry for
our local temporary list of inodes to destroy, the inode is guaranteed
to stay in sb->s_inodes list while we hold sb->s_inode_list_lock. So
there is no real need for safe iteration variant and we can use
list_for_each_entry() just fine.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 fs/inode.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index fd40102..a9a38c0 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -601,12 +601,12 @@ static void dispose_list(struct list_head *head)
  */
 void evict_inodes(struct super_block *sb)
 {
-	struct inode *inode, *next;
+	struct inode *inode;
 	LIST_HEAD(dispose);
 
 again:
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		if (atomic_read(&inode->i_count))
 			continue;
 
@@ -652,11 +652,11 @@ void evict_inodes(struct super_block *sb)
 int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 {
 	int busy = 0;
-	struct inode *inode, *next;
+	struct inode *inode;
 	LIST_HEAD(dispose);
 
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v8 3/6] vfs: Use dlock list for superblock's inode list
  2017-10-31 18:50 [PATCH v8 0/6] vfs: Use dlock list for SB's s_inodes list Waiman Long
  2017-10-31 18:50 ` [PATCH v8 1/6] lib/dlock-list: Distributed and lock-protected lists Waiman Long
  2017-10-31 18:50 ` [PATCH v8 2/6] vfs: Remove unnecessary list_for_each_entry_safe() variants Waiman Long
@ 2017-10-31 18:50 ` Waiman Long
  2017-10-31 18:50 ` [PATCH v8 4/6] lib/dlock-list: Make sibling CPUs share the same linked list Waiman Long
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Waiman Long @ 2017-10-31 18:50 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Davidlohr Bueso,
	Waiman Long

When many threads are trying to add or delete inode to or from
a superblock's s_inodes list, spinlock contention on the list can
become a performance bottleneck.

This patch changes the s_inodes field to become a dlock list which
is a distributed set of lists with per-list spinlocks.  As a result,
the following superblock inode list (sb->s_inodes) iteration functions
in vfs are also being modified:

 1. iterate_bdevs()
 2. drop_pagecache_sb()
 3. evict_inodes()
 4. invalidate_inodes()
 5. fsnotify_unmount_inodes()
 6. add_dquot_ref()
 7. remove_dquot_ref()

With an exit microbenchmark that creates a large number of threads,
attachs many inodes to them and then exits. The runtimes of that
microbenchmark with 1200 threads before and after the patch on a
2-socket Intel E5-2699 v3 system (36 cores, 72 threads) were as
follows:

  Kernel            Elapsed Time    System Time
  ------            ------------    -----------
  Vanilla 4.14-rc3     58.40s         65m20s
  Patched 4.14-rc3     42.17s         45m32s

Before the patch, spinlock contention at the inode_sb_list_add()
function at the startup phase and the evict() function at the teardown
phase were about 68% and 97% of total CPU time respectively (as
measured by perf). With the patched kernel, the inode_sb_list_add()
function is less than 1% and the evict() function is at 2.5%
respectively. There were still quite a lot of lock contention elsewhere
that limited the gain. For example, the nlru lock was contributing
most of the lock contention in the teardown phase.

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c       |  9 ++++-----
 fs/drop_caches.c     |  9 ++++-----
 fs/inode.c           | 34 +++++++++++++---------------------
 fs/notify/fsnotify.c |  9 ++++-----
 fs/quota/dquot.c     | 14 ++++++--------
 fs/super.c           |  7 ++++---
 include/linux/fs.h   |  8 ++++----
 7 files changed, 39 insertions(+), 51 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 789f55e..9bbeb14 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -2127,9 +2127,9 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
 void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 {
 	struct inode *inode, *old_inode = NULL;
+	DEFINE_DLOCK_LIST_ITER(iter, &blockdev_superblock->s_inodes);
 
-	spin_lock(&blockdev_superblock->s_inode_list_lock);
-	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
+	dlist_for_each_entry(inode, &iter, i_sb_list) {
 		struct address_space *mapping = inode->i_mapping;
 		struct block_device *bdev;
 
@@ -2141,7 +2141,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&blockdev_superblock->s_inode_list_lock);
+		dlock_list_unlock(&iter);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
 		 * removed from s_inodes list while we dropped the
@@ -2159,8 +2159,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 			func(bdev, arg);
 		mutex_unlock(&bdev->bd_mutex);
 
-		spin_lock(&blockdev_superblock->s_inode_list_lock);
+		dlock_list_relock(&iter);
 	}
-	spin_unlock(&blockdev_superblock->s_inode_list_lock);
 	iput(old_inode);
 }
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index d72d52b..8db1374 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -16,9 +16,9 @@
 static void drop_pagecache_sb(struct super_block *sb, void *unused)
 {
 	struct inode *inode, *toput_inode = NULL;
+	DEFINE_DLOCK_LIST_ITER(iter, &sb->s_inodes);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	dlist_for_each_entry(inode, &iter, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 		    (inode->i_mapping->nrpages == 0)) {
@@ -27,15 +27,14 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		dlock_list_unlock(&iter);
 
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
 
-		spin_lock(&sb->s_inode_list_lock);
+		dlock_list_relock(&iter);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 	iput(toput_inode);
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index a9a38c0..04bcc9a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -28,7 +28,7 @@
  *   inode->i_state, inode->i_hash, __iget()
  * Inode LRU list locks protect:
  *   inode->i_sb->s_inode_lru, inode->i_lru
- * inode->i_sb->s_inode_list_lock protects:
+ * inode->i_sb->s_inodes->head->lock protects:
  *   inode->i_sb->s_inodes, inode->i_sb_list
  * bdi->wb.list_lock protects:
  *   bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_io_list
@@ -37,7 +37,7 @@
  *
  * Lock ordering:
  *
- * inode->i_sb->s_inode_list_lock
+ * inode->i_sb->s_inodes->head->lock
  *   inode->i_lock
  *     Inode LRU list locks
  *
@@ -45,7 +45,7 @@
  *   inode->i_lock
  *
  * inode_hash_lock
- *   inode->i_sb->s_inode_list_lock
+ *   inode->i_sb->s_inodes->head->lock
  *   inode->i_lock
  *
  * iunique_lock
@@ -434,19 +434,14 @@ static void inode_lru_list_del(struct inode *inode)
  */
 void inode_sb_list_add(struct inode *inode)
 {
-	spin_lock(&inode->i_sb->s_inode_list_lock);
-	list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
-	spin_unlock(&inode->i_sb->s_inode_list_lock);
+	dlock_lists_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
 }
 EXPORT_SYMBOL_GPL(inode_sb_list_add);
 
 static inline void inode_sb_list_del(struct inode *inode)
 {
-	if (!list_empty(&inode->i_sb_list)) {
-		spin_lock(&inode->i_sb->s_inode_list_lock);
-		list_del_init(&inode->i_sb_list);
-		spin_unlock(&inode->i_sb->s_inode_list_lock);
-	}
+	if (!list_empty(&inode->i_sb_list.list))
+		dlock_lists_del(&inode->i_sb_list);
 }
 
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
@@ -602,11 +597,12 @@ static void dispose_list(struct list_head *head)
 void evict_inodes(struct super_block *sb)
 {
 	struct inode *inode;
+	struct dlock_list_iter iter;
 	LIST_HEAD(dispose);
 
 again:
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	init_dlock_list_iter(&iter, &sb->s_inodes);
+	dlist_for_each_entry(inode, &iter, i_sb_list) {
 		if (atomic_read(&inode->i_count))
 			continue;
 
@@ -627,13 +623,12 @@ void evict_inodes(struct super_block *sb)
 		 * bit so we don't livelock.
 		 */
 		if (need_resched()) {
-			spin_unlock(&sb->s_inode_list_lock);
+			dlock_list_unlock(&iter);
 			cond_resched();
 			dispose_list(&dispose);
 			goto again;
 		}
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 
 	dispose_list(&dispose);
 }
@@ -654,9 +649,9 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 	int busy = 0;
 	struct inode *inode;
 	LIST_HEAD(dispose);
+	DEFINE_DLOCK_LIST_ITER(iter, &sb->s_inodes);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	dlist_for_each_entry(inode, &iter, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
@@ -678,7 +673,6 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 		spin_unlock(&inode->i_lock);
 		list_add(&inode->i_lru, &dispose);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 
 	dispose_list(&dispose);
 
@@ -893,7 +887,7 @@ struct inode *new_inode_pseudo(struct super_block *sb)
 		spin_lock(&inode->i_lock);
 		inode->i_state = 0;
 		spin_unlock(&inode->i_lock);
-		INIT_LIST_HEAD(&inode->i_sb_list);
+		init_dlock_list_node(&inode->i_sb_list);
 	}
 	return inode;
 }
@@ -914,8 +908,6 @@ struct inode *new_inode(struct super_block *sb)
 {
 	struct inode *inode;
 
-	spin_lock_prefetch(&sb->s_inode_list_lock);
-
 	inode = new_inode_pseudo(sb);
 	if (inode)
 		inode_sb_list_add(inode);
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 0c4583b..136a68a 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -51,9 +51,9 @@ void __fsnotify_vfsmount_delete(struct vfsmount *mnt)
 void fsnotify_unmount_inodes(struct super_block *sb)
 {
 	struct inode *inode, *iput_inode = NULL;
+	DEFINE_DLOCK_LIST_ITER(iter, &sb->s_inodes);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	dlist_for_each_entry(inode, &iter, i_sb_list) {
 		/*
 		 * We cannot __iget() an inode in state I_FREEING,
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
@@ -78,7 +78,7 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		dlock_list_unlock(&iter);
 
 		if (iput_inode)
 			iput(iput_inode);
@@ -90,9 +90,8 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 
 		iput_inode = inode;
 
-		spin_lock(&sb->s_inode_list_lock);
+		dlock_list_relock(&iter);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 
 	if (iput_inode)
 		iput(iput_inode);
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 52ad151..30e0211 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -936,12 +936,12 @@ static int dqinit_needed(struct inode *inode, int type)
 static void add_dquot_ref(struct super_block *sb, int type)
 {
 	struct inode *inode, *old_inode = NULL;
+	DEFINE_DLOCK_LIST_ITER(iter, &sb->s_inodes);
 #ifdef CONFIG_QUOTA_DEBUG
 	int reserved = 0;
 #endif
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	dlist_for_each_entry(inode, &iter, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 		    !atomic_read(&inode->i_writecount) ||
@@ -951,7 +951,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		dlock_list_unlock(&iter);
 
 #ifdef CONFIG_QUOTA_DEBUG
 		if (unlikely(inode_get_rsv_space(inode) > 0))
@@ -969,9 +969,8 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		 * later.
 		 */
 		old_inode = inode;
-		spin_lock(&sb->s_inode_list_lock);
+		dlock_list_relock(&iter);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 	iput(old_inode);
 
 #ifdef CONFIG_QUOTA_DEBUG
@@ -1039,9 +1038,9 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 {
 	struct inode *inode;
 	int reserved = 0;
+	DEFINE_DLOCK_LIST_ITER(iter, &sb->s_inodes);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	dlist_for_each_entry(inode, &iter, i_sb_list) {
 		/*
 		 *  We have to scan also I_NEW inodes because they can already
 		 *  have quota pointer initialized. Luckily, we need to touch
@@ -1056,7 +1055,6 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 		}
 		spin_unlock(&dq_data_lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
 		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
diff --git a/fs/super.c b/fs/super.c
index 166c4ee..a90a070 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -164,6 +164,7 @@ static void destroy_super(struct super_block *s)
 {
 	list_lru_destroy(&s->s_dentry_lru);
 	list_lru_destroy(&s->s_inode_lru);
+	free_dlock_list_heads(&s->s_inodes);
 	security_sb_free(s);
 	WARN_ON(!list_empty(&s->s_mounts));
 	put_user_ns(s->s_user_ns);
@@ -210,11 +211,11 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
 	INIT_HLIST_NODE(&s->s_instances);
 	INIT_HLIST_BL_HEAD(&s->s_anon);
 	mutex_init(&s->s_sync_lock);
-	INIT_LIST_HEAD(&s->s_inodes);
-	spin_lock_init(&s->s_inode_list_lock);
 	INIT_LIST_HEAD(&s->s_inodes_wb);
 	spin_lock_init(&s->s_inode_wblist_lock);
 
+	if (alloc_dlock_list_heads(&s->s_inodes))
+		goto fail;
 	if (list_lru_init_memcg(&s->s_dentry_lru))
 		goto fail;
 	if (list_lru_init_memcg(&s->s_inode_lru))
@@ -434,7 +435,7 @@ void generic_shutdown_super(struct super_block *sb)
 		if (sop->put_super)
 			sop->put_super(sb);
 
-		if (!list_empty(&sb->s_inodes)) {
+		if (!dlock_lists_empty(&sb->s_inodes)) {
 			printk("VFS: Busy inodes after unmount of %s. "
 			   "Self-destruct in 5 seconds.  Have a nice day...\n",
 			   sb->s_id);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 13dab19..d687ad8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -34,6 +34,7 @@
 #include <linux/delayed_call.h>
 #include <linux/uuid.h>
 #include <linux/errseq.h>
+#include <linux/dlock-list.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -632,7 +633,7 @@ struct inode {
 	u16			i_wb_frn_history;
 #endif
 	struct list_head	i_lru;		/* inode LRU list */
-	struct list_head	i_sb_list;
+	struct dlock_list_node	i_sb_list;
 	struct list_head	i_wb_list;	/* backing dev writeback list */
 	union {
 		struct hlist_head	i_dentry;
@@ -1434,9 +1435,8 @@ struct super_block {
 	 */
 	int s_stack_depth;
 
-	/* s_inode_list_lock protects s_inodes */
-	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
-	struct list_head	s_inodes;	/* all inodes */
+	/* The internal per-list locks protect s_inodes */
+	struct dlock_list_heads s_inodes;	/* all inodes */
 
 	spinlock_t		s_inode_wblist_lock;
 	struct list_head	s_inodes_wb;	/* writeback inodes */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v8 4/6] lib/dlock-list: Make sibling CPUs share the same linked list
  2017-10-31 18:50 [PATCH v8 0/6] vfs: Use dlock list for SB's s_inodes list Waiman Long
                   ` (2 preceding siblings ...)
  2017-10-31 18:50 ` [PATCH v8 3/6] vfs: Use dlock list for superblock's inode list Waiman Long
@ 2017-10-31 18:50 ` Waiman Long
  2017-11-01  8:38   ` Jan Kara
  2017-10-31 18:50 ` [PATCH v8 5/6] lib/dlock-list: Enable faster lookup with hashing Waiman Long
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 32+ messages in thread
From: Waiman Long @ 2017-10-31 18:50 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Davidlohr Bueso,
	Waiman Long

The dlock list needs one list for each of the CPUs available. However,
for sibling CPUs, they are sharing the L2 and probably L1 caches
too. As a result, there is not much to gain in term of avoiding
cacheline contention while increasing the cacheline footprint of the
L1/L2 caches as separate lists may need to be in the cache.

This patch makes all the sibling CPUs share the same list, thus
reducing the number of lists that need to be maintained in each
dlock list without having any noticeable impact on performance. It
also improves dlock list iteration performance as fewer lists need
to be iterated.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 lib/dlock-list.c | 74 ++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 59 insertions(+), 15 deletions(-)

diff --git a/lib/dlock-list.c b/lib/dlock-list.c
index 17ced06..a4ddecc 100644
--- a/lib/dlock-list.c
+++ b/lib/dlock-list.c
@@ -25,31 +25,65 @@
  * The distributed and locked list is a distributed set of lists each of
  * which is protected by its own spinlock, but acts like a single
  * consolidated list to the callers. For scaling purpose, the number of
- * lists used is equal to the number of possible CPUs in the system to
- * minimize contention.
+ * lists used is equal to the number of possible cores in the system to
+ * minimize contention. All threads of the same CPU core will share the
+ * same list.
  *
- * However, it is possible that individual CPU numbers may be equal to
- * or greater than the number of possible CPUs when there are holes in
- * the CPU number list. As a result, we need to map the CPU number to a
- * list index.
+ * We need to map each CPU number to a list index.
  */
 static DEFINE_PER_CPU_READ_MOSTLY(int, cpu2idx);
+static int nr_dlock_lists __read_mostly;
 
 /*
- * Initialize cpu2idx mapping table
+ * Initialize cpu2idx mapping table & nr_dlock_lists.
  *
  * It is possible that a dlock-list can be allocated before the cpu2idx is
  * initialized. In this case, all the cpus are mapped to the first entry
  * before initialization.
  *
+ * All the sibling CPUs of a sibling group will map to the same dlock list so
+ * as to reduce the number of dlock lists to be maintained while minimizing
+ * cacheline contention.
+ *
+ * As the sibling masks are set up in the core initcall phase, this function
+ * has to be done in the postcore phase to get the right data.
  */
 static int __init cpu2idx_init(void)
 {
 	int idx, cpu;
+	struct cpumask *sibling_mask;
+	static struct cpumask mask __initdata;
 
+	cpumask_clear(&mask);
 	idx = 0;
-	for_each_possible_cpu(cpu)
-		per_cpu(cpu2idx, cpu) = idx++;
+	for_each_possible_cpu(cpu) {
+		int scpu;
+
+		if (cpumask_test_cpu(cpu, &mask))
+			continue;
+		per_cpu(cpu2idx, cpu) = idx;
+		cpumask_set_cpu(cpu, &mask);
+
+		sibling_mask = topology_sibling_cpumask(cpu);
+		if (sibling_mask) {
+			for_each_cpu(scpu, sibling_mask) {
+				per_cpu(cpu2idx, scpu) = idx;
+				cpumask_set_cpu(scpu, &mask);
+			}
+		}
+		idx++;
+	}
+
+	/*
+	 * nr_dlock_lists can only be set after cpu2idx is properly
+	 * initialized.
+	 */
+	smp_mb();
+	nr_dlock_lists = idx;
+	WARN_ON(nr_dlock_lists > nr_cpu_ids);
+
+	pr_info("dlock-list: %d head entries per dlock list.\n",
+		nr_dlock_lists);
 	return 0;
 }
 postcore_initcall(cpu2idx_init);
@@ -67,19 +101,23 @@ static int __init cpu2idx_init(void)
  *
  * Dynamically allocated locks need to have their own special lock class
  * to avoid lockdep warning.
+ *
+ * Since nr_dlock_lists will always be <= nr_cpu_ids, having more lists
+ * than necessary allocated is not a problem other than some wasted memory.
+ * The extra lists will not be ever used as all the cpu2idx entries will be
+ * 0 before initialization.
  */
 int __alloc_dlock_list_heads(struct dlock_list_heads *dlist,
 			     struct lock_class_key *key)
 {
-	int idx;
+	int idx, cnt = nr_dlock_lists ? nr_dlock_lists : nr_cpu_ids;
 
-	dlist->heads = kcalloc(nr_cpu_ids, sizeof(struct dlock_list_head),
-			       GFP_KERNEL);
+	dlist->heads = kcalloc(cnt, sizeof(struct dlock_list_head), GFP_KERNEL);
 
 	if (!dlist->heads)
 		return -ENOMEM;
 
-	for (idx = 0; idx < nr_cpu_ids; idx++) {
+	for (idx = 0; idx < cnt; idx++) {
 		struct dlock_list_head *head = &dlist->heads[idx];
 
 		INIT_LIST_HEAD(&head->list);
@@ -117,7 +155,10 @@ bool dlock_lists_empty(struct dlock_list_heads *dlist)
 {
 	int idx;
 
-	for (idx = 0; idx < nr_cpu_ids; idx++)
+	/* Shouldn't be called before nr_dlock_lists is initialized */
+	WARN_ON_ONCE(!nr_dlock_lists);
+
+	for (idx = 0; idx < nr_dlock_lists; idx++)
 		if (!list_empty(&dlist->heads[idx].list))
 			return false;
 	return true;
@@ -199,6 +240,9 @@ struct dlock_list_node *__dlock_list_next_list(struct dlock_list_iter *iter)
 	struct dlock_list_node *next;
 	struct dlock_list_head *head;
 
+	/* Shouldn't be called before nr_dlock_lists is initialized */
+	WARN_ON_ONCE(!nr_dlock_lists);
+
 restart:
 	if (iter->entry) {
 		spin_unlock(&iter->entry->lock);
@@ -209,7 +253,7 @@ struct dlock_list_node *__dlock_list_next_list(struct dlock_list_iter *iter)
 	/*
 	 * Try next list
 	 */
-	if (++iter->index >= nr_cpu_ids)
+	if (++iter->index >= nr_dlock_lists)
 		return NULL;	/* All the entries iterated */
 
 	if (list_empty(&iter->head[iter->index].list))
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v8 5/6] lib/dlock-list: Enable faster lookup with hashing
  2017-10-31 18:50 [PATCH v8 0/6] vfs: Use dlock list for SB's s_inodes list Waiman Long
                   ` (3 preceding siblings ...)
  2017-10-31 18:50 ` [PATCH v8 4/6] lib/dlock-list: Make sibling CPUs share the same linked list Waiman Long
@ 2017-10-31 18:50 ` Waiman Long
  2017-11-01  8:40   ` Jan Kara
  2017-10-31 18:51 ` [PATCH v8 6/6] lib/dlock-list: Add an IRQ-safe mode to be used in interrupt handler Waiman Long
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 32+ messages in thread
From: Waiman Long @ 2017-10-31 18:50 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Davidlohr Bueso,
	Waiman Long

Insertion and deletion is relatively cheap and mostly contention
free for dlock-list. Lookup, on the other hand, can be rather costly
because all the lists in a dlock-list will have to be iterated.

Currently dlock-list insertion is based on the cpu that the task is
running on. So a given object can be inserted into any one of the
lists depending on what the current cpu is.

This patch provides an alternative way of list selection. The caller
can provide a object context which will be hashed to one of the list
in a dlock-list. The object can then be added into that particular
list. Lookup can be done by iterating elements in the provided list
only instead of all the lists in a dlock-list.

The new APIs are:

struct dlock_list_head *dlock_list_hash(struct dlock_list_heads *, void *);
void dlock_list_add(struct dlock_list_node *, struct dlock_list_head *);

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/dlock-list.h |  9 ++++++++
 lib/dlock-list.c           | 51 +++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 53 insertions(+), 7 deletions(-)

diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
index c00c7f9..b374101 100644
--- a/include/linux/dlock-list.h
+++ b/include/linux/dlock-list.h
@@ -133,6 +133,15 @@ extern void dlock_lists_add(struct dlock_list_node *node,
 extern void dlock_lists_del(struct dlock_list_node *node);
 
 /*
+ * Instead of individual list mapping by CPU number, it can be based on
+ * a given context to speed up loockup performance.
+ */
+extern struct dlock_list_head *dlock_list_hash(struct dlock_list_heads *dlist,
+					       void *context);
+extern void dlock_list_add(struct dlock_list_node *node,
+			   struct dlock_list_head *head);
+
+/*
  * Find the first entry of the next available list.
  */
 extern struct dlock_list_node *
diff --git a/lib/dlock-list.c b/lib/dlock-list.c
index a4ddecc..f3667aa 100644
--- a/lib/dlock-list.c
+++ b/lib/dlock-list.c
@@ -20,6 +20,7 @@
 #include <linux/lockdep.h>
 #include <linux/slab.h>
 #include <linux/cpumask.h>
+#include <linux/jhash.h>
 
 /*
  * The distributed and locked list is a distributed set of lists each of
@@ -166,6 +167,48 @@ bool dlock_lists_empty(struct dlock_list_heads *dlist)
 EXPORT_SYMBOL(dlock_lists_empty);
 
 /**
+ * dlock_list_hash - Hash the given context to a particular list
+ * @dlist: The dlock list
+ * @ctx  : The context for hashing
+ */
+struct dlock_list_head *dlock_list_hash(struct dlock_list_heads *dlist,
+					void *ctx)
+{
+	unsigned long val = (unsigned long)ctx;
+	u32 hash;
+
+	if (unlikely(!nr_dlock_lists)) {
+		WARN_ON_ONCE(1);
+		return &dlist->heads[0];
+	}
+	if (val < nr_dlock_lists)
+		hash = val;
+	else
+		hash = jhash2((u32 *)&ctx, sizeof(ctx)/sizeof(u32), 0)
+				% nr_dlock_lists;
+	return &dlist->heads[hash];
+}
+EXPORT_SYMBOL(dlock_list_hash);
+
+/**
+ * dlock_list_add - Add a node to a particular head of dlock list
+ * @node: The node to be added
+ * @head: The dlock list head where the node is to be added
+ */
+void dlock_list_add(struct dlock_list_node *node,
+		    struct dlock_list_head *head)
+{
+	/*
+	 * There is no need to disable preemption
+	 */
+	spin_lock(&head->lock);
+	node->head = head;
+	list_add(&node->list, &head->list);
+	spin_unlock(&head->lock);
+}
+EXPORT_SYMBOL(dlock_list_add);
+
+/**
  * dlock_lists_add - Adds a node to the given dlock list
  * @node : The node to be added
  * @dlist: The dlock list where the node is to be added
@@ -178,13 +221,7 @@ void dlock_lists_add(struct dlock_list_node *node,
 {
 	struct dlock_list_head *head = &dlist->heads[this_cpu_read(cpu2idx)];
 
-	/*
-	 * There is no need to disable preemption
-	 */
-	spin_lock(&head->lock);
-	node->head = head;
-	list_add(&node->list, &head->list);
-	spin_unlock(&head->lock);
+	dlock_list_add(node, head);
 }
 EXPORT_SYMBOL(dlock_lists_add);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v8 6/6] lib/dlock-list: Add an IRQ-safe mode to be used in interrupt handler
  2017-10-31 18:50 [PATCH v8 0/6] vfs: Use dlock list for SB's s_inodes list Waiman Long
                   ` (4 preceding siblings ...)
  2017-10-31 18:50 ` [PATCH v8 5/6] lib/dlock-list: Enable faster lookup with hashing Waiman Long
@ 2017-10-31 18:51 ` Waiman Long
  2017-10-31 21:29   ` Davidlohr Bueso
  2017-11-29 15:26 ` [PATCH v8 0/6] vfs: Use dlock list for SB's s_inodes list Davidlohr Bueso
  2018-02-26  2:47 ` Dave Chinner
  7 siblings, 1 reply; 32+ messages in thread
From: Waiman Long @ 2017-10-31 18:51 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng, Davidlohr Bueso,
	Waiman Long

To enable the use of dlock-list in an interrupt handler, a new
irqsafe mode can now be specified at dlock-list allocation time as
an additional argument to alloc_dlock_list_heads(). With that mode
specified, the spin_lock_irqsave/spin_unlock_irqrestore pair will be
used instead of the regular lock and unlock calls.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 fs/super.c                 |  2 +-
 include/linux/dlock-list.h | 22 +++++++++++++++++-----
 lib/dlock-list.c           | 46 ++++++++++++++++++++++++++++++++--------------
 3 files changed, 50 insertions(+), 20 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index a90a070..0840e54 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -214,7 +214,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
 	INIT_LIST_HEAD(&s->s_inodes_wb);
 	spin_lock_init(&s->s_inode_wblist_lock);
 
-	if (alloc_dlock_list_heads(&s->s_inodes))
+	if (alloc_dlock_list_heads(&s->s_inodes, false))
 		goto fail;
 	if (list_lru_init_memcg(&s->s_dentry_lru))
 		goto fail;
diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
index b374101..fcb2b71 100644
--- a/include/linux/dlock-list.h
+++ b/include/linux/dlock-list.h
@@ -32,6 +32,8 @@
 struct dlock_list_head {
 	struct list_head list;
 	spinlock_t lock;
+	int irqsafe;	/* IRQ safe mode */
+	unsigned long flags;
 } ____cacheline_aligned_in_smp;
 
 struct dlock_list_heads {
@@ -89,7 +91,12 @@ static inline void init_dlock_list_node(struct dlock_list_node *node)
  */
 static inline void dlock_list_unlock(struct dlock_list_iter *iter)
 {
-	spin_unlock(&iter->entry->lock);
+	struct dlock_list_head *h = iter->entry;
+
+	if (h->irqsafe)
+		spin_unlock_irqrestore(&h->lock, h->flags);
+	else
+		spin_unlock(&h->lock);
 }
 
 /**
@@ -98,14 +105,19 @@ static inline void dlock_list_unlock(struct dlock_list_iter *iter)
  */
 static inline void dlock_list_relock(struct dlock_list_iter *iter)
 {
-	spin_lock(&iter->entry->lock);
+	struct dlock_list_head *h = iter->entry;
+
+	if (h->irqsafe)
+		spin_lock_irqsave(&h->lock, h->flags);
+	else
+		spin_lock(&h->lock);
 }
 
 /*
  * Allocation and freeing of dlock list
  */
 extern int  __alloc_dlock_list_heads(struct dlock_list_heads *dlist,
-				     struct lock_class_key *key);
+				     int irqsafe, struct lock_class_key *key);
 extern void free_dlock_list_heads(struct dlock_list_heads *dlist);
 
 /**
@@ -113,10 +125,10 @@ extern int  __alloc_dlock_list_heads(struct dlock_list_heads *dlist,
  * @dlist  : Pointer to the dlock_list_heads structure to be initialized
  * Return  : 0 if successful, -ENOMEM if memory allocation error
  */
-#define alloc_dlock_list_heads(dlist)					\
+#define alloc_dlock_list_heads(dlist, irqsafe)				\
 ({									\
 	static struct lock_class_key _key;				\
-	__alloc_dlock_list_heads(dlist, &_key);				\
+	__alloc_dlock_list_heads(dlist, irqsafe, &_key);		\
 })
 
 /*
diff --git a/lib/dlock-list.c b/lib/dlock-list.c
index f3667aa..c1197f5 100644
--- a/lib/dlock-list.c
+++ b/lib/dlock-list.c
@@ -91,8 +91,9 @@ static int __init cpu2idx_init(void)
 
 /**
  * __alloc_dlock_list_heads - Initialize and allocate the list of head entries
- * @dlist: Pointer to the dlock_list_heads structure to be initialized
- * @key  : The lock class key to be used for lockdep
+ * @dlist  : Pointer to the dlock_list_heads structure to be initialized
+ * @irqsafe: IRQ safe mode flag
+ * @key    : The lock class key to be used for lockdep
  * Return: 0 if successful, -ENOMEM if memory allocation error
  *
  * This function does not allocate the dlock_list_heads structure itself. The
@@ -108,7 +109,7 @@ static int __init cpu2idx_init(void)
  * The extra lists will not be ever used as all the cpu2idx entries will be
  * 0 before initialization.
  */
-int __alloc_dlock_list_heads(struct dlock_list_heads *dlist,
+int __alloc_dlock_list_heads(struct dlock_list_heads *dlist, int irqsafe,
 			     struct lock_class_key *key)
 {
 	int idx, cnt = nr_dlock_lists ? nr_dlock_lists : nr_cpu_ids;
@@ -123,6 +124,7 @@ int __alloc_dlock_list_heads(struct dlock_list_heads *dlist,
 
 		INIT_LIST_HEAD(&head->list);
 		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
+		head->irqsafe = irqsafe;
 		lockdep_set_class(&head->lock, key);
 	}
 	return 0;
@@ -198,13 +200,19 @@ struct dlock_list_head *dlock_list_hash(struct dlock_list_heads *dlist,
 void dlock_list_add(struct dlock_list_node *node,
 		    struct dlock_list_head *head)
 {
-	/*
-	 * There is no need to disable preemption
-	 */
-	spin_lock(&head->lock);
-	node->head = head;
-	list_add(&node->list, &head->list);
-	spin_unlock(&head->lock);
+	unsigned long flags;
+
+	if (head->irqsafe) {
+		spin_lock_irqsave(&head->lock, flags);
+		node->head = head;
+		list_add(&node->list, &head->list);
+		spin_unlock_irqrestore(&head->lock, flags);
+	} else {
+		spin_lock(&head->lock);
+		node->head = head;
+		list_add(&node->list, &head->list);
+		spin_unlock(&head->lock);
+	}
 }
 EXPORT_SYMBOL(dlock_list_add);
 
@@ -238,6 +246,7 @@ void dlock_lists_add(struct dlock_list_node *node,
 void dlock_lists_del(struct dlock_list_node *node)
 {
 	struct dlock_list_head *head;
+	unsigned long flags = 0;
 	bool retry;
 
 	do {
@@ -246,7 +255,11 @@ void dlock_lists_del(struct dlock_list_node *node)
 			      __func__, (unsigned long)node))
 			return;
 
-		spin_lock(&head->lock);
+		if (head->irqsafe)
+			spin_lock_irqsave(&head->lock, flags);
+		else
+			spin_lock(&head->lock);
+
 		if (likely(head == node->head)) {
 			list_del_init(&node->list);
 			node->head = NULL;
@@ -259,7 +272,11 @@ void dlock_lists_del(struct dlock_list_node *node)
 			 */
 			retry = (node->head != NULL);
 		}
-		spin_unlock(&head->lock);
+
+		if (head->irqsafe)
+			spin_unlock_irqrestore(&head->lock, flags);
+		else
+			spin_unlock(&head->lock);
 	} while (retry);
 }
 EXPORT_SYMBOL(dlock_lists_del);
@@ -282,7 +299,7 @@ struct dlock_list_node *__dlock_list_next_list(struct dlock_list_iter *iter)
 
 restart:
 	if (iter->entry) {
-		spin_unlock(&iter->entry->lock);
+		dlock_list_unlock(iter);
 		iter->entry = NULL;
 	}
 
@@ -297,7 +314,8 @@ struct dlock_list_node *__dlock_list_next_list(struct dlock_list_iter *iter)
 		goto next_list;
 
 	head = iter->entry = &iter->head[iter->index];
-	spin_lock(&head->lock);
+	dlock_list_relock(iter);
+
 	/*
 	 * There is a slight chance that the list may become empty just
 	 * before the lock is acquired. So an additional check is
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 6/6] lib/dlock-list: Add an IRQ-safe mode to be used in interrupt handler
  2017-10-31 18:51 ` [PATCH v8 6/6] lib/dlock-list: Add an IRQ-safe mode to be used in interrupt handler Waiman Long
@ 2017-10-31 21:29   ` Davidlohr Bueso
  0 siblings, 0 replies; 32+ messages in thread
From: Davidlohr Bueso @ 2017-10-31 21:29 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng

I would drop this patch for being too ugly and if nothing else, for lack
of users (epoll will no longer need dlock).

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 1/6] lib/dlock-list: Distributed and lock-protected lists
  2017-10-31 18:50 ` [PATCH v8 1/6] lib/dlock-list: Distributed and lock-protected lists Waiman Long
@ 2017-10-31 21:37   ` Davidlohr Bueso
  2017-11-01 18:44     ` Waiman Long
  2017-11-02 17:04   ` Davidlohr Bueso
  2017-11-29 15:29   ` [PATCH v8 1/6] lib/dlock-list: Distributed and lock-protected lists Davidlohr Bueso
  2 siblings, 1 reply; 32+ messages in thread
From: Davidlohr Bueso @ 2017-10-31 21:37 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng

On Tue, 31 Oct 2017, Waiman Long wrote:

>+void dlock_lists_del(struct dlock_list_node *node)
>+{
>+	struct dlock_list_head *head;
>+	bool retry;
>+
>+	do {
>+		head = READ_ONCE(node->head);

Boqun had previously pointed this out; you need to WRITE_ONCE() node->head too.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 4/6] lib/dlock-list: Make sibling CPUs share the same linked list
  2017-10-31 18:50 ` [PATCH v8 4/6] lib/dlock-list: Make sibling CPUs share the same linked list Waiman Long
@ 2017-11-01  8:38   ` Jan Kara
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Kara @ 2017-11-01  8:38 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng, Davidlohr Bueso

On Tue 31-10-17 14:50:58, Waiman Long wrote:
> The dlock list needs one list for each of the CPUs available. However,
> for sibling CPUs, they are sharing the L2 and probably L1 caches
> too. As a result, there is not much to gain in term of avoiding
> cacheline contention while increasing the cacheline footprint of the
> L1/L2 caches as separate lists may need to be in the cache.
> 
> This patch makes all the sibling CPUs share the same list, thus
> reducing the number of lists that need to be maintained in each
> dlock list without having any noticeable impact on performance. It
> also improves dlock list iteration performance as fewer lists need
> to be iterated.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  lib/dlock-list.c | 74 ++++++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 59 insertions(+), 15 deletions(-)
> 
> diff --git a/lib/dlock-list.c b/lib/dlock-list.c
> index 17ced06..a4ddecc 100644
> --- a/lib/dlock-list.c
> +++ b/lib/dlock-list.c
> @@ -25,31 +25,65 @@
>   * The distributed and locked list is a distributed set of lists each of
>   * which is protected by its own spinlock, but acts like a single
>   * consolidated list to the callers. For scaling purpose, the number of
> - * lists used is equal to the number of possible CPUs in the system to
> - * minimize contention.
> + * lists used is equal to the number of possible cores in the system to
> + * minimize contention. All threads of the same CPU core will share the
> + * same list.
>   *
> - * However, it is possible that individual CPU numbers may be equal to
> - * or greater than the number of possible CPUs when there are holes in
> - * the CPU number list. As a result, we need to map the CPU number to a
> - * list index.
> + * We need to map each CPU number to a list index.
>   */
>  static DEFINE_PER_CPU_READ_MOSTLY(int, cpu2idx);
> +static int nr_dlock_lists __read_mostly;
>  
>  /*
> - * Initialize cpu2idx mapping table
> + * Initialize cpu2idx mapping table & nr_dlock_lists.
>   *
>   * It is possible that a dlock-list can be allocated before the cpu2idx is
>   * initialized. In this case, all the cpus are mapped to the first entry
>   * before initialization.
>   *
> + * All the sibling CPUs of a sibling group will map to the same dlock list so
> + * as to reduce the number of dlock lists to be maintained while minimizing
> + * cacheline contention.
> + *
> + * As the sibling masks are set up in the core initcall phase, this function
> + * has to be done in the postcore phase to get the right data.
>   */
>  static int __init cpu2idx_init(void)
>  {
>  	int idx, cpu;
> +	struct cpumask *sibling_mask;
> +	static struct cpumask mask __initdata;
>  
> +	cpumask_clear(&mask);
>  	idx = 0;
> -	for_each_possible_cpu(cpu)
> -		per_cpu(cpu2idx, cpu) = idx++;
> +	for_each_possible_cpu(cpu) {
> +		int scpu;
> +
> +		if (cpumask_test_cpu(cpu, &mask))
> +			continue;
> +		per_cpu(cpu2idx, cpu) = idx;
> +		cpumask_set_cpu(cpu, &mask);
> +
> +		sibling_mask = topology_sibling_cpumask(cpu);
> +		if (sibling_mask) {
> +			for_each_cpu(scpu, sibling_mask) {
> +				per_cpu(cpu2idx, scpu) = idx;
> +				cpumask_set_cpu(scpu, &mask);
> +			}
> +		}
> +		idx++;
> +	}
> +
> +	/*
> +	 * nr_dlock_lists can only be set after cpu2idx is properly
> +	 * initialized.
> +	 */
> +	smp_mb();
> +	nr_dlock_lists = idx;
> +	WARN_ON(nr_dlock_lists > nr_cpu_ids);
> +
> +	pr_info("dlock-list: %d head entries per dlock list.\n",
> +		nr_dlock_lists);
>  	return 0;
>  }
>  postcore_initcall(cpu2idx_init);
> @@ -67,19 +101,23 @@ static int __init cpu2idx_init(void)
>   *
>   * Dynamically allocated locks need to have their own special lock class
>   * to avoid lockdep warning.
> + *
> + * Since nr_dlock_lists will always be <= nr_cpu_ids, having more lists
> + * than necessary allocated is not a problem other than some wasted memory.
> + * The extra lists will not be ever used as all the cpu2idx entries will be
> + * 0 before initialization.
>   */
>  int __alloc_dlock_list_heads(struct dlock_list_heads *dlist,
>  			     struct lock_class_key *key)
>  {
> -	int idx;
> +	int idx, cnt = nr_dlock_lists ? nr_dlock_lists : nr_cpu_ids;
>  
> -	dlist->heads = kcalloc(nr_cpu_ids, sizeof(struct dlock_list_head),
> -			       GFP_KERNEL);
> +	dlist->heads = kcalloc(cnt, sizeof(struct dlock_list_head), GFP_KERNEL);
>  
>  	if (!dlist->heads)
>  		return -ENOMEM;
>  
> -	for (idx = 0; idx < nr_cpu_ids; idx++) {
> +	for (idx = 0; idx < cnt; idx++) {
>  		struct dlock_list_head *head = &dlist->heads[idx];
>  
>  		INIT_LIST_HEAD(&head->list);
> @@ -117,7 +155,10 @@ bool dlock_lists_empty(struct dlock_list_heads *dlist)
>  {
>  	int idx;
>  
> -	for (idx = 0; idx < nr_cpu_ids; idx++)
> +	/* Shouldn't be called before nr_dlock_lists is initialized */
> +	WARN_ON_ONCE(!nr_dlock_lists);
> +
> +	for (idx = 0; idx < nr_dlock_lists; idx++)
>  		if (!list_empty(&dlist->heads[idx].list))
>  			return false;
>  	return true;
> @@ -199,6 +240,9 @@ struct dlock_list_node *__dlock_list_next_list(struct dlock_list_iter *iter)
>  	struct dlock_list_node *next;
>  	struct dlock_list_head *head;
>  
> +	/* Shouldn't be called before nr_dlock_lists is initialized */
> +	WARN_ON_ONCE(!nr_dlock_lists);
> +
>  restart:
>  	if (iter->entry) {
>  		spin_unlock(&iter->entry->lock);
> @@ -209,7 +253,7 @@ struct dlock_list_node *__dlock_list_next_list(struct dlock_list_iter *iter)
>  	/*
>  	 * Try next list
>  	 */
> -	if (++iter->index >= nr_cpu_ids)
> +	if (++iter->index >= nr_dlock_lists)
>  		return NULL;	/* All the entries iterated */
>  
>  	if (list_empty(&iter->head[iter->index].list))
> -- 
> 1.8.3.1
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 5/6] lib/dlock-list: Enable faster lookup with hashing
  2017-10-31 18:50 ` [PATCH v8 5/6] lib/dlock-list: Enable faster lookup with hashing Waiman Long
@ 2017-11-01  8:40   ` Jan Kara
  2017-11-01 13:16     ` Waiman Long
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Kara @ 2017-11-01  8:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng, Davidlohr Bueso

On Tue 31-10-17 14:50:59, Waiman Long wrote:
> Insertion and deletion is relatively cheap and mostly contention
> free for dlock-list. Lookup, on the other hand, can be rather costly
> because all the lists in a dlock-list will have to be iterated.
> 
> Currently dlock-list insertion is based on the cpu that the task is
> running on. So a given object can be inserted into any one of the
> lists depending on what the current cpu is.
> 
> This patch provides an alternative way of list selection. The caller
> can provide a object context which will be hashed to one of the list
> in a dlock-list. The object can then be added into that particular
> list. Lookup can be done by iterating elements in the provided list
> only instead of all the lists in a dlock-list.
> 
> The new APIs are:
> 
> struct dlock_list_head *dlock_list_hash(struct dlock_list_heads *, void *);
> void dlock_list_add(struct dlock_list_node *, struct dlock_list_head *);
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

Hum, do we have any users for this API? And wouldn't they also need to
control how many lists are allocated then?

								Honza

> ---
>  include/linux/dlock-list.h |  9 ++++++++
>  lib/dlock-list.c           | 51 +++++++++++++++++++++++++++++++++++++++-------
>  2 files changed, 53 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
> index c00c7f9..b374101 100644
> --- a/include/linux/dlock-list.h
> +++ b/include/linux/dlock-list.h
> @@ -133,6 +133,15 @@ extern void dlock_lists_add(struct dlock_list_node *node,
>  extern void dlock_lists_del(struct dlock_list_node *node);
>  
>  /*
> + * Instead of individual list mapping by CPU number, it can be based on
> + * a given context to speed up loockup performance.
> + */
> +extern struct dlock_list_head *dlock_list_hash(struct dlock_list_heads *dlist,
> +					       void *context);
> +extern void dlock_list_add(struct dlock_list_node *node,
> +			   struct dlock_list_head *head);
> +
> +/*
>   * Find the first entry of the next available list.
>   */
>  extern struct dlock_list_node *
> diff --git a/lib/dlock-list.c b/lib/dlock-list.c
> index a4ddecc..f3667aa 100644
> --- a/lib/dlock-list.c
> +++ b/lib/dlock-list.c
> @@ -20,6 +20,7 @@
>  #include <linux/lockdep.h>
>  #include <linux/slab.h>
>  #include <linux/cpumask.h>
> +#include <linux/jhash.h>
>  
>  /*
>   * The distributed and locked list is a distributed set of lists each of
> @@ -166,6 +167,48 @@ bool dlock_lists_empty(struct dlock_list_heads *dlist)
>  EXPORT_SYMBOL(dlock_lists_empty);
>  
>  /**
> + * dlock_list_hash - Hash the given context to a particular list
> + * @dlist: The dlock list
> + * @ctx  : The context for hashing
> + */
> +struct dlock_list_head *dlock_list_hash(struct dlock_list_heads *dlist,
> +					void *ctx)
> +{
> +	unsigned long val = (unsigned long)ctx;
> +	u32 hash;
> +
> +	if (unlikely(!nr_dlock_lists)) {
> +		WARN_ON_ONCE(1);
> +		return &dlist->heads[0];
> +	}
> +	if (val < nr_dlock_lists)
> +		hash = val;
> +	else
> +		hash = jhash2((u32 *)&ctx, sizeof(ctx)/sizeof(u32), 0)
> +				% nr_dlock_lists;
> +	return &dlist->heads[hash];
> +}
> +EXPORT_SYMBOL(dlock_list_hash);
> +
> +/**
> + * dlock_list_add - Add a node to a particular head of dlock list
> + * @node: The node to be added
> + * @head: The dlock list head where the node is to be added
> + */
> +void dlock_list_add(struct dlock_list_node *node,
> +		    struct dlock_list_head *head)
> +{
> +	/*
> +	 * There is no need to disable preemption
> +	 */
> +	spin_lock(&head->lock);
> +	node->head = head;
> +	list_add(&node->list, &head->list);
> +	spin_unlock(&head->lock);
> +}
> +EXPORT_SYMBOL(dlock_list_add);
> +
> +/**
>   * dlock_lists_add - Adds a node to the given dlock list
>   * @node : The node to be added
>   * @dlist: The dlock list where the node is to be added
> @@ -178,13 +221,7 @@ void dlock_lists_add(struct dlock_list_node *node,
>  {
>  	struct dlock_list_head *head = &dlist->heads[this_cpu_read(cpu2idx)];
>  
> -	/*
> -	 * There is no need to disable preemption
> -	 */
> -	spin_lock(&head->lock);
> -	node->head = head;
> -	list_add(&node->list, &head->list);
> -	spin_unlock(&head->lock);
> +	dlock_list_add(node, head);
>  }
>  EXPORT_SYMBOL(dlock_lists_add);
>  
> -- 
> 1.8.3.1
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 5/6] lib/dlock-list: Enable faster lookup with hashing
  2017-11-01  8:40   ` Jan Kara
@ 2017-11-01 13:16     ` Waiman Long
  0 siblings, 0 replies; 32+ messages in thread
From: Waiman Long @ 2017-11-01 13:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng, Davidlohr Bueso

On 11/01/2017 04:40 AM, Jan Kara wrote:
> On Tue 31-10-17 14:50:59, Waiman Long wrote:
>> Insertion and deletion is relatively cheap and mostly contention
>> free for dlock-list. Lookup, on the other hand, can be rather costly
>> because all the lists in a dlock-list will have to be iterated.
>>
>> Currently dlock-list insertion is based on the cpu that the task is
>> running on. So a given object can be inserted into any one of the
>> lists depending on what the current cpu is.
>>
>> This patch provides an alternative way of list selection. The caller
>> can provide a object context which will be hashed to one of the list
>> in a dlock-list. The object can then be added into that particular
>> list. Lookup can be done by iterating elements in the provided list
>> only instead of all the lists in a dlock-list.
>>
>> The new APIs are:
>>
>> struct dlock_list_head *dlock_list_hash(struct dlock_list_heads *, void *);
>> void dlock_list_add(struct dlock_list_node *, struct dlock_list_head *);
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
> Hum, do we have any users for this API? And wouldn't they also need to
> control how many lists are allocated then?

This patch is supposed to be used by the epoll patch from Davidlohr. As
he has retracted the patch, I can drop this patch also. The number of
lists scale with the number of CPU cores in the system whether it is
used one way or the others.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 1/6] lib/dlock-list: Distributed and lock-protected lists
  2017-10-31 21:37   ` Davidlohr Bueso
@ 2017-11-01 18:44     ` Waiman Long
  0 siblings, 0 replies; 32+ messages in thread
From: Waiman Long @ 2017-11-01 18:44 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng

On 10/31/2017 05:37 PM, Davidlohr Bueso wrote:
> On Tue, 31 Oct 2017, Waiman Long wrote:
>
>> +void dlock_lists_del(struct dlock_list_node *node)
>> +{
>> +    struct dlock_list_head *head;
>> +    bool retry;
>> +
>> +    do {
>> +        head = READ_ONCE(node->head);
>
> Boqun had previously pointed this out; you need to WRITE_ONCE()
> node->head too.

Right, I will get that into the next version of the patch.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 1/6] lib/dlock-list: Distributed and lock-protected lists
  2017-10-31 18:50 ` [PATCH v8 1/6] lib/dlock-list: Distributed and lock-protected lists Waiman Long
  2017-10-31 21:37   ` Davidlohr Bueso
@ 2017-11-02 17:04   ` Davidlohr Bueso
  2017-11-02 17:30     ` Waiman Long
  2017-11-29 15:29   ` [PATCH v8 1/6] lib/dlock-list: Distributed and lock-protected lists Davidlohr Bueso
  2 siblings, 1 reply; 32+ messages in thread
From: Davidlohr Bueso @ 2017-11-02 17:04 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng

On Tue, 31 Oct 2017, Waiman Long wrote:

>+/**
>+ * dlock_lists_empty - Check if all the dlock lists are empty
>+ * @dlist: Pointer to the dlock_list_heads structure
>+ * Return: true if list is empty, false otherwise.
>+ *           
>+ * This can be a pretty expensive function call. If this function is required
>+ * in a performance critical path, we may have to maintain a global count
>+ * of the list entries in the global dlock_list_heads structure instead.
>+ */

I vote for doing this in the original version. How about the following?

>+bool dlock_lists_empty(struct dlock_list_heads *dlist)
>+{
>+	int idx;
>+
>+	for (idx = 0; idx < nr_cpu_ids; idx++)
>+		if (!list_empty(&dlist->heads[idx].list))
>+			return false;
>+	return true;
>+}
>+EXPORT_SYMBOL(dlock_lists_empty);

----------8<-----------------------------------------------
From: Davidlohr Bueso <dave@stgolabs.net>
Subject: [PATCH] lib/dlock-list: Scale dlock_lists_empty()

Instead of the current O(N) implementation; at the cost
of adding an atomic counter. We also need to add a heads
pointer to the node structure such that we can unaccount
a thread doing list_del().

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 include/linux/dlock-list.h |  2 ++
 lib/dlock-list.c           | 40 ++++++++++++++++++++++++++++------------
 2 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
index c00c7f92ada4..dd73d5787885 100644
--- a/include/linux/dlock-list.h
+++ b/include/linux/dlock-list.h
@@ -36,6 +36,7 @@ struct dlock_list_head {
 
 struct dlock_list_heads {
 	struct dlock_list_head *heads;
+	atomic_t waiters;
 };
 
 /*
@@ -44,6 +45,7 @@ struct dlock_list_heads {
 struct dlock_list_node {
 	struct list_head list;
 	struct dlock_list_head *head;
+	struct dlock_list_heads *heads;
 };
 
 /*
diff --git a/lib/dlock-list.c b/lib/dlock-list.c
index a4ddecc01b12..bd11fc0da254 100644
--- a/lib/dlock-list.c
+++ b/lib/dlock-list.c
@@ -124,6 +124,8 @@ int __alloc_dlock_list_heads(struct dlock_list_heads *dlist,
 		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
 		lockdep_set_class(&head->lock, key);
 	}
+
+	atomic_set(&dlist->waiters, 0);
 	return 0;
 }
 EXPORT_SYMBOL(__alloc_dlock_list_heads);
@@ -139,29 +141,23 @@ void free_dlock_list_heads(struct dlock_list_heads *dlist)
 {
 	kfree(dlist->heads);
 	dlist->heads = NULL;
+	atomic_set(&dlist->waiters, 0);
 }
 EXPORT_SYMBOL(free_dlock_list_heads);
 
 /**
  * dlock_lists_empty - Check if all the dlock lists are empty
  * @dlist: Pointer to the dlock_list_heads structure
- * Return: true if list is empty, false otherwise.
  *
- * This can be a pretty expensive function call. If this function is required
- * in a performance critical path, we may have to maintain a global count
- * of the list entries in the global dlock_list_heads structure instead.
+ * Return: true if all dlock lists are empty, false otherwise.
  */
 bool dlock_lists_empty(struct dlock_list_heads *dlist)
 {
-	int idx;
-
 	/* Shouldn't be called before nr_dlock_lists is initialized */
 	WARN_ON_ONCE(!nr_dlock_lists);
 
-	for (idx = 0; idx < nr_dlock_lists; idx++)
-		if (!list_empty(&dlist->heads[idx].list))
-			return false;
-	return true;
+	smp_mb__before_atomic();
+	return !atomic_read(&dlist->waiters);
 }
 EXPORT_SYMBOL(dlock_lists_empty);
 
@@ -179,10 +175,30 @@ void dlock_lists_add(struct dlock_list_node *node,
 	struct dlock_list_head *head = &dlist->heads[this_cpu_read(cpu2idx)];
 
 	/*
+	 * Serialize dlist->waiters such that a 0->1 transition is not missed,
+	 * by another thread checking if any of the dlock lists are used.
+	 *
+	 * CPU0				    CPU1
+	 * dlock_list_add()                 dlock_lists_empty()
+	 *   [S] atomic_inc(waiters);
+	 *	 smp_mb__after_atomic();
+	 *				      smp_mb__before_atomic();
+	 *				      [L] atomic_read(waiters)
+	 *       list_add()
+	 *
+	 * Bump the waiters counter _before_ taking the head->lock such that we
+	 * don't miss a thread adding itself to a list while spinning for the
+	 * lock.
+	 */
+	atomic_inc(&dlist->waiters);
+	smp_mb__after_atomic();
+
+	/*
 	 * There is no need to disable preemption
 	 */
 	spin_lock(&head->lock);
 	node->head = head;
+	node->heads = dlist;
 	list_add(&node->list, &head->list);
 	spin_unlock(&head->lock);
 }
@@ -199,8 +215,7 @@ EXPORT_SYMBOL(dlock_lists_add);
  * a bug.
  */
 void dlock_lists_del(struct dlock_list_node *node)
-{
-	struct dlock_list_head *head;
+{	struct dlock_list_head *head;
 	bool retry;
 
 	do {
@@ -214,6 +229,7 @@ void dlock_lists_del(struct dlock_list_node *node)
 			list_del_init(&node->list);
 			node->head = NULL;
 			retry = false;
+			atomic_dec(&node->heads->waiters);
 		} else {
 			/*
 			 * The lock has somehow changed. Retry again if it is
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 1/6] lib/dlock-list: Distributed and lock-protected lists
  2017-11-02 17:04   ` Davidlohr Bueso
@ 2017-11-02 17:30     ` Waiman Long
  2017-11-03 13:34       ` Davidlohr Bueso
  0 siblings, 1 reply; 32+ messages in thread
From: Waiman Long @ 2017-11-02 17:30 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng

On 11/02/2017 01:04 PM, Davidlohr Bueso wrote:
> On Tue, 31 Oct 2017, Waiman Long wrote:
>
>> +/**
>> + * dlock_lists_empty - Check if all the dlock lists are empty
>> + * @dlist: Pointer to the dlock_list_heads structure
>> + * Return: true if list is empty, false otherwise.
>> + *           + * This can be a pretty expensive function call. If
>> this function is required
>> + * in a performance critical path, we may have to maintain a global
>> count
>> + * of the list entries in the global dlock_list_heads structure
>> instead.
>> + */
>
> I vote for doing this in the original version. How about the following?
>
>> +bool dlock_lists_empty(struct dlock_list_heads *dlist)
>> +{
>> +    int idx;
>> +
>> +    for (idx = 0; idx < nr_cpu_ids; idx++)
>> +        if (!list_empty(&dlist->heads[idx].list))
>> +            return false;
>> +    return true;
>> +}
>> +EXPORT_SYMBOL(dlock_lists_empty);
>
> ----------8<-----------------------------------------------
> From: Davidlohr Bueso <dave@stgolabs.net>
> Subject: [PATCH] lib/dlock-list: Scale dlock_lists_empty()
>
> Instead of the current O(N) implementation; at the cost
> of adding an atomic counter. We also need to add a heads
> pointer to the node structure such that we can unaccount
> a thread doing list_del().
>

The counter will then become the single contention point for all
concurrent updates to the dlock-list. So it will have a big impact on
performance. On the other hand, instead of being a counter of # of
items, we can make that a counter of # of non-empty lists. So its value
will only be changed when a list go from empty to non-empty and vice
versa. That will greatly reduce the number of updates to that counter.


> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> ---
> include/linux/dlock-list.h |  2 ++
> lib/dlock-list.c           | 40 ++++++++++++++++++++++++++++------------
> 2 files changed, 30 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
> index c00c7f92ada4..dd73d5787885 100644
> --- a/include/linux/dlock-list.h
> +++ b/include/linux/dlock-list.h
> @@ -36,6 +36,7 @@ struct dlock_list_head {
>
> struct dlock_list_heads {
>     struct dlock_list_head *heads;
> +    atomic_t waiters;
> };
>
> /*
> @@ -44,6 +45,7 @@ struct dlock_list_heads {
> struct dlock_list_node {
>     struct list_head list;
>     struct dlock_list_head *head;
> +    struct dlock_list_heads *heads;
> };
>

I don't want to add a new data item into dlock_list_node as there can be
thousands or even of them in the system. Instead, I prefer increasing the
size of dlock_list_head which only have a limited number of them and
they have unused space because they are cacheline aligned.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 1/6] lib/dlock-list: Distributed and lock-protected lists
  2017-11-02 17:30     ` Waiman Long
@ 2017-11-03 13:34       ` Davidlohr Bueso
  2017-11-03 14:22         ` [PATCH v3] lib/dlock-list: Scale dlock_lists_empty() Davidlohr Bueso
  0 siblings, 1 reply; 32+ messages in thread
From: Davidlohr Bueso @ 2017-11-03 13:34 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng

On Thu, 02 Nov 2017, Waiman Long wrote:

>> Instead of the current O(N) implementation; at the cost
>> of adding an atomic counter. We also need to add a heads
>> pointer to the node structure such that we can unaccount
>> a thread doing list_del().
>>
>
>The counter will then become the single contention point for all
>concurrent updates to the dlock-list. So it will have a big impact on
>performance. On the other hand, instead of being a counter of # of
>items, we can make that a counter of # of non-empty lists. So its value
>will only be changed when a list go from empty to non-empty and vice
>versa. That will greatly reduce the number of updates to that counter.
>
>
>> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
>> ---
>> include/linux/dlock-list.h |  2 ++
>> lib/dlock-list.c           | 40 ++++++++++++++++++++++++++++------------
>> 2 files changed, 30 insertions(+), 12 deletions(-)
>>
>> diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
>> index c00c7f92ada4..dd73d5787885 100644
>> --- a/include/linux/dlock-list.h
>> +++ b/include/linux/dlock-list.h
>> @@ -36,6 +36,7 @@ struct dlock_list_head {
>>
>> struct dlock_list_heads {
>>     struct dlock_list_head *heads;
>> +    atomic_t waiters;
>> };
>>
>> /*
>> @@ -44,6 +45,7 @@ struct dlock_list_heads {
>> struct dlock_list_node {
>>     struct list_head list;
>>     struct dlock_list_head *head;
>> +    struct dlock_list_heads *heads;
>> };
>>
>
>I don't want to add a new data item into dlock_list_node as there can be
>thousands or even of them in the system. Instead, I prefer increasing the
>size of dlock_list_head which only have a limited number of them and
>they have unused space because they are cacheline aligned.

Both are good points. Thanks.

----8<--------------------------------------------------------
Subject: [PATCH] [PATCH v2] lib/dlock-list: Scale dlock_lists_empty()

Instead of the current O(N) implementation, at the cost
of adding an atomic counter, we can convert the call to
an atomic_read(). The counter only serves for accounting
empty to non-empty transitions, and vice versa; therefore
only modified twice for each of the lists, during the
lifetime of the dlock -- thus 2*nr_dlock_lists.

In addition, to be able to unaccount a list_del(), we
add a dlist pointer to each head, thus minimizing the
overall memory footprint.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 include/linux/dlock-list.h |  2 ++
 lib/dlock-list.c           | 56 +++++++++++++++++++++++++++++++++++-----------
 2 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
index c00c7f92ada4..d176a2d00cd1 100644
--- a/include/linux/dlock-list.h
+++ b/include/linux/dlock-list.h
@@ -32,10 +32,12 @@
 struct dlock_list_head {
 	struct list_head list;
 	spinlock_t lock;
+	struct dlock_list_heads *dlist;
 } ____cacheline_aligned_in_smp;
 
 struct dlock_list_heads {
 	struct dlock_list_head *heads;
+	atomic_t waiters;
 };
 
 /*
diff --git a/lib/dlock-list.c b/lib/dlock-list.c
index a4ddecc01b12..a84f42e800d5 100644
--- a/lib/dlock-list.c
+++ b/lib/dlock-list.c
@@ -122,8 +122,11 @@ int __alloc_dlock_list_heads(struct dlock_list_heads *dlist,
 
 		INIT_LIST_HEAD(&head->list);
 		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
+		head->dlist = dlist;
 		lockdep_set_class(&head->lock, key);
 	}
+
+	atomic_set(&dlist->waiters, 0);
 	return 0;
 }
 EXPORT_SYMBOL(__alloc_dlock_list_heads);
@@ -138,30 +141,36 @@ EXPORT_SYMBOL(__alloc_dlock_list_heads);
 void free_dlock_list_heads(struct dlock_list_heads *dlist)
 {
 	kfree(dlist->heads);
-	dlist->heads = NULL;
+	atomic_set(&dlist->waiters, 0);
 }
 EXPORT_SYMBOL(free_dlock_list_heads);
 
 /**
  * dlock_lists_empty - Check if all the dlock lists are empty
  * @dlist: Pointer to the dlock_list_heads structure
- * Return: true if list is empty, false otherwise.
  *
- * This can be a pretty expensive function call. If this function is required
- * in a performance critical path, we may have to maintain a global count
- * of the list entries in the global dlock_list_heads structure instead.
+ * Return: true if all dlock lists are empty, false otherwise.
  */
 bool dlock_lists_empty(struct dlock_list_heads *dlist)
 {
-	int idx;
-
 	/* Shouldn't be called before nr_dlock_lists is initialized */
 	WARN_ON_ONCE(!nr_dlock_lists);
 
-	for (idx = 0; idx < nr_dlock_lists; idx++)
-		if (!list_empty(&dlist->heads[idx].list))
-			return false;
-	return true;
+	/*
+	 * Serialize dlist->waiters such that a 0->1 transition is not missed
+	 * by another thread checking if any of the dlock lists are used.
+	 *
+	 * CPU0				    CPU1
+	 * dlock_list_add()                 dlock_lists_empty()
+	 *   [S] atomic_inc(waiters);
+	 *       smp_mb__after_atomic();
+	 *					  smp_mb__before_atomic();
+	 *				      [L] atomic_read(waiters)
+	 *       list_add()
+	 *
+	 */
+	smp_mb__before_atomic();
+	return !atomic_read(&dlist->waiters);
 }
 EXPORT_SYMBOL(dlock_lists_empty);
 
@@ -179,6 +188,16 @@ void dlock_lists_add(struct dlock_list_node *node,
 	struct dlock_list_head *head = &dlist->heads[this_cpu_read(cpu2idx)];
 
 	/*
+	 * Bump the waiters counter _before_ taking the head->lock
+	 * such that we don't miss a thread adding itself to a list
+	 * while spinning for the lock.
+	 */
+	if (list_empty_careful(&head->list)) {
+		atomic_inc(&dlist->waiters);
+		smp_mb__after_atomic();
+	}
+
+	/*
 	 * There is no need to disable preemption
 	 */
 	spin_lock(&head->lock);
@@ -199,8 +218,7 @@ EXPORT_SYMBOL(dlock_lists_add);
  * a bug.
  */
 void dlock_lists_del(struct dlock_list_node *node)
-{
-	struct dlock_list_head *head;
+{	struct dlock_list_head *head;
 	bool retry;
 
 	do {
@@ -212,6 +230,18 @@ void dlock_lists_del(struct dlock_list_node *node)
 		spin_lock(&head->lock);
 		if (likely(head == node->head)) {
 			list_del_init(&node->list);
+			/*
+			 * We still hold the head->lock, a normal list_empty()
+			 * check will do.
+			 */
+			if (list_empty(&head->list)) {
+				struct dlock_list_heads *dlist;
+				dlist = node->head->dlist;
+
+				atomic_dec(&dlist->waiters);
+				smp_mb__after_atomic();
+			}
+
 			node->head = NULL;
 			retry = false;
 		} else {
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3] lib/dlock-list: Scale dlock_lists_empty()
  2017-11-03 13:34       ` Davidlohr Bueso
@ 2017-11-03 14:22         ` Davidlohr Bueso
  2017-11-03 16:33           ` Waiman Long
  0 siblings, 1 reply; 32+ messages in thread
From: Davidlohr Bueso @ 2017-11-03 14:22 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng

Instead of the current O(N) implementation, at the cost
of adding an atomic counter, we can convert the call to
an atomic_read(). The counter only serves for accounting
empty to non-empty transitions, and vice versa; therefore
only modified twice for each of the lists during the
lifetime of the dlock (while used).

In addition, to be able to unaccount a list_del(), we
add a dlist pointer to each head, thus minimizing the
overall memory footprint.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---

Changes from v2: Removed some bogus lines and an overoptimistic
                changelog.

 include/linux/dlock-list.h |  2 ++
 lib/dlock-list.c           | 52 +++++++++++++++++++++++++++++++++++++---------
 2 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
index c00c7f92ada4..d176a2d00cd1 100644
--- a/include/linux/dlock-list.h
+++ b/include/linux/dlock-list.h
@@ -32,10 +32,12 @@
 struct dlock_list_head {
 	struct list_head list;
 	spinlock_t lock;
+	struct dlock_list_heads *dlist;
 } ____cacheline_aligned_in_smp;
 
 struct dlock_list_heads {
 	struct dlock_list_head *heads;
+	atomic_t waiters;
 };
 
 /*
diff --git a/lib/dlock-list.c b/lib/dlock-list.c
index a4ddecc01b12..5814e42c5b81 100644
--- a/lib/dlock-list.c
+++ b/lib/dlock-list.c
@@ -122,8 +122,11 @@ int __alloc_dlock_list_heads(struct dlock_list_heads *dlist,
 
 		INIT_LIST_HEAD(&head->list);
 		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
+		head->dlist = dlist;
 		lockdep_set_class(&head->lock, key);
 	}
+
+	atomic_set(&dlist->waiters, 0);
 	return 0;
 }
 EXPORT_SYMBOL(__alloc_dlock_list_heads);
@@ -139,29 +142,36 @@ void free_dlock_list_heads(struct dlock_list_heads *dlist)
 {
 	kfree(dlist->heads);
 	dlist->heads = NULL;
+	atomic_set(&dlist->waiters, 0);
 }
 EXPORT_SYMBOL(free_dlock_list_heads);
 
 /**
  * dlock_lists_empty - Check if all the dlock lists are empty
  * @dlist: Pointer to the dlock_list_heads structure
- * Return: true if list is empty, false otherwise.
  *
- * This can be a pretty expensive function call. If this function is required
- * in a performance critical path, we may have to maintain a global count
- * of the list entries in the global dlock_list_heads structure instead.
+ * Return: true if all dlock lists are empty, false otherwise.
  */
 bool dlock_lists_empty(struct dlock_list_heads *dlist)
 {
-	int idx;
-
 	/* Shouldn't be called before nr_dlock_lists is initialized */
 	WARN_ON_ONCE(!nr_dlock_lists);
 
-	for (idx = 0; idx < nr_dlock_lists; idx++)
-		if (!list_empty(&dlist->heads[idx].list))
-			return false;
-	return true;
+	/*
+	 * Serialize dlist->waiters such that a 0->1 transition is not missed
+	 * by another thread checking if any of the dlock lists are used.
+	 *
+	 * CPU0				    CPU1
+	 * dlock_list_add()                 dlock_lists_empty()
+	 *   [S] atomic_inc(waiters);
+	 *       smp_mb__after_atomic();
+	 *					  smp_mb__before_atomic();
+	 *				      [L] atomic_read(waiters)
+	 *       list_add()
+	 *
+	 */
+	smp_mb__before_atomic();
+	return !atomic_read(&dlist->waiters);
 }
 EXPORT_SYMBOL(dlock_lists_empty);
 
@@ -179,6 +189,16 @@ void dlock_lists_add(struct dlock_list_node *node,
 	struct dlock_list_head *head = &dlist->heads[this_cpu_read(cpu2idx)];
 
 	/*
+	 * Bump the waiters counter _before_ taking the head->lock
+	 * such that we don't miss a thread adding itself to a list
+	 * while spinning for the lock.
+	 */
+	if (list_empty_careful(&head->list)) {
+		atomic_inc(&dlist->waiters);
+		smp_mb__after_atomic();
+	}
+
+	/*
 	 * There is no need to disable preemption
 	 */
 	spin_lock(&head->lock);
@@ -212,6 +232,18 @@ void dlock_lists_del(struct dlock_list_node *node)
 		spin_lock(&head->lock);
 		if (likely(head == node->head)) {
 			list_del_init(&node->list);
+			/*
+			 * We still hold the head->lock, a normal list_empty()
+			 * check will do.
+			 */
+			if (list_empty(&head->list)) {
+				struct dlock_list_heads *dlist;
+				dlist = node->head->dlist;
+
+				atomic_dec(&dlist->waiters);
+				smp_mb__after_atomic();
+			}
+
 			node->head = NULL;
 			retry = false;
 		} else {
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] lib/dlock-list: Scale dlock_lists_empty()
  2017-11-03 14:22         ` [PATCH v3] lib/dlock-list: Scale dlock_lists_empty() Davidlohr Bueso
@ 2017-11-03 16:33           ` Waiman Long
  2017-11-06 18:47             ` [PATCH v4] " Davidlohr Bueso
  0 siblings, 1 reply; 32+ messages in thread
From: Waiman Long @ 2017-11-03 16:33 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng

On 11/03/2017 10:22 AM, Davidlohr Bueso wrote:
> Instead of the current O(N) implementation, at the cost
> of adding an atomic counter, we can convert the call to
> an atomic_read(). The counter only serves for accounting
> empty to non-empty transitions, and vice versa; therefore
> only modified twice for each of the lists during the
> lifetime of the dlock (while used).
>
> In addition, to be able to unaccount a list_del(), we
> add a dlist pointer to each head, thus minimizing the
> overall memory footprint.
>
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> ---
>
> Changes from v2: Removed some bogus lines and an overoptimistic
>                changelog.
>
> include/linux/dlock-list.h |  2 ++
> lib/dlock-list.c           | 52
> +++++++++++++++++++++++++++++++++++++---------
> 2 files changed, 44 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
> index c00c7f92ada4..d176a2d00cd1 100644
> --- a/include/linux/dlock-list.h
> +++ b/include/linux/dlock-list.h
> @@ -32,10 +32,12 @@
> struct dlock_list_head {
>     struct list_head list;
>     spinlock_t lock;
> +    struct dlock_list_heads *dlist;
> } ____cacheline_aligned_in_smp;
>
> struct dlock_list_heads {
>     struct dlock_list_head *heads;
> +    atomic_t waiters;

We are tracking number of non-empty lists here. So I think we need a
better name and maybe some documentation of what it is.

> };
>
> /*
> diff --git a/lib/dlock-list.c b/lib/dlock-list.c
> index a4ddecc01b12..5814e42c5b81 100644
> --- a/lib/dlock-list.c
> +++ b/lib/dlock-list.c
> @@ -122,8 +122,11 @@ int __alloc_dlock_list_heads(struct
> dlock_list_heads *dlist,
>
>         INIT_LIST_HEAD(&head->list);
>         head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
> +        head->dlist = dlist;
>         lockdep_set_class(&head->lock, key);
>     }
> +
> +    atomic_set(&dlist->waiters, 0);
>     return 0;
> }
> EXPORT_SYMBOL(__alloc_dlock_list_heads);
> @@ -139,29 +142,36 @@ void free_dlock_list_heads(struct
> dlock_list_heads *dlist)
> {
>     kfree(dlist->heads);
>     dlist->heads = NULL;
> +    atomic_set(&dlist->waiters, 0);
> }
> EXPORT_SYMBOL(free_dlock_list_heads);
>
> /**
>  * dlock_lists_empty - Check if all the dlock lists are empty
>  * @dlist: Pointer to the dlock_list_heads structure
> - * Return: true if list is empty, false otherwise.
>  *
> - * This can be a pretty expensive function call. If this function is
> required
> - * in a performance critical path, we may have to maintain a global
> count
> - * of the list entries in the global dlock_list_heads structure instead.
> + * Return: true if all dlock lists are empty, false otherwise.
>  */
> bool dlock_lists_empty(struct dlock_list_heads *dlist)
> {
> -    int idx;
> -
>     /* Shouldn't be called before nr_dlock_lists is initialized */
>     WARN_ON_ONCE(!nr_dlock_lists);
>
> -    for (idx = 0; idx < nr_dlock_lists; idx++)
> -        if (!list_empty(&dlist->heads[idx].list))
> -            return false;
> -    return true;
> +    /*
> +     * Serialize dlist->waiters such that a 0->1 transition is not
> missed
> +     * by another thread checking if any of the dlock lists are used.
> +     *
> +     * CPU0                    CPU1
> +     * dlock_list_add()                 dlock_lists_empty()
> +     *   [S] atomic_inc(waiters);
> +     *       smp_mb__after_atomic();
> +     *                      smp_mb__before_atomic();
> +     *                      [L] atomic_read(waiters)
> +     *       list_add()
> +     *
> +     */
> +    smp_mb__before_atomic();
> +    return !atomic_read(&dlist->waiters);
> }
> EXPORT_SYMBOL(dlock_lists_empty);
>
> @@ -179,6 +189,16 @@ void dlock_lists_add(struct dlock_list_node *node,
>     struct dlock_list_head *head = &dlist->heads[this_cpu_read(cpu2idx)];
>
>     /*
> +     * Bump the waiters counter _before_ taking the head->lock
> +     * such that we don't miss a thread adding itself to a list
> +     * while spinning for the lock.
> +     */
> +    if (list_empty_careful(&head->list)) {
> +        atomic_inc(&dlist->waiters);
> +        smp_mb__after_atomic();
> +    }

That is racy. You will need to remember that you have opportunistically
incremented the count. After acquiring the spinlock, the code will have
to decrement it if the list is no longer empty. You will also have to
increment it after lock if the list is empty now, but not previously.

>
> +
> +    /*
>      * There is no need to disable preemption
>      */
>     spin_lock(&head->lock);
> @@ -212,6 +232,18 @@ void dlock_lists_del(struct dlock_list_node *node)
>         spin_lock(&head->lock);
>         if (likely(head == node->head)) {
>             list_del_init(&node->list);
> +            /*
> +             * We still hold the head->lock, a normal list_empty()
> +             * check will do.
> +             */
> +            if (list_empty(&head->list)) {
> +                struct dlock_list_heads *dlist;
> +                dlist = node->head->dlist;
> +
> +                atomic_dec(&dlist->waiters);
> +                smp_mb__after_atomic();

Do you need that while the code will do a spin_unlock soon?

Cheers,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4] lib/dlock-list: Scale dlock_lists_empty()
  2017-11-03 16:33           ` Waiman Long
@ 2017-11-06 18:47             ` Davidlohr Bueso
  2017-11-06 19:06               ` Waiman Long
  2017-11-07 11:59               ` Jan Kara
  0 siblings, 2 replies; 32+ messages in thread
From: Davidlohr Bueso @ 2017-11-06 18:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng

Instead of the current O(N) implementation, at the cost
of adding an atomic counter, we can convert the call to
an atomic_read(). The counter only serves for accounting
empty to non-empty transitions, and vice versa; therefore
only modified twice for each of the lists during the
lifetime of the dlock (while used).

In addition, to be able to unaccount a list_del(), we
add a dlist pointer to each head, thus minimizing the
overall memory footprint.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
Changes from v3:
 - s/waiters/used_lists, more doc around the counter.
 - fixed racy scenario when the list empty/non-empty
   condition changes after taking the lock.
 - sprinkled unlikely() around all checks, these are
   only corner cases in the lifetime of the lock.

 include/linux/dlock-list.h |  8 ++++++
 lib/dlock-list.c           | 67 +++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 65 insertions(+), 10 deletions(-)

diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
index c00c7f92ada4..e18690a9bba6 100644
--- a/include/linux/dlock-list.h
+++ b/include/linux/dlock-list.h
@@ -32,10 +32,18 @@
 struct dlock_list_head {
 	struct list_head list;
 	spinlock_t lock;
+	struct dlock_list_heads *dlist;
 } ____cacheline_aligned_in_smp;
 
+/*
+ * This is the main dlist data structure, with the array of heads
+ * and a counter that atomically tracks if any of the lists are
+ * being used. That is, empty to non-empty (and vice versa)
+ * head->list transitions.
+ */
 struct dlock_list_heads {
 	struct dlock_list_head *heads;
+	atomic_t used_lists;
 };
 
 /*
diff --git a/lib/dlock-list.c b/lib/dlock-list.c
index a4ddecc01b12..a9c855d492b8 100644
--- a/lib/dlock-list.c
+++ b/lib/dlock-list.c
@@ -122,8 +122,11 @@ int __alloc_dlock_list_heads(struct dlock_list_heads *dlist,
 
 		INIT_LIST_HEAD(&head->list);
 		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
+		head->dlist = dlist;
 		lockdep_set_class(&head->lock, key);
 	}
+
+	atomic_set(&dlist->used_lists, 0);
 	return 0;
 }
 EXPORT_SYMBOL(__alloc_dlock_list_heads);
@@ -139,29 +142,36 @@ void free_dlock_list_heads(struct dlock_list_heads *dlist)
 {
 	kfree(dlist->heads);
 	dlist->heads = NULL;
+	atomic_set(&dlist->used_lists, 0);
 }
 EXPORT_SYMBOL(free_dlock_list_heads);
 
 /**
  * dlock_lists_empty - Check if all the dlock lists are empty
  * @dlist: Pointer to the dlock_list_heads structure
- * Return: true if list is empty, false otherwise.
  *
- * This can be a pretty expensive function call. If this function is required
- * in a performance critical path, we may have to maintain a global count
- * of the list entries in the global dlock_list_heads structure instead.
+ * Return: true if all dlock lists are empty, false otherwise.
  */
 bool dlock_lists_empty(struct dlock_list_heads *dlist)
 {
-	int idx;
-
 	/* Shouldn't be called before nr_dlock_lists is initialized */
 	WARN_ON_ONCE(!nr_dlock_lists);
 
-	for (idx = 0; idx < nr_dlock_lists; idx++)
-		if (!list_empty(&dlist->heads[idx].list))
-			return false;
-	return true;
+	/*
+	 * Serialize dlist->used_lists such that a 0->1 transition is not
+	 * missed by another thread checking if any of the dlock lists are
+	 * used.
+	 *
+	 * CPU0				    CPU1
+	 * dlock_list_add()                 dlock_lists_empty()
+	 *   [S] atomic_inc(used_lists);
+	 *       smp_mb__after_atomic();
+	 *					  smp_mb__before_atomic();
+	 *				      [L] atomic_read(used_lists)
+	 *       list_add()
+	 */
+	smp_mb__before_atomic();
+	return !atomic_read(&dlist->used_lists);
 }
 EXPORT_SYMBOL(dlock_lists_empty);
 
@@ -177,11 +187,39 @@ void dlock_lists_add(struct dlock_list_node *node,
 		     struct dlock_list_heads *dlist)
 {
 	struct dlock_list_head *head = &dlist->heads[this_cpu_read(cpu2idx)];
+	bool list_empty_before_lock = false;
+
+	/*
+	 * Optimistically bump the used_lists counter _before_ taking
+	 * the head->lock such that we don't miss a thread adding itself
+	 * to a list while spinning for the lock.
+	 *
+	 * Then, after taking the lock, recheck if the empty to non-empty
+	 * transition changed and (un)account for ourselves, accordingly.
+	 * Note that all these scenarios are corner cases, and not the
+	 * common scenario, where the lists are actually populated most
+	 * of the time.
+	 */
+	if (unlikely(list_empty_careful(&head->list))) {
+		list_empty_before_lock = true;
+		atomic_inc(&dlist->used_lists);
+		smp_mb__after_atomic();
+	}
 
 	/*
 	 * There is no need to disable preemption
 	 */
 	spin_lock(&head->lock);
+
+	if (unlikely(!list_empty_before_lock && list_empty(&head->list))) {
+		atomic_inc(&dlist->used_lists);
+		smp_mb__after_atomic();
+	}
+	if (unlikely(list_empty_before_lock && !list_empty(&head->list))) {
+		atomic_dec(&dlist->used_lists);
+		smp_mb__after_atomic();
+	}
+
 	node->head = head;
 	list_add(&node->list, &head->list);
 	spin_unlock(&head->lock);
@@ -212,6 +250,15 @@ void dlock_lists_del(struct dlock_list_node *node)
 		spin_lock(&head->lock);
 		if (likely(head == node->head)) {
 			list_del_init(&node->list);
+
+			if (unlikely(list_empty(&head->list))) {
+				struct dlock_list_heads *dlist;
+				dlist = node->head->dlist;
+
+				atomic_dec(&dlist->used_lists);
+				smp_mb__after_atomic();
+			}
+
 			node->head = NULL;
 			retry = false;
 		} else {
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v4] lib/dlock-list: Scale dlock_lists_empty()
  2017-11-06 18:47             ` [PATCH v4] " Davidlohr Bueso
@ 2017-11-06 19:06               ` Waiman Long
  2017-11-07 11:59               ` Jan Kara
  1 sibling, 0 replies; 32+ messages in thread
From: Waiman Long @ 2017-11-06 19:06 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng

On 11/06/2017 01:47 PM, Davidlohr Bueso wrote:
> Instead of the current O(N) implementation, at the cost
> of adding an atomic counter, we can convert the call to
> an atomic_read(). The counter only serves for accounting
> empty to non-empty transitions, and vice versa; therefore
> only modified twice for each of the lists during the
> lifetime of the dlock (while used).
>
> In addition, to be able to unaccount a list_del(), we
> add a dlist pointer to each head, thus minimizing the
> overall memory footprint.
>
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> ---
> Changes from v3:
> - s/waiters/used_lists, more doc around the counter.
> - fixed racy scenario when the list empty/non-empty
>   condition changes after taking the lock.
> - sprinkled unlikely() around all checks, these are
>   only corner cases in the lifetime of the lock.
>
> include/linux/dlock-list.h |  8 ++++++
> lib/dlock-list.c           | 67
> +++++++++++++++++++++++++++++++++++++++-------
> 2 files changed, 65 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
> index c00c7f92ada4..e18690a9bba6 100644
> --- a/include/linux/dlock-list.h
> +++ b/include/linux/dlock-list.h
> @@ -32,10 +32,18 @@
> struct dlock_list_head {
>     struct list_head list;
>     spinlock_t lock;
> +    struct dlock_list_heads *dlist;
> } ____cacheline_aligned_in_smp;
>
> +/*
> + * This is the main dlist data structure, with the array of heads
> + * and a counter that atomically tracks if any of the lists are
> + * being used. That is, empty to non-empty (and vice versa)
> + * head->list transitions.
> + */
> struct dlock_list_heads {
>     struct dlock_list_head *heads;
> +    atomic_t used_lists;
> };
>
> /*
> diff --git a/lib/dlock-list.c b/lib/dlock-list.c
> index a4ddecc01b12..a9c855d492b8 100644
> --- a/lib/dlock-list.c
> +++ b/lib/dlock-list.c
> @@ -122,8 +122,11 @@ int __alloc_dlock_list_heads(struct
> dlock_list_heads *dlist,
>
>         INIT_LIST_HEAD(&head->list);
>         head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
> +        head->dlist = dlist;
>         lockdep_set_class(&head->lock, key);
>     }
> +
> +    atomic_set(&dlist->used_lists, 0);
>     return 0;
> }
> EXPORT_SYMBOL(__alloc_dlock_list_heads);
> @@ -139,29 +142,36 @@ void free_dlock_list_heads(struct
> dlock_list_heads *dlist)
> {
>     kfree(dlist->heads);
>     dlist->heads = NULL;
> +    atomic_set(&dlist->used_lists, 0);
> }
> EXPORT_SYMBOL(free_dlock_list_heads);
>
> /**
>  * dlock_lists_empty - Check if all the dlock lists are empty
>  * @dlist: Pointer to the dlock_list_heads structure
> - * Return: true if list is empty, false otherwise.
>  *
> - * This can be a pretty expensive function call. If this function is
> required
> - * in a performance critical path, we may have to maintain a global
> count
> - * of the list entries in the global dlock_list_heads structure instead.
> + * Return: true if all dlock lists are empty, false otherwise.
>  */
> bool dlock_lists_empty(struct dlock_list_heads *dlist)
> {
> -    int idx;
> -
>     /* Shouldn't be called before nr_dlock_lists is initialized */
>     WARN_ON_ONCE(!nr_dlock_lists);
>
> -    for (idx = 0; idx < nr_dlock_lists; idx++)
> -        if (!list_empty(&dlist->heads[idx].list))
> -            return false;
> -    return true;
> +    /*
> +     * Serialize dlist->used_lists such that a 0->1 transition is not
> +     * missed by another thread checking if any of the dlock lists are
> +     * used.
> +     *
> +     * CPU0                    CPU1
> +     * dlock_list_add()                 dlock_lists_empty()
> +     *   [S] atomic_inc(used_lists);
> +     *       smp_mb__after_atomic();
> +     *                      smp_mb__before_atomic();
> +     *                      [L] atomic_read(used_lists)
> +     *       list_add()
> +     */
> +    smp_mb__before_atomic();
> +    return !atomic_read(&dlist->used_lists);
> }
> EXPORT_SYMBOL(dlock_lists_empty);
>
> @@ -177,11 +187,39 @@ void dlock_lists_add(struct dlock_list_node *node,
>              struct dlock_list_heads *dlist)
> {
>     struct dlock_list_head *head = &dlist->heads[this_cpu_read(cpu2idx)];
> +    bool list_empty_before_lock = false;
> +
> +    /*
> +     * Optimistically bump the used_lists counter _before_ taking
> +     * the head->lock such that we don't miss a thread adding itself
> +     * to a list while spinning for the lock.
> +     *
> +     * Then, after taking the lock, recheck if the empty to non-empty
> +     * transition changed and (un)account for ourselves, accordingly.
> +     * Note that all these scenarios are corner cases, and not the
> +     * common scenario, where the lists are actually populated most
> +     * of the time.
> +     */
> +    if (unlikely(list_empty_careful(&head->list))) {
> +        list_empty_before_lock = true;
> +        atomic_inc(&dlist->used_lists);
> +        smp_mb__after_atomic();
> +    }
>
>     /*
>      * There is no need to disable preemption
>      */
>     spin_lock(&head->lock);
> +
> +    if (unlikely(!list_empty_before_lock && list_empty(&head->list))) {
> +        atomic_inc(&dlist->used_lists);
> +        smp_mb__after_atomic();
> +    }
> +    if (unlikely(list_empty_before_lock && !list_empty(&head->list))) {
> +        atomic_dec(&dlist->used_lists);
> +        smp_mb__after_atomic();
> +    }
> +
>     node->head = head;
>     list_add(&node->list, &head->list);
>     spin_unlock(&head->lock);
> @@ -212,6 +250,15 @@ void dlock_lists_del(struct dlock_list_node *node)
>         spin_lock(&head->lock);
>         if (likely(head == node->head)) {
>             list_del_init(&node->list);
> +
> +            if (unlikely(list_empty(&head->list))) {
> +                struct dlock_list_heads *dlist;
> +                dlist = node->head->dlist;
> +
> +                atomic_dec(&dlist->used_lists);
> +                smp_mb__after_atomic();
> +            }
> +
>             node->head = NULL;
>             retry = false;
>         } else {


This patch looks good to me.

Acked-by: Waiman Long <longman@redhat.com>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4] lib/dlock-list: Scale dlock_lists_empty()
  2017-11-06 18:47             ` [PATCH v4] " Davidlohr Bueso
  2017-11-06 19:06               ` Waiman Long
@ 2017-11-07 11:59               ` Jan Kara
  2017-11-07 17:59                 ` Andreas Dilger
  1 sibling, 1 reply; 32+ messages in thread
From: Jan Kara @ 2017-11-07 11:59 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Waiman Long, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Andi Kleen,
	Dave Chinner, Boqun Feng

On Mon 06-11-17 10:47:08, Davidlohr Bueso wrote:
> Instead of the current O(N) implementation, at the cost
> of adding an atomic counter, we can convert the call to
> an atomic_read(). The counter only serves for accounting
> empty to non-empty transitions, and vice versa; therefore
> only modified twice for each of the lists during the
> lifetime of the dlock (while used).
> 
> In addition, to be able to unaccount a list_del(), we
> add a dlist pointer to each head, thus minimizing the
> overall memory footprint.
> 
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza Kara

> ---
> Changes from v3:
> - s/waiters/used_lists, more doc around the counter.
> - fixed racy scenario when the list empty/non-empty
>   condition changes after taking the lock.
> - sprinkled unlikely() around all checks, these are
>   only corner cases in the lifetime of the lock.
> 
> include/linux/dlock-list.h |  8 ++++++
> lib/dlock-list.c           | 67 +++++++++++++++++++++++++++++++++++++++-------
> 2 files changed, 65 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
> index c00c7f92ada4..e18690a9bba6 100644
> --- a/include/linux/dlock-list.h
> +++ b/include/linux/dlock-list.h
> @@ -32,10 +32,18 @@
> struct dlock_list_head {
> 	struct list_head list;
> 	spinlock_t lock;
> +	struct dlock_list_heads *dlist;
> } ____cacheline_aligned_in_smp;
> 
> +/*
> + * This is the main dlist data structure, with the array of heads
> + * and a counter that atomically tracks if any of the lists are
> + * being used. That is, empty to non-empty (and vice versa)
> + * head->list transitions.
> + */
> struct dlock_list_heads {
> 	struct dlock_list_head *heads;
> +	atomic_t used_lists;
> };
> 
> /*
> diff --git a/lib/dlock-list.c b/lib/dlock-list.c
> index a4ddecc01b12..a9c855d492b8 100644
> --- a/lib/dlock-list.c
> +++ b/lib/dlock-list.c
> @@ -122,8 +122,11 @@ int __alloc_dlock_list_heads(struct dlock_list_heads *dlist,
> 
> 		INIT_LIST_HEAD(&head->list);
> 		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
> +		head->dlist = dlist;
> 		lockdep_set_class(&head->lock, key);
> 	}
> +
> +	atomic_set(&dlist->used_lists, 0);
> 	return 0;
> }
> EXPORT_SYMBOL(__alloc_dlock_list_heads);
> @@ -139,29 +142,36 @@ void free_dlock_list_heads(struct dlock_list_heads *dlist)
> {
> 	kfree(dlist->heads);
> 	dlist->heads = NULL;
> +	atomic_set(&dlist->used_lists, 0);
> }
> EXPORT_SYMBOL(free_dlock_list_heads);
> 
> /**
>  * dlock_lists_empty - Check if all the dlock lists are empty
>  * @dlist: Pointer to the dlock_list_heads structure
> - * Return: true if list is empty, false otherwise.
>  *
> - * This can be a pretty expensive function call. If this function is required
> - * in a performance critical path, we may have to maintain a global count
> - * of the list entries in the global dlock_list_heads structure instead.
> + * Return: true if all dlock lists are empty, false otherwise.
>  */
> bool dlock_lists_empty(struct dlock_list_heads *dlist)
> {
> -	int idx;
> -
> 	/* Shouldn't be called before nr_dlock_lists is initialized */
> 	WARN_ON_ONCE(!nr_dlock_lists);
> 
> -	for (idx = 0; idx < nr_dlock_lists; idx++)
> -		if (!list_empty(&dlist->heads[idx].list))
> -			return false;
> -	return true;
> +	/*
> +	 * Serialize dlist->used_lists such that a 0->1 transition is not
> +	 * missed by another thread checking if any of the dlock lists are
> +	 * used.
> +	 *
> +	 * CPU0				    CPU1
> +	 * dlock_list_add()                 dlock_lists_empty()
> +	 *   [S] atomic_inc(used_lists);
> +	 *       smp_mb__after_atomic();
> +	 *					  smp_mb__before_atomic();
> +	 *				      [L] atomic_read(used_lists)
> +	 *       list_add()
> +	 */
> +	smp_mb__before_atomic();
> +	return !atomic_read(&dlist->used_lists);
> }
> EXPORT_SYMBOL(dlock_lists_empty);
> 
> @@ -177,11 +187,39 @@ void dlock_lists_add(struct dlock_list_node *node,
> 		     struct dlock_list_heads *dlist)
> {
> 	struct dlock_list_head *head = &dlist->heads[this_cpu_read(cpu2idx)];
> +	bool list_empty_before_lock = false;
> +
> +	/*
> +	 * Optimistically bump the used_lists counter _before_ taking
> +	 * the head->lock such that we don't miss a thread adding itself
> +	 * to a list while spinning for the lock.
> +	 *
> +	 * Then, after taking the lock, recheck if the empty to non-empty
> +	 * transition changed and (un)account for ourselves, accordingly.
> +	 * Note that all these scenarios are corner cases, and not the
> +	 * common scenario, where the lists are actually populated most
> +	 * of the time.
> +	 */
> +	if (unlikely(list_empty_careful(&head->list))) {
> +		list_empty_before_lock = true;
> +		atomic_inc(&dlist->used_lists);
> +		smp_mb__after_atomic();
> +	}
> 
> 	/*
> 	 * There is no need to disable preemption
> 	 */
> 	spin_lock(&head->lock);
> +
> +	if (unlikely(!list_empty_before_lock && list_empty(&head->list))) {
> +		atomic_inc(&dlist->used_lists);
> +		smp_mb__after_atomic();
> +	}
> +	if (unlikely(list_empty_before_lock && !list_empty(&head->list))) {
> +		atomic_dec(&dlist->used_lists);
> +		smp_mb__after_atomic();
> +	}
> +
> 	node->head = head;
> 	list_add(&node->list, &head->list);
> 	spin_unlock(&head->lock);
> @@ -212,6 +250,15 @@ void dlock_lists_del(struct dlock_list_node *node)
> 		spin_lock(&head->lock);
> 		if (likely(head == node->head)) {
> 			list_del_init(&node->list);
> +
> +			if (unlikely(list_empty(&head->list))) {
> +				struct dlock_list_heads *dlist;
> +				dlist = node->head->dlist;
> +
> +				atomic_dec(&dlist->used_lists);
> +				smp_mb__after_atomic();
> +			}
> +
> 			node->head = NULL;
> 			retry = false;
> 		} else {
> -- 
> 2.13.6
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4] lib/dlock-list: Scale dlock_lists_empty()
  2017-11-07 11:59               ` Jan Kara
@ 2017-11-07 17:59                 ` Andreas Dilger
  2017-11-07 18:57                   ` Waiman Long
  2017-11-08  2:08                   ` Boqun Feng
  0 siblings, 2 replies; 32+ messages in thread
From: Andreas Dilger @ 2017-11-07 17:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Davidlohr Bueso, Waiman Long, Alexander Viro, Jan Kara,
	Jeff Layton, J. Bruce Fields, Tejun Heo, Christoph Lameter,
	linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Boqun Feng

[-- Attachment #1: Type: text/plain, Size: 1933 bytes --]

On Nov 7, 2017, at 4:59 AM, Jan Kara <jack@suse.cz> wrote:
> On Mon 06-11-17 10:47:08, Davidlohr Bueso wrote:
>> +	/*
>> +	 * Serialize dlist->used_lists such that a 0->1 transition is not
>> +	 * missed by another thread checking if any of the dlock lists are
>> +	 * used.
>> +	 *
>> +	 * CPU0				    CPU1
>> +	 * dlock_list_add()                 dlock_lists_empty()
>> +	 *   [S] atomic_inc(used_lists);
>> +	 *       smp_mb__after_atomic();
>> +	 *					  smp_mb__before_atomic();
>> +	 *				      [L] atomic_read(used_lists)
>> +	 *       list_add()
>> +	 */
>> +	smp_mb__before_atomic();
>> +	return !atomic_read(&dlist->used_lists);

Just a general kernel programming question here - I thought the whole point
of atomics is that they are, well, atomic across all CPUs so there is no
need for a memory barrier?  If there is a need for a memory barrier for
each atomic access (assuming it isn't accessed under another lock, which would
make the use of atomic types pointless, IMHO) then I'd think there is a lot
of code in the kernel that isn't doing this properly.

What am I missing here?

I don't see how this helps if the operations are executed like:

	 * CPU0				    CPU1
	 * dlock_list_add()                 dlock_lists_empty()
	 *   [S] atomic_inc(used_lists);
	 *					  smp_mb__before_atomic();
	 *       smp_mb__after_atomic();
	 *				      [L] atomic_read(used_lists)

or alternately like:

	 * CPU0				    CPU1
	 * dlock_list_add()                 dlock_lists_empty()
	 *					  smp_mb__before_atomic();
	 *   [S] atomic_inc(used_lists);
	 *       smp_mb__after_atomic();
	 *				      [L] atomic_read(used_lists)

then the same problem would exist, unless those functions/macros are somehow
bound to the atomic operations themselves?  In that case, what makes the use
of atomic_{inc,dec,read}() in other parts of the code safe without them?

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4] lib/dlock-list: Scale dlock_lists_empty()
  2017-11-07 17:59                 ` Andreas Dilger
@ 2017-11-07 18:57                   ` Waiman Long
  2017-11-07 19:36                     ` James Bottomley
  2017-11-08  2:08                   ` Boqun Feng
  1 sibling, 1 reply; 32+ messages in thread
From: Waiman Long @ 2017-11-07 18:57 UTC (permalink / raw)
  To: Andreas Dilger, Jan Kara
  Cc: Davidlohr Bueso, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Andi Kleen,
	Dave Chinner, Boqun Feng

On 11/07/2017 12:59 PM, Andreas Dilger wrote:
> On Nov 7, 2017, at 4:59 AM, Jan Kara <jack@suse.cz> wrote:
>> On Mon 06-11-17 10:47:08, Davidlohr Bueso wrote:
>>> +	/*
>>> +	 * Serialize dlist->used_lists such that a 0->1 transition is not
>>> +	 * missed by another thread checking if any of the dlock lists are
>>> +	 * used.
>>> +	 *
>>> +	 * CPU0				    CPU1
>>> +	 * dlock_list_add()                 dlock_lists_empty()
>>> +	 *   [S] atomic_inc(used_lists);
>>> +	 *       smp_mb__after_atomic();
>>> +	 *					  smp_mb__before_atomic();
>>> +	 *				      [L] atomic_read(used_lists)
>>> +	 *       list_add()
>>> +	 */
>>> +	smp_mb__before_atomic();
>>> +	return !atomic_read(&dlist->used_lists);
> Just a general kernel programming question here - I thought the whole point
> of atomics is that they are, well, atomic across all CPUs so there is no
> need for a memory barrier?  If there is a need for a memory barrier for
> each atomic access (assuming it isn't accessed under another lock, which would
> make the use of atomic types pointless, IMHO) then I'd think there is a lot
> of code in the kernel that isn't doing this properly.
>
> What am I missing here?

Atomic update and memory barrier are 2 different things. Atomic update
means other CPUs see either the value before or after the update. They
won't see anything in between. For a counter, it means we won't miss any
counts. However, not all atomic operations give an ordering guarantee.
The atomic_read() and atomic_inc() are examples that do not provide
memory ordering guarantee. See Documentation/memory-barriers.txt for
more information about it.

A CPU can perform atomic operations 1 & 2 in program order, but other
CPUs may see operation 2 first before operation 1. Here memory barrier
can be used to guarantee that other CPUs see the memory updates in
certain order.

Hope this help.
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4] lib/dlock-list: Scale dlock_lists_empty()
  2017-11-07 18:57                   ` Waiman Long
@ 2017-11-07 19:36                     ` James Bottomley
  0 siblings, 0 replies; 32+ messages in thread
From: James Bottomley @ 2017-11-07 19:36 UTC (permalink / raw)
  To: Waiman Long, Andreas Dilger, Jan Kara
  Cc: Davidlohr Bueso, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Andi Kleen,
	Dave Chinner, Boqun Feng

On Tue, 2017-11-07 at 13:57 -0500, Waiman Long wrote:
> On 11/07/2017 12:59 PM, Andreas Dilger wrote:
> > 
> > On Nov 7, 2017, at 4:59 AM, Jan Kara <jack@suse.cz> wrote:
> > > 
> > > On Mon 06-11-17 10:47:08, Davidlohr Bueso wrote:
> > > > 
> > > > +	/*
> > > > +	 * Serialize dlist->used_lists such that a 0->1
> > > > transition is not
> > > > +	 * missed by another thread checking if any of the
> > > > dlock lists are
> > > > +	 * used.
> > > > +	 *
> > > > +	 * CPU0				    CPU1
> > > > +	 *
> > > > dlock_list_add()                 dlock_lists_empty()
> > > > +	 *   [S] atomic_inc(used_lists);
> > > > +	 *       smp_mb__after_atomic();
> > > > +	 *					  smp_mb__be
> > > > fore_atomic();
> > > > +	 *				      [L]
> > > > atomic_read(used_lists)
> > > > +	 *       list_add()
> > > > +	 */
> > > > +	smp_mb__before_atomic();
> > > > +	return !atomic_read(&dlist->used_lists);
> > Just a general kernel programming question here - I thought the
> > whole point of atomics is that they are, well, atomic across all
> > CPUs so there is no need for a memory barrier?  If there is a need
> > for a memory barrier for each atomic access (assuming it isn't
> > accessed under another lock, which would make the use of atomic
> > types pointless, IMHO) then I'd think there is a lot of code in the
> > kernel that isn't doing this properly.
> > 
> > What am I missing here?
> 
> Atomic update and memory barrier are 2 different things. Atomic
> update means other CPUs see either the value before or after the
> update. They won't see anything in between. For a counter, it means
> we won't miss any counts. However, not all atomic operations give an
> ordering guarantee. The atomic_read() and atomic_inc() are examples
> that do not provide memory ordering guarantee. See
> Documentation/memory-barriers.txt for more information about it.
> 
> A CPU can perform atomic operations 1 & 2 in program order, but other
> CPUs may see operation 2 first before operation 1. Here memory
> barrier can be used to guarantee that other CPUs see the memory
> updates in certain order.

There's an omission here which I think Andreas may have been referring
to:  atomic_inc/dec operations *are* strongly ordered with respect to
each other, so if two CPUs are simultaneously executing atomic_inc, the
order in which they execute isn't guaranteed, but it is guaranteed that
the losing atomic_inc will not begin until the winning one is
completed, so after both are done the value will have +2.  So although
atomic_read and atomic_inc have no ordering guarantee at all (the point
of the barrier above), if you're looking at the return values of
atomic_inc/dec you don't need a barrier because regardless of which
order the CPUs go in, they'll see distinct values (we use this for
reference counting).

James

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4] lib/dlock-list: Scale dlock_lists_empty()
  2017-11-07 17:59                 ` Andreas Dilger
  2017-11-07 18:57                   ` Waiman Long
@ 2017-11-08  2:08                   ` Boqun Feng
  2017-11-09 17:24                     ` Davidlohr Bueso
  1 sibling, 1 reply; 32+ messages in thread
From: Boqun Feng @ 2017-11-08  2:08 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Jan Kara, Davidlohr Bueso, Waiman Long, Alexander Viro, Jan Kara,
	Jeff Layton, J. Bruce Fields, Tejun Heo, Christoph Lameter,
	linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner

[-- Attachment #1: Type: text/plain, Size: 2932 bytes --]

On Tue, Nov 07, 2017 at 10:59:29AM -0700, Andreas Dilger wrote:
> On Nov 7, 2017, at 4:59 AM, Jan Kara <jack@suse.cz> wrote:
> > On Mon 06-11-17 10:47:08, Davidlohr Bueso wrote:
> >> +	/*
> >> +	 * Serialize dlist->used_lists such that a 0->1 transition is not
> >> +	 * missed by another thread checking if any of the dlock lists are
> >> +	 * used.
> >> +	 *
> >> +	 * CPU0				    CPU1
> >> +	 * dlock_list_add()                 dlock_lists_empty()
> >> +	 *   [S] atomic_inc(used_lists);
> >> +	 *       smp_mb__after_atomic();
> >> +	 *					  smp_mb__before_atomic();
> >> +	 *				      [L] atomic_read(used_lists)
> >> +	 *       list_add()
> >> +	 */
> >> +	smp_mb__before_atomic();
> >> +	return !atomic_read(&dlist->used_lists);
> 
> Just a general kernel programming question here - I thought the whole point
> of atomics is that they are, well, atomic across all CPUs so there is no
> need for a memory barrier?  If there is a need for a memory barrier for
> each atomic access (assuming it isn't accessed under another lock, which would
> make the use of atomic types pointless, IMHO) then I'd think there is a lot
> of code in the kernel that isn't doing this properly.
> 
> What am I missing here?
> 
> I don't see how this helps if the operations are executed like:
> 
> 	 * CPU0				    CPU1
> 	 * dlock_list_add()                 dlock_lists_empty()
> 	 *   [S] atomic_inc(used_lists);
> 	 *					  smp_mb__before_atomic();
> 	 *       smp_mb__after_atomic();
> 	 *				      [L] atomic_read(used_lists)
> 
> or alternately like:
> 
> 	 * CPU0				    CPU1
> 	 * dlock_list_add()                 dlock_lists_empty()
> 	 *					  smp_mb__before_atomic();
> 	 *   [S] atomic_inc(used_lists);
> 	 *       smp_mb__after_atomic();
> 	 *				      [L] atomic_read(used_lists)
> 

Or worse:

 	 * CPU0				    CPU1
 	 * dlock_list_add()                 dlock_lists_empty()
 	 *					  smp_mb__before_atomic();
 	 *				      [L] atomic_read(used_lists)
 	 *   [S] atomic_inc(used_lists);
 	 *       smp_mb__after_atomic();

, in which case dlock_lists_empty() misses a increment of used_lists.

That said, this may be fine for the usage of dlock_list. But I think the
comments are confusing or wrong. The reason is you can not use barriers
to order two memory ops on different CPUs, and usually we add comments
like:

	[S] ...			[S] ...
	<barrier>		<barrier>
	[L] ...			[L] ...

Davidlohr, could you try to improve the comments by finding both memory
ops preceding and following the barriers? Maybe you will find those
barriers are not necessary in the end.

Regards,
Boqun

> then the same problem would exist, unless those functions/macros are somehow
> bound to the atomic operations themselves?  In that case, what makes the use
> of atomic_{inc,dec,read}() in other parts of the code safe without them?
> 
> Cheers, Andreas
> 
> 
> 
> 
> 



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4] lib/dlock-list: Scale dlock_lists_empty()
  2017-11-08  2:08                   ` Boqun Feng
@ 2017-11-09 17:24                     ` Davidlohr Bueso
  2017-11-09 17:30                       ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Davidlohr Bueso @ 2017-11-09 17:24 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Andreas Dilger, Jan Kara, Waiman Long, Alexander Viro, Jan Kara,
	Jeff Layton, J. Bruce Fields, Tejun Heo, Christoph Lameter,
	linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner

[-- Attachment #1: Type: text/plain, Size: 1507 bytes --]

On Wed, 08 Nov 2017, Boqun Feng wrote:
>Or worse:
>
> 	 * CPU0				    CPU1
> 	 * dlock_list_add()                 dlock_lists_empty()
> 	 *					  smp_mb__before_atomic();
> 	 *				      [L] atomic_read(used_lists)
> 	 *   [S] atomic_inc(used_lists);
> 	 *       smp_mb__after_atomic();
>
>, in which case dlock_lists_empty() misses a increment of used_lists.
>
>That said, this may be fine for the usage of dlock_list. But I think the
>comments are confusing or wrong. The reason is you can not use barriers
>to order two memory ops on different CPUs, and usually we add comments
>like:

So I think that case is OK as CPU1 legitimately reads a 0, the problem
would be if we miss the inc it because we are doing spin_lock().

>
>	[S] ...			[S] ...
>	<barrier>		<barrier>
>	[L] ...			[L] ...

That is true.

>
>Davidlohr, could you try to improve the comments by finding both memory
>ops preceding and following the barriers? Maybe you will find those
>barriers are not necessary in the end.

Ok so for the dlock_list_add() the first barrier is so that the atomic_inc()
is not done inside the critical region, after the head->lock is taken.
The other barriers that follow I put such that the respective atomic op
is done before the list_add(), however thinking about it I don't think
they are really needed.

Regarding dlock_list_empty(), the smp_mb__before_atomic() is mainly for
pairing reasons; but at this point I don't see a respective counterpart
for the above diagram so I'm unclear.

Thanks,
Davidlohr

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4] lib/dlock-list: Scale dlock_lists_empty()
  2017-11-09 17:24                     ` Davidlohr Bueso
@ 2017-11-09 17:30                       ` Peter Zijlstra
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2017-11-09 17:30 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Boqun Feng, Andreas Dilger, Jan Kara, Waiman Long,
	Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Andi Kleen, Dave Chinner

On Thu, Nov 09, 2017 at 09:24:08AM -0800, Davidlohr Bueso wrote:
> On Wed, 08 Nov 2017, Boqun Feng wrote:
> > Or worse:
> > 
> > 	 * CPU0				    CPU1
> > 	 * dlock_list_add()                 dlock_lists_empty()
> > 	 *					  smp_mb__before_atomic();
> > 	 *				      [L] atomic_read(used_lists)

Note that this is broken; smp_mb__before_atomic() is not valid on
atomic_read().

> > 	 *   [S] atomic_inc(used_lists);
> > 	 *       smp_mb__after_atomic();
> > 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 0/6] vfs: Use dlock list for SB's s_inodes list
  2017-10-31 18:50 [PATCH v8 0/6] vfs: Use dlock list for SB's s_inodes list Waiman Long
                   ` (5 preceding siblings ...)
  2017-10-31 18:51 ` [PATCH v8 6/6] lib/dlock-list: Add an IRQ-safe mode to be used in interrupt handler Waiman Long
@ 2017-11-29 15:26 ` Davidlohr Bueso
  2017-11-29 15:31   ` Waiman Long
  2018-02-26  2:47 ` Dave Chinner
  7 siblings, 1 reply; 32+ messages in thread
From: Davidlohr Bueso @ 2017-11-29 15:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng

Are you planning on sending a v9 with the discussed changes? afaict:

- Drop last two patches
- Fix tearing (WRITE/READ_ONCE())
- Reduce barrier usage for dlock_lists_empty() -- which I'll be sending
  you shortly.

Thanks,
Davidlohr

On Tue, 31 Oct 2017, Waiman Long wrote:

>v7->v8:
> - Integrate the additional patches 8, 9 and 10 sent to fix issues in
>   the original v7 patchset into patch 1 and adjust the other patches
>   accordingly.
>
>v6->v7:
> - Fix outdated email address.
> - Add a comment to patch 4 to explain allocation issue & fix a
>   compilation problem with cpumask.
> - Replace patch 6 with another one that adds an irqsafe mode argument
>   in alloc_dlock_list_heads() instead of adding new APIs.
>
>v5->v6:
> - Rebased the patch to 4.14-rc3.
> - Drop the fsnotify patch as it had been merged somehow.
> - Add a new patch 5 with alternative way of selecting list by hashing
>   instead of cpu #.
> - Add a new patch 6 to proivde a set irq safe APIs to be used in
>   interrupt context.
> - Update the CPU to index mapping code.
>
>v4->v5:
> - Rebased the patch to 4.8-rc1 (changes to fs/fs-writeback.c was
>   dropped).
> - Use kcalloc() instead of percpu_alloc() to allocate the dlock list
>   heads structure as suggested by Christoph Lameter.
> - Replaced patch 5 by another one that made sibling CPUs use the same
>   dlock list head thus reducing the number of list heads that needed
>   to be maintained.
>
>v3->v4:
> - As suggested by Al, encapsulate the dlock list mechanism into
>   the dlist_for_each_entry() and dlist_for_each_entry_safe()
>   which are the equivalent of list_for_each_entry() and
>   list_for_each_entry_safe() for regular linked list. That simplifies
>   the changes in the call sites that perform dlock list iterations.
> - Add a new patch to make the percpu head structure cacheline aligned
>   to prevent cacheline contention from disrupting the performance
>   of nearby percpu variables.
>
>v2->v3:
> - Remove the 2 persubnode API patches.
> - Merge __percpu tag patch 2 into patch 1.
> - As suggested by Tejun Heo, restructure the dlock_list_head data
>   structure to hide the __percpu tag and rename some of the functions
>   and structures.
> - Move most of the code from dlock_list.h to dlock_list.c and export
>   the symbols.
>
>v1->v2:
> - Add a set of simple per-subnode APIs that is between percpu and
>   per-node in granularity.
> - Make dlock list to use the per-subnode APIs so as to reduce the
>   total number of separate linked list that needs to be managed
>   and iterated.
> - There is no change in patches 1-5.
>
>This patchset provides new APIs for a set of distributed locked lists
>(one/CPU core) to minimize lock and cacheline contention. Insertion
>and deletion to the list will be cheap and relatively contention free.
>Lookup, on the other hand, may be a bit more costly as there are
>multiple lists to iterate. This is not really a problem for the
>replacement of superblock's inode list by dlock list included in
>the patchset as lookup isn't needed.
>
>For use cases that need to do lookup, the dlock list can also be
>treated as a set of hashed lists that scales with the number of CPU
>cores in the system.
>
>Both patches 5 and 6 are added to support other use cases like epoll
>nested callbacks, for example, which could use the dlock-list to
>reduce lock contention problem.
>
>Patch 1 introduces the dlock list. The list heads are allocated
>by kcalloc() instead of percpu_alloc(). Each list head entry is
>cacheline aligned to minimize contention.
>
>Patch 2 replaces the use of list_for_each_entry_safe() in
>evict_inodes() and invalidate_inodes() by list_for_each_entry().
>
>Patch 3 modifies the superblock and inode structures to use the dlock
>list. The corresponding functions that reference those structures
>are modified.
>
>Patch 4 makes the sibling CPUs use the same dlock list head to reduce
>the number of list heads that need to be iterated.
>
>Patch 5 enables alternative use case of as a set of hashed lists.
>
>Patch 6 provides an irq safe mode specified at dlock-list allocation
>time so that it can be used within interrupt context.
>
>Jan Kara (1):
>  vfs: Remove unnecessary list_for_each_entry_safe() variants
>
>Waiman Long (5):
>  lib/dlock-list: Distributed and lock-protected lists
>  vfs: Use dlock list for superblock's inode list
>  lib/dlock-list: Make sibling CPUs share the same linked list
>  lib/dlock-list: Enable faster lookup with hashing
>  lib/dlock-list: Add an IRQ-safe mode to be used in interrupt handler
>
> fs/block_dev.c             |   9 +-
> fs/drop_caches.c           |   9 +-
> fs/inode.c                 |  38 ++----
> fs/notify/fsnotify.c       |   9 +-
> fs/quota/dquot.c           |  14 +-
> fs/super.c                 |   7 +-
> include/linux/dlock-list.h | 263 +++++++++++++++++++++++++++++++++++
> include/linux/fs.h         |   8 +-
> lib/Makefile               |   2 +-
> lib/dlock-list.c           | 333 +++++++++++++++++++++++++++++++++++++++++++++
> 10 files changed, 638 insertions(+), 54 deletions(-)
> create mode 100644 include/linux/dlock-list.h
> create mode 100644 lib/dlock-list.c
>
>-- 
>1.8.3.1
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 1/6] lib/dlock-list: Distributed and lock-protected lists
  2017-10-31 18:50 ` [PATCH v8 1/6] lib/dlock-list: Distributed and lock-protected lists Waiman Long
  2017-10-31 21:37   ` Davidlohr Bueso
  2017-11-02 17:04   ` Davidlohr Bueso
@ 2017-11-29 15:29   ` Davidlohr Bueso
  2 siblings, 0 replies; 32+ messages in thread
From: Davidlohr Bueso @ 2017-11-29 15:29 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng

On Tue, 31 Oct 2017, Waiman Long wrote:

>Linked list is used everywhere in the Linux kernel. However, if many
>threads are trying to add or delete entries into the same linked list,
>it can create a performance bottleneck.
>
>This patch introduces a new list APIs that provide a set of distributed
>lists (one per CPU), each of which is protected by its own spinlock.
>To the callers, however, the set of lists acts like a single
>consolidated list.  This allows list entries insertion and deletion
>operations to happen in parallel instead of being serialized with a
>global list and lock.
>
>List entry insertion is strictly per cpu. List deletion, however, can
>happen in a cpu other than the one that did the insertion. So we still
>need lock to protect the list. Because of that, there may still be
>a small amount of contention when deletion is being done.
>
>A new header file include/linux/dlock-list.h will be added with the
>associated dlock_list_head and dlock_list_node structures. The following
>functions are provided to manage the per-cpu list:
>
> 1. int alloc_dlock_list_heads(struct dlock_list_heads *dlist)
> 2. void free_dlock_list_heads(struct dlock_list_heads *dlist)
> 3. void dlock_list_add(struct dlock_list_node *node,
>		        struct dlock_list_heads *dlist)
> 4. void dlock_list_del(struct dlock_list *node)
>
>Iteration of all the list entries within a dlock list array
>is done by calling either the dlist_for_each_entry() or
>dlist_for_each_entry_safe() macros. They correspond to the
>list_for_each_entry() and list_for_each_entry_safe() macros
>respectively. The iteration states are keep in a dlock_list_iter
>structure that is passed to the iteration macros.
>
>Signed-off-by: Waiman Long <longman@redhat.com>
>Reviewed-by: Jan Kara <jack@suse.cz>

Reviewed-by: Davidlohr Bueso <dbueso@suse.de>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 0/6] vfs: Use dlock list for SB's s_inodes list
  2017-11-29 15:26 ` [PATCH v8 0/6] vfs: Use dlock list for SB's s_inodes list Davidlohr Bueso
@ 2017-11-29 15:31   ` Waiman Long
  0 siblings, 0 replies; 32+ messages in thread
From: Waiman Long @ 2017-11-29 15:31 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng

On 11/29/2017 10:26 AM, Davidlohr Bueso wrote:
> Are you planning on sending a v9 with the discussed changes? afaict:
>
> - Drop last two patches
> - Fix tearing (WRITE/READ_ONCE())
> - Reduce barrier usage for dlock_lists_empty() -- which I'll be sending
>  you shortly.
>
> Thanks,
> Davidlohr 

Yes, I am planning to do so when I have some free time as I am working
on a high-priority task at the moment.

Regards,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 0/6] vfs: Use dlock list for SB's s_inodes list
  2017-10-31 18:50 [PATCH v8 0/6] vfs: Use dlock list for SB's s_inodes list Waiman Long
                   ` (6 preceding siblings ...)
  2017-11-29 15:26 ` [PATCH v8 0/6] vfs: Use dlock list for SB's s_inodes list Davidlohr Bueso
@ 2018-02-26  2:47 ` Dave Chinner
  2018-02-26  4:05   ` Waiman Long
  7 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2018-02-26  2:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng, Davidlohr Bueso

Hi Waiman,

What's happened to this patchset? Any plans to repost a more recent
version?

FYI, I just ran a workload that hit 60% CPU usage on sb inode list
lock contention - a multithreaded bulkstat scan of an XFS filesystem
with millions of inodes on SSDs. last time I ran this (about 18
months ago now!) I saw rates of about 600,000 inodes/s being scanned
from userspace. The run I did earlier today made 300,000 inodes/s on
the same 16p machine and was completely CPU bound....

Cheers,

Dave.

On Tue, Oct 31, 2017 at 02:50:54PM -0400, Waiman Long wrote:
> v7->v8:
>  - Integrate the additional patches 8, 9 and 10 sent to fix issues in
>    the original v7 patchset into patch 1 and adjust the other patches
>    accordingly.
> 
> v6->v7:
>  - Fix outdated email address.
>  - Add a comment to patch 4 to explain allocation issue & fix a
>    compilation problem with cpumask.
>  - Replace patch 6 with another one that adds an irqsafe mode argument 
>    in alloc_dlock_list_heads() instead of adding new APIs.
> 
> v5->v6:
>  - Rebased the patch to 4.14-rc3.
>  - Drop the fsnotify patch as it had been merged somehow.
>  - Add a new patch 5 with alternative way of selecting list by hashing
>    instead of cpu #.
>  - Add a new patch 6 to proivde a set irq safe APIs to be used in
>    interrupt context.
>  - Update the CPU to index mapping code.
> 
> v4->v5:
>  - Rebased the patch to 4.8-rc1 (changes to fs/fs-writeback.c was
>    dropped).
>  - Use kcalloc() instead of percpu_alloc() to allocate the dlock list
>    heads structure as suggested by Christoph Lameter.
>  - Replaced patch 5 by another one that made sibling CPUs use the same
>    dlock list head thus reducing the number of list heads that needed
>    to be maintained.
> 
> v3->v4:
>  - As suggested by Al, encapsulate the dlock list mechanism into
>    the dlist_for_each_entry() and dlist_for_each_entry_safe()
>    which are the equivalent of list_for_each_entry() and
>    list_for_each_entry_safe() for regular linked list. That simplifies
>    the changes in the call sites that perform dlock list iterations.
>  - Add a new patch to make the percpu head structure cacheline aligned
>    to prevent cacheline contention from disrupting the performance
>    of nearby percpu variables.
> 
> v2->v3:
>  - Remove the 2 persubnode API patches.
>  - Merge __percpu tag patch 2 into patch 1.
>  - As suggested by Tejun Heo, restructure the dlock_list_head data
>    structure to hide the __percpu tag and rename some of the functions
>    and structures.
>  - Move most of the code from dlock_list.h to dlock_list.c and export
>    the symbols.
> 
> v1->v2:
>  - Add a set of simple per-subnode APIs that is between percpu and
>    per-node in granularity.
>  - Make dlock list to use the per-subnode APIs so as to reduce the
>    total number of separate linked list that needs to be managed
>    and iterated.
>  - There is no change in patches 1-5.
> 
> This patchset provides new APIs for a set of distributed locked lists
> (one/CPU core) to minimize lock and cacheline contention. Insertion
> and deletion to the list will be cheap and relatively contention free.
> Lookup, on the other hand, may be a bit more costly as there are
> multiple lists to iterate. This is not really a problem for the
> replacement of superblock's inode list by dlock list included in
> the patchset as lookup isn't needed.
> 
> For use cases that need to do lookup, the dlock list can also be
> treated as a set of hashed lists that scales with the number of CPU
> cores in the system.
> 
> Both patches 5 and 6 are added to support other use cases like epoll
> nested callbacks, for example, which could use the dlock-list to
> reduce lock contention problem.
> 
> Patch 1 introduces the dlock list. The list heads are allocated
> by kcalloc() instead of percpu_alloc(). Each list head entry is
> cacheline aligned to minimize contention.
> 
> Patch 2 replaces the use of list_for_each_entry_safe() in
> evict_inodes() and invalidate_inodes() by list_for_each_entry().
> 
> Patch 3 modifies the superblock and inode structures to use the dlock
> list. The corresponding functions that reference those structures
> are modified.
> 
> Patch 4 makes the sibling CPUs use the same dlock list head to reduce
> the number of list heads that need to be iterated.
> 
> Patch 5 enables alternative use case of as a set of hashed lists.
> 
> Patch 6 provides an irq safe mode specified at dlock-list allocation
> time so that it can be used within interrupt context.
> 
> Jan Kara (1):
>   vfs: Remove unnecessary list_for_each_entry_safe() variants
> 
> Waiman Long (5):
>   lib/dlock-list: Distributed and lock-protected lists
>   vfs: Use dlock list for superblock's inode list
>   lib/dlock-list: Make sibling CPUs share the same linked list
>   lib/dlock-list: Enable faster lookup with hashing
>   lib/dlock-list: Add an IRQ-safe mode to be used in interrupt handler
> 
>  fs/block_dev.c             |   9 +-
>  fs/drop_caches.c           |   9 +-
>  fs/inode.c                 |  38 ++----
>  fs/notify/fsnotify.c       |   9 +-
>  fs/quota/dquot.c           |  14 +-
>  fs/super.c                 |   7 +-
>  include/linux/dlock-list.h | 263 +++++++++++++++++++++++++++++++++++
>  include/linux/fs.h         |   8 +-
>  lib/Makefile               |   2 +-
>  lib/dlock-list.c           | 333 +++++++++++++++++++++++++++++++++++++++++++++
>  10 files changed, 638 insertions(+), 54 deletions(-)
>  create mode 100644 include/linux/dlock-list.h
>  create mode 100644 lib/dlock-list.c
> 
> -- 
> 1.8.3.1
> 
> 

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v8 0/6] vfs: Use dlock list for SB's s_inodes list
  2018-02-26  2:47 ` Dave Chinner
@ 2018-02-26  4:05   ` Waiman Long
  0 siblings, 0 replies; 32+ messages in thread
From: Waiman Long @ 2018-02-26  4:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Boqun Feng, Davidlohr Bueso

On 02/25/2018 09:47 PM, Dave Chinner wrote:
> Hi Waiman,
>
> What's happened to this patchset? Any plans to repost a more recent
> version?
>
> FYI, I just ran a workload that hit 60% CPU usage on sb inode list
> lock contention - a multithreaded bulkstat scan of an XFS filesystem
> with millions of inodes on SSDs. last time I ran this (about 18
> months ago now!) I saw rates of about 600,000 inodes/s being scanned
> from userspace. The run I did earlier today made 300,000 inodes/s on
> the same 16p machine and was completely CPU bound....
>
> Cheers,
>
> Dave.

I was planning to repost the patchset with the latest change last
November and then Meltdown/Spectre happened. I was drafted into
backporting fixes to RHEL6. Hopefully, I can finish up the work in early
March and work on my upstream patches again.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2018-02-26  4:05 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-31 18:50 [PATCH v8 0/6] vfs: Use dlock list for SB's s_inodes list Waiman Long
2017-10-31 18:50 ` [PATCH v8 1/6] lib/dlock-list: Distributed and lock-protected lists Waiman Long
2017-10-31 21:37   ` Davidlohr Bueso
2017-11-01 18:44     ` Waiman Long
2017-11-02 17:04   ` Davidlohr Bueso
2017-11-02 17:30     ` Waiman Long
2017-11-03 13:34       ` Davidlohr Bueso
2017-11-03 14:22         ` [PATCH v3] lib/dlock-list: Scale dlock_lists_empty() Davidlohr Bueso
2017-11-03 16:33           ` Waiman Long
2017-11-06 18:47             ` [PATCH v4] " Davidlohr Bueso
2017-11-06 19:06               ` Waiman Long
2017-11-07 11:59               ` Jan Kara
2017-11-07 17:59                 ` Andreas Dilger
2017-11-07 18:57                   ` Waiman Long
2017-11-07 19:36                     ` James Bottomley
2017-11-08  2:08                   ` Boqun Feng
2017-11-09 17:24                     ` Davidlohr Bueso
2017-11-09 17:30                       ` Peter Zijlstra
2017-11-29 15:29   ` [PATCH v8 1/6] lib/dlock-list: Distributed and lock-protected lists Davidlohr Bueso
2017-10-31 18:50 ` [PATCH v8 2/6] vfs: Remove unnecessary list_for_each_entry_safe() variants Waiman Long
2017-10-31 18:50 ` [PATCH v8 3/6] vfs: Use dlock list for superblock's inode list Waiman Long
2017-10-31 18:50 ` [PATCH v8 4/6] lib/dlock-list: Make sibling CPUs share the same linked list Waiman Long
2017-11-01  8:38   ` Jan Kara
2017-10-31 18:50 ` [PATCH v8 5/6] lib/dlock-list: Enable faster lookup with hashing Waiman Long
2017-11-01  8:40   ` Jan Kara
2017-11-01 13:16     ` Waiman Long
2017-10-31 18:51 ` [PATCH v8 6/6] lib/dlock-list: Add an IRQ-safe mode to be used in interrupt handler Waiman Long
2017-10-31 21:29   ` Davidlohr Bueso
2017-11-29 15:26 ` [PATCH v8 0/6] vfs: Use dlock list for SB's s_inodes list Davidlohr Bueso
2017-11-29 15:31   ` Waiman Long
2018-02-26  2:47 ` Dave Chinner
2018-02-26  4:05   ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).