linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] vfs: Use per-cpu list for SB's s_inodes list
@ 2016-02-17  1:31 Waiman Long
  2016-02-17  1:31 ` [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks Waiman Long
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Waiman Long @ 2016-02-17  1:31 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Scott J Norton, Douglas Hatch,
	Waiman Long

This patch is a replacement of my previous list batching patch -
https://lwn.net/Articles/674105/. Compared with the previous patch,
this one provides better performance and fairness. However, it also
requires a bit more changes in the VFS layer.

This patchset is a derivative of Andi Kleen's patch on "Initial per
cpu list for the per sb inode list"

https://git.kernel.org/cgit/linux/kernel/git/ak/linux-misc.git/commit/?h=hle315/combined&id=f1cf9e715a40f44086662ae3b29f123cf059cbf4

Patch 1 introduces the per-cpu list.

Patch 2 modifies the superblock and inode structures to use the per-cpu
list. The corresponding functions that reference those structures are
modified.

Waiman Long (2):
  lib/percpu-list: Per-cpu list with associated per-cpu locks
  vfs: Use per-cpu list for superblock's inode list

 fs/block_dev.c              |   16 +++---
 fs/drop_caches.c            |   11 ++--
 fs/fs-writeback.c           |   16 +++---
 fs/inode.c                  |   43 ++++++----------
 fs/notify/inode_mark.c      |   22 ++++----
 fs/quota/dquot.c            |   23 ++++-----
 fs/super.c                  |    7 ++-
 include/linux/fs.h          |    8 ++--
 include/linux/percpu-list.h |  117 +++++++++++++++++++++++++++++++++++++++++++
 lib/Makefile                |    2 +-
 lib/percpu-list.c           |   80 +++++++++++++++++++++++++++++
 11 files changed, 263 insertions(+), 82 deletions(-)
 create mode 100644 include/linux/percpu-list.h
 create mode 100644 lib/percpu-list.c

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-17  1:31 [RFC PATCH 0/2] vfs: Use per-cpu list for SB's s_inodes list Waiman Long
@ 2016-02-17  1:31 ` Waiman Long
  2016-02-17  9:53   ` Dave Chinner
  2016-02-17 15:13   ` Christoph Lameter
  2016-02-17  1:31 ` [RRC PATCH 2/2] vfs: Use per-cpu list for superblock's inode list Waiman Long
  2016-02-18 23:58 ` [RFC PATCH 0/2] vfs: Use per-cpu list for SB's s_inodes list Dave Chinner
  2 siblings, 2 replies; 27+ messages in thread
From: Waiman Long @ 2016-02-17  1:31 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Scott J Norton, Douglas Hatch,
	Waiman Long

Linked list is used everywhere in the Linux kernel. However, if many
threads are trying to add or delete entries into the same linked list,
it can create a performance bottleneck.

This patch introduces a new per-cpu list subystem with associated
per-cpu locks for protecting each of the lists individually. This
allows list entries insertion and deletion operations to happen in
parallel instead of being serialized with a global list and lock.

List entry insertion is strictly per cpu. List deletion, however, can
happen in a cpu other than the one that did the insertion. So we still
need lock to protect the list. Because of that, there may still be
a small amount of contention when deletion is being done.

A new header file include/linux/percpu-list.h will be added with the
associated percpu_list structure. The following functions are used
to manage the per-cpu list:

 1. int init_percpu_list_head(struct percpu_list **pclist_handle)
 2. void percpu_list_add(struct percpu_list *new,
			 struct percpu_list *head)
 3. void percpu_list_del(struct percpu_list *entry)

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 include/linux/percpu-list.h |  117 +++++++++++++++++++++++++++++++++++++++++++
 lib/Makefile                |    2 +-
 lib/percpu-list.c           |   80 +++++++++++++++++++++++++++++
 3 files changed, 198 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/percpu-list.h
 create mode 100644 lib/percpu-list.c

diff --git a/include/linux/percpu-list.h b/include/linux/percpu-list.h
new file mode 100644
index 0000000..94be520
--- /dev/null
+++ b/include/linux/percpu-list.h
@@ -0,0 +1,117 @@
+/*
+ * Per-cpu list
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long <waiman.long@hpe.com>
+ */
+#ifndef __LINUX_PERCPU_LIST_H
+#define __LINUX_PERCPU_LIST_H
+
+#include <linux/spinlock.h>
+#include <linux/list.h>
+#include <linux/percpu.h>
+
+/*
+ * include/linux/percpu-list.h
+ *
+ * A per-cpu list protected by a per-cpu spinlock.
+ *
+ * The list head percpu_list structure contains the spinlock, the other
+ * entries in the list contain the spinlock pointer.
+ */
+struct percpu_list {
+	struct list_head list;
+	union {
+		spinlock_t lock;	/* For list head */
+		spinlock_t *lockptr;	/* For other entries */
+	};
+};
+
+/*
+ * A simplified for_all_percpu_list_entries macro without the next and pchead
+ * parameters.
+ */
+#define for_all_percpu_list_entries_simple(pos, pclock, head, member)	\
+	for_all_percpu_list_entries(pos, next, pchead, pclock, head, member)
+
+#define PERCPU_LIST_HEAD_INIT(name)				\
+	{							\
+		.list.prev = &name.list,			\
+		.list.next = &name.list,			\
+		.list.lock = __SPIN_LOCK_UNLOCKED(name),	\
+	}
+
+#define PERCPU_LIST_ENTRY_INIT(name)		\
+	{					\
+		.list.prev = &name.list,	\
+		.list.next = &name.list,	\
+		.list.lockptr = NULL		\
+	}
+
+static inline void INIT_PERCPU_LIST_HEAD(struct percpu_list *pcpu_list)
+{
+	INIT_LIST_HEAD(&pcpu_list->list);
+	pcpu_list->lock = __SPIN_LOCK_UNLOCKED(&pcpu_list->lock);
+}
+
+static inline void INIT_PERCPU_LIST_ENTRY(struct percpu_list *pcpu_list)
+{
+	INIT_LIST_HEAD(&pcpu_list->list);
+	pcpu_list->lockptr = NULL;
+}
+
+#define PERCPU_LIST_HEAD(name)	struct percpu_list __percpu *name
+
+static inline void free_percpu_list_head(struct percpu_list **pclist_handle)
+{
+	free_percpu(*pclist_handle);
+	*pclist_handle = NULL;
+}
+
+static inline bool percpu_list_empty(struct percpu_list *pcpu_list)
+{
+	int cpu;
+
+	for_each_possible_cpu (cpu)
+		if (!list_empty(&per_cpu_ptr(pcpu_list, cpu)->list))
+			return false;
+	return true;
+}
+
+/**
+ * for_all_percpu_list_entries - iterate over all the per-cpu list with locking
+ * @pos:	the type * to use as a loop cursor for the current .
+ * @next:	an internal type * variable pointing to the next entry
+ * @pchead:	an internal struct list * of percpu list head
+ * @pclock:	an internal variable for the current per-cpu spinlock
+ * @head:	the head of the per-cpu list
+ * @member:	the name of the per-cpu list within the struct
+ */
+#define for_all_percpu_list_entries(pos, next, pchead, pclock, head, member)\
+	{								 \
+	int cpu;							 \
+	for_each_possible_cpu (cpu) {					 \
+		typeof(*pos) *next;					 \
+		spinlock_t *pclock = per_cpu_ptr(&(head)->lock, cpu);	 \
+		struct list_head *pchead = &per_cpu_ptr(head, cpu)->list;\
+		spin_lock(pclock);					 \
+		list_for_each_entry_safe(pos, next, pchead, member.list)
+
+#define end_all_percpu_list_entries(pclock)	spin_unlock(pclock); } }
+
+extern int init_percpu_list_head(struct percpu_list **pclist_handle);
+extern void percpu_list_add(struct percpu_list *new, struct percpu_list *head);
+extern void percpu_list_del(struct percpu_list *entry);
+
+#endif /* __LINUX_PERCPU_LIST_H */
diff --git a/lib/Makefile b/lib/Makefile
index a7c26a4..71a25d4 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -27,7 +27,7 @@ obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
 	 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
 	 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 	 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
-	 once.o
+	 once.o percpu-list.o
 obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
 obj-y += hexdump.o
diff --git a/lib/percpu-list.c b/lib/percpu-list.c
new file mode 100644
index 0000000..e5c04bf
--- /dev/null
+++ b/lib/percpu-list.c
@@ -0,0 +1,80 @@
+/*
+ * Per-cpu list
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long <waiman.long@hpe.com>
+ */
+#include <linux/percpu-list.h>
+
+/*
+ * Initialize the per-cpu list
+ */
+int init_percpu_list_head(struct percpu_list **pclist_handle)
+{
+	struct percpu_list *pclist = alloc_percpu(struct percpu_list);
+	int cpu;
+
+	if (!pclist)
+		return -ENOMEM;
+
+	for_each_possible_cpu (cpu)
+		INIT_PERCPU_LIST_HEAD(per_cpu_ptr(pclist, cpu));
+
+	*pclist_handle = pclist;
+	return 0;
+}
+
+/*
+ * List selection is based on the CPU being used when the percpu_list_add()
+ * function is called. However, deletion may be done by a different CPU.
+ * So we still need to use a lock to protect the content of the list.
+ */
+void percpu_list_add(struct percpu_list *new, struct percpu_list *head)
+{
+	spinlock_t *lock;
+
+	/*
+	 * We need to disable preemption before accessing the per-cpu data
+	 * to make sure that the cpu won't be changed because of preemption.
+	 */
+	preempt_disable();
+	lock = this_cpu_ptr(&head->lock);
+	spin_lock(lock);
+	new->lockptr = lock;
+	list_add(&new->list, this_cpu_ptr(&head->list));
+	spin_unlock(lock);
+	preempt_enable();
+}
+
+/*
+ * Delete an entry from a percpu list
+ *
+ * We need to check the lock pointer again after taking the lock to guard
+ * against concurrent delete of the same entry. If the lock pointer changes
+ * or becomes NULL, we assume that the deletion was done elsewhere.
+ */
+void percpu_list_del(struct percpu_list *entry)
+{
+	spinlock_t *lock = READ_ONCE(entry->lockptr);
+
+	if (unlikely(!lock))
+		return;
+
+	spin_lock(lock);
+	if (likely(entry->lockptr && (lock == entry->lockptr))) {
+		list_del_init(&entry->list);
+		entry->lockptr = NULL;
+	}
+	spin_unlock(lock);
+}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RRC PATCH 2/2] vfs: Use per-cpu list for superblock's inode list
  2016-02-17  1:31 [RFC PATCH 0/2] vfs: Use per-cpu list for SB's s_inodes list Waiman Long
  2016-02-17  1:31 ` [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks Waiman Long
@ 2016-02-17  1:31 ` Waiman Long
  2016-02-17  7:16   ` Ingo Molnar
  2016-02-17 10:37   ` Dave Chinner
  2016-02-18 23:58 ` [RFC PATCH 0/2] vfs: Use per-cpu list for SB's s_inodes list Dave Chinner
  2 siblings, 2 replies; 27+ messages in thread
From: Waiman Long @ 2016-02-17  1:31 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Scott J Norton, Douglas Hatch,
	Waiman Long

When many threads are trying to add or delete inode to or from
a superblock's s_inodes list, spinlock contention on the list can
become a performance bottleneck.

This patch changes the s_inodes field to become a per-cpu list with
per-cpu spinlocks.

With an exit microbenchmark that creates a large number of threads,
attachs many inodes to them and then exits. The runtimes of that
microbenchmark with 1000 threads before and after the patch on a
4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as
follows:

  Kernel            Elapsed Time    System Time
  ------            ------------    -----------
  Vanilla 4.5-rc4      65.29s         82m14s
  Patched 4.5-rc4      22.81s         23m03s

Before the patch, spinlock contention at the inode_sb_list_add()
function at the startup phase and the inode_sb_list_del() function at
the exit phase were about 79% and 93% of total CPU time respectively
(as measured by perf). After the patch, the percpu_list_add()
function consumed only about 0.04% of CPU time at startup phase. The
percpu_list_del() function consumed about 0.4% of CPU time at exit
phase. There were still some spinlock contention, but they happened
elsewhere.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 fs/block_dev.c         |   16 +++++++---------
 fs/drop_caches.c       |   11 +++++------
 fs/fs-writeback.c      |   16 +++++++---------
 fs/inode.c             |   43 +++++++++++++++++--------------------------
 fs/notify/inode_mark.c |   22 +++++++++++-----------
 fs/quota/dquot.c       |   23 ++++++++++-------------
 fs/super.c             |    7 ++++---
 include/linux/fs.h     |    8 ++++----
 8 files changed, 65 insertions(+), 81 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 39b3a17..30b12cb 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1866,8 +1866,8 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 {
 	struct inode *inode, *old_inode = NULL;
 
-	spin_lock(&blockdev_superblock->s_inode_list_lock);
-	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
+	for_all_percpu_list_entries_simple(inode, percpu_lock,
+			blockdev_superblock->s_inodes_cpu, i_sb_list) {
 		struct address_space *mapping = inode->i_mapping;
 
 		spin_lock(&inode->i_lock);
@@ -1878,22 +1878,20 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&blockdev_superblock->s_inode_list_lock);
+		spin_unlock(percpu_lock);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
 		 * removed from s_inodes list while we dropped the
-		 * s_inode_list_lock  We cannot iput the inode now as we can
+		 * percpu_lock. We cannot iput the inode now as we can
 		 * be holding the last reference and we cannot iput it under
-		 * s_inode_list_lock. So we keep the reference and iput it
-		 * later.
+		 * percpu_lock. So we keep the reference and iput it later.
 		 */
 		iput(old_inode);
 		old_inode = inode;
 
 		func(I_BDEV(inode), arg);
 
-		spin_lock(&blockdev_superblock->s_inode_list_lock);
-	}
-	spin_unlock(&blockdev_superblock->s_inode_list_lock);
+		spin_lock(percpu_lock);
+	} end_all_percpu_list_entries(percpu_lock)
 	iput(old_inode);
 }
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index d72d52b..c091d91 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -17,8 +17,8 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 {
 	struct inode *inode, *toput_inode = NULL;
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	for_all_percpu_list_entries_simple(inode, percpu_lock,
+				    sb->s_inodes_cpu, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 		    (inode->i_mapping->nrpages == 0)) {
@@ -27,15 +27,14 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(percpu_lock);
 
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
 
-		spin_lock(&sb->s_inode_list_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+		spin_lock(percpu_lock);
+	} end_all_percpu_list_entries(percpu_lock)
 	iput(toput_inode);
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6915c95..3fcacb5 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2115,7 +2115,6 @@ static void wait_sb_inodes(struct super_block *sb)
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	mutex_lock(&sb->s_sync_lock);
-	spin_lock(&sb->s_inode_list_lock);
 
 	/*
 	 * Data integrity sync. Must wait for all pages under writeback,
@@ -2124,7 +2123,8 @@ static void wait_sb_inodes(struct super_block *sb)
 	 * In which case, the inode may not be on the dirty list, but
 	 * we still have to wait for that writeout.
 	 */
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	for_all_percpu_list_entries_simple(inode, percpu_lock,
+					   sb->s_inodes_cpu, i_sb_list) {
 		struct address_space *mapping = inode->i_mapping;
 
 		spin_lock(&inode->i_lock);
@@ -2135,15 +2135,14 @@ static void wait_sb_inodes(struct super_block *sb)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(percpu_lock);
 
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
 		 * removed from s_inodes list while we dropped the
-		 * s_inode_list_lock.  We cannot iput the inode now as we can
+		 * percpu_lock.  We cannot iput the inode now as we can
 		 * be holding the last reference and we cannot iput it under
-		 * s_inode_list_lock. So we keep the reference and iput it
-		 * later.
+		 * percpu_lock. So we keep the reference and iput it later.
 		 */
 		iput(old_inode);
 		old_inode = inode;
@@ -2157,9 +2156,8 @@ static void wait_sb_inodes(struct super_block *sb)
 
 		cond_resched();
 
-		spin_lock(&sb->s_inode_list_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+		spin_lock(percpu_lock);
+	} end_all_percpu_list_entries(percpu_lock)
 	iput(old_inode);
 	mutex_unlock(&sb->s_sync_lock);
 }
diff --git a/fs/inode.c b/fs/inode.c
index 9f62db3..15c82e7 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -28,8 +28,8 @@
  *   inode->i_state, inode->i_hash, __iget()
  * Inode LRU list locks protect:
  *   inode->i_sb->s_inode_lru, inode->i_lru
- * inode->i_sb->s_inode_list_lock protects:
- *   inode->i_sb->s_inodes, inode->i_sb_list
+ * inode->i_sb->s_inodes_cpu->lock protects:
+ *   inode->i_sb->s_inodes_cpu, inode->i_sb_list
  * bdi->wb.list_lock protects:
  *   bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_io_list
  * inode_hash_lock protects:
@@ -37,7 +37,7 @@
  *
  * Lock ordering:
  *
- * inode->i_sb->s_inode_list_lock
+ * inode->i_sb->s_inodes_cpu->lock
  *   inode->i_lock
  *     Inode LRU list locks
  *
@@ -45,7 +45,7 @@
  *   inode->i_lock
  *
  * inode_hash_lock
- *   inode->i_sb->s_inode_list_lock
+ *   inode->i_sb->s_inodes_cpu->lock
  *   inode->i_lock
  *
  * iunique_lock
@@ -424,19 +424,14 @@ static void inode_lru_list_del(struct inode *inode)
  */
 void inode_sb_list_add(struct inode *inode)
 {
-	spin_lock(&inode->i_sb->s_inode_list_lock);
-	list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
-	spin_unlock(&inode->i_sb->s_inode_list_lock);
+	percpu_list_add(&inode->i_sb_list, inode->i_sb->s_inodes_cpu);
 }
 EXPORT_SYMBOL_GPL(inode_sb_list_add);
 
 static inline void inode_sb_list_del(struct inode *inode)
 {
-	if (!list_empty(&inode->i_sb_list)) {
-		spin_lock(&inode->i_sb->s_inode_list_lock);
-		list_del_init(&inode->i_sb_list);
-		spin_unlock(&inode->i_sb->s_inode_list_lock);
-	}
+	if (!list_empty(&inode->i_sb_list.list))
+		percpu_list_del(&inode->i_sb_list);
 }
 
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
@@ -590,12 +585,12 @@ static void dispose_list(struct list_head *head)
  */
 void evict_inodes(struct super_block *sb)
 {
-	struct inode *inode, *next;
+	struct inode *inode;
 	LIST_HEAD(dispose);
 
 again:
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+	for_all_percpu_list_entries_simple(inode, percpu_lock,
+					   sb->s_inodes_cpu, i_sb_list) {
 		if (atomic_read(&inode->i_count))
 			continue;
 
@@ -616,13 +611,12 @@ again:
 		 * bit so we don't livelock.
 		 */
 		if (need_resched()) {
-			spin_unlock(&sb->s_inode_list_lock);
+			spin_unlock(percpu_lock);
 			cond_resched();
 			dispose_list(&dispose);
 			goto again;
 		}
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+	} end_all_percpu_list_entries(percpu_lock)
 
 	dispose_list(&dispose);
 }
@@ -640,11 +634,11 @@ again:
 int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 {
 	int busy = 0;
-	struct inode *inode, *next;
+	struct inode *inode;
 	LIST_HEAD(dispose);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+	for_all_percpu_list_entries_simple(inode, percpu_lock,
+					   sb->s_inodes_cpu, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
@@ -665,8 +659,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 		inode_lru_list_del(inode);
 		spin_unlock(&inode->i_lock);
 		list_add(&inode->i_lru, &dispose);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+	} end_all_percpu_list_entries(percpu_lock)
 
 	dispose_list(&dispose);
 
@@ -881,7 +874,7 @@ struct inode *new_inode_pseudo(struct super_block *sb)
 		spin_lock(&inode->i_lock);
 		inode->i_state = 0;
 		spin_unlock(&inode->i_lock);
-		INIT_LIST_HEAD(&inode->i_sb_list);
+		INIT_PERCPU_LIST_ENTRY(&inode->i_sb_list);
 	}
 	return inode;
 }
@@ -902,8 +895,6 @@ struct inode *new_inode(struct super_block *sb)
 {
 	struct inode *inode;
 
-	spin_lock_prefetch(&sb->s_inode_list_lock);
-
 	inode = new_inode_pseudo(sb);
 	if (inode)
 		inode_sb_list_add(inode);
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 741077d..96bcd4a 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -146,14 +146,15 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
  * @sb: superblock being unmounted.
  *
  * Called during unmount with no locks held, so needs to be safe against
- * concurrent modifiers. We temporarily drop sb->s_inode_list_lock and CAN block.
+ * concurrent modifiers. We temporarily drop sb->s_inodes_cpu->lock and CAN
+ * block.
  */
 void fsnotify_unmount_inodes(struct super_block *sb)
 {
-	struct inode *inode, *next_i, *need_iput = NULL;
+	struct inode *inode, *need_iput = NULL;
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list) {
+	for_all_percpu_list_entries(inode, next_i, percpu_head, percpu_lock,
+				    sb->s_inodes_cpu, i_sb_list) {
 		struct inode *need_iput_tmp;
 
 		/*
@@ -189,7 +190,7 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 		spin_unlock(&inode->i_lock);
 
 		/* In case the dropping of a reference would nuke next_i. */
-		while (&next_i->i_sb_list != &sb->s_inodes) {
+		while (&next_i->i_sb_list.list != percpu_head) {
 			spin_lock(&next_i->i_lock);
 			if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) &&
 						atomic_read(&next_i->i_count)) {
@@ -199,16 +200,16 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 				break;
 			}
 			spin_unlock(&next_i->i_lock);
-			next_i = list_next_entry(next_i, i_sb_list);
+			next_i = list_next_entry(next_i, i_sb_list.list);
 		}
 
 		/*
-		 * We can safely drop s_inode_list_lock here because either
+		 * We can safely drop percpu_lock  here because either
 		 * we actually hold references on both inode and next_i or
 		 * end of list.  Also no new inodes will be added since the
 		 * umount has begun.
 		 */
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(percpu_lock);
 
 		if (need_iput_tmp)
 			iput(need_iput_tmp);
@@ -220,7 +221,6 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 
 		iput(inode);
 
-		spin_lock(&sb->s_inode_list_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+		spin_lock(percpu_lock);
+	} end_all_percpu_list_entries(percpu_lock)
 }
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 3c3b81b..0123ef4 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -928,8 +928,8 @@ static void add_dquot_ref(struct super_block *sb, int type)
 	int reserved = 0;
 #endif
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	for_all_percpu_list_entries_simple(inode, percpu_lock,
+					   sb->s_inodes_cpu, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 		    !atomic_read(&inode->i_writecount) ||
@@ -939,7 +939,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(percpu_lock);
 
 #ifdef CONFIG_QUOTA_DEBUG
 		if (unlikely(inode_get_rsv_space(inode) > 0))
@@ -951,15 +951,13 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
 		 * removed from s_inodes list while we dropped the
-		 * s_inode_list_lock. We cannot iput the inode now as we can be
+		 * percpu_lock. We cannot iput the inode now as we can be
 		 * holding the last reference and we cannot iput it under
-		 * s_inode_list_lock. So we keep the reference and iput it
-		 * later.
+		 * percpu_lock. So we keep the reference and iput it later.
 		 */
 		old_inode = inode;
-		spin_lock(&sb->s_inode_list_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+		spin_lock(percpu_lock);
+	} end_all_percpu_list_entries(percpu_lock)
 	iput(old_inode);
 
 #ifdef CONFIG_QUOTA_DEBUG
@@ -1028,8 +1026,8 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 	struct inode *inode;
 	int reserved = 0;
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	for_all_percpu_list_entries_simple(inode, lock, sb->s_inodes_cpu,
+					   i_sb_list) {
 		/*
 		 *  We have to scan also I_NEW inodes because they can already
 		 *  have quota pointer initialized. Luckily, we need to touch
@@ -1043,8 +1041,7 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 			remove_inode_dquot_ref(inode, type, tofree_head);
 		}
 		spin_unlock(&dq_data_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+	} end_all_percpu_list_entries(lock)
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
 		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
diff --git a/fs/super.c b/fs/super.c
index 1182af8..1db4f37 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -163,6 +163,7 @@ static void destroy_super(struct super_block *s)
 {
 	list_lru_destroy(&s->s_dentry_lru);
 	list_lru_destroy(&s->s_inode_lru);
+	free_percpu_list_head(&s->s_inodes_cpu);
 	security_sb_free(s);
 	WARN_ON(!list_empty(&s->s_mounts));
 	kfree(s->s_subtype);
@@ -204,9 +205,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	INIT_HLIST_NODE(&s->s_instances);
 	INIT_HLIST_BL_HEAD(&s->s_anon);
 	mutex_init(&s->s_sync_lock);
-	INIT_LIST_HEAD(&s->s_inodes);
-	spin_lock_init(&s->s_inode_list_lock);
 
+	if (init_percpu_list_head(&s->s_inodes_cpu))
+		goto fail;
 	if (list_lru_init_memcg(&s->s_dentry_lru))
 		goto fail;
 	if (list_lru_init_memcg(&s->s_inode_lru))
@@ -426,7 +427,7 @@ void generic_shutdown_super(struct super_block *sb)
 		if (sop->put_super)
 			sop->put_super(sb);
 
-		if (!list_empty(&sb->s_inodes)) {
+		if (!percpu_list_empty(sb->s_inodes_cpu)) {
 			printk("VFS: Busy inodes after unmount of %s. "
 			   "Self-destruct in 5 seconds.  Have a nice day...\n",
 			   sb->s_id);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ae68100..efedc59 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -27,6 +27,7 @@
 #include <linux/migrate_mode.h>
 #include <linux/uidgid.h>
 #include <linux/lockdep.h>
+#include <linux/percpu-list.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/blk_types.h>
 #include <linux/workqueue.h>
@@ -648,7 +649,7 @@ struct inode {
 	u16			i_wb_frn_history;
 #endif
 	struct list_head	i_lru;		/* inode LRU list */
-	struct list_head	i_sb_list;
+	struct percpu_list	i_sb_list;
 	union {
 		struct hlist_head	i_dentry;
 		struct rcu_head		i_rcu;
@@ -1397,9 +1398,8 @@ struct super_block {
 	 */
 	int s_stack_depth;
 
-	/* s_inode_list_lock protects s_inodes */
-	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
-	struct list_head	s_inodes;	/* all inodes */
+	/* The percpu locks protect s_inodes_cpu */
+	PERCPU_LIST_HEAD(s_inodes_cpu);	/* all inodes */
 };
 
 extern struct timespec current_fs_time(struct super_block *sb);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RRC PATCH 2/2] vfs: Use per-cpu list for superblock's inode list
  2016-02-17  1:31 ` [RRC PATCH 2/2] vfs: Use per-cpu list for superblock's inode list Waiman Long
@ 2016-02-17  7:16   ` Ingo Molnar
  2016-02-17 15:40     ` Waiman Long
  2016-02-17 10:37   ` Dave Chinner
  1 sibling, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2016-02-17  7:16 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner


* Waiman Long <Waiman.Long@hpe.com> wrote:

> When many threads are trying to add or delete inode to or from
> a superblock's s_inodes list, spinlock contention on the list can
> become a performance bottleneck.
> 
> This patch changes the s_inodes field to become a per-cpu list with
> per-cpu spinlocks.
> 
> With an exit microbenchmark that creates a large number of threads,
> attachs many inodes to them and then exits. The runtimes of that
> microbenchmark with 1000 threads before and after the patch on a
> 4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as
> follows:
> 
>   Kernel            Elapsed Time    System Time
>   ------            ------------    -----------
>   Vanilla 4.5-rc4      65.29s         82m14s
>   Patched 4.5-rc4      22.81s         23m03s
> 
> Before the patch, spinlock contention at the inode_sb_list_add()
> function at the startup phase and the inode_sb_list_del() function at
> the exit phase were about 79% and 93% of total CPU time respectively
> (as measured by perf). After the patch, the percpu_list_add()
> function consumed only about 0.04% of CPU time at startup phase. The
> percpu_list_del() function consumed about 0.4% of CPU time at exit
> phase. There were still some spinlock contention, but they happened
> elsewhere.

Pretty impressive IMHO!

Just for the record, here's your former 'batched list' number inserted into the 
above table:

   Kernel                       Elapsed Time    System Time
   ------                       ------------    -----------
   Vanilla      [v4.5-rc4]      65.29s          82m14s
   batched list [v4.4]          45.69s          49m44s
   percpu list  [v4.5-rc4]      22.81s          23m03s

i.e. the proper per CPU data structure and the resulting improvement in cache 
locality gave another doubling in performance.

Just out of curiosity, could you post the profile of the latest patches - is there 
any (bigger) SMP overhead left, or is the profile pretty flat now?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-17  1:31 ` [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks Waiman Long
@ 2016-02-17  9:53   ` Dave Chinner
  2016-02-17 11:00     ` Peter Zijlstra
  2016-02-17 15:56     ` Waiman Long
  2016-02-17 15:13   ` Christoph Lameter
  1 sibling, 2 replies; 27+ messages in thread
From: Dave Chinner @ 2016-02-17  9:53 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On Tue, Feb 16, 2016 at 08:31:19PM -0500, Waiman Long wrote:
> Linked list is used everywhere in the Linux kernel. However, if many
> threads are trying to add or delete entries into the same linked list,
> it can create a performance bottleneck.
> 
> This patch introduces a new per-cpu list subystem with associated
> per-cpu locks for protecting each of the lists individually. This
> allows list entries insertion and deletion operations to happen in
> parallel instead of being serialized with a global list and lock.
> 
> List entry insertion is strictly per cpu. List deletion, however, can
> happen in a cpu other than the one that did the insertion. So we still
> need lock to protect the list. Because of that, there may still be
> a small amount of contention when deletion is being done.
> 
> A new header file include/linux/percpu-list.h will be added with the
> associated percpu_list structure. The following functions are used
> to manage the per-cpu list:
> 
>  1. int init_percpu_list_head(struct percpu_list **pclist_handle)
>  2. void percpu_list_add(struct percpu_list *new,
> 			 struct percpu_list *head)
>  3. void percpu_list_del(struct percpu_list *entry)

A few comments on the code

> + * A per-cpu list protected by a per-cpu spinlock.
> + *
> + * The list head percpu_list structure contains the spinlock, the other
> + * entries in the list contain the spinlock pointer.
> + */
> +struct percpu_list {
> +	struct list_head list;
> +	union {
> +		spinlock_t lock;	/* For list head */
> +		spinlock_t *lockptr;	/* For other entries */
> +	};
> +};

This union is bad for kernels running spinlock debugging - the size
of the spinlock can blow out, and that increases the size of any
object that has a percpu_list in it. I've only got basic spinlock
debugging turned on, and the spinlock_t is 24 bytes with that
config. If I turn on lockdep, it gets much larger again....

So it might be best to separate the list head and list entry
structures, similar to a hash list?

> +static inline void INIT_PERCPU_LIST_HEAD(struct percpu_list *pcpu_list)
> +{
> +	INIT_LIST_HEAD(&pcpu_list->list);
> +	pcpu_list->lock = __SPIN_LOCK_UNLOCKED(&pcpu_list->lock);
> +}
> +
> +static inline void INIT_PERCPU_LIST_ENTRY(struct percpu_list *pcpu_list)
> +{
> +	INIT_LIST_HEAD(&pcpu_list->list);
> +	pcpu_list->lockptr = NULL;
> +}

These function names don't need to shout.

> +/**
> + * for_all_percpu_list_entries - iterate over all the per-cpu list with locking
> + * @pos:	the type * to use as a loop cursor for the current .
> + * @next:	an internal type * variable pointing to the next entry
> + * @pchead:	an internal struct list * of percpu list head
> + * @pclock:	an internal variable for the current per-cpu spinlock
> + * @head:	the head of the per-cpu list
> + * @member:	the name of the per-cpu list within the struct
> + */
> +#define for_all_percpu_list_entries(pos, next, pchead, pclock, head, member)\
> +	{								 \
> +	int cpu;							 \
> +	for_each_possible_cpu (cpu) {					 \
> +		typeof(*pos) *next;					 \
> +		spinlock_t *pclock = per_cpu_ptr(&(head)->lock, cpu);	 \
> +		struct list_head *pchead = &per_cpu_ptr(head, cpu)->list;\
> +		spin_lock(pclock);					 \
> +		list_for_each_entry_safe(pos, next, pchead, member.list)
> +
> +#define end_all_percpu_list_entries(pclock)	spin_unlock(pclock); } }

This is a bit of a landmine - the code inside he iteration is under
a spinlock hidden in the macros. People are going to forget about
that, and it's needs documenting about how it needs to be treated
w.r.t. dropping and regaining the lock so sleeping operations can be
performed on objects on the list being iterated.

Can we also think up some shorter names - names that are 30-40
characters long are getting out out of hand given we're supposed
tobe sticking to 80 character widths and we lost 8 of them in the
first indent...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RRC PATCH 2/2] vfs: Use per-cpu list for superblock's inode list
  2016-02-17  1:31 ` [RRC PATCH 2/2] vfs: Use per-cpu list for superblock's inode list Waiman Long
  2016-02-17  7:16   ` Ingo Molnar
@ 2016-02-17 10:37   ` Dave Chinner
  2016-02-17 16:08     ` Waiman Long
  1 sibling, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2016-02-17 10:37 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On Tue, Feb 16, 2016 at 08:31:20PM -0500, Waiman Long wrote:
> When many threads are trying to add or delete inode to or from
> a superblock's s_inodes list, spinlock contention on the list can
> become a performance bottleneck.
> 
> This patch changes the s_inodes field to become a per-cpu list with
> per-cpu spinlocks.
> 
> With an exit microbenchmark that creates a large number of threads,
> attachs many inodes to them and then exits. The runtimes of that
> microbenchmark with 1000 threads before and after the patch on a
> 4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as
> follows:
> 
>   Kernel            Elapsed Time    System Time
>   ------            ------------    -----------
>   Vanilla 4.5-rc4      65.29s         82m14s
>   Patched 4.5-rc4      22.81s         23m03s

Pretty good :)

My fsmark tests usually show up a fair bit of contention - moving
250k inodes through the cache every second over 16p does generate a
bit of load on the list. The patch makes the inode list add/del
operations disappear completely from the perf profiles, and there's
a marginal decrease in runtime (~4m40s vs 4m30s). I think the global
lock is right on the edge of breakdown under this load, though, so
if I was testing on a larger system I think the difference would be
much bigger.

I'll run some more testing on it, see if anything breaks.

A few comments on the code follow.

> @@ -1866,8 +1866,8 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
>  {
>  	struct inode *inode, *old_inode = NULL;
>  
> -	spin_lock(&blockdev_superblock->s_inode_list_lock);
> -	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
> +	for_all_percpu_list_entries_simple(inode, percpu_lock,
> +			blockdev_superblock->s_inodes_cpu, i_sb_list) {

This is kind what I meant about names getting way too long. How
about something like:

#define walk_sb_inodes(inode, sb, pcpu_lock)	\
	for_all_percpu_list_entries_simple(inode, pcpu_lock,	\
					   sb->s_inodes_list, i_sb_list)

#define walk_sb_inodes_end(pcpu_lock) end_all_percpu_list_entries(pcpu_lock)

for brevity?

> @@ -189,7 +190,7 @@ void fsnotify_unmount_inodes(struct super_block *sb)
>  		spin_unlock(&inode->i_lock);
>  
>  		/* In case the dropping of a reference would nuke next_i. */
> -		while (&next_i->i_sb_list != &sb->s_inodes) {
> +		while (&next_i->i_sb_list.list != percpu_head) {
>  			spin_lock(&next_i->i_lock);
>  			if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) &&
>  						atomic_read(&next_i->i_count)) {
> @@ -199,16 +200,16 @@ void fsnotify_unmount_inodes(struct super_block *sb)
>  				break;
>  			}
>  			spin_unlock(&next_i->i_lock);
> -			next_i = list_next_entry(next_i, i_sb_list);
> +			next_i = list_next_entry(next_i, i_sb_list.list);

pcpu_list_next_entry(next_i, i_sb_list)?

> @@ -1397,9 +1398,8 @@ struct super_block {
>  	 */
>  	int s_stack_depth;
>  
> -	/* s_inode_list_lock protects s_inodes */
> -	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
> -	struct list_head	s_inodes;	/* all inodes */
> +	/* The percpu locks protect s_inodes_cpu */
> +	PERCPU_LIST_HEAD(s_inodes_cpu);	/* all inodes */

There is no need to encode the type of list into the name.
i.e. drop the "_cpu" suffix - we can see it's a percpu list from the
declaration.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-17  9:53   ` Dave Chinner
@ 2016-02-17 11:00     ` Peter Zijlstra
  2016-02-17 11:05       ` Peter Zijlstra
  2016-02-17 11:10       ` Dave Chinner
  2016-02-17 15:56     ` Waiman Long
  1 sibling, 2 replies; 27+ messages in thread
From: Peter Zijlstra @ 2016-02-17 11:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Waiman Long, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On Wed, Feb 17, 2016 at 08:53:18PM +1100, Dave Chinner wrote:
> > +/**
> > + * for_all_percpu_list_entries - iterate over all the per-cpu list with locking
> > + * @pos:	the type * to use as a loop cursor for the current .
> > + * @next:	an internal type * variable pointing to the next entry
> > + * @pchead:	an internal struct list * of percpu list head
> > + * @pclock:	an internal variable for the current per-cpu spinlock
> > + * @head:	the head of the per-cpu list
> > + * @member:	the name of the per-cpu list within the struct
> > + */
> > +#define for_all_percpu_list_entries(pos, next, pchead, pclock, head, member)\
> > +	{								 \
> > +	int cpu;							 \
> > +	for_each_possible_cpu (cpu) {					 \
> > +		typeof(*pos) *next;					 \
> > +		spinlock_t *pclock = per_cpu_ptr(&(head)->lock, cpu);	 \
> > +		struct list_head *pchead = &per_cpu_ptr(head, cpu)->list;\
> > +		spin_lock(pclock);					 \
> > +		list_for_each_entry_safe(pos, next, pchead, member.list)
> > +
> > +#define end_all_percpu_list_entries(pclock)	spin_unlock(pclock); } }
> 
> This is a bit of a landmine 

Yeah, that is pretty terrible. Maybe a visitor interface is advisable?

visit_percpu_list_entries(struct percpu_list *head, void (*visitor)(struct list_head *pos, void *data), void *data)
{
	int cpu;

	for_each_possible_cpu(cpu) {
		spinlock_t *lock = per_cpu_ptr(&head->lock, cpu);
		struct list_head *head = per_cpu_ptr(&head->list, cpu);
		struct list_head *pos, *tmp;

		spin_lock(lock);
		for (pos = head->next, tmp = pos->next; pos != head; pos = tmp)
			visitor(pos, data);
		spin_unlock(lock);
	}
}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-17 11:00     ` Peter Zijlstra
@ 2016-02-17 11:05       ` Peter Zijlstra
  2016-02-17 16:16         ` Waiman Long
  2016-02-17 11:10       ` Dave Chinner
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-02-17 11:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Waiman Long, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On Wed, Feb 17, 2016 at 12:00:40PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 17, 2016 at 08:53:18PM +1100, Dave Chinner wrote:
> > > +/**
> > > + * for_all_percpu_list_entries - iterate over all the per-cpu list with locking

But the locking is 'pointless'. You only lock one per-cpu sublist at a
time, therefore the list can be modified concurrently anyhow.

So why not use RCU for the list iteration and avoid potentially large
lock hold times?

> > > + * @pos:	the type * to use as a loop cursor for the current .
> > > + * @next:	an internal type * variable pointing to the next entry
> > > + * @pchead:	an internal struct list * of percpu list head
> > > + * @pclock:	an internal variable for the current per-cpu spinlock
> > > + * @head:	the head of the per-cpu list
> > > + * @member:	the name of the per-cpu list within the struct
> > > + */

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-17 11:00     ` Peter Zijlstra
  2016-02-17 11:05       ` Peter Zijlstra
@ 2016-02-17 11:10       ` Dave Chinner
  2016-02-17 11:26         ` Peter Zijlstra
  1 sibling, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2016-02-17 11:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On Wed, Feb 17, 2016 at 12:00:40PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 17, 2016 at 08:53:18PM +1100, Dave Chinner wrote:
> > > +/**
> > > + * for_all_percpu_list_entries - iterate over all the per-cpu list with locking
> > > + * @pos:	the type * to use as a loop cursor for the current .
> > > + * @next:	an internal type * variable pointing to the next entry
> > > + * @pchead:	an internal struct list * of percpu list head
> > > + * @pclock:	an internal variable for the current per-cpu spinlock
> > > + * @head:	the head of the per-cpu list
> > > + * @member:	the name of the per-cpu list within the struct
> > > + */
> > > +#define for_all_percpu_list_entries(pos, next, pchead, pclock, head, member)\
> > > +	{								 \
> > > +	int cpu;							 \
> > > +	for_each_possible_cpu (cpu) {					 \
> > > +		typeof(*pos) *next;					 \
> > > +		spinlock_t *pclock = per_cpu_ptr(&(head)->lock, cpu);	 \
> > > +		struct list_head *pchead = &per_cpu_ptr(head, cpu)->list;\
> > > +		spin_lock(pclock);					 \
> > > +		list_for_each_entry_safe(pos, next, pchead, member.list)
> > > +
> > > +#define end_all_percpu_list_entries(pclock)	spin_unlock(pclock); } }
> > 
> > This is a bit of a landmine 
> 
> Yeah, that is pretty terrible. Maybe a visitor interface is advisable?
> 
> visit_percpu_list_entries(struct percpu_list *head, void (*visitor)(struct list_head *pos, void *data), void *data)
> {
> 	int cpu;
> 
> 	for_each_possible_cpu(cpu) {
> 		spinlock_t *lock = per_cpu_ptr(&head->lock, cpu);
> 		struct list_head *head = per_cpu_ptr(&head->list, cpu);
> 		struct list_head *pos, *tmp;
> 
> 		spin_lock(lock);
> 		for (pos = head->next, tmp = pos->next; pos != head; pos = tmp)
> 			visitor(pos, data);

I thought about this - it's the same problem as the list_lru walking
functions. That is, the visitor has to be able to drop the list lock
to do blocking operations, so the lock has to be passed to the
visitor/internal loop context somehow, and the way the callers can
use it need to be documented.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-17 11:10       ` Dave Chinner
@ 2016-02-17 11:26         ` Peter Zijlstra
  2016-02-17 11:36           ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-02-17 11:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Waiman Long, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On Wed, Feb 17, 2016 at 10:10:02PM +1100, Dave Chinner wrote:
> On Wed, Feb 17, 2016 at 12:00:40PM +0100, Peter Zijlstra wrote:

> > Yeah, that is pretty terrible. Maybe a visitor interface is advisable?
> > 
> > visit_percpu_list_entries(struct percpu_list *head, void (*visitor)(struct list_head *pos, void *data), void *data)
> > {
> > 	int cpu;
> > 
> > 	for_each_possible_cpu(cpu) {
> > 		spinlock_t *lock = per_cpu_ptr(&head->lock, cpu);
> > 		struct list_head *head = per_cpu_ptr(&head->list, cpu);
> > 		struct list_head *pos, *tmp;
> > 
> > 		spin_lock(lock);
> > 		for (pos = head->next, tmp = pos->next; pos != head; pos = tmp)
> > 			visitor(pos, data);
> 
> I thought about this - it's the same problem as the list_lru walking
> functions. That is, the visitor has to be able to drop the list lock
> to do blocking operations, so the lock has to be passed to the
> visitor/internal loop context somehow, and the way the callers can
> use it need to be documented.

But you cannot drop the lock and guarantee fwd progress. The moment you
drop the lock, you have to restart the iteration from the head, since
any iterator you had might now be pointing into space.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-17 11:26         ` Peter Zijlstra
@ 2016-02-17 11:36           ` Peter Zijlstra
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2016-02-17 11:36 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Waiman Long, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On Wed, Feb 17, 2016 at 12:26:54PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 17, 2016 at 10:10:02PM +1100, Dave Chinner wrote:
> > On Wed, Feb 17, 2016 at 12:00:40PM +0100, Peter Zijlstra wrote:
> 
> > > Yeah, that is pretty terrible. Maybe a visitor interface is advisable?
> > > 
> > > visit_percpu_list_entries(struct percpu_list *head, void (*visitor)(struct list_head *pos, void *data), void *data)
> > > {
> > > 	int cpu;
> > > 
> > > 	for_each_possible_cpu(cpu) {
> > > 		spinlock_t *lock = per_cpu_ptr(&head->lock, cpu);
> > > 		struct list_head *head = per_cpu_ptr(&head->list, cpu);
> > > 		struct list_head *pos, *tmp;
> > > 
> > > 		spin_lock(lock);
> > > 		for (pos = head->next, tmp = pos->next; pos != head; pos = tmp)
> > > 			visitor(pos, data);
> > 
> > I thought about this - it's the same problem as the list_lru walking
> > functions. That is, the visitor has to be able to drop the list lock
> > to do blocking operations, so the lock has to be passed to the
> > visitor/internal loop context somehow, and the way the callers can
> > use it need to be documented.
> 
> But you cannot drop the lock and guarantee fwd progress. The moment you
> drop the lock, you have to restart the iteration from the head, since
> any iterator you had might now be pointing into space.

Ah, I see what iterate_bdevs() does. Yes, that's somewhat 'special'. Not
sure it makes sense to craft a generic 'interface' for that.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-17  1:31 ` [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks Waiman Long
  2016-02-17  9:53   ` Dave Chinner
@ 2016-02-17 15:13   ` Christoph Lameter
  1 sibling, 0 replies; 27+ messages in thread
From: Christoph Lameter @ 2016-02-17 15:13 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, linux-fsdevel, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Andi Kleen, Dave Chinner, Scott J Norton,
	Douglas Hatch

On Tue, 16 Feb 2016, Waiman Long wrote:

> List entry insertion is strictly per cpu. List deletion, however, can
> happen in a cpu other than the one that did the insertion. So we still
> need lock to protect the list. Because of that, there may still be
> a small amount of contention when deletion is being done.

Is there a way to avoid locking completely? You could use cmpxchg_double
to swap both list head and tail in an atomic fashion with some work. There is
even a this_cpu_cmpxchg_double available.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RRC PATCH 2/2] vfs: Use per-cpu list for superblock's inode list
  2016-02-17  7:16   ` Ingo Molnar
@ 2016-02-17 15:40     ` Waiman Long
  0 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2016-02-17 15:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner

On 02/17/2016 02:16 AM, Ingo Molnar wrote:
> * Waiman Long<Waiman.Long@hpe.com>  wrote:
>
>> When many threads are trying to add or delete inode to or from
>> a superblock's s_inodes list, spinlock contention on the list can
>> become a performance bottleneck.
>>
>> This patch changes the s_inodes field to become a per-cpu list with
>> per-cpu spinlocks.
>>
>> With an exit microbenchmark that creates a large number of threads,
>> attachs many inodes to them and then exits. The runtimes of that
>> microbenchmark with 1000 threads before and after the patch on a
>> 4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as
>> follows:
>>
>>    Kernel            Elapsed Time    System Time
>>    ------            ------------    -----------
>>    Vanilla 4.5-rc4      65.29s         82m14s
>>    Patched 4.5-rc4      22.81s         23m03s
>>
>> Before the patch, spinlock contention at the inode_sb_list_add()
>> function at the startup phase and the inode_sb_list_del() function at
>> the exit phase were about 79% and 93% of total CPU time respectively
>> (as measured by perf). After the patch, the percpu_list_add()
>> function consumed only about 0.04% of CPU time at startup phase. The
>> percpu_list_del() function consumed about 0.4% of CPU time at exit
>> phase. There were still some spinlock contention, but they happened
>> elsewhere.
> Pretty impressive IMHO!
>
> Just for the record, here's your former 'batched list' number inserted into the
> above table:
>
>     Kernel                       Elapsed Time    System Time
>     ------                       ------------    -----------
>     Vanilla      [v4.5-rc4]      65.29s          82m14s
>     batched list [v4.4]          45.69s          49m44s
>     percpu list  [v4.5-rc4]      22.81s          23m03s
>
> i.e. the proper per CPU data structure and the resulting improvement in cache
> locality gave another doubling in performance.
>
> Just out of curiosity, could you post the profile of the latest patches - is there
> any (bigger) SMP overhead left, or is the profile pretty flat now?
>
> Thanks,
>
> 	Ingo

Yes, there were still spinlock contention elsewhere in the exit path. 
Now the bulk of the CPU times was in:

-   79.23%    79.23%         a.out  [kernel.kallsyms]    [k] 
native_queued_spin
    - native_queued_spin_lock_slowpath
       - 99.99% queued_spin_lock_slowpath
          - 100.00% _raw_spin_lock
             - 99.98% list_lru_del
                - d_lru_del
                   - 100.00% select_collect
                        detach_and_collect
                        d_walk
                        d_invalidate
                        proc_flush_task
                        release_task
                        do_exit
                        do_group_exit
                        get_signal
                        do_signal
                        exit_to_usermode_loop
                        syscall_return_slowpath
                        int_ret_from_sys_call

The locks that were being contended were nlru->lock. For a 4-node system 
that I used, there will be four of those.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-17  9:53   ` Dave Chinner
  2016-02-17 11:00     ` Peter Zijlstra
@ 2016-02-17 15:56     ` Waiman Long
  2016-02-17 16:02       ` Peter Zijlstra
  1 sibling, 1 reply; 27+ messages in thread
From: Waiman Long @ 2016-02-17 15:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On 02/17/2016 04:53 AM, Dave Chinner wrote:
> On Tue, Feb 16, 2016 at 08:31:19PM -0500, Waiman Long wrote:
>> Linked list is used everywhere in the Linux kernel. However, if many
>> threads are trying to add or delete entries into the same linked list,
>> it can create a performance bottleneck.
>>
>> This patch introduces a new per-cpu list subystem with associated
>> per-cpu locks for protecting each of the lists individually. This
>> allows list entries insertion and deletion operations to happen in
>> parallel instead of being serialized with a global list and lock.
>>
>> List entry insertion is strictly per cpu. List deletion, however, can
>> happen in a cpu other than the one that did the insertion. So we still
>> need lock to protect the list. Because of that, there may still be
>> a small amount of contention when deletion is being done.
>>
>> A new header file include/linux/percpu-list.h will be added with the
>> associated percpu_list structure. The following functions are used
>> to manage the per-cpu list:
>>
>>   1. int init_percpu_list_head(struct percpu_list **pclist_handle)
>>   2. void percpu_list_add(struct percpu_list *new,
>> 			 struct percpu_list *head)
>>   3. void percpu_list_del(struct percpu_list *entry)
> A few comments on the code
>
>> + * A per-cpu list protected by a per-cpu spinlock.
>> + *
>> + * The list head percpu_list structure contains the spinlock, the other
>> + * entries in the list contain the spinlock pointer.
>> + */
>> +struct percpu_list {
>> +	struct list_head list;
>> +	union {
>> +		spinlock_t lock;	/* For list head */
>> +		spinlock_t *lockptr;	/* For other entries */
>> +	};
>> +};
> This union is bad for kernels running spinlock debugging - the size
> of the spinlock can blow out, and that increases the size of any
> object that has a percpu_list in it. I've only got basic spinlock
> debugging turned on, and the spinlock_t is 24 bytes with that
> config. If I turn on lockdep, it gets much larger again....
>
> So it might be best to separate the list head and list entry
> structures, similar to a hash list?

Right. I will split it into 2 separate structure in the next iteration 
of the patch.

>> +static inline void INIT_PERCPU_LIST_HEAD(struct percpu_list *pcpu_list)
>> +{
>> +	INIT_LIST_HEAD(&pcpu_list->list);
>> +	pcpu_list->lock = __SPIN_LOCK_UNLOCKED(&pcpu_list->lock);
>> +}
>> +
>> +static inline void INIT_PERCPU_LIST_ENTRY(struct percpu_list *pcpu_list)
>> +{
>> +	INIT_LIST_HEAD(&pcpu_list->list);
>> +	pcpu_list->lockptr = NULL;
>> +}
> These function names don't need to shout.

I was just following the convention used in list init functions. I can 
certainly change them to lowercase.

>
>> +/**
>> + * for_all_percpu_list_entries - iterate over all the per-cpu list with locking
>> + * @pos:	the type * to use as a loop cursor for the current .
>> + * @next:	an internal type * variable pointing to the next entry
>> + * @pchead:	an internal struct list * of percpu list head
>> + * @pclock:	an internal variable for the current per-cpu spinlock
>> + * @head:	the head of the per-cpu list
>> + * @member:	the name of the per-cpu list within the struct
>> + */
>> +#define for_all_percpu_list_entries(pos, next, pchead, pclock, head, member)\
>> +	{								 \
>> +	int cpu;							 \
>> +	for_each_possible_cpu (cpu) {					 \
>> +		typeof(*pos) *next;					 \
>> +		spinlock_t *pclock = per_cpu_ptr(&(head)->lock, cpu);	 \
>> +		struct list_head *pchead =&per_cpu_ptr(head, cpu)->list;\
>> +		spin_lock(pclock);					 \
>> +		list_for_each_entry_safe(pos, next, pchead, member.list)
>> +
>> +#define end_all_percpu_list_entries(pclock)	spin_unlock(pclock); } }
> This is a bit of a landmine - the code inside he iteration is under
> a spinlock hidden in the macros. People are going to forget about
> that, and it's needs documenting about how it needs to be treated
> w.r.t. dropping and regaining the lock so sleeping operations can be
> performed on objects on the list being iterated.
>
> Can we also think up some shorter names - names that are 30-40
> characters long are getting out out of hand given we're supposed
> tobe sticking to 80 character widths and we lost 8 of them in the
> first indent...
>
> Cheers,
>
> Dave.

I will try to shorten the name and better document the macro. This is 
probably the most tricky part of the whole part.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-17 15:56     ` Waiman Long
@ 2016-02-17 16:02       ` Peter Zijlstra
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2016-02-17 16:02 UTC (permalink / raw)
  To: Waiman Long
  Cc: Dave Chinner, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On Wed, Feb 17, 2016 at 10:56:10AM -0500, Waiman Long wrote:
> >>+/**
> >>+ * for_all_percpu_list_entries - iterate over all the per-cpu list with locking
> >>+ * @pos:	the type * to use as a loop cursor for the current .
> >>+ * @next:	an internal type * variable pointing to the next entry
> >>+ * @pchead:	an internal struct list * of percpu list head
> >>+ * @pclock:	an internal variable for the current per-cpu spinlock
> >>+ * @head:	the head of the per-cpu list
> >>+ * @member:	the name of the per-cpu list within the struct
> >>+ */
> >>+#define for_all_percpu_list_entries(pos, next, pchead, pclock, head, member)\
> >>+	{								 \
> >>+	int cpu;							 \
> >>+	for_each_possible_cpu (cpu) {					 \
> >>+		typeof(*pos) *next;					 \
> >>+		spinlock_t *pclock = per_cpu_ptr(&(head)->lock, cpu);	 \
> >>+		struct list_head *pchead =&per_cpu_ptr(head, cpu)->list;\
> >>+		spin_lock(pclock);					 \
> >>+		list_for_each_entry_safe(pos, next, pchead, member.list)
> >>+
> >>+#define end_all_percpu_list_entries(pclock)	spin_unlock(pclock); } }

> I will try to shorten the name and better document the macro. This is
> probably the most tricky part of the whole part.

Note that your use of _safe() here actually makes the usage in
iterate_bdevs() unsafe!

Because that function relies on __iget() pinning the current position,
which means that once you re-acquire the list lock, pos->next is valid.

Howveer, _safe() takes pos->next before dropping the lock, and that
object is not actually pinned and can go away.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RRC PATCH 2/2] vfs: Use per-cpu list for superblock's inode list
  2016-02-17 10:37   ` Dave Chinner
@ 2016-02-17 16:08     ` Waiman Long
  0 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2016-02-17 16:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On 02/17/2016 05:37 AM, Dave Chinner wrote:
> On Tue, Feb 16, 2016 at 08:31:20PM -0500, Waiman Long wrote:
>> When many threads are trying to add or delete inode to or from
>> a superblock's s_inodes list, spinlock contention on the list can
>> become a performance bottleneck.
>>
>> This patch changes the s_inodes field to become a per-cpu list with
>> per-cpu spinlocks.
>>
>> With an exit microbenchmark that creates a large number of threads,
>> attachs many inodes to them and then exits. The runtimes of that
>> microbenchmark with 1000 threads before and after the patch on a
>> 4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as
>> follows:
>>
>>    Kernel            Elapsed Time    System Time
>>    ------            ------------    -----------
>>    Vanilla 4.5-rc4      65.29s         82m14s
>>    Patched 4.5-rc4      22.81s         23m03s
> Pretty good :)
>
> My fsmark tests usually show up a fair bit of contention - moving
> 250k inodes through the cache every second over 16p does generate a
> bit of load on the list. The patch makes the inode list add/del
> operations disappear completely from the perf profiles, and there's
> a marginal decrease in runtime (~4m40s vs 4m30s). I think the global
> lock is right on the edge of breakdown under this load, though, so
> if I was testing on a larger system I think the difference would be
> much bigger.
>
> I'll run some more testing on it, see if anything breaks.
>
> A few comments on the code follow.
>
>> @@ -1866,8 +1866,8 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
>>   {
>>   	struct inode *inode, *old_inode = NULL;
>>
>> -	spin_lock(&blockdev_superblock->s_inode_list_lock);
>> -	list_for_each_entry(inode,&blockdev_superblock->s_inodes, i_sb_list) {
>> +	for_all_percpu_list_entries_simple(inode, percpu_lock,
>> +			blockdev_superblock->s_inodes_cpu, i_sb_list) {
> This is kind what I meant about names getting way too long. How
> about something like:
>
> #define walk_sb_inodes(inode, sb, pcpu_lock)	\
> 	for_all_percpu_list_entries_simple(inode, pcpu_lock,	\
> 					   sb->s_inodes_list, i_sb_list)
>
> #define walk_sb_inodes_end(pcpu_lock) end_all_percpu_list_entries(pcpu_lock)
>
> for brevity?

Yes, I think adding some inode specific macros in fs.h will help to make 
the patch easier to read.

>> @@ -189,7 +190,7 @@ void fsnotify_unmount_inodes(struct super_block *sb)
>>   		spin_unlock(&inode->i_lock);
>>
>>   		/* In case the dropping of a reference would nuke next_i. */
>> -		while (&next_i->i_sb_list !=&sb->s_inodes) {
>> +		while (&next_i->i_sb_list.list != percpu_head) {
>>   			spin_lock(&next_i->i_lock);
>>   			if (!(next_i->i_state&  (I_FREEING | I_WILL_FREE))&&
>>   						atomic_read(&next_i->i_count)) {
>> @@ -199,16 +200,16 @@ void fsnotify_unmount_inodes(struct super_block *sb)
>>   				break;
>>   			}
>>   			spin_unlock(&next_i->i_lock);
>> -			next_i = list_next_entry(next_i, i_sb_list);
>> +			next_i = list_next_entry(next_i, i_sb_list.list);
> pcpu_list_next_entry(next_i, i_sb_list)?

Will add that.

>> @@ -1397,9 +1398,8 @@ struct super_block {
>>   	 */
>>   	int s_stack_depth;
>>
>> -	/* s_inode_list_lock protects s_inodes */
>> -	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
>> -	struct list_head	s_inodes;	/* all inodes */
>> +	/* The percpu locks protect s_inodes_cpu */
>> +	PERCPU_LIST_HEAD(s_inodes_cpu);	/* all inodes */
> There is no need to encode the type of list into the name.
> i.e. drop the "_cpu" suffix - we can see it's a percpu list from the
> declaration.

Will remove that macro.

Thanks for the review.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-17 11:05       ` Peter Zijlstra
@ 2016-02-17 16:16         ` Waiman Long
  2016-02-17 16:22           ` Peter Zijlstra
  2016-02-17 16:27           ` Christoph Lameter
  0 siblings, 2 replies; 27+ messages in thread
From: Waiman Long @ 2016-02-17 16:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Chinner, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On 02/17/2016 06:05 AM, Peter Zijlstra wrote:
> On Wed, Feb 17, 2016 at 12:00:40PM +0100, Peter Zijlstra wrote:
>> On Wed, Feb 17, 2016 at 08:53:18PM +1100, Dave Chinner wrote:
>>>> +/**
>>>> + * for_all_percpu_list_entries - iterate over all the per-cpu list with locking
> But the locking is 'pointless'. You only lock one per-cpu sublist at a
> time, therefore the list can be modified concurrently anyhow.

For each per-cpu list, there can be only 1 task allowed to make changes. 
If you are talking about all the per-cpu lists within the group, 
operations can happen in parallel.

> So why not use RCU for the list iteration and avoid potentially large
> lock hold times?
>

I know we can use RCU for singly linked list, but I don't think we can 
use that for doubly linked list as there is no easy way to make atomic 
changes to both prev and next pointers simultaneously unless you are 
taking about 16b cmpxchg which is only supported in some architecture.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-17 16:16         ` Waiman Long
@ 2016-02-17 16:22           ` Peter Zijlstra
  2016-02-17 16:27           ` Christoph Lameter
  1 sibling, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2016-02-17 16:22 UTC (permalink / raw)
  To: Waiman Long
  Cc: Dave Chinner, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On Wed, Feb 17, 2016 at 11:16:10AM -0500, Waiman Long wrote:
> >So why not use RCU for the list iteration and avoid potentially large
> >lock hold times?
> >
> 
> I know we can use RCU for singly linked list, but I don't think we can use
> that for doubly linked list as there is no easy way to make atomic changes
> to both prev and next pointers simultaneously unless you are taking about
> 16b cmpxchg which is only supported in some architecture.

See include/linux/rculist.h, the only constraint is that reverse
iteration is not allowed, because of what you say.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-17 16:16         ` Waiman Long
  2016-02-17 16:22           ` Peter Zijlstra
@ 2016-02-17 16:27           ` Christoph Lameter
  2016-02-17 17:12             ` Waiman Long
  1 sibling, 1 reply; 27+ messages in thread
From: Christoph Lameter @ 2016-02-17 16:27 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Dave Chinner, Alexander Viro, Jan Kara,
	Jeff Layton, J. Bruce Fields, Tejun Heo, linux-fsdevel,
	linux-kernel, Ingo Molnar, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On Wed, 17 Feb 2016, Waiman Long wrote:

> I know we can use RCU for singly linked list, but I don't think we can use
> that for doubly linked list as there is no easy way to make atomic changes to
> both prev and next pointers simultaneously unless you are taking about 16b
> cmpxchg which is only supported in some architecture.

But its supported in the most important architecutes. You can fall back to
spinlocks on the ones that do not support it.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-17 16:27           ` Christoph Lameter
@ 2016-02-17 17:12             ` Waiman Long
  2016-02-17 17:18               ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Waiman Long @ 2016-02-17 17:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Dave Chinner, Alexander Viro, Jan Kara,
	Jeff Layton, J. Bruce Fields, Tejun Heo, linux-fsdevel,
	linux-kernel, Ingo Molnar, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On 02/17/2016 11:27 AM, Christoph Lameter wrote:
> On Wed, 17 Feb 2016, Waiman Long wrote:
>
>> I know we can use RCU for singly linked list, but I don't think we can use
>> that for doubly linked list as there is no easy way to make atomic changes to
>> both prev and next pointers simultaneously unless you are taking about 16b
>> cmpxchg which is only supported in some architecture.
> But its supported in the most important architecutes. You can fall back to
> spinlocks on the ones that do not support it.
>

I guess with some limitations on how the lists can be traversed, we may 
be able to do that with RCU without lock. However, that will make the 
code more complex and harder to verify. Given that in both my and Dave's 
testing that contentions with list insertion and deletion are almost 
gone from the perf profile when they used to be a bottleneck, is it 
really worth the effort to do such a conversion?

Cheers,
Longman

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-17 17:12             ` Waiman Long
@ 2016-02-17 17:18               ` Peter Zijlstra
  2016-02-17 17:41                 ` Waiman Long
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-02-17 17:18 UTC (permalink / raw)
  To: Waiman Long
  Cc: Christoph Lameter, Dave Chinner, Alexander Viro, Jan Kara,
	Jeff Layton, J. Bruce Fields, Tejun Heo, linux-fsdevel,
	linux-kernel, Ingo Molnar, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On Wed, Feb 17, 2016 at 12:12:57PM -0500, Waiman Long wrote:
> On 02/17/2016 11:27 AM, Christoph Lameter wrote:
> >On Wed, 17 Feb 2016, Waiman Long wrote:
> >
> >>I know we can use RCU for singly linked list, but I don't think we can use
> >>that for doubly linked list as there is no easy way to make atomic changes to
> >>both prev and next pointers simultaneously unless you are taking about 16b
> >>cmpxchg which is only supported in some architecture.
> >But its supported in the most important architecutes. You can fall back to
> >spinlocks on the ones that do not support it.
> >
> 
> I guess with some limitations on how the lists can be traversed, we may be
> able to do that with RCU without lock. However, that will make the code more
> complex and harder to verify. Given that in both my and Dave's testing that
> contentions with list insertion and deletion are almost gone from the perf
> profile when they used to be a bottleneck, is it really worth the effort to
> do such a conversion?

My initial concern was the preempt disable delay introduced by holding
the spinlock over the entire iteration.

There is no saying how many elements are on that list and there is no
lock break.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-17 17:18               ` Peter Zijlstra
@ 2016-02-17 17:41                 ` Waiman Long
  2016-02-17 18:22                   ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Waiman Long @ 2016-02-17 17:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Dave Chinner, Alexander Viro, Jan Kara,
	Jeff Layton, J. Bruce Fields, Tejun Heo, linux-fsdevel,
	linux-kernel, Ingo Molnar, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On 02/17/2016 12:18 PM, Peter Zijlstra wrote:
> On Wed, Feb 17, 2016 at 12:12:57PM -0500, Waiman Long wrote:
>> On 02/17/2016 11:27 AM, Christoph Lameter wrote:
>>> On Wed, 17 Feb 2016, Waiman Long wrote:
>>>
>>>> I know we can use RCU for singly linked list, but I don't think we can use
>>>> that for doubly linked list as there is no easy way to make atomic changes to
>>>> both prev and next pointers simultaneously unless you are taking about 16b
>>>> cmpxchg which is only supported in some architecture.
>>> But its supported in the most important architecutes. You can fall back to
>>> spinlocks on the ones that do not support it.
>>>
>> I guess with some limitations on how the lists can be traversed, we may be
>> able to do that with RCU without lock. However, that will make the code more
>> complex and harder to verify. Given that in both my and Dave's testing that
>> contentions with list insertion and deletion are almost gone from the perf
>> profile when they used to be a bottleneck, is it really worth the effort to
>> do such a conversion?
> My initial concern was the preempt disable delay introduced by holding
> the spinlock over the entire iteration.
>
> There is no saying how many elements are on that list and there is no
> lock break.

But preempt_disable() is called at the beginning of the spin_lock() 
call. So the additional preempt_disable() in percpu_list_add() is just 
to cover the this_cpu_ptr() call to make sure that the cpu number 
doesn't change. So we are talking about a few ns at most here.

Actually, I think I can remove the preempt_disable() and 
preempt_enable() calls as we just need to put list entry in one of the 
per-cpu lists. It doesn't need to be the same CPU of the current task.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-17 17:41                 ` Waiman Long
@ 2016-02-17 18:22                   ` Peter Zijlstra
  2016-02-17 18:45                     ` Waiman Long
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-02-17 18:22 UTC (permalink / raw)
  To: Waiman Long
  Cc: Christoph Lameter, Dave Chinner, Alexander Viro, Jan Kara,
	Jeff Layton, J. Bruce Fields, Tejun Heo, linux-fsdevel,
	linux-kernel, Ingo Molnar, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On Wed, Feb 17, 2016 at 12:41:53PM -0500, Waiman Long wrote:
> On 02/17/2016 12:18 PM, Peter Zijlstra wrote:
> >On Wed, Feb 17, 2016 at 12:12:57PM -0500, Waiman Long wrote:
> >>On 02/17/2016 11:27 AM, Christoph Lameter wrote:
> >>>On Wed, 17 Feb 2016, Waiman Long wrote:
> >>>
> >>>>I know we can use RCU for singly linked list, but I don't think we can use
> >>>>that for doubly linked list as there is no easy way to make atomic changes to
> >>>>both prev and next pointers simultaneously unless you are taking about 16b
> >>>>cmpxchg which is only supported in some architecture.
> >>>But its supported in the most important architecutes. You can fall back to
> >>>spinlocks on the ones that do not support it.
> >>>
> >>I guess with some limitations on how the lists can be traversed, we may be
> >>able to do that with RCU without lock. However, that will make the code more
> >>complex and harder to verify. Given that in both my and Dave's testing that
> >>contentions with list insertion and deletion are almost gone from the perf
> >>profile when they used to be a bottleneck, is it really worth the effort to
> >>do such a conversion?
> >My initial concern was the preempt disable delay introduced by holding
> >the spinlock over the entire iteration.
> >
> >There is no saying how many elements are on that list and there is no
> >lock break.
> 
> But preempt_disable() is called at the beginning of the spin_lock() call. So
> the additional preempt_disable() in percpu_list_add() is just to cover the
> this_cpu_ptr() call to make sure that the cpu number doesn't change. So we
> are talking about a few ns at most here.
> 

I'm talking about the list iteration, there is no preempt_disable() in
there, just the spin_lock, which you hold over the entire list, which
can be many, many element.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-17 18:22                   ` Peter Zijlstra
@ 2016-02-17 18:45                     ` Waiman Long
  2016-02-17 19:39                       ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Waiman Long @ 2016-02-17 18:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Dave Chinner, Alexander Viro, Jan Kara,
	Jeff Layton, J. Bruce Fields, Tejun Heo, linux-fsdevel,
	linux-kernel, Ingo Molnar, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On 02/17/2016 01:22 PM, Peter Zijlstra wrote:
> On Wed, Feb 17, 2016 at 12:41:53PM -0500, Waiman Long wrote:
>> On 02/17/2016 12:18 PM, Peter Zijlstra wrote:
>>> On Wed, Feb 17, 2016 at 12:12:57PM -0500, Waiman Long wrote:
>>>> On 02/17/2016 11:27 AM, Christoph Lameter wrote:
>>>>> On Wed, 17 Feb 2016, Waiman Long wrote:
>>>>>
>>>>>> I know we can use RCU for singly linked list, but I don't think we can use
>>>>>> that for doubly linked list as there is no easy way to make atomic changes to
>>>>>> both prev and next pointers simultaneously unless you are taking about 16b
>>>>>> cmpxchg which is only supported in some architecture.
>>>>> But its supported in the most important architecutes. You can fall back to
>>>>> spinlocks on the ones that do not support it.
>>>>>
>>>> I guess with some limitations on how the lists can be traversed, we may be
>>>> able to do that with RCU without lock. However, that will make the code more
>>>> complex and harder to verify. Given that in both my and Dave's testing that
>>>> contentions with list insertion and deletion are almost gone from the perf
>>>> profile when they used to be a bottleneck, is it really worth the effort to
>>>> do such a conversion?
>>> My initial concern was the preempt disable delay introduced by holding
>>> the spinlock over the entire iteration.
>>>
>>> There is no saying how many elements are on that list and there is no
>>> lock break.
>> But preempt_disable() is called at the beginning of the spin_lock() call. So
>> the additional preempt_disable() in percpu_list_add() is just to cover the
>> this_cpu_ptr() call to make sure that the cpu number doesn't change. So we
>> are talking about a few ns at most here.
>>
> I'm talking about the list iteration, there is no preempt_disable() in
> there, just the spin_lock, which you hold over the entire list, which
> can be many, many element.

Sorry for the misunderstanding.

The original code has one global lock and one single list that covers 
all the inodes in the filesystem. This patch essentially breaks it up 
into multiple smaller lists with one lock for each. So the lock hold 
time should have been greatly reduced unless we are unfortunately enough 
that most of the inodes are in one single list.

If lock hold time is a concern, I think in some cases we can set the an 
upper limit on how many inodes we want to process, release the lock, 
reacquire it and continue. I am just worry that using RCU and 16b 
cmpxchg will introduce too much complexity with no performance gain to show.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-17 18:45                     ` Waiman Long
@ 2016-02-17 19:39                       ` Peter Zijlstra
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2016-02-17 19:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: Christoph Lameter, Dave Chinner, Alexander Viro, Jan Kara,
	Jeff Layton, J. Bruce Fields, Tejun Heo, linux-fsdevel,
	linux-kernel, Ingo Molnar, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On Wed, Feb 17, 2016 at 01:45:35PM -0500, Waiman Long wrote:
> The original code has one global lock and one single list that covers all
> the inodes in the filesystem. This patch essentially breaks it up into
> multiple smaller lists with one lock for each. So the lock hold time should
> have been greatly reduced unless we are unfortunately enough that most of
> the inodes are in one single list.

Most of the inode code has lock breaks in, but in general you cannot do
that.

The more I look at that inode code, the more I think you want an inode
specific visitor interface and not bother provide something generic.

iterate_bdevs(), drop_pagecache_sb(), wait_sb_inodes(), add_dquot_ref()
all have the same pattern. And maybe fsnotify_unmount_inodes() can be
man-handled into the same form.

Afaict only invalidate_inodes() really doesn't do a lock-break, but its
very similar in form to evict_inodes() which does.


> If lock hold time is a concern, I think in some cases we can set the an
> upper limit on how many inodes we want to process, release the lock,
> reacquire it and continue. I am just worry that using RCU and 16b cmpxchg
> will introduce too much complexity with no performance gain to show.

You don't actually need cmpxchg16b in order to use RCU. But given the
users of this, you don't actually need RCU either.

Just don't try and provide a for_each_list_entry() like construct.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] vfs: Use per-cpu list for SB's s_inodes list
  2016-02-17  1:31 [RFC PATCH 0/2] vfs: Use per-cpu list for SB's s_inodes list Waiman Long
  2016-02-17  1:31 ` [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks Waiman Long
  2016-02-17  1:31 ` [RRC PATCH 2/2] vfs: Use per-cpu list for superblock's inode list Waiman Long
@ 2016-02-18 23:58 ` Dave Chinner
  2016-02-19 21:04   ` Long, Wai Man
  2 siblings, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2016-02-18 23:58 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On Tue, Feb 16, 2016 at 08:31:18PM -0500, Waiman Long wrote:
> This patch is a replacement of my previous list batching patch -
> https://lwn.net/Articles/674105/. Compared with the previous patch,
> this one provides better performance and fairness. However, it also
> requires a bit more changes in the VFS layer.
> 
> This patchset is a derivative of Andi Kleen's patch on "Initial per
> cpu list for the per sb inode list"
> 
> https://git.kernel.org/cgit/linux/kernel/git/ak/linux-misc.git/commit/?h=hle315/combined&id=f1cf9e715a40f44086662ae3b29f123cf059cbf4
> 
> Patch 1 introduces the per-cpu list.
> 
> Patch 2 modifies the superblock and inode structures to use the per-cpu
> list. The corresponding functions that reference those structures are
> modified.
> 
> Waiman Long (2):
>   lib/percpu-list: Per-cpu list with associated per-cpu locks
>   vfs: Use per-cpu list for superblock's inode list

xfstests:xfs/013 deadlocks (running on 4GB ram disks on a 16p VM):

[135478.644495] run fstests generic/013 at 2016-02-19 10:54:51
[135479.058833] XFS (ram0): Unmounting Filesystem
[135479.149472] XFS (ram0): Mounting V5 Filesystem
[135479.154056] XFS (ram0): Ending clean mount
[135479.461571] XFS (ram0): Unmounting Filesystem
[135479.548060] XFS (ram0): Mounting V5 Filesystem
[135479.553103] XFS (ram0): Ending clean mount
[135507.633377] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 23s! [fsstress:3390]
[135507.634572] Modules linked in:
[135507.635059] CPU: 4 PID: 3390 Comm: fsstress Not tainted 4.5.0-rc2-dgc+ #683
[135507.636108] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[135507.637442] task: ffff88023a420000 ti: ffff88023758c000 task.ti: ffff88023758c000
[135507.638566] RIP: 0010:[<ffffffff810f4982>]  [<ffffffff810f4982>] do_raw_spin_lock+0x52/0x120
[135507.639863] RSP: 0018:ffff88023758fe10  EFLAGS: 00000246
[135507.640669] RAX: 0000000000000000 RBX: ffff880306e05ca8 RCX: 0000000000000024
[135507.641751] RDX: 0000000000000001 RSI: 00010f15431142d8 RDI: ffff880306e05ca8
[135507.642825] RBP: ffff88023758fe30 R08: 0000000000000004 R09: 0000000000000007
[135507.643897] R10: 000000000002e400 R11: 0000000000000001 R12: ffffe8fefbd01800
[135507.644966] R13: ffff880306e05db8 R14: ffff880306e05ca8 R15: ffff880306e05c20
[135507.646051] FS:  00007f3f4244f700(0000) GS:ffff88013bc80000(0000) knlGS:0000000000000000
[135507.647269] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[135507.648140] CR2: 00007f3f415cb008 CR3: 000000018c9ab000 CR4: 00000000000006e0
[135507.649232] Stack:
[135507.649559]  ffff880306e05c20 ffffe8fefbd01800 ffff880306e05db8 ffff880306e05ca8
[135507.650740]  ffff88023758fe40 ffffffff81e01d75 ffff88023758fee8 ffffffff8121727f
[135507.651924]  ffff8802b2fb76b8 ffff8802b2fb7000 0000000a00000246 ffffe8fefbd01810
[135507.653108] Call Trace:
[135507.653504]  [<ffffffff81e01d75>] _raw_spin_lock+0x15/0x20
[135507.654343]  [<ffffffff8121727f>] sync_inodes_sb+0x1af/0x280
[135507.655204]  [<ffffffff8121d8f0>] ? SyS_tee+0x3d0/0x3d0
[135507.656003]  [<ffffffff8121d905>] sync_inodes_one_sb+0x15/0x20
[135507.656891]  [<ffffffff811efede>] iterate_supers+0xae/0x100
[135507.657748]  [<ffffffff8121dc35>] sys_sync+0x35/0x90
[135507.658507]  [<ffffffff81e0232e>] entry_SYSCALL_64_fastpath+0x12/0x71
[135507.659483] Code: 00 00 48 39 43 10 0f 84 b7 00 00 00 65 8b 05 de 57 f1 7e 39 43 08 0f 84 bb 00 00 00 8b 03 85 c0 75 0d ba 01 00 00 00 f0 0f b1 13 <85> c0 74 63 4c
[135507.733374] NMI watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [fsstress:3380]
[135507.734665] Modules linked in:
[135507.735195] CPU: 10 PID: 3380 Comm: fsstress Tainted: G             L  4.5.0-rc2-dgc+ #683
[135507.736525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[135507.737966] task: ffff8803814a23c0 ti: ffff880239264000 task.ti: ffff880239264000
[135507.739176] RIP: 0010:[<ffffffff817da7b9>]  [<ffffffff817da7b9>] delay_tsc+0x39/0x80
[135507.740457] RSP: 0018:ffff880239267a00  EFLAGS: 00000206
[135507.741338] RAX: 00010f23910bf4f8 RBX: ffffe8fefbd01810 RCX: 0000000000000024
[135507.742523] RDX: 00010f2300000000 RSI: 00010f23910bf4d4 RDI: 0000000000000001
[135507.743708] RBP: ffff880239267a00 R08: 000000000000000a R09: ffff880269028460
[135507.744896] R10: 00000000c383dd1a R11: 00000000001f07e0 R12: 00000000334293fd
[135507.746087] R13: 0000000000000001 R14: 0000000083214e30 R15: ffff880265b51900
[135507.747271] FS:  00007f3f4244f700(0000) GS:ffff88033bd00000(0000) knlGS:0000000000000000
[135507.748604] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[135507.749568] CR2: 00007f3f4159d008 CR3: 0000000143196000 CR4: 00000000000006e0
[135507.750753] Stack:
[135507.751111]  ffff880239267a10 ffffffff817da6bf ffff880239267a40 ffffffff810f49bc
[135507.752400]  000060fbc0001800 ffffe8fefbd01810 ffff880306e07058 0000000000000000
[135507.753697]  ffff880239267a50 ffffffff81e01d75 ffff880239267a78 ffffffff817e529a
[135507.754991] Call Trace:
[135507.755424]  [<ffffffff817da6bf>] __delay+0xf/0x20
[135507.756234]  [<ffffffff810f49bc>] do_raw_spin_lock+0x8c/0x120
[135507.757204]  [<ffffffff81e01d75>] _raw_spin_lock+0x15/0x20
[135507.758130]  [<ffffffff817e529a>] percpu_list_add+0x2a/0x70
[135507.759064]  [<ffffffff81206830>] inode_sb_list_add+0x20/0x30
[135507.760025]  [<ffffffff814f9744>] xfs_setup_inode+0x34/0x240
[135507.760972]  [<ffffffff814fb22b>] xfs_ialloc+0x36b/0x550
[135507.761874]  [<ffffffff814fb49d>] xfs_dir_ialloc+0x8d/0x260
[135507.762805]  [<ffffffff814fb92f>] xfs_create+0x25f/0x6a0
[135507.763700]  [<ffffffff814f881d>] xfs_generic_create+0xcd/0x2a0
[135507.764686]  [<ffffffff81e01dce>] ? _raw_spin_unlock+0xe/0x20
[135507.765652]  [<ffffffff814f8a26>] xfs_vn_create+0x16/0x20
[135507.766556]  [<ffffffff811f8ff2>] vfs_create+0xc2/0x120
[135507.767430]  [<ffffffff811fb6f9>] path_openat+0x1239/0x1370
[135507.768364]  [<ffffffff812b973c>] ? ext4_file_write_iter+0x21c/0x420
[135507.769430]  [<ffffffff810d2a89>] ? __might_sleep+0x49/0x80
[135507.770362]  [<ffffffff811fc8ee>] do_filp_open+0x7e/0xe0
[135507.771256]  [<ffffffff811e41f2>] ? kmem_cache_alloc+0x42/0x170
[135507.772250]  [<ffffffff81e01dce>] ? _raw_spin_unlock+0xe/0x20
[135507.773225]  [<ffffffff8120a2dc>] ? __alloc_fd+0xbc/0x170
[135507.774135]  [<ffffffff811eb2e6>] do_sys_open+0x116/0x1f0
[135507.775043]  [<ffffffff811eb41e>] SyS_creat+0x1e/0x20
[135507.775895]  [<ffffffff81e0232e>] entry_SYSCALL_64_fastpath+0x12/0x71

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] vfs: Use per-cpu list for SB's s_inodes list
  2016-02-18 23:58 ` [RFC PATCH 0/2] vfs: Use per-cpu list for SB's s_inodes list Dave Chinner
@ 2016-02-19 21:04   ` Long, Wai Man
  0 siblings, 0 replies; 27+ messages in thread
From: Long, Wai Man @ 2016-02-19 21:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On 02/18/2016 06:58 PM, Dave Chinner wrote:
> On Tue, Feb 16, 2016 at 08:31:18PM -0500, Waiman Long wrote:
>> This patch is a replacement of my previous list batching patch -
>> https://lwn.net/Articles/674105/. Compared with the previous patch,
>> this one provides better performance and fairness. However, it also
>> requires a bit more changes in the VFS layer.
>>
>> This patchset is a derivative of Andi Kleen's patch on "Initial per
>> cpu list for the per sb inode list"
>>
>> https://git.kernel.org/cgit/linux/kernel/git/ak/linux-misc.git/commit/?h=hle315/combined&id=f1cf9e715a40f44086662ae3b29f123cf059cbf4
>>
>> Patch 1 introduces the per-cpu list.
>>
>> Patch 2 modifies the superblock and inode structures to use the per-cpu
>> list. The corresponding functions that reference those structures are
>> modified.
>>
>> Waiman Long (2):
>>    lib/percpu-list: Per-cpu list with associated per-cpu locks
>>    vfs: Use per-cpu list for superblock's inode list
> xfstests:xfs/013 deadlocks (running on 4GB ram disks on a 16p VM):
>
> [135478.644495] run fstests generic/013 at 2016-02-19 10:54:51
> [135479.058833] XFS (ram0): Unmounting Filesystem
> [135479.149472] XFS (ram0): Mounting V5 Filesystem
> [135479.154056] XFS (ram0): Ending clean mount
> [135479.461571] XFS (ram0): Unmounting Filesystem
> [135479.548060] XFS (ram0): Mounting V5 Filesystem
> [135479.553103] XFS (ram0): Ending clean mount
> [135507.633377] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 23s! [fsstress:3390]
> [135507.634572] Modules linked in:
> [135507.635059] CPU: 4 PID: 3390 Comm: fsstress Not tainted 4.5.0-rc2-dgc+ #683
> [135507.636108] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
> [135507.637442] task: ffff88023a420000 ti: ffff88023758c000 task.ti: ffff88023758c000
> [135507.638566] RIP: 0010:[<ffffffff810f4982>]  [<ffffffff810f4982>] do_raw_spin_lock+0x52/0x120
> [135507.639863] RSP: 0018:ffff88023758fe10  EFLAGS: 00000246
> [135507.640669] RAX: 0000000000000000 RBX: ffff880306e05ca8 RCX: 0000000000000024
> [135507.641751] RDX: 0000000000000001 RSI: 00010f15431142d8 RDI: ffff880306e05ca8
> [135507.642825] RBP: ffff88023758fe30 R08: 0000000000000004 R09: 0000000000000007
> [135507.643897] R10: 000000000002e400 R11: 0000000000000001 R12: ffffe8fefbd01800
> [135507.644966] R13: ffff880306e05db8 R14: ffff880306e05ca8 R15: ffff880306e05c20
> [135507.646051] FS:  00007f3f4244f700(0000) GS:ffff88013bc80000(0000) knlGS:0000000000000000
> [135507.647269] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [135507.648140] CR2: 00007f3f415cb008 CR3: 000000018c9ab000 CR4: 00000000000006e0
> [135507.649232] Stack:
> [135507.649559]  ffff880306e05c20 ffffe8fefbd01800 ffff880306e05db8 ffff880306e05ca8
> [135507.650740]  ffff88023758fe40 ffffffff81e01d75 ffff88023758fee8 ffffffff8121727f
> [135507.651924]  ffff8802b2fb76b8 ffff8802b2fb7000 0000000a00000246 ffffe8fefbd01810
> [135507.653108] Call Trace:
> [135507.653504]  [<ffffffff81e01d75>] _raw_spin_lock+0x15/0x20
> [135507.654343]  [<ffffffff8121727f>] sync_inodes_sb+0x1af/0x280
> [135507.655204]  [<ffffffff8121d8f0>] ? SyS_tee+0x3d0/0x3d0
> [135507.656003]  [<ffffffff8121d905>] sync_inodes_one_sb+0x15/0x20
> [135507.656891]  [<ffffffff811efede>] iterate_supers+0xae/0x100
> [135507.657748]  [<ffffffff8121dc35>] sys_sync+0x35/0x90
> [135507.658507]  [<ffffffff81e0232e>] entry_SYSCALL_64_fastpath+0x12/0x71
> [135507.659483] Code: 00 00 48 39 43 10 0f 84 b7 00 00 00 65 8b 05 de 57 f1 7e 39 43 08 0f 84 bb 00 00 00 8b 03 85 c0 75 0d ba 01 00 00 00 f0 0f b1 13<85>  c0 74 63 4c
> [135507.733374] NMI watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [fsstress:3380]
> [135507.734665] Modules linked in:
> [135507.735195] CPU: 10 PID: 3380 Comm: fsstress Tainted: G             L  4.5.0-rc2-dgc+ #683
> [135507.736525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
> [135507.737966] task: ffff8803814a23c0 ti: ffff880239264000 task.ti: ffff880239264000
> [135507.739176] RIP: 0010:[<ffffffff817da7b9>]  [<ffffffff817da7b9>] delay_tsc+0x39/0x80
> [135507.740457] RSP: 0018:ffff880239267a00  EFLAGS: 00000206
> [135507.741338] RAX: 00010f23910bf4f8 RBX: ffffe8fefbd01810 RCX: 0000000000000024
> [135507.742523] RDX: 00010f2300000000 RSI: 00010f23910bf4d4 RDI: 0000000000000001
> [135507.743708] RBP: ffff880239267a00 R08: 000000000000000a R09: ffff880269028460
> [135507.744896] R10: 00000000c383dd1a R11: 00000000001f07e0 R12: 00000000334293fd
> [135507.746087] R13: 0000000000000001 R14: 0000000083214e30 R15: ffff880265b51900
> [135507.747271] FS:  00007f3f4244f700(0000) GS:ffff88033bd00000(0000) knlGS:0000000000000000
> [135507.748604] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [135507.749568] CR2: 00007f3f4159d008 CR3: 0000000143196000 CR4: 00000000000006e0
> [135507.750753] Stack:
> [135507.751111]  ffff880239267a10 ffffffff817da6bf ffff880239267a40 ffffffff810f49bc
> [135507.752400]  000060fbc0001800 ffffe8fefbd01810 ffff880306e07058 0000000000000000
> [135507.753697]  ffff880239267a50 ffffffff81e01d75 ffff880239267a78 ffffffff817e529a
> [135507.754991] Call Trace:
> [135507.755424]  [<ffffffff817da6bf>] __delay+0xf/0x20
> [135507.756234]  [<ffffffff810f49bc>] do_raw_spin_lock+0x8c/0x120
> [135507.757204]  [<ffffffff81e01d75>] _raw_spin_lock+0x15/0x20
> [135507.758130]  [<ffffffff817e529a>] percpu_list_add+0x2a/0x70
> [135507.759064]  [<ffffffff81206830>] inode_sb_list_add+0x20/0x30
> [135507.760025]  [<ffffffff814f9744>] xfs_setup_inode+0x34/0x240
> [135507.760972]  [<ffffffff814fb22b>] xfs_ialloc+0x36b/0x550
> [135507.761874]  [<ffffffff814fb49d>] xfs_dir_ialloc+0x8d/0x260
> [135507.762805]  [<ffffffff814fb92f>] xfs_create+0x25f/0x6a0
> [135507.763700]  [<ffffffff814f881d>] xfs_generic_create+0xcd/0x2a0
> [135507.764686]  [<ffffffff81e01dce>] ? _raw_spin_unlock+0xe/0x20
> [135507.765652]  [<ffffffff814f8a26>] xfs_vn_create+0x16/0x20
> [135507.766556]  [<ffffffff811f8ff2>] vfs_create+0xc2/0x120
> [135507.767430]  [<ffffffff811fb6f9>] path_openat+0x1239/0x1370
> [135507.768364]  [<ffffffff812b973c>] ? ext4_file_write_iter+0x21c/0x420
> [135507.769430]  [<ffffffff810d2a89>] ? __might_sleep+0x49/0x80
> [135507.770362]  [<ffffffff811fc8ee>] do_filp_open+0x7e/0xe0
> [135507.771256]  [<ffffffff811e41f2>] ? kmem_cache_alloc+0x42/0x170
> [135507.772250]  [<ffffffff81e01dce>] ? _raw_spin_unlock+0xe/0x20
> [135507.773225]  [<ffffffff8120a2dc>] ? __alloc_fd+0xbc/0x170
> [135507.774135]  [<ffffffff811eb2e6>] do_sys_open+0x116/0x1f0
> [135507.775043]  [<ffffffff811eb41e>] SyS_creat+0x1e/0x20
> [135507.775895]  [<ffffffff81e0232e>] entry_SYSCALL_64_fastpath+0x12/0x71
>
> Cheers,
>
> Dave.

Thanks for reporting this problem.

PeterZ had actually noticed an issue in my patch that I used 
list_for_each_entry_safe() for all the modified s_inodes iteration 
functions which originally used list_for_each_entry(). That may not be 
safe in some cases. The hangup you saw may be caused by this problem. I 
am going to send out an updated patch to use the correct 
list_for_each_entry() macro for those iteration functions. Please try 
that out to see if you see the same hangup problem again.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2016-02-19 21:04 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-17  1:31 [RFC PATCH 0/2] vfs: Use per-cpu list for SB's s_inodes list Waiman Long
2016-02-17  1:31 ` [RFC PATCH 1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks Waiman Long
2016-02-17  9:53   ` Dave Chinner
2016-02-17 11:00     ` Peter Zijlstra
2016-02-17 11:05       ` Peter Zijlstra
2016-02-17 16:16         ` Waiman Long
2016-02-17 16:22           ` Peter Zijlstra
2016-02-17 16:27           ` Christoph Lameter
2016-02-17 17:12             ` Waiman Long
2016-02-17 17:18               ` Peter Zijlstra
2016-02-17 17:41                 ` Waiman Long
2016-02-17 18:22                   ` Peter Zijlstra
2016-02-17 18:45                     ` Waiman Long
2016-02-17 19:39                       ` Peter Zijlstra
2016-02-17 11:10       ` Dave Chinner
2016-02-17 11:26         ` Peter Zijlstra
2016-02-17 11:36           ` Peter Zijlstra
2016-02-17 15:56     ` Waiman Long
2016-02-17 16:02       ` Peter Zijlstra
2016-02-17 15:13   ` Christoph Lameter
2016-02-17  1:31 ` [RRC PATCH 2/2] vfs: Use per-cpu list for superblock's inode list Waiman Long
2016-02-17  7:16   ` Ingo Molnar
2016-02-17 15:40     ` Waiman Long
2016-02-17 10:37   ` Dave Chinner
2016-02-17 16:08     ` Waiman Long
2016-02-18 23:58 ` [RFC PATCH 0/2] vfs: Use per-cpu list for SB's s_inodes list Dave Chinner
2016-02-19 21:04   ` Long, Wai Man

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).