All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/3] vfs: Use per-cpu list for SB's s_inodes list
@ 2016-02-23 19:04 Waiman Long
  2016-02-23 19:04 ` [PATCH v3 1/3] lib/percpu-list: Per-cpu list with associated per-cpu locks Waiman Long
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Waiman Long @ 2016-02-23 19:04 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Scott J Norton, Douglas Hatch,
	Waiman Long

v2->v3:
 - Directly replace list_for_each_entry() and
   list_for_each_entry_safe() by pcpu_list_iterate() and
   pcpu_list_iterate_safe() respectively instead. Those 2 functions
   provide a stateful per-cpu list iteration interface.
 - Include Jan Kara's patch to clean up the fsnotify_unmount_inodes()
   function.

v1->v2:
 - Use separate structures for list head and nodes & provide a
   cleaner interface.
 - Use existing list_for_each_entry() or list_for_each_entry_safe()
   macros for each of the sb's s_inodes iteration functions instead
   of using list_for_each_entry_safe() for all of them which may not
   be safe in some cases.
 - Use an iterator interface to access all the nodes of a group of
   per-cpu lists. This approach is cleaner than the previous double-for
   macro which is kind of hacky. However, it does require more lines
   of code changes.
 - Add a preparatory patch 2 to extract out the per-inode codes from
   the superblock s_inodes list iteration functions to minimize code
   changes needed in the patch 3.

This patch is a replacement of my previous list batching patch -
https://lwn.net/Articles/674105/. Compared with the previous patch,
this one provides better performance and fairness. However, it also
requires a bit more changes in the VFS layer.

This patchset is a derivative of Andi Kleen's patch on "Initial per
cpu list for the per sb inode list"

https://git.kernel.org/cgit/linux/kernel/git/ak/linux-misc.git/commit/?h=hle315/
combined&id=f1cf9e715a40f44086662ae3b29f123cf059cbf4

Patch 1 introduces the per-cpu list.

Patch 2 cleans up the fsnotify_unmount_inodes() function by making
the code simpler and more standard.

Patch 3 modifies the superblock and inode structures to use the per-cpu
list. The corresponding functions that reference those structures
are modified.

Jan Kara (1):
  fsnotify: Simplify inode iteration on umount

Waiman Long (2):
  lib/percpu-list: Per-cpu list with associated per-cpu locks
  vfs: Use per-cpu list for superblock's inode list

 fs/block_dev.c              |   13 ++-
 fs/drop_caches.c            |   10 +-
 fs/fs-writeback.c           |   13 ++-
 fs/inode.c                  |   40 +++----
 fs/notify/inode_mark.c      |   53 +++--------
 fs/quota/dquot.c            |   16 ++--
 fs/super.c                  |    7 +-
 include/linux/fs.h          |    8 +-
 include/linux/percpu-list.h |  235 +++++++++++++++++++++++++++++++++++++++++++
 lib/Makefile                |    2 +-
 lib/percpu-list.c           |   85 ++++++++++++++++
 11 files changed, 386 insertions(+), 96 deletions(-)
 create mode 100644 include/linux/percpu-list.h
 create mode 100644 lib/percpu-list.c

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3 1/3] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-23 19:04 [PATCH v3 0/3] vfs: Use per-cpu list for SB's s_inodes list Waiman Long
@ 2016-02-23 19:04 ` Waiman Long
  2016-02-24  2:00   ` Boqun Feng
  2016-02-24  7:56   ` Jan Kara
  2016-02-23 19:04 ` [PATCH v3 2/3] fsnotify: Simplify inode iteration on umount Waiman Long
  2016-02-23 19:04 ` [PATCH v3 3/3] vfs: Use per-cpu list for superblock's inode list Waiman Long
  2 siblings, 2 replies; 15+ messages in thread
From: Waiman Long @ 2016-02-23 19:04 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Scott J Norton, Douglas Hatch,
	Waiman Long

Linked list is used everywhere in the Linux kernel. However, if many
threads are trying to add or delete entries into the same linked list,
it can create a performance bottleneck.

This patch introduces a new per-cpu list subystem with associated
per-cpu locks for protecting each of the lists individually. This
allows list entries insertion and deletion operations to happen in
parallel instead of being serialized with a global list and lock.

List entry insertion is strictly per cpu. List deletion, however, can
happen in a cpu other than the one that did the insertion. So we still
need lock to protect the list. Because of that, there may still be
a small amount of contention when deletion is being done.

A new header file include/linux/percpu-list.h will be added with the
associated pcpu_list_head and pcpu_list_node structures. The following
functions are provided to manage the per-cpu list:

 1. int init_pcpu_list_head(struct pcpu_list_head **ppcpu_head)
 2. void pcpu_list_add(struct pcpu_list_node *node,
		       struct pcpu_list_head *head)
 3. void pcpu_list_del(struct pcpu_list *node)

Iteration of all the list entries within a group of per-cpu
lists is done by calling either the pcpu_list_iterate() or
pcpu_list_iterate_safe() functions in a while loop. They correspond
to the list_for_each_entry() and list_for_each_entry_safe() macros
respectively. The iteration states are keep in a pcpu_list_state
structure that is passed to the iteration functions.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 include/linux/percpu-list.h |  235 +++++++++++++++++++++++++++++++++++++++++++
 lib/Makefile                |    2 +-
 lib/percpu-list.c           |   85 ++++++++++++++++
 3 files changed, 321 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/percpu-list.h
 create mode 100644 lib/percpu-list.c

diff --git a/include/linux/percpu-list.h b/include/linux/percpu-list.h
new file mode 100644
index 0000000..8759fec
--- /dev/null
+++ b/include/linux/percpu-list.h
@@ -0,0 +1,235 @@
+/*
+ * Per-cpu list
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long <waiman.long@hpe.com>
+ */
+#ifndef __LINUX_PERCPU_LIST_H
+#define __LINUX_PERCPU_LIST_H
+
+#include <linux/spinlock.h>
+#include <linux/list.h>
+#include <linux/percpu.h>
+
+/*
+ * include/linux/percpu-list.h
+ *
+ * A per-cpu list protected by a per-cpu spinlock.
+ *
+ * The pcpu_list_head structure contains the spinlock, the other
+ * pcpu_list_node structures only contains a pointer to the spinlock in
+ * pcpu_list_head.
+ */
+struct pcpu_list_head {
+	struct list_head list;
+	spinlock_t lock;
+};
+
+struct pcpu_list_node {
+	struct list_head list;
+	spinlock_t *lockptr;
+};
+
+/*
+ * Per-cpu list iteration state
+ */
+struct pcpu_list_state {
+	int			 cpu;
+	spinlock_t		*lock;
+	struct list_head	*head;	/* List head of current per-cpu list */
+	struct pcpu_list_node	*curr;
+	struct pcpu_list_node	*next;
+};
+
+#define PCPU_LIST_HEAD_INIT(name)				\
+	{							\
+		.list.prev = &name.list,			\
+		.list.next = &name.list,			\
+		.list.lock = __SPIN_LOCK_UNLOCKED(name),	\
+	}
+
+#define PCPU_LIST_NODE_INIT(name)		\
+	{					\
+		.list.prev = &name.list,	\
+		.list.next = &name.list,	\
+		.list.lockptr = NULL		\
+	}
+
+#define PCPU_LIST_STATE_INIT()			\
+	{					\
+		.cpu  = -1,			\
+		.lock = NULL,			\
+		.head = NULL,			\
+		.curr = NULL,			\
+		.next = NULL,			\
+	}
+
+#define DEFINE_PCPU_LIST_STATE(s)		\
+	struct pcpu_list_state s = PCPU_LIST_STATE_INIT()
+
+#define pcpu_list_next_entry(pos, member) list_next_entry(pos, member.list)
+
+static inline void init_pcpu_list_node(struct pcpu_list_node *node)
+{
+	INIT_LIST_HEAD(&node->list);
+	node->lockptr = NULL;
+}
+
+static inline void free_pcpu_list_head(struct pcpu_list_head **ppcpu_head)
+{
+	free_percpu(*ppcpu_head);
+	*ppcpu_head = NULL;
+}
+
+static inline void init_pcpu_list_state(struct pcpu_list_state *state)
+{
+	state->cpu  = -1;
+	state->lock = NULL;
+	state->head = NULL;
+	state->curr = NULL;
+	state->next = NULL;
+}
+
+#if NR_CPUS == 1
+/*
+ * For uniprocessor, the list head and lock in struct pcpu_list_head are
+ * used directly.
+ */
+static inline bool pcpu_list_empty(struct pcpu_list_head *pcpu_head)
+{
+	return list_empty(&pcpu_head->list);
+}
+
+static __always_inline bool
+__pcpu_list_next_cpu(struct pcpu_list_head *head, struct pcpu_list_state *state)
+{
+	if (state->lock)
+		spin_unlock(state->lock);
+
+	if (state->cpu++ >= 0)
+		return false;
+
+	state->curr = list_entry(head->list.next, struct pcpu_list_node, list);
+	if (list_empty(&state->curr->list))
+		return false;
+	state->lock = &head->lock;
+	spin_lock(state->lock);
+	return true;
+
+}
+#else /* NR_CPUS == 1 */
+/*
+ * Multiprocessor
+ */
+static inline bool pcpu_list_empty(struct pcpu_list_head *pcpu_head)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		if (!list_empty(&per_cpu_ptr(pcpu_head, cpu)->list))
+			return false;
+	return true;
+}
+
+/*
+ * Helper function to find the first entry of the next per-cpu list
+ * It works somewhat like for_each_possible_cpu(cpu).
+ *
+ * Return: true if the entry is found, false if all the lists exhausted
+ */
+static __always_inline bool
+__pcpu_list_next_cpu(struct pcpu_list_head *head, struct pcpu_list_state *state)
+{
+	if (state->lock)
+		spin_unlock(state->lock);
+next_cpu:
+	/*
+	 * for_each_possible_cpu(cpu)
+	 */
+	state->cpu = cpumask_next(state->cpu, cpu_possible_mask);
+	if (state->cpu >= nr_cpu_ids)
+		return false;	/* All the per-cpu lists iterated */
+
+	state->head = &per_cpu_ptr(head, state->cpu)->list;
+	state->lock = &per_cpu_ptr(head, state->cpu)->lock;
+	state->curr = list_entry(state->head->next,
+				 struct pcpu_list_node, list);
+	if (&state->curr->list == state->head)
+		goto next_cpu;
+
+	spin_lock(state->lock);
+	return true;
+}
+#endif /* NR_CPUS == 1 */
+
+/*
+ * Iterate to the next entry of the group of per-cpu lists
+ *
+ * Return: true if the next entry is found, false if all the entries iterated
+ */
+static inline bool pcpu_list_iterate(struct pcpu_list_head *head,
+				     struct pcpu_list_state *state)
+{
+	/*
+	 * Find next entry
+	 */
+	if (state->curr)
+		state->curr = list_next_entry(state->curr, list);
+
+	if (!state->curr || (&state->curr->list == state->head)) {
+		/*
+		 * The current per-cpu list has been exhausted, try the next
+		 * per-cpu list.
+		 */
+		if (!__pcpu_list_next_cpu(head, state))
+			return false;
+	}
+	return true;	/* Continue the iteration */
+}
+
+/*
+ * Iterate to the next entry of the group of per-cpu lists and safe
+ * against removal of list_entry
+ *
+ * Return: true if the next entry is found, false if all the entries iterated
+ */
+static inline bool pcpu_list_iterate_safe(struct pcpu_list_head *head,
+					  struct pcpu_list_state *state)
+{
+	/*
+	 * Find next entry
+	 */
+	if (state->curr) {
+		state->curr = state->next;
+		state->next = list_next_entry(state->next, list);
+	}
+
+	if (!state->curr || (&state->curr->list == state->head)) {
+		/*
+		 * The current per-cpu list has been exhausted, try the next
+		 * per-cpu list.
+		 */
+		if (!__pcpu_list_next_cpu(head, state))
+			return false;
+		state->next = list_next_entry(state->curr, list);
+	}
+	return true;	/* Continue the iteration */
+}
+
+extern int  init_pcpu_list_head(struct pcpu_list_head **ppcpu_head);
+extern void pcpu_list_add(struct pcpu_list_node *node,
+			  struct pcpu_list_head *head);
+extern void pcpu_list_del(struct pcpu_list_node *node);
+
+#endif /* __LINUX_PERCPU_LIST_H */
diff --git a/lib/Makefile b/lib/Makefile
index a7c26a4..71a25d4 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -27,7 +27,7 @@ obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
 	 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
 	 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 	 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
-	 once.o
+	 once.o percpu-list.o
 obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
 obj-y += hexdump.o
diff --git a/lib/percpu-list.c b/lib/percpu-list.c
new file mode 100644
index 0000000..45bbb2a
--- /dev/null
+++ b/lib/percpu-list.c
@@ -0,0 +1,85 @@
+/*
+ * Per-cpu list
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long <waiman.long@hpe.com>
+ */
+#include <linux/percpu-list.h>
+
+/*
+ * Initialize the per-cpu list
+ */
+int init_pcpu_list_head(struct pcpu_list_head **ppcpu_head)
+{
+	struct pcpu_list_head *pcpu_head = alloc_percpu(struct pcpu_list_head);
+	int cpu;
+
+	if (!pcpu_head)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu) {
+		struct pcpu_list_head *head = per_cpu_ptr(pcpu_head, cpu);
+
+		INIT_LIST_HEAD(&head->list);
+		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
+	}
+
+	*ppcpu_head = pcpu_head;
+	return 0;
+}
+
+/*
+ * List selection is based on the CPU being used when the pcpu_list_add()
+ * function is called. However, deletion may be done by a different CPU.
+ * So we still need to use a lock to protect the content of the list.
+ */
+void pcpu_list_add(struct pcpu_list_node *node, struct pcpu_list_head *head)
+{
+	spinlock_t *lock;
+
+	/*
+	 * There is a very slight chance the cpu will be changed
+	 * (by preemption) before calling spin_lock(). We only need to put
+	 * the node in one of the per-cpu lists. It may not need to be
+	 * that of the current cpu.
+	 */
+	lock = this_cpu_ptr(&head->lock);
+	spin_lock(lock);
+	node->lockptr = lock;
+	list_add(&node->list, this_cpu_ptr(&head->list));
+	spin_unlock(lock);
+}
+
+/*
+ * Delete a node from a percpu list
+ *
+ * We need to check the lock pointer again after taking the lock to guard
+ * against concurrent delete of the same node. If the lock pointer changes
+ * (becomes NULL or to a different one), we assume that the deletion was done
+ * elsewhere.
+ */
+void pcpu_list_del(struct pcpu_list_node *node)
+{
+	spinlock_t *lock = READ_ONCE(node->lockptr);
+
+	if (unlikely(!lock))
+		return;
+
+	spin_lock(lock);
+	if (likely(lock == node->lockptr)) {
+		list_del_init(&node->list);
+		node->lockptr = NULL;
+	}
+	spin_unlock(lock);
+}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 2/3] fsnotify: Simplify inode iteration on umount
  2016-02-23 19:04 [PATCH v3 0/3] vfs: Use per-cpu list for SB's s_inodes list Waiman Long
  2016-02-23 19:04 ` [PATCH v3 1/3] lib/percpu-list: Per-cpu list with associated per-cpu locks Waiman Long
@ 2016-02-23 19:04 ` Waiman Long
  2016-02-23 19:04 ` [PATCH v3 3/3] vfs: Use per-cpu list for superblock's inode list Waiman Long
  2 siblings, 0 replies; 15+ messages in thread
From: Waiman Long @ 2016-02-23 19:04 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Scott J Norton, Douglas Hatch,
	Jan Kara, Waiman Long

From: Jan Kara <jack@suse.cz>

fsnotify_unmount_inodes() played complex tricks to pin next inode in the
sb->s_inodes list when iterating over all inodes. If we switch to
keeping current inode pinned somewhat longer, we can make the code much
simpler and standard.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 fs/notify/inode_mark.c |   45 +++++++++------------------------------------
 1 files changed, 9 insertions(+), 36 deletions(-)

diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 741077d..a364524 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -150,12 +150,10 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
  */
 void fsnotify_unmount_inodes(struct super_block *sb)
 {
-	struct inode *inode, *next_i, *need_iput = NULL;
+	struct inode *inode, *iput_inode = NULL;
 
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list) {
-		struct inode *need_iput_tmp;
-
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		/*
 		 * We cannot __iget() an inode in state I_FREEING,
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
@@ -178,49 +176,24 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 			continue;
 		}
 
-		need_iput_tmp = need_iput;
-		need_iput = NULL;
-
-		/* In case fsnotify_inode_delete() drops a reference. */
-		if (inode != need_iput_tmp)
-			__iget(inode);
-		else
-			need_iput_tmp = NULL;
+		__iget(inode);
 		spin_unlock(&inode->i_lock);
-
-		/* In case the dropping of a reference would nuke next_i. */
-		while (&next_i->i_sb_list != &sb->s_inodes) {
-			spin_lock(&next_i->i_lock);
-			if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) &&
-						atomic_read(&next_i->i_count)) {
-				__iget(next_i);
-				need_iput = next_i;
-				spin_unlock(&next_i->i_lock);
-				break;
-			}
-			spin_unlock(&next_i->i_lock);
-			next_i = list_next_entry(next_i, i_sb_list);
-		}
-
-		/*
-		 * We can safely drop s_inode_list_lock here because either
-		 * we actually hold references on both inode and next_i or
-		 * end of list.  Also no new inodes will be added since the
-		 * umount has begun.
-		 */
 		spin_unlock(&sb->s_inode_list_lock);
 
-		if (need_iput_tmp)
-			iput(need_iput_tmp);
+		if (iput_inode)
+			iput(iput_inode);
 
 		/* for each watch, send FS_UNMOUNT and then remove it */
 		fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
 
 		fsnotify_inode_delete(inode);
 
-		iput(inode);
+		iput_inode = inode;
 
 		spin_lock(&sb->s_inode_list_lock);
 	}
 	spin_unlock(&sb->s_inode_list_lock);
+
+	if (iput_inode)
+		iput(iput_inode);
 }
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 3/3] vfs: Use per-cpu list for superblock's inode list
  2016-02-23 19:04 [PATCH v3 0/3] vfs: Use per-cpu list for SB's s_inodes list Waiman Long
  2016-02-23 19:04 ` [PATCH v3 1/3] lib/percpu-list: Per-cpu list with associated per-cpu locks Waiman Long
  2016-02-23 19:04 ` [PATCH v3 2/3] fsnotify: Simplify inode iteration on umount Waiman Long
@ 2016-02-23 19:04 ` Waiman Long
  2016-02-24  8:28   ` Jan Kara
  2 siblings, 1 reply; 15+ messages in thread
From: Waiman Long @ 2016-02-23 19:04 UTC (permalink / raw)
  To: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter
  Cc: linux-fsdevel, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Dave Chinner, Scott J Norton, Douglas Hatch,
	Waiman Long

When many threads are trying to add or delete inode to or from
a superblock's s_inodes list, spinlock contention on the list can
become a performance bottleneck.

This patch changes the s_inodes field to become a per-cpu list with
per-cpu spinlocks. As a result, the following superblock inode list
(sb->s_inodes) iteration functions in vfs are also being modified:

 1. iterate_bdevs()
 2. drop_pagecache_sb()
 3. wait_sb_inodes()
 4. evict_inodes()
 5. invalidate_inodes()
 6. fsnotify_unmount_inodes()
 7. add_dquot_ref()
 8. remove_dquot_ref()

With an exit microbenchmark that creates a large number of threads,
attachs many inodes to them and then exits. The runtimes of that
microbenchmark with 1000 threads before and after the patch on a
4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as
follows:

  Kernel            Elapsed Time    System Time
  ------            ------------    -----------
  Vanilla 4.5-rc4      65.29s         82m14s
  Patched 4.5-rc4      22.81s         23m03s

Before the patch, spinlock contention at the inode_sb_list_add()
function at the startup phase and the inode_sb_list_del() function at
the exit phase were about 79% and 93% of total CPU time respectively
(as measured by perf). After the patch, the percpu_list_add()
function consumed only about 0.04% of CPU time at startup phase. The
percpu_list_del() function consumed about 0.4% of CPU time at exit
phase. There were still some spinlock contention, but they happened
elsewhere.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 fs/block_dev.c         |   13 +++++++------
 fs/drop_caches.c       |   10 +++++-----
 fs/fs-writeback.c      |   13 +++++++------
 fs/inode.c             |   40 +++++++++++++++++-----------------------
 fs/notify/inode_mark.c |   10 +++++-----
 fs/quota/dquot.c       |   16 ++++++++--------
 fs/super.c             |    7 ++++---
 include/linux/fs.h     |    8 ++++----
 8 files changed, 57 insertions(+), 60 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 39b3a17..759d9b6 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1865,11 +1865,13 @@ EXPORT_SYMBOL(__invalidate_device);
 void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 {
 	struct inode *inode, *old_inode = NULL;
+	DEFINE_PCPU_LIST_STATE(state);
 
-	spin_lock(&blockdev_superblock->s_inode_list_lock);
-	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
-		struct address_space *mapping = inode->i_mapping;
+	while (pcpu_list_iterate(blockdev_superblock->s_inodes, &state)) {
+		struct address_space *mapping;
 
+		inode   = list_entry(state.curr, struct inode, i_sb_list);
+		mapping = inode->i_mapping;
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
 		    mapping->nrpages == 0) {
@@ -1878,7 +1880,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&blockdev_superblock->s_inode_list_lock);
+		spin_unlock(state.lock);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
 		 * removed from s_inodes list while we dropped the
@@ -1892,8 +1894,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 
 		func(I_BDEV(inode), arg);
 
-		spin_lock(&blockdev_superblock->s_inode_list_lock);
+		spin_lock(state.lock);
 	}
-	spin_unlock(&blockdev_superblock->s_inode_list_lock);
 	iput(old_inode);
 }
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index d72d52b..ec272ed 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -16,9 +16,10 @@ int sysctl_drop_caches;
 static void drop_pagecache_sb(struct super_block *sb, void *unused)
 {
 	struct inode *inode, *toput_inode = NULL;
+	DEFINE_PCPU_LIST_STATE(state);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	while (pcpu_list_iterate(sb->s_inodes, &state)) {
+		inode = list_entry(state.curr, struct inode, i_sb_list);
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 		    (inode->i_mapping->nrpages == 0)) {
@@ -27,15 +28,14 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(state.lock);
 
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
 
-		spin_lock(&sb->s_inode_list_lock);
+		spin_lock(state.lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 	iput(toput_inode);
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6915c95..05b3f85 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2107,6 +2107,7 @@ EXPORT_SYMBOL(__mark_inode_dirty);
 static void wait_sb_inodes(struct super_block *sb)
 {
 	struct inode *inode, *old_inode = NULL;
+	DEFINE_PCPU_LIST_STATE(state);
 
 	/*
 	 * We need to be protected against the filesystem going from
@@ -2115,7 +2116,6 @@ static void wait_sb_inodes(struct super_block *sb)
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	mutex_lock(&sb->s_sync_lock);
-	spin_lock(&sb->s_inode_list_lock);
 
 	/*
 	 * Data integrity sync. Must wait for all pages under writeback,
@@ -2124,9 +2124,11 @@ static void wait_sb_inodes(struct super_block *sb)
 	 * In which case, the inode may not be on the dirty list, but
 	 * we still have to wait for that writeout.
 	 */
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		struct address_space *mapping = inode->i_mapping;
+	while (pcpu_list_iterate(sb->s_inodes, &state)) {
+		struct address_space *mapping;
 
+		inode   = list_entry(state.curr, struct inode, i_sb_list);
+		mapping = inode->i_mapping;
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 		    (mapping->nrpages == 0)) {
@@ -2135,7 +2137,7 @@ static void wait_sb_inodes(struct super_block *sb)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(state.lock);
 
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
@@ -2157,9 +2159,8 @@ static void wait_sb_inodes(struct super_block *sb)
 
 		cond_resched();
 
-		spin_lock(&sb->s_inode_list_lock);
+		spin_lock(state.lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 	iput(old_inode);
 	mutex_unlock(&sb->s_sync_lock);
 }
diff --git a/fs/inode.c b/fs/inode.c
index 9f62db3..e6e41ef 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -28,7 +28,7 @@
  *   inode->i_state, inode->i_hash, __iget()
  * Inode LRU list locks protect:
  *   inode->i_sb->s_inode_lru, inode->i_lru
- * inode->i_sb->s_inode_list_lock protects:
+ * inode->i_sb->s_inodes->lock protects:
  *   inode->i_sb->s_inodes, inode->i_sb_list
  * bdi->wb.list_lock protects:
  *   bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_io_list
@@ -37,7 +37,7 @@
  *
  * Lock ordering:
  *
- * inode->i_sb->s_inode_list_lock
+ * inode->i_sb->s_inodes->lock
  *   inode->i_lock
  *     Inode LRU list locks
  *
@@ -45,7 +45,7 @@
  *   inode->i_lock
  *
  * inode_hash_lock
- *   inode->i_sb->s_inode_list_lock
+ *   inode->i_sb->s_inodes->lock
  *   inode->i_lock
  *
  * iunique_lock
@@ -424,19 +424,14 @@ static void inode_lru_list_del(struct inode *inode)
  */
 void inode_sb_list_add(struct inode *inode)
 {
-	spin_lock(&inode->i_sb->s_inode_list_lock);
-	list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
-	spin_unlock(&inode->i_sb->s_inode_list_lock);
+	pcpu_list_add(&inode->i_sb_list, inode->i_sb->s_inodes);
 }
 EXPORT_SYMBOL_GPL(inode_sb_list_add);
 
 static inline void inode_sb_list_del(struct inode *inode)
 {
-	if (!list_empty(&inode->i_sb_list)) {
-		spin_lock(&inode->i_sb->s_inode_list_lock);
-		list_del_init(&inode->i_sb_list);
-		spin_unlock(&inode->i_sb->s_inode_list_lock);
-	}
+	if (!list_empty(&inode->i_sb_list.list))
+		pcpu_list_del(&inode->i_sb_list);
 }
 
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
@@ -590,12 +585,14 @@ static void dispose_list(struct list_head *head)
  */
 void evict_inodes(struct super_block *sb)
 {
-	struct inode *inode, *next;
+	struct inode *inode;
+	struct pcpu_list_state state;
 	LIST_HEAD(dispose);
 
 again:
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+	init_pcpu_list_state(&state);
+	while (pcpu_list_iterate_safe(sb->s_inodes, &state)) {
+		inode = list_entry(state.curr, struct inode, i_sb_list);
 		if (atomic_read(&inode->i_count))
 			continue;
 
@@ -616,13 +613,12 @@ again:
 		 * bit so we don't livelock.
 		 */
 		if (need_resched()) {
-			spin_unlock(&sb->s_inode_list_lock);
+			spin_unlock(state.lock);
 			cond_resched();
 			dispose_list(&dispose);
 			goto again;
 		}
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 
 	dispose_list(&dispose);
 }
@@ -640,11 +636,12 @@ again:
 int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 {
 	int busy = 0;
-	struct inode *inode, *next;
+	struct inode *inode;
 	LIST_HEAD(dispose);
+	DEFINE_PCPU_LIST_STATE(state);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+	while (pcpu_list_iterate_safe(sb->s_inodes, &state)) {
+		inode = list_entry(state.curr, struct inode, i_sb_list);
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
@@ -666,7 +663,6 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 		spin_unlock(&inode->i_lock);
 		list_add(&inode->i_lru, &dispose);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 
 	dispose_list(&dispose);
 
@@ -881,7 +877,7 @@ struct inode *new_inode_pseudo(struct super_block *sb)
 		spin_lock(&inode->i_lock);
 		inode->i_state = 0;
 		spin_unlock(&inode->i_lock);
-		INIT_LIST_HEAD(&inode->i_sb_list);
+		init_pcpu_list_node(&inode->i_sb_list);
 	}
 	return inode;
 }
@@ -902,8 +898,6 @@ struct inode *new_inode(struct super_block *sb)
 {
 	struct inode *inode;
 
-	spin_lock_prefetch(&sb->s_inode_list_lock);
-
 	inode = new_inode_pseudo(sb);
 	if (inode)
 		inode_sb_list_add(inode);
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index a364524..12515a4 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -151,14 +151,15 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
 void fsnotify_unmount_inodes(struct super_block *sb)
 {
 	struct inode *inode, *iput_inode = NULL;
+	DEFINE_PCPU_LIST_STATE(state);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	while (pcpu_list_iterate(sb->s_inodes, &state)) {
 		/*
 		 * We cannot __iget() an inode in state I_FREEING,
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
 		 * the inode cannot have any associated watches.
 		 */
+		inode = list_entry(state.curr, struct inode, i_sb_list);
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
 			spin_unlock(&inode->i_lock);
@@ -178,7 +179,7 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(state.lock);
 
 		if (iput_inode)
 			iput(iput_inode);
@@ -190,9 +191,8 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 
 		iput_inode = inode;
 
-		spin_lock(&sb->s_inode_list_lock);
+		spin_lock(state.lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 
 	if (iput_inode)
 		iput(iput_inode);
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 3c3b81b..2564271 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -924,12 +924,13 @@ static int dqinit_needed(struct inode *inode, int type)
 static void add_dquot_ref(struct super_block *sb, int type)
 {
 	struct inode *inode, *old_inode = NULL;
+	DEFINE_PCPU_LIST_STATE(state);
 #ifdef CONFIG_QUOTA_DEBUG
 	int reserved = 0;
 #endif
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	while (pcpu_list_iterate(sb->s_inodes, &state)) {
+		inode = list_entry(state.curr, struct inode, i_sb_list);
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 		    !atomic_read(&inode->i_writecount) ||
@@ -939,7 +940,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(state.lock);
 
 #ifdef CONFIG_QUOTA_DEBUG
 		if (unlikely(inode_get_rsv_space(inode) > 0))
@@ -957,9 +958,8 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		 * later.
 		 */
 		old_inode = inode;
-		spin_lock(&sb->s_inode_list_lock);
+		spin_lock(state.lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 	iput(old_inode);
 
 #ifdef CONFIG_QUOTA_DEBUG
@@ -1027,15 +1027,16 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 {
 	struct inode *inode;
 	int reserved = 0;
+	DEFINE_PCPU_LIST_STATE(state);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	while (pcpu_list_iterate(sb->s_inodes, &state)) {
 		/*
 		 *  We have to scan also I_NEW inodes because they can already
 		 *  have quota pointer initialized. Luckily, we need to touch
 		 *  only quota pointers and these have separate locking
 		 *  (dq_data_lock).
 		 */
+		inode = list_entry(state.curr, struct inode, i_sb_list);
 		spin_lock(&dq_data_lock);
 		if (!IS_NOQUOTA(inode)) {
 			if (unlikely(inode_get_rsv_space(inode) > 0))
@@ -1044,7 +1045,6 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 		}
 		spin_unlock(&dq_data_lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
 		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
diff --git a/fs/super.c b/fs/super.c
index 1182af8..7d44fad 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -163,6 +163,7 @@ static void destroy_super(struct super_block *s)
 {
 	list_lru_destroy(&s->s_dentry_lru);
 	list_lru_destroy(&s->s_inode_lru);
+	free_pcpu_list_head(&s->s_inodes);
 	security_sb_free(s);
 	WARN_ON(!list_empty(&s->s_mounts));
 	kfree(s->s_subtype);
@@ -204,9 +205,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	INIT_HLIST_NODE(&s->s_instances);
 	INIT_HLIST_BL_HEAD(&s->s_anon);
 	mutex_init(&s->s_sync_lock);
-	INIT_LIST_HEAD(&s->s_inodes);
-	spin_lock_init(&s->s_inode_list_lock);
 
+	if (init_pcpu_list_head(&s->s_inodes))
+		goto fail;
 	if (list_lru_init_memcg(&s->s_dentry_lru))
 		goto fail;
 	if (list_lru_init_memcg(&s->s_inode_lru))
@@ -426,7 +427,7 @@ void generic_shutdown_super(struct super_block *sb)
 		if (sop->put_super)
 			sop->put_super(sb);
 
-		if (!list_empty(&sb->s_inodes)) {
+		if (!pcpu_list_empty(sb->s_inodes)) {
 			printk("VFS: Busy inodes after unmount of %s. "
 			   "Self-destruct in 5 seconds.  Have a nice day...\n",
 			   sb->s_id);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ae68100..b533bda 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -28,6 +28,7 @@
 #include <linux/uidgid.h>
 #include <linux/lockdep.h>
 #include <linux/percpu-rwsem.h>
+#include <linux/percpu-list.h>
 #include <linux/blk_types.h>
 #include <linux/workqueue.h>
 #include <linux/percpu-rwsem.h>
@@ -648,7 +649,7 @@ struct inode {
 	u16			i_wb_frn_history;
 #endif
 	struct list_head	i_lru;		/* inode LRU list */
-	struct list_head	i_sb_list;
+	struct pcpu_list_node	i_sb_list;
 	union {
 		struct hlist_head	i_dentry;
 		struct rcu_head		i_rcu;
@@ -1397,9 +1398,8 @@ struct super_block {
 	 */
 	int s_stack_depth;
 
-	/* s_inode_list_lock protects s_inodes */
-	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
-	struct list_head	s_inodes;	/* all inodes */
+	/* The percpu locks protect s_inodes */
+	struct pcpu_list_head __percpu *s_inodes;	/* all inodes */
 };
 
 extern struct timespec current_fs_time(struct super_block *sb);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/3] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-23 19:04 ` [PATCH v3 1/3] lib/percpu-list: Per-cpu list with associated per-cpu locks Waiman Long
@ 2016-02-24  2:00   ` Boqun Feng
  2016-02-24  4:01     ` Waiman Long
  2016-02-24  7:56   ` Jan Kara
  1 sibling, 1 reply; 15+ messages in thread
From: Boqun Feng @ 2016-02-24  2:00 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

[-- Attachment #1: Type: text/plain, Size: 13350 bytes --]

Hi Waiman,

On Tue, Feb 23, 2016 at 02:04:30PM -0500, Waiman Long wrote:
> Linked list is used everywhere in the Linux kernel. However, if many
> threads are trying to add or delete entries into the same linked list,
> it can create a performance bottleneck.
> 
> This patch introduces a new per-cpu list subystem with associated
> per-cpu locks for protecting each of the lists individually. This
> allows list entries insertion and deletion operations to happen in
> parallel instead of being serialized with a global list and lock.
> 
> List entry insertion is strictly per cpu. List deletion, however, can
> happen in a cpu other than the one that did the insertion. So we still
> need lock to protect the list. Because of that, there may still be
> a small amount of contention when deletion is being done.
> 
> A new header file include/linux/percpu-list.h will be added with the
> associated pcpu_list_head and pcpu_list_node structures. The following
> functions are provided to manage the per-cpu list:
> 
>  1. int init_pcpu_list_head(struct pcpu_list_head **ppcpu_head)
>  2. void pcpu_list_add(struct pcpu_list_node *node,
> 		       struct pcpu_list_head *head)
>  3. void pcpu_list_del(struct pcpu_list *node)
> 
> Iteration of all the list entries within a group of per-cpu
> lists is done by calling either the pcpu_list_iterate() or
> pcpu_list_iterate_safe() functions in a while loop. They correspond
> to the list_for_each_entry() and list_for_each_entry_safe() macros
> respectively. The iteration states are keep in a pcpu_list_state
> structure that is passed to the iteration functions.
> 
> Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
> ---
>  include/linux/percpu-list.h |  235 +++++++++++++++++++++++++++++++++++++++++++
>  lib/Makefile                |    2 +-
>  lib/percpu-list.c           |   85 ++++++++++++++++
>  3 files changed, 321 insertions(+), 1 deletions(-)
>  create mode 100644 include/linux/percpu-list.h
>  create mode 100644 lib/percpu-list.c
> 
> diff --git a/include/linux/percpu-list.h b/include/linux/percpu-list.h
> new file mode 100644
> index 0000000..8759fec
> --- /dev/null
> +++ b/include/linux/percpu-list.h
> @@ -0,0 +1,235 @@
> +/*
> + * Per-cpu list
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
> + *
> + * Authors: Waiman Long <waiman.long@hpe.com>
> + */
> +#ifndef __LINUX_PERCPU_LIST_H
> +#define __LINUX_PERCPU_LIST_H
> +
> +#include <linux/spinlock.h>
> +#include <linux/list.h>
> +#include <linux/percpu.h>
> +
> +/*
> + * include/linux/percpu-list.h
> + *
> + * A per-cpu list protected by a per-cpu spinlock.
> + *
> + * The pcpu_list_head structure contains the spinlock, the other
> + * pcpu_list_node structures only contains a pointer to the spinlock in
> + * pcpu_list_head.
> + */
> +struct pcpu_list_head {
> +	struct list_head list;
> +	spinlock_t lock;
> +};
> +
> +struct pcpu_list_node {
> +	struct list_head list;
> +	spinlock_t *lockptr;
> +};
> +
> +/*
> + * Per-cpu list iteration state
> + */
> +struct pcpu_list_state {
> +	int			 cpu;
> +	spinlock_t		*lock;
> +	struct list_head	*head;	/* List head of current per-cpu list */
> +	struct pcpu_list_node	*curr;
> +	struct pcpu_list_node	*next;
> +};
> +
> +#define PCPU_LIST_HEAD_INIT(name)				\
> +	{							\
> +		.list.prev = &name.list,			\
> +		.list.next = &name.list,			\
> +		.list.lock = __SPIN_LOCK_UNLOCKED(name),	\
> +	}
> +
> +#define PCPU_LIST_NODE_INIT(name)		\
> +	{					\
> +		.list.prev = &name.list,	\
> +		.list.next = &name.list,	\
> +		.list.lockptr = NULL		\
> +	}
> +
> +#define PCPU_LIST_STATE_INIT()			\
> +	{					\
> +		.cpu  = -1,			\
> +		.lock = NULL,			\
> +		.head = NULL,			\
> +		.curr = NULL,			\
> +		.next = NULL,			\
> +	}
> +
> +#define DEFINE_PCPU_LIST_STATE(s)		\
> +	struct pcpu_list_state s = PCPU_LIST_STATE_INIT()
> +
> +#define pcpu_list_next_entry(pos, member) list_next_entry(pos, member.list)
> +
> +static inline void init_pcpu_list_node(struct pcpu_list_node *node)
> +{
> +	INIT_LIST_HEAD(&node->list);
> +	node->lockptr = NULL;
> +}
> +
> +static inline void free_pcpu_list_head(struct pcpu_list_head **ppcpu_head)
> +{
> +	free_percpu(*ppcpu_head);
> +	*ppcpu_head = NULL;
> +}
> +
> +static inline void init_pcpu_list_state(struct pcpu_list_state *state)
> +{
> +	state->cpu  = -1;
> +	state->lock = NULL;
> +	state->head = NULL;
> +	state->curr = NULL;
> +	state->next = NULL;
> +}
> +
> +#if NR_CPUS == 1
> +/*
> + * For uniprocessor, the list head and lock in struct pcpu_list_head are
> + * used directly.
> + */
> +static inline bool pcpu_list_empty(struct pcpu_list_head *pcpu_head)
> +{
> +	return list_empty(&pcpu_head->list);
> +}
> +
> +static __always_inline bool
> +__pcpu_list_next_cpu(struct pcpu_list_head *head, struct pcpu_list_state *state)
> +{
> +	if (state->lock)
> +		spin_unlock(state->lock);
> +
> +	if (state->cpu++ >= 0)
> +		return false;
> +
> +	state->curr = list_entry(head->list.next, struct pcpu_list_node, list);
> +	if (list_empty(&state->curr->list))
> +		return false;
> +	state->lock = &head->lock;
> +	spin_lock(state->lock);
> +	return true;
> +
> +}
> +#else /* NR_CPUS == 1 */
> +/*
> + * Multiprocessor
> + */
> +static inline bool pcpu_list_empty(struct pcpu_list_head *pcpu_head)
> +{
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu)
> +		if (!list_empty(&per_cpu_ptr(pcpu_head, cpu)->list))
> +			return false;
> +	return true;
> +}
> +
> +/*
> + * Helper function to find the first entry of the next per-cpu list
> + * It works somewhat like for_each_possible_cpu(cpu).
> + *
> + * Return: true if the entry is found, false if all the lists exhausted
> + */
> +static __always_inline bool
> +__pcpu_list_next_cpu(struct pcpu_list_head *head, struct pcpu_list_state *state)
> +{
> +	if (state->lock)
> +		spin_unlock(state->lock);
> +next_cpu:
> +	/*
> +	 * for_each_possible_cpu(cpu)
> +	 */
> +	state->cpu = cpumask_next(state->cpu, cpu_possible_mask);
> +	if (state->cpu >= nr_cpu_ids)
> +		return false;	/* All the per-cpu lists iterated */
> +
> +	state->head = &per_cpu_ptr(head, state->cpu)->list;
> +	state->lock = &per_cpu_ptr(head, state->cpu)->lock;
> +	state->curr = list_entry(state->head->next,
> +				 struct pcpu_list_node, list);
> +	if (&state->curr->list == state->head)
> +		goto next_cpu;
> +
> +	spin_lock(state->lock);
> +	return true;
> +}
> +#endif /* NR_CPUS == 1 */
> +
> +/*
> + * Iterate to the next entry of the group of per-cpu lists
> + *
> + * Return: true if the next entry is found, false if all the entries iterated
> + */
> +static inline bool pcpu_list_iterate(struct pcpu_list_head *head,
> +				     struct pcpu_list_state *state)
> +{
> +	/*
> +	 * Find next entry
> +	 */
> +	if (state->curr)
> +		state->curr = list_next_entry(state->curr, list);
> +
> +	if (!state->curr || (&state->curr->list == state->head)) {
> +		/*
> +		 * The current per-cpu list has been exhausted, try the next
> +		 * per-cpu list.
> +		 */
> +		if (!__pcpu_list_next_cpu(head, state))
> +			return false;
> +	}
> +	return true;	/* Continue the iteration */
> +}
> +
> +/*
> + * Iterate to the next entry of the group of per-cpu lists and safe
> + * against removal of list_entry
> + *
> + * Return: true if the next entry is found, false if all the entries iterated
> + */
> +static inline bool pcpu_list_iterate_safe(struct pcpu_list_head *head,
> +					  struct pcpu_list_state *state)
> +{
> +	/*
> +	 * Find next entry
> +	 */
> +	if (state->curr) {
> +		state->curr = state->next;
> +		state->next = list_next_entry(state->next, list);
> +	}
> +
> +	if (!state->curr || (&state->curr->list == state->head)) {
> +		/*
> +		 * The current per-cpu list has been exhausted, try the next
> +		 * per-cpu list.
> +		 */
> +		if (!__pcpu_list_next_cpu(head, state))
> +			return false;
> +		state->next = list_next_entry(state->curr, list);
> +	}
> +	return true;	/* Continue the iteration */
> +}
> +
> +extern int  init_pcpu_list_head(struct pcpu_list_head **ppcpu_head);
> +extern void pcpu_list_add(struct pcpu_list_node *node,
> +			  struct pcpu_list_head *head);
> +extern void pcpu_list_del(struct pcpu_list_node *node);
> +
> +#endif /* __LINUX_PERCPU_LIST_H */
> diff --git a/lib/Makefile b/lib/Makefile
> index a7c26a4..71a25d4 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -27,7 +27,7 @@ obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
>  	 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
>  	 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
>  	 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
> -	 once.o
> +	 once.o percpu-list.o
>  obj-y += string_helpers.o
>  obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
>  obj-y += hexdump.o
> diff --git a/lib/percpu-list.c b/lib/percpu-list.c
> new file mode 100644
> index 0000000..45bbb2a
> --- /dev/null
> +++ b/lib/percpu-list.c
> @@ -0,0 +1,85 @@
> +/*
> + * Per-cpu list
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
> + *
> + * Authors: Waiman Long <waiman.long@hpe.com>
> + */
> +#include <linux/percpu-list.h>
> +
> +/*
> + * Initialize the per-cpu list
> + */
> +int init_pcpu_list_head(struct pcpu_list_head **ppcpu_head)
> +{
> +	struct pcpu_list_head *pcpu_head = alloc_percpu(struct pcpu_list_head);
> +	int cpu;
> +
> +	if (!pcpu_head)
> +		return -ENOMEM;
> +
> +	for_each_possible_cpu(cpu) {
> +		struct pcpu_list_head *head = per_cpu_ptr(pcpu_head, cpu);
> +
> +		INIT_LIST_HEAD(&head->list);
> +		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
> +	}
> +
> +	*ppcpu_head = pcpu_head;
> +	return 0;
> +}
> +
> +/*
> + * List selection is based on the CPU being used when the pcpu_list_add()
> + * function is called. However, deletion may be done by a different CPU.
> + * So we still need to use a lock to protect the content of the list.
> + */
> +void pcpu_list_add(struct pcpu_list_node *node, struct pcpu_list_head *head)
> +{
> +	spinlock_t *lock;
> +
> +	/*
> +	 * There is a very slight chance the cpu will be changed
> +	 * (by preemption) before calling spin_lock(). We only need to put
> +	 * the node in one of the per-cpu lists. It may not need to be
> +	 * that of the current cpu.
> +	 */

Just curious about the comment here, what if the following happens:

	CPU 0				CPU 1
	=====================		=====================
	task_1:

	lock = this_cpu_ptr(&head->lock); // head->lock is on CPU0
	<preempted>
					continue to task_1:
					spin_lock(lock);
					node->lockptr = lock;
					// head->list is on CPU1
					list_add(&node->list, this_cpu_ptr(&head->list));
					spin_unlock(lock);

, which ends up the node is in the list on CPU1 while ->lockptr pointing
to the lock on CPU0.

If there is another node whose ->lockptr points to the lock on CPU1 and
the node is in list on CPU1, what will happen if these two nodes get
deleted simultaneously?

Regards,
Boqun

> +	lock = this_cpu_ptr(&head->lock);
> +	spin_lock(lock);
> +	node->lockptr = lock;
> +	list_add(&node->list, this_cpu_ptr(&head->list));
> +	spin_unlock(lock);
> +}
> +
> +/*
> + * Delete a node from a percpu list
> + *
> + * We need to check the lock pointer again after taking the lock to guard
> + * against concurrent delete of the same node. If the lock pointer changes
> + * (becomes NULL or to a different one), we assume that the deletion was done
> + * elsewhere.
> + */
> +void pcpu_list_del(struct pcpu_list_node *node)
> +{
> +	spinlock_t *lock = READ_ONCE(node->lockptr);
> +
> +	if (unlikely(!lock))
> +		return;
> +
> +	spin_lock(lock);
> +	if (likely(lock == node->lockptr)) {
> +		list_del_init(&node->list);
> +		node->lockptr = NULL;
> +	}
> +	spin_unlock(lock);
> +}
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/3] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-24  2:00   ` Boqun Feng
@ 2016-02-24  4:01     ` Waiman Long
  0 siblings, 0 replies; 15+ messages in thread
From: Waiman Long @ 2016-02-24  4:01 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On 02/23/2016 09:00 PM, Boqun Feng wrote:
> Hi Waiman,
>
> On Tue, Feb 23, 2016 at 02:04:30PM -0500, Waiman Long wrote:
>> }
>> +
>> +/*
>> + * List selection is based on the CPU being used when the pcpu_list_add()
>> + * function is called. However, deletion may be done by a different CPU.
>> + * So we still need to use a lock to protect the content of the list.
>> + */
>> +void pcpu_list_add(struct pcpu_list_node *node, struct pcpu_list_head *head)
>> +{
>> +	spinlock_t *lock;
>> +
>> +	/*
>> +	 * There is a very slight chance the cpu will be changed
>> +	 * (by preemption) before calling spin_lock(). We only need to put
>> +	 * the node in one of the per-cpu lists. It may not need to be
>> +	 * that of the current cpu.
>> +	 */
> Just curious about the comment here, what if the following happens:
>
> 	CPU 0				CPU 1
> 	=====================		=====================
> 	task_1:
>
> 	lock = this_cpu_ptr(&head->lock); // head->lock is on CPU0
> 	<preempted>
> 					continue to task_1:
> 					spin_lock(lock);
> 					node->lockptr = lock;
> 					// head->list is on CPU1
> 					list_add(&node->list, this_cpu_ptr(&head->list));
> 					spin_unlock(lock);
>
> , which ends up the node is in the list on CPU1 while ->lockptr pointing
> to the lock on CPU0.
>
> If there is another node whose ->lockptr points to the lock on CPU1 and
> the node is in list on CPU1, what will happen if these two nodes get
> deleted simultaneously?
>
> Regards,
> Boqun
>

Yes, you are right. I should have acquired the per-cpu head pointer 
first and used it onward instead of accessing the lock and list in 2 
separate operations. I will fix that in the next update.

Thanks for finding that.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/3] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-23 19:04 ` [PATCH v3 1/3] lib/percpu-list: Per-cpu list with associated per-cpu locks Waiman Long
  2016-02-24  2:00   ` Boqun Feng
@ 2016-02-24  7:56   ` Jan Kara
  2016-02-24 19:51     ` Waiman Long
  1 sibling, 1 reply; 15+ messages in thread
From: Jan Kara @ 2016-02-24  7:56 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On Tue 23-02-16 14:04:30, Waiman Long wrote:
> Linked list is used everywhere in the Linux kernel. However, if many
> threads are trying to add or delete entries into the same linked list,
> it can create a performance bottleneck.
> 
> This patch introduces a new per-cpu list subystem with associated
> per-cpu locks for protecting each of the lists individually. This
> allows list entries insertion and deletion operations to happen in
> parallel instead of being serialized with a global list and lock.
> 
> List entry insertion is strictly per cpu. List deletion, however, can
> happen in a cpu other than the one that did the insertion. So we still
> need lock to protect the list. Because of that, there may still be
> a small amount of contention when deletion is being done.
> 
> A new header file include/linux/percpu-list.h will be added with the
> associated pcpu_list_head and pcpu_list_node structures. The following
> functions are provided to manage the per-cpu list:
> 
>  1. int init_pcpu_list_head(struct pcpu_list_head **ppcpu_head)
>  2. void pcpu_list_add(struct pcpu_list_node *node,
> 		       struct pcpu_list_head *head)
>  3. void pcpu_list_del(struct pcpu_list *node)
> 
> Iteration of all the list entries within a group of per-cpu
> lists is done by calling either the pcpu_list_iterate() or
> pcpu_list_iterate_safe() functions in a while loop. They correspond
> to the list_for_each_entry() and list_for_each_entry_safe() macros
> respectively. The iteration states are keep in a pcpu_list_state
> structure that is passed to the iteration functions.
> 
> Signed-off-by: Waiman Long <Waiman.Long@hpe.com>

Two comments below.

> +/*
> + * Helper function to find the first entry of the next per-cpu list
> + * It works somewhat like for_each_possible_cpu(cpu).
> + *
> + * Return: true if the entry is found, false if all the lists exhausted
> + */
> +static __always_inline bool
> +__pcpu_list_next_cpu(struct pcpu_list_head *head, struct pcpu_list_state *state)
> +{
> +	if (state->lock)
> +		spin_unlock(state->lock);
> +next_cpu:
> +	/*
> +	 * for_each_possible_cpu(cpu)
> +	 */
> +	state->cpu = cpumask_next(state->cpu, cpu_possible_mask);
> +	if (state->cpu >= nr_cpu_ids)
> +		return false;	/* All the per-cpu lists iterated */
> +
> +	state->head = &per_cpu_ptr(head, state->cpu)->list;
> +	state->lock = &per_cpu_ptr(head, state->cpu)->lock;
> +	state->curr = list_entry(state->head->next,
> +				 struct pcpu_list_node, list);
> +	if (&state->curr->list == state->head)
> +		goto next_cpu;

This might be more comprehensible as:

	if (list_empty(state->head))
		goto next_cpu;

and you can do it just after updating state->head (no need to init
state->lock & state->curr if the list is empty).

Another note: Initialization of state->curr is IMO racy - you need to hold
state->lock to grab state->curr reliably, don't you? Otherwise someone can
remove the entry while you are working with it. So you need to move that
down just before returning.

> +
> +	spin_lock(state->lock);
> +	return true;
> +}
> +#endif /* NR_CPUS == 1 */

...

> +/*
> + * Delete a node from a percpu list
> + *
> + * We need to check the lock pointer again after taking the lock to guard
> + * against concurrent delete of the same node. If the lock pointer changes
> + * (becomes NULL or to a different one), we assume that the deletion was done
> + * elsewhere.
> + */
> +void pcpu_list_del(struct pcpu_list_node *node)
> +{
> +	spinlock_t *lock = READ_ONCE(node->lockptr);
> +
> +	if (unlikely(!lock))
> +		return;
> +
> +	spin_lock(lock);
> +	if (likely(lock == node->lockptr)) {
> +		list_del_init(&node->list);
> +		node->lockptr = NULL;
> +	}

But someone changing lockptr under your hands would mean that there are
two processes racing to remove entries and that would generally point to a
problem (and likely use-after-free) in the caller, won't it? Or do you have
some particular usecase in mind?

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 3/3] vfs: Use per-cpu list for superblock's inode list
  2016-02-23 19:04 ` [PATCH v3 3/3] vfs: Use per-cpu list for superblock's inode list Waiman Long
@ 2016-02-24  8:28   ` Jan Kara
  2016-02-24  8:36     ` Ingo Molnar
  2016-02-24 20:23     ` Waiman Long
  0 siblings, 2 replies; 15+ messages in thread
From: Jan Kara @ 2016-02-24  8:28 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

[-- Attachment #1: Type: text/plain, Size: 17647 bytes --]

On Tue 23-02-16 14:04:32, Waiman Long wrote:
> When many threads are trying to add or delete inode to or from
> a superblock's s_inodes list, spinlock contention on the list can
> become a performance bottleneck.
> 
> This patch changes the s_inodes field to become a per-cpu list with
> per-cpu spinlocks. As a result, the following superblock inode list
> (sb->s_inodes) iteration functions in vfs are also being modified:
> 
>  1. iterate_bdevs()
>  2. drop_pagecache_sb()
>  3. wait_sb_inodes()
>  4. evict_inodes()
>  5. invalidate_inodes()
>  6. fsnotify_unmount_inodes()
>  7. add_dquot_ref()
>  8. remove_dquot_ref()
> 
> With an exit microbenchmark that creates a large number of threads,
> attachs many inodes to them and then exits. The runtimes of that
> microbenchmark with 1000 threads before and after the patch on a
> 4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as
> follows:
> 
>   Kernel            Elapsed Time    System Time
>   ------            ------------    -----------
>   Vanilla 4.5-rc4      65.29s         82m14s
>   Patched 4.5-rc4      22.81s         23m03s
> 
> Before the patch, spinlock contention at the inode_sb_list_add()
> function at the startup phase and the inode_sb_list_del() function at
> the exit phase were about 79% and 93% of total CPU time respectively
> (as measured by perf). After the patch, the percpu_list_add()
> function consumed only about 0.04% of CPU time at startup phase. The
> percpu_list_del() function consumed about 0.4% of CPU time at exit
> phase. There were still some spinlock contention, but they happened
> elsewhere.

While looking through this patch, I have noticed that the
list_for_each_entry_safe() iterations in evict_inodes() and
invalidate_inodes() are actually unnecessary. So if you first apply the
attached patch, you don't have to implement safe iteration variants at all.

As a second comment, I'd note that this patch grows struct inode by 1
pointer. It is probably acceptable for large machines given the speedup but
it should be noted in the changelog. Furthermore for UP or even small SMP
systems this is IMHO undesired bloat since the speedup won't be noticeable.

So for these small systems it would be good if per-cpu list magic would just
fall back to single linked list with a spinlock. Do you think that is
reasonably doable?

								Honza

> 
> Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
> ---
>  fs/block_dev.c         |   13 +++++++------
>  fs/drop_caches.c       |   10 +++++-----
>  fs/fs-writeback.c      |   13 +++++++------
>  fs/inode.c             |   40 +++++++++++++++++-----------------------
>  fs/notify/inode_mark.c |   10 +++++-----
>  fs/quota/dquot.c       |   16 ++++++++--------
>  fs/super.c             |    7 ++++---
>  include/linux/fs.h     |    8 ++++----
>  8 files changed, 57 insertions(+), 60 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 39b3a17..759d9b6 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1865,11 +1865,13 @@ EXPORT_SYMBOL(__invalidate_device);
>  void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
>  {
>  	struct inode *inode, *old_inode = NULL;
> +	DEFINE_PCPU_LIST_STATE(state);
>  
> -	spin_lock(&blockdev_superblock->s_inode_list_lock);
> -	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
> -		struct address_space *mapping = inode->i_mapping;
> +	while (pcpu_list_iterate(blockdev_superblock->s_inodes, &state)) {
> +		struct address_space *mapping;
>  
> +		inode   = list_entry(state.curr, struct inode, i_sb_list);
> +		mapping = inode->i_mapping;
>  		spin_lock(&inode->i_lock);
>  		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
>  		    mapping->nrpages == 0) {
> @@ -1878,7 +1880,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
>  		}
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> -		spin_unlock(&blockdev_superblock->s_inode_list_lock);
> +		spin_unlock(state.lock);
>  		/*
>  		 * We hold a reference to 'inode' so it couldn't have been
>  		 * removed from s_inodes list while we dropped the
> @@ -1892,8 +1894,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
>  
>  		func(I_BDEV(inode), arg);
>  
> -		spin_lock(&blockdev_superblock->s_inode_list_lock);
> +		spin_lock(state.lock);
>  	}
> -	spin_unlock(&blockdev_superblock->s_inode_list_lock);
>  	iput(old_inode);
>  }
> diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> index d72d52b..ec272ed 100644
> --- a/fs/drop_caches.c
> +++ b/fs/drop_caches.c
> @@ -16,9 +16,10 @@ int sysctl_drop_caches;
>  static void drop_pagecache_sb(struct super_block *sb, void *unused)
>  {
>  	struct inode *inode, *toput_inode = NULL;
> +	DEFINE_PCPU_LIST_STATE(state);
>  
> -	spin_lock(&sb->s_inode_list_lock);
> -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +	while (pcpu_list_iterate(sb->s_inodes, &state)) {
> +		inode = list_entry(state.curr, struct inode, i_sb_list);
>  		spin_lock(&inode->i_lock);
>  		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
>  		    (inode->i_mapping->nrpages == 0)) {
> @@ -27,15 +28,14 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
>  		}
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> -		spin_unlock(&sb->s_inode_list_lock);
> +		spin_unlock(state.lock);
>  
>  		invalidate_mapping_pages(inode->i_mapping, 0, -1);
>  		iput(toput_inode);
>  		toput_inode = inode;
>  
> -		spin_lock(&sb->s_inode_list_lock);
> +		spin_lock(state.lock);
>  	}
> -	spin_unlock(&sb->s_inode_list_lock);
>  	iput(toput_inode);
>  }
>  
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 6915c95..05b3f85 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -2107,6 +2107,7 @@ EXPORT_SYMBOL(__mark_inode_dirty);
>  static void wait_sb_inodes(struct super_block *sb)
>  {
>  	struct inode *inode, *old_inode = NULL;
> +	DEFINE_PCPU_LIST_STATE(state);
>  
>  	/*
>  	 * We need to be protected against the filesystem going from
> @@ -2115,7 +2116,6 @@ static void wait_sb_inodes(struct super_block *sb)
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
>  	mutex_lock(&sb->s_sync_lock);
> -	spin_lock(&sb->s_inode_list_lock);
>  
>  	/*
>  	 * Data integrity sync. Must wait for all pages under writeback,
> @@ -2124,9 +2124,11 @@ static void wait_sb_inodes(struct super_block *sb)
>  	 * In which case, the inode may not be on the dirty list, but
>  	 * we still have to wait for that writeout.
>  	 */
> -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> -		struct address_space *mapping = inode->i_mapping;
> +	while (pcpu_list_iterate(sb->s_inodes, &state)) {
> +		struct address_space *mapping;
>  
> +		inode   = list_entry(state.curr, struct inode, i_sb_list);
> +		mapping = inode->i_mapping;
>  		spin_lock(&inode->i_lock);
>  		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
>  		    (mapping->nrpages == 0)) {
> @@ -2135,7 +2137,7 @@ static void wait_sb_inodes(struct super_block *sb)
>  		}
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> -		spin_unlock(&sb->s_inode_list_lock);
> +		spin_unlock(state.lock);
>  
>  		/*
>  		 * We hold a reference to 'inode' so it couldn't have been
> @@ -2157,9 +2159,8 @@ static void wait_sb_inodes(struct super_block *sb)
>  
>  		cond_resched();
>  
> -		spin_lock(&sb->s_inode_list_lock);
> +		spin_lock(state.lock);
>  	}
> -	spin_unlock(&sb->s_inode_list_lock);
>  	iput(old_inode);
>  	mutex_unlock(&sb->s_sync_lock);
>  }
> diff --git a/fs/inode.c b/fs/inode.c
> index 9f62db3..e6e41ef 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -28,7 +28,7 @@
>   *   inode->i_state, inode->i_hash, __iget()
>   * Inode LRU list locks protect:
>   *   inode->i_sb->s_inode_lru, inode->i_lru
> - * inode->i_sb->s_inode_list_lock protects:
> + * inode->i_sb->s_inodes->lock protects:
>   *   inode->i_sb->s_inodes, inode->i_sb_list
>   * bdi->wb.list_lock protects:
>   *   bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_io_list
> @@ -37,7 +37,7 @@
>   *
>   * Lock ordering:
>   *
> - * inode->i_sb->s_inode_list_lock
> + * inode->i_sb->s_inodes->lock
>   *   inode->i_lock
>   *     Inode LRU list locks
>   *
> @@ -45,7 +45,7 @@
>   *   inode->i_lock
>   *
>   * inode_hash_lock
> - *   inode->i_sb->s_inode_list_lock
> + *   inode->i_sb->s_inodes->lock
>   *   inode->i_lock
>   *
>   * iunique_lock
> @@ -424,19 +424,14 @@ static void inode_lru_list_del(struct inode *inode)
>   */
>  void inode_sb_list_add(struct inode *inode)
>  {
> -	spin_lock(&inode->i_sb->s_inode_list_lock);
> -	list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
> -	spin_unlock(&inode->i_sb->s_inode_list_lock);
> +	pcpu_list_add(&inode->i_sb_list, inode->i_sb->s_inodes);
>  }
>  EXPORT_SYMBOL_GPL(inode_sb_list_add);
>  
>  static inline void inode_sb_list_del(struct inode *inode)
>  {
> -	if (!list_empty(&inode->i_sb_list)) {
> -		spin_lock(&inode->i_sb->s_inode_list_lock);
> -		list_del_init(&inode->i_sb_list);
> -		spin_unlock(&inode->i_sb->s_inode_list_lock);
> -	}
> +	if (!list_empty(&inode->i_sb_list.list))
> +		pcpu_list_del(&inode->i_sb_list);
>  }
>  
>  static unsigned long hash(struct super_block *sb, unsigned long hashval)
> @@ -590,12 +585,14 @@ static void dispose_list(struct list_head *head)
>   */
>  void evict_inodes(struct super_block *sb)
>  {
> -	struct inode *inode, *next;
> +	struct inode *inode;
> +	struct pcpu_list_state state;
>  	LIST_HEAD(dispose);
>  
>  again:
> -	spin_lock(&sb->s_inode_list_lock);
> -	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> +	init_pcpu_list_state(&state);
> +	while (pcpu_list_iterate_safe(sb->s_inodes, &state)) {
> +		inode = list_entry(state.curr, struct inode, i_sb_list);
>  		if (atomic_read(&inode->i_count))
>  			continue;
>  
> @@ -616,13 +613,12 @@ again:
>  		 * bit so we don't livelock.
>  		 */
>  		if (need_resched()) {
> -			spin_unlock(&sb->s_inode_list_lock);
> +			spin_unlock(state.lock);
>  			cond_resched();
>  			dispose_list(&dispose);
>  			goto again;
>  		}
>  	}
> -	spin_unlock(&sb->s_inode_list_lock);
>  
>  	dispose_list(&dispose);
>  }
> @@ -640,11 +636,12 @@ again:
>  int invalidate_inodes(struct super_block *sb, bool kill_dirty)
>  {
>  	int busy = 0;
> -	struct inode *inode, *next;
> +	struct inode *inode;
>  	LIST_HEAD(dispose);
> +	DEFINE_PCPU_LIST_STATE(state);
>  
> -	spin_lock(&sb->s_inode_list_lock);
> -	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> +	while (pcpu_list_iterate_safe(sb->s_inodes, &state)) {
> +		inode = list_entry(state.curr, struct inode, i_sb_list);
>  		spin_lock(&inode->i_lock);
>  		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
>  			spin_unlock(&inode->i_lock);
> @@ -666,7 +663,6 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
>  		spin_unlock(&inode->i_lock);
>  		list_add(&inode->i_lru, &dispose);
>  	}
> -	spin_unlock(&sb->s_inode_list_lock);
>  
>  	dispose_list(&dispose);
>  
> @@ -881,7 +877,7 @@ struct inode *new_inode_pseudo(struct super_block *sb)
>  		spin_lock(&inode->i_lock);
>  		inode->i_state = 0;
>  		spin_unlock(&inode->i_lock);
> -		INIT_LIST_HEAD(&inode->i_sb_list);
> +		init_pcpu_list_node(&inode->i_sb_list);
>  	}
>  	return inode;
>  }
> @@ -902,8 +898,6 @@ struct inode *new_inode(struct super_block *sb)
>  {
>  	struct inode *inode;
>  
> -	spin_lock_prefetch(&sb->s_inode_list_lock);
> -
>  	inode = new_inode_pseudo(sb);
>  	if (inode)
>  		inode_sb_list_add(inode);
> diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
> index a364524..12515a4 100644
> --- a/fs/notify/inode_mark.c
> +++ b/fs/notify/inode_mark.c
> @@ -151,14 +151,15 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
>  void fsnotify_unmount_inodes(struct super_block *sb)
>  {
>  	struct inode *inode, *iput_inode = NULL;
> +	DEFINE_PCPU_LIST_STATE(state);
>  
> -	spin_lock(&sb->s_inode_list_lock);
> -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +	while (pcpu_list_iterate(sb->s_inodes, &state)) {
>  		/*
>  		 * We cannot __iget() an inode in state I_FREEING,
>  		 * I_WILL_FREE, or I_NEW which is fine because by that point
>  		 * the inode cannot have any associated watches.
>  		 */
> +		inode = list_entry(state.curr, struct inode, i_sb_list);
>  		spin_lock(&inode->i_lock);
>  		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
>  			spin_unlock(&inode->i_lock);
> @@ -178,7 +179,7 @@ void fsnotify_unmount_inodes(struct super_block *sb)
>  
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> -		spin_unlock(&sb->s_inode_list_lock);
> +		spin_unlock(state.lock);
>  
>  		if (iput_inode)
>  			iput(iput_inode);
> @@ -190,9 +191,8 @@ void fsnotify_unmount_inodes(struct super_block *sb)
>  
>  		iput_inode = inode;
>  
> -		spin_lock(&sb->s_inode_list_lock);
> +		spin_lock(state.lock);
>  	}
> -	spin_unlock(&sb->s_inode_list_lock);
>  
>  	if (iput_inode)
>  		iput(iput_inode);
> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> index 3c3b81b..2564271 100644
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -924,12 +924,13 @@ static int dqinit_needed(struct inode *inode, int type)
>  static void add_dquot_ref(struct super_block *sb, int type)
>  {
>  	struct inode *inode, *old_inode = NULL;
> +	DEFINE_PCPU_LIST_STATE(state);
>  #ifdef CONFIG_QUOTA_DEBUG
>  	int reserved = 0;
>  #endif
>  
> -	spin_lock(&sb->s_inode_list_lock);
> -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +	while (pcpu_list_iterate(sb->s_inodes, &state)) {
> +		inode = list_entry(state.curr, struct inode, i_sb_list);
>  		spin_lock(&inode->i_lock);
>  		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
>  		    !atomic_read(&inode->i_writecount) ||
> @@ -939,7 +940,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
>  		}
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> -		spin_unlock(&sb->s_inode_list_lock);
> +		spin_unlock(state.lock);
>  
>  #ifdef CONFIG_QUOTA_DEBUG
>  		if (unlikely(inode_get_rsv_space(inode) > 0))
> @@ -957,9 +958,8 @@ static void add_dquot_ref(struct super_block *sb, int type)
>  		 * later.
>  		 */
>  		old_inode = inode;
> -		spin_lock(&sb->s_inode_list_lock);
> +		spin_lock(state.lock);
>  	}
> -	spin_unlock(&sb->s_inode_list_lock);
>  	iput(old_inode);
>  
>  #ifdef CONFIG_QUOTA_DEBUG
> @@ -1027,15 +1027,16 @@ static void remove_dquot_ref(struct super_block *sb, int type,
>  {
>  	struct inode *inode;
>  	int reserved = 0;
> +	DEFINE_PCPU_LIST_STATE(state);
>  
> -	spin_lock(&sb->s_inode_list_lock);
> -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +	while (pcpu_list_iterate(sb->s_inodes, &state)) {
>  		/*
>  		 *  We have to scan also I_NEW inodes because they can already
>  		 *  have quota pointer initialized. Luckily, we need to touch
>  		 *  only quota pointers and these have separate locking
>  		 *  (dq_data_lock).
>  		 */
> +		inode = list_entry(state.curr, struct inode, i_sb_list);
>  		spin_lock(&dq_data_lock);
>  		if (!IS_NOQUOTA(inode)) {
>  			if (unlikely(inode_get_rsv_space(inode) > 0))
> @@ -1044,7 +1045,6 @@ static void remove_dquot_ref(struct super_block *sb, int type,
>  		}
>  		spin_unlock(&dq_data_lock);
>  	}
> -	spin_unlock(&sb->s_inode_list_lock);
>  #ifdef CONFIG_QUOTA_DEBUG
>  	if (reserved) {
>  		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
> diff --git a/fs/super.c b/fs/super.c
> index 1182af8..7d44fad 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -163,6 +163,7 @@ static void destroy_super(struct super_block *s)
>  {
>  	list_lru_destroy(&s->s_dentry_lru);
>  	list_lru_destroy(&s->s_inode_lru);
> +	free_pcpu_list_head(&s->s_inodes);
>  	security_sb_free(s);
>  	WARN_ON(!list_empty(&s->s_mounts));
>  	kfree(s->s_subtype);
> @@ -204,9 +205,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
>  	INIT_HLIST_NODE(&s->s_instances);
>  	INIT_HLIST_BL_HEAD(&s->s_anon);
>  	mutex_init(&s->s_sync_lock);
> -	INIT_LIST_HEAD(&s->s_inodes);
> -	spin_lock_init(&s->s_inode_list_lock);
>  
> +	if (init_pcpu_list_head(&s->s_inodes))
> +		goto fail;
>  	if (list_lru_init_memcg(&s->s_dentry_lru))
>  		goto fail;
>  	if (list_lru_init_memcg(&s->s_inode_lru))
> @@ -426,7 +427,7 @@ void generic_shutdown_super(struct super_block *sb)
>  		if (sop->put_super)
>  			sop->put_super(sb);
>  
> -		if (!list_empty(&sb->s_inodes)) {
> +		if (!pcpu_list_empty(sb->s_inodes)) {
>  			printk("VFS: Busy inodes after unmount of %s. "
>  			   "Self-destruct in 5 seconds.  Have a nice day...\n",
>  			   sb->s_id);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ae68100..b533bda 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -28,6 +28,7 @@
>  #include <linux/uidgid.h>
>  #include <linux/lockdep.h>
>  #include <linux/percpu-rwsem.h>
> +#include <linux/percpu-list.h>
>  #include <linux/blk_types.h>
>  #include <linux/workqueue.h>
>  #include <linux/percpu-rwsem.h>
> @@ -648,7 +649,7 @@ struct inode {
>  	u16			i_wb_frn_history;
>  #endif
>  	struct list_head	i_lru;		/* inode LRU list */
> -	struct list_head	i_sb_list;
> +	struct pcpu_list_node	i_sb_list;
>  	union {
>  		struct hlist_head	i_dentry;
>  		struct rcu_head		i_rcu;
> @@ -1397,9 +1398,8 @@ struct super_block {
>  	 */
>  	int s_stack_depth;
>  
> -	/* s_inode_list_lock protects s_inodes */
> -	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
> -	struct list_head	s_inodes;	/* all inodes */
> +	/* The percpu locks protect s_inodes */
> +	struct pcpu_list_head __percpu *s_inodes;	/* all inodes */
>  };
>  
>  extern struct timespec current_fs_time(struct super_block *sb);
> -- 
> 1.7.1
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-vfs-Remove-unnecessary-list_for_each_entry_safe-vari.patch --]
[-- Type: text/x-patch, Size: 1736 bytes --]

>From ede070a2159d4c49c6a29be601594f7119872417 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Wed, 24 Feb 2016 09:05:32 +0100
Subject: [PATCH] vfs: Remove unnecessary list_for_each_entry_safe() variants

evict_inodes() and invalidate_inodes() use list_for_each_entry_safe()
to iterate sb->s_inodes list. However, since we use i_lru list entry for
our local temporary list of inodes to destroy, the inode is guaranteed
to stay in sb->s_inodes list while we hold sb->s_inode_list_lock. So
there is no real need for safe iteration variant and we can use
list_for_each_entry() just fine.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/inode.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 9f62db3bcc3e..efb1bea53695 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -590,12 +590,12 @@ static void dispose_list(struct list_head *head)
  */
 void evict_inodes(struct super_block *sb)
 {
-	struct inode *inode, *next;
+	struct inode *inode;
 	LIST_HEAD(dispose);
 
 again:
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		if (atomic_read(&inode->i_count))
 			continue;
 
@@ -640,11 +640,11 @@ again:
 int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 {
 	int busy = 0;
-	struct inode *inode, *next;
+	struct inode *inode;
 	LIST_HEAD(dispose);
 
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 3/3] vfs: Use per-cpu list for superblock's inode list
  2016-02-24  8:28   ` Jan Kara
@ 2016-02-24  8:36     ` Ingo Molnar
  2016-02-24  8:58       ` Jan Kara
  2016-02-24 20:23     ` Waiman Long
  1 sibling, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2016-02-24  8:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: Waiman Long, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Andi Kleen,
	Dave Chinner, Scott J Norton, Douglas Hatch


* Jan Kara <jack@suse.cz> wrote:

> On Tue 23-02-16 14:04:32, Waiman Long wrote:
> > When many threads are trying to add or delete inode to or from
> > a superblock's s_inodes list, spinlock contention on the list can
> > become a performance bottleneck.
> > 
> > This patch changes the s_inodes field to become a per-cpu list with
> > per-cpu spinlocks. As a result, the following superblock inode list
> > (sb->s_inodes) iteration functions in vfs are also being modified:
> > 
> >  1. iterate_bdevs()
> >  2. drop_pagecache_sb()
> >  3. wait_sb_inodes()
> >  4. evict_inodes()
> >  5. invalidate_inodes()
> >  6. fsnotify_unmount_inodes()
> >  7. add_dquot_ref()
> >  8. remove_dquot_ref()
> > 
> > With an exit microbenchmark that creates a large number of threads,
> > attachs many inodes to them and then exits. The runtimes of that
> > microbenchmark with 1000 threads before and after the patch on a
> > 4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as
> > follows:
> > 
> >   Kernel            Elapsed Time    System Time
> >   ------            ------------    -----------
> >   Vanilla 4.5-rc4      65.29s         82m14s
> >   Patched 4.5-rc4      22.81s         23m03s
> > 
> > Before the patch, spinlock contention at the inode_sb_list_add()
> > function at the startup phase and the inode_sb_list_del() function at
> > the exit phase were about 79% and 93% of total CPU time respectively
> > (as measured by perf). After the patch, the percpu_list_add()
> > function consumed only about 0.04% of CPU time at startup phase. The
> > percpu_list_del() function consumed about 0.4% of CPU time at exit
> > phase. There were still some spinlock contention, but they happened
> > elsewhere.
> 
> While looking through this patch, I have noticed that the
> list_for_each_entry_safe() iterations in evict_inodes() and
> invalidate_inodes() are actually unnecessary. So if you first apply the
> attached patch, you don't have to implement safe iteration variants at all.
> 
> As a second comment, I'd note that this patch grows struct inode by 1 pointer. 
> It is probably acceptable for large machines given the speedup but it should be 
> noted in the changelog. Furthermore for UP or even small SMP systems this is 
> IMHO undesired bloat since the speedup won't be noticeable.
> 
> So for these small systems it would be good if per-cpu list magic would just 
> fall back to single linked list with a spinlock. Do you think that is reasonably 
> doable?

Even many 'small' systems tend to be SMP these days.

If you do this then please keep it a separate add-on patch, so that the 'UP cost' 
becomes apparent. Uniprocessor #ifdeffery is really painful in places and we might 
be better off with a single extra pointer. Forthermore UP kernels are tested a lot 
less stringently than SMP kernels. It's just 4 bytes for a truly small 32-bit 
system.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 3/3] vfs: Use per-cpu list for superblock's inode list
  2016-02-24  8:36     ` Ingo Molnar
@ 2016-02-24  8:58       ` Jan Kara
  2016-02-25  8:06         ` Ingo Molnar
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2016-02-24  8:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jan Kara, Waiman Long, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Andi Kleen,
	Dave Chinner, Scott J Norton, Douglas Hatch

On Wed 24-02-16 09:36:30, Ingo Molnar wrote:
> 
> * Jan Kara <jack@suse.cz> wrote:
> 
> > On Tue 23-02-16 14:04:32, Waiman Long wrote:
> > > When many threads are trying to add or delete inode to or from
> > > a superblock's s_inodes list, spinlock contention on the list can
> > > become a performance bottleneck.
> > > 
> > > This patch changes the s_inodes field to become a per-cpu list with
> > > per-cpu spinlocks. As a result, the following superblock inode list
> > > (sb->s_inodes) iteration functions in vfs are also being modified:
> > > 
> > >  1. iterate_bdevs()
> > >  2. drop_pagecache_sb()
> > >  3. wait_sb_inodes()
> > >  4. evict_inodes()
> > >  5. invalidate_inodes()
> > >  6. fsnotify_unmount_inodes()
> > >  7. add_dquot_ref()
> > >  8. remove_dquot_ref()
> > > 
> > > With an exit microbenchmark that creates a large number of threads,
> > > attachs many inodes to them and then exits. The runtimes of that
> > > microbenchmark with 1000 threads before and after the patch on a
> > > 4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as
> > > follows:
> > > 
> > >   Kernel            Elapsed Time    System Time
> > >   ------            ------------    -----------
> > >   Vanilla 4.5-rc4      65.29s         82m14s
> > >   Patched 4.5-rc4      22.81s         23m03s
> > > 
> > > Before the patch, spinlock contention at the inode_sb_list_add()
> > > function at the startup phase and the inode_sb_list_del() function at
> > > the exit phase were about 79% and 93% of total CPU time respectively
> > > (as measured by perf). After the patch, the percpu_list_add()
> > > function consumed only about 0.04% of CPU time at startup phase. The
> > > percpu_list_del() function consumed about 0.4% of CPU time at exit
> > > phase. There were still some spinlock contention, but they happened
> > > elsewhere.
> > 
> > While looking through this patch, I have noticed that the
> > list_for_each_entry_safe() iterations in evict_inodes() and
> > invalidate_inodes() are actually unnecessary. So if you first apply the
> > attached patch, you don't have to implement safe iteration variants at all.
> > 
> > As a second comment, I'd note that this patch grows struct inode by 1 pointer. 
> > It is probably acceptable for large machines given the speedup but it should be 
> > noted in the changelog. Furthermore for UP or even small SMP systems this is 
> > IMHO undesired bloat since the speedup won't be noticeable.
> > 
> > So for these small systems it would be good if per-cpu list magic would just 
> > fall back to single linked list with a spinlock. Do you think that is reasonably 
> > doable?
> 
> Even many 'small' systems tend to be SMP these days.

Yes, I know. But my tablet with 4 ARM cores is unlikely to benefit from
this change either. It will just have to pay the memory cost. And frankly I
don't care that much myself but if there is some reasonably easy way to
avoid the cost, it would be welcome.

> If you do this then please keep it a separate add-on patch, so that the
> 'UP cost' becomes apparent. Uniprocessor #ifdeffery is really painful in
> places and we might be better off with a single extra pointer.
> Forthermore UP kernels are tested a lot less stringently than SMP
> kernels. It's just 4 bytes for a truly small 32-bit system.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/3] lib/percpu-list: Per-cpu list with associated per-cpu locks
  2016-02-24  7:56   ` Jan Kara
@ 2016-02-24 19:51     ` Waiman Long
  0 siblings, 0 replies; 15+ messages in thread
From: Waiman Long @ 2016-02-24 19:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On 02/24/2016 02:56 AM, Jan Kara wrote:
> On Tue 23-02-16 14:04:30, Waiman Long wrote:
>> Linked list is used everywhere in the Linux kernel. However, if many
>> threads are trying to add or delete entries into the same linked list,
>> it can create a performance bottleneck.
>>
>> This patch introduces a new per-cpu list subystem with associated
>> per-cpu locks for protecting each of the lists individually. This
>> allows list entries insertion and deletion operations to happen in
>> parallel instead of being serialized with a global list and lock.
>>
>> List entry insertion is strictly per cpu. List deletion, however, can
>> happen in a cpu other than the one that did the insertion. So we still
>> need lock to protect the list. Because of that, there may still be
>> a small amount of contention when deletion is being done.
>>
>> A new header file include/linux/percpu-list.h will be added with the
>> associated pcpu_list_head and pcpu_list_node structures. The following
>> functions are provided to manage the per-cpu list:
>>
>>   1. int init_pcpu_list_head(struct pcpu_list_head **ppcpu_head)
>>   2. void pcpu_list_add(struct pcpu_list_node *node,
>> 		       struct pcpu_list_head *head)
>>   3. void pcpu_list_del(struct pcpu_list *node)
>>
>> Iteration of all the list entries within a group of per-cpu
>> lists is done by calling either the pcpu_list_iterate() or
>> pcpu_list_iterate_safe() functions in a while loop. They correspond
>> to the list_for_each_entry() and list_for_each_entry_safe() macros
>> respectively. The iteration states are keep in a pcpu_list_state
>> structure that is passed to the iteration functions.
>>
>> Signed-off-by: Waiman Long<Waiman.Long@hpe.com>
> Two comments below.
>
>> +/*
>> + * Helper function to find the first entry of the next per-cpu list
>> + * It works somewhat like for_each_possible_cpu(cpu).
>> + *
>> + * Return: true if the entry is found, false if all the lists exhausted
>> + */
>> +static __always_inline bool
>> +__pcpu_list_next_cpu(struct pcpu_list_head *head, struct pcpu_list_state *state)
>> +{
>> +	if (state->lock)
>> +		spin_unlock(state->lock);
>> +next_cpu:
>> +	/*
>> +	 * for_each_possible_cpu(cpu)
>> +	 */
>> +	state->cpu = cpumask_next(state->cpu, cpu_possible_mask);
>> +	if (state->cpu>= nr_cpu_ids)
>> +		return false;	/* All the per-cpu lists iterated */
>> +
>> +	state->head =&per_cpu_ptr(head, state->cpu)->list;
>> +	state->lock =&per_cpu_ptr(head, state->cpu)->lock;
>> +	state->curr = list_entry(state->head->next,
>> +				 struct pcpu_list_node, list);
>> +	if (&state->curr->list == state->head)
>> +		goto next_cpu;
> This might be more comprehensible as:
>
> 	if (list_empty(state->head))
> 		goto next_cpu;
>
> and you can do it just after updating state->head (no need to init
> state->lock&  state->curr if the list is empty).

Thank for the suggestion. Will change the code accordingly.
> Another note: Initialization of state->curr is IMO racy - you need to hold
> state->lock to grab state->curr reliably, don't you? Otherwise someone can
> remove the entry while you are working with it. So you need to move that
> down just before returning.

Right. I will move the initialization of state->curr after the spin_lock().

>> +
>> +	spin_lock(state->lock);
>> +	return true;
>> +}
>> +#endif /* NR_CPUS == 1 */
> ...
>
>> +/*
>> + * Delete a node from a percpu list
>> + *
>> + * We need to check the lock pointer again after taking the lock to guard
>> + * against concurrent delete of the same node. If the lock pointer changes
>> + * (becomes NULL or to a different one), we assume that the deletion was done
>> + * elsewhere.
>> + */
>> +void pcpu_list_del(struct pcpu_list_node *node)
>> +{
>> +	spinlock_t *lock = READ_ONCE(node->lockptr);
>> +
>> +	if (unlikely(!lock))
>> +		return;
>> +
>> +	spin_lock(lock);
>> +	if (likely(lock == node->lockptr)) {
>> +		list_del_init(&node->list);
>> +		node->lockptr = NULL;
>> +	}
> But someone changing lockptr under your hands would mean that there are
> two processes racing to remove entries and that would generally point to a
> problem (and likely use-after-free) in the caller, won't it? Or do you have
> some particular usecase in mind?
>
> 								Honza
>

This is just defensive programming to guard against unforeseen case. I 
don't have any particular use case in mind that will make that happen. 
Maybe I should put a WARN_ON if this really happens.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 3/3] vfs: Use per-cpu list for superblock's inode list
  2016-02-24  8:28   ` Jan Kara
  2016-02-24  8:36     ` Ingo Molnar
@ 2016-02-24 20:23     ` Waiman Long
  2016-02-25 14:50       ` Waiman Long
  1 sibling, 1 reply; 15+ messages in thread
From: Waiman Long @ 2016-02-24 20:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On 02/24/2016 03:28 AM, Jan Kara wrote:
> On Tue 23-02-16 14:04:32, Waiman Long wrote:
>> When many threads are trying to add or delete inode to or from
>> a superblock's s_inodes list, spinlock contention on the list can
>> become a performance bottleneck.
>>
>> This patch changes the s_inodes field to become a per-cpu list with
>> per-cpu spinlocks. As a result, the following superblock inode list
>> (sb->s_inodes) iteration functions in vfs are also being modified:
>>
>>   1. iterate_bdevs()
>>   2. drop_pagecache_sb()
>>   3. wait_sb_inodes()
>>   4. evict_inodes()
>>   5. invalidate_inodes()
>>   6. fsnotify_unmount_inodes()
>>   7. add_dquot_ref()
>>   8. remove_dquot_ref()
>>
>> With an exit microbenchmark that creates a large number of threads,
>> attachs many inodes to them and then exits. The runtimes of that
>> microbenchmark with 1000 threads before and after the patch on a
>> 4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as
>> follows:
>>
>>    Kernel            Elapsed Time    System Time
>>    ------            ------------    -----------
>>    Vanilla 4.5-rc4      65.29s         82m14s
>>    Patched 4.5-rc4      22.81s         23m03s
>>
>> Before the patch, spinlock contention at the inode_sb_list_add()
>> function at the startup phase and the inode_sb_list_del() function at
>> the exit phase were about 79% and 93% of total CPU time respectively
>> (as measured by perf). After the patch, the percpu_list_add()
>> function consumed only about 0.04% of CPU time at startup phase. The
>> percpu_list_del() function consumed about 0.4% of CPU time at exit
>> phase. There were still some spinlock contention, but they happened
>> elsewhere.
> While looking through this patch, I have noticed that the
> list_for_each_entry_safe() iterations in evict_inodes() and
> invalidate_inodes() are actually unnecessary. So if you first apply the
> attached patch, you don't have to implement safe iteration variants at all.

Thank for the patch. I will apply that in my next update. As for the 
safe iteration variant, I think I will keep it since I had implemented 
that already just in case it may be needed in some other places.

> As a second comment, I'd note that this patch grows struct inode by 1
> pointer. It is probably acceptable for large machines given the speedup but
> it should be noted in the changelog. Furthermore for UP or even small SMP
> systems this is IMHO undesired bloat since the speedup won't be noticeable.
>
> So for these small systems it would be good if per-cpu list magic would just
> fall back to single linked list with a spinlock. Do you think that is
> reasonably doable?
>

I already have a somewhat separate code path for UP. So I can remove the 
lock pointer for that. For small SMP system, however, the only way to 
avoid the extra pointer is to add a config parameter to turn this 
feature off. That can be added as a separate patch, if necessary.

BTW, I think the current inode structure is already pretty big, adding 
one more pointer will have too much impact on its overall size.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 3/3] vfs: Use per-cpu list for superblock's inode list
  2016-02-24  8:58       ` Jan Kara
@ 2016-02-25  8:06         ` Ingo Molnar
  2016-02-25 14:43           ` Waiman Long
  0 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2016-02-25  8:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Waiman Long, Alexander Viro, Jan Kara, Jeff Layton,
	J. Bruce Fields, Tejun Heo, Christoph Lameter, linux-fsdevel,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Andi Kleen,
	Dave Chinner, Scott J Norton, Douglas Hatch


* Jan Kara <jack@suse.cz> wrote:

> > > > With an exit microbenchmark that creates a large number of threads, 
> > > > attachs many inodes to them and then exits. The runtimes of that 
> > > > microbenchmark with 1000 threads before and after the patch on a 4-socket 
> > > > Intel E7-4820 v3 system (40 cores, 80 threads) were as follows:
> > > > 
> > > >   Kernel            Elapsed Time    System Time
> > > >   ------            ------------    -----------
> > > >   Vanilla 4.5-rc4      65.29s         82m14s
> > > >   Patched 4.5-rc4      22.81s         23m03s
> > > > 
> > > > Before the patch, spinlock contention at the inode_sb_list_add() function 
> > > > at the startup phase and the inode_sb_list_del() function at the exit 
> > > > phase were about 79% and 93% of total CPU time respectively (as measured 
> > > > by perf). After the patch, the percpu_list_add() function consumed only 
> > > > about 0.04% of CPU time at startup phase. The percpu_list_del() function 
> > > > consumed about 0.4% of CPU time at exit phase. There were still some 
> > > > spinlock contention, but they happened elsewhere.
> > > 
> > > While looking through this patch, I have noticed that the 
> > > list_for_each_entry_safe() iterations in evict_inodes() and 
> > > invalidate_inodes() are actually unnecessary. So if you first apply the 
> > > attached patch, you don't have to implement safe iteration variants at all.
> > > 
> > > As a second comment, I'd note that this patch grows struct inode by 1 
> > > pointer. It is probably acceptable for large machines given the speedup but 
> > > it should be noted in the changelog. Furthermore for UP or even small SMP 
> > > systems this is IMHO undesired bloat since the speedup won't be noticeable.
> > > 
> > > So for these small systems it would be good if per-cpu list magic would just 
> > > fall back to single linked list with a spinlock. Do you think that is 
> > > reasonably doable?
> > 
> > Even many 'small' systems tend to be SMP these days.
> 
> Yes, I know. But my tablet with 4 ARM cores is unlikely to benefit from this 
> change either. [...]

I'm not sure about that at all, the above numbers are showing a 3x-4x speedup in 
system time, which ought to be noticeable on smaller SMP systems as well.

Waiman, could you please post the microbenchmark?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 3/3] vfs: Use per-cpu list for superblock's inode list
  2016-02-25  8:06         ` Ingo Molnar
@ 2016-02-25 14:43           ` Waiman Long
  0 siblings, 0 replies; 15+ messages in thread
From: Waiman Long @ 2016-02-25 14:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jan Kara, Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

[-- Attachment #1: Type: text/plain, Size: 3052 bytes --]

On 02/25/2016 03:06 AM, Ingo Molnar wrote:
> * Jan Kara<jack@suse.cz>  wrote:
>
>>>>> With an exit microbenchmark that creates a large number of threads,
>>>>> attachs many inodes to them and then exits. The runtimes of that
>>>>> microbenchmark with 1000 threads before and after the patch on a 4-socket
>>>>> Intel E7-4820 v3 system (40 cores, 80 threads) were as follows:
>>>>>
>>>>>    Kernel            Elapsed Time    System Time
>>>>>    ------            ------------    -----------
>>>>>    Vanilla 4.5-rc4      65.29s         82m14s
>>>>>    Patched 4.5-rc4      22.81s         23m03s
>>>>>
>>>>> Before the patch, spinlock contention at the inode_sb_list_add() function
>>>>> at the startup phase and the inode_sb_list_del() function at the exit
>>>>> phase were about 79% and 93% of total CPU time respectively (as measured
>>>>> by perf). After the patch, the percpu_list_add() function consumed only
>>>>> about 0.04% of CPU time at startup phase. The percpu_list_del() function
>>>>> consumed about 0.4% of CPU time at exit phase. There were still some
>>>>> spinlock contention, but they happened elsewhere.
>>>> While looking through this patch, I have noticed that the
>>>> list_for_each_entry_safe() iterations in evict_inodes() and
>>>> invalidate_inodes() are actually unnecessary. So if you first apply the
>>>> attached patch, you don't have to implement safe iteration variants at all.
>>>>
>>>> As a second comment, I'd note that this patch grows struct inode by 1
>>>> pointer. It is probably acceptable for large machines given the speedup but
>>>> it should be noted in the changelog. Furthermore for UP or even small SMP
>>>> systems this is IMHO undesired bloat since the speedup won't be noticeable.
>>>>
>>>> So for these small systems it would be good if per-cpu list magic would just
>>>> fall back to single linked list with a spinlock. Do you think that is
>>>> reasonably doable?
>>> Even many 'small' systems tend to be SMP these days.
>> Yes, I know. But my tablet with 4 ARM cores is unlikely to benefit from this
>> change either. [...]
> I'm not sure about that at all, the above numbers are showing a 3x-4x speedup in
> system time, which ought to be noticeable on smaller SMP systems as well.
>
> Waiman, could you please post the microbenchmark?
>
> Thanks,
>
> 	Ingo

The microbenchmark that I used is attached.

I do agree that performance benefit will decrease as the number of CPUs 
get smaller. The system that I used for testing have 4 sockets with 40 
cores (80 threads). Dave Chinner had run his fstests on a 16-core system 
(probably 2-socket) which showed modest improvement in performance 
(~4m40s vs 4m30s in runtime).

This patch enables parallel insertion and deletion to/from the inode 
list which used to be a serialized operation. So if that list operation 
is a bottleneck, you will see significant improvement. If it is not, we 
may not notice that much of a difference. For a single-socket 4-core 
system, I agree that the performance benefit, if any, will be limited.

Cheers,
Longman


[-- Attachment #2: exit_test.c --]
[-- Type: text/plain, Size: 2665 bytes --]

/*
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * Authors: Waiman Long <waiman.long@hp.com>
 */
/*
 * This is an exit test
 */
#include <ctype.h>
#include <errno.h>
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <signal.h>
#include <sys/types.h>
#include <sys/syscall.h>


#define do_exit()	syscall(SYS_exit)
#define	gettid()	syscall(SYS_gettid)
#define	MAX_THREADS	2048

static inline void cpu_relax(void)
{
        __asm__ __volatile__("rep;nop": : :"memory");
}

static inline void atomic_inc(volatile int *v)
{
	__asm__ __volatile__("lock incl %0": "+m" (*v));
}

static volatile int exit_now  = 0;
static volatile int threadcnt = 0;

/*
 * Walk the /proc/<pid> filesystem to make them fill the dentry cache
 */
static void walk_procfs(void)
{
	char cmdbuf[256];
	pid_t tid = gettid();

	snprintf(cmdbuf, sizeof(cmdbuf), "find /proc/%d > /dev/null 2>&1", tid);
	if (system(cmdbuf) < 0)
		perror("system() failed!");
}

static void *exit_thread(void *dummy)
{
	long tid = (long)dummy;

	walk_procfs();
	atomic_inc(&threadcnt);
	/*
	 * Busy wait until the do_exit flag is set and then call exit
	 */
	while (!exit_now)
		sleep(1);
	do_exit();
}

static void exit_test(int threads)
{
	pthread_t thread[threads];
	long i = 0, finish;
	time_t start = time(NULL);

	while (i++ < threads) {
		if (pthread_create(thread + i - 1, NULL, exit_thread,
				  (void *)i)) {
			perror("pthread_create");
			exit(1);
		}
#if 0
		/*
		 * Pipelining to reduce contention & improve speed
		 */
		if ((i & 0xf) == 0)
			 while (i - threadcnt > 12)
				usleep(1);
#endif
	}
	while (threadcnt != threads)
		usleep(1);
	walk_procfs();
	printf("Setup time = %lus\n", time(NULL) - start);
	printf("Process ready to exit!\n");
	kill(0, SIGKILL);
	exit(0);
}

int main(int argc, char *argv[])
{
	int   tcnt;	/* Thread counts */
	char *cmd = argv[0];

	if ((argc != 2) || !isdigit(argv[1][0])) {
		fprintf(stderr, "Usage: %s <thread count>\n", cmd);
		exit(1);
	}
	tcnt = strtoul(argv[1], NULL, 10);
	if (tcnt > MAX_THREADS) {
		fprintf(stderr, "Error: thread count should be <= %d\n",
			MAX_THREADS);
		exit(1);
	}
	exit_test(tcnt);
	return 0;	/* Not reaachable */
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 3/3] vfs: Use per-cpu list for superblock's inode list
  2016-02-24 20:23     ` Waiman Long
@ 2016-02-25 14:50       ` Waiman Long
  0 siblings, 0 replies; 15+ messages in thread
From: Waiman Long @ 2016-02-25 14:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Alexander Viro, Jan Kara, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Christoph Lameter, linux-fsdevel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Andi Kleen, Dave Chinner,
	Scott J Norton, Douglas Hatch

On 02/24/2016 03:23 PM, Waiman Long wrote:
> On 02/24/2016 03:28 AM, Jan Kara wrote:
>> On Tue 23-02-16 14:04:32, Waiman Long wrote:
>>> When many threads are trying to add or delete inode to or from
>>> a superblock's s_inodes list, spinlock contention on the list can
>>> become a performance bottleneck.
>>>
>>> This patch changes the s_inodes field to become a per-cpu list with
>>> per-cpu spinlocks. As a result, the following superblock inode list
>>> (sb->s_inodes) iteration functions in vfs are also being modified:
>>>
>>>   1. iterate_bdevs()
>>>   2. drop_pagecache_sb()
>>>   3. wait_sb_inodes()
>>>   4. evict_inodes()
>>>   5. invalidate_inodes()
>>>   6. fsnotify_unmount_inodes()
>>>   7. add_dquot_ref()
>>>   8. remove_dquot_ref()
>>>
>>> With an exit microbenchmark that creates a large number of threads,
>>> attachs many inodes to them and then exits. The runtimes of that
>>> microbenchmark with 1000 threads before and after the patch on a
>>> 4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as
>>> follows:
>>>
>>>    Kernel            Elapsed Time    System Time
>>>    ------            ------------    -----------
>>>    Vanilla 4.5-rc4      65.29s         82m14s
>>>    Patched 4.5-rc4      22.81s         23m03s
>>>
>>> Before the patch, spinlock contention at the inode_sb_list_add()
>>> function at the startup phase and the inode_sb_list_del() function at
>>> the exit phase were about 79% and 93% of total CPU time respectively
>>> (as measured by perf). After the patch, the percpu_list_add()
>>> function consumed only about 0.04% of CPU time at startup phase. The
>>> percpu_list_del() function consumed about 0.4% of CPU time at exit
>>> phase. There were still some spinlock contention, but they happened
>>> elsewhere.
>> While looking through this patch, I have noticed that the
>> list_for_each_entry_safe() iterations in evict_inodes() and
>> invalidate_inodes() are actually unnecessary. So if you first apply the
>> attached patch, you don't have to implement safe iteration variants 
>> at all.
>
> Thank for the patch. I will apply that in my next update. As for the 
> safe iteration variant, I think I will keep it since I had implemented 
> that already just in case it may be needed in some other places.
>
>> As a second comment, I'd note that this patch grows struct inode by 1
>> pointer. It is probably acceptable for large machines given the 
>> speedup but
>> it should be noted in the changelog. Furthermore for UP or even small 
>> SMP
>> systems this is IMHO undesired bloat since the speedup won't be 
>> noticeable.
>>
>> So for these small systems it would be good if per-cpu list magic 
>> would just
>> fall back to single linked list with a spinlock. Do you think that is
>> reasonably doable?
>>
>
> I already have a somewhat separate code path for UP. So I can remove 
> the lock pointer for that. For small SMP system, however, the only way 
> to avoid the extra pointer is to add a config parameter to turn this 
> feature off. That can be added as a separate patch, if necessary.

I am sorry that I need to retreat from this promise for UP. Removing the 
lock pointer will require change in the list deletion API to pass in the 
lock information. So I am not going to change it for the time being.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-02-25 14:50 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-23 19:04 [PATCH v3 0/3] vfs: Use per-cpu list for SB's s_inodes list Waiman Long
2016-02-23 19:04 ` [PATCH v3 1/3] lib/percpu-list: Per-cpu list with associated per-cpu locks Waiman Long
2016-02-24  2:00   ` Boqun Feng
2016-02-24  4:01     ` Waiman Long
2016-02-24  7:56   ` Jan Kara
2016-02-24 19:51     ` Waiman Long
2016-02-23 19:04 ` [PATCH v3 2/3] fsnotify: Simplify inode iteration on umount Waiman Long
2016-02-23 19:04 ` [PATCH v3 3/3] vfs: Use per-cpu list for superblock's inode list Waiman Long
2016-02-24  8:28   ` Jan Kara
2016-02-24  8:36     ` Ingo Molnar
2016-02-24  8:58       ` Jan Kara
2016-02-25  8:06         ` Ingo Molnar
2016-02-25 14:43           ` Waiman Long
2016-02-24 20:23     ` Waiman Long
2016-02-25 14:50       ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.