All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] lib/list_batch: A simple list insertion/deletion batching facility
@ 2016-01-29 19:30 Waiman Long
  2016-01-29 19:30 ` [PATCH v2 1/3] " Waiman Long
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Waiman Long @ 2016-01-29 19:30 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro
  Cc: linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen,
	Scott J Norton, Douglas Hatch, Waiman Long

v1->v2:
  - Address feedbacks from PeterZ and Andi Kleen.

This patchset introduces a simple list insertion/deletion batching
facility to batch multiple list insertion and deletion operations
into a single one under one lock/unlock critical section.

Patch 1 introduces this new facility.

Patch 2 enables it for the x86 architecture.

Patch 3 makes the insertion and deletion of the VFS superblock's
inode list to use the new list batching functions.

On x86-64, the generated code (gcc v4.8.5) of the inode_sb_list_add()
function without the patch was:

   0x0000000000001030 <+0>:     callq  0x1035 <inode_sb_list_add+5>
   0x0000000000001035 <+5>:     push   %rbp
   0x0000000000001036 <+6>:     mov    %rsp,%rbp
   0x0000000000001039 <+9>:     push   %rbx
   0x000000000000103a <+10>:    mov    0x28(%rdi),%rax
   0x000000000000103e <+14>:    mov    %rdi,%rbx
   0x0000000000001041 <+17>:    lea    0x600(%rax),%rdi
   0x0000000000001048 <+24>:    callq  0x104d <inode_sb_list_add+29>
   0x000000000000104d <+29>:    mov    0x28(%rbx),%rsi
   0x0000000000001051 <+33>:    lea    0x120(%rbx),%rdi
   0x0000000000001058 <+40>:    mov    0x608(%rsi),%rdx
   0x000000000000105f <+47>:    add    $0x608,%rsi
   0x0000000000001066 <+54>:    callq  0x106b <inode_sb_list_add+59>
   0x000000000000106b <+59>:    mov    0x28(%rbx),%rdi
   0x000000000000106f <+63>:    add    $0x600,%rdi
   0x0000000000001076 <+70>:    callq  *0x0
   0x000000000000107d <+77>:    pop    %rbx
   0x000000000000107e <+78>:    pop    %rbp
   0x000000000000107f <+79>:    retq

With the patch, the function became:

   0x0000000000001030 <+0>:     callq  0x1035 <inode_sb_list_add+5>
   0x0000000000001035 <+5>:     push   %rbp
   0x0000000000001036 <+6>:     mov    %rsp,%rbp
   0x0000000000001039 <+9>:     push   %r13
   0x000000000000103b <+11>:    lea    0x120(%rdi),%r13
   0x0000000000001042 <+18>:    push   %r12
   0x0000000000001044 <+20>:    push   %rbx
   0x0000000000001045 <+21>:    mov    0x28(%rdi),%r12
   0x0000000000001049 <+25>:    lea    0x600(%r12),%rbx
   0x0000000000001051 <+33>:    mov    %rbx,%rdi
   0x0000000000001054 <+36>:    callq  0x1059 <inode_sb_list_add+41>
   0x0000000000001059 <+41>:    test   %eax,%eax
   0x000000000000105b <+43>:    je     0x1081 <inode_sb_list_add+81>
   0x000000000000105d <+45>:    mov    0x618(%r12),%rsi
   0x0000000000001065 <+53>:    mov    %r13,%rdi
   0x0000000000001068 <+56>:    mov    (%rsi),%rdx
   0x000000000000106b <+59>:    callq  0x1070 <inode_sb_list_add+64>
   0x0000000000001070 <+64>:    mov    %rbx,%rdi
   0x0000000000001073 <+67>:    callq  *0x0
   0x000000000000107a <+74>:    pop    %rbx
   0x000000000000107b <+75>:    pop    %r12
   0x000000000000107d <+77>:    pop    %r13
   0x000000000000107f <+79>:    pop    %rbp
   0x0000000000001080 <+80>:    retq
   0x0000000000001081 <+81>:    lea    0x618(%r12),%rdx
   0x0000000000001089 <+89>:    mov    %r13,%rcx
   0x000000000000108c <+92>:    mov    %rbx,%rsi
   0x000000000000108f <+95>:    xor    %edi,%edi
   0x0000000000001091 <+97>:    callq  0x1096 <inode_sb_list_add+102>
   0x0000000000001096 <+102>:   jmp    0x107a <inode_sb_list_add+74>

For the fastpath, the only overheads are the testing of the return value
of the spin_trylock function and the push/pop of 2 more registers. Other
than that, they look very similar.

Waiman Long (3):
  lib/list_batch: A simple list insertion/deletion batching facility
  lib/list_batch, x86: Enable list insertion/deletion batching for x86
  vfs: Enable list batching for the superblock's inode list

 arch/x86/Kconfig           |    1 +
 fs/inode.c                 |   13 ++---
 fs/super.c                 |    1 +
 include/linux/fs.h         |    2 +
 include/linux/list_batch.h |  133 ++++++++++++++++++++++++++++++++++++++++++++
 lib/Kconfig                |    7 ++
 lib/Makefile               |    1 +
 lib/list_batch.c           |  125 +++++++++++++++++++++++++++++++++++++++++
 8 files changed, 275 insertions(+), 8 deletions(-)
 create mode 100644 include/linux/list_batch.h
 create mode 100644 lib/list_batch.c

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/3] lib/list_batch: A simple list insertion/deletion batching facility
  2016-01-29 19:30 [PATCH v2 0/3] lib/list_batch: A simple list insertion/deletion batching facility Waiman Long
@ 2016-01-29 19:30 ` Waiman Long
  2016-02-01  0:47   ` Dave Chinner
  2016-01-29 19:30 ` [PATCH v2 2/3] lib/list_batch, x86: Enable list insertion/deletion batching for x86 Waiman Long
  2016-01-29 19:30 ` [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list Waiman Long
  2 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2016-01-29 19:30 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro
  Cc: linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen,
	Scott J Norton, Douglas Hatch, Waiman Long

Linked list insertion or deletion under lock is a very common activity
in the Linux kernel. If this is the only activity under lock, the
locking overhead can be pretty large compared with the actual time
spent on the insertion or deletion operation itself especially on a
large system with many CPUs.

This patch introduces a simple list insertion/deletion batching
facility where a group of list insertion and deletion operations are
grouped together in a single batch under lock. This can reduce the
locking overhead and improve overall system performance.

The fast path of this batching facility will be similar in performance
to the "lock; listop; unlock;" sequence of the existing code. If
the lock is not available, it will enter slowpath where the batching
happens.

A new config option LIST_BATCHING is added so that we can control on
which architecture do we want to have this facility enabled.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 include/linux/list_batch.h |  133 ++++++++++++++++++++++++++++++++++++++++++++
 lib/Kconfig                |    7 ++
 lib/Makefile               |    1 +
 lib/list_batch.c           |  125 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 266 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/list_batch.h
 create mode 100644 lib/list_batch.c

diff --git a/include/linux/list_batch.h b/include/linux/list_batch.h
new file mode 100644
index 0000000..a445a2e
--- /dev/null
+++ b/include/linux/list_batch.h
@@ -0,0 +1,133 @@
+/*
+ * List insertion/deletion batching facility
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long <waiman.long@hpe.com>
+ */
+#ifndef __LINUX_LIST_BATCH_H
+#define __LINUX_LIST_BATCH_H
+
+#include <linux/spinlock.h>
+#include <linux/list.h>
+
+/*
+ * include/linux/list_batch.h
+ *
+ * Inserting or deleting an entry from a linked list under a spinlock is a
+ * very common operation in the Linux kernel. If many CPUs are trying to
+ * grab the lock and manipulate the linked list, it can lead to significant
+ * lock contention and slow operation.
+ *
+ * This list operation batching facility is used to batch multiple list
+ * operations under one lock/unlock critical section, thus reducing the
+ * locking and cacheline bouncing overhead and improving overall performance.
+ */
+enum list_batch_cmd {
+	lb_cmd_add,
+	lb_cmd_del,
+	lb_cmd_del_init
+};
+
+enum list_batch_state {
+	lb_state_waiting,	/* Node is waiting */
+	lb_state_batch,		/* Queue head to perform batch processing */
+	lb_state_done		/* Job is done */
+};
+
+struct list_batch_qnode {
+	struct list_batch_qnode	*next;
+	struct list_head	*entry;
+	enum list_batch_cmd	cmd;
+	enum list_batch_state	state;
+};
+
+struct list_batch {
+	struct list_head	*list;
+	struct list_batch_qnode *tail;
+};
+
+#define LIST_BATCH_INIT(_list)	\
+	{			\
+		.list = _list,	\
+		.tail = NULL	\
+	}
+
+static inline void list_batch_init(struct list_batch *batch,
+				   struct list_head *list)
+{
+	batch->list = list;
+	batch->tail = NULL;
+}
+
+static __always_inline void _list_batch_cmd(enum list_batch_cmd cmd,
+					    struct list_head *head,
+					    struct list_head *entry)
+{
+	switch (cmd) {
+	case lb_cmd_add:
+		list_add(entry, head);
+		break;
+
+	case lb_cmd_del:
+		list_del(entry);
+		break;
+
+	case lb_cmd_del_init:
+		list_del_init(entry);
+		break;
+	}
+}
+
+#ifdef CONFIG_LIST_BATCHING
+
+extern void do_list_batch_slowpath(spinlock_t *lock, enum list_batch_cmd cmd,
+				   struct list_batch *batch,
+				   struct list_head *entry);
+
+/*
+ * The caller is expected to pass in a constant cmd parameter. As a
+ * result, most of unneeded code in the switch statement of _list_batch_cmd()
+ * will be optimized away. This should make the fast path almost as fast
+ * as the "lock; listop; unlock;" sequence it replaces.
+ */
+static inline void do_list_batch(spinlock_t *lock, enum list_batch_cmd cmd,
+				   struct list_batch *batch,
+				   struct list_head *entry)
+{
+	/*
+	 * Fast path
+	 */
+	if (likely(spin_trylock(lock))) {
+		_list_batch_cmd(cmd, batch->list, entry);
+		spin_unlock(lock);
+		return;
+	}
+	do_list_batch_slowpath(lock, cmd, batch, entry);
+}
+
+
+#else /* CONFIG_LIST_BATCHING */
+
+static inline void do_list_batch(spinlock_t *lock, enum list_batch_cmd cmd,
+				   struct list_batch *batch,
+				   struct list_head *entry)
+{
+	spin_lock(lock);
+	_list_batch_cmd(cmd, batch->list, entry);
+	spin_unlock(lock);
+}
+
+#endif /* CONFIG_LIST_BATCHING */
+
+#endif /* __LINUX_LIST_BATCH_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 133ebc0..d75ce19 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -514,6 +514,13 @@ config OID_REGISTRY
 config UCS2_STRING
         tristate
 
+config LIST_BATCHING
+	def_bool y if ARCH_USE_LIST_BATCHING
+	depends on SMP
+
+config ARCH_USE_LIST_BATCHING
+	bool
+
 source "lib/fonts/Kconfig"
 
 config SG_SPLIT
diff --git a/lib/Makefile b/lib/Makefile
index a7c26a4..2791262 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -210,6 +210,7 @@ quiet_cmd_build_OID_registry = GEN     $@
 clean-files	+= oid_registry_data.c
 
 obj-$(CONFIG_UCS2_STRING) += ucs2_string.o
+obj-$(CONFIG_LIST_BATCHING) += list_batch.o
 obj-$(CONFIG_UBSAN) += ubsan.o
 
 UBSAN_SANITIZE_ubsan.o := n
diff --git a/lib/list_batch.c b/lib/list_batch.c
new file mode 100644
index 0000000..174f4ba
--- /dev/null
+++ b/lib/list_batch.c
@@ -0,0 +1,125 @@
+/*
+ * List insertion/deletion batching facility
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long <waiman.long@hpe.com>
+ */
+#include <linux/list_batch.h>
+
+/*
+ * List processing batch size = 128
+ *
+ * The batch size shouldn't be too large. Otherwise, it will be too unfair
+ * to the task doing the batch processing. It shouldn't be too small neither
+ * as the performance benefit will be reduced.
+ */
+#define LB_BATCH_SIZE	(1 << 7)
+
+/*
+ * Inserting or deleting an entry from a linked list under a spinlock is a
+ * very common operation in the Linux kernel. If many CPUs are trying to
+ * grab the lock and manipulate the linked list, it can lead to significant
+ * lock contention and slow operation.
+ *
+ * This list operation batching facility is used to batch multiple list
+ * operations under one lock/unlock critical section, thus reducing the
+ * locking overhead and improving overall performance.
+ */
+void do_list_batch_slowpath(spinlock_t *lock, enum list_batch_cmd cmd,
+			    struct list_batch *batch, struct list_head *entry)
+{
+	struct list_batch_qnode node, *prev, *next, *nptr;
+	int loop;
+
+	/*
+	 * Put itself into the list_batch queue
+	 */
+	node.next  = NULL;
+	node.entry = entry;
+	node.cmd   = cmd;
+	node.state = lb_state_waiting;
+
+	/*
+	 * We rely on the implictit memory barrier of xchg() to make sure
+	 * that node initialization will be done before its content is being
+	 * accessed by other CPUs.
+	 */
+	prev = xchg(&batch->tail, &node);
+
+	if (prev) {
+		WRITE_ONCE(prev->next, &node);
+		while (READ_ONCE(node.state) == lb_state_waiting)
+			cpu_relax();
+		if (node.state == lb_state_done)
+			return;
+		WARN_ON(node.state != lb_state_batch);
+	}
+
+	/*
+	 * We are now the queue head, we should acquire the lock and
+	 * process a batch of qnodes.
+	 */
+	loop = LB_BATCH_SIZE;
+	next = &node;
+	spin_lock(lock);
+
+do_list_again:
+	do {
+		nptr = next;
+		_list_batch_cmd(nptr->cmd, batch->list, nptr->entry);
+		next = READ_ONCE(nptr->next);
+		/*
+		 * As soon as the state is marked lb_state_done, we
+		 * can no longer assume the content of *nptr as valid.
+		 * So we have to hold off marking it done until we no
+		 * longer need its content.
+		 *
+		 * The release barrier here is to make sure that we
+		 * won't access its content after marking it done.
+		 */
+		if (next)
+			smp_store_release(&nptr->state, lb_state_done);
+	} while (--loop && next);
+	if (!next) {
+		/*
+		 * The queue tail should equal to nptr, so clear it to
+		 * mark the queue as empty.
+		 */
+		if (cmpxchg_relaxed(&batch->tail, nptr, NULL) != nptr) {
+			/*
+			 * Queue not empty, wait until the next pointer is
+			 * initialized.
+			 */
+			while (!(next = READ_ONCE(nptr->next)))
+				cpu_relax();
+		}
+		/*
+		 * The release barrier is required to make sure that
+		 * setting the done state is the last operation.
+		 */
+		smp_store_release(&nptr->state, lb_state_done);
+	}
+	if (next) {
+		if (loop)
+			goto do_list_again;	/* More qnodes to process */
+		/*
+		 * Mark the next qnode as head to process the next batch
+		 * of qnodes. The new queue head cannot proceed until we
+		 * release the lock.
+		 */
+		WRITE_ONCE(next->state, lb_state_batch);
+	}
+	spin_unlock(lock);
+}
+EXPORT_SYMBOL_GPL(do_list_batch_slowpath);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/3] lib/list_batch, x86: Enable list insertion/deletion batching for x86
  2016-01-29 19:30 [PATCH v2 0/3] lib/list_batch: A simple list insertion/deletion batching facility Waiman Long
  2016-01-29 19:30 ` [PATCH v2 1/3] " Waiman Long
@ 2016-01-29 19:30 ` Waiman Long
  2016-01-29 19:30 ` [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list Waiman Long
  2 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2016-01-29 19:30 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro
  Cc: linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen,
	Scott J Norton, Douglas Hatch, Waiman Long

System with a large number of CPUs will benefit from the list batching
facility when the list lock is contended. If the lock isn't contended,
the performance of the list batching function should be similar to the
"lock; listop; unlock;" sequence that it replaces.

This patch enables it for the x86 architecture.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 arch/x86/Kconfig |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 330e738..5b35be7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -42,6 +42,7 @@ config X86
 	select ARCH_SUPPORTS_NUMA_BALANCING	if X86_64
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_CMPXCHG_LOCKREF		if X86_64
+	select ARCH_USE_LIST_BATCHING
 	select ARCH_USE_QUEUED_RWLOCKS
 	select ARCH_USE_QUEUED_SPINLOCKS
 	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if SMP
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list
  2016-01-29 19:30 [PATCH v2 0/3] lib/list_batch: A simple list insertion/deletion batching facility Waiman Long
  2016-01-29 19:30 ` [PATCH v2 1/3] " Waiman Long
  2016-01-29 19:30 ` [PATCH v2 2/3] lib/list_batch, x86: Enable list insertion/deletion batching for x86 Waiman Long
@ 2016-01-29 19:30 ` Waiman Long
  2016-01-30  8:35   ` Ingo Molnar
  2016-02-01  0:04   ` Dave Chinner
  2 siblings, 2 replies; 16+ messages in thread
From: Waiman Long @ 2016-01-29 19:30 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro
  Cc: linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen,
	Scott J Norton, Douglas Hatch, Waiman Long

The inode_sb_list_add() and inode_sb_list_del() functions in the vfs
layer just perform list addition and deletion under lock. So they can
use the new list batching facility to speed up the list operations
when many CPUs are trying to do it simultaneously.

In particular, the inode_sb_list_del() function can be a performance
bottleneck when large applications with many threads and associated
inodes exit. With an exit microbenchmark that creates a large number
of threads, attachs many inodes to them and then exits. The runtimes
of that microbenchmark with 1000 threads before and after the patch
on a 4-socket Intel E7-4820 v3 system (48 cores, 96 threads) were
as follows:

  Kernel        Elapsed Time    System Time
  ------        ------------    -----------
  Vanilla 4.4      65.29s         82m14s
  Patched 4.4      45.69s         49m44s

The elapsed time and the reported system time were reduced by 30%
and 40% respectively.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 fs/inode.c         |   13 +++++--------
 fs/super.c         |    1 +
 include/linux/fs.h |    2 ++
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 9f62db3..870de8c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -424,19 +424,16 @@ static void inode_lru_list_del(struct inode *inode)
  */
 void inode_sb_list_add(struct inode *inode)
 {
-	spin_lock(&inode->i_sb->s_inode_list_lock);
-	list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
-	spin_unlock(&inode->i_sb->s_inode_list_lock);
+	do_list_batch(&inode->i_sb->s_inode_list_lock, lb_cmd_add,
+			&inode->i_sb->s_list_batch, &inode->i_sb_list);
 }
 EXPORT_SYMBOL_GPL(inode_sb_list_add);
 
 static inline void inode_sb_list_del(struct inode *inode)
 {
-	if (!list_empty(&inode->i_sb_list)) {
-		spin_lock(&inode->i_sb->s_inode_list_lock);
-		list_del_init(&inode->i_sb_list);
-		spin_unlock(&inode->i_sb->s_inode_list_lock);
-	}
+	if (!list_empty(&inode->i_sb_list))
+		do_list_batch(&inode->i_sb->s_inode_list_lock, lb_cmd_del_init,
+				&inode->i_sb->s_list_batch, &inode->i_sb_list);
 }
 
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
diff --git a/fs/super.c b/fs/super.c
index 1182af8..b0e8540 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -206,6 +206,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	mutex_init(&s->s_sync_lock);
 	INIT_LIST_HEAD(&s->s_inodes);
 	spin_lock_init(&s->s_inode_list_lock);
+	list_batch_init(&s->s_list_batch, &s->s_inodes);
 
 	if (list_lru_init_memcg(&s->s_dentry_lru))
 		goto fail;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1a20462..11d8b77 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -9,6 +9,7 @@
 #include <linux/stat.h>
 #include <linux/cache.h>
 #include <linux/list.h>
+#include <linux/list_batch.h>
 #include <linux/list_lru.h>
 #include <linux/llist.h>
 #include <linux/radix-tree.h>
@@ -1403,6 +1404,7 @@ struct super_block {
 	/* s_inode_list_lock protects s_inodes */
 	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
 	struct list_head	s_inodes;	/* all inodes */
+	struct list_batch	s_list_batch;
 };
 
 extern struct timespec current_fs_time(struct super_block *sb);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list
  2016-01-29 19:30 ` [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list Waiman Long
@ 2016-01-30  8:35   ` Ingo Molnar
  2016-02-01 17:45     ` Andi Kleen
  2016-02-01 21:44     ` Waiman Long
  2016-02-01  0:04   ` Dave Chinner
  1 sibling, 2 replies; 16+ messages in thread
From: Ingo Molnar @ 2016-01-30  8:35 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro,
	linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen,
	Scott J Norton, Douglas Hatch


* Waiman Long <Waiman.Long@hpe.com> wrote:

> The inode_sb_list_add() and inode_sb_list_del() functions in the vfs
> layer just perform list addition and deletion under lock. So they can
> use the new list batching facility to speed up the list operations
> when many CPUs are trying to do it simultaneously.
> 
> In particular, the inode_sb_list_del() function can be a performance
> bottleneck when large applications with many threads and associated
> inodes exit. With an exit microbenchmark that creates a large number
> of threads, attachs many inodes to them and then exits. The runtimes
> of that microbenchmark with 1000 threads before and after the patch
> on a 4-socket Intel E7-4820 v3 system (48 cores, 96 threads) were
> as follows:
> 
>   Kernel        Elapsed Time    System Time
>   ------        ------------    -----------
>   Vanilla 4.4      65.29s         82m14s
>   Patched 4.4      45.69s         49m44s
> 
> The elapsed time and the reported system time were reduced by 30%
> and 40% respectively.

That's pretty impressive!

I'm wondering, why are inode_sb_list_add()/del() even called for a presumably 
reasonably well cached benchmark running on a system with enough RAM? Are these 
perhaps thousands of temporary files, already deleted, and released when all the 
file descriptors are closed as part of sys_exit()?

If that's the case then I suspect an even bigger win would be not just to batch 
the (sb-)global list fiddling, but to potentially turn the sb list into a 
percpu_alloc() managed set of per CPU lists? It's a bigger change, but it could 
speed up a lot of other temporary file intensive usecases as well, not just 
batched delete.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list
  2016-01-29 19:30 ` [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list Waiman Long
  2016-01-30  8:35   ` Ingo Molnar
@ 2016-02-01  0:04   ` Dave Chinner
  2016-02-03 23:01     ` Waiman Long
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2016-02-01  0:04 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro,
	linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen,
	Scott J Norton, Douglas Hatch

On Fri, Jan 29, 2016 at 02:30:46PM -0500, Waiman Long wrote:
> The inode_sb_list_add() and inode_sb_list_del() functions in the vfs
> layer just perform list addition and deletion under lock. So they can
> use the new list batching facility to speed up the list operations
> when many CPUs are trying to do it simultaneously.
> 
> In particular, the inode_sb_list_del() function can be a performance
> bottleneck when large applications with many threads and associated
> inodes exit. With an exit microbenchmark that creates a large number
> of threads, attachs many inodes to them and then exits. The runtimes
> of that microbenchmark with 1000 threads before and after the patch

I've never seen sb inode list contention in typical workloads in
exit processing. Can you post the test script you are using?

The inode sb list contention I usually often than not, it's
workloads that turn over the inode cache quickly (i.e. instantiating
lots of inodes through concurrent directory traversal or create
workloads). These are often latency sensitive, so I'm wondering what
the effect of spinning waiting for batch processing on every
contended add is going to do to lookup performance...

> on a 4-socket Intel E7-4820 v3 system (48 cores, 96 threads) were
> as follows:
> 
>   Kernel        Elapsed Time    System Time
>   ------        ------------    -----------
>   Vanilla 4.4      65.29s         82m14s
>   Patched 4.4      45.69s         49m44s

I wonder if you'd get the same results on such a benchmark simply by
making the spin lock a mutex, thereby reducing the number of CPUs
spinning on a single lock cacheline at any one point in time.
Certainly the system time will plummet....

> The elapsed time and the reported system time were reduced by 30%
> and 40% respectively.
> 
> Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
> ---
>  fs/inode.c         |   13 +++++--------
>  fs/super.c         |    1 +
>  include/linux/fs.h |    2 ++
>  3 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 9f62db3..870de8c 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -424,19 +424,16 @@ static void inode_lru_list_del(struct inode *inode)
>   */
>  void inode_sb_list_add(struct inode *inode)
>  {
> -	spin_lock(&inode->i_sb->s_inode_list_lock);
> -	list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
> -	spin_unlock(&inode->i_sb->s_inode_list_lock);
> +	do_list_batch(&inode->i_sb->s_inode_list_lock, lb_cmd_add,
> +			&inode->i_sb->s_list_batch, &inode->i_sb_list);

I don't like the API. This should simply be:

void inode_sb_list_add(struct inode *inode)
{
	list_batch_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
}

void inode_sb_list_del(struct inode *inode)
{
	list_batch_del(&inode->i_sb_list, &inode->i_sb->s_inodes);
}

And all the locks, lists and batch commands are internal to the
struct list_batch and the API implementation.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/3] lib/list_batch: A simple list insertion/deletion batching facility
  2016-01-29 19:30 ` [PATCH v2 1/3] " Waiman Long
@ 2016-02-01  0:47   ` Dave Chinner
  2016-02-03 23:11     ` Waiman Long
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2016-02-01  0:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro,
	linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen,
	Scott J Norton, Douglas Hatch

On Fri, Jan 29, 2016 at 02:30:44PM -0500, Waiman Long wrote:
> Linked list insertion or deletion under lock is a very common activity
> in the Linux kernel. If this is the only activity under lock, the
> locking overhead can be pretty large compared with the actual time
> spent on the insertion or deletion operation itself especially on a
> large system with many CPUs.
> 
> This patch introduces a simple list insertion/deletion batching
> facility where a group of list insertion and deletion operations are
> grouped together in a single batch under lock. This can reduce the
> locking overhead and improve overall system performance.
> 
> The fast path of this batching facility will be similar in performance
> to the "lock; listop; unlock;" sequence of the existing code. If
> the lock is not available, it will enter slowpath where the batching
> happens.
> 
> A new config option LIST_BATCHING is added so that we can control on
> which architecture do we want to have this facility enabled.
> 
> Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
....
> +#ifdef CONFIG_LIST_BATCHING
> +
> +extern void do_list_batch_slowpath(spinlock_t *lock, enum list_batch_cmd cmd,
> +				   struct list_batch *batch,
> +				   struct list_head *entry);
> +
> +/*
> + * The caller is expected to pass in a constant cmd parameter. As a
> + * result, most of unneeded code in the switch statement of _list_batch_cmd()
> + * will be optimized away. This should make the fast path almost as fast
> + * as the "lock; listop; unlock;" sequence it replaces.
> + */

This strikes me as needlessly complex. Simple inline functions are
much easier to read and verify correct, and we don't have to rely on
the compiler to optimise out dead code:

static inline void list_batch_add(struct list_head *entry,
				  struct list_batch *batch)
{
	if (!spin_trylock(&batch->lock))
		return do_list_batch_slowpath(entry, batch, lb_cmd_add);

	list_add(entry, &batch->list)
	spin_unlock(&batch->lock);
}

> +#include <linux/list_batch.h>
> +
> +/*
> + * List processing batch size = 128
> + *
> + * The batch size shouldn't be too large. Otherwise, it will be too unfair
> + * to the task doing the batch processing. It shouldn't be too small neither
> + * as the performance benefit will be reduced.
> + */
> +#define LB_BATCH_SIZE	(1 << 7)

Ok, so arbitrary operations are going to see longer delays when they
are selected as the batch processor. I'm not sure I really like this
idea, as it will be the first in the queue that sees contention
that takes the delay which reduces the fairness of the operations.
i.e. the spinlock uses fair queuing, but now we can be grossly unfair
the to the first spinner...

> +	/*
> +	 * We rely on the implictit memory barrier of xchg() to make sure
> +	 * that node initialization will be done before its content is being
> +	 * accessed by other CPUs.
> +	 */
> +	prev = xchg(&batch->tail, &node);
> +
> +	if (prev) {
> +		WRITE_ONCE(prev->next, &node);
> +		while (READ_ONCE(node.state) == lb_state_waiting)
> +			cpu_relax();
> +		if (node.state == lb_state_done)
> +			return;

So we spin waiting for the batch processor to process the
list, or

> +		WARN_ON(node.state != lb_state_batch);

tell us we are not the batch processor.

So, effectively, the reduction in runtime is due to the fact the
list operations spin on their own cache line rather than the spin
lock cacheline until they have been processed and/or made the batch
processor?

> +	}
> +
> +	/*
> +	 * We are now the queue head, we should acquire the lock and
> +	 * process a batch of qnodes.
> +	 */
> +	loop = LB_BATCH_SIZE;
> +	next = &node;
> +	spin_lock(lock);
> +
> +do_list_again:
> +	do {

While we are batch processing, all operations will fail the
trylock and add themselves to the tail of the queue, and spin on
their own cacheline at that point. So it doesn't reduce the amount
of spinning, just removes the cacheline contention that slows the
spinning.

Hmmm - there's another point of unfairness - when switching batch
processors, other add/delete operations can get the list lock and
perform their operations directly, thereby jumping the batch
queue....

So at what point does simply replacing the list_head with a list_lru
become more efficient than this batch processing (i.e.
https://lkml.org/lkml/2015/3/10/660)?  The list_lru isn't a great
fit for the inode list (doesn't need any of the special LRU/memcg
stuff https://lkml.org/lkml/2015/3/16/261) but it will tell us if,
like Ingo suggested, moving more towards a generic per-cpu list
would provide better overall performance...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list
  2016-01-30  8:35   ` Ingo Molnar
@ 2016-02-01 17:45     ` Andi Kleen
  2016-02-01 22:03       ` Waiman Long
  2016-02-01 21:44     ` Waiman Long
  1 sibling, 1 reply; 16+ messages in thread
From: Andi Kleen @ 2016-02-01 17:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Alexander Viro, linux-fsdevel, x86, linux-kernel, Peter Zijlstra,
	Andi Kleen, Scott J Norton, Douglas Hatch

> I'm wondering, why are inode_sb_list_add()/del() even called for a presumably 
> reasonably well cached benchmark running on a system with enough RAM? Are these 
> perhaps thousands of temporary files, already deleted, and released when all the 
> file descriptors are closed as part of sys_exit()?
> 
> If that's the case then I suspect an even bigger win would be not just to batch 
> the (sb-)global list fiddling, but to potentially turn the sb list into a 
> percpu_alloc() managed set of per CPU lists? It's a bigger change, but it could 

We had such a patch in the lock elision patchkit (It avoided a lot
of cache line bouncing leading to aborts)

https://git.kernel.org/cgit/linux/kernel/git/ak/linux-misc.git/commit/?h=hle315/combined&id=f1cf9e715a40f44086662ae3b29f123cf059cbf4

-Andi

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list
  2016-01-30  8:35   ` Ingo Molnar
  2016-02-01 17:45     ` Andi Kleen
@ 2016-02-01 21:44     ` Waiman Long
  1 sibling, 0 replies; 16+ messages in thread
From: Waiman Long @ 2016-02-01 21:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro,
	linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen,
	Scott J Norton, Douglas Hatch

[-- Attachment #1: Type: text/plain, Size: 2280 bytes --]

On 01/30/2016 03:35 AM, Ingo Molnar wrote:
> * Waiman Long<Waiman.Long@hpe.com>  wrote:
>
>> The inode_sb_list_add() and inode_sb_list_del() functions in the vfs
>> layer just perform list addition and deletion under lock. So they can
>> use the new list batching facility to speed up the list operations
>> when many CPUs are trying to do it simultaneously.
>>
>> In particular, the inode_sb_list_del() function can be a performance
>> bottleneck when large applications with many threads and associated
>> inodes exit. With an exit microbenchmark that creates a large number
>> of threads, attachs many inodes to them and then exits. The runtimes
>> of that microbenchmark with 1000 threads before and after the patch
>> on a 4-socket Intel E7-4820 v3 system (48 cores, 96 threads) were
>> as follows:
>>
>>    Kernel        Elapsed Time    System Time
>>    ------        ------------    -----------
>>    Vanilla 4.4      65.29s         82m14s
>>    Patched 4.4      45.69s         49m44s
>>
>> The elapsed time and the reported system time were reduced by 30%
>> and 40% respectively.
> That's pretty impressive!
>
> I'm wondering, why are inode_sb_list_add()/del() even called for a presumably
> reasonably well cached benchmark running on a system with enough RAM? Are these
> perhaps thousands of temporary files, already deleted, and released when all the
> file descriptors are closed as part of sys_exit()?

The inodes that need to be deleted were actually procfs files which have 
to go away when the processes/threads exit. I encountered this problem 
when running the SPECjbb2013 benchmark on large machine where sometimes 
it might seems to hang for 30 mins or so after the benchmark complete. I 
wrote a simple microbenchmark to simulate this situation which is in the 
attachment.


> If that's the case then I suspect an even bigger win would be not just to batch
> the (sb-)global list fiddling, but to potentially turn the sb list into a
> percpu_alloc() managed set of per CPU lists? It's a bigger change, but it could
> speed up a lot of other temporary file intensive usecases as well, not just
> batched delete.
>
> Thanks,
>
> 	Ingo

Yes, that can be another possible. I will investigate further on that 
one. Thanks for the suggestion.

Cheers,
Longman


[-- Attachment #2: exit_test.c --]
[-- Type: text/plain, Size: 2665 bytes --]

/*
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * Authors: Waiman Long <waiman.long@hp.com>
 */
/*
 * This is an exit test
 */
#include <ctype.h>
#include <errno.h>
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <signal.h>
#include <sys/types.h>
#include <sys/syscall.h>


#define do_exit()	syscall(SYS_exit)
#define	gettid()	syscall(SYS_gettid)
#define	MAX_THREADS	2048

static inline void cpu_relax(void)
{
        __asm__ __volatile__("rep;nop": : :"memory");
}

static inline void atomic_inc(volatile int *v)
{
	__asm__ __volatile__("lock incl %0": "+m" (*v));
}

static volatile int exit_now  = 0;
static volatile int threadcnt = 0;

/*
 * Walk the /proc/<pid> filesystem to make them fill the dentry cache
 */
static void walk_procfs(void)
{
	char cmdbuf[256];
	pid_t tid = gettid();

	snprintf(cmdbuf, sizeof(cmdbuf), "find /proc/%d > /dev/null 2>&1", tid);
	if (system(cmdbuf) < 0)
		perror("system() failed!");
}

static void *exit_thread(void *dummy)
{
	long tid = (long)dummy;

	walk_procfs();
	atomic_inc(&threadcnt);
	/*
	 * Busy wait until the do_exit flag is set and then call exit
	 */
	while (!exit_now)
		sleep(1);
	do_exit();
}

static void exit_test(int threads)
{
	pthread_t thread[threads];
	long i = 0, finish;
	time_t start = time(NULL);

	while (i++ < threads) {
		if (pthread_create(thread + i - 1, NULL, exit_thread,
				  (void *)i)) {
			perror("pthread_create");
			exit(1);
		}
#if 0
		/*
		 * Pipelining to reduce contention & improve speed
		 */
		if ((i & 0xf) == 0)
			 while (i - threadcnt > 12)
				usleep(1);
#endif
	}
	while (threadcnt != threads)
		usleep(1);
	walk_procfs();
	printf("Setup time = %lus\n", time(NULL) - start);
	printf("Process ready to exit!\n");
	kill(0, SIGKILL);
	exit(0);
}

int main(int argc, char *argv[])
{
	int   tcnt;	/* Thread counts */
	char *cmd = argv[0];

	if ((argc != 2) || !isdigit(argv[1][0])) {
		fprintf(stderr, "Usage: %s <thread count>\n", cmd);
		exit(1);
	}
	tcnt = strtoul(argv[1], NULL, 10);
	if (tcnt > MAX_THREADS) {
		fprintf(stderr, "Error: thread count should be <= %d\n",
			MAX_THREADS);
		exit(1);
	}
	exit_test(tcnt);
	return 0;	/* Not reaachable */
}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list
  2016-02-01 17:45     ` Andi Kleen
@ 2016-02-01 22:03       ` Waiman Long
  2016-02-03 22:59         ` Waiman Long
  0 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2016-02-01 22:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Alexander Viro, linux-fsdevel, x86, linux-kernel, Peter Zijlstra,
	Scott J Norton, Douglas Hatch

On 02/01/2016 12:45 PM, Andi Kleen wrote:
>> I'm wondering, why are inode_sb_list_add()/del() even called for a presumably
>> reasonably well cached benchmark running on a system with enough RAM? Are these
>> perhaps thousands of temporary files, already deleted, and released when all the
>> file descriptors are closed as part of sys_exit()?
>>
>> If that's the case then I suspect an even bigger win would be not just to batch
>> the (sb-)global list fiddling, but to potentially turn the sb list into a
>> percpu_alloc() managed set of per CPU lists? It's a bigger change, but it could
> We had such a patch in the lock elision patchkit (It avoided a lot
> of cache line bouncing leading to aborts)
>
> https://git.kernel.org/cgit/linux/kernel/git/ak/linux-misc.git/commit/?h=hle315/combined&id=f1cf9e715a40f44086662ae3b29f123cf059cbf4
>
> -Andi
>
>

I like your patch though it cannot be applied cleanly for the current 
upstream kernel. I will port it to the current kernel and run my 
microbenchmark to see what performance gain I can get.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list
  2016-02-01 22:03       ` Waiman Long
@ 2016-02-03 22:59         ` Waiman Long
  2016-02-06 23:51           ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2016-02-03 22:59 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Alexander Viro, linux-fsdevel, x86, linux-kernel, Peter Zijlstra,
	Scott J Norton, Douglas Hatch

On 02/01/2016 05:03 PM, Waiman Long wrote:
> On 02/01/2016 12:45 PM, Andi Kleen wrote:
>>> I'm wondering, why are inode_sb_list_add()/del() even called for a 
>>> presumably
>>> reasonably well cached benchmark running on a system with enough 
>>> RAM? Are these
>>> perhaps thousands of temporary files, already deleted, and released 
>>> when all the
>>> file descriptors are closed as part of sys_exit()?
>>>
>>> If that's the case then I suspect an even bigger win would be not 
>>> just to batch
>>> the (sb-)global list fiddling, but to potentially turn the sb list 
>>> into a
>>> percpu_alloc() managed set of per CPU lists? It's a bigger change, 
>>> but it could
>> We had such a patch in the lock elision patchkit (It avoided a lot
>> of cache line bouncing leading to aborts)
>>
>> https://git.kernel.org/cgit/linux/kernel/git/ak/linux-misc.git/commit/?h=hle315/combined&id=f1cf9e715a40f44086662ae3b29f123cf059cbf4 
>>
>>
>> -Andi
>>
>>
>
> I like your patch though it cannot be applied cleanly for the current 
> upstream kernel. I will port it to the current kernel and run my 
> microbenchmark to see what performance gain I can get.
>

Unfortunately, using per-cpu list didn't have the performance benefit 
that I expected. I saw maybe 1 or 2% of performance increase, but 
nothing significant. I guess the bulk of the performance improvement in 
my patch is in the elimination of most of the cacheline transfer 
latencies when the lock ownership is passed from one CPU to another. 
Those latencies are still there even if we use the per-cpu list.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list
  2016-02-01  0:04   ` Dave Chinner
@ 2016-02-03 23:01     ` Waiman Long
  0 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2016-02-03 23:01 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro,
	linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen,
	Scott J Norton, Douglas Hatch

On 01/31/2016 07:04 PM, Dave Chinner wrote:
> On Fri, Jan 29, 2016 at 02:30:46PM -0500, Waiman Long wrote:
>> The inode_sb_list_add() and inode_sb_list_del() functions in the vfs
>> layer just perform list addition and deletion under lock. So they can
>> use the new list batching facility to speed up the list operations
>> when many CPUs are trying to do it simultaneously.
>>
>> In particular, the inode_sb_list_del() function can be a performance
>> bottleneck when large applications with many threads and associated
>> inodes exit. With an exit microbenchmark that creates a large number
>> of threads, attachs many inodes to them and then exits. The runtimes
>> of that microbenchmark with 1000 threads before and after the patch
> I've never seen sb inode list contention in typical workloads in
> exit processing. Can you post the test script you are using?

I have posted it in one of my earlier email.


> The inode sb list contention I usually often than not, it's
> workloads that turn over the inode cache quickly (i.e. instantiating
> lots of inodes through concurrent directory traversal or create
> workloads). These are often latency sensitive, so I'm wondering what
> the effect of spinning waiting for batch processing on every
> contended add is going to do to lookup performance...

I think the batch processor will get higher latency, but the other will 
see a shorter one. If each CPU has a more or less chance to become the 
batch processor, the overall impact to system performance should not be 
that significatn.

>> on a 4-socket Intel E7-4820 v3 system (48 cores, 96 threads) were
>> as follows:
>>
>>    Kernel        Elapsed Time    System Time
>>    ------        ------------    -----------
>>    Vanilla 4.4      65.29s         82m14s
>>    Patched 4.4      45.69s         49m44s
> I wonder if you'd get the same results on such a benchmark simply by
> making the spin lock a mutex, thereby reducing the number of CPUs
> spinning on a single lock cacheline at any one point in time.
> Certainly the system time will plummet....

I don't think it is a good idea to use mutex as we can't sleep.

>> The elapsed time and the reported system time were reduced by 30%
>> and 40% respectively.
>>
>> Signed-off-by: Waiman Long<Waiman.Long@hpe.com>
>> ---
>>   fs/inode.c         |   13 +++++--------
>>   fs/super.c         |    1 +
>>   include/linux/fs.h |    2 ++
>>   3 files changed, 8 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/inode.c b/fs/inode.c
>> index 9f62db3..870de8c 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -424,19 +424,16 @@ static void inode_lru_list_del(struct inode *inode)
>>    */
>>   void inode_sb_list_add(struct inode *inode)
>>   {
>> -	spin_lock(&inode->i_sb->s_inode_list_lock);
>> -	list_add(&inode->i_sb_list,&inode->i_sb->s_inodes);
>> -	spin_unlock(&inode->i_sb->s_inode_list_lock);
>> +	do_list_batch(&inode->i_sb->s_inode_list_lock, lb_cmd_add,
>> +			&inode->i_sb->s_list_batch,&inode->i_sb_list);
> I don't like the API. This should simply be:
>
> void inode_sb_list_add(struct inode *inode)
> {
> 	list_batch_add(&inode->i_sb_list,&inode->i_sb->s_inodes);
> }
>
> void inode_sb_list_del(struct inode *inode)
> {
> 	list_batch_del(&inode->i_sb_list,&inode->i_sb->s_inodes);
> }
>
> And all the locks, lists and batch commands are internal to the
> struct list_batch and the API implementation.
>

Points taken. I will update the patch to do that. Thanks for the review.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/3] lib/list_batch: A simple list insertion/deletion batching facility
  2016-02-01  0:47   ` Dave Chinner
@ 2016-02-03 23:11     ` Waiman Long
  2016-02-06 23:57       ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2016-02-03 23:11 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro,
	linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen,
	Scott J Norton, Douglas Hatch

On 01/31/2016 07:47 PM, Dave Chinner wrote:
> On Fri, Jan 29, 2016 at 02:30:44PM -0500, Waiman Long wrote:
>> Linked list insertion or deletion under lock is a very common activity
>> in the Linux kernel. If this is the only activity under lock, the
>> locking overhead can be pretty large compared with the actual time
>> spent on the insertion or deletion operation itself especially on a
>> large system with many CPUs.
>>
>> This patch introduces a simple list insertion/deletion batching
>> facility where a group of list insertion and deletion operations are
>> grouped together in a single batch under lock. This can reduce the
>> locking overhead and improve overall system performance.
>>
>> The fast path of this batching facility will be similar in performance
>> to the "lock; listop; unlock;" sequence of the existing code. If
>> the lock is not available, it will enter slowpath where the batching
>> happens.
>>
>> A new config option LIST_BATCHING is added so that we can control on
>> which architecture do we want to have this facility enabled.
>>
>> Signed-off-by: Waiman Long<Waiman.Long@hpe.com>
> ....
>> +#ifdef CONFIG_LIST_BATCHING
>> +
>> +extern void do_list_batch_slowpath(spinlock_t *lock, enum list_batch_cmd cmd,
>> +				   struct list_batch *batch,
>> +				   struct list_head *entry);
>> +
>> +/*
>> + * The caller is expected to pass in a constant cmd parameter. As a
>> + * result, most of unneeded code in the switch statement of _list_batch_cmd()
>> + * will be optimized away. This should make the fast path almost as fast
>> + * as the "lock; listop; unlock;" sequence it replaces.
>> + */
> This strikes me as needlessly complex. Simple inline functions are
> much easier to read and verify correct, and we don't have to rely on
> the compiler to optimise out dead code:
>
> static inline void list_batch_add(struct list_head *entry,
> 				  struct list_batch *batch)
> {
> 	if (!spin_trylock(&batch->lock))
> 		return do_list_batch_slowpath(entry, batch, lb_cmd_add);
>
> 	list_add(entry,&batch->list)
> 	spin_unlock(&batch->lock);
> }

Will do so.

>> +#include<linux/list_batch.h>
>> +
>> +/*
>> + * List processing batch size = 128
>> + *
>> + * The batch size shouldn't be too large. Otherwise, it will be too unfair
>> + * to the task doing the batch processing. It shouldn't be too small neither
>> + * as the performance benefit will be reduced.
>> + */
>> +#define LB_BATCH_SIZE	(1<<  7)
> Ok, so arbitrary operations are going to see longer delays when they
> are selected as the batch processor. I'm not sure I really like this
> idea, as it will be the first in the queue that sees contention
> that takes the delay which reduces the fairness of the operations.
> i.e. the spinlock uses fair queuing, but now we can be grossly unfair
> the to the first spinner...
>

That is certainly true. It is well known that a bit of unfairness can 
often improve overall system performance. Interrupt handling, for 
example, is also unfair to the process currently running on the CPU. The 
amount of unfairness is controlled by the batch size parameter. Maybe we 
can make this parameter a read-mostly constant set up at boot time which 
has a value depending on the # of CPUs in a system so that smaller 
system can have a smaller batch size. That will reduce the unfairness, 
at least in smaller systems.

>> +	/*
>> +	 * We rely on the implictit memory barrier of xchg() to make sure
>> +	 * that node initialization will be done before its content is being
>> +	 * accessed by other CPUs.
>> +	 */
>> +	prev = xchg(&batch->tail,&node);
>> +
>> +	if (prev) {
>> +		WRITE_ONCE(prev->next,&node);
>> +		while (READ_ONCE(node.state) == lb_state_waiting)
>> +			cpu_relax();
>> +		if (node.state == lb_state_done)
>> +			return;
> So we spin waiting for the batch processor to process the
> list, or
>
>> +		WARN_ON(node.state != lb_state_batch);
> tell us we are not the batch processor.
>
> So, effectively, the reduction in runtime is due to the fact the
> list operations spin on their own cache line rather than the spin
> lock cacheline until they have been processed and/or made the batch
> processor?

Yes, that can a major part of it.

>> +	}
>> +
>> +	/*
>> +	 * We are now the queue head, we should acquire the lock and
>> +	 * process a batch of qnodes.
>> +	 */
>> +	loop = LB_BATCH_SIZE;
>> +	next =&node;
>> +	spin_lock(lock);
>> +
>> +do_list_again:
>> +	do {
> While we are batch processing, all operations will fail the
> trylock and add themselves to the tail of the queue, and spin on
> their own cacheline at that point. So it doesn't reduce the amount
> of spinning, just removes the cacheline contention that slows the
> spinning.
>
> Hmmm - there's another point of unfairness - when switching batch
> processors, other add/delete operations can get the list lock and
> perform their operations directly, thereby jumping the batch
> queue....

That is true too.

> So at what point does simply replacing the list_head with a list_lru
> become more efficient than this batch processing (i.e.
> https://lkml.org/lkml/2015/3/10/660)?  The list_lru isn't a great
> fit for the inode list (doesn't need any of the special LRU/memcg
> stuff https://lkml.org/lkml/2015/3/16/261) but it will tell us if,
> like Ingo suggested, moving more towards a generic per-cpu list
> would provide better overall performance...

I will take a look at the list_lru patch to see if that help. As for the 
per-cpu list, I tried that and it didn't quite work out.

Thanks,
Longman

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list
  2016-02-03 22:59         ` Waiman Long
@ 2016-02-06 23:51           ` Dave Chinner
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2016-02-06 23:51 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andi Kleen, Ingo Molnar, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, linux-fsdevel, x86, linux-kernel,
	Peter Zijlstra, Scott J Norton, Douglas Hatch

On Wed, Feb 03, 2016 at 05:59:17PM -0500, Waiman Long wrote:
> On 02/01/2016 05:03 PM, Waiman Long wrote:
> >On 02/01/2016 12:45 PM, Andi Kleen wrote:
> >>>I'm wondering, why are inode_sb_list_add()/del() even called
> >>>for a presumably
> >>>reasonably well cached benchmark running on a system with
> >>>enough RAM? Are these
> >>>perhaps thousands of temporary files, already deleted, and
> >>>released when all the
> >>>file descriptors are closed as part of sys_exit()?
> >>>
> >>>If that's the case then I suspect an even bigger win would be
> >>>not just to batch
> >>>the (sb-)global list fiddling, but to potentially turn the sb
> >>>list into a
> >>>percpu_alloc() managed set of per CPU lists? It's a bigger
> >>>change, but it could
> >>We had such a patch in the lock elision patchkit (It avoided a lot
> >>of cache line bouncing leading to aborts)
> >>
> >>https://git.kernel.org/cgit/linux/kernel/git/ak/linux-misc.git/commit/?h=hle315/combined&id=f1cf9e715a40f44086662ae3b29f123cf059cbf4
> >>
> >>
> >>-Andi
> >>
> >>
> >
> >I like your patch though it cannot be applied cleanly for the
> >current upstream kernel. I will port it to the current kernel and
> >run my microbenchmark to see what performance gain I can get.
> >
> 
> Unfortunately, using per-cpu list didn't have the performance
> benefit that I expected. I saw maybe 1 or 2% of performance
> increase, but nothing significant. I guess the bulk of the
> performance improvement in my patch is in the elimination of most of
> the cacheline transfer latencies when the lock ownership is passed
> from one CPU to another. Those latencies are still there even if we
> use the per-cpu list.

Now that I look at the above patch, it doesn't get rid of the global
list lock. hence it won't change any of the existing global lock
cacheline contention. The list structure change to per-cpu is
completely irrelevant because it doesn't address the problem being
seen.

A proper per-cpu list implementation will provide either per-cpu
locks or some other mechanism to protect each list and eliminate a
large amount of global cacheline bouncing. Given this, I would not
use the above patch and results as a reason for saying this approach
will not work, or be a better solution....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/3] lib/list_batch: A simple list insertion/deletion batching facility
  2016-02-03 23:11     ` Waiman Long
@ 2016-02-06 23:57       ` Dave Chinner
  2016-02-17  1:37         ` Waiman Long
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2016-02-06 23:57 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro,
	linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen,
	Scott J Norton, Douglas Hatch

On Wed, Feb 03, 2016 at 06:11:56PM -0500, Waiman Long wrote:
> On 01/31/2016 07:47 PM, Dave Chinner wrote:
> >So at what point does simply replacing the list_head with a list_lru
> >become more efficient than this batch processing (i.e.
> >https://lkml.org/lkml/2015/3/10/660)?  The list_lru isn't a great
> >fit for the inode list (doesn't need any of the special LRU/memcg
> >stuff https://lkml.org/lkml/2015/3/16/261) but it will tell us if,
> >like Ingo suggested, moving more towards a generic per-cpu list
> >would provide better overall performance...
> 
> I will take a look at the list_lru patch to see if that help. As for
> the per-cpu list, I tried that and it didn't quite work out.

OK, see my last email as to why Andi's patch didn't change anything.
The list_lru implementation has a list per node, a lock per node,
and each item is placed on the list for the node it is physically
allocated from. Hence for local workloads, the list/lock that is
accessed for add/remove should be local to the node and hence should
reduce cache line contention mostly to within a single node.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/3] lib/list_batch: A simple list insertion/deletion batching facility
  2016-02-06 23:57       ` Dave Chinner
@ 2016-02-17  1:37         ` Waiman Long
  0 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2016-02-17  1:37 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro,
	linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen,
	Scott J Norton, Douglas Hatch

On 02/06/2016 06:57 PM, Dave Chinner wrote:
> On Wed, Feb 03, 2016 at 06:11:56PM -0500, Waiman Long wrote:
>> On 01/31/2016 07:47 PM, Dave Chinner wrote:
>>> So at what point does simply replacing the list_head with a list_lru
>>> become more efficient than this batch processing (i.e.
>>> https://lkml.org/lkml/2015/3/10/660)?  The list_lru isn't a great
>>> fit for the inode list (doesn't need any of the special LRU/memcg
>>> stuff https://lkml.org/lkml/2015/3/16/261) but it will tell us if,
>>> like Ingo suggested, moving more towards a generic per-cpu list
>>> would provide better overall performance...
>> I will take a look at the list_lru patch to see if that help. As for
>> the per-cpu list, I tried that and it didn't quite work out.
> OK, see my last email as to why Andi's patch didn't change anything.
> The list_lru implementation has a list per node, a lock per node,
> and each item is placed on the list for the node it is physically
> allocated from. Hence for local workloads, the list/lock that is
> accessed for add/remove should be local to the node and hence should
> reduce cache line contention mostly to within a single node.
>
> Cheers,
>
> Dave.

I have just sent out a new patchset using per-cpu list with per-cpu 
locks. I used the per-cpu list as the changes will be simpler and easier 
to review. Please let me know your thought on that.

Thanks,
Longman

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-02-17  1:37 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-29 19:30 [PATCH v2 0/3] lib/list_batch: A simple list insertion/deletion batching facility Waiman Long
2016-01-29 19:30 ` [PATCH v2 1/3] " Waiman Long
2016-02-01  0:47   ` Dave Chinner
2016-02-03 23:11     ` Waiman Long
2016-02-06 23:57       ` Dave Chinner
2016-02-17  1:37         ` Waiman Long
2016-01-29 19:30 ` [PATCH v2 2/3] lib/list_batch, x86: Enable list insertion/deletion batching for x86 Waiman Long
2016-01-29 19:30 ` [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list Waiman Long
2016-01-30  8:35   ` Ingo Molnar
2016-02-01 17:45     ` Andi Kleen
2016-02-01 22:03       ` Waiman Long
2016-02-03 22:59         ` Waiman Long
2016-02-06 23:51           ` Dave Chinner
2016-02-01 21:44     ` Waiman Long
2016-02-01  0:04   ` Dave Chinner
2016-02-03 23:01     ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.