* [PATCH v2 0/3] lib/list_batch: A simple list insertion/deletion batching facility @ 2016-01-29 19:30 Waiman Long 2016-01-29 19:30 ` [PATCH v2 1/3] " Waiman Long ` (2 more replies) 0 siblings, 3 replies; 16+ messages in thread From: Waiman Long @ 2016-01-29 19:30 UTC (permalink / raw) To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro Cc: linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen, Scott J Norton, Douglas Hatch, Waiman Long v1->v2: - Address feedbacks from PeterZ and Andi Kleen. This patchset introduces a simple list insertion/deletion batching facility to batch multiple list insertion and deletion operations into a single one under one lock/unlock critical section. Patch 1 introduces this new facility. Patch 2 enables it for the x86 architecture. Patch 3 makes the insertion and deletion of the VFS superblock's inode list to use the new list batching functions. On x86-64, the generated code (gcc v4.8.5) of the inode_sb_list_add() function without the patch was: 0x0000000000001030 <+0>: callq 0x1035 <inode_sb_list_add+5> 0x0000000000001035 <+5>: push %rbp 0x0000000000001036 <+6>: mov %rsp,%rbp 0x0000000000001039 <+9>: push %rbx 0x000000000000103a <+10>: mov 0x28(%rdi),%rax 0x000000000000103e <+14>: mov %rdi,%rbx 0x0000000000001041 <+17>: lea 0x600(%rax),%rdi 0x0000000000001048 <+24>: callq 0x104d <inode_sb_list_add+29> 0x000000000000104d <+29>: mov 0x28(%rbx),%rsi 0x0000000000001051 <+33>: lea 0x120(%rbx),%rdi 0x0000000000001058 <+40>: mov 0x608(%rsi),%rdx 0x000000000000105f <+47>: add $0x608,%rsi 0x0000000000001066 <+54>: callq 0x106b <inode_sb_list_add+59> 0x000000000000106b <+59>: mov 0x28(%rbx),%rdi 0x000000000000106f <+63>: add $0x600,%rdi 0x0000000000001076 <+70>: callq *0x0 0x000000000000107d <+77>: pop %rbx 0x000000000000107e <+78>: pop %rbp 0x000000000000107f <+79>: retq With the patch, the function became: 0x0000000000001030 <+0>: callq 0x1035 <inode_sb_list_add+5> 0x0000000000001035 <+5>: push %rbp 0x0000000000001036 <+6>: mov %rsp,%rbp 0x0000000000001039 <+9>: push %r13 0x000000000000103b <+11>: lea 0x120(%rdi),%r13 0x0000000000001042 <+18>: push %r12 0x0000000000001044 <+20>: push %rbx 0x0000000000001045 <+21>: mov 0x28(%rdi),%r12 0x0000000000001049 <+25>: lea 0x600(%r12),%rbx 0x0000000000001051 <+33>: mov %rbx,%rdi 0x0000000000001054 <+36>: callq 0x1059 <inode_sb_list_add+41> 0x0000000000001059 <+41>: test %eax,%eax 0x000000000000105b <+43>: je 0x1081 <inode_sb_list_add+81> 0x000000000000105d <+45>: mov 0x618(%r12),%rsi 0x0000000000001065 <+53>: mov %r13,%rdi 0x0000000000001068 <+56>: mov (%rsi),%rdx 0x000000000000106b <+59>: callq 0x1070 <inode_sb_list_add+64> 0x0000000000001070 <+64>: mov %rbx,%rdi 0x0000000000001073 <+67>: callq *0x0 0x000000000000107a <+74>: pop %rbx 0x000000000000107b <+75>: pop %r12 0x000000000000107d <+77>: pop %r13 0x000000000000107f <+79>: pop %rbp 0x0000000000001080 <+80>: retq 0x0000000000001081 <+81>: lea 0x618(%r12),%rdx 0x0000000000001089 <+89>: mov %r13,%rcx 0x000000000000108c <+92>: mov %rbx,%rsi 0x000000000000108f <+95>: xor %edi,%edi 0x0000000000001091 <+97>: callq 0x1096 <inode_sb_list_add+102> 0x0000000000001096 <+102>: jmp 0x107a <inode_sb_list_add+74> For the fastpath, the only overheads are the testing of the return value of the spin_trylock function and the push/pop of 2 more registers. Other than that, they look very similar. Waiman Long (3): lib/list_batch: A simple list insertion/deletion batching facility lib/list_batch, x86: Enable list insertion/deletion batching for x86 vfs: Enable list batching for the superblock's inode list arch/x86/Kconfig | 1 + fs/inode.c | 13 ++--- fs/super.c | 1 + include/linux/fs.h | 2 + include/linux/list_batch.h | 133 ++++++++++++++++++++++++++++++++++++++++++++ lib/Kconfig | 7 ++ lib/Makefile | 1 + lib/list_batch.c | 125 +++++++++++++++++++++++++++++++++++++++++ 8 files changed, 275 insertions(+), 8 deletions(-) create mode 100644 include/linux/list_batch.h create mode 100644 lib/list_batch.c ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 1/3] lib/list_batch: A simple list insertion/deletion batching facility 2016-01-29 19:30 [PATCH v2 0/3] lib/list_batch: A simple list insertion/deletion batching facility Waiman Long @ 2016-01-29 19:30 ` Waiman Long 2016-02-01 0:47 ` Dave Chinner 2016-01-29 19:30 ` [PATCH v2 2/3] lib/list_batch, x86: Enable list insertion/deletion batching for x86 Waiman Long 2016-01-29 19:30 ` [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list Waiman Long 2 siblings, 1 reply; 16+ messages in thread From: Waiman Long @ 2016-01-29 19:30 UTC (permalink / raw) To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro Cc: linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen, Scott J Norton, Douglas Hatch, Waiman Long Linked list insertion or deletion under lock is a very common activity in the Linux kernel. If this is the only activity under lock, the locking overhead can be pretty large compared with the actual time spent on the insertion or deletion operation itself especially on a large system with many CPUs. This patch introduces a simple list insertion/deletion batching facility where a group of list insertion and deletion operations are grouped together in a single batch under lock. This can reduce the locking overhead and improve overall system performance. The fast path of this batching facility will be similar in performance to the "lock; listop; unlock;" sequence of the existing code. If the lock is not available, it will enter slowpath where the batching happens. A new config option LIST_BATCHING is added so that we can control on which architecture do we want to have this facility enabled. Signed-off-by: Waiman Long <Waiman.Long@hpe.com> --- include/linux/list_batch.h | 133 ++++++++++++++++++++++++++++++++++++++++++++ lib/Kconfig | 7 ++ lib/Makefile | 1 + lib/list_batch.c | 125 +++++++++++++++++++++++++++++++++++++++++ 4 files changed, 266 insertions(+), 0 deletions(-) create mode 100644 include/linux/list_batch.h create mode 100644 lib/list_batch.c diff --git a/include/linux/list_batch.h b/include/linux/list_batch.h new file mode 100644 index 0000000..a445a2e --- /dev/null +++ b/include/linux/list_batch.h @@ -0,0 +1,133 @@ +/* + * List insertion/deletion batching facility + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP + * + * Authors: Waiman Long <waiman.long@hpe.com> + */ +#ifndef __LINUX_LIST_BATCH_H +#define __LINUX_LIST_BATCH_H + +#include <linux/spinlock.h> +#include <linux/list.h> + +/* + * include/linux/list_batch.h + * + * Inserting or deleting an entry from a linked list under a spinlock is a + * very common operation in the Linux kernel. If many CPUs are trying to + * grab the lock and manipulate the linked list, it can lead to significant + * lock contention and slow operation. + * + * This list operation batching facility is used to batch multiple list + * operations under one lock/unlock critical section, thus reducing the + * locking and cacheline bouncing overhead and improving overall performance. + */ +enum list_batch_cmd { + lb_cmd_add, + lb_cmd_del, + lb_cmd_del_init +}; + +enum list_batch_state { + lb_state_waiting, /* Node is waiting */ + lb_state_batch, /* Queue head to perform batch processing */ + lb_state_done /* Job is done */ +}; + +struct list_batch_qnode { + struct list_batch_qnode *next; + struct list_head *entry; + enum list_batch_cmd cmd; + enum list_batch_state state; +}; + +struct list_batch { + struct list_head *list; + struct list_batch_qnode *tail; +}; + +#define LIST_BATCH_INIT(_list) \ + { \ + .list = _list, \ + .tail = NULL \ + } + +static inline void list_batch_init(struct list_batch *batch, + struct list_head *list) +{ + batch->list = list; + batch->tail = NULL; +} + +static __always_inline void _list_batch_cmd(enum list_batch_cmd cmd, + struct list_head *head, + struct list_head *entry) +{ + switch (cmd) { + case lb_cmd_add: + list_add(entry, head); + break; + + case lb_cmd_del: + list_del(entry); + break; + + case lb_cmd_del_init: + list_del_init(entry); + break; + } +} + +#ifdef CONFIG_LIST_BATCHING + +extern void do_list_batch_slowpath(spinlock_t *lock, enum list_batch_cmd cmd, + struct list_batch *batch, + struct list_head *entry); + +/* + * The caller is expected to pass in a constant cmd parameter. As a + * result, most of unneeded code in the switch statement of _list_batch_cmd() + * will be optimized away. This should make the fast path almost as fast + * as the "lock; listop; unlock;" sequence it replaces. + */ +static inline void do_list_batch(spinlock_t *lock, enum list_batch_cmd cmd, + struct list_batch *batch, + struct list_head *entry) +{ + /* + * Fast path + */ + if (likely(spin_trylock(lock))) { + _list_batch_cmd(cmd, batch->list, entry); + spin_unlock(lock); + return; + } + do_list_batch_slowpath(lock, cmd, batch, entry); +} + + +#else /* CONFIG_LIST_BATCHING */ + +static inline void do_list_batch(spinlock_t *lock, enum list_batch_cmd cmd, + struct list_batch *batch, + struct list_head *entry) +{ + spin_lock(lock); + _list_batch_cmd(cmd, batch->list, entry); + spin_unlock(lock); +} + +#endif /* CONFIG_LIST_BATCHING */ + +#endif /* __LINUX_LIST_BATCH_H */ diff --git a/lib/Kconfig b/lib/Kconfig index 133ebc0..d75ce19 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -514,6 +514,13 @@ config OID_REGISTRY config UCS2_STRING tristate +config LIST_BATCHING + def_bool y if ARCH_USE_LIST_BATCHING + depends on SMP + +config ARCH_USE_LIST_BATCHING + bool + source "lib/fonts/Kconfig" config SG_SPLIT diff --git a/lib/Makefile b/lib/Makefile index a7c26a4..2791262 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -210,6 +210,7 @@ quiet_cmd_build_OID_registry = GEN $@ clean-files += oid_registry_data.c obj-$(CONFIG_UCS2_STRING) += ucs2_string.o +obj-$(CONFIG_LIST_BATCHING) += list_batch.o obj-$(CONFIG_UBSAN) += ubsan.o UBSAN_SANITIZE_ubsan.o := n diff --git a/lib/list_batch.c b/lib/list_batch.c new file mode 100644 index 0000000..174f4ba --- /dev/null +++ b/lib/list_batch.c @@ -0,0 +1,125 @@ +/* + * List insertion/deletion batching facility + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP + * + * Authors: Waiman Long <waiman.long@hpe.com> + */ +#include <linux/list_batch.h> + +/* + * List processing batch size = 128 + * + * The batch size shouldn't be too large. Otherwise, it will be too unfair + * to the task doing the batch processing. It shouldn't be too small neither + * as the performance benefit will be reduced. + */ +#define LB_BATCH_SIZE (1 << 7) + +/* + * Inserting or deleting an entry from a linked list under a spinlock is a + * very common operation in the Linux kernel. If many CPUs are trying to + * grab the lock and manipulate the linked list, it can lead to significant + * lock contention and slow operation. + * + * This list operation batching facility is used to batch multiple list + * operations under one lock/unlock critical section, thus reducing the + * locking overhead and improving overall performance. + */ +void do_list_batch_slowpath(spinlock_t *lock, enum list_batch_cmd cmd, + struct list_batch *batch, struct list_head *entry) +{ + struct list_batch_qnode node, *prev, *next, *nptr; + int loop; + + /* + * Put itself into the list_batch queue + */ + node.next = NULL; + node.entry = entry; + node.cmd = cmd; + node.state = lb_state_waiting; + + /* + * We rely on the implictit memory barrier of xchg() to make sure + * that node initialization will be done before its content is being + * accessed by other CPUs. + */ + prev = xchg(&batch->tail, &node); + + if (prev) { + WRITE_ONCE(prev->next, &node); + while (READ_ONCE(node.state) == lb_state_waiting) + cpu_relax(); + if (node.state == lb_state_done) + return; + WARN_ON(node.state != lb_state_batch); + } + + /* + * We are now the queue head, we should acquire the lock and + * process a batch of qnodes. + */ + loop = LB_BATCH_SIZE; + next = &node; + spin_lock(lock); + +do_list_again: + do { + nptr = next; + _list_batch_cmd(nptr->cmd, batch->list, nptr->entry); + next = READ_ONCE(nptr->next); + /* + * As soon as the state is marked lb_state_done, we + * can no longer assume the content of *nptr as valid. + * So we have to hold off marking it done until we no + * longer need its content. + * + * The release barrier here is to make sure that we + * won't access its content after marking it done. + */ + if (next) + smp_store_release(&nptr->state, lb_state_done); + } while (--loop && next); + if (!next) { + /* + * The queue tail should equal to nptr, so clear it to + * mark the queue as empty. + */ + if (cmpxchg_relaxed(&batch->tail, nptr, NULL) != nptr) { + /* + * Queue not empty, wait until the next pointer is + * initialized. + */ + while (!(next = READ_ONCE(nptr->next))) + cpu_relax(); + } + /* + * The release barrier is required to make sure that + * setting the done state is the last operation. + */ + smp_store_release(&nptr->state, lb_state_done); + } + if (next) { + if (loop) + goto do_list_again; /* More qnodes to process */ + /* + * Mark the next qnode as head to process the next batch + * of qnodes. The new queue head cannot proceed until we + * release the lock. + */ + WRITE_ONCE(next->state, lb_state_batch); + } + spin_unlock(lock); +} +EXPORT_SYMBOL_GPL(do_list_batch_slowpath); -- 1.7.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/3] lib/list_batch: A simple list insertion/deletion batching facility 2016-01-29 19:30 ` [PATCH v2 1/3] " Waiman Long @ 2016-02-01 0:47 ` Dave Chinner 2016-02-03 23:11 ` Waiman Long 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2016-02-01 0:47 UTC (permalink / raw) To: Waiman Long Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro, linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen, Scott J Norton, Douglas Hatch On Fri, Jan 29, 2016 at 02:30:44PM -0500, Waiman Long wrote: > Linked list insertion or deletion under lock is a very common activity > in the Linux kernel. If this is the only activity under lock, the > locking overhead can be pretty large compared with the actual time > spent on the insertion or deletion operation itself especially on a > large system with many CPUs. > > This patch introduces a simple list insertion/deletion batching > facility where a group of list insertion and deletion operations are > grouped together in a single batch under lock. This can reduce the > locking overhead and improve overall system performance. > > The fast path of this batching facility will be similar in performance > to the "lock; listop; unlock;" sequence of the existing code. If > the lock is not available, it will enter slowpath where the batching > happens. > > A new config option LIST_BATCHING is added so that we can control on > which architecture do we want to have this facility enabled. > > Signed-off-by: Waiman Long <Waiman.Long@hpe.com> .... > +#ifdef CONFIG_LIST_BATCHING > + > +extern void do_list_batch_slowpath(spinlock_t *lock, enum list_batch_cmd cmd, > + struct list_batch *batch, > + struct list_head *entry); > + > +/* > + * The caller is expected to pass in a constant cmd parameter. As a > + * result, most of unneeded code in the switch statement of _list_batch_cmd() > + * will be optimized away. This should make the fast path almost as fast > + * as the "lock; listop; unlock;" sequence it replaces. > + */ This strikes me as needlessly complex. Simple inline functions are much easier to read and verify correct, and we don't have to rely on the compiler to optimise out dead code: static inline void list_batch_add(struct list_head *entry, struct list_batch *batch) { if (!spin_trylock(&batch->lock)) return do_list_batch_slowpath(entry, batch, lb_cmd_add); list_add(entry, &batch->list) spin_unlock(&batch->lock); } > +#include <linux/list_batch.h> > + > +/* > + * List processing batch size = 128 > + * > + * The batch size shouldn't be too large. Otherwise, it will be too unfair > + * to the task doing the batch processing. It shouldn't be too small neither > + * as the performance benefit will be reduced. > + */ > +#define LB_BATCH_SIZE (1 << 7) Ok, so arbitrary operations are going to see longer delays when they are selected as the batch processor. I'm not sure I really like this idea, as it will be the first in the queue that sees contention that takes the delay which reduces the fairness of the operations. i.e. the spinlock uses fair queuing, but now we can be grossly unfair the to the first spinner... > + /* > + * We rely on the implictit memory barrier of xchg() to make sure > + * that node initialization will be done before its content is being > + * accessed by other CPUs. > + */ > + prev = xchg(&batch->tail, &node); > + > + if (prev) { > + WRITE_ONCE(prev->next, &node); > + while (READ_ONCE(node.state) == lb_state_waiting) > + cpu_relax(); > + if (node.state == lb_state_done) > + return; So we spin waiting for the batch processor to process the list, or > + WARN_ON(node.state != lb_state_batch); tell us we are not the batch processor. So, effectively, the reduction in runtime is due to the fact the list operations spin on their own cache line rather than the spin lock cacheline until they have been processed and/or made the batch processor? > + } > + > + /* > + * We are now the queue head, we should acquire the lock and > + * process a batch of qnodes. > + */ > + loop = LB_BATCH_SIZE; > + next = &node; > + spin_lock(lock); > + > +do_list_again: > + do { While we are batch processing, all operations will fail the trylock and add themselves to the tail of the queue, and spin on their own cacheline at that point. So it doesn't reduce the amount of spinning, just removes the cacheline contention that slows the spinning. Hmmm - there's another point of unfairness - when switching batch processors, other add/delete operations can get the list lock and perform their operations directly, thereby jumping the batch queue.... So at what point does simply replacing the list_head with a list_lru become more efficient than this batch processing (i.e. https://lkml.org/lkml/2015/3/10/660)? The list_lru isn't a great fit for the inode list (doesn't need any of the special LRU/memcg stuff https://lkml.org/lkml/2015/3/16/261) but it will tell us if, like Ingo suggested, moving more towards a generic per-cpu list would provide better overall performance... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/3] lib/list_batch: A simple list insertion/deletion batching facility 2016-02-01 0:47 ` Dave Chinner @ 2016-02-03 23:11 ` Waiman Long 2016-02-06 23:57 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Waiman Long @ 2016-02-03 23:11 UTC (permalink / raw) To: Dave Chinner Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro, linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen, Scott J Norton, Douglas Hatch On 01/31/2016 07:47 PM, Dave Chinner wrote: > On Fri, Jan 29, 2016 at 02:30:44PM -0500, Waiman Long wrote: >> Linked list insertion or deletion under lock is a very common activity >> in the Linux kernel. If this is the only activity under lock, the >> locking overhead can be pretty large compared with the actual time >> spent on the insertion or deletion operation itself especially on a >> large system with many CPUs. >> >> This patch introduces a simple list insertion/deletion batching >> facility where a group of list insertion and deletion operations are >> grouped together in a single batch under lock. This can reduce the >> locking overhead and improve overall system performance. >> >> The fast path of this batching facility will be similar in performance >> to the "lock; listop; unlock;" sequence of the existing code. If >> the lock is not available, it will enter slowpath where the batching >> happens. >> >> A new config option LIST_BATCHING is added so that we can control on >> which architecture do we want to have this facility enabled. >> >> Signed-off-by: Waiman Long<Waiman.Long@hpe.com> > .... >> +#ifdef CONFIG_LIST_BATCHING >> + >> +extern void do_list_batch_slowpath(spinlock_t *lock, enum list_batch_cmd cmd, >> + struct list_batch *batch, >> + struct list_head *entry); >> + >> +/* >> + * The caller is expected to pass in a constant cmd parameter. As a >> + * result, most of unneeded code in the switch statement of _list_batch_cmd() >> + * will be optimized away. This should make the fast path almost as fast >> + * as the "lock; listop; unlock;" sequence it replaces. >> + */ > This strikes me as needlessly complex. Simple inline functions are > much easier to read and verify correct, and we don't have to rely on > the compiler to optimise out dead code: > > static inline void list_batch_add(struct list_head *entry, > struct list_batch *batch) > { > if (!spin_trylock(&batch->lock)) > return do_list_batch_slowpath(entry, batch, lb_cmd_add); > > list_add(entry,&batch->list) > spin_unlock(&batch->lock); > } Will do so. >> +#include<linux/list_batch.h> >> + >> +/* >> + * List processing batch size = 128 >> + * >> + * The batch size shouldn't be too large. Otherwise, it will be too unfair >> + * to the task doing the batch processing. It shouldn't be too small neither >> + * as the performance benefit will be reduced. >> + */ >> +#define LB_BATCH_SIZE (1<< 7) > Ok, so arbitrary operations are going to see longer delays when they > are selected as the batch processor. I'm not sure I really like this > idea, as it will be the first in the queue that sees contention > that takes the delay which reduces the fairness of the operations. > i.e. the spinlock uses fair queuing, but now we can be grossly unfair > the to the first spinner... > That is certainly true. It is well known that a bit of unfairness can often improve overall system performance. Interrupt handling, for example, is also unfair to the process currently running on the CPU. The amount of unfairness is controlled by the batch size parameter. Maybe we can make this parameter a read-mostly constant set up at boot time which has a value depending on the # of CPUs in a system so that smaller system can have a smaller batch size. That will reduce the unfairness, at least in smaller systems. >> + /* >> + * We rely on the implictit memory barrier of xchg() to make sure >> + * that node initialization will be done before its content is being >> + * accessed by other CPUs. >> + */ >> + prev = xchg(&batch->tail,&node); >> + >> + if (prev) { >> + WRITE_ONCE(prev->next,&node); >> + while (READ_ONCE(node.state) == lb_state_waiting) >> + cpu_relax(); >> + if (node.state == lb_state_done) >> + return; > So we spin waiting for the batch processor to process the > list, or > >> + WARN_ON(node.state != lb_state_batch); > tell us we are not the batch processor. > > So, effectively, the reduction in runtime is due to the fact the > list operations spin on their own cache line rather than the spin > lock cacheline until they have been processed and/or made the batch > processor? Yes, that can a major part of it. >> + } >> + >> + /* >> + * We are now the queue head, we should acquire the lock and >> + * process a batch of qnodes. >> + */ >> + loop = LB_BATCH_SIZE; >> + next =&node; >> + spin_lock(lock); >> + >> +do_list_again: >> + do { > While we are batch processing, all operations will fail the > trylock and add themselves to the tail of the queue, and spin on > their own cacheline at that point. So it doesn't reduce the amount > of spinning, just removes the cacheline contention that slows the > spinning. > > Hmmm - there's another point of unfairness - when switching batch > processors, other add/delete operations can get the list lock and > perform their operations directly, thereby jumping the batch > queue.... That is true too. > So at what point does simply replacing the list_head with a list_lru > become more efficient than this batch processing (i.e. > https://lkml.org/lkml/2015/3/10/660)? The list_lru isn't a great > fit for the inode list (doesn't need any of the special LRU/memcg > stuff https://lkml.org/lkml/2015/3/16/261) but it will tell us if, > like Ingo suggested, moving more towards a generic per-cpu list > would provide better overall performance... I will take a look at the list_lru patch to see if that help. As for the per-cpu list, I tried that and it didn't quite work out. Thanks, Longman ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/3] lib/list_batch: A simple list insertion/deletion batching facility 2016-02-03 23:11 ` Waiman Long @ 2016-02-06 23:57 ` Dave Chinner 2016-02-17 1:37 ` Waiman Long 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2016-02-06 23:57 UTC (permalink / raw) To: Waiman Long Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro, linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen, Scott J Norton, Douglas Hatch On Wed, Feb 03, 2016 at 06:11:56PM -0500, Waiman Long wrote: > On 01/31/2016 07:47 PM, Dave Chinner wrote: > >So at what point does simply replacing the list_head with a list_lru > >become more efficient than this batch processing (i.e. > >https://lkml.org/lkml/2015/3/10/660)? The list_lru isn't a great > >fit for the inode list (doesn't need any of the special LRU/memcg > >stuff https://lkml.org/lkml/2015/3/16/261) but it will tell us if, > >like Ingo suggested, moving more towards a generic per-cpu list > >would provide better overall performance... > > I will take a look at the list_lru patch to see if that help. As for > the per-cpu list, I tried that and it didn't quite work out. OK, see my last email as to why Andi's patch didn't change anything. The list_lru implementation has a list per node, a lock per node, and each item is placed on the list for the node it is physically allocated from. Hence for local workloads, the list/lock that is accessed for add/remove should be local to the node and hence should reduce cache line contention mostly to within a single node. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/3] lib/list_batch: A simple list insertion/deletion batching facility 2016-02-06 23:57 ` Dave Chinner @ 2016-02-17 1:37 ` Waiman Long 0 siblings, 0 replies; 16+ messages in thread From: Waiman Long @ 2016-02-17 1:37 UTC (permalink / raw) To: Dave Chinner Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro, linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen, Scott J Norton, Douglas Hatch On 02/06/2016 06:57 PM, Dave Chinner wrote: > On Wed, Feb 03, 2016 at 06:11:56PM -0500, Waiman Long wrote: >> On 01/31/2016 07:47 PM, Dave Chinner wrote: >>> So at what point does simply replacing the list_head with a list_lru >>> become more efficient than this batch processing (i.e. >>> https://lkml.org/lkml/2015/3/10/660)? The list_lru isn't a great >>> fit for the inode list (doesn't need any of the special LRU/memcg >>> stuff https://lkml.org/lkml/2015/3/16/261) but it will tell us if, >>> like Ingo suggested, moving more towards a generic per-cpu list >>> would provide better overall performance... >> I will take a look at the list_lru patch to see if that help. As for >> the per-cpu list, I tried that and it didn't quite work out. > OK, see my last email as to why Andi's patch didn't change anything. > The list_lru implementation has a list per node, a lock per node, > and each item is placed on the list for the node it is physically > allocated from. Hence for local workloads, the list/lock that is > accessed for add/remove should be local to the node and hence should > reduce cache line contention mostly to within a single node. > > Cheers, > > Dave. I have just sent out a new patchset using per-cpu list with per-cpu locks. I used the per-cpu list as the changes will be simpler and easier to review. Please let me know your thought on that. Thanks, Longman ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 2/3] lib/list_batch, x86: Enable list insertion/deletion batching for x86 2016-01-29 19:30 [PATCH v2 0/3] lib/list_batch: A simple list insertion/deletion batching facility Waiman Long 2016-01-29 19:30 ` [PATCH v2 1/3] " Waiman Long @ 2016-01-29 19:30 ` Waiman Long 2016-01-29 19:30 ` [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list Waiman Long 2 siblings, 0 replies; 16+ messages in thread From: Waiman Long @ 2016-01-29 19:30 UTC (permalink / raw) To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro Cc: linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen, Scott J Norton, Douglas Hatch, Waiman Long System with a large number of CPUs will benefit from the list batching facility when the list lock is contended. If the lock isn't contended, the performance of the list batching function should be similar to the "lock; listop; unlock;" sequence that it replaces. This patch enables it for the x86 architecture. Signed-off-by: Waiman Long <Waiman.Long@hpe.com> --- arch/x86/Kconfig | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 330e738..5b35be7 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -42,6 +42,7 @@ config X86 select ARCH_SUPPORTS_NUMA_BALANCING if X86_64 select ARCH_USE_BUILTIN_BSWAP select ARCH_USE_CMPXCHG_LOCKREF if X86_64 + select ARCH_USE_LIST_BATCHING select ARCH_USE_QUEUED_RWLOCKS select ARCH_USE_QUEUED_SPINLOCKS select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if SMP -- 1.7.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list 2016-01-29 19:30 [PATCH v2 0/3] lib/list_batch: A simple list insertion/deletion batching facility Waiman Long 2016-01-29 19:30 ` [PATCH v2 1/3] " Waiman Long 2016-01-29 19:30 ` [PATCH v2 2/3] lib/list_batch, x86: Enable list insertion/deletion batching for x86 Waiman Long @ 2016-01-29 19:30 ` Waiman Long 2016-01-30 8:35 ` Ingo Molnar 2016-02-01 0:04 ` Dave Chinner 2 siblings, 2 replies; 16+ messages in thread From: Waiman Long @ 2016-01-29 19:30 UTC (permalink / raw) To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro Cc: linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen, Scott J Norton, Douglas Hatch, Waiman Long The inode_sb_list_add() and inode_sb_list_del() functions in the vfs layer just perform list addition and deletion under lock. So they can use the new list batching facility to speed up the list operations when many CPUs are trying to do it simultaneously. In particular, the inode_sb_list_del() function can be a performance bottleneck when large applications with many threads and associated inodes exit. With an exit microbenchmark that creates a large number of threads, attachs many inodes to them and then exits. The runtimes of that microbenchmark with 1000 threads before and after the patch on a 4-socket Intel E7-4820 v3 system (48 cores, 96 threads) were as follows: Kernel Elapsed Time System Time ------ ------------ ----------- Vanilla 4.4 65.29s 82m14s Patched 4.4 45.69s 49m44s The elapsed time and the reported system time were reduced by 30% and 40% respectively. Signed-off-by: Waiman Long <Waiman.Long@hpe.com> --- fs/inode.c | 13 +++++-------- fs/super.c | 1 + include/linux/fs.h | 2 ++ 3 files changed, 8 insertions(+), 8 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 9f62db3..870de8c 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -424,19 +424,16 @@ static void inode_lru_list_del(struct inode *inode) */ void inode_sb_list_add(struct inode *inode) { - spin_lock(&inode->i_sb->s_inode_list_lock); - list_add(&inode->i_sb_list, &inode->i_sb->s_inodes); - spin_unlock(&inode->i_sb->s_inode_list_lock); + do_list_batch(&inode->i_sb->s_inode_list_lock, lb_cmd_add, + &inode->i_sb->s_list_batch, &inode->i_sb_list); } EXPORT_SYMBOL_GPL(inode_sb_list_add); static inline void inode_sb_list_del(struct inode *inode) { - if (!list_empty(&inode->i_sb_list)) { - spin_lock(&inode->i_sb->s_inode_list_lock); - list_del_init(&inode->i_sb_list); - spin_unlock(&inode->i_sb->s_inode_list_lock); - } + if (!list_empty(&inode->i_sb_list)) + do_list_batch(&inode->i_sb->s_inode_list_lock, lb_cmd_del_init, + &inode->i_sb->s_list_batch, &inode->i_sb_list); } static unsigned long hash(struct super_block *sb, unsigned long hashval) diff --git a/fs/super.c b/fs/super.c index 1182af8..b0e8540 100644 --- a/fs/super.c +++ b/fs/super.c @@ -206,6 +206,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags) mutex_init(&s->s_sync_lock); INIT_LIST_HEAD(&s->s_inodes); spin_lock_init(&s->s_inode_list_lock); + list_batch_init(&s->s_list_batch, &s->s_inodes); if (list_lru_init_memcg(&s->s_dentry_lru)) goto fail; diff --git a/include/linux/fs.h b/include/linux/fs.h index 1a20462..11d8b77 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -9,6 +9,7 @@ #include <linux/stat.h> #include <linux/cache.h> #include <linux/list.h> +#include <linux/list_batch.h> #include <linux/list_lru.h> #include <linux/llist.h> #include <linux/radix-tree.h> @@ -1403,6 +1404,7 @@ struct super_block { /* s_inode_list_lock protects s_inodes */ spinlock_t s_inode_list_lock ____cacheline_aligned_in_smp; struct list_head s_inodes; /* all inodes */ + struct list_batch s_list_batch; }; extern struct timespec current_fs_time(struct super_block *sb); -- 1.7.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list 2016-01-29 19:30 ` [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list Waiman Long @ 2016-01-30 8:35 ` Ingo Molnar 2016-02-01 17:45 ` Andi Kleen 2016-02-01 21:44 ` Waiman Long 2016-02-01 0:04 ` Dave Chinner 1 sibling, 2 replies; 16+ messages in thread From: Ingo Molnar @ 2016-01-30 8:35 UTC (permalink / raw) To: Waiman Long Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro, linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen, Scott J Norton, Douglas Hatch * Waiman Long <Waiman.Long@hpe.com> wrote: > The inode_sb_list_add() and inode_sb_list_del() functions in the vfs > layer just perform list addition and deletion under lock. So they can > use the new list batching facility to speed up the list operations > when many CPUs are trying to do it simultaneously. > > In particular, the inode_sb_list_del() function can be a performance > bottleneck when large applications with many threads and associated > inodes exit. With an exit microbenchmark that creates a large number > of threads, attachs many inodes to them and then exits. The runtimes > of that microbenchmark with 1000 threads before and after the patch > on a 4-socket Intel E7-4820 v3 system (48 cores, 96 threads) were > as follows: > > Kernel Elapsed Time System Time > ------ ------------ ----------- > Vanilla 4.4 65.29s 82m14s > Patched 4.4 45.69s 49m44s > > The elapsed time and the reported system time were reduced by 30% > and 40% respectively. That's pretty impressive! I'm wondering, why are inode_sb_list_add()/del() even called for a presumably reasonably well cached benchmark running on a system with enough RAM? Are these perhaps thousands of temporary files, already deleted, and released when all the file descriptors are closed as part of sys_exit()? If that's the case then I suspect an even bigger win would be not just to batch the (sb-)global list fiddling, but to potentially turn the sb list into a percpu_alloc() managed set of per CPU lists? It's a bigger change, but it could speed up a lot of other temporary file intensive usecases as well, not just batched delete. Thanks, Ingo ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list 2016-01-30 8:35 ` Ingo Molnar @ 2016-02-01 17:45 ` Andi Kleen 2016-02-01 22:03 ` Waiman Long 2016-02-01 21:44 ` Waiman Long 1 sibling, 1 reply; 16+ messages in thread From: Andi Kleen @ 2016-02-01 17:45 UTC (permalink / raw) To: Ingo Molnar Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro, linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen, Scott J Norton, Douglas Hatch > I'm wondering, why are inode_sb_list_add()/del() even called for a presumably > reasonably well cached benchmark running on a system with enough RAM? Are these > perhaps thousands of temporary files, already deleted, and released when all the > file descriptors are closed as part of sys_exit()? > > If that's the case then I suspect an even bigger win would be not just to batch > the (sb-)global list fiddling, but to potentially turn the sb list into a > percpu_alloc() managed set of per CPU lists? It's a bigger change, but it could We had such a patch in the lock elision patchkit (It avoided a lot of cache line bouncing leading to aborts) https://git.kernel.org/cgit/linux/kernel/git/ak/linux-misc.git/commit/?h=hle315/combined&id=f1cf9e715a40f44086662ae3b29f123cf059cbf4 -Andi ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list 2016-02-01 17:45 ` Andi Kleen @ 2016-02-01 22:03 ` Waiman Long 2016-02-03 22:59 ` Waiman Long 0 siblings, 1 reply; 16+ messages in thread From: Waiman Long @ 2016-02-01 22:03 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro, linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Scott J Norton, Douglas Hatch On 02/01/2016 12:45 PM, Andi Kleen wrote: >> I'm wondering, why are inode_sb_list_add()/del() even called for a presumably >> reasonably well cached benchmark running on a system with enough RAM? Are these >> perhaps thousands of temporary files, already deleted, and released when all the >> file descriptors are closed as part of sys_exit()? >> >> If that's the case then I suspect an even bigger win would be not just to batch >> the (sb-)global list fiddling, but to potentially turn the sb list into a >> percpu_alloc() managed set of per CPU lists? It's a bigger change, but it could > We had such a patch in the lock elision patchkit (It avoided a lot > of cache line bouncing leading to aborts) > > https://git.kernel.org/cgit/linux/kernel/git/ak/linux-misc.git/commit/?h=hle315/combined&id=f1cf9e715a40f44086662ae3b29f123cf059cbf4 > > -Andi > > I like your patch though it cannot be applied cleanly for the current upstream kernel. I will port it to the current kernel and run my microbenchmark to see what performance gain I can get. Cheers, Longman ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list 2016-02-01 22:03 ` Waiman Long @ 2016-02-03 22:59 ` Waiman Long 2016-02-06 23:51 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Waiman Long @ 2016-02-03 22:59 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro, linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Scott J Norton, Douglas Hatch On 02/01/2016 05:03 PM, Waiman Long wrote: > On 02/01/2016 12:45 PM, Andi Kleen wrote: >>> I'm wondering, why are inode_sb_list_add()/del() even called for a >>> presumably >>> reasonably well cached benchmark running on a system with enough >>> RAM? Are these >>> perhaps thousands of temporary files, already deleted, and released >>> when all the >>> file descriptors are closed as part of sys_exit()? >>> >>> If that's the case then I suspect an even bigger win would be not >>> just to batch >>> the (sb-)global list fiddling, but to potentially turn the sb list >>> into a >>> percpu_alloc() managed set of per CPU lists? It's a bigger change, >>> but it could >> We had such a patch in the lock elision patchkit (It avoided a lot >> of cache line bouncing leading to aborts) >> >> https://git.kernel.org/cgit/linux/kernel/git/ak/linux-misc.git/commit/?h=hle315/combined&id=f1cf9e715a40f44086662ae3b29f123cf059cbf4 >> >> >> -Andi >> >> > > I like your patch though it cannot be applied cleanly for the current > upstream kernel. I will port it to the current kernel and run my > microbenchmark to see what performance gain I can get. > Unfortunately, using per-cpu list didn't have the performance benefit that I expected. I saw maybe 1 or 2% of performance increase, but nothing significant. I guess the bulk of the performance improvement in my patch is in the elimination of most of the cacheline transfer latencies when the lock ownership is passed from one CPU to another. Those latencies are still there even if we use the per-cpu list. Cheers, Longman ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list 2016-02-03 22:59 ` Waiman Long @ 2016-02-06 23:51 ` Dave Chinner 0 siblings, 0 replies; 16+ messages in thread From: Dave Chinner @ 2016-02-06 23:51 UTC (permalink / raw) To: Waiman Long Cc: Andi Kleen, Ingo Molnar, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro, linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Scott J Norton, Douglas Hatch On Wed, Feb 03, 2016 at 05:59:17PM -0500, Waiman Long wrote: > On 02/01/2016 05:03 PM, Waiman Long wrote: > >On 02/01/2016 12:45 PM, Andi Kleen wrote: > >>>I'm wondering, why are inode_sb_list_add()/del() even called > >>>for a presumably > >>>reasonably well cached benchmark running on a system with > >>>enough RAM? Are these > >>>perhaps thousands of temporary files, already deleted, and > >>>released when all the > >>>file descriptors are closed as part of sys_exit()? > >>> > >>>If that's the case then I suspect an even bigger win would be > >>>not just to batch > >>>the (sb-)global list fiddling, but to potentially turn the sb > >>>list into a > >>>percpu_alloc() managed set of per CPU lists? It's a bigger > >>>change, but it could > >>We had such a patch in the lock elision patchkit (It avoided a lot > >>of cache line bouncing leading to aborts) > >> > >>https://git.kernel.org/cgit/linux/kernel/git/ak/linux-misc.git/commit/?h=hle315/combined&id=f1cf9e715a40f44086662ae3b29f123cf059cbf4 > >> > >> > >>-Andi > >> > >> > > > >I like your patch though it cannot be applied cleanly for the > >current upstream kernel. I will port it to the current kernel and > >run my microbenchmark to see what performance gain I can get. > > > > Unfortunately, using per-cpu list didn't have the performance > benefit that I expected. I saw maybe 1 or 2% of performance > increase, but nothing significant. I guess the bulk of the > performance improvement in my patch is in the elimination of most of > the cacheline transfer latencies when the lock ownership is passed > from one CPU to another. Those latencies are still there even if we > use the per-cpu list. Now that I look at the above patch, it doesn't get rid of the global list lock. hence it won't change any of the existing global lock cacheline contention. The list structure change to per-cpu is completely irrelevant because it doesn't address the problem being seen. A proper per-cpu list implementation will provide either per-cpu locks or some other mechanism to protect each list and eliminate a large amount of global cacheline bouncing. Given this, I would not use the above patch and results as a reason for saying this approach will not work, or be a better solution.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list 2016-01-30 8:35 ` Ingo Molnar 2016-02-01 17:45 ` Andi Kleen @ 2016-02-01 21:44 ` Waiman Long 1 sibling, 0 replies; 16+ messages in thread From: Waiman Long @ 2016-02-01 21:44 UTC (permalink / raw) To: Ingo Molnar Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro, linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen, Scott J Norton, Douglas Hatch [-- Attachment #1: Type: text/plain, Size: 2280 bytes --] On 01/30/2016 03:35 AM, Ingo Molnar wrote: > * Waiman Long<Waiman.Long@hpe.com> wrote: > >> The inode_sb_list_add() and inode_sb_list_del() functions in the vfs >> layer just perform list addition and deletion under lock. So they can >> use the new list batching facility to speed up the list operations >> when many CPUs are trying to do it simultaneously. >> >> In particular, the inode_sb_list_del() function can be a performance >> bottleneck when large applications with many threads and associated >> inodes exit. With an exit microbenchmark that creates a large number >> of threads, attachs many inodes to them and then exits. The runtimes >> of that microbenchmark with 1000 threads before and after the patch >> on a 4-socket Intel E7-4820 v3 system (48 cores, 96 threads) were >> as follows: >> >> Kernel Elapsed Time System Time >> ------ ------------ ----------- >> Vanilla 4.4 65.29s 82m14s >> Patched 4.4 45.69s 49m44s >> >> The elapsed time and the reported system time were reduced by 30% >> and 40% respectively. > That's pretty impressive! > > I'm wondering, why are inode_sb_list_add()/del() even called for a presumably > reasonably well cached benchmark running on a system with enough RAM? Are these > perhaps thousands of temporary files, already deleted, and released when all the > file descriptors are closed as part of sys_exit()? The inodes that need to be deleted were actually procfs files which have to go away when the processes/threads exit. I encountered this problem when running the SPECjbb2013 benchmark on large machine where sometimes it might seems to hang for 30 mins or so after the benchmark complete. I wrote a simple microbenchmark to simulate this situation which is in the attachment. > If that's the case then I suspect an even bigger win would be not just to batch > the (sb-)global list fiddling, but to potentially turn the sb list into a > percpu_alloc() managed set of per CPU lists? It's a bigger change, but it could > speed up a lot of other temporary file intensive usecases as well, not just > batched delete. > > Thanks, > > Ingo Yes, that can be another possible. I will investigate further on that one. Thanks for the suggestion. Cheers, Longman [-- Attachment #2: exit_test.c --] [-- Type: text/plain, Size: 2665 bytes --] /* * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * Authors: Waiman Long <waiman.long@hp.com> */ /* * This is an exit test */ #include <ctype.h> #include <errno.h> #include <pthread.h> #include <stdlib.h> #include <stdio.h> #include <time.h> #include <unistd.h> #include <signal.h> #include <sys/types.h> #include <sys/syscall.h> #define do_exit() syscall(SYS_exit) #define gettid() syscall(SYS_gettid) #define MAX_THREADS 2048 static inline void cpu_relax(void) { __asm__ __volatile__("rep;nop": : :"memory"); } static inline void atomic_inc(volatile int *v) { __asm__ __volatile__("lock incl %0": "+m" (*v)); } static volatile int exit_now = 0; static volatile int threadcnt = 0; /* * Walk the /proc/<pid> filesystem to make them fill the dentry cache */ static void walk_procfs(void) { char cmdbuf[256]; pid_t tid = gettid(); snprintf(cmdbuf, sizeof(cmdbuf), "find /proc/%d > /dev/null 2>&1", tid); if (system(cmdbuf) < 0) perror("system() failed!"); } static void *exit_thread(void *dummy) { long tid = (long)dummy; walk_procfs(); atomic_inc(&threadcnt); /* * Busy wait until the do_exit flag is set and then call exit */ while (!exit_now) sleep(1); do_exit(); } static void exit_test(int threads) { pthread_t thread[threads]; long i = 0, finish; time_t start = time(NULL); while (i++ < threads) { if (pthread_create(thread + i - 1, NULL, exit_thread, (void *)i)) { perror("pthread_create"); exit(1); } #if 0 /* * Pipelining to reduce contention & improve speed */ if ((i & 0xf) == 0) while (i - threadcnt > 12) usleep(1); #endif } while (threadcnt != threads) usleep(1); walk_procfs(); printf("Setup time = %lus\n", time(NULL) - start); printf("Process ready to exit!\n"); kill(0, SIGKILL); exit(0); } int main(int argc, char *argv[]) { int tcnt; /* Thread counts */ char *cmd = argv[0]; if ((argc != 2) || !isdigit(argv[1][0])) { fprintf(stderr, "Usage: %s <thread count>\n", cmd); exit(1); } tcnt = strtoul(argv[1], NULL, 10); if (tcnt > MAX_THREADS) { fprintf(stderr, "Error: thread count should be <= %d\n", MAX_THREADS); exit(1); } exit_test(tcnt); return 0; /* Not reaachable */ } ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list 2016-01-29 19:30 ` [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list Waiman Long 2016-01-30 8:35 ` Ingo Molnar @ 2016-02-01 0:04 ` Dave Chinner 2016-02-03 23:01 ` Waiman Long 1 sibling, 1 reply; 16+ messages in thread From: Dave Chinner @ 2016-02-01 0:04 UTC (permalink / raw) To: Waiman Long Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro, linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen, Scott J Norton, Douglas Hatch On Fri, Jan 29, 2016 at 02:30:46PM -0500, Waiman Long wrote: > The inode_sb_list_add() and inode_sb_list_del() functions in the vfs > layer just perform list addition and deletion under lock. So they can > use the new list batching facility to speed up the list operations > when many CPUs are trying to do it simultaneously. > > In particular, the inode_sb_list_del() function can be a performance > bottleneck when large applications with many threads and associated > inodes exit. With an exit microbenchmark that creates a large number > of threads, attachs many inodes to them and then exits. The runtimes > of that microbenchmark with 1000 threads before and after the patch I've never seen sb inode list contention in typical workloads in exit processing. Can you post the test script you are using? The inode sb list contention I usually often than not, it's workloads that turn over the inode cache quickly (i.e. instantiating lots of inodes through concurrent directory traversal or create workloads). These are often latency sensitive, so I'm wondering what the effect of spinning waiting for batch processing on every contended add is going to do to lookup performance... > on a 4-socket Intel E7-4820 v3 system (48 cores, 96 threads) were > as follows: > > Kernel Elapsed Time System Time > ------ ------------ ----------- > Vanilla 4.4 65.29s 82m14s > Patched 4.4 45.69s 49m44s I wonder if you'd get the same results on such a benchmark simply by making the spin lock a mutex, thereby reducing the number of CPUs spinning on a single lock cacheline at any one point in time. Certainly the system time will plummet.... > The elapsed time and the reported system time were reduced by 30% > and 40% respectively. > > Signed-off-by: Waiman Long <Waiman.Long@hpe.com> > --- > fs/inode.c | 13 +++++-------- > fs/super.c | 1 + > include/linux/fs.h | 2 ++ > 3 files changed, 8 insertions(+), 8 deletions(-) > > diff --git a/fs/inode.c b/fs/inode.c > index 9f62db3..870de8c 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -424,19 +424,16 @@ static void inode_lru_list_del(struct inode *inode) > */ > void inode_sb_list_add(struct inode *inode) > { > - spin_lock(&inode->i_sb->s_inode_list_lock); > - list_add(&inode->i_sb_list, &inode->i_sb->s_inodes); > - spin_unlock(&inode->i_sb->s_inode_list_lock); > + do_list_batch(&inode->i_sb->s_inode_list_lock, lb_cmd_add, > + &inode->i_sb->s_list_batch, &inode->i_sb_list); I don't like the API. This should simply be: void inode_sb_list_add(struct inode *inode) { list_batch_add(&inode->i_sb_list, &inode->i_sb->s_inodes); } void inode_sb_list_del(struct inode *inode) { list_batch_del(&inode->i_sb_list, &inode->i_sb->s_inodes); } And all the locks, lists and batch commands are internal to the struct list_batch and the API implementation. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list 2016-02-01 0:04 ` Dave Chinner @ 2016-02-03 23:01 ` Waiman Long 0 siblings, 0 replies; 16+ messages in thread From: Waiman Long @ 2016-02-03 23:01 UTC (permalink / raw) To: Dave Chinner Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Alexander Viro, linux-fsdevel, x86, linux-kernel, Peter Zijlstra, Andi Kleen, Scott J Norton, Douglas Hatch On 01/31/2016 07:04 PM, Dave Chinner wrote: > On Fri, Jan 29, 2016 at 02:30:46PM -0500, Waiman Long wrote: >> The inode_sb_list_add() and inode_sb_list_del() functions in the vfs >> layer just perform list addition and deletion under lock. So they can >> use the new list batching facility to speed up the list operations >> when many CPUs are trying to do it simultaneously. >> >> In particular, the inode_sb_list_del() function can be a performance >> bottleneck when large applications with many threads and associated >> inodes exit. With an exit microbenchmark that creates a large number >> of threads, attachs many inodes to them and then exits. The runtimes >> of that microbenchmark with 1000 threads before and after the patch > I've never seen sb inode list contention in typical workloads in > exit processing. Can you post the test script you are using? I have posted it in one of my earlier email. > The inode sb list contention I usually often than not, it's > workloads that turn over the inode cache quickly (i.e. instantiating > lots of inodes through concurrent directory traversal or create > workloads). These are often latency sensitive, so I'm wondering what > the effect of spinning waiting for batch processing on every > contended add is going to do to lookup performance... I think the batch processor will get higher latency, but the other will see a shorter one. If each CPU has a more or less chance to become the batch processor, the overall impact to system performance should not be that significatn. >> on a 4-socket Intel E7-4820 v3 system (48 cores, 96 threads) were >> as follows: >> >> Kernel Elapsed Time System Time >> ------ ------------ ----------- >> Vanilla 4.4 65.29s 82m14s >> Patched 4.4 45.69s 49m44s > I wonder if you'd get the same results on such a benchmark simply by > making the spin lock a mutex, thereby reducing the number of CPUs > spinning on a single lock cacheline at any one point in time. > Certainly the system time will plummet.... I don't think it is a good idea to use mutex as we can't sleep. >> The elapsed time and the reported system time were reduced by 30% >> and 40% respectively. >> >> Signed-off-by: Waiman Long<Waiman.Long@hpe.com> >> --- >> fs/inode.c | 13 +++++-------- >> fs/super.c | 1 + >> include/linux/fs.h | 2 ++ >> 3 files changed, 8 insertions(+), 8 deletions(-) >> >> diff --git a/fs/inode.c b/fs/inode.c >> index 9f62db3..870de8c 100644 >> --- a/fs/inode.c >> +++ b/fs/inode.c >> @@ -424,19 +424,16 @@ static void inode_lru_list_del(struct inode *inode) >> */ >> void inode_sb_list_add(struct inode *inode) >> { >> - spin_lock(&inode->i_sb->s_inode_list_lock); >> - list_add(&inode->i_sb_list,&inode->i_sb->s_inodes); >> - spin_unlock(&inode->i_sb->s_inode_list_lock); >> + do_list_batch(&inode->i_sb->s_inode_list_lock, lb_cmd_add, >> + &inode->i_sb->s_list_batch,&inode->i_sb_list); > I don't like the API. This should simply be: > > void inode_sb_list_add(struct inode *inode) > { > list_batch_add(&inode->i_sb_list,&inode->i_sb->s_inodes); > } > > void inode_sb_list_del(struct inode *inode) > { > list_batch_del(&inode->i_sb_list,&inode->i_sb->s_inodes); > } > > And all the locks, lists and batch commands are internal to the > struct list_batch and the API implementation. > Points taken. I will update the patch to do that. Thanks for the review. Cheers, Longman ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2016-02-17 1:37 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-01-29 19:30 [PATCH v2 0/3] lib/list_batch: A simple list insertion/deletion batching facility Waiman Long 2016-01-29 19:30 ` [PATCH v2 1/3] " Waiman Long 2016-02-01 0:47 ` Dave Chinner 2016-02-03 23:11 ` Waiman Long 2016-02-06 23:57 ` Dave Chinner 2016-02-17 1:37 ` Waiman Long 2016-01-29 19:30 ` [PATCH v2 2/3] lib/list_batch, x86: Enable list insertion/deletion batching for x86 Waiman Long 2016-01-29 19:30 ` [PATCH v2 3/3] vfs: Enable list batching for the superblock's inode list Waiman Long 2016-01-30 8:35 ` Ingo Molnar 2016-02-01 17:45 ` Andi Kleen 2016-02-01 22:03 ` Waiman Long 2016-02-03 22:59 ` Waiman Long 2016-02-06 23:51 ` Dave Chinner 2016-02-01 21:44 ` Waiman Long 2016-02-01 0:04 ` Dave Chinner 2016-02-03 23:01 ` Waiman Long
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.