* [rfc] [patch 1/2 ] Process private hash tables for private futexes @ 2009-03-21 4:46 Ravikiran G Thirumalai 2009-03-21 4:52 ` [rfc] [patch 2/2 ] Sysctl to turn on/off private futex " Ravikiran G Thirumalai ` (2 more replies) 0 siblings, 3 replies; 16+ messages in thread From: Ravikiran G Thirumalai @ 2009-03-21 4:46 UTC (permalink / raw) To: linux-kernel; +Cc: Ingo Molnar, shai Patch to have a process private hash table for 'PRIVATE' futexes. On large core count systems running multiple threaded processes causes false sharing on the global futex hash table. The global futex hash table is an array of struct futex_hash_bucket which is defined as: struct futex_hash_bucket { spinlock_t lock; struct plist_head chain; }; static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS]; Needless to say this will cause multiple spinlocks to reside on the same cacheline which is very bad when multiple un-related process hash onto adjacent hash buckets. The probability of unrelated futexes ending on adjacent hash buckets increase with the number of cores in the system (more cores available translates to more processes/more threads being run on a system). The effects of false sharing are tangible on machines with more than 32 cores. We have noticed this with workload of a certain multiple threaded FEA (Finite Element Analysis) solvers. We reported this problem couple of years ago which eventually resulted in a new api for private futexes to avoid mmap_sem. The false sharing on the global futex hash was put off pending glibc changes to accomodate the futex private apis. Now that the glibc changes are in, and multicore is more prevalent, maybe it is time to fix this problem. The root cause of the problem is a global futex hash table even for process private futexes. Process private futexes can be hashed on process private hash tables, avoiding the global hash and a longer hash table walk when there are a lot more futexes in the workload. However, this results in an addition of one extra pointer to the mm_struct. Hence, this implementation of a process private hash table is based off a config option, which can be turned off for smaller core count systems. Furthermore, a subsequent patch will introduce a sysctl to dynamically turn on private futex hash tables. We found this patch to improve the runtime of a certain FEA solver by about 15% on a 32 core vSMP system. Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Index: linux-2.6.28.6/include/linux/mm_types.h =================================================================== --- linux-2.6.28.6.orig/include/linux/mm_types.h 2009-03-11 16:52:06.000000000 -0800 +++ linux-2.6.28.6/include/linux/mm_types.h 2009-03-11 16:52:23.000000000 -0800 @@ -256,6 +256,10 @@ struct mm_struct { #ifdef CONFIG_MMU_NOTIFIER struct mmu_notifier_mm *mmu_notifier_mm; #endif +#ifdef CONFIG_PROCESS_PRIVATE_FUTEX + /* Process private futex hash table */ + struct futex_hash_bucket *htb; +#endif }; #endif /* _LINUX_MM_TYPES_H */ Index: linux-2.6.28.6/init/Kconfig =================================================================== --- linux-2.6.28.6.orig/init/Kconfig 2009-03-11 16:52:06.000000000 -0800 +++ linux-2.6.28.6/init/Kconfig 2009-03-18 17:06:23.000000000 -0800 @@ -672,6 +672,14 @@ config FUTEX support for "fast userspace mutexes". The resulting kernel may not run glibc-based applications correctly. +config PROCESS_PRIVATE_FUTEX + bool "Process private futexes" if FUTEX + default n + help + This option enables ability to have per-process hashtables for private + futexes. This makes sense on large core-count systems (more than + 32 cores) + config ANON_INODES bool Index: linux-2.6.28.6/kernel/fork.c =================================================================== --- linux-2.6.28.6.orig/kernel/fork.c 2009-02-17 09:29:27.000000000 -0800 +++ linux-2.6.28.6/kernel/fork.c 2009-03-12 17:12:40.000000000 -0800 @@ -424,6 +424,7 @@ static struct mm_struct * mm_init(struct return mm; } + free_futex_htb(mm); free_mm(mm); return NULL; } Index: linux-2.6.28.6/kernel/futex.c =================================================================== --- linux-2.6.28.6.orig/kernel/futex.c 2009-03-11 16:52:13.000000000 -0800 +++ linux-2.6.28.6/kernel/futex.c 2009-03-18 17:36:04.000000000 -0800 @@ -140,15 +140,84 @@ static inline void futex_unlock_mm(struc up_read(fshared); } +#ifdef CONFIG_PROCESS_PRIVATE_FUTEX +static void free_htb(struct futex_hash_bucket *htb) +{ + if (htb != futex_queues) + kfree(htb); +} + +void free_futex_htb(struct mm_struct *mm) +{ + free_htb(mm->htb); +} + +static void alloc_htb(struct mm_struct *mm) +{ + struct futex_hash_bucket *htb; + int i; + /* + * Allocate and install a private hash table of the + * same size as the global hash table. We fall + * back onto the global hash on allocation failure + */ + htb = kmalloc(sizeof(futex_queues), GFP_KERNEL); + if (!htb) + htb = futex_queues; + else { + for (i = 0; i < ARRAY_SIZE(futex_queues); i++) { + plist_head_init(&htb[i].chain, &htb[i].lock); + spin_lock_init(&htb[i].lock); + } + } + /* Install the hash table */ + spin_lock(&mm->page_table_lock); + if (mm->htb) { + /* Another thread installed the hash table */ + spin_unlock(&mm->page_table_lock); + free_htb(htb); + } else { + mm->htb = htb; + spin_unlock(&mm->page_table_lock); + } + +} + +static struct futex_hash_bucket *get_futex_hashtable(union futex_key *key) +{ + struct mm_struct *mm; + if (key->both.offset & FUT_OFF_INODE) + /* Shared inode based mapping uses global hash */ + return futex_queues; + /* + * Private futexes -- This covers both FUTEX_PRIVATE_FLAG + * and 'mm' only private futexes + */ + + mm = current->mm; + if (unlikely(!mm->htb)) + alloc_htb(mm); + return mm->htb; +} +#else +static inline +struct futex_hash_bucket *get_futex_hashtable(union futex_key *key) +{ + return futex_queues; +} +#endif /* * We hash on the keys returned from get_futex_key (see below). */ static struct futex_hash_bucket *hash_futex(union futex_key *key) { - u32 hash = jhash2((u32*)&key->both.word, + struct futex_hash_bucket *htb; + u32 hash; + htb = get_futex_hashtable(key); + hash = jhash2((u32 *)&key->both.word, (sizeof(key->both.word)+sizeof(key->both.ptr))/4, key->both.offset); - return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)]; + return &htb[hash & ((1 << FUTEX_HASHBITS)-1)]; } /* Index: linux-2.6.28.6/include/linux/futex.h =================================================================== --- linux-2.6.28.6.orig/include/linux/futex.h 2009-02-17 09:29:27.000000000 -0800 +++ linux-2.6.28.6/include/linux/futex.h 2009-03-18 16:59:27.000000000 -0800 @@ -176,6 +176,15 @@ static inline void exit_pi_state_list(st { } #endif + +#ifdef CONFIG_PROCESS_PRIVATE_FUTEX +extern void free_futex_htb(struct mm_struct *mm); +#else +static inline void free_futex_htb(struct mm_struct *mm) +{ + return; +} +#endif #endif /* __KERNEL__ */ #define FUTEX_OP_SET 0 /* *(int *)UADDR2 = OPARG; */ ^ permalink raw reply [flat|nested] 16+ messages in thread
* [rfc] [patch 2/2 ] Sysctl to turn on/off private futex hash tables for private futexes 2009-03-21 4:46 [rfc] [patch 1/2 ] Process private hash tables for private futexes Ravikiran G Thirumalai @ 2009-03-21 4:52 ` Ravikiran G Thirumalai 2009-03-21 9:07 ` [rfc] [patch 1/2 ] Process private " Eric Dumazet 2009-03-21 11:35 ` Andrew Morton 2 siblings, 0 replies; 16+ messages in thread From: Ravikiran G Thirumalai @ 2009-03-21 4:52 UTC (permalink / raw) To: linux-kernel; +Cc: Ingo Molnar, shai The following patch introduces a sysctl to control whether process private hashtable will be used for process private futexes. Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Index: linux-2.6.28.6/include/linux/futex.h =================================================================== --- linux-2.6.28.6.orig/include/linux/futex.h 2009-03-18 16:59:27.000000000 -0800 +++ linux-2.6.28.6/include/linux/futex.h 2009-03-18 17:49:02.000000000 -0800 @@ -179,6 +179,7 @@ static inline void exit_pi_state_list(st #ifdef CONFIG_PROCESS_PRIVATE_FUTEX extern void free_futex_htb(struct mm_struct *mm); +extern int sysctl_private_hash; #else static inline void free_futex_htb(struct mm_struct *mm) { Index: linux-2.6.28.6/kernel/futex.c =================================================================== --- linux-2.6.28.6.orig/kernel/futex.c 2009-03-18 17:36:04.000000000 -0800 +++ linux-2.6.28.6/kernel/futex.c 2009-03-18 17:57:13.000000000 -0800 @@ -154,14 +154,16 @@ void free_futex_htb(struct mm_struct *mm static void alloc_htb(struct mm_struct *mm) { - struct futex_hash_bucket *htb; int i; + struct futex_hash_bucket *htb = NULL; /* * Allocate and install a private hash table of the * same size as the global hash table. We fall - * back onto the global hash on allocation failure + * back onto the global hash on allocation failure, or + * if private futexes are disabled. */ - htb = kmalloc(sizeof(futex_queues), GFP_KERNEL); + if (sysctl_private_hash) + htb = kmalloc(sizeof(futex_queues), GFP_KERNEL); if (!htb) htb = futex_queues; else { @@ -183,6 +185,8 @@ static void alloc_htb(struct mm_struct * } +int sysctl_private_hash; + static struct futex_hash_bucket *get_futex_hashtable(union futex_key *key) { struct mm_struct *mm; Index: linux-2.6.28.6/kernel/sysctl.c =================================================================== --- linux-2.6.28.6.orig/kernel/sysctl.c 2009-03-18 16:59:27.000000000 -0800 +++ linux-2.6.28.6/kernel/sysctl.c 2009-03-18 17:49:02.000000000 -0800 @@ -48,6 +48,7 @@ #include <linux/acpi.h> #include <linux/reboot.h> #include <linux/ftrace.h> +#include <linux/futex.h> #include <asm/uaccess.h> #include <asm/processor.h> @@ -799,6 +800,19 @@ static struct ctl_table kern_table[] = { .strategy = &sysctl_intvec, }, #endif +#ifdef CONFIG_PROCESS_PRIVATE_FUTEX + { + .ctl_name = CTL_UNNUMBERED, + .procname = "private_futex_hashtable", + .data = &sysctl_private_hash, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec_minmax, + .strategy = &sysctl_intvec, + .extra1 = &zero, + .extra2 = &one, + }, +#endif #ifdef CONFIG_COMPAT { .ctl_name = KERN_COMPAT_LOG, Index: linux-2.6.28.6/Documentation/sysctl/kernel.txt =================================================================== --- linux-2.6.28.6.orig/Documentation/sysctl/kernel.txt 2009-02-17 09:29:27.000000000 -0800 +++ linux-2.6.28.6/Documentation/sysctl/kernel.txt 2009-03-20 12:14:15.000000000 -0800 @@ -281,6 +281,25 @@ send before ratelimiting kicks in. ============================================================== +private_futex_hashtable: + +This sysctl can be used to enable processes to use a per-process +private hash table for private futexes. Private futexes +are typically used in threaded workloads. When this option is +off, which is the default, threads waiting on futexes get +hashed onto a global hash table. On large core count machines +this turns out to be bad for performance if the workload +consist of multiple unrelated threaded processes. Turning +this option on will enable processes to use a private hash +for private futexes, improving performance. + +Valid options are 0 and 1. A value of 0 turns off process private +futex hash tables and a value of 1 enables private futex hash +tables. This option will only affect processes that get created +after the option gets changed. + +============================================================== + randomize-va-space: This option can be used to select the type of process address ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes 2009-03-21 4:46 [rfc] [patch 1/2 ] Process private hash tables for private futexes Ravikiran G Thirumalai 2009-03-21 4:52 ` [rfc] [patch 2/2 ] Sysctl to turn on/off private futex " Ravikiran G Thirumalai @ 2009-03-21 9:07 ` Eric Dumazet 2009-03-21 11:55 ` [PATCH] futex: Dynamically size futexes hash table Eric Dumazet 2009-03-22 4:54 ` [rfc] [patch 1/2 ] Process private hash tables for private futexes Ravikiran G Thirumalai 2009-03-21 11:35 ` Andrew Morton 2 siblings, 2 replies; 16+ messages in thread From: Eric Dumazet @ 2009-03-21 9:07 UTC (permalink / raw) To: Ravikiran G Thirumalai; +Cc: linux-kernel, Ingo Molnar, shai Ravikiran G Thirumalai a écrit : > Patch to have a process private hash table for 'PRIVATE' futexes. > > On large core count systems running multiple threaded processes causes > false sharing on the global futex hash table. The global futex hash > table is an array of struct futex_hash_bucket which is defined as: > > struct futex_hash_bucket { > spinlock_t lock; > struct plist_head chain; > }; > > static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS]; > > Needless to say this will cause multiple spinlocks to reside on the > same cacheline which is very bad when multiple un-related process > hash onto adjacent hash buckets. The probability of unrelated futexes > ending on adjacent hash buckets increase with the number of cores in the > system (more cores available translates to more processes/more threads > being run on a system). The effects of false sharing are tangible on > machines with more than 32 cores. We have noticed this with workload > of a certain multiple threaded FEA (Finite Element Analysis) solvers. > We reported this problem couple of years ago which eventually resulted in > a new api for private futexes to avoid mmap_sem. The false sharing on > the global futex hash was put off pending glibc changes to accomodate > the futex private apis. Now that the glibc changes are in, and > multicore is more prevalent, maybe it is time to fix this problem. > > The root cause of the problem is a global futex hash table even for process > private futexes. Process private futexes can be hashed on process private > hash tables, avoiding the global hash and a longer hash table walk when > there are a lot more futexes in the workload. However, this results in an > addition of one extra pointer to the mm_struct. Hence, this implementation > of a process private hash table is based off a config option, which can be > turned off for smaller core count systems. Furthermore, a subsequent patch > will introduce a sysctl to dynamically turn on private futex hash tables. > > We found this patch to improve the runtime of a certain FEA solver by about > 15% on a 32 core vSMP system. > > Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> > Signed-off-by: Shai Fultheim <shai@scalex86.org> > First incantation of PRIVATE_FUTEXES had process private hash table http://lkml.org/lkml/2007/3/15/230 I dont remember objections at that time, maybe it was going to slow down small users of these PRIVATE_FUTEXES, ie processes that will maybe use one futex_wait() in their existence, because they'll have to allocate their private hash table and populate it. So I dropped parts about NUMA and private hash tables to get PRIVATE_FUTEXES into mainline. http://lwn.net/Articles/229668/ Did you tried to change FUTEX_HASHBITS instead, since current value is really really ridiculous ? You could also try to adapt this patch to current kernels : http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-03/msg06504.html [PATCH 3/3] FUTEX : NUMA friendly global hashtable On NUMA machines, we should get better performance using a big futex hashtable, allocated with vmalloc() so that it is spreaded on several nodes. I chose a static size of four pages. (Very big NUMA machines have 64k page size) ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH] futex: Dynamically size futexes hash table 2009-03-21 9:07 ` [rfc] [patch 1/2 ] Process private " Eric Dumazet @ 2009-03-21 11:55 ` Eric Dumazet 2009-03-21 16:28 ` Ingo Molnar 2009-03-22 4:54 ` [rfc] [patch 1/2 ] Process private hash tables for private futexes Ravikiran G Thirumalai 1 sibling, 1 reply; 16+ messages in thread From: Eric Dumazet @ 2009-03-21 11:55 UTC (permalink / raw) To: Ravikiran G Thirumalai; +Cc: linux-kernel, Ingo Molnar, shai, Andrew Morton Eric Dumazet a écrit : > Ravikiran G Thirumalai a écrit : >> Patch to have a process private hash table for 'PRIVATE' futexes. >> >> On large core count systems running multiple threaded processes causes >> false sharing on the global futex hash table. The global futex hash >> table is an array of struct futex_hash_bucket which is defined as: >> >> struct futex_hash_bucket { >> spinlock_t lock; >> struct plist_head chain; >> }; >> >> static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS]; >> >> Needless to say this will cause multiple spinlocks to reside on the >> same cacheline which is very bad when multiple un-related process >> hash onto adjacent hash buckets. The probability of unrelated futexes >> ending on adjacent hash buckets increase with the number of cores in the >> system (more cores available translates to more processes/more threads >> being run on a system). The effects of false sharing are tangible on >> machines with more than 32 cores. We have noticed this with workload >> of a certain multiple threaded FEA (Finite Element Analysis) solvers. >> We reported this problem couple of years ago which eventually resulted in >> a new api for private futexes to avoid mmap_sem. The false sharing on >> the global futex hash was put off pending glibc changes to accomodate >> the futex private apis. Now that the glibc changes are in, and >> multicore is more prevalent, maybe it is time to fix this problem. >> >> The root cause of the problem is a global futex hash table even for process >> private futexes. Process private futexes can be hashed on process private >> hash tables, avoiding the global hash and a longer hash table walk when >> there are a lot more futexes in the workload. However, this results in an >> addition of one extra pointer to the mm_struct. Hence, this implementation >> of a process private hash table is based off a config option, which can be >> turned off for smaller core count systems. Furthermore, a subsequent patch >> will introduce a sysctl to dynamically turn on private futex hash tables. >> >> We found this patch to improve the runtime of a certain FEA solver by about >> 15% on a 32 core vSMP system. >> >> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> >> Signed-off-by: Shai Fultheim <shai@scalex86.org> >> > > First incantation of PRIVATE_FUTEXES had process private hash table > > http://lkml.org/lkml/2007/3/15/230 > > I dont remember objections at that time, maybe it was going to slow down small > users of these PRIVATE_FUTEXES, ie processes that will maybe use one futex_wait() > in their existence, because they'll have to allocate their private hash table > and populate it. > > So I dropped parts about NUMA and private hash tables to get PRIVATE_FUTEXES into mainline. > > http://lwn.net/Articles/229668/ > > Did you tried to change FUTEX_HASHBITS instead, since current value is really really > ridiculous ? > > You could also try to adapt this patch to current kernels : > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-03/msg06504.html > > [PATCH 3/3] FUTEX : NUMA friendly global hashtable > > On NUMA machines, we should get better performance using a big futex > hashtable, allocated with vmalloc() so that it is spreaded on several nodes. > > I chose a static size of four pages. (Very big NUMA machines have 64k page > size) > Here is my suggested patch. It should help both private and shared futexes usage on large machines. [PATCH] futex: Dynamically size futexes hash table As noticed by Ravikiran Thirumalai and Shai Fultheim, global futexes hash table is too small with current hardware. With 256 slots, probability of false sharing is too high. This patch makes its size not known at compile time, but boot time computed, depending on various factors, like number of possible cpus. Using vmalloc() also has better NUMA properties, since at boot time this hash table will be spreaded on many nodes, instead of only one. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- kernel/futex.c | 34 ++++++++++++++++++++++++++++++---- 1 files changed, 30 insertions(+), 4 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 438701a..f636e5c 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -62,7 +62,6 @@ int __read_mostly futex_cmpxchg_enabled; -#define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8) /* * Priority Inheritance state: @@ -121,7 +120,13 @@ struct futex_hash_bucket { struct plist_head chain; }; -static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS]; +#if defined(CONFIG_BASE_SMALL) +const unsigned int futex_hash_mask = 15; +static struct futex_hash_bucket futex_queues[16]; +#else +unsigned int futex_hash_mask __read_mostly; +static struct futex_hash_bucket *futex_queues __read_mostly; +#endif /* * We hash on the keys returned from get_futex_key (see below). @@ -131,7 +136,7 @@ static struct futex_hash_bucket *hash_futex(union futex_key *key) u32 hash = jhash2((u32*)&key->both.word, (sizeof(key->both.word)+sizeof(key->both.ptr))/4, key->both.offset); - return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)]; + return &futex_queues[hash & futex_hash_mask]; } /* @@ -2016,7 +2021,28 @@ static int __init futex_init(void) { u32 curval; int i; +#if !defined(CONFIG_BASE_SMALL) +#if defined(CONFIG_PROVE_LOCKING) + unsigned int nb_slots = 256; +#else + unsigned int nb_slots = roundup_pow_of_two(num_possible_cpus()) * 256; +#endif + size_t sz; +retry: + sz = nb_slots * sizeof(struct futex_hash_bucket); + if (sz > PAGE_SIZE) + futex_queues = vmalloc(sz); + else + futex_queues = kmalloc(sz, GFP_KERNEL); + if (!futex_queues) { + nb_slots >>= 1; + if (!nb_slots) + panic("Could not allocate futex hash table"); + goto retry; + } + futex_hash_mask = nb_slots - 1; +#endif /* * This will fail and we want it. Some arch implementations do * runtime detection of the futex_atomic_cmpxchg_inatomic() @@ -2031,7 +2057,7 @@ static int __init futex_init(void) if (curval == -EFAULT) futex_cmpxchg_enabled = 1; - for (i = 0; i < ARRAY_SIZE(futex_queues); i++) { + for (i = 0; i <= futex_hash_mask; i++) { plist_head_init(&futex_queues[i].chain, &futex_queues[i].lock); spin_lock_init(&futex_queues[i].lock); } ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] futex: Dynamically size futexes hash table 2009-03-21 11:55 ` [PATCH] futex: Dynamically size futexes hash table Eric Dumazet @ 2009-03-21 16:28 ` Ingo Molnar 0 siblings, 0 replies; 16+ messages in thread From: Ingo Molnar @ 2009-03-21 16:28 UTC (permalink / raw) To: Eric Dumazet; +Cc: Ravikiran G Thirumalai, linux-kernel, shai, Andrew Morton * Eric Dumazet <dada1@cosmosbay.com> wrote: > @@ -2016,7 +2021,28 @@ static int __init futex_init(void) > { > u32 curval; > int i; > +#if !defined(CONFIG_BASE_SMALL) > +#if defined(CONFIG_PROVE_LOCKING) > + unsigned int nb_slots = 256; > +#else > + unsigned int nb_slots = roundup_pow_of_two(num_possible_cpus()) * 256; > +#endif > + size_t sz; > > +retry: > + sz = nb_slots * sizeof(struct futex_hash_bucket); > + if (sz > PAGE_SIZE) > + futex_queues = vmalloc(sz); > + else > + futex_queues = kmalloc(sz, GFP_KERNEL); > + if (!futex_queues) { > + nb_slots >>= 1; > + if (!nb_slots) > + panic("Could not allocate futex hash table"); > + goto retry; > + } > + futex_hash_mask = nb_slots - 1; > +#endif no fundamental objections and the hash sizing problem is causing real problems and need to be fixed, but this code sequence is too ugly to live. Is there really no big-hash-alloc framework in existence? I have some vague memories of networking having met such problems in the past - maybe they have some goodness there we could reuse shamelessly in the futex code? Ingo ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes 2009-03-21 9:07 ` [rfc] [patch 1/2 ] Process private " Eric Dumazet 2009-03-21 11:55 ` [PATCH] futex: Dynamically size futexes hash table Eric Dumazet @ 2009-03-22 4:54 ` Ravikiran G Thirumalai 2009-03-22 8:17 ` Eric Dumazet 1 sibling, 1 reply; 16+ messages in thread From: Ravikiran G Thirumalai @ 2009-03-22 4:54 UTC (permalink / raw) To: Eric Dumazet; +Cc: linux-kernel, Ingo Molnar, shai On Sat, Mar 21, 2009 at 10:07:48AM +0100, Eric Dumazet wrote: >Ravikiran G Thirumalai a écrit : >> >> We found this patch to improve the runtime of a certain FEA solver by about >> 15% on a 32 core vSMP system. >> >> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> >> Signed-off-by: Shai Fultheim <shai@scalex86.org> >> > >First incantation of PRIVATE_FUTEXES had process private hash table > >http://lkml.org/lkml/2007/3/15/230 > >I dont remember objections at that time, maybe it was going to slow down small >users of these PRIVATE_FUTEXES, ie processes that will maybe use one futex_wait() > in their existence, because they'll have to allocate their private hash table >and populate it. > With the current proposal, we can still use the global futex hashes for such workloads (with the sysctl setting). >So I dropped parts about NUMA and private hash tables to get PRIVATE_FUTEXES into mainline > >http://lwn.net/Articles/229668/ > >Did you tried to change FUTEX_HASHBITS instead, since current value is really really >ridiculous ? We tried it in the past and I remember on a 16 core machine, we had to use 32k hash slots to avoid false sharing. > >You could also try to adapt this patch to current kernels : > >http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-03/msg06504.html > >[PATCH 3/3] FUTEX : NUMA friendly global hashtable > >On NUMA machines, we should get better performance using a big futex >hashtable, allocated with vmalloc() so that it is spreaded on several nodes. > >I chose a static size of four pages. (Very big NUMA machines have 64k page >size) Yes, dynamically changing the hash table is better (looking at the patch you have posted), but still there are no locality guarantees here. A process pinned to node X may still end up accessing remote memory locations while accessing the hash table. A process private table on the other hand should not have this problem. I think using a global hash for entirely process local objects is bad design wise here. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes 2009-03-22 4:54 ` [rfc] [patch 1/2 ] Process private hash tables for private futexes Ravikiran G Thirumalai @ 2009-03-22 8:17 ` Eric Dumazet 2009-03-23 20:28 ` Ravikiran G Thirumalai 0 siblings, 1 reply; 16+ messages in thread From: Eric Dumazet @ 2009-03-22 8:17 UTC (permalink / raw) To: Ravikiran G Thirumalai; +Cc: linux-kernel, Ingo Molnar, shai Ravikiran G Thirumalai a écrit : >> >> Did you tried to change FUTEX_HASHBITS instead, since current value is really really >> ridiculous ? > > We tried it in the past and I remember on a 16 core machine, we had to > use 32k hash slots to avoid false sharing. > > > Yes, dynamically changing the hash table is better (looking at the patch you > have posted), but still there are no locality guarantees here. A process > pinned to node X may still end up accessing remote memory locations while > accessing the hash table. A process private table on the other hand should > not have this problem. I think using a global hash for entirely process local > objects is bad design wise here. > > Bad design, or bad luck... considering all kernel already use such global tables (dentries, inodes, tcp, ip route cache, ...). Problem is to size this hash table, being private or not. You said hou had to have a 32768 slots to avoid false sharing on a 16 core machine. This seems strange to me, given we use jhash. What is the size of the cache line on your platforms ??? Say we have 32768 slots for the global hash table, and 256 slots for a private one, you probably can have a program running slowly with this private 256 slots table, if this program uses all available cores. If we use large private hash table, the setup cost is higher (need to initialize all spinlocks and plist heads at each program startup), unless we use a dedicate kmem_cache to keep a pool of preinitialized priv hash tables... ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes 2009-03-22 8:17 ` Eric Dumazet @ 2009-03-23 20:28 ` Ravikiran G Thirumalai 2009-03-23 21:57 ` Eric Dumazet 0 siblings, 1 reply; 16+ messages in thread From: Ravikiran G Thirumalai @ 2009-03-23 20:28 UTC (permalink / raw) To: Eric Dumazet; +Cc: linux-kernel, Ingo Molnar, shai, Andrew Morton On Sun, Mar 22, 2009 at 09:17:01AM +0100, Eric Dumazet wrote: >Ravikiran G Thirumalai a écrit : >>> >>> Did you tried to change FUTEX_HASHBITS instead, since current value is really really >>> ridiculous ? >> >> We tried it in the past and I remember on a 16 core machine, we had to >> use 32k hash slots to avoid false sharing. >> >> >> Yes, dynamically changing the hash table is better (looking at the patch you >> have posted), but still there are no locality guarantees here. A process >> pinned to node X may still end up accessing remote memory locations while >> accessing the hash table. A process private table on the other hand should >> not have this problem. I think using a global hash for entirely process local >> objects is bad design wise here. >> >> > > >Bad design, or bad luck... considering all kernel already use such global tables >(dentries, inodes, tcp, ip route cache, ...). Not necessarily. The dentry/inode/route caches need to be shared by processes so a global cache makes sense there -- the private futexes need to be only shared between threads of the process rather than the entire world. > >Problem is to size this hash table, being private or not. You said hou had >to have a 32768 slots to avoid false sharing on a 16 core machine. This seems >strange to me, given we use jhash. What is the size of the cache line on your >platforms ??? It is large and true these bad effects get magnified with larger cache lines. However, this does forewarn other architectures of problems such as these. Access to the below virtual addresses were seen to cause cache trashing between nodes on a vSMP system. The eip corresponds to the spin_lock on a 2.6.27 kernel at 'futex_wake' (see end of email). Obviously these addresses correspond to the spinlock on the hash buckets, and this was a threaded FEA solver workload on a 32 core machine. As can be seen, this is a problem even on a machine with 64b cacheline. > >Say we have 32768 slots for the global hash table, and 256 slots for a private one, >you probably can have a program running slowly with this private 256 slots table, >if this program uses all available cores. True, as I replied to akpm in this thread, if a workload happens to be one multi-threaded process with a zillion threads, the workload will have bigger overheads due to the sharing of process address space and mmap_sem. Atleast that has been our experience so far. Private futex hashes solve the problem on hand. > >If we use large private hash table, the setup cost is higher (need to initialize >all spinlocks and plist heads at each program startup), unless we use a dedicate >kmem_cache to keep a pool of preinitialized priv hash tables... > Hmm! How about a) Reduce hash table size for private futex hash and increase hash table size for the global hash? OR, better, b) Since it is multiple spinlocks on the same cacheline which is a PITA here, how about keeping the global table, but just add a dereference to each hash slot, and interleave the adjacent hash buckets between nodes/cpus? So even without needing to lose out memory from padding, we avoid false sharing on cachelines due to unrelated futexes hashing onto adjacent hash buckets? Thanks, Kiran Cache misses at futex_wake due to access to the following addresses: ------------------------------------------------------------------- fffff819cc180 fffff819cc1d0 fffff819cc248 fffff819cc310 fffff819cc3b0 fffff819cc400 fffff819cc568 fffff819cc5b8 fffff819cc658 fffff819cc770 fffff819cc798 fffff819cc838 fffff819cc8d8 fffff819cc9c8 fffff819cc9f0 fffff819cca90 fffff819ccae0 fffff819ccb08 fffff819ccd38 fffff819ccd88 fffff819ccdb0 fffff819cce78 fffff819ccf18 fffff819ccfb8 fffff819cd030 fffff819cd058 fffff819cd148 fffff819cd210 fffff819cd260 fffff819cd288 fffff819cd2b0 fffff819cd300 fffff819cd350 fffff819cd3f0 fffff819cd440 fffff819cd558 fffff819cd580 fffff819cd620 fffff819cd738 fffff819cd7b0 fffff819cd7d8 fffff819cd828 fffff819cd8c8 fffff819cd8f0 fffff819cd968 fffff819cd9b8 fffff819cd9e0 fffff819cda08 fffff819cda58 fffff819cdad0 fffff819cdbc0 fffff819cdc10 fffff819cdc60 fffff819cddf0 fffff819cde68 fffff819cdfa8 fffff819cdfd0 fffff819ce020 fffff819ce048 fffff819ce070 fffff819ce098 fffff819ce0c0 fffff819ce0e8 fffff819ce110 fffff819ce1d8 fffff819ce200 fffff819ce228 fffff819ce250 fffff819ce3b8 fffff819ce430 fffff819ce480 fffff819ce5e8 fffff819ce660 fffff819ce728 fffff819ce750 fffff819ce868 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes 2009-03-23 20:28 ` Ravikiran G Thirumalai @ 2009-03-23 21:57 ` Eric Dumazet 2009-03-24 3:19 ` Ravikiran G Thirumalai 2009-03-24 7:04 ` Eric Dumazet 0 siblings, 2 replies; 16+ messages in thread From: Eric Dumazet @ 2009-03-23 21:57 UTC (permalink / raw) To: Ravikiran G Thirumalai; +Cc: linux-kernel, Ingo Molnar, shai, Andrew Morton Ravikiran G Thirumalai a écrit : > On Sun, Mar 22, 2009 at 09:17:01AM +0100, Eric Dumazet wrote: >> Ravikiran G Thirumalai a écrit : >>>> Did you tried to change FUTEX_HASHBITS instead, since current value is really really >>>> ridiculous ? >>> We tried it in the past and I remember on a 16 core machine, we had to >>> use 32k hash slots to avoid false sharing. >>> >>> >>> Yes, dynamically changing the hash table is better (looking at the patch you >>> have posted), but still there are no locality guarantees here. A process >>> pinned to node X may still end up accessing remote memory locations while >>> accessing the hash table. A process private table on the other hand should >>> not have this problem. I think using a global hash for entirely process local >>> objects is bad design wise here. >>> >>> >> >> Bad design, or bad luck... considering all kernel already use such global tables >> (dentries, inodes, tcp, ip route cache, ...). > > Not necessarily. The dentry/inode/route caches need to be shared by > processes so a global cache makes sense there -- the private futexes need to > be only shared between threads of the process rather than the entire world. > >> Problem is to size this hash table, being private or not. You said hou had >> to have a 32768 slots to avoid false sharing on a 16 core machine. This seems >> strange to me, given we use jhash. What is the size of the cache line on your >> platforms ??? > > It is large and true these bad effects get magnified with larger cache lines. > However, this does forewarn other architectures of problems such as these. > Access to the below virtual addresses were seen to cause cache trashing > between nodes on a vSMP system. The eip corresponds to the spin_lock > on a 2.6.27 kernel at 'futex_wake' (see end of email). > Obviously these addresses correspond to the spinlock on the hash buckets, > and this was a threaded FEA solver workload on a 32 core machine. > As can be seen, this is a problem even on a machine with 64b cacheline. > > >> Say we have 32768 slots for the global hash table, and 256 slots for a private one, >> you probably can have a program running slowly with this private 256 slots table, >> if this program uses all available cores. > > True, as I replied to akpm in this thread, if a workload happens to be one > multi-threaded process with a zillion threads, the workload will have bigger > overheads due to the sharing of process address space and mmap_sem. Atleast > that has been our experience so far. Private futex hashes solve the > problem on hand. > >> If we use large private hash table, the setup cost is higher (need to initialize >> all spinlocks and plist heads at each program startup), unless we use a dedicate >> kmem_cache to keep a pool of preinitialized priv hash tables... >> > > > Hmm! How about > a) Reduce hash table size for private futex hash and increase hash table > size for the global hash? > > OR, better, > > b) Since it is multiple spinlocks on the same cacheline which is a PITA > here, how about keeping the global table, but just add a dereference > to each hash slot, and interleave the adjacent hash buckets between > nodes/cpus? So even without needing to lose out memory from padding, > we avoid false sharing on cachelines due to unrelated futexes hashing > onto adjacent hash buckets? > Because of jhash(), futex slots are almost random. No need to try to interleave them. Since you have a "cache line" of 4096 bytes, you need almost 4 pages per cpu to avoid in a statistical way false sharing. Unfortunatly a small private futex hash will have the same scalability problem, so its not a general solution. Could you try this patch please ? [PATCH] futex: Dynamically size futexes hash table As noticed by Ravikiran Thirumalai and Shai Fultheim, global futexes hash table is too small with current hardware. With 256 slots, probability of false sharing is too high. This patch makes its size not known at compile time, but boot time computed, depending on various factors, like number of possible cpus, and VSMP settings. Using alloc_large_system_hash() also has better NUMA properties, since at boot time hash table will be spreaded on many nodes, instead of only one. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- diff --git a/kernel/futex.c b/kernel/futex.c index 438701a..bd9a052 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -62,7 +62,6 @@ int __read_mostly futex_cmpxchg_enabled; -#define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8) /* * Priority Inheritance state: @@ -121,7 +120,13 @@ struct futex_hash_bucket { struct plist_head chain; }; -static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS]; +#if defined(CONFIG_BASE_SMALL) +const unsigned int futex_hash_mask = 15; +static struct futex_hash_bucket futex_queues[16]; +#else +unsigned int futex_hash_mask __read_mostly; +static struct futex_hash_bucket *futex_queues __read_mostly; +#endif /* * We hash on the keys returned from get_futex_key (see below). @@ -131,7 +136,7 @@ static struct futex_hash_bucket *hash_futex(union futex_key *key) u32 hash = jhash2((u32*)&key->both.word, (sizeof(key->both.word)+sizeof(key->both.ptr))/4, key->both.offset); - return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)]; + return &futex_queues[hash & futex_hash_mask]; } /* @@ -2016,7 +2021,25 @@ static int __init futex_init(void) { u32 curval; int i; +#if !defined(CONFIG_BASE_SMALL) +#if defined(CONFIG_PROVE_LOCKING) + unsigned int nr_slots = 256; +#else + /* 4 nodes per cpu on VSMP, or 256 slots per cpu */ + unsigned int nr_slots = max(256, (4 << INTERNODE_CACHE_SHIFT) / + sizeof(struct futex_hash_bucket)); + nr_slots = roundup_pow_of_two(num_possible_cpus() * nr_slots); +#endif + futex_queues = alloc_large_system_hash("futexes", + sizeof(struct futex_hash_bucket), + nr_slots, + 0, /* scale : unused */ + 0, /* flags */ + NULL, /* shift */ + &futex_hash_mask, + nr_slots); +#endif /* * This will fail and we want it. Some arch implementations do * runtime detection of the futex_atomic_cmpxchg_inatomic() @@ -2031,7 +2054,7 @@ static int __init futex_init(void) if (curval == -EFAULT) futex_cmpxchg_enabled = 1; - for (i = 0; i < ARRAY_SIZE(futex_queues); i++) { + for (i = 0; i <= futex_hash_mask; i++) { plist_head_init(&futex_queues[i].chain, &futex_queues[i].lock); spin_lock_init(&futex_queues[i].lock); } ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes 2009-03-23 21:57 ` Eric Dumazet @ 2009-03-24 3:19 ` Ravikiran G Thirumalai 2009-03-24 3:33 ` Ravikiran G Thirumalai 2009-03-24 5:31 ` Eric Dumazet 2009-03-24 7:04 ` Eric Dumazet 1 sibling, 2 replies; 16+ messages in thread From: Ravikiran G Thirumalai @ 2009-03-24 3:19 UTC (permalink / raw) To: Eric Dumazet; +Cc: linux-kernel, Ingo Molnar, shai, Andrew Morton On Mon, Mar 23, 2009 at 10:57:55PM +0100, Eric Dumazet wrote: >Ravikiran G Thirumalai a écrit : >> [...] >> Hmm! How about >> a) Reduce hash table size for private futex hash and increase hash table >> size for the global hash? >> >> OR, better, >> >> b) Since it is multiple spinlocks on the same cacheline which is a PITA >> here, how about keeping the global table, but just add a dereference >> to each hash slot, and interleave the adjacent hash buckets between >> nodes/cpus? So even without needing to lose out memory from padding, >> we avoid false sharing on cachelines due to unrelated futexes hashing >> onto adjacent hash buckets? >> > >Because of jhash(), futex slots are almost random. No need to try to interleave >them. Since you have a "cache line" of 4096 bytes, you need almost 4 pages >per cpu to avoid in a statistical way false sharing. How did you come up with that number? So there is no way adjacent cachelines will never ever be used in the global hash?? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes 2009-03-24 3:19 ` Ravikiran G Thirumalai @ 2009-03-24 3:33 ` Ravikiran G Thirumalai 2009-03-24 5:31 ` Eric Dumazet 1 sibling, 0 replies; 16+ messages in thread From: Ravikiran G Thirumalai @ 2009-03-24 3:33 UTC (permalink / raw) To: Eric Dumazet; +Cc: linux-kernel, Ingo Molnar, shai, Andrew Morton On Mon, Mar 23, 2009 at 08:19:25PM -0700, Ravikiran G Thirumalai wrote: >>Ravikiran G Thirumalai a écrit : >>> [...] > >How did you come up with that number? So there is no way adjacent >cachelines will never ever be used in the global hash?? s/cachelines/buckets ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes 2009-03-24 3:19 ` Ravikiran G Thirumalai 2009-03-24 3:33 ` Ravikiran G Thirumalai @ 2009-03-24 5:31 ` Eric Dumazet 1 sibling, 0 replies; 16+ messages in thread From: Eric Dumazet @ 2009-03-24 5:31 UTC (permalink / raw) To: Ravikiran G Thirumalai; +Cc: linux-kernel, Ingo Molnar, shai, Andrew Morton Ravikiran G Thirumalai a écrit : > On Mon, Mar 23, 2009 at 10:57:55PM +0100, Eric Dumazet wrote: >> Ravikiran G Thirumalai a écrit : > [...] >>> Hmm! How about >>> a) Reduce hash table size for private futex hash and increase hash table >>> size for the global hash? >>> >>> OR, better, >>> >>> b) Since it is multiple spinlocks on the same cacheline which is a PITA >>> here, how about keeping the global table, but just add a dereference >>> to each hash slot, and interleave the adjacent hash buckets between >>> nodes/cpus? So even without needing to lose out memory from padding, >>> we avoid false sharing on cachelines due to unrelated futexes hashing >>> onto adjacent hash buckets? >>> >> Because of jhash(), futex slots are almost random. No need to try to interleave >> them. Since you have a "cache line" of 4096 bytes, you need almost 4 pages >> per cpu to avoid in a statistical way false sharing. > > How did you come up with that number? So there is no way adjacent > cachelines will never ever be used in the global hash?? > > There is no way, unless you use one chain per 4096 bytes block, and that would be a waste of memory. Its about statistics (assuming jhash gives us normally distributed values), not hard and guaranted constraints. http://en.wikipedia.org/wiki/Standard_deviation For a particular process, you can carefully chose (knowing hash function in use by kernel), user land futexes that will *all* be hashed in a *single* hash chain, protected by a single spinlock. This process will then suffer from contention, using private futexes and private or shared hash table. Even without knowing hash function, there is still a small chance that it can happen. Please note that the hash table size has several purposes, first one being storing potentially many futexes (say 30000 threads are waiting on different futexes), and being able to avoid cache line ping pongs or misses. This is why I chose at least 256 slots per cpu, and not 4 cache lines per cpu. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes 2009-03-23 21:57 ` Eric Dumazet 2009-03-24 3:19 ` Ravikiran G Thirumalai @ 2009-03-24 7:04 ` Eric Dumazet 2009-04-23 17:30 ` Darren Hart 1 sibling, 1 reply; 16+ messages in thread From: Eric Dumazet @ 2009-03-24 7:04 UTC (permalink / raw) To: Ravikiran G Thirumalai; +Cc: linux-kernel, Ingo Molnar, shai, Andrew Morton Eric Dumazet a écrit : > Ravikiran G Thirumalai a écrit : >> On Sun, Mar 22, 2009 at 09:17:01AM +0100, Eric Dumazet wrote: >>> Ravikiran G Thirumalai a écrit : >>>>> Did you tried to change FUTEX_HASHBITS instead, since current value is really really >>>>> ridiculous ? >>>> We tried it in the past and I remember on a 16 core machine, we had to >>>> use 32k hash slots to avoid false sharing. >>>> >>>> >>>> Yes, dynamically changing the hash table is better (looking at the patch you >>>> have posted), but still there are no locality guarantees here. A process >>>> pinned to node X may still end up accessing remote memory locations while >>>> accessing the hash table. A process private table on the other hand should >>>> not have this problem. I think using a global hash for entirely process local >>>> objects is bad design wise here. >>>> >>>> >>> Bad design, or bad luck... considering all kernel already use such global tables >>> (dentries, inodes, tcp, ip route cache, ...). >> Not necessarily. The dentry/inode/route caches need to be shared by >> processes so a global cache makes sense there -- the private futexes need to >> be only shared between threads of the process rather than the entire world. >> >>> Problem is to size this hash table, being private or not. You said hou had >>> to have a 32768 slots to avoid false sharing on a 16 core machine. This seems >>> strange to me, given we use jhash. What is the size of the cache line on your >>> platforms ??? >> It is large and true these bad effects get magnified with larger cache lines. >> However, this does forewarn other architectures of problems such as these. >> Access to the below virtual addresses were seen to cause cache trashing >> between nodes on a vSMP system. The eip corresponds to the spin_lock >> on a 2.6.27 kernel at 'futex_wake' (see end of email). >> Obviously these addresses correspond to the spinlock on the hash buckets, >> and this was a threaded FEA solver workload on a 32 core machine. >> As can be seen, this is a problem even on a machine with 64b cacheline. >> >> >>> Say we have 32768 slots for the global hash table, and 256 slots for a private one, >>> you probably can have a program running slowly with this private 256 slots table, >>> if this program uses all available cores. >> True, as I replied to akpm in this thread, if a workload happens to be one >> multi-threaded process with a zillion threads, the workload will have bigger >> overheads due to the sharing of process address space and mmap_sem. Atleast >> that has been our experience so far. Private futex hashes solve the >> problem on hand. >> >>> If we use large private hash table, the setup cost is higher (need to initialize >>> all spinlocks and plist heads at each program startup), unless we use a dedicate >>> kmem_cache to keep a pool of preinitialized priv hash tables... >>> >> >> Hmm! How about >> a) Reduce hash table size for private futex hash and increase hash table >> size for the global hash? >> >> OR, better, >> >> b) Since it is multiple spinlocks on the same cacheline which is a PITA >> here, how about keeping the global table, but just add a dereference >> to each hash slot, and interleave the adjacent hash buckets between >> nodes/cpus? So even without needing to lose out memory from padding, >> we avoid false sharing on cachelines due to unrelated futexes hashing >> onto adjacent hash buckets? >> > > Because of jhash(), futex slots are almost random. No need to try to interleave > them. Since you have a "cache line" of 4096 bytes, you need almost 4 pages > per cpu to avoid in a statistical way false sharing. > > Unfortunatly a small private futex hash will have the same scalability problem, > so its not a general solution. > > Could you try this patch please ? > > [PATCH] futex: Dynamically size futexes hash table > > As noticed by Ravikiran Thirumalai and Shai Fultheim, global futexes hash table > is too small with current hardware. > With 256 slots, probability of false sharing is too high. > > This patch makes its size not known at compile time, but boot time computed, > depending on various factors, like number of possible cpus, and VSMP settings. > > Using alloc_large_system_hash() also has better NUMA properties, since at boot time > hash table will be spreaded on many nodes, instead of only one. > > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > --- > > > diff --git a/kernel/futex.c b/kernel/futex.c > index 438701a..bd9a052 100644 > --- a/kernel/futex.c > +++ b/kernel/futex.c > @@ -62,7 +62,6 @@ > > int __read_mostly futex_cmpxchg_enabled; > > -#define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8) > > /* > * Priority Inheritance state: > @@ -121,7 +120,13 @@ struct futex_hash_bucket { > struct plist_head chain; > }; > > -static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS]; > +#if defined(CONFIG_BASE_SMALL) Oops, should be #if CONFIG_BASE_SMALL (CONFIG_BASE_SMALL being defined to 0 or !0) I tested this last patch on a NUMA machine (8 cores), and checked dmesg output : futexes hash table entries: 2048 (order: 4, 90112 bytes) [PATCH] futex: Dynamically size futexes hash table As noticed by Ravikiran Thirumalai and Shai Fultheim, global futexes hash table is too small with current hardware. With 256 slots, probability of false sharing is too high. This patch makes its size not known at compile time, but boot time computed, depending on various factors, like number of possible cpus, and VSMP settings. Using alloc_large_system_hash() also has better NUMA properties, since at boot time hash table will be spreaded on many nodes, instead of only one. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- diff --git a/kernel/futex.c b/kernel/futex.c index 438701a..f8bd4c0 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -55,6 +55,7 @@ #include <linux/magic.h> #include <linux/pid.h> #include <linux/nsproxy.h> +#include <linux/bootmem.h> #include <asm/futex.h> @@ -62,7 +63,6 @@ int __read_mostly futex_cmpxchg_enabled; -#define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8) /* * Priority Inheritance state: @@ -121,7 +121,13 @@ struct futex_hash_bucket { struct plist_head chain; }; -static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS]; +#if CONFIG_BASE_SMALL +const unsigned int futex_hash_mask = 15; +static struct futex_hash_bucket futex_queues[16]; +#else +unsigned int futex_hash_mask __read_mostly; +static struct futex_hash_bucket *futex_queues __read_mostly; +#endif /* * We hash on the keys returned from get_futex_key (see below). @@ -131,7 +137,7 @@ static struct futex_hash_bucket *hash_futex(union futex_key *key) u32 hash = jhash2((u32*)&key->both.word, (sizeof(key->both.word)+sizeof(key->both.ptr))/4, key->both.offset); - return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)]; + return &futex_queues[hash & futex_hash_mask]; } /* @@ -2016,7 +2022,25 @@ static int __init futex_init(void) { u32 curval; int i; +#if !CONFIG_BASE_SMALL +#if defined(CONFIG_PROVE_LOCKING) + unsigned int nr_slots = 256; +#else + /* 4 nodes per cpu on VSMP, or 256 slots per cpu */ + unsigned int nr_slots = max(256U, (4U << INTERNODE_CACHE_SHIFT) / + sizeof(struct futex_hash_bucket)); + nr_slots = roundup_pow_of_two(num_possible_cpus() * nr_slots); +#endif + futex_queues = alloc_large_system_hash("futexes", + sizeof(struct futex_hash_bucket), + nr_slots, + 0, /* scale : unused */ + 0, /* flags */ + NULL, /* shift */ + &futex_hash_mask, + nr_slots); +#endif /* * This will fail and we want it. Some arch implementations do * runtime detection of the futex_atomic_cmpxchg_inatomic() @@ -2031,7 +2055,7 @@ static int __init futex_init(void) if (curval == -EFAULT) futex_cmpxchg_enabled = 1; - for (i = 0; i < ARRAY_SIZE(futex_queues); i++) { + for (i = 0; i <= futex_hash_mask; i++) { plist_head_init(&futex_queues[i].chain, &futex_queues[i].lock); spin_lock_init(&futex_queues[i].lock); } ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes 2009-03-24 7:04 ` Eric Dumazet @ 2009-04-23 17:30 ` Darren Hart 0 siblings, 0 replies; 16+ messages in thread From: Darren Hart @ 2009-04-23 17:30 UTC (permalink / raw) To: Eric Dumazet Cc: Ravikiran G Thirumalai, linux-kernel, Ingo Molnar, shai, Andrew Morton Eric Dumazet wrote: > > + futex_queues = alloc_large_system_hash("futexes", > + sizeof(struct futex_hash_bucket), > + nr_slots, > + 0, /* scale : unused */ > + 0, /* flags */ > + NULL, /* shift */ > + &futex_hash_mask, > + nr_slots); OK, so I'm a little late to the party, apologies. Minor nit (from someone who has been spending a lot of time in futex.c trying to clean things up recently :-). Let's not comment each argument individually. It needlessly lengthens the code. The interested user can easily look up alloc_large_system_hash. If you feel a comment is really needed, consider a single line before the call... I gave this some thought and couldn't think of anything you could put there that wouldn't still send the interested reader over to page_alloc.c. /* Allocate futex hashtable with... <whatever is special about this> */ futex_queues = alloc_large_system_hash("futexes", sizeof(struct futex_hash_bucket), nr_slots, 0, 0, NULL, &futex_hash_mask, nr_slots); -- Darren Hart IBM Linux Technology Center Real-Time Linux Team ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes 2009-03-21 4:46 [rfc] [patch 1/2 ] Process private hash tables for private futexes Ravikiran G Thirumalai 2009-03-21 4:52 ` [rfc] [patch 2/2 ] Sysctl to turn on/off private futex " Ravikiran G Thirumalai 2009-03-21 9:07 ` [rfc] [patch 1/2 ] Process private " Eric Dumazet @ 2009-03-21 11:35 ` Andrew Morton 2009-03-22 4:15 ` Ravikiran G Thirumalai 2 siblings, 1 reply; 16+ messages in thread From: Andrew Morton @ 2009-03-21 11:35 UTC (permalink / raw) To: Ravikiran G Thirumalai; +Cc: linux-kernel, Ingo Molnar, shai On Fri, 20 Mar 2009 21:46:37 -0700 Ravikiran G Thirumalai <kiran@scalex86.org> wrote: > Patch to have a process private hash table for 'PRIVATE' futexes. > > On large core count systems running multiple threaded processes causes > false sharing on the global futex hash table. The global futex hash > table is an array of struct futex_hash_bucket which is defined as: > > struct futex_hash_bucket { > spinlock_t lock; > struct plist_head chain; > }; > > static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS]; > > Needless to say this will cause multiple spinlocks to reside on the > same cacheline which is very bad when multiple un-related process > hash onto adjacent hash buckets. The probability of unrelated futexes > ending on adjacent hash buckets increase with the number of cores in the > system (more cores available translates to more processes/more threads > being run on a system). The effects of false sharing are tangible on > machines with more than 32 cores. We have noticed this with workload > of a certain multiple threaded FEA (Finite Element Analysis) solvers. > We reported this problem couple of years ago which eventually resulted in > a new api for private futexes to avoid mmap_sem. The false sharing on > the global futex hash was put off pending glibc changes to accomodate > the futex private apis. Now that the glibc changes are in, and > multicore is more prevalent, maybe it is time to fix this problem. > > The root cause of the problem is a global futex hash table even for process > private futexes. Process private futexes can be hashed on process private > hash tables, avoiding the global hash and a longer hash table walk when > there are a lot more futexes in the workload. However, this results in an > addition of one extra pointer to the mm_struct. Hence, this implementation > of a process private hash table is based off a config option, which can be > turned off for smaller core count systems. Furthermore, a subsequent patch > will introduce a sysctl to dynamically turn on private futex hash tables. > > We found this patch to improve the runtime of a certain FEA solver by about > 15% on a 32 core vSMP system. > > Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> > Signed-off-by: Shai Fultheim <shai@scalex86.org> > > Index: linux-2.6.28.6/include/linux/mm_types.h > =================================================================== > --- linux-2.6.28.6.orig/include/linux/mm_types.h 2009-03-11 16:52:06.000000000 -0800 > +++ linux-2.6.28.6/include/linux/mm_types.h 2009-03-11 16:52:23.000000000 -0800 > @@ -256,6 +256,10 @@ struct mm_struct { > #ifdef CONFIG_MMU_NOTIFIER > struct mmu_notifier_mm *mmu_notifier_mm; > #endif > +#ifdef CONFIG_PROCESS_PRIVATE_FUTEX > + /* Process private futex hash table */ > + struct futex_hash_bucket *htb; > +#endif So we're effectively improving the hashing operation by splitting the single hash table into multiple ones. But was that the best way of speeding up the hashing operation? I'd have thought that for some workloads, there will still be tremendous amounts of contention for the per-mm hashtable? In which case it is but a partial fix for certain workloads. Whereas a more general hashing optimisation (if we can come up with it) would benefit both types of workload? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes 2009-03-21 11:35 ` Andrew Morton @ 2009-03-22 4:15 ` Ravikiran G Thirumalai 0 siblings, 0 replies; 16+ messages in thread From: Ravikiran G Thirumalai @ 2009-03-22 4:15 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Ingo Molnar, shai On Sat, Mar 21, 2009 at 04:35:14AM -0700, Andrew Morton wrote: >On Fri, 20 Mar 2009 21:46:37 -0700 Ravikiran G Thirumalai <kiran@scalex86.org> wrote: > >> >> Index: linux-2.6.28.6/include/linux/mm_types.h >> =================================================================== >> --- linux-2.6.28.6.orig/include/linux/mm_types.h 2009-03-11 16:52:06.000000000 -0800 >> +++ linux-2.6.28.6/include/linux/mm_types.h 2009-03-11 16:52:23.000000000 -0800 >> @@ -256,6 +256,10 @@ struct mm_struct { >> #ifdef CONFIG_MMU_NOTIFIER >> struct mmu_notifier_mm *mmu_notifier_mm; >> #endif >> +#ifdef CONFIG_PROCESS_PRIVATE_FUTEX >> + /* Process private futex hash table */ >> + struct futex_hash_bucket *htb; >> +#endif > >So we're effectively improving the hashing operation by splitting the >single hash table into multiple ones. > >But was that the best way of speeding up the hashing operation? I'd have >thought that for some workloads, there will still be tremendous amounts of >contention for the per-mm hashtable? In which case it is but a partial fix >for certain workloads. If there is tremendous contention on the per-mm hashtable, then workload suffers from userspace lock contention to begin with, and the right approach would be to fix the lock contention in userspace/workload, no? True, if a workload happens to be one process on a large core count machine with a zillion threads, using a zillion futexes, hashing might still be bad, but such a workload has bigger hurdles like the mmap_sem which is still a bigger lock than the per-bucket locks of the private hash table. (Even with use of the FUTEX_PRIVATE_FLAG, mmap_sem gets contended for page faults and the like). The fundamental problem we are facing right now is the private futexes hashing onto a global hash table and then the futex subsystem wading through the table that contains private futexes from other *unrelated* processes. Even with better hashing/larger hash table, we do not eliminate the possibility of private futexes from two unrelated processes hashing on to hash buckets close enough to be on the same cache line -- if we use one global hash table for private futexes. (I remember doing some tests and instrumentation in the past with a different hash function as well as more hash slots) Hence, it seems like a private hash table for private futexes are the right solution. Perhaps we can reduce the hash slots for private hash table and increase the hash slots on the global hash (to help non-private futex hashing)? Thanks, Kiran ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2009-04-23 17:30 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-03-21 4:46 [rfc] [patch 1/2 ] Process private hash tables for private futexes Ravikiran G Thirumalai 2009-03-21 4:52 ` [rfc] [patch 2/2 ] Sysctl to turn on/off private futex " Ravikiran G Thirumalai 2009-03-21 9:07 ` [rfc] [patch 1/2 ] Process private " Eric Dumazet 2009-03-21 11:55 ` [PATCH] futex: Dynamically size futexes hash table Eric Dumazet 2009-03-21 16:28 ` Ingo Molnar 2009-03-22 4:54 ` [rfc] [patch 1/2 ] Process private hash tables for private futexes Ravikiran G Thirumalai 2009-03-22 8:17 ` Eric Dumazet 2009-03-23 20:28 ` Ravikiran G Thirumalai 2009-03-23 21:57 ` Eric Dumazet 2009-03-24 3:19 ` Ravikiran G Thirumalai 2009-03-24 3:33 ` Ravikiran G Thirumalai 2009-03-24 5:31 ` Eric Dumazet 2009-03-24 7:04 ` Eric Dumazet 2009-04-23 17:30 ` Darren Hart 2009-03-21 11:35 ` Andrew Morton 2009-03-22 4:15 ` Ravikiran G Thirumalai
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).