From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Davidlohr Bueso <davidlohr@hp.com>
Cc: linux-kernel@vger.kernel.org, mingo@kernel.org,
dvhart@linux.intel.com, peterz@infradead.org, tglx@linutronix.de,
efault@gmx.de, jeffm@suse.com, torvalds@linux-foundation.org,
jason.low2@hp.com, Waiman.Long@hp.com, tom.vaden@hp.com,
scott.norton@hp.com, aswin@hp.com
Subject: Re: [PATCH v5 2/4] futex: Larger hash table
Date: Fri, 10 Jan 2014 23:37:24 -0800 [thread overview]
Message-ID: <20140111073724.GA10038@linux.vnet.ibm.com> (raw)
In-Reply-To: <1388675120-8017-3-git-send-email-davidlohr@hp.com>
On Thu, Jan 02, 2014 at 07:05:18AM -0800, Davidlohr Bueso wrote:
> From: Davidlohr Bueso <davidlohr@hp.com>
>
> Currently, the futex global hash table suffers from it's fixed, smallish
> (for today's standards) size of 256 entries, as well as its lack of NUMA
> awareness. Large systems, using many futexes, can be prone to high amounts
> of collisions; where these futexes hash to the same bucket and lead to
> extra contention on the same hb->lock. Furthermore, cacheline bouncing is a
> reality when we have multiple hb->locks residing on the same cacheline and
> different futexes hash to adjacent buckets.
>
> This patch keeps the current static size of 16 entries for small systems,
> or otherwise, 256 * ncpus (or larger as we need to round the number to a
> power of 2). Note that this number of CPUs accounts for all CPUs that can
> ever be available in the system, taking into consideration things like
> hotpluging. While we do impose extra overhead at bootup by making the hash
> table larger, this is a one time thing, and does not shadow the benefits
> of this patch.
>
> Furthermore, as suggested by tglx, by cache aligning the hash buckets we can
> avoid access across cacheline boundaries and also avoid massive cache line
> bouncing if multiple cpus are hammering away at different hash buckets which
> happen to reside in the same cache line.
>
> Also, similar to other core kernel components (pid, dcache, tcp), by using
> alloc_large_system_hash() we benefit from its NUMA awareness and thus the
> table is distributed among the nodes instead of in a single one.
>
> For a custom microbenchmark that pounds on the uaddr hashing -- making the wait
> path fail at futex_wait_setup() returning -EWOULDBLOCK for large amounts of
> futexes, we can see the following benefits on a 80-core, 8-socket 1Tb server:
>
> +---------+--------------------+------------------------+-----------------------+-------------------------------+
> | threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) |
> +---------+--------------------+------------------------+-----------------------+-------------------------------+
> | 512 | 32426 | 50531 (+55.8%) | 255274 (+687.2%) | 292553 (+802.2%) |
> | 256 | 65360 | 99588 (+52.3%) | 443563 (+578.6%) | 508088 (+677.3%) |
> | 128 | 125635 | 200075 (+59.2%) | 742613 (+491.1%) | 835452 (+564.9%) |
> | 80 | 193559 | 323425 (+67.1%) | 1028147 (+431.1%) | 1130304 (+483.9%) |
> | 64 | 247667 | 443740 (+79.1%) | 997300 (+302.6%) | 1145494 (+362.5%) |
> | 32 | 628412 | 721401 (+14.7%) | 965996 (+53.7%) | 1122115 (+78.5%) |
> +---------+--------------------+------------------------+-----------------------+-------------------------------+
>
> Cc: Ingo Molnar <mingo@kernel.org>
> Reviewed-by: Darren Hart <dvhart@linux.intel.com>
> Acked-by: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Jeff Mahoney <jeffm@suse.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Scott Norton <scott.norton@hp.com>
> Cc: Tom Vaden <tom.vaden@hp.com>
> Cc: Aswin Chandramouleeswaran <aswin@hp.com>
> Reviewed-by: Waiman Long <Waiman.Long@hp.com>
> Reviewed-and-tested-by: Jason Low <jason.low2@hp.com>
> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
> kernel/futex.c | 26 +++++++++++++++++++-------
> 1 file changed, 19 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 085f5fa..577481d 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -63,6 +63,7 @@
> #include <linux/sched/rt.h>
> #include <linux/hugetlb.h>
> #include <linux/freezer.h>
> +#include <linux/bootmem.h>
>
> #include <asm/futex.h>
>
> @@ -70,8 +71,6 @@
>
> int __read_mostly futex_cmpxchg_enabled;
>
> -#define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)
> -
> /*
> * Futex flags used to encode options to functions and preserve them across
> * restarts.
> @@ -149,9 +148,11 @@ static const struct futex_q futex_q_init = {
> struct futex_hash_bucket {
> spinlock_t lock;
> struct plist_head chain;
> -};
> +} ____cacheline_aligned_in_smp;
>
> -static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
> +static unsigned long __read_mostly futex_hashsize;
> +
> +static struct futex_hash_bucket *futex_queues;
>
> /*
> * We hash on the keys returned from get_futex_key (see below).
> @@ -161,7 +162,7 @@ static struct futex_hash_bucket *hash_futex(union futex_key *key)
> u32 hash = jhash2((u32*)&key->both.word,
> (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
> key->both.offset);
> - return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
> + return &futex_queues[hash & (futex_hashsize - 1)];
> }
>
> /*
> @@ -2719,7 +2720,18 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
> static int __init futex_init(void)
> {
> u32 curval;
> - int i;
> + unsigned long i;
> +
> +#if CONFIG_BASE_SMALL
> + futex_hashsize = 16;
> +#else
> + futex_hashsize = roundup_pow_of_two(256 * num_possible_cpus());
> +#endif
> +
> + futex_queues = alloc_large_system_hash("futex", sizeof(*futex_queues),
> + futex_hashsize, 0,
> + futex_hashsize < 256 ? HASH_SMALL : 0,
> + NULL, NULL, futex_hashsize, futex_hashsize);
>
> /*
> * This will fail and we want it. Some arch implementations do
> @@ -2734,7 +2746,7 @@ static int __init futex_init(void)
> if (cmpxchg_futex_value_locked(&curval, NULL, 0, 0) == -EFAULT)
> futex_cmpxchg_enabled = 1;
>
> - for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
> + for (i = 0; i < futex_hashsize; i++) {
> plist_head_init(&futex_queues[i].chain);
> spin_lock_init(&futex_queues[i].lock);
> }
> --
> 1.8.1.4
>
next prev parent reply other threads:[~2014-01-11 7:37 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-01-02 15:05 [PATCH v5 0/4] futex: Wakeup optimizations Davidlohr Bueso
2014-01-02 15:05 ` [PATCH v5 1/4] futex: Misc cleanups Davidlohr Bueso
2014-01-11 6:43 ` Paul E. McKenney
2014-01-02 15:05 ` [PATCH v5 2/4] futex: Larger hash table Davidlohr Bueso
2014-01-11 7:37 ` Paul E. McKenney [this message]
2014-01-02 15:05 ` [PATCH v5 3/4] futex: Document ordering guarantees Davidlohr Bueso
2014-01-06 18:58 ` Darren Hart
2014-01-11 7:40 ` Paul E. McKenney
2014-01-02 15:05 ` [PATCH v5 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
2014-01-02 19:23 ` Linus Torvalds
2014-01-02 20:59 ` Davidlohr Bueso
2014-01-06 20:56 ` Darren Hart
2014-01-06 20:52 ` Darren Hart
2014-01-07 3:29 ` Davidlohr Bueso
2014-01-07 17:40 ` Darren Hart
2014-01-11 9:49 ` Paul E. McKenney
2014-01-11 9:52 ` Paul E. McKenney
2014-01-11 18:21 ` Davidlohr Bueso
2014-01-06 0:59 ` [PATCH v5 0/4] futex: Wakeup optimizations Davidlohr Bueso
2014-01-06 1:38 ` [PATCH 5/4] futex: silence uninitialized warnings Davidlohr Bueso
2014-01-06 18:48 ` Darren Hart
2014-01-07 2:55 ` Linus Torvalds
2014-01-07 3:02 ` Davidlohr Bueso
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140111073724.GA10038@linux.vnet.ibm.com \
--to=paulmck@linux.vnet.ibm.com \
--cc=Waiman.Long@hp.com \
--cc=aswin@hp.com \
--cc=davidlohr@hp.com \
--cc=dvhart@linux.intel.com \
--cc=efault@gmx.de \
--cc=jason.low2@hp.com \
--cc=jeffm@suse.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
--cc=scott.norton@hp.com \
--cc=tglx@linutronix.de \
--cc=tom.vaden@hp.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).