linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/4] futex: Wakeup optimizations
@ 2014-01-02 15:05 Davidlohr Bueso
  2014-01-02 15:05 ` [PATCH v5 1/4] futex: Misc cleanups Davidlohr Bueso
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Davidlohr Bueso @ 2014-01-02 15:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, paulmck, efault, jeffm, torvalds,
	jason.low2, Waiman.Long, tom.vaden, scott.norton, aswin,
	davidlohr

Changes from v3/v4 [http://lkml.org/lkml/2013/12/19/627]:
 - Almost completely redid patch 4, based on suggestions
   by Linus. Instead of adding an atomic counter to keep
   track of the plist size, couple the list's head empty
   call with a check to see if the hb lock is locked.
   This solves the race that motivated the use of the new
   atomic field.

 - Fix grammar in patch 3

 - Fix SOB tags.

Changes from v2 [http://lwn.net/Articles/575449/]:
 - Reordered SOB tags to reflect me as primary author.

 - Improved ordering guarantee comments for patch 4.

 - Rebased patch 4 against Linus' tree (this patch didn't
   apply after the recent futex changes/fixes).

Changes from v1 [https://lkml.org/lkml/2013/11/22/525]:
 - Removed patch "futex: Check for pi futex_q only once".

 - Cleaned up ifdefs for larger hash table.

 - Added a doc patch from tglx that describes the futex 
   ordering guarantees.

 - Improved the lockless plist check for the wake calls.
   Based on the community feedback, the necessary abstractions
   and barriers are added to maintain ordering guarantees.
   Code documentation is also updated.

 - Removed patch "sched,futex: Provide delayed wakeup list".
   Based on feedback from PeterZ, I will look into this as
   a separate issue once the other patches are settled.

We have been dealing with a customer database workload on large
12Tb, 240 core 16 socket NUMA system that exhibits high amounts 
of contention on some of the locks that serialize internal futex 
data structures. This workload specially suffers in the wakeup 
paths, where waiting on the corresponding hb->lock can account for 
up to ~60% of the time. The result of such calls can mostly be 
classified as (i) nothing to wake up and (ii) wakeup large amount 
of tasks.

Before these patches are applied, we can see this pathological behavior:

 37.12%  826174  xxx  [kernel.kallsyms] [k] _raw_spin_lock
            --- _raw_spin_lock
             |
             |--97.14%-- futex_wake
             |          do_futex
             |          sys_futex
             |          system_call_fastpath
             |          |
             |          |--99.70%-- 0x7f383fbdea1f
             |          |           yyy

 43.71%  762296  xxx  [kernel.kallsyms] [k] _raw_spin_lock
            --- _raw_spin_lock
             |
             |--53.74%-- futex_wake
             |          do_futex
             |          sys_futex
             |          system_call_fastpath
             |          |
             |          |--99.40%-- 0x7fe7d44a4c05
             |          |           zzz
             |--45.90%-- futex_wait_setup
             |          futex_wait
             |          do_futex
             |          sys_futex
             |          system_call_fastpath
             |          0x7fe7ba315789
             |          syscall


With these patches, contention is practically non existent:

 0.10%     49   xxx  [kernel.kallsyms]   [k] _raw_spin_lock
               --- _raw_spin_lock
                |
                |--76.06%-- futex_wait_setup
                |          futex_wait
                |          do_futex
                |          sys_futex
                |          system_call_fastpath
                |          |
                |          |--99.90%-- 0x7f3165e63789
                |          |          syscall|
                           ...
                |--6.27%-- futex_wake
                |          do_futex
                |          sys_futex
                |          system_call_fastpath
                |          |
                |          |--54.56%-- 0x7f317fff2c05
                ...

Patch 1 is a cleanup.

Patch 2 addresses the well known issue of the global hash table.
By creating a larger and NUMA aware table, we can reduce the false
sharing and collisions, thus reducing the chance of different futexes 
using hb->lock.

Patch 3 documents the futex ordering guarantees.

Patch 4 reduces contention on the corresponding hb->lock by not trying to
acquire it if there are no blocked tasks in the waitqueue.
This particularly deals with point (i) above, where we see that it is not
uncommon for up to 90% of wakeup calls end up returning 0, indicating that no
tasks were woken.

This patchset has also been tested on smaller systems for a variety of
benchmarks, including java workloads, kernel builds and custom bang-the-hell-out-of
hb locks programs. So far, no functional or performance regressions have been seen.
Furthermore, no issues were found when running the different tests in the futextest 
suite: http://git.kernel.org/cgit/linux/kernel/git/dvhart/futextest.git/

This patchset applies on top of Linus' tree as of v3.13-rc6 (9a0bb296)

Special thanks to Scott Norton, Tom Vanden, Mark Ray and Aswin Chandramouleeswaran
for help presenting, debugging and analyzing the data.

  futex: Misc cleanups
  futex: Larger hash table
  futex: Document ordering guarantees
  futex: Avoid taking hb lock if nothing to wakeup

 kernel/futex.c | 197 ++++++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 159 insertions(+), 38 deletions(-)

-- 
1.8.1.4


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v5 1/4] futex: Misc cleanups
  2014-01-02 15:05 [PATCH v5 0/4] futex: Wakeup optimizations Davidlohr Bueso
@ 2014-01-02 15:05 ` Davidlohr Bueso
  2014-01-11  6:43   ` Paul E. McKenney
  2014-01-02 15:05 ` [PATCH v5 2/4] futex: Larger hash table Davidlohr Bueso
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Davidlohr Bueso @ 2014-01-02 15:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, paulmck, efault, jeffm, torvalds,
	jason.low2, Waiman.Long, tom.vaden, scott.norton, aswin,
	davidlohr

From: Jason Low <jason.low2@hp.com>

- Remove unnecessary head variables.
- Delete unused parameter in queue_unlock().

Cc: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Darren Hart <dvhart@linux.intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: Tom Vaden <tom.vaden@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 kernel/futex.c | 39 ++++++++++++---------------------------
 1 file changed, 12 insertions(+), 27 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index f6ff019..085f5fa 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -598,13 +598,10 @@ lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
 {
 	struct futex_pi_state *pi_state = NULL;
 	struct futex_q *this, *next;
-	struct plist_head *head;
 	struct task_struct *p;
 	pid_t pid = uval & FUTEX_TID_MASK;
 
-	head = &hb->chain;
-
-	plist_for_each_entry_safe(this, next, head, list) {
+	plist_for_each_entry_safe(this, next, &hb->chain, list) {
 		if (match_futex(&this->key, key)) {
 			/*
 			 * Another waiter already exists - bump up
@@ -986,7 +983,6 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
-	struct plist_head *head;
 	union futex_key key = FUTEX_KEY_INIT;
 	int ret;
 
@@ -999,9 +995,8 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 
 	hb = hash_futex(&key);
 	spin_lock(&hb->lock);
-	head = &hb->chain;
 
-	plist_for_each_entry_safe(this, next, head, list) {
+	plist_for_each_entry_safe(this, next, &hb->chain, list) {
 		if (match_futex (&this->key, &key)) {
 			if (this->pi_state || this->rt_waiter) {
 				ret = -EINVAL;
@@ -1034,7 +1029,6 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
 {
 	union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
 	struct futex_hash_bucket *hb1, *hb2;
-	struct plist_head *head;
 	struct futex_q *this, *next;
 	int ret, op_ret;
 
@@ -1082,9 +1076,7 @@ retry_private:
 		goto retry;
 	}
 
-	head = &hb1->chain;
-
-	plist_for_each_entry_safe(this, next, head, list) {
+	plist_for_each_entry_safe(this, next, &hb1->chain, list) {
 		if (match_futex (&this->key, &key1)) {
 			if (this->pi_state || this->rt_waiter) {
 				ret = -EINVAL;
@@ -1097,10 +1089,8 @@ retry_private:
 	}
 
 	if (op_ret > 0) {
-		head = &hb2->chain;
-
 		op_ret = 0;
-		plist_for_each_entry_safe(this, next, head, list) {
+		plist_for_each_entry_safe(this, next, &hb2->chain, list) {
 			if (match_futex (&this->key, &key2)) {
 				if (this->pi_state || this->rt_waiter) {
 					ret = -EINVAL;
@@ -1270,7 +1260,6 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
 	int drop_count = 0, task_count = 0, ret;
 	struct futex_pi_state *pi_state = NULL;
 	struct futex_hash_bucket *hb1, *hb2;
-	struct plist_head *head1;
 	struct futex_q *this, *next;
 	u32 curval2;
 
@@ -1393,8 +1382,7 @@ retry_private:
 		}
 	}
 
-	head1 = &hb1->chain;
-	plist_for_each_entry_safe(this, next, head1, list) {
+	plist_for_each_entry_safe(this, next, &hb1->chain, list) {
 		if (task_count - nr_wake >= nr_requeue)
 			break;
 
@@ -1494,7 +1482,7 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
 }
 
 static inline void
-queue_unlock(struct futex_q *q, struct futex_hash_bucket *hb)
+queue_unlock(struct futex_hash_bucket *hb)
 	__releases(&hb->lock)
 {
 	spin_unlock(&hb->lock);
@@ -1867,7 +1855,7 @@ retry_private:
 	ret = get_futex_value_locked(&uval, uaddr);
 
 	if (ret) {
-		queue_unlock(q, *hb);
+		queue_unlock(*hb);
 
 		ret = get_user(uval, uaddr);
 		if (ret)
@@ -1881,7 +1869,7 @@ retry_private:
 	}
 
 	if (uval != val) {
-		queue_unlock(q, *hb);
+		queue_unlock(*hb);
 		ret = -EWOULDBLOCK;
 	}
 
@@ -2029,7 +2017,7 @@ retry_private:
 			 * Task is exiting and we just wait for the
 			 * exit to complete.
 			 */
-			queue_unlock(&q, hb);
+			queue_unlock(hb);
 			put_futex_key(&q.key);
 			cond_resched();
 			goto retry;
@@ -2081,7 +2069,7 @@ retry_private:
 	goto out_put_key;
 
 out_unlock_put_key:
-	queue_unlock(&q, hb);
+	queue_unlock(hb);
 
 out_put_key:
 	put_futex_key(&q.key);
@@ -2091,7 +2079,7 @@ out:
 	return ret != -EINTR ? ret : -ERESTARTNOINTR;
 
 uaddr_faulted:
-	queue_unlock(&q, hb);
+	queue_unlock(hb);
 
 	ret = fault_in_user_writeable(uaddr);
 	if (ret)
@@ -2113,7 +2101,6 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
-	struct plist_head *head;
 	union futex_key key = FUTEX_KEY_INIT;
 	u32 uval, vpid = task_pid_vnr(current);
 	int ret;
@@ -2153,9 +2140,7 @@ retry:
 	 * Ok, other tasks may need to be woken up - check waiters
 	 * and do the wakeup if necessary:
 	 */
-	head = &hb->chain;
-
-	plist_for_each_entry_safe(this, next, head, list) {
+	plist_for_each_entry_safe(this, next, &hb->chain, list) {
 		if (!match_futex (&this->key, &key))
 			continue;
 		ret = wake_futex_pi(uaddr, uval, this);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 2/4] futex: Larger hash table
  2014-01-02 15:05 [PATCH v5 0/4] futex: Wakeup optimizations Davidlohr Bueso
  2014-01-02 15:05 ` [PATCH v5 1/4] futex: Misc cleanups Davidlohr Bueso
@ 2014-01-02 15:05 ` Davidlohr Bueso
  2014-01-11  7:37   ` Paul E. McKenney
  2014-01-02 15:05 ` [PATCH v5 3/4] futex: Document ordering guarantees Davidlohr Bueso
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Davidlohr Bueso @ 2014-01-02 15:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, paulmck, efault, jeffm, torvalds,
	jason.low2, Waiman.Long, tom.vaden, scott.norton, aswin,
	davidlohr

From: Davidlohr Bueso <davidlohr@hp.com>

Currently, the futex global hash table suffers from it's fixed, smallish
(for today's standards) size of 256 entries, as well as its lack of NUMA
awareness. Large systems, using many futexes, can be prone to high amounts
of collisions; where these futexes hash to the same bucket and lead to
extra contention on the same hb->lock. Furthermore, cacheline bouncing is a
reality when we have multiple hb->locks residing on the same cacheline and
different futexes hash to adjacent buckets.

This patch keeps the current static size of 16 entries for small systems,
or otherwise, 256 * ncpus (or larger as we need to round the number to a
power of 2). Note that this number of CPUs accounts for all CPUs that can
ever be available in the system, taking into consideration things like
hotpluging. While we do impose extra overhead at bootup by making the hash
table larger, this is a one time thing, and does not shadow the benefits
of this patch.

Furthermore, as suggested by tglx, by cache aligning the hash buckets we can
avoid access across cacheline boundaries and also avoid massive cache line
bouncing if multiple cpus are hammering away at different hash buckets which
happen to reside in the same cache line.

Also, similar to other core kernel components (pid, dcache, tcp), by using
alloc_large_system_hash() we benefit from its NUMA awareness and thus the
table is distributed among the nodes instead of in a single one.

For a custom microbenchmark that pounds on the uaddr hashing -- making the wait
path fail at futex_wait_setup() returning -EWOULDBLOCK for large amounts of
futexes, we can see the following benefits on a 80-core, 8-socket 1Tb server:

+---------+--------------------+------------------------+-----------------------+-------------------------------+
| threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) |
+---------+--------------------+------------------------+-----------------------+-------------------------------+
|     512 |		 32426 | 50531  (+55.8%)	| 255274  (+687.2%)	| 292553  (+802.2%)		|
|     256 |		 65360 | 99588  (+52.3%)	| 443563  (+578.6%)	| 508088  (+677.3%)		|
|     128 |		125635 | 200075 (+59.2%)	| 742613  (+491.1%)	| 835452  (+564.9%)		|
|      80 |		193559 | 323425 (+67.1%)	| 1028147 (+431.1%)	| 1130304 (+483.9%)		|
|      64 |		247667 | 443740 (+79.1%)	| 997300  (+302.6%)	| 1145494 (+362.5%)		|
|      32 |		628412 | 721401 (+14.7%)	| 965996  (+53.7%)	| 1122115 (+78.5%)		|
+---------+--------------------+------------------------+-----------------------+-------------------------------+

Cc: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Darren Hart <dvhart@linux.intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: Tom Vaden <tom.vaden@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Reviewed-by: Waiman Long <Waiman.Long@hp.com>
Reviewed-and-tested-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 kernel/futex.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 085f5fa..577481d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -63,6 +63,7 @@
 #include <linux/sched/rt.h>
 #include <linux/hugetlb.h>
 #include <linux/freezer.h>
+#include <linux/bootmem.h>
 
 #include <asm/futex.h>
 
@@ -70,8 +71,6 @@
 
 int __read_mostly futex_cmpxchg_enabled;
 
-#define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)
-
 /*
  * Futex flags used to encode options to functions and preserve them across
  * restarts.
@@ -149,9 +148,11 @@ static const struct futex_q futex_q_init = {
 struct futex_hash_bucket {
 	spinlock_t lock;
 	struct plist_head chain;
-};
+} ____cacheline_aligned_in_smp;
 
-static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
+static unsigned long __read_mostly futex_hashsize;
+
+static struct futex_hash_bucket *futex_queues;
 
 /*
  * We hash on the keys returned from get_futex_key (see below).
@@ -161,7 +162,7 @@ static struct futex_hash_bucket *hash_futex(union futex_key *key)
 	u32 hash = jhash2((u32*)&key->both.word,
 			  (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
 			  key->both.offset);
-	return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
+	return &futex_queues[hash & (futex_hashsize - 1)];
 }
 
 /*
@@ -2719,7 +2720,18 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
 static int __init futex_init(void)
 {
 	u32 curval;
-	int i;
+	unsigned long i;
+
+#if CONFIG_BASE_SMALL
+	futex_hashsize = 16;
+#else
+	futex_hashsize = roundup_pow_of_two(256 * num_possible_cpus());
+#endif
+
+	futex_queues = alloc_large_system_hash("futex", sizeof(*futex_queues),
+					       futex_hashsize, 0,
+					       futex_hashsize < 256 ? HASH_SMALL : 0,
+					       NULL, NULL, futex_hashsize, futex_hashsize);
 
 	/*
 	 * This will fail and we want it. Some arch implementations do
@@ -2734,7 +2746,7 @@ static int __init futex_init(void)
 	if (cmpxchg_futex_value_locked(&curval, NULL, 0, 0) == -EFAULT)
 		futex_cmpxchg_enabled = 1;
 
-	for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
+	for (i = 0; i < futex_hashsize; i++) {
 		plist_head_init(&futex_queues[i].chain);
 		spin_lock_init(&futex_queues[i].lock);
 	}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 3/4] futex: Document ordering guarantees
  2014-01-02 15:05 [PATCH v5 0/4] futex: Wakeup optimizations Davidlohr Bueso
  2014-01-02 15:05 ` [PATCH v5 1/4] futex: Misc cleanups Davidlohr Bueso
  2014-01-02 15:05 ` [PATCH v5 2/4] futex: Larger hash table Davidlohr Bueso
@ 2014-01-02 15:05 ` Davidlohr Bueso
  2014-01-06 18:58   ` Darren Hart
  2014-01-11  7:40   ` Paul E. McKenney
  2014-01-02 15:05 ` [PATCH v5 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 23+ messages in thread
From: Davidlohr Bueso @ 2014-01-02 15:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, paulmck, efault, jeffm, torvalds,
	jason.low2, Waiman.Long, tom.vaden, scott.norton, aswin,
	davidlohr, Randy Dunlap

From: Thomas Gleixner <tglx@linutronix.de>

That's essential, if you want to hack on futexes.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: Tom Vaden <tom.vaden@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: Jason Low <jason.low2@hp.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 kernel/futex.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 57 insertions(+)

diff --git a/kernel/futex.c b/kernel/futex.c
index 577481d..fcc6850 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -69,6 +69,63 @@
 
 #include "locking/rtmutex_common.h"
 
+/*
+ * Basic futex operation and ordering guarantees:
+ *
+ * The waiter reads the futex value in user space and calls
+ * futex_wait(). This function computes the hash bucket and acquires
+ * the hash bucket lock. After that it reads the futex user space value
+ * again and verifies that the data has not changed. If it has not
+ * changed it enqueues itself into the hash bucket, releases the hash
+ * bucket lock and schedules.
+ *
+ * The waker side modifies the user space value of the futex and calls
+ * futex_wake(). This functions computes the hash bucket and acquires
+ * the hash bucket lock. Then it looks for waiters on that futex in the
+ * hash bucket and wakes them.
+ *
+ * Note that the spin_lock serializes waiters and wakers, so that the
+ * following scenario is avoided:
+ *
+ * CPU 0                               CPU 1
+ * val = *futex;
+ * sys_futex(WAIT, futex, val);
+ *   futex_wait(futex, val);
+ *   uval = *futex;
+ *                                     *futex = newval;
+ *                                     sys_futex(WAKE, futex);
+ *                                       futex_wake(futex);
+ *                                       if (queue_empty())
+ *                                         return;
+ *   if (uval == val)
+ *      lock(hash_bucket(futex));
+ *      queue();
+ *     unlock(hash_bucket(futex));
+ *     schedule();
+ *
+ * This would cause the waiter on CPU 0 to wait forever because it
+ * missed the transition of the user space value from val to newval
+ * and the waker did not find the waiter in the hash bucket queue.
+ * The spinlock serializes that:
+ *
+ * CPU 0                               CPU 1
+ * val = *futex;
+ * sys_futex(WAIT, futex, val);
+ *   futex_wait(futex, val);
+ *   lock(hash_bucket(futex));
+ *   uval = *futex;
+ *                                     *futex = newval;
+ *                                     sys_futex(WAKE, futex);
+ *                                       futex_wake(futex);
+ *                                       lock(hash_bucket(futex));
+ *   if (uval == val)
+ *      queue();
+ *     unlock(hash_bucket(futex));
+ *     schedule();                       if (!queue_empty())
+ *                                         wake_waiters(futex);
+ *                                       unlock(hash_bucket(futex));
+ */
+
 int __read_mostly futex_cmpxchg_enabled;
 
 /*
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2014-01-02 15:05 [PATCH v5 0/4] futex: Wakeup optimizations Davidlohr Bueso
                   ` (2 preceding siblings ...)
  2014-01-02 15:05 ` [PATCH v5 3/4] futex: Document ordering guarantees Davidlohr Bueso
@ 2014-01-02 15:05 ` Davidlohr Bueso
  2014-01-02 19:23   ` Linus Torvalds
                     ` (2 more replies)
  2014-01-06  0:59 ` [PATCH v5 0/4] futex: Wakeup optimizations Davidlohr Bueso
  2014-01-06  1:38 ` [PATCH 5/4] futex: silence uninitialized warnings Davidlohr Bueso
  5 siblings, 3 replies; 23+ messages in thread
From: Davidlohr Bueso @ 2014-01-02 15:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, paulmck, efault, jeffm, torvalds,
	jason.low2, Waiman.Long, tom.vaden, scott.norton, aswin,
	davidlohr

From: Davidlohr Bueso <davidlohr@hp.com>

In futex_wake() there is clearly no point in taking the hb->lock if we know
beforehand that there are no tasks to be woken. While the hash bucket's plist
head is a cheap way of knowing this, we cannot rely 100% on it as there is a
racy window between the futex_wait call and when the task is actually added to
the plist. To this end, we couple it with the spinlock check as tasks trying to
enter the critical region are most likely potential waiters that will be added
to the plist, thus preventing tasks sleeping forever if wakers don't acknowledge
all possible waiters.

Furthermore, the futex ordering guarantees are preserved, ensuring that waiters
either observe the changed user space value before blocking or is woken by a
concurrent waker. For wakers, this is done by relying on the barriers in
get_futex_key_refs() -- for archs that do have implicit mb in atomic_inc() we
explicitly add them through a new futex_get_mm function. For waiters we rely
on the fact that spin_lock calls already update the head counter, so spinners
are visible even if the lock hasn't been acquired yet.

For more details please refer to the updated comments in the code and related
discussion: https://lkml.org/lkml/2013/11/26/556

Special thanks to tglx for careful review and feedback.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: Tom Vaden <tom.vaden@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: Jason Low <jason.low2@hp.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 kernel/futex.c | 113 +++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 90 insertions(+), 23 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index fcc6850..5b4d09e 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -75,17 +75,20 @@
  * The waiter reads the futex value in user space and calls
  * futex_wait(). This function computes the hash bucket and acquires
  * the hash bucket lock. After that it reads the futex user space value
- * again and verifies that the data has not changed. If it has not
- * changed it enqueues itself into the hash bucket, releases the hash
- * bucket lock and schedules.
+ * again and verifies that the data has not changed. If it has not changed
+ * it enqueues itself into the hash bucket, releases the hash bucket lock
+ * and schedules.
  *
  * The waker side modifies the user space value of the futex and calls
- * futex_wake(). This functions computes the hash bucket and acquires
- * the hash bucket lock. Then it looks for waiters on that futex in the
- * hash bucket and wakes them.
+ * futex_wake(). This function computes the hash bucket and acquires the
+ * hash bucket lock. Then it looks for waiters on that futex in the hash
+ * bucket and wakes them.
  *
- * Note that the spin_lock serializes waiters and wakers, so that the
- * following scenario is avoided:
+ * In scenarios where wakeups are called and no tasks are blocked on a futex,
+ * taking the hb spinlock can be avoided and simply return. In order for this
+ * optimization to work, ordering guarantees must exist so that the waiter
+ * being added to the list is acknowledged when the list is concurrently being
+ * checked by the waker, avoiding scenarios like the following:
  *
  * CPU 0                               CPU 1
  * val = *futex;
@@ -106,24 +109,50 @@
  * This would cause the waiter on CPU 0 to wait forever because it
  * missed the transition of the user space value from val to newval
  * and the waker did not find the waiter in the hash bucket queue.
- * The spinlock serializes that:
+ *
+ * The correct serialization ensures that a waiter either observes
+ * the changed user space value before blocking or is woken by a
+ * concurrent waker:
  *
  * CPU 0                               CPU 1
  * val = *futex;
  * sys_futex(WAIT, futex, val);
  *   futex_wait(futex, val);
- *   lock(hash_bucket(futex));
- *   uval = *futex;
- *                                     *futex = newval;
- *                                     sys_futex(WAKE, futex);
- *                                       futex_wake(futex);
- *                                       lock(hash_bucket(futex));
+ *
+ *   waiters++;
+ *   mb(); (A) <-- paired with -.
+ *                              |
+ *   lock(hash_bucket(futex));  |
+ *                              |
+ *   uval = *futex;             |
+ *                              |        *futex = newval;
+ *                              |        sys_futex(WAKE, futex);
+ *                              |          futex_wake(futex);
+ *                              |
+ *                              `------->   mb(); (B)
  *   if (uval == val)
- *      queue();
+ *     queue();
  *     unlock(hash_bucket(futex));
- *     schedule();                       if (!queue_empty())
- *                                         wake_waiters(futex);
- *                                       unlock(hash_bucket(futex));
+ *     schedule();                         if (waiters)
+ *                                           lock(hash_bucket(futex));
+ *                                           wake_waiters(futex);
+ *                                           unlock(hash_bucket(futex));
+ *
+ * Where (A) orders the waiters increment and the futex value read -- this
+ * is guaranteed by the head counter in the hb spinlock; and where (B)
+ * orders the write to futex and the waiters read.
+ *
+ * This yields the following case (where X:=waiters, Y:=futex):
+ *
+ *	X = Y = 0
+ *
+ *	w[X]=1		w[Y]=1
+ *	MB		MB
+ *	r[Y]=y		r[X]=x
+ *
+ * Which guarantees that x==0 && y==0 is impossible; which translates back into
+ * the guarantee that we cannot both miss the futex variable change and the
+ * enqueue.
  */
 
 int __read_mostly futex_cmpxchg_enabled;
@@ -211,6 +240,38 @@ static unsigned long __read_mostly futex_hashsize;
 
 static struct futex_hash_bucket *futex_queues;
 
+static inline void futex_get_mm(union futex_key *key)
+{
+	atomic_inc(&key->private.mm->mm_count);
+#ifdef CONFIG_SMP
+	/*
+	 * Ensure futex_get_mm() implies a full barrier such that
+	 * get_futex_key() implies a full barrier. This is relied upon
+	 * as full barrier (B), see the ordering comment above.
+	 */
+	smp_mb__after_atomic_inc();
+#endif
+}
+
+static inline bool hb_waiters_pending(struct futex_hash_bucket *hb)
+{
+#ifdef CONFIG_SMP
+	/*
+	 * Tasks trying to enter the critical region are most likely
+	 * potential waiters that will be added to the plist. Ensure
+	 * that wakers won't miss to-be-slept tasks in the window between
+	 * the wait call and the actual plist_add.
+	 */
+	if (spin_is_locked(&hb->lock))
+		return true;
+	smp_rmb(); /* Make sure we check the lock state first */
+
+	return !plist_head_empty(&hb->chain);
+#else
+	return true;
+#endif
+}
+
 /*
  * We hash on the keys returned from get_futex_key (see below).
  */
@@ -245,10 +306,10 @@ static void get_futex_key_refs(union futex_key *key)
 
 	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
 	case FUT_OFF_INODE:
-		ihold(key->shared.inode);
+		ihold(key->shared.inode); /* implies MB (B) */
 		break;
 	case FUT_OFF_MMSHARED:
-		atomic_inc(&key->private.mm->mm_count);
+		futex_get_mm(key); /* implies MB (B) */
 		break;
 	}
 }
@@ -322,7 +383,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
 	if (!fshared) {
 		key->private.mm = mm;
 		key->private.address = address;
-		get_futex_key_refs(key);
+		get_futex_key_refs(key);  /* implies MB (B) */
 		return 0;
 	}
 
@@ -1052,6 +1113,11 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 		goto out;
 
 	hb = hash_futex(&key);
+
+	/* Make sure we really have tasks to wakeup */
+	if (!hb_waiters_pending(hb))
+		goto out_put_key;
+
 	spin_lock(&hb->lock);
 
 	plist_for_each_entry_safe(this, next, &hb->chain, list) {
@@ -1072,6 +1138,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 	}
 
 	spin_unlock(&hb->lock);
+out_put_key:
 	put_futex_key(&key);
 out:
 	return ret;
@@ -1535,7 +1602,7 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
 	hb = hash_futex(&q->key);
 	q->lock_ptr = &hb->lock;
 
-	spin_lock(&hb->lock);
+	spin_lock(&hb->lock); /* implies MB (A) */
 	return hb;
 }
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2014-01-02 15:05 ` [PATCH v5 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
@ 2014-01-02 19:23   ` Linus Torvalds
  2014-01-02 20:59     ` Davidlohr Bueso
  2014-01-06 20:52   ` Darren Hart
  2014-01-11  9:49   ` Paul E. McKenney
  2 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2014-01-02 19:23 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Linux Kernel Mailing List, Ingo Molnar, Darren Hart,
	Peter Zijlstra, Thomas Gleixner, Paul McKenney, Mike Galbraith,
	Jeff Mahoney, Jason Low, Waiman Long, Tom Vaden, Norton, Scott J,
	Chandramouleeswaran, Aswin

On Thu, Jan 2, 2014 at 7:05 AM, Davidlohr Bueso <davidlohr@hp.com> wrote:
>
> In futex_wake() there is clearly no point in taking the hb->lock if we know
> beforehand that there are no tasks to be woken.

Btw, I think we could optimize this a bit further for the wakeup case.

wake_futex() does a get_task_struct(p)/put_task_struct(p) around its
actual waking logic, and I don't think that's necessary. The task
structures are RCU-delayed, and the task cannot go away until the
"q->lock_ptr = NULL" afaik, so you could replace that atomic inc/dec
with just a RCU read region.

Maybe it's not a big deal ("wake_up_state()" ends up getting the task
struct pi_lock anyway, so it's not like we can avoid toucing the task
structure), but I'm getting the feeling that we're doing a lot of
unnecessary work here.

This only triggers for the actual case of having a task to wake up,
though, so not your particular test load.

              Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2014-01-02 19:23   ` Linus Torvalds
@ 2014-01-02 20:59     ` Davidlohr Bueso
  2014-01-06 20:56       ` Darren Hart
  0 siblings, 1 reply; 23+ messages in thread
From: Davidlohr Bueso @ 2014-01-02 20:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Ingo Molnar, Darren Hart,
	Peter Zijlstra, Thomas Gleixner, Paul McKenney, Mike Galbraith,
	Jeff Mahoney, Jason Low, Waiman Long, Tom Vaden, Norton, Scott J,
	Chandramouleeswaran, Aswin

On Thu, 2014-01-02 at 11:23 -0800, Linus Torvalds wrote:
> On Thu, Jan 2, 2014 at 7:05 AM, Davidlohr Bueso <davidlohr@hp.com> wrote:
> >
> > In futex_wake() there is clearly no point in taking the hb->lock if we know
> > beforehand that there are no tasks to be woken.
> 
> Btw, I think we could optimize this a bit further for the wakeup case.
> 
> wake_futex() does a get_task_struct(p)/put_task_struct(p) around its
> actual waking logic, and I don't think that's necessary. The task
> structures are RCU-delayed, and the task cannot go away until the
> "q->lock_ptr = NULL" afaik, so you could replace that atomic inc/dec
> with just a RCU read region.

I had originally explored making the whole plist thing more rcu aware
but never got to anything worth sharing. What you say does make a lot of
sense, however, I haven't been able to see any actual improvements. It
doesn't hurt however, so I'd have no problem adding such patch to the
lot.

> 
> Maybe it's not a big deal ("wake_up_state()" ends up getting the task
> struct pi_lock anyway, so it's not like we can avoid toucing the task
> structure), but I'm getting the feeling that we're doing a lot of
> unnecessary work here.

I passed this idea through my wakeup measuring program and didn't notice
hardly any difference, just noise, even for large amounts of futexes.
I believe that peterz's idea of lockless batch wakeups is the next step
worth looking into for futexes -- even though the spurious wakeup
problem can become a real pain.

Thanks,
Davidlohr



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 0/4] futex: Wakeup optimizations
  2014-01-02 15:05 [PATCH v5 0/4] futex: Wakeup optimizations Davidlohr Bueso
                   ` (3 preceding siblings ...)
  2014-01-02 15:05 ` [PATCH v5 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
@ 2014-01-06  0:59 ` Davidlohr Bueso
  2014-01-06  1:38 ` [PATCH 5/4] futex: silence uninitialized warnings Davidlohr Bueso
  5 siblings, 0 replies; 23+ messages in thread
From: Davidlohr Bueso @ 2014-01-06  0:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, paulmck, efault, jeffm, torvalds,
	jason.low2, Waiman.Long, tom.vaden, scott.norton, aswin

Folks, unless there's a reason not do so, could we get this patchset in
for 3.14? We're already at -rc7 and we could benefit from more testing
in -next, until 3.13 is out.

Thanks,
Davidlohr

On Thu, 2014-01-02 at 07:05 -0800, Davidlohr Bueso wrote:
> Changes from v3/v4 [http://lkml.org/lkml/2013/12/19/627]:
>  - Almost completely redid patch 4, based on suggestions
>    by Linus. Instead of adding an atomic counter to keep
>    track of the plist size, couple the list's head empty
>    call with a check to see if the hb lock is locked.
>    This solves the race that motivated the use of the new
>    atomic field.
> 
>  - Fix grammar in patch 3
> 
>  - Fix SOB tags.
> 
> Changes from v2 [http://lwn.net/Articles/575449/]:
>  - Reordered SOB tags to reflect me as primary author.
> 
>  - Improved ordering guarantee comments for patch 4.
> 
>  - Rebased patch 4 against Linus' tree (this patch didn't
>    apply after the recent futex changes/fixes).
> 
> Changes from v1 [https://lkml.org/lkml/2013/11/22/525]:
>  - Removed patch "futex: Check for pi futex_q only once".
> 
>  - Cleaned up ifdefs for larger hash table.
> 
>  - Added a doc patch from tglx that describes the futex 
>    ordering guarantees.
> 
>  - Improved the lockless plist check for the wake calls.
>    Based on the community feedback, the necessary abstractions
>    and barriers are added to maintain ordering guarantees.
>    Code documentation is also updated.
> 
>  - Removed patch "sched,futex: Provide delayed wakeup list".
>    Based on feedback from PeterZ, I will look into this as
>    a separate issue once the other patches are settled.
> 
> We have been dealing with a customer database workload on large
> 12Tb, 240 core 16 socket NUMA system that exhibits high amounts 
> of contention on some of the locks that serialize internal futex 
> data structures. This workload specially suffers in the wakeup 
> paths, where waiting on the corresponding hb->lock can account for 
> up to ~60% of the time. The result of such calls can mostly be 
> classified as (i) nothing to wake up and (ii) wakeup large amount 
> of tasks.
> 
> Before these patches are applied, we can see this pathological behavior:
> 
>  37.12%  826174  xxx  [kernel.kallsyms] [k] _raw_spin_lock
>             --- _raw_spin_lock
>              |
>              |--97.14%-- futex_wake
>              |          do_futex
>              |          sys_futex
>              |          system_call_fastpath
>              |          |
>              |          |--99.70%-- 0x7f383fbdea1f
>              |          |           yyy
> 
>  43.71%  762296  xxx  [kernel.kallsyms] [k] _raw_spin_lock
>             --- _raw_spin_lock
>              |
>              |--53.74%-- futex_wake
>              |          do_futex
>              |          sys_futex
>              |          system_call_fastpath
>              |          |
>              |          |--99.40%-- 0x7fe7d44a4c05
>              |          |           zzz
>              |--45.90%-- futex_wait_setup
>              |          futex_wait
>              |          do_futex
>              |          sys_futex
>              |          system_call_fastpath
>              |          0x7fe7ba315789
>              |          syscall
> 
> 
> With these patches, contention is practically non existent:
> 
>  0.10%     49   xxx  [kernel.kallsyms]   [k] _raw_spin_lock
>                --- _raw_spin_lock
>                 |
>                 |--76.06%-- futex_wait_setup
>                 |          futex_wait
>                 |          do_futex
>                 |          sys_futex
>                 |          system_call_fastpath
>                 |          |
>                 |          |--99.90%-- 0x7f3165e63789
>                 |          |          syscall|
>                            ...
>                 |--6.27%-- futex_wake
>                 |          do_futex
>                 |          sys_futex
>                 |          system_call_fastpath
>                 |          |
>                 |          |--54.56%-- 0x7f317fff2c05
>                 ...
> 
> Patch 1 is a cleanup.
> 
> Patch 2 addresses the well known issue of the global hash table.
> By creating a larger and NUMA aware table, we can reduce the false
> sharing and collisions, thus reducing the chance of different futexes 
> using hb->lock.
> 
> Patch 3 documents the futex ordering guarantees.
> 
> Patch 4 reduces contention on the corresponding hb->lock by not trying to
> acquire it if there are no blocked tasks in the waitqueue.
> This particularly deals with point (i) above, where we see that it is not
> uncommon for up to 90% of wakeup calls end up returning 0, indicating that no
> tasks were woken.
> 
> This patchset has also been tested on smaller systems for a variety of
> benchmarks, including java workloads, kernel builds and custom bang-the-hell-out-of
> hb locks programs. So far, no functional or performance regressions have been seen.
> Furthermore, no issues were found when running the different tests in the futextest 
> suite: http://git.kernel.org/cgit/linux/kernel/git/dvhart/futextest.git/
> 
> This patchset applies on top of Linus' tree as of v3.13-rc6 (9a0bb296)
> 
> Special thanks to Scott Norton, Tom Vanden, Mark Ray and Aswin Chandramouleeswaran
> for help presenting, debugging and analyzing the data.
> 
>   futex: Misc cleanups
>   futex: Larger hash table
>   futex: Document ordering guarantees
>   futex: Avoid taking hb lock if nothing to wakeup
> 
>  kernel/futex.c | 197 ++++++++++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 159 insertions(+), 38 deletions(-)
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 5/4] futex: silence uninitialized warnings
  2014-01-02 15:05 [PATCH v5 0/4] futex: Wakeup optimizations Davidlohr Bueso
                   ` (4 preceding siblings ...)
  2014-01-06  0:59 ` [PATCH v5 0/4] futex: Wakeup optimizations Davidlohr Bueso
@ 2014-01-06  1:38 ` Davidlohr Bueso
  2014-01-06 18:48   ` Darren Hart
  2014-01-07  2:55   ` Linus Torvalds
  5 siblings, 2 replies; 23+ messages in thread
From: Davidlohr Bueso @ 2014-01-06  1:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, paulmck, efault, jeffm, torvalds,
	jason.low2, Waiman.Long, tom.vaden, scott.norton, aswin

From: Davidlohr Bueso <davidlohr@hp.com>

Callers of cmpxchg_futex_value_locked() can trigger the following:

kernel/futex.c: In function ‘futex_lock_pi_atomic’:
kernel/futex.c:725: warning: ‘curval’ may be used uninitialized in this function

This was initially addressed by commit 7cfdaf38, but others still remain. Silence
these messages once and for all as the variables really aren't uninitialized.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 kernel/futex.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 5b4d09e..65ade8a 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -838,7 +838,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
 				struct task_struct *task, int set_waiters)
 {
 	int lock_taken, ret, force_take = 0;
-	u32 uval, newval, curval, vpid = task_pid_vnr(task);
+	u32 uval, newval, uninitialized_var(curval), vpid = task_pid_vnr(task);
 
 retry:
 	ret = lock_taken = 0;
@@ -2227,7 +2227,7 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
 	union futex_key key = FUTEX_KEY_INIT;
-	u32 uval, vpid = task_pid_vnr(current);
+	u32 uninitialized_var(uval), vpid = task_pid_vnr(current);
 	int ret;
 
 retry:
@@ -2843,7 +2843,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
 
 static int __init futex_init(void)
 {
-	u32 curval;
+	u32 uninitialized_var(curval);
 	unsigned long i;
 
 #if CONFIG_BASE_SMALL
-- 
1.8.1.4




^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/4] futex: silence uninitialized warnings
  2014-01-06  1:38 ` [PATCH 5/4] futex: silence uninitialized warnings Davidlohr Bueso
@ 2014-01-06 18:48   ` Darren Hart
  2014-01-07  2:55   ` Linus Torvalds
  1 sibling, 0 replies; 23+ messages in thread
From: Darren Hart @ 2014-01-06 18:48 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, peterz, tglx, paulmck, efault, jeffm,
	torvalds, jason.low2, Waiman.Long, tom.vaden, scott.norton,
	aswin

On Sun, 2014-01-05 at 17:38 -0800, Davidlohr Bueso wrote:
> From: Davidlohr Bueso <davidlohr@hp.com>
> 
> Callers of cmpxchg_futex_value_locked() can trigger the following:
> 
> kernel/futex.c: In function ‘futex_lock_pi_atomic’:
> kernel/futex.c:725: warning: ‘curval’ may be used uninitialized in this function
> 
> This was initially addressed by commit 7cfdaf38, but others still remain. Silence
> these messages once and for all as the variables really aren't uninitialized.
> 

I've gone after such things myself and seen arguments against this as it
will mask any future such bugs. If, as in this case, the value is indeed
not being used uninitialized, I believe this is considered a compiler
bug. Do we mask compiler bugs in the kernel at the expense of catching
legitimate future bugs? There are certainly several examples of this
method being used throughout the kernel.

Alternatively - does anyone see something Davidlohr or I have missed in
this call chain? It seems to me that the only way for curval not be
initialized is if cmpxchg_futex_value_lock() returns an error, in which
case we return without using curval anyway.

Is there some rule about how to deal with this when inline ASM is
involved? (Although there is an explicit assignment following that ASM).

If not, given existing precedent, and the unlikelihood of this
particular path seeing significant changes, I'm inclined to agree with
the proposed fix:

Acked-by: Darren Hart <dvhart@linux.intel.com>

--
Darren

> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> ---
>  kernel/futex.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 5b4d09e..65ade8a 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -838,7 +838,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
>  				struct task_struct *task, int set_waiters)
>  {
>  	int lock_taken, ret, force_take = 0;
> -	u32 uval, newval, curval, vpid = task_pid_vnr(task);
> +	u32 uval, newval, uninitialized_var(curval), vpid = task_pid_vnr(task);
>  
>  retry:
>  	ret = lock_taken = 0;
> @@ -2227,7 +2227,7 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
>  	struct futex_hash_bucket *hb;
>  	struct futex_q *this, *next;
>  	union futex_key key = FUTEX_KEY_INIT;
> -	u32 uval, vpid = task_pid_vnr(current);
> +	u32 uninitialized_var(uval), vpid = task_pid_vnr(current);
>  	int ret;
>  
>  retry:
> @@ -2843,7 +2843,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
>  
>  static int __init futex_init(void)
>  {
> -	u32 curval;
> +	u32 uninitialized_var(curval);
>  	unsigned long i;
>  
>  #if CONFIG_BASE_SMALL

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 3/4] futex: Document ordering guarantees
  2014-01-02 15:05 ` [PATCH v5 3/4] futex: Document ordering guarantees Davidlohr Bueso
@ 2014-01-06 18:58   ` Darren Hart
  2014-01-11  7:40   ` Paul E. McKenney
  1 sibling, 0 replies; 23+ messages in thread
From: Darren Hart @ 2014-01-06 18:58 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, peterz, tglx, paulmck, efault, jeffm,
	torvalds, jason.low2, Waiman.Long, tom.vaden, scott.norton,
	aswin, Randy Dunlap

On Thu, 2014-01-02 at 07:05 -0800, Davidlohr Bueso wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> That's essential, if you want to hack on futexes.
> 
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Darren Hart <dvhart@linux.intel.com>

Reviewed-by: Darren Hart <dvhart@linux.intel.com>

> Acked-by: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Jeff Mahoney <jeffm@suse.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Cc: Scott Norton <scott.norton@hp.com>
> Cc: Tom Vaden <tom.vaden@hp.com>
> Cc: Aswin Chandramouleeswaran <aswin@hp.com>
> Cc: Waiman Long <Waiman.Long@hp.com>
> Cc: Jason Low <jason.low2@hp.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> ---
>  kernel/futex.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 57 insertions(+)
> 
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 577481d..fcc6850 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -69,6 +69,63 @@
>  
>  #include "locking/rtmutex_common.h"
>  
> +/*
> + * Basic futex operation and ordering guarantees:
> + *
> + * The waiter reads the futex value in user space and calls
> + * futex_wait(). This function computes the hash bucket and acquires
> + * the hash bucket lock. After that it reads the futex user space value
> + * again and verifies that the data has not changed. If it has not
> + * changed it enqueues itself into the hash bucket, releases the hash
> + * bucket lock and schedules.
> + *
> + * The waker side modifies the user space value of the futex and calls
> + * futex_wake(). This functions computes the hash bucket and acquires
> + * the hash bucket lock. Then it looks for waiters on that futex in the
> + * hash bucket and wakes them.
> + *
> + * Note that the spin_lock serializes waiters and wakers, so that the
> + * following scenario is avoided:
> + *
> + * CPU 0                               CPU 1
> + * val = *futex;
> + * sys_futex(WAIT, futex, val);
> + *   futex_wait(futex, val);
> + *   uval = *futex;
> + *                                     *futex = newval;
> + *                                     sys_futex(WAKE, futex);
> + *                                       futex_wake(futex);
> + *                                       if (queue_empty())
> + *                                         return;
> + *   if (uval == val)
> + *      lock(hash_bucket(futex));
> + *      queue();
> + *     unlock(hash_bucket(futex));
> + *     schedule();
> + *
> + * This would cause the waiter on CPU 0 to wait forever because it
> + * missed the transition of the user space value from val to newval
> + * and the waker did not find the waiter in the hash bucket queue.
> + * The spinlock serializes that:
> + *
> + * CPU 0                               CPU 1
> + * val = *futex;
> + * sys_futex(WAIT, futex, val);
> + *   futex_wait(futex, val);
> + *   lock(hash_bucket(futex));
> + *   uval = *futex;
> + *                                     *futex = newval;
> + *                                     sys_futex(WAKE, futex);
> + *                                       futex_wake(futex);
> + *                                       lock(hash_bucket(futex));
> + *   if (uval == val)
> + *      queue();
> + *     unlock(hash_bucket(futex));
> + *     schedule();                       if (!queue_empty())
> + *                                         wake_waiters(futex);
> + *                                       unlock(hash_bucket(futex));
> + */
> +
>  int __read_mostly futex_cmpxchg_enabled;
>  
>  /*

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2014-01-02 15:05 ` [PATCH v5 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
  2014-01-02 19:23   ` Linus Torvalds
@ 2014-01-06 20:52   ` Darren Hart
  2014-01-07  3:29     ` Davidlohr Bueso
  2014-01-11  9:49   ` Paul E. McKenney
  2 siblings, 1 reply; 23+ messages in thread
From: Darren Hart @ 2014-01-06 20:52 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, peterz, tglx, paulmck, efault, jeffm,
	torvalds, jason.low2, Waiman.Long, tom.vaden, scott.norton,
	aswin

On Thu, 2014-01-02 at 07:05 -0800, Davidlohr Bueso wrote:
> From: Davidlohr Bueso <davidlohr@hp.com>
> 
> In futex_wake() there is clearly no point in taking the hb->lock if we know
> beforehand that there are no tasks to be woken. While the hash bucket's plist
> head is a cheap way of knowing this, we cannot rely 100% on it as there is a
> racy window between the futex_wait call and when the task is actually added to
> the plist. To this end, we couple it with the spinlock check as tasks trying to
> enter the critical region are most likely potential waiters that will be added
> to the plist, thus preventing tasks sleeping forever if wakers don't acknowledge
> all possible waiters.
> 
> Furthermore, the futex ordering guarantees are preserved, ensuring that waiters
> either observe the changed user space value before blocking or is woken by a
> concurrent waker. For wakers, this is done by relying on the barriers in
> get_futex_key_refs() -- for archs that do have implicit mb in atomic_inc() we

do NOT have implicit mb in atomic_inc()
   ^

Sorry to be a pedant, but this is gnarly stuff and we have to get the
documentation right.

> explicitly add them through a new futex_get_mm function. For waiters we rely
> on the fact that spin_lock calls already update the head counter, so spinners
> are visible even if the lock hasn't been acquired yet.
> 
> For more details please refer to the updated comments in the code and related
> discussion: https://lkml.org/lkml/2013/11/26/556
> 
> Special thanks to tglx for careful review and feedback.
> 
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Darren Hart <dvhart@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Jeff Mahoney <jeffm@suse.com>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Scott Norton <scott.norton@hp.com>
> Cc: Tom Vaden <tom.vaden@hp.com>
> Cc: Aswin Chandramouleeswaran <aswin@hp.com>
> Cc: Waiman Long <Waiman.Long@hp.com>
> Cc: Jason Low <jason.low2@hp.com>
> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> ---
>  kernel/futex.c | 113 +++++++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 90 insertions(+), 23 deletions(-)
> 
> diff --git a/kernel/futex.c b/kernel/futex.c
> index fcc6850..5b4d09e 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -75,17 +75,20 @@
>   * The waiter reads the futex value in user space and calls
>   * futex_wait(). This function computes the hash bucket and acquires
>   * the hash bucket lock. After that it reads the futex user space value
> - * again and verifies that the data has not changed. If it has not
> - * changed it enqueues itself into the hash bucket, releases the hash
> - * bucket lock and schedules.
> + * again and verifies that the data has not changed. If it has not changed
> + * it enqueues itself into the hash bucket, releases the hash bucket lock
> + * and schedules.
>   *
>   * The waker side modifies the user space value of the futex and calls
> - * futex_wake(). This functions computes the hash bucket and acquires
> - * the hash bucket lock. Then it looks for waiters on that futex in the
> - * hash bucket and wakes them.
> + * futex_wake(). This function computes the hash bucket and acquires the
> + * hash bucket lock. Then it looks for waiters on that futex in the hash
> + * bucket and wakes them.
>   *
> - * Note that the spin_lock serializes waiters and wakers, so that the
> - * following scenario is avoided:
> + * In scenarios where wakeups are called and no tasks are blocked on a futex,


"wakeups are called" reads awkwardly to me. Perhaps:

"In futex wake up scenarios where no tasks are blocked on the
futex, ..."


> + * taking the hb spinlock can be avoided and simply return. In order for this
> + * optimization to work, ordering guarantees must exist so that the waiter
> + * being added to the list is acknowledged when the list is concurrently being
> + * checked by the waker, avoiding scenarios like the following:
>   *
>   * CPU 0                               CPU 1
>   * val = *futex;
> @@ -106,24 +109,50 @@
>   * This would cause the waiter on CPU 0 to wait forever because it
>   * missed the transition of the user space value from val to newval
>   * and the waker did not find the waiter in the hash bucket queue.
> - * The spinlock serializes that:
> + *
> + * The correct serialization ensures that a waiter either observes
> + * the changed user space value before blocking or is woken by a
> + * concurrent waker:
>   *
>   * CPU 0                               CPU 1
>   * val = *futex;
>   * sys_futex(WAIT, futex, val);
>   *   futex_wait(futex, val);
> - *   lock(hash_bucket(futex));
> - *   uval = *futex;
> - *                                     *futex = newval;
> - *                                     sys_futex(WAKE, futex);
> - *                                       futex_wake(futex);
> - *                                       lock(hash_bucket(futex));
> + *
> + *   waiters++;
> + *   mb(); (A) <-- paired with -.
> + *                              |
> + *   lock(hash_bucket(futex));  |
> + *                              |
> + *   uval = *futex;             |
> + *                              |        *futex = newval;
> + *                              |        sys_futex(WAKE, futex);
> + *                              |          futex_wake(futex);
> + *                              |
> + *                              `------->   mb(); (B)
>   *   if (uval == val)
> - *      queue();
> + *     queue();
>   *     unlock(hash_bucket(futex));
> - *     schedule();                       if (!queue_empty())
> - *                                         wake_waiters(futex);
> - *                                       unlock(hash_bucket(futex));
> + *     schedule();                         if (waiters)
> + *                                           lock(hash_bucket(futex));
> + *                                           wake_waiters(futex);
> + *                                           unlock(hash_bucket(futex));
> + *
> + * Where (A) orders the waiters increment and the futex value read -- this
> + * is guaranteed by the head counter in the hb spinlock; and where (B)
> + * orders the write to futex and the waiters read.
> + *
> + * This yields the following case (where X:=waiters, Y:=futex):
> + *
> + *	X = Y = 0
> + *
> + *	w[X]=1		w[Y]=1
> + *	MB		MB
> + *	r[Y]=y		r[X]=x
> + *
> + * Which guarantees that x==0 && y==0 is impossible; which translates back into
> + * the guarantee that we cannot both miss the futex variable change and the
> + * enqueue.
>   */
>  
>  int __read_mostly futex_cmpxchg_enabled;
> @@ -211,6 +240,38 @@ static unsigned long __read_mostly futex_hashsize;
>  
>  static struct futex_hash_bucket *futex_queues;
>  
> +static inline void futex_get_mm(union futex_key *key)
> +{
> +	atomic_inc(&key->private.mm->mm_count);
> +#ifdef CONFIG_SMP
> +	/*
> +	 * Ensure futex_get_mm() implies a full barrier such that
> +	 * get_futex_key() implies a full barrier. This is relied upon
> +	 * as full barrier (B), see the ordering comment above.
> +	 */
> +	smp_mb__after_atomic_inc();
> +#endif
> +}
> +
> +static inline bool hb_waiters_pending(struct futex_hash_bucket *hb)
> +{
> +#ifdef CONFIG_SMP
> +	/*
> +	 * Tasks trying to enter the critical region are most likely
> +	 * potential waiters that will be added to the plist. Ensure
> +	 * that wakers won't miss to-be-slept tasks in the window between
> +	 * the wait call and the actual plist_add.
> +	 */
> +	if (spin_is_locked(&hb->lock))
> +		return true;
> +	smp_rmb(); /* Make sure we check the lock state first */
> +
> +	return !plist_head_empty(&hb->chain);
> +#else
> +	return true;
> +#endif
> +}


I thought someone, Peter Z?, had commented on these CONFIG_SMP bits. Are
they really necessary? Does smp_mb__after_atomic_inc() and smp_rmb() not
already just do the right thing as far as we're concerned here?


> +
>  /*
>   * We hash on the keys returned from get_futex_key (see below).
>   */
> @@ -245,10 +306,10 @@ static void get_futex_key_refs(union futex_key *key)
>  
>  	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
>  	case FUT_OFF_INODE:
> -		ihold(key->shared.inode);
> +		ihold(key->shared.inode); /* implies MB (B) */
>  		break;
>  	case FUT_OFF_MMSHARED:
> -		atomic_inc(&key->private.mm->mm_count);
> +		futex_get_mm(key); /* implies MB (B) */
>  		break;
>  	}
>  }
> @@ -322,7 +383,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
>  	if (!fshared) {
>  		key->private.mm = mm;
>  		key->private.address = address;
> -		get_futex_key_refs(key);
> +		get_futex_key_refs(key);  /* implies MB (B) */
>  		return 0;
>  	}
>  
> @@ -1052,6 +1113,11 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
>  		goto out;
>  


Given the subtlety of the implementation - I think it would be good to
explicitly annotate the get_futex_key() call site in futex_wake() as
providing the MB (B). 

Similar comment for futex_wait() and futex_requeue() for MB (A).

These will also raise the appropriate red flags for people looking to
optimize or modify these paths in the future. It would be good to have
it in the top level futex_* function to make the MB placement and
relationship explicitly clear.


>  	hb = hash_futex(&key);
> +
> +	/* Make sure we really have tasks to wakeup */
> +	if (!hb_waiters_pending(hb))
> +		goto out_put_key;
> +
>  	spin_lock(&hb->lock);
>  
>  	plist_for_each_entry_safe(this, next, &hb->chain, list) {
> @@ -1072,6 +1138,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
>  	}
>  
>  	spin_unlock(&hb->lock);
> +out_put_key:
>  	put_futex_key(&key);
>  out:
>  	return ret;
> @@ -1535,7 +1602,7 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
>  	hb = hash_futex(&q->key);
>  	q->lock_ptr = &hb->lock;
>  
> -	spin_lock(&hb->lock);
> +	spin_lock(&hb->lock); /* implies MB (A) */
>  	return hb;
>  }
>  

Functionally, this looks correct to me and Davidlohr's testing has been
well documented.

Reviewed-by: Darren Hart <dvhart@linux.intel.com>

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2014-01-02 20:59     ` Davidlohr Bueso
@ 2014-01-06 20:56       ` Darren Hart
  0 siblings, 0 replies; 23+ messages in thread
From: Darren Hart @ 2014-01-06 20:56 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Linus Torvalds, Linux Kernel Mailing List, Ingo Molnar,
	Peter Zijlstra, Thomas Gleixner, Paul McKenney, Mike Galbraith,
	Jeff Mahoney, Jason Low, Waiman Long, Tom Vaden, Norton, Scott J,
	Chandramouleeswaran, Aswin

On Thu, 2014-01-02 at 12:59 -0800, Davidlohr Bueso wrote:
> On Thu, 2014-01-02 at 11:23 -0800, Linus Torvalds wrote:
> > On Thu, Jan 2, 2014 at 7:05 AM, Davidlohr Bueso <davidlohr@hp.com> wrote:
> > >
> > > In futex_wake() there is clearly no point in taking the hb->lock if we know
> > > beforehand that there are no tasks to be woken.
> > 
> > Btw, I think we could optimize this a bit further for the wakeup case.
> > 
> > wake_futex() does a get_task_struct(p)/put_task_struct(p) around its
> > actual waking logic, and I don't think that's necessary. The task
> > structures are RCU-delayed, and the task cannot go away until the
> > "q->lock_ptr = NULL" afaik, so you could replace that atomic inc/dec
> > with just a RCU read region.
> 
> I had originally explored making the whole plist thing more rcu aware
> but never got to anything worth sharing. What you say does make a lot of
> sense, however, I haven't been able to see any actual improvements. It
> doesn't hurt however, so I'd have no problem adding such patch to the
> lot.
> 
> > 
> > Maybe it's not a big deal ("wake_up_state()" ends up getting the task
> > struct pi_lock anyway, so it's not like we can avoid toucing the task
> > structure), but I'm getting the feeling that we're doing a lot of
> > unnecessary work here.
> 
> I passed this idea through my wakeup measuring program and didn't notice
> hardly any difference, just noise, even for large amounts of futexes.
> I believe that peterz's idea of lockless batch wakeups is the next step
> worth looking into for futexes -- even though the spurious wakeup
> problem can become a real pain.
> 
> Thanks,
> Davidlohr
> 
> 

While I love to see significant performance improvements to the futex
hot paths, I am wary of the sort of implicit improvements we've been
exploring here. At the risk of being a wimp here, this code is
incredibly complex already, so I would prefer anything along these lines
have very strong empirical justification first - as Davidlohr's changes
here have provided.

Does anyone see any reason to hold off getting them in at this point?
I've made a couple points on comments and docs to the 4/5 patch, but
otherwise, I think it's time to get them in and more broadly tested.

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/4] futex: silence uninitialized warnings
  2014-01-06  1:38 ` [PATCH 5/4] futex: silence uninitialized warnings Davidlohr Bueso
  2014-01-06 18:48   ` Darren Hart
@ 2014-01-07  2:55   ` Linus Torvalds
  2014-01-07  3:02     ` Davidlohr Bueso
  1 sibling, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2014-01-07  2:55 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Linux Kernel Mailing List, Ingo Molnar, Darren Hart,
	Peter Zijlstra, Thomas Gleixner, Paul McKenney, Mike Galbraith,
	Jeff Mahoney, Jason Low, Waiman Long, Tom Vaden, Norton, Scott J,
	Chandramouleeswaran, Aswin

On Mon, Jan 6, 2014 at 9:38 AM, Davidlohr Bueso <davidlohr@hp.com> wrote:
>  {
>         int lock_taken, ret, force_take = 0;
> -       u32 uval, newval, curval, vpid = task_pid_vnr(task);
> +       u32 uval, newval, uninitialized_var(curval), vpid = task_pid_vnr(task);

Do you have some broken compiler? I really tend to hate this kind of
workarounds, because as mentioned, they can actually hide valid
warnings, and it seems to be due to just stupid compilers. Are we
perhaps better off trying to get people off the broken compiler
versions instead?

              Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/4] futex: silence uninitialized warnings
  2014-01-07  2:55   ` Linus Torvalds
@ 2014-01-07  3:02     ` Davidlohr Bueso
  0 siblings, 0 replies; 23+ messages in thread
From: Davidlohr Bueso @ 2014-01-07  3:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Ingo Molnar, Darren Hart,
	Peter Zijlstra, Thomas Gleixner, Paul McKenney, Mike Galbraith,
	Jeff Mahoney, Jason Low, Waiman Long, Tom Vaden, Norton, Scott J,
	Chandramouleeswaran, Aswin

On Tue, 2014-01-07 at 10:55 +0800, Linus Torvalds wrote:
> On Mon, Jan 6, 2014 at 9:38 AM, Davidlohr Bueso <davidlohr@hp.com> wrote:
> >  {
> >         int lock_taken, ret, force_take = 0;
> > -       u32 uval, newval, curval, vpid = task_pid_vnr(task);
> > +       u32 uval, newval, uninitialized_var(curval), vpid = task_pid_vnr(task);
> 
> Do you have some broken compiler? 

I only notice this when testing this patchset on our servers with RHEL
6.4 (gcc 4.4.7 20120313 (Red Hat 4.4.7-3)).

> I really tend to hate this kind of
> workarounds, because as mentioned, they can actually hide valid
> warnings, and it seems to be due to just stupid compilers. Are we
> perhaps better off trying to get people off the broken compiler
> versions instead?

As Darren points out, this path is unlikely to change, but I have no
particular preference otherwise.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2014-01-06 20:52   ` Darren Hart
@ 2014-01-07  3:29     ` Davidlohr Bueso
  2014-01-07 17:40       ` Darren Hart
  0 siblings, 1 reply; 23+ messages in thread
From: Davidlohr Bueso @ 2014-01-07  3:29 UTC (permalink / raw)
  To: Darren Hart
  Cc: linux-kernel, mingo, peterz, tglx, paulmck, efault, jeffm,
	torvalds, jason.low2, Waiman.Long, tom.vaden, scott.norton,
	aswin

On Mon, 2014-01-06 at 12:52 -0800, Darren Hart wrote:
> On Thu, 2014-01-02 at 07:05 -0800, Davidlohr Bueso wrote:
> > From: Davidlohr Bueso <davidlohr@hp.com>
> > 
> > In futex_wake() there is clearly no point in taking the hb->lock if we know
> > beforehand that there are no tasks to be woken. While the hash bucket's plist
> > head is a cheap way of knowing this, we cannot rely 100% on it as there is a
> > racy window between the futex_wait call and when the task is actually added to
> > the plist. To this end, we couple it with the spinlock check as tasks trying to
> > enter the critical region are most likely potential waiters that will be added
> > to the plist, thus preventing tasks sleeping forever if wakers don't acknowledge
> > all possible waiters.
> > 
> > Furthermore, the futex ordering guarantees are preserved, ensuring that waiters
> > either observe the changed user space value before blocking or is woken by a
> > concurrent waker. For wakers, this is done by relying on the barriers in
> > get_futex_key_refs() -- for archs that do have implicit mb in atomic_inc() we
> 
> do NOT have implicit mb in atomic_inc()
>    ^

oh, yes!

> 
> Sorry to be a pedant, but this is gnarly stuff and we have to get the
> documentation right.
> 

Absolutely!

> > explicitly add them through a new futex_get_mm function. For waiters we rely
> > on the fact that spin_lock calls already update the head counter, so spinners
> > are visible even if the lock hasn't been acquired yet.
> > 
> > For more details please refer to the updated comments in the code and related
> > discussion: https://lkml.org/lkml/2013/11/26/556
> > 
> > Special thanks to tglx for careful review and feedback.
> > 
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Darren Hart <dvhart@linux.intel.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Cc: Mike Galbraith <efault@gmx.de>
> > Cc: Jeff Mahoney <jeffm@suse.com>
> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Scott Norton <scott.norton@hp.com>
> > Cc: Tom Vaden <tom.vaden@hp.com>
> > Cc: Aswin Chandramouleeswaran <aswin@hp.com>
> > Cc: Waiman Long <Waiman.Long@hp.com>
> > Cc: Jason Low <jason.low2@hp.com>
> > Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> > ---
> >  kernel/futex.c | 113 +++++++++++++++++++++++++++++++++++++++++++++------------
> >  1 file changed, 90 insertions(+), 23 deletions(-)
> > 
> > diff --git a/kernel/futex.c b/kernel/futex.c
> > index fcc6850..5b4d09e 100644
> > --- a/kernel/futex.c
> > +++ b/kernel/futex.c
> > @@ -75,17 +75,20 @@
> >   * The waiter reads the futex value in user space and calls
> >   * futex_wait(). This function computes the hash bucket and acquires
> >   * the hash bucket lock. After that it reads the futex user space value
> > - * again and verifies that the data has not changed. If it has not
> > - * changed it enqueues itself into the hash bucket, releases the hash
> > - * bucket lock and schedules.
> > + * again and verifies that the data has not changed. If it has not changed
> > + * it enqueues itself into the hash bucket, releases the hash bucket lock
> > + * and schedules.
> >   *
> >   * The waker side modifies the user space value of the futex and calls
> > - * futex_wake(). This functions computes the hash bucket and acquires
> > - * the hash bucket lock. Then it looks for waiters on that futex in the
> > - * hash bucket and wakes them.
> > + * futex_wake(). This function computes the hash bucket and acquires the
> > + * hash bucket lock. Then it looks for waiters on that futex in the hash
> > + * bucket and wakes them.
> >   *
> > - * Note that the spin_lock serializes waiters and wakers, so that the
> > - * following scenario is avoided:
> > + * In scenarios where wakeups are called and no tasks are blocked on a futex,
> 
> 
> "wakeups are called" reads awkwardly to me. Perhaps:
> 
> "In futex wake up scenarios where no tasks are blocked on the
> futex, ..."
> 

I have no particular preference, so I'll update it.

> 
> > + * taking the hb spinlock can be avoided and simply return. In order for this
> > + * optimization to work, ordering guarantees must exist so that the waiter
> > + * being added to the list is acknowledged when the list is concurrently being
> > + * checked by the waker, avoiding scenarios like the following:
> >   *
> >   * CPU 0                               CPU 1
> >   * val = *futex;
> > @@ -106,24 +109,50 @@
> >   * This would cause the waiter on CPU 0 to wait forever because it
> >   * missed the transition of the user space value from val to newval
> >   * and the waker did not find the waiter in the hash bucket queue.
> > - * The spinlock serializes that:
> > + *
> > + * The correct serialization ensures that a waiter either observes
> > + * the changed user space value before blocking or is woken by a
> > + * concurrent waker:
> >   *
> >   * CPU 0                               CPU 1
> >   * val = *futex;
> >   * sys_futex(WAIT, futex, val);
> >   *   futex_wait(futex, val);
> > - *   lock(hash_bucket(futex));
> > - *   uval = *futex;
> > - *                                     *futex = newval;
> > - *                                     sys_futex(WAKE, futex);
> > - *                                       futex_wake(futex);
> > - *                                       lock(hash_bucket(futex));
> > + *
> > + *   waiters++;
> > + *   mb(); (A) <-- paired with -.
> > + *                              |
> > + *   lock(hash_bucket(futex));  |
> > + *                              |
> > + *   uval = *futex;             |
> > + *                              |        *futex = newval;
> > + *                              |        sys_futex(WAKE, futex);
> > + *                              |          futex_wake(futex);
> > + *                              |
> > + *                              `------->   mb(); (B)
> >   *   if (uval == val)
> > - *      queue();
> > + *     queue();
> >   *     unlock(hash_bucket(futex));
> > - *     schedule();                       if (!queue_empty())
> > - *                                         wake_waiters(futex);
> > - *                                       unlock(hash_bucket(futex));
> > + *     schedule();                         if (waiters)
> > + *                                           lock(hash_bucket(futex));
> > + *                                           wake_waiters(futex);
> > + *                                           unlock(hash_bucket(futex));
> > + *
> > + * Where (A) orders the waiters increment and the futex value read -- this
> > + * is guaranteed by the head counter in the hb spinlock; and where (B)
> > + * orders the write to futex and the waiters read.
> > + *
> > + * This yields the following case (where X:=waiters, Y:=futex):
> > + *
> > + *	X = Y = 0
> > + *
> > + *	w[X]=1		w[Y]=1
> > + *	MB		MB
> > + *	r[Y]=y		r[X]=x
> > + *
> > + * Which guarantees that x==0 && y==0 is impossible; which translates back into
> > + * the guarantee that we cannot both miss the futex variable change and the
> > + * enqueue.
> >   */
> >  
> >  int __read_mostly futex_cmpxchg_enabled;
> > @@ -211,6 +240,38 @@ static unsigned long __read_mostly futex_hashsize;
> >  
> >  static struct futex_hash_bucket *futex_queues;
> >  
> > +static inline void futex_get_mm(union futex_key *key)
> > +{
> > +	atomic_inc(&key->private.mm->mm_count);
> > +#ifdef CONFIG_SMP
> > +	/*
> > +	 * Ensure futex_get_mm() implies a full barrier such that
> > +	 * get_futex_key() implies a full barrier. This is relied upon
> > +	 * as full barrier (B), see the ordering comment above.
> > +	 */
> > +	smp_mb__after_atomic_inc();
> > +#endif
> > +}
> > +
> > +static inline bool hb_waiters_pending(struct futex_hash_bucket *hb)
> > +{
> > +#ifdef CONFIG_SMP
> > +	/*
> > +	 * Tasks trying to enter the critical region are most likely
> > +	 * potential waiters that will be added to the plist. Ensure
> > +	 * that wakers won't miss to-be-slept tasks in the window between
> > +	 * the wait call and the actual plist_add.
> > +	 */
> > +	if (spin_is_locked(&hb->lock))
> > +		return true;
> > +	smp_rmb(); /* Make sure we check the lock state first */
> > +
> > +	return !plist_head_empty(&hb->chain);
> > +#else
> > +	return true;
> > +#endif
> > +}
> 
> 
> I thought someone, Peter Z?, had commented on these CONFIG_SMP bits. Are
> they really necessary? Does smp_mb__after_atomic_inc() and smp_rmb() not
> already just do the right thing as far as we're concerned here?

I don't think so. Thomas and I agreed that this was in fact the way to
go. I rechecked old email and didn't notice any objections to
CONFIG_SMP. Also for things like hb_waiters_pending we definitely need
it.

> > +
> >  /*
> >   * We hash on the keys returned from get_futex_key (see below).
> >   */
> > @@ -245,10 +306,10 @@ static void get_futex_key_refs(union futex_key *key)
> >  
> >  	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
> >  	case FUT_OFF_INODE:
> > -		ihold(key->shared.inode);
> > +		ihold(key->shared.inode); /* implies MB (B) */
> >  		break;
> >  	case FUT_OFF_MMSHARED:
> > -		atomic_inc(&key->private.mm->mm_count);
> > +		futex_get_mm(key); /* implies MB (B) */
> >  		break;
> >  	}
> >  }
> > @@ -322,7 +383,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
> >  	if (!fshared) {
> >  		key->private.mm = mm;
> >  		key->private.address = address;
> > -		get_futex_key_refs(key);
> > +		get_futex_key_refs(key);  /* implies MB (B) */
> >  		return 0;
> >  	}
> >  
> > @@ -1052,6 +1113,11 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
> >  		goto out;
> >  
> 
> 
> Given the subtlety of the implementation - I think it would be good to
> explicitly annotate the get_futex_key() call site in futex_wake() as
> providing the MB (B). 
> 
> Similar comment for futex_wait() and futex_requeue() for MB (A).
> 
> These will also raise the appropriate red flags for people looking to
> optimize or modify these paths in the future. It would be good to have
> it in the top level futex_* function to make the MB placement and
> relationship explicitly clear.
> 

Something quite similar was already there for v2 but PeterZ's feedback
made me update the main documentation at the top of futex.c to as it is
now...

> 
> >  	hb = hash_futex(&key);
> > +
> > +	/* Make sure we really have tasks to wakeup */
> > +	if (!hb_waiters_pending(hb))
> > +		goto out_put_key;
> > +
> >  	spin_lock(&hb->lock);
> >  
> >  	plist_for_each_entry_safe(this, next, &hb->chain, list) {
> > @@ -1072,6 +1138,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
> >  	}
> >  
> >  	spin_unlock(&hb->lock);
> > +out_put_key:
> >  	put_futex_key(&key);
> >  out:
> >  	return ret;
> > @@ -1535,7 +1602,7 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
> >  	hb = hash_futex(&q->key);
> >  	q->lock_ptr = &hb->lock;
> >  
> > -	spin_lock(&hb->lock);
> > +	spin_lock(&hb->lock); /* implies MB (A) */
> >  	return hb;
> >  }
> >  
> 
> Functionally, this looks correct to me and Davidlohr's testing has been
> well documented.
> 
> Reviewed-by: Darren Hart <dvhart@linux.intel.com>

Thanks for looking into this Darren!


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2014-01-07  3:29     ` Davidlohr Bueso
@ 2014-01-07 17:40       ` Darren Hart
  0 siblings, 0 replies; 23+ messages in thread
From: Darren Hart @ 2014-01-07 17:40 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, peterz, tglx, paulmck, efault, jeffm,
	torvalds, jason.low2, Waiman.Long, tom.vaden, scott.norton,
	aswin

On Mon, 2014-01-06 at 19:29 -0800, Davidlohr Bueso wrote:
> On Mon, 2014-01-06 at 12:52 -0800, Darren Hart wrote:
> > On Thu, 2014-01-02 at 07:05 -0800, Davidlohr Bueso wrote:

 
> > I thought someone, Peter Z?, had commented on these CONFIG_SMP bits. Are
> > they really necessary? Does smp_mb__after_atomic_inc() and smp_rmb() not
> > already just do the right thing as far as we're concerned here?
> 
> I don't think so. Thomas and I agreed that this was in fact the way to
> go. I rechecked old email and didn't notice any objections to
> CONFIG_SMP. Also for things like hb_waiters_pending we definitely need
> it.

I'll happily defer to Thomas here.
 
> > 
> > Given the subtlety of the implementation - I think it would be good to
> > explicitly annotate the get_futex_key() call site in futex_wake() as
> > providing the MB (B). 
> > 
> > Similar comment for futex_wait() and futex_requeue() for MB (A).
> > 
> > These will also raise the appropriate red flags for people looking to
> > optimize or modify these paths in the future. It would be good to have
> > it in the top level futex_* function to make the MB placement and
> > relationship explicitly clear.
> > 
> 
> Something quite similar was already there for v2 but PeterZ's feedback
> made me update the main documentation at the top of futex.c to as it is
> now...

I don't want to block this any longer - but as complicated and
non-obvious as this is, I would *MUCH* prefer we document the memory
barrier point in the top level algorithm. If Peter/Thomas/Linus/Ingo
object, so be it, but otherwise let's err on the side of overly explicit
documentation.

Peter/Thomas/Linus/Ingo: Do any of you object to adding the three memory
barrier comments to the high level functions? futex_wait, futex_wake,
futex_requeue?

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 1/4] futex: Misc cleanups
  2014-01-02 15:05 ` [PATCH v5 1/4] futex: Misc cleanups Davidlohr Bueso
@ 2014-01-11  6:43   ` Paul E. McKenney
  0 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2014-01-11  6:43 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, dvhart, peterz, tglx, efault, jeffm,
	torvalds, jason.low2, Waiman.Long, tom.vaden, scott.norton,
	aswin

On Thu, Jan 02, 2014 at 07:05:17AM -0800, Davidlohr Bueso wrote:
> From: Jason Low <jason.low2@hp.com>
> 
> - Remove unnecessary head variables.
> - Delete unused parameter in queue_unlock().
> 
> Cc: Ingo Molnar <mingo@kernel.org>
> Reviewed-by: Darren Hart <dvhart@linux.intel.com>
> Acked-by: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Jeff Mahoney <jeffm@suse.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Scott Norton <scott.norton@hp.com>
> Cc: Tom Vaden <tom.vaden@hp.com>
> Cc: Aswin Chandramouleeswaran <aswin@hp.com>
> Cc: Waiman Long <Waiman.Long@hp.com>
> Signed-off-by: Jason Low <jason.low2@hp.com>
> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> ---
>  kernel/futex.c | 39 ++++++++++++---------------------------
>  1 file changed, 12 insertions(+), 27 deletions(-)
> 
> diff --git a/kernel/futex.c b/kernel/futex.c
> index f6ff019..085f5fa 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -598,13 +598,10 @@ lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
>  {
>  	struct futex_pi_state *pi_state = NULL;
>  	struct futex_q *this, *next;
> -	struct plist_head *head;
>  	struct task_struct *p;
>  	pid_t pid = uval & FUTEX_TID_MASK;
> 
> -	head = &hb->chain;
> -
> -	plist_for_each_entry_safe(this, next, head, list) {
> +	plist_for_each_entry_safe(this, next, &hb->chain, list) {
>  		if (match_futex(&this->key, key)) {
>  			/*
>  			 * Another waiter already exists - bump up
> @@ -986,7 +983,6 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
>  {
>  	struct futex_hash_bucket *hb;
>  	struct futex_q *this, *next;
> -	struct plist_head *head;
>  	union futex_key key = FUTEX_KEY_INIT;
>  	int ret;
> 
> @@ -999,9 +995,8 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
> 
>  	hb = hash_futex(&key);
>  	spin_lock(&hb->lock);
> -	head = &hb->chain;
> 
> -	plist_for_each_entry_safe(this, next, head, list) {
> +	plist_for_each_entry_safe(this, next, &hb->chain, list) {
>  		if (match_futex (&this->key, &key)) {
>  			if (this->pi_state || this->rt_waiter) {
>  				ret = -EINVAL;
> @@ -1034,7 +1029,6 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
>  {
>  	union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
>  	struct futex_hash_bucket *hb1, *hb2;
> -	struct plist_head *head;
>  	struct futex_q *this, *next;
>  	int ret, op_ret;
> 
> @@ -1082,9 +1076,7 @@ retry_private:
>  		goto retry;
>  	}
> 
> -	head = &hb1->chain;
> -
> -	plist_for_each_entry_safe(this, next, head, list) {
> +	plist_for_each_entry_safe(this, next, &hb1->chain, list) {
>  		if (match_futex (&this->key, &key1)) {
>  			if (this->pi_state || this->rt_waiter) {
>  				ret = -EINVAL;
> @@ -1097,10 +1089,8 @@ retry_private:
>  	}
> 
>  	if (op_ret > 0) {
> -		head = &hb2->chain;
> -
>  		op_ret = 0;
> -		plist_for_each_entry_safe(this, next, head, list) {
> +		plist_for_each_entry_safe(this, next, &hb2->chain, list) {
>  			if (match_futex (&this->key, &key2)) {
>  				if (this->pi_state || this->rt_waiter) {
>  					ret = -EINVAL;
> @@ -1270,7 +1260,6 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
>  	int drop_count = 0, task_count = 0, ret;
>  	struct futex_pi_state *pi_state = NULL;
>  	struct futex_hash_bucket *hb1, *hb2;
> -	struct plist_head *head1;
>  	struct futex_q *this, *next;
>  	u32 curval2;
> 
> @@ -1393,8 +1382,7 @@ retry_private:
>  		}
>  	}
> 
> -	head1 = &hb1->chain;
> -	plist_for_each_entry_safe(this, next, head1, list) {
> +	plist_for_each_entry_safe(this, next, &hb1->chain, list) {
>  		if (task_count - nr_wake >= nr_requeue)
>  			break;
> 
> @@ -1494,7 +1482,7 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
>  }
> 
>  static inline void
> -queue_unlock(struct futex_q *q, struct futex_hash_bucket *hb)
> +queue_unlock(struct futex_hash_bucket *hb)
>  	__releases(&hb->lock)
>  {
>  	spin_unlock(&hb->lock);
> @@ -1867,7 +1855,7 @@ retry_private:
>  	ret = get_futex_value_locked(&uval, uaddr);
> 
>  	if (ret) {
> -		queue_unlock(q, *hb);
> +		queue_unlock(*hb);
> 
>  		ret = get_user(uval, uaddr);
>  		if (ret)
> @@ -1881,7 +1869,7 @@ retry_private:
>  	}
> 
>  	if (uval != val) {
> -		queue_unlock(q, *hb);
> +		queue_unlock(*hb);
>  		ret = -EWOULDBLOCK;
>  	}
> 
> @@ -2029,7 +2017,7 @@ retry_private:
>  			 * Task is exiting and we just wait for the
>  			 * exit to complete.
>  			 */
> -			queue_unlock(&q, hb);
> +			queue_unlock(hb);
>  			put_futex_key(&q.key);
>  			cond_resched();
>  			goto retry;
> @@ -2081,7 +2069,7 @@ retry_private:
>  	goto out_put_key;
> 
>  out_unlock_put_key:
> -	queue_unlock(&q, hb);
> +	queue_unlock(hb);
> 
>  out_put_key:
>  	put_futex_key(&q.key);
> @@ -2091,7 +2079,7 @@ out:
>  	return ret != -EINTR ? ret : -ERESTARTNOINTR;
> 
>  uaddr_faulted:
> -	queue_unlock(&q, hb);
> +	queue_unlock(hb);
> 
>  	ret = fault_in_user_writeable(uaddr);
>  	if (ret)
> @@ -2113,7 +2101,6 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
>  {
>  	struct futex_hash_bucket *hb;
>  	struct futex_q *this, *next;
> -	struct plist_head *head;
>  	union futex_key key = FUTEX_KEY_INIT;
>  	u32 uval, vpid = task_pid_vnr(current);
>  	int ret;
> @@ -2153,9 +2140,7 @@ retry:
>  	 * Ok, other tasks may need to be woken up - check waiters
>  	 * and do the wakeup if necessary:
>  	 */
> -	head = &hb->chain;
> -
> -	plist_for_each_entry_safe(this, next, head, list) {
> +	plist_for_each_entry_safe(this, next, &hb->chain, list) {
>  		if (!match_futex (&this->key, &key))
>  			continue;
>  		ret = wake_futex_pi(uaddr, uval, this);
> -- 
> 1.8.1.4
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 2/4] futex: Larger hash table
  2014-01-02 15:05 ` [PATCH v5 2/4] futex: Larger hash table Davidlohr Bueso
@ 2014-01-11  7:37   ` Paul E. McKenney
  0 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2014-01-11  7:37 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, dvhart, peterz, tglx, efault, jeffm,
	torvalds, jason.low2, Waiman.Long, tom.vaden, scott.norton,
	aswin

On Thu, Jan 02, 2014 at 07:05:18AM -0800, Davidlohr Bueso wrote:
> From: Davidlohr Bueso <davidlohr@hp.com>
> 
> Currently, the futex global hash table suffers from it's fixed, smallish
> (for today's standards) size of 256 entries, as well as its lack of NUMA
> awareness. Large systems, using many futexes, can be prone to high amounts
> of collisions; where these futexes hash to the same bucket and lead to
> extra contention on the same hb->lock. Furthermore, cacheline bouncing is a
> reality when we have multiple hb->locks residing on the same cacheline and
> different futexes hash to adjacent buckets.
> 
> This patch keeps the current static size of 16 entries for small systems,
> or otherwise, 256 * ncpus (or larger as we need to round the number to a
> power of 2). Note that this number of CPUs accounts for all CPUs that can
> ever be available in the system, taking into consideration things like
> hotpluging. While we do impose extra overhead at bootup by making the hash
> table larger, this is a one time thing, and does not shadow the benefits
> of this patch.
> 
> Furthermore, as suggested by tglx, by cache aligning the hash buckets we can
> avoid access across cacheline boundaries and also avoid massive cache line
> bouncing if multiple cpus are hammering away at different hash buckets which
> happen to reside in the same cache line.
> 
> Also, similar to other core kernel components (pid, dcache, tcp), by using
> alloc_large_system_hash() we benefit from its NUMA awareness and thus the
> table is distributed among the nodes instead of in a single one.
> 
> For a custom microbenchmark that pounds on the uaddr hashing -- making the wait
> path fail at futex_wait_setup() returning -EWOULDBLOCK for large amounts of
> futexes, we can see the following benefits on a 80-core, 8-socket 1Tb server:
> 
> +---------+--------------------+------------------------+-----------------------+-------------------------------+
> | threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) |
> +---------+--------------------+------------------------+-----------------------+-------------------------------+
> |     512 |		 32426 | 50531  (+55.8%)	| 255274  (+687.2%)	| 292553  (+802.2%)		|
> |     256 |		 65360 | 99588  (+52.3%)	| 443563  (+578.6%)	| 508088  (+677.3%)		|
> |     128 |		125635 | 200075 (+59.2%)	| 742613  (+491.1%)	| 835452  (+564.9%)		|
> |      80 |		193559 | 323425 (+67.1%)	| 1028147 (+431.1%)	| 1130304 (+483.9%)		|
> |      64 |		247667 | 443740 (+79.1%)	| 997300  (+302.6%)	| 1145494 (+362.5%)		|
> |      32 |		628412 | 721401 (+14.7%)	| 965996  (+53.7%)	| 1122115 (+78.5%)		|
> +---------+--------------------+------------------------+-----------------------+-------------------------------+
> 
> Cc: Ingo Molnar <mingo@kernel.org>
> Reviewed-by: Darren Hart <dvhart@linux.intel.com>
> Acked-by: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Jeff Mahoney <jeffm@suse.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Scott Norton <scott.norton@hp.com>
> Cc: Tom Vaden <tom.vaden@hp.com>
> Cc: Aswin Chandramouleeswaran <aswin@hp.com>
> Reviewed-by: Waiman Long <Waiman.Long@hp.com>
> Reviewed-and-tested-by: Jason Low <jason.low2@hp.com>
> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> ---
>  kernel/futex.c | 26 +++++++++++++++++++-------
>  1 file changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 085f5fa..577481d 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -63,6 +63,7 @@
>  #include <linux/sched/rt.h>
>  #include <linux/hugetlb.h>
>  #include <linux/freezer.h>
> +#include <linux/bootmem.h>
> 
>  #include <asm/futex.h>
> 
> @@ -70,8 +71,6 @@
> 
>  int __read_mostly futex_cmpxchg_enabled;
> 
> -#define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)
> -
>  /*
>   * Futex flags used to encode options to functions and preserve them across
>   * restarts.
> @@ -149,9 +148,11 @@ static const struct futex_q futex_q_init = {
>  struct futex_hash_bucket {
>  	spinlock_t lock;
>  	struct plist_head chain;
> -};
> +} ____cacheline_aligned_in_smp;
> 
> -static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
> +static unsigned long __read_mostly futex_hashsize;
> +
> +static struct futex_hash_bucket *futex_queues;
> 
>  /*
>   * We hash on the keys returned from get_futex_key (see below).
> @@ -161,7 +162,7 @@ static struct futex_hash_bucket *hash_futex(union futex_key *key)
>  	u32 hash = jhash2((u32*)&key->both.word,
>  			  (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
>  			  key->both.offset);
> -	return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
> +	return &futex_queues[hash & (futex_hashsize - 1)];
>  }
> 
>  /*
> @@ -2719,7 +2720,18 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
>  static int __init futex_init(void)
>  {
>  	u32 curval;
> -	int i;
> +	unsigned long i;
> +
> +#if CONFIG_BASE_SMALL
> +	futex_hashsize = 16;
> +#else
> +	futex_hashsize = roundup_pow_of_two(256 * num_possible_cpus());
> +#endif
> +
> +	futex_queues = alloc_large_system_hash("futex", sizeof(*futex_queues),
> +					       futex_hashsize, 0,
> +					       futex_hashsize < 256 ? HASH_SMALL : 0,
> +					       NULL, NULL, futex_hashsize, futex_hashsize);
> 
>  	/*
>  	 * This will fail and we want it. Some arch implementations do
> @@ -2734,7 +2746,7 @@ static int __init futex_init(void)
>  	if (cmpxchg_futex_value_locked(&curval, NULL, 0, 0) == -EFAULT)
>  		futex_cmpxchg_enabled = 1;
> 
> -	for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
> +	for (i = 0; i < futex_hashsize; i++) {
>  		plist_head_init(&futex_queues[i].chain);
>  		spin_lock_init(&futex_queues[i].lock);
>  	}
> -- 
> 1.8.1.4
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 3/4] futex: Document ordering guarantees
  2014-01-02 15:05 ` [PATCH v5 3/4] futex: Document ordering guarantees Davidlohr Bueso
  2014-01-06 18:58   ` Darren Hart
@ 2014-01-11  7:40   ` Paul E. McKenney
  1 sibling, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2014-01-11  7:40 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, dvhart, peterz, tglx, efault, jeffm,
	torvalds, jason.low2, Waiman.Long, tom.vaden, scott.norton,
	aswin, Randy Dunlap

On Thu, Jan 02, 2014 at 07:05:19AM -0800, Davidlohr Bueso wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> That's essential, if you want to hack on futexes.
> 
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Darren Hart <dvhart@linux.intel.com>
> Acked-by: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Jeff Mahoney <jeffm@suse.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Cc: Scott Norton <scott.norton@hp.com>
> Cc: Tom Vaden <tom.vaden@hp.com>
> Cc: Aswin Chandramouleeswaran <aswin@hp.com>
> Cc: Waiman Long <Waiman.Long@hp.com>
> Cc: Jason Low <jason.low2@hp.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> ---
>  kernel/futex.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 57 insertions(+)
> 
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 577481d..fcc6850 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -69,6 +69,63 @@
> 
>  #include "locking/rtmutex_common.h"
> 
> +/*
> + * Basic futex operation and ordering guarantees:
> + *
> + * The waiter reads the futex value in user space and calls
> + * futex_wait(). This function computes the hash bucket and acquires
> + * the hash bucket lock. After that it reads the futex user space value
> + * again and verifies that the data has not changed. If it has not
> + * changed it enqueues itself into the hash bucket, releases the hash
> + * bucket lock and schedules.
> + *
> + * The waker side modifies the user space value of the futex and calls
> + * futex_wake(). This functions computes the hash bucket and acquires
> + * the hash bucket lock. Then it looks for waiters on that futex in the
> + * hash bucket and wakes them.
> + *
> + * Note that the spin_lock serializes waiters and wakers, so that the
> + * following scenario is avoided:
> + *
> + * CPU 0                               CPU 1
> + * val = *futex;
> + * sys_futex(WAIT, futex, val);
> + *   futex_wait(futex, val);
> + *   uval = *futex;
> + *                                     *futex = newval;
> + *                                     sys_futex(WAKE, futex);
> + *                                       futex_wake(futex);
> + *                                       if (queue_empty())
> + *                                         return;
> + *   if (uval == val)
> + *      lock(hash_bucket(futex));
> + *      queue();
> + *     unlock(hash_bucket(futex));
> + *     schedule();
> + *
> + * This would cause the waiter on CPU 0 to wait forever because it
> + * missed the transition of the user space value from val to newval
> + * and the waker did not find the waiter in the hash bucket queue.
> + * The spinlock serializes that:
> + *
> + * CPU 0                               CPU 1
> + * val = *futex;
> + * sys_futex(WAIT, futex, val);
> + *   futex_wait(futex, val);
> + *   lock(hash_bucket(futex));
> + *   uval = *futex;
> + *                                     *futex = newval;
> + *                                     sys_futex(WAKE, futex);
> + *                                       futex_wake(futex);
> + *                                       lock(hash_bucket(futex));
> + *   if (uval == val)
> + *      queue();
> + *     unlock(hash_bucket(futex));
> + *     schedule();                       if (!queue_empty())
> + *                                         wake_waiters(futex);
> + *                                       unlock(hash_bucket(futex));
> + */
> +
>  int __read_mostly futex_cmpxchg_enabled;
> 
>  /*
> -- 
> 1.8.1.4
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2014-01-02 15:05 ` [PATCH v5 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
  2014-01-02 19:23   ` Linus Torvalds
  2014-01-06 20:52   ` Darren Hart
@ 2014-01-11  9:49   ` Paul E. McKenney
  2014-01-11  9:52     ` Paul E. McKenney
  2 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2014-01-11  9:49 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, dvhart, peterz, tglx, efault, jeffm,
	torvalds, jason.low2, Waiman.Long, tom.vaden, scott.norton,
	aswin

On Thu, Jan 02, 2014 at 07:05:20AM -0800, Davidlohr Bueso wrote:
> From: Davidlohr Bueso <davidlohr@hp.com>
> 
> In futex_wake() there is clearly no point in taking the hb->lock if we know
> beforehand that there are no tasks to be woken. While the hash bucket's plist
> head is a cheap way of knowing this, we cannot rely 100% on it as there is a
> racy window between the futex_wait call and when the task is actually added to
> the plist. To this end, we couple it with the spinlock check as tasks trying to
> enter the critical region are most likely potential waiters that will be added
> to the plist, thus preventing tasks sleeping forever if wakers don't acknowledge
> all possible waiters.
> 
> Furthermore, the futex ordering guarantees are preserved, ensuring that waiters
> either observe the changed user space value before blocking or is woken by a
> concurrent waker. For wakers, this is done by relying on the barriers in
> get_futex_key_refs() -- for archs that do have implicit mb in atomic_inc() we
> explicitly add them through a new futex_get_mm function. For waiters we rely
> on the fact that spin_lock calls already update the head counter, so spinners
> are visible even if the lock hasn't been acquired yet.
> 
> For more details please refer to the updated comments in the code and related
> discussion: https://lkml.org/lkml/2013/11/26/556
> 
> Special thanks to tglx for careful review and feedback.
> 
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Darren Hart <dvhart@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Jeff Mahoney <jeffm@suse.com>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Scott Norton <scott.norton@hp.com>
> Cc: Tom Vaden <tom.vaden@hp.com>
> Cc: Aswin Chandramouleeswaran <aswin@hp.com>
> Cc: Waiman Long <Waiman.Long@hp.com>
> Cc: Jason Low <jason.low2@hp.com>
> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>

A couple of comments below.

							Thanx, Paul

> ---
>  kernel/futex.c | 113 +++++++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 90 insertions(+), 23 deletions(-)
> 
> diff --git a/kernel/futex.c b/kernel/futex.c
> index fcc6850..5b4d09e 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -75,17 +75,20 @@
>   * The waiter reads the futex value in user space and calls
>   * futex_wait(). This function computes the hash bucket and acquires
>   * the hash bucket lock. After that it reads the futex user space value
> - * again and verifies that the data has not changed. If it has not
> - * changed it enqueues itself into the hash bucket, releases the hash
> - * bucket lock and schedules.
> + * again and verifies that the data has not changed. If it has not changed
> + * it enqueues itself into the hash bucket, releases the hash bucket lock
> + * and schedules.
>   *
>   * The waker side modifies the user space value of the futex and calls
> - * futex_wake(). This functions computes the hash bucket and acquires
> - * the hash bucket lock. Then it looks for waiters on that futex in the
> - * hash bucket and wakes them.
> + * futex_wake(). This function computes the hash bucket and acquires the
> + * hash bucket lock. Then it looks for waiters on that futex in the hash
> + * bucket and wakes them.
>   *
> - * Note that the spin_lock serializes waiters and wakers, so that the
> - * following scenario is avoided:
> + * In scenarios where wakeups are called and no tasks are blocked on a futex,
> + * taking the hb spinlock can be avoided and simply return. In order for this
> + * optimization to work, ordering guarantees must exist so that the waiter
> + * being added to the list is acknowledged when the list is concurrently being
> + * checked by the waker, avoiding scenarios like the following:
>   *
>   * CPU 0                               CPU 1
>   * val = *futex;
> @@ -106,24 +109,50 @@
>   * This would cause the waiter on CPU 0 to wait forever because it
>   * missed the transition of the user space value from val to newval
>   * and the waker did not find the waiter in the hash bucket queue.
> - * The spinlock serializes that:
> + *
> + * The correct serialization ensures that a waiter either observes
> + * the changed user space value before blocking or is woken by a
> + * concurrent waker:
>   *
>   * CPU 0                               CPU 1
>   * val = *futex;
>   * sys_futex(WAIT, futex, val);
>   *   futex_wait(futex, val);
> - *   lock(hash_bucket(futex));
> - *   uval = *futex;
> - *                                     *futex = newval;
> - *                                     sys_futex(WAKE, futex);
> - *                                       futex_wake(futex);
> - *                                       lock(hash_bucket(futex));
> + *
> + *   waiters++;
> + *   mb(); (A) <-- paired with -.
> + *                              |
> + *   lock(hash_bucket(futex));  |
> + *                              |
> + *   uval = *futex;             |
> + *                              |        *futex = newval;
> + *                              |        sys_futex(WAKE, futex);
> + *                              |          futex_wake(futex);
> + *                              |
> + *                              `------->   mb(); (B)
>   *   if (uval == val)
> - *      queue();
> + *     queue();
>   *     unlock(hash_bucket(futex));
> - *     schedule();                       if (!queue_empty())
> - *                                         wake_waiters(futex);
> - *                                       unlock(hash_bucket(futex));
> + *     schedule();                         if (waiters)
> + *                                           lock(hash_bucket(futex));
> + *                                           wake_waiters(futex);
> + *                                           unlock(hash_bucket(futex));
> + *
> + * Where (A) orders the waiters increment and the futex value read -- this
> + * is guaranteed by the head counter in the hb spinlock; and where (B)
> + * orders the write to futex and the waiters read.
> + *
> + * This yields the following case (where X:=waiters, Y:=futex):
> + *
> + *	X = Y = 0
> + *
> + *	w[X]=1		w[Y]=1
> + *	MB		MB
> + *	r[Y]=y		r[X]=x
> + *
> + * Which guarantees that x==0 && y==0 is impossible; which translates back into
> + * the guarantee that we cannot both miss the futex variable change and the
> + * enqueue.
>   */
> 
>  int __read_mostly futex_cmpxchg_enabled;
> @@ -211,6 +240,38 @@ static unsigned long __read_mostly futex_hashsize;
> 
>  static struct futex_hash_bucket *futex_queues;
> 
> +static inline void futex_get_mm(union futex_key *key)
> +{
> +	atomic_inc(&key->private.mm->mm_count);
> +#ifdef CONFIG_SMP

You don't need the #ifdef -- smp_mb__after_atomic_inc() does not emit
code if !SMP.

> +	/*
> +	 * Ensure futex_get_mm() implies a full barrier such that
> +	 * get_futex_key() implies a full barrier. This is relied upon
> +	 * as full barrier (B), see the ordering comment above.
> +	 */
> +	smp_mb__after_atomic_inc();
> +#endif
> +}
> +
> +static inline bool hb_waiters_pending(struct futex_hash_bucket *hb)
> +{
> +#ifdef CONFIG_SMP
> +	/*
> +	 * Tasks trying to enter the critical region are most likely
> +	 * potential waiters that will be added to the plist. Ensure
> +	 * that wakers won't miss to-be-slept tasks in the window between
> +	 * the wait call and the actual plist_add.
> +	 */
> +	if (spin_is_locked(&hb->lock))
> +		return true;
> +	smp_rmb(); /* Make sure we check the lock state first */
> +
> +	return !plist_head_empty(&hb->chain);
> +#else
> +	return true;
> +#endif
> +}
> +
>  /*
>   * We hash on the keys returned from get_futex_key (see below).
>   */
> @@ -245,10 +306,10 @@ static void get_futex_key_refs(union futex_key *key)
> 
>  	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
>  	case FUT_OFF_INODE:
> -		ihold(key->shared.inode);
> +		ihold(key->shared.inode); /* implies MB (B) */
>  		break;
>  	case FUT_OFF_MMSHARED:
> -		atomic_inc(&key->private.mm->mm_count);
> +		futex_get_mm(key); /* implies MB (B) */
>  		break;
>  	}
>  }
> @@ -322,7 +383,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
>  	if (!fshared) {
>  		key->private.mm = mm;
>  		key->private.address = address;
> -		get_futex_key_refs(key);
> +		get_futex_key_refs(key);  /* implies MB (B) */
>  		return 0;
>  	}
> 
> @@ -1052,6 +1113,11 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
>  		goto out;
> 
>  	hb = hash_futex(&key);
> +
> +	/* Make sure we really have tasks to wakeup */
> +	if (!hb_waiters_pending(hb))
> +		goto out_put_key;

Nice optimization, especially in the (hopefully) common case of low
contention!

> +
>  	spin_lock(&hb->lock);
> 
>  	plist_for_each_entry_safe(this, next, &hb->chain, list) {
> @@ -1072,6 +1138,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
>  	}
> 
>  	spin_unlock(&hb->lock);
> +out_put_key:
>  	put_futex_key(&key);
>  out:
>  	return ret;
> @@ -1535,7 +1602,7 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
>  	hb = hash_futex(&q->key);
>  	q->lock_ptr = &hb->lock;
> 
> -	spin_lock(&hb->lock);
> +	spin_lock(&hb->lock); /* implies MB (A) */

You need smp_mb__before_spinlock() before the spin_lock() to get a
full memory barrier.

>  	return hb;
>  }
> 
> -- 
> 1.8.1.4
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2014-01-11  9:49   ` Paul E. McKenney
@ 2014-01-11  9:52     ` Paul E. McKenney
  2014-01-11 18:21       ` Davidlohr Bueso
  0 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2014-01-11  9:52 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, dvhart, peterz, tglx, efault, jeffm,
	torvalds, jason.low2, Waiman.Long, tom.vaden, scott.norton,
	aswin

On Sat, Jan 11, 2014 at 01:49:12AM -0800, Paul E. McKenney wrote:
> On Thu, Jan 02, 2014 at 07:05:20AM -0800, Davidlohr Bueso wrote:
> > From: Davidlohr Bueso <davidlohr@hp.com>
> > 
> > In futex_wake() there is clearly no point in taking the hb->lock if we know
> > beforehand that there are no tasks to be woken. While the hash bucket's plist
> > head is a cheap way of knowing this, we cannot rely 100% on it as there is a
> > racy window between the futex_wait call and when the task is actually added to
> > the plist. To this end, we couple it with the spinlock check as tasks trying to
> > enter the critical region are most likely potential waiters that will be added
> > to the plist, thus preventing tasks sleeping forever if wakers don't acknowledge
> > all possible waiters.
> > 
> > Furthermore, the futex ordering guarantees are preserved, ensuring that waiters
> > either observe the changed user space value before blocking or is woken by a
> > concurrent waker. For wakers, this is done by relying on the barriers in
> > get_futex_key_refs() -- for archs that do have implicit mb in atomic_inc() we
> > explicitly add them through a new futex_get_mm function. For waiters we rely
> > on the fact that spin_lock calls already update the head counter, so spinners
> > are visible even if the lock hasn't been acquired yet.
> > 
> > For more details please refer to the updated comments in the code and related
> > discussion: https://lkml.org/lkml/2013/11/26/556
> > 
> > Special thanks to tglx for careful review and feedback.
> > 
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Darren Hart <dvhart@linux.intel.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Cc: Mike Galbraith <efault@gmx.de>
> > Cc: Jeff Mahoney <jeffm@suse.com>
> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Scott Norton <scott.norton@hp.com>
> > Cc: Tom Vaden <tom.vaden@hp.com>
> > Cc: Aswin Chandramouleeswaran <aswin@hp.com>
> > Cc: Waiman Long <Waiman.Long@hp.com>
> > Cc: Jason Low <jason.low2@hp.com>
> > Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> 
> A couple of comments below.
> 
> 							Thanx, Paul
> 
> > ---
> >  kernel/futex.c | 113 +++++++++++++++++++++++++++++++++++++++++++++------------
> >  1 file changed, 90 insertions(+), 23 deletions(-)
> > 
> > diff --git a/kernel/futex.c b/kernel/futex.c
> > index fcc6850..5b4d09e 100644
> > --- a/kernel/futex.c
> > +++ b/kernel/futex.c
> > @@ -75,17 +75,20 @@
> >   * The waiter reads the futex value in user space and calls
> >   * futex_wait(). This function computes the hash bucket and acquires
> >   * the hash bucket lock. After that it reads the futex user space value
> > - * again and verifies that the data has not changed. If it has not
> > - * changed it enqueues itself into the hash bucket, releases the hash
> > - * bucket lock and schedules.
> > + * again and verifies that the data has not changed. If it has not changed
> > + * it enqueues itself into the hash bucket, releases the hash bucket lock
> > + * and schedules.
> >   *
> >   * The waker side modifies the user space value of the futex and calls
> > - * futex_wake(). This functions computes the hash bucket and acquires
> > - * the hash bucket lock. Then it looks for waiters on that futex in the
> > - * hash bucket and wakes them.
> > + * futex_wake(). This function computes the hash bucket and acquires the
> > + * hash bucket lock. Then it looks for waiters on that futex in the hash
> > + * bucket and wakes them.
> >   *
> > - * Note that the spin_lock serializes waiters and wakers, so that the
> > - * following scenario is avoided:
> > + * In scenarios where wakeups are called and no tasks are blocked on a futex,
> > + * taking the hb spinlock can be avoided and simply return. In order for this
> > + * optimization to work, ordering guarantees must exist so that the waiter
> > + * being added to the list is acknowledged when the list is concurrently being
> > + * checked by the waker, avoiding scenarios like the following:
> >   *
> >   * CPU 0                               CPU 1
> >   * val = *futex;
> > @@ -106,24 +109,50 @@
> >   * This would cause the waiter on CPU 0 to wait forever because it
> >   * missed the transition of the user space value from val to newval
> >   * and the waker did not find the waiter in the hash bucket queue.
> > - * The spinlock serializes that:
> > + *
> > + * The correct serialization ensures that a waiter either observes
> > + * the changed user space value before blocking or is woken by a
> > + * concurrent waker:
> >   *
> >   * CPU 0                               CPU 1
> >   * val = *futex;
> >   * sys_futex(WAIT, futex, val);
> >   *   futex_wait(futex, val);
> > - *   lock(hash_bucket(futex));
> > - *   uval = *futex;
> > - *                                     *futex = newval;
> > - *                                     sys_futex(WAKE, futex);
> > - *                                       futex_wake(futex);
> > - *                                       lock(hash_bucket(futex));
> > + *
> > + *   waiters++;
> > + *   mb(); (A) <-- paired with -.
> > + *                              |
> > + *   lock(hash_bucket(futex));  |
> > + *                              |
> > + *   uval = *futex;             |
> > + *                              |        *futex = newval;
> > + *                              |        sys_futex(WAKE, futex);
> > + *                              |          futex_wake(futex);
> > + *                              |
> > + *                              `------->   mb(); (B)
> >   *   if (uval == val)
> > - *      queue();
> > + *     queue();
> >   *     unlock(hash_bucket(futex));
> > - *     schedule();                       if (!queue_empty())
> > - *                                         wake_waiters(futex);
> > - *                                       unlock(hash_bucket(futex));
> > + *     schedule();                         if (waiters)
> > + *                                           lock(hash_bucket(futex));
> > + *                                           wake_waiters(futex);
> > + *                                           unlock(hash_bucket(futex));
> > + *
> > + * Where (A) orders the waiters increment and the futex value read -- this
> > + * is guaranteed by the head counter in the hb spinlock; and where (B)
> > + * orders the write to futex and the waiters read.
> > + *
> > + * This yields the following case (where X:=waiters, Y:=futex):
> > + *
> > + *	X = Y = 0
> > + *
> > + *	w[X]=1		w[Y]=1
> > + *	MB		MB
> > + *	r[Y]=y		r[X]=x
> > + *
> > + * Which guarantees that x==0 && y==0 is impossible; which translates back into
> > + * the guarantee that we cannot both miss the futex variable change and the
> > + * enqueue.
> >   */
> > 
> >  int __read_mostly futex_cmpxchg_enabled;
> > @@ -211,6 +240,38 @@ static unsigned long __read_mostly futex_hashsize;
> > 
> >  static struct futex_hash_bucket *futex_queues;
> > 
> > +static inline void futex_get_mm(union futex_key *key)
> > +{
> > +	atomic_inc(&key->private.mm->mm_count);
> > +#ifdef CONFIG_SMP
> 
> You don't need the #ifdef -- smp_mb__after_atomic_inc() does not emit
> code if !SMP.
> 
> > +	/*
> > +	 * Ensure futex_get_mm() implies a full barrier such that
> > +	 * get_futex_key() implies a full barrier. This is relied upon
> > +	 * as full barrier (B), see the ordering comment above.
> > +	 */
> > +	smp_mb__after_atomic_inc();
> > +#endif
> > +}
> > +
> > +static inline bool hb_waiters_pending(struct futex_hash_bucket *hb)
> > +{
> > +#ifdef CONFIG_SMP
> > +	/*
> > +	 * Tasks trying to enter the critical region are most likely
> > +	 * potential waiters that will be added to the plist. Ensure
> > +	 * that wakers won't miss to-be-slept tasks in the window between
> > +	 * the wait call and the actual plist_add.
> > +	 */
> > +	if (spin_is_locked(&hb->lock))
> > +		return true;
> > +	smp_rmb(); /* Make sure we check the lock state first */
> > +
> > +	return !plist_head_empty(&hb->chain);
> > +#else
> > +	return true;
> > +#endif
> > +}
> > +
> >  /*
> >   * We hash on the keys returned from get_futex_key (see below).
> >   */
> > @@ -245,10 +306,10 @@ static void get_futex_key_refs(union futex_key *key)
> > 
> >  	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
> >  	case FUT_OFF_INODE:
> > -		ihold(key->shared.inode);
> > +		ihold(key->shared.inode); /* implies MB (B) */
> >  		break;
> >  	case FUT_OFF_MMSHARED:
> > -		atomic_inc(&key->private.mm->mm_count);
> > +		futex_get_mm(key); /* implies MB (B) */
> >  		break;
> >  	}
> >  }
> > @@ -322,7 +383,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
> >  	if (!fshared) {
> >  		key->private.mm = mm;
> >  		key->private.address = address;
> > -		get_futex_key_refs(key);
> > +		get_futex_key_refs(key);  /* implies MB (B) */
> >  		return 0;
> >  	}
> > 
> > @@ -1052,6 +1113,11 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
> >  		goto out;
> > 
> >  	hb = hash_futex(&key);
> > +
> > +	/* Make sure we really have tasks to wakeup */
> > +	if (!hb_waiters_pending(hb))
> > +		goto out_put_key;
> 
> Nice optimization, especially in the (hopefully) common case of low
> contention!
> 
> > +
> >  	spin_lock(&hb->lock);
> > 
> >  	plist_for_each_entry_safe(this, next, &hb->chain, list) {
> > @@ -1072,6 +1138,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
> >  	}
> > 
> >  	spin_unlock(&hb->lock);
> > +out_put_key:
> >  	put_futex_key(&key);
> >  out:
> >  	return ret;
> > @@ -1535,7 +1602,7 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
> >  	hb = hash_futex(&q->key);
> >  	q->lock_ptr = &hb->lock;
> > 
> > -	spin_lock(&hb->lock);
> > +	spin_lock(&hb->lock); /* implies MB (A) */
> 
> You need smp_mb__before_spinlock() before the spin_lock() to get a
> full memory barrier.

Actually, even that only gets you smp_mb().

Unless you are ordering a prior write against a later write here, you
will need an smp_mb().

							Thanx, Paul

> >  	return hb;
> >  }
> > 
> > -- 
> > 1.8.1.4
> > 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2014-01-11  9:52     ` Paul E. McKenney
@ 2014-01-11 18:21       ` Davidlohr Bueso
  0 siblings, 0 replies; 23+ messages in thread
From: Davidlohr Bueso @ 2014-01-11 18:21 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, mingo, dvhart, peterz, tglx, efault, jeffm,
	torvalds, jason.low2, Waiman.Long, tom.vaden, scott.norton,
	aswin

On Sat, 2014-01-11 at 01:52 -0800, Paul E. McKenney wrote:
[...]
> On Sat, Jan 11, 2014 at 01:49:12AM -0800, Paul E. McKenney wrote:
> > On Thu, Jan 02, 2014 at 07:05:20AM -0800, Davidlohr Bueso wrote:
> > > -	spin_lock(&hb->lock);
> > > +	spin_lock(&hb->lock); /* implies MB (A) */
> > 
> > You need smp_mb__before_spinlock() before the spin_lock() to get a
> > full memory barrier.

Hmmm, the thing we need to guarantee here is that the ticket increment
is visible (which is the same as the smp_mb__after_atomic_inc we used to
have in the original atomic counter approach), so adding a barrier
before the spin_lock call wouldn't serve that. I previously consulted
this with Linus and we can rely on the fact that spin_lock calls already
update the head counter, so spinners are visible even if the lock hasn't
been acquired yet.

> Actually, even that only gets you smp_mb().

I guess you mean smp_wmb() here.

> Unless you are ordering a prior write against a later write here, you
> will need an smp_mb().

Yep.

Thanks for looking into this,
Davidlohr


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2014-01-11 18:21 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-02 15:05 [PATCH v5 0/4] futex: Wakeup optimizations Davidlohr Bueso
2014-01-02 15:05 ` [PATCH v5 1/4] futex: Misc cleanups Davidlohr Bueso
2014-01-11  6:43   ` Paul E. McKenney
2014-01-02 15:05 ` [PATCH v5 2/4] futex: Larger hash table Davidlohr Bueso
2014-01-11  7:37   ` Paul E. McKenney
2014-01-02 15:05 ` [PATCH v5 3/4] futex: Document ordering guarantees Davidlohr Bueso
2014-01-06 18:58   ` Darren Hart
2014-01-11  7:40   ` Paul E. McKenney
2014-01-02 15:05 ` [PATCH v5 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
2014-01-02 19:23   ` Linus Torvalds
2014-01-02 20:59     ` Davidlohr Bueso
2014-01-06 20:56       ` Darren Hart
2014-01-06 20:52   ` Darren Hart
2014-01-07  3:29     ` Davidlohr Bueso
2014-01-07 17:40       ` Darren Hart
2014-01-11  9:49   ` Paul E. McKenney
2014-01-11  9:52     ` Paul E. McKenney
2014-01-11 18:21       ` Davidlohr Bueso
2014-01-06  0:59 ` [PATCH v5 0/4] futex: Wakeup optimizations Davidlohr Bueso
2014-01-06  1:38 ` [PATCH 5/4] futex: silence uninitialized warnings Davidlohr Bueso
2014-01-06 18:48   ` Darren Hart
2014-01-07  2:55   ` Linus Torvalds
2014-01-07  3:02     ` Davidlohr Bueso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).