linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] futex: Wakeup optimizations
@ 2013-11-23  0:56 Davidlohr Bueso
  2013-11-23  0:56 ` [PATCH 1/5] futex: Misc cleanups Davidlohr Bueso
                   ` (5 more replies)
  0 siblings, 6 replies; 39+ messages in thread
From: Davidlohr Bueso @ 2013-11-23  0:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2,
	davidlohr

We have been dealing with a customer database workload on large
12Tb, 240 core 16 socket NUMA system that exhibits high amounts 
of contention on some of the locks that serialize internal futex 
data structures. This workload specially suffers in the wakeup 
paths, where waiting on the corresponding hb->lock can account for 
up to ~60% of the time. The result of such calls can mostly be 
classified as (i) nothing to wake up and (ii) wakeup large amount 
of tasks.

Before these patches are applied, we can see this pathological behavior:

 37.12%  826174  xxx  [kernel.kallsyms] [k] _raw_spin_lock
            --- _raw_spin_lock
             |
             |--97.14%-- futex_wake
             |          do_futex
             |          sys_futex
             |          system_call_fastpath
             |          |
             |          |--99.70%-- 0x7f383fbdea1f
             |          |           yyy

 43.71%  762296  xxx  [kernel.kallsyms] [k] _raw_spin_lock
            --- _raw_spin_lock
             |
             |--53.74%-- futex_wake
             |          do_futex
             |          sys_futex
             |          system_call_fastpath
             |          |
             |          |--99.40%-- 0x7fe7d44a4c05
             |          |           zzz
             |--45.90%-- futex_wait_setup
             |          futex_wait
             |          do_futex
             |          sys_futex
             |          system_call_fastpath
             |          0x7fe7ba315789
             |          syscall


With these patches, contention is practically non existent:

 0.10%     49   xxx  [kernel.kallsyms]   [k] _raw_spin_lock
               --- _raw_spin_lock
                |
                |--76.06%-- futex_wait_setup
                |          futex_wait
                |          do_futex
                |          sys_futex
                |          system_call_fastpath
                |          |
                |          |--99.90%-- 0x7f3165e63789
                |          |          syscall|
                           ...
                |--6.27%-- futex_wake
                |          do_futex
                |          sys_futex
                |          system_call_fastpath
                |          |
                |          |--54.56%-- 0x7f317fff2c05
                ...

Patches 1 & 2 are cleanups and micro optimizations.

Patch 3 addresses the well known issue of the global hash table.
By creating a larger and NUMA aware table, we can reduce the false
sharing and collisions, thus reducing the chance of different futexes 
using hb->lock.

Patch 4 reduces contention on the corresponding hb->lock by not trying to
acquire it if there are no blocked tasks in the waitqueue.
This particularly deals with point (i) above, where we see that it is not
uncommon for up to 90% of wakeup calls end up returning 0, indicating that no
tasks were woken.

Patch 5 resurrects a two year old idea from Peter Zijlstra to delay
the waking of the blocked tasks to be done without holding the hb->lock:
https://lkml.org/lkml/2011/9/14/118

This is useful for locking primitives that can effect multiple wakeups
per operation and want to avoid the futex's internal spinlock contention by
delaying the wakeups until we've released the hb->lock.
This particularly deals with point (ii) above, where we can observe that
in occasions the wake calls end up waking 125 to 200 waiters in what we believe 
are RW locks in the application.

This patchset has also been tested on smaller systems for a variety of
benchmarks, including java workloads, kernel builds and custom bang-the-hell-out-of
hb locks programs. So far, no functional or performance regressions have been seen.
Furthermore, no issues were found when running the different tests in the futextest 
suite: http://git.kernel.org/cgit/linux/kernel/git/dvhart/futextest.git/

This patchset applies on top of Linus' tree as of v3.13-rc1.

Special thanks to Scott Norton, Tom Vanden and Mark Ray for help presenting, 
debugging and analyzing the data.

  futex: Misc cleanups
  futex: Check for pi futex_q only once
  futex: Larger hash table
  futex: Avoid taking hb lock if nothing to wakeup
  sched,futex: Provide delayed wakeup list

 include/linux/sched.h |  41 ++++++++++++++++++
 kernel/futex.c        | 113 +++++++++++++++++++++++++++-----------------------
 kernel/sched/core.c   |  19 +++++++++
 3 files changed, 122 insertions(+), 51 deletions(-)

-- 
1.8.1.4


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/5] futex: Misc cleanups
  2013-11-23  0:56 [PATCH 0/5] futex: Wakeup optimizations Davidlohr Bueso
@ 2013-11-23  0:56 ` Davidlohr Bueso
  2013-11-23  6:52   ` Darren Hart
  2013-11-23  0:56 ` [PATCH 2/5] futex: Check for pi futex_q only once Davidlohr Bueso
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 39+ messages in thread
From: Davidlohr Bueso @ 2013-11-23  0:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2,
	davidlohr

From: Jason Low <jason.low2@hp.com>

- Remove unnecessary head variables.
- Delete unused parameter in queue_unlock().

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: Tom Vaden <tom.vaden@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 kernel/futex.c | 39 ++++++++++++---------------------------
 1 file changed, 12 insertions(+), 27 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 80ba086..e6ffe73 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -597,13 +597,10 @@ lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
 {
 	struct futex_pi_state *pi_state = NULL;
 	struct futex_q *this, *next;
-	struct plist_head *head;
 	struct task_struct *p;
 	pid_t pid = uval & FUTEX_TID_MASK;
 
-	head = &hb->chain;
-
-	plist_for_each_entry_safe(this, next, head, list) {
+	plist_for_each_entry_safe(this, next, &hb->chain, list) {
 		if (match_futex(&this->key, key)) {
 			/*
 			 * Another waiter already exists - bump up
@@ -985,7 +982,6 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
-	struct plist_head *head;
 	union futex_key key = FUTEX_KEY_INIT;
 	int ret;
 
@@ -998,9 +994,8 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 
 	hb = hash_futex(&key);
 	spin_lock(&hb->lock);
-	head = &hb->chain;
 
-	plist_for_each_entry_safe(this, next, head, list) {
+	plist_for_each_entry_safe(this, next, &hb->chain, list) {
 		if (match_futex (&this->key, &key)) {
 			if (this->pi_state || this->rt_waiter) {
 				ret = -EINVAL;
@@ -1033,7 +1028,6 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
 {
 	union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
 	struct futex_hash_bucket *hb1, *hb2;
-	struct plist_head *head;
 	struct futex_q *this, *next;
 	int ret, op_ret;
 
@@ -1081,9 +1075,7 @@ retry_private:
 		goto retry;
 	}
 
-	head = &hb1->chain;
-
-	plist_for_each_entry_safe(this, next, head, list) {
+	plist_for_each_entry_safe(this, next, &hb1->chain, list) {
 		if (match_futex (&this->key, &key1)) {
 			if (this->pi_state || this->rt_waiter) {
 				ret = -EINVAL;
@@ -1096,10 +1088,8 @@ retry_private:
 	}
 
 	if (op_ret > 0) {
-		head = &hb2->chain;
-
 		op_ret = 0;
-		plist_for_each_entry_safe(this, next, head, list) {
+		plist_for_each_entry_safe(this, next, &hb2->chain, list) {
 			if (match_futex (&this->key, &key2)) {
 				if (this->pi_state || this->rt_waiter) {
 					ret = -EINVAL;
@@ -1269,7 +1259,6 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
 	int drop_count = 0, task_count = 0, ret;
 	struct futex_pi_state *pi_state = NULL;
 	struct futex_hash_bucket *hb1, *hb2;
-	struct plist_head *head1;
 	struct futex_q *this, *next;
 	u32 curval2;
 
@@ -1392,8 +1381,7 @@ retry_private:
 		}
 	}
 
-	head1 = &hb1->chain;
-	plist_for_each_entry_safe(this, next, head1, list) {
+	plist_for_each_entry_safe(this, next, &hb1->chain, list) {
 		if (task_count - nr_wake >= nr_requeue)
 			break;
 
@@ -1493,7 +1481,7 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
 }
 
 static inline void
-queue_unlock(struct futex_q *q, struct futex_hash_bucket *hb)
+queue_unlock(struct futex_hash_bucket *hb)
 	__releases(&hb->lock)
 {
 	spin_unlock(&hb->lock);
@@ -1866,7 +1854,7 @@ retry_private:
 	ret = get_futex_value_locked(&uval, uaddr);
 
 	if (ret) {
-		queue_unlock(q, *hb);
+		queue_unlock(*hb);
 
 		ret = get_user(uval, uaddr);
 		if (ret)
@@ -1880,7 +1868,7 @@ retry_private:
 	}
 
 	if (uval != val) {
-		queue_unlock(q, *hb);
+		queue_unlock(*hb);
 		ret = -EWOULDBLOCK;
 	}
 
@@ -2028,7 +2016,7 @@ retry_private:
 			 * Task is exiting and we just wait for the
 			 * exit to complete.
 			 */
-			queue_unlock(&q, hb);
+			queue_unlock(hb);
 			put_futex_key(&q.key);
 			cond_resched();
 			goto retry;
@@ -2080,7 +2068,7 @@ retry_private:
 	goto out_put_key;
 
 out_unlock_put_key:
-	queue_unlock(&q, hb);
+	queue_unlock(hb);
 
 out_put_key:
 	put_futex_key(&q.key);
@@ -2090,7 +2078,7 @@ out:
 	return ret != -EINTR ? ret : -ERESTARTNOINTR;
 
 uaddr_faulted:
-	queue_unlock(&q, hb);
+	queue_unlock(hb);
 
 	ret = fault_in_user_writeable(uaddr);
 	if (ret)
@@ -2112,7 +2100,6 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
-	struct plist_head *head;
 	union futex_key key = FUTEX_KEY_INIT;
 	u32 uval, vpid = task_pid_vnr(current);
 	int ret;
@@ -2152,9 +2139,7 @@ retry:
 	 * Ok, other tasks may need to be woken up - check waiters
 	 * and do the wakeup if necessary:
 	 */
-	head = &hb->chain;
-
-	plist_for_each_entry_safe(this, next, head, list) {
+	plist_for_each_entry_safe(this, next, &hb->chain, list) {
 		if (!match_futex (&this->key, &key))
 			continue;
 		ret = wake_futex_pi(uaddr, uval, this);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 2/5] futex: Check for pi futex_q only once
  2013-11-23  0:56 [PATCH 0/5] futex: Wakeup optimizations Davidlohr Bueso
  2013-11-23  0:56 ` [PATCH 1/5] futex: Misc cleanups Davidlohr Bueso
@ 2013-11-23  0:56 ` Davidlohr Bueso
  2013-11-23  6:33   ` Darren Hart
  2013-11-23  0:56 ` [PATCH 3/5] futex: Larger hash table Davidlohr Bueso
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 39+ messages in thread
From: Davidlohr Bueso @ 2013-11-23  0:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2,
	davidlohr

All wake_futex() callers already verify that the we are not dealing with
a pi futex_q, so we can remove the redundant WARN() check, as this is never
triggered anyway.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: Tom Vaden <tom.vaden@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: Jason Low <jason.low2@hp.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 kernel/futex.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index e6ffe73..0768c68 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -844,9 +844,6 @@ static void wake_futex(struct futex_q *q)
 {
 	struct task_struct *p = q->task;
 
-	if (WARN(q->pi_state || q->rt_waiter, "refusing to wake PI futex\n"))
-		return;
-
 	/*
 	 * We set q->lock_ptr = NULL _before_ we wake up the task. If
 	 * a non-futex wake up happens on another CPU then the task
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 3/5] futex: Larger hash table
  2013-11-23  0:56 [PATCH 0/5] futex: Wakeup optimizations Davidlohr Bueso
  2013-11-23  0:56 ` [PATCH 1/5] futex: Misc cleanups Davidlohr Bueso
  2013-11-23  0:56 ` [PATCH 2/5] futex: Check for pi futex_q only once Davidlohr Bueso
@ 2013-11-23  0:56 ` Davidlohr Bueso
  2013-11-23  6:52   ` Darren Hart
  2013-11-23  0:56 ` [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 39+ messages in thread
From: Davidlohr Bueso @ 2013-11-23  0:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2,
	davidlohr

Currently, the futex global hash table suffers from it's fixed, smallish
(for today's standards) size of 256 entries, as well as its lack of NUMA
awareness. Large systems, using many futexes, can be prone to high amounts
of collisions; where these futexes hash to the same bucket and lead to
extra contention on the same hb->lock. Furthermore, cacheline bouncing is a
reality when we have multiple hb->locks residing on the same cacheline and
different futexes hash to adjacent buckets.

This patch keeps the current static size of 16 entries for small systems,
or otherwise, 256 * ncpus (or larger as we need to round the number to a
power of 2). Note that this number of CPUs accounts for all CPUs that can
ever be available in the system, taking into consideration things like
hotpluging. While we do impose extra overhead at bootup by making the hash
table larger, this is a one time thing, and does not shadow the benefits
of this patch.

Also, similar to other core kernel components (pid, dcache, tcp), by using
alloc_large_system_hash() we benefit from its NUMA awareness and thus the
table is distributed among the nodes instead of in a single one. We impose
this function's minimum limit of 256 entries, so that in worst case scenarios
or issues, we still end up using the current amount anyways.

For a custom microbenchmark that pounds on the uaddr hashing -- making the wait
path fail at futex_wait_setup() returning -EWOULDBLOCK for large amounts of futexes,
we can see the following benefits on a 80-core, 8-socket 1Tb server:

+---------+----------------------------------+------------------------------------------+----------+
| threads | baseline (ops/sec) [insns/cycle] | large hash table (ops/sec) [insns/cycle] | increase |
+---------+----------------------------------+------------------------------------------+----------+
|     512 | 34429    [0.07]                  | 255274    [0.48]                         | +641.45% |
|     256 | 65452    [0.07]                  | 443563    [0.41]                         | +577.69% |
|     128 | 125111   [0.07]                  | 742613    [0.33]                         | +493.56% |
|      80 | 203642   [0.09]                  | 1028147   [0.29]                         | +404.87% |
|      64 | 262944   [0.09]                  | 997300    [0.28]                         | +279.28% |
|      32 | 642390   [0.24]                  | 965996    [0.27]                         | +50.37   |
+---------+----------------------------------+------------------------------------------+----------+

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: Tom Vaden <tom.vaden@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 kernel/futex.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 0768c68..5fa9eb0 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -63,6 +63,7 @@
 #include <linux/sched/rt.h>
 #include <linux/hugetlb.h>
 #include <linux/freezer.h>
+#include <linux/bootmem.h>
 
 #include <asm/futex.h>
 
@@ -70,7 +71,11 @@
 
 int __read_mostly futex_cmpxchg_enabled;
 
-#define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)
+#if CONFIG_BASE_SMALL
+static unsigned long futex_hashsize = 16;
+#else
+static unsigned long futex_hashsize;
+#endif
 
 /*
  * Futex flags used to encode options to functions and preserve them across
@@ -151,7 +156,11 @@ struct futex_hash_bucket {
 	struct plist_head chain;
 };
 
-static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
+#if CONFIG_BASE_SMALL
+static struct futex_hash_bucket futex_queues[futex_hashsize];
+#else
+static struct futex_hash_bucket *futex_queues;
+#endif
 
 /*
  * We hash on the keys returned from get_futex_key (see below).
@@ -161,7 +170,7 @@ static struct futex_hash_bucket *hash_futex(union futex_key *key)
 	u32 hash = jhash2((u32*)&key->both.word,
 			  (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
 			  key->both.offset);
-	return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
+	return &futex_queues[hash & (futex_hashsize - 1)];
 }
 
 /*
@@ -2715,7 +2724,14 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
 static int __init futex_init(void)
 {
 	u32 curval;
-	int i;
+	unsigned long i;
+
+#if !CONFIG_BASE_SMALL
+	futex_hashsize = roundup_pow_of_two((256 * num_possible_cpus()));
+	futex_queues = alloc_large_system_hash("futex", sizeof(*futex_queues),
+					       futex_hashsize, 0, 0, NULL, NULL,
+					       256, futex_hashsize);
+#endif
 
 	/*
 	 * This will fail and we want it. Some arch implementations do
@@ -2730,7 +2746,7 @@ static int __init futex_init(void)
 	if (cmpxchg_futex_value_locked(&curval, NULL, 0, 0) == -EFAULT)
 		futex_cmpxchg_enabled = 1;
 
-	for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
+	for (i = 0; i < futex_hashsize; i++) {
 		plist_head_init(&futex_queues[i].chain);
 		spin_lock_init(&futex_queues[i].lock);
 	}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup
  2013-11-23  0:56 [PATCH 0/5] futex: Wakeup optimizations Davidlohr Bueso
                   ` (2 preceding siblings ...)
  2013-11-23  0:56 ` [PATCH 3/5] futex: Larger hash table Davidlohr Bueso
@ 2013-11-23  0:56 ` Davidlohr Bueso
  2013-11-23  1:25   ` Linus Torvalds
                     ` (2 more replies)
  2013-11-23  0:56 ` [PATCH 5/5] sched,futex: Provide delayed wakeup list Davidlohr Bueso
  2013-11-23  5:55 ` [PATCH 0/5] futex: Wakeup optimizations Darren Hart
  5 siblings, 3 replies; 39+ messages in thread
From: Davidlohr Bueso @ 2013-11-23  0:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2,
	davidlohr

In futex_wake() there is clearly no point in taking the hb->lock if
we know beforehand that there are no tasks to be woken. This comes
at the smaller cost of doing some atomic operations to keep track of
the list's size. Specifically, increment the counter when an element is
added to the list, and decrement when it is removed. Of course, if the
counter is 0, then there are no tasks blocked on a futex. Some special
considerations:

- increment the counter at queue_lock() as we always end up calling
  queue_me() which adds the element to the list. Upon any error,
  queue_unlock() is called for housekeeping, for which we decrement
  to mach the increment done in queue_lock().

- decrement the counter at __unqueue_me() to reflect when an element is
  removed from the queue for wakeup related purposes.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: Tom Vaden <tom.vaden@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 kernel/futex.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/kernel/futex.c b/kernel/futex.c
index 5fa9eb0..2f9dd5d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -152,6 +152,7 @@ static const struct futex_q futex_q_init = {
  * waiting on a futex.
  */
 struct futex_hash_bucket {
+	atomic_t len;
 	spinlock_t lock;
 	struct plist_head chain;
 };
@@ -843,6 +844,7 @@ static void __unqueue_futex(struct futex_q *q)
 
 	hb = container_of(q->lock_ptr, struct futex_hash_bucket, lock);
 	plist_del(&q->list, &hb->chain);
+	atomic_dec(&hb->len);
 }
 
 /*
@@ -999,6 +1001,10 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 		goto out;
 
 	hb = hash_futex(&key);
+	/* make sure we really have tasks to wakeup */
+	if (atomic_read(&hb->len) == 0)
+		goto out_put_key;
+
 	spin_lock(&hb->lock);
 
 	plist_for_each_entry_safe(this, next, &hb->chain, list) {
@@ -1019,6 +1025,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 	}
 
 	spin_unlock(&hb->lock);
+out_put_key:
 	put_futex_key(&key);
 out:
 	return ret;
@@ -1137,7 +1144,9 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
 	 */
 	if (likely(&hb1->chain != &hb2->chain)) {
 		plist_del(&q->list, &hb1->chain);
+		atomic_dec(&hb1->len);
 		plist_add(&q->list, &hb2->chain);
+		atomic_inc(&hb2->len);
 		q->lock_ptr = &hb2->lock;
 	}
 	get_futex_key_refs(key2);
@@ -1480,6 +1489,8 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
 	struct futex_hash_bucket *hb;
 
 	hb = hash_futex(&q->key);
+	atomic_inc(&hb->len);
+
 	q->lock_ptr = &hb->lock;
 
 	spin_lock(&hb->lock);
@@ -1491,6 +1502,7 @@ queue_unlock(struct futex_hash_bucket *hb)
 	__releases(&hb->lock)
 {
 	spin_unlock(&hb->lock);
+	atomic_dec(&hb->len);
 }
 
 /**
@@ -2222,6 +2234,7 @@ int handle_early_requeue_pi_wakeup(struct futex_hash_bucket *hb,
 		 * Unqueue the futex_q and determine which it was.
 		 */
 		plist_del(&q->list, &hb->chain);
+		atomic_dec(&hb->len);
 
 		/* Handle spurious wakeups gracefully */
 		ret = -EWOULDBLOCK;
@@ -2749,6 +2762,7 @@ static int __init futex_init(void)
 	for (i = 0; i < futex_hashsize; i++) {
 		plist_head_init(&futex_queues[i].chain);
 		spin_lock_init(&futex_queues[i].lock);
+		atomic_set(&futex_queues[i].len, 0);
 	}
 
 	return 0;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 5/5] sched,futex: Provide delayed wakeup list
  2013-11-23  0:56 [PATCH 0/5] futex: Wakeup optimizations Davidlohr Bueso
                   ` (3 preceding siblings ...)
  2013-11-23  0:56 ` [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
@ 2013-11-23  0:56 ` Davidlohr Bueso
  2013-11-23 11:48   ` Peter Zijlstra
  2013-11-23  5:55 ` [PATCH 0/5] futex: Wakeup optimizations Darren Hart
  5 siblings, 1 reply; 39+ messages in thread
From: Davidlohr Bueso @ 2013-11-23  0:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2,
	davidlohr

From: Peter Zijlstra <peterz@infradead.org>

Original patchset: https://lkml.org/lkml/2011/9/14/118

This is useful for locking primitives that can effect multiple
wakeups per operation and want to avoid lock internal lock contention
by delaying the wakeups until we've released the lock internal locks.

Alternatively it can be used to avoid issuing multiple wakeups, and
thus save a few cycles, in packet processing. Queue all target tasks
and wakeup once you've processed all packets. That way you avoid
waking the target task multiple times if there were multiple packets
for the same task.

This patch adds the needed infrastructure into the scheduler code
and uses the new wake_list to delay the futex wakeups until
after we've released the hash bucket locks. This avoids the newly
woken tasks from immediately getting stuck on the hb->lock.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: Tom Vaden <tom.vaden@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Waiman Long <Waiman.Long@hp.com>
Tested-by: Jason Low <jason.low2@hp.com>
[forward ported]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
Please note that in the original thread there was some debate
about spurious wakeups (https://lkml.org/lkml/2011/9/17/31), so
you can consider this more of an RFC patch if folks believe that
this functionality is incomplete/buggy.

 include/linux/sched.h | 41 +++++++++++++++++++++++++++++++++++++++++
 kernel/futex.c        | 31 +++++++++++++++----------------
 kernel/sched/core.c   | 19 +++++++++++++++++++
 3 files changed, 75 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7e35d4b..679aabb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -793,6 +793,20 @@ enum cpu_idle_type {
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
+struct wake_list_head {
+	struct wake_list_node *first;
+};
+
+struct wake_list_node {
+	struct wake_list_node *next;
+};
+
+#define WAKE_LIST_TAIL ((struct wake_list_node *) 0x01)
+
+#define WAKE_LIST(name)					\
+	struct wake_list_head name = { WAKE_LIST_TAIL }
+
+
 struct sched_domain_attr {
 	int relax_domain_level;
 };
@@ -1078,6 +1092,8 @@ struct task_struct {
 	unsigned int btrace_seq;
 #endif
 
+	struct wake_list_node wake_list;
+
 	unsigned int policy;
 	int nr_cpus_allowed;
 	cpumask_t cpus_allowed;
@@ -2044,6 +2060,31 @@ extern void wake_up_new_task(struct task_struct *tsk);
 extern void sched_fork(unsigned long clone_flags, struct task_struct *p);
 extern void sched_dead(struct task_struct *p);
 
+static inline void
+wake_list_add(struct wake_list_head *head, struct task_struct *p)
+{
+	struct wake_list_node *n = &p->wake_list;
+
+	/*
+	 * Atomically grab the task, if ->wake_list is !0 already it means
+	 * its already queued (either by us or someone else) and will get the
+	 * wakeup due to that.
+	 *
+	 * This cmpxchg() implies a full barrier, which pairs with the write
+	 * barrier implied by the wakeup in wake_up_list().
+	 */
+	if (cmpxchg(&n->next, NULL, head->first))
+		return;
+
+	/*
+	 * The head is context local, there can be no concurrency.
+	 */
+	get_task_struct(p);
+	head->first = n;
+}
+
+extern void wake_up_list(struct wake_list_head *head, unsigned int state);
+
 extern void proc_caches_init(void);
 extern void flush_signals(struct task_struct *);
 extern void __flush_signals(struct task_struct *);
diff --git a/kernel/futex.c b/kernel/futex.c
index 2f9dd5d..3d60a3d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -851,20 +851,17 @@ static void __unqueue_futex(struct futex_q *q)
  * The hash bucket lock must be held when this is called.
  * Afterwards, the futex_q must not be accessed.
  */
-static void wake_futex(struct futex_q *q)
+static void wake_futex(struct wake_list_head *wake_list, struct futex_q *q)
 {
 	struct task_struct *p = q->task;
 
 	/*
-	 * We set q->lock_ptr = NULL _before_ we wake up the task. If
-	 * a non-futex wake up happens on another CPU then the task
-	 * might exit and p would dereference a non-existing task
-	 * struct. Prevent this by holding a reference on p across the
-	 * wake up.
+	 * Queue up and delay the futex wakeups until the hb lock
+	 * has been released.
 	 */
-	get_task_struct(p);
-
+	wake_list_add(wake_list, p);
 	__unqueue_futex(q);
+
 	/*
 	 * The waiting task can free the futex_q as soon as
 	 * q->lock_ptr = NULL is written, without taking any locks. A
@@ -873,9 +870,6 @@ static void wake_futex(struct futex_q *q)
 	 */
 	smp_wmb();
 	q->lock_ptr = NULL;
-
-	wake_up_state(p, TASK_NORMAL);
-	put_task_struct(p);
 }
 
 static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
@@ -992,6 +986,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 	struct futex_q *this, *next;
 	union futex_key key = FUTEX_KEY_INIT;
 	int ret;
+	WAKE_LIST(wake_list);
 
 	if (!bitset)
 		return -EINVAL;
@@ -1018,7 +1013,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 			if (!(this->bitset & bitset))
 				continue;
 
-			wake_futex(this);
+			wake_futex(&wake_list, this);
 			if (++ret >= nr_wake)
 				break;
 		}
@@ -1027,6 +1022,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 	spin_unlock(&hb->lock);
 out_put_key:
 	put_futex_key(&key);
+	wake_up_list(&wake_list, TASK_NORMAL);
 out:
 	return ret;
 }
@@ -1043,7 +1039,7 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
 	struct futex_hash_bucket *hb1, *hb2;
 	struct futex_q *this, *next;
 	int ret, op_ret;
-
+	WAKE_LIST(wake_list);
 retry:
 	ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, VERIFY_READ);
 	if (unlikely(ret != 0))
@@ -1094,7 +1090,7 @@ retry_private:
 				ret = -EINVAL;
 				goto out_unlock;
 			}
-			wake_futex(this);
+			wake_futex(&wake_list, this);
 			if (++ret >= nr_wake)
 				break;
 		}
@@ -1108,7 +1104,7 @@ retry_private:
 					ret = -EINVAL;
 					goto out_unlock;
 				}
-				wake_futex(this);
+				wake_futex(&wake_list, this);
 				if (++op_ret >= nr_wake2)
 					break;
 			}
@@ -1122,6 +1118,7 @@ out_put_keys:
 	put_futex_key(&key2);
 out_put_key1:
 	put_futex_key(&key1);
+	wake_up_list(&wake_list, TASK_NORMAL);
 out:
 	return ret;
 }
@@ -1276,6 +1273,7 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
 	struct futex_hash_bucket *hb1, *hb2;
 	struct futex_q *this, *next;
 	u32 curval2;
+	WAKE_LIST(wake_list);
 
 	if (requeue_pi) {
 		/*
@@ -1423,7 +1421,7 @@ retry_private:
 		 * woken by futex_unlock_pi().
 		 */
 		if (++task_count <= nr_wake && !requeue_pi) {
-			wake_futex(this);
+			wake_futex(&wake_list, this);
 			continue;
 		}
 
@@ -1476,6 +1474,7 @@ out_put_keys:
 	put_futex_key(&key2);
 out_put_key1:
 	put_futex_key(&key1);
+	wake_up_list(&wake_list, TASK_NORMAL);
 out:
 	if (pi_state != NULL)
 		free_pi_state(pi_state);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c180860..1e1b1ad 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1695,6 +1695,25 @@ int wake_up_state(struct task_struct *p, unsigned int state)
 	return try_to_wake_up(p, state, 0);
 }
 
+void wake_up_list(struct wake_list_head *head, unsigned int state)
+{
+	struct wake_list_node *n = head->first;
+	struct task_struct *p;
+
+	while (n != WAKE_LIST_TAIL) {
+		p = container_of(n, struct task_struct, wake_list);
+		n = n->next;
+
+		p->wake_list.next = NULL;
+		/*
+		 * wake_up_state() implies a wmb() to pair with the queueing
+		 * in wake_list_add() so as not to miss wakeups.
+		 */
+		wake_up_state(p, state);
+		put_task_struct(p);
+	}
+}
+
 /*
  * Perform scheduler related setup for a newly forked process p.
  * p is forked by current.
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup
  2013-11-23  0:56 ` [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
@ 2013-11-23  1:25   ` Linus Torvalds
  2013-11-23  3:03     ` Jason Low
                       ` (2 more replies)
  2013-11-23  5:40   ` Darren Hart
  2013-11-23  7:20   ` Darren Hart
  2 siblings, 3 replies; 39+ messages in thread
From: Linus Torvalds @ 2013-11-23  1:25 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Linux Kernel Mailing List, Ingo Molnar, Darren Hart,
	Peter Zijlstra, Thomas Gleixner, Mike Galbraith, jeffm, Norton,
	Scott J, tom.vaden, Chandramouleeswaran, Aswin, Waiman Long,
	Jason Low

On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso <davidlohr@hp.com> wrote:
> In futex_wake() there is clearly no point in taking the hb->lock if
> we know beforehand that there are no tasks to be woken. This comes
> at the smaller cost of doing some atomic operations to keep track of
> the list's size.

Hmm. Why? Afaik, you only care about "empty or not". And if you don't
need the serialization from locking, then afaik you can just do a
"plist_head_empty()" without holding the lock.

NOTE!

The "list_empty()" function is very much designed to work even without
holding a lock (as long as the head itself exists reliably, of course)
BUT you have to then guarantee yourself that your algorithm doesn't
have any races wrt other CPU's adding an entry to the list at the same
time. Not holding a lock obviously means that you are not serialized
against that.. We've had problems with people doing

    if (!list_empty(waiters))
        wake_up_list(..)

because they wouldn't wake people up who just got added.

But considering that your atomic counter checking has the same lack of
serialization, at least the plist_head_empty() check shouldn't be any
worse than that counter thing.. And doesn't need any steenking atomic
ops or a new counter field.

            Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup
  2013-11-23  1:25   ` Linus Torvalds
@ 2013-11-23  3:03     ` Jason Low
  2013-11-23  3:19     ` Davidlohr Bueso
  2013-11-23  4:05     ` Waiman Long
  2 siblings, 0 replies; 39+ messages in thread
From: Jason Low @ 2013-11-23  3:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Linux Kernel Mailing List, Ingo Molnar,
	Darren Hart, Peter Zijlstra, Thomas Gleixner, Mike Galbraith,
	jeffm, Norton, Scott J, tom.vaden, Chandramouleeswaran, Aswin,
	Waiman Long, Jason Low

On Fri, Nov 22, 2013 at 5:25 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso <davidlohr@hp.com> wrote:
>> In futex_wake() there is clearly no point in taking the hb->lock if
>> we know beforehand that there are no tasks to be woken. This comes
>> at the smaller cost of doing some atomic operations to keep track of
>> the list's size.
>
> Hmm. Why? Afaik, you only care about "empty or not". And if you don't
> need the serialization from locking, then afaik you can just do a
> "plist_head_empty()" without holding the lock.
>
> NOTE!
>
> The "list_empty()" function is very much designed to work even without
> holding a lock (as long as the head itself exists reliably, of course)
> BUT you have to then guarantee yourself that your algorithm doesn't
> have any races wrt other CPU's adding an entry to the list at the same
> time. Not holding a lock obviously means that you are not serialized
> against that.. We've had problems with people doing
>
>     if (!list_empty(waiters))
>         wake_up_list(..)
>
> because they wouldn't wake people up who just got added.
>
> But considering that your atomic counter checking has the same lack of
> serialization, at least the plist_head_empty() check shouldn't be any
> worse than that counter thing.. And doesn't need any steenking atomic
> ops or a new counter field.

Hi Linus,

In this patch, since we do the atomic increments before holding the
hb->lock during situations where we may potentially add a task to the
list, then we would only skip attempting the wake up if no other
thread has already held the hb->lock and is in the process of adding a
task to the list, in addition to the list being empty. Would this be
enough to protect it from any race condition?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup
  2013-11-23  1:25   ` Linus Torvalds
  2013-11-23  3:03     ` Jason Low
@ 2013-11-23  3:19     ` Davidlohr Bueso
  2013-11-23  7:23       ` Darren Hart
  2013-11-23 13:16       ` Thomas Gleixner
  2013-11-23  4:05     ` Waiman Long
  2 siblings, 2 replies; 39+ messages in thread
From: Davidlohr Bueso @ 2013-11-23  3:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Ingo Molnar, Darren Hart,
	Peter Zijlstra, Thomas Gleixner, Mike Galbraith, jeffm, Norton,
	Scott J, tom.vaden, Chandramouleeswaran, Aswin, Waiman Long,
	Jason Low

On Fri, 2013-11-22 at 17:25 -0800, Linus Torvalds wrote:
> On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso <davidlohr@hp.com> wrote:
> > In futex_wake() there is clearly no point in taking the hb->lock if
> > we know beforehand that there are no tasks to be woken. This comes
> > at the smaller cost of doing some atomic operations to keep track of
> > the list's size.
> 
> Hmm. Why? Afaik, you only care about "empty or not". And if you don't
> need the serialization from locking, then afaik you can just do a
> "plist_head_empty()" without holding the lock.

I remember this being the original approach, but after noticing some
strange behavior we quickly decided it wasn't the path. And sure enough,
I just double checked and tried the patch without atomic ops and can see
things being off: one of the futextest performance cases is stuck
blocked on a futex and I couldn't reboot the machine either -- nothing
apparent in dmesg, just not 100% functional. The thing is, we can only
avoid taking the lock only if nobody else is trying to add itself to the
list.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup
  2013-11-23  1:25   ` Linus Torvalds
  2013-11-23  3:03     ` Jason Low
  2013-11-23  3:19     ` Davidlohr Bueso
@ 2013-11-23  4:05     ` Waiman Long
  2 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2013-11-23  4:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Linux Kernel Mailing List, Ingo Molnar,
	Darren Hart, Peter Zijlstra, Thomas Gleixner, Mike Galbraith,
	jeffm, Norton, Scott J, tom.vaden, Chandramouleeswaran, Aswin,
	Jason Low

On 11/22/2013 08:25 PM, Linus Torvalds wrote:
> On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso<davidlohr@hp.com>  wrote:
>> In futex_wake() there is clearly no point in taking the hb->lock if
>> we know beforehand that there are no tasks to be woken. This comes
>> at the smaller cost of doing some atomic operations to keep track of
>> the list's size.
> Hmm. Why? Afaik, you only care about "empty or not". And if you don't
> need the serialization from locking, then afaik you can just do a
> "plist_head_empty()" without holding the lock.
>
> NOTE!
>
> The "list_empty()" function is very much designed to work even without
> holding a lock (as long as the head itself exists reliably, of course)
> BUT you have to then guarantee yourself that your algorithm doesn't
> have any races wrt other CPU's adding an entry to the list at the same
> time. Not holding a lock obviously means that you are not serialized
> against that.. We've had problems with people doing
>
>      if (!list_empty(waiters))
>          wake_up_list(..)
>
> because they wouldn't wake people up who just got added.
>
> But considering that your atomic counter checking has the same lack of
> serialization, at least the plist_head_empty() check shouldn't be any
> worse than that counter thing.. And doesn't need any steenking atomic
> ops or a new counter field.
>
>              Linus

Before the patch, a waker will wake up one or waiters who call 
spin_lock() before the waker. If we just use plist_head_empty(), we will 
miss waiters who have called spin_lock(), but have not yet got the lock 
or inserted itself into the wait list. By adding atomic counter for 
checking purpose, we again has a single point where any waiters who pass 
that point (inc atomic counter) can be woken up by a waiter who check 
the atomic counter after they inc it. This acts sort of like 
serialization without using a lock.

-Longman

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup
  2013-11-23  0:56 ` [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
  2013-11-23  1:25   ` Linus Torvalds
@ 2013-11-23  5:40   ` Darren Hart
  2013-11-23  5:42     ` Hart, Darren
  2013-11-23  7:20   ` Darren Hart
  2 siblings, 1 reply; 39+ messages in thread
From: Darren Hart @ 2013-11-23  5:40 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, peterz, tglx, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2

On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:
> In futex_wake() there is clearly no point in taking the hb->lock if
> we know beforehand that there are no tasks to be woken. This comes
> at the smaller cost of doing some atomic operations to keep track of
> the list's size. Specifically, increment the counter when an element is
> added to the list, and decrement when it is removed. Of course, if the
> counter is 0, then there are no tasks blocked on a futex. Some special
> considerations:
> 
> - increment the counter at queue_lock() as we always end up calling
>   queue_me() which adds the element to the list. Upon any error,
>   queue_unlock() is called for housekeeping, for which we decrement
>   to mach the increment done in queue_lock().
> 
> - decrement the counter at __unqueue_me() to reflect when an element is
>   removed from the queue for wakeup related purposes.

What is the problem you are trying to solve here?

The fast-path in futexes is not calling into the kernel at all, if
you've made the syscall, you've already lost. What kind of improvement
are you seeing by adding this optimization? Is it worth the introduction
of atomic operations into a file known to the state of California to
increase the risk of liver failure?

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup
  2013-11-23  5:40   ` Darren Hart
@ 2013-11-23  5:42     ` Hart, Darren
  0 siblings, 0 replies; 39+ messages in thread
From: Hart, Darren @ 2013-11-23  5:42 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, peterz, tglx, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1387 bytes --]

On Fri, 2013-11-22 at 21:40 -0800, Darren Hart wrote:
> On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:
> > In futex_wake() there is clearly no point in taking the hb->lock if
> > we know beforehand that there are no tasks to be woken. This comes
> > at the smaller cost of doing some atomic operations to keep track of
> > the list's size. Specifically, increment the counter when an element is
> > added to the list, and decrement when it is removed. Of course, if the
> > counter is 0, then there are no tasks blocked on a futex. Some special
> > considerations:
> > 
> > - increment the counter at queue_lock() as we always end up calling
> >   queue_me() which adds the element to the list. Upon any error,
> >   queue_unlock() is called for housekeeping, for which we decrement
> >   to mach the increment done in queue_lock().
> > 
> > - decrement the counter at __unqueue_me() to reflect when an element is
> >   removed from the queue for wakeup related purposes.
> 
> What is the problem you are trying to solve here?

Apologies, too quick on the trigger. I see plenty of detail in 0/5. Will
spend some time reviewing that.

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] futex: Wakeup optimizations
  2013-11-23  0:56 [PATCH 0/5] futex: Wakeup optimizations Davidlohr Bueso
                   ` (4 preceding siblings ...)
  2013-11-23  0:56 ` [PATCH 5/5] sched,futex: Provide delayed wakeup list Davidlohr Bueso
@ 2013-11-23  5:55 ` Darren Hart
  2013-11-23  6:35   ` Mike Galbraith
  2013-11-23  6:38   ` Davidlohr Bueso
  5 siblings, 2 replies; 39+ messages in thread
From: Darren Hart @ 2013-11-23  5:55 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, peterz, tglx, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2

On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:
> We have been dealing with a customer database workload on large
> 12Tb, 240 core 16 socket NUMA system that exhibits high amounts 
> of contention on some of the locks that serialize internal futex 
> data structures. This workload specially suffers in the wakeup 
> paths, where waiting on the corresponding hb->lock can account for 
> up to ~60% of the time. The result of such calls can mostly be 
> classified as (i) nothing to wake up and (ii) wakeup large amount 
> of tasks.

With as many cores as you have, have you done any analysis of how
effective the hashing algorithm is, and would more buckets relieve some
of the contention.... ah, I see below that you did. Nice work.

> Before these patches are applied, we can see this pathological behavior:
> 
>  37.12%  826174  xxx  [kernel.kallsyms] [k] _raw_spin_lock
>             --- _raw_spin_lock
>              |
>              |--97.14%-- futex_wake
>              |          do_futex
>              |          sys_futex
>              |          system_call_fastpath
>              |          |
>              |          |--99.70%-- 0x7f383fbdea1f
>              |          |           yyy
> 
>  43.71%  762296  xxx  [kernel.kallsyms] [k] _raw_spin_lock
>             --- _raw_spin_lock
>              |
>              |--53.74%-- futex_wake
>              |          do_futex
>              |          sys_futex
>              |          system_call_fastpath
>              |          |
>              |          |--99.40%-- 0x7fe7d44a4c05
>              |          |           zzz
>              |--45.90%-- futex_wait_setup
>              |          futex_wait
>              |          do_futex
>              |          sys_futex
>              |          system_call_fastpath
>              |          0x7fe7ba315789
>              |          syscall
> 

Sorry to be dense, can you spell out how 60% falls out of these numbers?

> 
> With these patches, contention is practically non existent:
> 
>  0.10%     49   xxx  [kernel.kallsyms]   [k] _raw_spin_lock
>                --- _raw_spin_lock
>                 |
>                 |--76.06%-- futex_wait_setup
>                 |          futex_wait
>                 |          do_futex
>                 |          sys_futex
>                 |          system_call_fastpath
>                 |          |
>                 |          |--99.90%-- 0x7f3165e63789
>                 |          |          syscall|
>                            ...
>                 |--6.27%-- futex_wake
>                 |          do_futex
>                 |          sys_futex
>                 |          system_call_fastpath
>                 |          |
>                 |          |--54.56%-- 0x7f317fff2c05
>                 ...
> 
> Patches 1 & 2 are cleanups and micro optimizations.
> 
> Patch 3 addresses the well known issue of the global hash table.
> By creating a larger and NUMA aware table, we can reduce the false
> sharing and collisions, thus reducing the chance of different futexes 
> using hb->lock.
> 
> Patch 4 reduces contention on the corresponding hb->lock by not trying to
> acquire it if there are no blocked tasks in the waitqueue.
> This particularly deals with point (i) above, where we see that it is not
> uncommon for up to 90% of wakeup calls end up returning 0, indicating that no
> tasks were woken.

Can you determine how much benefit comes from 3 and how much additional
benefit comes from 4?

> 
> Patch 5 resurrects a two year old idea from Peter Zijlstra to delay
> the waking of the blocked tasks to be done without holding the hb->lock:
> https://lkml.org/lkml/2011/9/14/118
> 
> This is useful for locking primitives that can effect multiple wakeups
> per operation and want to avoid the futex's internal spinlock contention by
> delaying the wakeups until we've released the hb->lock.
> This particularly deals with point (ii) above, where we can observe that
> in occasions the wake calls end up waking 125 to 200 waiters in what we believe 
> are RW locks in the application.
> 
> This patchset has also been tested on smaller systems for a variety of
> benchmarks, including java workloads, kernel builds and custom bang-the-hell-out-of
> hb locks programs. So far, no functional or performance regressions have been seen.
> Furthermore, no issues were found when running the different tests in the futextest 
> suite: http://git.kernel.org/cgit/linux/kernel/git/dvhart/futextest.git/

Excellent. Would you be able to contribute any of these (C only please)
to the stress test group?

> 
> This patchset applies on top of Linus' tree as of v3.13-rc1.
> 
> Special thanks to Scott Norton, Tom Vanden and Mark Ray for help presenting, 
> debugging and analyzing the data.
> 
>   futex: Misc cleanups
>   futex: Check for pi futex_q only once
>   futex: Larger hash table
>   futex: Avoid taking hb lock if nothing to wakeup
>   sched,futex: Provide delayed wakeup list
> 
>  include/linux/sched.h |  41 ++++++++++++++++++
>  kernel/futex.c        | 113 +++++++++++++++++++++++++++-----------------------
>  kernel/sched/core.c   |  19 +++++++++
>  3 files changed, 122 insertions(+), 51 deletions(-)
> 

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/5] futex: Check for pi futex_q only once
  2013-11-23  0:56 ` [PATCH 2/5] futex: Check for pi futex_q only once Davidlohr Bueso
@ 2013-11-23  6:33   ` Darren Hart
  2013-11-24  5:19     ` Davidlohr Bueso
  0 siblings, 1 reply; 39+ messages in thread
From: Darren Hart @ 2013-11-23  6:33 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, peterz, tglx, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2

On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:
> All wake_futex() callers already verify that the we are not dealing with
> a pi futex_q, so we can remove the redundant WARN() check, as this is never
> triggered anyway.
> 
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Darren Hart <dvhart@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Jeff Mahoney <jeffm@suse.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Scott Norton <scott.norton@hp.com>
> Cc: Tom Vaden <tom.vaden@hp.com>
> Cc: Aswin Chandramouleeswaran <aswin@hp.com>
> Cc: Waiman Long <Waiman.Long@hp.com>
> Cc: Jason Low <jason.low2@hp.com>
> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> ---
>  kernel/futex.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/kernel/futex.c b/kernel/futex.c
> index e6ffe73..0768c68 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -844,9 +844,6 @@ static void wake_futex(struct futex_q *q)
>  {
>  	struct task_struct *p = q->task;
>  
> -	if (WARN(q->pi_state || q->rt_waiter, "refusing to wake PI futex\n"))
> -		return;
> -

This was added deliberately after adding said checks to the callers...
admittedly after a very long debug session I didn't ever want to repeat.
Sometimes warnings are added to make sure we caught everything and later
removed.... sometimes they are added to make sure nothing new ever
breaks this again. Since the failure scenario is non-obvious, unless
this is causing some significant performance issues for you, I'd prefer
this stays.

See commit aa10990e028cac3d5e255711fb9fb47e00700e35 for details.

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] futex: Wakeup optimizations
  2013-11-23  5:55 ` [PATCH 0/5] futex: Wakeup optimizations Darren Hart
@ 2013-11-23  6:35   ` Mike Galbraith
  2013-11-23  6:38   ` Davidlohr Bueso
  1 sibling, 0 replies; 39+ messages in thread
From: Mike Galbraith @ 2013-11-23  6:35 UTC (permalink / raw)
  To: Darren Hart
  Cc: Davidlohr Bueso, linux-kernel, mingo, peterz, tglx, jeffm,
	torvalds, scott.norton, tom.vaden, aswin, Waiman.Long,
	jason.low2

On Fri, 2013-11-22 at 21:55 -0800, Darren Hart wrote: 
> On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:

> > This patchset has also been tested on smaller systems for a variety of
> > benchmarks, including java workloads, kernel builds and custom bang-the-hell-out-of
> > hb locks programs. So far, no functional or performance regressions have been seen.
> > Furthermore, no issues were found when running the different tests in the futextest 
> > suite: http://git.kernel.org/cgit/linux/kernel/git/dvhart/futextest.git/
> 
> Excellent. Would you be able to contribute any of these (C only please)
> to the stress test group?

FWIW, I plugged this series into an rt kernel (extra raciness) and beat
it up a bit on a 64 core box too.  Nothing fell out, nor did futextest
numbers change outside variance (poor box has 8 whole gig ram, single
numa node, so kinda crippled/wimpy, and not good box for benchmarking).

What concerned me most about the series was 5/5.. looks like a great
idea to me, but the original thread did not have a happy ending.

-Mike


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] futex: Wakeup optimizations
  2013-11-23  5:55 ` [PATCH 0/5] futex: Wakeup optimizations Darren Hart
  2013-11-23  6:35   ` Mike Galbraith
@ 2013-11-23  6:38   ` Davidlohr Bueso
  1 sibling, 0 replies; 39+ messages in thread
From: Davidlohr Bueso @ 2013-11-23  6:38 UTC (permalink / raw)
  To: Darren Hart
  Cc: linux-kernel, mingo, peterz, tglx, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2

Hi Darren,

On Fri, 2013-11-22 at 21:55 -0800, Darren Hart wrote:
> On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:
> > We have been dealing with a customer database workload on large
> > 12Tb, 240 core 16 socket NUMA system that exhibits high amounts 
> > of contention on some of the locks that serialize internal futex 
> > data structures. This workload specially suffers in the wakeup 
> > paths, where waiting on the corresponding hb->lock can account for 
> > up to ~60% of the time. The result of such calls can mostly be 
> > classified as (i) nothing to wake up and (ii) wakeup large amount 
> > of tasks.
> 
> With as many cores as you have, have you done any analysis of how
> effective the hashing algorithm is, and would more buckets relieve someHi 
> of the contention.... ah, I see below that you did. Nice work.
> 
> > Before these patches are applied, we can see this pathological behavior:
> > 
> >  37.12%  826174  xxx  [kernel.kallsyms] [k] _raw_spin_lock
> >             --- _raw_spin_lock
> >              |
> >              |--97.14%-- futex_wake
> >              |          do_futex
> >              |          sys_futex
> >              |          system_call_fastpath
> >              |          |
> >              |          |--99.70%-- 0x7f383fbdea1f
> >              |          |           yyy
> > 
> >  43.71%  762296  xxx  [kernel.kallsyms] [k] _raw_spin_lock
> >             --- _raw_spin_lock
> >              |
> >              |--53.74%-- futex_wake
> >              |          do_futex
> >              |          sys_futex
> >              |          system_call_fastpath
> >              |          |
> >              |          |--99.40%-- 0x7fe7d44a4c05
> >              |          |           zzz
> >              |--45.90%-- futex_wait_setup
> >              |          futex_wait
> >              |          do_futex
> >              |          sys_futex
> >              |          system_call_fastpath
> >              |          0x7fe7ba315789
> >              |          syscall
> > 
> 
> Sorry to be dense, can you spell out how 60% falls out of these numbers?

By adding the respective percentages of futex_wake()*_raw_spin_lock
calls.

> 
> > 
> > With these patches, contention is practically non existent:
> > 
> >  0.10%     49   xxx  [kernel.kallsyms]   [k] _raw_spin_lock
> >                --- _raw_spin_lock
> >                 |
> >                 |--76.06%-- futex_wait_setup
> >                 |          futex_wait
> >                 |          do_futex
> >                 |          sys_futex
> >                 |          system_call_fastpath
> >                 |          |
> >                 |          |--99.90%-- 0x7f3165e63789
> >                 |          |          syscall|
> >                            ...
> >                 |--6.27%-- futex_wake
> >                 |          do_futex
> >                 |          sys_futex
> >                 |          system_call_fastpath
> >                 |          |
> >                 |          |--54.56%-- 0x7f317fff2c05
> >                 ...
> > 
> > Patches 1 & 2 are cleanups and micro optimizations.
> > 
> > Patch 3 addresses the well known issue of the global hash table.
> > By creating a larger and NUMA aware table, we can reduce the false
> > sharing and collisions, thus reducing the chance of different futexes 
> > using hb->lock.
> > 
> > Patch 4 reduces contention on the corresponding hb->lock by not trying to
> > acquire it if there are no blocked tasks in the waitqueue.
> > This particularly deals with point (i) above, where we see that it is not
> > uncommon for up to 90% of wakeup calls end up returning 0, indicating that no
> > tasks were woken.
> 
> Can you determine how much benefit comes from 3 and how much additional
> benefit comes from 4?

While I don't have specific per-patch data, there are indications that
the workload mostly deals with a handful of futexes. So its pretty safe
to assume that patch 4 is the one with the most benefit for _this_
particular workload.

> 
> > 
> > Patch 5 resurrects a two year old idea from Peter Zijlstra to delay
> > the waking of the blocked tasks to be done without holding the hb->lock:
> > https://lkml.org/lkml/2011/9/14/118
> > 
> > This is useful for locking primitives that can effect multiple wakeups
> > per operation and want to avoid the futex's internal spinlock contention by
> > delaying the wakeups until we've released the hb->lock.
> > This particularly deals with point (ii) above, where we can observe that
> > in occasions the wake calls end up waking 125 to 200 waiters in what we believe 
> > are RW locks in the application.
> > 
> > This patchset has also been tested on smaller systems for a variety of
> > benchmarks, including java workloads, kernel builds and custom bang-the-hell-out-of
> > hb locks programs. So far, no functional or performance regressions have been seen.
> > Furthermore, no issues were found when running the different tests in the futextest 
> > suite: http://git.kernel.org/cgit/linux/kernel/git/dvhart/futextest.git/
> 
> Excellent. Would you be able to contribute any of these (C only please)
> to the stress test group?
> 

Sure.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/5] futex: Larger hash table
  2013-11-23  0:56 ` [PATCH 3/5] futex: Larger hash table Davidlohr Bueso
@ 2013-11-23  6:52   ` Darren Hart
  0 siblings, 0 replies; 39+ messages in thread
From: Darren Hart @ 2013-11-23  6:52 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, peterz, tglx, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2

On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:
> Currently, the futex global hash table suffers from it's fixed, smallish
> (for today's standards) size of 256 entries, as well as its lack of NUMA
> awareness. Large systems, using many futexes, can be prone to high amounts
> of collisions; where these futexes hash to the same bucket and lead to
> extra contention on the same hb->lock. Furthermore, cacheline bouncing is a
> reality when we have multiple hb->locks residing on the same cacheline and
> different futexes hash to adjacent buckets.
> 
> This patch keeps the current static size of 16 entries for small systems,
> or otherwise, 256 * ncpus (or larger as we need to round the number to a
> power of 2). Note that this number of CPUs accounts for all CPUs that can
> ever be available in the system, taking into consideration things like
> hotpluging. While we do impose extra overhead at bootup by making the hash
> table larger, this is a one time thing, and does not shadow the benefits
> of this patch.
> 
> Also, similar to other core kernel components (pid, dcache, tcp), by using
> alloc_large_system_hash() we benefit from its NUMA awareness and thus the
> table is distributed among the nodes instead of in a single one. We impose
> this function's minimum limit of 256 entries, so that in worst case scenarios
> or issues, we still end up using the current amount anyways.
> 
> For a custom microbenchmark that pounds on the uaddr hashing -- making the wait
> path fail at futex_wait_setup() returning -EWOULDBLOCK for large amounts of futexes,
> we can see the following benefits on a 80-core, 8-socket 1Tb server:
> 
> +---------+----------------------------------+------------------------------------------+----------+
> | threads | baseline (ops/sec) [insns/cycle] | large hash table (ops/sec) [insns/cycle] | increase |
> +---------+----------------------------------+------------------------------------------+----------+
> |     512 | 34429    [0.07]                  | 255274    [0.48]                         | +641.45% |
> |     256 | 65452    [0.07]                  | 443563    [0.41]                         | +577.69% |
> |     128 | 125111   [0.07]                  | 742613    [0.33]                         | +493.56% |
> |      80 | 203642   [0.09]                  | 1028147   [0.29]                         | +404.87% |
> |      64 | 262944   [0.09]                  | 997300    [0.28]                         | +279.28% |
> |      32 | 642390   [0.24]                  | 965996    [0.27]                         | +50.37   |
> +---------+----------------------------------+------------------------------------------+----------+
> 
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Darren Hart <dvhart@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Jeff Mahoney <jeffm@suse.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Scott Norton <scott.norton@hp.com>
> Cc: Tom Vaden <tom.vaden@hp.com>
> Cc: Aswin Chandramouleeswaran <aswin@hp.com>
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> Signed-off-by: Jason Low <jason.low2@hp.com>
> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> ---
>  kernel/futex.c | 26 +++++++++++++++++++++-----
>  1 file changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 0768c68..5fa9eb0 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -63,6 +63,7 @@
>  #include <linux/sched/rt.h>
>  #include <linux/hugetlb.h>
>  #include <linux/freezer.h>
> +#include <linux/bootmem.h>
>  
>  #include <asm/futex.h>
>  
> @@ -70,7 +71,11 @@
>  
>  int __read_mostly futex_cmpxchg_enabled;
>  
> -#define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)
> +#if CONFIG_BASE_SMALL
> +static unsigned long futex_hashsize = 16;
> +#else
> +static unsigned long futex_hashsize;
> +#endif
>  
>  /*
>   * Futex flags used to encode options to functions and preserve them across
> @@ -151,7 +156,11 @@ struct futex_hash_bucket {
>  	struct plist_head chain;
>  };
>  
> -static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
> +#if CONFIG_BASE_SMALL
> +static struct futex_hash_bucket futex_queues[futex_hashsize];
> +#else
> +static struct futex_hash_bucket *futex_queues;
> +#endif

Something in me squirms at the #if/#else here, but I'll leave that to
the tip maintainers to call out if they are so inclined. Same below.
Otherwise, looks reasonable to me.

Reviewed-by: Darren Hart <dvhart@linux.intel.com>

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/5] futex: Misc cleanups
  2013-11-23  0:56 ` [PATCH 1/5] futex: Misc cleanups Davidlohr Bueso
@ 2013-11-23  6:52   ` Darren Hart
  0 siblings, 0 replies; 39+ messages in thread
From: Darren Hart @ 2013-11-23  6:52 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, peterz, tglx, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2

On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:
> From: Jason Low <jason.low2@hp.com>
> 
> - Remove unnecessary head variables.
> - Delete unused parameter in queue_unlock().
> 
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Darren Hart <dvhart@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Jeff Mahoney <jeffm@suse.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Scott Norton <scott.norton@hp.com>
> Cc: Tom Vaden <tom.vaden@hp.com>
> Cc: Aswin Chandramouleeswaran <aswin@hp.com>
> Cc: Waiman Long <Waiman.Long@hp.com>
> Signed-off-by: Jason Low <jason.low2@hp.com>
> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>

My concern was that we saved off head for a reason, but the hb lock is
held in these cases, and the call is safe against removal, so I don't
see a problem. Thanks.

Reviewed-by: Darren Hart <dvhart@linux.intel.com>

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup
  2013-11-23  0:56 ` [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
  2013-11-23  1:25   ` Linus Torvalds
  2013-11-23  5:40   ` Darren Hart
@ 2013-11-23  7:20   ` Darren Hart
  2 siblings, 0 replies; 39+ messages in thread
From: Darren Hart @ 2013-11-23  7:20 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, peterz, tglx, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2

On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:
> In futex_wake() there is clearly no point in taking the hb->lock if
> we know beforehand that there are no tasks to be woken. This comes
> at the smaller cost of doing some atomic operations to keep track of
> the list's size. Specifically, increment the counter when an element is
> added to the list, and decrement when it is removed. Of course, if the
> counter is 0, then there are no tasks blocked on a futex. Some special
> considerations:
> 
> - increment the counter at queue_lock() as we always end up calling
>   queue_me() which adds the element to the list. Upon any error,
>   queue_unlock() is called for housekeeping, for which we decrement
>   to mach the increment done in queue_lock().

      ^match

> 
> - decrement the counter at __unqueue_me() to reflect when an element is
>   removed from the queue for wakeup related purposes.

__unqueue_futex (not __unqueue_me)


> @@ -999,6 +1001,10 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
>  		goto out;
>  
>  	hb = hash_futex(&key);
> +	/* make sure we really have tasks to wakeup */

Nit, but please use proper sentence formatting for consistency with the
rest of the comments in futex.c (most of them anyway).

	/* Make sure we really have tasks to wake up. */

Now... I'm not thrilled with adding atomics if we don't need to,
especially for an optimization since the atomics themselves cause enough
problems, especially across a large number of CPUs... I'll respond to
Linus's thread though.

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup
  2013-11-23  3:19     ` Davidlohr Bueso
@ 2013-11-23  7:23       ` Darren Hart
  2013-11-23 13:16       ` Thomas Gleixner
  1 sibling, 0 replies; 39+ messages in thread
From: Darren Hart @ 2013-11-23  7:23 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Linus Torvalds, Linux Kernel Mailing List, Ingo Molnar,
	Peter Zijlstra, Thomas Gleixner, Mike Galbraith, jeffm, Norton,
	Scott J, tom.vaden, Chandramouleeswaran, Aswin, Waiman Long,
	Jason Low

On Fri, 2013-11-22 at 19:19 -0800, Davidlohr Bueso wrote:
> On Fri, 2013-11-22 at 17:25 -0800, Linus Torvalds wrote:
> > On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso <davidlohr@hp.com> wrote:
> > > In futex_wake() there is clearly no point in taking the hb->lock if
> > > we know beforehand that there are no tasks to be woken. This comes
> > > at the smaller cost of doing some atomic operations to keep track of
> > > the list's size.
> > 
> > Hmm. Why? Afaik, you only care about "empty or not". And if you don't
> > need the serialization from locking, then afaik you can just do a
> > "plist_head_empty()" without holding the lock.
> 
> I remember this being the original approach, but after noticing some
> strange behavior we quickly decided it wasn't the path. And sure enough,
> I just double checked and tried the patch without atomic ops and can see
> things being off: one of the futextest performance cases is stuck
> blocked on a futex and I couldn't reboot the machine either -- nothing
> apparent in dmesg, just not 100% functional. The thing is, we can only
> avoid taking the lock only if nobody else is trying to add itself to the
> list.

In your usage, the worst case scenario is that you detect 0 when locking
may have blocked and found a waiter. Correct?

In this case, you return 0, instead of 1 (or more).

This suggests to me a bug in the futextest testcase. Which test
specifically hung up waiting?

Futex hangs are almost always bad userspace code (my bad userspace code
in this case ;-)

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/5] sched,futex: Provide delayed wakeup list
  2013-11-23  0:56 ` [PATCH 5/5] sched,futex: Provide delayed wakeup list Davidlohr Bueso
@ 2013-11-23 11:48   ` Peter Zijlstra
  2013-11-23 12:01     ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2013-11-23 11:48 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, dvhart, tglx, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2

On Fri, Nov 22, 2013 at 04:56:37PM -0800, Davidlohr Bueso wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Original patchset: https://lkml.org/lkml/2011/9/14/118
> 
> This is useful for locking primitives that can effect multiple
> wakeups per operation and want to avoid lock internal lock contention
> by delaying the wakeups until we've released the lock internal locks.
> 
> Alternatively it can be used to avoid issuing multiple wakeups, and
> thus save a few cycles, in packet processing. Queue all target tasks
> and wakeup once you've processed all packets. That way you avoid
> waking the target task multiple times if there were multiple packets
> for the same task.
> 
> This patch adds the needed infrastructure into the scheduler code
> and uses the new wake_list to delay the futex wakeups until
> after we've released the hash bucket locks. This avoids the newly
> woken tasks from immediately getting stuck on the hb->lock.
> 
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Darren Hart <dvhart@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Jeff Mahoney <jeffm@suse.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Scott Norton <scott.norton@hp.com>
> Cc: Tom Vaden <tom.vaden@hp.com>
> Cc: Aswin Chandramouleeswaran <aswin@hp.com>
> Cc: Waiman Long <Waiman.Long@hp.com>
> Tested-by: Jason Low <jason.low2@hp.com>
> [forward ported]
> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> ---
> Please note that in the original thread there was some debate
> about spurious wakeups (https://lkml.org/lkml/2011/9/17/31), so
> you can consider this more of an RFC patch if folks believe that
> this functionality is incomplete/buggy.

Right, from what I remember, this patch can cause spurious wakeups, and
while all our regular sleeping lock / wait thingies can deal with this,
not all creative schedule() usage in the tree can deal with this.

There's about ~1400 (or there were that many 2 years ago, might be more
by now) schedule() calls, many of which are open coded wait constructs
of which most are buggy in one way or another.

So we first need to audit / fix all those before we can do this one.

I used to have a patch to schedule() that would always immediately fall
through and only actually block on the second call; it illustrated the
problem really well, in fact so well the kernels fails to boot most
times.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/5] sched,futex: Provide delayed wakeup list
  2013-11-23 11:48   ` Peter Zijlstra
@ 2013-11-23 12:01     ` Peter Zijlstra
  2013-11-24  5:25       ` Davidlohr Bueso
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2013-11-23 12:01 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, dvhart, tglx, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2

> I used to have a patch to schedule() that would always immediately fall
> through and only actually block on the second call; it illustrated the
> problem really well, in fact so well the kernels fails to boot most
> times.

I found the below on my filesystem -- making it apply shouldn't be hard.
Making it work is the same effort as that patch you sent, we need to
guarantee all schedule() callers can deal with not actually sleeping --
aka. spurious wakeups.

I don't think anybody ever got that thing to run reliable enough to see
if the idea proposed in the patch made any difference to actual
workloads though.

---
Subject: 
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Thu Dec 09 17:51:09 CET 2010


Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-v17vshx6uasjguuwd67fe7tg@git.kernel.org
---
 include/linux/sched.h |    5 +++--
 kernel/sched/core.c   |   18 ++++++++++++++++++
 2 files changed, 21 insertions(+), 2 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -153,9 +153,10 @@ print_cfs_rq(struct seq_file *m, int cpu
 #define TASK_DEAD		64
 #define TASK_WAKEKILL		128
 #define TASK_WAKING		256
-#define TASK_STATE_MAX		512
+#define TASK_YIELD		512
+#define TASK_STATE_MAX		1024
 
-#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKW"
+#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWY"
 
 extern char ___assert_task_state[1 - 2*!!(
 		sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -931,6 +931,7 @@ void set_task_cpu(struct task_struct *p,
 	 * ttwu() will sort out the placement.
 	 */
 	WARN_ON_ONCE(p->state != TASK_RUNNING && p->state != TASK_WAKING &&
+			!(p->state & TASK_YIELD) &&
 			!(task_thread_info(p)->preempt_count & PREEMPT_ACTIVE));
 
 #ifdef CONFIG_LOCKDEP
@@ -2864,6 +2865,22 @@ static void __sched __schedule(void)
 		if (unlikely(signal_pending_state(prev->state, prev))) {
 			prev->state = TASK_RUNNING;
 		} else {
+			/*
+			 * Provide an auto-yield feature on schedule().
+			 *
+			 * The thought is to avoid a sleep+wakeup cycle
+			 * if simply yielding the cpu will suffice to
+			 * satisfy the required condition.
+			 *
+			 * Assumes the calling schedule() site can deal
+			 * with spurious wakeups.
+			 */
+			if (prev->state & TASK_YIELD) {
+				prev->state &= ~TASK_YIELD;
+			       	if (rq->nr_running > 1)
+					goto no_deactivate;
+			}
+
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
 			prev->on_rq = 0;
 
@@ -2880,6 +2897,7 @@ static void __sched __schedule(void)
 					try_to_wake_up_local(to_wakeup);
 			}
 		}
+no_deactivate:
 		switch_count = &prev->nvcsw;
 	}
 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup
  2013-11-23  3:19     ` Davidlohr Bueso
  2013-11-23  7:23       ` Darren Hart
@ 2013-11-23 13:16       ` Thomas Gleixner
  2013-11-24  3:46         ` Linus Torvalds
  2013-11-25 19:47         ` Thomas Gleixner
  1 sibling, 2 replies; 39+ messages in thread
From: Thomas Gleixner @ 2013-11-23 13:16 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Linus Torvalds, Linux Kernel Mailing List, Ingo Molnar,
	Darren Hart, Peter Zijlstra, Mike Galbraith, jeffm, Norton,
	Scott J, tom.vaden, Chandramouleeswaran, Aswin, Waiman Long,
	Jason Low, Paul E. McKenney

On Fri, 22 Nov 2013, Davidlohr Bueso wrote:
> On Fri, 2013-11-22 at 17:25 -0800, Linus Torvalds wrote:
> > On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso <davidlohr@hp.com> wrote:
> > > In futex_wake() there is clearly no point in taking the hb->lock if
> > > we know beforehand that there are no tasks to be woken. This comes
> > > at the smaller cost of doing some atomic operations to keep track of
> > > the list's size.
> > 
> > Hmm. Why? Afaik, you only care about "empty or not". And if you don't
> > need the serialization from locking, then afaik you can just do a
> > "plist_head_empty()" without holding the lock.
> 
> I remember this being the original approach, but after noticing some
> strange behavior we quickly decided it wasn't the path. And sure enough,
> I just double checked and tried the patch without atomic ops and can see
> things being off: one of the futextest performance cases is stuck
> blocked on a futex and I couldn't reboot the machine either -- nothing
> apparent in dmesg, just not 100% functional. The thing is, we can only
> avoid taking the lock only if nobody else is trying to add itself to the
> list.

Right. The reason is:

CPU 0					CPU 1

    val = *futex;
    futex_wait(futex, val);

    spin_lock(&hb->lock);

    uval = *futex;
					*futex = newval;

    if (uval != val) {			futex_wake();
       spin_unlock(&hb->lock);		if (plist_empty(hb))
       return;				   return;
    }

    plist_add(hb, self);
    spin_unlock(&hb->lock);
    schedule();
    ...

So the waiter on CPU0 gets queued after the user space value changed
and the wakee on CPU1 returns because plist is empty. Waiter therefor
waits forever.

So with the atomic ops you are changing that to:

CPU 0					CPU 1

    val = *futex;
    futex_wait(futex, val);

    spin_lock(&hb->lock);

    atomic_inc(&hb->waiters);
    uval = *futex;
					*futex = newval;

    if (uval != val) {			futex_wake();
       spin_unlock(&hb->lock);		if (!atomic_read(&hb->waiters))
       return;				   return;
    }
   					spin_lock((&hb->lock);	
    plist_add(hb, self);
    spin_unlock(&hb->lock);
    schedule();				wakeup_waiters(hb);
    ...

which restores the ordering guarantee, which the hash bucket lock
provided so far.

Now the question is why we queue the waiter _AFTER_ reading the user
space value. The comment in the code is pretty non sensical:

   * On the other hand, we insert q and release the hash-bucket only
   * after testing *uaddr.  This guarantees that futex_wait() will NOT
   * absorb a wakeup if *uaddr does not match the desired values
   * while the syscall executes.

There is no reason why we cannot queue _BEFORE_ reading the user space
value. We just have to dequeue in all the error handling cases, but
for the fast path it does not matter at all.

CPU 0					CPU 1

    val = *futex;
    futex_wait(futex, val);

    spin_lock(&hb->lock);

    plist_add(hb, self);
    smp_wmb();

    uval = *futex;
					*futex = newval;
					futex_wake();
					
					smp_rmb();
					if (plist_empty(hb))
					   return;

   					spin_lock((&hb->lock);	
    if (uval != val) {			
       plist_del(self);			
       spin_unlock(&hb->lock);
       return;		
    }

    spin_unlock(&hb->lock);
    schedule();				wakeup_waiters(hb);
    ...

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup
  2013-11-23 13:16       ` Thomas Gleixner
@ 2013-11-24  3:46         ` Linus Torvalds
  2013-11-24  5:15           ` Davidlohr Bueso
  2013-11-25 16:23           ` Thomas Gleixner
  2013-11-25 19:47         ` Thomas Gleixner
  1 sibling, 2 replies; 39+ messages in thread
From: Linus Torvalds @ 2013-11-24  3:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Davidlohr Bueso, Linux Kernel Mailing List, Ingo Molnar,
	Darren Hart, Peter Zijlstra, Mike Galbraith, Jeff Mahoney,
	Norton, Scott J, tom.vaden, Chandramouleeswaran, Aswin,
	Waiman Long, Jason Low, Paul E. McKenney

On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Now the question is why we queue the waiter _AFTER_ reading the user
> space value. The comment in the code is pretty non sensical:
>
>    * On the other hand, we insert q and release the hash-bucket only
>    * after testing *uaddr.  This guarantees that futex_wait() will NOT
>    * absorb a wakeup if *uaddr does not match the desired values
>    * while the syscall executes.
>
> There is no reason why we cannot queue _BEFORE_ reading the user space
> value. We just have to dequeue in all the error handling cases, but
> for the fast path it does not matter at all.
>
> CPU 0                                   CPU 1
>
>     val = *futex;
>     futex_wait(futex, val);
>
>     spin_lock(&hb->lock);
>
>     plist_add(hb, self);
>     smp_wmb();
>
>     uval = *futex;
>                                         *futex = newval;
>                                         futex_wake();
>
>                                         smp_rmb();
>                                         if (plist_empty(hb))
>                                            return;
> ...

This would seem to be a nicer approach indeed, without needing the
extra atomics.

Davidlohr, mind trying Thomas' approach?

             Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup
  2013-11-24  3:46         ` Linus Torvalds
@ 2013-11-24  5:15           ` Davidlohr Bueso
  2013-11-25 12:01             ` Thomas Gleixner
  2013-11-25 16:23           ` Thomas Gleixner
  1 sibling, 1 reply; 39+ messages in thread
From: Davidlohr Bueso @ 2013-11-24  5:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Linux Kernel Mailing List, Ingo Molnar,
	Darren Hart, Peter Zijlstra, Mike Galbraith, Jeff Mahoney,
	Norton, Scott J, tom.vaden, Chandramouleeswaran, Aswin,
	Waiman Long, Jason Low, Paul E. McKenney

On Sat, 2013-11-23 at 19:46 -0800, Linus Torvalds wrote:
> On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > Now the question is why we queue the waiter _AFTER_ reading the user
> > space value. The comment in the code is pretty non sensical:
> >
> >    * On the other hand, we insert q and release the hash-bucket only
> >    * after testing *uaddr.  This guarantees that futex_wait() will NOT
> >    * absorb a wakeup if *uaddr does not match the desired values
> >    * while the syscall executes.
> >
> > There is no reason why we cannot queue _BEFORE_ reading the user space
> > value. We just have to dequeue in all the error handling cases, but
> > for the fast path it does not matter at all.
> >
> > CPU 0                                   CPU 1
> >
> >     val = *futex;
> >     futex_wait(futex, val);
> >
> >     spin_lock(&hb->lock);
> >
> >     plist_add(hb, self);
> >     smp_wmb();
> >
> >     uval = *futex;
> >                                         *futex = newval;
> >                                         futex_wake();
> >
> >                                         smp_rmb();
> >                                         if (plist_empty(hb))
> >                                            return;
> > ...
> 
> This would seem to be a nicer approach indeed, without needing the
> extra atomics.

Yep, I think we can all agree that doing this optization without atomic
ops is a big plus.

> 
> Davidlohr, mind trying Thomas' approach?

I just took a quick look and it seems pretty straightforward, but not
without some details to consider. We basically have to redo/reorder
futex_wait_setup(), which checks that uval == val, and
futex_wait_queue_me(), which adds the task to the list and blocks. Now,
both futex_wait() and futex_wait_requeue_pi() have this logic, but since
we don't use futex_wake() to wakeup tasks on pi futex_qs, I believe it's
ok to only change futex_wait(), while the order of the uval checking
doesn't matter for futex_wait_requeue_pi() so it can stay as is.

The following is the general skeleton of what Thomas is probably
referring to (yes, it needs comments, cleanups, refactoring, etc, etc).
Futexes are easy to screw up but at least this passes a kernel build and
all futextests on a large box.

static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
		      ktime_t *abs_time, u32 bitset)
{
	struct hrtimer_sleeper timeout, *to = NULL;
	struct restart_block *restart;
	struct futex_hash_bucket *hb;
	struct futex_q q = futex_q_init;
	int prio, ret = 0;
	u32 uval;

	if (!bitset)
		return -EINVAL;
	q.bitset = bitset;

	if (abs_time) {
		to = &timeout;

		hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ?
				      CLOCK_REALTIME : CLOCK_MONOTONIC,
				      HRTIMER_MODE_ABS);
		hrtimer_init_sleeper(to, current);
		hrtimer_set_expires_range_ns(&to->timer, *abs_time,
					     current->timer_slack_ns);
	}
retry:	
	ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key, VERIFY_READ);
	if (unlikely(ret != 0))
		goto out;

retry_private:
	hb = queue_lock(&q);
	prio = min(current->normal_prio, MAX_RT_PRIO);

	plist_node_init(&q.list, prio);
	plist_add(&q.list, &hb->chain);
	q.task = current;
	/* do NOT drop the lock here */
	smp_wmb();

	ret = get_futex_value_locked(&uval, uaddr);
	if (ret) {
		plist_del(&q.list, &hb->chain);
		spin_unlock(&hb->lock);
		
		ret = get_user(uval, uaddr);
		if (ret)
			goto out_put_key;

		if (!(flags & FLAGS_SHARED))
			goto retry_private;

		put_futex_key(&q.key);
		goto retry;

	}

	if (uval != val) {
		plist_del(&q.list, &hb->chain);
		spin_unlock(&hb->lock);
		ret = -EWOULDBLOCK;
		goto out_put_key;
	}

	set_current_state(TASK_INTERRUPTIBLE);
	spin_unlock(&hb->lock);

	/* Arm the timer */
	if (to) {
		hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
		if (!hrtimer_active(&to->timer))
			to->task = NULL;
	}

	/*
	 * If we have been removed from the hash list, then another task
	 * has tried to wake us, and we can skip the call to schedule().
	 */
	if (likely(!plist_node_empty(&q.list))) {
		/*
		 * If the timer has already expired, current will already be
		 * flagged for rescheduling. Only call schedule if there
		 * is no timeout, or if it has yet to expire.
		 */
		if (!to || to->task)
			freezable_schedule();
	}
	__set_current_state(TASK_RUNNING);


	/* If we were woken (and unqueued), we succeeded, whatever. */
	ret = 0;
	/* unqueue_me() drops q.key ref */
	if (!unqueue_me(&q))
		goto out;
	ret = -ETIMEDOUT;
	if (to && !to->task)
		goto out;

	/*
	 * We expect signal_pending(current), but we might be the
	 * victim of a spurious wakeup as well.
	 */
	if (!signal_pending(current))
		goto retry;

	ret = -ERESTARTSYS;
	if (!abs_time)
		goto out;

	restart = &current_thread_info()->restart_block;
	restart->fn = futex_wait_restart;
	restart->futex.uaddr = uaddr;
	restart->futex.val = val;
	restart->futex.time = abs_time->tv64;
	restart->futex.bitset = bitset;
	restart->futex.flags = flags | FLAGS_HAS_TIMEOUT;

	ret = -ERESTART_RESTARTBLOCK;

out_put_key:
	if (ret)
		put_futex_key(&q.key);
out:
	if (to) {
		hrtimer_cancel(&to->timer);
		destroy_hrtimer_on_stack(&to->timer);
	}
	return ret;
}

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/5] futex: Check for pi futex_q only once
  2013-11-23  6:33   ` Darren Hart
@ 2013-11-24  5:19     ` Davidlohr Bueso
  0 siblings, 0 replies; 39+ messages in thread
From: Davidlohr Bueso @ 2013-11-24  5:19 UTC (permalink / raw)
  To: Darren Hart
  Cc: linux-kernel, mingo, peterz, tglx, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2

On Fri, 2013-11-22 at 22:33 -0800, Darren Hart wrote:
> On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:
> > All wake_futex() callers already verify that the we are not dealing with
> > a pi futex_q, so we can remove the redundant WARN() check, as this is never
> > triggered anyway.
> > 
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Darren Hart <dvhart@linux.intel.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Mike Galbraith <efault@gmx.de>
> > Cc: Jeff Mahoney <jeffm@suse.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Scott Norton <scott.norton@hp.com>
> > Cc: Tom Vaden <tom.vaden@hp.com>
> > Cc: Aswin Chandramouleeswaran <aswin@hp.com>
> > Cc: Waiman Long <Waiman.Long@hp.com>
> > Cc: Jason Low <jason.low2@hp.com>
> > Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> > ---
> >  kernel/futex.c | 3 ---
> >  1 file changed, 3 deletions(-)
> > 
> > diff --git a/kernel/futex.c b/kernel/futex.c
> > index e6ffe73..0768c68 100644
> > --- a/kernel/futex.c
> > +++ b/kernel/futex.c
> > @@ -844,9 +844,6 @@ static void wake_futex(struct futex_q *q)
> >  {
> >  	struct task_struct *p = q->task;
> >  
> > -	if (WARN(q->pi_state || q->rt_waiter, "refusing to wake PI futex\n"))
> > -		return;
> > -
> 
> This was added deliberately after adding said checks to the callers...
> admittedly after a very long debug session I didn't ever want to repeat.
> Sometimes warnings are added to make sure we caught everything and later
> removed.... sometimes they are added to make sure nothing new ever
> breaks this again. Since the failure scenario is non-obvious, unless
> this is causing some significant performance issues for you, I'd prefer
> this stays.
> 
> See commit aa10990e028cac3d5e255711fb9fb47e00700e35 for details.

I think a strong comment in wake_futex() warning about pi futex_qs would
do nowadays, but fair enough, I was kind of expecting this reply.

Thanks,
Davidlohr



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/5] sched,futex: Provide delayed wakeup list
  2013-11-23 12:01     ` Peter Zijlstra
@ 2013-11-24  5:25       ` Davidlohr Bueso
  0 siblings, 0 replies; 39+ messages in thread
From: Davidlohr Bueso @ 2013-11-24  5:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, dvhart, tglx, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2

On Sat, 2013-11-23 at 13:01 +0100, Peter Zijlstra wrote:
> > I used to have a patch to schedule() that would always immediately fall
> > through and only actually block on the second call; it illustrated the
> > problem really well, in fact so well the kernels fails to boot most
> > times.
> 
> I found the below on my filesystem -- making it apply shouldn't be hard.
> Making it work is the same effort as that patch you sent, we need to
> guarantee all schedule() callers can deal with not actually sleeping --
> aka. spurious wakeups.

Thanks, I'll definitely try the patch and see what comes up.

> 
> I don't think anybody ever got that thing to run reliable enough to see
> if the idea proposed in the patch made any difference to actual
> workloads though.

Since your idea can also be applied to sysv sems (patch 3/3 back then),
I can definitely do some Oracle runs which IIRC, also likes doing
multiple wakeups at once. In any case this patch deals very nicely with
our customer workload, which is why I believe its particularly good
here.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup
  2013-11-24  5:15           ` Davidlohr Bueso
@ 2013-11-25 12:01             ` Thomas Gleixner
  0 siblings, 0 replies; 39+ messages in thread
From: Thomas Gleixner @ 2013-11-25 12:01 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Linus Torvalds, Linux Kernel Mailing List, Ingo Molnar,
	Darren Hart, Peter Zijlstra, Mike Galbraith, Jeff Mahoney,
	Norton, Scott J, tom.vaden, Chandramouleeswaran, Aswin,
	Waiman Long, Jason Low, Paul E. McKenney

On Sat, 23 Nov 2013, Davidlohr Bueso wrote:
> On Sat, 2013-11-23 at 19:46 -0800, Linus Torvalds wrote:
> > On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > >
> > > Now the question is why we queue the waiter _AFTER_ reading the user
> > > space value. The comment in the code is pretty non sensical:
> > >
> > >    * On the other hand, we insert q and release the hash-bucket only
> > >    * after testing *uaddr.  This guarantees that futex_wait() will NOT
> > >    * absorb a wakeup if *uaddr does not match the desired values
> > >    * while the syscall executes.
> > >
> > > There is no reason why we cannot queue _BEFORE_ reading the user space
> > > value. We just have to dequeue in all the error handling cases, but
> > > for the fast path it does not matter at all.
> > >
> > > CPU 0                                   CPU 1
> > >
> > >     val = *futex;
> > >     futex_wait(futex, val);
> > >
> > >     spin_lock(&hb->lock);
> > >
> > >     plist_add(hb, self);
> > >     smp_wmb();
> > >
> > >     uval = *futex;
> > >                                         *futex = newval;
> > >                                         futex_wake();
> > >
> > >                                         smp_rmb();
> > >                                         if (plist_empty(hb))
> > >                                            return;
> > > ...
> > 
> > This would seem to be a nicer approach indeed, without needing the
> > extra atomics.
> 
> Yep, I think we can all agree that doing this optization without atomic
> ops is a big plus.
> 
> > 
> > Davidlohr, mind trying Thomas' approach?
> 
> I just took a quick look and it seems pretty straightforward, but not
> without some details to consider. We basically have to redo/reorder
> futex_wait_setup(), which checks that uval == val, and
> futex_wait_queue_me(), which adds the task to the list and blocks. Now,
> both futex_wait() and futex_wait_requeue_pi() have this logic, but since
> we don't use futex_wake() to wakeup tasks on pi futex_qs, I believe it's
> ok to only change futex_wait(), while the order of the uval checking
> doesn't matter for futex_wait_requeue_pi() so it can stay as is.

There is no mechanism which prevents a futex_wake() call on the inner
futex of the wait_requeue_pi mechanism. So no, we have to change both.

futexes are no place for believe. Either you understand them
completely or you just leave them alone.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup
  2013-11-24  3:46         ` Linus Torvalds
  2013-11-24  5:15           ` Davidlohr Bueso
@ 2013-11-25 16:23           ` Thomas Gleixner
  2013-11-25 16:36             ` Peter Zijlstra
  1 sibling, 1 reply; 39+ messages in thread
From: Thomas Gleixner @ 2013-11-25 16:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Linux Kernel Mailing List, Ingo Molnar,
	Darren Hart, Peter Zijlstra, Mike Galbraith, Jeff Mahoney,
	Norton, Scott J, tom.vaden, Chandramouleeswaran, Aswin,
	Waiman Long, Jason Low, Paul E. McKenney

On Sat, 23 Nov 2013, Linus Torvalds wrote:

> On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > Now the question is why we queue the waiter _AFTER_ reading the user
> > space value. The comment in the code is pretty non sensical:
> >
> >    * On the other hand, we insert q and release the hash-bucket only
> >    * after testing *uaddr.  This guarantees that futex_wait() will NOT
> >    * absorb a wakeup if *uaddr does not match the desired values
> >    * while the syscall executes.
> >
> > There is no reason why we cannot queue _BEFORE_ reading the user space
> > value. We just have to dequeue in all the error handling cases, but
> > for the fast path it does not matter at all.
> >
> > CPU 0                                   CPU 1
> >
> >     val = *futex;
> >     futex_wait(futex, val);
> >
> >     spin_lock(&hb->lock);
> >
> >     plist_add(hb, self);
> >     smp_wmb();
> >
> >     uval = *futex;
> >                                         *futex = newval;
> >                                         futex_wake();
> >
> >                                         smp_rmb();
> >                                         if (plist_empty(hb))
> >                                            return;
> > ...
> 
> This would seem to be a nicer approach indeed, without needing the
> extra atomics.

I went through the issue with Peter and he noticed, that we need
smp_mb() in both places. That's what we have right now with the
spin_lock() and it is required as we need to guarantee that

 The waiter observes the change to the uaddr value after it added
 itself to the plist

 The waker observes plist not empty if the change to uaddr was made
 after the waiter checked the value.


	write(plist)		|	write(futex_uaddr)
	mb()			|	mb()
	read(futex_uaddr)	|	read(plist)

The spin_lock mb() on the waiter side does not help here because it
happpens before the write(plist) and not after it.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup
  2013-11-25 16:23           ` Thomas Gleixner
@ 2013-11-25 16:36             ` Peter Zijlstra
  2013-11-25 17:32               ` Thomas Gleixner
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2013-11-25 16:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Davidlohr Bueso, Linux Kernel Mailing List,
	Ingo Molnar, Darren Hart, Mike Galbraith, Jeff Mahoney, Norton,
	Scott J, tom.vaden, Chandramouleeswaran, Aswin, Waiman Long,
	Jason Low, Paul E. McKenney

On Mon, Nov 25, 2013 at 05:23:51PM +0100, Thomas Gleixner wrote:
> On Sat, 23 Nov 2013, Linus Torvalds wrote:
> 
> > On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > >
> > > Now the question is why we queue the waiter _AFTER_ reading the user
> > > space value. The comment in the code is pretty non sensical:
> > >
> > >    * On the other hand, we insert q and release the hash-bucket only
> > >    * after testing *uaddr.  This guarantees that futex_wait() will NOT
> > >    * absorb a wakeup if *uaddr does not match the desired values
> > >    * while the syscall executes.
> > >
> > > There is no reason why we cannot queue _BEFORE_ reading the user space
> > > value. We just have to dequeue in all the error handling cases, but
> > > for the fast path it does not matter at all.
> > >
> > > CPU 0                                   CPU 1
> > >
> > >     val = *futex;
> > >     futex_wait(futex, val);
> > >
> > >     spin_lock(&hb->lock);
> > >
> > >     plist_add(hb, self);
> > >     smp_wmb();
> > >
> > >     uval = *futex;
> > >                                         *futex = newval;
> > >                                         futex_wake();
> > >
> > >                                         smp_rmb();
> > >                                         if (plist_empty(hb))
> > >                                            return;
> > > ...
> > 
> > This would seem to be a nicer approach indeed, without needing the
> > extra atomics.
> 
> I went through the issue with Peter and he noticed, that we need
> smp_mb() in both places. That's what we have right now with the
> spin_lock() and it is required as we need to guarantee that
> 
>  The waiter observes the change to the uaddr value after it added
>  itself to the plist
> 
>  The waker observes plist not empty if the change to uaddr was made
>  after the waiter checked the value.
> 
> 
> 	write(plist)		|	write(futex_uaddr)
> 	mb()			|	mb()
> 	read(futex_uaddr)	|	read(plist)
> 
> The spin_lock mb() on the waiter side does not help here because it
> happpens before the write(plist) and not after it.

Ah, note that spin_lock() is only a smp_mb() on x86, in general its an
ACQUIRE barrier which is weaker than a full mb and will not suffice in
this case even it if were in the right place.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup
  2013-11-25 16:36             ` Peter Zijlstra
@ 2013-11-25 17:32               ` Thomas Gleixner
  2013-11-25 17:38                 ` Peter Zijlstra
  2013-11-25 18:55                 ` Davidlohr Bueso
  0 siblings, 2 replies; 39+ messages in thread
From: Thomas Gleixner @ 2013-11-25 17:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Davidlohr Bueso, Linux Kernel Mailing List,
	Ingo Molnar, Darren Hart, Mike Galbraith, Jeff Mahoney, Norton,
	Scott J, tom.vaden, Chandramouleeswaran, Aswin, Waiman Long,
	Jason Low, Paul E. McKenney

On Mon, 25 Nov 2013, Peter Zijlstra wrote:
> On Mon, Nov 25, 2013 at 05:23:51PM +0100, Thomas Gleixner wrote:
> > On Sat, 23 Nov 2013, Linus Torvalds wrote:
> > 
> > > On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > > >
> > > > Now the question is why we queue the waiter _AFTER_ reading the user
> > > > space value. The comment in the code is pretty non sensical:
> > > >
> > > >    * On the other hand, we insert q and release the hash-bucket only
> > > >    * after testing *uaddr.  This guarantees that futex_wait() will NOT
> > > >    * absorb a wakeup if *uaddr does not match the desired values
> > > >    * while the syscall executes.
> > > >
> > > > There is no reason why we cannot queue _BEFORE_ reading the user space
> > > > value. We just have to dequeue in all the error handling cases, but
> > > > for the fast path it does not matter at all.
> > > >
> > > > CPU 0                                   CPU 1
> > > >
> > > >     val = *futex;
> > > >     futex_wait(futex, val);
> > > >
> > > >     spin_lock(&hb->lock);
> > > >
> > > >     plist_add(hb, self);
> > > >     smp_wmb();
> > > >
> > > >     uval = *futex;
> > > >                                         *futex = newval;
> > > >                                         futex_wake();
> > > >
> > > >                                         smp_rmb();
> > > >                                         if (plist_empty(hb))
> > > >                                            return;
> > > > ...
> > > 
> > > This would seem to be a nicer approach indeed, without needing the
> > > extra atomics.
> > 
> > I went through the issue with Peter and he noticed, that we need
> > smp_mb() in both places. That's what we have right now with the
> > spin_lock() and it is required as we need to guarantee that
> > 
> >  The waiter observes the change to the uaddr value after it added
> >  itself to the plist
> > 
> >  The waker observes plist not empty if the change to uaddr was made
> >  after the waiter checked the value.
> > 
> > 
> > 	write(plist)		|	write(futex_uaddr)
> > 	mb()			|	mb()
> > 	read(futex_uaddr)	|	read(plist)
> > 
> > The spin_lock mb() on the waiter side does not help here because it
> > happpens before the write(plist) and not after it.
> 
> Ah, note that spin_lock() is only a smp_mb() on x86, in general its an
> ACQUIRE barrier which is weaker than a full mb and will not suffice in
> this case even it if were in the right place.

So now the question is whether this lockless empty check optimization
which seems to be quite nice on x86 for a particular workload will
have any negative side effects on other architectures.

If the smp_mb() is heavy weight, then it will hurt massivly in the
case where the hash bucket is not empty, because we add the price for
the smp_mb() just for no gain.

In that context it would also be helpful to measure the overhead on
x86 for the !empty case.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup
  2013-11-25 17:32               ` Thomas Gleixner
@ 2013-11-25 17:38                 ` Peter Zijlstra
  2013-11-25 18:55                 ` Davidlohr Bueso
  1 sibling, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2013-11-25 17:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Davidlohr Bueso, Linux Kernel Mailing List,
	Ingo Molnar, Darren Hart, Mike Galbraith, Jeff Mahoney, Norton,
	Scott J, tom.vaden, Chandramouleeswaran, Aswin, Waiman Long,
	Jason Low, Paul E. McKenney

On Mon, Nov 25, 2013 at 06:32:55PM +0100, Thomas Gleixner wrote:
> In that context it would also be helpful to measure the overhead on
> x86 for the !empty case.

Yes, because mfence is by no means a cheap instruction on x86.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup
  2013-11-25 17:32               ` Thomas Gleixner
  2013-11-25 17:38                 ` Peter Zijlstra
@ 2013-11-25 18:55                 ` Davidlohr Bueso
  2013-11-25 19:52                   ` Thomas Gleixner
  1 sibling, 1 reply; 39+ messages in thread
From: Davidlohr Bueso @ 2013-11-25 18:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Linus Torvalds, Linux Kernel Mailing List,
	Ingo Molnar, Darren Hart, Mike Galbraith, Jeff Mahoney, Norton,
	Scott J, tom.vaden, Chandramouleeswaran, Aswin, Waiman Long,
	Jason Low, Paul E. McKenney

On Mon, 2013-11-25 at 18:32 +0100, Thomas Gleixner wrote:
> On Mon, 25 Nov 2013, Peter Zijlstra wrote:
> > On Mon, Nov 25, 2013 at 05:23:51PM +0100, Thomas Gleixner wrote:
> > > On Sat, 23 Nov 2013, Linus Torvalds wrote:
> > > 
> > > > On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > > > >
> > > > > Now the question is why we queue the waiter _AFTER_ reading the user
> > > > > space value. The comment in the code is pretty non sensical:
> > > > >
> > > > >    * On the other hand, we insert q and release the hash-bucket only
> > > > >    * after testing *uaddr.  This guarantees that futex_wait() will NOT
> > > > >    * absorb a wakeup if *uaddr does not match the desired values
> > > > >    * while the syscall executes.
> > > > >
> > > > > There is no reason why we cannot queue _BEFORE_ reading the user space
> > > > > value. We just have to dequeue in all the error handling cases, but
> > > > > for the fast path it does not matter at all.
> > > > >
> > > > > CPU 0                                   CPU 1
> > > > >
> > > > >     val = *futex;
> > > > >     futex_wait(futex, val);
> > > > >
> > > > >     spin_lock(&hb->lock);
> > > > >
> > > > >     plist_add(hb, self);
> > > > >     smp_wmb();
> > > > >
> > > > >     uval = *futex;
> > > > >                                         *futex = newval;
> > > > >                                         futex_wake();
> > > > >
> > > > >                                         smp_rmb();
> > > > >                                         if (plist_empty(hb))
> > > > >                                            return;
> > > > > ...
> > > > 
> > > > This would seem to be a nicer approach indeed, without needing the
> > > > extra atomics.
> > > 
> > > I went through the issue with Peter and he noticed, that we need
> > > smp_mb() in both places. That's what we have right now with the
> > > spin_lock() and it is required as we need to guarantee that
> > > 
> > >  The waiter observes the change to the uaddr value after it added
> > >  itself to the plist
> > > 
> > >  The waker observes plist not empty if the change to uaddr was made
> > >  after the waiter checked the value.
> > > 
> > > 
> > > 	write(plist)		|	write(futex_uaddr)
> > > 	mb()			|	mb()
> > > 	read(futex_uaddr)	|	read(plist)
> > > 
> > > The spin_lock mb() on the waiter side does not help here because it
> > > happpens before the write(plist) and not after it.
> > 
> > Ah, note that spin_lock() is only a smp_mb() on x86, in general its an
> > ACQUIRE barrier which is weaker than a full mb and will not suffice in
> > this case even it if were in the right place.
> 
> So now the question is whether this lockless empty check optimization
> which seems to be quite nice on x86 for a particular workload will
> have any negative side effects on other architectures.
> 
> If the smp_mb() is heavy weight, then it will hurt massivly in the
> case where the hash bucket is not empty, because we add the price for
> the smp_mb() just for no gain.
> 
> In that context it would also be helpful to measure the overhead on
> x86 for the !empty case.

Absolutely, I will add these comparisons. If we do notice that we end up
hurting the !empty case, would the current patch using atomic ops still
be considered? We have made sure that none of the changes in this set
affects performance on other workloads/smaller systems.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup
  2013-11-23 13:16       ` Thomas Gleixner
  2013-11-24  3:46         ` Linus Torvalds
@ 2013-11-25 19:47         ` Thomas Gleixner
  2013-11-25 20:03           ` Darren Hart
  1 sibling, 1 reply; 39+ messages in thread
From: Thomas Gleixner @ 2013-11-25 19:47 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Linus Torvalds, Linux Kernel Mailing List, Ingo Molnar,
	Darren Hart, Peter Zijlstra, Mike Galbraith, jeffm, Norton,
	Scott J, tom.vaden, Chandramouleeswaran, Aswin, Waiman Long,
	Jason Low, Paul E. McKenney

On Sat, 23 Nov 2013, Thomas Gleixner wrote:
> On Fri, 22 Nov 2013, Davidlohr Bueso wrote:
> So with the atomic ops you are changing that to:
> 
> CPU 0					CPU 1
> 
>     val = *futex;
>     futex_wait(futex, val);
> 
>     spin_lock(&hb->lock);
> 
>     atomic_inc(&hb->waiters);
>     uval = *futex;
> 					*futex = newval;
> 
>     if (uval != val) {			futex_wake();
>        spin_unlock(&hb->lock);		if (!atomic_read(&hb->waiters))
>        return;				   return;
>     }
>    					spin_lock((&hb->lock);	
>     plist_add(hb, self);
>     spin_unlock(&hb->lock);
>     schedule();				wakeup_waiters(hb);
>     ...
> 
> which restores the ordering guarantee, which the hash bucket lock
> provided so far.

Actually that's not true by design, it just happens to work.

atomic_inc() on x86 is a "lock incl".

 The LOCK prefix just guarantees that the cache line which is affected
 by the INCL is locked. And it guarantees that locked operations
 serialize all outstanding load and store operations.

 But Documentation/atomic_ops.txt says about atomic_inc():

 "One very important aspect of these two routines is that they DO NOT
  require any explicit memory barriers.  They need only perform the
  atomic_t counter update in an SMP safe manner."

 So while this has a barrier on x86, it's not guaranteed.

atomic_read() is a simple read.

 This does not guarantee anything. And if you read
 Documentation/atomic_ops.txt it's well documented:

 "*** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***"


So now your code melts down to:

 write(hb->waiters)    |    write(uaddr)
 mb		       |    read(hb->waiters)
 read(uaddr)

I fear you simply managed to make the window small enough that your
testing was not longer able expose it.

Thanks,

	tglx





^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup
  2013-11-25 18:55                 ` Davidlohr Bueso
@ 2013-11-25 19:52                   ` Thomas Gleixner
  0 siblings, 0 replies; 39+ messages in thread
From: Thomas Gleixner @ 2013-11-25 19:52 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, Linus Torvalds, Linux Kernel Mailing List,
	Ingo Molnar, Darren Hart, Mike Galbraith, Jeff Mahoney, Norton,
	Scott J, tom.vaden, Chandramouleeswaran, Aswin, Waiman Long,
	Jason Low, Paul E. McKenney

On Mon, 25 Nov 2013, Davidlohr Bueso wrote:
> On Mon, 2013-11-25 at 18:32 +0100, Thomas Gleixner wrote:
> > If the smp_mb() is heavy weight, then it will hurt massivly in the
> > case where the hash bucket is not empty, because we add the price for
> > the smp_mb() just for no gain.
> > 
> > In that context it would also be helpful to measure the overhead on
> > x86 for the !empty case.
> 
> Absolutely, I will add these comparisons. If we do notice that we end up
> hurting the !empty case, would the current patch using atomic ops still
> be considered? We have made sure that none of the changes in this set
> affects performance on other workloads/smaller systems.

Please read my last reply to the atomic ops approach.

Aside of that we need numbers for a significant range of !x86.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup
  2013-11-25 19:47         ` Thomas Gleixner
@ 2013-11-25 20:03           ` Darren Hart
  2013-11-25 20:26             ` Thomas Gleixner
  2013-11-26 13:53             ` Thomas Gleixner
  0 siblings, 2 replies; 39+ messages in thread
From: Darren Hart @ 2013-11-25 20:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Davidlohr Bueso, Linus Torvalds, Linux Kernel Mailing List,
	Ingo Molnar, Peter Zijlstra, Mike Galbraith, jeffm, Norton,
	Scott J, tom.vaden, Chandramouleeswaran, Aswin, Waiman Long,
	Jason Low, Paul E. McKenney

On Mon, 2013-11-25 at 20:47 +0100, Thomas Gleixner wrote:
> On Sat, 23 Nov 2013, Thomas Gleixner wrote:
> > On Fri, 22 Nov 2013, Davidlohr Bueso wrote:
> > So with the atomic ops you are changing that to:
> > 
> > CPU 0                                 CPU 1
> > 
> >     val = *futex;
> >     futex_wait(futex, val);
> > 
> >     spin_lock(&hb->lock);
> > 
> >     atomic_inc(&hb->waiters);
> >     uval = *futex;
> >                                       *futex = newval;
> > 
> >     if (uval != val) {                        futex_wake();
> >        spin_unlock(&hb->lock);                if (!atomic_read(&hb->waiters))
> >        return;                                   return;
> >     }
> >                                       spin_lock((&hb->lock);  
> >     plist_add(hb, self);
> >     spin_unlock(&hb->lock);
> >     schedule();                               wakeup_waiters(hb);
> >     ...
> > 
> > which restores the ordering guarantee, which the hash bucket lock
> > provided so far.
> 
> Actually that's not true by design, it just happens to work.
> 
> atomic_inc() on x86 is a "lock incl".
> 
>  The LOCK prefix just guarantees that the cache line which is affected
>  by the INCL is locked. And it guarantees that locked operations
>  serialize all outstanding load and store operations.
> 
>  But Documentation/atomic_ops.txt says about atomic_inc():
> 
>  "One very important aspect of these two routines is that they DO NOT
>   require any explicit memory barriers.  They need only perform the
>   atomic_t counter update in an SMP safe manner."
> 
>  So while this has a barrier on x86, it's not guaranteed.


But it is guaranteed to be "in an SMP safe manner"... which I guess just
means that two writes will not intermix bytes, but no guarantee that the
value will be visible to other CPUs unless some kind of barrier is
explicitly imposed.

Correct?


> atomic_read() is a simple read.
> 
>  This does not guarantee anything. And if you read
>  Documentation/atomic_ops.txt it's well documented:
> 
>  "*** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***"
> 
> 
> So now your code melts down to:
> 
>  write(hb->waiters)    |    write(uaddr)
>  mb                    |    read(hb->waiters)
>  read(uaddr)
> 
> I fear you simply managed to make the window small enough that your
> testing was not longer able expose it.

Does seem to be the case.

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup
  2013-11-25 20:03           ` Darren Hart
@ 2013-11-25 20:26             ` Thomas Gleixner
  2013-11-26 13:53             ` Thomas Gleixner
  1 sibling, 0 replies; 39+ messages in thread
From: Thomas Gleixner @ 2013-11-25 20:26 UTC (permalink / raw)
  To: Darren Hart
  Cc: Davidlohr Bueso, Linus Torvalds, Linux Kernel Mailing List,
	Ingo Molnar, Peter Zijlstra, Mike Galbraith, jeffm, Norton,
	Scott J, tom.vaden, Chandramouleeswaran, Aswin, Waiman Long,
	Jason Low, Paul E. McKenney

On Mon, 25 Nov 2013, Darren Hart wrote:
> On Mon, 2013-11-25 at 20:47 +0100, Thomas Gleixner wrote:
> > > which restores the ordering guarantee, which the hash bucket lock
> > > provided so far.
> > 
> > Actually that's not true by design, it just happens to work.
> > 
> > atomic_inc() on x86 is a "lock incl".
> > 
> >  The LOCK prefix just guarantees that the cache line which is affected
> >  by the INCL is locked. And it guarantees that locked operations
> >  serialize all outstanding load and store operations.
> > 
> >  But Documentation/atomic_ops.txt says about atomic_inc():
> > 
> >  "One very important aspect of these two routines is that they DO NOT
> >   require any explicit memory barriers.  They need only perform the
> >   atomic_t counter update in an SMP safe manner."
> > 
> >  So while this has a barrier on x86, it's not guaranteed.
> 
> 
> But it is guaranteed to be "in an SMP safe manner"... which I guess just
> means that two writes will not intermix bytes, but no guarantee that the
> value will be visible to other CPUs unless some kind of barrier is
> explicitly imposed.
> 
> Correct?

Yep, that's what it sayes.
 
> > So now your code melts down to:
> > 
> >  write(hb->waiters)    |    write(uaddr)
> >  mb                    |    read(hb->waiters)
> >  read(uaddr)
> > 
> > I fear you simply managed to make the window small enough that your
> > testing was not longer able expose it.
> 
> Does seem to be the case.

You must be aware, that between the the write(uaddr) and the
read(hb->waiters) is the syscall, i.e. a user/kernel space
transition.

sysenter/syscall have no documented barriers inside, but we don't know
whether the actual hw implementation provides one or if the memory ops
between modifying uaddr and reaching the read(hb->waiters) point
including the real atomic op on the waiter side are good enough to
paper over the issue.

Thanks,

	tglx




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup
  2013-11-25 20:03           ` Darren Hart
  2013-11-25 20:26             ` Thomas Gleixner
@ 2013-11-26 13:53             ` Thomas Gleixner
  1 sibling, 0 replies; 39+ messages in thread
From: Thomas Gleixner @ 2013-11-26 13:53 UTC (permalink / raw)
  To: Darren Hart
  Cc: Davidlohr Bueso, Linus Torvalds, Linux Kernel Mailing List,
	Ingo Molnar, Peter Zijlstra, Mike Galbraith, jeffm, Norton,
	Scott J, tom.vaden, Chandramouleeswaran, Aswin, Waiman Long,
	Jason Low, Paul E. McKenney

On Mon, 25 Nov 2013, Darren Hart wrote:
> On Mon, 2013-11-25 at 20:47 +0100, Thomas Gleixner wrote:
> > So now your code melts down to:
> > 
> >  write(hb->waiters)    |    write(uaddr)
> >  mb                    |    read(hb->waiters)
> >  read(uaddr)
> > 
> > I fear you simply managed to make the window small enough that your
> > testing was not longer able expose it.
> 
> Does seem to be the case.

Actually not. It's protected by an atomic_inc() between the
write(uaddr) and read(hb->waiters) on the waker side.

It took me a while to realize, that get_futex_key_refs() is providing
the protection inadvertently. But as I explained we cannot rely on
that for all architectures.

Thanks,

	tglx









^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup
  2014-01-12 23:31 [PATCH v6 " Davidlohr Bueso
@ 2014-01-12 23:31 ` Davidlohr Bueso
  0 siblings, 0 replies; 39+ messages in thread
From: Davidlohr Bueso @ 2014-01-12 23:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, paulmck, efault, jeffm, torvalds,
	jason.low2, Waiman.Long, tom.vaden, scott.norton, aswin,
	davidlohr

From: Davidlohr Bueso <davidlohr@hp.com>

In futex_wake() there is clearly no point in taking the hb->lock if we know
beforehand that there are no tasks to be woken. While the hash bucket's plist
head is a cheap way of knowing this, we cannot rely 100% on it as there is a
racy window between the futex_wait call and when the task is actually added to
the plist. To this end, we couple it with the spinlock check as tasks trying to
enter the critical region are most likely potential waiters that will be added
to the plist, thus preventing tasks sleeping forever if wakers don't acknowledge
all possible waiters.

Furthermore, the futex ordering guarantees are preserved, ensuring that waiters
either observe the changed user space value before blocking or is woken by a
concurrent waker. For wakers, this is done by relying on the barriers in
get_futex_key_refs() -- for archs that do not have implicit mb in atomic_inc(),
we explicitly add them through a new futex_get_mm function. For waiters we rely
on the fact that spin_lock calls already update the head counter, so spinners
are visible even if the lock hasn't been acquired yet.

For more details please refer to the updated comments in the code and related
discussion: https://lkml.org/lkml/2013/11/26/556

Special thanks to tglx for careful review and feedback.

Cc: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: Tom Vaden <tom.vaden@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: Jason Low <jason.low2@hp.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 kernel/futex.c | 115 +++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 91 insertions(+), 24 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index fcc6850..be6399a 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -75,17 +75,20 @@
  * The waiter reads the futex value in user space and calls
  * futex_wait(). This function computes the hash bucket and acquires
  * the hash bucket lock. After that it reads the futex user space value
- * again and verifies that the data has not changed. If it has not
- * changed it enqueues itself into the hash bucket, releases the hash
- * bucket lock and schedules.
+ * again and verifies that the data has not changed. If it has not changed
+ * it enqueues itself into the hash bucket, releases the hash bucket lock
+ * and schedules.
  *
  * The waker side modifies the user space value of the futex and calls
- * futex_wake(). This functions computes the hash bucket and acquires
- * the hash bucket lock. Then it looks for waiters on that futex in the
- * hash bucket and wakes them.
+ * futex_wake(). This function computes the hash bucket and acquires the
+ * hash bucket lock. Then it looks for waiters on that futex in the hash
+ * bucket and wakes them.
  *
- * Note that the spin_lock serializes waiters and wakers, so that the
- * following scenario is avoided:
+ * In futex wake up scenarios where no tasks are blocked on a futex, taking
+ * the hb spinlock can be avoided and simply return. In order for this
+ * optimization to work, ordering guarantees must exist so that the waiter
+ * being added to the list is acknowledged when the list is concurrently being
+ * checked by the waker, avoiding scenarios like the following:
  *
  * CPU 0                               CPU 1
  * val = *futex;
@@ -106,24 +109,52 @@
  * This would cause the waiter on CPU 0 to wait forever because it
  * missed the transition of the user space value from val to newval
  * and the waker did not find the waiter in the hash bucket queue.
- * The spinlock serializes that:
+ *
+ * The correct serialization ensures that a waiter either observes
+ * the changed user space value before blocking or is woken by a
+ * concurrent waker:
  *
  * CPU 0                               CPU 1
  * val = *futex;
  * sys_futex(WAIT, futex, val);
  *   futex_wait(futex, val);
- *   lock(hash_bucket(futex));
- *   uval = *futex;
- *                                     *futex = newval;
- *                                     sys_futex(WAKE, futex);
- *                                       futex_wake(futex);
- *                                       lock(hash_bucket(futex));
+ *
+ *   waiters++;
+ *   mb(); (A) <-- paired with -.
+ *                              |
+ *   lock(hash_bucket(futex));  |
+ *                              |
+ *   uval = *futex;             |
+ *                              |        *futex = newval;
+ *                              |        sys_futex(WAKE, futex);
+ *                              |          futex_wake(futex);
+ *                              |
+ *                              `------->   mb(); (B)
  *   if (uval == val)
- *      queue();
+ *     queue();
  *     unlock(hash_bucket(futex));
- *     schedule();                       if (!queue_empty())
- *                                         wake_waiters(futex);
- *                                       unlock(hash_bucket(futex));
+ *     schedule();                         if (waiters)
+ *                                           lock(hash_bucket(futex));
+ *                                           wake_waiters(futex);
+ *                                           unlock(hash_bucket(futex));
+ *
+ * Where (A) orders the waiters increment and the futex value read -- this
+ * is guaranteed by the head counter in the hb spinlock; and where (B)
+ * orders the write to futex and the waiters read -- this is done by the
+ * barriers in get_futex_key_refs(), through either ihold or atomic_inc,
+ * depending on the futex type.
+ *
+ * This yields the following case (where X:=waiters, Y:=futex):
+ *
+ *	X = Y = 0
+ *
+ *	w[X]=1		w[Y]=1
+ *	MB		MB
+ *	r[Y]=y		r[X]=x
+ *
+ * Which guarantees that x==0 && y==0 is impossible; which translates back into
+ * the guarantee that we cannot both miss the futex variable change and the
+ * enqueue.
  */
 
 int __read_mostly futex_cmpxchg_enabled;
@@ -211,6 +242,36 @@ static unsigned long __read_mostly futex_hashsize;
 
 static struct futex_hash_bucket *futex_queues;
 
+static inline void futex_get_mm(union futex_key *key)
+{
+	atomic_inc(&key->private.mm->mm_count);
+	/*
+	 * Ensure futex_get_mm() implies a full barrier such that
+	 * get_futex_key() implies a full barrier. This is relied upon
+	 * as full barrier (B), see the ordering comment above.
+	 */
+	smp_mb__after_atomic_inc();
+}
+
+static inline bool hb_waiters_pending(struct futex_hash_bucket *hb)
+{
+#ifdef CONFIG_SMP
+	/*
+	 * Tasks trying to enter the critical region are most likely
+	 * potential waiters that will be added to the plist. Ensure
+	 * that wakers won't miss to-be-slept tasks in the window between
+	 * the wait call and the actual plist_add.
+	 */
+	if (spin_is_locked(&hb->lock))
+		return true;
+	smp_rmb(); /* Make sure we check the lock state first */
+
+	return !plist_head_empty(&hb->chain);
+#else
+	return true;
+#endif
+}
+
 /*
  * We hash on the keys returned from get_futex_key (see below).
  */
@@ -245,10 +306,10 @@ static void get_futex_key_refs(union futex_key *key)
 
 	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
 	case FUT_OFF_INODE:
-		ihold(key->shared.inode);
+		ihold(key->shared.inode); /* implies MB (B) */
 		break;
 	case FUT_OFF_MMSHARED:
-		atomic_inc(&key->private.mm->mm_count);
+		futex_get_mm(key); /* implies MB (B) */
 		break;
 	}
 }
@@ -322,7 +383,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
 	if (!fshared) {
 		key->private.mm = mm;
 		key->private.address = address;
-		get_futex_key_refs(key);
+		get_futex_key_refs(key);  /* implies MB (B) */
 		return 0;
 	}
 
@@ -429,7 +490,7 @@ again:
 		key->shared.pgoff = basepage_index(page);
 	}
 
-	get_futex_key_refs(key);
+	get_futex_key_refs(key); /* implies MB (B) */
 
 out:
 	unlock_page(page_head);
@@ -1052,6 +1113,11 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 		goto out;
 
 	hb = hash_futex(&key);
+
+	/* Make sure we really have tasks to wakeup */
+	if (!hb_waiters_pending(hb))
+		goto out_put_key;
+
 	spin_lock(&hb->lock);
 
 	plist_for_each_entry_safe(this, next, &hb->chain, list) {
@@ -1072,6 +1138,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 	}
 
 	spin_unlock(&hb->lock);
+out_put_key:
 	put_futex_key(&key);
 out:
 	return ret;
@@ -1535,7 +1602,7 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
 	hb = hash_futex(&q->key);
 	q->lock_ptr = &hb->lock;
 
-	spin_lock(&hb->lock);
+	spin_lock(&hb->lock); /* implies MB (A) */
 	return hb;
 }
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2014-01-12 23:32 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-23  0:56 [PATCH 0/5] futex: Wakeup optimizations Davidlohr Bueso
2013-11-23  0:56 ` [PATCH 1/5] futex: Misc cleanups Davidlohr Bueso
2013-11-23  6:52   ` Darren Hart
2013-11-23  0:56 ` [PATCH 2/5] futex: Check for pi futex_q only once Davidlohr Bueso
2013-11-23  6:33   ` Darren Hart
2013-11-24  5:19     ` Davidlohr Bueso
2013-11-23  0:56 ` [PATCH 3/5] futex: Larger hash table Davidlohr Bueso
2013-11-23  6:52   ` Darren Hart
2013-11-23  0:56 ` [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
2013-11-23  1:25   ` Linus Torvalds
2013-11-23  3:03     ` Jason Low
2013-11-23  3:19     ` Davidlohr Bueso
2013-11-23  7:23       ` Darren Hart
2013-11-23 13:16       ` Thomas Gleixner
2013-11-24  3:46         ` Linus Torvalds
2013-11-24  5:15           ` Davidlohr Bueso
2013-11-25 12:01             ` Thomas Gleixner
2013-11-25 16:23           ` Thomas Gleixner
2013-11-25 16:36             ` Peter Zijlstra
2013-11-25 17:32               ` Thomas Gleixner
2013-11-25 17:38                 ` Peter Zijlstra
2013-11-25 18:55                 ` Davidlohr Bueso
2013-11-25 19:52                   ` Thomas Gleixner
2013-11-25 19:47         ` Thomas Gleixner
2013-11-25 20:03           ` Darren Hart
2013-11-25 20:26             ` Thomas Gleixner
2013-11-26 13:53             ` Thomas Gleixner
2013-11-23  4:05     ` Waiman Long
2013-11-23  5:40   ` Darren Hart
2013-11-23  5:42     ` Hart, Darren
2013-11-23  7:20   ` Darren Hart
2013-11-23  0:56 ` [PATCH 5/5] sched,futex: Provide delayed wakeup list Davidlohr Bueso
2013-11-23 11:48   ` Peter Zijlstra
2013-11-23 12:01     ` Peter Zijlstra
2013-11-24  5:25       ` Davidlohr Bueso
2013-11-23  5:55 ` [PATCH 0/5] futex: Wakeup optimizations Darren Hart
2013-11-23  6:35   ` Mike Galbraith
2013-11-23  6:38   ` Davidlohr Bueso
2014-01-12 23:31 [PATCH v6 " Davidlohr Bueso
2014-01-12 23:31 ` [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).