[RFC PATCH v2 0/5] futex: Introducing throughput-optimized futexes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v2 0/5] futex: Introducing throughput-optimized futexes
@ 2016-09-20 13:42 Waiman Long
  2016-09-20 13:42 ` [RFC PATCH v2 1/5] futex: Add futex_set_timer() helper function Waiman Long
                   ` (4 more replies)
  0 siblings, 5 replies; 35+ messages in thread
From: Waiman Long @ 2016-09-20 13:42 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Davidlohr Bueso, Jason Low,
	Scott J Norton, Douglas Hatch, Waiman Long

 v1->v2:
  - Adds an explicit lock hand-off mechanism.
  - Adds timeout support.
  - Simplifies the required userspace code.
  - Fixes a number of problems in the v1 code.

This patchset introduces a new futex implementation called
throughput-optimized (TO) futexes. It is similar to PI futexes in its
calling convention, but provides better throughput than the wait-wake
futexes by encouraging lock stealing and optimistic spinning.

Waiman Long (5):
  futex: Add futex_set_timer() helper function
  futex: Rename futex_pi_state to futex_state
  futex: Throughput-optimized (TO) futexes
  futex: Add timeout support to TO futexes
  futex, doc: TO futexes document

 Documentation/00-INDEX     |    2 +
 Documentation/to-futex.txt |  140 ++++++++
 include/linux/sched.h      |    4 +-
 include/uapi/linux/futex.h |    4 +
 kernel/futex.c             |  810 +++++++++++++++++++++++++++++++++++++++-----
 5 files changed, 869 insertions(+), 91 deletions(-)
 create mode 100644 Documentation/to-futex.txt

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 1/5] futex: Add futex_set_timer() helper function
  2016-09-20 13:42 [RFC PATCH v2 0/5] futex: Introducing throughput-optimized futexes Waiman Long
@ 2016-09-20 13:42 ` Waiman Long
  2016-09-22 21:31   ` Thomas Gleixner
  2016-09-20 13:42 ` [RFC PATCH v2 2/5] futex: Rename futex_pi_state to futex_state Waiman Long
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 35+ messages in thread
From: Waiman Long @ 2016-09-20 13:42 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Davidlohr Bueso, Jason Low,
	Scott J Norton, Douglas Hatch, Waiman Long

This patch adds a new futex_set_timer() function to consolidate all
the sleeping hrtime setup code.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/futex.c |   51 ++++++++++++++++++++++++---------------------------
 1 files changed, 24 insertions(+), 27 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 46cb3a3..37e61ef 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -467,6 +467,25 @@ static void drop_futex_key_refs(union futex_key *key)
 	}
 }
 
+/*
+ * Helper function to set the sleeping hrtimer.
+ */
+static inline void futex_set_timer(ktime_t *time, struct hrtimer_sleeper **pto,
+		struct hrtimer_sleeper *timeout, int flags, u64 range_ns)
+{
+	if (!time)
+		return;
+	*pto = timeout;
+	hrtimer_init_on_stack(&timeout->timer, (flags & FLAGS_CLOCKRT) ?
+			      CLOCK_REALTIME : CLOCK_MONOTONIC,
+			      HRTIMER_MODE_ABS);
+	hrtimer_init_sleeper(timeout, current);
+	if (range_ns)
+		hrtimer_set_expires_range_ns(&timeout->timer, *time, range_ns);
+	else
+		hrtimer_set_expires(&timeout->timer, *time);
+}
+
 /**
  * get_futex_key() - Get parameters which are the keys for a futex
  * @uaddr:	virtual address of the futex
@@ -2403,17 +2422,8 @@ static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
 		return -EINVAL;
 	q.bitset = bitset;
 
-	if (abs_time) {
-		to = &timeout;
-
-		hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ?
-				      CLOCK_REALTIME : CLOCK_MONOTONIC,
-				      HRTIMER_MODE_ABS);
-		hrtimer_init_sleeper(to, current);
-		hrtimer_set_expires_range_ns(&to->timer, *abs_time,
-					     current->timer_slack_ns);
-	}
-
+	futex_set_timer(abs_time, &to, &timeout, flags,
+			current->timer_slack_ns);
 retry:
 	/*
 	 * Prepare to wait on uaddr. On success, holds hb lock and increments
@@ -2501,13 +2511,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
 	if (refill_pi_state_cache())
 		return -ENOMEM;
 
-	if (time) {
-		to = &timeout;
-		hrtimer_init_on_stack(&to->timer, CLOCK_REALTIME,
-				      HRTIMER_MODE_ABS);
-		hrtimer_init_sleeper(to, current);
-		hrtimer_set_expires(&to->timer, *time);
-	}
+	futex_set_timer(time, &to, &timeout, FLAGS_CLOCKRT, 0);
 
 retry:
 	ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key, VERIFY_WRITE);
@@ -2816,15 +2820,8 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 	if (!bitset)
 		return -EINVAL;
 
-	if (abs_time) {
-		to = &timeout;
-		hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ?
-				      CLOCK_REALTIME : CLOCK_MONOTONIC,
-				      HRTIMER_MODE_ABS);
-		hrtimer_init_sleeper(to, current);
-		hrtimer_set_expires_range_ns(&to->timer, *abs_time,
-					     current->timer_slack_ns);
-	}
+	futex_set_timer(abs_time, &to, &timeout, flags,
+			current->timer_slack_ns);
 
 	/*
 	 * The waiter is allocated on our stack, manipulated by the requeue
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 2/5] futex: Rename futex_pi_state to futex_state
  2016-09-20 13:42 [RFC PATCH v2 0/5] futex: Introducing throughput-optimized futexes Waiman Long
  2016-09-20 13:42 ` [RFC PATCH v2 1/5] futex: Add futex_set_timer() helper function Waiman Long
@ 2016-09-20 13:42 ` Waiman Long
  2016-09-20 13:42 ` [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes Waiman Long
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 35+ messages in thread
From: Waiman Long @ 2016-09-20 13:42 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Davidlohr Bueso, Jason Low,
	Scott J Norton, Douglas Hatch, Waiman Long

The futex_pi_state structure will be overloaded in later patches to
store state information about non-PI futexes. So the structure name
itself is no longer a good description of its purpose. This patch
renames it to futex_state, a more generic name.

Some of the functions that process the futex states are also renamed.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 include/linux/sched.h |    4 +-
 kernel/futex.c        |  107 +++++++++++++++++++++++++------------------------
 2 files changed, 56 insertions(+), 55 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 62c68e5..fefd7f7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -126,7 +126,7 @@ struct sched_attr {
 	u64 sched_period;
 };
 
-struct futex_pi_state;
+struct futex_state;
 struct robust_list_head;
 struct bio_list;
 struct fs_struct;
@@ -1772,7 +1772,7 @@ struct task_struct {
 	struct compat_robust_list_head __user *compat_robust_list;
 #endif
 	struct list_head pi_state_list;
-	struct futex_pi_state *pi_state_cache;
+	struct futex_state *pi_state_cache;
 #endif
 #ifdef CONFIG_PERF_EVENTS
 	struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
diff --git a/kernel/futex.c b/kernel/futex.c
index 37e61ef..f8bb93f 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -192,11 +192,12 @@ int __read_mostly futex_cmpxchg_enabled;
 #define FLAGS_HAS_TIMEOUT	0x04
 
 /*
- * Priority Inheritance state:
+ * Futex state object:
+ *  - Priority Inheritance state
  */
-struct futex_pi_state {
+struct futex_state {
 	/*
-	 * list of 'owned' pi_state instances - these have to be
+	 * list of 'owned' state instances - these have to be
 	 * cleaned up in do_exit() if the task exits prematurely:
 	 */
 	struct list_head list;
@@ -240,7 +241,7 @@ struct futex_q {
 	struct task_struct *task;
 	spinlock_t *lock_ptr;
 	union futex_key key;
-	struct futex_pi_state *pi_state;
+	struct futex_state *pi_state;
 	struct rt_mutex_waiter *rt_waiter;
 	union futex_key *requeue_pi_key;
 	u32 bitset;
@@ -787,76 +788,76 @@ static int get_futex_value_locked(u32 *dest, u32 __user *from)
 /*
  * PI code:
  */
-static int refill_pi_state_cache(void)
+static int refill_futex_state_cache(void)
 {
-	struct futex_pi_state *pi_state;
+	struct futex_state *state;
 
 	if (likely(current->pi_state_cache))
 		return 0;
 
-	pi_state = kzalloc(sizeof(*pi_state), GFP_KERNEL);
+	state = kzalloc(sizeof(*state), GFP_KERNEL);
 
-	if (!pi_state)
+	if (!state)
 		return -ENOMEM;
 
-	INIT_LIST_HEAD(&pi_state->list);
+	INIT_LIST_HEAD(&state->list);
 	/* pi_mutex gets initialized later */
-	pi_state->owner = NULL;
-	atomic_set(&pi_state->refcount, 1);
-	pi_state->key = FUTEX_KEY_INIT;
+	state->owner = NULL;
+	atomic_set(&state->refcount, 1);
+	state->key = FUTEX_KEY_INIT;
 
-	current->pi_state_cache = pi_state;
+	current->pi_state_cache = state;
 
 	return 0;
 }
 
-static struct futex_pi_state * alloc_pi_state(void)
+static struct futex_state *alloc_futex_state(void)
 {
-	struct futex_pi_state *pi_state = current->pi_state_cache;
+	struct futex_state *state = current->pi_state_cache;
 
-	WARN_ON(!pi_state);
+	WARN_ON(!state);
 	current->pi_state_cache = NULL;
 
-	return pi_state;
+	return state;
 }
 
 /*
- * Drops a reference to the pi_state object and frees or caches it
+ * Drops a reference to the futex state object and frees or caches it
  * when the last reference is gone.
  *
  * Must be called with the hb lock held.
  */
-static void put_pi_state(struct futex_pi_state *pi_state)
+static void put_futex_state(struct futex_state *state)
 {
-	if (!pi_state)
+	if (!state)
 		return;
 
-	if (!atomic_dec_and_test(&pi_state->refcount))
+	if (!atomic_dec_and_test(&state->refcount))
 		return;
 
 	/*
-	 * If pi_state->owner is NULL, the owner is most probably dying
-	 * and has cleaned up the pi_state already
+	 * If state->owner is NULL, the owner is most probably dying
+	 * and has cleaned up the futex state already
 	 */
-	if (pi_state->owner) {
-		raw_spin_lock_irq(&pi_state->owner->pi_lock);
-		list_del_init(&pi_state->list);
-		raw_spin_unlock_irq(&pi_state->owner->pi_lock);
+	if (state->owner) {
+		raw_spin_lock_irq(&state->owner->pi_lock);
+		list_del_init(&state->list);
+		raw_spin_unlock_irq(&state->owner->pi_lock);
 
-		rt_mutex_proxy_unlock(&pi_state->pi_mutex, pi_state->owner);
+		rt_mutex_proxy_unlock(&state->pi_mutex, state->owner);
 	}
 
 	if (current->pi_state_cache)
-		kfree(pi_state);
+		kfree(state);
 	else {
 		/*
-		 * pi_state->list is already empty.
-		 * clear pi_state->owner.
+		 * state->list is already empty.
+		 * clear state->owner.
 		 * refcount is at 0 - put it back to 1.
 		 */
-		pi_state->owner = NULL;
-		atomic_set(&pi_state->refcount, 1);
-		current->pi_state_cache = pi_state;
+		state->owner = NULL;
+		atomic_set(&state->refcount, 1);
+		current->pi_state_cache = state;
 	}
 }
 
@@ -886,7 +887,7 @@ static struct task_struct * futex_find_get_task(pid_t pid)
 void exit_pi_state_list(struct task_struct *curr)
 {
 	struct list_head *next, *head = &curr->pi_state_list;
-	struct futex_pi_state *pi_state;
+	struct futex_state *pi_state;
 	struct futex_hash_bucket *hb;
 	union futex_key key = FUTEX_KEY_INIT;
 
@@ -901,7 +902,7 @@ void exit_pi_state_list(struct task_struct *curr)
 	while (!list_empty(head)) {
 
 		next = head->next;
-		pi_state = list_entry(next, struct futex_pi_state, list);
+		pi_state = list_entry(next, struct futex_state, list);
 		key = pi_state->key;
 		hb = hash_futex(&key);
 		raw_spin_unlock_irq(&curr->pi_lock);
@@ -988,8 +989,8 @@ void exit_pi_state_list(struct task_struct *curr)
  * the pi_state against the user space value. If correct, attach to
  * it.
  */
-static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state,
-			      struct futex_pi_state **ps)
+static int attach_to_pi_state(u32 uval, struct futex_state *pi_state,
+			      struct futex_state **ps)
 {
 	pid_t pid = uval & FUTEX_TID_MASK;
 
@@ -1060,10 +1061,10 @@ out_state:
  * it after doing proper sanity checks.
  */
 static int attach_to_pi_owner(u32 uval, union futex_key *key,
-			      struct futex_pi_state **ps)
+			      struct futex_state **ps)
 {
 	pid_t pid = uval & FUTEX_TID_MASK;
-	struct futex_pi_state *pi_state;
+	struct futex_state *pi_state;
 	struct task_struct *p;
 
 	/*
@@ -1104,7 +1105,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 	/*
 	 * No existing pi state. First waiter. [2]
 	 */
-	pi_state = alloc_pi_state();
+	pi_state = alloc_futex_state();
 
 	/*
 	 * Initialize the pi_mutex in locked state and make @p
@@ -1128,7 +1129,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 }
 
 static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
-			   union futex_key *key, struct futex_pi_state **ps)
+			   union futex_key *key, struct futex_state **ps)
 {
 	struct futex_q *match = futex_top_waiter(hb, key);
 
@@ -1180,7 +1181,7 @@ static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
  */
 static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
 				union futex_key *key,
-				struct futex_pi_state **ps,
+				struct futex_state **ps,
 				struct task_struct *task, int set_waiters)
 {
 	u32 uval, newval, vpid = task_pid_vnr(task);
@@ -1306,7 +1307,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
 			 struct futex_hash_bucket *hb)
 {
 	struct task_struct *new_owner;
-	struct futex_pi_state *pi_state = this->pi_state;
+	struct futex_state *pi_state = this->pi_state;
 	u32 uninitialized_var(curval), newval;
 	WAKE_Q(wake_q);
 	bool deboost;
@@ -1646,7 +1647,7 @@ static int futex_proxy_trylock_atomic(u32 __user *pifutex,
 				 struct futex_hash_bucket *hb1,
 				 struct futex_hash_bucket *hb2,
 				 union futex_key *key1, union futex_key *key2,
-				 struct futex_pi_state **ps, int set_waiters)
+				 struct futex_state **ps, int set_waiters)
 {
 	struct futex_q *top_waiter = NULL;
 	u32 curval;
@@ -1715,7 +1716,7 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
 {
 	union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
 	int drop_count = 0, task_count = 0, ret;
-	struct futex_pi_state *pi_state = NULL;
+	struct futex_state *pi_state = NULL;
 	struct futex_hash_bucket *hb1, *hb2;
 	struct futex_q *this, *next;
 	WAKE_Q(wake_q);
@@ -1732,7 +1733,7 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
 		 * requeue_pi requires a pi_state, try to allocate it now
 		 * without any locks in case it fails.
 		 */
-		if (refill_pi_state_cache())
+		if (refill_futex_state_cache())
 			return -ENOMEM;
 		/*
 		 * requeue_pi must wake as many tasks as it can, up to nr_wake
@@ -1944,7 +1945,7 @@ retry_private:
 				 * object.
 				 */
 				this->pi_state = NULL;
-				put_pi_state(pi_state);
+				put_futex_state(pi_state);
 				/*
 				 * We stop queueing more waiters and let user
 				 * space deal with the mess.
@@ -1961,7 +1962,7 @@ retry_private:
 	 * in futex_proxy_trylock_atomic() or in lookup_pi_state(). We
 	 * need to drop it here again.
 	 */
-	put_pi_state(pi_state);
+	put_futex_state(pi_state);
 
 out_unlock:
 	double_unlock_hb(hb1, hb2);
@@ -2116,7 +2117,7 @@ static void unqueue_me_pi(struct futex_q *q)
 	__unqueue_futex(q);
 
 	BUG_ON(!q->pi_state);
-	put_pi_state(q->pi_state);
+	put_futex_state(q->pi_state);
 	q->pi_state = NULL;
 
 	spin_unlock(q->lock_ptr);
@@ -2132,7 +2133,7 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
 				struct task_struct *newowner)
 {
 	u32 newtid = task_pid_vnr(newowner) | FUTEX_WAITERS;
-	struct futex_pi_state *pi_state = q->pi_state;
+	struct futex_state *pi_state = q->pi_state;
 	struct task_struct *oldowner = pi_state->owner;
 	u32 uval, uninitialized_var(curval), newval;
 	int ret;
@@ -2508,7 +2509,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
 	struct futex_q q = futex_q_init;
 	int res, ret;
 
-	if (refill_pi_state_cache())
+	if (refill_futex_state_cache())
 		return -ENOMEM;
 
 	futex_set_timer(time, &to, &timeout, FLAGS_CLOCKRT, 0);
@@ -2889,7 +2890,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 			 * Drop the reference to the pi state which
 			 * the requeue_pi() code acquired for us.
 			 */
-			put_pi_state(q.pi_state);
+			put_futex_state(q.pi_state);
 			spin_unlock(q.lock_ptr);
 		}
 	} else {
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
  2016-09-20 13:42 [RFC PATCH v2 0/5] futex: Introducing throughput-optimized futexes Waiman Long
  2016-09-20 13:42 ` [RFC PATCH v2 1/5] futex: Add futex_set_timer() helper function Waiman Long
  2016-09-20 13:42 ` [RFC PATCH v2 2/5] futex: Rename futex_pi_state to futex_state Waiman Long
@ 2016-09-20 13:42 ` Waiman Long
  2016-09-21  6:59   ` Mike Galbraith
  2016-09-22 13:23   ` Peter Zijlstra
  2016-09-20 13:42 ` [RFC PATCH v2 4/5] futex: Add timeout support to TO futexes Waiman Long
  2016-09-20 13:42 ` [RFC PATCH v2 5/5] futex, doc: TO futexes document Waiman Long
  4 siblings, 2 replies; 35+ messages in thread
From: Waiman Long @ 2016-09-20 13:42 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Davidlohr Bueso, Jason Low,
	Scott J Norton, Douglas Hatch, Waiman Long

This patch introduces a new futex implementation called
throughput-optimized (TO) futexes. The goal of this new futex
type is to maximize locking throughput at the expense of fairness
and deterministic latency. Its throughput is higher than that of
the wait-wake futexes especially on systems with a large number of
CPUs and the lock owners are unlikely to sleep. The downside is the
increase in the response time variance. It also implements a lock
hand-off mechanism to make sure that lock starvation won't happen.

Using a futex locking microbenchmark to do 10 millions locking
operations with 5 pause instructions as the critical section by 256
threads (10M/256 locking ops each) on a 4-socket 72-core 144-thread
Haswell-EX system, the results of the benchmark runs were as follows:

                wait-wake futex     PI futex        TO futex
                ---------------     --------        --------
max time            3.49s            50.91s          2.65s
min time            3.24s            50.84s          0.07s
average time        3.41s            50.90s          1.84s
sys time	  7m22.4s            55.73s        2m32.9s
lock count       3,090,294          9,999,813       698,318
unlock count     3,268,896          9,999,814           134

The lock and unlock counts above show the actual numbers of futex(2)
lock and unlock syscalls that were being issued.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 include/uapi/linux/futex.h |    4 +
 kernel/futex.c             |  634 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 627 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index 0b1f716..e7deaf3 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -20,6 +20,8 @@
 #define FUTEX_WAKE_BITSET	10
 #define FUTEX_WAIT_REQUEUE_PI	11
 #define FUTEX_CMP_REQUEUE_PI	12
+#define FUTEX_LOCK_TO		13
+#define FUTEX_UNLOCK_TO		14
 
 #define FUTEX_PRIVATE_FLAG	128
 #define FUTEX_CLOCK_REALTIME	256
@@ -39,6 +41,8 @@
 					 FUTEX_PRIVATE_FLAG)
 #define FUTEX_CMP_REQUEUE_PI_PRIVATE	(FUTEX_CMP_REQUEUE_PI | \
 					 FUTEX_PRIVATE_FLAG)
+#define FUTEX_LOCK_TO_PRIVATE	(FUTEX_LOCK_TO | FUTEX_PRIVATE_FLAG)
+#define FUTEX_UNLOCK_TO_PRIVATE	(FUTEX_UNLOCK_TO | FUTEX_PRIVATE_FLAG)
 
 /*
  * Support for robust futexes: the kernel cleans up held futexes at
diff --git a/kernel/futex.c b/kernel/futex.c
index f8bb93f..7daba56 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -191,6 +191,11 @@ int __read_mostly futex_cmpxchg_enabled;
 #define FLAGS_CLOCKRT		0x02
 #define FLAGS_HAS_TIMEOUT	0x04
 
+enum futex_type {
+	TYPE_PI = 0,
+	TYPE_TO,
+};
+
 /*
  * Futex state object:
  *  - Priority Inheritance state
@@ -203,13 +208,30 @@ struct futex_state {
 	struct list_head list;
 
 	/*
-	 * The PI object:
+	 * Linking into the owning hashed bucket (TO futexes only)
+	 */
+	struct list_head hb_list;
+
+	/*
+	 * The PI or mutex object:
 	 */
-	struct rt_mutex pi_mutex;
+	union {
+		struct rt_mutex pi_mutex;
+		struct mutex mutex;
+	};
 
+	/*
+	 * For PI futex, owner is the task that owns the futex.
+	 * For TO futex, owner is the mutex lock holder that is either
+	 * spinning on the futex owner or sleeping.
+	 */
 	struct task_struct *owner;
 	atomic_t refcount;
 
+	enum futex_type type;
+
+	u32 handoff_pid;	/* TO only, PID for lock hand-off */
+
 	union futex_key key;
 };
 
@@ -262,6 +284,7 @@ struct futex_hash_bucket {
 	atomic_t waiters;
 	spinlock_t lock;
 	struct plist_head chain;
+	struct list_head fs_list;	/* List of futex state objects */
 } ____cacheline_aligned_in_smp;
 
 /*
@@ -801,8 +824,11 @@ static int refill_futex_state_cache(void)
 		return -ENOMEM;
 
 	INIT_LIST_HEAD(&state->list);
+	INIT_LIST_HEAD(&state->hb_list);
+
 	/* pi_mutex gets initialized later */
 	state->owner = NULL;
+	state->handoff_pid = 0;
 	atomic_set(&state->refcount, 1);
 	state->key = FUTEX_KEY_INIT;
 
@@ -836,10 +862,10 @@ static void put_futex_state(struct futex_state *state)
 		return;
 
 	/*
-	 * If state->owner is NULL, the owner is most probably dying
-	 * and has cleaned up the futex state already
+	 * If state->owner is NULL and the type is TYPE_PI, the owner
+	 * is most probably dying and has cleaned up the state already
 	 */
-	if (state->owner) {
+	if (state->owner && (state->type == TYPE_PI)) {
 		raw_spin_lock_irq(&state->owner->pi_lock);
 		list_del_init(&state->list);
 		raw_spin_unlock_irq(&state->owner->pi_lock);
@@ -847,6 +873,10 @@ static void put_futex_state(struct futex_state *state)
 		rt_mutex_proxy_unlock(&state->pi_mutex, state->owner);
 	}
 
+	/*
+	 * Dequeue it from the HB futex state list.
+	 */
+	list_del_init(&state->hb_list);
 	if (current->pi_state_cache)
 		kfree(state);
 	else {
@@ -919,13 +949,24 @@ void exit_pi_state_list(struct task_struct *curr)
 			continue;
 		}
 
-		WARN_ON(pi_state->owner != curr);
 		WARN_ON(list_empty(&pi_state->list));
+		if (pi_state->type == TYPE_PI) {
+			WARN_ON(pi_state->owner != curr);
+			pi_state->owner = NULL;
+		}
 		list_del_init(&pi_state->list);
-		pi_state->owner = NULL;
 		raw_spin_unlock_irq(&curr->pi_lock);
 
-		rt_mutex_unlock(&pi_state->pi_mutex);
+		if (pi_state->type == TYPE_PI)
+			rt_mutex_unlock(&pi_state->pi_mutex);
+		else if (pi_state->type == TYPE_TO) {
+			/*
+			 * Need to wakeup the mutex owner.
+			 */
+			WARN_ON(!pi_state->owner);
+			if (pi_state->owner)
+				wake_up_process(pi_state->owner);
+		}
 
 		spin_unlock(&hb->lock);
 
@@ -997,7 +1038,7 @@ static int attach_to_pi_state(u32 uval, struct futex_state *pi_state,
 	/*
 	 * Userspace might have messed up non-PI and PI futexes [3]
 	 */
-	if (unlikely(!pi_state))
+	if (unlikely(!pi_state || (pi_state->type != TYPE_PI)))
 		return -EINVAL;
 
 	WARN_ON(!atomic_read(&pi_state->refcount));
@@ -1115,6 +1156,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 
 	/* Store the key for possible exit cleanups: */
 	pi_state->key = *key;
+	pi_state->type = TYPE_PI;
 
 	WARN_ON(!list_empty(&pi_state->list));
 	list_add(&pi_state->list, &p->pi_state_list);
@@ -3075,8 +3117,8 @@ retry:
 			goto retry;
 
 		/*
-		 * Wake robust non-PI futexes here. The wakeup of
-		 * PI futexes happens in exit_pi_state():
+		 * Wake robust non-PI/TO futexes here. The wakeup of
+		 * PI/TO futexes happens in exit_pi_state():
 		 */
 		if (!pi && (uval & FUTEX_WAITERS))
 			futex_wake(uaddr, 1, 1, FUTEX_BITSET_MATCH_ANY);
@@ -3171,6 +3213,565 @@ void exit_robust_list(struct task_struct *curr)
 				   curr, pip);
 }
 
+#ifdef CONFIG_SMP
+/*
+ * Throughput-Optimized Futexes
+ * ----------------------------
+ *
+ * Userspace mutual exclusion locks can be implemented either by using the
+ * wait-wake futexes or the PI futexes. The wait-wake futexes have much higher
+ * throughput but can't guarantee minimal latency for high priority processes
+ * like the PI futexes. Even then, the overhead of wait-wake futexes in
+ * userspace locking primitives can easily become a performance bottleneck
+ * on system with a large number of cores.
+ *
+ * The throughput-optimized (TO) futex is a new futex implementation that
+ * provides higher throughput than wait-wake futex via the use of following
+ * two techniques:
+ *  1) Optimistic spinning when futex owner is actively running
+ *  2) Lock stealing
+ *
+ * Optimistic spinning isn't present in other futexes. Lock stealing is
+ * possible in wait-wake futexes, but isn't actively encouraged like the
+ * TO futexes. The downside, however, is in the much higher variance of
+ * the response times.
+ *
+ * TO futexes have a kernel builtin lock hand-off mechanism to prevent lock
+ * starvation as long as the mutex code can guarantee there is no lock
+ * starvation. For the wait-wake futexes, however, lock starvation prevention
+ * has to be done in the userspace, if desired.
+ *
+ * The use of TO futexes is very similar to the PI futexes. Locking is done
+ * by atomically transiting the futex from 0 to the task's thread ID.
+ * Unlocking is done by atomically changing the futex from thread ID to 0.
+ * Any failure to do so will require calls to the kernel to do the locking
+ * and unlocking.
+ *
+ * Within the kernel, trylocks are done ignoring the FUTEX_WAITERS bit. The
+ * purpose of this FUTEX_WAITERS bit is to make the unlocker wake up the
+ * serialization mutex owner.
+ *
+ * Like PI futexes, TO futexes are orthogonal to robust futexes. However,
+ * TO futexes have similar dying process handling as the PI futexes. So
+ * userspace management of locks in for exiting lock holders should not be
+ * needed for TO futexes unless to avoid the rare cases of pid wraparound.
+ *
+ * Unlike the other futexes, waiters of the TO futexes will not queue
+ * themselves to the plist of the hash bucket. Instead, they will queue up
+ * in the serialization mutex of the futex state container queued in the
+ * hash bucket.
+ */
+
+/*
+ * Looking up the futex state structure.
+ *
+ * It differs from lookup_pi_state() in that it searches the fs_list in the
+ * hash bucket instead of the waiters in the plist. The HB lock must be held
+ * before calling it. If the search_only flag isn't set, a new state structure
+ * will be pushed into the list and returned if there is no matching key.
+ *
+ * The reference count won't be incremented for search_only call as
+ * the futex state won't be destroyed while the HB lock is being held.
+ */
+static struct futex_state *
+lookup_futex_state(struct futex_hash_bucket *hb, union futex_key *key,
+		   bool search_only)
+{
+	struct futex_state *state;
+
+	list_for_each_entry(state, &hb->fs_list, hb_list)
+		if (match_futex(key, &state->key)) {
+			if (!search_only)
+				atomic_inc(&state->refcount);
+			return state;
+		}
+
+	if (search_only)
+		return NULL;
+
+	/*
+	 * Push a new one into the list and return it.
+	 */
+	state = alloc_futex_state();
+	state->type = TYPE_TO;
+	state->key = *key;
+	list_add(&state->hb_list, &hb->fs_list);
+	WARN_ON(atomic_read(&state->refcount) != 1);
+
+	/*
+	 * Initialize the mutex structure.
+	 */
+	mutex_init(&state->mutex);
+	WARN_ON(!list_empty(&state->list));
+	return state;
+}
+
+/*
+ * Fast dropping of reference to the futex state object while not holding
+ * the HB lock.
+ *
+ * Return: 1 if successful and 0 if the reference count is going to be
+ *	   decremented to 0 and hence has to be done under lock.
+ */
+static inline int put_futex_state_unlocked(struct futex_state *state)
+{
+	return atomic_add_unless(&state->refcount, -1, 1);
+}
+
+/*
+ * Spinning threshold before enabling lock handoff.
+ * Each sleep will decrement the threshold by 1/32 of the start value.
+ */
+#define TO_SPIN_THRESHOLD	(1 << 13)
+#define TO_SLEEP_DECREMENT	(TO_SPIN_THRESHOLD/32)
+
+/*
+ * futex_lock_to returned values:
+ * 0 - mutex owner acquires the lock
+ * 1 - steals the lock
+ * 2 - handed off the lock
+ */
+#define TO_LOCK_ACQUIRED	0
+#define TO_LOCK_STOLEN		1
+#define TO_LOCK_HANDOFF		2
+
+/*
+ * Try to lock the userspace futex word (0 => vpid).
+ *
+ * Return: 1 if lock acquired or an error happens, 0 if not.
+ *	   The status code will be 0 if no error, or < 0 if an error happens.
+ *	   *puval will contain the latest futex value when trylock fails.
+ *
+ * The HB spinlock should NOT be held while calling this function.
+ * The flag bits are ignored in the trylock.
+ *
+ * If waiter is true
+ * then
+ *   don't preserve the flag bits;
+ *   check for handoff (futex word == own pid)
+ * else
+ *   preserve the flag bits
+ * endif
+ */
+static inline int futex_trylock_to(u32 __user *uaddr, u32 vpid, u32 *puval,
+				   int *status, const bool waiter)
+{
+	u32 uval, flags = 0;
+
+	*status = 0;
+
+	if (unlikely(get_user(uval, uaddr)))
+		goto efault;
+
+	*puval = uval;
+
+	if (!waiter) {
+		flags |= (uval & ~FUTEX_TID_MASK);
+	} else if ((uval & FUTEX_TID_MASK) == vpid) {
+		*status = TO_LOCK_HANDOFF;
+		return 1;
+	}
+
+	if (uval & FUTEX_TID_MASK)
+		return 0;	/* Trylock fails */
+
+	if (unlikely(futex_atomic_cmpxchg_inatomic(puval, uaddr, uval,
+						   vpid | flags)))
+		goto efault;
+
+	return *puval == uval;
+
+efault:
+	*status = -EFAULT;
+	return 1;
+}
+
+/*
+ * Set the FUTEX_WAITERS bit while not holding the HB lock. The given puval
+ * value points to the intial value of the futex lock.
+ *
+ * Return: 0 if successful or < 0 if an error happen.
+ */
+static inline int futex_set_waiters_unlocked(u32 __user *uaddr, u32 *puval)
+{
+	u32 curval, uval = *puval;
+
+	while (!(uval & FUTEX_WAITERS)) {
+		/*
+		 * Set the FUTEX_WAITERS bit.
+		 */
+		if (futex_atomic_cmpxchg_inatomic(&curval,
+			uaddr, uval, uval | FUTEX_WAITERS))
+			return -1;
+		if (curval == uval)
+			break;
+		uval = curval;
+	}
+	*puval = uval | FUTEX_WAITERS;
+	return 0;
+}
+
+/*
+ * Spin on the futex word while the futex owner is active. Otherwise, set
+ * the FUTEX_WAITERS bit and go to sleep. As we take a reference to the futex
+ * owner's task structure, we don't need to use RCU to ensure that the task
+ * structure is valid. The function will directly grab the lock if the
+ * owner is dying or the pid is invalid. That should take care of the problem
+ * of dead lock owners unless the pid wraps around and the preceived owner is
+ * not the real owner.
+ *
+ * Return: 0 if futex acquired, < 0 if an error happens.
+ */
+static int futex_spin_on_owner(u32 __user *uaddr, u32 vpid,
+			       struct futex_state *state)
+{
+	int ret, loop = TO_SPIN_THRESHOLD;
+	u32 uval, curval;
+	u32 opid = 0;				/* Futex owner task ID */
+	struct task_struct *otask = NULL;	/* Futex owner task struct */
+	bool on_owner_list = false;
+
+	WRITE_ONCE(state->owner, current);
+	preempt_disable();
+	for (;; loop--) {
+		if (futex_trylock_to(uaddr, vpid, &uval, &ret, true))
+			break;
+
+		if (((uval & FUTEX_TID_MASK) != opid) ||
+		     (uval & FUTEX_OWNER_DIED)) {
+			/*
+			 * Get the new task structure
+			 */
+			if (otask) {
+				if (on_owner_list) {
+					raw_spin_lock_irq(&otask->pi_lock);
+					list_del_init(&state->list);
+					raw_spin_unlock_irq(&otask->pi_lock);
+					on_owner_list = false;
+				}
+				put_task_struct(otask);
+			}
+
+			if (likely(!(uval & FUTEX_OWNER_DIED))) {
+				opid  = uval & FUTEX_TID_MASK;
+				otask = futex_find_get_task(opid);
+			} else {
+				opid = 0;
+				otask = NULL;
+			}
+		}
+		if (unlikely(!otask || (otask->flags & PF_EXITING) ||
+			    (uval & FUTEX_OWNER_DIED))) {
+			/*
+			 * PID invalid or exiting/dead task, try to grab
+			 * the lock now.
+			 */
+			ret = futex_atomic_cmpxchg_inatomic(&curval,
+					uaddr, uval, vpid);
+			if (unlikely(ret))
+				goto efault;
+			if (curval != uval)
+				continue;	/* Futex value changed */
+			pr_info("futex_spin_on_owner: pid %d grabs futex from pid %d (%s)!\n",
+				vpid, opid, otask ? "dying" : "invalid");
+			break;
+		}
+
+		if (need_resched()) {
+			__set_current_state(TASK_RUNNING);
+			schedule_preempt_disabled();
+			continue;
+		}
+
+		/* Check for signal */
+		if (signal_pending(current)) {
+			ret = -EINTR;
+			break;
+		}
+
+		/*
+		 * Enable lock handoff if the threshold reaches 0.
+		 * We also need to set the FUTEX_WAITERS bit to make sure
+		 * that futex lock holder will initiate the handoff at
+		 * unlock time.
+		 */
+		if ((loop <= 0) && !READ_ONCE(state->handoff_pid)) {
+			WRITE_ONCE(state->handoff_pid, vpid);
+		} else if (otask->on_cpu) {
+			cpu_relax();
+			continue;
+		}
+
+		/*
+		 * If the owner isn't active, we need to go to sleep after
+		 * making sure that the FUTEX_WAITERS bit is set. We also
+		 * need to put the futex state into the futex owner's
+		 * pi_state_list to prevent deadlock when the owner dies.
+		 */
+		if (futex_set_waiters_unlocked(uaddr, &uval) < 0)
+			goto efault;
+
+		if (otask->on_cpu) {
+			cpu_relax();
+			continue;
+		}
+
+		if (!on_owner_list) {
+			raw_spin_lock_irq(&otask->pi_lock);
+			if (unlikely(otask->flags & PF_EXITING)) {
+				/*
+				 * Task is exiting, can directly
+				 * grab the futex instead.
+				 */
+				raw_spin_unlock_irq(&otask->pi_lock);
+				continue;
+			}
+			WARN_ON(!list_empty(&state->list));
+			list_add(&state->list, &otask->pi_state_list);
+			raw_spin_unlock_irq(&otask->pi_lock);
+			on_owner_list = true;
+		}
+
+		/*
+		 * Do a trylock after setting the task state to make
+		 * sure we won't miss a wakeup.
+		 *
+		 * Futex owner		Mutex owner
+		 * -----------		-----------
+		 *  unlock		set state
+		 *  MB			MB
+		 *  read state		trylock
+		 *  wakeup		sleep
+		 */
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (futex_trylock_to(uaddr, vpid, &uval, &ret, true)) {
+			__set_current_state(TASK_RUNNING);
+			break;
+		}
+
+		/*
+		 * Don't sleep if the owner has died or the FUTEX_WAITERS bit
+		 * was cleared. The latter case can happen when unlock and
+		 * lock stealing happen in between setting the FUTEX_WAITERS
+		 * and setting state to TASK_INTERRUPTIBLE.
+		 */
+		if (!(uval & FUTEX_OWNER_DIED) && (uval & FUTEX_WAITERS)) {
+			schedule_preempt_disabled();
+			loop -= TO_SLEEP_DECREMENT;
+		}
+		__set_current_state(TASK_RUNNING);
+	}
+out:
+	preempt_enable();
+	if (on_owner_list && !list_empty(&state->list)) {
+		BUG_ON(!otask);
+		raw_spin_lock_irq(&otask->pi_lock);
+		list_del_init(&state->list);
+		raw_spin_unlock_irq(&otask->pi_lock);
+	}
+	if (otask)
+		put_task_struct(otask);
+	/*
+	 * Cleanup futex state.
+	 */
+	WRITE_ONCE(state->owner, NULL);
+	WRITE_ONCE(state->handoff_pid, 0);
+	return ret;
+
+efault:
+	ret = -EFAULT;
+	goto out;
+}
+
+/*
+ * Userspace tried a 0 -> TID atomic transition of the futex value
+ * and failed. The kernel side here does the whole locking operation.
+ * The kernel mutex will be used for serialization. Once becoming the
+ * sole mutex lock owner, it will spin on the futex owner's task structure
+ * to see if it is running. It will also spin on the futex word so as to grab
+ * the lock as soon as it is free.
+ *
+ * This function is not inlined so that it can show up in stack trace for
+ * analysis purpose.
+ *
+ * Return:
+ *  < 0 - an error happens
+ *  0   - acquires the lock via futex_spin_on_owner()
+ *  1   - steals the lock
+ *  2   - lock handed off from unlocker
+ */
+static noinline int
+futex_lock_to(u32 __user *uaddr, unsigned int flags)
+{
+	struct futex_hash_bucket *hb;
+	union futex_key key = FUTEX_KEY_INIT;
+	struct futex_state *state;
+	u32 uval, vpid = task_pid_vnr(current);
+	int ret;
+
+
+	/*
+	 * Stealing lock and preserve the flag bits.
+	 */
+	if (futex_trylock_to(uaddr, vpid, &uval, &ret, false))
+		/* Lock acquired or an error happens */
+		return (ret < 0) ? ret : TO_LOCK_STOLEN;
+
+	/*
+	 * Detect deadlocks.
+	 */
+	if (unlikely(((uval & FUTEX_TID_MASK) == vpid) ||
+			should_fail_futex(true)))
+		return -EDEADLK;
+
+	if (refill_futex_state_cache())
+		return -ENOMEM;
+
+	ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
+	if (unlikely(ret))
+		goto out;
+
+	hb = hash_futex(&key);
+	spin_lock(&hb->lock);
+
+	/*
+	 * Locate the futex state by looking up the futex state list in the
+	 * hash bucket. If it isn't found, create a new one and put it into
+	 * the list.
+	 */
+	state = lookup_futex_state(hb, &key, false);
+
+	/*
+	 * We don't need to hold the HB lock after looking up the futex state
+	 * as we have incremented the reference count.
+	 */
+	spin_unlock(&hb->lock);
+	BUG_ON(!state);
+
+	/*
+	 * Acquiring the serialization mutex.
+	 */
+	if (state->type != TYPE_TO)
+		ret = -EINVAL;
+	else
+		ret = mutex_lock_interruptible(&state->mutex);
+
+	if (unlikely(ret))
+		/*
+		 * We got a signal or has some other error, need to abort
+		 * the lock operation and return.
+		 */
+		goto out_put_state_key;
+
+	/*
+	 * As the mutex owner, we can now spin on the futex word as well as
+	 * the active-ness of the futex owner.
+	 */
+	ret = futex_spin_on_owner(uaddr, vpid, state);
+
+	mutex_unlock(&state->mutex);
+
+out_put_state_key:
+	if (!put_futex_state_unlocked(state)) {
+		/*
+		 * May need to free the futex state object and so must be
+		 * under HB lock.
+		 */
+		spin_lock(&hb->lock);
+		put_futex_state(state);
+		spin_unlock(&hb->lock);
+	}
+	put_futex_key(&key);
+
+out:
+	return ret;
+}
+
+/*
+ * Userspace attempted a TID -> 0 atomic transition, and failed.
+ * This is the in-kernel slowpath: we look up the futex state (if any),
+ * and wakeup the mutex owner.
+ *
+ * Return: 1 if a wakeup is attempt, 0 if no task to wake,
+ * 	   or < 0 when an error happens.
+ */
+static int futex_unlock_to(u32 __user *uaddr, unsigned int flags)
+{
+	u32 uval, newpid = 0, vpid = task_pid_vnr(current);
+	union futex_key key = FUTEX_KEY_INIT;
+	struct futex_hash_bucket *hb;
+	struct futex_state *state = NULL;
+	struct task_struct *owner = NULL;
+	int ret;
+	WAKE_Q(wake_q);
+
+	if (get_user(uval, uaddr))
+		return -EFAULT;
+
+	if ((uval & FUTEX_TID_MASK) != vpid)
+		return -EPERM;
+
+	if (!(uval & FUTEX_WAITERS))
+		return -EINVAL;
+
+	ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
+	if (ret)
+		return ret;
+
+	hb = hash_futex(&key);
+	spin_lock(&hb->lock);
+
+	/*
+	 * Check the hash bucket only for matching futex state.
+	 */
+	state = lookup_futex_state(hb, &key, true);
+	WARN_ON_ONCE(!state);
+
+	if (state) {
+		if (state->type != TYPE_TO) {
+			ret = -EINVAL;
+			goto out_unlock;
+		}
+
+		newpid = READ_ONCE(state->handoff_pid);
+		if (newpid)
+			WRITE_ONCE(state->handoff_pid, 0);
+
+		owner = READ_ONCE(state->owner);
+		if (owner)
+			wake_q_add(&wake_q, owner);
+	}
+
+	/*
+	 * Unlock the futex or handoff to the next owner.
+	 * The flag bits are not preserved to encourage more lock stealing.
+	 */
+	for (;;) {
+		u32 old = uval;
+
+		if (cmpxchg_futex_value_locked(&uval, uaddr, old, newpid)) {
+			ret = -EFAULT;
+			goto out_unlock;
+		}
+		if (old == uval)
+			break;
+	}
+
+out_unlock:
+	spin_unlock(&hb->lock);
+	put_futex_key(&key);
+	if (owner) {
+		/*
+		 * No error has happened if owner defined.
+		 */
+		wake_up_q(&wake_q);
+		return 1;
+	}
+
+	return ret;
+}
+#endif /* CONFIG_SMP */
+
 long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 		u32 __user *uaddr2, u32 val2, u32 val3)
 {
@@ -3193,6 +3794,10 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 	case FUTEX_TRYLOCK_PI:
 	case FUTEX_WAIT_REQUEUE_PI:
 	case FUTEX_CMP_REQUEUE_PI:
+#ifdef CONFIG_SMP
+	case FUTEX_LOCK_TO:
+	case FUTEX_UNLOCK_TO:
+#endif
 		if (!futex_cmpxchg_enabled)
 			return -ENOSYS;
 	}
@@ -3224,6 +3829,12 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 					     uaddr2);
 	case FUTEX_CMP_REQUEUE_PI:
 		return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1);
+#ifdef CONFIG_SMP
+	case FUTEX_LOCK_TO:
+		return futex_lock_to(uaddr, flags);
+	case FUTEX_UNLOCK_TO:
+		return futex_unlock_to(uaddr, flags);
+#endif
 	}
 	return -ENOSYS;
 }
@@ -3307,6 +3918,7 @@ static int __init futex_init(void)
 	for (i = 0; i < futex_hashsize; i++) {
 		atomic_set(&futex_queues[i].waiters, 0);
 		plist_head_init(&futex_queues[i].chain);
+		INIT_LIST_HEAD(&futex_queues[i].fs_list);
 		spin_lock_init(&futex_queues[i].lock);
 	}
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 4/5] futex: Add timeout support to TO futexes
  2016-09-20 13:42 [RFC PATCH v2 0/5] futex: Introducing throughput-optimized futexes Waiman Long
                   ` (2 preceding siblings ...)
  2016-09-20 13:42 ` [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes Waiman Long
@ 2016-09-20 13:42 ` Waiman Long
  2016-09-20 13:42 ` [RFC PATCH v2 5/5] futex, doc: TO futexes document Waiman Long
  4 siblings, 0 replies; 35+ messages in thread
From: Waiman Long @ 2016-09-20 13:42 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Davidlohr Bueso, Jason Low,
	Scott J Norton, Douglas Hatch, Waiman Long

This patch adds the timeout support in other futex types to TO
futexes. However, the timeout isn't as precise as the other futex
types due to the fact timer expiration can't currently be detected
when the thread is waiting in the serialization mutex.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/futex.c |   40 +++++++++++++++++++++++++++++++---------
 1 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 7daba56..c73be79 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3423,7 +3423,8 @@ static inline int futex_set_waiters_unlocked(u32 __user *uaddr, u32 *puval)
  * Return: 0 if futex acquired, < 0 if an error happens.
  */
 static int futex_spin_on_owner(u32 __user *uaddr, u32 vpid,
-			       struct futex_state *state)
+			       struct futex_state *state,
+			       struct hrtimer_sleeper *timeout)
 {
 	int ret, loop = TO_SPIN_THRESHOLD;
 	u32 uval, curval;
@@ -3483,8 +3484,11 @@ static int futex_spin_on_owner(u32 __user *uaddr, u32 vpid,
 			continue;
 		}
 
-		/* Check for signal */
-		if (signal_pending(current)) {
+		/* Check for timeout & signal */
+		if (timeout && !timeout->task) {
+			ret = -ETIMEDOUT;
+			break;
+		} else if (signal_pending(current)) {
 			ret = -EINTR;
 			break;
 		}
@@ -3594,6 +3598,12 @@ efault:
  * This function is not inlined so that it can show up in stack trace for
  * analysis purpose.
  *
+ * The timeout functionality isn't as precise as other other futex types
+ * as timer expiration will not be detected while waiting in the
+ * serialization mutex. One possible solution is to modify the expiration
+ * function to send out a signal as well so as to break the thread out of
+ * the mutex code if we really want more precise timeout.
+ *
  * Return:
  *  < 0 - an error happens
  *  0   - acquires the lock via futex_spin_on_owner()
@@ -3601,8 +3611,9 @@ efault:
  *  2   - lock handed off from unlocker
  */
 static noinline int
-futex_lock_to(u32 __user *uaddr, unsigned int flags)
+futex_lock_to(u32 __user *uaddr, unsigned int flags, ktime_t *time)
 {
+	struct hrtimer_sleeper timeout, *to = NULL;
 	struct futex_hash_bucket *hb;
 	union futex_key key = FUTEX_KEY_INIT;
 	struct futex_state *state;
@@ -3627,6 +3638,8 @@ futex_lock_to(u32 __user *uaddr, unsigned int flags)
 	if (refill_futex_state_cache())
 		return -ENOMEM;
 
+	futex_set_timer(time, &to, &timeout, flags, current->timer_slack_ns);
+
 	ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
 	if (unlikely(ret))
 		goto out;
@@ -3651,10 +3664,15 @@ futex_lock_to(u32 __user *uaddr, unsigned int flags)
 	/*
 	 * Acquiring the serialization mutex.
 	 */
-	if (state->type != TYPE_TO)
+	if (state->type != TYPE_TO) {
 		ret = -EINVAL;
-	else
+	} else {
+		if (to)
+			hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
 		ret = mutex_lock_interruptible(&state->mutex);
+		if (ret && to && !timeout.task)
+			ret = -ETIMEDOUT;
+	}
 
 	if (unlikely(ret))
 		/*
@@ -3667,7 +3685,7 @@ futex_lock_to(u32 __user *uaddr, unsigned int flags)
 	 * As the mutex owner, we can now spin on the futex word as well as
 	 * the active-ness of the futex owner.
 	 */
-	ret = futex_spin_on_owner(uaddr, vpid, state);
+	ret = futex_spin_on_owner(uaddr, vpid, state, to ? &timeout : NULL);
 
 	mutex_unlock(&state->mutex);
 
@@ -3684,6 +3702,10 @@ out_put_state_key:
 	put_futex_key(&key);
 
 out:
+	if (to) {
+		hrtimer_cancel(&to->timer);
+		destroy_hrtimer_on_stack(&to->timer);
+	}
 	return ret;
 }
 
@@ -3831,7 +3853,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 		return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1);
 #ifdef CONFIG_SMP
 	case FUTEX_LOCK_TO:
-		return futex_lock_to(uaddr, flags);
+		return futex_lock_to(uaddr, flags, timeout);
 	case FUTEX_UNLOCK_TO:
 		return futex_unlock_to(uaddr, flags);
 #endif
@@ -3850,7 +3872,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
 	int cmd = op & FUTEX_CMD_MASK;
 
 	if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI ||
-		      cmd == FUTEX_WAIT_BITSET ||
+		      cmd == FUTEX_WAIT_BITSET || cmd == FUTEX_LOCK_TO ||
 		      cmd == FUTEX_WAIT_REQUEUE_PI)) {
 		if (unlikely(should_fail_futex(!(op & FUTEX_PRIVATE_FLAG))))
 			return -EFAULT;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 5/5] futex, doc: TO futexes document
  2016-09-20 13:42 [RFC PATCH v2 0/5] futex: Introducing throughput-optimized futexes Waiman Long
                   ` (3 preceding siblings ...)
  2016-09-20 13:42 ` [RFC PATCH v2 4/5] futex: Add timeout support to TO futexes Waiman Long
@ 2016-09-20 13:42 ` Waiman Long
  4 siblings, 0 replies; 35+ messages in thread
From: Waiman Long @ 2016-09-20 13:42 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Davidlohr Bueso, Jason Low,
	Scott J Norton, Douglas Hatch, Waiman Long

This patch adds a new document file on how to use the TO futexes.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 Documentation/00-INDEX     |    2 +
 Documentation/to-futex.txt |  140 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 142 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/to-futex.txt

diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX
index cb9a6c6..a39f020 100644
--- a/Documentation/00-INDEX
+++ b/Documentation/00-INDEX
@@ -439,6 +439,8 @@ this_cpu_ops.txt
 	- List rationale behind and the way to use this_cpu operations.
 thermal/
 	- directory with information on managing thermal issues (CPU/temp)
+to-futex.txt
+	- Documentation on lightweight throughput-optimized futexes.
 trace/
 	- directory with info on tracing technologies within linux
 unaligned-memory-access.txt
diff --git a/Documentation/to-futex.txt b/Documentation/to-futex.txt
new file mode 100644
index 0000000..bcd29d1
--- /dev/null
+++ b/Documentation/to-futex.txt
@@ -0,0 +1,140 @@
+Started by: Waiman Long <waiman.long@hpe.com>
+
+Throughput-Optimized Futexes
+----------------------------
+
+There are two main problems for a wait-wake futex (FUTEX_WAIT and
+FUTEX_WAKE) when used for creating user-space locking primitives:
+
+ 1) With a wait-wake futex, tasks waiting for a lock are put to sleep
+    in the futex queue to be woken up by the lock owner when it is done
+    with the lock. Waking up a sleeping task, however, introduces some
+    additional latency which can be large especially if the critical
+    section protected by the lock is relatively short. This may cause
+    a performance bottleneck on large systems with many CPUs running
+    applications that need a lot of inter-thread synchronization.
+
+ 2) The performance of the wait-wake futex is currently
+    spinlock-constrained.  When many threads are contending for a
+    futex in a large system with many CPUs, it is not unusual to have
+    spinlock contention accounting for more than 90% of the total
+    CPU cycles consumed at various points in time.
+
+This two problems can create performance bottlenecks with a
+futex-constrained workload especially on systems with large number
+of CPUs.
+
+The goal of the throughput-optimized (TO) futexes is maximize the
+locking throughput at the expense of fairness and deterministic
+latency. This is done by encouraging lock stealing and optimistic
+spinning on a locked futex when the lock owner is running within the
+kernel until the lock is free. This is the same optimistic spinning
+mechanism used by the kernel mutex and rw semaphore implementations
+to improve performance. Optimistic spinning was done without taking
+any lock.
+
+The downside of this improved throughput is the increased variance
+of the actual response times of the locking operations. Some locking
+operations will be very fast, while others may be considerably slower.
+The average response time should be better than the wait-wake futexes.
+
+The TO futexes has a built-in lock hand-off mechanism to prevent lock
+starvation from happening. When the top lock waiter has too many failed
+attempts to acquire the lock, it will initiate the hand-off mechanism
+by forcing the unlocker to transfer the lock to itself instead of
+freeing it. This limit the maximum latency a waiter has to wait.
+
+Performance-wise, TO futexes should be faster than wait-wake futexes
+especially if the futex locker holders do not sleep. For workload
+that does a lot of sleeping within the critical sections, the TO
+futexes may not be faster than the wait-wake futexes.
+
+Implementation
+--------------
+
+Like the PI and robust futexes, a lock acquirer has to atomically
+put its thread ID (TID) into the lower 30 bits of the 32-bit futex
+which should has an original value of 0. If it succeeds, it will be
+the owner of the futex. Otherwise, it has to call into the kernel
+using the new FUTEX_LOCK_TO futex(2) syscall.
+
+  futex(uaddr, FUTEX_LOCK_TO, 0, timeout, NULL, 0);
+
+Only the optional timeout parameter is being used by the new futex
+call.
+
+A kernel mutex is used for serialization. The top lock waiter that is
+the owner of the serialization mutex will try to acquire the lock when
+it is available.
+
+When the futex lock owner is no longer running, the top waiter will
+set the FUTEX_WAITERS bit before going to sleep. This is to make sure
+the futex owner will go into the kernel at unlock time to wake the
+waiter up.
+
+The expected return values of the above futex call are:
+ a) 0   - lock acquired as the top waiter
+ b) 1   - lock stolen as non-top waiter
+ c) 2   - lock explicitly handed off by the unlocker
+ d) < 0 - an error happens
+
+When it is time to unlock, the lock owner has to atomically change
+the futex value from its TID to 0. If that fails, it has to issue a
+FUTEX_UNLOCK_TO futex system call to wake up the top waiter.
+
+  futex(uaddr, FUTEX_UNLOCK_TO, 0, NULL, NULL, 0);
+
+A return value of 1 from the FUTEX_UNLOCK_TO futex(2) syscall
+indicates a task has been woken up. The syscall returns 0 if no
+sleeping task is woken. A negative value will be returned if an
+error happens.
+
+The error number returned by a FUTEX_UNLOCK_TO call on an empty futex
+can be used to decide if the TO futex functionality is implemented in
+the kernel. If it is present, no error will be returned.  Otherwise it
+will be ENOSYS.
+
+TO futexes require the kernel to have SMP support as well as support
+for the cmpxchg functionality. For architectures that don't support
+cmpxchg, TO futexes will not be supported as well.
+
+The TO futexes are orthogonal to the robust futexes and can be combined
+without problem.
+
+Usage Scenario
+--------------
+
+A TO futex can be used to implement a user-space exclusive lock
+or mutex to guard a critical section which are unlikely to go to
+sleep. The waiters in a TO futex, however, will fall back to sleep in
+a wait queue if the lock owner isn't running. Therefore, it can also be
+used when the critical section is long and prone to sleeping. However,
+it may not have the performance benefit when compared with a wait-wake
+futex in this case.
+
+Sample Code
+-----------
+
+The following are sample code to implement a simple lock and unlock
+function.
+
+__thread int tid;	/* Thread ID */
+
+void mutex_lock(int *faddr)
+{
+	if (cmpxchg(faddr, 0, tid) == 0)
+		return;
+	for (;;)
+		if (futex(faddr, FUTEX_LOCK_TO, ...) == 0)
+			break;
+}
+
+void mutex_unlock(int *faddr)
+{
+	int old, fval;
+
+	if ((fval = cmpxchg(faddr, tid, 0)) == tid)
+		return;
+	if (fval & FUTEX_WAITERS)
+		futex(faddr, FUTEX_UNLOCK_TO, ...);
+}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
  2016-09-20 13:42 ` [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes Waiman Long
@ 2016-09-21  6:59   ` Mike Galbraith
  2016-09-21 23:37     ` Waiman Long
  2016-09-22 13:23   ` Peter Zijlstra
  1 sibling, 1 reply; 35+ messages in thread
From: Mike Galbraith @ 2016-09-21  6:59 UTC (permalink / raw)
  To: Waiman Long, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Jonathan Corbet
  Cc: linux-kernel, linux-doc, Davidlohr Bueso, Jason Low,
	Scott J Norton, Douglas Hatch

On Tue, 2016-09-20 at 09:42 -0400, Waiman Long wrote:
> This patch introduces a new futex implementation called
> throughput-optimized (TO) futexes.

nit: 'TO' sounds way too much like timeout... TP?  You even use 'to' as
shorthand for timeout in the next patch.

>  		/*
> -> 	> 	>  * Wake robust non-PI futexes here. The wakeup of
> -> 	> 	>  * PI futexes happens in exit_pi_state():
> +> 	> 	>  * Wake robust non-PI/TO futexes here. The wakeup of
> +> 	> 	>  * PI/TO futexes happens in exit_pi_state():
>  > 	> 	>  */

nit: comment attracted my eye during high speed scroll-by.

	-Mike

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
  2016-09-21  6:59   ` Mike Galbraith
@ 2016-09-21 23:37     ` Waiman Long
  2016-09-22  7:49       ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Waiman Long @ 2016-09-21 23:37 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet,
	linux-kernel, linux-doc, Davidlohr Bueso, Jason Low,
	Scott J Norton, Douglas Hatch

On 09/21/2016 02:59 AM, Mike Galbraith wrote:
> On Tue, 2016-09-20 at 09:42 -0400, Waiman Long wrote:
>> This patch introduces a new futex implementation called
>> throughput-optimized (TO) futexes.
> nit: 'TO' sounds way too much like timeout... TP?  You even use 'to' as
> shorthand for timeout in the next patch.

I agree. I am not that satisfied with the TO name. So I will change it 
to TP in my next revision of the patch. Thanks for the suggestion.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
  2016-09-21 23:37     ` Waiman Long
@ 2016-09-22  7:49       ` Peter Zijlstra
  2016-09-22 13:04         ` Waiman Long
  2016-09-22 13:34         ` Thomas Gleixner
  0 siblings, 2 replies; 35+ messages in thread
From: Peter Zijlstra @ 2016-09-22  7:49 UTC (permalink / raw)
  To: Waiman Long
  Cc: Mike Galbraith, Thomas Gleixner, Ingo Molnar, Jonathan Corbet,
	linux-kernel, linux-doc, Davidlohr Bueso, Jason Low,
	Scott J Norton, Douglas Hatch

On Wed, Sep 21, 2016 at 07:37:34PM -0400, Waiman Long wrote:
> On 09/21/2016 02:59 AM, Mike Galbraith wrote:
> >On Tue, 2016-09-20 at 09:42 -0400, Waiman Long wrote:
> >>This patch introduces a new futex implementation called
> >>throughput-optimized (TO) futexes.
> >nit: 'TO' sounds way too much like timeout... TP?  You even use 'to' as
> >shorthand for timeout in the next patch.
> 
> I agree. I am not that satisfied with the TO name. So I will change it to TP
> in my next revision of the patch. Thanks for the suggestion.

I'd leave out the TO part entirely (or only mention it in changelogs).

That is, I'd call the futex ops: FUTEX_LOCK and FUTEX_UNLOCK.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
  2016-09-22  7:49       ` Peter Zijlstra
@ 2016-09-22 13:04         ` Waiman Long
  2016-09-22 13:34         ` Thomas Gleixner
  1 sibling, 0 replies; 35+ messages in thread
From: Waiman Long @ 2016-09-22 13:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Thomas Gleixner, Ingo Molnar, Jonathan Corbet,
	linux-kernel, linux-doc, Davidlohr Bueso, Jason Low,
	Scott J Norton, Douglas Hatch

On 09/22/2016 03:49 AM, Peter Zijlstra wrote:
> On Wed, Sep 21, 2016 at 07:37:34PM -0400, Waiman Long wrote:
>> On 09/21/2016 02:59 AM, Mike Galbraith wrote:
>>> On Tue, 2016-09-20 at 09:42 -0400, Waiman Long wrote:
>>>> This patch introduces a new futex implementation called
>>>> throughput-optimized (TO) futexes.
>>> nit: 'TO' sounds way too much like timeout... TP?  You even use 'to' as
>>> shorthand for timeout in the next patch.
>> I agree. I am not that satisfied with the TO name. So I will change it to TP
>> in my next revision of the patch. Thanks for the suggestion.
> I'd leave out the TO part entirely (or only mention it in changelogs).
>
> That is, I'd call the futex ops: FUTEX_LOCK and FUTEX_UNLOCK.

I was following the convention for the PI futexes where we have 
FUTEX_LOCK_PI and FUTEX_UNLOCK_PI. If you think it is OK to promote this 
new futex as default for mutex type locks and use the PI futexes only 
when needed, I am certainly OK to drop the suffix.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
  2016-09-20 13:42 ` [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes Waiman Long
  2016-09-21  6:59   ` Mike Galbraith
@ 2016-09-22 13:23   ` Peter Zijlstra
  2016-09-22 17:21     ` Waiman Long
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2016-09-22 13:23 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, Jonathan Corbet, linux-kernel,
	linux-doc, Davidlohr Bueso, Jason Low, Scott J Norton,
	Douglas Hatch

On Tue, Sep 20, 2016 at 09:42:41AM -0400, Waiman Long wrote:
> +/*
> + * Spinning threshold before enabling lock handoff.
> + * Each sleep will decrement the threshold by 1/32 of the start value.
> + */
> +#define TO_SPIN_THRESHOLD	(1 << 13)
> +#define TO_SLEEP_DECREMENT	(TO_SPIN_THRESHOLD/32)

Argh, we should really get rid of those stupid numbers. Wasn't there a
patch set working on implementing paravirt functions that would make all
this fixable in a sane way?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
  2016-09-22  7:49       ` Peter Zijlstra
  2016-09-22 13:04         ` Waiman Long
@ 2016-09-22 13:34         ` Thomas Gleixner
  2016-09-22 14:41           ` Davidlohr Bueso
  2016-09-22 19:56           ` [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes Waiman Long
  1 sibling, 2 replies; 35+ messages in thread
From: Thomas Gleixner @ 2016-09-22 13:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Mike Galbraith, Ingo Molnar, Jonathan Corbet,
	linux-kernel, linux-doc, Davidlohr Bueso, Jason Low,
	Scott J Norton, Douglas Hatch

On Thu, 22 Sep 2016, Peter Zijlstra wrote:
> On Wed, Sep 21, 2016 at 07:37:34PM -0400, Waiman Long wrote:
> > On 09/21/2016 02:59 AM, Mike Galbraith wrote:
> > >On Tue, 2016-09-20 at 09:42 -0400, Waiman Long wrote:
> > >>This patch introduces a new futex implementation called
> > >>throughput-optimized (TO) futexes.
> > >nit: 'TO' sounds way too much like timeout... TP?  You even use 'to' as
> > >shorthand for timeout in the next patch.
> > 
> > I agree. I am not that satisfied with the TO name. So I will change it to TP
> > in my next revision of the patch. Thanks for the suggestion.
> 
> I'd leave out the TO part entirely (or only mention it in changelogs).
> 
> That is, I'd call the futex ops: FUTEX_LOCK and FUTEX_UNLOCK.

That brings me to a different question:

How is user space going to support this, i.e. is this some extra magic for
code which implements its own locking primitives or is there going to be a
wide use via e.g. glibc.

Also what's the reason that we can't do probabilistic spinning for
FUTEX_WAIT and have to add yet another specialized variant of futexes?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
  2016-09-22 13:34         ` Thomas Gleixner
@ 2016-09-22 14:41           ` Davidlohr Bueso
  2016-09-22 14:46             ` Thomas Gleixner
  2016-09-22 19:56           ` [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes Waiman Long
  1 sibling, 1 reply; 35+ messages in thread
From: Davidlohr Bueso @ 2016-09-22 14:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Waiman Long, Mike Galbraith, Ingo Molnar,
	Jonathan Corbet, linux-kernel, linux-doc, Jason Low,
	Scott J Norton, Douglas Hatch

On Thu, 22 Sep 2016, Thomas Gleixner wrote:

>On Thu, 22 Sep 2016, Peter Zijlstra wrote:
>> On Wed, Sep 21, 2016 at 07:37:34PM -0400, Waiman Long wrote:
>> > On 09/21/2016 02:59 AM, Mike Galbraith wrote:
>> > >On Tue, 2016-09-20 at 09:42 -0400, Waiman Long wrote:
>> > >>This patch introduces a new futex implementation called
>> > >>throughput-optimized (TO) futexes.
>> > >nit: 'TO' sounds way too much like timeout... TP?  You even use 'to' as
>> > >shorthand for timeout in the next patch.
>> >
>> > I agree. I am not that satisfied with the TO name. So I will change it to TP
>> > in my next revision of the patch. Thanks for the suggestion.
>>
>> I'd leave out the TO part entirely (or only mention it in changelogs).
>>
>> That is, I'd call the futex ops: FUTEX_LOCK and FUTEX_UNLOCK.
>
>That brings me to a different question:
>
>How is user space going to support this, i.e. is this some extra magic for
>code which implements its own locking primitives or is there going to be a
>wide use via e.g. glibc.
>
>Also what's the reason that we can't do probabilistic spinning for
>FUTEX_WAIT and have to add yet another specialized variant of futexes?

Where would this leave the respective FUTEX_WAKE? A nop? Probably have to
differentiate the fact that the queue was empty, but there was a spinning,
instead of straightforward returning 0.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
  2016-09-22 14:41           ` Davidlohr Bueso
@ 2016-09-22 14:46             ` Thomas Gleixner
  2016-09-22 15:11               ` Davidlohr Bueso
  0 siblings, 1 reply; 35+ messages in thread
From: Thomas Gleixner @ 2016-09-22 14:46 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, Waiman Long, Mike Galbraith, Ingo Molnar,
	Jonathan Corbet, linux-kernel, linux-doc, Jason Low,
	Scott J Norton, Douglas Hatch

On Thu, 22 Sep 2016, Davidlohr Bueso wrote:
> On Thu, 22 Sep 2016, Thomas Gleixner wrote:
> > Also what's the reason that we can't do probabilistic spinning for
> > FUTEX_WAIT and have to add yet another specialized variant of futexes?
> 
> Where would this leave the respective FUTEX_WAKE? A nop? Probably have to
> differentiate the fact that the queue was empty, but there was a spinning,
> instead of straightforward returning 0.

Sorry, but I really can't parse this answer.

Can you folks please communicate with proper and coherent explanations
instead of throwing a few gnawed off bones in my direction?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
  2016-09-22 14:46             ` Thomas Gleixner
@ 2016-09-22 15:11               ` Davidlohr Bueso
  2016-09-22 20:08                 ` Waiman Long
  0 siblings, 1 reply; 35+ messages in thread
From: Davidlohr Bueso @ 2016-09-22 15:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Waiman Long, Mike Galbraith, Ingo Molnar,
	Jonathan Corbet, linux-kernel, linux-doc, Jason Low,
	Scott J Norton, Douglas Hatch

On Thu, 22 Sep 2016, Thomas Gleixner wrote:

>On Thu, 22 Sep 2016, Davidlohr Bueso wrote:
>> On Thu, 22 Sep 2016, Thomas Gleixner wrote:
>> > Also what's the reason that we can't do probabilistic spinning for
>> > FUTEX_WAIT and have to add yet another specialized variant of futexes?
>>
>> Where would this leave the respective FUTEX_WAKE? A nop? Probably have to
>> differentiate the fact that the queue was empty, but there was a spinning,
>> instead of straightforward returning 0.
>
>Sorry, but I really can't parse this answer.
>
>Can you folks please communicate with proper and coherent explanations
>instead of throwing a few gnawed off bones in my direction?

I actually think that FUTEX_WAIT is the better/nicer approach. But my immediate
question above was how to handle the FUTEX_WAKE counter-part. If we want to
maintain current FIFO ordering for wakeups, now with WAIT spinners this will
create lock stealing scenarios (including if we even guard against starvation).
Or we could reduce the scope of spinners, due to the restrictions, similar to
the top-waiter only being able to spin for rtmutexes. This of course will hurt
the effectiveness of spinning in FUTEX_WAIT in the first place.

Another immediate thought was situations where we spinner(s) and the wait queue is
empty, the WAKE should also have to acknowledge that situation, as just returning 0
would indicate that there are actually no waiters on the futex.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
  2016-09-22 13:23   ` Peter Zijlstra
@ 2016-09-22 17:21     ` Waiman Long
  0 siblings, 0 replies; 35+ messages in thread
From: Waiman Long @ 2016-09-22 17:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Jonathan Corbet, linux-kernel,
	linux-doc, Davidlohr Bueso, Jason Low, Scott J Norton,
	Douglas Hatch

On 09/22/2016 09:23 AM, Peter Zijlstra wrote:
> On Tue, Sep 20, 2016 at 09:42:41AM -0400, Waiman Long wrote:
>> +/*
>> + * Spinning threshold before enabling lock handoff.
>> + * Each sleep will decrement the threshold by 1/32 of the start value.
>> + */
>> +#define TO_SPIN_THRESHOLD	(1<<  13)
>> +#define TO_SLEEP_DECREMENT	(TO_SPIN_THRESHOLD/32)
> Argh, we should really get rid of those stupid numbers. Wasn't there a
> patch set working on implementing paravirt functions that would make all
> this fixable in a sane way?

These thresholds are not paravirt related. They are there to determine 
when we should activate explicit lock handoff when the waiter is waiting 
for too long. Yes, it is arbitrary. The numbers are on the high side as 
setting them too low will limit lock stealing and hence overall 
performance. The FUTEX_WAITERS bit is only turned on when either the top 
waiter is sleeping or a lock handoff is needed. Otherwise, it is off to 
encourage lock stealing in the userspace as well as in the kernel.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
  2016-09-22 13:34         ` Thomas Gleixner
  2016-09-22 14:41           ` Davidlohr Bueso
@ 2016-09-22 19:56           ` Waiman Long
  2016-09-22 20:26             ` Thomas Gleixner
  1 sibling, 1 reply; 35+ messages in thread
From: Waiman Long @ 2016-09-22 19:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Mike Galbraith, Ingo Molnar, Jonathan Corbet,
	linux-kernel, linux-doc, Davidlohr Bueso, Jason Low,
	Scott J Norton, Douglas Hatch

On 09/22/2016 09:34 AM, Thomas Gleixner wrote:
> On Thu, 22 Sep 2016, Peter Zijlstra wrote:
>>
>> I'd leave out the TO part entirely (or only mention it in changelogs).
>>
>> That is, I'd call the futex ops: FUTEX_LOCK and FUTEX_UNLOCK.
> That brings me to a different question:
>
> How is user space going to support this, i.e. is this some extra magic for
> code which implements its own locking primitives or is there going to be a
> wide use via e.g. glibc.

There are some applications that use futex(2) directly to implement 
their synchronization primitives. For those applications, they will need 
to modify their code to detect the presence of the new futexes. They can 
then use the new futexes if available and use wait-wake futexes if not.

I am also planning to take a look at the pthread_mutex* APIs to see if 
they can be modified to use the new futexes later on when the patch 
becomes more mature.

> Also what's the reason that we can't do probabilistic spinning for
> FUTEX_WAIT and have to add yet another specialized variant of futexes?
>

The main reason is that a FUTEX_WAIT waiter has no idea who the owner of 
the futex is. We usually do spinning when the lock owner is running and 
abort when it goes to sleep. We can't do that for FUTEX_WAIT.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
  2016-09-22 15:11               ` Davidlohr Bueso
@ 2016-09-22 20:08                 ` Waiman Long
  2016-09-22 20:28                   ` Waiman Long
  0 siblings, 1 reply; 35+ messages in thread
From: Waiman Long @ 2016-09-22 20:08 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Thomas Gleixner, Peter Zijlstra, Mike Galbraith, Ingo Molnar,
	Jonathan Corbet, linux-kernel, linux-doc, Jason Low,
	Scott J Norton, Douglas Hatch

On 09/22/2016 11:11 AM, Davidlohr Bueso wrote:
> On Thu, 22 Sep 2016, Thomas Gleixner wrote:
>
>> On Thu, 22 Sep 2016, Davidlohr Bueso wrote:
>>> On Thu, 22 Sep 2016, Thomas Gleixner wrote:
>>> > Also what's the reason that we can't do probabilistic spinning for
>>> > FUTEX_WAIT and have to add yet another specialized variant of 
>>> futexes?
>>>
>>> Where would this leave the respective FUTEX_WAKE? A nop? Probably 
>>> have to
>>> differentiate the fact that the queue was empty, but there was a 
>>> spinning,
>>> instead of straightforward returning 0.
>>
>> Sorry, but I really can't parse this answer.
>>
>> Can you folks please communicate with proper and coherent explanations
>> instead of throwing a few gnawed off bones in my direction?
>
> I actually think that FUTEX_WAIT is the better/nicer approach. But my 
> immediate
> question above was how to handle the FUTEX_WAKE counter-part. If we 
> want to
> maintain current FIFO ordering for wakeups, now with WAIT spinners 
> this will
> create lock stealing scenarios (including if we even guard against 
> starvation).
> Or we could reduce the scope of spinners, due to the restrictions, 
> similar to
> the top-waiter only being able to spin for rtmutexes. This of course 
> will hurt
> the effectiveness of spinning in FUTEX_WAIT in the first place.

Actually, there can be a lot of lock stealing going on with the 
wait-wake futexes. If the critical section is short enough, many of the 
lock waiters can be waiting in the hash bucket spinlock queue and not 
sleeping yet while the futex value changes. As a result, they will exit 
the futex syscall and back to user space with EAGAIN where one of them 
may get the lock. So we can't assume that they will get the lock in the 
FIFO order anyway.

> Another immediate thought was situations where we spinner(s) and the 
> wait queue is
> empty, the WAKE should also have to acknowledge that situation, as 
> just returning 0
> would indicate that there are actually no waiters on the futex.

I would say that adding optimistic spinning to FUTEX_WAIT will be a 
major change and I don't think it will be less complex than adding a new 
futex type like the TO futexes while introducing a fair amount of risk 
of breaking existing applications.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
  2016-09-22 19:56           ` [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes Waiman Long
@ 2016-09-22 20:26             ` Thomas Gleixner
  2016-09-22 21:13               ` Waiman Long
  0 siblings, 1 reply; 35+ messages in thread
From: Thomas Gleixner @ 2016-09-22 20:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Mike Galbraith, Ingo Molnar, Jonathan Corbet,
	linux-kernel, linux-doc, Davidlohr Bueso, Jason Low,
	Scott J Norton, Douglas Hatch

On Thu, 22 Sep 2016, Waiman Long wrote:
> On 09/22/2016 09:34 AM, Thomas Gleixner wrote:
> > On Thu, 22 Sep 2016, Peter Zijlstra wrote:
> > > 
> > > I'd leave out the TO part entirely (or only mention it in changelogs).
> > > 
> > > That is, I'd call the futex ops: FUTEX_LOCK and FUTEX_UNLOCK.
> > That brings me to a different question:
> > 
> > How is user space going to support this, i.e. is this some extra magic for
> > code which implements its own locking primitives or is there going to be a
> > wide use via e.g. glibc.
> 
> There are some applications that use futex(2) directly to implement their
> synchronization primitives. For those applications, they will need to modify
> their code to detect the presence of the new futexes. They can then use the
> new futexes if available and use wait-wake futexes if not.

That's what I suspected. Did you talk to the other folks who complain about
futex performance (database, JVM, etc.) and play their own games with user
space spinlocks and whatever?
 
> I am also planning to take a look at the pthread_mutex* APIs to see if they
> can be modified to use the new futexes later on when the patch becomes more
> mature.

Please involve glibc people who are interested in the futex stuff early and
discuss the concept before it's set in stone for your particular usecase.
 
> > Also what's the reason that we can't do probabilistic spinning for
> > FUTEX_WAIT and have to add yet another specialized variant of futexes?
> > 
> 
> The main reason is that a FUTEX_WAIT waiter has no idea who the owner of the
> futex is. We usually do spinning when the lock owner is running and abort when
> it goes to sleep. We can't do that for FUTEX_WAIT.

Fair enough. This wants to be spelled out in the changelog and explained a
bit more detailed.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
  2016-09-22 20:08                 ` Waiman Long
@ 2016-09-22 20:28                   ` Waiman Long
  2016-09-22 20:38                     ` Thomas Gleixner
  2016-09-22 21:39                     ` Davidlohr Bueso
  0 siblings, 2 replies; 35+ messages in thread
From: Waiman Long @ 2016-09-22 20:28 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Thomas Gleixner, Peter Zijlstra, Mike Galbraith, Ingo Molnar,
	Jonathan Corbet, linux-kernel, linux-doc, Jason Low,
	Scott J Norton, Douglas Hatch

On 09/22/2016 04:08 PM, Waiman Long wrote:
> On 09/22/2016 11:11 AM, Davidlohr Bueso wrote:
>> On Thu, 22 Sep 2016, Thomas Gleixner wrote:
>>
>>> On Thu, 22 Sep 2016, Davidlohr Bueso wrote:
>>>> On Thu, 22 Sep 2016, Thomas Gleixner wrote:
>>>> > Also what's the reason that we can't do probabilistic spinning for
>>>> > FUTEX_WAIT and have to add yet another specialized variant of 
>>>> futexes?
>>>>
>>>> Where would this leave the respective FUTEX_WAKE? A nop? Probably 
>>>> have to
>>>> differentiate the fact that the queue was empty, but there was a 
>>>> spinning,
>>>> instead of straightforward returning 0.
>>>
>>> Sorry, but I really can't parse this answer.
>>>
>>> Can you folks please communicate with proper and coherent explanations
>>> instead of throwing a few gnawed off bones in my direction?
>>
>> I actually think that FUTEX_WAIT is the better/nicer approach. But my 
>> immediate
>> question above was how to handle the FUTEX_WAKE counter-part. If we 
>> want to
>> maintain current FIFO ordering for wakeups, now with WAIT spinners 
>> this will
>> create lock stealing scenarios (including if we even guard against 
>> starvation).
>> Or we could reduce the scope of spinners, due to the restrictions, 
>> similar to
>> the top-waiter only being able to spin for rtmutexes. This of course 
>> will hurt
>> the effectiveness of spinning in FUTEX_WAIT in the first place.
>
> Actually, there can be a lot of lock stealing going on with the 
> wait-wake futexes. If the critical section is short enough, many of 
> the lock waiters can be waiting in the hash bucket spinlock queue and 
> not sleeping yet while the futex value changes. As a result, they will 
> exit the futex syscall and back to user space with EAGAIN where one of 
> them may get the lock. So we can't assume that they will get the lock 
> in the FIFO order anyway.

BTW, my initial attempt for the new futex was to use the same workflow 
as the PI futexes, but use mutex which has optimistic spinning instead 
of rt_mutex. That version can double the throughput compared with PI 
futexes but still far short of what can be achieved with wait-wake 
futex. Looking at the performance figures from the patch:

                 wait-wake futex     PI futex        TO futex
                 ---------------     --------        --------
max time            3.49s            50.91s          2.65s
min time            3.24s            50.84s          0.07s
average time        3.41s            50.90s          1.84s
sys time          7m22.4s            55.73s        2m32.9s
lock count       3,090,294          9,999,813       698,318
unlock count     3,268,896          9,999,814           134

The problem with a PI futexes like version is that almost all the 
lock/unlock operations were done in the kernel which added overhead and 
latency. Now looking at the numbers for the TO futexes, less than 1/10 
of the lock operations were done in the kernel, the number of unlock was 
insignificant. Locking was done mostly by lock stealing. This is where 
most of the performance benefit comes from, not optimistic spinning.

This is also the reason that a lock handoff mechanism is implemented to 
prevent lock starvation which is likely to happen without one.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
  2016-09-22 20:28                   ` Waiman Long
@ 2016-09-22 20:38                     ` Thomas Gleixner
  2016-09-22 21:48                       ` Waiman Long
  2016-09-22 21:39                     ` Davidlohr Bueso
  1 sibling, 1 reply; 35+ messages in thread
From: Thomas Gleixner @ 2016-09-22 20:38 UTC (permalink / raw)
  To: Waiman Long
  Cc: Davidlohr Bueso, Peter Zijlstra, Mike Galbraith, Ingo Molnar,
	Jonathan Corbet, linux-kernel, linux-doc, Jason Low,
	Scott J Norton, Douglas Hatch

On Thu, 22 Sep 2016, Waiman Long wrote:
> BTW, my initial attempt for the new futex was to use the same workflow as the
> PI futexes, but use mutex which has optimistic spinning instead of rt_mutex.
> That version can double the throughput compared with PI futexes but still far
> short of what can be achieved with wait-wake futex. Looking at the performance
> figures from the patch:
> 
>                 wait-wake futex     PI futex        TO futex
>                 ---------------     --------        --------
> max time            3.49s            50.91s          2.65s
> min time            3.24s            50.84s          0.07s
> average time        3.41s            50.90s          1.84s
> sys time          7m22.4s            55.73s        2m32.9s

That's really interesting. Do you have any explanation for this massive
system time differences?

> lock count       3,090,294          9,999,813       698,318
> unlock count     3,268,896          9,999,814           134
> 
> The problem with a PI futexes like version is that almost all the lock/unlock
> operations were done in the kernel which added overhead and latency. Now
> looking at the numbers for the TO futexes, less than 1/10 of the lock
> operations were done in the kernel, the number of unlock was insignificant.
> Locking was done mostly by lock stealing. This is where most of the
> performance benefit comes from, not optimistic spinning.

How does the lock latency distribution of all this look like and how fair
is the whole thing?

> This is also the reason that a lock handoff mechanism is implemented to
> prevent lock starvation which is likely to happen without one.

Where is that lock handoff mechanism?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
  2016-09-22 20:26             ` Thomas Gleixner
@ 2016-09-22 21:13               ` Waiman Long
  0 siblings, 0 replies; 35+ messages in thread
From: Waiman Long @ 2016-09-22 21:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Mike Galbraith, Ingo Molnar, Jonathan Corbet,
	linux-kernel, linux-doc, Davidlohr Bueso, Jason Low,
	Scott J Norton, Douglas Hatch

On 09/22/2016 04:26 PM, Thomas Gleixner wrote:
> On Thu, 22 Sep 2016, Waiman Long wrote:
>> On 09/22/2016 09:34 AM, Thomas Gleixner wrote:
>>> On Thu, 22 Sep 2016, Peter Zijlstra wrote:
>>>> I'd leave out the TO part entirely (or only mention it in changelogs).
>>>>
>>>> That is, I'd call the futex ops: FUTEX_LOCK and FUTEX_UNLOCK.
>>> That brings me to a different question:
>>>
>>> How is user space going to support this, i.e. is this some extra magic for
>>> code which implements its own locking primitives or is there going to be a
>>> wide use via e.g. glibc.
>> There are some applications that use futex(2) directly to implement their
>> synchronization primitives. For those applications, they will need to modify
>> their code to detect the presence of the new futexes. They can then use the
>> new futexes if available and use wait-wake futexes if not.
> That's what I suspected. Did you talk to the other folks who complain about
> futex performance (database, JVM, etc.) and play their own games with user
> space spinlocks and whatever?

I am also part of the team that help large application vendors to tune 
their application performance on our large SMP systems. Those 
application vendors tend to use futex directly instead of relying on 
glibc. We had seen spinlock contention in the futex could sometimes be a 
significant portion of the CPU cycles consumed depending on the 
workloads that were being run. We had been providing suggestions on the 
best practice of how to use futexes. But there is only so much you can 
do with tuning their locking code implementation. That is why I am also 
looking for way to improve the performance of the futex code in the kernel.

>
>> I am also planning to take a look at the pthread_mutex* APIs to see if they
>> can be modified to use the new futexes later on when the patch becomes more
>> mature.
> Please involve glibc people who are interested in the futex stuff early and
> discuss the concept before it's set in stone for your particular usecase.
>   

Sure, I will start to do some prototyping and performance testing with 
glibc and then engage those folks about that.

>>> Also what's the reason that we can't do probabilistic spinning for
>>> FUTEX_WAIT and have to add yet another specialized variant of futexes?
>>>
>> The main reason is that a FUTEX_WAIT waiter has no idea who the owner of the
>> futex is. We usually do spinning when the lock owner is running and abort when
>> it goes to sleep. We can't do that for FUTEX_WAIT.
> Fair enough. This wants to be spelled out in the changelog and explained a
> bit more detailed.

I will.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 1/5] futex: Add futex_set_timer() helper function
  2016-09-20 13:42 ` [RFC PATCH v2 1/5] futex: Add futex_set_timer() helper function Waiman Long
@ 2016-09-22 21:31   ` Thomas Gleixner
  2016-09-23  0:45     ` Waiman Long
  0 siblings, 1 reply; 35+ messages in thread
From: Thomas Gleixner @ 2016-09-22 21:31 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Jonathan Corbet, linux-kernel,
	linux-doc, Davidlohr Bueso, Jason Low, Scott J Norton,
	Douglas Hatch

On Tue, 20 Sep 2016, Waiman Long wrote:

Please be more careful of your subject lines. First thing I thought was
that you add a helper which is used in later patches to find out that you
actualy consolidate duplicated code. Something like:

  futex: Consolidate duplicated timer setup code

would have told me right away what this is about.

> This patch adds a new futex_set_timer() function to consolidate all

Please do not use: "This patch ...". We already know that this is a patch,
otherwise it would not be tagged [PATCH n/m] in the subject line.

See Documentation/SubmittingPatches ....

> the sleeping hrtime setup code.

Let me give you a hint:

1:  The code has three identical code copies to set up the futex timeout.

2:  Add a helper function and consolidate the call sites.

#1 tells precisely what the problem is
#2 tells precisely how it is solved

Can you see the difference?

> +/*
> + * Helper function to set the sleeping hrtimer.
> + */
> +static inline void futex_set_timer(ktime_t *time, struct hrtimer_sleeper **pto,
> +		struct hrtimer_sleeper *timeout, int flags, u64 range_ns)

Please use futex_setup_timer() as the function name. I was confused when I
read the other patch that you wanted to "set" the timer before entering
into the place which would actually need it.

> +{
> +	if (!time)
> +		return;
> +	*pto = timeout;

Please don't do that. That's a horrible coding style.

What's wrong with returning NULL or the timeout pointer and assign it to
"to" at the call site?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
  2016-09-22 20:28                   ` Waiman Long
  2016-09-22 20:38                     ` Thomas Gleixner
@ 2016-09-22 21:39                     ` Davidlohr Bueso
  2016-09-22 21:41                       ` Thomas Gleixner
  1 sibling, 1 reply; 35+ messages in thread
From: Davidlohr Bueso @ 2016-09-22 21:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Peter Zijlstra, Mike Galbraith, Ingo Molnar,
	Jonathan Corbet, linux-kernel, linux-doc, Jason Low,
	Scott J Norton, Douglas Hatch

On Thu, 22 Sep 2016, Waiman Long wrote:

>BTW, my initial attempt for the new futex was to use the same workflow 
>as the PI futexes, but use mutex which has optimistic spinning instead 
>of rt_mutex.

Btw, Thomas, do you still have any interest pursuing this for rtmutexes from
-rt into mainline? If so I can resend the patches from a while ago.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
  2016-09-22 21:39                     ` Davidlohr Bueso
@ 2016-09-22 21:41                       ` Thomas Gleixner
  2016-09-22 21:59                         ` Waiman Long
  2016-09-24  1:28                         ` [PATCH " Davidlohr Bueso
  0 siblings, 2 replies; 35+ messages in thread
From: Thomas Gleixner @ 2016-09-22 21:41 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Waiman Long, Peter Zijlstra, Mike Galbraith, Ingo Molnar,
	Jonathan Corbet, linux-kernel, linux-doc, Jason Low,
	Scott J Norton, Douglas Hatch

On Thu, 22 Sep 2016, Davidlohr Bueso wrote:
> On Thu, 22 Sep 2016, Waiman Long wrote:
> 
> > BTW, my initial attempt for the new futex was to use the same workflow as
> > the PI futexes, but use mutex which has optimistic spinning instead of
> > rt_mutex.
> 
> Btw, Thomas, do you still have any interest pursuing this for rtmutexes from
> -rt into mainline? If so I can resend the patches from a while ago.

Certainly yes. My faint memory tells me that there was some potential issue
due to boosting the owner only if it gets scheduled out, but I might be
wrong.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
  2016-09-22 20:38                     ` Thomas Gleixner
@ 2016-09-22 21:48                       ` Waiman Long
  2016-09-23 13:02                         ` Thomas Gleixner
  0 siblings, 1 reply; 35+ messages in thread
From: Waiman Long @ 2016-09-22 21:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Davidlohr Bueso, Peter Zijlstra, Mike Galbraith, Ingo Molnar,
	Jonathan Corbet, linux-kernel, linux-doc, Jason Low,
	Scott J Norton, Douglas Hatch

On 09/22/2016 04:38 PM, Thomas Gleixner wrote:
> On Thu, 22 Sep 2016, Waiman Long wrote:
>> BTW, my initial attempt for the new futex was to use the same workflow as the
>> PI futexes, but use mutex which has optimistic spinning instead of rt_mutex.
>> That version can double the throughput compared with PI futexes but still far
>> short of what can be achieved with wait-wake futex. Looking at the performance
>> figures from the patch:
>>
>>                  wait-wake futex     PI futex        TO futex
>>                  ---------------     --------        --------
>> max time            3.49s            50.91s          2.65s
>> min time            3.24s            50.84s          0.07s
>> average time        3.41s            50.90s          1.84s
>> sys time          7m22.4s            55.73s        2m32.9s
> That's really interesting. Do you have any explanation for this massive
> system time differences?

For the wait-wake futexes, the waiters will be kicked out with EAGAIN 
without sleeping as long as the futex value changes before they acquire 
the hash bucket spinlock. These waiters will attempt to acquire the lock 
again in the user space. If that fails, they will call FUTEX_WAIT again.

In the above test case, the critical section is pretty short and about 
4/5 of the FUTEX_WAIT calls resulted in an EAGAIN error. So there are a 
lot more user-space/kernel transitions than the TO futexes. I think the 
difference in the number of FUTEX_WAIT calls vs. the FUTEX_LOCK_TO calls 
causes the difference between their sys times. As for the PI futex, it 
is a strictly sleeping lock and so won't use that much sys time.

>> lock count       3,090,294          9,999,813       698,318
>> unlock count     3,268,896          9,999,814           134
>>
>> The problem with a PI futexes like version is that almost all the lock/unlock
>> operations were done in the kernel which added overhead and latency. Now
>> looking at the numbers for the TO futexes, less than 1/10 of the lock
>> operations were done in the kernel, the number of unlock was insignificant.
>> Locking was done mostly by lock stealing. This is where most of the
>> performance benefit comes from, not optimistic spinning.
> How does the lock latency distribution of all this look like and how fair
> is the whole thing?

The TO futexes are unfair as can be seen from the min/max thread times 
listed above. It took the fastest thread 0.07s to complete all the 
locking operations, whereas the slowest one needed 2.65s. However, the 
situation reverses when I changed the critical section to a 1us sleep. 
In this case, there will be no optimistic spinning. The performance 
results for 100k locking operations were listed below.

                 wait-wake futex     PI futex        TO futex
                 ---------------     --------        --------
max time            0.06s             9.32s          4.76s
min time            5.59s             9.36s          5.62s
average time        3.25s             9.35s          5.41s

In this case, the TO futexes are fairer but perform worse than the 
wait-wake futexes. That is because the lock handoff mechanism limit the 
amount of lock stealing in the TO futexes while the wait-wake futexes 
have no such restriction. When I disabled  lock handoff, the TO futexes 
would then perform similar to the wait-wake futexes.

>
>> This is also the reason that a lock handoff mechanism is implemented to
>> prevent lock starvation which is likely to happen without one.
> Where is that lock handoff mechanism?
>
>

In the futex_state object, there is a new field handoff_pid. It is set 
when the threshold count in futex_spin_on_owner() becomes negative When 
this field is set, the unlocker will change the futex word to that value 
instead of clearing it to 0 and others can steal it. I will separate out 
the lock handoff in a separate patch in the next revision to highlight it.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
  2016-09-22 21:41                       ` Thomas Gleixner
@ 2016-09-22 21:59                         ` Waiman Long
  2016-09-27 19:02                           ` [PATCH v2 -tip] locking/rtmutex: Reduce top-waiter blocking on a lock Davidlohr Bueso
  2016-09-24  1:28                         ` [PATCH " Davidlohr Bueso
  1 sibling, 1 reply; 35+ messages in thread
From: Waiman Long @ 2016-09-22 21:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Davidlohr Bueso, Peter Zijlstra, Mike Galbraith, Ingo Molnar,
	Jonathan Corbet, linux-kernel, linux-doc, Jason Low,
	Scott J Norton, Douglas Hatch

On 09/22/2016 05:41 PM, Thomas Gleixner wrote:
> On Thu, 22 Sep 2016, Davidlohr Bueso wrote:
>> On Thu, 22 Sep 2016, Waiman Long wrote:
>>
>>> BTW, my initial attempt for the new futex was to use the same workflow as
>>> the PI futexes, but use mutex which has optimistic spinning instead of
>>> rt_mutex.
>> Btw, Thomas, do you still have any interest pursuing this for rtmutexes from
>> -rt into mainline? If so I can resend the patches from a while ago.
> Certainly yes. My faint memory tells me that there was some potential issue
> due to boosting the owner only if it gets scheduled out, but I might be
> wrong.

It is tricky to add optimistic spinning to rtmutexes because of the need 
to observe process priorities. It is certainly possible to make the top 
waiter spin, but then I am not sure how much performance gain with just 
that.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 1/5] futex: Add futex_set_timer() helper function
  2016-09-22 21:31   ` Thomas Gleixner
@ 2016-09-23  0:45     ` Waiman Long
  0 siblings, 0 replies; 35+ messages in thread
From: Waiman Long @ 2016-09-23  0:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Peter Zijlstra, Jonathan Corbet, linux-kernel,
	linux-doc, Davidlohr Bueso, Jason Low, Scott J Norton,
	Douglas Hatch

On 09/22/2016 05:31 PM, Thomas Gleixner wrote:
> On Tue, 20 Sep 2016, Waiman Long wrote:
>
> Please be more careful of your subject lines. First thing I thought was
> that you add a helper which is used in later patches to find out that you
> actualy consolidate duplicated code. Something like:
>
>    futex: Consolidate duplicated timer setup code
>
> would have told me right away what this is about.
>
>> This patch adds a new futex_set_timer() function to consolidate all
> Please do not use: "This patch ...". We already know that this is a patch,
> otherwise it would not be tagged [PATCH n/m] in the subject line.
>
> See Documentation/SubmittingPatches ....
>
>> the sleeping hrtime setup code.
> Let me give you a hint:
>
> 1:  The code has three identical code copies to set up the futex timeout.
>
> 2:  Add a helper function and consolidate the call sites.
>
> #1 tells precisely what the problem is
> #2 tells precisely how it is solved
>
> Can you see the difference?
>
>> +/*
>> + * Helper function to set the sleeping hrtimer.
>> + */
>> +static inline void futex_set_timer(ktime_t *time, struct hrtimer_sleeper **pto,
>> +		struct hrtimer_sleeper *timeout, int flags, u64 range_ns)
> Please use futex_setup_timer() as the function name. I was confused when I
> read the other patch that you wanted to "set" the timer before entering
> into the place which would actually need it.
>
>> +{
>> +	if (!time)
>> +		return;
>> +	*pto = timeout;
> Please don't do that. That's a horrible coding style.
>
> What's wrong with returning NULL or the timeout pointer and assign it to
> "to" at the call site?
>
> Thanks,
>
> 	tglx
>

Thanks for the suggestions. I will fix this patch in the next revision.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
  2016-09-22 21:48                       ` Waiman Long
@ 2016-09-23 13:02                         ` Thomas Gleixner
  2016-09-26 22:02                           ` Waiman Long
  0 siblings, 1 reply; 35+ messages in thread
From: Thomas Gleixner @ 2016-09-23 13:02 UTC (permalink / raw)
  To: Waiman Long
  Cc: Davidlohr Bueso, Peter Zijlstra, Mike Galbraith, Ingo Molnar,
	Jonathan Corbet, linux-kernel, linux-doc, Jason Low,
	Scott J Norton, Douglas Hatch

On Thu, 22 Sep 2016, Waiman Long wrote:
> On 09/22/2016 04:38 PM, Thomas Gleixner wrote:
> > >                  wait-wake futex     PI futex        TO futex
> > >                  ---------------     --------        --------
> > > max time            3.49s            50.91s          2.65s
> > > min time            3.24s            50.84s          0.07s
> > > average time        3.41s            50.90s          1.84s
> > > sys time          7m22.4s            55.73s        2m32.9s
> > That's really interesting. Do you have any explanation for this massive
> > system time differences?
> 
> For the wait-wake futexes, the waiters will be kicked out with EAGAIN without
> sleeping as long as the futex value changes before they acquire the hash
> bucket spinlock. These waiters will attempt to acquire the lock again in the
> user space. If that fails, they will call FUTEX_WAIT again.
> 
> In the above test case, the critical section is pretty short and about 4/5 of
> the FUTEX_WAIT calls resulted in an EAGAIN error. So there are a lot more
> user-space/kernel transitions than the TO futexes. I think the difference in
> the number of FUTEX_WAIT calls vs. the FUTEX_LOCK_TO calls causes the
> difference between their sys times.

Ok.

> As for the PI futex, it is a strictly sleeping lock and so won't use that
> much sys time.

That's clear.

> > > Locking was done mostly by lock stealing. This is where most of the
> > > performance benefit comes from, not optimistic spinning.
> > How does the lock latency distribution of all this look like and how fair
> > is the whole thing?
> 
> The TO futexes are unfair as can be seen from the min/max thread times listed
> above. It took the fastest thread 0.07s to complete all the locking
> operations, whereas the slowest one needed 2.65s. However, the situation
> reverses when I changed the critical section to a 1us sleep. In this case,

1us sleep is going to add another syscall and therefor scheduling, so what?

Or did you just extend the critical section busy time?

> there will be no optimistic spinning. The performance results for 100k locking
> operations were listed below.
> 
>                 wait-wake futex     PI futex        TO futex
>                 ---------------     --------        --------
> max time            0.06s             9.32s          4.76s

      		      ^^^^ ????			

> min time            5.59s             9.36s          5.62s
> average time        3.25s             9.35s          5.41s
> 
> In this case, the TO futexes are fairer but perform worse than the wait-wake
> futexes. That is because the lock handoff mechanism limit the amount of lock
> stealing in the TO futexes while the wait-wake futexes have no such
> restriction. When I disabled  lock handoff, the TO futexes would then perform
> similar to the wait-wake futexes.

So the benefit of these new fangled futexes is only there for extreme short
critical sections and a gazillion of threads fighting for the same futex,
right?

I really wonder how the average programmer should pick the right flavour,
not to talk about any useful decision for something like glibc to pick the
proper one.

Thanks,

	tglx

 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH -tip] locking/rtmutex: Reduce top-waiter blocking on a lock
  2016-09-22 21:41                       ` Thomas Gleixner
  2016-09-22 21:59                         ` Waiman Long
@ 2016-09-24  1:28                         ` Davidlohr Bueso
  2016-09-26 21:40                           ` Waiman Long
  1 sibling, 1 reply; 35+ messages in thread
From: Davidlohr Bueso @ 2016-09-24  1:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Waiman Long, Peter Zijlstra, Mike Galbraith, Ingo Molnar,
	Jonathan Corbet, linux-kernel, linux-doc, Jason Low,
	Scott J Norton, Douglas Hatch

By applying well known spin-on-lock-owner techniques, we can avoid the
blocking overhead during the process of when the task is trying to take
the rtmutex. The idea is that as long as the owner is running, there is a
fair chance it'll release the lock soon, and thus a task trying to acquire
the rtmutex will better off spinning instead of blocking immediately after
the fastpath. This is similar to what we use for other locks, borrowed
from -rt. The main difference (due to the obvious real-time constraints)
is that top-waiter spinning must account for any new higher priority waiter,
and therefore cannot steal the lock and avoid any pi-dance. As such there
will be at most only one spinner waiter upon contended lock.

Conditions to stop spinning and block are simple:

(1) Upon need_resched()
(2) Current lock owner blocks

The unlock side remains unchanged as wake_up_process() can safely deal with
calls where the task is not actually blocked (TASK_NORMAL). The biggest
concern would perhaps be if we relied on any implicit barriers (in that the
wake_up_process call would not imply it anymore since nothing was awoken),
but this is not the case. As such, there is only unnecessary overhead dealing
with the wake_q, but this allows us not to miss any wakeups between the spinning
step and the unlocking side.

Measuring the amount of priority inversions of the pi_stress program, there is
some improvement in throughput during a 30 second window. On a 32-core box, with
increasing thread-group count:

pistress
			    4.4.3                 4.4.3
			  vanilla           rtmutex-topspinner
Hmean    1   2321586.73 (  0.00%)  2339847.23 (  0.79%)
Hmean    4   8209026.49 (  0.00%)  8597809.55 (  4.74%)
Hmean    7  12655322.45 (  0.00%) 13194896.45 (  4.26%)
Hmean    12  4210477.03 (  0.00%)  4348643.08 (  3.28%)
Hmean    21  2996823.05 (  0.00%)  3104513.47 (  3.59%)
Hmean    30  2463107.53 (  0.00%)  2584275.71 (  4.91%)
Hmean    48  2656668.46 (  0.00%)  2719324.53 (  2.36%)
Hmean    64  2397253.65 (  0.00%)  2471628.92 (  3.10%)
Stddev   1    653473.88 (  0.00%)   527076.59 (-19.34%)
Stddev   4    664995.50 (  0.00%)   359487.15 (-45.94%)
Stddev   7    248476.88 (  0.00%)   278307.31 ( 12.01%)
Stddev   12    74537.42 (  0.00%)    54305.86 (-27.14%)
Stddev   21    72143.80 (  0.00%)    40371.42 (-44.04%)
Stddev   30    31981.43 (  0.00%)    42306.07 ( 32.28%)
Stddev   48    21317.95 (  0.00%)    42608.50 ( 99.87%)
Stddev   64    23433.99 (  0.00%)    21502.56 ( -8.24%)

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---

Hi, so I've rebased the patch against -tip, and has survived about a full day
of pistress pounding. That said, I don't have any interesting boxes available 
for performance tests, so I'm keeping the results I obtained originally; which
should obviously not matter. The other difference from the previous post was that
I've sprinkled READ/WRITE_ONCE around lock->owner, as now we're playing with it
without the wait_lock.

 kernel/Kconfig.locks            |  4 ++
 kernel/locking/rtmutex.c        | 83 ++++++++++++++++++++++++++++++++++-------
 kernel/locking/rtmutex_common.h |  2 +-
 3 files changed, 75 insertions(+), 14 deletions(-)

diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index ebdb0043203a..e20790cdc446 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -227,6 +227,10 @@ config MUTEX_SPIN_ON_OWNER
 	def_bool y
 	depends on SMP && !DEBUG_MUTEXES && ARCH_SUPPORTS_ATOMIC_RMW
 
+config RT_MUTEX_SPIN_ON_OWNER
+       def_bool y
+       depends on SMP && RT_MUTEXES && !DEBUG_RT_MUTEXES && ARCH_SUPPORTS_ATOMIC_RMW
+
 config RWSEM_SPIN_ON_OWNER
        def_bool y
        depends on SMP && RWSEM_XCHGADD_ALGORITHM && ARCH_SUPPORTS_ATOMIC_RMW
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 1ec0f48962b3..282a773d1563 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -54,13 +54,13 @@ rt_mutex_set_owner(struct rt_mutex *lock, struct task_struct *owner)
 	if (rt_mutex_has_waiters(lock))
 		val |= RT_MUTEX_HAS_WAITERS;
 
-	lock->owner = (struct task_struct *)val;
+	WRITE_ONCE(lock->owner, (struct task_struct *)val);
 }
 
 static inline void clear_rt_mutex_waiters(struct rt_mutex *lock)
 {
-	lock->owner = (struct task_struct *)
-			((unsigned long)lock->owner & ~RT_MUTEX_HAS_WAITERS);
+	WRITE_ONCE(lock->owner, (struct task_struct *)
+		   ((unsigned long)lock->owner & ~RT_MUTEX_HAS_WAITERS));
 }
 
 static void fixup_rt_mutex_waiters(struct rt_mutex *lock)
@@ -989,14 +989,17 @@ static void mark_wakeup_next_waiter(struct wake_q_head *wake_q,
 	rt_mutex_dequeue_pi(current, waiter);
 
 	/*
-	 * As we are waking up the top waiter, and the waiter stays
-	 * queued on the lock until it gets the lock, this lock
-	 * obviously has waiters. Just set the bit here and this has
-	 * the added benefit of forcing all new tasks into the
-	 * slow path making sure no task of lower priority than
-	 * the top waiter can steal this lock.
+	 * As we are potentially waking up the top waiter, and the waiter
+	 * stays queued on the lock until it gets the lock, this lock
+	 * obviously has waiters. Just set the bit here and this has the
+	 * added benefit of forcing all new tasks into the slow path
+	 * making sure no task of lower priority than the top waiter can
+	 * steal this lock.
+	 *
+	 * If the top waiter, otoh, is spinning on ->owner, this will also
+	 * serve to exit out of the loop and try to acquire the lock.
 	 */
-	lock->owner = (void *) RT_MUTEX_HAS_WAITERS;
+	WRITE_ONCE(lock->owner, (void *) RT_MUTEX_HAS_WAITERS);
 
 	raw_spin_unlock(&current->pi_lock);
 
@@ -1089,6 +1092,48 @@ void rt_mutex_adjust_pi(struct task_struct *task)
 				   next_lock, NULL, task);
 }
 
+#ifdef CONFIG_RT_MUTEX_SPIN_ON_OWNER
+static bool rt_mutex_spin_on_owner(struct rt_mutex *lock,
+				   struct task_struct *owner)
+{
+	bool ret = true;
+
+	/*
+	 * The last owner could have just released the lock,
+	 * immediately try taking it again.
+	 */
+	if (!owner)
+		goto done;
+
+	rcu_read_lock();
+	while (rt_mutex_owner(lock) == owner) {
+		/*
+		 * Ensure we emit the owner->on_cpu, dereference _after_
+		 * checking lock->owner still matches owner. If that fails,
+		 * owner might point to freed memory. If it still matches,
+		 * the rcu_read_lock() ensures the memory stays valid.
+		 */
+		barrier();
+		if (!owner->on_cpu || need_resched()) {
+			ret = false;
+			break;
+		}
+
+		cpu_relax_lowlatency();
+	}
+	rcu_read_unlock();
+done:
+	return ret;
+}
+
+#else
+static bool rt_mutex_spin_on_owner(struct rt_mutex *lock,
+				   struct task_struct *owner)
+{
+	return false;
+}
+#endif
+
 /**
  * __rt_mutex_slowlock() - Perform the wait-wake-try-to-take loop
  * @lock:		 the rt_mutex to take
@@ -1107,6 +1152,8 @@ __rt_mutex_slowlock(struct rt_mutex *lock, int state,
 	int ret = 0;
 
 	for (;;) {
+		struct rt_mutex_waiter *top_waiter = NULL;
+
 		/* Try to acquire the lock: */
 		if (try_to_take_rt_mutex(lock, current, waiter))
 			break;
@@ -1125,11 +1172,21 @@ __rt_mutex_slowlock(struct rt_mutex *lock, int state,
 				break;
 		}
 
+		top_waiter = rt_mutex_top_waiter(lock);
+
 		raw_spin_unlock_irq(&lock->wait_lock);
 
 		debug_rt_mutex_print_deadlock(waiter);
 
-		schedule();
+		/*
+		 * At this point the PI-dance is done, and, as the top waiter,
+		 * we are next in line for the lock. Try to spin on the current
+		 * owner for a while, in the hope that the lock will be released
+		 * soon. Otherwise fallback and block.
+		 */
+		if (top_waiter != waiter ||
+		    !rt_mutex_spin_on_owner(lock, rt_mutex_owner(lock)))
+			schedule();
 
 		raw_spin_lock_irq(&lock->wait_lock);
 		set_current_state(state);
@@ -1555,7 +1612,7 @@ EXPORT_SYMBOL_GPL(__rt_mutex_init);
  * rt_mutex_init_proxy_locked - initialize and lock a rt_mutex on behalf of a
  *				proxy owner
  *
- * @lock: 	the rt_mutex to be locked
+ * @lock:	the rt_mutex to be locked
  * @proxy_owner:the task to set as owner
  *
  * No locking. Caller has to do serializing itself
@@ -1573,7 +1630,7 @@ void rt_mutex_init_proxy_locked(struct rt_mutex *lock,
 /**
  * rt_mutex_proxy_unlock - release a lock on behalf of owner
  *
- * @lock: 	the rt_mutex to be locked
+ * @lock:	the rt_mutex to be locked
  *
  * No locking. Caller has to do serializing itself
  * Special API call for PI-futex support
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 4f5f83c7d2d3..d30fc4797fe1 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -76,7 +76,7 @@ task_top_pi_waiter(struct task_struct *p)
 static inline struct task_struct *rt_mutex_owner(struct rt_mutex *lock)
 {
 	return (struct task_struct *)
-		((unsigned long)lock->owner & ~RT_MUTEX_OWNER_MASKALL);
+		((unsigned long)READ_ONCE(lock->owner) & ~RT_MUTEX_OWNER_MASKALL);
 }
 
 /*
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH -tip] locking/rtmutex: Reduce top-waiter blocking on a lock
  2016-09-24  1:28                         ` [PATCH " Davidlohr Bueso
@ 2016-09-26 21:40                           ` Waiman Long
  0 siblings, 0 replies; 35+ messages in thread
From: Waiman Long @ 2016-09-26 21:40 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Thomas Gleixner, Peter Zijlstra, Mike Galbraith, Ingo Molnar,
	Jonathan Corbet, linux-kernel, linux-doc, Jason Low,
	Scott J Norton, Douglas Hatch

On 09/23/2016 09:28 PM, Davidlohr Bueso wrote:
>
> +#ifdef CONFIG_RT_MUTEX_SPIN_ON_OWNER
> +static bool rt_mutex_spin_on_owner(struct rt_mutex *lock,
> +                   struct task_struct *owner)
> +{
> +    bool ret = true;
> +
> +    /*
> +     * The last owner could have just released the lock,
> +     * immediately try taking it again.
> +     */
> +    if (!owner)
> +        goto done;
> +
> +    rcu_read_lock();
> +    while (rt_mutex_owner(lock) == owner) {
> +        /*
> +         * Ensure we emit the owner->on_cpu, dereference _after_
> +         * checking lock->owner still matches owner. If that fails,
> +         * owner might point to freed memory. If it still matches,
> +         * the rcu_read_lock() ensures the memory stays valid.
> +         */
> +        barrier();
> +        if (!owner->on_cpu || need_resched()) {
> +            ret = false;
> +            break;
> +        }
> +
> +        cpu_relax_lowlatency();
> +    }
> +    rcu_read_unlock();
> +done:
> +    return ret;
> +}
> +

One issue that I saw is that the spinner may no longer be the top waiter 
while spinning. Should we also check this condition in the spin loop?

Cheers,
Longman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
  2016-09-23 13:02                         ` Thomas Gleixner
@ 2016-09-26 22:02                           ` Waiman Long
  0 siblings, 0 replies; 35+ messages in thread
From: Waiman Long @ 2016-09-26 22:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Davidlohr Bueso, Peter Zijlstra, Mike Galbraith, Ingo Molnar,
	Jonathan Corbet, linux-kernel, linux-doc, Jason Low,
	Scott J Norton, Douglas Hatch

On 09/23/2016 09:02 AM, Thomas Gleixner wrote:
> On Thu, 22 Sep 2016, Waiman Long wrote:
>>>> Locking was done mostly by lock stealing. This is where most of the
>>>> performance benefit comes from, not optimistic spinning.
>>> How does the lock latency distribution of all this look like and how fair
>>> is the whole thing?
>> The TO futexes are unfair as can be seen from the min/max thread times listed
>> above. It took the fastest thread 0.07s to complete all the locking
>> operations, whereas the slowest one needed 2.65s. However, the situation
>> reverses when I changed the critical section to a 1us sleep. In this case,
> 1us sleep is going to add another syscall and therefor scheduling, so what?
>
> Or did you just extend the critical section busy time?

The 1us sleep will cause the spinning to stop and make all the waiters 
sleep. This is to simulate the extreme case where TO futex may not have 
the performance advantage.

>
>> there will be no optimistic spinning. The performance results for 100k locking
>> operations were listed below.
>>
>>                  wait-wake futex     PI futex        TO futex
>>                  ---------------     --------        --------
>> max time            0.06s             9.32s          4.76s
>        		^^^^ ????			

Yes, wait-wake futex is the unfair one in this case.

>> min time            5.59s             9.36s          5.62s
>> average time        3.25s             9.35s          5.41s
>>
>> In this case, the TO futexes are fairer but perform worse than the wait-wake
>> futexes. That is because the lock handoff mechanism limit the amount of lock
>> stealing in the TO futexes while the wait-wake futexes have no such
>> restriction. When I disabled  lock handoff, the TO futexes would then perform
>> similar to the wait-wake futexes.
> So the benefit of these new fangled futexes is only there for extreme short
> critical sections and a gazillion of threads fighting for the same futex,
> right?

Not really. Lock stealing will help performance when a gazillion of 
threads fighting for the same futex. Optimistic spinning will help to 
reduce the lock transfer latency because the waiter isn't sleeping no 
matter the number of threads. One set of data that I haven't shown so 
far is that the performance delta between wait-wait and TO futexes 
actually increases as the critical section is lengthened. This is 
because for short critical section, the waiters of wait-wake futex may 
not actually go to sleep because of the latency introduced by the code 
that has to be run before they do a final check to see if the futex 
value change before going to sleep. The longer the critical section, the 
higher the chance that they actually sleep and hence their performance 
is getting worse relative to the TO futexes.

For example, with the critical section of 50 pause instructions instead 
of 5, the performance gain is about 5X instead of about 1.6X in the 
latter case.

> I really wonder how the average programmer should pick the right flavour,
> not to talk about any useful decision for something like glibc to pick the
> proper one.

I would say that TO futexes will have better performance in most cases. 
Of course, I still need to run some real world benchmarks to quantify 
the effect of the new futexes. I am hoping to get suggestion of what is 
a good set of benchmarks to run.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 -tip] locking/rtmutex: Reduce top-waiter blocking on a lock
  2016-09-22 21:59                         ` Waiman Long
@ 2016-09-27 19:02                           ` Davidlohr Bueso
  2016-10-24 18:08                             ` Davidlohr Bueso
  0 siblings, 1 reply; 35+ messages in thread
From: Davidlohr Bueso @ 2016-09-27 19:02 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Peter Zijlstra, Mike Galbraith, Ingo Molnar,
	Jonathan Corbet, linux-kernel, linux-doc, Jason Low,
	Scott J Norton, Douglas Hatch

By applying well known spin-on-lock-owner techniques, we can avoid the
blocking overhead during the process of when the task is trying to take
the rtmutex. The idea is that as long as the owner is running, there is a
fair chance it'll release the lock soon, and thus a task trying to acquire
the rtmutex will better off spinning instead of blocking immediately after
the fastpath. This is similar to what we use for other locks, borrowed
from -rt. The main difference (due to the obvious real-time constraints)
is that top-waiter spinning must account for any new higher priority waiter,
and therefore cannot steal the lock and avoid any pi-dance. As such there
will be at most only one spinner waiter upon contended lock.

Conditions to stop spinning and block are simple:

(1) Upon need_resched()
(2) Current lock owner blocks
(3) We are no longer the top-waiter

The unlock side remains unchanged as wake_up_process() can safely deal with
calls where the task is not actually blocked (TASK_NORMAL). The biggest
concern would perhaps be if we relied on any implicit barriers (in that the
wake_up_process call would not imply it anymore since nothing was awoken),
but this is not the case. As such, there is only unnecessary overhead dealing
with the wake_q, but this allows us not to miss any wakeups between the spinning
step and the unlocking side.

Measuring the amount of priority inversions of the pi_stress program, there is
some improvement in throughput during a 30 second window. On a 32-core box, with
increasing thread-group count:

pistress
			    4.4.3                 4.4.3
			  vanilla           rtmutex-topspinner
Hmean    1   2321586.73 (  0.00%)  2339847.23 (  0.79%)
Hmean    4   8209026.49 (  0.00%)  8597809.55 (  4.74%)
Hmean    7  12655322.45 (  0.00%) 13194896.45 (  4.26%)
Hmean    12  4210477.03 (  0.00%)  4348643.08 (  3.28%)
Hmean    21  2996823.05 (  0.00%)  3104513.47 (  3.59%)
Hmean    30  2463107.53 (  0.00%)  2584275.71 (  4.91%)
Hmean    48  2656668.46 (  0.00%)  2719324.53 (  2.36%)
Hmean    64  2397253.65 (  0.00%)  2471628.92 (  3.10%)
Stddev   1    653473.88 (  0.00%)   527076.59 (-19.34%)
Stddev   4    664995.50 (  0.00%)   359487.15 (-45.94%)
Stddev   7    248476.88 (  0.00%)   278307.31 ( 12.01%)
Stddev   12    74537.42 (  0.00%)    54305.86 (-27.14%)
Stddev   21    72143.80 (  0.00%)    40371.42 (-44.04%)
Stddev   30    31981.43 (  0.00%)    42306.07 ( 32.28%)
Stddev   48    21317.95 (  0.00%)    42608.50 ( 99.87%)
Stddev   64    23433.99 (  0.00%)    21502.56 ( -8.24%)

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---

Changes from v1: Stop spinning and block if we are no longer the
top-waiter for the lock. This also prevents having more than one
spinner at a time, as the old waiter would otherwise still think
its next in line for the lock.

 kernel/Kconfig.locks            |  4 ++
 kernel/locking/rtmutex.c        | 92 ++++++++++++++++++++++++++++++++++-------
 kernel/locking/rtmutex_common.h |  4 +-
 3 files changed, 82 insertions(+), 18 deletions(-)

diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index ebdb0043203a..e20790cdc446 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -227,6 +227,10 @@ config MUTEX_SPIN_ON_OWNER
 	def_bool y
 	depends on SMP && !DEBUG_MUTEXES && ARCH_SUPPORTS_ATOMIC_RMW
 
+config RT_MUTEX_SPIN_ON_OWNER
+       def_bool y
+       depends on SMP && RT_MUTEXES && !DEBUG_RT_MUTEXES && ARCH_SUPPORTS_ATOMIC_RMW
+
 config RWSEM_SPIN_ON_OWNER
        def_bool y
        depends on SMP && RWSEM_XCHGADD_ALGORITHM && ARCH_SUPPORTS_ATOMIC_RMW
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 1ec0f48962b3..a789528172c2 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -54,13 +54,13 @@ rt_mutex_set_owner(struct rt_mutex *lock, struct task_struct *owner)
 	if (rt_mutex_has_waiters(lock))
 		val |= RT_MUTEX_HAS_WAITERS;
 
-	lock->owner = (struct task_struct *)val;
+	WRITE_ONCE(lock->owner, (struct task_struct *)val);
 }
 
 static inline void clear_rt_mutex_waiters(struct rt_mutex *lock)
 {
-	lock->owner = (struct task_struct *)
-			((unsigned long)lock->owner & ~RT_MUTEX_HAS_WAITERS);
+	WRITE_ONCE(lock->owner, (struct task_struct *)
+		   ((unsigned long)lock->owner & ~RT_MUTEX_HAS_WAITERS));
 }
 
 static void fixup_rt_mutex_waiters(struct rt_mutex *lock)
@@ -198,7 +198,7 @@ rt_mutex_enqueue(struct rt_mutex *lock, struct rt_mutex_waiter *waiter)
 	}
 
 	if (leftmost)
-		lock->waiters_leftmost = &waiter->tree_entry;
+		WRITE_ONCE(lock->waiters_leftmost, &waiter->tree_entry);
 
 	rb_link_node(&waiter->tree_entry, parent, link);
 	rb_insert_color(&waiter->tree_entry, &lock->waiters);
@@ -210,8 +210,8 @@ rt_mutex_dequeue(struct rt_mutex *lock, struct rt_mutex_waiter *waiter)
 	if (RB_EMPTY_NODE(&waiter->tree_entry))
 		return;
 
-	if (lock->waiters_leftmost == &waiter->tree_entry)
-		lock->waiters_leftmost = rb_next(&waiter->tree_entry);
+	if (READ_ONCE(lock->waiters_leftmost) == &waiter->tree_entry)
+		WRITE_ONCE(lock->waiters_leftmost, rb_next(&waiter->tree_entry));
 
 	rb_erase(&waiter->tree_entry, &lock->waiters);
 	RB_CLEAR_NODE(&waiter->tree_entry);
@@ -989,14 +989,17 @@ static void mark_wakeup_next_waiter(struct wake_q_head *wake_q,
 	rt_mutex_dequeue_pi(current, waiter);
 
 	/*
-	 * As we are waking up the top waiter, and the waiter stays
-	 * queued on the lock until it gets the lock, this lock
-	 * obviously has waiters. Just set the bit here and this has
-	 * the added benefit of forcing all new tasks into the
-	 * slow path making sure no task of lower priority than
-	 * the top waiter can steal this lock.
+	 * As we are potentially waking up the top waiter, and the waiter
+	 * stays queued on the lock until it gets the lock, this lock
+	 * obviously has waiters. Just set the bit here and this has the
+	 * added benefit of forcing all new tasks into the slow path
+	 * making sure no task of lower priority than the top waiter can
+	 * steal this lock.
+	 *
+	 * If the top waiter, otoh, is spinning on ->owner, this will also
+	 * serve to exit out of the loop and try to acquire the lock.
 	 */
-	lock->owner = (void *) RT_MUTEX_HAS_WAITERS;
+	WRITE_ONCE(lock->owner, (void *) RT_MUTEX_HAS_WAITERS);
 
 	raw_spin_unlock(&current->pi_lock);
 
@@ -1089,6 +1092,51 @@ void rt_mutex_adjust_pi(struct task_struct *task)
 				   next_lock, NULL, task);
 }
 
+#ifdef CONFIG_RT_MUTEX_SPIN_ON_OWNER
+static bool rt_mutex_spin_on_owner(struct rt_mutex *lock,
+				   struct task_struct *owner,
+				   struct rt_mutex_waiter *waiter)
+{
+	bool ret = true;
+
+	/*
+	 * The last owner could have just released the lock,
+	 * immediately try taking it again.
+	 */
+	if (!owner)
+		goto done;
+
+	rcu_read_lock();
+	while (rt_mutex_owner(lock) == owner) {
+		/*
+		 * Ensure we emit the owner->on_cpu, dereference _after_
+		 * checking lock->owner still matches owner. If that fails,
+		 * owner might point to freed memory. If it still matches,
+		 * the rcu_read_lock() ensures the memory stays valid.
+		 */
+		barrier();
+		if (!owner->on_cpu || need_resched() ||
+		    rt_mutex_top_waiter(lock) != waiter) {
+			ret = false;
+			break;
+		}
+
+		cpu_relax_lowlatency();
+	}
+	rcu_read_unlock();
+done:
+	return ret;
+}
+
+#else
+static bool rt_mutex_spin_on_owner(struct rt_mutex *lock,
+				   struct task_struct *owner,
+				   struct rt_mutex_waiter *waiter)
+{
+	return false;
+}
+#endif
+
 /**
  * __rt_mutex_slowlock() - Perform the wait-wake-try-to-take loop
  * @lock:		 the rt_mutex to take
@@ -1107,6 +1155,8 @@ __rt_mutex_slowlock(struct rt_mutex *lock, int state,
 	int ret = 0;
 
 	for (;;) {
+		struct rt_mutex_waiter *top_waiter = NULL;
+
 		/* Try to acquire the lock: */
 		if (try_to_take_rt_mutex(lock, current, waiter))
 			break;
@@ -1125,11 +1175,21 @@ __rt_mutex_slowlock(struct rt_mutex *lock, int state,
 				break;
 		}
 
+		top_waiter = rt_mutex_top_waiter(lock);
+
 		raw_spin_unlock_irq(&lock->wait_lock);
 
 		debug_rt_mutex_print_deadlock(waiter);
 
-		schedule();
+		/*
+		 * At this point the PI-dance is done, and, as the top waiter,
+		 * we are next in line for the lock. Try to spin on the current
+		 * owner for a while, in the hope that the lock will be released
+		 * soon. Otherwise fallback and block.
+		 */
+		if (top_waiter != waiter ||
+		    !rt_mutex_spin_on_owner(lock, rt_mutex_owner(lock), waiter))
+			schedule();
 
 		raw_spin_lock_irq(&lock->wait_lock);
 		set_current_state(state);
@@ -1555,7 +1615,7 @@ EXPORT_SYMBOL_GPL(__rt_mutex_init);
  * rt_mutex_init_proxy_locked - initialize and lock a rt_mutex on behalf of a
  *				proxy owner
  *
- * @lock: 	the rt_mutex to be locked
+ * @lock:	the rt_mutex to be locked
  * @proxy_owner:the task to set as owner
  *
  * No locking. Caller has to do serializing itself
@@ -1573,7 +1633,7 @@ void rt_mutex_init_proxy_locked(struct rt_mutex *lock,
 /**
  * rt_mutex_proxy_unlock - release a lock on behalf of owner
  *
- * @lock: 	the rt_mutex to be locked
+ * @lock:	the rt_mutex to be locked
  *
  * No locking. Caller has to do serializing itself
  * Special API call for PI-futex support
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 4f5f83c7d2d3..dc80fce3b407 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -48,7 +48,7 @@ rt_mutex_top_waiter(struct rt_mutex *lock)
 {
 	struct rt_mutex_waiter *w;
 
-	w = rb_entry(lock->waiters_leftmost, struct rt_mutex_waiter,
+	w = rb_entry(READ_ONCE(lock->waiters_leftmost), struct rt_mutex_waiter,
 		     tree_entry);
 	BUG_ON(w->lock != lock);
 
@@ -76,7 +76,7 @@ task_top_pi_waiter(struct task_struct *p)
 static inline struct task_struct *rt_mutex_owner(struct rt_mutex *lock)
 {
 	return (struct task_struct *)
-		((unsigned long)lock->owner & ~RT_MUTEX_OWNER_MASKALL);
+		((unsigned long)READ_ONCE(lock->owner) & ~RT_MUTEX_OWNER_MASKALL);
 }
 
 /*
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 -tip] locking/rtmutex: Reduce top-waiter blocking on a lock
  2016-09-27 19:02                           ` [PATCH v2 -tip] locking/rtmutex: Reduce top-waiter blocking on a lock Davidlohr Bueso
@ 2016-10-24 18:08                             ` Davidlohr Bueso
  2016-10-24 18:48                               ` Thomas Gleixner
  0 siblings, 1 reply; 35+ messages in thread
From: Davidlohr Bueso @ 2016-10-24 18:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Peter Zijlstra, Mike Galbraith, Ingo Molnar,
	Jonathan Corbet, linux-kernel, linux-doc, Jason Low,
	Scott J Norton, Douglas Hatch

Any comments? can this make it for v4.10?

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 -tip] locking/rtmutex: Reduce top-waiter blocking on a lock
  2016-10-24 18:08                             ` Davidlohr Bueso
@ 2016-10-24 18:48                               ` Thomas Gleixner
  0 siblings, 0 replies; 35+ messages in thread
From: Thomas Gleixner @ 2016-10-24 18:48 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Waiman Long, Peter Zijlstra, Mike Galbraith, Ingo Molnar,
	Jonathan Corbet, linux-kernel, linux-doc, Jason Low,
	Scott J Norton, Douglas Hatch

On Mon, 24 Oct 2016, Davidlohr Bueso wrote:

> Any comments? can this make it for v4.10?

We need to fry the futex/rtmutex fish first....

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2016-10-24 18:55 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-20 13:42 [RFC PATCH v2 0/5] futex: Introducing throughput-optimized futexes Waiman Long
2016-09-20 13:42 ` [RFC PATCH v2 1/5] futex: Add futex_set_timer() helper function Waiman Long
2016-09-22 21:31   ` Thomas Gleixner
2016-09-23  0:45     ` Waiman Long
2016-09-20 13:42 ` [RFC PATCH v2 2/5] futex: Rename futex_pi_state to futex_state Waiman Long
2016-09-20 13:42 ` [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes Waiman Long
2016-09-21  6:59   ` Mike Galbraith
2016-09-21 23:37     ` Waiman Long
2016-09-22  7:49       ` Peter Zijlstra
2016-09-22 13:04         ` Waiman Long
2016-09-22 13:34         ` Thomas Gleixner
2016-09-22 14:41           ` Davidlohr Bueso
2016-09-22 14:46             ` Thomas Gleixner
2016-09-22 15:11               ` Davidlohr Bueso
2016-09-22 20:08                 ` Waiman Long
2016-09-22 20:28                   ` Waiman Long
2016-09-22 20:38                     ` Thomas Gleixner
2016-09-22 21:48                       ` Waiman Long
2016-09-23 13:02                         ` Thomas Gleixner
2016-09-26 22:02                           ` Waiman Long
2016-09-22 21:39                     ` Davidlohr Bueso
2016-09-22 21:41                       ` Thomas Gleixner
2016-09-22 21:59                         ` Waiman Long
2016-09-27 19:02                           ` [PATCH v2 -tip] locking/rtmutex: Reduce top-waiter blocking on a lock Davidlohr Bueso
2016-10-24 18:08                             ` Davidlohr Bueso
2016-10-24 18:48                               ` Thomas Gleixner
2016-09-24  1:28                         ` [PATCH " Davidlohr Bueso
2016-09-26 21:40                           ` Waiman Long
2016-09-22 19:56           ` [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes Waiman Long
2016-09-22 20:26             ` Thomas Gleixner
2016-09-22 21:13               ` Waiman Long
2016-09-22 13:23   ` Peter Zijlstra
2016-09-22 17:21     ` Waiman Long
2016-09-20 13:42 ` [RFC PATCH v2 4/5] futex: Add timeout support to TO futexes Waiman Long
2016-09-20 13:42 ` [RFC PATCH v2 5/5] futex, doc: TO futexes document Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.