linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] futex: Introducing throughput-optimized futexes
@ 2016-09-06 18:53 Waiman Long
  2016-09-06 18:53 ` [RFC PATCH 1/4] futex: Add futex_set_timer() helper function Waiman Long
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Waiman Long @ 2016-09-06 18:53 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Davidlohr Bueso, Scott J Norton,
	Douglas Hatch, Waiman Long

This patchset introduces a new futex implementation called
throughput-optimized (TO) futexes. It is similar to PI futexes in its
calling convention, but provides better throughput than the wait-wake
futexes by encouraging lock stealing and optimistic spinning.

The timeout parameter, however, isn't quite working yet. More
investigation will have to be done to see if this can be fixed or
the timeout parameter has to be scrapped entirely.

Waiman Long (4):
  futex: Add futex_set_timer() helper function
  futex: Rename futex_pi_state to futex_state
  futex: Throughput-optimized (TO) futexes
  futex, doc: TO futexes document

 Documentation/00-INDEX     |    2 +
 Documentation/to-futex.txt |  124 ++++++++
 include/linux/sched.h      |    4 +-
 include/uapi/linux/futex.h |    4 +
 kernel/futex.c             |  675 ++++++++++++++++++++++++++++++++++++++------
 5 files changed, 719 insertions(+), 90 deletions(-)
 create mode 100644 Documentation/to-futex.txt

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH 1/4] futex: Add futex_set_timer() helper function
  2016-09-06 18:53 [RFC PATCH 0/4] futex: Introducing throughput-optimized futexes Waiman Long
@ 2016-09-06 18:53 ` Waiman Long
  2016-09-06 18:53 ` [RFC PATCH 2/4] futex: Rename futex_pi_state to futex_state Waiman Long
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Waiman Long @ 2016-09-06 18:53 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Davidlohr Bueso, Scott J Norton,
	Douglas Hatch, Waiman Long

This patch adds a new futex_set_timer() function to consolidate all
the sleeping hrtime setup code.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/futex.c |   51 ++++++++++++++++++++++++---------------------------
 1 files changed, 24 insertions(+), 27 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 46cb3a3..37e61ef 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -467,6 +467,25 @@ static void drop_futex_key_refs(union futex_key *key)
 	}
 }
 
+/*
+ * Helper function to set the sleeping hrtimer.
+ */
+static inline void futex_set_timer(ktime_t *time, struct hrtimer_sleeper **pto,
+		struct hrtimer_sleeper *timeout, int flags, u64 range_ns)
+{
+	if (!time)
+		return;
+	*pto = timeout;
+	hrtimer_init_on_stack(&timeout->timer, (flags & FLAGS_CLOCKRT) ?
+			      CLOCK_REALTIME : CLOCK_MONOTONIC,
+			      HRTIMER_MODE_ABS);
+	hrtimer_init_sleeper(timeout, current);
+	if (range_ns)
+		hrtimer_set_expires_range_ns(&timeout->timer, *time, range_ns);
+	else
+		hrtimer_set_expires(&timeout->timer, *time);
+}
+
 /**
  * get_futex_key() - Get parameters which are the keys for a futex
  * @uaddr:	virtual address of the futex
@@ -2403,17 +2422,8 @@ static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
 		return -EINVAL;
 	q.bitset = bitset;
 
-	if (abs_time) {
-		to = &timeout;
-
-		hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ?
-				      CLOCK_REALTIME : CLOCK_MONOTONIC,
-				      HRTIMER_MODE_ABS);
-		hrtimer_init_sleeper(to, current);
-		hrtimer_set_expires_range_ns(&to->timer, *abs_time,
-					     current->timer_slack_ns);
-	}
-
+	futex_set_timer(abs_time, &to, &timeout, flags,
+			current->timer_slack_ns);
 retry:
 	/*
 	 * Prepare to wait on uaddr. On success, holds hb lock and increments
@@ -2501,13 +2511,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
 	if (refill_pi_state_cache())
 		return -ENOMEM;
 
-	if (time) {
-		to = &timeout;
-		hrtimer_init_on_stack(&to->timer, CLOCK_REALTIME,
-				      HRTIMER_MODE_ABS);
-		hrtimer_init_sleeper(to, current);
-		hrtimer_set_expires(&to->timer, *time);
-	}
+	futex_set_timer(time, &to, &timeout, FLAGS_CLOCKRT, 0);
 
 retry:
 	ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key, VERIFY_WRITE);
@@ -2816,15 +2820,8 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 	if (!bitset)
 		return -EINVAL;
 
-	if (abs_time) {
-		to = &timeout;
-		hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ?
-				      CLOCK_REALTIME : CLOCK_MONOTONIC,
-				      HRTIMER_MODE_ABS);
-		hrtimer_init_sleeper(to, current);
-		hrtimer_set_expires_range_ns(&to->timer, *abs_time,
-					     current->timer_slack_ns);
-	}
+	futex_set_timer(abs_time, &to, &timeout, flags,
+			current->timer_slack_ns);
 
 	/*
 	 * The waiter is allocated on our stack, manipulated by the requeue
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 2/4] futex: Rename futex_pi_state to futex_state
  2016-09-06 18:53 [RFC PATCH 0/4] futex: Introducing throughput-optimized futexes Waiman Long
  2016-09-06 18:53 ` [RFC PATCH 1/4] futex: Add futex_set_timer() helper function Waiman Long
@ 2016-09-06 18:53 ` Waiman Long
  2016-09-06 18:53 ` [RFC PATCH 3/4] futex: Throughput-optimized (TO) futexes Waiman Long
  2016-09-06 18:53 ` [RFC PATCH 4/4] futex, doc: TO futexes document Waiman Long
  3 siblings, 0 replies; 12+ messages in thread
From: Waiman Long @ 2016-09-06 18:53 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Davidlohr Bueso, Scott J Norton,
	Douglas Hatch, Waiman Long

The futex_pi_state structure will be overloaded in later patches to
store state information about non-PI futexes. So the structure name
itself is no longer a good description of its purpose. This patch
renames it to futex_state, a more generic name.

Some of the functions that process the futex states are also renamed.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 include/linux/sched.h |    4 +-
 kernel/futex.c        |  107 +++++++++++++++++++++++++------------------------
 2 files changed, 56 insertions(+), 55 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 62c68e5..fefd7f7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -126,7 +126,7 @@ struct sched_attr {
 	u64 sched_period;
 };
 
-struct futex_pi_state;
+struct futex_state;
 struct robust_list_head;
 struct bio_list;
 struct fs_struct;
@@ -1772,7 +1772,7 @@ struct task_struct {
 	struct compat_robust_list_head __user *compat_robust_list;
 #endif
 	struct list_head pi_state_list;
-	struct futex_pi_state *pi_state_cache;
+	struct futex_state *pi_state_cache;
 #endif
 #ifdef CONFIG_PERF_EVENTS
 	struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
diff --git a/kernel/futex.c b/kernel/futex.c
index 37e61ef..f8bb93f 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -192,11 +192,12 @@ int __read_mostly futex_cmpxchg_enabled;
 #define FLAGS_HAS_TIMEOUT	0x04
 
 /*
- * Priority Inheritance state:
+ * Futex state object:
+ *  - Priority Inheritance state
  */
-struct futex_pi_state {
+struct futex_state {
 	/*
-	 * list of 'owned' pi_state instances - these have to be
+	 * list of 'owned' state instances - these have to be
 	 * cleaned up in do_exit() if the task exits prematurely:
 	 */
 	struct list_head list;
@@ -240,7 +241,7 @@ struct futex_q {
 	struct task_struct *task;
 	spinlock_t *lock_ptr;
 	union futex_key key;
-	struct futex_pi_state *pi_state;
+	struct futex_state *pi_state;
 	struct rt_mutex_waiter *rt_waiter;
 	union futex_key *requeue_pi_key;
 	u32 bitset;
@@ -787,76 +788,76 @@ static int get_futex_value_locked(u32 *dest, u32 __user *from)
 /*
  * PI code:
  */
-static int refill_pi_state_cache(void)
+static int refill_futex_state_cache(void)
 {
-	struct futex_pi_state *pi_state;
+	struct futex_state *state;
 
 	if (likely(current->pi_state_cache))
 		return 0;
 
-	pi_state = kzalloc(sizeof(*pi_state), GFP_KERNEL);
+	state = kzalloc(sizeof(*state), GFP_KERNEL);
 
-	if (!pi_state)
+	if (!state)
 		return -ENOMEM;
 
-	INIT_LIST_HEAD(&pi_state->list);
+	INIT_LIST_HEAD(&state->list);
 	/* pi_mutex gets initialized later */
-	pi_state->owner = NULL;
-	atomic_set(&pi_state->refcount, 1);
-	pi_state->key = FUTEX_KEY_INIT;
+	state->owner = NULL;
+	atomic_set(&state->refcount, 1);
+	state->key = FUTEX_KEY_INIT;
 
-	current->pi_state_cache = pi_state;
+	current->pi_state_cache = state;
 
 	return 0;
 }
 
-static struct futex_pi_state * alloc_pi_state(void)
+static struct futex_state *alloc_futex_state(void)
 {
-	struct futex_pi_state *pi_state = current->pi_state_cache;
+	struct futex_state *state = current->pi_state_cache;
 
-	WARN_ON(!pi_state);
+	WARN_ON(!state);
 	current->pi_state_cache = NULL;
 
-	return pi_state;
+	return state;
 }
 
 /*
- * Drops a reference to the pi_state object and frees or caches it
+ * Drops a reference to the futex state object and frees or caches it
  * when the last reference is gone.
  *
  * Must be called with the hb lock held.
  */
-static void put_pi_state(struct futex_pi_state *pi_state)
+static void put_futex_state(struct futex_state *state)
 {
-	if (!pi_state)
+	if (!state)
 		return;
 
-	if (!atomic_dec_and_test(&pi_state->refcount))
+	if (!atomic_dec_and_test(&state->refcount))
 		return;
 
 	/*
-	 * If pi_state->owner is NULL, the owner is most probably dying
-	 * and has cleaned up the pi_state already
+	 * If state->owner is NULL, the owner is most probably dying
+	 * and has cleaned up the futex state already
 	 */
-	if (pi_state->owner) {
-		raw_spin_lock_irq(&pi_state->owner->pi_lock);
-		list_del_init(&pi_state->list);
-		raw_spin_unlock_irq(&pi_state->owner->pi_lock);
+	if (state->owner) {
+		raw_spin_lock_irq(&state->owner->pi_lock);
+		list_del_init(&state->list);
+		raw_spin_unlock_irq(&state->owner->pi_lock);
 
-		rt_mutex_proxy_unlock(&pi_state->pi_mutex, pi_state->owner);
+		rt_mutex_proxy_unlock(&state->pi_mutex, state->owner);
 	}
 
 	if (current->pi_state_cache)
-		kfree(pi_state);
+		kfree(state);
 	else {
 		/*
-		 * pi_state->list is already empty.
-		 * clear pi_state->owner.
+		 * state->list is already empty.
+		 * clear state->owner.
 		 * refcount is at 0 - put it back to 1.
 		 */
-		pi_state->owner = NULL;
-		atomic_set(&pi_state->refcount, 1);
-		current->pi_state_cache = pi_state;
+		state->owner = NULL;
+		atomic_set(&state->refcount, 1);
+		current->pi_state_cache = state;
 	}
 }
 
@@ -886,7 +887,7 @@ static struct task_struct * futex_find_get_task(pid_t pid)
 void exit_pi_state_list(struct task_struct *curr)
 {
 	struct list_head *next, *head = &curr->pi_state_list;
-	struct futex_pi_state *pi_state;
+	struct futex_state *pi_state;
 	struct futex_hash_bucket *hb;
 	union futex_key key = FUTEX_KEY_INIT;
 
@@ -901,7 +902,7 @@ void exit_pi_state_list(struct task_struct *curr)
 	while (!list_empty(head)) {
 
 		next = head->next;
-		pi_state = list_entry(next, struct futex_pi_state, list);
+		pi_state = list_entry(next, struct futex_state, list);
 		key = pi_state->key;
 		hb = hash_futex(&key);
 		raw_spin_unlock_irq(&curr->pi_lock);
@@ -988,8 +989,8 @@ void exit_pi_state_list(struct task_struct *curr)
  * the pi_state against the user space value. If correct, attach to
  * it.
  */
-static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state,
-			      struct futex_pi_state **ps)
+static int attach_to_pi_state(u32 uval, struct futex_state *pi_state,
+			      struct futex_state **ps)
 {
 	pid_t pid = uval & FUTEX_TID_MASK;
 
@@ -1060,10 +1061,10 @@ out_state:
  * it after doing proper sanity checks.
  */
 static int attach_to_pi_owner(u32 uval, union futex_key *key,
-			      struct futex_pi_state **ps)
+			      struct futex_state **ps)
 {
 	pid_t pid = uval & FUTEX_TID_MASK;
-	struct futex_pi_state *pi_state;
+	struct futex_state *pi_state;
 	struct task_struct *p;
 
 	/*
@@ -1104,7 +1105,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 	/*
 	 * No existing pi state. First waiter. [2]
 	 */
-	pi_state = alloc_pi_state();
+	pi_state = alloc_futex_state();
 
 	/*
 	 * Initialize the pi_mutex in locked state and make @p
@@ -1128,7 +1129,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 }
 
 static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
-			   union futex_key *key, struct futex_pi_state **ps)
+			   union futex_key *key, struct futex_state **ps)
 {
 	struct futex_q *match = futex_top_waiter(hb, key);
 
@@ -1180,7 +1181,7 @@ static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
  */
 static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
 				union futex_key *key,
-				struct futex_pi_state **ps,
+				struct futex_state **ps,
 				struct task_struct *task, int set_waiters)
 {
 	u32 uval, newval, vpid = task_pid_vnr(task);
@@ -1306,7 +1307,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
 			 struct futex_hash_bucket *hb)
 {
 	struct task_struct *new_owner;
-	struct futex_pi_state *pi_state = this->pi_state;
+	struct futex_state *pi_state = this->pi_state;
 	u32 uninitialized_var(curval), newval;
 	WAKE_Q(wake_q);
 	bool deboost;
@@ -1646,7 +1647,7 @@ static int futex_proxy_trylock_atomic(u32 __user *pifutex,
 				 struct futex_hash_bucket *hb1,
 				 struct futex_hash_bucket *hb2,
 				 union futex_key *key1, union futex_key *key2,
-				 struct futex_pi_state **ps, int set_waiters)
+				 struct futex_state **ps, int set_waiters)
 {
 	struct futex_q *top_waiter = NULL;
 	u32 curval;
@@ -1715,7 +1716,7 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
 {
 	union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
 	int drop_count = 0, task_count = 0, ret;
-	struct futex_pi_state *pi_state = NULL;
+	struct futex_state *pi_state = NULL;
 	struct futex_hash_bucket *hb1, *hb2;
 	struct futex_q *this, *next;
 	WAKE_Q(wake_q);
@@ -1732,7 +1733,7 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
 		 * requeue_pi requires a pi_state, try to allocate it now
 		 * without any locks in case it fails.
 		 */
-		if (refill_pi_state_cache())
+		if (refill_futex_state_cache())
 			return -ENOMEM;
 		/*
 		 * requeue_pi must wake as many tasks as it can, up to nr_wake
@@ -1944,7 +1945,7 @@ retry_private:
 				 * object.
 				 */
 				this->pi_state = NULL;
-				put_pi_state(pi_state);
+				put_futex_state(pi_state);
 				/*
 				 * We stop queueing more waiters and let user
 				 * space deal with the mess.
@@ -1961,7 +1962,7 @@ retry_private:
 	 * in futex_proxy_trylock_atomic() or in lookup_pi_state(). We
 	 * need to drop it here again.
 	 */
-	put_pi_state(pi_state);
+	put_futex_state(pi_state);
 
 out_unlock:
 	double_unlock_hb(hb1, hb2);
@@ -2116,7 +2117,7 @@ static void unqueue_me_pi(struct futex_q *q)
 	__unqueue_futex(q);
 
 	BUG_ON(!q->pi_state);
-	put_pi_state(q->pi_state);
+	put_futex_state(q->pi_state);
 	q->pi_state = NULL;
 
 	spin_unlock(q->lock_ptr);
@@ -2132,7 +2133,7 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
 				struct task_struct *newowner)
 {
 	u32 newtid = task_pid_vnr(newowner) | FUTEX_WAITERS;
-	struct futex_pi_state *pi_state = q->pi_state;
+	struct futex_state *pi_state = q->pi_state;
 	struct task_struct *oldowner = pi_state->owner;
 	u32 uval, uninitialized_var(curval), newval;
 	int ret;
@@ -2508,7 +2509,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
 	struct futex_q q = futex_q_init;
 	int res, ret;
 
-	if (refill_pi_state_cache())
+	if (refill_futex_state_cache())
 		return -ENOMEM;
 
 	futex_set_timer(time, &to, &timeout, FLAGS_CLOCKRT, 0);
@@ -2889,7 +2890,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 			 * Drop the reference to the pi state which
 			 * the requeue_pi() code acquired for us.
 			 */
-			put_pi_state(q.pi_state);
+			put_futex_state(q.pi_state);
 			spin_unlock(q.lock_ptr);
 		}
 	} else {
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 3/4] futex: Throughput-optimized (TO) futexes
  2016-09-06 18:53 [RFC PATCH 0/4] futex: Introducing throughput-optimized futexes Waiman Long
  2016-09-06 18:53 ` [RFC PATCH 1/4] futex: Add futex_set_timer() helper function Waiman Long
  2016-09-06 18:53 ` [RFC PATCH 2/4] futex: Rename futex_pi_state to futex_state Waiman Long
@ 2016-09-06 18:53 ` Waiman Long
  2016-09-22 13:32   ` Thomas Gleixner
  2016-09-06 18:53 ` [RFC PATCH 4/4] futex, doc: TO futexes document Waiman Long
  3 siblings, 1 reply; 12+ messages in thread
From: Waiman Long @ 2016-09-06 18:53 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Davidlohr Bueso, Scott J Norton,
	Douglas Hatch, Waiman Long

This patch introduces a new futex implementation called
throughput-optimized (TO) futexes. The goal of this new futex type is
to provide locking throughput that is higher than wait-wake futexes
especially on systems with a large number of CPUs and the lock owners
are unlikely to sleep. The downside is the increase in the response
time variance.

Using a futex locking microbenchmark to do 10 millions locking
operations with 5 pause instructions as the critical section by 256
threads (10M/256 locking ops each) on a 4-socket 72-core 144-thread
Haswell-EX system, the results of the benchmark runs were as follows:

                wait-wake futex     PI futex        TO futex
                ---------------     --------        --------
elapsed time        3.28s            57.77s          2.63s
min time            3.05s            57.50s          0.29s
average time        3.21s            57.76s          1.92s
lock count       2,827,202          9,999,640       555,025
unlock count     3,027,002          9,999,641        16,217

The lock and unlock counts above show the actual numbers of futex(2)
lock and unlock syscalls that were being issued.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 include/uapi/linux/futex.h |    4 +
 kernel/futex.c             |  523 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 516 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index 0b1f716..e7deaf3 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -20,6 +20,8 @@
 #define FUTEX_WAKE_BITSET	10
 #define FUTEX_WAIT_REQUEUE_PI	11
 #define FUTEX_CMP_REQUEUE_PI	12
+#define FUTEX_LOCK_TO		13
+#define FUTEX_UNLOCK_TO		14
 
 #define FUTEX_PRIVATE_FLAG	128
 #define FUTEX_CLOCK_REALTIME	256
@@ -39,6 +41,8 @@
 					 FUTEX_PRIVATE_FLAG)
 #define FUTEX_CMP_REQUEUE_PI_PRIVATE	(FUTEX_CMP_REQUEUE_PI | \
 					 FUTEX_PRIVATE_FLAG)
+#define FUTEX_LOCK_TO_PRIVATE	(FUTEX_LOCK_TO | FUTEX_PRIVATE_FLAG)
+#define FUTEX_UNLOCK_TO_PRIVATE	(FUTEX_UNLOCK_TO | FUTEX_PRIVATE_FLAG)
 
 /*
  * Support for robust futexes: the kernel cleans up held futexes at
diff --git a/kernel/futex.c b/kernel/futex.c
index f8bb93f..8bc2803 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -191,6 +191,11 @@ int __read_mostly futex_cmpxchg_enabled;
 #define FLAGS_CLOCKRT		0x02
 #define FLAGS_HAS_TIMEOUT	0x04
 
+enum futex_type {
+	TYPE_PI = 0,
+	TYPE_TO,
+};
+
 /*
  * Futex state object:
  *  - Priority Inheritance state
@@ -203,13 +208,28 @@ struct futex_state {
 	struct list_head list;
 
 	/*
-	 * The PI object:
+	 * Linking into the owning hashed bucket (TO futexes only)
 	 */
-	struct rt_mutex pi_mutex;
+	struct list_head hb_list;
 
+	/*
+	 * The PI or mutex object:
+	 */
+	union {
+		struct rt_mutex pi_mutex;
+		struct mutex mutex;
+	};
+
+	/*
+	 * For PI futex, owner is the task that owns the futex.
+	 * For TO futex, owner is the mutex lock holder that is either
+	 * spinning on the futex owner or sleeping.
+	 */
 	struct task_struct *owner;
 	atomic_t refcount;
 
+	enum futex_type type;
+
 	union futex_key key;
 };
 
@@ -262,6 +282,7 @@ struct futex_hash_bucket {
 	atomic_t waiters;
 	spinlock_t lock;
 	struct plist_head chain;
+	struct list_head fs_list;	/* List of futex state objects */
 } ____cacheline_aligned_in_smp;
 
 /*
@@ -801,6 +822,8 @@ static int refill_futex_state_cache(void)
 		return -ENOMEM;
 
 	INIT_LIST_HEAD(&state->list);
+	INIT_LIST_HEAD(&state->hb_list);
+
 	/* pi_mutex gets initialized later */
 	state->owner = NULL;
 	atomic_set(&state->refcount, 1);
@@ -836,10 +859,10 @@ static void put_futex_state(struct futex_state *state)
 		return;
 
 	/*
-	 * If state->owner is NULL, the owner is most probably dying
-	 * and has cleaned up the futex state already
+	 * If state->owner is NULL and the type is TYPE_PI, the owner
+	 * is most probably dying and has cleaned up the state already
 	 */
-	if (state->owner) {
+	if (state->owner && (state->type == TYPE_PI)) {
 		raw_spin_lock_irq(&state->owner->pi_lock);
 		list_del_init(&state->list);
 		raw_spin_unlock_irq(&state->owner->pi_lock);
@@ -847,6 +870,11 @@ static void put_futex_state(struct futex_state *state)
 		rt_mutex_proxy_unlock(&state->pi_mutex, state->owner);
 	}
 
+	/*
+	 * Dequeue it from the HB futex state list.
+	 */
+	list_del_init(&state->hb_list);
+
 	if (current->pi_state_cache)
 		kfree(state);
 	else {
@@ -919,13 +947,24 @@ void exit_pi_state_list(struct task_struct *curr)
 			continue;
 		}
 
-		WARN_ON(pi_state->owner != curr);
 		WARN_ON(list_empty(&pi_state->list));
+		if (pi_state->type == TYPE_PI) {
+			WARN_ON(pi_state->owner != curr);
+			pi_state->owner = NULL;
+		}
 		list_del_init(&pi_state->list);
-		pi_state->owner = NULL;
 		raw_spin_unlock_irq(&curr->pi_lock);
 
-		rt_mutex_unlock(&pi_state->pi_mutex);
+		if (pi_state->type == TYPE_PI)
+			rt_mutex_unlock(&pi_state->pi_mutex);
+		else if (pi_state->type == TYPE_TO) {
+			/*
+			 * Need to wakeup the mutex owner.
+			 */
+			WARN_ON(!pi_state->owner);
+			if (pi_state->owner)
+				wake_up_process(pi_state->owner);
+		}
 
 		spin_unlock(&hb->lock);
 
@@ -997,7 +1036,7 @@ static int attach_to_pi_state(u32 uval, struct futex_state *pi_state,
 	/*
 	 * Userspace might have messed up non-PI and PI futexes [3]
 	 */
-	if (unlikely(!pi_state))
+	if (unlikely(!pi_state || (pi_state->type != TYPE_PI)))
 		return -EINVAL;
 
 	WARN_ON(!atomic_read(&pi_state->refcount));
@@ -1115,6 +1154,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 
 	/* Store the key for possible exit cleanups: */
 	pi_state->key = *key;
+	pi_state->type = TYPE_PI;
 
 	WARN_ON(!list_empty(&pi_state->list));
 	list_add(&pi_state->list, &p->pi_state_list);
@@ -3075,8 +3115,8 @@ retry:
 			goto retry;
 
 		/*
-		 * Wake robust non-PI futexes here. The wakeup of
-		 * PI futexes happens in exit_pi_state():
+		 * Wake robust non-PI/TO futexes here. The wakeup of
+		 * PI/TO futexes happens in exit_pi_state():
 		 */
 		if (!pi && (uval & FUTEX_WAITERS))
 			futex_wake(uaddr, 1, 1, FUTEX_BITSET_MATCH_ANY);
@@ -3171,6 +3211,460 @@ void exit_robust_list(struct task_struct *curr)
 				   curr, pip);
 }
 
+/*
+ * Throughput-Optimized Futexes
+ * ----------------------------
+ *
+ * Userspace mutual exclusion locks can be implemented either by using the
+ * wait-wake futexes or the PI futexes. The wait-wake futexes have much higher
+ * throughput but can't guarantee minimal latency for high priority processes
+ * like the PI futexes. Even then, the overhead of wait-wake futexes in
+ * userspace locking primitives can easily become a performance bottleneck
+ * on system with a large number of cores.
+ *
+ * The throughput-optimized (TO) futex is a new futex implementation that
+ * provides higher throughput than wait-wake futex via the use of following
+ * two techniques:
+ *  1) Optimistic spinning when futex owner is actively running
+ *  2) Lock stealing
+ *
+ * Optimistic spinning isn't present in other futexes. Lock stealing is
+ * possible in wait-wake futexes, but isn't actively encouraged like the
+ * TO futexes. The downside, however, is in the much higher variance of
+ * the response times. Starvation shouldn't happen, but some tasks may
+ * take much longer times to acquire the futexes than the others.
+ *
+ * The use of TO futexes is very similar to the PI futexes. Locking is done
+ * by atomically transiting the futex from 0 to the task's thread ID.
+ * Unlocking is done by atomically changing the futex from thread ID to 0.
+ * Any failure to do so will require calls to the kernel to do the locking
+ * and unlocking. One difference is that an unlocker can optionally clear
+ * the task ID value from the futex while keeping the FUTEX_WAITERS bit
+ * before calling into the kernel. This will encourage faster lock transfer.
+ *
+ * Like PI futexes, TO futexes are orthogonal to robust futexes. However,
+ * TO futexes have similar dying process handling as the PI futexes. So
+ * userspace management of locks in for exiting lock holders should not be
+ * needed for TO futexes unless to avoid the rare cases of pid wraparound.
+ *
+ * Unlike the other futexes, waiters of the TO futexes will not queue
+ * themselves to the plist of the hash bucket. Instead, they will queue up
+ * in the serialization mutex of the futex state container queued in the
+ * hash bucket.
+ */
+
+/*
+ * Looking up the futex state structure.
+ *
+ * It differs from lookup_pi_state() in that it searches the fs_list in the
+ * hash bucket instead of the waiters in the plist. The hb lock must be held
+ * before calling it. If the create flag is set, a new state structure will be
+ * pushed into the list and returned.
+ */
+static struct futex_state *
+lookup_futex_state(struct futex_hash_bucket *hb, union futex_key *key,
+		   bool create)
+{
+	struct futex_state *state;
+
+	list_for_each_entry(state, &hb->fs_list, hb_list)
+		if (match_futex(key, &state->key)) {
+			atomic_inc(&state->refcount);
+			return state;
+		}
+
+	if (!create)
+		return NULL;
+
+	/*
+	 * Push a new one into the list and return it.
+	 */
+	state = alloc_futex_state();
+	state->type = TYPE_TO;
+	state->key = *key;
+	list_add(&state->hb_list, &hb->fs_list);
+	WARN_ON(atomic_read(&state->refcount) != 1);
+
+	/*
+	 * Initialize the mutex structure.
+	 */
+	mutex_init(&state->mutex);
+	WARN_ON(!list_empty(&state->list));
+	return state;
+}
+
+/*
+ * Try to lock the userspace futex word (0 => vpid).
+ *
+ * Return: 1 if lock acquired or an error happens, 0 if not.
+ *	   The status code will be 0 if no error, or < 0 if an error happens.
+ *	   *puval will contain the latest futex value when trylock fails.
+ *
+ * The waiter flag, if set, will make it ignore the FUTEX_WAITERS bit.
+ * The HB spinlock should NOT be held while calling this function. A
+ * successful lock acquisition will clear the waiter and died bits.
+ */
+static inline int futex_trylock_to(u32 __user *uaddr, u32 vpid, u32 *puval,
+				   const bool waiter, int *status)
+{
+	u32 uval;
+
+	*status = 0;
+
+	if (unlikely(get_user(uval, uaddr)))
+		goto efault;
+
+	*puval = uval;
+
+	if (waiter ? (uval & FUTEX_TID_MASK) : uval)
+		return 0;	/* Trylock fails */
+
+	if (unlikely(futex_atomic_cmpxchg_inatomic(puval, uaddr, uval, vpid)))
+		goto efault;
+
+	return *puval == uval;
+
+efault:
+	*status = -EFAULT;
+	return 1;
+}
+
+/*
+ * Spinning threshold for futex word without setting FUTEX_WAITERS.
+ */
+#define FUTEX_SPIN_THRESHOLD	(1 << 10)
+
+/*
+ * Spin on the futex word while the futex owner is active. Otherwise, set
+ * the FUTEX_WAITERS bit and go to sleep. As we take a reference to the futex
+ * owner's task structure, we don't need to use RCU to ensure that the task
+ * structure is valid. The function will directly grab the lock if the
+ * owner is dying or the pid is invalid. That should take care of the problem
+ * of dead lock owners unless the pid wraps around and the preceived owner is
+ * not the real owner.
+ *
+ * Return: 0 if futex acquired, < 0 if an error happens.
+ */
+static int futex_spin_on_owner(u32 __user *uaddr, u32 vpid,
+			       struct futex_state *state)
+{
+	int ret, loop = FUTEX_SPIN_THRESHOLD;
+	u32 uval, curval;
+	u32 opid = 0;				/* Futex owner task ID */
+	struct task_struct *otask = NULL;	/* Futex owner task struct */
+	bool on_owner_list = false;
+
+	preempt_disable();
+	WRITE_ONCE(state->owner, current);
+	for (;; loop--) {
+		if (futex_trylock_to(uaddr, vpid, &uval, true, &ret))
+			break;
+		WARN_ON_ONCE(!(uval & FUTEX_TID_MASK));
+
+		if ((uval & FUTEX_TID_MASK) != opid) {
+			/*
+			 * Get the new task structure
+			 */
+			if (otask)
+				put_task_struct(otask);
+
+			WARN_ON_ONCE(on_owner_list);
+			opid  = uval & FUTEX_TID_MASK;
+			otask = futex_find_get_task(opid);
+		}
+		if (unlikely(!otask || (otask->flags & PF_EXITING) ||
+			    (uval & FUTEX_OWNER_DIED))) {
+			/*
+			 * PID invalid or exiting/dead task, try to grab
+			 * the lock now.
+			 */
+			ret = futex_atomic_cmpxchg_inatomic(&curval,
+					uaddr, uval, vpid);
+			if (unlikely(ret))
+				goto efault;
+			if (curval != uval)
+				continue;	/* Futex value changed */
+			pr_info("futex_spin_on_owner: pid %d grabs futex from pid %d (%s)!\n",
+				vpid, opid, otask ? "dying" : "invalid");
+			break;
+		}
+		while ((loop <= 0) && !(uval & FUTEX_WAITERS)) {
+			/*
+			 * Need to set the FUTEX_WAITERS bit.
+			 */
+			if (futex_atomic_cmpxchg_inatomic(&curval, uaddr, uval,
+							  uval | FUTEX_WAITERS))
+				goto efault;
+			if (curval == uval) {
+				uval |= FUTEX_WAITERS;
+				break;
+			}
+			uval = curval;
+		}
+		if (!(uval & ~FUTEX_TID_MASK))
+			continue;	/* Do trylock again */
+
+		if (need_resched()) {
+			__set_current_state(TASK_RUNNING);
+			schedule_preempt_disabled();
+			continue;
+		}
+
+		if (signal_pending(current)) {
+			ret = -EINTR;
+			break;
+		}
+
+		/*
+		 * If the owner isn't active, we need to go to sleep after
+		 * making sure that the FUTEX_WAITERS bit is set. We also
+		 * need to put the futex state into the futex owner's
+		 * pi_state_list to prevent deadlock when the owner dies.
+		 */
+		if (!otask->on_cpu) {
+			if (!(uval & FUTEX_WAITERS)) {
+				loop = 0;
+				continue;
+			}
+
+			/*
+			 * As there will be no lock stealing with the
+			 * FUTEX_WAITERS bit set, otask should remain the
+			 * same after that unless the owner die.
+			 */
+			if (!on_owner_list) {
+				raw_spin_lock_irq(&otask->pi_lock);
+				if (unlikely(otask->flags & PF_EXITING)) {
+					/*
+					 * Task is exiting, can directly
+					 * grab the futex instead.
+					 */
+					raw_spin_unlock_irq(&otask->pi_lock);
+					continue;
+				}
+				WARN_ON(!list_empty(&state->list));
+				list_add(&state->list, &otask->pi_state_list);
+				raw_spin_unlock_irq(&otask->pi_lock);
+				on_owner_list = true;
+			}
+
+			/*
+			 * Do a trylock after setting the task state to make
+			 * sure we won't miss a wakeup.
+			 *
+			 * Futex owner		Mutex owner
+			 * -----------		-----------
+			 *  unlock		set state
+			 *  MB			MB
+			 *  read state		trylock
+			 *  wakeup		sleep
+			 */
+			set_current_state(TASK_INTERRUPTIBLE);
+			if (futex_trylock_to(uaddr, vpid, &uval, true, &ret)) {
+				__set_current_state(TASK_RUNNING);
+				break;
+			}
+
+			/*
+			 * Don't sleep if the owner has died.
+			 */
+			if (!(uval & FUTEX_OWNER_DIED))
+				schedule_preempt_disabled();
+			__set_current_state(TASK_RUNNING);
+			continue;
+		}
+
+		cpu_relax();
+	}
+out:
+	preempt_enable();
+	if (on_owner_list && !list_empty(&state->list)) {
+		BUG_ON(!otask);
+		raw_spin_lock_irq(&otask->pi_lock);
+		list_del_init(&state->list);
+		raw_spin_unlock_irq(&otask->pi_lock);
+	}
+	if (otask)
+		put_task_struct(otask);
+	WRITE_ONCE(state->owner, NULL);
+	return ret;
+
+efault:
+	ret = -EFAULT;
+	goto out;
+}
+
+/*
+ * Userspace tried a 0 -> TID atomic transition of the futex value
+ * and failed. The kernel side here does the whole locking operation.
+ * The kernel mutex will be used for serialization. Once becoming the
+ * sole mutex lock owner, it will spin on the futex owner's task structure
+ * to see if it is running. It will also spin on the futex word so as to grab
+ * the lock as soon as it is free.
+ */
+static int futex_lock_to(u32 __user *uaddr, unsigned int flags, ktime_t *time)
+{
+	struct hrtimer_sleeper timeout, *to = NULL;
+	struct futex_hash_bucket *hb;
+	union futex_key key = FUTEX_KEY_INIT;
+	struct futex_state *state;
+	u32 uval, vpid = task_pid_vnr(current);
+	int ret;
+
+	if (futex_trylock_to(uaddr, vpid, &uval, false, &ret))
+		/* Lock acquired or an error happens */
+		return ret;
+
+	/*
+	 * Detect deadlocks.
+	 */
+	if (unlikely(((uval & FUTEX_TID_MASK) == vpid) ||
+			should_fail_futex(true)))
+		return -EDEADLK;
+
+	if (refill_futex_state_cache())
+		return -ENOMEM;
+
+	futex_set_timer(time, &to, &timeout, flags, current->timer_slack_ns);
+
+	ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
+	if (unlikely(ret))
+		goto out;
+
+	hb = hash_futex(&key);
+	spin_lock(&hb->lock);
+
+	/*
+	 * Locate the futex state by looking up the futex state list in the
+	 * hash bucket. If it isn't found, create a new one and put it into
+	 * the list.
+	 */
+	state = lookup_futex_state(hb, &key, true);
+
+	/*
+	 * We don't need to hold the HB lock after looking up the futex state
+	 * as we have incremented the reference count.
+	 */
+	spin_unlock(&hb->lock);
+	BUG_ON(!state);
+
+	/*
+	 * Acquiring the serialization mutex.
+	 */
+	if (state->type != TYPE_TO) {
+		ret = -EINVAL;
+	} else {
+		if (to)
+			hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
+		ret = mutex_lock_interruptible(&state->mutex);
+	}
+
+	if (unlikely(ret))
+		/*
+		 * We got a signal or has some other error, need to abort
+		 * the lock operation and return.
+		 */
+		goto out_put_state_key;
+
+	/*
+	 * As the mutex owner, we can now spin on the futex word as well as
+	 * the active-ness of the futex owner.
+	 */
+	ret = futex_spin_on_owner(uaddr, vpid, state);
+
+	mutex_unlock(&state->mutex);
+
+out_put_state_key:
+	spin_lock(&hb->lock);
+	put_futex_state(state);
+	spin_unlock(&hb->lock);
+	put_futex_key(&key);
+
+out:
+	if (to) {
+		hrtimer_cancel(&to->timer);
+		destroy_hrtimer_on_stack(&to->timer);
+	}
+	return ret;
+}
+
+/*
+ * Userspace attempted a TID -> 0 atomic transition, and failed.
+ * This is the in-kernel slowpath: we look up the futex state (if any),
+ * and wakeup the mutex owner.
+ *
+ * Return: 1 if a process was woken up, 0 if not, or < 0 when an error happens.
+ */
+static int futex_unlock_to(u32 __user *uaddr, unsigned int flags)
+{
+	u32 uval, pid, vpid = task_pid_vnr(current);
+	union futex_key key = FUTEX_KEY_INIT;
+	struct futex_hash_bucket *hb;
+	struct futex_state *state;
+	struct task_struct *owner;
+	int ret;
+
+	if (get_user(uval, uaddr))
+		return -EFAULT;
+	/*
+	 * If there is a new lock owner, we can exit now.
+	 * If uval is 0, another task may have acquired and release the
+	 * lock in the mean time, so we should also exit.
+	 */
+	pid = uval & FUTEX_TID_MASK;
+	if ((pid && (pid != vpid)) || !uval)
+		return 0;
+	WARN_ON_ONCE(!(uval & FUTEX_WAITERS));
+
+	/*
+	 * If the TID isn't cleared in the userspace, clear it here to
+	 * encourage faster lock transfer to the mutex owner.
+	 */
+	if (pid == vpid) {
+		futex_atomic_cmpxchg_inatomic(&uval, uaddr, uval,
+					       uval & ~FUTEX_TID_MASK);
+		WARN_ON((uval & FUTEX_TID_MASK) != vpid);
+	}
+	ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
+	if (ret)
+		return ret;
+
+	hb = hash_futex(&key);
+	spin_lock(&hb->lock);
+
+	/*
+	 * Check the hash bucket only for matching futex state.
+	 */
+	state = lookup_futex_state(hb, &key, false);
+
+	if (!state)
+		goto out_unlock;
+
+	if (state->type != TYPE_TO) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	/*
+	 * We don't need to do the wakeup if the futex isn't equal to
+	 * FUTEX_WAITERS as it implies that someone else has taken over
+	 * the futex.
+	 */
+	if (get_user(uval, uaddr)) {
+		ret = -EFAULT;
+		goto out_unlock;
+	}
+
+	owner = READ_ONCE(state->owner);
+	if ((uval == FUTEX_WAITERS) && owner)
+		ret = wake_up_process(owner);
+
+out_unlock:
+	spin_unlock(&hb->lock);
+	put_futex_key(&key);
+	return ret;
+}
+
 long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 		u32 __user *uaddr2, u32 val2, u32 val3)
 {
@@ -3193,6 +3687,8 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 	case FUTEX_TRYLOCK_PI:
 	case FUTEX_WAIT_REQUEUE_PI:
 	case FUTEX_CMP_REQUEUE_PI:
+	case FUTEX_LOCK_TO:
+	case FUTEX_UNLOCK_TO:
 		if (!futex_cmpxchg_enabled)
 			return -ENOSYS;
 	}
@@ -3224,6 +3720,10 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 					     uaddr2);
 	case FUTEX_CMP_REQUEUE_PI:
 		return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1);
+	case FUTEX_LOCK_TO:
+		return futex_lock_to(uaddr, flags, timeout);
+	case FUTEX_UNLOCK_TO:
+		return futex_unlock_to(uaddr, flags);
 	}
 	return -ENOSYS;
 }
@@ -3307,6 +3807,7 @@ static int __init futex_init(void)
 	for (i = 0; i < futex_hashsize; i++) {
 		atomic_set(&futex_queues[i].waiters, 0);
 		plist_head_init(&futex_queues[i].chain);
+		INIT_LIST_HEAD(&futex_queues[i].fs_list);
 		spin_lock_init(&futex_queues[i].lock);
 	}
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 4/4] futex, doc: TO futexes document
  2016-09-06 18:53 [RFC PATCH 0/4] futex: Introducing throughput-optimized futexes Waiman Long
                   ` (2 preceding siblings ...)
  2016-09-06 18:53 ` [RFC PATCH 3/4] futex: Throughput-optimized (TO) futexes Waiman Long
@ 2016-09-06 18:53 ` Waiman Long
  2016-09-22 10:40   ` Thomas Gleixner
  3 siblings, 1 reply; 12+ messages in thread
From: Waiman Long @ 2016-09-06 18:53 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Davidlohr Bueso, Scott J Norton,
	Douglas Hatch, Waiman Long

This patch adds a new document file on how to use the TO futexes.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 Documentation/00-INDEX     |    2 +
 Documentation/to-futex.txt |  124 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 126 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/to-futex.txt

diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX
index cb9a6c6..a39f020 100644
--- a/Documentation/00-INDEX
+++ b/Documentation/00-INDEX
@@ -439,6 +439,8 @@ this_cpu_ops.txt
 	- List rationale behind and the way to use this_cpu operations.
 thermal/
 	- directory with information on managing thermal issues (CPU/temp)
+to-futex.txt
+	- Documentation on lightweight throughput-optimized futexes.
 trace/
 	- directory with info on tracing technologies within linux
 unaligned-memory-access.txt
diff --git a/Documentation/to-futex.txt b/Documentation/to-futex.txt
new file mode 100644
index 0000000..58a0637
--- /dev/null
+++ b/Documentation/to-futex.txt
@@ -0,0 +1,124 @@
+Started by: Waiman Long <waiman.long@hpe.com>
+
+Throughput-Optimized Futexes
+----------------------------
+
+There are two main problems for a wait-wake futex (FUTEX_WAIT and
+FUTEX_WAKE) when used for creating user-space locking primitives:
+
+ 1) With a wait-wake futex, tasks waiting for a lock are put to sleep
+    in the futex queue to be woken up by the lock owner when it is done
+    with the lock. Waking up a sleeping task, however, introduces some
+    additional latency which can be large especially if the critical
+    section protected by the lock is relatively short. This may cause
+    a performance bottleneck on large systems with many CPUs running
+    applications that need a lot of inter-thread synchronization.
+
+ 2) The performance of the wait-wake futex is currently
+    spinlock-constrained.  When many threads are contending for a
+    futex in a large system with many CPUs, it is not unusual to have
+    spinlock contention accounting for more than 90% of the total
+    CPU cycles consumed at various points in time.
+
+Throughput-optimized futex is a solution to both the wakeup latency
+and spinlock contention problems by encouraging lock stealing and
+optimistically spinning on a locked futex when the lock owner is
+running within the kernel until the lock is free. This is the same
+optimistic spinning mechanism used by the kernel mutex and rw semaphore
+implementations to improve performance. Optimistic spinning was done
+without taking any lock.
+
+The downside of this improved throughput is the increased variance
+of the actual response times of the locking operations. Some locking
+operations will be very fast, while others may be considerably slower.
+The average response time should be better than the wait-wake futexes.
+
+Implementation
+--------------
+
+Like the PI and robust futexes, a lock acquirer has to atomically
+put its thread ID (TID) into the lower 30 bits of the 32-bit futex
+which should has an original value of 0. If it succeeds, it will be
+the owner of the futex. Otherwise, it has to call into the kernel
+using the new FUTEX_LOCK_TO futex(2) syscall.
+
+  futex(uaddr, FUTEX_LOCK_TO, 0, timeout, NULL, 0);
+
+Only the optional timeout parameter is being used by the new futex
+call.
+
+A kernel mutex is used for serialization.  The top lock waiter that is
+the owner of the serialization mutex will try to steal the lock when
+it is available and delay setting the FUTEX_WAITERS bit to encourage
+more lock stealing in the userspace. When it has waited long enough
+or is going to sleep, it then set the FUTEX_WAITERS bit to make sure
+that it is going to get the lock next and is woken up when the lock
+owner releases the lock.
+
+When it is time to unlock, the lock owner has to atomically change the
+futex value from its TID to 0. If that fails, it can optionally clear
+the TID portion of the futex value to encourage faster lock transfer.
+It also has to issue a FUTEX_UNLOCK_TO futex system call to wake up
+the top waiter, if it has slept.
+
+  futex(uaddr, FUTEX_UNLOCK_TO, 0, NULL, NULL, 0);
+
+A return value of 1 from the FUTEX_UNLOCK_TO futex(2) syscall
+indicates a task has been woken up. The syscall returns 0 if no
+sleeping task is woken.
+
+The error number returned by a FUTEX_UNLOCK_TO call on an empty futex
+can be used to decide if the TO futex functionality is implemented in
+the kernel. If it is present, no error will be returned.  Otherwise it
+will be ENOSYS.
+
+TO futexes require the kernel to have support for the cmpxchg
+functionality. For architectures that don't support cmpxchg, TO
+futexes will not be supported as well.
+
+The TO futexes are orthogonal to the robust futexes and can be combined
+without problem.
+
+Usage Scenario
+--------------
+
+A TO futex can be used as an exclusive lock to guard a critical
+section which are unlikely to go to sleep. The waiters in a TO futex,
+however, will fall back to sleep in a wait queue if the lock owner
+isn't running. Therefore, it can also be used when the critical
+section is long and prone to sleeping. However, it may not have the
+performance benefit when compared with a wait-wake futex in this case.
+
+Sample Code
+-----------
+
+The following are sample code to implement a simple lock and unlock
+function.
+
+__thread int tid;	/* Thread ID */
+
+void mutex_lock(int *faddr)
+{
+	if (cmpxchg(faddr, 0, tid) == 0)
+		return;
+	for (;;)
+		if (futex(faddr, FUTEX_LOCK_TO, ...) == 0)
+			break;
+}
+
+void mutex_unlock(int *faddr)
+{
+	int old, fval;
+
+	if ((fval = cmpxchg(faddr, tid, 0)) == tid)
+		return;
+	/* Clear only the TID portion of the futex */
+	for (;;) {
+		old  = fval;
+		fval = cmpxchg(faddr, old, old & ~FUTEX_TID_MASK);
+		if (fval == old)
+			break;
+	}
+	if (fval & FUTEX_WAITERS)
+		futex(faddr, FUTEX_UNLOCK_TO, ...);
+}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 4/4] futex, doc: TO futexes document
  2016-09-06 18:53 ` [RFC PATCH 4/4] futex, doc: TO futexes document Waiman Long
@ 2016-09-22 10:40   ` Thomas Gleixner
  2016-09-22 13:06     ` Waiman Long
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Gleixner @ 2016-09-22 10:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Jonathan Corbet, linux-kernel,
	linux-doc, Davidlohr Bueso, Scott J Norton, Douglas Hatch

On Tue, 6 Sep 2016, Waiman Long wrote:
> This patch adds a new document file on how to use the TO futexes.

Documentation is nice, but the proper place for documenting this is the
futex(2) man page.
 
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 4/4] futex, doc: TO futexes document
  2016-09-22 10:40   ` Thomas Gleixner
@ 2016-09-22 13:06     ` Waiman Long
  2016-09-22 13:51       ` Thomas Gleixner
  0 siblings, 1 reply; 12+ messages in thread
From: Waiman Long @ 2016-09-22 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Peter Zijlstra, Jonathan Corbet, linux-kernel,
	linux-doc, Davidlohr Bueso, Scott J Norton, Douglas Hatch

On 09/22/2016 06:40 AM, Thomas Gleixner wrote:
> On Tue, 6 Sep 2016, Waiman Long wrote:
>> This patch adds a new document file on how to use the TO futexes.
> Documentation is nice, but the proper place for documenting this is the
> futex(2) man page.
>
> Thanks,
>
> 	tglx

Yes, I will certainly send patch to update the manpages after the new 
futex gets merged into the upstream kernel. Or should I do that in parallel?

Cheers,
Longman

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 3/4] futex: Throughput-optimized (TO) futexes
  2016-09-06 18:53 ` [RFC PATCH 3/4] futex: Throughput-optimized (TO) futexes Waiman Long
@ 2016-09-22 13:32   ` Thomas Gleixner
  2016-09-23  2:44     ` Waiman Long
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Gleixner @ 2016-09-22 13:32 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Jonathan Corbet, linux-kernel,
	linux-doc, Davidlohr Bueso, Scott J Norton, Douglas Hatch

On Tue, 6 Sep 2016, Waiman Long wrote:
> +enum futex_type {
> +	TYPE_PI = 0,
> +	TYPE_TO,
> +};

Please introduce the futex_type magic and the related changes to the pi
code in a seperate patch so it can be verified independently.

It's sad that one has to explain that to you over and over ....

> @@ -836,10 +859,10 @@ static void put_futex_state(struct futex_state *state)
>  		return;
>  
>  	/*
> -	 * If state->owner is NULL, the owner is most probably dying
> -	 * and has cleaned up the futex state already
> +	 * If state->owner is NULL and the type is TYPE_PI, the owner
> +	 * is most probably dying and has cleaned up the state already
>  	 */
> -	if (state->owner) {
> +	if (state->owner && (state->type == TYPE_PI)) {
>  		raw_spin_lock_irq(&state->owner->pi_lock);
>  		list_del_init(&state->list);
>  		raw_spin_unlock_irq(&state->owner->pi_lock);
> @@ -847,6 +870,11 @@ static void put_futex_state(struct futex_state *state)
>  		rt_mutex_proxy_unlock(&state->pi_mutex, state->owner);
>  	}
>  
> +	/*
> +	 * Dequeue it from the HB futex state list.
> +	 */
> +	list_del_init(&state->hb_list);

The comment above this list_del() is really pointless. I can see that from
the code itself.

Aside of that: Why do you need seperate list heads? You explain the
seperate list somewhere in that big comment below, but it should be
explained at the point where you add it to the state and the hash bucket.

>  	if (current->pi_state_cache)
>  		kfree(state);
>  	else {
> @@ -919,13 +947,24 @@ void exit_pi_state_list(struct task_struct *curr)
>  			continue;
>  		}
>  
> -		WARN_ON(pi_state->owner != curr);
>  		WARN_ON(list_empty(&pi_state->list));
> +		if (pi_state->type == TYPE_PI) {
> +			WARN_ON(pi_state->owner != curr);
> +			pi_state->owner = NULL;
> +		}
>  		list_del_init(&pi_state->list);
> -		pi_state->owner = NULL;
>  		raw_spin_unlock_irq(&curr->pi_lock);
>  
> -		rt_mutex_unlock(&pi_state->pi_mutex);
> +		if (pi_state->type == TYPE_PI)

lacks curly braces

> +			rt_mutex_unlock(&pi_state->pi_mutex);
> +		else if (pi_state->type == TYPE_TO) {
> +			/*
> +			 * Need to wakeup the mutex owner.
> +			 */

Another completely useless comment. Because you tell what you do, but not
WHY.

> +			WARN_ON(!pi_state->owner);
> +			if (pi_state->owner)
> +				wake_up_process(pi_state->owner);

And what handles or sanity checks the state->hb_list ???

> +/*
> + * Try to lock the userspace futex word (0 => vpid).
> + *
> + * Return: 1 if lock acquired or an error happens, 0 if not.
> + *	   The status code will be 0 if no error, or < 0 if an error happens.
> + *	   *puval will contain the latest futex value when trylock fails.
> + *
> + * The waiter flag, if set, will make it ignore the FUTEX_WAITERS bit.
> + * The HB spinlock should NOT be held while calling this function. A
> + * successful lock acquisition will clear the waiter and died bits.
> + */
> +static inline int futex_trylock_to(u32 __user *uaddr, u32 vpid, u32 *puval,
> +				   const bool waiter, int *status)
> +{
> +	u32 uval;
> +
> +	*status = 0;
> +
> +	if (unlikely(get_user(uval, uaddr)))
> +		goto efault;
> +
> +	*puval = uval;
> +
> +	if (waiter ? (uval & FUTEX_TID_MASK) : uval)
> +		return 0;	/* Trylock fails */

Please do not use tail comments. They are hard to parse. 

> +
> +	if (unlikely(futex_atomic_cmpxchg_inatomic(puval, uaddr, uval, vpid)))
> +		goto efault;
> +
> +	return *puval == uval;
> +
> +efault:
> +	*status = -EFAULT;
> +	return 1;
> +}

Do we really need another variant of cmpxchg and why do you need that extra
status? What's wrong in doing the magic in the return value?

This is utter crap, really.

> +
> +/*
> + * Spinning threshold for futex word without setting FUTEX_WAITERS.
> + */
> +#define FUTEX_SPIN_THRESHOLD	(1 << 10)

That number is pulled out of thin air or what is the rationale for chosing
1024?

> +/*
> + * Spin on the futex word while the futex owner is active. Otherwise, set
> + * the FUTEX_WAITERS bit and go to sleep. As we take a reference to the futex
> + * owner's task structure, we don't need to use RCU to ensure that the task
> + * structure is valid. The function will directly grab the lock if the
> + * owner is dying or the pid is invalid. That should take care of the problem
> + * of dead lock owners unless the pid wraps around and the preceived owner is
> + * not the real owner.
> + *
> + * Return: 0 if futex acquired, < 0 if an error happens.

If you document functions then please follow the style of this file which
uses kerneldoc comments.

> + */
> +static int futex_spin_on_owner(u32 __user *uaddr, u32 vpid,
> +			       struct futex_state *state)
> +{
> +	int ret, loop = FUTEX_SPIN_THRESHOLD;
> +	u32 uval, curval;
> +	u32 opid = 0;				/* Futex owner task ID */
> +	struct task_struct *otask = NULL;	/* Futex owner task struct */

Please use understandable variable names instead of this horrible tail
comments. What's wrong with using owner and owner_pid?

> +	bool on_owner_list = false;
> +
> +	preempt_disable();
> +	WRITE_ONCE(state->owner, current);
> +	for (;; loop--) {
> +		if (futex_trylock_to(uaddr, vpid, &uval, true, &ret))
> +			break;

And here you are. What the hell is wrong with:

    	     	ret = futex_trylock(uaddr, vpid, &uval, true);
		if (ret < 0)
			break;

Nothing is wrong. It's just way simpler to read and understand....

> +		WARN_ON_ONCE(!(uval & FUTEX_TID_MASK));
> +
> +		if ((uval & FUTEX_TID_MASK) != opid) {
> +			/*
> +			 * Get the new task structure
> +			 */
> +			if (otask)
> +				put_task_struct(otask);
> +
> +			WARN_ON_ONCE(on_owner_list);
> +			opid  = uval & FUTEX_TID_MASK;
> +			otask = futex_find_get_task(opid);
> +		}

Can you pretty please use split out functions for this otask handling? This
for loop does not fit on a single screen and can't be understood in one go.

> +		if (unlikely(!otask || (otask->flags & PF_EXITING) ||
> +			    (uval & FUTEX_OWNER_DIED))) {
> +			/*
> +			 * PID invalid or exiting/dead task, try to grab
> +			 * the lock now.
> +			 */
> +			ret = futex_atomic_cmpxchg_inatomic(&curval,
> +					uaddr, uval, vpid);
> +			if (unlikely(ret))
> +				goto efault;
> +			if (curval != uval)
> +				continue;	/* Futex value changed */

This code flow is completely non parseable. Stop this and put proper
comments above decisions which explain why and not what.

> +			pr_info("futex_spin_on_owner: pid %d grabs futex from pid %d (%s)!\n",
> +				vpid, opid, otask ? "dying" : "invalid");

Really useful information to confuse sysadmins. A proper comment explaining
the issue and the implications and how we deal with it would have been the
right thing to do...

> +			break;
> +		}

> +		while ((loop <= 0) && !(uval & FUTEX_WAITERS)) {

Groan. This is more than horrible to read.

       	        if (loop <= 0) {
		        if (futex_set_waiters_bit(....))
			   	goto fault;
		}

If at all. This loop <= 0 thing is utterly confusing.

> +			/*
> +			 * Need to set the FUTEX_WAITERS bit.
> +			 */
> +			if (futex_atomic_cmpxchg_inatomic(&curval, uaddr, uval,
> +							  uval | FUTEX_WAITERS))
> +				goto efault;
> +			if (curval == uval) {
> +				uval |= FUTEX_WAITERS;
> +				break;

I had to look five times to figure out to which loop this break belongs. I
really wonder how you manage to keep track of this mess.

> +			}
> +			uval = curval;
> +		}
> +
> +		if (!(uval & ~FUTEX_TID_MASK))
> +			continue;	/* Do trylock again */

Gah. I can see that you go over to start and do trylock, but WHY ? I know
it, but for the casual reader it's fcking non obvious.

> +
> +		if (need_resched()) {
> +			__set_current_state(TASK_RUNNING);
> +			schedule_preempt_disabled();
> +			continue;
> +		}
> +
> +		if (signal_pending(current)) {
> +			ret = -EINTR;
> +			break;
> +		}
> +
> +		/*
> +		 * If the owner isn't active, we need to go to sleep after
> +		 * making sure that the FUTEX_WAITERS bit is set. We also
> +		 * need to put the futex state into the futex owner's
> +		 * pi_state_list to prevent deadlock when the owner dies.
> +		 */
> +		if (!otask->on_cpu) {
> +			if (!(uval & FUTEX_WAITERS)) {
> +				loop = 0;
> +				continue;

This is completely fucked, really.

     		if (owner->on_cpu) {
		   	cpu_relax();
			continue;
		}	

		ret = futex_sleep();
		if (ret < 0)
		   	goto fault;
		if (ret == FUTEX_AQUIRED)
		   	break;

Your attempt to resue code which is in the loop above is just making this
completely unreadable. Hell no! Futexes are complex enough already. We
really can do without obfuscation. This not the obfuscated C-code contest.

> +	if (futex_trylock_to(uaddr, vpid, &uval, false, &ret))
> +		/* Lock acquired or an error happens */
> +		return ret;

This lacks curly braces. See: https://marc.info/?l=linux-kernel&m=147351236615103

> +
> +	/*
> +	 * Detect deadlocks.
> +	 */
> +	if (unlikely(((uval & FUTEX_TID_MASK) == vpid) ||
> +			should_fail_futex(true)))
> +		return -EDEADLK;
> +
> +	if (refill_futex_state_cache())
> +		return -ENOMEM;
> +
> +	futex_set_timer(time, &to, &timeout, flags, current->timer_slack_ns);
> +
> +	ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
> +	if (unlikely(ret))
> +		goto out;
> +
> +	hb = hash_futex(&key);
> +	spin_lock(&hb->lock);

Why are you using the global hash for this? As we have shown the global
hash has horrible performance due to cross node access and potential hash
bucket lock contention of unrelated processes. If we add something new
which is performance optimized then we pretty please make it use a seperate
process private storage. I don't see any point in making this optimized for
process shared futexes.

> +
> +	/*
> +	 * Locate the futex state by looking up the futex state list in the
> +	 * hash bucket. If it isn't found, create a new one and put it into
> +	 * the list.
> +	 */
> +	state = lookup_futex_state(hb, &key, true);
> +
> +	/*
> +	 * We don't need to hold the HB lock after looking up the futex state
> +	 * as we have incremented the reference count.
> +	 */
> +	spin_unlock(&hb->lock);
> +	BUG_ON(!state);
> +
> +	/*
> +	 * Acquiring the serialization mutex.
> +	 */
> +	if (state->type != TYPE_TO) {
> +		ret = -EINVAL;
> +	} else {
> +		if (to)
> +			hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
> +		ret = mutex_lock_interruptible(&state->mutex);
> +	}
> +
> +	if (unlikely(ret))
> +		/*
> +		 * We got a signal or has some other error, need to abort
> +		 * the lock operation and return.
> +		 */
> +		goto out_put_state_key;
> +
> +	/*
> +	 * As the mutex owner, we can now spin on the futex word as well as
> +	 * the active-ness of the futex owner.
> +	 */
> +	ret = futex_spin_on_owner(uaddr, vpid, state);

So if futex_spin_on_owner() faults, then you return -EFAULT to user space
w/o trying to page in the stuff? That's just wrong.

> +static int futex_unlock_to(u32 __user *uaddr, unsigned int flags)
> +{
> +	u32 uval, pid, vpid = task_pid_vnr(current);
> +	union futex_key key = FUTEX_KEY_INIT;
> +	struct futex_hash_bucket *hb;
> +	struct futex_state *state;
> +	struct task_struct *owner;
> +	int ret;
> +
> +	if (get_user(uval, uaddr))
> +		return -EFAULT;
> +	/*
> +	 * If there is a new lock owner, we can exit now.
> +	 * If uval is 0, another task may have acquired and release the
> +	 * lock in the mean time, so we should also exit.

If there is a new lock owner or the lock has been released? Voodoo magic or
what? How can user space take over the lock when the current task owns it?

That's just broken. The only reason to get here is that user space called
in because it was not able to release the lock atomically, i.e. because the
waiters bit was set. If anything fiddled with the uval then returning 0 is
beyond stupid.

> +	 */
> +	pid = uval & FUTEX_TID_MASK;
> +	if ((pid && (pid != vpid)) || !uval)
> +		return 0;
> +	WARN_ON_ONCE(!(uval & FUTEX_WAITERS));

Crap. It's legit to call here even if the waiters bit is not set.

> +	/*
> +	 * If the TID isn't cleared in the userspace, clear it here to
> +	 * encourage faster lock transfer to the mutex owner.

What do you encourage here? How is the mutex transfer accelerated?

> +	 */
> +	if (pid == vpid) {
> +		futex_atomic_cmpxchg_inatomic(&uval, uaddr, uval,
> +					       uval & ~FUTEX_TID_MASK);
> +		WARN_ON((uval & FUTEX_TID_MASK) != vpid);
> +	}
> +	ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
> +	if (ret)
> +		return ret;
> +
> +	hb = hash_futex(&key);
> +	spin_lock(&hb->lock);
> +
> +	/*
> +	 * Check the hash bucket only for matching futex state.
> +	 */
> +	state = lookup_futex_state(hb, &key, false);
> +
> +	if (!state)
> +		goto out_unlock;
> +
> +	if (state->type != TYPE_TO) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	/*
> +	 * We don't need to do the wakeup if the futex isn't equal to
> +	 * FUTEX_WAITERS as it implies that someone else has taken over
> +	 * the futex.
> +	 */
> +	if (get_user(uval, uaddr)) {
> +		ret = -EFAULT;
> +		goto out_unlock;
> +	}
> +
> +	owner = READ_ONCE(state->owner);
> +	if ((uval == FUTEX_WAITERS) && owner)
> +		ret = wake_up_process(owner);

What guarantees that owner cannot exit between reading owner state and
calling wake_up_process?

Dammit. If you read the source in this file carefully, then you notice that
it contains a lot of documentation about life time rules, protection and
serialization.

If you think, that it's the reviewers problem to figure that out from your
w/o proper comments in place, then you are completely on the wrong track.

I'm not even trying to think about the concept itself as long as this is
presented as a pile of unpenetrable clusterfuck.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 4/4] futex, doc: TO futexes document
  2016-09-22 13:06     ` Waiman Long
@ 2016-09-22 13:51       ` Thomas Gleixner
  2016-09-22 19:14         ` Waiman Long
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Gleixner @ 2016-09-22 13:51 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Jonathan Corbet, linux-kernel,
	linux-doc, Davidlohr Bueso, Scott J Norton, Douglas Hatch

On Thu, 22 Sep 2016, Waiman Long wrote:
> On 09/22/2016 06:40 AM, Thomas Gleixner wrote:
> > On Tue, 6 Sep 2016, Waiman Long wrote:
> > > This patch adds a new document file on how to use the TO futexes.
> > Documentation is nice, but the proper place for documenting this is the
> > futex(2) man page.
> > 
> > Thanks,
> > 
> > 	tglx
> 
> Yes, I will certainly send patch to update the manpages after the new futex
> gets merged into the upstream kernel. Or should I do that in parallel?

It won't get merged w/o a patch to the manpage which is accepted by the man
page maintainer.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 4/4] futex, doc: TO futexes document
  2016-09-22 13:51       ` Thomas Gleixner
@ 2016-09-22 19:14         ` Waiman Long
  0 siblings, 0 replies; 12+ messages in thread
From: Waiman Long @ 2016-09-22 19:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Peter Zijlstra, Jonathan Corbet, linux-kernel,
	linux-doc, Davidlohr Bueso, Scott J Norton, Douglas Hatch

On 09/22/2016 09:51 AM, Thomas Gleixner wrote:
> On Thu, 22 Sep 2016, Waiman Long wrote:
>> Yes, I will certainly send patch to update the manpages after the new futex
>> gets merged into the upstream kernel. Or should I do that in parallel?
> It won't get merged w/o a patch to the manpage which is accepted by the man
> page maintainer.

Sure. I will send out a manpage patch when I send out the next revision 
of this patch.

Thanks,
Longman

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 3/4] futex: Throughput-optimized (TO) futexes
  2016-09-22 13:32   ` Thomas Gleixner
@ 2016-09-23  2:44     ` Waiman Long
  2016-09-23 10:16       ` Thomas Gleixner
  0 siblings, 1 reply; 12+ messages in thread
From: Waiman Long @ 2016-09-23  2:44 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Peter Zijlstra, Jonathan Corbet, linux-kernel,
	linux-doc, Davidlohr Bueso, Scott J Norton, Douglas Hatch

On 09/22/2016 09:32 AM, Thomas Gleixner wrote:
> On Tue, 6 Sep 2016, Waiman Long wrote:
>> +enum futex_type {
>> +	TYPE_PI = 0,
>> +	TYPE_TO,
>> +};
> Please introduce the futex_type magic and the related changes to the pi
> code in a seperate patch so it can be verified independently.
>
> It's sad that one has to explain that to you over and over ....

I didn't break it out because the changes to the PI code was pretty 
small. I will break it out in the next version.

>> @@ -836,10 +859,10 @@ static void put_futex_state(struct futex_state *state)
>>   		return;
>>
>>   	/*
>> -	 * If state->owner is NULL, the owner is most probably dying
>> -	 * and has cleaned up the futex state already
>> +	 * If state->owner is NULL and the type is TYPE_PI, the owner
>> +	 * is most probably dying and has cleaned up the state already
>>   	 */
>> -	if (state->owner) {
>> +	if (state->owner&&  (state->type == TYPE_PI)) {
>>   		raw_spin_lock_irq(&state->owner->pi_lock);
>>   		list_del_init(&state->list);
>>   		raw_spin_unlock_irq(&state->owner->pi_lock);
>> @@ -847,6 +870,11 @@ static void put_futex_state(struct futex_state *state)
>>   		rt_mutex_proxy_unlock(&state->pi_mutex, state->owner);
>>   	}
>>
>> +	/*
>> +	 * Dequeue it from the HB futex state list.
>> +	 */
>> +	list_del_init(&state->hb_list);
> The comment above this list_del() is really pointless. I can see that from
> the code itself.
>
> Aside of that: Why do you need seperate list heads? You explain the
> seperate list somewhere in that big comment below, but it should be
> explained at the point where you add it to the state and the hash bucket.

Sure. Will fix the comment.

>>   	if (current->pi_state_cache)
>>   		kfree(state);
>>   	else {
>> @@ -919,13 +947,24 @@ void exit_pi_state_list(struct task_struct *curr)
>>   			continue;
>>   		}
>>
>> -		WARN_ON(pi_state->owner != curr);
>>   		WARN_ON(list_empty(&pi_state->list));
>> +		if (pi_state->type == TYPE_PI) {
>> +			WARN_ON(pi_state->owner != curr);
>> +			pi_state->owner = NULL;
>> +		}
>>   		list_del_init(&pi_state->list);
>> -		pi_state->owner = NULL;
>>   		raw_spin_unlock_irq(&curr->pi_lock);
>>
>> -		rt_mutex_unlock(&pi_state->pi_mutex);
>> +		if (pi_state->type == TYPE_PI)
> lacks curly braces

Yes, you are right.

>> +			rt_mutex_unlock(&pi_state->pi_mutex);
>> +		else if (pi_state->type == TYPE_TO) {
>> +			/*
>> +			 * Need to wakeup the mutex owner.
>> +			 */
> Another completely useless comment. Because you tell what you do, but not
> WHY.

Will elaborate on why the wakeup here.

>> +			WARN_ON(!pi_state->owner);
>> +			if (pi_state->owner)
>> +				wake_up_process(pi_state->owner);
> And what handles or sanity checks the state->hb_list ???
>

The exit_pi_state_list() function doesn't need to deal with 
state->hb_list. The hb_list is used to locate the futex state, but the 
futex owner doesn't have a reference to the futex state. So it won't 
need to decrement it and potentially free it.

>> +/*
>> + * Try to lock the userspace futex word (0 =>  vpid).
>> + *
>> + * Return: 1 if lock acquired or an error happens, 0 if not.
>> + *	   The status code will be 0 if no error, or<  0 if an error happens.
>> + *	   *puval will contain the latest futex value when trylock fails.
>> + *
>> + * The waiter flag, if set, will make it ignore the FUTEX_WAITERS bit.
>> + * The HB spinlock should NOT be held while calling this function. A
>> + * successful lock acquisition will clear the waiter and died bits.
>> + */
>> +static inline int futex_trylock_to(u32 __user *uaddr, u32 vpid, u32 *puval,
>> +				   const bool waiter, int *status)
>> +{
>> +	u32 uval;
>> +
>> +	*status = 0;
>> +
>> +	if (unlikely(get_user(uval, uaddr)))
>> +		goto efault;
>> +
>> +	*puval = uval;
>> +
>> +	if (waiter ? (uval&  FUTEX_TID_MASK) : uval)
>> +		return 0;	/* Trylock fails */
> Please do not use tail comments. They are hard to parse.
>

OK, will move the comment up.

>> +
>> +	if (unlikely(futex_atomic_cmpxchg_inatomic(puval, uaddr, uval, vpid)))
>> +		goto efault;
>> +
>> +	return *puval == uval;
>> +
>> +efault:
>> +	*status = -EFAULT;
>> +	return 1;
>> +}
> Do we really need another variant of cmpxchg and why do you need that extra
> status? What's wrong in doing the magic in the return value?

This is not another variant of cmpxchg. It is the cmpxchg used by 
cmpxchg_futex_value_locked().  The only difference is that page fault 
was disabled with the locked version. I call 
futex_atomic_cmpxchg_inatomic() directly because it is called without 
the HB spinlock. So I don't need to disable page fault. I will add a 
separate patch to introduce the helper function 
cmpxchg_futex_value_unlocked() to indicate that the cmpxchg is done 
without lock.

I will remove the extra status parameter.

>> +
>> +/*
>> + * Spinning threshold for futex word without setting FUTEX_WAITERS.
>> + */
>> +#define FUTEX_SPIN_THRESHOLD	(1<<  10)
> That number is pulled out of thin air or what is the rationale for chosing
> 1024?

It is kind of arbitrary. I want a value large enough to encourage lock 
stealing, while not too large that it may take too long to get the lock. 
Will elaborate more about the choice in the comment.


>> +/*
>> + * Spin on the futex word while the futex owner is active. Otherwise, set
>> + * the FUTEX_WAITERS bit and go to sleep. As we take a reference to the futex
>> + * owner's task structure, we don't need to use RCU to ensure that the task
>> + * structure is valid. The function will directly grab the lock if the
>> + * owner is dying or the pid is invalid. That should take care of the problem
>> + * of dead lock owners unless the pid wraps around and the preceived owner is
>> + * not the real owner.
>> + *
>> + * Return: 0 if futex acquired,<  0 if an error happens.
> If you document functions then please follow the style of this file which
> uses kerneldoc comments.
>

Sorry, I forgot to use the kerneldoc notation. It is an RFC patch, and 
my focus was to make it work. I haven't spent much time in cleaning up 
the comment. I will do that in the next version.

>> + */
>> +static int futex_spin_on_owner(u32 __user *uaddr, u32 vpid,
>> +			       struct futex_state *state)
>> +{
>> +	int ret, loop = FUTEX_SPIN_THRESHOLD;
>> +	u32 uval, curval;
>> +	u32 opid = 0;				/* Futex owner task ID */
>> +	struct task_struct *otask = NULL;	/* Futex owner task struct */
> Please use understandable variable names instead of this horrible tail
> comments. What's wrong with using owner and owner_pid?

Yes, I will use more descriptive variable name.

>> +	bool on_owner_list = false;
>> +
>> +	preempt_disable();
>> +	WRITE_ONCE(state->owner, current);
>> +	for (;; loop--) {
>> +		if (futex_trylock_to(uaddr, vpid,&uval, true,&ret))
>> +			break;
> And here you are. What the hell is wrong with:
>
>      	     	ret = futex_trylock(uaddr, vpid,&uval, true);
> 		if (ret<  0)
> 			break;
>
> Nothing is wrong. It's just way simpler to read and understand....

Will do that.

>> +		WARN_ON_ONCE(!(uval&  FUTEX_TID_MASK));
>> +
>> +		if ((uval&  FUTEX_TID_MASK) != opid) {
>> +			/*
>> +			 * Get the new task structure
>> +			 */
>> +			if (otask)
>> +				put_task_struct(otask);
>> +
>> +			WARN_ON_ONCE(on_owner_list);
>> +			opid  = uval&  FUTEX_TID_MASK;
>> +			otask = futex_find_get_task(opid);
>> +		}
> Can you pretty please use split out functions for this otask handling? This
> for loop does not fit on a single screen and can't be understood in one go.

Yes, this is the largest function in the new code, I will try to add 
helper functions to make it easier to understand.

>> +		if (unlikely(!otask || (otask->flags&  PF_EXITING) ||
>> +			    (uval&  FUTEX_OWNER_DIED))) {
>> +			/*
>> +			 * PID invalid or exiting/dead task, try to grab
>> +			 * the lock now.
>> +			 */
>> +			ret = futex_atomic_cmpxchg_inatomic(&curval,
>> +					uaddr, uval, vpid);
>> +			if (unlikely(ret))
>> +				goto efault;
>> +			if (curval != uval)
>> +				continue;	/* Futex value changed */
> This code flow is completely non parseable. Stop this and put proper
> comments above decisions which explain why and not what.

I will spend more time cleaning up the comment and streamline the code.

>
>> +			pr_info("futex_spin_on_owner: pid %d grabs futex from pid %d (%s)!\n",
>> +				vpid, opid, otask ? "dying" : "invalid");
> Really useful information to confuse sysadmins. A proper comment explaining
> the issue and the implications and how we deal with it would have been the
> right thing to do...

Yes, the message may be too cryptic. I will make it more clear in the 
next version.

>> +			break;
>> +		}
>> +		while ((loop<= 0)&&  !(uval&  FUTEX_WAITERS)) {
> Groan. This is more than horrible to read.
>
>         	        if (loop<= 0) {
> 		        if (futex_set_waiters_bit(....))
> 			   	goto fault;
> 		}
>
> If at all. This loop<= 0 thing is utterly confusing.
>

I will improve this code.

>> +			/*
>> +			 * Need to set the FUTEX_WAITERS bit.
>> +			 */
>> +			if (futex_atomic_cmpxchg_inatomic(&curval, uaddr, uval,
>> +							  uval | FUTEX_WAITERS))
>> +				goto efault;
>> +			if (curval == uval) {
>> +				uval |= FUTEX_WAITERS;
>> +				break;
> I had to look five times to figure out to which loop this break belongs. I
> really wonder how you manage to keep track of this mess.

Because I wrote it :-)

In the v2 patch, the FUTEX_WAITERS bit setting was put into a helper 
function which should clarify the code. I will add a few more to 
simplify the main loop.

>> +			}
>> +			uval = curval;
>> +		}
>> +
>> +		if (!(uval&  ~FUTEX_TID_MASK))
>> +			continue;	/* Do trylock again */
> Gah. I can see that you go over to start and do trylock, but WHY ? I know
> it, but for the casual reader it's fcking non obvious.
>
>> +
>> +		if (need_resched()) {
>> +			__set_current_state(TASK_RUNNING);
>> +			schedule_preempt_disabled();
>> +			continue;
>> +		}
>> +
>> +		if (signal_pending(current)) {
>> +			ret = -EINTR;
>> +			break;
>> +		}
>> +
>> +		/*
>> +		 * If the owner isn't active, we need to go to sleep after
>> +		 * making sure that the FUTEX_WAITERS bit is set. We also
>> +		 * need to put the futex state into the futex owner's
>> +		 * pi_state_list to prevent deadlock when the owner dies.
>> +		 */
>> +		if (!otask->on_cpu) {
>> +			if (!(uval&  FUTEX_WAITERS)) {
>> +				loop = 0;
>> +				continue;
> This is completely fucked, really.
>
>       		if (owner->on_cpu) {
> 		   	cpu_relax();
> 			continue;
> 		}	
>
> 		ret = futex_sleep();
> 		if (ret<  0)
> 		   	goto fault;
> 		if (ret == FUTEX_AQUIRED)
> 		   	break;
>
> Your attempt to resue code which is in the loop above is just making this
> completely unreadable. Hell no! Futexes are complex enough already. We
> really can do without obfuscation. This not the obfuscated C-code contest.

As said above, I will add more helper functions and streamline the code 
to make it more readable.

>> +	if (futex_trylock_to(uaddr, vpid,&uval, false,&ret))
>> +		/* Lock acquired or an error happens */
>> +		return ret;
> This lacks curly braces. See: https://marc.info/?l=linux-kernel&m=147351236615103

Got it.

>> +
>> +	/*
>> +	 * Detect deadlocks.
>> +	 */
>> +	if (unlikely(((uval&  FUTEX_TID_MASK) == vpid) ||
>> +			should_fail_futex(true)))
>> +		return -EDEADLK;
>> +
>> +	if (refill_futex_state_cache())
>> +		return -ENOMEM;
>> +
>> +	futex_set_timer(time,&to,&timeout, flags, current->timer_slack_ns);
>> +
>> +	ret = get_futex_key(uaddr, flags&  FLAGS_SHARED,&key, VERIFY_WRITE);
>> +	if (unlikely(ret))
>> +		goto out;
>> +
>> +	hb = hash_futex(&key);
>> +	spin_lock(&hb->lock);
> Why are you using the global hash for this? As we have shown the global
> hash has horrible performance due to cross node access and potential hash
> bucket lock contention of unrelated processes. If we add something new
> which is performance optimized then we pretty please make it use a seperate
> process private storage. I don't see any point in making this optimized for
> process shared futexes.

I used the hash lock for managing the futex state objects only. I don't 
need it for other purpose. It is true that if there is a hash collision 
that another wait-wake futex is using the same hash. It may cause 
performance problem. I think I will add another spinlock in the hash 
bucket just for TO futexes. In this way, a collision with wait-wake 
futex won't cause unexpected spinlock contention, though a bit of extra 
hash bucket cachline contention may still happen.

BTW, running my microbenchmark with wait-wake futex, over 90% of the CPU 
time were in the spinlock contention:

   96.23%  futex_test  [kernel.kallsyms] [k] 
native_queued_spin_lock_slowpath\v
       - queued_spin_lock_slowpath
          - _raw_spin_lock\v
             + 51.81% futex_wake\v
             + 48.08% futex_wait_setup

For the TO futexes, the perf profile was:

   97.33%   0.97%  futex_test  [kernel.kallsyms]  [k] futex_lock_to
\v  92.45%  92.45%  futex_test  [kernel.kallsyms]  [k] osq_lock
\v   1.65%   1.65%  futex_test  [kernel.kallsyms]  [k] 
mutex_spin_on_owner.is\v
    0.83%   0.83%  futex_test  [kernel.kallsyms]  [k] __get_user_4

The %cpu time in spinlock was about 0.5%. So spinlock contention wasn't 
an issue for the TO futexes.

>> +
>> +	/*
>> +	 * Locate the futex state by looking up the futex state list in the
>> +	 * hash bucket. If it isn't found, create a new one and put it into
>> +	 * the list.
>> +	 */
>> +	state = lookup_futex_state(hb,&key, true);
>> +
>> +	/*
>> +	 * We don't need to hold the HB lock after looking up the futex state
>> +	 * as we have incremented the reference count.
>> +	 */
>> +	spin_unlock(&hb->lock);
>> +	BUG_ON(!state);
>> +
>> +	/*
>> +	 * Acquiring the serialization mutex.
>> +	 */
>> +	if (state->type != TYPE_TO) {
>> +		ret = -EINVAL;
>> +	} else {
>> +		if (to)
>> +			hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
>> +		ret = mutex_lock_interruptible(&state->mutex);
>> +	}
>> +
>> +	if (unlikely(ret))
>> +		/*
>> +		 * We got a signal or has some other error, need to abort
>> +		 * the lock operation and return.
>> +		 */
>> +		goto out_put_state_key;
>> +
>> +	/*
>> +	 * As the mutex owner, we can now spin on the futex word as well as
>> +	 * the active-ness of the futex owner.
>> +	 */
>> +	ret = futex_spin_on_owner(uaddr, vpid, state);
> So if futex_spin_on_owner() faults, then you return -EFAULT to user space
> w/o trying to page in the stuff? That's just wrong.

I am not so sure about the proper fault handling in futex. However, 
futex_atomic_cmpxchg_inatomic() was doing cmpxchg without disabling page 
fault. So does that mean if I get a fault, it is probably other problem 
and not because of a lock of page fault?

>
>> +static int futex_unlock_to(u32 __user *uaddr, unsigned int flags)
>> +{
>> +	u32 uval, pid, vpid = task_pid_vnr(current);
>> +	union futex_key key = FUTEX_KEY_INIT;
>> +	struct futex_hash_bucket *hb;
>> +	struct futex_state *state;
>> +	struct task_struct *owner;
>> +	int ret;
>> +
>> +	if (get_user(uval, uaddr))
>> +		return -EFAULT;
>> +	/*
>> +	 * If there is a new lock owner, we can exit now.
>> +	 * If uval is 0, another task may have acquired and release the
>> +	 * lock in the mean time, so we should also exit.
> If there is a new lock owner or the lock has been released? Voodoo magic or
> what? How can user space take over the lock when the current task owns it?
>
> That's just broken. The only reason to get here is that user space called
> in because it was not able to release the lock atomically, i.e. because the
> waiters bit was set. If anything fiddled with the uval then returning 0 is
> beyond stupid.

I had changed it in the v2 patch to not allow that and return error if 
that happens.

>> +	 */
>> +	pid = uval&  FUTEX_TID_MASK;
>> +	if ((pid&&  (pid != vpid)) || !uval)
>> +		return 0;
>> +	WARN_ON_ONCE(!(uval&  FUTEX_WAITERS));
> Crap. It's legit to call here even if the waiters bit is not set.

Again, it was changed in the v2 patch to return error in this case.

>> +	/*
>> +	 * If the TID isn't cleared in the userspace, clear it here to
>> +	 * encourage faster lock transfer to the mutex owner.
> What do you encourage here? How is the mutex transfer accelerated?

It was removed in the v2 patch because of the need to do lock handoff.

>> +	 */
>> +	if (pid == vpid) {
>> +		futex_atomic_cmpxchg_inatomic(&uval, uaddr, uval,
>> +					       uval&  ~FUTEX_TID_MASK);
>> +		WARN_ON((uval&  FUTEX_TID_MASK) != vpid);
>> +	}
>> +	ret = get_futex_key(uaddr, flags&  FLAGS_SHARED,&key, VERIFY_WRITE);
>> +	if (ret)
>> +		return ret;
>> +
>> +	hb = hash_futex(&key);
>> +	spin_lock(&hb->lock);
>> +
>> +	/*
>> +	 * Check the hash bucket only for matching futex state.
>> +	 */
>> +	state = lookup_futex_state(hb,&key, false);
>> +
>> +	if (!state)
>> +		goto out_unlock;
>> +
>> +	if (state->type != TYPE_TO) {
>> +		ret = -EINVAL;
>> +		goto out_unlock;
>> +	}
>> +
>> +	/*
>> +	 * We don't need to do the wakeup if the futex isn't equal to
>> +	 * FUTEX_WAITERS as it implies that someone else has taken over
>> +	 * the futex.
>> +	 */
>> +	if (get_user(uval, uaddr)) {
>> +		ret = -EFAULT;
>> +		goto out_unlock;
>> +	}
>> +
>> +	owner = READ_ONCE(state->owner);
>> +	if ((uval == FUTEX_WAITERS)&&  owner)
>> +		ret = wake_up_process(owner);
> What guarantees that owner cannot exit between reading owner state and
> calling wake_up_process?

I used the wake_q function in the v2 patch which increment the reference 
count.

>
> Dammit. If you read the source in this file carefully, then you notice that
> it contains a lot of documentation about life time rules, protection and
> serialization.
>
> If you think, that it's the reviewers problem to figure that out from your
> w/o proper comments in place, then you are completely on the wrong track.
>
> I'm not even trying to think about the concept itself as long as this is
> presented as a pile of unpenetrable clusterfuck.

I am sorry that patch wasn't as well-documented as it should. I will 
spend more time cleaning up the comments and streamline the code before 
sending out the next v3 version.

Thanks for the review.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 3/4] futex: Throughput-optimized (TO) futexes
  2016-09-23  2:44     ` Waiman Long
@ 2016-09-23 10:16       ` Thomas Gleixner
  0 siblings, 0 replies; 12+ messages in thread
From: Thomas Gleixner @ 2016-09-23 10:16 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Jonathan Corbet, linux-kernel,
	linux-doc, Davidlohr Bueso, Scott J Norton, Douglas Hatch

On Thu, 22 Sep 2016, Waiman Long wrote:
> On 09/22/2016 09:32 AM, Thomas Gleixner wrote:
> > > +			WARN_ON(!pi_state->owner);
> > > +			if (pi_state->owner)
> > > +				wake_up_process(pi_state->owner);
> > And what handles or sanity checks the state->hb_list ???
> > 
> 
> The exit_pi_state_list() function doesn't need to deal with state->hb_list.
> The hb_list is used to locate the futex state, but the futex owner doesn't
> have a reference to the futex state. So it won't need to decrement it and
> potentially free it.

Ok. Though that wants comments as well.
 
> > > +/*
> > > + * Spinning threshold for futex word without setting FUTEX_WAITERS.
> > > + */
> > > +#define FUTEX_SPIN_THRESHOLD	(1<<  10)
> > That number is pulled out of thin air or what is the rationale for chosing
> > 1024?
> 
> It is kind of arbitrary. I want a value large enough to encourage lock
> stealing, while not too large that it may take too long to get the lock. Will
> elaborate more about the choice in the comment.

I'm not really happy about these heuristics. The chosen value fits a
particular machine and scenario. Now try to do the same on some slow ARM
SoC and the 1024 loops are going to hurt badly.
 
> Sorry, I forgot to use the kerneldoc notation. It is an RFC patch, and my
> focus was to make it work. I haven't spent much time in cleaning up the
> comment. I will do that in the next version.

RFC is not an excuse for a unreadable mess. This stuff is complex enough
that it should be in your own interest to make it well documented and
readable. If it takes more time for a reviewer to distangle the code jungle
than spending time on thinking about the concept, verifying the lifetime
and serialization rules, etc. then there is something very wrong. And it's
wasting my time and yours. Not that I care much about your time waste :)

When Sebastian and myself worked on the attached futexes we spent more time
tidying up the code and writing documentation than on actually coding,
debugging and optimizing it.

Futexes have a gazillion of corner cases and pitfalls and your stuff even
if it seems to be simpler has the same issues.

> > > +		WARN_ON_ONCE(!(uval&  FUTEX_TID_MASK));
> > > +
> > > +		if ((uval&  FUTEX_TID_MASK) != opid) {
> > > +			/*
> > > +			 * Get the new task structure
> > > +			 */
> > > +			if (otask)
> > > +				put_task_struct(otask);
> > > +
> > > +			WARN_ON_ONCE(on_owner_list);
> > > +			opid  = uval&  FUTEX_TID_MASK;
> > > +			otask = futex_find_get_task(opid);
> > > +		}
> > Can you pretty please use split out functions for this otask handling? This
> > for loop does not fit on a single screen and can't be understood in one go.
> 
> Yes, this is the largest function in the new code, I will try to add helper
> functions to make it easier to understand.

You should have done that in the first place. Because even when you develop
and experiment it's insane to try to follow a loop with several nest levels
which occupies _two_ screens.

It's your choice how you develop, but offering something like this for
review is just an affront.

> > > +			pr_info("futex_spin_on_owner: pid %d grabs futex from
> > > pid %d (%s)!\n",
> > > +				vpid, opid, otask ? "dying" : "invalid");
> > Really useful information to confuse sysadmins. A proper comment explaining
> > the issue and the implications and how we deal with it would have been the
> > right thing to do...
> 
> Yes, the message may be too cryptic. I will make it more clear in the next
> version.

No, this message is not too cryptic. It's entirely useless and it needs not
to be made more clear. It needs to be replaced by a comment explaining the
issue. This is a legit thing to happen and it has absolutely no value to
spam dmesg that something happened which is expected and dealt with.

> > > +	ret = get_futex_key(uaddr, flags&  FLAGS_SHARED,&key, VERIFY_WRITE);
> > > +	if (unlikely(ret))
> > > +		goto out;
> > > +
> > > +	hb = hash_futex(&key);
> > > +	spin_lock(&hb->lock);
> > Why are you using the global hash for this? As we have shown the global
> > hash has horrible performance due to cross node access and potential hash
> > bucket lock contention of unrelated processes. If we add something new
> > which is performance optimized then we pretty please make it use a seperate
> > process private storage. I don't see any point in making this optimized for
> > process shared futexes.
> 
> I used the hash lock for managing the futex state objects only. I don't need
> it for other purpose. It is true that if there is a hash collision that
> another wait-wake futex is using the same hash. It may cause performance
> problem. I think I will add another spinlock in the hash bucket just for TO
> futexes. In this way, a collision with wait-wake futex won't cause unexpected
> spinlock contention, though a bit of extra hash bucket cachline contention may
> still happen.

The real question is whether we want to use the global hash at all. We've
shown that using a process private hash has massive advantages and I don't
see any point in providing support for this optimized futex for anything
else than process private. process shared is insanely slow anyway.
 
> BTW, running my microbenchmark with wait-wake futex, over 90% of the CPU time
> were in the spinlock contention:
> 
>   96.23%  futex_test  [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
>       - queued_spin_lock_slowpath
>          - _raw_spin_lock
>             + 51.81% futex_wake
>             + 48.08% futex_wait_setup

That's not really helpful if you don't tell me what the microbenchmark
actually does. If it's a gazillion of threads whacking on the same futex,
then yes, this is expected. If there are different futexes then it's
important to know whether part of this is contention is hash collision.

We've shown that hash collisons happen and they cause a massive slowdown.

> For the TO futexes, the perf profile was:
> 
>   97.33%   0.97%  futex_test  [kernel.kallsyms]  [k] futex_lock_to
>   92.45%  92.45%  futex_test  [kernel.kallsyms]  [k] osq_lock
>    1.65%   1.65%  futex_test  [kernel.kallsyms]  [k] mutex_spin_on_owner.is
>    0.83%   0.83%  futex_test  [kernel.kallsyms]  [k] __get_user_4
> 
> The %cpu time in spinlock was about 0.5%. So spinlock contention wasn't an
> issue for the TO futexes.

Well. You traded the spinlock contention with burning cpu cycles in
osq_lock. Your comparison metrics simply suck.
 
> > > +	ret = futex_spin_on_owner(uaddr, vpid, state);
> > So if futex_spin_on_owner() faults, then you return -EFAULT to user space
> > w/o trying to page in the stuff? That's just wrong.
> 
> I am not so sure about the proper fault handling in futex. However,
> futex_atomic_cmpxchg_inatomic() was doing cmpxchg without disabling page
> fault. So does that mean if I get a fault, it is probably other problem and
> not because of a lock of page fault?

Sigh. You do that inside a preempt disabled region. So if you really get a
page fault then this will just explode because the fault handler will
either trip over might_sleep() or end up calling schedule() with preemption
disabled. IOW, you must disable pagefaults and you have to deal with the
fallout yourself. Rename cmpxchg_futex_value_locked() to
cmpxchg_futex_value_inatomic() and use it. 

About 90% of the futex code is about handling corner cases and sanity
checking user space values. Even with the simplified scheme of your
optimized futexes it's not going to be any different.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-09-23 10:19 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-06 18:53 [RFC PATCH 0/4] futex: Introducing throughput-optimized futexes Waiman Long
2016-09-06 18:53 ` [RFC PATCH 1/4] futex: Add futex_set_timer() helper function Waiman Long
2016-09-06 18:53 ` [RFC PATCH 2/4] futex: Rename futex_pi_state to futex_state Waiman Long
2016-09-06 18:53 ` [RFC PATCH 3/4] futex: Throughput-optimized (TO) futexes Waiman Long
2016-09-22 13:32   ` Thomas Gleixner
2016-09-23  2:44     ` Waiman Long
2016-09-23 10:16       ` Thomas Gleixner
2016-09-06 18:53 ` [RFC PATCH 4/4] futex, doc: TO futexes document Waiman Long
2016-09-22 10:40   ` Thomas Gleixner
2016-09-22 13:06     ` Waiman Long
2016-09-22 13:51       ` Thomas Gleixner
2016-09-22 19:14         ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).