All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/13] futex: Introducing throughput-optimized futexes
@ 2016-09-30 21:26 Waiman Long
  2016-09-30 21:26 ` [PATCH v3 01/13] futex: Consolidate duplicated timer setup code Waiman Long
                   ` (12 more replies)
  0 siblings, 13 replies; 16+ messages in thread
From: Waiman Long @ 2016-09-30 21:26 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Jason Low, Scott J Norton,
	Douglas Hatch, Waiman Long

 v2->v3:
  - Use the abbreviation TP for the new futexes instead of TO.
  - Make a number of changes accordingly to review comments from
    ThomasG, PeterZ and MikeG.
  - Breaks the main futex patch into smaller pieces to make them easier
    to review.
  - Integrate the microbenchmark into the "perf bench futex" tool
    so that others can use it too.

 v1->v2:
  - Adds an explicit lock hand-off mechanism.
  - Adds timeout support.
  - Simplifies the required userspace code.
  - Fixes a number of problems in the v1 code.

This patchset introduces a new futex implementation called
throughput-optimized (TP) futexes. It is similar to PI futexes in its
calling convention, but provides better throughput than the wait-wake
futexes by encouraging more lock stealing and optimistic spinning
especially when the lock holders don't sleep and the critical sections
are not very short.

A manpage patch that documents changes to the futex(2) system call
will be sent out separately.

Patch 1 consolidates the duplicated timer setup code.

Patch 2 renames the futex_pi_state structure to futex_state as it
will no longer be specific to the PI futexes. Some futex_pi_state
related functions are also renamed.

Patch 3 adds some helper functions to be used in later patches.

Patch 4 consolidates the codes that need to add or delete futex_state
to pi_state_list.

Patch 5 adds a new type field to the futex_state structure.

Patch 6 changes the futex_hash_bucket to include another list for
futex state objects and a spinlock to guard the new list.

Patch 7 is the core of this patchset and introduction of the new
TP futexes.

Patch 8 enables more robust handling when the futex owners died
unexpectedly.

Patch 9 implements the lock handoff mechanism to prevent lock
starvation.

Patch 10 enables the FUTEX_LOCK call to return a status code to show
how the lock is being acquired.

Patch 11 adds code support the specification of a timeout value in
the FUTEX_LOCK call.

Patch 12 adds a new documentation file about the TP futexes.

Patch 13 adds the futex microbenchmark that was used to produce the
performance data to the "perf bench futex" tool. Others can then use
it to do their own evaluation.

Waiman Long (13):
  futex: Consolidate duplicated timer setup code
  futex: Rename futex_pi_state to futex_state
  futex: Add helpers to get & cmpxchg futex value without lock
  futex: Consolidate pure pi_state_list add & delete codes to helpers
  futex: Add a new futex type field into futex_state
  futex: Allow direct attachment of futex_state objects to hash bucket
  futex: Throughput-optimized (TP) futexes
  futex: Enable robust handling of TP futexes
  futex: Implement lock handoff for TP futexes to prevent lock
    starvation
  futex: Inform FUTEX_LOCK callers on how the lock is acquired
  futex: Add timeout support to TP futexes
  futex, doc: TP futexes document
  perf bench: New microbenchmark for userspace mutex performance

 Documentation/00-INDEX         |    2 +
 Documentation/tp-futex.txt     |  147 +++++++
 include/linux/sched.h          |    4 +-
 include/uapi/linux/futex.h     |    4 +
 kernel/futex.c                 |  943 +++++++++++++++++++++++++++++++++++-----
 tools/perf/bench/Build         |    1 +
 tools/perf/bench/bench.h       |    1 +
 tools/perf/bench/futex-mutex.c |  558 ++++++++++++++++++++++++
 tools/perf/bench/futex.h       |   25 +
 tools/perf/builtin-bench.c     |    4 +
 10 files changed, 1575 insertions(+), 114 deletions(-)
 create mode 100644 Documentation/tp-futex.txt
 create mode 100644 tools/perf/bench/futex-mutex.c

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 01/13] futex: Consolidate duplicated timer setup code
  2016-09-30 21:26 [PATCH v3 00/13] futex: Introducing throughput-optimized futexes Waiman Long
@ 2016-09-30 21:26 ` Waiman Long
  2016-09-30 21:26 ` [PATCH v3 02/13] futex: Rename futex_pi_state to futex_state Waiman Long
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2016-09-30 21:26 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Jason Low, Scott J Norton,
	Douglas Hatch, Waiman Long

A new futex_setup_timer() helper function is added to consolidate all
the hrtimer_sleeper setup code.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/futex.c |   67 +++++++++++++++++++++++++++++++-------------------------
 1 files changed, 37 insertions(+), 30 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 46cb3a3..91724ec 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -468,6 +468,35 @@ static void drop_futex_key_refs(union futex_key *key)
 }
 
 /**
+ * futex_setup_timer - set up the sleeping hrtimer.
+ * @time:	ptr to the given timeout value
+ * @timeout:	the hrtimer_sleeper structure to be set up
+ * @flags:	futex flags
+ * @range_ns:	optional range in ns
+ *
+ * Return: Initialized hrtimer_sleeper structure or NULL if no timeout
+ *	   value given
+ */
+static inline struct hrtimer_sleeper *
+futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
+		  int flags, u64 range_ns)
+{
+	if (!time)
+		return NULL;
+
+	hrtimer_init_on_stack(&timeout->timer, (flags & FLAGS_CLOCKRT) ?
+			      CLOCK_REALTIME : CLOCK_MONOTONIC,
+			      HRTIMER_MODE_ABS);
+	hrtimer_init_sleeper(timeout, current);
+	if (range_ns)
+		hrtimer_set_expires_range_ns(&timeout->timer, *time, range_ns);
+	else
+		hrtimer_set_expires(&timeout->timer, *time);
+
+	return timeout;
+}
+
+/**
  * get_futex_key() - Get parameters which are the keys for a futex
  * @uaddr:	virtual address of the futex
  * @fshared:	0 for a PROCESS_PRIVATE futex, 1 for PROCESS_SHARED
@@ -2393,7 +2422,7 @@ out:
 static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
 		      ktime_t *abs_time, u32 bitset)
 {
-	struct hrtimer_sleeper timeout, *to = NULL;
+	struct hrtimer_sleeper timeout, *to;
 	struct restart_block *restart;
 	struct futex_hash_bucket *hb;
 	struct futex_q q = futex_q_init;
@@ -2403,17 +2432,8 @@ static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
 		return -EINVAL;
 	q.bitset = bitset;
 
-	if (abs_time) {
-		to = &timeout;
-
-		hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ?
-				      CLOCK_REALTIME : CLOCK_MONOTONIC,
-				      HRTIMER_MODE_ABS);
-		hrtimer_init_sleeper(to, current);
-		hrtimer_set_expires_range_ns(&to->timer, *abs_time,
-					     current->timer_slack_ns);
-	}
-
+	to = futex_setup_timer(abs_time, &timeout, flags,
+			       current->timer_slack_ns);
 retry:
 	/*
 	 * Prepare to wait on uaddr. On success, holds hb lock and increments
@@ -2493,7 +2513,7 @@ static long futex_wait_restart(struct restart_block *restart)
 static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
 			 ktime_t *time, int trylock)
 {
-	struct hrtimer_sleeper timeout, *to = NULL;
+	struct hrtimer_sleeper timeout, *to;
 	struct futex_hash_bucket *hb;
 	struct futex_q q = futex_q_init;
 	int res, ret;
@@ -2501,13 +2521,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
 	if (refill_pi_state_cache())
 		return -ENOMEM;
 
-	if (time) {
-		to = &timeout;
-		hrtimer_init_on_stack(&to->timer, CLOCK_REALTIME,
-				      HRTIMER_MODE_ABS);
-		hrtimer_init_sleeper(to, current);
-		hrtimer_set_expires(&to->timer, *time);
-	}
+	to = futex_setup_timer(time, &timeout, FLAGS_CLOCKRT, 0);
 
 retry:
 	ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key, VERIFY_WRITE);
@@ -2802,7 +2816,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 				 u32 val, ktime_t *abs_time, u32 bitset,
 				 u32 __user *uaddr2)
 {
-	struct hrtimer_sleeper timeout, *to = NULL;
+	struct hrtimer_sleeper timeout, *to;
 	struct rt_mutex_waiter rt_waiter;
 	struct rt_mutex *pi_mutex = NULL;
 	struct futex_hash_bucket *hb;
@@ -2816,15 +2830,8 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 	if (!bitset)
 		return -EINVAL;
 
-	if (abs_time) {
-		to = &timeout;
-		hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ?
-				      CLOCK_REALTIME : CLOCK_MONOTONIC,
-				      HRTIMER_MODE_ABS);
-		hrtimer_init_sleeper(to, current);
-		hrtimer_set_expires_range_ns(&to->timer, *abs_time,
-					     current->timer_slack_ns);
-	}
+	to = futex_setup_timer(abs_time, &timeout, flags,
+			       current->timer_slack_ns);
 
 	/*
 	 * The waiter is allocated on our stack, manipulated by the requeue
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 02/13] futex: Rename futex_pi_state to futex_state
  2016-09-30 21:26 [PATCH v3 00/13] futex: Introducing throughput-optimized futexes Waiman Long
  2016-09-30 21:26 ` [PATCH v3 01/13] futex: Consolidate duplicated timer setup code Waiman Long
@ 2016-09-30 21:26 ` Waiman Long
  2016-09-30 21:26 ` [PATCH v3 03/13] futex: Add helpers to get & cmpxchg futex value without lock Waiman Long
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2016-09-30 21:26 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Jason Low, Scott J Norton,
	Douglas Hatch, Waiman Long

The futex_pi_state structure will be overloaded in later patches to
store state information about non-PI futexes. So the structure name
itself is no longer a good description of its purpose. So its name
is changed to futex_state, a more generic name.

Some of the functions that process the futex states are also renamed.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 include/linux/sched.h |    4 +-
 kernel/futex.c        |  107 +++++++++++++++++++++++++------------------------
 2 files changed, 56 insertions(+), 55 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 62c68e5..fefd7f7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -126,7 +126,7 @@ struct sched_attr {
 	u64 sched_period;
 };
 
-struct futex_pi_state;
+struct futex_state;
 struct robust_list_head;
 struct bio_list;
 struct fs_struct;
@@ -1772,7 +1772,7 @@ struct task_struct {
 	struct compat_robust_list_head __user *compat_robust_list;
 #endif
 	struct list_head pi_state_list;
-	struct futex_pi_state *pi_state_cache;
+	struct futex_state *pi_state_cache;
 #endif
 #ifdef CONFIG_PERF_EVENTS
 	struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
diff --git a/kernel/futex.c b/kernel/futex.c
index 91724ec..f4e20ad 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -192,11 +192,12 @@ int __read_mostly futex_cmpxchg_enabled;
 #define FLAGS_HAS_TIMEOUT	0x04
 
 /*
- * Priority Inheritance state:
+ * Futex state object:
+ *  - Priority Inheritance state
  */
-struct futex_pi_state {
+struct futex_state {
 	/*
-	 * list of 'owned' pi_state instances - these have to be
+	 * list of 'owned' state instances - these have to be
 	 * cleaned up in do_exit() if the task exits prematurely:
 	 */
 	struct list_head list;
@@ -240,7 +241,7 @@ struct futex_q {
 	struct task_struct *task;
 	spinlock_t *lock_ptr;
 	union futex_key key;
-	struct futex_pi_state *pi_state;
+	struct futex_state *pi_state;
 	struct rt_mutex_waiter *rt_waiter;
 	union futex_key *requeue_pi_key;
 	u32 bitset;
@@ -797,76 +798,76 @@ static int get_futex_value_locked(u32 *dest, u32 __user *from)
 /*
  * PI code:
  */
-static int refill_pi_state_cache(void)
+static int refill_futex_state_cache(void)
 {
-	struct futex_pi_state *pi_state;
+	struct futex_state *state;
 
 	if (likely(current->pi_state_cache))
 		return 0;
 
-	pi_state = kzalloc(sizeof(*pi_state), GFP_KERNEL);
+	state = kzalloc(sizeof(*state), GFP_KERNEL);
 
-	if (!pi_state)
+	if (!state)
 		return -ENOMEM;
 
-	INIT_LIST_HEAD(&pi_state->list);
+	INIT_LIST_HEAD(&state->list);
 	/* pi_mutex gets initialized later */
-	pi_state->owner = NULL;
-	atomic_set(&pi_state->refcount, 1);
-	pi_state->key = FUTEX_KEY_INIT;
+	state->owner = NULL;
+	atomic_set(&state->refcount, 1);
+	state->key = FUTEX_KEY_INIT;
 
-	current->pi_state_cache = pi_state;
+	current->pi_state_cache = state;
 
 	return 0;
 }
 
-static struct futex_pi_state * alloc_pi_state(void)
+static struct futex_state *alloc_futex_state(void)
 {
-	struct futex_pi_state *pi_state = current->pi_state_cache;
+	struct futex_state *state = current->pi_state_cache;
 
-	WARN_ON(!pi_state);
+	WARN_ON(!state);
 	current->pi_state_cache = NULL;
 
-	return pi_state;
+	return state;
 }
 
 /*
- * Drops a reference to the pi_state object and frees or caches it
+ * Drops a reference to the futex state object and frees or caches it
  * when the last reference is gone.
  *
  * Must be called with the hb lock held.
  */
-static void put_pi_state(struct futex_pi_state *pi_state)
+static void put_futex_state(struct futex_state *state)
 {
-	if (!pi_state)
+	if (!state)
 		return;
 
-	if (!atomic_dec_and_test(&pi_state->refcount))
+	if (!atomic_dec_and_test(&state->refcount))
 		return;
 
 	/*
-	 * If pi_state->owner is NULL, the owner is most probably dying
-	 * and has cleaned up the pi_state already
+	 * If state->owner is NULL, the owner is most probably dying
+	 * and has cleaned up the futex state already
 	 */
-	if (pi_state->owner) {
-		raw_spin_lock_irq(&pi_state->owner->pi_lock);
-		list_del_init(&pi_state->list);
-		raw_spin_unlock_irq(&pi_state->owner->pi_lock);
+	if (state->owner) {
+		raw_spin_lock_irq(&state->owner->pi_lock);
+		list_del_init(&state->list);
+		raw_spin_unlock_irq(&state->owner->pi_lock);
 
-		rt_mutex_proxy_unlock(&pi_state->pi_mutex, pi_state->owner);
+		rt_mutex_proxy_unlock(&state->pi_mutex, state->owner);
 	}
 
 	if (current->pi_state_cache)
-		kfree(pi_state);
+		kfree(state);
 	else {
 		/*
-		 * pi_state->list is already empty.
-		 * clear pi_state->owner.
+		 * state->list is already empty.
+		 * clear state->owner.
 		 * refcount is at 0 - put it back to 1.
 		 */
-		pi_state->owner = NULL;
-		atomic_set(&pi_state->refcount, 1);
-		current->pi_state_cache = pi_state;
+		state->owner = NULL;
+		atomic_set(&state->refcount, 1);
+		current->pi_state_cache = state;
 	}
 }
 
@@ -896,7 +897,7 @@ static struct task_struct * futex_find_get_task(pid_t pid)
 void exit_pi_state_list(struct task_struct *curr)
 {
 	struct list_head *next, *head = &curr->pi_state_list;
-	struct futex_pi_state *pi_state;
+	struct futex_state *pi_state;
 	struct futex_hash_bucket *hb;
 	union futex_key key = FUTEX_KEY_INIT;
 
@@ -911,7 +912,7 @@ void exit_pi_state_list(struct task_struct *curr)
 	while (!list_empty(head)) {
 
 		next = head->next;
-		pi_state = list_entry(next, struct futex_pi_state, list);
+		pi_state = list_entry(next, struct futex_state, list);
 		key = pi_state->key;
 		hb = hash_futex(&key);
 		raw_spin_unlock_irq(&curr->pi_lock);
@@ -998,8 +999,8 @@ void exit_pi_state_list(struct task_struct *curr)
  * the pi_state against the user space value. If correct, attach to
  * it.
  */
-static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state,
-			      struct futex_pi_state **ps)
+static int attach_to_pi_state(u32 uval, struct futex_state *pi_state,
+			      struct futex_state **ps)
 {
 	pid_t pid = uval & FUTEX_TID_MASK;
 
@@ -1070,10 +1071,10 @@ out_state:
  * it after doing proper sanity checks.
  */
 static int attach_to_pi_owner(u32 uval, union futex_key *key,
-			      struct futex_pi_state **ps)
+			      struct futex_state **ps)
 {
 	pid_t pid = uval & FUTEX_TID_MASK;
-	struct futex_pi_state *pi_state;
+	struct futex_state *pi_state;
 	struct task_struct *p;
 
 	/*
@@ -1114,7 +1115,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 	/*
 	 * No existing pi state. First waiter. [2]
 	 */
-	pi_state = alloc_pi_state();
+	pi_state = alloc_futex_state();
 
 	/*
 	 * Initialize the pi_mutex in locked state and make @p
@@ -1138,7 +1139,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 }
 
 static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
-			   union futex_key *key, struct futex_pi_state **ps)
+			   union futex_key *key, struct futex_state **ps)
 {
 	struct futex_q *match = futex_top_waiter(hb, key);
 
@@ -1190,7 +1191,7 @@ static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
  */
 static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
 				union futex_key *key,
-				struct futex_pi_state **ps,
+				struct futex_state **ps,
 				struct task_struct *task, int set_waiters)
 {
 	u32 uval, newval, vpid = task_pid_vnr(task);
@@ -1316,7 +1317,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
 			 struct futex_hash_bucket *hb)
 {
 	struct task_struct *new_owner;
-	struct futex_pi_state *pi_state = this->pi_state;
+	struct futex_state *pi_state = this->pi_state;
 	u32 uninitialized_var(curval), newval;
 	WAKE_Q(wake_q);
 	bool deboost;
@@ -1656,7 +1657,7 @@ static int futex_proxy_trylock_atomic(u32 __user *pifutex,
 				 struct futex_hash_bucket *hb1,
 				 struct futex_hash_bucket *hb2,
 				 union futex_key *key1, union futex_key *key2,
-				 struct futex_pi_state **ps, int set_waiters)
+				 struct futex_state **ps, int set_waiters)
 {
 	struct futex_q *top_waiter = NULL;
 	u32 curval;
@@ -1725,7 +1726,7 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
 {
 	union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
 	int drop_count = 0, task_count = 0, ret;
-	struct futex_pi_state *pi_state = NULL;
+	struct futex_state *pi_state = NULL;
 	struct futex_hash_bucket *hb1, *hb2;
 	struct futex_q *this, *next;
 	WAKE_Q(wake_q);
@@ -1742,7 +1743,7 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
 		 * requeue_pi requires a pi_state, try to allocate it now
 		 * without any locks in case it fails.
 		 */
-		if (refill_pi_state_cache())
+		if (refill_futex_state_cache())
 			return -ENOMEM;
 		/*
 		 * requeue_pi must wake as many tasks as it can, up to nr_wake
@@ -1954,7 +1955,7 @@ retry_private:
 				 * object.
 				 */
 				this->pi_state = NULL;
-				put_pi_state(pi_state);
+				put_futex_state(pi_state);
 				/*
 				 * We stop queueing more waiters and let user
 				 * space deal with the mess.
@@ -1971,7 +1972,7 @@ retry_private:
 	 * in futex_proxy_trylock_atomic() or in lookup_pi_state(). We
 	 * need to drop it here again.
 	 */
-	put_pi_state(pi_state);
+	put_futex_state(pi_state);
 
 out_unlock:
 	double_unlock_hb(hb1, hb2);
@@ -2126,7 +2127,7 @@ static void unqueue_me_pi(struct futex_q *q)
 	__unqueue_futex(q);
 
 	BUG_ON(!q->pi_state);
-	put_pi_state(q->pi_state);
+	put_futex_state(q->pi_state);
 	q->pi_state = NULL;
 
 	spin_unlock(q->lock_ptr);
@@ -2142,7 +2143,7 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
 				struct task_struct *newowner)
 {
 	u32 newtid = task_pid_vnr(newowner) | FUTEX_WAITERS;
-	struct futex_pi_state *pi_state = q->pi_state;
+	struct futex_state *pi_state = q->pi_state;
 	struct task_struct *oldowner = pi_state->owner;
 	u32 uval, uninitialized_var(curval), newval;
 	int ret;
@@ -2518,7 +2519,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
 	struct futex_q q = futex_q_init;
 	int res, ret;
 
-	if (refill_pi_state_cache())
+	if (refill_futex_state_cache())
 		return -ENOMEM;
 
 	to = futex_setup_timer(time, &timeout, FLAGS_CLOCKRT, 0);
@@ -2899,7 +2900,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 			 * Drop the reference to the pi state which
 			 * the requeue_pi() code acquired for us.
 			 */
-			put_pi_state(q.pi_state);
+			put_futex_state(q.pi_state);
 			spin_unlock(q.lock_ptr);
 		}
 	} else {
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 03/13] futex: Add helpers to get & cmpxchg futex value without lock
  2016-09-30 21:26 [PATCH v3 00/13] futex: Introducing throughput-optimized futexes Waiman Long
  2016-09-30 21:26 ` [PATCH v3 01/13] futex: Consolidate duplicated timer setup code Waiman Long
  2016-09-30 21:26 ` [PATCH v3 02/13] futex: Rename futex_pi_state to futex_state Waiman Long
@ 2016-09-30 21:26 ` Waiman Long
  2016-09-30 21:26 ` [PATCH v3 04/13] futex: Consolidate pure pi_state_list add & delete codes to helpers Waiman Long
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2016-09-30 21:26 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Jason Low, Scott J Norton,
	Douglas Hatch, Waiman Long

Two new helper functions cmpxchg_futex_value() and get_futex_value()
are added to access and change the futex value without the hash
bucket lock.  As a result, page fault is enabled and the page will
be faulted in if not present yet.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/futex.c |   15 +++++++++++++++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index f4e20ad..ebe59fa 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -794,6 +794,21 @@ static int get_futex_value_locked(u32 *dest, u32 __user *from)
 	return ret ? -EFAULT : 0;
 }
 
+/*
+ * The equivalents of the above cmpxchg_futex_value_locked() and
+ * get_futex_value_locked which are called without the hash bucket lock
+ * and so can have page fault enabled.
+ */
+static inline int cmpxchg_futex_value(u32 *curval, u32 __user *uaddr,
+				      u32 uval, u32 newval)
+{
+	return futex_atomic_cmpxchg_inatomic(curval, uaddr, uval, newval);
+}
+
+static inline int get_futex_value(u32 *dest, u32 __user *from)
+{
+	return __get_user(*dest, from);
+}
 
 /*
  * PI code:
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 04/13] futex: Consolidate pure pi_state_list add & delete codes to helpers
  2016-09-30 21:26 [PATCH v3 00/13] futex: Introducing throughput-optimized futexes Waiman Long
                   ` (2 preceding siblings ...)
  2016-09-30 21:26 ` [PATCH v3 03/13] futex: Add helpers to get & cmpxchg futex value without lock Waiman Long
@ 2016-09-30 21:26 ` Waiman Long
  2016-09-30 21:26 ` [PATCH v3 05/13] futex: Add a new futex type field into futex_state Waiman Long
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2016-09-30 21:26 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Jason Low, Scott J Norton,
	Douglas Hatch, Waiman Long

Two new helper functions (task_pi_list_add & task_pi_list_del)
are created to consolidate all the pure pi_state_list addition and
insertion codes. The set_owner argument in task_pi_list_add() will
be needed in a later patch.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/futex.c |   67 +++++++++++++++++++++++++++++++++++---------------------
 1 files changed, 42 insertions(+), 25 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index ebe59fa..0d1b1be 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -813,6 +813,42 @@ static inline int get_futex_value(u32 *dest, u32 __user *from)
 /*
  * PI code:
  */
+
+/**
+ * task_pi_list_add - add futex state object to a task's pi_state_list
+ * @task :	task structure
+ * @state:	futex state object
+ * @set_owner:	set futex state owner to the given task if true
+ */
+static inline void task_pi_list_add(struct task_struct *task,
+				    struct futex_state *state,
+				    const bool set_owner)
+{
+	raw_spin_lock_irq(&task->pi_lock);
+	WARN_ON(!list_empty(&state->list));
+	list_add(&state->list, &task->pi_state_list);
+	if (set_owner)
+		state->owner = task;
+	raw_spin_unlock_irq(&task->pi_lock);
+}
+
+/**
+ * task_pi_list_del - delete futex state object from a task's pi_state_list
+ * @task : task structure
+ * @state: futex state object
+ * @warn : warn if list is empty when set
+ */
+static inline void task_pi_list_del(struct task_struct *task,
+				    struct futex_state *state,
+				    const bool warn)
+{
+	raw_spin_lock_irq(&task->pi_lock);
+	if (warn)
+		WARN_ON(list_empty(&state->list));
+	list_del_init(&state->list);
+	raw_spin_unlock_irq(&task->pi_lock);
+}
+
 static int refill_futex_state_cache(void)
 {
 	struct futex_state *state;
@@ -865,9 +901,7 @@ static void put_futex_state(struct futex_state *state)
 	 * and has cleaned up the futex state already
 	 */
 	if (state->owner) {
-		raw_spin_lock_irq(&state->owner->pi_lock);
-		list_del_init(&state->list);
-		raw_spin_unlock_irq(&state->owner->pi_lock);
+		task_pi_list_del(state->owner, state, false);
 
 		rt_mutex_proxy_unlock(&state->pi_mutex, state->owner);
 	}
@@ -1388,16 +1422,8 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
 		return ret;
 	}
 
-	raw_spin_lock(&pi_state->owner->pi_lock);
-	WARN_ON(list_empty(&pi_state->list));
-	list_del_init(&pi_state->list);
-	raw_spin_unlock(&pi_state->owner->pi_lock);
-
-	raw_spin_lock(&new_owner->pi_lock);
-	WARN_ON(!list_empty(&pi_state->list));
-	list_add(&pi_state->list, &new_owner->pi_state_list);
-	pi_state->owner = new_owner;
-	raw_spin_unlock(&new_owner->pi_lock);
+	task_pi_list_del(pi_state->owner, pi_state, true);
+	task_pi_list_add(new_owner, pi_state, true);
 
 	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 
@@ -2202,19 +2228,10 @@ retry:
 	 * We fixed up user space. Now we need to fix the pi_state
 	 * itself.
 	 */
-	if (pi_state->owner != NULL) {
-		raw_spin_lock_irq(&pi_state->owner->pi_lock);
-		WARN_ON(list_empty(&pi_state->list));
-		list_del_init(&pi_state->list);
-		raw_spin_unlock_irq(&pi_state->owner->pi_lock);
-	}
+	if (pi_state->owner != NULL)
+		task_pi_list_del(pi_state->owner, pi_state, true);
 
-	pi_state->owner = newowner;
-
-	raw_spin_lock_irq(&newowner->pi_lock);
-	WARN_ON(!list_empty(&pi_state->list));
-	list_add(&pi_state->list, &newowner->pi_state_list);
-	raw_spin_unlock_irq(&newowner->pi_lock);
+	task_pi_list_add(newowner, pi_state, true);
 	return 0;
 
 	/*
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 05/13] futex: Add a new futex type field into futex_state
  2016-09-30 21:26 [PATCH v3 00/13] futex: Introducing throughput-optimized futexes Waiman Long
                   ` (3 preceding siblings ...)
  2016-09-30 21:26 ` [PATCH v3 04/13] futex: Consolidate pure pi_state_list add & delete codes to helpers Waiman Long
@ 2016-09-30 21:26 ` Waiman Long
  2016-09-30 21:26 ` [PATCH v3 06/13] futex: Allow direct attachment of futex_state objects to hash bucket Waiman Long
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2016-09-30 21:26 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Jason Low, Scott J Norton,
	Douglas Hatch, Waiman Long

As the futex_state structure will be overloaded in later patches
to be used by non-PI futexes, it is necessary to add a type field to
distinguish among different types of futexes.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/futex.c |   14 ++++++++++----
 1 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 0d1b1be..a496c01 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -191,6 +191,10 @@ int __read_mostly futex_cmpxchg_enabled;
 #define FLAGS_CLOCKRT		0x02
 #define FLAGS_HAS_TIMEOUT	0x04
 
+enum futex_type {
+	TYPE_PI = 0,
+};
+
 /*
  * Futex state object:
  *  - Priority Inheritance state
@@ -210,6 +214,7 @@ struct futex_state {
 	struct task_struct *owner;
 	atomic_t refcount;
 
+	enum futex_type type;
 	union futex_key key;
 };
 
@@ -897,10 +902,10 @@ static void put_futex_state(struct futex_state *state)
 		return;
 
 	/*
-	 * If state->owner is NULL, the owner is most probably dying
-	 * and has cleaned up the futex state already
+	 * If state->owner is NULL and the type is TYPE_PI, the owner is
+	 * most probably dying and has cleaned up the futex state already
 	 */
-	if (state->owner) {
+	if (state->owner && (state->type == TYPE_PI)) {
 		task_pi_list_del(state->owner, state, false);
 
 		rt_mutex_proxy_unlock(&state->pi_mutex, state->owner);
@@ -1056,7 +1061,7 @@ static int attach_to_pi_state(u32 uval, struct futex_state *pi_state,
 	/*
 	 * Userspace might have messed up non-PI and PI futexes [3]
 	 */
-	if (unlikely(!pi_state))
+	if (unlikely(!pi_state || (pi_state->type != TYPE_PI)))
 		return -EINVAL;
 
 	WARN_ON(!atomic_read(&pi_state->refcount));
@@ -1174,6 +1179,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 
 	/* Store the key for possible exit cleanups: */
 	pi_state->key = *key;
+	pi_state->type = TYPE_PI;
 
 	WARN_ON(!list_empty(&pi_state->list));
 	list_add(&pi_state->list, &p->pi_state_list);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 06/13] futex: Allow direct attachment of futex_state objects to hash bucket
  2016-09-30 21:26 [PATCH v3 00/13] futex: Introducing throughput-optimized futexes Waiman Long
                   ` (4 preceding siblings ...)
  2016-09-30 21:26 ` [PATCH v3 05/13] futex: Add a new futex type field into futex_state Waiman Long
@ 2016-09-30 21:26 ` Waiman Long
  2016-09-30 21:26 ` [PATCH v3 07/13] futex: Throughput-optimized (TP) futexes Waiman Long
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2016-09-30 21:26 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Jason Low, Scott J Norton,
	Douglas Hatch, Waiman Long

Currently, the futex state objects can only be located indirectly as

        hash bucket => futex_q => futex state

Actually it can be beneficial in some cases to locate the futex state
object directly from the hash bucket without the futex_q middleman.
Therefore, a new list head to link the futex state objects as well
as a new spinlock to manage them are added to the hash bucket.

To limit size increase for UP systems, these new fields are only for
SMP machines where the cacheline alignment of the hash bucket leaves
it with enough empty space for the new fields.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/futex.c |   24 ++++++++++++++++++++++++
 1 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index a496c01..646445f 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -207,6 +207,11 @@ struct futex_state {
 	struct list_head list;
 
 	/*
+	 * Can link to fs_head in the owning hash bucket.
+	 */
+	struct list_head fs_list;
+
+	/*
 	 * The PI object:
 	 */
 	struct rt_mutex pi_mutex;
@@ -262,11 +267,24 @@ static const struct futex_q futex_q_init = {
  * Hash buckets are shared by all the futex_keys that hash to the same
  * location.  Each key may have multiple futex_q structures, one for each task
  * waiting on a futex.
+ *
+ * Alternatively (in SMP), a key can be associated with a unique futex_state
+ * object where multiple waiters waiting for that futex can queue up in that
+ * futex_state object without using the futex_q structure. A separate
+ * futex_state lock (fs_lock) is used for processing those futex_state objects.
  */
 struct futex_hash_bucket {
 	atomic_t waiters;
 	spinlock_t lock;
 	struct plist_head chain;
+
+#ifdef CONFIG_SMP
+	/*
+	 * Fields for managing futex_state object list
+	 */
+	spinlock_t fs_lock;
+	struct list_head fs_head;
+#endif
 } ____cacheline_aligned_in_smp;
 
 /*
@@ -867,6 +885,8 @@ static int refill_futex_state_cache(void)
 		return -ENOMEM;
 
 	INIT_LIST_HEAD(&state->list);
+	INIT_LIST_HEAD(&state->fs_list);
+
 	/* pi_mutex gets initialized later */
 	state->owner = NULL;
 	atomic_set(&state->refcount, 1);
@@ -3356,6 +3376,10 @@ static int __init futex_init(void)
 		atomic_set(&futex_queues[i].waiters, 0);
 		plist_head_init(&futex_queues[i].chain);
 		spin_lock_init(&futex_queues[i].lock);
+#ifdef CONFIG_SMP
+		INIT_LIST_HEAD(&futex_queues[i].fs_head);
+		spin_lock_init(&futex_queues[i].fs_lock);
+#endif
 	}
 
 	return 0;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 07/13] futex: Throughput-optimized (TP) futexes
  2016-09-30 21:26 [PATCH v3 00/13] futex: Introducing throughput-optimized futexes Waiman Long
                   ` (5 preceding siblings ...)
  2016-09-30 21:26 ` [PATCH v3 06/13] futex: Allow direct attachment of futex_state objects to hash bucket Waiman Long
@ 2016-09-30 21:26 ` Waiman Long
  2016-10-01  6:47   ` Thomas Gleixner
  2016-09-30 21:26 ` [PATCH v3 08/13] futex: Enable robust handling of TP futexes Waiman Long
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2016-09-30 21:26 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Jason Low, Scott J Norton,
	Douglas Hatch, Waiman Long

A new futex implementation called throughput-optimized (TP) futexes
is introduced. The goal of this new futex type is to maximize locking
throughput at the expense of fairness and deterministic latency. Its
throughput is higher than that of the wait-wake futexes especially
on systems with a large number of CPUs and for workloads where the
lock owners are unlikely to sleep.  The downside of this new futex
is the increase in the response time variance.

The new TP futex type is designed mainly to be used for mutex-like
exclusive user space locks. It is not as versatile as the wait-wake
futexes which can be used for other locking primitives like conditional
variables, or read-write locks. So it is not a direct replacement of
wait-wake futexes.

A futex locking microbenchmark was written where separate userspace
mutexes are implemented using wait-wake (WW), PI and TP futexes
respectively. This microbenchmark was intructed to run 10s of locking
operations with 5 pause instructions as the critical section by 144
threads on a 4-socket 72-core 144-thread Haswell-EX system, The system
was running on a 4.8-rc7 based kernel.  the results of the benchmark
runs were as follows:

                        WW futex       PI futex      TP futex
                        --------       --------      --------
Total locking ops      34,925,195      249,343      54,622,973
Per-thread avg/sec         24,248          172          37,929
Per-thread min/sec         22,656          172          12,552
Per-thread max/sec         26,436          214          70,187
% Stddev                    0.33%        0.17%           2.34%
lock futex calls        8,062,402      248,947       1,010,022
unlock futex calls      9,295,362      248,948               8

The TP futexes had better throughput, but they also have larger
variance.

For the TP futexes, the heavy hitters in the perf profile were:

 96.52%  96.52%  [kernel.vmlinux] [k] osq_lock
  0.77%   0.74%  [kernel.vmlinux] [k] futex_spin_on_owner
  0.72%   0.72%  [kernel.vmlinux] [k] mutex_spin_on_owner

For the WW futexes, the heavy hitter was:

 95.49%  95.49%  [kernel.vmlinux] [k] queued_spin_lock_slowpath

By increasing the critical section (CS) length, the performance
discrepancies between WW and TP futexes actually increases as shown
in the table below.

  CS length     WW locking ops  TP locking ops  % change
  ---------     --------------  --------------  --------
      5           34,925,195      54,622,973      +56%
     10           25,191,955      42,577,283      +69%
     20           16,322,731      29,424,333      +80%
     30           13,109,462      33,664,095     +157%
     40           11,052,076      27,998,446     +153%
     50            9,722,942      25,793,991     +161%
  1us sleep          175,824         179,067       +2%

For the WW futexes, the number of actual lock and unlock futex calls
increases with the critical section length. That is not necessarily
the case for TP futexes.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 include/uapi/linux/futex.h |    4 +
 kernel/futex.c             |  501 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 503 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index 0b1f716..7f676ac 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -20,6 +20,8 @@
 #define FUTEX_WAKE_BITSET	10
 #define FUTEX_WAIT_REQUEUE_PI	11
 #define FUTEX_CMP_REQUEUE_PI	12
+#define FUTEX_LOCK		13
+#define FUTEX_UNLOCK		14
 
 #define FUTEX_PRIVATE_FLAG	128
 #define FUTEX_CLOCK_REALTIME	256
@@ -39,6 +41,8 @@
 					 FUTEX_PRIVATE_FLAG)
 #define FUTEX_CMP_REQUEUE_PI_PRIVATE	(FUTEX_CMP_REQUEUE_PI | \
 					 FUTEX_PRIVATE_FLAG)
+#define FUTEX_LOCK_PRIVATE	(FUTEX_LOCK | FUTEX_PRIVATE_FLAG)
+#define FUTEX_UNLOCK_PRIVATE	(FUTEX_UNLOCK | FUTEX_PRIVATE_FLAG)
 
 /*
  * Support for robust futexes: the kernel cleans up held futexes at
diff --git a/kernel/futex.c b/kernel/futex.c
index 646445f..bfe17a9 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -193,6 +193,7 @@ int __read_mostly futex_cmpxchg_enabled;
 
 enum futex_type {
 	TYPE_PI = 0,
+	TYPE_TP,
 };
 
 /*
@@ -212,10 +213,18 @@ struct futex_state {
 	struct list_head fs_list;
 
 	/*
-	 * The PI object:
+	 * The PI or mutex object:
 	 */
-	struct rt_mutex pi_mutex;
+	union {
+		struct rt_mutex pi_mutex;
+		struct mutex mutex;
+	};
 
+	/*
+	 * For PI futex, owner is the task that owns the futex.
+	 * For TP futex, owner is the mutex lock holder that is either
+	 * spinning on the futex owner or sleeping.
+	 */
 	struct task_struct *owner;
 	atomic_t refcount;
 
@@ -3239,6 +3248,484 @@ void exit_robust_list(struct task_struct *curr)
 				   curr, pip);
 }
 
+#ifdef CONFIG_SMP
+/*
+ * Throughput-Optimized (TP) Futexes
+ * ---------------------------------
+ *
+ * Userspace mutual exclusion locks can be implemented either by using the
+ * wait-wake (WW) futexes or the PI futexes. The WW futexes have much higher
+ * throughput but can't guarantee minimal latency for high priority processes
+ * like the PI futexes. Even then, the overhead of WW futexes in userspace
+ * locking primitives can easily become a performance bottleneck on system
+ * with a large number of cores.
+ *
+ * The throughput-optimized (TP) futex is a new futex implementation that
+ * provides higher throughput than WW futex via the use of following
+ * two techniques:
+ *  1) Optimistic spinning when futex owner is actively running
+ *  2) Lock stealing
+ *
+ * Optimistic spinning isn't present in other futexes. Lock stealing is
+ * possible in WW futexes, but isn't actively encouraged like the
+ * TP futexes. The downside, however, is in the much higher variance of
+ * the response times. Since there is no point in supporting optimistic
+ * spinning on UP system, this new futex type is only available in SMP
+ * system.
+ *
+ * The TP futexes are as versatile as the WW futexes. The TP futexes are
+ * mainly for mutex-like exclusive user space locks. On the other hand,
+ * WW futexes can also used be implement other synchronization primitives
+ * like conditional variables or read/write locks.
+ *
+ * The use of TP futexes is very similar to the PI futexes. Locking is done
+ * by atomically transiting the futex from 0 to the task's thread ID.
+ * Unlocking is done by atomically changing the futex from thread ID to 0.
+ * Any failure to do so will require calls to the kernel to do the locking
+ * and unlocking.
+ *
+ * Within the kernel, trylocks are done ignoring the FUTEX_WAITERS bit. The
+ * purpose of this FUTEX_WAITERS bit is to make the unlocker wake up the
+ * serialization mutex owner. Not having the FUTEX_WAITERS bit set doesn't
+ * mean there is no waiter in the kernel.
+ *
+ * Like PI futexes, TP futexes are orthogonal to robust futexes can be
+ * used together.
+ *
+ * Unlike the other futexes, the futex_q structures aren't used. Instead,
+ * they will queue up in the serialization mutex of the futex state container
+ * queued in the hash bucket.
+ */
+
+/**
+ * lookup_futex_state - Looking up the futex state structure.
+ * @hb:		 hash bucket
+ * @key:	 futex key
+ * @search_only: Boolean search only (no allocation) flag
+ *
+ * This function differs from lookup_pi_state() in that it searches the fs_head
+ * in the hash bucket instead of the waiters in the plist. The HB fs_lock must
+ * be held before calling it. If the search_only flag isn't set, a new state
+ * structure will be pushed into the list and returned if there is no matching
+ * key.
+ *
+ * The reference count won't be incremented for search_only call as the
+ * futex state object won't be destroyed while the fs_lock is being held.
+ *
+ * Return: The matching futex state object or a newly allocated one.
+ *	   NULL if not found in search only lookup.
+ */
+static struct futex_state *
+lookup_futex_state(struct futex_hash_bucket *hb, union futex_key *key,
+		   bool search_only)
+{
+	struct futex_state *state;
+
+	list_for_each_entry(state, &hb->fs_head, fs_list)
+		if (match_futex(key, &state->key)) {
+			if (!search_only)
+				atomic_inc(&state->refcount);
+			return state;
+		}
+
+	if (search_only)
+		return NULL;
+
+	/*
+	 * Push a new one into the list and return it.
+	 */
+	state = alloc_futex_state();
+	state->type = TYPE_TP;
+	state->key = *key;
+	list_add(&state->fs_list, &hb->fs_head);
+	WARN_ON(atomic_read(&state->refcount) != 1);
+
+	/*
+	 * Initialize the mutex structure.
+	 */
+	mutex_init(&state->mutex);
+	WARN_ON(!list_empty(&state->list));
+	return state;
+}
+
+/**
+ * put_futex_state_unlocked - fast dropping futex state reference count
+ * @state: the futex state to be dropped
+ *
+ * This function is used to decrement the reference count in the futex state
+ * object while not holding the HB fs_lock. It can fail.
+ *
+ * Return: 1 if successful and 0 if the reference count is going to be
+ *	   decremented to 0 and hence has to be done under lock.
+ */
+static inline int put_futex_state_unlocked(struct futex_state *state)
+{
+	return atomic_add_unless(&state->refcount, -1, 1);
+}
+
+/**
+ * futex_trylock - try to lock the userspace futex word (0 => vpid).
+ * @uaddr  - futex address
+ * @puval  - storage location for current futex value
+ * @waiter - true for top waiter, false otherwise
+ *
+ * The HB fs_lock should NOT be held while calling this function.
+ * The flag bits are ignored in the trylock.
+ *
+ * If waiter is true
+ * then
+ *   don't preserve the flag bits;
+ * else
+ *   preserve the flag bits
+ * endif
+ *
+ * Return: 1 if lock acquired;
+ *	   0 if lock acquisition failed;
+ *	   -EFAULT if an error happened.
+ */
+static inline int futex_trylock(u32 __user *uaddr, u32 vpid, u32 *puval,
+				const bool waiter)
+{
+	u32 uval, flags = 0;
+
+	if (unlikely(get_futex_value(&uval, uaddr)))
+		return -EFAULT;
+
+	*puval = uval;
+
+	if (uval & FUTEX_TID_MASK)
+		return 0;	/* Trylock fails */
+
+	if (!waiter)
+		flags |= (uval & ~FUTEX_TID_MASK);
+
+	if (unlikely(cmpxchg_futex_value(puval, uaddr, uval, vpid|flags)))
+		return -EFAULT;
+
+	return *puval == uval;
+}
+
+/**
+ * futex_set_waiters_bit - set the FUTEX_WAITERS bit without fs_lock
+ * @uaddr: futex address
+ * @puval: pointer to current futex value
+ *
+ * Return: 0 if successful or < 0 if an error happen.
+ *	   futex value pointed by puval will be set to current value.
+ */
+static inline int futex_set_waiters_bit(u32 __user *uaddr, u32 *puval)
+{
+	u32 curval, uval = *puval;
+
+	while (!(uval & FUTEX_WAITERS)) {
+		/*
+		 * Set the FUTEX_WAITERS bit.
+		 */
+		if (cmpxchg_futex_value(&curval, uaddr, uval,
+					uval | FUTEX_WAITERS))
+			return -EFAULT;
+		if (curval == uval) {
+			*puval = uval | FUTEX_WAITERS;
+			break;
+		}
+		uval = curval;
+	}
+	return 0;
+}
+
+/**
+ * futex_spin_on_owner - Optimistically spin on futex owner
+ * @uaddr: futex address
+ * @vpid:  PID of current task
+ * @state: futex state object
+ *
+ * Spin on the futex word while the futex owner is active. Otherwise, set
+ * the FUTEX_WAITERS bit and go to sleep.
+ *
+ * As we take a reference to the futex owner's task structure, we don't need
+ * to use RCU to ensure that the task structure is valid.
+ *
+ * The function will directly grab the lock if the owner is dying or the pid
+ * is invalid. That should take care of the problem of dead lock owners
+ * unless the pid wraps around and the preceived owner is not the real owner.
+ * To guard against this case, we will have to use the robust futex feature.
+ *
+ * Return: 0 if futex acquired, < 0 if an error happens.
+ */
+static int futex_spin_on_owner(u32 __user *uaddr, u32 vpid,
+			       struct futex_state *state)
+{
+	int ret;
+	u32 uval;
+	u32 owner_pid = 0;
+	struct task_struct *owner_task = NULL;
+
+	WRITE_ONCE(state->owner, current);
+	preempt_disable();
+	for (;;) {
+		ret = futex_trylock(uaddr, vpid, &uval, true);
+		if (ret)
+			break;
+
+		if ((uval & FUTEX_TID_MASK) != owner_pid) {
+			if (owner_task)
+				put_task_struct(owner_task);
+
+			owner_pid  = uval & FUTEX_TID_MASK;
+			owner_task = futex_find_get_task(owner_pid);
+		}
+
+		if (need_resched()) {
+			__set_current_state(TASK_RUNNING);
+			schedule_preempt_disabled();
+			continue;
+		}
+
+		if (signal_pending(current)) {
+			ret = -EINTR;
+			break;
+		}
+
+		if (owner_task->on_cpu) {
+			cpu_relax();
+			continue;
+		}
+
+		/*
+		 * If the owner isn't active, we need to go to sleep after
+		 * making sure that the FUTEX_WAITERS bit is set.
+		 */
+		if (futex_set_waiters_bit(uaddr, &uval) < 0)
+			goto efault;
+
+		/*
+		 * Do a trylock after setting the task state to make
+		 * sure we won't miss a wakeup.
+		 *
+		 * Futex owner		Mutex owner
+		 * -----------		-----------
+		 * unlock		set state
+		 * MB			MB
+		 * read state		trylock
+		 * wakeup		sleep
+		 */
+		set_current_state(TASK_INTERRUPTIBLE);
+		ret = futex_trylock(uaddr, vpid, &uval, true);
+		if (ret) {
+			__set_current_state(TASK_RUNNING);
+			break;
+		}
+
+		/*
+		 * Don't sleep if the owner has died or the FUTEX_WAITERS bit
+		 * was cleared. The latter case can happen when unlock and
+		 * lock stealing happen in between setting the FUTEX_WAITERS
+		 * and setting state to TASK_INTERRUPTIBLE.
+		 */
+		if (!(uval & FUTEX_OWNER_DIED) && (uval & FUTEX_WAITERS))
+			schedule_preempt_disabled();
+		__set_current_state(TASK_RUNNING);
+	}
+out:
+	preempt_enable();
+	if (owner_task)
+		put_task_struct(owner_task);
+
+	/*
+	 * Cleanup futex state.
+	 */
+	WRITE_ONCE(state->owner, NULL);
+	return ret;
+
+efault:
+	ret = -EFAULT;
+	goto out;
+}
+
+/*
+ * Userspace tried a 0 -> TID atomic transition of the futex value
+ * and failed. The kernel side here does the whole locking operation.
+ * The kernel mutex will be used for serialization. Once becoming the
+ * sole mutex lock owner, it will spin on the futex owner's task structure
+ * to see if it is running. It will also spin on the futex word so as to grab
+ * the lock as soon as it is free.
+ *
+ * This function is not inlined so that it can show up separately in perf
+ * profile for performance analysis purpose.
+ *
+ * Return: 0   - lock acquired
+ *	   < 0 - an error happens
+ */
+static noinline int futex_lock(u32 __user *uaddr, unsigned int flags)
+{
+	struct futex_hash_bucket *hb;
+	union futex_key key = FUTEX_KEY_INIT;
+	struct futex_state *state;
+	u32 uval, vpid = task_pid_vnr(current);
+	int ret;
+
+
+	/*
+	 * Stealing lock and preserve the flag bits.
+	 */
+	ret = futex_trylock(uaddr, vpid, &uval, false);
+	if (ret)
+		goto out;
+
+	/*
+	 * Detect deadlocks.
+	 */
+	if (unlikely(((uval & FUTEX_TID_MASK) == vpid) ||
+			should_fail_futex(true)))
+		return -EDEADLK;
+
+	if (refill_futex_state_cache())
+		return -ENOMEM;
+
+	ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
+	if (unlikely(ret))
+		goto out;
+
+	hb = hash_futex(&key);
+	spin_lock(&hb->fs_lock);
+
+	/*
+	 * Locate the futex state by looking up the futex state list in the
+	 * hash bucket. If it isn't found, create a new one and put it into
+	 * the list.
+	 */
+	state = lookup_futex_state(hb, &key, false);
+
+	/*
+	 * We don't need to hold the lock after looking up the futex state
+	 * as we have incremented the reference count.
+	 */
+	spin_unlock(&hb->fs_lock);
+
+	/*
+	 * Acquiring the serialization mutex.
+	 */
+	if (!state)
+		ret = -ENOMEM;
+	else if (state->type != TYPE_TP)
+		ret = -EINVAL;
+	else
+		ret = mutex_lock_interruptible(&state->mutex);
+
+	/*
+	 * If we got a signal or has some other error, we need to abort
+	 * the lock operation and return.
+	 */
+	if (unlikely(ret))
+		goto out_put_state_key;
+
+	/*
+	 * As the mutex owner, we can now spin on the futex word as well as
+	 * the active-ness of the futex owner.
+	 */
+	ret = futex_spin_on_owner(uaddr, vpid, state);
+
+	mutex_unlock(&state->mutex);
+
+out_put_state_key:
+	if (!put_futex_state_unlocked(state)) {
+		/*
+		 * May need to free the futex state object and so must be
+		 * under the lock. If the reference count is really going
+		 * to reach 0, we also need to dequeue the futex state
+		 * from the hash bucket list.
+		 */
+		spin_lock(&hb->fs_lock);
+		if (atomic_read(&state->refcount) == 1)
+			list_del_init(&state->fs_list);
+		put_futex_state(state);
+		spin_unlock(&hb->fs_lock);
+	}
+	put_futex_key(&key);
+
+out:
+	return (ret < 0) ? ret : 0;
+}
+
+/*
+ * Userspace attempted a TID -> 0 atomic transition, and failed.
+ * This is the in-kernel slowpath: we look up the futex state (if any),
+ * and wakeup the mutex owner.
+ *
+ * Return: 1 if a wakeup is attempt, 0 if no task to wake,
+ *	   or < 0 when an error happens.
+ */
+static int futex_unlock(u32 __user *uaddr, unsigned int flags)
+{
+	u32 uval, vpid = task_pid_vnr(current);
+	union futex_key key = FUTEX_KEY_INIT;
+	struct futex_hash_bucket *hb;
+	struct futex_state *state = NULL;
+	struct task_struct *owner = NULL;
+	int ret;
+	WAKE_Q(wake_q);
+
+	if (get_user(uval, uaddr))
+		return -EFAULT;
+
+	if ((uval & FUTEX_TID_MASK) != vpid)
+		return -EPERM;
+
+	if (!(uval & FUTEX_WAITERS))
+		return -EINVAL;
+
+	ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
+	if (ret)
+		return ret;
+
+	hb = hash_futex(&key);
+	spin_lock(&hb->fs_lock);
+
+	/*
+	 * Check the hash bucket only for matching futex state.
+	 */
+	state = lookup_futex_state(hb, &key, true);
+	if (!state || (state->type != TYPE_TP)) {
+		ret = -EINVAL;
+		spin_unlock(&hb->fs_lock);
+		goto out_put_key;
+	}
+
+	owner = READ_ONCE(state->owner);
+	if (owner)
+		wake_q_add(&wake_q, owner);
+
+	spin_unlock(&hb->fs_lock);
+
+	/*
+	 * Unlock the futex.
+	 * The flag bits are not preserved to encourage more lock stealing.
+	 */
+	for (;;) {
+		u32 old = uval;
+
+		if (cmpxchg_futex_value(&uval, uaddr, old, 0)) {
+			ret = -EFAULT;
+			break;
+		}
+		if (old == uval)
+			break;
+	}
+
+out_put_key:
+	put_futex_key(&key);
+	if (owner) {
+		/*
+		 * No error would have happened if owner defined.
+		 */
+		wake_up_q(&wake_q);
+		return 1;
+	}
+
+	return ret;
+}
+#endif /* CONFIG_SMP */
+
 long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 		u32 __user *uaddr2, u32 val2, u32 val3)
 {
@@ -3261,6 +3748,10 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 	case FUTEX_TRYLOCK_PI:
 	case FUTEX_WAIT_REQUEUE_PI:
 	case FUTEX_CMP_REQUEUE_PI:
+#ifdef CONFIG_SMP
+	case FUTEX_LOCK:
+	case FUTEX_UNLOCK:
+#endif
 		if (!futex_cmpxchg_enabled)
 			return -ENOSYS;
 	}
@@ -3292,6 +3783,12 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 					     uaddr2);
 	case FUTEX_CMP_REQUEUE_PI:
 		return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1);
+#ifdef CONFIG_SMP
+	case FUTEX_LOCK:
+		return futex_lock(uaddr, flags);
+	case FUTEX_UNLOCK:
+		return futex_unlock(uaddr, flags);
+#endif
 	}
 	return -ENOSYS;
 }
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 08/13] futex: Enable robust handling of TP futexes
  2016-09-30 21:26 [PATCH v3 00/13] futex: Introducing throughput-optimized futexes Waiman Long
                   ` (6 preceding siblings ...)
  2016-09-30 21:26 ` [PATCH v3 07/13] futex: Throughput-optimized (TP) futexes Waiman Long
@ 2016-09-30 21:26 ` Waiman Long
  2016-09-30 21:26 ` [PATCH v3 09/13] futex: Implement lock handoff for TP futexes to prevent lock starvation Waiman Long
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2016-09-30 21:26 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Jason Low, Scott J Norton,
	Douglas Hatch, Waiman Long

The TP futexes don't have code to handle the death of futex
owners. There are 2 different cases that need to be considered.

As top waiter gets a reference to the task structure of the futex
owner, the task structure will never go away even if the owner dies.
When the futex owner died while the top waiter is spinning, the task
structure will be marked dead or the pid won't have a matching task
structure if the task died before a reference is taken. Alternatively,
if robust futex attribute is enabled, the FUTEX_OWNER_DIED bit of
the futex word may be set. In all those cases, what the top waiter
need to do is to grab the futex directly. An informational message
will also printed to highlight this event.

If the futex owner died while the top waiter is sleeping, we need to
make the exit processing code to wake up the top waiter. This is done
by chaining the futex state object into the pi_state_list of the futex
owner before the top waiter sleeps so that if exit_pi_state_list()
is called, the wakeup will happen. The top waiter needs to remove
its futex state object from the pi_state_list of the old owner if
the ownership changes hand.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/futex.c |   82 +++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 files changed, 76 insertions(+), 6 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index bfe17a9..5fa9a10 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -973,7 +973,7 @@ static struct task_struct * futex_find_get_task(pid_t pid)
 }
 
 /*
- * This task is holding PI mutexes at exit time => bad.
+ * This task is holding PI or TP mutexes at exit time => bad.
  * Kernel cleans up PI-state, but userspace is likely hosed.
  * (Robust-futex cleanup is separate and might save the day for userspace.)
  */
@@ -990,12 +990,29 @@ void exit_pi_state_list(struct task_struct *curr)
 	 * We are a ZOMBIE and nobody can enqueue itself on
 	 * pi_state_list anymore, but we have to be careful
 	 * versus waiters unqueueing themselves:
+	 *
+	 * For TP futexes, the only purpose of showing up in the
+	 * pi_state_list is for this function to wake up the serialization
+	 * mutex owner (state->owner). We don't actually need to take the
+	 * HB lock. The futex state and task struct won't go away as long
+	 * as we hold the pi_lock.
 	 */
 	raw_spin_lock_irq(&curr->pi_lock);
 	while (!list_empty(head)) {
 
 		next = head->next;
 		pi_state = list_entry(next, struct futex_state, list);
+
+		if (pi_state->type == TYPE_TP) {
+			struct task_struct *owner = READ_ONCE(pi_state->owner);
+
+			WARN_ON(list_empty(&pi_state->list));
+			list_del_init(&pi_state->list);
+			if (owner)
+				wake_up_process(owner);
+			continue;
+		}
+
 		key = pi_state->key;
 		hb = hash_futex(&key);
 		raw_spin_unlock_irq(&curr->pi_lock);
@@ -3152,8 +3169,8 @@ retry:
 			goto retry;
 
 		/*
-		 * Wake robust non-PI futexes here. The wakeup of
-		 * PI futexes happens in exit_pi_state():
+		 * Wake robust wait-wake futexes here. The wakeup of
+		 * PI and TP futexes happens in exit_pi_state():
 		 */
 		if (!pi && (uval & FUTEX_WAITERS))
 			futex_wake(uaddr, 1, 1, FUTEX_BITSET_MATCH_ANY);
@@ -3295,6 +3312,12 @@ void exit_robust_list(struct task_struct *curr)
  * Unlike the other futexes, the futex_q structures aren't used. Instead,
  * they will queue up in the serialization mutex of the futex state container
  * queued in the hash bucket.
+ *
+ * To handle the exceptional case that the futex owner died, the robust
+ * futexes list mechanism is used to for waking up sleeping top waiter.
+ * Checks are also made in the futex_spin_on_owner() loop for dead task
+ * structure or invalid pid. In both cases, the top waiter will take over
+ * the ownership of the futex.
  */
 
 /**
@@ -3459,6 +3482,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, u32 vpid,
 	u32 uval;
 	u32 owner_pid = 0;
 	struct task_struct *owner_task = NULL;
+	bool on_owner_pi_list = false;
 
 	WRITE_ONCE(state->owner, current);
 	preempt_disable();
@@ -3468,13 +3492,47 @@ static int futex_spin_on_owner(u32 __user *uaddr, u32 vpid,
 			break;
 
 		if ((uval & FUTEX_TID_MASK) != owner_pid) {
-			if (owner_task)
+			if (owner_task) {
+				/*
+				 * task_pi_list_del() should always be
+				 * done before put_task_struct(). The futex
+				 * state may have been dequeued if the task
+				 * is dead.
+				 */
+				if (on_owner_pi_list) {
+					task_pi_list_del(owner_task, state,
+							 false);
+					on_owner_pi_list = false;
+				}
 				put_task_struct(owner_task);
+			}
 
 			owner_pid  = uval & FUTEX_TID_MASK;
 			owner_task = futex_find_get_task(owner_pid);
 		}
 
+		if (unlikely(!owner_task ||
+			    (owner_task->flags & PF_EXITING) ||
+			    (uval & FUTEX_OWNER_DIED))) {
+			/*
+			 * PID invalid or exiting/dead task, we can directly
+			 * grab the lock now.
+			 */
+			u32 curval;
+
+			ret = cmpxchg_futex_value(&curval, uaddr, uval, vpid);
+			if (unlikely(ret))
+				goto efault;
+			if (curval != uval)
+				continue;
+			pr_info("futex: owner pid %d of futex 0x%lx was %s, lock is now acquired by pid %d!\n",
+				owner_pid, (long)uaddr,
+				(owner_task || (uval & FUTEX_OWNER_DIED))
+				? "dead" : "invalid", vpid);
+
+			break;
+		}
+
 		if (need_resched()) {
 			__set_current_state(TASK_RUNNING);
 			schedule_preempt_disabled();
@@ -3493,11 +3551,18 @@ static int futex_spin_on_owner(u32 __user *uaddr, u32 vpid,
 
 		/*
 		 * If the owner isn't active, we need to go to sleep after
-		 * making sure that the FUTEX_WAITERS bit is set.
+		 * making sure that the FUTEX_WAITERS bit is set. We also
+		 * need to put the futex state into the futex owner's
+		 * pi_state_list to prevent deadlock when the owner dies.
 		 */
 		if (futex_set_waiters_bit(uaddr, &uval) < 0)
 			goto efault;
 
+		if (!on_owner_pi_list) {
+			task_pi_list_add(owner_task, state, false);
+			on_owner_pi_list = true;
+		}
+
 		/*
 		 * Do a trylock after setting the task state to make
 		 * sure we won't miss a wakeup.
@@ -3528,8 +3593,13 @@ static int futex_spin_on_owner(u32 __user *uaddr, u32 vpid,
 	}
 out:
 	preempt_enable();
-	if (owner_task)
+	if (owner_task) {
+		if (on_owner_pi_list)
+			task_pi_list_del(owner_task, state, false);
 		put_task_struct(owner_task);
+	} else {
+		WARN_ON(on_owner_pi_list);
+	}
 
 	/*
 	 * Cleanup futex state.
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 09/13] futex: Implement lock handoff for TP futexes to prevent lock starvation
  2016-09-30 21:26 [PATCH v3 00/13] futex: Introducing throughput-optimized futexes Waiman Long
                   ` (7 preceding siblings ...)
  2016-09-30 21:26 ` [PATCH v3 08/13] futex: Enable robust handling of TP futexes Waiman Long
@ 2016-09-30 21:26 ` Waiman Long
  2016-09-30 21:26 ` [PATCH v3 10/13] futex: Inform FUTEX_LOCK callers on how the lock is acquired Waiman Long
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2016-09-30 21:26 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Jason Low, Scott J Norton,
	Douglas Hatch, Waiman Long

The current TP futexes has no guarantee that the top waiter
(serialization mutex owner) can get the lock within a finite time.
As a result, lock starvation can happen.

A lock handoff mechanism is added to the TP futexes to prevent lock
starvation from happening. The idea is that the top waiter can set a
special handoff_pid variable with its own pid. The futex owner will
then check this variable at unlock time to see if it should free the
futex or change its value to the designated pid to transfer the lock
directly to the top waiter.

This change does have the effect of limiting the amount of unfairness
and hence may adversely affect performance depending on the workloads.

Using a futex locking microbenchmark on a 4-socket 72-core 144-thread
Haswell-EX system running 4.8-rc7 based kernel, the performance data
for the TP futexes before and after the patch were as follows:

  CS length     Before patch    After patch    % change
  ---------     ------------    -----------    --------
      5          54,622,973      59,455,380      +8.8%
     10          42,577,283      41,161,387      -3.3%
     20          29,424,333      37,718,989     +28.2%
     30          33,664,095      29,474,986     -12.4%
     40          27,998,446      24,774,389     -11.5%
     50          25,793,991      22,718,432     -11.9%
  1us sleep         179,067         177,662      -0.8%

In fact, when the critical section is a 1us sleep, the wait-wake
futexes perform a bit better than the TP futexes as the former futexes
are more unfair in this case. So lock starvation can happen with
wait-wake futexes.

                        WW futex        TP futex
                        --------        --------
Total locking ops       175,824         177,662
Per-thread avg/sec          122             123
Per-thread min/sec            2             106
Per-thread max/sec          498             199
% Stddev                  6.26%           0.97%

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/futex.c |   55 ++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 48 insertions(+), 7 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 5fa9a10..d260410 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -228,6 +228,8 @@ struct futex_state {
 	struct task_struct *owner;
 	atomic_t refcount;
 
+	u32 handoff_pid;	/* For TP futexes only */
+
 	enum futex_type type;
 	union futex_key key;
 };
@@ -898,6 +900,7 @@ static int refill_futex_state_cache(void)
 
 	/* pi_mutex gets initialized later */
 	state->owner = NULL;
+	state->handoff_pid = 0;
 	atomic_set(&state->refcount, 1);
 	state->key = FUTEX_KEY_INIT;
 
@@ -3303,8 +3306,9 @@ void exit_robust_list(struct task_struct *curr)
  *
  * Within the kernel, trylocks are done ignoring the FUTEX_WAITERS bit. The
  * purpose of this FUTEX_WAITERS bit is to make the unlocker wake up the
- * serialization mutex owner. Not having the FUTEX_WAITERS bit set doesn't
- * mean there is no waiter in the kernel.
+ * serialization mutex owner or the hand off the lock directly to the top
+ * waiter. Not having the FUTEX_WAITERS bit set doesn't mean there is no
+ * waiter in the kernel.
  *
  * Like PI futexes, TP futexes are orthogonal to robust futexes can be
  * used together.
@@ -3320,6 +3324,19 @@ void exit_robust_list(struct task_struct *curr)
  * the ownership of the futex.
  */
 
+/*
+ * Spinning and sleeping thresholds for enabling lock handoff and prevent
+ * lock starvation.
+ *
+ * The current threshold values setting is kind of arbitrary. Setting them
+ * too low will impact the amount of lock stealing possible thus reducing
+ * performance. Setting them too high may make the top waiter wait too long
+ * before getting the lock. The current set of values are a bit on the
+ * high side to ensure adequate performance.
+ */
+#define TP_SPIN_THRESHOLD       (1 << 14)
+#define TP_SLEEP_DECREMENT      (TP_SPIN_THRESHOLD/64)
+
 /**
  * lookup_futex_state - Looking up the futex state structure.
  * @hb:		 hash bucket
@@ -3398,6 +3415,7 @@ static inline int put_futex_state_unlocked(struct futex_state *state)
  * If waiter is true
  * then
  *   don't preserve the flag bits;
+ *   check for handoff (futex word == own pid)
  * else
  *   preserve the flag bits
  * endif
@@ -3416,6 +3434,9 @@ static inline int futex_trylock(u32 __user *uaddr, u32 vpid, u32 *puval,
 
 	*puval = uval;
 
+	if (waiter && (uval & FUTEX_TID_MASK) == vpid)
+		return 1;
+
 	if (uval & FUTEX_TID_MASK)
 		return 0;	/* Trylock fails */
 
@@ -3478,7 +3499,7 @@ static inline int futex_set_waiters_bit(u32 __user *uaddr, u32 *puval)
 static int futex_spin_on_owner(u32 __user *uaddr, u32 vpid,
 			       struct futex_state *state)
 {
-	int ret;
+	int ret, loop = TP_SPIN_THRESHOLD;
 	u32 uval;
 	u32 owner_pid = 0;
 	struct task_struct *owner_task = NULL;
@@ -3486,7 +3507,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, u32 vpid,
 
 	WRITE_ONCE(state->owner, current);
 	preempt_disable();
-	for (;;) {
+	for (;; loop--) {
 		ret = futex_trylock(uaddr, vpid, &uval, true);
 		if (ret)
 			break;
@@ -3544,6 +3565,18 @@ static int futex_spin_on_owner(u32 __user *uaddr, u32 vpid,
 			break;
 		}
 
+		/*
+		 * Enable lock handoff if the threshold reaches 0.
+		 * We also need to set the FUTEX_WAITERS bit to make sure
+		 * that futex lock holder will initiate the handoff at
+		 * unlock time.
+		 */
+		if ((loop <= 0) && !READ_ONCE(state->handoff_pid)) {
+			WRITE_ONCE(state->handoff_pid, vpid);
+			if (futex_set_waiters_bit(uaddr, &uval) < 0)
+				goto efault;
+		}
+
 		if (owner_task->on_cpu) {
 			cpu_relax();
 			continue;
@@ -3587,8 +3620,10 @@ static int futex_spin_on_owner(u32 __user *uaddr, u32 vpid,
 		 * lock stealing happen in between setting the FUTEX_WAITERS
 		 * and setting state to TASK_INTERRUPTIBLE.
 		 */
-		if (!(uval & FUTEX_OWNER_DIED) && (uval & FUTEX_WAITERS))
+		if (!(uval & FUTEX_OWNER_DIED) && (uval & FUTEX_WAITERS)) {
 			schedule_preempt_disabled();
+			loop -= TP_SLEEP_DECREMENT;
+		}
 		__set_current_state(TASK_RUNNING);
 	}
 out:
@@ -3605,6 +3640,7 @@ out:
 	 * Cleanup futex state.
 	 */
 	WRITE_ONCE(state->owner, NULL);
+	WRITE_ONCE(state->handoff_pid, 0);
 	return ret;
 
 efault:
@@ -3728,6 +3764,7 @@ out:
 static int futex_unlock(u32 __user *uaddr, unsigned int flags)
 {
 	u32 uval, vpid = task_pid_vnr(current);
+	u32 newpid = 0;
 	union futex_key key = FUTEX_KEY_INIT;
 	struct futex_hash_bucket *hb;
 	struct futex_state *state = NULL;
@@ -3761,6 +3798,10 @@ static int futex_unlock(u32 __user *uaddr, unsigned int flags)
 		goto out_put_key;
 	}
 
+	newpid = READ_ONCE(state->handoff_pid);
+	if (newpid)
+		WRITE_ONCE(state->handoff_pid, 0);
+
 	owner = READ_ONCE(state->owner);
 	if (owner)
 		wake_q_add(&wake_q, owner);
@@ -3768,13 +3809,13 @@ static int futex_unlock(u32 __user *uaddr, unsigned int flags)
 	spin_unlock(&hb->fs_lock);
 
 	/*
-	 * Unlock the futex.
+	 * Unlock the futex or handoff to the next owner.
 	 * The flag bits are not preserved to encourage more lock stealing.
 	 */
 	for (;;) {
 		u32 old = uval;
 
-		if (cmpxchg_futex_value(&uval, uaddr, old, 0)) {
+		if (cmpxchg_futex_value(&uval, uaddr, old, newpid)) {
 			ret = -EFAULT;
 			break;
 		}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 10/13] futex: Inform FUTEX_LOCK callers on how the lock is acquired
  2016-09-30 21:26 [PATCH v3 00/13] futex: Introducing throughput-optimized futexes Waiman Long
                   ` (8 preceding siblings ...)
  2016-09-30 21:26 ` [PATCH v3 09/13] futex: Implement lock handoff for TP futexes to prevent lock starvation Waiman Long
@ 2016-09-30 21:26 ` Waiman Long
  2016-09-30 21:26 ` [PATCH v3 11/13] futex: Add timeout support to TP futexes Waiman Long
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2016-09-30 21:26 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Jason Low, Scott J Norton,
	Douglas Hatch, Waiman Long

As there are three different ways for a TP futex waiter in the kernel
to acquire the lock. It may be useful to pass this information out so
that user space has a better view of what is happening in the kernel.
With this change, different non-negative values will now be returned
depending on how the lock is acquired in the kernel.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/futex.c |   30 ++++++++++++++++++++++--------
 1 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index d260410..1219f32 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3337,7 +3337,17 @@ void exit_robust_list(struct task_struct *curr)
 #define TP_SPIN_THRESHOLD       (1 << 14)
 #define TP_SLEEP_DECREMENT      (TP_SPIN_THRESHOLD/64)
 
-/**
+/*
+ * futex_lock() returned values to identify how the lock is acquired:
+ * 0 - steals the lock
+ * 1 - top waiter (mutex owner) acquires the lock
+ * 2 - handed off the lock
+ */
+#define TP_LOCK_STOLEN		0
+#define TP_LOCK_ACQUIRED	1
+#define TP_LOCK_HANDOFF		2
+
+ /**
  * lookup_futex_state - Looking up the futex state structure.
  * @hb:		 hash bucket
  * @key:	 futex key
@@ -3420,9 +3430,11 @@ static inline int put_futex_state_unlocked(struct futex_state *state)
  *   preserve the flag bits
  * endif
  *
- * Return: 1 if lock acquired;
+ * Return: TP_LOCK_ACQUIRED if lock acquired;
+ *	   TP_LOCK_HANDOFF if lock was handed off;
  *	   0 if lock acquisition failed;
  *	   -EFAULT if an error happened.
+ *	   *puval will contain the latest futex value when trylock fails.
  */
 static inline int futex_trylock(u32 __user *uaddr, u32 vpid, u32 *puval,
 				const bool waiter)
@@ -3435,7 +3447,7 @@ static inline int futex_trylock(u32 __user *uaddr, u32 vpid, u32 *puval,
 	*puval = uval;
 
 	if (waiter && (uval & FUTEX_TID_MASK) == vpid)
-		return 1;
+		return TP_LOCK_HANDOFF;
 
 	if (uval & FUTEX_TID_MASK)
 		return 0;	/* Trylock fails */
@@ -3446,7 +3458,7 @@ static inline int futex_trylock(u32 __user *uaddr, u32 vpid, u32 *puval,
 	if (unlikely(cmpxchg_futex_value(puval, uaddr, uval, vpid|flags)))
 		return -EFAULT;
 
-	return *puval == uval;
+	return (*puval == uval) ? TP_LOCK_ACQUIRED : 0;
 }
 
 /**
@@ -3659,8 +3671,10 @@ efault:
  * This function is not inlined so that it can show up separately in perf
  * profile for performance analysis purpose.
  *
- * Return: 0   - lock acquired
- *	   < 0 - an error happens
+ * Return: TP_LOCK_ACQUIRED - lock acquired normally
+ *	   TP_LOCK_HANDOFF  - lock handed off directly from unlocker
+ *	   TP_LOCK_STOLEN   - lock stolen
+ *	   < 0		    - an error happens
  */
 static noinline int futex_lock(u32 __user *uaddr, unsigned int flags)
 {
@@ -3676,7 +3690,7 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags)
 	 */
 	ret = futex_trylock(uaddr, vpid, &uval, false);
 	if (ret)
-		goto out;
+		return (ret < 0) ? ret : TP_LOCK_STOLEN;
 
 	/*
 	 * Detect deadlocks.
@@ -3750,7 +3764,7 @@ out_put_state_key:
 	put_futex_key(&key);
 
 out:
-	return (ret < 0) ? ret : 0;
+	return ret;
 }
 
 /*
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 11/13] futex: Add timeout support to TP futexes
  2016-09-30 21:26 [PATCH v3 00/13] futex: Introducing throughput-optimized futexes Waiman Long
                   ` (9 preceding siblings ...)
  2016-09-30 21:26 ` [PATCH v3 10/13] futex: Inform FUTEX_LOCK callers on how the lock is acquired Waiman Long
@ 2016-09-30 21:26 ` Waiman Long
  2016-09-30 21:26 ` [PATCH v3 12/13] futex, doc: TP futexes document Waiman Long
  2016-09-30 21:26 ` [PATCH v3 13/13] perf bench: New microbenchmark for userspace mutex performance Waiman Long
  12 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2016-09-30 21:26 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Jason Low, Scott J Norton,
	Douglas Hatch, Waiman Long

Unlike other futexes, TP futexes do not accept the specification of
a timeout value. To align with the other futexes, timeout support is
now added.  However, the timeout isn't as precise as the other futex
types due to the fact that timer expiration can't be detected when
the thread is waiting in the serialization mutex.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/futex.c |   49 ++++++++++++++++++++++++++++++++++++++-----------
 1 files changed, 38 insertions(+), 11 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 1219f32..bc16eca 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3491,9 +3491,10 @@ static inline int futex_set_waiters_bit(u32 __user *uaddr, u32 *puval)
 
 /**
  * futex_spin_on_owner - Optimistically spin on futex owner
- * @uaddr: futex address
- * @vpid:  PID of current task
- * @state: futex state object
+ * @uaddr:   futex address
+ * @vpid:    PID of current task
+ * @state:   futex state object
+ * @timeout: hrtimer_sleeper structure
  *
  * Spin on the futex word while the futex owner is active. Otherwise, set
  * the FUTEX_WAITERS bit and go to sleep.
@@ -3509,7 +3510,8 @@ static inline int futex_set_waiters_bit(u32 __user *uaddr, u32 *puval)
  * Return: 0 if futex acquired, < 0 if an error happens.
  */
 static int futex_spin_on_owner(u32 __user *uaddr, u32 vpid,
-			       struct futex_state *state)
+			       struct futex_state *state,
+			       struct hrtimer_sleeper *timeout)
 {
 	int ret, loop = TP_SPIN_THRESHOLD;
 	u32 uval;
@@ -3572,6 +3574,12 @@ static int futex_spin_on_owner(u32 __user *uaddr, u32 vpid,
 			continue;
 		}
 
+
+		if (timeout && !timeout->task) {
+			ret = -ETIMEDOUT;
+			break;
+		}
+
 		if (signal_pending(current)) {
 			ret = -EINTR;
 			break;
@@ -3671,13 +3679,21 @@ efault:
  * This function is not inlined so that it can show up separately in perf
  * profile for performance analysis purpose.
  *
+ * The timeout functionality isn't as precise as other other futex types
+ * as timer expiration will not be detected while waiting in the
+ * serialization mutex. One possible solution is to modify the expiration
+ * function to send out a signal as well so as to break the thread out of
+ * the mutex code if we really want more precise timeout.
+ *
  * Return: TP_LOCK_ACQUIRED - lock acquired normally
  *	   TP_LOCK_HANDOFF  - lock handed off directly from unlocker
  *	   TP_LOCK_STOLEN   - lock stolen
  *	   < 0		    - an error happens
  */
-static noinline int futex_lock(u32 __user *uaddr, unsigned int flags)
+static noinline int
+futex_lock(u32 __user *uaddr, unsigned int flags, ktime_t *time)
 {
+	struct hrtimer_sleeper timeout, *to;
 	struct futex_hash_bucket *hb;
 	union futex_key key = FUTEX_KEY_INIT;
 	struct futex_state *state;
@@ -3702,6 +3718,8 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags)
 	if (refill_futex_state_cache())
 		return -ENOMEM;
 
+	to = futex_setup_timer(time, &timeout, flags, current->timer_slack_ns);
+
 	ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
 	if (unlikely(ret))
 		goto out;
@@ -3725,12 +3743,17 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags)
 	/*
 	 * Acquiring the serialization mutex.
 	 */
-	if (!state)
+	if (!state) {
 		ret = -ENOMEM;
-	else if (state->type != TYPE_TP)
+	} else if (state->type != TYPE_TP) {
 		ret = -EINVAL;
-	else
+	} else {
+		if (to)
+			hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
 		ret = mutex_lock_interruptible(&state->mutex);
+		if (to && !timeout.task)
+			ret = -ETIMEDOUT;
+	}
 
 	/*
 	 * If we got a signal or has some other error, we need to abort
@@ -3743,7 +3766,7 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags)
 	 * As the mutex owner, we can now spin on the futex word as well as
 	 * the active-ness of the futex owner.
 	 */
-	ret = futex_spin_on_owner(uaddr, vpid, state);
+	ret = futex_spin_on_owner(uaddr, vpid, state, to ? &timeout : NULL);
 
 	mutex_unlock(&state->mutex);
 
@@ -3764,6 +3787,10 @@ out_put_state_key:
 	put_futex_key(&key);
 
 out:
+	if (to) {
+		hrtimer_cancel(&to->timer);
+		destroy_hrtimer_on_stack(&to->timer);
+	}
 	return ret;
 }
 
@@ -3910,7 +3937,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 		return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1);
 #ifdef CONFIG_SMP
 	case FUTEX_LOCK:
-		return futex_lock(uaddr, flags);
+		return futex_lock(uaddr, flags, timeout);
 	case FUTEX_UNLOCK:
 		return futex_unlock(uaddr, flags);
 #endif
@@ -3929,7 +3956,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
 	int cmd = op & FUTEX_CMD_MASK;
 
 	if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI ||
-		      cmd == FUTEX_WAIT_BITSET ||
+		      cmd == FUTEX_WAIT_BITSET || cmd == FUTEX_LOCK ||
 		      cmd == FUTEX_WAIT_REQUEUE_PI)) {
 		if (unlikely(should_fail_futex(!(op & FUTEX_PRIVATE_FLAG))))
 			return -EFAULT;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 12/13] futex, doc: TP futexes document
  2016-09-30 21:26 [PATCH v3 00/13] futex: Introducing throughput-optimized futexes Waiman Long
                   ` (10 preceding siblings ...)
  2016-09-30 21:26 ` [PATCH v3 11/13] futex: Add timeout support to TP futexes Waiman Long
@ 2016-09-30 21:26 ` Waiman Long
  2016-09-30 21:26 ` [PATCH v3 13/13] perf bench: New microbenchmark for userspace mutex performance Waiman Long
  12 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2016-09-30 21:26 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Jason Low, Scott J Norton,
	Douglas Hatch, Waiman Long

This patch adds a new document file on how to use the TP futexes.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 Documentation/00-INDEX     |    2 +
 Documentation/tp-futex.txt |  147 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 149 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/tp-futex.txt

diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX
index cb9a6c6..28fba32 100644
--- a/Documentation/00-INDEX
+++ b/Documentation/00-INDEX
@@ -439,6 +439,8 @@ this_cpu_ops.txt
 	- List rationale behind and the way to use this_cpu operations.
 thermal/
 	- directory with information on managing thermal issues (CPU/temp)
+tp-futex.txt
+	- Documentation on lightweight throughput-optimized futexes.
 trace/
 	- directory with info on tracing technologies within linux
 unaligned-memory-access.txt
diff --git a/Documentation/tp-futex.txt b/Documentation/tp-futex.txt
new file mode 100644
index 0000000..3d8fe2a
--- /dev/null
+++ b/Documentation/tp-futex.txt
@@ -0,0 +1,147 @@
+Started by: Waiman Long <waiman.long@hpe.com>
+
+Throughput-Optimized Futexes
+----------------------------
+
+There are two main problems for a wait-wake futex (FUTEX_WAIT and
+FUTEX_WAKE) when used for creating user-space locking primitives:
+
+ 1) With a wait-wake futex, tasks waiting for a lock are put to sleep
+    in the futex queue to be woken up by the lock owner when it is done
+    with the lock. Waking up a sleeping task, however, introduces some
+    additional latency which can be large especially if the critical
+    section protected by the lock is relatively short. This may cause
+    a performance bottleneck on large systems with many CPUs running
+    applications that need a lot of inter-thread synchronization.
+
+ 2) The performance of the wait-wake futex is currently
+    spinlock-constrained.  When many threads are contending for a
+    futex in a large system with many CPUs, it is not unusual to have
+    spinlock contention accounting for more than 90% of the total
+    CPU cycles consumed at various points in time.
+
+This two problems can create performance bottlenecks with a
+futex-constrained workload especially on systems with large number
+of CPUs.
+
+The goal of the throughput-optimized (TP) futexes is maximize the
+locking throughput at the expense of fairness and deterministic
+latency. This is done by encouraging lock stealing and optimistic
+spinning on a locked futex when the futex owner is running.  This is
+the same optimistic spinning mechanism used by the kernel mutex and rw
+semaphore implementations to improve performance. Optimistic spinning
+was done without taking any lock.
+
+Lock stealing is known to be a performance enhancement techique as
+long as the safeguards are in place to make sure that there will be no
+lock starvation.  The TP futexes has a built-in lock hand-off mechanism
+to prevent lock starvation from happening. When the top lock waiter
+has too many failed attempts to acquire the lock, it will initiate
+the hand-off mechanism by forcing the unlocker to transfer the lock
+to itself instead of freeing it. This limit the maximum latency a
+waiter has to wait.
+
+The downside of this improved throughput is the increased variance
+of the actual response times of the locking operations. Some locking
+operations will be very fast, while others may be considerably slower.
+The average response time should be better than the wait-wake futexes.
+
+Performance-wise, TP futexes should be faster than wait-wake futexes
+especially if the futex locker holders do not sleep. For workload
+that does a lot of sleeping within the critical sections, the TP
+futexes may not be faster than the wait-wake futexes.
+
+Implementation
+--------------
+
+Like the PI and robust futexes, a lock acquirer has to atomically
+put its thread ID (TID) into the lower 30 bits of the 32-bit futex
+which should has an original value of 0. If it succeeds, it will be
+the owner of the futex. Otherwise, it has to call into the kernel
+using the new FUTEX_LOCK futex(2) syscall.
+
+  futex(uaddr, FUTEX_LOCK, 0, timeout, NULL, 0);
+
+Only the optional timeout parameter is being used by the new futex
+call.
+
+A kernel mutex is used for serialization. The top lock waiter that is
+the owner of the serialization mutex will try to acquire the lock when
+it is available.
+
+When the futex lock owner is no longer running, the top waiter will
+set the FUTEX_WAITERS bit before going to sleep. This is to make sure
+the futex owner will go into the kernel at unlock time to wake up the
+waiter.
+
+The expected return values of the above futex call are:
+ a) 0   - lock stolen as non-top waiter
+ b) 1   - lock acquired as the top waiter
+ c) 2   - lock explicitly handed off by the unlocker
+ d) < 0 - an error happens
+
+When it is time to unlock, the lock owner has to atomically change
+the futex value from its TID to 0. If that fails, it has to issue a
+FUTEX_UNLOCK futex(2) system call to wake up the top waiter.
+
+  futex(uaddr, FUTEX_UNLOCK, 0, NULL, NULL, 0);
+
+A return value of 1 from the FUTEX_UNLOCK futex(2) syscall
+indicates a task has been woken up. The syscall returns 0 if no
+sleeping task is woken. A negative value will be returned if an
+error happens.
+
+The error number returned by a FUTEX_UNLOCK call on an empty futex
+can be used to decide if the TP futex functionality is implemented
+in the kernel. If it is present, an EPERFM error will be returned.
+Otherwise it will return ENOSYS.
+
+TP futexes require the kernel to have SMP support as well as support
+for the cmpxchg functionality. For architectures that don't support
+cmpxchg, TP futexes will not be supported as well.
+
+The TP futexes are orthogonal to the robust futexes and can be combined
+without problem.
+
+Usage Scenario
+--------------
+
+A TP futex can be used to implement a user-space exclusive lock
+or mutex to guard a critical section which are unlikely to go to
+sleep. The waiters in a TP futex, however, will fall back to sleep in
+a wait queue if the lock owner isn't running. Therefore, it can also be
+used when the critical section is long and prone to sleeping. However,
+it may not have the performance gain when compared with a wait-wake
+futex in this case.
+
+The wait-wake futexes are more versatile as they can also be used
+to implement other locking primitives like conditional variables or
+read/write locks. So the new TP futex type is not a replacement for
+the wait-wake futexes. Looking just at the mutex type of locks, TP
+futexes can be used as replacement for the wait-wake futexes.
+
+Sample Code
+-----------
+
+The following are sample code to implement a simple mutex lock and
+unlock function.
+
+__thread int thread_id;
+
+void mutex_lock(int *faddr)
+{
+	if (cmpxchg(faddr, 0, thread_id) == 0)
+		return;
+	for (;;)
+		if (futex(faddr, FUTEX_LOCK, ...) >= 0)
+			break;
+}
+
+void mutex_unlock(int *faddr)
+{
+	int old, fval;
+
+	if (cmpxchg(faddr, thread_id, 0) == thread_id)
+		return;
+	futex(faddr, FUTEX_UNLOCK, ...);
+}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 13/13] perf bench: New microbenchmark for userspace mutex performance
  2016-09-30 21:26 [PATCH v3 00/13] futex: Introducing throughput-optimized futexes Waiman Long
                   ` (11 preceding siblings ...)
  2016-09-30 21:26 ` [PATCH v3 12/13] futex, doc: TP futexes document Waiman Long
@ 2016-09-30 21:26 ` Waiman Long
  12 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2016-09-30 21:26 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Jason Low, Scott J Norton,
	Douglas Hatch, Waiman Long

This microbenchmark simulates how the use of different futex types
can affect the actual performanace of userspace mutex locks. The
usage is:

        perf bench futex mutex <options>

Three sets of simple mutex lock and unlock functions are implemented
using the wait-wake, PI and TP futexes respectively. This
microbenchmark then runs the locking rate measurement tests using
either one of those mutexes or all of them consecutively.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 tools/perf/bench/Build         |    1 +
 tools/perf/bench/bench.h       |    1 +
 tools/perf/bench/futex-mutex.c |  558 ++++++++++++++++++++++++++++++++++++++++
 tools/perf/bench/futex.h       |   25 ++
 tools/perf/builtin-bench.c     |    4 +
 5 files changed, 589 insertions(+), 0 deletions(-)
 create mode 100644 tools/perf/bench/futex-mutex.c

diff --git a/tools/perf/bench/Build b/tools/perf/bench/Build
index 60bf119..dc6dfc1 100644
--- a/tools/perf/bench/Build
+++ b/tools/perf/bench/Build
@@ -6,6 +6,7 @@ perf-y += futex-wake.o
 perf-y += futex-wake-parallel.o
 perf-y += futex-requeue.o
 perf-y += futex-lock-pi.o
+perf-y += futex-mutex.o
 
 perf-$(CONFIG_X86_64) += mem-memcpy-x86-64-asm.o
 perf-$(CONFIG_X86_64) += mem-memset-x86-64-asm.o
diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h
index 579a592..b0632df 100644
--- a/tools/perf/bench/bench.h
+++ b/tools/perf/bench/bench.h
@@ -36,6 +36,7 @@ int bench_futex_wake_parallel(int argc, const char **argv, const char *prefix);
 int bench_futex_requeue(int argc, const char **argv, const char *prefix);
 /* pi futexes */
 int bench_futex_lock_pi(int argc, const char **argv, const char *prefix);
+int bench_futex_mutex(int argc, const char **argv, const char *prefix);
 
 #define BENCH_FORMAT_DEFAULT_STR	"default"
 #define BENCH_FORMAT_DEFAULT		0
diff --git a/tools/perf/bench/futex-mutex.c b/tools/perf/bench/futex-mutex.c
new file mode 100644
index 0000000..d48e198
--- /dev/null
+++ b/tools/perf/bench/futex-mutex.c
@@ -0,0 +1,558 @@
+/*
+ * Copyright (C) 2016 Waiman Long
+ *
+ * This microbenchmark simulates how the use of different futex types can
+ * affect the actual performanace of userspace locking primitives like mutex.
+ *
+ * The raw throughput of the futex lock and unlock calls is not a good
+ * indication of actual throughput of the mutex code as it may not really
+ * need to call into the kernel. Therefore, 3 simple mutex lock and unlock
+ * functions are written to implenment a mutex lock using the wait-wake,
+ * PI and TP futexes respectively. These functions serve as the basis for
+ * measuring the locking throughput.
+ */
+
+#include <pthread.h>
+
+#include <signal.h>
+#include <string.h>
+#include "../util/stat.h"
+#include "../perf-sys.h"
+#include <subcmd/parse-options.h>
+#include <linux/compiler.h>
+#include <linux/kernel.h>
+#include <errno.h>
+#include "bench.h"
+#include "futex.h"
+
+#include <err.h>
+#include <stdlib.h>
+#include <sys/time.h>
+
+#define	gettid()		syscall(SYS_gettid)
+#define __cacheline_aligned	__attribute__((__aligned__(64)))
+
+typedef u32 futex_t;
+typedef void (*mutex_lock_fn_t)(futex_t *futex, int tid);
+typedef void (*mutex_unlock_fn_t)(futex_t *futex, int tid);
+
+
+struct worker {
+	futex_t *futex;
+	pthread_t thread;
+
+	/*
+	 * Per-thread operation statistics
+	 */
+	unsigned int ops;	/* # of locking operations	*/
+	unsigned int locks;	/* # of lock futex calls	*/
+	unsigned int unlocks;	/* # of unlock futex calls	*/
+	unsigned int eagains;	/* # of EAGAIN errors		*/
+	unsigned int lockerrs;	/* # of other lock errors	*/
+	unsigned int unlockerrs;/* # of unlock errors		*/
+	unsigned int wakeups;	/* # of wakeups (unlock return)	*/
+	unsigned int handoffs;	/* # of lock handoff (TP only)	*/
+	unsigned int steals;	/* # of lock steals (TP only)	*/
+} __cacheline_aligned;
+
+/*
+ * Global cache-aligned futex
+ */
+static futex_t global_futex __cacheline_aligned;
+
+static __thread futex_t thread_id;	/* Thread ID */
+static __thread int counter;		/* Sleep counter */
+
+static struct worker *worker;
+static unsigned int nsecs = 10;
+static bool verbose, done, fshared, exit_now;
+static unsigned int ncpus, nthreads;
+static int futex_flag;
+static const char *ftype = "all";
+static int csload = 1;
+static int wratio;
+struct timeval start, end, runtime;
+static unsigned int worker_start;
+static unsigned int threads_starting;
+static struct stats throughput_stats;
+static mutex_lock_fn_t mutex_lock_fn;
+static mutex_unlock_fn_t mutex_unlock_fn;
+
+/*
+ * CS - Lock critical section
+ */
+static const struct option options[] = {
+	OPT_STRING  ('f', "futex-type",	&ftype, "type", "Specify futex type: WW, PI, TP, all (default)"),
+	OPT_INTEGER ('l', "load",	&csload,   "Specify # of cpu_relax's inside CS, default = 1"),
+	OPT_UINTEGER('t', "threads",	&nthreads, "Specify number of threads, default = # of CPUs"),
+	OPT_UINTEGER('r', "runtime",	&nsecs,    "Specify runtime (in seconds, default = 10s)"),
+	OPT_BOOLEAN ('S', "shared",	&fshared,  "Use shared futexes instead of private ones"),
+	OPT_BOOLEAN ('v', "verbose",	&verbose,  "Verbose mode: display thread-level details"),
+	OPT_INTEGER ('w', "wait-ratio", &wratio,   "Specify <n>/1024 of CS is 1us sleep, default = 0"),
+	OPT_END()
+};
+
+static const char * const bench_futex_mutex_usage[] = {
+	"perf bench futex mutex <options>",
+	NULL
+};
+
+/**
+ * futex_cmpxchg() - atomic compare and exchange
+ * @uaddr:	The address of the futex to be modified
+ * @oldval:	The expected value of the futex
+ * @newval:	The new value to try and assign the futex
+ *
+ * Implement cmpxchg using gcc atomic builtins.
+ *
+ * Return: the old futex value.
+ */
+static inline futex_t futex_cmpxchg(futex_t *uaddr, futex_t old, futex_t new)
+{
+	return __sync_val_compare_and_swap(uaddr, old, new);
+}
+
+/**
+ * atomic_dec_return - atomically decrement & return the new value
+ * @uaddr:	The address of the futex to be decremented
+ * Return:	The new value
+ */
+static inline int atomic_dec_return(futex_t *uaddr)
+{
+	return __sync_sub_and_fetch(uaddr, 1);
+}
+
+/**
+ * atomic_inc_return - atomically increment & return the new value
+ * @uaddr:	The address of the futex to be incremented
+ * Return:	The new value
+ */
+static inline int atomic_inc_return(futex_t *uaddr)
+{
+	return __sync_add_and_fetch(uaddr, 1);
+}
+
+/*
+ * Wait-wake futex lock/unlock functions
+ * futex value: 0 - unlocked
+ *		1 - locked
+ *		2 - locked with waiters (contended)
+ */
+static void ww_mutex_lock(futex_t *futex, int tid)
+{
+	futex_t old, val = *futex;
+	int ret;
+
+	for (;;) {
+		if (!val) {
+			val = futex_cmpxchg(futex, 0, 1);
+			if (val == 0)
+				return;
+		}
+		if (val != 2) {
+			/*
+			 * Force value to 2 to indicate waiter
+			 */
+			old = val;
+			val = futex_cmpxchg(futex, old, 2);
+			if (val == old)
+				val = 2;
+			else
+				continue;
+		}
+		break;
+	}
+
+	for (;;) {
+		ret = futex_wait(futex, 2, NULL, futex_flag);
+		worker[tid].locks++;
+		if (ret < 0) {
+			if (errno == EAGAIN)
+				worker[tid].eagains++;
+			else
+				worker[tid].lockerrs++;
+		}
+
+		val = *futex;
+		if (val == 2)
+			continue;
+		for (;;) {
+			old = val;
+			val = futex_cmpxchg(futex, old, 2);
+			if (old == val)
+				break;
+		}
+		if (val == 0)
+			break;	/* We got the lock */
+	}
+}
+
+static void ww_mutex_unlock(futex_t *futex, int tid)
+{
+	futex_t old, val;
+	int ret;
+
+	val = *futex;
+	do {
+		old = val;
+		val = futex_cmpxchg(futex, old, 0);
+	} while (val != old);
+
+	switch (val) {
+	default:
+	case 1: /* No waiter */
+		break;
+
+	case 2:
+		worker[tid].unlocks++;
+		ret = futex_wake(futex, 1, futex_flag);
+		if (ret < 0)
+			worker[tid].unlockerrs++;
+		else
+			worker[tid].wakeups += ret;
+		break;
+	}
+}
+
+/*
+ * PI futex lock/unlock functions
+ */
+static void pi_mutex_lock(futex_t *futex, int tid)
+{
+	futex_t val;
+	int ret;
+
+	val = futex_cmpxchg(futex, 0, thread_id);
+	if (val == 0)
+		return;
+
+	/*
+	 * Retry if an error happens
+	 */
+	for (;;) {
+		ret = futex_lock_pi(futex, NULL, futex_flag);
+		worker[tid].locks++;
+		if (ret >= 0)
+			break;
+		worker[tid].lockerrs++;
+	}
+}
+
+static void pi_mutex_unlock(futex_t *futex, int tid)
+{
+	futex_t val;
+	int ret;
+
+	val = futex_cmpxchg(futex, thread_id, 0);
+	if (val == thread_id)
+		return;
+
+	ret = futex_unlock_pi(futex, futex_flag);
+	if (ret < 0)
+		worker[tid].unlockerrs++;
+	else
+		worker[tid].wakeups += ret;
+	worker[tid].unlocks++;
+}
+
+/*
+ * TP futex lock/unlock functions
+ */
+static void tp_mutex_lock(futex_t *futex, int tid)
+{
+	futex_t val;
+	int ret;
+
+	val = futex_cmpxchg(futex, 0, thread_id);
+	if (val == 0)
+		return;
+
+	/*
+	 * Retry if an error happens
+	 */
+	for (;;) {
+		ret = futex_lock(futex, NULL, futex_flag);
+		worker[tid].locks++;
+		if (ret >= 0)
+			break;
+		worker[tid].lockerrs++;
+	}
+	/*
+	 * Check locking method
+	 */
+	if (!ret)
+		worker[tid].steals++;
+	else if (ret == 2)
+		worker[tid].handoffs++;
+}
+
+static void tp_mutex_unlock(futex_t *futex, int tid)
+{
+	futex_t val;
+	int ret;
+
+	val = futex_cmpxchg(futex, thread_id, 0);
+	if (val == thread_id)
+		return;
+
+	ret = futex_unlock(futex, futex_flag);
+	if (ret < 0)
+		worker[tid].unlockerrs++;
+	else
+		worker[tid].wakeups += ret;
+	worker[tid].unlocks++;
+}
+
+/*
+ * Load function
+ */
+static inline void load(int tid)
+{
+	int n = csload;
+
+	/*
+	 * Optionally does a 1us sleep instead if wratio is defined and
+	 * is within bound.
+	 */
+	if (wratio && (((counter++ + tid) & 0x3ff) < wratio)) {
+		usleep(1);
+		return;
+	}
+
+	while (n-- > 0)
+		cpu_relax();
+}
+
+/****************************************************************************/
+
+static void toggle_done(int sig __maybe_unused,
+			siginfo_t *info __maybe_unused,
+			void *uc __maybe_unused)
+{
+	/* inform all threads that we're done for the day */
+	done = true;
+	gettimeofday(&end, NULL);
+	timersub(&end, &start, &runtime);
+	if (sig)
+		exit_now = true;
+}
+
+static void *workerfn(void *arg)
+{
+	long tid = (long)arg;
+	struct worker *w = &worker[tid];
+	mutex_lock_fn_t lock_fn = mutex_lock_fn;
+	mutex_unlock_fn_t unlock_fn = mutex_unlock_fn;
+
+	thread_id = gettid();
+	counter = 0;
+
+	atomic_dec_return(&threads_starting);
+
+	/*
+	 * Busy wait until asked to start
+	 */
+	while (!worker_start)
+		cpu_relax();
+
+	do {
+		lock_fn(w->futex, tid);
+		load(tid);
+		unlock_fn(w->futex, tid);
+		w->ops++;	/* One more locking operation */
+		cpu_relax();
+	}  while (!done);
+
+	return NULL;
+}
+
+static void create_threads(struct worker *w, pthread_attr_t *thread_attr,
+			   long tid)
+{
+	cpu_set_t cpu;
+
+	/*
+	 * Bind each thread to a CPU
+	 */
+	CPU_ZERO(&cpu);
+	CPU_SET(tid % ncpus, &cpu);
+	w->futex = &global_futex;
+
+	if (pthread_attr_setaffinity_np(thread_attr, sizeof(cpu_set_t), &cpu))
+		err(EXIT_FAILURE, "pthread_attr_setaffinity_np");
+
+	if (pthread_create(&w->thread, thread_attr, workerfn, (void *)tid))
+		err(EXIT_FAILURE, "pthread_create");
+}
+
+static void futex_mutex_test(const char *futex_type)
+{
+	u64 us;
+	unsigned int i;
+	struct worker total;
+	double avg, stddev;
+	pthread_attr_t thread_attr;
+
+	if (exit_now)
+		return;
+
+	if (!strcasecmp(futex_type, "WW")) {
+		futex_type = "WW";
+		mutex_lock_fn = ww_mutex_lock;
+		mutex_unlock_fn = ww_mutex_unlock;
+	} else if (!strcasecmp(futex_type, "PI")) {
+		futex_type = "PI";
+		mutex_lock_fn = pi_mutex_lock;
+		mutex_unlock_fn = pi_mutex_unlock;
+	} else if (!strcasecmp(futex_type, "TP")) {
+		futex_type = "TP";
+		mutex_lock_fn = tp_mutex_lock;
+		mutex_unlock_fn = tp_mutex_unlock;
+
+		/*
+		 * Check if TP futex is supported.
+		 */
+		futex_unlock(&global_futex, 0);
+		if (errno == ENOSYS) {
+			fprintf(stderr, "\nTP futexes are not supported by the kernel!\n");
+			return;
+		}
+	} else {
+		fprintf(stderr, "Unknown futex type '%s'!\n", futex_type);
+		exit(1);
+	}
+
+	printf("\n=====================================\n");
+	printf("Run summary [PID %d]: %d threads doing %s futex lockings for %d secs.\n\n",
+	       getpid(), nthreads, futex_type, nsecs);
+
+	init_stats(&throughput_stats);
+
+	global_futex = 0;
+	done = false;
+	threads_starting = nthreads;
+	pthread_attr_init(&thread_attr);
+
+	for (i = 0; i < nthreads; i++)
+		create_threads(&worker[i], &thread_attr, i);
+	pthread_attr_destroy(&thread_attr);
+
+	while (threads_starting)
+		usleep(1);
+
+	gettimeofday(&start, NULL);
+
+	/*
+	 * Start the test
+	 *
+	 * Unlike the other futex benchmarks, this one uses busy waiting
+	 * instead of pthread APIs to make sure that all the threads (except
+	 * the one that shares CPU with the parent) will start more or less
+	 * simultaineously.
+	 */
+	atomic_inc_return(&worker_start);
+	sleep(nsecs);
+	toggle_done(0, NULL, NULL);
+
+	for (i = 0; i < nthreads; i++) {
+		int ret = pthread_join(worker[i].thread, NULL);
+
+		if (ret)
+			err(EXIT_FAILURE, "pthread_join");
+	}
+
+	us = runtime.tv_sec * 1000000 + runtime.tv_usec;
+	memset(&total, 0, sizeof(total));
+	for (i = 0; i < nthreads; i++) {
+		/*
+		 * Get a rounded estimate of the # of locking ops/sec.
+		 */
+		u64 tp = ((u64)worker[i].ops) * 1000000 / us;
+
+		total.ops        += worker[i].ops;
+		total.locks      += worker[i].locks;
+		total.unlocks    += worker[i].unlocks;
+		total.wakeups    += worker[i].wakeups;
+		total.eagains    += worker[i].eagains;
+		total.lockerrs   += worker[i].lockerrs;
+		total.unlockerrs += worker[i].unlockerrs;
+		total.handoffs   += worker[i].handoffs;
+		total.steals     += worker[i].steals;
+
+		update_stats(&throughput_stats, tp);
+		if (verbose)
+			printf("[thread %3d] futex: %p [ %ld ops/sec ]\n",
+			       i, worker[i].futex, (long)tp);
+	}
+
+	avg    = avg_stats(&throughput_stats);
+	stddev = stddev_stats(&throughput_stats);
+
+	printf("Locking statistics:\n");
+	printf("Test run time      = %'.2f s\n", (double)us/1000000);
+	printf("Total locking ops  = %'d\n", total.ops);
+	printf("Lock futex calls   = %'d (%.1f%%)\n", total.locks,
+		(double)total.locks*100/total.ops);
+	printf("Unlock futex calls = %'d (%.1f%%)\n", total.unlocks,
+		(double)total.unlocks*100/total.ops);
+	if (total.wakeups)
+		printf("Process wakeups    = %'d\n", total.wakeups);
+	if (total.eagains)
+		printf("EAGAIN lock errors = %'d\n", total.eagains);
+	if (total.lockerrs)
+		printf("Other lock errors  = %'d\n", total.lockerrs);
+	if (total.unlockerrs)
+		printf("Unlock errors      = %'d\n", total.unlockerrs);
+	if (total.handoffs)
+		printf("Lock handoffs      = %'d\n", total.handoffs);
+	if (total.steals)
+		printf("Lock stealings     = %'d\n", total.steals);
+
+	printf("\nPer-thread Locking Rates:\n");
+	printf("Avg = %'d ops/sec (+- %.2f%%)\n", (int)(avg + 0.5),
+		rel_stddev_stats(stddev, avg));
+	printf("Min = %'d ops/sec\n", (int)throughput_stats.min);
+	printf("Max = %'d ops/sec\n", (int)throughput_stats.max);
+
+	/* Clear the workers area */
+	memset(worker, 0, sizeof(*worker) * nthreads);
+}
+
+int bench_futex_mutex(int argc, const char **argv,
+		      const char *prefix __maybe_unused)
+{
+	struct sigaction act;
+
+	argc = parse_options(argc, argv, options, bench_futex_mutex_usage, 0);
+	if (argc)
+		goto err;
+
+	ncpus = sysconf(_SC_NPROCESSORS_ONLN);
+
+	sigfillset(&act.sa_mask);
+	act.sa_sigaction = toggle_done;
+	sigaction(SIGINT, &act, NULL);
+
+	if (!nthreads)
+		nthreads = ncpus;
+
+	worker = calloc(nthreads, sizeof(*worker));
+	if (!worker)
+		err(EXIT_FAILURE, "calloc");
+
+	if (!fshared)
+		futex_flag = FUTEX_PRIVATE_FLAG;
+
+	if (!strcmp(ftype, "all")) {
+		futex_mutex_test("WW");
+		futex_mutex_test("PI");
+		futex_mutex_test("TP");
+	} else {
+		futex_mutex_test(ftype);
+	}
+	free(worker);
+	return 0;
+err:
+	usage_with_options(bench_futex_mutex_usage, options);
+	exit(EXIT_FAILURE);
+}
diff --git a/tools/perf/bench/futex.h b/tools/perf/bench/futex.h
index b2e06d1..bd3f0df 100644
--- a/tools/perf/bench/futex.h
+++ b/tools/perf/bench/futex.h
@@ -86,6 +86,31 @@ futex_cmp_requeue(u_int32_t *uaddr, u_int32_t val, u_int32_t *uaddr2, int nr_wak
 		 val, opflags);
 }
 
+#ifndef FUTEX_LOCK
+#define FUTEX_LOCK	13
+#endif
+#ifndef FUTEX_UNLOCK
+#define FUTEX_UNLOCK	14
+#endif
+
+/**
+ * futex_lock() - lock the TP futex
+ */
+static inline int
+futex_lock(u_int32_t *uaddr, struct timespec *timeout, int opflags)
+{
+	return futex(uaddr, FUTEX_LOCK, 0, timeout, NULL, 0, opflags);
+}
+
+/**
+ * futex_unlock() - unlock the TP futex
+ */
+static inline int
+futex_unlock(u_int32_t *uaddr, int opflags)
+{
+	return futex(uaddr, FUTEX_UNLOCK, 0, NULL, NULL, 0, opflags);
+}
+
 #ifndef HAVE_PTHREAD_ATTR_SETAFFINITY_NP
 #include <pthread.h>
 static inline int pthread_attr_setaffinity_np(pthread_attr_t *attr,
diff --git a/tools/perf/builtin-bench.c b/tools/perf/builtin-bench.c
index a1cddc6..3a3a3af 100644
--- a/tools/perf/builtin-bench.c
+++ b/tools/perf/builtin-bench.c
@@ -20,6 +20,7 @@
 #include "builtin.h"
 #include "bench/bench.h"
 
+#include <locale.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
@@ -62,6 +63,7 @@ static struct bench futex_benchmarks[] = {
 	{ "requeue",	"Benchmark for futex requeue calls",            bench_futex_requeue	},
 	/* pi-futexes */
 	{ "lock-pi",	"Benchmark for futex lock_pi calls",            bench_futex_lock_pi	},
+	{ "mutex",	"Benchmark for mutex locks using futexes",	bench_futex_mutex	},
 	{ "all",	"Run all futex benchmarks",			NULL			},
 	{ NULL,		NULL,						NULL			}
 };
@@ -222,6 +224,8 @@ int cmd_bench(int argc, const char **argv, const char *prefix __maybe_unused)
 		goto end;
 	}
 
+	setlocale(LC_NUMERIC, "en_US");	/* Allow better number formatting */
+
 	argc = parse_options(argc, argv, bench_options, bench_usage,
 			     PARSE_OPT_STOP_AT_NON_OPTION);
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 07/13] futex: Throughput-optimized (TP) futexes
  2016-09-30 21:26 ` [PATCH v3 07/13] futex: Throughput-optimized (TP) futexes Waiman Long
@ 2016-10-01  6:47   ` Thomas Gleixner
  2016-10-02  1:12     ` Waiman Long
  0 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2016-10-01  6:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Jonathan Corbet, linux-kernel,
	linux-doc, Arnaldo Carvalho de Melo, Davidlohr Bueso,
	Mike Galbraith, Jason Low, Scott J Norton, Douglas Hatch

On Fri, 30 Sep 2016, Waiman Long wrote:
> +	WRITE_ONCE(state->owner, current);
> +	preempt_disable();
> +	for (;;) {
> +		ret = futex_trylock(uaddr, vpid, &uval, true);

Did you actually read what I said? You CANNOT access userspace in a preempt
disabled region without disabling pagefaults and handle the resulting
wreckage yourself.

End of review.

       tglx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 07/13] futex: Throughput-optimized (TP) futexes
  2016-10-01  6:47   ` Thomas Gleixner
@ 2016-10-02  1:12     ` Waiman Long
  0 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2016-10-02  1:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Peter Zijlstra, Jonathan Corbet, linux-kernel,
	linux-doc, Arnaldo Carvalho de Melo, Davidlohr Bueso,
	Mike Galbraith, Jason Low, Scott J Norton, Douglas Hatch

On 10/01/2016 02:47 AM, Thomas Gleixner wrote:
> On Fri, 30 Sep 2016, Waiman Long wrote:
>> +	WRITE_ONCE(state->owner, current);
>> +	preempt_disable();
>> +	for (;;) {
>> +		ret = futex_trylock(uaddr, vpid,&uval, true);
> Did you actually read what I said? You CANNOT access userspace in a preempt
> disabled region without disabling pagefaults and handle the resulting
> wreckage yourself.

I think I had missed that comment. My bad:-(

I will fix that with the code changes below. I will also double-check 
your comments again to see if I miss some others.

Cheers,
Longman


---------------------------[ cut here 
]-----------------------------------------

diff --git a/kernel/futex.c b/kernel/futex.c
index bc16eca..132a36d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3520,6 +3520,13 @@ static int futex_spin_on_owner(u32 __user *uaddr, 
u32 vpid,
         bool on_owner_pi_list = false;

         WRITE_ONCE(state->owner, current);
+retry:
+       /*
+        * The preempt_disable() has similar effect as pagefault_disable().
+        * As a result, we will have to disable page fault as well and 
handle
+        * the case of faulting in the futex word.
+        */
+       pagefault_disable();
         preempt_disable();
         for (;; loop--) {
                 ret = futex_trylock(uaddr, vpid, &uval, true);
@@ -3648,6 +3655,14 @@ static int futex_spin_on_owner(u32 __user *uaddr, 
u32 vpid,
         }
  out:
         preempt_enable();
+       pagefault_enable();
+
+       if (ret == -EFAULT) {
+               ret = fault_in_user_writeable(uaddr);
+               if (!ret)
+                       goto retry;
+       }
+
         if (owner_task) {
                 if (on_owner_pi_list)
                         task_pi_list_del(owner_task, state, false);

^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-10-02  1:12 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-30 21:26 [PATCH v3 00/13] futex: Introducing throughput-optimized futexes Waiman Long
2016-09-30 21:26 ` [PATCH v3 01/13] futex: Consolidate duplicated timer setup code Waiman Long
2016-09-30 21:26 ` [PATCH v3 02/13] futex: Rename futex_pi_state to futex_state Waiman Long
2016-09-30 21:26 ` [PATCH v3 03/13] futex: Add helpers to get & cmpxchg futex value without lock Waiman Long
2016-09-30 21:26 ` [PATCH v3 04/13] futex: Consolidate pure pi_state_list add & delete codes to helpers Waiman Long
2016-09-30 21:26 ` [PATCH v3 05/13] futex: Add a new futex type field into futex_state Waiman Long
2016-09-30 21:26 ` [PATCH v3 06/13] futex: Allow direct attachment of futex_state objects to hash bucket Waiman Long
2016-09-30 21:26 ` [PATCH v3 07/13] futex: Throughput-optimized (TP) futexes Waiman Long
2016-10-01  6:47   ` Thomas Gleixner
2016-10-02  1:12     ` Waiman Long
2016-09-30 21:26 ` [PATCH v3 08/13] futex: Enable robust handling of TP futexes Waiman Long
2016-09-30 21:26 ` [PATCH v3 09/13] futex: Implement lock handoff for TP futexes to prevent lock starvation Waiman Long
2016-09-30 21:26 ` [PATCH v3 10/13] futex: Inform FUTEX_LOCK callers on how the lock is acquired Waiman Long
2016-09-30 21:26 ` [PATCH v3 11/13] futex: Add timeout support to TP futexes Waiman Long
2016-09-30 21:26 ` [PATCH v3 12/13] futex, doc: TP futexes document Waiman Long
2016-09-30 21:26 ` [PATCH v3 13/13] perf bench: New microbenchmark for userspace mutex performance Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.