linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/20] futex: Introducing throughput-optimized (TP) futexes
@ 2016-12-29 16:13 Waiman Long
  2016-12-29 16:13 ` [PATCH v4 01/20] futex: Consolidate duplicated timer setup code Waiman Long
                   ` (19 more replies)
  0 siblings, 20 replies; 28+ messages in thread
From: Waiman Long @ 2016-12-29 16:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Scott J Norton, Waiman Long

 v3->v4:
  - Properly handle EFAULT error due to page fault.
  - Extend the use cases of TP futexes to userspace rwlock.
  - Change the lock handoff trigger from number of spinnings to
    elapsed time (5ms).
  - Various updates to the "perf bench futex mutex" microbenchmark.
  - Add a new "perf bench futex rwlock" microbenchmark for measuring
    rwlock performance.
  - Add mutex_owner to futex_state object to hold the serialization
    mutex owner.
  - Streamline a number of helper functions and other miscellenous
    coding improvements.
  - Rebase the patchset to the 4.10 kernel.

 v2->v3:
  - Use the abbreviation TP for the new futexes instead of TO.
  - Make a number of changes accordingly to review comments from
    ThomasG, PeterZ and MikeG.
  - Breaks the main futex patch into smaller pieces to make them easier
    to review.

 v1->v2:
  - Adds an explicit lock hand-off mechanism.
  - Adds timeout support.
  - Simplifies the required userspace code.
  - Fixes a number of problems in the v1 code.

This patchset introduces a new futex implementation called
throughput-optimized (TP) futexes. It is similar to PI futexes in its
calling convention, but provides better throughput than the wait-wake
(WW) futexes by encouraging lock stealing and optimistic spinning.

The new TP futexes can be used in implementing both userspace mutexes
and rwlocks. They provides better performance while simplifying the
userspace locking implementation at the same time. The WW futexes
are still needed to implement other synchronization primitives like
conditional variables and semaphores that cannot be handled by the
TP futexes.

Another advantage of TP futexes is that it has a built-in lock handoff
mechanism to prevent lock starvation from happenning as long as the
underlying kernel mutex doesn't have lock starvation problem.

Patches 1-6 are preparatory patches that pave the way to implement
the TP futexes.

Patch 7 implements the basic TP futex that can support userspace
mutexes.

Patch 8 adds robust handling to TP futex to handle the death of TP
futex exclusive lock owners.

Patch 9 adds a lock hand-off mechanism that can prevent lock starvation
to happen while introducing minimal runtime overhead.

Patch 10 enables the FUTEX_LOCK futex(2) syscall to return status
information, such as how the lock is acquired and how many time
the task needs to sleep in the spinning loop, that can be used by
userspace utilities to monitor how the TP futexes are performing.

Patch 11 enables userspace applications to supply a timeout value to
abort the lock acquisition attempt after the specified time period
has passed.

Patch 12 adds a new document tp-futex.txt in the Documentation
directory to describe the new TP futexes.

Patch 13 adds a microbenchmark to the "perf bench futex" suite to
measure the performance of the WW, PI and TP futexes when implementing
a userspace mutex.

Patch 14 extends the TP futexes to support userspace rwlocks.

Patch 15 enables more reader lock stealing of TP futexes in the kernel.

Patch 16 groups readers together as a spin group to enhance reader
throughput.

Patch 17 updates the tp-futex.txt file to add information about
rwlock support.

Patch 18 adds another microbenchmark to the "perf bench futex" suite
to measure the performance of the WW, TP futexes and Glibc when
implementing a userspace rwlock.

Patch 19 updates the wake_up_q() function to returns the number
woken tasks.

Patch 20 enables the dumping of internal futex state information
via debugfs.

All the benchmark results shown in the change logs were produced by
the microbenchmarks included in this patchset. So everyone can run
the microbenchmark to see how the TP futexes perform and behave in
their own test systems.

Once this patchset is finalized, an updated manpage patch to document
the new TP futexes will be sent out. The next step will then be to
make Glibc NTPL use the new TP futexes.

Performance highlight in term of average locking rates (ops/sec)
on a 2-socket system are as follows:

                WW futex       TP futex      Glibc
                --------       --------      -----
  mutex         121,129        152,242         -
  rwlock         99,149        162,887       30,362

Patches 7 and 16 contain more detailed information about the
performance characteristics of the TP futexes when implementing
userspace mutex and rwlock respectively when compared with other
possible way of doing so via the wait-wake futexes.

Waiman Long (20):
  futex: Consolidate duplicated timer setup code
  futex: Rename futex_pi_state to futex_state
  futex: Add helpers to get & cmpxchg futex value without lock
  futex: Consolidate pure pi_state_list add & delete codes to helpers
  futex: Add a new futex type field into futex_state
  futex: Allow direct attachment of futex_state objects to hash bucket
  futex: Introduce throughput-optimized (TP) futexes
  TP-futex: Enable robust handling
  TP-futex: Implement lock handoff to prevent lock starvation
  TP-futex: Return status code on FUTEX_LOCK calls
  TP-futex: Add timeout support
  TP-futex, doc: Add TP futexes documentation
  perf bench: New microbenchmark for userspace mutex performance
  TP-futex: Support userspace reader/writer locks
  TP-futex: Enable kernel reader lock stealing
  TP-futex: Group readers together in wait queue
  TP-futex, doc: Update TP futexes document on shared locking
  perf bench: New microbenchmark for userspace rwlock performance
  sched, TP-futex: Make wake_up_q() return wakeup count
  futex: Dump internal futex state via debugfs

 Documentation/00-INDEX         |    2 +
 Documentation/tp-futex.txt     |  284 +++++++
 include/linux/sched.h          |    6 +-
 include/uapi/linux/futex.h     |   32 +-
 kernel/futex.c                 | 1411 +++++++++++++++++++++++++++++++---
 kernel/sched/core.c            |    6 +-
 tools/perf/bench/Build         |    1 +
 tools/perf/bench/bench.h       |    2 +
 tools/perf/bench/futex-locks.c | 1654 ++++++++++++++++++++++++++++++++++++++++
 tools/perf/bench/futex.h       |   43 ++
 tools/perf/builtin-bench.c     |   11 +
 tools/perf/check-headers.sh    |    4 +
 12 files changed, 3335 insertions(+), 121 deletions(-)
 create mode 100644 Documentation/tp-futex.txt
 create mode 100644 tools/perf/bench/futex-locks.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v4 01/20] futex: Consolidate duplicated timer setup code
  2016-12-29 16:13 [PATCH v4 00/20] futex: Introducing throughput-optimized (TP) futexes Waiman Long
@ 2016-12-29 16:13 ` Waiman Long
  2016-12-29 16:13 ` [PATCH v4 02/20] futex: Rename futex_pi_state to futex_state Waiman Long
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2016-12-29 16:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Scott J Norton, Waiman Long

A new futex_setup_timer() helper function is added to consolidate all
the hrtimer_sleeper setup code.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/futex.c | 67 ++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 37 insertions(+), 30 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 9246d9f..0ea1b57 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -477,6 +477,35 @@ static void drop_futex_key_refs(union futex_key *key)
 }
 
 /**
+ * futex_setup_timer - set up the sleeping hrtimer.
+ * @time:	ptr to the given timeout value
+ * @timeout:	the hrtimer_sleeper structure to be set up
+ * @flags:	futex flags
+ * @range_ns:	optional range in ns
+ *
+ * Return: Initialized hrtimer_sleeper structure or NULL if no timeout
+ *	   value given
+ */
+static inline struct hrtimer_sleeper *
+futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
+		  int flags, u64 range_ns)
+{
+	if (!time)
+		return NULL;
+
+	hrtimer_init_on_stack(&timeout->timer, (flags & FLAGS_CLOCKRT) ?
+			      CLOCK_REALTIME : CLOCK_MONOTONIC,
+			      HRTIMER_MODE_ABS);
+	hrtimer_init_sleeper(timeout, current);
+	if (range_ns)
+		hrtimer_set_expires_range_ns(&timeout->timer, *time, range_ns);
+	else
+		hrtimer_set_expires(&timeout->timer, *time);
+
+	return timeout;
+}
+
+/**
  * get_futex_key() - Get parameters which are the keys for a futex
  * @uaddr:	virtual address of the futex
  * @fshared:	0 for a PROCESS_PRIVATE futex, 1 for PROCESS_SHARED
@@ -2402,7 +2431,7 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
 static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
 		      ktime_t *abs_time, u32 bitset)
 {
-	struct hrtimer_sleeper timeout, *to = NULL;
+	struct hrtimer_sleeper timeout, *to;
 	struct restart_block *restart;
 	struct futex_hash_bucket *hb;
 	struct futex_q q = futex_q_init;
@@ -2412,17 +2441,8 @@ static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
 		return -EINVAL;
 	q.bitset = bitset;
 
-	if (abs_time) {
-		to = &timeout;
-
-		hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ?
-				      CLOCK_REALTIME : CLOCK_MONOTONIC,
-				      HRTIMER_MODE_ABS);
-		hrtimer_init_sleeper(to, current);
-		hrtimer_set_expires_range_ns(&to->timer, *abs_time,
-					     current->timer_slack_ns);
-	}
-
+	to = futex_setup_timer(abs_time, &timeout, flags,
+			       current->timer_slack_ns);
 retry:
 	/*
 	 * Prepare to wait on uaddr. On success, holds hb lock and increments
@@ -2502,7 +2522,7 @@ static long futex_wait_restart(struct restart_block *restart)
 static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
 			 ktime_t *time, int trylock)
 {
-	struct hrtimer_sleeper timeout, *to = NULL;
+	struct hrtimer_sleeper timeout, *to;
 	struct futex_hash_bucket *hb;
 	struct futex_q q = futex_q_init;
 	int res, ret;
@@ -2510,13 +2530,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
 	if (refill_pi_state_cache())
 		return -ENOMEM;
 
-	if (time) {
-		to = &timeout;
-		hrtimer_init_on_stack(&to->timer, CLOCK_REALTIME,
-				      HRTIMER_MODE_ABS);
-		hrtimer_init_sleeper(to, current);
-		hrtimer_set_expires(&to->timer, *time);
-	}
+	to = futex_setup_timer(time, &timeout, FLAGS_CLOCKRT, 0);
 
 retry:
 	ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key, VERIFY_WRITE);
@@ -2811,7 +2825,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 				 u32 val, ktime_t *abs_time, u32 bitset,
 				 u32 __user *uaddr2)
 {
-	struct hrtimer_sleeper timeout, *to = NULL;
+	struct hrtimer_sleeper timeout, *to;
 	struct rt_mutex_waiter rt_waiter;
 	struct rt_mutex *pi_mutex = NULL;
 	struct futex_hash_bucket *hb;
@@ -2825,15 +2839,8 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 	if (!bitset)
 		return -EINVAL;
 
-	if (abs_time) {
-		to = &timeout;
-		hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ?
-				      CLOCK_REALTIME : CLOCK_MONOTONIC,
-				      HRTIMER_MODE_ABS);
-		hrtimer_init_sleeper(to, current);
-		hrtimer_set_expires_range_ns(&to->timer, *abs_time,
-					     current->timer_slack_ns);
-	}
+	to = futex_setup_timer(abs_time, &timeout, flags,
+			       current->timer_slack_ns);
 
 	/*
 	 * The waiter is allocated on our stack, manipulated by the requeue
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 02/20] futex: Rename futex_pi_state to futex_state
  2016-12-29 16:13 [PATCH v4 00/20] futex: Introducing throughput-optimized (TP) futexes Waiman Long
  2016-12-29 16:13 ` [PATCH v4 01/20] futex: Consolidate duplicated timer setup code Waiman Long
@ 2016-12-29 16:13 ` Waiman Long
  2016-12-29 16:13 ` [PATCH v4 03/20] futex: Add helpers to get & cmpxchg futex value without lock Waiman Long
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2016-12-29 16:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Scott J Norton, Waiman Long

The futex_pi_state structure will be overloaded in later patches to
store state information about non-PI futexes. So the structure name
itself is no longer a good description of its purpose. So its name
is changed to futex_state, a more generic name.

Some of the functions that process the futex states are also renamed.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/sched.h |   4 +-
 kernel/futex.c        | 107 +++++++++++++++++++++++++-------------------------
 2 files changed, 56 insertions(+), 55 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4d19052..453143c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -126,7 +126,7 @@ struct sched_attr {
 	u64 sched_period;
 };
 
-struct futex_pi_state;
+struct futex_state;
 struct robust_list_head;
 struct bio_list;
 struct fs_struct;
@@ -1830,7 +1830,7 @@ struct task_struct {
 	struct compat_robust_list_head __user *compat_robust_list;
 #endif
 	struct list_head pi_state_list;
-	struct futex_pi_state *pi_state_cache;
+	struct futex_state *pi_state_cache;
 #endif
 #ifdef CONFIG_PERF_EVENTS
 	struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
diff --git a/kernel/futex.c b/kernel/futex.c
index 0ea1b57..6f7cb40 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -192,11 +192,12 @@
 #define FLAGS_HAS_TIMEOUT	0x04
 
 /*
- * Priority Inheritance state:
+ * Futex state object:
+ *  - Priority Inheritance state
  */
-struct futex_pi_state {
+struct futex_state {
 	/*
-	 * list of 'owned' pi_state instances - these have to be
+	 * list of 'owned' state instances - these have to be
 	 * cleaned up in do_exit() if the task exits prematurely:
 	 */
 	struct list_head list;
@@ -240,7 +241,7 @@ struct futex_q {
 	struct task_struct *task;
 	spinlock_t *lock_ptr;
 	union futex_key key;
-	struct futex_pi_state *pi_state;
+	struct futex_state *pi_state;
 	struct rt_mutex_waiter *rt_waiter;
 	union futex_key *requeue_pi_key;
 	u32 bitset;
@@ -806,76 +807,76 @@ static int get_futex_value_locked(u32 *dest, u32 __user *from)
 /*
  * PI code:
  */
-static int refill_pi_state_cache(void)
+static int refill_futex_state_cache(void)
 {
-	struct futex_pi_state *pi_state;
+	struct futex_state *state;
 
 	if (likely(current->pi_state_cache))
 		return 0;
 
-	pi_state = kzalloc(sizeof(*pi_state), GFP_KERNEL);
+	state = kzalloc(sizeof(*state), GFP_KERNEL);
 
-	if (!pi_state)
+	if (!state)
 		return -ENOMEM;
 
-	INIT_LIST_HEAD(&pi_state->list);
+	INIT_LIST_HEAD(&state->list);
 	/* pi_mutex gets initialized later */
-	pi_state->owner = NULL;
-	atomic_set(&pi_state->refcount, 1);
-	pi_state->key = FUTEX_KEY_INIT;
+	state->owner = NULL;
+	atomic_set(&state->refcount, 1);
+	state->key = FUTEX_KEY_INIT;
 
-	current->pi_state_cache = pi_state;
+	current->pi_state_cache = state;
 
 	return 0;
 }
 
-static struct futex_pi_state * alloc_pi_state(void)
+static struct futex_state *alloc_futex_state(void)
 {
-	struct futex_pi_state *pi_state = current->pi_state_cache;
+	struct futex_state *state = current->pi_state_cache;
 
-	WARN_ON(!pi_state);
+	WARN_ON(!state);
 	current->pi_state_cache = NULL;
 
-	return pi_state;
+	return state;
 }
 
 /*
- * Drops a reference to the pi_state object and frees or caches it
+ * Drops a reference to the futex state object and frees or caches it
  * when the last reference is gone.
  *
  * Must be called with the hb lock held.
  */
-static void put_pi_state(struct futex_pi_state *pi_state)
+static void put_futex_state(struct futex_state *state)
 {
-	if (!pi_state)
+	if (!state)
 		return;
 
-	if (!atomic_dec_and_test(&pi_state->refcount))
+	if (!atomic_dec_and_test(&state->refcount))
 		return;
 
 	/*
-	 * If pi_state->owner is NULL, the owner is most probably dying
-	 * and has cleaned up the pi_state already
+	 * If state->owner is NULL, the owner is most probably dying
+	 * and has cleaned up the futex state already
 	 */
-	if (pi_state->owner) {
-		raw_spin_lock_irq(&pi_state->owner->pi_lock);
-		list_del_init(&pi_state->list);
-		raw_spin_unlock_irq(&pi_state->owner->pi_lock);
+	if (state->owner) {
+		raw_spin_lock_irq(&state->owner->pi_lock);
+		list_del_init(&state->list);
+		raw_spin_unlock_irq(&state->owner->pi_lock);
 
-		rt_mutex_proxy_unlock(&pi_state->pi_mutex, pi_state->owner);
+		rt_mutex_proxy_unlock(&state->pi_mutex, state->owner);
 	}
 
 	if (current->pi_state_cache)
-		kfree(pi_state);
+		kfree(state);
 	else {
 		/*
-		 * pi_state->list is already empty.
-		 * clear pi_state->owner.
+		 * state->list is already empty.
+		 * clear state->owner.
 		 * refcount is at 0 - put it back to 1.
 		 */
-		pi_state->owner = NULL;
-		atomic_set(&pi_state->refcount, 1);
-		current->pi_state_cache = pi_state;
+		state->owner = NULL;
+		atomic_set(&state->refcount, 1);
+		current->pi_state_cache = state;
 	}
 }
 
@@ -905,7 +906,7 @@ static struct task_struct * futex_find_get_task(pid_t pid)
 void exit_pi_state_list(struct task_struct *curr)
 {
 	struct list_head *next, *head = &curr->pi_state_list;
-	struct futex_pi_state *pi_state;
+	struct futex_state *pi_state;
 	struct futex_hash_bucket *hb;
 	union futex_key key = FUTEX_KEY_INIT;
 
@@ -920,7 +921,7 @@ void exit_pi_state_list(struct task_struct *curr)
 	while (!list_empty(head)) {
 
 		next = head->next;
-		pi_state = list_entry(next, struct futex_pi_state, list);
+		pi_state = list_entry(next, struct futex_state, list);
 		key = pi_state->key;
 		hb = hash_futex(&key);
 		raw_spin_unlock_irq(&curr->pi_lock);
@@ -1007,8 +1008,8 @@ void exit_pi_state_list(struct task_struct *curr)
  * the pi_state against the user space value. If correct, attach to
  * it.
  */
-static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state,
-			      struct futex_pi_state **ps)
+static int attach_to_pi_state(u32 uval, struct futex_state *pi_state,
+			      struct futex_state **ps)
 {
 	pid_t pid = uval & FUTEX_TID_MASK;
 
@@ -1079,10 +1080,10 @@ static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state,
  * it after doing proper sanity checks.
  */
 static int attach_to_pi_owner(u32 uval, union futex_key *key,
-			      struct futex_pi_state **ps)
+			      struct futex_state **ps)
 {
 	pid_t pid = uval & FUTEX_TID_MASK;
-	struct futex_pi_state *pi_state;
+	struct futex_state *pi_state;
 	struct task_struct *p;
 
 	/*
@@ -1123,7 +1124,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 	/*
 	 * No existing pi state. First waiter. [2]
 	 */
-	pi_state = alloc_pi_state();
+	pi_state = alloc_futex_state();
 
 	/*
 	 * Initialize the pi_mutex in locked state and make @p
@@ -1147,7 +1148,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 }
 
 static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
-			   union futex_key *key, struct futex_pi_state **ps)
+			   union futex_key *key, struct futex_state **ps)
 {
 	struct futex_q *match = futex_top_waiter(hb, key);
 
@@ -1199,7 +1200,7 @@ static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
  */
 static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
 				union futex_key *key,
-				struct futex_pi_state **ps,
+				struct futex_state **ps,
 				struct task_struct *task, int set_waiters)
 {
 	u32 uval, newval, vpid = task_pid_vnr(task);
@@ -1325,7 +1326,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
 			 struct futex_hash_bucket *hb)
 {
 	struct task_struct *new_owner;
-	struct futex_pi_state *pi_state = this->pi_state;
+	struct futex_state *pi_state = this->pi_state;
 	u32 uninitialized_var(curval), newval;
 	DEFINE_WAKE_Q(wake_q);
 	bool deboost;
@@ -1665,7 +1666,7 @@ static int futex_proxy_trylock_atomic(u32 __user *pifutex,
 				 struct futex_hash_bucket *hb1,
 				 struct futex_hash_bucket *hb2,
 				 union futex_key *key1, union futex_key *key2,
-				 struct futex_pi_state **ps, int set_waiters)
+				 struct futex_state **ps, int set_waiters)
 {
 	struct futex_q *top_waiter = NULL;
 	u32 curval;
@@ -1734,7 +1735,7 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
 {
 	union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
 	int drop_count = 0, task_count = 0, ret;
-	struct futex_pi_state *pi_state = NULL;
+	struct futex_state *pi_state = NULL;
 	struct futex_hash_bucket *hb1, *hb2;
 	struct futex_q *this, *next;
 	DEFINE_WAKE_Q(wake_q);
@@ -1751,7 +1752,7 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
 		 * requeue_pi requires a pi_state, try to allocate it now
 		 * without any locks in case it fails.
 		 */
-		if (refill_pi_state_cache())
+		if (refill_futex_state_cache())
 			return -ENOMEM;
 		/*
 		 * requeue_pi must wake as many tasks as it can, up to nr_wake
@@ -1963,7 +1964,7 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
 				 * object.
 				 */
 				this->pi_state = NULL;
-				put_pi_state(pi_state);
+				put_futex_state(pi_state);
 				/*
 				 * We stop queueing more waiters and let user
 				 * space deal with the mess.
@@ -1980,7 +1981,7 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
 	 * in futex_proxy_trylock_atomic() or in lookup_pi_state(). We
 	 * need to drop it here again.
 	 */
-	put_pi_state(pi_state);
+	put_futex_state(pi_state);
 
 out_unlock:
 	double_unlock_hb(hb1, hb2);
@@ -2135,7 +2136,7 @@ static void unqueue_me_pi(struct futex_q *q)
 	__unqueue_futex(q);
 
 	BUG_ON(!q->pi_state);
-	put_pi_state(q->pi_state);
+	put_futex_state(q->pi_state);
 	q->pi_state = NULL;
 
 	spin_unlock(q->lock_ptr);
@@ -2151,7 +2152,7 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
 				struct task_struct *newowner)
 {
 	u32 newtid = task_pid_vnr(newowner) | FUTEX_WAITERS;
-	struct futex_pi_state *pi_state = q->pi_state;
+	struct futex_state *pi_state = q->pi_state;
 	struct task_struct *oldowner = pi_state->owner;
 	u32 uval, uninitialized_var(curval), newval;
 	int ret;
@@ -2527,7 +2528,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
 	struct futex_q q = futex_q_init;
 	int res, ret;
 
-	if (refill_pi_state_cache())
+	if (refill_futex_state_cache())
 		return -ENOMEM;
 
 	to = futex_setup_timer(time, &timeout, FLAGS_CLOCKRT, 0);
@@ -2908,7 +2909,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 			 * Drop the reference to the pi state which
 			 * the requeue_pi() code acquired for us.
 			 */
-			put_pi_state(q.pi_state);
+			put_futex_state(q.pi_state);
 			spin_unlock(q.lock_ptr);
 		}
 	} else {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 03/20] futex: Add helpers to get & cmpxchg futex value without lock
  2016-12-29 16:13 [PATCH v4 00/20] futex: Introducing throughput-optimized (TP) futexes Waiman Long
  2016-12-29 16:13 ` [PATCH v4 01/20] futex: Consolidate duplicated timer setup code Waiman Long
  2016-12-29 16:13 ` [PATCH v4 02/20] futex: Rename futex_pi_state to futex_state Waiman Long
@ 2016-12-29 16:13 ` Waiman Long
  2016-12-29 16:13 ` [PATCH v4 04/20] futex: Consolidate pure pi_state_list add & delete codes to helpers Waiman Long
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2016-12-29 16:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Scott J Norton, Waiman Long

Two new helper functions cmpxchg_futex_value() and get_futex_value()
are added to access and change the futex value without the hash
bucket lock.  As a result, page fault is enabled and the page will
be faulted in if not present yet.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/futex.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/kernel/futex.c b/kernel/futex.c
index 6f7cb40..7590c7d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -803,6 +803,21 @@ static int get_futex_value_locked(u32 *dest, u32 __user *from)
 	return ret ? -EFAULT : 0;
 }
 
+/*
+ * The equivalents of the above cmpxchg_futex_value_locked() and
+ * get_futex_value_locked which are called without the hash bucket lock
+ * and so can have page fault enabled.
+ */
+static inline int cmpxchg_futex_value(u32 *curval, u32 __user *uaddr,
+				      u32 uval, u32 newval)
+{
+	return futex_atomic_cmpxchg_inatomic(curval, uaddr, uval, newval);
+}
+
+static inline int get_futex_value(u32 *dest, u32 __user *from)
+{
+	return __get_user(*dest, from);
+}
 
 /*
  * PI code:
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 04/20] futex: Consolidate pure pi_state_list add & delete codes to helpers
  2016-12-29 16:13 [PATCH v4 00/20] futex: Introducing throughput-optimized (TP) futexes Waiman Long
                   ` (2 preceding siblings ...)
  2016-12-29 16:13 ` [PATCH v4 03/20] futex: Add helpers to get & cmpxchg futex value without lock Waiman Long
@ 2016-12-29 16:13 ` Waiman Long
  2016-12-29 16:13 ` [PATCH v4 05/20] futex: Add a new futex type field into futex_state Waiman Long
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2016-12-29 16:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Scott J Norton, Waiman Long

Two new helper functions (task_pi_list_add & task_pi_list_del)
are created to consolidate all the pure pi_state_list addition and
insertion codes. The set_owner argument in task_pi_list_add() will
be needed in a later patch.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/futex.c | 64 +++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 39 insertions(+), 25 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 7590c7d..ccec962 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -822,6 +822,39 @@ static inline int get_futex_value(u32 *dest, u32 __user *from)
 /*
  * PI code:
  */
+
+/**
+ * task_pi_list_add - add futex state object to a task's pi_state_list
+ * @task : task structure
+ * @state: futex state object
+ */
+static inline void task_pi_list_add(struct task_struct *task,
+				    struct futex_state *state)
+{
+	raw_spin_lock_irq(&task->pi_lock);
+	WARN_ON(!list_empty(&state->list));
+	list_add(&state->list, &task->pi_state_list);
+	state->owner = task;
+	raw_spin_unlock_irq(&task->pi_lock);
+}
+
+/**
+ * task_pi_list_del - delete futex state object from a task's pi_state_list
+ * @state: futex state object
+ * @warn : warn if list is empty when set
+ */
+static inline void task_pi_list_del(struct futex_state *state, const bool warn)
+{
+	struct task_struct *task = state->owner;
+
+	raw_spin_lock_irq(&task->pi_lock);
+	if (warn)
+		WARN_ON(list_empty(&state->list));
+	list_del_init(&state->list);
+	state->owner = NULL;
+	raw_spin_unlock_irq(&task->pi_lock);
+}
+
 static int refill_futex_state_cache(void)
 {
 	struct futex_state *state;
@@ -874,9 +907,7 @@ static void put_futex_state(struct futex_state *state)
 	 * and has cleaned up the futex state already
 	 */
 	if (state->owner) {
-		raw_spin_lock_irq(&state->owner->pi_lock);
-		list_del_init(&state->list);
-		raw_spin_unlock_irq(&state->owner->pi_lock);
+		task_pi_list_del(state, false);
 
 		rt_mutex_proxy_unlock(&state->pi_mutex, state->owner);
 	}
@@ -1397,16 +1428,8 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
 		return ret;
 	}
 
-	raw_spin_lock(&pi_state->owner->pi_lock);
-	WARN_ON(list_empty(&pi_state->list));
-	list_del_init(&pi_state->list);
-	raw_spin_unlock(&pi_state->owner->pi_lock);
-
-	raw_spin_lock(&new_owner->pi_lock);
-	WARN_ON(!list_empty(&pi_state->list));
-	list_add(&pi_state->list, &new_owner->pi_state_list);
-	pi_state->owner = new_owner;
-	raw_spin_unlock(&new_owner->pi_lock);
+	task_pi_list_del(pi_state, true);
+	task_pi_list_add(new_owner, pi_state);
 
 	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 
@@ -2211,19 +2234,10 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
 	 * We fixed up user space. Now we need to fix the pi_state
 	 * itself.
 	 */
-	if (pi_state->owner != NULL) {
-		raw_spin_lock_irq(&pi_state->owner->pi_lock);
-		WARN_ON(list_empty(&pi_state->list));
-		list_del_init(&pi_state->list);
-		raw_spin_unlock_irq(&pi_state->owner->pi_lock);
-	}
+	if (pi_state->owner != NULL)
+		task_pi_list_del(pi_state, true);
 
-	pi_state->owner = newowner;
-
-	raw_spin_lock_irq(&newowner->pi_lock);
-	WARN_ON(!list_empty(&pi_state->list));
-	list_add(&pi_state->list, &newowner->pi_state_list);
-	raw_spin_unlock_irq(&newowner->pi_lock);
+	task_pi_list_add(newowner, pi_state);
 	return 0;
 
 	/*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 05/20] futex: Add a new futex type field into futex_state
  2016-12-29 16:13 [PATCH v4 00/20] futex: Introducing throughput-optimized (TP) futexes Waiman Long
                   ` (3 preceding siblings ...)
  2016-12-29 16:13 ` [PATCH v4 04/20] futex: Consolidate pure pi_state_list add & delete codes to helpers Waiman Long
@ 2016-12-29 16:13 ` Waiman Long
  2016-12-29 16:13 ` [PATCH v4 06/20] futex: Allow direct attachment of futex_state objects to hash bucket Waiman Long
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2016-12-29 16:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Scott J Norton, Waiman Long

As the futex_state structure will be overloaded in later patches
to be used by non-PI futexes, it is necessary to add a type field to
distinguish among different types of futexes.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/futex.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index ccec962..c8ff773 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -191,6 +191,10 @@
 #define FLAGS_CLOCKRT		0x02
 #define FLAGS_HAS_TIMEOUT	0x04
 
+enum futex_type {
+	TYPE_PI = 0,
+};
+
 /*
  * Futex state object:
  *  - Priority Inheritance state
@@ -210,6 +214,7 @@ struct futex_state {
 	struct task_struct *owner;
 	atomic_t refcount;
 
+	enum futex_type type;
 	union futex_key key;
 };
 
@@ -903,13 +908,14 @@ static void put_futex_state(struct futex_state *state)
 		return;
 
 	/*
-	 * If state->owner is NULL, the owner is most probably dying
-	 * and has cleaned up the futex state already
+	 * If state->owner is NULL and the type is TYPE_PI, the owner is
+	 * most probably dying and has cleaned up the futex state already.
 	 */
 	if (state->owner) {
 		task_pi_list_del(state, false);
 
-		rt_mutex_proxy_unlock(&state->pi_mutex, state->owner);
+		if (state->type == TYPE_PI)
+			rt_mutex_proxy_unlock(&state->pi_mutex, state->owner);
 	}
 
 	if (current->pi_state_cache)
@@ -1062,7 +1068,7 @@ static int attach_to_pi_state(u32 uval, struct futex_state *pi_state,
 	/*
 	 * Userspace might have messed up non-PI and PI futexes [3]
 	 */
-	if (unlikely(!pi_state))
+	if (unlikely(!pi_state || (pi_state->type != TYPE_PI)))
 		return -EINVAL;
 
 	WARN_ON(!atomic_read(&pi_state->refcount));
@@ -1180,6 +1186,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 
 	/* Store the key for possible exit cleanups: */
 	pi_state->key = *key;
+	pi_state->type = TYPE_PI;
 
 	WARN_ON(!list_empty(&pi_state->list));
 	list_add(&pi_state->list, &p->pi_state_list);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 06/20] futex: Allow direct attachment of futex_state objects to hash bucket
  2016-12-29 16:13 [PATCH v4 00/20] futex: Introducing throughput-optimized (TP) futexes Waiman Long
                   ` (4 preceding siblings ...)
  2016-12-29 16:13 ` [PATCH v4 05/20] futex: Add a new futex type field into futex_state Waiman Long
@ 2016-12-29 16:13 ` Waiman Long
  2016-12-29 16:13 ` [PATCH v4 07/20] futex: Introduce throughput-optimized (TP) futexes Waiman Long
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2016-12-29 16:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Scott J Norton, Waiman Long

Currently, the futex state objects can only be located indirectly as

        hash bucket => futex_q => futex state

Actually it can be beneficial in some cases to locate the futex state
object directly from the hash bucket without the futex_q middleman.
Therefore, a new list head to link the futex state objects as well
as a new spinlock to manage them are added to the hash bucket.

To limit size increase for UP systems, these new fields are only for
SMP machines where the cacheline alignment of the hash bucket leaves
it with enough empty space for the new fields.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/futex.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/kernel/futex.c b/kernel/futex.c
index c8ff773..8b9e982 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -207,6 +207,11 @@ struct futex_state {
 	struct list_head list;
 
 	/*
+	 * Can link to fs_head in the owning hash bucket.
+	 */
+	struct list_head fs_list;
+
+	/*
 	 * The PI object:
 	 */
 	struct rt_mutex pi_mutex;
@@ -262,11 +267,24 @@ struct futex_q {
  * Hash buckets are shared by all the futex_keys that hash to the same
  * location.  Each key may have multiple futex_q structures, one for each task
  * waiting on a futex.
+ *
+ * Alternatively (in SMP), a key can be associated with a unique futex_state
+ * object where multiple waiters waiting for that futex can queue up in that
+ * futex_state object without using the futex_q structure. A separate
+ * futex_state lock (fs_lock) is used for processing those futex_state objects.
  */
 struct futex_hash_bucket {
 	atomic_t waiters;
 	spinlock_t lock;
 	struct plist_head chain;
+
+#ifdef CONFIG_SMP
+	/*
+	 * Fields for managing futex_state object list
+	 */
+	spinlock_t fs_lock;
+	struct list_head fs_head;
+#endif
 } ____cacheline_aligned_in_smp;
 
 /*
@@ -873,6 +891,8 @@ static int refill_futex_state_cache(void)
 		return -ENOMEM;
 
 	INIT_LIST_HEAD(&state->list);
+	INIT_LIST_HEAD(&state->fs_list);
+
 	/* pi_mutex gets initialized later */
 	state->owner = NULL;
 	atomic_set(&state->refcount, 1);
@@ -3363,6 +3383,10 @@ static int __init futex_init(void)
 		atomic_set(&futex_queues[i].waiters, 0);
 		plist_head_init(&futex_queues[i].chain);
 		spin_lock_init(&futex_queues[i].lock);
+#ifdef CONFIG_SMP
+		INIT_LIST_HEAD(&futex_queues[i].fs_head);
+		spin_lock_init(&futex_queues[i].fs_lock);
+#endif
 	}
 
 	return 0;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 07/20] futex: Introduce throughput-optimized (TP) futexes
  2016-12-29 16:13 [PATCH v4 00/20] futex: Introducing throughput-optimized (TP) futexes Waiman Long
                   ` (5 preceding siblings ...)
  2016-12-29 16:13 ` [PATCH v4 06/20] futex: Allow direct attachment of futex_state objects to hash bucket Waiman Long
@ 2016-12-29 16:13 ` Waiman Long
  2016-12-29 21:17   ` kbuild test robot
  2016-12-29 16:13 ` [PATCH v4 08/20] TP-futex: Enable robust handling Waiman Long
                   ` (12 subsequent siblings)
  19 siblings, 1 reply; 28+ messages in thread
From: Waiman Long @ 2016-12-29 16:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Scott J Norton, Waiman Long

A new futex implementation called throughput-optimized (TP) futexes
is introduced. The goal of this new futex type is to maximize locking
throughput at the expense of fairness and deterministic latency. Its
throughput is higher than that of the wait-wake futexes especially
on systems with a large number of CPUs and for workloads where the
lock owners are unlikely to sleep.  The downside of this new futex
is the increase in the response time variance.

The new TP futex type is designed to be used for mutex-like exclusive
user space locks. It is not as versatile as the wait-wake futexes which
can be used for other locking primitives like conditional variables,
or read-write locks. So it is not a direct replacement of wait-wake
futexes.

A futex locking microbenchmark was written where separate userspace
mutexes are implemented using wait-wake (WW), PI and TP futexes
respectively. This microbenchmark was intructed to run 10s of
locking operations with a load latency of 5 as the critical section
on a 2-socket 36-core E5-2699 v3 system (HT off).  The system was
running on a 4.10 based kernel.  the results of the benchmark runs
were as follows:

                        WW futex       PI futex      TP futex
                        --------       --------      --------
Total locking ops      43,607,917      303,047      54,812,918
Per-thread avg/sec        121,129          841         152,242
Per-thread min/sec        114,842          841          37,440
Per-thread max/sec        127,201          842         455,677
% Stddev                    0.50%        0.00%           9.89%
lock futex calls       10,262,001      303,041         134,989
unlock futex calls     12,956,025      303,042              21

The TP futexes had better throughput, but they also have larger
variance.

For the TP futexes, the heavy hitters in the perf profile were:

 98.18%   7.25% [kernel.vmlinux] [k] futex_lock
 87.31%  87.31% [kernel.vmlinux] [k] osq_lock
  3.62%   3.62% [kernel.vmlinux] [k] mutex_spin_on_owner

For the WW futexes, the heavy hitters were:

 86.38%  86.38% [kernel.vmlinux] [k] queued_spin_lock_slowpath
 48.39%   2.65% [kernel.vmlinux] [k] futex_wake
 45.13%   3.57% [kernel.vmlinux] [k] futex_wait_setup

By increasing the load latency, the performance discrepancies between
WW and TP futexes actually increases as shown in the table below.

  Load Latency  WW locking ops  TP locking ops  % change
  ------------  --------------  --------------  --------
      5           43,607,917      54,812,918      +26%
     10           32,822,045      45,685,420      +39%
     20           23,841,429      39,439,048      +65%
     30           18,360,775      31,494,788      +72%
     40           15,383,536      27,687,160      +80%
     50           13,290,368      23,096,937      +74%
    100            8,577,763      14,410,909      +68%
  1us sleep          176,450         179,660       +2%

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/uapi/linux/futex.h |   4 +
 kernel/futex.c             | 559 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 560 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index 0b1f716..7f676ac 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -20,6 +20,8 @@
 #define FUTEX_WAKE_BITSET	10
 #define FUTEX_WAIT_REQUEUE_PI	11
 #define FUTEX_CMP_REQUEUE_PI	12
+#define FUTEX_LOCK		13
+#define FUTEX_UNLOCK		14
 
 #define FUTEX_PRIVATE_FLAG	128
 #define FUTEX_CLOCK_REALTIME	256
@@ -39,6 +41,8 @@
 					 FUTEX_PRIVATE_FLAG)
 #define FUTEX_CMP_REQUEUE_PI_PRIVATE	(FUTEX_CMP_REQUEUE_PI | \
 					 FUTEX_PRIVATE_FLAG)
+#define FUTEX_LOCK_PRIVATE	(FUTEX_LOCK | FUTEX_PRIVATE_FLAG)
+#define FUTEX_UNLOCK_PRIVATE	(FUTEX_UNLOCK | FUTEX_PRIVATE_FLAG)
 
 /*
  * Support for robust futexes: the kernel cleans up held futexes at
diff --git a/kernel/futex.c b/kernel/futex.c
index 8b9e982..2374cce 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -30,6 +30,9 @@
  *  "The futexes are also cursed."
  *  "But they come in a choice of three flavours!"
  *
+ *  TP-futex support started by Waiman Long
+ *  Copyright (C) 2016 Red Hat, Inc., Waiman Long <longman@redhat.com>
+ *
  *  This program is free software; you can redistribute it and/or modify
  *  it under the terms of the GNU General Public License as published by
  *  the Free Software Foundation; either version 2 of the License, or
@@ -193,6 +196,7 @@
 
 enum futex_type {
 	TYPE_PI = 0,
+	TYPE_TP,
 };
 
 /*
@@ -212,11 +216,23 @@ struct futex_state {
 	struct list_head fs_list;
 
 	/*
-	 * The PI object:
+	 * The PI or mutex object:
 	 */
-	struct rt_mutex pi_mutex;
+	union {
+		struct rt_mutex pi_mutex;
+		struct mutex mutex;
+	};
 
+	/*
+	 * For both TP/PI futexes, owner is the task that owns the futex.
+	 *
+	 * For TP futex, mutex_owner is the mutex lock holder that is either
+	 * spinning on the futex owner or is sleeping. The owner field will
+	 * only be set via task_pi_list_add() when the mutex lock holder is
+	 * sleeping.
+	 */
 	struct task_struct *owner;
+	struct task_struct *mutex_owner;	/* For TP futexes only */
 	atomic_t refcount;
 
 	enum futex_type type;
@@ -895,6 +911,7 @@ static int refill_futex_state_cache(void)
 
 	/* pi_mutex gets initialized later */
 	state->owner = NULL;
+	state->mutex_owner = NULL;
 	atomic_set(&state->refcount, 1);
 	state->key = FUTEX_KEY_INIT;
 
@@ -917,7 +934,8 @@ static struct futex_state *alloc_futex_state(void)
  * Drops a reference to the futex state object and frees or caches it
  * when the last reference is gone.
  *
- * Must be called with the hb lock held.
+ * Must be called with the hb lock held or with the fs_lock held for TP
+ * futexes.
  */
 static void put_futex_state(struct futex_state *state)
 {
@@ -934,10 +952,24 @@ static void put_futex_state(struct futex_state *state)
 	if (state->owner) {
 		task_pi_list_del(state, false);
 
+		/*
+		 * For TP futexes, the owner field should have been cleared
+		 * before calling put_futex_state.
+		 */
+		WARN_ON(state->type == TYPE_TP);
+
 		if (state->type == TYPE_PI)
 			rt_mutex_proxy_unlock(&state->pi_mutex, state->owner);
 	}
 
+#ifdef CONFIG_SMP
+	/*
+	 * Dequeue the futex state from the hash bucket for TP futexes.
+	 */
+	if (state->type == TYPE_TP)
+		list_del_init(&state->fs_list);
+#endif
+
 	if (current->pi_state_cache)
 		kfree(state);
 	else {
@@ -947,6 +979,7 @@ static void put_futex_state(struct futex_state *state)
 		 * refcount is at 0 - put it back to 1.
 		 */
 		state->owner = NULL;
+		state->mutex_owner = NULL;
 		atomic_set(&state->refcount, 1);
 		current->pi_state_cache = state;
 	}
@@ -3246,6 +3279,516 @@ void exit_robust_list(struct task_struct *curr)
 				   curr, pip);
 }
 
+#ifdef CONFIG_SMP
+/*
+ * Throughput-Optimized (TP) Futexes
+ * ---------------------------------
+ *
+ * Userspace mutual exclusion locks can be implemented either by using the
+ * wait-wake (WW) futexes or the PI futexes. The WW futexes have much higher
+ * throughput but can't guarantee minimal latency for high priority processes
+ * like the PI futexes. Even then, the overhead of WW futexes in userspace
+ * locking primitives can easily become a performance bottleneck on system
+ * with a large number of cores.
+ *
+ * The throughput-optimized (TP) futex is a new futex implementation that
+ * provides higher throughput than WW futex via the use of following
+ * two techniques:
+ *  1) Optimistic spinning when futex owner is actively running
+ *  2) Lock stealing
+ *
+ * Optimistic spinning isn't present in other futexes. Lock stealing is
+ * possible in WW futexes, but isn't actively encouraged like the
+ * TP futexes. The downside, however, is in the much higher variance of
+ * the response times. Since there is no point in supporting optimistic
+ * spinning on UP system, this new futex type is only available on SMP
+ * system.
+ *
+ * The TP futexes are not as versatile as the WW futexes. The TP futexes
+ * are used primarily for mutex-like exclusive userspace locks.
+ * On the other hand, WW futexes can also used be implement other
+ * synchronization primitives like conditional variables or semaphores.
+ *
+ * The use of TP futexes is very similar to the PI futexes. Locking is done
+ * by atomically transiting the futex from 0 to the task's thread ID.
+ * Unlocking is done by atomically changing the futex from thread ID to 0.
+ * Any failure to do so will require calls to the kernel to do the locking
+ * and unlocking.
+ *
+ * The purpose of this FUTEX_WAITERS bit is to make the unlocker wake up the
+ * serialization mutex owner. Not having the FUTEX_WAITERS bit set doesn't
+ * mean there is no waiter in the kernel.
+ *
+ * Like PI futexes, TP futexes are orthogonal to robust futexes can be
+ * used together.
+ *
+ * Unlike the other futexes, the futex_q structures aren't used. Instead,
+ * they will queue up in the serialization mutex of the futex state container
+ * queued in the hash bucket.
+ */
+
+/**
+ * lookup_futex_state - Looking up the futex state structure.
+ * @hb:		 hash bucket
+ * @key:	 futex key
+ * @search_only: Boolean search only (no allocation) flag
+ *
+ * This function differs from lookup_pi_state() in that it searches the fs_head
+ * in the hash bucket instead of the waiters in the plist. The HB fs_lock must
+ * be held before calling it. If the search_only flag isn't set, a new state
+ * structure will be pushed into the list and returned if there is no matching
+ * key.
+ *
+ * The reference count won't be incremented for search_only call as the
+ * futex state object won't be destroyed while the fs_lock is being held.
+ *
+ * Return: The matching futex state object or a newly allocated one.
+ *	   NULL if not found in search only lookup.
+ */
+static struct futex_state *
+lookup_futex_state(struct futex_hash_bucket *hb, union futex_key *key,
+		   bool search_only)
+{
+	struct futex_state *state;
+
+	list_for_each_entry(state, &hb->fs_head, fs_list)
+		if (match_futex(key, &state->key)) {
+			if (!search_only)
+				atomic_inc(&state->refcount);
+			return state;
+		}
+
+	if (search_only)
+		return NULL;
+
+	/*
+	 * Push a new one into the list and return it.
+	 */
+	state = alloc_futex_state();
+	state->type = TYPE_TP;
+	state->key = *key;
+	list_add(&state->fs_list, &hb->fs_head);
+	WARN_ON(atomic_read(&state->refcount) != 1);
+
+	/*
+	 * Initialize the mutex structure.
+	 */
+	mutex_init(&state->mutex);
+	WARN_ON(!list_empty(&state->list));
+	return state;
+}
+
+/**
+ * put_futex_state_unlocked - fast dropping futex state reference count
+ * @state: the futex state to be dropped
+ *
+ * This function is used to decrement the reference count in the futex state
+ * object while not holding the HB fs_lock. It can fail.
+ *
+ * Return: 1 if successful and 0 if the reference count is going to be
+ *	   decremented to 0 and hence has to be done under lock.
+ */
+static inline int put_futex_state_unlocked(struct futex_state *state)
+{
+	return atomic_add_unless(&state->refcount, -1, 1);
+}
+
+/**
+ * futex_trylock - try to lock the userspace futex word (0 => vpid).
+ * @uaddr  - futex address
+ * @vpid   - PID of current task
+ * @puval  - storage location for current futex value
+ * @waiter - true for top waiter, false otherwise
+ *
+ * The HB fs_lock should NOT be held while calling this function.
+ * The flag bits are ignored in the trylock.
+ *
+ * If waiter is true
+ * then
+ *   don't preserve the flag bits;
+ * else
+ *   preserve the flag bits
+ * endif
+ *
+ * Return: 1 if lock acquired;
+ *	   0 if lock acquisition failed;
+ *	   -EFAULT if an error happened.
+ */
+static inline int futex_trylock(u32 __user *uaddr, const u32 vpid, u32 *puval,
+				const bool waiter)
+{
+	u32 uval, flags = 0;
+
+	if (unlikely(get_futex_value(puval, uaddr)))
+		return -EFAULT;
+
+	uval = *puval;
+
+	if (uval & FUTEX_TID_MASK)
+		return 0;	/* Trylock fails */
+
+	if (!waiter)
+		flags |= (uval & ~FUTEX_TID_MASK);
+
+	if (unlikely(cmpxchg_futex_value(puval, uaddr, uval, vpid|flags)))
+		return -EFAULT;
+
+	return *puval == uval;
+}
+
+/**
+ * futex_trylock_preempt_disabled - futex_trylock with preemption disabled
+ * @uaddr  - futex address
+ * @vpid   - PID of current task
+ * @puval  - storage location for current futex value
+ *
+ * The preempt_disable() has similar effect as pagefault_disable(). As a
+ * result, we will have to disable page fault as well and handle the case
+ * of faulting in the futex word. This function should only be called from
+ * within futex_spin_on_owner().
+ *
+ * Return: 1 if lock acquired;
+ *	   0 if lock acquisition failed;
+ *	   -EFAULT if an error happened.
+ */
+static inline int futex_trylock_preempt_disabled(u32 __user *uaddr,
+						 const u32 vpid, u32 *puval)
+{
+	int ret;
+
+	pagefault_disable();
+	ret = futex_trylock(uaddr, vpid, puval, true);
+	pagefault_enable();
+
+	return ret;
+}
+
+/**
+ * futex_set_waiters_bit - set the FUTEX_WAITERS bit without fs_lock
+ * @uaddr: futex address
+ * @puval: pointer to current futex value
+ *
+ * Return: 0 if successful or < 0 if an error happen.
+ *	   futex value pointed by puval will be set to current value.
+ */
+static inline int futex_set_waiters_bit(u32 __user *uaddr, u32 *puval)
+{
+	u32 curval, uval = *puval;
+
+	while (!(uval & FUTEX_WAITERS)) {
+		/*
+		 * Set the FUTEX_WAITERS bit.
+		 */
+		if (cmpxchg_futex_value_locked(&curval, uaddr, uval,
+					uval | FUTEX_WAITERS))
+			return -EFAULT;
+		if (curval == uval) {
+			*puval = uval | FUTEX_WAITERS;
+			break;
+		}
+		uval = curval;
+	}
+	return 0;
+}
+
+/**
+ * futex_spin_on_owner - Optimistically spin on futex owner
+ * @uaddr: futex address
+ * @vpid:  PID of current task
+ * @state: futex state object
+ *
+ * Spin on the futex word while the futex owner is active. Otherwise, set
+ * the FUTEX_WAITERS bit and go to sleep.
+ *
+ * As we take a reference to the futex owner's task structure, we don't need
+ * to use RCU to ensure that the task structure is valid.
+ *
+ * The function will directly grab the lock if the owner is dying or the pid
+ * is invalid. That should take care of the problem of dead lock owners
+ * unless the pid wraps around and the perceived owner is not the real owner.
+ * To guard against this case, we will have to use the robust futex feature.
+ *
+ * Return: 0 if futex acquired, < 0 if an error happens.
+ */
+static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
+			       struct futex_state *state)
+{
+	int ret;
+	u32 uval;
+	u32 owner_pid = 0;
+	struct task_struct *owner_task = NULL;
+
+	preempt_disable();
+	WRITE_ONCE(state->mutex_owner, current);
+retry:
+	for (;;) {
+		ret = futex_trylock_preempt_disabled(uaddr, vpid, &uval);
+		if (ret)
+			break;
+
+		if ((uval & FUTEX_TID_MASK) != owner_pid) {
+			if (owner_task)
+				put_task_struct(owner_task);
+
+			owner_pid  = uval & FUTEX_TID_MASK;
+			owner_task = futex_find_get_task(owner_pid);
+		}
+
+		if (need_resched()) {
+			__set_current_state(TASK_RUNNING);
+			schedule_preempt_disabled();
+			continue;
+		}
+
+		if (signal_pending(current)) {
+			ret = -EINTR;
+			break;
+		}
+
+		if (owner_task->on_cpu) {
+			cpu_relax();
+			continue;
+		}
+
+		/*
+		 * If the owner isn't active, we need to go to sleep after
+		 * making sure that the FUTEX_WAITERS bit is set.
+		 */
+		ret = futex_set_waiters_bit(uaddr, &uval);
+		if (ret)
+			break;
+
+		/*
+		 * Do a trylock after setting the task state to make
+		 * sure we won't miss a wakeup.
+		 *
+		 * Futex owner		Mutex owner
+		 * -----------		-----------
+		 * unlock		set state
+		 * MB			MB
+		 * read state		trylock
+		 * wakeup		sleep
+		 */
+		set_current_state(TASK_INTERRUPTIBLE);
+		ret = futex_trylock_preempt_disabled(uaddr, vpid, &uval);
+		if (ret) {
+			__set_current_state(TASK_RUNNING);
+			break;
+		}
+
+		/*
+		 * Don't sleep if the owner has died or the FUTEX_WAITERS bit
+		 * was cleared. The latter case can happen when unlock and
+		 * lock stealing happen in between setting the FUTEX_WAITERS
+		 * and setting state to TASK_INTERRUPTIBLE.
+		 */
+		if (!(uval & FUTEX_OWNER_DIED) && (uval & FUTEX_WAITERS))
+			schedule_preempt_disabled();
+		__set_current_state(TASK_RUNNING);
+	}
+
+	if (ret == -EFAULT) {
+		ret = fault_in_user_writeable(uaddr);
+		if (!ret)
+			goto retry;
+	}
+
+	if (owner_task)
+		put_task_struct(owner_task);
+
+	/*
+	 * Cleanup futex state.
+	 */
+	WRITE_ONCE(state->mutex_owner, NULL);
+
+	preempt_enable();
+	return ret;
+}
+
+/*
+ * Userspace tried a 0 -> TID atomic transition of the futex value
+ * and failed. The kernel side here does the whole locking operation.
+ * The kernel mutex will be used for serialization. Once becoming the
+ * sole mutex lock owner, it will spin on the futex owner's task structure
+ * to see if it is running. It will also spin on the futex word so as to grab
+ * the lock as soon as it is free.
+ *
+ * This function is not inlined so that it can show up separately in perf
+ * profile for performance analysis purpose.
+ *
+ * Return: 0   - lock acquired
+ *	   < 0 - an error happens
+ */
+static noinline int futex_lock(u32 __user *uaddr, unsigned int flags)
+{
+	struct futex_hash_bucket *hb;
+	union futex_key key = FUTEX_KEY_INIT;
+	struct futex_state *state;
+	u32 uval, vpid = task_pid_vnr(current);
+	int ret;
+
+	/*
+	 * Stealing lock and preserve the flag bits.
+	 */
+	ret = futex_trylock(uaddr, vpid, &uval, false);
+	if (ret)
+		goto out;
+
+	/*
+	 * Detect deadlocks.
+	 */
+	if (unlikely(((uval & FUTEX_TID_MASK) == vpid) ||
+			should_fail_futex(true)))
+		return -EDEADLK;
+
+	if (refill_futex_state_cache())
+		return -ENOMEM;
+
+	ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
+	if (unlikely(ret))
+		goto out;
+
+	hb = hash_futex(&key);
+	spin_lock(&hb->fs_lock);
+
+	/*
+	 * Locate the futex state by looking up the futex state list in the
+	 * hash bucket. If it isn't found, create a new one and put it into
+	 * the list.
+	 */
+	state = lookup_futex_state(hb, &key, false);
+
+	/*
+	 * We don't need to hold the lock after looking up the futex state
+	 * as we have incremented the reference count.
+	 */
+	spin_unlock(&hb->fs_lock);
+
+	if (!state || (state->type != TYPE_TP)) {
+		ret = state ? -EINVAL : -ENOMEM;
+		goto out_put_state_key;
+	}
+
+	/*
+	 * Acquiring the serialization mutex.
+	 *
+	 * If we got a signal or has some other error, we need to abort
+	 * the lock operation and return.
+	 */
+	ret = mutex_lock_interruptible(&state->mutex);
+	if (unlikely(ret))
+		goto out_put_state_key;
+
+	/*
+	 * As the mutex owner, we can now spin on the futex word as well as
+	 * the active-ness of the futex owner.
+	 */
+	ret = futex_spin_on_owner(uaddr, vpid, state);
+
+	mutex_unlock(&state->mutex);
+
+out_put_state_key:
+	if (!put_futex_state_unlocked(state)) {
+		/*
+		 * May need to free the futex state object and so must be
+		 * under the fs_lock.
+		 */
+		spin_lock(&hb->fs_lock);
+		put_futex_state(state);
+		spin_unlock(&hb->fs_lock);
+	}
+	put_futex_key(&key);
+
+out:
+	return (ret < 0) ? ret : 0;
+}
+
+/*
+ * Userspace attempted a TID -> 0 atomic transition, and failed.
+ * This is the in-kernel slowpath: we look up the futex state (if any),
+ * and wakeup the mutex owner.
+ *
+ * Return: 1 if a wakeup is attempt, 0 if no task to wake,
+ *	   or < 0 when an error happens.
+ */
+static int futex_unlock(u32 __user *uaddr, unsigned int flags)
+{
+	u32 uval, vpid = task_pid_vnr(current);
+	union futex_key key = FUTEX_KEY_INIT;
+	struct futex_hash_bucket *hb;
+	struct futex_state *state = NULL;
+	struct task_struct *owner = NULL;
+	int ret;
+	DEFINE_WAKE_Q(wake_q);
+
+	if (get_futex_value(&uval, uaddr))
+		return -EFAULT;
+
+	if ((uval & FUTEX_TID_MASK) != vpid)
+		return -EPERM;
+
+	if (!(uval & FUTEX_WAITERS))
+		return -EINVAL;
+
+	ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
+	if (ret)
+		return ret;
+
+	hb = hash_futex(&key);
+	spin_lock(&hb->fs_lock);
+
+	/*
+	 * Check the hash bucket only for matching futex state.
+	 */
+	state = lookup_futex_state(hb, &key, true);
+	if (!state || (state->type != TYPE_TP)) {
+		ret = -EINVAL;
+		spin_unlock(&hb->fs_lock);
+		goto out_put_key;
+	}
+
+	owner = READ_ONCE(state->mutex_owner);
+	if (owner)
+		wake_q_add(&wake_q, owner);
+
+	spin_unlock(&hb->fs_lock);
+
+	/*
+	 * Unlock the futex.
+	 * The flag bits are not preserved to encourage more lock stealing.
+	 */
+	for (;;) {
+		u32 old = uval;
+
+		if (cmpxchg_futex_value(&uval, uaddr, old, 0)) {
+			ret = -EFAULT;
+			break;
+		}
+		if (old == uval)
+			break;
+		/*
+		 * The non-flag value of the futex shouldn't change
+		 */
+		if ((uval & FUTEX_TID_MASK) != vpid) {
+			ret = -EINVAL;
+			break;
+		}
+	}
+
+out_put_key:
+	put_futex_key(&key);
+	if (owner) {
+		/*
+		 * No error would have happened if owner defined.
+		 */
+		wake_up_q(&wake_q);
+		return 1;
+	}
+
+	return ret;
+}
+#endif /* CONFIG_SMP */
+
 long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 		u32 __user *uaddr2, u32 val2, u32 val3)
 {
@@ -3268,6 +3811,10 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 	case FUTEX_TRYLOCK_PI:
 	case FUTEX_WAIT_REQUEUE_PI:
 	case FUTEX_CMP_REQUEUE_PI:
+#ifdef CONFIG_SMP
+	case FUTEX_LOCK:
+	case FUTEX_UNLOCK:
+#endif
 		if (!futex_cmpxchg_enabled)
 			return -ENOSYS;
 	}
@@ -3299,6 +3846,12 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 					     uaddr2);
 	case FUTEX_CMP_REQUEUE_PI:
 		return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1);
+#ifdef CONFIG_SMP
+	case FUTEX_LOCK:
+		return futex_lock(uaddr, flags);
+	case FUTEX_UNLOCK:
+		return futex_unlock(uaddr, flags);
+#endif
 	}
 	return -ENOSYS;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 08/20] TP-futex: Enable robust handling
  2016-12-29 16:13 [PATCH v4 00/20] futex: Introducing throughput-optimized (TP) futexes Waiman Long
                   ` (6 preceding siblings ...)
  2016-12-29 16:13 ` [PATCH v4 07/20] futex: Introduce throughput-optimized (TP) futexes Waiman Long
@ 2016-12-29 16:13 ` Waiman Long
  2016-12-29 16:13 ` [PATCH v4 09/20] TP-futex: Implement lock handoff to prevent lock starvation Waiman Long
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2016-12-29 16:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Scott J Norton, Waiman Long

The TP futexes don't have code to handle the death of futex
owners. There are 2 different cases that need to be considered.

As top waiter gets a reference to the task structure of the futex
owner, the task structure will never go away even if the owner dies.
When the futex owner died while the top waiter is spinning, the task
structure will be marked dead or the pid won't have a matching task
structure if the task died before a reference is taken. Alternatively,
if robust futex attribute is enabled, the FUTEX_OWNER_DIED bit of the
futex word may also be set. In all those cases, what the top waiter
need to do is to grab the futex directly. An informational message
will be printed to highlight this event.

If the futex owner died while the top waiter is sleeping, we need to
make the exit processing code to wake up the top waiter. This is done
by chaining the futex state object into the pi_state_list of the futex
owner before the top waiter sleeps so that if exit_pi_state_list()
is called, the wakeup will happen. The top waiter needs to remove
its futex state object from the pi_state_list of the old owner if
the ownership changes hand or when the lock is acquired.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/futex.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 79 insertions(+), 6 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 2374cce..2d3ec8d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1004,7 +1004,7 @@ static struct task_struct * futex_find_get_task(pid_t pid)
 }
 
 /*
- * This task is holding PI mutexes at exit time => bad.
+ * This task is holding PI or TP mutexes at exit time => bad.
  * Kernel cleans up PI-state, but userspace is likely hosed.
  * (Robust-futex cleanup is separate and might save the day for userspace.)
  */
@@ -1021,12 +1021,31 @@ void exit_pi_state_list(struct task_struct *curr)
 	 * We are a ZOMBIE and nobody can enqueue itself on
 	 * pi_state_list anymore, but we have to be careful
 	 * versus waiters unqueueing themselves:
+	 *
+	 * For TP futexes, the only purpose of showing up in the
+	 * pi_state_list is for this function to wake up the serialization
+	 * mutex owner (state->mutex_owner). We don't actually need to take
+	 * the HB lock. The futex state and task struct won't go away as long
+	 * as we hold the pi_lock.
 	 */
 	raw_spin_lock_irq(&curr->pi_lock);
 	while (!list_empty(head)) {
 
 		next = head->next;
 		pi_state = list_entry(next, struct futex_state, list);
+
+		if (pi_state->type == TYPE_TP) {
+			struct task_struct *owner;
+
+			owner = READ_ONCE(pi_state->mutex_owner);
+			WARN_ON(list_empty(&pi_state->list));
+			list_del_init(&pi_state->list);
+			pi_state->owner = NULL;
+			if (owner)
+				wake_up_process(owner);
+			continue;
+		}
+
 		key = pi_state->key;
 		hb = hash_futex(&key);
 		raw_spin_unlock_irq(&curr->pi_lock);
@@ -3183,8 +3202,8 @@ int handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi)
 			goto retry;
 
 		/*
-		 * Wake robust non-PI futexes here. The wakeup of
-		 * PI futexes happens in exit_pi_state():
+		 * Wake robust wait-wake futexes here. The wakeup of
+		 * PI and TP futexes happens in exit_pi_state():
 		 */
 		if (!pi && (uval & FUTEX_WAITERS))
 			futex_wake(uaddr, 1, 1, FUTEX_BITSET_MATCH_ANY);
@@ -3325,6 +3344,12 @@ void exit_robust_list(struct task_struct *curr)
  * Unlike the other futexes, the futex_q structures aren't used. Instead,
  * they will queue up in the serialization mutex of the futex state container
  * queued in the hash bucket.
+ *
+ * To handle the exceptional case that the futex owner died, the robust
+ * futexes list mechanism is used to for waking up sleeping top waiter.
+ * Checks are also made in the futex_spin_on_owner() loop for dead task
+ * structure or invalid pid. In both cases, the top waiter will take over
+ * the ownership of the futex.
  */
 
 /**
@@ -3513,6 +3538,10 @@ static inline int futex_set_waiters_bit(u32 __user *uaddr, u32 *puval)
 static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 			       struct futex_state *state)
 {
+#define OWNER_DEAD_MESSAGE					\
+	"futex: owner pid %d of TP futex 0x%lx was %s.\n"	\
+	"\tLock is now acquired by pid %d!\n"
+
 	int ret;
 	u32 uval;
 	u32 owner_pid = 0;
@@ -3527,13 +3556,47 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 			break;
 
 		if ((uval & FUTEX_TID_MASK) != owner_pid) {
-			if (owner_task)
+			if (owner_task) {
+				/*
+				 * task_pi_list_del() should always be
+				 * done before put_task_struct(). The futex
+				 * state may have been dequeued if the task
+				 * is dead.
+				 */
+				if (state->owner) {
+					WARN_ON(state->owner != owner_task);
+					task_pi_list_del(state, true);
+				}
 				put_task_struct(owner_task);
+			}
 
 			owner_pid  = uval & FUTEX_TID_MASK;
 			owner_task = futex_find_get_task(owner_pid);
 		}
 
+		if (unlikely(!owner_task ||
+			    (owner_task->flags & PF_EXITING) ||
+			    (uval & FUTEX_OWNER_DIED))) {
+			/*
+			 * PID invalid or exiting/dead task, we can directly
+			 * grab the lock now.
+			 */
+			u32 curval;
+			char *owner_state;
+
+			ret = cmpxchg_futex_value_locked(&curval, uaddr, uval,
+							 vpid);
+			if (unlikely(ret))
+				break;
+			if (curval != uval)
+				continue;
+			owner_state = (owner_task || (uval & FUTEX_OWNER_DIED))
+				    ? "dead" : "invalid";
+			pr_info(OWNER_DEAD_MESSAGE, owner_pid,
+				(long)uaddr, owner_state, vpid);
+			break;
+		}
+
 		if (need_resched()) {
 			__set_current_state(TASK_RUNNING);
 			schedule_preempt_disabled();
@@ -3552,12 +3615,17 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 
 		/*
 		 * If the owner isn't active, we need to go to sleep after
-		 * making sure that the FUTEX_WAITERS bit is set.
+		 * making sure that the FUTEX_WAITERS bit is set. We also
+		 * need to put the futex state into the futex owner's
+		 * pi_state_list to prevent deadlock when the owner dies.
 		 */
 		ret = futex_set_waiters_bit(uaddr, &uval);
 		if (ret)
 			break;
 
+		if (owner_task && !state->owner)
+			task_pi_list_add(owner_task, state);
+
 		/*
 		 * Do a trylock after setting the task state to make
 		 * sure we won't miss a wakeup.
@@ -3593,8 +3661,13 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 			goto retry;
 	}
 
-	if (owner_task)
+	if (owner_task) {
+		if (state->owner)
+			task_pi_list_del(state, false);
 		put_task_struct(owner_task);
+	} else {
+		WARN_ON(state->owner);
+	}
 
 	/*
 	 * Cleanup futex state.
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 09/20] TP-futex: Implement lock handoff to prevent lock starvation
  2016-12-29 16:13 [PATCH v4 00/20] futex: Introducing throughput-optimized (TP) futexes Waiman Long
                   ` (7 preceding siblings ...)
  2016-12-29 16:13 ` [PATCH v4 08/20] TP-futex: Enable robust handling Waiman Long
@ 2016-12-29 16:13 ` Waiman Long
  2016-12-29 16:13 ` [PATCH v4 10/20] TP-futex: Return status code on FUTEX_LOCK calls Waiman Long
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2016-12-29 16:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Scott J Norton, Waiman Long

The current TP futexes has no guarantee that the top waiter
(serialization mutex owner) can get the lock within a finite time.
As a result, lock starvation can happen.

A lock handoff mechanism is added to the TP futexes to prevent lock
starvation from happening. The idea is that the top waiter can set a
special handoff_pid variable with its own pid. The futex owner will
then check this variable at unlock time to see if it should free the
futex or change its value to the designated pid to transfer the lock
directly to the top waiter.

This change does have the effect of limiting the amount of unfairness
and hence may adversely affect performance depending on the workloads.

Currently the handoff mechanism is triggered when the top waiter fails
to acquire the lock after about 5ms of spinning and sleeping. So the
maximum lock handoff rate is about 200 handoffs per second.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/futex.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 53 insertions(+), 6 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 2d3ec8d..711a2b4 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -63,6 +63,7 @@
 #include <linux/pid.h>
 #include <linux/nsproxy.h>
 #include <linux/ptrace.h>
+#include <linux/sched.h>
 #include <linux/sched/rt.h>
 #include <linux/hugetlb.h>
 #include <linux/freezer.h>
@@ -235,6 +236,8 @@ struct futex_state {
 	struct task_struct *mutex_owner;	/* For TP futexes only */
 	atomic_t refcount;
 
+	u32 handoff_pid;			/* For TP futexes only */
+
 	enum futex_type type;
 	union futex_key key;
 };
@@ -912,6 +915,7 @@ static int refill_futex_state_cache(void)
 	/* pi_mutex gets initialized later */
 	state->owner = NULL;
 	state->mutex_owner = NULL;
+	state->handoff_pid = 0;
 	atomic_set(&state->refcount, 1);
 	state->key = FUTEX_KEY_INIT;
 
@@ -3335,8 +3339,9 @@ void exit_robust_list(struct task_struct *curr)
  * and unlocking.
  *
  * The purpose of this FUTEX_WAITERS bit is to make the unlocker wake up the
- * serialization mutex owner. Not having the FUTEX_WAITERS bit set doesn't
- * mean there is no waiter in the kernel.
+ * serialization mutex owner or to hand off the lock directly to the top
+ * waiter. Not having the FUTEX_WAITERS bit set doesn't mean there is no
+ * waiter in the kernel.
  *
  * Like PI futexes, TP futexes are orthogonal to robust futexes can be
  * used together.
@@ -3352,6 +3357,16 @@ void exit_robust_list(struct task_struct *curr)
  * the ownership of the futex.
  */
 
+/*
+ * Timeout value for enabling lock handoff and prevent lock starvation.
+ *
+ * Currently, lock handoff will be enabled if the top waiter can't get the
+ * futex after about 5ms. To reduce time checking overhead, it is only done
+ * after every 128 spins or right after wakeup. As a result, the actual
+ * elapsed time will be a bit longer than the specified value.
+ */
+#define TP_HANDOFF_TIMEOUT	5000000	/* 5ms	*/
+
 /**
  * lookup_futex_state - Looking up the futex state structure.
  * @hb:		 hash bucket
@@ -3431,6 +3446,7 @@ static inline int put_futex_state_unlocked(struct futex_state *state)
  * If waiter is true
  * then
  *   don't preserve the flag bits;
+ *   check for handoff (futex word == own pid)
  * else
  *   preserve the flag bits
  * endif
@@ -3449,6 +3465,9 @@ static inline int futex_trylock(u32 __user *uaddr, const u32 vpid, u32 *puval,
 
 	uval = *puval;
 
+	if (waiter && (uval & FUTEX_TID_MASK) == vpid)
+		return 1;
+
 	if (uval & FUTEX_TID_MASK)
 		return 0;	/* Trylock fails */
 
@@ -3542,10 +3561,12 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 	"futex: owner pid %d of TP futex 0x%lx was %s.\n"	\
 	"\tLock is now acquired by pid %d!\n"
 
-	int ret;
+	int ret, loopcnt = 1;
+	bool handoff_set = false;
 	u32 uval;
 	u32 owner_pid = 0;
 	struct task_struct *owner_task = NULL;
+	u64 handoff_time = sched_clock() + TP_HANDOFF_TIMEOUT;
 
 	preempt_disable();
 	WRITE_ONCE(state->mutex_owner, current);
@@ -3600,6 +3621,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 		if (need_resched()) {
 			__set_current_state(TASK_RUNNING);
 			schedule_preempt_disabled();
+			loopcnt = 0;
 			continue;
 		}
 
@@ -3608,6 +3630,23 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 			break;
 		}
 
+		/*
+		 * Enable lock handoff if the elapsed time exceed the timeout
+		 * value. We also need to set the FUTEX_WAITERS bit to make
+		 * sure that futex lock holder will initiate the handoff at
+		 * unlock time.
+		 */
+		if (!handoff_set && !(loopcnt++ & 0x7f)) {
+			if (sched_clock() > handoff_time) {
+				WARN_ON(READ_ONCE(state->handoff_pid));
+				ret = futex_set_waiters_bit(uaddr, &uval);
+				if (ret)
+					break;
+				WRITE_ONCE(state->handoff_pid, vpid);
+				handoff_set = true;
+			}
+		}
+
 		if (owner_task->on_cpu) {
 			cpu_relax();
 			continue;
@@ -3650,8 +3689,10 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 		 * lock stealing happen in between setting the FUTEX_WAITERS
 		 * and setting state to TASK_INTERRUPTIBLE.
 		 */
-		if (!(uval & FUTEX_OWNER_DIED) && (uval & FUTEX_WAITERS))
+		if (!(uval & FUTEX_OWNER_DIED) && (uval & FUTEX_WAITERS)) {
 			schedule_preempt_disabled();
+			loopcnt = 0;
+		}
 		__set_current_state(TASK_RUNNING);
 	}
 
@@ -3673,6 +3714,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 	 * Cleanup futex state.
 	 */
 	WRITE_ONCE(state->mutex_owner, NULL);
+	WRITE_ONCE(state->handoff_pid, 0);
 
 	preempt_enable();
 	return ret;
@@ -3787,6 +3829,7 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags)
 static int futex_unlock(u32 __user *uaddr, unsigned int flags)
 {
 	u32 uval, vpid = task_pid_vnr(current);
+	u32 newpid = 0;
 	union futex_key key = FUTEX_KEY_INIT;
 	struct futex_hash_bucket *hb;
 	struct futex_state *state = NULL;
@@ -3820,6 +3863,10 @@ static int futex_unlock(u32 __user *uaddr, unsigned int flags)
 		goto out_put_key;
 	}
 
+	newpid = READ_ONCE(state->handoff_pid);
+	if (newpid)
+		WRITE_ONCE(state->handoff_pid, 0);
+
 	owner = READ_ONCE(state->mutex_owner);
 	if (owner)
 		wake_q_add(&wake_q, owner);
@@ -3827,13 +3874,13 @@ static int futex_unlock(u32 __user *uaddr, unsigned int flags)
 	spin_unlock(&hb->fs_lock);
 
 	/*
-	 * Unlock the futex.
+	 * Unlock the futex or handoff to the next owner.
 	 * The flag bits are not preserved to encourage more lock stealing.
 	 */
 	for (;;) {
 		u32 old = uval;
 
-		if (cmpxchg_futex_value(&uval, uaddr, old, 0)) {
+		if (cmpxchg_futex_value(&uval, uaddr, old, newpid)) {
 			ret = -EFAULT;
 			break;
 		}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 10/20] TP-futex: Return status code on FUTEX_LOCK calls
  2016-12-29 16:13 [PATCH v4 00/20] futex: Introducing throughput-optimized (TP) futexes Waiman Long
                   ` (8 preceding siblings ...)
  2016-12-29 16:13 ` [PATCH v4 09/20] TP-futex: Implement lock handoff to prevent lock starvation Waiman Long
@ 2016-12-29 16:13 ` Waiman Long
  2016-12-29 16:13 ` [PATCH v4 11/20] TP-futex: Add timeout support Waiman Long
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2016-12-29 16:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Scott J Norton, Waiman Long

To better understand how the TP futexes are performing, it is useful to
return the internal status on the TP futexes. The FUTEX_LOCK futex(2)
syscall will now return a positive status code if no error happens. The
status code consists of the following 3 fields:

 1) Bits 00-07: code on how the lock is acquired.
 2) Bits 08-15: reserved
 3) Bits 16-30: how many time the task sleeps in the optimistic
    spinning loop.

By returning the TP status code, an external monitoring or tracking
program can have a macro view of how the TP futexes are performing.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/futex.c | 43 ++++++++++++++++++++++++++++++++-----------
 1 file changed, 32 insertions(+), 11 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 711a2b4..3308cc3 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3367,7 +3367,23 @@ void exit_robust_list(struct task_struct *curr)
  */
 #define TP_HANDOFF_TIMEOUT	5000000	/* 5ms	*/
 
-/**
+/*
+ * The futex_lock() function returns the internal status of the TP futex.
+ * The status code consists of the following 3 fields:
+ * 1) bits 00-07: code on how the lock is acquired
+ *		   0 - steals the lock
+ *		   1 - top waiter (mutex owner) acquires the lock
+ *		   2 - handed off the lock
+ * 2) bits 08-15: reserved
+ * 3) bits 15-30: how many times the task has slept or yield to scheduler
+ *		  in futex_spin_on_owner().
+ */
+#define TP_LOCK_STOLEN		0
+#define TP_LOCK_ACQUIRED	1
+#define TP_LOCK_HANDOFF		2
+#define TP_STATUS_SLEEP(val, sleep)	((val)|((sleep) << 16))
+
+ /**
  * lookup_futex_state - Looking up the futex state structure.
  * @hb:		 hash bucket
  * @key:	 futex key
@@ -3451,9 +3467,11 @@ static inline int put_futex_state_unlocked(struct futex_state *state)
  *   preserve the flag bits
  * endif
  *
- * Return: 1 if lock acquired;
+ * Return: TP_LOCK_ACQUIRED if lock acquired;
+ *	   TP_LOCK_HANDOFF if lock was handed off;
  *	   0 if lock acquisition failed;
  *	   -EFAULT if an error happened.
+ *	   *puval will contain the latest futex value when trylock fails.
  */
 static inline int futex_trylock(u32 __user *uaddr, const u32 vpid, u32 *puval,
 				const bool waiter)
@@ -3466,7 +3484,7 @@ static inline int futex_trylock(u32 __user *uaddr, const u32 vpid, u32 *puval,
 	uval = *puval;
 
 	if (waiter && (uval & FUTEX_TID_MASK) == vpid)
-		return 1;
+		return TP_LOCK_HANDOFF;
 
 	if (uval & FUTEX_TID_MASK)
 		return 0;	/* Trylock fails */
@@ -3477,7 +3495,7 @@ static inline int futex_trylock(u32 __user *uaddr, const u32 vpid, u32 *puval,
 	if (unlikely(cmpxchg_futex_value(puval, uaddr, uval, vpid|flags)))
 		return -EFAULT;
 
-	return *puval == uval;
+	return (*puval == uval) ? TP_LOCK_ACQUIRED : 0;
 }
 
 /**
@@ -3491,7 +3509,8 @@ static inline int futex_trylock(u32 __user *uaddr, const u32 vpid, u32 *puval,
  * of faulting in the futex word. This function should only be called from
  * within futex_spin_on_owner().
  *
- * Return: 1 if lock acquired;
+ * Return: TP_LOCK_ACQUIRED if lock acquired;
+ *	   TP_LOCK_HANDOFF if lock was handed off;
  *	   0 if lock acquisition failed;
  *	   -EFAULT if an error happened.
  */
@@ -3552,7 +3571,7 @@ static inline int futex_set_waiters_bit(u32 __user *uaddr, u32 *puval)
  * unless the pid wraps around and the perceived owner is not the real owner.
  * To guard against this case, we will have to use the robust futex feature.
  *
- * Return: 0 if futex acquired, < 0 if an error happens.
+ * Return: TP status code if lock acquired, < 0 if an error happens.
  */
 static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 			       struct futex_state *state)
@@ -3562,6 +3581,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 	"\tLock is now acquired by pid %d!\n"
 
 	int ret, loopcnt = 1;
+	int nsleep = 0;
 	bool handoff_set = false;
 	u32 uval;
 	u32 owner_pid = 0;
@@ -3621,6 +3641,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 		if (need_resched()) {
 			__set_current_state(TASK_RUNNING);
 			schedule_preempt_disabled();
+			nsleep++;
 			loopcnt = 0;
 			continue;
 		}
@@ -3691,6 +3712,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 		 */
 		if (!(uval & FUTEX_OWNER_DIED) && (uval & FUTEX_WAITERS)) {
 			schedule_preempt_disabled();
+			nsleep++;
 			loopcnt = 0;
 		}
 		__set_current_state(TASK_RUNNING);
@@ -3717,7 +3739,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 	WRITE_ONCE(state->handoff_pid, 0);
 
 	preempt_enable();
-	return ret;
+	return (ret < 0) ? ret : TP_STATUS_SLEEP(ret, nsleep);
 }
 
 /*
@@ -3731,8 +3753,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
  * This function is not inlined so that it can show up separately in perf
  * profile for performance analysis purpose.
  *
- * Return: 0   - lock acquired
- *	   < 0 - an error happens
+ * Return: TP status code if lock acquired, < 0 if an error happens.
  */
 static noinline int futex_lock(u32 __user *uaddr, unsigned int flags)
 {
@@ -3747,7 +3768,7 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags)
 	 */
 	ret = futex_trylock(uaddr, vpid, &uval, false);
 	if (ret)
-		goto out;
+		return (ret < 0) ? ret : TP_LOCK_STOLEN;
 
 	/*
 	 * Detect deadlocks.
@@ -3815,7 +3836,7 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags)
 	put_futex_key(&key);
 
 out:
-	return (ret < 0) ? ret : 0;
+	return ret;
 }
 
 /*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 11/20] TP-futex: Add timeout support
  2016-12-29 16:13 [PATCH v4 00/20] futex: Introducing throughput-optimized (TP) futexes Waiman Long
                   ` (9 preceding siblings ...)
  2016-12-29 16:13 ` [PATCH v4 10/20] TP-futex: Return status code on FUTEX_LOCK calls Waiman Long
@ 2016-12-29 16:13 ` Waiman Long
  2016-12-29 16:13 ` [PATCH v4 12/20] TP-futex, doc: Add TP futexes documentation Waiman Long
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2016-12-29 16:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Scott J Norton, Waiman Long

Unlike other futexes, TP futexes do not accept the specification of
a timeout value. To align with the other futexes, timeout support is
now added.  However, the timeout isn't as precise as the other futex
types due to the fact that timer expiration can't be detected when
the thread is waiting in the serialization mutex.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/futex.c | 50 ++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 42 insertions(+), 8 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 3308cc3..429d85f 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3556,9 +3556,10 @@ static inline int futex_set_waiters_bit(u32 __user *uaddr, u32 *puval)
 
 /**
  * futex_spin_on_owner - Optimistically spin on futex owner
- * @uaddr: futex address
- * @vpid:  PID of current task
- * @state: futex state object
+ * @uaddr:   futex address
+ * @vpid:    PID of current task
+ * @state:   futex state object
+ * @timeout: hrtimer_sleeper structure
  *
  * Spin on the futex word while the futex owner is active. Otherwise, set
  * the FUTEX_WAITERS bit and go to sleep.
@@ -3574,7 +3575,8 @@ static inline int futex_set_waiters_bit(u32 __user *uaddr, u32 *puval)
  * Return: TP status code if lock acquired, < 0 if an error happens.
  */
 static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
-			       struct futex_state *state)
+			       struct futex_state *state,
+			       struct hrtimer_sleeper *timeout)
 {
 #define OWNER_DEAD_MESSAGE					\
 	"futex: owner pid %d of TP futex 0x%lx was %s.\n"	\
@@ -3646,6 +3648,12 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 			continue;
 		}
 
+
+		if (timeout && !timeout->task) {
+			ret = -ETIMEDOUT;
+			break;
+		}
+
 		if (signal_pending(current)) {
 			ret = -EINTR;
 			break;
@@ -3753,10 +3761,18 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
  * This function is not inlined so that it can show up separately in perf
  * profile for performance analysis purpose.
  *
+ * The timeout functionality isn't as precise as other other futex types
+ * as timer expiration will not be detected while waiting in the
+ * serialization mutex. One possible solution is to modify the expiration
+ * function to send out a signal as well so as to break the thread out of
+ * the mutex code if we really want more precise timeout.
+ *
  * Return: TP status code if lock acquired, < 0 if an error happens.
  */
-static noinline int futex_lock(u32 __user *uaddr, unsigned int flags)
+static noinline int
+futex_lock(u32 __user *uaddr, unsigned int flags, ktime_t *time)
 {
+	struct hrtimer_sleeper timeout, *to;
 	struct futex_hash_bucket *hb;
 	union futex_key key = FUTEX_KEY_INIT;
 	struct futex_state *state;
@@ -3780,6 +3796,8 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags)
 	if (refill_futex_state_cache())
 		return -ENOMEM;
 
+	to = futex_setup_timer(time, &timeout, flags, current->timer_slack_ns);
+
 	ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
 	if (unlikely(ret))
 		goto out;
@@ -3805,6 +3823,9 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags)
 		goto out_put_state_key;
 	}
 
+	if (to)
+		hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
+
 	/*
 	 * Acquiring the serialization mutex.
 	 *
@@ -3816,10 +3837,19 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags)
 		goto out_put_state_key;
 
 	/*
+	 * Check for timeout.
+	 */
+	if (to && !timeout.task) {
+		ret = -ETIMEDOUT;
+		mutex_unlock(&state->mutex);
+		goto out_put_state_key;
+	}
+
+	/*
 	 * As the mutex owner, we can now spin on the futex word as well as
 	 * the active-ness of the futex owner.
 	 */
-	ret = futex_spin_on_owner(uaddr, vpid, state);
+	ret = futex_spin_on_owner(uaddr, vpid, state, to);
 
 	mutex_unlock(&state->mutex);
 
@@ -3836,6 +3866,10 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags)
 	put_futex_key(&key);
 
 out:
+	if (to) {
+		hrtimer_cancel(&to->timer);
+		destroy_hrtimer_on_stack(&to->timer);
+	}
 	return ret;
 }
 
@@ -3989,7 +4023,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 		return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1);
 #ifdef CONFIG_SMP
 	case FUTEX_LOCK:
-		return futex_lock(uaddr, flags);
+		return futex_lock(uaddr, flags, timeout);
 	case FUTEX_UNLOCK:
 		return futex_unlock(uaddr, flags);
 #endif
@@ -4008,7 +4042,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 	int cmd = op & FUTEX_CMD_MASK;
 
 	if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI ||
-		      cmd == FUTEX_WAIT_BITSET ||
+		      cmd == FUTEX_WAIT_BITSET || cmd == FUTEX_LOCK ||
 		      cmd == FUTEX_WAIT_REQUEUE_PI)) {
 		if (unlikely(should_fail_futex(!(op & FUTEX_PRIVATE_FLAG))))
 			return -EFAULT;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 12/20] TP-futex, doc: Add TP futexes documentation
  2016-12-29 16:13 [PATCH v4 00/20] futex: Introducing throughput-optimized (TP) futexes Waiman Long
                   ` (10 preceding siblings ...)
  2016-12-29 16:13 ` [PATCH v4 11/20] TP-futex: Add timeout support Waiman Long
@ 2016-12-29 16:13 ` Waiman Long
  2016-12-29 16:13 ` [PATCH v4 13/20] perf bench: New microbenchmark for userspace mutex performance Waiman Long
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2016-12-29 16:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Scott J Norton, Waiman Long

This patch adds a new document file on how to use the TP futexes.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/00-INDEX     |   2 +
 Documentation/tp-futex.txt | 161 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 163 insertions(+)
 create mode 100644 Documentation/tp-futex.txt

diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX
index c8a8eb1..326b68c 100644
--- a/Documentation/00-INDEX
+++ b/Documentation/00-INDEX
@@ -416,6 +416,8 @@ this_cpu_ops.txt
 	- List rationale behind and the way to use this_cpu operations.
 thermal/
 	- directory with information on managing thermal issues (CPU/temp)
+tp-futex.txt
+	- Documentation on lightweight throughput-optimized futexes.
 trace/
 	- directory with info on tracing technologies within linux
 translations/
diff --git a/Documentation/tp-futex.txt b/Documentation/tp-futex.txt
new file mode 100644
index 0000000..5324ee0
--- /dev/null
+++ b/Documentation/tp-futex.txt
@@ -0,0 +1,161 @@
+Started by: Waiman Long <longman@redhat.com>
+
+Throughput-Optimized Futexes
+----------------------------
+
+There are two main problems for a wait-wake futex (FUTEX_WAIT and
+FUTEX_WAKE) when used for creating user-space locking primitives:
+
+ 1) With a wait-wake futex, tasks waiting for a lock are put to sleep
+    in the futex queue to be woken up by the lock owner when it is done
+    with the lock. Waking up a sleeping task, however, introduces some
+    additional latency which can be large especially if the critical
+    section protected by the lock is relatively short. This may cause
+    a performance bottleneck on large systems with many CPUs running
+    applications that need a lot of inter-thread synchronization.
+
+ 2) The performance of the wait-wake futex is currently
+    spinlock-constrained.  When many threads are contending for a
+    futex in a large system with many CPUs, it is not unusual to have
+    spinlock contention accounting for more than 90% of the total
+    CPU cycles consumed at various points in time.
+
+This two problems can create performance bottlenecks with a
+futex-constrained workload especially on systems with large number
+of CPUs.
+
+The goal of the throughput-optimized (TP) futexes is maximize the
+locking throughput at the expense of fairness and deterministic
+latency. This is done by encouraging lock stealing and optimistic
+spinning on a locked futex when the futex owner is running.  This is
+the same optimistic spinning mechanism used by the kernel mutex and rw
+semaphore implementations to improve performance. Optimistic spinning
+was done without taking any lock.
+
+Lock stealing is known to be a performance enhancement technique as
+long as the safeguards are in place to make sure that there will be no
+lock starvation.  The TP futexes has a built-in lock hand-off mechanism
+to prevent lock starvation from happening as long as the underlying
+kernel mutexes that the TP futexes use have no lock starvation problem.
+
+When the top lock waiter has failed to acquire the lock within a
+certain time threshold, it will initiate the hand-off mechanism by
+forcing the unlocker to transfer the lock to itself instead of freeing
+it for others to grab. This limit the maximum latency a waiter has
+to wait.
+
+The downside of this improved throughput is the increased variance
+of the actual response times of the locking operations. Some locking
+operations will be very fast, while others may be considerably slower.
+The average response time should be better than the wait-wake futexes.
+
+Performance-wise, TP futexes should be faster than wait-wake futexes
+especially if the futex locker holders do not sleep. For workload
+that does a lot of sleeping within the critical sections, the TP
+futexes may not be faster than the wait-wake futexes.
+
+Implementation
+--------------
+
+Like the PI and robust futexes, an exclusive lock acquirer has to
+atomically put its thread ID (TID) into the lower 30 bits of the
+32-bit futex which should has an original value of 0. If it succeeds,
+it will be the owner of the futex. Otherwise, it has to call into
+the kernel using the FUTEX_LOCK futex(2) syscall.
+
+  futex(uaddr, FUTEX_LOCK, 0, timeout, NULL, 0);
+
+Only the optional timeout parameter is being used by this futex(2)
+syscall.
+
+Inside the kernel, a kernel mutex is used for serialization among
+the futex waiters. Only the top lock waiter which is the owner of
+the serialization mutex is allowed to continuously spin and attempt
+to acquire the lock.  Other lock waiters will have one attempt to
+steal the lock before entering the mutex queues.
+
+When the exclusive futex lock owner is no longer running, the top
+waiter will set the FUTEX_WAITERS bit before going to sleep. This is
+to make sure the futex owner will go into the kernel at unlock time
+to wake up the top waiter.
+
+The return values of the above futex locking syscall, if non-negative,
+are status code that consists of 2 fields - the lock acquisition code
+(bits 0-7) and the number of sleeps (bits 8-30) in the optimistic
+spinning loop before acquiring the futex. A negative returned value
+means an error has happened.
+
+The lock acquisition code can have the following values:
+ a) 0   - lock stolen as non-top waiter
+ b) 1   - lock acquired as the top waiter
+ c) 2   - lock explicitly handed off by the unlocker
+
+When it is time to unlock, the exclusive lock owner has to atomically
+change the futex value from its TID to 0. If that fails, it has to
+issue a FUTEX_UNLOCK futex(2) syscall to wake up the top waiter.
+
+  futex(uaddr, FUTEX_UNLOCK, 0, NULL, NULL, 0);
+
+A return value of 1 from the FUTEX_UNLOCK futex(2) syscall indicates
+a task has been woken up. The syscall returns 0 if no sleeping task
+is woken. A negative value will be returned if an error happens.
+
+The error number returned by a FUTEX_UNLOCK syscall on an empty futex
+can be used to decide if the TP futex functionality is implemented
+in the kernel. If it is present, an EPERFM error will be returned.
+Otherwise it will return ENOSYS.
+
+TP futexes require the kernel to have SMP support as well as support
+for the cmpxchg functionality. For architectures that don't support
+cmpxchg, TP futexes will not be supported as well.
+
+The exclusive locking TP futexes are orthogonal to the robust futexes
+and can be combined without problem. The TP futexes also have code
+to detect the death of an exclusive TP futex owner and handle the
+transfer of futex ownership automatically without the use of the
+robust futexes. The only case that the TP futexes cannot handle alone
+is the PID wrap-around issue where another process with the same PID
+as the real futex owner because of PID wrap-around is mis-identified
+as the owner of a futex.
+
+Usage Scenario
+--------------
+
+A TP futex can be used to implement a user-space exclusive lock
+or mutex to guard a critical section which are unlikely to go to
+sleep. The waiters in a TP futex, however, will fall back to sleep in
+a wait queue if the lock owner isn't running. Therefore, it can also be
+used when the critical section is long and prone to sleeping. However,
+it may not have the performance gain when compared with a wait-wake
+futex in this case.
+
+The wait-wake futexes are more versatile as they can also be used to
+implement other locking primitives like semaphores or conditional
+variables.  So the TP futex is not a direct replacement of the
+wait-wake futex. However for userspace mutexes or rwlocks, the TP
+futex is likely a better option than the wait-wake futex.
+
+Sample Code
+-----------
+
+The following are sample code to implement simple mutex lock and
+unlock functions.
+
+__thread int thread_id;
+
+void mutex_lock(int *faddr)
+{
+	if (cmpxchg(faddr, 0, thread_id) == 0)
+		return;
+	while (futex(faddr, FUTEX_LOCK, ...) , 0)
+		;
+}
+
+void mutex_unlock(int *faddr)
+{
+	int old, fval;
+
+	if (cmpxchg(faddr, thread_id, 0) == thread_id)
+		return;
+	futex(faddr, FUTEX_UNLOCK, ...);
+}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 13/20] perf bench: New microbenchmark for userspace mutex performance
  2016-12-29 16:13 [PATCH v4 00/20] futex: Introducing throughput-optimized (TP) futexes Waiman Long
                   ` (11 preceding siblings ...)
  2016-12-29 16:13 ` [PATCH v4 12/20] TP-futex, doc: Add TP futexes documentation Waiman Long
@ 2016-12-29 16:13 ` Waiman Long
  2017-01-02 17:16   ` Arnaldo Carvalho de Melo
  2016-12-29 16:13 ` [PATCH v4 14/20] TP-futex: Support userspace reader/writer locks Waiman Long
                   ` (6 subsequent siblings)
  19 siblings, 1 reply; 28+ messages in thread
From: Waiman Long @ 2016-12-29 16:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Scott J Norton, Waiman Long

This microbenchmark simulates how the use of different futex types
can affect the actual performanace of userspace mutex locks. The
usage is:

        perf bench futex mutex <options>

Three sets of simple mutex lock and unlock functions are implemented
using the wait-wake, PI and TP futexes respectively. This
microbenchmark then runs the locking rate measurement tests using
either one of those mutexes or all of them consecutively.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 tools/perf/bench/Build         |   1 +
 tools/perf/bench/bench.h       |   1 +
 tools/perf/bench/futex-locks.c | 844 +++++++++++++++++++++++++++++++++++++++++
 tools/perf/bench/futex.h       |  23 ++
 tools/perf/builtin-bench.c     |  10 +
 tools/perf/check-headers.sh    |   4 +
 6 files changed, 883 insertions(+)
 create mode 100644 tools/perf/bench/futex-locks.c

diff --git a/tools/perf/bench/Build b/tools/perf/bench/Build
index 60bf119..7ce1cd7 100644
--- a/tools/perf/bench/Build
+++ b/tools/perf/bench/Build
@@ -6,6 +6,7 @@ perf-y += futex-wake.o
 perf-y += futex-wake-parallel.o
 perf-y += futex-requeue.o
 perf-y += futex-lock-pi.o
+perf-y += futex-locks.o
 
 perf-$(CONFIG_X86_64) += mem-memcpy-x86-64-asm.o
 perf-$(CONFIG_X86_64) += mem-memset-x86-64-asm.o
diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h
index 579a592..b0632df 100644
--- a/tools/perf/bench/bench.h
+++ b/tools/perf/bench/bench.h
@@ -36,6 +36,7 @@
 int bench_futex_requeue(int argc, const char **argv, const char *prefix);
 /* pi futexes */
 int bench_futex_lock_pi(int argc, const char **argv, const char *prefix);
+int bench_futex_mutex(int argc, const char **argv, const char *prefix);
 
 #define BENCH_FORMAT_DEFAULT_STR	"default"
 #define BENCH_FORMAT_DEFAULT		0
diff --git a/tools/perf/bench/futex-locks.c b/tools/perf/bench/futex-locks.c
new file mode 100644
index 0000000..ec1e627
--- /dev/null
+++ b/tools/perf/bench/futex-locks.c
@@ -0,0 +1,844 @@
+/*
+ * Copyright (C) 2016 Waiman Long <longman@redhat.com>
+ *
+ * This microbenchmark simulates how the use of different futex types can
+ * affect the actual performanace of userspace locking primitives like mutex.
+ *
+ * The raw throughput of the futex lock and unlock calls is not a good
+ * indication of actual throughput of the mutex code as it may not really
+ * need to call into the kernel. Therefore, 3 sets of simple mutex lock and
+ * unlock functions are written to implenment a mutex lock using the
+ * wait-wake, PI and TP futexes respectively. These functions serve as the
+ * basis for measuring the locking throughput.
+ */
+
+#include <pthread.h>
+
+#include <signal.h>
+#include <string.h>
+#include "../util/stat.h"
+#include "../perf-sys.h"
+#include <subcmd/parse-options.h>
+#include <linux/compiler.h>
+#include <linux/kernel.h>
+#include <errno.h>
+#include "bench.h"
+#include "futex.h"
+
+#include <err.h>
+#include <stdlib.h>
+#include <sys/time.h>
+
+#define CACHELINE_SIZE		64
+#define gettid()		syscall(SYS_gettid)
+#define __cacheline_aligned	__attribute__((__aligned__(CACHELINE_SIZE)))
+
+typedef u32 futex_t;
+typedef void (*lock_fn_t)(futex_t *futex, int tid);
+typedef void (*unlock_fn_t)(futex_t *futex, int tid);
+
+/*
+ * Statistical count list
+ */
+enum {
+	STAT_OPS,	/* # of exclusive locking operations	*/
+	STAT_LOCKS,	/* # of exclusive lock futex calls	*/
+	STAT_UNLOCKS,	/* # of exclusive unlock futex calls	*/
+	STAT_SLEEPS,	/* # of exclusive lock sleeps		*/
+	STAT_EAGAINS,	/* # of EAGAIN errors			*/
+	STAT_WAKEUPS,	/* # of wakeups (unlock return)		*/
+	STAT_HANDOFFS,	/* # of lock handoff (TP only)		*/
+	STAT_STEALS,	/* # of lock steals (TP only)		*/
+	STAT_LOCKERRS,	/* # of exclusive lock errors		*/
+	STAT_UNLKERRS,	/* # of exclusive unlock errors		*/
+	STAT_NUM	/* Total # of statistical count		*/
+};
+
+/*
+ * Syscall time list
+ */
+enum {
+	TIME_LOCK,	/* Total exclusive lock syscall time	*/
+	TIME_UNLK,	/* Total exclusive unlock syscall time	*/
+	TIME_NUM,
+};
+
+struct worker {
+	futex_t *futex;
+	pthread_t thread;
+
+	/*
+	 * Per-thread operation statistics
+	 */
+	u32 stats[STAT_NUM];
+
+	/*
+	 * Lock/unlock times
+	 */
+	u64 times[TIME_NUM];
+} __cacheline_aligned;
+
+/*
+ * Global cache-aligned futex
+ */
+static futex_t __cacheline_aligned global_futex;
+static futex_t *pfutex = &global_futex;
+
+static __thread futex_t thread_id;	/* Thread ID */
+static __thread int counter;		/* Sleep counter */
+
+static struct worker *worker, *worker_alloc;
+static unsigned int nsecs = 10;
+static bool verbose, done, fshared, exit_now, timestat;
+static unsigned int ncpus, nthreads;
+static int flags;
+static const char *ftype;
+static int loadlat = 1;
+static int locklat = 1;
+static int wratio;
+struct timeval start, end, runtime;
+static unsigned int worker_start;
+static unsigned int threads_starting;
+static unsigned int threads_stopping;
+static struct stats throughput_stats;
+static lock_fn_t mutex_lock_fn;
+static unlock_fn_t mutex_unlock_fn;
+
+/*
+ * Lock/unlock syscall time macro
+ */
+static inline void systime_add(int tid, int item, struct timespec *begin,
+			       struct timespec *_end)
+{
+	worker[tid].times[item] += (_end->tv_sec  - begin->tv_sec)*1000000000 +
+				    _end->tv_nsec - begin->tv_nsec;
+}
+
+static inline double stat_percent(struct worker *w, int top, int bottom)
+{
+	return (double)w->stats[top] * 100 / w->stats[bottom];
+}
+
+/*
+ * Inline functions to update the statistical counts
+ *
+ * Enable statistics collection may sometimes impact the locking rates
+ * to be measured. So we can specify the DISABLE_STAT macro to disable
+ * statistic counts collection for all except the core locking rate counts.
+ *
+ * #define DISABLE_STAT
+ */
+#ifndef DISABLE_STAT
+static inline void stat_add(int tid, int item, int num)
+{
+	worker[tid].stats[item] += num;
+}
+
+static inline void stat_inc(int tid, int item)
+{
+	stat_add(tid, item, 1);
+}
+#else
+static inline void stat_add(int tid __maybe_unused, int item __maybe_unused,
+			    int num __maybe_unused)
+{
+}
+
+static inline void stat_inc(int tid __maybe_unused, int item __maybe_unused)
+{
+}
+#endif
+
+/*
+ * The latency value within a lock critical section (load) and between locking
+ * operations is in term of the number of cpu_relax() calls that are being
+ * issued.
+ */
+static const struct option mutex_options[] = {
+	OPT_INTEGER ('d', "locklat",	&locklat,  "Specify inter-locking latency (default = 1)"),
+	OPT_STRING  ('f', "ftype",	&ftype,    "type", "Specify futex type: WW, PI, TP, all (default)"),
+	OPT_INTEGER ('L', "loadlat",	&loadlat,  "Specify load latency (default = 1)"),
+	OPT_UINTEGER('r', "runtime",	&nsecs,    "Specify runtime (in seconds, default = 10s)"),
+	OPT_BOOLEAN ('S', "shared",	&fshared,  "Use shared futexes instead of private ones"),
+	OPT_BOOLEAN ('T', "timestat",	&timestat, "Track lock/unlock syscall times"),
+	OPT_UINTEGER('t', "threads",	&nthreads, "Specify number of threads, default = # of CPUs"),
+	OPT_BOOLEAN ('v', "verbose",	&verbose,  "Verbose mode: display thread-level details"),
+	OPT_INTEGER ('w', "wait-ratio", &wratio,   "Specify <n>/1024 of load is 1us sleep, default = 0"),
+	OPT_END()
+};
+
+static const char * const bench_futex_mutex_usage[] = {
+	"perf bench futex mutex <options>",
+	NULL
+};
+
+/**
+ * futex_cmpxchg() - atomic compare and exchange
+ * @uaddr:	The address of the futex to be modified
+ * @oldval:	The expected value of the futex
+ * @newval:	The new value to try and assign to the futex
+ *
+ * Implement cmpxchg using gcc atomic builtins.
+ *
+ * Return: the old futex value.
+ */
+static inline futex_t futex_cmpxchg(futex_t *uaddr, futex_t old, futex_t new)
+{
+	return __sync_val_compare_and_swap(uaddr, old, new);
+}
+
+/**
+ * futex_xchg() - atomic exchange
+ * @uaddr:	The address of the futex to be modified
+ * @newval:	The new value to assign to the futex
+ *
+ * Implement cmpxchg using gcc atomic builtins.
+ *
+ * Return: the old futex value.
+ */
+static inline futex_t futex_xchg(futex_t *uaddr, futex_t new)
+{
+	return __sync_lock_test_and_set(uaddr, new);
+}
+
+/**
+ * atomic_dec_return - atomically decrement & return the new value
+ * @uaddr:	The address of the futex to be decremented
+ * Return:	The new value
+ */
+static inline int atomic_dec_return(futex_t *uaddr)
+{
+	return __sync_sub_and_fetch(uaddr, 1);
+}
+
+/**
+ * atomic_inc_return - atomically increment & return the new value
+ * @uaddr:	The address of the futex to be incremented
+ * Return:	The new value
+ */
+static inline int atomic_inc_return(futex_t *uaddr)
+{
+	return __sync_add_and_fetch(uaddr, 1);
+}
+
+/**********************[ MUTEX lock/unlock functions ]*********************/
+
+/*
+ * Wait-wake futex lock/unlock functions (Glibc implementation)
+ * futex value: 0 - unlocked
+ *		1 - locked
+ *		2 - locked with waiters (contended)
+ */
+static void ww_mutex_lock(futex_t *futex, int tid)
+{
+	struct timespec stime, etime;
+	futex_t val = *futex;
+	int ret;
+
+	if (!val) {
+		val = futex_cmpxchg(futex, 0, 1);
+		if (val == 0)
+			return;
+	}
+
+	for (;;) {
+		if (val != 2) {
+			/*
+			 * Force value to 2 to indicate waiter
+			 */
+			val = futex_xchg(futex, 2);
+			if (val == 0)
+				return;
+		}
+		if (timestat) {
+			clock_gettime(CLOCK_REALTIME, &stime);
+			ret = futex_wait(futex, 2, NULL, flags);
+			clock_gettime(CLOCK_REALTIME, &etime);
+			systime_add(tid, TIME_LOCK, &stime, &etime);
+		} else {
+			ret = futex_wait(futex, 2, NULL, flags);
+		}
+
+		stat_inc(tid, STAT_LOCKS);
+		if (ret < 0) {
+			if (errno == EAGAIN)
+				stat_inc(tid, STAT_EAGAINS);
+			else
+				stat_inc(tid, STAT_LOCKERRS);
+		}
+
+		val = *futex;
+	}
+}
+
+static void ww_mutex_unlock(futex_t *futex, int tid)
+{
+	struct timespec stime, etime;
+	futex_t val;
+	int ret;
+
+	val = futex_xchg(futex, 0);
+
+	if (val == 2) {
+		stat_inc(tid, STAT_UNLOCKS);
+		if (timestat) {
+			clock_gettime(CLOCK_REALTIME, &stime);
+			ret = futex_wake(futex, 1, flags);
+			clock_gettime(CLOCK_REALTIME, &etime);
+			systime_add(tid, TIME_UNLK, &stime, &etime);
+		} else {
+			ret = futex_wake(futex, 1, flags);
+		}
+		if (ret < 0)
+			stat_inc(tid, STAT_UNLKERRS);
+		else
+			stat_add(tid, STAT_WAKEUPS, ret);
+	}
+}
+
+/*
+ * Alternate wait-wake futex lock/unlock functions with thread_id lock word
+ */
+static void ww2_mutex_lock(futex_t *futex, int tid)
+{
+	struct timespec stime, etime;
+	futex_t val = *futex;
+	int ret;
+
+	if (!val) {
+		val = futex_cmpxchg(futex, 0, thread_id);
+		if (val == 0)
+			return;
+	}
+
+	for (;;) {
+		/*
+		 * Set the FUTEX_WAITERS bit, if not set yet.
+		 */
+		while (!(val & FUTEX_WAITERS)) {
+			futex_t old;
+
+			if (!val) {
+				val = futex_cmpxchg(futex, 0, thread_id);
+				if (val == 0)
+					return;
+				continue;
+			}
+			old = futex_cmpxchg(futex, val, val | FUTEX_WAITERS);
+			if (old == val) {
+				val |= FUTEX_WAITERS;
+				break;
+			}
+			val = old;
+		}
+		if (timestat) {
+			clock_gettime(CLOCK_REALTIME, &stime);
+			ret = futex_wait(futex, val, NULL, flags);
+			clock_gettime(CLOCK_REALTIME, &etime);
+			systime_add(tid, TIME_LOCK, &stime, &etime);
+		} else {
+			ret = futex_wait(futex, val, NULL, flags);
+		}
+		stat_inc(tid, STAT_LOCKS);
+		if (ret < 0) {
+			if (errno == EAGAIN)
+				stat_inc(tid, STAT_EAGAINS);
+			else
+				stat_inc(tid, STAT_LOCKERRS);
+		}
+
+		val = *futex;
+	}
+}
+
+static void ww2_mutex_unlock(futex_t *futex, int tid)
+{
+	struct timespec stime, etime;
+	futex_t val;
+	int ret;
+
+	val = futex_xchg(futex, 0);
+
+	if ((val & FUTEX_TID_MASK) != thread_id)
+		stat_inc(tid, STAT_UNLKERRS);
+
+	if (val & FUTEX_WAITERS) {
+		stat_inc(tid, STAT_UNLOCKS);
+		if (timestat) {
+			clock_gettime(CLOCK_REALTIME, &stime);
+			ret = futex_wake(futex, 1, flags);
+			clock_gettime(CLOCK_REALTIME, &etime);
+			systime_add(tid, TIME_UNLK, &stime, &etime);
+		} else {
+			ret = futex_wake(futex, 1, flags);
+		}
+		if (ret < 0)
+			stat_inc(tid, STAT_UNLKERRS);
+		else
+			stat_add(tid, STAT_WAKEUPS, ret);
+	}
+}
+
+/*
+ * PI futex lock/unlock functions
+ */
+static void pi_mutex_lock(futex_t *futex, int tid)
+{
+	struct timespec stime, etime;
+	futex_t val;
+	int ret;
+
+	val = futex_cmpxchg(futex, 0, thread_id);
+	if (val == 0)
+		return;
+
+	/*
+	 * Retry if an error happens
+	 */
+	for (;;) {
+		if (timestat) {
+			clock_gettime(CLOCK_REALTIME, &stime);
+			ret = futex_lock_pi(futex, NULL, flags);
+			clock_gettime(CLOCK_REALTIME, &etime);
+			systime_add(tid, TIME_LOCK, &stime, &etime);
+		} else {
+			ret = futex_lock_pi(futex, NULL, flags);
+		}
+		stat_inc(tid, STAT_LOCKS);
+		if (ret >= 0)
+			break;
+		stat_inc(tid, STAT_LOCKERRS);
+	}
+}
+
+static void pi_mutex_unlock(futex_t *futex, int tid)
+{
+	struct timespec stime, etime;
+	futex_t val;
+	int ret;
+
+	val = futex_cmpxchg(futex, thread_id, 0);
+	if (val == thread_id)
+		return;
+
+	if (timestat) {
+		clock_gettime(CLOCK_REALTIME, &stime);
+		ret = futex_unlock_pi(futex, flags);
+		clock_gettime(CLOCK_REALTIME, &etime);
+		systime_add(tid, TIME_UNLK, &stime, &etime);
+	} else {
+		ret = futex_unlock_pi(futex, flags);
+	}
+	if (ret < 0)
+		stat_inc(tid, STAT_UNLKERRS);
+	else
+		stat_add(tid, STAT_WAKEUPS, ret);
+	stat_inc(tid, STAT_UNLOCKS);
+}
+
+/*
+ * TP futex lock/unlock functions
+ */
+static void tp_mutex_lock(futex_t *futex, int tid)
+{
+	struct timespec stime, etime;
+	futex_t val;
+	int ret;
+
+	val = futex_cmpxchg(futex, 0, thread_id);
+	if (val == 0)
+		return;
+
+	/*
+	 * Retry if an error happens
+	 */
+	for (;;) {
+		if (timestat) {
+			clock_gettime(CLOCK_REALTIME, &stime);
+			ret = futex_lock(futex, NULL, flags);
+			clock_gettime(CLOCK_REALTIME, &etime);
+			systime_add(tid, TIME_LOCK, &stime, &etime);
+		} else {
+			ret = futex_lock(futex, NULL, flags);
+		}
+		stat_inc(tid, STAT_LOCKS);
+		if (ret >= 0)
+			break;
+		stat_inc(tid, STAT_LOCKERRS);
+	}
+	/*
+	 * Get # of sleeps & locking method
+	 */
+	stat_add(tid, STAT_SLEEPS, ret >> 16);
+	ret &= 0xff;
+	if (!ret)
+		stat_inc(tid, STAT_STEALS);
+	else if (ret == 2)
+		stat_inc(tid, STAT_HANDOFFS);
+}
+
+static void tp_mutex_unlock(futex_t *futex, int tid)
+{
+	struct timespec stime, etime;
+	futex_t val;
+	int ret;
+
+	val = futex_cmpxchg(futex, thread_id, 0);
+	if (val == thread_id)
+		return;
+
+	if (timestat) {
+		clock_gettime(CLOCK_REALTIME, &stime);
+		ret = futex_unlock(futex, flags);
+		clock_gettime(CLOCK_REALTIME, &etime);
+		systime_add(tid, TIME_UNLK, &stime, &etime);
+	} else {
+		ret = futex_unlock(futex, flags);
+	}
+	stat_inc(tid, STAT_UNLOCKS);
+	if (ret < 0)
+		stat_inc(tid, STAT_UNLKERRS);
+	else
+		stat_add(tid, STAT_WAKEUPS, ret);
+}
+
+/**************************************************************************/
+
+/*
+ * Load function
+ */
+static inline void load(int tid)
+{
+	int n = loadlat;
+
+	/*
+	 * Optionally does a 1us sleep instead if wratio is defined and
+	 * is within bound.
+	 */
+	if (wratio && (((counter++ + tid) & 0x3ff) < wratio)) {
+		usleep(1);
+		return;
+	}
+
+	while (n-- > 0)
+		cpu_relax();
+}
+
+static inline void csdelay(void)
+{
+	int n = locklat;
+
+	while (n-- > 0)
+		cpu_relax();
+}
+
+static void toggle_done(int sig __maybe_unused,
+			siginfo_t *info __maybe_unused,
+			void *uc __maybe_unused)
+{
+	/* inform all threads that we're done for the day */
+	done = true;
+	gettimeofday(&end, NULL);
+	timersub(&end, &start, &runtime);
+	if (sig)
+		exit_now = true;
+}
+
+static void *mutex_workerfn(void *arg)
+{
+	long tid = (long)arg;
+	struct worker *w = &worker[tid];
+	lock_fn_t lock_fn = mutex_lock_fn;
+	unlock_fn_t unlock_fn = mutex_unlock_fn;
+
+	thread_id = gettid();
+	counter = 0;
+
+	atomic_dec_return(&threads_starting);
+
+	/*
+	 * Busy wait until asked to start
+	 */
+	while (!worker_start)
+		cpu_relax();
+
+	do {
+		lock_fn(w->futex, tid);
+		load(tid);
+		unlock_fn(w->futex, tid);
+		w->stats[STAT_OPS]++;	/* One more locking operation */
+		csdelay();
+	}  while (!done);
+
+	if (verbose)
+		printf("[thread %3ld (%d)] exited.\n", tid, thread_id);
+	atomic_inc_return(&threads_stopping);
+	return NULL;
+}
+
+static void create_threads(struct worker *w, pthread_attr_t *thread_attr,
+			   void *(*workerfn)(void *arg), long tid)
+{
+	cpu_set_t cpu;
+
+	/*
+	 * Bind each thread to a CPU
+	 */
+	CPU_ZERO(&cpu);
+	CPU_SET(tid % ncpus, &cpu);
+	w->futex = pfutex;
+
+	if (pthread_attr_setaffinity_np(thread_attr, sizeof(cpu_set_t), &cpu))
+		err(EXIT_FAILURE, "pthread_attr_setaffinity_np");
+
+	if (pthread_create(&w->thread, thread_attr, workerfn, (void *)tid))
+		err(EXIT_FAILURE, "pthread_create");
+}
+
+static int futex_mutex_type(const char **ptype)
+{
+	const char *type = *ptype;
+
+	if (!strcasecmp(type, "WW")) {
+		*ptype = "WW";
+		mutex_lock_fn = ww_mutex_lock;
+		mutex_unlock_fn = ww_mutex_unlock;
+	} else if (!strcasecmp(type, "WW2")) {
+		*ptype = "WW2";
+		mutex_lock_fn = ww2_mutex_lock;
+		mutex_unlock_fn = ww2_mutex_unlock;
+	} else if (!strcasecmp(type, "PI")) {
+		*ptype = "PI";
+		mutex_lock_fn = pi_mutex_lock;
+		mutex_unlock_fn = pi_mutex_unlock;
+	} else if (!strcasecmp(type, "TP")) {
+		*ptype = "TP";
+		mutex_lock_fn = tp_mutex_lock;
+		mutex_unlock_fn = tp_mutex_unlock;
+
+		/*
+		 * Check if TP futex is supported.
+		 */
+		futex_unlock(&global_futex, 0);
+		if (errno == ENOSYS) {
+			fprintf(stderr,
+			    "\nTP futexes are not supported by the kernel!\n");
+			return -1;
+		}
+	} else {
+		return -1;
+	}
+	return 0;
+}
+
+static void futex_test_driver(const char *futex_type,
+			      int (*proc_type)(const char **ptype),
+			      void *(*workerfn)(void *arg))
+{
+	u64 us;
+	int i, j;
+	struct worker total;
+	double avg, stddev;
+	pthread_attr_t thread_attr;
+
+	/*
+	 * There is an extra blank line before the error counts to highlight
+	 * them.
+	 */
+	const char *desc[STAT_NUM] = {
+		[STAT_OPS]	 = "Total exclusive locking ops",
+		[STAT_LOCKS]	 = "Exclusive lock futex calls",
+		[STAT_UNLOCKS]	 = "Exclusive unlock futex calls",
+		[STAT_SLEEPS]	 = "Exclusive lock sleeps",
+		[STAT_WAKEUPS]	 = "Process wakeups",
+		[STAT_EAGAINS]	 = "EAGAIN lock errors",
+		[STAT_HANDOFFS]  = "Lock handoffs",
+		[STAT_STEALS]	 = "Lock stealings",
+		[STAT_LOCKERRS]  = "\nExclusive lock errors",
+		[STAT_UNLKERRS]  = "\nExclusive unlock errors",
+	};
+
+	if (exit_now)
+		return;
+
+	if (proc_type(&futex_type) < 0) {
+		fprintf(stderr, "Unknown futex type '%s'!\n", futex_type);
+		exit(1);
+	}
+
+	printf("\n=====================================\n");
+	printf("[PID %d]: %d threads doing %s futex lockings (load=%d) for %d secs.\n\n",
+	       getpid(), nthreads, futex_type, loadlat, nsecs);
+
+	init_stats(&throughput_stats);
+
+	*pfutex = 0;
+	done = false;
+	threads_starting = nthreads;
+	pthread_attr_init(&thread_attr);
+
+	for (i = 0; i < (int)nthreads; i++)
+		create_threads(&worker[i], &thread_attr, workerfn, i);
+
+	while (threads_starting)
+		usleep(1);
+
+	gettimeofday(&start, NULL);
+
+	/*
+	 * Start the test
+	 *
+	 * Unlike the other futex benchmarks, this one uses busy waiting
+	 * instead of pthread APIs to make sure that all the threads (except
+	 * the one that shares CPU with the parent) will start more or less
+	 * simultaineously.
+	 */
+	atomic_inc_return(&worker_start);
+	sleep(nsecs);
+	toggle_done(0, NULL, NULL);
+
+	/*
+	 * In verbose mode, we check if all the threads have been stopped
+	 * after 1ms and report the status if some are still running.
+	 */
+	if (verbose) {
+		usleep(1000);
+		if (threads_stopping != nthreads) {
+			printf("%d threads still running 1ms after timeout"
+				" - futex = 0x%x\n",
+				nthreads - threads_stopping, *pfutex);
+			/*
+			 * If the threads are still running after 10s,
+			 * go directly to statistics printing and exit.
+			 */
+			for (i = 10; i > 0; i--) {
+				sleep(1);
+				if (threads_stopping == nthreads)
+					break;
+			}
+			if (!i) {
+				printf("*** Threads waiting ABORTED!! ***\n\n");
+				goto print_stat;
+			}
+		}
+	}
+
+	for (i = 0; i < (int)nthreads; i++) {
+		int ret = pthread_join(worker[i].thread, NULL);
+
+		if (ret)
+			err(EXIT_FAILURE, "pthread_join");
+	}
+
+print_stat:
+	pthread_attr_destroy(&thread_attr);
+
+	us = runtime.tv_sec * 1000000 + runtime.tv_usec;
+	memset(&total, 0, sizeof(total));
+	for (i = 0; i < (int)nthreads; i++) {
+		/*
+		 * Get a rounded estimate of the # of locking ops/sec.
+		 */
+		u64 tp = (u64)worker[i].stats[STAT_OPS] * 1000000 / us;
+
+		for (j = 0; j < STAT_NUM; j++)
+			total.stats[j] += worker[i].stats[j];
+
+		update_stats(&throughput_stats, tp);
+		if (verbose)
+			printf("[thread %3d] futex: %p [ %'ld ops/sec ]\n",
+			       i, worker[i].futex, (long)tp);
+	}
+
+	avg    = avg_stats(&throughput_stats);
+	stddev = stddev_stats(&throughput_stats);
+
+	printf("Locking statistics:\n");
+	printf("%-28s = %'.2fs\n", "Test run time", (double)us/1000000);
+	for (i = 0; i < STAT_NUM; i++)
+		if (total.stats[i])
+			printf("%-28s = %'d\n", desc[i], total.stats[i]);
+
+	if (timestat && total.times[TIME_LOCK]) {
+		printf("\nSyscall times:\n");
+		if (total.stats[STAT_LOCKS])
+			printf("Avg exclusive lock syscall   = %'ldns\n",
+			    total.times[TIME_LOCK]/total.stats[STAT_LOCKS]);
+		if (total.stats[STAT_UNLOCKS])
+			printf("Avg exclusive unlock syscall = %'ldns\n",
+			    total.times[TIME_UNLK]/total.stats[STAT_UNLOCKS]);
+	}
+
+	printf("\nPercentages:\n");
+	if (total.stats[STAT_LOCKS])
+		printf("Exclusive lock futex calls   = %.1f%%\n",
+			stat_percent(&total, STAT_LOCKS, STAT_OPS));
+	if (total.stats[STAT_UNLOCKS])
+		printf("Exclusive unlock futex calls = %.1f%%\n",
+			stat_percent(&total, STAT_UNLOCKS, STAT_OPS));
+	if (total.stats[STAT_EAGAINS])
+		printf("EAGAIN lock errors           = %.1f%%\n",
+			stat_percent(&total, STAT_EAGAINS, STAT_LOCKS));
+	if (total.stats[STAT_WAKEUPS])
+		printf("Process wakeups              = %.1f%%\n",
+			stat_percent(&total, STAT_WAKEUPS, STAT_UNLOCKS));
+
+	printf("\nPer-thread Locking Rates:\n");
+	printf("Avg = %'d ops/sec (+- %.2f%%)\n", (int)(avg + 0.5),
+		rel_stddev_stats(stddev, avg));
+	printf("Min = %'d ops/sec\n", (int)throughput_stats.min);
+	printf("Max = %'d ops/sec\n", (int)throughput_stats.max);
+
+	if (*pfutex != 0)
+		printf("\nResidual futex value = 0x%x\n", *pfutex);
+
+	/* Clear the workers area */
+	memset(worker, 0, sizeof(*worker) * nthreads);
+}
+
+int bench_futex_mutex(int argc, const char **argv,
+		      const char *prefix __maybe_unused)
+{
+	struct sigaction act;
+
+	argc = parse_options(argc, argv, mutex_options,
+			     bench_futex_mutex_usage, 0);
+	if (argc)
+		goto err;
+
+	ncpus = sysconf(_SC_NPROCESSORS_ONLN);
+
+	sigfillset(&act.sa_mask);
+	act.sa_sigaction = toggle_done;
+	sigaction(SIGINT, &act, NULL);
+
+	if (!nthreads)
+		nthreads = ncpus;
+
+	/*
+	 * Since the allocated memory buffer may not be properly cacheline
+	 * aligned, we need to allocate one more than needed and manually
+	 * adjust array boundary to be cacheline aligned.
+	 */
+	worker_alloc = calloc(nthreads + 1, sizeof(*worker));
+	if (!worker_alloc)
+		err(EXIT_FAILURE, "calloc");
+	worker = (void *)((unsigned long)&worker_alloc[1] &
+					~(CACHELINE_SIZE - 1));
+
+	if (!fshared)
+		flags = FUTEX_PRIVATE_FLAG;
+
+	if (!ftype || !strcmp(ftype, "all")) {
+		futex_test_driver("WW", futex_mutex_type, mutex_workerfn);
+		futex_test_driver("PI", futex_mutex_type, mutex_workerfn);
+		futex_test_driver("TP", futex_mutex_type, mutex_workerfn);
+	} else {
+		futex_test_driver(ftype, futex_mutex_type, mutex_workerfn);
+	}
+	free(worker_alloc);
+	return 0;
+err:
+	usage_with_options(bench_futex_mutex_usage, mutex_options);
+	exit(EXIT_FAILURE);
+}
diff --git a/tools/perf/bench/futex.h b/tools/perf/bench/futex.h
index ba7c735..be1eb7a 100644
--- a/tools/perf/bench/futex.h
+++ b/tools/perf/bench/futex.h
@@ -87,6 +87,29 @@
 		 val, opflags);
 }
 
+#ifndef FUTEX_LOCK
+#define FUTEX_LOCK	0xff	/* Disable the use of TP futexes */
+#define FUTEX_UNLOCK	0xff
+#endif /* FUTEX_LOCK */
+
+/**
+ * futex_lock() - lock the TP futex
+ */
+static inline int
+futex_lock(u_int32_t *uaddr, struct timespec *timeout, int opflags)
+{
+	return futex(uaddr, FUTEX_LOCK, 0, timeout, NULL, 0, opflags);
+}
+
+/**
+ * futex_unlock() - unlock the TP futex
+ */
+static inline int
+futex_unlock(u_int32_t *uaddr, int opflags)
+{
+	return futex(uaddr, FUTEX_UNLOCK, 0, NULL, NULL, 0, opflags);
+}
+
 #ifndef HAVE_PTHREAD_ATTR_SETAFFINITY_NP
 #include <pthread.h>
 static inline int pthread_attr_setaffinity_np(pthread_attr_t *attr,
diff --git a/tools/perf/builtin-bench.c b/tools/perf/builtin-bench.c
index a1cddc6..bf4418d 100644
--- a/tools/perf/builtin-bench.c
+++ b/tools/perf/builtin-bench.c
@@ -20,6 +20,7 @@
 #include "builtin.h"
 #include "bench/bench.h"
 
+#include <locale.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
@@ -62,6 +63,7 @@ struct bench {
 	{ "requeue",	"Benchmark for futex requeue calls",            bench_futex_requeue	},
 	/* pi-futexes */
 	{ "lock-pi",	"Benchmark for futex lock_pi calls",            bench_futex_lock_pi	},
+	{ "mutex",	"Benchmark for mutex locks using futexes",	bench_futex_mutex	},
 	{ "all",	"Run all futex benchmarks",			NULL			},
 	{ NULL,		NULL,						NULL			}
 };
@@ -215,6 +217,7 @@ int cmd_bench(int argc, const char **argv, const char *prefix __maybe_unused)
 {
 	struct collection *coll;
 	int ret = 0;
+	char *locale;
 
 	if (argc < 2) {
 		/* No collection specified. */
@@ -222,6 +225,13 @@ int cmd_bench(int argc, const char **argv, const char *prefix __maybe_unused)
 		goto end;
 	}
 
+	/*
+	 * Enable better number formatting.
+	 */
+	locale = setlocale(LC_NUMERIC, "");
+	if (!strcmp(locale, "C"))
+		setlocale(LC_NUMERIC, "en_US");
+
 	argc = parse_options(argc, argv, bench_options, bench_usage,
 			     PARSE_OPT_STOP_AT_NON_OPTION);
 
diff --git a/tools/perf/check-headers.sh b/tools/perf/check-headers.sh
index c747bfd..6ec5f2d 100755
--- a/tools/perf/check-headers.sh
+++ b/tools/perf/check-headers.sh
@@ -57,3 +57,7 @@ check arch/x86/lib/memcpy_64.S        -B -I "^EXPORT_SYMBOL" -I "^#include <asm/
 check arch/x86/lib/memset_64.S        -B -I "^EXPORT_SYMBOL" -I "^#include <asm/export.h>"
 check include/uapi/asm-generic/mman.h -B -I "^#include <\(uapi/\)*asm-generic/mman-common.h>"
 check include/uapi/linux/mman.h       -B -I "^#include <\(uapi/\)*asm/mman.h>"
+
+# need to link to the latest futex.h header
+[[ -f ../../include/uapi/linux/futex.h ]] &&
+	ln -sf  ../../../../../include/uapi/linux/futex.h util/include/linux
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 14/20] TP-futex: Support userspace reader/writer locks
  2016-12-29 16:13 [PATCH v4 00/20] futex: Introducing throughput-optimized (TP) futexes Waiman Long
                   ` (12 preceding siblings ...)
  2016-12-29 16:13 ` [PATCH v4 13/20] perf bench: New microbenchmark for userspace mutex performance Waiman Long
@ 2016-12-29 16:13 ` Waiman Long
  2016-12-29 18:34   ` kbuild test robot
  2016-12-29 16:13 ` [PATCH v4 15/20] TP-futex: Enable kernel reader lock stealing Waiman Long
                   ` (5 subsequent siblings)
  19 siblings, 1 reply; 28+ messages in thread
From: Waiman Long @ 2016-12-29 16:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Scott J Norton, Waiman Long

The TP futexes was designed to support userspace mutually exclusive
locks. They are now extended to support userspace reader/writer
locks as well.

Two new futex command codes are added:
 1) FUTEX_LOCK_SHARED - to acquire a shared lock (reader-lock)
 2) FUTEX_UNLOCK_SHARED - to release a shred lock (reader-unlock)

The original FUTEX_LOCK and FUTEX_UNLOCK command codes acts like a
writer lock and unlock method in a userspace rwlock.

A special bit FUTEX_SHARED (bit 29) is used to indicate if a lock is
an exclusive (writer) or shared (reader) lock. Because of that, the
maximum TID value is now reduce to 1^29 instead of 1^30. With a shared
lock, the lower 24 bits are used to track the reader counts. Bit 28
is a special FUTEX_SHARED_UNLOCK bit that should used by userspace
to prevent racing in the FUTEX_UNLOCK_SHARED futex(2) syscall. Bits
24-27 are reserved for future extension.

When a TP futex becomes reader-owned, there is no way to tell if the
readers are running or not. So a fixed time interval (50us) is now
used for spinning on a reader-owned futex. The top waiter will abort
the spinning and goes to sleep when no activity is observed in the
futex value itself after the spinning timer expired. Activities here
means any change in the reader count of the futex indicating that
some new readers are coming in or old readers are leaving.

Sample code to acquire/release a reader lock in userspace are:

  void read_lock(int *faddr)
  {
  	int val = cmpxchg(faddr, 0, FUTEX_SHARED + 1);

	if (!val)
		return;
	for (;;) {
		int old = val, new;

		if (!old)
			new = FUTEX_SHARED + 1;
		else if ((old & (~FUTEX_TID_SHARED_MASK|
				  FUTEX_SHARED_UNLOCK) ||
			!(old & FUTEX_SHARED))
			break;
		else
			new = old + 1;
		val = cmpxchg(faddr, old, new);
		if (val == old)
			return;
	}
	while (futex(faddr, FUTEX_LOCK_SHARED, ...) < 0)
		;
  }

  void read_unlock(int *faddr)
  {
	int val = xadd(faddr, -1);
	int old;

	for (;;) {
		if ((val & (FUTEX_SCNT_MASK|FUTEX_SHARED_UNLOCK) ||
		   !(val & FUTEX_SHARED))
			return;	/* Not the last reader or shared */
		if (val & ~FUTEX_SHARED_TID_MASK) {
			old = cmpxchg(faddr, val, val|FUTEX_SHARED_UNLOCK);
			if (old == val)
				break;
		} else {
	    		old = cmpxchg(faddr, val, 0);
			if (old == val)
				return;
		}
		val = old;
	}
	futex(faddr, FUTEX_UNLOCK_SHARED, ...);
  }

The read lock/unlock code is a bit more complicated than the
corresponding write lock/unlock code. This is by design to minimize
any additional overhead for mutex lock and unlock. As a result,
the TP futex rwlock prefers writers a bit more than readers.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/uapi/linux/futex.h |  28 +++++-
 kernel/futex.c             | 227 ++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 219 insertions(+), 36 deletions(-)

diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index 7f676ac..1a22ecc 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -22,6 +22,8 @@
 #define FUTEX_CMP_REQUEUE_PI	12
 #define FUTEX_LOCK		13
 #define FUTEX_UNLOCK		14
+#define FUTEX_LOCK_SHARED	15
+#define FUTEX_UNLOCK_SHARED	16
 
 #define FUTEX_PRIVATE_FLAG	128
 #define FUTEX_CLOCK_REALTIME	256
@@ -43,6 +45,9 @@
 					 FUTEX_PRIVATE_FLAG)
 #define FUTEX_LOCK_PRIVATE	(FUTEX_LOCK | FUTEX_PRIVATE_FLAG)
 #define FUTEX_UNLOCK_PRIVATE	(FUTEX_UNLOCK | FUTEX_PRIVATE_FLAG)
+#define FUTEX_LOCK_SHARED_PRIVATE	(FUTEX_LOCK_SHARED | FUTEX_PRIVATE_FLAG)
+#define FUTEX_UNLOCK_SHARED_PRIVATE	(FUTEX_UNLOCK_SHARED | \
+					 FUTEX_PRIVATE_FLAG)
 
 /*
  * Support for robust futexes: the kernel cleans up held futexes at
@@ -111,9 +116,28 @@ struct robust_list_head {
 #define FUTEX_OWNER_DIED	0x40000000
 
 /*
- * The rest of the robust-futex field is for the TID:
+ * Shared locking bit (for readers).
+ * Robust futexes cannot be used in combination with shared locking.
+ *
+ * The FUTEX_SHARED_UNLOCK bit can be set by userspace to indicate that
+ * an shared unlock is in progress and so no shared locking is allowed
+ * within the kernel.
+ */
+#define FUTEX_SHARED		0x20000000
+#define FUTEX_SHARED_UNLOCK	0x10000000
+
+/*
+ * The rest of the bits is for the TID or, in the case of readers, the
+ * shared locking bit and the numbers of active readers present. IOW, the
+ * max supported TID is actually 0x1fffffff. Only 24 bits are used for
+ * tracking the number of shared futex owners. Bits 24-27 may be used by
+ * the userspace for internal management purpose. However, a successful
+ * FUTEX_UNLOCK_SHARED futex(2) syscall will clear or alter those bits.
  */
-#define FUTEX_TID_MASK		0x3fffffff
+#define FUTEX_TID_MASK		0x1fffffff
+#define FUTEX_SCNT_MASK		0x00ffffff	/* Shared futex owner count */
+#define FUTEX_SHARED_TID_MASK	0x3fffffff	/* FUTEX_SHARED bit + TID   */
+#define FUTEX_SHARED_SCNT_MASK	0x20ffffff	/* FUTEX_SHARED bit + count */
 
 /*
  * This limit protects against a deliberately circular list.
diff --git a/kernel/futex.c b/kernel/futex.c
index 429d85f..ee35264 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3176,7 +3176,7 @@ int handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi)
 	if (get_user(uval, uaddr))
 		return -1;
 
-	if ((uval & FUTEX_TID_MASK) == task_pid_vnr(curr)) {
+	if ((uval & FUTEX_SHARED_TID_MASK) == task_pid_vnr(curr)) {
 		/*
 		 * Ok, this dying thread is truly holding a futex
 		 * of interest. Set the OWNER_DIED bit atomically
@@ -3355,17 +3355,30 @@ void exit_robust_list(struct task_struct *curr)
  * Checks are also made in the futex_spin_on_owner() loop for dead task
  * structure or invalid pid. In both cases, the top waiter will take over
  * the ownership of the futex.
+ *
+ * TP futexes can also be used as shared lock. However TP futexes in
+ * shared locking mode cannot be used with robust futex. Userspace
+ * reader/writer locks can now be implemented with TP futexes.
  */
 
 /*
- * Timeout value for enabling lock handoff and prevent lock starvation.
+ * Timeout value for enabling lock handoff and preventing lock starvation.
  *
  * Currently, lock handoff will be enabled if the top waiter can't get the
  * futex after about 5ms. To reduce time checking overhead, it is only done
  * after every 128 spins or right after wakeup. As a result, the actual
  * elapsed time will be a bit longer than the specified value.
+ *
+ * For reader-owned futex spinning, the timeout is 50us. The timeout value
+ * will be reset whenever activities are detected in the futex value. That
+ * means any changes in the reader count. However, reader spinning will never
+ * exceed the handoff timeout plus an additionl reader spinning timeout.
  */
 #define TP_HANDOFF_TIMEOUT	5000000	/* 5ms	*/
+#define TP_RSPIN_TIMEOUT	50000	/* 50us	*/
+
+#define TP_FUTEX_SHARED_UNLOCK_READY(uval)	\
+	(((uval) & FUTEX_SHARED_SCNT_MASK) == FUTEX_SHARED)
 
 /*
  * The futex_lock() function returns the internal status of the TP futex.
@@ -3383,7 +3396,7 @@ void exit_robust_list(struct task_struct *curr)
 #define TP_LOCK_HANDOFF		2
 #define TP_STATUS_SLEEP(val, sleep)	((val)|((sleep) << 16))
 
- /**
+/**
  * lookup_futex_state - Looking up the futex state structure.
  * @hb:		 hash bucket
  * @key:	 futex key
@@ -3457,7 +3470,8 @@ static inline int put_futex_state_unlocked(struct futex_state *state)
  * @waiter - true for top waiter, false otherwise
  *
  * The HB fs_lock should NOT be held while calling this function.
- * The flag bits are ignored in the trylock.
+ * For writers, the flag bits are ignored in the trylock. The workflow is
+ * as follows:
  *
  * If waiter is true
  * then
@@ -3467,35 +3481,81 @@ static inline int put_futex_state_unlocked(struct futex_state *state)
  *   preserve the flag bits
  * endif
  *
+ * For readers (vpid == FUTEX_SHARED), the workflow is a bit different as
+ * shown in the code below.
+ *
  * Return: TP_LOCK_ACQUIRED if lock acquired;
  *	   TP_LOCK_HANDOFF if lock was handed off;
  *	   0 if lock acquisition failed;
  *	   -EFAULT if an error happened.
  *	   *puval will contain the latest futex value when trylock fails.
  */
-static inline int futex_trylock(u32 __user *uaddr, const u32 vpid, u32 *puval,
-				const bool waiter)
+static int futex_trylock(u32 __user *uaddr, const u32 vpid, u32 *puval,
+			 const bool waiter)
 {
 	u32 uval, flags = 0;
 
 	if (unlikely(get_futex_value(puval, uaddr)))
 		return -EFAULT;
 
+	if (vpid == FUTEX_SHARED)
+		goto trylock_shared;
+
 	uval = *puval;
 
-	if (waiter && (uval & FUTEX_TID_MASK) == vpid)
+	if (waiter && (uval & FUTEX_SHARED_TID_MASK) == vpid)
 		return TP_LOCK_HANDOFF;
 
-	if (uval & FUTEX_TID_MASK)
+	if (uval & FUTEX_SHARED_TID_MASK)
 		return 0;	/* Trylock fails */
 
 	if (!waiter)
-		flags |= (uval & ~FUTEX_TID_MASK);
+		flags |= (uval & ~FUTEX_SHARED_TID_MASK);
 
 	if (unlikely(cmpxchg_futex_value(puval, uaddr, uval, vpid|flags)))
 		return -EFAULT;
 
 	return (*puval == uval) ? TP_LOCK_ACQUIRED : 0;
+
+trylock_shared:
+	for (;;) {
+		u32 new;
+
+		uval = *puval;
+
+		/*
+		 * When it is not the top waiter, fail trylock attempt whenever
+		 * any of the flag bits is set.
+		 */
+		if (!waiter && (uval & ~FUTEX_SHARED_TID_MASK))
+			return 0;
+
+		/*
+		 * 1. Increment the reader counts if FUTEX_SHARED bit is set
+		 *    but not the FUTEX_SHARED_UNLOCK bit; or
+		 * 2. Change 0 => FUTEX_SHARED + 1.
+		 */
+		if ((uval & (FUTEX_SHARED|FUTEX_SHARED_UNLOCK)) == FUTEX_SHARED)
+			new = uval + 1;
+		else if (!(uval & FUTEX_SHARED_TID_MASK))
+			new = uval | (FUTEX_SHARED + 1);
+		else
+			return 0;
+
+		if (unlikely(cmpxchg_futex_value(puval, uaddr, uval, new)))
+			return -EFAULT;
+
+		if (*puval == uval)
+			break;
+	}
+	/*
+	 * If the original futex value is FUTEX_SHARED, it is assumed to be
+	 * a handoff even though it can also be a reader-owned futex
+	 * decremented to 0 reader.
+	 */
+	return ((uval & FUTEX_SHARED_TID_MASK) == FUTEX_SHARED)
+		? TP_LOCK_HANDOFF : TP_LOCK_ACQUIRED;
+
 }
 
 /**
@@ -3581,24 +3641,33 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 #define OWNER_DEAD_MESSAGE					\
 	"futex: owner pid %d of TP futex 0x%lx was %s.\n"	\
 	"\tLock is now acquired by pid %d!\n"
+#define OWNER_DEAD_MESSAGE_BY_READER				\
+	"futex: owner pid %d of TP futex 0x%lx was %s.\n"	\
+	"\tLock is now acquired by a reader!\n"
 
 	int ret, loopcnt = 1;
 	int nsleep = 0;
+	int reader_value = 0;
 	bool handoff_set = false;
 	u32 uval;
 	u32 owner_pid = 0;
 	struct task_struct *owner_task = NULL;
 	u64 handoff_time = sched_clock() + TP_HANDOFF_TIMEOUT;
+	u64 rspin_timeout;
 
 	preempt_disable();
 	WRITE_ONCE(state->mutex_owner, current);
 retry:
 	for (;;) {
+		u32 new_pid;
+
 		ret = futex_trylock_preempt_disabled(uaddr, vpid, &uval);
 		if (ret)
 			break;
 
-		if ((uval & FUTEX_TID_MASK) != owner_pid) {
+		new_pid = (uval & FUTEX_SHARED) ? FUTEX_SHARED
+						: (uval & FUTEX_TID_MASK);
+		if (new_pid != owner_pid) {
 			if (owner_task) {
 				/*
 				 * task_pi_list_del() should always be
@@ -3613,13 +3682,25 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 				put_task_struct(owner_task);
 			}
 
-			owner_pid  = uval & FUTEX_TID_MASK;
-			owner_task = futex_find_get_task(owner_pid);
+			/*
+			 * We need to reset the loop iteration count and
+			 * reader value for each writer=>reader ownership
+			 * transition so that we will have the right timeout
+			 * value for reader spinning.
+			 */
+			if ((owner_pid != FUTEX_SHARED) &&
+			     (new_pid == FUTEX_SHARED))
+				loopcnt = reader_value = 0;
+
+			owner_pid  = new_pid;
+			owner_task = (owner_pid == FUTEX_SHARED)
+				   ? NULL : futex_find_get_task(owner_pid);
 		}
 
-		if (unlikely(!owner_task ||
+		if (unlikely((owner_pid != FUTEX_SHARED) &&
+			    (!owner_task ||
 			    (owner_task->flags & PF_EXITING) ||
-			    (uval & FUTEX_OWNER_DIED))) {
+			    (uval & FUTEX_OWNER_DIED)))) {
 			/*
 			 * PID invalid or exiting/dead task, we can directly
 			 * grab the lock now.
@@ -3635,8 +3716,12 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 				continue;
 			owner_state = (owner_task || (uval & FUTEX_OWNER_DIED))
 				    ? "dead" : "invalid";
-			pr_info(OWNER_DEAD_MESSAGE, owner_pid,
-				(long)uaddr, owner_state, vpid);
+			if (vpid & FUTEX_SHARED)
+				pr_info(OWNER_DEAD_MESSAGE_BY_READER,
+					owner_pid, (long)uaddr, owner_state);
+			else
+				pr_info(OWNER_DEAD_MESSAGE, owner_pid,
+					(long)uaddr, owner_state, vpid);
 			break;
 		}
 
@@ -3664,19 +3749,75 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 		 * value. We also need to set the FUTEX_WAITERS bit to make
 		 * sure that futex lock holder will initiate the handoff at
 		 * unlock time.
+		 *
+		 * For reader, the handoff is done by setting just the
+		 * FUTEX_SHARED bit. The top waiter still need to increment
+		 * the reader count to get the lock.
 		 */
-		if (!handoff_set && !(loopcnt++ & 0x7f)) {
-			if (sched_clock() > handoff_time) {
+		if ((!handoff_set || (reader_value >= 0)) &&
+		   !(loopcnt++ & 0x7f)) {
+			u64 curtime = sched_clock();
+
+			if (!owner_task && (reader_value >= 0)) {
+				int old_count = reader_value;
+
+				/*
+				 * Reset timeout value if the old reader
+				 * count is 0 or the reader value changes and
+				 * handoff time hasn't been reached. Otherwise,
+				 * disable reader spinning if the handoff time
+				 * is reached.
+				 */
+				reader_value = (uval & FUTEX_TID_MASK);
+				if (!old_count || (!handoff_set
+					       && (reader_value != old_count)))
+					rspin_timeout = curtime +
+							TP_RSPIN_TIMEOUT;
+				else if (curtime > rspin_timeout)
+					reader_value = -1;
+			}
+			if (!handoff_set && (curtime > handoff_time)) {
 				WARN_ON(READ_ONCE(state->handoff_pid));
-				ret = futex_set_waiters_bit(uaddr, &uval);
-				if (ret)
-					break;
 				WRITE_ONCE(state->handoff_pid, vpid);
+
+				/*
+				 * Disable handoff check.
+				 */
 				handoff_set = true;
 			}
 		}
 
-		if (owner_task->on_cpu) {
+		if (handoff_set && !(uval & FUTEX_WAITERS)) {
+			/*
+			 * This is reached either as a fall through after
+			 * setting handoff_set or due to race in reader unlock
+			 * where the task is woken up without the proper
+			 * handoff. So we need to set the FUTEX_WAITERS bit
+			 * as well as ensuring that the handoff_pid is
+			 * properly set.
+			 */
+			ret = futex_set_waiters_bit(uaddr, &uval);
+			if (ret)
+				break;
+			if (!READ_ONCE(state->handoff_pid))
+				WRITE_ONCE(state->handoff_pid, vpid);
+		}
+
+		if (!owner_task) {
+			/*
+			 * For reader-owned futex, we optimistically spin
+			 * on the lock until reader spinning is disabled.
+			 *
+			 * Ideally, !owner_task <=> (uval & FUTEX_SHARED).
+			 * It is possible that there may be a mismatch here,
+			 * but it should be corrected in the next iteration
+			 * of the loop.
+			 */
+			if (reader_value >= 0) {
+				cpu_relax();
+				continue;
+			}
+		} else if (owner_task->on_cpu) {
 			cpu_relax();
 			continue;
 		}
@@ -3758,6 +3899,8 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
  * to see if it is running. It will also spin on the futex word so as to grab
  * the lock as soon as it is free.
  *
+ * For reader, TID = FUTEX_SHARED.
+ *
  * This function is not inlined so that it can show up separately in perf
  * profile for performance analysis purpose.
  *
@@ -3769,14 +3912,14 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
  *
  * Return: TP status code if lock acquired, < 0 if an error happens.
  */
-static noinline int
-futex_lock(u32 __user *uaddr, unsigned int flags, ktime_t *time)
+static noinline int futex_lock(u32 __user *uaddr, unsigned int flags,
+			       ktime_t *time, const bool shared)
 {
 	struct hrtimer_sleeper timeout, *to;
 	struct futex_hash_bucket *hb;
 	union futex_key key = FUTEX_KEY_INIT;
 	struct futex_state *state;
-	u32 uval, vpid = task_pid_vnr(current);
+	u32 uval, vpid = shared ? FUTEX_SHARED : task_pid_vnr(current);
 	int ret;
 
 	/*
@@ -3789,7 +3932,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 	/*
 	 * Detect deadlocks.
 	 */
-	if (unlikely(((uval & FUTEX_TID_MASK) == vpid) ||
+	if (unlikely((!shared && ((uval & FUTEX_TID_MASK) == vpid)) ||
 			should_fail_futex(true)))
 		return -EDEADLK;
 
@@ -3878,12 +4021,18 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
  * This is the in-kernel slowpath: we look up the futex state (if any),
  * and wakeup the mutex owner.
  *
+ * For shared unlock, userspace code must make sure that there is no
+ * racing and only one task will go into the kernel to do the unlock.
+ * The reader count shouldn't be incremented from 0 while the unlock is
+ * in progress.
+ *
  * Return: 1 if a wakeup is attempt, 0 if no task to wake,
  *	   or < 0 when an error happens.
  */
-static int futex_unlock(u32 __user *uaddr, unsigned int flags)
+static int futex_unlock(u32 __user *uaddr, unsigned int flags,
+			const bool shared)
 {
-	u32 uval, vpid = task_pid_vnr(current);
+	u32 uval, vpid = shared ? FUTEX_SHARED : task_pid_vnr(current);
 	u32 newpid = 0;
 	union futex_key key = FUTEX_KEY_INIT;
 	struct futex_hash_bucket *hb;
@@ -3895,8 +4044,12 @@ static int futex_unlock(u32 __user *uaddr, unsigned int flags)
 	if (get_futex_value(&uval, uaddr))
 		return -EFAULT;
 
-	if ((uval & FUTEX_TID_MASK) != vpid)
+	if (uval & FUTEX_SHARED) {
+		if (!shared || !TP_FUTEX_SHARED_UNLOCK_READY(uval))
+			return -EPERM;
+	} else if (shared || ((uval & FUTEX_TID_MASK) != vpid)) {
 		return -EPERM;
+	}
 
 	if (!(uval & FUTEX_WAITERS))
 		return -EINVAL;
@@ -3925,7 +4078,6 @@ static int futex_unlock(u32 __user *uaddr, unsigned int flags)
 	owner = READ_ONCE(state->mutex_owner);
 	if (owner)
 		wake_q_add(&wake_q, owner);
-
 	spin_unlock(&hb->fs_lock);
 
 	/*
@@ -3944,7 +4096,8 @@ static int futex_unlock(u32 __user *uaddr, unsigned int flags)
 		/*
 		 * The non-flag value of the futex shouldn't change
 		 */
-		if ((uval & FUTEX_TID_MASK) != vpid) {
+		if ((uval & (shared ? FUTEX_SHARED_SCNT_MASK : FUTEX_TID_MASK))
+				    != vpid) {
 			ret = -EINVAL;
 			break;
 		}
@@ -3957,7 +4110,7 @@ static int futex_unlock(u32 __user *uaddr, unsigned int flags)
 		 * No error would have happened if owner defined.
 		 */
 		wake_up_q(&wake_q);
-		return 1;
+		return ret ? ret : 1;
 	}
 
 	return ret;
@@ -3989,6 +4142,8 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 #ifdef CONFIG_SMP
 	case FUTEX_LOCK:
 	case FUTEX_UNLOCK:
+	case FUTEX_LOCK_SHARED:
+	case FUTEX_UNLOCK_SHARED:
 #endif
 		if (!futex_cmpxchg_enabled)
 			return -ENOSYS;
@@ -4023,9 +4178,13 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 		return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1);
 #ifdef CONFIG_SMP
 	case FUTEX_LOCK:
-		return futex_lock(uaddr, flags, timeout);
+	case FUTEX_LOCK_SHARED:
+		return futex_lock(uaddr, flags, timeout,
+				 (cmd == FUTEX_LOCK) ? false : true);
 	case FUTEX_UNLOCK:
-		return futex_unlock(uaddr, flags);
+	case FUTEX_UNLOCK_SHARED:
+		return futex_unlock(uaddr, flags,
+				   (cmd == FUTEX_UNLOCK) ? false : true);
 #endif
 	}
 	return -ENOSYS;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 15/20] TP-futex: Enable kernel reader lock stealing
  2016-12-29 16:13 [PATCH v4 00/20] futex: Introducing throughput-optimized (TP) futexes Waiman Long
                   ` (13 preceding siblings ...)
  2016-12-29 16:13 ` [PATCH v4 14/20] TP-futex: Support userspace reader/writer locks Waiman Long
@ 2016-12-29 16:13 ` Waiman Long
  2016-12-29 16:13 ` [PATCH v4 16/20] TP-futex: Group readers together in wait queue Waiman Long
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2016-12-29 16:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Scott J Norton, Waiman Long

The TP-futexes prefer writers as the code path is much simpler for
them. This may not be good for overall system performance especially
if there is a fair amount of reader activities.

This patch enables kernel reader to steal the lock when the futex is
currently reader-owned and the lock handoff mechanism hasn't been
enabled yet. A new field locksteal_disabled is added to the futex
state object for controlling reader lock stealing. So the waiting
reader must retrieve the futex state object first before doing it.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/futex.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index ee35264..b690a79 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -237,6 +237,7 @@ struct futex_state {
 	atomic_t refcount;
 
 	u32 handoff_pid;			/* For TP futexes only */
+	int locksteal_disabled;			/* For TP futexes only */
 
 	enum futex_type type;
 	union futex_key key;
@@ -3436,6 +3437,7 @@ void exit_robust_list(struct task_struct *curr)
 	state = alloc_futex_state();
 	state->type = TYPE_TP;
 	state->key = *key;
+	state->locksteal_disabled = false;
 	list_add(&state->fs_list, &hb->fs_head);
 	WARN_ON(atomic_read(&state->refcount) != 1);
 
@@ -3781,9 +3783,10 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 				WRITE_ONCE(state->handoff_pid, vpid);
 
 				/*
-				 * Disable handoff check.
+				 * Disable handoff check and reader lock
+				 * stealing.
 				 */
-				handoff_set = true;
+				state->locksteal_disabled = handoff_set = true;
 			}
 		}
 
@@ -3886,6 +3889,7 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 	 */
 	WRITE_ONCE(state->mutex_owner, NULL);
 	WRITE_ONCE(state->handoff_pid, 0);
+	state->locksteal_disabled = false;
 
 	preempt_enable();
 	return (ret < 0) ? ret : TP_STATUS_SLEEP(ret, nsleep);
@@ -3966,6 +3970,21 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags,
 		goto out_put_state_key;
 	}
 
+	/*
+	 * For reader, we will try to steal the lock here as if it is the
+	 * top waiter without taking the serialization mutex if reader
+	 * lock stealing isn't disabled even though the FUTEX_WAITERS bit
+	 * is set.
+	 */
+	if (shared && !state->locksteal_disabled) {
+		ret = futex_trylock(uaddr, vpid, &uval, true);
+		if (ret) {
+			if (ret > 0)
+				ret = TP_LOCK_STOLEN;
+			goto out_put_state_key;
+		}
+	}
+
 	if (to)
 		hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 16/20] TP-futex: Group readers together in wait queue
  2016-12-29 16:13 [PATCH v4 00/20] futex: Introducing throughput-optimized (TP) futexes Waiman Long
                   ` (14 preceding siblings ...)
  2016-12-29 16:13 ` [PATCH v4 15/20] TP-futex: Enable kernel reader lock stealing Waiman Long
@ 2016-12-29 16:13 ` Waiman Long
  2016-12-29 19:53   ` kbuild test robot
  2016-12-29 20:16   ` kbuild test robot
  2016-12-29 16:13 ` [PATCH v4 17/20] TP-futex, doc: Update TP futexes document on shared locking Waiman Long
                   ` (3 subsequent siblings)
  19 siblings, 2 replies; 28+ messages in thread
From: Waiman Long @ 2016-12-29 16:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Scott J Norton, Waiman Long

All the TP futex lock waiters are serialized in the kernel using a
kernel mutex which acts like a wait queue. The order at which the
waiters popped out from the wait queue will affect performance when
exclusive (writer) and shared (reader) lock waiters are mixed in the
queue. The worst case scenarios will be something like RWRWRW... in
term of ordering where no parallelism in term of lock ownership
can happen.

To improve throughput, the readers are now grouped together as a single
entity in the wait queue. The first reader that enters the mutex wait
queue will become the leader of the group. The other readers will
spin on the group leader via an OSQ lock. When the futex is put in
the shared mode, either by the group leader or by an external reader
the spinning readers in the reader group will then acquire the read
lock successively.

The spinning readers in the group will get disbanded when the group
leader goes to sleep. In this case, all those readers will go into the
mutex wait queue alone and wait for their turn to acquire the TP futex.

On a 2-socket 36-core E5-2699 v3 system (HT off) running on a 4.10
based kernel, the performance of TP rwlock with 1:1 reader/writer
ratio versus one based on the wait-wake futexes as well as the Glibc
rwlock with a microbenchmark (1 worker thread per cpu core) running
for 10s were as follows:

                          WW futex       TP futex         Glibc
                          --------       --------         -----
Total locking ops        35,707,234     58,645,434     10,930,422
Per-thread avg/sec           99,149        162,887         30,362
Per-thread min/sec           93,190         38,641         29,872
Per-thread max/sec          104,213        225,983         30,708
Write lock futex calls   11,161,534         14,094          -
Write unlock futex calls  8,696,121            167          -
Read lock futex calls     1,717,863          6,659          -
Read unlock futex calls   4,316,418            323          -

It can be seen that the throughput of the TP futex is close to 2X
the WW futex and almost 6X the Glibc version in this particular case.

The following table shows the CPU cores scaling for the average
per-thread locking rates (ops/sec):

                          WW futex       TP futex       Glibc
                          --------       --------       -----
  9 threads (1 socket)    422,014        647,006       190,236
 18 threads (2 sockets)   197,145        330,353        66,934
 27 threads (2 sockets)   127,947        213,417        43,641
 36 threads (2 sockets)    99,149        162,887        30,362

The following tables shows the average per-thread locking rates
(36 threads) with different reader percentages:

                          WW futex       TP futex       Glibc
                          --------       --------       -----
   90% readers            124,426        159,657        60,208
   95% readers            148,905        152,315        68,666
  100% reader             210,029        191,759        84,316

The rwlocks based on the WW futexes and Glibc prefer readers by
default. So it can be seen that the performance of the WW futex and
Glibc rwlocks increased with higher reader percentages. The TP futex
rwlock, however, prefers writers a bit more than readers. So the
performance didn't increase as the reader percentage rises.

The WW futexes and Glibc rwlocks also have a writer-preferring version.
Their performance with the same tests are as follows:

                          WW futex       TP futex       Glibc
                          --------       --------       -----
   90% readers             87,866        159,657        26,057
   95% readers             93,193        152,315        32,611
  100% reader             193,267        191,759        88,440

With separate 18 reader and 18 writer threads, the the average
per-thread reader and writer locking rates with different load
latencies (L, default = 1) and reader-preferring rwlocks are:

                          WW futex       TP futex       Glibc
                          --------       --------       -----
 Reader rate, L=1         381,411         74,059       164,184
 Writer rate, L=1               0        240,841             0

 Reader rate, L=5         330,732         57,361       150,691
 Writer rate, L=5               0        175,400             0

 Reader rate, L=50        304,505         17,355        97,066
 Writer rate, L=50              0        114,504             0

The corresponding locking rates with writer-preferring rwlocks are:

                          WW futex       TP futex       Glibc
                          --------       --------       -----
 Reader rate, L=1         138,805         74,059            54
 Writer rate, L=1          31,113        240,841        56,424

 Reader rate, L=5         114,414         57,361            24
 Writer rate, L=5          28,062        175,400        52,039

 Reader rate, L=50         88,619         51,483             5
 Writer rate, L=50         21,005         98,005        49,885

With both the WW futex and Glibc rwlocks, lock starvation happened
for the writers with reader-preferring rwlocks. For writer preferring
rwlocks, the WW futex one fared better. The Glibc one, however,
was close to starving the readers.

The TP futex prefers writer in general, but the actual preference
depends on the timing. Lock starvation should not happen on the TP
futexes as long as the underlying kernel mutex is lock starvation
free which is the case for 4.10 and later kernel.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/futex.c | 135 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 132 insertions(+), 3 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index b690a79..5e69c44 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -239,6 +239,16 @@ struct futex_state {
 	u32 handoff_pid;			/* For TP futexes only */
 	int locksteal_disabled;			/* For TP futexes only */
 
+	/*
+	 * To improve reader throughput in TP futexes, all the readers
+	 * in the mutex queue are grouped together. The first reader in the
+	 * queue will set first_reader, then the rest of the readers will
+	 * spin on the first reader via the OSQ without actually entering
+	 * the mutex queue.
+	 */
+	struct optimistic_spin_queue reader_osq;
+	struct task_struct *first_reader;
+
 	enum futex_type type;
 	union futex_key key;
 };
@@ -3388,14 +3398,21 @@ void exit_robust_list(struct task_struct *curr)
  *		   0 - steals the lock
  *		   1 - top waiter (mutex owner) acquires the lock
  *		   2 - handed off the lock
- * 2) bits 08-15: reserved
- * 3) bits 15-30: how many times the task has slept or yield to scheduler
+ * 2) bit  08:	  1 if reader spins alone		(shared lock only)
+ *    bit  09:	  1 if reader is a spin group leader	(shared lock only)
+ *    bits 10-16: reserved
+ * 3) bits 16-30: how many times the task has slept or yield to scheduler
  *		  in futex_spin_on_owner().
  */
 #define TP_LOCK_STOLEN		0
 #define TP_LOCK_ACQUIRED	1
 #define TP_LOCK_HANDOFF		2
+
+#define TP_READER_ALONE		1
+#define TP_READER_GROUP		2
+
 #define TP_STATUS_SLEEP(val, sleep)	((val)|((sleep) << 16))
+#define TP_STATUS_ALONE(val, alone)	((val)|((alone) << 8))
 
 /**
  * lookup_futex_state - Looking up the futex state structure.
@@ -3438,6 +3455,7 @@ void exit_robust_list(struct task_struct *curr)
 	state->type = TYPE_TP;
 	state->key = *key;
 	state->locksteal_disabled = false;
+	osq_lock_init(&state->reader_osq);
 	list_add(&state->fs_list, &hb->fs_head);
 	WARN_ON(atomic_read(&state->refcount) != 1);
 
@@ -3895,6 +3913,98 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 	return (ret < 0) ? ret : TP_STATUS_SLEEP(ret, nsleep);
 }
 
+/**
+ * futex_spin_on_reader - Optimistically spin on first reader
+ * @uaddr:   futex address
+ * @pfirst:  pointer to first reader
+ * @state:   futex state object
+ * @timeout: hrtimer_sleeper structure
+ *
+ * Reader performance will depend on the placement of readers within the
+ * mutex queue. For a queue of 4 readers and 4 writers, for example, the
+ * optimal placement will be either RRRRWWWW or WWWWRRRR. A worse case will
+ * be RWRWRWRW.
+ *
+ * One way to avoid the worst case scenario is to gather all the readers
+ * together as a single unit and place it into the mutex queue. This is done
+ * by having the first reader puts its task structure into state->first_reader
+ * and the rest of the readers optimistically spin on it instead of entering
+ * the mutex queue.
+ *
+ * On exit from this function, the reader would have
+ *  1) acquired a read lock on the futex;
+ *  2) become the first reader and goes into the mutex queue; or
+ *  3) seen the first reader slept and needs to go into the mutex queue alone.
+ *
+ * Any fault on accessing the futex will cause it to return 0 and goes into
+ * the mutex queue.
+ *
+ * Return:  > 0 - acquired the read lock
+ *	   <= 0 - need to goes into the mutex queue
+ */
+static int futex_spin_on_reader(u32 __user *uaddr, struct task_struct **pfirst,
+				struct futex_state *state,
+				struct hrtimer_sleeper *timeout)
+{
+	struct task_struct *first_reader;
+	int ret = 0;
+
+	preempt_disable();
+
+retry:
+	first_reader = READ_ONCE(state->first_reader);
+	if (!first_reader)
+		first_reader = cmpxchg(&state->first_reader, NULL, current);
+	if (!first_reader)
+		goto out;	/* Became the first reader */
+
+	if (!osq_lock(&state->reader_osq))
+		goto reschedule;
+
+	for (;;) {
+		u32 uval;
+
+		if (!state->locksteal_disabled) {
+			ret = futex_trylock_preempt_disabled(uaddr,
+						FUTEX_SHARED, &uval);
+			/*
+			 * Return if lock acquired or an error happened
+			 */
+			if (ret)
+				break;
+		}
+
+		/*
+		 * Reread the first reader value again.
+		 */
+		first_reader = READ_ONCE(state->first_reader);
+		if (!first_reader)
+			first_reader = cmpxchg(&state->first_reader, NULL,
+						current);
+		if (!first_reader || !first_reader->on_cpu)
+			break;
+
+		if (need_resched()) {
+			osq_unlock(&state->reader_osq);
+			goto reschedule;
+		}
+
+		cpu_relax();
+	}
+	osq_unlock(&state->reader_osq);
+out:
+	*pfirst = first_reader;
+	preempt_enable();
+	return ret;
+
+reschedule:
+	/*
+	 * Yield the CPU and retry later.
+	 */
+	schedule_preempt_disabled();
+	goto retry;
+}
+
 /*
  * Userspace tried a 0 -> TID atomic transition of the futex value
  * and failed. The kernel side here does the whole locking operation.
@@ -3921,9 +4031,11 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags,
 {
 	struct hrtimer_sleeper timeout, *to;
 	struct futex_hash_bucket *hb;
+	struct task_struct *first_reader = NULL;
 	union futex_key key = FUTEX_KEY_INIT;
 	struct futex_state *state;
 	u32 uval, vpid = shared ? FUTEX_SHARED : task_pid_vnr(current);
+	int alone = 0;
 	int ret;
 
 	/*
@@ -3989,6 +4101,17 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags,
 		hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
 
 	/*
+	 * Spin on the first reader and return if we acquired the read lock.
+	 */
+	if (shared) {
+		int rspin_ret = futex_spin_on_reader(uaddr, &first_reader,
+						     state, to);
+		if (rspin_ret > 0)
+			goto out_put_state_key;
+		alone = first_reader ? TP_READER_ALONE : TP_READER_GROUP;
+	}
+
+	/*
 	 * Acquiring the serialization mutex.
 	 *
 	 * If we got a signal or has some other error, we need to abort
@@ -4016,6 +4139,12 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags,
 	mutex_unlock(&state->mutex);
 
 out_put_state_key:
+	/*
+	 * We will be the first reader if (first_reader == NULL).
+	 */
+	if (shared && !first_reader)
+		WRITE_ONCE(state->first_reader, NULL);
+
 	if (!put_futex_state_unlocked(state)) {
 		/*
 		 * May need to free the futex state object and so must be
@@ -4032,7 +4161,7 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags,
 		hrtimer_cancel(&to->timer);
 		destroy_hrtimer_on_stack(&to->timer);
 	}
-	return ret;
+	return (ret < 0) ? ret : TP_STATUS_ALONE(ret, alone);
 }
 
 /*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 17/20] TP-futex, doc: Update TP futexes document on shared locking
  2016-12-29 16:13 [PATCH v4 00/20] futex: Introducing throughput-optimized (TP) futexes Waiman Long
                   ` (15 preceding siblings ...)
  2016-12-29 16:13 ` [PATCH v4 16/20] TP-futex: Group readers together in wait queue Waiman Long
@ 2016-12-29 16:13 ` Waiman Long
  2016-12-29 16:13 ` [PATCH v4 18/20] perf bench: New microbenchmark for userspace rwlock performance Waiman Long
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2016-12-29 16:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Scott J Norton, Waiman Long

The tp-futex.txt was updated to add description about shared locking
support.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/tp-futex.txt | 163 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 143 insertions(+), 20 deletions(-)

diff --git a/Documentation/tp-futex.txt b/Documentation/tp-futex.txt
index 5324ee0..d4f0ed6 100644
--- a/Documentation/tp-futex.txt
+++ b/Documentation/tp-futex.txt
@@ -68,22 +68,58 @@ the kernel using the FUTEX_LOCK futex(2) syscall.
 Only the optional timeout parameter is being used by this futex(2)
 syscall.
 
-Inside the kernel, a kernel mutex is used for serialization among
-the futex waiters. Only the top lock waiter which is the owner of
-the serialization mutex is allowed to continuously spin and attempt
-to acquire the lock.  Other lock waiters will have one attempt to
-steal the lock before entering the mutex queues.
+A shared lock acquirer, on the other hand, has to do either one of
+the following 2 actions depends on the current value of the futex:
+
+ 1) If current futex value is 0, atomically put (FUTEX_SHARED + 1)
+    into the lower 30 bits of the futex.
+
+ 2) If the FUTEX_SHARED bit is set and the FUTEX_SHARED_UNLOCK bit
+    isn't, atomically increment the reader count in the lower 24 bits.
+    The other unused bits (24-27) can be used for other management purpose
+    and will be ignored by the kernel.
+
+Any failure to both of the above actions will lead to the following
+new FUTEX_LOCK_SHARED futex(2) syscall.
+
+  futex(uaddr, FUTEX_LOCK_SHARED, 0, timeout, NULL, 0);
+
+The special FUTEX_SHARED bit (bit 30) is used to distinguish between a
+shared lock and an exclusive lock. If the FUTEX_SHARED bit isn't set,
+the lower 29 bits contain the TID of the exclusive lock owner. If
+the FUTEX_SHARED bit is set, the lower 29 bits contain the number of
+tasks owning the shared lock.  Therefore, only 29 bits are allowed
+to be used by the TID.
+
+The FUTEX_SHARED_UNLOCK bit can be used by userspace to prevent racing
+and to indicate that a shared unlock is in progress as it will disable
+any shared trylock attempt in the kernel.
+
+Inside the kernel, a kernel mutex is used for serialization among the
+futex waiters. Only the top lock waiter which is the owner of the
+serialization mutex is allowed to continuously spin and attempt to
+acquire the lock.  Other lock waiters will have one or two attempts
+to steal the lock before entering the mutex queues.
 
 When the exclusive futex lock owner is no longer running, the top
 waiter will set the FUTEX_WAITERS bit before going to sleep. This is
 to make sure the futex owner will go into the kernel at unlock time
 to wake up the top waiter.
 
+When the futex is shared by multiple owners, there is no way to
+determine if all the shared owners are running or not. In this case,
+the top waiter will spin for a fixed amount of time if no reader
+activity is observed before giving up and going to sleep. Any change
+in the reader count will be considered reader activities and so will
+reset the timer. However, when the hand-off threshold has been reached,
+reader optimistic spinning timer reset will be disabled automatically.
+
 The return values of the above futex locking syscall, if non-negative,
-are status code that consists of 2 fields - the lock acquisition code
-(bits 0-7) and the number of sleeps (bits 8-30) in the optimistic
-spinning loop before acquiring the futex. A negative returned value
-means an error has happened.
+are status code that consists of 3 fields - the lock acquisition code
+(bits 0-7) and the number of sleeps (bits 16-30) in the optimistic
+spinning loop before acquiring the futex. Bits 8-15 are reserved
+for future extension. A negative returned value means an error has
+happened.
 
 The lock acquisition code can have the following values:
  a) 0   - lock stolen as non-top waiter
@@ -96,14 +132,21 @@ issue a FUTEX_UNLOCK futex(2) syscall to wake up the top waiter.
 
   futex(uaddr, FUTEX_UNLOCK, 0, NULL, NULL, 0);
 
-A return value of 1 from the FUTEX_UNLOCK futex(2) syscall indicates
-a task has been woken up. The syscall returns 0 if no sleeping task
-is woken. A negative value will be returned if an error happens.
+For a shared lock owner, unlock is done by atomically decrementing
+the reader count by one. If the resultant futex value has no reader
+remaining, the process has to atomically change the futex value from
+FUTEX_SHARED to 0. If that fails, it has to issue a FUTEX_UNLOCK_SHARED
+futex(2) syscall to wake up the top waiter.
+
+A return value of 1 from the FUTEX_UNLOCK or FUTEX_UNLOCK_SHARED
+futex(2) syscall indicates a task has been woken up. The syscall
+returns 0 if no sleeping task is woken. A negative value will be
+returned if an error happens.
 
-The error number returned by a FUTEX_UNLOCK syscall on an empty futex
-can be used to decide if the TP futex functionality is implemented
-in the kernel. If it is present, an EPERFM error will be returned.
-Otherwise it will return ENOSYS.
+The error number returned by a FUTEX_UNLOCK or FUTEX_UNLOCK_SHARED
+syscall on an empty futex can be used to decide if the TP futex
+functionality is implemented in the kernel. If it is present, an
+EPERFM error will be returned.  Otherwise it will return ENOSYS.
 
 TP futexes require the kernel to have SMP support as well as support
 for the cmpxchg functionality. For architectures that don't support
@@ -118,6 +161,13 @@ is the PID wrap-around issue where another process with the same PID
 as the real futex owner because of PID wrap-around is mis-identified
 as the owner of a futex.
 
+Shared locking TP futexes, on the other hand, cannot be used with
+robust futexes. The unexpected death of a shared lock owner may
+leave the futex permanently in the shared state leading to deadlock
+of exclusive lock waiters. If necessary, some kind of timeout and
+userspace lock management system have to be used to resolve this
+deadlock problem.
+
 Usage Scenario
 --------------
 
@@ -129,6 +179,24 @@ used when the critical section is long and prone to sleeping. However,
 it may not have the performance gain when compared with a wait-wake
 futex in this case.
 
+A TP futex can also be used to implement a userspace reader-writer
+lock (rwlock) where the writers use the FUTEX_LOCK and FUTEX_UNLOCK
+futex(2) syscalls and the readers used the FUTEX_LOCK_SHARED and
+FUTEX_UNLOCK_SHARED futex(2) syscalls. Beside providing a better
+performance compared with wait-wake futexes, other advantages of
+using the the TP futexes for rwlocks are:
+
+ 1) The simplicity and ease of maintenance of the userspace locking
+    codes. There is only one version of rwlock using TP futex versus
+    a reader-preferring variant and/or a writer-preferring variant
+    when using the wait-wake futexes.
+
+ 2) TP futex is lock-starvation free. That may not be the case
+    with rwlocks implemented with wait-wake futexes especially the
+    reader-preferring variants where the writers can be starved of
+    the lock under certain circumstances. Some writer-preferring
+    rwlock variants may also starve readers of getting the lock.
+
 The wait-wake futexes are more versatile as they can also be used to
 implement other locking primitives like semaphores or conditional
 variables.  So the TP futex is not a direct replacement of the
@@ -138,12 +206,12 @@ futex is likely a better option than the wait-wake futex.
 Sample Code
 -----------
 
-The following are sample code to implement simple mutex lock and
-unlock functions.
+The following are sample code to implement simple mutex or writer
+lock and unlock functions.
 
 __thread int thread_id;
 
-void mutex_lock(int *faddr)
+void write_lock(int *faddr)
 {
 	if (cmpxchg(faddr, 0, thread_id) == 0)
 		return;
@@ -151,7 +219,7 @@ void mutex_lock(int *faddr)
 		;
 }
 
-void mutex_unlock(int *faddr)
+void write_unlock(int *faddr)
 {
 	int old, fval;
 
@@ -159,3 +227,58 @@ void mutex_unlock(int *faddr)
 		return;
 	futex(faddr, FUTEX_UNLOCK, ...);
 }
+
+The following sample code can be used to implement shared or reader
+lock and unlock functions.
+
+void read_lock(int *faddr)
+{
+	int val = cmpxchg(faddr, 0, FUTEX_SHARED + 1);
+
+	if (!val)
+		return;
+	for (;;) {
+		int old = val, new;
+
+		if (!old)
+			new = FUTEX_SHARED + 1;
+		else if (!(old & FUTEX_SHARED) ||
+			 (old & ((~FUTEX_SHARED_TID_MASK)|
+				   FUTEX_SHARED_UNLOCK)))
+			break;
+		else
+			new = old + 1;
+		val = cmpxchg(faddr, old, new);
+		if (val == old)
+			return;
+	}
+	while (futex(faddr, FUTEX_LOCK_SHARED, ...) < 0)
+		;
+}
+
+void read_unlock(int *faddr)
+{
+	int val = xadd(faddr, -1) - 1;
+	int old;
+
+	for (;;) {
+		/*
+		 * Return if not the last reader, not in shared locking
+		 * mode or the unlock bit has been set.
+		 */
+		if (!(val & FUTEX_SHARED) ||
+		     (val & (FUTEX_SHARED_UNLOCK|FUTEX_SCNT_MASK)))
+			return;
+		if (val & ~FUTEX_SHARED_TID_MASK) {
+			old = cmpxchg(faddr, val, val|FUTEX_SHARED_UNLOCK);
+			if (old == val)
+				break;	/* Do FUTEX_UNLOCK_SHARED */
+		} else {
+			old = cmpxchg(faddr, val, 0);
+			if (old == val)
+				return;
+		}
+		val = old;
+	}
+	futex(faddr, FUTEX_UNLOCK_SHARED, ...);
+}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 18/20] perf bench: New microbenchmark for userspace rwlock performance
  2016-12-29 16:13 [PATCH v4 00/20] futex: Introducing throughput-optimized (TP) futexes Waiman Long
                   ` (16 preceding siblings ...)
  2016-12-29 16:13 ` [PATCH v4 17/20] TP-futex, doc: Update TP futexes document on shared locking Waiman Long
@ 2016-12-29 16:13 ` Waiman Long
  2016-12-29 16:13 ` [PATCH v4 19/20] sched, TP-futex: Make wake_up_q() return wakeup count Waiman Long
  2016-12-29 16:13 ` [PATCH v4 20/20] futex: Dump internal futex state via debugfs Waiman Long
  19 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2016-12-29 16:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Scott J Norton, Waiman Long

This microbenchmark simulates how the use of different futex types
can affect the actual performanace of userspace rwlock locks. The
usage is:

        perf bench futex rwlock <options>

Three sets of simple rwlock lock and unlock functions are implemented
using the wait-wake, TP futexes and glibc rwlock respectively. This
microbenchmark then runs the locking rate measurement tests using
either one of those rwlocks or all of them consecutively.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 tools/perf/bench/bench.h       |   1 +
 tools/perf/bench/futex-locks.c | 842 ++++++++++++++++++++++++++++++++++++++++-
 tools/perf/bench/futex.h       |  24 +-
 tools/perf/builtin-bench.c     |   1 +
 4 files changed, 850 insertions(+), 18 deletions(-)

diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h
index b0632df..a7e2037 100644
--- a/tools/perf/bench/bench.h
+++ b/tools/perf/bench/bench.h
@@ -37,6 +37,7 @@
 /* pi futexes */
 int bench_futex_lock_pi(int argc, const char **argv, const char *prefix);
 int bench_futex_mutex(int argc, const char **argv, const char *prefix);
+int bench_futex_rwlock(int argc, const char **argv, const char *prefix);
 
 #define BENCH_FORMAT_DEFAULT_STR	"default"
 #define BENCH_FORMAT_DEFAULT		0
diff --git a/tools/perf/bench/futex-locks.c b/tools/perf/bench/futex-locks.c
index ec1e627..4008691 100644
--- a/tools/perf/bench/futex-locks.c
+++ b/tools/perf/bench/futex-locks.c
@@ -10,6 +10,12 @@
  * unlock functions are written to implenment a mutex lock using the
  * wait-wake, PI and TP futexes respectively. These functions serve as the
  * basis for measuring the locking throughput.
+ *
+ * Three sets of simple reader/writer lock and unlock functions are also
+ * implemented using the wait-wake and TP futexes as well as the Glibc
+ * rwlock respectively for performance measurement purpose. The TP futex
+ * writer lock/unlock functions are the same as that of the mutex functions.
+ * That is not the case for the wait-wake futexes.
  */
 
 #include <pthread.h>
@@ -21,11 +27,13 @@
 #include <subcmd/parse-options.h>
 #include <linux/compiler.h>
 #include <linux/kernel.h>
+#include <asm/byteorder.h>
 #include <errno.h>
 #include "bench.h"
 #include "futex.h"
 
 #include <err.h>
+#include <limits.h>
 #include <stdlib.h>
 #include <sys/time.h>
 
@@ -42,15 +50,25 @@
  */
 enum {
 	STAT_OPS,	/* # of exclusive locking operations	*/
+	STAT_SOPS,	/* # of shared locking operations	*/
 	STAT_LOCKS,	/* # of exclusive lock futex calls	*/
 	STAT_UNLOCKS,	/* # of exclusive unlock futex calls	*/
 	STAT_SLEEPS,	/* # of exclusive lock sleeps		*/
+	STAT_SLOCKS,	/* # of shared lock futex calls		*/
+	STAT_SUNLOCKS,	/* # of shared unlock futex calls	*/
+	STAT_SSLEEPS,	/* # of shared lock sleeps		*/
 	STAT_EAGAINS,	/* # of EAGAIN errors			*/
 	STAT_WAKEUPS,	/* # of wakeups (unlock return)		*/
 	STAT_HANDOFFS,	/* # of lock handoff (TP only)		*/
 	STAT_STEALS,	/* # of lock steals (TP only)		*/
+	STAT_SLRETRIES,	/* # of shared lock retries		*/
+	STAT_SURETRIES,	/* # of shared unlock retries		*/
+	STAT_SALONES,	/* # of shared locking alone ops	*/
+	STAT_SGROUPS,	/* # of shared locking group ops	*/
 	STAT_LOCKERRS,	/* # of exclusive lock errors		*/
 	STAT_UNLKERRS,	/* # of exclusive unlock errors		*/
+	STAT_SLOCKERRS,	/* # of shared lock errors		*/
+	STAT_SUNLKERRS,	/* # of shared unlock errors		*/
 	STAT_NUM	/* Total # of statistical count		*/
 };
 
@@ -60,6 +78,8 @@ enum {
 enum {
 	TIME_LOCK,	/* Total exclusive lock syscall time	*/
 	TIME_UNLK,	/* Total exclusive unlock syscall time	*/
+	TIME_SLOCK,	/* Total shared lock syscall time	*/
+	TIME_SUNLK,	/* Total shared unlock syscall time	*/
 	TIME_NUM,
 };
 
@@ -102,7 +122,11 @@ struct worker {
 static unsigned int threads_stopping;
 static struct stats throughput_stats;
 static lock_fn_t mutex_lock_fn;
+static lock_fn_t read_lock_fn;
+static lock_fn_t write_lock_fn;
 static unlock_fn_t mutex_unlock_fn;
+static unlock_fn_t read_unlock_fn;
+static unlock_fn_t write_unlock_fn;
 
 /*
  * Lock/unlock syscall time macro
@@ -150,6 +174,26 @@ static inline void stat_inc(int tid __maybe_unused, int item __maybe_unused)
 #endif
 
 /*
+ * For rwlock, the default is to have each thread acts as both reader and
+ * writer. The use of the -p option will force the use of separate threads
+ * for readers and writers. The numbers of reader and writer threads are
+ * determined by reader percentage and the total number of threads used.
+ * So the actual ratio of reader and writer operations may not be close
+ * to the given reader percentage.
+ */
+static bool xthread;
+static bool pwriter;			/* Prefer writer flag */
+static int rthread_threshold = -1;	/* (tid < threshold) => reader */
+static unsigned int rpercent = 50;	/* Reader percentage */
+
+/*
+ * Glibc rwlock
+ */
+static pthread_rwlock_t __cacheline_aligned rwlock;
+static pthread_rwlockattr_t rwlock_attr;
+static bool rwlock_inited, rwlock_attr_inited;
+
+/*
  * The latency value within a lock critical section (load) and between locking
  * operations is in term of the number of cpu_relax() calls that are being
  * issued.
@@ -172,6 +216,27 @@ static inline void stat_inc(int tid __maybe_unused, int item __maybe_unused)
 	NULL
 };
 
+static const struct option rwlock_options[] = {
+	OPT_INTEGER ('d', "locklat",	&locklat,  "Specify inter-locking latency (default = 1)"),
+	OPT_STRING  ('f', "ftype",	&ftype,    "type", "Specify futex type: WW, TP, GC, all (default)"),
+	OPT_INTEGER ('L', "loadlat",	&loadlat,  "Specify load latency (default = 1)"),
+	OPT_UINTEGER('R', "read-%",	&rpercent, "Specify reader percentage (default 50%)"),
+	OPT_UINTEGER('r', "runtime",	&nsecs,    "Specify runtime (in seconds, default = 10s)"),
+	OPT_BOOLEAN ('S', "shared",	&fshared,  "Use shared futexes instead of private ones"),
+	OPT_BOOLEAN ('T', "timestat",	&timestat, "Track lock/unlock syscall times"),
+	OPT_UINTEGER('t', "threads",	&nthreads, "Specify number of threads, default = # of CPUs"),
+	OPT_BOOLEAN ('v', "verbose",	&verbose,  "Verbose mode: display thread-level details"),
+	OPT_BOOLEAN ('W', "prefer-wr",	&pwriter,  "Prefer writers instead of readers"),
+	OPT_INTEGER ('w', "wait-ratio", &wratio,   "Specify <n>/1024 of load is 1us sleep, default = 0"),
+	OPT_BOOLEAN ('x', "xthread",	&xthread,  "Use separate reader/writer threads"),
+	OPT_END()
+};
+
+static const char * const bench_futex_rwlock_usage[] = {
+	"perf bench futex rwlock <options>",
+	NULL
+};
+
 /**
  * futex_cmpxchg() - atomic compare and exchange
  * @uaddr:	The address of the futex to be modified
@@ -221,6 +286,17 @@ static inline int atomic_inc_return(futex_t *uaddr)
 	return __sync_add_and_fetch(uaddr, 1);
 }
 
+/**
+ * atomic_add_return - atomically add a number & return the new value
+ * @uaddr:      The address of the futex to be added
+ * @val  :	The integer value to be added
+ * Return:	The new value
+ */
+static inline int atomic_add_return(futex_t *uaddr, int val)
+{
+	return  __sync_add_and_fetch(uaddr, val);
+}
+
 /**********************[ MUTEX lock/unlock functions ]*********************/
 
 /*
@@ -502,6 +578,498 @@ static void tp_mutex_unlock(futex_t *futex, int tid)
 		stat_add(tid, STAT_WAKEUPS, ret);
 }
 
+/**********************[ RWLOCK lock/unlock functions ]********************/
+
+/*
+ * Wait-wake futex reader/writer lock/unlock functions
+ *
+ * This implementation is based on the reader-preferring futex eventcount
+ * rwlocks posted on http://locklessinc.com/articles/sleeping_rwlocks with
+ * some modification.
+ *
+ * It is assumed the passed-in futex have sufficient trailing space to
+ * be used by the bigger reader/writer lock structure.
+ */
+#ifdef __LITTLE_ENDIAN
+#define LSB(field)	unsigned char field
+#else
+#define LSB(field)	struct {			\
+				unsigned char __pad[3];	\
+				unsigned char field;	\
+			}
+#endif
+
+#define RW_WLOCKED	(1U << 0)
+#define RW_READER	(1U << 8)
+#define RW_EC_CONTEND	(1U << 0)
+#define RW_EC_INC	(1U << 8)
+
+struct rwlock {
+	/*
+	 * Bits 0-7 : writer lock
+	 * Bits 8-31: reader count
+	 */
+	union {
+		futex_t val;
+		LSB(wlocked);
+	} lock;
+
+	/* Writer event count */
+	union {
+		futex_t val;
+		LSB(contend);
+	} write_ec;
+
+	/* Reader event count */
+	union {
+		futex_t val;
+		LSB(contend);
+	} read_ec;
+};
+
+static struct rwlock __cacheline_aligned rwfutex;
+
+/*
+ * Reader preferring rwlock functions
+ */
+static void ww_write_lock(futex_t *futex __maybe_unused, int tid)
+{
+	struct timespec stime, etime;
+	struct rwlock *rw = &rwfutex;
+
+	for (;;) {
+		futex_t ec = rw->write_ec.val | RW_EC_CONTEND;
+		int ret;
+
+		/* Set the write lock if there is no reader */
+		if (!rw->lock.val &&
+		   (futex_cmpxchg(&rw->lock.val, 0, RW_WLOCKED) == 0))
+			return;
+
+		rw->write_ec.contend = 1;
+
+		if (timestat) {
+			clock_gettime(CLOCK_REALTIME, &stime);
+			ret = futex_wait(&rw->write_ec.val, ec, NULL, flags);
+			clock_gettime(CLOCK_REALTIME, &etime);
+			systime_add(tid, TIME_LOCK, &stime, &etime);
+		} else {
+			ret = futex_wait(&rw->write_ec.val, ec, NULL, flags);
+		}
+
+		stat_inc(tid, STAT_LOCKS);
+		if (ret < 0) {
+			if (errno == EAGAIN)
+				stat_inc(tid, STAT_EAGAINS);
+			else
+				stat_inc(tid, STAT_LOCKERRS);
+		}
+
+		/* Other writers may exist */
+		rw->write_ec.contend = 1;
+	}
+}
+
+static void ww_write_unlock(futex_t *futex __maybe_unused, int tid)
+{
+	struct timespec stime, etime;
+	struct rwlock *rw = &rwfutex;
+	int ret;
+
+	rw->lock.wlocked = 0;	/* Unlock */
+
+	/* Wake all the readers */
+	atomic_add_return(&rw->read_ec.val, RW_EC_INC);
+	if (rw->read_ec.contend) {
+		rw->read_ec.contend = 0;
+		if (timestat) {
+			clock_gettime(CLOCK_REALTIME, &stime);
+			ret = futex_wake(&rw->read_ec.val, INT_MAX, flags);
+			clock_gettime(CLOCK_REALTIME, &etime);
+			systime_add(tid, TIME_UNLK, &stime, &etime);
+		} else {
+			ret = futex_wake(&rw->read_ec.val, INT_MAX, flags);
+		}
+		stat_inc(tid, STAT_UNLOCKS);
+		if (ret < 0)
+			stat_inc(tid, STAT_UNLKERRS);
+		else
+			stat_add(tid, STAT_WAKEUPS, ret);
+		if (ret > 0)
+			return;
+	}
+
+	/* Wake a writer */
+	atomic_add_return(&rw->write_ec.val, RW_EC_INC);
+	if (rw->write_ec.contend) {
+		rw->write_ec.contend = 0;
+		if (timestat) {
+			clock_gettime(CLOCK_REALTIME, &stime);
+			ret = futex_wake(&rw->write_ec.val, 1, flags);
+			clock_gettime(CLOCK_REALTIME, &etime);
+			systime_add(tid, TIME_UNLK, &stime, &etime);
+		} else {
+			ret = futex_wake(&rw->write_ec.val, 1, flags);
+		}
+		stat_inc(tid, STAT_UNLOCKS);
+		if (ret < 0)
+			stat_inc(tid, STAT_UNLKERRS);
+		else
+			stat_add(tid, STAT_WAKEUPS, ret);
+	}
+}
+
+static void ww_read_lock(futex_t *futex __maybe_unused, int tid)
+{
+	struct timespec stime, etime;
+	struct rwlock *rw = &rwfutex;
+	futex_t ec = rw->read_ec.val, state;
+	int ret;
+
+	state = atomic_add_return(&rw->lock.val, RW_READER);
+
+	while (state & RW_WLOCKED) {
+		ec |= RW_EC_CONTEND;
+		rw->read_ec.contend = 1;
+
+		/* Sleep until no longer held by a writer */
+		if (timestat) {
+			clock_gettime(CLOCK_REALTIME, &stime);
+			ret = futex_wait(&rw->read_ec.val, ec, NULL, flags);
+			clock_gettime(CLOCK_REALTIME, &etime);
+			systime_add(tid, TIME_SLOCK, &stime, &etime);
+		} else {
+			ret = futex_wait(&rw->read_ec.val, ec, NULL, flags);
+		}
+		stat_inc(tid, STAT_SLOCKS);
+		if (ret < 0) {
+			if (errno == EAGAIN)
+				stat_inc(tid, STAT_EAGAINS);
+			else
+				stat_inc(tid, STAT_SLOCKERRS);
+		}
+
+		ec = rw->read_ec.val;
+		barrier();
+		state = rw->lock.val;
+	}
+}
+
+static void ww_read_unlock(futex_t *futex __maybe_unused, int tid)
+{
+	struct timespec stime, etime;
+	struct rwlock *rw = &rwfutex;
+	futex_t state;
+	int ret;
+
+	/* Read unlock */
+	state = atomic_add_return(&rw->lock.val, -RW_READER);
+
+	/* Other readers there, don't do anything */
+	if (state >> 8)
+		return;
+
+	/* We may need to wake up a writer */
+	atomic_add_return(&rw->write_ec.val, RW_EC_INC);
+	if (rw->write_ec.contend) {
+		rw->write_ec.contend = 0;
+		if (timestat) {
+			clock_gettime(CLOCK_REALTIME, &stime);
+			ret = futex_wake(&rw->write_ec.val, 1, flags);
+			clock_gettime(CLOCK_REALTIME, &etime);
+			systime_add(tid, TIME_SUNLK, &stime, &etime);
+		} else {
+			ret = futex_wake(&rw->write_ec.val, 1, flags);
+		}
+		stat_inc(tid, STAT_SUNLOCKS);
+		if (ret < 0)
+			stat_inc(tid, STAT_SUNLKERRS);
+		else
+			stat_add(tid, STAT_WAKEUPS, ret);
+	}
+}
+
+/*
+ * Writer perferring rwlock functions
+ */
+#define ww2_write_lock	ww_write_lock
+#define ww2_read_unlock	ww_read_unlock
+
+static void ww2_write_unlock(futex_t *futex __maybe_unused, int tid)
+{
+	struct timespec stime, etime;
+	struct rwlock *rw = &rwfutex;
+	int ret;
+
+	rw->lock.wlocked = 0;	/* Unlock */
+
+	/* Wake a writer */
+	atomic_add_return(&rw->write_ec.val, RW_EC_INC);
+	if (rw->write_ec.contend) {
+		rw->write_ec.contend = 0;
+		if (timestat) {
+			clock_gettime(CLOCK_REALTIME, &stime);
+			ret = futex_wake(&rw->write_ec.val, 1, flags);
+			clock_gettime(CLOCK_REALTIME, &etime);
+			systime_add(tid, TIME_UNLK, &stime, &etime);
+		} else {
+			ret = futex_wake(&rw->write_ec.val, 1, flags);
+		}
+		stat_inc(tid, STAT_UNLOCKS);
+		if (ret < 0)
+			stat_inc(tid, STAT_UNLKERRS);
+		else
+			stat_add(tid, STAT_WAKEUPS, ret);
+		if (ret > 0)
+			return;
+	}
+
+	/* Wake all the readers */
+	atomic_add_return(&rw->read_ec.val, RW_EC_INC);
+	if (rw->read_ec.contend) {
+		rw->read_ec.contend = 0;
+		if (timestat) {
+			clock_gettime(CLOCK_REALTIME, &stime);
+			ret = futex_wake(&rw->read_ec.val, INT_MAX, flags);
+			clock_gettime(CLOCK_REALTIME, &etime);
+			systime_add(tid, TIME_UNLK, &stime, &etime);
+		} else {
+			ret = futex_wake(&rw->read_ec.val, INT_MAX, flags);
+		}
+		stat_inc(tid, STAT_UNLOCKS);
+		if (ret < 0)
+			stat_inc(tid, STAT_UNLKERRS);
+		else
+			stat_add(tid, STAT_WAKEUPS, ret);
+	}
+}
+
+static void ww2_read_lock(futex_t *futex __maybe_unused, int tid)
+{
+	struct timespec stime, etime;
+	struct rwlock *rw = &rwfutex;
+
+	for (;;) {
+		futex_t ec = rw->read_ec.val | RW_EC_CONTEND;
+		futex_t state;
+		int ret;
+
+		if (!rw->write_ec.contend) {
+			state = atomic_add_return(&rw->lock.val, RW_READER);
+
+			if (!(state & RW_WLOCKED))
+				return;
+
+			/* Unlock */
+			state = atomic_add_return(&rw->lock.val, -RW_READER);
+		} else {
+			atomic_add_return(&rw->write_ec.val, RW_EC_INC);
+			if (rw->write_ec.contend) {
+				rw->write_ec.contend = 0;
+
+				/*  Wake a writer, and then try again */
+				if (timestat) {
+					clock_gettime(CLOCK_REALTIME, &stime);
+					ret = futex_wake(&rw->write_ec.val, 1,
+							 flags);
+					clock_gettime(CLOCK_REALTIME, &etime);
+					systime_add(tid, TIME_SUNLK, &stime,
+						    &etime);
+				} else {
+					ret = futex_wake(&rw->write_ec.val, 1,
+							 flags);
+				}
+				stat_inc(tid, STAT_SUNLOCKS);
+				if (ret < 0)
+					stat_inc(tid, STAT_SUNLKERRS);
+				else
+					stat_add(tid, STAT_WAKEUPS, ret);
+				continue;
+			}
+		}
+
+		rw->read_ec.contend = 1;
+		if (rw->read_ec.val != ec)
+			continue;
+
+		/* Sleep until no longer held by a writer */
+		if (timestat) {
+			clock_gettime(CLOCK_REALTIME, &stime);
+			ret = futex_wait(&rw->read_ec.val, ec, NULL, flags);
+			clock_gettime(CLOCK_REALTIME, &etime);
+			systime_add(tid, TIME_SLOCK, &stime, &etime);
+		} else {
+			ret = futex_wait(&rw->read_ec.val, ec, NULL, flags);
+		}
+		stat_inc(tid, STAT_SLOCKS);
+		if (ret < 0) {
+			if (errno == EAGAIN)
+				stat_inc(tid, STAT_EAGAINS);
+			else
+				stat_inc(tid, STAT_SLOCKERRS);
+		}
+	}
+}
+
+/*
+ * TP futex reader/writer lock/unlock functions
+ */
+#define tp_write_lock		tp_mutex_lock
+#define tp_write_unlock		tp_mutex_unlock
+#define FUTEX_FLAGS_MASK	(3UL << 30)
+
+static void tp_read_lock(futex_t *futex, int tid)
+{
+	struct timespec stime, etime;
+	futex_t val;
+	int ret, retry = 0;
+
+	val = futex_cmpxchg(futex, 0, FUTEX_SHARED + 1);
+	if (!val)
+		return;
+
+	for (;;) {
+		futex_t old = val, new;
+
+		/*
+		 * Try to increment the reader count only if
+		 * 1) the FUTEX_SHARED bit is set; and
+		 * 2) none of the flags bits or FUTEX_SHARED_UNLOCK is set.
+		 */
+		if (!old)
+			new = FUTEX_SHARED + 1;
+		else if ((old & FUTEX_SHARED) &&
+			!(old & (FUTEX_FLAGS_MASK|FUTEX_SHARED_UNLOCK)))
+			new = old + 1;
+		else
+			break;
+		val = futex_cmpxchg(futex, old, new);
+		if (val == old)
+			goto out;
+		retry++;
+	}
+
+	for (;;) {
+		if (timestat) {
+			clock_gettime(CLOCK_REALTIME, &stime);
+			ret = futex_lock_shared(futex, NULL, flags);
+			clock_gettime(CLOCK_REALTIME, &etime);
+			systime_add(tid, TIME_SLOCK, &stime, &etime);
+		} else {
+			ret = futex_lock_shared(futex, NULL, flags);
+		}
+		stat_inc(tid, STAT_SLOCKS);
+		if (ret >= 0)
+			break;
+		stat_inc(tid, STAT_SLOCKERRS);
+	}
+	/*
+	 * Get # of sleeps & locking method
+	 */
+	stat_add(tid, STAT_SSLEEPS, ret >> 16);
+	if (ret & 0x100)
+		 stat_inc(tid, STAT_SALONES);	/* In alone mode */
+	if (ret & 0x200)
+		 stat_inc(tid, STAT_SGROUPS);	/* In group mode */
+
+	ret &= 0xff;
+	if (!ret)
+		stat_inc(tid, STAT_STEALS);
+	else if (ret == 2)
+		stat_inc(tid, STAT_HANDOFFS);
+out:
+	if (unlikely(retry))
+		stat_add(tid, STAT_SLRETRIES, retry);
+}
+
+static void tp_read_unlock(futex_t *futex, int tid)
+{
+	struct timespec stime, etime;
+	futex_t old, val;
+	int ret, retry = 0;
+
+	val = atomic_dec_return(futex);
+	if (!(val & FUTEX_SHARED)) {
+		fprintf(stderr,
+			"tp_read_unlock: Incorrect futex value = 0x%x\n", val);
+		exit(1);
+	}
+
+	for (;;) {
+		/*
+		 * Return if not the last reader, not in shared locking
+		 * mode or the unlock bit has been set.
+		 */
+		if ((val & (FUTEX_SCNT_MASK|FUTEX_SHARED_UNLOCK)) ||
+		   !(val & FUTEX_SHARED))
+			return;
+
+		if (val & ~FUTEX_SHARED_TID_MASK) {
+			/*
+			 * Only one task that can set the FUTEX_SHARED_UNLOCK
+			 * bit will do the unlock.
+			 */
+			old = futex_cmpxchg(futex, val,
+					    val|FUTEX_SHARED_UNLOCK);
+			if (old == val)
+				break;
+		} else {
+			/*
+			 * Try to clear the futex.
+			 */
+			old = futex_cmpxchg(futex, val, 0);
+			if (old == val)
+				goto out;
+		}
+		val = old;
+		retry++;
+	}
+
+	if (timestat) {
+		clock_gettime(CLOCK_REALTIME, &stime);
+		ret = futex_unlock_shared(futex, flags);
+		clock_gettime(CLOCK_REALTIME, &etime);
+		systime_add(tid, TIME_SUNLK, &stime, &etime);
+	} else {
+		ret = futex_unlock_shared(futex, flags);
+	}
+	stat_inc(tid, STAT_SUNLOCKS);
+	if (ret < 0)
+		stat_inc(tid, STAT_SUNLKERRS);
+	else
+		stat_add(tid, STAT_WAKEUPS, ret);
+out:
+	if (unlikely(retry))
+		stat_add(tid, STAT_SURETRIES, retry);
+}
+
+/*
+ * Glibc read/write lock
+ */
+static void gc_write_lock(futex_t *futex __maybe_unused,
+			  int tid __maybe_unused)
+{
+	pthread_rwlock_wrlock(&rwlock);
+}
+
+static void gc_write_unlock(futex_t *futex __maybe_unused,
+			    int tid __maybe_unused)
+{
+	pthread_rwlock_unlock(&rwlock);
+}
+
+static void gc_read_lock(futex_t *futex __maybe_unused,
+			 int tid __maybe_unused)
+{
+	pthread_rwlock_rdlock(&rwlock);
+}
+
+static void gc_read_unlock(futex_t *futex __maybe_unused,
+			   int tid __maybe_unused)
+{
+	pthread_rwlock_unlock(&rwlock);
+}
+
 /**************************************************************************/
 
 /*
@@ -576,6 +1144,77 @@ static void *mutex_workerfn(void *arg)
 	return NULL;
 }
 
+static void *rwlock_workerfn(void *arg)
+{
+	long tid = (long)arg;
+	struct worker *w = &worker[tid];
+	lock_fn_t rlock_fn = read_lock_fn;
+	lock_fn_t wlock_fn = write_lock_fn;
+	unlock_fn_t runlock_fn = read_unlock_fn;
+	unlock_fn_t wunlock_fn = write_unlock_fn;
+
+	thread_id = gettid();
+	counter = 0;
+
+	atomic_dec_return(&threads_starting);
+
+	/*
+	 * Busy wait until asked to start
+	 */
+	while (!worker_start)
+		cpu_relax();
+
+	if (rthread_threshold >= 0) {
+		if (tid < rthread_threshold) {
+			do {
+				rlock_fn(w->futex, tid);
+				load(tid);
+				runlock_fn(w->futex, tid);
+				w->stats[STAT_SOPS]++;
+				csdelay();
+			} while (!done);
+		} else {
+			do {
+				wlock_fn(w->futex, tid);
+				load(tid);
+				wunlock_fn(w->futex, tid);
+				w->stats[STAT_OPS]++;
+				csdelay();
+			} while (!done);
+		}
+		goto out;
+	}
+
+	while (!done) {
+		int rcnt = rpercent;
+		int wcnt = 100 - rcnt;
+
+		do {
+			if (wcnt) {
+				wlock_fn(w->futex, tid);
+				load(tid);
+				wunlock_fn(w->futex, tid);
+				w->stats[STAT_OPS]++;
+				wcnt--;
+				csdelay();
+			}
+			if (rcnt) {
+				rlock_fn(w->futex, tid);
+				load(tid);
+				runlock_fn(w->futex, tid);
+				w->stats[STAT_SOPS]++;
+				rcnt--;
+				csdelay();
+			}
+		}  while (!done && (rcnt + wcnt));
+	}
+out:
+	if (verbose)
+		printf("[thread %3ld (%d)] exited.\n", tid, thread_id);
+	atomic_inc_return(&threads_stopping);
+	return NULL;
+}
+
 static void create_threads(struct worker *w, pthread_attr_t *thread_attr,
 			   void *(*workerfn)(void *arg), long tid)
 {
@@ -631,6 +1270,66 @@ static int futex_mutex_type(const char **ptype)
 	return 0;
 }
 
+static int futex_rwlock_type(const char **ptype)
+{
+	const char *type = *ptype;
+
+	if (!strcasecmp(type, "WW")) {
+		*ptype = "WW";
+		pfutex = &rwfutex.lock.val;
+		if (pwriter) {
+			read_lock_fn = ww2_read_lock;
+			read_unlock_fn = ww2_read_unlock;
+			write_lock_fn = ww2_write_lock;
+			write_unlock_fn = ww2_write_unlock;
+		} else {
+			read_lock_fn = ww_read_lock;
+			read_unlock_fn = ww_read_unlock;
+			write_lock_fn = ww_write_lock;
+			write_unlock_fn = ww_write_unlock;
+		}
+	} else if (!strcasecmp(type, "TP")) {
+		*ptype = "TP";
+		read_lock_fn = tp_read_lock;
+		read_unlock_fn = tp_read_unlock;
+		write_lock_fn = tp_write_lock;
+		write_unlock_fn = tp_write_unlock;
+
+		/*
+		 * Check if TP futex is supported.
+		 */
+		futex_unlock(&global_futex, 0);
+		if (errno == ENOSYS) {
+			fprintf(stderr,
+			    "\nTP futexes are not supported by the kernel!\n");
+			return -1;
+		}
+	} else if (!strcasecmp(type, "GC")) {
+		pthread_rwlockattr_t *attr = NULL;
+
+		*ptype = "GC";
+		read_lock_fn = gc_read_lock;
+		read_unlock_fn = gc_read_unlock;
+		write_lock_fn = gc_write_lock;
+		write_unlock_fn = gc_write_unlock;
+		if (pwriter || fshared) {
+			pthread_rwlockattr_init(&rwlock_attr);
+			attr = &rwlock_attr;
+			rwlock_attr_inited = true;
+			if (fshared)
+				pthread_rwlockattr_setpshared(attr, true);
+			if (pwriter)
+				pthread_rwlockattr_setkind_np(attr,
+				  PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP);
+		}
+		pthread_rwlock_init(&rwlock, attr);
+		rwlock_inited = true;
+	} else {
+		return -1;
+	}
+	return 0;
+}
+
 static void futex_test_driver(const char *futex_type,
 			      int (*proc_type)(const char **ptype),
 			      void *(*workerfn)(void *arg))
@@ -647,15 +1346,25 @@ static void futex_test_driver(const char *futex_type,
 	 */
 	const char *desc[STAT_NUM] = {
 		[STAT_OPS]	 = "Total exclusive locking ops",
+		[STAT_SOPS]	 = "Total shared locking ops",
 		[STAT_LOCKS]	 = "Exclusive lock futex calls",
 		[STAT_UNLOCKS]	 = "Exclusive unlock futex calls",
 		[STAT_SLEEPS]	 = "Exclusive lock sleeps",
+		[STAT_SLOCKS]	 = "Shared lock futex calls",
+		[STAT_SUNLOCKS]	 = "Shared unlock futex calls",
+		[STAT_SSLEEPS]	 = "Shared lock sleeps",
 		[STAT_WAKEUPS]	 = "Process wakeups",
 		[STAT_EAGAINS]	 = "EAGAIN lock errors",
 		[STAT_HANDOFFS]  = "Lock handoffs",
 		[STAT_STEALS]	 = "Lock stealings",
+		[STAT_SLRETRIES] = "Shared lock retries",
+		[STAT_SURETRIES] = "Shared unlock retries",
+		[STAT_SALONES]	 = "Shared lock ops (alone mode)",
+		[STAT_SGROUPS]	 = "Shared lock ops (group mode)",
 		[STAT_LOCKERRS]  = "\nExclusive lock errors",
 		[STAT_UNLKERRS]  = "\nExclusive unlock errors",
+		[STAT_SLOCKERRS] = "\nShared lock errors",
+		[STAT_SUNLKERRS] = "\nShared unlock errors",
 	};
 
 	if (exit_now)
@@ -667,9 +1376,18 @@ static void futex_test_driver(const char *futex_type,
 	}
 
 	printf("\n=====================================\n");
-	printf("[PID %d]: %d threads doing %s futex lockings (load=%d) for %d secs.\n\n",
+	printf("[PID %d]: %d threads doing %s futex lockings (load=%d) for %d secs.\n",
 	       getpid(), nthreads, futex_type, loadlat, nsecs);
 
+	if (xthread) {
+		/*
+		 * Compute numbers of reader and writer threads.
+		 */
+		rthread_threshold = (rpercent * nthreads + 50)/100;
+		printf("\t\t{%d reader threads, %d writer threads}\n",
+			rthread_threshold, nthreads - rthread_threshold);
+	}
+	printf("\n");
 	init_stats(&throughput_stats);
 
 	*pfutex = 0;
@@ -739,11 +1457,15 @@ static void futex_test_driver(const char *futex_type,
 		/*
 		 * Get a rounded estimate of the # of locking ops/sec.
 		 */
-		u64 tp = (u64)worker[i].stats[STAT_OPS] * 1000000 / us;
+		u64 tp = (u64)(worker[i].stats[STAT_OPS] +
+			       worker[i].stats[STAT_SOPS]) * 1000000 / us;
 
 		for (j = 0; j < STAT_NUM; j++)
 			total.stats[j] += worker[i].stats[j];
 
+		for (j = 0; j < TIME_NUM; j++)
+			total.times[j] += worker[i].times[j];
+
 		update_stats(&throughput_stats, tp);
 		if (verbose)
 			printf("[thread %3d] futex: %p [ %'ld ops/sec ]\n",
@@ -767,21 +1489,41 @@ static void futex_test_driver(const char *futex_type,
 		if (total.stats[STAT_UNLOCKS])
 			printf("Avg exclusive unlock syscall = %'ldns\n",
 			    total.times[TIME_UNLK]/total.stats[STAT_UNLOCKS]);
+		if (total.stats[STAT_SLOCKS])
+			printf("Avg shared lock syscall      = %'ldns\n",
+			    total.times[TIME_SLOCK]/total.stats[STAT_SLOCKS]);
+		if (total.stats[STAT_SUNLOCKS])
+			printf("Avg shared unlock syscall    = %'ldns\n",
+			    total.times[TIME_SUNLK]/total.stats[STAT_SUNLOCKS]);
 	}
 
 	printf("\nPercentages:\n");
 	if (total.stats[STAT_LOCKS])
 		printf("Exclusive lock futex calls   = %.1f%%\n",
 			stat_percent(&total, STAT_LOCKS, STAT_OPS));
+	if (total.stats[STAT_SLOCKS])
+		printf("Shared lock futex calls      = %.1f%%\n",
+			stat_percent(&total, STAT_SLOCKS, STAT_SOPS));
 	if (total.stats[STAT_UNLOCKS])
 		printf("Exclusive unlock futex calls = %.1f%%\n",
 			stat_percent(&total, STAT_UNLOCKS, STAT_OPS));
+	if (total.stats[STAT_SUNLOCKS])
+		printf("Shared unlock futex calls    = %.1f%%\n",
+			stat_percent(&total, STAT_SUNLOCKS, STAT_SOPS));
 	if (total.stats[STAT_EAGAINS])
 		printf("EAGAIN lock errors           = %.1f%%\n",
-			stat_percent(&total, STAT_EAGAINS, STAT_LOCKS));
+			(double)total.stats[STAT_EAGAINS] * 100 /
+			(total.stats[STAT_LOCKS] + total.stats[STAT_SLOCKS]));
 	if (total.stats[STAT_WAKEUPS])
 		printf("Process wakeups              = %.1f%%\n",
-			stat_percent(&total, STAT_WAKEUPS, STAT_UNLOCKS));
+			(double)total.stats[STAT_WAKEUPS] * 100 /
+			(total.stats[STAT_UNLOCKS] +
+			 total.stats[STAT_SUNLOCKS]));
+	if (xthread)
+		 printf("Reader operations            = %.1f%%\n",
+			(double)total.stats[STAT_SOPS] * 100 /
+			(total.stats[STAT_OPS] + total.stats[STAT_SOPS]));
+
 
 	printf("\nPer-thread Locking Rates:\n");
 	printf("Avg = %'d ops/sec (+- %.2f%%)\n", (int)(avg + 0.5),
@@ -789,28 +1531,55 @@ static void futex_test_driver(const char *futex_type,
 	printf("Min = %'d ops/sec\n", (int)throughput_stats.min);
 	printf("Max = %'d ops/sec\n", (int)throughput_stats.max);
 
+	/*
+	 * Compute the averagge reader and writer locking operation rates
+	 * with separate reader and writer threads.
+	 */
+	if (xthread) {
+		u64 tp;
+
+		/* Reader stats */
+		memset(&throughput_stats, 0, sizeof(throughput_stats));
+		for (i = 0, tp = 0; i < rthread_threshold; i++) {
+			tp = (u64)worker[i].stats[STAT_SOPS] * 1000000 / us;
+			update_stats(&throughput_stats, tp);
+		}
+		avg    = avg_stats(&throughput_stats);
+		stddev = stddev_stats(&throughput_stats);
+		printf("\nReader avg = %'d ops/sec (+- %.2f%%)\n",
+			(int)(avg + 0.5), rel_stddev_stats(stddev, avg));
+
+		/* Writer stats */
+		memset(&throughput_stats, 0, sizeof(throughput_stats));
+		for (tp = 0; i < (int)nthreads; i++) {
+			tp = (u64)worker[i].stats[STAT_OPS] * 1000000 / us;
+			update_stats(&throughput_stats, tp);
+		}
+		avg    = avg_stats(&throughput_stats);
+		stddev = stddev_stats(&throughput_stats);
+		printf("Writer avg = %'d ops/sec (+- %.2f%%)\n",
+			(int)(avg + 0.5), rel_stddev_stats(stddev, avg));
+	}
+
 	if (*pfutex != 0)
 		printf("\nResidual futex value = 0x%x\n", *pfutex);
 
 	/* Clear the workers area */
 	memset(worker, 0, sizeof(*worker) * nthreads);
+
+	if (rwlock_inited)
+		pthread_rwlock_destroy(&rwlock);
+	if (rwlock_attr_inited)
+		pthread_rwlockattr_destroy(&rwlock_attr);
 }
 
-int bench_futex_mutex(int argc, const char **argv,
-		      const char *prefix __maybe_unused)
+static void bench_futex_common(struct sigaction *act)
 {
-	struct sigaction act;
-
-	argc = parse_options(argc, argv, mutex_options,
-			     bench_futex_mutex_usage, 0);
-	if (argc)
-		goto err;
-
 	ncpus = sysconf(_SC_NPROCESSORS_ONLN);
 
-	sigfillset(&act.sa_mask);
-	act.sa_sigaction = toggle_done;
-	sigaction(SIGINT, &act, NULL);
+	sigfillset(&act->sa_mask);
+	act->sa_sigaction = toggle_done;
+	sigaction(SIGINT, act, NULL);
 
 	if (!nthreads)
 		nthreads = ncpus;
@@ -828,6 +1597,19 @@ int bench_futex_mutex(int argc, const char **argv,
 
 	if (!fshared)
 		flags = FUTEX_PRIVATE_FLAG;
+}
+
+int bench_futex_mutex(int argc, const char **argv,
+		      const char *prefix __maybe_unused)
+{
+	struct sigaction act;
+
+	argc = parse_options(argc, argv, mutex_options,
+			     bench_futex_mutex_usage, 0);
+	if (argc)
+		goto err;
+
+	bench_futex_common(&act);
 
 	if (!ftype || !strcmp(ftype, "all")) {
 		futex_test_driver("WW", futex_mutex_type, mutex_workerfn);
@@ -842,3 +1624,31 @@ int bench_futex_mutex(int argc, const char **argv,
 	usage_with_options(bench_futex_mutex_usage, mutex_options);
 	exit(EXIT_FAILURE);
 }
+
+int bench_futex_rwlock(int argc, const char **argv,
+		      const char *prefix __maybe_unused)
+{
+	struct sigaction act;
+
+	argc = parse_options(argc, argv, rwlock_options,
+			     bench_futex_rwlock_usage, 0);
+	if (argc)
+		goto err;
+
+	ncpus = sysconf(_SC_NPROCESSORS_ONLN);
+
+	bench_futex_common(&act);
+
+	if (!ftype || !strcmp(ftype, "all")) {
+		futex_test_driver("WW", futex_rwlock_type, rwlock_workerfn);
+		futex_test_driver("TP", futex_rwlock_type, rwlock_workerfn);
+		futex_test_driver("GC", futex_rwlock_type, rwlock_workerfn);
+	} else {
+		futex_test_driver(ftype, futex_rwlock_type, rwlock_workerfn);
+	}
+	free(worker_alloc);
+	return 0;
+err:
+	usage_with_options(bench_futex_rwlock_usage, rwlock_options);
+	exit(EXIT_FAILURE);
+}
diff --git a/tools/perf/bench/futex.h b/tools/perf/bench/futex.h
index be1eb7a..199cc77 100644
--- a/tools/perf/bench/futex.h
+++ b/tools/perf/bench/futex.h
@@ -88,8 +88,10 @@
 }
 
 #ifndef FUTEX_LOCK
-#define FUTEX_LOCK	0xff	/* Disable the use of TP futexes */
-#define FUTEX_UNLOCK	0xff
+#define FUTEX_LOCK		0xff	/* Disable the use of TP futexes */
+#define FUTEX_UNLOCK		0xff
+#define FUTEX_LOCK_SHARED	0xff
+#define FUTEX_UNLOCK_SHARED	0xff
 #endif /* FUTEX_LOCK */
 
 /**
@@ -110,6 +112,24 @@
 	return futex(uaddr, FUTEX_UNLOCK, 0, NULL, NULL, 0, opflags);
 }
 
+/**
+ * futex_lock_shared() - shared locking of TP futex
+ */
+static inline int
+futex_lock_shared(u_int32_t *uaddr, struct timespec *timeout, int opflags)
+{
+	return futex(uaddr, FUTEX_LOCK_SHARED, 0, timeout, NULL, 0, opflags);
+}
+
+/**
+ * futex_unlock_shared() - shared unlocking of TP futex
+ */
+static inline int
+futex_unlock_shared(u_int32_t *uaddr, int opflags)
+{
+	return futex(uaddr, FUTEX_UNLOCK_SHARED, 0, NULL, NULL, 0, opflags);
+}
+
 #ifndef HAVE_PTHREAD_ATTR_SETAFFINITY_NP
 #include <pthread.h>
 static inline int pthread_attr_setaffinity_np(pthread_attr_t *attr,
diff --git a/tools/perf/builtin-bench.c b/tools/perf/builtin-bench.c
index bf4418d..424a014 100644
--- a/tools/perf/builtin-bench.c
+++ b/tools/perf/builtin-bench.c
@@ -64,6 +64,7 @@ struct bench {
 	/* pi-futexes */
 	{ "lock-pi",	"Benchmark for futex lock_pi calls",            bench_futex_lock_pi	},
 	{ "mutex",	"Benchmark for mutex locks using futexes",	bench_futex_mutex	},
+	{ "rwlock",	"Benchmark for rwlocks using futexes",		bench_futex_rwlock	},
 	{ "all",	"Run all futex benchmarks",			NULL			},
 	{ NULL,		NULL,						NULL			}
 };
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 19/20] sched, TP-futex: Make wake_up_q() return wakeup count
  2016-12-29 16:13 [PATCH v4 00/20] futex: Introducing throughput-optimized (TP) futexes Waiman Long
                   ` (17 preceding siblings ...)
  2016-12-29 16:13 ` [PATCH v4 18/20] perf bench: New microbenchmark for userspace rwlock performance Waiman Long
@ 2016-12-29 16:13 ` Waiman Long
  2016-12-29 16:13 ` [PATCH v4 20/20] futex: Dump internal futex state via debugfs Waiman Long
  19 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2016-12-29 16:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Scott J Norton, Waiman Long

Unlike wake_up_process(), wake_up_q() doesn't tell us how many
tasks have been woken up. This information can sometimes be useful
for tracking purpose. So wake_up_q() is now modified to return that
information.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/sched.h | 2 +-
 kernel/futex.c        | 8 +++-----
 kernel/sched/core.c   | 6 ++++--
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 453143c..eeb6d5a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1038,7 +1038,7 @@ struct wake_q_head {
 
 extern void wake_q_add(struct wake_q_head *head,
 		       struct task_struct *task);
-extern void wake_up_q(struct wake_q_head *head);
+extern int wake_up_q(struct wake_q_head *head);
 
 /*
  * sched-domains (multiprocessor balancing) declarations:
diff --git a/kernel/futex.c b/kernel/futex.c
index 5e69c44..4f235e8 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -4254,11 +4254,9 @@ static int futex_unlock(u32 __user *uaddr, unsigned int flags,
 out_put_key:
 	put_futex_key(&key);
 	if (owner) {
-		/*
-		 * No error would have happened if owner defined.
-		 */
-		wake_up_q(&wake_q);
-		return ret ? ret : 1;
+		int cnt = wake_up_q(&wake_q);
+
+		return ret ? ret : cnt;
 	}
 
 	return ret;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 966556e..eed7a15 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -449,9 +449,10 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
 	head->lastp = &node->next;
 }
 
-void wake_up_q(struct wake_q_head *head)
+int wake_up_q(struct wake_q_head *head)
 {
 	struct wake_q_node *node = head->first;
+	int wakecnt = 0;
 
 	while (node != WAKE_Q_TAIL) {
 		struct task_struct *task;
@@ -466,9 +467,10 @@ void wake_up_q(struct wake_q_head *head)
 		 * wake_up_process() implies a wmb() to pair with the queueing
 		 * in wake_q_add() so as not to miss wakeups.
 		 */
-		wake_up_process(task);
+		wakecnt += wake_up_process(task);
 		put_task_struct(task);
 	}
+	return wakecnt;
 }
 
 /*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 20/20] futex: Dump internal futex state via debugfs
  2016-12-29 16:13 [PATCH v4 00/20] futex: Introducing throughput-optimized (TP) futexes Waiman Long
                   ` (18 preceding siblings ...)
  2016-12-29 16:13 ` [PATCH v4 19/20] sched, TP-futex: Make wake_up_q() return wakeup count Waiman Long
@ 2016-12-29 16:13 ` Waiman Long
  2016-12-29 18:55   ` kbuild test robot
  19 siblings, 1 reply; 28+ messages in thread
From: Waiman Long @ 2016-12-29 16:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet
  Cc: linux-kernel, linux-doc, Arnaldo Carvalho de Melo,
	Davidlohr Bueso, Mike Galbraith, Scott J Norton, Waiman Long

For debugging purpose, it is sometimes useful to dump the internal
states in the futex hash bucket table. This patch adds a file
"futex_hash_table" in debugfs root filesystem to dump the internal
futex states.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/futex.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 82 insertions(+)

diff --git a/kernel/futex.c b/kernel/futex.c
index 4f235e8..9f7bfd3 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -4425,3 +4425,85 @@ static int __init futex_init(void)
 	return 0;
 }
 __initcall(futex_init);
+
+#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_SMP)
+/*
+ * Debug code to dump selected content of in-kernel futex hash bucket table.
+ */
+#include <linux/debugfs.h>
+
+static int futex_dump_show(struct seq_file *m, void *arg)
+{
+	struct futex_hash_bucket *hb = arg;
+	struct futex_state *state;
+	int i;
+
+	if (list_empty(&hb->fs_head))
+		return 0;
+
+	seq_printf(m, "\nHash bucket %ld:\n", hb - futex_queues);
+	spin_lock(&hb->fs_lock);
+	i = 0;
+	list_for_each_entry(state, &hb->fs_head, fs_list) {
+		seq_printf(m, "  Futex state %d\n", i++);
+		if (state->owner)
+			seq_printf(m, "    owner PID = %d\n",
+				   task_pid_vnr(state->owner));
+		if (state->mutex_owner)
+			seq_printf(m, "    mutex owner PID = %d\n",
+				   task_pid_vnr(state->mutex_owner));
+		seq_printf(m, "    reference count = %d\n",
+			   atomic_read(&state->refcount));
+		seq_printf(m, "    handoff PID = %d\n", state->handoff_pid);
+	}
+	spin_unlock(&hb->fs_lock);
+	return 0;
+}
+
+static void *futex_dump_start(struct seq_file *m, loff_t *pos)
+{
+	return (*pos < futex_hashsize) ? &futex_queues[*pos] : NULL;
+}
+
+static void *futex_dump_next(struct seq_file *m, void *arg, loff_t *pos)
+{
+	(*pos)++;
+	return (*pos < futex_hashsize) ? &futex_queues[*pos] : NULL;
+}
+
+static void futex_dump_stop(struct seq_file *m, void *arg)
+{
+}
+
+static const struct seq_operations futex_dump_op = {
+	.start = futex_dump_start,
+	.next  = futex_dump_next,
+	.stop  = futex_dump_stop,
+	.show  = futex_dump_show,
+};
+
+static int futex_dump_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &futex_dump_op);
+}
+
+static const struct file_operations fops_futex_dump = {
+	.open	 = futex_dump_open,
+	.read	 = seq_read,
+	.llseek	 = seq_lseek,
+	.release = seq_release,
+};
+
+/*
+ * Initialize debugfs for the futex hash bucket table dump.
+ */
+static int __init init_futex_dump(void)
+{
+	if (!debugfs_create_file("futex_hash_table", 0400, NULL, NULL,
+				 &fops_futex_dump))
+		return -ENOMEM;
+	return 0;
+}
+fs_initcall(init_futex_dump);
+
+#endif /* CONFIG_DEBUG_FS && CONFIG_SMP */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 14/20] TP-futex: Support userspace reader/writer locks
  2016-12-29 16:13 ` [PATCH v4 14/20] TP-futex: Support userspace reader/writer locks Waiman Long
@ 2016-12-29 18:34   ` kbuild test robot
  0 siblings, 0 replies; 28+ messages in thread
From: kbuild test robot @ 2016-12-29 18:34 UTC (permalink / raw)
  To: Waiman Long
  Cc: kbuild-all, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Jonathan Corbet, linux-kernel, linux-doc,
	Arnaldo Carvalho de Melo, Davidlohr Bueso, Mike Galbraith,
	Scott J Norton, Waiman Long

[-- Attachment #1: Type: text/plain, Size: 2157 bytes --]

Hi Waiman,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.10-rc1 next-20161224]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Waiman-Long/futex-Introducing-throughput-optimized-TP-futexes/20161230-020021
config: i386-randconfig-x006-201652 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

   kernel/futex.c: In function 'futex_lock':
>> kernel/futex.c:3776:13: warning: 'rspin_timeout' may be used uninitialized in this function [-Wmaybe-uninitialized]
        else if (curtime > rspin_timeout)
                ^
   kernel/futex.c:3656:6: note: 'rspin_timeout' was declared here
     u64 rspin_timeout;
         ^~~~~~~~~~~~~

vim +/rspin_timeout +3776 kernel/futex.c

  3760	
  3761				if (!owner_task && (reader_value >= 0)) {
  3762					int old_count = reader_value;
  3763	
  3764					/*
  3765					 * Reset timeout value if the old reader
  3766					 * count is 0 or the reader value changes and
  3767					 * handoff time hasn't been reached. Otherwise,
  3768					 * disable reader spinning if the handoff time
  3769					 * is reached.
  3770					 */
  3771					reader_value = (uval & FUTEX_TID_MASK);
  3772					if (!old_count || (!handoff_set
  3773						       && (reader_value != old_count)))
  3774						rspin_timeout = curtime +
  3775								TP_RSPIN_TIMEOUT;
> 3776					else if (curtime > rspin_timeout)
  3777						reader_value = -1;
  3778				}
  3779				if (!handoff_set && (curtime > handoff_time)) {
  3780					WARN_ON(READ_ONCE(state->handoff_pid));
  3781					WRITE_ONCE(state->handoff_pid, vpid);
  3782	
  3783					/*
  3784					 * Disable handoff check.

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 21429 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 20/20] futex: Dump internal futex state via debugfs
  2016-12-29 16:13 ` [PATCH v4 20/20] futex: Dump internal futex state via debugfs Waiman Long
@ 2016-12-29 18:55   ` kbuild test robot
  0 siblings, 0 replies; 28+ messages in thread
From: kbuild test robot @ 2016-12-29 18:55 UTC (permalink / raw)
  To: Waiman Long
  Cc: kbuild-all, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Jonathan Corbet, linux-kernel, linux-doc,
	Arnaldo Carvalho de Melo, Davidlohr Bueso, Mike Galbraith,
	Scott J Norton, Waiman Long

[-- Attachment #1: Type: text/plain, Size: 2208 bytes --]

Hi Waiman,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.10-rc1 next-20161224]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Waiman-Long/futex-Introducing-throughput-optimized-TP-futexes/20161230-020021
config: i386-randconfig-x006-201652 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   kernel/futex.c: In function 'futex_dump_show':
>> kernel/futex.c:4444:33: warning: format '%ld' expects argument of type 'long int', but argument 3 has type 'int' [-Wformat=]
     seq_printf(m, "\nHash bucket %ld:\n", hb - futex_queues);
                                    ^
   kernel/futex.c: In function 'futex_lock':
   kernel/futex.c:3796:13: warning: 'rspin_timeout' may be used uninitialized in this function [-Wmaybe-uninitialized]
        else if (curtime > rspin_timeout)
                ^
   kernel/futex.c:3676:6: note: 'rspin_timeout' was declared here
     u64 rspin_timeout;
         ^~~~~~~~~~~~~

vim +4444 kernel/futex.c

  4428	
  4429	#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_SMP)
  4430	/*
  4431	 * Debug code to dump selected content of in-kernel futex hash bucket table.
  4432	 */
  4433	#include <linux/debugfs.h>
  4434	
  4435	static int futex_dump_show(struct seq_file *m, void *arg)
  4436	{
  4437		struct futex_hash_bucket *hb = arg;
  4438		struct futex_state *state;
  4439		int i;
  4440	
  4441		if (list_empty(&hb->fs_head))
  4442			return 0;
  4443	
> 4444		seq_printf(m, "\nHash bucket %ld:\n", hb - futex_queues);
  4445		spin_lock(&hb->fs_lock);
  4446		i = 0;
  4447		list_for_each_entry(state, &hb->fs_head, fs_list) {
  4448			seq_printf(m, "  Futex state %d\n", i++);
  4449			if (state->owner)
  4450				seq_printf(m, "    owner PID = %d\n",
  4451					   task_pid_vnr(state->owner));
  4452			if (state->mutex_owner)

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 21429 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 16/20] TP-futex: Group readers together in wait queue
  2016-12-29 16:13 ` [PATCH v4 16/20] TP-futex: Group readers together in wait queue Waiman Long
@ 2016-12-29 19:53   ` kbuild test robot
  2016-12-29 20:16   ` kbuild test robot
  1 sibling, 0 replies; 28+ messages in thread
From: kbuild test robot @ 2016-12-29 19:53 UTC (permalink / raw)
  To: Waiman Long
  Cc: kbuild-all, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Jonathan Corbet, linux-kernel, linux-doc,
	Arnaldo Carvalho de Melo, Davidlohr Bueso, Mike Galbraith,
	Scott J Norton, Waiman Long

[-- Attachment #1: Type: text/plain, Size: 2583 bytes --]

Hi Waiman,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.10-rc1 next-20161224]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Waiman-Long/futex-Introducing-throughput-optimized-TP-futexes/20161230-020021
config: m32r-usrv_defconfig (attached as .config)
compiler: m32r-linux-gcc (GCC) 6.2.0
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=m32r 

All errors (new ones prefixed by >>):

   kernel/built-in.o: In function `futex_spin_on_reader':
>> kernel/futex.c:3961: undefined reference to `osq_lock'
   kernel/futex.c:3961:(.text+0x60cd8): relocation truncated to fit: R_M32R_26_PCREL_RELA against undefined symbol `osq_lock'
>> kernel/futex.c:3988: undefined reference to `osq_unlock'
   kernel/futex.c:3988:(.text+0x61028): relocation truncated to fit: R_M32R_26_PCREL_RELA against undefined symbol `osq_unlock'
   kernel/futex.c:3994: undefined reference to `osq_unlock'
   kernel/futex.c:3994:(.text+0x61080): relocation truncated to fit: R_M32R_26_PCREL_RELA against undefined symbol `osq_unlock'

vim +3961 kernel/futex.c

  3955		first_reader = READ_ONCE(state->first_reader);
  3956		if (!first_reader)
  3957			first_reader = cmpxchg(&state->first_reader, NULL, current);
  3958		if (!first_reader)
  3959			goto out;	/* Became the first reader */
  3960	
> 3961		if (!osq_lock(&state->reader_osq))
  3962			goto reschedule;
  3963	
  3964		for (;;) {
  3965			u32 uval;
  3966	
  3967			if (!state->locksteal_disabled) {
  3968				ret = futex_trylock_preempt_disabled(uaddr,
  3969							FUTEX_SHARED, &uval);
  3970				/*
  3971				 * Return if lock acquired or an error happened
  3972				 */
  3973				if (ret)
  3974					break;
  3975			}
  3976	
  3977			/*
  3978			 * Reread the first reader value again.
  3979			 */
  3980			first_reader = READ_ONCE(state->first_reader);
  3981			if (!first_reader)
  3982				first_reader = cmpxchg(&state->first_reader, NULL,
  3983							current);
  3984			if (!first_reader || !first_reader->on_cpu)
  3985				break;
  3986	
  3987			if (need_resched()) {
> 3988				osq_unlock(&state->reader_osq);
  3989				goto reschedule;
  3990			}
  3991	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 8617 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 16/20] TP-futex: Group readers together in wait queue
  2016-12-29 16:13 ` [PATCH v4 16/20] TP-futex: Group readers together in wait queue Waiman Long
  2016-12-29 19:53   ` kbuild test robot
@ 2016-12-29 20:16   ` kbuild test robot
  1 sibling, 0 replies; 28+ messages in thread
From: kbuild test robot @ 2016-12-29 20:16 UTC (permalink / raw)
  To: Waiman Long
  Cc: kbuild-all, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Jonathan Corbet, linux-kernel, linux-doc,
	Arnaldo Carvalho de Melo, Davidlohr Bueso, Mike Galbraith,
	Scott J Norton, Waiman Long

[-- Attachment #1: Type: text/plain, Size: 1226 bytes --]

Hi Waiman,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.10-rc1 next-20161224]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Waiman-Long/futex-Introducing-throughput-optimized-TP-futexes/20161230-020021
config: ia64-defconfig (attached as .config)
compiler: ia64-linux-gcc (GCC) 6.2.0
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=ia64 

All errors (new ones prefixed by >>):

   kernel/built-in.o: In function `futex_lock':
>> futex.c:(.text+0xfadd2): undefined reference to `osq_lock'
>> futex.c:(.text+0xfb1f2): undefined reference to `osq_unlock'
   futex.c:(.text+0xfb2d2): undefined reference to `osq_unlock'
   futex.c:(.text+0xfc0c2): undefined reference to `osq_lock'
   futex.c:(.text+0xfc6d2): undefined reference to `osq_unlock'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 18112 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 07/20] futex: Introduce throughput-optimized (TP) futexes
  2016-12-29 16:13 ` [PATCH v4 07/20] futex: Introduce throughput-optimized (TP) futexes Waiman Long
@ 2016-12-29 21:17   ` kbuild test robot
  0 siblings, 0 replies; 28+ messages in thread
From: kbuild test robot @ 2016-12-29 21:17 UTC (permalink / raw)
  To: Waiman Long
  Cc: kbuild-all, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Jonathan Corbet, linux-kernel, linux-doc,
	Arnaldo Carvalho de Melo, Davidlohr Bueso, Mike Galbraith,
	Scott J Norton, Waiman Long

[-- Attachment #1: Type: text/plain, Size: 1873 bytes --]

Hi Waiman,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.10-rc1 next-20161224]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Waiman-Long/futex-Introducing-throughput-optimized-TP-futexes/20161230-020021
reproduce: make htmldocs

All warnings (new ones prefixed by >>):

>> kernel/futex.c:3419: warning: No description found for parameter 'uaddr'
>> kernel/futex.c:3419: warning: No description found for parameter 'vpid'
>> kernel/futex.c:3419: warning: No description found for parameter 'puval'
>> kernel/futex.c:3419: warning: No description found for parameter 'waiter'
   kernel/futex.c:3456: warning: No description found for parameter 'uaddr'
   kernel/futex.c:3456: warning: No description found for parameter 'vpid'
   kernel/futex.c:3456: warning: No description found for parameter 'puval'

vim +/uaddr +3419 kernel/futex.c

  3403	 * The HB fs_lock should NOT be held while calling this function.
  3404	 * The flag bits are ignored in the trylock.
  3405	 *
  3406	 * If waiter is true
  3407	 * then
  3408	 *   don't preserve the flag bits;
  3409	 * else
  3410	 *   preserve the flag bits
  3411	 * endif
  3412	 *
  3413	 * Return: 1 if lock acquired;
  3414	 *	   0 if lock acquisition failed;
  3415	 *	   -EFAULT if an error happened.
  3416	 */
  3417	static inline int futex_trylock(u32 __user *uaddr, const u32 vpid, u32 *puval,
  3418					const bool waiter)
> 3419	{
  3420		u32 uval, flags = 0;
  3421	
  3422		if (unlikely(get_futex_value(puval, uaddr)))
  3423			return -EFAULT;
  3424	
  3425		uval = *puval;
  3426	
  3427		if (uval & FUTEX_TID_MASK)

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 6457 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 13/20] perf bench: New microbenchmark for userspace mutex performance
  2016-12-29 16:13 ` [PATCH v4 13/20] perf bench: New microbenchmark for userspace mutex performance Waiman Long
@ 2017-01-02 17:16   ` Arnaldo Carvalho de Melo
  2017-01-02 18:17     ` Waiman Long
  0 siblings, 1 reply; 28+ messages in thread
From: Arnaldo Carvalho de Melo @ 2017-01-02 17:16 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet,
	linux-kernel, linux-doc, Davidlohr Bueso, Mike Galbraith,
	Scott J Norton

Em Thu, Dec 29, 2016 at 11:13:39AM -0500, Waiman Long escreveu:
> This microbenchmark simulates how the use of different futex types
> can affect the actual performanace of userspace mutex locks. The
> usage is:
> 
>         perf bench futex mutex <options>

Showing the tool output is preferred, trim it if needed or state why
that is not possible.
 
> Three sets of simple mutex lock and unlock functions are implemented
> using the wait-wake, PI and TP futexes respectively. This
> microbenchmark then runs the locking rate measurement tests using
> either one of those mutexes or all of them consecutively.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  tools/perf/bench/Build         |   1 +
>  tools/perf/bench/bench.h       |   1 +
>  tools/perf/bench/futex-locks.c | 844 +++++++++++++++++++++++++++++++++++++++++
>  tools/perf/bench/futex.h       |  23 ++
>  tools/perf/builtin-bench.c     |  10 +
>  tools/perf/check-headers.sh    |   4 +

You forgot to document it in tools/perf/Documentation/perf-bench.txt

Also I would suggest you submit this first without supportint TP
futexes, i.e. supporting just what is in the kernel already, this way I
could cherry-pick this, which would be useful already for benchmarking
the existing types of futexes.

Then, after the new futex type is accepted into the kernel, in a
separate patch you would add support for it in 'perf bench futex mutex'.

Thanks,

- Arnaldo

>  6 files changed, 883 insertions(+)
>  create mode 100644 tools/perf/bench/futex-locks.c
> 
> diff --git a/tools/perf/bench/Build b/tools/perf/bench/Build
> index 60bf119..7ce1cd7 100644
> --- a/tools/perf/bench/Build
> +++ b/tools/perf/bench/Build
> @@ -6,6 +6,7 @@ perf-y += futex-wake.o
>  perf-y += futex-wake-parallel.o
>  perf-y += futex-requeue.o
>  perf-y += futex-lock-pi.o
> +perf-y += futex-locks.o
>  
>  perf-$(CONFIG_X86_64) += mem-memcpy-x86-64-asm.o
>  perf-$(CONFIG_X86_64) += mem-memset-x86-64-asm.o
> diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h
> index 579a592..b0632df 100644
> --- a/tools/perf/bench/bench.h
> +++ b/tools/perf/bench/bench.h
> @@ -36,6 +36,7 @@
>  int bench_futex_requeue(int argc, const char **argv, const char *prefix);
>  /* pi futexes */
>  int bench_futex_lock_pi(int argc, const char **argv, const char *prefix);
> +int bench_futex_mutex(int argc, const char **argv, const char *prefix);
>  
>  #define BENCH_FORMAT_DEFAULT_STR	"default"
>  #define BENCH_FORMAT_DEFAULT		0
> diff --git a/tools/perf/bench/futex-locks.c b/tools/perf/bench/futex-locks.c
> new file mode 100644
> index 0000000..ec1e627
> --- /dev/null
> +++ b/tools/perf/bench/futex-locks.c
> @@ -0,0 +1,844 @@
> +/*
> + * Copyright (C) 2016 Waiman Long <longman@redhat.com>
> + *
> + * This microbenchmark simulates how the use of different futex types can
> + * affect the actual performanace of userspace locking primitives like mutex.
> + *
> + * The raw throughput of the futex lock and unlock calls is not a good
> + * indication of actual throughput of the mutex code as it may not really
> + * need to call into the kernel. Therefore, 3 sets of simple mutex lock and
> + * unlock functions are written to implenment a mutex lock using the
> + * wait-wake, PI and TP futexes respectively. These functions serve as the
> + * basis for measuring the locking throughput.
> + */
> +
> +#include <pthread.h>
> +
> +#include <signal.h>
> +#include <string.h>
> +#include "../util/stat.h"
> +#include "../perf-sys.h"
> +#include <subcmd/parse-options.h>
> +#include <linux/compiler.h>
> +#include <linux/kernel.h>
> +#include <errno.h>
> +#include "bench.h"
> +#include "futex.h"
> +
> +#include <err.h>
> +#include <stdlib.h>
> +#include <sys/time.h>
> +
> +#define CACHELINE_SIZE		64
> +#define gettid()		syscall(SYS_gettid)
> +#define __cacheline_aligned	__attribute__((__aligned__(CACHELINE_SIZE)))
> +
> +typedef u32 futex_t;
> +typedef void (*lock_fn_t)(futex_t *futex, int tid);
> +typedef void (*unlock_fn_t)(futex_t *futex, int tid);
> +
> +/*
> + * Statistical count list
> + */
> +enum {
> +	STAT_OPS,	/* # of exclusive locking operations	*/
> +	STAT_LOCKS,	/* # of exclusive lock futex calls	*/
> +	STAT_UNLOCKS,	/* # of exclusive unlock futex calls	*/
> +	STAT_SLEEPS,	/* # of exclusive lock sleeps		*/
> +	STAT_EAGAINS,	/* # of EAGAIN errors			*/
> +	STAT_WAKEUPS,	/* # of wakeups (unlock return)		*/
> +	STAT_HANDOFFS,	/* # of lock handoff (TP only)		*/
> +	STAT_STEALS,	/* # of lock steals (TP only)		*/
> +	STAT_LOCKERRS,	/* # of exclusive lock errors		*/
> +	STAT_UNLKERRS,	/* # of exclusive unlock errors		*/
> +	STAT_NUM	/* Total # of statistical count		*/
> +};
> +
> +/*
> + * Syscall time list
> + */
> +enum {
> +	TIME_LOCK,	/* Total exclusive lock syscall time	*/
> +	TIME_UNLK,	/* Total exclusive unlock syscall time	*/
> +	TIME_NUM,
> +};
> +
> +struct worker {
> +	futex_t *futex;
> +	pthread_t thread;
> +
> +	/*
> +	 * Per-thread operation statistics
> +	 */
> +	u32 stats[STAT_NUM];
> +
> +	/*
> +	 * Lock/unlock times
> +	 */
> +	u64 times[TIME_NUM];
> +} __cacheline_aligned;
> +
> +/*
> + * Global cache-aligned futex
> + */
> +static futex_t __cacheline_aligned global_futex;
> +static futex_t *pfutex = &global_futex;
> +
> +static __thread futex_t thread_id;	/* Thread ID */
> +static __thread int counter;		/* Sleep counter */
> +
> +static struct worker *worker, *worker_alloc;
> +static unsigned int nsecs = 10;
> +static bool verbose, done, fshared, exit_now, timestat;
> +static unsigned int ncpus, nthreads;
> +static int flags;
> +static const char *ftype;
> +static int loadlat = 1;
> +static int locklat = 1;
> +static int wratio;
> +struct timeval start, end, runtime;
> +static unsigned int worker_start;
> +static unsigned int threads_starting;
> +static unsigned int threads_stopping;
> +static struct stats throughput_stats;
> +static lock_fn_t mutex_lock_fn;
> +static unlock_fn_t mutex_unlock_fn;
> +
> +/*
> + * Lock/unlock syscall time macro
> + */
> +static inline void systime_add(int tid, int item, struct timespec *begin,
> +			       struct timespec *_end)
> +{
> +	worker[tid].times[item] += (_end->tv_sec  - begin->tv_sec)*1000000000 +
> +				    _end->tv_nsec - begin->tv_nsec;
> +}
> +
> +static inline double stat_percent(struct worker *w, int top, int bottom)
> +{
> +	return (double)w->stats[top] * 100 / w->stats[bottom];
> +}
> +
> +/*
> + * Inline functions to update the statistical counts
> + *
> + * Enable statistics collection may sometimes impact the locking rates
> + * to be measured. So we can specify the DISABLE_STAT macro to disable
> + * statistic counts collection for all except the core locking rate counts.
> + *
> + * #define DISABLE_STAT
> + */
> +#ifndef DISABLE_STAT
> +static inline void stat_add(int tid, int item, int num)
> +{
> +	worker[tid].stats[item] += num;
> +}
> +
> +static inline void stat_inc(int tid, int item)
> +{
> +	stat_add(tid, item, 1);
> +}
> +#else
> +static inline void stat_add(int tid __maybe_unused, int item __maybe_unused,
> +			    int num __maybe_unused)
> +{
> +}
> +
> +static inline void stat_inc(int tid __maybe_unused, int item __maybe_unused)
> +{
> +}
> +#endif
> +
> +/*
> + * The latency value within a lock critical section (load) and between locking
> + * operations is in term of the number of cpu_relax() calls that are being
> + * issued.
> + */
> +static const struct option mutex_options[] = {
> +	OPT_INTEGER ('d', "locklat",	&locklat,  "Specify inter-locking latency (default = 1)"),
> +	OPT_STRING  ('f', "ftype",	&ftype,    "type", "Specify futex type: WW, PI, TP, all (default)"),
> +	OPT_INTEGER ('L', "loadlat",	&loadlat,  "Specify load latency (default = 1)"),
> +	OPT_UINTEGER('r', "runtime",	&nsecs,    "Specify runtime (in seconds, default = 10s)"),
> +	OPT_BOOLEAN ('S', "shared",	&fshared,  "Use shared futexes instead of private ones"),
> +	OPT_BOOLEAN ('T', "timestat",	&timestat, "Track lock/unlock syscall times"),
> +	OPT_UINTEGER('t', "threads",	&nthreads, "Specify number of threads, default = # of CPUs"),
> +	OPT_BOOLEAN ('v', "verbose",	&verbose,  "Verbose mode: display thread-level details"),
> +	OPT_INTEGER ('w', "wait-ratio", &wratio,   "Specify <n>/1024 of load is 1us sleep, default = 0"),
> +	OPT_END()
> +};
> +
> +static const char * const bench_futex_mutex_usage[] = {
> +	"perf bench futex mutex <options>",
> +	NULL
> +};
> +
> +/**
> + * futex_cmpxchg() - atomic compare and exchange
> + * @uaddr:	The address of the futex to be modified
> + * @oldval:	The expected value of the futex
> + * @newval:	The new value to try and assign to the futex
> + *
> + * Implement cmpxchg using gcc atomic builtins.
> + *
> + * Return: the old futex value.
> + */
> +static inline futex_t futex_cmpxchg(futex_t *uaddr, futex_t old, futex_t new)
> +{
> +	return __sync_val_compare_and_swap(uaddr, old, new);
> +}
> +
> +/**
> + * futex_xchg() - atomic exchange
> + * @uaddr:	The address of the futex to be modified
> + * @newval:	The new value to assign to the futex
> + *
> + * Implement cmpxchg using gcc atomic builtins.
> + *
> + * Return: the old futex value.
> + */
> +static inline futex_t futex_xchg(futex_t *uaddr, futex_t new)
> +{
> +	return __sync_lock_test_and_set(uaddr, new);
> +}
> +
> +/**
> + * atomic_dec_return - atomically decrement & return the new value
> + * @uaddr:	The address of the futex to be decremented
> + * Return:	The new value
> + */
> +static inline int atomic_dec_return(futex_t *uaddr)
> +{
> +	return __sync_sub_and_fetch(uaddr, 1);
> +}
> +
> +/**
> + * atomic_inc_return - atomically increment & return the new value
> + * @uaddr:	The address of the futex to be incremented
> + * Return:	The new value
> + */
> +static inline int atomic_inc_return(futex_t *uaddr)
> +{
> +	return __sync_add_and_fetch(uaddr, 1);
> +}
> +
> +/**********************[ MUTEX lock/unlock functions ]*********************/
> +
> +/*
> + * Wait-wake futex lock/unlock functions (Glibc implementation)
> + * futex value: 0 - unlocked
> + *		1 - locked
> + *		2 - locked with waiters (contended)
> + */
> +static void ww_mutex_lock(futex_t *futex, int tid)
> +{
> +	struct timespec stime, etime;
> +	futex_t val = *futex;
> +	int ret;
> +
> +	if (!val) {
> +		val = futex_cmpxchg(futex, 0, 1);
> +		if (val == 0)
> +			return;
> +	}
> +
> +	for (;;) {
> +		if (val != 2) {
> +			/*
> +			 * Force value to 2 to indicate waiter
> +			 */
> +			val = futex_xchg(futex, 2);
> +			if (val == 0)
> +				return;
> +		}
> +		if (timestat) {
> +			clock_gettime(CLOCK_REALTIME, &stime);
> +			ret = futex_wait(futex, 2, NULL, flags);
> +			clock_gettime(CLOCK_REALTIME, &etime);
> +			systime_add(tid, TIME_LOCK, &stime, &etime);
> +		} else {
> +			ret = futex_wait(futex, 2, NULL, flags);
> +		}
> +
> +		stat_inc(tid, STAT_LOCKS);
> +		if (ret < 0) {
> +			if (errno == EAGAIN)
> +				stat_inc(tid, STAT_EAGAINS);
> +			else
> +				stat_inc(tid, STAT_LOCKERRS);
> +		}
> +
> +		val = *futex;
> +	}
> +}
> +
> +static void ww_mutex_unlock(futex_t *futex, int tid)
> +{
> +	struct timespec stime, etime;
> +	futex_t val;
> +	int ret;
> +
> +	val = futex_xchg(futex, 0);
> +
> +	if (val == 2) {
> +		stat_inc(tid, STAT_UNLOCKS);
> +		if (timestat) {
> +			clock_gettime(CLOCK_REALTIME, &stime);
> +			ret = futex_wake(futex, 1, flags);
> +			clock_gettime(CLOCK_REALTIME, &etime);
> +			systime_add(tid, TIME_UNLK, &stime, &etime);
> +		} else {
> +			ret = futex_wake(futex, 1, flags);
> +		}
> +		if (ret < 0)
> +			stat_inc(tid, STAT_UNLKERRS);
> +		else
> +			stat_add(tid, STAT_WAKEUPS, ret);
> +	}
> +}
> +
> +/*
> + * Alternate wait-wake futex lock/unlock functions with thread_id lock word
> + */
> +static void ww2_mutex_lock(futex_t *futex, int tid)
> +{
> +	struct timespec stime, etime;
> +	futex_t val = *futex;
> +	int ret;
> +
> +	if (!val) {
> +		val = futex_cmpxchg(futex, 0, thread_id);
> +		if (val == 0)
> +			return;
> +	}
> +
> +	for (;;) {
> +		/*
> +		 * Set the FUTEX_WAITERS bit, if not set yet.
> +		 */
> +		while (!(val & FUTEX_WAITERS)) {
> +			futex_t old;
> +
> +			if (!val) {
> +				val = futex_cmpxchg(futex, 0, thread_id);
> +				if (val == 0)
> +					return;
> +				continue;
> +			}
> +			old = futex_cmpxchg(futex, val, val | FUTEX_WAITERS);
> +			if (old == val) {
> +				val |= FUTEX_WAITERS;
> +				break;
> +			}
> +			val = old;
> +		}
> +		if (timestat) {
> +			clock_gettime(CLOCK_REALTIME, &stime);
> +			ret = futex_wait(futex, val, NULL, flags);
> +			clock_gettime(CLOCK_REALTIME, &etime);
> +			systime_add(tid, TIME_LOCK, &stime, &etime);
> +		} else {
> +			ret = futex_wait(futex, val, NULL, flags);
> +		}
> +		stat_inc(tid, STAT_LOCKS);
> +		if (ret < 0) {
> +			if (errno == EAGAIN)
> +				stat_inc(tid, STAT_EAGAINS);
> +			else
> +				stat_inc(tid, STAT_LOCKERRS);
> +		}
> +
> +		val = *futex;
> +	}
> +}
> +
> +static void ww2_mutex_unlock(futex_t *futex, int tid)
> +{
> +	struct timespec stime, etime;
> +	futex_t val;
> +	int ret;
> +
> +	val = futex_xchg(futex, 0);
> +
> +	if ((val & FUTEX_TID_MASK) != thread_id)
> +		stat_inc(tid, STAT_UNLKERRS);
> +
> +	if (val & FUTEX_WAITERS) {
> +		stat_inc(tid, STAT_UNLOCKS);
> +		if (timestat) {
> +			clock_gettime(CLOCK_REALTIME, &stime);
> +			ret = futex_wake(futex, 1, flags);
> +			clock_gettime(CLOCK_REALTIME, &etime);
> +			systime_add(tid, TIME_UNLK, &stime, &etime);
> +		} else {
> +			ret = futex_wake(futex, 1, flags);
> +		}
> +		if (ret < 0)
> +			stat_inc(tid, STAT_UNLKERRS);
> +		else
> +			stat_add(tid, STAT_WAKEUPS, ret);
> +	}
> +}
> +
> +/*
> + * PI futex lock/unlock functions
> + */
> +static void pi_mutex_lock(futex_t *futex, int tid)
> +{
> +	struct timespec stime, etime;
> +	futex_t val;
> +	int ret;
> +
> +	val = futex_cmpxchg(futex, 0, thread_id);
> +	if (val == 0)
> +		return;
> +
> +	/*
> +	 * Retry if an error happens
> +	 */
> +	for (;;) {
> +		if (timestat) {
> +			clock_gettime(CLOCK_REALTIME, &stime);
> +			ret = futex_lock_pi(futex, NULL, flags);
> +			clock_gettime(CLOCK_REALTIME, &etime);
> +			systime_add(tid, TIME_LOCK, &stime, &etime);
> +		} else {
> +			ret = futex_lock_pi(futex, NULL, flags);
> +		}
> +		stat_inc(tid, STAT_LOCKS);
> +		if (ret >= 0)
> +			break;
> +		stat_inc(tid, STAT_LOCKERRS);
> +	}
> +}
> +
> +static void pi_mutex_unlock(futex_t *futex, int tid)
> +{
> +	struct timespec stime, etime;
> +	futex_t val;
> +	int ret;
> +
> +	val = futex_cmpxchg(futex, thread_id, 0);
> +	if (val == thread_id)
> +		return;
> +
> +	if (timestat) {
> +		clock_gettime(CLOCK_REALTIME, &stime);
> +		ret = futex_unlock_pi(futex, flags);
> +		clock_gettime(CLOCK_REALTIME, &etime);
> +		systime_add(tid, TIME_UNLK, &stime, &etime);
> +	} else {
> +		ret = futex_unlock_pi(futex, flags);
> +	}
> +	if (ret < 0)
> +		stat_inc(tid, STAT_UNLKERRS);
> +	else
> +		stat_add(tid, STAT_WAKEUPS, ret);
> +	stat_inc(tid, STAT_UNLOCKS);
> +}
> +
> +/*
> + * TP futex lock/unlock functions
> + */
> +static void tp_mutex_lock(futex_t *futex, int tid)
> +{
> +	struct timespec stime, etime;
> +	futex_t val;
> +	int ret;
> +
> +	val = futex_cmpxchg(futex, 0, thread_id);
> +	if (val == 0)
> +		return;
> +
> +	/*
> +	 * Retry if an error happens
> +	 */
> +	for (;;) {
> +		if (timestat) {
> +			clock_gettime(CLOCK_REALTIME, &stime);
> +			ret = futex_lock(futex, NULL, flags);
> +			clock_gettime(CLOCK_REALTIME, &etime);
> +			systime_add(tid, TIME_LOCK, &stime, &etime);
> +		} else {
> +			ret = futex_lock(futex, NULL, flags);
> +		}
> +		stat_inc(tid, STAT_LOCKS);
> +		if (ret >= 0)
> +			break;
> +		stat_inc(tid, STAT_LOCKERRS);
> +	}
> +	/*
> +	 * Get # of sleeps & locking method
> +	 */
> +	stat_add(tid, STAT_SLEEPS, ret >> 16);
> +	ret &= 0xff;
> +	if (!ret)
> +		stat_inc(tid, STAT_STEALS);
> +	else if (ret == 2)
> +		stat_inc(tid, STAT_HANDOFFS);
> +}
> +
> +static void tp_mutex_unlock(futex_t *futex, int tid)
> +{
> +	struct timespec stime, etime;
> +	futex_t val;
> +	int ret;
> +
> +	val = futex_cmpxchg(futex, thread_id, 0);
> +	if (val == thread_id)
> +		return;
> +
> +	if (timestat) {
> +		clock_gettime(CLOCK_REALTIME, &stime);
> +		ret = futex_unlock(futex, flags);
> +		clock_gettime(CLOCK_REALTIME, &etime);
> +		systime_add(tid, TIME_UNLK, &stime, &etime);
> +	} else {
> +		ret = futex_unlock(futex, flags);
> +	}
> +	stat_inc(tid, STAT_UNLOCKS);
> +	if (ret < 0)
> +		stat_inc(tid, STAT_UNLKERRS);
> +	else
> +		stat_add(tid, STAT_WAKEUPS, ret);
> +}
> +
> +/**************************************************************************/
> +
> +/*
> + * Load function
> + */
> +static inline void load(int tid)
> +{
> +	int n = loadlat;
> +
> +	/*
> +	 * Optionally does a 1us sleep instead if wratio is defined and
> +	 * is within bound.
> +	 */
> +	if (wratio && (((counter++ + tid) & 0x3ff) < wratio)) {
> +		usleep(1);
> +		return;
> +	}
> +
> +	while (n-- > 0)
> +		cpu_relax();
> +}
> +
> +static inline void csdelay(void)
> +{
> +	int n = locklat;
> +
> +	while (n-- > 0)
> +		cpu_relax();
> +}
> +
> +static void toggle_done(int sig __maybe_unused,
> +			siginfo_t *info __maybe_unused,
> +			void *uc __maybe_unused)
> +{
> +	/* inform all threads that we're done for the day */
> +	done = true;
> +	gettimeofday(&end, NULL);
> +	timersub(&end, &start, &runtime);
> +	if (sig)
> +		exit_now = true;
> +}
> +
> +static void *mutex_workerfn(void *arg)
> +{
> +	long tid = (long)arg;
> +	struct worker *w = &worker[tid];
> +	lock_fn_t lock_fn = mutex_lock_fn;
> +	unlock_fn_t unlock_fn = mutex_unlock_fn;
> +
> +	thread_id = gettid();
> +	counter = 0;
> +
> +	atomic_dec_return(&threads_starting);
> +
> +	/*
> +	 * Busy wait until asked to start
> +	 */
> +	while (!worker_start)
> +		cpu_relax();
> +
> +	do {
> +		lock_fn(w->futex, tid);
> +		load(tid);
> +		unlock_fn(w->futex, tid);
> +		w->stats[STAT_OPS]++;	/* One more locking operation */
> +		csdelay();
> +	}  while (!done);
> +
> +	if (verbose)
> +		printf("[thread %3ld (%d)] exited.\n", tid, thread_id);
> +	atomic_inc_return(&threads_stopping);
> +	return NULL;
> +}
> +
> +static void create_threads(struct worker *w, pthread_attr_t *thread_attr,
> +			   void *(*workerfn)(void *arg), long tid)
> +{
> +	cpu_set_t cpu;
> +
> +	/*
> +	 * Bind each thread to a CPU
> +	 */
> +	CPU_ZERO(&cpu);
> +	CPU_SET(tid % ncpus, &cpu);
> +	w->futex = pfutex;
> +
> +	if (pthread_attr_setaffinity_np(thread_attr, sizeof(cpu_set_t), &cpu))
> +		err(EXIT_FAILURE, "pthread_attr_setaffinity_np");
> +
> +	if (pthread_create(&w->thread, thread_attr, workerfn, (void *)tid))
> +		err(EXIT_FAILURE, "pthread_create");
> +}
> +
> +static int futex_mutex_type(const char **ptype)
> +{
> +	const char *type = *ptype;
> +
> +	if (!strcasecmp(type, "WW")) {
> +		*ptype = "WW";
> +		mutex_lock_fn = ww_mutex_lock;
> +		mutex_unlock_fn = ww_mutex_unlock;
> +	} else if (!strcasecmp(type, "WW2")) {
> +		*ptype = "WW2";
> +		mutex_lock_fn = ww2_mutex_lock;
> +		mutex_unlock_fn = ww2_mutex_unlock;
> +	} else if (!strcasecmp(type, "PI")) {
> +		*ptype = "PI";
> +		mutex_lock_fn = pi_mutex_lock;
> +		mutex_unlock_fn = pi_mutex_unlock;
> +	} else if (!strcasecmp(type, "TP")) {
> +		*ptype = "TP";
> +		mutex_lock_fn = tp_mutex_lock;
> +		mutex_unlock_fn = tp_mutex_unlock;
> +
> +		/*
> +		 * Check if TP futex is supported.
> +		 */
> +		futex_unlock(&global_futex, 0);
> +		if (errno == ENOSYS) {
> +			fprintf(stderr,
> +			    "\nTP futexes are not supported by the kernel!\n");
> +			return -1;
> +		}
> +	} else {
> +		return -1;
> +	}
> +	return 0;
> +}
> +
> +static void futex_test_driver(const char *futex_type,
> +			      int (*proc_type)(const char **ptype),
> +			      void *(*workerfn)(void *arg))
> +{
> +	u64 us;
> +	int i, j;
> +	struct worker total;
> +	double avg, stddev;
> +	pthread_attr_t thread_attr;
> +
> +	/*
> +	 * There is an extra blank line before the error counts to highlight
> +	 * them.
> +	 */
> +	const char *desc[STAT_NUM] = {
> +		[STAT_OPS]	 = "Total exclusive locking ops",
> +		[STAT_LOCKS]	 = "Exclusive lock futex calls",
> +		[STAT_UNLOCKS]	 = "Exclusive unlock futex calls",
> +		[STAT_SLEEPS]	 = "Exclusive lock sleeps",
> +		[STAT_WAKEUPS]	 = "Process wakeups",
> +		[STAT_EAGAINS]	 = "EAGAIN lock errors",
> +		[STAT_HANDOFFS]  = "Lock handoffs",
> +		[STAT_STEALS]	 = "Lock stealings",
> +		[STAT_LOCKERRS]  = "\nExclusive lock errors",
> +		[STAT_UNLKERRS]  = "\nExclusive unlock errors",
> +	};
> +
> +	if (exit_now)
> +		return;
> +
> +	if (proc_type(&futex_type) < 0) {
> +		fprintf(stderr, "Unknown futex type '%s'!\n", futex_type);
> +		exit(1);
> +	}
> +
> +	printf("\n=====================================\n");
> +	printf("[PID %d]: %d threads doing %s futex lockings (load=%d) for %d secs.\n\n",
> +	       getpid(), nthreads, futex_type, loadlat, nsecs);
> +
> +	init_stats(&throughput_stats);
> +
> +	*pfutex = 0;
> +	done = false;
> +	threads_starting = nthreads;
> +	pthread_attr_init(&thread_attr);
> +
> +	for (i = 0; i < (int)nthreads; i++)
> +		create_threads(&worker[i], &thread_attr, workerfn, i);
> +
> +	while (threads_starting)
> +		usleep(1);
> +
> +	gettimeofday(&start, NULL);
> +
> +	/*
> +	 * Start the test
> +	 *
> +	 * Unlike the other futex benchmarks, this one uses busy waiting
> +	 * instead of pthread APIs to make sure that all the threads (except
> +	 * the one that shares CPU with the parent) will start more or less
> +	 * simultaineously.
> +	 */
> +	atomic_inc_return(&worker_start);
> +	sleep(nsecs);
> +	toggle_done(0, NULL, NULL);
> +
> +	/*
> +	 * In verbose mode, we check if all the threads have been stopped
> +	 * after 1ms and report the status if some are still running.
> +	 */
> +	if (verbose) {
> +		usleep(1000);
> +		if (threads_stopping != nthreads) {
> +			printf("%d threads still running 1ms after timeout"
> +				" - futex = 0x%x\n",
> +				nthreads - threads_stopping, *pfutex);
> +			/*
> +			 * If the threads are still running after 10s,
> +			 * go directly to statistics printing and exit.
> +			 */
> +			for (i = 10; i > 0; i--) {
> +				sleep(1);
> +				if (threads_stopping == nthreads)
> +					break;
> +			}
> +			if (!i) {
> +				printf("*** Threads waiting ABORTED!! ***\n\n");
> +				goto print_stat;
> +			}
> +		}
> +	}
> +
> +	for (i = 0; i < (int)nthreads; i++) {
> +		int ret = pthread_join(worker[i].thread, NULL);
> +
> +		if (ret)
> +			err(EXIT_FAILURE, "pthread_join");
> +	}
> +
> +print_stat:
> +	pthread_attr_destroy(&thread_attr);
> +
> +	us = runtime.tv_sec * 1000000 + runtime.tv_usec;
> +	memset(&total, 0, sizeof(total));
> +	for (i = 0; i < (int)nthreads; i++) {
> +		/*
> +		 * Get a rounded estimate of the # of locking ops/sec.
> +		 */
> +		u64 tp = (u64)worker[i].stats[STAT_OPS] * 1000000 / us;
> +
> +		for (j = 0; j < STAT_NUM; j++)
> +			total.stats[j] += worker[i].stats[j];
> +
> +		update_stats(&throughput_stats, tp);
> +		if (verbose)
> +			printf("[thread %3d] futex: %p [ %'ld ops/sec ]\n",
> +			       i, worker[i].futex, (long)tp);
> +	}
> +
> +	avg    = avg_stats(&throughput_stats);
> +	stddev = stddev_stats(&throughput_stats);
> +
> +	printf("Locking statistics:\n");
> +	printf("%-28s = %'.2fs\n", "Test run time", (double)us/1000000);
> +	for (i = 0; i < STAT_NUM; i++)
> +		if (total.stats[i])
> +			printf("%-28s = %'d\n", desc[i], total.stats[i]);
> +
> +	if (timestat && total.times[TIME_LOCK]) {
> +		printf("\nSyscall times:\n");
> +		if (total.stats[STAT_LOCKS])
> +			printf("Avg exclusive lock syscall   = %'ldns\n",
> +			    total.times[TIME_LOCK]/total.stats[STAT_LOCKS]);
> +		if (total.stats[STAT_UNLOCKS])
> +			printf("Avg exclusive unlock syscall = %'ldns\n",
> +			    total.times[TIME_UNLK]/total.stats[STAT_UNLOCKS]);
> +	}
> +
> +	printf("\nPercentages:\n");
> +	if (total.stats[STAT_LOCKS])
> +		printf("Exclusive lock futex calls   = %.1f%%\n",
> +			stat_percent(&total, STAT_LOCKS, STAT_OPS));
> +	if (total.stats[STAT_UNLOCKS])
> +		printf("Exclusive unlock futex calls = %.1f%%\n",
> +			stat_percent(&total, STAT_UNLOCKS, STAT_OPS));
> +	if (total.stats[STAT_EAGAINS])
> +		printf("EAGAIN lock errors           = %.1f%%\n",
> +			stat_percent(&total, STAT_EAGAINS, STAT_LOCKS));
> +	if (total.stats[STAT_WAKEUPS])
> +		printf("Process wakeups              = %.1f%%\n",
> +			stat_percent(&total, STAT_WAKEUPS, STAT_UNLOCKS));
> +
> +	printf("\nPer-thread Locking Rates:\n");
> +	printf("Avg = %'d ops/sec (+- %.2f%%)\n", (int)(avg + 0.5),
> +		rel_stddev_stats(stddev, avg));
> +	printf("Min = %'d ops/sec\n", (int)throughput_stats.min);
> +	printf("Max = %'d ops/sec\n", (int)throughput_stats.max);
> +
> +	if (*pfutex != 0)
> +		printf("\nResidual futex value = 0x%x\n", *pfutex);
> +
> +	/* Clear the workers area */
> +	memset(worker, 0, sizeof(*worker) * nthreads);
> +}
> +
> +int bench_futex_mutex(int argc, const char **argv,
> +		      const char *prefix __maybe_unused)
> +{
> +	struct sigaction act;
> +
> +	argc = parse_options(argc, argv, mutex_options,
> +			     bench_futex_mutex_usage, 0);
> +	if (argc)
> +		goto err;
> +
> +	ncpus = sysconf(_SC_NPROCESSORS_ONLN);
> +
> +	sigfillset(&act.sa_mask);
> +	act.sa_sigaction = toggle_done;
> +	sigaction(SIGINT, &act, NULL);
> +
> +	if (!nthreads)
> +		nthreads = ncpus;
> +
> +	/*
> +	 * Since the allocated memory buffer may not be properly cacheline
> +	 * aligned, we need to allocate one more than needed and manually
> +	 * adjust array boundary to be cacheline aligned.
> +	 */
> +	worker_alloc = calloc(nthreads + 1, sizeof(*worker));
> +	if (!worker_alloc)
> +		err(EXIT_FAILURE, "calloc");
> +	worker = (void *)((unsigned long)&worker_alloc[1] &
> +					~(CACHELINE_SIZE - 1));
> +
> +	if (!fshared)
> +		flags = FUTEX_PRIVATE_FLAG;
> +
> +	if (!ftype || !strcmp(ftype, "all")) {
> +		futex_test_driver("WW", futex_mutex_type, mutex_workerfn);
> +		futex_test_driver("PI", futex_mutex_type, mutex_workerfn);
> +		futex_test_driver("TP", futex_mutex_type, mutex_workerfn);
> +	} else {
> +		futex_test_driver(ftype, futex_mutex_type, mutex_workerfn);
> +	}
> +	free(worker_alloc);
> +	return 0;
> +err:
> +	usage_with_options(bench_futex_mutex_usage, mutex_options);
> +	exit(EXIT_FAILURE);
> +}
> diff --git a/tools/perf/bench/futex.h b/tools/perf/bench/futex.h
> index ba7c735..be1eb7a 100644
> --- a/tools/perf/bench/futex.h
> +++ b/tools/perf/bench/futex.h
> @@ -87,6 +87,29 @@
>  		 val, opflags);
>  }
>  
> +#ifndef FUTEX_LOCK
> +#define FUTEX_LOCK	0xff	/* Disable the use of TP futexes */
> +#define FUTEX_UNLOCK	0xff
> +#endif /* FUTEX_LOCK */
> +
> +/**
> + * futex_lock() - lock the TP futex
> + */
> +static inline int
> +futex_lock(u_int32_t *uaddr, struct timespec *timeout, int opflags)
> +{
> +	return futex(uaddr, FUTEX_LOCK, 0, timeout, NULL, 0, opflags);
> +}
> +
> +/**
> + * futex_unlock() - unlock the TP futex
> + */
> +static inline int
> +futex_unlock(u_int32_t *uaddr, int opflags)
> +{
> +	return futex(uaddr, FUTEX_UNLOCK, 0, NULL, NULL, 0, opflags);
> +}
> +
>  #ifndef HAVE_PTHREAD_ATTR_SETAFFINITY_NP
>  #include <pthread.h>
>  static inline int pthread_attr_setaffinity_np(pthread_attr_t *attr,
> diff --git a/tools/perf/builtin-bench.c b/tools/perf/builtin-bench.c
> index a1cddc6..bf4418d 100644
> --- a/tools/perf/builtin-bench.c
> +++ b/tools/perf/builtin-bench.c
> @@ -20,6 +20,7 @@
>  #include "builtin.h"
>  #include "bench/bench.h"
>  
> +#include <locale.h>
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <string.h>
> @@ -62,6 +63,7 @@ struct bench {
>  	{ "requeue",	"Benchmark for futex requeue calls",            bench_futex_requeue	},
>  	/* pi-futexes */
>  	{ "lock-pi",	"Benchmark for futex lock_pi calls",            bench_futex_lock_pi	},
> +	{ "mutex",	"Benchmark for mutex locks using futexes",	bench_futex_mutex	},
>  	{ "all",	"Run all futex benchmarks",			NULL			},
>  	{ NULL,		NULL,						NULL			}
>  };
> @@ -215,6 +217,7 @@ int cmd_bench(int argc, const char **argv, const char *prefix __maybe_unused)
>  {
>  	struct collection *coll;
>  	int ret = 0;
> +	char *locale;
>  
>  	if (argc < 2) {
>  		/* No collection specified. */
> @@ -222,6 +225,13 @@ int cmd_bench(int argc, const char **argv, const char *prefix __maybe_unused)
>  		goto end;
>  	}
>  
> +	/*
> +	 * Enable better number formatting.
> +	 */
> +	locale = setlocale(LC_NUMERIC, "");
> +	if (!strcmp(locale, "C"))
> +		setlocale(LC_NUMERIC, "en_US");
> +
>  	argc = parse_options(argc, argv, bench_options, bench_usage,
>  			     PARSE_OPT_STOP_AT_NON_OPTION);
>  
> diff --git a/tools/perf/check-headers.sh b/tools/perf/check-headers.sh
> index c747bfd..6ec5f2d 100755
> --- a/tools/perf/check-headers.sh
> +++ b/tools/perf/check-headers.sh
> @@ -57,3 +57,7 @@ check arch/x86/lib/memcpy_64.S        -B -I "^EXPORT_SYMBOL" -I "^#include <asm/
>  check arch/x86/lib/memset_64.S        -B -I "^EXPORT_SYMBOL" -I "^#include <asm/export.h>"
>  check include/uapi/asm-generic/mman.h -B -I "^#include <\(uapi/\)*asm-generic/mman-common.h>"
>  check include/uapi/linux/mman.h       -B -I "^#include <\(uapi/\)*asm/mman.h>"
> +
> +# need to link to the latest futex.h header
> +[[ -f ../../include/uapi/linux/futex.h ]] &&
> +	ln -sf  ../../../../../include/uapi/linux/futex.h util/include/linux
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 13/20] perf bench: New microbenchmark for userspace mutex performance
  2017-01-02 17:16   ` Arnaldo Carvalho de Melo
@ 2017-01-02 18:17     ` Waiman Long
  0 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2017-01-02 18:17 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jonathan Corbet,
	linux-kernel, linux-doc, Davidlohr Bueso, Mike Galbraith,
	Scott J Norton

On 01/02/2017 12:16 PM, Arnaldo Carvalho de Melo wrote:
> Em Thu, Dec 29, 2016 at 11:13:39AM -0500, Waiman Long escreveu:
>> This microbenchmark simulates how the use of different futex types
>> can affect the actual performanace of userspace mutex locks. The
>> usage is:
>>
>>         perf bench futex mutex <options>
> Showing the tool output is preferred, trim it if needed or state why
> that is not possible.

I will do so when I update the patchset.

>  
>> Three sets of simple mutex lock and unlock functions are implemented
>> using the wait-wake, PI and TP futexes respectively. This
>> microbenchmark then runs the locking rate measurement tests using
>> either one of those mutexes or all of them consecutively.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>  tools/perf/bench/Build         |   1 +
>>  tools/perf/bench/bench.h       |   1 +
>>  tools/perf/bench/futex-locks.c | 844 +++++++++++++++++++++++++++++++++++++++++
>>  tools/perf/bench/futex.h       |  23 ++
>>  tools/perf/builtin-bench.c     |  10 +
>>  tools/perf/check-headers.sh    |   4 +
> You forgot to document it in tools/perf/Documentation/perf-bench.txt

Yes, I forgot about that. I will update the document in the next v5
patchset.

> Also I would suggest you submit this first without supportint TP
> futexes, i.e. supporting just what is in the kernel already, this way I
> could cherry-pick this, which would be useful already for benchmarking
> the existing types of futexes.
>
> Then, after the new futex type is accepted into the kernel, in a
> separate patch you would add support for it in 'perf bench futex mutex'.

I can separate out the TP portion of the changes and put them later in
the patchset.

Thanks for the comments.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2017-01-02 18:17 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-29 16:13 [PATCH v4 00/20] futex: Introducing throughput-optimized (TP) futexes Waiman Long
2016-12-29 16:13 ` [PATCH v4 01/20] futex: Consolidate duplicated timer setup code Waiman Long
2016-12-29 16:13 ` [PATCH v4 02/20] futex: Rename futex_pi_state to futex_state Waiman Long
2016-12-29 16:13 ` [PATCH v4 03/20] futex: Add helpers to get & cmpxchg futex value without lock Waiman Long
2016-12-29 16:13 ` [PATCH v4 04/20] futex: Consolidate pure pi_state_list add & delete codes to helpers Waiman Long
2016-12-29 16:13 ` [PATCH v4 05/20] futex: Add a new futex type field into futex_state Waiman Long
2016-12-29 16:13 ` [PATCH v4 06/20] futex: Allow direct attachment of futex_state objects to hash bucket Waiman Long
2016-12-29 16:13 ` [PATCH v4 07/20] futex: Introduce throughput-optimized (TP) futexes Waiman Long
2016-12-29 21:17   ` kbuild test robot
2016-12-29 16:13 ` [PATCH v4 08/20] TP-futex: Enable robust handling Waiman Long
2016-12-29 16:13 ` [PATCH v4 09/20] TP-futex: Implement lock handoff to prevent lock starvation Waiman Long
2016-12-29 16:13 ` [PATCH v4 10/20] TP-futex: Return status code on FUTEX_LOCK calls Waiman Long
2016-12-29 16:13 ` [PATCH v4 11/20] TP-futex: Add timeout support Waiman Long
2016-12-29 16:13 ` [PATCH v4 12/20] TP-futex, doc: Add TP futexes documentation Waiman Long
2016-12-29 16:13 ` [PATCH v4 13/20] perf bench: New microbenchmark for userspace mutex performance Waiman Long
2017-01-02 17:16   ` Arnaldo Carvalho de Melo
2017-01-02 18:17     ` Waiman Long
2016-12-29 16:13 ` [PATCH v4 14/20] TP-futex: Support userspace reader/writer locks Waiman Long
2016-12-29 18:34   ` kbuild test robot
2016-12-29 16:13 ` [PATCH v4 15/20] TP-futex: Enable kernel reader lock stealing Waiman Long
2016-12-29 16:13 ` [PATCH v4 16/20] TP-futex: Group readers together in wait queue Waiman Long
2016-12-29 19:53   ` kbuild test robot
2016-12-29 20:16   ` kbuild test robot
2016-12-29 16:13 ` [PATCH v4 17/20] TP-futex, doc: Update TP futexes document on shared locking Waiman Long
2016-12-29 16:13 ` [PATCH v4 18/20] perf bench: New microbenchmark for userspace rwlock performance Waiman Long
2016-12-29 16:13 ` [PATCH v4 19/20] sched, TP-futex: Make wake_up_q() return wakeup count Waiman Long
2016-12-29 16:13 ` [PATCH v4 20/20] futex: Dump internal futex state via debugfs Waiman Long
2016-12-29 18:55   ` kbuild test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).