All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC patch 0/7] futex: Add support for attached futexes
@ 2016-04-02 11:09 Thomas Gleixner
  2016-04-02 11:09 ` [RFC patch 1/7] futex: Provide helpers for hash bucket add/remove Thomas Gleixner
                   ` (6 more replies)
  0 siblings, 7 replies; 26+ messages in thread
From: Thomas Gleixner @ 2016-04-02 11:09 UTC (permalink / raw)
  To: LKML
  Cc: Sebastian Andrzej Siewior, Darren Hart, Peter Zijlstra,
	Ingo Molnar, Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

The standard futex mechanism in the Linux kernel uses a global hash to store
transient state. Collisions on that hash can lead to performance degradation
and on real-time enabled kernels to unbound priority inversions.

To guarantee futexes without collisions on the global kernel hash, we provide
a mechanism to attach to a futex. This creates futex private state which
avoids hash collisions and on NUMA systems also cross node memory accesses.

The full explanation of the mechanism is in the changelog of 
patch 4/7 'futex: Add support for attached futexes'.

The last two patches in the series contain a first update to the futex man
page and initial glibc support.

For your conveniance the kernel part is available at:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.futex

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC patch 1/7] futex: Provide helpers for hash bucket add/remove
  2016-04-02 11:09 [RFC patch 0/7] futex: Add support for attached futexes Thomas Gleixner
@ 2016-04-02 11:09 ` Thomas Gleixner
  2016-04-02 11:09 ` [RFC patch 2/7] futex: Add some more function commentry Thomas Gleixner
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Thomas Gleixner @ 2016-04-02 11:09 UTC (permalink / raw)
  To: LKML
  Cc: Sebastian Andrzej Siewior, Darren Hart, Peter Zijlstra,
	Ingo Molnar, Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

[-- Attachment #1: futex--Provide-helpers-for-hash-bucket-add-remove.patch --]
[-- Type: text/plain, Size: 2528 bytes --]

Provide helpers for adding/removing futex_q to/from a hash bucket so we don't
have more open coded places when adding support for attached futexes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/futex.c |   39 +++++++++++++++++++++++++++++++++------
 1 file changed, 33 insertions(+), 6 deletions(-)

Index: b/kernel/futex.c
===================================================================
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -373,6 +373,35 @@ static inline int hb_waiters_pending(str
 #endif
 }
 
+/**
+ * hb_insert_q - Insert futex_q into a hash bucket
+ * @q:		Pointer to the futex_q object
+ * @hb:		Pointer to the hash bucket
+ * @prio:	Priority ordering for insertion
+ *
+ * Inserts @q into the hash buckets @hb plist. Must be called with @hb->lock
+ * held.
+ */
+static inline void
+hb_insert_q(struct futex_q *q, struct futex_hash_bucket *hb, int prio)
+{
+	plist_node_init(&q->list, prio);
+	plist_add(&q->list, &hb->chain);
+}
+
+/**
+ * hb_remove_q - Remove futex_q from a hash bucket
+ * @q:		Pointer to the futex_q object
+ * @hb:		Pointer to the hash bucket
+ *
+ * Removes @q from the hash buckets @hb plist. Must be called with @hb->lock
+ * held.
+ */
+static inline void hb_remove_q(struct futex_q *q, struct futex_hash_bucket *hb)
+{
+	plist_del(&q->list, &hb->chain);
+}
+
 /*
  * We hash on the keys returned from get_futex_key (see below).
  */
@@ -1221,7 +1250,7 @@ static void __unqueue_futex(struct futex
 		return;
 
 	hb = container_of(q->lock_ptr, struct futex_hash_bucket, lock);
-	plist_del(&q->list, &hb->chain);
+	hb_remove_q(q, hb);
 	hb_waiters_dec(hb);
 }
 
@@ -1523,7 +1552,7 @@ void requeue_futex(struct futex_q *q, st
 	 * requeue.
 	 */
 	if (likely(&hb1->chain != &hb2->chain)) {
-		plist_del(&q->list, &hb1->chain);
+		hb_remove_q(q, hb1);
 		hb_waiters_dec(hb1);
 		plist_add(&q->list, &hb2->chain);
 		hb_waiters_inc(hb2);
@@ -1985,9 +2014,7 @@ static inline void queue_me(struct futex
 	 * the others are woken last, in FIFO order.
 	 */
 	prio = min(current->normal_prio, MAX_RT_PRIO);
-
-	plist_node_init(&q->list, prio);
-	plist_add(&q->list, &hb->chain);
+	hb_insert_q(q, hb, prio);
 	q->task = current;
 	spin_unlock(&hb->lock);
 }
@@ -2697,7 +2724,7 @@ int handle_early_requeue_pi_wakeup(struc
 		 * We were woken prior to requeue by a timeout or a signal.
 		 * Unqueue the futex_q and determine which it was.
 		 */
-		plist_del(&q->list, &hb->chain);
+		hb_remove_q(q, hb);
 		hb_waiters_dec(hb);
 
 		/* Handle spurious wakeups gracefully */

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC patch 2/7] futex: Add some more function commentry
  2016-04-02 11:09 [RFC patch 0/7] futex: Add support for attached futexes Thomas Gleixner
  2016-04-02 11:09 ` [RFC patch 1/7] futex: Provide helpers for hash bucket add/remove Thomas Gleixner
@ 2016-04-02 11:09 ` Thomas Gleixner
  2016-04-02 11:09 ` [RFC patch 3/7] futex: Make key init a helper function Thomas Gleixner
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Thomas Gleixner @ 2016-04-02 11:09 UTC (permalink / raw)
  To: LKML
  Cc: Sebastian Andrzej Siewior, Darren Hart, Peter Zijlstra,
	Ingo Molnar, Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

[-- Attachment #1: futex--Add-some-more-function-commentry.patch --]
[-- Type: text/plain, Size: 1245 bytes --]

Add some more comments and reformat existing ones to kernel doc style.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/futex.c |   15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

Index: b/kernel/futex.c
===================================================================
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -402,8 +402,12 @@ static inline void hb_remove_q(struct fu
 	plist_del(&q->list, &hb->chain);
 }
 
-/*
- * We hash on the keys returned from get_futex_key (see below).
+/**
+ * hash_futex - Return the hash bucket in the global hash
+ * @key:	Pointer to the futex key for which the hash is calculated
+ *
+ * We hash on the keys returned from get_futex_key (see below) and return the
+ * corresponding hash bucket in the global hash.
  */
 static struct futex_hash_bucket *hash_futex(union futex_key *key)
 {
@@ -413,7 +417,12 @@ static struct futex_hash_bucket *hash_fu
 	return &futex_queues[hash & (futex_hashsize - 1)];
 }
 
-/*
+
+/**
+ * match_futex - Check whether to futex keys are equal
+ * @key1:	Pointer to key1
+ * @key2:	Pointer to key2
+ *
  * Return 1 if two futex_keys are equal, 0 otherwise.
  */
 static inline int match_futex(union futex_key *key1, union futex_key *key2)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC patch 3/7] futex: Make key init a helper function
  2016-04-02 11:09 [RFC patch 0/7] futex: Add support for attached futexes Thomas Gleixner
  2016-04-02 11:09 ` [RFC patch 1/7] futex: Provide helpers for hash bucket add/remove Thomas Gleixner
  2016-04-02 11:09 ` [RFC patch 2/7] futex: Add some more function commentry Thomas Gleixner
@ 2016-04-02 11:09 ` Thomas Gleixner
  2016-04-02 11:09 ` [RFC patch 4/7] futex: Add support for attached futexes Thomas Gleixner
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Thomas Gleixner @ 2016-04-02 11:09 UTC (permalink / raw)
  To: LKML
  Cc: Sebastian Andrzej Siewior, Darren Hart, Peter Zijlstra,
	Ingo Molnar, Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

[-- Attachment #1: futex--Make-key-init-a-helper-function.patch --]
[-- Type: text/plain, Size: 4876 bytes --]

Support for attached futexes requires to store information in the futex
key. Make the key init a helper function and let it return the futex uaddr.

At the call sites check the returned uaddr for NULL so the attached mode can
abort the operation when it detects that an attached operation is attempted on
a non attached futex.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/futex.c |   67 +++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 61 insertions(+), 6 deletions(-)

Index: b/kernel/futex.c
===================================================================
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -433,6 +433,21 @@ static inline int match_futex(union fute
 		&& key1->both.offset == key2->both.offset);
 }
 
+/**
+ * futex_key_init - Initialize a futex key
+ * @key:	Pointer to the key to initialize
+ * @uaddr:	User space address of the futex
+ * @flags:	Flags to check for futex mode. Not yet used
+ *
+ * Returns:	@uaddr
+ */
+static u32 __user *futex_key_init(union futex_key *key, u32 __user *uaddr,
+				  unsigned int flags)
+{
+	*key = FUTEX_KEY_INIT;
+	return uaddr;
+}
+
 /*
  * Take a reference to the resource addressed by a key.
  * Can be called while holding spinlocks.
@@ -1403,13 +1418,17 @@ futex_wake(u32 __user *uaddr, unsigned i
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
-	union futex_key key = FUTEX_KEY_INIT;
+	union futex_key key;
 	int ret;
 	WAKE_Q(wake_q);
 
 	if (!bitset)
 		return -EINVAL;
 
+	uaddr = futex_key_init(&key, uaddr, flags);
+	if (!uaddr)
+		return -EINVAL;
+
 	ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_READ);
 	if (unlikely(ret != 0))
 		goto out;
@@ -1455,12 +1474,20 @@ static int
 futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
 	      int nr_wake, int nr_wake2, int op)
 {
-	union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
 	struct futex_hash_bucket *hb1, *hb2;
 	struct futex_q *this, *next;
+	union futex_key key1, key2;
 	int ret, op_ret;
 	WAKE_Q(wake_q);
 
+	uaddr1 = futex_key_init(&key1, uaddr1, flags);
+	if (!uaddr1)
+		return -EINVAL;
+
+	uaddr2 = futex_key_init(&key2, uaddr2, flags);
+	if (!uaddr2)
+		return -EINVAL;
+
 retry:
 	ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, VERIFY_READ);
 	if (unlikely(ret != 0))
@@ -1693,11 +1720,11 @@ static int futex_requeue(u32 __user *uad
 			 u32 __user *uaddr2, int nr_wake, int nr_requeue,
 			 u32 *cmpval, int requeue_pi)
 {
-	union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
 	int drop_count = 0, task_count = 0, ret;
 	struct futex_pi_state *pi_state = NULL;
 	struct futex_hash_bucket *hb1, *hb2;
 	struct futex_q *this, *next;
+	union futex_key key1, key2;
 	WAKE_Q(wake_q);
 
 	if (requeue_pi) {
@@ -1728,6 +1755,14 @@ static int futex_requeue(u32 __user *uad
 			return -EINVAL;
 	}
 
+	uaddr1 = futex_key_init(&key1, uaddr1, flags);
+	if (!uaddr1)
+		return -EINVAL;
+
+	uaddr2 = futex_key_init(&key2, uaddr2, flags);
+	if (!uaddr2)
+		return -EINVAL;
+
 retry:
 	ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, VERIFY_READ);
 	if (unlikely(ret != 0))
@@ -2398,6 +2433,11 @@ static int futex_wait(u32 __user *uaddr,
 
 	if (!bitset)
 		return -EINVAL;
+
+	uaddr = futex_key_init(&q.key, uaddr, flags);
+	if (!uaddr)
+		return -EINVAL;
+
 	q.bitset = bitset;
 
 	if (abs_time) {
@@ -2498,6 +2538,10 @@ static int futex_lock_pi(u32 __user *uad
 	if (refill_pi_state_cache())
 		return -ENOMEM;
 
+	uaddr = futex_key_init(&q.key, uaddr, flags);
+	if (!uaddr)
+		return -EINVAL;
+
 	if (time) {
 		to = &timeout;
 		hrtimer_init_on_stack(&to->timer, CLOCK_REALTIME,
@@ -2617,11 +2661,14 @@ static int futex_lock_pi(u32 __user *uad
 static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 {
 	u32 uninitialized_var(curval), uval, vpid = task_pid_vnr(current);
-	union futex_key key = FUTEX_KEY_INIT;
 	struct futex_hash_bucket *hb;
 	struct futex_q *match;
+	union futex_key key;
 	int ret;
 
+	uaddr = futex_key_init(&key, uaddr, flags);
+	if (!uaddr)
+		return -EINVAL;
 retry:
 	if (get_user(uval, uaddr))
 		return -EFAULT;
@@ -2793,9 +2840,9 @@ static int futex_wait_requeue_pi(u32 __u
 	struct hrtimer_sleeper timeout, *to = NULL;
 	struct rt_mutex_waiter rt_waiter;
 	struct rt_mutex *pi_mutex = NULL;
-	struct futex_hash_bucket *hb;
-	union futex_key key2 = FUTEX_KEY_INIT;
 	struct futex_q q = futex_q_init;
+	struct futex_hash_bucket *hb;
+	union futex_key key2;
 	int res, ret;
 
 	if (uaddr == uaddr2)
@@ -2804,6 +2851,14 @@ static int futex_wait_requeue_pi(u32 __u
 	if (!bitset)
 		return -EINVAL;
 
+	uaddr = futex_key_init(&q.key, uaddr, flags);
+	if (!uaddr)
+		return -EINVAL;
+
+	uaddr2 = futex_key_init(&key2, uaddr2, flags);
+	if (!uaddr2)
+		return -EINVAL;
+
 	if (abs_time) {
 		to = &timeout;
 		hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC patch 4/7] futex: Add support for attached futexes
  2016-04-02 11:09 [RFC patch 0/7] futex: Add support for attached futexes Thomas Gleixner
                   ` (2 preceding siblings ...)
  2016-04-02 11:09 ` [RFC patch 3/7] futex: Make key init a helper function Thomas Gleixner
@ 2016-04-02 11:09 ` Thomas Gleixner
  2016-04-02 16:26   ` Peter Zijlstra
                     ` (4 more replies)
  2016-04-02 11:09 ` [RFC patch 5/7] perf/bench/futex-hash: Support " Thomas Gleixner
                   ` (2 subsequent siblings)
  6 siblings, 5 replies; 26+ messages in thread
From: Thomas Gleixner @ 2016-04-02 11:09 UTC (permalink / raw)
  To: LKML
  Cc: Sebastian Andrzej Siewior, Darren Hart, Peter Zijlstra,
	Ingo Molnar, Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

[-- Attachment #1: futex--Add-support-for-attached-futexes.patch --]
[-- Type: text/plain, Size: 23292 bytes --]

The standard futex mechanism in the Linux kernel uses a global hash to store
transient state. Collisions on that hash can lead to performance degradation
and on real-time enabled kernels even to priority inversions.

To guarantee futexes without collisions on the global kernel hash, we provide
a mechanism to attach to a futex. This creates futex private state which
avoids hash collisions and on NUMA systems also cross node memory access.

To utilize this mechanism each thread has to attach to the futex before any
other operations on that futex.

The inner workings are as follows:

Attach:

    sys_futex(FUTEX_ATTACH | FUTEX_ATTACHED, uaddr, ....);

    If this is the first attach to uaddr then a 'global state' object is
    created. This global state contains a futex hash bucket and a futex_q
    object which is enqueued into the global hash for reference so subsequent
    attachers can find it. Each attacher takes a reference count on the
    'global state' object and hashes 'uaddr' into a thread local hash. This
    thread local hash is lock free and dynamically expanded to avoid
    collisions. Each populated entry in the thread local hash stores 'uaddr'
    and a pointer to the 'global state' object.

Futex ops:

    sys_futex(FUTEX_XXX | FUTEX_ATTACHED, uaddr, ....);

    If the attached flag is set, then 'uaddr' is hashed and the thread local
    hash is checked whether the hash entry contains 'uaddr'. If no, an error
    code is returned. If yes, the hash slot number is stored in the futex key
    which is used for further operations on the futex. When the hash bucket is
    looked up then attached futexes will use the slot number to retrieve the
    pointer to the 'global state' object and use the embedded hash bucket for
    the operation. Non-attached futexes just use the global hash as before.

Detach:

    sys_futex(FUTEX_DETACH | FUTEX_ATTACHED, uaddr, ....);
   
    Detach removes the entry in the thread local hash and decrements the
    refcount on the 'global state' object. Once the refcount drops to zero the
    'global state' object is removed from the global hash and destroyed.

    Thread exit cleans up the thread local hash and the 'global state' objects
    as we do for other futex related storage already.

The thread local hash and the 'global state' object are allocated on the node
on which the attaching thread runs.

Attached mode works with all futex operations and with both private and shared
futexes. For operations which involve two futexes, i.e. FUTEX_REQUEUE_* both
futexes have to be either attached or detached (like FUTEX_PRIVATE).

Why not auto attaching?

    Auto attaching has the following problems:

     - Memory consumption
     - Life time issues
     - Performance issues due to the necessary allocations

    So, no. It must be opt-in and reserved for explicit isolation purposes.

A modified version of 'perf bench futex hash' shows the following results:

 # ./perf bench futex hash -s -r 60
 # Running 'futex/hash' benchmark:
 Run summary [PID 4456]: 144 threads, each operating on 1024 [private ] futexes for 60 secs.

 Averaged 1451441 operations/sec (+- 3.65%), total secs = 60

 Perf stat analysis for such a run:

 Performance counter stats for './perf bench futex hash -s -r 60':

     9,400,551,930      node-load-misses
        74,404,109      node-loads
    11,001,148,922      node-store-misses
       574,505,561      node-stores

      60.016993846 seconds time elapsed

 # Running 'futex/hash' benchmark:
 Run summary [PID 3675]: 144 threads, each operating on 1024 [private attached] futexes for 60 secs.

 Averaged 1709712 operations/sec (+- 4.67%), total secs = 60

That's a performance increase of 18%.

 Perf stat analysis for such a run:

 Performance counter stats for './perf bench futex hash -s -r 60 -a':

     2,609,871,190      node-load-misses
         9,671,312      node-loads
     2,346,187,670      node-store-misses
       144,453,796      node-stores

      60.020374008 seconds time elapsed

Now someone might argue that this is an unrealistic test case. Sure the number
of futexes here exceeds the size of the global hash:

   144 * 1024  >  144 * 256

But from real world applications with a way smaller number of futexes we have
observed hash collisions on real-time and non real-time systems, which cause
performance or determinism issues. Surely we could increase the hash size, but
that just reduces the probability and does not exclude the chance to hit a bad
spot.

We wrote a - too ugly to share - collision scanner which runs on each node of
a NUMA system. It scans through malloced and mmaped memory and invokes
thousands of consecutive (failing like the hash bench test) futex_wait
operations on each address and observes the consumed time. In case of a
collision the time jumps massively due to the atomic_inc/dec and the spinlock
operation on the hash bucket and the resulting cache line bouncing.

Once it detected a collision it stays there for 60 seconds and then switches
to attached mode. This immediately brings back the performance on the involved
scanners. A collision with two threads and therefor two futexes on different
nodes results in a performance degradation of 30 and more percent in this
test. On real-time enabled systems even a one time collision can have fatal
consequences due to possibly unbound priority inversions.

An interesting observation is that the time to find a collision varies from a
few seconds (20 addresses scanned) to a couple of hours. That variation is due
to the different mm_struct addresses which are mixed into the global hash and
different malloc/mmap addresses which are handed out on each run.

A few smallish issues:

  - we must limit the number of allocated 'global state' objects. That must
    be a per process limit, but I have no idea where to put it best and how
    we let the sysadmin tune it.
  
  - this needs massive review and exposure to fuzzers as I don't want to end
    up with another exploit disaster in that code.

[ tglx: Wrote changelog ]

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/futex.h      |   12 -
 include/linux/sched.h      |    2 
 include/uapi/linux/futex.h |    6 
 kernel/fork.c              |    7 
 kernel/futex.c             |  444 ++++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 458 insertions(+), 13 deletions(-)

--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -47,6 +47,8 @@ union futex_key {
 		unsigned long word;
 		void *ptr;
 		int offset;
+		unsigned int slot;
+		bool attached;
 	} both;
 };
 
@@ -55,17 +57,15 @@ union futex_key {
 #ifdef CONFIG_FUTEX
 extern void exit_robust_list(struct task_struct *curr);
 extern void exit_pi_state_list(struct task_struct *curr);
+extern void exit_futex_task_cache(struct task_struct *curr);
 #ifdef CONFIG_HAVE_FUTEX_CMPXCHG
 #define futex_cmpxchg_enabled 1
 #else
 extern int futex_cmpxchg_enabled;
 #endif
 #else
-static inline void exit_robust_list(struct task_struct *curr)
-{
-}
-static inline void exit_pi_state_list(struct task_struct *curr)
-{
-}
+static inline void exit_robust_list(struct task_struct *curr) { }
+static inline void exit_pi_state_list(struct task_struct *curr) { }
+static inline void exit_futex_task_cache(struct task_struct *curr) { }
 #endif
 #endif
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -128,6 +128,7 @@ struct sched_attr {
 };
 
 struct futex_pi_state;
+struct futex_cache;
 struct robust_list_head;
 struct bio_list;
 struct fs_struct;
@@ -1702,6 +1703,7 @@ struct task_struct {
 #endif
 	struct list_head pi_state_list;
 	struct futex_pi_state *pi_state_cache;
+	struct futex_cache *futex_cache;
 #endif
 #ifdef CONFIG_PERF_EVENTS
 	struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -20,10 +20,14 @@
 #define FUTEX_WAKE_BITSET	10
 #define FUTEX_WAIT_REQUEUE_PI	11
 #define FUTEX_CMP_REQUEUE_PI	12
+#define FUTEX_ATTACH		13
+#define FUTEX_DETACH		14
 
 #define FUTEX_PRIVATE_FLAG	128
 #define FUTEX_CLOCK_REALTIME	256
-#define FUTEX_CMD_MASK		~(FUTEX_PRIVATE_FLAG | FUTEX_CLOCK_REALTIME)
+#define FUTEX_ATTACHED		512
+#define FUTEX_CMD_MASK		~(FUTEX_PRIVATE_FLAG | FUTEX_CLOCK_REALTIME | \
+				  FUTEX_ATTACHED)
 
 #define FUTEX_WAIT_PRIVATE	(FUTEX_WAIT | FUTEX_PRIVATE_FLAG)
 #define FUTEX_WAKE_PRIVATE	(FUTEX_WAKE | FUTEX_PRIVATE_FLAG)
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -880,6 +880,12 @@ void mm_release(struct task_struct *tsk,
 #endif
 	if (unlikely(!list_empty(&tsk->pi_state_list)))
 		exit_pi_state_list(tsk);
+
+	/* Must be last as the other cleanups might deref it */
+	if (unlikely(tsk->futex_cache)) {
+		exit_futex_task_cache(tsk);
+		tsk->futex_cache = NULL;
+	}
 #endif
 
 	uprobe_free_utask(tsk);
@@ -1489,6 +1495,7 @@ static struct task_struct *copy_process(
 #endif
 	INIT_LIST_HEAD(&p->pi_state_list);
 	p->pi_state_cache = NULL;
+	p->futex_cache = NULL;
 #endif
 	/*
 	 * sigaltstack should be cleared when sharing the same VM
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -23,6 +23,9 @@
  *  Copyright (C) IBM Corporation, 2009
  *  Thanks to Thomas Gleixner for conceptual design and careful reviews.
  *
+ *  Attached futex support by Sebastian Siewior and Thomas Gleixner
+ *  Copyright (C) Linutronix GmbH, 2016
+ *
  *  Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
  *  enough at me, Linus for the original (flawed) idea, Matthew
  *  Kirkwood for proof-of-concept implementation.
@@ -182,6 +185,7 @@ int __read_mostly futex_cmpxchg_enabled;
 #define FLAGS_SHARED		0x01
 #define FLAGS_CLOCKRT		0x02
 #define FLAGS_HAS_TIMEOUT	0x04
+#define FLAGS_ATTACHED		0x08
 
 /*
  * Priority Inheritance state:
@@ -255,6 +259,30 @@ struct futex_hash_bucket {
 	struct plist_head chain;
 } ____cacheline_aligned_in_smp;
 
+struct futex_state {
+	struct futex_hash_bucket	hb;
+	struct futex_q			q;
+	struct futex_hash_bucket	*global_hb;
+	int				refcount;
+};
+
+/* Task cache sizes must be power of 2! */
+#define	TASK_CACHE_BASE_SIZE		16
+/* FIXME: Make this a tunable */
+#define TASK_CACHE_MAX_SIZE		4096
+
+struct futex_cache_slot {
+	u32 __user		*uaddr;
+	struct futex_state	*fs;
+};
+
+struct futex_cache {
+	unsigned int		cache_size;
+	unsigned int		hash_prime;
+	unsigned long		cache_map[BITS_TO_LONGS(TASK_CACHE_MAX_SIZE)];
+	struct futex_cache_slot	slots[];
+};
+
 /*
  * The base of the bucket array and its size are always used together
  * (after initialization only in hash_futex()), so ensure that they
@@ -403,13 +431,13 @@ static inline void hb_remove_q(struct fu
 }
 
 /**
- * hash_futex - Return the hash bucket in the global hash
+ * hash_global_futex - Return the hash bucket in the global hash
  * @key:	Pointer to the futex key for which the hash is calculated
  *
  * We hash on the keys returned from get_futex_key (see below) and return the
  * corresponding hash bucket in the global hash.
  */
-static struct futex_hash_bucket *hash_futex(union futex_key *key)
+static struct futex_hash_bucket *hash_global_futex(union futex_key *key)
 {
 	u32 hash = jhash2((u32*)&key->both.word,
 			  (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
@@ -417,6 +445,38 @@ static struct futex_hash_bucket *hash_fu
 	return &futex_queues[hash & (futex_hashsize - 1)];
 }
 
+/**
+ * hash_local_futex - Return the hash bucket in the task local cache
+ * @uaddr:	The user space address of the futex
+ * @prime:	The prime number for the modulo operation
+ *
+ * That's a primitive hash function, but it turned out to be the most
+ * efficient one for the task local cache as we don't have anything to
+ * mix in like we have for the global hash.
+ */
+static inline unsigned int
+hash_local_futex(void __user *uaddr, unsigned int prime)
+{
+	return ((unsigned long) uaddr) % prime;
+}
+
+/**
+ * hash_futex - Get the hash bucket for a futex
+ *
+ * Returns either the local or the global hash bucket which fits the key.
+ *
+ * In case of an attached futex, we already verified that the cache and the
+ * slot exists, so we can unconditionally dereference it.
+ */
+static struct futex_hash_bucket *hash_futex(union futex_key *key)
+{
+	struct futex_cache *tc = current->futex_cache;
+
+	if (!key->both.attached)
+		return hash_global_futex(key);
+
+	return &tc->slots[key->both.slot].fs->hb;
+}
 
 /**
  * match_futex - Check whether to futex keys are equal
@@ -430,22 +490,107 @@ static inline int match_futex(union fute
 	return (key1 && key2
 		&& key1->both.word == key2->both.word
 		&& key1->both.ptr == key2->both.ptr
-		&& key1->both.offset == key2->both.offset);
+		&& key1->both.offset == key2->both.offset
+		&& key1->both.attached == key2->both.attached);
+}
+
+/**
+ * futex_detach_task - Detach task from global state
+ * @slot:	Slot number in the task local cache
+ *
+ * If the global state refcount drops to zero, the global state is destroyed.
+ */
+static void futex_detach_task(int slot)
+{
+	struct futex_cache *tc = current->futex_cache;
+	struct futex_state *fs = tc->slots[slot].fs;
+	struct futex_hash_bucket *hb = fs->global_hb;
+	struct futex_q *q = &fs->q;
+
+	/* Remove it from the task local cache */
+	__clear_bit(slot, tc->cache_map);
+	tc->slots[slot].uaddr = NULL;
+	tc->slots[slot].fs = NULL;
+
+	/*
+	 * Lock the global hash bucket. Decrement global state refcount. If 0
+	 * remove it from the global hash and free it.
+	 */
+	spin_lock(&hb->lock);
+	if (--fs->refcount == 0)
+		hb_remove_q(q, hb);
+	else
+		fs = NULL;
+	spin_unlock(&hb->lock);
+	kfree(fs);
+}
+
+/**
+ * futex_attach_task - Attach current to a global state
+ * @fs:		Pointer to global state
+ * @uaddr:	User space address of the futex
+ * @slot:	Hash slot to reference @fs in current
+ *
+ * Take a refcount on the global state and store the pointer to it in the
+ * given @slot of the current tasks futex cache along with @uaddr. Mark the
+ * slot as occupied.
+ *
+ * Must be called with fs->global_hb->lock held
+ */
+static void
+futex_attach_task(struct futex_state *fs, u32 __user *uaddr, int slot)
+{
+	struct futex_cache *tc = current->futex_cache;
+
+	fs->refcount++;
+	tc->slots[slot].fs = fs;
+	tc->slots[slot].uaddr = uaddr;
+	__set_bit(slot, tc->cache_map);
+}
+
+/**
+ * futex_queue_state - Queue a futex state object in the global hash
+ * @fs:		Pointer to the futex state object
+ * @hb:		Pointer to the hash bucket
+ *
+ * Must be called with hb->lock held
+ */
+static void
+futex_queue_state(struct futex_state *fs, struct futex_hash_bucket *hb)
+{
+	int prio = NICE_TO_PRIO(MIN_NICE);
+
+	fs->global_hb = hb;
+	fs->q.lock_ptr = &hb->lock;
+	hb_insert_q(&fs->q, hb, prio);
 }
 
 /**
  * futex_key_init - Initialize a futex key
  * @key:	Pointer to the key to initialize
  * @uaddr:	User space address of the futex
- * @flags:	Flags to check for futex mode. Not yet used
+ * @flags:	Flags to check for attached mode
  *
- * Returns:	@uaddr
+ * Returns:
+ *	@uaddr or NULL, if no match in the task local cache for attached mode
  */
 static u32 __user *futex_key_init(union futex_key *key, u32 __user *uaddr,
 				  unsigned int flags)
 {
+	struct futex_cache *tc = current->futex_cache;
+	int slot;
+
 	*key = FUTEX_KEY_INIT;
-	return uaddr;
+	if (!(flags & FLAGS_ATTACHED))
+		return uaddr;
+
+	if (!tc)
+		return NULL;
+
+	slot = hash_local_futex(uaddr, tc->hash_prime);
+	key->both.attached = true;
+	key->both.slot = slot;
+	return tc->slots[slot].uaddr == uaddr ? uaddr : NULL;
 }
 
 /*
@@ -3216,6 +3361,286 @@ void exit_robust_list(struct task_struct
 				   curr, pip);
 }
 
+/**
+ * exit_futex_task_cache -- Cleanup the task cache
+ *
+ * Called when the task exits.
+ */
+void exit_futex_task_cache(struct task_struct *tsk)
+{
+	struct futex_cache *tc = tsk->futex_cache;
+	unsigned long *map = tc->cache_map;
+	unsigned int slot, size = tc->cache_size;
+
+	slot = find_first_bit(map, size);
+	for (; slot < size; slot = find_next_bit(map, size, slot + 1))
+		futex_detach_task(slot);
+	kfree(tc);
+}
+
+static unsigned int hash_prime(unsigned int size)
+{
+	switch(size) {
+	case   16:
+	default:   return 13;
+	case   32: return 31;
+	case   64: return 61;
+	case  128: return 127;
+	case  256: return 251;
+	case  512: return 509;
+	case 1024: return 1021;
+	case 2048: return 2039;
+	case 4096: return 4093;
+	}
+}
+
+static struct futex_cache *futex_alloc_cache(int cache_size)
+{
+	struct futex_cache *tc;
+	size_t size;
+
+	/* Allocate a new task cache */
+	size = sizeof(*tc) + cache_size * sizeof(struct futex_cache_slot);
+	tc = kzalloc_node(size, GFP_KERNEL, numa_node_id());
+	if (tc) {
+		tc->hash_prime = hash_prime(cache_size);
+		tc->cache_size = cache_size;
+	}
+	return tc;
+}
+
+static int
+futex_rehash_task_cache(struct futex_cache *tc, struct futex_cache *tcnew)
+{
+	unsigned long *newmap = tcnew->cache_map;
+	unsigned int prime = tcnew->hash_prime;
+	unsigned long *map = tc->cache_map;
+	unsigned int size = tc->cache_size;
+	unsigned int slot, newslot;
+
+	slot = find_first_bit(map, size);
+	for (; slot < size; slot = find_next_bit(map, size, slot + 1)) {
+		newslot = hash_local_futex(tc->slots[slot].uaddr, prime);
+		/*
+		 * Paranoia. Rehashing to a larger cache should not result in
+		 * collisions which did not exist in the small one.
+		 */
+		if (__test_and_set_bit(newslot, newmap))
+			return -ENOSPC;
+		/* Copy uaddr and futex state pointer */
+		tcnew->slots[newslot] = tc->slots[slot];
+	}
+	return 0;
+}
+
+/**
+ * futex_get_task_cache_slot - Get a slot in the tasks local cache
+ *
+ * If the cache is not yet available it's allocated. If the existing cache is
+ * too small the cache is extended.
+ *
+ * Returns a valid slot or an error code
+ */
+static int futex_get_task_cache_slot(u32 __user *uaddr)
+{
+	struct futex_cache *tcnew, *tc = current->futex_cache;
+	unsigned int cache_size;
+	int slot;
+
+	/* First caller allocates the initial cache */
+	if (!tc) {
+		tc = futex_alloc_cache(TASK_CACHE_BASE_SIZE);
+		if (!tc)
+			return -ENOMEM;
+		current->futex_cache = tc;
+		slot = hash_local_futex(uaddr, tc->hash_prime);
+		return slot;
+	}
+
+	slot = hash_local_futex(uaddr, tc->hash_prime);
+
+	/* Check whether the slot is populated already */
+	if (!test_bit(slot, tc->cache_map))
+		return slot;
+
+	/* Was this futex attached already ? */
+	if (tc->slots[slot].uaddr == uaddr)
+		return -EEXIST;
+
+	cache_size = tc->cache_size;
+retry:
+	/* Task has reached max cache size? */
+	if (cache_size >= TASK_CACHE_MAX_SIZE)
+		return -ENOSPC;
+
+	cache_size *= 2;
+	tcnew = futex_alloc_cache(cache_size);
+	if (!tcnew)
+		return -ENOMEM;
+
+	/*
+	 * If the rehashing fails or the slot for uaddr is busy after
+	 * rehashing, try with a larger cache.
+	 */
+	slot = hash_local_futex(uaddr, tcnew->hash_prime);
+
+	if (futex_rehash_task_cache(tc, tcnew) ||
+	    test_bit(slot, tcnew->cache_map)) {
+		kfree(tcnew);
+		goto retry;
+	}
+
+	/* Populate the new cache and return the slot number */
+	current->futex_cache = tcnew;
+	return slot;
+}
+
+/**
+ * futex_create - Create an attached futex object and attach to it
+ * @uaddr:	The user space address of the futex
+ * @key:	Pointer to a initialized futex key object
+ * @hb:		Pointer to the hash bucket in the global hash corresponding
+ *		to @key
+ * @slot:	Free task cache slot number
+ *
+ * Returns:
+ *  Success:	0
+ *  Failure:	Proper error code
+ *		ENOMEM: Out of memory
+ *		EEXIST: Global state exists already
+ */
+static int futex_create(u32 __user *uaddr, union futex_key *key,
+			struct futex_hash_bucket *hb, int slot)
+{
+	struct futex_state *fs;
+	struct futex_q *match;
+
+	fs = kzalloc_node(sizeof(*fs), GFP_KERNEL, numa_node_id());
+	if (!fs)
+		return -ENOMEM;
+
+	atomic_set(&fs->hb.waiters, 0);
+	plist_head_init(&fs->hb.chain);
+	spin_lock_init(&fs->hb.lock);
+
+	fs->q = futex_q_init;
+	/* This is the global state object. Set an invalid slot */
+	fs->q.key = *key;
+	fs->q.key.both.slot = ~0U;
+
+	/* Verify again whether global state for this futex exists already */
+	spin_lock(&hb->lock);
+	match = futex_top_waiter(hb, &fs->q.key);
+	if (match) {
+		spin_unlock(&hb->lock);
+		kfree(fs);
+		return -EEXIST;
+	}
+	/*
+	 * Queue the new global state in the global hash and attach the task
+	 * to it.
+	 */
+	futex_queue_state(fs, hb);
+	futex_attach_task(fs, uaddr, slot);
+	spin_unlock(&hb->lock);
+	return slot;
+}
+
+/**
+ * futex_attach - Attach a task to a registered futex
+ * @uaddr:	The user space address of the futex
+ * @flags:	User supplied flags
+ *
+ * Returns:
+ *  Success:	0
+ *  Failure:	Proper error code
+ *		ENOMEM: Out of memory
+ *		EINVAL: Invalid @flags or invalid @uaddr
+ *		EFAULT: @uaddr access would fault
+ *		EEXIST: @uaddr is already attached
+ *		ENOSPC: TASK_CACHE_MAX_SIZE reached, no free slots
+ *
+ */
+static int futex_attach(u32 __user *uaddr, unsigned int flags)
+{
+	struct futex_hash_bucket *hb;
+	struct futex_state *fs;
+	struct futex_q *match;
+	union futex_key key;
+	int ret, slot;
+
+	if (!(flags & FLAGS_ATTACHED))
+		return -EINVAL;
+
+	if (futex_key_init(&key, uaddr, flags) != NULL)
+		return -EEXIST;
+
+	key = FUTEX_KEY_INIT;
+	ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
+	if (ret)
+		return ret;
+
+	/* Get a slot in the task local cache */
+	slot = futex_get_task_cache_slot(uaddr);
+	if (slot < 0) {
+		ret = slot;
+		goto out_put_key;
+	}
+
+	/* Find the global state and attach to it */
+	key.both.attached = true;
+	hb = hash_global_futex(&key);
+
+again:
+	spin_lock(&hb->lock);
+	match = futex_top_waiter(hb, &key);
+	if (!match) {
+		spin_unlock(&hb->lock);
+		ret = futex_create(uaddr, &key, hb, slot);
+		/* We raced against another task creating the global state */
+		if (ret == -EEXIST)
+			goto again;
+		goto out_put_key;
+	}
+
+	/* Attach to it */
+	fs = container_of(match, struct futex_state, q);
+	futex_attach_task(fs, uaddr, slot);
+	spin_unlock(&hb->lock);
+	ret = 0;
+
+out_put_key:
+	put_futex_key(&key);
+	return ret;
+
+}
+
+/**
+ * futex_detach - Detach a task from a registered futex
+ * @uaddr:	The cookie which was returned from attach
+ * @flags:	User supplied flags
+ *
+ * Returns:
+ *  Success:	0
+ *  Failure:	Proper error code
+ *		EINVAL: Invalid @flags or invalid @uaddr
+ */
+static int futex_detach(u32 __user *uaddr, unsigned int flags)
+{
+	union futex_key key;
+
+	if (!(flags & FLAGS_ATTACHED))
+		return -EINVAL;
+
+	/* We look up the slot and verify that it is populated */
+	uaddr = futex_key_init(&key, uaddr, flags);
+	if (!uaddr)
+		return -EINVAL;
+
+	futex_detach_task(key.both.slot);
+	return 0;
+}
+
 long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 		u32 __user *uaddr2, u32 val2, u32 val3)
 {
@@ -3232,6 +3657,9 @@ long do_futex(u32 __user *uaddr, int op,
 			return -ENOSYS;
 	}
 
+	if (op & FUTEX_ATTACHED)
+		flags |= FLAGS_ATTACHED;
+
 	switch (cmd) {
 	case FUTEX_LOCK_PI:
 	case FUTEX_UNLOCK_PI:
@@ -3269,6 +3697,10 @@ long do_futex(u32 __user *uaddr, int op,
 					     uaddr2);
 	case FUTEX_CMP_REQUEUE_PI:
 		return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1);
+	case FUTEX_ATTACH:
+		return futex_attach(uaddr, flags);
+	case FUTEX_DETACH:
+		return futex_detach(uaddr, flags);
 	}
 	return -ENOSYS;
 }

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC patch 5/7] perf/bench/futex-hash: Support for attached futexes
  2016-04-02 11:09 [RFC patch 0/7] futex: Add support for attached futexes Thomas Gleixner
                   ` (3 preceding siblings ...)
  2016-04-02 11:09 ` [RFC patch 4/7] futex: Add support for attached futexes Thomas Gleixner
@ 2016-04-02 11:09 ` Thomas Gleixner
  2016-04-02 11:09 ` [RFC patch 6/7] futex.2: Document attached mode Thomas Gleixner
  2016-04-02 11:09 ` [RFC patch 7/7] [PATCH] glibc: nptl: Add support for attached pthread_mutexes Thomas Gleixner
  6 siblings, 0 replies; 26+ messages in thread
From: Thomas Gleixner @ 2016-04-02 11:09 UTC (permalink / raw)
  To: LKML
  Cc: Sebastian Andrzej Siewior, Darren Hart, Peter Zijlstra,
	Ingo Molnar, Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

[-- Attachment #1: perf-bench-futex-hash--Support-for-attached-futexes.patch --]
[-- Type: text/plain, Size: 3737 bytes --]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 tools/perf/bench/futex-hash.c |   40 ++++++++++++++++++++++++++++++----------
 tools/perf/bench/futex.h      |   14 ++++++++++++++
 2 files changed, 44 insertions(+), 10 deletions(-)

Index: b/tools/perf/bench/futex-hash.c
===================================================================
--- a/tools/perf/bench/futex-hash.c
+++ b/tools/perf/bench/futex-hash.c
@@ -25,7 +25,7 @@ static unsigned int nthreads = 0;
 static unsigned int nsecs    = 10;
 /* amount of futexes per thread */
 static unsigned int nfutexes = 1024;
-static bool fshared = false, done = false, silent = false;
+static bool fshared = false, done = false, silent = false, attached = false;
 static int futex_flag = 0;
 
 struct timeval start, end, runtime;
@@ -42,11 +42,12 @@ struct worker {
 };
 
 static const struct option options[] = {
-	OPT_UINTEGER('t', "threads", &nthreads, "Specify amount of threads"),
-	OPT_UINTEGER('r', "runtime", &nsecs,    "Specify runtime (in seconds)"),
-	OPT_UINTEGER('f', "futexes", &nfutexes, "Specify amount of futexes per threads"),
-	OPT_BOOLEAN( 's', "silent",  &silent,   "Silent mode: do not display data/details"),
-	OPT_BOOLEAN( 'S', "shared",  &fshared,  "Use shared futexes instead of private ones"),
+	OPT_UINTEGER('t', "threads",  &nthreads, "Specify amount of threads"),
+	OPT_UINTEGER('r', "runtime",  &nsecs,    "Specify runtime (in seconds)"),
+	OPT_UINTEGER('f', "futexes",  &nfutexes, "Specify amount of futexes per threads"),
+	OPT_BOOLEAN( 's', "silent",   &silent,   "Silent mode: do not display data/details"),
+	OPT_BOOLEAN( 'S', "shared",   &fshared,  "Use shared futexes instead of private ones"),
+	OPT_BOOLEAN( 'a', "attached", &attached, "Use attached futexes"),
 	OPT_END()
 };
 
@@ -61,6 +62,14 @@ static void *workerfn(void *arg)
 	unsigned int i;
 	struct worker *w = (struct worker *) arg;
 
+	if (attached) {
+		for (i = 0; i < nfutexes; i++) {
+			ret = futex_attach(&w->futex[i], futex_flag);
+			if (ret < 0)
+				err(EXIT_FAILURE, "Attach futex failed\n");
+		}
+	}
+
 	pthread_mutex_lock(&thread_lock);
 	threads_starting--;
 	if (!threads_starting)
@@ -83,6 +92,14 @@ static void *workerfn(void *arg)
 		}
 	}  while (!done);
 
+	if (attached) {
+		for (i = 0; i < nfutexes; i++) {
+			ret = futex_detach(&w->futex[i], futex_flag);
+			if (ret < 0)
+				err(EXIT_FAILURE, "Detach futex failed");
+		}
+	}
+
 	return NULL;
 }
 
@@ -136,10 +153,13 @@ int bench_futex_hash(int argc, const cha
 		goto errmem;
 
 	if (!fshared)
-		futex_flag = FUTEX_PRIVATE_FLAG;
-
-	printf("Run summary [PID %d]: %d threads, each operating on %d [%s] futexes for %d secs.\n\n",
-	       getpid(), nthreads, nfutexes, fshared ? "shared":"private", nsecs);
+		futex_flag |= FUTEX_PRIVATE_FLAG;
+	if (attached)
+		futex_flag |= FUTEX_ATTACHED;
+
+	printf("Run summary [PID %d]: %d threads, each operating on %d [%s %s] futexes for %d secs.\n\n",
+	       getpid(), nthreads, nfutexes, fshared ? "shared":"private",
+	       attached ? "attached" : "", nsecs);
 
 	init_stats(&throughput_stats);
 	pthread_mutex_init(&thread_lock, NULL);
Index: b/tools/perf/bench/futex.h
===================================================================
--- a/tools/perf/bench/futex.h
+++ b/tools/perf/bench/futex.h
@@ -101,4 +101,18 @@ static inline int pthread_attr_setaffini
 }
 #endif
 
+#define FUTEX_ATTACH		13
+#define FUTEX_DETACH		14
+#define FUTEX_ATTACHED		512
+
+static inline int futex_attach(u_int32_t *uaddr, int opflags)
+{
+	return futex(uaddr, FUTEX_ATTACH, 0, NULL, NULL, 0, opflags);
+}
+
+static inline int futex_detach(u_int32_t *uaddr, int opflags)
+{
+	return futex(uaddr, FUTEX_DETACH, 0, NULL, NULL, 0, opflags);
+}
+
 #endif /* _FUTEX_H */

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC patch 7/7] [PATCH] glibc: nptl: Add support for attached pthread_mutexes
  2016-04-02 11:09 [RFC patch 0/7] futex: Add support for attached futexes Thomas Gleixner
                   ` (5 preceding siblings ...)
  2016-04-02 11:09 ` [RFC patch 6/7] futex.2: Document attached mode Thomas Gleixner
@ 2016-04-02 11:09 ` Thomas Gleixner
  2016-04-02 16:30   ` Peter Zijlstra
  6 siblings, 1 reply; 26+ messages in thread
From: Thomas Gleixner @ 2016-04-02 11:09 UTC (permalink / raw)
  To: LKML
  Cc: Sebastian Andrzej Siewior, Darren Hart, Peter Zijlstra,
	Ingo Molnar, Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

[-- Attachment #1: nptl-add-support-for-attached-pthread_mutexes.patch --]
[-- Type: text/plain, Size: 28822 bytes --]

pthread_mutexes on Linux are based on the futex mechanism. The standard futex
mechanism in the Linux kernel uses a global hash to store transient
state. Collisions on that hash can lead to performance degradation and on
real-time enabled kernels even to priority inversions.

To guarantee futexes without collisions on the global kernel hash, the kernel
provides a mechanism to attach to a futex. This creates futex private state
which avoids hash collisions and on NUMA systems also cross node memory
access.

To utilize this mechanism each thread has to attach to the futex before any
other operations on that futex which involve kernel interaction.

At pthread_mutex_init() the pthread_mutex attribute needs to be initialized
for attached mode via:

    pthread_mutexattr_setattached_np(&attr, 1);

All threads which are using the mutex - including the one which called
pthread_mutex_init() - must invoke

    pthread_mutex_attach_np(&mutex);

before any other pthread_mutex related operations.

Example:
        pthread_mutexattr_t attr;
	pthread_mutex_t lock;

        pthread_mutexattr_init(&attr);
        pthread_mutexattr_setattached_np(&attr, 1);
        pthread_mutex_init(&lock, &attr);

        pthread_mutex_attach_np(&lock);

	pthread_mutex_lock(&lock);

In ptrace this should look like this:

	futex(<addr>, 0x280 /* FUTEX_??? */, 1, NULL <unfinished ...>

	0x280: 	 FUTEX_ATTACHED | FUTEX_PRIVATE | FUTEX_WAIT

To undo the attachment each involved thread needs to call

    pthread_mutex_detach_np(&mutex);
        
When the last user detaches the kernel state is destroyed.

Attached mode works with all futex operations. For operations which involve
two futexes, i.e. FUTEX_REQUEUE_* both futexes have to be either attached or
detached (similar to FUTEX_PRIVATE). This excludes pthread_condvars for now,
but it will be implemented once the details of this new mechanism are agreed
on.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 nptl/Makefile                                      |  3 ++
 nptl/Versions                                      |  7 +++++
 nptl/pthreadP.h                                    | 15 ++++++++++
 nptl/pthread_cond_destroy.c                        |  2 +-
 nptl/pthread_mutex_attach_np.c                     | 35 ++++++++++++++++++++++
 nptl/pthread_mutex_cond_lock.c                     |  6 ++--
 nptl/pthread_mutex_destroy.c                       |  3 ++
 nptl/pthread_mutex_detach_np.c                     | 35 ++++++++++++++++++++++
 nptl/pthread_mutex_init.c                          |  3 ++
 nptl/pthread_mutex_lock.c                          | 18 +++++------
 nptl/pthread_mutex_setprioceiling.c                |  4 +--
 nptl/pthread_mutex_timedlock.c                     | 20 ++++++-------
 nptl/pthread_mutex_trylock.c                       |  8 ++---
 nptl/pthread_mutex_unlock.c                        | 12 ++++----
 nptl/pthread_mutexattr_getattached_np.c            | 15 ++++++++++
 nptl/pthread_mutexattr_setattached_np.c            | 17 +++++++++++
 sysdeps/nptl/pthread.h                             | 19 ++++++++++++
 sysdeps/unix/sysv/linux/i386/libpthread.abilist    |  5 ++++
 sysdeps/unix/sysv/linux/lowlevellock-futex.h       |  9 ++++--
 .../unix/sysv/linux/x86_64/64/libpthread.abilist   |  5 ++++
 .../unix/sysv/linux/x86_64/x32/libpthread.abilist  |  5 ++++
 21 files changed, 208 insertions(+), 38 deletions(-)
 create mode 100644 nptl/pthread_mutex_attach_np.c
 create mode 100644 nptl/pthread_mutex_detach_np.c
 create mode 100644 nptl/pthread_mutexattr_getattached_np.c
 create mode 100644 nptl/pthread_mutexattr_setattached_np.c

diff --git a/nptl/Makefile b/nptl/Makefile
index dc3ccab991ad..d4944f905c5b 100644
--- a/nptl/Makefile
+++ b/nptl/Makefile
@@ -59,6 +59,9 @@ libpthread-routines = nptl-init vars events version pt-interp \
 		      pthread_mutexattr_init pthread_mutexattr_destroy \
 		      pthread_mutexattr_getpshared \
 		      pthread_mutexattr_setpshared \
+		      pthread_mutexattr_setattached_np \
+		      pthread_mutexattr_getattached_np \
+		      pthread_mutex_attach_np pthread_mutex_detach_np \
 		      pthread_mutexattr_gettype pthread_mutexattr_settype \
 		      pthread_rwlock_init pthread_rwlock_destroy \
 		      pthread_rwlock_rdlock pthread_rwlock_timedrdlock \
diff --git a/nptl/Versions b/nptl/Versions
index 0ae5def464b9..06981b7ee6ca 100644
--- a/nptl/Versions
+++ b/nptl/Versions
@@ -265,6 +265,13 @@ libpthread {
   GLIBC_2.22 {
   }
 
+  GLIBC_2.23 {
+    pthread_mutexattr_getattached_np;
+    pthread_mutexattr_setattached_np;
+    pthread_mutex_attach_np;
+    pthread_mutex_detach_np;
+  }
+
   GLIBC_PRIVATE {
     __pthread_initialize_minimal;
     __pthread_clock_gettime; __pthread_clock_settime;
diff --git a/nptl/pthreadP.h b/nptl/pthreadP.h
index 4edc74b1ef54..b7656c292ecb 100644
--- a/nptl/pthreadP.h
+++ b/nptl/pthreadP.h
@@ -109,6 +109,7 @@ enum
 	  PTHREAD_MUTEX_TIMED_NP | PTHREAD_MUTEX_NO_ELISION_NP,
 };
 #define PTHREAD_MUTEX_PSHARED_BIT 128
+#define PTHREAD_MUTEX_ATTACHED_BIT 1024
 
 #define PTHREAD_MUTEX_TYPE(m) \
   ((m)->__data.__kind & 127)
@@ -125,10 +126,19 @@ enum
   (((m)->__data.__kind & 128) ? LLL_SHARED : LLL_PRIVATE)
 #endif
 
+#define PTHREAD_MUTEX_PATTACHED(m) \
+  (((m)->__data.__kind & PTHREAD_MUTEX_ATTACHED_BIT) ? FUTEX_ATTACHED : 0)
+
+#define PTHREAD_MUTEX_PSHARED_PATTACHED(m) \
+  (PTHREAD_MUTEX_PSHARED(m) | PTHREAD_MUTEX_PATTACHED(m))
+
 /* The kernel when waking robust mutexes on exit never uses
    FUTEX_PRIVATE_FLAG FUTEX_WAKE.  */
 #define PTHREAD_ROBUST_MUTEX_PSHARED(m) LLL_SHARED
 
+#define PTHREAD_ROBUST_MUTEX_PSHARED_PATTACHED(m) \
+	(PTHREAD_MUTEX_PATTACHED(m) | LLL_SHARED)
+
 /* Ceiling in __data.__lock.  __data.__lock is signed, so don't
    use the MSB bit in there, but in the mask also include that bit,
    so that the compiler can optimize & PTHREAD_MUTEX_PRIO_CEILING_MASK
@@ -139,6 +149,7 @@ enum
 
 
 /* Flags in mutex attr.  */
+#define PTHREAD_MUTEXATTR_FLAG_ATTACHED		0x08000000
 #define PTHREAD_MUTEXATTR_PROTOCOL_SHIFT	28
 #define PTHREAD_MUTEXATTR_PROTOCOL_MASK		0x30000000
 #define PTHREAD_MUTEXATTR_PRIO_CEILING_SHIFT	12
@@ -422,6 +433,8 @@ extern int __pthread_mutex_unlock (pthread_mutex_t *__mutex);
 extern int __pthread_mutex_unlock_usercnt (pthread_mutex_t *__mutex,
 					   int __decr)
      attribute_hidden internal_function;
+extern int __pthread_mutex_attach_np (pthread_mutex_t *__mutex);
+extern int __pthread_mutex_detach_np (pthread_mutex_t *__mutex);
 extern int __pthread_mutexattr_init (pthread_mutexattr_t *attr);
 extern int __pthread_mutexattr_destroy (pthread_mutexattr_t *attr);
 extern int __pthread_mutexattr_settype (pthread_mutexattr_t *attr, int kind);
@@ -497,6 +510,8 @@ hidden_proto (__pthread_mutex_init)
 hidden_proto (__pthread_mutex_destroy)
 hidden_proto (__pthread_mutex_lock)
 hidden_proto (__pthread_mutex_unlock)
+hidden_proto (__pthread_mutex_attach_np)
+hidden_proto (__pthread_mutex_detach_np)
 hidden_proto (__pthread_rwlock_rdlock)
 hidden_proto (__pthread_rwlock_wrlock)
 hidden_proto (__pthread_rwlock_unlock)
diff --git a/nptl/pthread_cond_destroy.c b/nptl/pthread_cond_destroy.c
index 1acd8042d841..3f7ae9fb36db 100644
--- a/nptl/pthread_cond_destroy.c
+++ b/nptl/pthread_cond_destroy.c
@@ -64,7 +64,7 @@ __pthread_cond_destroy (pthread_cond_t *cond)
 	{
 	  pthread_mutex_t *mut = (pthread_mutex_t *) cond->__data.__mutex;
 	  lll_futex_wake (&mut->__data.__lock, INT_MAX,
-			  PTHREAD_MUTEX_PSHARED (mut));
+			  PTHREAD_MUTEX_PSHARED_PATTACHED (mut));
 	}
 
       do
diff --git a/nptl/pthread_mutex_attach_np.c b/nptl/pthread_mutex_attach_np.c
new file mode 100644
index 000000000000..1038f898ac37
--- /dev/null
+++ b/nptl/pthread_mutex_attach_np.c
@@ -0,0 +1,35 @@
+#include <errno.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/param.h>
+#include "pthreadP.h"
+#include <lowlevellock.h>
+#include <stap-probe.h>
+
+int __pthread_mutex_attach_np (pthread_mutex_t *mutex)
+{
+	int ret;
+	int private = FUTEX_ATTACHED;
+	INTERNAL_SYSCALL_DECL (__err);
+
+	if (!(mutex->__data.__kind & PTHREAD_MUTEX_ATTACHED_BIT))
+		return -EINVAL;
+
+	if (mutex->__data.__kind & PTHREAD_MUTEX_ROBUST_NORMAL_NP)
+		private |= LLL_SHARED;
+	else if (mutex->__data.__kind & PTHREAD_MUTEX_PSHARED_BIT)
+		private |= LLL_SHARED;
+
+	ret = INTERNAL_SYSCALL (futex, __err, 4, &mutex->__data.__lock,
+				__lll_private_flag(FUTEX_ATTACH,
+						    private), 0, 0);
+	if (INTERNAL_SYSCALL_ERROR_P (ret, __err))
+	    return INTERNAL_SYSCALL_ERRNO (ret, __err);
+
+	return ret;
+}
+
+#ifndef __pthread_mutex_lock
+strong_alias (__pthread_mutex_attach_np, pthread_mutex_attach_np)
+hidden_def (__pthread_mutex_attach_np)
+#endif
diff --git a/nptl/pthread_mutex_cond_lock.c b/nptl/pthread_mutex_cond_lock.c
index 2ac421fd63af..fd1932088996 100644
--- a/nptl/pthread_mutex_cond_lock.c
+++ b/nptl/pthread_mutex_cond_lock.c
@@ -1,11 +1,11 @@
 #include <pthreadP.h>
 
 #define LLL_MUTEX_LOCK(mutex) \
-  lll_cond_lock ((mutex)->__data.__lock, PTHREAD_MUTEX_PSHARED (mutex))
+  lll_cond_lock ((mutex)->__data.__lock, PTHREAD_MUTEX_PSHARED_PATTACHED (mutex))
 
 /* Not actually elided so far. Needed? */
 #define LLL_MUTEX_LOCK_ELISION(mutex)  \
-  ({ lll_cond_lock ((mutex)->__data.__lock, PTHREAD_MUTEX_PSHARED (mutex)); 0; })
+  ({ lll_cond_lock ((mutex)->__data.__lock, PTHREAD_MUTEX_PSHARED_PATTACHED (mutex)); 0; })
 
 #define LLL_MUTEX_TRYLOCK(mutex) \
   lll_cond_trylock ((mutex)->__data.__lock)
@@ -13,7 +13,7 @@
 
 #define LLL_ROBUST_MUTEX_LOCK(mutex, id) \
   lll_robust_cond_lock ((mutex)->__data.__lock, id, \
-			PTHREAD_ROBUST_MUTEX_PSHARED (mutex))
+			PTHREAD_ROBUST_MUTEX_PSHARED_PATTACHED (mutex))
 #define __pthread_mutex_lock internal_function __pthread_mutex_cond_lock
 #define __pthread_mutex_lock_full __pthread_mutex_cond_lock_full
 #define NO_INCR
diff --git a/nptl/pthread_mutex_destroy.c b/nptl/pthread_mutex_destroy.c
index 8a57d49272c1..99550852c2f2 100644
--- a/nptl/pthread_mutex_destroy.c
+++ b/nptl/pthread_mutex_destroy.c
@@ -31,6 +31,9 @@ __pthread_mutex_destroy (pthread_mutex_t *mutex)
       && mutex->__data.__nusers != 0)
     return EBUSY;
 
+  if (mutex->__data.__kind & PTHREAD_MUTEX_ATTACHED_BIT)
+	  return __pthread_mutex_detach_np(mutex);
+
   /* Set to an invalid value.  */
   mutex->__data.__kind = -1;
 
diff --git a/nptl/pthread_mutex_detach_np.c b/nptl/pthread_mutex_detach_np.c
new file mode 100644
index 000000000000..4c4f8fd1c9af
--- /dev/null
+++ b/nptl/pthread_mutex_detach_np.c
@@ -0,0 +1,35 @@
+#include <errno.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/param.h>
+#include "pthreadP.h"
+#include <lowlevellock.h>
+#include <stap-probe.h>
+
+int __pthread_mutex_detach_np (pthread_mutex_t *mutex)
+{
+	int ret;
+	int private = FUTEX_ATTACHED;
+	INTERNAL_SYSCALL_DECL (__err);
+
+	if (!(mutex->__data.__kind & PTHREAD_MUTEX_ATTACHED_BIT))
+		return -EINVAL;
+
+	if (mutex->__data.__kind & PTHREAD_MUTEX_ROBUST_NORMAL_NP)
+		private |= LLL_SHARED;
+	else if (mutex->__data.__kind & PTHREAD_MUTEX_PSHARED_BIT)
+		private |= LLL_SHARED;
+
+	ret = INTERNAL_SYSCALL (futex, __err, 4, &mutex->__data.__lock,
+				__lll_private_flag(FUTEX_DETACH,
+						    private), 0, 0);
+	if (INTERNAL_SYSCALL_ERROR_P (ret, __err))
+		return INTERNAL_SYSCALL_ERRNO (ret, __err);
+
+	return ret;
+}
+
+#ifndef __pthread_mutex_lock
+strong_alias (__pthread_mutex_detach_np, pthread_mutex_detach_np)
+hidden_def (__pthread_mutex_detach_np)
+#endif
diff --git a/nptl/pthread_mutex_init.c b/nptl/pthread_mutex_init.c
index 71ac7bc7f3df..6f6344fcc805 100644
--- a/nptl/pthread_mutex_init.c
+++ b/nptl/pthread_mutex_init.c
@@ -136,6 +136,9 @@ __pthread_mutex_init (pthread_mutex_t *mutex,
 				| PTHREAD_MUTEXATTR_FLAG_ROBUST)) != 0)
     mutex->__data.__kind |= PTHREAD_MUTEX_PSHARED_BIT;
 
+  if (imutexattr->mutexkind & PTHREAD_MUTEXATTR_FLAG_ATTACHED)
+	  mutex->__data.__kind |= PTHREAD_MUTEX_ATTACHED_BIT;
+
   /* Default values: mutex not used yet.  */
   // mutex->__count = 0;	already done by memset
   // mutex->__owner = 0;	already done by memset
diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index bdfa529f639b..41f18caf5ba8 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -38,18 +38,18 @@
 
 #ifndef LLL_MUTEX_LOCK
 # define LLL_MUTEX_LOCK(mutex) \
-  lll_lock ((mutex)->__data.__lock, PTHREAD_MUTEX_PSHARED (mutex))
+  lll_lock ((mutex)->__data.__lock, PTHREAD_MUTEX_PSHARED_PATTACHED (mutex))
 # define LLL_MUTEX_TRYLOCK(mutex) \
   lll_trylock ((mutex)->__data.__lock)
 # define LLL_ROBUST_MUTEX_LOCK(mutex, id) \
   lll_robust_lock ((mutex)->__data.__lock, id, \
-		   PTHREAD_ROBUST_MUTEX_PSHARED (mutex))
+		   PTHREAD_ROBUST_MUTEX_PSHARED_PATTACHED (mutex))
 # define LLL_MUTEX_LOCK_ELISION(mutex) \
   lll_lock_elision ((mutex)->__data.__lock, (mutex)->__data.__elision, \
-		   PTHREAD_MUTEX_PSHARED (mutex))
+		   PTHREAD_MUTEX_PSHARED_PATTACHED (mutex))
 # define LLL_MUTEX_TRYLOCK_ELISION(mutex) \
   lll_trylock_elision((mutex)->__data.__lock, (mutex)->__data.__elision, \
-		   PTHREAD_MUTEX_PSHARED (mutex))
+		   PTHREAD_MUTEX_PSHARED_PATTACHED (mutex))
 #endif
 
 #ifndef FORCE_ELISION
@@ -261,7 +261,7 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 	      /* This mutex is now not recoverable.  */
 	      mutex->__data.__count = 0;
 	      lll_unlock (mutex->__data.__lock,
-			  PTHREAD_ROBUST_MUTEX_PSHARED (mutex));
+			  PTHREAD_ROBUST_MUTEX_PSHARED_PATTACHED (mutex));
 	      THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
 	      return ENOTRECOVERABLE;
 	    }
@@ -333,8 +333,8 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 	    /* The mutex is locked.  The kernel will now take care of
 	       everything.  */
 	    int private = (robust
-			   ? PTHREAD_ROBUST_MUTEX_PSHARED (mutex)
-			   : PTHREAD_MUTEX_PSHARED (mutex));
+			   ? PTHREAD_ROBUST_MUTEX_PSHARED_PATTACHED (mutex)
+			   : PTHREAD_MUTEX_PSHARED_PATTACHED (mutex));
 	    INTERNAL_SYSCALL_DECL (__err);
 	    int e = INTERNAL_SYSCALL (futex, __err, 4, &mutex->__data.__lock,
 				      __lll_private_flag (FUTEX_LOCK_PI,
@@ -395,7 +395,7 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 	    INTERNAL_SYSCALL_DECL (__err);
 	    INTERNAL_SYSCALL (futex, __err, 4, &mutex->__data.__lock,
 			      __lll_private_flag (FUTEX_UNLOCK_PI,
-						  PTHREAD_ROBUST_MUTEX_PSHARED (mutex)),
+						  PTHREAD_ROBUST_MUTEX_PSHARED_PATTACHED (mutex)),
 			      0, 0);
 
 	    THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
@@ -484,7 +484,7 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 
 		if (oldval != ceilval)
 		  lll_futex_wait (&mutex->__data.__lock, ceilval | 2,
-				  PTHREAD_MUTEX_PSHARED (mutex));
+				  PTHREAD_MUTEX_PSHARED_PATTACHED (mutex));
 	      }
 	    while (atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
 							ceilval | 2, ceilval)
diff --git a/nptl/pthread_mutex_setprioceiling.c b/nptl/pthread_mutex_setprioceiling.c
index 6b98b5d046c9..55d4d2276f56 100644
--- a/nptl/pthread_mutex_setprioceiling.c
+++ b/nptl/pthread_mutex_setprioceiling.c
@@ -84,7 +84,7 @@ pthread_mutex_setprioceiling (pthread_mutex_t *mutex, int prioceiling,
 
 	    if (oldval != ceilval)
 	      lll_futex_wait (&mutex->__data.__lock, ceilval | 2,
-			      PTHREAD_MUTEX_PSHARED (mutex));
+			      PTHREAD_MUTEX_PSHARED_PATTACHED (mutex));
 	  }
 	while (atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
 						    ceilval | 2, ceilval)
@@ -115,7 +115,7 @@ pthread_mutex_setprioceiling (pthread_mutex_t *mutex, int prioceiling,
   atomic_full_barrier ();
 
   lll_futex_wake (&mutex->__data.__lock, INT_MAX,
-		  PTHREAD_MUTEX_PSHARED (mutex));
+		  PTHREAD_MUTEX_PSHARED_PATTACHED (mutex));
 
   return 0;
 }
diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c
index 07f0901e5266..06f466ef5dde 100644
--- a/nptl/pthread_mutex_timedlock.c
+++ b/nptl/pthread_mutex_timedlock.c
@@ -74,7 +74,7 @@ pthread_mutex_timedlock (pthread_mutex_t *mutex,
 
       /* We have to get the mutex.  */
       result = lll_timedlock (mutex->__data.__lock, abstime,
-			      PTHREAD_MUTEX_PSHARED (mutex));
+			      PTHREAD_MUTEX_PSHARED_PATTACHED (mutex));
 
       if (result != 0)
 	goto out;
@@ -97,7 +97,7 @@ pthread_mutex_timedlock (pthread_mutex_t *mutex,
     simple:
       /* Normal mutex.  */
       result = lll_timedlock (mutex->__data.__lock, abstime,
-			      PTHREAD_MUTEX_PSHARED (mutex));
+			      PTHREAD_MUTEX_PSHARED_PATTACHED (mutex));
       break;
 
     case PTHREAD_MUTEX_TIMED_ELISION_NP:
@@ -106,7 +106,7 @@ pthread_mutex_timedlock (pthread_mutex_t *mutex,
       return lll_timedlock_elision (mutex->__data.__lock,
 				    mutex->__data.__spins,
 				    abstime,
-				    PTHREAD_MUTEX_PSHARED (mutex));
+				    PTHREAD_MUTEX_PSHARED_PATTACHED (mutex));
 
 
     case PTHREAD_MUTEX_ADAPTIVE_NP:
@@ -123,7 +123,7 @@ pthread_mutex_timedlock (pthread_mutex_t *mutex,
 	      if (cnt++ >= max_cnt)
 		{
 		  result = lll_timedlock (mutex->__data.__lock, abstime,
-					  PTHREAD_MUTEX_PSHARED (mutex));
+					  PTHREAD_MUTEX_PSHARED_PATTACHED (mutex));
 		  break;
 		}
 	      atomic_spin_nop ();
@@ -204,7 +204,7 @@ pthread_mutex_timedlock (pthread_mutex_t *mutex,
 	    }
 
 	  result = lll_robust_timedlock (mutex->__data.__lock, abstime, id,
-					 PTHREAD_ROBUST_MUTEX_PSHARED (mutex));
+					 PTHREAD_ROBUST_MUTEX_PSHARED_PATTACHED (mutex));
 
 	  if (__builtin_expect (mutex->__data.__owner
 				== PTHREAD_MUTEX_NOTRECOVERABLE, 0))
@@ -212,7 +212,7 @@ pthread_mutex_timedlock (pthread_mutex_t *mutex,
 	      /* This mutex is now not recoverable.  */
 	      mutex->__data.__count = 0;
 	      lll_unlock (mutex->__data.__lock,
-			  PTHREAD_ROBUST_MUTEX_PSHARED (mutex));
+			  PTHREAD_ROBUST_MUTEX_PSHARED_PATTACHED (mutex));
 	      THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
 	      return ENOTRECOVERABLE;
 	    }
@@ -288,8 +288,8 @@ pthread_mutex_timedlock (pthread_mutex_t *mutex,
 	       everything.  The timeout value must be a relative value.
 	       Convert it.  */
 	    int private = (robust
-			   ? PTHREAD_ROBUST_MUTEX_PSHARED (mutex)
-			   : PTHREAD_MUTEX_PSHARED (mutex));
+			   ? PTHREAD_ROBUST_MUTEX_PSHARED_PATTACHED (mutex)
+			   : PTHREAD_MUTEX_PSHARED_PATTACHED (mutex));
 	    INTERNAL_SYSCALL_DECL (__err);
 
 	    int e = INTERNAL_SYSCALL (futex, __err, 4, &mutex->__data.__lock,
@@ -370,7 +370,7 @@ pthread_mutex_timedlock (pthread_mutex_t *mutex,
 	    INTERNAL_SYSCALL_DECL (__err);
 	    INTERNAL_SYSCALL (futex, __err, 4, &mutex->__data.__lock,
 			      __lll_private_flag (FUTEX_UNLOCK_PI,
-						  PTHREAD_ROBUST_MUTEX_PSHARED (mutex)),
+						  PTHREAD_ROBUST_MUTEX_PSHARED_PATTACHED (mutex)),
 			      0, 0);
 
 	    THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
@@ -489,7 +489,7 @@ pthread_mutex_timedlock (pthread_mutex_t *mutex,
 
 		    lll_futex_timed_wait (&mutex->__data.__lock,
 					  ceilval | 2, &rt,
-					  PTHREAD_MUTEX_PSHARED (mutex));
+					  PTHREAD_MUTEX_PSHARED_PATTACHED (mutex));
 		  }
 	      }
 	    while (atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
diff --git a/nptl/pthread_mutex_trylock.c b/nptl/pthread_mutex_trylock.c
index 48c7865702ba..c0221cc86c02 100644
--- a/nptl/pthread_mutex_trylock.c
+++ b/nptl/pthread_mutex_trylock.c
@@ -170,7 +170,7 @@ __pthread_mutex_trylock (pthread_mutex_t *mutex)
 	      mutex->__data.__count = 0;
 	      if (oldval == id)
 		lll_unlock (mutex->__data.__lock,
-			    PTHREAD_ROBUST_MUTEX_PSHARED (mutex));
+			    PTHREAD_ROBUST_MUTEX_PSHARED_PATTACHED (mutex));
 	      THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
 	      return ENOTRECOVERABLE;
 	    }
@@ -252,8 +252,8 @@ __pthread_mutex_trylock (pthread_mutex_t *mutex)
 	    /* The mutex owner died.  The kernel will now take care of
 	       everything.  */
 	    int private = (robust
-			   ? PTHREAD_ROBUST_MUTEX_PSHARED (mutex)
-			   : PTHREAD_MUTEX_PSHARED (mutex));
+			   ? PTHREAD_ROBUST_MUTEX_PSHARED_PATTACHED (mutex)
+			   : PTHREAD_MUTEX_PSHARED_PATTACHED (mutex));
 	    INTERNAL_SYSCALL_DECL (__err);
 	    int e = INTERNAL_SYSCALL (futex, __err, 4, &mutex->__data.__lock,
 				      __lll_private_flag (FUTEX_TRYLOCK_PI,
@@ -299,7 +299,7 @@ __pthread_mutex_trylock (pthread_mutex_t *mutex)
 	    INTERNAL_SYSCALL_DECL (__err);
 	    INTERNAL_SYSCALL (futex, __err, 4, &mutex->__data.__lock,
 			      __lll_private_flag (FUTEX_UNLOCK_PI,
-						  PTHREAD_ROBUST_MUTEX_PSHARED (mutex)),
+						  PTHREAD_ROBUST_MUTEX_PSHARED_PATTACHED (mutex)),
 			      0, 0);
 
 	    THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
diff --git a/nptl/pthread_mutex_unlock.c b/nptl/pthread_mutex_unlock.c
index 334ce383420e..f2c276881f03 100644
--- a/nptl/pthread_mutex_unlock.c
+++ b/nptl/pthread_mutex_unlock.c
@@ -52,7 +52,7 @@ __pthread_mutex_unlock_usercnt (pthread_mutex_t *mutex, int decr)
 	--mutex->__data.__nusers;
 
       /* Unlock.  */
-      lll_unlock (mutex->__data.__lock, PTHREAD_MUTEX_PSHARED (mutex));
+      lll_unlock (mutex->__data.__lock, PTHREAD_MUTEX_PSHARED_PATTACHED (mutex));
 
       LIBC_PROBE (mutex_release, 1, mutex);
 
@@ -62,7 +62,7 @@ __pthread_mutex_unlock_usercnt (pthread_mutex_t *mutex, int decr)
     {
       /* Don't reset the owner/users fields for elision.  */
       return lll_unlock_elision (mutex->__data.__lock, mutex->__data.__elision,
-				      PTHREAD_MUTEX_PSHARED (mutex));
+				 PTHREAD_MUTEX_PSHARED_PATTACHED (mutex));
     }
   else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
 			      == PTHREAD_MUTEX_RECURSIVE_NP, 1))
@@ -151,7 +151,7 @@ __pthread_mutex_unlock_full (pthread_mutex_t *mutex, int decr)
 
       /* Unlock.  */
       lll_robust_unlock (mutex->__data.__lock,
-			 PTHREAD_ROBUST_MUTEX_PSHARED (mutex));
+			 PTHREAD_ROBUST_MUTEX_PSHARED_PATTACHED (mutex));
 
       THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
       break;
@@ -235,8 +235,8 @@ __pthread_mutex_unlock_full (pthread_mutex_t *mutex, int decr)
 	 lll_unlock).  */
       int robust = mutex->__data.__kind & PTHREAD_MUTEX_ROBUST_NORMAL_NP;
       int private = (robust
-		     ? PTHREAD_ROBUST_MUTEX_PSHARED (mutex)
-		     : PTHREAD_MUTEX_PSHARED (mutex));
+		     ? PTHREAD_ROBUST_MUTEX_PSHARED_PATTACHED (mutex)
+		     : PTHREAD_MUTEX_PSHARED_PATTACHED (mutex));
       if ((mutex->__data.__lock & FUTEX_WAITERS) != 0
 	  || atomic_compare_and_exchange_bool_rel (&mutex->__data.__lock, 0,
 						   THREAD_GETMEM (THREAD_SELF,
@@ -290,7 +290,7 @@ __pthread_mutex_unlock_full (pthread_mutex_t *mutex, int decr)
 
       if ((oldval & ~PTHREAD_MUTEX_PRIO_CEILING_MASK) > 1)
 	lll_futex_wake (&mutex->__data.__lock, 1,
-			PTHREAD_MUTEX_PSHARED (mutex));
+			PTHREAD_MUTEX_PSHARED_PATTACHED (mutex));
 
       int oldprio = newval >> PTHREAD_MUTEX_PRIO_CEILING_SHIFT;
 
diff --git a/nptl/pthread_mutexattr_getattached_np.c b/nptl/pthread_mutexattr_getattached_np.c
new file mode 100644
index 000000000000..033cfdb141a8
--- /dev/null
+++ b/nptl/pthread_mutexattr_getattached_np.c
@@ -0,0 +1,15 @@
+#include <pthreadP.h>
+
+int pthread_mutexattr_getattached_np (const pthread_mutexattr_t *attr,
+				      int *attached)
+{
+	const struct pthread_mutexattr *iattr;
+
+	iattr = (const struct pthread_mutexattr *) attr;
+
+	if (iattr->mutexkind & PTHREAD_MUTEXATTR_FLAG_ATTACHED)
+		*attached = 1;
+	else
+		*attached = 0;
+	return 0;
+}
diff --git a/nptl/pthread_mutexattr_setattached_np.c b/nptl/pthread_mutexattr_setattached_np.c
new file mode 100644
index 000000000000..aadbdc7478a8
--- /dev/null
+++ b/nptl/pthread_mutexattr_setattached_np.c
@@ -0,0 +1,17 @@
+#include <errno.h>
+#include <pthreadP.h>
+
+int pthread_mutexattr_setattached_np (pthread_mutexattr_t *attr,
+				      int attached)
+{
+	struct pthread_mutexattr *iattr = (struct pthread_mutexattr *) attr;
+
+	if (attached < 0 || attached > 1)
+		return EINVAL;
+
+	if (!attached)
+		iattr->mutexkind &= ~PTHREAD_MUTEXATTR_FLAG_ATTACHED;
+	else
+		iattr->mutexkind |= PTHREAD_MUTEXATTR_FLAG_ATTACHED;
+	return 0;
+}
diff --git a/sysdeps/nptl/pthread.h b/sysdeps/nptl/pthread.h
index fd0894efd210..75c1c9c2112e 100644
--- a/sysdeps/nptl/pthread.h
+++ b/sysdeps/nptl/pthread.h
@@ -762,6 +762,12 @@ extern int pthread_mutex_trylock (pthread_mutex_t *__mutex)
 extern int pthread_mutex_lock (pthread_mutex_t *__mutex)
      __THROWNL __nonnull ((1));
 
+extern int pthread_mutex_attach_np (pthread_mutex_t *__mutex)
+     __THROWNL __nonnull ((1));
+
+extern int pthread_mutex_detach_np (pthread_mutex_t *__mutex)
+     __THROWNL __nonnull ((1));
+
 #ifdef __USE_XOPEN2K
 /* Wait until lock becomes available, or specified time passes. */
 extern int pthread_mutex_timedlock (pthread_mutex_t *__restrict __mutex,
@@ -846,6 +852,19 @@ extern int pthread_mutexattr_setprotocol (pthread_mutexattr_t *__attr,
 					  int __protocol)
      __THROW __nonnull ((1));
 
+/* Return in *ATTACHED the mutex attached attribute in *ATTR.  */
+     extern int pthread_mutexattr_getattached_np (const pthread_mutexattr_t *
+						  __restrict __attr,
+						  int *__restrict __attached)
+     __THROW __nonnull ((1, 2));
+
+/* Set the mutex attached attribute in *ATTR to ATTACHED (either
+   0, or 1). */
+extern int pthread_mutexattr_setattached_np (pthread_mutexattr_t *__attr,
+					     int __attached)
+     __THROW __nonnull ((1));
+
+
 /* Return in *PRIOCEILING the mutex prioceiling attribute in *ATTR.  */
 extern int pthread_mutexattr_getprioceiling (const pthread_mutexattr_t *
 					     __restrict __attr,
diff --git a/sysdeps/unix/sysv/linux/i386/libpthread.abilist b/sysdeps/unix/sysv/linux/i386/libpthread.abilist
index 8f9c3254be88..c0e711985fc4 100644
--- a/sysdeps/unix/sysv/linux/i386/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/i386/libpthread.abilist
@@ -216,6 +216,11 @@ GLIBC_2.2.3 GLIBC_2.2.3 A
 GLIBC_2.2.3 pthread_getattr_np F
 GLIBC_2.2.6 GLIBC_2.2.6 A
 GLIBC_2.2.6 __nanosleep F
+GLIBC_2.23 GLIBC_2.23 A
+GLIBC_2.23 pthread_mutexattr_getattached_np F
+GLIBC_2.23 pthread_mutexattr_setattached_np F
+GLIBC_2.23 pthread_mutex_attach_np F
+GLIBC_2.23 pthread_mutex_detach_np F
 GLIBC_2.3.2 GLIBC_2.3.2 A
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
diff --git a/sysdeps/unix/sysv/linux/lowlevellock-futex.h b/sysdeps/unix/sysv/linux/lowlevellock-futex.h
index 7111bac9431e..be03683e2e08 100644
--- a/sysdeps/unix/sysv/linux/lowlevellock-futex.h
+++ b/sysdeps/unix/sysv/linux/lowlevellock-futex.h
@@ -38,8 +38,11 @@
 #define FUTEX_WAKE_BITSET	10
 #define FUTEX_WAIT_REQUEUE_PI   11
 #define FUTEX_CMP_REQUEUE_PI    12
+#define FUTEX_ATTACH		13
+#define FUTEX_DETACH		14
 #define FUTEX_PRIVATE_FLAG	128
 #define FUTEX_CLOCK_REALTIME	256
+#define FUTEX_ATTACHED		512
 
 #define FUTEX_BITSET_MATCH_ANY	0xffffffff
 
@@ -72,9 +75,9 @@
 # else
 #  define __lll_private_flag(fl, private) \
   (__builtin_constant_p (private)					      \
-   ? ((private) == 0							      \
-      ? ((fl) | THREAD_GETMEM (THREAD_SELF, header.private_futex))	      \
-      : (fl))								      \
+   ? ((private & LLL_SHARED) == 0					      \
+      ? ((fl) | THREAD_GETMEM (THREAD_SELF, header.private_futex) | (private & FUTEX_ATTACHED))	      \
+      : (fl) | (private & FUTEX_ATTACHED))					      \
    : ((fl) | (((private) ^ FUTEX_PRIVATE_FLAG)				      \
 	      & THREAD_GETMEM (THREAD_SELF, header.private_futex))))
 # endif
diff --git a/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist b/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist
index 85365c057cd7..daa0ac8f2320 100644
--- a/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist
@@ -203,6 +203,11 @@ GLIBC_2.2.5 waitpid F
 GLIBC_2.2.5 write F
 GLIBC_2.2.6 GLIBC_2.2.6 A
 GLIBC_2.2.6 __nanosleep F
+GLIBC_2.23 GLIBC_2.23 A
+GLIBC_2.23 pthread_mutexattr_getattached_np F
+GLIBC_2.23 pthread_mutexattr_setattached_np F
+GLIBC_2.23 pthread_mutex_attach_np F
+GLIBC_2.23 pthread_mutex_detach_np F
 GLIBC_2.3.2 GLIBC_2.3.2 A
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
diff --git a/sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist b/sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist
index 6cd0fc348708..de17ebf46818 100644
--- a/sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist
@@ -224,3 +224,8 @@ GLIBC_2.16 write F
 GLIBC_2.18 GLIBC_2.18 A
 GLIBC_2.18 pthread_getattr_default_np F
 GLIBC_2.18 pthread_setattr_default_np F
+GLIBC_2.23 GLIBC_2.23 A
+GLIBC_2.23 pthread_mutexattr_getattached_np F
+GLIBC_2.23 pthread_mutexattr_setattached_np F
+GLIBC_2.23 pthread_mutex_attach_np F
+GLIBC_2.23 pthread_mutex_detach_np F
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC patch 6/7] futex.2: Document attached mode
  2016-04-02 11:09 [RFC patch 0/7] futex: Add support for attached futexes Thomas Gleixner
                   ` (4 preceding siblings ...)
  2016-04-02 11:09 ` [RFC patch 5/7] perf/bench/futex-hash: Support " Thomas Gleixner
@ 2016-04-02 11:09 ` Thomas Gleixner
  2016-04-02 11:09 ` [RFC patch 7/7] [PATCH] glibc: nptl: Add support for attached pthread_mutexes Thomas Gleixner
  6 siblings, 0 replies; 26+ messages in thread
From: Thomas Gleixner @ 2016-04-02 11:09 UTC (permalink / raw)
  To: LKML
  Cc: Sebastian Andrzej Siewior, Darren Hart, Peter Zijlstra,
	Ingo Molnar, Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

[-- Attachment #1: futex.2--Document-attached-mode.patch --]
[-- Type: text/plain, Size: 3620 bytes --]

At least an attempt to document the futex attached mode extension.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 man2/futex.2 |   78 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 75 insertions(+), 3 deletions(-)

--- a/man2/futex.2
+++ b/man2/futex.2
@@ -138,8 +138,8 @@ Besides the basic wait and wake-up futex
 futex operations aimed at supporting more complex use cases.
 
 Note that
-no explicit initialization or destruction is necessary to use futexes;
-the kernel maintains a futex
+no explicit initialization or destruction is necessary to use futexes
+except when using futexes in attached mode; the kernel maintains a futex
 (i.e., the kernel-internal implementation artifact)
 only while operations such as
 .BR FUTEX_WAIT ,
@@ -253,6 +253,12 @@ as a relative time,
 measured against the
 .BR CLOCK_MONOTONIC
 clock.
+.TP
+.BR FUTEX_ATTACHED " (since Linux ?.?.?)"
+This option can be employed with all futex operations. It tells the kernel
+that the futex operates in attached mode. See the futex_op
+.BR FUTEX_ATTACH
+for further explanation.
 .PP
 The operation specified in
 .I futex_op
@@ -767,6 +773,63 @@ operations correspond to
 and
 .BR FUTEX_WAKE_BITSET
 operations where the bit masks are all ones.
+
+.\"
+.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"
+.TP
+.BR FUTEX_ATTACH " (since Linux ?.?.?)"
+This operation attaches the futex on the futex address provided by the
+.IR uaddr
+argument.
+
+This operation tells the kernel that the thread is going to operate on that
+futex and therefor it should create a kernel side state for it. This state is
+used in further futex operations instead of the global hash which is used
+without that optimization.
+
+There are two advantages for using attached futexes. Potential hash collisions
+on the global hash are avoided. The kernel state which replaces the global
+hash for attached futexes is allocated on the node on which the thread runs
+and therefor avoids cross node memory access.
+
+For this operation and all subsequent operations on the futex the flag
+.BR FUTEX_ATTACHED
+must be ORed to the
+.I futex_op
+
+The
+.IR val ,
+.IR timeout ,
+.IR uaddr2
+and
+.I val3
+arguments are ignored.
+
+.\"
+.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"
+.TP
+.BR FUTEX_DETACH " (since Linux ?.?.?)"
+This operation detaches the futex on the futex address provided by the
+.I uaddr
+argument from the kernel state.
+
+This operation reverses a previous
+.BR FUTEX_ATTACH
+operation.
+
+If the caller is the last user of the kernel side state, the kernel side state
+is destroyed.
+
+The
+.IR val ,
+.IR timeout ,
+.IR uaddr2
+and
+.I val3
+arguments are ignored.
+
 .\"
 .\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 .\"
@@ -1402,6 +1465,10 @@ While requeueing a waiter to the PI fute
 .IR uaddr2 ,
 the kernel detected a deadlock.
 .TP
+.B EEXIST
+.RB ( FUTEX_ATTACH )
+Operation was attempted on a already attached futex.
+.TP
 .B EFAULT
 A required pointer argument (i.e.,
 .IR uaddr ,
@@ -1550,13 +1617,18 @@ Invalid argument.
 .BR ENOMEM
 .RB ( FUTEX_LOCK_PI ,
 .BR FUTEX_TRYLOCK_PI ,
-.BR FUTEX_CMP_REQUEUE_PI )
+.BR FUTEX_CMP_REQUEUE_PI,
+.BR FUTEX_ATTACH )
 The kernel could not allocate memory to hold state information.
 .TP
 .B ENFILE
 .RB ( FUTEX_FD )
 The system-wide limit on the total number of open files has been reached.
 .TP
+.B ENOSPC
+.RB ( FUTEX_ATTACH )
+Operation failed because the limit of attached futexes has been reached.
+.TP
 .B ENOSYS
 Invalid operation specified in
 .IR futex_op .

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC patch 4/7] futex: Add support for attached futexes
  2016-04-02 11:09 ` [RFC patch 4/7] futex: Add support for attached futexes Thomas Gleixner
@ 2016-04-02 16:26   ` Peter Zijlstra
  2016-04-02 18:01     ` Thomas Gleixner
  2016-04-02 16:29   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2016-04-02 16:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Sebastian Andrzej Siewior, Darren Hart, Ingo Molnar,
	Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Sat, Apr 02, 2016 at 11:09:18AM -0000, Thomas Gleixner wrote:
> +/**
> + * hash_local_futex - Return the hash bucket in the task local cache
> + * @uaddr:	The user space address of the futex
> + * @prime:	The prime number for the modulo operation
> + *
> + * That's a primitive hash function, but it turned out to be the most
> + * efficient one for the task local cache as we don't have anything to
> + * mix in like we have for the global hash.
> + */
> +static inline unsigned int
> +hash_local_futex(void __user *uaddr, unsigned int prime)
> +{
> +	return ((unsigned long) uaddr) % prime;
> +}

> +static unsigned int hash_prime(unsigned int size)
> +{
> +	switch(size) {
> +	case   16:
> +	default:   return 13;
> +	case   32: return 31;
> +	case   64: return 61;
> +	case  128: return 127;
> +	case  256: return 251;
> +	case  512: return 509;
> +	case 1024: return 1021;
> +	case 2048: return 2039;
> +	case 4096: return 4093;
> +	}
> +}

One wonders what's wrong with include/linux/hash.h ? It appears to me
hash_ptr() is pretty much what you want, no?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC patch 4/7] futex: Add support for attached futexes
  2016-04-02 11:09 ` [RFC patch 4/7] futex: Add support for attached futexes Thomas Gleixner
  2016-04-02 16:26   ` Peter Zijlstra
@ 2016-04-02 16:29   ` Peter Zijlstra
  2016-04-03  9:59     ` Thomas Gleixner
  2016-04-02 18:19   ` Andy Lutomirski
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2016-04-02 16:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Sebastian Andrzej Siewior, Darren Hart, Ingo Molnar,
	Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Sat, Apr 02, 2016 at 11:09:18AM -0000, Thomas Gleixner wrote:
> +/**
> + * futex_detach_task - Detach task from global state
> + * @slot:	Slot number in the task local cache
> + *
> + * If the global state refcount drops to zero, the global state is destroyed.
> + */
> +static void futex_detach_task(int slot)
> +{
> +	struct futex_cache *tc = current->futex_cache;
> +	struct futex_state *fs = tc->slots[slot].fs;
> +	struct futex_hash_bucket *hb = fs->global_hb;
> +	struct futex_q *q = &fs->q;
> +
> +	/* Remove it from the task local cache */
> +	__clear_bit(slot, tc->cache_map);
> +	tc->slots[slot].uaddr = NULL;
> +	tc->slots[slot].fs = NULL;
> +
> +	/*
> +	 * Lock the global hash bucket. Decrement global state refcount. If 0
> +	 * remove it from the global hash and free it.
> +	 */
> +	spin_lock(&hb->lock);
> +	if (--fs->refcount == 0)
> +		hb_remove_q(q, hb);
> +	else
> +		fs = NULL;
> +	spin_unlock(&hb->lock);

So you could play funny games like:

	if (atomic_add_unless(&fs->recount, -1, 1))
		return;

	spin_lock(&hb->lock);
	if (atomic_dec_return(&fs->refcount) == 0)
		hb_remove_q(q, hb);
	else
		fs = NULL;
	spin_unlock(&hb->lock);

To avoid taking that lock entirely in the 'fast' path, but I'm not sure
how performance critical this path is.

> +	kfree(fs);
> +}

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC patch 7/7] [PATCH] glibc: nptl: Add support for attached pthread_mutexes
  2016-04-02 11:09 ` [RFC patch 7/7] [PATCH] glibc: nptl: Add support for attached pthread_mutexes Thomas Gleixner
@ 2016-04-02 16:30   ` Peter Zijlstra
  2016-04-02 16:32     ` Peter Zijlstra
  2016-04-03 10:08     ` Thomas Gleixner
  0 siblings, 2 replies; 26+ messages in thread
From: Peter Zijlstra @ 2016-04-02 16:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Sebastian Andrzej Siewior, Darren Hart, Ingo Molnar,
	Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Sat, Apr 02, 2016 at 11:09:20AM -0000, Thomas Gleixner wrote:
> pthread_mutexes on Linux are based on the futex mechanism. The standard futex
> mechanism in the Linux kernel uses a global hash to store transient
> state. Collisions on that hash can lead to performance degradation and on
> real-time enabled kernels even to priority inversions.
> 
> To guarantee futexes without collisions on the global kernel hash, the kernel
> provides a mechanism to attach to a futex. This creates futex private state
> which avoids hash collisions and on NUMA systems also cross node memory
> access.
> 
> To utilize this mechanism each thread has to attach to the futex before any
> other operations on that futex which involve kernel interaction.
> 
> At pthread_mutex_init() the pthread_mutex attribute needs to be initialized
> for attached mode via:
> 
>     pthread_mutexattr_setattached_np(&attr, 1);
> 
> All threads which are using the mutex - including the one which called
> pthread_mutex_init() - must invoke
> 
>     pthread_mutex_attach_np(&mutex);
> 
> before any other pthread_mutex related operations.
> 
> Example:
>         pthread_mutexattr_t attr;
> 	pthread_mutex_t lock;
> 
>         pthread_mutexattr_init(&attr);
>         pthread_mutexattr_setattached_np(&attr, 1);
>         pthread_mutex_init(&lock, &attr);
> 
>         pthread_mutex_attach_np(&lock);
> 
> 	pthread_mutex_lock(&lock);
> 
> In ptrace this should look like this:
> 
> 	futex(<addr>, 0x280 /* FUTEX_??? */, 1, NULL <unfinished ...>
> 
> 	0x280: 	 FUTEX_ATTACHED | FUTEX_PRIVATE | FUTEX_WAIT
> 
> To undo the attachment each involved thread needs to call
> 
>     pthread_mutex_detach_np(&mutex);
>         
> When the last user detaches the kernel state is destroyed.

So I was fully expecting pthread_mutex_{at,de}tach_np() to not exist and
be internal to pthread_mutex_{init,destroy}().

Is there a reason this is not so?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC patch 7/7] [PATCH] glibc: nptl: Add support for attached pthread_mutexes
  2016-04-02 16:30   ` Peter Zijlstra
@ 2016-04-02 16:32     ` Peter Zijlstra
  2016-04-03 10:08     ` Thomas Gleixner
  1 sibling, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2016-04-02 16:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Sebastian Andrzej Siewior, Darren Hart, Ingo Molnar,
	Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Sat, Apr 02, 2016 at 06:30:59PM +0200, Peter Zijlstra wrote:
> On Sat, Apr 02, 2016 at 11:09:20AM -0000, Thomas Gleixner wrote:
> > pthread_mutexes on Linux are based on the futex mechanism. The standard futex
> > mechanism in the Linux kernel uses a global hash to store transient
> > state. Collisions on that hash can lead to performance degradation and on
> > real-time enabled kernels even to priority inversions.
> > 
> > To guarantee futexes without collisions on the global kernel hash, the kernel
> > provides a mechanism to attach to a futex. This creates futex private state
> > which avoids hash collisions and on NUMA systems also cross node memory
> > access.
> > 
> > To utilize this mechanism each thread has to attach to the futex before any
> > other operations on that futex which involve kernel interaction.
> > 
> > At pthread_mutex_init() the pthread_mutex attribute needs to be initialized
> > for attached mode via:
> > 
> >     pthread_mutexattr_setattached_np(&attr, 1);
> > 
> > All threads which are using the mutex - including the one which called
> > pthread_mutex_init() - must invoke
> > 
> >     pthread_mutex_attach_np(&mutex);
> > 
> > before any other pthread_mutex related operations.
> > 
> > Example:
> >         pthread_mutexattr_t attr;
> > 	pthread_mutex_t lock;
> > 
> >         pthread_mutexattr_init(&attr);
> >         pthread_mutexattr_setattached_np(&attr, 1);
> >         pthread_mutex_init(&lock, &attr);
> > 
> >         pthread_mutex_attach_np(&lock);
> > 
> > 	pthread_mutex_lock(&lock);
> > 
> > In ptrace this should look like this:
> > 
> > 	futex(<addr>, 0x280 /* FUTEX_??? */, 1, NULL <unfinished ...>
> > 
> > 	0x280: 	 FUTEX_ATTACHED | FUTEX_PRIVATE | FUTEX_WAIT
> > 
> > To undo the attachment each involved thread needs to call
> > 
> >     pthread_mutex_detach_np(&mutex);
> >         
> > When the last user detaches the kernel state is destroyed.
> 
> So I was fully expecting pthread_mutex_{at,de}tach_np() to not exist and
> be internal to pthread_mutex_{init,destroy}().
> 
> Is there a reason this is not so?

Ah yes there is; it just isn't spelled out and I'm tired.

This is because every thread that wants to play with the mutex needs to
attach, and only one can do init.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC patch 4/7] futex: Add support for attached futexes
  2016-04-02 16:26   ` Peter Zijlstra
@ 2016-04-02 18:01     ` Thomas Gleixner
  0 siblings, 0 replies; 26+ messages in thread
From: Thomas Gleixner @ 2016-04-02 18:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Sebastian Andrzej Siewior, Darren Hart, Ingo Molnar,
	Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Sat, 2 Apr 2016, Peter Zijlstra wrote:
> On Sat, Apr 02, 2016 at 11:09:18AM -0000, Thomas Gleixner wrote:
> > +/**
> > + * hash_local_futex - Return the hash bucket in the task local cache
> > + * @uaddr:	The user space address of the futex
> > + * @prime:	The prime number for the modulo operation
> > + *
> > + * That's a primitive hash function, but it turned out to be the most
> > + * efficient one for the task local cache as we don't have anything to
> > + * mix in like we have for the global hash.
> > + */
> > +static inline unsigned int
> > +hash_local_futex(void __user *uaddr, unsigned int prime)
> > +{
> > +	return ((unsigned long) uaddr) % prime;
> > +}
> 
> > +static unsigned int hash_prime(unsigned int size)
> > +{
> > +	switch(size) {
> > +	case   16:
> > +	default:   return 13;
> > +	case   32: return 31;
> > +	case   64: return 61;
> > +	case  128: return 127;
> > +	case  256: return 251;
> > +	case  512: return 509;
> > +	case 1024: return 1021;
> > +	case 2048: return 2039;
> > +	case 4096: return 4093;
> > +	}
> > +}
> 
> One wonders what's wrong with include/linux/hash.h ? It appears to me
> hash_ptr() is pretty much what you want, no?

Nothing is wrong with that. Just my futex fried brain not being able to see
the obvious:)

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC patch 4/7] futex: Add support for attached futexes
  2016-04-02 11:09 ` [RFC patch 4/7] futex: Add support for attached futexes Thomas Gleixner
  2016-04-02 16:26   ` Peter Zijlstra
  2016-04-02 16:29   ` Peter Zijlstra
@ 2016-04-02 18:19   ` Andy Lutomirski
  2016-04-03  9:57     ` Thomas Gleixner
  2016-04-02 23:48   ` Rasmus Villemoes
  2016-04-03 11:16   ` Ingo Molnar
  4 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2016-04-02 18:19 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Sebastian Andrzej Siewior, Darren Hart, Peter Zijlstra,
	Ingo Molnar, Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

On 04/02/2016 04:09 AM, Thomas Gleixner wrote:
[omitted due to some Thunderbird bug, sigh]

What happens if you mix attached an non-attached ops on the same futex?

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC patch 4/7] futex: Add support for attached futexes
  2016-04-02 11:09 ` [RFC patch 4/7] futex: Add support for attached futexes Thomas Gleixner
                     ` (2 preceding siblings ...)
  2016-04-02 18:19   ` Andy Lutomirski
@ 2016-04-02 23:48   ` Rasmus Villemoes
  2016-04-03 10:05     ` Thomas Gleixner
  2016-04-03 11:16   ` Ingo Molnar
  4 siblings, 1 reply; 26+ messages in thread
From: Rasmus Villemoes @ 2016-04-02 23:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Sebastian Andrzej Siewior, Darren Hart, Peter Zijlstra,
	Ingo Molnar, Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Sat, Apr 02 2016, Thomas Gleixner <tglx@linutronix.de> wrote:

> The standard futex mechanism in the Linux kernel uses a global hash to store
> transient state. Collisions on that hash can lead to performance degradation
> and on real-time enabled kernels even to priority inversions.
>
> To guarantee futexes without collisions on the global kernel hash, we provide
> a mechanism to attach to a futex. This creates futex private state which
> avoids hash collisions and on NUMA systems also cross node memory access.

Hi,

A few minor comments inline below, and a question about the design:

How is an application supposed to handle it when the kernel fails to
achieve the no collision-goal?  With any reasonable upper bound on the
size of the local hash table (which of course has to be there, whether
sysctl'ed or not), and regardless of the hashing scheme used, it
seems inevitable that someone is going to get -ENOSPC when trying to
attach. Moreover, since different threads can attach to different sets
of futexes, one thread may succesfully attach to a futex, while another
fails - the second thread is then permanently prevented from operating
on that futex (?).

Why not use some sort of tree instead? Or fall back to a traditional
chained hash table once we're no longer allowed to increase the table
size? Of course these have worse lookup performance, and maybe failing
the attach in the rare case is better than penalizing the common case,
but it would be nice to have some mention of this in the change log.

Alternatively [this is not really thought through], maybe one could move
the decision and the complexity to userspace: On succesful FUTEX_ATTACH,
return an index into a small per-task array of struct futex_state*. On
subsequent FUTEX_ATTACHED operations on that futex, userspace passes in
this index somehow (either instead of uaddr, in which case the kernel
array would have to include this in addition to the futex_state pointer,
or by making uaddr actually point to a struct { int *futex_addr; int
attach_idx; }, or...) Then each thread would have to maintain a (futex
address => index) mapping, but that's more or less what the kernel
otherwise has to do.

> +
> +static unsigned int hash_prime(unsigned int size)
> +{
> +	switch(size) {
> +	case   16:
> +	default:   return 13;
> +	case   32: return 31;
> +	case   64: return 61;
> +	case  128: return 127;
> +	case  256: return 251;
> +	case  512: return 509;
> +	case 1024: return 1021;
> +	case 2048: return 2039;
> +	case 4096: return 4093;
> +	}
> +}
> +

There should probably be some mention of TASK_CACHE_{BASE,MAX}_SIZE here
so that anyone updating those would also look here, so we don't end up
using 13 out of 8192 slots...  BUILD_BUG_ON(TASK_CACHE_{BASE,MAX}_SIZE
!= {16,4096}) should do it. 

> +static struct futex_cache *futex_alloc_cache(int cache_size)
> +{
> +	struct futex_cache *tc;
> +	size_t size;
> +
> +	/* Allocate a new task cache */
> +	size = sizeof(*tc) + cache_size * sizeof(struct futex_cache_slot);
> +	tc = kzalloc_node(size, GFP_KERNEL, numa_node_id());
> +	if (tc) {
> +		tc->hash_prime = hash_prime(cache_size);
> +		tc->cache_size = cache_size;
> +	}
> +	return tc;
> +}
> +
> +static int
> +futex_rehash_task_cache(struct futex_cache *tc, struct futex_cache *tcnew)
> +{
> +	unsigned long *newmap = tcnew->cache_map;
> +	unsigned int prime = tcnew->hash_prime;
> +	unsigned long *map = tc->cache_map;
> +	unsigned int size = tc->cache_size;
> +	unsigned int slot, newslot;
> +
> +	slot = find_first_bit(map, size);
> +	for (; slot < size; slot = find_next_bit(map, size, slot + 1)) {
> +		newslot = hash_local_futex(tc->slots[slot].uaddr, prime);
> +		/*
> +		 * Paranoia. Rehashing to a larger cache should not result in
> +		 * collisions which did not exist in the small one.
> +		 */

This doesn't sound right when you're doing mod prime hashing; two
numbers can easily have the same remainder mod 31 while having distinct
remainders mod 13.


> +		if (__test_and_set_bit(newslot, newmap))
> +			return -ENOSPC;
> +		/* Copy uaddr and futex state pointer */
> +		tcnew->slots[newslot] = tc->slots[slot];
> +	}
> +	return 0;
> +}
> +
> +/**
> + * futex_get_task_cache_slot - Get a slot in the tasks local cache
> + *
> + * If the cache is not yet available it's allocated. If the existing cache is
> + * too small the cache is extended.
> + *
> + * Returns a valid slot or an error code
> + */
> +static int futex_get_task_cache_slot(u32 __user *uaddr)
> +{
> +	struct futex_cache *tcnew, *tc = current->futex_cache;
> +	unsigned int cache_size;
> +	int slot;
> +
> +	/* First caller allocates the initial cache */
> +	if (!tc) {
> +		tc = futex_alloc_cache(TASK_CACHE_BASE_SIZE);
> +		if (!tc)
> +			return -ENOMEM;
> +		current->futex_cache = tc;
> +		slot = hash_local_futex(uaddr, tc->hash_prime);
> +		return slot;
> +	}
> +
> +	slot = hash_local_futex(uaddr, tc->hash_prime);
> +
> +	/* Check whether the slot is populated already */
> +	if (!test_bit(slot, tc->cache_map))
> +		return slot;
> +
> +	/* Was this futex attached already ? */
> +	if (tc->slots[slot].uaddr == uaddr)
> +		return -EEXIST;
> +
> +	cache_size = tc->cache_size;
> +retry:
> +	/* Task has reached max cache size? */
> +	if (cache_size >= TASK_CACHE_MAX_SIZE)
> +		return -ENOSPC;
> +
> +	cache_size *= 2;
> +	tcnew = futex_alloc_cache(cache_size);
> +	if (!tcnew)
> +		return -ENOMEM;
> +
> +	/*
> +	 * If the rehashing fails or the slot for uaddr is busy after
> +	 * rehashing, try with a larger cache.
> +	 */
> +	slot = hash_local_futex(uaddr, tcnew->hash_prime);
> +
> +	if (futex_rehash_task_cache(tc, tcnew) ||
> +	    test_bit(slot, tcnew->cache_map)) {
> +		kfree(tcnew);
> +		goto retry;
> +	}
> +
> +	/* Populate the new cache and return the slot number */
> +	current->futex_cache = tcnew;
> +	return slot;
> +}

I may be misreading it, but this seems to leak the old ->futex_cache?


Rasmus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC patch 4/7] futex: Add support for attached futexes
  2016-04-02 18:19   ` Andy Lutomirski
@ 2016-04-03  9:57     ` Thomas Gleixner
  2016-04-03 13:18       ` Andy Lutomirski
  0 siblings, 1 reply; 26+ messages in thread
From: Thomas Gleixner @ 2016-04-03  9:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, Sebastian Andrzej Siewior, Darren Hart, Peter Zijlstra,
	Ingo Molnar, Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Sat, 2 Apr 2016, Andy Lutomirski wrote:

> On 04/02/2016 04:09 AM, Thomas Gleixner wrote:
> [omitted due to some Thunderbird bug, sigh]
> 
> What happens if you mix attached an non-attached ops on the same futex?

Not much. You might get an error code, sleep forever or the call will just
result in a NOP wasting cpu cycles. That's the same when you mix
shared/private operations on the same futex.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC patch 4/7] futex: Add support for attached futexes
  2016-04-02 16:29   ` Peter Zijlstra
@ 2016-04-03  9:59     ` Thomas Gleixner
  0 siblings, 0 replies; 26+ messages in thread
From: Thomas Gleixner @ 2016-04-03  9:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Sebastian Andrzej Siewior, Darren Hart, Ingo Molnar,
	Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Sat, 2 Apr 2016, Peter Zijlstra wrote:
> On Sat, Apr 02, 2016 at 11:09:18AM -0000, Thomas Gleixner wrote:
> > +	/*
> > +	 * Lock the global hash bucket. Decrement global state refcount. If 0
> > +	 * remove it from the global hash and free it.
> > +	 */
> > +	spin_lock(&hb->lock);
> > +	if (--fs->refcount == 0)
> > +		hb_remove_q(q, hb);
> > +	else
> > +		fs = NULL;
> > +	spin_unlock(&hb->lock);
> 
> So you could play funny games like:
> 
> 	if (atomic_add_unless(&fs->recount, -1, 1))
> 		return;
> 
> 	spin_lock(&hb->lock);
> 	if (atomic_dec_return(&fs->refcount) == 0)
> 		hb_remove_q(q, hb);
> 	else
> 		fs = NULL;
> 	spin_unlock(&hb->lock);
> 
> To avoid taking that lock entirely in the 'fast' path, but I'm not sure
> how performance critical this path is.

Attach/detach is not really critical. The futex ops are critical and they do
not touch the global hash bucket lock at all.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC patch 4/7] futex: Add support for attached futexes
  2016-04-02 23:48   ` Rasmus Villemoes
@ 2016-04-03 10:05     ` Thomas Gleixner
  0 siblings, 0 replies; 26+ messages in thread
From: Thomas Gleixner @ 2016-04-03 10:05 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: LKML, Sebastian Andrzej Siewior, Darren Hart, Peter Zijlstra,
	Ingo Molnar, Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Sun, 3 Apr 2016, Rasmus Villemoes wrote:
> On Sat, Apr 02 2016, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > The standard futex mechanism in the Linux kernel uses a global hash to store
> > transient state. Collisions on that hash can lead to performance degradation
> > and on real-time enabled kernels even to priority inversions.
> >
> > To guarantee futexes without collisions on the global kernel hash, we provide
> > a mechanism to attach to a futex. This creates futex private state which
> > avoids hash collisions and on NUMA systems also cross node memory access.
> 
> Hi,
> 
> A few minor comments inline below, and a question about the design:
> 
> How is an application supposed to handle it when the kernel fails to
> achieve the no collision-goal?  With any reasonable upper bound on the
> size of the local hash table (which of course has to be there, whether
> sysctl'ed or not), and regardless of the hashing scheme used, it
> seems inevitable that someone is going to get -ENOSPC when trying to
> attach. Moreover, since different threads can attach to different sets
> of futexes, one thread may succesfully attach to a futex, while another
> fails - the second thread is then permanently prevented from operating
> on that futex (?).

Yes. There is not much we can do about that except adding it to the
documentation.
 
> Why not use some sort of tree instead? Or fall back to a traditional
> chained hash table once we're no longer allowed to increase the table
> size? Of course these have worse lookup performance, and maybe failing
> the attach in the rare case is better than penalizing the common case,
> but it would be nice to have some mention of this in the change log.

The lookup performance is critical for the futex ops (wait/wake ....)
 
> Alternatively [this is not really thought through], maybe one could move
> the decision and the complexity to userspace: On succesful FUTEX_ATTACH,
> return an index into a small per-task array of struct futex_state*. On
> subsequent FUTEX_ATTACHED operations on that futex, userspace passes in
> this index somehow (either instead of uaddr, in which case the kernel
> array would have to include this in addition to the futex_state pointer,
> or by making uaddr actually point to a struct { int *futex_addr; int
> attach_idx; }, or...) Then each thread would have to maintain a (futex
> address => index) mapping, but that's more or less what the kernel
> otherwise has to do.

Right. We tried this as a first attempt. That moves the complexity of hashing
futex to index per thread to user space. It can be done simple enough, though
the resulting 
 
> > +	case 4096: return 4093;
> > +	}
> > +}
> > +
> 
> There should probably be some mention of TASK_CACHE_{BASE,MAX}_SIZE here
> so that anyone updating those would also look here, so we don't end up
> using 13 out of 8192 slots...  BUILD_BUG_ON(TASK_CACHE_{BASE,MAX}_SIZE
> != {16,4096}) should do it. 

As Peter pointed out there is hash_ptr() already so this will go away.
 
> > +	slot = find_first_bit(map, size);
> > +	for (; slot < size; slot = find_next_bit(map, size, slot + 1)) {
> > +		newslot = hash_local_futex(tc->slots[slot].uaddr, prime);
> > +		/*
> > +		 * Paranoia. Rehashing to a larger cache should not result in
> > +		 * collisions which did not exist in the small one.
> > +		 */
> 
> This doesn't sound right when you're doing mod prime hashing; two
> numbers can easily have the same remainder mod 31 while having distinct
> remainders mod 13.

I know, but as far as my extensive testing went I never hit that case.
 
> > +
> > +	/* Populate the new cache and return the slot number */
> > +	current->futex_cache = tcnew;
> > +	return slot;
> > +}
> 
> I may be misreading it, but this seems to leak the old ->futex_cache?

Duh, you are right.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC patch 7/7] [PATCH] glibc: nptl: Add support for attached pthread_mutexes
  2016-04-02 16:30   ` Peter Zijlstra
  2016-04-02 16:32     ` Peter Zijlstra
@ 2016-04-03 10:08     ` Thomas Gleixner
  1 sibling, 0 replies; 26+ messages in thread
From: Thomas Gleixner @ 2016-04-03 10:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Sebastian Andrzej Siewior, Darren Hart, Ingo Molnar,
	Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Sat, 2 Apr 2016, Peter Zijlstra wrote:
> On Sat, Apr 02, 2016 at 11:09:20AM -0000, Thomas Gleixner wrote:
> > To undo the attachment each involved thread needs to call
> > 
> >     pthread_mutex_detach_np(&mutex);
> >         
> > When the last user detaches the kernel state is destroyed.
> 
> So I was fully expecting pthread_mutex_{at,de}tach_np() to not exist and
> be internal to pthread_mutex_{init,destroy}().
> 
> Is there a reason this is not so?

init/destroy are only called once and not on all threads using the futex. If
you want to hide that, then you need to add it to all pthread_mutex_*
operations and attach on the first call.

That's possible, but for simplicity we made it explicit.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC patch 4/7] futex: Add support for attached futexes
  2016-04-02 11:09 ` [RFC patch 4/7] futex: Add support for attached futexes Thomas Gleixner
                     ` (3 preceding siblings ...)
  2016-04-02 23:48   ` Rasmus Villemoes
@ 2016-04-03 11:16   ` Ingo Molnar
  2016-04-03 11:30     ` Linus Torvalds
  4 siblings, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2016-04-03 11:16 UTC (permalink / raw)
  To: Thomas Gleixner, Linus Torvalds, Andrew Morton, Peter Zijlstra
  Cc: LKML, Sebastian Andrzej Siewior, Darren Hart, Peter Zijlstra,
	Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet


* Thomas Gleixner <tglx@linutronix.de> wrote:

> The standard futex mechanism in the Linux kernel uses a global hash to store
> transient state. Collisions on that hash can lead to performance degradation
> and on real-time enabled kernels even to priority inversions.
> 
> To guarantee futexes without collisions on the global kernel hash, we provide
> a mechanism to attach to a futex. This creates futex private state which
> avoids hash collisions and on NUMA systems also cross node memory access.
> 
> To utilize this mechanism each thread has to attach to the futex before any
> other operations on that futex.
> 
> The inner workings are as follows:
> 
> Attach:
> 
>     sys_futex(FUTEX_ATTACH | FUTEX_ATTACHED, uaddr, ....);
> 
>     If this is the first attach to uaddr then a 'global state' object is
>     created. This global state contains a futex hash bucket and a futex_q
>     object which is enqueued into the global hash for reference so subsequent
>     attachers can find it. Each attacher takes a reference count on the
>     'global state' object and hashes 'uaddr' into a thread local hash. This
>     thread local hash is lock free and dynamically expanded to avoid
>     collisions. Each populated entry in the thread local hash stores 'uaddr'
>     and a pointer to the 'global state' object.
> 
> Futex ops:
> 
>     sys_futex(FUTEX_XXX | FUTEX_ATTACHED, uaddr, ....);
> 
>     If the attached flag is set, then 'uaddr' is hashed and the thread local
>     hash is checked whether the hash entry contains 'uaddr'. If no, an error
>     code is returned. If yes, the hash slot number is stored in the futex key
>     which is used for further operations on the futex. When the hash bucket is
>     looked up then attached futexes will use the slot number to retrieve the
>     pointer to the 'global state' object and use the embedded hash bucket for
>     the operation. Non-attached futexes just use the global hash as before.
> 
> Detach:
> 
>     sys_futex(FUTEX_DETACH | FUTEX_ATTACHED, uaddr, ....);
>    
>     Detach removes the entry in the thread local hash and decrements the
>     refcount on the 'global state' object. Once the refcount drops to zero the
>     'global state' object is removed from the global hash and destroyed.
> 
>     Thread exit cleans up the thread local hash and the 'global state' objects
>     as we do for other futex related storage already.
> 
> The thread local hash and the 'global state' object are allocated on the node
> on which the attaching thread runs.
> 
> Attached mode works with all futex operations and with both private and shared
> futexes. For operations which involve two futexes, i.e. FUTEX_REQUEUE_* both
> futexes have to be either attached or detached (like FUTEX_PRIVATE).
> 
> Why not auto attaching?
> 
>     Auto attaching has the following problems:
> 
>      - Memory consumption
>      - Life time issues
>      - Performance issues due to the necessary allocations

But those are mostly setup only costs, right?

So I don't think this conclusion is necessarily true, even on smaller systems:

>     So, no. It must be opt-in and reserved for explicit isolation purposes.
> 
> A modified version of 'perf bench futex hash' shows the following results:

and look at the very measurable performance advantages on a small NUMA system:

  Before:

  >  Averaged 1451441 operations/sec (+- 3.65%), total secs = 60

  After:

  >  Averaged 1709712 operations/sec (+- 4.67%), total secs = 60

  > That's a performance increase of 18%.

... and I suspect that on a larger NUMA system the speedup is probably a lot more 
pronounced.

Also, the thing is, allocation/deallocation costs are a second order concern IMHO, 
because most of the futex's usage is the lock/unlock operations.

So my prediction: in real life large systems will want to have collision-free 
futexes most of the time, and they don't want to modify every futex using 
application or library. So this is a mostly kernel side system sizing 
question/decision, not really a user-side system purpose policy question.

So an ABI distinction and offloading the decision to every single application that 
wants to use it and hardcode it into actual application source code via an ABI is 
pretty much the _WORST_ way to go about it IMHO...

So how about this: don't add any ABI details, but make futexes auto-attached on 
NUMA systems (and obviously PREEMPT_RT systems)?

I.e. make it a build time or boot time decision at most, don't start a messy 
'should we used attached futexes or not' decisions on the ABI side, which we know 
from Linux ABI history won't be answered and utilized very well by applications!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC patch 4/7] futex: Add support for attached futexes
  2016-04-03 11:16   ` Ingo Molnar
@ 2016-04-03 11:30     ` Linus Torvalds
  2016-04-05  7:44       ` Torvald Riegel
  2016-04-05 15:58       ` Carlos O'Donell
  0 siblings, 2 replies; 26+ messages in thread
From: Linus Torvalds @ 2016-04-03 11:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Andrew Morton, Peter Zijlstra, LKML,
	Sebastian Andrzej Siewior, Darren Hart, Peter Zijlstra,
	Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Sun, Apr 3, 2016 at 6:16 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> So an ABI distinction and offloading the decision to every single application that
> wants to use it and hardcode it into actual application source code via an ABI is
> pretty much the _WORST_ way to go about it IMHO...
>
> So how about this: don't add any ABI details, but make futexes auto-attached on
> NUMA systems (and obviously PREEMPT_RT systems)?

I agree.

Do *not* make this a visible new ABI.

You will find that people will make exactly the wrong choices - either
not using it (because the futex is deep in a standard library!) when
they want to, or using it when they shouldn't (because the futex is
deep in a standard library, and the library writer knows *his* code is
so important that it should get a special faster futex).

So I absolutely detest this approach. It's the wrong way to go about
things. User space does *not* know whether they want to use this or
not, and they *will* be wrong.

So automatically using a local hashtable (for private mutexes - I
think people need to just accept that a shared mutex is more costly)
according to some heuristic is definitely the way to go. And yes, the
heuristic may be well be - at least to start - "this is a preempt-RT
system" (for people who clearly care about having predictable
latencies) or "this is actually a multi-node NUMA system, and I have
heaps of memory".

Then, add a tunable (for root, not per-futex) to allow people to tweak it.

Because the *last* thing you want is programmerrs saying "I'm so
important that I want the special futex". Because every single
programmer thinks they are special and that _their_ code is special. I
know - because I'm special.

                   Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC patch 4/7] futex: Add support for attached futexes
  2016-04-03  9:57     ` Thomas Gleixner
@ 2016-04-03 13:18       ` Andy Lutomirski
  2016-04-03 15:56         ` Thomas Gleixner
  0 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2016-04-03 13:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, LKML, Sebastian Andrzej Siewior, Darren Hart,
	Peter Zijlstra, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Sun, Apr 3, 2016 at 2:57 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Sat, 2 Apr 2016, Andy Lutomirski wrote:
>
>> On 04/02/2016 04:09 AM, Thomas Gleixner wrote:
>> [omitted due to some Thunderbird bug, sigh]
>>
>> What happens if you mix attached an non-attached ops on the same futex?
>
> Not much. You might get an error code, sleep forever or the call will just
> result in a NOP wasting cpu cycles. That's the same when you mix
> shared/private operations on the same futex.

What's the workflow?

Can the creation of an attached futex fail due to memory allocation
problems or any other reason?  If so, how does a library make sure it
falls back to a normal futex safely?

Why can't private futexes be attached by default?

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC patch 4/7] futex: Add support for attached futexes
  2016-04-03 13:18       ` Andy Lutomirski
@ 2016-04-03 15:56         ` Thomas Gleixner
  2016-04-03 16:11           ` Andy Lutomirski
  0 siblings, 1 reply; 26+ messages in thread
From: Thomas Gleixner @ 2016-04-03 15:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, LKML, Sebastian Andrzej Siewior, Darren Hart,
	Peter Zijlstra, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Sun, 3 Apr 2016, Andy Lutomirski wrote:

> On Sun, Apr 3, 2016 at 2:57 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Sat, 2 Apr 2016, Andy Lutomirski wrote:
> >
> >> On 04/02/2016 04:09 AM, Thomas Gleixner wrote:
> >> [omitted due to some Thunderbird bug, sigh]
> >>
> >> What happens if you mix attached an non-attached ops on the same futex?
> >
> > Not much. You might get an error code, sleep forever or the call will just
> > result in a NOP wasting cpu cycles. That's the same when you mix
> > shared/private operations on the same futex.
> 
> What's the workflow?
> 
> Can the creation of an attached futex fail due to memory allocation
> problems or any other reason?  If so, how does a library make sure it
> falls back to a normal futex safely?

Well, other operations on futexes can fail as well and the library or the
usage site has to take care of it. It's not that different.

> Why can't private futexes be attached by default?

We _can_ attach any futex - private or not - by default.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC patch 4/7] futex: Add support for attached futexes
  2016-04-03 15:56         ` Thomas Gleixner
@ 2016-04-03 16:11           ` Andy Lutomirski
  0 siblings, 0 replies; 26+ messages in thread
From: Andy Lutomirski @ 2016-04-03 16:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, LKML, Sebastian Andrzej Siewior, Darren Hart,
	Peter Zijlstra, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Sun, Apr 3, 2016 at 8:56 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Sun, 3 Apr 2016, Andy Lutomirski wrote:
>
>> On Sun, Apr 3, 2016 at 2:57 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > On Sat, 2 Apr 2016, Andy Lutomirski wrote:
>> >
>> >> On 04/02/2016 04:09 AM, Thomas Gleixner wrote:
>> >> [omitted due to some Thunderbird bug, sigh]
>> >>
>> >> What happens if you mix attached an non-attached ops on the same futex?
>> >
>> > Not much. You might get an error code, sleep forever or the call will just
>> > result in a NOP wasting cpu cycles. That's the same when you mix
>> > shared/private operations on the same futex.
>>
>> What's the workflow?
>>
>> Can the creation of an attached futex fail due to memory allocation
>> problems or any other reason?  If so, how does a library make sure it
>> falls back to a normal futex safely?
>
> Well, other operations on futexes can fail as well and the library or the
> usage site has to take care of it. It's not that different.
>
>> Why can't private futexes be attached by default?
>
> We _can_ attach any futex - private or not - by default.

Then I feel like I'm missing something.  Why does this need an API
change?  Why can't the kernel just attach all futexes?

The reason I'm asking about failures is this:

int some_futex;

Thread 1:
   attach some_futex;
   ...
   wait for some_futex;

Thread 2:
   attach some_futex;
   ...
   wake some_futex;

If one but not both of the attach operations fails and the other
doesn't, then either the thread that failed to attach can't operate on
the futex or, if it tries to fall back to a normal futex, then the
wake won't wake the waiter, right?  This seems fragile.

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC patch 4/7] futex: Add support for attached futexes
  2016-04-03 11:30     ` Linus Torvalds
@ 2016-04-05  7:44       ` Torvald Riegel
  2016-04-05 15:58       ` Carlos O'Donell
  1 sibling, 0 replies; 26+ messages in thread
From: Torvald Riegel @ 2016-04-05  7:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Thomas Gleixner, Andrew Morton, Peter Zijlstra,
	LKML, Sebastian Andrzej Siewior, Darren Hart, Peter Zijlstra,
	Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Eric Dumazet

On Sun, 2016-04-03 at 06:30 -0500, Linus Torvalds wrote:
> On Sun, Apr 3, 2016 at 6:16 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > So an ABI distinction and offloading the decision to every single application that
> > wants to use it and hardcode it into actual application source code via an ABI is
> > pretty much the _WORST_ way to go about it IMHO...
> >
> > So how about this: don't add any ABI details, but make futexes auto-attached on
> > NUMA systems (and obviously PREEMPT_RT systems)?
> 
> I agree.
> 
> Do *not* make this a visible new ABI.

>From a glibc perspective, I agree that this shouldn't require an
extension of the ABI unless it's really the only possible way to solve
this.

For "special" mutex kinds such as PI mutexes, the change in the
interface might be justifiable -- but for ordinary mutexes, there's no
good place to add the attach/detach calls in each thread: An
implementation of, say, C11 mutexes cannot easily estimate whether it
should use attached futexes, and it would have to track whether a
particular mutex has been attached to by the current thread; this might
just move the overhead of tracking and caching associations from the
kernel to userspace.

> You will find that people will make exactly the wrong choices - either
> not using it (because the futex is deep in a standard library!) when
> they want to, or using it when they shouldn't (because the futex is
> deep in a standard library, and the library writer knows *his* code is
> so important that it should get a special faster futex).

There are cases where userspace might know that it should use a
"special" futex.  Consider an MCS-lock implementation in glibc by which
all pthreads, C, and C++ mutexes are backed: the lock nodes that threads
would be spinning on would be per-thread and exist for the whole
lifetime of the thread; attach and detach would be simple to do, and
there would be a limited number of these in the system.
A Java VM's park/unpark implementation might be another candidate.
However, these use cases are pretty specific (eg, there's only a single
waiter), so any kernel support for these might be more special too.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC patch 4/7] futex: Add support for attached futexes
  2016-04-03 11:30     ` Linus Torvalds
  2016-04-05  7:44       ` Torvald Riegel
@ 2016-04-05 15:58       ` Carlos O'Donell
  1 sibling, 0 replies; 26+ messages in thread
From: Carlos O'Donell @ 2016-04-05 15:58 UTC (permalink / raw)
  To: Linus Torvalds, Ingo Molnar
  Cc: Thomas Gleixner, Andrew Morton, Peter Zijlstra, LKML,
	Sebastian Andrzej Siewior, Darren Hart, Peter Zijlstra,
	Michael Kerrisk, Davidlohr Bueso, Chris Mason, Torvald Riegel,
	Eric Dumazet

On 04/03/2016 07:30 AM, Linus Torvalds wrote:
> On Sun, Apr 3, 2016 at 6:16 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>
>> So an ABI distinction and offloading the decision to every single application that
>> wants to use it and hardcode it into actual application source code via an ABI is
>> pretty much the _WORST_ way to go about it IMHO...
>>
>> So how about this: don't add any ABI details, but make futexes auto-attached on
>> NUMA systems (and obviously PREEMPT_RT systems)?
> 
> I agree.
> 
> Do *not* make this a visible new ABI.

Agreed.

We had similar requests in glibc to add APIs to tweak the parameters of
the elision for locks backed by hardware transactional memory.

The person submitting the patches always thinks this is a great API
because it allows them to write tests to verify their own work (which
means it still might be useful for internal testing or developing
auto-tuning).

Users have no clue what to do with the API and worse the state space of
the parameters is immense. You can't possibly do any kind of sensible
optimization without knowing a lot about the hardware.

So no public API was ever added in glibc for pthread_mutex_lock elision
parameters. Either the parameters work by default or you have to post
patches to change the auto-tuning used internally in glibc.

> So automatically using a local hashtable (for private mutexes - I
> think people need to just accept that a shared mutex is more costly)
> according to some heuristic is definitely the way to go. And yes, the
> heuristic may be well be - at least to start - "this is a preempt-RT
> system" (for people who clearly care about having predictable
> latencies) or "this is actually a multi-node NUMA system, and I have
> heaps of memory".

Agreed.
 
> Then, add a tunable (for root, not per-futex) to allow people to tweak it.

Agreed.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2016-04-05 15:58 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-02 11:09 [RFC patch 0/7] futex: Add support for attached futexes Thomas Gleixner
2016-04-02 11:09 ` [RFC patch 1/7] futex: Provide helpers for hash bucket add/remove Thomas Gleixner
2016-04-02 11:09 ` [RFC patch 2/7] futex: Add some more function commentry Thomas Gleixner
2016-04-02 11:09 ` [RFC patch 3/7] futex: Make key init a helper function Thomas Gleixner
2016-04-02 11:09 ` [RFC patch 4/7] futex: Add support for attached futexes Thomas Gleixner
2016-04-02 16:26   ` Peter Zijlstra
2016-04-02 18:01     ` Thomas Gleixner
2016-04-02 16:29   ` Peter Zijlstra
2016-04-03  9:59     ` Thomas Gleixner
2016-04-02 18:19   ` Andy Lutomirski
2016-04-03  9:57     ` Thomas Gleixner
2016-04-03 13:18       ` Andy Lutomirski
2016-04-03 15:56         ` Thomas Gleixner
2016-04-03 16:11           ` Andy Lutomirski
2016-04-02 23:48   ` Rasmus Villemoes
2016-04-03 10:05     ` Thomas Gleixner
2016-04-03 11:16   ` Ingo Molnar
2016-04-03 11:30     ` Linus Torvalds
2016-04-05  7:44       ` Torvald Riegel
2016-04-05 15:58       ` Carlos O'Donell
2016-04-02 11:09 ` [RFC patch 5/7] perf/bench/futex-hash: Support " Thomas Gleixner
2016-04-02 11:09 ` [RFC patch 6/7] futex.2: Document attached mode Thomas Gleixner
2016-04-02 11:09 ` [RFC patch 7/7] [PATCH] glibc: nptl: Add support for attached pthread_mutexes Thomas Gleixner
2016-04-02 16:30   ` Peter Zijlstra
2016-04-02 16:32     ` Peter Zijlstra
2016-04-03 10:08     ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.