linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch V2 0/7] futex: Add support for process private hashing
@ 2016-05-05 20:44 Thomas Gleixner
  2016-05-05 20:44 ` [patch V2 1/7] futex: Add some more function commentry Thomas Gleixner
                   ` (6 more replies)
  0 siblings, 7 replies; 42+ messages in thread
From: Thomas Gleixner @ 2016-05-05 20:44 UTC (permalink / raw)
  To: LKML
  Cc: Sebastian Andrzej Siewior, Linus Torvalds, Darren Hart,
	Peter Zijlstra, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

The standard futex mechanism in the Linux kernel uses a global hash to store
transient state. Collisions on that hash can lead to performance degradation
and on real-time enabled kernels to unbound priority inversions.

This new attempt to solve the issue does not require user space changes and
operates transparently. On the first futex operation of a process the kernel
allocates a hash private to the process. All process private futexes are
hashed in this hash. Process shared futexes still use the global hash.

For RT applications and pathological use cases a new futex op is provided
which allows the application to preallocate and thereby size the process
private hash.

The last two patches add support to the perf futex-hash benchmark so test can
be run on nodes and the preallocation sizing can be tested.

The last patch contains a first update for the futex man page.

The difference vs. V1 of this series is that it uses hash_long() now that the
hash_64 implementation has been fixed in mainline. The performance numbers are
more or less the same as with V1 [1].

Thanks,

	tglx

[1] http://lkml.kernel.org/r/20160428161742.363543816@linutronix.de

----
 Documentation/sysctl/kernel.txt |   17 +++
 b/include/linux/futex_types.h   |   12 ++
 include/linux/futex.h           |   39 +++++--
 include/linux/mm_types.h        |    4 
 include/uapi/linux/futex.h      |    1 
 init/Kconfig                    |    4 
 kernel/fork.c                   |    3 
 kernel/futex.c                  |  215 +++++++++++++++++++++++++++++++++++++++-
 kernel/sysctl.c                 |   21 +++
 tools/perf/bench/Build          |    4 
 tools/perf/bench/futex-hash.c   |  101 ++++++++++++++++--
 tools/perf/bench/futex.h        |    5 
 12 files changed, 403 insertions(+), 23 deletions(-)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [patch V2 1/7] futex: Add some more function commentry
  2016-05-05 20:44 [patch V2 0/7] futex: Add support for process private hashing Thomas Gleixner
@ 2016-05-05 20:44 ` Thomas Gleixner
  2016-05-06 17:37   ` Darren Hart
  2016-05-05 20:44 ` [patch V2 2/7] futex: Hash private futexes per process Thomas Gleixner
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 42+ messages in thread
From: Thomas Gleixner @ 2016-05-05 20:44 UTC (permalink / raw)
  To: LKML
  Cc: Sebastian Andrzej Siewior, Linus Torvalds, Darren Hart,
	Peter Zijlstra, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

[-- Attachment #1: futex-Add-some-more-function-commentry.patch --]
[-- Type: text/plain, Size: 1126 bytes --]

Add some more comments and reformat existing ones to kernel doc style.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/futex.c |   15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -373,8 +373,12 @@ static inline int hb_waiters_pending(str
 #endif
 }
 
-/*
- * We hash on the keys returned from get_futex_key (see below).
+/**
+ * hash_futex - Return the hash bucket in the global hash
+ * @key:	Pointer to the futex key for which the hash is calculated
+ *
+ * We hash on the keys returned from get_futex_key (see below) and return the
+ * corresponding hash bucket in the global hash.
  */
 static struct futex_hash_bucket *hash_futex(union futex_key *key)
 {
@@ -384,7 +388,12 @@ static struct futex_hash_bucket *hash_fu
 	return &futex_queues[hash & (futex_hashsize - 1)];
 }
 
-/*
+
+/**
+ * match_futex - Check whether to futex keys are equal
+ * @key1:	Pointer to key1
+ * @key2:	Pointer to key2
+ *
  * Return 1 if two futex_keys are equal, 0 otherwise.
  */
 static inline int match_futex(union futex_key *key1, union futex_key *key2)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [patch V2 2/7] futex: Hash private futexes per process
  2016-05-05 20:44 [patch V2 0/7] futex: Add support for process private hashing Thomas Gleixner
  2016-05-05 20:44 ` [patch V2 1/7] futex: Add some more function commentry Thomas Gleixner
@ 2016-05-05 20:44 ` Thomas Gleixner
  2016-05-06 18:09   ` Darren Hart
                     ` (2 more replies)
  2016-05-05 20:44 ` [patch V2 4/7] futex: Add sysctl knobs for process private hash Thomas Gleixner
                   ` (4 subsequent siblings)
  6 siblings, 3 replies; 42+ messages in thread
From: Thomas Gleixner @ 2016-05-05 20:44 UTC (permalink / raw)
  To: LKML
  Cc: Sebastian Andrzej Siewior, Linus Torvalds, Darren Hart,
	Peter Zijlstra, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

[-- Attachment #1: futex-Hash-private-futexes-per-process.patch --]
[-- Type: text/plain, Size: 13765 bytes --]

From: Sebastian Siewior <bigeasy@linutronix.de>

The standard futex mechanism in the Linux kernel uses a global hash to store
transient state. Collisions on that hash can lead to performance degradation
especially on NUMA systems and on real-time enabled kernels even to priority
inversions.

To mitigate that problem we provide per process private hashing. On the first
futex operation in a process the kernel allocates a hash table. The hash table
is accessible via the process mm_struct. On Numa systems the hash is allocated
node local.

If the allocation fails then the global hash table is used as fallback, so
there is no user space visible impact of this feature.

The hash size is a default value which can be tweaked by the sys admin. The
sysctl interface is implemented in a follow up patch to make the review
simpler. For applications which have special requirements for the private hash
and to allow preallocation of the hash for RT applications, we'll provide a
futex OP in a follow up patch.

Performance data acquired on a 4 socket (node) Intel machine with perf bench
futex-hash:

Threads  G 65536  P 4	  P 8      P 16       P 32     P 64     P 128    P 256

1        8175006  8645465  8617469  8628686   8625223  8664491  8590934  8631582
2	 8149869  8618385  8578185  8622267   8603253  8618787  8595073  8590591
4	 7479482  5867840  7882991  7604838   7894380  7882850  7884911  7886278
8	 7308822  2378057  5731051  5550479   7691198  7672814  7711939  7681549
16	 7295893   677414  2670682  3453552   7158906  7688978  7677603  7690290

So with the proper hash size of the private hash is ~5% faster than the global
hash.

With a full perf bench futex-hash run with one process (36 threads) per node
and 1024 futexes per thread the following results are achieved:

G 65536	 P 4     P 8     P 16     P 32     P 64     P 128    P 256    P 512    P 1024  P 2048     
2673390  368952  682626  1223908  1845922  3003524  3538313  4118533  4286925  4289589 4274020

Ratio:   0,14    0,26    0,46     0,69	   1,12     1,32     1,54     1,60     1,60    1,60

So with a private hash size of 256 buckets and above the performance is almost
steady in this pathological test case and factor 1.6 better than the global
hash. Even a 64 buckets hash is already 10% faster,

Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/futex.h       |   38 ++++++++--
 include/linux/futex_types.h |   12 +++
 include/linux/mm_types.h    |    4 +
 init/Kconfig                |    4 +
 kernel/fork.c               |    3 
 kernel/futex.c              |  162 +++++++++++++++++++++++++++++++++++++++++++-
 6 files changed, 212 insertions(+), 11 deletions(-)
 create mode 100644 include/linux/futex_types.h

--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -1,6 +1,7 @@
 #ifndef _LINUX_FUTEX_H
 #define _LINUX_FUTEX_H
 
+#include <linux/futex_types.h>
 #include <uapi/linux/futex.h>
 
 struct inode;
@@ -21,16 +22,19 @@ handle_futex_death(u32 __user *uaddr, st
  *
  * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
  * We use the two low order bits of offset to tell what is the kind of key :
- *  00 : Private process futex (PTHREAD_PROCESS_PRIVATE)
- *       (no reference on an inode or mm)
+ *  00 : Private process futex (PTHREAD_PROCESS_PRIVATE) using process private
+ *	 hash (no reference on an inode or mm)
  *  01 : Shared futex (PTHREAD_PROCESS_SHARED)
  *	mapped on a file (reference on the underlying inode)
  *  10 : Shared futex (PTHREAD_PROCESS_SHARED)
  *       (but private mapping on an mm, and reference taken on it)
+ *  11 : Private process futex (PTHREAD_PROCESS_PRIVATE) using global hash
+ *	 (no reference on an inode or mm)
 */
 
-#define FUT_OFF_INODE    1 /* We set bit 0 if key has a reference on inode */
-#define FUT_OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */
+#define FUT_OFF_INODE		0x01 /* Key has a reference on inode */
+#define FUT_OFF_MMSHARED	0x02 /* Key has a reference on mm */
+#define FUT_OFF_PRIVATE		0x03 /* Key has no ref on inode/mm */
 
 union futex_key {
 	struct {
@@ -60,12 +64,30 @@ extern void exit_pi_state_list(struct ta
 #else
 extern int futex_cmpxchg_enabled;
 #endif
+
 #else
-static inline void exit_robust_list(struct task_struct *curr)
-{
-}
-static inline void exit_pi_state_list(struct task_struct *curr)
+static inline void exit_robust_list(struct task_struct *curr) { }
+static inline void exit_pi_state_list(struct task_struct *curr) { }
+#endif
+
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+/* Process private hash data for futexes */
+
+extern unsigned int futex_default_hash_bits;
+extern unsigned int futex_max_hash_bits;
+
+extern void futex_mm_hash_exit(struct mm_struct *mm);
+
+static inline void futex_mm_hash_init(struct mm_struct *mm)
 {
+	raw_spin_lock_init(&mm->futex_hash.lock);
+	mm->futex_hash.hash = NULL;
 }
+
+#else
+
+static inline void futex_mm_hash_init(struct mm_struct *mm) { }
+static inline void futex_mm_hash_exit(struct mm_struct *mm) { }
 #endif
+
 #endif
--- /dev/null
+++ b/include/linux/futex_types.h
@@ -0,0 +1,12 @@
+#ifndef _LINUX_FUTEX_TYPES_H
+#define _LINUX_FUTEX_TYPES_H
+
+struct futex_hash_bucket;
+
+struct futex_hash {
+	struct raw_spinlock		lock;
+	unsigned int			hash_bits;
+	struct futex_hash_bucket	*hash;
+};
+
+#endif
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -11,6 +11,7 @@
 #include <linux/completion.h>
 #include <linux/cpumask.h>
 #include <linux/uprobes.h>
+#include <linux/futex_types.h>
 #include <linux/page-flags-layout.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
@@ -442,6 +443,9 @@ struct mm_struct {
 
 	struct linux_binfmt *binfmt;
 
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+	struct futex_hash futex_hash;
+#endif
 	cpumask_var_t cpu_vm_mask_var;
 
 	/* Architecture-specific MM context */
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1498,6 +1498,10 @@ config FUTEX
 	  support for "fast userspace mutexes".  The resulting kernel may not
 	  run glibc-based applications correctly.
 
+config FUTEX_PRIVATE_HASH
+	bool
+	default FUTEX && SMP
+
 config HAVE_FUTEX_CMPXCHG
 	bool
 	depends on FUTEX
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -617,6 +617,8 @@ static struct mm_struct *mm_init(struct
 	mm_init_owner(mm, p);
 	mmu_notifier_mm_init(mm);
 	clear_tlb_flush_pending(mm);
+	futex_mm_hash_init(mm);
+
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
 #endif
@@ -713,6 +715,7 @@ void mmput(struct mm_struct *mm)
 		khugepaged_exit(mm); /* must run before exit_mmap */
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
+		futex_mm_hash_exit(mm);
 		if (!list_empty(&mm->mmlist)) {
 			spin_lock(&mmlist_lock);
 			list_del(&mm->mmlist);
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -23,6 +23,9 @@
  *  Copyright (C) IBM Corporation, 2009
  *  Thanks to Thomas Gleixner for conceptual design and careful reviews.
  *
+ *  Private hashed futex support by Sebastian Siewior and Thomas Gleixner
+ *  Copyright (C) Linutronix GmbH, 2016
+ *
  *  Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
  *  enough at me, Linus for the original (flawed) idea, Matthew
  *  Kirkwood for proof-of-concept implementation.
@@ -49,6 +52,7 @@
 #include <linux/fs.h>
 #include <linux/file.h>
 #include <linux/jhash.h>
+#include <linux/hash.h>
 #include <linux/init.h>
 #include <linux/futex.h>
 #include <linux/mount.h>
@@ -169,6 +173,34 @@
  * the code that actually moves the futex(es) between hash buckets (requeue_futex)
  * will do the additional required waiter count housekeeping. This is done for
  * double_lock_hb() and double_unlock_hb(), respectively.
+ *
+ * For private futexes we (pre)allocate a per process hash. We check lockless
+ * whether the hash is already allocated. To access the hash later we need
+ * information about the hash properties as well. This requires barriers as
+ * follows:
+ *
+ * CPU 0					CPU 1
+ * check_hash_allocation()
+ *	if (mm->futex_hash.hash)
+ *		return;
+ *	hash = alloc_hash()
+ *	lock(&mm->futex_hash.lock);
+ *	if (!mm->futex_hash.hash) {
+ *	  mm->futex_hash.par = params;
+ *
+ *	  smp_wmb(); (A0) <-paired with-|
+ *					|
+ *	  mm->futex_hash.hash = hash;	|
+ *					|	check_hash_allocation()
+ *					|	   if (mm->futex_hash.hash)
+ *					|		return;
+ *	  unlock(&mm->futex_hash.lock);	|	get_futex_key_refs()
+ *					|
+ *					|--------- smp_mb() (B)
+ *						s = hash(f, mm->futex_hash.par);
+ *						hb = &mm->futex_hash.hash[s];
+ *
+ * So we utilize the existing smp_mb() in get_futex_key_refs().
  */
 
 #ifndef CONFIG_HAVE_FUTEX_CMPXCHG
@@ -255,6 +287,22 @@ struct futex_hash_bucket {
 	struct plist_head chain;
 } ____cacheline_aligned_in_smp;
 
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+/*
+ * Process private hash for non-shared futexes
+ */
+#define FUTEX_USE_GLOBAL_HASH		((void *) 0x03)
+
+#define FUTEX_MIN_HASH_BITS		order_base_2(4UL)
+#define FUTEX_DEF_HASH_BITS		order_base_2(8UL)
+#define FUTEX_MAX_HASH_BITS		order_base_2(256UL)
+
+unsigned int futex_default_hash_bits	= FUTEX_DEF_HASH_BITS;
+unsigned int futex_max_hash_bits	= FUTEX_MAX_HASH_BITS;
+#else
+static const unsigned int futex_default_hash_bits = 0;
+#endif
+
 /*
  * The base of the bucket array and its size are always used together
  * (after initialization only in hash_futex()), so ensure that they
@@ -374,13 +422,13 @@ static inline int hb_waiters_pending(str
 }
 
 /**
- * hash_futex - Return the hash bucket in the global hash
+ * hash_global_futex - Return the hash bucket in the global hash
  * @key:	Pointer to the futex key for which the hash is calculated
  *
  * We hash on the keys returned from get_futex_key (see below) and return the
  * corresponding hash bucket in the global hash.
  */
-static struct futex_hash_bucket *hash_futex(union futex_key *key)
+static struct futex_hash_bucket *hash_global_futex(union futex_key *key)
 {
 	u32 hash = jhash2((u32*)&key->both.word,
 			  (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
@@ -388,9 +436,33 @@ static struct futex_hash_bucket *hash_fu
 	return &futex_queues[hash & (futex_hashsize - 1)];
 }
 
+/**
+ * hash_futex - Get the hash bucket for a futex
+ *
+ * Returns either the process private or the global hash bucket which fits the
+ * key.
+ */
+static struct futex_hash_bucket *hash_futex(union futex_key *key)
+{
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+	struct mm_struct *mm = current->mm;
+	unsigned int slot;
+
+	/*
+	 * Futexes which use the per process hash have the lower bits cleared
+	 */
+	if (key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED))
+		return hash_global_futex(key);
+
+	slot = hash_long(key->private.address, mm->futex_hash.hash_bits);
+	return &mm->futex_hash.hash[slot];
+#else
+	return hash_global_futex(key);
+#endif
+}
 
 /**
- * match_futex - Check whether to futex keys are equal
+ * match_futex - Check whether two futex keys are equal
  * @key1:	Pointer to key1
  * @key2:	Pointer to key2
  *
@@ -505,7 +577,20 @@ get_futex_key(u32 __user *uaddr, int fsh
 	 */
 	if (!fshared) {
 		key->private.mm = mm;
+		/*
+		 * If we have a process private hash, then we store uaddr
+		 * instead of the page base address.
+		 */
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+		if (mm->futex_hash.hash != FUTEX_USE_GLOBAL_HASH) {
+			key->private.address = (unsigned long) uaddr;
+		} else {
+			key->private.address = address;
+			key->both.offset |= FUT_OFF_PRIVATE;
+		}
+#else
 		key->private.address = address;
+#endif
 		get_futex_key_refs(key);  /* implies smp_mb(); (B) */
 		return 0;
 	}
@@ -3153,6 +3238,75 @@ void exit_robust_list(struct task_struct
 				   curr, pip);
 }
 
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+
+void futex_mm_hash_exit(struct mm_struct *mm)
+{
+	if (mm->futex_hash.hash && mm->futex_hash.hash != FUTEX_USE_GLOBAL_HASH)
+		kfree(mm->futex_hash.hash);
+	mm->futex_hash.hash = NULL;
+}
+
+static struct futex_hash_bucket *futex_alloc_hash(unsigned int hash_bits)
+{
+	struct futex_hash_bucket *hb;
+	size_t hash_size, size;
+	int i;
+
+	hash_size = 1 << hash_bits;
+	size = hash_size * sizeof(struct futex_hash_bucket);
+	hb = kzalloc_node(size, GFP_KERNEL, numa_node_id());
+	if (!hb)
+		return NULL;
+
+	for (i = 0; i < hash_size; i++) {
+		atomic_set(&hb[i].waiters, 0);
+		plist_head_init(&hb[i].chain);
+		spin_lock_init(&hb[i].lock);
+	}
+	return hb;
+}
+
+static void futex_populate_hash(unsigned int hash_bits)
+{
+	struct mm_struct *mm = current->mm;
+	struct futex_hash_bucket *hb = NULL;
+
+	/*
+	 * We don't need an explicit smp_mb() when the hash is populated
+	 * because before we dereference mm->futex_hash.hash_bits in the hash
+	 * function we have an smp_mb() in futex_get_key_refs() already.
+	 */
+	if (mm->futex_hash.hash)
+		return;
+
+	/*
+	 * If we failed to allocate a hash on the fly, fall back to the global
+	 * hash.
+	 */
+	hb = futex_alloc_hash(hash_bits);
+	if (!hb)
+		hb = FUTEX_USE_GLOBAL_HASH;
+
+	raw_spin_lock(&mm->futex_hash.lock);
+	/* We might have raced with another task allocating the hash. */
+	if (!mm->futex_hash.hash) {
+		mm->futex_hash.hash_bits = hash_bits;
+		/*
+		 * Ensure that the above is visible before we store
+		 * the pointer.
+		 */
+		smp_wmb(); /* (A0) Pairs with (B) */
+		mm->futex_hash.hash = hb;
+		hb = NULL;
+	}
+	raw_spin_unlock(&mm->futex_hash.lock);
+	kfree(hb);
+}
+#else /* CONFIG_FUTEX_PRIVATE_HASH */
+static inline void futex_populate_hash(unsigned int hash_bits) { }
+#endif
+
 long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 		u32 __user *uaddr2, u32 val2, u32 val3)
 {
@@ -3161,6 +3315,8 @@ long do_futex(u32 __user *uaddr, int op,
 
 	if (!(op & FUTEX_PRIVATE_FLAG))
 		flags |= FLAGS_SHARED;
+	else
+		futex_populate_hash(futex_default_hash_bits);
 
 	if (op & FUTEX_CLOCK_REALTIME) {
 		flags |= FLAGS_CLOCKRT;

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [patch V2 3/7] futex: Add op for hash preallocation
  2016-05-05 20:44 [patch V2 0/7] futex: Add support for process private hashing Thomas Gleixner
                   ` (2 preceding siblings ...)
  2016-05-05 20:44 ` [patch V2 4/7] futex: Add sysctl knobs for process private hash Thomas Gleixner
@ 2016-05-05 20:44 ` Thomas Gleixner
  2016-05-06 18:18   ` Darren Hart
                     ` (2 more replies)
  2016-05-05 20:44 ` [patch V2 5/7] perf/bench/futex-hash: Support NUMA Thomas Gleixner
                   ` (2 subsequent siblings)
  6 siblings, 3 replies; 42+ messages in thread
From: Thomas Gleixner @ 2016-05-05 20:44 UTC (permalink / raw)
  To: LKML
  Cc: Sebastian Andrzej Siewior, Linus Torvalds, Darren Hart,
	Peter Zijlstra, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

[-- Attachment #1: futex-Add-op-for-hash-preallocation.patch --]
[-- Type: text/plain, Size: 3659 bytes --]

From: Sebastian Siewior <bigeasy@linutronix.de>

The per process hash is allocated on the fly at the first futex operation of a
process. The size of the hash is determined by a system wide default setting
controlled by the sys admin, This is suboptimal for RT applications and
applications with pathological futex abuse,

 - For RT applications its important to allocate the per process hash before the
   first futex operation to avoid the allocation on the first futex operation.

 - For pathological applications which use gazillions of futexes its useful to
   allocate a hash greater than the default hash size.

Add a futex op which allows to preallocate the hash with the requested
size. The size is limited by the systemwide maximum hash size, which can be
set by the admin. The requested size is rounded up to the next order of 2.

The function can be called several times, but ony the first call results in a
hash allocation of the requested size as there is no non-intrusive way to
reallocate/rehash in a multithreaded application.

Note, that this call must be issued before the first futex operation in the
process because that would automatically allocate the default sized hash.

The function returns the actual hash size or 0 if the global hash is used. The
latter is the case on UP and in the rare case that the allocation failed and
the global hash is used as a fallback.

Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/uapi/linux/futex.h |    1 +
 kernel/futex.c             |   41 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)

--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -20,6 +20,7 @@
 #define FUTEX_WAKE_BITSET	10
 #define FUTEX_WAIT_REQUEUE_PI	11
 #define FUTEX_CMP_REQUEUE_PI	12
+#define FUTEX_PREALLOC_HASH	13
 
 #define FUTEX_PRIVATE_FLAG	128
 #define FUTEX_CLOCK_REALTIME	256
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3307,6 +3307,45 @@ static void futex_populate_hash(unsigned
 static inline void futex_populate_hash(unsigned int hash_bits) { }
 #endif
 
+/**
+ * futex_preallocate_hash - Preallocate the process private hash
+ * @slots:	Number of slots to allocate
+ *
+ * The function will allocate the process private hash with the number of
+ * requested slots. The number is rounded to the next power of two and may not
+ * exceed the current system limit.
+ *
+ * If the hash was already allocated by either an earlier call to
+ * futex_preallocate_hash() or an earlier futex op which allocated the cache
+ * on the fly, we return the size of the active hash.
+ *
+ * Returns::	Size of the hash, if 0 then the global hash is used.
+ */
+static int futex_preallocate_hash(unsigned int slots)
+{
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+	struct mm_struct *mm = current->mm;
+	struct futex_hash_bucket *hb;
+	unsigned int bits;
+
+	/* Try to allocate the requested nr of slots */
+	bits = order_base_2(slots);
+
+	if (bits < FUTEX_MIN_HASH_BITS)
+		bits = FUTEX_MIN_HASH_BITS;
+
+	if (bits > futex_max_hash_bits)
+		bits = futex_max_hash_bits;
+
+	futex_populate_hash(bits);
+
+	hb = mm->futex_hash.hash;
+	return hb == FUTEX_USE_GLOBAL_HASH ? 0 : 1 << mm->futex_hash.hash_bits;
+#else
+	return 0;
+#endif
+}
+
 long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 		u32 __user *uaddr2, u32 val2, u32 val3)
 {
@@ -3362,6 +3401,8 @@ long do_futex(u32 __user *uaddr, int op,
 					     uaddr2);
 	case FUTEX_CMP_REQUEUE_PI:
 		return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1);
+	case FUTEX_PREALLOC_HASH:
+		return futex_preallocate_hash(val);
 	}
 	return -ENOSYS;
 }

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [patch V2 4/7] futex: Add sysctl knobs for process private hash
  2016-05-05 20:44 [patch V2 0/7] futex: Add support for process private hashing Thomas Gleixner
  2016-05-05 20:44 ` [patch V2 1/7] futex: Add some more function commentry Thomas Gleixner
  2016-05-05 20:44 ` [patch V2 2/7] futex: Hash private futexes per process Thomas Gleixner
@ 2016-05-05 20:44 ` Thomas Gleixner
  2016-05-06 18:22   ` Darren Hart
  2016-05-05 20:44 ` [patch V2 3/7] futex: Add op for hash preallocation Thomas Gleixner
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 42+ messages in thread
From: Thomas Gleixner @ 2016-05-05 20:44 UTC (permalink / raw)
  To: LKML
  Cc: Sebastian Andrzej Siewior, Linus Torvalds, Darren Hart,
	Peter Zijlstra, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

[-- Attachment #1: add-sysctl-interface.patch --]
[-- Type: text/plain, Size: 3677 bytes --]

From: Sebastian Siewior <bigeasy@linutronix.de>

To adjust the default hash size and the maximum hash size for process private
futexes we add the following sysctls:

futex_private_default_hash_bits:

     Adjusts the default hash size (in bits) which is used for automatic hash
     allocations on the first futex operation

futex_private_max_hash_bits:

     Adjusts the maximum hash size (in bits). This limits the hash size which
     can be preallocated by applications with the FUTEX_PREALLOC_HASH op.

Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 Documentation/sysctl/kernel.txt |   17 +++++++++++++++++
 include/linux/futex.h           |    1 +
 kernel/futex.c                  |    5 +++--
 kernel/sysctl.c                 |   21 +++++++++++++++++++++
 4 files changed, 42 insertions(+), 2 deletions(-)

--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -29,6 +29,8 @@ Currently, these files might (depending
 - core_pipe_limit
 - core_uses_pid
 - ctrl-alt-del
+- futex_private_default_hash_bits
+- futex_private_max_hash_bits
 - dmesg_restrict
 - domainname
 - hostname
@@ -265,6 +267,21 @@ to decide what to do with it.
 
 ==============================================================
 
+futex_private_default_hash_bits:
+
+Adjusts the default hash size (in bits) which is used for
+automatic hash allocations on the first futex operation
+
+==============================================================
+
+futex_private_max_hash_bits:
+
+Adjusts the maximum hash size (in bits). This limits the hash
+size which can be preallocated by applications with the
+FUTEX_PREALLOC_HASH op.
+
+==============================================================
+
 dmesg_restrict:
 
 This toggle indicates whether unprivileged users are prevented
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -75,6 +75,7 @@ static inline void exit_pi_state_list(st
 
 extern unsigned int futex_default_hash_bits;
 extern unsigned int futex_max_hash_bits;
+extern unsigned int futex_sysmax_hash_bits;
 
 extern void futex_mm_hash_exit(struct mm_struct *mm);
 
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -297,8 +297,9 @@ struct futex_hash_bucket {
 #define FUTEX_DEF_HASH_BITS		order_base_2(8UL)
 #define FUTEX_MAX_HASH_BITS		order_base_2(256UL)
 
-unsigned int futex_default_hash_bits	= FUTEX_DEF_HASH_BITS;
-unsigned int futex_max_hash_bits	= FUTEX_MAX_HASH_BITS;
+unsigned int futex_default_hash_bits			= FUTEX_DEF_HASH_BITS;
+unsigned int futex_max_hash_bits			= FUTEX_MAX_HASH_BITS;
+unsigned int  __read_mostly futex_sysmax_hash_bits	= FUTEX_MAX_HASH_BITS;
 #else
 static const unsigned int futex_default_hash_bits = 0;
 #endif
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -65,6 +65,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/kexec.h>
 #include <linux/bpf.h>
+#include <linux/futex.h>
 
 #include <asm/uaccess.h>
 #include <asm/processor.h>
@@ -593,6 +594,26 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+	{
+		.procname	= "futex_private_default_hash_bits",
+		.data		= &futex_default_hash_bits,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &two,
+		.extra2		= &futex_max_hash_bits,
+	},
+	{
+		.procname	= "futex_private_max_hash_bits",
+		.data		= &futex_max_hash_bits,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &futex_default_hash_bits,
+		.extra2		= &futex_sysmax_hash_bits,
+	},
+#endif
 #ifdef CONFIG_FUNCTION_TRACER
 	{
 		.procname	= "ftrace_enabled",

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [patch V2 5/7] perf/bench/futex-hash: Support NUMA
  2016-05-05 20:44 [patch V2 0/7] futex: Add support for process private hashing Thomas Gleixner
                   ` (3 preceding siblings ...)
  2016-05-05 20:44 ` [patch V2 3/7] futex: Add op for hash preallocation Thomas Gleixner
@ 2016-05-05 20:44 ` Thomas Gleixner
  2016-05-05 20:44 ` [patch V2 6/7] perf/bench/futex-hash: Support preallocate hash table Thomas Gleixner
  2016-05-05 20:44 ` [patch V2 7/7] futex.2: Document hash preallocation opcode Thomas Gleixner
  6 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2016-05-05 20:44 UTC (permalink / raw)
  To: LKML
  Cc: Sebastian Andrzej Siewior, Linus Torvalds, Darren Hart,
	Peter Zijlstra, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

[-- Attachment #1: perf-bench-futex-hash-Support-NUMA.patch --]
[-- Type: text/plain, Size: 5893 bytes --]

This adds a new option to tell perf on which numa node the hash benchmark
should run. If set then 

 - The test is bound to the node
 - Memory is allocated on the local NUMA node
 - The threads are bound to the cpus on the node

The NUMA node can be specified by the -n argument.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 tools/perf/bench/Build        |  4 ++
 tools/perf/bench/futex-hash.c | 89 +++++++++++++++++++++++++++++++++++++------
 2 files changed, 82 insertions(+), 11 deletions(-)

diff --git a/tools/perf/bench/Build b/tools/perf/bench/Build
index 60bf119..9e6e518 100644
--- a/tools/perf/bench/Build
+++ b/tools/perf/bench/Build
@@ -1,3 +1,7 @@
+ifdef CONFIG_NUMA
+CFLAGS_futex-hash.o   += -DCONFIG_NUMA=1
+endif
+
 perf-y += sched-messaging.o
 perf-y += sched-pipe.o
 perf-y += mem-functions.o
diff --git a/tools/perf/bench/futex-hash.c b/tools/perf/bench/futex-hash.c
index 0999ac5..a1c6ee9 100644
--- a/tools/perf/bench/futex-hash.c
+++ b/tools/perf/bench/futex-hash.c
@@ -20,6 +20,9 @@
 #include <stdlib.h>
 #include <sys/time.h>
 #include <pthread.h>
+#ifdef CONFIG_NUMA
+#include <numa.h>
+#endif
 
 static unsigned int nthreads = 0;
 static unsigned int nsecs    = 10;
@@ -27,6 +30,7 @@ static unsigned int nsecs    = 10;
 static unsigned int nfutexes = 1024;
 static bool fshared = false, done = false, silent = false;
 static int futex_flag = 0;
+static int numa_node = -1;
 
 struct timeval start, end, runtime;
 static pthread_mutex_t thread_lock;
@@ -39,7 +43,7 @@ struct worker {
 	u_int32_t *futex;
 	pthread_t thread;
 	unsigned long ops;
-};
+} __attribute__((aligned(128)));
 
 static const struct option options[] = {
 	OPT_UINTEGER('t', "threads", &nthreads, "Specify amount of threads"),
@@ -47,9 +51,28 @@ static const struct option options[] = {
 	OPT_UINTEGER('f', "futexes", &nfutexes, "Specify amount of futexes per threads"),
 	OPT_BOOLEAN( 's', "silent",  &silent,   "Silent mode: do not display data/details"),
 	OPT_BOOLEAN( 'S', "shared",  &fshared,  "Use shared futexes instead of private ones"),
+#ifdef CONFIG_NUMA
+	OPT_INTEGER( 'n', "numa",   &numa_node,  "Specify the NUMA node"),
+#endif
 	OPT_END()
 };
 
+#ifndef CONFIG_NUMA
+static int numa_run_on_node(int node __maybe_unused) { return 0; }
+static int numa_node_of_cpu(int node __maybe_unused) { return 0; }
+static void *numa_alloc_local(size_t size) { return malloc(size); }
+static void numa_free(void *p, size_t size __maybe_unused) { return free(p); }
+#endif
+
+static bool cpu_is_local(int cpu)
+{
+	if (numa_node < 0)
+		return true;
+	if (numa_node_of_cpu(cpu) == numa_node)
+		return true;
+	return false;
+}
+
 static const char * const bench_futex_hash_usage[] = {
 	"perf bench futex hash <options>",
 	NULL
@@ -115,6 +138,8 @@ int bench_futex_hash(int argc, const char **argv,
 	unsigned int i, ncpus;
 	pthread_attr_t thread_attr;
 	struct worker *worker = NULL;
+	char *node_str = NULL;
+	unsigned int cpunum;
 
 	argc = parse_options(argc, argv, options, bench_futex_hash_usage, 0);
 	if (argc) {
@@ -128,18 +153,50 @@ int bench_futex_hash(int argc, const char **argv,
 	act.sa_sigaction = toggle_done;
 	sigaction(SIGINT, &act, NULL);
 
-	if (!nthreads) /* default to the number of CPUs */
-		nthreads = ncpus;
+	if (!nthreads) {
+		/* default to the number of CPUs per NUMA node */
+		if (numa_node < 0) {
+			nthreads = ncpus;
+		} else {
+			for (i = 0; i < ncpus; i++) {
+				if (cpu_is_local(i))
+					nthreads++;
+			}
+			if (!nthreads)
+				err(EXIT_FAILURE, "No online CPUs for this node");
+		}
+	} else {
+		int cpu_available = 0;
+
+		for (i = 0; i < ncpus && !cpu_available; i++) {
+			if (cpu_is_local(i))
+				cpu_available = 1;
+		}
+		if (!cpu_available)
+			err(EXIT_FAILURE, "No online CPUs for this node");
+	}
+
+	if (numa_node >= 0) {
+		ret = numa_run_on_node(numa_node);
+		if (ret < 0)
+			err(EXIT_FAILURE, "numa_run_on_node");
+		ret = asprintf(&node_str, " on node %d", numa_node);
+		if (ret < 0)
+			err(EXIT_FAILURE, "numa_node, asprintf");
+	}
 
-	worker = calloc(nthreads, sizeof(*worker));
+	worker = numa_alloc_local(nthreads * sizeof(*worker));
 	if (!worker)
 		goto errmem;
 
 	if (!fshared)
 		futex_flag = FUTEX_PRIVATE_FLAG;
 
-	printf("Run summary [PID %d]: %d threads, each operating on %d [%s] futexes for %d secs.\n\n",
-	       getpid(), nthreads, nfutexes, fshared ? "shared":"private", nsecs);
+	printf("Run summary [PID %d]: %d threads%s, each operating on %d [%s] futexes for %d secs.\n\n",
+	       getpid(), nthreads,
+	       node_str ? : "",
+	       nfutexes, fshared ? "shared":"private",
+	       nsecs);
 
 	init_stats(&throughput_stats);
 	pthread_mutex_init(&thread_lock, NULL);
@@ -149,14 +206,24 @@ int bench_futex_hash(int argc, const char **argv,
 	threads_starting = nthreads;
 	pthread_attr_init(&thread_attr);
 	gettimeofday(&start, NULL);
-	for (i = 0; i < nthreads; i++) {
+	for (cpunum = 0, i = 0; i < nthreads; i++, cpunum++) {
+
+		do {
+			if (cpu_is_local(cpunum))
+				break;
+			cpunum++;
+			if (cpunum > ncpus)
+				cpunum = 0;
+		} while (1);
+
 		worker[i].tid = i;
-		worker[i].futex = calloc(nfutexes, sizeof(*worker[i].futex));
+		worker[i].futex = numa_alloc_local(nfutexes *
+						   sizeof(*worker[i].futex));
 		if (!worker[i].futex)
 			goto errmem;
 
 		CPU_ZERO(&cpu);
-		CPU_SET(i % ncpus, &cpu);
+		CPU_SET(cpunum % ncpus, &cpu);
 
 		ret = pthread_attr_setaffinity_np(&thread_attr, sizeof(cpu_set_t), &cpu);
 		if (ret)
@@ -203,12 +270,12 @@ int bench_futex_hash(int argc, const char **argv,
 				       &worker[i].futex[nfutexes-1], t);
 		}
 
-		free(worker[i].futex);
+		numa_free(worker[i].futex, nfutexes * sizeof(*worker[i].futex));
 	}
 
 	print_summary();
 
-	free(worker);
+	numa_free(worker, nthreads * sizeof(*worker));
 	return ret;
 errmem:
 	err(EXIT_FAILURE, "calloc");
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [patch V2 6/7] perf/bench/futex-hash: Support preallocate hash table
  2016-05-05 20:44 [patch V2 0/7] futex: Add support for process private hashing Thomas Gleixner
                   ` (4 preceding siblings ...)
  2016-05-05 20:44 ` [patch V2 5/7] perf/bench/futex-hash: Support NUMA Thomas Gleixner
@ 2016-05-05 20:44 ` Thomas Gleixner
  2016-05-05 20:44 ` [patch V2 7/7] futex.2: Document hash preallocation opcode Thomas Gleixner
  6 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2016-05-05 20:44 UTC (permalink / raw)
  To: LKML
  Cc: Sebastian Andrzej Siewior, Linus Torvalds, Darren Hart,
	Peter Zijlstra, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

[-- Attachment #1: perf-bench-futex-hash-Support-preallocate-hash-table.patch --]
[-- Type: text/plain, Size: 2578 bytes --]

Instead of using the default hash size on the first allocation it is
possible to allocate a specific number of slots upfront.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 tools/perf/bench/futex-hash.c |   16 ++++++++++++++--
 tools/perf/bench/futex.h      |    5 +++++
 2 files changed, 19 insertions(+), 2 deletions(-)

--- a/tools/perf/bench/futex-hash.c
+++ b/tools/perf/bench/futex-hash.c
@@ -30,6 +30,7 @@ static unsigned int nsecs    = 10;
 static unsigned int nfutexes = 1024;
 static bool fshared = false, done = false, silent = false;
 static int futex_flag = 0;
+static unsigned int prealloc;
 static int numa_node = -1;
 
 struct timeval start, end, runtime;
@@ -51,6 +52,7 @@ static const struct option options[] = {
 	OPT_UINTEGER('f', "futexes", &nfutexes, "Specify amount of futexes per threads"),
 	OPT_BOOLEAN( 's', "silent",  &silent,   "Silent mode: do not display data/details"),
 	OPT_BOOLEAN( 'S', "shared",  &fshared,  "Use shared futexes instead of private ones"),
+	OPT_UINTEGER('p', "prealloc",&prealloc, "Specify number of preallocated hash slots"),
 #ifdef CONFIG_NUMA
 	OPT_INTEGER( 'n', "numa",   &numa_node,  "Specify the NUMA node"),
 #endif
@@ -138,6 +140,7 @@ int bench_futex_hash(int argc, const cha
 	unsigned int i, ncpus;
 	pthread_attr_t thread_attr;
 	struct worker *worker = NULL;
+	char *prealloc_str = NULL;
 	char *node_str = NULL;
 	unsigned int cpunum;
 
@@ -192,11 +195,20 @@ int bench_futex_hash(int argc, const cha
 	if (!fshared)
 		futex_flag = FUTEX_PRIVATE_FLAG;
 
-	printf("Run summary [PID %d]: %d threads%s, each operating on %d [%s] futexes for %d secs.\n\n",
+	if (prealloc) {
+		ret = futex_preallocate(prealloc);
+		if (ret < 0)
+			err(EXIT_FAILURE, "futex_prealloate");
+		ret = asprintf(&prealloc_str, " P %u %d", prealloc, ret);
+		if (ret < 0)
+			err(EXIT_FAILURE, "futex_preallocate, asprintf");
+	}
+
+	printf("Run summary [PID %d]: %d threads%s, each operating on %d [%s%s] futexes for %d secs.\n\n",
 	       getpid(), nthreads,
 	       node_str ? : "",
 	       nfutexes, fshared ? "shared":"private",
-	       nsecs);
+	       prealloc_str ? : "", nsecs);
 
 	init_stats(&throughput_stats);
 	pthread_mutex_init(&thread_lock, NULL);
--- a/tools/perf/bench/futex.h
+++ b/tools/perf/bench/futex.h
@@ -101,4 +101,9 @@ static inline int pthread_attr_setaffini
 }
 #endif
 
+static inline int futex_preallocate(u_int32_t hash_size)
+{
+	return futex(0, FUTEX_PREALLOC_HASH, hash_size, NULL, NULL, 0, 0);
+}
+
 #endif /* _FUTEX_H */

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [patch V2 7/7] futex.2: Document hash preallocation opcode
  2016-05-05 20:44 [patch V2 0/7] futex: Add support for process private hashing Thomas Gleixner
                   ` (5 preceding siblings ...)
  2016-05-05 20:44 ` [patch V2 6/7] perf/bench/futex-hash: Support preallocate hash table Thomas Gleixner
@ 2016-05-05 20:44 ` Thomas Gleixner
  6 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2016-05-05 20:44 UTC (permalink / raw)
  To: LKML
  Cc: Sebastian Andrzej Siewior, Linus Torvalds, Darren Hart,
	Peter Zijlstra, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

[-- Attachment #1: futex.2--Document-hash-preallocation-opcode --]
[-- Type: text/plain, Size: 2067 bytes --]

At least an attempt to document the futex attached mode extension.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 man2/futex.2 |   40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

--- a/man2/futex.2
+++ b/man2/futex.2
@@ -767,6 +767,40 @@ operations correspond to
 and
 .BR FUTEX_WAKE_BITSET
 operations where the bit masks are all ones.
+
+.\"
+.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"
+.TP
+.BR FUTEX_PREALLOC_HASH " (since Linux ?.?.?)"
+This operation preallocates the per process private hash with a hash size
+provided by the
+.IR val
+argument.
+
+This operation tells the kernel to preallocate the per process private hash
+with a given hash size. The size is rounded up to the next power of two and
+limited internally in the kernel according to the sys control setting.
+
+The per process private hash is allocated once and cannot be resized. The
+allocation either happens by this op code or automatically when the first real
+futex operation takes place.
+
+If the hash is allocated then subsequent calls with this opcode return the
+currently active hash size.
+
+If the returned size value is 0 then the global hash is used instead of the
+process private hash. This can happen on UP machines where the private hashing
+is disabled and in case of memory allocation failure.
+
+The
+.IR uaddr ,
+.IR timeout ,
+.IR uaddr2
+and
+.I val3
+arguments are ignored.
+
 .\"
 .\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 .\"
@@ -1334,6 +1368,12 @@ the futex word at
 Returns 0 if the caller was successfully requeued to the futex for
 the futex word at
 .IR uaddr2 .
+.TP
+.B FUTEX_PREALLOC_HASH
+Returns the size of the allocated hash, i.e. the number of hash slots. If the
+size value is 0 then the global hash is used instead of the process private
+hash. This can happen on UP machines where the private hashing is disabled and
+in case of memory allocation failure.
 .\"
 .\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 .\"

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 1/7] futex: Add some more function commentry
  2016-05-05 20:44 ` [patch V2 1/7] futex: Add some more function commentry Thomas Gleixner
@ 2016-05-06 17:37   ` Darren Hart
  0 siblings, 0 replies; 42+ messages in thread
From: Darren Hart @ 2016-05-06 17:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Sebastian Andrzej Siewior, Linus Torvalds, Darren Hart,
	Peter Zijlstra, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Thu, May 05, 2016 at 08:44:03PM -0000, Thomas Gleixner wrote:
> Add some more comments and reformat existing ones to kernel doc style.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  kernel/futex.c |   15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -373,8 +373,12 @@ static inline int hb_waiters_pending(str
>  #endif
>  }
>  
> -/*
> - * We hash on the keys returned from get_futex_key (see below).
> +/**
> + * hash_futex - Return the hash bucket in the global hash
> + * @key:	Pointer to the futex key for which the hash is calculated
> + *
> + * We hash on the keys returned from get_futex_key (see below) and return the
> + * corresponding hash bucket in the global hash.
>   */
>  static struct futex_hash_bucket *hash_futex(union futex_key *key)
>  {
> @@ -384,7 +388,12 @@ static struct futex_hash_bucket *hash_fu
>  	return &futex_queues[hash & (futex_hashsize - 1)];
>  }
>  
> -/*
> +
> +/**
> + * match_futex - Check whether to futex keys are equal

s/to/two/

Otherwise:

Reviewed-by: Darren Hart <dvhart@linux.intel.com>

> + * @key1:	Pointer to key1
> + * @key2:	Pointer to key2
> + *
>   * Return 1 if two futex_keys are equal, 0 otherwise.
>   */
>  static inline int match_futex(union futex_key *key1, union futex_key *key2)
> 
> 
> 

-- 
Darren Hart
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 2/7] futex: Hash private futexes per process
  2016-05-05 20:44 ` [patch V2 2/7] futex: Hash private futexes per process Thomas Gleixner
@ 2016-05-06 18:09   ` Darren Hart
  2016-05-06 21:56     ` Darren Hart
  2016-05-07  8:44     ` Thomas Gleixner
  2016-05-19 12:21   ` Peter Zijlstra
  2016-05-19 12:24   ` Peter Zijlstra
  2 siblings, 2 replies; 42+ messages in thread
From: Darren Hart @ 2016-05-06 18:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Sebastian Andrzej Siewior, Linus Torvalds, Darren Hart,
	Peter Zijlstra, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Thu, May 05, 2016 at 08:44:04PM -0000, Thomas Gleixner wrote:
> From: Sebastian Siewior <bigeasy@linutronix.de>
> 
> The standard futex mechanism in the Linux kernel uses a global hash to store
> transient state. Collisions on that hash can lead to performance degradation
> especially on NUMA systems and on real-time enabled kernels even to priority
> inversions.

I think it is worth noting the how this causes an unbounded priority inversion
as it wasn't obvious to me. At least mention that "CPU pinning" can result in an
unbounded priority inversion.

> 
> To mitigate that problem we provide per process private hashing. On the first
> futex operation in a process the kernel allocates a hash table. The hash table
> is accessible via the process mm_struct. On Numa systems the hash is allocated
> node local.
> 
> If the allocation fails then the global hash table is used as fallback, so
> there is no user space visible impact of this feature.
> 

It would be good to have a way to detect that the process private hash table was
successfully created. Perhaps a /proc/pid/ feature? This would allow us to write
a functional futex test for tools/testing/selftests/futex

> The hash size is a default value which can be tweaked by the sys admin. The
> sysctl interface is implemented in a follow up patch to make the review
> simpler. For applications which have special requirements for the private hash
> and to allow preallocation of the hash for RT applications, we'll provide a
> futex OP in a follow up patch.
> 
> Performance data acquired on a 4 socket (node) Intel machine with perf bench
> futex-hash:
> 
> Threads  G 65536  P 4	  P 8      P 16       P 32     P 64     P 128    P 256
> 
> 1        8175006  8645465  8617469  8628686   8625223  8664491  8590934  8631582
> 2	 8149869  8618385  8578185  8622267   8603253  8618787  8595073  8590591
> 4	 7479482  5867840  7882991  7604838   7894380  7882850  7884911  7886278
> 8	 7308822  2378057  5731051  5550479   7691198  7672814  7711939  7681549
> 16	 7295893   677414  2670682  3453552   7158906  7688978  7677603  7690290
> 
> So with the proper hash size of the private hash is ~5% faster than the global
> hash.
> 
> With a full perf bench futex-hash run with one process (36 threads) per node
> and 1024 futexes per thread the following results are achieved:
> 
> G 65536	 P 4     P 8     P 16     P 32     P 64     P 128    P 256    P 512    P 1024  P 2048     
> 2673390  368952  682626  1223908  1845922  3003524  3538313  4118533  4286925  4289589 4274020
> 
> Ratio:   0,14    0,26    0,46     0,69	   1,12     1,32     1,54     1,60     1,60    1,60
> 
> So with a private hash size of 256 buckets and above the performance is almost
> steady in this pathological test case and factor 1.6 better than the global
> hash. Even a 64 buckets hash is already 10% faster,
> 

Nice!

> Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  include/linux/futex.h       |   38 ++++++++--
>  include/linux/futex_types.h |   12 +++
>  include/linux/mm_types.h    |    4 +
>  init/Kconfig                |    4 +
>  kernel/fork.c               |    3 
>  kernel/futex.c              |  162 +++++++++++++++++++++++++++++++++++++++++++-
>  6 files changed, 212 insertions(+), 11 deletions(-)
>  create mode 100644 include/linux/futex_types.h
> 
> --- a/include/linux/futex.h
> +++ b/include/linux/futex.h
> @@ -1,6 +1,7 @@
>  #ifndef _LINUX_FUTEX_H
>  #define _LINUX_FUTEX_H
>  
> +#include <linux/futex_types.h>
>  #include <uapi/linux/futex.h>
>  
>  struct inode;
> @@ -21,16 +22,19 @@ handle_futex_death(u32 __user *uaddr, st
>   *
>   * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
>   * We use the two low order bits of offset to tell what is the kind of key :
> - *  00 : Private process futex (PTHREAD_PROCESS_PRIVATE)
> - *       (no reference on an inode or mm)
> + *  00 : Private process futex (PTHREAD_PROCESS_PRIVATE) using process private
> + *	 hash (no reference on an inode or mm)
>   *  01 : Shared futex (PTHREAD_PROCESS_SHARED)
>   *	mapped on a file (reference on the underlying inode)
>   *  10 : Shared futex (PTHREAD_PROCESS_SHARED)
>   *       (but private mapping on an mm, and reference taken on it)
> + *  11 : Private process futex (PTHREAD_PROCESS_PRIVATE) using global hash
> + *	 (no reference on an inode or mm)
>  */
>  
> -#define FUT_OFF_INODE    1 /* We set bit 0 if key has a reference on inode */
> -#define FUT_OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */
> +#define FUT_OFF_INODE		0x01 /* Key has a reference on inode */
> +#define FUT_OFF_MMSHARED	0x02 /* Key has a reference on mm */
> +#define FUT_OFF_PRIVATE		0x03 /* Key has no ref on inode/mm */
>  
>  union futex_key {
>  	struct {
> @@ -60,12 +64,30 @@ extern void exit_pi_state_list(struct ta
>  #else
>  extern int futex_cmpxchg_enabled;
>  #endif
> +
>  #else
> -static inline void exit_robust_list(struct task_struct *curr)
> -{
> -}
> -static inline void exit_pi_state_list(struct task_struct *curr)
> +static inline void exit_robust_list(struct task_struct *curr) { }
> +static inline void exit_pi_state_list(struct task_struct *curr) { }
> +#endif

These appear to be unrelated changes, can they preceed this patch?

> +
> +#ifdef CONFIG_FUTEX_PRIVATE_HASH
> +/* Process private hash data for futexes */
> +
> +extern unsigned int futex_default_hash_bits;
> +extern unsigned int futex_max_hash_bits;
> +
> +extern void futex_mm_hash_exit(struct mm_struct *mm);
> +
> +static inline void futex_mm_hash_init(struct mm_struct *mm)
>  {
> +	raw_spin_lock_init(&mm->futex_hash.lock);
> +	mm->futex_hash.hash = NULL;
>  }
> +
> +#else
> +
> +static inline void futex_mm_hash_init(struct mm_struct *mm) { }
> +static inline void futex_mm_hash_exit(struct mm_struct *mm) { }
>  #endif
> +

Ah, the above was to make it consistent with this... mmmm... kay. Nevermind.

>  #endif
> --- /dev/null
> +++ b/include/linux/futex_types.h
> @@ -0,0 +1,12 @@
> +#ifndef _LINUX_FUTEX_TYPES_H
> +#define _LINUX_FUTEX_TYPES_H
> +
> +struct futex_hash_bucket;
> +
> +struct futex_hash {
> +	struct raw_spinlock		lock;

As it isn't always obvious to everone, it would be good to add a single line
comment stating why a *raw* spinlock is necessary.

In this case... I suppose this could lead to some nasty scenarios setting up IPC
mechanisms between threads if they weren't strictly serialized? Something else?

> +	unsigned int			hash_bits;
> +	struct futex_hash_bucket	*hash;
> +};
> +
> +#endif
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -11,6 +11,7 @@
>  #include <linux/completion.h>
>  #include <linux/cpumask.h>
>  #include <linux/uprobes.h>
> +#include <linux/futex_types.h>
>  #include <linux/page-flags-layout.h>
>  #include <asm/page.h>
>  #include <asm/mmu.h>
> @@ -442,6 +443,9 @@ struct mm_struct {
>  
>  	struct linux_binfmt *binfmt;
>  
> +#ifdef CONFIG_FUTEX_PRIVATE_HASH
> +	struct futex_hash futex_hash;
> +#endif
>  	cpumask_var_t cpu_vm_mask_var;
>  
>  	/* Architecture-specific MM context */
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1498,6 +1498,10 @@ config FUTEX
>  	  support for "fast userspace mutexes".  The resulting kernel may not
>  	  run glibc-based applications correctly.
>  
> +config FUTEX_PRIVATE_HASH
> +	bool
> +	default FUTEX && SMP
> +

So no prompt, not user selectable. If you have SMP, you get this? I think
automatic is a good call... but is SMP the right criteria, or would NUMA be more
appropriate since I thought it was keeping the hash local to the NUMA node that
was the big win?

>  config HAVE_FUTEX_CMPXCHG
>  	bool
>  	depends on FUTEX
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -617,6 +617,8 @@ static struct mm_struct *mm_init(struct
>  	mm_init_owner(mm, p);
>  	mmu_notifier_mm_init(mm);
>  	clear_tlb_flush_pending(mm);
> +	futex_mm_hash_init(mm);
> +
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
>  	mm->pmd_huge_pte = NULL;
>  #endif
> @@ -713,6 +715,7 @@ void mmput(struct mm_struct *mm)
>  		khugepaged_exit(mm); /* must run before exit_mmap */
>  		exit_mmap(mm);
>  		set_mm_exe_file(mm, NULL);
> +		futex_mm_hash_exit(mm);
>  		if (!list_empty(&mm->mmlist)) {
>  			spin_lock(&mmlist_lock);
>  			list_del(&mm->mmlist);
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -23,6 +23,9 @@
>   *  Copyright (C) IBM Corporation, 2009
>   *  Thanks to Thomas Gleixner for conceptual design and careful reviews.
>   *
> + *  Private hashed futex support by Sebastian Siewior and Thomas Gleixner
> + *  Copyright (C) Linutronix GmbH, 2016
> + *
>   *  Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
>   *  enough at me, Linus for the original (flawed) idea, Matthew
>   *  Kirkwood for proof-of-concept implementation.
> @@ -49,6 +52,7 @@
>  #include <linux/fs.h>
>  #include <linux/file.h>
>  #include <linux/jhash.h>
> +#include <linux/hash.h>
>  #include <linux/init.h>
>  #include <linux/futex.h>
>  #include <linux/mount.h>
> @@ -169,6 +173,34 @@
>   * the code that actually moves the futex(es) between hash buckets (requeue_futex)
>   * will do the additional required waiter count housekeeping. This is done for
>   * double_lock_hb() and double_unlock_hb(), respectively.
> + *
> + * For private futexes we (pre)allocate a per process hash. We check lockless
> + * whether the hash is already allocated. To access the hash later we need
> + * information about the hash properties as well. This requires barriers as
> + * follows:
> + *
> + * CPU 0					CPU 1
> + * check_hash_allocation()
> + *	if (mm->futex_hash.hash)
> + *		return;
> + *	hash = alloc_hash()
> + *	lock(&mm->futex_hash.lock);
> + *	if (!mm->futex_hash.hash) {
> + *	  mm->futex_hash.par = params;
> + *
> + *	  smp_wmb(); (A0) <-paired with-|
> + *					|
> + *	  mm->futex_hash.hash = hash;	|
> + *					|	check_hash_allocation()
> + *					|	   if (mm->futex_hash.hash)
> + *					|		return;
> + *	  unlock(&mm->futex_hash.lock);	|	get_futex_key_refs()
> + *					|
> + *					|--------- smp_mb() (B)
> + *						s = hash(f, mm->futex_hash.par);
> + *						hb = &mm->futex_hash.hash[s];
> + *
> + * So we utilize the existing smp_mb() in get_futex_key_refs().
>   */
>  
>  #ifndef CONFIG_HAVE_FUTEX_CMPXCHG
> @@ -255,6 +287,22 @@ struct futex_hash_bucket {
>  	struct plist_head chain;
>  } ____cacheline_aligned_in_smp;
>  
> +#ifdef CONFIG_FUTEX_PRIVATE_HASH
> +/*
> + * Process private hash for non-shared futexes
> + */
> +#define FUTEX_USE_GLOBAL_HASH		((void *) 0x03)
> +
> +#define FUTEX_MIN_HASH_BITS		order_base_2(4UL)
> +#define FUTEX_DEF_HASH_BITS		order_base_2(8UL)
> +#define FUTEX_MAX_HASH_BITS		order_base_2(256UL)
> +
> +unsigned int futex_default_hash_bits	= FUTEX_DEF_HASH_BITS;
> +unsigned int futex_max_hash_bits	= FUTEX_MAX_HASH_BITS;
> +#else
> +static const unsigned int futex_default_hash_bits = 0;
> +#endif
> +
>  /*
>   * The base of the bucket array and its size are always used together
>   * (after initialization only in hash_futex()), so ensure that they
> @@ -374,13 +422,13 @@ static inline int hb_waiters_pending(str
>  }
>  
>  /**
> - * hash_futex - Return the hash bucket in the global hash
> + * hash_global_futex - Return the hash bucket in the global hash
>   * @key:	Pointer to the futex key for which the hash is calculated
>   *
>   * We hash on the keys returned from get_futex_key (see below) and return the
>   * corresponding hash bucket in the global hash.
>   */
> -static struct futex_hash_bucket *hash_futex(union futex_key *key)
> +static struct futex_hash_bucket *hash_global_futex(union futex_key *key)
>  {
>  	u32 hash = jhash2((u32*)&key->both.word,
>  			  (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
> @@ -388,9 +436,33 @@ static struct futex_hash_bucket *hash_fu
>  	return &futex_queues[hash & (futex_hashsize - 1)];
>  }
>  
> +/**
> + * hash_futex - Get the hash bucket for a futex
> + *
> + * Returns either the process private or the global hash bucket which fits the
> + * key.
> + */
> +static struct futex_hash_bucket *hash_futex(union futex_key *key)
> +{
> +#ifdef CONFIG_FUTEX_PRIVATE_HASH
> +	struct mm_struct *mm = current->mm;
> +	unsigned int slot;
> +
> +	/*
> +	 * Futexes which use the per process hash have the lower bits cleared
> +	 */
> +	if (key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED))
> +		return hash_global_futex(key);
> +
> +	slot = hash_long(key->private.address, mm->futex_hash.hash_bits);
> +	return &mm->futex_hash.hash[slot];

Don't we also need to check if the private hash exists? Per the commit
description, if we fail to allocate the private hash, we fall back to using the
global hash...

...

-- 
Darren Hart
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 3/7] futex: Add op for hash preallocation
  2016-05-05 20:44 ` [patch V2 3/7] futex: Add op for hash preallocation Thomas Gleixner
@ 2016-05-06 18:18   ` Darren Hart
  2016-05-07  8:47     ` Thomas Gleixner
  2016-05-19 12:24   ` Peter Zijlstra
  2016-05-19 12:25   ` Peter Zijlstra
  2 siblings, 1 reply; 42+ messages in thread
From: Darren Hart @ 2016-05-06 18:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Sebastian Andrzej Siewior, Linus Torvalds, Darren Hart,
	Peter Zijlstra, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Thu, May 05, 2016 at 08:44:05PM -0000, Thomas Gleixner wrote:
> From: Sebastian Siewior <bigeasy@linutronix.de>
> 
> The per process hash is allocated on the fly at the first futex operation of a
> process. The size of the hash is determined by a system wide default setting
> controlled by the sys admin, This is suboptimal for RT applications and
> applications with pathological futex abuse,
> 
>  - For RT applications its important to allocate the per process hash before the
>    first futex operation to avoid the allocation on the first futex operation.
> 
>  - For pathological applications which use gazillions of futexes its useful to
>    allocate a hash greater than the default hash size.

"it's" or preferably "it is"

> 
> Add a futex op which allows to preallocate the hash with the requested

"allows for preallocating"

> size. The size is limited by the systemwide maximum hash size, which can be

system-wide

> set by the admin. The requested size is rounded up to the next order of 2.
> 
> The function can be called several times, but ony the first call results in a
> hash allocation of the requested size as there is no non-intrusive way to
> reallocate/rehash in a multithreaded application.
> 
> Note, that this call must be issued before the first futex operation in the
> process because that would automatically allocate the default sized hash.

So this seems like it could be tricky for the user as system libraries, like
glibc, make use of futexes. Can we guarantee that "sys_futex" is not called by
the time main() is called?

> The function returns the actual hash size or 0 if the global hash is used. The
> latter is the case on UP and in the rare case that the allocation failed and
> the global hash is used as a fallback.
> 
> Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

...

-- 
Darren Hart
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 4/7] futex: Add sysctl knobs for process private hash
  2016-05-05 20:44 ` [patch V2 4/7] futex: Add sysctl knobs for process private hash Thomas Gleixner
@ 2016-05-06 18:22   ` Darren Hart
  2016-05-27 17:33     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 42+ messages in thread
From: Darren Hart @ 2016-05-06 18:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Sebastian Andrzej Siewior, Linus Torvalds, Darren Hart,
	Peter Zijlstra, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Thu, May 05, 2016 at 08:44:05PM -0000, Thomas Gleixner wrote:
> From: Sebastian Siewior <bigeasy@linutronix.de>
> 
> To adjust the default hash size and the maximum hash size for process private
> futexes we add the following sysctls:
> 
> futex_private_default_hash_bits:
> 
>      Adjusts the default hash size (in bits) which is used for automatic hash
>      allocations on the first futex operation
> 
> futex_private_max_hash_bits:
> 
>      Adjusts the maximum hash size (in bits). This limits the hash size which
>      can be preallocated by applications with the FUTEX_PREALLOC_HASH op.
> 
> Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  Documentation/sysctl/kernel.txt |   17 +++++++++++++++++
>  include/linux/futex.h           |    1 +
>  kernel/futex.c                  |    5 +++--
>  kernel/sysctl.c                 |   21 +++++++++++++++++++++
>  4 files changed, 42 insertions(+), 2 deletions(-)
> 
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -29,6 +29,8 @@ Currently, these files might (depending
>  - core_pipe_limit
>  - core_uses_pid
>  - ctrl-alt-del
> +- futex_private_default_hash_bits
> +- futex_private_max_hash_bits

This list (this context anyway) looks to have been alphabetical previously,
consider moving futex* between domainname and hostname?


-- 
Darren Hart
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 2/7] futex: Hash private futexes per process
  2016-05-06 18:09   ` Darren Hart
@ 2016-05-06 21:56     ` Darren Hart
  2016-05-07  8:45       ` Thomas Gleixner
  2016-05-07  8:44     ` Thomas Gleixner
  1 sibling, 1 reply; 42+ messages in thread
From: Darren Hart @ 2016-05-06 21:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Sebastian Andrzej Siewior, Linus Torvalds, Darren Hart,
	Peter Zijlstra, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Fri, May 06, 2016 at 11:09:33AM -0700, Darren Hart wrote:
> On Thu, May 05, 2016 at 08:44:04PM -0000, Thomas Gleixner wrote:
> > From: Sebastian Siewior <bigeasy@linutronix.de>
> > 
> > The standard futex mechanism in the Linux kernel uses a global hash to store
> > transient state. Collisions on that hash can lead to performance degradation
> > especially on NUMA systems and on real-time enabled kernels even to priority
> > inversions.
> 
> I think it is worth noting the how this causes an unbounded priority inversion
> as it wasn't obvious to me. At least mention that "CPU pinning" can result in an
> unbounded priority inversion.
> 
> > 
> > To mitigate that problem we provide per process private hashing. On the first
> > futex operation in a process the kernel allocates a hash table. The hash table
> > is accessible via the process mm_struct. On Numa systems the hash is allocated
> > node local.
> > 
> > If the allocation fails then the global hash table is used as fallback, so
> > there is no user space visible impact of this feature.
> > 
> 
> It would be good to have a way to detect that the process private hash table was
> successfully created. Perhaps a /proc/pid/ feature? This would allow us to write
> a functional futex test for tools/testing/selftests/futex

I suppose we could just use FUTEX_PREALLOC_HASH for this purpose, passing in the
default hash size. This will either return the default, the previously set
value, or 0, indicating the global hash is being used. That should be sufficient
for programatically determining the state of the system.

The /proc/pid/futex_hash_size option may still be convenient for other purposes.
Perhaps with a -1 indicating it hasn't been set yet.

-- 
Darren Hart
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 2/7] futex: Hash private futexes per process
  2016-05-06 18:09   ` Darren Hart
  2016-05-06 21:56     ` Darren Hart
@ 2016-05-07  8:44     ` Thomas Gleixner
  2016-05-11 21:07       ` Darren Hart
  2016-05-27 16:36       ` Sebastian Andrzej Siewior
  1 sibling, 2 replies; 42+ messages in thread
From: Thomas Gleixner @ 2016-05-07  8:44 UTC (permalink / raw)
  To: Darren Hart
  Cc: LKML, Sebastian Andrzej Siewior, Linus Torvalds, Darren Hart,
	Peter Zijlstra, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Fri, 6 May 2016, Darren Hart wrote:
> On Thu, May 05, 2016 at 08:44:04PM -0000, Thomas Gleixner wrote:
> > --- /dev/null
> > +++ b/include/linux/futex_types.h
> > @@ -0,0 +1,12 @@
> > +#ifndef _LINUX_FUTEX_TYPES_H
> > +#define _LINUX_FUTEX_TYPES_H
> > +
> > +struct futex_hash_bucket;
> > +
> > +struct futex_hash {
> > +	struct raw_spinlock		lock;
> 
> As it isn't always obvious to everone, it would be good to add a single line
> comment stating why a *raw* spinlock is necessary.

Well. Necessary. It protects the hash pointer and the hash bits. So the scope
is very limited and really does not need the heavy weight version of a
sleeping spinlock in RT.
 
> In this case... I suppose this could lead to some nasty scenarios setting up IPC
> mechanisms between threads if they weren't strictly serialized? Something else?

Sure, we need to serialize attempts to populate the hash. Especially in the
non preallocated case. The thing with raw vs. non raw spinlocks is that the
latter are expensive on RT and if there are just 5 instructions to protect it
does not make any sense to chose the heavy version.
 
> > +config FUTEX_PRIVATE_HASH
> > +	bool
> > +	default FUTEX && SMP
> > +
> 
> So no prompt, not user selectable. If you have SMP, you get this? I think
> automatic is a good call... but is SMP the right criteria, or would NUMA be more
> appropriate since I thought it was keeping the hash local to the NUMA node that
> was the big win?

Yes, we can make it depend on NUMA. I even thought about making a run time
decision for non preallocated ones when the machine is not numa. But for test
coverage I wanted to have it as widely used as possible.
 
> > +	/*
> > +	 * Futexes which use the per process hash have the lower bits cleared
> > +	 */
> > +	if (key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED))
> > +		return hash_global_futex(key);
> > +
> > +	slot = hash_long(key->private.address, mm->futex_hash.hash_bits);
> > +	return &mm->futex_hash.hash[slot];
> 
> Don't we also need to check if the private hash exists? Per the commit
> description, if we fail to allocate the private hash, we fall back to using the
> global hash...

If we fall back to the global hash, then the lower bits in offset are not
0. So the hash is guaranteed to be available.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 2/7] futex: Hash private futexes per process
  2016-05-06 21:56     ` Darren Hart
@ 2016-05-07  8:45       ` Thomas Gleixner
  2016-05-11 21:08         ` Darren Hart
  0 siblings, 1 reply; 42+ messages in thread
From: Thomas Gleixner @ 2016-05-07  8:45 UTC (permalink / raw)
  To: Darren Hart
  Cc: LKML, Sebastian Andrzej Siewior, Linus Torvalds, Darren Hart,
	Peter Zijlstra, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Fri, 6 May 2016, Darren Hart wrote:
 > It would be good to have a way to detect that the process private hash table was
> > successfully created. Perhaps a /proc/pid/ feature? This would allow us to write
> > a functional futex test for tools/testing/selftests/futex
> 
> I suppose we could just use FUTEX_PREALLOC_HASH for this purpose, passing in the
> default hash size. This will either return the default, the previously set
> value, or 0, indicating the global hash is being used. That should be sufficient
> for programatically determining the state of the system.

Right.
 
> The /proc/pid/futex_hash_size option may still be convenient for other purposes.
> Perhaps with a -1 indicating it hasn't been set yet.

Dunno, whether that's valuable, but it can be done on top.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 3/7] futex: Add op for hash preallocation
  2016-05-06 18:18   ` Darren Hart
@ 2016-05-07  8:47     ` Thomas Gleixner
  2016-05-07 11:40       ` Thomas Gleixner
  2016-05-19 12:28       ` Peter Zijlstra
  0 siblings, 2 replies; 42+ messages in thread
From: Thomas Gleixner @ 2016-05-07  8:47 UTC (permalink / raw)
  To: Darren Hart
  Cc: LKML, Sebastian Andrzej Siewior, Linus Torvalds, Darren Hart,
	Peter Zijlstra, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Fri, 6 May 2016, Darren Hart wrote:
> On Thu, May 05, 2016 at 08:44:05PM -0000, Thomas Gleixner wrote:
> > From: Sebastian Siewior <bigeasy@linutronix.de>
> > 
> > The per process hash is allocated on the fly at the first futex operation of a
> > process. The size of the hash is determined by a system wide default setting
> > controlled by the sys admin, This is suboptimal for RT applications and
> > applications with pathological futex abuse,
> > 
> >  - For RT applications its important to allocate the per process hash before the
> >    first futex operation to avoid the allocation on the first futex operation.
> > 
> >  - For pathological applications which use gazillions of futexes its useful to
> >    allocate a hash greater than the default hash size.
> 
> "it's" or preferably "it is"
> 
> > 
> > Add a futex op which allows to preallocate the hash with the requested
> 
> "allows for preallocating"
> 
> > size. The size is limited by the systemwide maximum hash size, which can be
> 
> system-wide
> 
> > set by the admin. The requested size is rounded up to the next order of 2.
> > 
> > The function can be called several times, but ony the first call results in a
> > hash allocation of the requested size as there is no non-intrusive way to
> > reallocate/rehash in a multithreaded application.
> > 
> > Note, that this call must be issued before the first futex operation in the
> > process because that would automatically allocate the default sized hash.
> 
> So this seems like it could be tricky for the user as system libraries, like
> glibc, make use of futexes. Can we guarantee that "sys_futex" is not called by
> the time main() is called?

To the extent of my testing I never observed that the hash was automatically
created when I called futex(PREALLOC) right away in main. But yes, that might
need some thought.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 3/7] futex: Add op for hash preallocation
  2016-05-07  8:47     ` Thomas Gleixner
@ 2016-05-07 11:40       ` Thomas Gleixner
  2016-05-19 12:28       ` Peter Zijlstra
  1 sibling, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2016-05-07 11:40 UTC (permalink / raw)
  To: Darren Hart
  Cc: LKML, Sebastian Andrzej Siewior, Linus Torvalds, Darren Hart,
	Peter Zijlstra, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Sat, 7 May 2016, Thomas Gleixner wrote:
> On Fri, 6 May 2016, Darren Hart wrote:
> > > Note, that this call must be issued before the first futex operation in the
> > > process because that would automatically allocate the default sized hash.
> > 
> > So this seems like it could be tricky for the user as system libraries, like
> > glibc, make use of futexes. Can we guarantee that "sys_futex" is not called by
> > the time main() is called?
> 
> To the extent of my testing I never observed that the hash was automatically
> created when I called futex(PREALLOC) right away in main. But yes, that might
> need some thought.

Thinking more about it. If a process is single threaded and it definitely is
up to the point where it reaches main(), there is nothing which might cause a
sys_futex() call except something which would use shared futexes in the depth
of init code. I doubt that this happens, and if it does, then it's some non
standard feature^Whackery which I do not care about at all.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 2/7] futex: Hash private futexes per process
  2016-05-07  8:44     ` Thomas Gleixner
@ 2016-05-11 21:07       ` Darren Hart
  2016-05-27 16:36       ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 42+ messages in thread
From: Darren Hart @ 2016-05-11 21:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Sebastian Andrzej Siewior, Linus Torvalds, Darren Hart,
	Peter Zijlstra, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Sat, May 07, 2016 at 10:44:39AM +0200, Thomas Gleixner wrote:
> On Fri, 6 May 2016, Darren Hart wrote:
> > On Thu, May 05, 2016 at 08:44:04PM -0000, Thomas Gleixner wrote:
> > > --- /dev/null
> > > +++ b/include/linux/futex_types.h
> > > @@ -0,0 +1,12 @@
> > > +#ifndef _LINUX_FUTEX_TYPES_H
> > > +#define _LINUX_FUTEX_TYPES_H
> > > +
> > > +struct futex_hash_bucket;
> > > +
> > > +struct futex_hash {
> > > +	struct raw_spinlock		lock;
> > 
> > As it isn't always obvious to everone, it would be good to add a single line
> > comment stating why a *raw* spinlock is necessary.
> 
> Well. Necessary. It protects the hash pointer and the hash bits. So the scope
> is very limited and really does not need the heavy weight version of a
> sleeping spinlock in RT.
>  
> > In this case... I suppose this could lead to some nasty scenarios setting up IPC
> > mechanisms between threads if they weren't strictly serialized? Something else?
> 
> Sure, we need to serialize attempts to populate the hash. Especially in the
> non preallocated case. The thing with raw vs. non raw spinlocks is that the
> latter are expensive on RT and if there are just 5 instructions to protect it
> does not make any sense to chose the heavy version.
>  
> > > +config FUTEX_PRIVATE_HASH
> > > +	bool
> > > +	default FUTEX && SMP
> > > +
> > 
> > So no prompt, not user selectable. If you have SMP, you get this? I think
> > automatic is a good call... but is SMP the right criteria, or would NUMA be more
> > appropriate since I thought it was keeping the hash local to the NUMA node that
> > was the big win?
> 
> Yes, we can make it depend on NUMA. I even thought about making a run time
> decision for non preallocated ones when the machine is not numa. But for test
> coverage I wanted to have it as widely used as possible.

OK, understood to here.

> > > +	/*
> > > +	 * Futexes which use the per process hash have the lower bits cleared
> > > +	 */
> > > +	if (key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED))
> > > +		return hash_global_futex(key);
> > > +
> > > +	slot = hash_long(key->private.address, mm->futex_hash.hash_bits);
> > > +	return &mm->futex_hash.hash[slot];
> > 
> > Don't we also need to check if the private hash exists? Per the commit
> > description, if we fail to allocate the private hash, we fall back to using the
> > global hash...
> 
> If we fall back to the global hash, then the lower bits in offset are not
> 0. So the hash is guaranteed to be available.
> 

Ah right. Since the position of the bits in the two flags isn't obvious when
reading the test, the comment about the lower bits being cleared didn't
translate to that case being implicitly covered by the test.

Maybe make this explicit?

/*
 * Only private futexes use the per process hash and they will not have
 * FUT_OFF_INODE nor FUT_OFF_MMSHARED set.
 */


-- 
Darren Hart
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 2/7] futex: Hash private futexes per process
  2016-05-07  8:45       ` Thomas Gleixner
@ 2016-05-11 21:08         ` Darren Hart
  0 siblings, 0 replies; 42+ messages in thread
From: Darren Hart @ 2016-05-11 21:08 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Sebastian Andrzej Siewior, Linus Torvalds, Darren Hart,
	Peter Zijlstra, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Sat, May 07, 2016 at 10:45:57AM +0200, Thomas Gleixner wrote:
> On Fri, 6 May 2016, Darren Hart wrote:
>  > It would be good to have a way to detect that the process private hash table was
> > > successfully created. Perhaps a /proc/pid/ feature? This would allow us to write
> > > a functional futex test for tools/testing/selftests/futex
> > 
> > I suppose we could just use FUTEX_PREALLOC_HASH for this purpose, passing in the
> > default hash size. This will either return the default, the previously set
> > value, or 0, indicating the global hash is being used. That should be sufficient
> > for programatically determining the state of the system.
> 
> Right.
>  
> > The /proc/pid/futex_hash_size option may still be convenient for other purposes.
> > Perhaps with a -1 indicating it hasn't been set yet.
> 
> Dunno, whether that's valuable, but it can be done on top.

Agreed. We can leave that to the kselftest patch, and if it's easily done
without this, then we're done. If not, we can look at it then.

-- 
Darren Hart
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 2/7] futex: Hash private futexes per process
  2016-05-05 20:44 ` [patch V2 2/7] futex: Hash private futexes per process Thomas Gleixner
  2016-05-06 18:09   ` Darren Hart
@ 2016-05-19 12:21   ` Peter Zijlstra
  2016-05-27 16:52     ` Sebastian Andrzej Siewior
  2016-05-19 12:24   ` Peter Zijlstra
  2 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2016-05-19 12:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Sebastian Andrzej Siewior, Linus Torvalds, Darren Hart,
	Ingo Molnar, Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Thu, May 05, 2016 at 08:44:04PM -0000, Thomas Gleixner wrote:
> +struct futex_hash {
> +	struct raw_spinlock		lock;

	raw_spinlock_t ?

> +	unsigned int			hash_bits;
> +	struct futex_hash_bucket	*hash;
> +};

> +static void futex_populate_hash(unsigned int hash_bits)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct futex_hash_bucket *hb = NULL;
> +
> +	/*
> +	 * We don't need an explicit smp_mb() when the hash is populated
> +	 * because before we dereference mm->futex_hash.hash_bits in the hash
> +	 * function we have an smp_mb() in futex_get_key_refs() already.
> +	 */
> +	if (mm->futex_hash.hash)
> +		return;
> +
> +	/*
> +	 * If we failed to allocate a hash on the fly, fall back to the global
> +	 * hash.
> +	 */
> +	hb = futex_alloc_hash(hash_bits);
> +	if (!hb)
> +		hb = FUTEX_USE_GLOBAL_HASH;
> +
> +	raw_spin_lock(&mm->futex_hash.lock);
> +	/* We might have raced with another task allocating the hash. */
> +	if (!mm->futex_hash.hash) {
> +		mm->futex_hash.hash_bits = hash_bits;
> +		/*
> +		 * Ensure that the above is visible before we store
> +		 * the pointer.
> +		 */
> +		smp_wmb(); /* (A0) Pairs with (B) */
> +		mm->futex_hash.hash = hb;

		smp_store_release(&mm->futex_hash.hash, hb); ?

> +		hb = NULL;
> +	}
> +	raw_spin_unlock(&mm->futex_hash.lock);
> +	kfree(hb);
> +}

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 2/7] futex: Hash private futexes per process
  2016-05-05 20:44 ` [patch V2 2/7] futex: Hash private futexes per process Thomas Gleixner
  2016-05-06 18:09   ` Darren Hart
  2016-05-19 12:21   ` Peter Zijlstra
@ 2016-05-19 12:24   ` Peter Zijlstra
  2016-05-27 17:10     ` Sebastian Andrzej Siewior
  2 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2016-05-19 12:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Sebastian Andrzej Siewior, Linus Torvalds, Darren Hart,
	Ingo Molnar, Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Thu, May 05, 2016 at 08:44:04PM -0000, Thomas Gleixner wrote:
> +static struct futex_hash_bucket *hash_futex(union futex_key *key)
> +{
> +#ifdef CONFIG_FUTEX_PRIVATE_HASH
> +	struct mm_struct *mm = current->mm;
> +	unsigned int slot;
> +
> +	/*
> +	 * Futexes which use the per process hash have the lower bits cleared
> +	 */
> +	if (key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED))
> +		return hash_global_futex(key);
> +
> +	slot = hash_long(key->private.address, mm->futex_hash.hash_bits);
> +	return &mm->futex_hash.hash[slot];

Do we want the option to WARN if we get collisions in this per-process
hash?

Because afaiu there is no guarantee what so ever this doesn't happen,
and collisions here can create the very same priority inversions as are
possible in the global hash.

Less likely etc.. more contained since its only the threads of the one
process that get tangled up, but still possible.

> +#else
> +	return hash_global_futex(key);
> +#endif
> +}

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 3/7] futex: Add op for hash preallocation
  2016-05-05 20:44 ` [patch V2 3/7] futex: Add op for hash preallocation Thomas Gleixner
  2016-05-06 18:18   ` Darren Hart
@ 2016-05-19 12:24   ` Peter Zijlstra
  2016-05-19 19:38     ` Darren Hart
  2016-05-19 12:25   ` Peter Zijlstra
  2 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2016-05-19 12:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Sebastian Andrzej Siewior, Linus Torvalds, Darren Hart,
	Ingo Molnar, Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Thu, May 05, 2016 at 08:44:05PM -0000, Thomas Gleixner wrote:
> From: Sebastian Siewior <bigeasy@linutronix.de>
> 
> The per process hash is allocated on the fly at the first futex operation of a
> process. The size of the hash is determined by a system wide default setting
> controlled by the sys admin, This is suboptimal for RT applications and
> applications with pathological futex abuse,
> 
>  - For RT applications its important to allocate the per process hash before the
>    first futex operation to avoid the allocation on the first futex operation.
> 
>  - For pathological applications which use gazillions of futexes its useful to
>    allocate a hash greater than the default hash size.
> 
> Add a futex op which allows to preallocate the hash with the requested
> size. The size is limited by the systemwide maximum hash size, which can be
> set by the admin. The requested size is rounded up to the next order of 2.
> 
> The function can be called several times, but ony the first call results in a
> hash allocation of the requested size as there is no non-intrusive way to
> reallocate/rehash in a multithreaded application.
> 
> Note, that this call must be issued before the first futex operation in the
> process because that would automatically allocate the default sized hash.
> 
> The function returns the actual hash size or 0 if the global hash is used. The
> latter is the case on UP and in the rare case that the allocation failed and
> the global hash is used as a fallback.

OK, so no on-line rehashing possible?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 3/7] futex: Add op for hash preallocation
  2016-05-05 20:44 ` [patch V2 3/7] futex: Add op for hash preallocation Thomas Gleixner
  2016-05-06 18:18   ` Darren Hart
  2016-05-19 12:24   ` Peter Zijlstra
@ 2016-05-19 12:25   ` Peter Zijlstra
  2016-05-27 17:27     ` Sebastian Andrzej Siewior
  2 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2016-05-19 12:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Sebastian Andrzej Siewior, Linus Torvalds, Darren Hart,
	Ingo Molnar, Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Thu, May 05, 2016 at 08:44:05PM -0000, Thomas Gleixner wrote:
> +static int futex_preallocate_hash(unsigned int slots)
> +{
> +#ifdef CONFIG_FUTEX_PRIVATE_HASH
> +	struct mm_struct *mm = current->mm;
> +	struct futex_hash_bucket *hb;
> +	unsigned int bits;
> +
> +	/* Try to allocate the requested nr of slots */
> +	bits = order_base_2(slots);
> +
> +	if (bits < FUTEX_MIN_HASH_BITS)
> +		bits = FUTEX_MIN_HASH_BITS;
> +
> +	if (bits > futex_max_hash_bits)
> +		bits = futex_max_hash_bits;
> +
> +	futex_populate_hash(bits);

Should we not simply fail if the provided number of slots is not a power
of 2 ?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 3/7] futex: Add op for hash preallocation
  2016-05-07  8:47     ` Thomas Gleixner
  2016-05-07 11:40       ` Thomas Gleixner
@ 2016-05-19 12:28       ` Peter Zijlstra
  2016-05-19 19:36         ` Darren Hart
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2016-05-19 12:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Darren Hart, LKML, Sebastian Andrzej Siewior, Linus Torvalds,
	Darren Hart, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Sat, May 07, 2016 at 10:47:38AM +0200, Thomas Gleixner wrote:
> On Fri, 6 May 2016, Darren Hart wrote:

> > So this seems like it could be tricky for the user as system libraries, like
> > glibc, make use of futexes. Can we guarantee that "sys_futex" is not called by
> > the time main() is called?
> 
> To the extent of my testing I never observed that the hash was automatically
> created when I called futex(PREALLOC) right away in main. But yes, that might
> need some thought.

I suspect that even if glibc uses futexes before main(), they will not
be contended, because, last time I checked, the C runtime environment is
very much single threaded unless explicitly made not so by the program.

In any case (re)hashing if the hash is empty is 'easy', if there's already state,
not so much.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 3/7] futex: Add op for hash preallocation
  2016-05-19 12:28       ` Peter Zijlstra
@ 2016-05-19 19:36         ` Darren Hart
  0 siblings, 0 replies; 42+ messages in thread
From: Darren Hart @ 2016-05-19 19:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, Sebastian Andrzej Siewior, Linus Torvalds,
	Darren Hart, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Thu, May 19, 2016 at 02:28:49PM +0200, Peter Zijlstra wrote:
> On Sat, May 07, 2016 at 10:47:38AM +0200, Thomas Gleixner wrote:
> > On Fri, 6 May 2016, Darren Hart wrote:
> 
> > > So this seems like it could be tricky for the user as system libraries, like
> > > glibc, make use of futexes. Can we guarantee that "sys_futex" is not called by
> > > the time main() is called?
> > 
> > To the extent of my testing I never observed that the hash was automatically
> > created when I called futex(PREALLOC) right away in main. But yes, that might
> > need some thought.
> 
> I suspect that even if glibc uses futexes before main(), they will not
> be contended, because, last time I checked, the C runtime environment is
> very much single threaded unless explicitly made not so by the program.
> 
> In any case (re)hashing if the hash is empty is 'easy', if there's already state,
> not so much.

It certainly would be nice to be able to resize the hash if it's empty.

-- 
Darren Hart
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 3/7] futex: Add op for hash preallocation
  2016-05-19 12:24   ` Peter Zijlstra
@ 2016-05-19 19:38     ` Darren Hart
  2016-05-20  4:50       ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Darren Hart @ 2016-05-19 19:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, Sebastian Andrzej Siewior, Linus Torvalds,
	Darren Hart, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Thu, May 19, 2016 at 02:24:57PM +0200, Peter Zijlstra wrote:
> On Thu, May 05, 2016 at 08:44:05PM -0000, Thomas Gleixner wrote:
> > From: Sebastian Siewior <bigeasy@linutronix.de>
> > 
> > The per process hash is allocated on the fly at the first futex operation of a
> > process. The size of the hash is determined by a system wide default setting
> > controlled by the sys admin, This is suboptimal for RT applications and
> > applications with pathological futex abuse,
> > 
> >  - For RT applications its important to allocate the per process hash before the
> >    first futex operation to avoid the allocation on the first futex operation.
> > 
> >  - For pathological applications which use gazillions of futexes its useful to
> >    allocate a hash greater than the default hash size.
> > 
> > Add a futex op which allows to preallocate the hash with the requested
> > size. The size is limited by the systemwide maximum hash size, which can be
> > set by the admin. The requested size is rounded up to the next order of 2.
> > 
> > The function can be called several times, but ony the first call results in a
> > hash allocation of the requested size as there is no non-intrusive way to
> > reallocate/rehash in a multithreaded application.
> > 
> > Note, that this call must be issued before the first futex operation in the
> > process because that would automatically allocate the default sized hash.
> > 
> > The function returns the actual hash size or 0 if the global hash is used. The
> > latter is the case on UP and in the rare case that the allocation failed and
> > the global hash is used as a fallback.
> 
> OK, so no on-line rehashing possible?

It doesn't do it currently... did you see something that makes it impossible to
add?


-- 
Darren Hart
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 3/7] futex: Add op for hash preallocation
  2016-05-19 19:38     ` Darren Hart
@ 2016-05-20  4:50       ` Peter Zijlstra
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2016-05-20  4:50 UTC (permalink / raw)
  To: Darren Hart
  Cc: Thomas Gleixner, LKML, Sebastian Andrzej Siewior, Linus Torvalds,
	Darren Hart, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Thu, May 19, 2016 at 12:38:19PM -0700, Darren Hart wrote:
> On Thu, May 19, 2016 at 02:24:57PM +0200, Peter Zijlstra wrote:
> > OK, so no on-line rehashing possible?
> 
> It doesn't do it currently... did you see something that makes it impossible to
> add?

No, just tons of tricky.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 2/7] futex: Hash private futexes per process
  2016-05-07  8:44     ` Thomas Gleixner
  2016-05-11 21:07       ` Darren Hart
@ 2016-05-27 16:36       ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 42+ messages in thread
From: Sebastian Andrzej Siewior @ 2016-05-27 16:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Darren Hart, LKML, Linus Torvalds, Darren Hart, Peter Zijlstra,
	Ingo Molnar, Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

On 2016-05-07 10:44:39 [+0200], Thomas Gleixner wrote:
> On Fri, 6 May 2016, Darren Hart wrote:
> > On Thu, May 05, 2016 at 08:44:04PM -0000, Thomas Gleixner wrote:
> Sure, we need to serialize attempts to populate the hash. Especially in the
> non preallocated case. The thing with raw vs. non raw spinlocks is that the
> latter are expensive on RT and if there are just 5 instructions to protect it
> does not make any sense to chose the heavy version.
>  
> > > +config FUTEX_PRIVATE_HASH
> > > +	bool
> > > +	default FUTEX && SMP
> > > +
> > 
> > So no prompt, not user selectable. If you have SMP, you get this? I think
> > automatic is a good call... but is SMP the right criteria, or would NUMA be more
> > appropriate since I thought it was keeping the hash local to the NUMA node that
> > was the big win?
> 
> Yes, we can make it depend on NUMA. I even thought about making a run time
> decision for non preallocated ones when the machine is not numa. But for test
> coverage I wanted to have it as widely used as possible.

Do we want to change it to autodetect NUMA at runtime or do we want to
keep it as is for now?

Sebastian

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 2/7] futex: Hash private futexes per process
  2016-05-19 12:21   ` Peter Zijlstra
@ 2016-05-27 16:52     ` Sebastian Andrzej Siewior
  2016-05-30  8:43       ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Sebastian Andrzej Siewior @ 2016-05-27 16:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, Sebastian Andrzej Siewior, Linus Torvalds,
	Darren Hart, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On 2016-05-19 14:21:48 [+0200], Peter Zijlstra wrote:
> > +static void futex_populate_hash(unsigned int hash_bits)
> > +{
> > +	raw_spin_lock(&mm->futex_hash.lock);
> > +	/* We might have raced with another task allocating the hash. */
> > +	if (!mm->futex_hash.hash) {
> > +		mm->futex_hash.hash_bits = hash_bits;
> > +		/*
> > +		 * Ensure that the above is visible before we store
> > +		 * the pointer.
> > +		 */
> > +		smp_wmb(); /* (A0) Pairs with (B) */
> > +		mm->futex_hash.hash = hb;
> 
> 		smp_store_release(&mm->futex_hash.hash, hb); ?

just to be clear: You suggest to use "smp_store_release()" instead
smp_wmb() followed by the assignment?

Sebastian

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 2/7] futex: Hash private futexes per process
  2016-05-19 12:24   ` Peter Zijlstra
@ 2016-05-27 17:10     ` Sebastian Andrzej Siewior
  2016-05-30  8:58       ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Sebastian Andrzej Siewior @ 2016-05-27 17:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, Sebastian Andrzej Siewior, Linus Torvalds,
	Darren Hart, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On 2016-05-19 14:24:06 [+0200], Peter Zijlstra wrote:
> On Thu, May 05, 2016 at 08:44:04PM -0000, Thomas Gleixner wrote:
> > +static struct futex_hash_bucket *hash_futex(union futex_key *key)
> > +{
> > +#ifdef CONFIG_FUTEX_PRIVATE_HASH
> > +	struct mm_struct *mm = current->mm;
> > +	unsigned int slot;
> > +
> > +	/*
> > +	 * Futexes which use the per process hash have the lower bits cleared
> > +	 */
> > +	if (key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED))
> > +		return hash_global_futex(key);
> > +
> > +	slot = hash_long(key->private.address, mm->futex_hash.hash_bits);
> > +	return &mm->futex_hash.hash[slot];
> 
> Do we want the option to WARN if we get collisions in this per-process
> hash?
> 
> Because afaiu there is no guarantee what so ever this doesn't happen,
> and collisions here can create the very same priority inversions as are
> possible in the global hash.
> 
> Less likely etc.. more contained since its only the threads of the one
> process that get tangled up, but still possible.

Since the collision is contained the same process it is less dramatic.
But how do you want to warn the user? A trace-event would be handy to
dump the uaddr and slot. The user would have to check the trace and
figure out which slot was assigend to different uaddr. But due to ASLR
the same application might result in a different behaviour on each run.
However, it might be good for a indication about the size of the private
hash…

Sebastian

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 3/7] futex: Add op for hash preallocation
  2016-05-19 12:25   ` Peter Zijlstra
@ 2016-05-27 17:27     ` Sebastian Andrzej Siewior
  2016-05-30  8:59       ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Sebastian Andrzej Siewior @ 2016-05-27 17:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, Sebastian Andrzej Siewior, Linus Torvalds,
	Darren Hart, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On 2016-05-19 14:25:58 [+0200], Peter Zijlstra wrote:
> On Thu, May 05, 2016 at 08:44:05PM -0000, Thomas Gleixner wrote:
> > +static int futex_preallocate_hash(unsigned int slots)
> > +{
> > +#ifdef CONFIG_FUTEX_PRIVATE_HASH
> > +	struct mm_struct *mm = current->mm;
> > +	struct futex_hash_bucket *hb;
> > +	unsigned int bits;
> > +
> > +	/* Try to allocate the requested nr of slots */
> > +	bits = order_base_2(slots);
> > +
> > +	if (bits < FUTEX_MIN_HASH_BITS)
> > +		bits = FUTEX_MIN_HASH_BITS;
> > +
> > +	if (bits > futex_max_hash_bits)
> > +		bits = futex_max_hash_bits;
> > +
> > +	futex_populate_hash(bits);
> 
> Should we not simply fail if the provided number of slots is not a power
> of 2 ?

We could if it is worth doing so. The procfs interface which limits the
upper / lower limit is bits based. This is slot based which then gets
converted to the number if bits.
If we align this interface with proc's limits then we would expect the
number of bits instead slots - now check for power of two anymore.
Anyone?

Sebastian

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 4/7] futex: Add sysctl knobs for process private hash
  2016-05-06 18:22   ` Darren Hart
@ 2016-05-27 17:33     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 42+ messages in thread
From: Sebastian Andrzej Siewior @ 2016-05-27 17:33 UTC (permalink / raw)
  To: Darren Hart
  Cc: Thomas Gleixner, LKML, Sebastian Andrzej Siewior, Linus Torvalds,
	Darren Hart, Peter Zijlstra, Ingo Molnar, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

On 2016-05-06 11:22:04 [-0700], Darren Hart wrote:
> > --- a/Documentation/sysctl/kernel.txt
> > +++ b/Documentation/sysctl/kernel.txt
> > @@ -29,6 +29,8 @@ Currently, these files might (depending
> >  - core_pipe_limit
> >  - core_uses_pid
> >  - ctrl-alt-del
> > +- futex_private_default_hash_bits
> > +- futex_private_max_hash_bits
> 
> This list (this context anyway) looks to have been alphabetical previously,
> consider moving futex* between domainname and hostname?
This is list is mostly sorted, yes. I will move it.

Sebastian

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 2/7] futex: Hash private futexes per process
  2016-05-27 16:52     ` Sebastian Andrzej Siewior
@ 2016-05-30  8:43       ` Peter Zijlstra
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2016-05-30  8:43 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Thomas Gleixner, LKML, Sebastian Andrzej Siewior, Linus Torvalds,
	Darren Hart, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Fri, May 27, 2016 at 06:52:11PM +0200, Sebastian Andrzej Siewior wrote:
> On 2016-05-19 14:21:48 [+0200], Peter Zijlstra wrote:
> > > +static void futex_populate_hash(unsigned int hash_bits)
> > > +{
> …
> > > +	raw_spin_lock(&mm->futex_hash.lock);
> > > +	/* We might have raced with another task allocating the hash. */
> > > +	if (!mm->futex_hash.hash) {
> > > +		mm->futex_hash.hash_bits = hash_bits;
> > > +		/*
> > > +		 * Ensure that the above is visible before we store
> > > +		 * the pointer.
> > > +		 */
> > > +		smp_wmb(); /* (A0) Pairs with (B) */
> > > +		mm->futex_hash.hash = hb;
> > 
> > 		smp_store_release(&mm->futex_hash.hash, hb); ?
> 
> just to be clear: You suggest to use "smp_store_release()" instead
> smp_wmb() followed by the assignment?

Yes, smp_store_release() is the most natural way to publish things like
this. Note that rcu_assign_pointer() also switched to using that. See
commit: 88c1863066cc ("rcu: Define rcu_assign_pointer() in terms of
smp_store_release()") for detail on the difference.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 2/7] futex: Hash private futexes per process
  2016-05-27 17:10     ` Sebastian Andrzej Siewior
@ 2016-05-30  8:58       ` Peter Zijlstra
  2016-05-30 11:08         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2016-05-30  8:58 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Thomas Gleixner, LKML, Sebastian Andrzej Siewior, Linus Torvalds,
	Darren Hart, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Fri, May 27, 2016 at 07:10:01PM +0200, Sebastian Andrzej Siewior wrote:
> On 2016-05-19 14:24:06 [+0200], Peter Zijlstra wrote:
> > On Thu, May 05, 2016 at 08:44:04PM -0000, Thomas Gleixner wrote:
> > > +static struct futex_hash_bucket *hash_futex(union futex_key *key)
> > > +{
> > > +#ifdef CONFIG_FUTEX_PRIVATE_HASH
> > > +	struct mm_struct *mm = current->mm;
> > > +	unsigned int slot;
> > > +
> > > +	/*
> > > +	 * Futexes which use the per process hash have the lower bits cleared
> > > +	 */
> > > +	if (key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED))
> > > +		return hash_global_futex(key);
> > > +
> > > +	slot = hash_long(key->private.address, mm->futex_hash.hash_bits);
> > > +	return &mm->futex_hash.hash[slot];
> > 
> > Do we want the option to WARN if we get collisions in this per-process
> > hash?
> > 
> > Because afaiu there is no guarantee what so ever this doesn't happen,
> > and collisions here can create the very same priority inversions as are
> > possible in the global hash.
> > 
> > Less likely etc.. more contained since its only the threads of the one
> > process that get tangled up, but still possible.
> 
> Since the collision is contained the same process it is less dramatic.

Right, but can still cause significant malfunction inside the process.
So its not something to completely ignore. If your room sized CNC
machine gets the priorities of the logging thread and the motor control
thread confused bad things could happen.

> But how do you want to warn the user? A trace-event would be handy to
> dump the uaddr and slot.

So I think there's a number of cases:

 - PREALLOC_HASH finds a taken bucket; in this case we can simply return
   an error.
 - PREALLOC_HASH succeeds, but an on demand hash later hits the same
   bucket. This is harder; we could maybe mark all buckets taken by
   PREALLOC_HASH and allow for a signal when this collision hits. Dunno.

> The user would have to check the trace and
> figure out which slot was assigend to different uaddr. 

Yeah, that's not really workable, might work for debugging, but blergh.

> But due to ASLR
> the same application might result in a different behaviour on each run.

Yeah, ASLR makes this all somewhat non deterministic, which is why you
really don't want a silent collision for your PREALLOC_HASH buckets.
Because once every 100 runs it does weird,..

> However, it might be good for a indication about the size of the private
> hash…

Yeah, now if online resize wasn't such a pain ;-)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 3/7] futex: Add op for hash preallocation
  2016-05-27 17:27     ` Sebastian Andrzej Siewior
@ 2016-05-30  8:59       ` Peter Zijlstra
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2016-05-30  8:59 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Thomas Gleixner, LKML, Sebastian Andrzej Siewior, Linus Torvalds,
	Darren Hart, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Fri, May 27, 2016 at 07:27:57PM +0200, Sebastian Andrzej Siewior wrote:
> On 2016-05-19 14:25:58 [+0200], Peter Zijlstra wrote:
> > On Thu, May 05, 2016 at 08:44:05PM -0000, Thomas Gleixner wrote:
> > > +static int futex_preallocate_hash(unsigned int slots)
> > > +{
> > > +#ifdef CONFIG_FUTEX_PRIVATE_HASH
> > > +	struct mm_struct *mm = current->mm;
> > > +	struct futex_hash_bucket *hb;
> > > +	unsigned int bits;
> > > +
> > > +	/* Try to allocate the requested nr of slots */
> > > +	bits = order_base_2(slots);
> > > +
> > > +	if (bits < FUTEX_MIN_HASH_BITS)
> > > +		bits = FUTEX_MIN_HASH_BITS;
> > > +
> > > +	if (bits > futex_max_hash_bits)
> > > +		bits = futex_max_hash_bits;
> > > +
> > > +	futex_populate_hash(bits);
> > 
> > Should we not simply fail if the provided number of slots is not a power
> > of 2 ?
> 
> We could if it is worth doing so. The procfs interface which limits the
> upper / lower limit is bits based. This is slot based which then gets
> converted to the number if bits.
> If we align this interface with proc's limits then we would expect the
> number of bits instead slots - now check for power of two anymore.
> Anyone?

I'm all for consistent and strict when it comes to things like this.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 2/7] futex: Hash private futexes per process
  2016-05-30  8:58       ` Peter Zijlstra
@ 2016-05-30 11:08         ` Sebastian Andrzej Siewior
  2016-05-30 12:06           ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Sebastian Andrzej Siewior @ 2016-05-30 11:08 UTC (permalink / raw)
  To: Peter Zijlstra, Sebastian Andrzej Siewior
  Cc: Thomas Gleixner, LKML, Linus Torvalds, Darren Hart, Ingo Molnar,
	Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

On 05/30/2016 10:58 AM, Peter Zijlstra wrote:
> On Fri, May 27, 2016 at 07:10:01PM +0200, Sebastian Andrzej Siewior wrote:
>> On 2016-05-19 14:24:06 [+0200], Peter Zijlstra wrote:
>>> On Thu, May 05, 2016 at 08:44:04PM -0000, Thomas Gleixner wrote:
>>>> +static struct futex_hash_bucket *hash_futex(union futex_key *key)
>>>> +{
>>>> +#ifdef CONFIG_FUTEX_PRIVATE_HASH
>>>> +	struct mm_struct *mm = current->mm;
>>>> +	unsigned int slot;
>>>> +
>>>> +	/*
>>>> +	 * Futexes which use the per process hash have the lower bits cleared
>>>> +	 */
>>>> +	if (key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED))
>>>> +		return hash_global_futex(key);
>>>> +
>>>> +	slot = hash_long(key->private.address, mm->futex_hash.hash_bits);
>>>> +	return &mm->futex_hash.hash[slot];
>>>
>>> Do we want the option to WARN if we get collisions in this per-process
>>> hash?
>>>
>>> Because afaiu there is no guarantee what so ever this doesn't happen,
>>> and collisions here can create the very same priority inversions as are
>>> possible in the global hash.
>>>
>>> Less likely etc.. more contained since its only the threads of the one
>>> process that get tangled up, but still possible.
>>
>> Since the collision is contained the same process it is less dramatic.
> 
> Right, but can still cause significant malfunction inside the process.
> So its not something to completely ignore. If your room sized CNC
> machine gets the priorities of the logging thread and the motor control
> thread confused bad things could happen.

> 
>> But how do you want to warn the user? A trace-event would be handy to
>> dump the uaddr and slot.
> 
> So I think there's a number of cases:
> 
>  - PREALLOC_HASH finds a taken bucket; in this case we can simply return
>    an error.
>  - PREALLOC_HASH succeeds, but an on demand hash later hits the same
>    bucket. This is harder; we could maybe mark all buckets taken by
>    PREALLOC_HASH and allow for a signal when this collision hits. Dunno.

PREALLOC_HASH happens once before any (contended) lock operation. We
never rehash the hash. To rehash the hash on runtime we would need an
empty futex hash and some locking in the fast path. And rehash seems
not to be required since we tried to come up with a sane default value
and the user/RT task can set it to the current max value.

So back to when does the collision happen. Since glibc visits the
kernel only on contention we might learn about the collision when it is
too late. We could have a lock operation by thread1 followed by lock
operation by thread2 on different uaddr resulting in the same bucket.
In this case we learn about this once the spin_lock() operation blocks.

Also marking a bucket as taken (on contention) might give false
positive results since we know nothing about lock's lifetime (i.e. the
lock might have been free()ed).

But if I may bring some ideas from v1. In v1 we had "tickets / IDs" for
the futex per thread. In v2 we don't have them anymore. We still have
the "private" futex hash buckets but per process this time.
We could introduce the "tickets / IDs" back and make them process wide.
We could hide them in pthread_mutex_init() and pthread_mutex_destroy()
since their IDs are no longer thread unique. I think I had something in
glibc's pthread variable where we could store 16bit if I split another
32bit variable.

That would be guaranteed collision free and hidden in glibc. But it
would take some time to get it used since it does no longer work out of
the box by updating the kernel. We could also add it later if people
scream for it since we can't change the behavior of "PRIVATE" futex
(uaddr vs ticket number).

>> But due to ASLR
>> the same application might result in a different behaviour on each run.
> 
> Yeah, ASLR makes this all somewhat non deterministic, which is why you
> really don't want a silent collision for your PREALLOC_HASH buckets.
> Because once every 100 runs it does weird,..

I think that you might learn about your collision too late and there
is nothing you can do about it. Logging in a logfile or restart the
application after 5 minutes of runtime? Not to mention the locks which
are contended in an error situation.

A futex op returning the kernel's hash value might be sane. You would
expose some implementation detail but the application could check its
mutex for collision before starting (doing the real work). On collision
it would have to restart and hope for the best if the collision from
ASLR.
But since you don't know about all mutexes, like those which are part
of library, you wouldn't never be 100% collision free here. So it is
probably a bad idea.

>> However, it might be good for a indication about the size of the private
>> hash…
> 
> Yeah, now if online resize wasn't such a pain ;-)

My point was during development / testing to figure out which initial
value is sane / reasonable. Start your APP with 32 slot. Collisions.
Again with 128 slots. Oh better.

Sebastian

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 2/7] futex: Hash private futexes per process
  2016-05-30 11:08         ` Sebastian Andrzej Siewior
@ 2016-05-30 12:06           ` Peter Zijlstra
  2016-05-30 13:37             ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2016-05-30 12:06 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Sebastian Andrzej Siewior, Thomas Gleixner, LKML, Linus Torvalds,
	Darren Hart, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Mon, May 30, 2016 at 01:08:53PM +0200, Sebastian Andrzej Siewior wrote:
> > So I think there's a number of cases:
> > 
> >  - PREALLOC_HASH finds a taken bucket; in this case we can simply return
> >    an error.
> >  - PREALLOC_HASH succeeds, but an on demand hash later hits the same
> >    bucket. This is harder; we could maybe mark all buckets taken by
> >    PREALLOC_HASH and allow for a signal when this collision hits. Dunno.
> 
> PREALLOC_HASH happens once before any (contended) lock operation. 

I know ;-)

> We
> never rehash the hash. To rehash the hash on runtime we would need an
> empty futex hash and some locking in the fast path.

Nah; you can do lockless rehashing. Little tricky, but entirely doable.
At worst we'd need an rcu_read_lock(), but with a little extra work we
can extend an existing preempt_disable section to cover the hash lookup
if it isn't already covered.

> And rehash seems
> not to be required since we tried to come up with a sane default value
> and the user/RT task can set it to the current max value.

I'm not sure this statement is true; given that the hash function
reduces the entire address space down to a few bits you're guaranteed to
have collisions at some point. A good hash will not have more than given
by the birthday paradox, but you cannot have less.

So while the sizing can certainly reduce the chance on collisions,
nothing guarantees you will not have them, quite the opposite.

> So back to when does the collision happen. Since glibc visits the
> kernel only on contention we might learn about the collision when it is
> too late. We could have a lock operation by thread1 followed by lock
> operation by thread2 on different uaddr resulting in the same bucket.
> In this case we learn about this once the spin_lock() operation blocks.

Yep, so if this is two ondemand futexes hitting the same bucket, we
don't care. But if this is an ondemand futex hitting a prealloc one, we
do care.

And yes, this is late, but its the only possible time. And notification
is better than silent weird behaviour.

> Also marking a bucket as taken (on contention) might give false
> positive results since we know nothing about lock's lifetime (i.e. the
> lock might have been free()ed).
> 
> But if I may bring some ideas from v1. In v1 we had "tickets / IDs" for
> the futex per thread. In v2 we don't have them anymore. We still have
> the "private" futex hash buckets but per process this time.
> We could introduce the "tickets / IDs" back and make them process wide.
> We could hide them in pthread_mutex_init() and pthread_mutex_destroy()
> since their IDs are no longer thread unique. I think I had something in
> glibc's pthread variable where we could store 16bit if I split another
> 32bit variable.

Not sure you need a ticket (and I never got around to reading v1), a
simple twin to PREALLOCATE_HASH to call on mutex_destroy would be
sufficient I think. It could free the hash bucket.

> That would be guaranteed collision free and hidden in glibc.

I'm still not seeing how you can guarantee anything with it; ondemand
local hashing can still collide, right?

Or are you thinking two local hashtables, one for ondemand things and
one for prealloc/free type thingies?

> My point was during development / testing to figure out which initial
> value is sane / reasonable. Start your APP with 32 slot. Collisions.
> Again with 128 slots. Oh better.

Ah, right, but given the above and ASLR, you cannot exhaustively test
this.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 2/7] futex: Hash private futexes per process
  2016-05-30 12:06           ` Peter Zijlstra
@ 2016-05-30 13:37             ` Sebastian Andrzej Siewior
  2016-05-30 13:49               ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Sebastian Andrzej Siewior @ 2016-05-30 13:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sebastian Andrzej Siewior, Thomas Gleixner, LKML, Linus Torvalds,
	Darren Hart, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On 05/30/2016 02:06 PM, Peter Zijlstra wrote:
>> We
>> never rehash the hash. To rehash the hash on runtime we would need an
>> empty futex hash and some locking in the fast path.
> 
> Nah; you can do lockless rehashing. Little tricky, but entirely doable.
> At worst we'd need an rcu_read_lock(), but with a little extra work we
> can extend an existing preempt_disable section to cover the hash lookup
> if it isn't already covered.

you also need ensure that all new users use the new hash but all old
users go for the old one until you can switch them over. And it is
likely we miss a corner case because it is futex after all.

And still, we all that? You need an upper limit how often / until which
size you do resize. And if you want to avoid collisions at all costs
(because) go for the max value if you think you need it so there is no
need to resize. And if the task does not pre-allocate why care at all?

>> And rehash seems
>> not to be required since we tried to come up with a sane default value
>> and the user/RT task can set it to the current max value.
> 
> I'm not sure this statement is true; given that the hash function
> reduces the entire address space down to a few bits you're guaranteed to
> have collisions at some point. A good hash will not have more than given
> by the birthday paradox, but you cannot have less.
> 
> So while the sizing can certainly reduce the chance on collisions,
> nothing guarantees you will not have them, quite the opposite.

Yes, I did not try to discuss the hash collision away. From RT's point
of view the problems I am aware are of the following scenario:
Task #1 pinned to CPU1 task #2 pinned to CPU2 but share the same
bucket. Task #2 got a wakeup and should run but is blocked on the
bucket lock - otherwise it could run. Usually it would PI-boost task#1
but task#1 got preempted itself by task and since task#1 prio is lower
it can't boost its way free towards the lock and so so CPU #2 may
remain idle for some time.

The same thing can happen within a Task if you take my story from above
and replace task with thread. Completely understood.

>> So back to when does the collision happen. Since glibc visits the
>> kernel only on contention we might learn about the collision when it is
>> too late. We could have a lock operation by thread1 followed by lock
>> operation by thread2 on different uaddr resulting in the same bucket.
>> In this case we learn about this once the spin_lock() operation blocks.
> 
> Yep, so if this is two ondemand futexes hitting the same bucket, we
> don't care. But if this is an ondemand futex hitting a prealloc one, we
> do care.
> 
> And yes, this is late, but its the only possible time. And notification
> is better than silent weird behaviour.

We don't distinguish right now between "user asked to preallocate the
hash" and "we created this hash because we had a locking job in
kernel". The latter is the behaviour we used to have and the former is
something like "this new preallocate thingy does not always work".

>> Also marking a bucket as taken (on contention) might give false
>> positive results since we know nothing about lock's lifetime (i.e. the
>> lock might have been free()ed).
>>
>> But if I may bring some ideas from v1. In v1 we had "tickets / IDs" for
>> the futex per thread. In v2 we don't have them anymore. We still have
>> the "private" futex hash buckets but per process this time.
>> We could introduce the "tickets / IDs" back and make them process wide.
>> We could hide them in pthread_mutex_init() and pthread_mutex_destroy()
>> since their IDs are no longer thread unique. I think I had something in
>> glibc's pthread variable where we could store 16bit if I split another
>> 32bit variable.
> 
> Not sure you need a ticket (and I never got around to reading v1), a

so v1 in short. To lock you do:
  futex(uaddr, FUTEX_LOCK_PI | PRIVATE, …)

instead uaddr, we used tickets and called the mode attached:
  futex(ticket, FUTEX_LOCK_PI | PRIVATE | ATTACHED, …)

and the ticket was the index in our array where we kept the hash
buckets. That means each user space futex has its own hash bucket
without contention. But since we had the tickets unique in every thread
it was hard to handle. Keeping them the same process wide should make
things simple again.

> simple twin to PREALLOCATE_HASH to call on mutex_destroy would be
> sufficient I think. It could free the hash bucket.

You do PREALLOCATE_HASH _once_ on startup not for every mutex. But yes,
you would mark each mutex in mutex_destroy() as gone.

So this would "work", you would find a collision during a slot lookup.
You could increase the coverage if you add another marker in
pthread_mutex_init() (well, except for static initializes). That means
we need glibc changes for this to work.
And what do you suggest to report such a collision? A trace point, a
file in proc? WARN_ON_ONCE() would not result in CVE but is not that
pretty and informative. Not sure if glibc is all for a return code
which is not part of POSIX.

Also I'm not sure what to do with this information unless this happens
during development. That is why I suggest the above which is guaranteed
hash clash free :)

>> That would be guaranteed collision free and hidden in glibc.
> 
> I'm still not seeing how you can guarantee anything with it; ondemand
> local hashing can still collide, right?

Since we have a process local hash table for private futex, yes
collisions can happen.

> Or are you thinking two local hashtables, one for ondemand things and
> one for prealloc/free type thingies?

I think one hash table for private futexes as we have now and if people
still complain about collisions get the dedicated "struct
futex_hash_bucket" for each pthread_mutex_t which we had in v1 and then
use the returned ticket for every lock operation. But this time keep it
ticket number the same within a process.
This would guarantee that no collision happens within a task and
requires changes in glibc. So the entry barrier is a little higher but
we could merge this (more or less) as is and the the collision-free
extension on top if there is demand for it.

>> My point was during development / testing to figure out which initial
>> value is sane / reasonable. Start your APP with 32 slot. Collisions.
>> Again with 128 slots. Oh better.
> 
> Ah, right, but given the above and ASLR, you cannot exhaustively test
> this.

correct. You could disable ASLR but security wise you might not want do
that (but then if everything runs as root anyway, disabling ASLR is the
next logical step :) ).

Sebastian

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 2/7] futex: Hash private futexes per process
  2016-05-30 13:37             ` Sebastian Andrzej Siewior
@ 2016-05-30 13:49               ` Peter Zijlstra
  2016-05-30 13:59                 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2016-05-30 13:49 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Sebastian Andrzej Siewior, Thomas Gleixner, LKML, Linus Torvalds,
	Darren Hart, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Mon, May 30, 2016 at 03:37:48PM +0200, Sebastian Andrzej Siewior wrote:
> Yes, I did not try to discuss the hash collision away. From RT's point
> of view the problems I am aware are of the following scenario:
> Task #1 pinned to CPU1 task #2 pinned to CPU2 but share the same
> bucket. Task #2 got a wakeup and should run but is blocked on the
> bucket lock - otherwise it could run. Usually it would PI-boost task#1
> but task#1 got preempted itself by task and since task#1 prio is lower
> it can't boost its way free towards the lock and so so CPU #2 may
> remain idle for some time.
> 
> The same thing can happen within a Task if you take my story from above
> and replace task with thread. Completely understood.

Right; so I don't see the point of PREALLOCATE_HASH to cater for RT
workloads if it still doesn't guarantee anything, esp. if the failure
case is silent and obscure.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 2/7] futex: Hash private futexes per process
  2016-05-30 13:49               ` Peter Zijlstra
@ 2016-05-30 13:59                 ` Sebastian Andrzej Siewior
  2016-05-30 14:02                   ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Sebastian Andrzej Siewior @ 2016-05-30 13:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sebastian Andrzej Siewior, Thomas Gleixner, LKML, Linus Torvalds,
	Darren Hart, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On 05/30/2016 03:49 PM, Peter Zijlstra wrote:
>> The same thing can happen within a Task if you take my story from above
>> and replace task with thread. Completely understood.
> 
> Right; so I don't see the point of PREALLOCATE_HASH to cater for RT
> workloads if it still doesn't guarantee anything, esp. if the failure
> case is silent and obscure.

So what do you suggest? Adding trace points in order to learn about
possible collisions or using tickets (on top of this) to guarantee
being collision free?
Note: this as it, is already a win on NUMA boxes since the memory is
not referenced cross node.

Sebastian

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch V2 2/7] futex: Hash private futexes per process
  2016-05-30 13:59                 ` Sebastian Andrzej Siewior
@ 2016-05-30 14:02                   ` Peter Zijlstra
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2016-05-30 14:02 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Sebastian Andrzej Siewior, Thomas Gleixner, LKML, Linus Torvalds,
	Darren Hart, Ingo Molnar, Michael Kerrisk, Davidlohr Bueso,
	Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Mon, May 30, 2016 at 03:59:51PM +0200, Sebastian Andrzej Siewior wrote:
> On 05/30/2016 03:49 PM, Peter Zijlstra wrote:
> >> The same thing can happen within a Task if you take my story from above
> >> and replace task with thread. Completely understood.
> > 
> > Right; so I don't see the point of PREALLOCATE_HASH to cater for RT
> > workloads if it still doesn't guarantee anything, esp. if the failure
> > case is silent and obscure.
> 
> So what do you suggest? Adding trace points in order to learn about
> possible collisions or using tickets (on top of this) to guarantee
> being collision free?

I have no idea about the ticket stuff, i've not seen it. But yes, you
need to somehow guarantee no collisions.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [patch V2 2/7] futex: Hash private futexes per process
  2016-05-05 19:03 [patch V2 0/7] Sebastian Andrzej Siewior <bigeasy@linutronix.de>, Linus Torvalds <torvalds@linux-foundation.org>, Darren Hart <darren@dvhart.com>, Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@kernel.org>, Michael Kerrisk <mtk.manpages@googlemail.com>, Davidlohr Bueso <dave@stgolabs.net>, Chris Mason <clm@fb.com>, Carlos O'Donell <carlos@redhat.com>, Torvald Riegel <triegel@redhat.com>, Eric Dumazet <edumazet@google.com> Thomas Gleixner
@ 2016-05-05 19:03 ` Thomas Gleixner
  0 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2016-05-05 19:03 UTC (permalink / raw)
  To: LKML; +Cc: Sebastian Siewior

[-- Attachment #1: futex-Hash-private-futexes-per-process.patch --]
[-- Type: text/plain, Size: 13765 bytes --]

From: Sebastian Siewior <bigeasy@linutronix.de>

The standard futex mechanism in the Linux kernel uses a global hash to store
transient state. Collisions on that hash can lead to performance degradation
especially on NUMA systems and on real-time enabled kernels even to priority
inversions.

To mitigate that problem we provide per process private hashing. On the first
futex operation in a process the kernel allocates a hash table. The hash table
is accessible via the process mm_struct. On Numa systems the hash is allocated
node local.

If the allocation fails then the global hash table is used as fallback, so
there is no user space visible impact of this feature.

The hash size is a default value which can be tweaked by the sys admin. The
sysctl interface is implemented in a follow up patch to make the review
simpler. For applications which have special requirements for the private hash
and to allow preallocation of the hash for RT applications, we'll provide a
futex OP in a follow up patch.

Performance data acquired on a 4 socket (node) Intel machine with perf bench
futex-hash:

Threads  G 65536  P 4	  P 8      P 16       P 32     P 64     P 128    P 256

1        8175006  8645465  8617469  8628686   8625223  8664491  8590934  8631582
2	 8149869  8618385  8578185  8622267   8603253  8618787  8595073  8590591
4	 7479482  5867840  7882991  7604838   7894380  7882850  7884911  7886278
8	 7308822  2378057  5731051  5550479   7691198  7672814  7711939  7681549
16	 7295893   677414  2670682  3453552   7158906  7688978  7677603  7690290

So with the proper hash size of the private hash is ~5% faster than the global
hash.

With a full perf bench futex-hash run with one process (36 threads) per node
and 1024 futexes per thread the following results are achieved:

G 65536	 P 4     P 8     P 16     P 32     P 64     P 128    P 256    P 512    P 1024  P 2048     
2673390  368952  682626  1223908  1845922  3003524  3538313  4118533  4286925  4289589 4274020

Ratio:   0,14    0,26    0,46     0,69	   1,12     1,32     1,54     1,60     1,60    1,60

So with a private hash size of 256 buckets and above the performance is almost
steady in this pathological test case and factor 1.6 better than the global
hash. Even a 64 buckets hash is already 10% faster,

Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/futex.h       |   38 ++++++++--
 include/linux/futex_types.h |   12 +++
 include/linux/mm_types.h    |    4 +
 init/Kconfig                |    4 +
 kernel/fork.c               |    3 
 kernel/futex.c              |  162 +++++++++++++++++++++++++++++++++++++++++++-
 6 files changed, 212 insertions(+), 11 deletions(-)
 create mode 100644 include/linux/futex_types.h

--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -1,6 +1,7 @@
 #ifndef _LINUX_FUTEX_H
 #define _LINUX_FUTEX_H
 
+#include <linux/futex_types.h>
 #include <uapi/linux/futex.h>
 
 struct inode;
@@ -21,16 +22,19 @@ handle_futex_death(u32 __user *uaddr, st
  *
  * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
  * We use the two low order bits of offset to tell what is the kind of key :
- *  00 : Private process futex (PTHREAD_PROCESS_PRIVATE)
- *       (no reference on an inode or mm)
+ *  00 : Private process futex (PTHREAD_PROCESS_PRIVATE) using process private
+ *	 hash (no reference on an inode or mm)
  *  01 : Shared futex (PTHREAD_PROCESS_SHARED)
  *	mapped on a file (reference on the underlying inode)
  *  10 : Shared futex (PTHREAD_PROCESS_SHARED)
  *       (but private mapping on an mm, and reference taken on it)
+ *  11 : Private process futex (PTHREAD_PROCESS_PRIVATE) using global hash
+ *	 (no reference on an inode or mm)
 */
 
-#define FUT_OFF_INODE    1 /* We set bit 0 if key has a reference on inode */
-#define FUT_OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */
+#define FUT_OFF_INODE		0x01 /* Key has a reference on inode */
+#define FUT_OFF_MMSHARED	0x02 /* Key has a reference on mm */
+#define FUT_OFF_PRIVATE		0x03 /* Key has no ref on inode/mm */
 
 union futex_key {
 	struct {
@@ -60,12 +64,30 @@ extern void exit_pi_state_list(struct ta
 #else
 extern int futex_cmpxchg_enabled;
 #endif
+
 #else
-static inline void exit_robust_list(struct task_struct *curr)
-{
-}
-static inline void exit_pi_state_list(struct task_struct *curr)
+static inline void exit_robust_list(struct task_struct *curr) { }
+static inline void exit_pi_state_list(struct task_struct *curr) { }
+#endif
+
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+/* Process private hash data for futexes */
+
+extern unsigned int futex_default_hash_bits;
+extern unsigned int futex_max_hash_bits;
+
+extern void futex_mm_hash_exit(struct mm_struct *mm);
+
+static inline void futex_mm_hash_init(struct mm_struct *mm)
 {
+	raw_spin_lock_init(&mm->futex_hash.lock);
+	mm->futex_hash.hash = NULL;
 }
+
+#else
+
+static inline void futex_mm_hash_init(struct mm_struct *mm) { }
+static inline void futex_mm_hash_exit(struct mm_struct *mm) { }
 #endif
+
 #endif
--- /dev/null
+++ b/include/linux/futex_types.h
@@ -0,0 +1,12 @@
+#ifndef _LINUX_FUTEX_TYPES_H
+#define _LINUX_FUTEX_TYPES_H
+
+struct futex_hash_bucket;
+
+struct futex_hash {
+	struct raw_spinlock		lock;
+	unsigned int			hash_bits;
+	struct futex_hash_bucket	*hash;
+};
+
+#endif
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -11,6 +11,7 @@
 #include <linux/completion.h>
 #include <linux/cpumask.h>
 #include <linux/uprobes.h>
+#include <linux/futex_types.h>
 #include <linux/page-flags-layout.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
@@ -442,6 +443,9 @@ struct mm_struct {
 
 	struct linux_binfmt *binfmt;
 
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+	struct futex_hash futex_hash;
+#endif
 	cpumask_var_t cpu_vm_mask_var;
 
 	/* Architecture-specific MM context */
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1498,6 +1498,10 @@ config FUTEX
 	  support for "fast userspace mutexes".  The resulting kernel may not
 	  run glibc-based applications correctly.
 
+config FUTEX_PRIVATE_HASH
+	bool
+	default FUTEX && SMP
+
 config HAVE_FUTEX_CMPXCHG
 	bool
 	depends on FUTEX
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -617,6 +617,8 @@ static struct mm_struct *mm_init(struct
 	mm_init_owner(mm, p);
 	mmu_notifier_mm_init(mm);
 	clear_tlb_flush_pending(mm);
+	futex_mm_hash_init(mm);
+
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
 #endif
@@ -713,6 +715,7 @@ void mmput(struct mm_struct *mm)
 		khugepaged_exit(mm); /* must run before exit_mmap */
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
+		futex_mm_hash_exit(mm);
 		if (!list_empty(&mm->mmlist)) {
 			spin_lock(&mmlist_lock);
 			list_del(&mm->mmlist);
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -23,6 +23,9 @@
  *  Copyright (C) IBM Corporation, 2009
  *  Thanks to Thomas Gleixner for conceptual design and careful reviews.
  *
+ *  Private hashed futex support by Sebastian Siewior and Thomas Gleixner
+ *  Copyright (C) Linutronix GmbH, 2016
+ *
  *  Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
  *  enough at me, Linus for the original (flawed) idea, Matthew
  *  Kirkwood for proof-of-concept implementation.
@@ -49,6 +52,7 @@
 #include <linux/fs.h>
 #include <linux/file.h>
 #include <linux/jhash.h>
+#include <linux/hash.h>
 #include <linux/init.h>
 #include <linux/futex.h>
 #include <linux/mount.h>
@@ -169,6 +173,34 @@
  * the code that actually moves the futex(es) between hash buckets (requeue_futex)
  * will do the additional required waiter count housekeeping. This is done for
  * double_lock_hb() and double_unlock_hb(), respectively.
+ *
+ * For private futexes we (pre)allocate a per process hash. We check lockless
+ * whether the hash is already allocated. To access the hash later we need
+ * information about the hash properties as well. This requires barriers as
+ * follows:
+ *
+ * CPU 0					CPU 1
+ * check_hash_allocation()
+ *	if (mm->futex_hash.hash)
+ *		return;
+ *	hash = alloc_hash()
+ *	lock(&mm->futex_hash.lock);
+ *	if (!mm->futex_hash.hash) {
+ *	  mm->futex_hash.par = params;
+ *
+ *	  smp_wmb(); (A0) <-paired with-|
+ *					|
+ *	  mm->futex_hash.hash = hash;	|
+ *					|	check_hash_allocation()
+ *					|	   if (mm->futex_hash.hash)
+ *					|		return;
+ *	  unlock(&mm->futex_hash.lock);	|	get_futex_key_refs()
+ *					|
+ *					|--------- smp_mb() (B)
+ *						s = hash(f, mm->futex_hash.par);
+ *						hb = &mm->futex_hash.hash[s];
+ *
+ * So we utilize the existing smp_mb() in get_futex_key_refs().
  */
 
 #ifndef CONFIG_HAVE_FUTEX_CMPXCHG
@@ -255,6 +287,22 @@ struct futex_hash_bucket {
 	struct plist_head chain;
 } ____cacheline_aligned_in_smp;
 
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+/*
+ * Process private hash for non-shared futexes
+ */
+#define FUTEX_USE_GLOBAL_HASH		((void *) 0x03)
+
+#define FUTEX_MIN_HASH_BITS		order_base_2(4UL)
+#define FUTEX_DEF_HASH_BITS		order_base_2(8UL)
+#define FUTEX_MAX_HASH_BITS		order_base_2(256UL)
+
+unsigned int futex_default_hash_bits	= FUTEX_DEF_HASH_BITS;
+unsigned int futex_max_hash_bits	= FUTEX_MAX_HASH_BITS;
+#else
+static const unsigned int futex_default_hash_bits = 0;
+#endif
+
 /*
  * The base of the bucket array and its size are always used together
  * (after initialization only in hash_futex()), so ensure that they
@@ -374,13 +422,13 @@ static inline int hb_waiters_pending(str
 }
 
 /**
- * hash_futex - Return the hash bucket in the global hash
+ * hash_global_futex - Return the hash bucket in the global hash
  * @key:	Pointer to the futex key for which the hash is calculated
  *
  * We hash on the keys returned from get_futex_key (see below) and return the
  * corresponding hash bucket in the global hash.
  */
-static struct futex_hash_bucket *hash_futex(union futex_key *key)
+static struct futex_hash_bucket *hash_global_futex(union futex_key *key)
 {
 	u32 hash = jhash2((u32*)&key->both.word,
 			  (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
@@ -388,9 +436,33 @@ static struct futex_hash_bucket *hash_fu
 	return &futex_queues[hash & (futex_hashsize - 1)];
 }
 
+/**
+ * hash_futex - Get the hash bucket for a futex
+ *
+ * Returns either the process private or the global hash bucket which fits the
+ * key.
+ */
+static struct futex_hash_bucket *hash_futex(union futex_key *key)
+{
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+	struct mm_struct *mm = current->mm;
+	unsigned int slot;
+
+	/*
+	 * Futexes which use the per process hash have the lower bits cleared
+	 */
+	if (key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED))
+		return hash_global_futex(key);
+
+	slot = hash_long(key->private.address, mm->futex_hash.hash_bits);
+	return &mm->futex_hash.hash[slot];
+#else
+	return hash_global_futex(key);
+#endif
+}
 
 /**
- * match_futex - Check whether to futex keys are equal
+ * match_futex - Check whether two futex keys are equal
  * @key1:	Pointer to key1
  * @key2:	Pointer to key2
  *
@@ -505,7 +577,20 @@ get_futex_key(u32 __user *uaddr, int fsh
 	 */
 	if (!fshared) {
 		key->private.mm = mm;
+		/*
+		 * If we have a process private hash, then we store uaddr
+		 * instead of the page base address.
+		 */
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+		if (mm->futex_hash.hash != FUTEX_USE_GLOBAL_HASH) {
+			key->private.address = (unsigned long) uaddr;
+		} else {
+			key->private.address = address;
+			key->both.offset |= FUT_OFF_PRIVATE;
+		}
+#else
 		key->private.address = address;
+#endif
 		get_futex_key_refs(key);  /* implies smp_mb(); (B) */
 		return 0;
 	}
@@ -3153,6 +3238,75 @@ void exit_robust_list(struct task_struct
 				   curr, pip);
 }
 
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+
+void futex_mm_hash_exit(struct mm_struct *mm)
+{
+	if (mm->futex_hash.hash && mm->futex_hash.hash != FUTEX_USE_GLOBAL_HASH)
+		kfree(mm->futex_hash.hash);
+	mm->futex_hash.hash = NULL;
+}
+
+static struct futex_hash_bucket *futex_alloc_hash(unsigned int hash_bits)
+{
+	struct futex_hash_bucket *hb;
+	size_t hash_size, size;
+	int i;
+
+	hash_size = 1 << hash_bits;
+	size = hash_size * sizeof(struct futex_hash_bucket);
+	hb = kzalloc_node(size, GFP_KERNEL, numa_node_id());
+	if (!hb)
+		return NULL;
+
+	for (i = 0; i < hash_size; i++) {
+		atomic_set(&hb[i].waiters, 0);
+		plist_head_init(&hb[i].chain);
+		spin_lock_init(&hb[i].lock);
+	}
+	return hb;
+}
+
+static void futex_populate_hash(unsigned int hash_bits)
+{
+	struct mm_struct *mm = current->mm;
+	struct futex_hash_bucket *hb = NULL;
+
+	/*
+	 * We don't need an explicit smp_mb() when the hash is populated
+	 * because before we dereference mm->futex_hash.hash_bits in the hash
+	 * function we have an smp_mb() in futex_get_key_refs() already.
+	 */
+	if (mm->futex_hash.hash)
+		return;
+
+	/*
+	 * If we failed to allocate a hash on the fly, fall back to the global
+	 * hash.
+	 */
+	hb = futex_alloc_hash(hash_bits);
+	if (!hb)
+		hb = FUTEX_USE_GLOBAL_HASH;
+
+	raw_spin_lock(&mm->futex_hash.lock);
+	/* We might have raced with another task allocating the hash. */
+	if (!mm->futex_hash.hash) {
+		mm->futex_hash.hash_bits = hash_bits;
+		/*
+		 * Ensure that the above is visible before we store
+		 * the pointer.
+		 */
+		smp_wmb(); /* (A0) Pairs with (B) */
+		mm->futex_hash.hash = hb;
+		hb = NULL;
+	}
+	raw_spin_unlock(&mm->futex_hash.lock);
+	kfree(hb);
+}
+#else /* CONFIG_FUTEX_PRIVATE_HASH */
+static inline void futex_populate_hash(unsigned int hash_bits) { }
+#endif
+
 long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 		u32 __user *uaddr2, u32 val2, u32 val3)
 {
@@ -3161,6 +3315,8 @@ long do_futex(u32 __user *uaddr, int op,
 
 	if (!(op & FUTEX_PRIVATE_FLAG))
 		flags |= FLAGS_SHARED;
+	else
+		futex_populate_hash(futex_default_hash_bits);
 
 	if (op & FUTEX_CLOCK_REALTIME) {
 		flags |= FLAGS_CLOCKRT;

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2016-05-30 14:02 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-05 20:44 [patch V2 0/7] futex: Add support for process private hashing Thomas Gleixner
2016-05-05 20:44 ` [patch V2 1/7] futex: Add some more function commentry Thomas Gleixner
2016-05-06 17:37   ` Darren Hart
2016-05-05 20:44 ` [patch V2 2/7] futex: Hash private futexes per process Thomas Gleixner
2016-05-06 18:09   ` Darren Hart
2016-05-06 21:56     ` Darren Hart
2016-05-07  8:45       ` Thomas Gleixner
2016-05-11 21:08         ` Darren Hart
2016-05-07  8:44     ` Thomas Gleixner
2016-05-11 21:07       ` Darren Hart
2016-05-27 16:36       ` Sebastian Andrzej Siewior
2016-05-19 12:21   ` Peter Zijlstra
2016-05-27 16:52     ` Sebastian Andrzej Siewior
2016-05-30  8:43       ` Peter Zijlstra
2016-05-19 12:24   ` Peter Zijlstra
2016-05-27 17:10     ` Sebastian Andrzej Siewior
2016-05-30  8:58       ` Peter Zijlstra
2016-05-30 11:08         ` Sebastian Andrzej Siewior
2016-05-30 12:06           ` Peter Zijlstra
2016-05-30 13:37             ` Sebastian Andrzej Siewior
2016-05-30 13:49               ` Peter Zijlstra
2016-05-30 13:59                 ` Sebastian Andrzej Siewior
2016-05-30 14:02                   ` Peter Zijlstra
2016-05-05 20:44 ` [patch V2 4/7] futex: Add sysctl knobs for process private hash Thomas Gleixner
2016-05-06 18:22   ` Darren Hart
2016-05-27 17:33     ` Sebastian Andrzej Siewior
2016-05-05 20:44 ` [patch V2 3/7] futex: Add op for hash preallocation Thomas Gleixner
2016-05-06 18:18   ` Darren Hart
2016-05-07  8:47     ` Thomas Gleixner
2016-05-07 11:40       ` Thomas Gleixner
2016-05-19 12:28       ` Peter Zijlstra
2016-05-19 19:36         ` Darren Hart
2016-05-19 12:24   ` Peter Zijlstra
2016-05-19 19:38     ` Darren Hart
2016-05-20  4:50       ` Peter Zijlstra
2016-05-19 12:25   ` Peter Zijlstra
2016-05-27 17:27     ` Sebastian Andrzej Siewior
2016-05-30  8:59       ` Peter Zijlstra
2016-05-05 20:44 ` [patch V2 5/7] perf/bench/futex-hash: Support NUMA Thomas Gleixner
2016-05-05 20:44 ` [patch V2 6/7] perf/bench/futex-hash: Support preallocate hash table Thomas Gleixner
2016-05-05 20:44 ` [patch V2 7/7] futex.2: Document hash preallocation opcode Thomas Gleixner
  -- strict thread matches above, loose matches on Subject: below --
2016-05-05 19:03 [patch V2 0/7] Sebastian Andrzej Siewior <bigeasy@linutronix.de>, Linus Torvalds <torvalds@linux-foundation.org>, Darren Hart <darren@dvhart.com>, Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@kernel.org>, Michael Kerrisk <mtk.manpages@googlemail.com>, Davidlohr Bueso <dave@stgolabs.net>, Chris Mason <clm@fb.com>, Carlos O'Donell <carlos@redhat.com>, Torvald Riegel <triegel@redhat.com>, Eric Dumazet <edumazet@google.com> Thomas Gleixner
2016-05-05 19:03 ` [patch V2 2/7] futex: Hash private futexes per process Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).