linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/7] futex: Add support for process private hashing
@ 2016-04-28 16:42 Thomas Gleixner
  2016-04-28 16:42 ` [patch 1/7] futex: Add some more function commentry Thomas Gleixner
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Thomas Gleixner @ 2016-04-28 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

The standard futex mechanism in the Linux kernel uses a global hash to store
transient state. Collisions on that hash can lead to performance degradation
and on real-time enabled kernels to unbound priority inversions.

This new attempt to solve the issue does not require user space changes and
operates transparently. On the first futex operation of a process the kernel
allocates a hash private to the process. All process private futexes are
hashed in this hash. Process shared futexes still use the global hash.

For RT applications and pathological use cases a new futex op is provided
which allows the application to preallocate and thereby size the process
private hash.

The series comes with a new 'stupid' hash function based on the good old
modulu prime. That function provides way better hash results than
hash_ptr/hash_long() for small hash sizes.

The last two patches add support to the perf futex-hash benchmark so test can
be run on nodes and the preallocation sizing can be tested.

The last patch contains a first update for the futex man page.

Results from our testing in nice colored charts are available here:

perf bench futex-hash run parallel on 4 nodes with global hash and various
sized private hashes and various numbers of futexes per thread

 https://tglx.de/~tglx/f-ops.png

perf bench futex-hash run parallel on 4 nodes with global hash and various
sized private hashes using the new hash_mod() and various numbers of futexes
per thread

 https://tglx.de/~tglx/f-ops.png

perf bench futex-hash run parallel on 4 nodes with global hash and various
sized private hashes using hash_long() and various numbers of futexes per
thread

 https://tglx.de/~tglx/f-ops-hlong.png

perf bench futex-hash run parallel on 2 nodes with global hash and various
sized private hashes and various numbers of futexes per thread

 https://tglx.de/~tglx/f-ops-2.png

perf bench futex-hash run parallel on 4 nodes with global hash and various
sized private hashes using hash_mod(). 1 futex per thread and various thread
numbers.

 https://tglx.de/~tglx/f-ops-mod-t.png

perf bench futex-hash run parallel on 4 nodes with global hash and various
sized private hashes using hash_long(). 1 futex per thread and various thread
numbers.

 https://tglx.de/~tglx/f-ops-hlong-t.png

Thanks,

	tglx

----
 Documentation/sysctl/kernel.txt |   17 +++
 b/include/linux/futex_types.h   |   14 ++
 b/lib/hashmod.c                 |   44 ++++++++
 include/linux/futex.h           |   39 +++++--
 include/linux/hash.h            |   28 +++++
 include/linux/mm_types.h        |    4 
 include/uapi/linux/futex.h      |    1 
 init/Kconfig                    |    5 
 kernel/fork.c                   |    3 
 kernel/futex.c                  |  219 +++++++++++++++++++++++++++++++++++++++-
 kernel/sysctl.c                 |   21 +++
 lib/Kconfig                     |    3 
 lib/Makefile                    |    1 
 tools/perf/bench/Build          |    4 
 tools/perf/bench/futex-hash.c   |  101 ++++++++++++++++--
 tools/perf/bench/futex.h        |    5 
 16 files changed, 486 insertions(+), 23 deletions(-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 1/7] futex: Add some more function commentry
  2016-04-28 16:42 [patch 0/7] futex: Add support for process private hashing Thomas Gleixner
@ 2016-04-28 16:42 ` Thomas Gleixner
  2016-04-28 16:42 ` [patch 2/7] lib/hashmod: Add modulo based hash mechanism Thomas Gleixner
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2016-04-28 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

[-- Attachment #1: futex-Add-some-more-function-commentry.patch --]
[-- Type: text/plain, Size: 1126 bytes --]

Add some more comments and reformat existing ones to kernel doc style.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/futex.c |   15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -373,8 +373,12 @@ static inline int hb_waiters_pending(str
 #endif
 }
 
-/*
- * We hash on the keys returned from get_futex_key (see below).
+/**
+ * hash_futex - Return the hash bucket in the global hash
+ * @key:	Pointer to the futex key for which the hash is calculated
+ *
+ * We hash on the keys returned from get_futex_key (see below) and return the
+ * corresponding hash bucket in the global hash.
  */
 static struct futex_hash_bucket *hash_futex(union futex_key *key)
 {
@@ -384,7 +388,12 @@ static struct futex_hash_bucket *hash_fu
 	return &futex_queues[hash & (futex_hashsize - 1)];
 }
 
-/*
+
+/**
+ * match_futex - Check whether to futex keys are equal
+ * @key1:	Pointer to key1
+ * @key2:	Pointer to key2
+ *
  * Return 1 if two futex_keys are equal, 0 otherwise.
  */
 static inline int match_futex(union futex_key *key1, union futex_key *key2)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-28 16:42 [patch 0/7] futex: Add support for process private hashing Thomas Gleixner
  2016-04-28 16:42 ` [patch 1/7] futex: Add some more function commentry Thomas Gleixner
@ 2016-04-28 16:42 ` Thomas Gleixner
  2016-04-28 18:32   ` Linus Torvalds
  2016-04-28 16:42 ` [patch 3/7] futex: Hash private futexes per process Thomas Gleixner
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2016-04-28 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

[-- Attachment #1: lib-hashmod-Add-modulo-based-hash-mechanism.patch --]
[-- Type: text/plain, Size: 3515 bytes --]

hash_long/hash_ptr() provide really bad hashing for small hash sizes and for
cases where the lower 12 bits (Page size aligned) of the hash value are 0.

A simple modulo(prime) based hash function has way better results
across a wide range of input values. The implementation uses invers
multiplication instead of a slow division.

A futex benchmark shows better results up to a factor 10 on small hashes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/hash.h |   28 ++++++++++++++++++++++++++++
 lib/Kconfig          |    3 +++
 lib/Makefile         |    1 +
 lib/hashmod.c        |   44 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 76 insertions(+)
 create mode 100644 lib/hashmod.c

--- a/include/linux/hash.h
+++ b/include/linux/hash.h
@@ -83,4 +83,32 @@ static inline u32 hash32_ptr(const void
 	return (u32)val;
 }
 
+struct hash_modulo {
+	unsigned int	pmul;
+	unsigned int	prime;
+	unsigned int	mask;
+};
+
+#ifdef CONFIG_HASH_MODULO
+
+int hash_modulo_params(unsigned int hash_bits, struct hash_modulo *hm);
+
+/**
+ * hash_mod - FIXME
+ */
+static inline unsigned int hash_mod(unsigned long addr, struct hash_modulo *hm)
+{
+	u32 a, m;
+
+	if (IS_ENABLED(CONFIG_64BIT)) {
+		a =  addr >> 32;
+		a ^= (unsigned int) addr;
+	} else {
+		a = addr;
+	}
+	m = ((u64)a * hm->pmul) >> 32;
+	return (a - m * hm->prime) & hm->mask;
+}
+#endif
+
 #endif /* _LINUX_HASH_H */
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -185,6 +185,9 @@ config CRC8
 	  when they need to do cyclic redundancy check according CRC8
 	  algorithm. Module will be called crc8.
 
+config HASH_MODULO
+       bool
+
 config AUDIT_GENERIC
 	bool
 	depends on AUDIT && !AUDIT_ARCH
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -97,6 +97,7 @@ obj-$(CONFIG_CRC32)	+= crc32.o
 obj-$(CONFIG_CRC7)	+= crc7.o
 obj-$(CONFIG_LIBCRC32C)	+= libcrc32c.o
 obj-$(CONFIG_CRC8)	+= crc8.o
+obj-$(CONFIG_HASH_MODULO) += hashmod.o
 obj-$(CONFIG_GENERIC_ALLOCATOR) += genalloc.o
 
 obj-$(CONFIG_842_COMPRESS) += 842/
--- /dev/null
+++ b/lib/hashmod.c
@@ -0,0 +1,44 @@
+/*
+ * Modulo based hash - Global helper functions
+ *
+ * (C) 2016 Linutronix GmbH, Thomas Gleixner
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public Licence version 2 as published by
+ * the Free Software Foundation;
+ */
+
+#include <linux/hash.h>
+#include <linux/errno,h>
+#include <linux/bug.h>
+#include <linux/kernel.h>
+
+#define hash_pmul(prime)	((unsigned int)((1ULL << 32) / prime))
+
+static const struct hash_modulo hash_modulo[] = {
+	{ .prime =    3, .pmul = hash_pmul(3),    .mask = 0x0003 },
+	{ .prime =    7, .pmul = hash_pmul(7),    .mask = 0x0007 },
+	{ .prime =   13, .pmul = hash_pmul(13),   .mask = 0x000f },
+	{ .prime =   31, .pmul = hash_pmul(31),   .mask = 0x001f },
+	{ .prime =   61, .pmul = hash_pmul(61),   .mask = 0x003f },
+	{ .prime =  127, .pmul = hash_pmul(127),  .mask = 0x007f },
+	{ .prime =  251, .pmul = hash_pmul(251),  .mask = 0x00ff },
+	{ .prime =  509, .pmul = hash_pmul(509),  .mask = 0x01ff },
+	{ .prime = 1021, .pmul = hash_pmul(1021), .mask = 0x03ff },
+	{ .prime = 2039, .pmul = hash_pmul(2039), .mask = 0x07ff },
+	{ .prime = 4093, .pmul = hash_pmul(4093), .mask = 0x0fff },
+};
+
+/**
+ * hash_modulo_params - FIXME
+ */
+int hash_modulo_params(unsigned int hash_bits, struct hash_modulo *hm)
+{
+	hash_bits -= 2;
+
+	if (hash_bits >= ARRAY_SIZE(hash_modulo))
+		return -EINVAL;
+
+	*hm = hash_modulo[hash_bits];
+	return 0;
+}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 3/7] futex: Hash private futexes per process
  2016-04-28 16:42 [patch 0/7] futex: Add support for process private hashing Thomas Gleixner
  2016-04-28 16:42 ` [patch 1/7] futex: Add some more function commentry Thomas Gleixner
  2016-04-28 16:42 ` [patch 2/7] lib/hashmod: Add modulo based hash mechanism Thomas Gleixner
@ 2016-04-28 16:42 ` Thomas Gleixner
  2016-04-28 16:42 ` [patch 4/7] futex: Add op for hash preallocation Thomas Gleixner
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2016-04-28 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

[-- Attachment #1: futex-Hash-private-futexes-per-process.patch --]
[-- Type: text/plain, Size: 13926 bytes --]

From: Sebastian Siewior <bigeasy@linutronix.de>

The standard futex mechanism in the Linux kernel uses a global hash to store
transient state. Collisions on that hash can lead to performance degradation
especially on NUMA systems and on real-time enabled kernels even to priority
inversions.

To mitigate that problem we provide per process private hashing. On the first
futex operation in a process the kernel allocates a hash table. The hash table
is accessible via the process mm_struct. On Numa systems the hash is allocated
node local.

If the allocation fails then the global hash table is used as fallback, so
there is no user space visible impact of this feature.

The hash size is a default value which can be tweaked by the sys admin. The
sysctl interface is implemented in a follow up patch to make the review
simpler. For applications which have special requirements for the private hash
and to allow preallocation of the hash for RT applications, we'll provide a
futex OP in a follow up patch.

Performance data acquired on a 4 socket (node) Intel machine with perf bench
futex-hash:

Threads  G 65536  P 4	  P 8      P 16       P 32     P 64     P 128    P 256

1        8175006  8645465  8617469  8628686   8625223  8664491  8590934  8631582
2	 8149869  8618385  8578185  8622267   8603253  8618787  8595073  8590591
4	 7479482  5867840  7882991  7604838   7894380  7882850  7884911  7886278
8	 7308822  2378057  5731051  5550479   7691198  7672814  7711939  7681549
16	 7295893   677414  2670682  3453552   7158906  7688978  7677603  7690290

So with the proper hash size of the private hash is ~5% faster than the global
hash.

With a full perf bench futex-hash run with one process (36 threads) per node
and 1024 futexes per thread the following results are achieved:

G 65536	 P 4     P 8     P 16     P 32     P 64     P 128    P 256    P 512    P 1024  P 2048     
2673390  368952  682626  1223908  1845922  3003524  3538313  4118533  4286925  4289589 4274020

Ratio:   0,14    0,26    0,46     0,69	   1,12     1,32     1,54     1,60     1,60    1,60

So with a private hash size of 256 buckets and above the performance is almost
steady in this pathological test case and factor 1.6 better than the global
hash. Even a 64 buckets hash is already 10% faster,

Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/futex.h       |   38 +++++++---
 include/linux/futex_types.h |   14 +++
 include/linux/mm_types.h    |    4 +
 init/Kconfig                |    5 +
 kernel/fork.c               |    3 
 kernel/futex.c              |  166 +++++++++++++++++++++++++++++++++++++++++++-
 6 files changed, 219 insertions(+), 11 deletions(-)
 create mode 100644 include/linux/futex_types.h

--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -1,6 +1,7 @@
 #ifndef _LINUX_FUTEX_H
 #define _LINUX_FUTEX_H
 
+#include <linux/futex_types.h>
 #include <uapi/linux/futex.h>
 
 struct inode;
@@ -21,16 +22,19 @@ handle_futex_death(u32 __user *uaddr, st
  *
  * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
  * We use the two low order bits of offset to tell what is the kind of key :
- *  00 : Private process futex (PTHREAD_PROCESS_PRIVATE)
- *       (no reference on an inode or mm)
+ *  00 : Private process futex (PTHREAD_PROCESS_PRIVATE) using process private
+ *	 hash (no reference on an inode or mm)
  *  01 : Shared futex (PTHREAD_PROCESS_SHARED)
  *	mapped on a file (reference on the underlying inode)
  *  10 : Shared futex (PTHREAD_PROCESS_SHARED)
  *       (but private mapping on an mm, and reference taken on it)
+ *  11 : Private process futex (PTHREAD_PROCESS_PRIVATE) using global hash
+ *	 (no reference on an inode or mm)
 */
 
-#define FUT_OFF_INODE    1 /* We set bit 0 if key has a reference on inode */
-#define FUT_OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */
+#define FUT_OFF_INODE		0x01 /* Key has a reference on inode */
+#define FUT_OFF_MMSHARED	0x02 /* Key has a reference on mm */
+#define FUT_OFF_PRIVATE		0x03 /* Key has no ref on inode/mm */
 
 union futex_key {
 	struct {
@@ -60,12 +64,30 @@ extern void exit_pi_state_list(struct ta
 #else
 extern int futex_cmpxchg_enabled;
 #endif
+
 #else
-static inline void exit_robust_list(struct task_struct *curr)
-{
-}
-static inline void exit_pi_state_list(struct task_struct *curr)
+static inline void exit_robust_list(struct task_struct *curr) { }
+static inline void exit_pi_state_list(struct task_struct *curr) { }
+#endif
+
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+/* Process private hash data for futexes */
+
+extern unsigned int futex_default_hash_bits;
+extern unsigned int futex_max_hash_bits;
+
+extern void futex_mm_hash_exit(struct mm_struct *mm);
+
+static inline void futex_mm_hash_init(struct mm_struct *mm)
 {
+	raw_spin_lock_init(&mm->futex_hash.lock);
+	mm->futex_hash.hash = NULL;
 }
+
+#else
+
+static inline void futex_mm_hash_init(struct mm_struct *mm) { }
+static inline void futex_mm_hash_exit(struct mm_struct *mm) { }
 #endif
+
 #endif
--- /dev/null
+++ b/include/linux/futex_types.h
@@ -0,0 +1,14 @@
+#ifndef _LINUX_FUTEX_TYPES_H
+#define _LINUX_FUTEX_TYPES_H
+
+#include <linux/hash.h>
+
+struct futex_hash_bucket;
+
+struct futex_hash {
+	struct raw_spinlock		lock;
+	struct hash_modulo		hmod;
+	struct futex_hash_bucket	*hash;
+};
+
+#endif
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -11,6 +11,7 @@
 #include <linux/completion.h>
 #include <linux/cpumask.h>
 #include <linux/uprobes.h>
+#include <linux/futex_types.h>
 #include <linux/page-flags-layout.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
@@ -442,6 +443,9 @@ struct mm_struct {
 
 	struct linux_binfmt *binfmt;
 
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+	struct futex_hash futex_hash;
+#endif
 	cpumask_var_t cpu_vm_mask_var;
 
 	/* Architecture-specific MM context */
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1498,6 +1498,11 @@ config FUTEX
 	  support for "fast userspace mutexes".  The resulting kernel may not
 	  run glibc-based applications correctly.
 
+config FUTEX_PRIVATE_HASH
+	bool
+	default FUTEX && SMP
+	select HASH_MODULO
+
 config HAVE_FUTEX_CMPXCHG
 	bool
 	depends on FUTEX
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -617,6 +617,8 @@ static struct mm_struct *mm_init(struct
 	mm_init_owner(mm, p);
 	mmu_notifier_mm_init(mm);
 	clear_tlb_flush_pending(mm);
+	futex_mm_hash_init(mm);
+
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
 #endif
@@ -713,6 +715,7 @@ void mmput(struct mm_struct *mm)
 		khugepaged_exit(mm); /* must run before exit_mmap */
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
+		futex_mm_hash_exit(mm);
 		if (!list_empty(&mm->mmlist)) {
 			spin_lock(&mmlist_lock);
 			list_del(&mm->mmlist);
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -23,6 +23,9 @@
  *  Copyright (C) IBM Corporation, 2009
  *  Thanks to Thomas Gleixner for conceptual design and careful reviews.
  *
+ *  Private hashed futex support by Sebastian Siewior and Thomas Gleixner
+ *  Copyright (C) Linutronix GmbH, 2016
+ *
  *  Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
  *  enough at me, Linus for the original (flawed) idea, Matthew
  *  Kirkwood for proof-of-concept implementation.
@@ -49,6 +52,7 @@
 #include <linux/fs.h>
 #include <linux/file.h>
 #include <linux/jhash.h>
+#include <linux/hash.h>
 #include <linux/init.h>
 #include <linux/futex.h>
 #include <linux/mount.h>
@@ -169,6 +173,34 @@
  * the code that actually moves the futex(es) between hash buckets (requeue_futex)
  * will do the additional required waiter count housekeeping. This is done for
  * double_lock_hb() and double_unlock_hb(), respectively.
+ *
+ * For private futexes we (pre)allocate a per process hash. We check lockless
+ * whether the hash is already allocated. To access the hash later we need
+ * information about the hash properties as well. This requires barriers as
+ * follows:
+ *
+ * CPU 0					CPU 1
+ * check_hash_allocation()
+ *	if (mm->futex_hash.hash)
+ *		return;
+ *	hash = alloc_hash()
+ *	lock(&mm->futex_hash.lock);
+ *	if (!mm->futex_hash.hash) {
+ *	  mm->futex_hash.par = params;
+ *
+ *	  smp_wmb(); (A0) <-paired with-|
+ *					|
+ *	  mm->futex_hash.hash = hash;	|
+ *					|	check_hash_allocation()
+ *					|	   if (mm->futex_hash.hash)
+ *					|		return;
+ *	  unlock(&mm->futex_hash.lock);	|	get_futex_key_refs()
+ *					|
+ *					|--------- smp_mb() (B)
+ *						s = hash(f, mm->futex_hash.par);
+ *						hb = &mm->futex_hash.hash[s];
+ *
+ * So we utilize the existing smp_mb() in get_futex_key_refs().
  */
 
 #ifndef CONFIG_HAVE_FUTEX_CMPXCHG
@@ -255,6 +287,22 @@ struct futex_hash_bucket {
 	struct plist_head chain;
 } ____cacheline_aligned_in_smp;
 
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+/*
+ * Process private hash for non-shared futexes
+ */
+#define FUTEX_USE_GLOBAL_HASH		((void *) 0x03)
+
+#define FUTEX_MIN_HASH_BITS		order_base_2(4UL)
+#define FUTEX_DEF_HASH_BITS		order_base_2(8UL)
+#define FUTEX_MAX_HASH_BITS		order_base_2(256UL)
+
+unsigned int futex_default_hash_bits	= FUTEX_DEF_HASH_BITS;
+unsigned int futex_max_hash_bits	= FUTEX_MAX_HASH_BITS;
+#else
+static const unsigned int futex_default_hash_bits = 0;
+#endif
+
 /*
  * The base of the bucket array and its size are always used together
  * (after initialization only in hash_futex()), so ensure that they
@@ -374,13 +422,13 @@ static inline int hb_waiters_pending(str
 }
 
 /**
- * hash_futex - Return the hash bucket in the global hash
+ * hash_global_futex - Return the hash bucket in the global hash
  * @key:	Pointer to the futex key for which the hash is calculated
  *
  * We hash on the keys returned from get_futex_key (see below) and return the
  * corresponding hash bucket in the global hash.
  */
-static struct futex_hash_bucket *hash_futex(union futex_key *key)
+static struct futex_hash_bucket *hash_global_futex(union futex_key *key)
 {
 	u32 hash = jhash2((u32*)&key->both.word,
 			  (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
@@ -388,9 +436,33 @@ static struct futex_hash_bucket *hash_fu
 	return &futex_queues[hash & (futex_hashsize - 1)];
 }
 
+/**
+ * hash_futex - Get the hash bucket for a futex
+ *
+ * Returns either the process private or the global hash bucket which fits the
+ * key.
+ */
+static struct futex_hash_bucket *hash_futex(union futex_key *key)
+{
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+	struct mm_struct *mm = current->mm;
+	unsigned int slot;
+
+	/*
+	 * Futexes which use the per process hash have the lower bits cleared
+	 */
+	if (key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED))
+		return hash_global_futex(key);
+
+	slot = hash_mod(key->private.address, &mm->futex_hash.hmod);
+	return &mm->futex_hash.hash[slot];
+#else
+	return hash_global_futex(key);
+#endif
+}
 
 /**
- * match_futex - Check whether to futex keys are equal
+ * match_futex - Check whether two futex keys are equal
  * @key1:	Pointer to key1
  * @key2:	Pointer to key2
  *
@@ -505,7 +577,20 @@ get_futex_key(u32 __user *uaddr, int fsh
 	 */
 	if (!fshared) {
 		key->private.mm = mm;
+		/*
+		 * If we have a process private hash, then we store uaddr
+		 * instead of the page base address.
+		 */
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+		if (mm->futex_hash.hash != FUTEX_USE_GLOBAL_HASH) {
+			key->private.address = (unsigned long) uaddr;
+		} else {
+			key->private.address = address;
+			key->both.offset |= FUT_OFF_PRIVATE;
+		}
+#else
 		key->private.address = address;
+#endif
 		get_futex_key_refs(key);  /* implies smp_mb(); (B) */
 		return 0;
 	}
@@ -3153,6 +3238,79 @@ void exit_robust_list(struct task_struct
 				   curr, pip);
 }
 
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+
+void futex_mm_hash_exit(struct mm_struct *mm)
+{
+	if (mm->futex_hash.hash && mm->futex_hash.hash != FUTEX_USE_GLOBAL_HASH)
+		kfree(mm->futex_hash.hash);
+	mm->futex_hash.hash = NULL;
+}
+
+static struct futex_hash_bucket *futex_alloc_hash(unsigned int hash_bits)
+{
+	struct futex_hash_bucket *hb;
+	size_t hash_size, size;
+	int i;
+
+	hash_size = 1 << hash_bits;
+	size = hash_size * sizeof(struct futex_hash_bucket);
+	hb = kzalloc_node(size, GFP_KERNEL, numa_node_id());
+	if (!hb)
+		return NULL;
+
+	for (i = 0; i < hash_size; i++) {
+		atomic_set(&hb[i].waiters, 0);
+		plist_head_init(&hb[i].chain);
+		spin_lock_init(&hb[i].lock);
+	}
+	return hb;
+}
+
+static void futex_populate_hash(unsigned int hash_bits)
+{
+	struct mm_struct *mm = current->mm;
+	struct futex_hash_bucket *hb = NULL;
+	struct hash_modulo hmod;
+
+	/*
+	 * We don't need an explicit smp_mb() when the hash is populated
+	 * because before we dereference mm->futex_hash.hmod in the hash
+	 * function we have an smp_mb() in futex_get_key_refs() already.
+	 */
+	if (mm->futex_hash.hash)
+		return;
+
+	/* If the params are invalid fallback to global hash */
+	if (!hash_modulo_params(hash_bits, &hmod))
+		hb = futex_alloc_hash(hash_bits);
+
+	/*
+	 * If we failed to allocate a hash on the fly, fall back to the global
+	 * hash.
+	 */
+	if (!hb)
+		hb = FUTEX_USE_GLOBAL_HASH;
+
+	raw_spin_lock(&mm->futex_hash.lock);
+	/* We might have raced with another task allocating the hash. */
+	if (!mm->futex_hash.hash) {
+		mm->futex_hash.hmod = hmod;
+		/*
+		 * Ensure that the above is visible before we store
+		 * the pointer.
+		 */
+		smp_wmb(); /* (A0) Pairs with (B) */
+		mm->futex_hash.hash = hb;
+		hb = NULL;
+	}
+	raw_spin_unlock(&mm->futex_hash.lock);
+	kfree(hb);
+}
+#else /* CONFIG_FUTEX_PRIVATE_HASH */
+static inline void futex_populate_hash(unsigned int hash_bits) { }
+#endif
+
 long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 		u32 __user *uaddr2, u32 val2, u32 val3)
 {
@@ -3161,6 +3319,8 @@ long do_futex(u32 __user *uaddr, int op,
 
 	if (!(op & FUTEX_PRIVATE_FLAG))
 		flags |= FLAGS_SHARED;
+	else
+		futex_populate_hash(futex_default_hash_bits);
 
 	if (op & FUTEX_CLOCK_REALTIME) {
 		flags |= FLAGS_CLOCKRT;

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 4/7] futex: Add op for hash preallocation
  2016-04-28 16:42 [patch 0/7] futex: Add support for process private hashing Thomas Gleixner
                   ` (2 preceding siblings ...)
  2016-04-28 16:42 ` [patch 3/7] futex: Hash private futexes per process Thomas Gleixner
@ 2016-04-28 16:42 ` Thomas Gleixner
  2016-04-28 16:42 ` [patch 5/7] futex: Add sysctl knobs for process private hash Thomas Gleixner
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2016-04-28 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

[-- Attachment #1: futex-Add-op-for-hash-preallocation.patch --]
[-- Type: text/plain, Size: 3658 bytes --]

From: Sebastian Siewior <bigeasy@linutronix.de>

The per process hash is allocated on the fly at the first futex operation of a
process. The size of the hash is determined by a system wide default setting
controlled by the sys admin, This is suboptimal for RT applications and
applications with pathological futex abuse,

 - For RT applications its important to allocate the per process hash before the
   first futex operation to avoid the allocation on the first futex operation.

 - For pathological applications which use gazillions of futexes its useful to
   allocate a hash greater than the default hash size.

Add a futex op which allows to preallocate the hash with the requested
size. The size is limited by the systemwide maximum hash size, which can be
set by the admin. The requested size is rounded up to the next order of 2.

The function can be called several times, but ony the first call results in a
hash allocation of the requested size as there is no non-intrusive way to
reallocate/rehash in a multithreaded application.

Note, that this call must be issued before the first futex operation in the
process because that would automatically allocate the default sized hash.

The function returns the actual hash size or 0 if the global hash is used. The
latter is the case on UP and in the rare case that the allocation failed and
the global hash is used as a fallback.

Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/uapi/linux/futex.h |    1 +
 kernel/futex.c             |   41 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)

--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -20,6 +20,7 @@
 #define FUTEX_WAKE_BITSET	10
 #define FUTEX_WAIT_REQUEUE_PI	11
 #define FUTEX_CMP_REQUEUE_PI	12
+#define FUTEX_PREALLOC_HASH	13
 
 #define FUTEX_PRIVATE_FLAG	128
 #define FUTEX_CLOCK_REALTIME	256
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3311,6 +3311,45 @@ static void futex_populate_hash(unsigned
 static inline void futex_populate_hash(unsigned int hash_bits) { }
 #endif
 
+/**
+ * futex_preallocate_hash - Preallocate the process private hash
+ * @slots:	Number of slots to allocate
+ *
+ * The function will allocate the process private hash with the number of
+ * requested slots. The number is rounded to the next power of two and may not
+ * exceed the current system limit.
+ *
+ * If the hash was already allocated by either an earlier call to
+ * futex_preallocate_hash() or an earlier futex op which allocated the cache
+ * on the fly, we return the size of the active hash.
+ *
+ * Returns::	Size of the hash, if 0 then the global hash is used.
+ */
+static int futex_preallocate_hash(unsigned int slots)
+{
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+	struct mm_struct *mm = current->mm;
+	struct futex_hash_bucket *hb;
+	unsigned int bits;
+
+	/* Try to allocate the requested nr of slots */
+	bits = order_base_2(slots);
+
+	if (bits < FUTEX_MIN_HASH_BITS)
+		bits = FUTEX_MIN_HASH_BITS;
+
+	if (bits > futex_max_hash_bits)
+		bits = futex_max_hash_bits;
+
+	futex_populate_hash(bits);
+
+	hb = mm->futex_hash.hash;
+	return hb == FUTEX_USE_GLOBAL_HASH ? 0 : mm->futex_hash.hmod.mask + 1;
+#else
+	return 0;
+#endif
+}
+
 long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 		u32 __user *uaddr2, u32 val2, u32 val3)
 {
@@ -3366,6 +3405,8 @@ long do_futex(u32 __user *uaddr, int op,
 					     uaddr2);
 	case FUTEX_CMP_REQUEUE_PI:
 		return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1);
+	case FUTEX_PREALLOC_HASH:
+		return futex_preallocate_hash(val);
 	}
 	return -ENOSYS;
 }

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 5/7] futex: Add sysctl knobs for process private hash
  2016-04-28 16:42 [patch 0/7] futex: Add support for process private hashing Thomas Gleixner
                   ` (3 preceding siblings ...)
  2016-04-28 16:42 ` [patch 4/7] futex: Add op for hash preallocation Thomas Gleixner
@ 2016-04-28 16:42 ` Thomas Gleixner
  2016-04-28 16:42 ` [patch 6/7] perf/bench/futex-hash: Support NUMA Thomas Gleixner
  2016-04-28 16:42 ` [patch 7/7] perf/bench/futex-hash: Support preallocate hash table Thomas Gleixner
  6 siblings, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2016-04-28 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

[-- Attachment #1: add-sysctl-interface.patch --]
[-- Type: text/plain, Size: 3677 bytes --]

From: Sebastian Siewior <bigeasy@linutronix.de>

To adjust the default hash size and the maximum hash size for process private
futexes we add the following sysctls:

futex_private_default_hash_bits:

     Adjusts the default hash size (in bits) which is used for automatic hash
     allocations on the first futex operation

futex_private_max_hash_bits:

     Adjusts the maximum hash size (in bits). This limits the hash size which
     can be preallocated by applications with the FUTEX_PREALLOC_HASH op.

Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 Documentation/sysctl/kernel.txt |   17 +++++++++++++++++
 include/linux/futex.h           |    1 +
 kernel/futex.c                  |    5 +++--
 kernel/sysctl.c                 |   21 +++++++++++++++++++++
 4 files changed, 42 insertions(+), 2 deletions(-)

--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -29,6 +29,8 @@ Currently, these files might (depending
 - core_pipe_limit
 - core_uses_pid
 - ctrl-alt-del
+- futex_private_default_hash_bits
+- futex_private_max_hash_bits
 - dmesg_restrict
 - domainname
 - hostname
@@ -265,6 +267,21 @@ to decide what to do with it.
 
 ==============================================================
 
+futex_private_default_hash_bits:
+
+Adjusts the default hash size (in bits) which is used for
+automatic hash allocations on the first futex operation
+
+==============================================================
+
+futex_private_max_hash_bits:
+
+Adjusts the maximum hash size (in bits). This limits the hash
+size which can be preallocated by applications with the
+FUTEX_PREALLOC_HASH op.
+
+==============================================================
+
 dmesg_restrict:
 
 This toggle indicates whether unprivileged users are prevented
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -75,6 +75,7 @@ static inline void exit_pi_state_list(st
 
 extern unsigned int futex_default_hash_bits;
 extern unsigned int futex_max_hash_bits;
+extern unsigned int futex_sysmax_hash_bits;
 
 extern void futex_mm_hash_exit(struct mm_struct *mm);
 
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -297,8 +297,9 @@ struct futex_hash_bucket {
 #define FUTEX_DEF_HASH_BITS		order_base_2(8UL)
 #define FUTEX_MAX_HASH_BITS		order_base_2(256UL)
 
-unsigned int futex_default_hash_bits	= FUTEX_DEF_HASH_BITS;
-unsigned int futex_max_hash_bits	= FUTEX_MAX_HASH_BITS;
+unsigned int futex_default_hash_bits			= FUTEX_DEF_HASH_BITS;
+unsigned int futex_max_hash_bits			= FUTEX_MAX_HASH_BITS;
+unsigned int  __read_mostly futex_sysmax_hash_bits	= FUTEX_MAX_HASH_BITS;
 #else
 static const unsigned int futex_default_hash_bits = 0;
 #endif
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -65,6 +65,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/kexec.h>
 #include <linux/bpf.h>
+#include <linux/futex.h>
 
 #include <asm/uaccess.h>
 #include <asm/processor.h>
@@ -593,6 +594,26 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+	{
+		.procname	= "futex_private_default_hash_bits",
+		.data		= &futex_default_hash_bits,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &two,
+		.extra2		= &futex_max_hash_bits,
+	},
+	{
+		.procname	= "futex_private_max_hash_bits",
+		.data		= &futex_max_hash_bits,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &futex_default_hash_bits,
+		.extra2		= &futex_sysmax_hash_bits,
+	},
+#endif
 #ifdef CONFIG_FUNCTION_TRACER
 	{
 		.procname	= "ftrace_enabled",

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 6/7] perf/bench/futex-hash: Support NUMA
  2016-04-28 16:42 [patch 0/7] futex: Add support for process private hashing Thomas Gleixner
                   ` (4 preceding siblings ...)
  2016-04-28 16:42 ` [patch 5/7] futex: Add sysctl knobs for process private hash Thomas Gleixner
@ 2016-04-28 16:42 ` Thomas Gleixner
  2016-04-28 16:42 ` [patch 7/7] perf/bench/futex-hash: Support preallocate hash table Thomas Gleixner
  6 siblings, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2016-04-28 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

[-- Attachment #1: perf-bench-futex-hash-Support-NUMA.patch --]
[-- Type: text/plain, Size: 5893 bytes --]

This adds a new option to tell perf on which numa node the hash benchmark
should run. If set then 

 - The test is bound to the node
 - Memory is allocated on the local NUMA node
 - The threads are bound to the cpus on the node

The NUMA node can be specified by the -n argument.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 tools/perf/bench/Build        |  4 ++
 tools/perf/bench/futex-hash.c | 89 +++++++++++++++++++++++++++++++++++++------
 2 files changed, 82 insertions(+), 11 deletions(-)

diff --git a/tools/perf/bench/Build b/tools/perf/bench/Build
index 60bf119..9e6e518 100644
--- a/tools/perf/bench/Build
+++ b/tools/perf/bench/Build
@@ -1,3 +1,7 @@
+ifdef CONFIG_NUMA
+CFLAGS_futex-hash.o   += -DCONFIG_NUMA=1
+endif
+
 perf-y += sched-messaging.o
 perf-y += sched-pipe.o
 perf-y += mem-functions.o
diff --git a/tools/perf/bench/futex-hash.c b/tools/perf/bench/futex-hash.c
index 0999ac5..a1c6ee9 100644
--- a/tools/perf/bench/futex-hash.c
+++ b/tools/perf/bench/futex-hash.c
@@ -20,6 +20,9 @@
 #include <stdlib.h>
 #include <sys/time.h>
 #include <pthread.h>
+#ifdef CONFIG_NUMA
+#include <numa.h>
+#endif
 
 static unsigned int nthreads = 0;
 static unsigned int nsecs    = 10;
@@ -27,6 +30,7 @@ static unsigned int nsecs    = 10;
 static unsigned int nfutexes = 1024;
 static bool fshared = false, done = false, silent = false;
 static int futex_flag = 0;
+static int numa_node = -1;
 
 struct timeval start, end, runtime;
 static pthread_mutex_t thread_lock;
@@ -39,7 +43,7 @@ struct worker {
 	u_int32_t *futex;
 	pthread_t thread;
 	unsigned long ops;
-};
+} __attribute__((aligned(128)));
 
 static const struct option options[] = {
 	OPT_UINTEGER('t', "threads", &nthreads, "Specify amount of threads"),
@@ -47,9 +51,28 @@ static const struct option options[] = {
 	OPT_UINTEGER('f', "futexes", &nfutexes, "Specify amount of futexes per threads"),
 	OPT_BOOLEAN( 's', "silent",  &silent,   "Silent mode: do not display data/details"),
 	OPT_BOOLEAN( 'S', "shared",  &fshared,  "Use shared futexes instead of private ones"),
+#ifdef CONFIG_NUMA
+	OPT_INTEGER( 'n', "numa",   &numa_node,  "Specify the NUMA node"),
+#endif
 	OPT_END()
 };
 
+#ifndef CONFIG_NUMA
+static int numa_run_on_node(int node __maybe_unused) { return 0; }
+static int numa_node_of_cpu(int node __maybe_unused) { return 0; }
+static void *numa_alloc_local(size_t size) { return malloc(size); }
+static void numa_free(void *p, size_t size __maybe_unused) { return free(p); }
+#endif
+
+static bool cpu_is_local(int cpu)
+{
+	if (numa_node < 0)
+		return true;
+	if (numa_node_of_cpu(cpu) == numa_node)
+		return true;
+	return false;
+}
+
 static const char * const bench_futex_hash_usage[] = {
 	"perf bench futex hash <options>",
 	NULL
@@ -115,6 +138,8 @@ int bench_futex_hash(int argc, const char **argv,
 	unsigned int i, ncpus;
 	pthread_attr_t thread_attr;
 	struct worker *worker = NULL;
+	char *node_str = NULL;
+	unsigned int cpunum;
 
 	argc = parse_options(argc, argv, options, bench_futex_hash_usage, 0);
 	if (argc) {
@@ -128,18 +153,50 @@ int bench_futex_hash(int argc, const char **argv,
 	act.sa_sigaction = toggle_done;
 	sigaction(SIGINT, &act, NULL);
 
-	if (!nthreads) /* default to the number of CPUs */
-		nthreads = ncpus;
+	if (!nthreads) {
+		/* default to the number of CPUs per NUMA node */
+		if (numa_node < 0) {
+			nthreads = ncpus;
+		} else {
+			for (i = 0; i < ncpus; i++) {
+				if (cpu_is_local(i))
+					nthreads++;
+			}
+			if (!nthreads)
+				err(EXIT_FAILURE, "No online CPUs for this node");
+		}
+	} else {
+		int cpu_available = 0;
+
+		for (i = 0; i < ncpus && !cpu_available; i++) {
+			if (cpu_is_local(i))
+				cpu_available = 1;
+		}
+		if (!cpu_available)
+			err(EXIT_FAILURE, "No online CPUs for this node");
+	}
+
+	if (numa_node >= 0) {
+		ret = numa_run_on_node(numa_node);
+		if (ret < 0)
+			err(EXIT_FAILURE, "numa_run_on_node");
+		ret = asprintf(&node_str, " on node %d", numa_node);
+		if (ret < 0)
+			err(EXIT_FAILURE, "numa_node, asprintf");
+	}
 
-	worker = calloc(nthreads, sizeof(*worker));
+	worker = numa_alloc_local(nthreads * sizeof(*worker));
 	if (!worker)
 		goto errmem;
 
 	if (!fshared)
 		futex_flag = FUTEX_PRIVATE_FLAG;
 
-	printf("Run summary [PID %d]: %d threads, each operating on %d [%s] futexes for %d secs.\n\n",
-	       getpid(), nthreads, nfutexes, fshared ? "shared":"private", nsecs);
+	printf("Run summary [PID %d]: %d threads%s, each operating on %d [%s] futexes for %d secs.\n\n",
+	       getpid(), nthreads,
+	       node_str ? : "",
+	       nfutexes, fshared ? "shared":"private",
+	       nsecs);
 
 	init_stats(&throughput_stats);
 	pthread_mutex_init(&thread_lock, NULL);
@@ -149,14 +206,24 @@ int bench_futex_hash(int argc, const char **argv,
 	threads_starting = nthreads;
 	pthread_attr_init(&thread_attr);
 	gettimeofday(&start, NULL);
-	for (i = 0; i < nthreads; i++) {
+	for (cpunum = 0, i = 0; i < nthreads; i++, cpunum++) {
+
+		do {
+			if (cpu_is_local(cpunum))
+				break;
+			cpunum++;
+			if (cpunum > ncpus)
+				cpunum = 0;
+		} while (1);
+
 		worker[i].tid = i;
-		worker[i].futex = calloc(nfutexes, sizeof(*worker[i].futex));
+		worker[i].futex = numa_alloc_local(nfutexes *
+						   sizeof(*worker[i].futex));
 		if (!worker[i].futex)
 			goto errmem;
 
 		CPU_ZERO(&cpu);
-		CPU_SET(i % ncpus, &cpu);
+		CPU_SET(cpunum % ncpus, &cpu);
 
 		ret = pthread_attr_setaffinity_np(&thread_attr, sizeof(cpu_set_t), &cpu);
 		if (ret)
@@ -203,12 +270,12 @@ int bench_futex_hash(int argc, const char **argv,
 				       &worker[i].futex[nfutexes-1], t);
 		}
 
-		free(worker[i].futex);
+		numa_free(worker[i].futex, nfutexes * sizeof(*worker[i].futex));
 	}
 
 	print_summary();
 
-	free(worker);
+	numa_free(worker, nthreads * sizeof(*worker));
 	return ret;
 errmem:
 	err(EXIT_FAILURE, "calloc");
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [patch 7/7] perf/bench/futex-hash: Support preallocate hash table
  2016-04-28 16:42 [patch 0/7] futex: Add support for process private hashing Thomas Gleixner
                   ` (5 preceding siblings ...)
  2016-04-28 16:42 ` [patch 6/7] perf/bench/futex-hash: Support NUMA Thomas Gleixner
@ 2016-04-28 16:42 ` Thomas Gleixner
  6 siblings, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2016-04-28 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

[-- Attachment #1: perf-bench-futex-hash-Support-preallocate-hash-table.patch --]
[-- Type: text/plain, Size: 2578 bytes --]

Instead of using the default hash size on the first allocation it is
possible to allocate a specific number of slots upfront.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 tools/perf/bench/futex-hash.c |   16 ++++++++++++++--
 tools/perf/bench/futex.h      |    5 +++++
 2 files changed, 19 insertions(+), 2 deletions(-)

--- a/tools/perf/bench/futex-hash.c
+++ b/tools/perf/bench/futex-hash.c
@@ -30,6 +30,7 @@ static unsigned int nsecs    = 10;
 static unsigned int nfutexes = 1024;
 static bool fshared = false, done = false, silent = false;
 static int futex_flag = 0;
+static unsigned int prealloc;
 static int numa_node = -1;
 
 struct timeval start, end, runtime;
@@ -51,6 +52,7 @@ static const struct option options[] = {
 	OPT_UINTEGER('f', "futexes", &nfutexes, "Specify amount of futexes per threads"),
 	OPT_BOOLEAN( 's', "silent",  &silent,   "Silent mode: do not display data/details"),
 	OPT_BOOLEAN( 'S', "shared",  &fshared,  "Use shared futexes instead of private ones"),
+	OPT_UINTEGER('p', "prealloc",&prealloc, "Specify number of preallocated hash slots"),
 #ifdef CONFIG_NUMA
 	OPT_INTEGER( 'n', "numa",   &numa_node,  "Specify the NUMA node"),
 #endif
@@ -138,6 +140,7 @@ int bench_futex_hash(int argc, const cha
 	unsigned int i, ncpus;
 	pthread_attr_t thread_attr;
 	struct worker *worker = NULL;
+	char *prealloc_str = NULL;
 	char *node_str = NULL;
 	unsigned int cpunum;
 
@@ -192,11 +195,20 @@ int bench_futex_hash(int argc, const cha
 	if (!fshared)
 		futex_flag = FUTEX_PRIVATE_FLAG;
 
-	printf("Run summary [PID %d]: %d threads%s, each operating on %d [%s] futexes for %d secs.\n\n",
+	if (prealloc) {
+		ret = futex_preallocate(prealloc);
+		if (ret < 0)
+			err(EXIT_FAILURE, "futex_prealloate");
+		ret = asprintf(&prealloc_str, " P %u %d", prealloc, ret);
+		if (ret < 0)
+			err(EXIT_FAILURE, "futex_preallocate, asprintf");
+	}
+
+	printf("Run summary [PID %d]: %d threads%s, each operating on %d [%s%s] futexes for %d secs.\n\n",
 	       getpid(), nthreads,
 	       node_str ? : "",
 	       nfutexes, fshared ? "shared":"private",
-	       nsecs);
+	       prealloc_str ? : "", nsecs);
 
 	init_stats(&throughput_stats);
 	pthread_mutex_init(&thread_lock, NULL);
--- a/tools/perf/bench/futex.h
+++ b/tools/perf/bench/futex.h
@@ -101,4 +101,9 @@ static inline int pthread_attr_setaffini
 }
 #endif
 
+static inline int futex_preallocate(u_int32_t hash_size)
+{
+	return futex(0, FUTEX_PREALLOC_HASH, hash_size, NULL, NULL, 0, 0);
+}
+
 #endif /* _FUTEX_H */

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-28 16:42 ` [patch 2/7] lib/hashmod: Add modulo based hash mechanism Thomas Gleixner
@ 2016-04-28 18:32   ` Linus Torvalds
  2016-04-28 23:26     ` Thomas Gleixner
  2016-04-29 21:10     ` Linus Torvalds
  0 siblings, 2 replies; 21+ messages in thread
From: Linus Torvalds @ 2016-04-28 18:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

On Thu, Apr 28, 2016 at 9:42 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> hash_long/hash_ptr() provide really bad hashing for small hash sizes and for
> cases where the lower 12 bits (Page size aligned) of the hash value are 0.

Hmm.

hash_long/ptr really shouldn't care about the low bits being zero at
all, because it should mix in all the bits (using a prime multiplier
and taking the high bits).

That said, numbers rule, so clearly we need to do something. It does
strike me that we would be better off just trying to improve
hash_long().

In particular, there are people and projects that have worked on
nothing but hashing. I'm not sure we should add a new hash algorithm
even if the whole "modulo prime" sounds obviously fine in theory. For
example, your 64-bit code has obvious problems if there are common
patterns in the low and the high 32 bits.. Not a problem for something
like hash_ptr(), but it can certainly be a problem for other cases.

It would be a really good idea to have some real hard numbers on the
hashing in general, but _particularly_ so if/when we start adding new
ones. Have you tested the modulus version with SMhasher, for example?

For example, there's Thomas Wang's hash function which should cascade
all the bits.

I'd really hate to add *another* ad-hoc hash when the previous ad-hoc
hash has been shown to be bad.

               Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-28 18:32   ` Linus Torvalds
@ 2016-04-28 23:26     ` Thomas Gleixner
  2016-04-29  2:25       ` Linus Torvalds
  2016-04-29 21:10     ` Linus Torvalds
  1 sibling, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2016-04-28 23:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

On Thu, 28 Apr 2016, Linus Torvalds wrote:
> 
> I'd really hate to add *another* ad-hoc hash when the previous ad-hoc
> hash has been shown to be bad.

I completely agree.

I'm not a hashing wizard and I completely failed to understand why
hash_long/ptr are so horrible for the various test cases I ran.

So my ad hoc test was to use the only hash function I truly understand. It was
state of the art in my university days :) And surprise, surprise it worked
really well.

My main focus was really to solve this futex issue which plages various people
and not to dive into hashing theory for a few weeks.

I'll try to dig up some time to analyze the hash_long failure unless someone
familiar with the problem is beating me to it.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-28 23:26     ` Thomas Gleixner
@ 2016-04-29  2:25       ` Linus Torvalds
  2016-04-30 13:02         ` Thomas Gleixner
  2016-06-12 12:18         ` Sandy Harris
  0 siblings, 2 replies; 21+ messages in thread
From: Linus Torvalds @ 2016-04-29  2:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

On Thu, Apr 28, 2016 at 4:26 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> I'll try to dig up some time to analyze the hash_long failure unless someone
> familiar with the problem is beating me to it.

I'm not sure you need to spend time analyzing failure: if you get bad
hashing with hash_long(), then we know that is a bad hash without
having to really try to figure out why.

It's the hashes that _look_ like they might be good hashes, but
there's not a lot of analysis behind it, that I would worry about. The
simple prime modulus _should_ be fine, but at the same time I kind of
suspect we can do better. Especially since it has two multiplications.

Looking around, there's

    http://burtleburtle.net/bob/hash/integer.html

and that 32-bit "full avalanche" hash in six shifts looks like it
could be better. You wouldn't want to inline it, but the point of a
full avalanche bit mixing _should_ be that you could avoid the whole
"upper bits" part, and it should work independently of the target set
size.

So if that hash works better, it would be a pretty good replacement
option for hash_int().

There is also

    https://gist.github.com/badboy/6267743

that has a 64 bit to 32 bit hash function that might be useful for
"hash_long()".

Most of the people who worry about hashes tend to have strings to
hash, not just a single word like a pointer, but there's clearly
people around who have tried to search for good hashes that really
spread out the bits.

                Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-28 18:32   ` Linus Torvalds
  2016-04-28 23:26     ` Thomas Gleixner
@ 2016-04-29 21:10     ` Linus Torvalds
  2016-04-29 23:51       ` Linus Torvalds
  2016-04-30 15:22       ` Thomas Gleixner
  1 sibling, 2 replies; 21+ messages in thread
From: Linus Torvalds @ 2016-04-29 21:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

On Thu, Apr 28, 2016 at 11:32 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> hash_long/ptr really shouldn't care about the low bits being zero at
> all, because it should mix in all the bits (using a prime multiplier
> and taking the high bits).

Looking at this assertion, it doesn't actually pan out very well at all.

The 32-bit hash looks to be ok. Certainly not perfect, but not horrible.

The 64-bit hash seems to be quite horribly bad with lots of values. I
wrote a test-harness to check it out (some simple values just spread
out at a fixed stride), and the end results are *so* bad that I'm kind
of worried that I screwed up the test harness. But it gives quite
reasonable values for hash_32() and for the plain modulo case.

Now, the way my test-harness works (literally testing a lot of
equal-stride cases), the "modulo prime number" approach will
automatically look perfect. So my test-harness is pretty unfair in
that respect.

But the hash_32() function looks good when hashing into 16 bits, for
example. In that case it does a very good job of spreading things out.
When hashing into 17 bits, hash_32 still looks good, except it does
very badly for strides of 32. It starts doing worse for bigger hash
buckets and bigger strides.

But out hash_64() seems to do very badly on pretty much *any* pattern.
To the point where I started to doubt my test-program. But it really
looks like that multiplication constant is almost pessimally chosen.

For example, that _long_ range of bits set ("7fffffffc" in the middle)
is effectively just one bit set with a subtraction. And it's *right*
in that bit area that is supposed to shuffle bits 14-40 to the high
bits (which is what we actually *use*. So it effectively shuffles none
of those bits around at all, and if you have a stride of 4096, your'e
pretty much done for.

The 32-bit value is better in this regard, largely thanks to having
that low bit set. Thanks to that, the information in bits around 12-18
will stay in bits 12-18, and because we then only have 32 bits, if the
hash table is large enough, they will still be part of the bits when
we take the high bits. For the 64-bit case, bits 12-18 will never even
be relevant.

So I think that what happens here is that hash_32() is actually
somewhat reasonable. But hash_64() sucks donkey balls when there's a
lot of information in around bits 10-20 (which is exactly where a lot
of pointer bits have the *most* information thanks to alignment
issues.

Picking a new value almost at random (I say "almost", because I just
started with that 32-bit multiplicand value that mostly works and
shifted it up by 32 bits and then randomly added a few more bits to
avoid long ranges of ones and zeroes), I picked

  #define GOLDEN_RATIO_PRIME_64 0x9e3700310c100d01UL

and it is *much* better in my test harness.

Of course, things like that depend on what patterns you test, But I
did have a "range of strides and hash sizes" I tried. So just for fun:
try changing GOLDEN_RATIO_PRIME_64 to that value, and see if the
absolutely _horrid_ page-aligned case goes away for you?

It really looks like those multiplication numbers were very very badly picked.

Still, that number doesn't do very well if the hash is small (say, 8
bits). But for slightly larger hash tables it seems to be doing much
better.

                  Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-29 21:10     ` Linus Torvalds
@ 2016-04-29 23:51       ` Linus Torvalds
  2016-04-30  1:34         ` Rik van Riel
  2016-05-02  9:39         ` Torvald Riegel
  2016-04-30 15:22       ` Thomas Gleixner
  1 sibling, 2 replies; 21+ messages in thread
From: Linus Torvalds @ 2016-04-29 23:51 UTC (permalink / raw)
  To: Thomas Gleixner, Rik van Riel
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

On Fri, Apr 29, 2016 at 2:10 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Apr 28, 2016 at 11:32 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> hash_long/ptr really shouldn't care about the low bits being zero at
>> all, because it should mix in all the bits (using a prime multiplier
>> and taking the high bits).
>
> Looking at this assertion, it doesn't actually pan out very well at all.
>
> The 32-bit hash looks to be ok. Certainly not perfect, but not horrible.
>
> The 64-bit hash seems to be quite horribly bad with lots of values.

Ok, I have tried to research this a bit more. There's a lot of
confusion here that causes the fact that hash_64() sucks donkey balls.

The basic method for the hashing method by multiplication is fairly
sane. It's well-documented, and attributed to Knuth as the comment
above it says.

However, there's two problems in there that degrade the hash, and
particularly so for the 64-bit case.

The first is the confusion about multiplying with a prime number..
That actually makes no sense at all, and is in fact entirely wrong.
There's no reason to try to pick a prime number for the
multiplication, and I'm not seeing Knuth having ever suggested that.

Knuth's suggestion is to do the multiplication with a floating point
value A somewhere in between 0 and 1, and multiplying the integer with
it, and then just taking the fractional part and multiply it up by 'm'
and do the floor of that to get a number in the range 0..m

At no point are primes involved.

And our multiplication really does approximate that - except it's
being done in fixed-point arithmetic (so the thing you multiply with
is basically n./2**64, and the overflow is what gets rid of the
fractional part - then getting the "high bits" of the result is really
just multiplying by a power of two and taking the floor of the
result).

So the first mistake is thinking that the thing you should multiply
with should be prime. The primality matters for when you use a
division to get a modulus, which is presumably where the confusion
came from.

Now, what value 'A' you use doesn't seem to really matter much. Knuth
suggested the fractional part of the golden ratio, but I suspect that
is purely because it's an irrational number that is not near to 0 or
1. You could use anything, although since "random bit pattern" is part
of the issue, irrational numbers are a good starting point. I suspect
that with our patterns, there's actually seldom a good reason to do
lots of high-order bits, so you might as well pick something closer to
0, but whatever - it's only going to matter for the overflow part that
gets thrown away anyway.

The second mistake - and the one that actually causes the real problem
- is to believe that the "bit sparseness" is a good thing. It's not.
It's _very_ much not. If you don't mix the bits well in the
multiplication, you get exactly the problem we hit: certain bit
patternsjust  will not mix up into the higher order bits.

So if you look at what the actual golden ratio representation *should* have bee:

  #define GOLDEN_RATIO_32 0x9e3779b9
  #define GOLDEN_RATIO_64 0x9e3779b97f4a7c16

and then compare it to the values we actually _use_ (bit-sparse primes
closeish to those values):

  #define GOLDEN_RATIO_PRIME_32 0x9e370001UL
  #define GOLDEN_RATIO_PRIME_64 0x9e37fffffffc0001UL

you start to see the problem. The right set of values have roughly 50%
of the bits set in a random pattern. The wrong set of values do not.

But as far as I an tell, you might as well use the fractional part of
'e' or 'pi' or just make random shit up that has a reasonable bit
distribution.

So we could use the fractional part of the golden ratio (phi):
  0x9e3779b9
  0x9e3779b97f4a7c16

or pi:
  0x243f6a88
  0x243f6a8885a308d3

or e:
  0xb7e15162
  0xb7e151628aed2a6b

or whatever. The constants don't have to be prime. They don't even
have to be odd, because the low bits will always be masked off anyway.
The whole "hash one integer value down to X bits" is very different in
this respect to things like string hashes, where you really do tend to
want primes (because you keep all bits).

But none of those are sparse. I think *some* amount of sparseness
might be ok if it allows people with bad CPU's to do it using
shift-and-adds, it just must not be as sparse as the current number,
the 64-bit one on particular.

There's presumably a few optimal values from a "spread bits out
evenly" standpoint, and they won't have anything to do with random
irrational constants, and will have everything to do with having nice
bitpatterns.

I'm adding Rik to the cc, because the original broken constants came
from him long long ago (they go back to 2002, originally only used for
the waitqueue hashing. Maybe he had some other input that caused him
to believe that the primeness actually mattered.

                 Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-29 23:51       ` Linus Torvalds
@ 2016-04-30  1:34         ` Rik van Riel
  2016-05-02  9:39         ` Torvald Riegel
  1 sibling, 0 replies; 21+ messages in thread
From: Rik van Riel @ 2016-04-30  1:34 UTC (permalink / raw)
  To: Linus Torvalds, Thomas Gleixner
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

[-- Attachment #1: Type: text/plain, Size: 1130 bytes --]

On Fri, 2016-04-29 at 16:51 -0700, Linus Torvalds wrote:

> There's presumably a few optimal values from a "spread bits out
> evenly" standpoint, and they won't have anything to do with random
> irrational constants, and will have everything to do with having nice
> bitpatterns.
> 
> I'm adding Rik to the cc, because the original broken constants came
> from him long long ago (they go back to 2002, originally only used
> for
> the waitqueue hashing. Maybe he had some other input that caused him
> to believe that the primeness actually mattered.

I do not remember where I got that hashing algorithm and
magic constant from 14 years ago, but the changelog suggests
I got it from Chuck Lever's paper.

Chuck Lever's paper does mention that primeness "adds
certain desirable qualities", and I may have read too
much into that.

I really do not remember the "bit sparse" thing at all,
and have no idea
where that came from. Googling old email
threads about the code mostly
makes me wonder "hey, where
did that person go?"

I am all for magic numbers that work better.

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-29  2:25       ` Linus Torvalds
@ 2016-04-30 13:02         ` Thomas Gleixner
  2016-04-30 16:45           ` Eric Dumazet
  2016-06-12 12:18         ` Sandy Harris
  1 sibling, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2016-04-30 13:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

On Thu, 28 Apr 2016, Linus Torvalds wrote:
> It's the hashes that _look_ like they might be good hashes, but
> there's not a lot of analysis behind it, that I would worry about. The
> simple prime modulus _should_ be fine, but at the same time I kind of
> suspect we can do better. Especially since it has two multiplications.
> 
> Looking around, there's
> 
>     http://burtleburtle.net/bob/hash/integer.html
> 
> and that 32-bit "full avalanche" hash in six shifts looks like it
> could be better. You wouldn't want to inline it, but the point of a
> full avalanche bit mixing _should_ be that you could avoid the whole
> "upper bits" part, and it should work independently of the target set
> size.

Yes. So I tested those two:

u32 hash_64(u64 key)
{
       key  = ~key + (key << 18);
       key ^= key >> 31;
       key += (key << 2)) + (key << 4);
       key ^= key >> 11;
       key += key << 6;
       key ^= key >> 22;
       return (u32) key;
}

u32 hash_32(u32 key)
{
       key = (key + 0x7ed55d16) + (key << 12);
       key = (key ^ 0xc761c23c) ^ (key >> 19);
       key = (key + 0x165667b1) + (key <<  5);
       key = (key + 0xd3a2646c) ^ (key <<  9);
       key = (key + 0xfd7046c5) + (key <<  3);
       key = (key ^ 0xb55a4f09) ^ (key >> 16);
       return key;
}

They are really good and the results are similar to the simple modulo prime
hash. hash64 is slightly faster as the modulo prime as it does not have the
multiplication.

I'll send a patch to replace hash_64 and hash_32.

Text size:
	   x86_64	i386	arm
hash_64	   88		148	128
hash_32	   88		84	112

So probably slightly too large to inline.

Thanks,

	tglx
	

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-29 21:10     ` Linus Torvalds
  2016-04-29 23:51       ` Linus Torvalds
@ 2016-04-30 15:22       ` Thomas Gleixner
  1 sibling, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2016-04-30 15:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

On Fri, 29 Apr 2016, Linus Torvalds wrote:
> Picking a new value almost at random (I say "almost", because I just
> started with that 32-bit multiplicand value that mostly works and
> shifted it up by 32 bits and then randomly added a few more bits to
> avoid long ranges of ones and zeroes), I picked
> 
>   #define GOLDEN_RATIO_PRIME_64 0x9e3700310c100d01UL
> 
> and it is *much* better in my test harness.
> 
> Of course, things like that depend on what patterns you test, But I
> did have a "range of strides and hash sizes" I tried. So just for fun:
> try changing GOLDEN_RATIO_PRIME_64 to that value, and see if the
> absolutely _horrid_ page-aligned case goes away for you?

It solves that horrid case:

   https://tglx.de/~tglx/f-ops-h64-t.png

It's faster than the shifts based version but the degradation with
hyperthreading is slightly worse.

Here for comparison the 64bit -> 32 shift version

  https://tglx.de/~tglx/f-ops-wang32-t.png

  FYI, that works way better than the existing shift machinery in hash_64

and the modulo prime one:

  https://tglx.de/~tglx/f-ops-mod-t.png

> It really looks like those multiplication numbers were very very badly picked.

Indeed.
 
> Still, that number doesn't do very well if the hash is small (say, 8
> bits).

I'm still waiting for the other test to complete. Will send numbers later
today.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-30 13:02         ` Thomas Gleixner
@ 2016-04-30 16:45           ` Eric Dumazet
  2016-04-30 17:12             ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2016-04-30 16:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

On Sat, 2016-04-30 at 15:02 +0200, Thomas Gleixner wrote:

> Yes. So I tested those two:
> 
> u32 hash_64(u64 key)
> {
>        key  = ~key + (key << 18);
>        key ^= key >> 31;
>        key += (key << 2)) + (key << 4);
>        key ^= key >> 11;
>        key += key << 6;
>        key ^= key >> 22;
>        return (u32) key;
> }
> 
> u32 hash_32(u32 key)
> {
>        key = (key + 0x7ed55d16) + (key << 12);
>        key = (key ^ 0xc761c23c) ^ (key >> 19);
>        key = (key + 0x165667b1) + (key <<  5);
>        key = (key + 0xd3a2646c) ^ (key <<  9);
>        key = (key + 0xfd7046c5) + (key <<  3);
>        key = (key ^ 0xb55a4f09) ^ (key >> 16);
>        return key;
> }
> 
> They are really good and the results are similar to the simple modulo prime
> hash. hash64 is slightly faster as the modulo prime as it does not have the
> multiplication.
> 
> I'll send a patch to replace hash_64 and hash_32.
> 
> Text size:
> 	   x86_64	i386	arm
> hash_64	   88		148	128
> hash_32	   88		84	112
> 
> So probably slightly too large to inline.

I use hash_32() in net/sched/sch_fq.c, for all packets sent by Google
servers. (Note that I did _not_ use hash_ptr())

That's gazillions of packets per second, and the current multiply worked
just fine in term of hash spreading.

Are you really going to use something which looks much slower ?

u32 hash_32(u32 key)
{
        key = (key + 0x7ed55d16) + (key << 12);
        key = (key ^ 0xc761c23c) ^ (key >> 19);
        key = (key + 0x165667b1) + (key <<  5);
        key = (key + 0xd3a2646c) ^ (key <<  9);
        key = (key + 0xfd7046c5) + (key <<  3);
        key = (key ^ 0xb55a4f09) ^ (key >> 16);
        return key;
}

Probably having a simple multiple when ARCH_HAS_FAST_MULTIPLIER is
defined might be good enough, eventually by choosing a better
GOLDEN_RATIO_PRIME_32

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-30 16:45           ` Eric Dumazet
@ 2016-04-30 17:12             ` Linus Torvalds
  2016-04-30 17:37               ` Eric Dumazet
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2016-04-30 17:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Ingo Molnar,
	Andrew Morton, Sebastian Andrzej Siewior, Darren Hart,
	Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Sat, Apr 30, 2016 at 9:45 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> I use hash_32() in net/sched/sch_fq.c, for all packets sent by Google
> servers. (Note that I did _not_ use hash_ptr())
>
> That's gazillions of packets per second, and the current multiply worked
> just fine in term of hash spreading.

So hash_32() really is much better than hash_64(). I think we'll tweak
it a bit, but largely leave it alone.

The 64-bit case needs to be tweaked a _lot_.

For the 32-bit case, I like the one that George Spelvin suggested:

   #define GOLDEN_RATIO_32 0x61c88647      /* phi^2 = 1-phi */

because of his slow multiplier fallback version that we could also use:

  /* Returns x * GOLDEN_RATIO_32 without a hardware multiplier */
  unsigned hash_32(unsigned x)
  {
          unsigned y, z;
                                /* Path length */
          y = (x << 19) + x;      /* 1 shift + 1 add */
          z = (x << 9) + y;       /* 1 shift + 2 add */
          x = (x << 23) + z;      /* 1 shift + 3 add */
          z = (z << 8) + y;       /* 2 shift + 3 add */
          x = (x << 6) - x;       /* 2 shift + 4 add */
          return (z << 3) + x;    /* 3 shift + 4 add */
  }

and I don't think that we really need the several big constants with
the fancy "full cascade" function.

If you have a test-case for that sch_fq.c case, it might be a good
idea to test the above GOLDEN_RATIO_32 value, but quite frankly, I
don't see any way it would be materially different from the one we use
now. It does avoid that long series of zeroes in the low bits, but
that's actually not a huge problem for the 32-bit hash to begin with.
It's not nearly as long a series (or in the wrong bit positions) as
the 64-bit hash multiplier value had.

Also, I suspect that since you hash the kernel "struct sock" pointers,
you actually never get the kinds of really bad patterns that Thomas
had.

But maybe you use hash_32() on a pointer because you noticed that
hash_long() or hash_ptr() (which use hash_64 on 64-bit architectures,
and would have been more natural) gave worse performance?

Maybe you thought that it was the bigger multiply that caused the
performance problems? If you did performance work, I suspect it really
could have been that hash_64() did a bad job for you.

                 Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-30 17:12             ` Linus Torvalds
@ 2016-04-30 17:37               ` Eric Dumazet
  0 siblings, 0 replies; 21+ messages in thread
From: Eric Dumazet @ 2016-04-30 17:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Ingo Molnar,
	Andrew Morton, Sebastian Andrzej Siewior, Darren Hart,
	Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Sat, 2016-04-30 at 10:12 -0700, Linus Torvalds wrote:
> On Sat, Apr 30, 2016 at 9:45 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > I use hash_32() in net/sched/sch_fq.c, for all packets sent by Google
> > servers. (Note that I did _not_ use hash_ptr())
> >
> > That's gazillions of packets per second, and the current multiply worked
> > just fine in term of hash spreading.
> 
> So hash_32() really is much better than hash_64(). I think we'll tweak
> it a bit, but largely leave it alone.
> 
> The 64-bit case needs to be tweaked a _lot_.

Agreed ;)

> 
> For the 32-bit case, I like the one that George Spelvin suggested:
> 
>    #define GOLDEN_RATIO_32 0x61c88647      /* phi^2 = 1-phi */
> 
> because of his slow multiplier fallback version that we could also use:
> 
>   /* Returns x * GOLDEN_RATIO_32 without a hardware multiplier */
>   unsigned hash_32(unsigned x)
>   {
>           unsigned y, z;
>                                 /* Path length */
>           y = (x << 19) + x;      /* 1 shift + 1 add */
>           z = (x << 9) + y;       /* 1 shift + 2 add */
>           x = (x << 23) + z;      /* 1 shift + 3 add */
>           z = (z << 8) + y;       /* 2 shift + 3 add */
>           x = (x << 6) - x;       /* 2 shift + 4 add */
>           return (z << 3) + x;    /* 3 shift + 4 add */
>   }
> 
> and I don't think that we really need the several big constants with
> the fancy "full cascade" function.
> 
> If you have a test-case for that sch_fq.c case, it might be a good
> idea to test the above GOLDEN_RATIO_32 value, but quite frankly, I
> don't see any way it would be materially different from the one we use
> now. It does avoid that long series of zeroes in the low bits, but
> that's actually not a huge problem for the 32-bit hash to begin with.
> It's not nearly as long a series (or in the wrong bit positions) as
> the 64-bit hash multiplier value had.
> 
> Also, I suspect that since you hash the kernel "struct sock" pointers,
> you actually never get the kinds of really bad patterns that Thomas
> had.
> 
> But maybe you use hash_32() on a pointer because you noticed that
> hash_long() or hash_ptr() (which use hash_64 on 64-bit architectures,
> and would have been more natural) gave worse performance?

Not at all.

At the time I did sch_fq (for linux-3.12), hash_64() was not using a
multiply yet, but this long series of shifts/add/sub

I used hash_32() because it was simply faster on my servers.

You added this multiply in linux-3.17, and I did not noticed at that
time.


> 
> Maybe you thought that it was the bigger multiply that caused the
> performance problems? If you did performance work, I suspect it really
> could have been that hash_64() did a bad job for you.

Really not ;)

I could test hash_64(), more entropy can not be bad.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-29 23:51       ` Linus Torvalds
  2016-04-30  1:34         ` Rik van Riel
@ 2016-05-02  9:39         ` Torvald Riegel
  1 sibling, 0 replies; 21+ messages in thread
From: Torvald Riegel @ 2016-05-02  9:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Rik van Riel, LKML, Peter Zijlstra, Ingo Molnar,
	Andrew Morton, Sebastian Andrzej Siewior, Darren Hart,
	Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Eric Dumazet

On Fri, 2016-04-29 at 16:51 -0700, Linus Torvalds wrote:
> On Fri, Apr 29, 2016 at 2:10 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Thu, Apr 28, 2016 at 11:32 AM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> >>
> >> hash_long/ptr really shouldn't care about the low bits being zero at
> >> all, because it should mix in all the bits (using a prime multiplier
> >> and taking the high bits).
> >
> > Looking at this assertion, it doesn't actually pan out very well at all.
> >
> > The 32-bit hash looks to be ok. Certainly not perfect, but not horrible.
> >
> > The 64-bit hash seems to be quite horribly bad with lots of values.
> 
> Ok, I have tried to research this a bit more. There's a lot of
> confusion here that causes the fact that hash_64() sucks donkey balls.
> 
> The basic method for the hashing method by multiplication is fairly
> sane. It's well-documented, and attributed to Knuth as the comment
> above it says.

Section 2.2 of this article might also be of interest:

M. Dietzfelbinger, T. Hagerup, J. Katajainen, and M. Penttonen. A Re-
liable Randomized Algorithm for the Closest-Pair Problem. Journal of
Algorithms, 25(1):19 – 51, 1997.

I don't know much about hash functions, but my understanding of this is
that one can do good hashing of words by multiplying with a random
number and using the most-significant bits of the lower-significant word
of the result.  Different random numbers may work better than others for
the input data (and some might be really awful), but the paper seems to
claim that one *can* find a good random number for all input data.  In
practice, this might mean that re-hashing with a different random number
after seeing too many conflicts in a hash table should eventually lead
to a good hashing (ie, because the *class* of such hash functions is
universal).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-29  2:25       ` Linus Torvalds
  2016-04-30 13:02         ` Thomas Gleixner
@ 2016-06-12 12:18         ` Sandy Harris
  1 sibling, 0 replies; 21+ messages in thread
From: Sandy Harris @ 2016-06-12 12:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Ingo Molnar,
	Andrew Morton, Sebastian Andrzej Siewior, Darren Hart,
	Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Thu, Apr 28, 2016 at 10:25 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:

> It's the hashes that _look_ like they might be good hashes, but
> there's not a lot of analysis behind it, that I would worry about. The
> simple prime modulus _should_ be fine, but at the same time I kind of
> suspect we can do better. Especially since it has two multiplications.
>
> Looking around, there's
>
>     http://burtleburtle.net/bob/hash/integer.html
>
> and that 32-bit "full avalanche" hash in six shifts looks like it
> could be better. You wouldn't want to inline it, but the point of a
> full avalanche bit mixing _should_ be that you could avoid the whole
> "upper bits" part, and it should work independently of the target set
> size.
>
> So if that hash works better, it would be a pretty good replacement
> option for hash_int().
>
> There is also
>
>     https://gist.github.com/badboy/6267743
>
> that has a 64 bit to 32 bit hash function that might be useful for
> "hash_long()".
>
> Most of the people who worry about hashes tend to have strings to
> hash, not just a single word like a pointer, but there's clearly
> people around who have tried to search for good hashes that really
> spread out the bits.
>
>                 Linus

Here's another possibility, from my GPL code at:
https://github.com/sandy-harris/maxwell

Not very efficient -- two each of 32-bit multiply & modulo
in most cases -- but provably good mixing.

/*
    Quasi-Hadamard transform
    My own invention

    Goal is to mix a 32-bit object
    so that each output bit depends
    on every input bit

    Underlying primitive is IDEA
    multiplication which mixes
    a pair of 16-bit objects

    This is analogous to the
    pseudo-Hadamard transform
    (PHT) originally from the
    SAFER cipher, later in
    Twofish and others

    Conceptually, a two-way PHT
    on a,b is:

    x = a + b
    y = a + 2b
    a = x
    b = y

    This is reversible; it loses
    no information. Each output
    word depends on both inputs.

    A PHT can be implemented as

    a += b
    b += a

    which is faster and avoids
    using intermediate variables

    QHT is the same thing using
    IDEA multiplication instead
    of addition, calculating a*b
    and a*b^2 instead of a+b and
    a+2b

    IDEA multiplication operates
    on 16-bit chunks and makes
    every output bit depend on
    all input bits. Therefore
    QHT is close to an ideal
    mixer for 32-bit words.
*/

u32 qht(u32 x)
{
    u32 a, b ;
    a = x >> 16 ;        // high 16 bits
    b = x & 0xffff ;    // low 16
    a = idea(a,b) ;        // a *= b
    b = idea(a,b) ;        // b *= a
    return( (a<<16) | b) ;
}

/*
    IDEA multiplication
    borrowed from the IDEA cipher
*/
#define    MAX (1<<16)
#define MOD (MAX+1)

u32 idea(u32 a, u32 b)
{
    u32 x ;
    // make sure they are in range
    a %= MOD ;
    b %= MOD ;
    // special cases
    if( (a == 0) && (b == 0))
        return(1) ;
    else if( a == 0)
        return(MOD - b) ;
    else if( b == 0)
        return(MOD - a) ;
    // non-special
    else    {
        x = (a*b) % MOD ;
        if(x == MAX)
            return(0) ;
        else    return(x) ;
    }
}

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-06-12 12:18 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-28 16:42 [patch 0/7] futex: Add support for process private hashing Thomas Gleixner
2016-04-28 16:42 ` [patch 1/7] futex: Add some more function commentry Thomas Gleixner
2016-04-28 16:42 ` [patch 2/7] lib/hashmod: Add modulo based hash mechanism Thomas Gleixner
2016-04-28 18:32   ` Linus Torvalds
2016-04-28 23:26     ` Thomas Gleixner
2016-04-29  2:25       ` Linus Torvalds
2016-04-30 13:02         ` Thomas Gleixner
2016-04-30 16:45           ` Eric Dumazet
2016-04-30 17:12             ` Linus Torvalds
2016-04-30 17:37               ` Eric Dumazet
2016-06-12 12:18         ` Sandy Harris
2016-04-29 21:10     ` Linus Torvalds
2016-04-29 23:51       ` Linus Torvalds
2016-04-30  1:34         ` Rik van Riel
2016-05-02  9:39         ` Torvald Riegel
2016-04-30 15:22       ` Thomas Gleixner
2016-04-28 16:42 ` [patch 3/7] futex: Hash private futexes per process Thomas Gleixner
2016-04-28 16:42 ` [patch 4/7] futex: Add op for hash preallocation Thomas Gleixner
2016-04-28 16:42 ` [patch 5/7] futex: Add sysctl knobs for process private hash Thomas Gleixner
2016-04-28 16:42 ` [patch 6/7] perf/bench/futex-hash: Support NUMA Thomas Gleixner
2016-04-28 16:42 ` [patch 7/7] perf/bench/futex-hash: Support preallocate hash table Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).