Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
       [not found] <CA+55aFxBWfAHQNAdBbdVr+z8ror4GVteyce3D3=vwDWxhu5KqQ@mail.gmail.com>
@ 2016-04-30 20:52 ` George Spelvin
  2016-05-01  8:35   ` Thomas Gleixner
  0 siblings, 1 reply; 43+ messages in thread
From: George Spelvin @ 2016-04-30 20:52 UTC (permalink / raw)
  To: tglx; +Cc: eric.dumazet, linux, linux-kernel, riel, torvalds

Thomas Gleixner wrote:
> I'll send a patch to replace hash_64 and hash_32.

Before you do that, could we look for a way to tweak the constants
in the existing hash?

It seems the basic "take the high bits of x * K" algorithm is actually
a decent hash function if K is chosen properly, and has a significant
speed advantage on machines with half-decent multipliers.

I'm researching how to do the multiply with fewer shifts and adds on
machines that need it.  (Or we could use a totally different function
in that case.)

You say that
> hash64 is slightly faster as the modulo prime as it does not have the
> multiplication.

Um... are you sure you benchmarked that right?  The hash_64 code you
used (Thomas Wang's 64->32-bit hash) has a critical path consisting of 6
shifts and 7 adds.  I can't believe that's faster than a single multiply.

For 1,000,000 iterations on an Ivy Bridge, the multiply is 4x
faster (5x if out of line) for me!

The constants I recommend are
#define GOLDEN_RATIO_64 0x61C8864680B583EBull
#define GOLDEN_RATIO_32 0x61C88647


rdtsc times for 1,000,000 iterations of each of the two.
(The sum of all hashes is printed to prevent dead code elimination.)

  hash_64  (sum)      * PHI   (sum)      hash_64  (sum)      * PHI   (sum)
  17552154 a52752df   3431821 2ce5398c   17485381 a52752df   3375535 2ce5398c
  17522273 a52752df   3487206 2ce5398c   17551217 a52752df   3374221 2ce5398c
  17546242 a52752df   3377355 2ce5398c   17494306 a52752df   3374202 2ce5398c
  17495702 a52752df   3409768 2ce5398c   17505839 a52752df   3398205 2ce5398c
  17501114 a52752df   3375435 2ce5398c   17539388 a52752df   3374202 2ce5398c
And with hash_64 forced inline:
  13596945 a52752df   3374931 2ce5398c   13585916 a52752df   3411107 2ce5398c
  13564616 a52752df   3374928 2ce5398c   13573465 a52752df   3425160 2ce5398c
  13569712 a52752df   3374915 2ce5398c   13580461 a52752df   3397773 2ce5398c
  13577481 a52752df   3374912 2ce5398c   13558708 a52752df   3417456 2ce5398c
  13569044 a52752df   3374909 2ce5398c   13557193 a52752df   3407912 2ce5398c

That's 3.5 cycles vs. 13.5.

(I actually have two copies of the inlined code, to show code alignment
issues.)

On a Phenom, it's worse, 4 cycles vs. 35.
  35083119 a52752df   4020754 2ce5398c   35068116 a52752df   4015659 2ce5398c
  35074377 a52752df   4000819 2ce5398c   35068735 a52752df   4016943 2ce5398c
  35067596 a52752df   4025397 2ce5398c   35074365 a52752df   4000108 2ce5398c
  35071050 a52752df   4016190 2ce5398c   35058775 a52752df   4017988 2ce5398c
  35055091 a52752df   4000066 2ce5398c   35201158 a52752df   4000094 2ce5398c




My simple test code appended for anyone who cares...

#include <stdint.h>
#include <stdio.h>

/*  Phi = 0x0.9E3779B97F4A7C15F... */
/* -Phi = 0x0.61C8864680B583EA1... */
#define K 0x61C8864680B583EBull

static inline uint32_t hash1(uint64_t key)
{
       key  = ~key + (key << 18);
       key ^= key >> 31;
       key += (key << 2) + (key << 4);
       key ^= key >> 11;
       key += key << 6;
       key ^= key >> 22;
       return (uint32_t)key;
}

static inline uint32_t hash2(uint64_t key)
{
	return (uint32_t)(key * K >> 32);
}

static inline uint64_t rdtsc(void)
{
	uint32_t lo, hi;
	asm volatile("rdtsc" : "=a" (lo), "=d" (hi));
	return (uint64_t)hi << 32 | lo;
}

int
main(void)
{
	int i, j;
	uint32_t sum, sums[20];
	uint64_t start, times[20];

	for (i = 0; i < 20; i += 4) {
		sum = 0;
		start = rdtsc();
		for (j = 0; j < 1000000; j++)
			sum += hash1(j+0xdeadbeef);
		times[i] = rdtsc() - start;
		sums[i] = sum;

		sum = 0;
		start = rdtsc();
		for (j = 0; j < 1000000; j++)
			sum += hash2(j+0xdeadbeef);
		times[i+1] = rdtsc() - start;
		sums[i+1] = sum;

		sum = 0;
		start = rdtsc();
		for (j = 0; j < 1000000; j++)
			sum += hash1(j+0xdeadbeef);
		times[i+2] = rdtsc() - start;
		sums[i+2] = sum;

		sum = 0;
		start = rdtsc();
		for (j = 0; j < 1000000; j++)
			sum += hash2(j+0xdeadbeef);
		times[i+3] = rdtsc() - start;
		sums[i+3] = sum;
	}
	for (i = 0; i < 20; i++)
		printf("  %llu %08x%c",
			times[i], sums[i], (~i & 3) ? ' ' : '\n');
	return 0;
}

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-30 20:52 ` [patch 2/7] lib/hashmod: Add modulo based hash mechanism George Spelvin
@ 2016-05-01  8:35   ` Thomas Gleixner
  2016-05-01  9:43     ` George Spelvin
  0 siblings, 1 reply; 43+ messages in thread
From: Thomas Gleixner @ 2016-05-01  8:35 UTC (permalink / raw)
  To: George Spelvin; +Cc: eric.dumazet, linux-kernel, riel, torvalds

On Sat, 30 Apr 2016, George Spelvin wrote:
> Thomas Gleixner wrote:
> You say that
> > hash64 is slightly faster as the modulo prime as it does not have the
> > multiplication.
> 
> Um... are you sure you benchmarked that right?  The hash_64 code you
> used (Thomas Wang's 64->32-bit hash) has a critical path consisting of 6
> shifts and 7 adds.  I can't believe that's faster than a single multiply.

Sorry I did not express myself clear enough.

hash64 (the single multiply with the adjusted golden ratio) is slightly faster
than the modulo one which has two mutiplications.
 
So here is the list:

hash_64(): (key * GOLDEN_RATIO) >> (64 - bits)		31Mio Ops/sec

modulo:	   	  		       	 		28Mio Ops/sec

Thomas Wangs 64 -> 32 bit				21Mio Ops/sec

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-05-01  8:35   ` Thomas Gleixner
@ 2016-05-01  9:43     ` George Spelvin
  2016-05-01 16:51       ` Linus Torvalds
  2016-05-02  7:11       ` Thomas Gleixner
  0 siblings, 2 replies; 43+ messages in thread
From: George Spelvin @ 2016-05-01  9:43 UTC (permalink / raw)
  To: linux, tglx; +Cc: eric.dumazet, linux-kernel, riel, torvalds

> Sorry I did not express myself clear enough.

> hash64 (the single multiply with the adjusted golden ratio) is
> slightly faster than the modulo one which has two mutiplications.

Yes, I figured out that was probably what you were talking about,
and benchmarked it similarly.

But I noticed a much greater difference.

 Wang      * PHI    % 4093   Wang      * PHI    % 4093
 13599486  3494816  5238442  13584166  3460266  5239463
 13589552  3466764  5237327  13582381  3422304  5276253
 13569409  3407317  5236239  13566738  3393215  5267909
 13575257  3413736  5237708  13596428  3379811  5280118
 13583607  3540416  5325609  13650964  3380301  5265210

At 3.7 GHz, that's 

* PHI:     1059 M ops/second
* Modulo:   706 M ops/second
* Wang:     271 M ops/second

Of course, that's a tight loop hashing; I presume your test case
has more overhead.

Anyway, my recommendation (I'll write the patches if you like) is:

* Modify the multiplicative constants to be
  #define COLDEN_RATIO_32 0x61C88647
  #define COLDEN_RATIO_64 0x61C8864680B583EB

* Change the prototype of hash_64 to return u32.

* Create a separate 32-bit implementation of hash_64() for the
  BITS_PER_LONG < 64 case.  This should not be Wang's or anything
  similar because 64-bit shifts are slow and awkward on 32-bit
  machines.
  One option is something like __jhash_final(), but I think
  it will suffice to do:

  static __always_inline u32 hash_64(u64 val, unsigned int bits)
  {
	u32 hash = (u32)(val >> 32) * GOLDEN_RATIO_32 + (u32)val;
	hash *= GOLDEN_RATIO_32;
        return hash >> (32 - bits);
  }
  (S-o-b on that if you want it, of course.)

  (If you want it out of line, make a helper function that
  computes the 32-bit hash and have an inline wrapper that
  does the shifting.)

* Eliminate the !CONFIG_ARCH_HAS_FAST_MULTIPLIER code path.  Once we've
  got rid of the 32-bit processors, we're left with 64-bit processors,
  and they *all* have hardware multipliers.  Even the MIPS R4000 (first
  64-bit processor ever) and Alpha 21064 had one.

  (Admittedly, the R4000 was 20 cycles, which is slower than Wang's hash,
  but 18 of them could overlap other instructions.)

  The worst multiply support is SPARCv9, which has 64x64-bit
  multiply with a 64-bit result, but no widening multiply.

* If you feel ambitious, add a 32-bit CONFIG_ARCH_HAS_SLOW_MULTIPLIER
  exception path.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-05-01  9:43     ` George Spelvin
@ 2016-05-01 16:51       ` Linus Torvalds
  2016-05-14  3:54         ` George Spelvin
  2016-05-02  7:11       ` Thomas Gleixner
  1 sibling, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2016-05-01 16:51 UTC (permalink / raw)
  To: George Spelvin, David Woodhouse, Chris Mason
  Cc: Thomas Gleixner, Eric Dumazet, Linux Kernel Mailing List, Rik van Riel

On Sun, May 1, 2016 at 2:43 AM, George Spelvin <linux@horizon.com> wrote:
>
> Anyway, my recommendation (I'll write the patches if you like) is:
>
> * Modify the multiplicative constants to be
>   #define COLDEN_RATIO_32 0x61C88647
>   #define COLDEN_RATIO_64 0x61C8864680B583EB

Interestingly, looking at where hash_64() is used, I notice the btrfs
raid56 code. And the comment there in rio_bucket():

         * we shift down quite a bit.  We're using byte
         * addressing, and most of the lower bits are zeros.
         * This tends to upset hash_64, and it consistently
         * returns just one or two different values.
         *
         * shifting off the lower bits fixes things.

so people had actually noticed the breakage before.

Adding DavidW and Chris Mason explicitly to the cc, to perhaps have
them verify that fixing the hash_64 constant would make that btrfs
hack be unnecessary too..

Chris? Do you have a test-case? Do things end up being ok if you
change that COLDEN_RATIO_64 value and remove the ">>16"?

> * Change the prototype of hash_64 to return u32.

Fair enough. That simplifies things for 32-bit. Not that there are a
_lot_ of 32-bit cases, and most of the callers seem to just assign the
value directly to an int or just use it as an index, so it's probably
not really ever going to change anything, but it might make it easier
to write a simpler 32-bit code that uses two explicitly 32-bit fields
and mixes them.

> * Create a separate 32-bit implementation of hash_64() for the
>   BITS_PER_LONG < 64 case.  This should not be Wang's or anything
>   similar because 64-bit shifts are slow and awkward on 32-bit
>   machines.
>   One option is something like __jhash_final(), but I think
>   it will suffice to do:
>
>   static __always_inline u32 hash_64(u64 val, unsigned int bits)
>   {
>         u32 hash = (u32)(val >> 32) * GOLDEN_RATIO_32 + (u32)val;
>         hash *= GOLDEN_RATIO_32;
>         return hash >> (32 - bits);
>   }
>   (S-o-b on that if you want it, of course.)

Agreed. However, we need to make very certain that nobody wants a
value bigger than 32 bits. It all looks fine: the name hash folding
does want exactly 32 bits, but that code is only enabled for 64-bit
anyway, and it would be ok.

But it might be worth double-checking that nobody uses hash_64() for
anything else odd. My quick grep didn't show anything, but it was
quick.

> * Eliminate the !CONFIG_ARCH_HAS_FAST_MULTIPLIER code path.  Once we've
>   got rid of the 32-bit processors, we're left with 64-bit processors,
>   and they *all* have hardware multipliers.  Even the MIPS R4000 (first
>   64-bit processor ever) and Alpha 21064 had one.
>
>   (Admittedly, the R4000 was 20 cycles, which is slower than Wang's hash,
>   but 18 of them could overlap other instructions.)
>
>   The worst multiply support is SPARCv9, which has 64x64-bit
>   multiply with a 64-bit result, but no widening multiply.

Ack.

> * If you feel ambitious, add a 32-bit CONFIG_ARCH_HAS_SLOW_MULTIPLIER
>   exception path.

Let's make that a separate worry, and just fix hash_64() first.

In particular, that means "let's not touch COLDEN_RATIO_32 yet". I
suspect that when we *do* change that value, we do want the
non-multiplying version you had.

If you wrote out the patch for the hash_64 case as you outlined above,
I will definitely apply it. The current hash_64 is clearly broken, and
we now have two real life examples (ie futex problems and the btrfs
code both).

                      Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-05-01 16:51       ` Linus Torvalds
@ 2016-05-14  3:54         ` George Spelvin
  2016-05-14 18:35           ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: George Spelvin @ 2016-05-14  3:54 UTC (permalink / raw)
  To: torvalds; +Cc: eric.dumazet, linux, linux-kernel, riel, tglx

On May 1, 2016 at 9:51 AM, Lnus Torvalds wrote:
> On Sun, May 1, 2016 at 2:43 AM, George Spelvin <linux@horizon.com> wrote:
>> * If you feel ambitious, add a 32-bit CONFIG_ARCH_HAS_SLOW_MULTIPLIER
>>   exception path.

> Let's make that a separate worry, and just fix hash_64() first.
> 
> In particular, that means "let's not touch COLDEN_RATIO_32 yet". I
> suspect that when we *do* change that value, we do want the
> non-multiplying version you had.

I've been working on this, and just a brief status update: it's definitely
one of those rabbit holes.

There are exactly three architectures which (some models) don't have
an efficient 32x32->32-bit multiply:

- arch/m58k: MC68000 (and 68010 and 68328) no-mmu
- arch/h8300: Most (all?) of the H8 processor series
- arch/microblaze: Depending on Verilog compilation options

The thing is, they all don't have a barrel shifter, either.
Indeed, only the m68k even has multi-bit shift instructions.

So the upshot is that it's not clear that shift-and-add is a whole lot
better.  Working out the timing on the 68000, I can beat the multiply
code, but not by much.

So I'm working on arch-specific solutions for those three cases.

H8 and 68000 have 16x16->32-bit multiplies, which can be used to make a
reasonable hash function (some H8 models can multiply faster than they
can shift!), but if you configure a Microblaze with neither multiplier
nor barrel shifter (which arch/microblaze/Kconfig.platform lets you do),
I have no idea what to do.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-05-14  3:54         ` George Spelvin
@ 2016-05-14 18:35           ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2016-05-14 18:35 UTC (permalink / raw)
  To: George Spelvin
  Cc: Eric Dumazet, Linux Kernel Mailing List, Rik van Riel, Thomas Gleixner

On Fri, May 13, 2016 at 8:54 PM, George Spelvin <linux@horizon.com> wrote:
>
> There are exactly three architectures which (some models) don't have
> an efficient 32x32->32-bit multiply:
>
> - arch/m58k: MC68000 (and 68010 and 68328) no-mmu
> - arch/h8300: Most (all?) of the H8 processor series
> - arch/microblaze: Depending on Verilog compilation options

I wouldn't worry about it too much.

The architectures where performance really matters are x86, ARM and powerpc.

The rest need to *work* and not suck horribly, but we're not going to
try to do cycle counting for them. It's not worth the pain.

If an architecture doesn't have a barrel shifter, it's not going to
have fast hash functions.

So I'd be ok with just saying "32-bit architectures are going to use a
multiply with non-sparse bits". Not a problem.

We do want to make sure that hash_64 isn't totally disgusting on
32-bit architectures (but that's a fairly rare case), so we probably
do want to have that function fall back on something else than a 64x64
multiply on a 32-bit architecture. Presumably just "mix the two 32-bit
words into one, then use hash_32() on that" is good enough.

                     Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-05-01  9:43     ` George Spelvin
  2016-05-01 16:51       ` Linus Torvalds
@ 2016-05-02  7:11       ` Thomas Gleixner
  2016-05-02 10:20         ` [PATCH 1/2] <linux/hash.h>: Make hash_64(), hash_ptr() return 32 bits George Spelvin
  1 sibling, 1 reply; 43+ messages in thread
From: Thomas Gleixner @ 2016-05-02  7:11 UTC (permalink / raw)
  To: George Spelvin; +Cc: eric.dumazet, linux-kernel, riel, torvalds

On Sun, 1 May 2016, George Spelvin wrote:
> But I noticed a much greater difference.
> 
>  Wang      * PHI    % 4093   Wang      * PHI    % 4093
>  13599486  3494816  5238442  13584166  3460266  5239463
>  13589552  3466764  5237327  13582381  3422304  5276253
>  13569409  3407317  5236239  13566738  3393215  5267909
>  13575257  3413736  5237708  13596428  3379811  5280118
>  13583607  3540416  5325609  13650964  3380301  5265210
> 
> At 3.7 GHz, that's 
> 
> * PHI:     1059 M ops/second
> * Modulo:   706 M ops/second
> * Wang:     271 M ops/second
> 
> Of course, that's a tight loop hashing; I presume your test case
> has more overhead.

Indeed.
 
> Anyway, my recommendation (I'll write the patches if you like) is:
> 
> * Modify the multiplicative constants to be
>   #define COLDEN_RATIO_32 0x61C88647
>   #define COLDEN_RATIO_64 0x61C8864680B583EB

Works for me. I ran them through my test case and they behaved reasonably
well.
 
> * Change the prototype of hash_64 to return u32.

Makes sense.
 
> * Create a separate 32-bit implementation of hash_64() for the
>   BITS_PER_LONG < 64 case.  This should not be Wang's or anything
>   similar because 64-bit shifts are slow and awkward on 32-bit
>   machines.
>   One option is something like __jhash_final(), but I think
>   it will suffice to do:
> 
>   static __always_inline u32 hash_64(u64 val, unsigned int bits)
>   {
> 	u32 hash = (u32)(val >> 32) * GOLDEN_RATIO_32 + (u32)val;
> 	hash *= GOLDEN_RATIO_32;
>         return hash >> (32 - bits);
>   }

Works. That's more or less the same overhead as the modulo one, which behaved
well on 32bit.
 
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/2] <linux/hash.h>: Make hash_64(), hash_ptr() return 32 bits
  2016-05-02  7:11       ` Thomas Gleixner
@ 2016-05-02 10:20         ` George Spelvin
  2016-05-02 10:22           ` [PATCH 2/2] <linux/hash.h>: Fix hash_64()'s horrible collision problem George Spelvin
                             ` (4 more replies)
  0 siblings, 5 replies; 43+ messages in thread
From: George Spelvin @ 2016-05-02 10:20 UTC (permalink / raw)
  To: linux-kernel, tglx, torvalds
  Cc: bfields, eric.dumazet, jlayton, linux, linux-nfs, riel

This also affects hash_str() and hash_mem() in <linux/sunrpc/svcauth.h>.

After a careful scan through the kernel code, no caller asks any of
those four for  more than 32 bits of hash result, except that the
latter two need 64 bits from hash_long() if BITS_PER_LONG == 64.

This is in preparation for the following patch, which will create
a new implementation of hash_64 for the BITS_PER_LONG == 32 case
which is optimized for 32-bit machines.

Signed-off-by: George Spelvin <linux@horizon.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: linux-nfs@vger.kernel.org
---
Cc: to NFS folks because it touches the sunrpc directory.

Is that "TODO" comment too presumptuous of me?

 include/linux/hash.h           | 22 ++++++++++++++++------
 include/linux/sunrpc/svcauth.h | 15 +++++++--------
 2 files changed, 23 insertions(+), 14 deletions(-)

diff --git a/include/linux/hash.h b/include/linux/hash.h
index 1afde47e..05003fdc 100644
--- a/include/linux/hash.h
+++ b/include/linux/hash.h
@@ -24,15 +24,17 @@
 
 #if BITS_PER_LONG == 32
 #define GOLDEN_RATIO_PRIME GOLDEN_RATIO_PRIME_32
+#define __hash_long(val) __hash_32(val)
 #define hash_long(val, bits) hash_32(val, bits)
 #elif BITS_PER_LONG == 64
+#define __hash_long(val) __hash_64(val)
 #define hash_long(val, bits) hash_64(val, bits)
 #define GOLDEN_RATIO_PRIME GOLDEN_RATIO_PRIME_64
 #else
 #error Wordsize not 32 or 64
 #endif
 
-static __always_inline u64 hash_64(u64 val, unsigned int bits)
+static __always_inline u64 __hash_64(u64 val)
 {
 	u64 hash = val;
 
@@ -55,20 +57,28 @@ static __always_inline u64 hash_64(u64 val, unsigned int bits)
 	hash += n;
 #endif
 
+	return hash;
+}
+
+static __always_inline u64 hash_64(u64 val, unsigned bits)
+{
 	/* High bits are more random, so use them. */
-	return hash >> (64 - bits);
+	return __hash_64(val) >> (64 - bits);
 }
 
-static inline u32 hash_32(u32 val, unsigned int bits)
+static inline u32 __hash_32(u32 val)
 {
 	/* On some cpus multiply is faster, on others gcc will do shifts */
-	u32 hash = val * GOLDEN_RATIO_PRIME_32;
+	return val * GOLDEN_RATIO_PRIME_32;
+}
 
+static inline u32 hash_32(u32 val, unsigned bits)
+{
 	/* High bits are more random, so use them. */
-	return hash >> (32 - bits);
+	return __hash_32(val) >> (32 - bits);
 }
 
-static inline unsigned long hash_ptr(const void *ptr, unsigned int bits)
+static inline u32 hash_ptr(const void *ptr, unsigned bits)
 {
 	return hash_long((unsigned long)ptr, bits);
 }
diff --git a/include/linux/sunrpc/svcauth.h b/include/linux/sunrpc/svcauth.h
index c00f53a4..eb1241b3 100644
--- a/include/linux/sunrpc/svcauth.h
+++ b/include/linux/sunrpc/svcauth.h
@@ -165,7 +165,8 @@ extern int svcauth_unix_set_client(struct svc_rqst *rqstp);
 extern int unix_gid_cache_create(struct net *net);
 extern void unix_gid_cache_destroy(struct net *net);
 
-static inline unsigned long hash_str(char *name, int bits)
+/* TODO: Update to <asm/word-at-a-time.h> when CONFIG_DCACHE_WORD_ACCESS */
+static inline u32 hash_str(const char *name, int bits)
 {
 	unsigned long hash = 0;
 	unsigned long l = 0;
@@ -176,14 +177,13 @@ static inline unsigned long hash_str(char *name, int bits)
 			c = (char)len; len = -1;
 		}
 		l = (l << 8) | c;
-		len++;
-		if ((len & (BITS_PER_LONG/8-1))==0)
-			hash = hash_long(hash^l, BITS_PER_LONG);
+		if (++len % sizeof(hash) == 0)
+			hash = __hash_long(hash^l);
 	} while (len);
 	return hash >> (BITS_PER_LONG - bits);
 }
 
-static inline unsigned long hash_mem(char *buf, int length, int bits)
+static inline u32 hash_mem(const char *buf, int length, int bits)
 {
 	unsigned long hash = 0;
 	unsigned long l = 0;
@@ -195,9 +195,8 @@ static inline unsigned long hash_mem(char *buf, int length, int bits)
 		} else
 			c = *buf++;
 		l = (l << 8) | c;
-		len++;
-		if ((len & (BITS_PER_LONG/8-1))==0)
-			hash = hash_long(hash^l, BITS_PER_LONG);
+		if (++len % sizeof(hash) == 0)
+			hash = __hash_long(hash^l);
 	} while (len);
 	return hash >> (BITS_PER_LONG - bits);
 }
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/2] <linux/hash.h>: Fix hash_64()'s horrible collision problem
  2016-05-02 10:20         ` [PATCH 1/2] <linux/hash.h>: Make hash_64(), hash_ptr() return 32 bits George Spelvin
@ 2016-05-02 10:22           ` George Spelvin
  2016-05-02 20:08             ` Linus Torvalds
  2016-05-02 10:27           ` [RFC PATCH 3/2] (Rant) Fix various hash abuses George Spelvin
                             ` (3 subsequent siblings)
  4 siblings, 1 reply; 43+ messages in thread
From: George Spelvin @ 2016-05-02 10:22 UTC (permalink / raw)
  To: linux-kernel, tglx, torvalds
  Cc: bfields, eric.dumazet, jlayton, linux-nfs, linux, riel

hash_64() was using a low-bit-weight multiplier, which resulted in
very bad mixing of the high bits of the input.  In particular,
page-aligned pointers (low 12 bits not used) were a disaster,

Since all 64-bit processors (I checked back to the MIPS R4000 and
Alpha 21064) have hardware multipliers and don't benefit from this
"optimization", use the proper golden ratio value.

Avoid performance problems on 32-bit machines by providing a totally
separate implementation for them based on 32-bit arithmetic.

Keep the bad multiplier for hash_32() for now, at Linus' request.
"Sparse in 32 bits" is not quite as bad as "sparse in 64 bits".

Explicitly document that the algorithm is not stable.  I've tried to
check all callers for inadvertent dependence on the exact numerical value,
but some them are so confusing (*cough* Lustre *cough*) that I can't
tell for sure.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: George Spelvin <linux@horizon.com>
---
 include/linux/hash.h | 151 ++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 107 insertions(+), 44 deletions(-)

diff --git a/include/linux/hash.h b/include/linux/hash.h
index 05003fdc..64c44e20 100644
--- a/include/linux/hash.h
+++ b/include/linux/hash.h
@@ -1,63 +1,56 @@
 #ifndef _LINUX_HASH_H
 #define _LINUX_HASH_H
-/* Fast hashing routine for ints,  longs and pointers.
-   (C) 2002 Nadia Yvette Chambers, IBM */
-
 /*
- * Knuth recommends primes in approximately golden ratio to the maximum
- * integer representable by a machine word for multiplicative hashing.
+ * Fast hashing routine for ints, longs and pointers.
+ * (C) 2002 Nadia Yvette Chambers, IBM
+ *
+ * These are used for small in-memory hash tables, where speed is a
+ * primary concern.  If you want something a little bit stronger, see
+ * <linux/jhash.h>, especially functions like jhash_3words().  If your
+ * hash table is subject to a hash collision denial of service attack,
+ * use something cryptographic.
+ *
+ * Note that the algorithms used are not guaranteed stable across kernel
+ * versions or architectures!  In particular, hash_64() is implemented
+ * differently on 32- and 64-bit machines.  Do not let external behavior
+ * depend on the hash values.
+ *
+ * The algorithm used is straight from Knuth: multiply a w-bit word by
+ * a suitable large constant, and take the high bits of the w-bit result.
+ *
  * Chuck Lever verified the effectiveness of this technique:
  * http://www.citi.umich.edu/techreports/reports/citi-tr-00-1.pdf
  *
- * These primes are chosen to be bit-sparse, that is operations on
- * them can use shifts and additions instead of multiplications for
- * machines where multiplications are slow.
+ * A good reference is Mikkel Thorup, "High Speed Hashing for
+ * Integers and Strings" at http://arxiv.org/abs/1504.06804 and
+ * https://www.youtube.com/watch?v=cB85UZKJQTc
+ *
+ * Because the current algorithm is linear (hash(a+b) = hash(a) + hash(b)),
+ * adding or subtracting hash values is just as likely to cause collisions
+ * as adding or subtracting the keys themselves.
  */
-
 #include <asm/types.h>
 #include <linux/compiler.h>
 
-/* 2^31 + 2^29 - 2^25 + 2^22 - 2^19 - 2^16 + 1 */
-#define GOLDEN_RATIO_PRIME_32 0x9e370001UL
-/*  2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
-#define GOLDEN_RATIO_PRIME_64 0x9e37fffffffc0001UL
+/*
+ * Although a random odd number will do, it turns out that the golden ratio
+ * phi = (sqrt(5)-1)/2, or its negative, has particularly nice properties.
+ *
+ * These are actually the negative, (1 - phi) = (phi^2) = (3 - sqrt(5))/2.
+ * (See Knuth vol 3, section 6.4, exercise 9.)
+ */
+#define GOLDEN_RATIO_32 0x61C88647
+#define GOLDEN_RATIO_64 0x61C8864680B583EBull
 
-#if BITS_PER_LONG == 32
-#define GOLDEN_RATIO_PRIME GOLDEN_RATIO_PRIME_32
-#define __hash_long(val) __hash_32(val)
-#define hash_long(val, bits) hash_32(val, bits)
-#elif BITS_PER_LONG == 64
+#if BITS_PER_LONG == 64
+
+#define GOLDEN_RATIO_PRIME GOLDEN_RATIO_64	/* Used in fs/inode.c */
 #define __hash_long(val) __hash_64(val)
 #define hash_long(val, bits) hash_64(val, bits)
-#define GOLDEN_RATIO_PRIME GOLDEN_RATIO_PRIME_64
-#else
-#error Wordsize not 32 or 64
-#endif
 
 static __always_inline u64 __hash_64(u64 val)
 {
-	u64 hash = val;
-
-#if defined(CONFIG_ARCH_HAS_FAST_MULTIPLIER) && BITS_PER_LONG == 64
-	hash = hash * GOLDEN_RATIO_PRIME_64;
-#else
-	/*  Sigh, gcc can't optimise this alone like it does for 32 bits. */
-	u64 n = hash;
-	n <<= 18;
-	hash -= n;
-	n <<= 33;
-	hash -= n;
-	n <<= 3;
-	hash += n;
-	n <<= 3;
-	hash -= n;
-	n <<= 4;
-	hash += n;
-	n <<= 2;
-	hash += n;
-#endif
-
-	return hash;
+	return val * GOLDEN_RATIO_64;
 }
 
 static __always_inline u64 hash_64(u64 val, unsigned bits)
@@ -66,6 +59,75 @@ static __always_inline u64 hash_64(u64 val, unsigned bits)
 	return __hash_64(val) >> (64 - bits);
 }
 
+#elif BITS_PER_LONG == 32
+
+#define GOLDEN_RATIO_PRIME GOLDEN_RATIO_32
+#define __hash_long(val) __hash_32(val)
+#define hash_long(val, bits) hash_32(val, bits)
+
+/*
+ * Because 64-bit multiplications are very expensive on 32-bit machines,
+ * provide a completely separate implementation for them.
+ *
+ * This is mostly used via the hash_long() and hash_ptr() wrappers,
+ * which use hash_32() on 32-bit platforms, but there are some direct
+ * users of hash_64() in 32-bit kernels.
+ *
+ * Note that there is no __hash_64 function at all; that exists
+ * only to implement __hash_long().
+ *
+ * The algorithm is somewhat ad-hoc, but achieves decent mixing.
+ */
+static __always_inline u32 hash_64(u64 val, unsigned bits)
+{
+	u32 hash = (uint32)(val >> 32) * GOLDEN_RATIO_32;
+	hash += (uint32)val;
+	hash *= GOLDEN_RATIO_32;
+	return hash >> (32 - bits);
+}
+
+#else /* BITS_PER_LONG is something else */
+#error Wordsize not 32 or 64
+#endif
+
+
+/*
+ * This is the old bastard constant: a low-bit-weight
+ * prime close to 2^32 * phi = 0x9E3779B9.
+ *
+ * The purpose of the low bit weight is to make the shift-and-add
+ * code faster on processors like ARMv2 without hardware multiply.
+ * The downside is that the high bits of the input are hashed very weakly.
+ * In particular, the high 16 bits of input are just shifted up and added,
+ * so if you ask for b < 16 bits of output, bits 16..31-b of the input
+ * barely affect the output.
+ *
+ * Annoyingly, GCC compiles this into 6 shifts and adds, which
+ * is enough to multiply by the full GOLDEN_RATIO_32 using a
+ * cleverer algorithm:
+ *
+ * unsigned hash_32(unsigned x)
+ * {
+ * 	unsigned y, z;
+ *
+ * 	y = (x << 19) + x;
+ * 	z = (x << 9) + y;
+ * 	x = (x << 23) + z;
+ * 	z = (z << 8) + y;
+ * 	x = (x << 6) - x;
+ * 	return (z << 3) + x;
+ * }
+ *
+ * (Found by Yevgen Voronenko's Hcub algorithm, from
+ * http://spiral.ece.cmu.edu/mcm/gen.html)
+ *
+ * Unfortunately, figuring out which version to compile requires
+ * replicating the compiler's logic in Kconfig or the preprocessor.
+ */
+
+/* 2^31 + 2^29 - 2^25 + 2^22 - 2^19 - 2^16 + 1 */
+#define GOLDEN_RATIO_PRIME_32 0x9e370001UL
+
 static inline u32 __hash_32(u32 val)
 {
 	/* On some cpus multiply is faster, on others gcc will do shifts */
@@ -83,6 +145,7 @@ static inline u32 hash_ptr(const void *ptr, unsigned bits)
 	return hash_long((unsigned long)ptr, bits);
 }
 
+/* This really should be called "fold32_ptr"; it barely hashes at all. */
 static inline u32 hash32_ptr(const void *ptr)
 {
 	unsigned long val = (unsigned long)ptr;
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] <linux/hash.h>: Fix hash_64()'s horrible collision problem
  2016-05-02 10:22           ` [PATCH 2/2] <linux/hash.h>: Fix hash_64()'s horrible collision problem George Spelvin
@ 2016-05-02 20:08             ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2016-05-02 20:08 UTC (permalink / raw)
  To: George Spelvin
  Cc: Linux Kernel Mailing List, Thomas Gleixner, Bruce Fields,
	Eric Dumazet, Jeff Layton, Linux NFS Mailing List, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 1166 bytes --]

On Mon, May 2, 2016 at 3:22 AM, George Spelvin <linux@horizon.com> wrote:
> hash_64() was using a low-bit-weight multiplier, which resulted in
> very bad mixing of the high bits of the input.  In particular,
> page-aligned pointers (low 12 bits not used) were a disaster,

So I did just a minimal for fro 4.6 (and back-porting), which took
just the constants and made _only_ the 64-bit architevture case use
this improved constant for hash_64.

In other words, people who use "hash_long()" or use "hash_64()" on
64-bit architectures will get the improvements, but if you use
hash_64() on a 32-bit architecture you'll conteinue to see the old
behavior.

Quite frankly, looking at some of the explicit hash_64() users, they
seem to be a big dubious anyway. And it won't make things *worse* for
them.

So that simple "just use multiplication unconditionally on 64-bit, and
use the better constant" should fix the actual _practical_ problems
that we've seen. And it shouldn't have any negative consequences,
since as you say, 64-bit architectures universally do have a
multiplier.

The bigger changes will have to be for 4.7 by now, I think.

                     Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 2805 bytes --]

From 689de1d6ca95b3b5bd8ee446863bf81a4883ea25 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 2 May 2016 12:46:42 -0700
Subject: [PATCH] Minimal fix-up of bad hashing behavior of hash_64()

This is a fairly minimal fixup to the horribly bad behavior of hash_64()
with certain input patterns.

In particular, because the multiplicative value used for the 64-bit hash
was intentionally bit-sparse (so that the multiply could be done with
shifts and adds on architectures without hardware multipliers), some
bits did not get spread out very much.  In particular, certain fairly
common bit ranges in the input (roughly bits 12-20: commonly with the
most information in them when you hash things like byte offsets in files
or memory that have block factors that mean that the low bits are often
zero) would not necessarily show up much in the result.

There's a bigger patch-series brewing to fix up things more completely,
but this is the fairly minimal fix for the 64-bit hashing problem.  It
simply picks a much better constant multiplier, spreading the bits out a
lot better.

NOTE! For 32-bit architectures, the bad old hash_64() remains the same
for now, since 64-bit multiplies are expensive.  The bigger hashing
cleanup will replace the 32-bit case with something better.

The new constants were picked by George Spelvin who wrote that bigger
cleanup series.  I just picked out the constants and part of the comment
from that series.

Cc: stable@vger.kernel.org
Cc: George Spelvin <linux@horizon.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/include/linux/hash.h b/include/linux/hash.h
index 1afde47e1528..79c52fa81cac 100644
--- a/include/linux/hash.h
+++ b/include/linux/hash.h
@@ -32,12 +32,28 @@
 #error Wordsize not 32 or 64
 #endif

+/*
+ * The above primes are actively bad for hashing, since they are
+ * too sparse. The 32-bit one is mostly ok, the 64-bit one causes
+ * real problems. Besides, the "prime" part is pointless for the
+ * multiplicative hash.
+ *
+ * Although a random odd number will do, it turns out that the golden
+ * ratio phi = (sqrt(5)-1)/2, or its negative, has particularly nice
+ * properties.
+ *
+ * These are the negative, (1 - phi) = (phi^2) = (3 - sqrt(5))/2.
+ * (See Knuth vol 3, section 6.4, exercise 9.)
+ */
+#define GOLDEN_RATIO_32 0x61C88647
+#define GOLDEN_RATIO_64 0x61C8864680B583EBull
+
 static __always_inline u64 hash_64(u64 val, unsigned int bits)
 {
 	u64 hash = val;

-#if defined(CONFIG_ARCH_HAS_FAST_MULTIPLIER) && BITS_PER_LONG == 64
-	hash = hash * GOLDEN_RATIO_PRIME_64;
+#if BITS_PER_LONG == 64
+	hash = hash * GOLDEN_RATIO_64;
 #else
 	/*  Sigh, gcc can't optimise this alone like it does for 32 bits. */
 	u64 n = hash;

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 3/2] (Rant) Fix various hash abuses
  2016-05-02 10:20         ` [PATCH 1/2] <linux/hash.h>: Make hash_64(), hash_ptr() return 32 bits George Spelvin
  2016-05-02 10:22           ` [PATCH 2/2] <linux/hash.h>: Fix hash_64()'s horrible collision problem George Spelvin
@ 2016-05-02 10:27           ` George Spelvin
  2016-05-02 10:31           ` [RFC PATCH 4/2] namei: Improve hash mixing if CONFIG_DCACHE_WORD_ACCESS George Spelvin
                             ` (2 subsequent siblings)
  4 siblings, 0 replies; 43+ messages in thread
From: George Spelvin @ 2016-05-02 10:27 UTC (permalink / raw)
  To: linux-kernel, tglx, torvalds; +Cc: eric.dumazet, linux, riel

>From c7d6cf96d3b3695ef1a7a3da9e8be58add9c827d Mon Sep 17 00:00:00 2001
From: George Spelvin <linux@horizon.com>
Date: Mon, 2 May 2016 00:00:22 -0400
Subject: [RFC PATCH 3/2] (Rant) Fix various hash abuses

(This patch is not seriously meant to be applied as-is, but should
be divided up and sent to the various subsystems.  I haven't even
compiled all of the code I touched.)

While checking the call sites of hash functions in for the previous
patches, I found numerous stupidities or even bugs.  This patch
fixes them.

Particularly common was calling hash_long() on values declared as
32 bits, where hash_32 would be just as good and faster.  (Except
for the different rounding of the constant phi used, it would compute
the same hash value!)

The Mellanox mlx4 driver did this and a bit more: it XORed together IP
addresses into a 32-bit value and then apparently hoped that hash_long()
would fix the resultant collisions.  Migrated to jhash_3words(),

Lustre had several places where it did the opposite: used hash_long()
on 64-bit values.  That would ignore the high 32 bits on a 32-bit kernel.

It's not all Lustre's fault; the same bug was in include/linux/hashtable.h.

Several place, I changed hash_long() with a cast to hash_ptr().
It does the same thing, just with less clutter.

CIFS did some strange things with hash_64 that could be better done
with modular reduction.

ima_hash_key() from security/integrity/ima/ima.h was too hard to figure
out so I just added a rude comment to it, but it doesn't inspire feelings
of security or integrity.

Signed-off-by: George Spelvin <linux@horizon.com>
Cc: kvm-ppc@vger.kernel.org
Cc: Alexander Graf <agraf@suse.com>
Cc: linux-bcache@vger.kernel.org
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Eugenia Emantayev <eugenia@mellanox.com>
Cc: lustre-devel@lists.lustre.org
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: Steve French <sfrench@samba.org>
Cc: linux-cifs@vger.kernel.org
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: ecryptfs@vger.kernel.org
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: Sasha Levin <levinsasha928@gmail.com>
Cc: Peter Zijlstra <pzijlstr@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Dmitry Kasatkin <dmitry.kasatkin@gmail.com>
Cc: linux-ima-devel@lists.sourceforge.net
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
---
(Cc: list above is a reminder to self.  I haven't spammed everyone on it
yet while I think about this.)

 arch/powerpc/kvm/book3s_mmu_hpte.c                   |  6 +++---
 drivers/md/bcache/request.c                          |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c       | 13 ++++---------
 drivers/staging/lustre/include/linux/lnet/lib-lnet.h |  2 +-
 drivers/staging/lustre/lnet/lnet/api-ni.c            |  2 +-
 drivers/staging/lustre/lnet/lnet/lib-ptl.c           |  4 ++--
 drivers/staging/lustre/lustre/include/lustre_fid.h   |  2 +-
 drivers/staging/lustre/lustre/ldlm/ldlm_resource.c   |  4 ++--
 drivers/staging/lustre/lustre/obdclass/lu_object.c   |  4 ++--
 fs/cifs/cifsfs.h                                     | 16 ++++++++++------
 fs/ecryptfs/messaging.c                              |  2 +-
 fs/nfsd/nfs4idmap.c                                  |  4 ++--
 include/linux/hashtable.h                            |  2 +-
 kernel/locking/lockdep.c                             |  2 +-
 kernel/sched/wait.c                                  |  3 +--
 lib/debugobjects.c                                   |  2 +-
 net/sunrpc/auth.c                                    |  2 +-
 net/sunrpc/svcauth_unix.c                            |  2 +-
 security/integrity/ima/ima.h                         | 11 ++++++++++-
 security/tomoyo/memory.c                             |  2 +-
 tools/perf/builtin-lock.c                            |  4 ++--
 21 files changed, 49 insertions(+), 42 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_mmu_hpte.c b/arch/powerpc/kvm/book3s_mmu_hpte.c
index 5a1ab125..e0499ac9 100644
--- a/arch/powerpc/kvm/book3s_mmu_hpte.c
+++ b/arch/powerpc/kvm/book3s_mmu_hpte.c
@@ -41,7 +41,7 @@ static inline u64 kvmppc_mmu_hash_pte(u64 eaddr)
 
 static inline u64 kvmppc_mmu_hash_pte_long(u64 eaddr)
 {
-	return hash_64((eaddr & 0x0ffff000) >> PTE_SIZE,
+	return hash_32((eaddr & 0x0ffff000) >> PTE_SIZE,
 		       HPTEG_HASH_BITS_PTE_LONG);
 }
 
@@ -52,14 +52,14 @@ static inline u64 kvmppc_mmu_hash_vpte(u64 vpage)
 
 static inline u64 kvmppc_mmu_hash_vpte_long(u64 vpage)
 {
-	return hash_64((vpage & 0xffffff000ULL) >> 12,
+	return hash_32((vpage & 0xffffff000ULL) >> 12,
 		       HPTEG_HASH_BITS_VPTE_LONG);
 }
 
 #ifdef CONFIG_PPC_BOOK3S_64
 static inline u64 kvmppc_mmu_hash_vpte_64k(u64 vpage)
 {
-	return hash_64((vpage & 0xffffffff0ULL) >> 4,
+	return hash_32((vpage & 0xffffffff0ULL) >> 4,
 		       HPTEG_HASH_BITS_VPTE_64K);
 }
 #endif
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 25fa8445..5137ab31 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -664,7 +664,7 @@ static inline struct search *search_alloc(struct bio *bio,
 	s->iop.c		= d->c;
 	s->iop.bio		= NULL;
 	s->iop.inode		= d->id;
-	s->iop.write_point	= hash_long((unsigned long) current, 16);
+	s->iop.write_point	= hash_ptr(current, 16);
 	s->iop.write_prio	= 0;
 	s->iop.error		= 0;
 	s->iop.flags		= 0;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index b4b258c8..180d0b7d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -36,7 +36,7 @@
 #include <linux/if_vlan.h>
 #include <linux/delay.h>
 #include <linux/slab.h>
-#include <linux/hash.h>
+#include <linux/jhash.h>
 #include <net/ip.h>
 #include <net/busy_poll.h>
 #include <net/vxlan.h>
@@ -194,14 +194,9 @@ static inline struct hlist_head *
 filter_hash_bucket(struct mlx4_en_priv *priv, __be32 src_ip, __be32 dst_ip,
 		   __be16 src_port, __be16 dst_port)
 {
-	unsigned long l;
-	int bucket_idx;
-
-	l = (__force unsigned long)src_port |
-	    ((__force unsigned long)dst_port << 2);
-	l ^= (__force unsigned long)(src_ip ^ dst_ip);
-
-	bucket_idx = hash_long(l, MLX4_EN_FILTER_HASH_SHIFT);
+	u32 ports = (__force u32)src_port << 16 | (__force u32)dst_port;
+	int bucket_idx = jhash_3words(ports, (__force u32)src_ip,
+		(__force u32)dst_ip, 0) >> (32 - MLX4_EN_FILTER_HASH_SHIFT);
 
 	return &priv->filter_hash[bucket_idx];
 }
diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
index dfc0208d..9b42fb55 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
@@ -428,7 +428,7 @@ lnet_ni_alloc(__u32 net, struct cfs_expr_list *el, struct list_head *nilist);
 static inline int
 lnet_nid2peerhash(lnet_nid_t nid)
 {
-	return hash_long(nid, LNET_PEER_HASH_BITS);
+	return hash_64(nid, LNET_PEER_HASH_BITS);
 }
 
 static inline struct list_head *
diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
index 87647555..4de382a2 100644
--- a/drivers/staging/lustre/lnet/lnet/api-ni.c
+++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
@@ -700,7 +700,7 @@ lnet_nid_cpt_hash(lnet_nid_t nid, unsigned int number)
 	if (number == 1)
 		return 0;
 
-	val = hash_long(key, LNET_CPT_BITS);
+	val = hash_64(key, LNET_CPT_BITS);
 	/* NB: LNET_CP_NUMBER doesn't have to be PO2 */
 	if (val < number)
 		return val;
diff --git a/drivers/staging/lustre/lnet/lnet/lib-ptl.c b/drivers/staging/lustre/lnet/lnet/lib-ptl.c
index 3947e8b7..dc600c3b 100644
--- a/drivers/staging/lustre/lnet/lnet/lib-ptl.c
+++ b/drivers/staging/lustre/lnet/lnet/lib-ptl.c
@@ -360,13 +360,13 @@ lnet_mt_match_head(struct lnet_match_table *mtable,
 		   lnet_process_id_t id, __u64 mbits)
 {
 	struct lnet_portal *ptl = the_lnet.ln_portals[mtable->mt_portal];
-	unsigned long hash = mbits;
+	u64 hash = mbits;
 
 	if (!lnet_ptl_is_wildcard(ptl)) {
 		hash += id.nid + id.pid;
 
 		LASSERT(lnet_ptl_is_unique(ptl));
-		hash = hash_long(hash, LNET_MT_HASH_BITS);
+		hash = hash_64(hash, LNET_MT_HASH_BITS);
 	}
 	return &mtable->mt_mhash[hash & LNET_MT_HASH_MASK];
 }
diff --git a/drivers/staging/lustre/lustre/include/lustre_fid.h b/drivers/staging/lustre/lustre/include/lustre_fid.h
index ab4a9239..5a19bd50 100644
--- a/drivers/staging/lustre/lustre/include/lustre_fid.h
+++ b/drivers/staging/lustre/lustre/include/lustre_fid.h
@@ -594,7 +594,7 @@ static inline __u32 fid_hash(const struct lu_fid *f, int bits)
 	/* all objects with same id and different versions will belong to same
 	 * collisions list.
 	 */
-	return hash_long(fid_flatten(f), bits);
+	return hash_64(fid_flatten(f), bits);
 }
 
 /**
diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c b/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c
index 9dede87a..d3b66496 100644
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c
@@ -475,9 +475,9 @@ static unsigned ldlm_res_hop_fid_hash(struct cfs_hash *hs,
 	} else {
 		val = fid_oid(&fid);
 	}
-	hash = hash_long(hash, hs->hs_bkt_bits);
+	hash = hash_32(hash, hs->hs_bkt_bits);
 	/* give me another random factor */
-	hash -= hash_long((unsigned long)hs, val % 11 + 3);
+	hash -= hash_ptr(hs, val % 11 + 3);
 
 	hash <<= hs->hs_cur_bits - hs->hs_bkt_bits;
 	hash |= ldlm_res_hop_hash(hs, key, CFS_HASH_NBKT(hs) - 1);
diff --git a/drivers/staging/lustre/lustre/obdclass/lu_object.c b/drivers/staging/lustre/lustre/obdclass/lu_object.c
index 978568ad..369a193b 100644
--- a/drivers/staging/lustre/lustre/obdclass/lu_object.c
+++ b/drivers/staging/lustre/lustre/obdclass/lu_object.c
@@ -869,10 +869,10 @@ static unsigned lu_obj_hop_hash(struct cfs_hash *hs,
 
 	hash = fid_flatten32(fid);
 	hash += (hash >> 4) + (hash << 12); /* mixing oid and seq */
-	hash = hash_long(hash, hs->hs_bkt_bits);
+	hash = hash_32(hash, hs->hs_bkt_bits);
 
 	/* give me another random factor */
-	hash -= hash_long((unsigned long)hs, fid_oid(fid) % 11 + 3);
+	hash -= hash_ptr(hs, fid_oid(fid) % 11 + 3);
 
 	hash <<= hs->hs_cur_bits - hs->hs_bkt_bits;
 	hash |= (fid_seq(fid) + fid_oid(fid)) & (CFS_HASH_NBKT(hs) - 1);
diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
index 83aac8ba..b10522e3 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -22,20 +22,24 @@
 #ifndef _CIFSFS_H
 #define _CIFSFS_H
 
-#include <linux/hash.h>
-
 #define ROOT_I 2
 
 /*
  * ino_t is 32-bits on 32-bit arch. We have to squash the 64-bit value down
- * so that it will fit. We use hash_64 to convert the value to 31 bits, and
- * then add 1, to ensure that we don't end up with a 0 as the value.
+ * so that it will fit, and we have to ensure that only a zero fileid will
+ * map to the zero ino_t.
+ *
+ * This can be done very simply by reducing mod-2^32-1, similar to IP
+ * checksums.  After the first addition, the sum is at most 0x1fffffffe.
+ * The second sum cannot overflow 32 bits.
  */
 static inline ino_t
 cifs_uniqueid_to_ino_t(u64 fileid)
 {
-	if ((sizeof(ino_t)) < (sizeof(u64)))
-		return (ino_t)hash_64(fileid, (sizeof(ino_t) * 8) - 1) + 1;
+	if (sizeof(ino_t) < sizeof(u64)) {
+		fileid = (fileid & 0xffffffff) + (fileid >> 32);
+		fileid = (fileid & 0xffffffff) + (fileid >> 32);
+	}
 
 	return (ino_t)fileid;
 
diff --git a/fs/ecryptfs/messaging.c b/fs/ecryptfs/messaging.c
index 286f10b0..bd5ae06b 100644
--- a/fs/ecryptfs/messaging.c
+++ b/fs/ecryptfs/messaging.c
@@ -33,7 +33,7 @@ static struct hlist_head *ecryptfs_daemon_hash;
 struct mutex ecryptfs_daemon_hash_mux;
 static int ecryptfs_hash_bits;
 #define ecryptfs_current_euid_hash(uid) \
-	hash_long((unsigned long)from_kuid(&init_user_ns, current_euid()), ecryptfs_hash_bits)
+	hash_32(from_kuid(&init_user_ns, current_euid()), ecryptfs_hash_bits)
 
 static u32 ecryptfs_msg_counter;
 static struct ecryptfs_msg_ctx *ecryptfs_msg_ctx_arr;
diff --git a/fs/nfsd/nfs4idmap.c b/fs/nfsd/nfs4idmap.c
index 5b20577d..8e99a418 100644
--- a/fs/nfsd/nfs4idmap.c
+++ b/fs/nfsd/nfs4idmap.c
@@ -111,8 +111,8 @@ idtoname_hash(struct ent *ent)
 {
 	uint32_t hash;
 
-	hash = hash_str(ent->authname, ENT_HASHBITS);
-	hash = hash_long(hash ^ ent->id, ENT_HASHBITS);
+	hash = hash_str(ent->authname, 32);
+	hash = hash_32(hash ^ ent->id, ENT_HASHBITS);
 
 	/* Flip LSB for user/group */
 	if (ent->type == IDMAP_TYPE_GROUP)
diff --git a/include/linux/hashtable.h b/include/linux/hashtable.h
index 661e5c2a..91e3fc5d 100644
--- a/include/linux/hashtable.h
+++ b/include/linux/hashtable.h
@@ -28,7 +28,7 @@
 
 /* Use hash_32 when possible to allow for fast 32bit hashing in 64bit kernels. */
 #define hash_min(val, bits)							\
-	(sizeof(val) <= 4 ? hash_32(val, bits) : hash_long(val, bits))
+	(sizeof(val) <= 4 ? hash_32(val, bits) : hash_64(val, bits))
 
 static inline void __hash_init(struct hlist_head *ht, unsigned int sz)
 {
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 78c1c0ee..11dca9fc 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -286,7 +286,7 @@ LIST_HEAD(all_lock_classes);
  */
 #define CLASSHASH_BITS		(MAX_LOCKDEP_KEYS_BITS - 1)
 #define CLASSHASH_SIZE		(1UL << CLASSHASH_BITS)
-#define __classhashfn(key)	hash_long((unsigned long)key, CLASSHASH_BITS)
+#define __classhashfn(key)	hash_ptr(key, CLASSHASH_BITS)
 #define classhashentry(key)	(classhash_table + __classhashfn((key)))
 
 static struct hlist_head classhash_table[CLASSHASH_SIZE];
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index f15d6b6a..a23611d2 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -485,9 +485,8 @@ EXPORT_SYMBOL(wake_up_bit);
 
 wait_queue_head_t *bit_waitqueue(void *word, int bit)
 {
-	const int shift = BITS_PER_LONG == 32 ? 5 : 6;
 	const struct zone *zone = page_zone(virt_to_page(word));
-	unsigned long val = (unsigned long)word << shift | bit;
+	unsigned long val = (unsigned long)word * BITS_PER_LONG + bit;
 
 	return &zone->wait_table[hash_long(val, zone->wait_table_bits)];
 }
diff --git a/lib/debugobjects.c b/lib/debugobjects.c
index 519b5a10..df65cfba 100644
--- a/lib/debugobjects.c
+++ b/lib/debugobjects.c
@@ -242,7 +242,7 @@ static void debug_objects_oom(void)
  */
 static struct debug_bucket *get_bucket(unsigned long addr)
 {
-	unsigned long hash;
+	unsigned int hash;
 
 	hash = hash_long((addr >> ODEBUG_CHUNK_SHIFT), ODEBUG_HASH_BITS);
 	return &obj_hash[hash];
diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index 02f53674..1524dd3b 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -551,7 +551,7 @@ rpcauth_lookup_credcache(struct rpc_auth *auth, struct auth_cred * acred,
 			*entry, *new;
 	unsigned int nr;
 
-	nr = hash_long(from_kuid(&init_user_ns, acred->uid), cache->hashbits);
+	nr = hash_32(from_kuid(&init_user_ns, acred->uid), cache->hashbits);
 
 	rcu_read_lock();
 	hlist_for_each_entry_rcu(entry, &cache->hashtable[nr], cr_hash) {
diff --git a/net/sunrpc/svcauth_unix.c b/net/sunrpc/svcauth_unix.c
index dfacdc95..44773bdc 100644
--- a/net/sunrpc/svcauth_unix.c
+++ b/net/sunrpc/svcauth_unix.c
@@ -416,7 +416,7 @@ struct unix_gid {
 
 static int unix_gid_hash(kuid_t uid)
 {
-	return hash_long(from_kuid(&init_user_ns, uid), GID_HASHBITS);
+	return hash_32(from_kuid(&init_user_ns, uid), GID_HASHBITS);
 }
 
 static void unix_gid_put(struct kref *kref)
diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index 5d0f6116..80296292 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -135,9 +135,18 @@ struct ima_h_table {
 };
 extern struct ima_h_table ima_htable;
 
+/*
+ * FIXME: What the hell is the point of hashing one byte to produce
+ * a 9-bit hash value?  Should this just be "return *digest"?  Or should
+ * we be hashing more than the first byte of the digest?  Callers seem
+ * to pass a buffer of TPM_DIGEST_SIZE bytes.
+ *
+ * It's always fun to be WTFing over "security" code for
+ * "integrity management".
+ */
 static inline unsigned long ima_hash_key(u8 *digest)
 {
-	return hash_long(*digest, IMA_HASH_BITS);
+	return hash_32(*digest, IMA_HASH_BITS);
 }
 
 enum ima_hooks {
diff --git a/security/tomoyo/memory.c b/security/tomoyo/memory.c
index 0e995716..594c4aac 100644
--- a/security/tomoyo/memory.c
+++ b/security/tomoyo/memory.c
@@ -155,7 +155,7 @@ const struct tomoyo_path_info *tomoyo_get_name(const char *name)
 		return NULL;
 	len = strlen(name) + 1;
 	hash = full_name_hash((const unsigned char *) name, len - 1);
-	head = &tomoyo_name_list[hash_long(hash, TOMOYO_HASH_BITS)];
+	head = &tomoyo_name_list[hash_32(hash, TOMOYO_HASH_BITS)];
 	if (mutex_lock_interruptible(&tomoyo_policy_lock))
 		return NULL;
 	list_for_each_entry(ptr, head, head.list) {
diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
index ce3bfb48..34f764a9 100644
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
@@ -35,8 +35,8 @@ static struct perf_session *session;
 
 static struct list_head lockhash_table[LOCKHASH_SIZE];
 
-#define __lockhashfn(key)	hash_long((unsigned long)key, LOCKHASH_BITS)
-#define lockhashentry(key)	(lockhash_table + __lockhashfn((key)))
+#define __lockhashfn(addr)	hash_ptr(addr, LOCKHASH_BITS)
+#define lockhashentry(addr)	(lockhash_table + __lockhashfn(addr))
 
 struct lock_stat {
 	struct list_head	hash_entry;
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 4/2] namei: Improve hash mixing if CONFIG_DCACHE_WORD_ACCESS
  2016-05-02 10:20         ` [PATCH 1/2] <linux/hash.h>: Make hash_64(), hash_ptr() return 32 bits George Spelvin
  2016-05-02 10:22           ` [PATCH 2/2] <linux/hash.h>: Fix hash_64()'s horrible collision problem George Spelvin
  2016-05-02 10:27           ` [RFC PATCH 3/2] (Rant) Fix various hash abuses George Spelvin
@ 2016-05-02 10:31           ` George Spelvin
  2016-05-16 18:51             ` Linus Torvalds
  2016-05-02 13:28           ` [PATCH 1/2] <linux/hash.h>: Make hash_64(), hash_ptr() return 32 bits Peter Zijlstra
  2016-05-02 16:24           ` Linus Torvalds
  4 siblings, 1 reply; 43+ messages in thread
From: George Spelvin @ 2016-05-02 10:31 UTC (permalink / raw)
  To: linux-kernel, tglx, torvalds; +Cc: eric.dumazet, linux, riel

The hash mixing between adding the next 64 bits of name
was just a bit weak.

Replaced with a still very fast but slightly more effective
mixing function.

Signed-off-by: George Spelvin <linux@horizon.com>
---
As long as I was looking at all sorts of hashing in the kernel, I noticed
this.  I'm not sure if this is still too expansive and will slow down
the loop.

 fs/namei.c | 33 ++++++++++++++++++++++++++-------
 1 file changed, 26 insertions(+), 7 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 1d9ca2d5..e2bff05d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1794,30 +1794,49 @@ static inline unsigned int fold_hash(unsigned long hash)
 	return hash_64(hash, 32);
 }
 
+/*
+ * This is George Marsaglia's XORSHIFT generator.
+ * It implements a maximum-period LFSR in only a few
+ * instructions.  It also has the property (required
+ * by hash_name()) that mix_hash(0) = 0.
+ */
+static inline unsigned long mix_hash(unsigned long hash)
+{
+	hash ^= hash << 13;
+	hash ^= hash >> 7;
+	hash ^= hash << 17;
+	return hash;
+}
+
 #else	/* 32-bit case */
 
 #define fold_hash(x) (x)
 
+static inline unsigned long mix_hash(unsigned long hash)
+{
+	hash ^= hash << 13;
+	hash ^= hash >> 17;
+	hash ^= hash << 5;
+	return hash;
+}
+
 #endif
 
 unsigned int full_name_hash(const unsigned char *name, unsigned int len)
 {
-	unsigned long a, mask;
-	unsigned long hash = 0;
+	unsigned long a, hash = 0;
 
 	for (;;) {
 		a = load_unaligned_zeropad(name);
 		if (len < sizeof(unsigned long))
 			break;
-		hash += a;
-		hash *= 9;
+		hash = mix_hash(hash + a);
 		name += sizeof(unsigned long);
 		len -= sizeof(unsigned long);
 		if (!len)
 			goto done;
 	}
-	mask = bytemask_from_count(len);
-	hash += mask & a;
+	hash += a & bytemask_from_count(len);
 done:
 	return fold_hash(hash);
 }
@@ -1835,7 +1854,7 @@ static inline u64 hash_name(const char *name)
 	hash = a = 0;
 	len = -sizeof(unsigned long);
 	do {
-		hash = (hash + a) * 9;
+		hash = mix_hash(hash + a);
 		len += sizeof(unsigned long);
 		a = load_unaligned_zeropad(name+len);
 		b = a ^ REPEAT_BYTE('/');
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 4/2] namei: Improve hash mixing if CONFIG_DCACHE_WORD_ACCESS
  2016-05-02 10:31           ` [RFC PATCH 4/2] namei: Improve hash mixing if CONFIG_DCACHE_WORD_ACCESS George Spelvin
@ 2016-05-16 18:51             ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2016-05-16 18:51 UTC (permalink / raw)
  To: George Spelvin
  Cc: Linux Kernel Mailing List, Thomas Gleixner, Eric Dumazet, Rik van Riel

On Mon, May 2, 2016 at 3:31 AM, George Spelvin <linux@horizon.com> wrote:
> The hash mixing between adding the next 64 bits of name
> was just a bit weak.
>
> Replaced with a still very fast but slightly more effective
> mixing function.

I'e applied this patch independently of all your other hash rework to my tree.

I verified that the code generation for the inner loop is still fine,
and it does look like a much better mixing function, as well as just
clean up the code.

I hope to get new versions of the actual <linux/hash.h> fixes during
this merge window from you.

Thanks,

              Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] <linux/hash.h>: Make hash_64(), hash_ptr() return 32 bits
  2016-05-02 10:20         ` [PATCH 1/2] <linux/hash.h>: Make hash_64(), hash_ptr() return 32 bits George Spelvin
                             ` (2 preceding siblings ...)
  2016-05-02 10:31           ` [RFC PATCH 4/2] namei: Improve hash mixing if CONFIG_DCACHE_WORD_ACCESS George Spelvin
@ 2016-05-02 13:28           ` Peter Zijlstra
  2016-05-02 19:08             ` George Spelvin
  2016-05-02 16:24           ` Linus Torvalds
  4 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2016-05-02 13:28 UTC (permalink / raw)
  To: George Spelvin
  Cc: linux-kernel, tglx, torvalds, bfields, eric.dumazet, jlayton,
	linux-nfs, riel

On Mon, May 02, 2016 at 06:20:16AM -0400, George Spelvin wrote:
> Subject: [PATCH 1/2] <linux/hash.h>: Make hash_64(), hash_ptr() return 32 bits

> +static __always_inline u64 hash_64(u64 val, unsigned bits)
> +{
>  	/* High bits are more random, so use them. */
> +	return __hash_64(val) >> (64 - bits);
>  }

Is the subject stale or the above a mistake? Because hash_64() still
very much seems to return u64.

Also, I think I would prefer to keep it like this, I would like to use
it for kernel/locking/lockdep.c:iterate_chain_key(), which currently is
a somewhat crap hash.

Something like:

static inline u64 iterate_chain_key(u64 key1, u64 key2)
{
	return hash_64(key1 ^ key2, 64);
}

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] <linux/hash.h>: Make hash_64(), hash_ptr() return 32 bits
  2016-05-02 13:28           ` [PATCH 1/2] <linux/hash.h>: Make hash_64(), hash_ptr() return 32 bits Peter Zijlstra
@ 2016-05-02 19:08             ` George Spelvin
  0 siblings, 0 replies; 43+ messages in thread
From: George Spelvin @ 2016-05-02 19:08 UTC (permalink / raw)
  To: linux, peterz
  Cc: bfields, eric.dumazet, jlayton, linux-kernel, linux-nfs, riel,
	tglx, torvalds

Peter Zijlstra wrote:
> Is the subject stale or the above a mistake? Because hash_64() still
> very much seems to return u64.

Damn it no, it's a brown-paper-bag typo caused by a recent rebase.
It's meant to be u32, it was developed with u32, but the typo snuck
in during late testing and I didn't catch it.

I know Linus hates recent rebases, but I actually had a good
reason if you want to hear the saga...

I developed the patch while running v4.4.x.  I'd been doing other hacking
on top of v4.5 that resulted in an unstable system, so I kept going back
to the "last known good" v4.4.x kernel to get work done.

Developing this patch, I backed out that buggy work and based it on my
4.5 tree, since Linus hates basing work on random kernels.

Most of it was compile testing, but just before submitting, I of course
had to boot it and test.

When I booted it, I discovered I couldn't load microcode.  How the hell
did I cause that?  Oh, I have CONFIG_MICROCODE turned off... huh?  Oh,
v4.5 has bug where CONFIG_MICROCODE depende on CONFIG_BLK_DEV_INITRD
which I don't use, and the fix went in to v4.6-rc1.

Okay, fine, in the interest of getting a clean boot for testing, I'll
rebase to v4.6-rc6.  See, I told you I had a reason!

Now, I actually have a fair pile of local patches for hacking projects in
progress (I'm running 4.6.0-rc6-0130), so rebasing my whole stack takes
me about an hour and a half, with several merge conflict resolutions.

Finally, I get my clean boot, and everything seems to be working
fine, and I'm ready to post.

But by this time it's late, I'm tired, and I didn't notice that I somehow
managed to screw that up!  In hindsight, I think I remember the sequence
of edits that caused it (I deleted something by accident and cut & pasted
it back), but that's an even more obscure saga.

I will now go and fix it and boot test again, just to be sure.

Grump.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] <linux/hash.h>: Make hash_64(), hash_ptr() return 32 bits
  2016-05-02 10:20         ` [PATCH 1/2] <linux/hash.h>: Make hash_64(), hash_ptr() return 32 bits George Spelvin
                             ` (3 preceding siblings ...)
  2016-05-02 13:28           ` [PATCH 1/2] <linux/hash.h>: Make hash_64(), hash_ptr() return 32 bits Peter Zijlstra
@ 2016-05-02 16:24           ` Linus Torvalds
  2016-05-02 20:26             ` George Spelvin
  4 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2016-05-02 16:24 UTC (permalink / raw)
  To: George Spelvin
  Cc: Linux Kernel Mailing List, Thomas Gleixner, Bruce Fields,
	Eric Dumazet, Jeff Layton, Linux NFS Mailing List, Rik van Riel

On Mon, May 2, 2016 at 3:20 AM, George Spelvin <linux@horizon.com> wrote:
>
> After a careful scan through the kernel code, no caller asks any of
> those four for  more than 32 bits of hash result, except that the
> latter two need 64 bits from hash_long() if BITS_PER_LONG == 64.

Ugh. I hate this patch.

I really think that we should *not* convuse those two odd svcauth.h
users with the whole hash_32/64 thing.

I think  hash_str/hash_mem should be moved to lib/hash.c, and they
just shouldn't use "hash_long()" at all, except at the verty end (they
currently have a very odd way of doing "every <n> bytes _and_ at the
end".

In particular, the hashing in the *middle* is very different from the
hashing at the end.

At the end, you need to make sure the lower bits get spread out
particularly to the upper bits, since you're going to shift things
down.

But in the middle, you just want to spread the bits out (and in
particular, destroy any byte-per-byte patterns that it build it in
between).

Quite frankly, I think those functions should just use something like
the FNV hash (or Jenkins hash), and then maybe use "hash_long()" at
the *end* to shift the result down to "bits".

I don't want to make our <linux/hash.h> odder just because of two
entirely broken users.

That said, I actually think hash_str() should be killed entirely.
Better just use what we use for pathnames: full_name_hash() (which
gets a pointer and length) and hash_name (which gets the string).

Those functions do the "word-at-a-time" optimization, and they should
do a good enough job. If they aren't, we should fix them, because they
are a hell of a lot more important than anything that the svcauth code
does.

                Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] <linux/hash.h>: Make hash_64(), hash_ptr() return 32 bits
  2016-05-02 16:24           ` Linus Torvalds
@ 2016-05-02 20:26             ` George Spelvin
  2016-05-02 21:19               ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: George Spelvin @ 2016-05-02 20:26 UTC (permalink / raw)
  To: linux, torvalds
  Cc: bfields, eric.dumazet, jlayton, linux-kernel, linux-nfs, riel, tglx

Linus wrote:
> I really think that we should *not* convuse those two odd svcauth.h
> users with the whole hash_32/64 thing.
>
> I think  hash_str/hash_mem should be moved to lib/hash.c, and they
> just shouldn't use "hash_long()" at all, except at the very end (they
> currently have a very odd way of doing "every <n> bytes _and_ at the
> end".

Moving them is fine.  I have no problem with that, except that I agree
that merging them with the fs/namei.c hashes would be even better.

But the hash_long it not odd at all.  They're using it as the mix function
between adding words.  Like hash_name() uses "hash = (hash + a) * 9".

So it's

	x = (first 8 bytes of string)
	x = __hash64(x);
	x ^= (next 8 bytes of string)
	x = __hash64(x);
	... repeat ...
	x ^= (last 1..8 bytes of string)
	return __hash64(x);

It's a quite reasonable mix function.  One multiply (4 cycles or so) per
8 bytes.  It's definitely swamped by the byte-at-a-time string loading.

(The one peculiar thing is that the "last 1..8 bytes of string" is, due
to the way it's shiften into an accumulator, actually the last 8 bytes,
which may include some previously-hashed bytes.  But that has nothing
to do with the mixing.)

But... the fundamental reason I didn't was that this is late -rc.
I'm trying to fix a *bug* tglx found, not add risk and delay with a much
more ambitious patch series.

This is a related but separate issue that can be addressed separately.

> In particular, the hashing in the *middle* is very different from the
> hashing at the end.
>
> At the end, you need to make sure the lower bits get spread out
> particularly to the upper bits, since you're going to shift things
> down.
>
> But in the middle, you just want to spread the bits out (and in
> particular, destroy any byte-per-byte patterns that it build it in
> between).

Yes, in the middle you have a full width hash to spread the bits among,
while at the end you have to "concentrate" the hash value in a few bits.
The latter is a more difficult task and might be slower.

But there's nothing really *wrong* with using the same mix operation
both places if it's both good enough and fast enough.

In particualr, if you're loading the string a word at a time, you need
a much better mixing function than if you're processing it byte at a time.

> Quite frankly, I think those functions should just use something like
> the FNV hash (or Jenkins hash), and then maybe use "hash_long()" at
> the *end* to shift the result down to "bits".

It *is* the FNV hash!  More specifically, FNV-1a done a word at a time;
doing it byte at a time like most implementations would result in 8
times as many multiplies and be slower.

The only difference is the exact multiplier.  The published FNV-1
uses a low-bit-weight multiplier.  Since this is being done a word
at a time, I think the stronger multiplier is worth it.

Jenkins hash is (if you have a HW multiplier) slower.  Better mixing,
so adding a multiply at the end would be redundant.

> I don't want to make our <linux/hash.h> odder just because of two
> entirely broken users.

> That said, I actually think hash_str() should be killed entirely.
> Better just use what we use for pathnames: full_name_hash() (which
> gets a pointer and length) and hash_name (which gets the string).

I was thinking the exact same thing.  Repeated grepping over the kernel
tree for "hash_*" functions showed me just how many there are in various
places, and combining some would be a good idea.

For example, partial_name_hash() is still used in many places
even if the word-at-a-time code is used in namei.c.

> Those functions do the "word-at-a-time" optimization, and they should
> do a good enough job. If they aren't, we should fix them, because they
> are a hell of a lot more important than anything that the svcauth code
> does.

Okay, let me list the problems...

1) hash_name() stops at a slash or a nul.  hash_str() only looks
   for a nul.  Should I add a third variant?  Should that go in fs/namei,
   or should the whole pole be moved elsewhere?

2) Some places need the length *and* the hash.  Calling strlen() and then
   full_name_hash() somewhat defeats the purpose of word-at-a-time access.
   hash_name returns both jammed into a 64-bit word.  Is that a good
   interface in general?

   Myself, I think the length should be computed at copy_from_user()
   time and I'd like to look at each such call site and understand why
   it *doesn't* have the length ready.  But that's a lot of work.

3) They do particularly crappy mixing.  See that "RFC PATCH 4/2" I posted
   that because I couldn't stand how bad it was.

   If you don't have word at a time, the partial_name_hash() is decent,
   but the word-at-a-time mixes by multiplying by 9.  So the hashes
   of the strings "1.......0" and "0.......9" are identical. 

   (I assume this was chosen as the best available one-instruction (LEA)
   mix operation due to an obsessive interest in speed in the dcache.)

   More crappy mixing is the final folding.  On a 64-bit machine, the
   high and low 32 bits are just XORed together.  So the hashes of
   "deadbeef" and "beefdead" are identical.

   I agree we should be very careful of the mixing function, since it's
   the only thing limiting loop cycle time.  The has_zero hackery has a
   cricital path about 6 cycles long, but they're independent per loop
   and a sufficiently aggressive OOO machine could pipeline them.

   (If you have a particular cycle count budget in mind, I can come up with
   something.)

4) They don't do a final mix.  Again, obsessive interest in speed when
   you're storing the whole 32 bits and comparing that.  For applications
   that use only a few bits of hash index and need the entropy
   "concentrated", should this be done outside?

Basicaly, I can understand why someone might prefer a stronger hash
for less speed-critical applications.

I agree that we should ideally have just two string hash functions for
general kernel use (plus any imposed by ecternal standards like file
systems or ELF).

One is the dcache hash, for applications where collision DoS attacks are
not expected and is optimized strongly for speed.

A second, something like SipHash, for applcations which require
sollision resistance against malicious attackers.

But achieving that ideal is a project of significant size.
There are a lot of corner cases to worry about.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] <linux/hash.h>: Make hash_64(), hash_ptr() return 32 bits
  2016-05-02 20:26             ` George Spelvin
@ 2016-05-02 21:19               ` Linus Torvalds
  2016-05-02 21:41                 ` Linus Torvalds
  2016-05-03  1:59                 ` George Spelvin
  0 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2016-05-02 21:19 UTC (permalink / raw)
  To: George Spelvin
  Cc: Bruce Fields, Eric Dumazet, Jeff Layton,
	Linux Kernel Mailing List, Linux NFS Mailing List, Rik van Riel,
	Thomas Gleixner

On Mon, May 2, 2016 at 1:26 PM, George Spelvin <linux@horizon.com> wrote:
>
> But the hash_long it not odd at all.  They're using it as the mix function
> between adding words.  Like hash_name() uses "hash = (hash + a) * 9".

Right. But there is no reason to think that that should be the same
thing as the final hash.

In fact, there are many reasons to think it should *not*.

The final hash is different.

It's duifferent not just because it wants to concentrate the bits at
the highb end (which the middle hash does not), but it's different
exactly because the whole calling convention is different: the final
hash returns a "small" value (in your version an "u32"), while the
hash in the middle very much does not.

So thinking that they are somehow related is wrong. It screws up
hash.h, and incorrectly conflates this middle hash_mix() thing with
the final has_32/64() thing.

> It's a quite reasonable mix function.  One multiply (4 cycles or so) per
> 8 bytes.  It's definitely swamped by the byte-at-a-time string loading.

.. but it's not. Exactly because of the above.

Make it be a "hash_mix()" function. Make it use the multiply, by all
means. Same multiplier, even. BUT IT IS STILL NOT THE SAME FUNCTION,
for the above reason. One wants to return "u32", the other does not.

Really.

> But... the fundamental reason I didn't was that this is late -rc.
> I'm trying to fix a *bug* tglx found, not add risk and delay with a much
> more ambitious patch series.

Oh, this late in the rc we're not doing _any_ of this. I sent out my
suggested "late in the rc and for stable" patch that fixes the
practical problem we have, that has nothing to do with cleaning things
up.

> It *is* the FNV hash!  More specifically, FNV-1a done a word at a time;
> doing it byte at a time like most implementations would result in 8
> times as many multiplies and be slower.

I refuse to call that shit "word at a time".

It's "byte at a time with a slow conditional that will screw up your
branch predictor and a multiply in the middle".

A compiler might be able to turn it into some disgusting unrolled
thing that avoids some of the problems, but at no point is that a good
thing.

I seriously think that it would be

 (a) more readable

 (b) have a better mix function

if it was just kept as a byte-at-a-time thing entirely with the
per-byte mixing thing done better than just "shift by 8 bits".

And then at the end you could do a single hash_long().

That horrible svc thing needs to go.

It's alll pretty moot, since we have a reasonable version that
actually does do work-at-a-time.

That can be improved too, I'm sure, but that svcauth garbage should
just be thrown out.

> The only difference is the exact multiplier.  The published FNV-1
> uses a low-bit-weight multiplier.  Since this is being done a word
> at a time, I think the stronger multiplier is worth it.

.. Considering that the "word-at-a-time" is likely *slower* than doing
it a byte-at-a-time the way it has been encoded, I'm not in the least
convinced.

> For example, partial_name_hash() is still used in many places
> even if the word-at-a-time code is used in namei.c.

Those places aren't actually all that performance-critical. They
really don't matter.

> Okay, let me list the problems...
>
> 1) hash_name() stops at a slash or a nul.  hash_str() only looks
>    for a nul.  Should I add a third variant?  Should that go in fs/namei,
>    or should the whole pole be moved elsewhere?

We'll never get rid of "hash_name()", it not only has that '/' case,
it's also inlined for a reason. You'd copy it without the magic for
'/' and turn that into str_hash() for others to use.

full_name_hash() can presumably be used pretty much as-is as mem_hash().

> 2) Some places need the length *and* the hash.  Calling strlen() and then
>    full_name_hash() somewhat defeats the purpose of word-at-a-time access.
>    hash_name returns both jammed into a 64-bit word.  Is that a good
>    interface in general?

Hmm. We actually have that exact case in the dentry code and
hash_name(), except we handle it by doing that "hashlen" thing that
contains both the length and the hash in one 64-bit thing.

Maybe we could expose that kind of interface, even if it's pretty ugly.

>    Myself, I think the length should be computed at copy_from_user()
>    time and I'd like to look at each such call site and understand why
>    it *doesn't* have the length ready.  But that's a lot of work.

If it's copied from user space, we already do have the length.

You do realize that pathnames are different from pretty much every
other case in the kernel? There's a reason pathnames have their own
logic. The string length is very different from the component length,
and it turns out that component length and hash is generally critical.

> 3) They do particularly crappy mixing.  See that "RFC PATCH 4/2" I posted
>    that because I couldn't stand how bad it was.
>
>    If you don't have word at a time, the partial_name_hash() is decent,
>    but the word-at-a-time mixes by multiplying by 9.  So the hashes
>    of the strings "1.......0" and "0.......9" are identical.
>
>    (I assume this was chosen as the best available one-instruction (LEA)
>    mix operation due to an obsessive interest in speed in the dcache.)

The thing is, a lot of people tend to optimize performance (and
behavior) for large strings.

For path components, the most common lengths are less than a single
8-byte word! That "mixing" function almost doesn't matter, because the
case that matters the most (by far) are strings that fit in one or
_maybe_ two words.

Yes, things are slowly moving towards longer pathnames, but even with
long filenames, most component names are the directory entries that
still tend to be shorter. We still have the old 3-letter ones close to
the root, of course, but the statistics are still pretty heavily
skewed to <15 characters especially if you look at directory names.

So for pathnames, the mixing in the middle has tended to not be
_nearly_ as important as the final mix, for the simple reason that it
happens maybe once, often not at all.

>    More crappy mixing is the final folding.  On a 64-bit machine, the
>    high and low 32 bits are just XORed together.  So the hashes of
>    "deadbeef" and "beefdead" are identical.

Hmm? It uses "fold_hash()", which definitely doesn't do that.

Are you still looking at partial_name_hash() and friends? Don't. They
are garbage. They are used for things like case-insensitive
filesystems etc.

>    (If you have a particular cycle count budget in mind, I can come up with
>    something.)

The real issue I had in fs/namei.c is that link_path_walk() and
__d_lookup_rcu() are literally _the_ hottest functions in the kernel
under a few (common) loads, and for code generation things like
register pressure ends up mattering.

The reason that "hash_len" is a single 64-bit field rather than two
32-bit fields, for example, is that that way it takes on _register_
when we do the has lookup. Some of that code was tuned to inline - and
_not_ inline in particular patterns.

Now, some of that tuning may have bitrotted, of course, so I'm not
saying it's necessarily doing great now. But some of that code was
tuned to not need a stack frame etc.

> 4) They don't do a final mix.  Again, obsessive interest in speed when
>    you're storing the whole 32 bits and comparing that.  For applications
>    that use only a few bits of hash index and need the entropy
>    "concentrated", should this be done outside?

I think the caller should do it, yes. There's a difference between
"calculate a hash for a string" and "turn that hash into a hashtable
lookup".

               Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] <linux/hash.h>: Make hash_64(), hash_ptr() return 32 bits
  2016-05-02 21:19               ` Linus Torvalds
@ 2016-05-02 21:41                 ` Linus Torvalds
  2016-05-03  1:59                 ` George Spelvin
  1 sibling, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2016-05-02 21:41 UTC (permalink / raw)
  To: George Spelvin
  Cc: Bruce Fields, Eric Dumazet, Jeff Layton,
	Linux Kernel Mailing List, Linux NFS Mailing List, Rik van Riel,
	Thomas Gleixner

On Mon, May 2, 2016 at 2:19 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The reason that "hash_len" is a single 64-bit field rather than two
> 32-bit fields, for example, is that that way it takes on _register_
> when we do the has lookup. Some of that code was tuned to inline - and
> _not_ inline in particular patterns.

Actually, I think the tuning for no stack frame etc was mostly for the
permission testing with the selinux AVC code.

The filename hashing does have some of it too - like making sure that
the call to ->d_hash() is in an unlikely path etc and doesn't pollute
the case that actually matters. But I don't think any of it is simple
enough to avoid a stack frame.

The hash_len thing did make the innermost hash lookup loop smaller,
which was noticeable at some point.

What I really wanted to do there was actually have a direct-mapped "L1
dentry hash cache", that didn't have a hash loop at all (it would fall
back to the slow case for that). I had a patch for that (which worked
*beautifully*, partly because it also moved the hot entries to that
hash cache and thus effectively moved the active entries to the head
of the queue), but I couldn't get the L1 cache update to be coherent
without locking, which killed the thing.

Anyway, I suspect that your mixing function changes should be fine.
That link_path_walk() is important, but a couple of shifts and xors
shouldn't kill it.

             Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] <linux/hash.h>: Make hash_64(), hash_ptr() return 32 bits
  2016-05-02 21:19               ` Linus Torvalds
  2016-05-02 21:41                 ` Linus Torvalds
@ 2016-05-03  1:59                 ` George Spelvin
  2016-05-03  3:01                   ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: George Spelvin @ 2016-05-03  1:59 UTC (permalink / raw)
  To: linux, torvalds
  Cc: bfields, eric.dumazet, jlayton, linux-kernel, linux-nfs, riel, tglx

> Right. But there is no reason to think that that should be the same
> thing as the final hash.

Your logic is absolutely correct.  I agree with you.  The two operations
have different purposes, and should be thought of differently.

HOWEVER, if you can find one operation which is good enough to
serve both purposes, that's not evidence of incompetence.

I'm not saying hash_str is anything wonderful.  Just that it's not a
complete disaster.  (As a hash function.  Your comments about its
performance are addressed below.)

It's frankly a lot *better* than the hash computed by namei.c.
That one *is* perilously close to a complete disaster.

The main problem with the multiplicative hash (which describes both
hash_str() and hash_name()) is that input bits can only affect higher
state bits.  The lsbit of the 64-bit state is just the XOR of all the
input lsbits.  The lsbyte depends only on the lsbytes.

But that's not a disaster.  In hash_str(), we throw away the lsbits
at the end so their imperfect mixing doesn't matter.

What's bad is that the msbits of the inputs can only affect the
msbits of the hash.  The msbits of the inputs are just XORed together
into the msbit of the hash.  It's fundamentally impossible for
such a hash to detect any 2-bit change to the msbits of the input.

That said, you're right that most names aren't long enough (16 bytes on
little-endian) to *have* a second msbit to worry about.

> I refuse to call that shit "word at a time".
>
> It's "byte at a time with a slow conditional that will screw up your
> branch predictor and a multiply in the middle".

The *hashing* is word at a time.  The *fetching* is done byte at a time
and I agree that it doesn't qualify in any way at all.

But yeah, your point about the branch predictor penalty being worse than
the 7 saved multiplies is well taken.

Rule #1 of performane: Avoid lock contention at any cost.
Rule #2 of performane: Avoid cache misses unless it conflicts with #1.
(Rule 2a: Cache bouncing is a paritucularly permicious form of cache missing.)
(Rule 2b: Locking, even uncontended, tends to cause cache bouncing.)
Rule #3 of performance: Avoid unpredictable branches.

>> For example, partial_name_hash() is still used in many places
>> even if the word-at-a-time code is used in namei.c.

> Those places aren't actually all that performance-critical.  They
> really don't matter.

AFAICT, none of the string-hash functions outside of fs/ are
on particularly hot paths.  The goal is deleting redundant code.

> We'll never get rid of "hash_name()", it not only has that '/' case,
> it's also inlined for a reason. You'd copy it without the magic for
> '/' and turn that into str_hash() for others to use.
>
> full_name_hash() can presumably be used pretty much as-is as mem_hash().

That part is obvious. I was just caught between two unpleasant
possibiites:

- Placing a str_hash() helper function in the middle of fs/namei.c which
  nothing in fs/namei.c actually calls, or
- Copying it to some outside file and then having to keep the
  two in sync.

Thinking about the function comments on each, the first seems less
embarrasing to write.

> Maybe we could expose that kind of interface, even if it's pretty ugly.

Yeah, that's pretty much what I thought.  I just hoped you had some
brilliant idea for avoiding it.

> The thing is, a lot of people tend to optimize performance (and
> behavior) for large strings.
>
> For path components, the most common lengths are less than a single
> 8-byte word! That "mixing" function almost doesn't matter, because the
>  case that matters the most (by far) are strings that fit in one or
> _maybe_ two words.

I'll remember that next time I look up
.git/objects/69/466b786e99a0a2d86f0cb99e0f4bb61588d13c

:-)

But yes, it makes your point about pathname components.
The first three are all a single word.

I just worry with people having directories full of PHP state
cookies.

> Hmm? It uses "fold_hash()", which definitely doesn't do that.

Mea culpa; that's a side effect of repeately grepping the kernel.
There was a *different* hash folding function in some other source file
elsewhere that did that, and I got the two mixed up in my memory.

The word-at-a-time one does "hash_64(x)", which doesn't have that problem.

(Maybe it was the IP checksum folding?)

>>    (If you have a particular cycle count budget in mind, I can come up with
>>    something.)

> The real issue I had in fs/namei.c is that link_path_walk() and
> __d_lookup_rcu() are literally _the_ hottest functions in the kernel
> under a few (common) loads, and for code generation things like
> register pressure ends up mattering.
> 
> The reason that "hash_len" is a single 64-bit field rather than two
> 32-bit fields, for example, is that that way it takes on _register_
> when we do the has lookup. Some of that code was tuned to inline - and
> _not_ inline in particular patterns.
> 
> Now, some of that tuning may have bitrotted, of course, so I'm not
> saying it's necessarily doing great now. But some of that code was
> tuned to not need a stack frame etc.

Yes, and it's hard to make a general purpose helper out of code
that's tuned to piano wire tension like that.

I was thinking about very fast hash functions (2 or 3 cycles
per word), and the obvious solution is to use a second register.

As you pointed out above, the mixing function can take advantage of an
internal state which is larger than the final state, so the easy way to
minimize collisions without very expensive state mixing between words
is to give the bits more room to spread out.

I was playing with the idea of an ARX structure like the "Speck" block
cipher (https://en.wikipedia.org/wiki/Speck_%28cipher%29):

	h1 ^= a;
	h2 ^= h1; h1 = ROTL(h1, K);
	h1 += h2; h2 *= 9;

The "h2 *= 9" replaces "ROTL(h2, 3)" in Speck, achieves a little more
mixing, is one cycle on most machines, and is probably supported by
more functional units than a general barrel shift.

It's only one more cycle on the critical path than the current
"hash = (hash + a) * 9"... but it's one more register.

> I think the caller should do it, yes. There's a difference between
> "calculate a hash for a string" and "turn that hash into a hashtable
> lookup".

Makes sense, thanks.

Another thing worth doing is having a slightly less thorough folding
function than hash64(x, 32).  (x >> 32) + (uint32)x * CONSTANT does
just as well if you're keeping all 32 bits.

This is hasically the first half of my 32-bit hash_64().  Then if you
want less than 32 bits, you can do a second hash_32() on the result.

> Anyway, I suspect that your mixing function changes should be fine.
> That link_path_walk() is important, but a couple of shifts and xors
> shouldn't kill it.

Actually, that code
1) Uses an extra register for a shift temporary anyway, and
2) Has a 6-cycle critical path, which is pretty borderline.

The ARX code above seems like a more efficient use of two registers.

(It's also conveniently nonlinear, so if we want, feeding a salt into
h2 makes it less utterly trivial to force collisions.)

I'll play with that for a bit.  Thank you for mentioning those critical
functions; I'll check the x86 code generation for them to make sure it
doesn't get worse.

P.S. Here's a way to improve partial_name_hash on x86.
Compare the assembly for

unsigned long
pnh1(unsigned long c, unsigned long prevhash)
{
        return (prevhash + (c << 4) + (c >> 4)) * 11;
}

pnh1:
	movl    %eax, %ecx
        shrl    $4, %eax
        sall    $4, %ecx
        addl    %ecx, %eax
        addl    %eax, %edx
        leal    (%edx,%edx,4), %eax
        leal    (%edx,%eax,2), %eax
        ret

unsigned long
pnh2(unsigned long c, unsigned long prevhash)
{
        prevhash += c <<= 4;
        prevhash += c >> 8;
        return prevhash * 11;
}

pnh2:
        sall    $4, %eax
        addl    %eax, %edx
        shrl    $8, %eax
        addl    %eax, %edx
        leal    (%edx,%edx,4), %eax
        leal    (%edx,%eax,2), %eax
        ret

pnh1 doesn't know that "c" is really only 8 bits and so it doesn't need
to copy it to another register to compute the two shifted forms.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] <linux/hash.h>: Make hash_64(), hash_ptr() return 32 bits
  2016-05-03  1:59                 ` George Spelvin
@ 2016-05-03  3:01                   ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2016-05-03  3:01 UTC (permalink / raw)
  To: George Spelvin
  Cc: Bruce Fields, Eric Dumazet, Jeff Layton,
	Linux Kernel Mailing List, Linux NFS Mailing List, Rik van Riel,
	Thomas Gleixner

On Mon, May 2, 2016 at 6:59 PM, George Spelvin <linux@horizon.com> wrote:
>
> AFAICT, none of the string-hash functions outside of fs/ are
> on particularly hot paths.  The goal is deleting redundant code.

Yes, agreed.

>> We'll never get rid of "hash_name()", it not only has that '/' case,
>> it's also inlined for a reason. You'd copy it without the magic for
>> '/' and turn that into str_hash() for others to use.
>>
>> full_name_hash() can presumably be used pretty much as-is as mem_hash().
>
> That part is obvious. I was just caught between two unpleasant
> possibiites:
>
> - Placing a str_hash() helper function in the middle of fs/namei.c which
>   nothing in fs/namei.c actually calls, or
> - Copying it to some outside file and then having to keep the
>   two in sync.

So I don't think the "keep the two in sync" is necessarily all that problematic.

The word-at-a-time logic _used_ to be very specific to the name_hash()
code, but it got made generic enough over a few iterations that it's
actually a fairly reasonable pattern now.

So the code loop ends up being really not very complex:

        const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;

        hash = a = 0;
        len = -sizeof(unsigned long);
        do {
                hash = (hash + a) * 9;
                len += sizeof(unsigned long);
                a = load_unaligned_zeropad(name+len);
                b = a ^ REPEAT_BYTE('/');
        } while (!(has_zero(a, &adata, &constants) | has_zero(b,
&bdata, &constants)));

and with your suggested "hash_mix()" function (may I suggest we just
make it take both the old hash and the new value as two separate
arguments, and it can choose how to mix them), there remains pretty
much nothing to keep in sync.

Yes, there's the tail part, but that ends up being pretty simple too.

The non-path-component case (so checking only for an ending NUL
character, not the '/') ends up being exactly the same, except all the
'b' logic goes away, so you end up with

        hash = a = 0;
        len = -sizeof(unsigned long);
        do {
                hash = hash_mix(hash, a);
                len += sizeof(unsigned long);
                a = load_unaligned_zeropad(name+len);
        } while (!has_zero(a, &adata, &constants));

which really seems to not be a pain. Very simple, in fact. There's
hardly any code left to keep in sync: any tweaking of the hash would
happen by tweaking the hash_mix()

The tail part would presumably be:

        adata = prep_zero_mask(a, adata, &constants);

        mask = create_zero_mask(adata | bdata);
        return hash_mix(hash, a & zero_bytemask(mask));

and you're done (or we could perhaps decide that the last mix is fine
just doing a single add? We will have mixed up all previous hash
values, so that last "hash_mix()" might just be a simple addition).

Yes, if we want to return some mix of the length and the hash, we'd
have to play those hashlen games, but I suspect the only case that
*really* cares deeply about that is the dentry hash case, and we'd
just keep it separate.

In other words, I think just the addition of your "hash_mix()" helper
is enough to abstract things out enough that there really is nothing
left but tying all those pieces together, and no maintenance burden
from having "two copies" of that tying-togetherness.

>> For path components, the most common lengths are less than a single
>> 8-byte word! That "mixing" function almost doesn't matter, because the
>>  case that matters the most (by far) are strings that fit in one or
>> _maybe_ two words.
>
> I'll remember that next time I look up
> .git/objects/69/466b786e99a0a2d86f0cb99e0f4bb61588d13c
>
> :-)

Yes, they happen, but when people have pathnames like that, your
hashing function generally isn't going to much matter.

Except you absolutely want to avoid 8-bit and 4-bit boundaries when
mixing. The "*9" we have now does that, we had a *11 in an earlier
incarnation (but that was coupled with shifting right too - I think
our old hash remains in the partial_name_hash())

I do agree that it's not a great hash mixing function, but I don't
think it's been particularly problematic either. I did run my whole
filesystem through the hash at some point just to verify, and the
statistics seemed fairly balanced.

But yes, I think your hash_mix() function is likely a noticeable
improvement from a hashing standpoint.  And while it may not be all
that noticeable for path components that are usually pretty short, if
we extend this code to be a generic string hash then the whole "three
characters is the most common component length" argument gores away ;)

> I was playing with the idea of an ARX structure like the "Speck" block
> cipher (https://en.wikipedia.org/wiki/Speck_%28cipher%29):
>
>         h1 ^= a;
>         h2 ^= h1; h1 = ROTL(h1, K);
>         h1 += h2; h2 *= 9;
>
> The "h2 *= 9" replaces "ROTL(h2, 3)" in Speck, achieves a little more
> mixing, is one cycle on most machines, and is probably supported by
> more functional units than a general barrel shift.
>
> It's only one more cycle on the critical path than the current
> "hash = (hash + a) * 9"... but it's one more register.

Try it.

I wouldn't worry too much about 32-bit x86 any more (we want ti to not
suck horribly, but it's not the primary target for anybody who cares
about best performance) any more. But x86-64 code generation is worth
looking at. The register pressure issue is still real, but it's not
quite as bad as the old 32-bit code.

The other main architectures that it would be good to verify are ok
are ARMv8 and powerpc.

> P.S. Here's a way to improve partial_name_hash on x86.
> Compare the assembly for
>
> unsigned long
> pnh1(unsigned long c, unsigned long prevhash)
> {
>         return (prevhash + (c << 4) + (c >> 4)) * 11;
> }
>
> pnh1:
>         movl    %eax, %ecx
>         shrl    $4, %eax
>         sall    $4, %ecx
>         addl    %ecx, %eax
>         addl    %eax, %edx
>         leal    (%edx,%edx,4), %eax
>         leal    (%edx,%eax,2), %eax
>         ret
>
> unsigned long
> pnh2(unsigned long c, unsigned long prevhash)
> {
>         prevhash += c <<= 4;
>         prevhash += c >> 8;
>         return prevhash * 11;
> }
>
> pnh2:
>         sall    $4, %eax
>         addl    %eax, %edx
>         shrl    $8, %eax
>         addl    %eax, %edx
>         leal    (%edx,%edx,4), %eax
>         leal    (%edx,%eax,2), %eax
>         ret
>
> pnh1 doesn't know that "c" is really only 8 bits and so it doesn't need
> to copy it to another register to compute the two shifted forms.

Well, if I cared about the partial_name_hash() (which I don't), I'd
suggest you just convince the compiler to generate

  rolb $4,%al

instead of two shifts, and just be done with it.

You might find that you end up with partial register write stalls,
though, so you might have to add a "movzbq %al,%rax" to get around
those.

In fact, it looks like you can get gcc to generate that code:

    unsigned long pnh3(unsigned char c, unsigned long prevhash)
    {
        c = (c << 4) | (c >> 4);
        return (prevhash + c)*11;
    }

generates

    pnh3:
        rolb $4, %dil
        movzbl %dil, %edi
        addq %rdi, %rsi
        leaq (%rsi,%rsi,4), %rax
        leaq (%rsi,%rax,2), %rax
        ret

which is one instruction less than your pnh2, but should perform even
better because I think "movzbl" ends up being done as a rename and
mask in the microarchitecture - without any actual ALU costs or added
latency.

But I suspect it can be a bit fragile to get gcc to actually generate
that rotate instruction. I could easily see that being inlined, and
than the pattern that turns into a rotate instruction gets perturbed
enough that gcc no longer generates the rot..

                  Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
@ 2016-04-29  2:57 George Spelvin
  2016-04-29  3:16 ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: George Spelvin @ 2016-04-29  2:57 UTC (permalink / raw)
  To: tglx; +Cc: linux, linux-kernel, torvalds

Thomas Gleixner wrote:
> I'm not a hashing wizard and I completely failed to understand why
> hash_long/ptr are so horrible for the various test cases I ran.

It's very simple: the constants chosen are bit-sparse, *particularly*
in the least significant bits, and only 32/64 bits of the product are
kept.  Using the high-word of a double-width multiply is even better,
but some machines (*cough* SPARCv9 *cough*) don't have hardware
support for that.

So what you get is:

  (0x9e370001 * (x << 12)) & 0xffffffff
= (0x9e370001 * x & 0xfffff) << 12
= (0x70001 * x & 0xfffff) << 12

*Now* does it make sense?

64 bits is just as bad...  0x9e37fffffffc0001 becomes
0x7fffffffc0001, which is 2^51 - 2^18 + 1.

The challenge is the !CONFIG_ARCH_HAS_FAST_MULTIPLIER case,
when it has to be done with shifts and adds/subtracts.

Now, what's odd is that it's only relevant for 64-bit platforms, and
currently only x86 and POWER7+ have it.

SPARCv9, MIPS64, ARM64, SH64, PPC64, and IA64 all have it turned off.

Is this a bug that should be fixed?

In fact, do *any* 64-bit platforms need multiply emulation?

How many 32-bit platforms nead a multiplier that's easy for GCC to
evaluate via shifts and adds?

Generlly, by the time you've got a machine grunty enough to
need 64 bits, a multiplier is quite affordable.

Anyway, assuming there exists at least one platform that needs the
shift-and-add sequence, it's quite easy to get a higher hamming weight,
you just have to use a few more registers to save some intermediate
results.

E.g.

	u64 x = val, t = val, u;
	x <<= 2;
	u = x += t;	/* val * 5 */
	x <<= 4;	/* val * 80 */
	x -= u;		/* val * 75 = 0b1001011 */

Shall I try to come up with something?

Footnote: useful web pages on shift-and-add/subtract mutliplciation
http://www.vinc17.org/research/mulbyconst/index.en.html
http://www.spiral.net/hardware/multless.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-29  2:57 [patch 2/7] lib/hashmod: Add modulo based hash mechanism George Spelvin
@ 2016-04-29  3:16 ` Linus Torvalds
  2016-04-29  4:12   ` George Spelvin
  2016-04-29 23:31   ` George Spelvin
  0 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2016-04-29  3:16 UTC (permalink / raw)
  To: George Spelvin; +Cc: Thomas Gleixner, Linux Kernel Mailing List

On Thu, Apr 28, 2016 at 7:57 PM, George Spelvin <linux@horizon.com> wrote:
>
> How many 32-bit platforms nead a multiplier that's easy for GCC to
> evaluate via shifts and adds?
>
> Generlly, by the time you've got a machine grunty enough to
> need 64 bits, a multiplier is quite affordable.

Probably true.

That said, the whole "use a multiply to do bit shifts and adds" may be
a false economy too. It's a good trick, but it does limit the end
result in many ways: you are limited to (a) only left-shifts and (b)
only addition and subtraction.

The "only left-shifts" means that you will always be in the situation
that you'll then need to use the high bits (so you'll always need that
shift down). And being limited to just the adder tends to mean that
it's harder to get a nice spread of bits - you're basically always
going to have that same carry chain.

Having looked around at other hashes, I suspect we should look at the
ones that do five or six shifts, and a mix of add/sub and xor. And
because they shift the bits around more freely you don't have the
final shift (that ends up being dependent on the size of the target
set).

It really would be lovely to hear that we can just replace
hash_int/long() with a better hash. And I wouldn't get too hung up on
the multiplication trick. I suspect it's not worth it.

              Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-29  3:16 ` Linus Torvalds
@ 2016-04-29  4:12   ` George Spelvin
  2016-04-29 23:31   ` George Spelvin
  1 sibling, 0 replies; 43+ messages in thread
From: George Spelvin @ 2016-04-29  4:12 UTC (permalink / raw)
  To: linux, torvalds; +Cc: linux-kernel, tglx

Linus wrote:
> Having looked around at other hashes, I suspect we should look at the
> ones that do five or six shifts, and a mix of add/sub and xor. And
> because they shift the bits around more freely you don't have the
> final shift (that ends up being dependent on the size of the target
> set).

I'm not sure that final shift is a problem.  You need to mask the result
to the desired final size somehow, and a shift is no more cycles than
an AND.

> It really would be lovely to hear that we can just replace
> hash_int/long() with a better hash. And I wouldn't get too hung up on
> the multiplication trick. I suspect it's not worth it.

My main concern is that the scope of the search grows enormously
if we include such things.  I don't want to discourage someone
from looking, but I volunteered to find a better multiplication
constant with an efficient add/subtract chain, not start a thesis
project on more general hash functions.

Two places one could look for ideas, though:
http://www.burtleburtle.net/bob/hash/integer.html
https://gist.github.com/badboy/6267743

Here's Thomas Wang's 64-bit hash, which is reputedly quite
good, in case it helps:

uint64_t hash(uint64_t key)
{
	key  = ~key + (key << 21);	// key = (key << 21) - key - 1;
	key ^= key >> 24;
	key += (key << 3)) + (key << 8); // key *= 265
	key ^= key >> 14;
	key += (key << 2)) + (key << 4); // key *= 21
	key ^= key >> 28;
	key += key << 31;
	return key;
}

And his slightly shorter 64-to-32-bit function:
unsigned hash(uint64_t key)
{
  key  = ~key + (key << 18); // key = (key << 18) - key - 1;
  key ^= key >> 31;
  key *= 21;	 // key += (key << 2)) + (key << 4);
  key ^= key >> 11;
  key += key << 6;
  key ^= key >> 22;
  return (uint32_t)key;
}


Sticking to multiplication, using the heuristics in the
current comments (prime near golden ratio = 9e3779b9 = 2654435769,)
I can come up with this for multiplying by 2654435599 = 0x9e37790f:

// -----------------------------------------------------------------------------
// This code was generated by Spiral Multiplier Block Generator, www.spiral.net
// Copyright (c) 2006, Carnegie Mellon University
// All rights reserved.
// The generated code is distributed under a BSD style license
// (see http://www.opensource.org/licenses/bsd-license.php)
// -----------------------------------------------
// Cost: 6 adds/subtracts 6 shifts 0 negations
// Depth: 5
// Input:
//   int t0
// Outputs:
//   int t1 = 2654435599 * t0
// -----------------------------------------------
t3 = shl(t0, 11);   /* 2048*/
t2 = sub(t3, t0);   /* 2047*/
t5 = shl(t2, 8);   /* 524032*/
t4 = sub(t5, t2);   /* 521985*/
t7 = shl(t0, 25);   /* 33554432*/
t6 = add(t4, t7);   /* 34076417*/
t9 = shl(t0, 9);   /* 512*/
t8 = sub(t9, t0);   /* 511*/
t11 = shl(t6, 4);   /* 545222672*/
t10 = sub(t11, t6);   /* 511146255*/
t12 = shl(t8, 22);   /* 2143289344*/
t1 = add(t10, t12);   /* 2654435599*/

Which translates into C as

uint32_t multiply(uint32_t x)
{
        unsigned y = (x << 11) - x;

        y -= y << 8;
        y -= x << 25;
        x -= x << 9;
        y -= y << 4;
        y -= x << 22;
        return y;
}

Unfortunately, that utility bogs like hell on 64-bit constants.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-29  3:16 ` Linus Torvalds
  2016-04-29  4:12   ` George Spelvin
@ 2016-04-29 23:31   ` George Spelvin
  2016-04-30  0:05     ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: George Spelvin @ 2016-04-29 23:31 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel, linux, tglx

On Fri, Apr 29, 2016 at 9:32 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
wrote:
> For example, that _long_ range of bits set ("7fffffffc" in the middle)
> is effectively just one bit set with a subtraction. And it's *right*
> in that bit area that is supposed to shuffle bits 14-40 to the high bits
> (which is what we actually *use*. So it effectively shuffles none of those
> bits around at all, and if you have a stride of 4096, your'e pretty much
> done for.

Gee, I recall saying something a lot like that.
> 64 bits is just as bad...  0x9e37fffffffc0001 becomes
> 0x7fffffffc0001, which is 2^51 - 2^18 + 1.

After researching it, I think that the "high bits of a multiply" is
in fact a decent way to do such a hash.  Interestingly, for a randomly
chosen odd multiplier A, the high k bits of the w-bit product A*x is a
universal hash function in the cryptographic sense.  See section 2.3 of
http://arxiv.org/abs/1504.06804

One thing I note is that the advice in the comments to choose a prime
number is misquoting Knuth!  Knuth says (vol. 3 section 6.4) the number
should be *relatively* prime to the word size, which for binary computers
simply means odd.

When we have a hardware multiplier, keeping the Hamming weight low is
a waste of time.  When we don't, clever organization can do
better than the very naive addition/subtraction chain in the
current hash_64().

To multiply by the 32-bit constant 1640531527 = 0x61c88647 (which is
the negative of the golden ratio, so has identical distribution
properties) can be done in 6 shifts + adds, with a critical path
length of 7 operations (3 shifts + 4 adds).

#define GOLDEN_RATIO_32 0x61c88647	/* phi^2 = 1-phi */
/* Returns x * GOLDEN_RATIO_32 without a hardware multiplier */
unsigned hash_32(unsigned x)
{
        unsigned y, z;
				/* Path length */
        y = (x << 19) + x;	/* 1 shift + 1 add */
        z = (x << 9) + y;	/* 1 shift + 2 add */
        x = (x << 23) + z;	/* 1 shift + 3 add */
        z = (z << 8) + y;	/* 2 shift + 3 add */
	x = (x << 6) - x;	/* 2 shift + 4 add */
        return (z << 3) + x;	/* 3 shift + 4 add */
}

Finding a similarly efficient chain for the 64-bit golden ratio
0x9E3779B97F4A7C15 = 11400714819323198485
or
0x61C8864680B583EB = 7046029254386353131

is a bit of a challenge, but algorithms are known.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-29 23:31   ` George Spelvin
@ 2016-04-30  0:05     ` Linus Torvalds
  2016-04-30  0:32       ` George Spelvin
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2016-04-30  0:05 UTC (permalink / raw)
  To: George Spelvin; +Cc: Linux Kernel Mailing List, Thomas Gleixner

On Fri, Apr 29, 2016 at 4:31 PM, George Spelvin <linux@horizon.com> wrote:
>
> After researching it, I think that the "high bits of a multiply" is
> in fact a decent way to do such a hash.

Our emails crossed. Yes. My test harness actually likes the
multiplication more than most of the specialized "spread out bits"
versions I've found, but only if the constants are better-chosen than
the ones we have now.

> One thing I note is that the advice in the comments to choose a prime
> number is misquoting Knuth!  Knuth says (vol. 3 section 6.4) the number
> should be *relatively* prime to the word size, which for binary computers
> simply means odd.

At least for my tests, even that seems to actually be a total
non-issue. Yes, odd values *might* be better, but as mentioned in my
crossing email, it doesn't actually seem to matter for any case the
kernel cares about, since we tend to want to hash down to 10-20 bits
of data, so the least significant bit (particularly for the 64-bit
case) just doesn't matter all that much.

For the 32-bit case I suspect it's more noticeable, since we might be
using even half or more of the result.

But as mentioned: that least-order bit seems to be a *lot* less
important than the mix of the bits in the middle. Because even if your
input ends up being all zeroes in the low bits (it that worst-case
"page aligned pointers" case that Thomas had), and the hash multiplies
by an even number, you'll still end up just dropping all those zero
bits anyway.

> When we have a hardware multiplier, keeping the Hamming weight low is
> a waste of time.  When we don't, clever organization can do
> better than the very naive addition/subtraction chain in the
> current hash_64().

Yeah. gcc will actually do the clever stuff for the 32-bit case, afaik.

Nothing I know of does it for the 64-bit case, which is why our
current hand-written one isn't even the smark one.

> To multiply by the 32-bit constant 1640531527 = 0x61c88647 (which is
> the negative of the golden ratio, so has identical distribution
> properties) can be done in 6 shifts + adds, with a critical path
> length of 7 operations (3 shifts + 4 adds).

So the reason we don't do this for the 32-bit case is exactly that gcc
can already do this.

If you can do the same for the 64-bit case, that might be worth it.

                Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-30  0:05     ` Linus Torvalds
@ 2016-04-30  0:32       ` George Spelvin
  2016-04-30  1:12         ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: George Spelvin @ 2016-04-30  0:32 UTC (permalink / raw)
  To: linux, torvalds; +Cc: linux-kernel, tglx

> At least for my tests, even that seems to actually be a total
> non-issue. Yes, odd values *might* be better, but as mentioned in my
> crossing email, it doesn't actually seem to matter for any case the
> kernel cares about, since we tend to want to hash down to 10-20 bits
> of data, so the least significant bit (particularly for the 64-bit
> case) just doesn't matter all that much.

Odd is important.  If the multiplier is even, the msbit of the input
doesn't affect the hash result at all.  x and (x + 0x80000000) hash to
the same value, always.  That just seems like a crappy hash function.

> Yeah. gcc will actually do the clever stuff for the 32-bit case, afaik.

It's not as clever as it could be; it just does the same Booth
recoding thing, a simple series of shifts with add/subtract.

Here's the ARM code that GCC produces (9 instructions, all dependent):

mult1:
	add	r3, r0, r0, lsl #1
	rsb	r3, r0, r3, lsl #5
	add	r3, r3, r3, lsl #4
	rsb	r3, r3, r3, lsl #5
	add	r3, r0, r3, lsl #5
	add	r3, r0, r3, lsl #1
	add	r3, r0, r3, lsl #3
	add	r3, r0, r3, lsl #3
	rsb	r0, r0, r3, lsl #3
	bx	lr

versus the clever code (6 instructions, #4 and #5 could dual-issue):
mult2:
	add	r3, r0, r0, lsl #19
	add	r2, r3, r0, lsl #9
	add	r0, r2, r0, lsl #23
	add	r3, r3, r2, lsl #8
	rsb	r0, r0, r0, lsl #6
	add	r0, r0, r3, lsl #3
	bx	lr

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-30  0:32       ` George Spelvin
@ 2016-04-30  1:12         ` Linus Torvalds
  2016-04-30  3:04           ` George Spelvin
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2016-04-30  1:12 UTC (permalink / raw)
  To: George Spelvin; +Cc: Linux Kernel Mailing List, Thomas Gleixner

On Fri, Apr 29, 2016 at 5:32 PM, George Spelvin <linux@horizon.com> wrote:
>
> Odd is important.  If the multiplier is even, the msbit of the input
> doesn't affect the hash result at all.

Fair enough. My test-set was incomplete.

>> Yeah. gcc will actually do the clever stuff for the 32-bit case, afaik.
>
> It's not as clever as it could be; it just does the same Booth
> recoding thing, a simple series of shifts with add/subtract.

Ahh. I thought gcc did the Bernstein's algorithm thing, which is
exponential in the bit size. That would have explained why it only
does it for 32-bit constants.

Not doing it for 64-bit constants makes no sense if it just uses the
trivial Booth's algorithm version.

So the odd "we don't do it for 64-bit" is apparently just an
oversight, not because gcc does something clever.

Oh well.

             Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-30  1:12         ` Linus Torvalds
@ 2016-04-30  3:04           ` George Spelvin
  0 siblings, 0 replies; 43+ messages in thread
From: George Spelvin @ 2016-04-30  3:04 UTC (permalink / raw)
  To: linux, torvalds; +Cc: linux-kernel, tglx

> Not doing it for 64-bit constants makes no sense if it just uses the
> trivial Booth's algorithm version.

AFAICT, gcc 5 *does* optimize 64-bit multiplies by constants.
Does the belief that it doesn't date back to some really old
version?

There's still a threshold where it just punts to the multiplier.

Some examples, x86-64 (gcc 6.0.1) and aarch64 (gcc 5.3.1).
Note the difference in the multiply-by-12345 routine.

	return x*9;
mul9:
        leaq    (%rdi,%rdi,8), %rax
        ret
mul9:
        add     x0, x0, x0, lsl 3
        ret

	return x*10;
mul10:
        leaq    (%rdi,%rdi,4), %rax
        addq    %rax, %rax
        ret
mul10:
        lsl     x1, x0, 3
        add     x0, x1, x0, lsl 1
        ret

	return x*127;
mul127:
        movq    %rdi, %rax
        salq    $7, %rax
        subq    %rdi, %rax
        ret
mul127:
        lsl     x1, x0, 7
        sub     x0, x1, x0
        ret

	return x*12345;
mul12345:
        imulq   $12345, %rdi, %rax
        ret
mul12345:
        lsl     x1, x0, 3
        sub     x1, x1, x0
        lsl     x1, x1, 1
        sub     x1, x1, x0
        lsl     x1, x1, 3
        sub     x1, x1, x0
        lsl     x1, x1, 3
        sub     x0, x1, x0
        lsl     x1, x0, 4
        sub     x0, x1, x0
        ret

        uint64_t y = (x << 9) - (x << 3) + x;
        return x + (x << 14) - (y << 3);
mul12345_manual:
        movq    %rdi, %rdx
        salq    $14, %rax
        salq    $9, %rdx
        addq    %rdi, %rax
        addq    %rdi, %rdx
        salq    $3, %rdi
        subq    %rdi, %rdx
        salq    $3, %rdx
        subq    %rdx, %rax
        ret
mul12345_manual:
        lsl     x2, x0, 9
        lsl     x1, x0, 14
        add     x2, x2, x0
        add     x1, x1, x0
        sub     x0, x2, x0, lsl 3
        sub     x0, x1, x0, lsl 3
        ret

	return x*2654435769:
mul2654435769:
        movl    $2654435769, %eax
        imulq   %rdi, %rax
        ret
mul2654435769:
        mov     x1, 31161
        movk    x1, 0x9e37, lsl 16
        mul     x0, x0, x1
        ret

The problem with variant code paths like mul12345_manual is that the
infrastructure required to determine which to use is many times larger
than the code itself.  :-(

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [patch 0/7] futex: Add support for process private hashing
@ 2016-04-28 16:42 Thomas Gleixner
  2016-04-28 16:42 ` [patch 2/7] lib/hashmod: Add modulo based hash mechanism Thomas Gleixner
  0 siblings, 1 reply; 43+ messages in thread
From: Thomas Gleixner @ 2016-04-28 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

The standard futex mechanism in the Linux kernel uses a global hash to store
transient state. Collisions on that hash can lead to performance degradation
and on real-time enabled kernels to unbound priority inversions.

This new attempt to solve the issue does not require user space changes and
operates transparently. On the first futex operation of a process the kernel
allocates a hash private to the process. All process private futexes are
hashed in this hash. Process shared futexes still use the global hash.

For RT applications and pathological use cases a new futex op is provided
which allows the application to preallocate and thereby size the process
private hash.

The series comes with a new 'stupid' hash function based on the good old
modulu prime. That function provides way better hash results than
hash_ptr/hash_long() for small hash sizes.

The last two patches add support to the perf futex-hash benchmark so test can
be run on nodes and the preallocation sizing can be tested.

The last patch contains a first update for the futex man page.

Results from our testing in nice colored charts are available here:

perf bench futex-hash run parallel on 4 nodes with global hash and various
sized private hashes and various numbers of futexes per thread

 https://tglx.de/~tglx/f-ops.png

perf bench futex-hash run parallel on 4 nodes with global hash and various
sized private hashes using the new hash_mod() and various numbers of futexes
per thread

 https://tglx.de/~tglx/f-ops.png

perf bench futex-hash run parallel on 4 nodes with global hash and various
sized private hashes using hash_long() and various numbers of futexes per
thread

 https://tglx.de/~tglx/f-ops-hlong.png

perf bench futex-hash run parallel on 2 nodes with global hash and various
sized private hashes and various numbers of futexes per thread

 https://tglx.de/~tglx/f-ops-2.png

perf bench futex-hash run parallel on 4 nodes with global hash and various
sized private hashes using hash_mod(). 1 futex per thread and various thread
numbers.

 https://tglx.de/~tglx/f-ops-mod-t.png

perf bench futex-hash run parallel on 4 nodes with global hash and various
sized private hashes using hash_long(). 1 futex per thread and various thread
numbers.

 https://tglx.de/~tglx/f-ops-hlong-t.png

Thanks,

	tglx

----
 Documentation/sysctl/kernel.txt |   17 +++
 b/include/linux/futex_types.h   |   14 ++
 b/lib/hashmod.c                 |   44 ++++++++
 include/linux/futex.h           |   39 +++++--
 include/linux/hash.h            |   28 +++++
 include/linux/mm_types.h        |    4 
 include/uapi/linux/futex.h      |    1 
 init/Kconfig                    |    5 
 kernel/fork.c                   |    3 
 kernel/futex.c                  |  219 +++++++++++++++++++++++++++++++++++++++-
 kernel/sysctl.c                 |   21 +++
 lib/Kconfig                     |    3 
 lib/Makefile                    |    1 
 tools/perf/bench/Build          |    4 
 tools/perf/bench/futex-hash.c   |  101 ++++++++++++++++--
 tools/perf/bench/futex.h        |    5 
 16 files changed, 486 insertions(+), 23 deletions(-)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-28 16:42 [patch 0/7] futex: Add support for process private hashing Thomas Gleixner
@ 2016-04-28 16:42 ` Thomas Gleixner
  2016-04-28 18:32   ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Thomas Gleixner @ 2016-04-28 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

[-- Attachment #1: lib-hashmod-Add-modulo-based-hash-mechanism.patch --]
[-- Type: text/plain, Size: 3515 bytes --]

hash_long/hash_ptr() provide really bad hashing for small hash sizes and for
cases where the lower 12 bits (Page size aligned) of the hash value are 0.

A simple modulo(prime) based hash function has way better results
across a wide range of input values. The implementation uses invers
multiplication instead of a slow division.

A futex benchmark shows better results up to a factor 10 on small hashes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/hash.h |   28 ++++++++++++++++++++++++++++
 lib/Kconfig          |    3 +++
 lib/Makefile         |    1 +
 lib/hashmod.c        |   44 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 76 insertions(+)
 create mode 100644 lib/hashmod.c

--- a/include/linux/hash.h
+++ b/include/linux/hash.h
@@ -83,4 +83,32 @@ static inline u32 hash32_ptr(const void
 	return (u32)val;
 }
 
+struct hash_modulo {
+	unsigned int	pmul;
+	unsigned int	prime;
+	unsigned int	mask;
+};
+
+#ifdef CONFIG_HASH_MODULO
+
+int hash_modulo_params(unsigned int hash_bits, struct hash_modulo *hm);
+
+/**
+ * hash_mod - FIXME
+ */
+static inline unsigned int hash_mod(unsigned long addr, struct hash_modulo *hm)
+{
+	u32 a, m;
+
+	if (IS_ENABLED(CONFIG_64BIT)) {
+		a =  addr >> 32;
+		a ^= (unsigned int) addr;
+	} else {
+		a = addr;
+	}
+	m = ((u64)a * hm->pmul) >> 32;
+	return (a - m * hm->prime) & hm->mask;
+}
+#endif
+
 #endif /* _LINUX_HASH_H */
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -185,6 +185,9 @@ config CRC8
 	  when they need to do cyclic redundancy check according CRC8
 	  algorithm. Module will be called crc8.
 
+config HASH_MODULO
+       bool
+
 config AUDIT_GENERIC
 	bool
 	depends on AUDIT && !AUDIT_ARCH
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -97,6 +97,7 @@ obj-$(CONFIG_CRC32)	+= crc32.o
 obj-$(CONFIG_CRC7)	+= crc7.o
 obj-$(CONFIG_LIBCRC32C)	+= libcrc32c.o
 obj-$(CONFIG_CRC8)	+= crc8.o
+obj-$(CONFIG_HASH_MODULO) += hashmod.o
 obj-$(CONFIG_GENERIC_ALLOCATOR) += genalloc.o
 
 obj-$(CONFIG_842_COMPRESS) += 842/
--- /dev/null
+++ b/lib/hashmod.c
@@ -0,0 +1,44 @@
+/*
+ * Modulo based hash - Global helper functions
+ *
+ * (C) 2016 Linutronix GmbH, Thomas Gleixner
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public Licence version 2 as published by
+ * the Free Software Foundation;
+ */
+
+#include <linux/hash.h>
+#include <linux/errno,h>
+#include <linux/bug.h>
+#include <linux/kernel.h>
+
+#define hash_pmul(prime)	((unsigned int)((1ULL << 32) / prime))
+
+static const struct hash_modulo hash_modulo[] = {
+	{ .prime =    3, .pmul = hash_pmul(3),    .mask = 0x0003 },
+	{ .prime =    7, .pmul = hash_pmul(7),    .mask = 0x0007 },
+	{ .prime =   13, .pmul = hash_pmul(13),   .mask = 0x000f },
+	{ .prime =   31, .pmul = hash_pmul(31),   .mask = 0x001f },
+	{ .prime =   61, .pmul = hash_pmul(61),   .mask = 0x003f },
+	{ .prime =  127, .pmul = hash_pmul(127),  .mask = 0x007f },
+	{ .prime =  251, .pmul = hash_pmul(251),  .mask = 0x00ff },
+	{ .prime =  509, .pmul = hash_pmul(509),  .mask = 0x01ff },
+	{ .prime = 1021, .pmul = hash_pmul(1021), .mask = 0x03ff },
+	{ .prime = 2039, .pmul = hash_pmul(2039), .mask = 0x07ff },
+	{ .prime = 4093, .pmul = hash_pmul(4093), .mask = 0x0fff },
+};
+
+/**
+ * hash_modulo_params - FIXME
+ */
+int hash_modulo_params(unsigned int hash_bits, struct hash_modulo *hm)
+{
+	hash_bits -= 2;
+
+	if (hash_bits >= ARRAY_SIZE(hash_modulo))
+		return -EINVAL;
+
+	*hm = hash_modulo[hash_bits];
+	return 0;
+}

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-28 16:42 ` [patch 2/7] lib/hashmod: Add modulo based hash mechanism Thomas Gleixner
@ 2016-04-28 18:32   ` Linus Torvalds
  2016-04-28 23:26     ` Thomas Gleixner
  2016-04-29 21:10     ` Linus Torvalds
  0 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2016-04-28 18:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

On Thu, Apr 28, 2016 at 9:42 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> hash_long/hash_ptr() provide really bad hashing for small hash sizes and for
> cases where the lower 12 bits (Page size aligned) of the hash value are 0.

Hmm.

hash_long/ptr really shouldn't care about the low bits being zero at
all, because it should mix in all the bits (using a prime multiplier
and taking the high bits).

That said, numbers rule, so clearly we need to do something. It does
strike me that we would be better off just trying to improve
hash_long().

In particular, there are people and projects that have worked on
nothing but hashing. I'm not sure we should add a new hash algorithm
even if the whole "modulo prime" sounds obviously fine in theory. For
example, your 64-bit code has obvious problems if there are common
patterns in the low and the high 32 bits.. Not a problem for something
like hash_ptr(), but it can certainly be a problem for other cases.

It would be a really good idea to have some real hard numbers on the
hashing in general, but _particularly_ so if/when we start adding new
ones. Have you tested the modulus version with SMhasher, for example?

For example, there's Thomas Wang's hash function which should cascade
all the bits.

I'd really hate to add *another* ad-hoc hash when the previous ad-hoc
hash has been shown to be bad.

               Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-28 18:32   ` Linus Torvalds
@ 2016-04-28 23:26     ` Thomas Gleixner
  2016-04-29  2:25       ` Linus Torvalds
  2016-04-29 21:10     ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: Thomas Gleixner @ 2016-04-28 23:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

On Thu, 28 Apr 2016, Linus Torvalds wrote:
> 
> I'd really hate to add *another* ad-hoc hash when the previous ad-hoc
> hash has been shown to be bad.

I completely agree.

I'm not a hashing wizard and I completely failed to understand why
hash_long/ptr are so horrible for the various test cases I ran.

So my ad hoc test was to use the only hash function I truly understand. It was
state of the art in my university days :) And surprise, surprise it worked
really well.

My main focus was really to solve this futex issue which plages various people
and not to dive into hashing theory for a few weeks.

I'll try to dig up some time to analyze the hash_long failure unless someone
familiar with the problem is beating me to it.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-28 23:26     ` Thomas Gleixner
@ 2016-04-29  2:25       ` Linus Torvalds
  2016-04-30 13:02         ` Thomas Gleixner
  2016-06-12 12:18         ` Sandy Harris
  0 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2016-04-29  2:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

On Thu, Apr 28, 2016 at 4:26 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> I'll try to dig up some time to analyze the hash_long failure unless someone
> familiar with the problem is beating me to it.

I'm not sure you need to spend time analyzing failure: if you get bad
hashing with hash_long(), then we know that is a bad hash without
having to really try to figure out why.

It's the hashes that _look_ like they might be good hashes, but
there's not a lot of analysis behind it, that I would worry about. The
simple prime modulus _should_ be fine, but at the same time I kind of
suspect we can do better. Especially since it has two multiplications.

Looking around, there's

    http://burtleburtle.net/bob/hash/integer.html

and that 32-bit "full avalanche" hash in six shifts looks like it
could be better. You wouldn't want to inline it, but the point of a
full avalanche bit mixing _should_ be that you could avoid the whole
"upper bits" part, and it should work independently of the target set
size.

So if that hash works better, it would be a pretty good replacement
option for hash_int().

There is also

    https://gist.github.com/badboy/6267743

that has a 64 bit to 32 bit hash function that might be useful for
"hash_long()".

Most of the people who worry about hashes tend to have strings to
hash, not just a single word like a pointer, but there's clearly
people around who have tried to search for good hashes that really
spread out the bits.

                Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-29  2:25       ` Linus Torvalds
@ 2016-04-30 13:02         ` Thomas Gleixner
  2016-04-30 16:45           ` Eric Dumazet
  2016-06-12 12:18         ` Sandy Harris
  1 sibling, 1 reply; 43+ messages in thread
From: Thomas Gleixner @ 2016-04-30 13:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

On Thu, 28 Apr 2016, Linus Torvalds wrote:
> It's the hashes that _look_ like they might be good hashes, but
> there's not a lot of analysis behind it, that I would worry about. The
> simple prime modulus _should_ be fine, but at the same time I kind of
> suspect we can do better. Especially since it has two multiplications.
> 
> Looking around, there's
> 
>     http://burtleburtle.net/bob/hash/integer.html
> 
> and that 32-bit "full avalanche" hash in six shifts looks like it
> could be better. You wouldn't want to inline it, but the point of a
> full avalanche bit mixing _should_ be that you could avoid the whole
> "upper bits" part, and it should work independently of the target set
> size.

Yes. So I tested those two:

u32 hash_64(u64 key)
{
       key  = ~key + (key << 18);
       key ^= key >> 31;
       key += (key << 2)) + (key << 4);
       key ^= key >> 11;
       key += key << 6;
       key ^= key >> 22;
       return (u32) key;
}

u32 hash_32(u32 key)
{
       key = (key + 0x7ed55d16) + (key << 12);
       key = (key ^ 0xc761c23c) ^ (key >> 19);
       key = (key + 0x165667b1) + (key <<  5);
       key = (key + 0xd3a2646c) ^ (key <<  9);
       key = (key + 0xfd7046c5) + (key <<  3);
       key = (key ^ 0xb55a4f09) ^ (key >> 16);
       return key;
}

They are really good and the results are similar to the simple modulo prime
hash. hash64 is slightly faster as the modulo prime as it does not have the
multiplication.

I'll send a patch to replace hash_64 and hash_32.

Text size:
	   x86_64	i386	arm
hash_64	   88		148	128
hash_32	   88		84	112

So probably slightly too large to inline.

Thanks,

	tglx
	

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-30 13:02         ` Thomas Gleixner
@ 2016-04-30 16:45           ` Eric Dumazet
  2016-04-30 17:12             ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2016-04-30 16:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

On Sat, 2016-04-30 at 15:02 +0200, Thomas Gleixner wrote:

> Yes. So I tested those two:
> 
> u32 hash_64(u64 key)
> {
>        key  = ~key + (key << 18);
>        key ^= key >> 31;
>        key += (key << 2)) + (key << 4);
>        key ^= key >> 11;
>        key += key << 6;
>        key ^= key >> 22;
>        return (u32) key;
> }
> 
> u32 hash_32(u32 key)
> {
>        key = (key + 0x7ed55d16) + (key << 12);
>        key = (key ^ 0xc761c23c) ^ (key >> 19);
>        key = (key + 0x165667b1) + (key <<  5);
>        key = (key + 0xd3a2646c) ^ (key <<  9);
>        key = (key + 0xfd7046c5) + (key <<  3);
>        key = (key ^ 0xb55a4f09) ^ (key >> 16);
>        return key;
> }
> 
> They are really good and the results are similar to the simple modulo prime
> hash. hash64 is slightly faster as the modulo prime as it does not have the
> multiplication.
> 
> I'll send a patch to replace hash_64 and hash_32.
> 
> Text size:
> 	   x86_64	i386	arm
> hash_64	   88		148	128
> hash_32	   88		84	112
> 
> So probably slightly too large to inline.

I use hash_32() in net/sched/sch_fq.c, for all packets sent by Google
servers. (Note that I did _not_ use hash_ptr())

That's gazillions of packets per second, and the current multiply worked
just fine in term of hash spreading.

Are you really going to use something which looks much slower ?

u32 hash_32(u32 key)
{
        key = (key + 0x7ed55d16) + (key << 12);
        key = (key ^ 0xc761c23c) ^ (key >> 19);
        key = (key + 0x165667b1) + (key <<  5);
        key = (key + 0xd3a2646c) ^ (key <<  9);
        key = (key + 0xfd7046c5) + (key <<  3);
        key = (key ^ 0xb55a4f09) ^ (key >> 16);
        return key;
}

Probably having a simple multiple when ARCH_HAS_FAST_MULTIPLIER is
defined might be good enough, eventually by choosing a better
GOLDEN_RATIO_PRIME_32

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-30 16:45           ` Eric Dumazet
@ 2016-04-30 17:12             ` Linus Torvalds
  2016-04-30 17:37               ` Eric Dumazet
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2016-04-30 17:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Ingo Molnar,
	Andrew Morton, Sebastian Andrzej Siewior, Darren Hart,
	Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Sat, Apr 30, 2016 at 9:45 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> I use hash_32() in net/sched/sch_fq.c, for all packets sent by Google
> servers. (Note that I did _not_ use hash_ptr())
>
> That's gazillions of packets per second, and the current multiply worked
> just fine in term of hash spreading.

So hash_32() really is much better than hash_64(). I think we'll tweak
it a bit, but largely leave it alone.

The 64-bit case needs to be tweaked a _lot_.

For the 32-bit case, I like the one that George Spelvin suggested:

   #define GOLDEN_RATIO_32 0x61c88647      /* phi^2 = 1-phi */

because of his slow multiplier fallback version that we could also use:

  /* Returns x * GOLDEN_RATIO_32 without a hardware multiplier */
  unsigned hash_32(unsigned x)
  {
          unsigned y, z;
                                /* Path length */
          y = (x << 19) + x;      /* 1 shift + 1 add */
          z = (x << 9) + y;       /* 1 shift + 2 add */
          x = (x << 23) + z;      /* 1 shift + 3 add */
          z = (z << 8) + y;       /* 2 shift + 3 add */
          x = (x << 6) - x;       /* 2 shift + 4 add */
          return (z << 3) + x;    /* 3 shift + 4 add */
  }

and I don't think that we really need the several big constants with
the fancy "full cascade" function.

If you have a test-case for that sch_fq.c case, it might be a good
idea to test the above GOLDEN_RATIO_32 value, but quite frankly, I
don't see any way it would be materially different from the one we use
now. It does avoid that long series of zeroes in the low bits, but
that's actually not a huge problem for the 32-bit hash to begin with.
It's not nearly as long a series (or in the wrong bit positions) as
the 64-bit hash multiplier value had.

Also, I suspect that since you hash the kernel "struct sock" pointers,
you actually never get the kinds of really bad patterns that Thomas
had.

But maybe you use hash_32() on a pointer because you noticed that
hash_long() or hash_ptr() (which use hash_64 on 64-bit architectures,
and would have been more natural) gave worse performance?

Maybe you thought that it was the bigger multiply that caused the
performance problems? If you did performance work, I suspect it really
could have been that hash_64() did a bad job for you.

                 Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-30 17:12             ` Linus Torvalds
@ 2016-04-30 17:37               ` Eric Dumazet
  0 siblings, 0 replies; 43+ messages in thread
From: Eric Dumazet @ 2016-04-30 17:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Ingo Molnar,
	Andrew Morton, Sebastian Andrzej Siewior, Darren Hart,
	Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Sat, 2016-04-30 at 10:12 -0700, Linus Torvalds wrote:
> On Sat, Apr 30, 2016 at 9:45 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > I use hash_32() in net/sched/sch_fq.c, for all packets sent by Google
> > servers. (Note that I did _not_ use hash_ptr())
> >
> > That's gazillions of packets per second, and the current multiply worked
> > just fine in term of hash spreading.
> 
> So hash_32() really is much better than hash_64(). I think we'll tweak
> it a bit, but largely leave it alone.
> 
> The 64-bit case needs to be tweaked a _lot_.

Agreed ;)

> 
> For the 32-bit case, I like the one that George Spelvin suggested:
> 
>    #define GOLDEN_RATIO_32 0x61c88647      /* phi^2 = 1-phi */
> 
> because of his slow multiplier fallback version that we could also use:
> 
>   /* Returns x * GOLDEN_RATIO_32 without a hardware multiplier */
>   unsigned hash_32(unsigned x)
>   {
>           unsigned y, z;
>                                 /* Path length */
>           y = (x << 19) + x;      /* 1 shift + 1 add */
>           z = (x << 9) + y;       /* 1 shift + 2 add */
>           x = (x << 23) + z;      /* 1 shift + 3 add */
>           z = (z << 8) + y;       /* 2 shift + 3 add */
>           x = (x << 6) - x;       /* 2 shift + 4 add */
>           return (z << 3) + x;    /* 3 shift + 4 add */
>   }
> 
> and I don't think that we really need the several big constants with
> the fancy "full cascade" function.
> 
> If you have a test-case for that sch_fq.c case, it might be a good
> idea to test the above GOLDEN_RATIO_32 value, but quite frankly, I
> don't see any way it would be materially different from the one we use
> now. It does avoid that long series of zeroes in the low bits, but
> that's actually not a huge problem for the 32-bit hash to begin with.
> It's not nearly as long a series (or in the wrong bit positions) as
> the 64-bit hash multiplier value had.
> 
> Also, I suspect that since you hash the kernel "struct sock" pointers,
> you actually never get the kinds of really bad patterns that Thomas
> had.
> 
> But maybe you use hash_32() on a pointer because you noticed that
> hash_long() or hash_ptr() (which use hash_64 on 64-bit architectures,
> and would have been more natural) gave worse performance?

Not at all.

At the time I did sch_fq (for linux-3.12), hash_64() was not using a
multiply yet, but this long series of shifts/add/sub

I used hash_32() because it was simply faster on my servers.

You added this multiply in linux-3.17, and I did not noticed at that
time.


> 
> Maybe you thought that it was the bigger multiply that caused the
> performance problems? If you did performance work, I suspect it really
> could have been that hash_64() did a bad job for you.

Really not ;)

I could test hash_64(), more entropy can not be bad.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-29  2:25       ` Linus Torvalds
  2016-04-30 13:02         ` Thomas Gleixner
@ 2016-06-12 12:18         ` Sandy Harris
  1 sibling, 0 replies; 43+ messages in thread
From: Sandy Harris @ 2016-06-12 12:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Ingo Molnar,
	Andrew Morton, Sebastian Andrzej Siewior, Darren Hart,
	Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Torvald Riegel, Eric Dumazet

On Thu, Apr 28, 2016 at 10:25 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:

> It's the hashes that _look_ like they might be good hashes, but
> there's not a lot of analysis behind it, that I would worry about. The
> simple prime modulus _should_ be fine, but at the same time I kind of
> suspect we can do better. Especially since it has two multiplications.
>
> Looking around, there's
>
>     http://burtleburtle.net/bob/hash/integer.html
>
> and that 32-bit "full avalanche" hash in six shifts looks like it
> could be better. You wouldn't want to inline it, but the point of a
> full avalanche bit mixing _should_ be that you could avoid the whole
> "upper bits" part, and it should work independently of the target set
> size.
>
> So if that hash works better, it would be a pretty good replacement
> option for hash_int().
>
> There is also
>
>     https://gist.github.com/badboy/6267743
>
> that has a 64 bit to 32 bit hash function that might be useful for
> "hash_long()".
>
> Most of the people who worry about hashes tend to have strings to
> hash, not just a single word like a pointer, but there's clearly
> people around who have tried to search for good hashes that really
> spread out the bits.
>
>                 Linus

Here's another possibility, from my GPL code at:
https://github.com/sandy-harris/maxwell

Not very efficient -- two each of 32-bit multiply & modulo
in most cases -- but provably good mixing.

/*
    Quasi-Hadamard transform
    My own invention

    Goal is to mix a 32-bit object
    so that each output bit depends
    on every input bit

    Underlying primitive is IDEA
    multiplication which mixes
    a pair of 16-bit objects

    This is analogous to the
    pseudo-Hadamard transform
    (PHT) originally from the
    SAFER cipher, later in
    Twofish and others

    Conceptually, a two-way PHT
    on a,b is:

    x = a + b
    y = a + 2b
    a = x
    b = y

    This is reversible; it loses
    no information. Each output
    word depends on both inputs.

    A PHT can be implemented as

    a += b
    b += a

    which is faster and avoids
    using intermediate variables

    QHT is the same thing using
    IDEA multiplication instead
    of addition, calculating a*b
    and a*b^2 instead of a+b and
    a+2b

    IDEA multiplication operates
    on 16-bit chunks and makes
    every output bit depend on
    all input bits. Therefore
    QHT is close to an ideal
    mixer for 32-bit words.
*/

u32 qht(u32 x)
{
    u32 a, b ;
    a = x >> 16 ;        // high 16 bits
    b = x & 0xffff ;    // low 16
    a = idea(a,b) ;        // a *= b
    b = idea(a,b) ;        // b *= a
    return( (a<<16) | b) ;
}

/*
    IDEA multiplication
    borrowed from the IDEA cipher
*/
#define    MAX (1<<16)
#define MOD (MAX+1)

u32 idea(u32 a, u32 b)
{
    u32 x ;
    // make sure they are in range
    a %= MOD ;
    b %= MOD ;
    // special cases
    if( (a == 0) && (b == 0))
        return(1) ;
    else if( a == 0)
        return(MOD - b) ;
    else if( b == 0)
        return(MOD - a) ;
    // non-special
    else    {
        x = (a*b) % MOD ;
        if(x == MAX)
            return(0) ;
        else    return(x) ;
    }
}

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-28 18:32   ` Linus Torvalds
  2016-04-28 23:26     ` Thomas Gleixner
@ 2016-04-29 21:10     ` Linus Torvalds
  2016-04-29 23:51       ` Linus Torvalds
  2016-04-30 15:22       ` Thomas Gleixner
  1 sibling, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2016-04-29 21:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

On Thu, Apr 28, 2016 at 11:32 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> hash_long/ptr really shouldn't care about the low bits being zero at
> all, because it should mix in all the bits (using a prime multiplier
> and taking the high bits).

Looking at this assertion, it doesn't actually pan out very well at all.

The 32-bit hash looks to be ok. Certainly not perfect, but not horrible.

The 64-bit hash seems to be quite horribly bad with lots of values. I
wrote a test-harness to check it out (some simple values just spread
out at a fixed stride), and the end results are *so* bad that I'm kind
of worried that I screwed up the test harness. But it gives quite
reasonable values for hash_32() and for the plain modulo case.

Now, the way my test-harness works (literally testing a lot of
equal-stride cases), the "modulo prime number" approach will
automatically look perfect. So my test-harness is pretty unfair in
that respect.

But the hash_32() function looks good when hashing into 16 bits, for
example. In that case it does a very good job of spreading things out.
When hashing into 17 bits, hash_32 still looks good, except it does
very badly for strides of 32. It starts doing worse for bigger hash
buckets and bigger strides.

But out hash_64() seems to do very badly on pretty much *any* pattern.
To the point where I started to doubt my test-program. But it really
looks like that multiplication constant is almost pessimally chosen.

For example, that _long_ range of bits set ("7fffffffc" in the middle)
is effectively just one bit set with a subtraction. And it's *right*
in that bit area that is supposed to shuffle bits 14-40 to the high
bits (which is what we actually *use*. So it effectively shuffles none
of those bits around at all, and if you have a stride of 4096, your'e
pretty much done for.

The 32-bit value is better in this regard, largely thanks to having
that low bit set. Thanks to that, the information in bits around 12-18
will stay in bits 12-18, and because we then only have 32 bits, if the
hash table is large enough, they will still be part of the bits when
we take the high bits. For the 64-bit case, bits 12-18 will never even
be relevant.

So I think that what happens here is that hash_32() is actually
somewhat reasonable. But hash_64() sucks donkey balls when there's a
lot of information in around bits 10-20 (which is exactly where a lot
of pointer bits have the *most* information thanks to alignment
issues.

Picking a new value almost at random (I say "almost", because I just
started with that 32-bit multiplicand value that mostly works and
shifted it up by 32 bits and then randomly added a few more bits to
avoid long ranges of ones and zeroes), I picked

  #define GOLDEN_RATIO_PRIME_64 0x9e3700310c100d01UL

and it is *much* better in my test harness.

Of course, things like that depend on what patterns you test, But I
did have a "range of strides and hash sizes" I tried. So just for fun:
try changing GOLDEN_RATIO_PRIME_64 to that value, and see if the
absolutely _horrid_ page-aligned case goes away for you?

It really looks like those multiplication numbers were very very badly picked.

Still, that number doesn't do very well if the hash is small (say, 8
bits). But for slightly larger hash tables it seems to be doing much
better.

                  Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-29 21:10     ` Linus Torvalds
@ 2016-04-29 23:51       ` Linus Torvalds
  2016-04-30  1:34         ` Rik van Riel
  2016-05-02  9:39         ` Torvald Riegel
  2016-04-30 15:22       ` Thomas Gleixner
  1 sibling, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2016-04-29 23:51 UTC (permalink / raw)
  To: Thomas Gleixner, Rik van Riel
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

On Fri, Apr 29, 2016 at 2:10 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Apr 28, 2016 at 11:32 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> hash_long/ptr really shouldn't care about the low bits being zero at
>> all, because it should mix in all the bits (using a prime multiplier
>> and taking the high bits).
>
> Looking at this assertion, it doesn't actually pan out very well at all.
>
> The 32-bit hash looks to be ok. Certainly not perfect, but not horrible.
>
> The 64-bit hash seems to be quite horribly bad with lots of values.

Ok, I have tried to research this a bit more. There's a lot of
confusion here that causes the fact that hash_64() sucks donkey balls.

The basic method for the hashing method by multiplication is fairly
sane. It's well-documented, and attributed to Knuth as the comment
above it says.

However, there's two problems in there that degrade the hash, and
particularly so for the 64-bit case.

The first is the confusion about multiplying with a prime number..
That actually makes no sense at all, and is in fact entirely wrong.
There's no reason to try to pick a prime number for the
multiplication, and I'm not seeing Knuth having ever suggested that.

Knuth's suggestion is to do the multiplication with a floating point
value A somewhere in between 0 and 1, and multiplying the integer with
it, and then just taking the fractional part and multiply it up by 'm'
and do the floor of that to get a number in the range 0..m

At no point are primes involved.

And our multiplication really does approximate that - except it's
being done in fixed-point arithmetic (so the thing you multiply with
is basically n./2**64, and the overflow is what gets rid of the
fractional part - then getting the "high bits" of the result is really
just multiplying by a power of two and taking the floor of the
result).

So the first mistake is thinking that the thing you should multiply
with should be prime. The primality matters for when you use a
division to get a modulus, which is presumably where the confusion
came from.

Now, what value 'A' you use doesn't seem to really matter much. Knuth
suggested the fractional part of the golden ratio, but I suspect that
is purely because it's an irrational number that is not near to 0 or
1. You could use anything, although since "random bit pattern" is part
of the issue, irrational numbers are a good starting point. I suspect
that with our patterns, there's actually seldom a good reason to do
lots of high-order bits, so you might as well pick something closer to
0, but whatever - it's only going to matter for the overflow part that
gets thrown away anyway.

The second mistake - and the one that actually causes the real problem
- is to believe that the "bit sparseness" is a good thing. It's not.
It's _very_ much not. If you don't mix the bits well in the
multiplication, you get exactly the problem we hit: certain bit
patternsjust  will not mix up into the higher order bits.

So if you look at what the actual golden ratio representation *should* have bee:

  #define GOLDEN_RATIO_32 0x9e3779b9
  #define GOLDEN_RATIO_64 0x9e3779b97f4a7c16

and then compare it to the values we actually _use_ (bit-sparse primes
closeish to those values):

  #define GOLDEN_RATIO_PRIME_32 0x9e370001UL
  #define GOLDEN_RATIO_PRIME_64 0x9e37fffffffc0001UL

you start to see the problem. The right set of values have roughly 50%
of the bits set in a random pattern. The wrong set of values do not.

But as far as I an tell, you might as well use the fractional part of
'e' or 'pi' or just make random shit up that has a reasonable bit
distribution.

So we could use the fractional part of the golden ratio (phi):
  0x9e3779b9
  0x9e3779b97f4a7c16

or pi:
  0x243f6a88
  0x243f6a8885a308d3

or e:
  0xb7e15162
  0xb7e151628aed2a6b

or whatever. The constants don't have to be prime. They don't even
have to be odd, because the low bits will always be masked off anyway.
The whole "hash one integer value down to X bits" is very different in
this respect to things like string hashes, where you really do tend to
want primes (because you keep all bits).

But none of those are sparse. I think *some* amount of sparseness
might be ok if it allows people with bad CPU's to do it using
shift-and-adds, it just must not be as sparse as the current number,
the 64-bit one on particular.

There's presumably a few optimal values from a "spread bits out
evenly" standpoint, and they won't have anything to do with random
irrational constants, and will have everything to do with having nice
bitpatterns.

I'm adding Rik to the cc, because the original broken constants came
from him long long ago (they go back to 2002, originally only used for
the waitqueue hashing. Maybe he had some other input that caused him
to believe that the primeness actually mattered.

                 Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-29 23:51       ` Linus Torvalds
@ 2016-04-30  1:34         ` Rik van Riel
  2016-05-02  9:39         ` Torvald Riegel
  1 sibling, 0 replies; 43+ messages in thread
From: Rik van Riel @ 2016-04-30  1:34 UTC (permalink / raw)
  To: Linus Torvalds, Thomas Gleixner
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

[-- Attachment #1: Type: text/plain, Size: 1130 bytes --]

On Fri, 2016-04-29 at 16:51 -0700, Linus Torvalds wrote:

> There's presumably a few optimal values from a "spread bits out
> evenly" standpoint, and they won't have anything to do with random
> irrational constants, and will have everything to do with having nice
> bitpatterns.
> 
> I'm adding Rik to the cc, because the original broken constants came
> from him long long ago (they go back to 2002, originally only used
> for
> the waitqueue hashing. Maybe he had some other input that caused him
> to believe that the primeness actually mattered.

I do not remember where I got that hashing algorithm and
magic constant from 14 years ago, but the changelog suggests
I got it from Chuck Lever's paper.

Chuck Lever's paper does mention that primeness "adds
certain desirable qualities", and I may have read too
much into that.

I really do not remember the "bit sparse" thing at all,
and have no idea
where that came from. Googling old email
threads about the code mostly
makes me wonder "hey, where
did that person go?"

I am all for magic numbers that work better.

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-29 23:51       ` Linus Torvalds
  2016-04-30  1:34         ` Rik van Riel
@ 2016-05-02  9:39         ` Torvald Riegel
  1 sibling, 0 replies; 43+ messages in thread
From: Torvald Riegel @ 2016-05-02  9:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Rik van Riel, LKML, Peter Zijlstra, Ingo Molnar,
	Andrew Morton, Sebastian Andrzej Siewior, Darren Hart,
	Michael Kerrisk, Davidlohr Bueso, Chris Mason,
	Carlos O'Donell, Eric Dumazet

On Fri, 2016-04-29 at 16:51 -0700, Linus Torvalds wrote:
> On Fri, Apr 29, 2016 at 2:10 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Thu, Apr 28, 2016 at 11:32 AM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> >>
> >> hash_long/ptr really shouldn't care about the low bits being zero at
> >> all, because it should mix in all the bits (using a prime multiplier
> >> and taking the high bits).
> >
> > Looking at this assertion, it doesn't actually pan out very well at all.
> >
> > The 32-bit hash looks to be ok. Certainly not perfect, but not horrible.
> >
> > The 64-bit hash seems to be quite horribly bad with lots of values.
> 
> Ok, I have tried to research this a bit more. There's a lot of
> confusion here that causes the fact that hash_64() sucks donkey balls.
> 
> The basic method for the hashing method by multiplication is fairly
> sane. It's well-documented, and attributed to Knuth as the comment
> above it says.

Section 2.2 of this article might also be of interest:

M. Dietzfelbinger, T. Hagerup, J. Katajainen, and M. Penttonen. A Re-
liable Randomized Algorithm for the Closest-Pair Problem. Journal of
Algorithms, 25(1):19 – 51, 1997.

I don't know much about hash functions, but my understanding of this is
that one can do good hashing of words by multiplying with a random
number and using the most-significant bits of the lower-significant word
of the result.  Different random numbers may work better than others for
the input data (and some might be really awful), but the paper seems to
claim that one *can* find a good random number for all input data.  In
practice, this might mean that re-hashing with a different random number
after seeing too many conflicts in a hash table should eventually lead
to a good hashing (ie, because the *class* of such hash functions is
universal).

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
  2016-04-29 21:10     ` Linus Torvalds
  2016-04-29 23:51       ` Linus Torvalds
@ 2016-04-30 15:22       ` Thomas Gleixner
  1 sibling, 0 replies; 43+ messages in thread
From: Thomas Gleixner @ 2016-04-30 15:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk,
	Davidlohr Bueso, Chris Mason, Carlos O'Donell,
	Torvald Riegel, Eric Dumazet

On Fri, 29 Apr 2016, Linus Torvalds wrote:
> Picking a new value almost at random (I say "almost", because I just
> started with that 32-bit multiplicand value that mostly works and
> shifted it up by 32 bits and then randomly added a few more bits to
> avoid long ranges of ones and zeroes), I picked
> 
>   #define GOLDEN_RATIO_PRIME_64 0x9e3700310c100d01UL
> 
> and it is *much* better in my test harness.
> 
> Of course, things like that depend on what patterns you test, But I
> did have a "range of strides and hash sizes" I tried. So just for fun:
> try changing GOLDEN_RATIO_PRIME_64 to that value, and see if the
> absolutely _horrid_ page-aligned case goes away for you?

It solves that horrid case:

   https://tglx.de/~tglx/f-ops-h64-t.png

It's faster than the shifts based version but the degradation with
hyperthreading is slightly worse.

Here for comparison the 64bit -> 32 shift version

  https://tglx.de/~tglx/f-ops-wang32-t.png

  FYI, that works way better than the existing shift machinery in hash_64

and the modulo prime one:

  https://tglx.de/~tglx/f-ops-mod-t.png

> It really looks like those multiplication numbers were very very badly picked.

Indeed.
 
> Still, that number doesn't do very well if the hash is small (say, 8
> bits).

I'm still waiting for the other test to complete. Will send numbers later
today.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2016-06-12 12:18 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CA+55aFxBWfAHQNAdBbdVr+z8ror4GVteyce3D3=vwDWxhu5KqQ@mail.gmail.com>
2016-04-30 20:52 ` [patch 2/7] lib/hashmod: Add modulo based hash mechanism George Spelvin
2016-05-01  8:35   ` Thomas Gleixner
2016-05-01  9:43     ` George Spelvin
2016-05-01 16:51       ` Linus Torvalds
2016-05-14  3:54         ` George Spelvin
2016-05-14 18:35           ` Linus Torvalds
2016-05-02  7:11       ` Thomas Gleixner
2016-05-02 10:20         ` [PATCH 1/2] <linux/hash.h>: Make hash_64(), hash_ptr() return 32 bits George Spelvin
2016-05-02 10:22           ` [PATCH 2/2] <linux/hash.h>: Fix hash_64()'s horrible collision problem George Spelvin
2016-05-02 20:08             ` Linus Torvalds
2016-05-02 10:27           ` [RFC PATCH 3/2] (Rant) Fix various hash abuses George Spelvin
2016-05-02 10:31           ` [RFC PATCH 4/2] namei: Improve hash mixing if CONFIG_DCACHE_WORD_ACCESS George Spelvin
2016-05-16 18:51             ` Linus Torvalds
2016-05-02 13:28           ` [PATCH 1/2] <linux/hash.h>: Make hash_64(), hash_ptr() return 32 bits Peter Zijlstra
2016-05-02 19:08             ` George Spelvin
2016-05-02 16:24           ` Linus Torvalds
2016-05-02 20:26             ` George Spelvin
2016-05-02 21:19               ` Linus Torvalds
2016-05-02 21:41                 ` Linus Torvalds
2016-05-03  1:59                 ` George Spelvin
2016-05-03  3:01                   ` Linus Torvalds
2016-04-29  2:57 [patch 2/7] lib/hashmod: Add modulo based hash mechanism George Spelvin
2016-04-29  3:16 ` Linus Torvalds
2016-04-29  4:12   ` George Spelvin
2016-04-29 23:31   ` George Spelvin
2016-04-30  0:05     ` Linus Torvalds
2016-04-30  0:32       ` George Spelvin
2016-04-30  1:12         ` Linus Torvalds
2016-04-30  3:04           ` George Spelvin
  -- strict thread matches above, loose matches on Subject: below --
2016-04-28 16:42 [patch 0/7] futex: Add support for process private hashing Thomas Gleixner
2016-04-28 16:42 ` [patch 2/7] lib/hashmod: Add modulo based hash mechanism Thomas Gleixner
2016-04-28 18:32   ` Linus Torvalds
2016-04-28 23:26     ` Thomas Gleixner
2016-04-29  2:25       ` Linus Torvalds
2016-04-30 13:02         ` Thomas Gleixner
2016-04-30 16:45           ` Eric Dumazet
2016-04-30 17:12             ` Linus Torvalds
2016-04-30 17:37               ` Eric Dumazet
2016-06-12 12:18         ` Sandy Harris
2016-04-29 21:10     ` Linus Torvalds
2016-04-29 23:51       ` Linus Torvalds
2016-04-30  1:34         ` Rik van Riel
2016-05-02  9:39         ` Torvald Riegel
2016-04-30 15:22       ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).