* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
@ 2016-04-29 2:57 George Spelvin
2016-04-29 3:16 ` Linus Torvalds
0 siblings, 1 reply; 29+ messages in thread
From: George Spelvin @ 2016-04-29 2:57 UTC (permalink / raw)
To: tglx; +Cc: linux, linux-kernel, torvalds
Thomas Gleixner wrote:
> I'm not a hashing wizard and I completely failed to understand why
> hash_long/ptr are so horrible for the various test cases I ran.
It's very simple: the constants chosen are bit-sparse, *particularly*
in the least significant bits, and only 32/64 bits of the product are
kept. Using the high-word of a double-width multiply is even better,
but some machines (*cough* SPARCv9 *cough*) don't have hardware
support for that.
So what you get is:
(0x9e370001 * (x << 12)) & 0xffffffff
= (0x9e370001 * x & 0xfffff) << 12
= (0x70001 * x & 0xfffff) << 12
*Now* does it make sense?
64 bits is just as bad... 0x9e37fffffffc0001 becomes
0x7fffffffc0001, which is 2^51 - 2^18 + 1.
The challenge is the !CONFIG_ARCH_HAS_FAST_MULTIPLIER case,
when it has to be done with shifts and adds/subtracts.
Now, what's odd is that it's only relevant for 64-bit platforms, and
currently only x86 and POWER7+ have it.
SPARCv9, MIPS64, ARM64, SH64, PPC64, and IA64 all have it turned off.
Is this a bug that should be fixed?
In fact, do *any* 64-bit platforms need multiply emulation?
How many 32-bit platforms nead a multiplier that's easy for GCC to
evaluate via shifts and adds?
Generlly, by the time you've got a machine grunty enough to
need 64 bits, a multiplier is quite affordable.
Anyway, assuming there exists at least one platform that needs the
shift-and-add sequence, it's quite easy to get a higher hamming weight,
you just have to use a few more registers to save some intermediate
results.
E.g.
u64 x = val, t = val, u;
x <<= 2;
u = x += t; /* val * 5 */
x <<= 4; /* val * 80 */
x -= u; /* val * 75 = 0b1001011 */
Shall I try to come up with something?
Footnote: useful web pages on shift-and-add/subtract mutliplciation
http://www.vinc17.org/research/mulbyconst/index.en.html
http://www.spiral.net/hardware/multless.html
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-04-29 2:57 [patch 2/7] lib/hashmod: Add modulo based hash mechanism George Spelvin @ 2016-04-29 3:16 ` Linus Torvalds 2016-04-29 4:12 ` George Spelvin 2016-04-29 23:31 ` George Spelvin 0 siblings, 2 replies; 29+ messages in thread From: Linus Torvalds @ 2016-04-29 3:16 UTC (permalink / raw) To: George Spelvin; +Cc: Thomas Gleixner, Linux Kernel Mailing List On Thu, Apr 28, 2016 at 7:57 PM, George Spelvin <linux@horizon.com> wrote: > > How many 32-bit platforms nead a multiplier that's easy for GCC to > evaluate via shifts and adds? > > Generlly, by the time you've got a machine grunty enough to > need 64 bits, a multiplier is quite affordable. Probably true. That said, the whole "use a multiply to do bit shifts and adds" may be a false economy too. It's a good trick, but it does limit the end result in many ways: you are limited to (a) only left-shifts and (b) only addition and subtraction. The "only left-shifts" means that you will always be in the situation that you'll then need to use the high bits (so you'll always need that shift down). And being limited to just the adder tends to mean that it's harder to get a nice spread of bits - you're basically always going to have that same carry chain. Having looked around at other hashes, I suspect we should look at the ones that do five or six shifts, and a mix of add/sub and xor. And because they shift the bits around more freely you don't have the final shift (that ends up being dependent on the size of the target set). It really would be lovely to hear that we can just replace hash_int/long() with a better hash. And I wouldn't get too hung up on the multiplication trick. I suspect it's not worth it. Linus ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-04-29 3:16 ` Linus Torvalds @ 2016-04-29 4:12 ` George Spelvin 2016-04-29 23:31 ` George Spelvin 1 sibling, 0 replies; 29+ messages in thread From: George Spelvin @ 2016-04-29 4:12 UTC (permalink / raw) To: linux, torvalds; +Cc: linux-kernel, tglx Linus wrote: > Having looked around at other hashes, I suspect we should look at the > ones that do five or six shifts, and a mix of add/sub and xor. And > because they shift the bits around more freely you don't have the > final shift (that ends up being dependent on the size of the target > set). I'm not sure that final shift is a problem. You need to mask the result to the desired final size somehow, and a shift is no more cycles than an AND. > It really would be lovely to hear that we can just replace > hash_int/long() with a better hash. And I wouldn't get too hung up on > the multiplication trick. I suspect it's not worth it. My main concern is that the scope of the search grows enormously if we include such things. I don't want to discourage someone from looking, but I volunteered to find a better multiplication constant with an efficient add/subtract chain, not start a thesis project on more general hash functions. Two places one could look for ideas, though: http://www.burtleburtle.net/bob/hash/integer.html https://gist.github.com/badboy/6267743 Here's Thomas Wang's 64-bit hash, which is reputedly quite good, in case it helps: uint64_t hash(uint64_t key) { key = ~key + (key << 21); // key = (key << 21) - key - 1; key ^= key >> 24; key += (key << 3)) + (key << 8); // key *= 265 key ^= key >> 14; key += (key << 2)) + (key << 4); // key *= 21 key ^= key >> 28; key += key << 31; return key; } And his slightly shorter 64-to-32-bit function: unsigned hash(uint64_t key) { key = ~key + (key << 18); // key = (key << 18) - key - 1; key ^= key >> 31; key *= 21; // key += (key << 2)) + (key << 4); key ^= key >> 11; key += key << 6; key ^= key >> 22; return (uint32_t)key; } Sticking to multiplication, using the heuristics in the current comments (prime near golden ratio = 9e3779b9 = 2654435769,) I can come up with this for multiplying by 2654435599 = 0x9e37790f: // ----------------------------------------------------------------------------- // This code was generated by Spiral Multiplier Block Generator, www.spiral.net // Copyright (c) 2006, Carnegie Mellon University // All rights reserved. // The generated code is distributed under a BSD style license // (see http://www.opensource.org/licenses/bsd-license.php) // ----------------------------------------------- // Cost: 6 adds/subtracts 6 shifts 0 negations // Depth: 5 // Input: // int t0 // Outputs: // int t1 = 2654435599 * t0 // ----------------------------------------------- t3 = shl(t0, 11); /* 2048*/ t2 = sub(t3, t0); /* 2047*/ t5 = shl(t2, 8); /* 524032*/ t4 = sub(t5, t2); /* 521985*/ t7 = shl(t0, 25); /* 33554432*/ t6 = add(t4, t7); /* 34076417*/ t9 = shl(t0, 9); /* 512*/ t8 = sub(t9, t0); /* 511*/ t11 = shl(t6, 4); /* 545222672*/ t10 = sub(t11, t6); /* 511146255*/ t12 = shl(t8, 22); /* 2143289344*/ t1 = add(t10, t12); /* 2654435599*/ Which translates into C as uint32_t multiply(uint32_t x) { unsigned y = (x << 11) - x; y -= y << 8; y -= x << 25; x -= x << 9; y -= y << 4; y -= x << 22; return y; } Unfortunately, that utility bogs like hell on 64-bit constants. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-04-29 3:16 ` Linus Torvalds 2016-04-29 4:12 ` George Spelvin @ 2016-04-29 23:31 ` George Spelvin 2016-04-30 0:05 ` Linus Torvalds 1 sibling, 1 reply; 29+ messages in thread From: George Spelvin @ 2016-04-29 23:31 UTC (permalink / raw) To: torvalds; +Cc: linux-kernel, linux, tglx On Fri, Apr 29, 2016 at 9:32 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: wrote: > For example, that _long_ range of bits set ("7fffffffc" in the middle) > is effectively just one bit set with a subtraction. And it's *right* > in that bit area that is supposed to shuffle bits 14-40 to the high bits > (which is what we actually *use*. So it effectively shuffles none of those > bits around at all, and if you have a stride of 4096, your'e pretty much > done for. Gee, I recall saying something a lot like that. > 64 bits is just as bad... 0x9e37fffffffc0001 becomes > 0x7fffffffc0001, which is 2^51 - 2^18 + 1. After researching it, I think that the "high bits of a multiply" is in fact a decent way to do such a hash. Interestingly, for a randomly chosen odd multiplier A, the high k bits of the w-bit product A*x is a universal hash function in the cryptographic sense. See section 2.3 of http://arxiv.org/abs/1504.06804 One thing I note is that the advice in the comments to choose a prime number is misquoting Knuth! Knuth says (vol. 3 section 6.4) the number should be *relatively* prime to the word size, which for binary computers simply means odd. When we have a hardware multiplier, keeping the Hamming weight low is a waste of time. When we don't, clever organization can do better than the very naive addition/subtraction chain in the current hash_64(). To multiply by the 32-bit constant 1640531527 = 0x61c88647 (which is the negative of the golden ratio, so has identical distribution properties) can be done in 6 shifts + adds, with a critical path length of 7 operations (3 shifts + 4 adds). #define GOLDEN_RATIO_32 0x61c88647 /* phi^2 = 1-phi */ /* Returns x * GOLDEN_RATIO_32 without a hardware multiplier */ unsigned hash_32(unsigned x) { unsigned y, z; /* Path length */ y = (x << 19) + x; /* 1 shift + 1 add */ z = (x << 9) + y; /* 1 shift + 2 add */ x = (x << 23) + z; /* 1 shift + 3 add */ z = (z << 8) + y; /* 2 shift + 3 add */ x = (x << 6) - x; /* 2 shift + 4 add */ return (z << 3) + x; /* 3 shift + 4 add */ } Finding a similarly efficient chain for the 64-bit golden ratio 0x9E3779B97F4A7C15 = 11400714819323198485 or 0x61C8864680B583EB = 7046029254386353131 is a bit of a challenge, but algorithms are known. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-04-29 23:31 ` George Spelvin @ 2016-04-30 0:05 ` Linus Torvalds 2016-04-30 0:32 ` George Spelvin 0 siblings, 1 reply; 29+ messages in thread From: Linus Torvalds @ 2016-04-30 0:05 UTC (permalink / raw) To: George Spelvin; +Cc: Linux Kernel Mailing List, Thomas Gleixner On Fri, Apr 29, 2016 at 4:31 PM, George Spelvin <linux@horizon.com> wrote: > > After researching it, I think that the "high bits of a multiply" is > in fact a decent way to do such a hash. Our emails crossed. Yes. My test harness actually likes the multiplication more than most of the specialized "spread out bits" versions I've found, but only if the constants are better-chosen than the ones we have now. > One thing I note is that the advice in the comments to choose a prime > number is misquoting Knuth! Knuth says (vol. 3 section 6.4) the number > should be *relatively* prime to the word size, which for binary computers > simply means odd. At least for my tests, even that seems to actually be a total non-issue. Yes, odd values *might* be better, but as mentioned in my crossing email, it doesn't actually seem to matter for any case the kernel cares about, since we tend to want to hash down to 10-20 bits of data, so the least significant bit (particularly for the 64-bit case) just doesn't matter all that much. For the 32-bit case I suspect it's more noticeable, since we might be using even half or more of the result. But as mentioned: that least-order bit seems to be a *lot* less important than the mix of the bits in the middle. Because even if your input ends up being all zeroes in the low bits (it that worst-case "page aligned pointers" case that Thomas had), and the hash multiplies by an even number, you'll still end up just dropping all those zero bits anyway. > When we have a hardware multiplier, keeping the Hamming weight low is > a waste of time. When we don't, clever organization can do > better than the very naive addition/subtraction chain in the > current hash_64(). Yeah. gcc will actually do the clever stuff for the 32-bit case, afaik. Nothing I know of does it for the 64-bit case, which is why our current hand-written one isn't even the smark one. > To multiply by the 32-bit constant 1640531527 = 0x61c88647 (which is > the negative of the golden ratio, so has identical distribution > properties) can be done in 6 shifts + adds, with a critical path > length of 7 operations (3 shifts + 4 adds). So the reason we don't do this for the 32-bit case is exactly that gcc can already do this. If you can do the same for the 64-bit case, that might be worth it. Linus ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-04-30 0:05 ` Linus Torvalds @ 2016-04-30 0:32 ` George Spelvin 2016-04-30 1:12 ` Linus Torvalds 0 siblings, 1 reply; 29+ messages in thread From: George Spelvin @ 2016-04-30 0:32 UTC (permalink / raw) To: linux, torvalds; +Cc: linux-kernel, tglx > At least for my tests, even that seems to actually be a total > non-issue. Yes, odd values *might* be better, but as mentioned in my > crossing email, it doesn't actually seem to matter for any case the > kernel cares about, since we tend to want to hash down to 10-20 bits > of data, so the least significant bit (particularly for the 64-bit > case) just doesn't matter all that much. Odd is important. If the multiplier is even, the msbit of the input doesn't affect the hash result at all. x and (x + 0x80000000) hash to the same value, always. That just seems like a crappy hash function. > Yeah. gcc will actually do the clever stuff for the 32-bit case, afaik. It's not as clever as it could be; it just does the same Booth recoding thing, a simple series of shifts with add/subtract. Here's the ARM code that GCC produces (9 instructions, all dependent): mult1: add r3, r0, r0, lsl #1 rsb r3, r0, r3, lsl #5 add r3, r3, r3, lsl #4 rsb r3, r3, r3, lsl #5 add r3, r0, r3, lsl #5 add r3, r0, r3, lsl #1 add r3, r0, r3, lsl #3 add r3, r0, r3, lsl #3 rsb r0, r0, r3, lsl #3 bx lr versus the clever code (6 instructions, #4 and #5 could dual-issue): mult2: add r3, r0, r0, lsl #19 add r2, r3, r0, lsl #9 add r0, r2, r0, lsl #23 add r3, r3, r2, lsl #8 rsb r0, r0, r0, lsl #6 add r0, r0, r3, lsl #3 bx lr ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-04-30 0:32 ` George Spelvin @ 2016-04-30 1:12 ` Linus Torvalds 2016-04-30 3:04 ` George Spelvin 0 siblings, 1 reply; 29+ messages in thread From: Linus Torvalds @ 2016-04-30 1:12 UTC (permalink / raw) To: George Spelvin; +Cc: Linux Kernel Mailing List, Thomas Gleixner On Fri, Apr 29, 2016 at 5:32 PM, George Spelvin <linux@horizon.com> wrote: > > Odd is important. If the multiplier is even, the msbit of the input > doesn't affect the hash result at all. Fair enough. My test-set was incomplete. >> Yeah. gcc will actually do the clever stuff for the 32-bit case, afaik. > > It's not as clever as it could be; it just does the same Booth > recoding thing, a simple series of shifts with add/subtract. Ahh. I thought gcc did the Bernstein's algorithm thing, which is exponential in the bit size. That would have explained why it only does it for 32-bit constants. Not doing it for 64-bit constants makes no sense if it just uses the trivial Booth's algorithm version. So the odd "we don't do it for 64-bit" is apparently just an oversight, not because gcc does something clever. Oh well. Linus ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-04-30 1:12 ` Linus Torvalds @ 2016-04-30 3:04 ` George Spelvin 0 siblings, 0 replies; 29+ messages in thread From: George Spelvin @ 2016-04-30 3:04 UTC (permalink / raw) To: linux, torvalds; +Cc: linux-kernel, tglx > Not doing it for 64-bit constants makes no sense if it just uses the > trivial Booth's algorithm version. AFAICT, gcc 5 *does* optimize 64-bit multiplies by constants. Does the belief that it doesn't date back to some really old version? There's still a threshold where it just punts to the multiplier. Some examples, x86-64 (gcc 6.0.1) and aarch64 (gcc 5.3.1). Note the difference in the multiply-by-12345 routine. return x*9; mul9: leaq (%rdi,%rdi,8), %rax ret mul9: add x0, x0, x0, lsl 3 ret return x*10; mul10: leaq (%rdi,%rdi,4), %rax addq %rax, %rax ret mul10: lsl x1, x0, 3 add x0, x1, x0, lsl 1 ret return x*127; mul127: movq %rdi, %rax salq $7, %rax subq %rdi, %rax ret mul127: lsl x1, x0, 7 sub x0, x1, x0 ret return x*12345; mul12345: imulq $12345, %rdi, %rax ret mul12345: lsl x1, x0, 3 sub x1, x1, x0 lsl x1, x1, 1 sub x1, x1, x0 lsl x1, x1, 3 sub x1, x1, x0 lsl x1, x1, 3 sub x0, x1, x0 lsl x1, x0, 4 sub x0, x1, x0 ret uint64_t y = (x << 9) - (x << 3) + x; return x + (x << 14) - (y << 3); mul12345_manual: movq %rdi, %rdx salq $14, %rax salq $9, %rdx addq %rdi, %rax addq %rdi, %rdx salq $3, %rdi subq %rdi, %rdx salq $3, %rdx subq %rdx, %rax ret mul12345_manual: lsl x2, x0, 9 lsl x1, x0, 14 add x2, x2, x0 add x1, x1, x0 sub x0, x2, x0, lsl 3 sub x0, x1, x0, lsl 3 ret return x*2654435769: mul2654435769: movl $2654435769, %eax imulq %rdi, %rax ret mul2654435769: mov x1, 31161 movk x1, 0x9e37, lsl 16 mul x0, x0, x1 ret The problem with variant code paths like mul12345_manual is that the infrastructure required to determine which to use is many times larger than the code itself. :-( ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <CA+55aFxBWfAHQNAdBbdVr+z8ror4GVteyce3D3=vwDWxhu5KqQ@mail.gmail.com>]
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism [not found] <CA+55aFxBWfAHQNAdBbdVr+z8ror4GVteyce3D3=vwDWxhu5KqQ@mail.gmail.com> @ 2016-04-30 20:52 ` George Spelvin 2016-05-01 8:35 ` Thomas Gleixner 0 siblings, 1 reply; 29+ messages in thread From: George Spelvin @ 2016-04-30 20:52 UTC (permalink / raw) To: tglx; +Cc: eric.dumazet, linux, linux-kernel, riel, torvalds Thomas Gleixner wrote: > I'll send a patch to replace hash_64 and hash_32. Before you do that, could we look for a way to tweak the constants in the existing hash? It seems the basic "take the high bits of x * K" algorithm is actually a decent hash function if K is chosen properly, and has a significant speed advantage on machines with half-decent multipliers. I'm researching how to do the multiply with fewer shifts and adds on machines that need it. (Or we could use a totally different function in that case.) You say that > hash64 is slightly faster as the modulo prime as it does not have the > multiplication. Um... are you sure you benchmarked that right? The hash_64 code you used (Thomas Wang's 64->32-bit hash) has a critical path consisting of 6 shifts and 7 adds. I can't believe that's faster than a single multiply. For 1,000,000 iterations on an Ivy Bridge, the multiply is 4x faster (5x if out of line) for me! The constants I recommend are #define GOLDEN_RATIO_64 0x61C8864680B583EBull #define GOLDEN_RATIO_32 0x61C88647 rdtsc times for 1,000,000 iterations of each of the two. (The sum of all hashes is printed to prevent dead code elimination.) hash_64 (sum) * PHI (sum) hash_64 (sum) * PHI (sum) 17552154 a52752df 3431821 2ce5398c 17485381 a52752df 3375535 2ce5398c 17522273 a52752df 3487206 2ce5398c 17551217 a52752df 3374221 2ce5398c 17546242 a52752df 3377355 2ce5398c 17494306 a52752df 3374202 2ce5398c 17495702 a52752df 3409768 2ce5398c 17505839 a52752df 3398205 2ce5398c 17501114 a52752df 3375435 2ce5398c 17539388 a52752df 3374202 2ce5398c And with hash_64 forced inline: 13596945 a52752df 3374931 2ce5398c 13585916 a52752df 3411107 2ce5398c 13564616 a52752df 3374928 2ce5398c 13573465 a52752df 3425160 2ce5398c 13569712 a52752df 3374915 2ce5398c 13580461 a52752df 3397773 2ce5398c 13577481 a52752df 3374912 2ce5398c 13558708 a52752df 3417456 2ce5398c 13569044 a52752df 3374909 2ce5398c 13557193 a52752df 3407912 2ce5398c That's 3.5 cycles vs. 13.5. (I actually have two copies of the inlined code, to show code alignment issues.) On a Phenom, it's worse, 4 cycles vs. 35. 35083119 a52752df 4020754 2ce5398c 35068116 a52752df 4015659 2ce5398c 35074377 a52752df 4000819 2ce5398c 35068735 a52752df 4016943 2ce5398c 35067596 a52752df 4025397 2ce5398c 35074365 a52752df 4000108 2ce5398c 35071050 a52752df 4016190 2ce5398c 35058775 a52752df 4017988 2ce5398c 35055091 a52752df 4000066 2ce5398c 35201158 a52752df 4000094 2ce5398c My simple test code appended for anyone who cares... #include <stdint.h> #include <stdio.h> /* Phi = 0x0.9E3779B97F4A7C15F... */ /* -Phi = 0x0.61C8864680B583EA1... */ #define K 0x61C8864680B583EBull static inline uint32_t hash1(uint64_t key) { key = ~key + (key << 18); key ^= key >> 31; key += (key << 2) + (key << 4); key ^= key >> 11; key += key << 6; key ^= key >> 22; return (uint32_t)key; } static inline uint32_t hash2(uint64_t key) { return (uint32_t)(key * K >> 32); } static inline uint64_t rdtsc(void) { uint32_t lo, hi; asm volatile("rdtsc" : "=a" (lo), "=d" (hi)); return (uint64_t)hi << 32 | lo; } int main(void) { int i, j; uint32_t sum, sums[20]; uint64_t start, times[20]; for (i = 0; i < 20; i += 4) { sum = 0; start = rdtsc(); for (j = 0; j < 1000000; j++) sum += hash1(j+0xdeadbeef); times[i] = rdtsc() - start; sums[i] = sum; sum = 0; start = rdtsc(); for (j = 0; j < 1000000; j++) sum += hash2(j+0xdeadbeef); times[i+1] = rdtsc() - start; sums[i+1] = sum; sum = 0; start = rdtsc(); for (j = 0; j < 1000000; j++) sum += hash1(j+0xdeadbeef); times[i+2] = rdtsc() - start; sums[i+2] = sum; sum = 0; start = rdtsc(); for (j = 0; j < 1000000; j++) sum += hash2(j+0xdeadbeef); times[i+3] = rdtsc() - start; sums[i+3] = sum; } for (i = 0; i < 20; i++) printf(" %llu %08x%c", times[i], sums[i], (~i & 3) ? ' ' : '\n'); return 0; } ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-04-30 20:52 ` George Spelvin @ 2016-05-01 8:35 ` Thomas Gleixner 2016-05-01 9:43 ` George Spelvin 0 siblings, 1 reply; 29+ messages in thread From: Thomas Gleixner @ 2016-05-01 8:35 UTC (permalink / raw) To: George Spelvin; +Cc: eric.dumazet, linux-kernel, riel, torvalds On Sat, 30 Apr 2016, George Spelvin wrote: > Thomas Gleixner wrote: > You say that > > hash64 is slightly faster as the modulo prime as it does not have the > > multiplication. > > Um... are you sure you benchmarked that right? The hash_64 code you > used (Thomas Wang's 64->32-bit hash) has a critical path consisting of 6 > shifts and 7 adds. I can't believe that's faster than a single multiply. Sorry I did not express myself clear enough. hash64 (the single multiply with the adjusted golden ratio) is slightly faster than the modulo one which has two mutiplications. So here is the list: hash_64(): (key * GOLDEN_RATIO) >> (64 - bits) 31Mio Ops/sec modulo: 28Mio Ops/sec Thomas Wangs 64 -> 32 bit 21Mio Ops/sec Thanks, tglx ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-05-01 8:35 ` Thomas Gleixner @ 2016-05-01 9:43 ` George Spelvin 2016-05-01 16:51 ` Linus Torvalds 2016-05-02 7:11 ` Thomas Gleixner 0 siblings, 2 replies; 29+ messages in thread From: George Spelvin @ 2016-05-01 9:43 UTC (permalink / raw) To: linux, tglx; +Cc: eric.dumazet, linux-kernel, riel, torvalds > Sorry I did not express myself clear enough. > hash64 (the single multiply with the adjusted golden ratio) is > slightly faster than the modulo one which has two mutiplications. Yes, I figured out that was probably what you were talking about, and benchmarked it similarly. But I noticed a much greater difference. Wang * PHI % 4093 Wang * PHI % 4093 13599486 3494816 5238442 13584166 3460266 5239463 13589552 3466764 5237327 13582381 3422304 5276253 13569409 3407317 5236239 13566738 3393215 5267909 13575257 3413736 5237708 13596428 3379811 5280118 13583607 3540416 5325609 13650964 3380301 5265210 At 3.7 GHz, that's * PHI: 1059 M ops/second * Modulo: 706 M ops/second * Wang: 271 M ops/second Of course, that's a tight loop hashing; I presume your test case has more overhead. Anyway, my recommendation (I'll write the patches if you like) is: * Modify the multiplicative constants to be #define COLDEN_RATIO_32 0x61C88647 #define COLDEN_RATIO_64 0x61C8864680B583EB * Change the prototype of hash_64 to return u32. * Create a separate 32-bit implementation of hash_64() for the BITS_PER_LONG < 64 case. This should not be Wang's or anything similar because 64-bit shifts are slow and awkward on 32-bit machines. One option is something like __jhash_final(), but I think it will suffice to do: static __always_inline u32 hash_64(u64 val, unsigned int bits) { u32 hash = (u32)(val >> 32) * GOLDEN_RATIO_32 + (u32)val; hash *= GOLDEN_RATIO_32; return hash >> (32 - bits); } (S-o-b on that if you want it, of course.) (If you want it out of line, make a helper function that computes the 32-bit hash and have an inline wrapper that does the shifting.) * Eliminate the !CONFIG_ARCH_HAS_FAST_MULTIPLIER code path. Once we've got rid of the 32-bit processors, we're left with 64-bit processors, and they *all* have hardware multipliers. Even the MIPS R4000 (first 64-bit processor ever) and Alpha 21064 had one. (Admittedly, the R4000 was 20 cycles, which is slower than Wang's hash, but 18 of them could overlap other instructions.) The worst multiply support is SPARCv9, which has 64x64-bit multiply with a 64-bit result, but no widening multiply. * If you feel ambitious, add a 32-bit CONFIG_ARCH_HAS_SLOW_MULTIPLIER exception path. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-05-01 9:43 ` George Spelvin @ 2016-05-01 16:51 ` Linus Torvalds 2016-05-14 3:54 ` George Spelvin 2016-05-02 7:11 ` Thomas Gleixner 1 sibling, 1 reply; 29+ messages in thread From: Linus Torvalds @ 2016-05-01 16:51 UTC (permalink / raw) To: George Spelvin, David Woodhouse, Chris Mason Cc: Thomas Gleixner, Eric Dumazet, Linux Kernel Mailing List, Rik van Riel On Sun, May 1, 2016 at 2:43 AM, George Spelvin <linux@horizon.com> wrote: > > Anyway, my recommendation (I'll write the patches if you like) is: > > * Modify the multiplicative constants to be > #define COLDEN_RATIO_32 0x61C88647 > #define COLDEN_RATIO_64 0x61C8864680B583EB Interestingly, looking at where hash_64() is used, I notice the btrfs raid56 code. And the comment there in rio_bucket(): * we shift down quite a bit. We're using byte * addressing, and most of the lower bits are zeros. * This tends to upset hash_64, and it consistently * returns just one or two different values. * * shifting off the lower bits fixes things. so people had actually noticed the breakage before. Adding DavidW and Chris Mason explicitly to the cc, to perhaps have them verify that fixing the hash_64 constant would make that btrfs hack be unnecessary too.. Chris? Do you have a test-case? Do things end up being ok if you change that COLDEN_RATIO_64 value and remove the ">>16"? > * Change the prototype of hash_64 to return u32. Fair enough. That simplifies things for 32-bit. Not that there are a _lot_ of 32-bit cases, and most of the callers seem to just assign the value directly to an int or just use it as an index, so it's probably not really ever going to change anything, but it might make it easier to write a simpler 32-bit code that uses two explicitly 32-bit fields and mixes them. > * Create a separate 32-bit implementation of hash_64() for the > BITS_PER_LONG < 64 case. This should not be Wang's or anything > similar because 64-bit shifts are slow and awkward on 32-bit > machines. > One option is something like __jhash_final(), but I think > it will suffice to do: > > static __always_inline u32 hash_64(u64 val, unsigned int bits) > { > u32 hash = (u32)(val >> 32) * GOLDEN_RATIO_32 + (u32)val; > hash *= GOLDEN_RATIO_32; > return hash >> (32 - bits); > } > (S-o-b on that if you want it, of course.) Agreed. However, we need to make very certain that nobody wants a value bigger than 32 bits. It all looks fine: the name hash folding does want exactly 32 bits, but that code is only enabled for 64-bit anyway, and it would be ok. But it might be worth double-checking that nobody uses hash_64() for anything else odd. My quick grep didn't show anything, but it was quick. > * Eliminate the !CONFIG_ARCH_HAS_FAST_MULTIPLIER code path. Once we've > got rid of the 32-bit processors, we're left with 64-bit processors, > and they *all* have hardware multipliers. Even the MIPS R4000 (first > 64-bit processor ever) and Alpha 21064 had one. > > (Admittedly, the R4000 was 20 cycles, which is slower than Wang's hash, > but 18 of them could overlap other instructions.) > > The worst multiply support is SPARCv9, which has 64x64-bit > multiply with a 64-bit result, but no widening multiply. Ack. > * If you feel ambitious, add a 32-bit CONFIG_ARCH_HAS_SLOW_MULTIPLIER > exception path. Let's make that a separate worry, and just fix hash_64() first. In particular, that means "let's not touch COLDEN_RATIO_32 yet". I suspect that when we *do* change that value, we do want the non-multiplying version you had. If you wrote out the patch for the hash_64 case as you outlined above, I will definitely apply it. The current hash_64 is clearly broken, and we now have two real life examples (ie futex problems and the btrfs code both). Linus ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-05-01 16:51 ` Linus Torvalds @ 2016-05-14 3:54 ` George Spelvin 2016-05-14 18:35 ` Linus Torvalds 0 siblings, 1 reply; 29+ messages in thread From: George Spelvin @ 2016-05-14 3:54 UTC (permalink / raw) To: torvalds; +Cc: eric.dumazet, linux, linux-kernel, riel, tglx On May 1, 2016 at 9:51 AM, Lnus Torvalds wrote: > On Sun, May 1, 2016 at 2:43 AM, George Spelvin <linux@horizon.com> wrote: >> * If you feel ambitious, add a 32-bit CONFIG_ARCH_HAS_SLOW_MULTIPLIER >> exception path. > Let's make that a separate worry, and just fix hash_64() first. > > In particular, that means "let's not touch COLDEN_RATIO_32 yet". I > suspect that when we *do* change that value, we do want the > non-multiplying version you had. I've been working on this, and just a brief status update: it's definitely one of those rabbit holes. There are exactly three architectures which (some models) don't have an efficient 32x32->32-bit multiply: - arch/m58k: MC68000 (and 68010 and 68328) no-mmu - arch/h8300: Most (all?) of the H8 processor series - arch/microblaze: Depending on Verilog compilation options The thing is, they all don't have a barrel shifter, either. Indeed, only the m68k even has multi-bit shift instructions. So the upshot is that it's not clear that shift-and-add is a whole lot better. Working out the timing on the 68000, I can beat the multiply code, but not by much. So I'm working on arch-specific solutions for those three cases. H8 and 68000 have 16x16->32-bit multiplies, which can be used to make a reasonable hash function (some H8 models can multiply faster than they can shift!), but if you configure a Microblaze with neither multiplier nor barrel shifter (which arch/microblaze/Kconfig.platform lets you do), I have no idea what to do. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-05-14 3:54 ` George Spelvin @ 2016-05-14 18:35 ` Linus Torvalds 0 siblings, 0 replies; 29+ messages in thread From: Linus Torvalds @ 2016-05-14 18:35 UTC (permalink / raw) To: George Spelvin Cc: Eric Dumazet, Linux Kernel Mailing List, Rik van Riel, Thomas Gleixner On Fri, May 13, 2016 at 8:54 PM, George Spelvin <linux@horizon.com> wrote: > > There are exactly three architectures which (some models) don't have > an efficient 32x32->32-bit multiply: > > - arch/m58k: MC68000 (and 68010 and 68328) no-mmu > - arch/h8300: Most (all?) of the H8 processor series > - arch/microblaze: Depending on Verilog compilation options I wouldn't worry about it too much. The architectures where performance really matters are x86, ARM and powerpc. The rest need to *work* and not suck horribly, but we're not going to try to do cycle counting for them. It's not worth the pain. If an architecture doesn't have a barrel shifter, it's not going to have fast hash functions. So I'd be ok with just saying "32-bit architectures are going to use a multiply with non-sparse bits". Not a problem. We do want to make sure that hash_64 isn't totally disgusting on 32-bit architectures (but that's a fairly rare case), so we probably do want to have that function fall back on something else than a 64x64 multiply on a 32-bit architecture. Presumably just "mix the two 32-bit words into one, then use hash_32() on that" is good enough. Linus ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-05-01 9:43 ` George Spelvin 2016-05-01 16:51 ` Linus Torvalds @ 2016-05-02 7:11 ` Thomas Gleixner 1 sibling, 0 replies; 29+ messages in thread From: Thomas Gleixner @ 2016-05-02 7:11 UTC (permalink / raw) To: George Spelvin; +Cc: eric.dumazet, linux-kernel, riel, torvalds On Sun, 1 May 2016, George Spelvin wrote: > But I noticed a much greater difference. > > Wang * PHI % 4093 Wang * PHI % 4093 > 13599486 3494816 5238442 13584166 3460266 5239463 > 13589552 3466764 5237327 13582381 3422304 5276253 > 13569409 3407317 5236239 13566738 3393215 5267909 > 13575257 3413736 5237708 13596428 3379811 5280118 > 13583607 3540416 5325609 13650964 3380301 5265210 > > At 3.7 GHz, that's > > * PHI: 1059 M ops/second > * Modulo: 706 M ops/second > * Wang: 271 M ops/second > > Of course, that's a tight loop hashing; I presume your test case > has more overhead. Indeed. > Anyway, my recommendation (I'll write the patches if you like) is: > > * Modify the multiplicative constants to be > #define COLDEN_RATIO_32 0x61C88647 > #define COLDEN_RATIO_64 0x61C8864680B583EB Works for me. I ran them through my test case and they behaved reasonably well. > * Change the prototype of hash_64 to return u32. Makes sense. > * Create a separate 32-bit implementation of hash_64() for the > BITS_PER_LONG < 64 case. This should not be Wang's or anything > similar because 64-bit shifts are slow and awkward on 32-bit > machines. > One option is something like __jhash_final(), but I think > it will suffice to do: > > static __always_inline u32 hash_64(u64 val, unsigned int bits) > { > u32 hash = (u32)(val >> 32) * GOLDEN_RATIO_32 + (u32)val; > hash *= GOLDEN_RATIO_32; > return hash >> (32 - bits); > } Works. That's more or less the same overhead as the modulo one, which behaved well on 32bit. Thanks, tglx ^ permalink raw reply [flat|nested] 29+ messages in thread
* [patch 0/7] futex: Add support for process private hashing @ 2016-04-28 16:42 Thomas Gleixner 2016-04-28 16:42 ` [patch 2/7] lib/hashmod: Add modulo based hash mechanism Thomas Gleixner 0 siblings, 1 reply; 29+ messages in thread From: Thomas Gleixner @ 2016-04-28 16:42 UTC (permalink / raw) To: LKML Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrew Morton, Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk, Davidlohr Bueso, Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet The standard futex mechanism in the Linux kernel uses a global hash to store transient state. Collisions on that hash can lead to performance degradation and on real-time enabled kernels to unbound priority inversions. This new attempt to solve the issue does not require user space changes and operates transparently. On the first futex operation of a process the kernel allocates a hash private to the process. All process private futexes are hashed in this hash. Process shared futexes still use the global hash. For RT applications and pathological use cases a new futex op is provided which allows the application to preallocate and thereby size the process private hash. The series comes with a new 'stupid' hash function based on the good old modulu prime. That function provides way better hash results than hash_ptr/hash_long() for small hash sizes. The last two patches add support to the perf futex-hash benchmark so test can be run on nodes and the preallocation sizing can be tested. The last patch contains a first update for the futex man page. Results from our testing in nice colored charts are available here: perf bench futex-hash run parallel on 4 nodes with global hash and various sized private hashes and various numbers of futexes per thread https://tglx.de/~tglx/f-ops.png perf bench futex-hash run parallel on 4 nodes with global hash and various sized private hashes using the new hash_mod() and various numbers of futexes per thread https://tglx.de/~tglx/f-ops.png perf bench futex-hash run parallel on 4 nodes with global hash and various sized private hashes using hash_long() and various numbers of futexes per thread https://tglx.de/~tglx/f-ops-hlong.png perf bench futex-hash run parallel on 2 nodes with global hash and various sized private hashes and various numbers of futexes per thread https://tglx.de/~tglx/f-ops-2.png perf bench futex-hash run parallel on 4 nodes with global hash and various sized private hashes using hash_mod(). 1 futex per thread and various thread numbers. https://tglx.de/~tglx/f-ops-mod-t.png perf bench futex-hash run parallel on 4 nodes with global hash and various sized private hashes using hash_long(). 1 futex per thread and various thread numbers. https://tglx.de/~tglx/f-ops-hlong-t.png Thanks, tglx ---- Documentation/sysctl/kernel.txt | 17 +++ b/include/linux/futex_types.h | 14 ++ b/lib/hashmod.c | 44 ++++++++ include/linux/futex.h | 39 +++++-- include/linux/hash.h | 28 +++++ include/linux/mm_types.h | 4 include/uapi/linux/futex.h | 1 init/Kconfig | 5 kernel/fork.c | 3 kernel/futex.c | 219 +++++++++++++++++++++++++++++++++++++++- kernel/sysctl.c | 21 +++ lib/Kconfig | 3 lib/Makefile | 1 tools/perf/bench/Build | 4 tools/perf/bench/futex-hash.c | 101 ++++++++++++++++-- tools/perf/bench/futex.h | 5 16 files changed, 486 insertions(+), 23 deletions(-) ^ permalink raw reply [flat|nested] 29+ messages in thread
* [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-04-28 16:42 [patch 0/7] futex: Add support for process private hashing Thomas Gleixner @ 2016-04-28 16:42 ` Thomas Gleixner 2016-04-28 18:32 ` Linus Torvalds 0 siblings, 1 reply; 29+ messages in thread From: Thomas Gleixner @ 2016-04-28 16:42 UTC (permalink / raw) To: LKML Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrew Morton, Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk, Davidlohr Bueso, Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet [-- Attachment #1: lib-hashmod-Add-modulo-based-hash-mechanism.patch --] [-- Type: text/plain, Size: 3515 bytes --] hash_long/hash_ptr() provide really bad hashing for small hash sizes and for cases where the lower 12 bits (Page size aligned) of the hash value are 0. A simple modulo(prime) based hash function has way better results across a wide range of input values. The implementation uses invers multiplication instead of a slow division. A futex benchmark shows better results up to a factor 10 on small hashes. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- include/linux/hash.h | 28 ++++++++++++++++++++++++++++ lib/Kconfig | 3 +++ lib/Makefile | 1 + lib/hashmod.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 76 insertions(+) create mode 100644 lib/hashmod.c --- a/include/linux/hash.h +++ b/include/linux/hash.h @@ -83,4 +83,32 @@ static inline u32 hash32_ptr(const void return (u32)val; } +struct hash_modulo { + unsigned int pmul; + unsigned int prime; + unsigned int mask; +}; + +#ifdef CONFIG_HASH_MODULO + +int hash_modulo_params(unsigned int hash_bits, struct hash_modulo *hm); + +/** + * hash_mod - FIXME + */ +static inline unsigned int hash_mod(unsigned long addr, struct hash_modulo *hm) +{ + u32 a, m; + + if (IS_ENABLED(CONFIG_64BIT)) { + a = addr >> 32; + a ^= (unsigned int) addr; + } else { + a = addr; + } + m = ((u64)a * hm->pmul) >> 32; + return (a - m * hm->prime) & hm->mask; +} +#endif + #endif /* _LINUX_HASH_H */ --- a/lib/Kconfig +++ b/lib/Kconfig @@ -185,6 +185,9 @@ config CRC8 when they need to do cyclic redundancy check according CRC8 algorithm. Module will be called crc8. +config HASH_MODULO + bool + config AUDIT_GENERIC bool depends on AUDIT && !AUDIT_ARCH --- a/lib/Makefile +++ b/lib/Makefile @@ -97,6 +97,7 @@ obj-$(CONFIG_CRC32) += crc32.o obj-$(CONFIG_CRC7) += crc7.o obj-$(CONFIG_LIBCRC32C) += libcrc32c.o obj-$(CONFIG_CRC8) += crc8.o +obj-$(CONFIG_HASH_MODULO) += hashmod.o obj-$(CONFIG_GENERIC_ALLOCATOR) += genalloc.o obj-$(CONFIG_842_COMPRESS) += 842/ --- /dev/null +++ b/lib/hashmod.c @@ -0,0 +1,44 @@ +/* + * Modulo based hash - Global helper functions + * + * (C) 2016 Linutronix GmbH, Thomas Gleixner + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public Licence version 2 as published by + * the Free Software Foundation; + */ + +#include <linux/hash.h> +#include <linux/errno,h> +#include <linux/bug.h> +#include <linux/kernel.h> + +#define hash_pmul(prime) ((unsigned int)((1ULL << 32) / prime)) + +static const struct hash_modulo hash_modulo[] = { + { .prime = 3, .pmul = hash_pmul(3), .mask = 0x0003 }, + { .prime = 7, .pmul = hash_pmul(7), .mask = 0x0007 }, + { .prime = 13, .pmul = hash_pmul(13), .mask = 0x000f }, + { .prime = 31, .pmul = hash_pmul(31), .mask = 0x001f }, + { .prime = 61, .pmul = hash_pmul(61), .mask = 0x003f }, + { .prime = 127, .pmul = hash_pmul(127), .mask = 0x007f }, + { .prime = 251, .pmul = hash_pmul(251), .mask = 0x00ff }, + { .prime = 509, .pmul = hash_pmul(509), .mask = 0x01ff }, + { .prime = 1021, .pmul = hash_pmul(1021), .mask = 0x03ff }, + { .prime = 2039, .pmul = hash_pmul(2039), .mask = 0x07ff }, + { .prime = 4093, .pmul = hash_pmul(4093), .mask = 0x0fff }, +}; + +/** + * hash_modulo_params - FIXME + */ +int hash_modulo_params(unsigned int hash_bits, struct hash_modulo *hm) +{ + hash_bits -= 2; + + if (hash_bits >= ARRAY_SIZE(hash_modulo)) + return -EINVAL; + + *hm = hash_modulo[hash_bits]; + return 0; +} ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-04-28 16:42 ` [patch 2/7] lib/hashmod: Add modulo based hash mechanism Thomas Gleixner @ 2016-04-28 18:32 ` Linus Torvalds 2016-04-28 23:26 ` Thomas Gleixner 2016-04-29 21:10 ` Linus Torvalds 0 siblings, 2 replies; 29+ messages in thread From: Linus Torvalds @ 2016-04-28 18:32 UTC (permalink / raw) To: Thomas Gleixner Cc: LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton, Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk, Davidlohr Bueso, Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet On Thu, Apr 28, 2016 at 9:42 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > hash_long/hash_ptr() provide really bad hashing for small hash sizes and for > cases where the lower 12 bits (Page size aligned) of the hash value are 0. Hmm. hash_long/ptr really shouldn't care about the low bits being zero at all, because it should mix in all the bits (using a prime multiplier and taking the high bits). That said, numbers rule, so clearly we need to do something. It does strike me that we would be better off just trying to improve hash_long(). In particular, there are people and projects that have worked on nothing but hashing. I'm not sure we should add a new hash algorithm even if the whole "modulo prime" sounds obviously fine in theory. For example, your 64-bit code has obvious problems if there are common patterns in the low and the high 32 bits.. Not a problem for something like hash_ptr(), but it can certainly be a problem for other cases. It would be a really good idea to have some real hard numbers on the hashing in general, but _particularly_ so if/when we start adding new ones. Have you tested the modulus version with SMhasher, for example? For example, there's Thomas Wang's hash function which should cascade all the bits. I'd really hate to add *another* ad-hoc hash when the previous ad-hoc hash has been shown to be bad. Linus ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-04-28 18:32 ` Linus Torvalds @ 2016-04-28 23:26 ` Thomas Gleixner 2016-04-29 2:25 ` Linus Torvalds 2016-04-29 21:10 ` Linus Torvalds 1 sibling, 1 reply; 29+ messages in thread From: Thomas Gleixner @ 2016-04-28 23:26 UTC (permalink / raw) To: Linus Torvalds Cc: LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton, Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk, Davidlohr Bueso, Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet On Thu, 28 Apr 2016, Linus Torvalds wrote: > > I'd really hate to add *another* ad-hoc hash when the previous ad-hoc > hash has been shown to be bad. I completely agree. I'm not a hashing wizard and I completely failed to understand why hash_long/ptr are so horrible for the various test cases I ran. So my ad hoc test was to use the only hash function I truly understand. It was state of the art in my university days :) And surprise, surprise it worked really well. My main focus was really to solve this futex issue which plages various people and not to dive into hashing theory for a few weeks. I'll try to dig up some time to analyze the hash_long failure unless someone familiar with the problem is beating me to it. Thanks, tglx ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-04-28 23:26 ` Thomas Gleixner @ 2016-04-29 2:25 ` Linus Torvalds 2016-04-30 13:02 ` Thomas Gleixner 2016-06-12 12:18 ` Sandy Harris 0 siblings, 2 replies; 29+ messages in thread From: Linus Torvalds @ 2016-04-29 2:25 UTC (permalink / raw) To: Thomas Gleixner Cc: LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton, Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk, Davidlohr Bueso, Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet On Thu, Apr 28, 2016 at 4:26 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > > I'll try to dig up some time to analyze the hash_long failure unless someone > familiar with the problem is beating me to it. I'm not sure you need to spend time analyzing failure: if you get bad hashing with hash_long(), then we know that is a bad hash without having to really try to figure out why. It's the hashes that _look_ like they might be good hashes, but there's not a lot of analysis behind it, that I would worry about. The simple prime modulus _should_ be fine, but at the same time I kind of suspect we can do better. Especially since it has two multiplications. Looking around, there's http://burtleburtle.net/bob/hash/integer.html and that 32-bit "full avalanche" hash in six shifts looks like it could be better. You wouldn't want to inline it, but the point of a full avalanche bit mixing _should_ be that you could avoid the whole "upper bits" part, and it should work independently of the target set size. So if that hash works better, it would be a pretty good replacement option for hash_int(). There is also https://gist.github.com/badboy/6267743 that has a 64 bit to 32 bit hash function that might be useful for "hash_long()". Most of the people who worry about hashes tend to have strings to hash, not just a single word like a pointer, but there's clearly people around who have tried to search for good hashes that really spread out the bits. Linus ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-04-29 2:25 ` Linus Torvalds @ 2016-04-30 13:02 ` Thomas Gleixner 2016-04-30 16:45 ` Eric Dumazet 2016-06-12 12:18 ` Sandy Harris 1 sibling, 1 reply; 29+ messages in thread From: Thomas Gleixner @ 2016-04-30 13:02 UTC (permalink / raw) To: Linus Torvalds Cc: LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton, Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk, Davidlohr Bueso, Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet On Thu, 28 Apr 2016, Linus Torvalds wrote: > It's the hashes that _look_ like they might be good hashes, but > there's not a lot of analysis behind it, that I would worry about. The > simple prime modulus _should_ be fine, but at the same time I kind of > suspect we can do better. Especially since it has two multiplications. > > Looking around, there's > > http://burtleburtle.net/bob/hash/integer.html > > and that 32-bit "full avalanche" hash in six shifts looks like it > could be better. You wouldn't want to inline it, but the point of a > full avalanche bit mixing _should_ be that you could avoid the whole > "upper bits" part, and it should work independently of the target set > size. Yes. So I tested those two: u32 hash_64(u64 key) { key = ~key + (key << 18); key ^= key >> 31; key += (key << 2)) + (key << 4); key ^= key >> 11; key += key << 6; key ^= key >> 22; return (u32) key; } u32 hash_32(u32 key) { key = (key + 0x7ed55d16) + (key << 12); key = (key ^ 0xc761c23c) ^ (key >> 19); key = (key + 0x165667b1) + (key << 5); key = (key + 0xd3a2646c) ^ (key << 9); key = (key + 0xfd7046c5) + (key << 3); key = (key ^ 0xb55a4f09) ^ (key >> 16); return key; } They are really good and the results are similar to the simple modulo prime hash. hash64 is slightly faster as the modulo prime as it does not have the multiplication. I'll send a patch to replace hash_64 and hash_32. Text size: x86_64 i386 arm hash_64 88 148 128 hash_32 88 84 112 So probably slightly too large to inline. Thanks, tglx ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-04-30 13:02 ` Thomas Gleixner @ 2016-04-30 16:45 ` Eric Dumazet 2016-04-30 17:12 ` Linus Torvalds 0 siblings, 1 reply; 29+ messages in thread From: Eric Dumazet @ 2016-04-30 16:45 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton, Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk, Davidlohr Bueso, Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet On Sat, 2016-04-30 at 15:02 +0200, Thomas Gleixner wrote: > Yes. So I tested those two: > > u32 hash_64(u64 key) > { > key = ~key + (key << 18); > key ^= key >> 31; > key += (key << 2)) + (key << 4); > key ^= key >> 11; > key += key << 6; > key ^= key >> 22; > return (u32) key; > } > > u32 hash_32(u32 key) > { > key = (key + 0x7ed55d16) + (key << 12); > key = (key ^ 0xc761c23c) ^ (key >> 19); > key = (key + 0x165667b1) + (key << 5); > key = (key + 0xd3a2646c) ^ (key << 9); > key = (key + 0xfd7046c5) + (key << 3); > key = (key ^ 0xb55a4f09) ^ (key >> 16); > return key; > } > > They are really good and the results are similar to the simple modulo prime > hash. hash64 is slightly faster as the modulo prime as it does not have the > multiplication. > > I'll send a patch to replace hash_64 and hash_32. > > Text size: > x86_64 i386 arm > hash_64 88 148 128 > hash_32 88 84 112 > > So probably slightly too large to inline. I use hash_32() in net/sched/sch_fq.c, for all packets sent by Google servers. (Note that I did _not_ use hash_ptr()) That's gazillions of packets per second, and the current multiply worked just fine in term of hash spreading. Are you really going to use something which looks much slower ? u32 hash_32(u32 key) { key = (key + 0x7ed55d16) + (key << 12); key = (key ^ 0xc761c23c) ^ (key >> 19); key = (key + 0x165667b1) + (key << 5); key = (key + 0xd3a2646c) ^ (key << 9); key = (key + 0xfd7046c5) + (key << 3); key = (key ^ 0xb55a4f09) ^ (key >> 16); return key; } Probably having a simple multiple when ARCH_HAS_FAST_MULTIPLIER is defined might be good enough, eventually by choosing a better GOLDEN_RATIO_PRIME_32 ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-04-30 16:45 ` Eric Dumazet @ 2016-04-30 17:12 ` Linus Torvalds 2016-04-30 17:37 ` Eric Dumazet 0 siblings, 1 reply; 29+ messages in thread From: Linus Torvalds @ 2016-04-30 17:12 UTC (permalink / raw) To: Eric Dumazet Cc: Thomas Gleixner, LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton, Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk, Davidlohr Bueso, Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet On Sat, Apr 30, 2016 at 9:45 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > I use hash_32() in net/sched/sch_fq.c, for all packets sent by Google > servers. (Note that I did _not_ use hash_ptr()) > > That's gazillions of packets per second, and the current multiply worked > just fine in term of hash spreading. So hash_32() really is much better than hash_64(). I think we'll tweak it a bit, but largely leave it alone. The 64-bit case needs to be tweaked a _lot_. For the 32-bit case, I like the one that George Spelvin suggested: #define GOLDEN_RATIO_32 0x61c88647 /* phi^2 = 1-phi */ because of his slow multiplier fallback version that we could also use: /* Returns x * GOLDEN_RATIO_32 without a hardware multiplier */ unsigned hash_32(unsigned x) { unsigned y, z; /* Path length */ y = (x << 19) + x; /* 1 shift + 1 add */ z = (x << 9) + y; /* 1 shift + 2 add */ x = (x << 23) + z; /* 1 shift + 3 add */ z = (z << 8) + y; /* 2 shift + 3 add */ x = (x << 6) - x; /* 2 shift + 4 add */ return (z << 3) + x; /* 3 shift + 4 add */ } and I don't think that we really need the several big constants with the fancy "full cascade" function. If you have a test-case for that sch_fq.c case, it might be a good idea to test the above GOLDEN_RATIO_32 value, but quite frankly, I don't see any way it would be materially different from the one we use now. It does avoid that long series of zeroes in the low bits, but that's actually not a huge problem for the 32-bit hash to begin with. It's not nearly as long a series (or in the wrong bit positions) as the 64-bit hash multiplier value had. Also, I suspect that since you hash the kernel "struct sock" pointers, you actually never get the kinds of really bad patterns that Thomas had. But maybe you use hash_32() on a pointer because you noticed that hash_long() or hash_ptr() (which use hash_64 on 64-bit architectures, and would have been more natural) gave worse performance? Maybe you thought that it was the bigger multiply that caused the performance problems? If you did performance work, I suspect it really could have been that hash_64() did a bad job for you. Linus ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-04-30 17:12 ` Linus Torvalds @ 2016-04-30 17:37 ` Eric Dumazet 0 siblings, 0 replies; 29+ messages in thread From: Eric Dumazet @ 2016-04-30 17:37 UTC (permalink / raw) To: Linus Torvalds Cc: Thomas Gleixner, LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton, Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk, Davidlohr Bueso, Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet On Sat, 2016-04-30 at 10:12 -0700, Linus Torvalds wrote: > On Sat, Apr 30, 2016 at 9:45 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > I use hash_32() in net/sched/sch_fq.c, for all packets sent by Google > > servers. (Note that I did _not_ use hash_ptr()) > > > > That's gazillions of packets per second, and the current multiply worked > > just fine in term of hash spreading. > > So hash_32() really is much better than hash_64(). I think we'll tweak > it a bit, but largely leave it alone. > > The 64-bit case needs to be tweaked a _lot_. Agreed ;) > > For the 32-bit case, I like the one that George Spelvin suggested: > > #define GOLDEN_RATIO_32 0x61c88647 /* phi^2 = 1-phi */ > > because of his slow multiplier fallback version that we could also use: > > /* Returns x * GOLDEN_RATIO_32 without a hardware multiplier */ > unsigned hash_32(unsigned x) > { > unsigned y, z; > /* Path length */ > y = (x << 19) + x; /* 1 shift + 1 add */ > z = (x << 9) + y; /* 1 shift + 2 add */ > x = (x << 23) + z; /* 1 shift + 3 add */ > z = (z << 8) + y; /* 2 shift + 3 add */ > x = (x << 6) - x; /* 2 shift + 4 add */ > return (z << 3) + x; /* 3 shift + 4 add */ > } > > and I don't think that we really need the several big constants with > the fancy "full cascade" function. > > If you have a test-case for that sch_fq.c case, it might be a good > idea to test the above GOLDEN_RATIO_32 value, but quite frankly, I > don't see any way it would be materially different from the one we use > now. It does avoid that long series of zeroes in the low bits, but > that's actually not a huge problem for the 32-bit hash to begin with. > It's not nearly as long a series (or in the wrong bit positions) as > the 64-bit hash multiplier value had. > > Also, I suspect that since you hash the kernel "struct sock" pointers, > you actually never get the kinds of really bad patterns that Thomas > had. > > But maybe you use hash_32() on a pointer because you noticed that > hash_long() or hash_ptr() (which use hash_64 on 64-bit architectures, > and would have been more natural) gave worse performance? Not at all. At the time I did sch_fq (for linux-3.12), hash_64() was not using a multiply yet, but this long series of shifts/add/sub I used hash_32() because it was simply faster on my servers. You added this multiply in linux-3.17, and I did not noticed at that time. > > Maybe you thought that it was the bigger multiply that caused the > performance problems? If you did performance work, I suspect it really > could have been that hash_64() did a bad job for you. Really not ;) I could test hash_64(), more entropy can not be bad. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-04-29 2:25 ` Linus Torvalds 2016-04-30 13:02 ` Thomas Gleixner @ 2016-06-12 12:18 ` Sandy Harris 1 sibling, 0 replies; 29+ messages in thread From: Sandy Harris @ 2016-06-12 12:18 UTC (permalink / raw) To: Linus Torvalds Cc: Thomas Gleixner, LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton, Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk, Davidlohr Bueso, Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet On Thu, Apr 28, 2016 at 10:25 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > It's the hashes that _look_ like they might be good hashes, but > there's not a lot of analysis behind it, that I would worry about. The > simple prime modulus _should_ be fine, but at the same time I kind of > suspect we can do better. Especially since it has two multiplications. > > Looking around, there's > > http://burtleburtle.net/bob/hash/integer.html > > and that 32-bit "full avalanche" hash in six shifts looks like it > could be better. You wouldn't want to inline it, but the point of a > full avalanche bit mixing _should_ be that you could avoid the whole > "upper bits" part, and it should work independently of the target set > size. > > So if that hash works better, it would be a pretty good replacement > option for hash_int(). > > There is also > > https://gist.github.com/badboy/6267743 > > that has a 64 bit to 32 bit hash function that might be useful for > "hash_long()". > > Most of the people who worry about hashes tend to have strings to > hash, not just a single word like a pointer, but there's clearly > people around who have tried to search for good hashes that really > spread out the bits. > > Linus Here's another possibility, from my GPL code at: https://github.com/sandy-harris/maxwell Not very efficient -- two each of 32-bit multiply & modulo in most cases -- but provably good mixing. /* Quasi-Hadamard transform My own invention Goal is to mix a 32-bit object so that each output bit depends on every input bit Underlying primitive is IDEA multiplication which mixes a pair of 16-bit objects This is analogous to the pseudo-Hadamard transform (PHT) originally from the SAFER cipher, later in Twofish and others Conceptually, a two-way PHT on a,b is: x = a + b y = a + 2b a = x b = y This is reversible; it loses no information. Each output word depends on both inputs. A PHT can be implemented as a += b b += a which is faster and avoids using intermediate variables QHT is the same thing using IDEA multiplication instead of addition, calculating a*b and a*b^2 instead of a+b and a+2b IDEA multiplication operates on 16-bit chunks and makes every output bit depend on all input bits. Therefore QHT is close to an ideal mixer for 32-bit words. */ u32 qht(u32 x) { u32 a, b ; a = x >> 16 ; // high 16 bits b = x & 0xffff ; // low 16 a = idea(a,b) ; // a *= b b = idea(a,b) ; // b *= a return( (a<<16) | b) ; } /* IDEA multiplication borrowed from the IDEA cipher */ #define MAX (1<<16) #define MOD (MAX+1) u32 idea(u32 a, u32 b) { u32 x ; // make sure they are in range a %= MOD ; b %= MOD ; // special cases if( (a == 0) && (b == 0)) return(1) ; else if( a == 0) return(MOD - b) ; else if( b == 0) return(MOD - a) ; // non-special else { x = (a*b) % MOD ; if(x == MAX) return(0) ; else return(x) ; } } ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-04-28 18:32 ` Linus Torvalds 2016-04-28 23:26 ` Thomas Gleixner @ 2016-04-29 21:10 ` Linus Torvalds 2016-04-29 23:51 ` Linus Torvalds 2016-04-30 15:22 ` Thomas Gleixner 1 sibling, 2 replies; 29+ messages in thread From: Linus Torvalds @ 2016-04-29 21:10 UTC (permalink / raw) To: Thomas Gleixner Cc: LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton, Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk, Davidlohr Bueso, Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet On Thu, Apr 28, 2016 at 11:32 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > hash_long/ptr really shouldn't care about the low bits being zero at > all, because it should mix in all the bits (using a prime multiplier > and taking the high bits). Looking at this assertion, it doesn't actually pan out very well at all. The 32-bit hash looks to be ok. Certainly not perfect, but not horrible. The 64-bit hash seems to be quite horribly bad with lots of values. I wrote a test-harness to check it out (some simple values just spread out at a fixed stride), and the end results are *so* bad that I'm kind of worried that I screwed up the test harness. But it gives quite reasonable values for hash_32() and for the plain modulo case. Now, the way my test-harness works (literally testing a lot of equal-stride cases), the "modulo prime number" approach will automatically look perfect. So my test-harness is pretty unfair in that respect. But the hash_32() function looks good when hashing into 16 bits, for example. In that case it does a very good job of spreading things out. When hashing into 17 bits, hash_32 still looks good, except it does very badly for strides of 32. It starts doing worse for bigger hash buckets and bigger strides. But out hash_64() seems to do very badly on pretty much *any* pattern. To the point where I started to doubt my test-program. But it really looks like that multiplication constant is almost pessimally chosen. For example, that _long_ range of bits set ("7fffffffc" in the middle) is effectively just one bit set with a subtraction. And it's *right* in that bit area that is supposed to shuffle bits 14-40 to the high bits (which is what we actually *use*. So it effectively shuffles none of those bits around at all, and if you have a stride of 4096, your'e pretty much done for. The 32-bit value is better in this regard, largely thanks to having that low bit set. Thanks to that, the information in bits around 12-18 will stay in bits 12-18, and because we then only have 32 bits, if the hash table is large enough, they will still be part of the bits when we take the high bits. For the 64-bit case, bits 12-18 will never even be relevant. So I think that what happens here is that hash_32() is actually somewhat reasonable. But hash_64() sucks donkey balls when there's a lot of information in around bits 10-20 (which is exactly where a lot of pointer bits have the *most* information thanks to alignment issues. Picking a new value almost at random (I say "almost", because I just started with that 32-bit multiplicand value that mostly works and shifted it up by 32 bits and then randomly added a few more bits to avoid long ranges of ones and zeroes), I picked #define GOLDEN_RATIO_PRIME_64 0x9e3700310c100d01UL and it is *much* better in my test harness. Of course, things like that depend on what patterns you test, But I did have a "range of strides and hash sizes" I tried. So just for fun: try changing GOLDEN_RATIO_PRIME_64 to that value, and see if the absolutely _horrid_ page-aligned case goes away for you? It really looks like those multiplication numbers were very very badly picked. Still, that number doesn't do very well if the hash is small (say, 8 bits). But for slightly larger hash tables it seems to be doing much better. Linus ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-04-29 21:10 ` Linus Torvalds @ 2016-04-29 23:51 ` Linus Torvalds 2016-04-30 1:34 ` Rik van Riel 2016-05-02 9:39 ` Torvald Riegel 2016-04-30 15:22 ` Thomas Gleixner 1 sibling, 2 replies; 29+ messages in thread From: Linus Torvalds @ 2016-04-29 23:51 UTC (permalink / raw) To: Thomas Gleixner, Rik van Riel Cc: LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton, Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk, Davidlohr Bueso, Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet On Fri, Apr 29, 2016 at 2:10 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Apr 28, 2016 at 11:32 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> hash_long/ptr really shouldn't care about the low bits being zero at >> all, because it should mix in all the bits (using a prime multiplier >> and taking the high bits). > > Looking at this assertion, it doesn't actually pan out very well at all. > > The 32-bit hash looks to be ok. Certainly not perfect, but not horrible. > > The 64-bit hash seems to be quite horribly bad with lots of values. Ok, I have tried to research this a bit more. There's a lot of confusion here that causes the fact that hash_64() sucks donkey balls. The basic method for the hashing method by multiplication is fairly sane. It's well-documented, and attributed to Knuth as the comment above it says. However, there's two problems in there that degrade the hash, and particularly so for the 64-bit case. The first is the confusion about multiplying with a prime number.. That actually makes no sense at all, and is in fact entirely wrong. There's no reason to try to pick a prime number for the multiplication, and I'm not seeing Knuth having ever suggested that. Knuth's suggestion is to do the multiplication with a floating point value A somewhere in between 0 and 1, and multiplying the integer with it, and then just taking the fractional part and multiply it up by 'm' and do the floor of that to get a number in the range 0..m At no point are primes involved. And our multiplication really does approximate that - except it's being done in fixed-point arithmetic (so the thing you multiply with is basically n./2**64, and the overflow is what gets rid of the fractional part - then getting the "high bits" of the result is really just multiplying by a power of two and taking the floor of the result). So the first mistake is thinking that the thing you should multiply with should be prime. The primality matters for when you use a division to get a modulus, which is presumably where the confusion came from. Now, what value 'A' you use doesn't seem to really matter much. Knuth suggested the fractional part of the golden ratio, but I suspect that is purely because it's an irrational number that is not near to 0 or 1. You could use anything, although since "random bit pattern" is part of the issue, irrational numbers are a good starting point. I suspect that with our patterns, there's actually seldom a good reason to do lots of high-order bits, so you might as well pick something closer to 0, but whatever - it's only going to matter for the overflow part that gets thrown away anyway. The second mistake - and the one that actually causes the real problem - is to believe that the "bit sparseness" is a good thing. It's not. It's _very_ much not. If you don't mix the bits well in the multiplication, you get exactly the problem we hit: certain bit patternsjust will not mix up into the higher order bits. So if you look at what the actual golden ratio representation *should* have bee: #define GOLDEN_RATIO_32 0x9e3779b9 #define GOLDEN_RATIO_64 0x9e3779b97f4a7c16 and then compare it to the values we actually _use_ (bit-sparse primes closeish to those values): #define GOLDEN_RATIO_PRIME_32 0x9e370001UL #define GOLDEN_RATIO_PRIME_64 0x9e37fffffffc0001UL you start to see the problem. The right set of values have roughly 50% of the bits set in a random pattern. The wrong set of values do not. But as far as I an tell, you might as well use the fractional part of 'e' or 'pi' or just make random shit up that has a reasonable bit distribution. So we could use the fractional part of the golden ratio (phi): 0x9e3779b9 0x9e3779b97f4a7c16 or pi: 0x243f6a88 0x243f6a8885a308d3 or e: 0xb7e15162 0xb7e151628aed2a6b or whatever. The constants don't have to be prime. They don't even have to be odd, because the low bits will always be masked off anyway. The whole "hash one integer value down to X bits" is very different in this respect to things like string hashes, where you really do tend to want primes (because you keep all bits). But none of those are sparse. I think *some* amount of sparseness might be ok if it allows people with bad CPU's to do it using shift-and-adds, it just must not be as sparse as the current number, the 64-bit one on particular. There's presumably a few optimal values from a "spread bits out evenly" standpoint, and they won't have anything to do with random irrational constants, and will have everything to do with having nice bitpatterns. I'm adding Rik to the cc, because the original broken constants came from him long long ago (they go back to 2002, originally only used for the waitqueue hashing. Maybe he had some other input that caused him to believe that the primeness actually mattered. Linus ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-04-29 23:51 ` Linus Torvalds @ 2016-04-30 1:34 ` Rik van Riel 2016-05-02 9:39 ` Torvald Riegel 1 sibling, 0 replies; 29+ messages in thread From: Rik van Riel @ 2016-04-30 1:34 UTC (permalink / raw) To: Linus Torvalds, Thomas Gleixner Cc: LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton, Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk, Davidlohr Bueso, Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet [-- Attachment #1: Type: text/plain, Size: 1130 bytes --] On Fri, 2016-04-29 at 16:51 -0700, Linus Torvalds wrote: > There's presumably a few optimal values from a "spread bits out > evenly" standpoint, and they won't have anything to do with random > irrational constants, and will have everything to do with having nice > bitpatterns. > > I'm adding Rik to the cc, because the original broken constants came > from him long long ago (they go back to 2002, originally only used > for > the waitqueue hashing. Maybe he had some other input that caused him > to believe that the primeness actually mattered. I do not remember where I got that hashing algorithm and magic constant from 14 years ago, but the changelog suggests I got it from Chuck Lever's paper. Chuck Lever's paper does mention that primeness "adds certain desirable qualities", and I may have read too much into that. I really do not remember the "bit sparse" thing at all, and have no idea where that came from. Googling old email threads about the code mostly makes me wonder "hey, where did that person go?" I am all for magic numbers that work better. -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-04-29 23:51 ` Linus Torvalds 2016-04-30 1:34 ` Rik van Riel @ 2016-05-02 9:39 ` Torvald Riegel 1 sibling, 0 replies; 29+ messages in thread From: Torvald Riegel @ 2016-05-02 9:39 UTC (permalink / raw) To: Linus Torvalds Cc: Thomas Gleixner, Rik van Riel, LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton, Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk, Davidlohr Bueso, Chris Mason, Carlos O'Donell, Eric Dumazet On Fri, 2016-04-29 at 16:51 -0700, Linus Torvalds wrote: > On Fri, Apr 29, 2016 at 2:10 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > On Thu, Apr 28, 2016 at 11:32 AM, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > >> > >> hash_long/ptr really shouldn't care about the low bits being zero at > >> all, because it should mix in all the bits (using a prime multiplier > >> and taking the high bits). > > > > Looking at this assertion, it doesn't actually pan out very well at all. > > > > The 32-bit hash looks to be ok. Certainly not perfect, but not horrible. > > > > The 64-bit hash seems to be quite horribly bad with lots of values. > > Ok, I have tried to research this a bit more. There's a lot of > confusion here that causes the fact that hash_64() sucks donkey balls. > > The basic method for the hashing method by multiplication is fairly > sane. It's well-documented, and attributed to Knuth as the comment > above it says. Section 2.2 of this article might also be of interest: M. Dietzfelbinger, T. Hagerup, J. Katajainen, and M. Penttonen. A Re- liable Randomized Algorithm for the Closest-Pair Problem. Journal of Algorithms, 25(1):19 – 51, 1997. I don't know much about hash functions, but my understanding of this is that one can do good hashing of words by multiplying with a random number and using the most-significant bits of the lower-significant word of the result. Different random numbers may work better than others for the input data (and some might be really awful), but the paper seems to claim that one *can* find a good random number for all input data. In practice, this might mean that re-hashing with a different random number after seeing too many conflicts in a hash table should eventually lead to a good hashing (ie, because the *class* of such hash functions is universal). ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism 2016-04-29 21:10 ` Linus Torvalds 2016-04-29 23:51 ` Linus Torvalds @ 2016-04-30 15:22 ` Thomas Gleixner 1 sibling, 0 replies; 29+ messages in thread From: Thomas Gleixner @ 2016-04-30 15:22 UTC (permalink / raw) To: Linus Torvalds Cc: LKML, Peter Zijlstra, Ingo Molnar, Andrew Morton, Sebastian Andrzej Siewior, Darren Hart, Michael Kerrisk, Davidlohr Bueso, Chris Mason, Carlos O'Donell, Torvald Riegel, Eric Dumazet On Fri, 29 Apr 2016, Linus Torvalds wrote: > Picking a new value almost at random (I say "almost", because I just > started with that 32-bit multiplicand value that mostly works and > shifted it up by 32 bits and then randomly added a few more bits to > avoid long ranges of ones and zeroes), I picked > > #define GOLDEN_RATIO_PRIME_64 0x9e3700310c100d01UL > > and it is *much* better in my test harness. > > Of course, things like that depend on what patterns you test, But I > did have a "range of strides and hash sizes" I tried. So just for fun: > try changing GOLDEN_RATIO_PRIME_64 to that value, and see if the > absolutely _horrid_ page-aligned case goes away for you? It solves that horrid case: https://tglx.de/~tglx/f-ops-h64-t.png It's faster than the shifts based version but the degradation with hyperthreading is slightly worse. Here for comparison the 64bit -> 32 shift version https://tglx.de/~tglx/f-ops-wang32-t.png FYI, that works way better than the existing shift machinery in hash_64 and the modulo prime one: https://tglx.de/~tglx/f-ops-mod-t.png > It really looks like those multiplication numbers were very very badly picked. Indeed. > Still, that number doesn't do very well if the hash is small (say, 8 > bits). I'm still waiting for the other test to complete. Will send numbers later today. Thanks, tglx ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2016-06-12 12:18 UTC | newest] Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-04-29 2:57 [patch 2/7] lib/hashmod: Add modulo based hash mechanism George Spelvin 2016-04-29 3:16 ` Linus Torvalds 2016-04-29 4:12 ` George Spelvin 2016-04-29 23:31 ` George Spelvin 2016-04-30 0:05 ` Linus Torvalds 2016-04-30 0:32 ` George Spelvin 2016-04-30 1:12 ` Linus Torvalds 2016-04-30 3:04 ` George Spelvin [not found] <CA+55aFxBWfAHQNAdBbdVr+z8ror4GVteyce3D3=vwDWxhu5KqQ@mail.gmail.com> 2016-04-30 20:52 ` George Spelvin 2016-05-01 8:35 ` Thomas Gleixner 2016-05-01 9:43 ` George Spelvin 2016-05-01 16:51 ` Linus Torvalds 2016-05-14 3:54 ` George Spelvin 2016-05-14 18:35 ` Linus Torvalds 2016-05-02 7:11 ` Thomas Gleixner -- strict thread matches above, loose matches on Subject: below -- 2016-04-28 16:42 [patch 0/7] futex: Add support for process private hashing Thomas Gleixner 2016-04-28 16:42 ` [patch 2/7] lib/hashmod: Add modulo based hash mechanism Thomas Gleixner 2016-04-28 18:32 ` Linus Torvalds 2016-04-28 23:26 ` Thomas Gleixner 2016-04-29 2:25 ` Linus Torvalds 2016-04-30 13:02 ` Thomas Gleixner 2016-04-30 16:45 ` Eric Dumazet 2016-04-30 17:12 ` Linus Torvalds 2016-04-30 17:37 ` Eric Dumazet 2016-06-12 12:18 ` Sandy Harris 2016-04-29 21:10 ` Linus Torvalds 2016-04-29 23:51 ` Linus Torvalds 2016-04-30 1:34 ` Rik van Riel 2016-05-02 9:39 ` Torvald Riegel 2016-04-30 15:22 ` Thomas Gleixner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.