Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism

* Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
       [not found] <CA+55aFxBWfAHQNAdBbdVr+z8ror4GVteyce3D3=vwDWxhu5KqQ@mail.gmail.com>
@ 2016-04-30 20:52 ` George Spelvin
  2016-05-01  8:35   ` Thomas Gleixner
  0 siblings, 1 reply; 43+ messages in thread
From: George Spelvin @ 2016-04-30 20:52 UTC (permalink / raw)
  To: tglx; +Cc: eric.dumazet, linux, linux-kernel, riel, torvalds

Thomas Gleixner wrote:
> I'll send a patch to replace hash_64 and hash_32.

Before you do that, could we look for a way to tweak the constants
in the existing hash?

It seems the basic "take the high bits of x * K" algorithm is actually
a decent hash function if K is chosen properly, and has a significant
speed advantage on machines with half-decent multipliers.

I'm researching how to do the multiply with fewer shifts and adds on
machines that need it.  (Or we could use a totally different function
in that case.)

You say that
> hash64 is slightly faster as the modulo prime as it does not have the
> multiplication.

Um... are you sure you benchmarked that right?  The hash_64 code you
used (Thomas Wang's 64->32-bit hash) has a critical path consisting of 6
shifts and 7 adds.  I can't believe that's faster than a single multiply.

For 1,000,000 iterations on an Ivy Bridge, the multiply is 4x
faster (5x if out of line) for me!

The constants I recommend are
#define GOLDEN_RATIO_64 0x61C8864680B583EBull
#define GOLDEN_RATIO_32 0x61C88647

rdtsc times for 1,000,000 iterations of each of the two.
(The sum of all hashes is printed to prevent dead code elimination.)

  hash_64  (sum)      * PHI   (sum)      hash_64  (sum)      * PHI   (sum)
  17552154 a52752df   3431821 2ce5398c   17485381 a52752df   3375535 2ce5398c
  17522273 a52752df   3487206 2ce5398c   17551217 a52752df   3374221 2ce5398c
  17546242 a52752df   3377355 2ce5398c   17494306 a52752df   3374202 2ce5398c
  17495702 a52752df   3409768 2ce5398c   17505839 a52752df   3398205 2ce5398c
  17501114 a52752df   3375435 2ce5398c   17539388 a52752df   3374202 2ce5398c
And with hash_64 forced inline:
  13596945 a52752df   3374931 2ce5398c   13585916 a52752df   3411107 2ce5398c
  13564616 a52752df   3374928 2ce5398c   13573465 a52752df   3425160 2ce5398c
  13569712 a52752df   3374915 2ce5398c   13580461 a52752df   3397773 2ce5398c
  13577481 a52752df   3374912 2ce5398c   13558708 a52752df   3417456 2ce5398c
  13569044 a52752df   3374909 2ce5398c   13557193 a52752df   3407912 2ce5398c

That's 3.5 cycles vs. 13.5.

(I actually have two copies of the inlined code, to show code alignment
issues.)

On a Phenom, it's worse, 4 cycles vs. 35.
  35083119 a52752df   4020754 2ce5398c   35068116 a52752df   4015659 2ce5398c
  35074377 a52752df   4000819 2ce5398c   35068735 a52752df   4016943 2ce5398c
  35067596 a52752df   4025397 2ce5398c   35074365 a52752df   4000108 2ce5398c
  35071050 a52752df   4016190 2ce5398c   35058775 a52752df   4017988 2ce5398c
  35055091 a52752df   4000066 2ce5398c   35201158 a52752df   4000094 2ce5398c

My simple test code appended for anyone who cares...

#include <stdint.h>
#include <stdio.h>

/*  Phi = 0x0.9E3779B97F4A7C15F... */
/* -Phi = 0x0.61C8864680B583EA1... */
#define K 0x61C8864680B583EBull

static inline uint32_t hash1(uint64_t key)
{
       key  = ~key + (key << 18);
       key ^= key >> 31;
       key += (key << 2) + (key << 4);
       key ^= key >> 11;
       key += key << 6;
       key ^= key >> 22;
       return (uint32_t)key;
}

static inline uint32_t hash2(uint64_t key)
{
	return (uint32_t)(key * K >> 32);
}

static inline uint64_t rdtsc(void)
{
	uint32_t lo, hi;
	asm volatile("rdtsc" : "=a" (lo), "=d" (hi));
	return (uint64_t)hi << 32 | lo;
}

int
main(void)
{
	int i, j;
	uint32_t sum, sums[20];
	uint64_t start, times[20];

	for (i = 0; i < 20; i += 4) {
		sum = 0;
		start = rdtsc();
		for (j = 0; j < 1000000; j++)
			sum += hash1(j+0xdeadbeef);
		times[i] = rdtsc() - start;
		sums[i] = sum;

		sum = 0;
		start = rdtsc();
		for (j = 0; j < 1000000; j++)
			sum += hash2(j+0xdeadbeef);
		times[i+1] = rdtsc() - start;
		sums[i+1] = sum;

		sum = 0;
		start = rdtsc();
		for (j = 0; j < 1000000; j++)
			sum += hash1(j+0xdeadbeef);
		times[i+2] = rdtsc() - start;
		sums[i+2] = sum;

		sum = 0;
		start = rdtsc();
		for (j = 0; j < 1000000; j++)
			sum += hash2(j+0xdeadbeef);
		times[i+3] = rdtsc() - start;
		sums[i+3] = sum;
	}
	for (i = 0; i < 20; i++)
		printf("  %llu %08x%c",
			times[i], sums[i], (~i & 3) ? ' ' : '\n');
	return 0;
}

^ permalink raw reply	[flat|nested] 43+ messages in thread