All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Cuckoo hash lookup enhancements
@ 2016-08-26 21:34 Pablo de Lara
  2016-08-26 21:34 ` [PATCH 1/3] hash: reorganize bucket structure Pablo de Lara
                   ` (3 more replies)
  0 siblings, 4 replies; 37+ messages in thread
From: Pablo de Lara @ 2016-08-26 21:34 UTC (permalink / raw)
  To: dev; +Cc: bruce.richardson, Pablo de Lara

This patchset improves lookup performance on the current hash library
by changing the existing lookup bulk pipeline, with an improved pipeline,
based on a loop-and-jump model, instead of the current 4-stage 2-entry pipeline.
Also, x86 vectorized intrinsics are used to improve performance when comparing signatures.

First patch modifies the order of the bucket structure.
Currently, the buckets store all the signatures together (current and alternative).
In order to be able to perform a vectorized signature comparison,
all current signatures have to be together, so the order of the bucket has been changed,
having separated all the current signatures from the alternative signatures.

Second patch introduces x86 vectorized intrinsics.
When performing a lookup bulk operation, all current signatures in a bucket
are compared against the signature of the key being looked up.
Now that they all are together, a vectorized comparison can be performed,
which takes less instructions to be carried out.
In case of having a machine with AVX2, number of entries per bucket are
increased from 4 to 8, as AVX2 allows comparing two 256-bit values, with 8x32-bit integers,
which are the 8 signatures on the bucket.

Third (and last) patch modifies the current pipeline of the lookup bulk function.
The new pipeline is based on a loop-and-jump model. The two key improvements are:

- Better prefetching: in this case, first 4 keys to be looked up are prefetched,
  and after that, the rest of the keys are prefetched at the time the calculation
  of the signatures are being performed. This gives more time for the CPU to
  prefetch the data requesting before actually need it, which result in less
  cache misses and therefore, higher throughput.

- Lower performance penalty when using fallback: the lookup bulk algorithm
  assumes that most times there will not be a collision in a bucket, but it might
  happen that two or more signatures are equal, which means that more than one
  key comparison might be necessary. In that case, only the key of the first hit is prefetched,
  like in the current implementation. The difference now is that if this comparison
  results in a miss, the information of the other keys to be compared has been stored,
  unlike the current implementation, which needs to perform an entire simple lookup again.

This patchset depends on the following patchset:
"Hash library fixes" (http://dpdk.org/ml/archives/dev/2016-August/045780.html)

Byron Marohn (3):
  hash: reorganize bucket structure
  hash: add vectorized comparison
  hash: modify lookup bulk pipeline

 lib/librte_hash/rte_cuckoo_hash.c     | 427 +++++++++++++---------------------
 lib/librte_hash/rte_cuckoo_hash.h     |  24 +-
 lib/librte_hash/rte_cuckoo_hash_x86.h |  20 +-
 3 files changed, 179 insertions(+), 292 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/3] hash: reorganize bucket structure
  2016-08-26 21:34 [PATCH 0/3] Cuckoo hash lookup enhancements Pablo de Lara
@ 2016-08-26 21:34 ` Pablo de Lara
  2016-08-26 21:34 ` [PATCH 2/3] hash: add vectorized comparison Pablo de Lara
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 37+ messages in thread
From: Pablo de Lara @ 2016-08-26 21:34 UTC (permalink / raw)
  To: dev; +Cc: bruce.richardson, Byron Marohn, Saikrishna Edupuganti

From: Byron Marohn <byron.marohn@intel.com>

Move current signatures of all entries together in the bucket
and same with all alternative signatures, instead of having
current and alternative signatures together per entry in the bucket.
This will be benefitial in the next commits, where a vectorized
comparison will be performed, achieving better performance.

Signed-off-by: Byron Marohn <byron.marohn@intel.com>
Signed-off-by: Saikrishna Edupuganti <saikrishna.edupuganti@intel.com>
---
 lib/librte_hash/rte_cuckoo_hash.c     | 43 ++++++++++++++++++-----------------
 lib/librte_hash/rte_cuckoo_hash.h     | 17 ++++----------
 lib/librte_hash/rte_cuckoo_hash_x86.h | 20 ++++++++--------
 3 files changed, 37 insertions(+), 43 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c
index dd0290f..9d507b6 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -420,7 +420,7 @@ make_space_bucket(const struct rte_hash *h, struct rte_hash_bucket *bkt)
 	 */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
 		/* Search for space in alternative locations */
-		next_bucket_idx = bkt->signatures[i].alt & h->bucket_bitmask;
+		next_bucket_idx = bkt->sig_alt[i] & h->bucket_bitmask;
 		next_bkt[i] = &h->buckets[next_bucket_idx];
 		for (j = 0; j < RTE_HASH_BUCKET_ENTRIES; j++) {
 			if (next_bkt[i]->key_idx[j] == EMPTY_SLOT)
@@ -433,8 +433,8 @@ make_space_bucket(const struct rte_hash *h, struct rte_hash_bucket *bkt)
 
 	/* Alternative location has spare room (end of recursive function) */
 	if (i != RTE_HASH_BUCKET_ENTRIES) {
-		next_bkt[i]->signatures[j].alt = bkt->signatures[i].current;
-		next_bkt[i]->signatures[j].current = bkt->signatures[i].alt;
+		next_bkt[i]->sig_alt[j] = bkt->sig_current[i];
+		next_bkt[i]->sig_current[j] = bkt->sig_alt[i];
 		next_bkt[i]->key_idx[j] = bkt->key_idx[i];
 		return i;
 	}
@@ -460,8 +460,8 @@ make_space_bucket(const struct rte_hash *h, struct rte_hash_bucket *bkt)
 	 */
 	bkt->flag[i] = 0;
 	if (ret >= 0) {
-		next_bkt[i]->signatures[ret].alt = bkt->signatures[i].current;
-		next_bkt[i]->signatures[ret].current = bkt->signatures[i].alt;
+		next_bkt[i]->sig_alt[ret] = bkt->sig_current[i];
+		next_bkt[i]->sig_current[ret] = bkt->sig_alt[i];
 		next_bkt[i]->key_idx[ret] = bkt->key_idx[i];
 		return i;
 	} else
@@ -543,8 +543,8 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is already inserted in primary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (prim_bkt->signatures[i].current == sig &&
-				prim_bkt->signatures[i].alt == alt_hash) {
+		if (prim_bkt->sig_current[i] == sig &&
+				prim_bkt->sig_alt[i] == alt_hash) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					prim_bkt->key_idx[i] * h->key_entry_size);
 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {
@@ -563,8 +563,8 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is already inserted in secondary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (sec_bkt->signatures[i].alt == sig &&
-				sec_bkt->signatures[i].current == alt_hash) {
+		if (sec_bkt->sig_alt[i] == sig &&
+				sec_bkt->sig_current[i] == alt_hash) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					sec_bkt->key_idx[i] * h->key_entry_size);
 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {
@@ -610,8 +610,8 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 		for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
 			/* Check if slot is available */
 			if (likely(prim_bkt->key_idx[i] == EMPTY_SLOT)) {
-				prim_bkt->signatures[i].current = sig;
-				prim_bkt->signatures[i].alt = alt_hash;
+				prim_bkt->sig_current[i] = sig;
+				prim_bkt->sig_alt[i] = alt_hash;
 				prim_bkt->key_idx[i] = new_idx;
 				break;
 			}
@@ -631,8 +631,8 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 		 */
 		ret = make_space_bucket(h, prim_bkt);
 		if (ret >= 0) {
-			prim_bkt->signatures[ret].current = sig;
-			prim_bkt->signatures[ret].alt = alt_hash;
+			prim_bkt->sig_current[ret] = sig;
+			prim_bkt->sig_alt[ret] = alt_hash;
 			prim_bkt->key_idx[ret] = new_idx;
 			if (h->add_key == ADD_KEY_MULTIWRITER)
 				rte_spinlock_unlock(h->multiwriter_lock);
@@ -706,7 +706,7 @@ __rte_hash_lookup_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is in primary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (bkt->signatures[i].current == sig &&
+		if (bkt->sig_current[i] == sig &&
 				bkt->key_idx[i] != EMPTY_SLOT) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					bkt->key_idx[i] * h->key_entry_size);
@@ -729,8 +729,8 @@ __rte_hash_lookup_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is in secondary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (bkt->signatures[i].current == alt_hash &&
-				bkt->signatures[i].alt == sig) {
+		if (bkt->sig_current[i] == alt_hash &&
+				bkt->sig_alt[i] == sig) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					bkt->key_idx[i] * h->key_entry_size);
 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {
@@ -784,7 +784,8 @@ remove_entry(const struct rte_hash *h, struct rte_hash_bucket *bkt, unsigned i)
 	unsigned lcore_id, n_slots;
 	struct lcore_cache *cached_free_slots;
 
-	bkt->signatures[i].sig = NULL_SIGNATURE;
+	bkt->sig_current[i] = NULL_SIGNATURE;
+	bkt->sig_alt[i] = NULL_SIGNATURE;
 	if (h->hw_trans_mem_support) {
 		lcore_id = rte_lcore_id();
 		cached_free_slots = &h->local_free_slots[lcore_id];
@@ -822,7 +823,7 @@ __rte_hash_del_key_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is in primary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (bkt->signatures[i].current == sig &&
+		if (bkt->sig_current[i] == sig &&
 				bkt->key_idx[i] != EMPTY_SLOT) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					bkt->key_idx[i] * h->key_entry_size);
@@ -847,7 +848,7 @@ __rte_hash_del_key_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is in secondary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (bkt->signatures[i].current == alt_hash &&
+		if (bkt->sig_current[i] == alt_hash &&
 				bkt->key_idx[i] != EMPTY_SLOT) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					bkt->key_idx[i] * h->key_entry_size);
@@ -956,8 +957,8 @@ lookup_stage2(unsigned idx, hash_sig_t prim_hash, hash_sig_t sec_hash,
 	prim_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
 	sec_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		prim_hash_matches |= ((prim_hash == prim_bkt->signatures[i].current) << i);
-		sec_hash_matches |= ((sec_hash == sec_bkt->signatures[i].current) << i);
+		prim_hash_matches |= ((prim_hash == prim_bkt->sig_current[i]) << i);
+		sec_hash_matches |= ((sec_hash == sec_bkt->sig_current[i]) << i);
 	}
 
 	key_idx = prim_bkt->key_idx[__builtin_ctzl(prim_hash_matches)];
diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index e290dab..fe0654f 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -151,17 +151,6 @@ struct lcore_cache {
 	void *objs[LCORE_CACHE_SIZE]; /**< Cache objects */
 } __rte_cache_aligned;
 
-/* Structure storing both primary and secondary hashes */
-struct rte_hash_signatures {
-	union {
-		struct {
-			hash_sig_t current;
-			hash_sig_t alt;
-		};
-		uint64_t sig;
-	};
-};
-
 /* Structure that stores key-value pair */
 struct rte_hash_key {
 	union {
@@ -174,10 +163,14 @@ struct rte_hash_key {
 
 /** Bucket structure */
 struct rte_hash_bucket {
-	struct rte_hash_signatures signatures[RTE_HASH_BUCKET_ENTRIES];
+	hash_sig_t sig_current[RTE_HASH_BUCKET_ENTRIES];
+
 	/* Includes dummy key index that always contains index 0 */
 	uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES + 1];
+
 	uint8_t flag[RTE_HASH_BUCKET_ENTRIES];
+
+	hash_sig_t sig_alt[RTE_HASH_BUCKET_ENTRIES];
 } __rte_cache_aligned;
 
 /** A hash table structure. */
diff --git a/lib/librte_hash/rte_cuckoo_hash_x86.h b/lib/librte_hash/rte_cuckoo_hash_x86.h
index e16d69c..494c160 100644
--- a/lib/librte_hash/rte_cuckoo_hash_x86.h
+++ b/lib/librte_hash/rte_cuckoo_hash_x86.h
@@ -54,8 +54,8 @@ rte_hash_cuckoo_insert_mw_tm(struct rte_hash_bucket *prim_bkt,
 			for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
 				/* Check if slot is available */
 				if (likely(prim_bkt->key_idx == EMPTY_SLOT)) {
-					prim_bkt->signatures[i].current = sig;
-					prim_bkt->signatures[i].alt = alt_hash;
+					prim_bkt->sig_current[i] = sig;
+					prim_bkt->sig_alt[i] = alt_hash;
 					prim_bkt->key_idx[i] = new_idx;
 					break;
 				}
@@ -101,7 +101,7 @@ rte_hash_cuckoo_move_insert_mw_tm(const struct rte_hash *h,
 				prev_slot = curr_node->prev_slot;
 
 				prev_alt_bkt_idx
-					= prev_bkt->signatures[prev_slot].alt
+					= prev_bkt->sig_alt[prev_slot]
 					    & h->bucket_bitmask;
 
 				if (unlikely(&h->buckets[prev_alt_bkt_idx]
@@ -113,10 +113,10 @@ rte_hash_cuckoo_move_insert_mw_tm(const struct rte_hash *h,
 				 * Cuckoo insert to move elements back to its
 				 * primary bucket if available
 				 */
-				curr_bkt->signatures[curr_slot].alt =
-				    prev_bkt->signatures[prev_slot].current;
-				curr_bkt->signatures[curr_slot].current =
-				    prev_bkt->signatures[prev_slot].alt;
+				curr_bkt->sig_alt[curr_slot] =
+				    prev_bkt->sig_current[prev_slot];
+				curr_bkt->sig_current[curr_slot] =
+				    prev_bkt->sig_alt[prev_slot];
 				curr_bkt->key_idx[curr_slot]
 				    = prev_bkt->key_idx[prev_slot];
 
@@ -125,8 +125,8 @@ rte_hash_cuckoo_move_insert_mw_tm(const struct rte_hash *h,
 				curr_bkt = curr_node->bkt;
 			}
 
-			curr_bkt->signatures[curr_slot].current = sig;
-			curr_bkt->signatures[curr_slot].alt = alt_hash;
+			curr_bkt->sig_current[curr_slot] = sig;
+			curr_bkt->sig_alt[curr_slot] = alt_hash;
 			curr_bkt->key_idx[curr_slot] = new_idx;
 
 			rte_xend();
@@ -178,7 +178,7 @@ rte_hash_cuckoo_make_space_mw_tm(const struct rte_hash *h,
 			}
 
 			/* Enqueue new node and keep prev node info */
-			alt_bkt = &(h->buckets[curr_bkt->signatures[i].alt
+			alt_bkt = &(h->buckets[curr_bkt->sig_alt[i]
 						    & h->bucket_bitmask]);
 			head->bkt = alt_bkt;
 			head->prev = tail;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/3] hash: add vectorized comparison
  2016-08-26 21:34 [PATCH 0/3] Cuckoo hash lookup enhancements Pablo de Lara
  2016-08-26 21:34 ` [PATCH 1/3] hash: reorganize bucket structure Pablo de Lara
@ 2016-08-26 21:34 ` Pablo de Lara
  2016-08-27  8:57   ` Thomas Monjalon
  2016-08-26 21:34 ` [PATCH 3/3] hash: modify lookup bulk pipeline Pablo de Lara
  2016-09-02 22:56 ` [PATCH v2 0/4] Cuckoo hash lookup enhancements Pablo de Lara
  3 siblings, 1 reply; 37+ messages in thread
From: Pablo de Lara @ 2016-08-26 21:34 UTC (permalink / raw)
  To: dev; +Cc: bruce.richardson, Byron Marohn, Saikrishna Edupuganti, Pablo de Lara

From: Byron Marohn <byron.marohn@intel.com>

In lookup bulk function, the signatures of all entries
are compared against the signature of the key that is being looked up.
Now that all the signatures are together, they can be compared
with vector instructions (SSE, AVX2), achieving higher lookup performance.

Also, entries per bucket are increased to 8 when using processors
with AVX2, as 256 bits can be compared at once, which is the size of
8x32-bit signatures.

Signed-off-by: Byron Marohn <byron.marohn@intel.com>
Signed-off-by: Saikrishna Edupuganti <saikrishna.edupuganti@intel.com>
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
---
 lib/librte_hash/rte_cuckoo_hash.c | 41 ++++++++++++++++++++++++++++++++++-----
 lib/librte_hash/rte_cuckoo_hash.h |  4 ++++
 2 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c
index 9d507b6..98713d3 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -939,6 +939,38 @@ lookup_stage1(unsigned idx, hash_sig_t *prim_hash, hash_sig_t *sec_hash,
 	rte_prefetch0(*secondary_bkt);
 }
 
+static inline void
+compare_signatures(unsigned *prim_hash_matches, unsigned *sec_hash_matches,
+				const struct rte_hash_bucket *prim_bkt,
+				const struct rte_hash_bucket *sec_bkt,
+				hash_sig_t prim_hash, hash_sig_t sec_hash)
+{
+/* 8 entries per bucket */
+#if defined(__AVX2__)
+	*prim_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+			_mm256_load_si256((__m256i const *)prim_bkt->sig_current),
+			_mm256_set1_epi32(prim_hash)));
+	*sec_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+			_mm256_load_si256((__m256i const *)sec_bkt->sig_current),
+			_mm256_set1_epi32(sec_hash)));
+/* 4 entries per bucket */
+#elif defined(__SSE2__)
+	*prim_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+			_mm_load_si128((__m128i const *)prim_bkt->sig_current),
+			_mm_set1_epi32(prim_hash)));
+	*sec_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+			_mm_load_si128((__m128i const *)sec_bkt->sig_current),
+			_mm_set1_epi32(sec_hash)));
+#else
+	unsigned i;
+
+	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
+		*prim_hash_matches |= ((prim_hash == prim_bkt->sig_current[i]) << i);
+		*sec_hash_matches |= ((sec_hash == sec_bkt->sig_current[i]) << i);
+	}
+#endif
+}
+
 /*
  * Lookup bulk stage 2:  Search for match hashes in primary/secondary locations
  * and prefetch first key slot
@@ -951,15 +983,14 @@ lookup_stage2(unsigned idx, hash_sig_t prim_hash, hash_sig_t sec_hash,
 		uint64_t *extra_hits_mask, const void *keys,
 		const struct rte_hash *h)
 {
-	unsigned prim_hash_matches, sec_hash_matches, key_idx, i;
+	unsigned prim_hash_matches, sec_hash_matches, key_idx;
 	unsigned total_hash_matches;
 
 	prim_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
 	sec_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
-	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		prim_hash_matches |= ((prim_hash == prim_bkt->sig_current[i]) << i);
-		sec_hash_matches |= ((sec_hash == sec_bkt->sig_current[i]) << i);
-	}
+
+	compare_signatures(&prim_hash_matches, &sec_hash_matches, prim_bkt,
+						sec_bkt, prim_hash, sec_hash);
 
 	key_idx = prim_bkt->key_idx[__builtin_ctzl(prim_hash_matches)];
 	if (key_idx == 0)
diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index fe0654f..eb57d7e 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -130,7 +130,11 @@ enum add_key_case {
 };
 
 /** Number of items per bucket. */
+#if defined(__AVX2__)
+#define RTE_HASH_BUCKET_ENTRIES		8
+#else
 #define RTE_HASH_BUCKET_ENTRIES		4
+#endif
 
 #define NULL_SIGNATURE			0
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/3] hash: modify lookup bulk pipeline
  2016-08-26 21:34 [PATCH 0/3] Cuckoo hash lookup enhancements Pablo de Lara
  2016-08-26 21:34 ` [PATCH 1/3] hash: reorganize bucket structure Pablo de Lara
  2016-08-26 21:34 ` [PATCH 2/3] hash: add vectorized comparison Pablo de Lara
@ 2016-08-26 21:34 ` Pablo de Lara
  2016-09-02 22:56 ` [PATCH v2 0/4] Cuckoo hash lookup enhancements Pablo de Lara
  3 siblings, 0 replies; 37+ messages in thread
From: Pablo de Lara @ 2016-08-26 21:34 UTC (permalink / raw)
  To: dev; +Cc: bruce.richardson, Byron Marohn, Saikrishna Edupuganti, Pablo de Lara

From: Byron Marohn <byron.marohn@intel.com>

This patch replaces the pipelined rte_hash lookup mechanism with a
loop-and-jump model, which performs significantly better,
especially for smaller table sizes and smaller table occupancies.

Signed-off-by: Byron Marohn <byron.marohn@intel.com>
Signed-off-by: Saikrishna Edupuganti <saikrishna.edupuganti@intel.com>
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
---
 lib/librte_hash/rte_cuckoo_hash.c | 381 ++++++++++++--------------------------
 lib/librte_hash/rte_cuckoo_hash.h |   3 +-
 2 files changed, 121 insertions(+), 263 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c
index 98713d3..41acdc7 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -904,61 +904,26 @@ rte_hash_get_key_with_position(const struct rte_hash *h, const int32_t position,
 	return 0;
 }
 
-/* Lookup bulk stage 0: Prefetch input key */
 static inline void
-lookup_stage0(unsigned *idx, uint64_t *lookup_mask,
-		const void * const *keys)
-{
-	*idx = __builtin_ctzl(*lookup_mask);
-	if (*lookup_mask == 0)
-		*idx = 0;
-
-	rte_prefetch0(keys[*idx]);
-	*lookup_mask &= ~(1llu << *idx);
-}
-
-/*
- * Lookup bulk stage 1: Calculate primary/secondary hashes
- * and prefetch primary/secondary buckets
- */
-static inline void
-lookup_stage1(unsigned idx, hash_sig_t *prim_hash, hash_sig_t *sec_hash,
-		const struct rte_hash_bucket **primary_bkt,
-		const struct rte_hash_bucket **secondary_bkt,
-		hash_sig_t *hash_vals, const void * const *keys,
-		const struct rte_hash *h)
-{
-	*prim_hash = rte_hash_hash(h, keys[idx]);
-	hash_vals[idx] = *prim_hash;
-	*sec_hash = rte_hash_secondary_hash(*prim_hash);
-
-	*primary_bkt = &h->buckets[*prim_hash & h->bucket_bitmask];
-	*secondary_bkt = &h->buckets[*sec_hash & h->bucket_bitmask];
-
-	rte_prefetch0(*primary_bkt);
-	rte_prefetch0(*secondary_bkt);
-}
-
-static inline void
-compare_signatures(unsigned *prim_hash_matches, unsigned *sec_hash_matches,
+compare_signatures(uint32_t *prim_hash_matches, uint32_t *sec_hash_matches,
 				const struct rte_hash_bucket *prim_bkt,
 				const struct rte_hash_bucket *sec_bkt,
 				hash_sig_t prim_hash, hash_sig_t sec_hash)
 {
 /* 8 entries per bucket */
 #if defined(__AVX2__)
-	*prim_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+	*prim_hash_matches = _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
 			_mm256_load_si256((__m256i const *)prim_bkt->sig_current),
 			_mm256_set1_epi32(prim_hash)));
-	*sec_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+	*sec_hash_matches = _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
 			_mm256_load_si256((__m256i const *)sec_bkt->sig_current),
 			_mm256_set1_epi32(sec_hash)));
 /* 4 entries per bucket */
 #elif defined(__SSE2__)
-	*prim_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+	*prim_hash_matches = _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
 			_mm_load_si128((__m128i const *)prim_bkt->sig_current),
 			_mm_set1_epi32(prim_hash)));
-	*sec_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+	*sec_hash_matches = _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
 			_mm_load_si128((__m128i const *)sec_bkt->sig_current),
 			_mm_set1_epi32(sec_hash)));
 #else
@@ -971,244 +936,138 @@ compare_signatures(unsigned *prim_hash_matches, unsigned *sec_hash_matches,
 #endif
 }
 
-/*
- * Lookup bulk stage 2:  Search for match hashes in primary/secondary locations
- * and prefetch first key slot
- */
+#define PREFETCH_OFFSET 4
 static inline void
-lookup_stage2(unsigned idx, hash_sig_t prim_hash, hash_sig_t sec_hash,
-		const struct rte_hash_bucket *prim_bkt,
-		const struct rte_hash_bucket *sec_bkt,
-		const struct rte_hash_key **key_slot, int32_t *positions,
-		uint64_t *extra_hits_mask, const void *keys,
-		const struct rte_hash *h)
+__rte_hash_lookup_bulk(const struct rte_hash *h, const void **keys,
+			int32_t num_keys, int32_t *positions,
+			uint64_t *hit_mask, void *data[])
 {
-	unsigned prim_hash_matches, sec_hash_matches, key_idx;
-	unsigned total_hash_matches;
+	uint64_t hits = 0;
+	int32_t i;
+	uint32_t prim_hash[RTE_HASH_LOOKUP_BULK_MAX];
+	uint32_t sec_hash[RTE_HASH_LOOKUP_BULK_MAX];
+	const struct rte_hash_bucket *primary_bkt[RTE_HASH_LOOKUP_BULK_MAX];
+	const struct rte_hash_bucket *secondary_bkt[RTE_HASH_LOOKUP_BULK_MAX];
+	uint32_t prim_hitmask[RTE_HASH_LOOKUP_BULK_MAX] = {0};
+	uint32_t sec_hitmask[RTE_HASH_LOOKUP_BULK_MAX] = {0};
+
+	/* Prefetch first keys */
+	for (i = 0; i < PREFETCH_OFFSET && i < num_keys; i++) {
+		rte_prefetch0(keys[i]);
+	}
 
-	prim_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
-	sec_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
+	/*
+	 * Prefetch rest of the keys, calculate primary and
+	 * secondary bucket and prefetch them
+	 */
+	for (i = 0; i < (num_keys - PREFETCH_OFFSET); i++) {
+		rte_prefetch0(keys[i + PREFETCH_OFFSET]);
 
-	compare_signatures(&prim_hash_matches, &sec_hash_matches, prim_bkt,
-						sec_bkt, prim_hash, sec_hash);
+		prim_hash[i] = rte_hash_hash(h, keys[i]);
+		sec_hash[i] = rte_hash_secondary_hash(prim_hash[i]);
 
-	key_idx = prim_bkt->key_idx[__builtin_ctzl(prim_hash_matches)];
-	if (key_idx == 0)
-		key_idx = sec_bkt->key_idx[__builtin_ctzl(sec_hash_matches)];
+		primary_bkt[i] = &h->buckets[prim_hash[i] & h->bucket_bitmask];
+		secondary_bkt[i] = &h->buckets[sec_hash[i] & h->bucket_bitmask];
 
-	total_hash_matches = (prim_hash_matches |
-				(sec_hash_matches << (RTE_HASH_BUCKET_ENTRIES + 1)));
-	*key_slot = (const struct rte_hash_key *) ((const char *)keys +
-					key_idx * h->key_entry_size);
+		rte_prefetch0(primary_bkt[i]);
+		rte_prefetch0(secondary_bkt[i]);
+	}
 
-	rte_prefetch0(*key_slot);
-	/*
-	 * Return index where key is stored,
-	 * substracting the first dummy index
-	 */
-	positions[idx] = (key_idx - 1);
+	/* Calculate and prefetch rest of the buckets */
+	for (; i < num_keys; i++) {
+		prim_hash[i] = rte_hash_hash(h, keys[i]);
+		sec_hash[i] = rte_hash_secondary_hash(prim_hash[i]);
 
-	*extra_hits_mask |= (uint64_t)(__builtin_popcount(total_hash_matches) > 3) << idx;
+		primary_bkt[i] = &h->buckets[prim_hash[i] & h->bucket_bitmask];
+		secondary_bkt[i] = &h->buckets[sec_hash[i] & h->bucket_bitmask];
 
-}
+		rte_prefetch0(primary_bkt[i]);
+		rte_prefetch0(secondary_bkt[i]);
+	}
 
+	/* Compare signatures and prefetch key slot of first hit */
+	for (i = 0; i < num_keys; i++) {
+		compare_signatures(&prim_hitmask[i], &sec_hitmask[i],
+				primary_bkt[i], secondary_bkt[i],
+				prim_hash[i], sec_hash[i]);
+
+		if (prim_hitmask[i]) {
+			uint32_t first_hit = __builtin_ctzl(prim_hitmask[i]);
+			uint32_t key_idx = primary_bkt[i]->key_idx[first_hit];
+			const struct rte_hash_key *key_slot =
+				(const struct rte_hash_key *)(
+				(const char *)h->key_store +
+				key_idx * h->key_entry_size);
+			rte_prefetch0(key_slot);
+			goto next_prefetch;
+		}
 
-/* Lookup bulk stage 3: Check if key matches, update hit mask and return data */
-static inline void
-lookup_stage3(unsigned idx, const struct rte_hash_key *key_slot, const void * const *keys,
-		const int32_t *positions, void *data[], uint64_t *hits,
-		const struct rte_hash *h)
-{
-	unsigned hit;
-	unsigned key_idx;
+		if (sec_hitmask[i]) {
+			uint32_t first_hit = __builtin_ctzl(sec_hitmask[i]);
+			uint32_t key_idx = secondary_bkt[i]->key_idx[first_hit];
+			const struct rte_hash_key *key_slot =
+				(const struct rte_hash_key *)(
+				(const char *)h->key_store +
+				key_idx * h->key_entry_size);
+			rte_prefetch0(key_slot);
+		}
 
-	hit = !rte_hash_cmp_eq(key_slot->key, keys[idx], h);
-	if (data != NULL)
-		data[idx] = key_slot->pdata;
+next_prefetch:
+		continue;
+	}
 
-	key_idx = positions[idx] + 1;
-	/*
-	 * If key index is 0, force hit to be 0, in case key to be looked up
-	 * is all zero (as in the dummy slot), which would result in a wrong hit
-	 */
-	*hits |= (uint64_t)(hit && !!key_idx)  << idx;
-}
+	/* Compare keys, first hits in primary first */
+	for (i = 0; i < num_keys; i++) {
+		positions[i] = -ENOENT;
+		while (prim_hitmask[i]) {
+			uint32_t hit_index = __builtin_ctzl(prim_hitmask[i]);
+
+			uint32_t key_idx = primary_bkt[i]->key_idx[hit_index];
+			const struct rte_hash_key *key_slot =
+				(const struct rte_hash_key *)(
+				(const char *)h->key_store +
+				key_idx * h->key_entry_size);
+			/*
+			 * If key index is 0, do not compare key,
+			 * as it is checking the dummy slot
+			 */
+			if (!!key_idx & !rte_hash_cmp_eq(key_slot->key, keys[i], h)) {
+				if (data != NULL)
+					data[i] = key_slot->pdata;
 
-static inline void
-__rte_hash_lookup_bulk(const struct rte_hash *h, const void **keys,
-			uint32_t num_keys, int32_t *positions,
-			uint64_t *hit_mask, void *data[])
-{
-	uint64_t hits = 0;
-	uint64_t extra_hits_mask = 0;
-	uint64_t lookup_mask, miss_mask;
-	unsigned idx;
-	const void *key_store = h->key_store;
-	int ret;
-	hash_sig_t hash_vals[RTE_HASH_LOOKUP_BULK_MAX];
-
-	unsigned idx00, idx01, idx10, idx11, idx20, idx21, idx30, idx31;
-	const struct rte_hash_bucket *primary_bkt10, *primary_bkt11;
-	const struct rte_hash_bucket *secondary_bkt10, *secondary_bkt11;
-	const struct rte_hash_bucket *primary_bkt20, *primary_bkt21;
-	const struct rte_hash_bucket *secondary_bkt20, *secondary_bkt21;
-	const struct rte_hash_key *k_slot20, *k_slot21, *k_slot30, *k_slot31;
-	hash_sig_t primary_hash10, primary_hash11;
-	hash_sig_t secondary_hash10, secondary_hash11;
-	hash_sig_t primary_hash20, primary_hash21;
-	hash_sig_t secondary_hash20, secondary_hash21;
-
-	lookup_mask = (uint64_t) -1 >> (64 - num_keys);
-	miss_mask = lookup_mask;
-
-	lookup_stage0(&idx00, &lookup_mask, keys);
-	lookup_stage0(&idx01, &lookup_mask, keys);
-
-	idx10 = idx00, idx11 = idx01;
-
-	lookup_stage0(&idx00, &lookup_mask, keys);
-	lookup_stage0(&idx01, &lookup_mask, keys);
-	lookup_stage1(idx10, &primary_hash10, &secondary_hash10,
-			&primary_bkt10, &secondary_bkt10, hash_vals, keys, h);
-	lookup_stage1(idx11, &primary_hash11, &secondary_hash11,
-			&primary_bkt11,	&secondary_bkt11, hash_vals, keys, h);
-
-	primary_bkt20 = primary_bkt10;
-	primary_bkt21 = primary_bkt11;
-	secondary_bkt20 = secondary_bkt10;
-	secondary_bkt21 = secondary_bkt11;
-	primary_hash20 = primary_hash10;
-	primary_hash21 = primary_hash11;
-	secondary_hash20 = secondary_hash10;
-	secondary_hash21 = secondary_hash11;
-	idx20 = idx10, idx21 = idx11;
-	idx10 = idx00, idx11 = idx01;
-
-	lookup_stage0(&idx00, &lookup_mask, keys);
-	lookup_stage0(&idx01, &lookup_mask, keys);
-	lookup_stage1(idx10, &primary_hash10, &secondary_hash10,
-			&primary_bkt10, &secondary_bkt10, hash_vals, keys, h);
-	lookup_stage1(idx11, &primary_hash11, &secondary_hash11,
-			&primary_bkt11,	&secondary_bkt11, hash_vals, keys, h);
-	lookup_stage2(idx20, primary_hash20, secondary_hash20, primary_bkt20,
-			secondary_bkt20, &k_slot20, positions, &extra_hits_mask,
-			key_store, h);
-	lookup_stage2(idx21, primary_hash21, secondary_hash21, primary_bkt21,
-			secondary_bkt21, &k_slot21, positions, &extra_hits_mask,
-			key_store, h);
-
-	while (lookup_mask) {
-		k_slot30 = k_slot20, k_slot31 = k_slot21;
-		idx30 = idx20, idx31 = idx21;
-		primary_bkt20 = primary_bkt10;
-		primary_bkt21 = primary_bkt11;
-		secondary_bkt20 = secondary_bkt10;
-		secondary_bkt21 = secondary_bkt11;
-		primary_hash20 = primary_hash10;
-		primary_hash21 = primary_hash11;
-		secondary_hash20 = secondary_hash10;
-		secondary_hash21 = secondary_hash11;
-		idx20 = idx10, idx21 = idx11;
-		idx10 = idx00, idx11 = idx01;
-
-		lookup_stage0(&idx00, &lookup_mask, keys);
-		lookup_stage0(&idx01, &lookup_mask, keys);
-		lookup_stage1(idx10, &primary_hash10, &secondary_hash10,
-			&primary_bkt10, &secondary_bkt10, hash_vals, keys, h);
-		lookup_stage1(idx11, &primary_hash11, &secondary_hash11,
-			&primary_bkt11,	&secondary_bkt11, hash_vals, keys, h);
-		lookup_stage2(idx20, primary_hash20, secondary_hash20,
-			primary_bkt20, secondary_bkt20, &k_slot20, positions,
-			&extra_hits_mask, key_store, h);
-		lookup_stage2(idx21, primary_hash21, secondary_hash21,
-			primary_bkt21, secondary_bkt21,	&k_slot21, positions,
-			&extra_hits_mask, key_store, h);
-		lookup_stage3(idx30, k_slot30, keys, positions, data, &hits, h);
-		lookup_stage3(idx31, k_slot31, keys, positions, data, &hits, h);
-	}
+				hits |= 1ULL << i;
+				positions[i] = key_idx - 1;
+				goto next_key;
+			}
+			prim_hitmask[i] &= ~(1 << (hit_index));
+		}
 
-	k_slot30 = k_slot20, k_slot31 = k_slot21;
-	idx30 = idx20, idx31 = idx21;
-	primary_bkt20 = primary_bkt10;
-	primary_bkt21 = primary_bkt11;
-	secondary_bkt20 = secondary_bkt10;
-	secondary_bkt21 = secondary_bkt11;
-	primary_hash20 = primary_hash10;
-	primary_hash21 = primary_hash11;
-	secondary_hash20 = secondary_hash10;
-	secondary_hash21 = secondary_hash11;
-	idx20 = idx10, idx21 = idx11;
-	idx10 = idx00, idx11 = idx01;
-
-	lookup_stage1(idx10, &primary_hash10, &secondary_hash10,
-		&primary_bkt10, &secondary_bkt10, hash_vals, keys, h);
-	lookup_stage1(idx11, &primary_hash11, &secondary_hash11,
-		&primary_bkt11,	&secondary_bkt11, hash_vals, keys, h);
-	lookup_stage2(idx20, primary_hash20, secondary_hash20, primary_bkt20,
-		secondary_bkt20, &k_slot20, positions, &extra_hits_mask,
-		key_store, h);
-	lookup_stage2(idx21, primary_hash21, secondary_hash21, primary_bkt21,
-		secondary_bkt21, &k_slot21, positions, &extra_hits_mask,
-		key_store, h);
-	lookup_stage3(idx30, k_slot30, keys, positions, data, &hits, h);
-	lookup_stage3(idx31, k_slot31, keys, positions, data, &hits, h);
-
-	k_slot30 = k_slot20, k_slot31 = k_slot21;
-	idx30 = idx20, idx31 = idx21;
-	primary_bkt20 = primary_bkt10;
-	primary_bkt21 = primary_bkt11;
-	secondary_bkt20 = secondary_bkt10;
-	secondary_bkt21 = secondary_bkt11;
-	primary_hash20 = primary_hash10;
-	primary_hash21 = primary_hash11;
-	secondary_hash20 = secondary_hash10;
-	secondary_hash21 = secondary_hash11;
-	idx20 = idx10, idx21 = idx11;
-
-	lookup_stage2(idx20, primary_hash20, secondary_hash20, primary_bkt20,
-		secondary_bkt20, &k_slot20, positions, &extra_hits_mask,
-		key_store, h);
-	lookup_stage2(idx21, primary_hash21, secondary_hash21, primary_bkt21,
-		secondary_bkt21, &k_slot21, positions, &extra_hits_mask,
-		key_store, h);
-	lookup_stage3(idx30, k_slot30, keys, positions, data, &hits, h);
-	lookup_stage3(idx31, k_slot31, keys, positions, data, &hits, h);
-
-	k_slot30 = k_slot20, k_slot31 = k_slot21;
-	idx30 = idx20, idx31 = idx21;
-
-	lookup_stage3(idx30, k_slot30, keys, positions, data, &hits, h);
-	lookup_stage3(idx31, k_slot31, keys, positions, data, &hits, h);
-
-	/* ignore any items we have already found */
-	extra_hits_mask &= ~hits;
-
-	if (unlikely(extra_hits_mask)) {
-		/* run a single search for each remaining item */
-		do {
-			idx = __builtin_ctzl(extra_hits_mask);
-			if (data != NULL) {
-				ret = rte_hash_lookup_with_hash_data(h,
-						keys[idx], hash_vals[idx], &data[idx]);
-				if (ret >= 0)
-					hits |= 1ULL << idx;
-			} else {
-				positions[idx] = rte_hash_lookup_with_hash(h,
-							keys[idx], hash_vals[idx]);
-				if (positions[idx] >= 0)
-					hits |= 1llu << idx;
+		while (sec_hitmask[i]) {
+			uint32_t hit_index = __builtin_ctzl(sec_hitmask[i]);
+
+			uint32_t key_idx = secondary_bkt[i]->key_idx[hit_index];
+			const struct rte_hash_key *key_slot =
+				(const struct rte_hash_key *)(
+				(const char *)h->key_store +
+				key_idx * h->key_entry_size);
+			/*
+			 * If key index is 0, do not compare key,
+			 * as it is checking the dummy slot
+			 */
+
+			if (!!key_idx & !rte_hash_cmp_eq(key_slot->key, keys[i], h)) {
+				if (data != NULL)
+					data[i] = key_slot->pdata;
+
+				hits |= 1ULL << i;
+				positions[i] = key_idx - 1;
+				goto next_key;
 			}
-			extra_hits_mask &= ~(1llu << idx);
-		} while (extra_hits_mask);
-	}
+			sec_hitmask[i] &= ~(1 << (hit_index));
+		}
 
-	miss_mask &= ~hits;
-	if (unlikely(miss_mask)) {
-		do {
-			idx = __builtin_ctzl(miss_mask);
-			positions[idx] = -ENOENT;
-			miss_mask &= ~(1llu << idx);
-		} while (miss_mask);
+next_key:
+		continue;
 	}
 
 	if (hit_mask != NULL)
diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index eb57d7e..f5c7904 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -169,8 +169,7 @@ struct rte_hash_key {
 struct rte_hash_bucket {
 	hash_sig_t sig_current[RTE_HASH_BUCKET_ENTRIES];
 
-	/* Includes dummy key index that always contains index 0 */
-	uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES + 1];
+	uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES];
 
 	uint8_t flag[RTE_HASH_BUCKET_ENTRIES];
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] hash: add vectorized comparison
  2016-08-26 21:34 ` [PATCH 2/3] hash: add vectorized comparison Pablo de Lara
@ 2016-08-27  8:57   ` Thomas Monjalon
  2016-09-02 17:05     ` De Lara Guarch, Pablo
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Monjalon @ 2016-08-27  8:57 UTC (permalink / raw)
  To: Pablo de Lara, Byron Marohn
  Cc: dev, bruce.richardson, Saikrishna Edupuganti, jianbo.liu,
	chaozhu, jerin.jacob

2016-08-26 22:34, Pablo de Lara:
> From: Byron Marohn <byron.marohn@intel.com>
> 
> In lookup bulk function, the signatures of all entries
> are compared against the signature of the key that is being looked up.
> Now that all the signatures are together, they can be compared
> with vector instructions (SSE, AVX2), achieving higher lookup performance.
> 
> Also, entries per bucket are increased to 8 when using processors
> with AVX2, as 256 bits can be compared at once, which is the size of
> 8x32-bit signatures.

Please, would it be possible to use the generic SIMD intrinsics?
We could define generic types compatible with Altivec and NEON:
	__attribute__ ((vector_size (n)))
as described in https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html

> +/* 8 entries per bucket */
> +#if defined(__AVX2__)

Please prefer
	#ifdef RTE_MACHINE_CPUFLAG_AVX2
Ideally the vector support could be checked at runtime:
	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
It would allow packaging one binary using the best optimization available.

> +	*prim_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
> +			_mm256_load_si256((__m256i const *)prim_bkt->sig_current),
> +			_mm256_set1_epi32(prim_hash)));
> +	*sec_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
> +			_mm256_load_si256((__m256i const *)sec_bkt->sig_current),
> +			_mm256_set1_epi32(sec_hash)));
> +/* 4 entries per bucket */
> +#elif defined(__SSE2__)
> +	*prim_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
> +			_mm_load_si128((__m128i const *)prim_bkt->sig_current),
> +			_mm_set1_epi32(prim_hash)));
> +	*sec_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
> +			_mm_load_si128((__m128i const *)sec_bkt->sig_current),
> +			_mm_set1_epi32(sec_hash)));

In order to allow such switch based on register size, we could have an
abstraction in EAL supporting 128/256/512 width for x86/ARM/POWER.
I think aliasing RTE_MACHINE_CPUFLAG_ and RTE_CPUFLAG_ may be enough.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/3] hash: add vectorized comparison
  2016-08-27  8:57   ` Thomas Monjalon
@ 2016-09-02 17:05     ` De Lara Guarch, Pablo
  0 siblings, 0 replies; 37+ messages in thread
From: De Lara Guarch, Pablo @ 2016-09-02 17:05 UTC (permalink / raw)
  To: Thomas Monjalon, Marohn, Byron
  Cc: dev, Richardson, Bruce, Edupuganti, Saikrishna, jianbo.liu,
	chaozhu, jerin.jacob



> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Saturday, August 27, 2016 1:58 AM
> To: De Lara Guarch, Pablo; Marohn, Byron
> Cc: dev@dpdk.org; Richardson, Bruce; Edupuganti, Saikrishna;
> jianbo.liu@linaro.org; chaozhu@linux.vnet.ibm.com;
> jerin.jacob@caviumnetworks.com
> Subject: Re: [dpdk-dev] [PATCH 2/3] hash: add vectorized comparison
> 
> 2016-08-26 22:34, Pablo de Lara:
> > From: Byron Marohn <byron.marohn@intel.com>
> >
> > In lookup bulk function, the signatures of all entries
> > are compared against the signature of the key that is being looked up.
> > Now that all the signatures are together, they can be compared
> > with vector instructions (SSE, AVX2), achieving higher lookup performance.
> >
> > Also, entries per bucket are increased to 8 when using processors
> > with AVX2, as 256 bits can be compared at once, which is the size of
> > 8x32-bit signatures.
> 
> Please, would it be possible to use the generic SIMD intrinsics?
> We could define generic types compatible with Altivec and NEON:
> 	__attribute__ ((vector_size (n)))
> as described in https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
> 

I tried to convert these into generic code with gcc builtins,
but I couldn't find a way to translate the __mm_movemask instrinsic into a generic builtin
(which is very necessary for performance reasons).
Therefore, I think it is not possible to do this without penalizing performance.
Sure, we could try to translate the other intrinsics, but it would mean that we still need to
use #ifdefs and we would have a mix of code with x86 instrinsics and gcc builtins,
so it is better to leave it this way.

> > +/* 8 entries per bucket */
> > +#if defined(__AVX2__)
> 
> Please prefer
> 	#ifdef RTE_MACHINE_CPUFLAG_AVX2
> Ideally the vector support could be checked at runtime:
> 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> It would allow packaging one binary using the best optimization available.
> 

Good idea. Will submit a v2 with this change. It took me a bit of time to figure out
a way to do this without paying a big performance penalty.

> > +	*prim_hash_matches |=
> _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
> > +			_mm256_load_si256((__m256i const *)prim_bkt-
> >sig_current),
> > +			_mm256_set1_epi32(prim_hash)));
> > +	*sec_hash_matches |=
> _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
> > +			_mm256_load_si256((__m256i const *)sec_bkt-
> >sig_current),
> > +			_mm256_set1_epi32(sec_hash)));
> > +/* 4 entries per bucket */
> > +#elif defined(__SSE2__)
> > +	*prim_hash_matches |=
> _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
> > +			_mm_load_si128((__m128i const *)prim_bkt-
> >sig_current),
> > +			_mm_set1_epi32(prim_hash)));
> > +	*sec_hash_matches |=
> _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
> > +			_mm_load_si128((__m128i const *)sec_bkt-
> >sig_current),
> > +			_mm_set1_epi32(sec_hash)));
> 
> In order to allow such switch based on register size, we could have an
> abstraction in EAL supporting 128/256/512 width for x86/ARM/POWER.
> I think aliasing RTE_MACHINE_CPUFLAG_ and RTE_CPUFLAG_ may be
> enough.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v2 0/4] Cuckoo hash lookup enhancements
  2016-08-26 21:34 [PATCH 0/3] Cuckoo hash lookup enhancements Pablo de Lara
                   ` (2 preceding siblings ...)
  2016-08-26 21:34 ` [PATCH 3/3] hash: modify lookup bulk pipeline Pablo de Lara
@ 2016-09-02 22:56 ` Pablo de Lara
  2016-09-02 22:56   ` [PATCH v2 1/4] hash: reorder hash structure Pablo de Lara
                     ` (5 more replies)
  3 siblings, 6 replies; 37+ messages in thread
From: Pablo de Lara @ 2016-09-02 22:56 UTC (permalink / raw)
  To: dev; +Cc: bruce.richarson, Pablo de Lara

This patchset improves lookup performance on the current hash library
by changing the existing lookup bulk pipeline, with an improved pipeline,
based on a loop-and-jump model, instead of the current 4-stage 2-entry pipeline.
Also, x86 vectorized intrinsics are used to improve performance when comparing signatures.

First patch modifies the order of the bucket structure.
Currently, the buckets store all the signatures together (current and alternative).
In order to be able to perform a vectorized signature comparison,
all current signatures have to be together, so the order of the bucket has been changed,
having separated all the current signatures from the alternative signatures.

Second patch introduces x86 vectorized intrinsics.
When performing a lookup bulk operation, all current signatures in a bucket
are compared against the signature of the key being looked up.
Now that they all are together, a vectorized comparison can be performed,
which takes less instructions to be carried out.
In case of having a machine with AVX2, number of entries per bucket are
increased from 4 to 8, as AVX2 allows comparing two 256-bit values, with 8x32-bit integers,
which are the 8 signatures on the bucket.

Third (and last) patch modifies the current pipeline of the lookup bulk function.
The new pipeline is based on a loop-and-jump model. The two key improvements are:

- Better prefetching: in this case, first 4 keys to be looked up are prefetched,
  and after that, the rest of the keys are prefetched at the time the calculation
  of the signatures are being performed. This gives more time for the CPU to
  prefetch the data requesting before actually need it, which result in less
  cache misses and therefore, higher throughput.

- Lower performance penalty when using fallback: the lookup bulk algorithm
  assumes that most times there will not be a collision in a bucket, but it might
  happen that two or more signatures are equal, which means that more than one
  key comparison might be necessary. In that case, only the key of the first hit is prefetched,
  like in the current implementation. The difference now is that if this comparison
  results in a miss, the information of the other keys to be compared has been stored,
  unlike the current implementation, which needs to perform an entire simple lookup again.

This patchset depends on the following patchset:
"Hash library fixes" (http://dpdk.org/ml/archives/dev/2016-August/045780.html)

Changes in v2:
- Increased entries per bucket from 4 to 8 for all cases,
  so it is not architecture dependent any longer.
- Replaced compile-time signature comparison function election
  with run-time election, so best optimization available
  will be used from a single binary.
- Reordered the hash structure, so all the fields used by lookup
  are in the same cache line (first).


Byron Marohn (3):
  hash: reorganize bucket structure
  hash: add vectorized comparison
  hash: modify lookup bulk pipeline

Pablo de Lara (1):
  hash: reorder hash structure

 lib/librte_hash/rte_cuckoo_hash.c     | 455 ++++++++++++++--------------------
 lib/librte_hash/rte_cuckoo_hash.h     |  44 ++--
 lib/librte_hash/rte_cuckoo_hash_x86.h |  20 +-
 3 files changed, 221 insertions(+), 298 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v2 1/4] hash: reorder hash structure
  2016-09-02 22:56 ` [PATCH v2 0/4] Cuckoo hash lookup enhancements Pablo de Lara
@ 2016-09-02 22:56   ` Pablo de Lara
  2016-09-02 22:56   ` [PATCH v2 2/4] hash: reorganize bucket structure Pablo de Lara
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 37+ messages in thread
From: Pablo de Lara @ 2016-09-02 22:56 UTC (permalink / raw)
  To: dev; +Cc: bruce.richarson, Pablo de Lara

In order to optimize lookup performance, hash structure
is reordered, so all fields used for lookup will be
in the first cache line.

Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
---
 lib/librte_hash/rte_cuckoo_hash.h | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index e290dab..701531a 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -182,9 +182,7 @@ struct rte_hash_bucket {
 
 /** A hash table structure. */
 struct rte_hash {
-	char name[RTE_HASH_NAMESIZE];   /**< Name of the hash. */
-	uint32_t entries;               /**< Total table entries. */
-	uint32_t num_buckets;           /**< Number of buckets in table. */
+	/* first cache line - fields used in lookup */
 	uint32_t key_len;               /**< Length of hash key. */
 	rte_hash_function hash_func;    /**< Function used to calculate hash. */
 	uint32_t hash_func_init_val;    /**< Init value used by hash_func. */
@@ -196,12 +194,13 @@ struct rte_hash {
 						from hash signature. */
 	uint32_t key_entry_size;         /**< Size of each key entry. */
 
-	struct rte_ring *free_slots;    /**< Ring that stores all indexes
-						of the free slots in the key table */
 	void *key_store;                /**< Table storing all keys and data */
 	struct rte_hash_bucket *buckets;	/**< Table with buckets storing all the
 							hash values and key indexes
 							to the key table*/
+
+	struct rte_ring *free_slots;    /**< Ring that stores all indexes
+						of the free slots in the key table */
 	uint8_t hw_trans_mem_support;	/**< Hardware transactional
 							memory support */
 	struct lcore_cache *local_free_slots;
@@ -209,6 +208,9 @@ struct rte_hash {
 	enum add_key_case add_key; /**< Multi-writer hash add behavior */
 
 	rte_spinlock_t *multiwriter_lock; /**< Multi-writer spinlock for w/o TM */
+	char name[RTE_HASH_NAMESIZE];   /**< Name of the hash. */
+	uint32_t entries;               /**< Total table entries. */
+	uint32_t num_buckets;           /**< Number of buckets in table. */
 } __rte_cache_aligned;
 
 struct queue_node {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 2/4] hash: reorganize bucket structure
  2016-09-02 22:56 ` [PATCH v2 0/4] Cuckoo hash lookup enhancements Pablo de Lara
  2016-09-02 22:56   ` [PATCH v2 1/4] hash: reorder hash structure Pablo de Lara
@ 2016-09-02 22:56   ` Pablo de Lara
  2016-09-02 22:56   ` [PATCH v2 3/4] hash: add vectorized comparison Pablo de Lara
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 37+ messages in thread
From: Pablo de Lara @ 2016-09-02 22:56 UTC (permalink / raw)
  To: dev; +Cc: bruce.richarson, Byron Marohn, Saikrishna Edupuganti

From: Byron Marohn <byron.marohn@intel.com>

Move current signatures of all entries together in the bucket
and same with all alternative signatures, instead of having
current and alternative signatures together per entry in the bucket.
This will be benefitial in the next commits, where a vectorized
comparison will be performed, achieving better performance.

Signed-off-by: Byron Marohn <byron.marohn@intel.com>
Signed-off-by: Saikrishna Edupuganti <saikrishna.edupuganti@intel.com>
---
 lib/librte_hash/rte_cuckoo_hash.c     | 43 ++++++++++++++++++-----------------
 lib/librte_hash/rte_cuckoo_hash.h     | 17 ++++----------
 lib/librte_hash/rte_cuckoo_hash_x86.h | 20 ++++++++--------
 3 files changed, 37 insertions(+), 43 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c
index dd0290f..9d507b6 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -420,7 +420,7 @@ make_space_bucket(const struct rte_hash *h, struct rte_hash_bucket *bkt)
 	 */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
 		/* Search for space in alternative locations */
-		next_bucket_idx = bkt->signatures[i].alt & h->bucket_bitmask;
+		next_bucket_idx = bkt->sig_alt[i] & h->bucket_bitmask;
 		next_bkt[i] = &h->buckets[next_bucket_idx];
 		for (j = 0; j < RTE_HASH_BUCKET_ENTRIES; j++) {
 			if (next_bkt[i]->key_idx[j] == EMPTY_SLOT)
@@ -433,8 +433,8 @@ make_space_bucket(const struct rte_hash *h, struct rte_hash_bucket *bkt)
 
 	/* Alternative location has spare room (end of recursive function) */
 	if (i != RTE_HASH_BUCKET_ENTRIES) {
-		next_bkt[i]->signatures[j].alt = bkt->signatures[i].current;
-		next_bkt[i]->signatures[j].current = bkt->signatures[i].alt;
+		next_bkt[i]->sig_alt[j] = bkt->sig_current[i];
+		next_bkt[i]->sig_current[j] = bkt->sig_alt[i];
 		next_bkt[i]->key_idx[j] = bkt->key_idx[i];
 		return i;
 	}
@@ -460,8 +460,8 @@ make_space_bucket(const struct rte_hash *h, struct rte_hash_bucket *bkt)
 	 */
 	bkt->flag[i] = 0;
 	if (ret >= 0) {
-		next_bkt[i]->signatures[ret].alt = bkt->signatures[i].current;
-		next_bkt[i]->signatures[ret].current = bkt->signatures[i].alt;
+		next_bkt[i]->sig_alt[ret] = bkt->sig_current[i];
+		next_bkt[i]->sig_current[ret] = bkt->sig_alt[i];
 		next_bkt[i]->key_idx[ret] = bkt->key_idx[i];
 		return i;
 	} else
@@ -543,8 +543,8 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is already inserted in primary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (prim_bkt->signatures[i].current == sig &&
-				prim_bkt->signatures[i].alt == alt_hash) {
+		if (prim_bkt->sig_current[i] == sig &&
+				prim_bkt->sig_alt[i] == alt_hash) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					prim_bkt->key_idx[i] * h->key_entry_size);
 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {
@@ -563,8 +563,8 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is already inserted in secondary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (sec_bkt->signatures[i].alt == sig &&
-				sec_bkt->signatures[i].current == alt_hash) {
+		if (sec_bkt->sig_alt[i] == sig &&
+				sec_bkt->sig_current[i] == alt_hash) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					sec_bkt->key_idx[i] * h->key_entry_size);
 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {
@@ -610,8 +610,8 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 		for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
 			/* Check if slot is available */
 			if (likely(prim_bkt->key_idx[i] == EMPTY_SLOT)) {
-				prim_bkt->signatures[i].current = sig;
-				prim_bkt->signatures[i].alt = alt_hash;
+				prim_bkt->sig_current[i] = sig;
+				prim_bkt->sig_alt[i] = alt_hash;
 				prim_bkt->key_idx[i] = new_idx;
 				break;
 			}
@@ -631,8 +631,8 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 		 */
 		ret = make_space_bucket(h, prim_bkt);
 		if (ret >= 0) {
-			prim_bkt->signatures[ret].current = sig;
-			prim_bkt->signatures[ret].alt = alt_hash;
+			prim_bkt->sig_current[ret] = sig;
+			prim_bkt->sig_alt[ret] = alt_hash;
 			prim_bkt->key_idx[ret] = new_idx;
 			if (h->add_key == ADD_KEY_MULTIWRITER)
 				rte_spinlock_unlock(h->multiwriter_lock);
@@ -706,7 +706,7 @@ __rte_hash_lookup_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is in primary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (bkt->signatures[i].current == sig &&
+		if (bkt->sig_current[i] == sig &&
 				bkt->key_idx[i] != EMPTY_SLOT) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					bkt->key_idx[i] * h->key_entry_size);
@@ -729,8 +729,8 @@ __rte_hash_lookup_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is in secondary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (bkt->signatures[i].current == alt_hash &&
-				bkt->signatures[i].alt == sig) {
+		if (bkt->sig_current[i] == alt_hash &&
+				bkt->sig_alt[i] == sig) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					bkt->key_idx[i] * h->key_entry_size);
 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {
@@ -784,7 +784,8 @@ remove_entry(const struct rte_hash *h, struct rte_hash_bucket *bkt, unsigned i)
 	unsigned lcore_id, n_slots;
 	struct lcore_cache *cached_free_slots;
 
-	bkt->signatures[i].sig = NULL_SIGNATURE;
+	bkt->sig_current[i] = NULL_SIGNATURE;
+	bkt->sig_alt[i] = NULL_SIGNATURE;
 	if (h->hw_trans_mem_support) {
 		lcore_id = rte_lcore_id();
 		cached_free_slots = &h->local_free_slots[lcore_id];
@@ -822,7 +823,7 @@ __rte_hash_del_key_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is in primary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (bkt->signatures[i].current == sig &&
+		if (bkt->sig_current[i] == sig &&
 				bkt->key_idx[i] != EMPTY_SLOT) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					bkt->key_idx[i] * h->key_entry_size);
@@ -847,7 +848,7 @@ __rte_hash_del_key_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is in secondary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (bkt->signatures[i].current == alt_hash &&
+		if (bkt->sig_current[i] == alt_hash &&
 				bkt->key_idx[i] != EMPTY_SLOT) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					bkt->key_idx[i] * h->key_entry_size);
@@ -956,8 +957,8 @@ lookup_stage2(unsigned idx, hash_sig_t prim_hash, hash_sig_t sec_hash,
 	prim_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
 	sec_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		prim_hash_matches |= ((prim_hash == prim_bkt->signatures[i].current) << i);
-		sec_hash_matches |= ((sec_hash == sec_bkt->signatures[i].current) << i);
+		prim_hash_matches |= ((prim_hash == prim_bkt->sig_current[i]) << i);
+		sec_hash_matches |= ((sec_hash == sec_bkt->sig_current[i]) << i);
 	}
 
 	key_idx = prim_bkt->key_idx[__builtin_ctzl(prim_hash_matches)];
diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index 701531a..86471f7 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -151,17 +151,6 @@ struct lcore_cache {
 	void *objs[LCORE_CACHE_SIZE]; /**< Cache objects */
 } __rte_cache_aligned;
 
-/* Structure storing both primary and secondary hashes */
-struct rte_hash_signatures {
-	union {
-		struct {
-			hash_sig_t current;
-			hash_sig_t alt;
-		};
-		uint64_t sig;
-	};
-};
-
 /* Structure that stores key-value pair */
 struct rte_hash_key {
 	union {
@@ -174,10 +163,14 @@ struct rte_hash_key {
 
 /** Bucket structure */
 struct rte_hash_bucket {
-	struct rte_hash_signatures signatures[RTE_HASH_BUCKET_ENTRIES];
+	hash_sig_t sig_current[RTE_HASH_BUCKET_ENTRIES];
+
 	/* Includes dummy key index that always contains index 0 */
 	uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES + 1];
+
 	uint8_t flag[RTE_HASH_BUCKET_ENTRIES];
+
+	hash_sig_t sig_alt[RTE_HASH_BUCKET_ENTRIES];
 } __rte_cache_aligned;
 
 /** A hash table structure. */
diff --git a/lib/librte_hash/rte_cuckoo_hash_x86.h b/lib/librte_hash/rte_cuckoo_hash_x86.h
index e16d69c..494c160 100644
--- a/lib/librte_hash/rte_cuckoo_hash_x86.h
+++ b/lib/librte_hash/rte_cuckoo_hash_x86.h
@@ -54,8 +54,8 @@ rte_hash_cuckoo_insert_mw_tm(struct rte_hash_bucket *prim_bkt,
 			for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
 				/* Check if slot is available */
 				if (likely(prim_bkt->key_idx == EMPTY_SLOT)) {
-					prim_bkt->signatures[i].current = sig;
-					prim_bkt->signatures[i].alt = alt_hash;
+					prim_bkt->sig_current[i] = sig;
+					prim_bkt->sig_alt[i] = alt_hash;
 					prim_bkt->key_idx[i] = new_idx;
 					break;
 				}
@@ -101,7 +101,7 @@ rte_hash_cuckoo_move_insert_mw_tm(const struct rte_hash *h,
 				prev_slot = curr_node->prev_slot;
 
 				prev_alt_bkt_idx
-					= prev_bkt->signatures[prev_slot].alt
+					= prev_bkt->sig_alt[prev_slot]
 					    & h->bucket_bitmask;
 
 				if (unlikely(&h->buckets[prev_alt_bkt_idx]
@@ -113,10 +113,10 @@ rte_hash_cuckoo_move_insert_mw_tm(const struct rte_hash *h,
 				 * Cuckoo insert to move elements back to its
 				 * primary bucket if available
 				 */
-				curr_bkt->signatures[curr_slot].alt =
-				    prev_bkt->signatures[prev_slot].current;
-				curr_bkt->signatures[curr_slot].current =
-				    prev_bkt->signatures[prev_slot].alt;
+				curr_bkt->sig_alt[curr_slot] =
+				    prev_bkt->sig_current[prev_slot];
+				curr_bkt->sig_current[curr_slot] =
+				    prev_bkt->sig_alt[prev_slot];
 				curr_bkt->key_idx[curr_slot]
 				    = prev_bkt->key_idx[prev_slot];
 
@@ -125,8 +125,8 @@ rte_hash_cuckoo_move_insert_mw_tm(const struct rte_hash *h,
 				curr_bkt = curr_node->bkt;
 			}
 
-			curr_bkt->signatures[curr_slot].current = sig;
-			curr_bkt->signatures[curr_slot].alt = alt_hash;
+			curr_bkt->sig_current[curr_slot] = sig;
+			curr_bkt->sig_alt[curr_slot] = alt_hash;
 			curr_bkt->key_idx[curr_slot] = new_idx;
 
 			rte_xend();
@@ -178,7 +178,7 @@ rte_hash_cuckoo_make_space_mw_tm(const struct rte_hash *h,
 			}
 
 			/* Enqueue new node and keep prev node info */
-			alt_bkt = &(h->buckets[curr_bkt->signatures[i].alt
+			alt_bkt = &(h->buckets[curr_bkt->sig_alt[i]
 						    & h->bucket_bitmask]);
 			head->bkt = alt_bkt;
 			head->prev = tail;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 3/4] hash: add vectorized comparison
  2016-09-02 22:56 ` [PATCH v2 0/4] Cuckoo hash lookup enhancements Pablo de Lara
  2016-09-02 22:56   ` [PATCH v2 1/4] hash: reorder hash structure Pablo de Lara
  2016-09-02 22:56   ` [PATCH v2 2/4] hash: reorganize bucket structure Pablo de Lara
@ 2016-09-02 22:56   ` Pablo de Lara
  2016-09-02 22:56   ` [PATCH v2 4/4] hash: modify lookup bulk pipeline Pablo de Lara
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 37+ messages in thread
From: Pablo de Lara @ 2016-09-02 22:56 UTC (permalink / raw)
  To: dev; +Cc: bruce.richarson, Byron Marohn, Saikrishna Edupuganti, Pablo de Lara

From: Byron Marohn <byron.marohn@intel.com>

In lookup bulk function, the signatures of all entries
are compared against the signature of the key that is being looked up.
Now that all the signatures are together, they can be compared
with vector instructions (SSE, AVX2), achieving higher lookup performance.

Also, entries per bucket are increased to 8 when using processors
with AVX2, as 256 bits can be compared at once, which is the size of
8x32-bit signatures.

Signed-off-by: Byron Marohn <byron.marohn@intel.com>
Signed-off-by: Saikrishna Edupuganti <saikrishna.edupuganti@intel.com>
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
---
 lib/librte_hash/rte_cuckoo_hash.c | 73 ++++++++++++++++++++++++++++++++++++---
 lib/librte_hash/rte_cuckoo_hash.h | 12 ++++++-
 2 files changed, 79 insertions(+), 6 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c
index 9d507b6..eab28a1 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -283,6 +283,15 @@ rte_hash_create(const struct rte_hash_parameters *params)
 	h->free_slots = r;
 	h->hw_trans_mem_support = hw_trans_mem_support;
 
+#if defined(RTE_ARCH_X86)
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
+		h->sig_cmp_fn = RTE_HASH_COMPARE_AVX2;
+	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE2))
+		h->sig_cmp_fn = RTE_HASH_COMPARE_SSE;
+	else
+#endif
+		h->sig_cmp_fn = RTE_HASH_COMPARE_SCALAR;
+
 	/* Turn on multi-writer only with explicit flat from user and TM
 	 * support.
 	 */
@@ -939,6 +948,61 @@ lookup_stage1(unsigned idx, hash_sig_t *prim_hash, hash_sig_t *sec_hash,
 	rte_prefetch0(*secondary_bkt);
 }
 
+static inline void
+compare_signatures(unsigned *prim_hash_matches, unsigned *sec_hash_matches,
+				const struct rte_hash_bucket *prim_bkt,
+				const struct rte_hash_bucket *sec_bkt,
+				hash_sig_t prim_hash, hash_sig_t sec_hash,
+				enum rte_hash_sig_compare_function sig_cmp_fn)
+{
+	unsigned i;
+
+	switch (sig_cmp_fn) {
+#ifdef RTE_MACHINE_CPUFLAG_AVX2
+	case RTE_HASH_COMPARE_AVX2:
+		*prim_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+				_mm256_load_si256(
+					(__m256i const *)prim_bkt->sig_current),
+				_mm256_set1_epi32(prim_hash)));
+		*sec_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+				_mm256_load_si256(
+					(__m256i const *)sec_bkt->sig_current),
+				_mm256_set1_epi32(sec_hash)));
+		break;
+#endif
+#ifdef RTE_MACHINE_CPUFLAG_SSE2
+	case RTE_HASH_COMPARE_SSE:
+		/* Compare the first 4 signatures in the bucket */
+		*prim_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+				_mm_load_si128(
+					(__m128i const *)prim_bkt->sig_current),
+				_mm_set1_epi32(prim_hash)));
+		*prim_hash_matches |= (_mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+				_mm_load_si128(
+					(__m128i const *)&prim_bkt->sig_current[4]),
+				_mm_set1_epi32(prim_hash)))) << 4;
+		/* Compare the first 4 signatures in the bucket */
+		*sec_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+				_mm_load_si128(
+					(__m128i const *)sec_bkt->sig_current),
+				_mm_set1_epi32(sec_hash)));
+		*sec_hash_matches |= (_mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+				_mm_load_si128(
+					(__m128i const *)&sec_bkt->sig_current[4]),
+				_mm_set1_epi32(sec_hash)))) << 4;
+		break;
+#endif
+	default:
+		for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
+			*prim_hash_matches |=
+				((prim_hash == prim_bkt->sig_current[i]) << i);
+			*sec_hash_matches |=
+				((sec_hash == sec_bkt->sig_current[i]) << i);
+		}
+	}
+
+}
+
 /*
  * Lookup bulk stage 2:  Search for match hashes in primary/secondary locations
  * and prefetch first key slot
@@ -951,15 +1015,14 @@ lookup_stage2(unsigned idx, hash_sig_t prim_hash, hash_sig_t sec_hash,
 		uint64_t *extra_hits_mask, const void *keys,
 		const struct rte_hash *h)
 {
-	unsigned prim_hash_matches, sec_hash_matches, key_idx, i;
+	unsigned prim_hash_matches, sec_hash_matches, key_idx;
 	unsigned total_hash_matches;
 
 	prim_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
 	sec_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
-	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		prim_hash_matches |= ((prim_hash == prim_bkt->sig_current[i]) << i);
-		sec_hash_matches |= ((sec_hash == sec_bkt->sig_current[i]) << i);
-	}
+
+	compare_signatures(&prim_hash_matches, &sec_hash_matches, prim_bkt,
+				sec_bkt, prim_hash, sec_hash, h->sig_cmp_fn);
 
 	key_idx = prim_bkt->key_idx[__builtin_ctzl(prim_hash_matches)];
 	if (key_idx == 0)
diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index 86471f7..8ffc146 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -130,7 +130,7 @@ enum add_key_case {
 };
 
 /** Number of items per bucket. */
-#define RTE_HASH_BUCKET_ENTRIES		4
+#define RTE_HASH_BUCKET_ENTRIES		8
 
 #define NULL_SIGNATURE			0
 
@@ -161,6 +161,14 @@ struct rte_hash_key {
 	char key[0];
 } __attribute__((aligned(KEY_ALIGNMENT)));
 
+/* All different signature compare functions */
+enum rte_hash_sig_compare_function {
+	RTE_HASH_COMPARE_SCALAR = 0,
+	RTE_HASH_COMPARE_SSE,
+	RTE_HASH_COMPARE_AVX2,
+	RTE_HASH_COMPARE_NUM
+};
+
 /** Bucket structure */
 struct rte_hash_bucket {
 	hash_sig_t sig_current[RTE_HASH_BUCKET_ENTRIES];
@@ -183,6 +191,8 @@ struct rte_hash {
 	/**< Custom function used to compare keys. */
 	enum cmp_jump_table_case cmp_jump_table_idx;
 	/**< Indicates which compare function to use. */
+	enum rte_hash_sig_compare_function sig_cmp_fn;
+	/**< Indicates which signature compare function to use. */
 	uint32_t bucket_bitmask;        /**< Bitmask for getting bucket index
 						from hash signature. */
 	uint32_t key_entry_size;         /**< Size of each key entry. */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 4/4] hash: modify lookup bulk pipeline
  2016-09-02 22:56 ` [PATCH v2 0/4] Cuckoo hash lookup enhancements Pablo de Lara
                     ` (2 preceding siblings ...)
  2016-09-02 22:56   ` [PATCH v2 3/4] hash: add vectorized comparison Pablo de Lara
@ 2016-09-02 22:56   ` Pablo de Lara
  2016-09-06 19:33   ` [PATCH v3 0/4] Cuckoo hash lookup enhancements Pablo de Lara
  2016-09-06 19:34   ` [PATCH v3 0/4] Cuckoo hash lookup enhancements Pablo de Lara
  5 siblings, 0 replies; 37+ messages in thread
From: Pablo de Lara @ 2016-09-02 22:56 UTC (permalink / raw)
  To: dev; +Cc: bruce.richarson, Byron Marohn, Saikrishna Edupuganti, Pablo de Lara

From: Byron Marohn <byron.marohn@intel.com>

This patch replaces the pipelined rte_hash lookup mechanism with a
loop-and-jump model, which performs significantly better,
especially for smaller table sizes and smaller table occupancies.

Signed-off-by: Byron Marohn <byron.marohn@intel.com>
Signed-off-by: Saikrishna Edupuganti <saikrishna.edupuganti@intel.com>
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
---
 lib/librte_hash/rte_cuckoo_hash.c | 377 ++++++++++++--------------------------
 lib/librte_hash/rte_cuckoo_hash.h |   3 +-
 2 files changed, 117 insertions(+), 263 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c
index eab28a1..47b5beb 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -913,43 +913,8 @@ rte_hash_get_key_with_position(const struct rte_hash *h, const int32_t position,
 	return 0;
 }
 
-/* Lookup bulk stage 0: Prefetch input key */
 static inline void
-lookup_stage0(unsigned *idx, uint64_t *lookup_mask,
-		const void * const *keys)
-{
-	*idx = __builtin_ctzl(*lookup_mask);
-	if (*lookup_mask == 0)
-		*idx = 0;
-
-	rte_prefetch0(keys[*idx]);
-	*lookup_mask &= ~(1llu << *idx);
-}
-
-/*
- * Lookup bulk stage 1: Calculate primary/secondary hashes
- * and prefetch primary/secondary buckets
- */
-static inline void
-lookup_stage1(unsigned idx, hash_sig_t *prim_hash, hash_sig_t *sec_hash,
-		const struct rte_hash_bucket **primary_bkt,
-		const struct rte_hash_bucket **secondary_bkt,
-		hash_sig_t *hash_vals, const void * const *keys,
-		const struct rte_hash *h)
-{
-	*prim_hash = rte_hash_hash(h, keys[idx]);
-	hash_vals[idx] = *prim_hash;
-	*sec_hash = rte_hash_secondary_hash(*prim_hash);
-
-	*primary_bkt = &h->buckets[*prim_hash & h->bucket_bitmask];
-	*secondary_bkt = &h->buckets[*sec_hash & h->bucket_bitmask];
-
-	rte_prefetch0(*primary_bkt);
-	rte_prefetch0(*secondary_bkt);
-}
-
-static inline void
-compare_signatures(unsigned *prim_hash_matches, unsigned *sec_hash_matches,
+compare_signatures(uint32_t *prim_hash_matches, uint32_t *sec_hash_matches,
 				const struct rte_hash_bucket *prim_bkt,
 				const struct rte_hash_bucket *sec_bkt,
 				hash_sig_t prim_hash, hash_sig_t sec_hash,
@@ -960,11 +925,11 @@ compare_signatures(unsigned *prim_hash_matches, unsigned *sec_hash_matches,
 	switch (sig_cmp_fn) {
 #ifdef RTE_MACHINE_CPUFLAG_AVX2
 	case RTE_HASH_COMPARE_AVX2:
-		*prim_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+		*prim_hash_matches = _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
 				_mm256_load_si256(
 					(__m256i const *)prim_bkt->sig_current),
 				_mm256_set1_epi32(prim_hash)));
-		*sec_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+		*sec_hash_matches = _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
 				_mm256_load_si256(
 					(__m256i const *)sec_bkt->sig_current),
 				_mm256_set1_epi32(sec_hash)));
@@ -973,7 +938,7 @@ compare_signatures(unsigned *prim_hash_matches, unsigned *sec_hash_matches,
 #ifdef RTE_MACHINE_CPUFLAG_SSE2
 	case RTE_HASH_COMPARE_SSE:
 		/* Compare the first 4 signatures in the bucket */
-		*prim_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+		*prim_hash_matches = _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
 				_mm_load_si128(
 					(__m128i const *)prim_bkt->sig_current),
 				_mm_set1_epi32(prim_hash)));
@@ -982,7 +947,7 @@ compare_signatures(unsigned *prim_hash_matches, unsigned *sec_hash_matches,
 					(__m128i const *)&prim_bkt->sig_current[4]),
 				_mm_set1_epi32(prim_hash)))) << 4;
 		/* Compare the first 4 signatures in the bucket */
-		*sec_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+		*sec_hash_matches = _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
 				_mm_load_si128(
 					(__m128i const *)sec_bkt->sig_current),
 				_mm_set1_epi32(sec_hash)));
@@ -1003,244 +968,134 @@ compare_signatures(unsigned *prim_hash_matches, unsigned *sec_hash_matches,
 
 }
 
-/*
- * Lookup bulk stage 2:  Search for match hashes in primary/secondary locations
- * and prefetch first key slot
- */
+#define PREFETCH_OFFSET 4
 static inline void
-lookup_stage2(unsigned idx, hash_sig_t prim_hash, hash_sig_t sec_hash,
-		const struct rte_hash_bucket *prim_bkt,
-		const struct rte_hash_bucket *sec_bkt,
-		const struct rte_hash_key **key_slot, int32_t *positions,
-		uint64_t *extra_hits_mask, const void *keys,
-		const struct rte_hash *h)
+__rte_hash_lookup_bulk(const struct rte_hash *h, const void **keys,
+			int32_t num_keys, int32_t *positions,
+			uint64_t *hit_mask, void *data[])
 {
-	unsigned prim_hash_matches, sec_hash_matches, key_idx;
-	unsigned total_hash_matches;
+	uint64_t hits = 0;
+	int32_t i;
+	uint32_t prim_hash[RTE_HASH_LOOKUP_BULK_MAX];
+	uint32_t sec_hash[RTE_HASH_LOOKUP_BULK_MAX];
+	const struct rte_hash_bucket *primary_bkt[RTE_HASH_LOOKUP_BULK_MAX];
+	const struct rte_hash_bucket *secondary_bkt[RTE_HASH_LOOKUP_BULK_MAX];
+	uint32_t prim_hitmask[RTE_HASH_LOOKUP_BULK_MAX] = {0};
+	uint32_t sec_hitmask[RTE_HASH_LOOKUP_BULK_MAX] = {0};
+
+	/* Prefetch first keys */
+	for (i = 0; i < PREFETCH_OFFSET && i < num_keys; i++)
+		rte_prefetch0(keys[i]);
 
-	prim_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
-	sec_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
+	/*
+	 * Prefetch rest of the keys, calculate primary and
+	 * secondary bucket and prefetch them
+	 */
+	for (i = 0; i < (num_keys - PREFETCH_OFFSET); i++) {
+		rte_prefetch0(keys[i + PREFETCH_OFFSET]);
 
-	compare_signatures(&prim_hash_matches, &sec_hash_matches, prim_bkt,
-				sec_bkt, prim_hash, sec_hash, h->sig_cmp_fn);
+		prim_hash[i] = rte_hash_hash(h, keys[i]);
+		sec_hash[i] = rte_hash_secondary_hash(prim_hash[i]);
 
-	key_idx = prim_bkt->key_idx[__builtin_ctzl(prim_hash_matches)];
-	if (key_idx == 0)
-		key_idx = sec_bkt->key_idx[__builtin_ctzl(sec_hash_matches)];
+		primary_bkt[i] = &h->buckets[prim_hash[i] & h->bucket_bitmask];
+		secondary_bkt[i] = &h->buckets[sec_hash[i] & h->bucket_bitmask];
 
-	total_hash_matches = (prim_hash_matches |
-				(sec_hash_matches << (RTE_HASH_BUCKET_ENTRIES + 1)));
-	*key_slot = (const struct rte_hash_key *) ((const char *)keys +
-					key_idx * h->key_entry_size);
+		rte_prefetch0(primary_bkt[i]);
+		rte_prefetch0(secondary_bkt[i]);
+	}
 
-	rte_prefetch0(*key_slot);
-	/*
-	 * Return index where key is stored,
-	 * substracting the first dummy index
-	 */
-	positions[idx] = (key_idx - 1);
+	/* Calculate and prefetch rest of the buckets */
+	for (; i < num_keys; i++) {
+		prim_hash[i] = rte_hash_hash(h, keys[i]);
+		sec_hash[i] = rte_hash_secondary_hash(prim_hash[i]);
 
-	*extra_hits_mask |= (uint64_t)(__builtin_popcount(total_hash_matches) > 3) << idx;
+		primary_bkt[i] = &h->buckets[prim_hash[i] & h->bucket_bitmask];
+		secondary_bkt[i] = &h->buckets[sec_hash[i] & h->bucket_bitmask];
 
-}
+		rte_prefetch0(primary_bkt[i]);
+		rte_prefetch0(secondary_bkt[i]);
+	}
 
+	/* Compare signatures and prefetch key slot of first hit */
+	for (i = 0; i < num_keys; i++) {
+		compare_signatures(&prim_hitmask[i], &sec_hitmask[i],
+				primary_bkt[i], secondary_bkt[i],
+				prim_hash[i], sec_hash[i], h->sig_cmp_fn);
+
+		if (prim_hitmask[i]) {
+			uint32_t first_hit = __builtin_ctzl(prim_hitmask[i]);
+			uint32_t key_idx = primary_bkt[i]->key_idx[first_hit];
+			const struct rte_hash_key *key_slot =
+				(const struct rte_hash_key *)(
+				(const char *)h->key_store +
+				key_idx * h->key_entry_size);
+			rte_prefetch0(key_slot);
+			continue;
+		}
 
-/* Lookup bulk stage 3: Check if key matches, update hit mask and return data */
-static inline void
-lookup_stage3(unsigned idx, const struct rte_hash_key *key_slot, const void * const *keys,
-		const int32_t *positions, void *data[], uint64_t *hits,
-		const struct rte_hash *h)
-{
-	unsigned hit;
-	unsigned key_idx;
+		if (sec_hitmask[i]) {
+			uint32_t first_hit = __builtin_ctzl(sec_hitmask[i]);
+			uint32_t key_idx = secondary_bkt[i]->key_idx[first_hit];
+			const struct rte_hash_key *key_slot =
+				(const struct rte_hash_key *)(
+				(const char *)h->key_store +
+				key_idx * h->key_entry_size);
+			rte_prefetch0(key_slot);
+		}
+	}
 
-	hit = !rte_hash_cmp_eq(key_slot->key, keys[idx], h);
-	if (data != NULL)
-		data[idx] = key_slot->pdata;
+	/* Compare keys, first hits in primary first */
+	for (i = 0; i < num_keys; i++) {
+		positions[i] = -ENOENT;
+		while (prim_hitmask[i]) {
+			uint32_t hit_index = __builtin_ctzl(prim_hitmask[i]);
+
+			uint32_t key_idx = primary_bkt[i]->key_idx[hit_index];
+			const struct rte_hash_key *key_slot =
+				(const struct rte_hash_key *)(
+				(const char *)h->key_store +
+				key_idx * h->key_entry_size);
+			/*
+			 * If key index is 0, do not compare key,
+			 * as it is checking the dummy slot
+			 */
+			if (!!key_idx & !rte_hash_cmp_eq(key_slot->key, keys[i], h)) {
+				if (data != NULL)
+					data[i] = key_slot->pdata;
 
-	key_idx = positions[idx] + 1;
-	/*
-	 * If key index is 0, force hit to be 0, in case key to be looked up
-	 * is all zero (as in the dummy slot), which would result in a wrong hit
-	 */
-	*hits |= (uint64_t)(hit && !!key_idx)  << idx;
-}
+				hits |= 1ULL << i;
+				positions[i] = key_idx - 1;
+				goto next_key;
+			}
+			prim_hitmask[i] &= ~(1 << (hit_index));
+		}
 
-static inline void
-__rte_hash_lookup_bulk(const struct rte_hash *h, const void **keys,
-			uint32_t num_keys, int32_t *positions,
-			uint64_t *hit_mask, void *data[])
-{
-	uint64_t hits = 0;
-	uint64_t extra_hits_mask = 0;
-	uint64_t lookup_mask, miss_mask;
-	unsigned idx;
-	const void *key_store = h->key_store;
-	int ret;
-	hash_sig_t hash_vals[RTE_HASH_LOOKUP_BULK_MAX];
-
-	unsigned idx00, idx01, idx10, idx11, idx20, idx21, idx30, idx31;
-	const struct rte_hash_bucket *primary_bkt10, *primary_bkt11;
-	const struct rte_hash_bucket *secondary_bkt10, *secondary_bkt11;
-	const struct rte_hash_bucket *primary_bkt20, *primary_bkt21;
-	const struct rte_hash_bucket *secondary_bkt20, *secondary_bkt21;
-	const struct rte_hash_key *k_slot20, *k_slot21, *k_slot30, *k_slot31;
-	hash_sig_t primary_hash10, primary_hash11;
-	hash_sig_t secondary_hash10, secondary_hash11;
-	hash_sig_t primary_hash20, primary_hash21;
-	hash_sig_t secondary_hash20, secondary_hash21;
-
-	lookup_mask = (uint64_t) -1 >> (64 - num_keys);
-	miss_mask = lookup_mask;
-
-	lookup_stage0(&idx00, &lookup_mask, keys);
-	lookup_stage0(&idx01, &lookup_mask, keys);
-
-	idx10 = idx00, idx11 = idx01;
-
-	lookup_stage0(&idx00, &lookup_mask, keys);
-	lookup_stage0(&idx01, &lookup_mask, keys);
-	lookup_stage1(idx10, &primary_hash10, &secondary_hash10,
-			&primary_bkt10, &secondary_bkt10, hash_vals, keys, h);
-	lookup_stage1(idx11, &primary_hash11, &secondary_hash11,
-			&primary_bkt11,	&secondary_bkt11, hash_vals, keys, h);
-
-	primary_bkt20 = primary_bkt10;
-	primary_bkt21 = primary_bkt11;
-	secondary_bkt20 = secondary_bkt10;
-	secondary_bkt21 = secondary_bkt11;
-	primary_hash20 = primary_hash10;
-	primary_hash21 = primary_hash11;
-	secondary_hash20 = secondary_hash10;
-	secondary_hash21 = secondary_hash11;
-	idx20 = idx10, idx21 = idx11;
-	idx10 = idx00, idx11 = idx01;
-
-	lookup_stage0(&idx00, &lookup_mask, keys);
-	lookup_stage0(&idx01, &lookup_mask, keys);
-	lookup_stage1(idx10, &primary_hash10, &secondary_hash10,
-			&primary_bkt10, &secondary_bkt10, hash_vals, keys, h);
-	lookup_stage1(idx11, &primary_hash11, &secondary_hash11,
-			&primary_bkt11,	&secondary_bkt11, hash_vals, keys, h);
-	lookup_stage2(idx20, primary_hash20, secondary_hash20, primary_bkt20,
-			secondary_bkt20, &k_slot20, positions, &extra_hits_mask,
-			key_store, h);
-	lookup_stage2(idx21, primary_hash21, secondary_hash21, primary_bkt21,
-			secondary_bkt21, &k_slot21, positions, &extra_hits_mask,
-			key_store, h);
-
-	while (lookup_mask) {
-		k_slot30 = k_slot20, k_slot31 = k_slot21;
-		idx30 = idx20, idx31 = idx21;
-		primary_bkt20 = primary_bkt10;
-		primary_bkt21 = primary_bkt11;
-		secondary_bkt20 = secondary_bkt10;
-		secondary_bkt21 = secondary_bkt11;
-		primary_hash20 = primary_hash10;
-		primary_hash21 = primary_hash11;
-		secondary_hash20 = secondary_hash10;
-		secondary_hash21 = secondary_hash11;
-		idx20 = idx10, idx21 = idx11;
-		idx10 = idx00, idx11 = idx01;
-
-		lookup_stage0(&idx00, &lookup_mask, keys);
-		lookup_stage0(&idx01, &lookup_mask, keys);
-		lookup_stage1(idx10, &primary_hash10, &secondary_hash10,
-			&primary_bkt10, &secondary_bkt10, hash_vals, keys, h);
-		lookup_stage1(idx11, &primary_hash11, &secondary_hash11,
-			&primary_bkt11,	&secondary_bkt11, hash_vals, keys, h);
-		lookup_stage2(idx20, primary_hash20, secondary_hash20,
-			primary_bkt20, secondary_bkt20, &k_slot20, positions,
-			&extra_hits_mask, key_store, h);
-		lookup_stage2(idx21, primary_hash21, secondary_hash21,
-			primary_bkt21, secondary_bkt21,	&k_slot21, positions,
-			&extra_hits_mask, key_store, h);
-		lookup_stage3(idx30, k_slot30, keys, positions, data, &hits, h);
-		lookup_stage3(idx31, k_slot31, keys, positions, data, &hits, h);
-	}
+		while (sec_hitmask[i]) {
+			uint32_t hit_index = __builtin_ctzl(sec_hitmask[i]);
+
+			uint32_t key_idx = secondary_bkt[i]->key_idx[hit_index];
+			const struct rte_hash_key *key_slot =
+				(const struct rte_hash_key *)(
+				(const char *)h->key_store +
+				key_idx * h->key_entry_size);
+			/*
+			 * If key index is 0, do not compare key,
+			 * as it is checking the dummy slot
+			 */
+
+			if (!!key_idx & !rte_hash_cmp_eq(key_slot->key, keys[i], h)) {
+				if (data != NULL)
+					data[i] = key_slot->pdata;
 
-	k_slot30 = k_slot20, k_slot31 = k_slot21;
-	idx30 = idx20, idx31 = idx21;
-	primary_bkt20 = primary_bkt10;
-	primary_bkt21 = primary_bkt11;
-	secondary_bkt20 = secondary_bkt10;
-	secondary_bkt21 = secondary_bkt11;
-	primary_hash20 = primary_hash10;
-	primary_hash21 = primary_hash11;
-	secondary_hash20 = secondary_hash10;
-	secondary_hash21 = secondary_hash11;
-	idx20 = idx10, idx21 = idx11;
-	idx10 = idx00, idx11 = idx01;
-
-	lookup_stage1(idx10, &primary_hash10, &secondary_hash10,
-		&primary_bkt10, &secondary_bkt10, hash_vals, keys, h);
-	lookup_stage1(idx11, &primary_hash11, &secondary_hash11,
-		&primary_bkt11,	&secondary_bkt11, hash_vals, keys, h);
-	lookup_stage2(idx20, primary_hash20, secondary_hash20, primary_bkt20,
-		secondary_bkt20, &k_slot20, positions, &extra_hits_mask,
-		key_store, h);
-	lookup_stage2(idx21, primary_hash21, secondary_hash21, primary_bkt21,
-		secondary_bkt21, &k_slot21, positions, &extra_hits_mask,
-		key_store, h);
-	lookup_stage3(idx30, k_slot30, keys, positions, data, &hits, h);
-	lookup_stage3(idx31, k_slot31, keys, positions, data, &hits, h);
-
-	k_slot30 = k_slot20, k_slot31 = k_slot21;
-	idx30 = idx20, idx31 = idx21;
-	primary_bkt20 = primary_bkt10;
-	primary_bkt21 = primary_bkt11;
-	secondary_bkt20 = secondary_bkt10;
-	secondary_bkt21 = secondary_bkt11;
-	primary_hash20 = primary_hash10;
-	primary_hash21 = primary_hash11;
-	secondary_hash20 = secondary_hash10;
-	secondary_hash21 = secondary_hash11;
-	idx20 = idx10, idx21 = idx11;
-
-	lookup_stage2(idx20, primary_hash20, secondary_hash20, primary_bkt20,
-		secondary_bkt20, &k_slot20, positions, &extra_hits_mask,
-		key_store, h);
-	lookup_stage2(idx21, primary_hash21, secondary_hash21, primary_bkt21,
-		secondary_bkt21, &k_slot21, positions, &extra_hits_mask,
-		key_store, h);
-	lookup_stage3(idx30, k_slot30, keys, positions, data, &hits, h);
-	lookup_stage3(idx31, k_slot31, keys, positions, data, &hits, h);
-
-	k_slot30 = k_slot20, k_slot31 = k_slot21;
-	idx30 = idx20, idx31 = idx21;
-
-	lookup_stage3(idx30, k_slot30, keys, positions, data, &hits, h);
-	lookup_stage3(idx31, k_slot31, keys, positions, data, &hits, h);
-
-	/* ignore any items we have already found */
-	extra_hits_mask &= ~hits;
-
-	if (unlikely(extra_hits_mask)) {
-		/* run a single search for each remaining item */
-		do {
-			idx = __builtin_ctzl(extra_hits_mask);
-			if (data != NULL) {
-				ret = rte_hash_lookup_with_hash_data(h,
-						keys[idx], hash_vals[idx], &data[idx]);
-				if (ret >= 0)
-					hits |= 1ULL << idx;
-			} else {
-				positions[idx] = rte_hash_lookup_with_hash(h,
-							keys[idx], hash_vals[idx]);
-				if (positions[idx] >= 0)
-					hits |= 1llu << idx;
+				hits |= 1ULL << i;
+				positions[i] = key_idx - 1;
+				goto next_key;
 			}
-			extra_hits_mask &= ~(1llu << idx);
-		} while (extra_hits_mask);
-	}
+			sec_hitmask[i] &= ~(1 << (hit_index));
+		}
 
-	miss_mask &= ~hits;
-	if (unlikely(miss_mask)) {
-		do {
-			idx = __builtin_ctzl(miss_mask);
-			positions[idx] = -ENOENT;
-			miss_mask &= ~(1llu << idx);
-		} while (miss_mask);
+next_key:
+		continue;
 	}
 
 	if (hit_mask != NULL)
diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index 8ffc146..986596f 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -173,8 +173,7 @@ enum rte_hash_sig_compare_function {
 struct rte_hash_bucket {
 	hash_sig_t sig_current[RTE_HASH_BUCKET_ENTRIES];
 
-	/* Includes dummy key index that always contains index 0 */
-	uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES + 1];
+	uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES];
 
 	uint8_t flag[RTE_HASH_BUCKET_ENTRIES];
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 0/4] Cuckoo hash lookup enhancements
  2016-09-02 22:56 ` [PATCH v2 0/4] Cuckoo hash lookup enhancements Pablo de Lara
                     ` (3 preceding siblings ...)
  2016-09-02 22:56   ` [PATCH v2 4/4] hash: modify lookup bulk pipeline Pablo de Lara
@ 2016-09-06 19:33   ` Pablo de Lara
  2016-09-30  7:38     ` [PATCH v4 0/4] Cuckoo hash enhancements Pablo de Lara
  2016-09-06 19:34   ` [PATCH v3 0/4] Cuckoo hash lookup enhancements Pablo de Lara
  5 siblings, 1 reply; 37+ messages in thread
From: Pablo de Lara @ 2016-09-06 19:33 UTC (permalink / raw)
  To: dev; +Cc: bruce.richarson, Pablo de Lara

This patchset improves lookup performance on the current hash library
by changing the existing lookup bulk pipeline, with an improved pipeline,
based on a loop-and-jump model, instead of the current 4-stage 2-entry pipeline.
Also, x86 vectorized intrinsics are used to improve performance when comparing signatures.

First patch reorganizes the order of the hash structure.
The structure takes more than one 64-byte cache line, but not all
the fields are used in the lookup operation (the most common operation).
Therefore, all these fields have been moved to the first part of the structure,
so they all fit in one cache line, improving slightly the performance in some
scenarios.

Second patch modifies the order of the bucket structure.
Currently, the buckets store all the signatures together (current and alternative).
In order to be able to perform a vectorized signature comparison,
all current signatures have to be together, so the order of the bucket has been changed,
having separated all the current signatures from the alternative signatures.

Third patch introduces x86 vectorized intrinsics.
When performing a lookup bulk operation, all current signatures in a bucket
are compared against the signature of the key being looked up.
Now that they all are together, a vectorized comparison can be performed,
which takes less instructions to be carried out.
In case of having a machine with AVX2, number of entries per bucket are
increased from 4 to 8, as AVX2 allows comparing two 256-bit values, with 8x32-bit integers,
which are the 8 signatures on the bucket.

Fourth (and last) patch modifies the current pipeline of the lookup bulk function.
The new pipeline is based on a loop-and-jump model. The two key improvements are:

- Better prefetching: in this case, first 4 keys to be looked up are prefetched,
  and after that, the rest of the keys are prefetched at the time the calculation
  of the signatures are being performed. This gives more time for the CPU to
  prefetch the data requesting before actually need it, which result in less
  cache misses and therefore, higher throughput.

- Lower performance penalty when using fallback: the lookup bulk algorithm
  assumes that most times there will not be a collision in a bucket, but it might
  happen that two or more signatures are equal, which means that more than one
  key comparison might be necessary. In that case, only the key of the first hit is prefetched,
  like in the current implementation. The difference now is that if this comparison
  results in a miss, the information of the other keys to be compared has been stored,
  unlike the current implementation, which needs to perform an entire simple lookup again.

This patchset depends on the following patchset:
"Hash library fixes" (http://dpdk.org/ml/archives/dev/2016-August/045780.html)

Changes in v3:
- Corrected the cover letter (wrong number of patches)

Changes in v2:
- Increased entries per bucket from 4 to 8 for all cases,
  so it is not architecture dependent any longer.
- Replaced compile-time signature comparison function election
  with run-time election, so best optimization available
  will be used from a single binary.
- Reordered the hash structure, so all the fields used by lookup
  are in the same cache line (first).

Byron Marohn (3):
  hash: reorganize bucket structure
  hash: add vectorized comparison
  hash: modify lookup bulk pipeline

Pablo de Lara (1):
  hash: reorder hash structure

 lib/librte_hash/rte_cuckoo_hash.c     | 455 ++++++++++++++--------------------
 lib/librte_hash/rte_cuckoo_hash.h     |  44 ++--
 lib/librte_hash/rte_cuckoo_hash_x86.h |  20 +-
 3 files changed, 221 insertions(+), 298 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v3 0/4] Cuckoo hash lookup enhancements
  2016-09-02 22:56 ` [PATCH v2 0/4] Cuckoo hash lookup enhancements Pablo de Lara
                     ` (4 preceding siblings ...)
  2016-09-06 19:33   ` [PATCH v3 0/4] Cuckoo hash lookup enhancements Pablo de Lara
@ 2016-09-06 19:34   ` Pablo de Lara
  2016-09-06 19:34     ` [PATCH v3 1/4] hash: reorder hash structure Pablo de Lara
                       ` (3 more replies)
  5 siblings, 4 replies; 37+ messages in thread
From: Pablo de Lara @ 2016-09-06 19:34 UTC (permalink / raw)
  To: dev; +Cc: bruce.richardson, Pablo de Lara

This patchset improves lookup performance on the current hash library
by changing the existing lookup bulk pipeline, with an improved pipeline,
based on a loop-and-jump model, instead of the current 4-stage 2-entry pipeline.
Also, x86 vectorized intrinsics are used to improve performance when comparing signatures.

First patch reorganizes the order of the hash structure.
The structure takes more than one 64-byte cache line, but not all
the fields are used in the lookup operation (the most common operation).
Therefore, all these fields have been moved to the first part of the structure,
so they all fit in one cache line, improving slightly the performance in some
scenarios.

Second patch modifies the order of the bucket structure.
Currently, the buckets store all the signatures together (current and alternative).
In order to be able to perform a vectorized signature comparison,
all current signatures have to be together, so the order of the bucket has been changed,
having separated all the current signatures from the alternative signatures.

Third patch introduces x86 vectorized intrinsics.
When performing a lookup bulk operation, all current signatures in a bucket
are compared against the signature of the key being looked up.
Now that they all are together, a vectorized comparison can be performed,
which takes less instructions to be carried out.
In case of having a machine with AVX2, number of entries per bucket are
increased from 4 to 8, as AVX2 allows comparing two 256-bit values, with 8x32-bit integers,
which are the 8 signatures on the bucket.

Fourth (and last) patch modifies the current pipeline of the lookup bulk function.
The new pipeline is based on a loop-and-jump model. The two key improvements are:

- Better prefetching: in this case, first 4 keys to be looked up are prefetched,
  and after that, the rest of the keys are prefetched at the time the calculation
  of the signatures are being performed. This gives more time for the CPU to
  prefetch the data requesting before actually need it, which result in less
  cache misses and therefore, higher throughput.

- Lower performance penalty when using fallback: the lookup bulk algorithm
  assumes that most times there will not be a collision in a bucket, but it might
  happen that two or more signatures are equal, which means that more than one
  key comparison might be necessary. In that case, only the key of the first hit is prefetched,
  like in the current implementation. The difference now is that if this comparison
  results in a miss, the information of the other keys to be compared has been stored,
  unlike the current implementation, which needs to perform an entire simple lookup again.

This patchset depends on the following patchset:
"Hash library fixes" (http://dpdk.org/ml/archives/dev/2016-August/045780.html)

Changes in v3:
- Corrected the cover letter (wrong number of patches)

Changes in v2:
- Increased entries per bucket from 4 to 8 for all cases,
  so it is not architecture dependent any longer.
- Replaced compile-time signature comparison function election
  with run-time election, so best optimization available
  will be used from a single binary.
- Reordered the hash structure, so all the fields used by lookup
  are in the same cache line (first).

Byron Marohn (3):
  hash: reorganize bucket structure
  hash: add vectorized comparison
  hash: modify lookup bulk pipeline

Pablo de Lara (1):
  hash: reorder hash structure

 lib/librte_hash/rte_cuckoo_hash.c     | 455 ++++++++++++++--------------------
 lib/librte_hash/rte_cuckoo_hash.h     |  44 ++--
 lib/librte_hash/rte_cuckoo_hash_x86.h |  20 +-
 3 files changed, 221 insertions(+), 298 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v3 1/4] hash: reorder hash structure
  2016-09-06 19:34   ` [PATCH v3 0/4] Cuckoo hash lookup enhancements Pablo de Lara
@ 2016-09-06 19:34     ` Pablo de Lara
  2016-09-28  9:02       ` Bruce Richardson
  2016-09-06 19:34     ` [PATCH v3 2/4] hash: reorganize bucket structure Pablo de Lara
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 37+ messages in thread
From: Pablo de Lara @ 2016-09-06 19:34 UTC (permalink / raw)
  To: dev; +Cc: bruce.richardson, Pablo de Lara

In order to optimize lookup performance, hash structure
is reordered, so all fields used for lookup will be
in the first cache line.

Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
---
 lib/librte_hash/rte_cuckoo_hash.h | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index e290dab..701531a 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -182,9 +182,7 @@ struct rte_hash_bucket {
 
 /** A hash table structure. */
 struct rte_hash {
-	char name[RTE_HASH_NAMESIZE];   /**< Name of the hash. */
-	uint32_t entries;               /**< Total table entries. */
-	uint32_t num_buckets;           /**< Number of buckets in table. */
+	/* first cache line - fields used in lookup */
 	uint32_t key_len;               /**< Length of hash key. */
 	rte_hash_function hash_func;    /**< Function used to calculate hash. */
 	uint32_t hash_func_init_val;    /**< Init value used by hash_func. */
@@ -196,12 +194,13 @@ struct rte_hash {
 						from hash signature. */
 	uint32_t key_entry_size;         /**< Size of each key entry. */
 
-	struct rte_ring *free_slots;    /**< Ring that stores all indexes
-						of the free slots in the key table */
 	void *key_store;                /**< Table storing all keys and data */
 	struct rte_hash_bucket *buckets;	/**< Table with buckets storing all the
 							hash values and key indexes
 							to the key table*/
+
+	struct rte_ring *free_slots;    /**< Ring that stores all indexes
+						of the free slots in the key table */
 	uint8_t hw_trans_mem_support;	/**< Hardware transactional
 							memory support */
 	struct lcore_cache *local_free_slots;
@@ -209,6 +208,9 @@ struct rte_hash {
 	enum add_key_case add_key; /**< Multi-writer hash add behavior */
 
 	rte_spinlock_t *multiwriter_lock; /**< Multi-writer spinlock for w/o TM */
+	char name[RTE_HASH_NAMESIZE];   /**< Name of the hash. */
+	uint32_t entries;               /**< Total table entries. */
+	uint32_t num_buckets;           /**< Number of buckets in table. */
 } __rte_cache_aligned;
 
 struct queue_node {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 2/4] hash: reorganize bucket structure
  2016-09-06 19:34   ` [PATCH v3 0/4] Cuckoo hash lookup enhancements Pablo de Lara
  2016-09-06 19:34     ` [PATCH v3 1/4] hash: reorder hash structure Pablo de Lara
@ 2016-09-06 19:34     ` Pablo de Lara
  2016-09-28  9:05       ` Bruce Richardson
  2016-09-06 19:34     ` [PATCH v3 3/4] hash: add vectorized comparison Pablo de Lara
  2016-09-06 19:34     ` [PATCH v3 4/4] hash: modify lookup bulk pipeline Pablo de Lara
  3 siblings, 1 reply; 37+ messages in thread
From: Pablo de Lara @ 2016-09-06 19:34 UTC (permalink / raw)
  To: dev; +Cc: bruce.richardson, Byron Marohn, Saikrishna Edupuganti

From: Byron Marohn <byron.marohn@intel.com>

Move current signatures of all entries together in the bucket
and same with all alternative signatures, instead of having
current and alternative signatures together per entry in the bucket.
This will be benefitial in the next commits, where a vectorized
comparison will be performed, achieving better performance.

Signed-off-by: Byron Marohn <byron.marohn@intel.com>
Signed-off-by: Saikrishna Edupuganti <saikrishna.edupuganti@intel.com>
---
 lib/librte_hash/rte_cuckoo_hash.c     | 43 ++++++++++++++++++-----------------
 lib/librte_hash/rte_cuckoo_hash.h     | 17 ++++----------
 lib/librte_hash/rte_cuckoo_hash_x86.h | 20 ++++++++--------
 3 files changed, 37 insertions(+), 43 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c
index dd0290f..9d507b6 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -420,7 +420,7 @@ make_space_bucket(const struct rte_hash *h, struct rte_hash_bucket *bkt)
 	 */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
 		/* Search for space in alternative locations */
-		next_bucket_idx = bkt->signatures[i].alt & h->bucket_bitmask;
+		next_bucket_idx = bkt->sig_alt[i] & h->bucket_bitmask;
 		next_bkt[i] = &h->buckets[next_bucket_idx];
 		for (j = 0; j < RTE_HASH_BUCKET_ENTRIES; j++) {
 			if (next_bkt[i]->key_idx[j] == EMPTY_SLOT)
@@ -433,8 +433,8 @@ make_space_bucket(const struct rte_hash *h, struct rte_hash_bucket *bkt)
 
 	/* Alternative location has spare room (end of recursive function) */
 	if (i != RTE_HASH_BUCKET_ENTRIES) {
-		next_bkt[i]->signatures[j].alt = bkt->signatures[i].current;
-		next_bkt[i]->signatures[j].current = bkt->signatures[i].alt;
+		next_bkt[i]->sig_alt[j] = bkt->sig_current[i];
+		next_bkt[i]->sig_current[j] = bkt->sig_alt[i];
 		next_bkt[i]->key_idx[j] = bkt->key_idx[i];
 		return i;
 	}
@@ -460,8 +460,8 @@ make_space_bucket(const struct rte_hash *h, struct rte_hash_bucket *bkt)
 	 */
 	bkt->flag[i] = 0;
 	if (ret >= 0) {
-		next_bkt[i]->signatures[ret].alt = bkt->signatures[i].current;
-		next_bkt[i]->signatures[ret].current = bkt->signatures[i].alt;
+		next_bkt[i]->sig_alt[ret] = bkt->sig_current[i];
+		next_bkt[i]->sig_current[ret] = bkt->sig_alt[i];
 		next_bkt[i]->key_idx[ret] = bkt->key_idx[i];
 		return i;
 	} else
@@ -543,8 +543,8 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is already inserted in primary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (prim_bkt->signatures[i].current == sig &&
-				prim_bkt->signatures[i].alt == alt_hash) {
+		if (prim_bkt->sig_current[i] == sig &&
+				prim_bkt->sig_alt[i] == alt_hash) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					prim_bkt->key_idx[i] * h->key_entry_size);
 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {
@@ -563,8 +563,8 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is already inserted in secondary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (sec_bkt->signatures[i].alt == sig &&
-				sec_bkt->signatures[i].current == alt_hash) {
+		if (sec_bkt->sig_alt[i] == sig &&
+				sec_bkt->sig_current[i] == alt_hash) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					sec_bkt->key_idx[i] * h->key_entry_size);
 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {
@@ -610,8 +610,8 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 		for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
 			/* Check if slot is available */
 			if (likely(prim_bkt->key_idx[i] == EMPTY_SLOT)) {
-				prim_bkt->signatures[i].current = sig;
-				prim_bkt->signatures[i].alt = alt_hash;
+				prim_bkt->sig_current[i] = sig;
+				prim_bkt->sig_alt[i] = alt_hash;
 				prim_bkt->key_idx[i] = new_idx;
 				break;
 			}
@@ -631,8 +631,8 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 		 */
 		ret = make_space_bucket(h, prim_bkt);
 		if (ret >= 0) {
-			prim_bkt->signatures[ret].current = sig;
-			prim_bkt->signatures[ret].alt = alt_hash;
+			prim_bkt->sig_current[ret] = sig;
+			prim_bkt->sig_alt[ret] = alt_hash;
 			prim_bkt->key_idx[ret] = new_idx;
 			if (h->add_key == ADD_KEY_MULTIWRITER)
 				rte_spinlock_unlock(h->multiwriter_lock);
@@ -706,7 +706,7 @@ __rte_hash_lookup_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is in primary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (bkt->signatures[i].current == sig &&
+		if (bkt->sig_current[i] == sig &&
 				bkt->key_idx[i] != EMPTY_SLOT) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					bkt->key_idx[i] * h->key_entry_size);
@@ -729,8 +729,8 @@ __rte_hash_lookup_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is in secondary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (bkt->signatures[i].current == alt_hash &&
-				bkt->signatures[i].alt == sig) {
+		if (bkt->sig_current[i] == alt_hash &&
+				bkt->sig_alt[i] == sig) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					bkt->key_idx[i] * h->key_entry_size);
 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {
@@ -784,7 +784,8 @@ remove_entry(const struct rte_hash *h, struct rte_hash_bucket *bkt, unsigned i)
 	unsigned lcore_id, n_slots;
 	struct lcore_cache *cached_free_slots;
 
-	bkt->signatures[i].sig = NULL_SIGNATURE;
+	bkt->sig_current[i] = NULL_SIGNATURE;
+	bkt->sig_alt[i] = NULL_SIGNATURE;
 	if (h->hw_trans_mem_support) {
 		lcore_id = rte_lcore_id();
 		cached_free_slots = &h->local_free_slots[lcore_id];
@@ -822,7 +823,7 @@ __rte_hash_del_key_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is in primary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (bkt->signatures[i].current == sig &&
+		if (bkt->sig_current[i] == sig &&
 				bkt->key_idx[i] != EMPTY_SLOT) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					bkt->key_idx[i] * h->key_entry_size);
@@ -847,7 +848,7 @@ __rte_hash_del_key_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is in secondary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (bkt->signatures[i].current == alt_hash &&
+		if (bkt->sig_current[i] == alt_hash &&
 				bkt->key_idx[i] != EMPTY_SLOT) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					bkt->key_idx[i] * h->key_entry_size);
@@ -956,8 +957,8 @@ lookup_stage2(unsigned idx, hash_sig_t prim_hash, hash_sig_t sec_hash,
 	prim_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
 	sec_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		prim_hash_matches |= ((prim_hash == prim_bkt->signatures[i].current) << i);
-		sec_hash_matches |= ((sec_hash == sec_bkt->signatures[i].current) << i);
+		prim_hash_matches |= ((prim_hash == prim_bkt->sig_current[i]) << i);
+		sec_hash_matches |= ((sec_hash == sec_bkt->sig_current[i]) << i);
 	}
 
 	key_idx = prim_bkt->key_idx[__builtin_ctzl(prim_hash_matches)];
diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index 701531a..86471f7 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -151,17 +151,6 @@ struct lcore_cache {
 	void *objs[LCORE_CACHE_SIZE]; /**< Cache objects */
 } __rte_cache_aligned;
 
-/* Structure storing both primary and secondary hashes */
-struct rte_hash_signatures {
-	union {
-		struct {
-			hash_sig_t current;
-			hash_sig_t alt;
-		};
-		uint64_t sig;
-	};
-};
-
 /* Structure that stores key-value pair */
 struct rte_hash_key {
 	union {
@@ -174,10 +163,14 @@ struct rte_hash_key {
 
 /** Bucket structure */
 struct rte_hash_bucket {
-	struct rte_hash_signatures signatures[RTE_HASH_BUCKET_ENTRIES];
+	hash_sig_t sig_current[RTE_HASH_BUCKET_ENTRIES];
+
 	/* Includes dummy key index that always contains index 0 */
 	uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES + 1];
+
 	uint8_t flag[RTE_HASH_BUCKET_ENTRIES];
+
+	hash_sig_t sig_alt[RTE_HASH_BUCKET_ENTRIES];
 } __rte_cache_aligned;
 
 /** A hash table structure. */
diff --git a/lib/librte_hash/rte_cuckoo_hash_x86.h b/lib/librte_hash/rte_cuckoo_hash_x86.h
index e16d69c..494c160 100644
--- a/lib/librte_hash/rte_cuckoo_hash_x86.h
+++ b/lib/librte_hash/rte_cuckoo_hash_x86.h
@@ -54,8 +54,8 @@ rte_hash_cuckoo_insert_mw_tm(struct rte_hash_bucket *prim_bkt,
 			for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
 				/* Check if slot is available */
 				if (likely(prim_bkt->key_idx == EMPTY_SLOT)) {
-					prim_bkt->signatures[i].current = sig;
-					prim_bkt->signatures[i].alt = alt_hash;
+					prim_bkt->sig_current[i] = sig;
+					prim_bkt->sig_alt[i] = alt_hash;
 					prim_bkt->key_idx[i] = new_idx;
 					break;
 				}
@@ -101,7 +101,7 @@ rte_hash_cuckoo_move_insert_mw_tm(const struct rte_hash *h,
 				prev_slot = curr_node->prev_slot;
 
 				prev_alt_bkt_idx
-					= prev_bkt->signatures[prev_slot].alt
+					= prev_bkt->sig_alt[prev_slot]
 					    & h->bucket_bitmask;
 
 				if (unlikely(&h->buckets[prev_alt_bkt_idx]
@@ -113,10 +113,10 @@ rte_hash_cuckoo_move_insert_mw_tm(const struct rte_hash *h,
 				 * Cuckoo insert to move elements back to its
 				 * primary bucket if available
 				 */
-				curr_bkt->signatures[curr_slot].alt =
-				    prev_bkt->signatures[prev_slot].current;
-				curr_bkt->signatures[curr_slot].current =
-				    prev_bkt->signatures[prev_slot].alt;
+				curr_bkt->sig_alt[curr_slot] =
+				    prev_bkt->sig_current[prev_slot];
+				curr_bkt->sig_current[curr_slot] =
+				    prev_bkt->sig_alt[prev_slot];
 				curr_bkt->key_idx[curr_slot]
 				    = prev_bkt->key_idx[prev_slot];
 
@@ -125,8 +125,8 @@ rte_hash_cuckoo_move_insert_mw_tm(const struct rte_hash *h,
 				curr_bkt = curr_node->bkt;
 			}
 
-			curr_bkt->signatures[curr_slot].current = sig;
-			curr_bkt->signatures[curr_slot].alt = alt_hash;
+			curr_bkt->sig_current[curr_slot] = sig;
+			curr_bkt->sig_alt[curr_slot] = alt_hash;
 			curr_bkt->key_idx[curr_slot] = new_idx;
 
 			rte_xend();
@@ -178,7 +178,7 @@ rte_hash_cuckoo_make_space_mw_tm(const struct rte_hash *h,
 			}
 
 			/* Enqueue new node and keep prev node info */
-			alt_bkt = &(h->buckets[curr_bkt->signatures[i].alt
+			alt_bkt = &(h->buckets[curr_bkt->sig_alt[i]
 						    & h->bucket_bitmask]);
 			head->bkt = alt_bkt;
 			head->prev = tail;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 3/4] hash: add vectorized comparison
  2016-09-06 19:34   ` [PATCH v3 0/4] Cuckoo hash lookup enhancements Pablo de Lara
  2016-09-06 19:34     ` [PATCH v3 1/4] hash: reorder hash structure Pablo de Lara
  2016-09-06 19:34     ` [PATCH v3 2/4] hash: reorganize bucket structure Pablo de Lara
@ 2016-09-06 19:34     ` Pablo de Lara
  2016-09-06 19:34     ` [PATCH v3 4/4] hash: modify lookup bulk pipeline Pablo de Lara
  3 siblings, 0 replies; 37+ messages in thread
From: Pablo de Lara @ 2016-09-06 19:34 UTC (permalink / raw)
  To: dev; +Cc: bruce.richardson, Byron Marohn, Saikrishna Edupuganti, Pablo de Lara

From: Byron Marohn <byron.marohn@intel.com>

In lookup bulk function, the signatures of all entries
are compared against the signature of the key that is being looked up.
Now that all the signatures are together, they can be compared
with vector instructions (SSE, AVX2), achieving higher lookup performance.

Also, entries per bucket are increased to 8 when using processors
with AVX2, as 256 bits can be compared at once, which is the size of
8x32-bit signatures.

Signed-off-by: Byron Marohn <byron.marohn@intel.com>
Signed-off-by: Saikrishna Edupuganti <saikrishna.edupuganti@intel.com>
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
---
 lib/librte_hash/rte_cuckoo_hash.c | 73 ++++++++++++++++++++++++++++++++++++---
 lib/librte_hash/rte_cuckoo_hash.h | 12 ++++++-
 2 files changed, 79 insertions(+), 6 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c
index 9d507b6..eab28a1 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -283,6 +283,15 @@ rte_hash_create(const struct rte_hash_parameters *params)
 	h->free_slots = r;
 	h->hw_trans_mem_support = hw_trans_mem_support;
 
+#if defined(RTE_ARCH_X86)
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
+		h->sig_cmp_fn = RTE_HASH_COMPARE_AVX2;
+	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE2))
+		h->sig_cmp_fn = RTE_HASH_COMPARE_SSE;
+	else
+#endif
+		h->sig_cmp_fn = RTE_HASH_COMPARE_SCALAR;
+
 	/* Turn on multi-writer only with explicit flat from user and TM
 	 * support.
 	 */
@@ -939,6 +948,61 @@ lookup_stage1(unsigned idx, hash_sig_t *prim_hash, hash_sig_t *sec_hash,
 	rte_prefetch0(*secondary_bkt);
 }
 
+static inline void
+compare_signatures(unsigned *prim_hash_matches, unsigned *sec_hash_matches,
+				const struct rte_hash_bucket *prim_bkt,
+				const struct rte_hash_bucket *sec_bkt,
+				hash_sig_t prim_hash, hash_sig_t sec_hash,
+				enum rte_hash_sig_compare_function sig_cmp_fn)
+{
+	unsigned i;
+
+	switch (sig_cmp_fn) {
+#ifdef RTE_MACHINE_CPUFLAG_AVX2
+	case RTE_HASH_COMPARE_AVX2:
+		*prim_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+				_mm256_load_si256(
+					(__m256i const *)prim_bkt->sig_current),
+				_mm256_set1_epi32(prim_hash)));
+		*sec_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+				_mm256_load_si256(
+					(__m256i const *)sec_bkt->sig_current),
+				_mm256_set1_epi32(sec_hash)));
+		break;
+#endif
+#ifdef RTE_MACHINE_CPUFLAG_SSE2
+	case RTE_HASH_COMPARE_SSE:
+		/* Compare the first 4 signatures in the bucket */
+		*prim_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+				_mm_load_si128(
+					(__m128i const *)prim_bkt->sig_current),
+				_mm_set1_epi32(prim_hash)));
+		*prim_hash_matches |= (_mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+				_mm_load_si128(
+					(__m128i const *)&prim_bkt->sig_current[4]),
+				_mm_set1_epi32(prim_hash)))) << 4;
+		/* Compare the first 4 signatures in the bucket */
+		*sec_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+				_mm_load_si128(
+					(__m128i const *)sec_bkt->sig_current),
+				_mm_set1_epi32(sec_hash)));
+		*sec_hash_matches |= (_mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+				_mm_load_si128(
+					(__m128i const *)&sec_bkt->sig_current[4]),
+				_mm_set1_epi32(sec_hash)))) << 4;
+		break;
+#endif
+	default:
+		for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
+			*prim_hash_matches |=
+				((prim_hash == prim_bkt->sig_current[i]) << i);
+			*sec_hash_matches |=
+				((sec_hash == sec_bkt->sig_current[i]) << i);
+		}
+	}
+
+}
+
 /*
  * Lookup bulk stage 2:  Search for match hashes in primary/secondary locations
  * and prefetch first key slot
@@ -951,15 +1015,14 @@ lookup_stage2(unsigned idx, hash_sig_t prim_hash, hash_sig_t sec_hash,
 		uint64_t *extra_hits_mask, const void *keys,
 		const struct rte_hash *h)
 {
-	unsigned prim_hash_matches, sec_hash_matches, key_idx, i;
+	unsigned prim_hash_matches, sec_hash_matches, key_idx;
 	unsigned total_hash_matches;
 
 	prim_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
 	sec_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
-	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		prim_hash_matches |= ((prim_hash == prim_bkt->sig_current[i]) << i);
-		sec_hash_matches |= ((sec_hash == sec_bkt->sig_current[i]) << i);
-	}
+
+	compare_signatures(&prim_hash_matches, &sec_hash_matches, prim_bkt,
+				sec_bkt, prim_hash, sec_hash, h->sig_cmp_fn);
 
 	key_idx = prim_bkt->key_idx[__builtin_ctzl(prim_hash_matches)];
 	if (key_idx == 0)
diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index 86471f7..8ffc146 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -130,7 +130,7 @@ enum add_key_case {
 };
 
 /** Number of items per bucket. */
-#define RTE_HASH_BUCKET_ENTRIES		4
+#define RTE_HASH_BUCKET_ENTRIES		8
 
 #define NULL_SIGNATURE			0
 
@@ -161,6 +161,14 @@ struct rte_hash_key {
 	char key[0];
 } __attribute__((aligned(KEY_ALIGNMENT)));
 
+/* All different signature compare functions */
+enum rte_hash_sig_compare_function {
+	RTE_HASH_COMPARE_SCALAR = 0,
+	RTE_HASH_COMPARE_SSE,
+	RTE_HASH_COMPARE_AVX2,
+	RTE_HASH_COMPARE_NUM
+};
+
 /** Bucket structure */
 struct rte_hash_bucket {
 	hash_sig_t sig_current[RTE_HASH_BUCKET_ENTRIES];
@@ -183,6 +191,8 @@ struct rte_hash {
 	/**< Custom function used to compare keys. */
 	enum cmp_jump_table_case cmp_jump_table_idx;
 	/**< Indicates which compare function to use. */
+	enum rte_hash_sig_compare_function sig_cmp_fn;
+	/**< Indicates which signature compare function to use. */
 	uint32_t bucket_bitmask;        /**< Bitmask for getting bucket index
 						from hash signature. */
 	uint32_t key_entry_size;         /**< Size of each key entry. */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 4/4] hash: modify lookup bulk pipeline
  2016-09-06 19:34   ` [PATCH v3 0/4] Cuckoo hash lookup enhancements Pablo de Lara
                       ` (2 preceding siblings ...)
  2016-09-06 19:34     ` [PATCH v3 3/4] hash: add vectorized comparison Pablo de Lara
@ 2016-09-06 19:34     ` Pablo de Lara
  3 siblings, 0 replies; 37+ messages in thread
From: Pablo de Lara @ 2016-09-06 19:34 UTC (permalink / raw)
  To: dev; +Cc: bruce.richardson, Byron Marohn, Saikrishna Edupuganti, Pablo de Lara

From: Byron Marohn <byron.marohn@intel.com>

This patch replaces the pipelined rte_hash lookup mechanism with a
loop-and-jump model, which performs significantly better,
especially for smaller table sizes and smaller table occupancies.

Signed-off-by: Byron Marohn <byron.marohn@intel.com>
Signed-off-by: Saikrishna Edupuganti <saikrishna.edupuganti@intel.com>
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
---
 lib/librte_hash/rte_cuckoo_hash.c | 377 ++++++++++++--------------------------
 lib/librte_hash/rte_cuckoo_hash.h |   3 +-
 2 files changed, 117 insertions(+), 263 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c
index eab28a1..47b5beb 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -913,43 +913,8 @@ rte_hash_get_key_with_position(const struct rte_hash *h, const int32_t position,
 	return 0;
 }
 
-/* Lookup bulk stage 0: Prefetch input key */
 static inline void
-lookup_stage0(unsigned *idx, uint64_t *lookup_mask,
-		const void * const *keys)
-{
-	*idx = __builtin_ctzl(*lookup_mask);
-	if (*lookup_mask == 0)
-		*idx = 0;
-
-	rte_prefetch0(keys[*idx]);
-	*lookup_mask &= ~(1llu << *idx);
-}
-
-/*
- * Lookup bulk stage 1: Calculate primary/secondary hashes
- * and prefetch primary/secondary buckets
- */
-static inline void
-lookup_stage1(unsigned idx, hash_sig_t *prim_hash, hash_sig_t *sec_hash,
-		const struct rte_hash_bucket **primary_bkt,
-		const struct rte_hash_bucket **secondary_bkt,
-		hash_sig_t *hash_vals, const void * const *keys,
-		const struct rte_hash *h)
-{
-	*prim_hash = rte_hash_hash(h, keys[idx]);
-	hash_vals[idx] = *prim_hash;
-	*sec_hash = rte_hash_secondary_hash(*prim_hash);
-
-	*primary_bkt = &h->buckets[*prim_hash & h->bucket_bitmask];
-	*secondary_bkt = &h->buckets[*sec_hash & h->bucket_bitmask];
-
-	rte_prefetch0(*primary_bkt);
-	rte_prefetch0(*secondary_bkt);
-}
-
-static inline void
-compare_signatures(unsigned *prim_hash_matches, unsigned *sec_hash_matches,
+compare_signatures(uint32_t *prim_hash_matches, uint32_t *sec_hash_matches,
 				const struct rte_hash_bucket *prim_bkt,
 				const struct rte_hash_bucket *sec_bkt,
 				hash_sig_t prim_hash, hash_sig_t sec_hash,
@@ -960,11 +925,11 @@ compare_signatures(unsigned *prim_hash_matches, unsigned *sec_hash_matches,
 	switch (sig_cmp_fn) {
 #ifdef RTE_MACHINE_CPUFLAG_AVX2
 	case RTE_HASH_COMPARE_AVX2:
-		*prim_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+		*prim_hash_matches = _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
 				_mm256_load_si256(
 					(__m256i const *)prim_bkt->sig_current),
 				_mm256_set1_epi32(prim_hash)));
-		*sec_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+		*sec_hash_matches = _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
 				_mm256_load_si256(
 					(__m256i const *)sec_bkt->sig_current),
 				_mm256_set1_epi32(sec_hash)));
@@ -973,7 +938,7 @@ compare_signatures(unsigned *prim_hash_matches, unsigned *sec_hash_matches,
 #ifdef RTE_MACHINE_CPUFLAG_SSE2
 	case RTE_HASH_COMPARE_SSE:
 		/* Compare the first 4 signatures in the bucket */
-		*prim_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+		*prim_hash_matches = _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
 				_mm_load_si128(
 					(__m128i const *)prim_bkt->sig_current),
 				_mm_set1_epi32(prim_hash)));
@@ -982,7 +947,7 @@ compare_signatures(unsigned *prim_hash_matches, unsigned *sec_hash_matches,
 					(__m128i const *)&prim_bkt->sig_current[4]),
 				_mm_set1_epi32(prim_hash)))) << 4;
 		/* Compare the first 4 signatures in the bucket */
-		*sec_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+		*sec_hash_matches = _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
 				_mm_load_si128(
 					(__m128i const *)sec_bkt->sig_current),
 				_mm_set1_epi32(sec_hash)));
@@ -1003,244 +968,134 @@ compare_signatures(unsigned *prim_hash_matches, unsigned *sec_hash_matches,
 
 }
 
-/*
- * Lookup bulk stage 2:  Search for match hashes in primary/secondary locations
- * and prefetch first key slot
- */
+#define PREFETCH_OFFSET 4
 static inline void
-lookup_stage2(unsigned idx, hash_sig_t prim_hash, hash_sig_t sec_hash,
-		const struct rte_hash_bucket *prim_bkt,
-		const struct rte_hash_bucket *sec_bkt,
-		const struct rte_hash_key **key_slot, int32_t *positions,
-		uint64_t *extra_hits_mask, const void *keys,
-		const struct rte_hash *h)
+__rte_hash_lookup_bulk(const struct rte_hash *h, const void **keys,
+			int32_t num_keys, int32_t *positions,
+			uint64_t *hit_mask, void *data[])
 {
-	unsigned prim_hash_matches, sec_hash_matches, key_idx;
-	unsigned total_hash_matches;
+	uint64_t hits = 0;
+	int32_t i;
+	uint32_t prim_hash[RTE_HASH_LOOKUP_BULK_MAX];
+	uint32_t sec_hash[RTE_HASH_LOOKUP_BULK_MAX];
+	const struct rte_hash_bucket *primary_bkt[RTE_HASH_LOOKUP_BULK_MAX];
+	const struct rte_hash_bucket *secondary_bkt[RTE_HASH_LOOKUP_BULK_MAX];
+	uint32_t prim_hitmask[RTE_HASH_LOOKUP_BULK_MAX] = {0};
+	uint32_t sec_hitmask[RTE_HASH_LOOKUP_BULK_MAX] = {0};
+
+	/* Prefetch first keys */
+	for (i = 0; i < PREFETCH_OFFSET && i < num_keys; i++)
+		rte_prefetch0(keys[i]);
 
-	prim_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
-	sec_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
+	/*
+	 * Prefetch rest of the keys, calculate primary and
+	 * secondary bucket and prefetch them
+	 */
+	for (i = 0; i < (num_keys - PREFETCH_OFFSET); i++) {
+		rte_prefetch0(keys[i + PREFETCH_OFFSET]);
 
-	compare_signatures(&prim_hash_matches, &sec_hash_matches, prim_bkt,
-				sec_bkt, prim_hash, sec_hash, h->sig_cmp_fn);
+		prim_hash[i] = rte_hash_hash(h, keys[i]);
+		sec_hash[i] = rte_hash_secondary_hash(prim_hash[i]);
 
-	key_idx = prim_bkt->key_idx[__builtin_ctzl(prim_hash_matches)];
-	if (key_idx == 0)
-		key_idx = sec_bkt->key_idx[__builtin_ctzl(sec_hash_matches)];
+		primary_bkt[i] = &h->buckets[prim_hash[i] & h->bucket_bitmask];
+		secondary_bkt[i] = &h->buckets[sec_hash[i] & h->bucket_bitmask];
 
-	total_hash_matches = (prim_hash_matches |
-				(sec_hash_matches << (RTE_HASH_BUCKET_ENTRIES + 1)));
-	*key_slot = (const struct rte_hash_key *) ((const char *)keys +
-					key_idx * h->key_entry_size);
+		rte_prefetch0(primary_bkt[i]);
+		rte_prefetch0(secondary_bkt[i]);
+	}
 
-	rte_prefetch0(*key_slot);
-	/*
-	 * Return index where key is stored,
-	 * substracting the first dummy index
-	 */
-	positions[idx] = (key_idx - 1);
+	/* Calculate and prefetch rest of the buckets */
+	for (; i < num_keys; i++) {
+		prim_hash[i] = rte_hash_hash(h, keys[i]);
+		sec_hash[i] = rte_hash_secondary_hash(prim_hash[i]);
 
-	*extra_hits_mask |= (uint64_t)(__builtin_popcount(total_hash_matches) > 3) << idx;
+		primary_bkt[i] = &h->buckets[prim_hash[i] & h->bucket_bitmask];
+		secondary_bkt[i] = &h->buckets[sec_hash[i] & h->bucket_bitmask];
 
-}
+		rte_prefetch0(primary_bkt[i]);
+		rte_prefetch0(secondary_bkt[i]);
+	}
 
+	/* Compare signatures and prefetch key slot of first hit */
+	for (i = 0; i < num_keys; i++) {
+		compare_signatures(&prim_hitmask[i], &sec_hitmask[i],
+				primary_bkt[i], secondary_bkt[i],
+				prim_hash[i], sec_hash[i], h->sig_cmp_fn);
+
+		if (prim_hitmask[i]) {
+			uint32_t first_hit = __builtin_ctzl(prim_hitmask[i]);
+			uint32_t key_idx = primary_bkt[i]->key_idx[first_hit];
+			const struct rte_hash_key *key_slot =
+				(const struct rte_hash_key *)(
+				(const char *)h->key_store +
+				key_idx * h->key_entry_size);
+			rte_prefetch0(key_slot);
+			continue;
+		}
 
-/* Lookup bulk stage 3: Check if key matches, update hit mask and return data */
-static inline void
-lookup_stage3(unsigned idx, const struct rte_hash_key *key_slot, const void * const *keys,
-		const int32_t *positions, void *data[], uint64_t *hits,
-		const struct rte_hash *h)
-{
-	unsigned hit;
-	unsigned key_idx;
+		if (sec_hitmask[i]) {
+			uint32_t first_hit = __builtin_ctzl(sec_hitmask[i]);
+			uint32_t key_idx = secondary_bkt[i]->key_idx[first_hit];
+			const struct rte_hash_key *key_slot =
+				(const struct rte_hash_key *)(
+				(const char *)h->key_store +
+				key_idx * h->key_entry_size);
+			rte_prefetch0(key_slot);
+		}
+	}
 
-	hit = !rte_hash_cmp_eq(key_slot->key, keys[idx], h);
-	if (data != NULL)
-		data[idx] = key_slot->pdata;
+	/* Compare keys, first hits in primary first */
+	for (i = 0; i < num_keys; i++) {
+		positions[i] = -ENOENT;
+		while (prim_hitmask[i]) {
+			uint32_t hit_index = __builtin_ctzl(prim_hitmask[i]);
+
+			uint32_t key_idx = primary_bkt[i]->key_idx[hit_index];
+			const struct rte_hash_key *key_slot =
+				(const struct rte_hash_key *)(
+				(const char *)h->key_store +
+				key_idx * h->key_entry_size);
+			/*
+			 * If key index is 0, do not compare key,
+			 * as it is checking the dummy slot
+			 */
+			if (!!key_idx & !rte_hash_cmp_eq(key_slot->key, keys[i], h)) {
+				if (data != NULL)
+					data[i] = key_slot->pdata;
 
-	key_idx = positions[idx] + 1;
-	/*
-	 * If key index is 0, force hit to be 0, in case key to be looked up
-	 * is all zero (as in the dummy slot), which would result in a wrong hit
-	 */
-	*hits |= (uint64_t)(hit && !!key_idx)  << idx;
-}
+				hits |= 1ULL << i;
+				positions[i] = key_idx - 1;
+				goto next_key;
+			}
+			prim_hitmask[i] &= ~(1 << (hit_index));
+		}
 
-static inline void
-__rte_hash_lookup_bulk(const struct rte_hash *h, const void **keys,
-			uint32_t num_keys, int32_t *positions,
-			uint64_t *hit_mask, void *data[])
-{
-	uint64_t hits = 0;
-	uint64_t extra_hits_mask = 0;
-	uint64_t lookup_mask, miss_mask;
-	unsigned idx;
-	const void *key_store = h->key_store;
-	int ret;
-	hash_sig_t hash_vals[RTE_HASH_LOOKUP_BULK_MAX];
-
-	unsigned idx00, idx01, idx10, idx11, idx20, idx21, idx30, idx31;
-	const struct rte_hash_bucket *primary_bkt10, *primary_bkt11;
-	const struct rte_hash_bucket *secondary_bkt10, *secondary_bkt11;
-	const struct rte_hash_bucket *primary_bkt20, *primary_bkt21;
-	const struct rte_hash_bucket *secondary_bkt20, *secondary_bkt21;
-	const struct rte_hash_key *k_slot20, *k_slot21, *k_slot30, *k_slot31;
-	hash_sig_t primary_hash10, primary_hash11;
-	hash_sig_t secondary_hash10, secondary_hash11;
-	hash_sig_t primary_hash20, primary_hash21;
-	hash_sig_t secondary_hash20, secondary_hash21;
-
-	lookup_mask = (uint64_t) -1 >> (64 - num_keys);
-	miss_mask = lookup_mask;
-
-	lookup_stage0(&idx00, &lookup_mask, keys);
-	lookup_stage0(&idx01, &lookup_mask, keys);
-
-	idx10 = idx00, idx11 = idx01;
-
-	lookup_stage0(&idx00, &lookup_mask, keys);
-	lookup_stage0(&idx01, &lookup_mask, keys);
-	lookup_stage1(idx10, &primary_hash10, &secondary_hash10,
-			&primary_bkt10, &secondary_bkt10, hash_vals, keys, h);
-	lookup_stage1(idx11, &primary_hash11, &secondary_hash11,
-			&primary_bkt11,	&secondary_bkt11, hash_vals, keys, h);
-
-	primary_bkt20 = primary_bkt10;
-	primary_bkt21 = primary_bkt11;
-	secondary_bkt20 = secondary_bkt10;
-	secondary_bkt21 = secondary_bkt11;
-	primary_hash20 = primary_hash10;
-	primary_hash21 = primary_hash11;
-	secondary_hash20 = secondary_hash10;
-	secondary_hash21 = secondary_hash11;
-	idx20 = idx10, idx21 = idx11;
-	idx10 = idx00, idx11 = idx01;
-
-	lookup_stage0(&idx00, &lookup_mask, keys);
-	lookup_stage0(&idx01, &lookup_mask, keys);
-	lookup_stage1(idx10, &primary_hash10, &secondary_hash10,
-			&primary_bkt10, &secondary_bkt10, hash_vals, keys, h);
-	lookup_stage1(idx11, &primary_hash11, &secondary_hash11,
-			&primary_bkt11,	&secondary_bkt11, hash_vals, keys, h);
-	lookup_stage2(idx20, primary_hash20, secondary_hash20, primary_bkt20,
-			secondary_bkt20, &k_slot20, positions, &extra_hits_mask,
-			key_store, h);
-	lookup_stage2(idx21, primary_hash21, secondary_hash21, primary_bkt21,
-			secondary_bkt21, &k_slot21, positions, &extra_hits_mask,
-			key_store, h);
-
-	while (lookup_mask) {
-		k_slot30 = k_slot20, k_slot31 = k_slot21;
-		idx30 = idx20, idx31 = idx21;
-		primary_bkt20 = primary_bkt10;
-		primary_bkt21 = primary_bkt11;
-		secondary_bkt20 = secondary_bkt10;
-		secondary_bkt21 = secondary_bkt11;
-		primary_hash20 = primary_hash10;
-		primary_hash21 = primary_hash11;
-		secondary_hash20 = secondary_hash10;
-		secondary_hash21 = secondary_hash11;
-		idx20 = idx10, idx21 = idx11;
-		idx10 = idx00, idx11 = idx01;
-
-		lookup_stage0(&idx00, &lookup_mask, keys);
-		lookup_stage0(&idx01, &lookup_mask, keys);
-		lookup_stage1(idx10, &primary_hash10, &secondary_hash10,
-			&primary_bkt10, &secondary_bkt10, hash_vals, keys, h);
-		lookup_stage1(idx11, &primary_hash11, &secondary_hash11,
-			&primary_bkt11,	&secondary_bkt11, hash_vals, keys, h);
-		lookup_stage2(idx20, primary_hash20, secondary_hash20,
-			primary_bkt20, secondary_bkt20, &k_slot20, positions,
-			&extra_hits_mask, key_store, h);
-		lookup_stage2(idx21, primary_hash21, secondary_hash21,
-			primary_bkt21, secondary_bkt21,	&k_slot21, positions,
-			&extra_hits_mask, key_store, h);
-		lookup_stage3(idx30, k_slot30, keys, positions, data, &hits, h);
-		lookup_stage3(idx31, k_slot31, keys, positions, data, &hits, h);
-	}
+		while (sec_hitmask[i]) {
+			uint32_t hit_index = __builtin_ctzl(sec_hitmask[i]);
+
+			uint32_t key_idx = secondary_bkt[i]->key_idx[hit_index];
+			const struct rte_hash_key *key_slot =
+				(const struct rte_hash_key *)(
+				(const char *)h->key_store +
+				key_idx * h->key_entry_size);
+			/*
+			 * If key index is 0, do not compare key,
+			 * as it is checking the dummy slot
+			 */
+
+			if (!!key_idx & !rte_hash_cmp_eq(key_slot->key, keys[i], h)) {
+				if (data != NULL)
+					data[i] = key_slot->pdata;
 
-	k_slot30 = k_slot20, k_slot31 = k_slot21;
-	idx30 = idx20, idx31 = idx21;
-	primary_bkt20 = primary_bkt10;
-	primary_bkt21 = primary_bkt11;
-	secondary_bkt20 = secondary_bkt10;
-	secondary_bkt21 = secondary_bkt11;
-	primary_hash20 = primary_hash10;
-	primary_hash21 = primary_hash11;
-	secondary_hash20 = secondary_hash10;
-	secondary_hash21 = secondary_hash11;
-	idx20 = idx10, idx21 = idx11;
-	idx10 = idx00, idx11 = idx01;
-
-	lookup_stage1(idx10, &primary_hash10, &secondary_hash10,
-		&primary_bkt10, &secondary_bkt10, hash_vals, keys, h);
-	lookup_stage1(idx11, &primary_hash11, &secondary_hash11,
-		&primary_bkt11,	&secondary_bkt11, hash_vals, keys, h);
-	lookup_stage2(idx20, primary_hash20, secondary_hash20, primary_bkt20,
-		secondary_bkt20, &k_slot20, positions, &extra_hits_mask,
-		key_store, h);
-	lookup_stage2(idx21, primary_hash21, secondary_hash21, primary_bkt21,
-		secondary_bkt21, &k_slot21, positions, &extra_hits_mask,
-		key_store, h);
-	lookup_stage3(idx30, k_slot30, keys, positions, data, &hits, h);
-	lookup_stage3(idx31, k_slot31, keys, positions, data, &hits, h);
-
-	k_slot30 = k_slot20, k_slot31 = k_slot21;
-	idx30 = idx20, idx31 = idx21;
-	primary_bkt20 = primary_bkt10;
-	primary_bkt21 = primary_bkt11;
-	secondary_bkt20 = secondary_bkt10;
-	secondary_bkt21 = secondary_bkt11;
-	primary_hash20 = primary_hash10;
-	primary_hash21 = primary_hash11;
-	secondary_hash20 = secondary_hash10;
-	secondary_hash21 = secondary_hash11;
-	idx20 = idx10, idx21 = idx11;
-
-	lookup_stage2(idx20, primary_hash20, secondary_hash20, primary_bkt20,
-		secondary_bkt20, &k_slot20, positions, &extra_hits_mask,
-		key_store, h);
-	lookup_stage2(idx21, primary_hash21, secondary_hash21, primary_bkt21,
-		secondary_bkt21, &k_slot21, positions, &extra_hits_mask,
-		key_store, h);
-	lookup_stage3(idx30, k_slot30, keys, positions, data, &hits, h);
-	lookup_stage3(idx31, k_slot31, keys, positions, data, &hits, h);
-
-	k_slot30 = k_slot20, k_slot31 = k_slot21;
-	idx30 = idx20, idx31 = idx21;
-
-	lookup_stage3(idx30, k_slot30, keys, positions, data, &hits, h);
-	lookup_stage3(idx31, k_slot31, keys, positions, data, &hits, h);
-
-	/* ignore any items we have already found */
-	extra_hits_mask &= ~hits;
-
-	if (unlikely(extra_hits_mask)) {
-		/* run a single search for each remaining item */
-		do {
-			idx = __builtin_ctzl(extra_hits_mask);
-			if (data != NULL) {
-				ret = rte_hash_lookup_with_hash_data(h,
-						keys[idx], hash_vals[idx], &data[idx]);
-				if (ret >= 0)
-					hits |= 1ULL << idx;
-			} else {
-				positions[idx] = rte_hash_lookup_with_hash(h,
-							keys[idx], hash_vals[idx]);
-				if (positions[idx] >= 0)
-					hits |= 1llu << idx;
+				hits |= 1ULL << i;
+				positions[i] = key_idx - 1;
+				goto next_key;
 			}
-			extra_hits_mask &= ~(1llu << idx);
-		} while (extra_hits_mask);
-	}
+			sec_hitmask[i] &= ~(1 << (hit_index));
+		}
 
-	miss_mask &= ~hits;
-	if (unlikely(miss_mask)) {
-		do {
-			idx = __builtin_ctzl(miss_mask);
-			positions[idx] = -ENOENT;
-			miss_mask &= ~(1llu << idx);
-		} while (miss_mask);
+next_key:
+		continue;
 	}
 
 	if (hit_mask != NULL)
diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index 8ffc146..986596f 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -173,8 +173,7 @@ enum rte_hash_sig_compare_function {
 struct rte_hash_bucket {
 	hash_sig_t sig_current[RTE_HASH_BUCKET_ENTRIES];
 
-	/* Includes dummy key index that always contains index 0 */
-	uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES + 1];
+	uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES];
 
 	uint8_t flag[RTE_HASH_BUCKET_ENTRIES];
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 1/4] hash: reorder hash structure
  2016-09-06 19:34     ` [PATCH v3 1/4] hash: reorder hash structure Pablo de Lara
@ 2016-09-28  9:02       ` Bruce Richardson
  2016-09-29  1:33         ` De Lara Guarch, Pablo
  0 siblings, 1 reply; 37+ messages in thread
From: Bruce Richardson @ 2016-09-28  9:02 UTC (permalink / raw)
  To: Pablo de Lara; +Cc: dev

On Tue, Sep 06, 2016 at 08:34:01PM +0100, Pablo de Lara wrote:
> In order to optimize lookup performance, hash structure
> is reordered, so all fields used for lookup will be
> in the first cache line.
> 
> Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
> ---
>  lib/librte_hash/rte_cuckoo_hash.h | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
> index e290dab..701531a 100644
> --- a/lib/librte_hash/rte_cuckoo_hash.h
> +++ b/lib/librte_hash/rte_cuckoo_hash.h
> @@ -182,9 +182,7 @@ struct rte_hash_bucket {
>  
>  /** A hash table structure. */
>  struct rte_hash {
> -	char name[RTE_HASH_NAMESIZE];   /**< Name of the hash. */
> -	uint32_t entries;               /**< Total table entries. */
> -	uint32_t num_buckets;           /**< Number of buckets in table. */
> +	/* first cache line - fields used in lookup */
>  	uint32_t key_len;               /**< Length of hash key. */
>  	rte_hash_function hash_func;    /**< Function used to calculate hash. */
>  	uint32_t hash_func_init_val;    /**< Init value used by hash_func. */
> @@ -196,12 +194,13 @@ struct rte_hash {
>  						from hash signature. */
>  	uint32_t key_entry_size;         /**< Size of each key entry. */
>  
> -	struct rte_ring *free_slots;    /**< Ring that stores all indexes
> -						of the free slots in the key table */
>  	void *key_store;                /**< Table storing all keys and data */
>  	struct rte_hash_bucket *buckets;	/**< Table with buckets storing all the
>  							hash values and key indexes
>  							to the key table*/
> +
> +	struct rte_ring *free_slots;    /**< Ring that stores all indexes
> +						of the free slots in the key table */
>  	uint8_t hw_trans_mem_support;	/**< Hardware transactional
>  							memory support */
>  	struct lcore_cache *local_free_slots;
> @@ -209,6 +208,9 @@ struct rte_hash {
>  	enum add_key_case add_key; /**< Multi-writer hash add behavior */
>  
>  	rte_spinlock_t *multiwriter_lock; /**< Multi-writer spinlock for w/o TM */
> +	char name[RTE_HASH_NAMESIZE];   /**< Name of the hash. */
> +	uint32_t entries;               /**< Total table entries. */
> +	uint32_t num_buckets;           /**< Number of buckets in table. */

Hi Pablo,

While I've no strong objection to the change, having the name at the start
is a common paradigm in DPDK. Rather than place these fields at the end,
can you get the same effect by just marking the key_len function
__rte_cache_aligned? It may use a little more memory per table, but given that
the size of the hash table is going to be largely governed by the table data,
I don't see an extra 64 bytes in the structure as being an issue.

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 2/4] hash: reorganize bucket structure
  2016-09-06 19:34     ` [PATCH v3 2/4] hash: reorganize bucket structure Pablo de Lara
@ 2016-09-28  9:05       ` Bruce Richardson
  2016-09-29  1:40         ` De Lara Guarch, Pablo
  0 siblings, 1 reply; 37+ messages in thread
From: Bruce Richardson @ 2016-09-28  9:05 UTC (permalink / raw)
  To: Pablo de Lara; +Cc: dev, Byron Marohn, Saikrishna Edupuganti

On Tue, Sep 06, 2016 at 08:34:02PM +0100, Pablo de Lara wrote:
> From: Byron Marohn <byron.marohn@intel.com>
> 
> Move current signatures of all entries together in the bucket
> and same with all alternative signatures, instead of having
> current and alternative signatures together per entry in the bucket.
> This will be benefitial in the next commits, where a vectorized
> comparison will be performed, achieving better performance.
> 
> Signed-off-by: Byron Marohn <byron.marohn@intel.com>
> Signed-off-by: Saikrishna Edupuganti <saikrishna.edupuganti@intel.com>
> ---
>  lib/librte_hash/rte_cuckoo_hash.c     | 43 ++++++++++++++++++-----------------
>  lib/librte_hash/rte_cuckoo_hash.h     | 17 ++++----------
>  lib/librte_hash/rte_cuckoo_hash_x86.h | 20 ++++++++--------
>  3 files changed, 37 insertions(+), 43 deletions(-)
> 
<snip>
> --- a/lib/librte_hash/rte_cuckoo_hash.h
> +++ b/lib/librte_hash/rte_cuckoo_hash.h
> @@ -151,17 +151,6 @@ struct lcore_cache {
>  	void *objs[LCORE_CACHE_SIZE]; /**< Cache objects */
>  } __rte_cache_aligned;
>  
> -/* Structure storing both primary and secondary hashes */
> -struct rte_hash_signatures {
> -	union {
> -		struct {
> -			hash_sig_t current;
> -			hash_sig_t alt;
> -		};
> -		uint64_t sig;
> -	};
> -};
> -
>  /* Structure that stores key-value pair */
>  struct rte_hash_key {
>  	union {
> @@ -174,10 +163,14 @@ struct rte_hash_key {
>  
>  /** Bucket structure */
>  struct rte_hash_bucket {
> -	struct rte_hash_signatures signatures[RTE_HASH_BUCKET_ENTRIES];
> +	hash_sig_t sig_current[RTE_HASH_BUCKET_ENTRIES];
> +
>  	/* Includes dummy key index that always contains index 0 */
>  	uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES + 1];
> +
>  	uint8_t flag[RTE_HASH_BUCKET_ENTRIES];
> +
> +	hash_sig_t sig_alt[RTE_HASH_BUCKET_ENTRIES];
>  } __rte_cache_aligned;
> 

Is there a reason why sig_current and sig_alt fields cannot be beside each
other in the structure. It looks strange having them separate by other fields?

/Bruce

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 1/4] hash: reorder hash structure
  2016-09-28  9:02       ` Bruce Richardson
@ 2016-09-29  1:33         ` De Lara Guarch, Pablo
  0 siblings, 0 replies; 37+ messages in thread
From: De Lara Guarch, Pablo @ 2016-09-29  1:33 UTC (permalink / raw)
  To: Richardson, Bruce; +Cc: dev



> -----Original Message-----
> From: Richardson, Bruce
> Sent: Wednesday, September 28, 2016 2:03 AM
> To: De Lara Guarch, Pablo
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v3 1/4] hash: reorder hash structure
> 
> On Tue, Sep 06, 2016 at 08:34:01PM +0100, Pablo de Lara wrote:
> > In order to optimize lookup performance, hash structure
> > is reordered, so all fields used for lookup will be
> > in the first cache line.
> >
> > Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
> > ---
> >  lib/librte_hash/rte_cuckoo_hash.h | 12 +++++++-----
> >  1 file changed, 7 insertions(+), 5 deletions(-)
> >
> > diff --git a/lib/librte_hash/rte_cuckoo_hash.h
> b/lib/librte_hash/rte_cuckoo_hash.h
> > index e290dab..701531a 100644
> > --- a/lib/librte_hash/rte_cuckoo_hash.h
> > +++ b/lib/librte_hash/rte_cuckoo_hash.h
> > @@ -182,9 +182,7 @@ struct rte_hash_bucket {
> >
> >  /** A hash table structure. */
> >  struct rte_hash {
> > -	char name[RTE_HASH_NAMESIZE];   /**< Name of the hash. */
> > -	uint32_t entries;               /**< Total table entries. */
> > -	uint32_t num_buckets;           /**< Number of buckets in table. */
> > +	/* first cache line - fields used in lookup */
> >  	uint32_t key_len;               /**< Length of hash key. */
> >  	rte_hash_function hash_func;    /**< Function used to calculate hash.
> */
> >  	uint32_t hash_func_init_val;    /**< Init value used by hash_func. */
> > @@ -196,12 +194,13 @@ struct rte_hash {
> >  						from hash signature. */
> >  	uint32_t key_entry_size;         /**< Size of each key entry. */
> >
> > -	struct rte_ring *free_slots;    /**< Ring that stores all indexes
> > -						of the free slots in the key
> table */
> >  	void *key_store;                /**< Table storing all keys and data */
> >  	struct rte_hash_bucket *buckets;	/**< Table with buckets
> storing all the
> >  							hash values and key
> indexes
> >  							to the key table*/
> > +
> > +	struct rte_ring *free_slots;    /**< Ring that stores all indexes
> > +						of the free slots in the key
> table */
> >  	uint8_t hw_trans_mem_support;	/**< Hardware transactional
> >  							memory support */
> >  	struct lcore_cache *local_free_slots;
> > @@ -209,6 +208,9 @@ struct rte_hash {
> >  	enum add_key_case add_key; /**< Multi-writer hash add behavior */
> >
> >  	rte_spinlock_t *multiwriter_lock; /**< Multi-writer spinlock for w/o
> TM */
> > +	char name[RTE_HASH_NAMESIZE];   /**< Name of the hash. */
> > +	uint32_t entries;               /**< Total table entries. */
> > +	uint32_t num_buckets;           /**< Number of buckets in table. */
> 
> Hi Pablo,
> 
> While I've no strong objection to the change, having the name at the start
> is a common paradigm in DPDK. Rather than place these fields at the end,
> can you get the same effect by just marking the key_len function
> __rte_cache_aligned? It may use a little more memory per table, but given
> that
> the size of the hash table is going to be largely governed by the table data,
> I don't see an extra 64 bytes in the structure as being an issue.

Hi Bruce,

Sure, sounds good to me. I was trying a similar approach to the mbuf structure,
and actually I saw a small performance boost doing this,
so I can push the fields needed for lookup for the second cache line instead,
it should work the same :)

Thanks,
Pablo
> 
> Regards,
> /Bruce

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 2/4] hash: reorganize bucket structure
  2016-09-28  9:05       ` Bruce Richardson
@ 2016-09-29  1:40         ` De Lara Guarch, Pablo
  0 siblings, 0 replies; 37+ messages in thread
From: De Lara Guarch, Pablo @ 2016-09-29  1:40 UTC (permalink / raw)
  To: Richardson, Bruce; +Cc: dev, Marohn, Byron, Edupuganti, Saikrishna



> -----Original Message-----
> From: Richardson, Bruce
> Sent: Wednesday, September 28, 2016 2:05 AM
> To: De Lara Guarch, Pablo
> Cc: dev@dpdk.org; Marohn, Byron; Edupuganti, Saikrishna
> Subject: Re: [PATCH v3 2/4] hash: reorganize bucket structure
> 
> On Tue, Sep 06, 2016 at 08:34:02PM +0100, Pablo de Lara wrote:
> > From: Byron Marohn <byron.marohn@intel.com>
> >
> > Move current signatures of all entries together in the bucket
> > and same with all alternative signatures, instead of having
> > current and alternative signatures together per entry in the bucket.
> > This will be benefitial in the next commits, where a vectorized
> > comparison will be performed, achieving better performance.
> >
> > Signed-off-by: Byron Marohn <byron.marohn@intel.com>
> > Signed-off-by: Saikrishna Edupuganti <saikrishna.edupuganti@intel.com>
> > ---
> >  lib/librte_hash/rte_cuckoo_hash.c     | 43 ++++++++++++++++++----------------
> -
> >  lib/librte_hash/rte_cuckoo_hash.h     | 17 ++++----------
> >  lib/librte_hash/rte_cuckoo_hash_x86.h | 20 ++++++++--------
> >  3 files changed, 37 insertions(+), 43 deletions(-)
> >
> <snip>
> > --- a/lib/librte_hash/rte_cuckoo_hash.h
> > +++ b/lib/librte_hash/rte_cuckoo_hash.h
> > @@ -151,17 +151,6 @@ struct lcore_cache {
> >  	void *objs[LCORE_CACHE_SIZE]; /**< Cache objects */
> >  } __rte_cache_aligned;
> >
> > -/* Structure storing both primary and secondary hashes */
> > -struct rte_hash_signatures {
> > -	union {
> > -		struct {
> > -			hash_sig_t current;
> > -			hash_sig_t alt;
> > -		};
> > -		uint64_t sig;
> > -	};
> > -};
> > -
> >  /* Structure that stores key-value pair */
> >  struct rte_hash_key {
> >  	union {
> > @@ -174,10 +163,14 @@ struct rte_hash_key {
> >
> >  /** Bucket structure */
> >  struct rte_hash_bucket {
> > -	struct rte_hash_signatures signatures[RTE_HASH_BUCKET_ENTRIES];
> > +	hash_sig_t sig_current[RTE_HASH_BUCKET_ENTRIES];
> > +
> >  	/* Includes dummy key index that always contains index 0 */
> >  	uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES + 1];
> > +
> >  	uint8_t flag[RTE_HASH_BUCKET_ENTRIES];
> > +
> > +	hash_sig_t sig_alt[RTE_HASH_BUCKET_ENTRIES];
> >  } __rte_cache_aligned;
> >
> 
> Is there a reason why sig_current and sig_alt fields cannot be beside each
> other in the structure. It looks strange having them separate by other fields?

Bucket entries has increased to 8 now, so sig_current and key_idx take 64 bytes
(key_idx will be reduced to 8 entries in the fourth patch).

Therefore, the idea was to push sig_alt to the next cache line
(assuming a 64 byte cacheline, if it is bigger, then either way is ok),
as it is not used in lookup (like sig_current and key_idx).

Anyway, I think I will move sig_alt before flag, so it is cache aligned,
in case vectorization is used in the future.

Thanks,
Pablo

> 
> /Bruce

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v4 0/4] Cuckoo hash enhancements
  2016-09-06 19:33   ` [PATCH v3 0/4] Cuckoo hash lookup enhancements Pablo de Lara
@ 2016-09-30  7:38     ` Pablo de Lara
  2016-09-30  7:38       ` [PATCH v4 1/4] hash: reorder hash structure Pablo de Lara
                         ` (6 more replies)
  0 siblings, 7 replies; 37+ messages in thread
From: Pablo de Lara @ 2016-09-30  7:38 UTC (permalink / raw)
  To: dev; +Cc: bruce.richardson, Pablo de Lara

This patchset improves lookup performance on the current hash library
by changing the existing lookup bulk pipeline, with an improved pipeline,
based on a loop-and-jump model, instead of the current 4-stage 2-entry pipeline.
Also, x86 vectorized intrinsics are used to improve performance when comparing signatures.

First patch reorganizes the order of the hash structure.
The structure takes more than one 64-byte cache line, but not all
the fields are used in the lookup operation (the most common operation).
Therefore, all these fields have been moved to the first part of the structure,
so they all fit in one cache line, improving slightly the performance in some
scenarios.

Second patch modifies the order of the bucket structure.
Currently, the buckets store all the signatures together (current and alternative).
In order to be able to perform a vectorized signature comparison,
all current signatures have to be together, so the order of the bucket has been changed,
having separated all the current signatures from the alternative signatures.

Third patch introduces x86 vectorized intrinsics.
When performing a lookup bulk operation, all current signatures in a bucket
are compared against the signature of the key being looked up.
Now that they all are together, a vectorized comparison can be performed,
which takes less instructions to be carried out.
In case of having a machine with AVX2, number of entries per bucket are
increased from 4 to 8, as AVX2 allows comparing two 256-bit values, with 8x32-bit integers,
which are the 8 signatures on the bucket.

Fourth (and last) patch modifies the current pipeline of the lookup bulk function.
The new pipeline is based on a loop-and-jump model. The two key improvements are:

- Better prefetching: in this case, first 4 keys to be looked up are prefetched,
  and after that, the rest of the keys are prefetched at the time the calculation
  of the signatures are being performed. This gives more time for the CPU to
  prefetch the data requesting before actually need it, which result in less
  cache misses and therefore, higher throughput.

- Lower performance penalty when using fallback: the lookup bulk algorithm
  assumes that most times there will not be a collision in a bucket, but it might
  happen that two or more signatures are equal, which means that more than one
  key comparison might be necessary. In that case, only the key of the first hit is prefetched,
  like in the current implementation. The difference now is that if this comparison
  results in a miss, the information of the other keys to be compared has been stored,
  unlike the current implementation, which needs to perform an entire simple lookup again.

Changes in v4:
- Reordered hash structure, so alt signature is at the start
  of the next cache line, and explain in the commit message
  why it has been moved
- Reordered hash structure, so name field is on top of the structure,
  leaving all the fields used in lookup in the next cache line
  (instead of the first cache line)

Changes in v3:
- Corrected the cover letter (wrong number of patches)

Changes in v2:
- Increased entries per bucket from 4 to 8 for all cases,
  so it is not architecture dependent any longer.
- Replaced compile-time signature comparison function election
  with run-time election, so best optimization available
  will be used from a single binary.
- Reordered the hash structure, so all the fields used by lookup
  are in the same cache line (first).

Byron Marohn (3):
  hash: reorganize bucket structure
  hash: add vectorized comparison
  hash: modify lookup bulk pipeline

Pablo de Lara (1):
  hash: reorder hash structure

 lib/librte_hash/rte_cuckoo_hash.c     | 455 ++++++++++++++--------------------
 lib/librte_hash/rte_cuckoo_hash.h     |  56 +++--
 lib/librte_hash/rte_cuckoo_hash_x86.h |  20 +-
 3 files changed, 228 insertions(+), 303 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v4 1/4] hash: reorder hash structure
  2016-09-30  7:38     ` [PATCH v4 0/4] Cuckoo hash enhancements Pablo de Lara
@ 2016-09-30  7:38       ` Pablo de Lara
  2016-09-30  7:38       ` [PATCH v4 2/4] hash: reorganize bucket structure Pablo de Lara
                         ` (5 subsequent siblings)
  6 siblings, 0 replies; 37+ messages in thread
From: Pablo de Lara @ 2016-09-30  7:38 UTC (permalink / raw)
  To: dev; +Cc: bruce.richardson, Pablo de Lara

In order to optimize lookup performance, hash structure
is reordered, so all fields used for lookup will be
in the first cache line.

Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
---
 lib/librte_hash/rte_cuckoo_hash.h | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index e290dab..5a32ea6 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -185,7 +185,20 @@ struct rte_hash {
 	char name[RTE_HASH_NAMESIZE];   /**< Name of the hash. */
 	uint32_t entries;               /**< Total table entries. */
 	uint32_t num_buckets;           /**< Number of buckets in table. */
-	uint32_t key_len;               /**< Length of hash key. */
+
+	struct rte_ring *free_slots;    /**< Ring that stores all indexes
+						of the free slots in the key table */
+	uint8_t hw_trans_mem_support;	/**< Hardware transactional
+							memory support */
+	struct lcore_cache *local_free_slots;
+	/**< Local cache per lcore, storing some indexes of the free slots */
+	enum add_key_case add_key; /**< Multi-writer hash add behavior */
+
+	rte_spinlock_t *multiwriter_lock; /**< Multi-writer spinlock for w/o TM */
+
+	/* Fields used in lookup */
+	uint32_t key_len __rte_cache_aligned;
+	/**< Length of hash key. */
 	rte_hash_function hash_func;    /**< Function used to calculate hash. */
 	uint32_t hash_func_init_val;    /**< Init value used by hash_func. */
 	rte_hash_cmp_eq_t rte_hash_custom_cmp_eq;
@@ -196,19 +209,10 @@ struct rte_hash {
 						from hash signature. */
 	uint32_t key_entry_size;         /**< Size of each key entry. */
 
-	struct rte_ring *free_slots;    /**< Ring that stores all indexes
-						of the free slots in the key table */
 	void *key_store;                /**< Table storing all keys and data */
 	struct rte_hash_bucket *buckets;	/**< Table with buckets storing all the
 							hash values and key indexes
 							to the key table*/
-	uint8_t hw_trans_mem_support;	/**< Hardware transactional
-							memory support */
-	struct lcore_cache *local_free_slots;
-	/**< Local cache per lcore, storing some indexes of the free slots */
-	enum add_key_case add_key; /**< Multi-writer hash add behavior */
-
-	rte_spinlock_t *multiwriter_lock; /**< Multi-writer spinlock for w/o TM */
 } __rte_cache_aligned;
 
 struct queue_node {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 2/4] hash: reorganize bucket structure
  2016-09-30  7:38     ` [PATCH v4 0/4] Cuckoo hash enhancements Pablo de Lara
  2016-09-30  7:38       ` [PATCH v4 1/4] hash: reorder hash structure Pablo de Lara
@ 2016-09-30  7:38       ` Pablo de Lara
  2016-09-30  7:38       ` [PATCH v4 3/4] hash: add vectorized comparison Pablo de Lara
                         ` (4 subsequent siblings)
  6 siblings, 0 replies; 37+ messages in thread
From: Pablo de Lara @ 2016-09-30  7:38 UTC (permalink / raw)
  To: dev; +Cc: bruce.richardson, Byron Marohn, Saikrishna Edupuganti

From: Byron Marohn <byron.marohn@intel.com>

Move current signatures of all entries together in the bucket
and same with all alternative signatures, instead of having
current and alternative signatures together per entry in the bucket.
This will be benefitial in the next commits, where a vectorized
comparison will be performed, achieving better performance.

The alternative signatures have been moved away from
the current signatures, to make the key indices be consecutive
to the current signatures, as these two fields are used by lookup,
so they are in the same cache line.

Signed-off-by: Byron Marohn <byron.marohn@intel.com>
Signed-off-by: Saikrishna Edupuganti <saikrishna.edupuganti@intel.com>
---
 lib/librte_hash/rte_cuckoo_hash.c     | 43 ++++++++++++++++++-----------------
 lib/librte_hash/rte_cuckoo_hash.h     | 17 ++++----------
 lib/librte_hash/rte_cuckoo_hash_x86.h | 20 ++++++++--------
 3 files changed, 37 insertions(+), 43 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c
index 4de4422..a7ee2b9 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -421,7 +421,7 @@ make_space_bucket(const struct rte_hash *h, struct rte_hash_bucket *bkt)
 	 */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
 		/* Search for space in alternative locations */
-		next_bucket_idx = bkt->signatures[i].alt & h->bucket_bitmask;
+		next_bucket_idx = bkt->sig_alt[i] & h->bucket_bitmask;
 		next_bkt[i] = &h->buckets[next_bucket_idx];
 		for (j = 0; j < RTE_HASH_BUCKET_ENTRIES; j++) {
 			if (next_bkt[i]->key_idx[j] == EMPTY_SLOT)
@@ -434,8 +434,8 @@ make_space_bucket(const struct rte_hash *h, struct rte_hash_bucket *bkt)
 
 	/* Alternative location has spare room (end of recursive function) */
 	if (i != RTE_HASH_BUCKET_ENTRIES) {
-		next_bkt[i]->signatures[j].alt = bkt->signatures[i].current;
-		next_bkt[i]->signatures[j].current = bkt->signatures[i].alt;
+		next_bkt[i]->sig_alt[j] = bkt->sig_current[i];
+		next_bkt[i]->sig_current[j] = bkt->sig_alt[i];
 		next_bkt[i]->key_idx[j] = bkt->key_idx[i];
 		return i;
 	}
@@ -461,8 +461,8 @@ make_space_bucket(const struct rte_hash *h, struct rte_hash_bucket *bkt)
 	 */
 	bkt->flag[i] = 0;
 	if (ret >= 0) {
-		next_bkt[i]->signatures[ret].alt = bkt->signatures[i].current;
-		next_bkt[i]->signatures[ret].current = bkt->signatures[i].alt;
+		next_bkt[i]->sig_alt[ret] = bkt->sig_current[i];
+		next_bkt[i]->sig_current[ret] = bkt->sig_alt[i];
 		next_bkt[i]->key_idx[ret] = bkt->key_idx[i];
 		return i;
 	} else
@@ -544,8 +544,8 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is already inserted in primary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (prim_bkt->signatures[i].current == sig &&
-				prim_bkt->signatures[i].alt == alt_hash) {
+		if (prim_bkt->sig_current[i] == sig &&
+				prim_bkt->sig_alt[i] == alt_hash) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					prim_bkt->key_idx[i] * h->key_entry_size);
 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {
@@ -564,8 +564,8 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is already inserted in secondary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (sec_bkt->signatures[i].alt == sig &&
-				sec_bkt->signatures[i].current == alt_hash) {
+		if (sec_bkt->sig_alt[i] == sig &&
+				sec_bkt->sig_current[i] == alt_hash) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					sec_bkt->key_idx[i] * h->key_entry_size);
 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {
@@ -611,8 +611,8 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 		for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
 			/* Check if slot is available */
 			if (likely(prim_bkt->key_idx[i] == EMPTY_SLOT)) {
-				prim_bkt->signatures[i].current = sig;
-				prim_bkt->signatures[i].alt = alt_hash;
+				prim_bkt->sig_current[i] = sig;
+				prim_bkt->sig_alt[i] = alt_hash;
 				prim_bkt->key_idx[i] = new_idx;
 				break;
 			}
@@ -632,8 +632,8 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 		 */
 		ret = make_space_bucket(h, prim_bkt);
 		if (ret >= 0) {
-			prim_bkt->signatures[ret].current = sig;
-			prim_bkt->signatures[ret].alt = alt_hash;
+			prim_bkt->sig_current[ret] = sig;
+			prim_bkt->sig_alt[ret] = alt_hash;
 			prim_bkt->key_idx[ret] = new_idx;
 			if (h->add_key == ADD_KEY_MULTIWRITER)
 				rte_spinlock_unlock(h->multiwriter_lock);
@@ -707,7 +707,7 @@ __rte_hash_lookup_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is in primary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (bkt->signatures[i].current == sig &&
+		if (bkt->sig_current[i] == sig &&
 				bkt->key_idx[i] != EMPTY_SLOT) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					bkt->key_idx[i] * h->key_entry_size);
@@ -730,8 +730,8 @@ __rte_hash_lookup_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is in secondary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (bkt->signatures[i].current == alt_hash &&
-				bkt->signatures[i].alt == sig) {
+		if (bkt->sig_current[i] == alt_hash &&
+				bkt->sig_alt[i] == sig) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					bkt->key_idx[i] * h->key_entry_size);
 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {
@@ -785,7 +785,8 @@ remove_entry(const struct rte_hash *h, struct rte_hash_bucket *bkt, unsigned i)
 	unsigned lcore_id, n_slots;
 	struct lcore_cache *cached_free_slots;
 
-	bkt->signatures[i].sig = NULL_SIGNATURE;
+	bkt->sig_current[i] = NULL_SIGNATURE;
+	bkt->sig_alt[i] = NULL_SIGNATURE;
 	if (h->hw_trans_mem_support) {
 		lcore_id = rte_lcore_id();
 		cached_free_slots = &h->local_free_slots[lcore_id];
@@ -823,7 +824,7 @@ __rte_hash_del_key_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is in primary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (bkt->signatures[i].current == sig &&
+		if (bkt->sig_current[i] == sig &&
 				bkt->key_idx[i] != EMPTY_SLOT) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					bkt->key_idx[i] * h->key_entry_size);
@@ -848,7 +849,7 @@ __rte_hash_del_key_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is in secondary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (bkt->signatures[i].current == alt_hash &&
+		if (bkt->sig_current[i] == alt_hash &&
 				bkt->key_idx[i] != EMPTY_SLOT) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					bkt->key_idx[i] * h->key_entry_size);
@@ -957,8 +958,8 @@ lookup_stage2(unsigned idx, hash_sig_t prim_hash, hash_sig_t sec_hash,
 	prim_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
 	sec_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		prim_hash_matches |= ((prim_hash == prim_bkt->signatures[i].current) << i);
-		sec_hash_matches |= ((sec_hash == sec_bkt->signatures[i].current) << i);
+		prim_hash_matches |= ((prim_hash == prim_bkt->sig_current[i]) << i);
+		sec_hash_matches |= ((sec_hash == sec_bkt->sig_current[i]) << i);
 	}
 
 	key_idx = prim_bkt->key_idx[__builtin_ctzl(prim_hash_matches)];
diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index 5a32ea6..24f8437 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -151,17 +151,6 @@ struct lcore_cache {
 	void *objs[LCORE_CACHE_SIZE]; /**< Cache objects */
 } __rte_cache_aligned;
 
-/* Structure storing both primary and secondary hashes */
-struct rte_hash_signatures {
-	union {
-		struct {
-			hash_sig_t current;
-			hash_sig_t alt;
-		};
-		uint64_t sig;
-	};
-};
-
 /* Structure that stores key-value pair */
 struct rte_hash_key {
 	union {
@@ -174,9 +163,13 @@ struct rte_hash_key {
 
 /** Bucket structure */
 struct rte_hash_bucket {
-	struct rte_hash_signatures signatures[RTE_HASH_BUCKET_ENTRIES];
+	hash_sig_t sig_current[RTE_HASH_BUCKET_ENTRIES];
+
 	/* Includes dummy key index that always contains index 0 */
 	uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES + 1];
+
+	hash_sig_t sig_alt[RTE_HASH_BUCKET_ENTRIES];
+
 	uint8_t flag[RTE_HASH_BUCKET_ENTRIES];
 } __rte_cache_aligned;
 
diff --git a/lib/librte_hash/rte_cuckoo_hash_x86.h b/lib/librte_hash/rte_cuckoo_hash_x86.h
index e16d69c..494c160 100644
--- a/lib/librte_hash/rte_cuckoo_hash_x86.h
+++ b/lib/librte_hash/rte_cuckoo_hash_x86.h
@@ -54,8 +54,8 @@ rte_hash_cuckoo_insert_mw_tm(struct rte_hash_bucket *prim_bkt,
 			for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
 				/* Check if slot is available */
 				if (likely(prim_bkt->key_idx == EMPTY_SLOT)) {
-					prim_bkt->signatures[i].current = sig;
-					prim_bkt->signatures[i].alt = alt_hash;
+					prim_bkt->sig_current[i] = sig;
+					prim_bkt->sig_alt[i] = alt_hash;
 					prim_bkt->key_idx[i] = new_idx;
 					break;
 				}
@@ -101,7 +101,7 @@ rte_hash_cuckoo_move_insert_mw_tm(const struct rte_hash *h,
 				prev_slot = curr_node->prev_slot;
 
 				prev_alt_bkt_idx
-					= prev_bkt->signatures[prev_slot].alt
+					= prev_bkt->sig_alt[prev_slot]
 					    & h->bucket_bitmask;
 
 				if (unlikely(&h->buckets[prev_alt_bkt_idx]
@@ -113,10 +113,10 @@ rte_hash_cuckoo_move_insert_mw_tm(const struct rte_hash *h,
 				 * Cuckoo insert to move elements back to its
 				 * primary bucket if available
 				 */
-				curr_bkt->signatures[curr_slot].alt =
-				    prev_bkt->signatures[prev_slot].current;
-				curr_bkt->signatures[curr_slot].current =
-				    prev_bkt->signatures[prev_slot].alt;
+				curr_bkt->sig_alt[curr_slot] =
+				    prev_bkt->sig_current[prev_slot];
+				curr_bkt->sig_current[curr_slot] =
+				    prev_bkt->sig_alt[prev_slot];
 				curr_bkt->key_idx[curr_slot]
 				    = prev_bkt->key_idx[prev_slot];
 
@@ -125,8 +125,8 @@ rte_hash_cuckoo_move_insert_mw_tm(const struct rte_hash *h,
 				curr_bkt = curr_node->bkt;
 			}
 
-			curr_bkt->signatures[curr_slot].current = sig;
-			curr_bkt->signatures[curr_slot].alt = alt_hash;
+			curr_bkt->sig_current[curr_slot] = sig;
+			curr_bkt->sig_alt[curr_slot] = alt_hash;
 			curr_bkt->key_idx[curr_slot] = new_idx;
 
 			rte_xend();
@@ -178,7 +178,7 @@ rte_hash_cuckoo_make_space_mw_tm(const struct rte_hash *h,
 			}
 
 			/* Enqueue new node and keep prev node info */
-			alt_bkt = &(h->buckets[curr_bkt->signatures[i].alt
+			alt_bkt = &(h->buckets[curr_bkt->sig_alt[i]
 						    & h->bucket_bitmask]);
 			head->bkt = alt_bkt;
 			head->prev = tail;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 3/4] hash: add vectorized comparison
  2016-09-30  7:38     ` [PATCH v4 0/4] Cuckoo hash enhancements Pablo de Lara
  2016-09-30  7:38       ` [PATCH v4 1/4] hash: reorder hash structure Pablo de Lara
  2016-09-30  7:38       ` [PATCH v4 2/4] hash: reorganize bucket structure Pablo de Lara
@ 2016-09-30  7:38       ` Pablo de Lara
  2016-09-30  7:38       ` [PATCH v4 4/4] hash: modify lookup bulk pipeline Pablo de Lara
                         ` (3 subsequent siblings)
  6 siblings, 0 replies; 37+ messages in thread
From: Pablo de Lara @ 2016-09-30  7:38 UTC (permalink / raw)
  To: dev; +Cc: bruce.richardson, Byron Marohn, Saikrishna Edupuganti, Pablo de Lara

From: Byron Marohn <byron.marohn@intel.com>

In lookup bulk function, the signatures of all entries
are compared against the signature of the key that is being looked up.
Now that all the signatures are together, they can be compared
with vector instructions (SSE, AVX2), achieving higher lookup performance.

Also, entries per bucket are increased to 8 when using processors
with AVX2, as 256 bits can be compared at once, which is the size of
8x32-bit signatures.

Signed-off-by: Byron Marohn <byron.marohn@intel.com>
Signed-off-by: Saikrishna Edupuganti <saikrishna.edupuganti@intel.com>
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
---
 lib/librte_hash/rte_cuckoo_hash.c | 73 ++++++++++++++++++++++++++++++++++++---
 lib/librte_hash/rte_cuckoo_hash.h | 12 ++++++-
 2 files changed, 79 insertions(+), 6 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c
index a7ee2b9..397aa8e 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -284,6 +284,15 @@ rte_hash_create(const struct rte_hash_parameters *params)
 	h->free_slots = r;
 	h->hw_trans_mem_support = hw_trans_mem_support;
 
+#if defined(RTE_ARCH_X86)
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
+		h->sig_cmp_fn = RTE_HASH_COMPARE_AVX2;
+	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE2))
+		h->sig_cmp_fn = RTE_HASH_COMPARE_SSE;
+	else
+#endif
+		h->sig_cmp_fn = RTE_HASH_COMPARE_SCALAR;
+
 	/* Turn on multi-writer only with explicit flat from user and TM
 	 * support.
 	 */
@@ -940,6 +949,61 @@ lookup_stage1(unsigned idx, hash_sig_t *prim_hash, hash_sig_t *sec_hash,
 	rte_prefetch0(*secondary_bkt);
 }
 
+static inline void
+compare_signatures(unsigned *prim_hash_matches, unsigned *sec_hash_matches,
+				const struct rte_hash_bucket *prim_bkt,
+				const struct rte_hash_bucket *sec_bkt,
+				hash_sig_t prim_hash, hash_sig_t sec_hash,
+				enum rte_hash_sig_compare_function sig_cmp_fn)
+{
+	unsigned i;
+
+	switch (sig_cmp_fn) {
+#ifdef RTE_MACHINE_CPUFLAG_AVX2
+	case RTE_HASH_COMPARE_AVX2:
+		*prim_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+				_mm256_load_si256(
+					(__m256i const *)prim_bkt->sig_current),
+				_mm256_set1_epi32(prim_hash)));
+		*sec_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+				_mm256_load_si256(
+					(__m256i const *)sec_bkt->sig_current),
+				_mm256_set1_epi32(sec_hash)));
+		break;
+#endif
+#ifdef RTE_MACHINE_CPUFLAG_SSE2
+	case RTE_HASH_COMPARE_SSE:
+		/* Compare the first 4 signatures in the bucket */
+		*prim_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+				_mm_load_si128(
+					(__m128i const *)prim_bkt->sig_current),
+				_mm_set1_epi32(prim_hash)));
+		*prim_hash_matches |= (_mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+				_mm_load_si128(
+					(__m128i const *)&prim_bkt->sig_current[4]),
+				_mm_set1_epi32(prim_hash)))) << 4;
+		/* Compare the first 4 signatures in the bucket */
+		*sec_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+				_mm_load_si128(
+					(__m128i const *)sec_bkt->sig_current),
+				_mm_set1_epi32(sec_hash)));
+		*sec_hash_matches |= (_mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+				_mm_load_si128(
+					(__m128i const *)&sec_bkt->sig_current[4]),
+				_mm_set1_epi32(sec_hash)))) << 4;
+		break;
+#endif
+	default:
+		for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
+			*prim_hash_matches |=
+				((prim_hash == prim_bkt->sig_current[i]) << i);
+			*sec_hash_matches |=
+				((sec_hash == sec_bkt->sig_current[i]) << i);
+		}
+	}
+
+}
+
 /*
  * Lookup bulk stage 2:  Search for match hashes in primary/secondary locations
  * and prefetch first key slot
@@ -952,15 +1016,14 @@ lookup_stage2(unsigned idx, hash_sig_t prim_hash, hash_sig_t sec_hash,
 		uint64_t *extra_hits_mask, const void *keys,
 		const struct rte_hash *h)
 {
-	unsigned prim_hash_matches, sec_hash_matches, key_idx, i;
+	unsigned prim_hash_matches, sec_hash_matches, key_idx;
 	unsigned total_hash_matches;
 
 	prim_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
 	sec_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
-	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		prim_hash_matches |= ((prim_hash == prim_bkt->sig_current[i]) << i);
-		sec_hash_matches |= ((sec_hash == sec_bkt->sig_current[i]) << i);
-	}
+
+	compare_signatures(&prim_hash_matches, &sec_hash_matches, prim_bkt,
+				sec_bkt, prim_hash, sec_hash, h->sig_cmp_fn);
 
 	key_idx = prim_bkt->key_idx[__builtin_ctzl(prim_hash_matches)];
 	if (key_idx == 0)
diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index 24f8437..9ff79c0 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -130,7 +130,7 @@ enum add_key_case {
 };
 
 /** Number of items per bucket. */
-#define RTE_HASH_BUCKET_ENTRIES		4
+#define RTE_HASH_BUCKET_ENTRIES		8
 
 #define NULL_SIGNATURE			0
 
@@ -161,6 +161,14 @@ struct rte_hash_key {
 	char key[0];
 } __attribute__((aligned(KEY_ALIGNMENT)));
 
+/* All different signature compare functions */
+enum rte_hash_sig_compare_function {
+	RTE_HASH_COMPARE_SCALAR = 0,
+	RTE_HASH_COMPARE_SSE,
+	RTE_HASH_COMPARE_AVX2,
+	RTE_HASH_COMPARE_NUM
+};
+
 /** Bucket structure */
 struct rte_hash_bucket {
 	hash_sig_t sig_current[RTE_HASH_BUCKET_ENTRIES];
@@ -198,6 +206,8 @@ struct rte_hash {
 	/**< Custom function used to compare keys. */
 	enum cmp_jump_table_case cmp_jump_table_idx;
 	/**< Indicates which compare function to use. */
+	enum rte_hash_sig_compare_function sig_cmp_fn;
+	/**< Indicates which signature compare function to use. */
 	uint32_t bucket_bitmask;        /**< Bitmask for getting bucket index
 						from hash signature. */
 	uint32_t key_entry_size;         /**< Size of each key entry. */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 4/4] hash: modify lookup bulk pipeline
  2016-09-30  7:38     ` [PATCH v4 0/4] Cuckoo hash enhancements Pablo de Lara
                         ` (2 preceding siblings ...)
  2016-09-30  7:38       ` [PATCH v4 3/4] hash: add vectorized comparison Pablo de Lara
@ 2016-09-30  7:38       ` Pablo de Lara
  2016-09-30 19:53       ` [PATCH v4 0/4] Cuckoo hash enhancements Gobriel, Sameh
                         ` (2 subsequent siblings)
  6 siblings, 0 replies; 37+ messages in thread
From: Pablo de Lara @ 2016-09-30  7:38 UTC (permalink / raw)
  To: dev; +Cc: bruce.richardson, Byron Marohn, Saikrishna Edupuganti, Pablo de Lara

From: Byron Marohn <byron.marohn@intel.com>

This patch replaces the pipelined rte_hash lookup mechanism with a
loop-and-jump model, which performs significantly better,
especially for smaller table sizes and smaller table occupancies.

Signed-off-by: Byron Marohn <byron.marohn@intel.com>
Signed-off-by: Saikrishna Edupuganti <saikrishna.edupuganti@intel.com>
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
---
 lib/librte_hash/rte_cuckoo_hash.c | 377 ++++++++++++--------------------------
 lib/librte_hash/rte_cuckoo_hash.h |   3 +-
 2 files changed, 117 insertions(+), 263 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c
index 397aa8e..1443ee1 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -914,43 +914,8 @@ rte_hash_get_key_with_position(const struct rte_hash *h, const int32_t position,
 	return 0;
 }
 
-/* Lookup bulk stage 0: Prefetch input key */
 static inline void
-lookup_stage0(unsigned *idx, uint64_t *lookup_mask,
-		const void * const *keys)
-{
-	*idx = __builtin_ctzl(*lookup_mask);
-	if (*lookup_mask == 0)
-		*idx = 0;
-
-	rte_prefetch0(keys[*idx]);
-	*lookup_mask &= ~(1llu << *idx);
-}
-
-/*
- * Lookup bulk stage 1: Calculate primary/secondary hashes
- * and prefetch primary/secondary buckets
- */
-static inline void
-lookup_stage1(unsigned idx, hash_sig_t *prim_hash, hash_sig_t *sec_hash,
-		const struct rte_hash_bucket **primary_bkt,
-		const struct rte_hash_bucket **secondary_bkt,
-		hash_sig_t *hash_vals, const void * const *keys,
-		const struct rte_hash *h)
-{
-	*prim_hash = rte_hash_hash(h, keys[idx]);
-	hash_vals[idx] = *prim_hash;
-	*sec_hash = rte_hash_secondary_hash(*prim_hash);
-
-	*primary_bkt = &h->buckets[*prim_hash & h->bucket_bitmask];
-	*secondary_bkt = &h->buckets[*sec_hash & h->bucket_bitmask];
-
-	rte_prefetch0(*primary_bkt);
-	rte_prefetch0(*secondary_bkt);
-}
-
-static inline void
-compare_signatures(unsigned *prim_hash_matches, unsigned *sec_hash_matches,
+compare_signatures(uint32_t *prim_hash_matches, uint32_t *sec_hash_matches,
 				const struct rte_hash_bucket *prim_bkt,
 				const struct rte_hash_bucket *sec_bkt,
 				hash_sig_t prim_hash, hash_sig_t sec_hash,
@@ -961,11 +926,11 @@ compare_signatures(unsigned *prim_hash_matches, unsigned *sec_hash_matches,
 	switch (sig_cmp_fn) {
 #ifdef RTE_MACHINE_CPUFLAG_AVX2
 	case RTE_HASH_COMPARE_AVX2:
-		*prim_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+		*prim_hash_matches = _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
 				_mm256_load_si256(
 					(__m256i const *)prim_bkt->sig_current),
 				_mm256_set1_epi32(prim_hash)));
-		*sec_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+		*sec_hash_matches = _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
 				_mm256_load_si256(
 					(__m256i const *)sec_bkt->sig_current),
 				_mm256_set1_epi32(sec_hash)));
@@ -974,7 +939,7 @@ compare_signatures(unsigned *prim_hash_matches, unsigned *sec_hash_matches,
 #ifdef RTE_MACHINE_CPUFLAG_SSE2
 	case RTE_HASH_COMPARE_SSE:
 		/* Compare the first 4 signatures in the bucket */
-		*prim_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+		*prim_hash_matches = _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
 				_mm_load_si128(
 					(__m128i const *)prim_bkt->sig_current),
 				_mm_set1_epi32(prim_hash)));
@@ -983,7 +948,7 @@ compare_signatures(unsigned *prim_hash_matches, unsigned *sec_hash_matches,
 					(__m128i const *)&prim_bkt->sig_current[4]),
 				_mm_set1_epi32(prim_hash)))) << 4;
 		/* Compare the first 4 signatures in the bucket */
-		*sec_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+		*sec_hash_matches = _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
 				_mm_load_si128(
 					(__m128i const *)sec_bkt->sig_current),
 				_mm_set1_epi32(sec_hash)));
@@ -1004,244 +969,134 @@ compare_signatures(unsigned *prim_hash_matches, unsigned *sec_hash_matches,
 
 }
 
-/*
- * Lookup bulk stage 2:  Search for match hashes in primary/secondary locations
- * and prefetch first key slot
- */
+#define PREFETCH_OFFSET 4
 static inline void
-lookup_stage2(unsigned idx, hash_sig_t prim_hash, hash_sig_t sec_hash,
-		const struct rte_hash_bucket *prim_bkt,
-		const struct rte_hash_bucket *sec_bkt,
-		const struct rte_hash_key **key_slot, int32_t *positions,
-		uint64_t *extra_hits_mask, const void *keys,
-		const struct rte_hash *h)
+__rte_hash_lookup_bulk(const struct rte_hash *h, const void **keys,
+			int32_t num_keys, int32_t *positions,
+			uint64_t *hit_mask, void *data[])
 {
-	unsigned prim_hash_matches, sec_hash_matches, key_idx;
-	unsigned total_hash_matches;
+	uint64_t hits = 0;
+	int32_t i;
+	uint32_t prim_hash[RTE_HASH_LOOKUP_BULK_MAX];
+	uint32_t sec_hash[RTE_HASH_LOOKUP_BULK_MAX];
+	const struct rte_hash_bucket *primary_bkt[RTE_HASH_LOOKUP_BULK_MAX];
+	const struct rte_hash_bucket *secondary_bkt[RTE_HASH_LOOKUP_BULK_MAX];
+	uint32_t prim_hitmask[RTE_HASH_LOOKUP_BULK_MAX] = {0};
+	uint32_t sec_hitmask[RTE_HASH_LOOKUP_BULK_MAX] = {0};
+
+	/* Prefetch first keys */
+	for (i = 0; i < PREFETCH_OFFSET && i < num_keys; i++)
+		rte_prefetch0(keys[i]);
 
-	prim_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
-	sec_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
+	/*
+	 * Prefetch rest of the keys, calculate primary and
+	 * secondary bucket and prefetch them
+	 */
+	for (i = 0; i < (num_keys - PREFETCH_OFFSET); i++) {
+		rte_prefetch0(keys[i + PREFETCH_OFFSET]);
 
-	compare_signatures(&prim_hash_matches, &sec_hash_matches, prim_bkt,
-				sec_bkt, prim_hash, sec_hash, h->sig_cmp_fn);
+		prim_hash[i] = rte_hash_hash(h, keys[i]);
+		sec_hash[i] = rte_hash_secondary_hash(prim_hash[i]);
 
-	key_idx = prim_bkt->key_idx[__builtin_ctzl(prim_hash_matches)];
-	if (key_idx == 0)
-		key_idx = sec_bkt->key_idx[__builtin_ctzl(sec_hash_matches)];
+		primary_bkt[i] = &h->buckets[prim_hash[i] & h->bucket_bitmask];
+		secondary_bkt[i] = &h->buckets[sec_hash[i] & h->bucket_bitmask];
 
-	total_hash_matches = (prim_hash_matches |
-				(sec_hash_matches << (RTE_HASH_BUCKET_ENTRIES + 1)));
-	*key_slot = (const struct rte_hash_key *) ((const char *)keys +
-					key_idx * h->key_entry_size);
+		rte_prefetch0(primary_bkt[i]);
+		rte_prefetch0(secondary_bkt[i]);
+	}
 
-	rte_prefetch0(*key_slot);
-	/*
-	 * Return index where key is stored,
-	 * substracting the first dummy index
-	 */
-	positions[idx] = (key_idx - 1);
+	/* Calculate and prefetch rest of the buckets */
+	for (; i < num_keys; i++) {
+		prim_hash[i] = rte_hash_hash(h, keys[i]);
+		sec_hash[i] = rte_hash_secondary_hash(prim_hash[i]);
 
-	*extra_hits_mask |= (uint64_t)(__builtin_popcount(total_hash_matches) > 3) << idx;
+		primary_bkt[i] = &h->buckets[prim_hash[i] & h->bucket_bitmask];
+		secondary_bkt[i] = &h->buckets[sec_hash[i] & h->bucket_bitmask];
 
-}
+		rte_prefetch0(primary_bkt[i]);
+		rte_prefetch0(secondary_bkt[i]);
+	}
 
+	/* Compare signatures and prefetch key slot of first hit */
+	for (i = 0; i < num_keys; i++) {
+		compare_signatures(&prim_hitmask[i], &sec_hitmask[i],
+				primary_bkt[i], secondary_bkt[i],
+				prim_hash[i], sec_hash[i], h->sig_cmp_fn);
+
+		if (prim_hitmask[i]) {
+			uint32_t first_hit = __builtin_ctzl(prim_hitmask[i]);
+			uint32_t key_idx = primary_bkt[i]->key_idx[first_hit];
+			const struct rte_hash_key *key_slot =
+				(const struct rte_hash_key *)(
+				(const char *)h->key_store +
+				key_idx * h->key_entry_size);
+			rte_prefetch0(key_slot);
+			continue;
+		}
 
-/* Lookup bulk stage 3: Check if key matches, update hit mask and return data */
-static inline void
-lookup_stage3(unsigned idx, const struct rte_hash_key *key_slot, const void * const *keys,
-		const int32_t *positions, void *data[], uint64_t *hits,
-		const struct rte_hash *h)
-{
-	unsigned hit;
-	unsigned key_idx;
+		if (sec_hitmask[i]) {
+			uint32_t first_hit = __builtin_ctzl(sec_hitmask[i]);
+			uint32_t key_idx = secondary_bkt[i]->key_idx[first_hit];
+			const struct rte_hash_key *key_slot =
+				(const struct rte_hash_key *)(
+				(const char *)h->key_store +
+				key_idx * h->key_entry_size);
+			rte_prefetch0(key_slot);
+		}
+	}
 
-	hit = !rte_hash_cmp_eq(key_slot->key, keys[idx], h);
-	if (data != NULL)
-		data[idx] = key_slot->pdata;
+	/* Compare keys, first hits in primary first */
+	for (i = 0; i < num_keys; i++) {
+		positions[i] = -ENOENT;
+		while (prim_hitmask[i]) {
+			uint32_t hit_index = __builtin_ctzl(prim_hitmask[i]);
+
+			uint32_t key_idx = primary_bkt[i]->key_idx[hit_index];
+			const struct rte_hash_key *key_slot =
+				(const struct rte_hash_key *)(
+				(const char *)h->key_store +
+				key_idx * h->key_entry_size);
+			/*
+			 * If key index is 0, do not compare key,
+			 * as it is checking the dummy slot
+			 */
+			if (!!key_idx & !rte_hash_cmp_eq(key_slot->key, keys[i], h)) {
+				if (data != NULL)
+					data[i] = key_slot->pdata;
 
-	key_idx = positions[idx] + 1;
-	/*
-	 * If key index is 0, force hit to be 0, in case key to be looked up
-	 * is all zero (as in the dummy slot), which would result in a wrong hit
-	 */
-	*hits |= (uint64_t)(hit && !!key_idx)  << idx;
-}
+				hits |= 1ULL << i;
+				positions[i] = key_idx - 1;
+				goto next_key;
+			}
+			prim_hitmask[i] &= ~(1 << (hit_index));
+		}
 
-static inline void
-__rte_hash_lookup_bulk(const struct rte_hash *h, const void **keys,
-			uint32_t num_keys, int32_t *positions,
-			uint64_t *hit_mask, void *data[])
-{
-	uint64_t hits = 0;
-	uint64_t extra_hits_mask = 0;
-	uint64_t lookup_mask, miss_mask;
-	unsigned idx;
-	const void *key_store = h->key_store;
-	int ret;
-	hash_sig_t hash_vals[RTE_HASH_LOOKUP_BULK_MAX];
-
-	unsigned idx00, idx01, idx10, idx11, idx20, idx21, idx30, idx31;
-	const struct rte_hash_bucket *primary_bkt10, *primary_bkt11;
-	const struct rte_hash_bucket *secondary_bkt10, *secondary_bkt11;
-	const struct rte_hash_bucket *primary_bkt20, *primary_bkt21;
-	const struct rte_hash_bucket *secondary_bkt20, *secondary_bkt21;
-	const struct rte_hash_key *k_slot20, *k_slot21, *k_slot30, *k_slot31;
-	hash_sig_t primary_hash10, primary_hash11;
-	hash_sig_t secondary_hash10, secondary_hash11;
-	hash_sig_t primary_hash20, primary_hash21;
-	hash_sig_t secondary_hash20, secondary_hash21;
-
-	lookup_mask = (uint64_t) -1 >> (64 - num_keys);
-	miss_mask = lookup_mask;
-
-	lookup_stage0(&idx00, &lookup_mask, keys);
-	lookup_stage0(&idx01, &lookup_mask, keys);
-
-	idx10 = idx00, idx11 = idx01;
-
-	lookup_stage0(&idx00, &lookup_mask, keys);
-	lookup_stage0(&idx01, &lookup_mask, keys);
-	lookup_stage1(idx10, &primary_hash10, &secondary_hash10,
-			&primary_bkt10, &secondary_bkt10, hash_vals, keys, h);
-	lookup_stage1(idx11, &primary_hash11, &secondary_hash11,
-			&primary_bkt11,	&secondary_bkt11, hash_vals, keys, h);
-
-	primary_bkt20 = primary_bkt10;
-	primary_bkt21 = primary_bkt11;
-	secondary_bkt20 = secondary_bkt10;
-	secondary_bkt21 = secondary_bkt11;
-	primary_hash20 = primary_hash10;
-	primary_hash21 = primary_hash11;
-	secondary_hash20 = secondary_hash10;
-	secondary_hash21 = secondary_hash11;
-	idx20 = idx10, idx21 = idx11;
-	idx10 = idx00, idx11 = idx01;
-
-	lookup_stage0(&idx00, &lookup_mask, keys);
-	lookup_stage0(&idx01, &lookup_mask, keys);
-	lookup_stage1(idx10, &primary_hash10, &secondary_hash10,
-			&primary_bkt10, &secondary_bkt10, hash_vals, keys, h);
-	lookup_stage1(idx11, &primary_hash11, &secondary_hash11,
-			&primary_bkt11,	&secondary_bkt11, hash_vals, keys, h);
-	lookup_stage2(idx20, primary_hash20, secondary_hash20, primary_bkt20,
-			secondary_bkt20, &k_slot20, positions, &extra_hits_mask,
-			key_store, h);
-	lookup_stage2(idx21, primary_hash21, secondary_hash21, primary_bkt21,
-			secondary_bkt21, &k_slot21, positions, &extra_hits_mask,
-			key_store, h);
-
-	while (lookup_mask) {
-		k_slot30 = k_slot20, k_slot31 = k_slot21;
-		idx30 = idx20, idx31 = idx21;
-		primary_bkt20 = primary_bkt10;
-		primary_bkt21 = primary_bkt11;
-		secondary_bkt20 = secondary_bkt10;
-		secondary_bkt21 = secondary_bkt11;
-		primary_hash20 = primary_hash10;
-		primary_hash21 = primary_hash11;
-		secondary_hash20 = secondary_hash10;
-		secondary_hash21 = secondary_hash11;
-		idx20 = idx10, idx21 = idx11;
-		idx10 = idx00, idx11 = idx01;
-
-		lookup_stage0(&idx00, &lookup_mask, keys);
-		lookup_stage0(&idx01, &lookup_mask, keys);
-		lookup_stage1(idx10, &primary_hash10, &secondary_hash10,
-			&primary_bkt10, &secondary_bkt10, hash_vals, keys, h);
-		lookup_stage1(idx11, &primary_hash11, &secondary_hash11,
-			&primary_bkt11,	&secondary_bkt11, hash_vals, keys, h);
-		lookup_stage2(idx20, primary_hash20, secondary_hash20,
-			primary_bkt20, secondary_bkt20, &k_slot20, positions,
-			&extra_hits_mask, key_store, h);
-		lookup_stage2(idx21, primary_hash21, secondary_hash21,
-			primary_bkt21, secondary_bkt21,	&k_slot21, positions,
-			&extra_hits_mask, key_store, h);
-		lookup_stage3(idx30, k_slot30, keys, positions, data, &hits, h);
-		lookup_stage3(idx31, k_slot31, keys, positions, data, &hits, h);
-	}
+		while (sec_hitmask[i]) {
+			uint32_t hit_index = __builtin_ctzl(sec_hitmask[i]);
+
+			uint32_t key_idx = secondary_bkt[i]->key_idx[hit_index];
+			const struct rte_hash_key *key_slot =
+				(const struct rte_hash_key *)(
+				(const char *)h->key_store +
+				key_idx * h->key_entry_size);
+			/*
+			 * If key index is 0, do not compare key,
+			 * as it is checking the dummy slot
+			 */
+
+			if (!!key_idx & !rte_hash_cmp_eq(key_slot->key, keys[i], h)) {
+				if (data != NULL)
+					data[i] = key_slot->pdata;
 
-	k_slot30 = k_slot20, k_slot31 = k_slot21;
-	idx30 = idx20, idx31 = idx21;
-	primary_bkt20 = primary_bkt10;
-	primary_bkt21 = primary_bkt11;
-	secondary_bkt20 = secondary_bkt10;
-	secondary_bkt21 = secondary_bkt11;
-	primary_hash20 = primary_hash10;
-	primary_hash21 = primary_hash11;
-	secondary_hash20 = secondary_hash10;
-	secondary_hash21 = secondary_hash11;
-	idx20 = idx10, idx21 = idx11;
-	idx10 = idx00, idx11 = idx01;
-
-	lookup_stage1(idx10, &primary_hash10, &secondary_hash10,
-		&primary_bkt10, &secondary_bkt10, hash_vals, keys, h);
-	lookup_stage1(idx11, &primary_hash11, &secondary_hash11,
-		&primary_bkt11,	&secondary_bkt11, hash_vals, keys, h);
-	lookup_stage2(idx20, primary_hash20, secondary_hash20, primary_bkt20,
-		secondary_bkt20, &k_slot20, positions, &extra_hits_mask,
-		key_store, h);
-	lookup_stage2(idx21, primary_hash21, secondary_hash21, primary_bkt21,
-		secondary_bkt21, &k_slot21, positions, &extra_hits_mask,
-		key_store, h);
-	lookup_stage3(idx30, k_slot30, keys, positions, data, &hits, h);
-	lookup_stage3(idx31, k_slot31, keys, positions, data, &hits, h);
-
-	k_slot30 = k_slot20, k_slot31 = k_slot21;
-	idx30 = idx20, idx31 = idx21;
-	primary_bkt20 = primary_bkt10;
-	primary_bkt21 = primary_bkt11;
-	secondary_bkt20 = secondary_bkt10;
-	secondary_bkt21 = secondary_bkt11;
-	primary_hash20 = primary_hash10;
-	primary_hash21 = primary_hash11;
-	secondary_hash20 = secondary_hash10;
-	secondary_hash21 = secondary_hash11;
-	idx20 = idx10, idx21 = idx11;
-
-	lookup_stage2(idx20, primary_hash20, secondary_hash20, primary_bkt20,
-		secondary_bkt20, &k_slot20, positions, &extra_hits_mask,
-		key_store, h);
-	lookup_stage2(idx21, primary_hash21, secondary_hash21, primary_bkt21,
-		secondary_bkt21, &k_slot21, positions, &extra_hits_mask,
-		key_store, h);
-	lookup_stage3(idx30, k_slot30, keys, positions, data, &hits, h);
-	lookup_stage3(idx31, k_slot31, keys, positions, data, &hits, h);
-
-	k_slot30 = k_slot20, k_slot31 = k_slot21;
-	idx30 = idx20, idx31 = idx21;
-
-	lookup_stage3(idx30, k_slot30, keys, positions, data, &hits, h);
-	lookup_stage3(idx31, k_slot31, keys, positions, data, &hits, h);
-
-	/* ignore any items we have already found */
-	extra_hits_mask &= ~hits;
-
-	if (unlikely(extra_hits_mask)) {
-		/* run a single search for each remaining item */
-		do {
-			idx = __builtin_ctzl(extra_hits_mask);
-			if (data != NULL) {
-				ret = rte_hash_lookup_with_hash_data(h,
-						keys[idx], hash_vals[idx], &data[idx]);
-				if (ret >= 0)
-					hits |= 1ULL << idx;
-			} else {
-				positions[idx] = rte_hash_lookup_with_hash(h,
-							keys[idx], hash_vals[idx]);
-				if (positions[idx] >= 0)
-					hits |= 1llu << idx;
+				hits |= 1ULL << i;
+				positions[i] = key_idx - 1;
+				goto next_key;
 			}
-			extra_hits_mask &= ~(1llu << idx);
-		} while (extra_hits_mask);
-	}
+			sec_hitmask[i] &= ~(1 << (hit_index));
+		}
 
-	miss_mask &= ~hits;
-	if (unlikely(miss_mask)) {
-		do {
-			idx = __builtin_ctzl(miss_mask);
-			positions[idx] = -ENOENT;
-			miss_mask &= ~(1llu << idx);
-		} while (miss_mask);
+next_key:
+		continue;
 	}
 
 	if (hit_mask != NULL)
diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index 9ff79c0..2172b2c 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -173,8 +173,7 @@ enum rte_hash_sig_compare_function {
 struct rte_hash_bucket {
 	hash_sig_t sig_current[RTE_HASH_BUCKET_ENTRIES];
 
-	/* Includes dummy key index that always contains index 0 */
-	uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES + 1];
+	uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES];
 
 	hash_sig_t sig_alt[RTE_HASH_BUCKET_ENTRIES];
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/4] Cuckoo hash enhancements
  2016-09-30  7:38     ` [PATCH v4 0/4] Cuckoo hash enhancements Pablo de Lara
                         ` (3 preceding siblings ...)
  2016-09-30  7:38       ` [PATCH v4 4/4] hash: modify lookup bulk pipeline Pablo de Lara
@ 2016-09-30 19:53       ` Gobriel, Sameh
  2016-10-03  9:59       ` Bruce Richardson
  2016-10-04 23:25       ` [PATCH v5 " Pablo de Lara
  6 siblings, 0 replies; 37+ messages in thread
From: Gobriel, Sameh @ 2016-09-30 19:53 UTC (permalink / raw)
  To: De Lara Guarch, Pablo, dev; +Cc: Richardson, Bruce



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of De Lara Guarch, Pablo
> Sent: Friday, September 30, 2016 12:39 AM
> To: dev@dpdk.org
> Cc: Richardson, Bruce <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>
> Subject: [dpdk-dev] [PATCH v4 0/4] Cuckoo hash enhancements
> 
> This patchset improves lookup performance on the current hash library by
> changing the existing lookup bulk pipeline, with an improved pipeline, based on
> a loop-and-jump model, instead of the current 4-stage 2-entry pipeline.
> Also, x86 vectorized intrinsics are used to improve performance when
> comparing signatures.
> 
> First patch reorganizes the order of the hash structure.
> The structure takes more than one 64-byte cache line, but not all the fields are
> used in the lookup operation (the most common operation).
> Therefore, all these fields have been moved to the first part of the structure, so
> they all fit in one cache line, improving slightly the performance in some
> scenarios.
> 
> Second patch modifies the order of the bucket structure.
> Currently, the buckets store all the signatures together (current and
> alternative).
> In order to be able to perform a vectorized signature comparison, all current
> signatures have to be together, so the order of the bucket has been changed,
> having separated all the current signatures from the alternative signatures.
> 
> Third patch introduces x86 vectorized intrinsics.
> When performing a lookup bulk operation, all current signatures in a bucket
> are compared against the signature of the key being looked up.
> Now that they all are together, a vectorized comparison can be performed,
> which takes less instructions to be carried out.
> In case of having a machine with AVX2, number of entries per bucket are
> increased from 4 to 8, as AVX2 allows comparing two 256-bit values, with
> 8x32-bit integers, which are the 8 signatures on the bucket.
> 
> Fourth (and last) patch modifies the current pipeline of the lookup bulk
> function.
> The new pipeline is based on a loop-and-jump model. The two key
> improvements are:
> 
> - Better prefetching: in this case, first 4 keys to be looked up are prefetched,
>   and after that, the rest of the keys are prefetched at the time the calculation
>   of the signatures are being performed. This gives more time for the CPU to
>   prefetch the data requesting before actually need it, which result in less
>   cache misses and therefore, higher throughput.
> 
> - Lower performance penalty when using fallback: the lookup bulk algorithm
>   assumes that most times there will not be a collision in a bucket, but it might
>   happen that two or more signatures are equal, which means that more than
> one
>   key comparison might be necessary. In that case, only the key of the first hit is
> prefetched,
>   like in the current implementation. The difference now is that if this
> comparison
>   results in a miss, the information of the other keys to be compared has been
> stored,
>   unlike the current implementation, which needs to perform an entire simple
> lookup again.
> 
> Changes in v4:
> - Reordered hash structure, so alt signature is at the start
>   of the next cache line, and explain in the commit message
>   why it has been moved
> - Reordered hash structure, so name field is on top of the structure,
>   leaving all the fields used in lookup in the next cache line
>   (instead of the first cache line)
> 
> Changes in v3:
> - Corrected the cover letter (wrong number of patches)
> 
> Changes in v2:
> - Increased entries per bucket from 4 to 8 for all cases,
>   so it is not architecture dependent any longer.
> - Replaced compile-time signature comparison function election
>   with run-time election, so best optimization available
>   will be used from a single binary.
> - Reordered the hash structure, so all the fields used by lookup
>   are in the same cache line (first).
> 
> Byron Marohn (3):
>   hash: reorganize bucket structure
>   hash: add vectorized comparison
>   hash: modify lookup bulk pipeline
> 
> Pablo de Lara (1):
>   hash: reorder hash structure
> 
>  lib/librte_hash/rte_cuckoo_hash.c     | 455 ++++++++++++++--------------------
>  lib/librte_hash/rte_cuckoo_hash.h     |  56 +++--
>  lib/librte_hash/rte_cuckoo_hash_x86.h |  20 +-
>  3 files changed, 228 insertions(+), 303 deletions(-)
> 
> --
> 2.7.4

Series-acked-by: Sameh Gobriel <sameh.gobriel@intel.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/4] Cuckoo hash enhancements
  2016-09-30  7:38     ` [PATCH v4 0/4] Cuckoo hash enhancements Pablo de Lara
                         ` (4 preceding siblings ...)
  2016-09-30 19:53       ` [PATCH v4 0/4] Cuckoo hash enhancements Gobriel, Sameh
@ 2016-10-03  9:59       ` Bruce Richardson
  2016-10-04  6:50         ` De Lara Guarch, Pablo
  2016-10-04 23:25       ` [PATCH v5 " Pablo de Lara
  6 siblings, 1 reply; 37+ messages in thread
From: Bruce Richardson @ 2016-10-03  9:59 UTC (permalink / raw)
  To: Pablo de Lara; +Cc: dev

On Fri, Sep 30, 2016 at 08:38:52AM +0100, Pablo de Lara wrote:
> This patchset improves lookup performance on the current hash library
> by changing the existing lookup bulk pipeline, with an improved pipeline,
> based on a loop-and-jump model, instead of the current 4-stage 2-entry pipeline.
> Also, x86 vectorized intrinsics are used to improve performance when comparing signatures.
> 
> First patch reorganizes the order of the hash structure.
> The structure takes more than one 64-byte cache line, but not all
> the fields are used in the lookup operation (the most common operation).
> Therefore, all these fields have been moved to the first part of the structure,
> so they all fit in one cache line, improving slightly the performance in some
> scenarios.
> 
> Second patch modifies the order of the bucket structure.
> Currently, the buckets store all the signatures together (current and alternative).
> In order to be able to perform a vectorized signature comparison,
> all current signatures have to be together, so the order of the bucket has been changed,
> having separated all the current signatures from the alternative signatures.
> 
> Third patch introduces x86 vectorized intrinsics.
> When performing a lookup bulk operation, all current signatures in a bucket
> are compared against the signature of the key being looked up.
> Now that they all are together, a vectorized comparison can be performed,
> which takes less instructions to be carried out.
> In case of having a machine with AVX2, number of entries per bucket are
> increased from 4 to 8, as AVX2 allows comparing two 256-bit values, with 8x32-bit integers,
> which are the 8 signatures on the bucket.
> 
> Fourth (and last) patch modifies the current pipeline of the lookup bulk function.
> The new pipeline is based on a loop-and-jump model. The two key improvements are:
> 
> - Better prefetching: in this case, first 4 keys to be looked up are prefetched,
>   and after that, the rest of the keys are prefetched at the time the calculation
>   of the signatures are being performed. This gives more time for the CPU to
>   prefetch the data requesting before actually need it, which result in less
>   cache misses and therefore, higher throughput.
> 
> - Lower performance penalty when using fallback: the lookup bulk algorithm
>   assumes that most times there will not be a collision in a bucket, but it might
>   happen that two or more signatures are equal, which means that more than one
>   key comparison might be necessary. In that case, only the key of the first hit is prefetched,
>   like in the current implementation. The difference now is that if this comparison
>   results in a miss, the information of the other keys to be compared has been stored,
>   unlike the current implementation, which needs to perform an entire simple lookup again.
> 
> Changes in v4:
> - Reordered hash structure, so alt signature is at the start
>   of the next cache line, and explain in the commit message
>   why it has been moved
> - Reordered hash structure, so name field is on top of the structure,
>   leaving all the fields used in lookup in the next cache line
>   (instead of the first cache line)
> 
> Changes in v3:
> - Corrected the cover letter (wrong number of patches)
> 
> Changes in v2:
> - Increased entries per bucket from 4 to 8 for all cases,
>   so it is not architecture dependent any longer.
> - Replaced compile-time signature comparison function election
>   with run-time election, so best optimization available
>   will be used from a single binary.
> - Reordered the hash structure, so all the fields used by lookup
>   are in the same cache line (first).
> 
> Byron Marohn (3):
>   hash: reorganize bucket structure
>   hash: add vectorized comparison
>   hash: modify lookup bulk pipeline
> 

Hi,

Firstly, checkpatches is reporting some style errors in these patches.

Secondly, when I run the "hash_multiwriter_autotest" I get what I assume to be
an error after applying this patchset. Before this set is applied, running
that test shows the cycles per insert with/without lock elision. Now, though
I'm getting an error about a key being dropped or failing to insert in the lock
elision case, e.g. 

  Core #2 inserting 1572864: 0 - 1,572,864
  key 1497087 is lost
  1 key lost

I've run the test a number of times, and there is a single key lost each time.
Please check on this, is it expected or is it a problem?

Thanks,
/Bruce

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/4] Cuckoo hash enhancements
  2016-10-03  9:59       ` Bruce Richardson
@ 2016-10-04  6:50         ` De Lara Guarch, Pablo
  2016-10-04  7:17           ` De Lara Guarch, Pablo
  0 siblings, 1 reply; 37+ messages in thread
From: De Lara Guarch, Pablo @ 2016-10-04  6:50 UTC (permalink / raw)
  To: Richardson, Bruce; +Cc: dev

Hi Bruce,

> -----Original Message-----
> From: Richardson, Bruce
> Sent: Monday, October 03, 2016 2:59 AM
> To: De Lara Guarch, Pablo
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v4 0/4] Cuckoo hash enhancements
> 
> On Fri, Sep 30, 2016 at 08:38:52AM +0100, Pablo de Lara wrote:
> > This patchset improves lookup performance on the current hash library
> > by changing the existing lookup bulk pipeline, with an improved pipeline,
> > based on a loop-and-jump model, instead of the current 4-stage 2-entry
> pipeline.
> > Also, x86 vectorized intrinsics are used to improve performance when
> comparing signatures.
> >
> > First patch reorganizes the order of the hash structure.
> > The structure takes more than one 64-byte cache line, but not all
> > the fields are used in the lookup operation (the most common operation).
> > Therefore, all these fields have been moved to the first part of the structure,
> > so they all fit in one cache line, improving slightly the performance in some
> > scenarios.
> >
> > Second patch modifies the order of the bucket structure.
> > Currently, the buckets store all the signatures together (current and
> alternative).
> > In order to be able to perform a vectorized signature comparison,
> > all current signatures have to be together, so the order of the bucket has
> been changed,
> > having separated all the current signatures from the alternative signatures.
> >
> > Third patch introduces x86 vectorized intrinsics.
> > When performing a lookup bulk operation, all current signatures in a bucket
> > are compared against the signature of the key being looked up.
> > Now that they all are together, a vectorized comparison can be performed,
> > which takes less instructions to be carried out.
> > In case of having a machine with AVX2, number of entries per bucket are
> > increased from 4 to 8, as AVX2 allows comparing two 256-bit values, with
> 8x32-bit integers,
> > which are the 8 signatures on the bucket.
> >
> > Fourth (and last) patch modifies the current pipeline of the lookup bulk
> function.
> > The new pipeline is based on a loop-and-jump model. The two key
> improvements are:
> >
> > - Better prefetching: in this case, first 4 keys to be looked up are prefetched,
> >   and after that, the rest of the keys are prefetched at the time the
> calculation
> >   of the signatures are being performed. This gives more time for the CPU to
> >   prefetch the data requesting before actually need it, which result in less
> >   cache misses and therefore, higher throughput.
> >
> > - Lower performance penalty when using fallback: the lookup bulk
> algorithm
> >   assumes that most times there will not be a collision in a bucket, but it
> might
> >   happen that two or more signatures are equal, which means that more
> than one
> >   key comparison might be necessary. In that case, only the key of the first
> hit is prefetched,
> >   like in the current implementation. The difference now is that if this
> comparison
> >   results in a miss, the information of the other keys to be compared has
> been stored,
> >   unlike the current implementation, which needs to perform an entire
> simple lookup again.
> >
> > Changes in v4:
> > - Reordered hash structure, so alt signature is at the start
> >   of the next cache line, and explain in the commit message
> >   why it has been moved
> > - Reordered hash structure, so name field is on top of the structure,
> >   leaving all the fields used in lookup in the next cache line
> >   (instead of the first cache line)
> >
> > Changes in v3:
> > - Corrected the cover letter (wrong number of patches)
> >
> > Changes in v2:
> > - Increased entries per bucket from 4 to 8 for all cases,
> >   so it is not architecture dependent any longer.
> > - Replaced compile-time signature comparison function election
> >   with run-time election, so best optimization available
> >   will be used from a single binary.
> > - Reordered the hash structure, so all the fields used by lookup
> >   are in the same cache line (first).
> >
> > Byron Marohn (3):
> >   hash: reorganize bucket structure
> >   hash: add vectorized comparison
> >   hash: modify lookup bulk pipeline
> >
> 
> Hi,
> 
> Firstly, checkpatches is reporting some style errors in these patches.
> 
> Secondly, when I run the "hash_multiwriter_autotest" I get what I assume to
> be
> an error after applying this patchset. Before this set is applied, running
> that test shows the cycles per insert with/without lock elision. Now, though
> I'm getting an error about a key being dropped or failing to insert in the lock
> elision case, e.g.
> 
>   Core #2 inserting 1572864: 0 - 1,572,864
>   key 1497087 is lost
>   1 key lost
> 
> I've run the test a number of times, and there is a single key lost each time.
> Please check on this, is it expected or is it a problem?

I am seeing that error even without the patchset. I am still investigating it,
but using "git bisect" looks like the problem is in commit 5fc74c2e146d
("hash: check if slot is empty with key index").

Thanks,
Pablo

> 
> Thanks,
> /Bruce

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/4] Cuckoo hash enhancements
  2016-10-04  6:50         ` De Lara Guarch, Pablo
@ 2016-10-04  7:17           ` De Lara Guarch, Pablo
  2016-10-04  9:47             ` Bruce Richardson
  0 siblings, 1 reply; 37+ messages in thread
From: De Lara Guarch, Pablo @ 2016-10-04  7:17 UTC (permalink / raw)
  To: De Lara Guarch, Pablo, Richardson, Bruce; +Cc: dev



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of De Lara Guarch,
> Pablo
> Sent: Monday, October 03, 2016 11:51 PM
> To: Richardson, Bruce
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v4 0/4] Cuckoo hash enhancements
> 
> Hi Bruce,
> 
> > -----Original Message-----
> > From: Richardson, Bruce
> > Sent: Monday, October 03, 2016 2:59 AM
> > To: De Lara Guarch, Pablo
> > Cc: dev@dpdk.org
> > Subject: Re: [PATCH v4 0/4] Cuckoo hash enhancements
> >
> > On Fri, Sep 30, 2016 at 08:38:52AM +0100, Pablo de Lara wrote:
> > > This patchset improves lookup performance on the current hash library
> > > by changing the existing lookup bulk pipeline, with an improved pipeline,
> > > based on a loop-and-jump model, instead of the current 4-stage 2-entry
> > pipeline.
> > > Also, x86 vectorized intrinsics are used to improve performance when
> > comparing signatures.
> > >
> > > First patch reorganizes the order of the hash structure.
> > > The structure takes more than one 64-byte cache line, but not all
> > > the fields are used in the lookup operation (the most common operation).
> > > Therefore, all these fields have been moved to the first part of the
> structure,
> > > so they all fit in one cache line, improving slightly the performance in
> some
> > > scenarios.
> > >
> > > Second patch modifies the order of the bucket structure.
> > > Currently, the buckets store all the signatures together (current and
> > alternative).
> > > In order to be able to perform a vectorized signature comparison,
> > > all current signatures have to be together, so the order of the bucket has
> > been changed,
> > > having separated all the current signatures from the alternative
> signatures.
> > >
> > > Third patch introduces x86 vectorized intrinsics.
> > > When performing a lookup bulk operation, all current signatures in a
> bucket
> > > are compared against the signature of the key being looked up.
> > > Now that they all are together, a vectorized comparison can be
> performed,
> > > which takes less instructions to be carried out.
> > > In case of having a machine with AVX2, number of entries per bucket are
> > > increased from 4 to 8, as AVX2 allows comparing two 256-bit values, with
> > 8x32-bit integers,
> > > which are the 8 signatures on the bucket.
> > >
> > > Fourth (and last) patch modifies the current pipeline of the lookup bulk
> > function.
> > > The new pipeline is based on a loop-and-jump model. The two key
> > improvements are:
> > >
> > > - Better prefetching: in this case, first 4 keys to be looked up are
> prefetched,
> > >   and after that, the rest of the keys are prefetched at the time the
> > calculation
> > >   of the signatures are being performed. This gives more time for the CPU
> to
> > >   prefetch the data requesting before actually need it, which result in less
> > >   cache misses and therefore, higher throughput.
> > >
> > > - Lower performance penalty when using fallback: the lookup bulk
> > algorithm
> > >   assumes that most times there will not be a collision in a bucket, but it
> > might
> > >   happen that two or more signatures are equal, which means that more
> > than one
> > >   key comparison might be necessary. In that case, only the key of the first
> > hit is prefetched,
> > >   like in the current implementation. The difference now is that if this
> > comparison
> > >   results in a miss, the information of the other keys to be compared has
> > been stored,
> > >   unlike the current implementation, which needs to perform an entire
> > simple lookup again.
> > >
> > > Changes in v4:
> > > - Reordered hash structure, so alt signature is at the start
> > >   of the next cache line, and explain in the commit message
> > >   why it has been moved
> > > - Reordered hash structure, so name field is on top of the structure,
> > >   leaving all the fields used in lookup in the next cache line
> > >   (instead of the first cache line)
> > >
> > > Changes in v3:
> > > - Corrected the cover letter (wrong number of patches)
> > >
> > > Changes in v2:
> > > - Increased entries per bucket from 4 to 8 for all cases,
> > >   so it is not architecture dependent any longer.
> > > - Replaced compile-time signature comparison function election
> > >   with run-time election, so best optimization available
> > >   will be used from a single binary.
> > > - Reordered the hash structure, so all the fields used by lookup
> > >   are in the same cache line (first).
> > >
> > > Byron Marohn (3):
> > >   hash: reorganize bucket structure
> > >   hash: add vectorized comparison
> > >   hash: modify lookup bulk pipeline
> > >
> >
> > Hi,
> >
> > Firstly, checkpatches is reporting some style errors in these patches.
> >
> > Secondly, when I run the "hash_multiwriter_autotest" I get what I assume
> to
> > be
> > an error after applying this patchset. Before this set is applied, running
> > that test shows the cycles per insert with/without lock elision. Now, though
> > I'm getting an error about a key being dropped or failing to insert in the lock
> > elision case, e.g.
> >
> >   Core #2 inserting 1572864: 0 - 1,572,864
> >   key 1497087 is lost
> >   1 key lost
> >
> > I've run the test a number of times, and there is a single key lost each time.
> > Please check on this, is it expected or is it a problem?
> 
> I am seeing that error even without the patchset. I am still investigating it,
> but using "git bisect" looks like the problem is in commit 5fc74c2e146d
> ("hash: check if slot is empty with key index").

I found the problem, and I submitted a patch for it (http://dpdk.org/dev/patchwork/patch/16361/).
Could you check if it works for you? 

> 
> Thanks,
> Pablo
> 
> >
> > Thanks,
> > /Bruce

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/4] Cuckoo hash enhancements
  2016-10-04  7:17           ` De Lara Guarch, Pablo
@ 2016-10-04  9:47             ` Bruce Richardson
  0 siblings, 0 replies; 37+ messages in thread
From: Bruce Richardson @ 2016-10-04  9:47 UTC (permalink / raw)
  To: De Lara Guarch, Pablo; +Cc: dev

On Tue, Oct 04, 2016 at 08:17:28AM +0100, De Lara Guarch, Pablo wrote:
> 
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of De Lara Guarch,
> > Pablo
> > Sent: Monday, October 03, 2016 11:51 PM
> > To: Richardson, Bruce
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v4 0/4] Cuckoo hash enhancements
> > 
> > Hi Bruce,
> > 
> > > -----Original Message-----
> > > From: Richardson, Bruce
> > > Sent: Monday, October 03, 2016 2:59 AM
> > > To: De Lara Guarch, Pablo
> > > Cc: dev@dpdk.org
> > > Subject: Re: [PATCH v4 0/4] Cuckoo hash enhancements
> > >
> > > On Fri, Sep 30, 2016 at 08:38:52AM +0100, Pablo de Lara wrote:
> > > > This patchset improves lookup performance on the current hash library
> > > > by changing the existing lookup bulk pipeline, with an improved pipeline,
> > > > based on a loop-and-jump model, instead of the current 4-stage 2-entry
> > > pipeline.
> > > > Also, x86 vectorized intrinsics are used to improve performance when
> > > comparing signatures.
> > > >
> > > > First patch reorganizes the order of the hash structure.
> > > > The structure takes more than one 64-byte cache line, but not all
> > > > the fields are used in the lookup operation (the most common operation).
> > > > Therefore, all these fields have been moved to the first part of the
> > structure,
> > > > so they all fit in one cache line, improving slightly the performance in
> > some
> > > > scenarios.
> > > >
> > > > Second patch modifies the order of the bucket structure.
> > > > Currently, the buckets store all the signatures together (current and
> > > alternative).
> > > > In order to be able to perform a vectorized signature comparison,
> > > > all current signatures have to be together, so the order of the bucket has
> > > been changed,
> > > > having separated all the current signatures from the alternative
> > signatures.
> > > >
> > > > Third patch introduces x86 vectorized intrinsics.
> > > > When performing a lookup bulk operation, all current signatures in a
> > bucket
> > > > are compared against the signature of the key being looked up.
> > > > Now that they all are together, a vectorized comparison can be
> > performed,
> > > > which takes less instructions to be carried out.
> > > > In case of having a machine with AVX2, number of entries per bucket are
> > > > increased from 4 to 8, as AVX2 allows comparing two 256-bit values, with
> > > 8x32-bit integers,
> > > > which are the 8 signatures on the bucket.
> > > >
> > > > Fourth (and last) patch modifies the current pipeline of the lookup bulk
> > > function.
> > > > The new pipeline is based on a loop-and-jump model. The two key
> > > improvements are:
> > > >
> > > > - Better prefetching: in this case, first 4 keys to be looked up are
> > prefetched,
> > > >   and after that, the rest of the keys are prefetched at the time the
> > > calculation
> > > >   of the signatures are being performed. This gives more time for the CPU
> > to
> > > >   prefetch the data requesting before actually need it, which result in less
> > > >   cache misses and therefore, higher throughput.
> > > >
> > > > - Lower performance penalty when using fallback: the lookup bulk
> > > algorithm
> > > >   assumes that most times there will not be a collision in a bucket, but it
> > > might
> > > >   happen that two or more signatures are equal, which means that more
> > > than one
> > > >   key comparison might be necessary. In that case, only the key of the first
> > > hit is prefetched,
> > > >   like in the current implementation. The difference now is that if this
> > > comparison
> > > >   results in a miss, the information of the other keys to be compared has
> > > been stored,
> > > >   unlike the current implementation, which needs to perform an entire
> > > simple lookup again.
> > > >
> > > > Changes in v4:
> > > > - Reordered hash structure, so alt signature is at the start
> > > >   of the next cache line, and explain in the commit message
> > > >   why it has been moved
> > > > - Reordered hash structure, so name field is on top of the structure,
> > > >   leaving all the fields used in lookup in the next cache line
> > > >   (instead of the first cache line)
> > > >
> > > > Changes in v3:
> > > > - Corrected the cover letter (wrong number of patches)
> > > >
> > > > Changes in v2:
> > > > - Increased entries per bucket from 4 to 8 for all cases,
> > > >   so it is not architecture dependent any longer.
> > > > - Replaced compile-time signature comparison function election
> > > >   with run-time election, so best optimization available
> > > >   will be used from a single binary.
> > > > - Reordered the hash structure, so all the fields used by lookup
> > > >   are in the same cache line (first).
> > > >
> > > > Byron Marohn (3):
> > > >   hash: reorganize bucket structure
> > > >   hash: add vectorized comparison
> > > >   hash: modify lookup bulk pipeline
> > > >
> > >
> > > Hi,
> > >
> > > Firstly, checkpatches is reporting some style errors in these patches.
> > >
> > > Secondly, when I run the "hash_multiwriter_autotest" I get what I assume
> > to
> > > be
> > > an error after applying this patchset. Before this set is applied, running
> > > that test shows the cycles per insert with/without lock elision. Now, though
> > > I'm getting an error about a key being dropped or failing to insert in the lock
> > > elision case, e.g.
> > >
> > >   Core #2 inserting 1572864: 0 - 1,572,864
> > >   key 1497087 is lost
> > >   1 key lost
> > >
> > > I've run the test a number of times, and there is a single key lost each time.
> > > Please check on this, is it expected or is it a problem?
> > 
> > I am seeing that error even without the patchset. I am still investigating it,
> > but using "git bisect" looks like the problem is in commit 5fc74c2e146d
> > ("hash: check if slot is empty with key index").
> 
> I found the problem, and I submitted a patch for it (http://dpdk.org/dev/patchwork/patch/16361/).
> Could you check if it works for you? 
> 

That patch looks like a correct bugfix so I've acked it for you. However, I still
see the error appearing very occasionally. Since it also appeared before I applied
this set, I am ok to accept this set anyway. Please do a new version of the
set with checkpatch issues fixed and keep my ack.

Series Acked-by: Bruce Richardson <bruce.richardson@intel.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 0/4] Cuckoo hash enhancements
  2016-09-30  7:38     ` [PATCH v4 0/4] Cuckoo hash enhancements Pablo de Lara
                         ` (5 preceding siblings ...)
  2016-10-03  9:59       ` Bruce Richardson
@ 2016-10-04 23:25       ` Pablo de Lara
  2016-10-04 23:25         ` [PATCH v5 1/4] hash: reorder hash structure Pablo de Lara
                           ` (4 more replies)
  6 siblings, 5 replies; 37+ messages in thread
From: Pablo de Lara @ 2016-10-04 23:25 UTC (permalink / raw)
  To: dev; +Cc: bruce.richardson, Pablo de Lara

This patchset improves lookup performance on the current hash library
by changing the existing lookup bulk pipeline, with an improved pipeline,
based on a loop-and-jump model, instead of the current 4-stage 2-entry pipeline.
Also, x86 vectorized intrinsics are used to improve performance when comparing signatures.

First patch reorganizes the order of the hash structure.
The structure takes more than one 64-byte cache line, but not all
the fields are used in the lookup operation (the most common operation).
Therefore, all these fields have been moved to the first part of the structure,
so they all fit in one cache line, improving slightly the performance in some
scenarios.

Second patch modifies the order of the bucket structure.
Currently, the buckets store all the signatures together (current and alternative).
In order to be able to perform a vectorized signature comparison,
all current signatures have to be together, so the order of the bucket has been changed,
having separated all the current signatures from the alternative signatures.

Third patch introduces x86 vectorized intrinsics.
When performing a lookup bulk operation, all current signatures in a bucket
are compared against the signature of the key being looked up.
Now that they all are together, a vectorized comparison can be performed,
which takes less instructions to be carried out.
In case of having a machine with AVX2, number of entries per bucket are
increased from 4 to 8, as AVX2 allows comparing two 256-bit values, with 8x32-bit integers,
which are the 8 signatures on the bucket.

Fourth (and last) patch modifies the current pipeline of the lookup bulk function.
The new pipeline is based on a loop-and-jump model. The two key improvements are:

- Better prefetching: in this case, first 4 keys to be looked up are prefetched,
  and after that, the rest of the keys are prefetched at the time the calculation
  of the signatures are being performed. This gives more time for the CPU to
  prefetch the data requesting before actually need it, which result in less
  cache misses and therefore, higher throughput.

- Lower performance penalty when using fallback: the lookup bulk algorithm
  assumes that most times there will not be a collision in a bucket, but it might
  happen that two or more signatures are equal, which means that more than one
  key comparison might be necessary. In that case, only the key of the first hit is prefetched,
  like in the current implementation. The difference now is that if this comparison
  results in a miss, the information of the other keys to be compared has been stored,
  unlike the current implementation, which needs to perform an entire simple lookup again.

Changes in v5:
- Rebased against current HEAD
- Fix checkpatch warnings

Changes in v4:
- Reordered hash structure, so alt signature is at the start
  of the next cache line, and explain in the commit message
  why it has been moved
- Reordered hash structure, so name field is on top of the structure,
  leaving all the fields used in lookup in the next cache line
  (instead of the first cache line)

Changes in v3:
- Corrected the cover letter (wrong number of patches)

Changes in v2:
- Increased entries per bucket from 4 to 8 for all cases,
  so it is not architecture dependent any longer.
- Replaced compile-time signature comparison function election
  with run-time election, so best optimization available
  will be used from a single binary.
- Reordered the hash structure, so all the fields used by lookup
  are in the same cache line (first).


Byron Marohn (3):
  hash: reorganize bucket structure
  hash: add vectorized comparison
  hash: modify lookup bulk pipeline

Pablo de Lara (1):
  hash: reorder hash structure

 lib/librte_hash/rte_cuckoo_hash.c     | 455 ++++++++++++++--------------------
 lib/librte_hash/rte_cuckoo_hash.h     |  68 ++---
 lib/librte_hash/rte_cuckoo_hash_x86.h |  20 +-
 3 files changed, 235 insertions(+), 308 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 1/4] hash: reorder hash structure
  2016-10-04 23:25       ` [PATCH v5 " Pablo de Lara
@ 2016-10-04 23:25         ` Pablo de Lara
  2016-10-04 23:25         ` [PATCH v5 2/4] hash: reorganize bucket structure Pablo de Lara
                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 37+ messages in thread
From: Pablo de Lara @ 2016-10-04 23:25 UTC (permalink / raw)
  To: dev; +Cc: bruce.richardson, Pablo de Lara

In order to optimize lookup performance, hash structure
is reordered, so all fields used for lookup will be
in the first cache line.

Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Acked-by: Sameh Gobriel <sameh.gobriel@intel.com>
---
 lib/librte_hash/rte_cuckoo_hash.h | 36 +++++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index e290dab..27a47e5 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -185,30 +185,36 @@ struct rte_hash {
 	char name[RTE_HASH_NAMESIZE];   /**< Name of the hash. */
 	uint32_t entries;               /**< Total table entries. */
 	uint32_t num_buckets;           /**< Number of buckets in table. */
-	uint32_t key_len;               /**< Length of hash key. */
+
+	struct rte_ring *free_slots;
+	/**< Ring that stores all indexes of the free slots in the key table */
+	uint8_t hw_trans_mem_support;
+	/**< Hardware transactional memory support */
+	struct lcore_cache *local_free_slots;
+	/**< Local cache per lcore, storing some indexes of the free slots */
+	enum add_key_case add_key; /**< Multi-writer hash add behavior */
+
+	rte_spinlock_t *multiwriter_lock; /**< Multi-writer spinlock for w/o TM */
+
+	/* Fields used in lookup */
+
+	uint32_t key_len __rte_cache_aligned;
+	/**< Length of hash key. */
 	rte_hash_function hash_func;    /**< Function used to calculate hash. */
 	uint32_t hash_func_init_val;    /**< Init value used by hash_func. */
 	rte_hash_cmp_eq_t rte_hash_custom_cmp_eq;
 	/**< Custom function used to compare keys. */
 	enum cmp_jump_table_case cmp_jump_table_idx;
 	/**< Indicates which compare function to use. */
-	uint32_t bucket_bitmask;        /**< Bitmask for getting bucket index
-						from hash signature. */
+	uint32_t bucket_bitmask;
+	/**< Bitmask for getting bucket index from hash signature. */
 	uint32_t key_entry_size;         /**< Size of each key entry. */
 
-	struct rte_ring *free_slots;    /**< Ring that stores all indexes
-						of the free slots in the key table */
 	void *key_store;                /**< Table storing all keys and data */
-	struct rte_hash_bucket *buckets;	/**< Table with buckets storing all the
-							hash values and key indexes
-							to the key table*/
-	uint8_t hw_trans_mem_support;	/**< Hardware transactional
-							memory support */
-	struct lcore_cache *local_free_slots;
-	/**< Local cache per lcore, storing some indexes of the free slots */
-	enum add_key_case add_key; /**< Multi-writer hash add behavior */
-
-	rte_spinlock_t *multiwriter_lock; /**< Multi-writer spinlock for w/o TM */
+	struct rte_hash_bucket *buckets;
+	/**< Table with buckets storing all the	hash values and key indexes
+	 * to the key table.
+	 */
 } __rte_cache_aligned;
 
 struct queue_node {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v5 2/4] hash: reorganize bucket structure
  2016-10-04 23:25       ` [PATCH v5 " Pablo de Lara
  2016-10-04 23:25         ` [PATCH v5 1/4] hash: reorder hash structure Pablo de Lara
@ 2016-10-04 23:25         ` Pablo de Lara
  2016-10-04 23:25         ` [PATCH v5 3/4] hash: add vectorized comparison Pablo de Lara
                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 37+ messages in thread
From: Pablo de Lara @ 2016-10-04 23:25 UTC (permalink / raw)
  To: dev; +Cc: bruce.richardson, Byron Marohn, Saikrishna Edupuganti

From: Byron Marohn <byron.marohn@intel.com>

Move current signatures of all entries together in the bucket
and same with all alternative signatures, instead of having
current and alternative signatures together per entry in the bucket.
This will be benefitial in the next commits, where a vectorized
comparison will be performed, achieving better performance.

The alternative signatures have been moved away from
the current signatures, to make the key indices be consecutive
to the current signatures, as these two fields are used by lookup,
so they are in the same cache line.

Signed-off-by: Byron Marohn <byron.marohn@intel.com>
Signed-off-by: Saikrishna Edupuganti <saikrishna.edupuganti@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Acked-by: Sameh Gobriel <sameh.gobriel@intel.com>
---
 lib/librte_hash/rte_cuckoo_hash.c     | 43 ++++++++++++++++++-----------------
 lib/librte_hash/rte_cuckoo_hash.h     | 17 ++++----------
 lib/librte_hash/rte_cuckoo_hash_x86.h | 20 ++++++++--------
 3 files changed, 37 insertions(+), 43 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c
index 4de4422..a7ee2b9 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -421,7 +421,7 @@ make_space_bucket(const struct rte_hash *h, struct rte_hash_bucket *bkt)
 	 */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
 		/* Search for space in alternative locations */
-		next_bucket_idx = bkt->signatures[i].alt & h->bucket_bitmask;
+		next_bucket_idx = bkt->sig_alt[i] & h->bucket_bitmask;
 		next_bkt[i] = &h->buckets[next_bucket_idx];
 		for (j = 0; j < RTE_HASH_BUCKET_ENTRIES; j++) {
 			if (next_bkt[i]->key_idx[j] == EMPTY_SLOT)
@@ -434,8 +434,8 @@ make_space_bucket(const struct rte_hash *h, struct rte_hash_bucket *bkt)
 
 	/* Alternative location has spare room (end of recursive function) */
 	if (i != RTE_HASH_BUCKET_ENTRIES) {
-		next_bkt[i]->signatures[j].alt = bkt->signatures[i].current;
-		next_bkt[i]->signatures[j].current = bkt->signatures[i].alt;
+		next_bkt[i]->sig_alt[j] = bkt->sig_current[i];
+		next_bkt[i]->sig_current[j] = bkt->sig_alt[i];
 		next_bkt[i]->key_idx[j] = bkt->key_idx[i];
 		return i;
 	}
@@ -461,8 +461,8 @@ make_space_bucket(const struct rte_hash *h, struct rte_hash_bucket *bkt)
 	 */
 	bkt->flag[i] = 0;
 	if (ret >= 0) {
-		next_bkt[i]->signatures[ret].alt = bkt->signatures[i].current;
-		next_bkt[i]->signatures[ret].current = bkt->signatures[i].alt;
+		next_bkt[i]->sig_alt[ret] = bkt->sig_current[i];
+		next_bkt[i]->sig_current[ret] = bkt->sig_alt[i];
 		next_bkt[i]->key_idx[ret] = bkt->key_idx[i];
 		return i;
 	} else
@@ -544,8 +544,8 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is already inserted in primary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (prim_bkt->signatures[i].current == sig &&
-				prim_bkt->signatures[i].alt == alt_hash) {
+		if (prim_bkt->sig_current[i] == sig &&
+				prim_bkt->sig_alt[i] == alt_hash) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					prim_bkt->key_idx[i] * h->key_entry_size);
 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {
@@ -564,8 +564,8 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is already inserted in secondary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (sec_bkt->signatures[i].alt == sig &&
-				sec_bkt->signatures[i].current == alt_hash) {
+		if (sec_bkt->sig_alt[i] == sig &&
+				sec_bkt->sig_current[i] == alt_hash) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					sec_bkt->key_idx[i] * h->key_entry_size);
 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {
@@ -611,8 +611,8 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 		for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
 			/* Check if slot is available */
 			if (likely(prim_bkt->key_idx[i] == EMPTY_SLOT)) {
-				prim_bkt->signatures[i].current = sig;
-				prim_bkt->signatures[i].alt = alt_hash;
+				prim_bkt->sig_current[i] = sig;
+				prim_bkt->sig_alt[i] = alt_hash;
 				prim_bkt->key_idx[i] = new_idx;
 				break;
 			}
@@ -632,8 +632,8 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 		 */
 		ret = make_space_bucket(h, prim_bkt);
 		if (ret >= 0) {
-			prim_bkt->signatures[ret].current = sig;
-			prim_bkt->signatures[ret].alt = alt_hash;
+			prim_bkt->sig_current[ret] = sig;
+			prim_bkt->sig_alt[ret] = alt_hash;
 			prim_bkt->key_idx[ret] = new_idx;
 			if (h->add_key == ADD_KEY_MULTIWRITER)
 				rte_spinlock_unlock(h->multiwriter_lock);
@@ -707,7 +707,7 @@ __rte_hash_lookup_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is in primary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (bkt->signatures[i].current == sig &&
+		if (bkt->sig_current[i] == sig &&
 				bkt->key_idx[i] != EMPTY_SLOT) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					bkt->key_idx[i] * h->key_entry_size);
@@ -730,8 +730,8 @@ __rte_hash_lookup_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is in secondary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (bkt->signatures[i].current == alt_hash &&
-				bkt->signatures[i].alt == sig) {
+		if (bkt->sig_current[i] == alt_hash &&
+				bkt->sig_alt[i] == sig) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					bkt->key_idx[i] * h->key_entry_size);
 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {
@@ -785,7 +785,8 @@ remove_entry(const struct rte_hash *h, struct rte_hash_bucket *bkt, unsigned i)
 	unsigned lcore_id, n_slots;
 	struct lcore_cache *cached_free_slots;
 
-	bkt->signatures[i].sig = NULL_SIGNATURE;
+	bkt->sig_current[i] = NULL_SIGNATURE;
+	bkt->sig_alt[i] = NULL_SIGNATURE;
 	if (h->hw_trans_mem_support) {
 		lcore_id = rte_lcore_id();
 		cached_free_slots = &h->local_free_slots[lcore_id];
@@ -823,7 +824,7 @@ __rte_hash_del_key_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is in primary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (bkt->signatures[i].current == sig &&
+		if (bkt->sig_current[i] == sig &&
 				bkt->key_idx[i] != EMPTY_SLOT) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					bkt->key_idx[i] * h->key_entry_size);
@@ -848,7 +849,7 @@ __rte_hash_del_key_with_hash(const struct rte_hash *h, const void *key,
 
 	/* Check if key is in secondary location */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (bkt->signatures[i].current == alt_hash &&
+		if (bkt->sig_current[i] == alt_hash &&
 				bkt->key_idx[i] != EMPTY_SLOT) {
 			k = (struct rte_hash_key *) ((char *)keys +
 					bkt->key_idx[i] * h->key_entry_size);
@@ -957,8 +958,8 @@ lookup_stage2(unsigned idx, hash_sig_t prim_hash, hash_sig_t sec_hash,
 	prim_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
 	sec_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		prim_hash_matches |= ((prim_hash == prim_bkt->signatures[i].current) << i);
-		sec_hash_matches |= ((sec_hash == sec_bkt->signatures[i].current) << i);
+		prim_hash_matches |= ((prim_hash == prim_bkt->sig_current[i]) << i);
+		sec_hash_matches |= ((sec_hash == sec_bkt->sig_current[i]) << i);
 	}
 
 	key_idx = prim_bkt->key_idx[__builtin_ctzl(prim_hash_matches)];
diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index 27a47e5..6549731 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -151,17 +151,6 @@ struct lcore_cache {
 	void *objs[LCORE_CACHE_SIZE]; /**< Cache objects */
 } __rte_cache_aligned;
 
-/* Structure storing both primary and secondary hashes */
-struct rte_hash_signatures {
-	union {
-		struct {
-			hash_sig_t current;
-			hash_sig_t alt;
-		};
-		uint64_t sig;
-	};
-};
-
 /* Structure that stores key-value pair */
 struct rte_hash_key {
 	union {
@@ -174,9 +163,13 @@ struct rte_hash_key {
 
 /** Bucket structure */
 struct rte_hash_bucket {
-	struct rte_hash_signatures signatures[RTE_HASH_BUCKET_ENTRIES];
+	hash_sig_t sig_current[RTE_HASH_BUCKET_ENTRIES];
+
 	/* Includes dummy key index that always contains index 0 */
 	uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES + 1];
+
+	hash_sig_t sig_alt[RTE_HASH_BUCKET_ENTRIES];
+
 	uint8_t flag[RTE_HASH_BUCKET_ENTRIES];
 } __rte_cache_aligned;
 
diff --git a/lib/librte_hash/rte_cuckoo_hash_x86.h b/lib/librte_hash/rte_cuckoo_hash_x86.h
index 7ffa56f..47aec6d 100644
--- a/lib/librte_hash/rte_cuckoo_hash_x86.h
+++ b/lib/librte_hash/rte_cuckoo_hash_x86.h
@@ -54,8 +54,8 @@ rte_hash_cuckoo_insert_mw_tm(struct rte_hash_bucket *prim_bkt,
 			for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
 				/* Check if slot is available */
 				if (likely(prim_bkt->key_idx[i] == EMPTY_SLOT)) {
-					prim_bkt->signatures[i].current = sig;
-					prim_bkt->signatures[i].alt = alt_hash;
+					prim_bkt->sig_current[i] = sig;
+					prim_bkt->sig_alt[i] = alt_hash;
 					prim_bkt->key_idx[i] = new_idx;
 					break;
 				}
@@ -101,7 +101,7 @@ rte_hash_cuckoo_move_insert_mw_tm(const struct rte_hash *h,
 				prev_slot = curr_node->prev_slot;
 
 				prev_alt_bkt_idx
-					= prev_bkt->signatures[prev_slot].alt
+					= prev_bkt->sig_alt[prev_slot]
 					    & h->bucket_bitmask;
 
 				if (unlikely(&h->buckets[prev_alt_bkt_idx]
@@ -113,10 +113,10 @@ rte_hash_cuckoo_move_insert_mw_tm(const struct rte_hash *h,
 				 * Cuckoo insert to move elements back to its
 				 * primary bucket if available
 				 */
-				curr_bkt->signatures[curr_slot].alt =
-				    prev_bkt->signatures[prev_slot].current;
-				curr_bkt->signatures[curr_slot].current =
-				    prev_bkt->signatures[prev_slot].alt;
+				curr_bkt->sig_alt[curr_slot] =
+				    prev_bkt->sig_current[prev_slot];
+				curr_bkt->sig_current[curr_slot] =
+				    prev_bkt->sig_alt[prev_slot];
 				curr_bkt->key_idx[curr_slot]
 				    = prev_bkt->key_idx[prev_slot];
 
@@ -125,8 +125,8 @@ rte_hash_cuckoo_move_insert_mw_tm(const struct rte_hash *h,
 				curr_bkt = curr_node->bkt;
 			}
 
-			curr_bkt->signatures[curr_slot].current = sig;
-			curr_bkt->signatures[curr_slot].alt = alt_hash;
+			curr_bkt->sig_current[curr_slot] = sig;
+			curr_bkt->sig_alt[curr_slot] = alt_hash;
 			curr_bkt->key_idx[curr_slot] = new_idx;
 
 			rte_xend();
@@ -178,7 +178,7 @@ rte_hash_cuckoo_make_space_mw_tm(const struct rte_hash *h,
 			}
 
 			/* Enqueue new node and keep prev node info */
-			alt_bkt = &(h->buckets[curr_bkt->signatures[i].alt
+			alt_bkt = &(h->buckets[curr_bkt->sig_alt[i]
 						    & h->bucket_bitmask]);
 			head->bkt = alt_bkt;
 			head->prev = tail;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v5 3/4] hash: add vectorized comparison
  2016-10-04 23:25       ` [PATCH v5 " Pablo de Lara
  2016-10-04 23:25         ` [PATCH v5 1/4] hash: reorder hash structure Pablo de Lara
  2016-10-04 23:25         ` [PATCH v5 2/4] hash: reorganize bucket structure Pablo de Lara
@ 2016-10-04 23:25         ` Pablo de Lara
  2016-10-04 23:25         ` [PATCH v5 4/4] hash: modify lookup bulk pipeline Pablo de Lara
  2016-10-05 10:12         ` [PATCH v5 0/4] Cuckoo hash enhancements Thomas Monjalon
  4 siblings, 0 replies; 37+ messages in thread
From: Pablo de Lara @ 2016-10-04 23:25 UTC (permalink / raw)
  To: dev; +Cc: bruce.richardson, Byron Marohn, Saikrishna Edupuganti, Pablo de Lara

From: Byron Marohn <byron.marohn@intel.com>

In lookup bulk function, the signatures of all entries
are compared against the signature of the key that is being looked up.
Now that all the signatures are together, they can be compared
with vector instructions (SSE, AVX2), achieving higher lookup performance.

Also, entries per bucket are increased to 8 when using processors
with AVX2, as 256 bits can be compared at once, which is the size of
8x32-bit signatures.

Signed-off-by: Byron Marohn <byron.marohn@intel.com>
Signed-off-by: Saikrishna Edupuganti <saikrishna.edupuganti@intel.com>
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Acked-by: Sameh Gobriel <sameh.gobriel@intel.com>
---
 lib/librte_hash/rte_cuckoo_hash.c | 76 +++++++++++++++++++++++++++++++++++----
 lib/librte_hash/rte_cuckoo_hash.h | 12 ++++++-
 2 files changed, 81 insertions(+), 7 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c
index a7ee2b9..d762f36 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -284,6 +284,15 @@ rte_hash_create(const struct rte_hash_parameters *params)
 	h->free_slots = r;
 	h->hw_trans_mem_support = hw_trans_mem_support;
 
+#if defined(RTE_ARCH_X86)
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
+		h->sig_cmp_fn = RTE_HASH_COMPARE_AVX2;
+	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE2))
+		h->sig_cmp_fn = RTE_HASH_COMPARE_SSE;
+	else
+#endif
+		h->sig_cmp_fn = RTE_HASH_COMPARE_SCALAR;
+
 	/* Turn on multi-writer only with explicit flat from user and TM
 	 * support.
 	 */
@@ -940,6 +949,62 @@ lookup_stage1(unsigned idx, hash_sig_t *prim_hash, hash_sig_t *sec_hash,
 	rte_prefetch0(*secondary_bkt);
 }
 
+static inline void
+compare_signatures(unsigned int *prim_hash_matches,
+			unsigned int *sec_hash_matches,
+			const struct rte_hash_bucket *prim_bkt,
+			const struct rte_hash_bucket *sec_bkt,
+			hash_sig_t prim_hash, hash_sig_t sec_hash,
+			enum rte_hash_sig_compare_function sig_cmp_fn)
+{
+	unsigned int i;
+
+	switch (sig_cmp_fn) {
+#ifdef RTE_MACHINE_CPUFLAG_AVX2
+	case RTE_HASH_COMPARE_AVX2:
+		*prim_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+				_mm256_load_si256(
+					(__m256i const *)prim_bkt->sig_current),
+				_mm256_set1_epi32(prim_hash)));
+		*sec_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+				_mm256_load_si256(
+					(__m256i const *)sec_bkt->sig_current),
+				_mm256_set1_epi32(sec_hash)));
+		break;
+#endif
+#ifdef RTE_MACHINE_CPUFLAG_SSE2
+	case RTE_HASH_COMPARE_SSE:
+		/* Compare the first 4 signatures in the bucket */
+		*prim_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+				_mm_load_si128(
+					(__m128i const *)prim_bkt->sig_current),
+				_mm_set1_epi32(prim_hash)));
+		*prim_hash_matches |= (_mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+				_mm_load_si128(
+					(__m128i const *)&prim_bkt->sig_current[4]),
+				_mm_set1_epi32(prim_hash)))) << 4;
+		/* Compare the first 4 signatures in the bucket */
+		*sec_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+				_mm_load_si128(
+					(__m128i const *)sec_bkt->sig_current),
+				_mm_set1_epi32(sec_hash)));
+		*sec_hash_matches |= (_mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+				_mm_load_si128(
+					(__m128i const *)&sec_bkt->sig_current[4]),
+				_mm_set1_epi32(sec_hash)))) << 4;
+		break;
+#endif
+	default:
+		for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
+			*prim_hash_matches |=
+				((prim_hash == prim_bkt->sig_current[i]) << i);
+			*sec_hash_matches |=
+				((sec_hash == sec_bkt->sig_current[i]) << i);
+		}
+	}
+
+}
+
 /*
  * Lookup bulk stage 2:  Search for match hashes in primary/secondary locations
  * and prefetch first key slot
@@ -952,15 +1017,14 @@ lookup_stage2(unsigned idx, hash_sig_t prim_hash, hash_sig_t sec_hash,
 		uint64_t *extra_hits_mask, const void *keys,
 		const struct rte_hash *h)
 {
-	unsigned prim_hash_matches, sec_hash_matches, key_idx, i;
-	unsigned total_hash_matches;
+	unsigned int prim_hash_matches, sec_hash_matches, key_idx;
+	unsigned int total_hash_matches;
 
 	prim_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
 	sec_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
-	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		prim_hash_matches |= ((prim_hash == prim_bkt->sig_current[i]) << i);
-		sec_hash_matches |= ((sec_hash == sec_bkt->sig_current[i]) << i);
-	}
+
+	compare_signatures(&prim_hash_matches, &sec_hash_matches, prim_bkt,
+				sec_bkt, prim_hash, sec_hash, h->sig_cmp_fn);
 
 	key_idx = prim_bkt->key_idx[__builtin_ctzl(prim_hash_matches)];
 	if (key_idx == 0)
diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index 6549731..504661d 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -130,7 +130,7 @@ enum add_key_case {
 };
 
 /** Number of items per bucket. */
-#define RTE_HASH_BUCKET_ENTRIES		4
+#define RTE_HASH_BUCKET_ENTRIES		8
 
 #define NULL_SIGNATURE			0
 
@@ -161,6 +161,14 @@ struct rte_hash_key {
 	char key[0];
 } __attribute__((aligned(KEY_ALIGNMENT)));
 
+/* All different signature compare functions */
+enum rte_hash_sig_compare_function {
+	RTE_HASH_COMPARE_SCALAR = 0,
+	RTE_HASH_COMPARE_SSE,
+	RTE_HASH_COMPARE_AVX2,
+	RTE_HASH_COMPARE_NUM
+};
+
 /** Bucket structure */
 struct rte_hash_bucket {
 	hash_sig_t sig_current[RTE_HASH_BUCKET_ENTRIES];
@@ -199,6 +207,8 @@ struct rte_hash {
 	/**< Custom function used to compare keys. */
 	enum cmp_jump_table_case cmp_jump_table_idx;
 	/**< Indicates which compare function to use. */
+	enum rte_hash_sig_compare_function sig_cmp_fn;
+	/**< Indicates which signature compare function to use. */
 	uint32_t bucket_bitmask;
 	/**< Bitmask for getting bucket index from hash signature. */
 	uint32_t key_entry_size;         /**< Size of each key entry. */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v5 4/4] hash: modify lookup bulk pipeline
  2016-10-04 23:25       ` [PATCH v5 " Pablo de Lara
                           ` (2 preceding siblings ...)
  2016-10-04 23:25         ` [PATCH v5 3/4] hash: add vectorized comparison Pablo de Lara
@ 2016-10-04 23:25         ` Pablo de Lara
  2016-10-05 10:12         ` [PATCH v5 0/4] Cuckoo hash enhancements Thomas Monjalon
  4 siblings, 0 replies; 37+ messages in thread
From: Pablo de Lara @ 2016-10-04 23:25 UTC (permalink / raw)
  To: dev; +Cc: bruce.richardson, Byron Marohn, Saikrishna Edupuganti, Pablo de Lara

From: Byron Marohn <byron.marohn@intel.com>

This patch replaces the pipelined rte_hash lookup mechanism with a
loop-and-jump model, which performs significantly better,
especially for smaller table sizes and smaller table occupancies.

Signed-off-by: Byron Marohn <byron.marohn@intel.com>
Signed-off-by: Saikrishna Edupuganti <saikrishna.edupuganti@intel.com>
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Acked-by: Sameh Gobriel <sameh.gobriel@intel.com>
---
 lib/librte_hash/rte_cuckoo_hash.c | 378 ++++++++++++--------------------------
 lib/librte_hash/rte_cuckoo_hash.h |   3 +-
 2 files changed, 117 insertions(+), 264 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c
index d762f36..3324b17 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -914,44 +914,8 @@ rte_hash_get_key_with_position(const struct rte_hash *h, const int32_t position,
 	return 0;
 }
 
-/* Lookup bulk stage 0: Prefetch input key */
 static inline void
-lookup_stage0(unsigned *idx, uint64_t *lookup_mask,
-		const void * const *keys)
-{
-	*idx = __builtin_ctzl(*lookup_mask);
-	if (*lookup_mask == 0)
-		*idx = 0;
-
-	rte_prefetch0(keys[*idx]);
-	*lookup_mask &= ~(1llu << *idx);
-}
-
-/*
- * Lookup bulk stage 1: Calculate primary/secondary hashes
- * and prefetch primary/secondary buckets
- */
-static inline void
-lookup_stage1(unsigned idx, hash_sig_t *prim_hash, hash_sig_t *sec_hash,
-		const struct rte_hash_bucket **primary_bkt,
-		const struct rte_hash_bucket **secondary_bkt,
-		hash_sig_t *hash_vals, const void * const *keys,
-		const struct rte_hash *h)
-{
-	*prim_hash = rte_hash_hash(h, keys[idx]);
-	hash_vals[idx] = *prim_hash;
-	*sec_hash = rte_hash_secondary_hash(*prim_hash);
-
-	*primary_bkt = &h->buckets[*prim_hash & h->bucket_bitmask];
-	*secondary_bkt = &h->buckets[*sec_hash & h->bucket_bitmask];
-
-	rte_prefetch0(*primary_bkt);
-	rte_prefetch0(*secondary_bkt);
-}
-
-static inline void
-compare_signatures(unsigned int *prim_hash_matches,
-			unsigned int *sec_hash_matches,
+compare_signatures(uint32_t *prim_hash_matches, uint32_t *sec_hash_matches,
 			const struct rte_hash_bucket *prim_bkt,
 			const struct rte_hash_bucket *sec_bkt,
 			hash_sig_t prim_hash, hash_sig_t sec_hash,
@@ -962,11 +926,11 @@ compare_signatures(unsigned int *prim_hash_matches,
 	switch (sig_cmp_fn) {
 #ifdef RTE_MACHINE_CPUFLAG_AVX2
 	case RTE_HASH_COMPARE_AVX2:
-		*prim_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+		*prim_hash_matches = _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
 				_mm256_load_si256(
 					(__m256i const *)prim_bkt->sig_current),
 				_mm256_set1_epi32(prim_hash)));
-		*sec_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+		*sec_hash_matches = _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
 				_mm256_load_si256(
 					(__m256i const *)sec_bkt->sig_current),
 				_mm256_set1_epi32(sec_hash)));
@@ -975,7 +939,7 @@ compare_signatures(unsigned int *prim_hash_matches,
 #ifdef RTE_MACHINE_CPUFLAG_SSE2
 	case RTE_HASH_COMPARE_SSE:
 		/* Compare the first 4 signatures in the bucket */
-		*prim_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+		*prim_hash_matches = _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
 				_mm_load_si128(
 					(__m128i const *)prim_bkt->sig_current),
 				_mm_set1_epi32(prim_hash)));
@@ -984,7 +948,7 @@ compare_signatures(unsigned int *prim_hash_matches,
 					(__m128i const *)&prim_bkt->sig_current[4]),
 				_mm_set1_epi32(prim_hash)))) << 4;
 		/* Compare the first 4 signatures in the bucket */
-		*sec_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+		*sec_hash_matches = _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
 				_mm_load_si128(
 					(__m128i const *)sec_bkt->sig_current),
 				_mm_set1_epi32(sec_hash)));
@@ -1005,244 +969,134 @@ compare_signatures(unsigned int *prim_hash_matches,
 
 }
 
-/*
- * Lookup bulk stage 2:  Search for match hashes in primary/secondary locations
- * and prefetch first key slot
- */
+#define PREFETCH_OFFSET 4
 static inline void
-lookup_stage2(unsigned idx, hash_sig_t prim_hash, hash_sig_t sec_hash,
-		const struct rte_hash_bucket *prim_bkt,
-		const struct rte_hash_bucket *sec_bkt,
-		const struct rte_hash_key **key_slot, int32_t *positions,
-		uint64_t *extra_hits_mask, const void *keys,
-		const struct rte_hash *h)
+__rte_hash_lookup_bulk(const struct rte_hash *h, const void **keys,
+			int32_t num_keys, int32_t *positions,
+			uint64_t *hit_mask, void *data[])
 {
-	unsigned int prim_hash_matches, sec_hash_matches, key_idx;
-	unsigned int total_hash_matches;
+	uint64_t hits = 0;
+	int32_t i;
+	uint32_t prim_hash[RTE_HASH_LOOKUP_BULK_MAX];
+	uint32_t sec_hash[RTE_HASH_LOOKUP_BULK_MAX];
+	const struct rte_hash_bucket *primary_bkt[RTE_HASH_LOOKUP_BULK_MAX];
+	const struct rte_hash_bucket *secondary_bkt[RTE_HASH_LOOKUP_BULK_MAX];
+	uint32_t prim_hitmask[RTE_HASH_LOOKUP_BULK_MAX] = {0};
+	uint32_t sec_hitmask[RTE_HASH_LOOKUP_BULK_MAX] = {0};
+
+	/* Prefetch first keys */
+	for (i = 0; i < PREFETCH_OFFSET && i < num_keys; i++)
+		rte_prefetch0(keys[i]);
 
-	prim_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
-	sec_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
+	/*
+	 * Prefetch rest of the keys, calculate primary and
+	 * secondary bucket and prefetch them
+	 */
+	for (i = 0; i < (num_keys - PREFETCH_OFFSET); i++) {
+		rte_prefetch0(keys[i + PREFETCH_OFFSET]);
 
-	compare_signatures(&prim_hash_matches, &sec_hash_matches, prim_bkt,
-				sec_bkt, prim_hash, sec_hash, h->sig_cmp_fn);
+		prim_hash[i] = rte_hash_hash(h, keys[i]);
+		sec_hash[i] = rte_hash_secondary_hash(prim_hash[i]);
 
-	key_idx = prim_bkt->key_idx[__builtin_ctzl(prim_hash_matches)];
-	if (key_idx == 0)
-		key_idx = sec_bkt->key_idx[__builtin_ctzl(sec_hash_matches)];
+		primary_bkt[i] = &h->buckets[prim_hash[i] & h->bucket_bitmask];
+		secondary_bkt[i] = &h->buckets[sec_hash[i] & h->bucket_bitmask];
 
-	total_hash_matches = (prim_hash_matches |
-				(sec_hash_matches << (RTE_HASH_BUCKET_ENTRIES + 1)));
-	*key_slot = (const struct rte_hash_key *) ((const char *)keys +
-					key_idx * h->key_entry_size);
+		rte_prefetch0(primary_bkt[i]);
+		rte_prefetch0(secondary_bkt[i]);
+	}
 
-	rte_prefetch0(*key_slot);
-	/*
-	 * Return index where key is stored,
-	 * substracting the first dummy index
-	 */
-	positions[idx] = (key_idx - 1);
+	/* Calculate and prefetch rest of the buckets */
+	for (; i < num_keys; i++) {
+		prim_hash[i] = rte_hash_hash(h, keys[i]);
+		sec_hash[i] = rte_hash_secondary_hash(prim_hash[i]);
 
-	*extra_hits_mask |= (uint64_t)(__builtin_popcount(total_hash_matches) > 3) << idx;
+		primary_bkt[i] = &h->buckets[prim_hash[i] & h->bucket_bitmask];
+		secondary_bkt[i] = &h->buckets[sec_hash[i] & h->bucket_bitmask];
 
-}
+		rte_prefetch0(primary_bkt[i]);
+		rte_prefetch0(secondary_bkt[i]);
+	}
 
+	/* Compare signatures and prefetch key slot of first hit */
+	for (i = 0; i < num_keys; i++) {
+		compare_signatures(&prim_hitmask[i], &sec_hitmask[i],
+				primary_bkt[i], secondary_bkt[i],
+				prim_hash[i], sec_hash[i], h->sig_cmp_fn);
+
+		if (prim_hitmask[i]) {
+			uint32_t first_hit = __builtin_ctzl(prim_hitmask[i]);
+			uint32_t key_idx = primary_bkt[i]->key_idx[first_hit];
+			const struct rte_hash_key *key_slot =
+				(const struct rte_hash_key *)(
+				(const char *)h->key_store +
+				key_idx * h->key_entry_size);
+			rte_prefetch0(key_slot);
+			continue;
+		}
 
-/* Lookup bulk stage 3: Check if key matches, update hit mask and return data */
-static inline void
-lookup_stage3(unsigned idx, const struct rte_hash_key *key_slot, const void * const *keys,
-		const int32_t *positions, void *data[], uint64_t *hits,
-		const struct rte_hash *h)
-{
-	unsigned hit;
-	unsigned key_idx;
+		if (sec_hitmask[i]) {
+			uint32_t first_hit = __builtin_ctzl(sec_hitmask[i]);
+			uint32_t key_idx = secondary_bkt[i]->key_idx[first_hit];
+			const struct rte_hash_key *key_slot =
+				(const struct rte_hash_key *)(
+				(const char *)h->key_store +
+				key_idx * h->key_entry_size);
+			rte_prefetch0(key_slot);
+		}
+	}
 
-	hit = !rte_hash_cmp_eq(key_slot->key, keys[idx], h);
-	if (data != NULL)
-		data[idx] = key_slot->pdata;
+	/* Compare keys, first hits in primary first */
+	for (i = 0; i < num_keys; i++) {
+		positions[i] = -ENOENT;
+		while (prim_hitmask[i]) {
+			uint32_t hit_index = __builtin_ctzl(prim_hitmask[i]);
+
+			uint32_t key_idx = primary_bkt[i]->key_idx[hit_index];
+			const struct rte_hash_key *key_slot =
+				(const struct rte_hash_key *)(
+				(const char *)h->key_store +
+				key_idx * h->key_entry_size);
+			/*
+			 * If key index is 0, do not compare key,
+			 * as it is checking the dummy slot
+			 */
+			if (!!key_idx & !rte_hash_cmp_eq(key_slot->key, keys[i], h)) {
+				if (data != NULL)
+					data[i] = key_slot->pdata;
 
-	key_idx = positions[idx] + 1;
-	/*
-	 * If key index is 0, force hit to be 0, in case key to be looked up
-	 * is all zero (as in the dummy slot), which would result in a wrong hit
-	 */
-	*hits |= (uint64_t)(hit && !!key_idx)  << idx;
-}
+				hits |= 1ULL << i;
+				positions[i] = key_idx - 1;
+				goto next_key;
+			}
+			prim_hitmask[i] &= ~(1 << (hit_index));
+		}
 
-static inline void
-__rte_hash_lookup_bulk(const struct rte_hash *h, const void **keys,
-			uint32_t num_keys, int32_t *positions,
-			uint64_t *hit_mask, void *data[])
-{
-	uint64_t hits = 0;
-	uint64_t extra_hits_mask = 0;
-	uint64_t lookup_mask, miss_mask;
-	unsigned idx;
-	const void *key_store = h->key_store;
-	int ret;
-	hash_sig_t hash_vals[RTE_HASH_LOOKUP_BULK_MAX];
-
-	unsigned idx00, idx01, idx10, idx11, idx20, idx21, idx30, idx31;
-	const struct rte_hash_bucket *primary_bkt10, *primary_bkt11;
-	const struct rte_hash_bucket *secondary_bkt10, *secondary_bkt11;
-	const struct rte_hash_bucket *primary_bkt20, *primary_bkt21;
-	const struct rte_hash_bucket *secondary_bkt20, *secondary_bkt21;
-	const struct rte_hash_key *k_slot20, *k_slot21, *k_slot30, *k_slot31;
-	hash_sig_t primary_hash10, primary_hash11;
-	hash_sig_t secondary_hash10, secondary_hash11;
-	hash_sig_t primary_hash20, primary_hash21;
-	hash_sig_t secondary_hash20, secondary_hash21;
-
-	lookup_mask = (uint64_t) -1 >> (64 - num_keys);
-	miss_mask = lookup_mask;
-
-	lookup_stage0(&idx00, &lookup_mask, keys);
-	lookup_stage0(&idx01, &lookup_mask, keys);
-
-	idx10 = idx00, idx11 = idx01;
-
-	lookup_stage0(&idx00, &lookup_mask, keys);
-	lookup_stage0(&idx01, &lookup_mask, keys);
-	lookup_stage1(idx10, &primary_hash10, &secondary_hash10,
-			&primary_bkt10, &secondary_bkt10, hash_vals, keys, h);
-	lookup_stage1(idx11, &primary_hash11, &secondary_hash11,
-			&primary_bkt11,	&secondary_bkt11, hash_vals, keys, h);
-
-	primary_bkt20 = primary_bkt10;
-	primary_bkt21 = primary_bkt11;
-	secondary_bkt20 = secondary_bkt10;
-	secondary_bkt21 = secondary_bkt11;
-	primary_hash20 = primary_hash10;
-	primary_hash21 = primary_hash11;
-	secondary_hash20 = secondary_hash10;
-	secondary_hash21 = secondary_hash11;
-	idx20 = idx10, idx21 = idx11;
-	idx10 = idx00, idx11 = idx01;
-
-	lookup_stage0(&idx00, &lookup_mask, keys);
-	lookup_stage0(&idx01, &lookup_mask, keys);
-	lookup_stage1(idx10, &primary_hash10, &secondary_hash10,
-			&primary_bkt10, &secondary_bkt10, hash_vals, keys, h);
-	lookup_stage1(idx11, &primary_hash11, &secondary_hash11,
-			&primary_bkt11,	&secondary_bkt11, hash_vals, keys, h);
-	lookup_stage2(idx20, primary_hash20, secondary_hash20, primary_bkt20,
-			secondary_bkt20, &k_slot20, positions, &extra_hits_mask,
-			key_store, h);
-	lookup_stage2(idx21, primary_hash21, secondary_hash21, primary_bkt21,
-			secondary_bkt21, &k_slot21, positions, &extra_hits_mask,
-			key_store, h);
-
-	while (lookup_mask) {
-		k_slot30 = k_slot20, k_slot31 = k_slot21;
-		idx30 = idx20, idx31 = idx21;
-		primary_bkt20 = primary_bkt10;
-		primary_bkt21 = primary_bkt11;
-		secondary_bkt20 = secondary_bkt10;
-		secondary_bkt21 = secondary_bkt11;
-		primary_hash20 = primary_hash10;
-		primary_hash21 = primary_hash11;
-		secondary_hash20 = secondary_hash10;
-		secondary_hash21 = secondary_hash11;
-		idx20 = idx10, idx21 = idx11;
-		idx10 = idx00, idx11 = idx01;
-
-		lookup_stage0(&idx00, &lookup_mask, keys);
-		lookup_stage0(&idx01, &lookup_mask, keys);
-		lookup_stage1(idx10, &primary_hash10, &secondary_hash10,
-			&primary_bkt10, &secondary_bkt10, hash_vals, keys, h);
-		lookup_stage1(idx11, &primary_hash11, &secondary_hash11,
-			&primary_bkt11,	&secondary_bkt11, hash_vals, keys, h);
-		lookup_stage2(idx20, primary_hash20, secondary_hash20,
-			primary_bkt20, secondary_bkt20, &k_slot20, positions,
-			&extra_hits_mask, key_store, h);
-		lookup_stage2(idx21, primary_hash21, secondary_hash21,
-			primary_bkt21, secondary_bkt21,	&k_slot21, positions,
-			&extra_hits_mask, key_store, h);
-		lookup_stage3(idx30, k_slot30, keys, positions, data, &hits, h);
-		lookup_stage3(idx31, k_slot31, keys, positions, data, &hits, h);
-	}
+		while (sec_hitmask[i]) {
+			uint32_t hit_index = __builtin_ctzl(sec_hitmask[i]);
+
+			uint32_t key_idx = secondary_bkt[i]->key_idx[hit_index];
+			const struct rte_hash_key *key_slot =
+				(const struct rte_hash_key *)(
+				(const char *)h->key_store +
+				key_idx * h->key_entry_size);
+			/*
+			 * If key index is 0, do not compare key,
+			 * as it is checking the dummy slot
+			 */
+
+			if (!!key_idx & !rte_hash_cmp_eq(key_slot->key, keys[i], h)) {
+				if (data != NULL)
+					data[i] = key_slot->pdata;
 
-	k_slot30 = k_slot20, k_slot31 = k_slot21;
-	idx30 = idx20, idx31 = idx21;
-	primary_bkt20 = primary_bkt10;
-	primary_bkt21 = primary_bkt11;
-	secondary_bkt20 = secondary_bkt10;
-	secondary_bkt21 = secondary_bkt11;
-	primary_hash20 = primary_hash10;
-	primary_hash21 = primary_hash11;
-	secondary_hash20 = secondary_hash10;
-	secondary_hash21 = secondary_hash11;
-	idx20 = idx10, idx21 = idx11;
-	idx10 = idx00, idx11 = idx01;
-
-	lookup_stage1(idx10, &primary_hash10, &secondary_hash10,
-		&primary_bkt10, &secondary_bkt10, hash_vals, keys, h);
-	lookup_stage1(idx11, &primary_hash11, &secondary_hash11,
-		&primary_bkt11,	&secondary_bkt11, hash_vals, keys, h);
-	lookup_stage2(idx20, primary_hash20, secondary_hash20, primary_bkt20,
-		secondary_bkt20, &k_slot20, positions, &extra_hits_mask,
-		key_store, h);
-	lookup_stage2(idx21, primary_hash21, secondary_hash21, primary_bkt21,
-		secondary_bkt21, &k_slot21, positions, &extra_hits_mask,
-		key_store, h);
-	lookup_stage3(idx30, k_slot30, keys, positions, data, &hits, h);
-	lookup_stage3(idx31, k_slot31, keys, positions, data, &hits, h);
-
-	k_slot30 = k_slot20, k_slot31 = k_slot21;
-	idx30 = idx20, idx31 = idx21;
-	primary_bkt20 = primary_bkt10;
-	primary_bkt21 = primary_bkt11;
-	secondary_bkt20 = secondary_bkt10;
-	secondary_bkt21 = secondary_bkt11;
-	primary_hash20 = primary_hash10;
-	primary_hash21 = primary_hash11;
-	secondary_hash20 = secondary_hash10;
-	secondary_hash21 = secondary_hash11;
-	idx20 = idx10, idx21 = idx11;
-
-	lookup_stage2(idx20, primary_hash20, secondary_hash20, primary_bkt20,
-		secondary_bkt20, &k_slot20, positions, &extra_hits_mask,
-		key_store, h);
-	lookup_stage2(idx21, primary_hash21, secondary_hash21, primary_bkt21,
-		secondary_bkt21, &k_slot21, positions, &extra_hits_mask,
-		key_store, h);
-	lookup_stage3(idx30, k_slot30, keys, positions, data, &hits, h);
-	lookup_stage3(idx31, k_slot31, keys, positions, data, &hits, h);
-
-	k_slot30 = k_slot20, k_slot31 = k_slot21;
-	idx30 = idx20, idx31 = idx21;
-
-	lookup_stage3(idx30, k_slot30, keys, positions, data, &hits, h);
-	lookup_stage3(idx31, k_slot31, keys, positions, data, &hits, h);
-
-	/* ignore any items we have already found */
-	extra_hits_mask &= ~hits;
-
-	if (unlikely(extra_hits_mask)) {
-		/* run a single search for each remaining item */
-		do {
-			idx = __builtin_ctzl(extra_hits_mask);
-			if (data != NULL) {
-				ret = rte_hash_lookup_with_hash_data(h,
-						keys[idx], hash_vals[idx], &data[idx]);
-				if (ret >= 0)
-					hits |= 1ULL << idx;
-			} else {
-				positions[idx] = rte_hash_lookup_with_hash(h,
-							keys[idx], hash_vals[idx]);
-				if (positions[idx] >= 0)
-					hits |= 1llu << idx;
+				hits |= 1ULL << i;
+				positions[i] = key_idx - 1;
+				goto next_key;
 			}
-			extra_hits_mask &= ~(1llu << idx);
-		} while (extra_hits_mask);
-	}
+			sec_hitmask[i] &= ~(1 << (hit_index));
+		}
 
-	miss_mask &= ~hits;
-	if (unlikely(miss_mask)) {
-		do {
-			idx = __builtin_ctzl(miss_mask);
-			positions[idx] = -ENOENT;
-			miss_mask &= ~(1llu << idx);
-		} while (miss_mask);
+next_key:
+		continue;
 	}
 
 	if (hit_mask != NULL)
diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index 504661d..c00aafa 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -173,8 +173,7 @@ enum rte_hash_sig_compare_function {
 struct rte_hash_bucket {
 	hash_sig_t sig_current[RTE_HASH_BUCKET_ENTRIES];
 
-	/* Includes dummy key index that always contains index 0 */
-	uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES + 1];
+	uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES];
 
 	hash_sig_t sig_alt[RTE_HASH_BUCKET_ENTRIES];
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 0/4] Cuckoo hash enhancements
  2016-10-04 23:25       ` [PATCH v5 " Pablo de Lara
                           ` (3 preceding siblings ...)
  2016-10-04 23:25         ` [PATCH v5 4/4] hash: modify lookup bulk pipeline Pablo de Lara
@ 2016-10-05 10:12         ` Thomas Monjalon
  4 siblings, 0 replies; 37+ messages in thread
From: Thomas Monjalon @ 2016-10-05 10:12 UTC (permalink / raw)
  To: Pablo de Lara; +Cc: dev, bruce.richardson

2016-10-05 00:25, Pablo de Lara:
> This patchset improves lookup performance on the current hash library
> by changing the existing lookup bulk pipeline, with an improved pipeline,
> based on a loop-and-jump model, instead of the current 4-stage 2-entry pipeline.
> Also, x86 vectorized intrinsics are used to improve performance when comparing signatures.

Applied, thanks

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2016-10-05 10:12 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-26 21:34 [PATCH 0/3] Cuckoo hash lookup enhancements Pablo de Lara
2016-08-26 21:34 ` [PATCH 1/3] hash: reorganize bucket structure Pablo de Lara
2016-08-26 21:34 ` [PATCH 2/3] hash: add vectorized comparison Pablo de Lara
2016-08-27  8:57   ` Thomas Monjalon
2016-09-02 17:05     ` De Lara Guarch, Pablo
2016-08-26 21:34 ` [PATCH 3/3] hash: modify lookup bulk pipeline Pablo de Lara
2016-09-02 22:56 ` [PATCH v2 0/4] Cuckoo hash lookup enhancements Pablo de Lara
2016-09-02 22:56   ` [PATCH v2 1/4] hash: reorder hash structure Pablo de Lara
2016-09-02 22:56   ` [PATCH v2 2/4] hash: reorganize bucket structure Pablo de Lara
2016-09-02 22:56   ` [PATCH v2 3/4] hash: add vectorized comparison Pablo de Lara
2016-09-02 22:56   ` [PATCH v2 4/4] hash: modify lookup bulk pipeline Pablo de Lara
2016-09-06 19:33   ` [PATCH v3 0/4] Cuckoo hash lookup enhancements Pablo de Lara
2016-09-30  7:38     ` [PATCH v4 0/4] Cuckoo hash enhancements Pablo de Lara
2016-09-30  7:38       ` [PATCH v4 1/4] hash: reorder hash structure Pablo de Lara
2016-09-30  7:38       ` [PATCH v4 2/4] hash: reorganize bucket structure Pablo de Lara
2016-09-30  7:38       ` [PATCH v4 3/4] hash: add vectorized comparison Pablo de Lara
2016-09-30  7:38       ` [PATCH v4 4/4] hash: modify lookup bulk pipeline Pablo de Lara
2016-09-30 19:53       ` [PATCH v4 0/4] Cuckoo hash enhancements Gobriel, Sameh
2016-10-03  9:59       ` Bruce Richardson
2016-10-04  6:50         ` De Lara Guarch, Pablo
2016-10-04  7:17           ` De Lara Guarch, Pablo
2016-10-04  9:47             ` Bruce Richardson
2016-10-04 23:25       ` [PATCH v5 " Pablo de Lara
2016-10-04 23:25         ` [PATCH v5 1/4] hash: reorder hash structure Pablo de Lara
2016-10-04 23:25         ` [PATCH v5 2/4] hash: reorganize bucket structure Pablo de Lara
2016-10-04 23:25         ` [PATCH v5 3/4] hash: add vectorized comparison Pablo de Lara
2016-10-04 23:25         ` [PATCH v5 4/4] hash: modify lookup bulk pipeline Pablo de Lara
2016-10-05 10:12         ` [PATCH v5 0/4] Cuckoo hash enhancements Thomas Monjalon
2016-09-06 19:34   ` [PATCH v3 0/4] Cuckoo hash lookup enhancements Pablo de Lara
2016-09-06 19:34     ` [PATCH v3 1/4] hash: reorder hash structure Pablo de Lara
2016-09-28  9:02       ` Bruce Richardson
2016-09-29  1:33         ` De Lara Guarch, Pablo
2016-09-06 19:34     ` [PATCH v3 2/4] hash: reorganize bucket structure Pablo de Lara
2016-09-28  9:05       ` Bruce Richardson
2016-09-29  1:40         ` De Lara Guarch, Pablo
2016-09-06 19:34     ` [PATCH v3 3/4] hash: add vectorized comparison Pablo de Lara
2016-09-06 19:34     ` [PATCH v3 4/4] hash: modify lookup bulk pipeline Pablo de Lara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.