[PATCH 0/6] rhashtable: Fixes + cleanups + preparation for multiple rehash

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/6] rhashtable: Fixes + cleanups + preparation for multiple rehash
@ 2015-03-13  9:56 Herbert Xu
  2015-03-13  9:57 ` [PATCH 1/6] rhashtable: Fix walker behaviour during rehash Herbert Xu
                   ` (8 more replies)
  0 siblings, 9 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-13  9:56 UTC (permalink / raw)
  To: Thomas Graf, netdev

Hi:

Patch 1 fixes the walker so that it behaves properly even during
a resize.

Patch 2-3 are cleanups.

Patch 4-6 lays some ground work for the upcoming multiple rehashing.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 1/6] rhashtable: Fix walker behaviour during rehash
  2015-03-13  9:56 [PATCH 0/6] rhashtable: Fixes + cleanups + preparation for multiple rehash Herbert Xu
@ 2015-03-13  9:57 ` Herbert Xu
  2015-03-13 15:50   ` Thomas Graf
  2015-03-13  9:57 ` [PATCH 2/6] rhashtable: Use SINGLE_DEPTH_NESTING Herbert Xu
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-13  9:57 UTC (permalink / raw)
  To: Thomas Graf, netdev

Previously whenever the walker encountered a resize it simply
snaps back to the beginning and starts again.  However, this only
works if the rehash started and completed while the walker was
idle.

If the walker attempts to restart while the rehash is still ongoing,
we may miss objects that we shouldn't have.

This patch fixes this by making the walker walk the old table
followed by the new table just like all other readers.  If a
rehash is detected we will still signal our caller of the fact
so they can prepare for duplicates but we will simply continue
the walk onto the new table after the old one is finished either
by us or by the rehasher.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/rhashtable.h |    8 ++---
 lib/rhashtable.c           |   69 ++++++++++++++++++++++++++++++---------------
 2 files changed, 50 insertions(+), 27 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index c93ff8a..4192682 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -53,6 +53,7 @@ struct rhash_head {
  * @shift: Current size (1 << shift)
  * @locks_mask: Mask to apply before accessing locks[]
  * @locks: Array of spinlocks protecting individual buckets
+ * @walkers: List of active walkers
  * @buckets: size * hash buckets
  */
 struct bucket_table {
@@ -61,6 +62,7 @@ struct bucket_table {
 	u32			shift;
 	unsigned int		locks_mask;
 	spinlock_t		*locks;
+	struct list_head	walkers;
 
 	struct rhash_head __rcu	*buckets[] ____cacheline_aligned_in_smp;
 };
@@ -104,7 +106,6 @@ struct rhashtable_params {
  * @p: Configuration parameters
  * @run_work: Deferred worker to expand/shrink asynchronously
  * @mutex: Mutex to protect current/future table swapping
- * @walkers: List of active walkers
  * @being_destroyed: True if table is set up for destruction
  */
 struct rhashtable {
@@ -115,17 +116,16 @@ struct rhashtable {
 	struct rhashtable_params	p;
 	struct work_struct		run_work;
 	struct mutex                    mutex;
-	struct list_head		walkers;
 };
 
 /**
  * struct rhashtable_walker - Hash table walker
  * @list: List entry on list of walkers
- * @resize: Resize event occured
+ * @tbl: The table that we were walking over
  */
 struct rhashtable_walker {
 	struct list_head list;
-	bool resize;
+	struct bucket_table *tbl;
 };
 
 /**
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index fc0d451..f7c7607 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -170,6 +170,8 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
 		return NULL;
 	}
 
+	INIT_LIST_HEAD(&tbl->walkers);
+
 	for (i = 0; i < nbuckets; i++)
 		INIT_RHT_NULLS_HEAD(tbl->buckets[i], ht, i);
 
@@ -264,6 +266,7 @@ static void rhashtable_rehash(struct rhashtable *ht,
 			      struct bucket_table *new_tbl)
 {
 	struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht);
+	struct rhashtable_walker *walker;
 	unsigned old_hash;
 
 	get_random_bytes(&new_tbl->hash_rnd, sizeof(new_tbl->hash_rnd));
@@ -284,6 +287,9 @@ static void rhashtable_rehash(struct rhashtable *ht,
 	/* Publish the new table pointer. */
 	rcu_assign_pointer(ht->tbl, new_tbl);
 
+	list_for_each_entry(walker, &old_tbl->walkers, list)
+		walker->tbl = NULL;
+
 	/* Wait for readers. All new readers will see the new
 	 * table, and thus no references to the old table will
 	 * remain.
@@ -358,7 +364,6 @@ static void rht_deferred_worker(struct work_struct *work)
 {
 	struct rhashtable *ht;
 	struct bucket_table *tbl;
-	struct rhashtable_walker *walker;
 
 	ht = container_of(work, struct rhashtable, run_work);
 	mutex_lock(&ht->mutex);
@@ -367,9 +372,6 @@ static void rht_deferred_worker(struct work_struct *work)
 
 	tbl = rht_dereference(ht->tbl, ht);
 
-	list_for_each_entry(walker, &ht->walkers, list)
-		walker->resize = true;
-
 	if (rht_grow_above_75(ht, tbl))
 		rhashtable_expand(ht);
 	else if (rht_shrink_below_30(ht, tbl))
@@ -725,11 +727,9 @@ int rhashtable_walk_init(struct rhashtable *ht, struct rhashtable_iter *iter)
 	if (!iter->walker)
 		return -ENOMEM;
 
-	INIT_LIST_HEAD(&iter->walker->list);
-	iter->walker->resize = false;
-
 	mutex_lock(&ht->mutex);
-	list_add(&iter->walker->list, &ht->walkers);
+	iter->walker->tbl = rht_dereference(ht->tbl, ht);
+	list_add(&iter->walker->list, &iter->walker->tbl->walkers);
 	mutex_unlock(&ht->mutex);
 
 	return 0;
@@ -745,7 +745,8 @@ EXPORT_SYMBOL_GPL(rhashtable_walk_init);
 void rhashtable_walk_exit(struct rhashtable_iter *iter)
 {
 	mutex_lock(&iter->ht->mutex);
-	list_del(&iter->walker->list);
+	if (iter->walker->tbl)
+		list_del(&iter->walker->list);
 	mutex_unlock(&iter->ht->mutex);
 	kfree(iter->walker);
 }
@@ -767,12 +768,19 @@ EXPORT_SYMBOL_GPL(rhashtable_walk_exit);
  */
 int rhashtable_walk_start(struct rhashtable_iter *iter)
 {
+	struct rhashtable *ht = iter->ht;
+
+	mutex_lock(&ht->mutex);
+
+	if (iter->walker->tbl)
+		list_del(&iter->walker->list);
+
 	rcu_read_lock();
 
-	if (iter->walker->resize) {
-		iter->slot = 0;
-		iter->skip = 0;
-		iter->walker->resize = false;
+	mutex_unlock(&ht->mutex);
+
+	if (!iter->walker->tbl) {
+		iter->walker->tbl = rht_dereference_rcu(ht->tbl, ht);
 		return -EAGAIN;
 	}
 
@@ -794,13 +802,11 @@ EXPORT_SYMBOL_GPL(rhashtable_walk_start);
  */
 void *rhashtable_walk_next(struct rhashtable_iter *iter)
 {
-	const struct bucket_table *tbl;
+	struct bucket_table *tbl = iter->walker->tbl;
 	struct rhashtable *ht = iter->ht;
 	struct rhash_head *p = iter->p;
 	void *obj = NULL;
 
-	tbl = rht_dereference_rcu(ht->tbl, ht);
-
 	if (p) {
 		p = rht_dereference_bucket_rcu(p->next, tbl, iter->slot);
 		goto next;
@@ -826,17 +832,18 @@ next:
 		iter->skip = 0;
 	}
 
-	iter->p = NULL;
-
-out:
-	if (iter->walker->resize) {
-		iter->p = NULL;
+	iter->walker->tbl = rht_dereference_rcu(ht->future_tbl, ht);
+	if (iter->walker->tbl != tbl) {
 		iter->slot = 0;
 		iter->skip = 0;
-		iter->walker->resize = false;
 		return ERR_PTR(-EAGAIN);
 	}
 
+	iter->walker->tbl = NULL;
+	iter->p = NULL;
+
+out:
+
 	return obj;
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_next);
@@ -849,7 +856,24 @@ EXPORT_SYMBOL_GPL(rhashtable_walk_next);
  */
 void rhashtable_walk_stop(struct rhashtable_iter *iter)
 {
+	struct rhashtable *ht;
+	struct bucket_table *tbl = iter->walker->tbl;
+
 	rcu_read_unlock();
+
+	if (!tbl)
+		return;
+
+	ht = iter->ht;
+
+	mutex_lock(&ht->mutex);
+	if (rht_dereference(ht->tbl, ht) == tbl ||
+	    rht_dereference(ht->future_tbl, ht) == tbl)
+		list_add(&iter->walker->list, &tbl->walkers);
+	else
+		iter->walker->tbl = NULL;
+	mutex_unlock(&ht->mutex);
+
 	iter->p = NULL;
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_stop);
@@ -927,7 +951,6 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
 	memset(ht, 0, sizeof(*ht));
 	mutex_init(&ht->mutex);
 	memcpy(&ht->p, params, sizeof(*params));
-	INIT_LIST_HEAD(&ht->walkers);
 
 	if (params->locks_mul)
 		ht->p.locks_mul = roundup_pow_of_two(params->locks_mul);

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 2/6] rhashtable: Use SINGLE_DEPTH_NESTING
  2015-03-13  9:56 [PATCH 0/6] rhashtable: Fixes + cleanups + preparation for multiple rehash Herbert Xu
  2015-03-13  9:57 ` [PATCH 1/6] rhashtable: Fix walker behaviour during rehash Herbert Xu
@ 2015-03-13  9:57 ` Herbert Xu
  2015-03-13 15:40   ` Thomas Graf
  2015-03-13  9:57 ` [PATCH 3/6] rhashtable: Move seed init into bucket_table_alloc Herbert Xu
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-13  9:57 UTC (permalink / raw)
  To: Thomas Graf, netdev

We only nest one level deep there is no need to roll our own
subclasses.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 lib/rhashtable.c |    9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index f7c7607..5d06cc2 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -33,11 +33,6 @@
 /* Base bits plus 1 bit for nulls marker */
 #define HASH_RESERVED_SPACE	(RHT_BASE_BITS + 1)
 
-enum {
-	RHT_LOCK_NORMAL,
-	RHT_LOCK_NESTED,
-};
-
 /* The bucket lock is selected based on the hash and protects mutations
  * on a group of hash buckets.
  *
@@ -231,7 +226,7 @@ static int rhashtable_rehash_one(struct rhashtable *ht, unsigned old_hash)
 
 	new_bucket_lock = bucket_lock(new_tbl, new_hash);
 
-	spin_lock_nested(new_bucket_lock, RHT_LOCK_NESTED);
+	spin_lock_nested(new_bucket_lock, SINGLE_DEPTH_NESTING);
 	head = rht_dereference_bucket(new_tbl->buckets[new_hash],
 				      new_tbl, new_hash);
 
@@ -405,7 +400,7 @@ static bool __rhashtable_insert(struct rhashtable *ht, struct rhash_head *obj,
 	tbl = rht_dereference_rcu(ht->future_tbl, ht);
 	if (tbl != old_tbl) {
 		hash = head_hashfn(ht, tbl, obj);
-		spin_lock_nested(bucket_lock(tbl, hash), RHT_LOCK_NESTED);
+		spin_lock_nested(bucket_lock(tbl, hash), SINGLE_DEPTH_NESTING);
 	}
 
 	if (compare &&

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 3/6] rhashtable: Move seed init into bucket_table_alloc
  2015-03-13  9:56 [PATCH 0/6] rhashtable: Fixes + cleanups + preparation for multiple rehash Herbert Xu
  2015-03-13  9:57 ` [PATCH 1/6] rhashtable: Fix walker behaviour during rehash Herbert Xu
  2015-03-13  9:57 ` [PATCH 2/6] rhashtable: Use SINGLE_DEPTH_NESTING Herbert Xu
@ 2015-03-13  9:57 ` Herbert Xu
  2015-03-13 10:03   ` Daniel Borkmann
                     ` (2 more replies)
  2015-03-13  9:57 ` [PATCH 4/6] rhashtable: Free bucket tables asynchronously after rehash Herbert Xu
                   ` (5 subsequent siblings)
  8 siblings, 3 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-13  9:57 UTC (permalink / raw)
  To: Thomas Graf, netdev

It seems that I have already made every rehash redo the random
seed even though my commit message indicated otherwise :)

Since we have already taken that step, this patch goes one step
further and moves the seed initialisation into bucket_table_alloc.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 lib/rhashtable.c |   16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 5d06cc2..e55bbc8 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -142,7 +142,7 @@ static void bucket_table_free(const struct bucket_table *tbl)
 }
 
 static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
-					       size_t nbuckets, u32 hash_rnd)
+					       size_t nbuckets)
 {
 	struct bucket_table *tbl = NULL;
 	size_t size;
@@ -158,7 +158,6 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
 
 	tbl->size = nbuckets;
 	tbl->shift = ilog2(nbuckets);
-	tbl->hash_rnd = hash_rnd;
 
 	if (alloc_bucket_locks(ht, tbl) < 0) {
 		bucket_table_free(tbl);
@@ -167,6 +166,8 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
 
 	INIT_LIST_HEAD(&tbl->walkers);
 
+	get_random_bytes(&tbl->hash_rnd, sizeof(tbl->hash_rnd));
+
 	for (i = 0; i < nbuckets; i++)
 		INIT_RHT_NULLS_HEAD(tbl->buckets[i], ht, i);
 
@@ -264,8 +265,6 @@ static void rhashtable_rehash(struct rhashtable *ht,
 	struct rhashtable_walker *walker;
 	unsigned old_hash;
 
-	get_random_bytes(&new_tbl->hash_rnd, sizeof(new_tbl->hash_rnd));
-
 	/* Make insertions go into the new, empty table right away. Deletions
 	 * and lookups will be attempted in both tables until we synchronize.
 	 * The synchronize_rcu() guarantees for the new table to be picked up
@@ -315,7 +314,7 @@ int rhashtable_expand(struct rhashtable *ht)
 
 	ASSERT_RHT_MUTEX(ht);
 
-	new_tbl = bucket_table_alloc(ht, old_tbl->size * 2, old_tbl->hash_rnd);
+	new_tbl = bucket_table_alloc(ht, old_tbl->size * 2);
 	if (new_tbl == NULL)
 		return -ENOMEM;
 
@@ -346,7 +345,7 @@ int rhashtable_shrink(struct rhashtable *ht)
 
 	ASSERT_RHT_MUTEX(ht);
 
-	new_tbl = bucket_table_alloc(ht, old_tbl->size / 2, old_tbl->hash_rnd);
+	new_tbl = bucket_table_alloc(ht, old_tbl->size / 2);
 	if (new_tbl == NULL)
 		return -ENOMEM;
 
@@ -926,7 +925,6 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
 {
 	struct bucket_table *tbl;
 	size_t size;
-	u32 hash_rnd;
 
 	size = HASH_DEFAULT_SIZE;
 
@@ -952,9 +950,7 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
 	else
 		ht->p.locks_mul = BUCKET_LOCKS_PER_CPU;
 
-	get_random_bytes(&hash_rnd, sizeof(hash_rnd));
-
-	tbl = bucket_table_alloc(ht, size, hash_rnd);
+	tbl = bucket_table_alloc(ht, size);
 	if (tbl == NULL)
 		return -ENOMEM;
 

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 4/6] rhashtable: Free bucket tables asynchronously after rehash
  2015-03-13  9:56 [PATCH 0/6] rhashtable: Fixes + cleanups + preparation for multiple rehash Herbert Xu
                   ` (2 preceding siblings ...)
  2015-03-13  9:57 ` [PATCH 3/6] rhashtable: Move seed init into bucket_table_alloc Herbert Xu
@ 2015-03-13  9:57 ` Herbert Xu
  2015-03-13 15:42   ` Thomas Graf
  2015-03-13  9:57 ` [PATCH 5/6] rhashtable: Add rehash counter to bucket_table Herbert Xu
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-13  9:57 UTC (permalink / raw)
  To: Thomas Graf, netdev

There is in fact no need to wait for an RCU grace period in the
rehash function, since all insertions are guaranteed to go into
the new table through spin locks.

This patch uses call_rcu to free the old/rehashed table at our
leisure.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/rhashtable.h |    2 ++
 lib/rhashtable.c           |    9 ++++++---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 4192682..a0abddd 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -54,6 +54,7 @@ struct rhash_head {
  * @locks_mask: Mask to apply before accessing locks[]
  * @locks: Array of spinlocks protecting individual buckets
  * @walkers: List of active walkers
+ * @rcu: RCU structure for freeing the table
  * @buckets: size * hash buckets
  */
 struct bucket_table {
@@ -63,6 +64,7 @@ struct bucket_table {
 	unsigned int		locks_mask;
 	spinlock_t		*locks;
 	struct list_head	walkers;
+	struct rcu_head		rcu;
 
 	struct rhash_head __rcu	*buckets[] ____cacheline_aligned_in_smp;
 };
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index e55bbc8..36fb091 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -141,6 +141,11 @@ static void bucket_table_free(const struct bucket_table *tbl)
 	kvfree(tbl);
 }
 
+static void bucket_table_free_rcu(struct rcu_head *head)
+{
+	bucket_table_free(container_of(head, struct bucket_table, rcu));
+}
+
 static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
 					       size_t nbuckets)
 {
@@ -288,9 +293,7 @@ static void rhashtable_rehash(struct rhashtable *ht,
 	 * table, and thus no references to the old table will
 	 * remain.
 	 */
-	synchronize_rcu();
-
-	bucket_table_free(old_tbl);
+	call_rcu(&old_tbl->rcu, bucket_table_free_rcu);
 }
 
 /**

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 5/6] rhashtable: Add rehash counter to bucket_table
  2015-03-13  9:56 [PATCH 0/6] rhashtable: Fixes + cleanups + preparation for multiple rehash Herbert Xu
                   ` (3 preceding siblings ...)
  2015-03-13  9:57 ` [PATCH 4/6] rhashtable: Free bucket tables asynchronously after rehash Herbert Xu
@ 2015-03-13  9:57 ` Herbert Xu
  2015-03-13 13:51   ` Thomas Graf
  2015-03-13  9:57 ` [PATCH 6/6] rhashtable: Move future_tbl into struct bucket_table Herbert Xu
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-13  9:57 UTC (permalink / raw)
  To: Thomas Graf, netdev

This patch adds a rehash counter to bucket_table to indicate
the last bucket that has been rehashed.  This serves two purposes:

1. Any bucket that has been rehashed can never gain a new object.
2. If the rehash counter reaches the size of the table, the table
will forever remain empty.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/rhashtable.h |    4 +++-
 lib/rhashtable.c           |    1 +
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index a0abddd..ed7562a 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -49,6 +49,7 @@ struct rhash_head {
 /**
  * struct bucket_table - Table of hash buckets
  * @size: Number of hash buckets
+ * @rehash: Current bucket being rehashed
  * @hash_rnd: Random seed to fold into hash
  * @shift: Current size (1 << shift)
  * @locks_mask: Mask to apply before accessing locks[]
@@ -58,7 +59,8 @@ struct rhash_head {
  * @buckets: size * hash buckets
  */
 struct bucket_table {
-	size_t			size;
+	unsigned int		size;
+	unsigned int		rehash;
 	u32			hash_rnd;
 	u32			shift;
 	unsigned int		locks_mask;
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 36fb091..ff4ea17 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -260,6 +260,7 @@ static void rhashtable_rehash_chain(struct rhashtable *ht, unsigned old_hash)
 	spin_lock_bh(old_bucket_lock);
 	while (!rhashtable_rehash_one(ht, old_hash))
 		;
+	old_tbl->rehash++;
 	spin_unlock_bh(old_bucket_lock);
 }
 

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 6/6] rhashtable: Move future_tbl into struct bucket_table
  2015-03-13  9:56 [PATCH 0/6] rhashtable: Fixes + cleanups + preparation for multiple rehash Herbert Xu
                   ` (4 preceding siblings ...)
  2015-03-13  9:57 ` [PATCH 5/6] rhashtable: Add rehash counter to bucket_table Herbert Xu
@ 2015-03-13  9:57 ` Herbert Xu
  2015-03-13 16:13   ` Thomas Graf
  2015-03-13 13:57 ` [PATCH 0/6] rhashtable: Fixes + cleanups + preparation for multiple rehash Thomas Graf
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-13  9:57 UTC (permalink / raw)
  To: Thomas Graf, netdev

This patch moves future_tbl to open up the possibility of having
multiple rehashes on the same table.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/rhashtable.h |    5 +++--
 lib/rhashtable.c           |   27 +++++++++++----------------
 2 files changed, 14 insertions(+), 18 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index ed7562a..1695378 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -56,6 +56,7 @@ struct rhash_head {
  * @locks: Array of spinlocks protecting individual buckets
  * @walkers: List of active walkers
  * @rcu: RCU structure for freeing the table
+ * @future_tbl: Table under construction during rehashing
  * @buckets: size * hash buckets
  */
 struct bucket_table {
@@ -68,6 +69,8 @@ struct bucket_table {
 	struct list_head	walkers;
 	struct rcu_head		rcu;
 
+	struct bucket_table __rcu *future_tbl;
+
 	struct rhash_head __rcu	*buckets[] ____cacheline_aligned_in_smp;
 };
 
@@ -105,7 +108,6 @@ struct rhashtable_params {
 /**
  * struct rhashtable - Hash table handle
  * @tbl: Bucket table
- * @future_tbl: Table under construction during expansion/shrinking
  * @nelems: Number of elements in table
  * @p: Configuration parameters
  * @run_work: Deferred worker to expand/shrink asynchronously
@@ -114,7 +116,6 @@ struct rhashtable_params {
  */
 struct rhashtable {
 	struct bucket_table __rcu	*tbl;
-	struct bucket_table __rcu       *future_tbl;
 	atomic_t			nelems;
 	bool                            being_destroyed;
 	struct rhashtable_params	p;
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index ff4ea17..9d53a46 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -207,8 +207,9 @@ static bool rht_shrink_below_30(const struct rhashtable *ht,
 
 static int rhashtable_rehash_one(struct rhashtable *ht, unsigned old_hash)
 {
-	struct bucket_table *new_tbl = rht_dereference(ht->future_tbl, ht);
 	struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht);
+	struct bucket_table *new_tbl =
+		rht_dereference(old_tbl->future_tbl, ht) ?: old_tbl;
 	struct rhash_head __rcu **pprev = &old_tbl->buckets[old_hash];
 	int err = -ENOENT;
 	struct rhash_head *head, *next, *entry;
@@ -273,10 +274,8 @@ static void rhashtable_rehash(struct rhashtable *ht,
 
 	/* Make insertions go into the new, empty table right away. Deletions
 	 * and lookups will be attempted in both tables until we synchronize.
-	 * The synchronize_rcu() guarantees for the new table to be picked up
-	 * so no new additions go into the old table while we relink.
 	 */
-	rcu_assign_pointer(ht->future_tbl, new_tbl);
+	rcu_assign_pointer(old_tbl->future_tbl, new_tbl);
 
 	/* Ensure the new table is visible to readers. */
 	smp_wmb();
@@ -400,7 +399,7 @@ static bool __rhashtable_insert(struct rhashtable *ht, struct rhash_head *obj,
 	 * also grab the bucket lock in old_tbl because until the
 	 * rehash completes ht->tbl won't be changed.
 	 */
-	tbl = rht_dereference_rcu(ht->future_tbl, ht);
+	tbl = rht_dereference_rcu(old_tbl->future_tbl, ht) ?: old_tbl;
 	if (tbl != old_tbl) {
 		hash = head_hashfn(ht, tbl, obj);
 		spin_lock_nested(bucket_lock(tbl, hash), SINGLE_DEPTH_NESTING);
@@ -525,7 +524,7 @@ bool rhashtable_remove(struct rhashtable *ht, struct rhash_head *obj)
 	 * visible then that guarantees the entry to still be in
 	 * old_tbl if it exists.
 	 */
-	tbl = rht_dereference_rcu(ht->future_tbl, ht);
+	tbl = rht_dereference_rcu(old_tbl->future_tbl, ht) ?: old_tbl;
 	if (!ret && old_tbl != tbl)
 		ret = __rhashtable_remove(ht, tbl, obj);
 
@@ -599,7 +598,7 @@ EXPORT_SYMBOL_GPL(rhashtable_lookup);
 void *rhashtable_lookup_compare(struct rhashtable *ht, const void *key,
 				bool (*compare)(void *, void *), void *arg)
 {
-	const struct bucket_table *tbl, *old_tbl;
+	const struct bucket_table *tbl;
 	struct rhash_head *he;
 	u32 hash;
 
@@ -618,9 +617,8 @@ restart:
 	/* Ensure we see any new tables. */
 	smp_rmb();
 
-	old_tbl = tbl;
-	tbl = rht_dereference_rcu(ht->future_tbl, ht);
-	if (unlikely(tbl != old_tbl))
+	tbl = rht_dereference_rcu(tbl->future_tbl, ht);
+	if (unlikely(tbl))
 		goto restart;
 	rcu_read_unlock();
 
@@ -830,14 +828,13 @@ next:
 		iter->skip = 0;
 	}
 
-	iter->walker->tbl = rht_dereference_rcu(ht->future_tbl, ht);
-	if (iter->walker->tbl != tbl) {
+	iter->walker->tbl = rht_dereference_rcu(tbl->future_tbl, ht);
+	if (iter->walker->tbl) {
 		iter->slot = 0;
 		iter->skip = 0;
 		return ERR_PTR(-EAGAIN);
 	}
 
-	iter->walker->tbl = NULL;
 	iter->p = NULL;
 
 out:
@@ -865,8 +862,7 @@ void rhashtable_walk_stop(struct rhashtable_iter *iter)
 	ht = iter->ht;
 
 	mutex_lock(&ht->mutex);
-	if (rht_dereference(ht->tbl, ht) == tbl ||
-	    rht_dereference(ht->future_tbl, ht) == tbl)
+	if (tbl->rehash < tbl->size)
 		list_add(&iter->walker->list, &tbl->walkers);
 	else
 		iter->walker->tbl = NULL;
@@ -961,7 +957,6 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
 	atomic_set(&ht->nelems, 0);
 
 	RCU_INIT_POINTER(ht->tbl, tbl);
-	RCU_INIT_POINTER(ht->future_tbl, tbl);
 
 	INIT_WORK(&ht->run_work, rht_deferred_worker);
 

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH 3/6] rhashtable: Move seed init into bucket_table_alloc
  2015-03-13  9:57 ` [PATCH 3/6] rhashtable: Move seed init into bucket_table_alloc Herbert Xu
@ 2015-03-13 10:03   ` Daniel Borkmann
  2015-03-13 11:33   ` David Laight
  2015-03-13 15:40   ` Thomas Graf
  2 siblings, 0 replies; 113+ messages in thread
From: Daniel Borkmann @ 2015-03-13 10:03 UTC (permalink / raw)
  To: Herbert Xu, Thomas Graf, netdev

On 03/13/2015 10:57 AM, Herbert Xu wrote:
> It seems that I have already made every rehash redo the random
> seed even though my commit message indicated otherwise :)
>
> Since we have already taken that step, this patch goes one step
> further and moves the seed initialisation into bucket_table_alloc.
>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

I wanted to suggest the same. :) Thanks!

^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH 3/6] rhashtable: Move seed init into bucket_table_alloc
  2015-03-13  9:57 ` [PATCH 3/6] rhashtable: Move seed init into bucket_table_alloc Herbert Xu
  2015-03-13 10:03   ` Daniel Borkmann
@ 2015-03-13 11:33   ` David Laight
  2015-03-13 11:40     ` Herbert Xu
  2015-03-13 15:40   ` Thomas Graf
  2 siblings, 1 reply; 113+ messages in thread
From: David Laight @ 2015-03-13 11:33 UTC (permalink / raw)
  To: 'Herbert Xu', Thomas Graf, netdev

From: Herbert Xu
> It seems that I have already made every rehash redo the random
> seed even though my commit message indicated otherwise :)
> 
> Since we have already taken that step, this patch goes one step
> further and moves the seed initialisation into bucket_table_alloc.

If the decision to resize a hash table is based on the length
of the hash chains, then changing the hash function might either
make that unnecessary or make an immediate resize be needed.

Not that I'm convinced you can get a good enough hash function
to avoid one or two long chains when the has table is large.
I've not played with the hashes used for network 'stuff' but
I remember playing with the elf symbol table hash. Whatever I
did one or two chains would end up with more elements than
you might have hoped.

	David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 3/6] rhashtable: Move seed init into bucket_table_alloc
  2015-03-13 11:33   ` David Laight
@ 2015-03-13 11:40     ` Herbert Xu
  0 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-13 11:40 UTC (permalink / raw)
  To: David Laight; +Cc: Thomas Graf, netdev

On Fri, Mar 13, 2015 at 11:33:39AM +0000, David Laight wrote:
>
> If the decision to resize a hash table is based on the length
> of the hash chains, then changing the hash function might either
> make that unnecessary or make an immediate resize be needed.

My plan is to trigger a delayed resize (as we do now) if we're
over 75%.

If we get a chain length above 4 then we do an immediate resize
if we're already over 75% and the delayed resize hasn't started
yet, otherwise we do an immediate rehash with no change in the
hash table size.

Note that in my new scheme the rehash will contain two distinct
parts.  The first part that is executed immediately will only
involve allocating the hash table.

The hard work of movings across is always performed in the kernel
thread.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 5/6] rhashtable: Add rehash counter to bucket_table
  2015-03-13  9:57 ` [PATCH 5/6] rhashtable: Add rehash counter to bucket_table Herbert Xu
@ 2015-03-13 13:51   ` Thomas Graf
  2015-03-14  2:49     ` Herbert Xu
  0 siblings, 1 reply; 113+ messages in thread
From: Thomas Graf @ 2015-03-13 13:51 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

On 03/13/15 at 08:57pm, Herbert Xu wrote:
> @@ -58,7 +59,8 @@ struct rhash_head {
>   * @buckets: size * hash buckets
>   */
>  struct bucket_table {
> -	size_t			size;
> +	unsigned int		size;
> +	unsigned int		rehash;
>  	u32			hash_rnd;
>  	u32			shift;
>  	unsigned int		locks_mask;

This changes causes a minor warning:
lib/test_rhashtable.c: In function ‘test_bucket_stats’:
lib/test_rhashtable.c:83:4: warning: format ‘%zu’ expects argument of
type ‘size_t’, but argument 3 has type ‘unsigned int’ [-Wformat=]
    pr_info(" [%#4x/%zu]", i, tbl->size)

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 0/6] rhashtable: Fixes + cleanups + preparation for multiple rehash
  2015-03-13  9:56 [PATCH 0/6] rhashtable: Fixes + cleanups + preparation for multiple rehash Herbert Xu
                   ` (5 preceding siblings ...)
  2015-03-13  9:57 ` [PATCH 6/6] rhashtable: Move future_tbl into struct bucket_table Herbert Xu
@ 2015-03-13 13:57 ` Thomas Graf
  2015-03-13 16:25 ` David Miller
  2015-03-14  2:53 ` [v2 PATCH " Herbert Xu
  8 siblings, 0 replies; 113+ messages in thread
From: Thomas Graf @ 2015-03-13 13:57 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

On 03/13/15 at 08:56pm, Herbert Xu wrote:
> Hi:
> 
> Patch 1 fixes the walker so that it behaves properly even during
> a resize.
> 
> Patch 2-3 are cleanups.
> 
> Patch 4-6 lays some ground work for the upcoming multiple rehashing.

Tested this series plus already committed rehash fix with netlink
and nft tests. All works well.

Tested-by: Thomas Graf <tgraf@suug.ch>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 2/6] rhashtable: Use SINGLE_DEPTH_NESTING
  2015-03-13  9:57 ` [PATCH 2/6] rhashtable: Use SINGLE_DEPTH_NESTING Herbert Xu
@ 2015-03-13 15:40   ` Thomas Graf
  0 siblings, 0 replies; 113+ messages in thread
From: Thomas Graf @ 2015-03-13 15:40 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

On 03/13/15 at 08:57pm, Herbert Xu wrote:
> We only nest one level deep there is no need to roll our own
> subclasses.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Acked-by: Thomas Graf <tgraf@suug.ch>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 3/6] rhashtable: Move seed init into bucket_table_alloc
  2015-03-13  9:57 ` [PATCH 3/6] rhashtable: Move seed init into bucket_table_alloc Herbert Xu
  2015-03-13 10:03   ` Daniel Borkmann
  2015-03-13 11:33   ` David Laight
@ 2015-03-13 15:40   ` Thomas Graf
  2 siblings, 0 replies; 113+ messages in thread
From: Thomas Graf @ 2015-03-13 15:40 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

On 03/13/15 at 08:57pm, Herbert Xu wrote:
> It seems that I have already made every rehash redo the random
> seed even though my commit message indicated otherwise :)
> 
> Since we have already taken that step, this patch goes one step
> further and moves the seed initialisation into bucket_table_alloc.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Acked-by: Thomas Graf <tgraf@suug.ch>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 4/6] rhashtable: Free bucket tables asynchronously after rehash
  2015-03-13  9:57 ` [PATCH 4/6] rhashtable: Free bucket tables asynchronously after rehash Herbert Xu
@ 2015-03-13 15:42   ` Thomas Graf
  0 siblings, 0 replies; 113+ messages in thread
From: Thomas Graf @ 2015-03-13 15:42 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

On 03/13/15 at 08:57pm, Herbert Xu wrote:
> There is in fact no need to wait for an RCU grace period in the
> rehash function, since all insertions are guaranteed to go into
> the new table through spin locks.
> 
> This patch uses call_rcu to free the old/rehashed table at our
> leisure.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Acked-by: Thomas Graf <tgraf@suug.ch>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 1/6] rhashtable: Fix walker behaviour during rehash
  2015-03-13  9:57 ` [PATCH 1/6] rhashtable: Fix walker behaviour during rehash Herbert Xu
@ 2015-03-13 15:50   ` Thomas Graf
  2015-03-13 23:42     ` Herbert Xu
  0 siblings, 1 reply; 113+ messages in thread
From: Thomas Graf @ 2015-03-13 15:50 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

On 03/13/15 at 08:57pm, Herbert Xu wrote:
> @@ -725,11 +727,9 @@ int rhashtable_walk_init(struct rhashtable *ht, struct rhashtable_iter *iter)
>  	if (!iter->walker)
>  		return -ENOMEM;
>  
> -	INIT_LIST_HEAD(&iter->walker->list);
> -	iter->walker->resize = false;
> -
>  	mutex_lock(&ht->mutex);
> -	list_add(&iter->walker->list, &ht->walkers);
> +	iter->walker->tbl = rht_dereference(ht->tbl, ht);
> +	list_add(&iter->walker->list, &iter->walker->tbl->walkers);
>  	mutex_unlock(&ht->mutex);
>  
>  	return 0;

Is there a reason to keeping rhashtable_walk_init() and
rhashtable_walk_start() separate? It seems like one always has
to call init for each iteration as start does not reset the
iterator bits. It would also safe a mutex_lock().

> @@ -826,17 +832,18 @@ next:
>  		iter->skip = 0;
>  	}
>  
> -	iter->p = NULL;
> -
> -out:
> -	if (iter->walker->resize) {
> -		iter->p = NULL;
> +	iter->walker->tbl = rht_dereference_rcu(ht->future_tbl, ht);
> +	if (iter->walker->tbl != tbl) {

I was confused at first because this would cause to always dump the
table twice in the regular case. I see that in patch 6/6 you change
the meaning of future_tbl and only make it non-NULL during the short
period of the rehashing. Ordering patch 6 in front of 1 would have
helped ;-)

Looks good in combination with patch 6.

Acked-by: Thomas Graf <tgraf@suug.ch>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 6/6] rhashtable: Move future_tbl into struct bucket_table
  2015-03-13  9:57 ` [PATCH 6/6] rhashtable: Move future_tbl into struct bucket_table Herbert Xu
@ 2015-03-13 16:13   ` Thomas Graf
  0 siblings, 0 replies; 113+ messages in thread
From: Thomas Graf @ 2015-03-13 16:13 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

On 03/13/15 at 08:57pm, Herbert Xu wrote:
> This patch moves future_tbl to open up the possibility of having
> multiple rehashes on the same table.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Acked-by: Thomas Graf <tgraf@suug.ch>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 0/6] rhashtable: Fixes + cleanups + preparation for multiple rehash
  2015-03-13  9:56 [PATCH 0/6] rhashtable: Fixes + cleanups + preparation for multiple rehash Herbert Xu
                   ` (6 preceding siblings ...)
  2015-03-13 13:57 ` [PATCH 0/6] rhashtable: Fixes + cleanups + preparation for multiple rehash Thomas Graf
@ 2015-03-13 16:25 ` David Miller
  2015-03-14  2:51   ` Herbert Xu
  2015-03-14  2:53 ` [v2 PATCH " Herbert Xu
  8 siblings, 1 reply; 113+ messages in thread
From: David Miller @ 2015-03-13 16:25 UTC (permalink / raw)
  To: herbert; +Cc: tgraf, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 13 Mar 2015 20:56:07 +1100

> Patch 1 fixes the walker so that it behaves properly even during
> a resize.
> 
> Patch 2-3 are cleanups.
> 
> Patch 4-6 lays some ground work for the upcoming multiple rehashing.

Looks like this needs at least one respin, to fix the printf formatting
from changing size to unsigned int from size_t.

And perhaps addressing the ordering of patches #6 and #1.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 1/6] rhashtable: Fix walker behaviour during rehash
  2015-03-13 15:50   ` Thomas Graf
@ 2015-03-13 23:42     ` Herbert Xu
  2015-03-14  0:06       ` Thomas Graf
  0 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-13 23:42 UTC (permalink / raw)
  To: Thomas Graf; +Cc: netdev

On Fri, Mar 13, 2015 at 03:50:23PM +0000, Thomas Graf wrote:
>
> Is there a reason to keeping rhashtable_walk_init() and
> rhashtable_walk_start() separate? It seems like one always has
> to call init for each iteration as start does not reset the
> iterator bits. It would also safe a mutex_lock().

Yes they are needed for netlink users, e.g., netfilter.  Proc
users will always init/start while netlink users can init, and
then just use start/stop to keep their state.

> > @@ -826,17 +832,18 @@ next:
> >  		iter->skip = 0;
> >  	}
> >  
> > -	iter->p = NULL;
> > -
> > -out:
> > -	if (iter->walker->resize) {
> > -		iter->p = NULL;
> > +	iter->walker->tbl = rht_dereference_rcu(ht->future_tbl, ht);
> > +	if (iter->walker->tbl != tbl) {
> 
> I was confused at first because this would cause to always dump the
> table twice in the regular case. I see that in patch 6/6 you change
> the meaning of future_tbl and only make it non-NULL during the short
> period of the rehashing. Ordering patch 6 in front of 1 would have
> helped ;-)
> 
> Looks good in combination with patch 6.

No this patch is needed on its own and should not depend on patch 6
in any way.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 1/6] rhashtable: Fix walker behaviour during rehash
  2015-03-13 23:42     ` Herbert Xu
@ 2015-03-14  0:06       ` Thomas Graf
  0 siblings, 0 replies; 113+ messages in thread
From: Thomas Graf @ 2015-03-14  0:06 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

On 03/14/15 at 10:42am, Herbert Xu wrote:
> On Fri, Mar 13, 2015 at 03:50:23PM +0000, Thomas Graf wrote:
> >
> > Is there a reason to keeping rhashtable_walk_init() and
> > rhashtable_walk_start() separate? It seems like one always has
> > to call init for each iteration as start does not reset the
> > iterator bits. It would also safe a mutex_lock().
> 
> Yes they are needed for netlink users, e.g., netfilter.  Proc
> users will always init/start while netlink users can init, and
> then just use start/stop to keep their state.

OK, it's for future code. Thanks.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 5/6] rhashtable: Add rehash counter to bucket_table
  2015-03-13 13:51   ` Thomas Graf
@ 2015-03-14  2:49     ` Herbert Xu
  0 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-14  2:49 UTC (permalink / raw)
  To: Thomas Graf; +Cc: netdev

On Fri, Mar 13, 2015 at 01:51:31PM +0000, Thomas Graf wrote:
> On 03/13/15 at 08:57pm, Herbert Xu wrote:
> > @@ -58,7 +59,8 @@ struct rhash_head {
> >   * @buckets: size * hash buckets
> >   */
> >  struct bucket_table {
> > -	size_t			size;
> > +	unsigned int		size;
> > +	unsigned int		rehash;
> >  	u32			hash_rnd;
> >  	u32			shift;
> >  	unsigned int		locks_mask;
> 
> This changes causes a minor warning:
> lib/test_rhashtable.c: In function ‘test_bucket_stats’:
> lib/test_rhashtable.c:83:4: warning: format ‘%zu’ expects argument of
> type ‘size_t’, but argument 3 has type ‘unsigned int’ [-Wformat=]
>     pr_info(" [%#4x/%zu]", i, tbl->size)

Thanks.  I have now enabled test_rhashtable in my config so I
should be able to catch it next time.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 0/6] rhashtable: Fixes + cleanups + preparation for multiple rehash
  2015-03-13 16:25 ` David Miller
@ 2015-03-14  2:51   ` Herbert Xu
  0 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-14  2:51 UTC (permalink / raw)
  To: David Miller; +Cc: tgraf, netdev

On Fri, Mar 13, 2015 at 12:25:05PM -0400, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Fri, 13 Mar 2015 20:56:07 +1100
> 
> > Patch 1 fixes the walker so that it behaves properly even during
> > a resize.
> > 
> > Patch 2-3 are cleanups.
> > 
> > Patch 4-6 lays some ground work for the upcoming multiple rehashing.
> 
> Looks like this needs at least one respin, to fix the printf formatting
> from changing size to unsigned int from size_t.

OK I have fixed the warning.

> And perhaps addressing the ordering of patches #6 and #1.

AFAICS there is no ordering issue there.  #1 is a bug-fix to the
current rhashtable while #6 is purely ground work for multiple
rehashing.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [v2 PATCH 0/6] rhashtable: Fixes + cleanups + preparation for multiple rehash
  2015-03-13  9:56 [PATCH 0/6] rhashtable: Fixes + cleanups + preparation for multiple rehash Herbert Xu
                   ` (7 preceding siblings ...)
  2015-03-13 16:25 ` David Miller
@ 2015-03-14  2:53 ` Herbert Xu
  2015-03-14  2:57   ` [v2 PATCH 1/6] rhashtable: Fix walker behaviour during rehash Herbert Xu
                     ` (6 more replies)
  8 siblings, 7 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-14  2:53 UTC (permalink / raw)
  To: Thomas Graf, netdev

Hi:

Patch 1 fixes the walker so that it behaves properly even during
a resize.

Patch 2-3 are cleanups.

Patch 4-6 lays some ground work for the upcoming multiple rehashing.

This revision fixes the warning coming from the bucket_table->size
downsize and improves its changelog.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [v2 PATCH 1/6] rhashtable: Fix walker behaviour during rehash
  2015-03-14  2:53 ` [v2 PATCH " Herbert Xu
@ 2015-03-14  2:57   ` Herbert Xu
  2015-03-14  2:57   ` [v2 PATCH 2/6] rhashtable: Use SINGLE_DEPTH_NESTING Herbert Xu
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-14  2:57 UTC (permalink / raw)
  To: David Miller, Thomas Graf, netdev

Previously whenever the walker encountered a resize it simply
snaps back to the beginning and starts again.  However, this only
works if the rehash started and completed while the walker was
idle.

If the walker attempts to restart while the rehash is still ongoing,
we may miss objects that we shouldn't have.

This patch fixes this by making the walker walk the old table
followed by the new table just like all other readers.  If a
rehash is detected we will still signal our caller of the fact
so they can prepare for duplicates but we will simply continue
the walk onto the new table after the old one is finished either
by us or by the rehasher.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/rhashtable.h |    8 ++---
 lib/rhashtable.c           |   69 ++++++++++++++++++++++++++++++---------------
 2 files changed, 50 insertions(+), 27 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index c93ff8a..4192682 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -53,6 +53,7 @@ struct rhash_head {
  * @shift: Current size (1 << shift)
  * @locks_mask: Mask to apply before accessing locks[]
  * @locks: Array of spinlocks protecting individual buckets
+ * @walkers: List of active walkers
  * @buckets: size * hash buckets
  */
 struct bucket_table {
@@ -61,6 +62,7 @@ struct bucket_table {
 	u32			shift;
 	unsigned int		locks_mask;
 	spinlock_t		*locks;
+	struct list_head	walkers;
 
 	struct rhash_head __rcu	*buckets[] ____cacheline_aligned_in_smp;
 };
@@ -104,7 +106,6 @@ struct rhashtable_params {
  * @p: Configuration parameters
  * @run_work: Deferred worker to expand/shrink asynchronously
  * @mutex: Mutex to protect current/future table swapping
- * @walkers: List of active walkers
  * @being_destroyed: True if table is set up for destruction
  */
 struct rhashtable {
@@ -115,17 +116,16 @@ struct rhashtable {
 	struct rhashtable_params	p;
 	struct work_struct		run_work;
 	struct mutex                    mutex;
-	struct list_head		walkers;
 };
 
 /**
  * struct rhashtable_walker - Hash table walker
  * @list: List entry on list of walkers
- * @resize: Resize event occured
+ * @tbl: The table that we were walking over
  */
 struct rhashtable_walker {
 	struct list_head list;
-	bool resize;
+	struct bucket_table *tbl;
 };
 
 /**
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index fc0d451..f7c7607 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -170,6 +170,8 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
 		return NULL;
 	}
 
+	INIT_LIST_HEAD(&tbl->walkers);
+
 	for (i = 0; i < nbuckets; i++)
 		INIT_RHT_NULLS_HEAD(tbl->buckets[i], ht, i);
 
@@ -264,6 +266,7 @@ static void rhashtable_rehash(struct rhashtable *ht,
 			      struct bucket_table *new_tbl)
 {
 	struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht);
+	struct rhashtable_walker *walker;
 	unsigned old_hash;
 
 	get_random_bytes(&new_tbl->hash_rnd, sizeof(new_tbl->hash_rnd));
@@ -284,6 +287,9 @@ static void rhashtable_rehash(struct rhashtable *ht,
 	/* Publish the new table pointer. */
 	rcu_assign_pointer(ht->tbl, new_tbl);
 
+	list_for_each_entry(walker, &old_tbl->walkers, list)
+		walker->tbl = NULL;
+
 	/* Wait for readers. All new readers will see the new
 	 * table, and thus no references to the old table will
 	 * remain.
@@ -358,7 +364,6 @@ static void rht_deferred_worker(struct work_struct *work)
 {
 	struct rhashtable *ht;
 	struct bucket_table *tbl;
-	struct rhashtable_walker *walker;
 
 	ht = container_of(work, struct rhashtable, run_work);
 	mutex_lock(&ht->mutex);
@@ -367,9 +372,6 @@ static void rht_deferred_worker(struct work_struct *work)
 
 	tbl = rht_dereference(ht->tbl, ht);
 
-	list_for_each_entry(walker, &ht->walkers, list)
-		walker->resize = true;
-
 	if (rht_grow_above_75(ht, tbl))
 		rhashtable_expand(ht);
 	else if (rht_shrink_below_30(ht, tbl))
@@ -725,11 +727,9 @@ int rhashtable_walk_init(struct rhashtable *ht, struct rhashtable_iter *iter)
 	if (!iter->walker)
 		return -ENOMEM;
 
-	INIT_LIST_HEAD(&iter->walker->list);
-	iter->walker->resize = false;
-
 	mutex_lock(&ht->mutex);
-	list_add(&iter->walker->list, &ht->walkers);
+	iter->walker->tbl = rht_dereference(ht->tbl, ht);
+	list_add(&iter->walker->list, &iter->walker->tbl->walkers);
 	mutex_unlock(&ht->mutex);
 
 	return 0;
@@ -745,7 +745,8 @@ EXPORT_SYMBOL_GPL(rhashtable_walk_init);
 void rhashtable_walk_exit(struct rhashtable_iter *iter)
 {
 	mutex_lock(&iter->ht->mutex);
-	list_del(&iter->walker->list);
+	if (iter->walker->tbl)
+		list_del(&iter->walker->list);
 	mutex_unlock(&iter->ht->mutex);
 	kfree(iter->walker);
 }
@@ -767,12 +768,19 @@ EXPORT_SYMBOL_GPL(rhashtable_walk_exit);
  */
 int rhashtable_walk_start(struct rhashtable_iter *iter)
 {
+	struct rhashtable *ht = iter->ht;
+
+	mutex_lock(&ht->mutex);
+
+	if (iter->walker->tbl)
+		list_del(&iter->walker->list);
+
 	rcu_read_lock();
 
-	if (iter->walker->resize) {
-		iter->slot = 0;
-		iter->skip = 0;
-		iter->walker->resize = false;
+	mutex_unlock(&ht->mutex);
+
+	if (!iter->walker->tbl) {
+		iter->walker->tbl = rht_dereference_rcu(ht->tbl, ht);
 		return -EAGAIN;
 	}
 
@@ -794,13 +802,11 @@ EXPORT_SYMBOL_GPL(rhashtable_walk_start);
  */
 void *rhashtable_walk_next(struct rhashtable_iter *iter)
 {
-	const struct bucket_table *tbl;
+	struct bucket_table *tbl = iter->walker->tbl;
 	struct rhashtable *ht = iter->ht;
 	struct rhash_head *p = iter->p;
 	void *obj = NULL;
 
-	tbl = rht_dereference_rcu(ht->tbl, ht);
-
 	if (p) {
 		p = rht_dereference_bucket_rcu(p->next, tbl, iter->slot);
 		goto next;
@@ -826,17 +832,18 @@ next:
 		iter->skip = 0;
 	}
 
-	iter->p = NULL;
-
-out:
-	if (iter->walker->resize) {
-		iter->p = NULL;
+	iter->walker->tbl = rht_dereference_rcu(ht->future_tbl, ht);
+	if (iter->walker->tbl != tbl) {
 		iter->slot = 0;
 		iter->skip = 0;
-		iter->walker->resize = false;
 		return ERR_PTR(-EAGAIN);
 	}
 
+	iter->walker->tbl = NULL;
+	iter->p = NULL;
+
+out:
+
 	return obj;
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_next);
@@ -849,7 +856,24 @@ EXPORT_SYMBOL_GPL(rhashtable_walk_next);
  */
 void rhashtable_walk_stop(struct rhashtable_iter *iter)
 {
+	struct rhashtable *ht;
+	struct bucket_table *tbl = iter->walker->tbl;
+
 	rcu_read_unlock();
+
+	if (!tbl)
+		return;
+
+	ht = iter->ht;
+
+	mutex_lock(&ht->mutex);
+	if (rht_dereference(ht->tbl, ht) == tbl ||
+	    rht_dereference(ht->future_tbl, ht) == tbl)
+		list_add(&iter->walker->list, &tbl->walkers);
+	else
+		iter->walker->tbl = NULL;
+	mutex_unlock(&ht->mutex);
+
 	iter->p = NULL;
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_stop);
@@ -927,7 +951,6 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
 	memset(ht, 0, sizeof(*ht));
 	mutex_init(&ht->mutex);
 	memcpy(&ht->p, params, sizeof(*params));
-	INIT_LIST_HEAD(&ht->walkers);
 
 	if (params->locks_mul)
 		ht->p.locks_mul = roundup_pow_of_two(params->locks_mul);

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v2 PATCH 2/6] rhashtable: Use SINGLE_DEPTH_NESTING
  2015-03-14  2:53 ` [v2 PATCH " Herbert Xu
  2015-03-14  2:57   ` [v2 PATCH 1/6] rhashtable: Fix walker behaviour during rehash Herbert Xu
@ 2015-03-14  2:57   ` Herbert Xu
  2015-03-14  2:57   ` [v2 PATCH 3/6] rhashtable: Move seed init into bucket_table_alloc Herbert Xu
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-14  2:57 UTC (permalink / raw)
  To: David Miller, Thomas Graf, netdev

We only nest one level deep there is no need to roll our own
subclasses.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 lib/rhashtable.c |    9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index f7c7607..5d06cc2 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -33,11 +33,6 @@
 /* Base bits plus 1 bit for nulls marker */
 #define HASH_RESERVED_SPACE	(RHT_BASE_BITS + 1)
 
-enum {
-	RHT_LOCK_NORMAL,
-	RHT_LOCK_NESTED,
-};
-
 /* The bucket lock is selected based on the hash and protects mutations
  * on a group of hash buckets.
  *
@@ -231,7 +226,7 @@ static int rhashtable_rehash_one(struct rhashtable *ht, unsigned old_hash)
 
 	new_bucket_lock = bucket_lock(new_tbl, new_hash);
 
-	spin_lock_nested(new_bucket_lock, RHT_LOCK_NESTED);
+	spin_lock_nested(new_bucket_lock, SINGLE_DEPTH_NESTING);
 	head = rht_dereference_bucket(new_tbl->buckets[new_hash],
 				      new_tbl, new_hash);
 
@@ -405,7 +400,7 @@ static bool __rhashtable_insert(struct rhashtable *ht, struct rhash_head *obj,
 	tbl = rht_dereference_rcu(ht->future_tbl, ht);
 	if (tbl != old_tbl) {
 		hash = head_hashfn(ht, tbl, obj);
-		spin_lock_nested(bucket_lock(tbl, hash), RHT_LOCK_NESTED);
+		spin_lock_nested(bucket_lock(tbl, hash), SINGLE_DEPTH_NESTING);
 	}
 
 	if (compare &&

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v2 PATCH 3/6] rhashtable: Move seed init into bucket_table_alloc
  2015-03-14  2:53 ` [v2 PATCH " Herbert Xu
  2015-03-14  2:57   ` [v2 PATCH 1/6] rhashtable: Fix walker behaviour during rehash Herbert Xu
  2015-03-14  2:57   ` [v2 PATCH 2/6] rhashtable: Use SINGLE_DEPTH_NESTING Herbert Xu
@ 2015-03-14  2:57   ` Herbert Xu
  2015-03-14  2:57   ` [v2 PATCH 4/6] rhashtable: Free bucket tables asynchronously after rehash Herbert Xu
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-14  2:57 UTC (permalink / raw)
  To: David Miller, Thomas Graf, netdev

It seems that I have already made every rehash redo the random
seed even though my commit message indicated otherwise :)

Since we have already taken that step, this patch goes one step
further and moves the seed initialisation into bucket_table_alloc.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 lib/rhashtable.c |   16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 5d06cc2..e55bbc8 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -142,7 +142,7 @@ static void bucket_table_free(const struct bucket_table *tbl)
 }
 
 static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
-					       size_t nbuckets, u32 hash_rnd)
+					       size_t nbuckets)
 {
 	struct bucket_table *tbl = NULL;
 	size_t size;
@@ -158,7 +158,6 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
 
 	tbl->size = nbuckets;
 	tbl->shift = ilog2(nbuckets);
-	tbl->hash_rnd = hash_rnd;
 
 	if (alloc_bucket_locks(ht, tbl) < 0) {
 		bucket_table_free(tbl);
@@ -167,6 +166,8 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
 
 	INIT_LIST_HEAD(&tbl->walkers);
 
+	get_random_bytes(&tbl->hash_rnd, sizeof(tbl->hash_rnd));
+
 	for (i = 0; i < nbuckets; i++)
 		INIT_RHT_NULLS_HEAD(tbl->buckets[i], ht, i);
 
@@ -264,8 +265,6 @@ static void rhashtable_rehash(struct rhashtable *ht,
 	struct rhashtable_walker *walker;
 	unsigned old_hash;
 
-	get_random_bytes(&new_tbl->hash_rnd, sizeof(new_tbl->hash_rnd));
-
 	/* Make insertions go into the new, empty table right away. Deletions
 	 * and lookups will be attempted in both tables until we synchronize.
 	 * The synchronize_rcu() guarantees for the new table to be picked up
@@ -315,7 +314,7 @@ int rhashtable_expand(struct rhashtable *ht)
 
 	ASSERT_RHT_MUTEX(ht);
 
-	new_tbl = bucket_table_alloc(ht, old_tbl->size * 2, old_tbl->hash_rnd);
+	new_tbl = bucket_table_alloc(ht, old_tbl->size * 2);
 	if (new_tbl == NULL)
 		return -ENOMEM;
 
@@ -346,7 +345,7 @@ int rhashtable_shrink(struct rhashtable *ht)
 
 	ASSERT_RHT_MUTEX(ht);
 
-	new_tbl = bucket_table_alloc(ht, old_tbl->size / 2, old_tbl->hash_rnd);
+	new_tbl = bucket_table_alloc(ht, old_tbl->size / 2);
 	if (new_tbl == NULL)
 		return -ENOMEM;
 
@@ -926,7 +925,6 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
 {
 	struct bucket_table *tbl;
 	size_t size;
-	u32 hash_rnd;
 
 	size = HASH_DEFAULT_SIZE;
 
@@ -952,9 +950,7 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
 	else
 		ht->p.locks_mul = BUCKET_LOCKS_PER_CPU;
 
-	get_random_bytes(&hash_rnd, sizeof(hash_rnd));
-
-	tbl = bucket_table_alloc(ht, size, hash_rnd);
+	tbl = bucket_table_alloc(ht, size);
 	if (tbl == NULL)
 		return -ENOMEM;
 

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v2 PATCH 4/6] rhashtable: Free bucket tables asynchronously after rehash
  2015-03-14  2:53 ` [v2 PATCH " Herbert Xu
                     ` (2 preceding siblings ...)
  2015-03-14  2:57   ` [v2 PATCH 3/6] rhashtable: Move seed init into bucket_table_alloc Herbert Xu
@ 2015-03-14  2:57   ` Herbert Xu
  2015-03-14  2:57   ` [v2 PATCH 5/6] rhashtable: Add rehash counter to bucket_table Herbert Xu
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-14  2:57 UTC (permalink / raw)
  To: David Miller, Thomas Graf, netdev

There is in fact no need to wait for an RCU grace period in the
rehash function, since all insertions are guaranteed to go into
the new table through spin locks.

This patch uses call_rcu to free the old/rehashed table at our
leisure.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/rhashtable.h |    2 ++
 lib/rhashtable.c           |    9 ++++++---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 4192682..a0abddd 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -54,6 +54,7 @@ struct rhash_head {
  * @locks_mask: Mask to apply before accessing locks[]
  * @locks: Array of spinlocks protecting individual buckets
  * @walkers: List of active walkers
+ * @rcu: RCU structure for freeing the table
  * @buckets: size * hash buckets
  */
 struct bucket_table {
@@ -63,6 +64,7 @@ struct bucket_table {
 	unsigned int		locks_mask;
 	spinlock_t		*locks;
 	struct list_head	walkers;
+	struct rcu_head		rcu;
 
 	struct rhash_head __rcu	*buckets[] ____cacheline_aligned_in_smp;
 };
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index e55bbc8..36fb091 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -141,6 +141,11 @@ static void bucket_table_free(const struct bucket_table *tbl)
 	kvfree(tbl);
 }
 
+static void bucket_table_free_rcu(struct rcu_head *head)
+{
+	bucket_table_free(container_of(head, struct bucket_table, rcu));
+}
+
 static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
 					       size_t nbuckets)
 {
@@ -288,9 +293,7 @@ static void rhashtable_rehash(struct rhashtable *ht,
 	 * table, and thus no references to the old table will
 	 * remain.
 	 */
-	synchronize_rcu();
-
-	bucket_table_free(old_tbl);
+	call_rcu(&old_tbl->rcu, bucket_table_free_rcu);
 }
 
 /**

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v2 PATCH 5/6] rhashtable: Add rehash counter to bucket_table
  2015-03-14  2:53 ` [v2 PATCH " Herbert Xu
                     ` (3 preceding siblings ...)
  2015-03-14  2:57   ` [v2 PATCH 4/6] rhashtable: Free bucket tables asynchronously after rehash Herbert Xu
@ 2015-03-14  2:57   ` Herbert Xu
  2015-03-14  2:57   ` [v2 PATCH 6/6] rhashtable: Move future_tbl into struct bucket_table Herbert Xu
  2015-03-15  5:36   ` [v2 PATCH 0/6] rhashtable: Fixes + cleanups + preparation for multiple rehash David Miller
  6 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-14  2:57 UTC (permalink / raw)
  To: David Miller, Thomas Graf, netdev

This patch adds a rehash counter to bucket_table to indicate
the last bucket that has been rehashed.  This serves two purposes:

1. Any bucket that has been rehashed can never gain a new object.
2. If the rehash counter reaches the size of the table, the table
will forever remain empty.

This patch also downsizes bucket_table->size to an unsigned int
since we do not support sizes greater than 32 bits yet.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/rhashtable.h |    4 +++-
 lib/rhashtable.c           |    1 +
 lib/test_rhashtable.c      |    2 +-
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index a0abddd..ed7562a 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -49,6 +49,7 @@ struct rhash_head {
 /**
  * struct bucket_table - Table of hash buckets
  * @size: Number of hash buckets
+ * @rehash: Current bucket being rehashed
  * @hash_rnd: Random seed to fold into hash
  * @shift: Current size (1 << shift)
  * @locks_mask: Mask to apply before accessing locks[]
@@ -58,7 +59,8 @@ struct rhash_head {
  * @buckets: size * hash buckets
  */
 struct bucket_table {
-	size_t			size;
+	unsigned int		size;
+	unsigned int		rehash;
 	u32			hash_rnd;
 	u32			shift;
 	unsigned int		locks_mask;
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 36fb091..ff4ea17 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -260,6 +260,7 @@ static void rhashtable_rehash_chain(struct rhashtable *ht, unsigned old_hash)
 	spin_lock_bh(old_bucket_lock);
 	while (!rhashtable_rehash_one(ht, old_hash))
 		;
+	old_tbl->rehash++;
 	spin_unlock_bh(old_bucket_lock);
 }
 
diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 67c7593..16974fd 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -80,7 +80,7 @@ static void test_bucket_stats(struct rhashtable *ht, bool quiet)
 		rcu_cnt = cnt = 0;
 
 		if (!quiet)
-			pr_info(" [%#4x/%zu]", i, tbl->size);
+			pr_info(" [%#4x/%u]", i, tbl->size);
 
 		rht_for_each_entry_rcu(obj, pos, tbl, i, node) {
 			cnt++;

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v2 PATCH 6/6] rhashtable: Move future_tbl into struct bucket_table
  2015-03-14  2:53 ` [v2 PATCH " Herbert Xu
                     ` (4 preceding siblings ...)
  2015-03-14  2:57   ` [v2 PATCH 5/6] rhashtable: Add rehash counter to bucket_table Herbert Xu
@ 2015-03-14  2:57   ` Herbert Xu
  2015-03-15  5:36   ` [v2 PATCH 0/6] rhashtable: Fixes + cleanups + preparation for multiple rehash David Miller
  6 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-14  2:57 UTC (permalink / raw)
  To: David Miller, Thomas Graf, netdev

This patch moves future_tbl to open up the possibility of having
multiple rehashes on the same table.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/rhashtable.h |    5 +++--
 lib/rhashtable.c           |   27 +++++++++++----------------
 2 files changed, 14 insertions(+), 18 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index ed7562a..1695378 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -56,6 +56,7 @@ struct rhash_head {
  * @locks: Array of spinlocks protecting individual buckets
  * @walkers: List of active walkers
  * @rcu: RCU structure for freeing the table
+ * @future_tbl: Table under construction during rehashing
  * @buckets: size * hash buckets
  */
 struct bucket_table {
@@ -68,6 +69,8 @@ struct bucket_table {
 	struct list_head	walkers;
 	struct rcu_head		rcu;
 
+	struct bucket_table __rcu *future_tbl;
+
 	struct rhash_head __rcu	*buckets[] ____cacheline_aligned_in_smp;
 };
 
@@ -105,7 +108,6 @@ struct rhashtable_params {
 /**
  * struct rhashtable - Hash table handle
  * @tbl: Bucket table
- * @future_tbl: Table under construction during expansion/shrinking
  * @nelems: Number of elements in table
  * @p: Configuration parameters
  * @run_work: Deferred worker to expand/shrink asynchronously
@@ -114,7 +116,6 @@ struct rhashtable_params {
  */
 struct rhashtable {
 	struct bucket_table __rcu	*tbl;
-	struct bucket_table __rcu       *future_tbl;
 	atomic_t			nelems;
 	bool                            being_destroyed;
 	struct rhashtable_params	p;
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index ff4ea17..9d53a46 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -207,8 +207,9 @@ static bool rht_shrink_below_30(const struct rhashtable *ht,
 
 static int rhashtable_rehash_one(struct rhashtable *ht, unsigned old_hash)
 {
-	struct bucket_table *new_tbl = rht_dereference(ht->future_tbl, ht);
 	struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht);
+	struct bucket_table *new_tbl =
+		rht_dereference(old_tbl->future_tbl, ht) ?: old_tbl;
 	struct rhash_head __rcu **pprev = &old_tbl->buckets[old_hash];
 	int err = -ENOENT;
 	struct rhash_head *head, *next, *entry;
@@ -273,10 +274,8 @@ static void rhashtable_rehash(struct rhashtable *ht,
 
 	/* Make insertions go into the new, empty table right away. Deletions
 	 * and lookups will be attempted in both tables until we synchronize.
-	 * The synchronize_rcu() guarantees for the new table to be picked up
-	 * so no new additions go into the old table while we relink.
 	 */
-	rcu_assign_pointer(ht->future_tbl, new_tbl);
+	rcu_assign_pointer(old_tbl->future_tbl, new_tbl);
 
 	/* Ensure the new table is visible to readers. */
 	smp_wmb();
@@ -400,7 +399,7 @@ static bool __rhashtable_insert(struct rhashtable *ht, struct rhash_head *obj,
 	 * also grab the bucket lock in old_tbl because until the
 	 * rehash completes ht->tbl won't be changed.
 	 */
-	tbl = rht_dereference_rcu(ht->future_tbl, ht);
+	tbl = rht_dereference_rcu(old_tbl->future_tbl, ht) ?: old_tbl;
 	if (tbl != old_tbl) {
 		hash = head_hashfn(ht, tbl, obj);
 		spin_lock_nested(bucket_lock(tbl, hash), SINGLE_DEPTH_NESTING);
@@ -525,7 +524,7 @@ bool rhashtable_remove(struct rhashtable *ht, struct rhash_head *obj)
 	 * visible then that guarantees the entry to still be in
 	 * old_tbl if it exists.
 	 */
-	tbl = rht_dereference_rcu(ht->future_tbl, ht);
+	tbl = rht_dereference_rcu(old_tbl->future_tbl, ht) ?: old_tbl;
 	if (!ret && old_tbl != tbl)
 		ret = __rhashtable_remove(ht, tbl, obj);
 
@@ -599,7 +598,7 @@ EXPORT_SYMBOL_GPL(rhashtable_lookup);
 void *rhashtable_lookup_compare(struct rhashtable *ht, const void *key,
 				bool (*compare)(void *, void *), void *arg)
 {
-	const struct bucket_table *tbl, *old_tbl;
+	const struct bucket_table *tbl;
 	struct rhash_head *he;
 	u32 hash;
 
@@ -618,9 +617,8 @@ restart:
 	/* Ensure we see any new tables. */
 	smp_rmb();
 
-	old_tbl = tbl;
-	tbl = rht_dereference_rcu(ht->future_tbl, ht);
-	if (unlikely(tbl != old_tbl))
+	tbl = rht_dereference_rcu(tbl->future_tbl, ht);
+	if (unlikely(tbl))
 		goto restart;
 	rcu_read_unlock();
 
@@ -830,14 +828,13 @@ next:
 		iter->skip = 0;
 	}
 
-	iter->walker->tbl = rht_dereference_rcu(ht->future_tbl, ht);
-	if (iter->walker->tbl != tbl) {
+	iter->walker->tbl = rht_dereference_rcu(tbl->future_tbl, ht);
+	if (iter->walker->tbl) {
 		iter->slot = 0;
 		iter->skip = 0;
 		return ERR_PTR(-EAGAIN);
 	}
 
-	iter->walker->tbl = NULL;
 	iter->p = NULL;
 
 out:
@@ -865,8 +862,7 @@ void rhashtable_walk_stop(struct rhashtable_iter *iter)
 	ht = iter->ht;
 
 	mutex_lock(&ht->mutex);
-	if (rht_dereference(ht->tbl, ht) == tbl ||
-	    rht_dereference(ht->future_tbl, ht) == tbl)
+	if (tbl->rehash < tbl->size)
 		list_add(&iter->walker->list, &tbl->walkers);
 	else
 		iter->walker->tbl = NULL;
@@ -961,7 +957,6 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
 	atomic_set(&ht->nelems, 0);
 
 	RCU_INIT_POINTER(ht->tbl, tbl);
-	RCU_INIT_POINTER(ht->future_tbl, tbl);
 
 	INIT_WORK(&ht->run_work, rht_deferred_worker);
 

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [v2 PATCH 0/6] rhashtable: Fixes + cleanups + preparation for multiple rehash
  2015-03-14  2:53 ` [v2 PATCH " Herbert Xu
                     ` (5 preceding siblings ...)
  2015-03-14  2:57   ` [v2 PATCH 6/6] rhashtable: Move future_tbl into struct bucket_table Herbert Xu
@ 2015-03-15  5:36   ` David Miller
  2015-03-15 10:10     ` [v1 PATCH 0/6] rhashtable: Fix two bugs caused by multiple rehash preparation Herbert Xu
  6 siblings, 1 reply; 113+ messages in thread
From: David Miller @ 2015-03-15  5:36 UTC (permalink / raw)
  To: herbert; +Cc: tgraf, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 14 Mar 2015 13:53:39 +1100

> Patch 1 fixes the walker so that it behaves properly even during
> a resize.
> 
> Patch 2-3 are cleanups.
> 
> Patch 4-6 lays some ground work for the upcoming multiple rehashing.
> 
> This revision fixes the warning coming from the bucket_table->size
> downsize and improves its changelog.

Series applied, thanks a lot Herbert.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [v1 PATCH 0/6] rhashtable: Fix two bugs caused by multiple rehash preparation
  2015-03-15  5:36   ` [v2 PATCH 0/6] rhashtable: Fixes + cleanups + preparation for multiple rehash David Miller
@ 2015-03-15 10:10     ` Herbert Xu
  2015-03-15 10:12       ` [v1 PATCH 1/2] rhashtable: Fix use-after-free in rhashtable_walk_stop Herbert Xu
                         ` (4 more replies)
  0 siblings, 5 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-15 10:10 UTC (permalink / raw)
  To: David Miller; +Cc: tgraf, netdev

Hi Dave:

While testing some new patches over the weekend I discovered a
couple of bugs in the series that had just been merged.  These
two patches fix them:

1) A use-after-free in the walker that can cause crashes when
walking during a rehash.

2) When a second rehash starts during a single rhashtable_remove
call the remove may fail when it shouldn't.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [v1 PATCH 1/2] rhashtable: Fix use-after-free in rhashtable_walk_stop
  2015-03-15 10:10     ` [v1 PATCH 0/6] rhashtable: Fix two bugs caused by multiple rehash preparation Herbert Xu
@ 2015-03-15 10:12       ` Herbert Xu
  2015-03-15 10:12       ` [v1 PATCH 2/2] rhashtable: Fix rhashtable_remove failures Herbert Xu
                         ` (3 subsequent siblings)
  4 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-15 10:12 UTC (permalink / raw)
  To: David Miller, tgraf, netdev

The commit c4db8848af6af92f90462258603be844baeab44d ("rhashtable:
Move future_tbl into struct bucket_table") introduced a use-after-
free bug in rhashtable_walk_stop because it dereferences tbl after
droping the RCU read lock.

This patch fixes it by moving the RCU read unlock down to the bottom
of rhashtable_walk_stop.  In fact this was how I had it originally
but it got dropped while rearranging patches because this one
depended on the async freeing of bucket_table.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 lib/rhashtable.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 9d53a46..b916679 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -854,10 +854,8 @@ void rhashtable_walk_stop(struct rhashtable_iter *iter)
 	struct rhashtable *ht;
 	struct bucket_table *tbl = iter->walker->tbl;
 
-	rcu_read_unlock();
-
 	if (!tbl)
-		return;
+		goto out;
 
 	ht = iter->ht;
 
@@ -869,6 +867,9 @@ void rhashtable_walk_stop(struct rhashtable_iter *iter)
 	mutex_unlock(&ht->mutex);
 
 	iter->p = NULL;
+
+out:
+	rcu_read_unlock();
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_stop);
 

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v1 PATCH 2/2] rhashtable: Fix rhashtable_remove failures
  2015-03-15 10:10     ` [v1 PATCH 0/6] rhashtable: Fix two bugs caused by multiple rehash preparation Herbert Xu
  2015-03-15 10:12       ` [v1 PATCH 1/2] rhashtable: Fix use-after-free in rhashtable_walk_stop Herbert Xu
@ 2015-03-15 10:12       ` Herbert Xu
  2015-03-15 10:43       ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash Herbert Xu
                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-15 10:12 UTC (permalink / raw)
  To: David Miller, tgraf, netdev

The commit 9d901bc05153bbf33b5da2cd6266865e531f0545 ("rhashtable:
Free bucket tables asynchronously after rehash") causes gratuitous
failures in rhashtable_remove.

The reason is that it inadvertently introduced multiple rehashing
from the perspective of readers.  IOW it is now possible to see
more than two tables during a single RCU critical section.

Fortunately the other reader rhashtable_lookup already deals with
this correctly thanks to c4db8848af6af92f90462258603be844baeab44d
("rhashtable: rhashtable: Move future_tbl into struct bucket_table")
so only rhashtable_remove is broken by this change.

This patch fixes this by looping over every table from the first
one to the last or until we find the element that we were trying
to delete.

Incidentally the simple test for detecting rehashing to prevent
starting another shrinking no longer works.  Since it isn't needed
anyway (the work queue and the mutex serves as a natural barrier
to unnecessary rehashes) I've simply killed the test.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 lib/rhashtable.c |   17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index b916679..c523d3a 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -511,28 +511,25 @@ static bool __rhashtable_remove(struct rhashtable *ht,
  */
 bool rhashtable_remove(struct rhashtable *ht, struct rhash_head *obj)
 {
-	struct bucket_table *tbl, *old_tbl;
+	struct bucket_table *tbl;
 	bool ret;

 	rcu_read_lock();

-	old_tbl = rht_dereference_rcu(ht->tbl, ht);
-	ret = __rhashtable_remove(ht, old_tbl, obj);
+	tbl = rht_dereference_rcu(ht->tbl, ht);

 	/* Because we have already taken (and released) the bucket
 	 * lock in old_tbl, if we find that future_tbl is not yet
 	 * visible then that guarantees the entry to still be in
-	 * old_tbl if it exists.
+	 * the old tbl if it exists.
 	 */
-	tbl = rht_dereference_rcu(old_tbl->future_tbl, ht) ?: old_tbl;
-	if (!ret && old_tbl != tbl)
-		ret = __rhashtable_remove(ht, tbl, obj);
+	while (!(ret = __rhashtable_remove(ht, tbl, obj)) &&
+	       (tbl = rht_dereference_rcu(tbl->future_tbl, ht)))
+		;

 	if (ret) {
-		bool no_resize_running = tbl == old_tbl;
-
 		atomic_dec(&ht->nelems);
-		if (no_resize_running && rht_shrink_below_30(ht, tbl))
+		if (rht_shrink_below_30(ht, tbl))
 			schedule_work(&ht->run_work);
 	}

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash
  2015-03-15 10:10     ` [v1 PATCH 0/6] rhashtable: Fix two bugs caused by multiple rehash preparation Herbert Xu
  2015-03-15 10:12       ` [v1 PATCH 1/2] rhashtable: Fix use-after-free in rhashtable_walk_stop Herbert Xu
  2015-03-15 10:12       ` [v1 PATCH 2/2] rhashtable: Fix rhashtable_remove failures Herbert Xu
@ 2015-03-15 10:43       ` Herbert Xu
  2015-03-15 10:44         ` [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table Herbert Xu
                           ` (20 more replies)
  2015-03-15 10:43       ` [v1 PATCH 0/6] rhashtable: Fix two bugs caused by multiple rehash preparation Herbert Xu
  2015-03-16  2:23       ` David Miller
  4 siblings, 21 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-15 10:43 UTC (permalink / raw)
  To: David Miller; +Cc: tgraf, netdev, Eric Dumazet

Hi Dave:

This series of patches deal with some niggly bits that nevertheless
need to be dealt with before we can move ahead.

I was trying to squeeze bucket_table->rehash in by downsizing
bucket_table->size, only to find that my spot had been taken
over by bucket_table->shift.  Patches 1-6 kills shift and makes
me feel better :)

One of the things I want to do is limit the chain length.  However,
certain users try to hash multiple objects with the same key.  This
would obviously defeat any attempt to control the maximum chain
length.  The only current offender using rhashtable is netlink and
it really has no excuse to do this at all.

Patches 7-10 resolves that problem.

Finally I saw that with every new rhashtable user we gained a copy
of jhash as it's an inline function so every indirect reference to
it results in another copy.  Patches 11-14 deal with this by making
jhash (or jhash2 when applicable) the function to use when the user
does not select a hash function.

PS I'd love to kill the indirect call on the hash but I think
you guys need something other than jhash for sockets.  Presumably
they are not exposed to untrusted parties.  But I really wonder
whether that is still the case given things like namespaces where
even root may be untrusted.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 0/6] rhashtable: Fix two bugs caused by multiple rehash preparation
  2015-03-15 10:10     ` [v1 PATCH 0/6] rhashtable: Fix two bugs caused by multiple rehash preparation Herbert Xu
                         ` (2 preceding siblings ...)
  2015-03-15 10:43       ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash Herbert Xu
@ 2015-03-15 10:43       ` Herbert Xu
  2015-03-16  2:23       ` David Miller
  4 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-15 10:43 UTC (permalink / raw)
  To: David Miller; +Cc: tgraf, netdev

On Sun, Mar 15, 2015 at 09:10:04PM +1100, Herbert Xu wrote:
> Hi Dave:
> 
> While testing some new patches over the weekend I discovered a
> couple of bugs in the series that had just been merged.  These
> two patches fix them:

Doh, obviously the subject should have said 0/2, not 0/6.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table
  2015-03-15 10:43       ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash Herbert Xu
@ 2015-03-15 10:44         ` Herbert Xu
  2015-03-17 10:51           ` David Laight
  2015-03-15 10:44         ` [v1 PATCH 2/14] rhashtable: Introduce max_size/min_size Herbert Xu
                           ` (19 subsequent siblings)
  20 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-15 10:44 UTC (permalink / raw)
  To: David Miller, tgraf, netdev, Eric Dumazet

Keeping both size and shift is silly.  We only need one.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/rhashtable.h |    2 --
 lib/rhashtable.c           |    5 ++---
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 1695378..f16e856 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -51,7 +51,6 @@ struct rhash_head {
  * @size: Number of hash buckets
  * @rehash: Current bucket being rehashed
  * @hash_rnd: Random seed to fold into hash
- * @shift: Current size (1 << shift)
  * @locks_mask: Mask to apply before accessing locks[]
  * @locks: Array of spinlocks protecting individual buckets
  * @walkers: List of active walkers
@@ -63,7 +62,6 @@ struct bucket_table {
 	unsigned int		size;
 	unsigned int		rehash;
 	u32			hash_rnd;
-	u32			shift;
 	unsigned int		locks_mask;
 	spinlock_t		*locks;
 	struct list_head	walkers;
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index c523d3a..4a99c0c 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -162,7 +162,6 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
 		return NULL;
 
 	tbl->size = nbuckets;
-	tbl->shift = ilog2(nbuckets);
 
 	if (alloc_bucket_locks(ht, tbl) < 0) {
 		bucket_table_free(tbl);
@@ -189,7 +188,7 @@ static bool rht_grow_above_75(const struct rhashtable *ht,
 {
 	/* Expand table when exceeding 75% load */
 	return atomic_read(&ht->nelems) > (tbl->size / 4 * 3) &&
-	       (!ht->p.max_shift || tbl->shift < ht->p.max_shift);
+	       (!ht->p.max_shift || tbl->size < (1 << ht->p.max_shift));
 }
 
 /**
@@ -202,7 +201,7 @@ static bool rht_shrink_below_30(const struct rhashtable *ht,
 {
 	/* Shrink table beneath 30% load */
 	return atomic_read(&ht->nelems) < (tbl->size * 3 / 10) &&
-	       tbl->shift > ht->p.min_shift;
+	       tbl->size > (1 << ht->p.min_shift);
 }
 
 static int rhashtable_rehash_one(struct rhashtable *ht, unsigned old_hash)

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v1 PATCH 2/14] rhashtable: Introduce max_size/min_size
  2015-03-15 10:43       ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash Herbert Xu
  2015-03-15 10:44         ` [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table Herbert Xu
@ 2015-03-15 10:44         ` Herbert Xu
  2015-03-15 15:12           ` Sergei Shtylyov
  2015-03-15 10:44         ` [v1 PATCH 3/14] netlink: Use rhashtable max_size instead of max_shift Herbert Xu
                           ` (18 subsequent siblings)
  20 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-15 10:44 UTC (permalink / raw)
  To: David Miller, tgraf, netdev, Eric Dumazet

This patch adds the parameters max_size and min_size which are
meant to replace max_shift and min_shift.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/rhashtable.h |    4 ++++
 lib/rhashtable.c           |   12 ++++++++----
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index f16e856..81267fe 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -85,6 +85,8 @@ struct rhashtable;
  * @head_offset: Offset of rhash_head in struct to be hashed
  * @max_shift: Maximum number of shifts while expanding
  * @min_shift: Minimum number of shifts while shrinking
+ * @max_size: Maximum size while expanding
+ * @min_size: Minimum size while shrinking
  * @nulls_base: Base value to generate nulls marker
  * @locks_mul: Number of bucket locks to allocate per cpu (default: 128)
  * @hashfn: Function to hash key
@@ -97,6 +99,8 @@ struct rhashtable_params {
 	size_t			head_offset;
 	size_t			max_shift;
 	size_t			min_shift;
+	unsigned int		max_size;
+	unsigned int		min_size;
 	u32			nulls_base;
 	size_t			locks_mul;
 	rht_hashfn_t		hashfn;
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 4a99c0c..5101e18 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -27,7 +27,7 @@
 #include <linux/err.h>
 
 #define HASH_DEFAULT_SIZE	64UL
-#define HASH_MIN_SIZE		4UL
+#define HASH_MIN_SIZE		4U
 #define BUCKET_LOCKS_PER_CPU   128UL
 
 /* Base bits plus 1 bit for nulls marker */
@@ -188,7 +188,8 @@ static bool rht_grow_above_75(const struct rhashtable *ht,
 {
 	/* Expand table when exceeding 75% load */
 	return atomic_read(&ht->nelems) > (tbl->size / 4 * 3) &&
-	       (!ht->p.max_shift || tbl->size < (1 << ht->p.max_shift));
+	       (!ht->p.max_shift || tbl->size < (1 << ht->p.max_shift)) &&
+	       (!ht->p.max_size || tbl->size < ht->p.max_size);
 }
 
 /**
@@ -201,7 +202,8 @@ static bool rht_shrink_below_30(const struct rhashtable *ht,
 {
 	/* Shrink table beneath 30% load */
 	return atomic_read(&ht->nelems) < (tbl->size * 3 / 10) &&
-	       tbl->size > (1 << ht->p.min_shift);
+	       tbl->size > (1 << ht->p.min_shift) &&
+	       tbl->size > ht->p.min_size;
 }
 
 static int rhashtable_rehash_one(struct rhashtable *ht, unsigned old_hash)
@@ -872,7 +874,8 @@ EXPORT_SYMBOL_GPL(rhashtable_walk_stop);
 static size_t rounded_hashtable_size(struct rhashtable_params *params)
 {
 	return max(roundup_pow_of_two(params->nelem_hint * 4 / 3),
-		   1UL << params->min_shift);
+		   max(1UL << params->min_shift,
+		       (unsigned long)params->min_size));
 }
 
 /**
@@ -934,6 +937,7 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
 
 	params->min_shift = max_t(size_t, params->min_shift,
 				  ilog2(HASH_MIN_SIZE));
+	params->min_size = max(params->min_size, HASH_MIN_SIZE);
 
 	if (params->nelem_hint)
 		size = rounded_hashtable_size(params);

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v1 PATCH 3/14] netlink: Use rhashtable max_size instead of max_shift
  2015-03-15 10:43       ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash Herbert Xu
  2015-03-15 10:44         ` [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table Herbert Xu
  2015-03-15 10:44         ` [v1 PATCH 2/14] rhashtable: Introduce max_size/min_size Herbert Xu
@ 2015-03-15 10:44         ` Herbert Xu
  2015-03-15 10:44         ` [v1 PATCH 4/14] tipc: " Herbert Xu
                           ` (17 subsequent siblings)
  20 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-15 10:44 UTC (permalink / raw)
  To: David Miller, tgraf, netdev, Eric Dumazet

This patch converts netlink to use rhashtable max_size instead
of the obsolete max_shift.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/netlink/af_netlink.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 6b0f219..d97aed6 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -3123,7 +3123,7 @@ static int __init netlink_proto_init(void)
 		.key_offset = offsetof(struct netlink_sock, portid),
 		.key_len = sizeof(u32), /* portid */
 		.hashfn = jhash,
-		.max_shift = 16, /* 64K */
+		.max_size = 65536,
 	};
 
 	if (err != 0)

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v1 PATCH 4/14] tipc: Use rhashtable max_size instead of max_shift
  2015-03-15 10:43       ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash Herbert Xu
                           ` (2 preceding siblings ...)
  2015-03-15 10:44         ` [v1 PATCH 3/14] netlink: Use rhashtable max_size instead of max_shift Herbert Xu
@ 2015-03-15 10:44         ` Herbert Xu
  2015-03-15 15:13           ` Sergei Shtylyov
  2015-03-15 10:44         ` [v1 PATCH 5/14] test_rhashtable: " Herbert Xu
                           ` (16 subsequent siblings)
  20 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-15 10:44 UTC (permalink / raw)
  To: David Miller, tgraf, netdev, Eric Dumazet

This patch converts tipc to use rhashtable max_size instead of
the obsolete max_shift.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/tipc/socket.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 934947f..64eb669 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -2361,8 +2361,8 @@ int tipc_sk_rht_init(struct net *net)
 		.key_offset = offsetof(struct tipc_sock, portid),
 		.key_len = sizeof(u32), /* portid */
 		.hashfn = jhash,
-		.max_shift = 20, /* 1M */
-		.min_shift = 8,  /* 256 */
+		.max_size = 1048576,
+		.min_size = 256,
 	};
 
 	return rhashtable_init(&tn->sk_rht, &rht_params);

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v1 PATCH 5/14] test_rhashtable: Use rhashtable max_size instead of max_shift
  2015-03-15 10:43       ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash Herbert Xu
                           ` (3 preceding siblings ...)
  2015-03-15 10:44         ` [v1 PATCH 4/14] tipc: " Herbert Xu
@ 2015-03-15 10:44         ` Herbert Xu
  2015-03-16  3:50           ` David Miller
  2015-03-15 10:44         ` [v1 PATCH 6/14] rhashtable: Remove max_shift and min_shift Herbert Xu
                           ` (15 subsequent siblings)
  20 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-15 10:44 UTC (permalink / raw)
  To: David Miller, tgraf, netdev, Eric Dumazet

This patch converts tets_rhashtable to use rhashtable max_size
instead of the obsolete max_shift.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 lib/test_rhashtable.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 16974fd..2bc403d 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -201,7 +201,7 @@ static int __init test_rht_init(void)
 		.key_offset = offsetof(struct test_obj, value),
 		.key_len = sizeof(int),
 		.hashfn = jhash,
-		.max_shift = 1, /* we expand/shrink manually here */
+		.max_size = 2, /* we expand/shrink manually here */
 		.nulls_base = (3U << RHT_BASE_SHIFT),
 	};
 	int err;

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v1 PATCH 6/14] rhashtable: Remove max_shift and min_shift
  2015-03-15 10:43       ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash Herbert Xu
                           ` (4 preceding siblings ...)
  2015-03-15 10:44         ` [v1 PATCH 5/14] test_rhashtable: " Herbert Xu
@ 2015-03-15 10:44         ` Herbert Xu
  2015-03-15 10:44         ` [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare Herbert Xu
                           ` (14 subsequent siblings)
  20 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-15 10:44 UTC (permalink / raw)
  To: David Miller, tgraf, netdev, Eric Dumazet

Now that nobody uses max_shift and min_shift, we can safely remove
them.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/rhashtable.h |    4 ----
 lib/rhashtable.c           |    7 +------
 2 files changed, 1 insertion(+), 10 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 81267fe..99425f2 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -83,8 +83,6 @@ struct rhashtable;
  * @key_len: Length of key
  * @key_offset: Offset of key in struct to be hashed
  * @head_offset: Offset of rhash_head in struct to be hashed
- * @max_shift: Maximum number of shifts while expanding
- * @min_shift: Minimum number of shifts while shrinking
  * @max_size: Maximum size while expanding
  * @min_size: Minimum size while shrinking
  * @nulls_base: Base value to generate nulls marker
@@ -97,8 +95,6 @@ struct rhashtable_params {
 	size_t			key_len;
 	size_t			key_offset;
 	size_t			head_offset;
-	size_t			max_shift;
-	size_t			min_shift;
 	unsigned int		max_size;
 	unsigned int		min_size;
 	u32			nulls_base;
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 5101e18..a74ba72 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -188,7 +188,6 @@ static bool rht_grow_above_75(const struct rhashtable *ht,
 {
 	/* Expand table when exceeding 75% load */
 	return atomic_read(&ht->nelems) > (tbl->size / 4 * 3) &&
-	       (!ht->p.max_shift || tbl->size < (1 << ht->p.max_shift)) &&
 	       (!ht->p.max_size || tbl->size < ht->p.max_size);
 }
 
@@ -202,7 +201,6 @@ static bool rht_shrink_below_30(const struct rhashtable *ht,
 {
 	/* Shrink table beneath 30% load */
 	return atomic_read(&ht->nelems) < (tbl->size * 3 / 10) &&
-	       tbl->size > (1 << ht->p.min_shift) &&
 	       tbl->size > ht->p.min_size;
 }
 
@@ -874,8 +872,7 @@ EXPORT_SYMBOL_GPL(rhashtable_walk_stop);
 static size_t rounded_hashtable_size(struct rhashtable_params *params)
 {
 	return max(roundup_pow_of_two(params->nelem_hint * 4 / 3),
-		   max(1UL << params->min_shift,
-		       (unsigned long)params->min_size));
+		   (unsigned long)params->min_size);
 }
 
 /**
@@ -935,8 +932,6 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
 	if (params->nulls_base && params->nulls_base < (1U << RHT_BASE_SHIFT))
 		return -EINVAL;
 
-	params->min_shift = max_t(size_t, params->min_shift,
-				  ilog2(HASH_MIN_SIZE));
 	params->min_size = max(params->min_size, HASH_MIN_SIZE);
 
 	if (params->nelem_hint)

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare
  2015-03-15 10:43       ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash Herbert Xu
                           ` (5 preceding siblings ...)
  2015-03-15 10:44         ` [v1 PATCH 6/14] rhashtable: Remove max_shift and min_shift Herbert Xu
@ 2015-03-15 10:44         ` Herbert Xu
  2015-03-16  8:28           ` Thomas Graf
  2015-03-15 10:44         ` [v1 PATCH 8/14] rhashtable: Fix support of objects with no accessible keys Herbert Xu
                           ` (13 subsequent siblings)
  20 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-15 10:44 UTC (permalink / raw)
  To: David Miller, tgraf, netdev, Eric Dumazet

The use of rhashtable_lookup_compare in nft_hash is gratuitous
since the comparison function is just doing memcmp.  Furthermore,
there is cruft embedded in the comparson function that should
instead be moved into the caller of the lookup.

This patch moves that cruft over and replacces the call to
rhashtable_lookup_compare with rhashtable_lookup.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/netfilter/nft_hash.c |   40 ++++++++++------------------------------
 1 file changed, 10 insertions(+), 30 deletions(-)

diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c
index c82df0a..1fcae5e 100644
--- a/net/netfilter/nft_hash.c
+++ b/net/netfilter/nft_hash.c
@@ -89,41 +89,21 @@ static void nft_hash_remove(const struct nft_set *set,
 	kfree(elem->cookie);
 }
 
-struct nft_compare_arg {
-	const struct nft_set *set;
-	struct nft_set_elem *elem;
-};
-
-static bool nft_hash_compare(void *ptr, void *arg)
-{
-	struct nft_hash_elem *he = ptr;
-	struct nft_compare_arg *x = arg;
-
-	if (!nft_data_cmp(&he->key, &x->elem->key, x->set->klen)) {
-		x->elem->cookie = he;
-		x->elem->flags = 0;
-		if (x->set->flags & NFT_SET_MAP)
-			nft_data_copy(&x->elem->data, he->data);
-
-		return true;
-	}
-
-	return false;
-}
-
 static int nft_hash_get(const struct nft_set *set, struct nft_set_elem *elem)
 {
 	struct rhashtable *priv = nft_set_priv(set);
-	struct nft_compare_arg arg = {
-		.set = set,
-		.elem = elem,
-	};
+	struct nft_hash_elem *he;
 
-	if (rhashtable_lookup_compare(priv, &elem->key,
-				      &nft_hash_compare, &arg))
-		return 0;
+	he = rhashtable_lookup(priv, &elem->key);
+	if (!he)
+		return -ENOENT;
 
-	return -ENOENT;
+	elem->cookie = he;
+	elem->flags = 0;
+	if (set->flags & NFT_SET_MAP)
+		nft_data_copy(&elem->data, he->data);
+
+	return 0;
 }
 
 static void nft_hash_walk(const struct nft_ctx *ctx, const struct nft_set *set,

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v1 PATCH 8/14] rhashtable: Fix support of objects with no accessible keys
  2015-03-15 10:43       ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash Herbert Xu
                           ` (6 preceding siblings ...)
  2015-03-15 10:44         ` [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare Herbert Xu
@ 2015-03-15 10:44         ` Herbert Xu
  2015-03-15 10:44         ` [v1 PATCH 9/14] netlink: Move namespace into hash key Herbert Xu
                           ` (12 subsequent siblings)
  20 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-15 10:44 UTC (permalink / raw)
  To: David Miller, tgraf, netdev, Eric Dumazet

The current support of so-called (a misnomer actually) vairable-
length keys is broken.  I found out about this after trying to
actually use it for netlink by incorporating the namespace into
the key.

The crux of the matter is that all the code that is needed to
make it work (such as netlink_lookup_insert_compare) only works
when key_len != 0.  However, the object hash function is only
used when key_len == 0.

In fact key_len should always be non-zero since even for these
objects with inaccessible keys, we need to use a key to perform
a table lookup.

This patch fixes this by directly using the object hash function
as the indicator of whether the key is accessible or not.  It
also adds a new function obj_cmpfn to compare a key against an
object.  This means that the caller no longer needs to supply
explicit compare functions.

All this is done in a backwards compatible manner so we can rip
out the existing users safely after this patch.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/rhashtable.h |    5 +
 lib/rhashtable.c           |  158 +++++++++++++++++++++++++++++++++++++++------
 2 files changed, 143 insertions(+), 20 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 99425f2..de7ac49 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -74,6 +74,7 @@ struct bucket_table {
 
 typedef u32 (*rht_hashfn_t)(const void *data, u32 len, u32 seed);
 typedef u32 (*rht_obj_hashfn_t)(const void *data, u32 seed);
+typedef bool (*rht_obj_cmpfn_t)(const void *key, const void *obj);
 
 struct rhashtable;
 
@@ -89,6 +90,7 @@ struct rhashtable;
  * @locks_mul: Number of bucket locks to allocate per cpu (default: 128)
  * @hashfn: Function to hash key
  * @obj_hashfn: Function to hash object
+ * @obj_cmpfn: Function to compare key with object
  */
 struct rhashtable_params {
 	size_t			nelem_hint;
@@ -101,6 +103,7 @@ struct rhashtable_params {
 	size_t			locks_mul;
 	rht_hashfn_t		hashfn;
 	rht_obj_hashfn_t	obj_hashfn;
+	rht_obj_cmpfn_t		obj_cmpfn;
 };
 
 /**
@@ -198,6 +201,8 @@ bool rhashtable_lookup_compare_insert(struct rhashtable *ht,
 				      struct rhash_head *obj,
 				      bool (*compare)(void *, void *),
 				      void *arg);
+int rhashtable_lookup_insert_key(struct rhashtable *ht,
+				 const void *key, struct rhash_head *obj);
 
 int rhashtable_walk_init(struct rhashtable *ht, struct rhashtable_iter *iter);
 void rhashtable_walk_exit(struct rhashtable_iter *iter);
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index a74ba72..8b3cae3 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -74,7 +74,7 @@ static u32 head_hashfn(struct rhashtable *ht,
 {
 	const char *ptr = rht_obj(ht, he);
 
-	return likely(ht->p.key_len) ?
+	return likely(!ht->p.obj_hashfn) ?
 	       key_hashfn(ht, tbl, ptr + ht->p.key_offset) :
 	       rht_bucket_index(tbl, ht->p.obj_hashfn(ptr, tbl->hash_rnd));
 }
@@ -376,6 +376,69 @@ unlock:
 	mutex_unlock(&ht->mutex);
 }
 
+static int __rhashtable_lookup_insert_key(struct rhashtable *ht,
+					  const void *key,
+					  struct rhash_head *obj)
+{
+	struct bucket_table *tbl, *old_tbl;
+	struct rhash_head *head;
+	bool no_resize_running;
+	unsigned hash;
+	int err = -EEXIST;
+
+	rcu_read_lock();
+
+	old_tbl = rht_dereference_rcu(ht->tbl, ht);
+	hash = head_hashfn(ht, old_tbl, obj);
+
+	spin_lock_bh(bucket_lock(old_tbl, hash));
+
+	/* Because we have already taken the bucket lock in old_tbl,
+	 * if we find that future_tbl is not yet visible then that
+	 * guarantees all other insertions of the same entry will
+	 * also grab the bucket lock in old_tbl because until the
+	 * rehash completes ht->tbl won't be changed.
+	 */
+	tbl = rht_dereference_rcu(old_tbl->future_tbl, ht) ?: old_tbl;
+	if (tbl != old_tbl) {
+		hash = head_hashfn(ht, tbl, obj);
+		spin_lock_nested(bucket_lock(tbl, hash), SINGLE_DEPTH_NESTING);
+	}
+
+	if (key && rhashtable_lookup(ht, key))
+		goto exit;
+
+	err = 0;
+
+	no_resize_running = tbl == old_tbl;
+
+	head = rht_dereference_bucket(tbl->buckets[hash], tbl, hash);
+
+	if (rht_is_a_nulls(head))
+		INIT_RHT_NULLS_HEAD(obj->next, ht, hash);
+	else
+		RCU_INIT_POINTER(obj->next, head);
+
+	rcu_assign_pointer(tbl->buckets[hash], obj);
+
+	atomic_inc(&ht->nelems);
+	if (no_resize_running && rht_grow_above_75(ht, tbl))
+		schedule_work(&ht->run_work);
+
+exit:
+	if (tbl != old_tbl) {
+		hash = head_hashfn(ht, tbl, obj);
+		spin_unlock(bucket_lock(tbl, hash));
+	}
+
+	hash = head_hashfn(ht, old_tbl, obj);
+	spin_unlock_bh(bucket_lock(old_tbl, hash));
+
+	rcu_read_unlock();
+
+	return err;
+}
+
 static bool __rhashtable_insert(struct rhashtable *ht, struct rhash_head *obj,
 				bool (*compare)(void *, void *), void *arg)
 {
@@ -457,7 +520,9 @@ exit:
  */
 void rhashtable_insert(struct rhashtable *ht, struct rhash_head *obj)
 {
-	__rhashtable_insert(ht, obj, NULL, NULL);
+	BUG_ON(ht->p.obj_hashfn);
+
+	__rhashtable_lookup_insert_key(ht, NULL, obj);
 }
 EXPORT_SYMBOL_GPL(rhashtable_insert);
 
@@ -543,9 +608,9 @@ struct rhashtable_compare_arg {
 	const void *key;
 };
 
-static bool rhashtable_compare(void *ptr, void *arg)
+static bool rhashtable_compare(const void *arg, const void *ptr)
 {
-	struct rhashtable_compare_arg *x = arg;
+	const struct rhashtable_compare_arg *x = arg;
 	struct rhashtable *ht = x->ht;
 
 	return !memcmp(ptr + ht->p.key_offset, x->key, ht->p.key_len);
@@ -559,9 +624,6 @@ static bool rhashtable_compare(void *ptr, void *arg)
  * Computes the hash value for the key and traverses the bucket chain looking
  * for a entry with an identical key. The first matching entry is returned.
  *
- * This lookup function may only be used for fixed key hash table (key_len
- * parameter set). It will BUG() if used inappropriately.
- *
  * Lookups may occur in parallel with hashtable mutations and resizing.
  */
 void *rhashtable_lookup(struct rhashtable *ht, const void *key)
@@ -570,10 +632,41 @@ void *rhashtable_lookup(struct rhashtable *ht, const void *key)
 		.ht = ht,
 		.key = key,
 	};
+	const struct bucket_table *tbl;
+	rht_obj_cmpfn_t compare;
+	const void *compare_key;
+	struct rhash_head *he;
+	u32 hash;
+
+	if (ht->p.obj_cmpfn) {
+		compare = ht->p.obj_cmpfn;
+		compare_key = key;
+	} else {
+		compare = rhashtable_compare;
+		compare_key = &arg;
+	}
+
+	rcu_read_lock();
+
+	tbl = rht_dereference_rcu(ht->tbl, ht);
+restart:
+	hash = key_hashfn(ht, tbl, key);
+	rht_for_each_rcu(he, tbl, hash) {
+		if (!compare(compare_key, rht_obj(ht, he)))
+			continue;
+		rcu_read_unlock();
+		return rht_obj(ht, he);
+	}
 
-	BUG_ON(!ht->p.key_len);
+	/* Ensure we see any new tables. */
+	smp_rmb();
+
+	tbl = rht_dereference_rcu(tbl->future_tbl, ht);
+	if (unlikely(tbl))
+		goto restart;
+	rcu_read_unlock();
 
-	return rhashtable_lookup_compare(ht, key, &rhashtable_compare, &arg);
+	return NULL;
 }
 EXPORT_SYMBOL_GPL(rhashtable_lookup);
 
@@ -644,15 +737,12 @@ EXPORT_SYMBOL_GPL(rhashtable_lookup_compare);
  */
 bool rhashtable_lookup_insert(struct rhashtable *ht, struct rhash_head *obj)
 {
-	struct rhashtable_compare_arg arg = {
-		.ht = ht,
-		.key = rht_obj(ht, obj) + ht->p.key_offset,
-	};
+	const char *key = rht_obj(ht, obj);
 
-	BUG_ON(!ht->p.key_len);
+	BUG_ON(ht->p.obj_hashfn);
+
+	return !__rhashtable_lookup_insert_key(ht, key + ht->p.key_offset, obj);
 
-	return rhashtable_lookup_compare_insert(ht, obj, &rhashtable_compare,
-						&arg);
 }
 EXPORT_SYMBOL_GPL(rhashtable_lookup_insert);
 
@@ -681,13 +771,41 @@ bool rhashtable_lookup_compare_insert(struct rhashtable *ht,
 				      bool (*compare)(void *, void *),
 				      void *arg)
 {
-	BUG_ON(!ht->p.key_len);
-
 	return __rhashtable_insert(ht, obj, compare, arg);
 }
 EXPORT_SYMBOL_GPL(rhashtable_lookup_compare_insert);
 
 /**
+ * rhashtable_lookup_insert_key - search and insert object to hash table
+ *                                with explicit key
+ * @ht:		hash table
+ * @key:	key 
+ * @obj:	pointer to hash head inside object
+ *
+ * Locks down the bucket chain in both the old and new table if a resize
+ * is in progress to ensure that writers can't remove from the old table
+ * and can't insert to the new table during the atomic operation of search
+ * and insertion. Searches for duplicates in both the old and new table if
+ * a resize is in progress.
+ *
+ * Lookups may occur in parallel with hashtable mutations and resizing.
+ *
+ * Will trigger an automatic deferred table resizing if the size grows
+ * beyond the watermark indicated by grow_decision() which can be passed
+ * to rhashtable_init().
+ *
+ * Returns zero on success.
+ */
+int rhashtable_lookup_insert_key(struct rhashtable *ht,
+				 const void *key, struct rhash_head *obj)
+{
+	BUG_ON(!ht->p.obj_hashfn || !key);
+
+	return __rhashtable_lookup_insert_key(ht, key, obj);
+}
+EXPORT_SYMBOL_GPL(rhashtable_lookup_insert_key);
+
+/**
  * rhashtable_walk_init - Initialise an iterator
  * @ht:		Table to walk over
  * @iter:	Hash table Iterator
@@ -925,8 +1043,8 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
 
 	size = HASH_DEFAULT_SIZE;
 
-	if ((params->key_len && !params->hashfn) ||
-	    (!params->key_len && !params->obj_hashfn))
+	if (!params->hashfn || !params->key_len ||
+	    (params->obj_hashfn && !params->obj_cmpfn))
 		return -EINVAL;
 
 	if (params->nulls_base && params->nulls_base < (1U << RHT_BASE_SHIFT))

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v1 PATCH 9/14] netlink: Move namespace into hash key
  2015-03-15 10:43       ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash Herbert Xu
                           ` (7 preceding siblings ...)
  2015-03-15 10:44         ` [v1 PATCH 8/14] rhashtable: Fix support of objects with no accessible keys Herbert Xu
@ 2015-03-15 10:44         ` Herbert Xu
  2015-03-15 10:44         ` [v1 PATCH 10/14] rhashtable: Rip out obsolete compare interface Herbert Xu
                           ` (11 subsequent siblings)
  20 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-15 10:44 UTC (permalink / raw)
  To: David Miller, tgraf, netdev, Eric Dumazet

Currently the name space is a de facto key because it has to match
before we find an object in the hash table.  However, it isn't in
the hash value so all objects from different name spaces with the
same port ID hash to the same bucket.

This is bad as the number of name spaces is unbounded.

This patch fixes this by using the namespace when doing the hash.

Because the namespace field doesn't lie next to the portid field
in the netlink socket, this patch switches over to the rhashtable
interface without a fixed key.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/netlink/af_netlink.c |   66 +++++++++++++++++++++++++++++------------------
 1 file changed, 41 insertions(+), 25 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index d97aed6..ca048f3 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -970,41 +970,46 @@ netlink_unlock_table(void)
 
 struct netlink_compare_arg
 {
-	struct net *net;
+	possible_net_t pnet;
 	u32 portid;
+	char trailer[];
 };
 
-static bool netlink_compare(void *ptr, void *arg)
+#define netlink_compare_arg_len offsetof(struct netlink_compare_arg, trailer)
+
+static bool netlink_compare(const void *arg, const void *ptr)
 {
-	struct netlink_compare_arg *x = arg;
-	struct sock *sk = ptr;
+	const struct netlink_compare_arg *x = arg;
+	const struct netlink_sock *nlk = ptr;
 
-	return nlk_sk(sk)->portid == x->portid &&
-	       net_eq(sock_net(sk), x->net);
+	return nlk->portid == x->portid &&
+	       net_eq(sock_net(&nlk->sk), read_pnet(&x->pnet));
+}
+
+static void netlink_compare_arg_init(struct netlink_compare_arg *arg,
+				     struct net *net, u32 portid)
+{
+	memset(arg, 0, sizeof(*arg));
+	write_pnet(&arg->pnet, net);
+	arg->portid = portid;
 }
 
 static struct sock *__netlink_lookup(struct netlink_table *table, u32 portid,
 				     struct net *net)
 {
-	struct netlink_compare_arg arg = {
-		.net = net,
-		.portid = portid,
-	};
+	struct netlink_compare_arg arg;
 
-	return rhashtable_lookup_compare(&table->hash, &portid,
-					 &netlink_compare, &arg);
+	netlink_compare_arg_init(&arg, net, portid);
+	return rhashtable_lookup(&table->hash, &arg);
 }
 
-static bool __netlink_insert(struct netlink_table *table, struct sock *sk)
+static int __netlink_insert(struct netlink_table *table, struct sock *sk)
 {
-	struct netlink_compare_arg arg = {
-		.net = sock_net(sk),
-		.portid = nlk_sk(sk)->portid,
-	};
+	struct netlink_compare_arg arg;
 
-	return rhashtable_lookup_compare_insert(&table->hash,
-						&nlk_sk(sk)->node,
-						&netlink_compare, &arg);
+	netlink_compare_arg_init(&arg, sock_net(sk), nlk_sk(sk)->portid);
+	return rhashtable_lookup_insert_key(&table->hash, &arg,
+					    &nlk_sk(sk)->node);
 }
 
 static struct sock *netlink_lookup(struct net *net, int protocol, u32 portid)
@@ -1066,9 +1071,10 @@ static int netlink_insert(struct sock *sk, u32 portid)
 	nlk_sk(sk)->portid = portid;
 	sock_hold(sk);
 
-	err = 0;
-	if (!__netlink_insert(table, sk)) {
-		err = -EADDRINUSE;
+	err = __netlink_insert(table, sk);
+	if (err) {
+		if (err == -EEXIST)
+			err = -EADDRINUSE;
 		sock_put(sk);
 	}
 
@@ -3114,15 +3120,25 @@ static struct pernet_operations __net_initdata netlink_net_ops = {
 	.exit = netlink_net_exit,
 };
 
+static u32 netlink_hash(const void *data, u32 seed)
+{
+	const struct netlink_sock *nlk = data;
+	struct netlink_compare_arg arg;
+
+	netlink_compare_arg_init(&arg, sock_net(&nlk->sk), nlk->portid);
+	return jhash(&arg, netlink_compare_arg_len, seed);
+}
+
 static int __init netlink_proto_init(void)
 {
 	int i;
 	int err = proto_register(&netlink_proto, 0);
 	struct rhashtable_params ht_params = {
 		.head_offset = offsetof(struct netlink_sock, node),
-		.key_offset = offsetof(struct netlink_sock, portid),
-		.key_len = sizeof(u32), /* portid */
+		.key_len = netlink_compare_arg_len,
 		.hashfn = jhash,
+		.obj_hashfn = netlink_hash,
+		.obj_cmpfn = netlink_compare,
 		.max_size = 65536,
 	};
 

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v1 PATCH 10/14] rhashtable: Rip out obsolete compare interface
  2015-03-15 10:43       ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash Herbert Xu
                           ` (8 preceding siblings ...)
  2015-03-15 10:44         ` [v1 PATCH 9/14] netlink: Move namespace into hash key Herbert Xu
@ 2015-03-15 10:44         ` Herbert Xu
  2015-03-16  9:35           ` Thomas Graf
  2015-03-15 10:44         ` [v1 PATCH 11/14] rhashtable: Allow hashfn to be unset Herbert Xu
                           ` (10 subsequent siblings)
  20 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-15 10:44 UTC (permalink / raw)
  To: David Miller, tgraf, netdev, Eric Dumazet

Now that the only user of rhashtable_lookup_compare_insert and
rhashtable_lookup_compare (i.e., netlink) has switched over to
the new obj_hashfn based interface, we can rip them out safely.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/rhashtable.h |    6 -
 lib/rhashtable.c           |  143 +--------------------------------------------
 2 files changed, 4 insertions(+), 145 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index de7ac49..00fe294 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -193,14 +193,8 @@ int rhashtable_expand(struct rhashtable *ht);
 int rhashtable_shrink(struct rhashtable *ht);
 
 void *rhashtable_lookup(struct rhashtable *ht, const void *key);
-void *rhashtable_lookup_compare(struct rhashtable *ht, const void *key,
-				bool (*compare)(void *, void *), void *arg);
 
 bool rhashtable_lookup_insert(struct rhashtable *ht, struct rhash_head *obj);
-bool rhashtable_lookup_compare_insert(struct rhashtable *ht,
-				      struct rhash_head *obj,
-				      bool (*compare)(void *, void *),
-				      void *arg);
 int rhashtable_lookup_insert_key(struct rhashtable *ht,
 				 const void *key, struct rhash_head *obj);
 
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 8b3cae3..e33e7b9 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -1,13 +1,16 @@
 /*
  * Resizable, Scalable, Concurrent Hash Table
  *
+ * Copyright (c) 2015 Herbert Xu <herbert@gondor.apana.org.au>
  * Copyright (c) 2014-2015 Thomas Graf <tgraf@suug.ch>
  * Copyright (c) 2008-2014 Patrick McHardy <kaber@trash.net>
  *
  * Based on the following paper:
  * https://www.usenix.org/legacy/event/atc11/tech/final_files/Triplett.pdf
  *
- * Code partially derived from nft_hash
+ * Code originally partially derived from nft_hash
+ * Rewritten with rehash code from br_multicast plus single list
+ * pointer as suggested by Josh Triplett
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
@@ -439,70 +442,6 @@ exit:
 	return err;
 }
 
-static bool __rhashtable_insert(struct rhashtable *ht, struct rhash_head *obj,
-				bool (*compare)(void *, void *), void *arg)
-{
-	struct bucket_table *tbl, *old_tbl;
-	struct rhash_head *head;
-	bool no_resize_running;
-	unsigned hash;
-	bool success = true;
-
-	rcu_read_lock();
-
-	old_tbl = rht_dereference_rcu(ht->tbl, ht);
-	hash = head_hashfn(ht, old_tbl, obj);
-
-	spin_lock_bh(bucket_lock(old_tbl, hash));
-
-	/* Because we have already taken the bucket lock in old_tbl,
-	 * if we find that future_tbl is not yet visible then that
-	 * guarantees all other insertions of the same entry will
-	 * also grab the bucket lock in old_tbl because until the
-	 * rehash completes ht->tbl won't be changed.
-	 */
-	tbl = rht_dereference_rcu(old_tbl->future_tbl, ht) ?: old_tbl;
-	if (tbl != old_tbl) {
-		hash = head_hashfn(ht, tbl, obj);
-		spin_lock_nested(bucket_lock(tbl, hash), SINGLE_DEPTH_NESTING);
-	}
-
-	if (compare &&
-	    rhashtable_lookup_compare(ht, rht_obj(ht, obj) + ht->p.key_offset,
-				      compare, arg)) {
-		success = false;
-		goto exit;
-	}
-
-	no_resize_running = tbl == old_tbl;
-
-	head = rht_dereference_bucket(tbl->buckets[hash], tbl, hash);
-
-	if (rht_is_a_nulls(head))
-		INIT_RHT_NULLS_HEAD(obj->next, ht, hash);
-	else
-		RCU_INIT_POINTER(obj->next, head);
-
-	rcu_assign_pointer(tbl->buckets[hash], obj);
-
-	atomic_inc(&ht->nelems);
-	if (no_resize_running && rht_grow_above_75(ht, tbl))
-		schedule_work(&ht->run_work);
-
-exit:
-	if (tbl != old_tbl) {
-		hash = head_hashfn(ht, tbl, obj);
-		spin_unlock(bucket_lock(tbl, hash));
-	}
-
-	hash = head_hashfn(ht, old_tbl, obj);
-	spin_unlock_bh(bucket_lock(old_tbl, hash));
-
-	rcu_read_unlock();
-
-	return success;
-}
-
 /**
  * rhashtable_insert - insert object into hash table
  * @ht:		hash table
@@ -671,51 +610,6 @@ restart:
 EXPORT_SYMBOL_GPL(rhashtable_lookup);
 
 /**
- * rhashtable_lookup_compare - search hash table with compare function
- * @ht:		hash table
- * @key:	the pointer to the key
- * @compare:	compare function, must return true on match
- * @arg:	argument passed on to compare function
- *
- * Traverses the bucket chain behind the provided hash value and calls the
- * specified compare function for each entry.
- *
- * Lookups may occur in parallel with hashtable mutations and resizing.
- *
- * Returns the first entry on which the compare function returned true.
- */
-void *rhashtable_lookup_compare(struct rhashtable *ht, const void *key,
-				bool (*compare)(void *, void *), void *arg)
-{
-	const struct bucket_table *tbl;
-	struct rhash_head *he;
-	u32 hash;
-
-	rcu_read_lock();
-
-	tbl = rht_dereference_rcu(ht->tbl, ht);
-restart:
-	hash = key_hashfn(ht, tbl, key);
-	rht_for_each_rcu(he, tbl, hash) {
-		if (!compare(rht_obj(ht, he), arg))
-			continue;
-		rcu_read_unlock();
-		return rht_obj(ht, he);
-	}
-
-	/* Ensure we see any new tables. */
-	smp_rmb();
-
-	tbl = rht_dereference_rcu(tbl->future_tbl, ht);
-	if (unlikely(tbl))
-		goto restart;
-	rcu_read_unlock();
-
-	return NULL;
-}
-EXPORT_SYMBOL_GPL(rhashtable_lookup_compare);
-
-/**
  * rhashtable_lookup_insert - lookup and insert object into hash table
  * @ht:		hash table
  * @obj:	pointer to hash head inside object
@@ -747,35 +641,6 @@ bool rhashtable_lookup_insert(struct rhashtable *ht, struct rhash_head *obj)
 EXPORT_SYMBOL_GPL(rhashtable_lookup_insert);
 
 /**
- * rhashtable_lookup_compare_insert - search and insert object to hash table
- *                                    with compare function
- * @ht:		hash table
- * @obj:	pointer to hash head inside object
- * @compare:	compare function, must return true on match
- * @arg:	argument passed on to compare function
- *
- * Locks down the bucket chain in both the old and new table if a resize
- * is in progress to ensure that writers can't remove from the old table
- * and can't insert to the new table during the atomic operation of search
- * and insertion. Searches for duplicates in both the old and new table if
- * a resize is in progress.
- *
- * Lookups may occur in parallel with hashtable mutations and resizing.
- *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
- */
-bool rhashtable_lookup_compare_insert(struct rhashtable *ht,
-				      struct rhash_head *obj,
-				      bool (*compare)(void *, void *),
-				      void *arg)
-{
-	return __rhashtable_insert(ht, obj, compare, arg);
-}
-EXPORT_SYMBOL_GPL(rhashtable_lookup_compare_insert);
-
-/**
  * rhashtable_lookup_insert_key - search and insert object to hash table
  *                                with explicit key
  * @ht:		hash table

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v1 PATCH 11/14] rhashtable: Allow hashfn to be unset
  2015-03-15 10:43       ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash Herbert Xu
                           ` (9 preceding siblings ...)
  2015-03-15 10:44         ` [v1 PATCH 10/14] rhashtable: Rip out obsolete compare interface Herbert Xu
@ 2015-03-15 10:44         ` Herbert Xu
  2015-03-15 10:44         ` [v1 PATCH 12/14] netlink: Use default rhashtable hashfn Herbert Xu
                           ` (9 subsequent siblings)
  20 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-15 10:44 UTC (permalink / raw)
  To: David Miller, tgraf, netdev, Eric Dumazet

Since every current rhashtable user uses jhash as their hash
function, the fact that jhash is an inline function causes each
user to generate a copy of its code.

This function provides a solution to this problem by allowing
hashfn to be unset.  In which case rhashtable will automatically
set it to jhash.  Furthermore, if the key length is a multiple
of 4, we will switch over to jhash2.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/rhashtable.h |    2 ++
 lib/rhashtable.c           |   19 +++++++++++++++++--
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 00fe294..295580e 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -110,6 +110,7 @@ struct rhashtable_params {
  * struct rhashtable - Hash table handle
  * @tbl: Bucket table
  * @nelems: Number of elements in table
+ * @key_len: Key length for hashfn
  * @p: Configuration parameters
  * @run_work: Deferred worker to expand/shrink asynchronously
  * @mutex: Mutex to protect current/future table swapping
@@ -119,6 +120,7 @@ struct rhashtable {
 	struct bucket_table __rcu	*tbl;
 	atomic_t			nelems;
 	bool                            being_destroyed;
+	unsigned int			hashfn_key_len;
 	struct rhashtable_params	p;
 	struct work_struct		run_work;
 	struct mutex                    mutex;
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index e33e7b9..a2dc855 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -67,7 +67,7 @@ static u32 rht_bucket_index(const struct bucket_table *tbl, u32 hash)
 static u32 key_hashfn(struct rhashtable *ht, const struct bucket_table *tbl,
 		      const void *key)
 {
-	return rht_bucket_index(tbl, ht->p.hashfn(key, ht->p.key_len,
+	return rht_bucket_index(tbl, ht->p.hashfn(key, ht->hashfn_key_len,
 						  tbl->hash_rnd));
 }
 
@@ -858,6 +858,11 @@ static size_t rounded_hashtable_size(struct rhashtable_params *params)
 		   (unsigned long)params->min_size);
 }
 
+static u32 rhashtable_jhash2(const void *key, u32 length, u32 seed)
+{
+	return jhash2(key, length, seed);
+}
+
 /**
  * rhashtable_init - initialize a new hash table
  * @ht:		hash table to be initialized
@@ -908,7 +913,7 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
 
 	size = HASH_DEFAULT_SIZE;
 
-	if (!params->hashfn || !params->key_len ||
+	if (!params->key_len ||
 	    (params->obj_hashfn && !params->obj_cmpfn))
 		return -EINVAL;
 
@@ -924,6 +929,16 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
 	mutex_init(&ht->mutex);
 	memcpy(&ht->p, params, sizeof(*params));
 
+	ht->hashfn_key_len = ht->p.key_len;
+	if (!params->hashfn) {
+		ht->p.hashfn = jhash;
+
+		if (!(ht->hashfn_key_len & (sizeof(u32) - 1))) {
+			ht->hashfn_key_len /= sizeof(u32);
+			ht->p.hashfn = rhashtable_jhash2;
+		}
+	}
+
 	if (params->locks_mul)
 		ht->p.locks_mul = roundup_pow_of_two(params->locks_mul);
 	else

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v1 PATCH 12/14] netlink: Use default rhashtable hashfn
  2015-03-15 10:43       ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash Herbert Xu
                           ` (10 preceding siblings ...)
  2015-03-15 10:44         ` [v1 PATCH 11/14] rhashtable: Allow hashfn to be unset Herbert Xu
@ 2015-03-15 10:44         ` Herbert Xu
  2015-03-15 10:44         ` [v1 PATCH 13/14] tipc: " Herbert Xu
                           ` (8 subsequent siblings)
  20 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-15 10:44 UTC (permalink / raw)
  To: David Miller, tgraf, netdev, Eric Dumazet

This patch removes the explicit jhash value for the hashfn parameter
of rhashtable.  The default is now jhash so removing the setting
makes no difference apart from making one less copy of jhash in
the kernel.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/netlink/af_netlink.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index ca048f3..f521f35 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -3126,7 +3126,7 @@ static u32 netlink_hash(const void *data, u32 seed)
 	struct netlink_compare_arg arg;
 
 	netlink_compare_arg_init(&arg, sock_net(&nlk->sk), nlk->portid);
-	return jhash(&arg, netlink_compare_arg_len, seed);
+	return jhash2((u32 *)&arg, netlink_compare_arg_len / sizeof(u32), seed);
 }
 
 static int __init netlink_proto_init(void)
@@ -3136,7 +3136,6 @@ static int __init netlink_proto_init(void)
 	struct rhashtable_params ht_params = {
 		.head_offset = offsetof(struct netlink_sock, node),
 		.key_len = netlink_compare_arg_len,
-		.hashfn = jhash,
 		.obj_hashfn = netlink_hash,
 		.obj_cmpfn = netlink_compare,
 		.max_size = 65536,

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v1 PATCH 13/14] tipc: Use default rhashtable hashfn
  2015-03-15 10:43       ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash Herbert Xu
                           ` (11 preceding siblings ...)
  2015-03-15 10:44         ` [v1 PATCH 12/14] netlink: Use default rhashtable hashfn Herbert Xu
@ 2015-03-15 10:44         ` Herbert Xu
  2015-03-15 10:44         ` [v1 PATCH 14/14] netfilter: " Herbert Xu
                           ` (7 subsequent siblings)
  20 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-15 10:44 UTC (permalink / raw)
  To: David Miller, tgraf, netdev, Eric Dumazet

This patch removes the explicit jhash value for the hashfn parameter
of rhashtable.  The default is now jhash so removing the setting
makes no difference apart from making one less copy of jhash in
the kernel.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/tipc/socket.c |    2 --
 1 file changed, 2 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 64eb669..f462498 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -35,7 +35,6 @@
  */
 
 #include <linux/rhashtable.h>
-#include <linux/jhash.h>
 #include "core.h"
 #include "name_table.h"
 #include "node.h"
@@ -2360,7 +2359,6 @@ int tipc_sk_rht_init(struct net *net)
 		.head_offset = offsetof(struct tipc_sock, node),
 		.key_offset = offsetof(struct tipc_sock, portid),
 		.key_len = sizeof(u32), /* portid */
-		.hashfn = jhash,
 		.max_size = 1048576,
 		.min_size = 256,
 	};

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v1 PATCH 14/14] netfilter: Use default rhashtable hashfn
  2015-03-15 10:43       ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash Herbert Xu
                           ` (12 preceding siblings ...)
  2015-03-15 10:44         ` [v1 PATCH 13/14] tipc: " Herbert Xu
@ 2015-03-15 10:44         ` Herbert Xu
  2015-03-16  4:01         ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash David Miller
                           ` (6 subsequent siblings)
  20 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-15 10:44 UTC (permalink / raw)
  To: David Miller, tgraf, netdev, Eric Dumazet

This patch removes the explicit jhash value for the hashfn parameter
of rhashtable.  The default is now jhash so removing the setting
makes no difference apart from making one less copy of jhash in
the kernel.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/netfilter/nft_hash.c |    2 --
 1 file changed, 2 deletions(-)

diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c
index 1fcae5e..397bdd3 100644
--- a/net/netfilter/nft_hash.c
+++ b/net/netfilter/nft_hash.c
@@ -13,7 +13,6 @@
 #include <linux/module.h>
 #include <linux/list.h>
 #include <linux/log2.h>
-#include <linux/jhash.h>
 #include <linux/netlink.h>
 #include <linux/rhashtable.h>
 #include <linux/netfilter.h>
@@ -171,7 +170,6 @@ static int nft_hash_init(const struct nft_set *set,
 		.head_offset = offsetof(struct nft_hash_elem, node),
 		.key_offset = offsetof(struct nft_hash_elem, key),
 		.key_len = set->klen,
-		.hashfn = jhash,
 	};
 
 	return rhashtable_init(priv, &params);

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 2/14] rhashtable: Introduce max_size/min_size
  2015-03-15 10:44         ` [v1 PATCH 2/14] rhashtable: Introduce max_size/min_size Herbert Xu
@ 2015-03-15 15:12           ` Sergei Shtylyov
  2015-03-15 20:21             ` Herbert Xu
  0 siblings, 1 reply; 113+ messages in thread
From: Sergei Shtylyov @ 2015-03-15 15:12 UTC (permalink / raw)
  To: Herbert Xu, David Miller, tgraf, netdev, Eric Dumazet

Hello.

On 3/15/2015 1:44 PM, Herbert Xu wrote:

> This patch adds the parameters max_size and min_size which are
> meant to replace max_shift and min_shift.

> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

[...]
> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> index 4a99c0c..5101e18 100644
> --- a/lib/rhashtable.c
> +++ b/lib/rhashtable.c
[...]
> @@ -872,7 +874,8 @@ EXPORT_SYMBOL_GPL(rhashtable_walk_stop);
>   static size_t rounded_hashtable_size(struct rhashtable_params *params)
>   {
>   	return max(roundup_pow_of_two(params->nelem_hint * 4 / 3),
> -		   1UL << params->min_shift);
> +		   max(1UL << params->min_shift,

    max_t(), perhaps?

> +		       (unsigned long)params->min_size));
>   }
>
>   /**
[...]

WBR, Sergei

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 4/14] tipc: Use rhashtable max_size instead of max_shift
  2015-03-15 10:44         ` [v1 PATCH 4/14] tipc: " Herbert Xu
@ 2015-03-15 15:13           ` Sergei Shtylyov
  0 siblings, 0 replies; 113+ messages in thread
From: Sergei Shtylyov @ 2015-03-15 15:13 UTC (permalink / raw)
  To: Herbert Xu, David Miller, tgraf, netdev, Eric Dumazet

On 3/15/2015 1:44 PM, Herbert Xu wrote:

> This patch converts tipc to use rhashtable max_size instead of
> the obsolete max_shift.

    You're converting 'min_shift' to 'min_size' as well...

> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> ---

>   net/tipc/socket.c |    4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)

> diff --git a/net/tipc/socket.c b/net/tipc/socket.c
> index 934947f..64eb669 100644
> --- a/net/tipc/socket.c
> +++ b/net/tipc/socket.c
> @@ -2361,8 +2361,8 @@ int tipc_sk_rht_init(struct net *net)
>   		.key_offset = offsetof(struct tipc_sock, portid),
>   		.key_len = sizeof(u32), /* portid */
>   		.hashfn = jhash,
> -		.max_shift = 20, /* 1M */
> -		.min_shift = 8,  /* 256 */
> +		.max_size = 1048576,
> +		.min_size = 256,
>   	};
>
>   	return rhashtable_init(&tn->sk_rht, &rht_params);

WBR, Sergei

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 2/14] rhashtable: Introduce max_size/min_size
  2015-03-15 15:12           ` Sergei Shtylyov
@ 2015-03-15 20:21             ` Herbert Xu
  0 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-15 20:21 UTC (permalink / raw)
  To: Sergei Shtylyov; +Cc: David Miller, tgraf, netdev, Eric Dumazet

On Sun, Mar 15, 2015 at 06:12:02PM +0300, Sergei Shtylyov wrote:
> Hello.
> 
> On 3/15/2015 1:44 PM, Herbert Xu wrote:
> 
> >This patch adds the parameters max_size and min_size which are
> >meant to replace max_shift and min_shift.
> 
> >Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> [...]
> >diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> >index 4a99c0c..5101e18 100644
> >--- a/lib/rhashtable.c
> >+++ b/lib/rhashtable.c
> [...]
> >@@ -872,7 +874,8 @@ EXPORT_SYMBOL_GPL(rhashtable_walk_stop);
> >  static size_t rounded_hashtable_size(struct rhashtable_params *params)
> >  {
> >  	return max(roundup_pow_of_two(params->nelem_hint * 4 / 3),
> >-		   1UL << params->min_shift);
> >+		   max(1UL << params->min_shift,
> 
>    max_t(), perhaps?

max works just fine.  Besides, this max goes away a few patches
down.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 0/6] rhashtable: Fix two bugs caused by multiple rehash preparation
  2015-03-15 10:10     ` [v1 PATCH 0/6] rhashtable: Fix two bugs caused by multiple rehash preparation Herbert Xu
                         ` (3 preceding siblings ...)
  2015-03-15 10:43       ` [v1 PATCH 0/6] rhashtable: Fix two bugs caused by multiple rehash preparation Herbert Xu
@ 2015-03-16  2:23       ` David Miller
  4 siblings, 0 replies; 113+ messages in thread
From: David Miller @ 2015-03-16  2:23 UTC (permalink / raw)
  To: herbert; +Cc: tgraf, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sun, 15 Mar 2015 21:10:04 +1100

> While testing some new patches over the weekend I discovered a
> couple of bugs in the series that had just been merged.  These
> two patches fix them:
> 
> 1) A use-after-free in the walker that can cause crashes when
> walking during a rehash.
> 
> 2) When a second rehash starts during a single rhashtable_remove
> call the remove may fail when it shouldn't.

Series applied, thanks Herbert.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 5/14] test_rhashtable: Use rhashtable max_size instead of max_shift
  2015-03-15 10:44         ` [v1 PATCH 5/14] test_rhashtable: " Herbert Xu
@ 2015-03-16  3:50           ` David Miller
  0 siblings, 0 replies; 113+ messages in thread
From: David Miller @ 2015-03-16  3:50 UTC (permalink / raw)
  To: herbert; +Cc: tgraf, netdev, eric.dumazet

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sun, 15 Mar 2015 21:44:30 +1100

> This patch converts tets_rhashtable to use rhashtable max_size

Typo: test_rhashtable

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash
  2015-03-15 10:43       ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash Herbert Xu
                           ` (13 preceding siblings ...)
  2015-03-15 10:44         ` [v1 PATCH 14/14] netfilter: " Herbert Xu
@ 2015-03-16  4:01         ` David Miller
  2015-03-16  4:18           ` Herbert Xu
  2015-03-18  9:01         ` [v2 PATCH 1/6] rhashtable: Remove shift from bucket_table Herbert Xu
                           ` (5 subsequent siblings)
  20 siblings, 1 reply; 113+ messages in thread
From: David Miller @ 2015-03-16  4:01 UTC (permalink / raw)
  To: herbert; +Cc: tgraf, netdev, eric.dumazet

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sun, 15 Mar 2015 21:43:06 +1100

> PS I'd love to kill the indirect call on the hash but I think
> you guys need something other than jhash for sockets.  Presumably
> they are not exposed to untrusted parties.  But I really wonder
> whether that is still the case given things like namespaces where
> even root may be untrusted.

In my opinion the biggest architectural fault of rhashtables is
these damn callbacks, look at the assembler for a simple hash
lookup, it's disgusting.

Two callbacks, _two_.

That also means the hash lookup can't be a leaf function, and we take
two mispredicted indirect calls as well.  Not acceptable.

This is pure overhead from overengineering in my opinion and I'd much
rather have code duplication than this.

We should have rhashtable_lookup_compare() be a macro, for which the
caller provides the hash function and compare operation inline.

Otherwise when we convert things like inet_hashtables.c to rhashtable
it's going to be a regression that will definitely show up in
microbenchmarks.

So essentially I think the rhashtables interface to how the keys are
described to the user is inside out.  rhashtable should just provide
the outer-framework, and the user should provide the minute details of
the key hashing and comparison in something that gets expanded inline.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash
  2015-03-16  4:01         ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash David Miller
@ 2015-03-16  4:18           ` Herbert Xu
  2015-03-16  4:30             ` David Miller
  0 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-16  4:18 UTC (permalink / raw)
  To: David Miller; +Cc: tgraf, netdev, eric.dumazet

On Mon, Mar 16, 2015 at 12:01:13AM -0400, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Sun, 15 Mar 2015 21:43:06 +1100
> 
> > PS I'd love to kill the indirect call on the hash but I think
> > you guys need something other than jhash for sockets.  Presumably
> > they are not exposed to untrusted parties.  But I really wonder
> > whether that is still the case given things like namespaces where
> > even root may be untrusted.
> 
> In my opinion the biggest architectural fault of rhashtables is
> these damn callbacks, look at the assembler for a simple hash
> lookup, it's disgusting.

Well if everybody used jhash the indirect call would go away...

> So essentially I think the rhashtables interface to how the keys are
> described to the user is inside out.  rhashtable should just provide
> the outer-framework, and the user should provide the minute details of
> the key hashing and comparison in something that gets expanded inline.

Sure I'll try to work something out that eliminates the indirect
calls on the fast path.

However, the indirect call is still needed for rehashing but that
should be OK since rehashing is supposed to be rare.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash
  2015-03-16  4:18           ` Herbert Xu
@ 2015-03-16  4:30             ` David Miller
  2015-03-16  4:33               ` Herbert Xu
  0 siblings, 1 reply; 113+ messages in thread
From: David Miller @ 2015-03-16  4:30 UTC (permalink / raw)
  To: herbert; +Cc: tgraf, netdev, eric.dumazet

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 16 Mar 2015 15:18:05 +1100

> On Mon, Mar 16, 2015 at 12:01:13AM -0400, David Miller wrote:
>> From: Herbert Xu <herbert@gondor.apana.org.au>
>> Date: Sun, 15 Mar 2015 21:43:06 +1100
>> 
>> > PS I'd love to kill the indirect call on the hash but I think
>> > you guys need something other than jhash for sockets.  Presumably
>> > they are not exposed to untrusted parties.  But I really wonder
>> > whether that is still the case given things like namespaces where
>> > even root may be untrusted.
>> 
>> In my opinion the biggest architectural fault of rhashtables is
>> these damn callbacks, look at the assembler for a simple hash
>> lookup, it's disgusting.
> 
> Well if everybody used jhash the indirect call would go away...

The other issue is that some hashes are per-ns and thus don't
need the namespace hash and comparison, whilst others do.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash
  2015-03-16  4:30             ` David Miller
@ 2015-03-16  4:33               ` Herbert Xu
  2015-03-16  4:40                 ` David Miller
  0 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-16  4:33 UTC (permalink / raw)
  To: David Miller; +Cc: tgraf, netdev, eric.dumazet

On Mon, Mar 16, 2015 at 12:30:49AM -0400, David Miller wrote:
>
> > Well if everybody used jhash the indirect call would go away...
> 
> The other issue is that some hashes are per-ns and thus don't
> need the namespace hash and comparison, whilst others do.

That's just a matter of what goes into the key.  IOW if your
table is global then you include the namespace in the hash,
otherwise you don't.  It shouldn't change the hash function.

Have a look at what I did to netlink in patch 9.  If you wanted
a per-namespace hash table it is as simple as getting rid of
obj_hashfn and just reducing the key to have just the portid.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash
  2015-03-16  4:33               ` Herbert Xu
@ 2015-03-16  4:40                 ` David Miller
  2015-03-16 11:26                   ` Herbert Xu
  0 siblings, 1 reply; 113+ messages in thread
From: David Miller @ 2015-03-16  4:40 UTC (permalink / raw)
  To: herbert; +Cc: tgraf, netdev, eric.dumazet

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 16 Mar 2015 15:33:59 +1100

> On Mon, Mar 16, 2015 at 12:30:49AM -0400, David Miller wrote:
>>
>> > Well if everybody used jhash the indirect call would go away...
>> 
>> The other issue is that some hashes are per-ns and thus don't
>> need the namespace hash and comparison, whilst others do.
> 
> That's just a matter of what goes into the key.  IOW if your
> table is global then you include the namespace in the hash,
> otherwise you don't.  It shouldn't change the hash function.
> 
> Have a look at what I did to netlink in patch 9.  If you wanted
> a per-namespace hash table it is as simple as getting rid of
> obj_hashfn and just reducing the key to have just the portid.

I know, reading the hash callback adjusting patches is what
prompted me to start this conversation about inlining the
lookup.

Patrick McHardy had mentioned to me his consternation about this issue
offhand the other week, and I've just failed to get around to bringing
it up :-)

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare
  2015-03-15 10:44         ` [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare Herbert Xu
@ 2015-03-16  8:28           ` Thomas Graf
  2015-03-16  9:14             ` Herbert Xu
  0 siblings, 1 reply; 113+ messages in thread
From: Thomas Graf @ 2015-03-16  8:28 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, Eric Dumazet, kaber

On 03/15/15 at 09:44pm, Herbert Xu wrote:
> The use of rhashtable_lookup_compare in nft_hash is gratuitous
> since the comparison function is just doing memcmp.  Furthermore,
> there is cruft embedded in the comparson function that should
> instead be moved into the caller of the lookup.
> 
> This patch moves that cruft over and replacces the call to
> rhashtable_lookup_compare with rhashtable_lookup.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

The reason for the indirection was to not bypass the abstraction
nft_data_cmp() provides.

No objection to the change but maybe leave a comment in
nft_data_cmp() that if one changes nft_data_cmp() one needs to
look at nft_hash and see if the direct use of rhashtable_lookup()
is still valid.

Copying Patrick as well.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare
  2015-03-16  8:28           ` Thomas Graf
@ 2015-03-16  9:14             ` Herbert Xu
  2015-03-16  9:28               ` Thomas Graf
                                 ` (2 more replies)
  0 siblings, 3 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-16  9:14 UTC (permalink / raw)
  To: Thomas Graf; +Cc: David Miller, netdev, Eric Dumazet, kaber

On Mon, Mar 16, 2015 at 08:28:42AM +0000, Thomas Graf wrote:
> On 03/15/15 at 09:44pm, Herbert Xu wrote:
> > The use of rhashtable_lookup_compare in nft_hash is gratuitous
> > since the comparison function is just doing memcmp.  Furthermore,
> > there is cruft embedded in the comparson function that should
> > instead be moved into the caller of the lookup.
> > 
> > This patch moves that cruft over and replacces the call to
> > rhashtable_lookup_compare with rhashtable_lookup.
> > 
> > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> The reason for the indirection was to not bypass the abstraction
> nft_data_cmp() provides.
> 
> No objection to the change but maybe leave a comment in
> nft_data_cmp() that if one changes nft_data_cmp() one needs to
> look at nft_hash and see if the direct use of rhashtable_lookup()
> is still valid.

Well we could preserve nft_data_cmp if necessary.  It'll just
mean an extra indirect call to do the compare when needed.

Patrick?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare
  2015-03-16  9:14             ` Herbert Xu
@ 2015-03-16  9:28               ` Thomas Graf
  2015-03-16 11:13               ` Patrick McHardy
  2015-03-20  9:36               ` Herbert Xu
  2 siblings, 0 replies; 113+ messages in thread
From: Thomas Graf @ 2015-03-16  9:28 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, Eric Dumazet, kaber

On 03/16/15 at 08:14pm, Herbert Xu wrote:
> On Mon, Mar 16, 2015 at 08:28:42AM +0000, Thomas Graf wrote:
> > On 03/15/15 at 09:44pm, Herbert Xu wrote:
> > > The use of rhashtable_lookup_compare in nft_hash is gratuitous
> > > since the comparison function is just doing memcmp.  Furthermore,
> > > there is cruft embedded in the comparson function that should
> > > instead be moved into the caller of the lookup.
> > > 
> > > This patch moves that cruft over and replacces the call to
> > > rhashtable_lookup_compare with rhashtable_lookup.
> > > 
> > > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> > 
> > The reason for the indirection was to not bypass the abstraction
> > nft_data_cmp() provides.
> > 
> > No objection to the change but maybe leave a comment in
> > nft_data_cmp() that if one changes nft_data_cmp() one needs to
> > look at nft_hash and see if the direct use of rhashtable_lookup()
> > is still valid.
> 
> Well we could preserve nft_data_cmp if necessary.  It'll just
> mean an extra indirect call to do the compare when needed.

I like Dave's idea of implementing the lookup as a macro to avoid
the redirect for both rhashtable and API users requiring their own
compare function to inline the compare and hash function:

#define rhashtable_lookup_compare(ht, obj, compare_fn, obj_hash_fn, arg)

Leaving indirect hash calls to the slow path only. This was also
the initial design of the compare lookup variantion until it
"evolved" away from that due to lack of users of that API ;-)

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 10/14] rhashtable: Rip out obsolete compare interface
  2015-03-15 10:44         ` [v1 PATCH 10/14] rhashtable: Rip out obsolete compare interface Herbert Xu
@ 2015-03-16  9:35           ` Thomas Graf
  0 siblings, 0 replies; 113+ messages in thread
From: Thomas Graf @ 2015-03-16  9:35 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, Eric Dumazet

On 03/15/15 at 09:44pm, Herbert Xu wrote:
> Now that the only user of rhashtable_lookup_compare_insert and
> rhashtable_lookup_compare (i.e., netlink) has switched over to
> the new obj_hashfn based interface, we can rip them out safely.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

As pointed out in the previous patch, I think we should keep this
API but convert it to a macro for inlining. rhashtable_lookup() would
then just be a user of this macro.

It might even be worth to hardcode rhashtable_lookup() to jhash
and require users which require a different hash function to
use rhashtable_lookup_compare().

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare
  2015-03-16  9:14             ` Herbert Xu
  2015-03-16  9:28               ` Thomas Graf
@ 2015-03-16 11:13               ` Patrick McHardy
  2015-03-20  8:55                 ` Herbert Xu
  2015-03-20  9:36               ` Herbert Xu
  2 siblings, 1 reply; 113+ messages in thread
From: Patrick McHardy @ 2015-03-16 11:13 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, David Miller, netdev, Eric Dumazet

On 16.03, Herbert Xu wrote:
> On Mon, Mar 16, 2015 at 08:28:42AM +0000, Thomas Graf wrote:
> > On 03/15/15 at 09:44pm, Herbert Xu wrote:
> > > The use of rhashtable_lookup_compare in nft_hash is gratuitous
> > > since the comparison function is just doing memcmp.  Furthermore,
> > > there is cruft embedded in the comparson function that should
> > > instead be moved into the caller of the lookup.
> > > 
> > > This patch moves that cruft over and replacces the call to
> > > rhashtable_lookup_compare with rhashtable_lookup.
> > > 
> > > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> > 
> > The reason for the indirection was to not bypass the abstraction
> > nft_data_cmp() provides.
> > 
> > No objection to the change but maybe leave a comment in
> > nft_data_cmp() that if one changes nft_data_cmp() one needs to
> > look at nft_hash and see if the direct use of rhashtable_lookup()
> > is still valid.
> 
> Well we could preserve nft_data_cmp if necessary.  It'll just
> mean an extra indirect call to do the compare when needed.
> 
> Patrick?

An upcoming patchset will add transaction support to sets, which needs
the callback. So please leave it for know since it will only cause
unnecessary churn.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash
  2015-03-16  4:40                 ` David Miller
@ 2015-03-16 11:26                   ` Herbert Xu
  2015-03-16 20:25                     ` David Miller
  0 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-16 11:26 UTC (permalink / raw)
  To: David Miller; +Cc: tgraf, netdev, eric.dumazet, Patrick McHardy

On Mon, Mar 16, 2015 at 12:40:17AM -0400, David Miller wrote:
>
> Patrick McHardy had mentioned to me his consternation about this issue
> offhand the other week, and I've just failed to get around to bringing
> it up :-)

How about something like this? Compile tested only but at least
it does inline correctly.

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index c5104aa..79df647 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -42,6 +42,9 @@
 #define RHT_HASH_BITS		27
 #define RHT_BASE_SHIFT		RHT_HASH_BITS
 
+/* Base bits plus 1 bit for nulls marker */
+#define RHT_HASH_RESERVED_SPACE	(RHT_BASE_BITS + 1)
+
 struct rhash_head {
 	struct rhash_head __rcu		*next;
 };
@@ -172,6 +175,24 @@ static inline unsigned long rht_get_nulls_value(const struct rhash_head *ptr)
 	return ((unsigned long) ptr) >> 1;
 }
 
+static void *rht_obj(const struct rhashtable *ht, const struct rhash_head *he)
+{
+	return (void *) he - ht->p.head_offset;
+}
+
+static inline u32 rht_bucket_index(const struct bucket_table *tbl, u32 hash)
+{
+	return (hash >> RHT_HASH_RESERVED_SPACE) & (tbl->size - 1);
+}
+
+static inline u32 rht_key_hashfn(const struct bucket_table *tbl,
+				 const void *key,
+				 unsigned int key_len,
+				 rht_hashfn_t hashfn)
+{
+	return rht_bucket_index(tbl, hashfn(key, key_len, tbl->hash_rnd));
+}
+
 #ifdef CONFIG_PROVE_LOCKING
 int lockdep_rht_mutex_is_held(struct rhashtable *ht);
 int lockdep_rht_bucket_is_held(const struct bucket_table *tbl, u32 hash);
@@ -197,6 +218,7 @@ int rhashtable_expand(struct rhashtable *ht);
 int rhashtable_shrink(struct rhashtable *ht);
 
 void *rhashtable_lookup(struct rhashtable *ht, const void *key);
+void *rhashtable_lookup_slow(struct rhashtable *ht, const void *key);
 
 int rhashtable_lookup_insert(struct rhashtable *ht, struct rhash_head *obj);
 int rhashtable_lookup_insert_key(struct rhashtable *ht,
@@ -358,4 +380,30 @@ void rhashtable_destroy(struct rhashtable *ht);
 	rht_for_each_entry_rcu_continue(tpos, pos, (tbl)->buckets[hash],\
 					tbl, hash, member)
 
+static inline void *rhashtable_lookup_fast(struct rhashtable *ht,
+					   const void *key,
+					   unsigned int key_len,
+					   rht_hashfn_t hashfn,
+					   rht_obj_cmpfn_t obj_cmpfn)
+{
+	const struct bucket_table *tbl;
+	struct rhash_head *he;
+	u32 hash;
+
+	rcu_read_lock();
+
+	tbl = rht_dereference_rcu(ht->tbl, ht);
+	hash = rht_key_hashfn(tbl, key, key_len, hashfn);
+	rht_for_each_rcu(he, tbl, hash) {
+		if (!obj_cmpfn(key, rht_obj(ht, he)))
+			continue;
+		rcu_read_unlock();
+		return rht_obj(ht, he);
+	}
+
+	rcu_read_unlock();
+
+	return rhashtable_lookup(ht, key);
+}
+
 #endif /* _LINUX_RHASHTABLE_H */
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index d22da74..7486e6a 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -33,9 +33,6 @@
 #define HASH_MIN_SIZE		4U
 #define BUCKET_LOCKS_PER_CPU   128UL
 
-/* Base bits plus 1 bit for nulls marker */
-#define HASH_RESERVED_SPACE	(RHT_BASE_BITS + 1)
-
 /* The bucket lock is selected based on the hash and protects mutations
  * on a group of hash buckets.
  *
@@ -54,21 +51,10 @@ static spinlock_t *bucket_lock(const struct bucket_table *tbl, u32 hash)
 	return &tbl->locks[hash & tbl->locks_mask];
 }
 
-static void *rht_obj(const struct rhashtable *ht, const struct rhash_head *he)
-{
-	return (void *) he - ht->p.head_offset;
-}
-
-static u32 rht_bucket_index(const struct bucket_table *tbl, u32 hash)
-{
-	return (hash >> HASH_RESERVED_SPACE) & (tbl->size - 1);
-}
-
 static u32 key_hashfn(struct rhashtable *ht, const struct bucket_table *tbl,
 		      const void *key)
 {
-	return rht_bucket_index(tbl, ht->p.hashfn(key, ht->hashfn_key_len,
-						  tbl->hash_rnd));
+	return rht_key_hashfn(tbl, key, ht->p.key_len, ht->p.hashfn);
 }
 
 static u32 head_hashfn(struct rhashtable *ht,
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index f521f35..4f225c7 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1000,7 +1000,7 @@ static struct sock *__netlink_lookup(struct netlink_table *table, u32 portid,
 	struct netlink_compare_arg arg;
 
 	netlink_compare_arg_init(&arg, net, portid);
-	return rhashtable_lookup(&table->hash, &arg);
+	return rhashtable_lookup_fast(&table->hash, &arg, netlink_compare_arg_len, jhash2, netlink_compare);
 }
 
 static int __netlink_insert(struct netlink_table *table, struct sock *sk)

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash
  2015-03-16 11:26                   ` Herbert Xu
@ 2015-03-16 20:25                     ` David Miller
  0 siblings, 0 replies; 113+ messages in thread
From: David Miller @ 2015-03-16 20:25 UTC (permalink / raw)
  To: herbert; +Cc: tgraf, netdev, eric.dumazet, kaber

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 16 Mar 2015 22:26:28 +1100

> On Mon, Mar 16, 2015 at 12:40:17AM -0400, David Miller wrote:
>>
>> Patrick McHardy had mentioned to me his consternation about this issue
>> offhand the other week, and I've just failed to get around to bringing
>> it up :-)
> 
> How about something like this? Compile tested only but at least
> it does inline correctly.

I like it.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table
  2015-03-15 10:44         ` [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table Herbert Xu
@ 2015-03-17 10:51           ` David Laight
  2015-03-17 10:56             ` tgraf
  0 siblings, 1 reply; 113+ messages in thread
From: David Laight @ 2015-03-17 10:51 UTC (permalink / raw)
  To: 'Herbert Xu', David Miller, tgraf, netdev, Eric Dumazet

From: Herbert Xu
> Sent: 15 March 2015 10:44
> Keeping both size and shift is silly.  We only need one.
...
> --- a/lib/rhashtable.c
> +++ b/lib/rhashtable.c
> @@ -162,7 +162,6 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
>  		return NULL;
> 
>  	tbl->size = nbuckets;
> -	tbl->shift = ilog2(nbuckets);
> 
>  	if (alloc_bucket_locks(ht, tbl) < 0) {
>  		bucket_table_free(tbl);
> @@ -189,7 +188,7 @@ static bool rht_grow_above_75(const struct rhashtable *ht,
>  {
>  	/* Expand table when exceeding 75% load */
>  	return atomic_read(&ht->nelems) > (tbl->size / 4 * 3) &&
> -	       (!ht->p.max_shift || tbl->shift < ht->p.max_shift);
> +	       (!ht->p.max_shift || tbl->size < (1 << ht->p.max_shift));

Looks like you could pre-calculate the 'grow_at' size.
The test above would then be:
	return atomic_read(&ht->nelems > tbl->grow_at_size);

Similarly for the shrink.

	David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table
  2015-03-17 10:51           ` David Laight
@ 2015-03-17 10:56             ` tgraf
  2015-03-17 11:00               ` Herbert Xu
  0 siblings, 1 reply; 113+ messages in thread
From: tgraf @ 2015-03-17 10:56 UTC (permalink / raw)
  To: David Laight; +Cc: 'Herbert Xu', David Miller, netdev, Eric Dumazet

On 03/17/15 at 10:51am, David Laight wrote:
> Looks like you could pre-calculate the 'grow_at' size.
> The test above would then be:
> 	return atomic_read(&ht->nelems > tbl->grow_at_size);
> 
> Similarly for the shrink.

Given the discussions, the grow decision will likely change to
a max bucket length limit anyway.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table
  2015-03-17 10:56             ` tgraf
@ 2015-03-17 11:00               ` Herbert Xu
  2015-03-17 11:22                 ` tgraf
  2015-03-17 11:22                 ` David Laight
  0 siblings, 2 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-17 11:00 UTC (permalink / raw)
  To: tgraf; +Cc: David Laight, David Miller, netdev, Eric Dumazet

On Tue, Mar 17, 2015 at 10:56:57AM +0000, tgraf@suug.ch wrote:
>
> Given the discussions, the grow decision will likely change to
> a max bucket length limit anyway.

Actually no.  In my pathces the chain length is only used to
force an immediate rehash.  Growing is still based on the number
of elements.

The reason is that the maximum (not average) chain length actually
grows with the hash table size, even at 75% utilisation.

So unless we want ever decreasing utilisation as the table grows,
we have to cope with a few chains with many elements.  The limit
I'm currently using is 16 which shouldn't be hit even if you had
a 2^32 table (which you currently cannot).

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table
  2015-03-17 11:00               ` Herbert Xu
@ 2015-03-17 11:22                 ` tgraf
  2015-03-17 11:27                   ` Herbert Xu
  2015-03-17 11:22                 ` David Laight
  1 sibling, 1 reply; 113+ messages in thread
From: tgraf @ 2015-03-17 11:22 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Laight, David Miller, netdev, Eric Dumazet

On 03/17/15 at 10:00pm, Herbert Xu wrote:
> On Tue, Mar 17, 2015 at 10:56:57AM +0000, tgraf@suug.ch wrote:
> >
> > Given the discussions, the grow decision will likely change to
> > a max bucket length limit anyway.
> 
> Actually no.  In my pathces the chain length is only used to
> force an immediate rehash.  Growing is still based on the number
> of elements.
> 
> The reason is that the maximum (not average) chain length actually
> grows with the hash table size, even at 75% utilisation.
> 
> So unless we want ever decreasing utilisation as the table grows,
> we have to cope with a few chains with many elements.  The limit
> I'm currently using is 16 which shouldn't be hit even if you had
> a 2^32 table (which you currently cannot).

Not sure I understand the downside of a bucket length based grow
decision with optional forced rehashing plus synchroneous table
realloc if we hit a 2nd watermark as you proposed earlier. Shouldn't
we consider deterministic lookup and insert behaviour more important
than overall table utilization? Given the rehashing, the grow decision
should not be attackable.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table
  2015-03-17 11:00               ` Herbert Xu
  2015-03-17 11:22                 ` tgraf
@ 2015-03-17 11:22                 ` David Laight
  2015-03-17 11:25                   ` Herbert Xu
  1 sibling, 1 reply; 113+ messages in thread
From: David Laight @ 2015-03-17 11:22 UTC (permalink / raw)
  To: 'Herbert Xu', tgraf; +Cc: David Miller, netdev, Eric Dumazet

From: Herbert Xu
> Sent: 17 March 2015 11:01
> On Tue, Mar 17, 2015 at 10:56:57AM +0000, tgraf@suug.ch wrote:
> >
> > Given the discussions, the grow decision will likely change to
> > a max bucket length limit anyway.
> 
> Actually no.  In my pathces the chain length is only used to
> force an immediate rehash.  Growing is still based on the number
> of elements.
> 
> The reason is that the maximum (not average) chain length actually
> grows with the hash table size, even at 75% utilisation.

That doesn't surprise me.

But won't the rehashed table be just as likely to have a long list?
So you are likely to get an immediate rehash?

	David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table
  2015-03-17 11:22                 ` David Laight
@ 2015-03-17 11:25                   ` Herbert Xu
  0 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-17 11:25 UTC (permalink / raw)
  To: David Laight; +Cc: tgraf, David Miller, netdev, Eric Dumazet

On Tue, Mar 17, 2015 at 11:22:44AM +0000, David Laight wrote:
>
> But won't the rehashed table be just as likely to have a long list?
> So you are likely to get an immediate rehash?

Not if the threshold is set to a value that should never be hit
at 100% utilisation.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table
  2015-03-17 11:22                 ` tgraf
@ 2015-03-17 11:27                   ` Herbert Xu
  2015-03-17 11:57                     ` tgraf
  0 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-17 11:27 UTC (permalink / raw)
  To: tgraf; +Cc: David Laight, David Miller, netdev, Eric Dumazet

On Tue, Mar 17, 2015 at 11:22:03AM +0000, tgraf@suug.ch wrote:
>
> Not sure I understand the downside of a bucket length based grow
> decision with optional forced rehashing plus synchroneous table
> realloc if we hit a 2nd watermark as you proposed earlier. Shouldn't
> we consider deterministic lookup and insert behaviour more important
> than overall table utilization? Given the rehashing, the grow decision
> should not be attackable.

Do you really want to double the table size when 0.1% of the buckets
have a chain length > 4 but still < 16?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table
  2015-03-17 11:27                   ` Herbert Xu
@ 2015-03-17 11:57                     ` tgraf
  2015-03-17 12:13                       ` David Laight
  0 siblings, 1 reply; 113+ messages in thread
From: tgraf @ 2015-03-17 11:57 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Laight, David Miller, netdev, Eric Dumazet

On 03/17/15 at 10:27pm, Herbert Xu wrote:
> On Tue, Mar 17, 2015 at 11:22:03AM +0000, tgraf@suug.ch wrote:
> >
> > Not sure I understand the downside of a bucket length based grow
> > decision with optional forced rehashing plus synchroneous table
> > realloc if we hit a 2nd watermark as you proposed earlier. Shouldn't
> > we consider deterministic lookup and insert behaviour more important
> > than overall table utilization? Given the rehashing, the grow decision
> > should not be attackable.
> 
> Do you really want to double the table size when 0.1% of the buckets
> have a chain length > 4 but still < 16?

If we constantly hit that bucket because we are handling just a few
TCP flows it would be worth to double the size & rehash to avoid the
additional cache misses of the linked list.

Although a limit of 4 would be too high. Ideally we should resize and
rehash when we add the 2nd entry to a bucket to stay < 100% utilization.
It seems likely though that we'll always have a bucket with >=2
entries so we would end up constantly doubling and rehashing. This is
the only thing that speaks for a table wide nelems counters in my
opinion.

Another option would be to continue resizing & rehashing as long as we
have at least one bucket with >= 4 entries but allow a table size
dependant limit of (length > 1 && length < 4) buckets. This may be
overengineered again though ;-)

^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table
  2015-03-17 11:57                     ` tgraf
@ 2015-03-17 12:13                       ` David Laight
  2015-03-17 12:18                         ` 'tgraf@suug.ch'
  2015-03-17 12:20                         ` Herbert Xu
  0 siblings, 2 replies; 113+ messages in thread
From: David Laight @ 2015-03-17 12:13 UTC (permalink / raw)
  To: 'tgraf@suug.ch', Herbert Xu; +Cc: David Miller, netdev, Eric Dumazet

From: Thomas Graf 
> Sent: 17 March 2015 11:58
...
> > Do you really want to double the table size when 0.1% of the buckets
> > have a chain length > 4 but still < 16?
> 
> If we constantly hit that bucket because we are handling just a few
> TCP flows it would be worth to double the size & rehash to avoid the
> additional cache misses of the linked list.
> 
> Although a limit of 4 would be too high. Ideally we should resize and
> rehash when we add the 2nd entry to a bucket to stay < 100% utilization.
> It seems likely though that we'll always have a bucket with >=2
> entries so we would end up constantly doubling and rehashing. This is
> the only thing that speaks for a table wide nelems counters in my
> opinion.

I think you are seriously overestimating the 'efficiency' of the hash function.
And not doing the 'birthday paradox' maths at all.

The only way you'll get a 'hash' that good is if you can pick the input
value in order to generate a perfect hash.
However you aren't going to manage that for inbound TCP connections since
none of the inputs to the hash can be chosen by the listening system
(unless IPv6 has something than can help you).

You may have to live with a few % of the items being on long chains.
Maybe count the number of items on chains longer than (say) 4 and
rehash or increase the table size if this exceeds a few % of the
table size.
(Or count the number of items that are further than 4 from the start
of the hash chain.)

	David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table
  2015-03-17 12:13                       ` David Laight
@ 2015-03-17 12:18                         ` 'tgraf@suug.ch'
  2015-03-17 12:20                         ` Herbert Xu
  1 sibling, 0 replies; 113+ messages in thread
From: 'tgraf@suug.ch' @ 2015-03-17 12:18 UTC (permalink / raw)
  To: David Laight; +Cc: Herbert Xu, David Miller, netdev, Eric Dumazet

On 03/17/15 at 12:13pm, David Laight wrote:
> From: Thomas Graf 
> > Sent: 17 March 2015 11:58
> ...
> > > Do you really want to double the table size when 0.1% of the buckets
> > > have a chain length > 4 but still < 16?
> > 
> > If we constantly hit that bucket because we are handling just a few
> > TCP flows it would be worth to double the size & rehash to avoid the
> > additional cache misses of the linked list.
> > 
> > Although a limit of 4 would be too high. Ideally we should resize and
> > rehash when we add the 2nd entry to a bucket to stay < 100% utilization.
> > It seems likely though that we'll always have a bucket with >=2
> > entries so we would end up constantly doubling and rehashing. This is
> > the only thing that speaks for a table wide nelems counters in my
> > opinion.
> 
> I think you are seriously overestimating the 'efficiency' of the hash function.
> And not doing the 'birthday paradox' maths at all.
> 
> The only way you'll get a 'hash' that good is if you can pick the input
> value in order to generate a perfect hash.
> However you aren't going to manage that for inbound TCP connections since
> none of the inputs to the hash can be chosen by the listening system
> (unless IPv6 has something than can help you).
> 
> You may have to live with a few % of the items being on long chains.
> Maybe count the number of items on chains longer than (say) 4 and
> rehash or increase the table size if this exceeds a few % of the
> table size.
> (Or count the number of items that are further than 4 from the start
> of the hash chain.)

Did you read my email all the way?  You basically omitted to quote and
then rephrased the 2nd half of my email.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table
  2015-03-17 12:13                       ` David Laight
  2015-03-17 12:18                         ` 'tgraf@suug.ch'
@ 2015-03-17 12:20                         ` Herbert Xu
  2015-03-17 12:40                           ` 'tgraf@suug.ch'
  1 sibling, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-17 12:20 UTC (permalink / raw)
  To: David Laight; +Cc: 'tgraf@suug.ch', David Miller, netdev, Eric Dumazet

On Tue, Mar 17, 2015 at 12:13:42PM +0000, David Laight wrote:
> From: Thomas Graf 
> > Sent: 17 March 2015 11:58
> ...
> > > Do you really want to double the table size when 0.1% of the buckets
> > > have a chain length > 4 but still < 16?
> > 
> > If we constantly hit that bucket because we are handling just a few
> > TCP flows it would be worth to double the size & rehash to avoid the
> > additional cache misses of the linked list.
> > 
> > Although a limit of 4 would be too high. Ideally we should resize and
> > rehash when we add the 2nd entry to a bucket to stay < 100% utilization.
> > It seems likely though that we'll always have a bucket with >=2
> > entries so we would end up constantly doubling and rehashing. This is
> > the only thing that speaks for a table wide nelems counters in my
> > opinion.
> 
> I think you are seriously overestimating the 'efficiency' of the hash function.
> And not doing the 'birthday paradox' maths at all.

Agreed.  Thomas, an easy test to do is to pump the integers from
0 to 65535 into jhash, then mask it with 65535 and run sort |
uniq -c | sort -n on it, I think you'll see what we're talking
about.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table
  2015-03-17 12:20                         ` Herbert Xu
@ 2015-03-17 12:40                           ` 'tgraf@suug.ch'
  2015-03-17 13:06                             ` David Laight
  2015-03-17 21:56                             ` Herbert Xu
  0 siblings, 2 replies; 113+ messages in thread
From: 'tgraf@suug.ch' @ 2015-03-17 12:40 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Laight, David Miller, netdev, Eric Dumazet

On 03/17/15 at 11:20pm, Herbert Xu wrote:
> On Tue, Mar 17, 2015 at 12:13:42PM +0000, David Laight wrote:
> > From: Thomas Graf 
> > > Sent: 17 March 2015 11:58
> > ...
> > > > Do you really want to double the table size when 0.1% of the buckets
> > > > have a chain length > 4 but still < 16?
> > > 
> > > If we constantly hit that bucket because we are handling just a few
> > > TCP flows it would be worth to double the size & rehash to avoid the
> > > additional cache misses of the linked list.
> > > 
> > > Although a limit of 4 would be too high. Ideally we should resize and
> > > rehash when we add the 2nd entry to a bucket to stay < 100% utilization.
> > > It seems likely though that we'll always have a bucket with >=2
> > > entries so we would end up constantly doubling and rehashing. This is
> > > the only thing that speaks for a table wide nelems counters in my
> > > opinion.
> > 
> > I think you are seriously overestimating the 'efficiency' of the hash function.
> > And not doing the 'birthday paradox' maths at all.
> 
> Agreed.  Thomas, an easy test to do is to pump the integers from
> 0 to 65535 into jhash, then mask it with 65535 and run sort |
> uniq -c | sort -n on it, I think you'll see what we're talking
> about.

I'm not claiming perfect hash functions and this is exactly why I
think average utilization is not an optimal growth criteria because
it gives very limited view into the actual chain lengths.

What you describe above is a 100% utilization scenario. Initially
we talked about 0.1% utilization and whether to resize & rehash if a
single chain has length > 4. My answer is: yes we should resize &
rehash or at least rehash in that case.

My point here is that a chain length of 4 may be a serious
performance bottleneck already and that it might be worth to try
and detect bad hashing distribution and attempt to fix it at an
earlier stage while ruling out the possibility of endless rehashes.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table
  2015-03-17 12:40                           ` 'tgraf@suug.ch'
@ 2015-03-17 13:06                             ` David Laight
  2015-03-17 21:56                             ` Herbert Xu
  1 sibling, 0 replies; 113+ messages in thread
From: David Laight @ 2015-03-17 13:06 UTC (permalink / raw)
  To: 'tgraf@suug.ch', Herbert Xu; +Cc: David Miller, netdev, Eric Dumazet

From: Thomas Graf
> Sent: 17 March 2015 12:40
...
> I'm not claiming perfect hash functions and this is exactly why I
> think average utilization is not an optimal growth criteria because
> it gives very limited view into the actual chain lengths.

If you assume that all the entries are being looked up equally
often the interesting number ought to be the number of failed compares.
So you want to sum 'length * (length - 1)/2' and compare against the
total number of items (or the table size).

> What you describe above is a 100% utilization scenario. Initially
> we talked about 0.1% utilization and whether to resize & rehash if a
> single chain has length > 4. My answer is: yes we should resize &
> rehash or at least rehash in that case.

0.1% utilisation is about as realistic as 100%.
50-75% is probably a reasonable limit.
This will give some short chains.

> My point here is that a chain length of 4 may be a serious
> performance bottleneck already and that it might be worth to try
> and detect bad hashing distribution and attempt to fix it at an
> earlier stage while ruling out the possibility of endless rehashes.

If you have 4 items and they are all on one chain they I suspect that
nothing you do will actually split them.

OTOH if you have 1000 items and one chain of length 4 it really doesn't
matter. Unless, of course, someone arranges to flood the last item with
traffic - in which case you are going to lose anyway.

	David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table
  2015-03-17 12:40                           ` 'tgraf@suug.ch'
  2015-03-17 13:06                             ` David Laight
@ 2015-03-17 21:56                             ` Herbert Xu
  2015-03-18  9:51                               ` 'tgraf@suug.ch'
  1 sibling, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-17 21:56 UTC (permalink / raw)
  To: 'tgraf@suug.ch'; +Cc: David Laight, David Miller, netdev, Eric Dumazet

On Tue, Mar 17, 2015 at 12:40:12PM +0000, 'tgraf@suug.ch' wrote:
>
> > Agreed.  Thomas, an easy test to do is to pump the integers from
> > 0 to 65535 into jhash, then mask it with 65535 and run sort |
> > uniq -c | sort -n on it, I think you'll see what we're talking
> > about.
> 
> I'm not claiming perfect hash functions and this is exactly why I
> think average utilization is not an optimal growth criteria because
> it gives very limited view into the actual chain lengths.
> 
> What you describe above is a 100% utilization scenario. Initially

Actually 75% is just as bad.  To test that just repeat my script
above and add head -n $((65536 / 4 * 3)) before the first sort.

My point is that to get below anything like a chain length limit
of 2 (or even 4), your hash table is going to be mostly empty.
OK 0.1% might be an exaggeration but it's certainly nowhere near 50%
as your hash table size grows.

Please actually try out my test, here is a C program that will help
you do it:

#include <stdio.h>

typedef unsigned char u8;
typedef unsigned int u32;

static inline u32 rol32(u32 word, unsigned int shift)
{
	return (word << shift) | (word >> (32 - shift));
}

/* Best hash sizes are of power of two */
#define jhash_size(n)   ((u32)1<<(n))
/* Mask the hash value, i.e (value & jhash_mask(n)) instead of (value % n) */
#define jhash_mask(n)   (jhash_size(n)-1)

/* __jhash_mix -- mix 3 32-bit values reversibly. */
#define __jhash_mix(a, b, c)			\
{						\
	a -= c;  a ^= rol32(c, 4);  c += b;	\
	b -= a;  b ^= rol32(a, 6);  a += c;	\
	c -= b;  c ^= rol32(b, 8);  b += a;	\
	a -= c;  a ^= rol32(c, 16); c += b;	\
	b -= a;  b ^= rol32(a, 19); a += c;	\
	c -= b;  c ^= rol32(b, 4);  b += a;	\
}

/* __jhash_final - final mixing of 3 32-bit values (a,b,c) into c */
#define __jhash_final(a, b, c)			\
{						\
	c ^= b; c -= rol32(b, 14);		\
	a ^= c; a -= rol32(c, 11);		\
	b ^= a; b -= rol32(a, 25);		\
	c ^= b; c -= rol32(b, 16);		\
	a ^= c; a -= rol32(c, 4);		\
	b ^= a; b -= rol32(a, 14);		\
	c ^= b; c -= rol32(b, 24);		\
}

#define __get_unaligned_cpu32(x) (*(u32 *)(x))

/* An arbitrary initial parameter */
#define JHASH_INITVAL		0xdeadbeef

/* jhash - hash an arbitrary key
 * @k: sequence of bytes as key
 * @length: the length of the key
 * @initval: the previous hash, or an arbitray value
 *
 * The generic version, hashes an arbitrary sequence of bytes.
 * No alignment or length assumptions are made about the input key.
 *
 * Returns the hash value of the key. The result depends on endianness.
 */
static inline u32 jhash(const void *key, u32 length, u32 initval)
{
	u32 a, b, c;
	const u8 *k = key;

	/* Set up the internal state */
	a = b = c = JHASH_INITVAL + length + initval;

	/* All but the last block: affect some 32 bits of (a,b,c) */
	while (length > 12) {
		a += __get_unaligned_cpu32(k);
		b += __get_unaligned_cpu32(k + 4);
		c += __get_unaligned_cpu32(k + 8);
		__jhash_mix(a, b, c);
		length -= 12;
		k += 12;
	}
	/* Last block: affect all 32 bits of (c) */
	/* All the case statements fall through */
	switch (length) {
	case 12: c += (u32)k[11]<<24;
	case 11: c += (u32)k[10]<<16;
	case 10: c += (u32)k[9]<<8;
	case 9:  c += k[8];
	case 8:  b += (u32)k[7]<<24;
	case 7:  b += (u32)k[6]<<16;
	case 6:  b += (u32)k[5]<<8;
	case 5:  b += k[4];
	case 4:  a += (u32)k[3]<<24;
	case 3:  a += (u32)k[2]<<16;
	case 2:  a += (u32)k[1]<<8;
	case 1:  a += k[0];
		 __jhash_final(a, b, c);
	case 0: /* Nothing left to add */
		break;
	}

	return c;
}

/* jhash2 - hash an array of u32's
 * @k: the key which must be an array of u32's
 * @length: the number of u32's in the key
 * @initval: the previous hash, or an arbitray value
 *
 * Returns the hash value of the key.
 */
static inline u32 jhash2(const u32 *k, u32 length, u32 initval)
{
	u32 a, b, c;

	/* Set up the internal state */
	a = b = c = JHASH_INITVAL + (length<<2) + initval;

	/* Handle most of the key */
	while (length > 3) {
		a += k[0];
		b += k[1];
		c += k[2];
		__jhash_mix(a, b, c);
		length -= 3;
		k += 3;
	}

	/* Handle the last 3 u32's: all the case statements fall through */
	switch (length) {
	case 3: c += k[2];
	case 2: b += k[1];
	case 1: a += k[0];
		__jhash_final(a, b, c);
	case 0:	/* Nothing left to add */
		break;
	}

	return c;
}


/* jhash_3words - hash exactly 3, 2 or 1 word(s) */
static inline u32 jhash_3words(u32 a, u32 b, u32 c, u32 initval)
{
	a += JHASH_INITVAL;
	b += JHASH_INITVAL;
	c += initval;

	__jhash_final(a, b, c);

	return c;
}

static inline u32 jhash_2words(u32 a, u32 b, u32 initval)
{
	return jhash_3words(a, b, 0, initval);
}

static inline u32 jhash_1word(u32 a, u32 initval)
{
	return jhash_3words(a, 0, 0, initval);
}

int main(int argc, char **argv)
{
	int i;
	struct {
		void *s;
		void *t;
		u32 l;
	} k = { .s = 0 };
	int total;

	total = atoi(argv[1]);

	for (i = 0; i < total; i++) {
		k.l = i;
		printf("0x%x\n", jhash2((u32 *)&k, 3, 12345) & (total - 1));
	}

	return 0;
}

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [v2 PATCH 1/6] rhashtable: Remove shift from bucket_table
  2015-03-15 10:43       ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash Herbert Xu
                           ` (14 preceding siblings ...)
  2015-03-16  4:01         ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash David Miller
@ 2015-03-18  9:01         ` Herbert Xu
  2015-03-18  9:01         ` [v2 PATCH 2/6] rhashtable: Introduce max_size/min_size Herbert Xu
                           ` (4 subsequent siblings)
  20 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-18  9:01 UTC (permalink / raw)
  To: David Miller, tgraf, netdev, Eric Dumazet

Keeping both size and shift is silly.  We only need one.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/rhashtable.h |    2 --
 lib/rhashtable.c           |    5 ++---
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 1695378..f16e856 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -51,7 +51,6 @@ struct rhash_head {
  * @size: Number of hash buckets
  * @rehash: Current bucket being rehashed
  * @hash_rnd: Random seed to fold into hash
- * @shift: Current size (1 << shift)
  * @locks_mask: Mask to apply before accessing locks[]
  * @locks: Array of spinlocks protecting individual buckets
  * @walkers: List of active walkers
@@ -63,7 +62,6 @@ struct bucket_table {
 	unsigned int		size;
 	unsigned int		rehash;
 	u32			hash_rnd;
-	u32			shift;
 	unsigned int		locks_mask;
 	spinlock_t		*locks;
 	struct list_head	walkers;
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 09a7ada..0974003 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -162,7 +162,6 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
 		return NULL;
 
 	tbl->size = nbuckets;
-	tbl->shift = ilog2(nbuckets);
 
 	if (alloc_bucket_locks(ht, tbl) < 0) {
 		bucket_table_free(tbl);
@@ -189,7 +188,7 @@ static bool rht_grow_above_75(const struct rhashtable *ht,
 {
 	/* Expand table when exceeding 75% load */
 	return atomic_read(&ht->nelems) > (tbl->size / 4 * 3) &&
-	       (!ht->p.max_shift || tbl->shift < ht->p.max_shift);
+	       (!ht->p.max_shift || tbl->size < (1 << ht->p.max_shift));
 }
 
 /**
@@ -202,7 +201,7 @@ static bool rht_shrink_below_30(const struct rhashtable *ht,
 {
 	/* Shrink table beneath 30% load */
 	return atomic_read(&ht->nelems) < (tbl->size * 3 / 10) &&
-	       tbl->shift > ht->p.min_shift;
+	       tbl->size > (1 << ht->p.min_shift);
 }
 
 static int rhashtable_rehash_one(struct rhashtable *ht, unsigned old_hash)

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v2 PATCH 2/6] rhashtable: Introduce max_size/min_size
  2015-03-15 10:43       ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash Herbert Xu
                           ` (15 preceding siblings ...)
  2015-03-18  9:01         ` [v2 PATCH 1/6] rhashtable: Remove shift from bucket_table Herbert Xu
@ 2015-03-18  9:01         ` Herbert Xu
  2015-03-18 10:55           ` Thomas Graf
  2015-03-18  9:01         ` [v2 PATCH 3/6] netlink: Use rhashtable max_size instead of max_shift Herbert Xu
                           ` (3 subsequent siblings)
  20 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-18  9:01 UTC (permalink / raw)
  To: David Miller, tgraf, netdev, Eric Dumazet

This patch adds the parameters max_size and min_size which are
meant to replace max_shift and min_shift.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/rhashtable.h |    4 ++++
 lib/rhashtable.c           |   12 ++++++++----
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index f16e856..81267fe 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -85,6 +85,8 @@ struct rhashtable;
  * @head_offset: Offset of rhash_head in struct to be hashed
  * @max_shift: Maximum number of shifts while expanding
  * @min_shift: Minimum number of shifts while shrinking
+ * @max_size: Maximum size while expanding
+ * @min_size: Minimum size while shrinking
  * @nulls_base: Base value to generate nulls marker
  * @locks_mul: Number of bucket locks to allocate per cpu (default: 128)
  * @hashfn: Function to hash key
@@ -97,6 +99,8 @@ struct rhashtable_params {
 	size_t			head_offset;
 	size_t			max_shift;
 	size_t			min_shift;
+	unsigned int		max_size;
+	unsigned int		min_size;
 	u32			nulls_base;
 	size_t			locks_mul;
 	rht_hashfn_t		hashfn;
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 0974003..c4061bb 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -27,7 +27,7 @@
 #include <linux/err.h>
 
 #define HASH_DEFAULT_SIZE	64UL
-#define HASH_MIN_SIZE		4UL
+#define HASH_MIN_SIZE		4U
 #define BUCKET_LOCKS_PER_CPU   128UL
 
 /* Base bits plus 1 bit for nulls marker */
@@ -188,7 +188,8 @@ static bool rht_grow_above_75(const struct rhashtable *ht,
 {
 	/* Expand table when exceeding 75% load */
 	return atomic_read(&ht->nelems) > (tbl->size / 4 * 3) &&
-	       (!ht->p.max_shift || tbl->size < (1 << ht->p.max_shift));
+	       (!ht->p.max_shift || tbl->size < (1 << ht->p.max_shift)) &&
+	       (!ht->p.max_size || tbl->size < ht->p.max_size);
 }
 
 /**
@@ -201,7 +202,8 @@ static bool rht_shrink_below_30(const struct rhashtable *ht,
 {
 	/* Shrink table beneath 30% load */
 	return atomic_read(&ht->nelems) < (tbl->size * 3 / 10) &&
-	       tbl->size > (1 << ht->p.min_shift);
+	       tbl->size > (1 << ht->p.min_shift) &&
+	       tbl->size > ht->p.min_size;
 }
 
 static int rhashtable_rehash_one(struct rhashtable *ht, unsigned old_hash)
@@ -873,7 +875,8 @@ EXPORT_SYMBOL_GPL(rhashtable_walk_stop);
 static size_t rounded_hashtable_size(struct rhashtable_params *params)
 {
 	return max(roundup_pow_of_two(params->nelem_hint * 4 / 3),
-		   1UL << params->min_shift);
+		   max(1UL << params->min_shift,
+		       (unsigned long)params->min_size));
 }
 
 /**
@@ -935,6 +938,7 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
 
 	params->min_shift = max_t(size_t, params->min_shift,
 				  ilog2(HASH_MIN_SIZE));
+	params->min_size = max(params->min_size, HASH_MIN_SIZE);
 
 	if (params->nelem_hint)
 		size = rounded_hashtable_size(params);

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v2 PATCH 3/6] netlink: Use rhashtable max_size instead of max_shift
  2015-03-15 10:43       ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash Herbert Xu
                           ` (16 preceding siblings ...)
  2015-03-18  9:01         ` [v2 PATCH 2/6] rhashtable: Introduce max_size/min_size Herbert Xu
@ 2015-03-18  9:01         ` Herbert Xu
  2015-03-18  9:01         ` [v2 PATCH 4/6] tipc: Use rhashtable max/min_size instead of max/min_shift Herbert Xu
                           ` (2 subsequent siblings)
  20 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-18  9:01 UTC (permalink / raw)
  To: David Miller, tgraf, netdev, Eric Dumazet

This patch converts netlink to use rhashtable max_size instead
of the obsolete max_shift.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/netlink/af_netlink.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 6b0f219..d97aed6 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -3123,7 +3123,7 @@ static int __init netlink_proto_init(void)
 		.key_offset = offsetof(struct netlink_sock, portid),
 		.key_len = sizeof(u32), /* portid */
 		.hashfn = jhash,
-		.max_shift = 16, /* 64K */
+		.max_size = 65536,
 	};
 
 	if (err != 0)

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v2 PATCH 4/6] tipc: Use rhashtable max/min_size instead of max/min_shift
  2015-03-15 10:43       ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash Herbert Xu
                           ` (17 preceding siblings ...)
  2015-03-18  9:01         ` [v2 PATCH 3/6] netlink: Use rhashtable max_size instead of max_shift Herbert Xu
@ 2015-03-18  9:01         ` Herbert Xu
  2015-03-18  9:01         ` [v2 PATCH 5/6] test_rhashtable: Use rhashtable max_size instead of max_shift Herbert Xu
  2015-03-18  9:01         ` [v2 PATCH 6/6] rhashtable: Remove max_shift and min_shift Herbert Xu
  20 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-18  9:01 UTC (permalink / raw)
  To: David Miller, tgraf, netdev, Eric Dumazet

This patch converts tipc to use rhashtable max/min_size instead of
the obsolete max/min_shift.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/tipc/socket.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 813847d..d7a6c10 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -2286,8 +2286,8 @@ int tipc_sk_rht_init(struct net *net)
 		.key_offset = offsetof(struct tipc_sock, portid),
 		.key_len = sizeof(u32), /* portid */
 		.hashfn = jhash,
-		.max_shift = 20, /* 1M */
-		.min_shift = 8,  /* 256 */
+		.max_size = 1048576,
+		.min_size = 256,
 	};
 
 	return rhashtable_init(&tn->sk_rht, &rht_params);

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v2 PATCH 5/6] test_rhashtable: Use rhashtable max_size instead of max_shift
  2015-03-15 10:43       ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash Herbert Xu
                           ` (18 preceding siblings ...)
  2015-03-18  9:01         ` [v2 PATCH 4/6] tipc: Use rhashtable max/min_size instead of max/min_shift Herbert Xu
@ 2015-03-18  9:01         ` Herbert Xu
  2015-03-18  9:01         ` [v2 PATCH 6/6] rhashtable: Remove max_shift and min_shift Herbert Xu
  20 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-18  9:01 UTC (permalink / raw)
  To: David Miller, tgraf, netdev, Eric Dumazet

This patch converts test_rhashtable to use rhashtable max_size
instead of the obsolete max_shift.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 lib/test_rhashtable.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 16974fd..2bc403d 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -201,7 +201,7 @@ static int __init test_rht_init(void)
 		.key_offset = offsetof(struct test_obj, value),
 		.key_len = sizeof(int),
 		.hashfn = jhash,
-		.max_shift = 1, /* we expand/shrink manually here */
+		.max_size = 2, /* we expand/shrink manually here */
 		.nulls_base = (3U << RHT_BASE_SHIFT),
 	};
 	int err;

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [v2 PATCH 6/6] rhashtable: Remove max_shift and min_shift
  2015-03-15 10:43       ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash Herbert Xu
                           ` (19 preceding siblings ...)
  2015-03-18  9:01         ` [v2 PATCH 5/6] test_rhashtable: Use rhashtable max_size instead of max_shift Herbert Xu
@ 2015-03-18  9:01         ` Herbert Xu
  20 siblings, 0 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-18  9:01 UTC (permalink / raw)
  To: David Miller, tgraf, netdev, Eric Dumazet

Now that nobody uses max_shift and min_shift, we can safely remove
them.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/rhashtable.h |    4 ----
 lib/rhashtable.c           |    7 +------
 2 files changed, 1 insertion(+), 10 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 81267fe..99425f2 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -83,8 +83,6 @@ struct rhashtable;
  * @key_len: Length of key
  * @key_offset: Offset of key in struct to be hashed
  * @head_offset: Offset of rhash_head in struct to be hashed
- * @max_shift: Maximum number of shifts while expanding
- * @min_shift: Minimum number of shifts while shrinking
  * @max_size: Maximum size while expanding
  * @min_size: Minimum size while shrinking
  * @nulls_base: Base value to generate nulls marker
@@ -97,8 +95,6 @@ struct rhashtable_params {
 	size_t			key_len;
 	size_t			key_offset;
 	size_t			head_offset;
-	size_t			max_shift;
-	size_t			min_shift;
 	unsigned int		max_size;
 	unsigned int		min_size;
 	u32			nulls_base;
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index c4061bb..5f8fe3e 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -188,7 +188,6 @@ static bool rht_grow_above_75(const struct rhashtable *ht,
 {
 	/* Expand table when exceeding 75% load */
 	return atomic_read(&ht->nelems) > (tbl->size / 4 * 3) &&
-	       (!ht->p.max_shift || tbl->size < (1 << ht->p.max_shift)) &&
 	       (!ht->p.max_size || tbl->size < ht->p.max_size);
 }
 
@@ -202,7 +201,6 @@ static bool rht_shrink_below_30(const struct rhashtable *ht,
 {
 	/* Shrink table beneath 30% load */
 	return atomic_read(&ht->nelems) < (tbl->size * 3 / 10) &&
-	       tbl->size > (1 << ht->p.min_shift) &&
 	       tbl->size > ht->p.min_size;
 }
 
@@ -875,8 +873,7 @@ EXPORT_SYMBOL_GPL(rhashtable_walk_stop);
 static size_t rounded_hashtable_size(struct rhashtable_params *params)
 {
 	return max(roundup_pow_of_two(params->nelem_hint * 4 / 3),
-		   max(1UL << params->min_shift,
-		       (unsigned long)params->min_size));
+		   (unsigned long)params->min_size);
 }
 
 /**
@@ -936,8 +933,6 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
 	if (params->nulls_base && params->nulls_base < (1U << RHT_BASE_SHIFT))
 		return -EINVAL;
 
-	params->min_shift = max_t(size_t, params->min_shift,
-				  ilog2(HASH_MIN_SIZE));
 	params->min_size = max(params->min_size, HASH_MIN_SIZE);
 
 	if (params->nelem_hint)

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table
  2015-03-17 21:56                             ` Herbert Xu
@ 2015-03-18  9:51                               ` 'tgraf@suug.ch'
  2015-03-18  9:55                                 ` Herbert Xu
  0 siblings, 1 reply; 113+ messages in thread
From: 'tgraf@suug.ch' @ 2015-03-18  9:51 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Laight, David Miller, netdev, Eric Dumazet

On 03/18/15 at 08:56am, Herbert Xu wrote:
> Actually 75% is just as bad.  To test that just repeat my script
> above and add head -n $((65536 / 4 * 3)) before the first sort.
> 
> My point is that to get below anything like a chain length limit
> of 2 (or even 4), your hash table is going to be mostly empty.
> OK 0.1% might be an exaggeration but it's certainly nowhere near 50%
> as your hash table size grows.
> 
> Please actually try out my test, here is a C program that will help
> you do it:

Thanks for providing that C program. I played around with it and
noticed a small typo which I fixed up. I'm attaching the version
I've used at the end of the mail.

Maybe there is a thinko on my part but the results I'm getting are
very impressive. I have a very hard time to produce more than a
single hash conflict for sequences up to 0..2^27.

I also tried with a pseudo random value for the u32 part of the key
instead of the sequence  number and got the same results.

I'm running it like this:
for i in $(seq 1 27); do \
echo $i: && ./jhash $((2**$i)) | uniq -D | uniq -c \
done

1:
2:
3:
      3 0x1
4:
      2 0xc
5:
6:
      2 0x2b
7:
8:
9:
10:
11:
      2 0x1c9
12:
13:
      2 0x9c7
      2 0x4e9
14:
15:
      2 0x5c8
16:
17:
18:
      2 0x34cb2
      2 0x21fd0
19:
      2 0x61fd0
      2 0x6e82f
20:
      2 0x61fd0
      2 0x6e82f
      2 0xfd571
21:
      2 0x1a0e02
      2 0x1776ea
22:
      2 0x379099
23:
      2 0x2650b6
      2 0x6dc5b5
      2 0x648fe1
      2 0x543446
24:
      2 0x8d60e0
      2 0x8726e5
25:
      2 0x1225f30
26:
      2 0x3a136f4
27:


--- jhash.c ---

#include <stdio.h>

typedef unsigned char u8;
typedef unsigned int u32;

static inline u32 rol32(u32 word, unsigned int shift)
{
	return (word << shift) | (word >> (32 - shift));
}

/* Best hash sizes are of power of two */
#define jhash_size(n)   ((u32)1<<(n))
/* Mask the hash value, i.e (value & jhash_mask(n)) instead of (value % n) */
#define jhash_mask(n)   (jhash_size(n)-1)

/* __jhash_mix -- mix 3 32-bit values reversibly. */
#define __jhash_mix(a, b, c)			\
{						\
	a -= c;  a ^= rol32(c, 4);  c += b;	\
	b -= a;  b ^= rol32(a, 6);  a += c;	\
	c -= b;  c ^= rol32(b, 8);  b += a;	\
	a -= c;  a ^= rol32(c, 16); c += b;	\
	b -= a;  b ^= rol32(a, 19); a += c;	\
	c -= b;  c ^= rol32(b, 4);  b += a;	\
}

/* __jhash_final - final mixing of 3 32-bit values (a,b,c) into c */
#define __jhash_final(a, b, c)			\
{						\
	c ^= b; c -= rol32(b, 14);		\
	a ^= c; a -= rol32(c, 11);		\
	b ^= a; b -= rol32(a, 25);		\
	c ^= b; c -= rol32(b, 16);		\
	a ^= c; a -= rol32(c, 4);		\
	b ^= a; b -= rol32(a, 14);		\
	c ^= b; c -= rol32(b, 24);		\
}

#define __get_unaligned_cpu32(x) (*(u32 *)(x))

/* An arbitrary initial parameter */
#define JHASH_INITVAL		0xdeadbeef

/* jhash - hash an arbitrary key
 * @k: sequence of bytes as key
 * @length: the length of the key
 * @initval: the previous hash, or an arbitray value
 *
 * The generic version, hashes an arbitrary sequence of bytes.
 * No alignment or length assumptions are made about the input key.
 *
 * Returns the hash value of the key. The result depends on endianness.
 */
static inline u32 jhash(const void *key, u32 length, u32 initval)
{
	u32 a, b, c;
	const u8 *k = key;

	/* Set up the internal state */
	a = b = c = JHASH_INITVAL + length + initval;

	/* All but the last block: affect some 32 bits of (a,b,c) */
	while (length > 12) {
		a += __get_unaligned_cpu32(k);
		b += __get_unaligned_cpu32(k + 4);
		c += __get_unaligned_cpu32(k + 8);
		__jhash_mix(a, b, c);
		length -= 12;
		k += 12;
	}
	/* Last block: affect all 32 bits of (c) */
	/* All the case statements fall through */
	switch (length) {
	case 12: c += (u32)k[11]<<24;
	case 11: c += (u32)k[10]<<16;
	case 10: c += (u32)k[9]<<8;
	case 9:  c += k[8];
	case 8:  b += (u32)k[7]<<24;
	case 7:  b += (u32)k[6]<<16;
	case 6:  b += (u32)k[5]<<8;
	case 5:  b += k[4];
	case 4:  a += (u32)k[3]<<24;
	case 3:  a += (u32)k[2]<<16;
	case 2:  a += (u32)k[1]<<8;
	case 1:  a += k[0];
		 __jhash_final(a, b, c);
	case 0: /* Nothing left to add */
		break;
	}

	return c;
}

/* jhash2 - hash an array of u32's
 * @k: the key which must be an array of u32's
 * @length: the number of u32's in the key
 * @initval: the previous hash, or an arbitray value
 *
 * Returns the hash value of the key.
 */
static inline u32 jhash2(const u32 *k, u32 length, u32 initval)
{
	u32 a, b, c;

	/* Set up the internal state */
	a = b = c = JHASH_INITVAL + (length<<2) + initval;

	/* Handle most of the key */
	while (length > 3) {
		a += k[0];
		b += k[1];
		c += k[2];
		__jhash_mix(a, b, c);
		length -= 3;
		k += 3;
	}

	/* Handle the last 3 u32's: all the case statements fall through */
	switch (length) {
	case 3: c += k[2];
	case 2: b += k[1];
	case 1: a += k[0];
		__jhash_final(a, b, c);
	case 0:	/* Nothing left to add */
		break;
	}

	return c;
}


/* jhash_3words - hash exactly 3, 2 or 1 word(s) */
static inline u32 jhash_3words(u32 a, u32 b, u32 c, u32 initval)
{
	a += JHASH_INITVAL;
	b += JHASH_INITVAL;
	c += initval;

	__jhash_final(a, b, c);

	return c;
}

static inline u32 jhash_2words(u32 a, u32 b, u32 initval)
{
	return jhash_3words(a, b, 0, initval);
}

static inline u32 jhash_1word(u32 a, u32 initval)
{
	return jhash_3words(a, 0, 0, initval);
}

int main(int argc, char **argv)
{
	int i;
	struct {
		void *s;
		void *t;
		u32 l;
	} k = { .s = 0 };
	int total;
	u32 initval;

	total = atoi(argv[1]);
	srandom(time(0));
	initval = random();

	for (i = 0; i < total; i++) {
		k.l = i;
		printf("0x%x\n", jhash2((u32 *)&k, sizeof(k)/4, initval) & (total - 1));
	}

	return 0;
}

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table
  2015-03-18  9:51                               ` 'tgraf@suug.ch'
@ 2015-03-18  9:55                                 ` Herbert Xu
  2015-03-18 10:08                                   ` 'tgraf@suug.ch'
  0 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-18  9:55 UTC (permalink / raw)
  To: 'tgraf@suug.ch'; +Cc: David Laight, David Miller, netdev, Eric Dumazet

On Wed, Mar 18, 2015 at 09:51:02AM +0000, 'tgraf@suug.ch' wrote:
> 
> I'm running it like this:
> for i in $(seq 1 27); do \
> echo $i: && ./jhash $((2**$i)) | uniq -D | uniq -c \

uniq -D doesn't do what you think it does:

$ { echo a; echo b; echo a; } | uniq -D
$

So you should use sort | uniq -c

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table
  2015-03-18  9:55                                 ` Herbert Xu
@ 2015-03-18 10:08                                   ` 'tgraf@suug.ch'
  2015-03-18 10:12                                     ` Herbert Xu
  0 siblings, 1 reply; 113+ messages in thread
From: 'tgraf@suug.ch' @ 2015-03-18 10:08 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Laight, David Miller, netdev, Eric Dumazet

On 03/18/15 at 08:55pm, Herbert Xu wrote:
> On Wed, Mar 18, 2015 at 09:51:02AM +0000, 'tgraf@suug.ch' wrote:
> > 
> > I'm running it like this:
> > for i in $(seq 1 27); do \
> > echo $i: && ./jhash $((2**$i)) | uniq -D | uniq -c \
> 
> uniq -D doesn't do what you think it does:
> 
> $ { echo a; echo b; echo a; } | uniq -D
> $
> 
> So you should use sort | uniq -c

Thanks. The important keyword is "adjacent" here ;-)

With this fixed up I can see what you mean. So if we are
to only do a chain length based decision, the limit would
have to grow together with the table size.

for i in $(seq 1 27); do
echo $i: && ./jhash $((2**$i)) | sort | uniq -D | uniq -c | sort -n -r | head -5;
done

1:
      2 0x0
2:
      3 0x0
3:
      3 0x0
      2 0x2
4:
      2 0xb
      2 0x8
      2 0x4
      2 0x2
      2 0x0
5:
      3 0xe
      3 0x13
      2 0xc
      2 0xa
      2 0x8
6:
      3 0x38
      3 0x37
      3 0x32
      3 0x2f
      3 0x2e
7:
      4 0x37
      3 0x7d
      3 0x79
      3 0x6f
      3 0x69
8:
      6 0xb7
      5 0x79
      4 0xf4
      4 0xe4
      4 0xa3
9:
      5 0xb7
      5 0x9e
      5 0x15b
      5 0x13b
      4 0xdd
10:
      5 0xa4
      5 0x79
      5 0x29e
      5 0x22a
      4 0xf4
11:
      5 0x92
      5 0x4a4
      5 0x479
      5 0x47
      5 0x379
12:
      6 0xf1
      6 0x751
      5 0xf00
      5 0xe26
      5 0xc26
13:
      7 0xf1
      6 0x520
      6 0x1e06
      6 0x1c44
      6 0x11ac
14:
      7 0x3174
      7 0x2f12
      6 0x520
      6 0x3763
      6 0x3615
15:
      7 0xf4
      7 0x7d0b
      7 0x179
      6 0x807
      6 0x69c9
16:
      7 0xf516
      7 0xe659
      7 0xdf00
      7 0x5f5a
      7 0x5cc3
17:
      7 0xe659
      7 0xde5a
      7 0xcaf
      7 0x9cce
      7 0x7fe3
18:
      9 0x17f9b
      8 0x3852d
      7 0x992b
      7 0x5e38
      7 0x3d39e
19:
      8 0x5f22b
      8 0x47d66
      8 0x436ba
      8 0x182aa
      7 0xe095
20:
      8 0xf87dd
      8 0xf464d
      8 0xef51d
      8 0xc08d4
      8 0x91a96
21:
      9 0xa7a90
      9 0x1fc5ad
      8 0xfae2f
      8 0xd4123
      8 0xb5686
22:
      9 0x97b11
      9 0x3fcfa3
      8 0xffa9e
      8 0xfa437
      8 0xef7e6
23:
     10 0x3b3115
      9 0x5cb5ba
      9 0x568fbf
      9 0x54d790
      9 0x4c2393
24:
     10 0x86373a
     10 0x35794f
      9 0xf6d21e
      9 0xde5301
      9 0xb36b9d

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table
  2015-03-18 10:08                                   ` 'tgraf@suug.ch'
@ 2015-03-18 10:12                                     ` Herbert Xu
  2015-03-18 10:26                                       ` David Laight
  2015-03-18 10:44                                       ` 'tgraf@suug.ch'
  0 siblings, 2 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-18 10:12 UTC (permalink / raw)
  To: 'tgraf@suug.ch'; +Cc: David Laight, David Miller, netdev, Eric Dumazet

On Wed, Mar 18, 2015 at 10:08:12AM +0000, 'tgraf@suug.ch' wrote:
>
> With this fixed up I can see what you mean. So if we are
> to only do a chain length based decision, the limit would
> have to grow together with the table size.

All we could just use a flat limit of 16 since our maximum table
size is 2^32 (actually a bit less) and 16 should be plenty for
that.

Remember this limit is only there to detect when somebody has
stolen our secret is trying to DoS us.  So if we can keep them
under 16 per-bucket it should be sufficient to defeat any attack.

Of course I will also add a patch to limit the number of elements
to the table size (so maximum utilisation is 100%).  This will come
after we allow insertions to fail.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table
  2015-03-18 10:12                                     ` Herbert Xu
@ 2015-03-18 10:26                                       ` David Laight
  2015-03-18 10:44                                       ` 'tgraf@suug.ch'
  1 sibling, 0 replies; 113+ messages in thread
From: David Laight @ 2015-03-18 10:26 UTC (permalink / raw)
  To: 'Herbert Xu', 'tgraf@suug.ch'
  Cc: David Miller, netdev, Eric Dumazet

From: Herbert Xu
..
> Of course I will also add a patch to limit the number of elements
> to the table size (so maximum utilisation is 100%).  This will come
> after we allow insertions to fail.

There may be some uses where short hash chains are acceptable.
In which case a utilisation way above 100% may make sense.

If a table is likely to have repeated lookups for the same item
then it can make sense to cache the last looked up item for each chain.
This cached value can be written without any locking - the only
place care is needed is ensuring it doesn't reference an item
that has just been deleted.
(You'd want a separate array to not dirty the cache lines containing
the head pointers. Although for a large table they are misses anyway.)

	David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table
  2015-03-18 10:12                                     ` Herbert Xu
  2015-03-18 10:26                                       ` David Laight
@ 2015-03-18 10:44                                       ` 'tgraf@suug.ch'
  1 sibling, 0 replies; 113+ messages in thread
From: 'tgraf@suug.ch' @ 2015-03-18 10:44 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Laight, David Miller, netdev, Eric Dumazet

On 03/18/15 at 09:12pm, Herbert Xu wrote:
> On Wed, Mar 18, 2015 at 10:08:12AM +0000, 'tgraf@suug.ch' wrote:
> >
> > With this fixed up I can see what you mean. So if we are
> > to only do a chain length based decision, the limit would
> > have to grow together with the table size.
> 
> All we could just use a flat limit of 16 since our maximum table
> size is 2^32 (actually a bit less) and 16 should be plenty for
> that.
> 
> Remember this limit is only there to detect when somebody has
> stolen our secret is trying to DoS us.  So if we can keep them
> under 16 per-bucket it should be sufficient to defeat any attack.

Agreed, that is certainly sufficient to avoid attacks. I didn't
want to give up on getting rid of the need for a table wide
nelemes *inside* rhashtable if we have to maintain chain length
anyway but I can see the difficulties.

The only approach that seems to work is to count the number of
buckets with a chain length above a limit that is relative to the
table size or something alike which requires to count on table
level and if we have to do that we might as well just count the
number of elements in the table.

> Of course I will also add a patch to limit the number of elements
> to the table size (so maximum utilisation is 100%).  This will come
> after we allow insertions to fail.

I'm also attaching distribution of chain length for table sizes
for completion's sake.

Percentage of buckets with more than 1 entry:

1: 0%
2: 50%
3: 25%
4: 25%
5: 28%
6: 28%
7: 31%
8: 26%
9: 25%
10: 27%
11: 26%
12: 26%
13: 26%
14: 26%
15: 26%
16: 26%
17: 26%
18: 26%
19: 26%
20: 26%
21: 26%
22: 26%
23: 26%

Distribution of chain lengths:
1:
      1 2
2:
      2 2
3:
      1 2
      1 3
4:
      3 2
      1 3
5:
      6 2
      4 3
6:
     11 2
      4 3
      1 4
7:
     27 2
      8 3
      2 4
8:
     50 2
     15 3
      3 4
      1 5
9:
    101 2
     33 3
      6 4
      1 5
10:
    201 2
     61 3
     17 4
      2 5
11:
    361 2
    134 3
     29 4
      5 5
      1 6
12:
    784 2
    242 3
     49 4
     14 5
      3 6
13:
   1552 2
    472 3
    118 4
     28 5
      5 6
      1 7
14:
   2965 2
    996 3
    262 4
     51 5
      9 6
      1 7
      1 8
15:
   6010 2
   2029 3
    510 4
     90 5
     21 6
      2 7
16:
  11993 2
   4068 3
   1014 4
    175 5
     40 6
      5 7
      1 8
17:
  24123 2
   8100 3
   1969 4
    403 5
     79 6
     10 7
      3 8
18:
  48127 2
  16145 3
   3930 4
    825 5
    137 6
     25 7
      1 8
      1 9
19:
  96558 2
  32250 3
   7990 4
   1619 5
    264 6
     40 7
      7 8
20:
 193267 2
  64476 3
  16051 4
   3140 5
    544 6
     71 7
      9 8
      1 9
21:
 385293 2
 128850 3
  32366 4
   6344 5
   1017 6
    128 7
     21 8
      2 9
22:
 769535 2
 257779 3
  64534 4
  13052 5
   2177 6
    322 7
     25 8
      1 9
23:
1540606 2
 514743 3
 130192 4
  26207 5
   4403 6
    635 7
     73 8
     11 9
24:
3073982 2
1032727 3
 262044 4
  53475 5
   9179 6
   1353 7
    184 8
     14 9
      2 10
      1 11
25:
6124258 2
2071733 3
 534130 4
 111862 5
  19754 6
   3030 7
    419 8
     56 9
      5 10
      1 12

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v2 PATCH 2/6] rhashtable: Introduce max_size/min_size
  2015-03-18  9:01         ` [v2 PATCH 2/6] rhashtable: Introduce max_size/min_size Herbert Xu
@ 2015-03-18 10:55           ` Thomas Graf
  2015-03-18 16:47             ` David Miller
  2015-03-18 16:51             ` David Laight
  0 siblings, 2 replies; 113+ messages in thread
From: Thomas Graf @ 2015-03-18 10:55 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, Eric Dumazet

On 03/18/15 at 08:01pm, Herbert Xu wrote:
> @@ -935,6 +938,7 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
>  
>  	params->min_shift = max_t(size_t, params->min_shift,
>  				  ilog2(HASH_MIN_SIZE));
> +	params->min_size = max(params->min_size, HASH_MIN_SIZE);
>  
>  	if (params->nelem_hint)
>  		size = rounded_hashtable_size(params);

The only change I would add on top is to ensure that min_size
and max_size are a power of two as otherwise the table size
used will end up being greater or smaller than specified.
I can do that as a follow-up though.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v2 PATCH 2/6] rhashtable: Introduce max_size/min_size
  2015-03-18 10:55           ` Thomas Graf
@ 2015-03-18 16:47             ` David Miller
  2015-03-18 16:51             ` David Laight
  1 sibling, 0 replies; 113+ messages in thread
From: David Miller @ 2015-03-18 16:47 UTC (permalink / raw)
  To: tgraf; +Cc: herbert, netdev, eric.dumazet

From: Thomas Graf <tgraf@suug.ch>
Date: Wed, 18 Mar 2015 10:55:28 +0000

> On 03/18/15 at 08:01pm, Herbert Xu wrote:
>> @@ -935,6 +938,7 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
>>  
>>  	params->min_shift = max_t(size_t, params->min_shift,
>>  				  ilog2(HASH_MIN_SIZE));
>> +	params->min_size = max(params->min_size, HASH_MIN_SIZE);
>>  
>>  	if (params->nelem_hint)
>>  		size = rounded_hashtable_size(params);
> 
> The only change I would add on top is to ensure that min_size
> and max_size are a power of two as otherwise the table size
> used will end up being greater or smaller than specified.
> I can do that as a follow-up though.

Feel free.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [v2 PATCH 2/6] rhashtable: Introduce max_size/min_size
  2015-03-18 10:55           ` Thomas Graf
  2015-03-18 16:47             ` David Miller
@ 2015-03-18 16:51             ` David Laight
  1 sibling, 0 replies; 113+ messages in thread
From: David Laight @ 2015-03-18 16:51 UTC (permalink / raw)
  To: 'Thomas Graf', Herbert Xu; +Cc: David Miller, netdev, Eric Dumazet

From: Thomas Graf
> On 03/18/15 at 08:01pm, Herbert Xu wrote:
> > @@ -935,6 +938,7 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
> >
> >  	params->min_shift = max_t(size_t, params->min_shift,
> >  				  ilog2(HASH_MIN_SIZE));
> > +	params->min_size = max(params->min_size, HASH_MIN_SIZE);
> >
> >  	if (params->nelem_hint)
> >  		size = rounded_hashtable_size(params);
> 
> The only change I would add on top is to ensure that min_size
> and max_size are a power of two as otherwise the table size
> used will end up being greater or smaller than specified.

I'd just make sure that 'something sensible' happens if they aren't.

You don't really want to error the table creation if some sysctl (etc)
that control the sizes isn't a power of 2.

	David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare
  2015-03-16 11:13               ` Patrick McHardy
@ 2015-03-20  8:55                 ` Herbert Xu
  2015-03-20  9:22                   ` Patrick McHardy
  0 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-20  8:55 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Thomas Graf, David Miller, netdev, Eric Dumazet

On Mon, Mar 16, 2015 at 11:13:46AM +0000, Patrick McHardy wrote:
>
> An upcoming patchset will add transaction support to sets, which needs
> the callback. So please leave it for know since it will only cause
> unnecessary churn.

Patrick, can you explain what you mean by the callback?

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare
  2015-03-20  8:55                 ` Herbert Xu
@ 2015-03-20  9:22                   ` Patrick McHardy
  2015-03-20  9:27                     ` Herbert Xu
  0 siblings, 1 reply; 113+ messages in thread
From: Patrick McHardy @ 2015-03-20  9:22 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, David Miller, netdev, Eric Dumazet

On 20.03, Herbert Xu wrote:
> On Mon, Mar 16, 2015 at 11:13:46AM +0000, Patrick McHardy wrote:
> >
> > An upcoming patchset will add transaction support to sets, which needs
> > the callback. So please leave it for know since it will only cause
> > unnecessary churn.
> 
> Patrick, can you explain what you mean by the callback?

I need the compare functions for transaction support to decide whether
an element is already activated or has been deactivated and, with a
further patch, for timeouts. Inactive and timed out elements are treated
as non-existant but are in case of transactions present in the hash
until the transaction is committed, in case of timeouts until GC.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare
  2015-03-20  9:22                   ` Patrick McHardy
@ 2015-03-20  9:27                     ` Herbert Xu
  2015-03-20  9:59                       ` Patrick McHardy
  0 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-20  9:27 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Thomas Graf, David Miller, netdev, Eric Dumazet

On Fri, Mar 20, 2015 at 09:22:17AM +0000, Patrick McHardy wrote:
>
> I need the compare functions for transaction support to decide whether
> an element is already activated or has been deactivated and, with a
> further patch, for timeouts. Inactive and timed out elements are treated
> as non-existant but are in case of transactions present in the hash
> until the transaction is committed, in case of timeouts until GC.

I still don't understand what is in the compare callback.  Can
you provide some code example?

More importantly, why is that you can't lookup and then just do
whatever you need to do in the caller of lookup? If you're planning
on having multiple objects in the hash table with the same key then
I'm afraid I'll have to say no because we want to use the chain
length to determine whether we're being attacked and having multiple
objects with the same key defeats that mechanism.

Of course many hash table users need to be able to keep multiple
objects under the same key.  My suggestion would be to allocate
your own linked list and have the linked list be the object that
is inserted into the hash table.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare
  2015-03-16  9:14             ` Herbert Xu
  2015-03-16  9:28               ` Thomas Graf
  2015-03-16 11:13               ` Patrick McHardy
@ 2015-03-20  9:36               ` Herbert Xu
  2015-03-20 10:02                 ` Patrick McHardy
  2 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-20  9:36 UTC (permalink / raw)
  To: Thomas Graf; +Cc: David Miller, netdev, Eric Dumazet, kaber

On Mon, Mar 16, 2015 at 08:14:15PM +1100, Herbert Xu wrote:
> On Mon, Mar 16, 2015 at 08:28:42AM +0000, Thomas Graf wrote:
>
> > The reason for the indirection was to not bypass the abstraction
> > nft_data_cmp() provides.
> > 
> > No objection to the change but maybe leave a comment in
> > nft_data_cmp() that if one changes nft_data_cmp() one needs to
> > look at nft_hash and see if the direct use of rhashtable_lookup()
> > is still valid.
> 
> Well we could preserve nft_data_cmp if necessary.  It'll just
> mean an extra indirect call to do the compare when needed.

FWIW I've decided to ditch nft_data_cmp unless someone can show
me that it's really necessary.  The reason is that nft_hash_lookup
already uses memcmp instead of nft_data_cmp.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare
  2015-03-20  9:27                     ` Herbert Xu
@ 2015-03-20  9:59                       ` Patrick McHardy
  2015-03-20 10:16                         ` Herbert Xu
  0 siblings, 1 reply; 113+ messages in thread
From: Patrick McHardy @ 2015-03-20  9:59 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, David Miller, netdev, Eric Dumazet

On 20.03, Herbert Xu wrote:
> On Fri, Mar 20, 2015 at 09:22:17AM +0000, Patrick McHardy wrote:
> >
> > I need the compare functions for transaction support to decide whether
> > an element is already activated or has been deactivated and, with a
> > further patch, for timeouts. Inactive and timed out elements are treated
> > as non-existant but are in case of transactions present in the hash
> > until the transaction is committed, in case of timeouts until GC.
> 
> I still don't understand what is in the compare callback.  Can
> you provide some code example?

Sure, for lookup / insert:

static bool nft_hash_cmp(void *ptr, void *arg)
{
        struct nft_hash_cmp_arg *x = arg;
        struct nft_hash_elem *he = ptr;

        if (nft_data_cmp(nft_set_ext_key(&he->ext), x->key, x->set->klen))
                return false;
        if (nft_hash_elem_expired(he))
                return false;
        if (!nft_hash_is_active(he, x->genmask))
                return false;
        return true;
}

static bool nft_hash_lookup(const struct nft_set *set,
                            const struct nft_data *key,
                            const struct nft_set_ext **ext)
{
        struct nft_hash *priv = nft_set_priv(set);
        const struct nft_hash_elem *he;
        struct nft_hash_cmp_arg arg = {
                .genmask = nft_genmask_cur(set->net) << NFT_HASH_GEN_SHIFT,
                .set     = set,
                .key     = key,
        };

        he = rhashtable_lookup_compare(&priv->ht, key, &nft_hash_cmp, &arg);
        if (he != NULL)                                                         
                *ext = &he->ext;

        return !!he;
}

static int nft_hash_insert(const struct nft_set *set,                           
                           const struct nft_set_elem *elem)
{
        struct nft_hash *priv = nft_set_priv(set);
        struct nft_hash_elem *he = elem->priv;
        struct nft_hash_cmp_arg arg = {
                .genmask = nft_genmask_next(set->net) << NFT_HASH_GEN_SHIFT,
                .set     = set,
                .key     = &elem->key,
        };

        nft_hash_set_inactive(set, he);
        if (!rhashtable_lookup_compare_insert(&priv->ht, &he->node,
                                              nft_hash_cmp, &arg))
                return -EEXIST;
        return 0;
}

> More importantly, why is that you can't lookup and then just do
> whatever you need to do in the caller of lookup? If you're planning
> on having multiple objects in the hash table with the same key then
> I'm afraid I'll have to say no because we want to use the chain
> length to determine whether we're being attacked and having multiple
> objects with the same key defeats that mechanism.

It's exactly that, there are multiple similar keys in the hash. Timed
out and inactive elements are treated as non-existant. Timeout could
be handled differently, but for transactions there is no other way
to implement this in order to provide atomic transaction semantics.

Consider f.i. the sequence:

- delete element X
- add element X

If we fail, we want the first X to be still active, otherwise the second
X becomes active and the first one inactive atomically. For this to work,
both need to be present in the hash already. Transactions can cover
an arbitrary amount of elements, so even if the case of a single similar
key could be handled otherwise, its not possible for multiple keys.

Regarding the chain length as trigger - I'm sorry, but this doesn't work
for us. I don't see why you would have to look at chain length. That
implies that you don't trust your hash function - why not fix that
instead?

> Of course many hash table users need to be able to keep multiple
> objects under the same key.  My suggestion would be to allocate
> your own linked list and have the linked list be the object that
> is inserted into the hash table.

That would require a huge amount of extra memory per element and having
millions of them is not unrealistic for our use case.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare
  2015-03-20  9:36               ` Herbert Xu
@ 2015-03-20 10:02                 ` Patrick McHardy
  0 siblings, 0 replies; 113+ messages in thread
From: Patrick McHardy @ 2015-03-20 10:02 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, David Miller, netdev, Eric Dumazet

On 20.03, Herbert Xu wrote:
> On Mon, Mar 16, 2015 at 08:14:15PM +1100, Herbert Xu wrote:
> > On Mon, Mar 16, 2015 at 08:28:42AM +0000, Thomas Graf wrote:
> >
> > > The reason for the indirection was to not bypass the abstraction
> > > nft_data_cmp() provides.
> > > 
> > > No objection to the change but maybe leave a comment in
> > > nft_data_cmp() that if one changes nft_data_cmp() one needs to
> > > look at nft_hash and see if the direct use of rhashtable_lookup()
> > > is still valid.
> > 
> > Well we could preserve nft_data_cmp if necessary.  It'll just
> > mean an extra indirect call to do the compare when needed.
> 
> FWIW I've decided to ditch nft_data_cmp unless someone can show
> me that it's really necessary.  The reason is that nft_hash_lookup
> already uses memcmp instead of nft_data_cmp.

I don't care about nft_data_cmp, but I do care about being able to
override the compare function.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare
  2015-03-20  9:59                       ` Patrick McHardy
@ 2015-03-20 10:16                         ` Herbert Xu
  2015-03-20 10:27                           ` Patrick McHardy
  0 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-20 10:16 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Thomas Graf, David Miller, netdev, Eric Dumazet

On Fri, Mar 20, 2015 at 09:59:09AM +0000, Patrick McHardy wrote:
>
> Regarding the chain length as trigger - I'm sorry, but this doesn't work
> for us. I don't see why you would have to look at chain length. That
> implies that you don't trust your hash function - why not fix that
> instead?

Any hash function can be attacked.  That's why we need to be able
to rehash it.  And the best way to decide when to rehash is based
on chain length (otherwise you'd waste time rehashing periodically
like we used to do).  With name spaces these days anyone could be
an adversary.

Besides, putting multiple objects with the same key into a hash
table defeats the whole point of hashing.

> > Of course many hash table users need to be able to keep multiple
> > objects under the same key.  My suggestion would be to allocate
> > your own linked list and have the linked list be the object that
> > is inserted into the hash table.
> 
> That would require a huge amount of extra memory per element and having
> millions of them is not unrealistic for our use case.

You should be able to do it with just 8 extra bytes per unique
hash table key.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare
  2015-03-20 10:16                         ` Herbert Xu
@ 2015-03-20 10:27                           ` Patrick McHardy
  2015-03-20 21:47                             ` Herbert Xu
  0 siblings, 1 reply; 113+ messages in thread
From: Patrick McHardy @ 2015-03-20 10:27 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, David Miller, netdev, Eric Dumazet

On 20.03, Herbert Xu wrote:
> On Fri, Mar 20, 2015 at 09:59:09AM +0000, Patrick McHardy wrote:
> >
> > Regarding the chain length as trigger - I'm sorry, but this doesn't work
> > for us. I don't see why you would have to look at chain length. That
> > implies that you don't trust your hash function - why not fix that
> > instead?
> 
> Any hash function can be attacked.  That's why we need to be able
> to rehash it.  And the best way to decide when to rehash is based
> on chain length (otherwise you'd waste time rehashing periodically
> like we used to do).  With name spaces these days anyone could be
> an adversary.

We already had this discussion. I strongly do not believe this is
the right way to fix namespace problems. There are millions of ways
of creating CPU intensive workloads. You need to be able to put
bounds on the entire namespace. Fixing individual spots will not
solve that problem.

> Besides, putting multiple objects with the same key into a hash
> table defeats the whole point of hashing.

They exist only for (very) short periods of time. Its simply not a
problem in our case. We could even put hard bounds on them, meaning
an element will at most exist twice during that period.

> > > Of course many hash table users need to be able to keep multiple
> > > objects under the same key.  My suggestion would be to allocate
> > > your own linked list and have the linked list be the object that
> > > is inserted into the hash table.
> > 
> > That would require a huge amount of extra memory per element and having
> > millions of them is not unrealistic for our use case.
> 
> You should be able to do it with just 8 extra bytes per unique
> hash table key.

That's something 25% more memory usage for us in common cases. We try
very hard to keep the active memory size small. I don't want to waste
that amount of memory just for the very short periods while transactions
are unconfirmed.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare
  2015-03-20 10:27                           ` Patrick McHardy
@ 2015-03-20 21:47                             ` Herbert Xu
  2015-03-20 21:56                               ` Thomas Graf
  2015-03-21  5:23                               ` Patrick McHardy
  0 siblings, 2 replies; 113+ messages in thread
From: Herbert Xu @ 2015-03-20 21:47 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Thomas Graf, David Miller, netdev, Eric Dumazet

On Fri, Mar 20, 2015 at 10:27:01AM +0000, Patrick McHardy wrote:
> On 20.03, Herbert Xu wrote:
>
> > Any hash function can be attacked.  That's why we need to be able
> > to rehash it.  And the best way to decide when to rehash is based
> > on chain length (otherwise you'd waste time rehashing periodically
> > like we used to do).  With name spaces these days anyone could be
> > an adversary.
> 
> We already had this discussion. I strongly do not believe this is
> the right way to fix namespace problems. There are millions of ways
> of creating CPU intensive workloads. You need to be able to put
> bounds on the entire namespace. Fixing individual spots will not
> solve that problem.

A CPU intensive workload that can be rescheduled is completely
different from one that is running under spin lock with BH disabled.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare
  2015-03-20 21:47                             ` Herbert Xu
@ 2015-03-20 21:56                               ` Thomas Graf
  2015-03-20 21:57                                 ` Herbert Xu
  2015-03-21  5:23                               ` Patrick McHardy
  1 sibling, 1 reply; 113+ messages in thread
From: Thomas Graf @ 2015-03-20 21:56 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Patrick McHardy, David Miller, netdev, Eric Dumazet

On 03/21/15 at 08:47am, Herbert Xu wrote:
> On Fri, Mar 20, 2015 at 10:27:01AM +0000, Patrick McHardy wrote:
> > On 20.03, Herbert Xu wrote:
> >
> > > Any hash function can be attacked.  That's why we need to be able
> > > to rehash it.  And the best way to decide when to rehash is based
> > > on chain length (otherwise you'd waste time rehashing periodically
> > > like we used to do).  With name spaces these days anyone could be
> > > an adversary.
> > 
> > We already had this discussion. I strongly do not believe this is
> > the right way to fix namespace problems. There are millions of ways
> > of creating CPU intensive workloads. You need to be able to put
> > bounds on the entire namespace. Fixing individual spots will not
> > solve that problem.
> 
> A CPU intensive workload that can be rescheduled is completely
> different from one that is running under spin lock with BH disabled.

Just make the chain length based growth function configurable
and nft_hash can disable it. nft_hash entries are not created by
unprivileged users so attacking the table is out of the question.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare
  2015-03-20 21:56                               ` Thomas Graf
@ 2015-03-20 21:57                                 ` Herbert Xu
  2015-03-20 22:07                                   ` Thomas Graf
  0 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-20 21:57 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Patrick McHardy, David Miller, netdev, Eric Dumazet

On Fri, Mar 20, 2015 at 09:56:12PM +0000, Thomas Graf wrote:
> On 03/21/15 at 08:47am, Herbert Xu wrote:
> > On Fri, Mar 20, 2015 at 10:27:01AM +0000, Patrick McHardy wrote:
> > > On 20.03, Herbert Xu wrote:
> > >
> > > > Any hash function can be attacked.  That's why we need to be able
> > > > to rehash it.  And the best way to decide when to rehash is based
> > > > on chain length (otherwise you'd waste time rehashing periodically
> > > > like we used to do).  With name spaces these days anyone could be
> > > > an adversary.
> > > 
> > > We already had this discussion. I strongly do not believe this is
> > > the right way to fix namespace problems. There are millions of ways
> > > of creating CPU intensive workloads. You need to be able to put
> > > bounds on the entire namespace. Fixing individual spots will not
> > > solve that problem.
> > 
> > A CPU intensive workload that can be rescheduled is completely
> > different from one that is running under spin lock with BH disabled.
> 
> Just make the chain length based growth function configurable
> and nft_hash can disable it. nft_hash entries are not created by
> unprivileged users so attacking the table is out of the question.

Please read the quoted text, we're talking about potential attacks
on nft_hash.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare
  2015-03-20 21:57                                 ` Herbert Xu
@ 2015-03-20 22:07                                   ` Thomas Graf
  2015-03-20 22:10                                     ` Herbert Xu
  0 siblings, 1 reply; 113+ messages in thread
From: Thomas Graf @ 2015-03-20 22:07 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Patrick McHardy, David Miller, netdev, Eric Dumazet

On 03/21/15 at 08:57am, Herbert Xu wrote:
> On Fri, Mar 20, 2015 at 09:56:12PM +0000, Thomas Graf wrote:
> > On 03/21/15 at 08:47am, Herbert Xu wrote:
> > > On Fri, Mar 20, 2015 at 10:27:01AM +0000, Patrick McHardy wrote:
> > > > We already had this discussion. I strongly do not believe this is
> > > > the right way to fix namespace problems. There are millions of ways
> > > > of creating CPU intensive workloads. You need to be able to put
> > > > bounds on the entire namespace. Fixing individual spots will not
> > > > solve that problem.
> > > 
> > > A CPU intensive workload that can be rescheduled is completely
> > > different from one that is running under spin lock with BH disabled.
> > 
> > Just make the chain length based growth function configurable
> > and nft_hash can disable it. nft_hash entries are not created by
> > unprivileged users so attacking the table is out of the question.
> 
> Please read the quoted text, we're talking about potential attacks
> on nft_hash.

Attack by whom? If I read the nft_set code correctly then the only
way to add to an nft_set is via nfnetlink which requires
CAP_NET_ADMIN. My understanding was that the chain length based
growth is to counter hash seed attacks.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare
  2015-03-20 22:07                                   ` Thomas Graf
@ 2015-03-20 22:10                                     ` Herbert Xu
  2015-03-20 22:23                                       ` Thomas Graf
  0 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-20 22:10 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Patrick McHardy, David Miller, netdev, Eric Dumazet

On Fri, Mar 20, 2015 at 10:07:51PM +0000, Thomas Graf wrote:
>
> Attack by whom? If I read the nft_set code correctly then the only
> way to add to an nft_set is via nfnetlink which requires
> CAP_NET_ADMIN. My understanding was that the chain length based
> growth is to counter hash seed attacks.

You cannot trust root in a namespace.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare
  2015-03-20 22:10                                     ` Herbert Xu
@ 2015-03-20 22:23                                       ` Thomas Graf
  2015-03-20 22:25                                         ` Herbert Xu
  0 siblings, 1 reply; 113+ messages in thread
From: Thomas Graf @ 2015-03-20 22:23 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Patrick McHardy, David Miller, netdev, Eric Dumazet

On 03/21/15 at 09:10am, Herbert Xu wrote:
> On Fri, Mar 20, 2015 at 10:07:51PM +0000, Thomas Graf wrote:
> >
> > Attack by whom? If I read the nft_set code correctly then the only
> > way to add to an nft_set is via nfnetlink which requires
> > CAP_NET_ADMIN. My understanding was that the chain length based
> > growth is to counter hash seed attacks.
> 
> You cannot trust root in a namespace.

He might as well just run for (;;) to burn cycles in his namespace.
If you give away virtualized local privileges you better be ready
to restrict the resources consumed. 

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare
  2015-03-20 22:23                                       ` Thomas Graf
@ 2015-03-20 22:25                                         ` Herbert Xu
  2015-03-20 22:36                                           ` Thomas Graf
  0 siblings, 1 reply; 113+ messages in thread
From: Herbert Xu @ 2015-03-20 22:25 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Patrick McHardy, David Miller, netdev, Eric Dumazet

On Fri, Mar 20, 2015 at 10:23:11PM +0000, Thomas Graf wrote:
> 
> He might as well just run for (;;) to burn cycles in his namespace.
> If you give away virtualized local privileges you better be ready
> to restrict the resources consumed. 

Please reread the first email that you replied to, let me quote:

	A CPU intensive workload that can be rescheduled is
	completely different from one that is running under spin
	lock with BH disabled.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare
  2015-03-20 22:25                                         ` Herbert Xu
@ 2015-03-20 22:36                                           ` Thomas Graf
  2015-03-21  5:25                                             ` Patrick McHardy
  0 siblings, 1 reply; 113+ messages in thread
From: Thomas Graf @ 2015-03-20 22:36 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Patrick McHardy, David Miller, netdev, Eric Dumazet

On 03/21/15 at 09:25am, Herbert Xu wrote:
> On Fri, Mar 20, 2015 at 10:23:11PM +0000, Thomas Graf wrote:
> > 
> > He might as well just run for (;;) to burn cycles in his namespace.
> > If you give away virtualized local privileges you better be ready
> > to restrict the resources consumed. 
> 
> Please reread the first email that you replied to, let me quote:
> 
> 	A CPU intensive workload that can be rescheduled is
> 	completely different from one that is running under spin
> 	lock with BH disabled.

We have countless ways to create linear list of things like classifiers,
qdiscs, multicast memberships, net_devices, fib_rules, etc. All taking
spin locks or write locks. Most of them with BH disabled. Some at
least use hashtables with most of them fixed size.

I don't want to downplay this but do you *really* want to run
untrusted workloads with CAP_NET_ADMIN privileges?

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare
  2015-03-20 21:47                             ` Herbert Xu
  2015-03-20 21:56                               ` Thomas Graf
@ 2015-03-21  5:23                               ` Patrick McHardy
  1 sibling, 0 replies; 113+ messages in thread
From: Patrick McHardy @ 2015-03-21  5:23 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, David Miller, netdev, Eric Dumazet

On 21.03, Herbert Xu wrote:
> On Fri, Mar 20, 2015 at 10:27:01AM +0000, Patrick McHardy wrote:
> > On 20.03, Herbert Xu wrote:
> >
> > > Any hash function can be attacked.  That's why we need to be able
> > > to rehash it.  And the best way to decide when to rehash is based
> > > on chain length (otherwise you'd waste time rehashing periodically
> > > like we used to do).  With name spaces these days anyone could be
> > > an adversary.
> > 
> > We already had this discussion. I strongly do not believe this is
> > the right way to fix namespace problems. There are millions of ways
> > of creating CPU intensive workloads. You need to be able to put
> > bounds on the entire namespace. Fixing individual spots will not
> > solve that problem.
> 
> A CPU intensive workload that can be rescheduled is completely
> different from one that is running under spin lock with BH disabled.

Sure, that's what I actually meant of course.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare
  2015-03-20 22:36                                           ` Thomas Graf
@ 2015-03-21  5:25                                             ` Patrick McHardy
  0 siblings, 0 replies; 113+ messages in thread
From: Patrick McHardy @ 2015-03-21  5:25 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Herbert Xu, David Miller, netdev, Eric Dumazet

On 20.03, Thomas Graf wrote:
> On 03/21/15 at 09:25am, Herbert Xu wrote:
> > On Fri, Mar 20, 2015 at 10:23:11PM +0000, Thomas Graf wrote:
> > > 
> > > He might as well just run for (;;) to burn cycles in his namespace.
> > > If you give away virtualized local privileges you better be ready
> > > to restrict the resources consumed. 
> > 
> > Please reread the first email that you replied to, let me quote:
> > 
> > 	A CPU intensive workload that can be rescheduled is
> > 	completely different from one that is running under spin
> > 	lock with BH disabled.
> 
> We have countless ways to create linear list of things like classifiers,
> qdiscs, multicast memberships, net_devices, fib_rules, etc. All taking
> spin locks or write locks. Most of them with BH disabled. Some at
> least use hashtables with most of them fixed size.

That's my point. Its impossible to fix this by restricting data
structures, this just removes a valid use case.

> I don't want to downplay this but do you *really* want to run
> untrusted workloads with CAP_NET_ADMIN privileges?

^ permalink raw reply	[flat|nested] 113+ messages in thread

end of thread, other threads:[~2015-03-21  5:25 UTC | newest]

Thread overview: 113+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-13  9:56 [PATCH 0/6] rhashtable: Fixes + cleanups + preparation for multiple rehash Herbert Xu
2015-03-13  9:57 ` [PATCH 1/6] rhashtable: Fix walker behaviour during rehash Herbert Xu
2015-03-13 15:50   ` Thomas Graf
2015-03-13 23:42     ` Herbert Xu
2015-03-14  0:06       ` Thomas Graf
2015-03-13  9:57 ` [PATCH 2/6] rhashtable: Use SINGLE_DEPTH_NESTING Herbert Xu
2015-03-13 15:40   ` Thomas Graf
2015-03-13  9:57 ` [PATCH 3/6] rhashtable: Move seed init into bucket_table_alloc Herbert Xu
2015-03-13 10:03   ` Daniel Borkmann
2015-03-13 11:33   ` David Laight
2015-03-13 11:40     ` Herbert Xu
2015-03-13 15:40   ` Thomas Graf
2015-03-13  9:57 ` [PATCH 4/6] rhashtable: Free bucket tables asynchronously after rehash Herbert Xu
2015-03-13 15:42   ` Thomas Graf
2015-03-13  9:57 ` [PATCH 5/6] rhashtable: Add rehash counter to bucket_table Herbert Xu
2015-03-13 13:51   ` Thomas Graf
2015-03-14  2:49     ` Herbert Xu
2015-03-13  9:57 ` [PATCH 6/6] rhashtable: Move future_tbl into struct bucket_table Herbert Xu
2015-03-13 16:13   ` Thomas Graf
2015-03-13 13:57 ` [PATCH 0/6] rhashtable: Fixes + cleanups + preparation for multiple rehash Thomas Graf
2015-03-13 16:25 ` David Miller
2015-03-14  2:51   ` Herbert Xu
2015-03-14  2:53 ` [v2 PATCH " Herbert Xu
2015-03-14  2:57   ` [v2 PATCH 1/6] rhashtable: Fix walker behaviour during rehash Herbert Xu
2015-03-14  2:57   ` [v2 PATCH 2/6] rhashtable: Use SINGLE_DEPTH_NESTING Herbert Xu
2015-03-14  2:57   ` [v2 PATCH 3/6] rhashtable: Move seed init into bucket_table_alloc Herbert Xu
2015-03-14  2:57   ` [v2 PATCH 4/6] rhashtable: Free bucket tables asynchronously after rehash Herbert Xu
2015-03-14  2:57   ` [v2 PATCH 5/6] rhashtable: Add rehash counter to bucket_table Herbert Xu
2015-03-14  2:57   ` [v2 PATCH 6/6] rhashtable: Move future_tbl into struct bucket_table Herbert Xu
2015-03-15  5:36   ` [v2 PATCH 0/6] rhashtable: Fixes + cleanups + preparation for multiple rehash David Miller
2015-03-15 10:10     ` [v1 PATCH 0/6] rhashtable: Fix two bugs caused by multiple rehash preparation Herbert Xu
2015-03-15 10:12       ` [v1 PATCH 1/2] rhashtable: Fix use-after-free in rhashtable_walk_stop Herbert Xu
2015-03-15 10:12       ` [v1 PATCH 2/2] rhashtable: Fix rhashtable_remove failures Herbert Xu
2015-03-15 10:43       ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash Herbert Xu
2015-03-15 10:44         ` [v1 PATCH 1/14] rhashtable: Remove shift from bucket_table Herbert Xu
2015-03-17 10:51           ` David Laight
2015-03-17 10:56             ` tgraf
2015-03-17 11:00               ` Herbert Xu
2015-03-17 11:22                 ` tgraf
2015-03-17 11:27                   ` Herbert Xu
2015-03-17 11:57                     ` tgraf
2015-03-17 12:13                       ` David Laight
2015-03-17 12:18                         ` 'tgraf@suug.ch'
2015-03-17 12:20                         ` Herbert Xu
2015-03-17 12:40                           ` 'tgraf@suug.ch'
2015-03-17 13:06                             ` David Laight
2015-03-17 21:56                             ` Herbert Xu
2015-03-18  9:51                               ` 'tgraf@suug.ch'
2015-03-18  9:55                                 ` Herbert Xu
2015-03-18 10:08                                   ` 'tgraf@suug.ch'
2015-03-18 10:12                                     ` Herbert Xu
2015-03-18 10:26                                       ` David Laight
2015-03-18 10:44                                       ` 'tgraf@suug.ch'
2015-03-17 11:22                 ` David Laight
2015-03-17 11:25                   ` Herbert Xu
2015-03-15 10:44         ` [v1 PATCH 2/14] rhashtable: Introduce max_size/min_size Herbert Xu
2015-03-15 15:12           ` Sergei Shtylyov
2015-03-15 20:21             ` Herbert Xu
2015-03-15 10:44         ` [v1 PATCH 3/14] netlink: Use rhashtable max_size instead of max_shift Herbert Xu
2015-03-15 10:44         ` [v1 PATCH 4/14] tipc: " Herbert Xu
2015-03-15 15:13           ` Sergei Shtylyov
2015-03-15 10:44         ` [v1 PATCH 5/14] test_rhashtable: " Herbert Xu
2015-03-16  3:50           ` David Miller
2015-03-15 10:44         ` [v1 PATCH 6/14] rhashtable: Remove max_shift and min_shift Herbert Xu
2015-03-15 10:44         ` [v1 PATCH 7/14] netfilter: Use rhashtable_lookup instead of lookup_compare Herbert Xu
2015-03-16  8:28           ` Thomas Graf
2015-03-16  9:14             ` Herbert Xu
2015-03-16  9:28               ` Thomas Graf
2015-03-16 11:13               ` Patrick McHardy
2015-03-20  8:55                 ` Herbert Xu
2015-03-20  9:22                   ` Patrick McHardy
2015-03-20  9:27                     ` Herbert Xu
2015-03-20  9:59                       ` Patrick McHardy
2015-03-20 10:16                         ` Herbert Xu
2015-03-20 10:27                           ` Patrick McHardy
2015-03-20 21:47                             ` Herbert Xu
2015-03-20 21:56                               ` Thomas Graf
2015-03-20 21:57                                 ` Herbert Xu
2015-03-20 22:07                                   ` Thomas Graf
2015-03-20 22:10                                     ` Herbert Xu
2015-03-20 22:23                                       ` Thomas Graf
2015-03-20 22:25                                         ` Herbert Xu
2015-03-20 22:36                                           ` Thomas Graf
2015-03-21  5:25                                             ` Patrick McHardy
2015-03-21  5:23                               ` Patrick McHardy
2015-03-20  9:36               ` Herbert Xu
2015-03-20 10:02                 ` Patrick McHardy
2015-03-15 10:44         ` [v1 PATCH 8/14] rhashtable: Fix support of objects with no accessible keys Herbert Xu
2015-03-15 10:44         ` [v1 PATCH 9/14] netlink: Move namespace into hash key Herbert Xu
2015-03-15 10:44         ` [v1 PATCH 10/14] rhashtable: Rip out obsolete compare interface Herbert Xu
2015-03-16  9:35           ` Thomas Graf
2015-03-15 10:44         ` [v1 PATCH 11/14] rhashtable: Allow hashfn to be unset Herbert Xu
2015-03-15 10:44         ` [v1 PATCH 12/14] netlink: Use default rhashtable hashfn Herbert Xu
2015-03-15 10:44         ` [v1 PATCH 13/14] tipc: " Herbert Xu
2015-03-15 10:44         ` [v1 PATCH 14/14] netfilter: " Herbert Xu
2015-03-16  4:01         ` [v1 PATCH 0/14] rhashtable: Kill shift/Key netlink namespace/Merge jhash David Miller
2015-03-16  4:18           ` Herbert Xu
2015-03-16  4:30             ` David Miller
2015-03-16  4:33               ` Herbert Xu
2015-03-16  4:40                 ` David Miller
2015-03-16 11:26                   ` Herbert Xu
2015-03-16 20:25                     ` David Miller
2015-03-18  9:01         ` [v2 PATCH 1/6] rhashtable: Remove shift from bucket_table Herbert Xu
2015-03-18  9:01         ` [v2 PATCH 2/6] rhashtable: Introduce max_size/min_size Herbert Xu
2015-03-18 10:55           ` Thomas Graf
2015-03-18 16:47             ` David Miller
2015-03-18 16:51             ` David Laight
2015-03-18  9:01         ` [v2 PATCH 3/6] netlink: Use rhashtable max_size instead of max_shift Herbert Xu
2015-03-18  9:01         ` [v2 PATCH 4/6] tipc: Use rhashtable max/min_size instead of max/min_shift Herbert Xu
2015-03-18  9:01         ` [v2 PATCH 5/6] test_rhashtable: Use rhashtable max_size instead of max_shift Herbert Xu
2015-03-18  9:01         ` [v2 PATCH 6/6] rhashtable: Remove max_shift and min_shift Herbert Xu
2015-03-15 10:43       ` [v1 PATCH 0/6] rhashtable: Fix two bugs caused by multiple rehash preparation Herbert Xu
2015-03-16  2:23       ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.