All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5 RFC net-next] rhashtable: Parallel atomic insert/deletion & deferred expansion
@ 2014-09-15 12:17 Thomas Graf
  2014-09-15 12:18 ` [PATCH 1/5] rhashtable: Remove gfp_flags from insert and remove functions Thomas Graf
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Thomas Graf @ 2014-09-15 12:17 UTC (permalink / raw)
  To: davem, eric.dumazet, paulmck, john.r.fastabend, kaber
  Cc: netdev, linux-kernel

This is still work in progress but I wanted to share this as soon as the
direction where this is heading to becomes clear.

 * Patch 1: Remove gfp_t awareness
 * Patch 2: Enhance selftest to catch traversal miscount
 * Patch 3: Nulls marker convertion
 * Patch 4: Add spin_lock_bh_nested()
 * Patch 5: Per bucket spinlock, deferred expansion/shrinking

Paul, Eric, John, Dave: Let me know how far off this is from what you had
in mind.

Opens: 
 * I'm in particular interested in the feasibility of the side effect
   regarding recent insertion visibility in bucket traversals (see below).
   I think current users of rhashtable can live with it but it might be
   worth to offer a synchronous expansion alternative at some point.

   If anybody has a better alternative I'm definitely interested to hear
   about it.

 * Convert inet hashtable over to this and see if it is missing anything.

 * Own worker thread as we might eventually sleep for multiple grace
   periods?

 * Yes, some of the rcu_assign_pointer() can probably be replaced with
   RCU_INIT_POINTER().

Converts the single linked list to us a nulls marker. Reserves 4 bits to
distinguish list membership. Stores a full 27 bit hash as the default
marker. This allows the marker to stay valid even if the stable expands
or shrinks.

Introduces an array of spinlocks to protect bucket mutations. The number
of spinlocks per CPU is configurable and selected based on the hash of
the bucket. This allows for parallel insertions and removals of entries
which do not share a lock.

The patch also defers expansion and shrinking to a worker queue which
allow insertion and removal from atomic context. Insertions and
deletions may occur in parallel to it and are only held up briefly
while the particular bucket is linked or unzipped.

Mutations of the bucket table pointer is protected by a new mutex taken
during expansion and shrinking, read access to that pointer for insertion
and deletion is RCU protected.

In the event of an expansion or shrinking, the new bucket table allocated
is exposed as a so called future table right away. Lookups, deletions, and
insertions will briefly use both tables. The future table becomes the main
table after an RCU grace period and initial relinking was performed which
guarantees that no new insertions occur on the old table and all entries
can be found.

The side effect of this is that during that RCU grace period, a bucket
traversal using any rht_for_each() variant on the main table will not see
any insertions performed during the RCU grace period which would at that
point land in the future table. The lookup will see them as it searches
both tables if needed.

Thomas Graf (5):
  rhashtable: Remove gfp_flags from insert and remove functions
  rhashtable: Check for count misatch in selftest
  rhashtable: Convert to nulls list
  spinlock: Add spin_lock_bh_nested()
  rhashtable: Per bucket locks & expansion/shrinking in work queue

 include/linux/list_nulls.h       |   3 +-
 include/linux/rhashtable.h       | 241 +++++++++++------
 include/linux/spinlock.h         |   8 +
 include/linux/spinlock_api_smp.h |   2 +
 include/linux/spinlock_api_up.h  |   1 +
 kernel/locking/spinlock.c        |   8 +
 lib/rhashtable.c                 | 539 ++++++++++++++++++++++++++-------------
 net/netfilter/nft_hash.c         |  69 ++---
 net/netlink/af_netlink.c         |  18 +-
 net/netlink/diag.c               |   4 +-
 10 files changed, 604 insertions(+), 289 deletions(-)

-- 
1.9.3


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/5] rhashtable: Remove gfp_flags from insert and remove functions
  2014-09-15 12:17 [PATCH 0/5 RFC net-next] rhashtable: Parallel atomic insert/deletion & deferred expansion Thomas Graf
@ 2014-09-15 12:18 ` Thomas Graf
  2014-09-15 12:35   ` Eric Dumazet
  2014-09-15 12:18 ` [PATCH 2/5] rhashtable: Check for count misatch in selftest Thomas Graf
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Thomas Graf @ 2014-09-15 12:18 UTC (permalink / raw)
  To: davem, eric.dumazet, paulmck, john.r.fastabend, kaber
  Cc: netdev, linux-kernel

As the expansion/shrinking is moved to a worker thread, no allocations
will be performed anymore.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 include/linux/rhashtable.h | 10 +++++-----
 lib/rhashtable.c           | 41 +++++++++++++++++------------------------
 net/netfilter/nft_hash.c   |  4 ++--
 net/netlink/af_netlink.c   |  4 ++--
 4 files changed, 26 insertions(+), 33 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index fb298e9d..942fa44 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -96,16 +96,16 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params);
 u32 rhashtable_hashfn(const struct rhashtable *ht, const void *key, u32 len);
 u32 rhashtable_obj_hashfn(const struct rhashtable *ht, void *ptr);
 
-void rhashtable_insert(struct rhashtable *ht, struct rhash_head *node, gfp_t);
-bool rhashtable_remove(struct rhashtable *ht, struct rhash_head *node, gfp_t);
+void rhashtable_insert(struct rhashtable *ht, struct rhash_head *node);
+bool rhashtable_remove(struct rhashtable *ht, struct rhash_head *node);
 void rhashtable_remove_pprev(struct rhashtable *ht, struct rhash_head *obj,
-			     struct rhash_head __rcu **pprev, gfp_t flags);
+			     struct rhash_head __rcu **pprev);
 
 bool rht_grow_above_75(const struct rhashtable *ht, size_t new_size);
 bool rht_shrink_below_30(const struct rhashtable *ht, size_t new_size);
 
-int rhashtable_expand(struct rhashtable *ht, gfp_t flags);
-int rhashtable_shrink(struct rhashtable *ht, gfp_t flags);
+int rhashtable_expand(struct rhashtable *ht);
+int rhashtable_shrink(struct rhashtable *ht);
 
 void *rhashtable_lookup(const struct rhashtable *ht, const void *key);
 void *rhashtable_lookup_compare(const struct rhashtable *ht, u32 hash,
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 8dfec3f..c133d82 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -108,13 +108,13 @@ static u32 head_hashfn(const struct rhashtable *ht,
 	return obj_hashfn(ht, rht_obj(ht, he), hsize);
 }
 
-static struct bucket_table *bucket_table_alloc(size_t nbuckets, gfp_t flags)
+static struct bucket_table *bucket_table_alloc(size_t nbuckets)
 {
 	struct bucket_table *tbl;
 	size_t size;
 
 	size = sizeof(*tbl) + nbuckets * sizeof(tbl->buckets[0]);
-	tbl = kzalloc(size, flags);
+	tbl = kzalloc(size, GFP_KERNEL);
 	if (tbl == NULL)
 		tbl = vzalloc(size);
 
@@ -201,7 +201,6 @@ static void hashtable_chain_unzip(const struct rhashtable *ht,
 /**
  * rhashtable_expand - Expand hash table while allowing concurrent lookups
  * @ht:		the hash table to expand
- * @flags:	allocation flags
  *
  * A secondary bucket array is allocated and the hash entries are migrated
  * while keeping them on both lists until the end of the RCU grace period.
@@ -212,7 +211,7 @@ static void hashtable_chain_unzip(const struct rhashtable *ht,
  * The caller must ensure that no concurrent table mutations take place.
  * It is however valid to have concurrent lookups if they are RCU protected.
  */
-int rhashtable_expand(struct rhashtable *ht, gfp_t flags)
+int rhashtable_expand(struct rhashtable *ht)
 {
 	struct bucket_table *new_tbl, *old_tbl = rht_dereference(ht->tbl, ht);
 	struct rhash_head *he;
@@ -224,7 +223,7 @@ int rhashtable_expand(struct rhashtable *ht, gfp_t flags)
 	if (ht->p.max_shift && ht->shift >= ht->p.max_shift)
 		return 0;
 
-	new_tbl = bucket_table_alloc(old_tbl->size * 2, flags);
+	new_tbl = bucket_table_alloc(old_tbl->size * 2);
 	if (new_tbl == NULL)
 		return -ENOMEM;
 
@@ -282,7 +281,6 @@ EXPORT_SYMBOL_GPL(rhashtable_expand);
 /**
  * rhashtable_shrink - Shrink hash table while allowing concurrent lookups
  * @ht:		the hash table to shrink
- * @flags:	allocation flags
  *
  * This function may only be called in a context where it is safe to call
  * synchronize_rcu(), e.g. not within a rcu_read_lock() section.
@@ -290,7 +288,7 @@ EXPORT_SYMBOL_GPL(rhashtable_expand);
  * The caller must ensure that no concurrent table mutations take place.
  * It is however valid to have concurrent lookups if they are RCU protected.
  */
-int rhashtable_shrink(struct rhashtable *ht, gfp_t flags)
+int rhashtable_shrink(struct rhashtable *ht)
 {
 	struct bucket_table *ntbl, *tbl = rht_dereference(ht->tbl, ht);
 	struct rhash_head __rcu **pprev;
@@ -301,7 +299,7 @@ int rhashtable_shrink(struct rhashtable *ht, gfp_t flags)
 	if (ht->shift <= ht->p.min_shift)
 		return 0;
 
-	ntbl = bucket_table_alloc(tbl->size / 2, flags);
+	ntbl = bucket_table_alloc(tbl->size / 2);
 	if (ntbl == NULL)
 		return -ENOMEM;
 
@@ -342,7 +340,6 @@ EXPORT_SYMBOL_GPL(rhashtable_shrink);
  * rhashtable_insert - insert object into hash hash table
  * @ht:		hash table
  * @obj:	pointer to hash head inside object
- * @flags:	allocation flags (table expansion)
  *
  * Will automatically grow the table via rhashtable_expand() if the the
  * grow_decision function specified at rhashtable_init() returns true.
@@ -350,8 +347,7 @@ EXPORT_SYMBOL_GPL(rhashtable_shrink);
  * The caller must ensure that no concurrent table mutations occur. It is
  * however valid to have concurrent lookups if they are RCU protected.
  */
-void rhashtable_insert(struct rhashtable *ht, struct rhash_head *obj,
-		       gfp_t flags)
+void rhashtable_insert(struct rhashtable *ht, struct rhash_head *obj)
 {
 	struct bucket_table *tbl = rht_dereference(ht->tbl, ht);
 	u32 hash;
@@ -364,7 +360,7 @@ void rhashtable_insert(struct rhashtable *ht, struct rhash_head *obj,
 	ht->nelems++;
 
 	if (ht->p.grow_decision && ht->p.grow_decision(ht, tbl->size))
-		rhashtable_expand(ht, flags);
+		rhashtable_expand(ht);
 }
 EXPORT_SYMBOL_GPL(rhashtable_insert);
 
@@ -373,14 +369,13 @@ EXPORT_SYMBOL_GPL(rhashtable_insert);
  * @ht:		hash table
  * @obj:	pointer to hash head inside object
  * @pprev:	pointer to previous element
- * @flags:	allocation flags (table expansion)
  *
  * Identical to rhashtable_remove() but caller is alreayd aware of the element
  * in front of the element to be deleted. This is in particular useful for
  * deletion when combined with walking or lookup.
  */
 void rhashtable_remove_pprev(struct rhashtable *ht, struct rhash_head *obj,
-			     struct rhash_head __rcu **pprev, gfp_t flags)
+			     struct rhash_head __rcu **pprev)
 {
 	struct bucket_table *tbl = rht_dereference(ht->tbl, ht);
 
@@ -391,7 +386,7 @@ void rhashtable_remove_pprev(struct rhashtable *ht, struct rhash_head *obj,
 
 	if (ht->p.shrink_decision &&
 	    ht->p.shrink_decision(ht, tbl->size))
-		rhashtable_shrink(ht, flags);
+		rhashtable_shrink(ht);
 }
 EXPORT_SYMBOL_GPL(rhashtable_remove_pprev);
 
@@ -399,7 +394,6 @@ EXPORT_SYMBOL_GPL(rhashtable_remove_pprev);
  * rhashtable_remove - remove object from hash table
  * @ht:		hash table
  * @obj:	pointer to hash head inside object
- * @flags:	allocation flags (table expansion)
  *
  * Since the hash chain is single linked, the removal operation needs to
  * walk the bucket chain upon removal. The removal operation is thus
@@ -411,8 +405,7 @@ EXPORT_SYMBOL_GPL(rhashtable_remove_pprev);
  * The caller must ensure that no concurrent table mutations occur. It is
  * however valid to have concurrent lookups if they are RCU protected.
  */
-bool rhashtable_remove(struct rhashtable *ht, struct rhash_head *obj,
-		       gfp_t flags)
+bool rhashtable_remove(struct rhashtable *ht, struct rhash_head *obj)
 {
 	struct bucket_table *tbl = rht_dereference(ht->tbl, ht);
 	struct rhash_head __rcu **pprev;
@@ -430,7 +423,7 @@ bool rhashtable_remove(struct rhashtable *ht, struct rhash_head *obj,
 			continue;
 		}
 
-		rhashtable_remove_pprev(ht, he, pprev, flags);
+		rhashtable_remove_pprev(ht, he, pprev);
 		return true;
 	}
 
@@ -573,7 +566,7 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
 	if (params->nelem_hint)
 		size = rounded_hashtable_size(params);
 
-	tbl = bucket_table_alloc(size, GFP_KERNEL);
+	tbl = bucket_table_alloc(size);
 	if (tbl == NULL)
 		return -ENOMEM;
 
@@ -708,7 +701,7 @@ static int __init test_rhashtable(struct rhashtable *ht)
 		obj->ptr = TEST_PTR;
 		obj->value = i * 2;
 
-		rhashtable_insert(ht, &obj->node, GFP_KERNEL);
+		rhashtable_insert(ht, &obj->node);
 	}
 
 	rcu_read_lock();
@@ -719,7 +712,7 @@ static int __init test_rhashtable(struct rhashtable *ht)
 
 	for (i = 0; i < TEST_NEXPANDS; i++) {
 		pr_info("  Table expansion iteration %u...\n", i);
-		rhashtable_expand(ht, GFP_KERNEL);
+		rhashtable_expand(ht);
 
 		rcu_read_lock();
 		pr_info("  Verifying lookups...\n");
@@ -729,7 +722,7 @@ static int __init test_rhashtable(struct rhashtable *ht)
 
 	for (i = 0; i < TEST_NEXPANDS; i++) {
 		pr_info("  Table shrinkage iteration %u...\n", i);
-		rhashtable_shrink(ht, GFP_KERNEL);
+		rhashtable_shrink(ht);
 
 		rcu_read_lock();
 		pr_info("  Verifying lookups...\n");
@@ -744,7 +737,7 @@ static int __init test_rhashtable(struct rhashtable *ht)
 		obj = rhashtable_lookup(ht, &key);
 		BUG_ON(!obj);
 
-		rhashtable_remove(ht, &obj->node, GFP_KERNEL);
+		rhashtable_remove(ht, &obj->node);
 		kfree(obj);
 	}
 
diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c
index 28fb8f3..b52873c 100644
--- a/net/netfilter/nft_hash.c
+++ b/net/netfilter/nft_hash.c
@@ -65,7 +65,7 @@ static int nft_hash_insert(const struct nft_set *set,
 	if (set->flags & NFT_SET_MAP)
 		nft_data_copy(he->data, &elem->data);
 
-	rhashtable_insert(priv, &he->node, GFP_KERNEL);
+	rhashtable_insert(priv, &he->node);
 
 	return 0;
 }
@@ -88,7 +88,7 @@ static void nft_hash_remove(const struct nft_set *set,
 	pprev = elem->cookie;
 	he = rht_dereference((*pprev), priv);
 
-	rhashtable_remove_pprev(priv, he, pprev, GFP_KERNEL);
+	rhashtable_remove_pprev(priv, he, pprev);
 
 	synchronize_rcu();
 	kfree(he);
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index c416725..a1e6104 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1082,7 +1082,7 @@ static int netlink_insert(struct sock *sk, struct net *net, u32 portid)
 
 	nlk_sk(sk)->portid = portid;
 	sock_hold(sk);
-	rhashtable_insert(&table->hash, &nlk_sk(sk)->node, GFP_KERNEL);
+	rhashtable_insert(&table->hash, &nlk_sk(sk)->node);
 	err = 0;
 err:
 	mutex_unlock(&nl_sk_hash_lock);
@@ -1095,7 +1095,7 @@ static void netlink_remove(struct sock *sk)
 
 	mutex_lock(&nl_sk_hash_lock);
 	table = &nl_table[sk->sk_protocol];
-	if (rhashtable_remove(&table->hash, &nlk_sk(sk)->node, GFP_KERNEL)) {
+	if (rhashtable_remove(&table->hash, &nlk_sk(sk)->node)) {
 		WARN_ON(atomic_read(&sk->sk_refcnt) == 1);
 		__sock_put(sk);
 	}
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/5] rhashtable: Check for count misatch in selftest
  2014-09-15 12:17 [PATCH 0/5 RFC net-next] rhashtable: Parallel atomic insert/deletion & deferred expansion Thomas Graf
  2014-09-15 12:18 ` [PATCH 1/5] rhashtable: Remove gfp_flags from insert and remove functions Thomas Graf
@ 2014-09-15 12:18 ` Thomas Graf
  2014-09-15 12:18 ` [PATCH 3/5] rhashtable: Convert to nulls list Thomas Graf
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Thomas Graf @ 2014-09-15 12:18 UTC (permalink / raw)
  To: davem, eric.dumazet, paulmck, john.r.fastabend, kaber
  Cc: netdev, linux-kernel

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 lib/rhashtable.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index c133d82..c10df45 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -648,15 +648,14 @@ static int __init test_rht_lookup(struct rhashtable *ht)
 	return 0;
 }
 
-static void test_bucket_stats(struct rhashtable *ht,
-				     struct bucket_table *tbl,
-				     bool quiet)
+static void test_bucket_stats(struct rhashtable *ht, struct bucket_table *tbl,
+			      bool quiet)
 {
-	unsigned int cnt, i, total = 0;
+	unsigned int cnt, rcu_cnt, i, total = 0;
 	struct test_obj *obj;
 
 	for (i = 0; i < tbl->size; i++) {
-		cnt = 0;
+		rcu_cnt = cnt = 0;
 
 		if (!quiet)
 			pr_info(" [%#4x/%zu]", i, tbl->size);
@@ -668,6 +667,13 @@ static void test_bucket_stats(struct rhashtable *ht,
 				pr_cont(" [%p],", obj);
 		}
 
+		rht_for_each_rcu(pos, tbl, i)
+			rcu_cnt++;
+
+		if (rcu_cnt != cnt)
+			pr_warn("Test failed: Chain count mismach %d != %d",
+				cnt, rcu_cnt);
+
 		if (!quiet)
 			pr_cont("\n  [%#x] first element: %p, chain length: %u\n",
 				i, tbl->buckets[i], cnt);
@@ -675,6 +681,9 @@ static void test_bucket_stats(struct rhashtable *ht,
 
 	pr_info("  Traversal complete: counted=%u, nelems=%zu, entries=%d\n",
 		total, ht->nelems, TEST_ENTRIES);
+
+	if (total != ht->nelems || total != TEST_ENTRIES)
+		pr_warn("Test failed: Total count mismatch ^^^");
 }
 
 static int __init test_rhashtable(struct rhashtable *ht)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/5] rhashtable: Convert to nulls list
  2014-09-15 12:17 [PATCH 0/5 RFC net-next] rhashtable: Parallel atomic insert/deletion & deferred expansion Thomas Graf
  2014-09-15 12:18 ` [PATCH 1/5] rhashtable: Remove gfp_flags from insert and remove functions Thomas Graf
  2014-09-15 12:18 ` [PATCH 2/5] rhashtable: Check for count misatch in selftest Thomas Graf
@ 2014-09-15 12:18 ` Thomas Graf
  2014-09-15 12:18 ` [PATCH 4/5] spinlock: Add spin_lock_bh_nested() Thomas Graf
  2014-09-15 12:18 ` [PATCH 5/5] rhashtable: Per bucket locks & expansion/shrinking in work queue Thomas Graf
  4 siblings, 0 replies; 13+ messages in thread
From: Thomas Graf @ 2014-09-15 12:18 UTC (permalink / raw)
  To: davem, eric.dumazet, paulmck, john.r.fastabend, kaber
  Cc: netdev, linux-kernel

In order to allow wider usage of rhashtable, use a special nulls marker
to terminate each chain. The reason for not using the existing
nulls_list is that the pprev pointer usage would not be valid as entries
can be linked in two different buckets at the same time.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 include/linux/list_nulls.h |   3 +-
 include/linux/rhashtable.h | 195 +++++++++++++++++++++++++++++++--------------
 lib/rhashtable.c           | 158 ++++++++++++++++++++++--------------
 net/netfilter/nft_hash.c   |  12 ++-
 net/netlink/af_netlink.c   |   9 ++-
 net/netlink/diag.c         |   4 +-
 6 files changed, 248 insertions(+), 133 deletions(-)

diff --git a/include/linux/list_nulls.h b/include/linux/list_nulls.h
index 5d10ae36..e8c300e 100644
--- a/include/linux/list_nulls.h
+++ b/include/linux/list_nulls.h
@@ -21,8 +21,9 @@ struct hlist_nulls_head {
 struct hlist_nulls_node {
 	struct hlist_nulls_node *next, **pprev;
 };
+#define NULLS_MARKER(value) (1UL | (((long)value) << 1))
 #define INIT_HLIST_NULLS_HEAD(ptr, nulls) \
-	((ptr)->first = (struct hlist_nulls_node *) (1UL | (((long)nulls) << 1)))
+	((ptr)->first = (struct hlist_nulls_node *) NULLS_MARKER(nulls))
 
 #define hlist_nulls_entry(ptr, type, member) container_of(ptr,type,member)
 /**
diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 942fa44..e9cdbda 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -18,14 +18,12 @@
 #ifndef _LINUX_RHASHTABLE_H
 #define _LINUX_RHASHTABLE_H
 
-#include <linux/rculist.h>
+#include <linux/list_nulls.h>
 
 struct rhash_head {
 	struct rhash_head __rcu		*next;
 };
 
-#define INIT_HASH_HEAD(ptr) ((ptr)->next = NULL)
-
 struct bucket_table {
 	size_t				size;
 	struct rhash_head __rcu		*buckets[];
@@ -45,6 +43,7 @@ struct rhashtable;
  * @hash_rnd: Seed to use while hashing
  * @max_shift: Maximum number of shifts while expanding
  * @min_shift: Minimum number of shifts while shrinking
+ * @nulls_base: Base value to generate nulls marker
  * @hashfn: Function to hash key
  * @obj_hashfn: Function to hash object
  * @grow_decision: If defined, may return true if table should expand
@@ -59,6 +58,7 @@ struct rhashtable_params {
 	u32			hash_rnd;
 	size_t			max_shift;
 	size_t			min_shift;
+	int			nulls_base;
 	rht_hashfn_t		hashfn;
 	rht_obj_hashfn_t	obj_hashfn;
 	bool			(*grow_decision)(const struct rhashtable *ht,
@@ -82,6 +82,24 @@ struct rhashtable {
 	struct rhashtable_params	p;
 };
 
+static inline unsigned long rht_marker(const struct rhashtable *ht, u32 hash)
+{
+	return NULLS_MARKER(ht->p.nulls_base + hash);
+}
+
+#define INIT_RHT_NULLS_HEAD(ptr, ht, hash) \
+	((ptr) = (typeof(ptr)) rht_marker(ht, hash))
+
+static inline bool rht_is_a_nulls(const struct rhash_head *ptr)
+{
+	return ((unsigned long) ptr & 1);
+}
+
+static inline unsigned long rht_get_nulls_value(const struct rhash_head *ptr)
+{
+	return ((unsigned long) ptr) >> 1;
+}
+
 #ifdef CONFIG_PROVE_LOCKING
 int lockdep_rht_mutex_is_held(const struct rhashtable *ht);
 #else
@@ -119,92 +137,145 @@ void rhashtable_destroy(const struct rhashtable *ht);
 #define rht_dereference_rcu(p, ht) \
 	rcu_dereference_check(p, lockdep_rht_mutex_is_held(ht))
 
-#define rht_entry(ptr, type, member) container_of(ptr, type, member)
-#define rht_entry_safe(ptr, type, member) \
-({ \
-	typeof(ptr) __ptr = (ptr); \
-	   __ptr ? rht_entry(__ptr, type, member) : NULL; \
-})
+#define rht_dereference_bucket(p, tbl, hash) \
+	rcu_dereference_protected(p, lockdep_rht_mutex_is_held(ht))
+
+#define rht_dereference_bucket_rcu(p, tbl, hash) \
+	rcu_dereference_check(p, lockdep_rht_mutex_is_held(ht))
+
+#define rht_entry(tpos, pos, member) \
+	({ tpos = container_of(pos, typeof(*tpos), member); 1; })
 
-#define rht_next_entry_safe(pos, ht, member) \
-({ \
-	pos ? rht_entry_safe(rht_dereference((pos)->member.next, ht), \
-			     typeof(*(pos)), member) : NULL; \
-})
+static inline struct rhash_head *rht_get_bucket(const struct bucket_table *tbl,
+						u32 hash)
+{
+	return rht_dereference_bucket(tbl->buckets[hash], tbl, hash);
+}
+
+static inline struct rhash_head *rht_get_bucket_rcu(const struct bucket_table *tbl,
+						    u32 hash)
+{
+	return rht_dereference_bucket_rcu(tbl->buckets[hash], tbl, hash);
+}
+
+/**
+ * rht_for_each_continue - continue iterating over hash chain
+ * @pos:	the &struct rhash_head to use as a loop cursor.
+ * @head:	the previous &struct rhash_head to continue from
+ * @tbl:	the &struct bucket_table
+ * @hash:	the hash value / bucket index
+ */
+#define rht_for_each_continue(pos, head, tbl, hash) \
+	for (pos = rht_dereference_bucket(head, tbl, hash); \
+	     !rht_is_a_nulls(pos); \
+	     pos = rht_dereference_bucket(pos->next, tbl, hash))
 
 /**
  * rht_for_each - iterate over hash chain
- * @pos:	&struct rhash_head to use as a loop cursor.
- * @head:	head of the hash chain (struct rhash_head *)
- * @ht:		pointer to your struct rhashtable
+ * @pos:	the &struct rhash_head to use as a loop cursor.
+ * @tbl:	the &struct bucket_table
+ * @hash:	the hash value / bucket index
  */
-#define rht_for_each(pos, head, ht) \
-	for (pos = rht_dereference(head, ht); \
-	     pos; \
-	     pos = rht_dereference((pos)->next, ht))
+#define rht_for_each(pos, tbl, hash) \
+	for (pos = rht_get_bucket(tbl, hash); \
+	     !rht_is_a_nulls(pos); \
+	     pos = rht_dereference_bucket(pos->next, tbl, hash))
 
 /**
  * rht_for_each_entry - iterate over hash chain of given type
- * @pos:	type * to use as a loop cursor.
- * @head:	head of the hash chain (struct rhash_head *)
- * @ht:		pointer to your struct rhashtable
- * @member:	name of the rhash_head within the hashable struct.
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct rhash_head to use as a loop cursor.
+ * @tbl:	the &struct bucket_table
+ * @hash:	the hash value / bucket index
+ * @member:	name of the &struct rhash_head within the hashable struct.
  */
-#define rht_for_each_entry(pos, head, ht, member) \
-	for (pos = rht_entry_safe(rht_dereference(head, ht), \
-				   typeof(*(pos)), member); \
-	     pos; \
-	     pos = rht_next_entry_safe(pos, ht, member))
+#define rht_for_each_entry(tpos, pos, tbl, hash, member)		\
+	for (pos = rht_get_bucket(tbl, hash);				\
+	     (!rht_is_a_nulls(pos)) && rht_entry(tpos, pos, member);	\
+	     pos = rht_dereference_bucket(pos->next, tbl, hash))
 
 /**
  * rht_for_each_entry_safe - safely iterate over hash chain of given type
- * @pos:	type * to use as a loop cursor.
- * @n:		type * to use for temporary next object storage
- * @head:	head of the hash chain (struct rhash_head *)
- * @ht:		pointer to your struct rhashtable
- * @member:	name of the rhash_head within the hashable struct.
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct rhash_head to use as a loop cursor.
+ * @next:	the &struct rhash_head to use as next in loop cursor.
+ * @tbl:	the &struct bucket_table
+ * @hash:	the hash value / bucket index
+ * @member:	name of the &struct rhash_head within the hashable struct.
  *
  * This hash chain list-traversal primitive allows for the looped code to
  * remove the loop cursor from the list.
  */
-#define rht_for_each_entry_safe(pos, n, head, ht, member)		\
-	for (pos = rht_entry_safe(rht_dereference(head, ht), \
-				  typeof(*(pos)), member), \
-	     n = rht_next_entry_safe(pos, ht, member); \
-	     pos; \
-	     pos = n, \
-	     n = rht_next_entry_safe(pos, ht, member))
+#define rht_for_each_entry_safe(tpos, pos, next, tbl, hash, member)	\
+	for (pos = rht_get_bucket(tbl, hash),				\
+	     next = !rht_is_a_nulls(pos) ?				\
+			rht_dereference_bucket(pos->next, tbl, hash) :	\
+			(struct rhash_head *) NULLS_MARKER(0);		\
+	     (!rht_is_a_nulls(pos)) && rht_entry(tpos, pos, member);	\
+	     pos = next)
+
+/**
+ * rht_for_each_rcu_continue - continue iterating over rcu hash chain
+ * @pos:	the &struct rhash_head to use as a loop cursor.
+ * @head:	the previous &struct rhash_head to continue from
+ *
+ * This hash chain list-traversal primitive may safely run concurrently with
+ * the _rcu mutation primitives such as rht_insert() as long as the traversal
+ * is guarded by rcu_read_lock().
+ */
+#define rht_for_each_rcu_continue(pos, head)				\
+	for (({barrier(); }), pos = rcu_dereference_raw(head);		\
+	     !rht_is_a_nulls(pos);					\
+	     pos = rcu_dereference_raw(pos->next))
 
 /**
  * rht_for_each_rcu - iterate over rcu hash chain
- * @pos:	&struct rhash_head to use as a loop cursor.
- * @head:	head of the hash chain (struct rhash_head *)
- * @ht:		pointer to your struct rhashtable
+ * @pos:	the &struct rhash_head to use as a loop cursor.
+ * @tbl:	the &struct bucket_table
+ * @hash:	the hash value / bucket index
+ *
+ * This hash chain list-traversal primitive may safely run concurrently with
+ * the _rcu mutation primitives such as rht_insert() as long as the traversal
+ * is guarded by rcu_read_lock().
+ */
+#define rht_for_each_rcu(pos, tbl, hash)				\
+	for (({barrier(); }), pos = rht_get_bucket_rcu(tbl, hash);	\
+	     !rht_is_a_nulls(pos);					\
+	     pos = rcu_dereference_raw(pos->next))
+
+/**
+ * rht_for_each_entry_rcu_continue - continue iterating over rcu hash chain
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct rhash_head to use as a loop cursor.
+ * @head:	the previous &struct rhash_head to continue from
+ * @member:	name of the &struct rhash_head within the hashable struct.
  *
  * This hash chain list-traversal primitive may safely run concurrently with
- * the _rcu fkht mutation primitives such as rht_insert() as long as the
- * traversal is guarded by rcu_read_lock().
+ * the _rcu mutation primitives such as rht_insert() as long as the traversal
+ * is guarded by rcu_read_lock().
  */
-#define rht_for_each_rcu(pos, head, ht) \
-	for (pos = rht_dereference_rcu(head, ht); \
-	     pos; \
-	     pos = rht_dereference_rcu((pos)->next, ht))
+#define rht_for_each_entry_rcu_continue(tpos, pos, head, member)	\
+	for (({barrier(); }),						\
+	     pos = rcu_dereference_raw(head);				\
+	     (!rht_is_a_nulls(pos)) && rht_entry(tpos, pos, member);	\
+	     pos = rcu_dereference_raw(pos->next))
 
 /**
  * rht_for_each_entry_rcu - iterate over rcu hash chain of given type
- * @pos:	type * to use as a loop cursor.
- * @head:	head of the hash chain (struct rhash_head *)
- * @member:	name of the rhash_head within the hashable struct.
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct rhash_head to use as a loop cursor.
+ * @tbl:	the &struct bucket_table
+ * @hash:	the hash value / bucket index
+ * @member:	name of the &struct rhash_head within the hashable struct.
  *
  * This hash chain list-traversal primitive may safely run concurrently with
- * the _rcu fkht mutation primitives such as rht_insert() as long as the
- * traversal is guarded by rcu_read_lock().
+ * the _rcu mutation primitives such as rht_insert() as long as the traversal
+ * is guarded by rcu_read_lock().
  */
-#define rht_for_each_entry_rcu(pos, head, member) \
-	for (pos = rht_entry_safe(rcu_dereference_raw(head), \
-				  typeof(*(pos)), member); \
-	     pos; \
-	     pos = rht_entry_safe(rcu_dereference_raw((pos)->member.next), \
-				  typeof(*(pos)), member))
+#define rht_for_each_entry_rcu(tpos, pos, tbl, hash, member)		\
+	for (({barrier(); }),						\
+	     pos = rht_get_bucket_rcu(tbl, hash);			\
+	     (!rht_is_a_nulls(pos)) && rht_entry(tpos, pos, member);	\
+	     pos = rht_dereference_bucket_rcu(pos->next, tbl, hash))
 
 #endif /* _LINUX_RHASHTABLE_H */
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index c10df45..d871483 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -28,6 +28,23 @@
 #define HASH_DEFAULT_SIZE	64UL
 #define HASH_MIN_SIZE		4UL
 
+/*
+ * The nulls marker consists of:
+ *
+ * +-------+-----------------------------------------------------+-+
+ * | Base  |                      Hash                           |1|
+ * +-------+-----------------------------------------------------+-+
+ *
+ * Base (4 bits) : Reserved to distinguish between multiple tables.
+ *                 Specified via &struct rhashtable_params.nulls_base.
+ * Hash (27 bits): Full hash (unmasked) of first element added to bucket
+ * 1 (1 bit)     : Nulls marker (always set)
+ *
+ */
+#define HASH_BASE_BITS		4
+#define HASH_BASE_MIN		(1 << (31 - HASH_BASE_BITS))
+#define HASH_RESERVED_SPACE	(HASH_BASE_BITS + 1)
+
 #define ASSERT_RHT_MUTEX(HT) BUG_ON(!lockdep_rht_mutex_is_held(HT))
 
 #ifdef CONFIG_PROVE_LOCKING
@@ -43,14 +60,22 @@ static void *rht_obj(const struct rhashtable *ht, const struct rhash_head *he)
 	return (void *) he - ht->p.head_offset;
 }
 
-static u32 __hashfn(const struct rhashtable *ht, const void *key,
-		      u32 len, u32 hsize)
+static u32 rht_bucket_index(u32 hash, const struct bucket_table *tbl)
 {
-	u32 h;
+	return hash & (tbl->size - 1);
+}
 
-	h = ht->p.hashfn(key, len, ht->p.hash_rnd);
+static u32 obj_raw_hashfn(const struct rhashtable *ht, const void *ptr)
+{
+	u32 hash;
+
+	if (unlikely(!ht->p.key_len))
+		hash = ht->p.obj_hashfn(ptr, ht->p.hash_rnd);
+	else
+		hash = ht->p.hashfn(ptr + ht->p.key_offset, ht->p.key_len,
+				    ht->p.hash_rnd);
 
-	return h & (hsize - 1);
+	return hash >> HASH_RESERVED_SPACE;
 }
 
 /**
@@ -66,23 +91,14 @@ static u32 __hashfn(const struct rhashtable *ht, const void *key,
 u32 rhashtable_hashfn(const struct rhashtable *ht, const void *key, u32 len)
 {
 	struct bucket_table *tbl = rht_dereference_rcu(ht->tbl, ht);
+	u32 hash;
 
-	return __hashfn(ht, key, len, tbl->size);
-}
-EXPORT_SYMBOL_GPL(rhashtable_hashfn);
-
-static u32 obj_hashfn(const struct rhashtable *ht, const void *ptr, u32 hsize)
-{
-	if (unlikely(!ht->p.key_len)) {
-		u32 h;
-
-		h = ht->p.obj_hashfn(ptr, ht->p.hash_rnd);
-
-		return h & (hsize - 1);
-	}
+	hash = ht->p.hashfn(key, len, ht->p.hash_rnd);
+	hash >>= HASH_RESERVED_SPACE;
 
-	return __hashfn(ht, ptr + ht->p.key_offset, ht->p.key_len, hsize);
+	return rht_bucket_index(hash, tbl);
 }
+EXPORT_SYMBOL_GPL(rhashtable_hashfn);
 
 /**
  * rhashtable_obj_hashfn - compute hash for hashed object
@@ -98,20 +114,23 @@ u32 rhashtable_obj_hashfn(const struct rhashtable *ht, void *ptr)
 {
 	struct bucket_table *tbl = rht_dereference_rcu(ht->tbl, ht);
 
-	return obj_hashfn(ht, ptr, tbl->size);
+	return rht_bucket_index(obj_raw_hashfn(ht, ptr), tbl);
 }
 EXPORT_SYMBOL_GPL(rhashtable_obj_hashfn);
 
 static u32 head_hashfn(const struct rhashtable *ht,
-		       const struct rhash_head *he, u32 hsize)
+		       const struct rhash_head *he,
+		       const struct bucket_table *tbl)
 {
-	return obj_hashfn(ht, rht_obj(ht, he), hsize);
+	return rht_bucket_index(obj_raw_hashfn(ht, rht_obj(ht, he)), tbl);
 }
 
-static struct bucket_table *bucket_table_alloc(size_t nbuckets)
+static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
+					       size_t nbuckets)
 {
 	struct bucket_table *tbl;
 	size_t size;
+	int i;
 
 	size = sizeof(*tbl) + nbuckets * sizeof(tbl->buckets[0]);
 	tbl = kzalloc(size, GFP_KERNEL);
@@ -121,6 +140,9 @@ static struct bucket_table *bucket_table_alloc(size_t nbuckets)
 	if (tbl == NULL)
 		return NULL;
 
+	for (i = 0; i < nbuckets; i++)
+		INIT_RHT_NULLS_HEAD(tbl->buckets[i], ht, i);
+
 	tbl->size = nbuckets;
 
 	return tbl;
@@ -159,34 +181,36 @@ static void hashtable_chain_unzip(const struct rhashtable *ht,
 				  const struct bucket_table *new_tbl,
 				  struct bucket_table *old_tbl, size_t n)
 {
-	struct rhash_head *he, *p, *next;
-	unsigned int h;
+	struct rhash_head *he, *p;
+	struct rhash_head __rcu *next;
+	u32 hash, new_tbl_idx;
 
 	/* Old bucket empty, no work needed. */
-	p = rht_dereference(old_tbl->buckets[n], ht);
-	if (!p)
+	p = rht_get_bucket(old_tbl, n);
+	if (rht_is_a_nulls(p))
 		return;
 
 	/* Advance the old bucket pointer one or more times until it
 	 * reaches a node that doesn't hash to the same bucket as the
 	 * previous node p. Call the previous node p;
 	 */
-	h = head_hashfn(ht, p, new_tbl->size);
-	rht_for_each(he, p->next, ht) {
-		if (head_hashfn(ht, he, new_tbl->size) != h)
+	hash = obj_raw_hashfn(ht, rht_obj(ht, p));
+	new_tbl_idx = rht_bucket_index(hash, new_tbl);
+	rht_for_each_continue(he, p->next, old_tbl, n) {
+		if (head_hashfn(ht, he, new_tbl) != new_tbl_idx)
 			break;
 		p = he;
 	}
-	RCU_INIT_POINTER(old_tbl->buckets[n], p->next);
+	RCU_INIT_POINTER(old_tbl->buckets[n], he);
 
 	/* Find the subsequent node which does hash to the same
 	 * bucket as node P, or NULL if no such node exists.
 	 */
-	next = NULL;
-	if (he) {
-		rht_for_each(he, he->next, ht) {
-			if (head_hashfn(ht, he, new_tbl->size) == h) {
-				next = he;
+	INIT_RHT_NULLS_HEAD(next, ht, hash);
+	if (!rht_is_a_nulls(he)) {
+		rht_for_each_continue(he, he->next, old_tbl, n) {
+			if (head_hashfn(ht, he, new_tbl) == new_tbl_idx) {
+				next = (struct rhash_head __rcu *) he;
 				break;
 			}
 		}
@@ -223,7 +247,7 @@ int rhashtable_expand(struct rhashtable *ht)
 	if (ht->p.max_shift && ht->shift >= ht->p.max_shift)
 		return 0;
 
-	new_tbl = bucket_table_alloc(old_tbl->size * 2);
+	new_tbl = bucket_table_alloc(ht, old_tbl->size * 2);
 	if (new_tbl == NULL)
 		return -ENOMEM;
 
@@ -239,8 +263,8 @@ int rhashtable_expand(struct rhashtable *ht)
 	 */
 	for (i = 0; i < new_tbl->size; i++) {
 		h = i & (old_tbl->size - 1);
-		rht_for_each(he, old_tbl->buckets[h], ht) {
-			if (head_hashfn(ht, he, new_tbl->size) == i) {
+		rht_for_each(he, old_tbl, h) {
+			if (head_hashfn(ht, he, new_tbl) == i) {
 				RCU_INIT_POINTER(new_tbl->buckets[i], he);
 				break;
 			}
@@ -268,7 +292,7 @@ int rhashtable_expand(struct rhashtable *ht)
 		complete = true;
 		for (i = 0; i < old_tbl->size; i++) {
 			hashtable_chain_unzip(ht, new_tbl, old_tbl, i);
-			if (old_tbl->buckets[i] != NULL)
+			if (!rht_is_a_nulls(old_tbl->buckets[i]))
 				complete = false;
 		}
 	} while (!complete);
@@ -299,7 +323,7 @@ int rhashtable_shrink(struct rhashtable *ht)
 	if (ht->shift <= ht->p.min_shift)
 		return 0;
 
-	ntbl = bucket_table_alloc(tbl->size / 2);
+	ntbl = bucket_table_alloc(ht, tbl->size / 2);
 	if (ntbl == NULL)
 		return -ENOMEM;
 
@@ -316,8 +340,9 @@ int rhashtable_shrink(struct rhashtable *ht)
 		 * in the old table that contains entries which will hash
 		 * to the new bucket.
 		 */
-		for (pprev = &ntbl->buckets[i]; *pprev != NULL;
-		     pprev = &rht_dereference(*pprev, ht)->next)
+		for (pprev = &ntbl->buckets[i];
+		     !rht_is_a_nulls(rht_dereference_bucket(*pprev, ntbl, i));
+		     pprev = &rht_dereference_bucket(*pprev, ntbl, i)->next)
 			;
 		RCU_INIT_POINTER(*pprev, tbl->buckets[i + ntbl->size]);
 	}
@@ -350,13 +375,17 @@ EXPORT_SYMBOL_GPL(rhashtable_shrink);
 void rhashtable_insert(struct rhashtable *ht, struct rhash_head *obj)
 {
 	struct bucket_table *tbl = rht_dereference(ht->tbl, ht);
-	u32 hash;
+	u32 hash, idx;
 
 	ASSERT_RHT_MUTEX(ht);
 
-	hash = head_hashfn(ht, obj, tbl->size);
-	RCU_INIT_POINTER(obj->next, tbl->buckets[hash]);
-	rcu_assign_pointer(tbl->buckets[hash], obj);
+	hash = obj_raw_hashfn(ht, rht_obj(ht, obj));
+	idx = rht_bucket_index(hash, tbl);
+	if (rht_is_a_nulls(rht_get_bucket(tbl, idx)))
+		INIT_RHT_NULLS_HEAD(obj->next, ht, hash);
+	else
+		obj->next = tbl->buckets[idx];
+	rcu_assign_pointer(tbl->buckets[idx], obj);
 	ht->nelems++;
 
 	if (ht->p.grow_decision && ht->p.grow_decision(ht, tbl->size))
@@ -410,14 +439,13 @@ bool rhashtable_remove(struct rhashtable *ht, struct rhash_head *obj)
 	struct bucket_table *tbl = rht_dereference(ht->tbl, ht);
 	struct rhash_head __rcu **pprev;
 	struct rhash_head *he;
-	u32 h;
+	u32 idx;
 
 	ASSERT_RHT_MUTEX(ht);
 
-	h = head_hashfn(ht, obj, tbl->size);
-
-	pprev = &tbl->buckets[h];
-	rht_for_each(he, tbl->buckets[h], ht) {
+	idx = head_hashfn(ht, obj, tbl);
+	pprev = &tbl->buckets[idx];
+	rht_for_each(he, tbl, idx) {
 		if (he != obj) {
 			pprev = &he->next;
 			continue;
@@ -453,12 +481,12 @@ void *rhashtable_lookup(const struct rhashtable *ht, const void *key)
 
 	BUG_ON(!ht->p.key_len);
 
-	h = __hashfn(ht, key, ht->p.key_len, tbl->size);
-	rht_for_each_rcu(he, tbl->buckets[h], ht) {
+	h = rhashtable_hashfn(ht, key, ht->p.key_len);
+	rht_for_each_rcu(he, tbl, h) {
 		if (memcmp(rht_obj(ht, he) + ht->p.key_offset, key,
 			   ht->p.key_len))
 			continue;
-		return (void *) he - ht->p.head_offset;
+		return rht_obj(ht, he);
 	}
 
 	return NULL;
@@ -489,7 +517,7 @@ void *rhashtable_lookup_compare(const struct rhashtable *ht, u32 hash,
 	if (unlikely(hash >= tbl->size))
 		return NULL;
 
-	rht_for_each_rcu(he, tbl->buckets[hash], ht) {
+	rht_for_each_rcu(he, tbl, hash) {
 		if (!compare(rht_obj(ht, he), arg))
 			continue;
 		return (void *) he - ht->p.head_offset;
@@ -560,19 +588,23 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
 	    (!params->key_len && !params->obj_hashfn))
 		return -EINVAL;
 
+	if (params->nulls_base && params->nulls_base < HASH_BASE_MIN)
+		return -EINVAL;
+
 	params->min_shift = max_t(size_t, params->min_shift,
 				  ilog2(HASH_MIN_SIZE));
 
 	if (params->nelem_hint)
 		size = rounded_hashtable_size(params);
 
-	tbl = bucket_table_alloc(size);
+	memset(ht, 0, sizeof(*ht));
+	memcpy(&ht->p, params, sizeof(*params));
+
+	tbl = bucket_table_alloc(ht, size);
 	if (tbl == NULL)
 		return -ENOMEM;
 
-	memset(ht, 0, sizeof(*ht));
 	ht->shift = ilog2(tbl->size);
-	memcpy(&ht->p, params, sizeof(*params));
 	RCU_INIT_POINTER(ht->tbl, tbl);
 
 	if (!ht->p.hash_rnd)
@@ -652,6 +684,7 @@ static void test_bucket_stats(struct rhashtable *ht, struct bucket_table *tbl,
 			      bool quiet)
 {
 	unsigned int cnt, rcu_cnt, i, total = 0;
+	struct rhash_head *pos;
 	struct test_obj *obj;
 
 	for (i = 0; i < tbl->size; i++) {
@@ -660,7 +693,7 @@ static void test_bucket_stats(struct rhashtable *ht, struct bucket_table *tbl,
 		if (!quiet)
 			pr_info(" [%#4x/%zu]", i, tbl->size);
 
-		rht_for_each_entry_rcu(obj, tbl->buckets[i], node) {
+		rht_for_each_entry_rcu(obj, pos, tbl, i, node) {
 			cnt++;
 			total++;
 			if (!quiet)
@@ -689,7 +722,8 @@ static void test_bucket_stats(struct rhashtable *ht, struct bucket_table *tbl,
 static int __init test_rhashtable(struct rhashtable *ht)
 {
 	struct bucket_table *tbl;
-	struct test_obj *obj, *next;
+	struct test_obj *obj;
+	struct rhash_head *pos, *next;
 	int err;
 	unsigned int i;
 
@@ -755,7 +789,7 @@ static int __init test_rhashtable(struct rhashtable *ht)
 error:
 	tbl = rht_dereference_rcu(ht->tbl, ht);
 	for (i = 0; i < tbl->size; i++)
-		rht_for_each_entry_safe(obj, next, tbl->buckets[i], ht, node)
+		rht_for_each_entry_safe(obj, pos, next, tbl, i, node)
 			kfree(obj);
 
 	return err;
diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c
index b52873c..68b654b 100644
--- a/net/netfilter/nft_hash.c
+++ b/net/netfilter/nft_hash.c
@@ -99,12 +99,13 @@ static int nft_hash_get(const struct nft_set *set, struct nft_set_elem *elem)
 	const struct rhashtable *priv = nft_set_priv(set);
 	const struct bucket_table *tbl = rht_dereference_rcu(priv->tbl, priv);
 	struct rhash_head __rcu * const *pprev;
+	struct rhash_head *pos;
 	struct nft_hash_elem *he;
 	u32 h;
 
 	h = rhashtable_hashfn(priv, &elem->key, set->klen);
 	pprev = &tbl->buckets[h];
-	rht_for_each_entry_rcu(he, tbl->buckets[h], node) {
+	rht_for_each_entry_rcu(he, pos, tbl, h, node) {
 		if (nft_data_cmp(&he->key, &elem->key, set->klen)) {
 			pprev = &he->node.next;
 			continue;
@@ -130,7 +131,9 @@ static void nft_hash_walk(const struct nft_ctx *ctx, const struct nft_set *set,
 
 	tbl = rht_dereference_rcu(priv->tbl, priv);
 	for (i = 0; i < tbl->size; i++) {
-		rht_for_each_entry_rcu(he, tbl->buckets[i], node) {
+		struct rhash_head *pos;
+
+		rht_for_each_entry_rcu(he, pos, tbl, i, node) {
 			if (iter->count < iter->skip)
 				goto cont;
 
@@ -181,12 +184,13 @@ static void nft_hash_destroy(const struct nft_set *set)
 {
 	const struct rhashtable *priv = nft_set_priv(set);
 	const struct bucket_table *tbl;
-	struct nft_hash_elem *he, *next;
+	struct nft_hash_elem *he;
+	struct rhash_head *pos, *next;
 	unsigned int i;
 
 	tbl = rht_dereference(priv->tbl, priv);
 	for (i = 0; i < tbl->size; i++)
-		rht_for_each_entry_safe(he, next, tbl->buckets[i], priv, node)
+		rht_for_each_entry_safe(he, pos, next, tbl, i, node)
 			nft_hash_elem_destroy(set, he);
 
 	rhashtable_destroy(priv);
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index a1e6104..98e5b58 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2903,7 +2903,9 @@ static struct sock *netlink_seq_socket_idx(struct seq_file *seq, loff_t pos)
 		const struct bucket_table *tbl = rht_dereference_rcu(ht->tbl, ht);
 
 		for (j = 0; j < tbl->size; j++) {
-			rht_for_each_entry_rcu(nlk, tbl->buckets[j], node) {
+			struct rhash_head *node;
+
+			rht_for_each_entry_rcu(nlk, node, tbl, j, node) {
 				s = (struct sock *)nlk;
 
 				if (sock_net(s) != seq_file_net(seq))
@@ -2929,6 +2931,7 @@ static void *netlink_seq_start(struct seq_file *seq, loff_t *pos)
 
 static void *netlink_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 {
+	struct rhash_head *node;
 	struct netlink_sock *nlk;
 	struct nl_seq_iter *iter;
 	struct net *net;
@@ -2943,7 +2946,7 @@ static void *netlink_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 	iter = seq->private;
 	nlk = v;
 
-	rht_for_each_entry_rcu(nlk, nlk->node.next, node)
+	rht_for_each_entry_rcu_continue(nlk, node, nlk->node.next, node)
 		if (net_eq(sock_net((struct sock *)nlk), net))
 			return nlk;
 
@@ -2955,7 +2958,7 @@ static void *netlink_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 		const struct bucket_table *tbl = rht_dereference_rcu(ht->tbl, ht);
 
 		for (; j < tbl->size; j++) {
-			rht_for_each_entry_rcu(nlk, tbl->buckets[j], node) {
+			rht_for_each_entry_rcu(nlk, node, tbl, j, node) {
 				if (net_eq(sock_net((struct sock *)nlk), net)) {
 					iter->link = i;
 					iter->hash_idx = j;
diff --git a/net/netlink/diag.c b/net/netlink/diag.c
index de8c74a..1062bb4 100644
--- a/net/netlink/diag.c
+++ b/net/netlink/diag.c
@@ -113,7 +113,9 @@ static int __netlink_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
 	req = nlmsg_data(cb->nlh);
 
 	for (i = 0; i < htbl->size; i++) {
-		rht_for_each_entry(nlsk, htbl->buckets[i], ht, node) {
+		struct rhash_head *node;
+
+		rht_for_each_entry(nlsk, node, htbl, i, node) {
 			sk = (struct sock *)nlsk;
 
 			if (!net_eq(sock_net(sk), net))
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/5] spinlock: Add spin_lock_bh_nested()
  2014-09-15 12:17 [PATCH 0/5 RFC net-next] rhashtable: Parallel atomic insert/deletion & deferred expansion Thomas Graf
                   ` (2 preceding siblings ...)
  2014-09-15 12:18 ` [PATCH 3/5] rhashtable: Convert to nulls list Thomas Graf
@ 2014-09-15 12:18 ` Thomas Graf
  2014-09-15 12:18 ` [PATCH 5/5] rhashtable: Per bucket locks & expansion/shrinking in work queue Thomas Graf
  4 siblings, 0 replies; 13+ messages in thread
From: Thomas Graf @ 2014-09-15 12:18 UTC (permalink / raw)
  To: davem, eric.dumazet, paulmck, john.r.fastabend, kaber
  Cc: netdev, linux-kernel

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 include/linux/spinlock.h         | 8 ++++++++
 include/linux/spinlock_api_smp.h | 2 ++
 include/linux/spinlock_api_up.h  | 1 +
 kernel/locking/spinlock.c        | 8 ++++++++
 4 files changed, 19 insertions(+)

diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 3f2867f..481180a 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -190,6 +190,8 @@ static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock)
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 # define raw_spin_lock_nested(lock, subclass) \
 	_raw_spin_lock_nested(lock, subclass)
+# define raw_spin_lock_bh_nested(lock, subclass) \
+	_raw_spin_lock_bh_nested(lock, subclass)
 
 # define raw_spin_lock_nest_lock(lock, nest_lock)			\
 	 do {								\
@@ -199,6 +201,7 @@ static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock)
 #else
 # define raw_spin_lock_nested(lock, subclass)		_raw_spin_lock(lock)
 # define raw_spin_lock_nest_lock(lock, nest_lock)	_raw_spin_lock(lock)
+# define raw_spin_lock_bh_nested(lock, subclass)	_raw_spin_lock_bh(lock)
 #endif
 
 #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
@@ -318,6 +321,11 @@ do {								\
 	raw_spin_lock_nested(spinlock_check(lock), subclass);	\
 } while (0)
 
+#define spin_lock_bh_nested(lock, subclass)			\
+do {								\
+	raw_spin_lock_bh_nested(spinlock_check(lock), subclass);\
+} while (0)
+
 #define spin_lock_nest_lock(lock, nest_lock)				\
 do {									\
 	raw_spin_lock_nest_lock(spinlock_check(lock), nest_lock);	\
diff --git a/include/linux/spinlock_api_smp.h b/include/linux/spinlock_api_smp.h
index 42dfab8..5344268 100644
--- a/include/linux/spinlock_api_smp.h
+++ b/include/linux/spinlock_api_smp.h
@@ -22,6 +22,8 @@ int in_lock_functions(unsigned long addr);
 void __lockfunc _raw_spin_lock(raw_spinlock_t *lock)		__acquires(lock);
 void __lockfunc _raw_spin_lock_nested(raw_spinlock_t *lock, int subclass)
 								__acquires(lock);
+void __lockfunc _raw_spin_lock_bh_nested(raw_spinlock_t *lock, int subclass)
+								__acquires(lock);
 void __lockfunc
 _raw_spin_lock_nest_lock(raw_spinlock_t *lock, struct lockdep_map *map)
 								__acquires(lock);
diff --git a/include/linux/spinlock_api_up.h b/include/linux/spinlock_api_up.h
index d0d1888..d3afef9 100644
--- a/include/linux/spinlock_api_up.h
+++ b/include/linux/spinlock_api_up.h
@@ -57,6 +57,7 @@
 
 #define _raw_spin_lock(lock)			__LOCK(lock)
 #define _raw_spin_lock_nested(lock, subclass)	__LOCK(lock)
+#define _raw_spin_lock_bh_nested(lock, subclass) __LOCK(lock)
 #define _raw_read_lock(lock)			__LOCK(lock)
 #define _raw_write_lock(lock)			__LOCK(lock)
 #define _raw_spin_lock_bh(lock)			__LOCK_BH(lock)
diff --git a/kernel/locking/spinlock.c b/kernel/locking/spinlock.c
index 4b082b5..db3ccb1 100644
--- a/kernel/locking/spinlock.c
+++ b/kernel/locking/spinlock.c
@@ -363,6 +363,14 @@ void __lockfunc _raw_spin_lock_nested(raw_spinlock_t *lock, int subclass)
 }
 EXPORT_SYMBOL(_raw_spin_lock_nested);
 
+void __lockfunc _raw_spin_lock_bh_nested(raw_spinlock_t *lock, int subclass)
+{
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
+	spin_acquire(&lock->dep_map, subclass, 0, _RET_IP_);
+	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
+}
+EXPORT_SYMBOL(_raw_spin_lock_bh_nested);
+
 unsigned long __lockfunc _raw_spin_lock_irqsave_nested(raw_spinlock_t *lock,
 						   int subclass)
 {
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 5/5] rhashtable: Per bucket locks & expansion/shrinking in work queue
  2014-09-15 12:17 [PATCH 0/5 RFC net-next] rhashtable: Parallel atomic insert/deletion & deferred expansion Thomas Graf
                   ` (3 preceding siblings ...)
  2014-09-15 12:18 ` [PATCH 4/5] spinlock: Add spin_lock_bh_nested() Thomas Graf
@ 2014-09-15 12:18 ` Thomas Graf
  2014-09-15 15:23   ` Eric Dumazet
  4 siblings, 1 reply; 13+ messages in thread
From: Thomas Graf @ 2014-09-15 12:18 UTC (permalink / raw)
  To: davem, eric.dumazet, paulmck, john.r.fastabend, kaber
  Cc: netdev, linux-kernel

Introduces an array of spinlocks to protect bucket mutations. The number
of spinlocks per CPU is configurable and selected based on the hash of
the bucket. This allows for parallel insertions and removals of entries
which do not share a lock.

The patch also defers expansion and shrinking to a worker queue which
allow insertion and removal from atomic context. Insertions and
deletions may occur in parallel to it and are only held up briefly
while the particular bucket is linked or unzipped.

Mutations of the bucket table pointer is protected by a new mutex, read
access to that pointer for insertion and deletion is RCU protected.

In the event of an expansion or shrinking, the new bucket table allocated
is exposed as a so called future table right away. Lookups, deletions, and
insertions will briefly use both tables. The future table becomes the main
table after an RCU grace period and initial relinking was performed which
guarantees that no new insertions occur on the old table but all entries
can be found.

The side effect of this is that during that RCU grace period, a bucket
traversal using any rht_for_each() variant on the main table will not see
any insertions performed during the RCU grace period which would at that
point land in the future table. The lookup will see them as it searches
both tables if needed.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 include/linux/rhashtable.h |  42 +++--
 lib/rhashtable.c           | 375 +++++++++++++++++++++++++++++++--------------
 net/netfilter/nft_hash.c   |  59 +++----
 net/netlink/af_netlink.c   |   5 +-
 4 files changed, 330 insertions(+), 151 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index e9cdbda..dc8a447 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -19,13 +19,23 @@
 #define _LINUX_RHASHTABLE_H
 
 #include <linux/list_nulls.h>
+#include <linux/workqueue.h>
 
 struct rhash_head {
 	struct rhash_head __rcu		*next;
 };
 
+/**
+ * struct bucket_table - Table of hash buckets
+ * @size: Number of hash buckets
+ * @locks_mask: Mask to apply before accessing locks[]
+ * @locks: Array of spinlocks protecting individual buckets
+ * @buckets: size * hash buckets
+ */
 struct bucket_table {
 	size_t				size;
+	unsigned int			locks_mask;
+	spinlock_t			*locks;
 	struct rhash_head __rcu		*buckets[];
 };
 
@@ -44,6 +54,7 @@ struct rhashtable;
  * @max_shift: Maximum number of shifts while expanding
  * @min_shift: Minimum number of shifts while shrinking
  * @nulls_base: Base value to generate nulls marker
+ * @locks_mul: Number of bucket locks to allocate per cpu (default: 128)
  * @hashfn: Function to hash key
  * @obj_hashfn: Function to hash object
  * @grow_decision: If defined, may return true if table should expand
@@ -59,6 +70,7 @@ struct rhashtable_params {
 	size_t			max_shift;
 	size_t			min_shift;
 	int			nulls_base;
+	size_t			locks_mul;
 	rht_hashfn_t		hashfn;
 	rht_obj_hashfn_t	obj_hashfn;
 	bool			(*grow_decision)(const struct rhashtable *ht,
@@ -71,15 +83,23 @@ struct rhashtable_params {
 /**
  * struct rhashtable - Hash table handle
  * @tbl: Bucket table
+ * @future_tbl: Future table during expansion and shrinking
  * @nelems: Number of elements in table
  * @shift: Current size (1 << shift)
  * @p: Configuration parameters
+ * @run_work: Delayed work executing expansion/shrinking
+ * @mutex: Hold while expansion/shrinking takes places
+ * @being_destroyed: True if table is set to be destroyed
  */
 struct rhashtable {
 	struct bucket_table __rcu	*tbl;
+	struct bucket_table __rcu	*future_tbl;
 	size_t				nelems;
 	size_t				shift;
 	struct rhashtable_params	p;
+	struct delayed_work		run_work;
+	struct mutex			mutex;
+	bool				being_destroyed;
 };
 
 static inline unsigned long rht_marker(const struct rhashtable *ht, u32 hash)
@@ -101,9 +121,16 @@ static inline unsigned long rht_get_nulls_value(const struct rhash_head *ptr)
 }
 
 #ifdef CONFIG_PROVE_LOCKING
-int lockdep_rht_mutex_is_held(const struct rhashtable *ht);
+int lockdep_rht_mutex_is_held(struct rhashtable *ht);
+int rht_bucket_lock_is_held(const struct bucket_table *tbl, u32 hash);
 #else
-static inline int lockdep_rht_mutex_is_held(const struct rhashtable *ht)
+static inline int lockdep_rht_mutex_is_held(struct rhashtable *ht)
+{
+	return 1;
+}
+
+static inline int lockdep_rht_bucket_is_held(const struct bucket_table *tbl,
+					     u32 hash)
 {
 	return 1;
 }
@@ -112,12 +139,9 @@ static inline int lockdep_rht_mutex_is_held(const struct rhashtable *ht)
 int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params);
 
 u32 rhashtable_hashfn(const struct rhashtable *ht, const void *key, u32 len);
-u32 rhashtable_obj_hashfn(const struct rhashtable *ht, void *ptr);
 
 void rhashtable_insert(struct rhashtable *ht, struct rhash_head *node);
 bool rhashtable_remove(struct rhashtable *ht, struct rhash_head *node);
-void rhashtable_remove_pprev(struct rhashtable *ht, struct rhash_head *obj,
-			     struct rhash_head __rcu **pprev);
 
 bool rht_grow_above_75(const struct rhashtable *ht, size_t new_size);
 bool rht_shrink_below_30(const struct rhashtable *ht, size_t new_size);
@@ -126,10 +150,10 @@ int rhashtable_expand(struct rhashtable *ht);
 int rhashtable_shrink(struct rhashtable *ht);
 
 void *rhashtable_lookup(const struct rhashtable *ht, const void *key);
-void *rhashtable_lookup_compare(const struct rhashtable *ht, u32 hash,
+void *rhashtable_lookup_compare(const struct rhashtable *ht, const void *key,
 				bool (*compare)(void *, void *), void *arg);
 
-void rhashtable_destroy(const struct rhashtable *ht);
+void rhashtable_destroy(struct rhashtable *ht);
 
 #define rht_dereference(p, ht) \
 	rcu_dereference_protected(p, lockdep_rht_mutex_is_held(ht))
@@ -138,10 +162,10 @@ void rhashtable_destroy(const struct rhashtable *ht);
 	rcu_dereference_check(p, lockdep_rht_mutex_is_held(ht))
 
 #define rht_dereference_bucket(p, tbl, hash) \
-	rcu_dereference_protected(p, lockdep_rht_mutex_is_held(ht))
+	rcu_dereference_protected(p, lockdep_rht_bucket_is_held(tbl, hash))
 
 #define rht_dereference_bucket_rcu(p, tbl, hash) \
-	rcu_dereference_check(p, lockdep_rht_mutex_is_held(ht))
+	rcu_dereference_check(p, lockdep_rht_bucket_is_held(tbl, hash))
 
 #define rht_entry(tpos, pos, member) \
 	({ tpos = container_of(pos, typeof(*tpos), member); 1; })
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index d871483..2a6ff3d 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -27,6 +27,7 @@
 
 #define HASH_DEFAULT_SIZE	64UL
 #define HASH_MIN_SIZE		4UL
+#define BUCKET_LOCKS_PER_CPU	128UL
 
 /*
  * The nulls marker consists of:
@@ -45,14 +46,42 @@
 #define HASH_BASE_MIN		(1 << (31 - HASH_BASE_BITS))
 #define HASH_RESERVED_SPACE	(HASH_BASE_BITS + 1)
 
+/* The bucket lock is selected based on the hash and protects mutations
+ * on a group of hash buckets.
+ *
+ * Important: When holding the bucket lock of both the old and new table
+ * during expansions and shrinking, the old bucket lock must always be
+ * acquired first.
+ */
+static spinlock_t *bucket_lock(const struct bucket_table *tbl, u32 hash)
+{
+	return &tbl->locks[hash & tbl->locks_mask];
+}
+
 #define ASSERT_RHT_MUTEX(HT) BUG_ON(!lockdep_rht_mutex_is_held(HT))
 
 #ifdef CONFIG_PROVE_LOCKING
-int lockdep_rht_mutex_is_held(const struct rhashtable *ht)
+int lockdep_rht_mutex_is_held(struct rhashtable *ht)
 {
-	return ht->p.mutex_is_held();
+#ifdef CONFIG_LOCKDEP
+	return (debug_locks) ? lockdep_is_held(&ht->mutex) : 1;
+#else
+	return 1;
+#endif
 }
 EXPORT_SYMBOL_GPL(lockdep_rht_mutex_is_held);
+
+int lockdep_rht_bucket_is_held(const struct bucket_table *tbl, u32 hash)
+{
+#ifdef CONFIG_LOCKDEP
+	spinlock_t *lock = bucket_lock(tbl, hash);
+
+	return (debug_locks) ? lockdep_is_held(lock) : 1;
+#else
+	return 1;
+#endif
+}
+EXPORT_SYMBOL_GPL(lockdep_rht_bucket_is_held);
 #endif
 
 static void *rht_obj(const struct rhashtable *ht, const struct rhash_head *he)
@@ -85,39 +114,19 @@ static u32 obj_raw_hashfn(const struct rhashtable *ht, const void *ptr)
  * @len:	length of key
  *
  * Computes the hash value using the hash function provided in the 'hashfn'
- * of struct rhashtable_params. The returned value is guaranteed to be
- * smaller than the number of buckets in the hash table.
+ * of struct rhashtable_params.
  */
 u32 rhashtable_hashfn(const struct rhashtable *ht, const void *key, u32 len)
 {
-	struct bucket_table *tbl = rht_dereference_rcu(ht->tbl, ht);
 	u32 hash;
 
 	hash = ht->p.hashfn(key, len, ht->p.hash_rnd);
 	hash >>= HASH_RESERVED_SPACE;
 
-	return rht_bucket_index(hash, tbl);
+	return hash;
 }
 EXPORT_SYMBOL_GPL(rhashtable_hashfn);
 
-/**
- * rhashtable_obj_hashfn - compute hash for hashed object
- * @ht:		hash table to compuate for
- * @ptr:	pointer to hashed object
- *
- * Computes the hash value using the hash function `hashfn` respectively
- * 'obj_hashfn' depending on whether the hash table is set up to work with
- * a fixed length key. The returned value is guaranteed to be smaller than
- * the number of buckets in the hash table.
- */
-u32 rhashtable_obj_hashfn(const struct rhashtable *ht, void *ptr)
-{
-	struct bucket_table *tbl = rht_dereference_rcu(ht->tbl, ht);
-
-	return rht_bucket_index(obj_raw_hashfn(ht, ptr), tbl);
-}
-EXPORT_SYMBOL_GPL(rhashtable_obj_hashfn);
-
 static u32 head_hashfn(const struct rhashtable *ht,
 		       const struct rhash_head *he,
 		       const struct bucket_table *tbl)
@@ -125,6 +134,56 @@ static u32 head_hashfn(const struct rhashtable *ht,
 	return rht_bucket_index(obj_raw_hashfn(ht, rht_obj(ht, he)), tbl);
 }
 
+static struct rhash_head __rcu **rht_list_tail(struct bucket_table *tbl, u32 n)
+{
+	struct rhash_head __rcu **pprev;
+
+	for (pprev = &tbl->buckets[n];
+	     !rht_is_a_nulls(rht_dereference_bucket(*pprev, tbl, n));
+	     pprev = &rht_dereference_bucket(*pprev, tbl, n)->next)
+		;
+
+	return pprev;
+}
+
+static int alloc_bucket_locks(struct rhashtable *ht, struct bucket_table *tbl)
+{
+	unsigned int i, size;
+#if defined(CONFIG_PROVE_LOCKING)
+	unsigned int nr_pcpus = 2;
+#else
+	unsigned int nr_pcpus = num_possible_cpus();
+#endif
+
+	nr_pcpus = min_t(unsigned int, nr_pcpus, 32UL);
+	size = nr_pcpus * ht->p.locks_mul;
+
+	if (sizeof(spinlock_t) != 0) {
+#ifdef CONFIG_NUMA
+		if (size * sizeof(spinlock_t) > PAGE_SIZE)
+			tbl->locks = vmalloc(size * sizeof(spinlock_t));
+		else
+#endif
+		tbl->locks = kmalloc_array(size, sizeof(spinlock_t),
+					   GFP_KERNEL);
+		if (!tbl->locks)
+			return -ENOMEM;
+		for (i = 0; i < size; i++)
+			spin_lock_init(&tbl->locks[i]);
+	}
+	tbl->locks_mask = size - 1;
+
+	return 0;
+}
+
+static void bucket_table_free(const struct bucket_table *tbl)
+{
+	if (tbl)
+		kvfree(tbl->locks);
+
+	kvfree(tbl);
+}
+
 static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
 					       size_t nbuckets)
 {
@@ -143,16 +202,16 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
 	for (i = 0; i < nbuckets; i++)
 		INIT_RHT_NULLS_HEAD(tbl->buckets[i], ht, i);
 
+	if (alloc_bucket_locks(ht, tbl) < 0) {
+		bucket_table_free(tbl);
+		return NULL;
+	}
+
 	tbl->size = nbuckets;
 
 	return tbl;
 }
 
-static void bucket_table_free(const struct bucket_table *tbl)
-{
-	kvfree(tbl);
-}
-
 /**
  * rht_grow_above_75 - returns true if nelems > 0.75 * table-size
  * @ht:		hash table
@@ -183,8 +242,11 @@ static void hashtable_chain_unzip(const struct rhashtable *ht,
 {
 	struct rhash_head *he, *p;
 	struct rhash_head __rcu *next;
+	spinlock_t *new_bucket_lock;
 	u32 hash, new_tbl_idx;
 
+	BUG_ON(!lockdep_rht_bucket_is_held(old_tbl, n));
+
 	/* Old bucket empty, no work needed. */
 	p = rht_get_bucket(old_tbl, n);
 	if (rht_is_a_nulls(p))
@@ -196,6 +258,10 @@ static void hashtable_chain_unzip(const struct rhashtable *ht,
 	 */
 	hash = obj_raw_hashfn(ht, rht_obj(ht, p));
 	new_tbl_idx = rht_bucket_index(hash, new_tbl);
+
+	new_bucket_lock = bucket_lock(new_tbl, new_tbl_idx);
+	spin_lock_bh_nested(new_bucket_lock, SINGLE_DEPTH_NESTING);
+
 	rht_for_each_continue(he, p->next, old_tbl, n) {
 		if (head_hashfn(ht, he, new_tbl) != new_tbl_idx)
 			break;
@@ -204,7 +270,7 @@ static void hashtable_chain_unzip(const struct rhashtable *ht,
 	RCU_INIT_POINTER(old_tbl->buckets[n], he);
 
 	/* Find the subsequent node which does hash to the same
-	 * bucket as node P, or NULL if no such node exists.
+	 * bucket as node P, or nulls if no such node exists.
 	 */
 	INIT_RHT_NULLS_HEAD(next, ht, hash);
 	if (!rht_is_a_nulls(he)) {
@@ -219,7 +285,9 @@ static void hashtable_chain_unzip(const struct rhashtable *ht,
 	/* Set p's next pointer to that subsequent node pointer,
 	 * bypassing the nodes which do not hash to p's bucket
 	 */
-	RCU_INIT_POINTER(p->next, next);
+	rcu_assign_pointer(p->next, next);
+
+	spin_unlock_bh(new_bucket_lock);
 }
 
 /**
@@ -239,8 +307,9 @@ int rhashtable_expand(struct rhashtable *ht)
 {
 	struct bucket_table *new_tbl, *old_tbl = rht_dereference(ht->tbl, ht);
 	struct rhash_head *he;
+	spinlock_t *new_bucket_lock, *old_bucket_lock;
 	unsigned int i, h;
-	bool complete;
+	bool complete = false;
 
 	ASSERT_RHT_MUTEX(ht);
 
@@ -253,22 +322,38 @@ int rhashtable_expand(struct rhashtable *ht)
 
 	ht->shift++;
 
-	/* For each new bucket, search the corresponding old bucket
-	 * for the first entry that hashes to the new bucket, and
-	 * link the new bucket to that entry. Since all the entries
-	 * which will end up in the new bucket appear in the same
-	 * old bucket, this constructs an entirely valid new hash
-	 * table, but with multiple buckets "zipped" together into a
-	 * single imprecise chain.
+	/* Make insertions go into the new, empty table right away. Deletions
+	 * and lookups will be attemped in both tables until we synchronzie.
+	 * The synchronize_rcu() guarantees for the new table to be picked up
+	 * so no new additions go into the old table while we relink.
+	 */
+	rcu_assign_pointer(ht->future_tbl, new_tbl);
+	synchronize_rcu();
+
+	/* For each new bucket, search the corresponding old bucket for the
+	 * first entry that hashes to the new bucket, and link the end of
+	 * newly formed bucket chain (containing entries added to future
+	 * table) to that entry. Since all the entries which will end up in
+	 * the new bucket appear in the same old bucket, this constructs an
+	 * entirely valid new hash table, but with multiple buckets
+	 * "zipped" together into a single imprecise chain.
 	 */
 	for (i = 0; i < new_tbl->size; i++) {
 		h = i & (old_tbl->size - 1);
+		old_bucket_lock = bucket_lock(old_tbl, h);
+		spin_lock_bh(old_bucket_lock);
+
 		rht_for_each(he, old_tbl, h) {
 			if (head_hashfn(ht, he, new_tbl) == i) {
-				RCU_INIT_POINTER(new_tbl->buckets[i], he);
+				new_bucket_lock = bucket_lock(new_tbl, i);
+				spin_lock_bh_nested(new_bucket_lock,
+						    SINGLE_DEPTH_NESTING);
+				rcu_assign_pointer(*rht_list_tail(new_tbl, i), he);
+				spin_unlock_bh(new_bucket_lock);
 				break;
 			}
 		}
+		spin_unlock_bh(old_bucket_lock);
 	}
 
 	/* Publish the new table pointer. Lookups may now traverse
@@ -278,7 +363,7 @@ int rhashtable_expand(struct rhashtable *ht)
 	rcu_assign_pointer(ht->tbl, new_tbl);
 
 	/* Unzip interleaved hash chains */
-	do {
+	while (!complete && !ht->being_destroyed) {
 		/* Wait for readers. All new readers will see the new
 		 * table, and thus no references to the old table will
 		 * remain.
@@ -291,11 +376,18 @@ int rhashtable_expand(struct rhashtable *ht)
 		 */
 		complete = true;
 		for (i = 0; i < old_tbl->size; i++) {
+			spinlock_t *old_bucket_lock;
+
+			old_bucket_lock = bucket_lock(old_tbl, i);
+			spin_lock_bh(old_bucket_lock);
+
 			hashtable_chain_unzip(ht, new_tbl, old_tbl, i);
-			if (!rht_is_a_nulls(old_tbl->buckets[i]))
+			if (!rht_is_a_nulls(rht_get_bucket(old_tbl, i)))
 				complete = false;
+
+			spin_unlock_bh(old_bucket_lock);
 		}
-	} while (!complete);
+	}
 
 	bucket_table_free(old_tbl);
 	return 0;
@@ -315,7 +407,7 @@ EXPORT_SYMBOL_GPL(rhashtable_expand);
 int rhashtable_shrink(struct rhashtable *ht)
 {
 	struct bucket_table *ntbl, *tbl = rht_dereference(ht->tbl, ht);
-	struct rhash_head __rcu **pprev;
+	spinlock_t *new_bucket_lock, *old_bucket_lock;
 	unsigned int i;
 
 	ASSERT_RHT_MUTEX(ht);
@@ -327,28 +419,31 @@ int rhashtable_shrink(struct rhashtable *ht)
 	if (ntbl == NULL)
 		return -ENOMEM;
 
-	ht->shift--;
+	rcu_assign_pointer(ht->future_tbl, ntbl);
+	synchronize_rcu();
 
 	/* Link each bucket in the new table to the first bucket
 	 * in the old table that contains entries which will hash
 	 * to the new bucket.
 	 */
 	for (i = 0; i < ntbl->size; i++) {
+		old_bucket_lock = bucket_lock(tbl, rht_bucket_index(i, tbl));
+		new_bucket_lock = bucket_lock(ntbl, i);
+
+		spin_lock_bh(old_bucket_lock);
+		spin_lock_bh_nested(new_bucket_lock, SINGLE_DEPTH_NESTING);
+
 		ntbl->buckets[i] = tbl->buckets[i];
+		rcu_assign_pointer(*rht_list_tail(ntbl, i),
+				   tbl->buckets[i + ntbl->size]);
 
-		/* Link each bucket in the new table to the first bucket
-		 * in the old table that contains entries which will hash
-		 * to the new bucket.
-		 */
-		for (pprev = &ntbl->buckets[i];
-		     !rht_is_a_nulls(rht_dereference_bucket(*pprev, ntbl, i));
-		     pprev = &rht_dereference_bucket(*pprev, ntbl, i)->next)
-			;
-		RCU_INIT_POINTER(*pprev, tbl->buckets[i + ntbl->size]);
+		spin_unlock_bh(new_bucket_lock);
+		spin_unlock_bh(old_bucket_lock);
 	}
 
 	/* Publish the new, valid hash table */
 	rcu_assign_pointer(ht->tbl, ntbl);
+	ht->shift--;
 
 	/* Wait for readers. No new readers will have references to the
 	 * old hash table.
@@ -361,6 +456,23 @@ int rhashtable_shrink(struct rhashtable *ht)
 }
 EXPORT_SYMBOL_GPL(rhashtable_shrink);
 
+static void rht_deferred_worker(struct work_struct *work)
+{
+	struct rhashtable *ht;
+	struct bucket_table *tbl;
+
+	ht = container_of(work, struct rhashtable, run_work.work);
+	mutex_lock(&ht->mutex);
+	tbl = rht_dereference(ht->tbl, ht);
+
+	if (ht->p.grow_decision && ht->p.grow_decision(ht, tbl->size))
+		rhashtable_expand(ht);
+	else if (ht->p.shrink_decision && ht->p.shrink_decision(ht, tbl->size))
+		rhashtable_shrink(ht);
+
+	mutex_unlock(&ht->mutex);
+}
+
 /**
  * rhashtable_insert - insert object into hash hash table
  * @ht:		hash table
@@ -369,18 +481,23 @@ EXPORT_SYMBOL_GPL(rhashtable_shrink);
  * Will automatically grow the table via rhashtable_expand() if the the
  * grow_decision function specified at rhashtable_init() returns true.
  *
- * The caller must ensure that no concurrent table mutations occur. It is
- * however valid to have concurrent lookups if they are RCU protected.
+ * Will take a per bucket spinlock to protect against mutual mutations
+ * on the same bucket.
  */
 void rhashtable_insert(struct rhashtable *ht, struct rhash_head *obj)
 {
-	struct bucket_table *tbl = rht_dereference(ht->tbl, ht);
+	struct bucket_table *tbl;
+	spinlock_t *lock;
 	u32 hash, idx;
 
-	ASSERT_RHT_MUTEX(ht);
-
+	rcu_read_lock();
+	tbl = rht_dereference_rcu(ht->future_tbl, ht);
 	hash = obj_raw_hashfn(ht, rht_obj(ht, obj));
 	idx = rht_bucket_index(hash, tbl);
+
+	lock = bucket_lock(tbl, idx);
+	spin_lock_bh(lock);
+
 	if (rht_is_a_nulls(rht_get_bucket(tbl, idx)))
 		INIT_RHT_NULLS_HEAD(obj->next, ht, hash);
 	else
@@ -388,36 +505,14 @@ void rhashtable_insert(struct rhashtable *ht, struct rhash_head *obj)
 	rcu_assign_pointer(tbl->buckets[idx], obj);
 	ht->nelems++;
 
-	if (ht->p.grow_decision && ht->p.grow_decision(ht, tbl->size))
-		rhashtable_expand(ht);
-}
-EXPORT_SYMBOL_GPL(rhashtable_insert);
-
-/**
- * rhashtable_remove_pprev - remove object from hash table given previous element
- * @ht:		hash table
- * @obj:	pointer to hash head inside object
- * @pprev:	pointer to previous element
- *
- * Identical to rhashtable_remove() but caller is alreayd aware of the element
- * in front of the element to be deleted. This is in particular useful for
- * deletion when combined with walking or lookup.
- */
-void rhashtable_remove_pprev(struct rhashtable *ht, struct rhash_head *obj,
-			     struct rhash_head __rcu **pprev)
-{
-	struct bucket_table *tbl = rht_dereference(ht->tbl, ht);
-
-	ASSERT_RHT_MUTEX(ht);
+	spin_unlock_bh(lock);
 
-	RCU_INIT_POINTER(*pprev, obj->next);
-	ht->nelems--;
+	if (ht->p.grow_decision && ht->p.grow_decision(ht, tbl->size))
+		schedule_delayed_work(&ht->run_work, 0);
 
-	if (ht->p.shrink_decision &&
-	    ht->p.shrink_decision(ht, tbl->size))
-		rhashtable_shrink(ht);
+	rcu_read_unlock();
 }
-EXPORT_SYMBOL_GPL(rhashtable_remove_pprev);
+EXPORT_SYMBOL_GPL(rhashtable_insert);
 
 /**
  * rhashtable_remove - remove object from hash table
@@ -436,14 +531,20 @@ EXPORT_SYMBOL_GPL(rhashtable_remove_pprev);
  */
 bool rhashtable_remove(struct rhashtable *ht, struct rhash_head *obj)
 {
-	struct bucket_table *tbl = rht_dereference(ht->tbl, ht);
-	struct rhash_head __rcu **pprev;
+	struct bucket_table *tbl;
 	struct rhash_head *he;
+	struct rhash_head __rcu **pprev;
+	spinlock_t *lock;
 	u32 idx;
 
-	ASSERT_RHT_MUTEX(ht);
-
+	rcu_read_lock();
+	tbl = rht_dereference_rcu(ht->future_tbl, ht);
 	idx = head_hashfn(ht, obj, tbl);
+
+	lock = bucket_lock(tbl, idx);
+	spin_lock_bh(lock);
+
+restart:
 	pprev = &tbl->buckets[idx];
 	rht_for_each(he, tbl, idx) {
 		if (he != obj) {
@@ -451,10 +552,34 @@ bool rhashtable_remove(struct rhashtable *ht, struct rhash_head *obj)
 			continue;
 		}
 
-		rhashtable_remove_pprev(ht, he, pprev);
+		rcu_assign_pointer(*pprev, obj->next);
+		ht->nelems--;
+
+		spin_unlock_bh(lock);
+
+		if (ht->p.shrink_decision &&
+		    ht->p.shrink_decision(ht, tbl->size))
+			schedule_delayed_work(&ht->run_work, 0);
+
+		rcu_read_unlock();
+
 		return true;
 	}
 
+	if (tbl != rht_dereference_rcu(ht->tbl, ht)) {
+		spin_unlock_bh(lock);
+
+		tbl = rht_dereference_rcu(ht->tbl, ht);
+		idx = head_hashfn(ht, obj, tbl);
+
+		lock = bucket_lock(tbl, idx);
+		spin_lock_bh(lock);
+		goto restart;
+	}
+
+	spin_unlock_bh(lock);
+	rcu_read_unlock();
+
 	return false;
 }
 EXPORT_SYMBOL_GPL(rhashtable_remove);
@@ -471,24 +596,32 @@ EXPORT_SYMBOL_GPL(rhashtable_remove);
  * paramter set). It will BUG() if used inappropriately.
  *
  * Lookups may occur in parallel with hash mutations as long as the lookup is
- * guarded by rcu_read_lock(). The caller must take care of this.
+ * guarded by rcu_read_lock(). Otherwise the ht->mutex must be held.
  */
 void *rhashtable_lookup(const struct rhashtable *ht, const void *key)
 {
-	const struct bucket_table *tbl = rht_dereference_rcu(ht->tbl, ht);
+	const struct bucket_table *tbl, *old_tbl;
 	struct rhash_head *he;
 	u32 h;
 
 	BUG_ON(!ht->p.key_len);
 
+	old_tbl = rht_dereference_rcu(ht->tbl, ht);
+	tbl = rht_dereference_rcu(ht->future_tbl, ht);
 	h = rhashtable_hashfn(ht, key, ht->p.key_len);
-	rht_for_each_rcu(he, tbl, h) {
+restart:
+	rht_for_each_rcu(he, tbl, rht_bucket_index(h, tbl)) {
 		if (memcmp(rht_obj(ht, he) + ht->p.key_offset, key,
 			   ht->p.key_len))
 			continue;
 		return rht_obj(ht, he);
 	}
 
+	if (unlikely(tbl != old_tbl)) {
+		tbl = old_tbl;
+		goto restart;
+	}
+
 	return NULL;
 }
 EXPORT_SYMBOL_GPL(rhashtable_lookup);
@@ -496,7 +629,7 @@ EXPORT_SYMBOL_GPL(rhashtable_lookup);
 /**
  * rhashtable_lookup_compare - search hash table with compare function
  * @ht:		hash table
- * @hash:	hash value of desired entry
+ * @key:	the pointer to the key
  * @compare:	compare function, must return true on match
  * @arg:	argument passed on to compare function
  *
@@ -508,21 +641,28 @@ EXPORT_SYMBOL_GPL(rhashtable_lookup);
  *
  * Returns the first entry on which the compare function returned true.
  */
-void *rhashtable_lookup_compare(const struct rhashtable *ht, u32 hash,
+void *rhashtable_lookup_compare(const struct rhashtable *ht, const void *key,
 				bool (*compare)(void *, void *), void *arg)
 {
-	const struct bucket_table *tbl = rht_dereference_rcu(ht->tbl, ht);
+	const struct bucket_table *tbl, *old_tbl;
 	struct rhash_head *he;
+	u32 h;
 
-	if (unlikely(hash >= tbl->size))
-		return NULL;
-
-	rht_for_each_rcu(he, tbl, hash) {
+	old_tbl = rht_dereference_rcu(ht->tbl, ht);
+	tbl = rht_dereference_rcu(ht->future_tbl, ht);
+	h = rhashtable_hashfn(ht, key, ht->p.key_len);
+restart:
+	rht_for_each_rcu(he, tbl, rht_bucket_index(h, tbl)) {
 		if (!compare(rht_obj(ht, he), arg))
 			continue;
 		return (void *) he - ht->p.head_offset;
 	}
 
+	if (unlikely(tbl != old_tbl)) {
+		tbl = old_tbl;
+		goto restart;
+	}
+
 	return NULL;
 }
 EXPORT_SYMBOL_GPL(rhashtable_lookup_compare);
@@ -554,7 +694,6 @@ static size_t rounded_hashtable_size(struct rhashtable_params *params)
  *	.key_offset = offsetof(struct test_obj, key),
  *	.key_len = sizeof(int),
  *	.hashfn = arch_fast_hash,
- *	.mutex_is_held = &my_mutex_is_held,
  * };
  *
  * Configuration Example 2: Variable length keys
@@ -574,7 +713,6 @@ static size_t rounded_hashtable_size(struct rhashtable_params *params)
  *	.head_offset = offsetof(struct test_obj, node),
  *	.hashfn = arch_fast_hash,
  *	.obj_hashfn = my_hash_fn,
- *	.mutex_is_held = &my_mutex_is_held,
  * };
  */
 int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
@@ -599,6 +737,12 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
 
 	memset(ht, 0, sizeof(*ht));
 	memcpy(&ht->p, params, sizeof(*params));
+	mutex_init(&ht->mutex);
+
+	if (params->locks_mul)
+		ht->p.locks_mul = roundup_pow_of_two(params->locks_mul);
+	else
+		ht->p.locks_mul = BUCKET_LOCKS_PER_CPU;
 
 	tbl = bucket_table_alloc(ht, size);
 	if (tbl == NULL)
@@ -606,10 +750,14 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
 
 	ht->shift = ilog2(tbl->size);
 	RCU_INIT_POINTER(ht->tbl, tbl);
+	RCU_INIT_POINTER(ht->future_tbl, tbl);
 
 	if (!ht->p.hash_rnd)
 		get_random_bytes(&ht->p.hash_rnd, sizeof(ht->p.hash_rnd));
 
+	if (ht->p.grow_decision || ht->p.shrink_decision)
+		INIT_DEFERRABLE_WORK(&ht->run_work, rht_deferred_worker);
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(rhashtable_init);
@@ -620,11 +768,16 @@ EXPORT_SYMBOL_GPL(rhashtable_init);
  *
  * Frees the bucket array.
  */
-void rhashtable_destroy(const struct rhashtable *ht)
+void rhashtable_destroy(struct rhashtable *ht)
 {
-	const struct bucket_table *tbl = rht_dereference(ht->tbl, ht);
+	ht->being_destroyed = true;
 
-	bucket_table_free(tbl);
+	mutex_lock(&ht->mutex);
+
+	cancel_delayed_work(&ht->run_work);
+	bucket_table_free(rht_dereference(ht->tbl, ht));
+
+	mutex_unlock(&ht->mutex);
 }
 EXPORT_SYMBOL_GPL(rhashtable_destroy);
 
@@ -639,11 +792,6 @@ EXPORT_SYMBOL_GPL(rhashtable_destroy);
 #define TEST_PTR	((void *) 0xdeadbeef)
 #define TEST_NEXPANDS	4
 
-static int test_mutex_is_held(void)
-{
-	return 1;
-}
-
 struct test_obj {
 	void			*ptr;
 	int			value;
@@ -755,7 +903,9 @@ static int __init test_rhashtable(struct rhashtable *ht)
 
 	for (i = 0; i < TEST_NEXPANDS; i++) {
 		pr_info("  Table expansion iteration %u...\n", i);
+		mutex_lock(&ht->mutex);
 		rhashtable_expand(ht);
+		mutex_unlock(&ht->mutex);
 
 		rcu_read_lock();
 		pr_info("  Verifying lookups...\n");
@@ -765,7 +915,9 @@ static int __init test_rhashtable(struct rhashtable *ht)
 
 	for (i = 0; i < TEST_NEXPANDS; i++) {
 		pr_info("  Table shrinkage iteration %u...\n", i);
+		mutex_lock(&ht->mutex);
 		rhashtable_shrink(ht);
+		mutex_unlock(&ht->mutex);
 
 		rcu_read_lock();
 		pr_info("  Verifying lookups...\n");
@@ -804,7 +956,6 @@ static int __init test_rht_init(void)
 		.key_offset = offsetof(struct test_obj, value),
 		.key_len = sizeof(int),
 		.hashfn = arch_fast_hash,
-		.mutex_is_held = &test_mutex_is_held,
 		.grow_decision = rht_grow_above_75,
 		.shrink_decision = rht_shrink_below_30,
 	};
diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c
index 68b654b..436f77b 100644
--- a/net/netfilter/nft_hash.c
+++ b/net/netfilter/nft_hash.c
@@ -83,41 +83,48 @@ static void nft_hash_remove(const struct nft_set *set,
 			    const struct nft_set_elem *elem)
 {
 	struct rhashtable *priv = nft_set_priv(set);
-	struct rhash_head *he, __rcu **pprev;
-
-	pprev = elem->cookie;
-	he = rht_dereference((*pprev), priv);
-
-	rhashtable_remove_pprev(priv, he, pprev);
+	struct rhash_head *he = elem->cookie;
 
+	rhashtable_remove(priv, he);
 	synchronize_rcu();
 	kfree(he);
 }
 
+struct nft_compare_arg {
+	const struct nft_set *set;
+	struct nft_set_elem *elem;
+};
+
+static bool nft_hash_compare(void *ptr, void *arg)
+{
+	struct nft_hash_elem *he = ptr;
+	struct nft_compare_arg *x = arg;
+
+	if (!nft_data_cmp(&he->key, &x->elem->key, x->set->klen)) {
+		x->elem->cookie = &he->node;
+		x->elem->flags = 0;
+		if (x->set->flags & NFT_SET_MAP)
+			nft_data_copy(&x->elem->data, he->data);
+
+		return true;
+	}
+
+	return false;
+}
+
 static int nft_hash_get(const struct nft_set *set, struct nft_set_elem *elem)
 {
 	const struct rhashtable *priv = nft_set_priv(set);
-	const struct bucket_table *tbl = rht_dereference_rcu(priv->tbl, priv);
-	struct rhash_head __rcu * const *pprev;
-	struct rhash_head *pos;
-	struct nft_hash_elem *he;
-	u32 h;
-
-	h = rhashtable_hashfn(priv, &elem->key, set->klen);
-	pprev = &tbl->buckets[h];
-	rht_for_each_entry_rcu(he, pos, tbl, h, node) {
-		if (nft_data_cmp(&he->key, &elem->key, set->klen)) {
-			pprev = &he->node.next;
-			continue;
-		}
+	struct nft_compare_arg arg = {
+		.set = set,
+		.elem = elem,
+	};
 
-		elem->cookie = (void *)pprev;
-		elem->flags = 0;
-		if (set->flags & NFT_SET_MAP)
-			nft_data_copy(&elem->data, he->data);
+	if (rhashtable_lookup_compare(priv, &elem->key,
+				      &nft_hash_compare, &arg))
 		return 0;
-	}
-	return -ENOENT;
+	else
+		return -ENOENT;
 }
 
 static void nft_hash_walk(const struct nft_ctx *ctx, const struct nft_set *set,
@@ -182,7 +189,7 @@ static int nft_hash_init(const struct nft_set *set,
 
 static void nft_hash_destroy(const struct nft_set *set)
 {
-	const struct rhashtable *priv = nft_set_priv(set);
+	struct rhashtable *priv = nft_set_priv(set);
 	const struct bucket_table *tbl;
 	struct nft_hash_elem *he;
 	struct rhash_head *pos, *next;
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 98e5b58..4e02fa2 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1015,11 +1015,8 @@ static struct sock *__netlink_lookup(struct netlink_table *table, u32 portid,
 		.net = net,
 		.portid = portid,
 	};
-	u32 hash;
 
-	hash = rhashtable_hashfn(&table->hash, &portid, sizeof(portid));
-
-	return rhashtable_lookup_compare(&table->hash, hash,
+	return rhashtable_lookup_compare(&table->hash, &portid,
 					 &netlink_compare, &arg);
 }
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/5] rhashtable: Remove gfp_flags from insert and remove functions
  2014-09-15 12:18 ` [PATCH 1/5] rhashtable: Remove gfp_flags from insert and remove functions Thomas Graf
@ 2014-09-15 12:35   ` Eric Dumazet
  2014-09-15 12:49     ` Thomas Graf
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2014-09-15 12:35 UTC (permalink / raw)
  To: Thomas Graf; +Cc: davem, paulmck, john.r.fastabend, kaber, netdev, linux-kernel

On Mon, 2014-09-15 at 14:18 +0200, Thomas Graf wrote:
> As the expansion/shrinking is moved to a worker thread, no allocations
> will be performed anymore.
> 

You meant : no GFP_ATOMIC allocations ?

I would rephrase using something like :

Because hash resizes are potentially time consuming, they'll be
performed in process context where GFP_KERNEL allocations are preferred.

> Signed-off-by: Thomas Graf <tgraf@suug.ch>
> ---
>  include/linux/rhashtable.h | 10 +++++-----
>  lib/rhashtable.c           | 41 +++++++++++++++++------------------------
>  net/netfilter/nft_hash.c   |  4 ++--
>  net/netlink/af_netlink.c   |  4 ++--
>  4 files changed, 26 insertions(+), 33 deletions(-)
> 
> diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
> index fb298e9d..942fa44 100644
> --- a/include/linux/rhashtable.h
> +++ b/include/linux/rhashtable.h
> @@ -96,16 +96,16 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params);
>  u32 rhashtable_hashfn(const struct rhashtable *ht, const void *key, u32 len);
>  u32 rhashtable_obj_hashfn(const struct rhashtable *ht, void *ptr);
>  
> -void rhashtable_insert(struct rhashtable *ht, struct rhash_head *node, gfp_t);
> -bool rhashtable_remove(struct rhashtable *ht, struct rhash_head *node, gfp_t);
> +void rhashtable_insert(struct rhashtable *ht, struct rhash_head *node);
> +bool rhashtable_remove(struct rhashtable *ht, struct rhash_head *node);
>  void rhashtable_remove_pprev(struct rhashtable *ht, struct rhash_head *obj,
> -			     struct rhash_head __rcu **pprev, gfp_t flags);
> +			     struct rhash_head __rcu **pprev);
>  
>  bool rht_grow_above_75(const struct rhashtable *ht, size_t new_size);
>  bool rht_shrink_below_30(const struct rhashtable *ht, size_t new_size);
>  
> -int rhashtable_expand(struct rhashtable *ht, gfp_t flags);
> -int rhashtable_shrink(struct rhashtable *ht, gfp_t flags);
> +int rhashtable_expand(struct rhashtable *ht);
> +int rhashtable_shrink(struct rhashtable *ht);
>  
>  void *rhashtable_lookup(const struct rhashtable *ht, const void *key);
>  void *rhashtable_lookup_compare(const struct rhashtable *ht, u32 hash,
> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> index 8dfec3f..c133d82 100644
> --- a/lib/rhashtable.c
> +++ b/lib/rhashtable.c
> @@ -108,13 +108,13 @@ static u32 head_hashfn(const struct rhashtable *ht,
>  	return obj_hashfn(ht, rht_obj(ht, he), hsize);
>  }
>  
> -static struct bucket_table *bucket_table_alloc(size_t nbuckets, gfp_t flags)
> +static struct bucket_table *bucket_table_alloc(size_t nbuckets)
>  {
>  	struct bucket_table *tbl;
>  	size_t size;
>  
>  	size = sizeof(*tbl) + nbuckets * sizeof(tbl->buckets[0]);
> -	tbl = kzalloc(size, flags);
> +	tbl = kzalloc(size, GFP_KERNEL);

Add __GFP_NOWARN, as you fallback to vzalloc ?

>  	if (tbl == NULL)
>  		tbl = vzalloc(size);

btw, inet hash table uses alloc_large_system_hash() which spreads the
pages on all available NUMA nodes :

# dmesg | grep "TCP established"
[    1.223046] TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
# grep alloc_large_system_hash /proc/vmallocinfo | grep pages=1024
0xffffc900181d6000-0xffffc900185d7000 4198400 alloc_large_system_hash+0x177/0x237 pages=1024 vmalloc vpages N0=512 N1=512
0xffffc900185d7000-0xffffc900189d8000 4198400 alloc_large_system_hash+0x177/0x237 pages=1024 vmalloc vpages N0=512 N1=512
0xffffc90028a01000-0xffffc90028e02000 4198400 alloc_large_system_hash+0x177/0x237 pages=1024 vmalloc vpages N0=512 N1=512

So we might keep in mind to keep this numa policy.

Rest of the patch looks fine to me, thanks.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/5] rhashtable: Remove gfp_flags from insert and remove functions
  2014-09-15 12:35   ` Eric Dumazet
@ 2014-09-15 12:49     ` Thomas Graf
  2014-09-15 14:49       ` Eric Dumazet
  2014-09-24  4:11       ` Eric W. Biederman
  0 siblings, 2 replies; 13+ messages in thread
From: Thomas Graf @ 2014-09-15 12:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, paulmck, john.r.fastabend, kaber, netdev, linux-kernel

On 09/15/14 at 05:35am, Eric Dumazet wrote:
> On Mon, 2014-09-15 at 14:18 +0200, Thomas Graf wrote:
> > As the expansion/shrinking is moved to a worker thread, no allocations
> > will be performed anymore.
> > 
> 
> You meant : no GFP_ATOMIC allocations ?
> 
> I would rephrase using something like :
> 
> Because hash resizes are potentially time consuming, they'll be
> performed in process context where GFP_KERNEL allocations are preferred.

I meant to say no allocations in insert/remove anymore but your wording
is even clearer. I'll update it.

> > -	tbl = kzalloc(size, flags);
> > +	tbl = kzalloc(size, GFP_KERNEL);
> 
> Add __GFP_NOWARN, as you fallback to vzalloc ?

Good point.

> btw, inet hash table uses alloc_large_system_hash() which spreads the
> pages on all available NUMA nodes :
> 
> # dmesg | grep "TCP established"
> [    1.223046] TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
> # grep alloc_large_system_hash /proc/vmallocinfo | grep pages=1024
> 0xffffc900181d6000-0xffffc900185d7000 4198400 alloc_large_system_hash+0x177/0x237 pages=1024 vmalloc vpages N0=512 N1=512
> 0xffffc900185d7000-0xffffc900189d8000 4198400 alloc_large_system_hash+0x177/0x237 pages=1024 vmalloc vpages N0=512 N1=512
> 0xffffc90028a01000-0xffffc90028e02000 4198400 alloc_large_system_hash+0x177/0x237 pages=1024 vmalloc vpages N0=512 N1=512
> 
> So we might keep in mind to keep this numa policy.

Agreed. Will introduce this through a table parameter option when
converting the inet hash table.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/5] rhashtable: Remove gfp_flags from insert and remove functions
  2014-09-15 12:49     ` Thomas Graf
@ 2014-09-15 14:49       ` Eric Dumazet
  2014-09-15 14:56         ` Thomas Graf
  2014-09-24  4:11       ` Eric W. Biederman
  1 sibling, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2014-09-15 14:49 UTC (permalink / raw)
  To: Thomas Graf; +Cc: davem, paulmck, john.r.fastabend, kaber, netdev, linux-kernel

On Mon, 2014-09-15 at 13:49 +0100, Thomas Graf wrote:

> Agreed. Will introduce this through a table parameter option when
> converting the inet hash table.

I am not sure you covered the /proc/net/tcp problem yet ? (or inet_diag)



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/5] rhashtable: Remove gfp_flags from insert and remove functions
  2014-09-15 14:49       ` Eric Dumazet
@ 2014-09-15 14:56         ` Thomas Graf
  0 siblings, 0 replies; 13+ messages in thread
From: Thomas Graf @ 2014-09-15 14:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, paulmck, john.r.fastabend, kaber, netdev, linux-kernel

On 09/15/14 at 07:49am, Eric Dumazet wrote:
> On Mon, 2014-09-15 at 13:49 +0100, Thomas Graf wrote:
> 
> > Agreed. Will introduce this through a table parameter option when
> > converting the inet hash table.
> 
> I am not sure you covered the /proc/net/tcp problem yet ? (or inet_diag)

I haven't digged through inet_diag in all detail yet but was planning to
acquire ht->mutex to block expansions and then take bucket locks during
bucket traverals as done now. I will do that next but wanted to get this
out for comments in the meantime.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 5/5] rhashtable: Per bucket locks & expansion/shrinking in work queue
  2014-09-15 12:18 ` [PATCH 5/5] rhashtable: Per bucket locks & expansion/shrinking in work queue Thomas Graf
@ 2014-09-15 15:23   ` Eric Dumazet
  2014-09-16 13:22     ` Thomas Graf
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2014-09-15 15:23 UTC (permalink / raw)
  To: Thomas Graf; +Cc: davem, paulmck, john.r.fastabend, kaber, netdev, linux-kernel

On Mon, 2014-09-15 at 14:18 +0200, Thomas Graf wrote:

> +static int alloc_bucket_locks(struct rhashtable *ht, struct bucket_table *tbl)
> +{
> +	unsigned int i, size;
> +#if defined(CONFIG_PROVE_LOCKING)
> +	unsigned int nr_pcpus = 2;
> +#else
> +	unsigned int nr_pcpus = num_possible_cpus();
> +#endif
> +
> +	nr_pcpus = min_t(unsigned int, nr_pcpus, 32UL);
> +	size = nr_pcpus * ht->p.locks_mul;
> +

You need to roundup to next power of two.

> +	if (sizeof(spinlock_t) != 0) {
> +#ifdef CONFIG_NUMA
> +		if (size * sizeof(spinlock_t) > PAGE_SIZE)
> +			tbl->locks = vmalloc(size * sizeof(spinlock_t));
> +		else
> +#endif
> +		tbl->locks = kmalloc_array(size, sizeof(spinlock_t),
> +					   GFP_KERNEL);
> +		if (!tbl->locks)
> +			return -ENOMEM;
> +		for (i = 0; i < size; i++)
> +			spin_lock_init(&tbl->locks[i]);
> +	}
> +	tbl->locks_mask = size - 1;
> +
> +	return 0;
> +}



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 5/5] rhashtable: Per bucket locks & expansion/shrinking in work queue
  2014-09-15 15:23   ` Eric Dumazet
@ 2014-09-16 13:22     ` Thomas Graf
  0 siblings, 0 replies; 13+ messages in thread
From: Thomas Graf @ 2014-09-16 13:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, paulmck, john.r.fastabend, kaber, netdev, linux-kernel

On 09/15/14 at 08:23am, Eric Dumazet wrote:
> You need to roundup to next power of two.

Fixed, thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/5] rhashtable: Remove gfp_flags from insert and remove functions
  2014-09-15 12:49     ` Thomas Graf
  2014-09-15 14:49       ` Eric Dumazet
@ 2014-09-24  4:11       ` Eric W. Biederman
  1 sibling, 0 replies; 13+ messages in thread
From: Eric W. Biederman @ 2014-09-24  4:11 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Eric Dumazet, davem, paulmck, john.r.fastabend, kaber, netdev,
	linux-kernel

Thomas Graf <tgraf@suug.ch> writes:

> On 09/15/14 at 05:35am, Eric Dumazet wrote:
>> On Mon, 2014-09-15 at 14:18 +0200, Thomas Graf wrote:
>> > As the expansion/shrinking is moved to a worker thread, no allocations
>> > will be performed anymore.
>> > 
>> 
>> You meant : no GFP_ATOMIC allocations ?
>> 
>> I would rephrase using something like :
>> 
>> Because hash resizes are potentially time consuming, they'll be
>> performed in process context where GFP_KERNEL allocations are preferred.
>
> I meant to say no allocations in insert/remove anymore but your wording
> is even clearer. I'll update it.
>
>> > -	tbl = kzalloc(size, flags);
>> > +	tbl = kzalloc(size, GFP_KERNEL);
>> 
>> Add __GFP_NOWARN, as you fallback to vzalloc ?
>
> Good point.

It needs to be both __GFP_NOWARN and __GFP_NORETRY.

Otherwise the system will kick in the OOM killer before it falls back to
vzalloc.  Which I can't imagine anyone wanting.

Look at the history of alloc_fdmem in fs/file.c for the real world
reasoning.

Eric

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-09-24  4:11 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-15 12:17 [PATCH 0/5 RFC net-next] rhashtable: Parallel atomic insert/deletion & deferred expansion Thomas Graf
2014-09-15 12:18 ` [PATCH 1/5] rhashtable: Remove gfp_flags from insert and remove functions Thomas Graf
2014-09-15 12:35   ` Eric Dumazet
2014-09-15 12:49     ` Thomas Graf
2014-09-15 14:49       ` Eric Dumazet
2014-09-15 14:56         ` Thomas Graf
2014-09-24  4:11       ` Eric W. Biederman
2014-09-15 12:18 ` [PATCH 2/5] rhashtable: Check for count misatch in selftest Thomas Graf
2014-09-15 12:18 ` [PATCH 3/5] rhashtable: Convert to nulls list Thomas Graf
2014-09-15 12:18 ` [PATCH 4/5] spinlock: Add spin_lock_bh_nested() Thomas Graf
2014-09-15 12:18 ` [PATCH 5/5] rhashtable: Per bucket locks & expansion/shrinking in work queue Thomas Graf
2014-09-15 15:23   ` Eric Dumazet
2014-09-16 13:22     ` Thomas Graf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.