linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] rhashtable: assorted fixes and enhancements
@ 2018-03-26 23:33 NeilBrown
  2018-03-26 23:33 ` [PATCH 1/6] rhashtable: improve documentation for rhashtable_walk_peek() NeilBrown
                   ` (5 more replies)
  0 siblings, 6 replies; 33+ messages in thread
From: NeilBrown @ 2018-03-26 23:33 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel

Hi,
 I'm hoping to use rhashtable in lustre, to replace the resizeable
 hashtable implementation in libcfs.
 While working through the conversion I found some minor bugs in the
 rhashtable code and documentation, and some areas where enhancements
 could make rhashtable a better fit for lustre.

 Following 6 patches are the result.  Please review.

 It would help me if I could get an Ack for these patches, and could
 then submit them through the drivers/staging tree together with the
 lustre changes that make use to rhashtable.  The first 2 are mostly
 just fixes to comments and can go in through the netdev tree if you
 prefer - the last 4 are needed for lustre to work
 correctly/optimally.

Thanks,
NeilBrown


---

NeilBrown (6):
      rhashtable: improve documentation for rhashtable_walk_peek()
      rhashtable: remove outdated comments about grow_decision etc
      rhashtable: reset iter when rhashtable_walk_start sees new table
      rhashtable: allow a walk of the hash table without missing objects.
      rhashtable: support guaranteed successful insertion.
      rhashtable: allow element counting to be disabled.


 include/linux/rhashtable.h |   89 ++++++++++++++++++++++++++++----------
 lib/rhashtable.c           |  102 +++++++++++++++++++++++++++++++-------------
 2 files changed, 136 insertions(+), 55 deletions(-)

--
Signature

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/6] rhashtable: improve documentation for rhashtable_walk_peek()
  2018-03-26 23:33 [PATCH 0/6] rhashtable: assorted fixes and enhancements NeilBrown
@ 2018-03-26 23:33 ` NeilBrown
  2018-03-27 10:55   ` Sergei Shtylyov
                     ` (4 more replies)
  2018-03-26 23:33 ` [PATCH 3/6] rhashtable: reset intr when rhashtable_walk_start sees new table NeilBrown
                   ` (4 subsequent siblings)
  5 siblings, 5 replies; 33+ messages in thread
From: NeilBrown @ 2018-03-26 23:33 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel

The documentation for rhashtable_walk_peek() wrong.  It claims to
return the *next* entry, whereas it in fact returns the *previous*
entry.
However if no entries have yet been returned - or if the iterator
was reset due to a resize event, then rhashtable_walk_peek()
*does* return the next entry, but also advances the iterator.

I suspect that this interface should be discarded and the one user
should be changed to not require it.  Possibly this patch should be
seen as a first step in that conversation.

This patch mostly corrects the documentation, but does make a
small code change so that the documentation can be correct without
listing too many special cases.  I don't think the one user will
be affected by the code change.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 lib/rhashtable.c |   17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 3825c30aaa36..24a57ca494cb 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -853,13 +853,17 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
 EXPORT_SYMBOL_GPL(rhashtable_walk_next);
 
 /**
- * rhashtable_walk_peek - Return the next object but don't advance the iterator
+ * rhashtable_walk_peek - Return the previously returned object without advancing the iterator
  * @iter:	Hash table iterator
  *
- * Returns the next object or NULL when the end of the table is reached.
+ * Returns the last object returned, or NULL if no object has yet been returned.
+ * If the previously returned object has since been removed, then some other arbitrary
+ * object maybe returned, or possibly NULL will be returned.  In that case, the
+ * iterator might be advanced.
  *
  * Returns -EAGAIN if resize event occurred.  Note that the iterator
- * will rewind back to the beginning and you may continue to use it.
+ * will rewind back to the beginning and rhashtable_walk_next() should be
+ * used to get the next object.
  */
 void *rhashtable_walk_peek(struct rhashtable_iter *iter)
 {
@@ -880,7 +884,12 @@ void *rhashtable_walk_peek(struct rhashtable_iter *iter)
 		 * the table hasn't changed.
 		 */
 		iter->skip--;
-	}
+	} else
+		/* ->skip is only zero after rhashtable_walk_start()
+		 * or when the iterator is reset.  In this case there
+		 * is no previous object to return.
+		 */
+		return NULL;
 
 	return __rhashtable_walk_find_next(iter);
 }

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 2/6] rhashtable: remove outdated comments about grow_decision etc
  2018-03-26 23:33 [PATCH 0/6] rhashtable: assorted fixes and enhancements NeilBrown
                   ` (4 preceding siblings ...)
  2018-03-26 23:33 ` [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects NeilBrown
@ 2018-03-26 23:33 ` NeilBrown
  5 siblings, 0 replies; 33+ messages in thread
From: NeilBrown @ 2018-03-26 23:33 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel

grow_decision and shink_decision no longer exist, so remove
the remaining references to them.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/rhashtable.h |   33 ++++++++++++++-------------------
 1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index c9df2527e0cd..3bd19d29f46b 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -834,9 +834,8 @@ static inline void *__rhashtable_insert_fast(
  *
  * It is safe to call this function from atomic context.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  */
 static inline int rhashtable_insert_fast(
 	struct rhashtable *ht, struct rhash_head *obj,
@@ -864,9 +863,8 @@ static inline int rhashtable_insert_fast(
  *
  * It is safe to call this function from atomic context.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  */
 static inline int rhltable_insert_key(
 	struct rhltable *hlt, const void *key, struct rhlist_head *list,
@@ -888,9 +886,8 @@ static inline int rhltable_insert_key(
  *
  * It is safe to call this function from atomic context.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  */
 static inline int rhltable_insert(
 	struct rhltable *hlt, struct rhlist_head *list,
@@ -920,9 +917,8 @@ static inline int rhltable_insert(
  *
  * It is safe to call this function from atomic context.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  */
 static inline int rhashtable_lookup_insert_fast(
 	struct rhashtable *ht, struct rhash_head *obj,
@@ -979,9 +975,8 @@ static inline void *rhashtable_lookup_get_insert_fast(
  *
  * Lookups may occur in parallel with hashtable mutations and resizing.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  *
  * Returns zero on success.
  */
@@ -1132,8 +1127,8 @@ static inline int __rhashtable_remove_fast(
  * walk the bucket chain upon removal. The removal operation is thus
  * considerable slow if the hash table is not correctly sized.
  *
- * Will automatically shrink the table via rhashtable_expand() if the
- * shrink_decision function specified at rhashtable_init() returns true.
+ * Will automatically shrink the table if permitted when residency drops
+ * below 30%.
  *
  * Returns zero on success, -ENOENT if the entry could not be found.
  */
@@ -1154,8 +1149,8 @@ static inline int rhashtable_remove_fast(
  * walk the bucket chain upon removal. The removal operation is thus
  * considerable slow if the hash table is not correctly sized.
  *
- * Will automatically shrink the table via rhashtable_expand() if the
- * shrink_decision function specified at rhashtable_init() returns true.
+ * Will automatically shrink the table if permitted when residency drops
+ * below 30%
  *
  * Returns zero on success, -ENOENT if the entry could not be found.
  */

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 3/6] rhashtable: reset intr when rhashtable_walk_start sees new table
  2018-03-26 23:33 [PATCH 0/6] rhashtable: assorted fixes and enhancements NeilBrown
  2018-03-26 23:33 ` [PATCH 1/6] rhashtable: improve documentation for rhashtable_walk_peek() NeilBrown
@ 2018-03-26 23:33 ` NeilBrown
  2018-03-27 15:47   ` Herbert Xu
  2018-03-26 23:33 ` [PATCH 6/6] rhashtable: allow element counting to be disabled NeilBrown
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 33+ messages in thread
From: NeilBrown @ 2018-03-26 23:33 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel

The documentation claims that when rhashtable_walk_start_check()
detects a resize event, it will rewind back to the beginning
of the table.  This is not true.  We need to set ->slot and
->skip to be zero for it to be true.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 lib/rhashtable.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 24a57ca494cb..08018198f045 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -733,6 +733,8 @@ int rhashtable_walk_start_check(struct rhashtable_iter *iter)
 
 	if (!iter->walker.tbl && !iter->end_of_table) {
 		iter->walker.tbl = rht_dereference_rcu(ht->tbl, ht);
+		iter->slot = 0;
+		iter->skip = 0;
 		return -EAGAIN;
 	}
 

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects.
  2018-03-26 23:33 [PATCH 0/6] rhashtable: assorted fixes and enhancements NeilBrown
                   ` (3 preceding siblings ...)
  2018-03-26 23:33 ` [PATCH 5/6] rhashtable: support guaranteed successful insertion NeilBrown
@ 2018-03-26 23:33 ` NeilBrown
  2018-03-27 15:49   ` David Miller
  2018-03-27 15:51   ` Herbert Xu
  2018-03-26 23:33 ` [PATCH 2/6] rhashtable: remove outdated comments about grow_decision etc NeilBrown
  5 siblings, 2 replies; 33+ messages in thread
From: NeilBrown @ 2018-03-26 23:33 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel

When a walk of the hashtable can be done entirely under RCU,
no objects will be missed - though seeing duplicates is possible.
This is because a cursor is kept in iter->p.
Without the cursor we depend on the ->skip counter.  If an object
before the current location in hash chain is removed, the ->skip
counter will be too large and would could miss a later object.

In many cases where the walker needs to drop out of RCU protection,
it will take a reference to the object and this can prevent it from
being removed from the hash table.  In those cases, the last-returned
object can still be used as a cursor.  rhashtable cannot detect
these cases itself.

This patch adds a new rhashtable_walk_start_continue() interface which
is passed the last object returned.  This can be used if the caller
knows that the object is still in the hash table.  When it is used,
a walk of the hash table will return every object that was in the
hastable for the duration of the walk, at least once.  This can be
used, for example, to selectively delete objects from the table.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/rhashtable.h |   28 ++++++++++++++++++++++++++--
 lib/rhashtable.c           |   42 ++++++++++++++++++++++++++----------------
 2 files changed, 52 insertions(+), 18 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 3bd19d29f46b..4ffd96949d4f 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -387,11 +387,35 @@ void *rhashtable_insert_slow(struct rhashtable *ht, const void *key,
 void rhashtable_walk_enter(struct rhashtable *ht,
 			   struct rhashtable_iter *iter);
 void rhashtable_walk_exit(struct rhashtable_iter *iter);
-int rhashtable_walk_start_check(struct rhashtable_iter *iter) __acquires(RCU);
+int rhashtable_walk_start_continue(struct rhashtable_iter *iter,
+				   struct rhash_head *obj) __acquires(RCU);
+
+/**
+ * rhashtable_walk_start_check - Start a hash table walk
+ * @iter:	Hash table iterator
+ *
+ * Start a hash table walk at the current iterator position.  Note that we take
+ * the RCU lock in all cases including when we return an error.  So you must
+ * always call rhashtable_walk_stop to clean up.
+ *
+ * Returns zero if successful.
+ *
+ * Returns -EAGAIN if resize event occured.  Note that the iterator
+ * will rewind back to the beginning and you may use it immediately
+ * by calling rhashtable_walk_next.
+ *
+ * rhashtable_walk_start is defined as an inline variant that returns
+ * void. This is preferred in cases where the caller would ignore
+ * resize events and always continue.
+ */
+static inline int rhashtable_walk_start_check(struct rhashtable_iter *iter)
+{
+	return rhashtable_walk_start_continue(iter, NULL);
+}
 
 static inline void rhashtable_walk_start(struct rhashtable_iter *iter)
 {
-	(void)rhashtable_walk_start_check(iter);
+	(void)rhashtable_walk_start_continue(iter, NULL);
 }
 
 void *rhashtable_walk_next(struct rhashtable_iter *iter);
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 08018198f045..fd6f320b9704 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -702,30 +702,41 @@ void rhashtable_walk_exit(struct rhashtable_iter *iter)
 EXPORT_SYMBOL_GPL(rhashtable_walk_exit);
 
 /**
- * rhashtable_walk_start_check - Start a hash table walk
- * @iter:	Hash table iterator
+ * rhashtable_walk_start_continue - Restart a hash table walk from last object
+ * @iter:	Hask table iterator
+ * @obj:	pointer to rhash_head in last object returned.
+ *
+ * Restart a hash table walk, ensuring not to miss any objects.  The
+ * previously returned object must still be in the hash table, and must be
+ * provided as an argument.
  *
- * Start a hash table walk at the current iterator position.  Note that we take
- * the RCU lock in all cases including when we return an error.  So you must
- * always call rhashtable_walk_stop to clean up.
+ * When rhashtable_walk_start() or rhashtable_walk_start_check() is used,
+ * a deletion since the previous walk_start can result in objects being missed
+ * as a hash chain might be shorter than expected.  This can be avoided by
+ * using the last returned object as a cursor.
  *
- * Returns zero if successful.
+ * If the @obj passed is NULL, or not the most recently returned object,
+ * rhashtable_walk_start_continue() will act like rhashtable_walk_start_check();
  *
- * Returns -EAGAIN if resize event occured.  Note that the iterator
- * will rewind back to the beginning and you may use it immediately
- * by calling rhashtable_walk_next.
+ * Returns -EAGAIN if a resize event was detected.  The iterator will
+ * rewind back to the beginning and can be used immediately.  Seeing duplicates
+ * is possible but missing objects isn't.
+ * Returns zero if no resize event was detected.  This does not guarantee
+ * that no duplicates will be seen.
  *
- * rhashtable_walk_start is defined as an inline variant that returns
- * void. This is preferred in cases where the caller would ignore
- * resize events and always continue.
+ * Always takes the RCU read lock, so rhashtable_walk_stop() must always be called
+ * to clean up.
  */
-int rhashtable_walk_start_check(struct rhashtable_iter *iter)
+int rhashtable_walk_start_continue(struct rhashtable_iter *iter, struct rhash_head *obj)
 	__acquires(RCU)
 {
 	struct rhashtable *ht = iter->ht;
 
 	rcu_read_lock();
 
+	if (!obj || iter->p != obj)
+		iter->p = NULL;
+
 	spin_lock(&ht->lock);
 	if (iter->walker.tbl)
 		list_del(&iter->walker.list);
@@ -733,6 +744,7 @@ int rhashtable_walk_start_check(struct rhashtable_iter *iter)
 
 	if (!iter->walker.tbl && !iter->end_of_table) {
 		iter->walker.tbl = rht_dereference_rcu(ht->tbl, ht);
+		iter->p = NULL;
 		iter->slot = 0;
 		iter->skip = 0;
 		return -EAGAIN;
@@ -740,7 +752,7 @@ int rhashtable_walk_start_check(struct rhashtable_iter *iter)
 
 	return 0;
 }
-EXPORT_SYMBOL_GPL(rhashtable_walk_start_check);
+EXPORT_SYMBOL_GPL(rhashtable_walk_start_continue);
 
 /**
  * __rhashtable_walk_find_next - Find the next element in a table (or the first
@@ -922,8 +934,6 @@ void rhashtable_walk_stop(struct rhashtable_iter *iter)
 		iter->walker.tbl = NULL;
 	spin_unlock(&ht->lock);
 
-	iter->p = NULL;
-
 out:
 	rcu_read_unlock();
 }

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 5/6] rhashtable: support guaranteed successful insertion.
  2018-03-26 23:33 [PATCH 0/6] rhashtable: assorted fixes and enhancements NeilBrown
                   ` (2 preceding siblings ...)
  2018-03-26 23:33 ` [PATCH 6/6] rhashtable: allow element counting to be disabled NeilBrown
@ 2018-03-26 23:33 ` NeilBrown
  2018-03-27 15:56   ` Herbert Xu
  2018-03-26 23:33 ` [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects NeilBrown
  2018-03-26 23:33 ` [PATCH 2/6] rhashtable: remove outdated comments about grow_decision etc NeilBrown
  5 siblings, 1 reply; 33+ messages in thread
From: NeilBrown @ 2018-03-26 23:33 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel

The current rhashtable will fail an insertion if the hashtable
it "too full", one of:
 - table already has 2^31 elements (-E2BIG)
 - a max_size was specified and table already has that
   many elements (rounded up to power of 2) (-E2BIG)
 - a single chain has more than 16 elements (-EBUSY)
 - table has more elements than the current table size,
   and allocating a new table fails (-ENOMEM)
 - a new page needed to be allocated for a nested table,
   and the memory allocation failed (-ENOMEM).

A traditional hash table does not have a concept of "too full", and
insertion only fails if the key already exists.  Many users of hash
tables have separate means of limiting the total number of entries,
and are not susceptible to an attack which could cause unusually large
hash chains.  For those users, the need to check for errors when
inserting objects to an rhashtable is an unnecessary burden and hence
a potential source of bugs (as these failures are likely to be rare).

This patch adds a "never_fail_insert" configuration parameter which
ensures that insertion will only fail if the key already exists.

When this option is in effect:
 - nelems is capped at INT_MAX and will never decrease once it reaches
   that value
 - max_size is largely ignored
 - elements will be added to a table that is nominally "full", though
   a rehash will be scheduled
 - a new table will never be allocated directly by the insert
   function, that is always left for the worker.
   For this to trigger a rehash when long chains are detected (possibly
   still useful) an extra field in the table records if a long chain
   has been seen.  This shares a word with the 'nest' value.  As
   'nest' is never changed once the table is created, updating the
   new ->long_chain without locking cannot cause any corruption.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/rhashtable.h |   18 +++++++++++++++---
 lib/rhashtable.c           |   27 +++++++++++++++++++--------
 2 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 4ffd96949d4f..abdeb1f3f378 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -77,6 +77,7 @@ struct rhlist_head {
  * struct bucket_table - Table of hash buckets
  * @size: Number of hash buckets
  * @nest: Number of bits of first-level nested table.
+ * @long_chain: %true when a chain longer than RHT_ELASTICITY seen.
  * @rehash: Current bucket being rehashed
  * @hash_rnd: Random seed to fold into hash
  * @locks_mask: Mask to apply before accessing locks[]
@@ -89,7 +90,8 @@ struct rhlist_head {
  */
 struct bucket_table {
 	unsigned int		size;
-	unsigned int		nest;
+	unsigned short		nest;
+	bool			long_chain;
 	unsigned int		rehash;
 	u32			hash_rnd;
 	unsigned int		locks_mask;
@@ -129,6 +131,9 @@ struct rhashtable;
  * @min_size: Minimum size while shrinking
  * @locks_mul: Number of bucket locks to allocate per cpu (default: 32)
  * @automatic_shrinking: Enable automatic shrinking of tables
+ * @never_fail_insert: Insert will always succeed, even if table will become
+ *           unbalanced.  Without this, -E2BIG, -EBUSY, and -ENOMEM are possible
+ *           errors from rhashtable_*insert*()
  * @nulls_base: Base value to generate nulls marker
  * @hashfn: Hash function (default: jhash2 if !(key_len % 4), or jhash)
  * @obj_hashfn: Function to hash object
@@ -142,6 +147,7 @@ struct rhashtable_params {
 	unsigned int		max_size;
 	u16			min_size;
 	bool			automatic_shrinking;
+	bool			never_fail_insert;
 	u8			locks_mul;
 	u32			nulls_base;
 	rht_hashfn_t		hashfn;
@@ -832,7 +838,10 @@ static inline void *__rhashtable_insert_fast(
 
 	rcu_assign_pointer(*pprev, obj);
 
-	atomic_inc(&ht->nelems);
+	if (params.never_fail_insert)
+		atomic_add_unless(&ht->nelems, 1, INT_MAX);
+	else
+		atomic_inc(&ht->nelems);
 	if (rht_grow_above_75(ht, tbl))
 		schedule_work(&ht->run_work);
 
@@ -1104,7 +1113,10 @@ static inline int __rhashtable_remove_fast_one(
 	spin_unlock_bh(lock);
 
 	if (err > 0) {
-		atomic_dec(&ht->nelems);
+		if (params.never_fail_insert)
+			atomic_add_unless(&ht->nelems, -1, INT_MAX);
+		else
+			atomic_dec(&ht->nelems);
 		if (unlikely(ht->p.automatic_shrinking &&
 			     rht_shrink_below_30(ht, tbl)))
 			schedule_work(&ht->run_work);
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index fd6f320b9704..427836aace60 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -424,7 +424,7 @@ static void rht_deferred_worker(struct work_struct *work)
 		err = rhashtable_rehash_alloc(ht, tbl, tbl->size * 2);
 	else if (ht->p.automatic_shrinking && rht_shrink_below_30(ht, tbl))
 		err = rhashtable_shrink(ht);
-	else if (tbl->nest)
+	else if (tbl->nest || tbl->long_chain)
 		err = rhashtable_rehash_alloc(ht, tbl, tbl->size);
 
 	if (!err)
@@ -549,14 +549,22 @@ static struct bucket_table *rhashtable_insert_one(struct rhashtable *ht,
 	if (new_tbl)
 		return new_tbl;
 
-	if (PTR_ERR(data) != -ENOENT)
-		return ERR_CAST(data);
+	if (ht->p.never_fail_insert) {
+		if (PTR_ERR(data) == -EAGAIN &&
+		    atomic_read(&ht->nelems) != INT_MAX) {
+			tbl->long_chain = true;
+			schedule_work(&ht->run_work);
+		}
+	} else {
+		if (PTR_ERR(data) != -ENOENT)
+			return ERR_CAST(data);
 
-	if (unlikely(rht_grow_above_max(ht, tbl)))
-		return ERR_PTR(-E2BIG);
+		if (unlikely(rht_grow_above_max(ht, tbl)))
+			return ERR_PTR(-E2BIG);
 
-	if (unlikely(rht_grow_above_100(ht, tbl)))
-		return ERR_PTR(-EAGAIN);
+		if (unlikely(rht_grow_above_100(ht, tbl)))
+			return ERR_PTR(-EAGAIN);
+	}
 
 	pprev = rht_bucket_insert(ht, tbl, hash);
 	if (!pprev)
@@ -574,7 +582,10 @@ static struct bucket_table *rhashtable_insert_one(struct rhashtable *ht,
 
 	rcu_assign_pointer(*pprev, obj);
 
-	atomic_inc(&ht->nelems);
+	if (ht->p.never_fail_insert)
+		atomic_add_unless(&ht->nelems, 1, INT_MAX);
+	else
+		atomic_inc(&ht->nelems);
 	if (rht_grow_above_75(ht, tbl))
 		schedule_work(&ht->run_work);
 

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 6/6] rhashtable: allow element counting to be disabled.
  2018-03-26 23:33 [PATCH 0/6] rhashtable: assorted fixes and enhancements NeilBrown
  2018-03-26 23:33 ` [PATCH 1/6] rhashtable: improve documentation for rhashtable_walk_peek() NeilBrown
  2018-03-26 23:33 ` [PATCH 3/6] rhashtable: reset intr when rhashtable_walk_start sees new table NeilBrown
@ 2018-03-26 23:33 ` NeilBrown
  2018-03-26 23:33 ` [PATCH 5/6] rhashtable: support guaranteed successful insertion NeilBrown
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 33+ messages in thread
From: NeilBrown @ 2018-03-26 23:33 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel

If multiple CPUs are performing concurrent updates, they can
contend on accessing the element counter even when they
don't often content on hash chains or spin locks.  This can
hurt performance.

The nelems counter is only used to trigger a resize at the
70% and 30% marks, so it does not need to be precise.

It is easy to calculate an approximate value when the table
is being rehashed, and this happens when a chain is found to
be 16 elements long.  So just moving the counting from
"every update" to "every rehash" removes lots of contention,
but has the down-side is that it allows the max bucket size
to grow to 16 (so average is probably under 8).  The normal
average is close to 1.

As a rehash can sometimes not see all (or any) elements, such as when
multiple tables are in the table chain, it is only safe to increase
nelems to match the number rehashed, never to decrease it.

If a client wants minimal contention while still maintaining
a shorter chain length, it can run a periodic task which
counts the number of elements and updates ->nelems directly.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/rhashtable.h |   26 ++++++++++++++++++--------
 lib/rhashtable.c           |   22 +++++++++++++++-------
 2 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index abdeb1f3f378..d0ce5635540f 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -134,6 +134,11 @@ struct rhashtable;
  * @never_fail_insert: Insert will always succeed, even if table will become
  *           unbalanced.  Without this, -E2BIG, -EBUSY, and -ENOMEM are possible
  *           errors from rhashtable_*insert*()
+ * @disable_count: Disable precise counting of number of entries.  It is only
+ *           updated approximately when the hash table is resized.
+ *           This reduces contention in parallel updates, but means we only
+ *           grow the table when a hash chain length reaches 16 or when owner
+ *           directly updates ->nelems.
  * @nulls_base: Base value to generate nulls marker
  * @hashfn: Hash function (default: jhash2 if !(key_len % 4), or jhash)
  * @obj_hashfn: Function to hash object
@@ -148,6 +153,7 @@ struct rhashtable_params {
 	u16			min_size;
 	bool			automatic_shrinking;
 	bool			never_fail_insert;
+	bool			disable_count;
 	u8			locks_mul;
 	u32			nulls_base;
 	rht_hashfn_t		hashfn;
@@ -838,10 +844,12 @@ static inline void *__rhashtable_insert_fast(
 
 	rcu_assign_pointer(*pprev, obj);
 
-	if (params.never_fail_insert)
-		atomic_add_unless(&ht->nelems, 1, INT_MAX);
-	else
-		atomic_inc(&ht->nelems);
+	if (!params.disable_count) {
+		if (params.never_fail_insert)
+			atomic_add_unless(&ht->nelems, 1, INT_MAX);
+		else
+			atomic_inc(&ht->nelems);
+	}
 	if (rht_grow_above_75(ht, tbl))
 		schedule_work(&ht->run_work);
 
@@ -1113,10 +1121,12 @@ static inline int __rhashtable_remove_fast_one(
 	spin_unlock_bh(lock);
 
 	if (err > 0) {
-		if (params.never_fail_insert)
-			atomic_add_unless(&ht->nelems, -1, INT_MAX);
-		else
-			atomic_dec(&ht->nelems);
+		if (!params.disable_count) {
+			if (params.never_fail_insert)
+				atomic_add_unless(&ht->nelems, -1, INT_MAX);
+			else
+				atomic_dec(&ht->nelems);
+		}
 		if (unlikely(ht->p.automatic_shrinking &&
 			     rht_shrink_below_30(ht, tbl)))
 			schedule_work(&ht->run_work);
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 427836aace60..686193faf271 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -278,12 +278,13 @@ static int rhashtable_rehash_chain(struct rhashtable *ht,
 	struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht);
 	spinlock_t *old_bucket_lock;
 	int err;
+	int cnt = 0;
 
 	old_bucket_lock = rht_bucket_lock(old_tbl, old_hash);
 
 	spin_lock_bh(old_bucket_lock);
 	while (!(err = rhashtable_rehash_one(ht, old_hash)))
-		;
+		cnt++;
 
 	if (err == -ENOENT) {
 		old_tbl->rehash++;
@@ -291,7 +292,7 @@ static int rhashtable_rehash_chain(struct rhashtable *ht,
 	}
 	spin_unlock_bh(old_bucket_lock);
 
-	return err;
+	return err ?: cnt;
 }
 
 static int rhashtable_rehash_attach(struct rhashtable *ht,
@@ -324,6 +325,7 @@ static int rhashtable_rehash_table(struct rhashtable *ht)
 	struct rhashtable_walker *walker;
 	unsigned int old_hash;
 	int err;
+	unsigned int cnt = 0;
 
 	new_tbl = rht_dereference(old_tbl->future_tbl, ht);
 	if (!new_tbl)
@@ -331,12 +333,16 @@ static int rhashtable_rehash_table(struct rhashtable *ht)
 
 	for (old_hash = 0; old_hash < old_tbl->size; old_hash++) {
 		err = rhashtable_rehash_chain(ht, old_hash);
-		if (err)
+		if (err < 0)
 			return err;
+		if (INT_MAX - cnt > err)
+			cnt += err;
 	}
 
 	/* Publish the new table pointer. */
 	rcu_assign_pointer(ht->tbl, new_tbl);
+	if (ht->p.disable_count && cnt > atomic_read(&ht->nelems))
+		atomic_set(&ht->nelems, cnt);
 
 	spin_lock(&ht->lock);
 	list_for_each_entry(walker, &old_tbl->walkers, list)
@@ -582,10 +588,12 @@ static struct bucket_table *rhashtable_insert_one(struct rhashtable *ht,
 
 	rcu_assign_pointer(*pprev, obj);
 
-	if (ht->p.never_fail_insert)
-		atomic_add_unless(&ht->nelems, 1, INT_MAX);
-	else
-		atomic_inc(&ht->nelems);
+	if (!ht->p.disable_count) {
+		if (ht->p.never_fail_insert)
+			atomic_add_unless(&ht->nelems, 1, INT_MAX);
+		else
+			atomic_inc(&ht->nelems);
+	}
 	if (rht_grow_above_75(ht, tbl))
 		schedule_work(&ht->run_work);
 

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/6] rhashtable: improve documentation for rhashtable_walk_peek()
  2018-03-26 23:33 ` [PATCH 1/6] rhashtable: improve documentation for rhashtable_walk_peek() NeilBrown
@ 2018-03-27 10:55   ` Sergei Shtylyov
  2018-03-27 15:30   ` Herbert Xu
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 33+ messages in thread
From: Sergei Shtylyov @ 2018-03-27 10:55 UTC (permalink / raw)
  To: NeilBrown, Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel

Hello!

On 3/27/2018 2:33 AM, NeilBrown wrote:

> The documentation for rhashtable_walk_peek() wrong.  It claims to
> return the *next* entry, whereas it in fact returns the *previous*
> entry.
> However if no entries have yet been returned - or if the iterator
> was reset due to a resize event, then rhashtable_walk_peek()
> *does* return the next entry, but also advances the iterator.
> 
> I suspect that this interface should be discarded and the one user
> should be changed to not require it.  Possibly this patch should be
> seen as a first step in that conversation.
> 
> This patch mostly corrects the documentation, but does make a
> small code change so that the documentation can be correct without
> listing too many special cases.  I don't think the one user will
> be affected by the code change.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>   lib/rhashtable.c |   17 +++++++++++++----
>   1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> index 3825c30aaa36..24a57ca494cb 100644
> --- a/lib/rhashtable.c
> +++ b/lib/rhashtable.c
> @@ -853,13 +853,17 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
>   EXPORT_SYMBOL_GPL(rhashtable_walk_next);
>   
>   /**
> - * rhashtable_walk_peek - Return the next object but don't advance the iterator
> + * rhashtable_walk_peek - Return the previously returned object without advancing the iterator
>    * @iter:	Hash table iterator
>    *
> - * Returns the next object or NULL when the end of the table is reached.
> + * Returns the last object returned,

    Sounds somewhat tautological. :-)

> or NULL if no object has yet been returned.
> + * If the previously returned object has since been removed, then some other arbitrary
> + * object maybe returned, or possibly NULL will be returned.  In that case, the
> + * iterator might be advanced.
>    *
>    * Returns -EAGAIN if resize event occurred.  Note that the iterator
> - * will rewind back to the beginning and you may continue to use it.
> + * will rewind back to the beginning and rhashtable_walk_next() should be
> + * used to get the next object.
>    */
>   void *rhashtable_walk_peek(struct rhashtable_iter *iter)
>   {
> @@ -880,7 +884,12 @@ void *rhashtable_walk_peek(struct rhashtable_iter *iter)
>   		 * the table hasn't changed.
>   		 */
>   		iter->skip--;
> -	}
> +	} else
> +		/* ->skip is only zero after rhashtable_walk_start()
> +		 * or when the iterator is reset.  In this case there
> +		 * is no previous object to return.
> +		 */
> +		return NULL;

    CodingStyle: need {} on the *else* branch if the 1st branch has them.

>   
>   	return __rhashtable_walk_find_next(iter);
>   }

MBR, Sergei

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/6] rhashtable: improve documentation for rhashtable_walk_peek()
  2018-03-26 23:33 ` [PATCH 1/6] rhashtable: improve documentation for rhashtable_walk_peek() NeilBrown
  2018-03-27 10:55   ` Sergei Shtylyov
@ 2018-03-27 15:30   ` Herbert Xu
  2018-03-27 15:47   ` David Miller
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 33+ messages in thread
From: Herbert Xu @ 2018-03-27 15:30 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, netdev, linux-kernel, Tom Herbert

On Tue, Mar 27, 2018 at 10:33:04AM +1100, NeilBrown wrote:
> The documentation for rhashtable_walk_peek() wrong.  It claims to
> return the *next* entry, whereas it in fact returns the *previous*
> entry.
> However if no entries have yet been returned - or if the iterator
> was reset due to a resize event, then rhashtable_walk_peek()
> *does* return the next entry, but also advances the iterator.
> 
> I suspect that this interface should be discarded and the one user
> should be changed to not require it.  Possibly this patch should be
> seen as a first step in that conversation.
> 
> This patch mostly corrects the documentation, but does make a
> small code change so that the documentation can be correct without
> listing too many special cases.  I don't think the one user will
> be affected by the code change.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>

We should cc Tom Herbert too since he wrote this code.

> ---
>  lib/rhashtable.c |   17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> index 3825c30aaa36..24a57ca494cb 100644
> --- a/lib/rhashtable.c
> +++ b/lib/rhashtable.c
> @@ -853,13 +853,17 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
>  EXPORT_SYMBOL_GPL(rhashtable_walk_next);
>  
>  /**
> - * rhashtable_walk_peek - Return the next object but don't advance the iterator
> + * rhashtable_walk_peek - Return the previously returned object without advancing the iterator
>   * @iter:	Hash table iterator
>   *
> - * Returns the next object or NULL when the end of the table is reached.
> + * Returns the last object returned, or NULL if no object has yet been returned.
> + * If the previously returned object has since been removed, then some other arbitrary
> + * object maybe returned, or possibly NULL will be returned.  In that case, the
> + * iterator might be advanced.
>   *
>   * Returns -EAGAIN if resize event occurred.  Note that the iterator
> - * will rewind back to the beginning and you may continue to use it.
> + * will rewind back to the beginning and rhashtable_walk_next() should be
> + * used to get the next object.
>   */
>  void *rhashtable_walk_peek(struct rhashtable_iter *iter)
>  {
> @@ -880,7 +884,12 @@ void *rhashtable_walk_peek(struct rhashtable_iter *iter)
>  		 * the table hasn't changed.
>  		 */
>  		iter->skip--;
> -	}
> +	} else
> +		/* ->skip is only zero after rhashtable_walk_start()
> +		 * or when the iterator is reset.  In this case there
> +		 * is no previous object to return.
> +		 */
> +		return NULL;
>  
>  	return __rhashtable_walk_find_next(iter);
>  }
> 
> 

-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/6] rhashtable: improve documentation for rhashtable_walk_peek()
  2018-03-26 23:33 ` [PATCH 1/6] rhashtable: improve documentation for rhashtable_walk_peek() NeilBrown
  2018-03-27 10:55   ` Sergei Shtylyov
  2018-03-27 15:30   ` Herbert Xu
@ 2018-03-27 15:47   ` David Miller
  2018-03-27 21:45   ` [PATCH 1/6 v2] " NeilBrown
  2018-03-27 22:46   ` [PATCH 1/6] " Andreas Grünbacher
  4 siblings, 0 replies; 33+ messages in thread
From: David Miller @ 2018-03-27 15:47 UTC (permalink / raw)
  To: neilb; +Cc: tgraf, herbert, netdev, linux-kernel

From: NeilBrown <neilb@suse.com>
Date: Tue, 27 Mar 2018 10:33:04 +1100

> The documentation for rhashtable_walk_peek() wrong.  It claims to
> return the *next* entry, whereas it in fact returns the *previous*
> entry.
> However if no entries have yet been returned - or if the iterator
> was reset due to a resize event, then rhashtable_walk_peek()
> *does* return the next entry, but also advances the iterator.
> 
> I suspect that this interface should be discarded and the one user
> should be changed to not require it.  Possibly this patch should be
> seen as a first step in that conversation.
> 
> This patch mostly corrects the documentation, but does make a
> small code change so that the documentation can be correct without
> listing too many special cases.  I don't think the one user will
> be affected by the code change.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>

Please mention the "one user" explicitly in both locations where
you refer to it in this commit message.

Thank you.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/6] rhashtable: reset intr when rhashtable_walk_start sees new table
  2018-03-26 23:33 ` [PATCH 3/6] rhashtable: reset intr when rhashtable_walk_start sees new table NeilBrown
@ 2018-03-27 15:47   ` Herbert Xu
  0 siblings, 0 replies; 33+ messages in thread
From: Herbert Xu @ 2018-03-27 15:47 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, netdev, linux-kernel

On Tue, Mar 27, 2018 at 10:33:04AM +1100, NeilBrown wrote:
> The documentation claims that when rhashtable_walk_start_check()
> detects a resize event, it will rewind back to the beginning
> of the table.  This is not true.  We need to set ->slot and
> ->skip to be zero for it to be true.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects.
  2018-03-26 23:33 ` [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects NeilBrown
@ 2018-03-27 15:49   ` David Miller
  2018-03-27 15:54     ` Herbert Xu
  2018-03-27 21:50     ` NeilBrown
  2018-03-27 15:51   ` Herbert Xu
  1 sibling, 2 replies; 33+ messages in thread
From: David Miller @ 2018-03-27 15:49 UTC (permalink / raw)
  To: neilb; +Cc: tgraf, herbert, netdev, linux-kernel

From: NeilBrown <neilb@suse.com>
Date: Tue, 27 Mar 2018 10:33:04 +1100

> In many cases where the walker needs to drop out of RCU protection,
> it will take a reference to the object and this can prevent it from
> being removed from the hash table.  In those cases, the last-returned
> object can still be used as a cursor.  rhashtable cannot detect
> these cases itself.

Merely having an elevated reference count does not explicitly prevent
the object from being removed from the hash table.

This invariant might hold for the particular user of the rhashtable
instance, but it is not always the case.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects.
  2018-03-26 23:33 ` [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects NeilBrown
  2018-03-27 15:49   ` David Miller
@ 2018-03-27 15:51   ` Herbert Xu
  2018-03-27 21:54     ` NeilBrown
  1 sibling, 1 reply; 33+ messages in thread
From: Herbert Xu @ 2018-03-27 15:51 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, netdev, linux-kernel

On Tue, Mar 27, 2018 at 10:33:04AM +1100, NeilBrown wrote:
>
> -int rhashtable_walk_start_check(struct rhashtable_iter *iter)
> +int rhashtable_walk_start_continue(struct rhashtable_iter *iter, struct rhash_head *obj)
>  	__acquires(RCU)
>  {
>  	struct rhashtable *ht = iter->ht;
>  
>  	rcu_read_lock();
>  
> +	if (!obj || iter->p != obj)
> +		iter->p = NULL;

Why bother with this check at all? Couldn't we make it so that
if you call continue then you continue with the cursor otherwise
you set it to NULL as we currently do.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects.
  2018-03-27 15:49   ` David Miller
@ 2018-03-27 15:54     ` Herbert Xu
  2018-03-27 21:50     ` NeilBrown
  1 sibling, 0 replies; 33+ messages in thread
From: Herbert Xu @ 2018-03-27 15:54 UTC (permalink / raw)
  To: David Miller; +Cc: neilb, tgraf, netdev, linux-kernel

On Tue, Mar 27, 2018 at 11:49:41AM -0400, David Miller wrote:
>
> Merely having an elevated reference count does not explicitly prevent
> the object from being removed from the hash table.
> 
> This invariant might hold for the particular user of the rhashtable
> instance, but it is not always the case.

I think having a new interface like this should work as if the
user knows that the element has not been removed from the hash
table then it should be safe to continue from it.

Of course people may misuse it but even using the current interface
correctly may result in lost objects due to concurrent removals.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.
  2018-03-26 23:33 ` [PATCH 5/6] rhashtable: support guaranteed successful insertion NeilBrown
@ 2018-03-27 15:56   ` Herbert Xu
  2018-03-27 21:34     ` NeilBrown
  0 siblings, 1 reply; 33+ messages in thread
From: Herbert Xu @ 2018-03-27 15:56 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, netdev, linux-kernel

On Tue, Mar 27, 2018 at 10:33:04AM +1100, NeilBrown wrote:
> The current rhashtable will fail an insertion if the hashtable
> it "too full", one of:
>  - table already has 2^31 elements (-E2BIG)
>  - a max_size was specified and table already has that
>    many elements (rounded up to power of 2) (-E2BIG)
>  - a single chain has more than 16 elements (-EBUSY)
>  - table has more elements than the current table size,
>    and allocating a new table fails (-ENOMEM)
>  - a new page needed to be allocated for a nested table,
>    and the memory allocation failed (-ENOMEM).
> 
> A traditional hash table does not have a concept of "too full", and
> insertion only fails if the key already exists.  Many users of hash
> tables have separate means of limiting the total number of entries,
> and are not susceptible to an attack which could cause unusually large
> hash chains.  For those users, the need to check for errors when
> inserting objects to an rhashtable is an unnecessary burden and hence
> a potential source of bugs (as these failures are likely to be rare).

Did you actually encounter an insertion failure? The current code
should never fail an insertion until you actually run ouf memory.
That is unless you're using rhashtable when you should be using
rhlist instead.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.
  2018-03-27 15:56   ` Herbert Xu
@ 2018-03-27 21:34     ` NeilBrown
  2018-03-28  6:04       ` Herbert Xu
  0 siblings, 1 reply; 33+ messages in thread
From: NeilBrown @ 2018-03-27 21:34 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1890 bytes --]

On Tue, Mar 27 2018, Herbert Xu wrote:

> On Tue, Mar 27, 2018 at 10:33:04AM +1100, NeilBrown wrote:
>> The current rhashtable will fail an insertion if the hashtable
>> it "too full", one of:
>>  - table already has 2^31 elements (-E2BIG)
>>  - a max_size was specified and table already has that
>>    many elements (rounded up to power of 2) (-E2BIG)
>>  - a single chain has more than 16 elements (-EBUSY)
>>  - table has more elements than the current table size,
>>    and allocating a new table fails (-ENOMEM)
>>  - a new page needed to be allocated for a nested table,
>>    and the memory allocation failed (-ENOMEM).
>> 
>> A traditional hash table does not have a concept of "too full", and
>> insertion only fails if the key already exists.  Many users of hash
>> tables have separate means of limiting the total number of entries,
>> and are not susceptible to an attack which could cause unusually large
>> hash chains.  For those users, the need to check for errors when
>> inserting objects to an rhashtable is an unnecessary burden and hence
>> a potential source of bugs (as these failures are likely to be rare).
>
> Did you actually encounter an insertion failure? The current code
> should never fail an insertion until you actually run ouf memory.
> That is unless you're using rhashtable when you should be using
> rhlist instead.

It is easy to get an -EBUSY insertion failure when .disable_count is
enabled, and I did get that.  Blindly propagating that up caused lustre
to get terribly confused - not too surprising really.

Even if I didn't seem errors in practive, if the interface can return an
error, then I need to check for the error and really should test that
handling each error works correctly.  It is much easier to write
reliable code when errors cannot happen, so I'd rather have that option.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/6 v2] rhashtable: improve documentation for rhashtable_walk_peek()
  2018-03-26 23:33 ` [PATCH 1/6] rhashtable: improve documentation for rhashtable_walk_peek() NeilBrown
                     ` (2 preceding siblings ...)
  2018-03-27 15:47   ` David Miller
@ 2018-03-27 21:45   ` NeilBrown
  2018-03-27 22:46   ` [PATCH 1/6] " Andreas Grünbacher
  4 siblings, 0 replies; 33+ messages in thread
From: NeilBrown @ 2018-03-27 21:45 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu
  Cc: netdev, linux-kernel, Tom Herbert, Andreas Gruenbacher

[-- Attachment #1: Type: text/plain, Size: 3154 bytes --]


In this version I:
 - fixed brace-after-else coding style, thanks Sergei Shtylyov,
 - explained where the one user is, thanks David Miller
 - added CC for author of rhashtable_walk_peek (Tom Herbert) and
   of the one usage (Andreas Gruenbacher) - thanks Herbert Xu

NeilBrown

-----------------8<-------------------------
Subject: [PATCH] rhashtable: improve documentation for rhashtable_walk_peek()

The documentation for rhashtable_walk_peek() wrong.  It claims to
return the *next* entry, whereas it in fact returns the *previous*
entry.
However if no entries have yet been returned - or if the iterator
was reset due to a resize event, then rhashtable_walk_peek()
*does* return the next entry, but also advances the iterator.

I suspect that this interface should be discarded and the one user
should be changed to not require it.  The only current user is
gfs2_glock_iter_next in fs/gfs2/glock.c, which is part of the
implementation of a seq_file which reports all the content of the
hash table.
Possibly this patch should be seen as a first step in that
conversation.

This patch mostly corrects the documentation, but does make a
small code change so that the documentation can be correct without
listing too many special cases.  I don't think gfs2_glock_iter_next
will be affected by the code change.

Cc: Tom Herbert <tom@quantonium.net>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 lib/rhashtable.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 3825c30aaa36..9367816820ea 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -853,13 +853,17 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
 EXPORT_SYMBOL_GPL(rhashtable_walk_next);
 
 /**
- * rhashtable_walk_peek - Return the next object but don't advance the iterator
+ * rhashtable_walk_peek - Return the previously returned object without advancing the iterator
  * @iter:	Hash table iterator
  *
- * Returns the next object or NULL when the end of the table is reached.
+ * Returns the last object returned, or NULL if no object has yet been returned.
+ * If the previously returned object has since been removed, then some other arbitrary
+ * object maybe returned, or possibly NULL will be returned.  In that case, the
+ * iterator might be advanced.
  *
  * Returns -EAGAIN if resize event occurred.  Note that the iterator
- * will rewind back to the beginning and you may continue to use it.
+ * will rewind back to the beginning and rhashtable_walk_next() should be
+ * used to get the next object.
  */
 void *rhashtable_walk_peek(struct rhashtable_iter *iter)
 {
@@ -880,6 +884,12 @@ void *rhashtable_walk_peek(struct rhashtable_iter *iter)
 		 * the table hasn't changed.
 		 */
 		iter->skip--;
+	} else {
+		/* ->skip is only zero after rhashtable_walk_start()
+		 * or when the iterator is reset.  In this case there
+		 * is no previous object to return.
+		 */
+		return NULL;
 	}
 
 	return __rhashtable_walk_find_next(iter);
-- 
2.14.0.rc0.dirty


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects.
  2018-03-27 15:49   ` David Miller
  2018-03-27 15:54     ` Herbert Xu
@ 2018-03-27 21:50     ` NeilBrown
  1 sibling, 0 replies; 33+ messages in thread
From: NeilBrown @ 2018-03-27 21:50 UTC (permalink / raw)
  To: David Miller; +Cc: tgraf, herbert, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 954 bytes --]

On Tue, Mar 27 2018, David Miller wrote:

> From: NeilBrown <neilb@suse.com>
> Date: Tue, 27 Mar 2018 10:33:04 +1100
>
>> In many cases where the walker needs to drop out of RCU protection,
>> it will take a reference to the object and this can prevent it from
>> being removed from the hash table.  In those cases, the last-returned
>> object can still be used as a cursor.  rhashtable cannot detect
>> these cases itself.
>
> Merely having an elevated reference count does not explicitly prevent
> the object from being removed from the hash table.
>
> This invariant might hold for the particular user of the rhashtable
> instance, but it is not always the case.

Agreed.  Hence "In many case ... this *can* be prevented" and "In those
cases".

The doc comment for rhashtable_walk_start_continue() makes it clear
that:

 *   The
 * previously returned object must still be in the hash table, and must be
 * provided as an argument.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects.
  2018-03-27 15:51   ` Herbert Xu
@ 2018-03-27 21:54     ` NeilBrown
  2018-03-28  6:07       ` Herbert Xu
  0 siblings, 1 reply; 33+ messages in thread
From: NeilBrown @ 2018-03-27 21:54 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1062 bytes --]

On Tue, Mar 27 2018, Herbert Xu wrote:

> On Tue, Mar 27, 2018 at 10:33:04AM +1100, NeilBrown wrote:
>>
>> -int rhashtable_walk_start_check(struct rhashtable_iter *iter)
>> +int rhashtable_walk_start_continue(struct rhashtable_iter *iter, struct rhash_head *obj)
>>  	__acquires(RCU)
>>  {
>>  	struct rhashtable *ht = iter->ht;
>>  
>>  	rcu_read_lock();
>>  
>> +	if (!obj || iter->p != obj)
>> +		iter->p = NULL;
>
> Why bother with this check at all? Couldn't we make it so that
> if you call continue then you continue with the cursor otherwise
> you set it to NULL as we currently do.

Possibly.
I particularly want the interface to require that you pass the
previously returned object to _continue. That makes it easy to see that
the object is still being used.  If someone changes to code to delete
the object before the _continue, there should be a strong hint that it
won't work.

Maybe it would be better to make it a WARN_ON()

  if (!obj || WARN_ON(iter->p != obj))
         iter->p = NULL;

??

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/6] rhashtable: improve documentation for rhashtable_walk_peek()
  2018-03-26 23:33 ` [PATCH 1/6] rhashtable: improve documentation for rhashtable_walk_peek() NeilBrown
                     ` (3 preceding siblings ...)
  2018-03-27 21:45   ` [PATCH 1/6 v2] " NeilBrown
@ 2018-03-27 22:46   ` Andreas Grünbacher
  2018-03-28  0:49     ` NeilBrown
  4 siblings, 1 reply; 33+ messages in thread
From: Andreas Grünbacher @ 2018-03-27 22:46 UTC (permalink / raw)
  To: NeilBrown
  Cc: Thomas Graf, Herbert Xu, netdev, Linux Kernel Mailing List, Bob Peterson

Neil,

2018-03-27 1:33 GMT+02:00 NeilBrown <neilb@suse.com>:
> The documentation for rhashtable_walk_peek() wrong.  It claims to
> return the *next* entry, whereas it in fact returns the *previous*
> entry.
> However if no entries have yet been returned - or if the iterator
> was reset due to a resize event, then rhashtable_walk_peek()
> *does* return the next entry, but also advances the iterator.
>
> I suspect that this interface should be discarded and the one user
> should be changed to not require it.  Possibly this patch should be
> seen as a first step in that conversation.
>
> This patch mostly corrects the documentation, but does make a
> small code change so that the documentation can be correct without
> listing too many special cases.  I don't think the one user will
> be affected by the code change.

how about I come up with a replacement so that we can remove
rhashtable_walk_peek straight away without making it differently
broken in the meantime?

Thanks,
Andreas

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/6] rhashtable: improve documentation for rhashtable_walk_peek()
  2018-03-27 22:46   ` [PATCH 1/6] " Andreas Grünbacher
@ 2018-03-28  0:49     ` NeilBrown
  0 siblings, 0 replies; 33+ messages in thread
From: NeilBrown @ 2018-03-28  0:49 UTC (permalink / raw)
  To: Andreas Grünbacher
  Cc: Thomas Graf, Herbert Xu, netdev, Linux Kernel Mailing List, Bob Peterson

[-- Attachment #1: Type: text/plain, Size: 1168 bytes --]

On Wed, Mar 28 2018, Andreas Grünbacher wrote:

> Neil,
>
> 2018-03-27 1:33 GMT+02:00 NeilBrown <neilb@suse.com>:
>> The documentation for rhashtable_walk_peek() wrong.  It claims to
>> return the *next* entry, whereas it in fact returns the *previous*
>> entry.
>> However if no entries have yet been returned - or if the iterator
>> was reset due to a resize event, then rhashtable_walk_peek()
>> *does* return the next entry, but also advances the iterator.
>>
>> I suspect that this interface should be discarded and the one user
>> should be changed to not require it.  Possibly this patch should be
>> seen as a first step in that conversation.
>>
>> This patch mostly corrects the documentation, but does make a
>> small code change so that the documentation can be correct without
>> listing too many special cases.  I don't think the one user will
>> be affected by the code change.
>
> how about I come up with a replacement so that we can remove
> rhashtable_walk_peek straight away without making it differently
> broken in the meantime?
>

Hi Andreas,
 I'd be very happy with that outcome - thanks for the offer!

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.
  2018-03-27 21:34     ` NeilBrown
@ 2018-03-28  6:04       ` Herbert Xu
  2018-03-28  7:04         ` NeilBrown
  0 siblings, 1 reply; 33+ messages in thread
From: Herbert Xu @ 2018-03-28  6:04 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, netdev, linux-kernel

On Wed, Mar 28, 2018 at 08:34:19AM +1100, NeilBrown wrote:
>
> It is easy to get an -EBUSY insertion failure when .disable_count is
> enabled, and I did get that.  Blindly propagating that up caused lustre
> to get terribly confused - not too surprising really.

Right, so this failure mode is specific to your patch 6.

I think I see the problem.  As it currently stands, we never
need growing when we hit the bucket length limit.  We only do
rehashes.

With your patch, you must change it so that we actually try to
grow the table if necessary when we hit the bucket length limit.

Otherwise it will result in the EBUSY that you're seeing.

I laso think that we shouldn't make this a toggle.  If we're going
to do disable_count then it should be unconditionally done for
everyone.

However, I personally would prefer a percpu elem count instead of
disabling it completely.  Would that be acceptable to your use-case?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects.
  2018-03-27 21:54     ` NeilBrown
@ 2018-03-28  6:07       ` Herbert Xu
  2018-03-28  7:17         ` NeilBrown
  0 siblings, 1 reply; 33+ messages in thread
From: Herbert Xu @ 2018-03-28  6:07 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, netdev, linux-kernel

On Wed, Mar 28, 2018 at 08:54:41AM +1100, NeilBrown wrote:
>
> Possibly.
> I particularly want the interface to require that you pass the
> previously returned object to _continue. That makes it easy to see that
> the object is still being used.  If someone changes to code to delete
> the object before the _continue, there should be a strong hint that it
> won't work.
> 
> Maybe it would be better to make it a WARN_ON()
> 
>   if (!obj || WARN_ON(iter->p != obj))
>          iter->p = NULL;

This doesn't really protect against the case where obj is removed.
All it proves is that the user saved a copy of obj which we already
did anyway.

To detect an actual removal you'd need to traverse the list.

I have another idea: we could save insert the walkers into the
hash table chain at the end, essentially as a hidden list.  We
can mark it with a flag like rht_marker so that normal traversal
doesn't see it.

That way the removal code can simply traverse that list and inform
them that the iterator is invalid.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.
  2018-03-28  6:04       ` Herbert Xu
@ 2018-03-28  7:04         ` NeilBrown
  2018-03-28  7:27           ` Herbert Xu
  0 siblings, 1 reply; 33+ messages in thread
From: NeilBrown @ 2018-03-28  7:04 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2667 bytes --]

On Wed, Mar 28 2018, Herbert Xu wrote:

> On Wed, Mar 28, 2018 at 08:34:19AM +1100, NeilBrown wrote:
>>
>> It is easy to get an -EBUSY insertion failure when .disable_count is
>> enabled, and I did get that.  Blindly propagating that up caused lustre
>> to get terribly confused - not too surprising really.
>
> Right, so this failure mode is specific to your patch 6.

I disagree.  My patch 6 only makes it common instead of exceedingly
rare.  If any table in the list other than the first has a chain with 16
elements, then trying to insert an element with a hash which matches
that chain will fail with -EBUSY.  This is theoretically possible
already, though astronomically unlikely.  So that case will never be
tested for.

>
> I think I see the problem.  As it currently stands, we never
> need growing when we hit the bucket length limit.  We only do
> rehashes.
>
> With your patch, you must change it so that we actually try to
> grow the table if necessary when we hit the bucket length limit.

It is hard to know if it is necessary.  And making the new table larger
will make the error less likely, but still won't make it impossible.  So
callers will have to handle it - just like they currently have to handle
-ENOMEM even though it is highly unlikely (and not strictly necessary).

Are these errors ever actually useful?  I thought I had convinced myself
before that they were (to throttle attacks on the hash function), but
they happen even less often than I thought.

>
> Otherwise it will result in the EBUSY that you're seeing.
>
> I laso think that we shouldn't make this a toggle.  If we're going
> to do disable_count then it should be unconditionally done for
> everyone.
>
> However, I personally would prefer a percpu elem count instead of
> disabling it completely.  Would that be acceptable to your use-case?

Maybe. Reading a percpu counter isn't cheap.  Reading it whenever a hash
chain reaches 16 is reasonable, but I think we would want to read it a
lot more often than that.  So probably store the last-sampled time (with
no locking) and only sample the counter if last-sampled is more than
 jiffies - 10*HZ (???)

In the case in lustre we also shard the LRU list so that adding to the
LRU causes minimal contention. Keeping a shared counter together with
the lru is trivial and summing them periodically is little burden.
Maybe it makes sense to include that functionality if rhashtables so
that it is there for everyone.

A percpu counter uses a lot more memory than atomic_t.  Given that some
callers set nelem_hint to 2 or 3, it seem likely that those callers
don't want to waste memory.  Should we force them to?

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects.
  2018-03-28  6:07       ` Herbert Xu
@ 2018-03-28  7:17         ` NeilBrown
  2018-03-28  7:30           ` Herbert Xu
  0 siblings, 1 reply; 33+ messages in thread
From: NeilBrown @ 2018-03-28  7:17 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2246 bytes --]

On Wed, Mar 28 2018, Herbert Xu wrote:

> On Wed, Mar 28, 2018 at 08:54:41AM +1100, NeilBrown wrote:
>>
>> Possibly.
>> I particularly want the interface to require that you pass the
>> previously returned object to _continue. That makes it easy to see that
>> the object is still being used.  If someone changes to code to delete
>> the object before the _continue, there should be a strong hint that it
>> won't work.
>> 
>> Maybe it would be better to make it a WARN_ON()
>> 
>>   if (!obj || WARN_ON(iter->p != obj))
>>          iter->p = NULL;
>
> This doesn't really protect against the case where obj is removed.
> All it proves is that the user saved a copy of obj which we already
> did anyway.

True.  We ultimately need to trust the user not to do something silly.
We can do little things to help catch obvious errors though.  That was
my intent with the WARN_ON.

>
> To detect an actual removal you'd need to traverse the list.

Yes.  You could reset ->skip to zero and search for the given object.
That feels a bit like excess hand-holding to me, but probably wouldn't
be too expensive.  It only happens when you need to drop out of RCU, and
in that case you are probably already doing something expensive.

>
> I have another idea: we could save insert the walkers into the
> hash table chain at the end, essentially as a hidden list.  We
> can mark it with a flag like rht_marker so that normal traversal
> doesn't see it.
>
> That way the removal code can simply traverse that list and inform
> them that the iterator is invalid.

Sounds like over-kill to me.
It might be reasonable to have a CONFIG_DEBUG_RHASHTABLE which enables
extra to code to catch misuse, but I don't see the justification for
always performing these checks.
The DEBUG code could just scan the chain (usually quite short) to see if
the given element is present.  Of course it might have already been
rehashed to the next table, so you would to allow for that possibility -
probably check tbl->rehash.

Thanks,
NeilBrown

>
> Cheers,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.
  2018-03-28  7:04         ` NeilBrown
@ 2018-03-28  7:27           ` Herbert Xu
  2018-03-28 21:26             ` NeilBrown
  0 siblings, 1 reply; 33+ messages in thread
From: Herbert Xu @ 2018-03-28  7:27 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, netdev, linux-kernel

On Wed, Mar 28, 2018 at 06:04:40PM +1100, NeilBrown wrote:
>
> I disagree.  My patch 6 only makes it common instead of exceedingly
> rare.  If any table in the list other than the first has a chain with 16
> elements, then trying to insert an element with a hash which matches
> that chain will fail with -EBUSY.  This is theoretically possible
> already, though astronomically unlikely.  So that case will never be
> tested for.

No that's not true.  If the table is correctly sized then the
probability of having a chain with 16 elements is extremely low.

Even if it does happen we won't fail because we will perform
an immediate rehash.  We only fail if it happens right away
after the rehash (that is, at least another 16 elements have
been inserted and you're trying to insert a 17th element, all
while the new hash table has not been completely populated),
which means that somebody has figured out our hash secret and
failing in that case makes sense.

> It is hard to know if it is necessary.  And making the new table larger
> will make the error less likely, but still won't make it impossible.  So
> callers will have to handle it - just like they currently have to handle
> -ENOMEM even though it is highly unlikely (and not strictly necessary).

Callers should not handle an ENOMEM error by retrying.  Nor should
they retry an EBUSY return value.

> Are these errors ever actually useful?  I thought I had convinced myself
> before that they were (to throttle attacks on the hash function), but
> they happen even less often than I thought.

The EBUSY error indicates that the hash table has essentially
degenereated into a linked list because somebody has worked out
our hash secret.

> Maybe. Reading a percpu counter isn't cheap.  Reading it whenever a hash
> chain reaches 16 is reasonable, but I think we would want to read it a
> lot more often than that.  So probably store the last-sampled time (with
> no locking) and only sample the counter if last-sampled is more than
>  jiffies - 10*HZ (???)

We could also take the spinlock table approach and have a counter
per bucket spinlock.  This should be sufficient as you'll contend
on the bucket spinlock table anyway.

This also allows us to estimate the total table size and not have
to always do a last-ditch growth when it's too late.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects.
  2018-03-28  7:17         ` NeilBrown
@ 2018-03-28  7:30           ` Herbert Xu
  2018-03-28 21:34             ` NeilBrown
  0 siblings, 1 reply; 33+ messages in thread
From: Herbert Xu @ 2018-03-28  7:30 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, netdev, linux-kernel

On Wed, Mar 28, 2018 at 06:17:57PM +1100, NeilBrown wrote:
>
> Sounds like over-kill to me.
> It might be reasonable to have a CONFIG_DEBUG_RHASHTABLE which enables
> extra to code to catch misuse, but I don't see the justification for
> always performing these checks.
> The DEBUG code could just scan the chain (usually quite short) to see if
> the given element is present.  Of course it might have already been
> rehashed to the next table, so you would to allow for that possibility -
> probably check tbl->rehash.

No this is not meant to debug users incorrectly using the cursor.
This is a replacement of your continue interface by automatically
validating the cursor.

In fact we can make it even more reliable.  We can insert the walker
right into the bucket chain, that way the walking will always be
consistent.

The only problem is that we need be able to differentiate between
a walker, a normal object, and the end of the list.  I think it
should be doable.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.
  2018-03-28  7:27           ` Herbert Xu
@ 2018-03-28 21:26             ` NeilBrown
  2018-03-29  5:22               ` Herbert Xu
  0 siblings, 1 reply; 33+ messages in thread
From: NeilBrown @ 2018-03-28 21:26 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3887 bytes --]

On Wed, Mar 28 2018, Herbert Xu wrote:

> On Wed, Mar 28, 2018 at 06:04:40PM +1100, NeilBrown wrote:
>>
>> I disagree.  My patch 6 only makes it common instead of exceedingly
>> rare.  If any table in the list other than the first has a chain with 16
>> elements, then trying to insert an element with a hash which matches
>> that chain will fail with -EBUSY.  This is theoretically possible
>> already, though astronomically unlikely.  So that case will never be
>> tested for.
>
> No that's not true.  If the table is correctly sized then the
> probability of having a chain with 16 elements is extremely low.

I say "astronomically unlikely", you say "probability .. is extremely
low".  I think we are in agreement here.

The point remains that if an error *can* be returned then I have to
write code to handle it and test that code.  I'd rather not.

>
> Even if it does happen we won't fail because we will perform
> an immediate rehash.  We only fail if it happens right away
> after the rehash (that is, at least another 16 elements have
> been inserted and you're trying to insert a 17th element, all
> while the new hash table has not been completely populated),
> which means that somebody has figured out our hash secret and
> failing in that case makes sense.
>
>> It is hard to know if it is necessary.  And making the new table larger
>> will make the error less likely, but still won't make it impossible.  So
>> callers will have to handle it - just like they currently have to handle
>> -ENOMEM even though it is highly unlikely (and not strictly necessary).
>
> Callers should not handle an ENOMEM error by retrying.  Nor should
> they retry an EBUSY return value.

I never suggested retrying, but I would have to handle it somehow.  I'd
rather not.

>
>> Are these errors ever actually useful?  I thought I had convinced myself
>> before that they were (to throttle attacks on the hash function), but
>> they happen even less often than I thought.
>
> The EBUSY error indicates that the hash table has essentially
> degenereated into a linked list because somebody has worked out
> our hash secret.

While I have no doubt that there are hashtables where someone could try
to attack the hash, I am quite sure there are others where is such an
attack is meaningless - any code which could generate the required range of
keys, could do far worse things more easily.

>
>> Maybe. Reading a percpu counter isn't cheap.  Reading it whenever a hash
>> chain reaches 16 is reasonable, but I think we would want to read it a
>> lot more often than that.  So probably store the last-sampled time (with
>> no locking) and only sample the counter if last-sampled is more than
>>  jiffies - 10*HZ (???)
>
> We could also take the spinlock table approach and have a counter
> per bucket spinlock.  This should be sufficient as you'll contend
> on the bucket spinlock table anyway.

Yes, storing a sharded count in the spinlock table does seem like an
appropriate granularity.  However that leads me to ask: why do we have
the spinlock table?  Why not bit spinlocks in the hashchain head like
include/linux/list_bl uses?

>
> This also allows us to estimate the total table size and not have
> to always do a last-ditch growth when it's too late.

I don't understand how it can ever be "too late", though I appreciate
that in some cases "sooner" is better than "later"
If we give up on the single atomic_t counter, then we must accept that
the number of elements could exceed any given value.  The only promise
we can provide is that it wont exceed N% of the table size for more than
T seconds.

Thanks,
NeilBrown


>
> Cheers,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects.
  2018-03-28  7:30           ` Herbert Xu
@ 2018-03-28 21:34             ` NeilBrown
  2018-03-29  1:13               ` NeilBrown
  0 siblings, 1 reply; 33+ messages in thread
From: NeilBrown @ 2018-03-28 21:34 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1561 bytes --]

On Wed, Mar 28 2018, Herbert Xu wrote:

> On Wed, Mar 28, 2018 at 06:17:57PM +1100, NeilBrown wrote:
>>
>> Sounds like over-kill to me.
>> It might be reasonable to have a CONFIG_DEBUG_RHASHTABLE which enables
>> extra to code to catch misuse, but I don't see the justification for
>> always performing these checks.
>> The DEBUG code could just scan the chain (usually quite short) to see if
>> the given element is present.  Of course it might have already been
>> rehashed to the next table, so you would to allow for that possibility -
>> probably check tbl->rehash.
>
> No this is not meant to debug users incorrectly using the cursor.
> This is a replacement of your continue interface by automatically
> validating the cursor.
>
> In fact we can make it even more reliable.  We can insert the walker
> right into the bucket chain, that way the walking will always be
> consistent.
>
> The only problem is that we need be able to differentiate between
> a walker, a normal object, and the end of the list.  I think it
> should be doable.

Yes, I think that could work.  The code to stop over a walker object
during an unlocked search wouldn't be straight forward and would need
careful analysis.

However about storing the hash chains in order by object address?
Then rhashtable_walk_start() can easily find it's place regardless of
whether the old object was still present or not, using <= on the
address.
"Insert" would need to record an insert location and insert there rather
than at the head of the chain.
I might try coding that.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects.
  2018-03-28 21:34             ` NeilBrown
@ 2018-03-29  1:13               ` NeilBrown
  0 siblings, 0 replies; 33+ messages in thread
From: NeilBrown @ 2018-03-29  1:13 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 732 bytes --]

On Thu, Mar 29 2018, NeilBrown wrote:

>
> How about storing the hash chains in order by object address?
> Then rhashtable_walk_start() can easily find it's place regardless of
> whether the old object was still present or not, using <= on the
> address.
> "Insert" would need to record an insert location and insert there rather
> than at the head of the chain.
> I might try coding that.

Unfortunately rhltables make this approach unworkable.
However while trying to write the code I found a bug :-(
I'll post a patch for the bug, and a patch to transparently make the
current interface more reliable when the caller keeps the current
object in the table.
I think this is sufficient for all current use-cases.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.
  2018-03-28 21:26             ` NeilBrown
@ 2018-03-29  5:22               ` Herbert Xu
  2018-04-06  3:11                 ` NeilBrown
  0 siblings, 1 reply; 33+ messages in thread
From: Herbert Xu @ 2018-03-29  5:22 UTC (permalink / raw)
  To: NeilBrown
  Cc: Thomas Graf, netdev, linux-kernel, David S. Miller, Eric Dumazet

On Thu, Mar 29, 2018 at 08:26:21AM +1100, NeilBrown wrote:
>
> I say "astronomically unlikely", you say "probability .. is extremely
> low".  I think we are in agreement here.
> 
> The point remains that if an error *can* be returned then I have to
> write code to handle it and test that code.  I'd rather not.

You have be able to handle errors anyway because of memory allocation
failures.  Ultimately if you keep inserting you will eventually
fail with ENOMEM.  So I don't see the issue with an additional
error value.

> > Even if it does happen we won't fail because we will perform
> > an immediate rehash.  We only fail if it happens right away
> > after the rehash (that is, at least another 16 elements have
> > been inserted and you're trying to insert a 17th element, all
> > while the new hash table has not been completely populated),
> > which means that somebody has figured out our hash secret and
> > failing in that case makes sense.

BTW, you didn't acknowledge this bit which I think is crucial to
how likely such an error is.

> I never suggested retrying, but I would have to handle it somehow.  I'd
> rather not.

...

> While I have no doubt that there are hashtables where someone could try
> to attack the hash, I am quite sure there are others where is such an
> attack is meaningless - any code which could generate the required range of
> keys, could do far worse things more easily.

Our network hashtable has to be secure against adversaries.  I
understand that this may not be important to your use-case.  However,
given the fact that the failure would only occur if an adversary
is present and actively attacking your hash table, I don't think
it has that much of a negative effect on your use-case either.

Of course if you can reproduce the EBUSY error without your
disable_count patch or even after you have fixed the issue I
have pointed out in your disable_count patch you can still reproduce
it then that would suggest a real bug and we would need to fix it,
for everyone.

> Yes, storing a sharded count in the spinlock table does seem like an
> appropriate granularity.  However that leads me to ask: why do we have
> the spinlock table?  Why not bit spinlocks in the hashchain head like
> include/linux/list_bl uses?

The spinlock table predates rhashtable.  Perhaps Thomas/Eric/Dave
can elucidate this.

> I don't understand how it can ever be "too late", though I appreciate
> that in some cases "sooner" is better than "later"
> If we give up on the single atomic_t counter, then we must accept that
> the number of elements could exceed any given value.  The only promise
> we can provide is that it wont exceed N% of the table size for more than
> T seconds.

Sure.  However, assuming we use an estimate that is hash-based,
that *should* be fairly accurate assuming that your hash function
is working in the first place.  It's completely different vs.
estimating based on a per-cpu count which could be wildly inaccurate.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.
  2018-03-29  5:22               ` Herbert Xu
@ 2018-04-06  3:11                 ` NeilBrown
  2018-04-06  4:13                   ` Herbert Xu
  0 siblings, 1 reply; 33+ messages in thread
From: NeilBrown @ 2018-04-06  3:11 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Thomas Graf, netdev, linux-kernel, David S. Miller, Eric Dumazet

[-- Attachment #1: Type: text/plain, Size: 5902 bytes --]

On Thu, Mar 29 2018, Herbert Xu wrote:

> On Thu, Mar 29, 2018 at 08:26:21AM +1100, NeilBrown wrote:
>>
>> I say "astronomically unlikely", you say "probability .. is extremely
>> low".  I think we are in agreement here.
>> 
>> The point remains that if an error *can* be returned then I have to
>> write code to handle it and test that code.  I'd rather not.
>
> You have be able to handle errors anyway because of memory allocation
> failures.  Ultimately if you keep inserting you will eventually
> fail with ENOMEM.  So I don't see the issue with an additional
> error value.

You don't need to handle memory allocation failures at the point where
you insert into the table - adding to a linked list requires no new
memory.

If you are running out of memory, you will fail to allocate new objects,
so you won't be able to add them to the rhashtable.  Obviously you have
to handle that failure, and that is likely to save rhashtable from
getting many mem-alloc failures.

But once you have allocated a new object, rhashtable should just accept
it.  It doesn't help anyone to have the insertion fail.
The insertion can currently fail even when allocating a new object would
succeed, as assertion can require a GFP_ATOMIC allocation to succeed,
while allocating a new object might only need a GFP_KERNEL allocation to
succeed. 

>
>> > Even if it does happen we won't fail because we will perform
>> > an immediate rehash.  We only fail if it happens right away
>> > after the rehash (that is, at least another 16 elements have
>> > been inserted and you're trying to insert a 17th element, all
>> > while the new hash table has not been completely populated),
>> > which means that somebody has figured out our hash secret and
>> > failing in that case makes sense.
>
> BTW, you didn't acknowledge this bit which I think is crucial to
> how likely such an error is.

The likelihood of the error isn't really the issue - it either can
happen or it cannot.  If it can, it requires code to handle it.

>
>> I never suggested retrying, but I would have to handle it somehow.  I'd
>> rather not.
>
> ...
>
>> While I have no doubt that there are hashtables where someone could try
>> to attack the hash, I am quite sure there are others where is such an
>> attack is meaningless - any code which could generate the required range of
>> keys, could do far worse things more easily.
>
> Our network hashtable has to be secure against adversaries.  I
> understand that this may not be important to your use-case.  However,
> given the fact that the failure would only occur if an adversary
> is present and actively attacking your hash table, I don't think
> it has that much of a negative effect on your use-case either.

I certainly accept that there are circumstances where it is a real
possibility that an adversary might succeed in attacking the hash
function, and changing the seed for each table is an excellent idea.
It isn't clear to me that failing insertions is needed - it is done
rather late and so doesn't throttle much, and could give the attacker
feedback that they had succeeded (?).

Not all hash tables are susceptible to attack.  It would be nice if
developers working in less exposed areas could use rhashtable without ever
having to check for errors.

Error codes are messages from the implementation to the caller.
They should be chosen to help the caller make useful choices, not to
allow the implementation to avoid awkward situations.

>
> Of course if you can reproduce the EBUSY error without your
> disable_count patch or even after you have fixed the issue I
> have pointed out in your disable_count patch you can still reproduce
> it then that would suggest a real bug and we would need to fix it,
> for everyone.

I suspect that I cannot.  However experience tells me that customers do
things that I cannot and would never expect - they can often trigger
errors that I cannot.  It is best if the error cannot be returned.

>
>> Yes, storing a sharded count in the spinlock table does seem like an
>> appropriate granularity.  However that leads me to ask: why do we have
>> the spinlock table?  Why not bit spinlocks in the hashchain head like
>> include/linux/list_bl uses?
>
> The spinlock table predates rhashtable.  Perhaps Thomas/Eric/Dave
> can elucidate this.

Maybe I'll submit an RFC patch to change it - if I can make it work.

>
>> I don't understand how it can ever be "too late", though I appreciate
>> that in some cases "sooner" is better than "later"
>> If we give up on the single atomic_t counter, then we must accept that
>> the number of elements could exceed any given value.  The only promise
>> we can provide is that it wont exceed N% of the table size for more than
>> T seconds.
>
> Sure.  However, assuming we use an estimate that is hash-based,
> that *should* be fairly accurate assuming that your hash function
> is working in the first place.  It's completely different vs.
> estimating based on a per-cpu count which could be wildly inaccurate.

Ahhh... I see what you are thinking now.  You aren't suggestion a
sharded count that is summed periodically.  Rather you are suggesting that
we divide the hash space into N regions each with their own independent
counter, and resize the table if any one region reaches 70% occupancy -
on the assumption that the other regions are likely to be close.  Is
that right?
It would probably be dangerous to allow automatic shrinking (when just
one drops below 30%) in that case, but it might be OK for growing.

I'm not sure I like the idea, but it is worth thinking about.

Thanks,
NeilBrown


>
> Cheers,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.
  2018-04-06  3:11                 ` NeilBrown
@ 2018-04-06  4:13                   ` Herbert Xu
  0 siblings, 0 replies; 33+ messages in thread
From: Herbert Xu @ 2018-04-06  4:13 UTC (permalink / raw)
  To: NeilBrown
  Cc: Thomas Graf, netdev, linux-kernel, David S. Miller, Eric Dumazet

On Fri, Apr 06, 2018 at 01:11:56PM +1000, NeilBrown wrote:
>
> You don't need to handle memory allocation failures at the point where
> you insert into the table - adding to a linked list requires no new
> memory.

You do actually.  The r in rhashtable stands for resizable.  We
cannot completely rely on a background hash table resize because
the insertions can be triggered from softirq context and be without
rate-limits of any kind (e.g., incoming TCP connections).

Therefore at some point you either have to wait for the resize or
fail the insertion.  Since we cannot sleep then failing is the only
option.

> The likelihood of the error isn't really the issue - it either can
> happen or it cannot.  If it can, it requires code to handle it.

Sure, but you just have to handle it the same way you would handle
a memory allocation failure, because that's what it essentially is.

I'm sorry if that means you have to restructure your code to do that
but that's what you pay for having a resizable hash table because
at some point you just have to perform a resize.

> Ahhh... I see what you are thinking now.  You aren't suggestion a
> sharded count that is summed periodically.  Rather you are suggesting that
> we divide the hash space into N regions each with their own independent
> counter, and resize the table if any one region reaches 70% occupancy -
> on the assumption that the other regions are likely to be close.  Is
> that right?

Yes.

> It would probably be dangerous to allow automatic shrinking (when just
> one drops below 30%) in that case, but it might be OK for growing.

At the greatest granularity it would be a counter per bucket, so
clearly we need to maintain some kind of bound on the granularity
so our sample space is not too small.

Automatic shrinking shouldn't be an issue because that's the slow
kind of resizing that we punt to kthread context and it can afford
to do a careful count.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2018-04-06  4:14 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-26 23:33 [PATCH 0/6] rhashtable: assorted fixes and enhancements NeilBrown
2018-03-26 23:33 ` [PATCH 1/6] rhashtable: improve documentation for rhashtable_walk_peek() NeilBrown
2018-03-27 10:55   ` Sergei Shtylyov
2018-03-27 15:30   ` Herbert Xu
2018-03-27 15:47   ` David Miller
2018-03-27 21:45   ` [PATCH 1/6 v2] " NeilBrown
2018-03-27 22:46   ` [PATCH 1/6] " Andreas Grünbacher
2018-03-28  0:49     ` NeilBrown
2018-03-26 23:33 ` [PATCH 3/6] rhashtable: reset intr when rhashtable_walk_start sees new table NeilBrown
2018-03-27 15:47   ` Herbert Xu
2018-03-26 23:33 ` [PATCH 6/6] rhashtable: allow element counting to be disabled NeilBrown
2018-03-26 23:33 ` [PATCH 5/6] rhashtable: support guaranteed successful insertion NeilBrown
2018-03-27 15:56   ` Herbert Xu
2018-03-27 21:34     ` NeilBrown
2018-03-28  6:04       ` Herbert Xu
2018-03-28  7:04         ` NeilBrown
2018-03-28  7:27           ` Herbert Xu
2018-03-28 21:26             ` NeilBrown
2018-03-29  5:22               ` Herbert Xu
2018-04-06  3:11                 ` NeilBrown
2018-04-06  4:13                   ` Herbert Xu
2018-03-26 23:33 ` [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects NeilBrown
2018-03-27 15:49   ` David Miller
2018-03-27 15:54     ` Herbert Xu
2018-03-27 21:50     ` NeilBrown
2018-03-27 15:51   ` Herbert Xu
2018-03-27 21:54     ` NeilBrown
2018-03-28  6:07       ` Herbert Xu
2018-03-28  7:17         ` NeilBrown
2018-03-28  7:30           ` Herbert Xu
2018-03-28 21:34             ` NeilBrown
2018-03-29  1:13               ` NeilBrown
2018-03-26 23:33 ` [PATCH 2/6] rhashtable: remove outdated comments about grow_decision etc NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).