LKML Archive on lore.kernel.org
 help / Atom feed
* [PATCH 3/3] rhashtable: implement rhashtable_walk_peek() using rhashtable_walk_last_seen()
  2018-07-06  7:11 [PATCH 0/3] rhashtable: replace rhashtable_walk_peek implementation NeilBrown
@ 2018-07-06  7:11 ` NeilBrown
  2018-07-06  7:11 ` [PATCH 1/3] rhashtable: further improve stability of rhashtable_walk NeilBrown
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2018-07-06  7:11 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu, Tom Herbert; +Cc: netdev, linux-kernel

rhashtable_walk_last_seen() does most of the work that
rhashtable_walk_peek() needs done, so use it and put
it in a "static inline".
Also update the documentation for rhashtable_walk_peek() to clarify
the expected use case.

Acked-by: Tom Herbert <tom@quantonium.net>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/rhashtable.h |   29 ++++++++++++++++++++++++++++-
 lib/rhashtable.c           |   34 ----------------------------------
 2 files changed, 28 insertions(+), 35 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index d63b472e9d50..96ebc2690027 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -247,10 +247,37 @@ static inline void rhashtable_walk_start(struct rhashtable_iter *iter)
 }
 
 void *rhashtable_walk_next(struct rhashtable_iter *iter);
-void *rhashtable_walk_peek(struct rhashtable_iter *iter);
 void *rhashtable_walk_last_seen(struct rhashtable_iter *iter);
 void rhashtable_walk_stop(struct rhashtable_iter *iter) __releases(RCU);
 
+/**
+ * rhashtable_walk_peek - Return the next object to use in an interrupted walk
+ * @iter:	Hash table iterator
+ *
+ * Returns the "current" object or NULL when the end of the table is reached.
+ * When an rhashtable_walk is interrupted with rhashtable_walk_stop(),
+ * it is often because an object was found that could not be processed
+ * immediately, possible because there is no more space to encode details
+ * of the object (e.g. when producing a seq_file from the table).
+ * When the walk is restarted, the same object needs to be processed again,
+ * if possible.  The object might have been removed from the table while
+ * the walk was paused, so it might not be available.  In that case, the
+ * normal "next" object should be treated as "current".
+ *
+ * To support this common case, rhashtable_walk_peek() returns the
+ * appropriate object to process after an interrupted walk, either the
+ * one that was most recently returned, or if that doesn't exist - the
+ * next one.
+ *
+ * Returns -EAGAIN if resize event occurred.  In that case the iterator
+ * will rewind back to the beginning and you may continue to use it.
+ */
+static inline void *rhashtable_walk_peek(struct rhashtable_iter *iter)
+{
+	return rhashtable_walk_last_seen(iter) ?:
+		rhashtable_walk_next(iter);
+}
+
 void rhashtable_free_and_destroy(struct rhashtable *ht,
 				 void (*free_fn)(void *ptr, void *arg),
 				 void *arg);
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 2d0227822262..9ddb7134285e 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -913,40 +913,6 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_next);
 
-/**
- * rhashtable_walk_peek - Return the next object but don't advance the iterator
- * @iter:	Hash table iterator
- *
- * Returns the next object or NULL when the end of the table is reached.
- *
- * Returns -EAGAIN if resize event occurred.  Note that the iterator
- * will rewind back to the beginning and you may continue to use it.
- */
-void *rhashtable_walk_peek(struct rhashtable_iter *iter)
-{
-	struct rhlist_head *list = iter->list;
-	struct rhashtable *ht = iter->ht;
-	struct rhash_head *p = iter->p;
-
-	if (p)
-		return rht_obj(ht, ht->rhlist ? &list->rhead : p);
-
-	/* No object found in current iter, find next one in the table. */
-
-	if (iter->skip) {
-		/* A nonzero skip value points to the next entry in the table
-		 * beyond that last one that was found. Decrement skip so
-		 * we find the current value. __rhashtable_walk_find_next
-		 * will restore the original value of skip assuming that
-		 * the table hasn't changed.
-		 */
-		iter->skip--;
-	}
-
-	return __rhashtable_walk_find_next(iter);
-}
-EXPORT_SYMBOL_GPL(rhashtable_walk_peek);
-
 /**
  * rhashtable_walk_last_seen - Return the previously returned object, if available
  * @iter:	Hash table iterator



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/3] rhashtable: add rhashtable_walk_last_seen()
  2018-07-06  7:11 [PATCH 0/3] rhashtable: replace rhashtable_walk_peek implementation NeilBrown
  2018-07-06  7:11 ` [PATCH 3/3] rhashtable: implement rhashtable_walk_peek() using rhashtable_walk_last_seen() NeilBrown
  2018-07-06  7:11 ` [PATCH 1/3] rhashtable: further improve stability of rhashtable_walk NeilBrown
@ 2018-07-06  7:11 ` NeilBrown
  2018-07-10 23:55   ` David Miller
  2018-07-10 23:55 ` [PATCH 0/3] rhashtable: replace rhashtable_walk_peek implementation David Miller
  3 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2018-07-06  7:11 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu, Tom Herbert; +Cc: netdev, linux-kernel

rhashtable_walk_last_seen() returns the object returned by
the previous rhashtable_walk_next(), providing it is still in the
table (or was during this grace period).
This works even if rhashtable_walk_stop() and rhashtable_talk_start()
have been called since the last rhashtable_walk_next().

If there have been no calls to rhashtable_walk_next(), or if the
object is gone from the table, then NULL is returned.

This can usefully be used in a seq_file ->start() function.
If the pos is the same as was returned by the last ->next() call,
then rhashtable_walk_last_seen() can be used to re-establish the
current location in the table.  If it returns NULL, then
rhashtable_walk_next() should be used.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/rhashtable.h |    1 +
 lib/rhashtable.c           |   30 ++++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 657e37ae314c..d63b472e9d50 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -248,6 +248,7 @@ static inline void rhashtable_walk_start(struct rhashtable_iter *iter)
 
 void *rhashtable_walk_next(struct rhashtable_iter *iter);
 void *rhashtable_walk_peek(struct rhashtable_iter *iter);
+void *rhashtable_walk_last_seen(struct rhashtable_iter *iter);
 void rhashtable_walk_stop(struct rhashtable_iter *iter) __releases(RCU);
 
 void rhashtable_free_and_destroy(struct rhashtable *ht,
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 36f97d0c69ce..2d0227822262 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -947,6 +947,36 @@ void *rhashtable_walk_peek(struct rhashtable_iter *iter)
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_peek);
 
+/**
+ * rhashtable_walk_last_seen - Return the previously returned object, if available
+ * @iter:	Hash table iterator
+ *
+ * If rhashtable_walk_next() has previously been called and the object
+ * it returned is still in the hash table, that object is returned again,
+ * otherwise %NULL is returned.
+ *
+ * If the recent rhashtable_walk_next() call was since the most recent
+ * rhashtable_walk_start() call then the returned object may not, strictly
+ * speaking, still be in the table.  It will be safe to dereference.
+ *
+ * Note that the iterator is not changed.
+ */
+void *rhashtable_walk_last_seen(struct rhashtable_iter *iter)
+{
+	struct rhashtable *ht = iter->ht;
+	struct rhash_head *p = iter->p;
+
+	if (!p)
+		return NULL;
+	if (!iter->p_is_unsafe || ht->rhlist)
+		return p;
+	rht_for_each_rcu(p, iter->walker.tbl, iter->slot)
+		if (p == iter->p)
+			return p;
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(rhashtable_walk_last_seen);
+
 /**
  * rhashtable_walk_stop - Finish a hash table walk
  * @iter:	Hash table iterator



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/3] rhashtable: further improve stability of rhashtable_walk
  2018-07-06  7:11 [PATCH 0/3] rhashtable: replace rhashtable_walk_peek implementation NeilBrown
  2018-07-06  7:11 ` [PATCH 3/3] rhashtable: implement rhashtable_walk_peek() using rhashtable_walk_last_seen() NeilBrown
@ 2018-07-06  7:11 ` NeilBrown
  2018-07-06  8:24   ` kbuild test robot
                     ` (3 more replies)
  2018-07-06  7:11 ` [PATCH 2/3] rhashtable: add rhashtable_walk_last_seen() NeilBrown
  2018-07-10 23:55 ` [PATCH 0/3] rhashtable: replace rhashtable_walk_peek implementation David Miller
  3 siblings, 4 replies; 25+ messages in thread
From: NeilBrown @ 2018-07-06  7:11 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu, Tom Herbert; +Cc: netdev, linux-kernel

If the sequence:
   obj = rhashtable_walk_next(iter);
   rhashtable_walk_stop(iter);
   rhashtable_remove_fast(ht, &obj->head, params);
   rhashtable_walk_start(iter);

 races with another thread inserting or removing
 an object on the same hash chain, a subsequent
 rhashtable_walk_next() is not guaranteed to get the "next"
 object. It is possible that an object could be
 repeated, or missed.

 This can be made more reliable by keeping the objects in a hash chain
 sorted by memory address.  A subsequent rhashtable_walk_next()
 call can reliably find the correct position in the list, and thus
 find the 'next' object.

 It is not possible to take this approach with an rhltable as keeping
 the hash chain in order is not so easy.  When the first object with a
 given key is removed, it is replaced in the chain with the next
 object with the same key, and the address of that object may not be
 correctly ordered.
 I have not yet found any way to achieve the same stability
 with rhltables, that doesn't have a major impact on lookup
 or insert.  No code currently in Linux would benefit from
 such extra stability.

 With this patch:
 - a new object is always inserted after the last object with a
   smaller address, or at the start.  This preserves the property,
   important when allowing objects to be removed and re-added, that
   an object is never inserted *after* a position that it previously
   held in the list.
 - when rhashtable_walk_start() is called, it records that 'p' is not
   'safe', meaning that it cannot be dereferenced.  The revalidation
   that was previously done here is moved to rhashtable_walk_next()
 - when rhashtable_walk_next() is called while p is not NULL and not
   safe, it walks the chain looking for the first object with an
   address greater than p and returns that.  If there is none, it moves
   to the next hash chain.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/rhashtable-types.h |    1 
 include/linux/rhashtable.h       |   10 ++++-
 lib/rhashtable.c                 |   82 +++++++++++++++++++++++++-------------
 3 files changed, 62 insertions(+), 31 deletions(-)

diff --git a/include/linux/rhashtable-types.h b/include/linux/rhashtable-types.h
index 763d613ce2c2..bc3e84547ba7 100644
--- a/include/linux/rhashtable-types.h
+++ b/include/linux/rhashtable-types.h
@@ -126,6 +126,7 @@ struct rhashtable_iter {
 	struct rhashtable_walker walker;
 	unsigned int slot;
 	unsigned int skip;
+	bool p_is_unsafe;
 	bool end_of_table;
 };
 
diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 10435a77b156..657e37ae314c 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -628,7 +628,12 @@ static inline void *__rhashtable_insert_fast(
 		    (params.obj_cmpfn ?
 		     params.obj_cmpfn(&arg, rht_obj(ht, head)) :
 		     rhashtable_compare(&arg, rht_obj(ht, head)))) {
-			pprev = &head->next;
+			if (rhlist) {
+				pprev = &head->next;
+			} else {
+				if (head < obj)
+					headp = &head->next;
+			}
 			continue;
 		}
 
@@ -1124,7 +1129,8 @@ static inline int rhashtable_walk_init(struct rhashtable *ht,
  * Note that if you restart a walk after rhashtable_walk_stop you
  * may see the same object twice.  Also, you may miss objects if
  * there are removals in between rhashtable_walk_stop and the next
- * call to rhashtable_walk_start.
+ * call to rhashtable_walk_start.  Note that this is different to
+ * rhashtable_walk_enter() which misses objects.
  *
  * For a completely stable walk you should construct your own data
  * structure outside the hash table.
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index f87af707f086..36f97d0c69ce 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -228,6 +228,7 @@ static int rhashtable_rehash_one(struct rhashtable *ht, unsigned int old_hash)
 	struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht);
 	struct bucket_table *new_tbl = rhashtable_last_table(ht, old_tbl);
 	struct rhash_head __rcu **pprev = rht_bucket_var(old_tbl, old_hash);
+	struct rhash_head __rcu **inspos;
 	int err = -EAGAIN;
 	struct rhash_head *head, *next, *entry;
 	spinlock_t *new_bucket_lock;
@@ -256,12 +257,15 @@ static int rhashtable_rehash_one(struct rhashtable *ht, unsigned int old_hash)
 	new_bucket_lock = rht_bucket_lock(new_tbl, new_hash);
 
 	spin_lock_nested(new_bucket_lock, SINGLE_DEPTH_NESTING);
-	head = rht_dereference_bucket(new_tbl->buckets[new_hash],
-				      new_tbl, new_hash);
-
+	inspos = &new_tbl->buckets[new_hash];
+	head = rht_dereference_bucket(*inspos, new_tbl, new_hash);
+	while (!rht_is_a_nulls(head) && head < entry) {
+		inspos = &head->next;
+		head = rht_dereference_bucket(*inspos, new_tbl, new_hash);
+	}
 	RCU_INIT_POINTER(entry->next, head);
 
-	rcu_assign_pointer(new_tbl->buckets[new_hash], entry);
+	rcu_assign_pointer(*inspos, entry);
 	spin_unlock(new_bucket_lock);
 
 	rcu_assign_pointer(*pprev, next);
@@ -557,6 +561,10 @@ static struct bucket_table *rhashtable_insert_one(struct rhashtable *ht,
 		return ERR_PTR(-ENOMEM);
 
 	head = rht_dereference_bucket(*pprev, tbl, hash);
+	while (!ht->rhlist && !rht_is_a_nulls(head) && head < obj) {
+		pprev = &head->next;
+		head = rht_dereference_bucket(*pprev, tbl, hash);
+	}
 
 	RCU_INIT_POINTER(obj->next, head);
 	if (ht->rhlist) {
@@ -651,10 +659,10 @@ EXPORT_SYMBOL_GPL(rhashtable_insert_slow);
  *
  * This function prepares a hash table walk.
  *
- * Note that if you restart a walk after rhashtable_walk_stop you
- * may see the same object twice.  Also, you may miss objects if
- * there are removals in between rhashtable_walk_stop and the next
- * call to rhashtable_walk_start.
+ * A walk is guaranteed to return every object that was in
+ * the table before this call, and is still in the table when
+ * rhashtable_walk_next() returns NULL.  Duplicates can be
+ * seen, but only if there is a rehash event during the walk.
  *
  * For a completely stable walk you should construct your own data
  * structure outside the hash table.
@@ -738,19 +746,10 @@ int rhashtable_walk_start_check(struct rhashtable_iter *iter)
 
 	if (iter->p && !rhlist) {
 		/*
-		 * We need to validate that 'p' is still in the table, and
-		 * if so, update 'skip'
+		 * 'p' will be revalidated when rhashtable_walk_next()
+		 * is called.
 		 */
-		struct rhash_head *p;
-		int skip = 0;
-		rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
-			skip++;
-			if (p == iter->p) {
-				iter->skip = skip;
-				goto found;
-			}
-		}
-		iter->p = NULL;
+		iter->p_is_unsafe = true;
 	} else if (iter->p && rhlist) {
 		/* Need to validate that 'list' is still in the table, and
 		 * if so, update 'skip' and 'p'.
@@ -867,15 +866,39 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
 	bool rhlist = ht->rhlist;
 
 	if (p) {
-		if (!rhlist || !(list = rcu_dereference(list->next))) {
-			p = rcu_dereference(p->next);
-			list = container_of(p, struct rhlist_head, rhead);
-		}
-		if (!rht_is_a_nulls(p)) {
-			iter->skip++;
-			iter->p = p;
-			iter->list = list;
-			return rht_obj(ht, rhlist ? &list->rhead : p);
+		if (!rhlist && iter->p_is_unsafe) {
+			/*
+			 * First time next() was called after start().
+			 * Need to find location of 'p' in the list.
+			 */
+			struct rhash_head *p;
+
+			iter->skip = 0;
+			rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
+				iter->skip++;
+				if (p <= iter->p)
+					continue;
+
+				/* p is the next object after iter->p */
+				iter->p = p;
+				iter->p_is_unsafe = false;
+				return rht_obj(ht, p);
+			}
+			/* There is no "next" object in the list, move
+			 * to next hash chain.
+			 */
+		} else {
+			if (!rhlist || !(list = rcu_dereference(list->next))) {
+				p = rcu_dereference(p->next);
+				list = container_of(p, struct rhlist_head,
+						    rhead);
+			}
+			if (!rht_is_a_nulls(p)) {
+				iter->skip++;
+				iter->p = p;
+				iter->list = list;
+				return rht_obj(ht, rhlist ? &list->rhead : p);
+			}
 		}
 
 		/* At the end of this slot, switch to next one and then find
@@ -885,6 +908,7 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
 		iter->slot++;
 	}
 
+	iter->p_is_unsafe = false;
 	return __rhashtable_walk_find_next(iter);
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_next);



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 0/3] rhashtable: replace rhashtable_walk_peek implementation
@ 2018-07-06  7:11 NeilBrown
  2018-07-06  7:11 ` [PATCH 3/3] rhashtable: implement rhashtable_walk_peek() using rhashtable_walk_last_seen() NeilBrown
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: NeilBrown @ 2018-07-06  7:11 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu, Tom Herbert; +Cc: netdev, linux-kernel

This is a resend of these three patches with no change
from last time.
The last patch has an Ack from Tom Herbert and Herbert Xu, but the
first two which are necessary pre-requisites don't yet.

Thanks,
NeilBrown

---

NeilBrown (3):
      rhashtable: further improve stability of rhashtable_walk
      rhashtable: add rhashtable_walk_last_seen()
      rhashtable: implement rhashtable_walk_peek() using rhashtable_walk_last_seen()


 include/linux/rhashtable-types.h |    1 
 include/linux/rhashtable.h       |   40 +++++++++++-
 lib/rhashtable.c                 |  124 ++++++++++++++++++++++----------------
 3 files changed, 110 insertions(+), 55 deletions(-)

--
Signature


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/3] rhashtable: further improve stability of rhashtable_walk
  2018-07-06  7:11 ` [PATCH 1/3] rhashtable: further improve stability of rhashtable_walk NeilBrown
@ 2018-07-06  8:24   ` kbuild test robot
  2018-07-06  9:50     ` NeilBrown
  2018-07-06  8:59   ` Paolo Abeni
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 25+ messages in thread
From: kbuild test robot @ 2018-07-06  8:24 UTC (permalink / raw)
  To: NeilBrown
  Cc: kbuild-all, Thomas Graf, Herbert Xu, Tom Herbert, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4502 bytes --]

Hi NeilBrown,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]
[also build test ERROR on next-20180705]
[cannot apply to linus/master linux-sof-driver/master v4.18-rc3]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/NeilBrown/rhashtable-replace-rhashtable_walk_peek-implementation/20180706-153705
config: i386-randconfig-x078-201826 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   In file included from ipc/util.c:66:0:
   include/linux/rhashtable.h: In function '__rhashtable_insert_fast':
>> include/linux/rhashtable.h:624:6: error: 'headp' undeclared (first use in this function); did you mean 'head'?
         headp = &head->next;
         ^~~~~
         head
   include/linux/rhashtable.h:624:6: note: each undeclared identifier is reported only once for each function it appears in

vim +624 include/linux/rhashtable.h

   570	
   571	/* Internal function, please use rhashtable_insert_fast() instead. This
   572	 * function returns the existing element already in hashes in there is a clash,
   573	 * otherwise it returns an error via ERR_PTR().
   574	 */
   575	static inline void *__rhashtable_insert_fast(
   576		struct rhashtable *ht, const void *key, struct rhash_head *obj,
   577		const struct rhashtable_params params, bool rhlist)
   578	{
   579		struct rhashtable_compare_arg arg = {
   580			.ht = ht,
   581			.key = key,
   582		};
   583		struct rhash_head __rcu **pprev;
   584		struct bucket_table *tbl;
   585		struct rhash_head *head;
   586		spinlock_t *lock;
   587		unsigned int hash;
   588		int elasticity;
   589		void *data;
   590	
   591		rcu_read_lock();
   592	
   593		tbl = rht_dereference_rcu(ht->tbl, ht);
   594		hash = rht_head_hashfn(ht, tbl, obj, params);
   595		lock = rht_bucket_lock(tbl, hash);
   596		spin_lock_bh(lock);
   597	
   598		if (unlikely(rcu_access_pointer(tbl->future_tbl))) {
   599	slow_path:
   600			spin_unlock_bh(lock);
   601			rcu_read_unlock();
   602			return rhashtable_insert_slow(ht, key, obj);
   603		}
   604	
   605		elasticity = RHT_ELASTICITY;
   606		pprev = rht_bucket_insert(ht, tbl, hash);
   607		data = ERR_PTR(-ENOMEM);
   608		if (!pprev)
   609			goto out;
   610	
   611		rht_for_each_continue(head, *pprev, tbl, hash) {
   612			struct rhlist_head *plist;
   613			struct rhlist_head *list;
   614	
   615			elasticity--;
   616			if (!key ||
   617			    (params.obj_cmpfn ?
   618			     params.obj_cmpfn(&arg, rht_obj(ht, head)) :
   619			     rhashtable_compare(&arg, rht_obj(ht, head)))) {
   620				if (rhlist) {
   621					pprev = &head->next;
   622				} else {
   623					if (head < obj)
 > 624						headp = &head->next;
   625				}
   626				continue;
   627			}
   628	
   629			data = rht_obj(ht, head);
   630	
   631			if (!rhlist)
   632				goto out;
   633	
   634	
   635			list = container_of(obj, struct rhlist_head, rhead);
   636			plist = container_of(head, struct rhlist_head, rhead);
   637	
   638			RCU_INIT_POINTER(list->next, plist);
   639			head = rht_dereference_bucket(head->next, tbl, hash);
   640			RCU_INIT_POINTER(list->rhead.next, head);
   641			rcu_assign_pointer(*pprev, obj);
   642	
   643			goto good;
   644		}
   645	
   646		if (elasticity <= 0)
   647			goto slow_path;
   648	
   649		data = ERR_PTR(-E2BIG);
   650		if (unlikely(rht_grow_above_max(ht, tbl)))
   651			goto out;
   652	
   653		if (unlikely(rht_grow_above_100(ht, tbl)))
   654			goto slow_path;
   655	
   656		head = rht_dereference_bucket(*pprev, tbl, hash);
   657	
   658		RCU_INIT_POINTER(obj->next, head);
   659		if (rhlist) {
   660			struct rhlist_head *list;
   661	
   662			list = container_of(obj, struct rhlist_head, rhead);
   663			RCU_INIT_POINTER(list->next, NULL);
   664		}
   665	
   666		rcu_assign_pointer(*pprev, obj);
   667	
   668		atomic_inc(&ht->nelems);
   669		if (rht_grow_above_75(ht, tbl))
   670			schedule_work(&ht->run_work);
   671	
   672	good:
   673		data = NULL;
   674	
   675	out:
   676		spin_unlock_bh(lock);
   677		rcu_read_unlock();
   678	
   679		return data;
   680	}
   681	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 25836 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/3] rhashtable: further improve stability of rhashtable_walk
  2018-07-06  7:11 ` [PATCH 1/3] rhashtable: further improve stability of rhashtable_walk NeilBrown
  2018-07-06  8:24   ` kbuild test robot
@ 2018-07-06  8:59   ` Paolo Abeni
  2018-07-06  9:55     ` NeilBrown
  2018-07-06  9:25   ` kbuild test robot
  2018-12-05  3:51   ` [PATCH net-next] " NeilBrown
  3 siblings, 1 reply; 25+ messages in thread
From: Paolo Abeni @ 2018-07-06  8:59 UTC (permalink / raw)
  To: NeilBrown, Thomas Graf, Herbert Xu, Tom Herbert; +Cc: netdev, linux-kernel

On Fri, 2018-07-06 at 17:11 +1000, NeilBrown wrote:
> If the sequence:
>    obj = rhashtable_walk_next(iter);
>    rhashtable_walk_stop(iter);
>    rhashtable_remove_fast(ht, &obj->head, params);
>    rhashtable_walk_start(iter);
> 
>  races with another thread inserting or removing
>  an object on the same hash chain, a subsequent
>  rhashtable_walk_next() is not guaranteed to get the "next"
>  object. It is possible that an object could be
>  repeated, or missed.

The above scenario is very similar to the one I'm running:

   rhashtable_walk_next(iter);
   rhashtable_walk_stop(iter);     
   // rhashtable change not yet identified, could be either
   // remove, insert or even rehash
   rhashtable_walk_start(iter);
   rhashtable_walk_next(iter);

but I'm seeing use-after-free there. I'll try this patch to see if
solves my issue.

Note: the code under test is a pending new patch I'm holding due to the
above issue, I can send it as RFC to share the code if you think it may
help.

> @@ -867,15 +866,39 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
>  	bool rhlist = ht->rhlist;
>  
>  	if (p) {
> -		if (!rhlist || !(list = rcu_dereference(list->next))) {
> -			p = rcu_dereference(p->next);
> -			list = container_of(p, struct rhlist_head, rhead);
> -		}
> -		if (!rht_is_a_nulls(p)) {
> -			iter->skip++;
> -			iter->p = p;
> -			iter->list = list;
> -			return rht_obj(ht, rhlist ? &list->rhead : p);
> +		if (!rhlist && iter->p_is_unsafe) {
> +			/*
> +			 * First time next() was called after start().
> +			 * Need to find location of 'p' in the list.
> +			 */
> +			struct rhash_head *p;
> +
> +			iter->skip = 0;
> +			rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
> +				iter->skip++;
> +				if (p <= iter->p)
> +					continue;

Out of sheer ignorance, I really don't understand the goal of the above
conditional ?!?

Should it possibly be something like:
				if (p != iter->p->next)

instead? 
But I think we can't safely dereference 'p' yet ?!?

I'm sorry for the possibly dumb comments, rhashtable internals are
somewhat obscure to me, but I'm really interested in this topic.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/3] rhashtable: further improve stability of rhashtable_walk
  2018-07-06  7:11 ` [PATCH 1/3] rhashtable: further improve stability of rhashtable_walk NeilBrown
  2018-07-06  8:24   ` kbuild test robot
  2018-07-06  8:59   ` Paolo Abeni
@ 2018-07-06  9:25   ` kbuild test robot
  2018-12-05  3:51   ` [PATCH net-next] " NeilBrown
  3 siblings, 0 replies; 25+ messages in thread
From: kbuild test robot @ 2018-07-06  9:25 UTC (permalink / raw)
  To: NeilBrown
  Cc: kbuild-all, Thomas Graf, Herbert Xu, Tom Herbert, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4475 bytes --]

Hi NeilBrown,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]
[also build test ERROR on next-20180706]
[cannot apply to linus/master linux-sof-driver/master v4.18-rc3]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/NeilBrown/rhashtable-replace-rhashtable_walk_peek-implementation/20180706-153705
config: i386-randconfig-a0-07060846 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   In file included from net/core/xdp.c:10:0:
   include/linux/rhashtable.h: In function '__rhashtable_insert_fast':
>> include/linux/rhashtable.h:624:6: error: 'headp' undeclared (first use in this function)
         headp = &head->next;
         ^
   include/linux/rhashtable.h:624:6: note: each undeclared identifier is reported only once for each function it appears in

vim +/headp +624 include/linux/rhashtable.h

   570	
   571	/* Internal function, please use rhashtable_insert_fast() instead. This
   572	 * function returns the existing element already in hashes in there is a clash,
   573	 * otherwise it returns an error via ERR_PTR().
   574	 */
   575	static inline void *__rhashtable_insert_fast(
   576		struct rhashtable *ht, const void *key, struct rhash_head *obj,
   577		const struct rhashtable_params params, bool rhlist)
   578	{
   579		struct rhashtable_compare_arg arg = {
   580			.ht = ht,
   581			.key = key,
   582		};
   583		struct rhash_head __rcu **pprev;
   584		struct bucket_table *tbl;
   585		struct rhash_head *head;
   586		spinlock_t *lock;
   587		unsigned int hash;
   588		int elasticity;
   589		void *data;
   590	
   591		rcu_read_lock();
   592	
   593		tbl = rht_dereference_rcu(ht->tbl, ht);
   594		hash = rht_head_hashfn(ht, tbl, obj, params);
   595		lock = rht_bucket_lock(tbl, hash);
   596		spin_lock_bh(lock);
   597	
   598		if (unlikely(rcu_access_pointer(tbl->future_tbl))) {
   599	slow_path:
   600			spin_unlock_bh(lock);
   601			rcu_read_unlock();
   602			return rhashtable_insert_slow(ht, key, obj);
   603		}
   604	
   605		elasticity = RHT_ELASTICITY;
   606		pprev = rht_bucket_insert(ht, tbl, hash);
   607		data = ERR_PTR(-ENOMEM);
   608		if (!pprev)
   609			goto out;
   610	
   611		rht_for_each_continue(head, *pprev, tbl, hash) {
   612			struct rhlist_head *plist;
   613			struct rhlist_head *list;
   614	
   615			elasticity--;
   616			if (!key ||
   617			    (params.obj_cmpfn ?
   618			     params.obj_cmpfn(&arg, rht_obj(ht, head)) :
   619			     rhashtable_compare(&arg, rht_obj(ht, head)))) {
   620				if (rhlist) {
   621					pprev = &head->next;
   622				} else {
   623					if (head < obj)
 > 624						headp = &head->next;
   625				}
   626				continue;
   627			}
   628	
   629			data = rht_obj(ht, head);
   630	
   631			if (!rhlist)
   632				goto out;
   633	
   634	
   635			list = container_of(obj, struct rhlist_head, rhead);
   636			plist = container_of(head, struct rhlist_head, rhead);
   637	
   638			RCU_INIT_POINTER(list->next, plist);
   639			head = rht_dereference_bucket(head->next, tbl, hash);
   640			RCU_INIT_POINTER(list->rhead.next, head);
   641			rcu_assign_pointer(*pprev, obj);
   642	
   643			goto good;
   644		}
   645	
   646		if (elasticity <= 0)
   647			goto slow_path;
   648	
   649		data = ERR_PTR(-E2BIG);
   650		if (unlikely(rht_grow_above_max(ht, tbl)))
   651			goto out;
   652	
   653		if (unlikely(rht_grow_above_100(ht, tbl)))
   654			goto slow_path;
   655	
   656		head = rht_dereference_bucket(*pprev, tbl, hash);
   657	
   658		RCU_INIT_POINTER(obj->next, head);
   659		if (rhlist) {
   660			struct rhlist_head *list;
   661	
   662			list = container_of(obj, struct rhlist_head, rhead);
   663			RCU_INIT_POINTER(list->next, NULL);
   664		}
   665	
   666		rcu_assign_pointer(*pprev, obj);
   667	
   668		atomic_inc(&ht->nelems);
   669		if (rht_grow_above_75(ht, tbl))
   670			schedule_work(&ht->run_work);
   671	
   672	good:
   673		data = NULL;
   674	
   675	out:
   676		spin_unlock_bh(lock);
   677		rcu_read_unlock();
   678	
   679		return data;
   680	}
   681	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 27167 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/3] rhashtable: further improve stability of rhashtable_walk
  2018-07-06  8:24   ` kbuild test robot
@ 2018-07-06  9:50     ` NeilBrown
  0 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2018-07-06  9:50 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, Thomas Graf, Herbert Xu, Tom Herbert, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 569 bytes --]

On Fri, Jul 06 2018, kbuild test robot wrote:

> Hi NeilBrown,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on net-next/master]
> [also build test ERROR on next-20180705]
> [cannot apply to linus/master linux-sof-driver/master v4.18-rc3]
> [if your patch is applied to the wrong git tree, please drop us a note
> to help improve the system]

Patch is against net-next, plus "rhashtable: detect when object movement
might have invalidated a lookup" which was posted earlier (this
dependency wasn't made explicit).

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/3] rhashtable: further improve stability of rhashtable_walk
  2018-07-06  8:59   ` Paolo Abeni
@ 2018-07-06  9:55     ` NeilBrown
  2018-07-06 10:12       ` Paolo Abeni
  0 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2018-07-06  9:55 UTC (permalink / raw)
  To: Paolo Abeni, Thomas Graf, Herbert Xu, Tom Herbert; +Cc: netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3125 bytes --]

On Fri, Jul 06 2018, Paolo Abeni wrote:

> On Fri, 2018-07-06 at 17:11 +1000, NeilBrown wrote:
>> If the sequence:
>>    obj = rhashtable_walk_next(iter);
>>    rhashtable_walk_stop(iter);
>>    rhashtable_remove_fast(ht, &obj->head, params);
>>    rhashtable_walk_start(iter);
>> 
>>  races with another thread inserting or removing
>>  an object on the same hash chain, a subsequent
>>  rhashtable_walk_next() is not guaranteed to get the "next"
>>  object. It is possible that an object could be
>>  repeated, or missed.
>
> The above scenario is very similar to the one I'm running:
>
>    rhashtable_walk_next(iter);
>    rhashtable_walk_stop(iter);     
>    // rhashtable change not yet identified, could be either
>    // remove, insert or even rehash
>    rhashtable_walk_start(iter);
>    rhashtable_walk_next(iter);
>
> but I'm seeing use-after-free there. I'll try this patch to see if
> solves my issue.
>
> Note: the code under test is a pending new patch I'm holding due to the
> above issue, I can send it as RFC to share the code if you think it may
> help.

I'd suggest post it.  I may not get a chance to look at it, but if you
don't post it, then I definitely won't :-)

>
>> @@ -867,15 +866,39 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
>>  	bool rhlist = ht->rhlist;
>>  
>>  	if (p) {
>> -		if (!rhlist || !(list = rcu_dereference(list->next))) {
>> -			p = rcu_dereference(p->next);
>> -			list = container_of(p, struct rhlist_head, rhead);
>> -		}
>> -		if (!rht_is_a_nulls(p)) {
>> -			iter->skip++;
>> -			iter->p = p;
>> -			iter->list = list;
>> -			return rht_obj(ht, rhlist ? &list->rhead : p);
>> +		if (!rhlist && iter->p_is_unsafe) {
>> +			/*
>> +			 * First time next() was called after start().
>> +			 * Need to find location of 'p' in the list.
>> +			 */
>> +			struct rhash_head *p;
>> +
>> +			iter->skip = 0;
>> +			rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
>> +				iter->skip++;
>> +				if (p <= iter->p)
>> +					continue;
>
> Out of sheer ignorance, I really don't understand the goal of the above
> conditional ?!?

I hoped the patch description would cover that:
     With this patch:
     - a new object is always inserted after the last object with a
       smaller address, or at the start.  This preserves the property,
       important when allowing objects to be removed and re-added, that
       an object is never inserted *after* a position that it previously
       held in the list.

The items in each table slot are stored in order of the address of the
item.  So to find the first item in a slot that was not before the
previously returned item (iter->p), we step forward while this item is
<= that one. 

Does that help at all?

NeilBrown


>
> Should it possibly be something like:
> 				if (p != iter->p->next)
>
> instead? 
> But I think we can't safely dereference 'p' yet ?!?
>
> I'm sorry for the possibly dumb comments, rhashtable internals are
> somewhat obscure to me, but I'm really interested in this topic.
>
> Cheers,
>
> Paolo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/3] rhashtable: further improve stability of rhashtable_walk
  2018-07-06  9:55     ` NeilBrown
@ 2018-07-06 10:12       ` Paolo Abeni
  0 siblings, 0 replies; 25+ messages in thread
From: Paolo Abeni @ 2018-07-06 10:12 UTC (permalink / raw)
  To: NeilBrown, Thomas Graf, Herbert Xu, Tom Herbert; +Cc: netdev, linux-kernel

On Fri, 2018-07-06 at 19:55 +1000, NeilBrown wrote:
> On Fri, Jul 06 2018, Paolo Abeni wrote:
> 
> > Note: the code under test is a pending new patch I'm holding due to the
> > above issue, I can send it as RFC to share the code if you think it may
> > help.
> 
> I'd suggest post it.  I may not get a chance to look at it, but if you
> don't post it, then I definitely won't :-)

Oks, thanks, I just spammed the list (and you ;)

> > > @@ -867,15 +866,39 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
> > >  	bool rhlist = ht->rhlist;
> > >  
> > >  	if (p) {
> > > -		if (!rhlist || !(list = rcu_dereference(list->next))) {
> > > -			p = rcu_dereference(p->next);
> > > -			list = container_of(p, struct rhlist_head, rhead);
> > > -		}
> > > -		if (!rht_is_a_nulls(p)) {
> > > -			iter->skip++;
> > > -			iter->p = p;
> > > -			iter->list = list;
> > > -			return rht_obj(ht, rhlist ? &list->rhead : p);
> > > +		if (!rhlist && iter->p_is_unsafe) {
> > > +			/*
> > > +			 * First time next() was called after start().
> > > +			 * Need to find location of 'p' in the list.
> > > +			 */
> > > +			struct rhash_head *p;
> > > +
> > > +			iter->skip = 0;
> > > +			rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
> > > +				iter->skip++;
> > > +				if (p <= iter->p)
> > > +					continue;
> > 
> > Out of sheer ignorance, I really don't understand the goal of the above
> > conditional ?!?
> 
> I hoped the patch description would cover that:
>      With this patch:
>      - a new object is always inserted after the last object with a
>        smaller address, or at the start.  This preserves the property,
>        important when allowing objects to be removed and re-added, that
>        an object is never inserted *after* a position that it previously
>        held in the list.
> 
> The items in each table slot are stored in order of the address of the
> item.  So to find the first item in a slot that was not before the
> previously returned item (iter->p), we step forward while this item is
> <= that one. 
> 
> Does that help at all?

Yes, it's very clear. Before I dumbly skipped some slices of the patch.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] rhashtable: replace rhashtable_walk_peek implementation
  2018-07-06  7:11 [PATCH 0/3] rhashtable: replace rhashtable_walk_peek implementation NeilBrown
                   ` (2 preceding siblings ...)
  2018-07-06  7:11 ` [PATCH 2/3] rhashtable: add rhashtable_walk_last_seen() NeilBrown
@ 2018-07-10 23:55 ` David Miller
  3 siblings, 0 replies; 25+ messages in thread
From: David Miller @ 2018-07-10 23:55 UTC (permalink / raw)
  To: neilb; +Cc: tgraf, herbert, tom, netdev, linux-kernel

From: NeilBrown <neilb@suse.com>
Date: Fri, 06 Jul 2018 17:11:32 +1000

> This is a resend of these three patches with no change
> from last time.
> The last patch has an Ack from Tom Herbert and Herbert Xu, but the
> first two which are necessary pre-requisites don't yet.

Please repost this when your dependencies are actually in the
tree, not beforehand.

Otherwise you make more work for me, and you completely destroy
the value of the kbuild test robot.

Thank you.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] rhashtable: add rhashtable_walk_last_seen()
  2018-07-06  7:11 ` [PATCH 2/3] rhashtable: add rhashtable_walk_last_seen() NeilBrown
@ 2018-07-10 23:55   ` David Miller
  2018-07-15 23:58     ` NeilBrown
  0 siblings, 1 reply; 25+ messages in thread
From: David Miller @ 2018-07-10 23:55 UTC (permalink / raw)
  To: neilb; +Cc: tgraf, herbert, tom, netdev, linux-kernel

From: NeilBrown <neilb@suse.com>
Date: Fri, 06 Jul 2018 17:11:32 +1000

> rhashtable_walk_last_seen() returns the object returned by
> the previous rhashtable_walk_next(), providing it is still in the
> table (or was during this grace period).
> This works even if rhashtable_walk_stop() and rhashtable_talk_start()
                                                           ^^^^

"walk"

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] rhashtable: add rhashtable_walk_last_seen()
  2018-07-10 23:55   ` David Miller
@ 2018-07-15 23:58     ` NeilBrown
  0 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2018-07-15 23:58 UTC (permalink / raw)
  To: David Miller; +Cc: tgraf, herbert, tom, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 504 bytes --]

On Tue, Jul 10 2018, David Miller wrote:

> From: NeilBrown <neilb@suse.com>
> Date: Fri, 06 Jul 2018 17:11:32 +1000
>
>> rhashtable_walk_last_seen() returns the object returned by
>> the previous rhashtable_walk_next(), providing it is still in the
>> table (or was during this grace period).
>> This works even if rhashtable_walk_stop() and rhashtable_talk_start()
>                                                            ^^^^
>
> "walk"

Thanks :-)  Fixed for when I next post.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH net-next] rhashtable: further improve stability of rhashtable_walk
  2018-07-06  7:11 ` [PATCH 1/3] rhashtable: further improve stability of rhashtable_walk NeilBrown
                     ` (2 preceding siblings ...)
  2018-07-06  9:25   ` kbuild test robot
@ 2018-12-05  3:51   ` " NeilBrown
  2018-12-07  5:39     ` Herbert Xu
  3 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2018-12-05  3:51 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu, Tom Herbert, David Miller; +Cc: netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 8469 bytes --]


If the sequence:
   obj = rhashtable_walk_next(iter);
   rhashtable_walk_stop(iter);
   rhashtable_remove_fast(ht, &obj->head, params);
   rhashtable_walk_start(iter);

 races with another thread inserting or removing
 an object on the same hash chain, a subsequent
 rhashtable_walk_next() is not guaranteed to get the "next"
 object. It is possible that an object could be
 repeated, or missed.

 This can be made more reliable by keeping the objects in a hash chain
 sorted by memory address.  A subsequent rhashtable_walk_next()
 call can reliably find the correct position in the list, and thus
 find the 'next' object.

 It is not possible to take this approach with an rhltable as keeping
 the hash chain in order is not so easy.  When the first object with a
 given key is removed, it is replaced in the chain with the next
 object with the same key, and the address of that object may not be
 correctly ordered.
 I have not yet found any way to achieve the same stability
 with rhltables, that doesn't have a major impact on lookup
 or insert.  No code currently in Linux would benefit from
 such extra stability.

 With this patch:
 - a new object is always inserted after the last object with a
   smaller address, or at the start.
 - when rhashtable_walk_start() is called, it records that 'p' is not
   'safe', meaning that it cannot be dereferenced.  The revalidation
   that was previously done here is moved to rhashtable_walk_next()
 - when rhashtable_walk_next() is called while p is not NULL and not
   safe, it walks the chain looking for the first object with an
   address greater than p and returns that.  If there is none, it moves
   to the next hash chain.

Signed-off-by: NeilBrown <neilb@suse.com>
---

This is a resend of a patch that I sent back in July.  I couldn't
applied then because it assumed another rhashtable patch which hadn't
landed yet - it now has.

NeilBrown

 include/linux/rhashtable-types.h |  1 +
 include/linux/rhashtable.h       | 10 ++++-
 lib/rhashtable.c                 | 82 ++++++++++++++++++++++++++--------------
 3 files changed, 62 insertions(+), 31 deletions(-)

diff --git a/include/linux/rhashtable-types.h b/include/linux/rhashtable-types.h
index 763d613ce2c2..bc3e84547ba7 100644
--- a/include/linux/rhashtable-types.h
+++ b/include/linux/rhashtable-types.h
@@ -126,6 +126,7 @@ struct rhashtable_iter {
 	struct rhashtable_walker walker;
 	unsigned int slot;
 	unsigned int skip;
+	bool p_is_unsafe;
 	bool end_of_table;
 };
 
diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 20f9c6af7473..4a8056b66bfb 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -635,7 +635,12 @@ static inline void *__rhashtable_insert_fast(
 		    (params.obj_cmpfn ?
 		     params.obj_cmpfn(&arg, rht_obj(ht, head)) :
 		     rhashtable_compare(&arg, rht_obj(ht, head)))) {
-			pprev = &head->next;
+			if (rhlist) {
+				pprev = &head->next;
+			} else {
+				if (head < obj)
+					pprev = &head->next;
+			}
 			continue;
 		}
 
@@ -1131,7 +1136,8 @@ static inline int rhashtable_walk_init(struct rhashtable *ht,
  * Note that if you restart a walk after rhashtable_walk_stop you
  * may see the same object twice.  Also, you may miss objects if
  * there are removals in between rhashtable_walk_stop and the next
- * call to rhashtable_walk_start.
+ * call to rhashtable_walk_start.  Note that this is different to
+ * rhashtable_walk_enter() which never misses objects.
  *
  * For a completely stable walk you should construct your own data
  * structure outside the hash table.
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 852ffa5160f1..d4154b9a29a1 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -225,6 +225,7 @@ static int rhashtable_rehash_one(struct rhashtable *ht, unsigned int old_hash)
 	struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht);
 	struct bucket_table *new_tbl = rhashtable_last_table(ht, old_tbl);
 	struct rhash_head __rcu **pprev = rht_bucket_var(old_tbl, old_hash);
+	struct rhash_head __rcu **inspos;
 	int err = -EAGAIN;
 	struct rhash_head *head, *next, *entry;
 	spinlock_t *new_bucket_lock;
@@ -253,12 +254,15 @@ static int rhashtable_rehash_one(struct rhashtable *ht, unsigned int old_hash)
 	new_bucket_lock = rht_bucket_lock(new_tbl, new_hash);
 
 	spin_lock_nested(new_bucket_lock, SINGLE_DEPTH_NESTING);
-	head = rht_dereference_bucket(new_tbl->buckets[new_hash],
-				      new_tbl, new_hash);
-
+	inspos = &new_tbl->buckets[new_hash];
+	head = rht_dereference_bucket(*inspos, new_tbl, new_hash);
+	while (!rht_is_a_nulls(head) && head < entry) {
+		inspos = &head->next;
+		head = rht_dereference_bucket(*inspos, new_tbl, new_hash);
+	}
 	RCU_INIT_POINTER(entry->next, head);
 
-	rcu_assign_pointer(new_tbl->buckets[new_hash], entry);
+	rcu_assign_pointer(*inspos, entry);
 	spin_unlock(new_bucket_lock);
 
 	rcu_assign_pointer(*pprev, next);
@@ -554,6 +558,10 @@ static struct bucket_table *rhashtable_insert_one(struct rhashtable *ht,
 		return ERR_PTR(-ENOMEM);
 
 	head = rht_dereference_bucket(*pprev, tbl, hash);
+	while (!ht->rhlist && !rht_is_a_nulls(head) && head < obj) {
+		pprev = &head->next;
+		head = rht_dereference_bucket(*pprev, tbl, hash);
+	}
 
 	RCU_INIT_POINTER(obj->next, head);
 	if (ht->rhlist) {
@@ -648,10 +656,10 @@ EXPORT_SYMBOL_GPL(rhashtable_insert_slow);
  *
  * This function prepares a hash table walk.
  *
- * Note that if you restart a walk after rhashtable_walk_stop you
- * may see the same object twice.  Also, you may miss objects if
- * there are removals in between rhashtable_walk_stop and the next
- * call to rhashtable_walk_start.
+ * A walk is guaranteed to return every object that was in
+ * the table before this call, and is still in the table when
+ * rhashtable_walk_next() returns NULL.  Duplicates can be
+ * seen, but only if there is a rehash event during the walk.
  *
  * For a completely stable walk you should construct your own data
  * structure outside the hash table.
@@ -735,19 +743,10 @@ int rhashtable_walk_start_check(struct rhashtable_iter *iter)
 
 	if (iter->p && !rhlist) {
 		/*
-		 * We need to validate that 'p' is still in the table, and
-		 * if so, update 'skip'
+		 * 'p' will be revalidated when rhashtable_walk_next()
+		 * is called.
 		 */
-		struct rhash_head *p;
-		int skip = 0;
-		rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
-			skip++;
-			if (p == iter->p) {
-				iter->skip = skip;
-				goto found;
-			}
-		}
-		iter->p = NULL;
+		iter->p_is_unsafe = true;
 	} else if (iter->p && rhlist) {
 		/* Need to validate that 'list' is still in the table, and
 		 * if so, update 'skip' and 'p'.
@@ -864,15 +863,39 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
 	bool rhlist = ht->rhlist;
 
 	if (p) {
-		if (!rhlist || !(list = rcu_dereference(list->next))) {
-			p = rcu_dereference(p->next);
-			list = container_of(p, struct rhlist_head, rhead);
-		}
-		if (!rht_is_a_nulls(p)) {
-			iter->skip++;
-			iter->p = p;
-			iter->list = list;
-			return rht_obj(ht, rhlist ? &list->rhead : p);
+		if (!rhlist && iter->p_is_unsafe) {
+			/*
+			 * First time next() was called after start().
+			 * Need to find location of 'p' in the list.
+			 */
+			struct rhash_head *p;
+
+			iter->skip = 0;
+			rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
+				iter->skip++;
+				if (p <= iter->p)
+					continue;
+
+				/* p is the next object after iter->p */
+				iter->p = p;
+				iter->p_is_unsafe = false;
+				return rht_obj(ht, p);
+			}
+			/* There is no "next" object in the list, move
+			 * to next hash chain.
+			 */
+		} else {
+			if (!rhlist || !(list = rcu_dereference(list->next))) {
+				p = rcu_dereference(p->next);
+				list = container_of(p, struct rhlist_head,
+						    rhead);
+			}
+			if (!rht_is_a_nulls(p)) {
+				iter->skip++;
+				iter->p = p;
+				iter->list = list;
+				return rht_obj(ht, rhlist ? &list->rhead : p);
+			}
 		}
 
 		/* At the end of this slot, switch to next one and then find
@@ -882,6 +905,7 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
 		iter->slot++;
 	}
 
+	iter->p_is_unsafe = false;
 	return __rhashtable_walk_find_next(iter);
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_next);
-- 
2.14.0.rc0.dirty


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next] rhashtable: further improve stability of rhashtable_walk
  2018-12-05  3:51   ` [PATCH net-next] " NeilBrown
@ 2018-12-07  5:39     ` Herbert Xu
  2018-12-09 22:50       ` NeilBrown
  0 siblings, 1 reply; 25+ messages in thread
From: Herbert Xu @ 2018-12-07  5:39 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, Tom Herbert, David Miller, netdev, linux-kernel

On Wed, Dec 05, 2018 at 02:51:02PM +1100, NeilBrown wrote:
> 
> If the sequence:
>    obj = rhashtable_walk_next(iter);
>    rhashtable_walk_stop(iter);
>    rhashtable_remove_fast(ht, &obj->head, params);
>    rhashtable_walk_start(iter);
> 
>  races with another thread inserting or removing
>  an object on the same hash chain, a subsequent
>  rhashtable_walk_next() is not guaranteed to get the "next"
>  object. It is possible that an object could be
>  repeated, or missed.
> 
>  This can be made more reliable by keeping the objects in a hash chain
>  sorted by memory address.  A subsequent rhashtable_walk_next()
>  call can reliably find the correct position in the list, and thus
>  find the 'next' object.
> 
>  It is not possible to take this approach with an rhltable as keeping
>  the hash chain in order is not so easy.  When the first object with a
>  given key is removed, it is replaced in the chain with the next
>  object with the same key, and the address of that object may not be
>  correctly ordered.
>  I have not yet found any way to achieve the same stability
>  with rhltables, that doesn't have a major impact on lookup
>  or insert.  No code currently in Linux would benefit from
>  such extra stability.
> 
>  With this patch:
>  - a new object is always inserted after the last object with a
>    smaller address, or at the start.
>  - when rhashtable_walk_start() is called, it records that 'p' is not
>    'safe', meaning that it cannot be dereferenced.  The revalidation
>    that was previously done here is moved to rhashtable_walk_next()
>  - when rhashtable_walk_next() is called while p is not NULL and not
>    safe, it walks the chain looking for the first object with an
>    address greater than p and returns that.  If there is none, it moves
>    to the next hash chain.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
> 
> This is a resend of a patch that I sent back in July.  I couldn't
> applied then because it assumed another rhashtable patch which hadn't
> landed yet - it now has.

I thought we had agreed to drop this because nobody needs it
currently and it doesn't handle rhlist?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next] rhashtable: further improve stability of rhashtable_walk
  2018-12-07  5:39     ` Herbert Xu
@ 2018-12-09 22:50       ` NeilBrown
  2018-12-11  5:17         ` Herbert Xu
  0 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2018-12-09 22:50 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, Tom Herbert, David Miller, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3203 bytes --]

On Fri, Dec 07 2018, Herbert Xu wrote:

> On Wed, Dec 05, 2018 at 02:51:02PM +1100, NeilBrown wrote:
>> 
>> If the sequence:
>>    obj = rhashtable_walk_next(iter);
>>    rhashtable_walk_stop(iter);
>>    rhashtable_remove_fast(ht, &obj->head, params);
>>    rhashtable_walk_start(iter);
>> 
>>  races with another thread inserting or removing
>>  an object on the same hash chain, a subsequent
>>  rhashtable_walk_next() is not guaranteed to get the "next"
>>  object. It is possible that an object could be
>>  repeated, or missed.
>> 
>>  This can be made more reliable by keeping the objects in a hash chain
>>  sorted by memory address.  A subsequent rhashtable_walk_next()
>>  call can reliably find the correct position in the list, and thus
>>  find the 'next' object.
>> 
>>  It is not possible to take this approach with an rhltable as keeping
>>  the hash chain in order is not so easy.  When the first object with a
>>  given key is removed, it is replaced in the chain with the next
>>  object with the same key, and the address of that object may not be
>>  correctly ordered.
>>  I have not yet found any way to achieve the same stability
>>  with rhltables, that doesn't have a major impact on lookup
>>  or insert.  No code currently in Linux would benefit from
>>  such extra stability.
>> 
>>  With this patch:
>>  - a new object is always inserted after the last object with a
>>    smaller address, or at the start.
>>  - when rhashtable_walk_start() is called, it records that 'p' is not
>>    'safe', meaning that it cannot be dereferenced.  The revalidation
>>    that was previously done here is moved to rhashtable_walk_next()
>>  - when rhashtable_walk_next() is called while p is not NULL and not
>>    safe, it walks the chain looking for the first object with an
>>    address greater than p and returns that.  If there is none, it moves
>>    to the next hash chain.
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>> 
>> This is a resend of a patch that I sent back in July.  I couldn't
>> applied then because it assumed another rhashtable patch which hadn't
>> landed yet - it now has.
>
> I thought we had agreed to drop this because nobody needs it
> currently and it doesn't handle rhlist?

Hi Herbert,
 I think it was agreed that I would not pursue features that were only
 of use to out-of-tree code, but I don't think that applies here.  This
 is not a feature, this is a quality-of-implementation improvement.
 There are users in the kernel today which use
   rhashtable_walk_stop()/rhashtable_walk_start()
 to drop out of RCU protection for periods during the walk.
 Any such user might miss seeing an object that has been in the table
 for a while - sure that is less than optimal, and should be fixed if
 the cost is small.

 There are currently no rhlist users which use stop/start to drop out
 of RCU, so there is no clear value in fixing that case, or cost in not
 fixing it.

Thanks,
NeilBrown

>
> Cheers,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next] rhashtable: further improve stability of rhashtable_walk
  2018-12-09 22:50       ` NeilBrown
@ 2018-12-11  5:17         ` Herbert Xu
  2018-12-12  0:02           ` NeilBrown
  0 siblings, 1 reply; 25+ messages in thread
From: Herbert Xu @ 2018-12-11  5:17 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, Tom Herbert, David Miller, netdev, linux-kernel

Hi Neil:

On Mon, Dec 10, 2018 at 09:50:43AM +1100, NeilBrown wrote:
>  I think it was agreed that I would not pursue features that were only
>  of use to out-of-tree code, but I don't think that applies here.  This
>  is not a feature, this is a quality-of-implementation improvement.
>  There are users in the kernel today which use
>    rhashtable_walk_stop()/rhashtable_walk_start()
>  to drop out of RCU protection for periods during the walk.
>  Any such user might miss seeing an object that has been in the table
>  for a while - sure that is less than optimal, and should be fixed if
>  the cost is small.
> 
>  There are currently no rhlist users which use stop/start to drop out
>  of RCU, so there is no clear value in fixing that case, or cost in not
>  fixing it.

I think the complexities of this patch outweighs any benefit.

Walking an rhashtable is inherently unreliable, due to rehashing.
If you need an 100% reliable walking mechanism, then add your own
linked list alongside the hash table like xfrm does.

Having said that, if the current code is missing items that existed
prior to the start of the walk (in the absence of a rehash) then
that would be a bug that we should fix.  Anything else like
duplicates just needs to be accepted by the caller.

So please explain how can an object be missed then we can work on
fixing that and that only.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next] rhashtable: further improve stability of rhashtable_walk
  2018-12-11  5:17         ` Herbert Xu
@ 2018-12-12  0:02           ` NeilBrown
  2018-12-12  5:46             ` Herbert Xu
  0 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2018-12-12  0:02 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, Tom Herbert, David Miller, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4007 bytes --]

On Tue, Dec 11 2018, Herbert Xu wrote:

> Hi Neil:
>
> On Mon, Dec 10, 2018 at 09:50:43AM +1100, NeilBrown wrote:
>>  I think it was agreed that I would not pursue features that were only
>>  of use to out-of-tree code, but I don't think that applies here.  This
>>  is not a feature, this is a quality-of-implementation improvement.
>>  There are users in the kernel today which use
>>    rhashtable_walk_stop()/rhashtable_walk_start()
>>  to drop out of RCU protection for periods during the walk.
>>  Any such user might miss seeing an object that has been in the table
>>  for a while - sure that is less than optimal, and should be fixed if
>>  the cost is small.
>> 
>>  There are currently no rhlist users which use stop/start to drop out
>>  of RCU, so there is no clear value in fixing that case, or cost in not
>>  fixing it.
>
> I think the complexities of this patch outweighs any benefit.
>
> Walking an rhashtable is inherently unreliable, due to rehashing.
> If you need an 100% reliable walking mechanism, then add your own
> linked list alongside the hash table like xfrm does.
>
> Having said that, if the current code is missing items that existed
> prior to the start of the walk (in the absence of a rehash) then
> that would be a bug that we should fix.  Anything else like
> duplicates just needs to be accepted by the caller.
>
> So please explain how can an object be missed then we can work on
> fixing that and that only.

The commit comment gives the context:

  If the sequence:
     obj = rhashtable_walk_next(iter);
     rhashtable_walk_stop(iter);
     rhashtable_remove_fast(ht, &obj->head, params);
     rhashtable_walk_start(iter);

   races with another thread inserting or removing
   an object on the same hash chain, a subsequent
   rhashtable_walk_next() is not guaranteed to get the "next"
   object. It is possible that an object could be
   repeated, or missed.

The rhashtable_walk_start() at the end of this sequence will find that
iter->p is not null (it will be precisely &obj->head) and will look for
it in the chain, but will not find it (because of the remove_fast).  So
it sets iter->p to NULL.  This causes rhashtable_walk_next() to fall
through to __rhashtable_walk_find_next() which looks for the entry
number item->skip in the chain for item->slot.
->skip will be the index for the entry just beyond obj->head.  Since
that has been removed, it is actually the index for the object one
further on.  So if obj->head was not the last entry in the chain, the
object after it will be missed.

The code in tipc_nl_sk_walk() is fairly similar to this pattern in that
the sock_put() call could potentially result in a call to
rhashtable_remove_fast().
It avoids the problem by ensuring that the rhashtable_remove_fast() is
*after* the rhashtable_walk_start().

If the rhashtable_remove_fast() happened from some other thread due to a
race, the walk could still miss an object.
Currently, the only way to protect against this is to hold a reference
to an object across rhashtable_walk_stop()/rhashtable_walk_start().
Sometimes that is easy to do, other times - not so much.

So I think this is a real bug - it is quite unlikely to hit, but
possibly.
You need a chain with at least 2 objects, you need
rhashtable_walk_stop() to be called after visiting an object other than
the last object, and you need some thread (this or some other) to remove
that object from the table.

The patch that I posted aims to fix that bug, and only that bug.
The only alternative that I can think of is to document that this can
happen and advise that a reference should be held to the last visited
object when stop/start is called, or in some other way ensure that it
doesn't get removed.

Thanks,
NeilBrown


>
> Thanks,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next] rhashtable: further improve stability of rhashtable_walk
  2018-12-12  0:02           ` NeilBrown
@ 2018-12-12  5:46             ` Herbert Xu
  2018-12-12  6:41               ` NeilBrown
  0 siblings, 1 reply; 25+ messages in thread
From: Herbert Xu @ 2018-12-12  5:46 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, Tom Herbert, David Miller, netdev, linux-kernel

On Wed, Dec 12, 2018 at 11:02:35AM +1100, NeilBrown wrote:
> 
> So I think this is a real bug - it is quite unlikely to hit, but
> possibly.
> You need a chain with at least 2 objects, you need
> rhashtable_walk_stop() to be called after visiting an object other than
> the last object, and you need some thread (this or some other) to remove
> that object from the table.
> 
> The patch that I posted aims to fix that bug, and only that bug.
> The only alternative that I can think of is to document that this can
> happen and advise that a reference should be held to the last visited
> object when stop/start is called, or in some other way ensure that it
> doesn't get removed.

Thanks for reminding me of the issue you were trying to fix.

So going back into the email archives, I suggested at the very
start that we could just insert the walker objects into the actual
hash table.  That would solve the issue for both rhashtable and
rhlist.

Could we do that rather than using this ordered list that only
works for rhashtable?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next] rhashtable: further improve stability of rhashtable_walk
  2018-12-12  5:46             ` Herbert Xu
@ 2018-12-12  6:41               ` NeilBrown
  2018-12-12  8:00                 ` Herbert Xu
  0 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2018-12-12  6:41 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, Tom Herbert, David Miller, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1999 bytes --]

On Wed, Dec 12 2018, Herbert Xu wrote:

> On Wed, Dec 12, 2018 at 11:02:35AM +1100, NeilBrown wrote:
>> 
>> So I think this is a real bug - it is quite unlikely to hit, but
>> possibly.
>> You need a chain with at least 2 objects, you need
>> rhashtable_walk_stop() to be called after visiting an object other than
>> the last object, and you need some thread (this or some other) to remove
>> that object from the table.
>> 
>> The patch that I posted aims to fix that bug, and only that bug.
>> The only alternative that I can think of is to document that this can
>> happen and advise that a reference should be held to the last visited
>> object when stop/start is called, or in some other way ensure that it
>> doesn't get removed.
>
> Thanks for reminding me of the issue you were trying to fix.
>
> So going back into the email archives, I suggested at the very
> start that we could just insert the walker objects into the actual
> hash table.  That would solve the issue for both rhashtable and
> rhlist.
>
> Could we do that rather than using this ordered list that only
> works for rhashtable?

No. that doesn't work.
When you remove the walker object from the hash chain, you need to wait
for the RCU grace period to expire before you can safely insert back
into the chain. Inserting into a different chain isn't quite so bad now
that the nulls-marker stuff is working, a lookup thread will notice the
move and retry the lookup.

So you would substantially slow down the rhashtable_walk_start() step.
I've tried to think of various ways to over come this problem, such as
walking backwards through each chain - it is fairly safe to move and
object earlier in the chain - but all the approaches I have tried make
the code much more complex.

Thanks,
NeilBrown


>
> Cheers,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next] rhashtable: further improve stability of rhashtable_walk
  2018-12-12  6:41               ` NeilBrown
@ 2018-12-12  8:00                 ` Herbert Xu
  2018-12-12  8:49                   ` NeilBrown
  0 siblings, 1 reply; 25+ messages in thread
From: Herbert Xu @ 2018-12-12  8:00 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, Tom Herbert, David Miller, netdev, linux-kernel

On Wed, Dec 12, 2018 at 05:41:29PM +1100, NeilBrown wrote:
>
> So you would substantially slow down the rhashtable_walk_start() step.

This whole thing is meant for uses such as /proc and netlink
enumeration.  Speed is definitely not a prerogative of these
iterators.

For that matter, if speed was an issue then surely you would
not drop out of the RCU read lock at all while iterating.

It sounds to me like you're trying to use this interface for
something that it's simply not designed to do.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next] rhashtable: further improve stability of rhashtable_walk
  2018-12-12  8:00                 ` Herbert Xu
@ 2018-12-12  8:49                   ` NeilBrown
  2018-12-13  1:43                     ` Herbert Xu
  0 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2018-12-12  8:49 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, Tom Herbert, David Miller, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1562 bytes --]

On Wed, Dec 12 2018, Herbert Xu wrote:

> On Wed, Dec 12, 2018 at 05:41:29PM +1100, NeilBrown wrote:
>>
>> So you would substantially slow down the rhashtable_walk_start() step.
>
> This whole thing is meant for uses such as /proc and netlink
> enumeration.  Speed is definitely not a prerogative of these
> iterators.

An RCU grace period is typically 10s of milliseconds.  It doesn't take
very many of those to start being noticed.

>
> For that matter, if speed was an issue then surely you would
> not drop out of the RCU read lock at all while iterating.

tipc_nl_sk_walk() drops out of RCU for every object.
I don't know what it is used for, but I doubt that making it thousands
of times slower would be a good thing.

>
> It sounds to me like you're trying to use this interface for
> something that it's simply not designed to do.

I'm just trying to make sure we have a reliable data structure which I
can use without having to be worried that I might accidentally hit some
unsupported behaviour.

I don't really see why you think my patch is all that complex.
The only conceptual change is that hash chains are now ordered, and
ordered chains are easy to understand.
The biggest part of the code change is just moving some code from
rhashtable_walk_start() to rhashtable_walk_next().

Why don't you like it?

Thanks,
NeilBrown


>
> Cheers,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next] rhashtable: further improve stability of rhashtable_walk
  2018-12-12  8:49                   ` NeilBrown
@ 2018-12-13  1:43                     ` Herbert Xu
  2018-12-13  3:48                       ` NeilBrown
  0 siblings, 1 reply; 25+ messages in thread
From: Herbert Xu @ 2018-12-13  1:43 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, Tom Herbert, David Miller, netdev, linux-kernel

On Wed, Dec 12, 2018 at 07:49:08PM +1100, NeilBrown wrote:
> On Wed, Dec 12 2018, Herbert Xu wrote:
> 
> > On Wed, Dec 12, 2018 at 05:41:29PM +1100, NeilBrown wrote:
> >>
> >> So you would substantially slow down the rhashtable_walk_start() step.
> >
> > This whole thing is meant for uses such as /proc and netlink
> > enumeration.  Speed is definitely not a prerogative of these
> > iterators.
> 
> An RCU grace period is typically 10s of milliseconds.  It doesn't take
> very many of those to start being noticed.

Why would you even need an RCU grace period for deleting and adding
the walker object to the bucket list? You just allocate a new one
each time and free the old one after an RCU grace period.  I don't
see where the latency is coming from.

> > For that matter, if speed was an issue then surely you would
> > not drop out of the RCU read lock at all while iterating.
> 
> tipc_nl_sk_walk() drops out of RCU for every object.
> I don't know what it is used for, but I doubt that making it thousands
> of times slower would be a good thing.

Now you're conflating two different things.  Dropping the RCU
isn't necessarily slow.  We were talking about waiting for an
RCU grace period which would only come into play if you were
suspending the walk indefinitely.  Actually as I said above even
there you don't really need to wait.

> > It sounds to me like you're trying to use this interface for
> > something that it's simply not designed to do.
> 
> I'm just trying to make sure we have a reliable data structure which I
> can use without having to be worried that I might accidentally hit some
> unsupported behaviour.

Now that is something I agree with.

> I don't really see why you think my patch is all that complex.
> The only conceptual change is that hash chains are now ordered, and
> ordered chains are easy to understand.
> The biggest part of the code change is just moving some code from
> rhashtable_walk_start() to rhashtable_walk_next().
> 
> Why don't you like it?

My main beef is the fact that it doesn't work with rhlist.  So down
the track either we'll have to add more complexity for rhlist or
we will have to rip this out again do something completely different.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next] rhashtable: further improve stability of rhashtable_walk
  2018-12-13  1:43                     ` Herbert Xu
@ 2018-12-13  3:48                       ` NeilBrown
  2018-12-13  8:47                         ` Herbert Xu
  0 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2018-12-13  3:48 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, Tom Herbert, David Miller, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4846 bytes --]

On Thu, Dec 13 2018, Herbert Xu wrote:

> On Wed, Dec 12, 2018 at 07:49:08PM +1100, NeilBrown wrote:
>> On Wed, Dec 12 2018, Herbert Xu wrote:
>> 
>> > On Wed, Dec 12, 2018 at 05:41:29PM +1100, NeilBrown wrote:
>> >>
>> >> So you would substantially slow down the rhashtable_walk_start() step.
>> >
>> > This whole thing is meant for uses such as /proc and netlink
>> > enumeration.  Speed is definitely not a prerogative of these
>> > iterators.
>> 
>> An RCU grace period is typically 10s of milliseconds.  It doesn't take
>> very many of those to start being noticed.
>
> Why would you even need an RCU grace period for deleting and adding
> the walker object to the bucket list? You just allocate a new one
> each time and free the old one after an RCU grace period.  I don't
> see where the latency is coming from.

Yes, you could rcu_free the old one and allocate a new one.  Then you
would have to be ready to deal with memory allocation failure which
complicates usage (I already don't like that rhashtable_insert() can
report -ENOMEM!).

Another problem with inserting a marker object on rhashtable_walk_stop()
is that it might not be clear where to insert it.
You would have to take the spinlock, and then you might find that the
location that you want to mark has been deleted from the list.  I think
you can probably still manage a reliable placement, but it could get
messy.

>
>> > For that matter, if speed was an issue then surely you would
>> > not drop out of the RCU read lock at all while iterating.
>> 
>> tipc_nl_sk_walk() drops out of RCU for every object.
>> I don't know what it is used for, but I doubt that making it thousands
>> of times slower would be a good thing.
>
> Now you're conflating two different things.  Dropping the RCU
> isn't necessarily slow.  We were talking about waiting for an
> RCU grace period which would only come into play if you were
> suspending the walk indefinitely.  Actually as I said above even
> there you don't really need to wait.

How would rhashtable_walk_stop() know if it was indefinite or not?

Yes, if you allocate a new marker on each rhashtable_walk_start()
(passing in a gfp_t and coping with errors) then you don't need to wait
a grace period.  If you reuse one to avoid the errors, you do.

>
>> > It sounds to me like you're trying to use this interface for
>> > something that it's simply not designed to do.
>> 
>> I'm just trying to make sure we have a reliable data structure which I
>> can use without having to be worried that I might accidentally hit some
>> unsupported behaviour.
>
> Now that is something I agree with.
>
>> I don't really see why you think my patch is all that complex.
>> The only conceptual change is that hash chains are now ordered, and
>> ordered chains are easy to understand.
>> The biggest part of the code change is just moving some code from
>> rhashtable_walk_start() to rhashtable_walk_next().
>> 
>> Why don't you like it?
>
> My main beef is the fact that it doesn't work with rhlist.  So down
> the track either we'll have to add more complexity for rhlist or
> we will have to rip this out again do something completely different.

You have previously said that functionality which isn't needed by current
code should not be a concern.  No current code ever stops and restarts
an rhlist walk, so this shouldn't be relevant.  Solve that problem
when/if we come to it.

If this is really the main beef, then let's turn the question around.
Suppose this patch had landed before rhltables had landed.  How would we
implement rhltables without breaking this functionality?

Keeping all the objects with the same key in a linked list is clearly a
good idea - make this a circular list as there is no obvious "first".

*Not* keeping them all in the hash chain is ideal, but not essential.
I see three costs with this.
One is that we would compare the same key multiple times for lookup.
How much of a problem is that? A failing compare is usually quite quick,
and most rhltable uses have inline memcmp for comparison (admittedly not
all).

The second cost is tracking the chain length against elasticity.
We could flag one object with each key as a 'master' (low bit of the
'next' pointer) and only count the masters.  When lookup raced with
remove this might get a slightly incorrect count, but I don't think that
hurts.

Finally, there is more pointer chasing as the chains are longer.

Was performance part of the reason for adding rhltables?  It isn't
mentioned in the commit message, but it is easy to miss things.

Thanks,
NeilBrown



>
> Cheers,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next] rhashtable: further improve stability of rhashtable_walk
  2018-12-13  3:48                       ` NeilBrown
@ 2018-12-13  8:47                         ` Herbert Xu
  0 siblings, 0 replies; 25+ messages in thread
From: Herbert Xu @ 2018-12-13  8:47 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, Tom Herbert, David Miller, netdev, linux-kernel

On Thu, Dec 13, 2018 at 02:48:59PM +1100, NeilBrown wrote:
> 
> Yes, you could rcu_free the old one and allocate a new one.  Then you
> would have to be ready to deal with memory allocation failure which
> complicates usage (I already don't like that rhashtable_insert() can
> report -ENOMEM!).

Yes there will be a cost to dealing with allocation failure but at
least it'll work reliably in all cases.  For the intended use-case
of dumping things to user-space allocation failure is a non-issue.

> > Now you're conflating two different things.  Dropping the RCU
> > isn't necessarily slow.  We were talking about waiting for an
> > RCU grace period which would only come into play if you were
> > suspending the walk indefinitely.  Actually as I said above even
> > there you don't really need to wait.
> 
> How would rhashtable_walk_stop() know if it was indefinite or not?

You assume that it's always indefinite because the typical usage of
stop is because we have run out of memory and must wait for user-
space to read what we have produced so far to free up memory.

> *Not* keeping them all in the hash chain is ideal, but not essential.
> I see three costs with this.
> One is that we would compare the same key multiple times for lookup.
> How much of a problem is that? A failing compare is usually quite quick,
> and most rhltable uses have inline memcmp for comparison (admittedly not
> all).
> 
> The second cost is tracking the chain length against elasticity.
> We could flag one object with each key as a 'master' (low bit of the
> 'next' pointer) and only count the masters.  When lookup raced with
> remove this might get a slightly incorrect count, but I don't think that
> hurts.
> 
> Finally, there is more pointer chasing as the chains are longer.

The biggest problem is that you can no longer return the lookup
result.  When you perform a lookup on rhltable you need to return
all the matching objects, not just a random one.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, back to index

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-06  7:11 [PATCH 0/3] rhashtable: replace rhashtable_walk_peek implementation NeilBrown
2018-07-06  7:11 ` [PATCH 3/3] rhashtable: implement rhashtable_walk_peek() using rhashtable_walk_last_seen() NeilBrown
2018-07-06  7:11 ` [PATCH 1/3] rhashtable: further improve stability of rhashtable_walk NeilBrown
2018-07-06  8:24   ` kbuild test robot
2018-07-06  9:50     ` NeilBrown
2018-07-06  8:59   ` Paolo Abeni
2018-07-06  9:55     ` NeilBrown
2018-07-06 10:12       ` Paolo Abeni
2018-07-06  9:25   ` kbuild test robot
2018-12-05  3:51   ` [PATCH net-next] " NeilBrown
2018-12-07  5:39     ` Herbert Xu
2018-12-09 22:50       ` NeilBrown
2018-12-11  5:17         ` Herbert Xu
2018-12-12  0:02           ` NeilBrown
2018-12-12  5:46             ` Herbert Xu
2018-12-12  6:41               ` NeilBrown
2018-12-12  8:00                 ` Herbert Xu
2018-12-12  8:49                   ` NeilBrown
2018-12-13  1:43                     ` Herbert Xu
2018-12-13  3:48                       ` NeilBrown
2018-12-13  8:47                         ` Herbert Xu
2018-07-06  7:11 ` [PATCH 2/3] rhashtable: add rhashtable_walk_last_seen() NeilBrown
2018-07-10 23:55   ` David Miller
2018-07-15 23:58     ` NeilBrown
2018-07-10 23:55 ` [PATCH 0/3] rhashtable: replace rhashtable_walk_peek implementation David Miller

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox