linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/20] staging: lustre: convert to rhashtable
@ 2018-04-11 21:54 NeilBrown
  2018-04-11 21:54 ` [PATCH 03/20] staging: lustre: convert obd uuid hash " NeilBrown
                   ` (21 more replies)
  0 siblings, 22 replies; 25+ messages in thread
From: NeilBrown @ 2018-04-11 21:54 UTC (permalink / raw)
  To: Oleg Drokin, Greg Kroah-Hartman, James Simmons, Andreas Dilger
  Cc: Linux Kernel Mailing List, Lustre Development List

libcfs in lustre has a resizeable hashtable.
Linux already has a resizeable hashtable, rhashtable, which is better
is most metrics. See https://lwn.net/Articles/751374/ in a few days
for an introduction to rhashtable.

This series converts lustre to use rhashtable.  This affects several
different tables, and each is different is various ways.

There are two outstanding issues.  One is that a bug in rhashtable
means that we cannot enable auto-shrinking in one of the tables.  That
is documented as appropriate and should be fixed soon.

The other is that rhashtable has an atomic_t which counts the elements
in a hash table.  At least one table in lustre went to some trouble to
avoid any table-wide atomics, so that could lead to a regression.
I'm hoping that rhashtable can be enhanced with the option of a
per-cpu counter, or similar.


I have enabled automatic shrinking on all tables where it makes sense
and doesn't trigger the bug.  I have also removed all hints concerning
min/max size - I cannot see how these could be useful.

The dump_pgcache debugfs file provided some interesting challenges.  I
think I have cleaned it up enough so that it all makes sense.  An
extra pair of eyes examining that code in particular would be
appreciated.

This series passes all the same tests that pass before the patches are
applied.

Thanks,
NeilBrown


---

NeilBrown (20):
      staging: lustre: ptlrpc: convert conn_hash to rhashtable
      staging: lustre: convert lov_pool to use rhashtable
      staging: lustre: convert obd uuid hash to rhashtable
      staging: lustre: convert osc_quota hash to rhashtable
      staging: lustre: separate buckets from ldlm hash table
      staging: lustre: ldlm: add a counter to the per-namespace data
      staging: lustre: ldlm: store name directly in namespace.
      staging: lustre: simplify ldlm_ns_hash_defs[]
      staging: lustre: convert ldlm_resource hash to rhashtable.
      staging: lustre: make struct lu_site_bkt_data private
      staging: lustre: lu_object: discard extra lru count.
      staging: lustre: lu_object: factor out extra per-bucket data
      staging: lustre: lu_object: move retry logic inside htable_lookup
      staging: lustre: fold lu_object_new() into lu_object_find_at()
      staging: lustre: llite: use more private data in dump_pgcache
      staging: lustre: llite: remove redundant lookup in dump_pgcache
      staging: lustre: use call_rcu() to free lu_object_headers
      staging: lustre: change how "dump_page_cache" walks a hash table
      staging: lustre: convert lu_object cache to rhashtable
      staging: lustre: remove cfs_hash resizeable hashtable implementation.


 .../staging/lustre/include/linux/libcfs/libcfs.h   |    1 
 .../lustre/include/linux/libcfs/libcfs_hash.h      |  866 --------
 drivers/staging/lustre/lnet/libcfs/Makefile        |    2 
 drivers/staging/lustre/lnet/libcfs/hash.c          | 2064 --------------------
 drivers/staging/lustre/lnet/libcfs/module.c        |   12 
 drivers/staging/lustre/lustre/include/lu_object.h  |   55 -
 drivers/staging/lustre/lustre/include/lustre_dlm.h |   19 
 .../staging/lustre/lustre/include/lustre_export.h  |    2 
 drivers/staging/lustre/lustre/include/lustre_net.h |    4 
 drivers/staging/lustre/lustre/include/obd.h        |   11 
 .../staging/lustre/lustre/include/obd_support.h    |    9 
 drivers/staging/lustre/lustre/ldlm/ldlm_request.c  |   31 
 drivers/staging/lustre/lustre/ldlm/ldlm_resource.c |  370 +---
 drivers/staging/lustre/lustre/llite/lcommon_cl.c   |    8 
 drivers/staging/lustre/lustre/llite/vvp_dev.c      |  332 +--
 drivers/staging/lustre/lustre/llite/vvp_object.c   |    9 
 drivers/staging/lustre/lustre/lov/lov_internal.h   |   11 
 drivers/staging/lustre/lustre/lov/lov_obd.c        |   12 
 drivers/staging/lustre/lustre/lov/lov_object.c     |    8 
 drivers/staging/lustre/lustre/lov/lov_pool.c       |  159 +-
 drivers/staging/lustre/lustre/lov/lovsub_object.c  |    9 
 drivers/staging/lustre/lustre/obdclass/genops.c    |   34 
 drivers/staging/lustre/lustre/obdclass/lu_object.c |  564 ++---
 .../staging/lustre/lustre/obdclass/obd_config.c    |  161 +-
 .../staging/lustre/lustre/obdecho/echo_client.c    |    8 
 drivers/staging/lustre/lustre/osc/osc_internal.h   |    5 
 drivers/staging/lustre/lustre/osc/osc_quota.c      |  136 -
 drivers/staging/lustre/lustre/osc/osc_request.c    |   12 
 drivers/staging/lustre/lustre/ptlrpc/connection.c  |  164 +-
 29 files changed, 870 insertions(+), 4208 deletions(-)
 delete mode 100644 drivers/staging/lustre/include/linux/libcfs/libcfs_hash.h
 delete mode 100644 drivers/staging/lustre/lnet/libcfs/hash.c

--
Signature

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 12/20] staging: lustre: lu_object: factor out extra per-bucket data
  2018-04-11 21:54 [PATCH 00/20] staging: lustre: convert to rhashtable NeilBrown
                   ` (6 preceding siblings ...)
  2018-04-11 21:54 ` [PATCH 02/20] staging: lustre: convert lov_pool to use rhashtable NeilBrown
@ 2018-04-11 21:54 ` NeilBrown
  2018-04-11 21:54 ` [PATCH 08/20] staging: lustre: simplify ldlm_ns_hash_defs[] NeilBrown
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2018-04-11 21:54 UTC (permalink / raw)
  To: Oleg Drokin, Greg Kroah-Hartman, James Simmons, Andreas Dilger
  Cc: Linux Kernel Mailing List, Lustre Development List

The hash tables managed by lu_object store some extra
information in each bucket in the hash table.  This prevents the use
of resizeable hash tables, so lu_site_init() goes to some trouble
to try to guess a good hash size.

There is no real need for the extra data to be closely associated with
hash buckets.  There is a small advantage as both the hash bucket and
the extra information can then be protected by the same lock, but as
these locks have low contention, that should rarely be noticed.

The extra data is updated frequently and accessed rarely, such an lru
list and a wait_queue head.  There could just be a single copy of this
data for the whole array, but on a many-cpu machine, that could become
a contention bottle neck.  So it makes sense keep multiple shards and
combine them only when needed.  It does not make sense to have many
more copies than there are CPUs.

This patch takes the extra data out of the hash table buckets and
creates a separate array, which never has more entries than twice the
number of possible cpus.  As this extra data contains a
wait_queue_head, which contains a spinlock, that lock is used to
protect the other data (counter and lru list).

The code currently uses a very simple hash to choose a
hash-table bucket:

 (fid_seq(fid) + fid_oid(fid)) & (CFS_HASH_NBKT(hs) - 1)

There is no documented reason for this and I cannot see any value in
not using a general hash function.  So I have chosen jhash2() instead.

The lock ordering requires that a hash-table lock cannot be taken
while an extra-data lock is held.  This means that in
lu_site_purge_objects() we much first remove objects from the lru
(with the extra information locked) and then remove each one from the
hash table.  To ensure the object is not found between these two
steps, the LU_OBJECT_HEARD_BANSHEE flag is set.

As the extra info is now separate from the hash buckets, we cannot
report statistic from both at the same time.  I think the lru
statistics are probably more useful than the hash-table statistics, so
I have preserved the former and discarded the latter.  When the
hashtable becomes resizeable, those statistics will be irrelevant.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/staging/lustre/lustre/include/lu_object.h  |    5 +
 drivers/staging/lustre/lustre/obdclass/lu_object.c |  135 +++++++++++---------
 2 files changed, 77 insertions(+), 63 deletions(-)

diff --git a/drivers/staging/lustre/lustre/include/lu_object.h b/drivers/staging/lustre/lustre/include/lu_object.h
index f29bbca5af65..d23a78577fb5 100644
--- a/drivers/staging/lustre/lustre/include/lu_object.h
+++ b/drivers/staging/lustre/lustre/include/lu_object.h
@@ -576,6 +576,11 @@ struct lu_site {
 	 * objects hash table
 	 */
 	struct cfs_hash	       *ls_obj_hash;
+	/*
+	 * buckets for summary data
+	 */
+	struct lu_site_bkt_data *ls_bkts;
+	int			ls_bkt_cnt;
 	/**
 	 * index of bucket on hash table while purging
 	 */
diff --git a/drivers/staging/lustre/lustre/obdclass/lu_object.c b/drivers/staging/lustre/lustre/obdclass/lu_object.c
index 2bf089817157..064166843e64 100644
--- a/drivers/staging/lustre/lustre/obdclass/lu_object.c
+++ b/drivers/staging/lustre/lustre/obdclass/lu_object.c
@@ -55,15 +55,15 @@
 #include <cl_object.h>
 #include <lu_ref.h>
 #include <linux/list.h>
+#include <linux/jhash.h>
 
 struct lu_site_bkt_data {
 	/**
 	 * LRU list, updated on each access to object. Protected by
-	 * bucket lock of lu_site::ls_obj_hash.
+	 * lsb_marche_funebre.lock.
 	 *
 	 * "Cold" end of LRU is lu_site::ls_lru.next. Accessed object are
-	 * moved to the lu_site::ls_lru.prev (this is due to the non-existence
-	 * of list_for_each_entry_safe_reverse()).
+	 * moved to the lu_site::ls_lru.prev.
 	 */
 	struct list_head		lsb_lru;
 	/**
@@ -92,9 +92,11 @@ enum {
 #define LU_SITE_BITS_MAX	24
 #define LU_SITE_BITS_MAX_CL	19
 /**
- * total 256 buckets, we don't want too many buckets because:
- * - consume too much memory
+ * max 256 buckets, we don't want too many buckets because:
+ * - consume too much memory (currently max 16K)
  * - avoid unbalanced LRU list
+ * With few cpus there is little gain from extra buckets, so
+ * we treat this as a maximum in lu_site_init().
  */
 #define LU_SITE_BKT_BITS	8
 
@@ -109,14 +111,18 @@ MODULE_PARM_DESC(lu_cache_nr, "Maximum number of objects in lu_object cache");
 static void lu_object_free(const struct lu_env *env, struct lu_object *o);
 static __u32 ls_stats_read(struct lprocfs_stats *stats, int idx);
 
+static inline int lu_bkt_hash(struct lu_site *s, const struct lu_fid *fid)
+{
+	return jhash2((void*)fid, sizeof(*fid)/4, 0xdeadbeef) &
+		(s->ls_bkt_cnt - 1);
+}
+
 wait_queue_head_t *
 lu_site_wq_from_fid(struct lu_site *site, struct lu_fid *fid)
 {
-	struct cfs_hash_bd bd;
 	struct lu_site_bkt_data *bkt;
 
-	cfs_hash_bd_get(site->ls_obj_hash, fid, &bd);
-	bkt = cfs_hash_bd_extra_get(site->ls_obj_hash, &bd);
+	bkt = &site->ls_bkts[lu_bkt_hash(site, fid)];
 	return &bkt->lsb_marche_funebre;
 }
 
@@ -158,7 +164,6 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 	}
 
 	cfs_hash_bd_get(site->ls_obj_hash, &top->loh_fid, &bd);
-	bkt = cfs_hash_bd_extra_get(site->ls_obj_hash, &bd);
 
 	if (!cfs_hash_bd_dec_and_lock(site->ls_obj_hash, &bd, &top->loh_ref)) {
 		if (lu_object_is_dying(top)) {
@@ -166,6 +171,7 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 			 * somebody may be waiting for this, currently only
 			 * used for cl_object, see cl_object_put_last().
 			 */
+			bkt = &site->ls_bkts[lu_bkt_hash(site, &top->loh_fid)];
 			wake_up_all(&bkt->lsb_marche_funebre);
 		}
 		return;
@@ -180,9 +186,13 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 			o->lo_ops->loo_object_release(env, o);
 	}
 
+	bkt = &site->ls_bkts[lu_bkt_hash(site, &top->loh_fid)];
+	spin_lock(&bkt->lsb_marche_funebre.lock);
+
 	if (!lu_object_is_dying(top)) {
 		LASSERT(list_empty(&top->loh_lru));
 		list_add_tail(&top->loh_lru, &bkt->lsb_lru);
+		spin_unlock(&bkt->lsb_marche_funebre.lock);
 		percpu_counter_inc(&site->ls_lru_len_counter);
 		CDEBUG(D_INODE, "Add %p to site lru. hash: %p, bkt: %p\n",
 		       o, site->ls_obj_hash, bkt);
@@ -192,21 +202,20 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 
 	/*
 	 * If object is dying (will not be cached), then removed it
-	 * from hash table and LRU.
+	 * from hash table (it is already not on the LRU).
 	 *
-	 * This is done with hash table and LRU lists locked. As the only
+	 * This is done with hash table list locked. As the only
 	 * way to acquire first reference to previously unreferenced
-	 * object is through hash-table lookup (lu_object_find()),
-	 * or LRU scanning (lu_site_purge()), that are done under hash-table
-	 * and LRU lock, no race with concurrent object lookup is possible
-	 * and we can safely destroy object below.
+	 * object is through hash-table lookup (lu_object_find())
+	 * which is done under hash-table, no race with concurrent
+	 * object lookup is possible and we can safely destroy object below.
 	 */
 	if (!test_and_set_bit(LU_OBJECT_UNHASHED, &top->loh_flags))
 		cfs_hash_bd_del_locked(site->ls_obj_hash, &bd, &top->loh_hash);
+	spin_unlock(&bkt->lsb_marche_funebre.lock);
 	cfs_hash_bd_unlock(site->ls_obj_hash, &bd, 1);
 	/*
-	 * Object was already removed from hash and lru above, can
-	 * kill it.
+	 * Object was already removed from hash above, can kill it.
 	 */
 	lu_object_free(env, orig);
 }
@@ -231,8 +240,10 @@ void lu_object_unhash(const struct lu_env *env, struct lu_object *o)
 		if (!list_empty(&top->loh_lru)) {
 			struct lu_site_bkt_data *bkt;
 
+			bkt = &site->ls_bkts[lu_bkt_hash(site,&top->loh_fid)];
+			spin_lock(&bkt->lsb_marche_funebre.lock);
 			list_del_init(&top->loh_lru);
-			bkt = cfs_hash_bd_extra_get(obj_hash, &bd);
+			spin_unlock(&bkt->lsb_marche_funebre.lock);
 			percpu_counter_dec(&site->ls_lru_len_counter);
 		}
 		cfs_hash_bd_del_locked(obj_hash, &bd, &top->loh_hash);
@@ -369,8 +380,6 @@ int lu_site_purge_objects(const struct lu_env *env, struct lu_site *s,
 	struct lu_object_header *h;
 	struct lu_object_header *temp;
 	struct lu_site_bkt_data *bkt;
-	struct cfs_hash_bd	    bd;
-	struct cfs_hash_bd	    bd2;
 	struct list_head	       dispose;
 	int		      did_sth;
 	unsigned int start = 0;
@@ -388,7 +397,7 @@ int lu_site_purge_objects(const struct lu_env *env, struct lu_site *s,
 	 */
 	if (nr != ~0)
 		start = s->ls_purge_start;
-	bnr = (nr == ~0) ? -1 : nr / (int)CFS_HASH_NBKT(s->ls_obj_hash) + 1;
+	bnr = (nr == ~0) ? -1 : nr / s->ls_bkt_cnt + 1;
  again:
 	/*
 	 * It doesn't make any sense to make purge threads parallel, that can
@@ -400,21 +409,21 @@ int lu_site_purge_objects(const struct lu_env *env, struct lu_site *s,
 		goto out;
 
 	did_sth = 0;
-	cfs_hash_for_each_bucket(s->ls_obj_hash, &bd, i) {
-		if (i < start)
-			continue;
+	for (i = start; i < s->ls_bkt_cnt ; i++) {
 		count = bnr;
-		cfs_hash_bd_lock(s->ls_obj_hash, &bd, 1);
-		bkt = cfs_hash_bd_extra_get(s->ls_obj_hash, &bd);
+		bkt = &s->ls_bkts[i];
+		spin_lock(&bkt->lsb_marche_funebre.lock);
 
 		list_for_each_entry_safe(h, temp, &bkt->lsb_lru, loh_lru) {
 			LASSERT(atomic_read(&h->loh_ref) == 0);
 
-			cfs_hash_bd_get(s->ls_obj_hash, &h->loh_fid, &bd2);
-			LASSERT(bd.bd_bucket == bd2.bd_bucket);
+			LINVRNT(lu_bkt_hash(s, &h->loh_fid) == i);
 
-			cfs_hash_bd_del_locked(s->ls_obj_hash,
-					       &bd2, &h->loh_hash);
+			/* Cannot remove from hash under current spinlock,
+			 * so set flag to stop object from being found
+			 * by htable_lookup().
+			 */
+			set_bit(LU_OBJECT_HEARD_BANSHEE, &h->loh_flags);
 			list_move(&h->loh_lru, &dispose);
 			percpu_counter_dec(&s->ls_lru_len_counter);
 			if (did_sth == 0)
@@ -426,16 +435,17 @@ int lu_site_purge_objects(const struct lu_env *env, struct lu_site *s,
 			if (count > 0 && --count == 0)
 				break;
 		}
-		cfs_hash_bd_unlock(s->ls_obj_hash, &bd, 1);
+		spin_unlock(&bkt->lsb_marche_funebre.lock);
 		cond_resched();
 		/*
 		 * Free everything on the dispose list. This is safe against
 		 * races due to the reasons described in lu_object_put().
 		 */
-		while (!list_empty(&dispose)) {
-			h = container_of(dispose.next,
-					 struct lu_object_header, loh_lru);
+		while ((h = list_first_entry_or_null(&dispose,
+						     struct lu_object_header,
+						     loh_lru)) != NULL) {
 			list_del_init(&h->loh_lru);
+			cfs_hash_del(s->ls_obj_hash, &h->loh_fid, &h->loh_hash);
 			lu_object_free(env, lu_object_top(h));
 			lprocfs_counter_incr(s->ls_stats, LU_SS_LRU_PURGED);
 		}
@@ -450,7 +460,7 @@ int lu_site_purge_objects(const struct lu_env *env, struct lu_site *s,
 		goto again;
 	}
 	/* race on s->ls_purge_start, but nobody cares */
-	s->ls_purge_start = i % CFS_HASH_NBKT(s->ls_obj_hash);
+	s->ls_purge_start = i & (s->ls_bkt_cnt - 1);
 out:
 	return nr;
 }
@@ -598,7 +608,6 @@ static struct lu_object *htable_lookup(struct lu_site *s,
 		return ERR_PTR(-ENOENT);
 
 	*version = ver;
-	bkt = cfs_hash_bd_extra_get(s->ls_obj_hash, bd);
 	/* cfs_hash_bd_peek_locked is a somehow "internal" function
 	 * of cfs_hash, it doesn't add refcount on object.
 	 */
@@ -609,6 +618,8 @@ static struct lu_object *htable_lookup(struct lu_site *s,
 	}
 
 	h = container_of(hnode, struct lu_object_header, loh_hash);
+	bkt = &s->ls_bkts[lu_bkt_hash(s, f)];
+	spin_lock(&bkt->lsb_marche_funebre.lock);
 	if (likely(!lu_object_is_dying(h))) {
 		cfs_hash_get(s->ls_obj_hash, hnode);
 		lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_HIT);
@@ -616,8 +627,10 @@ static struct lu_object *htable_lookup(struct lu_site *s,
 			list_del_init(&h->loh_lru);
 			percpu_counter_dec(&s->ls_lru_len_counter);
 		}
+		spin_unlock(&bkt->lsb_marche_funebre.lock);
 		return lu_object_top(h);
 	}
+	spin_unlock(&bkt->lsb_marche_funebre.lock);
 
 	/*
 	 * Lookup found an object being destroyed this object cannot be
@@ -1028,7 +1041,6 @@ static void lu_dev_add_linkage(struct lu_site *s, struct lu_device *d)
 int lu_site_init(struct lu_site *s, struct lu_device *top)
 {
 	struct lu_site_bkt_data *bkt;
-	struct cfs_hash_bd bd;
 	unsigned long bits;
 	unsigned long i;
 	char name[16];
@@ -1045,7 +1057,7 @@ int lu_site_init(struct lu_site *s, struct lu_device *top)
 	for (bits = lu_htable_order(top); bits >= LU_SITE_BITS_MIN; bits--) {
 		s->ls_obj_hash = cfs_hash_create(name, bits, bits,
 						 bits - LU_SITE_BKT_BITS,
-						 sizeof(*bkt), 0, 0,
+						 0, 0, 0,
 						 &lu_site_hash_ops,
 						 CFS_HASH_SPIN_BKTLOCK |
 						 CFS_HASH_NO_ITEMREF |
@@ -1061,15 +1073,26 @@ int lu_site_init(struct lu_site *s, struct lu_device *top)
 		return -ENOMEM;
 	}
 
-	cfs_hash_for_each_bucket(s->ls_obj_hash, &bd, i) {
-		bkt = cfs_hash_bd_extra_get(s->ls_obj_hash, &bd);
+	s->ls_bkt_cnt = max_t(long, 1 << LU_SITE_BKT_BITS, 2 * num_possible_cpus());
+	s->ls_bkt_cnt = roundup_pow_of_two(s->ls_bkt_cnt);
+	s->ls_bkts = kvmalloc_array(s->ls_bkt_cnt, sizeof(*bkt), GFP_KERNEL);
+	if (!s->ls_bkts) {
+		cfs_hash_putref(s->ls_obj_hash);
+		s->ls_obj_hash = NULL;
+		s->ls_bkts = NULL;
+		return -ENOMEM;
+	}
+	for (i = 0; i < s->ls_bkt_cnt ; i++) {
+		bkt = &s->ls_bkts[i];
 		INIT_LIST_HEAD(&bkt->lsb_lru);
 		init_waitqueue_head(&bkt->lsb_marche_funebre);
 	}
 
 	s->ls_stats = lprocfs_alloc_stats(LU_SS_LAST_STAT, 0);
 	if (!s->ls_stats) {
+		kvfree(s->ls_bkts);
 		cfs_hash_putref(s->ls_obj_hash);
+		s->ls_bkts = NULL;
 		s->ls_obj_hash = NULL;
 		return -ENOMEM;
 	}
@@ -1118,6 +1141,8 @@ void lu_site_fini(struct lu_site *s)
 		s->ls_obj_hash = NULL;
 	}
 
+	kvfree(s->ls_bkts);
+
 	if (s->ls_top_dev) {
 		s->ls_top_dev->ld_site = NULL;
 		lu_ref_del(&s->ls_top_dev->ld_reference, "site-top", s);
@@ -1827,34 +1852,18 @@ struct lu_site_stats {
 };
 
 static void lu_site_stats_get(const struct lu_site *s,
-			      struct lu_site_stats *stats, int populated)
+			      struct lu_site_stats *stats)
 {
-	struct cfs_hash *hs = s->ls_obj_hash;
-	struct cfs_hash_bd bd;
-	unsigned int i;
+	int cnt = cfs_hash_size_get(s->ls_obj_hash);
 	/* percpu_counter_read_positive() won't accept a const pointer */
 	struct lu_site *s2 = (struct lu_site *)s;
 
-	stats->lss_busy += cfs_hash_size_get(hs) -
+	stats->lss_busy += cnt -
 		percpu_counter_read_positive(&s2->ls_lru_len_counter);
-	cfs_hash_for_each_bucket(hs, &bd, i) {
-		struct hlist_head	*hhead;
-
-		cfs_hash_bd_lock(hs, &bd, 1);
-		stats->lss_total += cfs_hash_bd_count_get(&bd);
-		stats->lss_max_search = max((int)stats->lss_max_search,
-					    cfs_hash_bd_depmax_get(&bd));
-		if (!populated) {
-			cfs_hash_bd_unlock(hs, &bd, 1);
-			continue;
-		}
 
-		cfs_hash_bd_for_each_hlist(hs, &bd, hhead) {
-			if (!hlist_empty(hhead))
-				stats->lss_populated++;
-		}
-		cfs_hash_bd_unlock(hs, &bd, 1);
-	}
+	stats->lss_total += cnt;
+	stats->lss_max_search = 0;
+	stats->lss_populated = 0;
 }
 
 /*
@@ -2033,7 +2042,7 @@ int lu_site_stats_print(const struct lu_site *s, struct seq_file *m)
 	struct lu_site_stats stats;
 
 	memset(&stats, 0, sizeof(stats));
-	lu_site_stats_get(s, &stats, 1);
+	lu_site_stats_get(s, &stats);
 
 	seq_printf(m, "%d/%d %d/%ld %d %d %d %d %d %d %d\n",
 		   stats.lss_busy,

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 11/20] staging: lustre: lu_object: discard extra lru count.
  2018-04-11 21:54 [PATCH 00/20] staging: lustre: convert to rhashtable NeilBrown
                   ` (10 preceding siblings ...)
  2018-04-11 21:54 ` [PATCH 06/20] staging: lustre: ldlm: add a counter to the per-namespace data NeilBrown
@ 2018-04-11 21:54 ` NeilBrown
  2018-04-11 21:54 ` [PATCH 17/20] staging: lustre: use call_rcu() to free lu_object_headers NeilBrown
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2018-04-11 21:54 UTC (permalink / raw)
  To: Oleg Drokin, Greg Kroah-Hartman, James Simmons, Andreas Dilger
  Cc: Linux Kernel Mailing List, Lustre Development List

lu_object maintains 2 lru counts.
One is a per-bucket lsb_lru_len.
The other is the per-cpu ls_lru_len_counter.

The only times the per-bucket counters are use are:
- a debug message when an object is added
- in lu_site_stats_get when all the counters are combined.

The debug message is not essential, and the per-cpu counter
can be used to get the combined total.

So discard the per-bucket lsb_lru_len.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/staging/lustre/lustre/obdclass/lu_object.c |   24 ++++++++------------
 1 file changed, 9 insertions(+), 15 deletions(-)

diff --git a/drivers/staging/lustre/lustre/obdclass/lu_object.c b/drivers/staging/lustre/lustre/obdclass/lu_object.c
index 2a8a25d6edb5..2bf089817157 100644
--- a/drivers/staging/lustre/lustre/obdclass/lu_object.c
+++ b/drivers/staging/lustre/lustre/obdclass/lu_object.c
@@ -57,10 +57,6 @@
 #include <linux/list.h>
 
 struct lu_site_bkt_data {
-	/**
-	 * number of object in this bucket on the lsb_lru list.
-	 */
-	long			lsb_lru_len;
 	/**
 	 * LRU list, updated on each access to object. Protected by
 	 * bucket lock of lu_site::ls_obj_hash.
@@ -187,10 +183,9 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 	if (!lu_object_is_dying(top)) {
 		LASSERT(list_empty(&top->loh_lru));
 		list_add_tail(&top->loh_lru, &bkt->lsb_lru);
-		bkt->lsb_lru_len++;
 		percpu_counter_inc(&site->ls_lru_len_counter);
-		CDEBUG(D_INODE, "Add %p to site lru. hash: %p, bkt: %p, lru_len: %ld\n",
-		       o, site->ls_obj_hash, bkt, bkt->lsb_lru_len);
+		CDEBUG(D_INODE, "Add %p to site lru. hash: %p, bkt: %p\n",
+		       o, site->ls_obj_hash, bkt);
 		cfs_hash_bd_unlock(site->ls_obj_hash, &bd, 1);
 		return;
 	}
@@ -238,7 +233,6 @@ void lu_object_unhash(const struct lu_env *env, struct lu_object *o)
 
 			list_del_init(&top->loh_lru);
 			bkt = cfs_hash_bd_extra_get(obj_hash, &bd);
-			bkt->lsb_lru_len--;
 			percpu_counter_dec(&site->ls_lru_len_counter);
 		}
 		cfs_hash_bd_del_locked(obj_hash, &bd, &top->loh_hash);
@@ -422,7 +416,6 @@ int lu_site_purge_objects(const struct lu_env *env, struct lu_site *s,
 			cfs_hash_bd_del_locked(s->ls_obj_hash,
 					       &bd2, &h->loh_hash);
 			list_move(&h->loh_lru, &dispose);
-			bkt->lsb_lru_len--;
 			percpu_counter_dec(&s->ls_lru_len_counter);
 			if (did_sth == 0)
 				did_sth = 1;
@@ -621,7 +614,6 @@ static struct lu_object *htable_lookup(struct lu_site *s,
 		lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_HIT);
 		if (!list_empty(&h->loh_lru)) {
 			list_del_init(&h->loh_lru);
-			bkt->lsb_lru_len--;
 			percpu_counter_dec(&s->ls_lru_len_counter);
 		}
 		return lu_object_top(h);
@@ -1834,19 +1826,21 @@ struct lu_site_stats {
 	unsigned int	lss_busy;
 };
 
-static void lu_site_stats_get(struct cfs_hash *hs,
+static void lu_site_stats_get(const struct lu_site *s,
 			      struct lu_site_stats *stats, int populated)
 {
+	struct cfs_hash *hs = s->ls_obj_hash;
 	struct cfs_hash_bd bd;
 	unsigned int i;
+	/* percpu_counter_read_positive() won't accept a const pointer */
+	struct lu_site *s2 = (struct lu_site *)s;
 
+	stats->lss_busy += cfs_hash_size_get(hs) -
+		percpu_counter_read_positive(&s2->ls_lru_len_counter);
 	cfs_hash_for_each_bucket(hs, &bd, i) {
-		struct lu_site_bkt_data *bkt = cfs_hash_bd_extra_get(hs, &bd);
 		struct hlist_head	*hhead;
 
 		cfs_hash_bd_lock(hs, &bd, 1);
-		stats->lss_busy  +=
-			cfs_hash_bd_count_get(&bd) - bkt->lsb_lru_len;
 		stats->lss_total += cfs_hash_bd_count_get(&bd);
 		stats->lss_max_search = max((int)stats->lss_max_search,
 					    cfs_hash_bd_depmax_get(&bd));
@@ -2039,7 +2033,7 @@ int lu_site_stats_print(const struct lu_site *s, struct seq_file *m)
 	struct lu_site_stats stats;
 
 	memset(&stats, 0, sizeof(stats));
-	lu_site_stats_get(s->ls_obj_hash, &stats, 1);
+	lu_site_stats_get(s, &stats, 1);
 
 	seq_printf(m, "%d/%d %d/%ld %d %d %d %d %d %d %d\n",
 		   stats.lss_busy,

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 10/20] staging: lustre: make struct lu_site_bkt_data private
  2018-04-11 21:54 [PATCH 00/20] staging: lustre: convert to rhashtable NeilBrown
                   ` (4 preceding siblings ...)
  2018-04-11 21:54 ` [PATCH 07/20] staging: lustre: ldlm: store name directly in namespace NeilBrown
@ 2018-04-11 21:54 ` NeilBrown
  2018-04-11 21:54 ` [PATCH 02/20] staging: lustre: convert lov_pool to use rhashtable NeilBrown
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2018-04-11 21:54 UTC (permalink / raw)
  To: Oleg Drokin, Greg Kroah-Hartman, James Simmons, Andreas Dilger
  Cc: Linux Kernel Mailing List, Lustre Development List

This data structure only needs to be public so that
various modules can access a wait queue to wait for object
destruction.
If we provide a function to get the wait queue, rather than the
whole bucket, the structure can be made private.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/staging/lustre/lustre/include/lu_object.h  |   36 +-------------
 drivers/staging/lustre/lustre/llite/lcommon_cl.c   |    8 ++-
 drivers/staging/lustre/lustre/lov/lov_object.c     |    8 ++-
 drivers/staging/lustre/lustre/obdclass/lu_object.c |   50 +++++++++++++++++---
 4 files changed, 54 insertions(+), 48 deletions(-)

diff --git a/drivers/staging/lustre/lustre/include/lu_object.h b/drivers/staging/lustre/lustre/include/lu_object.h
index c3b0ed518819..f29bbca5af65 100644
--- a/drivers/staging/lustre/lustre/include/lu_object.h
+++ b/drivers/staging/lustre/lustre/include/lu_object.h
@@ -549,31 +549,7 @@ struct lu_object_header {
 };
 
 struct fld;
-
-struct lu_site_bkt_data {
-	/**
-	 * number of object in this bucket on the lsb_lru list.
-	 */
-	long			lsb_lru_len;
-	/**
-	 * LRU list, updated on each access to object. Protected by
-	 * bucket lock of lu_site::ls_obj_hash.
-	 *
-	 * "Cold" end of LRU is lu_site::ls_lru.next. Accessed object are
-	 * moved to the lu_site::ls_lru.prev (this is due to the non-existence
-	 * of list_for_each_entry_safe_reverse()).
-	 */
-	struct list_head		lsb_lru;
-	/**
-	 * Wait-queue signaled when an object in this site is ultimately
-	 * destroyed (lu_object_free()). It is used by lu_object_find() to
-	 * wait before re-trying when object in the process of destruction is
-	 * found in the hash table.
-	 *
-	 * \see htable_lookup().
-	 */
-	wait_queue_head_t	       lsb_marche_funebre;
-};
+struct lu_site_bkt_data;
 
 enum {
 	LU_SS_CREATED	 = 0,
@@ -642,14 +618,8 @@ struct lu_site {
 	struct percpu_counter	 ls_lru_len_counter;
 };
 
-static inline struct lu_site_bkt_data *
-lu_site_bkt_from_fid(struct lu_site *site, struct lu_fid *fid)
-{
-	struct cfs_hash_bd bd;
-
-	cfs_hash_bd_get(site->ls_obj_hash, fid, &bd);
-	return cfs_hash_bd_extra_get(site->ls_obj_hash, &bd);
-}
+wait_queue_head_t *
+lu_site_wq_from_fid(struct lu_site *site, struct lu_fid *fid);
 
 static inline struct seq_server_site *lu_site2seq(const struct lu_site *s)
 {
diff --git a/drivers/staging/lustre/lustre/llite/lcommon_cl.c b/drivers/staging/lustre/lustre/llite/lcommon_cl.c
index df5c0c0ae703..d5b42fb1d601 100644
--- a/drivers/staging/lustre/lustre/llite/lcommon_cl.c
+++ b/drivers/staging/lustre/lustre/llite/lcommon_cl.c
@@ -211,12 +211,12 @@ static void cl_object_put_last(struct lu_env *env, struct cl_object *obj)
 
 	if (unlikely(atomic_read(&header->loh_ref) != 1)) {
 		struct lu_site *site = obj->co_lu.lo_dev->ld_site;
-		struct lu_site_bkt_data *bkt;
+		wait_queue_head_t *wq;
 
-		bkt = lu_site_bkt_from_fid(site, &header->loh_fid);
+		wq = lu_site_wq_from_fid(site, &header->loh_fid);
 
 		init_waitqueue_entry(&waiter, current);
-		add_wait_queue(&bkt->lsb_marche_funebre, &waiter);
+		add_wait_queue(wq, &waiter);
 
 		while (1) {
 			set_current_state(TASK_UNINTERRUPTIBLE);
@@ -226,7 +226,7 @@ static void cl_object_put_last(struct lu_env *env, struct cl_object *obj)
 		}
 
 		set_current_state(TASK_RUNNING);
-		remove_wait_queue(&bkt->lsb_marche_funebre, &waiter);
+		remove_wait_queue(wq, &waiter);
 	}
 
 	cl_object_put(env, obj);
diff --git a/drivers/staging/lustre/lustre/lov/lov_object.c b/drivers/staging/lustre/lustre/lov/lov_object.c
index f7c69680cb7d..adc90f310fd7 100644
--- a/drivers/staging/lustre/lustre/lov/lov_object.c
+++ b/drivers/staging/lustre/lustre/lov/lov_object.c
@@ -370,7 +370,7 @@ static void lov_subobject_kill(const struct lu_env *env, struct lov_object *lov,
 	struct cl_object	*sub;
 	struct lov_layout_raid0 *r0;
 	struct lu_site	  *site;
-	struct lu_site_bkt_data *bkt;
+	wait_queue_head_t *wq;
 	wait_queue_entry_t	  *waiter;
 
 	r0  = &lov->u.raid0;
@@ -378,7 +378,7 @@ static void lov_subobject_kill(const struct lu_env *env, struct lov_object *lov,
 
 	sub  = lovsub2cl(los);
 	site = sub->co_lu.lo_dev->ld_site;
-	bkt  = lu_site_bkt_from_fid(site, &sub->co_lu.lo_header->loh_fid);
+	wq   = lu_site_wq_from_fid(site, &sub->co_lu.lo_header->loh_fid);
 
 	cl_object_kill(env, sub);
 	/* release a reference to the sub-object and ... */
@@ -391,7 +391,7 @@ static void lov_subobject_kill(const struct lu_env *env, struct lov_object *lov,
 	if (r0->lo_sub[idx] == los) {
 		waiter = &lov_env_info(env)->lti_waiter;
 		init_waitqueue_entry(waiter, current);
-		add_wait_queue(&bkt->lsb_marche_funebre, waiter);
+		add_wait_queue(wq, waiter);
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		while (1) {
 			/* this wait-queue is signaled at the end of
@@ -408,7 +408,7 @@ static void lov_subobject_kill(const struct lu_env *env, struct lov_object *lov,
 				break;
 			}
 		}
-		remove_wait_queue(&bkt->lsb_marche_funebre, waiter);
+		remove_wait_queue(wq, waiter);
 	}
 	LASSERT(!r0->lo_sub[idx]);
 }
diff --git a/drivers/staging/lustre/lustre/obdclass/lu_object.c b/drivers/staging/lustre/lustre/obdclass/lu_object.c
index 3de7dc0497c4..2a8a25d6edb5 100644
--- a/drivers/staging/lustre/lustre/obdclass/lu_object.c
+++ b/drivers/staging/lustre/lustre/obdclass/lu_object.c
@@ -56,6 +56,31 @@
 #include <lu_ref.h>
 #include <linux/list.h>
 
+struct lu_site_bkt_data {
+	/**
+	 * number of object in this bucket on the lsb_lru list.
+	 */
+	long			lsb_lru_len;
+	/**
+	 * LRU list, updated on each access to object. Protected by
+	 * bucket lock of lu_site::ls_obj_hash.
+	 *
+	 * "Cold" end of LRU is lu_site::ls_lru.next. Accessed object are
+	 * moved to the lu_site::ls_lru.prev (this is due to the non-existence
+	 * of list_for_each_entry_safe_reverse()).
+	 */
+	struct list_head		lsb_lru;
+	/**
+	 * Wait-queue signaled when an object in this site is ultimately
+	 * destroyed (lu_object_free()). It is used by lu_object_find() to
+	 * wait before re-trying when object in the process of destruction is
+	 * found in the hash table.
+	 *
+	 * \see htable_lookup().
+	 */
+	wait_queue_head_t	       lsb_marche_funebre;
+};
+
 enum {
 	LU_CACHE_PERCENT_MAX	 = 50,
 	LU_CACHE_PERCENT_DEFAULT = 20
@@ -88,6 +113,17 @@ MODULE_PARM_DESC(lu_cache_nr, "Maximum number of objects in lu_object cache");
 static void lu_object_free(const struct lu_env *env, struct lu_object *o);
 static __u32 ls_stats_read(struct lprocfs_stats *stats, int idx);
 
+wait_queue_head_t *
+lu_site_wq_from_fid(struct lu_site *site, struct lu_fid *fid)
+{
+	struct cfs_hash_bd bd;
+	struct lu_site_bkt_data *bkt;
+
+	cfs_hash_bd_get(site->ls_obj_hash, fid, &bd);
+	bkt = cfs_hash_bd_extra_get(site->ls_obj_hash, &bd);
+	return &bkt->lsb_marche_funebre;
+}
+
 /**
  * Decrease reference counter on object. If last reference is freed, return
  * object to the cache, unless lu_object_is_dying(o) holds. In the latter
@@ -288,7 +324,7 @@ static struct lu_object *lu_object_alloc(const struct lu_env *env,
  */
 static void lu_object_free(const struct lu_env *env, struct lu_object *o)
 {
-	struct lu_site_bkt_data *bkt;
+	wait_queue_head_t *wq;
 	struct lu_site	  *site;
 	struct lu_object	*scan;
 	struct list_head	      *layers;
@@ -296,7 +332,7 @@ static void lu_object_free(const struct lu_env *env, struct lu_object *o)
 
 	site   = o->lo_dev->ld_site;
 	layers = &o->lo_header->loh_layers;
-	bkt    = lu_site_bkt_from_fid(site, &o->lo_header->loh_fid);
+	wq     = lu_site_wq_from_fid(site, &o->lo_header->loh_fid);
 	/*
 	 * First call ->loo_object_delete() method to release all resources.
 	 */
@@ -324,8 +360,8 @@ static void lu_object_free(const struct lu_env *env, struct lu_object *o)
 		o->lo_ops->loo_object_free(env, o);
 	}
 
-	if (waitqueue_active(&bkt->lsb_marche_funebre))
-		wake_up_all(&bkt->lsb_marche_funebre);
+	if (waitqueue_active(wq))
+		wake_up_all(wq);
 }
 
 /**
@@ -749,7 +785,7 @@ struct lu_object *lu_object_find_at(const struct lu_env *env,
 				    const struct lu_fid *f,
 				    const struct lu_object_conf *conf)
 {
-	struct lu_site_bkt_data *bkt;
+	wait_queue_head_t	*wq;
 	struct lu_object	*obj;
 	wait_queue_entry_t	   wait;
 
@@ -762,8 +798,8 @@ struct lu_object *lu_object_find_at(const struct lu_env *env,
 		 * wait queue.
 		 */
 		schedule();
-		bkt = lu_site_bkt_from_fid(dev->ld_site, (void *)f);
-		remove_wait_queue(&bkt->lsb_marche_funebre, &wait);
+		wq = lu_site_wq_from_fid(dev->ld_site, (void *)f);
+		remove_wait_queue(wq, &wait);
 	}
 }
 EXPORT_SYMBOL(lu_object_find_at);

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 09/20] staging: lustre: convert ldlm_resource hash to rhashtable.
  2018-04-11 21:54 [PATCH 00/20] staging: lustre: convert to rhashtable NeilBrown
                   ` (2 preceding siblings ...)
  2018-04-11 21:54 ` [PATCH 05/20] staging: lustre: separate buckets from ldlm hash table NeilBrown
@ 2018-04-11 21:54 ` NeilBrown
  2018-04-11 21:54 ` [PATCH 07/20] staging: lustre: ldlm: store name directly in namespace NeilBrown
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2018-04-11 21:54 UTC (permalink / raw)
  To: Oleg Drokin, Greg Kroah-Hartman, James Simmons, Andreas Dilger
  Cc: Linux Kernel Mailing List, Lustre Development List

Using an rhashtable allows lockless lookup at the cost
of rcu freeing of entries.

When we find an entry, we need to atomically check the
reference hasn't dropped to zero.

When adding an entry, we might find an existing entry which is in the
process of being removed - with a zero refcount.  In that case
we loop around and repeat the lookup.  To ensure this doesn't
spin, the 'cmp' function will fail any comparison with a resource
which has a zero refcount.

Now that we are using resizing hash tables, we don't need to preconfig
suitable sizes for each namespace.  We can just use the default and
let it grow or shrink as needed.  We keep the pre-configured sizes
for the bucket array.  Previously the sizeof the bucket array was the
difference between nsd_all_bits and nsd_bkt_bits.  As we don't need
nsd_all_bits any more, nsd_bkt_bits is changed to the number of bits
used to choose a bucket.

Walking an rhashtable requires that we manage refcounts ourself, so
a new function, ldlm_resource_for_each() is added to do that.

Note that with this patch we now update a per-table counter
on every insert/remove, which might cause more contention
between CPUs on a busy system.  Hopefully rhashtable will
be enhanced in the new future to support a per-CPU counter
for nelems.

ldlm_namespace_cleanup() will sometimes iterate over the hash table
and remove everything.  If we enable automatic shrinking, the table
will be resized during this process.  There is currently a bug
in rhashtable_walk_start() which means that when that happens,
we can miss entires so not everything gets deleted.  A fix for the
bug has been accepted but has not yet landed upstream.  Rather than
wait, automatic_shrinking has been disabled for now.  Once the bug fix
lands we can enable automatic_shrinking.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/staging/lustre/lustre/include/lustre_dlm.h |   10 +
 drivers/staging/lustre/lustre/ldlm/ldlm_request.c  |   31 +-
 drivers/staging/lustre/lustre/ldlm/ldlm_resource.c |  268 +++++++++-----------
 drivers/staging/lustre/lustre/osc/osc_request.c    |   12 -
 4 files changed, 138 insertions(+), 183 deletions(-)

diff --git a/drivers/staging/lustre/lustre/include/lustre_dlm.h b/drivers/staging/lustre/lustre/include/lustre_dlm.h
index 500dda854564..b60170a11d26 100644
--- a/drivers/staging/lustre/lustre/include/lustre_dlm.h
+++ b/drivers/staging/lustre/lustre/include/lustre_dlm.h
@@ -368,7 +368,7 @@ struct ldlm_namespace {
 	char			*ns_name;
 
 	/** Resource hash table for namespace. */
-	struct cfs_hash		*ns_rs_hash;
+	struct rhashtable	ns_rs_hash;
 	struct ldlm_ns_bucket	*ns_rs_buckets;
 	unsigned int		ns_bucket_bits;
 
@@ -828,9 +828,8 @@ struct ldlm_resource {
 
 	/**
 	 * List item for list in namespace hash.
-	 * protected by ns_lock
 	 */
-	struct hlist_node	lr_hash;
+	struct rhash_head	lr_hash;
 
 	/** Spinlock to protect locks under this resource. */
 	spinlock_t		lr_lock;
@@ -874,8 +873,13 @@ struct ldlm_resource {
 	struct lu_ref		lr_reference;
 
 	struct inode		*lr_lvb_inode;
+	struct rcu_head		lr_rcu;
 };
 
+void ldlm_resource_for_each(struct ldlm_namespace *ns,
+			    int cb(struct ldlm_resource *res, void *data),
+			    void *data);
+
 static inline bool ldlm_has_layout(struct ldlm_lock *lock)
 {
 	return lock->l_resource->lr_type == LDLM_IBITS &&
diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_request.c b/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
index f573de9cf45d..1daeaf76ef89 100644
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
@@ -1680,11 +1680,8 @@ struct ldlm_cli_cancel_arg {
 	void   *lc_opaque;
 };
 
-static int ldlm_cli_hash_cancel_unused(struct cfs_hash *hs,
-				       struct cfs_hash_bd *bd,
-				       struct hlist_node *hnode, void *arg)
+static int ldlm_cli_hash_cancel_unused(struct ldlm_resource *res, void *arg)
 {
-	struct ldlm_resource	   *res = cfs_hash_object(hs, hnode);
 	struct ldlm_cli_cancel_arg     *lc = arg;
 
 	ldlm_cli_cancel_unused_resource(ldlm_res_to_ns(res), &res->lr_name,
@@ -1718,8 +1715,7 @@ int ldlm_cli_cancel_unused(struct ldlm_namespace *ns,
 						       LCK_MINMODE, flags,
 						       opaque);
 	} else {
-		cfs_hash_for_each_nolock(ns->ns_rs_hash,
-					 ldlm_cli_hash_cancel_unused, &arg, 0);
+		ldlm_resource_for_each(ns, ldlm_cli_hash_cancel_unused, &arg);
 		return ELDLM_OK;
 	}
 }
@@ -1768,27 +1764,21 @@ static int ldlm_iter_helper(struct ldlm_lock *lock, void *closure)
 	return helper->iter(lock, helper->closure);
 }
 
-static int ldlm_res_iter_helper(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-				struct hlist_node *hnode, void *arg)
-
+static int ldlm_res_iter_helper(struct ldlm_resource *res, void *arg)
 {
-	struct ldlm_resource *res = cfs_hash_object(hs, hnode);
-
 	return ldlm_resource_foreach(res, ldlm_iter_helper, arg) ==
 	       LDLM_ITER_STOP;
 }
 
 static void ldlm_namespace_foreach(struct ldlm_namespace *ns,
 				   ldlm_iterator_t iter, void *closure)
-
 {
 	struct iter_helper_data helper = {
 		.iter		= iter,
 		.closure	= closure,
 	};
 
-	cfs_hash_for_each_nolock(ns->ns_rs_hash,
-				 ldlm_res_iter_helper, &helper, 0);
+	ldlm_resource_for_each(ns, ldlm_res_iter_helper, &helper);
 }
 
 /* non-blocking function to manipulate a lock whose cb_data is being put away.
@@ -1823,11 +1813,14 @@ static int ldlm_chain_lock_for_replay(struct ldlm_lock *lock, void *closure)
 {
 	struct list_head *list = closure;
 
-	/* we use l_pending_chain here, because it's unused on clients. */
-	LASSERTF(list_empty(&lock->l_pending_chain),
-		 "lock %p next %p prev %p\n",
-		 lock, &lock->l_pending_chain.next,
-		 &lock->l_pending_chain.prev);
+	/*
+	 * We use l_pending_chain here, because it's unused on clients.
+	 * As rhashtable_walk_next() can repeat elements when a resize event
+	 * happens, we skip locks that have already been added to the chain
+	 */
+	if (!list_empty(&lock->l_pending_chain))
+		return LDLM_ITER_CONTINUE;
+
 	/* bug 9573: don't replay locks left after eviction, or
 	 * bug 17614: locks being actively cancelled. Get a reference
 	 * on a lock so that it does not disappear under us (e.g. due to cancel)
diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c b/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c
index 4288a81fd62b..b30ad212e967 100644
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c
@@ -62,6 +62,7 @@ static LIST_HEAD(ldlm_cli_inactive_namespace_list);
 static struct dentry *ldlm_debugfs_dir;
 static struct dentry *ldlm_ns_debugfs_dir;
 struct dentry *ldlm_svc_debugfs_dir;
+static void __ldlm_resource_putref_final(struct ldlm_resource *res);
 
 /* during debug dump certain amount of granted locks for one resource to avoid
  * DDOS.
@@ -453,28 +454,6 @@ static int ldlm_namespace_debugfs_register(struct ldlm_namespace *ns)
 
 #undef MAX_STRING_SIZE
 
-static struct ldlm_resource *ldlm_resource_getref(struct ldlm_resource *res)
-{
-	LASSERT(res);
-	LASSERT(res != LP_POISON);
-	atomic_inc(&res->lr_refcount);
-	CDEBUG(D_INFO, "getref res: %p count: %d\n", res,
-	       atomic_read(&res->lr_refcount));
-	return res;
-}
-
-static unsigned int ldlm_res_hop_hash(struct cfs_hash *hs,
-				      const void *key, unsigned int mask)
-{
-	const struct ldlm_res_id     *id  = key;
-	unsigned int		val = 0;
-	unsigned int		i;
-
-	for (i = 0; i < RES_NAME_SIZE; i++)
-		val += id->name[i];
-	return val & mask;
-}
-
 static unsigned int ldlm_res_hop_fid_hash(const struct ldlm_res_id *id, unsigned int bits)
 {
 	struct lu_fid       fid;
@@ -496,84 +475,52 @@ static unsigned int ldlm_res_hop_fid_hash(const struct ldlm_res_id *id, unsigned
 	return hash_32(hash, bits);
 }
 
-static void *ldlm_res_hop_key(struct hlist_node *hnode)
-{
-	struct ldlm_resource   *res;
-
-	res = hlist_entry(hnode, struct ldlm_resource, lr_hash);
-	return &res->lr_name;
-}
-
-static int ldlm_res_hop_keycmp(const void *key, struct hlist_node *hnode)
-{
-	struct ldlm_resource   *res;
-
-	res = hlist_entry(hnode, struct ldlm_resource, lr_hash);
-	return ldlm_res_eq((const struct ldlm_res_id *)key,
-			   (const struct ldlm_res_id *)&res->lr_name);
-}
-
-static void *ldlm_res_hop_object(struct hlist_node *hnode)
-{
-	return hlist_entry(hnode, struct ldlm_resource, lr_hash);
-}
-
-static void ldlm_res_hop_get_locked(struct cfs_hash *hs,
-				    struct hlist_node *hnode)
-{
-	struct ldlm_resource *res;
-
-	res = hlist_entry(hnode, struct ldlm_resource, lr_hash);
-	ldlm_resource_getref(res);
-}
-
-static void ldlm_res_hop_put(struct cfs_hash *hs, struct hlist_node *hnode)
+static int rs_cmp(struct rhashtable_compare_arg *arg, const void *obj)
 {
-	struct ldlm_resource *res;
+	/*
+	 * Don't allow entries with lr_refcount==0 to be found.
+	 * rhashtable_remove doesn't use this function, so they
+	 * can still be deleted.
+	 */
+	const struct ldlm_res_id *name = arg->key;
+	const struct ldlm_resource *res = obj;
 
-	res = hlist_entry(hnode, struct ldlm_resource, lr_hash);
-	ldlm_resource_putref(res);
+	if (!ldlm_res_eq(name, &res->lr_name))
+		return -ESRCH;
+	return atomic_read(&res->lr_refcount) > 0 ? 0 : -EBUSY;
 }
 
-static struct cfs_hash_ops ldlm_ns_hash_ops = {
-	.hs_hash	= ldlm_res_hop_hash,
-	.hs_key		= ldlm_res_hop_key,
-	.hs_keycmp      = ldlm_res_hop_keycmp,
-	.hs_keycpy      = NULL,
-	.hs_object      = ldlm_res_hop_object,
-	.hs_get		= ldlm_res_hop_get_locked,
-	.hs_put		= ldlm_res_hop_put
+static const struct rhashtable_params ns_rs_hash_params = {
+	.key_len	= sizeof(struct ldlm_res_id),
+	.key_offset	= offsetof(struct ldlm_resource, lr_name),
+	.head_offset	= offsetof(struct ldlm_resource, lr_hash),
+	.obj_cmpfn	= rs_cmp,
+/* automatic_shrinking cannot be enabled until a bug
+ * in rhashtable_walk_start() is fixed
+	.automatic_shrinking = true,
+ */
 };
 
 static struct {
-	/** hash bucket bits */
 	unsigned int	nsd_bkt_bits;
-	/** hash bits */
-	unsigned int	nsd_all_bits;
 } ldlm_ns_hash_defs[] = {
 	[LDLM_NS_TYPE_MDC] = {
-		.nsd_bkt_bits   = 11,
-		.nsd_all_bits   = 16,
+		.nsd_bkt_bits   = 5,
 	},
 	[LDLM_NS_TYPE_MDT] = {
-		.nsd_bkt_bits   = 14,
-		.nsd_all_bits   = 21,
+		.nsd_bkt_bits   = 7,
 	},
 	[LDLM_NS_TYPE_OSC] = {
-		.nsd_bkt_bits   = 8,
-		.nsd_all_bits   = 12,
+		.nsd_bkt_bits   = 4,
 	},
 	[LDLM_NS_TYPE_OST] = {
-		.nsd_bkt_bits   = 11,
-		.nsd_all_bits   = 17,
+		.nsd_bkt_bits   = 6,
 	},
 	[LDLM_NS_TYPE_MGC] = {
-		.nsd_bkt_bits   = 3,
-		.nsd_all_bits   = 4,
+		.nsd_bkt_bits   = 1,
 	},
 	[LDLM_NS_TYPE_MGT] = {
-		.nsd_bkt_bits   = 3,
-		.nsd_all_bits   = 4,
+		.nsd_bkt_bits   = 1,
 	},
 };
 
@@ -618,23 +565,11 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
 	if (!ns)
 		goto out_ref;
 
-	ns->ns_rs_hash = cfs_hash_create(name,
-					 ldlm_ns_hash_defs[ns_type].nsd_all_bits,
-					 ldlm_ns_hash_defs[ns_type].nsd_all_bits,
-					 ldlm_ns_hash_defs[ns_type].nsd_bkt_bits,
-					 0,
-					 CFS_HASH_MIN_THETA,
-					 CFS_HASH_MAX_THETA,
-					 &ldlm_ns_hash_ops,
-					 CFS_HASH_DEPTH |
-					 CFS_HASH_BIGNAME |
-					 CFS_HASH_SPIN_BKTLOCK |
-					 CFS_HASH_NO_ITEMREF);
-	if (!ns->ns_rs_hash)
-		goto out_ns;
-	ns->ns_bucket_bits = ldlm_ns_hash_defs[ns_type].nsd_all_bits -
-			      ldlm_ns_hash_defs[ns_type].nsd_bkt_bits;
+	rc = rhashtable_init(&ns->ns_rs_hash, &ns_rs_hash_params);
 
+	if (rc)
+		goto out_ns;
+	ns->ns_bucket_bits = ldlm_ns_hash_defs[ns_type].nsd_bkt_bits;
 	ns->ns_rs_buckets = kvmalloc_array(1 << ns->ns_bucket_bits,
 					   sizeof(ns->ns_rs_buckets[0]),
 					   GFP_KERNEL);
@@ -698,7 +633,7 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
 out_hash:
 	kfree(ns->ns_name);
 	kvfree(ns->ns_rs_buckets);
-	cfs_hash_putref(ns->ns_rs_hash);
+	rhashtable_destroy(&ns->ns_rs_hash);
 out_ns:
 	kfree(ns);
 out_ref:
@@ -786,23 +721,18 @@ static void cleanup_resource(struct ldlm_resource *res, struct list_head *q,
 	} while (1);
 }
 
-static int ldlm_resource_clean(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-			       struct hlist_node *hnode, void *arg)
+static int ldlm_resource_clean(struct ldlm_resource *res, void *arg)
 {
-	struct ldlm_resource *res = cfs_hash_object(hs, hnode);
-	__u64 flags = *(__u64 *)arg;
+	__u64 *flags = arg;
 
-	cleanup_resource(res, &res->lr_granted, flags);
-	cleanup_resource(res, &res->lr_waiting, flags);
+	cleanup_resource(res, &res->lr_granted, *flags);
+	cleanup_resource(res, &res->lr_waiting, *flags);
 
 	return 0;
 }
 
-static int ldlm_resource_complain(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-				  struct hlist_node *hnode, void *arg)
+static int ldlm_resource_complain(struct ldlm_resource *res, void *arg)
 {
-	struct ldlm_resource  *res = cfs_hash_object(hs, hnode);
-
 	lock_res(res);
 	CERROR("%s: namespace resource " DLDLMRES
 	       " (%p) refcount nonzero (%d) after lock cleanup; forcing cleanup.\n",
@@ -814,6 +744,39 @@ static int ldlm_resource_complain(struct cfs_hash *hs, struct cfs_hash_bd *bd,
 	return 0;
 }
 
+void ldlm_resource_for_each(struct ldlm_namespace *ns,
+			    int cb(struct ldlm_resource *res, void *data),
+			    void *data)
+{
+	struct rhashtable_iter iter;
+	struct ldlm_resource *res, *to_put = NULL;
+
+	rhashtable_walk_enter(&ns->ns_rs_hash, &iter);
+	rhashtable_walk_start(&iter);
+	while ((res = rhashtable_walk_next(&iter)) != NULL) {
+		if (IS_ERR(res))
+			continue;
+		if (!atomic_inc_not_zero(&res->lr_refcount))
+			continue;
+		rhashtable_walk_stop(&iter);
+		if (to_put) {
+			__ldlm_resource_putref_final(to_put);
+			to_put = NULL;
+		}
+		if (cb(res, data)) {
+			ldlm_resource_putref(res);
+			goto exit;
+		}
+		rhashtable_walk_start(&iter);
+		if (atomic_dec_and_test(&res->lr_refcount))
+			to_put = res;
+	}
+	rhashtable_walk_stop(&iter);
+exit:
+	rhashtable_walk_exit(&iter);
+	if (to_put)
+		__ldlm_resource_putref_final(to_put);
+}
 /**
  * Cancel and destroy all locks in the namespace.
  *
@@ -828,10 +791,9 @@ int ldlm_namespace_cleanup(struct ldlm_namespace *ns, __u64 flags)
 		return ELDLM_OK;
 	}
 
-	cfs_hash_for_each_nolock(ns->ns_rs_hash, ldlm_resource_clean,
-				 &flags, 0);
-	cfs_hash_for_each_nolock(ns->ns_rs_hash, ldlm_resource_complain,
-				 NULL, 0);
+	ldlm_resource_for_each(ns, ldlm_resource_clean, &flags);
+	ldlm_resource_for_each(ns, ldlm_resource_complain, NULL);
+
 	return ELDLM_OK;
 }
 EXPORT_SYMBOL(ldlm_namespace_cleanup);
@@ -960,7 +922,7 @@ void ldlm_namespace_free_post(struct ldlm_namespace *ns)
 
 	ldlm_namespace_debugfs_unregister(ns);
 	ldlm_namespace_sysfs_unregister(ns);
-	cfs_hash_putref(ns->ns_rs_hash);
+	rhashtable_destroy(&ns->ns_rs_hash);
 	kvfree(ns->ns_rs_buckets);
 	kfree(ns->ns_name);
 	/* Namespace \a ns should be not on list at this time, otherwise
@@ -1062,27 +1024,22 @@ ldlm_resource_get(struct ldlm_namespace *ns, struct ldlm_resource *parent,
 		  const struct ldlm_res_id *name, enum ldlm_type type,
 		  int create)
 {
-	struct hlist_node     *hnode;
 	struct ldlm_resource *res = NULL;
-	struct cfs_hash_bd	 bd;
-	__u64		 version;
+	struct ldlm_resource *res2;
 	int		      ns_refcount = 0;
 	int rc;
 	int hash;
 
 	LASSERT(!parent);
-	LASSERT(ns->ns_rs_hash);
 	LASSERT(name->name[0] != 0);
 
-	cfs_hash_bd_get_and_lock(ns->ns_rs_hash, (void *)name, &bd, 0);
-	hnode = cfs_hash_bd_lookup_locked(ns->ns_rs_hash, &bd, (void *)name);
-	if (hnode) {
-		cfs_hash_bd_unlock(ns->ns_rs_hash, &bd, 0);
+	rcu_read_lock();
+	res = rhashtable_lookup(&ns->ns_rs_hash, name, ns_rs_hash_params);
+	if (res && atomic_inc_not_zero(&res->lr_refcount)) {
+		rcu_read_unlock();
 		goto lvbo_init;
 	}
-
-	version = cfs_hash_bd_version_get(&bd);
-	cfs_hash_bd_unlock(ns->ns_rs_hash, &bd, 0);
+	rcu_read_unlock();
 
 	if (create == 0)
 		return ERR_PTR(-ENOENT);
@@ -1098,20 +1055,30 @@ ldlm_resource_get(struct ldlm_namespace *ns, struct ldlm_resource *parent,
 	res->lr_name       = *name;
 	res->lr_type       = type;
 
-	cfs_hash_bd_lock(ns->ns_rs_hash, &bd, 1);
-	hnode = (version == cfs_hash_bd_version_get(&bd)) ?  NULL :
-		cfs_hash_bd_lookup_locked(ns->ns_rs_hash, &bd, (void *)name);
-
-	if (hnode) {
+	/*
+	 * If we find an existing entry with a refcount of zero, we need to
+	 * try again.
+	 */
+	rcu_read_lock();
+	do {
+		res2 = rhashtable_lookup_get_insert_fast(&ns->ns_rs_hash, &res->lr_hash,
+							 ns_rs_hash_params);
+	} while (!IS_ERR_OR_NULL(res2) && !atomic_inc_not_zero(&res2->lr_refcount));
+	rcu_read_unlock();
+	if (res2) {
+		/* Insertion failed: an error occurred or              */
 		/* Someone won the race and already added the resource. */
-		cfs_hash_bd_unlock(ns->ns_rs_hash, &bd, 1);
+
 		/* Clean lu_ref for failed resource. */
 		lu_ref_fini(&res->lr_reference);
 		/* We have taken lr_lvb_mutex. Drop it. */
 		mutex_unlock(&res->lr_lvb_mutex);
 		kmem_cache_free(ldlm_resource_slab, res);
+
+		if (IS_ERR(res2))
+			return ERR_PTR(-ENOMEM);
+		res = res2;
 lvbo_init:
-		res = hlist_entry(hnode, struct ldlm_resource, lr_hash);
 		/* Synchronize with regard to resource creation. */
 		if (ns->ns_lvbo && ns->ns_lvbo->lvbo_init) {
 			mutex_lock(&res->lr_lvb_mutex);
@@ -1125,12 +1092,9 @@ ldlm_resource_get(struct ldlm_namespace *ns, struct ldlm_resource *parent,
 		}
 		return res;
 	}
-	/* We won! Let's add the resource. */
-	cfs_hash_bd_add_locked(ns->ns_rs_hash, &bd, &res->lr_hash);
 	if (atomic_inc_return(&res->lr_ns_bucket->nsb_count) == 1)
 		ns_refcount = ldlm_namespace_get_return(ns);
 
-	cfs_hash_bd_unlock(ns->ns_rs_hash, &bd, 1);
 	if (ns->ns_lvbo && ns->ns_lvbo->lvbo_init) {
 		OBD_FAIL_TIMEOUT(OBD_FAIL_LDLM_CREATE_RESOURCE, 2);
 		rc = ns->ns_lvbo->lvbo_init(res);
@@ -1163,8 +1127,14 @@ ldlm_resource_get(struct ldlm_namespace *ns, struct ldlm_resource *parent,
 }
 EXPORT_SYMBOL(ldlm_resource_get);
 
-static void __ldlm_resource_putref_final(struct cfs_hash_bd *bd,
-					 struct ldlm_resource *res)
+static void ldlm_resource_free(struct rcu_head *rcu)
+{
+	struct ldlm_resource *res = container_of(rcu, struct ldlm_resource,
+						 lr_rcu);
+	kmem_cache_free(ldlm_resource_slab, res);
+}
+
+static void __ldlm_resource_putref_final(struct ldlm_resource *res)
 {
 	struct ldlm_ns_bucket *nsb = res->lr_ns_bucket;
 	struct ldlm_namespace *ns = nsb->nsb_namespace;
@@ -1179,29 +1149,24 @@ static void __ldlm_resource_putref_final(struct cfs_hash_bd *bd,
 		LBUG();
 	}
 
-	cfs_hash_bd_del_locked(ns->ns_rs_hash,
-			       bd, &res->lr_hash);
+	rhashtable_remove_fast(&ns->ns_rs_hash,
+			       &res->lr_hash, ns_rs_hash_params);
 	lu_ref_fini(&res->lr_reference);
-	cfs_hash_bd_unlock(ns->ns_rs_hash, bd, 1);
 	if (ns->ns_lvbo && ns->ns_lvbo->lvbo_free)
 		ns->ns_lvbo->lvbo_free(res);
 	if (atomic_dec_and_test(&nsb->nsb_count))
 		ldlm_namespace_put(ns);
-	kmem_cache_free(ldlm_resource_slab, res);
+	call_rcu(&res->lr_rcu, ldlm_resource_free);
 }
 
 void ldlm_resource_putref(struct ldlm_resource *res)
 {
-	struct ldlm_namespace *ns = ldlm_res_to_ns(res);
-	struct cfs_hash_bd   bd;
-
 	LASSERT_ATOMIC_GT_LT(&res->lr_refcount, 0, LI_POISON);
 	CDEBUG(D_INFO, "putref res: %p count: %d\n",
 	       res, atomic_read(&res->lr_refcount) - 1);
 
-	cfs_hash_bd_get(ns->ns_rs_hash, &res->lr_name, &bd);
-	if (cfs_hash_bd_dec_and_lock(ns->ns_rs_hash, &bd, &res->lr_refcount))
-		__ldlm_resource_putref_final(&bd, res);
+	if (atomic_dec_and_test(&res->lr_refcount))
+		__ldlm_resource_putref_final(res);
 }
 EXPORT_SYMBOL(ldlm_resource_putref);
 
@@ -1263,14 +1228,12 @@ void ldlm_dump_all_namespaces(enum ldlm_side client, int level)
 	mutex_unlock(ldlm_namespace_lock(client));
 }
 
-static int ldlm_res_hash_dump(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-			      struct hlist_node *hnode, void *arg)
+static int ldlm_res_hash_dump(struct ldlm_resource *res, void *arg)
 {
-	struct ldlm_resource *res = cfs_hash_object(hs, hnode);
-	int    level = (int)(unsigned long)arg;
+	int    *level = arg;
 
 	lock_res(res);
-	ldlm_resource_dump(level, res);
+	ldlm_resource_dump(*level, res);
 	unlock_res(res);
 
 	return 0;
@@ -1291,9 +1254,8 @@ void ldlm_namespace_dump(int level, struct ldlm_namespace *ns)
 	if (time_before(jiffies, ns->ns_next_dump))
 		return;
 
-	cfs_hash_for_each_nolock(ns->ns_rs_hash,
-				 ldlm_res_hash_dump,
-				 (void *)(unsigned long)level, 0);
+	ldlm_resource_for_each(ns, ldlm_res_hash_dump, &level);
+
 	spin_lock(&ns->ns_lock);
 	ns->ns_next_dump = jiffies + 10 * HZ;
 	spin_unlock(&ns->ns_lock);
diff --git a/drivers/staging/lustre/lustre/osc/osc_request.c b/drivers/staging/lustre/lustre/osc/osc_request.c
index 0038e555e905..06dc52dfb671 100644
--- a/drivers/staging/lustre/lustre/osc/osc_request.c
+++ b/drivers/staging/lustre/lustre/osc/osc_request.c
@@ -2479,13 +2479,10 @@ static int osc_disconnect(struct obd_export *exp)
 	return rc;
 }
 
-static int osc_ldlm_resource_invalidate(struct cfs_hash *hs,
-					struct cfs_hash_bd *bd,
-					struct hlist_node *hnode, void *arg)
+static int osc_ldlm_resource_invalidate(struct ldlm_resource *res, void *data)
 {
-	struct ldlm_resource *res = cfs_hash_object(hs, hnode);
+	struct lu_env *env = data;
 	struct osc_object *osc = NULL;
-	struct lu_env *env = arg;
 	struct ldlm_lock *lock;
 
 	lock_res(res);
@@ -2545,9 +2542,8 @@ static int osc_import_event(struct obd_device *obd,
 		if (!IS_ERR(env)) {
 			osc_io_unplug(env, &obd->u.cli, NULL);
 
-			cfs_hash_for_each_nolock(ns->ns_rs_hash,
-						 osc_ldlm_resource_invalidate,
-						 env, 0);
+			ldlm_resource_for_each(ns, osc_ldlm_resource_invalidate, env);
+
 			cl_env_put(env, &refcheck);
 
 			ldlm_namespace_cleanup(ns, LDLM_FL_LOCAL_ONLY);

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 08/20] staging: lustre: simplify ldlm_ns_hash_defs[]
  2018-04-11 21:54 [PATCH 00/20] staging: lustre: convert to rhashtable NeilBrown
                   ` (7 preceding siblings ...)
  2018-04-11 21:54 ` [PATCH 12/20] staging: lustre: lu_object: factor out extra per-bucket data NeilBrown
@ 2018-04-11 21:54 ` NeilBrown
  2018-04-11 21:54 ` [PATCH 01/20] staging: lustre: ptlrpc: convert conn_hash to rhashtable NeilBrown
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2018-04-11 21:54 UTC (permalink / raw)
  To: Oleg Drokin, Greg Kroah-Hartman, James Simmons, Andreas Dilger
  Cc: Linux Kernel Mailing List, Lustre Development List

As the ldlm_ns_types are dense, we can use the type as
the index to the array, rather than searching through
the array for a match.
We can also discard nsd_hops as all hash tables now
use the same hops.
This makes the table smaller and the code simpler.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/staging/lustre/lustre/ldlm/ldlm_resource.c |   62 ++++++--------------
 1 file changed, 20 insertions(+), 42 deletions(-)

diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c b/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c
index a14cc12303ab..4288a81fd62b 100644
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c
@@ -545,55 +545,35 @@ static struct cfs_hash_ops ldlm_ns_hash_ops = {
 	.hs_put		= ldlm_res_hop_put
 };
 
-struct ldlm_ns_hash_def {
-	enum ldlm_ns_type nsd_type;
+static struct {
 	/** hash bucket bits */
 	unsigned int	nsd_bkt_bits;
 	/** hash bits */
 	unsigned int	nsd_all_bits;
-	/** hash operations */
-	struct cfs_hash_ops *nsd_hops;
-};
-
-static struct ldlm_ns_hash_def ldlm_ns_hash_defs[] = {
-	{
-		.nsd_type       = LDLM_NS_TYPE_MDC,
+} ldlm_ns_hash_defs[] = {
+	[LDLM_NS_TYPE_MDC] = {
 		.nsd_bkt_bits   = 11,
 		.nsd_all_bits   = 16,
-		.nsd_hops       = &ldlm_ns_hash_ops,
 	},
-	{
-		.nsd_type       = LDLM_NS_TYPE_MDT,
+	[LDLM_NS_TYPE_MDT] = {
 		.nsd_bkt_bits   = 14,
 		.nsd_all_bits   = 21,
-		.nsd_hops       = &ldlm_ns_hash_ops,
 	},
-	{
-		.nsd_type       = LDLM_NS_TYPE_OSC,
+	[LDLM_NS_TYPE_OSC] = {
 		.nsd_bkt_bits   = 8,
 		.nsd_all_bits   = 12,
-		.nsd_hops       = &ldlm_ns_hash_ops,
 	},
-	{
-		.nsd_type       = LDLM_NS_TYPE_OST,
+	[LDLM_NS_TYPE_OST] = {
 		.nsd_bkt_bits   = 11,
 		.nsd_all_bits   = 17,
-		.nsd_hops       = &ldlm_ns_hash_ops,
 	},
-	{
-		.nsd_type       = LDLM_NS_TYPE_MGC,
+	[LDLM_NS_TYPE_MGC] = {
 		.nsd_bkt_bits   = 3,
 		.nsd_all_bits   = 4,
-		.nsd_hops       = &ldlm_ns_hash_ops,
 	},
-	{
-		.nsd_type       = LDLM_NS_TYPE_MGT,
+	[LDLM_NS_TYPE_MGT] = {
 		.nsd_bkt_bits   = 3,
 		.nsd_all_bits   = 4,
-		.nsd_hops       = &ldlm_ns_hash_ops,
-	},
-	{
-		.nsd_type       = LDLM_NS_TYPE_UNKNOWN,
 	},
 };
 
@@ -617,7 +597,6 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
 					  enum ldlm_ns_type ns_type)
 {
 	struct ldlm_namespace *ns = NULL;
-	struct ldlm_ns_hash_def    *nsd;
 	int		    idx;
 	int		    rc;
 
@@ -629,15 +608,10 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
 		return NULL;
 	}
 
-	for (idx = 0;; idx++) {
-		nsd = &ldlm_ns_hash_defs[idx];
-		if (nsd->nsd_type == LDLM_NS_TYPE_UNKNOWN) {
-			CERROR("Unknown type %d for ns %s\n", ns_type, name);
-			goto out_ref;
-		}
-
-		if (nsd->nsd_type == ns_type)
-			break;
+	if (ns_type >= ARRAY_SIZE(ldlm_ns_hash_defs) ||
+	    ldlm_ns_hash_defs[ns_type].nsd_bkt_bits == 0) {
+		CERROR("Unknown type %d for ns %s\n", ns_type, name);
+		goto out_ref;
 	}
 
 	ns = kzalloc(sizeof(*ns), GFP_NOFS);
@@ -645,18 +619,22 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
 		goto out_ref;
 
 	ns->ns_rs_hash = cfs_hash_create(name,
-					 nsd->nsd_all_bits, nsd->nsd_all_bits,
-					 nsd->nsd_bkt_bits, 0,
+					 ldlm_ns_hash_defs[ns_type].nsd_all_bits,
+					 ldlm_ns_hash_defs[ns_type].nsd_all_bits,
+					 ldlm_ns_hash_defs[ns_type].nsd_bkt_bits,
+					 0,
 					 CFS_HASH_MIN_THETA,
 					 CFS_HASH_MAX_THETA,
-					 nsd->nsd_hops,
+					 &ldlm_ns_hash_ops,
 					 CFS_HASH_DEPTH |
 					 CFS_HASH_BIGNAME |
 					 CFS_HASH_SPIN_BKTLOCK |
 					 CFS_HASH_NO_ITEMREF);
 	if (!ns->ns_rs_hash)
 		goto out_ns;
-	ns->ns_bucket_bits = nsd->nsd_all_bits - nsd->nsd_bkt_bits;
+	ns->ns_bucket_bits = ldlm_ns_hash_defs[ns_type].nsd_all_bits -
+			      ldlm_ns_hash_defs[ns_type].nsd_bkt_bits;
+
 	ns->ns_rs_buckets = kvmalloc_array(1 << ns->ns_bucket_bits,
 					   sizeof(ns->ns_rs_buckets[0]),
 					   GFP_KERNEL);

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 07/20] staging: lustre: ldlm: store name directly in namespace.
  2018-04-11 21:54 [PATCH 00/20] staging: lustre: convert to rhashtable NeilBrown
                   ` (3 preceding siblings ...)
  2018-04-11 21:54 ` [PATCH 09/20] staging: lustre: convert ldlm_resource hash to rhashtable NeilBrown
@ 2018-04-11 21:54 ` NeilBrown
  2018-04-11 21:54 ` [PATCH 10/20] staging: lustre: make struct lu_site_bkt_data private NeilBrown
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2018-04-11 21:54 UTC (permalink / raw)
  To: Oleg Drokin, Greg Kroah-Hartman, James Simmons, Andreas Dilger
  Cc: Linux Kernel Mailing List, Lustre Development List

Rather than storing the name of a namespace in the
hash table, store it directly in the namespace.
This will allow the hashtable to be changed to use
rhashtable.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/staging/lustre/lustre/include/lustre_dlm.h |    5 ++++-
 drivers/staging/lustre/lustre/ldlm/ldlm_resource.c |    5 +++++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/lustre/lustre/include/lustre_dlm.h b/drivers/staging/lustre/lustre/include/lustre_dlm.h
index ab90abe1c2d8..500dda854564 100644
--- a/drivers/staging/lustre/lustre/include/lustre_dlm.h
+++ b/drivers/staging/lustre/lustre/include/lustre_dlm.h
@@ -364,6 +364,9 @@ struct ldlm_namespace {
 	/** Flag indicating if namespace is on client instead of server */
 	enum ldlm_side		ns_client;
 
+	/** name of this namespace */
+	char			*ns_name;
+
 	/** Resource hash table for namespace. */
 	struct cfs_hash		*ns_rs_hash;
 	struct ldlm_ns_bucket	*ns_rs_buckets;
@@ -882,7 +885,7 @@ static inline bool ldlm_has_layout(struct ldlm_lock *lock)
 static inline char *
 ldlm_ns_name(struct ldlm_namespace *ns)
 {
-	return ns->ns_rs_hash->hs_name;
+	return ns->ns_name;
 }
 
 static inline struct ldlm_namespace *
diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c b/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c
index 669556420bdf..a14cc12303ab 100644
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c
@@ -673,6 +673,9 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
 	ns->ns_obd      = obd;
 	ns->ns_appetite = apt;
 	ns->ns_client   = client;
+	ns->ns_name     = kstrdup(name, GFP_KERNEL);
+	if (!ns->ns_name)
+		goto out_hash;
 
 	INIT_LIST_HEAD(&ns->ns_list_chain);
 	INIT_LIST_HEAD(&ns->ns_unused_list);
@@ -715,6 +718,7 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
 	ldlm_namespace_sysfs_unregister(ns);
 	ldlm_namespace_cleanup(ns, 0);
 out_hash:
+	kfree(ns->ns_name);
 	kvfree(ns->ns_rs_buckets);
 	cfs_hash_putref(ns->ns_rs_hash);
 out_ns:
@@ -980,6 +984,7 @@ void ldlm_namespace_free_post(struct ldlm_namespace *ns)
 	ldlm_namespace_sysfs_unregister(ns);
 	cfs_hash_putref(ns->ns_rs_hash);
 	kvfree(ns->ns_rs_buckets);
+	kfree(ns->ns_name);
 	/* Namespace \a ns should be not on list at this time, otherwise
 	 * this will cause issues related to using freed \a ns in poold
 	 * thread.

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 06/20] staging: lustre: ldlm: add a counter to the per-namespace data
  2018-04-11 21:54 [PATCH 00/20] staging: lustre: convert to rhashtable NeilBrown
                   ` (9 preceding siblings ...)
  2018-04-11 21:54 ` [PATCH 01/20] staging: lustre: ptlrpc: convert conn_hash to rhashtable NeilBrown
@ 2018-04-11 21:54 ` NeilBrown
  2018-04-11 21:54 ` [PATCH 11/20] staging: lustre: lu_object: discard extra lru count NeilBrown
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2018-04-11 21:54 UTC (permalink / raw)
  To: Oleg Drokin, Greg Kroah-Hartman, James Simmons, Andreas Dilger
  Cc: Linux Kernel Mailing List, Lustre Development List

When we change the resource hash to rhashtable we won't have
a per-bucket counter.  We could use the nelems global counter,
but ldlm_resource goes to some trouble to avoid having any
table-wide atomics, and hopefully rhashtable will grow the
ability to disable the global counter in the near future.
Having a counter we control makes it easier to manage the
back-reference to the namespace when there is anything in the
hash table.

So add a counter to the ldlm_ns_bucket.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/staging/lustre/lustre/include/lustre_dlm.h |    2 ++
 drivers/staging/lustre/lustre/ldlm/ldlm_resource.c |   10 +++++-----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/staging/lustre/lustre/include/lustre_dlm.h b/drivers/staging/lustre/lustre/include/lustre_dlm.h
index 395d50160dcc..ab90abe1c2d8 100644
--- a/drivers/staging/lustre/lustre/include/lustre_dlm.h
+++ b/drivers/staging/lustre/lustre/include/lustre_dlm.h
@@ -309,6 +309,8 @@ struct ldlm_ns_bucket {
 	 * fact the network or overall system load is at fault
 	 */
 	struct adaptive_timeout     nsb_at_estimate;
+	/* counter of entries in this bucket */
+	atomic_t		nsb_count;
 };
 
 enum {
diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c b/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c
index 927544f01adc..669556420bdf 100644
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c
@@ -181,12 +181,11 @@ static ssize_t resource_count_show(struct kobject *kobj, struct attribute *attr,
 	struct ldlm_namespace *ns = container_of(kobj, struct ldlm_namespace,
 						 ns_kobj);
 	__u64		  res = 0;
-	struct cfs_hash_bd	  bd;
 	int		    i;
 
 	/* result is not strictly consistent */
-	cfs_hash_for_each_bucket(ns->ns_rs_hash, &bd, i)
-		res += cfs_hash_bd_count_get(&bd);
+	for (i = 0; i < (1 << ns->ns_bucket_bits); i++)
+		res += atomic_read(&ns->ns_rs_buckets[i].nsb_count);
 	return sprintf(buf, "%lld\n", res);
 }
 LUSTRE_RO_ATTR(resource_count);
@@ -668,6 +667,7 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
 		struct ldlm_ns_bucket *nsb = &ns->ns_rs_buckets[idx];
 		at_init(&nsb->nsb_at_estimate, ldlm_enqueue_min, 0);
 		nsb->nsb_namespace = ns;
+		atomic_set(&nsb->nsb_count, 0);
 	}
 
 	ns->ns_obd      = obd;
@@ -1144,7 +1144,7 @@ ldlm_resource_get(struct ldlm_namespace *ns, struct ldlm_resource *parent,
 	}
 	/* We won! Let's add the resource. */
 	cfs_hash_bd_add_locked(ns->ns_rs_hash, &bd, &res->lr_hash);
-	if (cfs_hash_bd_count_get(&bd) == 1)
+	if (atomic_inc_return(&res->lr_ns_bucket->nsb_count) == 1)
 		ns_refcount = ldlm_namespace_get_return(ns);
 
 	cfs_hash_bd_unlock(ns->ns_rs_hash, &bd, 1);
@@ -1202,7 +1202,7 @@ static void __ldlm_resource_putref_final(struct cfs_hash_bd *bd,
 	cfs_hash_bd_unlock(ns->ns_rs_hash, bd, 1);
 	if (ns->ns_lvbo && ns->ns_lvbo->lvbo_free)
 		ns->ns_lvbo->lvbo_free(res);
-	if (cfs_hash_bd_count_get(bd) == 0)
+	if (atomic_dec_and_test(&nsb->nsb_count))
 		ldlm_namespace_put(ns);
 	kmem_cache_free(ldlm_resource_slab, res);
 }

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 05/20] staging: lustre: separate buckets from ldlm hash table
  2018-04-11 21:54 [PATCH 00/20] staging: lustre: convert to rhashtable NeilBrown
  2018-04-11 21:54 ` [PATCH 03/20] staging: lustre: convert obd uuid hash " NeilBrown
  2018-04-11 21:54 ` [PATCH 04/20] staging: lustre: convert osc_quota " NeilBrown
@ 2018-04-11 21:54 ` NeilBrown
  2018-04-11 21:54 ` [PATCH 09/20] staging: lustre: convert ldlm_resource hash to rhashtable NeilBrown
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2018-04-11 21:54 UTC (permalink / raw)
  To: Oleg Drokin, Greg Kroah-Hartman, James Simmons, Andreas Dilger
  Cc: Linux Kernel Mailing List, Lustre Development List

ldlm maintains a per-namespace hashtable of resources.
With these hash tables it stores per-bucket 'struct adaptive_timeout'
structures.

Presumably having a single struct for the whole table results in too
much contention while having one per resource results in very little
adaption.

A future patch will change ldlm to use rhashtable which does not
support per-bucket data, so we need to manage the data separately.

There is no need for the multiple adaptive_timeout to align with the
hash chains, and trying to do this has resulted in a rather complex
hash function.
The purpose of ldlm_res_hop_fid_hash() appears to be to keep
resources with the same fid in the same hash bucket, so they use
the same adaptive timeout.  However it fails at doing this
because it puts the fid-specific bits in the wrong part of the hash.
If that is not the purpose, then I can see no point to the
complexitiy.

This patch creates a completely separate array of adaptive timeouts
(and other less interesting data) and uses a hash of the fid to index
that, meaning that a simple hash can be used for the hash table.

In the previous code, two namespace uses the same value for
nsd_all_bits and nsd_bkt_bits.  This results in zero bits being
used to choose a bucket - so there is only one bucket.
This looks odd and would confuse hash_32(), so I've adjusted the
numbers so there is always at least 1 bit (2 buckets).

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/staging/lustre/lustre/include/lustre_dlm.h |    2 +
 drivers/staging/lustre/lustre/ldlm/ldlm_resource.c |   53 ++++++++------------
 2 files changed, 23 insertions(+), 32 deletions(-)

diff --git a/drivers/staging/lustre/lustre/include/lustre_dlm.h b/drivers/staging/lustre/lustre/include/lustre_dlm.h
index d668d86423a4..395d50160dcc 100644
--- a/drivers/staging/lustre/lustre/include/lustre_dlm.h
+++ b/drivers/staging/lustre/lustre/include/lustre_dlm.h
@@ -364,6 +364,8 @@ struct ldlm_namespace {
 
 	/** Resource hash table for namespace. */
 	struct cfs_hash		*ns_rs_hash;
+	struct ldlm_ns_bucket	*ns_rs_buckets;
+	unsigned int		ns_bucket_bits;
 
 	/** serialize */
 	spinlock_t		ns_lock;
diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c b/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c
index 6c615b6e9bdc..927544f01adc 100644
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c
@@ -476,10 +476,8 @@ static unsigned int ldlm_res_hop_hash(struct cfs_hash *hs,
 	return val & mask;
 }
 
-static unsigned int ldlm_res_hop_fid_hash(struct cfs_hash *hs,
-					  const void *key, unsigned int mask)
+static unsigned int ldlm_res_hop_fid_hash(const struct ldlm_res_id *id, unsigned int bits)
 {
-	const struct ldlm_res_id *id = key;
 	struct lu_fid       fid;
 	__u32	       hash;
 	__u32	       val;
@@ -492,18 +490,11 @@ static unsigned int ldlm_res_hop_fid_hash(struct cfs_hash *hs,
 	hash += (hash >> 4) + (hash << 12); /* mixing oid and seq */
 	if (id->name[LUSTRE_RES_ID_HSH_OFF] != 0) {
 		val = id->name[LUSTRE_RES_ID_HSH_OFF];
-		hash += (val >> 5) + (val << 11);
 	} else {
 		val = fid_oid(&fid);
 	}
-	hash = hash_long(hash, hs->hs_bkt_bits);
-	/* give me another random factor */
-	hash -= hash_long((unsigned long)hs, val % 11 + 3);
-
-	hash <<= hs->hs_cur_bits - hs->hs_bkt_bits;
-	hash |= ldlm_res_hop_hash(hs, key, CFS_HASH_NBKT(hs) - 1);
-
-	return hash & mask;
+	hash += (val >> 5) + (val << 11);
+	return hash_32(hash, bits);
 }
 
 static void *ldlm_res_hop_key(struct hlist_node *hnode)
@@ -555,16 +546,6 @@ static struct cfs_hash_ops ldlm_ns_hash_ops = {
 	.hs_put		= ldlm_res_hop_put
 };
 
-static struct cfs_hash_ops ldlm_ns_fid_hash_ops = {
-	.hs_hash	= ldlm_res_hop_fid_hash,
-	.hs_key		= ldlm_res_hop_key,
-	.hs_keycmp      = ldlm_res_hop_keycmp,
-	.hs_keycpy      = NULL,
-	.hs_object      = ldlm_res_hop_object,
-	.hs_get		= ldlm_res_hop_get_locked,
-	.hs_put		= ldlm_res_hop_put
-};
-
 struct ldlm_ns_hash_def {
 	enum ldlm_ns_type nsd_type;
 	/** hash bucket bits */
@@ -580,13 +561,13 @@ static struct ldlm_ns_hash_def ldlm_ns_hash_defs[] = {
 		.nsd_type       = LDLM_NS_TYPE_MDC,
 		.nsd_bkt_bits   = 11,
 		.nsd_all_bits   = 16,
-		.nsd_hops       = &ldlm_ns_fid_hash_ops,
+		.nsd_hops       = &ldlm_ns_hash_ops,
 	},
 	{
 		.nsd_type       = LDLM_NS_TYPE_MDT,
 		.nsd_bkt_bits   = 14,
 		.nsd_all_bits   = 21,
-		.nsd_hops       = &ldlm_ns_fid_hash_ops,
+		.nsd_hops       = &ldlm_ns_hash_ops,
 	},
 	{
 		.nsd_type       = LDLM_NS_TYPE_OSC,
@@ -602,13 +583,13 @@ static struct ldlm_ns_hash_def ldlm_ns_hash_defs[] = {
 	},
 	{
 		.nsd_type       = LDLM_NS_TYPE_MGC,
-		.nsd_bkt_bits   = 4,
+		.nsd_bkt_bits   = 3,
 		.nsd_all_bits   = 4,
 		.nsd_hops       = &ldlm_ns_hash_ops,
 	},
 	{
 		.nsd_type       = LDLM_NS_TYPE_MGT,
-		.nsd_bkt_bits   = 4,
+		.nsd_bkt_bits   = 3,
 		.nsd_all_bits   = 4,
 		.nsd_hops       = &ldlm_ns_hash_ops,
 	},
@@ -637,9 +618,7 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
 					  enum ldlm_ns_type ns_type)
 {
 	struct ldlm_namespace *ns = NULL;
-	struct ldlm_ns_bucket *nsb;
 	struct ldlm_ns_hash_def    *nsd;
-	struct cfs_hash_bd	  bd;
 	int		    idx;
 	int		    rc;
 
@@ -668,7 +647,7 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
 
 	ns->ns_rs_hash = cfs_hash_create(name,
 					 nsd->nsd_all_bits, nsd->nsd_all_bits,
-					 nsd->nsd_bkt_bits, sizeof(*nsb),
+					 nsd->nsd_bkt_bits, 0,
 					 CFS_HASH_MIN_THETA,
 					 CFS_HASH_MAX_THETA,
 					 nsd->nsd_hops,
@@ -678,9 +657,15 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
 					 CFS_HASH_NO_ITEMREF);
 	if (!ns->ns_rs_hash)
 		goto out_ns;
+	ns->ns_bucket_bits = nsd->nsd_all_bits - nsd->nsd_bkt_bits;
+	ns->ns_rs_buckets = kvmalloc_array(1 << ns->ns_bucket_bits,
+					   sizeof(ns->ns_rs_buckets[0]),
+					   GFP_KERNEL);
+	if (!ns->ns_rs_buckets)
+		goto out_hash;
 
-	cfs_hash_for_each_bucket(ns->ns_rs_hash, &bd, idx) {
-		nsb = cfs_hash_bd_extra_get(ns->ns_rs_hash, &bd);
+	for (idx = 0; idx < (1 << ns->ns_bucket_bits); idx++) {
+		struct ldlm_ns_bucket *nsb = &ns->ns_rs_buckets[idx];
 		at_init(&nsb->nsb_at_estimate, ldlm_enqueue_min, 0);
 		nsb->nsb_namespace = ns;
 	}
@@ -730,6 +715,7 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
 	ldlm_namespace_sysfs_unregister(ns);
 	ldlm_namespace_cleanup(ns, 0);
 out_hash:
+	kvfree(ns->ns_rs_buckets);
 	cfs_hash_putref(ns->ns_rs_hash);
 out_ns:
 	kfree(ns);
@@ -993,6 +979,7 @@ void ldlm_namespace_free_post(struct ldlm_namespace *ns)
 	ldlm_namespace_debugfs_unregister(ns);
 	ldlm_namespace_sysfs_unregister(ns);
 	cfs_hash_putref(ns->ns_rs_hash);
+	kvfree(ns->ns_rs_buckets);
 	/* Namespace \a ns should be not on list at this time, otherwise
 	 * this will cause issues related to using freed \a ns in poold
 	 * thread.
@@ -1098,6 +1085,7 @@ ldlm_resource_get(struct ldlm_namespace *ns, struct ldlm_resource *parent,
 	__u64		 version;
 	int		      ns_refcount = 0;
 	int rc;
+	int hash;
 
 	LASSERT(!parent);
 	LASSERT(ns->ns_rs_hash);
@@ -1122,7 +1110,8 @@ ldlm_resource_get(struct ldlm_namespace *ns, struct ldlm_resource *parent,
 	if (!res)
 		return ERR_PTR(-ENOMEM);
 
-	res->lr_ns_bucket  = cfs_hash_bd_extra_get(ns->ns_rs_hash, &bd);
+	hash = ldlm_res_hop_fid_hash(name, ns->ns_bucket_bits);
+	res->lr_ns_bucket  = &ns->ns_rs_buckets[hash];
 	res->lr_name       = *name;
 	res->lr_type       = type;
 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 04/20] staging: lustre: convert osc_quota hash to rhashtable
  2018-04-11 21:54 [PATCH 00/20] staging: lustre: convert to rhashtable NeilBrown
  2018-04-11 21:54 ` [PATCH 03/20] staging: lustre: convert obd uuid hash " NeilBrown
@ 2018-04-11 21:54 ` NeilBrown
  2018-04-11 21:54 ` [PATCH 05/20] staging: lustre: separate buckets from ldlm hash table NeilBrown
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2018-04-11 21:54 UTC (permalink / raw)
  To: Oleg Drokin, Greg Kroah-Hartman, James Simmons, Andreas Dilger
  Cc: Linux Kernel Mailing List, Lustre Development List

As this is indexed by an integer, an extensible array
or extensible bitmap would be better.
If/when xarray lands, we should change to use that.

For now, just a simple conversion to rhashtable.

When removing an entry, we need to hold rcu_read_lock()
across the lookup and remove in case we race with another thread
performing a removal.  This means we need to use call_rcu()
to free the quota info so we need an rcu_head in there, which
unfortunately doubles the size of the structure.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/staging/lustre/lustre/include/obd.h      |    2 
 drivers/staging/lustre/lustre/osc/osc_internal.h |    5 -
 drivers/staging/lustre/lustre/osc/osc_quota.c    |  136 +++++++---------------
 3 files changed, 48 insertions(+), 95 deletions(-)

diff --git a/drivers/staging/lustre/lustre/include/obd.h b/drivers/staging/lustre/lustre/include/obd.h
index 1818fe6a7a2f..682902e744e2 100644
--- a/drivers/staging/lustre/lustre/include/obd.h
+++ b/drivers/staging/lustre/lustre/include/obd.h
@@ -333,7 +333,7 @@ struct client_obd {
 	void		    *cl_writeback_work;
 	void			*cl_lru_work;
 	/* hash tables for osc_quota_info */
-	struct cfs_hash	      *cl_quota_hash[MAXQUOTAS];
+	struct rhashtable	cl_quota_hash[MAXQUOTAS];
 };
 
 #define obd2cli_tgt(obd) ((char *)(obd)->u.cli.cl_target_uuid.uuid)
diff --git a/drivers/staging/lustre/lustre/osc/osc_internal.h b/drivers/staging/lustre/lustre/osc/osc_internal.h
index be8c7829b3de..fca020568c19 100644
--- a/drivers/staging/lustre/lustre/osc/osc_internal.h
+++ b/drivers/staging/lustre/lustre/osc/osc_internal.h
@@ -188,8 +188,9 @@ extern struct lu_kmem_descr osc_caches[];
 extern struct kmem_cache *osc_quota_kmem;
 struct osc_quota_info {
 	/** linkage for quota hash table */
-	struct hlist_node oqi_hash;
-	u32	  oqi_id;
+	struct rhash_head oqi_hash;
+	u32		  oqi_id;
+	struct rcu_head	  rcu;
 };
 
 int osc_quota_setup(struct obd_device *obd);
diff --git a/drivers/staging/lustre/lustre/osc/osc_quota.c b/drivers/staging/lustre/lustre/osc/osc_quota.c
index ce1731dc604f..723ec2fb18bf 100644
--- a/drivers/staging/lustre/lustre/osc/osc_quota.c
+++ b/drivers/staging/lustre/lustre/osc/osc_quota.c
@@ -27,6 +27,13 @@
 #include <obd_class.h>
 #include "osc_internal.h"
 
+static const struct rhashtable_params quota_hash_params = {
+	.key_len	= sizeof(u32),
+	.key_offset	= offsetof(struct osc_quota_info, oqi_id),
+	.head_offset	= offsetof(struct osc_quota_info, oqi_hash),
+	.automatic_shrinking = true,
+};
+
 static inline struct osc_quota_info *osc_oqi_alloc(u32 id)
 {
 	struct osc_quota_info *oqi;
@@ -45,9 +52,10 @@ int osc_quota_chkdq(struct client_obd *cli, const unsigned int qid[])
 	for (type = 0; type < MAXQUOTAS; type++) {
 		struct osc_quota_info *oqi;
 
-		oqi = cfs_hash_lookup(cli->cl_quota_hash[type], &qid[type]);
+		oqi = rhashtable_lookup_fast(&cli->cl_quota_hash[type], &qid[type],
+					     quota_hash_params);
 		if (oqi) {
-			/* do not try to access oqi here, it could have been
+			/* Must not access oqi here, it could have been
 			 * freed by osc_quota_setdq()
 			 */
 
@@ -63,6 +71,14 @@ int osc_quota_chkdq(struct client_obd *cli, const unsigned int qid[])
 	return QUOTA_OK;
 }
 
+static void osc_quota_free(struct rcu_head *head)
+{
+	struct osc_quota_info *oqi = container_of(head, struct osc_quota_info, rcu);
+
+	kmem_cache_free(osc_quota_kmem, oqi);
+}
+
+
 #define MD_QUOTA_FLAG(type) ((type == USRQUOTA) ? OBD_MD_FLUSRQUOTA \
 						: OBD_MD_FLGRPQUOTA)
 #define FL_QUOTA_FLAG(type) ((type == USRQUOTA) ? OBD_FL_NO_USRQUOTA \
@@ -84,11 +100,14 @@ int osc_quota_setdq(struct client_obd *cli, const unsigned int qid[],
 			continue;
 
 		/* lookup the ID in the per-type hash table */
-		oqi = cfs_hash_lookup(cli->cl_quota_hash[type], &qid[type]);
+		rcu_read_lock();
+		oqi = rhashtable_lookup_fast(&cli->cl_quota_hash[type], &qid[type],
+					     quota_hash_params);
 		if ((flags & FL_QUOTA_FLAG(type)) != 0) {
 			/* This ID is getting close to its quota limit, let's
 			 * switch to sync I/O
 			 */
+			rcu_read_unlock();
 			if (oqi)
 				continue;
 
@@ -98,12 +117,16 @@ int osc_quota_setdq(struct client_obd *cli, const unsigned int qid[],
 				break;
 			}
 
-			rc = cfs_hash_add_unique(cli->cl_quota_hash[type],
-						 &qid[type], &oqi->oqi_hash);
+			rc = rhashtable_lookup_insert_fast(&cli->cl_quota_hash[type],
+							   &oqi->oqi_hash, quota_hash_params);
 			/* race with others? */
-			if (rc == -EALREADY) {
-				rc = 0;
+			if (rc) {
 				kmem_cache_free(osc_quota_kmem, oqi);
+				if (rc != -EEXIST) {
+					rc = -ENOMEM;
+					break;
+				}
+				rc = 0;
 			}
 
 			CDEBUG(D_QUOTA, "%s: setdq to insert for %s %d (%d)\n",
@@ -114,14 +137,14 @@ int osc_quota_setdq(struct client_obd *cli, const unsigned int qid[],
 			/* This ID is now off the hook, let's remove it from
 			 * the hash table
 			 */
-			if (!oqi)
+			if (!oqi) {
+				rcu_read_unlock();
 				continue;
-
-			oqi = cfs_hash_del_key(cli->cl_quota_hash[type],
-					       &qid[type]);
-			if (oqi)
-				kmem_cache_free(osc_quota_kmem, oqi);
-
+			}
+			if (rhashtable_remove_fast(&cli->cl_quota_hash[type],
+						   &oqi->oqi_hash, quota_hash_params) == 0)
+				call_rcu(&oqi->rcu, osc_quota_free);
+			rcu_read_unlock();
 			CDEBUG(D_QUOTA, "%s: setdq to remove for %s %d (%p)\n",
 			       cli_name(cli),
 			       type == USRQUOTA ? "user" : "group",
@@ -132,93 +155,21 @@ int osc_quota_setdq(struct client_obd *cli, const unsigned int qid[],
 	return rc;
 }
 
-/*
- * Hash operations for uid/gid <-> osc_quota_info
- */
-static unsigned int
-oqi_hashfn(struct cfs_hash *hs, const void *key, unsigned int mask)
-{
-	return cfs_hash_u32_hash(*((__u32 *)key), mask);
-}
-
-static int
-oqi_keycmp(const void *key, struct hlist_node *hnode)
-{
-	struct osc_quota_info *oqi;
-	u32 uid;
-
-	LASSERT(key);
-	uid = *((u32 *)key);
-	oqi = hlist_entry(hnode, struct osc_quota_info, oqi_hash);
-
-	return uid == oqi->oqi_id;
-}
-
-static void *
-oqi_key(struct hlist_node *hnode)
-{
-	struct osc_quota_info *oqi;
-
-	oqi = hlist_entry(hnode, struct osc_quota_info, oqi_hash);
-	return &oqi->oqi_id;
-}
-
-static void *
-oqi_object(struct hlist_node *hnode)
-{
-	return hlist_entry(hnode, struct osc_quota_info, oqi_hash);
-}
-
-static void
-oqi_get(struct cfs_hash *hs, struct hlist_node *hnode)
-{
-}
-
-static void
-oqi_put_locked(struct cfs_hash *hs, struct hlist_node *hnode)
-{
-}
-
 static void
-oqi_exit(struct cfs_hash *hs, struct hlist_node *hnode)
+oqi_exit(void *vquota, void *data)
 {
-	struct osc_quota_info *oqi;
-
-	oqi = hlist_entry(hnode, struct osc_quota_info, oqi_hash);
+	struct osc_quota_info *oqi = vquota;
 
-	kmem_cache_free(osc_quota_kmem, oqi);
+	osc_quota_free(&oqi->rcu);
 }
 
-#define HASH_QUOTA_BKT_BITS 5
-#define HASH_QUOTA_CUR_BITS 5
-#define HASH_QUOTA_MAX_BITS 15
-
-static struct cfs_hash_ops quota_hash_ops = {
-	.hs_hash	= oqi_hashfn,
-	.hs_keycmp	= oqi_keycmp,
-	.hs_key		= oqi_key,
-	.hs_object	= oqi_object,
-	.hs_get		= oqi_get,
-	.hs_put_locked	= oqi_put_locked,
-	.hs_exit	= oqi_exit,
-};
-
 int osc_quota_setup(struct obd_device *obd)
 {
 	struct client_obd *cli = &obd->u.cli;
 	int i, type;
 
 	for (type = 0; type < MAXQUOTAS; type++) {
-		cli->cl_quota_hash[type] = cfs_hash_create("QUOTA_HASH",
-							   HASH_QUOTA_CUR_BITS,
-							   HASH_QUOTA_MAX_BITS,
-							   HASH_QUOTA_BKT_BITS,
-							   0,
-							   CFS_HASH_MIN_THETA,
-							   CFS_HASH_MAX_THETA,
-							   &quota_hash_ops,
-							   CFS_HASH_DEFAULT);
-		if (!cli->cl_quota_hash[type])
+		if (rhashtable_init(&cli->cl_quota_hash[type], &quota_hash_params) != 0)
 			break;
 	}
 
@@ -226,7 +177,7 @@ int osc_quota_setup(struct obd_device *obd)
 		return 0;
 
 	for (i = 0; i < type; i++)
-		cfs_hash_putref(cli->cl_quota_hash[i]);
+		rhashtable_destroy(&cli->cl_quota_hash[i]);
 
 	return -ENOMEM;
 }
@@ -237,7 +188,8 @@ int osc_quota_cleanup(struct obd_device *obd)
 	int type;
 
 	for (type = 0; type < MAXQUOTAS; type++)
-		cfs_hash_putref(cli->cl_quota_hash[type]);
+		rhashtable_free_and_destroy(&cli->cl_quota_hash[type],
+					    oqi_exit, NULL);
 
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 03/20] staging: lustre: convert obd uuid hash to rhashtable
  2018-04-11 21:54 [PATCH 00/20] staging: lustre: convert to rhashtable NeilBrown
@ 2018-04-11 21:54 ` NeilBrown
  2018-04-11 21:54 ` [PATCH 04/20] staging: lustre: convert osc_quota " NeilBrown
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2018-04-11 21:54 UTC (permalink / raw)
  To: Oleg Drokin, Greg Kroah-Hartman, James Simmons, Andreas Dilger
  Cc: Linux Kernel Mailing List, Lustre Development List

The rhashtable data type is a perfect fit for the
export uuid hash table, so use that instead of
cfs_hash (which will eventually be removed).

As rhashtable supports lookups and insertions in atomic
context, there is no need to drop a spinlock while
inserting a new entry, which simplifies code quite a bit.

As there are no simple lookups on this hash table (only
insertions which might fail and take a spinlock), there is
no need to use rcu to free the exports.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 .../staging/lustre/lustre/include/lustre_export.h  |    2 
 drivers/staging/lustre/lustre/include/obd.h        |    5 -
 .../staging/lustre/lustre/include/obd_support.h    |    3 
 drivers/staging/lustre/lustre/obdclass/genops.c    |   34 +---
 .../staging/lustre/lustre/obdclass/obd_config.c    |  161 +++++++++-----------
 5 files changed, 80 insertions(+), 125 deletions(-)

diff --git a/drivers/staging/lustre/lustre/include/lustre_export.h b/drivers/staging/lustre/lustre/include/lustre_export.h
index 19ce13bc8ee6..79ad5aae86b9 100644
--- a/drivers/staging/lustre/lustre/include/lustre_export.h
+++ b/drivers/staging/lustre/lustre/include/lustre_export.h
@@ -89,7 +89,7 @@ struct obd_export {
 	struct list_head		exp_obd_chain;
 	/** work_struct for destruction of export */
 	struct work_struct	exp_zombie_work;
-	struct hlist_node	  exp_uuid_hash; /** uuid-export hash*/
+	struct rhash_head	exp_uuid_hash; /** uuid-export hash*/
 	/** Obd device of this export */
 	struct obd_device	*exp_obd;
 	/**
diff --git a/drivers/staging/lustre/lustre/include/obd.h b/drivers/staging/lustre/lustre/include/obd.h
index ad265db48b76..1818fe6a7a2f 100644
--- a/drivers/staging/lustre/lustre/include/obd.h
+++ b/drivers/staging/lustre/lustre/include/obd.h
@@ -558,7 +558,7 @@ struct obd_device {
 	 */
 	unsigned long obd_recovery_expired:1;
 	/* uuid-export hash body */
-	struct cfs_hash	     *obd_uuid_hash;
+	struct rhashtable	obd_uuid_hash;
 	wait_queue_head_t	     obd_refcount_waitq;
 	struct list_head	      obd_exports;
 	struct list_head	      obd_unlinked_exports;
@@ -622,6 +622,9 @@ struct obd_device {
 	struct completion	obd_kobj_unregister;
 };
 
+int obd_uuid_add(struct obd_device *obd, struct obd_export *export);
+void obd_uuid_del(struct obd_device *obd, struct obd_export *export);
+
 /* get/set_info keys */
 #define KEY_ASYNC	       "async"
 #define KEY_CHANGELOG_CLEAR     "changelog_clear"
diff --git a/drivers/staging/lustre/lustre/include/obd_support.h b/drivers/staging/lustre/lustre/include/obd_support.h
index 730a6ee71565..a4c7a2ee9738 100644
--- a/drivers/staging/lustre/lustre/include/obd_support.h
+++ b/drivers/staging/lustre/lustre/include/obd_support.h
@@ -61,9 +61,6 @@ extern atomic_long_t obd_dirty_transit_pages;
 extern char obd_jobid_var[];
 
 /* Some hash init argument constants */
-#define HASH_UUID_BKT_BITS 5
-#define HASH_UUID_CUR_BITS 7
-#define HASH_UUID_MAX_BITS 12
 /* Timeout definitions */
 #define OBD_TIMEOUT_DEFAULT	     100
 /* Time to wait for all clients to reconnect during recovery (hard limit) */
diff --git a/drivers/staging/lustre/lustre/obdclass/genops.c b/drivers/staging/lustre/lustre/obdclass/genops.c
index 86e22472719a..af233b868742 100644
--- a/drivers/staging/lustre/lustre/obdclass/genops.c
+++ b/drivers/staging/lustre/lustre/obdclass/genops.c
@@ -713,7 +713,6 @@ struct obd_export *class_new_export(struct obd_device *obd,
 				    struct obd_uuid *cluuid)
 {
 	struct obd_export *export;
-	struct cfs_hash *hash = NULL;
 	int rc = 0;
 
 	export = kzalloc(sizeof(*export), GFP_NOFS);
@@ -740,7 +739,6 @@ struct obd_export *class_new_export(struct obd_device *obd,
 	class_handle_hash(&export->exp_handle, &export_handle_ops);
 	spin_lock_init(&export->exp_lock);
 	spin_lock_init(&export->exp_rpc_lock);
-	INIT_HLIST_NODE(&export->exp_uuid_hash);
 	spin_lock_init(&export->exp_bl_list_lock);
 	INIT_LIST_HEAD(&export->exp_bl_list);
 	INIT_WORK(&export->exp_zombie_work, obd_zombie_exp_cull);
@@ -757,44 +755,24 @@ struct obd_export *class_new_export(struct obd_device *obd,
 		goto exit_unlock;
 	}
 
-	hash = cfs_hash_getref(obd->obd_uuid_hash);
-	if (!hash) {
-		rc = -ENODEV;
-		goto exit_unlock;
-	}
-	spin_unlock(&obd->obd_dev_lock);
-
 	if (!obd_uuid_equals(cluuid, &obd->obd_uuid)) {
-		rc = cfs_hash_add_unique(hash, cluuid, &export->exp_uuid_hash);
-		if (rc != 0) {
+		rc = obd_uuid_add(obd, export);
+		if (rc) {
 			LCONSOLE_WARN("%s: denying duplicate export for %s, %d\n",
 				      obd->obd_name, cluuid->uuid, rc);
-			rc = -EALREADY;
-			goto exit_err;
+			goto exit_unlock;
 		}
 	}
 
-	spin_lock(&obd->obd_dev_lock);
-	if (obd->obd_stopping) {
-		cfs_hash_del(hash, cluuid, &export->exp_uuid_hash);
-		rc = -ENODEV;
-		goto exit_unlock;
-	}
-
 	class_incref(obd, "export", export);
 	list_add(&export->exp_obd_chain, &export->exp_obd->obd_exports);
 	export->exp_obd->obd_num_exports++;
 	spin_unlock(&obd->obd_dev_lock);
-	cfs_hash_putref(hash);
 	return export;
 
 exit_unlock:
 	spin_unlock(&obd->obd_dev_lock);
-exit_err:
-	if (hash)
-		cfs_hash_putref(hash);
 	class_handle_unhash(&export->exp_handle);
-	LASSERT(hlist_unhashed(&export->exp_uuid_hash));
 	obd_destroy_export(export);
 	kfree(export);
 	return ERR_PTR(rc);
@@ -807,10 +785,8 @@ void class_unlink_export(struct obd_export *exp)
 
 	spin_lock(&exp->exp_obd->obd_dev_lock);
 	/* delete an uuid-export hashitem from hashtables */
-	if (!hlist_unhashed(&exp->exp_uuid_hash))
-		cfs_hash_del(exp->exp_obd->obd_uuid_hash,
-			     &exp->exp_client_uuid,
-			     &exp->exp_uuid_hash);
+	if (exp != exp->exp_obd->obd_self_export)
+		obd_uuid_del(exp->exp_obd, exp);
 
 	list_move(&exp->exp_obd_chain, &exp->exp_obd->obd_unlinked_exports);
 	exp->exp_obd->obd_num_exports--;
diff --git a/drivers/staging/lustre/lustre/obdclass/obd_config.c b/drivers/staging/lustre/lustre/obdclass/obd_config.c
index eab03766236f..ffc1814398a5 100644
--- a/drivers/staging/lustre/lustre/obdclass/obd_config.c
+++ b/drivers/staging/lustre/lustre/obdclass/obd_config.c
@@ -48,7 +48,69 @@
 
 #include "llog_internal.h"
 
-static struct cfs_hash_ops uuid_hash_ops;
+/*
+ * uuid<->export lustre hash operations
+ */
+/*
+ * NOTE: It is impossible to find an export that is in failed
+ *       state with this function
+ */
+static int
+uuid_keycmp(struct rhashtable_compare_arg *arg, const void *obj)
+{
+	const struct obd_uuid *uuid = arg->key;
+	const struct obd_export *exp = obj;
+
+	if (obd_uuid_equals(uuid, &exp->exp_client_uuid) &&
+	    !exp->exp_failed)
+		return 0;
+	return -ESRCH;
+}
+
+static void
+uuid_export_exit(void *vexport, void *data)
+{
+	struct obd_export *exp = vexport;
+
+	class_export_put(exp);
+}
+
+static const struct rhashtable_params uuid_hash_params = {
+	.key_len	= sizeof(struct obd_uuid),
+	.key_offset	= offsetof(struct obd_export, exp_client_uuid),
+	.head_offset	= offsetof(struct obd_export, exp_uuid_hash),
+	.obj_cmpfn	= uuid_keycmp,
+	.automatic_shrinking = true,
+};
+
+int obd_uuid_add(struct obd_device *obd, struct obd_export *export)
+{
+	int rc;
+
+	rc = rhashtable_lookup_insert_fast(&obd->obd_uuid_hash,
+					   &export->exp_uuid_hash,
+					   uuid_hash_params);
+	if (rc == 0)
+		class_export_get(export);
+	else if (rc == -EEXIST)
+		rc = -EALREADY;
+	else
+		/* map obscure error codes to -ENOMEM */
+		rc = -ENOMEM;
+	return rc;
+}
+
+void obd_uuid_del(struct obd_device *obd, struct obd_export *export)
+{
+	int rc;
+
+	rc = rhashtable_remove_fast(&obd->obd_uuid_hash,
+				    &export->exp_uuid_hash,
+				    uuid_hash_params);
+
+	if (rc == 0)
+		class_export_put(export);
+}
 
 /*********** string parsing utils *********/
 
@@ -347,26 +409,18 @@ static int class_setup(struct obd_device *obd, struct lustre_cfg *lcfg)
 	 * other fns check that status, and we're not actually set up yet.
 	 */
 	obd->obd_starting = 1;
-	obd->obd_uuid_hash = NULL;
 	spin_unlock(&obd->obd_dev_lock);
 
 	/* create an uuid-export lustre hash */
-	obd->obd_uuid_hash = cfs_hash_create("UUID_HASH",
-					     HASH_UUID_CUR_BITS,
-					     HASH_UUID_MAX_BITS,
-					     HASH_UUID_BKT_BITS, 0,
-					     CFS_HASH_MIN_THETA,
-					     CFS_HASH_MAX_THETA,
-					     &uuid_hash_ops, CFS_HASH_DEFAULT);
-	if (!obd->obd_uuid_hash) {
-		err = -ENOMEM;
+	err = rhashtable_init(&obd->obd_uuid_hash, &uuid_hash_params);
+
+	if (err)
 		goto err_hash;
-	}
 
 	exp = class_new_export(obd, &obd->obd_uuid);
 	if (IS_ERR(exp)) {
 		err = PTR_ERR(exp);
-		goto err_hash;
+		goto err_new;
 	}
 
 	obd->obd_self_export = exp;
@@ -392,11 +446,9 @@ static int class_setup(struct obd_device *obd, struct lustre_cfg *lcfg)
 		class_unlink_export(obd->obd_self_export);
 		obd->obd_self_export = NULL;
 	}
+err_new:
+	rhashtable_destroy(&obd->obd_uuid_hash);
 err_hash:
-	if (obd->obd_uuid_hash) {
-		cfs_hash_putref(obd->obd_uuid_hash);
-		obd->obd_uuid_hash = NULL;
-	}
 	obd->obd_starting = 0;
 	CERROR("setup %s failed (%d)\n", obd->obd_name, err);
 	return err;
@@ -490,10 +542,7 @@ static int class_cleanup(struct obd_device *obd, struct lustre_cfg *lcfg)
 		       obd->obd_name, err);
 
 	/* destroy an uuid-export hash body */
-	if (obd->obd_uuid_hash) {
-		cfs_hash_putref(obd->obd_uuid_hash);
-		obd->obd_uuid_hash = NULL;
-	}
+	rhashtable_free_and_destroy(&obd->obd_uuid_hash, uuid_export_exit, NULL);
 
 	class_decref(obd, "setup", obd);
 	obd->obd_set_up = 0;
@@ -1487,73 +1536,3 @@ int class_manual_cleanup(struct obd_device *obd)
 	return rc;
 }
 EXPORT_SYMBOL(class_manual_cleanup);
-
-/*
- * uuid<->export lustre hash operations
- */
-
-static unsigned int
-uuid_hash(struct cfs_hash *hs, const void *key, unsigned int mask)
-{
-	return cfs_hash_djb2_hash(((struct obd_uuid *)key)->uuid,
-				  sizeof(((struct obd_uuid *)key)->uuid), mask);
-}
-
-static void *
-uuid_key(struct hlist_node *hnode)
-{
-	struct obd_export *exp;
-
-	exp = hlist_entry(hnode, struct obd_export, exp_uuid_hash);
-
-	return &exp->exp_client_uuid;
-}
-
-/*
- * NOTE: It is impossible to find an export that is in failed
- *       state with this function
- */
-static int
-uuid_keycmp(const void *key, struct hlist_node *hnode)
-{
-	struct obd_export *exp;
-
-	LASSERT(key);
-	exp = hlist_entry(hnode, struct obd_export, exp_uuid_hash);
-
-	return obd_uuid_equals(key, &exp->exp_client_uuid) &&
-	       !exp->exp_failed;
-}
-
-static void *
-uuid_export_object(struct hlist_node *hnode)
-{
-	return hlist_entry(hnode, struct obd_export, exp_uuid_hash);
-}
-
-static void
-uuid_export_get(struct cfs_hash *hs, struct hlist_node *hnode)
-{
-	struct obd_export *exp;
-
-	exp = hlist_entry(hnode, struct obd_export, exp_uuid_hash);
-	class_export_get(exp);
-}
-
-static void
-uuid_export_put_locked(struct cfs_hash *hs, struct hlist_node *hnode)
-{
-	struct obd_export *exp;
-
-	exp = hlist_entry(hnode, struct obd_export, exp_uuid_hash);
-	class_export_put(exp);
-}
-
-static struct cfs_hash_ops uuid_hash_ops = {
-	.hs_hash	= uuid_hash,
-	.hs_key		= uuid_key,
-	.hs_keycmp      = uuid_keycmp,
-	.hs_object      = uuid_export_object,
-	.hs_get		= uuid_export_get,
-	.hs_put_locked  = uuid_export_put_locked,
-};

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 02/20] staging: lustre: convert lov_pool to use rhashtable
  2018-04-11 21:54 [PATCH 00/20] staging: lustre: convert to rhashtable NeilBrown
                   ` (5 preceding siblings ...)
  2018-04-11 21:54 ` [PATCH 10/20] staging: lustre: make struct lu_site_bkt_data private NeilBrown
@ 2018-04-11 21:54 ` NeilBrown
  2018-04-11 21:54 ` [PATCH 12/20] staging: lustre: lu_object: factor out extra per-bucket data NeilBrown
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2018-04-11 21:54 UTC (permalink / raw)
  To: Oleg Drokin, Greg Kroah-Hartman, James Simmons, Andreas Dilger
  Cc: Linux Kernel Mailing List, Lustre Development List

The pools hashtable can be implemented using
the rhashtable implementation in lib.
This has the benefit that lookups are lock-free.

We need to use kfree_rcu() to free a pool so
that a lookup racing with a deletion will not access
freed memory.

rhashtable has no combined lookup-and-delete interface,
but as the lookup is lockless and the chains are short,
this brings little cost.  Even if a lookup finds a pool,
we must be prepared for the delete to fail to find it,
as we might race with another thread doing a delete.

We use atomic_inc_not_zero() after finding a pool in the
hash table and if that fails, we must have raced with a
deletion, so we treat the lookup as a failure.

Use hashlen_string() rather than a hand-crafted hash
function.
Note that the pool_name, and the search key, are
guaranteed to be nul terminated.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/staging/lustre/lustre/include/obd.h        |    4 -
 .../staging/lustre/lustre/include/obd_support.h    |    3 
 drivers/staging/lustre/lustre/lov/lov_internal.h   |   11 +
 drivers/staging/lustre/lustre/lov/lov_obd.c        |   12 +-
 drivers/staging/lustre/lustre/lov/lov_pool.c       |  159 ++++++++------------
 5 files changed, 80 insertions(+), 109 deletions(-)

diff --git a/drivers/staging/lustre/lustre/include/obd.h b/drivers/staging/lustre/lustre/include/obd.h
index f1233ca7d337..ad265db48b76 100644
--- a/drivers/staging/lustre/lustre/include/obd.h
+++ b/drivers/staging/lustre/lustre/include/obd.h
@@ -46,6 +46,8 @@
 #include <lustre_intent.h>
 #include <cl_object.h>
 
+#include <linux/rhashtable.h>
+
 #define MAX_OBD_DEVICES 8192
 
 struct osc_async_rc {
@@ -383,7 +385,7 @@ struct lov_obd {
 	__u32		   lov_tgt_size;   /* size of tgts array */
 	int		     lov_connects;
 	int		     lov_pool_count;
-	struct cfs_hash	     *lov_pools_hash_body; /* used for key access */
+	struct rhashtable	lov_pools_hash_body; /* used for key access */
 	struct list_head	lov_pool_list; /* used for sequential access */
 	struct dentry		*lov_pool_debugfs_entry;
 	enum lustre_sec_part    lov_sp_me;
diff --git a/drivers/staging/lustre/lustre/include/obd_support.h b/drivers/staging/lustre/lustre/include/obd_support.h
index aebcab191442..730a6ee71565 100644
--- a/drivers/staging/lustre/lustre/include/obd_support.h
+++ b/drivers/staging/lustre/lustre/include/obd_support.h
@@ -61,9 +61,6 @@ extern atomic_long_t obd_dirty_transit_pages;
 extern char obd_jobid_var[];
 
 /* Some hash init argument constants */
-#define HASH_POOLS_BKT_BITS 3
-#define HASH_POOLS_CUR_BITS 3
-#define HASH_POOLS_MAX_BITS 7
 #define HASH_UUID_BKT_BITS 5
 #define HASH_UUID_CUR_BITS 7
 #define HASH_UUID_MAX_BITS 12
diff --git a/drivers/staging/lustre/lustre/lov/lov_internal.h b/drivers/staging/lustre/lustre/lov/lov_internal.h
index 27f60dd7ab9a..47042f27ca90 100644
--- a/drivers/staging/lustre/lustre/lov/lov_internal.h
+++ b/drivers/staging/lustre/lustre/lov/lov_internal.h
@@ -149,11 +149,16 @@ struct pool_desc {
 	char			 pool_name[LOV_MAXPOOLNAME + 1];
 	struct ost_pool		 pool_obds;
 	atomic_t		 pool_refcount;
-	struct hlist_node	 pool_hash;		/* access by poolname */
-	struct list_head	 pool_list;		/* serial access */
+	struct rhash_head	 pool_hash;		/* access by poolname */
+	union {
+		struct list_head	pool_list;	/* serial access */
+		struct rcu_head		rcu;		/* delayed free */
+	};
 	struct dentry		*pool_debugfs_entry;	/* file in debugfs */
 	struct obd_device	*pool_lobd;		/* owner */
 };
+int lov_pool_hash_init(struct rhashtable *tbl);
+void lov_pool_hash_destroy(struct rhashtable *tbl);
 
 struct lov_request {
 	struct obd_info	  rq_oi;
@@ -241,8 +246,6 @@ void lprocfs_lov_init_vars(struct lprocfs_static_vars *lvars);
 /* lov_cl.c */
 extern struct lu_device_type lov_device_type;
 
-/* pools */
-extern struct cfs_hash_ops pool_hash_operations;
 /* ost_pool methods */
 int lov_ost_pool_init(struct ost_pool *op, unsigned int count);
 int lov_ost_pool_extend(struct ost_pool *op, unsigned int min_count);
diff --git a/drivers/staging/lustre/lustre/lov/lov_obd.c b/drivers/staging/lustre/lustre/lov/lov_obd.c
index 355e87ecc62d..94da35e673f7 100644
--- a/drivers/staging/lustre/lustre/lov/lov_obd.c
+++ b/drivers/staging/lustre/lustre/lov/lov_obd.c
@@ -795,15 +795,11 @@ int lov_setup(struct obd_device *obd, struct lustre_cfg *lcfg)
 
 	init_rwsem(&lov->lov_notify_lock);
 
-	lov->lov_pools_hash_body = cfs_hash_create("POOLS", HASH_POOLS_CUR_BITS,
-						   HASH_POOLS_MAX_BITS,
-						   HASH_POOLS_BKT_BITS, 0,
-						   CFS_HASH_MIN_THETA,
-						   CFS_HASH_MAX_THETA,
-						   &pool_hash_operations,
-						   CFS_HASH_DEFAULT);
 	INIT_LIST_HEAD(&lov->lov_pool_list);
 	lov->lov_pool_count = 0;
+	rc = lov_pool_hash_init(&lov->lov_pools_hash_body);
+	if (rc)
+		goto out;
 	rc = lov_ost_pool_init(&lov->lov_packed, 0);
 	if (rc)
 		goto out;
@@ -839,7 +835,7 @@ static int lov_cleanup(struct obd_device *obd)
 		/* coverity[overrun-buffer-val] */
 		lov_pool_del(obd, pool->pool_name);
 	}
-	cfs_hash_putref(lov->lov_pools_hash_body);
+	lov_pool_hash_destroy(&lov->lov_pools_hash_body);
 	lov_ost_pool_free(&lov->lov_packed);
 
 	lprocfs_obd_cleanup(obd);
diff --git a/drivers/staging/lustre/lustre/lov/lov_pool.c b/drivers/staging/lustre/lustre/lov/lov_pool.c
index ecd9329cd073..b673b4fd305b 100644
--- a/drivers/staging/lustre/lustre/lov/lov_pool.c
+++ b/drivers/staging/lustre/lustre/lov/lov_pool.c
@@ -49,6 +49,28 @@
 #define pool_tgt(_p, _i) \
 		_p->pool_lobd->u.lov.lov_tgts[_p->pool_obds.op_array[_i]]
 
+static u32 pool_hashfh(const void *data, u32 len, u32 seed)
+{
+	const char *pool_name = data;
+	return hashlen_hash(hashlen_string((void*)(unsigned long)seed, pool_name));
+}
+
+static int pool_cmpfn(struct rhashtable_compare_arg *arg, const void *obj)
+{
+	const struct pool_desc *pool = obj;
+	const char *pool_name = arg->key;
+	return strcmp(pool_name, pool->pool_name);
+}
+
+static const struct rhashtable_params pools_hash_params = {
+	.key_len	= 1, /* actually variable */
+	.key_offset	= offsetof(struct pool_desc, pool_name),
+	.head_offset	= offsetof(struct pool_desc, pool_hash),
+	.hashfn		= pool_hashfh,
+	.obj_cmpfn	= pool_cmpfn,
+	.automatic_shrinking = true,
+};
+
 static void lov_pool_getref(struct pool_desc *pool)
 {
 	CDEBUG(D_INFO, "pool %p\n", pool);
@@ -59,96 +81,13 @@ void lov_pool_putref(struct pool_desc *pool)
 {
 	CDEBUG(D_INFO, "pool %p\n", pool);
 	if (atomic_dec_and_test(&pool->pool_refcount)) {
-		LASSERT(hlist_unhashed(&pool->pool_hash));
 		LASSERT(list_empty(&pool->pool_list));
 		LASSERT(!pool->pool_debugfs_entry);
 		lov_ost_pool_free(&pool->pool_obds);
-		kfree(pool);
-	}
-}
-
-static void lov_pool_putref_locked(struct pool_desc *pool)
-{
-	CDEBUG(D_INFO, "pool %p\n", pool);
-	LASSERT(atomic_read(&pool->pool_refcount) > 1);
-
-	atomic_dec(&pool->pool_refcount);
-}
-
-/*
- * hash function using a Rotating Hash algorithm
- * Knuth, D. The Art of Computer Programming,
- * Volume 3: Sorting and Searching,
- * Chapter 6.4.
- * Addison Wesley, 1973
- */
-static __u32 pool_hashfn(struct cfs_hash *hash_body, const void *key,
-			 unsigned int mask)
-{
-	int i;
-	__u32 result;
-	char *poolname;
-
-	result = 0;
-	poolname = (char *)key;
-	for (i = 0; i < LOV_MAXPOOLNAME; i++) {
-		if (poolname[i] == '\0')
-			break;
-		result = (result << 4) ^ (result >> 28) ^  poolname[i];
+		kfree_rcu(pool, rcu);
 	}
-	return (result % mask);
-}
-
-static void *pool_key(struct hlist_node *hnode)
-{
-	struct pool_desc *pool;
-
-	pool = hlist_entry(hnode, struct pool_desc, pool_hash);
-	return pool->pool_name;
-}
-
-static int pool_hashkey_keycmp(const void *key, struct hlist_node *compared_hnode)
-{
-	char *pool_name;
-	struct pool_desc *pool;
-
-	pool_name = (char *)key;
-	pool = hlist_entry(compared_hnode, struct pool_desc, pool_hash);
-	return !strncmp(pool_name, pool->pool_name, LOV_MAXPOOLNAME);
-}
-
-static void *pool_hashobject(struct hlist_node *hnode)
-{
-	return hlist_entry(hnode, struct pool_desc, pool_hash);
-}
-
-static void pool_hashrefcount_get(struct cfs_hash *hs, struct hlist_node *hnode)
-{
-	struct pool_desc *pool;
-
-	pool = hlist_entry(hnode, struct pool_desc, pool_hash);
-	lov_pool_getref(pool);
-}
-
-static void pool_hashrefcount_put_locked(struct cfs_hash *hs,
-					 struct hlist_node *hnode)
-{
-	struct pool_desc *pool;
-
-	pool = hlist_entry(hnode, struct pool_desc, pool_hash);
-	lov_pool_putref_locked(pool);
 }
 
-struct cfs_hash_ops pool_hash_operations = {
-	.hs_hash	= pool_hashfn,
-	.hs_key		= pool_key,
-	.hs_keycmp      = pool_hashkey_keycmp,
-	.hs_object      = pool_hashobject,
-	.hs_get		= pool_hashrefcount_get,
-	.hs_put_locked  = pool_hashrefcount_put_locked,
-
-};
-
 /*
  * pool debugfs seq_file methods
  */
@@ -396,6 +335,23 @@ int lov_ost_pool_free(struct ost_pool *op)
 	return 0;
 }
 
+static void
+pools_hash_exit(void *vpool, void *data)
+{
+	struct pool_desc *pool = vpool;
+	lov_pool_putref(pool);
+}
+
+int lov_pool_hash_init(struct rhashtable *tbl)
+{
+	return rhashtable_init(tbl, &pools_hash_params);
+}
+
+void lov_pool_hash_destroy(struct rhashtable *tbl)
+{
+	rhashtable_free_and_destroy(tbl, pools_hash_exit, NULL);
+}
+
 int lov_pool_new(struct obd_device *obd, char *poolname)
 {
 	struct lov_obd *lov;
@@ -421,8 +377,6 @@ int lov_pool_new(struct obd_device *obd, char *poolname)
 	if (rc)
 		goto out_err;
 
-	INIT_HLIST_NODE(&new_pool->pool_hash);
-
 	/* get ref for debugfs file */
 	lov_pool_getref(new_pool);
 	new_pool->pool_debugfs_entry = ldebugfs_add_simple(
@@ -443,11 +397,16 @@ int lov_pool_new(struct obd_device *obd, char *poolname)
 	lov->lov_pool_count++;
 	spin_unlock(&obd->obd_dev_lock);
 
-	/* add to find only when it fully ready  */
-	rc = cfs_hash_add_unique(lov->lov_pools_hash_body, poolname,
-				 &new_pool->pool_hash);
+	/* Add to hash table only when it is fully ready.  */
+	rc = rhashtable_lookup_insert_fast(&lov->lov_pools_hash_body,
+					   &new_pool->pool_hash, pools_hash_params);
 	if (rc) {
-		rc = -EEXIST;
+		if (rc != -EEXIST)
+			/*
+			 * Hide -E2BIG and -EBUSY which
+			 * are not helpful.
+			 */
+			rc = -ENOMEM;
 		goto out_err;
 	}
 
@@ -476,7 +435,13 @@ int lov_pool_del(struct obd_device *obd, char *poolname)
 	lov = &obd->u.lov;
 
 	/* lookup and kill hash reference */
-	pool = cfs_hash_del_key(lov->lov_pools_hash_body, poolname);
+	rcu_read_lock();
+	pool = rhashtable_lookup(&lov->lov_pools_hash_body, poolname, pools_hash_params);
+	if (pool)
+		if (rhashtable_remove_fast(&lov->lov_pools_hash_body,
+					   &pool->pool_hash, pools_hash_params) != 0)
+			pool = NULL;
+	rcu_read_unlock();
 	if (!pool)
 		return -ENOENT;
 
@@ -507,7 +472,11 @@ int lov_pool_add(struct obd_device *obd, char *poolname, char *ostname)
 
 	lov = &obd->u.lov;
 
-	pool = cfs_hash_lookup(lov->lov_pools_hash_body, poolname);
+	rcu_read_lock();
+	pool = rhashtable_lookup(&lov->lov_pools_hash_body, poolname, pools_hash_params);
+	if (pool && !atomic_inc_not_zero(&pool->pool_refcount))
+		pool = NULL;
+	rcu_read_unlock();
 	if (!pool)
 		return -ENOENT;
 
@@ -551,7 +520,11 @@ int lov_pool_remove(struct obd_device *obd, char *poolname, char *ostname)
 
 	lov = &obd->u.lov;
 
-	pool = cfs_hash_lookup(lov->lov_pools_hash_body, poolname);
+	rcu_read_lock();
+	pool = rhashtable_lookup(&lov->lov_pools_hash_body, poolname, pools_hash_params);
+	if (pool && !atomic_inc_not_zero(&pool->pool_refcount))
+		pool = NULL;
+	rcu_read_unlock();
 	if (!pool)
 		return -ENOENT;
 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 01/20] staging: lustre: ptlrpc: convert conn_hash to rhashtable
  2018-04-11 21:54 [PATCH 00/20] staging: lustre: convert to rhashtable NeilBrown
                   ` (8 preceding siblings ...)
  2018-04-11 21:54 ` [PATCH 08/20] staging: lustre: simplify ldlm_ns_hash_defs[] NeilBrown
@ 2018-04-11 21:54 ` NeilBrown
  2018-04-11 21:54 ` [PATCH 06/20] staging: lustre: ldlm: add a counter to the per-namespace data NeilBrown
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2018-04-11 21:54 UTC (permalink / raw)
  To: Oleg Drokin, Greg Kroah-Hartman, James Simmons, Andreas Dilger
  Cc: Linux Kernel Mailing List, Lustre Development List

Linux has a resizeable hashtable implementation in lib,
so we should use that instead of having one in libcfs.

This patch converts the ptlrpc conn_hash to use rhashtable.
In the process we gain lockless lookup.

As connections are never deleted until the hash table is destroyed,
there is no need to count the reference in the hash table.  There
is also no need to enable automatic_shrinking.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/staging/lustre/lustre/include/lustre_net.h |    4 
 .../staging/lustre/lustre/include/obd_support.h    |    3 
 drivers/staging/lustre/lustre/ptlrpc/connection.c  |  164 +++++++-------------
 3 files changed, 64 insertions(+), 107 deletions(-)

diff --git a/drivers/staging/lustre/lustre/include/lustre_net.h b/drivers/staging/lustre/lustre/include/lustre_net.h
index d13db55b7242..35b43a77eb18 100644
--- a/drivers/staging/lustre/lustre/include/lustre_net.h
+++ b/drivers/staging/lustre/lustre/include/lustre_net.h
@@ -67,6 +67,8 @@
 #include <obd_support.h>
 #include <uapi/linux/lustre/lustre_ver.h>
 
+#include <linux/rhashtable.h>
+
 /* MD flags we _always_ use */
 #define PTLRPC_MD_OPTIONS  0
 
@@ -286,7 +288,7 @@ struct ptlrpc_replay_async_args {
  */
 struct ptlrpc_connection {
 	/** linkage for connections hash table */
-	struct hlist_node	c_hash;
+	struct rhash_head	c_hash;
 	/** Our own lnet nid for this connection */
 	lnet_nid_t	      c_self;
 	/** Remote side nid for this connection */
diff --git a/drivers/staging/lustre/lustre/include/obd_support.h b/drivers/staging/lustre/lustre/include/obd_support.h
index eb2d6cb6b40b..aebcab191442 100644
--- a/drivers/staging/lustre/lustre/include/obd_support.h
+++ b/drivers/staging/lustre/lustre/include/obd_support.h
@@ -67,9 +67,6 @@ extern char obd_jobid_var[];
 #define HASH_UUID_BKT_BITS 5
 #define HASH_UUID_CUR_BITS 7
 #define HASH_UUID_MAX_BITS 12
-#define HASH_CONN_BKT_BITS 5
-#define HASH_CONN_CUR_BITS 5
-#define HASH_CONN_MAX_BITS 15
 /* Timeout definitions */
 #define OBD_TIMEOUT_DEFAULT	     100
 /* Time to wait for all clients to reconnect during recovery (hard limit) */
diff --git a/drivers/staging/lustre/lustre/ptlrpc/connection.c b/drivers/staging/lustre/lustre/ptlrpc/connection.c
index dfdb4587d49d..fb35a89ca6c6 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/connection.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/connection.c
@@ -38,8 +38,41 @@
 
 #include "ptlrpc_internal.h"
 
-static struct cfs_hash *conn_hash;
-static struct cfs_hash_ops conn_hash_ops;
+static struct rhashtable conn_hash;
+
+/*
+ * struct lnet_process_id may contain unassigned bytes which might not
+ * be zero, so we cannot just hash and compare bytes.
+ */
+
+static u32 lnet_process_id_hash(const void *data, u32 len, u32 seed)
+{
+	const struct lnet_process_id *lpi = data;
+
+	seed = hash_32(seed ^ lpi->pid, 32);
+	seed ^= hash_64(lpi->nid, 32);
+	return seed;
+}
+
+static int lnet_process_id_cmp(struct rhashtable_compare_arg *arg,
+			       const void *obj)
+{
+	const struct lnet_process_id *lpi = arg->key;
+	const struct ptlrpc_connection *con = obj;
+
+	if (lpi->nid == con->c_peer.nid &&
+	    lpi->pid == con->c_peer.pid)
+		return 0;
+	return -ESRCH;
+}
+
+static const struct rhashtable_params conn_hash_params = {
+	.key_len	= 1, /* actually variable-length */
+	.key_offset	= offsetof(struct ptlrpc_connection, c_peer),
+	.head_offset	= offsetof(struct ptlrpc_connection, c_hash),
+	.hashfn		= lnet_process_id_hash,
+	.obj_cmpfn	= lnet_process_id_cmp,
+};
 
 struct ptlrpc_connection *
 ptlrpc_connection_get(struct lnet_process_id peer, lnet_nid_t self,
@@ -47,9 +80,11 @@ ptlrpc_connection_get(struct lnet_process_id peer, lnet_nid_t self,
 {
 	struct ptlrpc_connection *conn, *conn2;
 
-	conn = cfs_hash_lookup(conn_hash, &peer);
-	if (conn)
+	conn = rhashtable_lookup_fast(&conn_hash, &peer, conn_hash_params);
+	if (conn) {
+		ptlrpc_connection_addref(conn);
 		goto out;
+	}
 
 	conn = kzalloc(sizeof(*conn), GFP_NOFS);
 	if (!conn)
@@ -57,7 +92,6 @@ ptlrpc_connection_get(struct lnet_process_id peer, lnet_nid_t self,
 
 	conn->c_peer = peer;
 	conn->c_self = self;
-	INIT_HLIST_NODE(&conn->c_hash);
 	atomic_set(&conn->c_refcount, 1);
 	if (uuid)
 		obd_str2uuid(&conn->c_remote_uuid, uuid->uuid);
@@ -65,17 +99,18 @@ ptlrpc_connection_get(struct lnet_process_id peer, lnet_nid_t self,
 	/*
 	 * Add the newly created conn to the hash, on key collision we
 	 * lost a racing addition and must destroy our newly allocated
-	 * connection.  The object which exists in the has will be
-	 * returned and may be compared against out object.
-	 */
-	/* In the function below, .hs_keycmp resolves to
-	 * conn_keycmp()
+	 * connection.  The object which exists in the hash will be
+	 * returned, otherwise NULL is returned on success.
 	 */
-	/* coverity[overrun-buffer-val] */
-	conn2 = cfs_hash_findadd_unique(conn_hash, &peer, &conn->c_hash);
-	if (conn != conn2) {
+	conn2 = rhashtable_lookup_get_insert_fast(&conn_hash, &conn->c_hash,
+						  conn_hash_params);
+	if (conn2 != NULL) {
+		/* insertion failed */
 		kfree(conn);
+		if (IS_ERR(conn2))
+			return NULL;
 		conn = conn2;
+		ptlrpc_connection_addref(conn);
 	}
 out:
 	CDEBUG(D_INFO, "conn=%p refcount %d to %s\n",
@@ -91,7 +126,7 @@ int ptlrpc_connection_put(struct ptlrpc_connection *conn)
 	if (!conn)
 		return rc;
 
-	LASSERT(atomic_read(&conn->c_refcount) > 1);
+	LASSERT(atomic_read(&conn->c_refcount) > 0);
 
 	/*
 	 * We do not remove connection from hashtable and
@@ -109,7 +144,7 @@ int ptlrpc_connection_put(struct ptlrpc_connection *conn)
 	 * when ptlrpc_connection_fini()->lh_exit->conn_exit()
 	 * path is called.
 	 */
-	if (atomic_dec_return(&conn->c_refcount) == 1)
+	if (atomic_dec_return(&conn->c_refcount) == 0)
 		rc = 1;
 
 	CDEBUG(D_INFO, "PUT conn=%p refcount %d to %s\n",
@@ -130,88 +165,11 @@ ptlrpc_connection_addref(struct ptlrpc_connection *conn)
 	return conn;
 }
 
-int ptlrpc_connection_init(void)
-{
-	conn_hash = cfs_hash_create("CONN_HASH",
-				    HASH_CONN_CUR_BITS,
-				    HASH_CONN_MAX_BITS,
-				    HASH_CONN_BKT_BITS, 0,
-				    CFS_HASH_MIN_THETA,
-				    CFS_HASH_MAX_THETA,
-				    &conn_hash_ops, CFS_HASH_DEFAULT);
-	if (!conn_hash)
-		return -ENOMEM;
-
-	return 0;
-}
-
-void ptlrpc_connection_fini(void)
-{
-	cfs_hash_putref(conn_hash);
-}
-
-/*
- * Hash operations for net_peer<->connection
- */
-static unsigned int
-conn_hashfn(struct cfs_hash *hs, const void *key, unsigned int mask)
-{
-	return cfs_hash_djb2_hash(key, sizeof(struct lnet_process_id), mask);
-}
-
-static int
-conn_keycmp(const void *key, struct hlist_node *hnode)
-{
-	struct ptlrpc_connection *conn;
-	const struct lnet_process_id *conn_key;
-
-	LASSERT(key);
-	conn_key = key;
-	conn = hlist_entry(hnode, struct ptlrpc_connection, c_hash);
-
-	return conn_key->nid == conn->c_peer.nid &&
-	       conn_key->pid == conn->c_peer.pid;
-}
-
-static void *
-conn_key(struct hlist_node *hnode)
-{
-	struct ptlrpc_connection *conn;
-
-	conn = hlist_entry(hnode, struct ptlrpc_connection, c_hash);
-	return &conn->c_peer;
-}
-
-static void *
-conn_object(struct hlist_node *hnode)
-{
-	return hlist_entry(hnode, struct ptlrpc_connection, c_hash);
-}
-
 static void
-conn_get(struct cfs_hash *hs, struct hlist_node *hnode)
+conn_exit(void *vconn, void *data)
 {
-	struct ptlrpc_connection *conn;
+	struct ptlrpc_connection *conn = vconn;
 
-	conn = hlist_entry(hnode, struct ptlrpc_connection, c_hash);
-	atomic_inc(&conn->c_refcount);
-}
-
-static void
-conn_put_locked(struct cfs_hash *hs, struct hlist_node *hnode)
-{
-	struct ptlrpc_connection *conn;
-
-	conn = hlist_entry(hnode, struct ptlrpc_connection, c_hash);
-	atomic_dec(&conn->c_refcount);
-}
-
-static void
-conn_exit(struct cfs_hash *hs, struct hlist_node *hnode)
-{
-	struct ptlrpc_connection *conn;
-
-	conn = hlist_entry(hnode, struct ptlrpc_connection, c_hash);
 	/*
 	 * Nothing should be left. Connection user put it and
 	 * connection also was deleted from table by this time
@@ -223,12 +181,12 @@ conn_exit(struct cfs_hash *hs, struct hlist_node *hnode)
 	kfree(conn);
 }
 
-static struct cfs_hash_ops conn_hash_ops = {
-	.hs_hash	= conn_hashfn,
-	.hs_keycmp      = conn_keycmp,
-	.hs_key		= conn_key,
-	.hs_object      = conn_object,
-	.hs_get		= conn_get,
-	.hs_put_locked  = conn_put_locked,
-	.hs_exit	= conn_exit,
-};
+int ptlrpc_connection_init(void)
+{
+	return rhashtable_init(&conn_hash, &conn_hash_params);
+}
+
+void ptlrpc_connection_fini(void)
+{
+	rhashtable_free_and_destroy(&conn_hash, conn_exit, NULL);
+}

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 20/20] staging: lustre: remove cfs_hash resizeable hashtable implementation.
  2018-04-11 21:54 [PATCH 00/20] staging: lustre: convert to rhashtable NeilBrown
                   ` (17 preceding siblings ...)
  2018-04-11 21:54 ` [PATCH 18/20] staging: lustre: change how "dump_page_cache" walks a hash table NeilBrown
@ 2018-04-11 21:54 ` NeilBrown
  2018-04-11 21:54 ` [PATCH 19/20] staging: lustre: convert lu_object cache to rhashtable NeilBrown
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2018-04-11 21:54 UTC (permalink / raw)
  To: Oleg Drokin, Greg Kroah-Hartman, James Simmons, Andreas Dilger
  Cc: Linux Kernel Mailing List, Lustre Development List

All users of this library have been converted to use rhashtable.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 .../staging/lustre/include/linux/libcfs/libcfs.h   |    1 
 .../lustre/include/linux/libcfs/libcfs_hash.h      |  866 --------
 drivers/staging/lustre/lnet/libcfs/Makefile        |    2 
 drivers/staging/lustre/lnet/libcfs/hash.c          | 2064 --------------------
 drivers/staging/lustre/lnet/libcfs/module.c        |   12 
 5 files changed, 1 insertion(+), 2944 deletions(-)
 delete mode 100644 drivers/staging/lustre/include/linux/libcfs/libcfs_hash.h
 delete mode 100644 drivers/staging/lustre/lnet/libcfs/hash.c

diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs.h b/drivers/staging/lustre/include/linux/libcfs/libcfs.h
index f183f31da387..62e46aa3c554 100644
--- a/drivers/staging/lustre/include/linux/libcfs/libcfs.h
+++ b/drivers/staging/lustre/include/linux/libcfs/libcfs.h
@@ -44,7 +44,6 @@
 #include <linux/libcfs/libcfs_cpu.h>
 #include <linux/libcfs/libcfs_prim.h>
 #include <linux/libcfs/libcfs_string.h>
-#include <linux/libcfs/libcfs_hash.h>
 #include <linux/libcfs/libcfs_fail.h>
 #include <linux/libcfs/curproc.h>
 
diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_hash.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_hash.h
deleted file mode 100644
index 0506f1d45757..000000000000
--- a/drivers/staging/lustre/include/linux/libcfs/libcfs_hash.h
+++ /dev/null
@@ -1,866 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/*
- * GPL HEADER START
- *
- * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 only,
- * as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License version 2 for more details (a copy is included
- * in the LICENSE file that accompanied this code).
- *
- * You should have received a copy of the GNU General Public License
- * version 2 along with this program; If not, see
- * http://www.gnu.org/licenses/gpl-2.0.html
- *
- * GPL HEADER END
- */
-/*
- * Copyright (c) 2008, 2010, Oracle and/or its affiliates. All rights reserved.
- * Use is subject to license terms.
- *
- * Copyright (c) 2012, 2015 Intel Corporation.
- */
-/*
- * This file is part of Lustre, http://www.lustre.org/
- * Lustre is a trademark of Sun Microsystems, Inc.
- *
- * libcfs/include/libcfs/libcfs_hash.h
- *
- * Hashing routines
- *
- */
-
-#ifndef __LIBCFS_HASH_H__
-#define __LIBCFS_HASH_H__
-
-#include <linux/hash.h>
-
-/*
- * Knuth recommends primes in approximately golden ratio to the maximum
- * integer representable by a machine word for multiplicative hashing.
- * Chuck Lever verified the effectiveness of this technique:
- * http://www.citi.umich.edu/techreports/reports/citi-tr-00-1.pdf
- *
- * These primes are chosen to be bit-sparse, that is operations on
- * them can use shifts and additions instead of multiplications for
- * machines where multiplications are slow.
- */
-/* 2^31 + 2^29 - 2^25 + 2^22 - 2^19 - 2^16 + 1 */
-#define CFS_GOLDEN_RATIO_PRIME_32 0x9e370001UL
-/*  2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
-#define CFS_GOLDEN_RATIO_PRIME_64 0x9e37fffffffc0001ULL
-
-/** disable debug */
-#define CFS_HASH_DEBUG_NONE	0
-/*
- * record hash depth and output to console when it's too deep,
- * computing overhead is low but consume more memory
- */
-#define CFS_HASH_DEBUG_1	1
-/** expensive, check key validation */
-#define CFS_HASH_DEBUG_2	2
-
-#define CFS_HASH_DEBUG_LEVEL	CFS_HASH_DEBUG_NONE
-
-struct cfs_hash_ops;
-struct cfs_hash_lock_ops;
-struct cfs_hash_hlist_ops;
-
-union cfs_hash_lock {
-	rwlock_t		rw;		/**< rwlock */
-	spinlock_t		spin;		/**< spinlock */
-};
-
-/**
- * cfs_hash_bucket is a container of:
- * - lock, counter ...
- * - array of hash-head starting from hsb_head[0], hash-head can be one of
- *   . struct cfs_hash_head
- *   . struct cfs_hash_head_dep
- *   . struct cfs_hash_dhead
- *   . struct cfs_hash_dhead_dep
- *   which depends on requirement of user
- * - some extra bytes (caller can require it while creating hash)
- */
-struct cfs_hash_bucket {
-	union cfs_hash_lock	hsb_lock;	/**< bucket lock */
-	u32			hsb_count;	/**< current entries */
-	u32			hsb_version;	/**< change version */
-	unsigned int		hsb_index;	/**< index of bucket */
-	int			hsb_depmax;	/**< max depth on bucket */
-	long			hsb_head[0];	/**< hash-head array */
-};
-
-/**
- * cfs_hash bucket descriptor, it's normally in stack of caller
- */
-struct cfs_hash_bd {
-	/* address of bucket */
-	struct cfs_hash_bucket	*bd_bucket;
-	/* offset in bucket */
-	unsigned int		 bd_offset;
-};
-
-#define CFS_HASH_NAME_LEN	16	/**< default name length */
-#define CFS_HASH_BIGNAME_LEN	64	/**< bigname for param tree */
-
-#define CFS_HASH_BKT_BITS	3	/**< default bits of bucket */
-#define CFS_HASH_BITS_MAX	30	/**< max bits of bucket */
-#define CFS_HASH_BITS_MIN	CFS_HASH_BKT_BITS
-
-/**
- * common hash attributes.
- */
-enum cfs_hash_tag {
-	/**
-	 * don't need any lock, caller will protect operations with it's
-	 * own lock. With this flag:
-	 *  . CFS_HASH_NO_BKTLOCK, CFS_HASH_RW_BKTLOCK, CFS_HASH_SPIN_BKTLOCK
-	 *    will be ignored.
-	 *  . Some functions will be disabled with this flag, i.e:
-	 *    cfs_hash_for_each_empty, cfs_hash_rehash
-	 */
-	CFS_HASH_NO_LOCK	= BIT(0),
-	/** no bucket lock, use one spinlock to protect the whole hash */
-	CFS_HASH_NO_BKTLOCK	= BIT(1),
-	/** rwlock to protect bucket */
-	CFS_HASH_RW_BKTLOCK	= BIT(2),
-	/** spinlock to protect bucket */
-	CFS_HASH_SPIN_BKTLOCK	= BIT(3),
-	/** always add new item to tail */
-	CFS_HASH_ADD_TAIL	= BIT(4),
-	/** hash-table doesn't have refcount on item */
-	CFS_HASH_NO_ITEMREF	= BIT(5),
-	/** big name for param-tree */
-	CFS_HASH_BIGNAME	= BIT(6),
-	/** track global count */
-	CFS_HASH_COUNTER	= BIT(7),
-	/** rehash item by new key */
-	CFS_HASH_REHASH_KEY	= BIT(8),
-	/** Enable dynamic hash resizing */
-	CFS_HASH_REHASH		= BIT(9),
-	/** can shrink hash-size */
-	CFS_HASH_SHRINK		= BIT(10),
-	/** assert hash is empty on exit */
-	CFS_HASH_ASSERT_EMPTY	= BIT(11),
-	/** record hlist depth */
-	CFS_HASH_DEPTH		= BIT(12),
-	/**
-	 * rehash is always scheduled in a different thread, so current
-	 * change on hash table is non-blocking
-	 */
-	CFS_HASH_NBLK_CHANGE	= BIT(13),
-	/**
-	 * NB, we typed hs_flags as  u16, please change it
-	 * if you need to extend >=16 flags
-	 */
-};
-
-/** most used attributes */
-#define CFS_HASH_DEFAULT	(CFS_HASH_RW_BKTLOCK | \
-				 CFS_HASH_COUNTER | CFS_HASH_REHASH)
-
-/**
- * cfs_hash is a hash-table implementation for general purpose, it can support:
- *    . two refcount modes
- *      hash-table with & without refcount
- *    . four lock modes
- *      nolock, one-spinlock, rw-bucket-lock, spin-bucket-lock
- *    . general operations
- *      lookup, add(add_tail or add_head), delete
- *    . rehash
- *      grows or shrink
- *    . iteration
- *      locked iteration and unlocked iteration
- *    . bigname
- *      support long name hash
- *    . debug
- *      trace max searching depth
- *
- * Rehash:
- * When the htable grows or shrinks, a separate task (cfs_hash_rehash_worker)
- * is spawned to handle the rehash in the background, it's possible that other
- * processes can concurrently perform additions, deletions, and lookups
- * without being blocked on rehash completion, because rehash will release
- * the global wrlock for each bucket.
- *
- * rehash and iteration can't run at the same time because it's too tricky
- * to keep both of them safe and correct.
- * As they are relatively rare operations, so:
- *   . if iteration is in progress while we try to launch rehash, then
- *     it just giveup, iterator will launch rehash at the end.
- *   . if rehash is in progress while we try to iterate the hash table,
- *     then we just wait (shouldn't be very long time), anyway, nobody
- *     should expect iteration of whole hash-table to be non-blocking.
- *
- * During rehashing, a (key,object) pair may be in one of two buckets,
- * depending on whether the worker task has yet to transfer the object
- * to its new location in the table. Lookups and deletions need to search both
- * locations; additions must take care to only insert into the new bucket.
- */
-
-struct cfs_hash {
-	/**
-	 * serialize with rehash, or serialize all operations if
-	 * the hash-table has CFS_HASH_NO_BKTLOCK
-	 */
-	union cfs_hash_lock		hs_lock;
-	/** hash operations */
-	struct cfs_hash_ops		*hs_ops;
-	/** hash lock operations */
-	struct cfs_hash_lock_ops	*hs_lops;
-	/** hash list operations */
-	struct cfs_hash_hlist_ops	*hs_hops;
-	/** hash buckets-table */
-	struct cfs_hash_bucket		**hs_buckets;
-	/** total number of items on this hash-table */
-	atomic_t			hs_count;
-	/** hash flags, see cfs_hash_tag for detail */
-	u16				hs_flags;
-	/** # of extra-bytes for bucket, for user saving extended attributes */
-	u16				hs_extra_bytes;
-	/** wants to iterate */
-	u8				hs_iterating;
-	/** hash-table is dying */
-	u8				hs_exiting;
-	/** current hash bits */
-	u8				hs_cur_bits;
-	/** min hash bits */
-	u8				hs_min_bits;
-	/** max hash bits */
-	u8				hs_max_bits;
-	/** bits for rehash */
-	u8				hs_rehash_bits;
-	/** bits for each bucket */
-	u8				hs_bkt_bits;
-	/** resize min threshold */
-	u16				hs_min_theta;
-	/** resize max threshold */
-	u16				hs_max_theta;
-	/** resize count */
-	u32				hs_rehash_count;
-	/** # of iterators (caller of cfs_hash_for_each_*) */
-	u32				hs_iterators;
-	/** rehash workitem */
-	struct work_struct		hs_rehash_work;
-	/** refcount on this hash table */
-	atomic_t			hs_refcount;
-	/** rehash buckets-table */
-	struct cfs_hash_bucket		**hs_rehash_buckets;
-#if CFS_HASH_DEBUG_LEVEL >= CFS_HASH_DEBUG_1
-	/** serialize debug members */
-	spinlock_t			hs_dep_lock;
-	/** max depth */
-	unsigned int			hs_dep_max;
-	/** id of the deepest bucket */
-	unsigned int			hs_dep_bkt;
-	/** offset in the deepest bucket */
-	unsigned int			hs_dep_off;
-	/** bits when we found the max depth */
-	unsigned int			hs_dep_bits;
-	/** workitem to output max depth */
-	struct work_struct		hs_dep_work;
-#endif
-	/** name of htable */
-	char				hs_name[0];
-};
-
-struct cfs_hash_lock_ops {
-	/** lock the hash table */
-	void    (*hs_lock)(union cfs_hash_lock *lock, int exclusive);
-	/** unlock the hash table */
-	void    (*hs_unlock)(union cfs_hash_lock *lock, int exclusive);
-	/** lock the hash bucket */
-	void    (*hs_bkt_lock)(union cfs_hash_lock *lock, int exclusive);
-	/** unlock the hash bucket */
-	void    (*hs_bkt_unlock)(union cfs_hash_lock *lock, int exclusive);
-};
-
-struct cfs_hash_hlist_ops {
-	/** return hlist_head of hash-head of @bd */
-	struct hlist_head *(*hop_hhead)(struct cfs_hash *hs,
-					struct cfs_hash_bd *bd);
-	/** return hash-head size */
-	int (*hop_hhead_size)(struct cfs_hash *hs);
-	/** add @hnode to hash-head of @bd */
-	int (*hop_hnode_add)(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-			     struct hlist_node *hnode);
-	/** remove @hnode from hash-head of @bd */
-	int (*hop_hnode_del)(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-			     struct hlist_node *hnode);
-};
-
-struct cfs_hash_ops {
-	/** return hashed value from @key */
-	unsigned int (*hs_hash)(struct cfs_hash *hs, const void *key,
-				unsigned int mask);
-	/** return key address of @hnode */
-	void *   (*hs_key)(struct hlist_node *hnode);
-	/** copy key from @hnode to @key */
-	void     (*hs_keycpy)(struct hlist_node *hnode, void *key);
-	/**
-	 *  compare @key with key of @hnode
-	 *  returns 1 on a match
-	 */
-	int      (*hs_keycmp)(const void *key, struct hlist_node *hnode);
-	/** return object address of @hnode, i.e: container_of(...hnode) */
-	void *   (*hs_object)(struct hlist_node *hnode);
-	/** get refcount of item, always called with holding bucket-lock */
-	void     (*hs_get)(struct cfs_hash *hs, struct hlist_node *hnode);
-	/** release refcount of item */
-	void     (*hs_put)(struct cfs_hash *hs, struct hlist_node *hnode);
-	/** release refcount of item, always called with holding bucket-lock */
-	void     (*hs_put_locked)(struct cfs_hash *hs,
-				  struct hlist_node *hnode);
-	/** it's called before removing of @hnode */
-	void     (*hs_exit)(struct cfs_hash *hs, struct hlist_node *hnode);
-};
-
-/** total number of buckets in @hs */
-#define CFS_HASH_NBKT(hs)	\
-	BIT((hs)->hs_cur_bits - (hs)->hs_bkt_bits)
-
-/** total number of buckets in @hs while rehashing */
-#define CFS_HASH_RH_NBKT(hs)	\
-	BIT((hs)->hs_rehash_bits - (hs)->hs_bkt_bits)
-
-/** number of hlist for in bucket */
-#define CFS_HASH_BKT_NHLIST(hs)	BIT((hs)->hs_bkt_bits)
-
-/** total number of hlist in @hs */
-#define CFS_HASH_NHLIST(hs)	BIT((hs)->hs_cur_bits)
-
-/** total number of hlist in @hs while rehashing */
-#define CFS_HASH_RH_NHLIST(hs)	BIT((hs)->hs_rehash_bits)
-
-static inline int
-cfs_hash_with_no_lock(struct cfs_hash *hs)
-{
-	/* caller will serialize all operations for this hash-table */
-	return hs->hs_flags & CFS_HASH_NO_LOCK;
-}
-
-static inline int
-cfs_hash_with_no_bktlock(struct cfs_hash *hs)
-{
-	/* no bucket lock, one single lock to protect the hash-table */
-	return hs->hs_flags & CFS_HASH_NO_BKTLOCK;
-}
-
-static inline int
-cfs_hash_with_rw_bktlock(struct cfs_hash *hs)
-{
-	/* rwlock to protect hash bucket */
-	return hs->hs_flags & CFS_HASH_RW_BKTLOCK;
-}
-
-static inline int
-cfs_hash_with_spin_bktlock(struct cfs_hash *hs)
-{
-	/* spinlock to protect hash bucket */
-	return hs->hs_flags & CFS_HASH_SPIN_BKTLOCK;
-}
-
-static inline int
-cfs_hash_with_add_tail(struct cfs_hash *hs)
-{
-	return hs->hs_flags & CFS_HASH_ADD_TAIL;
-}
-
-static inline int
-cfs_hash_with_no_itemref(struct cfs_hash *hs)
-{
-	/*
-	 * hash-table doesn't keep refcount on item,
-	 * item can't be removed from hash unless it's
-	 * ZERO refcount
-	 */
-	return hs->hs_flags & CFS_HASH_NO_ITEMREF;
-}
-
-static inline int
-cfs_hash_with_bigname(struct cfs_hash *hs)
-{
-	return hs->hs_flags & CFS_HASH_BIGNAME;
-}
-
-static inline int
-cfs_hash_with_counter(struct cfs_hash *hs)
-{
-	return hs->hs_flags & CFS_HASH_COUNTER;
-}
-
-static inline int
-cfs_hash_with_rehash(struct cfs_hash *hs)
-{
-	return hs->hs_flags & CFS_HASH_REHASH;
-}
-
-static inline int
-cfs_hash_with_rehash_key(struct cfs_hash *hs)
-{
-	return hs->hs_flags & CFS_HASH_REHASH_KEY;
-}
-
-static inline int
-cfs_hash_with_shrink(struct cfs_hash *hs)
-{
-	return hs->hs_flags & CFS_HASH_SHRINK;
-}
-
-static inline int
-cfs_hash_with_assert_empty(struct cfs_hash *hs)
-{
-	return hs->hs_flags & CFS_HASH_ASSERT_EMPTY;
-}
-
-static inline int
-cfs_hash_with_depth(struct cfs_hash *hs)
-{
-	return hs->hs_flags & CFS_HASH_DEPTH;
-}
-
-static inline int
-cfs_hash_with_nblk_change(struct cfs_hash *hs)
-{
-	return hs->hs_flags & CFS_HASH_NBLK_CHANGE;
-}
-
-static inline int
-cfs_hash_is_exiting(struct cfs_hash *hs)
-{
-	/* cfs_hash_destroy is called */
-	return hs->hs_exiting;
-}
-
-static inline int
-cfs_hash_is_rehashing(struct cfs_hash *hs)
-{
-	/* rehash is launched */
-	return !!hs->hs_rehash_bits;
-}
-
-static inline int
-cfs_hash_is_iterating(struct cfs_hash *hs)
-{
-	/* someone is calling cfs_hash_for_each_* */
-	return hs->hs_iterating || hs->hs_iterators;
-}
-
-static inline int
-cfs_hash_bkt_size(struct cfs_hash *hs)
-{
-	return offsetof(struct cfs_hash_bucket, hsb_head[0]) +
-	       hs->hs_hops->hop_hhead_size(hs) * CFS_HASH_BKT_NHLIST(hs) +
-	       hs->hs_extra_bytes;
-}
-
-static inline unsigned
-cfs_hash_id(struct cfs_hash *hs, const void *key, unsigned int mask)
-{
-	return hs->hs_ops->hs_hash(hs, key, mask);
-}
-
-static inline void *
-cfs_hash_key(struct cfs_hash *hs, struct hlist_node *hnode)
-{
-	return hs->hs_ops->hs_key(hnode);
-}
-
-static inline void
-cfs_hash_keycpy(struct cfs_hash *hs, struct hlist_node *hnode, void *key)
-{
-	if (hs->hs_ops->hs_keycpy)
-		hs->hs_ops->hs_keycpy(hnode, key);
-}
-
-/**
- * Returns 1 on a match,
- */
-static inline int
-cfs_hash_keycmp(struct cfs_hash *hs, const void *key, struct hlist_node *hnode)
-{
-	return hs->hs_ops->hs_keycmp(key, hnode);
-}
-
-static inline void *
-cfs_hash_object(struct cfs_hash *hs, struct hlist_node *hnode)
-{
-	return hs->hs_ops->hs_object(hnode);
-}
-
-static inline void
-cfs_hash_get(struct cfs_hash *hs, struct hlist_node *hnode)
-{
-	return hs->hs_ops->hs_get(hs, hnode);
-}
-
-static inline void
-cfs_hash_put_locked(struct cfs_hash *hs, struct hlist_node *hnode)
-{
-	return hs->hs_ops->hs_put_locked(hs, hnode);
-}
-
-static inline void
-cfs_hash_put(struct cfs_hash *hs, struct hlist_node *hnode)
-{
-	return hs->hs_ops->hs_put(hs, hnode);
-}
-
-static inline void
-cfs_hash_exit(struct cfs_hash *hs, struct hlist_node *hnode)
-{
-	if (hs->hs_ops->hs_exit)
-		hs->hs_ops->hs_exit(hs, hnode);
-}
-
-static inline void cfs_hash_lock(struct cfs_hash *hs, int excl)
-{
-	hs->hs_lops->hs_lock(&hs->hs_lock, excl);
-}
-
-static inline void cfs_hash_unlock(struct cfs_hash *hs, int excl)
-{
-	hs->hs_lops->hs_unlock(&hs->hs_lock, excl);
-}
-
-static inline int cfs_hash_dec_and_lock(struct cfs_hash *hs,
-					atomic_t *condition)
-{
-	LASSERT(cfs_hash_with_no_bktlock(hs));
-	return atomic_dec_and_lock(condition, &hs->hs_lock.spin);
-}
-
-static inline void cfs_hash_bd_lock(struct cfs_hash *hs,
-				    struct cfs_hash_bd *bd, int excl)
-{
-	hs->hs_lops->hs_bkt_lock(&bd->bd_bucket->hsb_lock, excl);
-}
-
-static inline void cfs_hash_bd_unlock(struct cfs_hash *hs,
-				      struct cfs_hash_bd *bd, int excl)
-{
-	hs->hs_lops->hs_bkt_unlock(&bd->bd_bucket->hsb_lock, excl);
-}
-
-/**
- * operations on cfs_hash bucket (bd: bucket descriptor),
- * they are normally for hash-table without rehash
- */
-void cfs_hash_bd_get(struct cfs_hash *hs, const void *key,
-		     struct cfs_hash_bd *bd);
-
-static inline void
-cfs_hash_bd_get_and_lock(struct cfs_hash *hs, const void *key,
-			 struct cfs_hash_bd *bd, int excl)
-{
-	cfs_hash_bd_get(hs, key, bd);
-	cfs_hash_bd_lock(hs, bd, excl);
-}
-
-static inline unsigned
-cfs_hash_bd_index_get(struct cfs_hash *hs, struct cfs_hash_bd *bd)
-{
-	return bd->bd_offset | (bd->bd_bucket->hsb_index << hs->hs_bkt_bits);
-}
-
-static inline void
-cfs_hash_bd_index_set(struct cfs_hash *hs, unsigned int index,
-		      struct cfs_hash_bd *bd)
-{
-	bd->bd_bucket = hs->hs_buckets[index >> hs->hs_bkt_bits];
-	bd->bd_offset = index & (CFS_HASH_BKT_NHLIST(hs) - 1U);
-}
-
-static inline void *
-cfs_hash_bd_extra_get(struct cfs_hash *hs, struct cfs_hash_bd *bd)
-{
-	return (void *)bd->bd_bucket +
-	       cfs_hash_bkt_size(hs) - hs->hs_extra_bytes;
-}
-
-static inline u32
-cfs_hash_bd_version_get(struct cfs_hash_bd *bd)
-{
-	/* need hold cfs_hash_bd_lock */
-	return bd->bd_bucket->hsb_version;
-}
-
-static inline u32
-cfs_hash_bd_count_get(struct cfs_hash_bd *bd)
-{
-	/* need hold cfs_hash_bd_lock */
-	return bd->bd_bucket->hsb_count;
-}
-
-static inline int
-cfs_hash_bd_depmax_get(struct cfs_hash_bd *bd)
-{
-	return bd->bd_bucket->hsb_depmax;
-}
-
-static inline int
-cfs_hash_bd_compare(struct cfs_hash_bd *bd1, struct cfs_hash_bd *bd2)
-{
-	if (bd1->bd_bucket->hsb_index != bd2->bd_bucket->hsb_index)
-		return bd1->bd_bucket->hsb_index - bd2->bd_bucket->hsb_index;
-
-	if (bd1->bd_offset != bd2->bd_offset)
-		return bd1->bd_offset - bd2->bd_offset;
-
-	return 0;
-}
-
-void cfs_hash_bd_add_locked(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-			    struct hlist_node *hnode);
-void cfs_hash_bd_del_locked(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-			    struct hlist_node *hnode);
-void cfs_hash_bd_move_locked(struct cfs_hash *hs, struct cfs_hash_bd *bd_old,
-			     struct cfs_hash_bd *bd_new,
-			     struct hlist_node *hnode);
-
-static inline int
-cfs_hash_bd_dec_and_lock(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-			 atomic_t *condition)
-{
-	LASSERT(cfs_hash_with_spin_bktlock(hs));
-	return atomic_dec_and_lock(condition, &bd->bd_bucket->hsb_lock.spin);
-}
-
-static inline struct hlist_head *
-cfs_hash_bd_hhead(struct cfs_hash *hs, struct cfs_hash_bd *bd)
-{
-	return hs->hs_hops->hop_hhead(hs, bd);
-}
-
-struct hlist_node *
-cfs_hash_bd_lookup_locked(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-			  const void *key);
-struct hlist_node *
-cfs_hash_bd_peek_locked(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-			const void *key);
-
-/**
- * operations on cfs_hash bucket (bd: bucket descriptor),
- * they are safe for hash-table with rehash
- */
-void cfs_hash_dual_bd_get(struct cfs_hash *hs, const void *key,
-			  struct cfs_hash_bd *bds);
-void cfs_hash_dual_bd_lock(struct cfs_hash *hs, struct cfs_hash_bd *bds,
-			   int excl);
-void cfs_hash_dual_bd_unlock(struct cfs_hash *hs, struct cfs_hash_bd *bds,
-			     int excl);
-
-static inline void
-cfs_hash_dual_bd_get_and_lock(struct cfs_hash *hs, const void *key,
-			      struct cfs_hash_bd *bds, int excl)
-{
-	cfs_hash_dual_bd_get(hs, key, bds);
-	cfs_hash_dual_bd_lock(hs, bds, excl);
-}
-
-struct hlist_node *
-cfs_hash_dual_bd_lookup_locked(struct cfs_hash *hs, struct cfs_hash_bd *bds,
-			       const void *key);
-struct hlist_node *
-cfs_hash_dual_bd_findadd_locked(struct cfs_hash *hs, struct cfs_hash_bd *bds,
-				const void *key, struct hlist_node *hnode,
-				int insist_add);
-struct hlist_node *
-cfs_hash_dual_bd_finddel_locked(struct cfs_hash *hs, struct cfs_hash_bd *bds,
-				const void *key, struct hlist_node *hnode);
-
-/* Hash init/cleanup functions */
-struct cfs_hash *
-cfs_hash_create(char *name, unsigned int cur_bits, unsigned int max_bits,
-		unsigned int bkt_bits, unsigned int extra_bytes,
-		unsigned int min_theta, unsigned int max_theta,
-		struct cfs_hash_ops *ops, unsigned int flags);
-
-struct cfs_hash *cfs_hash_getref(struct cfs_hash *hs);
-void cfs_hash_putref(struct cfs_hash *hs);
-
-/* Hash addition functions */
-void cfs_hash_add(struct cfs_hash *hs, const void *key,
-		  struct hlist_node *hnode);
-int cfs_hash_add_unique(struct cfs_hash *hs, const void *key,
-			struct hlist_node *hnode);
-void *cfs_hash_findadd_unique(struct cfs_hash *hs, const void *key,
-			      struct hlist_node *hnode);
-
-/* Hash deletion functions */
-void *cfs_hash_del(struct cfs_hash *hs, const void *key,
-		   struct hlist_node *hnode);
-void *cfs_hash_del_key(struct cfs_hash *hs, const void *key);
-
-/* Hash lookup/for_each functions */
-#define CFS_HASH_LOOP_HOG       1024
-
-typedef int (*cfs_hash_for_each_cb_t)(struct cfs_hash *hs,
-				      struct cfs_hash_bd *bd,
-				      struct hlist_node *node,
-				      void *data);
-void *
-cfs_hash_lookup(struct cfs_hash *hs, const void *key);
-void
-cfs_hash_for_each(struct cfs_hash *hs, cfs_hash_for_each_cb_t cb, void *data);
-void
-cfs_hash_for_each_safe(struct cfs_hash *hs, cfs_hash_for_each_cb_t cb,
-		       void *data);
-int
-cfs_hash_for_each_nolock(struct cfs_hash *hs, cfs_hash_for_each_cb_t cb,
-			 void *data, int start);
-int
-cfs_hash_for_each_empty(struct cfs_hash *hs, cfs_hash_for_each_cb_t cb,
-			void *data);
-void
-cfs_hash_for_each_key(struct cfs_hash *hs, const void *key,
-		      cfs_hash_for_each_cb_t cb, void *data);
-typedef int (*cfs_hash_cond_opt_cb_t)(void *obj, void *data);
-void
-cfs_hash_cond_del(struct cfs_hash *hs, cfs_hash_cond_opt_cb_t cb, void *data);
-
-void
-cfs_hash_hlist_for_each(struct cfs_hash *hs, unsigned int hindex,
-			cfs_hash_for_each_cb_t cb, void *data);
-int  cfs_hash_is_empty(struct cfs_hash *hs);
-u64 cfs_hash_size_get(struct cfs_hash *hs);
-
-/*
- * Rehash - Theta is calculated to be the average chained
- * hash depth assuming a perfectly uniform hash function.
- */
-void cfs_hash_rehash_cancel_locked(struct cfs_hash *hs);
-void cfs_hash_rehash_cancel(struct cfs_hash *hs);
-void cfs_hash_rehash(struct cfs_hash *hs, int do_rehash);
-void cfs_hash_rehash_key(struct cfs_hash *hs, const void *old_key,
-			 void *new_key, struct hlist_node *hnode);
-
-#if CFS_HASH_DEBUG_LEVEL > CFS_HASH_DEBUG_1
-/* Validate hnode references the correct key */
-static inline void
-cfs_hash_key_validate(struct cfs_hash *hs, const void *key,
-		      struct hlist_node *hnode)
-{
-	LASSERT(cfs_hash_keycmp(hs, key, hnode));
-}
-
-/* Validate hnode is in the correct bucket */
-static inline void
-cfs_hash_bucket_validate(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-			 struct hlist_node *hnode)
-{
-	struct cfs_hash_bd bds[2];
-
-	cfs_hash_dual_bd_get(hs, cfs_hash_key(hs, hnode), bds);
-	LASSERT(bds[0].bd_bucket == bd->bd_bucket ||
-		bds[1].bd_bucket == bd->bd_bucket);
-}
-
-#else /* CFS_HASH_DEBUG_LEVEL > CFS_HASH_DEBUG_1 */
-
-static inline void
-cfs_hash_key_validate(struct cfs_hash *hs, const void *key,
-		      struct hlist_node *hnode) {}
-
-static inline void
-cfs_hash_bucket_validate(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-			 struct hlist_node *hnode) {}
-
-#endif /* CFS_HASH_DEBUG_LEVEL */
-
-#define CFS_HASH_THETA_BITS	10
-#define CFS_HASH_MIN_THETA	BIT(CFS_HASH_THETA_BITS - 1)
-#define CFS_HASH_MAX_THETA	BIT(CFS_HASH_THETA_BITS + 1)
-
-/* Return integer component of theta */
-static inline int __cfs_hash_theta_int(int theta)
-{
-	return (theta >> CFS_HASH_THETA_BITS);
-}
-
-/* Return a fractional value between 0 and 999 */
-static inline int __cfs_hash_theta_frac(int theta)
-{
-	return ((theta * 1000) >> CFS_HASH_THETA_BITS) -
-	       (__cfs_hash_theta_int(theta) * 1000);
-}
-
-static inline int __cfs_hash_theta(struct cfs_hash *hs)
-{
-	return (atomic_read(&hs->hs_count) <<
-		CFS_HASH_THETA_BITS) >> hs->hs_cur_bits;
-}
-
-static inline void
-__cfs_hash_set_theta(struct cfs_hash *hs, int min, int max)
-{
-	LASSERT(min < max);
-	hs->hs_min_theta = (u16)min;
-	hs->hs_max_theta = (u16)max;
-}
-
-/* Generic debug formatting routines mainly for proc handler */
-struct seq_file;
-void cfs_hash_debug_header(struct seq_file *m);
-void cfs_hash_debug_str(struct cfs_hash *hs, struct seq_file *m);
-
-/*
- * Generic djb2 hash algorithm for character arrays.
- */
-static inline unsigned
-cfs_hash_djb2_hash(const void *key, size_t size, unsigned int mask)
-{
-	unsigned int i, hash = 5381;
-
-	LASSERT(key);
-
-	for (i = 0; i < size; i++)
-		hash = hash * 33 + ((char *)key)[i];
-
-	return (hash & mask);
-}
-
-/*
- * Generic u32 hash algorithm.
- */
-static inline unsigned
-cfs_hash_u32_hash(const u32 key, unsigned int mask)
-{
-	return ((key * CFS_GOLDEN_RATIO_PRIME_32) & mask);
-}
-
-/*
- * Generic u64 hash algorithm.
- */
-static inline unsigned
-cfs_hash_u64_hash(const u64 key, unsigned int mask)
-{
-	return ((unsigned int)(key * CFS_GOLDEN_RATIO_PRIME_64) & mask);
-}
-
-/** iterate over all buckets in @bds (array of struct cfs_hash_bd) */
-#define cfs_hash_for_each_bd(bds, n, i)	\
-	for (i = 0; i < n && (bds)[i].bd_bucket != NULL; i++)
-
-/** iterate over all buckets of @hs */
-#define cfs_hash_for_each_bucket(hs, bd, pos)			\
-	for (pos = 0;						\
-	     pos < CFS_HASH_NBKT(hs) &&				\
-	     ((bd)->bd_bucket = (hs)->hs_buckets[pos]) != NULL; pos++)
-
-/** iterate over all hlist of bucket @bd */
-#define cfs_hash_bd_for_each_hlist(hs, bd, hlist)		\
-	for ((bd)->bd_offset = 0;				\
-	     (bd)->bd_offset < CFS_HASH_BKT_NHLIST(hs) &&	\
-	     (hlist = cfs_hash_bd_hhead(hs, bd)) != NULL;	\
-	     (bd)->bd_offset++)
-
-/* !__LIBCFS__HASH_H__ */
-#endif
diff --git a/drivers/staging/lustre/lnet/libcfs/Makefile b/drivers/staging/lustre/lnet/libcfs/Makefile
index b7dc7ac11cc5..36b49a6b7b88 100644
--- a/drivers/staging/lustre/lnet/libcfs/Makefile
+++ b/drivers/staging/lustre/lnet/libcfs/Makefile
@@ -13,7 +13,7 @@ libcfs-linux-objs += linux-crypto-adler.o
 libcfs-linux-objs := $(addprefix linux/,$(libcfs-linux-objs))
 
 libcfs-all-objs := debug.o fail.o module.o tracefile.o \
-		   libcfs_string.o hash.o \
+		   libcfs_string.o \
 		   libcfs_cpu.o libcfs_mem.o libcfs_lock.o
 
 libcfs-objs := $(libcfs-linux-objs) $(libcfs-all-objs)
diff --git a/drivers/staging/lustre/lnet/libcfs/hash.c b/drivers/staging/lustre/lnet/libcfs/hash.c
deleted file mode 100644
index f7b3c9306456..000000000000
--- a/drivers/staging/lustre/lnet/libcfs/hash.c
+++ /dev/null
@@ -1,2064 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/*
- * GPL HEADER START
- *
- * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 only,
- * as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License version 2 for more details (a copy is included
- * in the LICENSE file that accompanied this code).
- *
- * You should have received a copy of the GNU General Public License
- * version 2 along with this program; If not, see
- * http://www.gnu.org/licenses/gpl-2.0.html
- *
- * GPL HEADER END
- */
-/*
- * Copyright (c) 2009, 2010, Oracle and/or its affiliates. All rights reserved.
- * Use is subject to license terms.
- *
- * Copyright (c) 2011, 2012, Intel Corporation.
- */
-/*
- * This file is part of Lustre, http://www.lustre.org/
- * Lustre is a trademark of Sun Microsystems, Inc.
- *
- * libcfs/libcfs/hash.c
- *
- * Implement a hash class for hash process in lustre system.
- *
- * Author: YuZhangyong <yzy@clusterfs.com>
- *
- * 2008-08-15: Brian Behlendorf <behlendorf1@llnl.gov>
- * - Simplified API and improved documentation
- * - Added per-hash feature flags:
- *   * CFS_HASH_DEBUG additional validation
- *   * CFS_HASH_REHASH dynamic rehashing
- * - Added per-hash statistics
- * - General performance enhancements
- *
- * 2009-07-31: Liang Zhen <zhen.liang@sun.com>
- * - move all stuff to libcfs
- * - don't allow cur_bits != max_bits without setting of CFS_HASH_REHASH
- * - ignore hs_rwlock if without CFS_HASH_REHASH setting
- * - buckets are allocated one by one(instead of contiguous memory),
- *   to avoid unnecessary cacheline conflict
- *
- * 2010-03-01: Liang Zhen <zhen.liang@sun.com>
- * - "bucket" is a group of hlist_head now, user can specify bucket size
- *   by bkt_bits of cfs_hash_create(), all hlist_heads in a bucket share
- *   one lock for reducing memory overhead.
- *
- * - support lockless hash, caller will take care of locks:
- *   avoid lock overhead for hash tables that are already protected
- *   by locking in the caller for another reason
- *
- * - support both spin_lock/rwlock for bucket:
- *   overhead of spinlock contention is lower than read/write
- *   contention of rwlock, so using spinlock to serialize operations on
- *   bucket is more reasonable for those frequently changed hash tables
- *
- * - support one-single lock mode:
- *   one lock to protect all hash operations to avoid overhead of
- *   multiple locks if hash table is always small
- *
- * - removed a lot of unnecessary addref & decref on hash element:
- *   addref & decref are atomic operations in many use-cases which
- *   are expensive.
- *
- * - support non-blocking cfs_hash_add() and cfs_hash_findadd():
- *   some lustre use-cases require these functions to be strictly
- *   non-blocking, we need to schedule required rehash on a different
- *   thread on those cases.
- *
- * - safer rehash on large hash table
- *   In old implementation, rehash function will exclusively lock the
- *   hash table and finish rehash in one batch, it's dangerous on SMP
- *   system because rehash millions of elements could take long time.
- *   New implemented rehash can release lock and relax CPU in middle
- *   of rehash, it's safe for another thread to search/change on the
- *   hash table even it's in rehasing.
- *
- * - support two different refcount modes
- *   . hash table has refcount on element
- *   . hash table doesn't change refcount on adding/removing element
- *
- * - support long name hash table (for param-tree)
- *
- * - fix a bug for cfs_hash_rehash_key:
- *   in old implementation, cfs_hash_rehash_key could screw up the
- *   hash-table because @key is overwritten without any protection.
- *   Now we need user to define hs_keycpy for those rehash enabled
- *   hash tables, cfs_hash_rehash_key will overwrite hash-key
- *   inside lock by calling hs_keycpy.
- *
- * - better hash iteration:
- *   Now we support both locked iteration & lockless iteration of hash
- *   table. Also, user can break the iteration by return 1 in callback.
- */
-#include <linux/seq_file.h>
-#include <linux/log2.h>
-
-#include <linux/libcfs/libcfs.h>
-
-#if CFS_HASH_DEBUG_LEVEL >= CFS_HASH_DEBUG_1
-static unsigned int warn_on_depth = 8;
-module_param(warn_on_depth, uint, 0644);
-MODULE_PARM_DESC(warn_on_depth, "warning when hash depth is high.");
-#endif
-
-struct workqueue_struct *cfs_rehash_wq;
-
-static inline void
-cfs_hash_nl_lock(union cfs_hash_lock *lock, int exclusive) {}
-
-static inline void
-cfs_hash_nl_unlock(union cfs_hash_lock *lock, int exclusive) {}
-
-static inline void
-cfs_hash_spin_lock(union cfs_hash_lock *lock, int exclusive)
-	__acquires(&lock->spin)
-{
-	spin_lock(&lock->spin);
-}
-
-static inline void
-cfs_hash_spin_unlock(union cfs_hash_lock *lock, int exclusive)
-	__releases(&lock->spin)
-{
-	spin_unlock(&lock->spin);
-}
-
-static inline void
-cfs_hash_rw_lock(union cfs_hash_lock *lock, int exclusive)
-	__acquires(&lock->rw)
-{
-	if (!exclusive)
-		read_lock(&lock->rw);
-	else
-		write_lock(&lock->rw);
-}
-
-static inline void
-cfs_hash_rw_unlock(union cfs_hash_lock *lock, int exclusive)
-	__releases(&lock->rw)
-{
-	if (!exclusive)
-		read_unlock(&lock->rw);
-	else
-		write_unlock(&lock->rw);
-}
-
-/** No lock hash */
-static struct cfs_hash_lock_ops cfs_hash_nl_lops = {
-	.hs_lock	= cfs_hash_nl_lock,
-	.hs_unlock	= cfs_hash_nl_unlock,
-	.hs_bkt_lock	= cfs_hash_nl_lock,
-	.hs_bkt_unlock	= cfs_hash_nl_unlock,
-};
-
-/** no bucket lock, one spinlock to protect everything */
-static struct cfs_hash_lock_ops cfs_hash_nbl_lops = {
-	.hs_lock	= cfs_hash_spin_lock,
-	.hs_unlock	= cfs_hash_spin_unlock,
-	.hs_bkt_lock	= cfs_hash_nl_lock,
-	.hs_bkt_unlock	= cfs_hash_nl_unlock,
-};
-
-/** spin bucket lock, rehash is enabled */
-static struct cfs_hash_lock_ops cfs_hash_bkt_spin_lops = {
-	.hs_lock	= cfs_hash_rw_lock,
-	.hs_unlock	= cfs_hash_rw_unlock,
-	.hs_bkt_lock	= cfs_hash_spin_lock,
-	.hs_bkt_unlock	= cfs_hash_spin_unlock,
-};
-
-/** rw bucket lock, rehash is enabled */
-static struct cfs_hash_lock_ops cfs_hash_bkt_rw_lops = {
-	.hs_lock	= cfs_hash_rw_lock,
-	.hs_unlock	= cfs_hash_rw_unlock,
-	.hs_bkt_lock	= cfs_hash_rw_lock,
-	.hs_bkt_unlock	= cfs_hash_rw_unlock,
-};
-
-/** spin bucket lock, rehash is disabled */
-static struct cfs_hash_lock_ops cfs_hash_nr_bkt_spin_lops = {
-	.hs_lock	= cfs_hash_nl_lock,
-	.hs_unlock	= cfs_hash_nl_unlock,
-	.hs_bkt_lock	= cfs_hash_spin_lock,
-	.hs_bkt_unlock	= cfs_hash_spin_unlock,
-};
-
-/** rw bucket lock, rehash is disabled */
-static struct cfs_hash_lock_ops cfs_hash_nr_bkt_rw_lops = {
-	.hs_lock	= cfs_hash_nl_lock,
-	.hs_unlock	= cfs_hash_nl_unlock,
-	.hs_bkt_lock	= cfs_hash_rw_lock,
-	.hs_bkt_unlock	= cfs_hash_rw_unlock,
-};
-
-static void
-cfs_hash_lock_setup(struct cfs_hash *hs)
-{
-	if (cfs_hash_with_no_lock(hs)) {
-		hs->hs_lops = &cfs_hash_nl_lops;
-
-	} else if (cfs_hash_with_no_bktlock(hs)) {
-		hs->hs_lops = &cfs_hash_nbl_lops;
-		spin_lock_init(&hs->hs_lock.spin);
-
-	} else if (cfs_hash_with_rehash(hs)) {
-		rwlock_init(&hs->hs_lock.rw);
-
-		if (cfs_hash_with_rw_bktlock(hs))
-			hs->hs_lops = &cfs_hash_bkt_rw_lops;
-		else if (cfs_hash_with_spin_bktlock(hs))
-			hs->hs_lops = &cfs_hash_bkt_spin_lops;
-		else
-			LBUG();
-	} else {
-		if (cfs_hash_with_rw_bktlock(hs))
-			hs->hs_lops = &cfs_hash_nr_bkt_rw_lops;
-		else if (cfs_hash_with_spin_bktlock(hs))
-			hs->hs_lops = &cfs_hash_nr_bkt_spin_lops;
-		else
-			LBUG();
-	}
-}
-
-/**
- * Simple hash head without depth tracking
- * new element is always added to head of hlist
- */
-struct cfs_hash_head {
-	struct hlist_head	hh_head;	/**< entries list */
-};
-
-static int
-cfs_hash_hh_hhead_size(struct cfs_hash *hs)
-{
-	return sizeof(struct cfs_hash_head);
-}
-
-static struct hlist_head *
-cfs_hash_hh_hhead(struct cfs_hash *hs, struct cfs_hash_bd *bd)
-{
-	struct cfs_hash_head *head;
-
-	head = (struct cfs_hash_head *)&bd->bd_bucket->hsb_head[0];
-	return &head[bd->bd_offset].hh_head;
-}
-
-static int
-cfs_hash_hh_hnode_add(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-		      struct hlist_node *hnode)
-{
-	hlist_add_head(hnode, cfs_hash_hh_hhead(hs, bd));
-	return -1; /* unknown depth */
-}
-
-static int
-cfs_hash_hh_hnode_del(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-		      struct hlist_node *hnode)
-{
-	hlist_del_init(hnode);
-	return -1; /* unknown depth */
-}
-
-/**
- * Simple hash head with depth tracking
- * new element is always added to head of hlist
- */
-struct cfs_hash_head_dep {
-	struct hlist_head	hd_head;	/**< entries list */
-	unsigned int		hd_depth;	/**< list length */
-};
-
-static int
-cfs_hash_hd_hhead_size(struct cfs_hash *hs)
-{
-	return sizeof(struct cfs_hash_head_dep);
-}
-
-static struct hlist_head *
-cfs_hash_hd_hhead(struct cfs_hash *hs, struct cfs_hash_bd *bd)
-{
-	struct cfs_hash_head_dep *head;
-
-	head = (struct cfs_hash_head_dep *)&bd->bd_bucket->hsb_head[0];
-	return &head[bd->bd_offset].hd_head;
-}
-
-static int
-cfs_hash_hd_hnode_add(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-		      struct hlist_node *hnode)
-{
-	struct cfs_hash_head_dep *hh;
-
-	hh = container_of(cfs_hash_hd_hhead(hs, bd),
-			  struct cfs_hash_head_dep, hd_head);
-	hlist_add_head(hnode, &hh->hd_head);
-	return ++hh->hd_depth;
-}
-
-static int
-cfs_hash_hd_hnode_del(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-		      struct hlist_node *hnode)
-{
-	struct cfs_hash_head_dep *hh;
-
-	hh = container_of(cfs_hash_hd_hhead(hs, bd),
-			  struct cfs_hash_head_dep, hd_head);
-	hlist_del_init(hnode);
-	return --hh->hd_depth;
-}
-
-/**
- * double links hash head without depth tracking
- * new element is always added to tail of hlist
- */
-struct cfs_hash_dhead {
-	struct hlist_head	dh_head;	/**< entries list */
-	struct hlist_node	*dh_tail;	/**< the last entry */
-};
-
-static int
-cfs_hash_dh_hhead_size(struct cfs_hash *hs)
-{
-	return sizeof(struct cfs_hash_dhead);
-}
-
-static struct hlist_head *
-cfs_hash_dh_hhead(struct cfs_hash *hs, struct cfs_hash_bd *bd)
-{
-	struct cfs_hash_dhead *head;
-
-	head = (struct cfs_hash_dhead *)&bd->bd_bucket->hsb_head[0];
-	return &head[bd->bd_offset].dh_head;
-}
-
-static int
-cfs_hash_dh_hnode_add(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-		      struct hlist_node *hnode)
-{
-	struct cfs_hash_dhead *dh;
-
-	dh = container_of(cfs_hash_dh_hhead(hs, bd),
-			  struct cfs_hash_dhead, dh_head);
-	if (dh->dh_tail) /* not empty */
-		hlist_add_behind(hnode, dh->dh_tail);
-	else /* empty list */
-		hlist_add_head(hnode, &dh->dh_head);
-	dh->dh_tail = hnode;
-	return -1; /* unknown depth */
-}
-
-static int
-cfs_hash_dh_hnode_del(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-		      struct hlist_node *hnd)
-{
-	struct cfs_hash_dhead *dh;
-
-	dh = container_of(cfs_hash_dh_hhead(hs, bd),
-			  struct cfs_hash_dhead, dh_head);
-	if (!hnd->next) { /* it's the tail */
-		dh->dh_tail = (hnd->pprev == &dh->dh_head.first) ? NULL :
-			      container_of(hnd->pprev, struct hlist_node, next);
-	}
-	hlist_del_init(hnd);
-	return -1; /* unknown depth */
-}
-
-/**
- * double links hash head with depth tracking
- * new element is always added to tail of hlist
- */
-struct cfs_hash_dhead_dep {
-	struct hlist_head	dd_head;	/**< entries list */
-	struct hlist_node	*dd_tail;	/**< the last entry */
-	unsigned int		dd_depth;	/**< list length */
-};
-
-static int
-cfs_hash_dd_hhead_size(struct cfs_hash *hs)
-{
-	return sizeof(struct cfs_hash_dhead_dep);
-}
-
-static struct hlist_head *
-cfs_hash_dd_hhead(struct cfs_hash *hs, struct cfs_hash_bd *bd)
-{
-	struct cfs_hash_dhead_dep *head;
-
-	head = (struct cfs_hash_dhead_dep *)&bd->bd_bucket->hsb_head[0];
-	return &head[bd->bd_offset].dd_head;
-}
-
-static int
-cfs_hash_dd_hnode_add(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-		      struct hlist_node *hnode)
-{
-	struct cfs_hash_dhead_dep *dh;
-
-	dh = container_of(cfs_hash_dd_hhead(hs, bd),
-			  struct cfs_hash_dhead_dep, dd_head);
-	if (dh->dd_tail) /* not empty */
-		hlist_add_behind(hnode, dh->dd_tail);
-	else /* empty list */
-		hlist_add_head(hnode, &dh->dd_head);
-	dh->dd_tail = hnode;
-	return ++dh->dd_depth;
-}
-
-static int
-cfs_hash_dd_hnode_del(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-		      struct hlist_node *hnd)
-{
-	struct cfs_hash_dhead_dep *dh;
-
-	dh = container_of(cfs_hash_dd_hhead(hs, bd),
-			  struct cfs_hash_dhead_dep, dd_head);
-	if (!hnd->next) { /* it's the tail */
-		dh->dd_tail = (hnd->pprev == &dh->dd_head.first) ? NULL :
-			      container_of(hnd->pprev, struct hlist_node, next);
-	}
-	hlist_del_init(hnd);
-	return --dh->dd_depth;
-}
-
-static struct cfs_hash_hlist_ops cfs_hash_hh_hops = {
-	.hop_hhead	= cfs_hash_hh_hhead,
-	.hop_hhead_size	= cfs_hash_hh_hhead_size,
-	.hop_hnode_add	= cfs_hash_hh_hnode_add,
-	.hop_hnode_del	= cfs_hash_hh_hnode_del,
-};
-
-static struct cfs_hash_hlist_ops cfs_hash_hd_hops = {
-	.hop_hhead	= cfs_hash_hd_hhead,
-	.hop_hhead_size	= cfs_hash_hd_hhead_size,
-	.hop_hnode_add	= cfs_hash_hd_hnode_add,
-	.hop_hnode_del	= cfs_hash_hd_hnode_del,
-};
-
-static struct cfs_hash_hlist_ops cfs_hash_dh_hops = {
-	.hop_hhead	= cfs_hash_dh_hhead,
-	.hop_hhead_size	= cfs_hash_dh_hhead_size,
-	.hop_hnode_add	= cfs_hash_dh_hnode_add,
-	.hop_hnode_del	= cfs_hash_dh_hnode_del,
-};
-
-static struct cfs_hash_hlist_ops cfs_hash_dd_hops = {
-	.hop_hhead	= cfs_hash_dd_hhead,
-	.hop_hhead_size	= cfs_hash_dd_hhead_size,
-	.hop_hnode_add	= cfs_hash_dd_hnode_add,
-	.hop_hnode_del	= cfs_hash_dd_hnode_del,
-};
-
-static void
-cfs_hash_hlist_setup(struct cfs_hash *hs)
-{
-	if (cfs_hash_with_add_tail(hs)) {
-		hs->hs_hops = cfs_hash_with_depth(hs) ?
-			      &cfs_hash_dd_hops : &cfs_hash_dh_hops;
-	} else {
-		hs->hs_hops = cfs_hash_with_depth(hs) ?
-			      &cfs_hash_hd_hops : &cfs_hash_hh_hops;
-	}
-}
-
-static void
-cfs_hash_bd_from_key(struct cfs_hash *hs, struct cfs_hash_bucket **bkts,
-		     unsigned int bits, const void *key, struct cfs_hash_bd *bd)
-{
-	unsigned int index = cfs_hash_id(hs, key, (1U << bits) - 1);
-
-	LASSERT(bits == hs->hs_cur_bits || bits == hs->hs_rehash_bits);
-
-	bd->bd_bucket = bkts[index & ((1U << (bits - hs->hs_bkt_bits)) - 1)];
-	bd->bd_offset = index >> (bits - hs->hs_bkt_bits);
-}
-
-void
-cfs_hash_bd_get(struct cfs_hash *hs, const void *key, struct cfs_hash_bd *bd)
-{
-	/* NB: caller should hold hs->hs_rwlock if REHASH is set */
-	if (likely(!hs->hs_rehash_buckets)) {
-		cfs_hash_bd_from_key(hs, hs->hs_buckets,
-				     hs->hs_cur_bits, key, bd);
-	} else {
-		LASSERT(hs->hs_rehash_bits);
-		cfs_hash_bd_from_key(hs, hs->hs_rehash_buckets,
-				     hs->hs_rehash_bits, key, bd);
-	}
-}
-EXPORT_SYMBOL(cfs_hash_bd_get);
-
-static inline void
-cfs_hash_bd_dep_record(struct cfs_hash *hs, struct cfs_hash_bd *bd, int dep_cur)
-{
-	if (likely(dep_cur <= bd->bd_bucket->hsb_depmax))
-		return;
-
-	bd->bd_bucket->hsb_depmax = dep_cur;
-# if CFS_HASH_DEBUG_LEVEL >= CFS_HASH_DEBUG_1
-	if (likely(!warn_on_depth ||
-		   max(warn_on_depth, hs->hs_dep_max) >= dep_cur))
-		return;
-
-	spin_lock(&hs->hs_dep_lock);
-	hs->hs_dep_max = dep_cur;
-	hs->hs_dep_bkt = bd->bd_bucket->hsb_index;
-	hs->hs_dep_off = bd->bd_offset;
-	hs->hs_dep_bits = hs->hs_cur_bits;
-	spin_unlock(&hs->hs_dep_lock);
-
-	queue_work(cfs_rehash_wq, &hs->hs_dep_work);
-# endif
-}
-
-void
-cfs_hash_bd_add_locked(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-		       struct hlist_node *hnode)
-{
-	int rc;
-
-	rc = hs->hs_hops->hop_hnode_add(hs, bd, hnode);
-	cfs_hash_bd_dep_record(hs, bd, rc);
-	bd->bd_bucket->hsb_version++;
-	if (unlikely(!bd->bd_bucket->hsb_version))
-		bd->bd_bucket->hsb_version++;
-	bd->bd_bucket->hsb_count++;
-
-	if (cfs_hash_with_counter(hs))
-		atomic_inc(&hs->hs_count);
-	if (!cfs_hash_with_no_itemref(hs))
-		cfs_hash_get(hs, hnode);
-}
-EXPORT_SYMBOL(cfs_hash_bd_add_locked);
-
-void
-cfs_hash_bd_del_locked(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-		       struct hlist_node *hnode)
-{
-	hs->hs_hops->hop_hnode_del(hs, bd, hnode);
-
-	LASSERT(bd->bd_bucket->hsb_count > 0);
-	bd->bd_bucket->hsb_count--;
-	bd->bd_bucket->hsb_version++;
-	if (unlikely(!bd->bd_bucket->hsb_version))
-		bd->bd_bucket->hsb_version++;
-
-	if (cfs_hash_with_counter(hs)) {
-		LASSERT(atomic_read(&hs->hs_count) > 0);
-		atomic_dec(&hs->hs_count);
-	}
-	if (!cfs_hash_with_no_itemref(hs))
-		cfs_hash_put_locked(hs, hnode);
-}
-EXPORT_SYMBOL(cfs_hash_bd_del_locked);
-
-void
-cfs_hash_bd_move_locked(struct cfs_hash *hs, struct cfs_hash_bd *bd_old,
-			struct cfs_hash_bd *bd_new, struct hlist_node *hnode)
-{
-	struct cfs_hash_bucket *obkt = bd_old->bd_bucket;
-	struct cfs_hash_bucket *nbkt = bd_new->bd_bucket;
-	int rc;
-
-	if (!cfs_hash_bd_compare(bd_old, bd_new))
-		return;
-
-	/* use cfs_hash_bd_hnode_add/del, to avoid atomic & refcount ops
-	 * in cfs_hash_bd_del/add_locked
-	 */
-	hs->hs_hops->hop_hnode_del(hs, bd_old, hnode);
-	rc = hs->hs_hops->hop_hnode_add(hs, bd_new, hnode);
-	cfs_hash_bd_dep_record(hs, bd_new, rc);
-
-	LASSERT(obkt->hsb_count > 0);
-	obkt->hsb_count--;
-	obkt->hsb_version++;
-	if (unlikely(!obkt->hsb_version))
-		obkt->hsb_version++;
-	nbkt->hsb_count++;
-	nbkt->hsb_version++;
-	if (unlikely(!nbkt->hsb_version))
-		nbkt->hsb_version++;
-}
-
-enum {
-	/** always set, for sanity (avoid ZERO intent) */
-	CFS_HS_LOOKUP_MASK_FIND	= BIT(0),
-	/** return entry with a ref */
-	CFS_HS_LOOKUP_MASK_REF	= BIT(1),
-	/** add entry if not existing */
-	CFS_HS_LOOKUP_MASK_ADD	= BIT(2),
-	/** delete entry, ignore other masks */
-	CFS_HS_LOOKUP_MASK_DEL	= BIT(3),
-};
-
-enum cfs_hash_lookup_intent {
-	/** return item w/o refcount */
-	CFS_HS_LOOKUP_IT_PEEK	 = CFS_HS_LOOKUP_MASK_FIND,
-	/** return item with refcount */
-	CFS_HS_LOOKUP_IT_FIND	 = (CFS_HS_LOOKUP_MASK_FIND |
-				    CFS_HS_LOOKUP_MASK_REF),
-	/** return item w/o refcount if existed, otherwise add */
-	CFS_HS_LOOKUP_IT_ADD	 = (CFS_HS_LOOKUP_MASK_FIND |
-				    CFS_HS_LOOKUP_MASK_ADD),
-	/** return item with refcount if existed, otherwise add */
-	CFS_HS_LOOKUP_IT_FINDADD = (CFS_HS_LOOKUP_IT_FIND |
-				    CFS_HS_LOOKUP_MASK_ADD),
-	/** delete if existed */
-	CFS_HS_LOOKUP_IT_FINDDEL = (CFS_HS_LOOKUP_MASK_FIND |
-				    CFS_HS_LOOKUP_MASK_DEL)
-};
-
-static struct hlist_node *
-cfs_hash_bd_lookup_intent(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-			  const void *key, struct hlist_node *hnode,
-			  enum cfs_hash_lookup_intent intent)
-
-{
-	struct hlist_head *hhead = cfs_hash_bd_hhead(hs, bd);
-	struct hlist_node *ehnode;
-	struct hlist_node *match;
-	int intent_add = intent & CFS_HS_LOOKUP_MASK_ADD;
-
-	/* with this function, we can avoid a lot of useless refcount ops,
-	 * which are expensive atomic operations most time.
-	 */
-	match = intent_add ? NULL : hnode;
-	hlist_for_each(ehnode, hhead) {
-		if (!cfs_hash_keycmp(hs, key, ehnode))
-			continue;
-
-		if (match && match != ehnode) /* can't match */
-			continue;
-
-		/* match and ... */
-		if (intent & CFS_HS_LOOKUP_MASK_DEL) {
-			cfs_hash_bd_del_locked(hs, bd, ehnode);
-			return ehnode;
-		}
-
-		/* caller wants refcount? */
-		if (intent & CFS_HS_LOOKUP_MASK_REF)
-			cfs_hash_get(hs, ehnode);
-		return ehnode;
-	}
-	/* no match item */
-	if (!intent_add)
-		return NULL;
-
-	LASSERT(hnode);
-	cfs_hash_bd_add_locked(hs, bd, hnode);
-	return hnode;
-}
-
-struct hlist_node *
-cfs_hash_bd_lookup_locked(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-			  const void *key)
-{
-	return cfs_hash_bd_lookup_intent(hs, bd, key, NULL,
-					 CFS_HS_LOOKUP_IT_FIND);
-}
-EXPORT_SYMBOL(cfs_hash_bd_lookup_locked);
-
-struct hlist_node *
-cfs_hash_bd_peek_locked(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-			const void *key)
-{
-	return cfs_hash_bd_lookup_intent(hs, bd, key, NULL,
-					 CFS_HS_LOOKUP_IT_PEEK);
-}
-EXPORT_SYMBOL(cfs_hash_bd_peek_locked);
-
-static void
-cfs_hash_multi_bd_lock(struct cfs_hash *hs, struct cfs_hash_bd *bds,
-		       unsigned int n, int excl)
-{
-	struct cfs_hash_bucket *prev = NULL;
-	int i;
-
-	/**
-	 * bds must be ascendantly ordered by bd->bd_bucket->hsb_index.
-	 * NB: it's possible that several bds point to the same bucket but
-	 * have different bd::bd_offset, so need take care of deadlock.
-	 */
-	cfs_hash_for_each_bd(bds, n, i) {
-		if (prev == bds[i].bd_bucket)
-			continue;
-
-		LASSERT(!prev || prev->hsb_index < bds[i].bd_bucket->hsb_index);
-		cfs_hash_bd_lock(hs, &bds[i], excl);
-		prev = bds[i].bd_bucket;
-	}
-}
-
-static void
-cfs_hash_multi_bd_unlock(struct cfs_hash *hs, struct cfs_hash_bd *bds,
-			 unsigned int n, int excl)
-{
-	struct cfs_hash_bucket *prev = NULL;
-	int i;
-
-	cfs_hash_for_each_bd(bds, n, i) {
-		if (prev != bds[i].bd_bucket) {
-			cfs_hash_bd_unlock(hs, &bds[i], excl);
-			prev = bds[i].bd_bucket;
-		}
-	}
-}
-
-static struct hlist_node *
-cfs_hash_multi_bd_lookup_locked(struct cfs_hash *hs, struct cfs_hash_bd *bds,
-				unsigned int n, const void *key)
-{
-	struct hlist_node *ehnode;
-	unsigned int i;
-
-	cfs_hash_for_each_bd(bds, n, i) {
-		ehnode = cfs_hash_bd_lookup_intent(hs, &bds[i], key, NULL,
-						   CFS_HS_LOOKUP_IT_FIND);
-		if (ehnode)
-			return ehnode;
-	}
-	return NULL;
-}
-
-static struct hlist_node *
-cfs_hash_multi_bd_findadd_locked(struct cfs_hash *hs, struct cfs_hash_bd *bds,
-				 unsigned int n, const void *key,
-				 struct hlist_node *hnode, int noref)
-{
-	struct hlist_node *ehnode;
-	int intent;
-	unsigned int i;
-
-	LASSERT(hnode);
-	intent = (!noref * CFS_HS_LOOKUP_MASK_REF) | CFS_HS_LOOKUP_IT_PEEK;
-
-	cfs_hash_for_each_bd(bds, n, i) {
-		ehnode = cfs_hash_bd_lookup_intent(hs, &bds[i], key,
-						   NULL, intent);
-		if (ehnode)
-			return ehnode;
-	}
-
-	if (i == 1) { /* only one bucket */
-		cfs_hash_bd_add_locked(hs, &bds[0], hnode);
-	} else {
-		struct cfs_hash_bd mybd;
-
-		cfs_hash_bd_get(hs, key, &mybd);
-		cfs_hash_bd_add_locked(hs, &mybd, hnode);
-	}
-
-	return hnode;
-}
-
-static struct hlist_node *
-cfs_hash_multi_bd_finddel_locked(struct cfs_hash *hs, struct cfs_hash_bd *bds,
-				 unsigned int n, const void *key,
-				 struct hlist_node *hnode)
-{
-	struct hlist_node *ehnode;
-	unsigned int i;
-
-	cfs_hash_for_each_bd(bds, n, i) {
-		ehnode = cfs_hash_bd_lookup_intent(hs, &bds[i], key, hnode,
-						   CFS_HS_LOOKUP_IT_FINDDEL);
-		if (ehnode)
-			return ehnode;
-	}
-	return NULL;
-}
-
-static void
-cfs_hash_bd_order(struct cfs_hash_bd *bd1, struct cfs_hash_bd *bd2)
-{
-	int rc;
-
-	if (!bd2->bd_bucket)
-		return;
-
-	if (!bd1->bd_bucket) {
-		*bd1 = *bd2;
-		bd2->bd_bucket = NULL;
-		return;
-	}
-
-	rc = cfs_hash_bd_compare(bd1, bd2);
-	if (!rc)
-		bd2->bd_bucket = NULL;
-	else if (rc > 0)
-		swap(*bd1, *bd2); /* swap bd1 and bd2 */
-}
-
-void
-cfs_hash_dual_bd_get(struct cfs_hash *hs, const void *key,
-		     struct cfs_hash_bd *bds)
-{
-	/* NB: caller should hold hs_lock.rw if REHASH is set */
-	cfs_hash_bd_from_key(hs, hs->hs_buckets,
-			     hs->hs_cur_bits, key, &bds[0]);
-	if (likely(!hs->hs_rehash_buckets)) {
-		/* no rehash or not rehashing */
-		bds[1].bd_bucket = NULL;
-		return;
-	}
-
-	LASSERT(hs->hs_rehash_bits);
-	cfs_hash_bd_from_key(hs, hs->hs_rehash_buckets,
-			     hs->hs_rehash_bits, key, &bds[1]);
-
-	cfs_hash_bd_order(&bds[0], &bds[1]);
-}
-
-void
-cfs_hash_dual_bd_lock(struct cfs_hash *hs, struct cfs_hash_bd *bds, int excl)
-{
-	cfs_hash_multi_bd_lock(hs, bds, 2, excl);
-}
-
-void
-cfs_hash_dual_bd_unlock(struct cfs_hash *hs, struct cfs_hash_bd *bds, int excl)
-{
-	cfs_hash_multi_bd_unlock(hs, bds, 2, excl);
-}
-
-struct hlist_node *
-cfs_hash_dual_bd_lookup_locked(struct cfs_hash *hs, struct cfs_hash_bd *bds,
-			       const void *key)
-{
-	return cfs_hash_multi_bd_lookup_locked(hs, bds, 2, key);
-}
-
-struct hlist_node *
-cfs_hash_dual_bd_findadd_locked(struct cfs_hash *hs, struct cfs_hash_bd *bds,
-				const void *key, struct hlist_node *hnode,
-				int noref)
-{
-	return cfs_hash_multi_bd_findadd_locked(hs, bds, 2, key,
-						hnode, noref);
-}
-
-struct hlist_node *
-cfs_hash_dual_bd_finddel_locked(struct cfs_hash *hs, struct cfs_hash_bd *bds,
-				const void *key, struct hlist_node *hnode)
-{
-	return cfs_hash_multi_bd_finddel_locked(hs, bds, 2, key, hnode);
-}
-
-static void
-cfs_hash_buckets_free(struct cfs_hash_bucket **buckets,
-		      int bkt_size, int prev_size, int size)
-{
-	int i;
-
-	for (i = prev_size; i < size; i++)
-		kfree(buckets[i]);
-
-	kvfree(buckets);
-}
-
-/*
- * Create or grow bucket memory. Return old_buckets if no allocation was
- * needed, the newly allocated buckets if allocation was needed and
- * successful, and NULL on error.
- */
-static struct cfs_hash_bucket **
-cfs_hash_buckets_realloc(struct cfs_hash *hs, struct cfs_hash_bucket **old_bkts,
-			 unsigned int old_size, unsigned int new_size)
-{
-	struct cfs_hash_bucket **new_bkts;
-	int i;
-
-	LASSERT(!old_size || old_bkts);
-
-	if (old_bkts && old_size == new_size)
-		return old_bkts;
-
-	new_bkts = kvmalloc_array(new_size, sizeof(new_bkts[0]), GFP_KERNEL);
-	if (!new_bkts)
-		return NULL;
-
-	if (old_bkts) {
-		memcpy(new_bkts, old_bkts,
-		       min(old_size, new_size) * sizeof(*old_bkts));
-	}
-
-	for (i = old_size; i < new_size; i++) {
-		struct hlist_head *hhead;
-		struct cfs_hash_bd bd;
-
-		new_bkts[i] = kzalloc(cfs_hash_bkt_size(hs), GFP_KERNEL);
-		if (!new_bkts[i]) {
-			cfs_hash_buckets_free(new_bkts, cfs_hash_bkt_size(hs),
-					      old_size, new_size);
-			return NULL;
-		}
-
-		new_bkts[i]->hsb_index = i;
-		new_bkts[i]->hsb_version = 1;	/* shouldn't be zero */
-		new_bkts[i]->hsb_depmax = -1;	/* unknown */
-		bd.bd_bucket = new_bkts[i];
-		cfs_hash_bd_for_each_hlist(hs, &bd, hhead)
-			INIT_HLIST_HEAD(hhead);
-
-		if (cfs_hash_with_no_lock(hs) ||
-		    cfs_hash_with_no_bktlock(hs))
-			continue;
-
-		if (cfs_hash_with_rw_bktlock(hs))
-			rwlock_init(&new_bkts[i]->hsb_lock.rw);
-		else if (cfs_hash_with_spin_bktlock(hs))
-			spin_lock_init(&new_bkts[i]->hsb_lock.spin);
-		else
-			LBUG(); /* invalid use-case */
-	}
-	return new_bkts;
-}
-
-/**
- * Initialize new libcfs hash, where:
- * @name     - Descriptive hash name
- * @cur_bits - Initial hash table size, in bits
- * @max_bits - Maximum allowed hash table resize, in bits
- * @ops      - Registered hash table operations
- * @flags    - CFS_HASH_REHASH enable synamic hash resizing
- *	     - CFS_HASH_SORT enable chained hash sort
- */
-static void cfs_hash_rehash_worker(struct work_struct *work);
-
-#if CFS_HASH_DEBUG_LEVEL >= CFS_HASH_DEBUG_1
-static void cfs_hash_dep_print(struct work_struct *work)
-{
-	struct cfs_hash *hs = container_of(work, struct cfs_hash, hs_dep_work);
-	int dep;
-	int bkt;
-	int off;
-	int bits;
-
-	spin_lock(&hs->hs_dep_lock);
-	dep = hs->hs_dep_max;
-	bkt = hs->hs_dep_bkt;
-	off = hs->hs_dep_off;
-	bits = hs->hs_dep_bits;
-	spin_unlock(&hs->hs_dep_lock);
-
-	LCONSOLE_WARN("#### HASH %s (bits: %d): max depth %d at bucket %d/%d\n",
-		      hs->hs_name, bits, dep, bkt, off);
-	spin_lock(&hs->hs_dep_lock);
-	hs->hs_dep_bits = 0; /* mark as workitem done */
-	spin_unlock(&hs->hs_dep_lock);
-	return 0;
-}
-
-static void cfs_hash_depth_wi_init(struct cfs_hash *hs)
-{
-	spin_lock_init(&hs->hs_dep_lock);
-	INIT_WORK(&hs->hs_dep_work, cfs_hash_dep_print);
-}
-
-static void cfs_hash_depth_wi_cancel(struct cfs_hash *hs)
-{
-	cancel_work_sync(&hs->hs_dep_work);
-}
-
-#else /* CFS_HASH_DEBUG_LEVEL < CFS_HASH_DEBUG_1 */
-
-static inline void cfs_hash_depth_wi_init(struct cfs_hash *hs) {}
-static inline void cfs_hash_depth_wi_cancel(struct cfs_hash *hs) {}
-
-#endif /* CFS_HASH_DEBUG_LEVEL >= CFS_HASH_DEBUG_1 */
-
-struct cfs_hash *
-cfs_hash_create(char *name, unsigned int cur_bits, unsigned int max_bits,
-		unsigned int bkt_bits, unsigned int extra_bytes,
-		unsigned int min_theta, unsigned int max_theta,
-		struct cfs_hash_ops *ops, unsigned int flags)
-{
-	struct cfs_hash *hs;
-	int len;
-
-	BUILD_BUG_ON(CFS_HASH_THETA_BITS >= 15);
-
-	LASSERT(name);
-	LASSERT(ops->hs_key);
-	LASSERT(ops->hs_hash);
-	LASSERT(ops->hs_object);
-	LASSERT(ops->hs_keycmp);
-	LASSERT(ops->hs_get);
-	LASSERT(ops->hs_put || ops->hs_put_locked);
-
-	if (flags & CFS_HASH_REHASH)
-		flags |= CFS_HASH_COUNTER; /* must have counter */
-
-	LASSERT(cur_bits > 0);
-	LASSERT(cur_bits >= bkt_bits);
-	LASSERT(max_bits >= cur_bits && max_bits < 31);
-	LASSERT(ergo(!(flags & CFS_HASH_REHASH), cur_bits == max_bits));
-	LASSERT(ergo(flags & CFS_HASH_REHASH, !(flags & CFS_HASH_NO_LOCK)));
-	LASSERT(ergo(flags & CFS_HASH_REHASH_KEY, ops->hs_keycpy));
-
-	len = !(flags & CFS_HASH_BIGNAME) ?
-	      CFS_HASH_NAME_LEN : CFS_HASH_BIGNAME_LEN;
-	hs = kzalloc(offsetof(struct cfs_hash, hs_name[len]), GFP_KERNEL);
-	if (!hs)
-		return NULL;
-
-	strlcpy(hs->hs_name, name, len);
-	hs->hs_flags = flags;
-
-	atomic_set(&hs->hs_refcount, 1);
-	atomic_set(&hs->hs_count, 0);
-
-	cfs_hash_lock_setup(hs);
-	cfs_hash_hlist_setup(hs);
-
-	hs->hs_cur_bits = (u8)cur_bits;
-	hs->hs_min_bits = (u8)cur_bits;
-	hs->hs_max_bits = (u8)max_bits;
-	hs->hs_bkt_bits = (u8)bkt_bits;
-
-	hs->hs_ops = ops;
-	hs->hs_extra_bytes = extra_bytes;
-	hs->hs_rehash_bits = 0;
-	INIT_WORK(&hs->hs_rehash_work, cfs_hash_rehash_worker);
-	cfs_hash_depth_wi_init(hs);
-
-	if (cfs_hash_with_rehash(hs))
-		__cfs_hash_set_theta(hs, min_theta, max_theta);
-
-	hs->hs_buckets = cfs_hash_buckets_realloc(hs, NULL, 0,
-						  CFS_HASH_NBKT(hs));
-	if (hs->hs_buckets)
-		return hs;
-
-	kfree(hs);
-	return NULL;
-}
-EXPORT_SYMBOL(cfs_hash_create);
-
-/**
- * Cleanup libcfs hash @hs.
- */
-static void
-cfs_hash_destroy(struct cfs_hash *hs)
-{
-	struct hlist_node *hnode;
-	struct hlist_node *pos;
-	struct cfs_hash_bd bd;
-	int i;
-
-	LASSERT(hs);
-	LASSERT(!cfs_hash_is_exiting(hs) &&
-		!cfs_hash_is_iterating(hs));
-
-	/**
-	 * prohibit further rehashes, don't need any lock because
-	 * I'm the only (last) one can change it.
-	 */
-	hs->hs_exiting = 1;
-	if (cfs_hash_with_rehash(hs))
-		cfs_hash_rehash_cancel(hs);
-
-	cfs_hash_depth_wi_cancel(hs);
-	/* rehash should be done/canceled */
-	LASSERT(hs->hs_buckets && !hs->hs_rehash_buckets);
-
-	cfs_hash_for_each_bucket(hs, &bd, i) {
-		struct hlist_head *hhead;
-
-		LASSERT(bd.bd_bucket);
-		/* no need to take this lock, just for consistent code */
-		cfs_hash_bd_lock(hs, &bd, 1);
-
-		cfs_hash_bd_for_each_hlist(hs, &bd, hhead) {
-			hlist_for_each_safe(hnode, pos, hhead) {
-				LASSERTF(!cfs_hash_with_assert_empty(hs),
-					 "hash %s bucket %u(%u) is not empty: %u items left\n",
-					 hs->hs_name, bd.bd_bucket->hsb_index,
-					 bd.bd_offset, bd.bd_bucket->hsb_count);
-				/* can't assert key valicate, because we
-				 * can interrupt rehash
-				 */
-				cfs_hash_bd_del_locked(hs, &bd, hnode);
-				cfs_hash_exit(hs, hnode);
-			}
-		}
-		LASSERT(!bd.bd_bucket->hsb_count);
-		cfs_hash_bd_unlock(hs, &bd, 1);
-		cond_resched();
-	}
-
-	LASSERT(!atomic_read(&hs->hs_count));
-
-	cfs_hash_buckets_free(hs->hs_buckets, cfs_hash_bkt_size(hs),
-			      0, CFS_HASH_NBKT(hs));
-	i = cfs_hash_with_bigname(hs) ?
-	    CFS_HASH_BIGNAME_LEN : CFS_HASH_NAME_LEN;
-	kfree(hs);
-}
-
-struct cfs_hash *cfs_hash_getref(struct cfs_hash *hs)
-{
-	if (atomic_inc_not_zero(&hs->hs_refcount))
-		return hs;
-	return NULL;
-}
-EXPORT_SYMBOL(cfs_hash_getref);
-
-void cfs_hash_putref(struct cfs_hash *hs)
-{
-	if (atomic_dec_and_test(&hs->hs_refcount))
-		cfs_hash_destroy(hs);
-}
-EXPORT_SYMBOL(cfs_hash_putref);
-
-static inline int
-cfs_hash_rehash_bits(struct cfs_hash *hs)
-{
-	if (cfs_hash_with_no_lock(hs) ||
-	    !cfs_hash_with_rehash(hs))
-		return -EOPNOTSUPP;
-
-	if (unlikely(cfs_hash_is_exiting(hs)))
-		return -ESRCH;
-
-	if (unlikely(cfs_hash_is_rehashing(hs)))
-		return -EALREADY;
-
-	if (unlikely(cfs_hash_is_iterating(hs)))
-		return -EAGAIN;
-
-	/* XXX: need to handle case with max_theta != 2.0
-	 *      and the case with min_theta != 0.5
-	 */
-	if ((hs->hs_cur_bits < hs->hs_max_bits) &&
-	    (__cfs_hash_theta(hs) > hs->hs_max_theta))
-		return hs->hs_cur_bits + 1;
-
-	if (!cfs_hash_with_shrink(hs))
-		return 0;
-
-	if ((hs->hs_cur_bits > hs->hs_min_bits) &&
-	    (__cfs_hash_theta(hs) < hs->hs_min_theta))
-		return hs->hs_cur_bits - 1;
-
-	return 0;
-}
-
-/**
- * don't allow inline rehash if:
- * - user wants non-blocking change (add/del) on hash table
- * - too many elements
- */
-static inline int
-cfs_hash_rehash_inline(struct cfs_hash *hs)
-{
-	return !cfs_hash_with_nblk_change(hs) &&
-	       atomic_read(&hs->hs_count) < CFS_HASH_LOOP_HOG;
-}
-
-/**
- * Add item @hnode to libcfs hash @hs using @key.  The registered
- * ops->hs_get function will be called when the item is added.
- */
-void
-cfs_hash_add(struct cfs_hash *hs, const void *key, struct hlist_node *hnode)
-{
-	struct cfs_hash_bd bd;
-	int bits;
-
-	LASSERT(hlist_unhashed(hnode));
-
-	cfs_hash_lock(hs, 0);
-	cfs_hash_bd_get_and_lock(hs, key, &bd, 1);
-
-	cfs_hash_key_validate(hs, key, hnode);
-	cfs_hash_bd_add_locked(hs, &bd, hnode);
-
-	cfs_hash_bd_unlock(hs, &bd, 1);
-
-	bits = cfs_hash_rehash_bits(hs);
-	cfs_hash_unlock(hs, 0);
-	if (bits > 0)
-		cfs_hash_rehash(hs, cfs_hash_rehash_inline(hs));
-}
-EXPORT_SYMBOL(cfs_hash_add);
-
-static struct hlist_node *
-cfs_hash_find_or_add(struct cfs_hash *hs, const void *key,
-		     struct hlist_node *hnode, int noref)
-{
-	struct hlist_node *ehnode;
-	struct cfs_hash_bd bds[2];
-	int bits = 0;
-
-	LASSERTF(hlist_unhashed(hnode), "hnode = %p\n", hnode);
-
-	cfs_hash_lock(hs, 0);
-	cfs_hash_dual_bd_get_and_lock(hs, key, bds, 1);
-
-	cfs_hash_key_validate(hs, key, hnode);
-	ehnode = cfs_hash_dual_bd_findadd_locked(hs, bds, key,
-						 hnode, noref);
-	cfs_hash_dual_bd_unlock(hs, bds, 1);
-
-	if (ehnode == hnode)	/* new item added */
-		bits = cfs_hash_rehash_bits(hs);
-	cfs_hash_unlock(hs, 0);
-	if (bits > 0)
-		cfs_hash_rehash(hs, cfs_hash_rehash_inline(hs));
-
-	return ehnode;
-}
-
-/**
- * Add item @hnode to libcfs hash @hs using @key.  The registered
- * ops->hs_get function will be called if the item was added.
- * Returns 0 on success or -EALREADY on key collisions.
- */
-int
-cfs_hash_add_unique(struct cfs_hash *hs, const void *key,
-		    struct hlist_node *hnode)
-{
-	return cfs_hash_find_or_add(hs, key, hnode, 1) != hnode ?
-	       -EALREADY : 0;
-}
-EXPORT_SYMBOL(cfs_hash_add_unique);
-
-/**
- * Add item @hnode to libcfs hash @hs using @key.  If this @key
- * already exists in the hash then ops->hs_get will be called on the
- * conflicting entry and that entry will be returned to the caller.
- * Otherwise ops->hs_get is called on the item which was added.
- */
-void *
-cfs_hash_findadd_unique(struct cfs_hash *hs, const void *key,
-			struct hlist_node *hnode)
-{
-	hnode = cfs_hash_find_or_add(hs, key, hnode, 0);
-
-	return cfs_hash_object(hs, hnode);
-}
-EXPORT_SYMBOL(cfs_hash_findadd_unique);
-
-/**
- * Delete item @hnode from the libcfs hash @hs using @key.  The @key
- * is required to ensure the correct hash bucket is locked since there
- * is no direct linkage from the item to the bucket.  The object
- * removed from the hash will be returned and obs->hs_put is called
- * on the removed object.
- */
-void *
-cfs_hash_del(struct cfs_hash *hs, const void *key, struct hlist_node *hnode)
-{
-	void *obj = NULL;
-	int bits = 0;
-	struct cfs_hash_bd bds[2];
-
-	cfs_hash_lock(hs, 0);
-	cfs_hash_dual_bd_get_and_lock(hs, key, bds, 1);
-
-	/* NB: do nothing if @hnode is not in hash table */
-	if (!hnode || !hlist_unhashed(hnode)) {
-		if (!bds[1].bd_bucket && hnode) {
-			cfs_hash_bd_del_locked(hs, &bds[0], hnode);
-		} else {
-			hnode = cfs_hash_dual_bd_finddel_locked(hs, bds,
-								key, hnode);
-		}
-	}
-
-	if (hnode) {
-		obj = cfs_hash_object(hs, hnode);
-		bits = cfs_hash_rehash_bits(hs);
-	}
-
-	cfs_hash_dual_bd_unlock(hs, bds, 1);
-	cfs_hash_unlock(hs, 0);
-	if (bits > 0)
-		cfs_hash_rehash(hs, cfs_hash_rehash_inline(hs));
-
-	return obj;
-}
-EXPORT_SYMBOL(cfs_hash_del);
-
-/**
- * Delete item given @key in libcfs hash @hs.  The first @key found in
- * the hash will be removed, if the key exists multiple times in the hash
- * @hs this function must be called once per key.  The removed object
- * will be returned and ops->hs_put is called on the removed object.
- */
-void *
-cfs_hash_del_key(struct cfs_hash *hs, const void *key)
-{
-	return cfs_hash_del(hs, key, NULL);
-}
-EXPORT_SYMBOL(cfs_hash_del_key);
-
-/**
- * Lookup an item using @key in the libcfs hash @hs and return it.
- * If the @key is found in the hash hs->hs_get() is called and the
- * matching objects is returned.  It is the callers responsibility
- * to call the counterpart ops->hs_put using the cfs_hash_put() macro
- * when when finished with the object.  If the @key was not found
- * in the hash @hs NULL is returned.
- */
-void *
-cfs_hash_lookup(struct cfs_hash *hs, const void *key)
-{
-	void *obj = NULL;
-	struct hlist_node *hnode;
-	struct cfs_hash_bd bds[2];
-
-	cfs_hash_lock(hs, 0);
-	cfs_hash_dual_bd_get_and_lock(hs, key, bds, 0);
-
-	hnode = cfs_hash_dual_bd_lookup_locked(hs, bds, key);
-	if (hnode)
-		obj = cfs_hash_object(hs, hnode);
-
-	cfs_hash_dual_bd_unlock(hs, bds, 0);
-	cfs_hash_unlock(hs, 0);
-
-	return obj;
-}
-EXPORT_SYMBOL(cfs_hash_lookup);
-
-static void
-cfs_hash_for_each_enter(struct cfs_hash *hs)
-{
-	LASSERT(!cfs_hash_is_exiting(hs));
-
-	if (!cfs_hash_with_rehash(hs))
-		return;
-	/*
-	 * NB: it's race on cfs_has_t::hs_iterating, but doesn't matter
-	 * because it's just an unreliable signal to rehash-thread,
-	 * rehash-thread will try to finish rehash ASAP when seeing this.
-	 */
-	hs->hs_iterating = 1;
-
-	cfs_hash_lock(hs, 1);
-	hs->hs_iterators++;
-	cfs_hash_unlock(hs, 1);
-
-	/* NB: iteration is mostly called by service thread,
-	 * we tend to cancel pending rehash-request, instead of
-	 * blocking service thread, we will relaunch rehash request
-	 * after iteration
-	 */
-	if (cfs_hash_is_rehashing(hs))
-		cfs_hash_rehash_cancel(hs);
-}
-
-static void
-cfs_hash_for_each_exit(struct cfs_hash *hs)
-{
-	int remained;
-	int bits;
-
-	if (!cfs_hash_with_rehash(hs))
-		return;
-	cfs_hash_lock(hs, 1);
-	remained = --hs->hs_iterators;
-	bits = cfs_hash_rehash_bits(hs);
-	cfs_hash_unlock(hs, 1);
-	/* NB: it's race on cfs_has_t::hs_iterating, see above */
-	if (!remained)
-		hs->hs_iterating = 0;
-	if (bits > 0) {
-		cfs_hash_rehash(hs, atomic_read(&hs->hs_count) <
-				    CFS_HASH_LOOP_HOG);
-	}
-}
-
-/**
- * For each item in the libcfs hash @hs call the passed callback @func
- * and pass to it as an argument each hash item and the private @data.
- *
- * a) the function may sleep!
- * b) during the callback:
- *    . the bucket lock is held so the callback must never sleep.
- *    . if @removal_safe is true, use can remove current item by
- *      cfs_hash_bd_del_locked
- */
-static u64
-cfs_hash_for_each_tight(struct cfs_hash *hs, cfs_hash_for_each_cb_t func,
-			void *data, int remove_safe)
-{
-	struct hlist_node *hnode;
-	struct hlist_node *pos;
-	struct cfs_hash_bd bd;
-	u64 count = 0;
-	int excl = !!remove_safe;
-	int loop = 0;
-	int i;
-
-	cfs_hash_for_each_enter(hs);
-
-	cfs_hash_lock(hs, 0);
-	LASSERT(!cfs_hash_is_rehashing(hs));
-
-	cfs_hash_for_each_bucket(hs, &bd, i) {
-		struct hlist_head *hhead;
-
-		cfs_hash_bd_lock(hs, &bd, excl);
-		if (!func) { /* only glimpse size */
-			count += bd.bd_bucket->hsb_count;
-			cfs_hash_bd_unlock(hs, &bd, excl);
-			continue;
-		}
-
-		cfs_hash_bd_for_each_hlist(hs, &bd, hhead) {
-			hlist_for_each_safe(hnode, pos, hhead) {
-				cfs_hash_bucket_validate(hs, &bd, hnode);
-				count++;
-				loop++;
-				if (func(hs, &bd, hnode, data)) {
-					cfs_hash_bd_unlock(hs, &bd, excl);
-					goto out;
-				}
-			}
-		}
-		cfs_hash_bd_unlock(hs, &bd, excl);
-		if (loop < CFS_HASH_LOOP_HOG)
-			continue;
-		loop = 0;
-		cfs_hash_unlock(hs, 0);
-		cond_resched();
-		cfs_hash_lock(hs, 0);
-	}
- out:
-	cfs_hash_unlock(hs, 0);
-
-	cfs_hash_for_each_exit(hs);
-	return count;
-}
-
-struct cfs_hash_cond_arg {
-	cfs_hash_cond_opt_cb_t	func;
-	void			*arg;
-};
-
-static int
-cfs_hash_cond_del_locked(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-			 struct hlist_node *hnode, void *data)
-{
-	struct cfs_hash_cond_arg *cond = data;
-
-	if (cond->func(cfs_hash_object(hs, hnode), cond->arg))
-		cfs_hash_bd_del_locked(hs, bd, hnode);
-	return 0;
-}
-
-/**
- * Delete item from the libcfs hash @hs when @func return true.
- * The write lock being hold during loop for each bucket to avoid
- * any object be reference.
- */
-void
-cfs_hash_cond_del(struct cfs_hash *hs, cfs_hash_cond_opt_cb_t func, void *data)
-{
-	struct cfs_hash_cond_arg arg = {
-		.func	= func,
-		.arg	= data,
-	};
-
-	cfs_hash_for_each_tight(hs, cfs_hash_cond_del_locked, &arg, 1);
-}
-EXPORT_SYMBOL(cfs_hash_cond_del);
-
-void
-cfs_hash_for_each(struct cfs_hash *hs, cfs_hash_for_each_cb_t func,
-		  void *data)
-{
-	cfs_hash_for_each_tight(hs, func, data, 0);
-}
-EXPORT_SYMBOL(cfs_hash_for_each);
-
-void
-cfs_hash_for_each_safe(struct cfs_hash *hs, cfs_hash_for_each_cb_t func,
-		       void *data)
-{
-	cfs_hash_for_each_tight(hs, func, data, 1);
-}
-EXPORT_SYMBOL(cfs_hash_for_each_safe);
-
-static int
-cfs_hash_peek(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-	      struct hlist_node *hnode, void *data)
-{
-	*(int *)data = 0;
-	return 1; /* return 1 to break the loop */
-}
-
-int
-cfs_hash_is_empty(struct cfs_hash *hs)
-{
-	int empty = 1;
-
-	cfs_hash_for_each_tight(hs, cfs_hash_peek, &empty, 0);
-	return empty;
-}
-EXPORT_SYMBOL(cfs_hash_is_empty);
-
-u64
-cfs_hash_size_get(struct cfs_hash *hs)
-{
-	return cfs_hash_with_counter(hs) ?
-	       atomic_read(&hs->hs_count) :
-	       cfs_hash_for_each_tight(hs, NULL, NULL, 0);
-}
-EXPORT_SYMBOL(cfs_hash_size_get);
-
-/*
- * cfs_hash_for_each_relax:
- * Iterate the hash table and call @func on each item without
- * any lock. This function can't guarantee to finish iteration
- * if these features are enabled:
- *
- *  a. if rehash_key is enabled, an item can be moved from
- *     one bucket to another bucket
- *  b. user can remove non-zero-ref item from hash-table,
- *     so the item can be removed from hash-table, even worse,
- *     it's possible that user changed key and insert to another
- *     hash bucket.
- * there's no way for us to finish iteration correctly on previous
- * two cases, so iteration has to be stopped on change.
- */
-static int
-cfs_hash_for_each_relax(struct cfs_hash *hs, cfs_hash_for_each_cb_t func,
-			void *data, int start)
-{
-	struct hlist_node *next = NULL;
-	struct hlist_node *hnode;
-	struct cfs_hash_bd bd;
-	u32 version;
-	int count = 0;
-	int stop_on_change;
-	int has_put_locked;
-	int end = -1;
-	int rc = 0;
-	int i;
-
-	stop_on_change = cfs_hash_with_rehash_key(hs) ||
-			 !cfs_hash_with_no_itemref(hs);
-	has_put_locked = hs->hs_ops->hs_put_locked != NULL;
-	cfs_hash_lock(hs, 0);
-again:
-	LASSERT(!cfs_hash_is_rehashing(hs));
-
-	cfs_hash_for_each_bucket(hs, &bd, i) {
-		struct hlist_head *hhead;
-
-		if (i < start)
-			continue;
-		else if (end > 0 && i >= end)
-			break;
-
-		cfs_hash_bd_lock(hs, &bd, 0);
-		version = cfs_hash_bd_version_get(&bd);
-
-		cfs_hash_bd_for_each_hlist(hs, &bd, hhead) {
-			hnode = hhead->first;
-			if (!hnode)
-				continue;
-			cfs_hash_get(hs, hnode);
-
-			for (; hnode; hnode = next) {
-				cfs_hash_bucket_validate(hs, &bd, hnode);
-				next = hnode->next;
-				if (next)
-					cfs_hash_get(hs, next);
-				cfs_hash_bd_unlock(hs, &bd, 0);
-				cfs_hash_unlock(hs, 0);
-
-				rc = func(hs, &bd, hnode, data);
-				if (stop_on_change || !has_put_locked)
-					cfs_hash_put(hs, hnode);
-				cond_resched();
-				count++;
-
-				cfs_hash_lock(hs, 0);
-				cfs_hash_bd_lock(hs, &bd, 0);
-				if (stop_on_change) {
-					if (version !=
-					    cfs_hash_bd_version_get(&bd))
-						rc = -EINTR;
-				} else if (has_put_locked) {
-					cfs_hash_put_locked(hs, hnode);
-				}
-				if (rc) /* callback wants to break iteration */
-					break;
-			}
-			if (next) {
-				if (has_put_locked) {
-					cfs_hash_put_locked(hs, next);
-					next = NULL;
-				}
-				break;
-			} else if (rc) {
-				break;
-			}
-		}
-		cfs_hash_bd_unlock(hs, &bd, 0);
-		if (next && !has_put_locked) {
-			cfs_hash_put(hs, next);
-			next = NULL;
-		}
-		if (rc) /* callback wants to break iteration */
-			break;
-	}
-	if (start > 0 && !rc) {
-		end = start;
-		start = 0;
-		goto again;
-	}
-
-	cfs_hash_unlock(hs, 0);
-	return count;
-}
-
-int
-cfs_hash_for_each_nolock(struct cfs_hash *hs, cfs_hash_for_each_cb_t func,
-			 void *data, int start)
-{
-	if (cfs_hash_with_no_lock(hs) ||
-	    cfs_hash_with_rehash_key(hs) ||
-	    !cfs_hash_with_no_itemref(hs))
-		return -EOPNOTSUPP;
-
-	if (!hs->hs_ops->hs_get ||
-	    (!hs->hs_ops->hs_put && !hs->hs_ops->hs_put_locked))
-		return -EOPNOTSUPP;
-
-	cfs_hash_for_each_enter(hs);
-	cfs_hash_for_each_relax(hs, func, data, start);
-	cfs_hash_for_each_exit(hs);
-
-	return 0;
-}
-EXPORT_SYMBOL(cfs_hash_for_each_nolock);
-
-/**
- * For each hash bucket in the libcfs hash @hs call the passed callback
- * @func until all the hash buckets are empty.  The passed callback @func
- * or the previously registered callback hs->hs_put must remove the item
- * from the hash.  You may either use the cfs_hash_del() or hlist_del()
- * functions.  No rwlocks will be held during the callback @func it is
- * safe to sleep if needed.  This function will not terminate until the
- * hash is empty.  Note it is still possible to concurrently add new
- * items in to the hash.  It is the callers responsibility to ensure
- * the required locking is in place to prevent concurrent insertions.
- */
-int
-cfs_hash_for_each_empty(struct cfs_hash *hs, cfs_hash_for_each_cb_t func,
-			void *data)
-{
-	unsigned int i = 0;
-
-	if (cfs_hash_with_no_lock(hs))
-		return -EOPNOTSUPP;
-
-	if (!hs->hs_ops->hs_get ||
-	    (!hs->hs_ops->hs_put && !hs->hs_ops->hs_put_locked))
-		return -EOPNOTSUPP;
-
-	cfs_hash_for_each_enter(hs);
-	while (cfs_hash_for_each_relax(hs, func, data, 0)) {
-		CDEBUG(D_INFO, "Try to empty hash: %s, loop: %u\n",
-		       hs->hs_name, i++);
-	}
-	cfs_hash_for_each_exit(hs);
-	return 0;
-}
-EXPORT_SYMBOL(cfs_hash_for_each_empty);
-
-void
-cfs_hash_hlist_for_each(struct cfs_hash *hs, unsigned int hindex,
-			cfs_hash_for_each_cb_t func, void *data)
-{
-	struct hlist_head *hhead;
-	struct hlist_node *hnode;
-	struct cfs_hash_bd bd;
-
-	cfs_hash_for_each_enter(hs);
-	cfs_hash_lock(hs, 0);
-	if (hindex >= CFS_HASH_NHLIST(hs))
-		goto out;
-
-	cfs_hash_bd_index_set(hs, hindex, &bd);
-
-	cfs_hash_bd_lock(hs, &bd, 0);
-	hhead = cfs_hash_bd_hhead(hs, &bd);
-	hlist_for_each(hnode, hhead) {
-		if (func(hs, &bd, hnode, data))
-			break;
-	}
-	cfs_hash_bd_unlock(hs, &bd, 0);
-out:
-	cfs_hash_unlock(hs, 0);
-	cfs_hash_for_each_exit(hs);
-}
-EXPORT_SYMBOL(cfs_hash_hlist_for_each);
-
-/*
- * For each item in the libcfs hash @hs which matches the @key call
- * the passed callback @func and pass to it as an argument each hash
- * item and the private @data. During the callback the bucket lock
- * is held so the callback must never sleep.
- */
-void
-cfs_hash_for_each_key(struct cfs_hash *hs, const void *key,
-		      cfs_hash_for_each_cb_t func, void *data)
-{
-	struct hlist_node *hnode;
-	struct cfs_hash_bd bds[2];
-	unsigned int i;
-
-	cfs_hash_lock(hs, 0);
-
-	cfs_hash_dual_bd_get_and_lock(hs, key, bds, 0);
-
-	cfs_hash_for_each_bd(bds, 2, i) {
-		struct hlist_head *hlist = cfs_hash_bd_hhead(hs, &bds[i]);
-
-		hlist_for_each(hnode, hlist) {
-			cfs_hash_bucket_validate(hs, &bds[i], hnode);
-
-			if (cfs_hash_keycmp(hs, key, hnode)) {
-				if (func(hs, &bds[i], hnode, data))
-					break;
-			}
-		}
-	}
-
-	cfs_hash_dual_bd_unlock(hs, bds, 0);
-	cfs_hash_unlock(hs, 0);
-}
-EXPORT_SYMBOL(cfs_hash_for_each_key);
-
-/**
- * Rehash the libcfs hash @hs to the given @bits.  This can be used
- * to grow the hash size when excessive chaining is detected, or to
- * shrink the hash when it is larger than needed.  When the CFS_HASH_REHASH
- * flag is set in @hs the libcfs hash may be dynamically rehashed
- * during addition or removal if the hash's theta value exceeds
- * either the hs->hs_min_theta or hs->max_theta values.  By default
- * these values are tuned to keep the chained hash depth small, and
- * this approach assumes a reasonably uniform hashing function.  The
- * theta thresholds for @hs are tunable via cfs_hash_set_theta().
- */
-void
-cfs_hash_rehash_cancel(struct cfs_hash *hs)
-{
-	LASSERT(cfs_hash_with_rehash(hs));
-	cancel_work_sync(&hs->hs_rehash_work);
-}
-
-void
-cfs_hash_rehash(struct cfs_hash *hs, int do_rehash)
-{
-	int rc;
-
-	LASSERT(cfs_hash_with_rehash(hs) && !cfs_hash_with_no_lock(hs));
-
-	cfs_hash_lock(hs, 1);
-
-	rc = cfs_hash_rehash_bits(hs);
-	if (rc <= 0) {
-		cfs_hash_unlock(hs, 1);
-		return;
-	}
-
-	hs->hs_rehash_bits = rc;
-	if (!do_rehash) {
-		/* launch and return */
-		queue_work(cfs_rehash_wq, &hs->hs_rehash_work);
-		cfs_hash_unlock(hs, 1);
-		return;
-	}
-
-	/* rehash right now */
-	cfs_hash_unlock(hs, 1);
-
-	cfs_hash_rehash_worker(&hs->hs_rehash_work);
-}
-
-static int
-cfs_hash_rehash_bd(struct cfs_hash *hs, struct cfs_hash_bd *old)
-{
-	struct cfs_hash_bd new;
-	struct hlist_head *hhead;
-	struct hlist_node *hnode;
-	struct hlist_node *pos;
-	void *key;
-	int c = 0;
-
-	/* hold cfs_hash_lock(hs, 1), so don't need any bucket lock */
-	cfs_hash_bd_for_each_hlist(hs, old, hhead) {
-		hlist_for_each_safe(hnode, pos, hhead) {
-			key = cfs_hash_key(hs, hnode);
-			LASSERT(key);
-			/* Validate hnode is in the correct bucket. */
-			cfs_hash_bucket_validate(hs, old, hnode);
-			/*
-			 * Delete from old hash bucket; move to new bucket.
-			 * ops->hs_key must be defined.
-			 */
-			cfs_hash_bd_from_key(hs, hs->hs_rehash_buckets,
-					     hs->hs_rehash_bits, key, &new);
-			cfs_hash_bd_move_locked(hs, old, &new, hnode);
-			c++;
-		}
-	}
-
-	return c;
-}
-
-static void
-cfs_hash_rehash_worker(struct work_struct *work)
-{
-	struct cfs_hash *hs = container_of(work, struct cfs_hash, hs_rehash_work);
-	struct cfs_hash_bucket **bkts;
-	struct cfs_hash_bd bd;
-	unsigned int old_size;
-	unsigned int new_size;
-	int bsize;
-	int count = 0;
-	int rc = 0;
-	int i;
-
-	LASSERT(hs && cfs_hash_with_rehash(hs));
-
-	cfs_hash_lock(hs, 0);
-	LASSERT(cfs_hash_is_rehashing(hs));
-
-	old_size = CFS_HASH_NBKT(hs);
-	new_size = CFS_HASH_RH_NBKT(hs);
-
-	cfs_hash_unlock(hs, 0);
-
-	/*
-	 * don't need hs::hs_rwlock for hs::hs_buckets,
-	 * because nobody can change bkt-table except me.
-	 */
-	bkts = cfs_hash_buckets_realloc(hs, hs->hs_buckets,
-					old_size, new_size);
-	cfs_hash_lock(hs, 1);
-	if (!bkts) {
-		rc = -ENOMEM;
-		goto out;
-	}
-
-	if (bkts == hs->hs_buckets) {
-		bkts = NULL; /* do nothing */
-		goto out;
-	}
-
-	rc = __cfs_hash_theta(hs);
-	if ((rc >= hs->hs_min_theta) && (rc <= hs->hs_max_theta)) {
-		/* free the new allocated bkt-table */
-		old_size = new_size;
-		new_size = CFS_HASH_NBKT(hs);
-		rc = -EALREADY;
-		goto out;
-	}
-
-	LASSERT(!hs->hs_rehash_buckets);
-	hs->hs_rehash_buckets = bkts;
-
-	rc = 0;
-	cfs_hash_for_each_bucket(hs, &bd, i) {
-		if (cfs_hash_is_exiting(hs)) {
-			rc = -ESRCH;
-			/* someone wants to destroy the hash, abort now */
-			if (old_size < new_size) /* OK to free old bkt-table */
-				break;
-			/* it's shrinking, need free new bkt-table */
-			hs->hs_rehash_buckets = NULL;
-			old_size = new_size;
-			new_size = CFS_HASH_NBKT(hs);
-			goto out;
-		}
-
-		count += cfs_hash_rehash_bd(hs, &bd);
-		if (count < CFS_HASH_LOOP_HOG ||
-		    cfs_hash_is_iterating(hs)) { /* need to finish ASAP */
-			continue;
-		}
-
-		count = 0;
-		cfs_hash_unlock(hs, 1);
-		cond_resched();
-		cfs_hash_lock(hs, 1);
-	}
-
-	hs->hs_rehash_count++;
-
-	bkts = hs->hs_buckets;
-	hs->hs_buckets = hs->hs_rehash_buckets;
-	hs->hs_rehash_buckets = NULL;
-
-	hs->hs_cur_bits = hs->hs_rehash_bits;
-out:
-	hs->hs_rehash_bits = 0;
-	bsize = cfs_hash_bkt_size(hs);
-	cfs_hash_unlock(hs, 1);
-	/* can't refer to @hs anymore because it could be destroyed */
-	if (bkts)
-		cfs_hash_buckets_free(bkts, bsize, new_size, old_size);
-	if (rc)
-		CDEBUG(D_INFO, "early quit of rehashing: %d\n", rc);
-}
-
-/**
- * Rehash the object referenced by @hnode in the libcfs hash @hs.  The
- * @old_key must be provided to locate the objects previous location
- * in the hash, and the @new_key will be used to reinsert the object.
- * Use this function instead of a cfs_hash_add() + cfs_hash_del()
- * combo when it is critical that there is no window in time where the
- * object is missing from the hash.  When an object is being rehashed
- * the registered cfs_hash_get() and cfs_hash_put() functions will
- * not be called.
- */
-void cfs_hash_rehash_key(struct cfs_hash *hs, const void *old_key,
-			 void *new_key, struct hlist_node *hnode)
-{
-	struct cfs_hash_bd bds[3];
-	struct cfs_hash_bd old_bds[2];
-	struct cfs_hash_bd new_bd;
-
-	LASSERT(!hlist_unhashed(hnode));
-
-	cfs_hash_lock(hs, 0);
-
-	cfs_hash_dual_bd_get(hs, old_key, old_bds);
-	cfs_hash_bd_get(hs, new_key, &new_bd);
-
-	bds[0] = old_bds[0];
-	bds[1] = old_bds[1];
-	bds[2] = new_bd;
-
-	/* NB: bds[0] and bds[1] are ordered already */
-	cfs_hash_bd_order(&bds[1], &bds[2]);
-	cfs_hash_bd_order(&bds[0], &bds[1]);
-
-	cfs_hash_multi_bd_lock(hs, bds, 3, 1);
-	if (likely(!old_bds[1].bd_bucket)) {
-		cfs_hash_bd_move_locked(hs, &old_bds[0], &new_bd, hnode);
-	} else {
-		cfs_hash_dual_bd_finddel_locked(hs, old_bds, old_key, hnode);
-		cfs_hash_bd_add_locked(hs, &new_bd, hnode);
-	}
-	/* overwrite key inside locks, otherwise may screw up with
-	 * other operations, i.e: rehash
-	 */
-	cfs_hash_keycpy(hs, hnode, new_key);
-
-	cfs_hash_multi_bd_unlock(hs, bds, 3, 1);
-	cfs_hash_unlock(hs, 0);
-}
-EXPORT_SYMBOL(cfs_hash_rehash_key);
-
-void cfs_hash_debug_header(struct seq_file *m)
-{
-	seq_printf(m, "%-*s   cur   min   max theta t-min t-max flags rehash   count  maxdep maxdepb distribution\n",
-		   CFS_HASH_BIGNAME_LEN, "name");
-}
-EXPORT_SYMBOL(cfs_hash_debug_header);
-
-static struct cfs_hash_bucket **
-cfs_hash_full_bkts(struct cfs_hash *hs)
-{
-	/* NB: caller should hold hs->hs_rwlock if REHASH is set */
-	if (!hs->hs_rehash_buckets)
-		return hs->hs_buckets;
-
-	LASSERT(hs->hs_rehash_bits);
-	return hs->hs_rehash_bits > hs->hs_cur_bits ?
-	       hs->hs_rehash_buckets : hs->hs_buckets;
-}
-
-static unsigned int
-cfs_hash_full_nbkt(struct cfs_hash *hs)
-{
-	/* NB: caller should hold hs->hs_rwlock if REHASH is set */
-	if (!hs->hs_rehash_buckets)
-		return CFS_HASH_NBKT(hs);
-
-	LASSERT(hs->hs_rehash_bits);
-	return hs->hs_rehash_bits > hs->hs_cur_bits ?
-	       CFS_HASH_RH_NBKT(hs) : CFS_HASH_NBKT(hs);
-}
-
-void cfs_hash_debug_str(struct cfs_hash *hs, struct seq_file *m)
-{
-	int dist[8] = { 0, };
-	int maxdep = -1;
-	int maxdepb = -1;
-	int total = 0;
-	int theta;
-	int i;
-
-	cfs_hash_lock(hs, 0);
-	theta = __cfs_hash_theta(hs);
-
-	seq_printf(m, "%-*s %5d %5d %5d %d.%03d %d.%03d %d.%03d  0x%02x %6d ",
-		   CFS_HASH_BIGNAME_LEN, hs->hs_name,
-		   1 << hs->hs_cur_bits, 1 << hs->hs_min_bits,
-		   1 << hs->hs_max_bits,
-		   __cfs_hash_theta_int(theta), __cfs_hash_theta_frac(theta),
-		   __cfs_hash_theta_int(hs->hs_min_theta),
-		   __cfs_hash_theta_frac(hs->hs_min_theta),
-		   __cfs_hash_theta_int(hs->hs_max_theta),
-		   __cfs_hash_theta_frac(hs->hs_max_theta),
-		   hs->hs_flags, hs->hs_rehash_count);
-
-	/*
-	 * The distribution is a summary of the chained hash depth in
-	 * each of the libcfs hash buckets.  Each buckets hsb_count is
-	 * divided by the hash theta value and used to generate a
-	 * histogram of the hash distribution.  A uniform hash will
-	 * result in all hash buckets being close to the average thus
-	 * only the first few entries in the histogram will be non-zero.
-	 * If you hash function results in a non-uniform hash the will
-	 * be observable by outlier bucks in the distribution histogram.
-	 *
-	 * Uniform hash distribution:		128/128/0/0/0/0/0/0
-	 * Non-Uniform hash distribution:	128/125/0/0/0/0/2/1
-	 */
-	for (i = 0; i < cfs_hash_full_nbkt(hs); i++) {
-		struct cfs_hash_bd bd;
-
-		bd.bd_bucket = cfs_hash_full_bkts(hs)[i];
-		cfs_hash_bd_lock(hs, &bd, 0);
-		if (maxdep < bd.bd_bucket->hsb_depmax) {
-			maxdep  = bd.bd_bucket->hsb_depmax;
-			maxdepb = ffz(~maxdep);
-		}
-		total += bd.bd_bucket->hsb_count;
-		dist[min(fls(bd.bd_bucket->hsb_count / max(theta, 1)), 7)]++;
-		cfs_hash_bd_unlock(hs, &bd, 0);
-	}
-
-	seq_printf(m, "%7d %7d %7d ", total, maxdep, maxdepb);
-	for (i = 0; i < 8; i++)
-		seq_printf(m, "%d%c",  dist[i], (i == 7) ? '\n' : '/');
-
-	cfs_hash_unlock(hs, 0);
-}
-EXPORT_SYMBOL(cfs_hash_debug_str);
diff --git a/drivers/staging/lustre/lnet/libcfs/module.c b/drivers/staging/lustre/lnet/libcfs/module.c
index a03f924f1d7c..4b5a4c1b86c3 100644
--- a/drivers/staging/lustre/lnet/libcfs/module.c
+++ b/drivers/staging/lustre/lnet/libcfs/module.c
@@ -547,13 +547,6 @@ static int libcfs_init(void)
 		goto cleanup_cpu;
 	}
 
-	cfs_rehash_wq = alloc_workqueue("cfs_rh", WQ_SYSFS, 4);
-	if (!cfs_rehash_wq) {
-		CERROR("Failed to start rehash workqueue.\n");
-		rc = -ENOMEM;
-		goto cleanup_deregister;
-	}
-
 	rc = cfs_crypto_register();
 	if (rc) {
 		CERROR("cfs_crypto_register: error %d\n", rc);
@@ -579,11 +572,6 @@ static void libcfs_exit(void)
 
 	lustre_remove_debugfs();
 
-	if (cfs_rehash_wq) {
-		destroy_workqueue(cfs_rehash_wq);
-		cfs_rehash_wq = NULL;
-	}
-
 	cfs_crypto_unregister();
 
 	misc_deregister(&libcfs_dev);

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 19/20] staging: lustre: convert lu_object cache to rhashtable
  2018-04-11 21:54 [PATCH 00/20] staging: lustre: convert to rhashtable NeilBrown
                   ` (18 preceding siblings ...)
  2018-04-11 21:54 ` [PATCH 20/20] staging: lustre: remove cfs_hash resizeable hashtable implementation NeilBrown
@ 2018-04-11 21:54 ` NeilBrown
  2018-04-17  3:35 ` [PATCH 00/20] staging: lustre: convert " James Simmons
  2018-04-23 13:08 ` Greg Kroah-Hartman
  21 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2018-04-11 21:54 UTC (permalink / raw)
  To: Oleg Drokin, Greg Kroah-Hartman, James Simmons, Andreas Dilger
  Cc: Linux Kernel Mailing List, Lustre Development List

The lu_object cache is a little more complex than the other lustre
hash tables for two reasons.
1/ there is a debugfs file which displays the contents of the cache,
  so we need to use rhashtable_walk in a way that works for seq_file.

2/ There is a (shared) lru list for objects which are no longer
   referenced, so finding an object needs to consider races with the
   lru as well as with the hash table.

The debugfs file already manages walking the libcfs hash table keeping
a current-position in the private data.  We can fairly easily convert
that to a struct rhashtable_iter.  The debugfs file actually reports
pages, and there are multiple pages per hashtable object.  So as well
as rhashtable_iter, we need the current page index.


For the double-locking, the current code uses direct-access to the
bucket locks that libcfs_hash provides.  rhashtable doesn't provide
that access - callers must provide their own locking or use rcu
techniques.

The lsb_marche_funebre.lock is still used to manage the lru list, but
with this patch it is no longer nested *inside* the hashtable locks,
but instead is outside.  It is used to protect an object with a
refcount of zero.

When purging old objects from an lru, we first set
LU_OBJECT_HEARD_BANSHEE while holding the lsb_marche_funebre.lock,
then remove all the entries from the hashtable separately.

When we find an object in the hashtable with a refcount of zero, we
take the corresponding lsb_marche_funebre.lock and check
LU_OBJECT_HEARD_BANSHEE isn't set.  If it isn't, we can safely
increment the refcount.  If it is, the object is gone.

When removing the last reference from an object, we first take the
lsb_marche_funebre.lock, then decrement the reference and add to the
lru list.

This way, we only ever manipulate an object with a refcount of zero
while holding the lsb_marche_funebre.lock.

As there is nothing to stop us using the resizing capabilities of
rhashtable, the code to try to guess the perfect has size has been
removed.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/staging/lustre/lustre/include/lu_object.h  |    8 
 drivers/staging/lustre/lustre/llite/vvp_dev.c      |   98 ++---
 drivers/staging/lustre/lustre/obdclass/lu_object.c |  362 +++++++-------------
 3 files changed, 170 insertions(+), 298 deletions(-)

diff --git a/drivers/staging/lustre/lustre/include/lu_object.h b/drivers/staging/lustre/lustre/include/lu_object.h
index 85066ece44d6..fab576687608 100644
--- a/drivers/staging/lustre/lustre/include/lu_object.h
+++ b/drivers/staging/lustre/lustre/include/lu_object.h
@@ -530,9 +530,9 @@ struct lu_object_header {
 	 */
 	__u32		  loh_attr;
 	/**
-	 * Linkage into per-site hash table. Protected by lu_site::ls_guard.
+	 * Linkage into per-site hash table.
 	 */
-	struct hlist_node       loh_hash;
+	struct rhash_head       loh_hash;
 	/**
 	 * Linkage into per-site LRU list. Protected by lu_site::ls_guard.
 	 * memory shared with lru_head for delayed freeing;
@@ -579,7 +579,7 @@ struct lu_site {
 	/**
 	 * objects hash table
 	 */
-	struct cfs_hash	       *ls_obj_hash;
+	struct rhashtable	ls_obj_hash;
 	/*
 	 * buckets for summary data
 	 */
@@ -655,6 +655,8 @@ int lu_object_init(struct lu_object *o,
 void lu_object_fini(struct lu_object *o);
 void lu_object_add_top(struct lu_object_header *h, struct lu_object *o);
 void lu_object_add(struct lu_object *before, struct lu_object *o);
+struct lu_object *lu_object_get_first(struct lu_object_header *h,
+				      struct lu_device *dev);
 
 /**
  * Helpers to initialize and finalize device types.
diff --git a/drivers/staging/lustre/lustre/llite/vvp_dev.c b/drivers/staging/lustre/lustre/llite/vvp_dev.c
index 64c3fdbbf0eb..da39375ae43d 100644
--- a/drivers/staging/lustre/lustre/llite/vvp_dev.c
+++ b/drivers/staging/lustre/lustre/llite/vvp_dev.c
@@ -365,21 +365,13 @@ int cl_sb_fini(struct super_block *sb)
  *
  ****************************************************************************/
 
-struct vvp_pgcache_id {
-	unsigned int		 vpi_bucket;
-	unsigned int		 vpi_depth;
-	u32			 vpi_index;
-
-	unsigned int		 vpi_curdep;
-	struct lu_object_header *vpi_obj;
-};
-
 struct seq_private {
 	struct ll_sb_info	*sbi;
 	struct lu_env		*env;
 	u16			refcheck;
 	struct cl_object	*clob;
-	struct vvp_pgcache_id	id;
+	struct rhashtable_iter	iter;
+	u32			page_index;
 	/*
 	 * prev_pos is the 'pos' of the last object returned
 	 * by ->start of ->next.
@@ -387,79 +379,43 @@ struct seq_private {
 	loff_t			prev_pos;
 };
 
-static int vvp_pgcache_obj_get(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-			       struct hlist_node *hnode, void *data)
-{
-	struct vvp_pgcache_id   *id  = data;
-	struct lu_object_header *hdr = cfs_hash_object(hs, hnode);
-
-	if (lu_object_is_dying(hdr))
-		return 0;
-
-	if (id->vpi_curdep-- > 0)
-		return 0; /* continue */
-
-	cfs_hash_get(hs, hnode);
-	id->vpi_obj = hdr;
-	return 1;
-}
-
-static struct cl_object *vvp_pgcache_obj(const struct lu_env *env,
-					 struct lu_device *dev,
-					 struct vvp_pgcache_id *id)
-{
-	LASSERT(lu_device_is_cl(dev));
-
-	id->vpi_obj    = NULL;
-	id->vpi_curdep = id->vpi_depth;
-
-	cfs_hash_hlist_for_each(dev->ld_site->ls_obj_hash, id->vpi_bucket,
-				vvp_pgcache_obj_get, id);
-	if (id->vpi_obj) {
-		struct lu_object *lu_obj;
-
-		lu_obj = lu_object_locate(id->vpi_obj, dev->ld_type);
-		if (lu_obj) {
-			lu_object_ref_add(lu_obj, "dump", current);
-			return lu2cl(lu_obj);
-		}
-		lu_object_put(env, lu_object_top(id->vpi_obj));
-	}
-	return NULL;
-}
-
 static struct page *vvp_pgcache_current(struct seq_private *priv)
 {
 	struct lu_device *dev = &priv->sbi->ll_cl->cd_lu_dev;
 
+	rhashtable_walk_start(&priv->iter);
 	while(1) {
 		struct inode *inode;
 		int nr;
 		struct page *vmpage;
 
 		if (!priv->clob) {
-			struct cl_object *clob;
-
-			while ((clob = vvp_pgcache_obj(priv->env, dev, &priv->id)) == NULL &&
-			       ++(priv->id.vpi_bucket) < CFS_HASH_NHLIST(dev->ld_site->ls_obj_hash))
-				priv->id.vpi_depth = 0;
-			if (!clob)
+			struct lu_object_header *h;
+			struct lu_object *lu_obj;
+
+			while ((h = rhashtable_walk_next(&priv->iter)) != NULL &&
+			       (lu_obj = lu_object_get_first(h, dev)) == NULL)
+				;
+			if (!h) {
+				rhashtable_walk_stop(&priv->iter);
 				return NULL;
-			priv->clob = clob;
-			priv->id.vpi_index = 0;
+			}
+			priv->clob = lu2cl(lu_obj);
+			lu_object_ref_add(lu_obj, "dump", current);
+			priv->page_index = 0;
 		}
 
 		inode = vvp_object_inode(priv->clob);
-		nr = find_get_pages_contig(inode->i_mapping, priv->id.vpi_index, 1, &vmpage);
+		nr = find_get_pages_contig(inode->i_mapping, priv->page_index, 1, &vmpage);
 		if (nr > 0) {
-			priv->id.vpi_index = vmpage->index;
+			priv->page_index = vmpage->index;
+			rhashtable_walk_stop(&priv->iter);
 			return vmpage;
 		}
 		lu_object_ref_del(&priv->clob->co_lu, "dump", current);
 		cl_object_put(priv->env, priv->clob);
 		priv->clob = NULL;
-		priv->id.vpi_index = 0;
-		priv->id.vpi_depth++;
+		priv->page_index = 0;
 	}
 }
 
@@ -524,7 +480,9 @@ static int vvp_pgcache_show(struct seq_file *f, void *v)
 static void vvp_pgcache_rewind(struct seq_private *priv)
 {
 	if (priv->prev_pos) {
-		memset(&priv->id, 0, sizeof(priv->id));
+		struct lu_site *s = priv->sbi->ll_cl->cd_lu_dev.ld_site;
+		rhashtable_walk_exit(&priv->iter);
+		rhashtable_walk_enter(&s->ls_obj_hash, &priv->iter);
 		priv->prev_pos = 0;
 		if (priv->clob) {
 			lu_object_ref_del(&priv->clob->co_lu, "dump", current);
@@ -536,7 +494,7 @@ static void vvp_pgcache_rewind(struct seq_private *priv)
 
 static struct page *vvp_pgcache_next_page(struct seq_private *priv)
 {
-	priv->id.vpi_index += 1;
+	priv->page_index += 1;
 	return vvp_pgcache_current(priv);
 }
 
@@ -550,7 +508,7 @@ static void *vvp_pgcache_start(struct seq_file *f, loff_t *pos)
 		/* Return the current item */;
 	else {
 		WARN_ON(*pos != priv->prev_pos + 1);
-		priv->id.vpi_index += 1;
+		priv->page_index += 1;
 	}
 
 	priv->prev_pos = *pos;
@@ -582,6 +540,7 @@ static const struct seq_operations vvp_pgcache_ops = {
 static int vvp_dump_pgcache_seq_open(struct inode *inode, struct file *filp)
 {
 	struct seq_private *priv;
+	struct lu_site *s;
 
 	priv = __seq_open_private(filp, &vvp_pgcache_ops, sizeof(*priv));
 	if (!priv)
@@ -590,13 +549,16 @@ static int vvp_dump_pgcache_seq_open(struct inode *inode, struct file *filp)
 	priv->sbi = inode->i_private;
 	priv->env = cl_env_get(&priv->refcheck);
 	priv->clob = NULL;
-	memset(&priv->id, 0, sizeof(priv->id));
 
 	if (IS_ERR(priv->env)) {
 		int err = PTR_ERR(priv->env);
 		seq_release_private(inode, filp);
 		return err;
 	}
+
+	s = priv->sbi->ll_cl->cd_lu_dev.ld_site;
+	rhashtable_walk_enter(&s->ls_obj_hash, &priv->iter);
+
 	return 0;
 }
 
@@ -609,8 +571,8 @@ static int vvp_dump_pgcache_seq_release(struct inode *inode, struct file *file)
 		lu_object_ref_del(&priv->clob->co_lu, "dump", current);
 		cl_object_put(priv->env, priv->clob);
 	}
-
 	cl_env_put(priv->env, &priv->refcheck);
+	rhashtable_walk_exit(&priv->iter);
 	return seq_release_private(inode, file);
 }
 
diff --git a/drivers/staging/lustre/lustre/obdclass/lu_object.c b/drivers/staging/lustre/lustre/obdclass/lu_object.c
index 18019f41c7a8..6ec5b83b3570 100644
--- a/drivers/staging/lustre/lustre/obdclass/lu_object.c
+++ b/drivers/staging/lustre/lustre/obdclass/lu_object.c
@@ -45,8 +45,6 @@
 
 #include <linux/module.h>
 
-/* hash_long() */
-#include <linux/libcfs/libcfs_hash.h>
 #include <obd_class.h>
 #include <obd_support.h>
 #include <lustre_disk.h>
@@ -88,9 +86,6 @@ enum {
 #define LU_CACHE_NR_LDISKFS_LIMIT	LU_CACHE_NR_UNLIMITED
 #define LU_CACHE_NR_ZFS_LIMIT		256
 
-#define LU_SITE_BITS_MIN	12
-#define LU_SITE_BITS_MAX	24
-#define LU_SITE_BITS_MAX_CL	19
 /**
  * max 256 buckets, we don't want too many buckets because:
  * - consume too much memory (currently max 16K)
@@ -126,6 +121,13 @@ lu_site_wq_from_fid(struct lu_site *site, struct lu_fid *fid)
 	return &bkt->lsb_marche_funebre;
 }
 
+static const struct rhashtable_params obj_hash_params = {
+	.key_len	= sizeof(struct lu_fid),
+	.key_offset	= offsetof(struct lu_object_header, loh_fid),
+	.head_offset	= offsetof(struct lu_object_header, loh_hash),
+	.automatic_shrinking = true,
+};
+
 /**
  * Decrease reference counter on object. If last reference is freed, return
  * object to the cache, unless lu_object_is_dying(o) holds. In the latter
@@ -137,7 +139,6 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 	struct lu_object_header *top;
 	struct lu_site	  *site;
 	struct lu_object	*orig;
-	struct cfs_hash_bd	    bd;
 	const struct lu_fid     *fid;
 
 	top  = o->lo_header;
@@ -151,7 +152,6 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 	 */
 	fid = lu_object_fid(o);
 	if (fid_is_zero(fid)) {
-		LASSERT(!top->loh_hash.next && !top->loh_hash.pprev);
 		LASSERT(list_empty(&top->loh_lru));
 		if (!atomic_dec_and_test(&top->loh_ref))
 			return;
@@ -163,9 +163,8 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 		return;
 	}
 
-	cfs_hash_bd_get(site->ls_obj_hash, &top->loh_fid, &bd);
-
-	if (!cfs_hash_bd_dec_and_lock(site->ls_obj_hash, &bd, &top->loh_ref)) {
+	if (atomic_add_unless(&top->loh_ref, -1, 1)) {
+	still_active:
 		if (lu_object_is_dying(top)) {
 			/*
 			 * somebody may be waiting for this, currently only
@@ -177,6 +176,16 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 		return;
 	}
 
+	bkt = &site->ls_bkts[lu_bkt_hash(site, &top->loh_fid)];
+	spin_lock(&bkt->lsb_marche_funebre.lock);
+	if (!atomic_dec_and_test(&top->loh_ref)) {
+		spin_unlock(&bkt->lsb_marche_funebre.lock);
+		goto still_active;
+	}
+	/* refcount is zero, and cannot be incremented without taking the
+	 * bkt lock, so object is stable.
+	 */
+
 	/*
 	 * When last reference is released, iterate over object
 	 * layers, and notify them that object is no longer busy.
@@ -186,17 +195,13 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 			o->lo_ops->loo_object_release(env, o);
 	}
 
-	bkt = &site->ls_bkts[lu_bkt_hash(site, &top->loh_fid)];
-	spin_lock(&bkt->lsb_marche_funebre.lock);
-
 	if (!lu_object_is_dying(top)) {
 		LASSERT(list_empty(&top->loh_lru));
 		list_add_tail(&top->loh_lru, &bkt->lsb_lru);
 		spin_unlock(&bkt->lsb_marche_funebre.lock);
 		percpu_counter_inc(&site->ls_lru_len_counter);
-		CDEBUG(D_INODE, "Add %p to site lru. hash: %p, bkt: %p\n",
-		       o, site->ls_obj_hash, bkt);
-		cfs_hash_bd_unlock(site->ls_obj_hash, &bd, 1);
+		CDEBUG(D_INODE, "Add %p to site lru. bkt: %p\n",
+		       o, bkt);
 		return;
 	}
 
@@ -204,16 +209,15 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 	 * If object is dying (will not be cached), then removed it
 	 * from hash table (it is already not on the LRU).
 	 *
-	 * This is done with hash table list locked. As the only
+	 * This is done with bucket lock held. As the only
 	 * way to acquire first reference to previously unreferenced
 	 * object is through hash-table lookup (lu_object_find())
-	 * which is done under hash-table, no race with concurrent
+	 * which takes the lock for first reference, no race with concurrent
 	 * object lookup is possible and we can safely destroy object below.
 	 */
 	if (!test_and_set_bit(LU_OBJECT_UNHASHED, &top->loh_flags))
-		cfs_hash_bd_del_locked(site->ls_obj_hash, &bd, &top->loh_hash);
+		rhashtable_remove_fast(&site->ls_obj_hash, &top->loh_hash, obj_hash_params);
 	spin_unlock(&bkt->lsb_marche_funebre.lock);
-	cfs_hash_bd_unlock(site->ls_obj_hash, &bd, 1);
 	/*
 	 * Object was already removed from hash above, can kill it.
 	 */
@@ -233,21 +237,19 @@ void lu_object_unhash(const struct lu_env *env, struct lu_object *o)
 	set_bit(LU_OBJECT_HEARD_BANSHEE, &top->loh_flags);
 	if (!test_and_set_bit(LU_OBJECT_UNHASHED, &top->loh_flags)) {
 		struct lu_site *site = o->lo_dev->ld_site;
-		struct cfs_hash *obj_hash = site->ls_obj_hash;
-		struct cfs_hash_bd bd;
+		struct rhashtable *obj_hash = &site->ls_obj_hash;
+		struct lu_site_bkt_data *bkt;
 
-		cfs_hash_bd_get_and_lock(obj_hash, &top->loh_fid, &bd, 1);
-		if (!list_empty(&top->loh_lru)) {
-			struct lu_site_bkt_data *bkt;
+		bkt = &site->ls_bkts[lu_bkt_hash(site,&top->loh_fid)];
+		spin_lock(&bkt->lsb_marche_funebre.lock);
 
-			bkt = &site->ls_bkts[lu_bkt_hash(site,&top->loh_fid)];
-			spin_lock(&bkt->lsb_marche_funebre.lock);
+		if (!list_empty(&top->loh_lru)) {
 			list_del_init(&top->loh_lru);
-			spin_unlock(&bkt->lsb_marche_funebre.lock);
 			percpu_counter_dec(&site->ls_lru_len_counter);
 		}
-		cfs_hash_bd_del_locked(obj_hash, &bd, &top->loh_hash);
-		cfs_hash_bd_unlock(obj_hash, &bd, 1);
+		spin_unlock(&bkt->lsb_marche_funebre.lock);
+
+		rhashtable_remove_fast(obj_hash, &top->loh_hash, obj_hash_params);
 	}
 }
 EXPORT_SYMBOL(lu_object_unhash);
@@ -419,11 +421,8 @@ int lu_site_purge_objects(const struct lu_env *env, struct lu_site *s,
 
 			LINVRNT(lu_bkt_hash(s, &h->loh_fid) == i);
 
-			/* Cannot remove from hash under current spinlock,
-			 * so set flag to stop object from being found
-			 * by htable_lookup().
-			 */
-			set_bit(LU_OBJECT_HEARD_BANSHEE, &h->loh_flags);
+			rhashtable_remove_fast(&s->ls_obj_hash, &h->loh_hash,
+					       obj_hash_params);
 			list_move(&h->loh_lru, &dispose);
 			percpu_counter_dec(&s->ls_lru_len_counter);
 			if (did_sth == 0)
@@ -445,7 +444,6 @@ int lu_site_purge_objects(const struct lu_env *env, struct lu_site *s,
 						     struct lu_object_header,
 						     loh_lru)) != NULL) {
 			list_del_init(&h->loh_lru);
-			cfs_hash_del(s->ls_obj_hash, &h->loh_fid, &h->loh_hash);
 			lu_object_free(env, lu_object_top(h));
 			lprocfs_counter_incr(s->ls_stats, LU_SS_LRU_PURGED);
 		}
@@ -555,9 +553,9 @@ void lu_object_header_print(const struct lu_env *env, void *cookie,
 	(*printer)(env, cookie, "header@%p[%#lx, %d, " DFID "%s%s%s]",
 		   hdr, hdr->loh_flags, atomic_read(&hdr->loh_ref),
 		   PFID(&hdr->loh_fid),
-		   hlist_unhashed(&hdr->loh_hash) ? "" : " hash",
-		   list_empty((struct list_head *)&hdr->loh_lru) ? \
-		   "" : " lru",
+		   test_bit(LU_OBJECT_UNHASHED,
+			    &hdr->loh_flags) ? "" : " hash",
+		   list_empty(&hdr->loh_lru) ? "" : " lru",
 		   hdr->loh_attr & LOHA_EXISTS ? " exist":"");
 }
 EXPORT_SYMBOL(lu_object_header_print);
@@ -594,39 +592,37 @@ void lu_object_print(const struct lu_env *env, void *cookie,
 EXPORT_SYMBOL(lu_object_print);
 
 static struct lu_object *htable_lookup(struct lu_site *s,
-				       struct cfs_hash_bd *bd,
+				       struct lu_site_bkt_data *bkt,
 				       const struct lu_fid *f,
-				       __u64 *version)
+				       struct lu_object_header *new)
 {
-	struct cfs_hash		*hs = s->ls_obj_hash;
-	struct lu_site_bkt_data *bkt;
 	struct lu_object_header *h;
-	struct hlist_node	*hnode;
-	__u64 ver;
 	wait_queue_entry_t waiter;
 
 retry:
-	ver = cfs_hash_bd_version_get(bd);
-
-	if (*version == ver) {
+	rcu_read_lock();
+	if (new)
+		h = rhashtable_lookup_get_insert_fast(&s->ls_obj_hash, &new->loh_hash,
+						      obj_hash_params);
+	else
+		h = rhashtable_lookup(&s->ls_obj_hash, f, obj_hash_params);
+	if (!h) {
+		/* Not found */
+		if (!new)
+			lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_MISS);
+		rcu_read_unlock();
 		return ERR_PTR(-ENOENT);
 	}
-
-	*version = ver;
-	/* cfs_hash_bd_peek_locked is a somehow "internal" function
-	 * of cfs_hash, it doesn't add refcount on object.
-	 */
-	hnode = cfs_hash_bd_peek_locked(s->ls_obj_hash, bd, (void *)f);
-	if (!hnode) {
-		lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_MISS);
-		return ERR_PTR(-ENOENT);
+	if (atomic_inc_not_zero(&h->loh_ref)) {
+		rcu_read_unlock();
+		return lu_object_top(h);
 	}
 
-	h = container_of(hnode, struct lu_object_header, loh_hash);
-	bkt = &s->ls_bkts[lu_bkt_hash(s, f)];
 	spin_lock(&bkt->lsb_marche_funebre.lock);
 	if (likely(!lu_object_is_dying(h))) {
-		cfs_hash_get(s->ls_obj_hash, hnode);
+		rcu_read_unlock();
+		atomic_inc(&h->loh_ref);
+
 		lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_HIT);
 		if (!list_empty(&h->loh_lru)) {
 			list_del_init(&h->loh_lru);
@@ -635,7 +631,7 @@ static struct lu_object *htable_lookup(struct lu_site *s,
 		spin_unlock(&bkt->lsb_marche_funebre.lock);
 		return lu_object_top(h);
 	}
-	spin_unlock(&bkt->lsb_marche_funebre.lock);
+	rcu_read_unlock();
 
 	/*
 	 * Lookup found an object being destroyed this object cannot be
@@ -647,10 +643,9 @@ static struct lu_object *htable_lookup(struct lu_site *s,
 	add_wait_queue(&bkt->lsb_marche_funebre, &waiter);
 	set_current_state(TASK_UNINTERRUPTIBLE);
 	lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_DEATH_RACE);
-	cfs_hash_bd_unlock(hs, bd, 1);
+	spin_unlock(&bkt->lsb_marche_funebre.lock);
 	schedule();
 	remove_wait_queue(&bkt->lsb_marche_funebre, &waiter);
-	cfs_hash_bd_lock(hs, bd, 1);
 	goto retry;
 }
 
@@ -681,7 +676,7 @@ static void lu_object_limit(const struct lu_env *env, struct lu_device *dev)
 	if (lu_cache_nr == LU_CACHE_NR_UNLIMITED)
 		return;
 
-	size = cfs_hash_size_get(dev->ld_site->ls_obj_hash);
+	size = atomic_read(&dev->ld_site->ls_obj_hash.nelems);
 	nr = (__u64)lu_cache_nr;
 	if (size <= nr)
 		return;
@@ -691,6 +686,35 @@ static void lu_object_limit(const struct lu_env *env, struct lu_device *dev)
 			      false);
 }
 
+/*
+ * Get a 'first' reference to an object that was found while looking
+ * through the hash table.
+ */
+struct lu_object *lu_object_get_first(struct lu_object_header *h,
+				      struct lu_device *dev)
+{
+	struct lu_object *ret;
+	struct lu_site	*s = dev->ld_site;
+
+	if (IS_ERR_OR_NULL(h) || lu_object_is_dying(h))
+		return NULL;
+
+	ret = lu_object_locate(h, dev->ld_type);
+	if (!ret)
+		return ret;
+	if (!atomic_inc_not_zero(&h->loh_ref)) {
+		struct lu_site_bkt_data *bkt = &s->ls_bkts[lu_bkt_hash(s, &h->loh_fid)];
+
+		spin_lock(&bkt->lsb_marche_funebre.lock);
+		if (!lu_object_is_dying(h))
+			atomic_inc(&h->loh_ref);
+		else
+			ret = NULL;
+		spin_unlock(&bkt->lsb_marche_funebre.lock);
+	}
+	return ret;
+}
+
 /**
  * Much like lu_object_find(), but top level device of object is specifically
  * \a dev rather than top level device of the site. This interface allows
@@ -704,9 +728,8 @@ struct lu_object *lu_object_find_at(const struct lu_env *env,
 	struct lu_object      *o;
 	struct lu_object      *shadow;
 	struct lu_site	*s;
-	struct cfs_hash	    *hs;
-	struct cfs_hash_bd	  bd;
-	__u64		  version = 0;
+	struct rhashtable	*hs;
+	struct lu_site_bkt_data	*bkt;
 
 	/*
 	 * This uses standard index maintenance protocol:
@@ -727,13 +750,11 @@ struct lu_object *lu_object_find_at(const struct lu_env *env,
 	 *
 	 */
 	s  = dev->ld_site;
-	hs = s->ls_obj_hash;
+	hs = &s->ls_obj_hash;
+	bkt = &s->ls_bkts[lu_bkt_hash(s, f)];
 
-	cfs_hash_bd_get(hs, f, &bd);
 	if (!(conf && conf->loc_flags & LOC_F_NEW)) {
-		cfs_hash_bd_lock(hs, &bd, 0);
-		o = htable_lookup(s, &bd, f, &version);
-		cfs_hash_bd_unlock(hs, &bd, 0);
+		o = htable_lookup(s, bkt, f, NULL);
 
 		if (!IS_ERR(o) || PTR_ERR(o) != -ENOENT)
 			return o;
@@ -748,15 +769,20 @@ struct lu_object *lu_object_find_at(const struct lu_env *env,
 
 	LASSERT(lu_fid_eq(lu_object_fid(o), f));
 
-	cfs_hash_bd_lock(hs, &bd, 1);
+	if (conf && conf->loc_flags & LOC_F_NEW) {
+		int status = rhashtable_insert_fast(hs, &o->lo_header->loh_hash,
+						    obj_hash_params);
+		if (status == -EEXIST)
+			/* Already existed! go the slow way */
+			shadow = htable_lookup(s, bkt, f, o->lo_header);
+		else if (status)
+			shadow = ERR_PTR(status);
+		else
+			shadow = ERR_PTR(-ENOENT);
+	} else
+		shadow = htable_lookup(s, bkt, f, o->lo_header);
 
-	if (conf && conf->loc_flags & LOC_F_NEW)
-		shadow = ERR_PTR(-ENOENT);
-	else
-		shadow = htable_lookup(s, &bd, f, &version);
 	if (likely(PTR_ERR(shadow) == -ENOENT)) {
-		cfs_hash_bd_add_locked(hs, &bd, &o->lo_header->loh_hash);
-		cfs_hash_bd_unlock(hs, &bd, 1);
 
 		lu_object_limit(env, dev);
 
@@ -764,7 +790,6 @@ struct lu_object *lu_object_find_at(const struct lu_env *env,
 	}
 
 	lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_RACE);
-	cfs_hash_bd_unlock(hs, &bd, 1);
 	lu_object_free(env, o);
 	return shadow;
 }
@@ -846,14 +871,9 @@ struct lu_site_print_arg {
 	lu_printer_t     lsp_printer;
 };
 
-static int
-lu_site_obj_print(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-		  struct hlist_node *hnode, void *data)
+static void
+lu_site_obj_print(struct lu_object_header *h, struct lu_site_print_arg *arg)
 {
-	struct lu_site_print_arg *arg = (struct lu_site_print_arg *)data;
-	struct lu_object_header  *h;
-
-	h = hlist_entry(hnode, struct lu_object_header, loh_hash);
 	if (!list_empty(&h->loh_layers)) {
 		const struct lu_object *o;
 
@@ -864,7 +884,6 @@ lu_site_obj_print(struct cfs_hash *hs, struct cfs_hash_bd *bd,
 		lu_object_header_print(arg->lsp_env, arg->lsp_cookie,
 				       arg->lsp_printer, h);
 	}
-	return 0;
 }
 
 /**
@@ -878,115 +897,20 @@ void lu_site_print(const struct lu_env *env, struct lu_site *s, void *cookie,
 		.lsp_cookie  = cookie,
 		.lsp_printer = printer,
 	};
-
-	cfs_hash_for_each(s->ls_obj_hash, lu_site_obj_print, &arg);
-}
-EXPORT_SYMBOL(lu_site_print);
-
-/**
- * Return desired hash table order.
- */
-static unsigned long lu_htable_order(struct lu_device *top)
-{
-	unsigned long bits_max = LU_SITE_BITS_MAX;
-	unsigned long cache_size;
-	unsigned long bits;
-
-	if (!strcmp(top->ld_type->ldt_name, LUSTRE_VVP_NAME))
-		bits_max = LU_SITE_BITS_MAX_CL;
-
-	/*
-	 * Calculate hash table size, assuming that we want reasonable
-	 * performance when 20% of total memory is occupied by cache of
-	 * lu_objects.
-	 *
-	 * Size of lu_object is (arbitrary) taken as 1K (together with inode).
-	 */
-	cache_size = totalram_pages;
-
-#if BITS_PER_LONG == 32
-	/* limit hashtable size for lowmem systems to low RAM */
-	if (cache_size > 1 << (30 - PAGE_SHIFT))
-		cache_size = 1 << (30 - PAGE_SHIFT) * 3 / 4;
-#endif
-
-	/* clear off unreasonable cache setting. */
-	if (lu_cache_percent == 0 || lu_cache_percent > LU_CACHE_PERCENT_MAX) {
-		CWARN("obdclass: invalid lu_cache_percent: %u, it must be in the range of (0, %u]. Will use default value: %u.\n",
-		      lu_cache_percent, LU_CACHE_PERCENT_MAX,
-		      LU_CACHE_PERCENT_DEFAULT);
-
-		lu_cache_percent = LU_CACHE_PERCENT_DEFAULT;
-	}
-	cache_size = cache_size / 100 * lu_cache_percent *
-		(PAGE_SIZE / 1024);
-
-	for (bits = 1; (1 << bits) < cache_size; ++bits)
-		;
-	return clamp_t(typeof(bits), bits, LU_SITE_BITS_MIN, bits_max);
-}
-
-static unsigned int lu_obj_hop_hash(struct cfs_hash *hs,
-				    const void *key, unsigned int mask)
-{
-	struct lu_fid  *fid = (struct lu_fid *)key;
-	__u32	   hash;
-
-	hash = fid_flatten32(fid);
-	hash += (hash >> 4) + (hash << 12); /* mixing oid and seq */
-	hash = hash_long(hash, hs->hs_bkt_bits);
-
-	/* give me another random factor */
-	hash -= hash_long((unsigned long)hs, fid_oid(fid) % 11 + 3);
-
-	hash <<= hs->hs_cur_bits - hs->hs_bkt_bits;
-	hash |= (fid_seq(fid) + fid_oid(fid)) & (CFS_HASH_NBKT(hs) - 1);
-
-	return hash & mask;
-}
-
-static void *lu_obj_hop_object(struct hlist_node *hnode)
-{
-	return hlist_entry(hnode, struct lu_object_header, loh_hash);
-}
-
-static void *lu_obj_hop_key(struct hlist_node *hnode)
-{
 	struct lu_object_header *h;
-
-	h = hlist_entry(hnode, struct lu_object_header, loh_hash);
-	return &h->loh_fid;
-}
-
-static int lu_obj_hop_keycmp(const void *key, struct hlist_node *hnode)
-{
-	struct lu_object_header *h;
-
-	h = hlist_entry(hnode, struct lu_object_header, loh_hash);
-	return lu_fid_eq(&h->loh_fid, (struct lu_fid *)key);
-}
-
-static void lu_obj_hop_get(struct cfs_hash *hs, struct hlist_node *hnode)
-{
-	struct lu_object_header *h;
-
-	h = hlist_entry(hnode, struct lu_object_header, loh_hash);
-	atomic_inc(&h->loh_ref);
-}
-
-static void lu_obj_hop_put_locked(struct cfs_hash *hs, struct hlist_node *hnode)
-{
-	LBUG(); /* we should never called it */
+	struct rhashtable_iter iter;
+
+	rhashtable_walk_enter(&s->ls_obj_hash, &iter);
+	rhashtable_walk_start(&iter);
+	while ((h = rhashtable_walk_next(&iter)) != NULL) {
+		if (IS_ERR(h))
+			continue;
+		lu_site_obj_print(h, &arg);
+	}
+	rhashtable_walk_stop(&iter);
+	rhashtable_walk_exit(&iter);
 }
-
-static struct cfs_hash_ops lu_site_hash_ops = {
-	.hs_hash	= lu_obj_hop_hash,
-	.hs_key		= lu_obj_hop_key,
-	.hs_keycmp      = lu_obj_hop_keycmp,
-	.hs_object      = lu_obj_hop_object,
-	.hs_get		= lu_obj_hop_get,
-	.hs_put_locked  = lu_obj_hop_put_locked,
-};
+EXPORT_SYMBOL(lu_site_print);
 
 static void lu_dev_add_linkage(struct lu_site *s, struct lu_device *d)
 {
@@ -1002,9 +926,7 @@ static void lu_dev_add_linkage(struct lu_site *s, struct lu_device *d)
 int lu_site_init(struct lu_site *s, struct lu_device *top)
 {
 	struct lu_site_bkt_data *bkt;
-	unsigned long bits;
 	unsigned long i;
-	char name[16];
 	int rc;
 
 	memset(s, 0, sizeof(*s));
@@ -1014,23 +936,8 @@ int lu_site_init(struct lu_site *s, struct lu_device *top)
 	if (rc)
 		return -ENOMEM;
 
-	snprintf(name, sizeof(name), "lu_site_%s", top->ld_type->ldt_name);
-	for (bits = lu_htable_order(top); bits >= LU_SITE_BITS_MIN; bits--) {
-		s->ls_obj_hash = cfs_hash_create(name, bits, bits,
-						 bits - LU_SITE_BKT_BITS,
-						 0, 0, 0,
-						 &lu_site_hash_ops,
-						 CFS_HASH_SPIN_BKTLOCK |
-						 CFS_HASH_NO_ITEMREF |
-						 CFS_HASH_DEPTH |
-						 CFS_HASH_ASSERT_EMPTY |
-						 CFS_HASH_COUNTER);
-		if (s->ls_obj_hash)
-			break;
-	}
-
-	if (!s->ls_obj_hash) {
-		CERROR("failed to create lu_site hash with bits: %lu\n", bits);
+	if (rhashtable_init(&s->ls_obj_hash, &obj_hash_params) != 0) {
+		CERROR("failed to create lu_site hash\n");
 		return -ENOMEM;
 	}
 
@@ -1038,8 +945,7 @@ int lu_site_init(struct lu_site *s, struct lu_device *top)
 	s->ls_bkt_cnt = roundup_pow_of_two(s->ls_bkt_cnt);
 	s->ls_bkts = kvmalloc_array(s->ls_bkt_cnt, sizeof(*bkt), GFP_KERNEL);
 	if (!s->ls_bkts) {
-		cfs_hash_putref(s->ls_obj_hash);
-		s->ls_obj_hash = NULL;
+		rhashtable_destroy(&s->ls_obj_hash);
 		s->ls_bkts = NULL;
 		return -ENOMEM;
 	}
@@ -1052,9 +958,8 @@ int lu_site_init(struct lu_site *s, struct lu_device *top)
 	s->ls_stats = lprocfs_alloc_stats(LU_SS_LAST_STAT, 0);
 	if (!s->ls_stats) {
 		kvfree(s->ls_bkts);
-		cfs_hash_putref(s->ls_obj_hash);
 		s->ls_bkts = NULL;
-		s->ls_obj_hash = NULL;
+		rhashtable_destroy(&s->ls_obj_hash);
 		return -ENOMEM;
 	}
 
@@ -1097,13 +1002,12 @@ void lu_site_fini(struct lu_site *s)
 
 	percpu_counter_destroy(&s->ls_lru_len_counter);
 
-	if (s->ls_obj_hash) {
-		cfs_hash_putref(s->ls_obj_hash);
-		s->ls_obj_hash = NULL;
+	if (s->ls_bkts) {
+		rhashtable_destroy(&s->ls_obj_hash);
+		kvfree(s->ls_bkts);
+		s->ls_bkts = NULL;
 	}
 
-	kvfree(s->ls_bkts);
-
 	if (s->ls_top_dev) {
 		s->ls_top_dev->ld_site = NULL;
 		lu_ref_del(&s->ls_top_dev->ld_reference, "site-top", s);
@@ -1259,7 +1163,6 @@ int lu_object_header_init(struct lu_object_header *h)
 {
 	memset(h, 0, sizeof(*h));
 	atomic_set(&h->loh_ref, 1);
-	INIT_HLIST_NODE(&h->loh_hash);
 	INIT_LIST_HEAD(&h->loh_lru);
 	INIT_LIST_HEAD(&h->loh_layers);
 	lu_ref_init(&h->loh_reference);
@@ -1274,7 +1177,6 @@ void lu_object_header_fini(struct lu_object_header *h)
 {
 	LASSERT(list_empty(&h->loh_layers));
 	LASSERT(list_empty(&h->loh_lru));
-	LASSERT(hlist_unhashed(&h->loh_hash));
 	lu_ref_fini(&h->loh_reference);
 }
 EXPORT_SYMBOL(lu_object_header_fini);
@@ -1815,7 +1717,7 @@ struct lu_site_stats {
 static void lu_site_stats_get(const struct lu_site *s,
 			      struct lu_site_stats *stats)
 {
-	int cnt = cfs_hash_size_get(s->ls_obj_hash);
+	int cnt = atomic_read(&s->ls_obj_hash.nelems);
 	/* percpu_counter_read_positive() won't accept a const pointer */
 	struct lu_site *s2 = (struct lu_site *)s;
 
@@ -2001,15 +1903,21 @@ static __u32 ls_stats_read(struct lprocfs_stats *stats, int idx)
 int lu_site_stats_print(const struct lu_site *s, struct seq_file *m)
 {
 	struct lu_site_stats stats;
+	const struct bucket_table *tbl;
+	long chains;
 
 	memset(&stats, 0, sizeof(stats));
 	lu_site_stats_get(s, &stats);
 
+	rcu_read_lock();
+	tbl = rht_dereference_rcu(s->ls_obj_hash.tbl, &((struct lu_site *)s)->ls_obj_hash);
+	chains = tbl->size;
+	rcu_read_unlock();
 	seq_printf(m, "%d/%d %d/%ld %d %d %d %d %d %d %d\n",
 		   stats.lss_busy,
 		   stats.lss_total,
 		   stats.lss_populated,
-		   CFS_HASH_NHLIST(s->ls_obj_hash),
+		   chains,
 		   stats.lss_max_search,
 		   ls_stats_read(s->ls_stats, LU_SS_CREATED),
 		   ls_stats_read(s->ls_stats, LU_SS_CACHE_HIT),

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 18/20] staging: lustre: change how "dump_page_cache" walks a hash table
  2018-04-11 21:54 [PATCH 00/20] staging: lustre: convert to rhashtable NeilBrown
                   ` (16 preceding siblings ...)
  2018-04-11 21:54 ` [PATCH 13/20] staging: lustre: lu_object: move retry logic inside htable_lookup NeilBrown
@ 2018-04-11 21:54 ` NeilBrown
  2018-04-11 21:54 ` [PATCH 20/20] staging: lustre: remove cfs_hash resizeable hashtable implementation NeilBrown
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2018-04-11 21:54 UTC (permalink / raw)
  To: Oleg Drokin, Greg Kroah-Hartman, James Simmons, Andreas Dilger
  Cc: Linux Kernel Mailing List, Lustre Development List

The "dump_page_cache" seq_file currently tries to encode
a location in the hash table into a 64bit file index so that
the seq_file can seek to any location.

This is not necessary with the current implementation of seq_file.
seq_file performs any seeks needed itself by rewinding and calling
->next and ->show until the required index is reached.

The required behaviour of ->next is that it always return the next
object after the last one returned by either ->start or ->next.
It can ignore the ppos, but should increment it.

The required behaviour of ->start is one of:
1/ if *ppos is 0, then return the first object
2/ if *ppos is the same value that was passed to the most recent call
   to either ->start or ->next, then return the same object again
3/ if *ppos is anything else, return the next object after the most
   recently returned one.

To implement this we store a vvp_pgcache_id (index into hash table)
in the seq_private data structure, and also store 'prev_pos' as the
last value passed to either ->start or ->next.

We remove all converstion of an id to a pos, and any limits on the
size of the vpi_depth.

vvp_pgcache_obj_get() is changed to ignore dying objects so that
vvp_pgcache_obj only returns NULL when it reaches the end of a hash
chain, and so vpi_bucket needs to be incremented.

A reference to the current ->clob pointer is now kept as long as we
are iterating over the pages in a given object, so we don't have to
try to find it again (and possibly fail) for each page.

And the ->start and ->next functions are changed as described above.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/staging/lustre/lustre/llite/vvp_dev.c |  173 +++++++++++--------------
 1 file changed, 79 insertions(+), 94 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/vvp_dev.c b/drivers/staging/lustre/lustre/llite/vvp_dev.c
index 39a85e967368..64c3fdbbf0eb 100644
--- a/drivers/staging/lustre/lustre/llite/vvp_dev.c
+++ b/drivers/staging/lustre/lustre/llite/vvp_dev.c
@@ -365,22 +365,6 @@ int cl_sb_fini(struct super_block *sb)
  *
  ****************************************************************************/
 
-/*
- * To represent contents of a page cache as a byte stream, following
- * information if encoded in 64bit offset:
- *
- *       - file hash bucket in lu_site::ls_hash[]       28bits
- *
- *       - how far file is from bucket head	      4bits
- *
- *       - page index				   32bits
- *
- * First two data identify a file in the cache uniquely.
- */
-
-#define PGC_OBJ_SHIFT (32 + 4)
-#define PGC_DEPTH_SHIFT (32)
-
 struct vvp_pgcache_id {
 	unsigned int		 vpi_bucket;
 	unsigned int		 vpi_depth;
@@ -395,37 +379,26 @@ struct seq_private {
 	struct lu_env		*env;
 	u16			refcheck;
 	struct cl_object	*clob;
+	struct vvp_pgcache_id	id;
+	/*
+	 * prev_pos is the 'pos' of the last object returned
+	 * by ->start of ->next.
+	 */
+	loff_t			prev_pos;
 };
 
-static void vvp_pgcache_id_unpack(loff_t pos, struct vvp_pgcache_id *id)
-{
-	BUILD_BUG_ON(sizeof(pos) != sizeof(__u64));
-
-	id->vpi_index  = pos & 0xffffffff;
-	id->vpi_depth  = (pos >> PGC_DEPTH_SHIFT) & 0xf;
-	id->vpi_bucket = (unsigned long long)pos >> PGC_OBJ_SHIFT;
-}
-
-static loff_t vvp_pgcache_id_pack(struct vvp_pgcache_id *id)
-{
-	return
-		((__u64)id->vpi_index) |
-		((__u64)id->vpi_depth  << PGC_DEPTH_SHIFT) |
-		((__u64)id->vpi_bucket << PGC_OBJ_SHIFT);
-}
-
 static int vvp_pgcache_obj_get(struct cfs_hash *hs, struct cfs_hash_bd *bd,
 			       struct hlist_node *hnode, void *data)
 {
 	struct vvp_pgcache_id   *id  = data;
 	struct lu_object_header *hdr = cfs_hash_object(hs, hnode);
 
+	if (lu_object_is_dying(hdr))
+		return 0;
+
 	if (id->vpi_curdep-- > 0)
 		return 0; /* continue */
 
-	if (lu_object_is_dying(hdr))
-		return 1;
-
 	cfs_hash_get(hs, hnode);
 	id->vpi_obj = hdr;
 	return 1;
@@ -437,7 +410,6 @@ static struct cl_object *vvp_pgcache_obj(const struct lu_env *env,
 {
 	LASSERT(lu_device_is_cl(dev));
 
-	id->vpi_depth &= 0xf;
 	id->vpi_obj    = NULL;
 	id->vpi_curdep = id->vpi_depth;
 
@@ -452,55 +424,42 @@ static struct cl_object *vvp_pgcache_obj(const struct lu_env *env,
 			return lu2cl(lu_obj);
 		}
 		lu_object_put(env, lu_object_top(id->vpi_obj));
-
-	} else if (id->vpi_curdep > 0) {
-		id->vpi_depth = 0xf;
 	}
 	return NULL;
 }
 
-static struct page *vvp_pgcache_find(const struct lu_env *env,
-				     struct lu_device *dev,
-				     struct cl_object **clobp, loff_t *pos)
+static struct page *vvp_pgcache_current(struct seq_private *priv)
 {
-	struct cl_object     *clob;
-	struct lu_site       *site;
-	struct vvp_pgcache_id id;
-
-	site = dev->ld_site;
-	vvp_pgcache_id_unpack(*pos, &id);
-
-	while (1) {
-		if (id.vpi_bucket >= CFS_HASH_NHLIST(site->ls_obj_hash))
-			return NULL;
-		clob = vvp_pgcache_obj(env, dev, &id);
-		if (clob) {
-			struct inode *inode = vvp_object_inode(clob);
-			struct page *vmpage;
-			int nr;
-
-			nr = find_get_pages_contig(inode->i_mapping,
-						   id.vpi_index, 1, &vmpage);
-			if (nr > 0) {
-				id.vpi_index = vmpage->index;
-				/* Cant support over 16T file */
-				if (vmpage->index <= 0xffffffff) {
-					*clobp = clob;
-					*pos = vvp_pgcache_id_pack(&id);
-					return vmpage;
-				}
-				put_page(vmpage);
-			}
-
-			lu_object_ref_del(&clob->co_lu, "dump", current);
-			cl_object_put(env, clob);
+	struct lu_device *dev = &priv->sbi->ll_cl->cd_lu_dev;
+
+	while(1) {
+		struct inode *inode;
+		int nr;
+		struct page *vmpage;
+
+		if (!priv->clob) {
+			struct cl_object *clob;
+
+			while ((clob = vvp_pgcache_obj(priv->env, dev, &priv->id)) == NULL &&
+			       ++(priv->id.vpi_bucket) < CFS_HASH_NHLIST(dev->ld_site->ls_obj_hash))
+				priv->id.vpi_depth = 0;
+			if (!clob)
+				return NULL;
+			priv->clob = clob;
+			priv->id.vpi_index = 0;
+		}
+
+		inode = vvp_object_inode(priv->clob);
+		nr = find_get_pages_contig(inode->i_mapping, priv->id.vpi_index, 1, &vmpage);
+		if (nr > 0) {
+			priv->id.vpi_index = vmpage->index;
+			return vmpage;
 		}
-		/* to the next object. */
-		++id.vpi_depth;
-		id.vpi_depth &= 0xf;
-		if (id.vpi_depth == 0 && ++id.vpi_bucket == 0)
-			return NULL;
-		id.vpi_index = 0;
+		lu_object_ref_del(&priv->clob->co_lu, "dump", current);
+		cl_object_put(priv->env, priv->clob);
+		priv->clob = NULL;
+		priv->id.vpi_index = 0;
+		priv->id.vpi_depth++;
 	}
 }
 
@@ -558,36 +517,54 @@ static int vvp_pgcache_show(struct seq_file *f, void *v)
 	} else {
 		seq_puts(f, "missing\n");
 	}
-	lu_object_ref_del(&priv->clob->co_lu, "dump", current);
-	cl_object_put(priv->env, priv->clob);
 
 	return 0;
 }
 
+static void vvp_pgcache_rewind(struct seq_private *priv)
+{
+	if (priv->prev_pos) {
+		memset(&priv->id, 0, sizeof(priv->id));
+		priv->prev_pos = 0;
+		if (priv->clob) {
+			lu_object_ref_del(&priv->clob->co_lu, "dump", current);
+			cl_object_put(priv->env, priv->clob);
+		}
+		priv->clob = NULL;
+	}
+}
+
+static struct page *vvp_pgcache_next_page(struct seq_private *priv)
+{
+	priv->id.vpi_index += 1;
+	return vvp_pgcache_current(priv);
+}
+
 static void *vvp_pgcache_start(struct seq_file *f, loff_t *pos)
 {
 	struct seq_private	*priv = f->private;
-	struct page *ret;
 
-	if (priv->sbi->ll_site->ls_obj_hash->hs_cur_bits >
-	    64 - PGC_OBJ_SHIFT)
-		ret = ERR_PTR(-EFBIG);
-	else
-		ret = vvp_pgcache_find(priv->env, &priv->sbi->ll_cl->cd_lu_dev,
-				       &priv->clob, pos);
+	if (*pos == 0)
+		vvp_pgcache_rewind(priv);
+	else if (*pos == priv->prev_pos)
+		/* Return the current item */;
+	else {
+		WARN_ON(*pos != priv->prev_pos + 1);
+		priv->id.vpi_index += 1;
+	}
 
-	return ret;
+	priv->prev_pos = *pos;
+	return vvp_pgcache_current(priv);
 }
 
 static void *vvp_pgcache_next(struct seq_file *f, void *v, loff_t *pos)
 {
 	struct seq_private *priv = f->private;
-	struct page *ret;
 
+	WARN_ON(*pos != priv->prev_pos);
 	*pos += 1;
-	ret = vvp_pgcache_find(priv->env, &priv->sbi->ll_cl->cd_lu_dev,
-			       &priv->clob, pos);
-	return ret;
+	priv->prev_pos = *pos;
+	return vvp_pgcache_next_page(priv);
 }
 
 static void vvp_pgcache_stop(struct seq_file *f, void *v)
@@ -612,6 +589,9 @@ static int vvp_dump_pgcache_seq_open(struct inode *inode, struct file *filp)
 
 	priv->sbi = inode->i_private;
 	priv->env = cl_env_get(&priv->refcheck);
+	priv->clob = NULL;
+	memset(&priv->id, 0, sizeof(priv->id));
+
 	if (IS_ERR(priv->env)) {
 		int err = PTR_ERR(priv->env);
 		seq_release_private(inode, filp);
@@ -625,6 +605,11 @@ static int vvp_dump_pgcache_seq_release(struct inode *inode, struct file *file)
 	struct seq_file *seq = file->private_data;
 	struct seq_private *priv = seq->private;
 
+	if (priv->clob) {
+		lu_object_ref_del(&priv->clob->co_lu, "dump", current);
+		cl_object_put(priv->env, priv->clob);
+	}
+
 	cl_env_put(priv->env, &priv->refcheck);
 	return seq_release_private(inode, file);
 }

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 17/20] staging: lustre: use call_rcu() to free lu_object_headers
  2018-04-11 21:54 [PATCH 00/20] staging: lustre: convert to rhashtable NeilBrown
                   ` (11 preceding siblings ...)
  2018-04-11 21:54 ` [PATCH 11/20] staging: lustre: lu_object: discard extra lru count NeilBrown
@ 2018-04-11 21:54 ` NeilBrown
  2018-04-11 21:54 ` [PATCH 15/20] staging: lustre: llite: use more private data in dump_pgcache NeilBrown
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2018-04-11 21:54 UTC (permalink / raw)
  To: Oleg Drokin, Greg Kroah-Hartman, James Simmons, Andreas Dilger
  Cc: Linux Kernel Mailing List, Lustre Development List

Using call_rcu to free lu_object_headers will allow us to use
rhashtable and get lockless lookup.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/staging/lustre/lustre/include/lu_object.h  |    6 +++++-
 drivers/staging/lustre/lustre/llite/vvp_object.c   |    9 ++++++++-
 drivers/staging/lustre/lustre/lov/lovsub_object.c  |    9 ++++++++-
 .../staging/lustre/lustre/obdecho/echo_client.c    |    8 +++++++-
 4 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/drivers/staging/lustre/lustre/include/lu_object.h b/drivers/staging/lustre/lustre/include/lu_object.h
index d23a78577fb5..85066ece44d6 100644
--- a/drivers/staging/lustre/lustre/include/lu_object.h
+++ b/drivers/staging/lustre/lustre/include/lu_object.h
@@ -535,8 +535,12 @@ struct lu_object_header {
 	struct hlist_node       loh_hash;
 	/**
 	 * Linkage into per-site LRU list. Protected by lu_site::ls_guard.
+	 * memory shared with lru_head for delayed freeing;
 	 */
-	struct list_head	     loh_lru;
+	union {
+		struct list_head	loh_lru;
+		struct rcu_head		loh_rcu;
+	};
 	/**
 	 * Linkage into list of layers. Never modified once set (except lately
 	 * during object destruction). No locking is necessary.
diff --git a/drivers/staging/lustre/lustre/llite/vvp_object.c b/drivers/staging/lustre/lustre/llite/vvp_object.c
index 05ad3b322a29..48a999f8406b 100644
--- a/drivers/staging/lustre/lustre/llite/vvp_object.c
+++ b/drivers/staging/lustre/lustre/llite/vvp_object.c
@@ -251,13 +251,20 @@ static int vvp_object_init(const struct lu_env *env, struct lu_object *obj,
 	return result;
 }
 
+static void __vvp_object_free(struct rcu_head *rcu)
+{
+	struct vvp_object *vob = container_of(rcu, struct vvp_object, vob_header.coh_lu.loh_rcu);
+
+	kmem_cache_free(vvp_object_kmem, vob);
+}
+
 static void vvp_object_free(const struct lu_env *env, struct lu_object *obj)
 {
 	struct vvp_object *vob = lu2vvp(obj);
 
 	lu_object_fini(obj);
 	lu_object_header_fini(obj->lo_header);
-	kmem_cache_free(vvp_object_kmem, vob);
+	call_rcu(&vob->vob_header.coh_lu.loh_rcu, __vvp_object_free);
 }
 
 static const struct lu_object_operations vvp_lu_obj_ops = {
diff --git a/drivers/staging/lustre/lustre/lov/lovsub_object.c b/drivers/staging/lustre/lustre/lov/lovsub_object.c
index 13d452086b61..3626c2500149 100644
--- a/drivers/staging/lustre/lustre/lov/lovsub_object.c
+++ b/drivers/staging/lustre/lustre/lov/lovsub_object.c
@@ -70,6 +70,13 @@ int lovsub_object_init(const struct lu_env *env, struct lu_object *obj,
 	return result;
 }
 
+static void __lovsub_object_free(struct rcu_head *rcu)
+{
+	struct lovsub_object *los = container_of(rcu, struct lovsub_object,
+						 lso_header.coh_lu.loh_rcu);
+	kmem_cache_free(lovsub_object_kmem, los);
+}
+
 static void lovsub_object_free(const struct lu_env *env, struct lu_object *obj)
 {
 	struct lovsub_object *los = lu2lovsub(obj);
@@ -88,7 +95,7 @@ static void lovsub_object_free(const struct lu_env *env, struct lu_object *obj)
 
 	lu_object_fini(obj);
 	lu_object_header_fini(&los->lso_header.coh_lu);
-	kmem_cache_free(lovsub_object_kmem, los);
+	call_rcu(&los->lso_header.coh_lu.loh_rcu, __lovsub_object_free);
 }
 
 static int lovsub_object_print(const struct lu_env *env, void *cookie,
diff --git a/drivers/staging/lustre/lustre/obdecho/echo_client.c b/drivers/staging/lustre/lustre/obdecho/echo_client.c
index 767067b61109..16bf3b5a74e4 100644
--- a/drivers/staging/lustre/lustre/obdecho/echo_client.c
+++ b/drivers/staging/lustre/lustre/obdecho/echo_client.c
@@ -431,6 +431,12 @@ static int echo_object_init(const struct lu_env *env, struct lu_object *obj,
 	return 0;
 }
 
+static void __echo_object_free(struct rcu_head *rcu)
+{
+	struct echo_object *eco = container_of(rcu, struct echo_object, eo_hdr.coh_lu.loh_rcu);
+	kmem_cache_free(echo_object_kmem, eco);
+}
+
 static void echo_object_free(const struct lu_env *env, struct lu_object *obj)
 {
 	struct echo_object *eco    = cl2echo_obj(lu2cl(obj));
@@ -446,7 +452,7 @@ static void echo_object_free(const struct lu_env *env, struct lu_object *obj)
 	lu_object_header_fini(obj->lo_header);
 
 	kfree(eco->eo_oinfo);
-	kmem_cache_free(echo_object_kmem, eco);
+	call_rcu(&eco->eo_hdr.coh_lu.loh_rcu, __echo_object_free);
 }
 
 static int echo_object_print(const struct lu_env *env, void *cookie,

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 16/20] staging: lustre: llite: remove redundant lookup in dump_pgcache
  2018-04-11 21:54 [PATCH 00/20] staging: lustre: convert to rhashtable NeilBrown
                   ` (13 preceding siblings ...)
  2018-04-11 21:54 ` [PATCH 15/20] staging: lustre: llite: use more private data in dump_pgcache NeilBrown
@ 2018-04-11 21:54 ` NeilBrown
  2018-04-11 21:54 ` [PATCH 14/20] staging: lustre: fold lu_object_new() into lu_object_find_at() NeilBrown
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2018-04-11 21:54 UTC (permalink / raw)
  To: Oleg Drokin, Greg Kroah-Hartman, James Simmons, Andreas Dilger
  Cc: Linux Kernel Mailing List, Lustre Development List

Both the 'next' and the 'show' functions for the dump_page_cache
seqfile perform a lookup based on the current file index.  This is
needless duplication.

The reason appears to be that the state that needs to be communicated
from "next" to "show" is two pointers, but seq_file only provides for
a single pointer to be returned from next and passed to show.

So make use of the new 'seq_private' structure to store the extra
pointer.
So when 'next' (or 'start') find something, it returns the page and
stores the clob in the private area.
'show' accepts the page as an argument, and finds the clob where it
was stored.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/staging/lustre/lustre/llite/vvp_dev.c |   97 +++++++++++--------------
 1 file changed, 41 insertions(+), 56 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/vvp_dev.c b/drivers/staging/lustre/lustre/llite/vvp_dev.c
index a2619dc04a7f..39a85e967368 100644
--- a/drivers/staging/lustre/lustre/llite/vvp_dev.c
+++ b/drivers/staging/lustre/lustre/llite/vvp_dev.c
@@ -394,6 +394,7 @@ struct seq_private {
 	struct ll_sb_info	*sbi;
 	struct lu_env		*env;
 	u16			refcheck;
+	struct cl_object	*clob;
 };
 
 static void vvp_pgcache_id_unpack(loff_t pos, struct vvp_pgcache_id *id)
@@ -458,19 +459,20 @@ static struct cl_object *vvp_pgcache_obj(const struct lu_env *env,
 	return NULL;
 }
 
-static loff_t vvp_pgcache_find(const struct lu_env *env,
-			       struct lu_device *dev, loff_t pos)
+static struct page *vvp_pgcache_find(const struct lu_env *env,
+				     struct lu_device *dev,
+				     struct cl_object **clobp, loff_t *pos)
 {
 	struct cl_object     *clob;
 	struct lu_site       *site;
 	struct vvp_pgcache_id id;
 
 	site = dev->ld_site;
-	vvp_pgcache_id_unpack(pos, &id);
+	vvp_pgcache_id_unpack(*pos, &id);
 
 	while (1) {
 		if (id.vpi_bucket >= CFS_HASH_NHLIST(site->ls_obj_hash))
-			return ~0ULL;
+			return NULL;
 		clob = vvp_pgcache_obj(env, dev, &id);
 		if (clob) {
 			struct inode *inode = vvp_object_inode(clob);
@@ -482,20 +484,22 @@ static loff_t vvp_pgcache_find(const struct lu_env *env,
 			if (nr > 0) {
 				id.vpi_index = vmpage->index;
 				/* Cant support over 16T file */
-				nr = !(vmpage->index > 0xffffffff);
+				if (vmpage->index <= 0xffffffff) {
+					*clobp = clob;
+					*pos = vvp_pgcache_id_pack(&id);
+					return vmpage;
+				}
 				put_page(vmpage);
 			}
 
 			lu_object_ref_del(&clob->co_lu, "dump", current);
 			cl_object_put(env, clob);
-			if (nr > 0)
-				return vvp_pgcache_id_pack(&id);
 		}
 		/* to the next object. */
 		++id.vpi_depth;
 		id.vpi_depth &= 0xf;
 		if (id.vpi_depth == 0 && ++id.vpi_bucket == 0)
-			return ~0ULL;
+			return NULL;
 		id.vpi_index = 0;
 	}
 }
@@ -538,71 +542,52 @@ static void vvp_pgcache_page_show(const struct lu_env *env,
 static int vvp_pgcache_show(struct seq_file *f, void *v)
 {
 	struct seq_private	*priv = f->private;
-	loff_t		   pos;
-	struct cl_object	*clob;
-	struct vvp_pgcache_id    id;
-
-	pos = *(loff_t *)v;
-	vvp_pgcache_id_unpack(pos, &id);
-	clob = vvp_pgcache_obj(priv->env, &priv->sbi->ll_cl->cd_lu_dev, &id);
-	if (clob) {
-		struct inode *inode = vvp_object_inode(clob);
-		struct cl_page *page = NULL;
-		struct page *vmpage;
-		int result;
-
-		result = find_get_pages_contig(inode->i_mapping,
-					       id.vpi_index, 1,
-					       &vmpage);
-		if (result > 0) {
-			lock_page(vmpage);
-			page = cl_vmpage_page(vmpage, clob);
-			unlock_page(vmpage);
-			put_page(vmpage);
-		}
-
-		seq_printf(f, "%8x@" DFID ": ", id.vpi_index,
-			   PFID(lu_object_fid(&clob->co_lu)));
-		if (page) {
-			vvp_pgcache_page_show(priv->env, f, page);
-			cl_page_put(priv->env, page);
-		} else {
-			seq_puts(f, "missing\n");
-		}
-		lu_object_ref_del(&clob->co_lu, "dump", current);
-		cl_object_put(priv->env, clob);
+	struct page		*vmpage = v;
+	struct cl_page		*page;
+
+	seq_printf(f, "%8lx@" DFID ": ", vmpage->index,
+		   PFID(lu_object_fid(&priv->clob->co_lu)));
+	lock_page(vmpage);
+	page = cl_vmpage_page(vmpage, priv->clob);
+	unlock_page(vmpage);
+	put_page(vmpage);
+
+	if (page) {
+		vvp_pgcache_page_show(priv->env, f, page);
+		cl_page_put(priv->env, page);
 	} else {
-		seq_printf(f, "%llx missing\n", pos);
+		seq_puts(f, "missing\n");
 	}
+	lu_object_ref_del(&priv->clob->co_lu, "dump", current);
+	cl_object_put(priv->env, priv->clob);
+
 	return 0;
 }
 
 static void *vvp_pgcache_start(struct seq_file *f, loff_t *pos)
 {
 	struct seq_private	*priv = f->private;
+	struct page *ret;
 
 	if (priv->sbi->ll_site->ls_obj_hash->hs_cur_bits >
-	    64 - PGC_OBJ_SHIFT) {
-		pos = ERR_PTR(-EFBIG);
-	} else {
-		*pos = vvp_pgcache_find(priv->env, &priv->sbi->ll_cl->cd_lu_dev,
-					*pos);
-		if (*pos == ~0ULL)
-			pos = NULL;
-	}
+	    64 - PGC_OBJ_SHIFT)
+		ret = ERR_PTR(-EFBIG);
+	else
+		ret = vvp_pgcache_find(priv->env, &priv->sbi->ll_cl->cd_lu_dev,
+				       &priv->clob, pos);
 
-	return pos;
+	return ret;
 }
 
 static void *vvp_pgcache_next(struct seq_file *f, void *v, loff_t *pos)
 {
 	struct seq_private *priv = f->private;
+	struct page *ret;
 
-	*pos = vvp_pgcache_find(priv->env, &priv->sbi->ll_cl->cd_lu_dev, *pos + 1);
-	if (*pos == ~0ULL)
-		pos = NULL;
-
-	return pos;
+	*pos += 1;
+	ret = vvp_pgcache_find(priv->env, &priv->sbi->ll_cl->cd_lu_dev,
+			       &priv->clob, pos);
+	return ret;
 }
 
 static void vvp_pgcache_stop(struct seq_file *f, void *v)

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 15/20] staging: lustre: llite: use more private data in dump_pgcache
  2018-04-11 21:54 [PATCH 00/20] staging: lustre: convert to rhashtable NeilBrown
                   ` (12 preceding siblings ...)
  2018-04-11 21:54 ` [PATCH 17/20] staging: lustre: use call_rcu() to free lu_object_headers NeilBrown
@ 2018-04-11 21:54 ` NeilBrown
  2018-04-11 21:54 ` [PATCH 16/20] staging: lustre: llite: remove redundant lookup " NeilBrown
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2018-04-11 21:54 UTC (permalink / raw)
  To: Oleg Drokin, Greg Kroah-Hartman, James Simmons, Andreas Dilger
  Cc: Linux Kernel Mailing List, Lustre Development List

The dump_page_cache debugfs file allocates and frees an 'env' in each
call to vvp_pgcache_start,next,show.  This is likely to be fast, but
does introduce the need to check for errors.

It is reasonable to allocate a single 'env' when the file is opened,
and use that throughout.

So create 'seq_private' structure which stores the sbi, env, and
refcheck, and attach this to the seqfile.

Then use it throughout instead of allocating 'env' repeatedly.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/staging/lustre/lustre/llite/vvp_dev.c |  150 ++++++++++++-------------
 1 file changed, 72 insertions(+), 78 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/vvp_dev.c b/drivers/staging/lustre/lustre/llite/vvp_dev.c
index 987c03b058e6..a2619dc04a7f 100644
--- a/drivers/staging/lustre/lustre/llite/vvp_dev.c
+++ b/drivers/staging/lustre/lustre/llite/vvp_dev.c
@@ -390,6 +390,12 @@ struct vvp_pgcache_id {
 	struct lu_object_header *vpi_obj;
 };
 
+struct seq_private {
+	struct ll_sb_info	*sbi;
+	struct lu_env		*env;
+	u16			refcheck;
+};
+
 static void vvp_pgcache_id_unpack(loff_t pos, struct vvp_pgcache_id *id)
 {
 	BUILD_BUG_ON(sizeof(pos) != sizeof(__u64));
@@ -531,95 +537,71 @@ static void vvp_pgcache_page_show(const struct lu_env *env,
 
 static int vvp_pgcache_show(struct seq_file *f, void *v)
 {
+	struct seq_private	*priv = f->private;
 	loff_t		   pos;
-	struct ll_sb_info       *sbi;
 	struct cl_object	*clob;
-	struct lu_env	   *env;
 	struct vvp_pgcache_id    id;
-	u16 refcheck;
-	int		      result;
 
-	env = cl_env_get(&refcheck);
-	if (!IS_ERR(env)) {
-		pos = *(loff_t *)v;
-		vvp_pgcache_id_unpack(pos, &id);
-		sbi = f->private;
-		clob = vvp_pgcache_obj(env, &sbi->ll_cl->cd_lu_dev, &id);
-		if (clob) {
-			struct inode *inode = vvp_object_inode(clob);
-			struct cl_page *page = NULL;
-			struct page *vmpage;
-
-			result = find_get_pages_contig(inode->i_mapping,
-						       id.vpi_index, 1,
-						       &vmpage);
-			if (result > 0) {
-				lock_page(vmpage);
-				page = cl_vmpage_page(vmpage, clob);
-				unlock_page(vmpage);
-				put_page(vmpage);
-			}
+	pos = *(loff_t *)v;
+	vvp_pgcache_id_unpack(pos, &id);
+	clob = vvp_pgcache_obj(priv->env, &priv->sbi->ll_cl->cd_lu_dev, &id);
+	if (clob) {
+		struct inode *inode = vvp_object_inode(clob);
+		struct cl_page *page = NULL;
+		struct page *vmpage;
+		int result;
+
+		result = find_get_pages_contig(inode->i_mapping,
+					       id.vpi_index, 1,
+					       &vmpage);
+		if (result > 0) {
+			lock_page(vmpage);
+			page = cl_vmpage_page(vmpage, clob);
+			unlock_page(vmpage);
+			put_page(vmpage);
+		}
 
-			seq_printf(f, "%8x@" DFID ": ", id.vpi_index,
-				   PFID(lu_object_fid(&clob->co_lu)));
-			if (page) {
-				vvp_pgcache_page_show(env, f, page);
-				cl_page_put(env, page);
-			} else {
-				seq_puts(f, "missing\n");
-			}
-			lu_object_ref_del(&clob->co_lu, "dump", current);
-			cl_object_put(env, clob);
+		seq_printf(f, "%8x@" DFID ": ", id.vpi_index,
+			   PFID(lu_object_fid(&clob->co_lu)));
+		if (page) {
+			vvp_pgcache_page_show(priv->env, f, page);
+			cl_page_put(priv->env, page);
 		} else {
-			seq_printf(f, "%llx missing\n", pos);
+			seq_puts(f, "missing\n");
 		}
-		cl_env_put(env, &refcheck);
-		result = 0;
+		lu_object_ref_del(&clob->co_lu, "dump", current);
+		cl_object_put(priv->env, clob);
 	} else {
-		result = PTR_ERR(env);
+		seq_printf(f, "%llx missing\n", pos);
 	}
-	return result;
+	return 0;
 }
 
 static void *vvp_pgcache_start(struct seq_file *f, loff_t *pos)
 {
-	struct ll_sb_info *sbi;
-	struct lu_env     *env;
-	u16 refcheck;
-
-	sbi = f->private;
+	struct seq_private	*priv = f->private;
 
-	env = cl_env_get(&refcheck);
-	if (!IS_ERR(env)) {
-		sbi = f->private;
-		if (sbi->ll_site->ls_obj_hash->hs_cur_bits >
-		    64 - PGC_OBJ_SHIFT) {
-			pos = ERR_PTR(-EFBIG);
-		} else {
-			*pos = vvp_pgcache_find(env, &sbi->ll_cl->cd_lu_dev,
-						*pos);
-			if (*pos == ~0ULL)
-				pos = NULL;
-		}
-		cl_env_put(env, &refcheck);
+	if (priv->sbi->ll_site->ls_obj_hash->hs_cur_bits >
+	    64 - PGC_OBJ_SHIFT) {
+		pos = ERR_PTR(-EFBIG);
+	} else {
+		*pos = vvp_pgcache_find(priv->env, &priv->sbi->ll_cl->cd_lu_dev,
+					*pos);
+		if (*pos == ~0ULL)
+			pos = NULL;
 	}
+
 	return pos;
 }
 
 static void *vvp_pgcache_next(struct seq_file *f, void *v, loff_t *pos)
 {
-	struct ll_sb_info *sbi;
-	struct lu_env     *env;
-	u16 refcheck;
+	struct seq_private *priv = f->private;
+
+	*pos = vvp_pgcache_find(priv->env, &priv->sbi->ll_cl->cd_lu_dev, *pos + 1);
+	if (*pos == ~0ULL)
+		pos = NULL;
 
-	env = cl_env_get(&refcheck);
-	if (!IS_ERR(env)) {
-		sbi = f->private;
-		*pos = vvp_pgcache_find(env, &sbi->ll_cl->cd_lu_dev, *pos + 1);
-		if (*pos == ~0ULL)
-			pos = NULL;
-		cl_env_put(env, &refcheck);
-	}
 	return pos;
 }
 
@@ -637,17 +619,29 @@ static const struct seq_operations vvp_pgcache_ops = {
 
 static int vvp_dump_pgcache_seq_open(struct inode *inode, struct file *filp)
 {
-	struct seq_file *seq;
-	int rc;
-
-	rc = seq_open(filp, &vvp_pgcache_ops);
-	if (rc)
-		return rc;
+	struct seq_private *priv;
+
+	priv = __seq_open_private(filp, &vvp_pgcache_ops, sizeof(*priv));
+	if (!priv)
+		return -ENOMEM;
+
+	priv->sbi = inode->i_private;
+	priv->env = cl_env_get(&priv->refcheck);
+	if (IS_ERR(priv->env)) {
+		int err = PTR_ERR(priv->env);
+		seq_release_private(inode, filp);
+		return err;
+	}
+	return 0;
+}
 
-	seq = filp->private_data;
-	seq->private = inode->i_private;
+static int vvp_dump_pgcache_seq_release(struct inode *inode, struct file *file)
+{
+	struct seq_file *seq = file->private_data;
+	struct seq_private *priv = seq->private;
 
-	return 0;
+	cl_env_put(priv->env, &priv->refcheck);
+	return seq_release_private(inode, file);
 }
 
 const struct file_operations vvp_dump_pgcache_file_ops = {
@@ -655,5 +649,5 @@ const struct file_operations vvp_dump_pgcache_file_ops = {
 	.open    = vvp_dump_pgcache_seq_open,
 	.read    = seq_read,
 	.llseek	 = seq_lseek,
-	.release = seq_release,
+	.release = vvp_dump_pgcache_seq_release,
 };

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 14/20] staging: lustre: fold lu_object_new() into lu_object_find_at()
  2018-04-11 21:54 [PATCH 00/20] staging: lustre: convert to rhashtable NeilBrown
                   ` (14 preceding siblings ...)
  2018-04-11 21:54 ` [PATCH 16/20] staging: lustre: llite: remove redundant lookup " NeilBrown
@ 2018-04-11 21:54 ` NeilBrown
  2018-04-11 21:54 ` [PATCH 13/20] staging: lustre: lu_object: move retry logic inside htable_lookup NeilBrown
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2018-04-11 21:54 UTC (permalink / raw)
  To: Oleg Drokin, Greg Kroah-Hartman, James Simmons, Andreas Dilger
  Cc: Linux Kernel Mailing List, Lustre Development List

lu_object_new() duplicates a lot of code that is in
lu_object_find_at().
There is no real need for a separate function, it is simpler just
to skip the bits of lu_object_find_at() that we don't
want in the LOC_F_NEW case.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/staging/lustre/lustre/obdclass/lu_object.c |   44 +++++---------------
 1 file changed, 12 insertions(+), 32 deletions(-)

diff --git a/drivers/staging/lustre/lustre/obdclass/lu_object.c b/drivers/staging/lustre/lustre/obdclass/lu_object.c
index bf505a9463a3..18019f41c7a8 100644
--- a/drivers/staging/lustre/lustre/obdclass/lu_object.c
+++ b/drivers/staging/lustre/lustre/obdclass/lu_object.c
@@ -691,29 +691,6 @@ static void lu_object_limit(const struct lu_env *env, struct lu_device *dev)
 			      false);
 }
 
-static struct lu_object *lu_object_new(const struct lu_env *env,
-				       struct lu_device *dev,
-				       const struct lu_fid *f,
-				       const struct lu_object_conf *conf)
-{
-	struct lu_object	*o;
-	struct cfs_hash	      *hs;
-	struct cfs_hash_bd	    bd;
-
-	o = lu_object_alloc(env, dev, f, conf);
-	if (IS_ERR(o))
-		return o;
-
-	hs = dev->ld_site->ls_obj_hash;
-	cfs_hash_bd_get_and_lock(hs, (void *)f, &bd, 1);
-	cfs_hash_bd_add_locked(hs, &bd, &o->lo_header->loh_hash);
-	cfs_hash_bd_unlock(hs, &bd, 1);
-
-	lu_object_limit(env, dev);
-
-	return o;
-}
-
 /**
  * Much like lu_object_find(), but top level device of object is specifically
  * \a dev rather than top level device of the site. This interface allows
@@ -749,18 +726,18 @@ struct lu_object *lu_object_find_at(const struct lu_env *env,
 	 * just alloc and insert directly.
 	 *
 	 */
-	if (conf && conf->loc_flags & LOC_F_NEW)
-		return lu_object_new(env, dev, f, conf);
-
 	s  = dev->ld_site;
 	hs = s->ls_obj_hash;
-	cfs_hash_bd_get_and_lock(hs, (void *)f, &bd, 0);
-	o = htable_lookup(s, &bd, f, &version);
-	cfs_hash_bd_unlock(hs, &bd, 0);
 
-	if (!IS_ERR(o) || PTR_ERR(o) != -ENOENT)
-		return o;
+	cfs_hash_bd_get(hs, f, &bd);
+	if (!(conf && conf->loc_flags & LOC_F_NEW)) {
+		cfs_hash_bd_lock(hs, &bd, 0);
+		o = htable_lookup(s, &bd, f, &version);
+		cfs_hash_bd_unlock(hs, &bd, 0);
 
+		if (!IS_ERR(o) || PTR_ERR(o) != -ENOENT)
+			return o;
+	}
 	/*
 	 * Allocate new object. This may result in rather complicated
 	 * operations, including fld queries, inode loading, etc.
@@ -773,7 +750,10 @@ struct lu_object *lu_object_find_at(const struct lu_env *env,
 
 	cfs_hash_bd_lock(hs, &bd, 1);
 
-	shadow = htable_lookup(s, &bd, f, &version);
+	if (conf && conf->loc_flags & LOC_F_NEW)
+		shadow = ERR_PTR(-ENOENT);
+	else
+		shadow = htable_lookup(s, &bd, f, &version);
 	if (likely(PTR_ERR(shadow) == -ENOENT)) {
 		cfs_hash_bd_add_locked(hs, &bd, &o->lo_header->loh_hash);
 		cfs_hash_bd_unlock(hs, &bd, 1);

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 13/20] staging: lustre: lu_object: move retry logic inside htable_lookup
  2018-04-11 21:54 [PATCH 00/20] staging: lustre: convert to rhashtable NeilBrown
                   ` (15 preceding siblings ...)
  2018-04-11 21:54 ` [PATCH 14/20] staging: lustre: fold lu_object_new() into lu_object_find_at() NeilBrown
@ 2018-04-11 21:54 ` NeilBrown
  2018-04-11 21:54 ` [PATCH 18/20] staging: lustre: change how "dump_page_cache" walks a hash table NeilBrown
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2018-04-11 21:54 UTC (permalink / raw)
  To: Oleg Drokin, Greg Kroah-Hartman, James Simmons, Andreas Dilger
  Cc: Linux Kernel Mailing List, Lustre Development List

The current retry logic, to wait when a 'dying' object is found,
spans multiple functions.  The process is attached to a waitqueue
and set TASK_UNINTERRUPTIBLE in htable_lookup, and this status
is passed back through lu_object_find_try() to lu_object_find_at()
where schedule() is called and the process is removed from the queue.

This can be simplified by moving all the logic (including
hashtable locking) inside htable_lookup(), which now never returns
EAGAIN.

Note that htable_lookup() is called with the hash bucket lock
held, and will drop and retake it if it needs to schedule.

I made this a 'goto' loop rather than a 'while(1)' loop as the
diff is easier to read.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/staging/lustre/lustre/obdclass/lu_object.c |   73 +++++++-------------
 1 file changed, 27 insertions(+), 46 deletions(-)

diff --git a/drivers/staging/lustre/lustre/obdclass/lu_object.c b/drivers/staging/lustre/lustre/obdclass/lu_object.c
index 064166843e64..bf505a9463a3 100644
--- a/drivers/staging/lustre/lustre/obdclass/lu_object.c
+++ b/drivers/staging/lustre/lustre/obdclass/lu_object.c
@@ -596,16 +596,21 @@ EXPORT_SYMBOL(lu_object_print);
 static struct lu_object *htable_lookup(struct lu_site *s,
 				       struct cfs_hash_bd *bd,
 				       const struct lu_fid *f,
-				       wait_queue_entry_t *waiter,
 				       __u64 *version)
 {
+	struct cfs_hash		*hs = s->ls_obj_hash;
 	struct lu_site_bkt_data *bkt;
 	struct lu_object_header *h;
 	struct hlist_node	*hnode;
-	__u64  ver = cfs_hash_bd_version_get(bd);
+	__u64 ver;
+	wait_queue_entry_t waiter;
 
-	if (*version == ver)
+retry:
+	ver = cfs_hash_bd_version_get(bd);
+
+	if (*version == ver) {
 		return ERR_PTR(-ENOENT);
+	}
 
 	*version = ver;
 	/* cfs_hash_bd_peek_locked is a somehow "internal" function
@@ -638,11 +643,15 @@ static struct lu_object *htable_lookup(struct lu_site *s,
 	 * drained), and moreover, lookup has to wait until object is freed.
 	 */
 
-	init_waitqueue_entry(waiter, current);
-	add_wait_queue(&bkt->lsb_marche_funebre, waiter);
+	init_waitqueue_entry(&waiter, current);
+	add_wait_queue(&bkt->lsb_marche_funebre, &waiter);
 	set_current_state(TASK_UNINTERRUPTIBLE);
 	lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_DEATH_RACE);
-	return ERR_PTR(-EAGAIN);
+	cfs_hash_bd_unlock(hs, bd, 1);
+	schedule();
+	remove_wait_queue(&bkt->lsb_marche_funebre, &waiter);
+	cfs_hash_bd_lock(hs, bd, 1);
+	goto retry;
 }
 
 /**
@@ -706,13 +715,14 @@ static struct lu_object *lu_object_new(const struct lu_env *env,
 }
 
 /**
- * Core logic of lu_object_find*() functions.
+ * Much like lu_object_find(), but top level device of object is specifically
+ * \a dev rather than top level device of the site. This interface allows
+ * objects of different "stacking" to be created within the same site.
  */
-static struct lu_object *lu_object_find_try(const struct lu_env *env,
-					    struct lu_device *dev,
-					    const struct lu_fid *f,
-					    const struct lu_object_conf *conf,
-					    wait_queue_entry_t *waiter)
+struct lu_object *lu_object_find_at(const struct lu_env *env,
+				    struct lu_device *dev,
+				    const struct lu_fid *f,
+				    const struct lu_object_conf *conf)
 {
 	struct lu_object      *o;
 	struct lu_object      *shadow;
@@ -738,17 +748,16 @@ static struct lu_object *lu_object_find_try(const struct lu_env *env,
 	 * It is unnecessary to perform lookup-alloc-lookup-insert, instead,
 	 * just alloc and insert directly.
 	 *
-	 * If dying object is found during index search, add @waiter to the
-	 * site wait-queue and return ERR_PTR(-EAGAIN).
 	 */
 	if (conf && conf->loc_flags & LOC_F_NEW)
 		return lu_object_new(env, dev, f, conf);
 
 	s  = dev->ld_site;
 	hs = s->ls_obj_hash;
-	cfs_hash_bd_get_and_lock(hs, (void *)f, &bd, 1);
-	o = htable_lookup(s, &bd, f, waiter, &version);
-	cfs_hash_bd_unlock(hs, &bd, 1);
+	cfs_hash_bd_get_and_lock(hs, (void *)f, &bd, 0);
+	o = htable_lookup(s, &bd, f, &version);
+	cfs_hash_bd_unlock(hs, &bd, 0);
+
 	if (!IS_ERR(o) || PTR_ERR(o) != -ENOENT)
 		return o;
 
@@ -764,7 +773,7 @@ static struct lu_object *lu_object_find_try(const struct lu_env *env,
 
 	cfs_hash_bd_lock(hs, &bd, 1);
 
-	shadow = htable_lookup(s, &bd, f, waiter, &version);
+	shadow = htable_lookup(s, &bd, f, &version);
 	if (likely(PTR_ERR(shadow) == -ENOENT)) {
 		cfs_hash_bd_add_locked(hs, &bd, &o->lo_header->loh_hash);
 		cfs_hash_bd_unlock(hs, &bd, 1);
@@ -779,34 +788,6 @@ static struct lu_object *lu_object_find_try(const struct lu_env *env,
 	lu_object_free(env, o);
 	return shadow;
 }
-
-/**
- * Much like lu_object_find(), but top level device of object is specifically
- * \a dev rather than top level device of the site. This interface allows
- * objects of different "stacking" to be created within the same site.
- */
-struct lu_object *lu_object_find_at(const struct lu_env *env,
-				    struct lu_device *dev,
-				    const struct lu_fid *f,
-				    const struct lu_object_conf *conf)
-{
-	wait_queue_head_t	*wq;
-	struct lu_object	*obj;
-	wait_queue_entry_t	   wait;
-
-	while (1) {
-		obj = lu_object_find_try(env, dev, f, conf, &wait);
-		if (obj != ERR_PTR(-EAGAIN))
-			return obj;
-		/*
-		 * lu_object_find_try() already added waiter into the
-		 * wait queue.
-		 */
-		schedule();
-		wq = lu_site_wq_from_fid(dev->ld_site, (void *)f);
-		remove_wait_queue(wq, &wait);
-	}
-}
 EXPORT_SYMBOL(lu_object_find_at);
 
 /**

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/20] staging: lustre: convert to rhashtable
  2018-04-11 21:54 [PATCH 00/20] staging: lustre: convert to rhashtable NeilBrown
                   ` (19 preceding siblings ...)
  2018-04-11 21:54 ` [PATCH 19/20] staging: lustre: convert lu_object cache to rhashtable NeilBrown
@ 2018-04-17  3:35 ` James Simmons
  2018-04-18  3:17   ` NeilBrown
  2018-04-23 13:08 ` Greg Kroah-Hartman
  21 siblings, 1 reply; 25+ messages in thread
From: James Simmons @ 2018-04-17  3:35 UTC (permalink / raw)
  To: NeilBrown
  Cc: Oleg Drokin, Greg Kroah-Hartman, Andreas Dilger,
	Linux Kernel Mailing List, Lustre Development List


> libcfs in lustre has a resizeable hashtable.
> Linux already has a resizeable hashtable, rhashtable, which is better
> is most metrics. See https://lwn.net/Articles/751374/ in a few days
> for an introduction to rhashtable.

Thansk for starting this work. I was think about cleaning the libcfs
hash but your port to rhashtables is way better. How did you gather
metrics to see that rhashtable was better than libcfs hash?
 
> This series converts lustre to use rhashtable.  This affects several
> different tables, and each is different is various ways.
> 
> There are two outstanding issues.  One is that a bug in rhashtable
> means that we cannot enable auto-shrinking in one of the tables.  That
> is documented as appropriate and should be fixed soon.
> 
> The other is that rhashtable has an atomic_t which counts the elements
> in a hash table.  At least one table in lustre went to some trouble to
> avoid any table-wide atomics, so that could lead to a regression.
> I'm hoping that rhashtable can be enhanced with the option of a
> per-cpu counter, or similar.
> 

This doesn't sound quite ready to land just yet. This will have to do some
soak testing and a larger scope of test to make sure no new regressions
happen. Believe me I did work to make lustre work better on tickless 
systems, which I'm preparing for the linux client, and small changes could 
break things in interesting ways. I will port the rhashtable change to the 
Intel developement branch and get people more familar with the hash code
to look at it. 

> I have enabled automatic shrinking on all tables where it makes sense
> and doesn't trigger the bug.  I have also removed all hints concerning
> min/max size - I cannot see how these could be useful.
> 
> The dump_pgcache debugfs file provided some interesting challenges.  I
> think I have cleaned it up enough so that it all makes sense.  An
> extra pair of eyes examining that code in particular would be
> appreciated.
> 
> This series passes all the same tests that pass before the patches are
> applied.
> 
> Thanks,
> NeilBrown
> 
> 
> ---
> 
> NeilBrown (20):
>       staging: lustre: ptlrpc: convert conn_hash to rhashtable
>       staging: lustre: convert lov_pool to use rhashtable
>       staging: lustre: convert obd uuid hash to rhashtable
>       staging: lustre: convert osc_quota hash to rhashtable
>       staging: lustre: separate buckets from ldlm hash table
>       staging: lustre: ldlm: add a counter to the per-namespace data
>       staging: lustre: ldlm: store name directly in namespace.
>       staging: lustre: simplify ldlm_ns_hash_defs[]
>       staging: lustre: convert ldlm_resource hash to rhashtable.
>       staging: lustre: make struct lu_site_bkt_data private
>       staging: lustre: lu_object: discard extra lru count.
>       staging: lustre: lu_object: factor out extra per-bucket data
>       staging: lustre: lu_object: move retry logic inside htable_lookup
>       staging: lustre: fold lu_object_new() into lu_object_find_at()
>       staging: lustre: llite: use more private data in dump_pgcache
>       staging: lustre: llite: remove redundant lookup in dump_pgcache
>       staging: lustre: use call_rcu() to free lu_object_headers
>       staging: lustre: change how "dump_page_cache" walks a hash table
>       staging: lustre: convert lu_object cache to rhashtable
>       staging: lustre: remove cfs_hash resizeable hashtable implementation.
> 
> 
>  .../staging/lustre/include/linux/libcfs/libcfs.h   |    1 
>  .../lustre/include/linux/libcfs/libcfs_hash.h      |  866 --------
>  drivers/staging/lustre/lnet/libcfs/Makefile        |    2 
>  drivers/staging/lustre/lnet/libcfs/hash.c          | 2064 --------------------
>  drivers/staging/lustre/lnet/libcfs/module.c        |   12 
>  drivers/staging/lustre/lustre/include/lu_object.h  |   55 -
>  drivers/staging/lustre/lustre/include/lustre_dlm.h |   19 
>  .../staging/lustre/lustre/include/lustre_export.h  |    2 
>  drivers/staging/lustre/lustre/include/lustre_net.h |    4 
>  drivers/staging/lustre/lustre/include/obd.h        |   11 
>  .../staging/lustre/lustre/include/obd_support.h    |    9 
>  drivers/staging/lustre/lustre/ldlm/ldlm_request.c  |   31 
>  drivers/staging/lustre/lustre/ldlm/ldlm_resource.c |  370 +---
>  drivers/staging/lustre/lustre/llite/lcommon_cl.c   |    8 
>  drivers/staging/lustre/lustre/llite/vvp_dev.c      |  332 +--
>  drivers/staging/lustre/lustre/llite/vvp_object.c   |    9 
>  drivers/staging/lustre/lustre/lov/lov_internal.h   |   11 
>  drivers/staging/lustre/lustre/lov/lov_obd.c        |   12 
>  drivers/staging/lustre/lustre/lov/lov_object.c     |    8 
>  drivers/staging/lustre/lustre/lov/lov_pool.c       |  159 +-
>  drivers/staging/lustre/lustre/lov/lovsub_object.c  |    9 
>  drivers/staging/lustre/lustre/obdclass/genops.c    |   34 
>  drivers/staging/lustre/lustre/obdclass/lu_object.c |  564 ++---
>  .../staging/lustre/lustre/obdclass/obd_config.c    |  161 +-
>  .../staging/lustre/lustre/obdecho/echo_client.c    |    8 
>  drivers/staging/lustre/lustre/osc/osc_internal.h   |    5 
>  drivers/staging/lustre/lustre/osc/osc_quota.c      |  136 -
>  drivers/staging/lustre/lustre/osc/osc_request.c    |   12 
>  drivers/staging/lustre/lustre/ptlrpc/connection.c  |  164 +-
>  29 files changed, 870 insertions(+), 4208 deletions(-)
>  delete mode 100644 drivers/staging/lustre/include/linux/libcfs/libcfs_hash.h
>  delete mode 100644 drivers/staging/lustre/lnet/libcfs/hash.c
> 
> --
> Signature
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/20] staging: lustre: convert to rhashtable
  2018-04-17  3:35 ` [PATCH 00/20] staging: lustre: convert " James Simmons
@ 2018-04-18  3:17   ` NeilBrown
  2018-04-18 21:56     ` [lustre-devel] " Simmons, James A.
  0 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2018-04-18  3:17 UTC (permalink / raw)
  To: James Simmons
  Cc: Oleg Drokin, Greg Kroah-Hartman, Andreas Dilger,
	Linux Kernel Mailing List, Lustre Development List

[-- Attachment #1: Type: text/plain, Size: 3436 bytes --]

On Tue, Apr 17 2018, James Simmons wrote:

>> libcfs in lustre has a resizeable hashtable.
>> Linux already has a resizeable hashtable, rhashtable, which is better
>> is most metrics. See https://lwn.net/Articles/751374/ in a few days
>> for an introduction to rhashtable.
>
> Thansk for starting this work. I was think about cleaning the libcfs
> hash but your port to rhashtables is way better. How did you gather
> metrics to see that rhashtable was better than libcfs hash?

Code inspection and reputation.  It is hard to beat inlined lockless
code for lookups.  And rhashtable is heavily used in the network
subsystem and they are very focused on latency there.  I'm not sure that
insertion is as fast as it can be (I have some thoughts on that) but I'm
sure lookup will be better.
I haven't done any performance testing myself, only correctness.

>  
>> This series converts lustre to use rhashtable.  This affects several
>> different tables, and each is different is various ways.
>> 
>> There are two outstanding issues.  One is that a bug in rhashtable
>> means that we cannot enable auto-shrinking in one of the tables.  That
>> is documented as appropriate and should be fixed soon.
>> 
>> The other is that rhashtable has an atomic_t which counts the elements
>> in a hash table.  At least one table in lustre went to some trouble to
>> avoid any table-wide atomics, so that could lead to a regression.
>> I'm hoping that rhashtable can be enhanced with the option of a
>> per-cpu counter, or similar.
>> 
>
> This doesn't sound quite ready to land just yet. This will have to do some
> soak testing and a larger scope of test to make sure no new regressions
> happen. Believe me I did work to make lustre work better on tickless 
> systems, which I'm preparing for the linux client, and small changes could 
> break things in interesting ways. I will port the rhashtable change to the 
> Intel developement branch and get people more familar with the hash code
> to look at it.

Whether it is "ready" or not probably depends on perspective and
priorities.  As I see it, getting lustre cleaned up and out of staging
is a fairly high priority, and it will require a lot of code change.
It is inevitable that regressions will slip in (some already have) and
it is important to keep testing (the test suite is of great benefit, but
is only part of the story of course).  But to test the code, it needs to
land.  Testing the code in Intel's devel branch and then porting it
across doesn't really prove much.  For testing to be meaningful, it
needs to be tested in a branch that up-to-date with mainline and on
track to be merged into mainline.

I have no particular desire to rush this in, but I don't see any
particular benefit in delaying it either.

I guess I see staging as implicitly a 'devel' branch.  You seem to be
treating it a bit like a 'stable' branch - is that right?

As mentioned, I think there is room for improvement in rhashtable.
- making the atomic counter optional
- replacing the separate spinlocks with bit-locks in the hash-chain head
  so that the lock and chain are in the same cache line
- for the common case of inserting in an empty chain, a single atomic
  cmpxchg() should be sufficient
I think we have a better chance of being heard if we have "skin in the
game" and have upstream code that would use this.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [lustre-devel] [PATCH 00/20] staging: lustre: convert to rhashtable
  2018-04-18  3:17   ` NeilBrown
@ 2018-04-18 21:56     ` Simmons, James A.
  0 siblings, 0 replies; 25+ messages in thread
From: Simmons, James A. @ 2018-04-18 21:56 UTC (permalink / raw)
  To: NeilBrown, James Simmons
  Cc: Oleg Drokin, Greg Kroah-Hartman, Linux Kernel Mailing List,
	Lustre Development List


>>> libcfs in lustre has a resizeable hashtable.
>>> Linux already has a resizeable hashtable, rhashtable, which is better
>>> is most metrics. See https://lwn.net/Articles/751374/ in a few days
>>> for an introduction to rhashtable.
>>
>> Thansk for starting this work. I was think about cleaning the libcfs
>> hash but your port to rhashtables is way better. How did you gather
>> metrics to see that rhashtable was better than libcfs hash?
>
>Code inspection and reputation.  It is hard to beat inlined lockless
>code for lookups.  And rhashtable is heavily used in the network
>subsystem and they are very focused on latency there.  I'm not sure that
>insertion is as fast as it can be (I have some thoughts on that) but I'm
>sure lookup will be better.
>I haven't done any performance testing myself, only correctness.

Sorry this email never reached my infradead account so I'm replying
from my work account.  I was just wondering if numbers were gathered.
I'm curious how well this would scale on some HPC cluster. In any case
I can do a comparison with and without the patches on one of my test
clusters and share the numbers with you using real world work loads.

>>> This series converts lustre to use rhashtable.  This affects several
>>> different tables, and each is different is various ways.
>>>
>>> There are two outstanding issues.  One is that a bug in rhashtable
>>> means that we cannot enable auto-shrinking in one of the tables.  That
>>> is documented as appropriate and should be fixed soon.
>>>
>>> The other is that rhashtable has an atomic_t which counts the elements
>>> in a hash table.  At least one table in lustre went to some trouble to
>>> avoid any table-wide atomics, so that could lead to a regression.
>>> I'm hoping that rhashtable can be enhanced with the option of a
>>> per-cpu counter, or similar.
>>>
>>
>> This doesn't sound quite ready to land just yet. This will have to do some
>> soak testing and a larger scope of test to make sure no new regressions
>> happen. Believe me I did work to make lustre work better on tickless
>> systems, which I'm preparing for the linux client, and small changes could
>> break things in interesting ways. I will port the rhashtable change to the
>> Intel developement branch and get people more familar with the hash code
>> to look at it.
>
>Whether it is "ready" or not probably depends on perspective and
>priorities.  As I see it, getting lustre cleaned up and out of staging
>is a fairly high priority, and it will require a lot of code change.
>It is inevitable that regressions will slip in (some already have) and
>it is important to keep testing (the test suite is of great benefit, but
>is only part of the story of course).  But to test the code, it needs to
>land.  Testing the code in Intel's devel branch and then porting it
>across doesn't really prove much.  For testing to be meaningful, it
>needs to be tested in a branch that up-to-date with mainline and on
>track to be merged into mainline.
>
>I have no particular desire to rush this in, but I don't see any
>particular benefit in delaying it either.
>
>I guess I see staging as implicitly a 'devel' branch.  You seem to be
>treating it a bit like a 'stable' branch - is that right?

So two years ago no one would touch the linux Lustre client due to it
being so broken. Then after about a year of work the client got into 
sort of working state. Even to the point actual sites are using it. 
It still is  broken and guess who gets notified of the brokenness :-)  
The good news is that people do actually test it but if it regress to much
we will lose our testing audience. Sadly its a chicken and egg problem. Yes I
want to see it leave staging but I like the Lustre client to be in good working
order. If it leaves staging still broken and no one uses it then their is not
much point. 

So to understand why I work with the Intel development branch and linux 
kernel version I need to explain my test setup.  So I test at different levels.

At level one I have my generic x86 nodes and x86 server back ends. Pretty
vanilla and the scale is pretty small. 3 servers nodes and a couple of clients.
In the environment I can easily test the upstream client. This is not VMs but
real hardware and real storage back end.

Besides x86 client nodes for level one testing I also have Power9 ppc nodes.
Also its possible for me to test the upstream client directly. In case you want to
know nope lustre doesn't work out of the box for Power9. The IB stack needs
fixing and we need to handle 64K pages at the LNet layer. I have work that
resolves those issues. The fixes need to be pushed upstream.

Next I have an ARM client cluster to test with which runs a newer kernel that 
is "offical".  When we first got the cluster I stomped all over it but discovered
I had to share it and the other parties involved were not to happy with my 
experimenting. The ARM system had the same issues as the Power9 so
now it works. Once the upstream client is in better shape I think I could
subject the other users of the system to try it out. I have been told the ARM
vendor has a strong interest in the linux lustre client as well.

Once I'm done with that testing I moved to my personal Cray test cluster.
Sadly Cray has unique hardware so if I use the latest vanilla upstream kernel
that unique hardware no longer works which makes the test bed pretty much
useless. Since I have to work with an older kernel I back port the linux kernel
client work and run test to make sure to works. On that cluster I can run real
HPC work loads.  

Lastly if the work shows itself stable at this point I see if I can put it on a Cray
development system that users at my workplace test on. If the user scream
things are broken then I find out how they break it. General users are very
creative at find ways to break things that we wouldn't think of. 

>I think we have a better chance of being heard if we have "skin in the
>game" and have upstream code that would use this.

I agree. I just want a working base line that users can  work with :-) For me 
landing the rhashtable work is only the beginning. Its behavior at very large scales
needs to be examined to find any potential bottlenecks. 

Which bring me to the next point since this is cross posting to the lustre-devel list.
For me the largest test cluster I can get my hands on is 80 nodes. Would anyone 
be willing to donate test time on their test bed clusters to help out in this work. It would
be really awesome to test on a many thousand node system. I will be at the Lustre
User Group Conference next week so if you want to meet up with me to discuss details
I would love to help you set up your test cluster so we really can exercise these changes.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/20] staging: lustre: convert to rhashtable
  2018-04-11 21:54 [PATCH 00/20] staging: lustre: convert to rhashtable NeilBrown
                   ` (20 preceding siblings ...)
  2018-04-17  3:35 ` [PATCH 00/20] staging: lustre: convert " James Simmons
@ 2018-04-23 13:08 ` Greg Kroah-Hartman
  21 siblings, 0 replies; 25+ messages in thread
From: Greg Kroah-Hartman @ 2018-04-23 13:08 UTC (permalink / raw)
  To: NeilBrown
  Cc: Oleg Drokin, James Simmons, Andreas Dilger,
	Linux Kernel Mailing List, Lustre Development List

On Thu, Apr 12, 2018 at 07:54:48AM +1000, NeilBrown wrote:
> libcfs in lustre has a resizeable hashtable.
> Linux already has a resizeable hashtable, rhashtable, which is better
> is most metrics. See https://lwn.net/Articles/751374/ in a few days
> for an introduction to rhashtable.
> 
> This series converts lustre to use rhashtable.  This affects several
> different tables, and each is different is various ways.
> 
> There are two outstanding issues.  One is that a bug in rhashtable
> means that we cannot enable auto-shrinking in one of the tables.  That
> is documented as appropriate and should be fixed soon.
> 
> The other is that rhashtable has an atomic_t which counts the elements
> in a hash table.  At least one table in lustre went to some trouble to
> avoid any table-wide atomics, so that could lead to a regression.
> I'm hoping that rhashtable can be enhanced with the option of a
> per-cpu counter, or similar.
> 
> 
> I have enabled automatic shrinking on all tables where it makes sense
> and doesn't trigger the bug.  I have also removed all hints concerning
> min/max size - I cannot see how these could be useful.
> 
> The dump_pgcache debugfs file provided some interesting challenges.  I
> think I have cleaned it up enough so that it all makes sense.  An
> extra pair of eyes examining that code in particular would be
> appreciated.
> 
> This series passes all the same tests that pass before the patches are
> applied.

I've taken the first 4 patches of this series, as they were "obviously
correct".  I'll let you and James argue about the rest.  Feel free to
resend when there's some sort of agreement.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2018-04-23 13:08 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-11 21:54 [PATCH 00/20] staging: lustre: convert to rhashtable NeilBrown
2018-04-11 21:54 ` [PATCH 03/20] staging: lustre: convert obd uuid hash " NeilBrown
2018-04-11 21:54 ` [PATCH 04/20] staging: lustre: convert osc_quota " NeilBrown
2018-04-11 21:54 ` [PATCH 05/20] staging: lustre: separate buckets from ldlm hash table NeilBrown
2018-04-11 21:54 ` [PATCH 09/20] staging: lustre: convert ldlm_resource hash to rhashtable NeilBrown
2018-04-11 21:54 ` [PATCH 07/20] staging: lustre: ldlm: store name directly in namespace NeilBrown
2018-04-11 21:54 ` [PATCH 10/20] staging: lustre: make struct lu_site_bkt_data private NeilBrown
2018-04-11 21:54 ` [PATCH 02/20] staging: lustre: convert lov_pool to use rhashtable NeilBrown
2018-04-11 21:54 ` [PATCH 12/20] staging: lustre: lu_object: factor out extra per-bucket data NeilBrown
2018-04-11 21:54 ` [PATCH 08/20] staging: lustre: simplify ldlm_ns_hash_defs[] NeilBrown
2018-04-11 21:54 ` [PATCH 01/20] staging: lustre: ptlrpc: convert conn_hash to rhashtable NeilBrown
2018-04-11 21:54 ` [PATCH 06/20] staging: lustre: ldlm: add a counter to the per-namespace data NeilBrown
2018-04-11 21:54 ` [PATCH 11/20] staging: lustre: lu_object: discard extra lru count NeilBrown
2018-04-11 21:54 ` [PATCH 17/20] staging: lustre: use call_rcu() to free lu_object_headers NeilBrown
2018-04-11 21:54 ` [PATCH 15/20] staging: lustre: llite: use more private data in dump_pgcache NeilBrown
2018-04-11 21:54 ` [PATCH 16/20] staging: lustre: llite: remove redundant lookup " NeilBrown
2018-04-11 21:54 ` [PATCH 14/20] staging: lustre: fold lu_object_new() into lu_object_find_at() NeilBrown
2018-04-11 21:54 ` [PATCH 13/20] staging: lustre: lu_object: move retry logic inside htable_lookup NeilBrown
2018-04-11 21:54 ` [PATCH 18/20] staging: lustre: change how "dump_page_cache" walks a hash table NeilBrown
2018-04-11 21:54 ` [PATCH 20/20] staging: lustre: remove cfs_hash resizeable hashtable implementation NeilBrown
2018-04-11 21:54 ` [PATCH 19/20] staging: lustre: convert lu_object cache to rhashtable NeilBrown
2018-04-17  3:35 ` [PATCH 00/20] staging: lustre: convert " James Simmons
2018-04-18  3:17   ` NeilBrown
2018-04-18 21:56     ` [lustre-devel] " Simmons, James A.
2018-04-23 13:08 ` Greg Kroah-Hartman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).