lustre-devel-lustre.org archive mirror
 help / color / mirror / Atom feed
* [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020
@ 2020-07-15 20:44 James Simmons
  2020-07-15 20:44 ` [lustre-devel] [PATCH 01/37] lustre: osc: fix osc_extent_find() James Simmons
                   ` (36 more replies)
  0 siblings, 37 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:44 UTC (permalink / raw)
  To: lustre-devel

Latest patches to landed to the OpenSFS tree as of July 14, 2020.
Please review to make sure they are correct.

Amir Shehata (2):
  lnet: socklnd: fix local interface binding
  lnet: check rtr_nid is a gateway

Andreas Dilger (1):
  lustre: misc: quiet compiler warning on armv7l

Chris Horn (2):
  lnet: Allow router to forward to healthier NID
  lnet: Set remote NI status in lnet_notify

Hongchao Zhang (1):
  lustre: ptlrpc: fix endless loop issue

James Simmons (1):
  lustre: ptlrpc: handle conn_hash rhashtable resize

Mikhail Pershin (1):
  lustre: ptlrpc: re-enterable signal_completed_replay()

Mr NeilBrown (18):
  lustre: osc: fix osc_extent_find()
  lustre: obdclass: remove init to 0 from lustre_init_lsi()
  lustre: lu_object: convert lu_object cache to rhashtable
  lnet: o2iblnd: allocate init_qp_attr on stack.
  lnet: Fix some out-of-date comments.
  lnet: socklnd: don't fall-back to tcp_sendpage.
  lustre: remove some "#ifdef CONFIG*" from .c files.
  lnet: o2iblnd: Use ib_mtu_int_to_enum()
  lnet: o2iblnd: wait properly for fps->increasing.
  lnet: o2iblnd: use need_resched()
  lnet: o2iblnd: Use list_for_each_entry_safe
  lnet: socklnd: use need_resched()
  lnet: socklnd: use list_for_each_entry_safe()
  lnet: socklnd: convert various refcounts to refcount_t
  lnet: libcfs: don't call unshare_fs_struct()
  lustre: llite: annotate non-owner locking
  lnet: remove LNetMEUnlink and clean up related code
  lnet: socklnd: change ksnd_nthreads to atomic_t

Sebastien Buisson (1):
  lustre: sec: better struct sepol_downcall_data

Shaun Tancheff (1):
  lustre: llite: Fix lock ordering in pagevec_dirty

Vladimir Saveliev (1):
  lustre: osc: consume grants for direct I/O

Wang Shilong (7):
  lustre: ldlm: check slv and limit before updating
  lustre: osc: disable ext merging for rdma only pages and non-rdma
  lustre: obdclass: use offset instead of cp_linkage
  lustre: obdclass: re-declare cl_page variables to reduce its size
  lustre: osc: re-declare ops_from/to to shrink osc_page
  lustre: llite: fix to free cl_dio_aio properly
  lustre: llite: fix short io for AIO

Yang Sheng (1):
  lustre: obdcalss: ensure LCT_QUIESCENT take sync

 fs/lustre/include/cl_object.h              |  34 +-
 fs/lustre/include/lu_object.h              |  28 +-
 fs/lustre/include/lustre_osc.h             |  12 +-
 fs/lustre/include/obd.h                    |  21 ++
 fs/lustre/ldlm/ldlm_request.c              |   8 +
 fs/lustre/llite/file.c                     |  32 +-
 fs/lustre/llite/llite_internal.h           |  29 ++
 fs/lustre/llite/llite_lib.c                |  54 +--
 fs/lustre/llite/rw26.c                     |  43 ++-
 fs/lustre/llite/vvp_dev.c                  | 105 ++----
 fs/lustre/llite/vvp_internal.h             |   3 +-
 fs/lustre/llite/vvp_io.c                   | 111 +++---
 fs/lustre/lov/lovsub_dev.c                 |   5 +-
 fs/lustre/mdc/mdc_request.c                |   8 +-
 fs/lustre/obdclass/cl_io.c                 |  19 +-
 fs/lustre/obdclass/cl_page.c               | 360 ++++++++++---------
 fs/lustre/obdclass/llog.c                  |   2 -
 fs/lustre/obdclass/lu_object.c             | 539 ++++++++++++++---------------
 fs/lustre/obdclass/lu_tgt_descs.c          |   2 +-
 fs/lustre/obdclass/obd_mount.c             |   6 +-
 fs/lustre/osc/osc_cache.c                  |  64 ++--
 fs/lustre/osc/osc_page.c                   |  21 +-
 fs/lustre/ptlrpc/connection.c              |  12 +-
 fs/lustre/ptlrpc/import.c                  |   6 +-
 fs/lustre/ptlrpc/niobuf.c                  |  12 +-
 fs/lustre/ptlrpc/pinger.c                  |  11 +-
 fs/lustre/ptlrpc/ptlrpcd.c                 |   1 -
 fs/lustre/ptlrpc/sec_lproc.c               | 134 ++++++-
 fs/lustre/ptlrpc/service.c                 |   3 -
 include/linux/lnet/api.h                   |   6 +-
 include/linux/lnet/lib-lnet.h              |   4 +-
 include/uapi/linux/lustre/lustre_user.h    |  16 +-
 net/lnet/klnds/o2iblnd/o2iblnd.c           | 103 ++----
 net/lnet/klnds/o2iblnd/o2iblnd.h           |   2 -
 net/lnet/klnds/o2iblnd/o2iblnd_cb.c        |  12 +-
 net/lnet/klnds/o2iblnd/o2iblnd_modparams.c |   4 +-
 net/lnet/klnds/socklnd/socklnd.c           |  97 +++---
 net/lnet/klnds/socklnd/socklnd.h           |  56 ++-
 net/lnet/klnds/socklnd/socklnd_cb.c        |  26 +-
 net/lnet/klnds/socklnd/socklnd_lib.c       |   8 +-
 net/lnet/lnet/api-ni.c                     |   5 +-
 net/lnet/lnet/lib-md.c                     |  62 ++--
 net/lnet/lnet/lib-me.c                     |  39 ---
 net/lnet/lnet/lib-move.c                   |  57 ++-
 net/lnet/lnet/router.c                     |   6 +-
 net/lnet/selftest/rpc.c                    |   1 -
 46 files changed, 1176 insertions(+), 1013 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 01/37] lustre: osc: fix osc_extent_find()
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
@ 2020-07-15 20:44 ` James Simmons
  2020-07-15 20:44 ` [lustre-devel] [PATCH 02/37] lustre: ldlm: check slv and limit before updating James Simmons
                   ` (35 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:44 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

- fix a pre-existing bug - osc_extent_merge() should never try to
  merge two extends with different ->oe_mppr as later alignment
  checks can get confused.
- Remove a redundant list_del_init() which is already included in
  __osc_extent_remove().

Fixes: 85ebb57ddc ("lustre: osc: simplify osc_extent_find()")
WC-bug-id: https://jira.whamcloud.com/browse/LU-9679
Lustre-commit: 80e21cce3dd67 ("LU-9679 osc: simplify osc_extent_find()")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/37607
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_cache.c | 33 +++++++++++++++------------------
 1 file changed, 15 insertions(+), 18 deletions(-)

diff --git a/fs/lustre/osc/osc_cache.c b/fs/lustre/osc/osc_cache.c
index 5049aaa..474b711 100644
--- a/fs/lustre/osc/osc_cache.c
+++ b/fs/lustre/osc/osc_cache.c
@@ -574,6 +574,14 @@ static int osc_extent_merge(const struct lu_env *env, struct osc_extent *cur,
 	if (cur->oe_max_end != victim->oe_max_end)
 		return -ERANGE;
 
+	/*
+	 * In the rare case max_pages_per_rpc (mppr) is changed, don't
+	 * merge extents until after old ones have been sent, or the
+	 * "extents are aligned to RPCs" checks are unhappy.
+	 */
+	if (cur->oe_mppr != victim->oe_mppr)
+		return -ERANGE;
+
 	LASSERT(cur->oe_dlmlock == victim->oe_dlmlock);
 	ppc_bits = osc_cli(obj)->cl_chunkbits - PAGE_SHIFT;
 	chunk_start = cur->oe_start >> ppc_bits;
@@ -601,7 +609,6 @@ static int osc_extent_merge(const struct lu_env *env, struct osc_extent *cur,
 	cur->oe_urgent |= victim->oe_urgent;
 	cur->oe_memalloc |= victim->oe_memalloc;
 	list_splice_init(&victim->oe_pages, &cur->oe_pages);
-	list_del_init(&victim->oe_link);
 	victim->oe_nr_pages = 0;
 
 	osc_extent_get(victim);
@@ -727,8 +734,7 @@ static struct osc_extent *osc_extent_find(const struct lu_env *env,
 		cur->oe_start = descr->cld_start;
 	if (cur->oe_end > max_end)
 		cur->oe_end = max_end;
-	LASSERT(*grants >= chunksize);
-	cur->oe_grants = chunksize;
+	cur->oe_grants = chunksize + cli->cl_grant_extent_tax;
 	cur->oe_mppr = max_pages;
 	if (olck->ols_dlmlock) {
 		LASSERT(olck->ols_hold);
@@ -800,17 +806,8 @@ static struct osc_extent *osc_extent_find(const struct lu_env *env,
 			 */
 			continue;
 
-		/* it's required that an extent must be contiguous at chunk
-		 * level so that we know the whole extent is covered by grant
-		 * (the pages in the extent are NOT required to be contiguous).
-		 * Otherwise, it will be too much difficult to know which
-		 * chunks have grants allocated.
-		 */
-		/* On success, osc_extent_merge() will put cur,
-		 * so we take an extra reference
-		 */
-		osc_extent_get(cur);
 		if (osc_extent_merge(env, ext, cur) == 0) {
+			LASSERT(*grants >= chunksize);
 			*grants -= chunksize;
 			found = osc_extent_hold(ext);
 
@@ -824,19 +821,19 @@ static struct osc_extent *osc_extent_find(const struct lu_env *env,
 
 			break;
 		}
-		osc_extent_put(env, cur);
 	}
 
 	osc_extent_tree_dump(D_CACHE, obj);
 	if (found) {
 		LASSERT(!conflict);
-		LASSERT(found->oe_dlmlock == cur->oe_dlmlock);
-		OSC_EXTENT_DUMP(D_CACHE, found,
-				"found caching ext for %lu.\n", index);
+		if (!IS_ERR(found)) {
+			LASSERT(found->oe_dlmlock == cur->oe_dlmlock);
+			OSC_EXTENT_DUMP(D_CACHE, found,
+					"found caching ext for %lu.\n", index);
+		}
 	} else if (!conflict) {
 		/* create a new extent */
 		EASSERT(osc_extent_is_overlapped(obj, cur) == 0, cur);
-		cur->oe_grants = chunksize + cli->cl_grant_extent_tax;
 		LASSERT(*grants >= cur->oe_grants);
 		*grants -= cur->oe_grants;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 02/37] lustre: ldlm: check slv and limit before updating
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
  2020-07-15 20:44 ` [lustre-devel] [PATCH 01/37] lustre: osc: fix osc_extent_find() James Simmons
@ 2020-07-15 20:44 ` James Simmons
  2020-07-15 20:44 ` [lustre-devel] [PATCH 03/37] lustre: sec: better struct sepol_downcall_data James Simmons
                   ` (34 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:44 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

slv and limit do not change for most of time,
ldlm_cli_update_pool() could be called for each RPC reply,
try hold read lock to check firstly could avoid heavy write
lock in hot path.

WC-bug-id: https://jira.whamcloud.com/browse/LU-13365
Lustre-commit: 3116b9e19dc09 ("LU-13365 ldlm: check slv and limit before updating")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/37969
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_request.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index e1ba596..6318137 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -1163,6 +1163,14 @@ int ldlm_cli_update_pool(struct ptlrpc_request *req)
 	new_slv = lustre_msg_get_slv(req->rq_repmsg);
 	obd = req->rq_import->imp_obd;
 
+	read_lock(&obd->obd_pool_lock);
+	if (obd->obd_pool_slv == new_slv &&
+	    obd->obd_pool_limit == new_limit) {
+		read_unlock(&obd->obd_pool_lock);
+		return 0;
+	}
+	read_unlock(&obd->obd_pool_lock);
+
 	/*
 	 * Set new SLV and limit in OBD fields to make them accessible
 	 * to the pool thread. We do not access obd_namespace and pool
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 03/37] lustre: sec: better struct sepol_downcall_data
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
  2020-07-15 20:44 ` [lustre-devel] [PATCH 01/37] lustre: osc: fix osc_extent_find() James Simmons
  2020-07-15 20:44 ` [lustre-devel] [PATCH 02/37] lustre: ldlm: check slv and limit before updating James Simmons
@ 2020-07-15 20:44 ` James Simmons
  2020-07-15 20:44 ` [lustre-devel] [PATCH 04/37] lustre: obdclass: remove init to 0 from lustre_init_lsi() James Simmons
                   ` (33 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:44 UTC (permalink / raw)
  To: lustre-devel

From: Sebastien Buisson <sbuisson@ddn.com>

struct sepol_downcall_data is badly formed for several reasons:
- it uses a __kernel_time_t field, which can be variably sized,
  depending on the size of __kernel_long_t. Replace it with a
  fixed-size __s64 type;
- it has __u32 sdd_magic that is immediately before a potentially
  64-bit field, whereas the 64-bit fields in a structure should
  always be naturally aligned on 64-bit boundaries to avoid potential
  incompatibility in the structure definition;
- it has __u16 sdd_sepol_len which may be followed by padding.

So create a better struct sepol_downcall_data, while maintaining
compatibility with 2.12 by keeping a struct sepol_downcall_data_old.

WC-bug-id: https://jira.whamcloud.com/browse/LU-13525
Lustre-commit: 82b8cb5528f48 ("LU-13525 sec: better struct sepol_downcall_data")
Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-on: https://review.whamcloud.com/38580
Reviewed-by: Olaf Faaland-LLNL <faaland1@llnl.gov>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/sec_lproc.c            | 134 ++++++++++++++++++++++++++++----
 include/uapi/linux/lustre/lustre_user.h |  16 +++-
 2 files changed, 135 insertions(+), 15 deletions(-)

diff --git a/fs/lustre/ptlrpc/sec_lproc.c b/fs/lustre/ptlrpc/sec_lproc.c
index 7db7e81..b34ced4 100644
--- a/fs/lustre/ptlrpc/sec_lproc.c
+++ b/fs/lustre/ptlrpc/sec_lproc.c
@@ -131,6 +131,86 @@ static int sptlrpc_ctxs_lprocfs_seq_show(struct seq_file *seq, void *v)
 
 LPROC_SEQ_FOPS_RO(sptlrpc_ctxs_lprocfs);
 
+#if LUSTRE_VERSION_CODE < OBD_OCD_VERSION(2, 16, 53, 0)
+static ssize_t sepol_seq_write_old(struct obd_device *obd,
+				   const char __user *buffer,
+				   size_t count)
+{
+	struct client_obd *cli = &obd->u.cli;
+	struct obd_import *imp = cli->cl_import;
+	struct sepol_downcall_data_old *param;
+	int size = sizeof(*param);
+	u16 len;
+	int rc = 0;
+
+	if (count < size) {
+		rc = -EINVAL;
+		CERROR("%s: invalid data count = %lu, size = %d: rc = %d\n",
+		       obd->obd_name, (unsigned long) count, size, rc);
+		return rc;
+	}
+
+	param = kmalloc(size, GFP_KERNEL);
+	if (!param)
+		return -ENOMEM;
+
+	if (copy_from_user(param, buffer, size)) {
+		rc = -EFAULT;
+		CERROR("%s: bad sepol data: rc = %d\n", obd->obd_name, rc);
+		goto out;
+	}
+
+	if (param->sdd_magic != SEPOL_DOWNCALL_MAGIC_OLD) {
+		rc = -EINVAL;
+		CERROR("%s: sepol downcall bad params: rc = %d\n",
+		       obd->obd_name, rc);
+		goto out;
+	}
+
+	if (param->sdd_sepol_len == 0 ||
+	    param->sdd_sepol_len >= sizeof(imp->imp_sec->ps_sepol)) {
+		rc = -EINVAL;
+		CERROR("%s: invalid sepol data returned: rc = %d\n",
+		       obd->obd_name, rc);
+		goto out;
+	}
+	len = param->sdd_sepol_len; /* save sdd_sepol_len */
+	kfree(param);
+	size = offsetof(struct sepol_downcall_data_old,
+			sdd_sepol[len]);
+
+	if (count < size) {
+		rc = -EINVAL;
+		CERROR("%s: invalid sepol count = %lu, size = %d: rc = %d\n",
+		       obd->obd_name, (unsigned long) count, size, rc);
+		return rc;
+	}
+
+	/* alloc again with real size */
+	param = kmalloc(size, GFP_KERNEL);
+	if (!param)
+		return -ENOMEM;
+
+	if (copy_from_user(param, buffer, size)) {
+		rc = -EFAULT;
+		CERROR("%s: cannot copy sepol data: rc = %d\n",
+		       obd->obd_name, rc);
+		goto out;
+	}
+
+	spin_lock(&imp->imp_sec->ps_lock);
+	snprintf(imp->imp_sec->ps_sepol, param->sdd_sepol_len + 1, "%s",
+		 param->sdd_sepol);
+	imp->imp_sec->ps_sepol_mtime = ktime_set(param->sdd_sepol_mtime, 0);
+	spin_unlock(&imp->imp_sec->ps_lock);
+
+out:
+	kfree(param);
+
+	return rc ? rc : count;
+}
+#endif
+
 static ssize_t
 lprocfs_wr_sptlrpc_sepol(struct file *file, const char __user *buffer,
 			 size_t count, void *data)
@@ -140,13 +220,41 @@ static int sptlrpc_ctxs_lprocfs_seq_show(struct seq_file *seq, void *v)
 	struct client_obd *cli = &obd->u.cli;
 	struct obd_import *imp = cli->cl_import;
 	struct sepol_downcall_data *param;
-	int size = sizeof(*param);
+	u32 magic;
+	int size = sizeof(magic);
+	u16 len;
 	int rc = 0;
 
 	if (count < size) {
-		CERROR("%s: invalid data count = %lu, size = %d\n",
-		       obd->obd_name, (unsigned long) count, size);
-		return -EINVAL;
+		rc = -EINVAL;
+		CERROR("%s: invalid buffer count = %lu, size = %d: rc = %d\n",
+		       obd->obd_name, (unsigned long) count, size, rc);
+		return rc;
+	}
+
+	if (copy_from_user(&magic, buffer, size)) {
+		rc = -EFAULT;
+		CERROR("%s: bad sepol magic: rc = %d\n", obd->obd_name, rc);
+		return rc;
+	}
+
+	if (magic != SEPOL_DOWNCALL_MAGIC) {
+#if LUSTRE_VERSION_CODE < OBD_OCD_VERSION(2, 16, 53, 0)
+		if (magic == SEPOL_DOWNCALL_MAGIC_OLD)
+			return sepol_seq_write_old(obd, buffer, count);
+#endif
+		rc = -EINVAL;
+		CERROR("%s: sepol downcall bad magic '%#08x': rc = %d\n",
+		       obd->obd_name, magic, rc);
+		return rc;
+	}
+
+	size = sizeof(*param);
+	if (count < size) {
+		rc = -EINVAL;
+		CERROR("%s: invalid data count = %lu, size = %d: rc = %d\n",
+		       obd->obd_name, (unsigned long) count, size, rc);
+		return rc;
 	}
 
 	param = kzalloc(size, GFP_KERNEL);
@@ -154,39 +262,39 @@ static int sptlrpc_ctxs_lprocfs_seq_show(struct seq_file *seq, void *v)
 		return -ENOMEM;
 
 	if (copy_from_user(param, buffer, size)) {
-		CERROR("%s: bad sepol data\n", obd->obd_name);
 		rc = -EFAULT;
+		CERROR("%s: bad sepol data: rc = %d\n", obd->obd_name, rc);
 		goto out;
 	}
 
 	if (param->sdd_magic != SEPOL_DOWNCALL_MAGIC) {
-		CERROR("%s: sepol downcall bad params\n",
-		       obd->obd_name);
 		rc = -EINVAL;
+		CERROR("%s: invalid sepol data returned: rc = %d\n",
+		       obd->obd_name, rc);
 		goto out;
 	}
 
 	if (param->sdd_sepol_len == 0 ||
 	    param->sdd_sepol_len >= sizeof(imp->imp_sec->ps_sepol)) {
-		CERROR("%s: invalid sepol data returned\n",
-		       obd->obd_name);
 		rc = -EINVAL;
+		CERROR("%s: invalid sepol data returned: rc = %d\n",
+		       obd->obd_name, rc);
 		goto out;
 	}
-	rc = param->sdd_sepol_len; /* save sdd_sepol_len */
+	len = param->sdd_sepol_len; /* save sdd_sepol_len */
 	kfree(param);
 	size = offsetof(struct sepol_downcall_data,
-			sdd_sepol[rc]);
+			sdd_sepol[len]);
 
 	/* alloc again with real size */
-	rc = 0;
 	param = kzalloc(size, GFP_KERNEL);
 	if (!param)
 		return -ENOMEM;
 
 	if (copy_from_user(param, buffer, size)) {
-		CERROR("%s: bad sepol data\n", obd->obd_name);
 		rc = -EFAULT;
+		CERROR("%s: cannot copy sepol data: rc = %d\n",
+		       obd->obd_name, rc);
 		goto out;
 	}
 
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 6a2d5f9..b0301e1 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -51,6 +51,7 @@
 #include <linux/types.h>
 #include <linux/unistd.h>
 #include <linux/lustre/lustre_fiemap.h>
+#include <linux/lustre/lustre_ver.h>
 
 #ifndef __KERNEL__
 # define __USE_ISOC99  1
@@ -980,7 +981,6 @@ static inline const char *qtype_name(int qtype)
 }
 
 #define IDENTITY_DOWNCALL_MAGIC 0x6d6dd629
-#define SEPOL_DOWNCALL_MAGIC 0x8b8bb842
 
 /* permission */
 #define N_PERMS_MAX	64
@@ -1002,13 +1002,25 @@ struct identity_downcall_data {
 	__u32			    idd_groups[0];
 };
 
-struct sepol_downcall_data {
+#if LUSTRE_VERSION_CODE < OBD_OCD_VERSION(2, 16, 53, 0)
+/* old interface struct is deprecated in 2.14 */
+#define SEPOL_DOWNCALL_MAGIC_OLD 0x8b8bb842
+struct sepol_downcall_data_old {
 	__u32		sdd_magic;
 	__s64		sdd_sepol_mtime;
 	__u16		sdd_sepol_len;
 	char		sdd_sepol[0];
 };
+#endif
 
+#define SEPOL_DOWNCALL_MAGIC 0x8b8bb843
+struct sepol_downcall_data {
+	__u32		sdd_magic;
+	__u16		sdd_sepol_len;
+	__u16		sdd_padding1;
+	__s64		sdd_sepol_mtime;
+	char		sdd_sepol[0];
+};
 
 /* lustre volatile file support
  * file name header: ".^L^S^T^R:volatile"
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 04/37] lustre: obdclass: remove init to 0 from lustre_init_lsi()
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (2 preceding siblings ...)
  2020-07-15 20:44 ` [lustre-devel] [PATCH 03/37] lustre: sec: better struct sepol_downcall_data James Simmons
@ 2020-07-15 20:44 ` James Simmons
  2020-07-15 20:44 ` [lustre-devel] [PATCH 05/37] lustre: ptlrpc: handle conn_hash rhashtable resize James Simmons
                   ` (32 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:44 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

After allocating a struct with kzalloc, there is no value
in setting a few of the fields to zero.

And as all fields were zero, it must be safe to kfree lmd_exclude,
whether lmd_exclude_count is zero or not.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9679
Lustre-commit: 513dde601d2e9 ("LU-9679 obdclass: remove init to 0 from lustre_init_lsi()")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/39135
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/obd_mount.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/fs/lustre/obdclass/obd_mount.c b/fs/lustre/obdclass/obd_mount.c
index 13e6521..ea5b469 100644
--- a/fs/lustre/obdclass/obd_mount.c
+++ b/fs/lustre/obdclass/obd_mount.c
@@ -515,9 +515,6 @@ struct lustre_sb_info *lustre_init_lsi(struct super_block *sb)
 		return NULL;
 	}
 
-	lsi->lsi_lmd->lmd_exclude_count = 0;
-	lsi->lsi_lmd->lmd_recovery_time_soft = 0;
-	lsi->lsi_lmd->lmd_recovery_time_hard = 0;
 	s2lsi_nocast(sb) = lsi;
 	/* we take 1 extra ref for our setup */
 	atomic_set(&lsi->lsi_mounts, 1);
@@ -544,8 +541,7 @@ static int lustre_free_lsi(struct super_block *sb)
 		kfree(lsi->lsi_lmd->lmd_fileset);
 		kfree(lsi->lsi_lmd->lmd_mgssec);
 		kfree(lsi->lsi_lmd->lmd_opts);
-		if (lsi->lsi_lmd->lmd_exclude_count)
-			kfree(lsi->lsi_lmd->lmd_exclude);
+		kfree(lsi->lsi_lmd->lmd_exclude);
 		kfree(lsi->lsi_lmd->lmd_mgs);
 		kfree(lsi->lsi_lmd->lmd_osd_type);
 		kfree(lsi->lsi_lmd->lmd_params);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 05/37] lustre: ptlrpc: handle conn_hash rhashtable resize
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (3 preceding siblings ...)
  2020-07-15 20:44 ` [lustre-devel] [PATCH 04/37] lustre: obdclass: remove init to 0 from lustre_init_lsi() James Simmons
@ 2020-07-15 20:44 ` James Simmons
  2020-07-15 20:44 ` [lustre-devel] [PATCH 06/37] lustre: lu_object: convert lu_object cache to rhashtable James Simmons
                   ` (31 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:44 UTC (permalink / raw)
  To: lustre-devel

The errors returned by rhashtable_lookup_get_insert_fast() of the
values -ENOMEM or -EBUSY is due to the hashtable being resized.
This is not fatal so retry a lookup.

Fixes: ac2370ac2b ("staging: lustre: ptlrpc: convert conn_hash to rhashtable")
WC-bug-id: https://jira.whamcloud.com/browse/LU-8130
Lustre-commit: 37b29a8f709aa ("LU-8130 ptlrpc: convert conn_hash to rhashtable")
Signed-off-by: James Simmons <jsimmons@infradead.org>
Reviewed-on: https://review.whamcloud.com/33616
Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
---
 fs/lustre/ptlrpc/connection.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/ptlrpc/connection.c b/fs/lustre/ptlrpc/connection.c
index 5466755..a548d99 100644
--- a/fs/lustre/ptlrpc/connection.c
+++ b/fs/lustre/ptlrpc/connection.c
@@ -32,6 +32,8 @@
  */
 
 #define DEBUG_SUBSYSTEM S_RPC
+
+#include <linux/delay.h>
 #include <obd_support.h>
 #include <obd_class.h>
 #include <lustre_net.h>
@@ -103,13 +105,21 @@ struct ptlrpc_connection *
 	 * connection.  The object which exists in the hash will be
 	 * returned, otherwise NULL is returned on success.
 	 */
+try_again:
 	conn2 = rhashtable_lookup_get_insert_fast(&conn_hash, &conn->c_hash,
 						  conn_hash_params);
 	if (conn2) {
 		/* insertion failed */
 		kfree(conn);
-		if (IS_ERR(conn2))
+		if (IS_ERR(conn2)) {
+			/* hash table could be resizing. */
+			if (PTR_ERR(conn2) == -ENOMEM ||
+			    PTR_ERR(conn2) == -EBUSY) {
+				msleep(20);
+				goto try_again;
+			}
 			return NULL;
+		}
 		conn = conn2;
 		ptlrpc_connection_addref(conn);
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 06/37] lustre: lu_object: convert lu_object cache to rhashtable
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (4 preceding siblings ...)
  2020-07-15 20:44 ` [lustre-devel] [PATCH 05/37] lustre: ptlrpc: handle conn_hash rhashtable resize James Simmons
@ 2020-07-15 20:44 ` James Simmons
  2020-07-15 20:44 ` [lustre-devel] [PATCH 07/37] lustre: osc: disable ext merging for rdma only pages and non-rdma James Simmons
                   ` (30 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:44 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

The lu_object cache is a little more complex than the other lustre
hash tables for two reasons.
1/ there is a debugfs file which displays the contents of the cache,
   so we need to use rhashtable_walk in a way that works for seq_file.

2/ There is a (sharded) lru list for objects which are no longer
   referenced, so finding an object needs to consider races with the
   lru as well as with the hash table.

The debugfs file already manages walking the libcfs hash table keeping
a current-position in the private data.  We can fairly easily convert
that to a struct rhashtable_iter.  The debugfs file actually reports
pages, and there are multiple pages per hashtable object.  So as well
as rhashtable_iter, we need the current page index.

For the double-locking, the current code uses direct-access to the
bucket locks that libcfs_hash provides.  rhashtable doesn't provide
that access - callers must provide their own locking or use rcu
techniques.

The lsb_waitq.lock is still used to manage the lru list, but with
this patch it is no longer nested *inside* the hashtable locks, but
instead is outside.  It is used to protect an object with a refcount
of zero.

When purging old objects from an lru, we first set
LU_OBJECT_HEARD_BANSHEE while holding the lsb_waitq.lock,
then remove all the entries from the hashtable separately.

When removing the last reference from an object, we first take the
lsb_waitq.lock, then decrement the reference and add to the lru list
or discard it setting LU_OBJECT_UNHASHED.

When we find an object in the hashtable with a refcount of zero, we
take the corresponding lsb_waitq.lock and check that neither
LU_OBJECT_HEARD_BANSHEE or LU_OBJECT_UNHASHED is set.  If neither is,
we can safely increment the refcount.  If either are, the object is
gone.

This way, we only ever manipulate an object with a refcount of zero
while holding the lsb_waitq.lock.

As there is nothing to stop us using the resizing capabilities of
rhashtable, the code to try to guess the perfect hash size has been
removed.

Also: the "is_dying" variable in lu_object_put() is racey - the value
could change the moment it is sampled.  It is also not needed as it is
only used to avoid a wakeup, which is not particularly expensive.
In the same code as comment says that 'top' could not be accessed, but
the code then immediately accesses 'top' to calculate 'bkt'.
So move the initialization of 'bkt' to before 'top' becomes unsafe.

Also: Change "wake_up_all()" to "wake_up()".  wake_up_all() is only
relevant when an exclusive wait is used.

Moving from the libcfs hashtable to rhashtable also gives the
benefit of a very large performance boost.

Before patch:

SUMMARY rate: (of 5 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation:      12036.610      11091.880      11452.978        318.829
   Directory stat:          25871.734      24232.310      24935.661        574.996
   Directory removal:       12698.769      12239.685      12491.008        149.149
   File creation:           11722.036      11673.961      11692.157         15.966
   File stat:               62304.540      58237.124      60282.003       1479.103
   File read:               24204.811      23889.091      24048.577        110.245
   File removal:             9412.930       9111.828       9217.546        120.894
   Tree creation:            3515.536       3195.627       3442.609        123.792
   Tree removal:              433.917        418.935        428.038          5.545

After patch:

SUMMARY rate: (of 5 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation:      11873.308        303.626       9371.860       4539.539
   Directory stat:          31116.512      30190.574      30568.091        335.545
   Directory removal:       13082.121      12645.228      12943.239        157.695
   File creation:           12607.135      12293.319      12466.647        138.307
   File stat:              124419.347     105240.996     116919.977       7847.165
   File read:               39707.270      36295.477      38266.011       1328.857
   File removal:             9614.333       9273.931       9477.299        140.201
   Tree creation:            3572.602       3017.580       3339.547        207.061
   Tree removal:              487.687          0.004        282.188        230.659

WC-bug-id: https://jira.whamcloud.com/browse/LU-8130
Lustre-commit: aff14dbc522e1 ("LU-8130 lu_object: convert lu_object cache to rhashtable")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/36707
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lu_object.h  |  20 +-
 fs/lustre/llite/vvp_dev.c      | 105 +++------
 fs/lustre/lov/lovsub_dev.c     |   5 +-
 fs/lustre/obdclass/lu_object.c | 481 +++++++++++++++++++----------------------
 4 files changed, 272 insertions(+), 339 deletions(-)

diff --git a/fs/lustre/include/lu_object.h b/fs/lustre/include/lu_object.h
index 2a2f38e..1a6b6e1 100644
--- a/fs/lustre/include/lu_object.h
+++ b/fs/lustre/include/lu_object.h
@@ -36,6 +36,7 @@
 
 #include <stdarg.h>
 #include <linux/percpu_counter.h>
+#include <linux/rhashtable.h>
 #include <linux/libcfs/libcfs.h>
 #include <linux/ctype.h>
 #include <obd_target.h>
@@ -469,11 +470,6 @@ enum lu_object_header_flags {
 	 * initialized yet, the object allocator will initialize it.
 	 */
 	LU_OBJECT_INITED	= 2,
-	/**
-	 * Object is being purged, so mustn't be returned by
-	 * htable_lookup()
-	 */
-	LU_OBJECT_PURGING	= 3,
 };
 
 enum lu_object_header_attr {
@@ -496,6 +492,8 @@ enum lu_object_header_attr {
  * it is created for things like not-yet-existing child created by mkdir or
  * create calls. lu_object_operations::loo_exists() can be used to check
  * whether object is backed by persistent storage entity.
+ * Any object containing this structre which might be placed in an
+ * rhashtable via loh_hash MUST be freed using call_rcu() or rcu_kfree().
  */
 struct lu_object_header {
 	/**
@@ -517,9 +515,9 @@ struct lu_object_header {
 	 */
 	u32			loh_attr;
 	/**
-	 * Linkage into per-site hash table. Protected by lu_site::ls_guard.
+	 * Linkage into per-site hash table.
 	 */
-	struct hlist_node	loh_hash;
+	struct rhash_head	loh_hash;
 	/**
 	 * Linkage into per-site LRU list. Protected by lu_site::ls_guard.
 	 */
@@ -566,7 +564,7 @@ struct lu_site {
 	/**
 	 * objects hash table
 	 */
-	struct cfs_hash	       *ls_obj_hash;
+	struct rhashtable	ls_obj_hash;
 	/*
 	 * buckets for summary data
 	 */
@@ -643,6 +641,8 @@ int lu_object_init(struct lu_object *o,
 void lu_object_fini(struct lu_object *o);
 void lu_object_add_top(struct lu_object_header *h, struct lu_object *o);
 void lu_object_add(struct lu_object *before, struct lu_object *o);
+struct lu_object *lu_object_get_first(struct lu_object_header *h,
+				      struct lu_device *dev);
 
 /**
  * Helpers to initialize and finalize device types.
@@ -697,8 +697,8 @@ static inline int lu_site_purge(const struct lu_env *env, struct lu_site *s,
 	return lu_site_purge_objects(env, s, nr, true);
 }
 
-void lu_site_print(const struct lu_env *env, struct lu_site *s, void *cookie,
-		   lu_printer_t printer);
+void lu_site_print(const struct lu_env *env, struct lu_site *s, atomic_t *ref,
+		   int msg_flags, lu_printer_t printer);
 struct lu_object *lu_object_find_at(const struct lu_env *env,
 				    struct lu_device *dev,
 				    const struct lu_fid *f,
diff --git a/fs/lustre/llite/vvp_dev.c b/fs/lustre/llite/vvp_dev.c
index e1d87f9..aa8b2c5 100644
--- a/fs/lustre/llite/vvp_dev.c
+++ b/fs/lustre/llite/vvp_dev.c
@@ -361,21 +361,13 @@ int cl_sb_fini(struct super_block *sb)
  *
  ****************************************************************************/
 
-struct vvp_pgcache_id {
-	unsigned int		 vpi_bucket;
-	unsigned int		 vpi_depth;
-	u32			 vpi_index;
-
-	unsigned int		 vpi_curdep;
-	struct lu_object_header *vpi_obj;
-};
-
 struct vvp_seq_private {
 	struct ll_sb_info	*vsp_sbi;
 	struct lu_env		*vsp_env;
 	u16			 vsp_refcheck;
 	struct cl_object	*vsp_clob;
-	struct vvp_pgcache_id	 vsp_id;
+	struct rhashtable_iter	 vsp_iter;
+	u32			 vsp_page_index;
 	/*
 	 * prev_pos is the 'pos' of the last object returned
 	 * by ->start of ->next.
@@ -383,81 +375,43 @@ struct vvp_seq_private {
 	loff_t			 vsp_prev_pos;
 };
 
-static int vvp_pgcache_obj_get(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-			       struct hlist_node *hnode, void *data)
-{
-	struct vvp_pgcache_id *id = data;
-	struct lu_object_header *hdr = cfs_hash_object(hs, hnode);
-
-	if (lu_object_is_dying(hdr))
-		return 1;
-
-	if (id->vpi_curdep-- > 0)
-		return 0; /* continue */
-
-	cfs_hash_get(hs, hnode);
-	id->vpi_obj = hdr;
-	return 1;
-}
-
-static struct cl_object *vvp_pgcache_obj(const struct lu_env *env,
-					 struct lu_device *dev,
-					 struct vvp_pgcache_id *id)
-{
-	LASSERT(lu_device_is_cl(dev));
-
-	id->vpi_obj = NULL;
-	id->vpi_curdep = id->vpi_depth;
-
-	cfs_hash_hlist_for_each(dev->ld_site->ls_obj_hash, id->vpi_bucket,
-				vvp_pgcache_obj_get, id);
-	if (id->vpi_obj) {
-		struct lu_object *lu_obj;
-
-		lu_obj = lu_object_locate(id->vpi_obj, dev->ld_type);
-		if (lu_obj) {
-			lu_object_ref_add(lu_obj, "dump", current);
-			return lu2cl(lu_obj);
-		}
-		lu_object_put(env, lu_object_top(id->vpi_obj));
-	}
-	return NULL;
-}
-
 static struct page *vvp_pgcache_current(struct vvp_seq_private *priv)
 {
 	struct lu_device *dev = &priv->vsp_sbi->ll_cl->cd_lu_dev;
+	struct lu_object_header *h;
+	struct page *vmpage = NULL;
 
-	while (1) {
+	rhashtable_walk_start(&priv->vsp_iter);
+	while ((h = rhashtable_walk_next(&priv->vsp_iter)) != NULL) {
 		struct inode *inode;
-		struct page *vmpage;
 		int nr;
 
 		if (!priv->vsp_clob) {
-			struct cl_object *clob;
-
-			while ((clob = vvp_pgcache_obj(priv->vsp_env, dev, &priv->vsp_id)) == NULL &&
-			       ++(priv->vsp_id.vpi_bucket) < CFS_HASH_NHLIST(dev->ld_site->ls_obj_hash))
-				priv->vsp_id.vpi_depth = 0;
-			if (!clob)
-				return NULL;
-			priv->vsp_clob = clob;
-			priv->vsp_id.vpi_index = 0;
+			struct lu_object *lu_obj;
+
+			lu_obj = lu_object_get_first(h, dev);
+			if (!lu_obj)
+				continue;
+
+			priv->vsp_clob = lu2cl(lu_obj);
+			lu_object_ref_add(lu_obj, "dump", current);
+			priv->vsp_page_index = 0;
 		}
 
 		inode = vvp_object_inode(priv->vsp_clob);
 		nr = find_get_pages_contig(inode->i_mapping,
-					   priv->vsp_id.vpi_index, 1, &vmpage);
+					   priv->vsp_page_index, 1, &vmpage);
 		if (nr > 0) {
-			priv->vsp_id.vpi_index = vmpage->index;
-			return vmpage;
+			priv->vsp_page_index = vmpage->index;
+			break;
 		}
 		lu_object_ref_del(&priv->vsp_clob->co_lu, "dump", current);
 		cl_object_put(priv->vsp_env, priv->vsp_clob);
 		priv->vsp_clob = NULL;
-		priv->vsp_id.vpi_index = 0;
-		priv->vsp_id.vpi_depth++;
+		priv->vsp_page_index = 0;
 	}
+	rhashtable_walk_stop(&priv->vsp_iter);
+	return vmpage;
 }
 
 #define seq_page_flag(seq, page, flag, has_flags) do {			\
@@ -521,7 +475,10 @@ static int vvp_pgcache_show(struct seq_file *f, void *v)
 static void vvp_pgcache_rewind(struct vvp_seq_private *priv)
 {
 	if (priv->vsp_prev_pos) {
-		memset(&priv->vsp_id, 0, sizeof(priv->vsp_id));
+		struct lu_site *s = priv->vsp_sbi->ll_cl->cd_lu_dev.ld_site;
+
+		rhashtable_walk_exit(&priv->vsp_iter);
+		rhashtable_walk_enter(&s->ls_obj_hash, &priv->vsp_iter);
 		priv->vsp_prev_pos = 0;
 		if (priv->vsp_clob) {
 			lu_object_ref_del(&priv->vsp_clob->co_lu, "dump",
@@ -534,7 +491,7 @@ static void vvp_pgcache_rewind(struct vvp_seq_private *priv)
 
 static struct page *vvp_pgcache_next_page(struct vvp_seq_private *priv)
 {
-	priv->vsp_id.vpi_index += 1;
+	priv->vsp_page_index += 1;
 	return vvp_pgcache_current(priv);
 }
 
@@ -548,7 +505,7 @@ static void *vvp_pgcache_start(struct seq_file *f, loff_t *pos)
 		/* Return the current item */;
 	} else {
 		WARN_ON(*pos != priv->vsp_prev_pos + 1);
-		priv->vsp_id.vpi_index += 1;
+		priv->vsp_page_index += 1;
 	}
 
 	priv->vsp_prev_pos = *pos;
@@ -580,6 +537,7 @@ static void vvp_pgcache_stop(struct seq_file *f, void *v)
 static int vvp_dump_pgcache_seq_open(struct inode *inode, struct file *filp)
 {
 	struct vvp_seq_private *priv;
+	struct lu_site *s;
 
 	priv = __seq_open_private(filp, &vvp_pgcache_ops, sizeof(*priv));
 	if (!priv)
@@ -588,13 +546,16 @@ static int vvp_dump_pgcache_seq_open(struct inode *inode, struct file *filp)
 	priv->vsp_sbi = inode->i_private;
 	priv->vsp_env = cl_env_get(&priv->vsp_refcheck);
 	priv->vsp_clob = NULL;
-	memset(&priv->vsp_id, 0, sizeof(priv->vsp_id));
 	if (IS_ERR(priv->vsp_env)) {
 		int err = PTR_ERR(priv->vsp_env);
 
 		seq_release_private(inode, filp);
 		return err;
 	}
+
+	s = priv->vsp_sbi->ll_cl->cd_lu_dev.ld_site;
+	rhashtable_walk_enter(&s->ls_obj_hash, &priv->vsp_iter);
+
 	return 0;
 }
 
@@ -607,8 +568,8 @@ static int vvp_dump_pgcache_seq_release(struct inode *inode, struct file *file)
 		lu_object_ref_del(&priv->vsp_clob->co_lu, "dump", current);
 		cl_object_put(priv->vsp_env, priv->vsp_clob);
 	}
-
 	cl_env_put(priv->vsp_env, &priv->vsp_refcheck);
+	rhashtable_walk_exit(&priv->vsp_iter);
 	return seq_release_private(inode, file);
 }
 
diff --git a/fs/lustre/lov/lovsub_dev.c b/fs/lustre/lov/lovsub_dev.c
index 69380fc..0555737 100644
--- a/fs/lustre/lov/lovsub_dev.c
+++ b/fs/lustre/lov/lovsub_dev.c
@@ -88,10 +88,7 @@ static struct lu_device *lovsub_device_free(const struct lu_env *env,
 	struct lovsub_device *lsd = lu2lovsub_dev(d);
 	struct lu_device *next = cl2lu_dev(lsd->acid_next);
 
-	if (atomic_read(&d->ld_ref) && d->ld_site) {
-		LIBCFS_DEBUG_MSG_DATA_DECL(msgdata, D_ERROR, NULL);
-		lu_site_print(env, d->ld_site, &msgdata, lu_cdebug_printer);
-	}
+	lu_site_print(env, d->ld_site, &d->ld_ref, D_ERROR, lu_cdebug_printer);
 	cl_device_fini(lu2cl_dev(d));
 	kfree(lsd);
 	return next;
diff --git a/fs/lustre/obdclass/lu_object.c b/fs/lustre/obdclass/lu_object.c
index ec3f6a3..5cd8231 100644
--- a/fs/lustre/obdclass/lu_object.c
+++ b/fs/lustre/obdclass/lu_object.c
@@ -41,12 +41,11 @@
 
 #define DEBUG_SUBSYSTEM S_CLASS
 
+#include <linux/delay.h>
 #include <linux/module.h>
 #include <linux/processor.h>
 #include <linux/random.h>
 
-/* hash_long() */
-#include <linux/libcfs/libcfs_hash.h>
 #include <obd_class.h>
 #include <obd_support.h>
 #include <lustre_disk.h>
@@ -85,12 +84,10 @@ enum {
 #define LU_CACHE_NR_MAX_ADJUST		512
 #define LU_CACHE_NR_UNLIMITED		-1
 #define LU_CACHE_NR_DEFAULT		LU_CACHE_NR_UNLIMITED
-#define LU_CACHE_NR_LDISKFS_LIMIT	LU_CACHE_NR_UNLIMITED
-#define LU_CACHE_NR_ZFS_LIMIT		256
 
-#define LU_SITE_BITS_MIN		12
-#define LU_SITE_BITS_MAX		24
-#define LU_SITE_BITS_MAX_CL		19
+#define LU_CACHE_NR_MIN			4096
+#define LU_CACHE_NR_MAX			0x80000000UL
+
 /**
  * Max 256 buckets, we don't want too many buckets because:
  * - consume too much memory (currently max 16K)
@@ -111,7 +108,7 @@ enum {
 static void lu_object_free(const struct lu_env *env, struct lu_object *o);
 static u32 ls_stats_read(struct lprocfs_stats *stats, int idx);
 
-static u32 lu_fid_hash(const void *data, u32 seed)
+static u32 lu_fid_hash(const void *data, u32 len, u32 seed)
 {
 	const struct lu_fid *fid = data;
 
@@ -120,9 +117,17 @@ static u32 lu_fid_hash(const void *data, u32 seed)
 	return seed;
 }
 
+static const struct rhashtable_params obj_hash_params = {
+	.key_len	= sizeof(struct lu_fid),
+	.key_offset	= offsetof(struct lu_object_header, loh_fid),
+	.head_offset	= offsetof(struct lu_object_header, loh_hash),
+	.hashfn		= lu_fid_hash,
+	.automatic_shrinking = true,
+};
+
 static inline int lu_bkt_hash(struct lu_site *s, const struct lu_fid *fid)
 {
-	return lu_fid_hash(fid, s->ls_bkt_seed) &
+	return lu_fid_hash(fid, sizeof(*fid), s->ls_bkt_seed) &
 	       (s->ls_bkt_cnt - 1);
 }
 
@@ -147,9 +152,7 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 	struct lu_object_header *top = o->lo_header;
 	struct lu_site *site = o->lo_dev->ld_site;
 	struct lu_object *orig = o;
-	struct cfs_hash_bd bd;
 	const struct lu_fid *fid = lu_object_fid(o);
-	bool is_dying;
 
 	/*
 	 * till we have full fids-on-OST implemented anonymous objects
@@ -157,7 +160,6 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 	 * so we should not remove it from the site.
 	 */
 	if (fid_is_zero(fid)) {
-		LASSERT(!top->loh_hash.next && !top->loh_hash.pprev);
 		LASSERT(list_empty(&top->loh_lru));
 		if (!atomic_dec_and_test(&top->loh_ref))
 			return;
@@ -169,40 +171,45 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 		return;
 	}
 
-	cfs_hash_bd_get(site->ls_obj_hash, &top->loh_fid, &bd);
-
-	is_dying = lu_object_is_dying(top);
-	if (!cfs_hash_bd_dec_and_lock(site->ls_obj_hash, &bd, &top->loh_ref)) {
-		/* at this point the object reference is dropped and lock is
+	bkt = &site->ls_bkts[lu_bkt_hash(site, &top->loh_fid)];
+	if (atomic_add_unless(&top->loh_ref, -1, 1)) {
+still_active:
+		/*
+		 * At this point the object reference is dropped and lock is
 		 * not taken, so lu_object should not be touched because it
-		 * can be freed by concurrent thread. Use local variable for
-		 * check.
+		 * can be freed by concurrent thread.
+		 *
+		 * Somebody may be waiting for this, currently only used for
+		 * cl_object, see cl_object_put_last().
 		 */
-		if (is_dying) {
-			/*
-			 * somebody may be waiting for this, currently only
-			 * used for cl_object, see cl_object_put_last().
-			 */
-			bkt = &site->ls_bkts[lu_bkt_hash(site, &top->loh_fid)];
-			wake_up_all(&bkt->lsb_waitq);
-		}
+		wake_up(&bkt->lsb_waitq);
+
 		return;
 	}
 
+	spin_lock(&bkt->lsb_waitq.lock);
+	if (!atomic_dec_and_test(&top->loh_ref)) {
+		spin_unlock(&bkt->lsb_waitq.lock);
+		goto still_active;
+	}
+
 	/*
-	 * When last reference is released, iterate over object
-	 * layers, and notify them that object is no longer busy.
+	 * Refcount is zero, and cannot be incremented without taking the bkt
+	 * lock, so object is stable.
+	 */
+
+	/*
+	 * When last reference is released, iterate over object layers, and
+	 * notify them that object is no longer busy.
 	 */
 	list_for_each_entry_reverse(o, &top->loh_layers, lo_linkage) {
 		if (o->lo_ops->loo_object_release)
 			o->lo_ops->loo_object_release(env, o);
 	}
 
-	bkt = &site->ls_bkts[lu_bkt_hash(site, &top->loh_fid)];
-	spin_lock(&bkt->lsb_waitq.lock);
-
-	/* don't use local 'is_dying' here because if was taken without lock
-	 * but here we need the latest actual value of it so check lu_object
+	/*
+	 * Don't use local 'is_dying' here because if was taken without lock but
+	 * here we need the latest actual value of it so check lu_object
 	 * directly here.
 	 */
 	if (!lu_object_is_dying(top)) {
@@ -210,26 +217,26 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 		list_add_tail(&top->loh_lru, &bkt->lsb_lru);
 		spin_unlock(&bkt->lsb_waitq.lock);
 		percpu_counter_inc(&site->ls_lru_len_counter);
-		CDEBUG(D_INODE, "Add %p/%p to site lru. hash: %p, bkt: %p\n",
-		       orig, top, site->ls_obj_hash, bkt);
-		cfs_hash_bd_unlock(site->ls_obj_hash, &bd, 1);
+		CDEBUG(D_INODE, "Add %p/%p to site lru. bkt: %p\n",
+		       orig, top, bkt);
 		return;
 	}
 
 	/*
-	 * If object is dying (will not be cached), then removed it
-	 * from hash table (it is already not on the LRU).
+	 * If object is dying (will not be cached) then removed it from hash
+	 * table (it is already not on the LRU).
 	 *
-	 * This is done with hash table lists locked. As the only
-	 * way to acquire first reference to previously unreferenced
-	 * object is through hash-table lookup (lu_object_find())
-	 * which is done under hash-table, no race with concurrent
-	 * object lookup is possible and we can safely destroy object below.
+	 * This is done with bucket lock held. As the only way to acquire first
+	 * reference to previously unreferenced object is through hash-table
+	 * lookup (lu_object_find()) which takes the lock for first reference,
+	 * no race with concurrent object lookup is possible and we can safely
+	 * destroy object below.
 	 */
 	if (!test_and_set_bit(LU_OBJECT_UNHASHED, &top->loh_flags))
-		cfs_hash_bd_del_locked(site->ls_obj_hash, &bd, &top->loh_hash);
+		rhashtable_remove_fast(&site->ls_obj_hash, &top->loh_hash,
+				       obj_hash_params);
+
 	spin_unlock(&bkt->lsb_waitq.lock);
-	cfs_hash_bd_unlock(site->ls_obj_hash, &bd, 1);
 	/* Object was already removed from hash above, can kill it. */
 	lu_object_free(env, orig);
 }
@@ -247,21 +254,19 @@ void lu_object_unhash(const struct lu_env *env, struct lu_object *o)
 	set_bit(LU_OBJECT_HEARD_BANSHEE, &top->loh_flags);
 	if (!test_and_set_bit(LU_OBJECT_UNHASHED, &top->loh_flags)) {
 		struct lu_site *site = o->lo_dev->ld_site;
-		struct cfs_hash *obj_hash = site->ls_obj_hash;
-		struct cfs_hash_bd bd;
+		struct rhashtable *obj_hash = &site->ls_obj_hash;
+		struct lu_site_bkt_data *bkt;
 
-		cfs_hash_bd_get_and_lock(obj_hash, &top->loh_fid, &bd, 1);
+		bkt = &site->ls_bkts[lu_bkt_hash(site, &top->loh_fid)];
+		spin_lock(&bkt->lsb_waitq.lock);
 		if (!list_empty(&top->loh_lru)) {
-			struct lu_site_bkt_data *bkt;
-
-			bkt = &site->ls_bkts[lu_bkt_hash(site, &top->loh_fid)];
-			spin_lock(&bkt->lsb_waitq.lock);
 			list_del_init(&top->loh_lru);
-			spin_unlock(&bkt->lsb_waitq.lock);
 			percpu_counter_dec(&site->ls_lru_len_counter);
 		}
-		cfs_hash_bd_del_locked(obj_hash, &bd, &top->loh_hash);
-		cfs_hash_bd_unlock(obj_hash, &bd, 1);
+		spin_unlock(&bkt->lsb_waitq.lock);
+
+		rhashtable_remove_fast(obj_hash, &top->loh_hash,
+				       obj_hash_params);
 	}
 }
 EXPORT_SYMBOL(lu_object_unhash);
@@ -445,11 +450,9 @@ int lu_site_purge_objects(const struct lu_env *env, struct lu_site *s,
 
 			LINVRNT(lu_bkt_hash(s, &h->loh_fid) == i);
 
-			/* Cannot remove from hash under current spinlock,
-			 * so set flag to stop object from being found
-			 * by htable_lookup().
-			 */
-			set_bit(LU_OBJECT_PURGING, &h->loh_flags);
+			set_bit(LU_OBJECT_UNHASHED, &h->loh_flags);
+			rhashtable_remove_fast(&s->ls_obj_hash, &h->loh_hash,
+					       obj_hash_params);
 			list_move(&h->loh_lru, &dispose);
 			percpu_counter_dec(&s->ls_lru_len_counter);
 			if (did_sth == 0)
@@ -470,7 +473,6 @@ int lu_site_purge_objects(const struct lu_env *env, struct lu_site *s,
 		while ((h = list_first_entry_or_null(&dispose,
 						     struct lu_object_header,
 						     loh_lru)) != NULL) {
-			cfs_hash_del(s->ls_obj_hash, &h->loh_fid, &h->loh_hash);
 			list_del_init(&h->loh_lru);
 			lu_object_free(env, lu_object_top(h));
 			lprocfs_counter_incr(s->ls_stats, LU_SS_LRU_PURGED);
@@ -582,9 +584,9 @@ void lu_object_header_print(const struct lu_env *env, void *cookie,
 	(*printer)(env, cookie, "header@%p[%#lx, %d, " DFID "%s%s%s]",
 		   hdr, hdr->loh_flags, atomic_read(&hdr->loh_ref),
 		   PFID(&hdr->loh_fid),
-		   hlist_unhashed(&hdr->loh_hash) ? "" : " hash",
-		   list_empty((struct list_head *)&hdr->loh_lru) ? \
-		   "" : " lru",
+		   test_bit(LU_OBJECT_UNHASHED,
+			    &hdr->loh_flags) ? "" : " hash",
+		   list_empty(&hdr->loh_lru) ? "" : " lru",
 		   hdr->loh_attr & LOHA_EXISTS ? " exist":"");
 }
 EXPORT_SYMBOL(lu_object_header_print);
@@ -621,54 +623,94 @@ void lu_object_print(const struct lu_env *env, void *cookie,
 EXPORT_SYMBOL(lu_object_print);
 
 /*
- * NOTE: htable_lookup() is called with the relevant
- * hash bucket locked, but might drop and re-acquire the lock.
+ * Limit the lu_object cache to a maximum of lu_cache_nr objects.  Because the
+ * calculation for the number of objects to reclaim is not covered by a lock the
+ * maximum number of objects is capped by LU_CACHE_MAX_ADJUST. This ensures
+ * that many concurrent threads will not accidentally purge the entire cache.
  */
-static struct lu_object *htable_lookup(struct lu_site *s,
-				       struct cfs_hash_bd *bd,
+static void lu_object_limit(const struct lu_env *env,
+			    struct lu_device *dev)
+{
+	u64 size, nr;
+
+	if (lu_cache_nr == LU_CACHE_NR_UNLIMITED)
+		return;
+
+	size = atomic_read(&dev->ld_site->ls_obj_hash.nelems);
+	nr = (u64)lu_cache_nr;
+	if (size <= nr)
+		return;
+
+	lu_site_purge_objects(env, dev->ld_site,
+			      min_t(u64, size - nr, LU_CACHE_NR_MAX_ADJUST),
+			      false);
+}
+
+static struct lu_object *htable_lookup(const struct lu_env *env,
+				       struct lu_device *dev,
+				       struct lu_site_bkt_data *bkt,
 				       const struct lu_fid *f,
-				       u64 *version)
+				       struct lu_object_header *new)
 {
+	struct lu_site *s = dev->ld_site;
 	struct lu_object_header *h;
-	struct hlist_node *hnode;
-	u64 ver = cfs_hash_bd_version_get(bd);
 
-	if (*version == ver)
+try_again:
+	rcu_read_lock();
+	if (new)
+		h = rhashtable_lookup_get_insert_fast(&s->ls_obj_hash,
+						      &new->loh_hash,
+						      obj_hash_params);
+	else
+		h = rhashtable_lookup(&s->ls_obj_hash, f, obj_hash_params);
+	if (IS_ERR_OR_NULL(h)) {
+		/* Not found */
+		if (!new)
+			lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_MISS);
+		rcu_read_unlock();
+		if (PTR_ERR(h) == -ENOMEM) {
+			msleep(20);
+			goto try_again;
+		}
+		lu_object_limit(env, dev);
+		if (PTR_ERR(h) == -E2BIG)
+			goto try_again;
+
 		return ERR_PTR(-ENOENT);
+	}
 
-	*version = ver;
-	/* cfs_hash_bd_peek_locked is a somehow "internal" function
-	 * of cfs_hash, it doesn't add refcount on object.
-	 */
-	hnode = cfs_hash_bd_peek_locked(s->ls_obj_hash, bd, (void *)f);
-	if (!hnode) {
+	if (atomic_inc_not_zero(&h->loh_ref)) {
+		rcu_read_unlock();
+		return lu_object_top(h);
+	}
+
+	spin_lock(&bkt->lsb_waitq.lock);
+	if (lu_object_is_dying(h) ||
+	    test_bit(LU_OBJECT_UNHASHED, &h->loh_flags)) {
+		spin_unlock(&bkt->lsb_waitq.lock);
+		rcu_read_unlock();
+		if (new) {
+			/*
+			 * Old object might have already been removed, or will
+			 * be soon.  We need to insert our new object, so
+			 * remove the old one just in case it is still there.
+			 */
+			rhashtable_remove_fast(&s->ls_obj_hash, &h->loh_hash,
+					       obj_hash_params);
+			goto try_again;
+		}
 		lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_MISS);
 		return ERR_PTR(-ENOENT);
 	}
+	/* Now protected by spinlock */
+	rcu_read_unlock();
 
-	h = container_of(hnode, struct lu_object_header, loh_hash);
 	if (!list_empty(&h->loh_lru)) {
-		struct lu_site_bkt_data *bkt;
-
-		bkt = &s->ls_bkts[lu_bkt_hash(s, &h->loh_fid)];
-		spin_lock(&bkt->lsb_waitq.lock);
-		/* Might have just been moved to the dispose list, in which
-		 * case LU_OBJECT_PURGING will be set.  In that case,
-		 * delete it from the hash table immediately.
-		 * When lu_site_purge_objects() tried, it will find it
-		 * isn't there, which is harmless.
-		 */
-		if (test_bit(LU_OBJECT_PURGING, &h->loh_flags)) {
-			spin_unlock(&bkt->lsb_waitq.lock);
-			cfs_hash_bd_del_locked(s->ls_obj_hash, bd, hnode);
-			lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_MISS);
-			return ERR_PTR(-ENOENT);
-		}
 		list_del_init(&h->loh_lru);
-		spin_unlock(&bkt->lsb_waitq.lock);
 		percpu_counter_dec(&s->ls_lru_len_counter);
 	}
-	cfs_hash_get(s->ls_obj_hash, hnode);
+	atomic_inc(&h->loh_ref);
+	spin_unlock(&bkt->lsb_waitq.lock);
 	lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_HIT);
 	return lu_object_top(h);
 }
@@ -687,28 +729,37 @@ static struct lu_object *lu_object_find(const struct lu_env *env,
 }
 
 /*
- * Limit the lu_object cache to a maximum of lu_cache_nr objects.  Because
- * the calculation for the number of objects to reclaim is not covered by
- * a lock the maximum number of objects is capped by LU_CACHE_MAX_ADJUST.
- * This ensures that many concurrent threads will not accidentally purge
- * the entire cache.
+ * Get a 'first' reference to an object that was found while looking through the
+ * hash table.
  */
-static void lu_object_limit(const struct lu_env *env, struct lu_device *dev)
+struct lu_object *lu_object_get_first(struct lu_object_header *h,
+				      struct lu_device *dev)
 {
-	u64 size, nr;
+	struct lu_site *s = dev->ld_site;
+	struct lu_object *ret;
 
-	if (lu_cache_nr == LU_CACHE_NR_UNLIMITED)
-		return;
+	if (IS_ERR_OR_NULL(h) || lu_object_is_dying(h))
+		return NULL;
 
-	size = cfs_hash_size_get(dev->ld_site->ls_obj_hash);
-	nr = (u64)lu_cache_nr;
-	if (size <= nr)
-		return;
+	ret = lu_object_locate(h, dev->ld_type);
+	if (!ret)
+		return ret;
 
-	lu_site_purge_objects(env, dev->ld_site,
-			      min_t(u64, size - nr, LU_CACHE_NR_MAX_ADJUST),
-			      false);
+	if (!atomic_inc_not_zero(&h->loh_ref)) {
+		struct lu_site_bkt_data *bkt;
+
+		bkt = &s->ls_bkts[lu_bkt_hash(s, &h->loh_fid)];
+		spin_lock(&bkt->lsb_waitq.lock);
+		if (!lu_object_is_dying(h) &&
+		    !test_bit(LU_OBJECT_UNHASHED, &h->loh_flags))
+			atomic_inc(&h->loh_ref);
+		else
+			ret = NULL;
+		spin_unlock(&bkt->lsb_waitq.lock);
+	}
+	return ret;
 }
+EXPORT_SYMBOL(lu_object_get_first);
 
 /**
  * Core logic of lu_object_find*() functions.
@@ -725,10 +776,8 @@ struct lu_object *lu_object_find_at(const struct lu_env *env,
 	struct lu_object *o;
 	struct lu_object *shadow;
 	struct lu_site *s;
-	struct cfs_hash	*hs;
-	struct cfs_hash_bd bd;
 	struct lu_site_bkt_data *bkt;
-	u64 version = 0;
+	struct rhashtable *hs;
 	int rc;
 
 	/*
@@ -750,16 +799,13 @@ struct lu_object *lu_object_find_at(const struct lu_env *env,
 	 *
 	 */
 	s  = dev->ld_site;
-	hs = s->ls_obj_hash;
+	hs = &s->ls_obj_hash;
 	if (unlikely(OBD_FAIL_PRECHECK(OBD_FAIL_OBD_ZERO_NLINK_RACE)))
 		lu_site_purge(env, s, -1);
 
 	bkt = &s->ls_bkts[lu_bkt_hash(s, f)];
-	cfs_hash_bd_get(hs, f, &bd);
 	if (!(conf && conf->loc_flags & LOC_F_NEW)) {
-		cfs_hash_bd_lock(hs, &bd, 1);
-		o = htable_lookup(s, &bd, f, &version);
-		cfs_hash_bd_unlock(hs, &bd, 1);
+		o = htable_lookup(env, dev, bkt, f, NULL);
 
 		if (!IS_ERR(o)) {
 			if (likely(lu_object_is_inited(o->lo_header)))
@@ -795,29 +841,31 @@ struct lu_object *lu_object_find_at(const struct lu_env *env,
 
 	CFS_RACE_WAIT(OBD_FAIL_OBD_ZERO_NLINK_RACE);
 
-	cfs_hash_bd_lock(hs, &bd, 1);
-
-	if (conf && conf->loc_flags & LOC_F_NEW)
-		shadow = ERR_PTR(-ENOENT);
-	else
-		shadow = htable_lookup(s, &bd, f, &version);
+	if (conf && conf->loc_flags & LOC_F_NEW) {
+		int status = rhashtable_insert_fast(hs, &o->lo_header->loh_hash,
+						    obj_hash_params);
+		if (status)
+			/* Strange error - go the slow way */
+			shadow = htable_lookup(env, dev, bkt, f, o->lo_header);
+		else
+			shadow = ERR_PTR(-ENOENT);
+	} else {
+		shadow = htable_lookup(env, dev, bkt, f, o->lo_header);
+	}
 	if (likely(PTR_ERR(shadow) == -ENOENT)) {
-		cfs_hash_bd_add_locked(hs, &bd, &o->lo_header->loh_hash);
-		cfs_hash_bd_unlock(hs, &bd, 1);
-
 		/*
+		 * The new object has been successfully inserted.
+		 *
 		 * This may result in rather complicated operations, including
 		 * fld queries, inode loading, etc.
 		 */
 		rc = lu_object_start(env, dev, o, conf);
 		if (rc) {
-			set_bit(LU_OBJECT_HEARD_BANSHEE,
-				&o->lo_header->loh_flags);
 			lu_object_put(env, o);
 			return ERR_PTR(rc);
 		}
 
-		wake_up_all(&bkt->lsb_waitq);
+		wake_up(&bkt->lsb_waitq);
 
 		lu_object_limit(env, dev);
 
@@ -825,10 +873,10 @@ struct lu_object *lu_object_find_at(const struct lu_env *env,
 	}
 
 	lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_RACE);
-	cfs_hash_bd_unlock(hs, &bd, 1);
 	lu_object_free(env, o);
 
 	if (!(conf && conf->loc_flags & LOC_F_NEW) &&
+	    !IS_ERR(shadow) &&
 	    !lu_object_is_inited(shadow->lo_header)) {
 		wait_event_idle(bkt->lsb_waitq,
 				lu_object_is_inited(shadow->lo_header) ||
@@ -906,14 +954,9 @@ struct lu_site_print_arg {
 	lu_printer_t		 lsp_printer;
 };
 
-static int
-lu_site_obj_print(struct cfs_hash *hs, struct cfs_hash_bd *bd,
-		  struct hlist_node *hnode, void *data)
+static void
+lu_site_obj_print(struct lu_object_header *h, struct lu_site_print_arg *arg)
 {
-	struct lu_site_print_arg *arg = (struct lu_site_print_arg *)data;
-	struct lu_object_header *h;
-
-	h = hlist_entry(hnode, struct lu_object_header, loh_hash);
 	if (!list_empty(&h->loh_layers)) {
 		const struct lu_object *o;
 
@@ -924,36 +967,45 @@ struct lu_site_print_arg {
 		lu_object_header_print(arg->lsp_env, arg->lsp_cookie,
 				       arg->lsp_printer, h);
 	}
-	return 0;
 }
 
 /**
  * Print all objects in @s.
  */
-void lu_site_print(const struct lu_env *env, struct lu_site *s, void *cookie,
-		   lu_printer_t printer)
+void lu_site_print(const struct lu_env *env, struct lu_site *s, atomic_t *ref,
+		   int msg_flag, lu_printer_t printer)
 {
 	struct lu_site_print_arg arg = {
 		.lsp_env	= (struct lu_env *)env,
-		.lsp_cookie	= cookie,
 		.lsp_printer	= printer,
 	};
+	struct rhashtable_iter iter;
+	struct lu_object_header *h;
+	LIBCFS_DEBUG_MSG_DATA_DECL(msgdata, msg_flag, NULL);
+
+	if (!s || !atomic_read(ref))
+		return;
 
-	cfs_hash_for_each(s->ls_obj_hash, lu_site_obj_print, &arg);
+	arg.lsp_cookie = (void *)&msgdata;
+
+	rhashtable_walk_enter(&s->ls_obj_hash, &iter);
+	rhashtable_walk_start(&iter);
+	while ((h = rhashtable_walk_next(&iter)) != NULL) {
+		if (IS_ERR(h))
+			continue;
+		lu_site_obj_print(h, &arg);
+	}
+	rhashtable_walk_stop(&iter);
+	rhashtable_walk_exit(&iter);
 }
 EXPORT_SYMBOL(lu_site_print);
 
 /**
  * Return desired hash table order.
  */
-static unsigned long lu_htable_order(struct lu_device *top)
+static void lu_htable_limits(struct lu_device *top)
 {
-	unsigned long bits_max = LU_SITE_BITS_MAX;
 	unsigned long cache_size;
-	unsigned long bits;
-
-	if (!strcmp(top->ld_type->ldt_name, LUSTRE_VVP_NAME))
-		bits_max = LU_SITE_BITS_MAX_CL;
 
 	/*
 	 * Calculate hash table size, assuming that we want reasonable
@@ -979,75 +1031,12 @@ static unsigned long lu_htable_order(struct lu_device *top)
 		lu_cache_percent = LU_CACHE_PERCENT_DEFAULT;
 	}
 	cache_size = cache_size / 100 * lu_cache_percent *
-		(PAGE_SIZE / 1024);
-
-	for (bits = 1; (1 << bits) < cache_size; ++bits)
-		;
-	return clamp_t(typeof(bits), bits, LU_SITE_BITS_MIN, bits_max);
-}
-
-static unsigned int lu_obj_hop_hash(struct cfs_hash *hs,
-				    const void *key, unsigned int mask)
-{
-	struct lu_fid *fid = (struct lu_fid *)key;
-	u32 hash;
+		     (PAGE_SIZE / 1024);
 
-	hash = fid_flatten32(fid);
-	hash += (hash >> 4) + (hash << 12); /* mixing oid and seq */
-	hash = hash_long(hash, hs->hs_bkt_bits);
-
-	/* give me another random factor */
-	hash -= hash_long((unsigned long)hs, fid_oid(fid) % 11 + 3);
-
-	hash <<= hs->hs_cur_bits - hs->hs_bkt_bits;
-	hash |= (fid_seq(fid) + fid_oid(fid)) & (CFS_HASH_NBKT(hs) - 1);
-
-	return hash & mask;
-}
-
-static void *lu_obj_hop_object(struct hlist_node *hnode)
-{
-	return hlist_entry(hnode, struct lu_object_header, loh_hash);
-}
-
-static void *lu_obj_hop_key(struct hlist_node *hnode)
-{
-	struct lu_object_header *h;
-
-	h = hlist_entry(hnode, struct lu_object_header, loh_hash);
-	return &h->loh_fid;
-}
-
-static int lu_obj_hop_keycmp(const void *key, struct hlist_node *hnode)
-{
-	struct lu_object_header *h;
-
-	h = hlist_entry(hnode, struct lu_object_header, loh_hash);
-	return lu_fid_eq(&h->loh_fid, (struct lu_fid *)key);
-}
-
-static void lu_obj_hop_get(struct cfs_hash *hs, struct hlist_node *hnode)
-{
-	struct lu_object_header *h;
-
-	h = hlist_entry(hnode, struct lu_object_header, loh_hash);
-	atomic_inc(&h->loh_ref);
+	lu_cache_nr = clamp_t(typeof(cache_size), cache_size,
+			      LU_CACHE_NR_MIN, LU_CACHE_NR_MAX);
 }
 
-static void lu_obj_hop_put_locked(struct cfs_hash *hs, struct hlist_node *hnode)
-{
-	LBUG(); /* we should never called it */
-}
-
-static struct cfs_hash_ops lu_site_hash_ops = {
-	.hs_hash	= lu_obj_hop_hash,
-	.hs_key		= lu_obj_hop_key,
-	.hs_keycmp	= lu_obj_hop_keycmp,
-	.hs_object	= lu_obj_hop_object,
-	.hs_get		= lu_obj_hop_get,
-	.hs_put_locked	= lu_obj_hop_put_locked,
-};
-
 static void lu_dev_add_linkage(struct lu_site *s, struct lu_device *d)
 {
 	spin_lock(&s->ls_ld_lock);
@@ -1062,35 +1051,19 @@ static void lu_dev_add_linkage(struct lu_site *s, struct lu_device *d)
 int lu_site_init(struct lu_site *s, struct lu_device *top)
 {
 	struct lu_site_bkt_data *bkt;
-	unsigned long bits;
-	unsigned long i;
-	char name[16];
+	unsigned int i;
 	int rc;
 
 	memset(s, 0, sizeof(*s));
 	mutex_init(&s->ls_purge_mutex);
+	lu_htable_limits(top);
 
 	rc = percpu_counter_init(&s->ls_lru_len_counter, 0, GFP_NOFS);
 	if (rc)
 		return -ENOMEM;
 
-	snprintf(name, sizeof(name), "lu_site_%s", top->ld_type->ldt_name);
-	for (bits = lu_htable_order(top); bits >= LU_SITE_BITS_MIN; bits--) {
-		s->ls_obj_hash = cfs_hash_create(name, bits, bits,
-						 bits - LU_SITE_BKT_BITS,
-						 0, 0, 0,
-						 &lu_site_hash_ops,
-						 CFS_HASH_SPIN_BKTLOCK |
-						 CFS_HASH_NO_ITEMREF |
-						 CFS_HASH_DEPTH |
-						 CFS_HASH_ASSERT_EMPTY |
-						 CFS_HASH_COUNTER);
-		if (s->ls_obj_hash)
-			break;
-	}
-
-	if (!s->ls_obj_hash) {
-		CERROR("failed to create lu_site hash with bits: %lu\n", bits);
+	if (rhashtable_init(&s->ls_obj_hash, &obj_hash_params) != 0) {
+		CERROR("failed to create lu_site hash\n");
 		return -ENOMEM;
 	}
 
@@ -1101,8 +1074,7 @@ int lu_site_init(struct lu_site *s, struct lu_device *top)
 	s->ls_bkts = kvmalloc_array(s->ls_bkt_cnt, sizeof(*bkt),
 				    GFP_KERNEL | __GFP_ZERO);
 	if (!s->ls_bkts) {
-		cfs_hash_putref(s->ls_obj_hash);
-		s->ls_obj_hash = NULL;
+		rhashtable_destroy(&s->ls_obj_hash);
 		s->ls_bkts = NULL;
 		return -ENOMEM;
 	}
@@ -1116,9 +1088,8 @@ int lu_site_init(struct lu_site *s, struct lu_device *top)
 	s->ls_stats = lprocfs_alloc_stats(LU_SS_LAST_STAT, 0);
 	if (!s->ls_stats) {
 		kvfree(s->ls_bkts);
-		cfs_hash_putref(s->ls_obj_hash);
-		s->ls_obj_hash = NULL;
 		s->ls_bkts = NULL;
+		rhashtable_destroy(&s->ls_obj_hash);
 		return -ENOMEM;
 	}
 
@@ -1161,13 +1132,12 @@ void lu_site_fini(struct lu_site *s)
 
 	percpu_counter_destroy(&s->ls_lru_len_counter);
 
-	if (s->ls_obj_hash) {
-		cfs_hash_putref(s->ls_obj_hash);
-		s->ls_obj_hash = NULL;
+	if (s->ls_bkts) {
+		rhashtable_destroy(&s->ls_obj_hash);
+		kvfree(s->ls_bkts);
+		s->ls_bkts = NULL;
 	}
 
-	kvfree(s->ls_bkts);
-
 	if (s->ls_top_dev) {
 		s->ls_top_dev->ld_site = NULL;
 		lu_ref_del(&s->ls_top_dev->ld_reference, "site-top", s);
@@ -1323,7 +1293,6 @@ int lu_object_header_init(struct lu_object_header *h)
 {
 	memset(h, 0, sizeof(*h));
 	atomic_set(&h->loh_ref, 1);
-	INIT_HLIST_NODE(&h->loh_hash);
 	INIT_LIST_HEAD(&h->loh_lru);
 	INIT_LIST_HEAD(&h->loh_layers);
 	lu_ref_init(&h->loh_reference);
@@ -1338,7 +1307,6 @@ void lu_object_header_fini(struct lu_object_header *h)
 {
 	LASSERT(list_empty(&h->loh_layers));
 	LASSERT(list_empty(&h->loh_lru));
-	LASSERT(hlist_unhashed(&h->loh_hash));
 	lu_ref_fini(&h->loh_reference);
 }
 EXPORT_SYMBOL(lu_object_header_fini);
@@ -1933,7 +1901,7 @@ struct lu_site_stats {
 static void lu_site_stats_get(const struct lu_site *s,
 			      struct lu_site_stats *stats)
 {
-	int cnt = cfs_hash_size_get(s->ls_obj_hash);
+	int cnt = atomic_read(&s->ls_obj_hash.nelems);
 	/*
 	 * percpu_counter_sum_positive() won't accept a const pointer
 	 * as it does modify the struct by taking a spinlock
@@ -2235,16 +2203,23 @@ static u32 ls_stats_read(struct lprocfs_stats *stats, int idx)
  */
 int lu_site_stats_print(const struct lu_site *s, struct seq_file *m)
 {
+	const struct bucket_table *tbl;
 	struct lu_site_stats stats;
+	unsigned int chains;
 
 	memset(&stats, 0, sizeof(stats));
 	lu_site_stats_get(s, &stats);
 
-	seq_printf(m, "%d/%d %d/%ld %d %d %d %d %d %d %d\n",
+	rcu_read_lock();
+	tbl = rht_dereference_rcu(s->ls_obj_hash.tbl,
+				  &((struct lu_site *)s)->ls_obj_hash);
+	chains = tbl->size;
+	rcu_read_unlock();
+	seq_printf(m, "%d/%d %d/%u %d %d %d %d %d %d %d\n",
 		   stats.lss_busy,
 		   stats.lss_total,
 		   stats.lss_populated,
-		   CFS_HASH_NHLIST(s->ls_obj_hash),
+		   chains,
 		   stats.lss_max_search,
 		   ls_stats_read(s->ls_stats, LU_SS_CREATED),
 		   ls_stats_read(s->ls_stats, LU_SS_CACHE_HIT),
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 07/37] lustre: osc: disable ext merging for rdma only pages and non-rdma
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (5 preceding siblings ...)
  2020-07-15 20:44 ` [lustre-devel] [PATCH 06/37] lustre: lu_object: convert lu_object cache to rhashtable James Simmons
@ 2020-07-15 20:44 ` James Simmons
  2020-07-15 20:44 ` [lustre-devel] [PATCH 08/37] lnet: socklnd: fix local interface binding James Simmons
                   ` (29 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:44 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

This patch try to add logic to prevent CPU memory pages and RDMA
memory pages from merging into one RPC, codes which set OBD_BRW_RDMA_ONLY
will be added whenever RDMA only codes added later.

WC-bug-id: https://jira.whamcloud.com/browse/LU-13180
Lustre-commit: 9f6c9fa44d6e6 ("LU-13180 osc: disable ext merging for rdma only pages and non-rdma")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/37567
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Gu Zheng <gzheng@ddn.com>
Reviewed-by: Yingjin Qian <qian@ddn.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_osc.h | 4 +++-
 fs/lustre/osc/osc_cache.c      | 4 ++++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/include/lustre_osc.h b/fs/lustre/include/lustre_osc.h
index 11b7e92..cd08f27 100644
--- a/fs/lustre/include/lustre_osc.h
+++ b/fs/lustre/include/lustre_osc.h
@@ -939,7 +939,9 @@ struct osc_extent {
 	/* Non-delay RPC should be used for this extent. */
 				oe_ndelay:1,
 	/* direct IO pages */
-				oe_dio:1;
+				oe_dio:1,
+	/* this extent consists of RDMA only pages */
+				oe_is_rdma_only;
 	/* how many grants allocated for this extent.
 	 *  Grant allocated for this extent. There is no grant allocated
 	 *  for reading extents and sync write extents.
diff --git a/fs/lustre/osc/osc_cache.c b/fs/lustre/osc/osc_cache.c
index 474b711..f811dadb 100644
--- a/fs/lustre/osc/osc_cache.c
+++ b/fs/lustre/osc/osc_cache.c
@@ -1927,6 +1927,9 @@ static inline unsigned int osc_extent_chunks(const struct osc_extent *ext)
 	if (in_rpc->oe_dio && overlapped(ext, in_rpc))
 		return false;
 
+	if (ext->oe_is_rdma_only != in_rpc->oe_is_rdma_only)
+		return false;
+
 	return true;
 }
 
@@ -2688,6 +2691,7 @@ int osc_queue_sync_pages(const struct lu_env *env, const struct cl_io *io,
 	ext->oe_srvlock = !!(brw_flags & OBD_BRW_SRVLOCK);
 	ext->oe_ndelay = !!(brw_flags & OBD_BRW_NDELAY);
 	ext->oe_dio = !!(brw_flags & OBD_BRW_NOCACHE);
+	ext->oe_is_rdma_only = !!(brw_flags & OBD_BRW_RDMA_ONLY);
 	ext->oe_nr_pages = page_count;
 	ext->oe_mppr = mppr;
 	list_splice_init(list, &ext->oe_pages);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 08/37] lnet: socklnd: fix local interface binding
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (6 preceding siblings ...)
  2020-07-15 20:44 ` [lustre-devel] [PATCH 07/37] lustre: osc: disable ext merging for rdma only pages and non-rdma James Simmons
@ 2020-07-15 20:44 ` James Simmons
  2020-07-15 20:44 ` [lustre-devel] [PATCH 09/37] lnet: o2iblnd: allocate init_qp_attr on stack James Simmons
                   ` (28 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:44 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

When a node is configured with multiple interfaces in
Multi-Rail config, socklnd was not utilizing the local interface
requested by LNet. In essence LNet was using all the NIDs in round
robin, however the socklnd module was not binding to the correct
interface. Traffic was thus sent on a subset of the interfaces.

The reason is that the route interface number was not being set.
In most cases lnet_connect() is called to create a socket. The
socket is bound to the interface provided and then
ksocknal_create_conn() is called to create the socklnd connection.
ksocknal_create_conn() calls ksocknal_associate_route_conn_locked()
at which point the route's local interface is assigned. However,
this is already too late as the socket has already been created
and bound to a local interface.

Therefore, it's important to assign the route's interface before
calling lnet_connect() to ensure socket is bound to correct local
interface.

To address this issue, the route's interface index is initialized
to the NI's interface index when it's added to the peer_ni.

Another bug fixed:
The interface index was not being initialized in the startup
routine.

Note: We're strictly assuming that there is one interface for each
NI. This is because tcp bonding will be removed from the socklnd as
it has been deprecated by LNet mutli-rail.

WC-bug-id: https://jira.whamcloud.com/browse/LU-13566
Lustre-commit: a7c9aba5eb96d ("LU-13566 socklnd: fix local interface binding")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/38743
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/socklnd/socklnd.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/net/lnet/klnds/socklnd/socklnd.c b/net/lnet/klnds/socklnd/socklnd.c
index 444b90b..2b8fd3d 100644
--- a/net/lnet/klnds/socklnd/socklnd.c
+++ b/net/lnet/klnds/socklnd/socklnd.c
@@ -409,12 +409,14 @@ struct ksock_peer_ni *
 {
 	struct ksock_conn *conn;
 	struct ksock_route *route2;
+	struct ksock_net *net = peer_ni->ksnp_ni->ni_data;
 
 	LASSERT(!peer_ni->ksnp_closing);
 	LASSERT(!route->ksnr_peer);
 	LASSERT(!route->ksnr_scheduled);
 	LASSERT(!route->ksnr_connecting);
 	LASSERT(!route->ksnr_connected);
+	LASSERT(net->ksnn_ninterfaces > 0);
 
 	/* LASSERT(unique) */
 	list_for_each_entry(route2, &peer_ni->ksnp_routes, ksnr_list) {
@@ -428,6 +430,11 @@ struct ksock_peer_ni *
 
 	route->ksnr_peer = peer_ni;
 	ksocknal_peer_addref(peer_ni);
+
+	/* set the route's interface to the current net's interface */
+	route->ksnr_myiface = net->ksnn_interfaces[0].ksni_index;
+	net->ksnn_interfaces[0].ksni_nroutes++;
+
 	/* peer_ni's routelist takes over my ref on 'route' */
 	list_add_tail(&route->ksnr_list, &peer_ni->ksnp_routes);
 
@@ -2667,6 +2674,7 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 		net->ksnn_ninterfaces = 1;
 		ni->ni_dev_cpt = ifaces[0].li_cpt;
 		ksi->ksni_ipaddr = ifaces[0].li_ipaddr;
+		ksi->ksni_index = ksocknal_ip2index(ksi->ksni_ipaddr, ni);
 		ksi->ksni_netmask = ifaces[0].li_netmask;
 		strlcpy(ksi->ksni_name, ifaces[0].li_name,
 			sizeof(ksi->ksni_name));
@@ -2706,6 +2714,8 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 				ksi = &net->ksnn_interfaces[j];
 				ni->ni_dev_cpt = ifaces[j].li_cpt;
 				ksi->ksni_ipaddr = ifaces[j].li_ipaddr;
+				ksi->ksni_index =
+					ksocknal_ip2index(ksi->ksni_ipaddr, ni);
 				ksi->ksni_netmask = ifaces[j].li_netmask;
 				strlcpy(ksi->ksni_name, ifaces[j].li_name,
 					sizeof(ksi->ksni_name));
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 09/37] lnet: o2iblnd: allocate init_qp_attr on stack.
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (7 preceding siblings ...)
  2020-07-15 20:44 ` [lustre-devel] [PATCH 08/37] lnet: socklnd: fix local interface binding James Simmons
@ 2020-07-15 20:44 ` James Simmons
  2020-07-15 20:44 ` [lustre-devel] [PATCH 10/37] lnet: Fix some out-of-date comments James Simmons
                   ` (27 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:44 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

'struct ib_qp_init_attr' is not so large that it cannot be allocated
on the stack.  It is about 100 bytes, various other function in Linux
allocate it on the stack, and the stack isn't as constrained as it
once was.

So allocate on stack instead of using kmalloc and handling errors.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12678
Lustre-commit: 524a5a733ba1c ("LU-12678 o2iblnd: allocate init_qp_attr on stack.")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/39122
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/o2iblnd/o2iblnd.c | 45 +++++++++++++++-------------------------
 1 file changed, 17 insertions(+), 28 deletions(-)

diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.c b/net/lnet/klnds/o2iblnd/o2iblnd.c
index 16edfba..d8fca2a 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.c
@@ -699,7 +699,7 @@ struct kib_conn *kiblnd_create_conn(struct kib_peer_ni *peer_ni,
 	rwlock_t *glock = &kiblnd_data.kib_global_lock;
 	struct kib_net *net = peer_ni->ibp_ni->ni_data;
 	struct kib_dev *dev;
-	struct ib_qp_init_attr *init_qp_attr;
+	struct ib_qp_init_attr init_qp_attr = {};
 	struct kib_sched_info *sched;
 	struct ib_cq_init_attr cq_attr = {};
 	struct kib_conn *conn;
@@ -727,18 +727,11 @@ struct kib_conn *kiblnd_create_conn(struct kib_peer_ni *peer_ni,
 	 */
 	cpt = sched->ibs_cpt;
 
-	init_qp_attr = kzalloc_cpt(sizeof(*init_qp_attr), GFP_NOFS, cpt);
-	if (!init_qp_attr) {
-		CERROR("Can't allocate qp_attr for %s\n",
-		       libcfs_nid2str(peer_ni->ibp_nid));
-		goto failed_0;
-	}
-
 	conn = kzalloc_cpt(sizeof(*conn), GFP_NOFS, cpt);
 	if (!conn) {
 		CERROR("Can't allocate connection for %s\n",
 		       libcfs_nid2str(peer_ni->ibp_nid));
-		goto failed_1;
+		goto failed_0;
 	}
 
 	conn->ibc_state = IBLND_CONN_INIT;
@@ -819,27 +812,27 @@ struct kib_conn *kiblnd_create_conn(struct kib_peer_ni *peer_ni,
 		goto failed_2;
 	}
 
-	init_qp_attr->event_handler = kiblnd_qp_event;
-	init_qp_attr->qp_context = conn;
-	init_qp_attr->cap.max_send_sge = *kiblnd_tunables.kib_wrq_sge;
-	init_qp_attr->cap.max_recv_sge = 1;
-	init_qp_attr->sq_sig_type = IB_SIGNAL_REQ_WR;
-	init_qp_attr->qp_type = IB_QPT_RC;
-	init_qp_attr->send_cq = cq;
-	init_qp_attr->recv_cq = cq;
+	init_qp_attr.event_handler = kiblnd_qp_event;
+	init_qp_attr.qp_context = conn;
+	init_qp_attr.cap.max_send_sge = *kiblnd_tunables.kib_wrq_sge;
+	init_qp_attr.cap.max_recv_sge = 1;
+	init_qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
+	init_qp_attr.qp_type = IB_QPT_RC;
+	init_qp_attr.send_cq = cq;
+	init_qp_attr.recv_cq = cq;
 	/* kiblnd_send_wrs() can change the connection's queue depth if
 	 * the maximum work requests for the device is maxed out
 	 */
-	init_qp_attr->cap.max_send_wr = kiblnd_send_wrs(conn);
-	init_qp_attr->cap.max_recv_wr = IBLND_RECV_WRS(conn);
+	init_qp_attr.cap.max_send_wr = kiblnd_send_wrs(conn);
+	init_qp_attr.cap.max_recv_wr = IBLND_RECV_WRS(conn);
 
-	rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd, init_qp_attr);
+	rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd, &init_qp_attr);
 	if (rc) {
 		CERROR("Can't create QP: %d, send_wr: %d, recv_wr: %d, send_sge: %d, recv_sge: %d\n",
-		       rc, init_qp_attr->cap.max_send_wr,
-		       init_qp_attr->cap.max_recv_wr,
-		       init_qp_attr->cap.max_send_sge,
-		       init_qp_attr->cap.max_recv_sge);
+		       rc, init_qp_attr.cap.max_send_wr,
+		       init_qp_attr.cap.max_recv_wr,
+		       init_qp_attr.cap.max_send_sge,
+		       init_qp_attr.cap.max_recv_sge);
 		goto failed_2;
 	}
 
@@ -851,8 +844,6 @@ struct kib_conn *kiblnd_create_conn(struct kib_peer_ni *peer_ni,
 		      peer_ni->ibp_queue_depth,
 		      conn->ibc_queue_depth);
 
-	kfree(init_qp_attr);
-
 	conn->ibc_rxs = kzalloc_cpt(IBLND_RX_MSGS(conn) *
 				    sizeof(*conn->ibc_rxs),
 				    GFP_NOFS, cpt);
@@ -918,8 +909,6 @@ struct kib_conn *kiblnd_create_conn(struct kib_peer_ni *peer_ni,
 failed_2:
 	kiblnd_destroy_conn(conn);
 	kfree(conn);
-failed_1:
-	kfree(init_qp_attr);
 failed_0:
 	return NULL;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 10/37] lnet: Fix some out-of-date comments.
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (8 preceding siblings ...)
  2020-07-15 20:44 ` [lustre-devel] [PATCH 09/37] lnet: o2iblnd: allocate init_qp_attr on stack James Simmons
@ 2020-07-15 20:44 ` James Simmons
  2020-07-15 20:44 ` [lustre-devel] [PATCH 11/37] lnet: socklnd: don't fall-back to tcp_sendpage James Simmons
                   ` (26 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:44 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

The structures these comments describe have changed or been removed,
but the comments weren't updated at the time.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12678
Lustre-commit: 617ad3af720a3 ("LU-12678 lnet: Fix some out-of-date comments.")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/39127
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/socklnd/socklnd.h | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/net/lnet/klnds/socklnd/socklnd.h b/net/lnet/klnds/socklnd/socklnd.h
index 7d49fff..0ac3637 100644
--- a/net/lnet/klnds/socklnd/socklnd.h
+++ b/net/lnet/klnds/socklnd/socklnd.h
@@ -255,15 +255,13 @@ struct ksock_nal_data {
 #define SOCKNAL_INIT_DATA	1
 #define SOCKNAL_INIT_ALL	2
 
-/*
- * A packet just assembled for transmission is represented by 1 or more
- * struct iovec fragments (the first frag contains the portals header),
- * followed by 0 or more struct bio_vec fragments.
+/* A packet just assembled for transmission is represented by 1
+ * struct iovec fragment - the portals header -  followed by 0
+ * or more struct bio_vec fragments.
  *
  * On the receive side, initially 1 struct iovec fragment is posted for
  * receive (the header).  Once the header has been received, the payload is
- * received into either struct iovec or struct bio_vec fragments, depending on
- * what the header matched or whether the message needs forwarding.
+ * received into struct bio_vec fragments.
  */
 struct ksock_conn;				/* forward ref */
 struct ksock_route;				/* forward ref */
@@ -296,8 +294,6 @@ struct ksock_tx {				/* transmit packet */
 
 #define KSOCK_NOOP_TX_SIZE (offsetof(struct ksock_tx, tx_payload[0]))
 
-/* network zero copy callback descriptor embedded in struct ksock_tx */
-
 #define SOCKNAL_RX_KSM_HEADER	1 /* reading ksock message header */
 #define SOCKNAL_RX_LNET_HEADER	2 /* reading lnet message header */
 #define SOCKNAL_RX_PARSE	3 /* Calling lnet_parse() */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 11/37] lnet: socklnd: don't fall-back to tcp_sendpage.
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (9 preceding siblings ...)
  2020-07-15 20:44 ` [lustre-devel] [PATCH 10/37] lnet: Fix some out-of-date comments James Simmons
@ 2020-07-15 20:44 ` James Simmons
  2020-07-15 20:44 ` [lustre-devel] [PATCH 12/37] lustre: ptlrpc: re-enterable signal_completed_replay() James Simmons
                   ` (25 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:44 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

sk_prot->sendpage is never NULL, so there is no
need for a fallback to tcp_sendpage.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12678
Lustre-commit: 011d760069142 ("LU-12678 socklnd: don't fall-back to tcp_sendpage.")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/39134
Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/socklnd/socklnd_lib.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/net/lnet/klnds/socklnd/socklnd_lib.c b/net/lnet/klnds/socklnd/socklnd_lib.c
index 2adc99c..1d6cd0e 100644
--- a/net/lnet/klnds/socklnd/socklnd_lib.c
+++ b/net/lnet/klnds/socklnd/socklnd_lib.c
@@ -123,12 +123,8 @@
 		    fragsize < tx->tx_resid)
 			msgflg |= MSG_MORE;
 
-		if (sk->sk_prot->sendpage) {
-			rc = sk->sk_prot->sendpage(sk, page,
-						   offset, fragsize, msgflg);
-		} else {
-			rc = tcp_sendpage(sk, page, offset, fragsize, msgflg);
-		}
+		rc = sk->sk_prot->sendpage(sk, page,
+					   offset, fragsize, msgflg);
 	} else {
 		struct msghdr msg = { .msg_flags = MSG_DONTWAIT };
 		int i;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 12/37] lustre: ptlrpc: re-enterable signal_completed_replay()
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (10 preceding siblings ...)
  2020-07-15 20:44 ` [lustre-devel] [PATCH 11/37] lnet: socklnd: don't fall-back to tcp_sendpage James Simmons
@ 2020-07-15 20:44 ` James Simmons
  2020-07-15 20:44 ` [lustre-devel] [PATCH 13/37] lustre: obdcalss: ensure LCT_QUIESCENT take sync James Simmons
                   ` (24 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:44 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

The signal_completed_replay() can meet race conditions while
checking imp_replay_inflight counter, so remove assertion and
check race conditions instead.

Fixes: 8cc7f22847 ("lustre: ptlrpc: limit rate of lock replays")
WC-bug-id: https://jira.whamcloud.com/browse/LU-13600
Lustre-commit: 24451f3790503 ("LU-13600 ptlrpc: re-enterable signal_completed_replay()")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/39140
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/import.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index 7ec3638..1b62b81 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -1407,8 +1407,8 @@ static int signal_completed_replay(struct obd_import *imp)
 	if (unlikely(OBD_FAIL_CHECK(OBD_FAIL_PTLRPC_FINISH_REPLAY)))
 		return 0;
 
-	LASSERT(atomic_read(&imp->imp_replay_inflight) == 0);
-	atomic_inc(&imp->imp_replay_inflight);
+	if (!atomic_add_unless(&imp->imp_replay_inflight, 1, 1))
+		return 0;
 
 	req = ptlrpc_request_alloc_pack(imp, &RQF_OBD_PING, LUSTRE_OBD_VERSION,
 					OBD_PING);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 13/37] lustre: obdcalss: ensure LCT_QUIESCENT take sync
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (11 preceding siblings ...)
  2020-07-15 20:44 ` [lustre-devel] [PATCH 12/37] lustre: ptlrpc: re-enterable signal_completed_replay() James Simmons
@ 2020-07-15 20:44 ` James Simmons
  2020-07-15 20:44 ` [lustre-devel] [PATCH 14/37] lustre: remove some "#ifdef CONFIG*" from .c files James Simmons
                   ` (23 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:44 UTC (permalink / raw)
  To: lustre-devel

From: Yang Sheng <ys@whamcloud.com>

Add locking in lu_device_init ensure LCT_QUIESCENT
operating can be seen on other thread in parallel
mounting. Also add extra checking before unset the
flag to make sure we don't do it after device has
been started.

(osd_handler.c:7730:osd_device_init0()) ASSERTION( info ) failed:
(osd_handler.c:7730:osd_device_init0()) LBUG
Pid: 28098, comm: mount.lustre 3.10.0-1062.9.1.el7_lustre.x86_64
Call Trace:
 libcfs_call_trace+0x8c/0xc0 [libcfs]
 lbug_with_loc+0x4c/0xa0 [libcfs]
 osd_device_alloc+0x778/0x8f0 [osd_ldiskfs]
 obd_setup+0x129/0x2f0 [obdclass]
 class_setup+0x48f/0x7f0 [obdclass]
 class_process_config+0x190f/0x2830 [obdclass]
 do_lcfg+0x258/0x500 [obdclass]
 lustre_start_simple+0x88/0x210 [obdclass]
 server_fill_super+0xf55/0x1890 [obdclass]
 lustre_fill_super+0x498/0x990 [obdclass]
 mount_nodev+0x4f/0xb0
 lustre_mount+0x18/0x20 [obdclass]
 mount_fs+0x3e/0x1b0
 vfs_kern_mount+0x67/0x110
 do_mount+0x1ef/0xce0
 SyS_mount+0x83/0xd0
 system_call_fastpath+0x25/0x2a
 0xffffffffffffffff
 Kernel panic - not syncing: LBUG

WC-bug-id: https://jira.whamcloud.com/browse/LU-11814
Lustre-commit: 979f5e1db041d ("LU-11814 obdcalss: ensure LCT_QUIESCENT take sync")
Signed-off-by: Yang Sheng <ys@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/38416
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lu_object.h  |  8 +++---
 fs/lustre/obdclass/lu_object.c | 58 ++++++++++++++++++++++++------------------
 2 files changed, 38 insertions(+), 28 deletions(-)

diff --git a/fs/lustre/include/lu_object.h b/fs/lustre/include/lu_object.h
index 1a6b6e1..6c47f43 100644
--- a/fs/lustre/include/lu_object.h
+++ b/fs/lustre/include/lu_object.h
@@ -1151,7 +1151,8 @@ struct lu_context_key {
 void lu_context_key_degister(struct lu_context_key *key);
 void *lu_context_key_get(const struct lu_context *ctx,
 			 const struct lu_context_key *key);
-void lu_context_key_quiesce(struct lu_context_key *key);
+void lu_context_key_quiesce(struct lu_device_type *t,
+			    struct lu_context_key *key);
 void lu_context_key_revive(struct lu_context_key *key);
 
 /*
@@ -1199,7 +1200,7 @@ void *lu_context_key_get(const struct lu_context *ctx,
 #define LU_TYPE_STOP(mod, ...)						\
 	static void mod##_type_stop(struct lu_device_type *t)		\
 	{								\
-		lu_context_key_quiesce_many(__VA_ARGS__, NULL);		\
+		lu_context_key_quiesce_many(t, __VA_ARGS__, NULL);	\
 	}								\
 	struct __##mod##_dummy_type_stop {; }
 
@@ -1223,7 +1224,8 @@ void *lu_context_key_get(const struct lu_context *ctx,
 int lu_context_key_register_many(struct lu_context_key *k, ...);
 void lu_context_key_degister_many(struct lu_context_key *k, ...);
 void lu_context_key_revive_many(struct lu_context_key *k, ...);
-void lu_context_key_quiesce_many(struct lu_context_key *k, ...);
+void lu_context_key_quiesce_many(struct lu_device_type *t,
+				 struct lu_context_key *k, ...);
 
 /*
  * update/clear ctx/ses tags.
diff --git a/fs/lustre/obdclass/lu_object.c b/fs/lustre/obdclass/lu_object.c
index 5cd8231..42bb7a6 100644
--- a/fs/lustre/obdclass/lu_object.c
+++ b/fs/lustre/obdclass/lu_object.c
@@ -1185,14 +1185,25 @@ void lu_device_put(struct lu_device *d)
 }
 EXPORT_SYMBOL(lu_device_put);
 
+enum { /* Maximal number of tld slots. */
+	LU_CONTEXT_KEY_NR = 40
+};
+static struct lu_context_key *lu_keys[LU_CONTEXT_KEY_NR] = { NULL, };
+static DECLARE_RWSEM(lu_key_initing);
+
 /**
  * Initialize device @d of type @t.
  */
 int lu_device_init(struct lu_device *d, struct lu_device_type *t)
 {
-	if (atomic_inc_return(&t->ldt_device_nr) == 1 &&
-	    t->ldt_ops->ldto_start)
-		t->ldt_ops->ldto_start(t);
+	if (atomic_add_unless(&t->ldt_device_nr, 1, 0) == 0) {
+		down_write(&lu_key_initing);
+		if (t->ldt_ops->ldto_start &&
+		    atomic_read(&t->ldt_device_nr) == 0)
+			t->ldt_ops->ldto_start(t);
+		atomic_inc(&t->ldt_device_nr);
+		up_write(&lu_key_initing);
+	}
 
 	memset(d, 0, sizeof(*d));
 	atomic_set(&d->ld_ref, 0);
@@ -1358,17 +1369,6 @@ void lu_stack_fini(const struct lu_env *env, struct lu_device *top)
 	}
 }
 
-enum {
-	/**
-	 * Maximal number of tld slots.
-	 */
-	LU_CONTEXT_KEY_NR = 40
-};
-
-static struct lu_context_key *lu_keys[LU_CONTEXT_KEY_NR] = { NULL, };
-
-static DECLARE_RWSEM(lu_key_initing);
-
 /**
  * Global counter incremented whenever key is registered, unregistered,
  * revived or quiesced. This is used to void unnecessary calls to
@@ -1442,7 +1442,7 @@ void lu_context_key_degister(struct lu_context_key *key)
 	LASSERT(atomic_read(&key->lct_used) >= 1);
 	LINVRNT(0 <= key->lct_index && key->lct_index < ARRAY_SIZE(lu_keys));
 
-	lu_context_key_quiesce(key);
+	lu_context_key_quiesce(NULL, key);
 
 	key_fini(&lu_shrink_env.le_ctx, key->lct_index);
 
@@ -1527,13 +1527,14 @@ void lu_context_key_revive_many(struct lu_context_key *k, ...)
 /**
  * Quiescent a number of keys.
  */
-void lu_context_key_quiesce_many(struct lu_context_key *k, ...)
+void lu_context_key_quiesce_many(struct lu_device_type *t,
+				 struct lu_context_key *k, ...)
 {
 	va_list args;
 
 	va_start(args, k);
 	do {
-		lu_context_key_quiesce(k);
+		lu_context_key_quiesce(t, k);
 		k = va_arg(args, struct lu_context_key*);
 	} while (k);
 	va_end(args);
@@ -1564,18 +1565,22 @@ void *lu_context_key_get(const struct lu_context *ctx,
  * values in "shared" contexts (like service threads), when a module owning
  * the key is about to be unloaded.
  */
-void lu_context_key_quiesce(struct lu_context_key *key)
+void lu_context_key_quiesce(struct lu_device_type *t,
+			    struct lu_context_key *key)
 {
 	struct lu_context *ctx;
 
+	if (key->lct_tags & LCT_QUIESCENT)
+		return;
+	/*
+	 * The write-lock on lu_key_initing will ensure that any
+	 * keys_fill() which didn't see LCT_QUIESCENT will have
+	 * finished before we call key_fini().
+	 */
+	down_write(&lu_key_initing);
 	if (!(key->lct_tags & LCT_QUIESCENT)) {
-		/*
-		 * The write-lock on lu_key_initing will ensure that any
-		 * keys_fill() which didn't see LCT_QUIESCENT will have
-		 * finished before we call key_fini().
-		 */
-		down_write(&lu_key_initing);
-		key->lct_tags |= LCT_QUIESCENT;
+		if (!t || atomic_read(&t->ldt_device_nr) == 0)
+			key->lct_tags |= LCT_QUIESCENT;
 		up_write(&lu_key_initing);
 
 		spin_lock(&lu_context_remembered_guard);
@@ -1584,7 +1589,10 @@ void lu_context_key_quiesce(struct lu_context_key *key)
 			key_fini(ctx, key->lct_index);
 		}
 		spin_unlock(&lu_context_remembered_guard);
+
+		return;
 	}
+	up_write(&lu_key_initing);
 }
 
 void lu_context_key_revive(struct lu_context_key *key)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 14/37] lustre: remove some "#ifdef CONFIG*" from .c files.
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (12 preceding siblings ...)
  2020-07-15 20:44 ` [lustre-devel] [PATCH 13/37] lustre: obdcalss: ensure LCT_QUIESCENT take sync James Simmons
@ 2020-07-15 20:44 ` James Simmons
  2020-07-15 20:44 ` [lustre-devel] [PATCH 15/37] lustre: obdclass: use offset instead of cp_linkage James Simmons
                   ` (22 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:44 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

It is Linux policy to avoid #ifdef in C files where
convenient - .h files are OK.

This patch defines a few inline functions which differ
depending on CONFIG_LUSTRE_FS_POSIX_ACL, and removes
some #ifdefs from .c files.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9679
Lustre-commit: f37e26964a34f ("LU-9679 lustre: remove some "#ifdef CONFIG*" from .c files.")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/39131
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd.h          | 21 ++++++++++++++++++++
 fs/lustre/llite/llite_internal.h | 29 +++++++++++++++++++++++++++
 fs/lustre/llite/llite_lib.c      | 43 +++++++++-------------------------------
 fs/lustre/mdc/mdc_request.c      |  8 +++-----
 4 files changed, 62 insertions(+), 39 deletions(-)

diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index 438f4ca..ad2b2f4 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -34,6 +34,8 @@
 #ifndef __OBD_H
 #define __OBD_H
 
+#include <linux/fs.h>
+#include <linux/posix_acl.h>
 #include <linux/kobject.h>
 #include <linux/spinlock.h>
 #include <linux/sysfs.h>
@@ -930,6 +932,25 @@ struct lustre_md {
 	struct mdt_remote_perm		*remote_perm;
 };
 
+#ifdef CONFIG_LUSTRE_FS_POSIX_ACL
+static inline void lmd_clear_acl(struct lustre_md *md)
+{
+	if (md->posix_acl) {
+		posix_acl_release(md->posix_acl);
+		md->posix_acl = NULL;
+	}
+}
+
+#define OBD_CONNECT_ACL_FLAGS  \
+	(OBD_CONNECT_ACL | OBD_CONNECT_UMASK | OBD_CONNECT_LARGE_ACL)
+#else
+static inline void lmd_clear_acl(struct lustre_md *md)
+{
+}
+
+#define OBD_CONNECT_ACL_FLAGS  (0)
+#endif
+
 struct md_open_data {
 	struct obd_client_handle	*mod_och;
 	struct ptlrpc_request		*mod_open_req;
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 2556dd8..31c528f 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -350,6 +350,35 @@ static inline void trunc_sem_up_write(struct ll_trunc_sem *sem)
 	wake_up_var(&sem->ll_trunc_readers);
 }
 
+#ifdef CONFIG_LUSTRE_FS_POSIX_ACL
+static inline void lli_clear_acl(struct ll_inode_info *lli)
+{
+	if (lli->lli_posix_acl) {
+		posix_acl_release(lli->lli_posix_acl);
+		lli->lli_posix_acl = NULL;
+	}
+}
+
+static inline void lli_replace_acl(struct ll_inode_info *lli,
+				   struct lustre_md *md)
+{
+	spin_lock(&lli->lli_lock);
+	if (lli->lli_posix_acl)
+		posix_acl_release(lli->lli_posix_acl);
+	lli->lli_posix_acl = md->posix_acl;
+	spin_unlock(&lli->lli_lock);
+}
+#else
+static inline void lli_clear_acl(struct ll_inode_info *lli)
+{
+}
+
+static inline void lli_replace_acl(struct ll_inode_info *lli,
+				   struct lustre_md *md)
+{
+}
+#endif
+
 static inline u32 ll_layout_version_get(struct ll_inode_info *lli)
 {
 	u32 gen;
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 1a7d805..c62e182 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -265,10 +265,7 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 
 	if (sbi->ll_flags & LL_SBI_LRU_RESIZE)
 		data->ocd_connect_flags |= OBD_CONNECT_LRU_RESIZE;
-#ifdef CONFIG_LUSTRE_FS_POSIX_ACL
-	data->ocd_connect_flags |= OBD_CONNECT_ACL | OBD_CONNECT_UMASK |
-				   OBD_CONNECT_LARGE_ACL;
-#endif
+	data->ocd_connect_flags |= OBD_CONNECT_ACL_FLAGS;
 
 	data->ocd_cksum_types = obd_cksum_types_supported_client();
 
@@ -618,13 +615,8 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 	ptlrpc_req_finished(request);
 
 	if (IS_ERR(root)) {
-#ifdef CONFIG_LUSTRE_FS_POSIX_ACL
-		if (lmd.posix_acl) {
-			posix_acl_release(lmd.posix_acl);
-			lmd.posix_acl = NULL;
-		}
-#endif
-		err = -EBADF;
+		lmd_clear_acl(&lmd);
+		err = IS_ERR(root) ? PTR_ERR(root) : -EBADF;
 		CERROR("lustre_lite: bad iget4 for root\n");
 		goto out_root;
 	}
@@ -1584,13 +1576,7 @@ void ll_clear_inode(struct inode *inode)
 
 	ll_xattr_cache_destroy(inode);
 
-#ifdef CONFIG_LUSTRE_FS_POSIX_ACL
-	forget_all_cached_acls(inode);
-	if (lli->lli_posix_acl) {
-		posix_acl_release(lli->lli_posix_acl);
-		lli->lli_posix_acl = NULL;
-	}
-#endif
+	lli_clear_acl(lli);
 	lli->lli_inode_magic = LLI_INODE_DEAD;
 
 	if (S_ISDIR(inode->i_mode))
@@ -2233,15 +2219,9 @@ int ll_update_inode(struct inode *inode, struct lustre_md *md)
 			return rc;
 	}
 
-#ifdef CONFIG_LUSTRE_FS_POSIX_ACL
-	if (body->mbo_valid & OBD_MD_FLACL) {
-		spin_lock(&lli->lli_lock);
-		if (lli->lli_posix_acl)
-			posix_acl_release(lli->lli_posix_acl);
-		lli->lli_posix_acl = md->posix_acl;
-		spin_unlock(&lli->lli_lock);
-	}
-#endif
+	if (body->mbo_valid & OBD_MD_FLACL)
+		lli_replace_acl(lli, md);
+
 	inode->i_ino = cl_fid_build_ino(&body->mbo_fid1,
 					sbi->ll_flags & LL_SBI_32BIT_API);
 	inode->i_generation = cl_fid_build_gen(&body->mbo_fid1);
@@ -2691,13 +2671,8 @@ int ll_prep_inode(struct inode **inode, struct ptlrpc_request *req,
 						  sbi->ll_flags & LL_SBI_32BIT_API),
 				 &md);
 		if (IS_ERR(*inode)) {
-#ifdef CONFIG_LUSTRE_FS_POSIX_ACL
-			if (md.posix_acl) {
-				posix_acl_release(md.posix_acl);
-				md.posix_acl = NULL;
-			}
-#endif
-			rc = PTR_ERR(*inode);
+			lmd_clear_acl(&md);
+			rc = IS_ERR(*inode) ? PTR_ERR(*inode) : -ENOMEM;
 			CERROR("new_inode -fatal: rc %d\n", rc);
 			goto out;
 		}
diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index d6d9f43..cacc58b 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -675,11 +675,9 @@ static int mdc_get_lustre_md(struct obd_export *exp,
 	}
 
 out:
-	if (rc) {
-#ifdef CONFIG_LUSTRE_FS_POSIX_ACL
-		posix_acl_release(md->posix_acl);
-#endif
-	}
+	if (rc)
+		lmd_clear_acl(md);
+
 	return rc;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 15/37] lustre: obdclass: use offset instead of cp_linkage
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (13 preceding siblings ...)
  2020-07-15 20:44 ` [lustre-devel] [PATCH 14/37] lustre: remove some "#ifdef CONFIG*" from .c files James Simmons
@ 2020-07-15 20:44 ` James Simmons
  2020-07-15 20:44 ` [lustre-devel] [PATCH 16/37] lustre: obdclass: re-declare cl_page variables to reduce its size James Simmons
                   ` (21 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:44 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

Since we have fixed-size cl_page allocations, we could use
offset array to store every slices pointer for cl_page.

With this patch, we will reduce cl_page size from 392 bytes
to 336 bytes which means we could allocate from 10 to 12 objects.

WC-bug-id: https://jira.whamcloud.com/browse/LU-13134
Lustre-commit: 55967f1e5c701 ("LU-13134 obdclass: use offset instead of cp_linkage")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/37428
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/cl_object.h |   8 +-
 fs/lustre/obdclass/cl_page.c  | 284 ++++++++++++++++++++++++------------------
 2 files changed, 168 insertions(+), 124 deletions(-)

diff --git a/fs/lustre/include/cl_object.h b/fs/lustre/include/cl_object.h
index a0b9e87..47997f8 100644
--- a/fs/lustre/include/cl_object.h
+++ b/fs/lustre/include/cl_object.h
@@ -737,8 +737,10 @@ struct cl_page {
 	struct page			*cp_vmpage;
 	/** Linkage of pages within group. Pages must be owned */
 	struct list_head		 cp_batch;
-	/** List of slices. Immutable after creation. */
-	struct list_head		 cp_layers;
+	/** array of slices offset. Immutable after creation. */
+	unsigned char			 cp_layer_offset[3];
+	/** current slice index */
+	unsigned char			 cp_layer_count:2;
 	/**
 	 * Page state. This field is const to avoid accidental update, it is
 	 * modified only internally within cl_page.c. Protected by a VM lock.
@@ -781,8 +783,6 @@ struct cl_page_slice {
 	 */
 	struct cl_object		*cpl_obj;
 	const struct cl_page_operations *cpl_ops;
-	/** Linkage into cl_page::cp_layers. Immutable after creation. */
-	struct list_head		 cpl_linkage;
 };
 
 /**
diff --git a/fs/lustre/obdclass/cl_page.c b/fs/lustre/obdclass/cl_page.c
index d5be0c5..cced026 100644
--- a/fs/lustre/obdclass/cl_page.c
+++ b/fs/lustre/obdclass/cl_page.c
@@ -72,22 +72,47 @@ static void cl_page_get_trust(struct cl_page *page)
 	refcount_inc(&page->cp_ref);
 }
 
+static struct cl_page_slice *
+cl_page_slice_get(const struct cl_page *cl_page, int index)
+{
+	if (index < 0 || index >= cl_page->cp_layer_count)
+		return NULL;
+
+	/* To get the cp_layer_offset values fit under 256 bytes, we
+	 * use the offset beyond the end of struct cl_page.
+	 */
+	return (struct cl_page_slice *)((char *)cl_page + sizeof(*cl_page) +
+					cl_page->cp_layer_offset[index]);
+}
+
+#define cl_page_slice_for_each(cl_page, slice, i)		\
+	for (i = 0, slice = cl_page_slice_get(cl_page, 0);	\
+	     i < (cl_page)->cp_layer_count;			\
+	     slice = cl_page_slice_get(cl_page, ++i))
+
+#define cl_page_slice_for_each_reverse(cl_page, slice, i)	\
+	for (i = (cl_page)->cp_layer_count - 1,			\
+	     slice = cl_page_slice_get(cl_page, i); i >= 0;	\
+	     slice = cl_page_slice_get(cl_page, --i))
+
 /**
- * Returns a slice within a page, corresponding to the given layer in the
+ * Returns a slice within a cl_page, corresponding to the given layer in the
  * device stack.
  *
  * \see cl_lock_at()
  */
 static const struct cl_page_slice *
-cl_page_at_trusted(const struct cl_page *page,
+cl_page_at_trusted(const struct cl_page *cl_page,
 		   const struct lu_device_type *dtype)
 {
 	const struct cl_page_slice *slice;
+	int i;
 
-	list_for_each_entry(slice, &page->cp_layers, cpl_linkage) {
+	cl_page_slice_for_each(cl_page, slice, i) {
 		if (slice->cpl_obj->co_lu.lo_dev->ld_type == dtype)
 			return slice;
 	}
+
 	return NULL;
 }
 
@@ -104,28 +129,28 @@ static void __cl_page_free(struct cl_page *cl_page, unsigned short bufsize)
 	}
 }
 
-static void cl_page_free(const struct lu_env *env, struct cl_page *page,
+static void cl_page_free(const struct lu_env *env, struct cl_page *cl_page,
 			 struct pagevec *pvec)
 {
-	struct cl_object *obj = page->cp_obj;
-	struct cl_page_slice *slice;
+	struct cl_object *obj = cl_page->cp_obj;
 	unsigned short bufsize = cl_object_header(obj)->coh_page_bufsize;
+	struct cl_page_slice *slice;
+	int i;
 
-	PASSERT(env, page, list_empty(&page->cp_batch));
-	PASSERT(env, page, !page->cp_owner);
-	PASSERT(env, page, page->cp_state == CPS_FREEING);
+	PASSERT(env, cl_page, list_empty(&cl_page->cp_batch));
+	PASSERT(env, cl_page, !cl_page->cp_owner);
+	PASSERT(env, cl_page, cl_page->cp_state == CPS_FREEING);
 
-	while ((slice = list_first_entry_or_null(&page->cp_layers,
-						 struct cl_page_slice,
-						 cpl_linkage)) != NULL) {
-		list_del_init(page->cp_layers.next);
+	cl_page_slice_for_each(cl_page, slice, i) {
 		if (unlikely(slice->cpl_ops->cpo_fini))
 			slice->cpl_ops->cpo_fini(env, slice, pvec);
 	}
-	lu_object_ref_del_at(&obj->co_lu, &page->cp_obj_ref, "cl_page", page);
+	cl_page->cp_layer_count = 0;
+	lu_object_ref_del_at(&obj->co_lu, &cl_page->cp_obj_ref,
+			     "cl_page", cl_page);
 	cl_object_put(env, obj);
-	lu_ref_fini(&page->cp_reference);
-	__cl_page_free(page, bufsize);
+	lu_ref_fini(&cl_page->cp_reference);
+	__cl_page_free(cl_page, bufsize);
 }
 
 /**
@@ -212,7 +237,6 @@ struct cl_page *cl_page_alloc(const struct lu_env *env,
 		page->cp_vmpage = vmpage;
 		cl_page_state_set_trust(page, CPS_CACHED);
 		page->cp_type = type;
-		INIT_LIST_HEAD(&page->cp_layers);
 		INIT_LIST_HEAD(&page->cp_batch);
 		lu_ref_init(&page->cp_reference);
 		cl_object_for_each(o2, o) {
@@ -455,22 +479,23 @@ static void cl_page_owner_set(struct cl_page *page)
 }
 
 void __cl_page_disown(const struct lu_env *env,
-		     struct cl_io *io, struct cl_page *pg)
+		      struct cl_io *io, struct cl_page *cl_page)
 {
 	const struct cl_page_slice *slice;
 	enum cl_page_state state;
+	int i;
 
-	state = pg->cp_state;
-	cl_page_owner_clear(pg);
+	state = cl_page->cp_state;
+	cl_page_owner_clear(cl_page);
 
 	if (state == CPS_OWNED)
-		cl_page_state_set(env, pg, CPS_CACHED);
+		cl_page_state_set(env, cl_page, CPS_CACHED);
 	/*
 	 * Completion call-backs are executed in the bottom-up order, so that
 	 * uppermost layer (llite), responsible for VFS/VM interaction runs
 	 * last and can release locks safely.
 	 */
-	list_for_each_entry_reverse(slice, &pg->cp_layers, cpl_linkage) {
+	cl_page_slice_for_each_reverse(cl_page, slice, i) {
 		if (slice->cpl_ops->cpo_disown)
 			(*slice->cpl_ops->cpo_disown)(env, slice, io);
 	}
@@ -494,12 +519,12 @@ int cl_page_is_owned(const struct cl_page *pg, const struct cl_io *io)
  * Waits until page is in cl_page_state::CPS_CACHED state, and then switch it
  * into cl_page_state::CPS_OWNED state.
  *
- * \pre  !cl_page_is_owned(pg, io)
- * \post result == 0 iff cl_page_is_owned(pg, io)
+ * \pre  !cl_page_is_owned(cl_page, io)
+ * \post result == 0 iff cl_page_is_owned(cl_page, io)
  *
  * Return:	0 success
  *
- *		-ve failure, e.g., page was destroyed (and landed in
+ *		-ve failure, e.g., cl_page was destroyed (and landed in
  *		cl_page_state::CPS_FREEING instead of
  *		cl_page_state::CPS_CACHED). or, page was owned by
  *		another thread, or in IO.
@@ -510,19 +535,20 @@ int cl_page_is_owned(const struct cl_page *pg, const struct cl_io *io)
  * \see cl_page_own
  */
 static int __cl_page_own(const struct lu_env *env, struct cl_io *io,
-			 struct cl_page *pg, int nonblock)
+			 struct cl_page *cl_page, int nonblock)
 {
 	const struct cl_page_slice *slice;
 	int result = 0;
+	int i;
 
 	io = cl_io_top(io);
 
-	if (pg->cp_state == CPS_FREEING) {
+	if (cl_page->cp_state == CPS_FREEING) {
 		result = -ENOENT;
 		goto out;
 	}
 
-	list_for_each_entry(slice, &pg->cp_layers, cpl_linkage) {
+	cl_page_slice_for_each(cl_page, slice, i) {
 		if (slice->cpl_ops->cpo_own)
 			result = (*slice->cpl_ops->cpo_own)(env, slice,
 							    io, nonblock);
@@ -533,13 +559,13 @@ static int __cl_page_own(const struct lu_env *env, struct cl_io *io,
 		result = 0;
 
 	if (result == 0) {
-		PASSERT(env, pg, !pg->cp_owner);
-		pg->cp_owner = cl_io_top(io);
-		cl_page_owner_set(pg);
-		if (pg->cp_state != CPS_FREEING) {
-			cl_page_state_set(env, pg, CPS_OWNED);
+		PASSERT(env, cl_page, !cl_page->cp_owner);
+		cl_page->cp_owner = cl_io_top(io);
+		cl_page_owner_set(cl_page);
+		if (cl_page->cp_state != CPS_FREEING) {
+			cl_page_state_set(env, cl_page, CPS_OWNED);
 		} else {
-			__cl_page_disown(env, io, pg);
+			__cl_page_disown(env, io, cl_page);
 			result = -ENOENT;
 		}
 	}
@@ -575,51 +601,53 @@ int cl_page_own_try(const struct lu_env *env, struct cl_io *io,
  *
  * Called when page is already locked by the hosting VM.
  *
- * \pre !cl_page_is_owned(pg, io)
- * \post cl_page_is_owned(pg, io)
+ * \pre !cl_page_is_owned(cl_page, io)
+ * \post cl_page_is_owned(cl_page, io)
  *
  * \see cl_page_operations::cpo_assume()
  */
 void cl_page_assume(const struct lu_env *env,
-		    struct cl_io *io, struct cl_page *pg)
+		    struct cl_io *io, struct cl_page *cl_page)
 {
 	const struct cl_page_slice *slice;
+	int i;
 
 	io = cl_io_top(io);
 
-	list_for_each_entry(slice, &pg->cp_layers, cpl_linkage) {
+	cl_page_slice_for_each(cl_page, slice, i) {
 		if (slice->cpl_ops->cpo_assume)
 			(*slice->cpl_ops->cpo_assume)(env, slice, io);
 	}
 
-	PASSERT(env, pg, !pg->cp_owner);
-	pg->cp_owner = cl_io_top(io);
-	cl_page_owner_set(pg);
-	cl_page_state_set(env, pg, CPS_OWNED);
+	PASSERT(env, cl_page, !cl_page->cp_owner);
+	cl_page->cp_owner = cl_io_top(io);
+	cl_page_owner_set(cl_page);
+	cl_page_state_set(env, cl_page, CPS_OWNED);
 }
 EXPORT_SYMBOL(cl_page_assume);
 
 /**
  * Releases page ownership without unlocking the page.
  *
- * Moves page into cl_page_state::CPS_CACHED without releasing a lock on the
- * underlying VM page (as VM is supposed to do this itself).
+ * Moves cl_page into cl_page_state::CPS_CACHED without releasing a lock
+ * on the underlying VM page (as VM is supposed to do this itself).
  *
- * \pre   cl_page_is_owned(pg, io)
- * \post !cl_page_is_owned(pg, io)
+ * \pre   cl_page_is_owned(cl_page, io)
+ * \post !cl_page_is_owned(cl_page, io)
  *
  * \see cl_page_assume()
  */
 void cl_page_unassume(const struct lu_env *env,
-		      struct cl_io *io, struct cl_page *pg)
+		      struct cl_io *io, struct cl_page *cl_page)
 {
 	const struct cl_page_slice *slice;
+	int i;
 
 	io = cl_io_top(io);
-	cl_page_owner_clear(pg);
-	cl_page_state_set(env, pg, CPS_CACHED);
+	cl_page_owner_clear(cl_page);
+	cl_page_state_set(env, cl_page, CPS_CACHED);
 
-	list_for_each_entry_reverse(slice, &pg->cp_layers, cpl_linkage) {
+	cl_page_slice_for_each_reverse(cl_page, slice, i) {
 		if (slice->cpl_ops->cpo_unassume)
 			(*slice->cpl_ops->cpo_unassume)(env, slice, io);
 	}
@@ -646,21 +674,22 @@ void cl_page_disown(const struct lu_env *env,
 EXPORT_SYMBOL(cl_page_disown);
 
 /**
- * Called when page is to be removed from the object, e.g., as a result of
- * truncate.
+ * Called when cl_page is to be removed from the object, e.g.,
+ * as a result of truncate.
  *
  * Calls cl_page_operations::cpo_discard() top-to-bottom.
  *
- * \pre cl_page_is_owned(pg, io)
+ * \pre cl_page_is_owned(cl_page, io)
  *
  * \see cl_page_operations::cpo_discard()
  */
 void cl_page_discard(const struct lu_env *env,
-		     struct cl_io *io, struct cl_page *pg)
+		     struct cl_io *io, struct cl_page *cl_page)
 {
 	const struct cl_page_slice *slice;
+	int i;
 
-	list_for_each_entry(slice, &pg->cp_layers, cpl_linkage) {
+	cl_page_slice_for_each(cl_page, slice, i) {
 		if (slice->cpl_ops->cpo_discard)
 			(*slice->cpl_ops->cpo_discard)(env, slice, io);
 	}
@@ -669,22 +698,24 @@ void cl_page_discard(const struct lu_env *env,
 
 /**
  * Version of cl_page_delete() that can be called for not fully constructed
- * pages, e.g,. in a error handling cl_page_find()->__cl_page_delete()
+ * cl_pages, e.g,. in a error handling cl_page_find()->__cl_page_delete()
  * path. Doesn't check page invariant.
  */
-static void __cl_page_delete(const struct lu_env *env, struct cl_page *pg)
+static void __cl_page_delete(const struct lu_env *env,
+			     struct cl_page *cl_page)
 {
 	const struct cl_page_slice *slice;
+	int i;
 
-	PASSERT(env, pg, pg->cp_state != CPS_FREEING);
+	PASSERT(env, cl_page, cl_page->cp_state != CPS_FREEING);
 
 	/*
-	 * Sever all ways to obtain new pointers to @pg.
+	 * Sever all ways to obtain new pointers to @cl_page.
 	 */
-	cl_page_owner_clear(pg);
-	__cl_page_state_set(env, pg, CPS_FREEING);
+	cl_page_owner_clear(cl_page);
+	__cl_page_state_set(env, cl_page, CPS_FREEING);
 
-	list_for_each_entry_reverse(slice, &pg->cp_layers, cpl_linkage) {
+	cl_page_slice_for_each_reverse(cl_page, slice, i) {
 		if (slice->cpl_ops->cpo_delete)
 			(*slice->cpl_ops->cpo_delete)(env, slice);
 	}
@@ -729,11 +760,13 @@ void cl_page_delete(const struct lu_env *env, struct cl_page *pg)
  *
  * \see cl_page_operations::cpo_export()
  */
-void cl_page_export(const struct lu_env *env, struct cl_page *pg, int uptodate)
+void cl_page_export(const struct lu_env *env, struct cl_page *cl_page,
+		    int uptodate)
 {
 	const struct cl_page_slice *slice;
+	int i;
 
-	list_for_each_entry(slice, &pg->cp_layers, cpl_linkage) {
+	cl_page_slice_for_each(cl_page, slice, i) {
 		if (slice->cpl_ops->cpo_export)
 			(*slice->cpl_ops->cpo_export)(env, slice, uptodate);
 	}
@@ -741,34 +774,36 @@ void cl_page_export(const struct lu_env *env, struct cl_page *pg, int uptodate)
 EXPORT_SYMBOL(cl_page_export);
 
 /**
- * Returns true, if @pg is VM locked in a suitable sense by the calling
+ * Returns true, if @cl_page is VM locked in a suitable sense by the calling
  * thread.
  */
-int cl_page_is_vmlocked(const struct lu_env *env, const struct cl_page *pg)
+int cl_page_is_vmlocked(const struct lu_env *env,
+			const struct cl_page *cl_page)
 {
 	const struct cl_page_slice *slice;
 	int result;
 
-	slice = list_first_entry(&pg->cp_layers,
-				 const struct cl_page_slice, cpl_linkage);
-	PASSERT(env, pg, slice->cpl_ops->cpo_is_vmlocked);
+	slice = cl_page_slice_get(cl_page, 0);
+	PASSERT(env, cl_page, slice->cpl_ops->cpo_is_vmlocked);
 	/*
 	 * Call ->cpo_is_vmlocked() directly instead of going through
 	 * CL_PAGE_INVOKE(), because cl_page_is_vmlocked() is used by
 	 * cl_page_invariant().
 	 */
 	result = slice->cpl_ops->cpo_is_vmlocked(env, slice);
-	PASSERT(env, pg, result == -EBUSY || result == -ENODATA);
+	PASSERT(env, cl_page, result == -EBUSY || result == -ENODATA);
+
 	return result == -EBUSY;
 }
 EXPORT_SYMBOL(cl_page_is_vmlocked);
 
-void cl_page_touch(const struct lu_env *env, const struct cl_page *pg,
-		  size_t to)
+void cl_page_touch(const struct lu_env *env,
+		   const struct cl_page *cl_page, size_t to)
 {
 	const struct cl_page_slice *slice;
+	int i;
 
-	list_for_each_entry(slice, &pg->cp_layers, cpl_linkage) {
+	cl_page_slice_for_each(cl_page, slice, i) {
 		if (slice->cpl_ops->cpo_page_touch)
 			(*slice->cpl_ops->cpo_page_touch)(env, slice, to);
 	}
@@ -799,20 +834,21 @@ static void cl_page_io_start(const struct lu_env *env,
  * transfer now.
  */
 int cl_page_prep(const struct lu_env *env, struct cl_io *io,
-		 struct cl_page *pg, enum cl_req_type crt)
+		 struct cl_page *cl_page, enum cl_req_type crt)
 {
 	const struct cl_page_slice *slice;
 	int result = 0;
+	int i;
 
 	/*
-	 * XXX this has to be called bottom-to-top, so that llite can set up
+	 * this has to be called bottom-to-top, so that llite can set up
 	 * PG_writeback without risking other layers deciding to skip this
 	 * page.
 	 */
 	if (crt >= CRT_NR)
 		return -EINVAL;
 
-	list_for_each_entry(slice, &pg->cp_layers, cpl_linkage) {
+	cl_page_slice_for_each(cl_page, slice, i) {
 		if (slice->cpl_ops->cpo_own)
 			result = (*slice->cpl_ops->io[crt].cpo_prep)(env, slice,
 								     io);
@@ -822,10 +858,10 @@ int cl_page_prep(const struct lu_env *env, struct cl_io *io,
 
 	if (result >= 0) {
 		result = 0;
-		cl_page_io_start(env, pg, crt);
+		cl_page_io_start(env, cl_page, crt);
 	}
 
-	CL_PAGE_HEADER(D_TRACE, env, pg, "%d %d\n", crt, result);
+	CL_PAGE_HEADER(D_TRACE, env, cl_page, "%d %d\n", crt, result);
 	return result;
 }
 EXPORT_SYMBOL(cl_page_prep);
@@ -840,35 +876,36 @@ int cl_page_prep(const struct lu_env *env, struct cl_io *io,
  * uppermost layer (llite), responsible for the VFS/VM interaction runs last
  * and can release locks safely.
  *
- * \pre  pg->cp_state == CPS_PAGEIN || pg->cp_state == CPS_PAGEOUT
- * \post pg->cp_state == CPS_CACHED
+ * \pre  cl_page->cp_state == CPS_PAGEIN || cl_page->cp_state == CPS_PAGEOUT
+ * \post cl_page->cp_state == CPS_CACHED
  *
  * \see cl_page_operations::cpo_completion()
  */
 void cl_page_completion(const struct lu_env *env,
-			struct cl_page *pg, enum cl_req_type crt, int ioret)
+			struct cl_page *cl_page, enum cl_req_type crt,
+			int ioret)
 {
-	struct cl_sync_io *anchor = pg->cp_sync_io;
+	struct cl_sync_io *anchor = cl_page->cp_sync_io;
 	const struct cl_page_slice *slice;
+	int i;
 
-	PASSERT(env, pg, crt < CRT_NR);
-	PASSERT(env, pg, pg->cp_state == cl_req_type_state(crt));
-
-	CL_PAGE_HEADER(D_TRACE, env, pg, "%d %d\n", crt, ioret);
+	PASSERT(env, cl_page, crt < CRT_NR);
+	PASSERT(env, cl_page, cl_page->cp_state == cl_req_type_state(crt));
 
-	cl_page_state_set(env, pg, CPS_CACHED);
+	CL_PAGE_HEADER(D_TRACE, env, cl_page, "%d %d\n", crt, ioret);
+	cl_page_state_set(env, cl_page, CPS_CACHED);
 	if (crt >= CRT_NR)
 		return;
 
-	list_for_each_entry_reverse(slice, &pg->cp_layers, cpl_linkage) {
+	cl_page_slice_for_each_reverse(cl_page, slice, i) {
 		if (slice->cpl_ops->io[crt].cpo_completion)
 			(*slice->cpl_ops->io[crt].cpo_completion)(env, slice,
 								  ioret);
 	}
 
 	if (anchor) {
-		LASSERT(pg->cp_sync_io == anchor);
-		pg->cp_sync_io = NULL;
+		LASSERT(cl_page->cp_sync_io == anchor);
+		cl_page->cp_sync_io = NULL;
 		cl_sync_io_note(env, anchor, ioret);
 	}
 }
@@ -878,53 +915,56 @@ void cl_page_completion(const struct lu_env *env,
  * Notify layers that transfer formation engine decided to yank this page from
  * the cache and to make it a part of a transfer.
  *
- * \pre  pg->cp_state == CPS_CACHED
- * \post pg->cp_state == CPS_PAGEIN || pg->cp_state == CPS_PAGEOUT
+ * \pre  cl_page->cp_state == CPS_CACHED
+ * \post cl_page->cp_state == CPS_PAGEIN || cl_page->cp_state == CPS_PAGEOUT
  *
  * \see cl_page_operations::cpo_make_ready()
  */
-int cl_page_make_ready(const struct lu_env *env, struct cl_page *pg,
+int cl_page_make_ready(const struct lu_env *env, struct cl_page *cl_page,
 		       enum cl_req_type crt)
 {
-	const struct cl_page_slice *sli;
+	const struct cl_page_slice *slice;
 	int result = 0;
+	int i;
 
 	if (crt >= CRT_NR)
 		return -EINVAL;
 
-	list_for_each_entry(sli, &pg->cp_layers, cpl_linkage) {
-		if (sli->cpl_ops->io[crt].cpo_make_ready)
-			result = (*sli->cpl_ops->io[crt].cpo_make_ready)(env,
-									 sli);
+	cl_page_slice_for_each(cl_page, slice, i) {
+		if (slice->cpl_ops->io[crt].cpo_make_ready)
+			result = (*slice->cpl_ops->io[crt].cpo_make_ready)(env,
+									   slice);
 		if (result != 0)
 			break;
 	}
 
 	if (result >= 0) {
-		PASSERT(env, pg, pg->cp_state == CPS_CACHED);
-		cl_page_io_start(env, pg, crt);
+		PASSERT(env, cl_page, cl_page->cp_state == CPS_CACHED);
+		cl_page_io_start(env, cl_page, crt);
 		result = 0;
 	}
-	CL_PAGE_HEADER(D_TRACE, env, pg, "%d %d\n", crt, result);
+	CL_PAGE_HEADER(D_TRACE, env, cl_page, "%d %d\n", crt, result);
+
 	return result;
 }
 EXPORT_SYMBOL(cl_page_make_ready);
 
 /**
- * Called if a pge is being written back by kernel's intention.
+ * Called if a page is being written back by kernel's intention.
  *
- * \pre  cl_page_is_owned(pg, io)
- * \post ergo(result == 0, pg->cp_state == CPS_PAGEOUT)
+ * \pre  cl_page_is_owned(cl_page, io)
+ * \post ergo(result == 0, cl_page->cp_state == CPS_PAGEOUT)
  *
  * \see cl_page_operations::cpo_flush()
  */
 int cl_page_flush(const struct lu_env *env, struct cl_io *io,
-		  struct cl_page *pg)
+		  struct cl_page *cl_page)
 {
 	const struct cl_page_slice *slice;
 	int result = 0;
+	int i;
 
-	 list_for_each_entry(slice, &pg->cp_layers, cpl_linkage) {
+	cl_page_slice_for_each(cl_page, slice, i) {
 		if (slice->cpl_ops->cpo_flush)
 			result = (*slice->cpl_ops->cpo_flush)(env, slice, io);
 		if (result != 0)
@@ -933,7 +973,7 @@ int cl_page_flush(const struct lu_env *env, struct cl_io *io,
 	if (result > 0)
 		result = 0;
 
-	CL_PAGE_HEADER(D_TRACE, env, pg, "%d\n", result);
+	CL_PAGE_HEADER(D_TRACE, env, cl_page, "%d\n", result);
 	return result;
 }
 EXPORT_SYMBOL(cl_page_flush);
@@ -943,14 +983,14 @@ int cl_page_flush(const struct lu_env *env, struct cl_io *io,
  *
  * \see cl_page_operations::cpo_clip()
  */
-void cl_page_clip(const struct lu_env *env, struct cl_page *pg,
+void cl_page_clip(const struct lu_env *env, struct cl_page *cl_page,
 		  int from, int to)
 {
 	const struct cl_page_slice *slice;
+	int i;
 
-	CL_PAGE_HEADER(D_TRACE, env, pg, "%d %d\n", from, to);
-
-	list_for_each_entry(slice, &pg->cp_layers, cpl_linkage) {
+	CL_PAGE_HEADER(D_TRACE, env, cl_page, "%d %d\n", from, to);
+	cl_page_slice_for_each(cl_page, slice, i) {
 		if (slice->cpl_ops->cpo_clip)
 			(*slice->cpl_ops->cpo_clip)(env, slice, from, to);
 	}
@@ -972,24 +1012,24 @@ void cl_page_header_print(const struct lu_env *env, void *cookie,
 EXPORT_SYMBOL(cl_page_header_print);
 
 /**
- * Prints human readable representation of @pg to the @f.
+ * Prints human readable representation of @cl_page to the @f.
  */
 void cl_page_print(const struct lu_env *env, void *cookie,
-		   lu_printer_t printer, const struct cl_page *pg)
+		   lu_printer_t printer, const struct cl_page *cl_page)
 {
 	const struct cl_page_slice *slice;
 	int result = 0;
+	int i;
 
-	cl_page_header_print(env, cookie, printer, pg);
-
-	list_for_each_entry(slice, &pg->cp_layers, cpl_linkage) {
+	cl_page_header_print(env, cookie, printer, cl_page);
+	cl_page_slice_for_each(cl_page, slice, i) {
 		if (slice->cpl_ops->cpo_print)
 			result = (*slice->cpl_ops->cpo_print)(env, slice,
 							      cookie, printer);
 		if (result != 0)
 			break;
 	}
-	(*printer)(env, cookie, "end page@%p\n", pg);
+	(*printer)(env, cookie, "end page@%p\n", cl_page);
 }
 EXPORT_SYMBOL(cl_page_print);
 
@@ -1032,14 +1072,18 @@ size_t cl_page_size(const struct cl_object *obj)
  *
  * \see cl_lock_slice_add(), cl_req_slice_add(), cl_io_slice_add()
  */
-void cl_page_slice_add(struct cl_page *page, struct cl_page_slice *slice,
+void cl_page_slice_add(struct cl_page *cl_page, struct cl_page_slice *slice,
 		       struct cl_object *obj,
 		       const struct cl_page_operations *ops)
 {
-	list_add_tail(&slice->cpl_linkage, &page->cp_layers);
+	unsigned int offset = (char *)slice -
+			      ((char *)cl_page + sizeof(*cl_page));
+
+	LASSERT(offset < (1 << sizeof(cl_page->cp_layer_offset[0]) * 8));
+	cl_page->cp_layer_offset[cl_page->cp_layer_count++] = offset;
 	slice->cpl_obj = obj;
 	slice->cpl_ops = ops;
-	slice->cpl_page = page;
+	slice->cpl_page = cl_page;
 }
 EXPORT_SYMBOL(cl_page_slice_add);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 16/37] lustre: obdclass: re-declare cl_page variables to reduce its size
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (14 preceding siblings ...)
  2020-07-15 20:44 ` [lustre-devel] [PATCH 15/37] lustre: obdclass: use offset instead of cp_linkage James Simmons
@ 2020-07-15 20:44 ` James Simmons
  2020-07-15 20:44 ` [lustre-devel] [PATCH 17/37] lustre: osc: re-declare ops_from/to to shrink osc_page James Simmons
                   ` (20 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:44 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

With following changes:
1) make CPS_CACHED declare start from 1 consistent with CPT_CACHED
2) add CPT_NR to indicate max allowed CPT state value.
3) Reserve 4 bits for @cp_state which allow 15 kind of states
4) Reserve 2 bits for @cp_type which allow 3 kinds of cl_page types
5) use short int for @cp_kmem_index and We still have another 16 bits
   reserved for future extension.
6) move @cp_lov_index after @cp_ref to fill 4 bytes hole.

After this patch, cl_page size could reduce from 336 bytes to 320 bytes

WC-bug-id: https://jira.whamcloud.com/browse/LU-13134
Lustre-commit: 5fb29cd1e77ca ("LU-13134 obdclass: re-declare cl_page variables to reduce its size")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/37480
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/cl_object.h | 26 +++++++++------
 fs/lustre/obdclass/cl_page.c  | 76 +++++++++++++++++++++----------------------
 2 files changed, 53 insertions(+), 49 deletions(-)

diff --git a/fs/lustre/include/cl_object.h b/fs/lustre/include/cl_object.h
index 47997f8..8611285 100644
--- a/fs/lustre/include/cl_object.h
+++ b/fs/lustre/include/cl_object.h
@@ -621,7 +621,7 @@ enum cl_page_state {
 	 *
 	 * \invariant cl_page::cp_owner == NULL && cl_page::cp_req == NULL
 	 */
-	CPS_CACHED,
+	CPS_CACHED = 1,
 	/**
 	 * Page is exclusively owned by some cl_io. Page may end up in this
 	 * state as a result of
@@ -715,8 +715,13 @@ enum cl_page_type {
 	 *  it is used in DirectIO and lockless IO.
 	 */
 	CPT_TRANSIENT,
+	CPT_NR
 };
 
+#define	CP_STATE_BITS	4
+#define	CP_TYPE_BITS	2
+#define	CP_MAX_LAYER	3
+
 /**
  * Fields are protected by the lock on struct page, except for atomics and
  * immutables.
@@ -729,8 +734,9 @@ enum cl_page_type {
 struct cl_page {
 	/** Reference counter. */
 	refcount_t			 cp_ref;
-	/* which slab kmem index this memory allocated from */
-	int				 cp_kmem_index;
+	/** layout_entry + stripe index, composed using lov_comp_index() */
+	unsigned int			 cp_lov_index;
+	pgoff_t				 cp_osc_index;
 	/** An object this page is a part of. Immutable after creation. */
 	struct cl_object		*cp_obj;
 	/** vmpage */
@@ -738,19 +744,22 @@ struct cl_page {
 	/** Linkage of pages within group. Pages must be owned */
 	struct list_head		 cp_batch;
 	/** array of slices offset. Immutable after creation. */
-	unsigned char			 cp_layer_offset[3];
+	unsigned char			 cp_layer_offset[CP_MAX_LAYER]; /* 24 bits */
 	/** current slice index */
-	unsigned char			 cp_layer_count:2;
+	unsigned char			 cp_layer_count:2; /* 26 bits */
 	/**
 	 * Page state. This field is const to avoid accidental update, it is
 	 * modified only internally within cl_page.c. Protected by a VM lock.
 	 */
-	const enum cl_page_state	 cp_state;
+	enum cl_page_state		 cp_state:CP_STATE_BITS; /* 30 bits */
 	/**
 	 * Page type. Only CPT_TRANSIENT is used so far. Immutable after
 	 * creation.
 	 */
-	enum cl_page_type		 cp_type;
+	enum cl_page_type		 cp_type:CP_TYPE_BITS; /* 32 bits */
+	/* which slab kmem index this memory allocated from */
+	short int			 cp_kmem_index; /* 48 bits */
+	unsigned int			 cp_unused1:16; /* 64 bits */
 
 	/**
 	 * Owning IO in cl_page_state::CPS_OWNED state. Sub-page can be owned
@@ -765,9 +774,6 @@ struct cl_page {
 	struct lu_ref_link		 cp_queue_ref;
 	/** Assigned if doing a sync_io */
 	struct cl_sync_io		*cp_sync_io;
-	/** layout_entry + stripe index, composed using lov_comp_index() */
-	unsigned int			 cp_lov_index;
-	pgoff_t				 cp_osc_index;
 };
 
 /**
diff --git a/fs/lustre/obdclass/cl_page.c b/fs/lustre/obdclass/cl_page.c
index cced026..53f88a7 100644
--- a/fs/lustre/obdclass/cl_page.c
+++ b/fs/lustre/obdclass/cl_page.c
@@ -153,17 +153,6 @@ static void cl_page_free(const struct lu_env *env, struct cl_page *cl_page,
 	__cl_page_free(cl_page, bufsize);
 }
 
-/**
- * Helper function updating page state. This is the only place in the code
- * where cl_page::cp_state field is mutated.
- */
-static inline void cl_page_state_set_trust(struct cl_page *page,
-					   enum cl_page_state state)
-{
-	/* bypass const. */
-	*(enum cl_page_state *)&page->cp_state = state;
-}
-
 static struct cl_page *__cl_page_alloc(struct cl_object *o)
 {
 	int i = 0;
@@ -217,44 +206,50 @@ static struct cl_page *__cl_page_alloc(struct cl_object *o)
 	return cl_page;
 }
 
-struct cl_page *cl_page_alloc(const struct lu_env *env,
-			      struct cl_object *o, pgoff_t ind,
-			      struct page *vmpage,
+struct cl_page *cl_page_alloc(const struct lu_env *env, struct cl_object *o,
+			      pgoff_t ind, struct page *vmpage,
 			      enum cl_page_type type)
 {
-	struct cl_page *page;
+	struct cl_page *cl_page;
 	struct cl_object *o2;
 
-	page = __cl_page_alloc(o);
-	if (page) {
+	cl_page = __cl_page_alloc(o);
+	if (cl_page) {
 		int result = 0;
 
-		refcount_set(&page->cp_ref, 1);
-		page->cp_obj = o;
+		/*
+		 * Please fix cl_page:cp_state/type declaration if
+		 * these assertions fail in the future.
+		 */
+		BUILD_BUG_ON((1 << CP_STATE_BITS) < CPS_NR); /* cp_state */
+		BUILD_BUG_ON((1 << CP_TYPE_BITS) < CPT_NR); /* cp_type */
+		refcount_set(&cl_page->cp_ref, 1);
+		cl_page->cp_obj = o;
 		cl_object_get(o);
-		lu_object_ref_add_at(&o->co_lu, &page->cp_obj_ref, "cl_page",
-				     page);
-		page->cp_vmpage = vmpage;
-		cl_page_state_set_trust(page, CPS_CACHED);
-		page->cp_type = type;
-		INIT_LIST_HEAD(&page->cp_batch);
-		lu_ref_init(&page->cp_reference);
+		lu_object_ref_add_at(&o->co_lu, &cl_page->cp_obj_ref,
+				     "cl_page", cl_page);
+		cl_page->cp_vmpage = vmpage;
+		cl_page->cp_state = CPS_CACHED;
+		cl_page->cp_type = type;
+		INIT_LIST_HEAD(&cl_page->cp_batch);
+		lu_ref_init(&cl_page->cp_reference);
 		cl_object_for_each(o2, o) {
 			if (o2->co_ops->coo_page_init) {
 				result = o2->co_ops->coo_page_init(env, o2,
-								   page, ind);
+								   cl_page,
+								   ind);
 				if (result != 0) {
-					__cl_page_delete(env, page);
-					cl_page_free(env, page, NULL);
-					page = ERR_PTR(result);
+					__cl_page_delete(env, cl_page);
+					cl_page_free(env, cl_page, NULL);
+					cl_page = ERR_PTR(result);
 					break;
 				}
 			}
 		}
 	} else {
-		page = ERR_PTR(-ENOMEM);
+		cl_page = ERR_PTR(-ENOMEM);
 	}
-	return page;
+	return cl_page;
 }
 
 /**
@@ -317,7 +312,8 @@ static inline int cl_page_invariant(const struct cl_page *pg)
 }
 
 static void __cl_page_state_set(const struct lu_env *env,
-				struct cl_page *page, enum cl_page_state state)
+				struct cl_page *cl_page,
+				enum cl_page_state state)
 {
 	enum cl_page_state old;
 
@@ -363,12 +359,13 @@ static void __cl_page_state_set(const struct lu_env *env,
 		}
 	};
 
-	old = page->cp_state;
-	PASSERT(env, page, allowed_transitions[old][state]);
-	CL_PAGE_HEADER(D_TRACE, env, page, "%d -> %d\n", old, state);
-	PASSERT(env, page, page->cp_state == old);
-	PASSERT(env, page, equi(state == CPS_OWNED, page->cp_owner));
-	cl_page_state_set_trust(page, state);
+	old = cl_page->cp_state;
+	PASSERT(env, cl_page, allowed_transitions[old][state]);
+	CL_PAGE_HEADER(D_TRACE, env, cl_page, "%d -> %d\n", old, state);
+	PASSERT(env, cl_page, cl_page->cp_state == old);
+	PASSERT(env, cl_page, equi(state == CPS_OWNED,
+				   cl_page->cp_owner));
+	cl_page->cp_state = state;
 }
 
 static void cl_page_state_set(const struct lu_env *env,
@@ -1079,6 +1076,7 @@ void cl_page_slice_add(struct cl_page *cl_page, struct cl_page_slice *slice,
 	unsigned int offset = (char *)slice -
 			      ((char *)cl_page + sizeof(*cl_page));
 
+	LASSERT(cl_page->cp_layer_count < CP_MAX_LAYER);
 	LASSERT(offset < (1 << sizeof(cl_page->cp_layer_offset[0]) * 8));
 	cl_page->cp_layer_offset[cl_page->cp_layer_count++] = offset;
 	slice->cpl_obj = obj;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 17/37] lustre: osc: re-declare ops_from/to to shrink osc_page
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (15 preceding siblings ...)
  2020-07-15 20:44 ` [lustre-devel] [PATCH 16/37] lustre: obdclass: re-declare cl_page variables to reduce its size James Simmons
@ 2020-07-15 20:44 ` James Simmons
  2020-07-15 20:44 ` [lustre-devel] [PATCH 18/37] lustre: llite: Fix lock ordering in pagevec_dirty James Simmons
                   ` (19 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:44 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

@ops_from and @ops_to is within PAGE_SIZE, use PAGE_SHIFT
bits to limit it is fine, on x86_64 platform, this patch
will reduce another 8 bytes.

Notice, previous @ops_to is exclusive which could be PAGE_SIZE,
this patch change it to inclusive which means max value will be
PAGE_SIZE - 1, and be careful to calculate its length.

After this patch, cl_page size could reduce from 320 to 312 bytes,
and we are able to allocate 13 objects using slab pool for 4K page.

WC-bug-id: https://jira.whamcloud.com/browse/LU-13134
Lustre-commit: 9821754235e24 ("LU-13134 osc: re-declare ops_from/to to shrink osc_page")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/37487
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_osc.h |  8 ++++----
 fs/lustre/osc/osc_cache.c      |  5 +++--
 fs/lustre/osc/osc_page.c       | 21 +++++++++++----------
 3 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/fs/lustre/include/lustre_osc.h b/fs/lustre/include/lustre_osc.h
index cd08f27..3956ab4 100644
--- a/fs/lustre/include/lustre_osc.h
+++ b/fs/lustre/include/lustre_osc.h
@@ -507,17 +507,17 @@ struct osc_page {
 	 * An offset within page from which next transfer starts. This is used
 	 * by cl_page_clip() to submit partial page transfers.
 	 */
-	int			ops_from;
+	unsigned int		ops_from:PAGE_SHIFT,
 	/*
-	 * An offset within page at which next transfer ends.
+	 * An offset within page at which next transfer ends(inclusive).
 	 *
 	 * \see osc_page::ops_from.
 	 */
-	int			ops_to;
+				ops_to:PAGE_SHIFT,
 	/*
 	 * Boolean, true iff page is under transfer. Used for sanity checking.
 	 */
-	unsigned		ops_transfer_pinned:1,
+				ops_transfer_pinned:1,
 	/*
 	 * in LRU?
 	 */
diff --git a/fs/lustre/osc/osc_cache.c b/fs/lustre/osc/osc_cache.c
index f811dadb..fe03c0d 100644
--- a/fs/lustre/osc/osc_cache.c
+++ b/fs/lustre/osc/osc_cache.c
@@ -2395,7 +2395,7 @@ int osc_queue_async_io(const struct lu_env *env, struct cl_io *io,
 
 	oap->oap_cmd = cmd;
 	oap->oap_page_off = ops->ops_from;
-	oap->oap_count = ops->ops_to - ops->ops_from;
+	oap->oap_count = ops->ops_to - ops->ops_from + 1;
 	/*
 	 * No need to hold a lock here,
 	 * since this page is not in any list yet.
@@ -2664,7 +2664,8 @@ int osc_queue_sync_pages(const struct lu_env *env, const struct cl_io *io,
 		++page_count;
 		mppr <<= (page_count > mppr);
 
-		if (unlikely(opg->ops_from > 0 || opg->ops_to < PAGE_SIZE))
+		if (unlikely(opg->ops_from > 0 ||
+			     opg->ops_to < PAGE_SIZE - 1))
 			can_merge = false;
 	}
 
diff --git a/fs/lustre/osc/osc_page.c b/fs/lustre/osc/osc_page.c
index 2856f30..bb605af 100644
--- a/fs/lustre/osc/osc_page.c
+++ b/fs/lustre/osc/osc_page.c
@@ -211,7 +211,8 @@ static void osc_page_clip(const struct lu_env *env,
 	struct osc_async_page *oap = &opg->ops_oap;
 
 	opg->ops_from = from;
-	opg->ops_to = to;
+	/* argument @to is exclusive, but @ops_to is inclusive */
+	opg->ops_to = to - 1;
 	spin_lock(&oap->oap_lock);
 	oap->oap_async_flags |= ASYNC_COUNT_STABLE;
 	spin_unlock(&oap->oap_lock);
@@ -246,28 +247,28 @@ static void osc_page_touch(const struct lu_env *env,
 };
 
 int osc_page_init(const struct lu_env *env, struct cl_object *obj,
-		  struct cl_page *page, pgoff_t index)
+		  struct cl_page *cl_page, pgoff_t index)
 {
 	struct osc_object *osc = cl2osc(obj);
-	struct osc_page *opg = cl_object_page_slice(obj, page);
+	struct osc_page *opg = cl_object_page_slice(obj, cl_page);
 	struct osc_io *oio = osc_env_io(env);
 	int result;
 
 	opg->ops_from = 0;
-	opg->ops_to = PAGE_SIZE;
+	opg->ops_to = PAGE_SIZE - 1;
 	INIT_LIST_HEAD(&opg->ops_lru);
 
-	result = osc_prep_async_page(osc, opg, page->cp_vmpage,
+	result = osc_prep_async_page(osc, opg, cl_page->cp_vmpage,
 				     cl_offset(obj, index));
 	if (result != 0)
 		return result;
 
 	opg->ops_srvlock = osc_io_srvlock(oio);
-	cl_page_slice_add(page, &opg->ops_cl, obj, &osc_page_ops);
-	page->cp_osc_index = index;
+	cl_page_slice_add(cl_page, &opg->ops_cl, obj, &osc_page_ops);
+	cl_page->cp_osc_index = index;
 
-	/* reserve an LRU space for this page */
-	if (page->cp_type == CPT_CACHEABLE) {
+	/* reserve an LRU space for this cl_page */
+	if (cl_page->cp_type == CPT_CACHEABLE) {
 		result = osc_lru_alloc(env, osc_cli(osc), opg);
 		if (result == 0) {
 			result = radix_tree_preload(GFP_KERNEL);
@@ -308,7 +309,7 @@ void osc_page_submit(const struct lu_env *env, struct osc_page *opg,
 
 	oap->oap_cmd = crt == CRT_WRITE ? OBD_BRW_WRITE : OBD_BRW_READ;
 	oap->oap_page_off = opg->ops_from;
-	oap->oap_count = opg->ops_to - opg->ops_from;
+	oap->oap_count = opg->ops_to - opg->ops_from + 1;
 	oap->oap_brw_flags = OBD_BRW_SYNC | brw_flags;
 
 	if (oio->oi_cap_sys_resource) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 18/37] lustre: llite: Fix lock ordering in pagevec_dirty
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (16 preceding siblings ...)
  2020-07-15 20:44 ` [lustre-devel] [PATCH 17/37] lustre: osc: re-declare ops_from/to to shrink osc_page James Simmons
@ 2020-07-15 20:44 ` James Simmons
  2020-07-15 20:45 ` [lustre-devel] [PATCH 19/37] lustre: misc: quiet compiler warning on armv7l James Simmons
                   ` (18 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:44 UTC (permalink / raw)
  To: lustre-devel

From: Shaun Tancheff <shaun.tancheff@hpe.com>

In vvp_set_pagevec_dirty lock order between i_pages and
lock_page_memcg was inverted with the expectation that
no other users would conflict.

However in vvp_page_completion_write the call to
test_clear_page_writeback does expect to be able
to lock_page_memcg then lock i_pages which appears
to conflict with the original analysis.

The reported case shows as RCU stalls with
vvp_set_pagevec_dirty blocked attempting to lock i_pages.

Fixes: f8a5fb036ae ("lustre: vvp: dirty pages with pagevec")
HPE-bug-id: LUS-8798
WC-bug-id: https://jira.whamcloud.com/browse/LU-13746
Lustre-commit: c4ed9b0fb1013 ("LU-13476 llite: Fix lock ordering in pagevec_dirty")
Signed-off-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Reviewed-on: https://review.whamcloud.com/38317
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Patrick Farrell <farr0186@gmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/vvp_io.c | 34 +++++++++++++++++++++-------------
 1 file changed, 21 insertions(+), 13 deletions(-)

diff --git a/fs/lustre/llite/vvp_io.c b/fs/lustre/llite/vvp_io.c
index 8edd3c1..7627431 100644
--- a/fs/lustre/llite/vvp_io.c
+++ b/fs/lustre/llite/vvp_io.c
@@ -897,19 +897,31 @@ void vvp_set_pagevec_dirty(struct pagevec *pvec)
 	struct page *page = pvec->pages[0];
 	struct address_space *mapping = page->mapping;
 	unsigned long flags;
+	unsigned long skip_pages = 0;
 	int count = pagevec_count(pvec);
 	int dirtied = 0;
-	int i = 0;
-
-	/* From set_page_dirty */
-	for (i = 0; i < count; i++)
-		ClearPageReclaim(pvec->pages[i]);
+	int i;
 
+	BUILD_BUG_ON(PAGEVEC_SIZE > BITS_PER_LONG);
 	LASSERTF(page->mapping,
 		 "mapping must be set. page %p, page->private (cl_page) %p\n",
 		 page, (void *) page->private);
 
-	/* Rest of code derived from __set_page_dirty_nobuffers */
+	for (i = 0; i < count; i++) {
+		page = pvec->pages[i];
+
+		ClearPageReclaim(page);
+
+		lock_page_memcg(page);
+		if (TestSetPageDirty(page)) {
+			/* page is already dirty .. no extra work needed
+			 * set a flag for the i'th page to be skipped
+			 */
+			unlock_page_memcg(page);
+			skip_pages |= (1 << i);
+		}
+	}
+
 	xa_lock_irqsave(&mapping->i_pages, flags);
 
 	/* Notes on differences with __set_page_dirty_nobuffers:
@@ -920,17 +932,13 @@ void vvp_set_pagevec_dirty(struct pagevec *pvec)
 	 * 3. No mapping is impossible. (Race w/truncate mentioned in
 	 * dirty_nobuffers should be impossible because we hold the page lock.)
 	 * 4. All mappings are the same because i/o is only to one file.
-	 * 5. We invert the lock order on lock_page_memcg(page) and the mapping
-	 * xa_lock, but this is the only function that should use that pair of
-	 * locks and it can't race because Lustre locks pages throughout i/o.
 	 */
 	for (i = 0; i < count; i++) {
 		page = pvec->pages[i];
-		lock_page_memcg(page);
-		if (TestSetPageDirty(page)) {
-			unlock_page_memcg(page);
+		/* if the i'th page was unlocked above, skip it here */
+		if ((skip_pages >> i) & 1)
 			continue;
-		}
+
 		LASSERTF(page->mapping == mapping,
 			 "all pages must have the same mapping.  page %p, mapping %p, first mapping %p\n",
 			 page, page->mapping, mapping);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 19/37] lustre: misc: quiet compiler warning on armv7l
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (17 preceding siblings ...)
  2020-07-15 20:44 ` [lustre-devel] [PATCH 18/37] lustre: llite: Fix lock ordering in pagevec_dirty James Simmons
@ 2020-07-15 20:45 ` James Simmons
  2020-07-15 20:45 ` [lustre-devel] [PATCH 20/37] lustre: llite: fix to free cl_dio_aio properly James Simmons
                   ` (17 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:45 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Avoid overflow in lu_prandom_u64_max().

Quiet printk() warning for mismatched type of size_t variables
by using %z modifier for those variables.

Fixes: bc2e21c54ba2 ("lustre: obdclass: generate random u64 max correctly")

WC-bug-id: https://jira.whamcloud.com/browse/LU-13673
Lustre-commit: 57bb302461383 ("LU-13673 misc: quiet compiler warning on armv7l")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/38927
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/vvp_io.c          | 4 ++--
 fs/lustre/obdclass/lu_tgt_descs.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/llite/vvp_io.c b/fs/lustre/llite/vvp_io.c
index 7627431..c3fb03a 100644
--- a/fs/lustre/llite/vvp_io.c
+++ b/fs/lustre/llite/vvp_io.c
@@ -791,7 +791,7 @@ static int vvp_io_read_start(const struct lu_env *env,
 		goto out;
 
 	LU_OBJECT_HEADER(D_INODE, env, &obj->co_lu,
-			 "Read ino %lu, %lu bytes, offset %lld, size %llu\n",
+			 "Read ino %lu, %zu bytes, offset %lld, size %llu\n",
 			 inode->i_ino, cnt, pos, i_size_read(inode));
 
 	/* turn off the kernel's read-ahead */
@@ -1197,7 +1197,7 @@ static int vvp_io_write_start(const struct lu_env *env,
 	}
 	if (vio->vui_iocb->ki_pos != (pos + io->ci_nob - nob)) {
 		CDEBUG(D_VFSTRACE,
-		       "%s: write position mismatch: ki_pos %lld vs. pos %lld, written %ld, commit %ld rc %ld\n",
+		       "%s: write position mismatch: ki_pos %lld vs. pos %lld, written %zd, commit %zd rc %zd\n",
 		       file_dentry(file)->d_name.name,
 		       vio->vui_iocb->ki_pos, pos + io->ci_nob - nob,
 		       written, io->ci_nob - nob, result);
diff --git a/fs/lustre/obdclass/lu_tgt_descs.c b/fs/lustre/obdclass/lu_tgt_descs.c
index db5a93b..469c935 100644
--- a/fs/lustre/obdclass/lu_tgt_descs.c
+++ b/fs/lustre/obdclass/lu_tgt_descs.c
@@ -62,7 +62,7 @@ u64 lu_prandom_u64_max(u64 ep_ro)
 		 * 32 bits (truncated to the upper limit, if needed)
 		 */
 		if (ep_ro > 0xffffffffULL)
-			rand = prandom_u32_max((u32)(ep_ro >> 32)) << 32;
+			rand = (u64)prandom_u32_max((u32)(ep_ro >> 32)) << 32;
 
 		if (rand == (ep_ro & 0xffffffff00000000ULL))
 			rand |= prandom_u32_max((u32)ep_ro);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 20/37] lustre: llite: fix to free cl_dio_aio properly
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (18 preceding siblings ...)
  2020-07-15 20:45 ` [lustre-devel] [PATCH 19/37] lustre: misc: quiet compiler warning on armv7l James Simmons
@ 2020-07-15 20:45 ` James Simmons
  2020-07-15 20:45 ` [lustre-devel] [PATCH 21/37] lnet: o2iblnd: Use ib_mtu_int_to_enum() James Simmons
                   ` (16 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:45 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

@cl_dio_aio is allocated by slab, we should use slab
free helper to free its memory.

Fixes: ebdbecbaf50b ("lustre: obdclass: use slab allocation for cl_dio_aio")
WC-bug-id: https://jira.whamcloud.com/browse/LU-13134
Lustre-commit: f71a539c3e41b ("LU-13134 llite: fix to free cl_dio_aio properly")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/39103
Reviewed-by: Patrick Farrell <farr0186@gmail.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/cl_object.h |  2 ++
 fs/lustre/llite/rw26.c        |  2 +-
 fs/lustre/obdclass/cl_io.c    | 10 ++++++++--
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/include/cl_object.h b/fs/lustre/include/cl_object.h
index 8611285..e656c68 100644
--- a/fs/lustre/include/cl_object.h
+++ b/fs/lustre/include/cl_object.h
@@ -2538,6 +2538,8 @@ int cl_sync_io_wait(const struct lu_env *env, struct cl_sync_io *anchor,
 void cl_sync_io_note(const struct lu_env *env, struct cl_sync_io *anchor,
 		     int ioret);
 struct cl_dio_aio *cl_aio_alloc(struct kiocb *iocb);
+void cl_aio_free(struct cl_dio_aio *aio);
+
 static inline void cl_sync_io_init(struct cl_sync_io *anchor, int nr)
 {
 	cl_sync_io_init_notify(anchor, nr, NULL, NULL);
diff --git a/fs/lustre/llite/rw26.c b/fs/lustre/llite/rw26.c
index 0971185..d0e3ff6 100644
--- a/fs/lustre/llite/rw26.c
+++ b/fs/lustre/llite/rw26.c
@@ -384,7 +384,7 @@ static ssize_t ll_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 			vio->u.write.vui_written += tot_bytes;
 			result = tot_bytes;
 		}
-		kfree(aio);
+		cl_aio_free(aio);
 	} else {
 		result = -EIOCBQUEUED;
 	}
diff --git a/fs/lustre/obdclass/cl_io.c b/fs/lustre/obdclass/cl_io.c
index 2f597d1..dcf940f 100644
--- a/fs/lustre/obdclass/cl_io.c
+++ b/fs/lustre/obdclass/cl_io.c
@@ -1106,6 +1106,13 @@ struct cl_dio_aio *cl_aio_alloc(struct kiocb *iocb)
 }
 EXPORT_SYMBOL(cl_aio_alloc);
 
+void cl_aio_free(struct cl_dio_aio *aio)
+{
+	if (aio)
+		kmem_cache_free(cl_dio_aio_kmem, aio);
+}
+EXPORT_SYMBOL(cl_aio_free);
+
 /**
  * Indicate that transfer of a single page completed.
  */
@@ -1143,8 +1150,7 @@ void cl_sync_io_note(const struct lu_env *env, struct cl_sync_io *anchor,
 		 * If anchor->csi_aio is set, we are responsible for freeing
 		 * memory here rather than when cl_sync_io_wait() completes.
 		 */
-		if (aio)
-			kmem_cache_free(cl_dio_aio_kmem, aio);
+		cl_aio_free(aio);
 	}
 }
 EXPORT_SYMBOL(cl_sync_io_note);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 21/37] lnet: o2iblnd: Use ib_mtu_int_to_enum()
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (19 preceding siblings ...)
  2020-07-15 20:45 ` [lustre-devel] [PATCH 20/37] lustre: llite: fix to free cl_dio_aio properly James Simmons
@ 2020-07-15 20:45 ` James Simmons
  2020-07-15 20:45 ` [lustre-devel] [PATCH 22/37] lnet: o2iblnd: wait properly for fps->increasing James Simmons
                   ` (15 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:45 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

Rather than bespoke code for converting an MTU into the enum,
use ib_mtu_int_to_enum().
This has slightly different behaviour for invalid values,
but those are caught when the parameter is set.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12678
Lustre-commit: 1b622e2007483 ("LU-12678 o2iblnd: Use ib_mtu_int_to_enum()")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/39123
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Reviewed-by: Chris Horn <chris.horn@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/o2iblnd/o2iblnd.c           | 29 +++--------------------------
 net/lnet/klnds/o2iblnd/o2iblnd_modparams.c |  4 +++-
 2 files changed, 6 insertions(+), 27 deletions(-)

diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.c b/net/lnet/klnds/o2iblnd/o2iblnd.c
index d8fca2a..e2e94b7 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.c
@@ -560,38 +560,15 @@ static struct kib_conn *kiblnd_get_conn_by_idx(struct lnet_ni *ni, int index)
 	return NULL;
 }
 
-int kiblnd_translate_mtu(int value)
-{
-	switch (value) {
-	default:
-		return -1;
-	case 0:
-		return 0;
-	case 256:
-		return IB_MTU_256;
-	case 512:
-		return IB_MTU_512;
-	case 1024:
-		return IB_MTU_1024;
-	case 2048:
-		return IB_MTU_2048;
-	case 4096:
-		return IB_MTU_4096;
-	}
-}
-
 static void kiblnd_setup_mtu_locked(struct rdma_cm_id *cmid)
 {
-	int mtu;
-
 	/* XXX There is no path record for iWARP, set by netdev->change_mtu? */
 	if (!cmid->route.path_rec)
 		return;
 
-	mtu = kiblnd_translate_mtu(*kiblnd_tunables.kib_ib_mtu);
-	LASSERT(mtu >= 0);
-	if (mtu)
-		cmid->route.path_rec->mtu = mtu;
+	if (*kiblnd_tunables.kib_ib_mtu)
+		cmid->route.path_rec->mtu =
+			ib_mtu_int_to_enum(*kiblnd_tunables.kib_ib_mtu);
 }
 
 static int kiblnd_get_completion_vector(struct kib_conn *conn, int cpt)
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd_modparams.c b/net/lnet/klnds/o2iblnd/o2iblnd_modparams.c
index f341376..73ad22d 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd_modparams.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd_modparams.c
@@ -230,7 +230,9 @@ int kiblnd_tunables_setup(struct lnet_ni *ni)
 	/* Current API version */
 	tunables->lnd_version = 0;
 
-	if (kiblnd_translate_mtu(*kiblnd_tunables.kib_ib_mtu) < 0) {
+	if (*kiblnd_tunables.kib_ib_mtu &&
+	    ib_mtu_enum_to_int(ib_mtu_int_to_enum(*kiblnd_tunables.kib_ib_mtu)) !=
+	    *kiblnd_tunables.kib_ib_mtu) {
 		CERROR("Invalid ib_mtu %d, expected 256/512/1024/2048/4096\n",
 		       *kiblnd_tunables.kib_ib_mtu);
 		return -EINVAL;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 22/37] lnet: o2iblnd: wait properly for fps->increasing.
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (20 preceding siblings ...)
  2020-07-15 20:45 ` [lustre-devel] [PATCH 21/37] lnet: o2iblnd: Use ib_mtu_int_to_enum() James Simmons
@ 2020-07-15 20:45 ` James Simmons
  2020-07-15 20:45 ` [lustre-devel] [PATCH 23/37] lnet: o2iblnd: use need_resched() James Simmons
                   ` (14 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:45 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

If we need to allocate a new fmr_pool and another thread is currently
allocating one, we call schedule() and then try again.  This can spin,
consuming a CPU and wasting power.

Instead, use wait_var_event() and wake_up_var() to
wait for fps_increasing to be cleared.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12678
Lustre-commit: 530eca31556f7 ("LU-12768 o2iblnd: wait properly for fps->increasing.")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/39124
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Reviewed-by: Chris Horn <chris.horn@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/o2iblnd/o2iblnd.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.c b/net/lnet/klnds/o2iblnd/o2iblnd.c
index e2e94b7..6c7659c 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.c
@@ -1750,7 +1750,7 @@ int kiblnd_fmr_pool_map(struct kib_fmr_poolset *fps, struct kib_tx *tx,
 	if (fps->fps_increasing) {
 		spin_unlock(&fps->fps_lock);
 		CDEBUG(D_NET, "Another thread is allocating new FMR pool, waiting for her to complete\n");
-		schedule();
+		wait_var_event(fps, !fps->fps_increasing);
 		goto again;
 	}
 
@@ -1767,6 +1767,7 @@ int kiblnd_fmr_pool_map(struct kib_fmr_poolset *fps, struct kib_tx *tx,
 	rc = kiblnd_create_fmr_pool(fps, &fpo);
 	spin_lock(&fps->fps_lock);
 	fps->fps_increasing = 0;
+	wake_up_var(fps);
 	if (!rc) {
 		fps->fps_version++;
 		list_add_tail(&fpo->fpo_list, &fps->fps_pool_list);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 23/37] lnet: o2iblnd: use need_resched()
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (21 preceding siblings ...)
  2020-07-15 20:45 ` [lustre-devel] [PATCH 22/37] lnet: o2iblnd: wait properly for fps->increasing James Simmons
@ 2020-07-15 20:45 ` James Simmons
  2020-07-15 20:45 ` [lustre-devel] [PATCH 24/37] lnet: o2iblnd: Use list_for_each_entry_safe James Simmons
                   ` (13 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:45 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

Rather than using a counter to decide when to drop
the lock and see if we need to reshedule we can
use need_resched(), which is a precise test instead of a guess.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12678
Lustre-commit: dcd799269f693 ("LU-12678 o2iblnd: use need_resched()")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/39125
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Reviewed-by: Chris Horn <chris.horn@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/o2iblnd/o2iblnd.h    | 2 --
 net/lnet/klnds/o2iblnd/o2iblnd_cb.c | 5 +----
 2 files changed, 1 insertion(+), 6 deletions(-)

diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.h b/net/lnet/klnds/o2iblnd/o2iblnd.h
index f60a69d..9a2fb42 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.h
@@ -67,8 +67,6 @@
 #include <linux/lnet/lib-lnet.h>
 
 #define IBLND_PEER_HASH_SIZE		101	/* # peer_ni lists */
-/* # scheduler loops before reschedule */
-#define IBLND_RESCHED			100
 
 #define IBLND_N_SCHED			2
 #define IBLND_N_SCHED_HIGH		4
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
index 3b9d10d..2c670a33 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -3605,7 +3605,6 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 	unsigned long flags;
 	struct ib_wc wc;
 	int did_something;
-	int busy_loops = 0;
 	int rc;
 
 	init_waitqueue_entry(&wait, current);
@@ -3621,11 +3620,10 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 	spin_lock_irqsave(&sched->ibs_lock, flags);
 
 	while (!kiblnd_data.kib_shutdown) {
-		if (busy_loops++ >= IBLND_RESCHED) {
+		if (need_resched()) {
 			spin_unlock_irqrestore(&sched->ibs_lock, flags);
 
 			cond_resched();
-			busy_loops = 0;
 
 			spin_lock_irqsave(&sched->ibs_lock, flags);
 		}
@@ -3718,7 +3716,6 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 		spin_unlock_irqrestore(&sched->ibs_lock, flags);
 
 		schedule();
-		busy_loops = 0;
 
 		remove_wait_queue(&sched->ibs_waitq, &wait);
 		spin_lock_irqsave(&sched->ibs_lock, flags);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 24/37] lnet: o2iblnd: Use list_for_each_entry_safe
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (22 preceding siblings ...)
  2020-07-15 20:45 ` [lustre-devel] [PATCH 23/37] lnet: o2iblnd: use need_resched() James Simmons
@ 2020-07-15 20:45 ` James Simmons
  2020-07-15 20:45 ` [lustre-devel] [PATCH 25/37] lnet: socklnd: use need_resched() James Simmons
                   ` (12 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:45 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

Several loops use list_for_each_safe(), then call
list_entry() as first step.
These can be merged using list_for_each_entry_safe().

WC-bug-id: https://jira.whamcloud.com/browse/LU-12678
Lustre-commit: e5574f72f2fd9 ("LU-12678 o2iblnd: Use list_for_each_entry_safe")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/39126
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Reviewed-by: Chris Horn <chris.horn@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/o2iblnd/o2iblnd.c    | 26 ++++++++++----------------
 net/lnet/klnds/o2iblnd/o2iblnd_cb.c |  7 ++-----
 2 files changed, 12 insertions(+), 21 deletions(-)

diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.c b/net/lnet/klnds/o2iblnd/o2iblnd.c
index 6c7659c..c6a077b 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.c
@@ -454,18 +454,16 @@ static int kiblnd_get_peer_info(struct lnet_ni *ni, int index,
 
 static void kiblnd_del_peer_locked(struct kib_peer_ni *peer_ni)
 {
-	struct list_head *ctmp;
-	struct list_head *cnxt;
+	struct kib_conn *cnxt;
 	struct kib_conn *conn;
 
 	if (list_empty(&peer_ni->ibp_conns)) {
 		kiblnd_unlink_peer_locked(peer_ni);
 	} else {
-		list_for_each_safe(ctmp, cnxt, &peer_ni->ibp_conns) {
-			conn = list_entry(ctmp, struct kib_conn, ibc_list);
-
+		list_for_each_entry_safe(conn, cnxt, &peer_ni->ibp_conns,
+					 ibc_list)
 			kiblnd_close_conn_locked(conn, 0);
-		}
+
 		/* NB closing peer_ni's last conn unlinked it. */
 	}
 	/*
@@ -952,13 +950,11 @@ void kiblnd_destroy_conn(struct kib_conn *conn)
 int kiblnd_close_peer_conns_locked(struct kib_peer_ni *peer_ni, int why)
 {
 	struct kib_conn *conn;
-	struct list_head *ctmp;
-	struct list_head *cnxt;
+	struct kib_conn *cnxt;
 	int count = 0;
 
-	list_for_each_safe(ctmp, cnxt, &peer_ni->ibp_conns) {
-		conn = list_entry(ctmp, struct kib_conn, ibc_list);
-
+	list_for_each_entry_safe(conn, cnxt, &peer_ni->ibp_conns,
+				 ibc_list) {
 		CDEBUG(D_NET, "Closing conn -> %s, version: %x, reason: %d\n",
 		       libcfs_nid2str(peer_ni->ibp_nid),
 		       conn->ibc_version, why);
@@ -974,13 +970,11 @@ int kiblnd_close_stale_conns_locked(struct kib_peer_ni *peer_ni,
 				    int version, u64 incarnation)
 {
 	struct kib_conn *conn;
-	struct list_head *ctmp;
-	struct list_head *cnxt;
+	struct kib_conn *cnxt;
 	int count = 0;
 
-	list_for_each_safe(ctmp, cnxt, &peer_ni->ibp_conns) {
-		conn = list_entry(ctmp, struct kib_conn, ibc_list);
-
+	list_for_each_entry_safe(conn, cnxt, &peer_ni->ibp_conns,
+				 ibc_list) {
 		if (conn->ibc_version == version &&
 		    conn->ibc_incarnation == incarnation)
 			continue;
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
index 2c670a33..ba2f46f 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -1982,15 +1982,12 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 kiblnd_abort_txs(struct kib_conn *conn, struct list_head *txs)
 {
 	LIST_HEAD(zombies);
-	struct list_head *tmp;
-	struct list_head *nxt;
+	struct kib_tx *nxt;
 	struct kib_tx *tx;
 
 	spin_lock(&conn->ibc_lock);
 
-	list_for_each_safe(tmp, nxt, txs) {
-		tx = list_entry(tmp, struct kib_tx, tx_list);
-
+	list_for_each_entry_safe(tx, nxt, txs, tx_list) {
 		if (txs == &conn->ibc_active_txs) {
 			LASSERT(!tx->tx_queued);
 			LASSERT(tx->tx_waiting || tx->tx_sending);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 25/37] lnet: socklnd: use need_resched()
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (23 preceding siblings ...)
  2020-07-15 20:45 ` [lustre-devel] [PATCH 24/37] lnet: o2iblnd: Use list_for_each_entry_safe James Simmons
@ 2020-07-15 20:45 ` James Simmons
  2020-07-15 20:45 ` [lustre-devel] [PATCH 26/37] lnet: socklnd: use list_for_each_entry_safe() James Simmons
                   ` (11 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:45 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

Rather than using a counter to decide when to drop the lock and see if
we need to reshedule we can use need_resched(), which is a precise
test instead of a guess.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12678
Lustre-commit: 3f848f85ba3d3 ("LU-12678 socklnd: use need_resched()")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/39128
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Reviewed-by: Chris Horn <chris.horn@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/socklnd/socklnd.h    |  1 -
 net/lnet/klnds/socklnd/socklnd_cb.c | 12 +++---------
 2 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/net/lnet/klnds/socklnd/socklnd.h b/net/lnet/klnds/socklnd/socklnd.h
index 0ac3637..0a0f0a7 100644
--- a/net/lnet/klnds/socklnd/socklnd.h
+++ b/net/lnet/klnds/socklnd/socklnd.h
@@ -55,7 +55,6 @@
 #define SOCKNAL_NSCHEDS_HIGH	(SOCKNAL_NSCHEDS << 1)
 
 #define SOCKNAL_PEER_HASH_BITS	7     /* # log2 of # of peer_ni lists */
-#define SOCKNAL_RESCHED		100   /* # scheduler loops before reschedule */
 #define SOCKNAL_INSANITY_RECONN	5000  /* connd is trying on reconn infinitely */
 #define SOCKNAL_ENOMEM_RETRY	1     /* seconds between retries */
 
diff --git a/net/lnet/klnds/socklnd/socklnd_cb.c b/net/lnet/klnds/socklnd/socklnd_cb.c
index 623478c..936054ee 100644
--- a/net/lnet/klnds/socklnd/socklnd_cb.c
+++ b/net/lnet/klnds/socklnd/socklnd_cb.c
@@ -1328,7 +1328,6 @@ int ksocknal_scheduler(void *arg)
 	struct ksock_conn *conn;
 	struct ksock_tx *tx;
 	int rc;
-	int nloops = 0;
 	long id = (long)arg;
 
 	sched = ksocknal_data.ksnd_schedulers[KSOCK_THREAD_CPT(id)];
@@ -1470,12 +1469,10 @@ int ksocknal_scheduler(void *arg)
 
 			did_something = 1;
 		}
-		if (!did_something ||		/* nothing to do */
-		    ++nloops == SOCKNAL_RESCHED) { /* hogging CPU? */
+		if (!did_something ||	/* nothing to do */
+		    need_resched()) {	/* hogging CPU? */
 			spin_unlock_bh(&sched->kss_lock);
 
-			nloops = 0;
-
 			if (!did_something) {   /* wait for something to do */
 				rc = wait_event_interruptible_exclusive(
 					sched->kss_waitq,
@@ -2080,7 +2077,6 @@ void ksocknal_write_callback(struct ksock_conn *conn)
 	spinlock_t *connd_lock = &ksocknal_data.ksnd_connd_lock;
 	struct ksock_connreq *cr;
 	wait_queue_entry_t wait;
-	int nloops = 0;
 	int cons_retry = 0;
 
 	init_waitqueue_entry(&wait, current);
@@ -2158,10 +2154,9 @@ void ksocknal_write_callback(struct ksock_conn *conn)
 		}
 
 		if (dropped_lock) {
-			if (++nloops < SOCKNAL_RESCHED)
+			if (!need_resched())
 				continue;
 			spin_unlock_bh(connd_lock);
-			nloops = 0;
 			cond_resched();
 			spin_lock_bh(connd_lock);
 			continue;
@@ -2173,7 +2168,6 @@ void ksocknal_write_callback(struct ksock_conn *conn)
 					 &wait);
 		spin_unlock_bh(connd_lock);
 
-		nloops = 0;
 		schedule_timeout(timeout);
 
 		remove_wait_queue(&ksocknal_data.ksnd_connd_waitq, &wait);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 26/37] lnet: socklnd: use list_for_each_entry_safe()
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (24 preceding siblings ...)
  2020-07-15 20:45 ` [lustre-devel] [PATCH 25/37] lnet: socklnd: use need_resched() James Simmons
@ 2020-07-15 20:45 ` James Simmons
  2020-07-15 20:45 ` [lustre-devel] [PATCH 27/37] lnet: socklnd: convert various refcounts to refcount_t James Simmons
                   ` (10 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:45 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

Several loops use list_for_each_safe(), then call list_entry() as
first step.  These can be merged using list_for_each_entry_safe().

In one case, the 'safe' version is clearly not needed, so just use
list_for_each_entry().

WC-bug-id: https://jira.whamcloud.com/browse/LU-12678
Lustre-commit: 03f375e9f6390 ("LU-12678 socklnd: use list_for_each_entry_safe()")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/39129
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Chris Horn <chris.horn@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/socklnd/socklnd.c | 55 ++++++++++++++--------------------------
 1 file changed, 19 insertions(+), 36 deletions(-)

diff --git a/net/lnet/klnds/socklnd/socklnd.c b/net/lnet/klnds/socklnd/socklnd.c
index 2b8fd3d..2e11737 100644
--- a/net/lnet/klnds/socklnd/socklnd.c
+++ b/net/lnet/klnds/socklnd/socklnd.c
@@ -453,15 +453,12 @@ struct ksock_peer_ni *
 	struct ksock_peer_ni *peer_ni = route->ksnr_peer;
 	struct ksock_interface *iface;
 	struct ksock_conn *conn;
-	struct list_head *ctmp;
-	struct list_head *cnxt;
+	struct ksock_conn *cnxt;
 
 	LASSERT(!route->ksnr_deleted);
 
 	/* Close associated conns */
-	list_for_each_safe(ctmp, cnxt, &peer_ni->ksnp_conns) {
-		conn = list_entry(ctmp, struct ksock_conn, ksnc_list);
-
+	list_for_each_entry_safe(conn, cnxt, &peer_ni->ksnp_conns, ksnc_list) {
 		if (conn->ksnc_route != route)
 			continue;
 
@@ -548,9 +545,9 @@ struct ksock_peer_ni *
 ksocknal_del_peer_locked(struct ksock_peer_ni *peer_ni, u32 ip)
 {
 	struct ksock_conn *conn;
+	struct ksock_conn *cnxt;
 	struct ksock_route *route;
-	struct list_head *tmp;
-	struct list_head *nxt;
+	struct ksock_route *rnxt;
 	int nshared;
 
 	LASSERT(!peer_ni->ksnp_closing);
@@ -558,9 +555,8 @@ struct ksock_peer_ni *
 	/* Extra ref prevents peer_ni disappearing until I'm done with it */
 	ksocknal_peer_addref(peer_ni);
 
-	list_for_each_safe(tmp, nxt, &peer_ni->ksnp_routes) {
-		route = list_entry(tmp, struct ksock_route, ksnr_list);
-
+	list_for_each_entry_safe(route, rnxt, &peer_ni->ksnp_routes,
+				 ksnr_list) {
 		/* no match */
 		if (!(!ip || route->ksnr_ipaddr == ip))
 			continue;
@@ -571,29 +567,23 @@ struct ksock_peer_ni *
 	}
 
 	nshared = 0;
-	list_for_each_safe(tmp, nxt, &peer_ni->ksnp_routes) {
-		route = list_entry(tmp, struct ksock_route, ksnr_list);
+	list_for_each_entry(route, &peer_ni->ksnp_routes, ksnr_list)
 		nshared += route->ksnr_share_count;
-	}
 
 	if (!nshared) {
-		/*
-		 * remove everything else if there are no explicit entries
+		/* remove everything else if there are no explicit entries
 		 * left
 		 */
-		list_for_each_safe(tmp, nxt, &peer_ni->ksnp_routes) {
-			route = list_entry(tmp, struct ksock_route, ksnr_list);
-
+		list_for_each_entry_safe(route, rnxt, &peer_ni->ksnp_routes,
+					 ksnr_list) {
 			/* we should only be removing auto-entries */
 			LASSERT(!route->ksnr_share_count);
 			ksocknal_del_route_locked(route);
 		}
 
-		list_for_each_safe(tmp, nxt, &peer_ni->ksnp_conns) {
-			conn = list_entry(tmp, struct ksock_conn, ksnc_list);
-
+		list_for_each_entry_safe(conn, cnxt, &peer_ni->ksnp_conns,
+					 ksnc_list)
 			ksocknal_close_conn_locked(conn, 0);
-		}
 	}
 
 	ksocknal_peer_decref(peer_ni);
@@ -1752,13 +1742,10 @@ struct ksock_peer_ni *
 				 u32 ipaddr, int why)
 {
 	struct ksock_conn *conn;
-	struct list_head *ctmp;
-	struct list_head *cnxt;
+	struct ksock_conn *cnxt;
 	int count = 0;
 
-	list_for_each_safe(ctmp, cnxt, &peer_ni->ksnp_conns) {
-		conn = list_entry(ctmp, struct ksock_conn, ksnc_list);
-
+	list_for_each_entry_safe(conn, cnxt, &peer_ni->ksnp_conns, ksnc_list) {
 		if (!ipaddr || conn->ksnc_ipaddr == ipaddr) {
 			count++;
 			ksocknal_close_conn_locked(conn, why);
@@ -1992,10 +1979,10 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 ksocknal_peer_del_interface_locked(struct ksock_peer_ni *peer_ni,
 				   u32 ipaddr, int index)
 {
-	struct list_head *tmp;
-	struct list_head *nxt;
 	struct ksock_route *route;
+	struct ksock_route *rnxt;
 	struct ksock_conn *conn;
+	struct ksock_conn *cnxt;
 	int i;
 	int j;
 
@@ -2008,9 +1995,8 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 			break;
 		}
 
-	list_for_each_safe(tmp, nxt, &peer_ni->ksnp_routes) {
-		route = list_entry(tmp, struct ksock_route, ksnr_list);
-
+	list_for_each_entry_safe(route, rnxt, &peer_ni->ksnp_routes,
+				 ksnr_list) {
 		if (route->ksnr_myiface != index)
 			continue;
 
@@ -2022,12 +2008,9 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 		}
 	}
 
-	list_for_each_safe(tmp, nxt, &peer_ni->ksnp_conns) {
-		conn = list_entry(tmp, struct ksock_conn, ksnc_list);
-
+	list_for_each_entry_safe(conn, cnxt, &peer_ni->ksnp_conns, ksnc_list)
 		if (conn->ksnc_myipaddr == ipaddr)
 			ksocknal_close_conn_locked(conn, 0);
-	}
 }
 
 static int
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 27/37] lnet: socklnd: convert various refcounts to refcount_t
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (25 preceding siblings ...)
  2020-07-15 20:45 ` [lustre-devel] [PATCH 26/37] lnet: socklnd: use list_for_each_entry_safe() James Simmons
@ 2020-07-15 20:45 ` James Simmons
  2020-07-15 20:45 ` [lustre-devel] [PATCH 28/37] lnet: libcfs: don't call unshare_fs_struct() James Simmons
                   ` (9 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:45 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

Each of these refcounts exactly follows the expectations of
refcount_t, so change the atomic_t to refcoun_t.

We can remove the LASSERTs on incref/decref as they can now be enabled
at build time with CONFIG_REFCOUNT_FULL

WC-bug-id: https://jira.whamcloud.com/browse/LU-12678
Lustre-commit: db3e51f612069 ("LU-12678 socklnd: convert various refcounts to refcount_t")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/39130
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Reviewed-by: Chris Horn <chris.horn@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/socklnd/socklnd.c    | 28 ++++++++++++-------------
 net/lnet/klnds/socklnd/socklnd.h    | 41 +++++++++++++++----------------------
 net/lnet/klnds/socklnd/socklnd_cb.c |  6 +++---
 3 files changed, 33 insertions(+), 42 deletions(-)

diff --git a/net/lnet/klnds/socklnd/socklnd.c b/net/lnet/klnds/socklnd/socklnd.c
index 2e11737..22a73c3 100644
--- a/net/lnet/klnds/socklnd/socklnd.c
+++ b/net/lnet/klnds/socklnd/socklnd.c
@@ -123,7 +123,7 @@ static int ksocknal_ip2index(__u32 ipaddress, struct lnet_ni *ni)
 	if (!route)
 		return NULL;
 
-	atomic_set(&route->ksnr_refcount, 1);
+	refcount_set(&route->ksnr_refcount, 1);
 	route->ksnr_peer = NULL;
 	route->ksnr_retry_interval = 0;		/* OK to connect at any time */
 	route->ksnr_ipaddr = ipaddr;
@@ -142,7 +142,7 @@ static int ksocknal_ip2index(__u32 ipaddress, struct lnet_ni *ni)
 void
 ksocknal_destroy_route(struct ksock_route *route)
 {
-	LASSERT(!atomic_read(&route->ksnr_refcount));
+	LASSERT(!refcount_read(&route->ksnr_refcount));
 
 	if (route->ksnr_peer)
 		ksocknal_peer_decref(route->ksnr_peer);
@@ -174,7 +174,7 @@ static int ksocknal_ip2index(__u32 ipaddress, struct lnet_ni *ni)
 
 	peer_ni->ksnp_ni = ni;
 	peer_ni->ksnp_id = id;
-	atomic_set(&peer_ni->ksnp_refcount, 1);   /* 1 ref for caller */
+	refcount_set(&peer_ni->ksnp_refcount, 1);   /* 1 ref for caller */
 	peer_ni->ksnp_closing = 0;
 	peer_ni->ksnp_accepting = 0;
 	peer_ni->ksnp_proto = NULL;
@@ -198,7 +198,7 @@ static int ksocknal_ip2index(__u32 ipaddress, struct lnet_ni *ni)
 	CDEBUG(D_NET, "peer_ni %s %p deleted\n",
 	       libcfs_id2str(peer_ni->ksnp_id), peer_ni);
 
-	LASSERT(!atomic_read(&peer_ni->ksnp_refcount));
+	LASSERT(!refcount_read(&peer_ni->ksnp_refcount));
 	LASSERT(!peer_ni->ksnp_accepting);
 	LASSERT(list_empty(&peer_ni->ksnp_conns));
 	LASSERT(list_empty(&peer_ni->ksnp_routes));
@@ -235,7 +235,7 @@ struct ksock_peer_ni *
 
 		CDEBUG(D_NET, "got peer_ni [%p] -> %s (%d)\n",
 		       peer_ni, libcfs_id2str(id),
-		       atomic_read(&peer_ni->ksnp_refcount));
+		       refcount_read(&peer_ni->ksnp_refcount));
 		return peer_ni;
 	}
 	return NULL;
@@ -1069,10 +1069,10 @@ struct ksock_peer_ni *
 	 * 2 ref, 1 for conn, another extra ref prevents socket
 	 * being closed before establishment of connection
 	 */
-	atomic_set(&conn->ksnc_sock_refcount, 2);
+	refcount_set(&conn->ksnc_sock_refcount, 2);
 	conn->ksnc_type = type;
 	ksocknal_lib_save_callback(sock, conn);
-	atomic_set(&conn->ksnc_conn_refcount, 1); /* 1 ref for me */
+	refcount_set(&conn->ksnc_conn_refcount, 1); /* 1 ref for me */
 
 	conn->ksnc_rx_ready = 0;
 	conn->ksnc_rx_scheduled = 0;
@@ -1667,7 +1667,7 @@ struct ksock_peer_ni *
 {
 	/* Queue the conn for the reaper to destroy */
 
-	LASSERT(!atomic_read(&conn->ksnc_conn_refcount));
+	LASSERT(!refcount_read(&conn->ksnc_conn_refcount));
 	spin_lock_bh(&ksocknal_data.ksnd_reaper_lock);
 
 	list_add_tail(&conn->ksnc_list, &ksocknal_data.ksnd_zombie_conns);
@@ -1684,8 +1684,8 @@ struct ksock_peer_ni *
 	/* Final coup-de-grace of the reaper */
 	CDEBUG(D_NET, "connection %p\n", conn);
 
-	LASSERT(!atomic_read(&conn->ksnc_conn_refcount));
-	LASSERT(!atomic_read(&conn->ksnc_sock_refcount));
+	LASSERT(!refcount_read(&conn->ksnc_conn_refcount));
+	LASSERT(!refcount_read(&conn->ksnc_sock_refcount));
 	LASSERT(!conn->ksnc_sock);
 	LASSERT(!conn->ksnc_route);
 	LASSERT(!conn->ksnc_tx_scheduled);
@@ -2412,7 +2412,7 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 
 		CWARN("Active peer_ni on shutdown: %s, ref %d, closing %d, accepting %d, err %d, zcookie %llu, txq %d, zc_req %d\n",
 		      libcfs_id2str(peer_ni->ksnp_id),
-		      atomic_read(&peer_ni->ksnp_refcount),
+		      refcount_read(&peer_ni->ksnp_refcount),
 		      peer_ni->ksnp_closing,
 		      peer_ni->ksnp_accepting, peer_ni->ksnp_error,
 		      peer_ni->ksnp_zc_next_cookie,
@@ -2421,7 +2421,7 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 
 		list_for_each_entry(route, &peer_ni->ksnp_routes, ksnr_list) {
 			CWARN("Route: ref %d, schd %d, conn %d, cnted %d, del %d\n",
-			      atomic_read(&route->ksnr_refcount),
+			      refcount_read(&route->ksnr_refcount),
 			      route->ksnr_scheduled,
 			      route->ksnr_connecting,
 			      route->ksnr_connected,
@@ -2430,8 +2430,8 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 
 		list_for_each_entry(conn, &peer_ni->ksnp_conns, ksnc_list) {
 			CWARN("Conn: ref %d, sref %d, t %d, c %d\n",
-			      atomic_read(&conn->ksnc_conn_refcount),
-			      atomic_read(&conn->ksnc_sock_refcount),
+			      refcount_read(&conn->ksnc_conn_refcount),
+			      refcount_read(&conn->ksnc_sock_refcount),
 			      conn->ksnc_type, conn->ksnc_closing);
 		}
 		goto done;
diff --git a/net/lnet/klnds/socklnd/socklnd.h b/net/lnet/klnds/socklnd/socklnd.h
index 0a0f0a7..df863f2 100644
--- a/net/lnet/klnds/socklnd/socklnd.h
+++ b/net/lnet/klnds/socklnd/socklnd.h
@@ -37,6 +37,7 @@
 #include <linux/list.h>
 #include <linux/mm.h>
 #include <linux/module.h>
+#include <linux/refcount.h>
 #include <linux/stat.h>
 #include <linux/string.h>
 #include <linux/syscalls.h>
@@ -270,7 +271,7 @@ struct ksock_tx {				/* transmit packet */
 	struct list_head	tx_list;	/* queue on conn for transmission etc
 						 */
 	struct list_head	tx_zc_list;	/* queue on peer_ni for ZC request */
-	atomic_t		tx_refcount;	/* tx reference count */
+	refcount_t		tx_refcount;	/* tx reference count */
 	int			tx_nob;		/* # packet bytes */
 	int			tx_resid;	/* residual bytes */
 	int			tx_niov;	/* # packet iovec frags */
@@ -311,8 +312,8 @@ struct ksock_conn {
 	void		       *ksnc_saved_write_space;	/* socket's original
 							 * write_space() callback
 							 */
-	atomic_t		ksnc_conn_refcount;	/* conn refcount */
-	atomic_t		ksnc_sock_refcount;	/* sock refcount */
+	refcount_t		ksnc_conn_refcount;	/* conn refcount */
+	refcount_t		ksnc_sock_refcount;	/* sock refcount */
 	struct ksock_sched     *ksnc_scheduler;		/* who schedules this connection
 							 */
 	u32			ksnc_myipaddr;		/* my IP */
@@ -374,7 +375,7 @@ struct ksock_route {
 	struct list_head	ksnr_list;		/* chain on peer_ni route list */
 	struct list_head	ksnr_connd_list;	/* chain on ksnr_connd_routes */
 	struct ksock_peer_ni   *ksnr_peer;		/* owning peer_ni */
-	atomic_t		ksnr_refcount;		/* # users */
+	refcount_t		ksnr_refcount;		/* # users */
 	time64_t		ksnr_timeout;		/* when (in secs) reconnection
 							 * can happen next
 							 */
@@ -404,7 +405,7 @@ struct ksock_peer_ni {
 							 * alive
 							 */
 	struct lnet_process_id	ksnp_id;		/* who's on the other end(s) */
-	atomic_t		ksnp_refcount;		/* # users */
+	refcount_t		ksnp_refcount;		/* # users */
 	int			ksnp_closing;		/* being closed */
 	int			ksnp_accepting;		/* # passive connections pending
 							 */
@@ -510,8 +511,7 @@ struct ksock_proto {
 static inline void
 ksocknal_conn_addref(struct ksock_conn *conn)
 {
-	LASSERT(atomic_read(&conn->ksnc_conn_refcount) > 0);
-	atomic_inc(&conn->ksnc_conn_refcount);
+	refcount_inc(&conn->ksnc_conn_refcount);
 }
 
 void ksocknal_queue_zombie_conn(struct ksock_conn *conn);
@@ -520,8 +520,7 @@ struct ksock_proto {
 static inline void
 ksocknal_conn_decref(struct ksock_conn *conn)
 {
-	LASSERT(atomic_read(&conn->ksnc_conn_refcount) > 0);
-	if (atomic_dec_and_test(&conn->ksnc_conn_refcount))
+	if (refcount_dec_and_test(&conn->ksnc_conn_refcount))
 		ksocknal_queue_zombie_conn(conn);
 }
 
@@ -532,8 +531,7 @@ struct ksock_proto {
 
 	read_lock(&ksocknal_data.ksnd_global_lock);
 	if (!conn->ksnc_closing) {
-		LASSERT(atomic_read(&conn->ksnc_sock_refcount) > 0);
-		atomic_inc(&conn->ksnc_sock_refcount);
+		refcount_inc(&conn->ksnc_sock_refcount);
 		rc = 0;
 	}
 	read_unlock(&ksocknal_data.ksnd_global_lock);
@@ -544,8 +542,7 @@ struct ksock_proto {
 static inline void
 ksocknal_connsock_decref(struct ksock_conn *conn)
 {
-	LASSERT(atomic_read(&conn->ksnc_sock_refcount) > 0);
-	if (atomic_dec_and_test(&conn->ksnc_sock_refcount)) {
+	if (refcount_dec_and_test(&conn->ksnc_sock_refcount)) {
 		LASSERT(conn->ksnc_closing);
 		sock_release(conn->ksnc_sock);
 		conn->ksnc_sock = NULL;
@@ -556,8 +553,7 @@ struct ksock_proto {
 static inline void
 ksocknal_tx_addref(struct ksock_tx *tx)
 {
-	LASSERT(atomic_read(&tx->tx_refcount) > 0);
-	atomic_inc(&tx->tx_refcount);
+	refcount_inc(&tx->tx_refcount);
 }
 
 void ksocknal_tx_prep(struct ksock_conn *, struct ksock_tx *tx);
@@ -566,16 +562,14 @@ struct ksock_proto {
 static inline void
 ksocknal_tx_decref(struct ksock_tx *tx)
 {
-	LASSERT(atomic_read(&tx->tx_refcount) > 0);
-	if (atomic_dec_and_test(&tx->tx_refcount))
+	if (refcount_dec_and_test(&tx->tx_refcount))
 		ksocknal_tx_done(NULL, tx, 0);
 }
 
 static inline void
 ksocknal_route_addref(struct ksock_route *route)
 {
-	LASSERT(atomic_read(&route->ksnr_refcount) > 0);
-	atomic_inc(&route->ksnr_refcount);
+	refcount_inc(&route->ksnr_refcount);
 }
 
 void ksocknal_destroy_route(struct ksock_route *route);
@@ -583,16 +577,14 @@ struct ksock_proto {
 static inline void
 ksocknal_route_decref(struct ksock_route *route)
 {
-	LASSERT(atomic_read(&route->ksnr_refcount) > 0);
-	if (atomic_dec_and_test(&route->ksnr_refcount))
+	if (refcount_dec_and_test(&route->ksnr_refcount))
 		ksocknal_destroy_route(route);
 }
 
 static inline void
 ksocknal_peer_addref(struct ksock_peer_ni *peer_ni)
 {
-	LASSERT(atomic_read(&peer_ni->ksnp_refcount) > 0);
-	atomic_inc(&peer_ni->ksnp_refcount);
+	refcount_inc(&peer_ni->ksnp_refcount);
 }
 
 void ksocknal_destroy_peer(struct ksock_peer_ni *peer_ni);
@@ -600,8 +592,7 @@ struct ksock_proto {
 static inline void
 ksocknal_peer_decref(struct ksock_peer_ni *peer_ni)
 {
-	LASSERT(atomic_read(&peer_ni->ksnp_refcount) > 0);
-	if (atomic_dec_and_test(&peer_ni->ksnp_refcount))
+	if (refcount_dec_and_test(&peer_ni->ksnp_refcount))
 		ksocknal_destroy_peer(peer_ni);
 }
 
diff --git a/net/lnet/klnds/socklnd/socklnd_cb.c b/net/lnet/klnds/socklnd/socklnd_cb.c
index 936054ee..9b3b604 100644
--- a/net/lnet/klnds/socklnd/socklnd_cb.c
+++ b/net/lnet/klnds/socklnd/socklnd_cb.c
@@ -52,7 +52,7 @@ struct ksock_tx *
 	if (!tx)
 		return NULL;
 
-	atomic_set(&tx->tx_refcount, 1);
+	refcount_set(&tx->tx_refcount, 1);
 	tx->tx_zc_aborted = 0;
 	tx->tx_zc_capable = 0;
 	tx->tx_zc_checked = 0;
@@ -381,7 +381,7 @@ struct ksock_tx *
 				tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_ERROR;
 		}
 
-		LASSERT(atomic_read(&tx->tx_refcount) == 1);
+		LASSERT(refcount_read(&tx->tx_refcount) == 1);
 		ksocknal_tx_done(ni, tx, error);
 	}
 }
@@ -1072,7 +1072,7 @@ struct ksock_route *
 	struct lnet_process_id *id;
 	int rc;
 
-	LASSERT(atomic_read(&conn->ksnc_conn_refcount) > 0);
+	LASSERT(refcount_read(&conn->ksnc_conn_refcount) > 0);
 
 	/* NB: sched lock NOT held */
 	/* SOCKNAL_RX_LNET_HEADER is here for backward compatibility */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 28/37] lnet: libcfs: don't call unshare_fs_struct()
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (26 preceding siblings ...)
  2020-07-15 20:45 ` [lustre-devel] [PATCH 27/37] lnet: socklnd: convert various refcounts to refcount_t James Simmons
@ 2020-07-15 20:45 ` James Simmons
  2020-07-15 20:45 ` [lustre-devel] [PATCH 29/37] lnet: Allow router to forward to healthier NID James Simmons
                   ` (8 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:45 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

A kthread runs with the same fs_struct as init.
It is only helpful to unshare this if the thread
will change one of the fields in the fs_struct:
  root directory
  current working directory
  umask.

No lustre kthread changes any of these, so there is
no need to call unshare_fs_struct().

WC-bug-id: https://jira.whamcloud.com/browse/LU-9859
Lustre-commit: 9013eb2bb5492 ("LU-9859 libcfs: don't call unshare_fs_struct()")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/39132
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/llog.c  | 2 --
 fs/lustre/ptlrpc/import.c  | 2 --
 fs/lustre/ptlrpc/ptlrpcd.c | 1 -
 fs/lustre/ptlrpc/service.c | 3 ---
 4 files changed, 8 deletions(-)

diff --git a/fs/lustre/obdclass/llog.c b/fs/lustre/obdclass/llog.c
index b2667d9..e172ebc 100644
--- a/fs/lustre/obdclass/llog.c
+++ b/fs/lustre/obdclass/llog.c
@@ -449,8 +449,6 @@ static int llog_process_thread_daemonize(void *arg)
 	struct lu_env env;
 	int rc;
 
-	unshare_fs_struct();
-
 	/* client env has no keys, tags is just 0 */
 	rc = lu_env_init(&env, LCT_LOCAL | LCT_MG_THREAD);
 	if (rc)
diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index 1b62b81..1490dcf 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -1438,8 +1438,6 @@ static int ptlrpc_invalidate_import_thread(void *data)
 {
 	struct obd_import *imp = data;
 
-	unshare_fs_struct();
-
 	CDEBUG(D_HA, "thread invalidate import %s to %s@%s\n",
 	       imp->imp_obd->obd_name, obd2cli_tgt(imp->imp_obd),
 	       imp->imp_connection->c_remote_uuid.uuid);
diff --git a/fs/lustre/ptlrpc/ptlrpcd.c b/fs/lustre/ptlrpc/ptlrpcd.c
index 533f592..b0b81cc 100644
--- a/fs/lustre/ptlrpc/ptlrpcd.c
+++ b/fs/lustre/ptlrpc/ptlrpcd.c
@@ -393,7 +393,6 @@ static int ptlrpcd(void *arg)
 	int rc = 0;
 	int exit = 0;
 
-	unshare_fs_struct();
 	if (cfs_cpt_bind(cfs_cpt_tab, pc->pc_cpt) != 0)
 		CWARN("Failed to bind %s on CPT %d\n", pc->pc_name, pc->pc_cpt);
 
diff --git a/fs/lustre/ptlrpc/service.c b/fs/lustre/ptlrpc/service.c
index 4d5e6b3..5881e0a 100644
--- a/fs/lustre/ptlrpc/service.c
+++ b/fs/lustre/ptlrpc/service.c
@@ -2175,7 +2175,6 @@ static int ptlrpc_main(void *arg)
 
 	thread->t_task = current;
 	thread->t_pid = current->pid;
-	unshare_fs_struct();
 
 	if (svc->srv_cpt_bind) {
 		rc = cfs_cpt_bind(svc->srv_cptable, svcpt->scp_cpt);
@@ -2391,8 +2390,6 @@ static int ptlrpc_hr_main(void *arg)
 	if (!env)
 		return -ENOMEM;
 
-	unshare_fs_struct();
-
 	rc = cfs_cpt_bind(ptlrpc_hr.hr_cpt_table, hrp->hrp_cpt);
 	if (rc != 0) {
 		char threadname[20];
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 29/37] lnet: Allow router to forward to healthier NID
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (27 preceding siblings ...)
  2020-07-15 20:45 ` [lustre-devel] [PATCH 28/37] lnet: libcfs: don't call unshare_fs_struct() James Simmons
@ 2020-07-15 20:45 ` James Simmons
  2020-07-15 20:45 ` [lustre-devel] [PATCH 30/37] lustre: llite: annotate non-owner locking James Simmons
                   ` (7 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:45 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <chris.horn@hpe.com>

When a final-hop router (aka edge router) is forwarding a message,
if both the originator and destination of the message are mutli-rail
capable, then allow the router to choose a new destination lpni if
the one selected by the message originator is unhealthy or down.

HPE-bug-id: LUS-8905
WC-bug-id: https://jira.whamcloud.com/browse/LU-13606
Lustre-commit: b0e8ab1a5f6f8 ("LU-13606 lnet: Allow router to forward to healthier NID")
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Reviewed-on: https://review.whamcloud.com/38798
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h |  4 ++--
 net/lnet/lnet/lib-move.c      | 37 +++++++++++++++++++++++++++++++++++--
 2 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 75c0da7..b069422 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -819,8 +819,8 @@ int lnet_get_peer_ni_info(u32 peer_index, u64 *nid,
 }
 
 /*
- * A peer is alive if it satisfies the following two conditions:
- *  1. peer health >= LNET_MAX_HEALTH_VALUE * router_sensitivity_percentage
+ * A peer NI is alive if it satisfies the following two conditions:
+ *  1. peer NI health >= LNET_MAX_HEALTH_VALUE * router_sensitivity_percentage
  *  2. the cached NI status received when we discover the peer is UP
  */
 static inline bool
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 2f3ef8c..234fbb5 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -2371,6 +2371,8 @@ struct lnet_ni *
 	int cpt, rc;
 	int md_cpt;
 	u32 send_case = 0;
+	bool final_hop;
+	bool mr_forwarding_allowed;
 
 	memset(&send_data, 0, sizeof(send_data));
 
@@ -2447,16 +2449,47 @@ struct lnet_ni *
 	else
 		send_case |= REMOTE_DST;
 
+	final_hop = false;
+	if (msg->msg_routing && (send_case & LOCAL_DST))
+		final_hop = true;
+
+	/* Determine whether to allow MR forwarding for this message.
+	 * NB: MR forwarding is allowed if the message originator and the
+	 * destination are both MR capable, and the destination lpni that was
+	 * originally chosen by the originator is unhealthy or down.
+	 * We check the MR capability of the destination further below
+	 */
+	mr_forwarding_allowed = false;
+	if (final_hop) {
+		struct lnet_peer *src_lp;
+		struct lnet_peer_ni *src_lpni;
+
+		src_lpni = lnet_nid2peerni_locked(msg->msg_hdr.src_nid,
+						  LNET_NID_ANY, cpt);
+		/* We don't fail the send if we hit any errors here. We'll just
+		 * try to send it via non-multi-rail criteria
+		 */
+		if (!IS_ERR(src_lpni)) {
+			src_lp = lpni->lpni_peer_net->lpn_peer;
+			if (lnet_peer_is_multi_rail(src_lp) &&
+			    !lnet_is_peer_ni_alive(lpni))
+				mr_forwarding_allowed = true;
+		}
+		CDEBUG(D_NET, "msg %p MR forwarding %s\n", msg,
+		       mr_forwarding_allowed ? "allowed" : "not allowed");
+	}
+
 	/* Deal with the peer as NMR in the following cases:
 	 * 1. the peer is NMR
 	 * 2. We're trying to recover a specific peer NI
-	 * 3. I'm a router sending to the final destination
+	 * 3. I'm a router sending to the final destination and MR forwarding is
+	 *    not allowed for this message (as determined above).
 	 *    In this case the source of the message would've
 	 *    already selected the final destination so my job
 	 *    is to honor the selection.
 	 */
 	if (!lnet_peer_is_multi_rail(peer) || msg->msg_recovery ||
-	    (msg->msg_routing && (send_case & LOCAL_DST)))
+	    (final_hop && !mr_forwarding_allowed))
 		send_case |= NMR_DST;
 	else
 		send_case |= MR_DST;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 30/37] lustre: llite: annotate non-owner locking
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (28 preceding siblings ...)
  2020-07-15 20:45 ` [lustre-devel] [PATCH 29/37] lnet: Allow router to forward to healthier NID James Simmons
@ 2020-07-15 20:45 ` James Simmons
  2020-07-15 20:45 ` [lustre-devel] [PATCH 31/37] lustre: osc: consume grants for direct I/O James Simmons
                   ` (6 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:45 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

The lli_lsm_sem locks taken by ll_prep_md_op_data() are sometimes
released by a different thread.  This confuses lockdep unless we
explain the situation.

So use down_read_non_owner() and up_read_non_owner().

WC-bug-id: https://jira.whamcloud.com/browse/LU-9679
Lustre-commit: f34392412fe22 ("LU-9679 llite: annotate non-owner locking")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/39234
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_lib.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index c62e182..f52d2b5 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -2783,12 +2783,12 @@ int ll_obd_statfs(struct inode *inode, void __user *arg)
 void ll_unlock_md_op_lsm(struct md_op_data *op_data)
 {
 	if (op_data->op_mea2_sem) {
-		up_read(op_data->op_mea2_sem);
+		up_read_non_owner(op_data->op_mea2_sem);
 		op_data->op_mea2_sem = NULL;
 	}
 
 	if (op_data->op_mea1_sem) {
-		up_read(op_data->op_mea1_sem);
+		up_read_non_owner(op_data->op_mea1_sem);
 		op_data->op_mea1_sem = NULL;
 	}
 }
@@ -2823,7 +2823,7 @@ struct md_op_data *ll_prep_md_op_data(struct md_op_data *op_data,
 	op_data->op_code = opc;
 
 	if (S_ISDIR(i1->i_mode)) {
-		down_read(&ll_i2info(i1)->lli_lsm_sem);
+		down_read_non_owner(&ll_i2info(i1)->lli_lsm_sem);
 		op_data->op_mea1_sem = &ll_i2info(i1)->lli_lsm_sem;
 		op_data->op_mea1 = ll_i2info(i1)->lli_lsm_md;
 		op_data->op_default_mea1 = ll_i2info(i1)->lli_default_lsm_md;
@@ -2833,7 +2833,10 @@ struct md_op_data *ll_prep_md_op_data(struct md_op_data *op_data,
 		op_data->op_fid2 = *ll_inode2fid(i2);
 		if (S_ISDIR(i2->i_mode)) {
 			if (i2 != i1) {
-				down_read(&ll_i2info(i2)->lli_lsm_sem);
+				/* i2 is typically a child of i1, and MUST be
+				 * further from the root to avoid deadlocks.
+				 */
+				down_read_non_owner(&ll_i2info(i2)->lli_lsm_sem);
 				op_data->op_mea2_sem =
 						&ll_i2info(i2)->lli_lsm_sem;
 			}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 31/37] lustre: osc: consume grants for direct I/O
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (29 preceding siblings ...)
  2020-07-15 20:45 ` [lustre-devel] [PATCH 30/37] lustre: llite: annotate non-owner locking James Simmons
@ 2020-07-15 20:45 ` James Simmons
  2020-07-15 20:45 ` [lustre-devel] [PATCH 32/37] lnet: remove LNetMEUnlink and clean up related code James Simmons
                   ` (5 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:45 UTC (permalink / raw)
  To: lustre-devel

From: Vladimir Saveliev <c17830@cray.com>

New IO engine implementation lost consuming grants by direct I/O
writes. That led to early emergence of out of space condition during
direct I/O. The below illustrates the problem:
  # OSTSIZE=100000 sh llmount.sh
  # dd if=/dev/zero of=/mnt/lustre/file bs=4k count=100 oflag=direct
  dd: error writing ?/mnt/lustre/file?: No space left on device

Consume grants for direct I/O.

Try to consume grants in osc_queue_sync_pages() when it is called for
pages which are being writted in direct i/o.

Tests are added to verify grant consumption in buffered and direct i/o
and to verify direct i/o overwrite when ost is full.
The overwrite test is for ldiskfs only as zfs is unable to overwrite
when it is full.

Cray-bug-id: LUS-7036
WC-bug-id: https://jira.whamcloud.com/browse/LU-12687
Lustre-commit: 05f326a7988a7a ("LU-12687 osc: consume grants for direct I/O")
Signed-off-by: Vladimir Saveliev <c17830@cray.com>
Reviewed-on: https://review.whamcloud.com/35896
Reviewed-by: Wang Shilong <wshilong@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_cache.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/fs/lustre/osc/osc_cache.c b/fs/lustre/osc/osc_cache.c
index fe03c0d..c7aaabb 100644
--- a/fs/lustre/osc/osc_cache.c
+++ b/fs/lustre/osc/osc_cache.c
@@ -2692,6 +2692,28 @@ int osc_queue_sync_pages(const struct lu_env *env, const struct cl_io *io,
 	ext->oe_srvlock = !!(brw_flags & OBD_BRW_SRVLOCK);
 	ext->oe_ndelay = !!(brw_flags & OBD_BRW_NDELAY);
 	ext->oe_dio = !!(brw_flags & OBD_BRW_NOCACHE);
+	if (ext->oe_dio && !ext->oe_rw) { /* direct io write */
+		int grants;
+		int ppc;
+
+		ppc = 1 << (cli->cl_chunkbits - PAGE_SHIFT);
+		grants = cli->cl_grant_extent_tax;
+		grants += (1 << cli->cl_chunkbits) *
+			  ((page_count + ppc - 1) / ppc);
+
+		spin_lock(&cli->cl_loi_list_lock);
+		if (osc_reserve_grant(cli, grants) == 0) {
+			list_for_each_entry(oap, list, oap_pending_item) {
+				osc_consume_write_grant(cli,
+							&oap->oap_brw_page);
+				atomic_long_inc(&obd_dirty_pages);
+			}
+			osc_unreserve_grant_nolock(cli, grants, 0);
+			ext->oe_grants = grants;
+		}
+		spin_unlock(&cli->cl_loi_list_lock);
+	}
+
 	ext->oe_is_rdma_only = !!(brw_flags & OBD_BRW_RDMA_ONLY);
 	ext->oe_nr_pages = page_count;
 	ext->oe_mppr = mppr;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 32/37] lnet: remove LNetMEUnlink and clean up related code
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (30 preceding siblings ...)
  2020-07-15 20:45 ` [lustre-devel] [PATCH 31/37] lustre: osc: consume grants for direct I/O James Simmons
@ 2020-07-15 20:45 ` James Simmons
  2020-07-15 20:45 ` [lustre-devel] [PATCH 33/37] lnet: Set remote NI status in lnet_notify James Simmons
                   ` (4 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:45 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

LNetMEUnlink is not particularly useful, and exposing it as an LNet
interface only provides the opportunity for it to be misused.

Every successful call to LNetMEAttach() is followed by a call to
LNetMDAttach().  If that call succeeds, the ME is owned by
the MD and the caller mustn't touch it again.
If the call fails, the caller is currently required to call
LNetMEUnlink(), which all callers do, and these are the only places
that LNetMEUnlink() are called.

As LNetMDAttach() knows when it will fail, it can unlink the ME itself
and save the caller the effort.
This allows LNetMEUnlink() to be removed which simplifies
the LNet interface.

LNetMEUnlink() is also used in in ptl_send_rpc() in a situation where
ptl_send_buf() fails.  In this case both the ME and the MD need to be
unlinked, as as they are interconnected, LNetMEUnlink() or
LNetMDUnlink() can equally do the job.  So change it to use
LNetMDUnlink().

LNetMEUnlink() is primarily a call the lnet_me_unlink(). It also
 - has some handling if ->me_md is not NULL, but that is never the
   case
 - takes the lnet_res_lock().  However LNetMDAttach() already
   takes that lock.
So none of this functionality is useful to LNetMDAttach().
On failure, it can call lnet_me_unlink() directly while ensuring
it still has the lock.

This patch:
 - moves the calls to lnet_md_validate() into lnet_md_build()
 - changes LNetMDAttach() to always take the lnet_res_lock(),
   and to call lnet_me_unlink() on failure.
 - removes all calls to LNetMEUnlink() and sometimes simplifies
   surrounding code.
 - changes lnet_md_link() to 'void' as it only ever returns
   '0', and thus simplify error handling in LNetMDAttach()
   and LNetMDBind()

WC-bug-id: https://jira.whamcloud.com/browse/LU-12678
Lustre-commit: e17ee2296c201 ("LU-12678 lnet: remove LNetMEUnlink and clean up related code")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/38646
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/niobuf.c | 12 +++------
 include/linux/lnet/api.h  |  6 ++---
 net/lnet/lnet/api-ni.c    |  5 +---
 net/lnet/lnet/lib-md.c    | 62 +++++++++++++++--------------------------------
 net/lnet/lnet/lib-me.c    | 39 -----------------------------
 net/lnet/selftest/rpc.c   |  1 -
 6 files changed, 26 insertions(+), 99 deletions(-)

diff --git a/fs/lustre/ptlrpc/niobuf.c b/fs/lustre/ptlrpc/niobuf.c
index 6fb79a2..924b9c4 100644
--- a/fs/lustre/ptlrpc/niobuf.c
+++ b/fs/lustre/ptlrpc/niobuf.c
@@ -203,7 +203,6 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req)
 			CERROR("%s: LNetMDAttach failed x%llu/%d: rc = %d\n",
 			       desc->bd_import->imp_obd->obd_name, mbits,
 			       posted_md, rc);
-			LNetMEUnlink(me);
 			break;
 		}
 	}
@@ -676,7 +675,7 @@ int ptl_send_rpc(struct ptlrpc_request *request, int noreply)
 			request->rq_receiving_reply = 0;
 			spin_unlock(&request->rq_lock);
 			rc = -ENOMEM;
-			goto cleanup_me;
+			goto cleanup_bulk;
 		}
 		percpu_ref_get(&ptlrpc_pending);
 
@@ -720,12 +719,8 @@ int ptl_send_rpc(struct ptlrpc_request *request, int noreply)
 	if (noreply)
 		goto out;
 
-cleanup_me:
-	/* MEUnlink is safe; the PUT didn't even get off the ground, and
-	 * nobody apart from the PUT's target has the right nid+XID to
-	 * access the reply buffer.
-	 */
-	LNetMEUnlink(reply_me);
+	LNetMDUnlink(request->rq_reply_md_h);
+
 	/* UNLINKED callback called synchronously */
 	LASSERT(!request->rq_receiving_reply);
 
@@ -802,7 +797,6 @@ int ptlrpc_register_rqbd(struct ptlrpc_request_buffer_desc *rqbd)
 
 	CERROR("ptlrpc: LNetMDAttach failed: rc = %d\n", rc);
 	LASSERT(rc == -ENOMEM);
-	LNetMEUnlink(me);
 	rqbd->rqbd_refcount = 0;
 
 	return -ENOMEM;
diff --git a/include/linux/lnet/api.h b/include/linux/lnet/api.h
index 24115eb..95805de 100644
--- a/include/linux/lnet/api.h
+++ b/include/linux/lnet/api.h
@@ -90,8 +90,8 @@
  * list is a chain of MEs. Each ME includes a pointer to a memory descriptor
  * and a set of match criteria. The match criteria can be used to reject
  * incoming requests based on process ID or the match bits provided in the
- * request. MEs can be dynamically inserted into a match list by LNetMEAttach()
- * and removed from its list by LNetMEUnlink().
+ * request. MEs can be dynamically inserted into a match list by LNetMEAttach(),
+ * and must then be attached to an MD with LNetMDAttach().
  * @{
  */
 struct lnet_me *
@@ -101,8 +101,6 @@ struct lnet_me *
 	     u64 ignore_bits_in,
 	     enum lnet_unlink unlink_in,
 	     enum lnet_ins_pos pos_in);
-
-void LNetMEUnlink(struct lnet_me *current_in);
 /** @} lnet_me */
 
 /** \defgroup lnet_md Memory descriptors
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 3e69435..5f35468 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -1645,14 +1645,12 @@ struct lnet_ping_buffer *
 	rc = LNetMDAttach(me, &md, LNET_RETAIN, ping_mdh);
 	if (rc) {
 		CERROR("Can't attach ping target MD: %d\n", rc);
-		goto fail_unlink_ping_me;
+		goto fail_decref_ping_buffer;
 	}
 	lnet_ping_buffer_addref(*ppbuf);
 
 	return 0;
 
-fail_unlink_ping_me:
-	LNetMEUnlink(me);
 fail_decref_ping_buffer:
 	LASSERT(atomic_read(&(*ppbuf)->pb_refcnt) == 1);
 	lnet_ping_buffer_decref(*ppbuf);
@@ -1855,7 +1853,6 @@ int lnet_push_target_post(struct lnet_ping_buffer *pbuf,
 	rc = LNetMDAttach(me, &md, LNET_UNLINK, mdhp);
 	if (rc) {
 		CERROR("Can't attach push MD: %d\n", rc);
-		LNetMEUnlink(me);
 		lnet_ping_buffer_decref(pbuf);
 		pbuf->pb_needs_post = true;
 		return rc;
diff --git a/net/lnet/lnet/lib-md.c b/net/lnet/lnet/lib-md.c
index e80dc6f..48249f3 100644
--- a/net/lnet/lnet/lib-md.c
+++ b/net/lnet/lnet/lib-md.c
@@ -123,6 +123,8 @@ int lnet_cpt_of_md(struct lnet_libmd *md, unsigned int offset)
 	return cpt;
 }
 
+static int lnet_md_validate(const struct lnet_md *umd);
+
 static struct lnet_libmd *
 lnet_md_build(const struct lnet_md *umd, int unlink)
 {
@@ -132,6 +134,9 @@ int lnet_cpt_of_md(struct lnet_libmd *md, unsigned int offset)
 	struct lnet_libmd *lmd;
 	unsigned int size;
 
+	if (lnet_md_validate(umd) != 0)
+		return ERR_PTR(-EINVAL);
+
 	if (umd->options & LNET_MD_KIOV)
 		niov = umd->length;
 	else
@@ -228,15 +233,14 @@ int lnet_cpt_of_md(struct lnet_libmd *md, unsigned int offset)
 }
 
 /* must be called with resource lock held */
-static int
+static void
 lnet_md_link(struct lnet_libmd *md, lnet_handler_t handler, int cpt)
 {
 	struct lnet_res_container *container = the_lnet.ln_md_containers[cpt];
 
 	/*
 	 * NB we are passed an allocated, but inactive md.
-	 * if we return success, caller may lnet_md_unlink() it.
-	 * otherwise caller may only kfree() it.
+	 * Caller may lnet_md_unlink() it, or may lnet_md_free() it.
 	 */
 	/*
 	 * This implementation doesn't know how to create START events or
@@ -255,8 +259,6 @@ int lnet_cpt_of_md(struct lnet_libmd *md, unsigned int offset)
 
 	LASSERT(list_empty(&md->md_list));
 	list_add(&md->md_list, &container->rec_active);
-
-	return 0;
 }
 
 /* must be called with lnet_res_lock held */
@@ -304,14 +306,11 @@ int lnet_cpt_of_md(struct lnet_libmd *md, unsigned int offset)
  * @handle	On successful returns, a handle to the newly created MD is
  *		saved here. This handle can be used later in LNetMDUnlink().
  *
+ * The ME will either be linked to the new MD, or it will be freed.
+ *
  * Return:	0 on success.
  *		-EINVAL If @umd is not valid.
  *		-ENOMEM If new MD cannot be allocated.
- *		-ENOENT Either @me or @umd.handle does not point to a
- *		valid object. Note that it's OK to supply a NULL @umd.handle
- *		by calling LNetInvalidateHandle() on it.
- *		-EBUSY if the ME pointed to by @me is already associated with
- *		a MD.
  */
 int
 LNetMDAttach(struct lnet_me *me, const struct lnet_md *umd,
@@ -321,33 +320,27 @@ int lnet_cpt_of_md(struct lnet_libmd *md, unsigned int offset)
 	LIST_HEAD(drops);
 	struct lnet_libmd *md;
 	int cpt;
-	int rc;
 
 	LASSERT(the_lnet.ln_refcount > 0);
 
-	if (lnet_md_validate(umd))
-		return -EINVAL;
+	LASSERT(!me->me_md);
 
 	if (!(umd->options & (LNET_MD_OP_GET | LNET_MD_OP_PUT))) {
 		CERROR("Invalid option: no MD_OP set\n");
-		return -EINVAL;
-	}
-
-	md = lnet_md_build(umd, unlink);
-	if (IS_ERR(md))
-		return PTR_ERR(md);
+		md = ERR_PTR(-EINVAL);
+	} else
+		md = lnet_md_build(umd, unlink);
 
 	cpt = me->me_cpt;
-
 	lnet_res_lock(cpt);
 
-	if (me->me_md)
-		rc = -EBUSY;
-	else
-		rc = lnet_md_link(md, umd->handler, cpt);
+	if (IS_ERR(md)) {
+		lnet_me_unlink(me);
+		lnet_res_unlock(cpt);
+		return PTR_ERR(md);
+	}
 
-	if (rc)
-		goto out_unlock;
+	lnet_md_link(md, umd->handler, cpt);
 
 	/*
 	 * attach this MD to portal of ME and check if it matches any
@@ -363,11 +356,6 @@ int lnet_cpt_of_md(struct lnet_libmd *md, unsigned int offset)
 	lnet_recv_delayed_msg_list(&matches);
 
 	return 0;
-
-out_unlock:
-	lnet_res_unlock(cpt);
-	kfree(md);
-	return rc;
 }
 EXPORT_SYMBOL(LNetMDAttach);
 
@@ -383,9 +371,6 @@ int lnet_cpt_of_md(struct lnet_libmd *md, unsigned int offset)
  * Return:		0 On success.
  *			-EINVAL If @umd is not valid.
  *			-ENOMEM If new MD cannot be allocated.
- *			-ENOENT @umd.handle does not point to a valid EQ.
- *			Note that it's OK to supply a NULL @umd.handle by
- *			calling LNetInvalidateHandle() on it.
  */
 int
 LNetMDBind(const struct lnet_md *umd, enum lnet_unlink unlink,
@@ -397,9 +382,6 @@ int lnet_cpt_of_md(struct lnet_libmd *md, unsigned int offset)
 
 	LASSERT(the_lnet.ln_refcount > 0);
 
-	if (lnet_md_validate(umd))
-		return -EINVAL;
-
 	if ((umd->options & (LNET_MD_OP_GET | LNET_MD_OP_PUT))) {
 		CERROR("Invalid option: GET|PUT illegal on active MDs\n");
 		return -EINVAL;
@@ -418,17 +400,13 @@ int lnet_cpt_of_md(struct lnet_libmd *md, unsigned int offset)
 
 	cpt = lnet_res_lock_current();
 
-	rc = lnet_md_link(md, umd->handler, cpt);
-	if (rc)
-		goto out_unlock;
+	lnet_md_link(md, umd->handler, cpt);
 
 	lnet_md2handle(handle, md);
 
 	lnet_res_unlock(cpt);
 	return 0;
 
-out_unlock:
-	lnet_res_unlock(cpt);
 out_free:
 	kfree(md);
 
diff --git a/net/lnet/lnet/lib-me.c b/net/lnet/lnet/lib-me.c
index 14ab21f..f75f3cb 100644
--- a/net/lnet/lnet/lib-me.c
+++ b/net/lnet/lnet/lib-me.c
@@ -118,45 +118,6 @@ struct lnet_me *
 }
 EXPORT_SYMBOL(LNetMEAttach);
 
-/**
- * Unlink a match entry from its match list.
- *
- * This operation also releases any resources associated with the ME. If a
- * memory descriptor is attached to the ME, then it will be unlinked as well
- * and an unlink event will be generated. It is an error to use the ME handle
- * after calling LNetMEUnlink().
- *
- * @me		The ME to be unlinked.
- *
- * \see LNetMDUnlink() for the discussion on delivering unlink event.
- */
-void
-LNetMEUnlink(struct lnet_me *me)
-{
-	struct lnet_libmd *md;
-	struct lnet_event ev;
-	int cpt;
-
-	LASSERT(the_lnet.ln_refcount > 0);
-
-	cpt = me->me_cpt;
-	lnet_res_lock(cpt);
-
-	md = me->me_md;
-	if (md) {
-		md->md_flags |= LNET_MD_FLAG_ABORTED;
-		if (md->md_handler && !md->md_refcount) {
-			lnet_build_unlink_event(md, &ev);
-			md->md_handler(&ev);
-		}
-	}
-
-	lnet_me_unlink(me);
-
-	lnet_res_unlock(cpt);
-}
-EXPORT_SYMBOL(LNetMEUnlink);
-
 /* call with lnet_res_lock please */
 void
 lnet_me_unlink(struct lnet_me *me)
diff --git a/net/lnet/selftest/rpc.c b/net/lnet/selftest/rpc.c
index 799ad99..a72e485 100644
--- a/net/lnet/selftest/rpc.c
+++ b/net/lnet/selftest/rpc.c
@@ -383,7 +383,6 @@ struct srpc_bulk *
 		CERROR("LNetMDAttach failed: %d\n", rc);
 		LASSERT(rc == -ENOMEM);
 
-		LNetMEUnlink(me);
 		return -ENOMEM;
 	}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 33/37] lnet: Set remote NI status in lnet_notify
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (31 preceding siblings ...)
  2020-07-15 20:45 ` [lustre-devel] [PATCH 32/37] lnet: remove LNetMEUnlink and clean up related code James Simmons
@ 2020-07-15 20:45 ` James Simmons
  2020-07-15 20:45 ` [lustre-devel] [PATCH 34/37] lustre: ptlrpc: fix endless loop issue James Simmons
                   ` (3 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:45 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <chris.horn@hpe.com>

The gnilnd receives node health information asynchronous from any tx
failure, so aliveness of lpni as reported by lnet_is_peer_ni_alive()
may not match what LND is telling us. Use existing reset flag to
set cached NI status down so we can be sure that remote NIs are
correctly set down.

HPE-bug-id: LUS-8897
WC-bug-id: https://jira.whamcloud.com/browse/LU-13648
Lustre-commit: 8010dbb660766 ("LU-13648 lnet: Set remote NI status in lnet_notify")
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Reviewed-on: https://review.whamcloud.com/38862
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/router.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index c0578d9..e3b3e71 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -1671,8 +1671,7 @@ bool lnet_router_checker_active(void)
 
 	CDEBUG(D_NET, "%s notifying %s: %s\n",
 	       !ni ? "userspace" : libcfs_nid2str(ni->ni_nid),
-	       libcfs_nid2str(nid),
-	       alive ? "up" : "down");
+	       libcfs_nid2str(nid), alive ? "up" : "down");
 
 	if (ni &&
 	    LNET_NIDNET(ni->ni_nid) != LNET_NIDNET(nid)) {
@@ -1714,6 +1713,7 @@ bool lnet_router_checker_active(void)
 
 	if (alive) {
 		if (reset) {
+			lpni->lpni_ns_status = LNET_NI_STATUS_UP;
 			lnet_set_lpni_healthv_locked(lpni,
 						     LNET_MAX_HEALTH_VALUE);
 		} else {
@@ -1726,6 +1726,8 @@ bool lnet_router_checker_active(void)
 						     (sensitivity) ? sensitivity :
 						     lnet_health_sensitivity);
 		}
+	} else if (reset) {
+		lpni->lpni_ns_status = LNET_NI_STATUS_DOWN;
 	}
 
 	/* recalculate aliveness */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 34/37] lustre: ptlrpc: fix endless loop issue
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (32 preceding siblings ...)
  2020-07-15 20:45 ` [lustre-devel] [PATCH 33/37] lnet: Set remote NI status in lnet_notify James Simmons
@ 2020-07-15 20:45 ` James Simmons
  2020-07-15 20:45 ` [lustre-devel] [PATCH 35/37] lustre: llite: fix short io for AIO James Simmons
                   ` (2 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:45 UTC (permalink / raw)
  To: lustre-devel

From: Hongchao Zhang <hongchao@whamcloud.com>

In ptlrpc_pinger_main, if the process to ping the recoverable
clients takes too long time, it could be stuck in endless loop
because of the negative value returned by pinger_check_timeout.

WC-bug-id: https://jira.whamcloud.com/browse/LU-13667
Lustre-commit: 6be2dbb259512 ("LU-13667 ptlrpc: fix endless loop issue")
Signed-off-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/38915
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Olaf Faaland-LLNL <faaland1@llnl.gov>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/pinger.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/ptlrpc/pinger.c b/fs/lustre/ptlrpc/pinger.c
index ec4c51a..9f57c61 100644
--- a/fs/lustre/ptlrpc/pinger.c
+++ b/fs/lustre/ptlrpc/pinger.c
@@ -258,12 +258,13 @@ static void ptlrpc_pinger_process_import(struct obd_import *imp,
 
 static void ptlrpc_pinger_main(struct work_struct *ws)
 {
-	time64_t this_ping = ktime_get_seconds();
-	time64_t time_to_next_wake;
+	time64_t this_ping, time_after_ping, time_to_next_wake;
 	struct timeout_item *item;
 	struct obd_import *imp;
 
 	do {
+		this_ping = ktime_get_seconds();
+
 		mutex_lock(&pinger_mutex);
 		list_for_each_entry(item, &timeout_list, ti_chain) {
 			item->ti_cb(item, item->ti_cb_data);
@@ -277,6 +278,12 @@ static void ptlrpc_pinger_main(struct work_struct *ws)
 		}
 		mutex_unlock(&pinger_mutex);
 
+		time_after_ping = ktime_get_seconds();
+
+		if ((ktime_get_seconds() - this_ping - 3) > PING_INTERVAL)
+			CDEBUG(D_HA, "long time to ping: %lld, %lld, %lld\n",
+			       this_ping, time_after_ping, ktime_get_seconds());
+
 		/* Wait until the next ping time, or until we're stopped. */
 		time_to_next_wake = pinger_check_timeout(this_ping);
 		/* The ping sent by ptlrpc_send_rpc may get sent out
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 35/37] lustre: llite: fix short io for AIO
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (33 preceding siblings ...)
  2020-07-15 20:45 ` [lustre-devel] [PATCH 34/37] lustre: ptlrpc: fix endless loop issue James Simmons
@ 2020-07-15 20:45 ` James Simmons
  2020-07-15 20:45 ` [lustre-devel] [PATCH 36/37] lnet: socklnd: change ksnd_nthreads to atomic_t James Simmons
  2020-07-15 20:45 ` [lustre-devel] [PATCH 37/37] lnet: check rtr_nid is a gateway James Simmons
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:45 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

The problem is currently AIO could not handle i/o size > stripe size:

We need cl io loop to handle io across stripes, since -EIOCBQUEUED is
returned for AIO, io loop will be stopped thus short io happen.

The patch try to fix the problem by making IO engine aware of
special error, and it will be proceed to finish all IO requests.

Fixes: fde7ac1942f5 ("lustre: clio: AIO support for direct IO")
WC-bug-id: https://jira.whamcloud.com/browse/LU-13697
Lustre-commit: 84c3e85ced2dd ("LU-13697 llite: fix short io for AIO")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/39104
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/cl_object.h  |  2 ++
 fs/lustre/llite/file.c         | 32 +++++++++++++++++-
 fs/lustre/llite/rw26.c         | 43 +++++++++++++++++--------
 fs/lustre/llite/vvp_internal.h |  3 +-
 fs/lustre/llite/vvp_io.c       | 73 ++++++++++++++++++++++++++++--------------
 fs/lustre/obdclass/cl_io.c     |  9 +++++-
 6 files changed, 122 insertions(+), 40 deletions(-)

diff --git a/fs/lustre/include/cl_object.h b/fs/lustre/include/cl_object.h
index e656c68..e849f23 100644
--- a/fs/lustre/include/cl_object.h
+++ b/fs/lustre/include/cl_object.h
@@ -1814,6 +1814,8 @@ struct cl_io {
 	enum cl_io_state	ci_state;
 	/** main object this io is against. Immutable after creation. */
 	struct cl_object	*ci_obj;
+	/** one AIO request might be split in cl_io_loop */
+	struct cl_dio_aio	*ci_aio;
 	/**
 	 * Upper layer io, of which this io is a part of. Immutable after
 	 * creation.
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 1849229..757950f 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -1514,6 +1514,7 @@ static void ll_heat_add(struct inode *inode, enum cl_io_type iot,
 	int rc = 0;
 	unsigned int retried = 0;
 	unsigned int ignore_lockless = 0;
+	bool is_aio = false;
 
 	CDEBUG(D_VFSTRACE, "file: %pD, type: %d ppos: %llu, count: %zu\n",
 	       file, iot, *ppos, count);
@@ -1536,6 +1537,15 @@ static void ll_heat_add(struct inode *inode, enum cl_io_type iot,
 		vio->vui_fd  = file->private_data;
 		vio->vui_iter = args->u.normal.via_iter;
 		vio->vui_iocb = args->u.normal.via_iocb;
+		if (file->f_flags & O_DIRECT) {
+			if (!is_sync_kiocb(vio->vui_iocb))
+				is_aio = true;
+			io->ci_aio = cl_aio_alloc(vio->vui_iocb);
+			if (!io->ci_aio) {
+				rc = -ENOMEM;
+				goto out;
+			}
+		}
 		/*
 		 * Direct IO reads must also take range lock,
 		 * or multiple reads will try to work on the same pages
@@ -1567,7 +1577,14 @@ static void ll_heat_add(struct inode *inode, enum cl_io_type iot,
 		rc = io->ci_result;
 	}
 
-	if (io->ci_nob > 0) {
+	/*
+	 * In order to move forward AIO, ci_nob was increased,
+	 * but that doesn't mean io have been finished, it just
+	 * means io have been submited, we will always return
+	 * EIOCBQUEUED to the caller, So we could only return
+	 * number of bytes in non-AIO case.
+	 */
+	if (io->ci_nob > 0 && !is_aio) {
 		result += io->ci_nob;
 		count -= io->ci_nob;
 		*ppos = io->u.ci_wr.wr.crw_pos;
@@ -1577,6 +1594,19 @@ static void ll_heat_add(struct inode *inode, enum cl_io_type iot,
 			args->u.normal.via_iter = vio->vui_iter;
 	}
 out:
+	if (io->ci_aio) {
+		/**
+		 * Drop one extra reference so that end_io() could be
+		 * called for this IO context, we could call it after
+		 * we make sure all AIO requests have been proceed.
+		 */
+		cl_sync_io_note(env, &io->ci_aio->cda_sync,
+				rc == -EIOCBQUEUED ? 0 : rc);
+		if (!is_aio) {
+			cl_aio_free(io->ci_aio);
+			io->ci_aio = NULL;
+		}
+	}
 	cl_io_fini(env, io);
 
 	CDEBUG(D_VFSTRACE,
diff --git a/fs/lustre/llite/rw26.c b/fs/lustre/llite/rw26.c
index d0e3ff6..b3802cf 100644
--- a/fs/lustre/llite/rw26.c
+++ b/fs/lustre/llite/rw26.c
@@ -290,6 +290,7 @@ static ssize_t ll_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	ssize_t tot_bytes = 0, result = 0;
 	loff_t file_offset = iocb->ki_pos;
 	int rw = iov_iter_rw(iter);
+	struct vvp_io *vio;
 
 	/* if file is encrypted, return 0 so that we fall back to buffered IO */
 	if (IS_ENCRYPTED(inode))
@@ -319,12 +320,13 @@ static ssize_t ll_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 
 	env = lcc->lcc_env;
 	LASSERT(!IS_ERR(env));
+	vio = vvp_env_io(env);
 	io = lcc->lcc_io;
 	LASSERT(io);
 
-	aio = cl_aio_alloc(iocb);
-	if (!aio)
-		return -ENOMEM;
+	aio = io->ci_aio;
+	LASSERT(aio);
+	LASSERT(aio->cda_iocb == iocb);
 
 	/* 0. Need locking between buffered and direct access. and race with
 	 *    size changing by concurrent truncates and writes.
@@ -368,24 +370,39 @@ static ssize_t ll_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	}
 
 out:
-	aio->cda_bytes = tot_bytes;
-	cl_sync_io_note(env, &aio->cda_sync, result);
+	aio->cda_bytes += tot_bytes;
 
 	if (is_sync_kiocb(iocb)) {
+		struct cl_sync_io *anchor = &aio->cda_sync;
 		ssize_t rc2;
 
-		rc2 = cl_sync_io_wait(env, &aio->cda_sync, 0);
+		/**
+		 * @anchor was inited as 1 to prevent end_io to be
+		 * called before we add all pages for IO, so drop
+		 * one extra reference to make sure we could wait
+		 * count to be zero.
+		 */
+		cl_sync_io_note(env, anchor, result);
+
+		rc2 = cl_sync_io_wait(env, anchor, 0);
 		if (result == 0 && rc2)
 			result = rc2;
 
+		/**
+		 * One extra reference again, as if @anchor is
+		 * reused we assume it as 1 before using.
+		 */
+		atomic_add(1, &anchor->csi_sync_nr);
 		if (result == 0) {
-			struct vvp_io *vio = vvp_env_io(env);
 			/* no commit async for direct IO */
-			vio->u.write.vui_written += tot_bytes;
+			vio->u.readwrite.vui_written += tot_bytes;
 			result = tot_bytes;
 		}
-		cl_aio_free(aio);
 	} else {
+		if (rw == WRITE)
+			vio->u.readwrite.vui_written += tot_bytes;
+		else
+			vio->u.readwrite.vui_read += tot_bytes;
 		result = -EIOCBQUEUED;
 	}
 
@@ -523,7 +540,7 @@ static int ll_write_begin(struct file *file, struct address_space *mapping,
 	vmpage = grab_cache_page_nowait(mapping, index);
 	if (unlikely(!vmpage || PageDirty(vmpage) || PageWriteback(vmpage))) {
 		struct vvp_io *vio = vvp_env_io(env);
-		struct cl_page_list *plist = &vio->u.write.vui_queue;
+		struct cl_page_list *plist = &vio->u.readwrite.vui_queue;
 
 		/* if the page is already in dirty cache, we have to commit
 		 * the pages right now; otherwise, it may cause deadlock
@@ -685,17 +702,17 @@ static int ll_write_end(struct file *file, struct address_space *mapping,
 
 	LASSERT(cl_page_is_owned(page, io));
 	if (copied > 0) {
-		struct cl_page_list *plist = &vio->u.write.vui_queue;
+		struct cl_page_list *plist = &vio->u.readwrite.vui_queue;
 
 		lcc->lcc_page = NULL; /* page will be queued */
 
 		/* Add it into write queue */
 		cl_page_list_add(plist, page);
 		if (plist->pl_nr == 1) /* first page */
-			vio->u.write.vui_from = from;
+			vio->u.readwrite.vui_from = from;
 		else
 			LASSERT(from == 0);
-		vio->u.write.vui_to = from + copied;
+		vio->u.readwrite.vui_to = from + copied;
 
 		/*
 		 * To address the deadlock in balance_dirty_pages() where
diff --git a/fs/lustre/llite/vvp_internal.h b/fs/lustre/llite/vvp_internal.h
index cff85ea..6956d6b 100644
--- a/fs/lustre/llite/vvp_internal.h
+++ b/fs/lustre/llite/vvp_internal.h
@@ -88,9 +88,10 @@ struct vvp_io {
 		struct {
 			struct cl_page_list	vui_queue;
 			unsigned long		vui_written;
+			unsigned long		vui_read;
 			int			vui_from;
 			int			vui_to;
-		} write;
+		} readwrite; /* normal io */
 	} u;
 
 	/**
diff --git a/fs/lustre/llite/vvp_io.c b/fs/lustre/llite/vvp_io.c
index c3fb03a..59da56d 100644
--- a/fs/lustre/llite/vvp_io.c
+++ b/fs/lustre/llite/vvp_io.c
@@ -249,10 +249,20 @@ static int vvp_io_write_iter_init(const struct lu_env *env,
 {
 	struct vvp_io *vio = cl2vvp_io(env, ios);
 
-	cl_page_list_init(&vio->u.write.vui_queue);
-	vio->u.write.vui_written = 0;
-	vio->u.write.vui_from = 0;
-	vio->u.write.vui_to = PAGE_SIZE;
+	cl_page_list_init(&vio->u.readwrite.vui_queue);
+	vio->u.readwrite.vui_written = 0;
+	vio->u.readwrite.vui_from = 0;
+	vio->u.readwrite.vui_to = PAGE_SIZE;
+
+	return 0;
+}
+
+static int vvp_io_read_iter_init(const struct lu_env *env,
+				 const struct cl_io_slice *ios)
+{
+	struct vvp_io *vio = cl2vvp_io(env, ios);
+
+	vio->u.readwrite.vui_read = 0;
 
 	return 0;
 }
@@ -262,7 +272,7 @@ static void vvp_io_write_iter_fini(const struct lu_env *env,
 {
 	struct vvp_io *vio = cl2vvp_io(env, ios);
 
-	LASSERT(vio->u.write.vui_queue.pl_nr == 0);
+	LASSERT(vio->u.readwrite.vui_queue.pl_nr == 0);
 }
 
 static int vvp_io_fault_iter_init(const struct lu_env *env,
@@ -824,7 +834,13 @@ static int vvp_io_read_start(const struct lu_env *env,
 			io->ci_continue = 0;
 		io->ci_nob += result;
 		result = 0;
+	} else if (result == -EIOCBQUEUED) {
+		io->ci_nob += vio->u.readwrite.vui_read;
+		if (vio->vui_iocb)
+			vio->vui_iocb->ki_pos = pos +
+						vio->u.readwrite.vui_read;
 	}
+
 	return result;
 }
 
@@ -1017,23 +1033,24 @@ int vvp_io_write_commit(const struct lu_env *env, struct cl_io *io)
 	struct cl_object *obj = io->ci_obj;
 	struct inode *inode = vvp_object_inode(obj);
 	struct vvp_io *vio = vvp_env_io(env);
-	struct cl_page_list *queue = &vio->u.write.vui_queue;
+	struct cl_page_list *queue = &vio->u.readwrite.vui_queue;
 	struct cl_page *page;
 	int rc = 0;
 	int bytes = 0;
-	unsigned int npages = vio->u.write.vui_queue.pl_nr;
+	unsigned int npages = vio->u.readwrite.vui_queue.pl_nr;
 
 	if (npages == 0)
 		return 0;
 
 	CDEBUG(D_VFSTRACE, "commit async pages: %d, from %d, to %d\n",
-	       npages, vio->u.write.vui_from, vio->u.write.vui_to);
+	       npages, vio->u.readwrite.vui_from, vio->u.readwrite.vui_to);
 
 	LASSERT(page_list_sanity_check(obj, queue));
 
 	/* submit IO with async write */
 	rc = cl_io_commit_async(env, io, queue,
-				vio->u.write.vui_from, vio->u.write.vui_to,
+				vio->u.readwrite.vui_from,
+				vio->u.readwrite.vui_to,
 				write_commit_callback);
 	npages -= queue->pl_nr; /* already committed pages */
 	if (npages > 0) {
@@ -1041,18 +1058,18 @@ int vvp_io_write_commit(const struct lu_env *env, struct cl_io *io)
 		bytes = npages << PAGE_SHIFT;
 
 		/* first page */
-		bytes -= vio->u.write.vui_from;
+		bytes -= vio->u.readwrite.vui_from;
 		if (queue->pl_nr == 0) /* last page */
-			bytes -= PAGE_SIZE - vio->u.write.vui_to;
+			bytes -= PAGE_SIZE - vio->u.readwrite.vui_to;
 		LASSERTF(bytes > 0, "bytes = %d, pages = %d\n", bytes, npages);
 
-		vio->u.write.vui_written += bytes;
+		vio->u.readwrite.vui_written += bytes;
 
 		CDEBUG(D_VFSTRACE, "Committed %d pages %d bytes, tot: %ld\n",
-		       npages, bytes, vio->u.write.vui_written);
+		       npages, bytes, vio->u.readwrite.vui_written);
 
 		/* the first page must have been written. */
-		vio->u.write.vui_from = 0;
+		vio->u.readwrite.vui_from = 0;
 	}
 	LASSERT(page_list_sanity_check(obj, queue));
 	LASSERT(ergo(rc == 0, queue->pl_nr == 0));
@@ -1060,10 +1077,10 @@ int vvp_io_write_commit(const struct lu_env *env, struct cl_io *io)
 	/* out of quota, try sync write */
 	if (rc == -EDQUOT && !cl_io_is_mkwrite(io)) {
 		rc = vvp_io_commit_sync(env, io, queue,
-					vio->u.write.vui_from,
-					vio->u.write.vui_to);
+					vio->u.readwrite.vui_from,
+					vio->u.readwrite.vui_to);
 		if (rc > 0) {
-			vio->u.write.vui_written += rc;
+			vio->u.readwrite.vui_written += rc;
 			rc = 0;
 		}
 	}
@@ -1181,15 +1198,15 @@ static int vvp_io_write_start(const struct lu_env *env,
 		result = vvp_io_write_commit(env, io);
 		/* Simulate short commit */
 		if (CFS_FAULT_CHECK(OBD_FAIL_LLITE_SHORT_COMMIT)) {
-			vio->u.write.vui_written >>= 1;
-			if (vio->u.write.vui_written > 0)
+			vio->u.readwrite.vui_written >>= 1;
+			if (vio->u.readwrite.vui_written > 0)
 				io->ci_need_restart = 1;
 		}
-		if (vio->u.write.vui_written > 0) {
-			result = vio->u.write.vui_written;
+		if (vio->u.readwrite.vui_written > 0) {
+			result = vio->u.readwrite.vui_written;
 			io->ci_nob += result;
-
-			CDEBUG(D_VFSTRACE, "write: nob %zd, result: %zd\n",
+			CDEBUG(D_VFSTRACE, "%s: write: nob %zd, result: %zd\n",
+			       file_dentry(file)->d_name.name,
 			       io->ci_nob, result);
 		} else {
 			io->ci_continue = 0;
@@ -1215,11 +1232,18 @@ static int vvp_io_write_start(const struct lu_env *env,
 	if (result > 0 || result == -EIOCBQUEUED) {
 		set_bit(LLIF_DATA_MODIFIED, &(ll_i2info(inode))->lli_flags);
 
-		if (result < cnt)
+		if (result != -EIOCBQUEUED && result < cnt)
 			io->ci_continue = 0;
 		if (result > 0)
 			result = 0;
+		/* move forward */
+		if (result == -EIOCBQUEUED) {
+			io->ci_nob += vio->u.readwrite.vui_written;
+			vio->vui_iocb->ki_pos = pos +
+						vio->u.readwrite.vui_written;
+		}
 	}
+
 	return result;
 }
 
@@ -1509,6 +1533,7 @@ static int vvp_io_read_ahead(const struct lu_env *env,
 	.op = {
 		[CIT_READ] = {
 			.cio_fini	= vvp_io_fini,
+			.cio_iter_init	= vvp_io_read_iter_init,
 			.cio_lock	= vvp_io_read_lock,
 			.cio_start	= vvp_io_read_start,
 			.cio_end	= vvp_io_rw_end,
diff --git a/fs/lustre/obdclass/cl_io.c b/fs/lustre/obdclass/cl_io.c
index dcf940f..1564d9f 100644
--- a/fs/lustre/obdclass/cl_io.c
+++ b/fs/lustre/obdclass/cl_io.c
@@ -695,6 +695,7 @@ int cl_io_submit_sync(const struct lu_env *env, struct cl_io *io,
 int cl_io_loop(const struct lu_env *env, struct cl_io *io)
 {
 	int result = 0;
+	int rc = 0;
 
 	LINVRNT(cl_io_is_loopable(io));
 
@@ -727,7 +728,13 @@ int cl_io_loop(const struct lu_env *env, struct cl_io *io)
 			}
 		}
 		cl_io_iter_fini(env, io);
-	} while (result == 0 && io->ci_continue);
+		if (result)
+			rc = result;
+	} while ((result == 0 || result == -EIOCBQUEUED) &&
+		 io->ci_continue);
+
+	if (rc && !result)
+		result = rc;
 
 	if (result == -EWOULDBLOCK && io->ci_ndelay) {
 		io->ci_need_restart = 1;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 36/37] lnet: socklnd: change ksnd_nthreads to atomic_t
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (34 preceding siblings ...)
  2020-07-15 20:45 ` [lustre-devel] [PATCH 35/37] lustre: llite: fix short io for AIO James Simmons
@ 2020-07-15 20:45 ` James Simmons
  2020-07-15 20:45 ` [lustre-devel] [PATCH 37/37] lnet: check rtr_nid is a gateway James Simmons
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:45 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

This variable is treated like an atomic_t, but a global spinlock is
used to protect updates - and also unnecessarily to protect reads.

Change to atomic_t and avoid using the spinlock.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12678
Lustre-commit: 4b0d3c0e41201 ("LU-12678 socklnd: change ksnd_nthreads to atomic_t")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/39121
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Chris Horn <chris.horn@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/socklnd/socklnd.c    | 4 ++--
 net/lnet/klnds/socklnd/socklnd.h    | 2 +-
 net/lnet/klnds/socklnd/socklnd_cb.c | 8 ++------
 3 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/net/lnet/klnds/socklnd/socklnd.c b/net/lnet/klnds/socklnd/socklnd.c
index 22a73c3..91925475 100644
--- a/net/lnet/klnds/socklnd/socklnd.c
+++ b/net/lnet/klnds/socklnd/socklnd.c
@@ -2260,9 +2260,9 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 		}
 
 		wait_var_event_warning(&ksocknal_data.ksnd_nthreads,
-				       ksocknal_data.ksnd_nthreads == 0,
+				       atomic_read(&ksocknal_data.ksnd_nthreads) == 0,
 				       "waiting for %d threads to terminate\n",
-				       ksocknal_data.ksnd_nthreads);
+				       atomic_read(&ksocknal_data.ksnd_nthreads));
 
 		ksocknal_free_buffers();
 
diff --git a/net/lnet/klnds/socklnd/socklnd.h b/net/lnet/klnds/socklnd/socklnd.h
index df863f2..350f2c8 100644
--- a/net/lnet/klnds/socklnd/socklnd.h
+++ b/net/lnet/klnds/socklnd/socklnd.h
@@ -196,7 +196,7 @@ struct ksock_nal_data {
 							 * known peers
 							 */
 
-	int			ksnd_nthreads;		/* # live threads */
+	atomic_t		ksnd_nthreads;		/* # live threads */
 	int			ksnd_shuttingdown;	/* tell threads to exit
 							 */
 	struct ksock_sched	**ksnd_schedulers;	/* schedulers info */
diff --git a/net/lnet/klnds/socklnd/socklnd_cb.c b/net/lnet/klnds/socklnd/socklnd_cb.c
index 9b3b604..a1c0c3d 100644
--- a/net/lnet/klnds/socklnd/socklnd_cb.c
+++ b/net/lnet/klnds/socklnd/socklnd_cb.c
@@ -976,19 +976,15 @@ struct ksock_route *
 	if (IS_ERR(task))
 		return PTR_ERR(task);
 
-	write_lock_bh(&ksocknal_data.ksnd_global_lock);
-	ksocknal_data.ksnd_nthreads++;
-	write_unlock_bh(&ksocknal_data.ksnd_global_lock);
+	atomic_inc(&ksocknal_data.ksnd_nthreads);
 	return 0;
 }
 
 void
 ksocknal_thread_fini(void)
 {
-	write_lock_bh(&ksocknal_data.ksnd_global_lock);
-	if (--ksocknal_data.ksnd_nthreads == 0)
+	if (atomic_dec_and_test(&ksocknal_data.ksnd_nthreads))
 		wake_up_var(&ksocknal_data.ksnd_nthreads);
-	write_unlock_bh(&ksocknal_data.ksnd_global_lock);
 }
 
 int
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [lustre-devel] [PATCH 37/37] lnet: check rtr_nid is a gateway
  2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
                   ` (35 preceding siblings ...)
  2020-07-15 20:45 ` [lustre-devel] [PATCH 36/37] lnet: socklnd: change ksnd_nthreads to atomic_t James Simmons
@ 2020-07-15 20:45 ` James Simmons
  36 siblings, 0 replies; 38+ messages in thread
From: James Simmons @ 2020-07-15 20:45 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

The rtr_nid is specified for all REPLY/ACK. However it is possible
for the route through the gateway specified by rtr_nid to be removed.
In this case we don't want to use it. We should lookup alternative
paths.

This patch checks if the peer looked up is indeed a gateway. If it's
not a gateway then we attempt to find another path. There is no need
to fail right away. It's not a hard requirement to fail if the default
rtr_nid is not valid.

WC-bug-id: https://jira.whamcloud.com/browse/LU-13713
Lustre-commit: 07397a2e7473c ("LU-13713 lnet: check rtr_nid is a gateway")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/39175
Reviewed-by: Chris Horn <chris.horn@hpe.com>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 234fbb5..c0dd30c 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -1777,6 +1777,7 @@ struct lnet_ni *
 	struct lnet_route *last_route = NULL;
 	struct lnet_peer_ni *lpni = NULL;
 	struct lnet_peer_ni *gwni = NULL;
+	bool route_found = false;
 	lnet_nid_t src_nid = (sd->sd_src_nid != LNET_NID_ANY) ? sd->sd_src_nid :
 			      sd->sd_best_ni ? sd->sd_best_ni->ni_nid :
 			      LNET_NID_ANY;
@@ -1790,15 +1791,20 @@ struct lnet_ni *
 	 */
 	if (sd->sd_rtr_nid != LNET_NID_ANY) {
 		gwni = lnet_find_peer_ni_locked(sd->sd_rtr_nid);
-		if (!gwni) {
-			CERROR("No peer NI for gateway %s\n",
+		if (gwni) {
+			gw = gwni->lpni_peer_net->lpn_peer;
+			lnet_peer_ni_decref_locked(gwni);
+			if (gw->lp_rtr_refcount) {
+				local_lnet = LNET_NIDNET(sd->sd_rtr_nid);
+				route_found = true;
+			}
+		} else {
+			CWARN("No peer NI for gateway %s. Attempting to find an alternative route.\n",
 			       libcfs_nid2str(sd->sd_rtr_nid));
-			return -EHOSTUNREACH;
 		}
-		gw = gwni->lpni_peer_net->lpn_peer;
-		lnet_peer_ni_decref_locked(gwni);
-		local_lnet = LNET_NIDNET(sd->sd_rtr_nid);
-	} else {
+	}
+
+	if (!route_found) {
 		/* we've already looked up the initial lpni using dst_nid */
 		lpni = sd->sd_best_lpni;
 		/* the peer tree must be in existence */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2020-07-15 20:45 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-15 20:44 [lustre-devel] [PATCH 00/37] lustre: latest patches landed to OpenSFS 07/14/2020 James Simmons
2020-07-15 20:44 ` [lustre-devel] [PATCH 01/37] lustre: osc: fix osc_extent_find() James Simmons
2020-07-15 20:44 ` [lustre-devel] [PATCH 02/37] lustre: ldlm: check slv and limit before updating James Simmons
2020-07-15 20:44 ` [lustre-devel] [PATCH 03/37] lustre: sec: better struct sepol_downcall_data James Simmons
2020-07-15 20:44 ` [lustre-devel] [PATCH 04/37] lustre: obdclass: remove init to 0 from lustre_init_lsi() James Simmons
2020-07-15 20:44 ` [lustre-devel] [PATCH 05/37] lustre: ptlrpc: handle conn_hash rhashtable resize James Simmons
2020-07-15 20:44 ` [lustre-devel] [PATCH 06/37] lustre: lu_object: convert lu_object cache to rhashtable James Simmons
2020-07-15 20:44 ` [lustre-devel] [PATCH 07/37] lustre: osc: disable ext merging for rdma only pages and non-rdma James Simmons
2020-07-15 20:44 ` [lustre-devel] [PATCH 08/37] lnet: socklnd: fix local interface binding James Simmons
2020-07-15 20:44 ` [lustre-devel] [PATCH 09/37] lnet: o2iblnd: allocate init_qp_attr on stack James Simmons
2020-07-15 20:44 ` [lustre-devel] [PATCH 10/37] lnet: Fix some out-of-date comments James Simmons
2020-07-15 20:44 ` [lustre-devel] [PATCH 11/37] lnet: socklnd: don't fall-back to tcp_sendpage James Simmons
2020-07-15 20:44 ` [lustre-devel] [PATCH 12/37] lustre: ptlrpc: re-enterable signal_completed_replay() James Simmons
2020-07-15 20:44 ` [lustre-devel] [PATCH 13/37] lustre: obdcalss: ensure LCT_QUIESCENT take sync James Simmons
2020-07-15 20:44 ` [lustre-devel] [PATCH 14/37] lustre: remove some "#ifdef CONFIG*" from .c files James Simmons
2020-07-15 20:44 ` [lustre-devel] [PATCH 15/37] lustre: obdclass: use offset instead of cp_linkage James Simmons
2020-07-15 20:44 ` [lustre-devel] [PATCH 16/37] lustre: obdclass: re-declare cl_page variables to reduce its size James Simmons
2020-07-15 20:44 ` [lustre-devel] [PATCH 17/37] lustre: osc: re-declare ops_from/to to shrink osc_page James Simmons
2020-07-15 20:44 ` [lustre-devel] [PATCH 18/37] lustre: llite: Fix lock ordering in pagevec_dirty James Simmons
2020-07-15 20:45 ` [lustre-devel] [PATCH 19/37] lustre: misc: quiet compiler warning on armv7l James Simmons
2020-07-15 20:45 ` [lustre-devel] [PATCH 20/37] lustre: llite: fix to free cl_dio_aio properly James Simmons
2020-07-15 20:45 ` [lustre-devel] [PATCH 21/37] lnet: o2iblnd: Use ib_mtu_int_to_enum() James Simmons
2020-07-15 20:45 ` [lustre-devel] [PATCH 22/37] lnet: o2iblnd: wait properly for fps->increasing James Simmons
2020-07-15 20:45 ` [lustre-devel] [PATCH 23/37] lnet: o2iblnd: use need_resched() James Simmons
2020-07-15 20:45 ` [lustre-devel] [PATCH 24/37] lnet: o2iblnd: Use list_for_each_entry_safe James Simmons
2020-07-15 20:45 ` [lustre-devel] [PATCH 25/37] lnet: socklnd: use need_resched() James Simmons
2020-07-15 20:45 ` [lustre-devel] [PATCH 26/37] lnet: socklnd: use list_for_each_entry_safe() James Simmons
2020-07-15 20:45 ` [lustre-devel] [PATCH 27/37] lnet: socklnd: convert various refcounts to refcount_t James Simmons
2020-07-15 20:45 ` [lustre-devel] [PATCH 28/37] lnet: libcfs: don't call unshare_fs_struct() James Simmons
2020-07-15 20:45 ` [lustre-devel] [PATCH 29/37] lnet: Allow router to forward to healthier NID James Simmons
2020-07-15 20:45 ` [lustre-devel] [PATCH 30/37] lustre: llite: annotate non-owner locking James Simmons
2020-07-15 20:45 ` [lustre-devel] [PATCH 31/37] lustre: osc: consume grants for direct I/O James Simmons
2020-07-15 20:45 ` [lustre-devel] [PATCH 32/37] lnet: remove LNetMEUnlink and clean up related code James Simmons
2020-07-15 20:45 ` [lustre-devel] [PATCH 33/37] lnet: Set remote NI status in lnet_notify James Simmons
2020-07-15 20:45 ` [lustre-devel] [PATCH 34/37] lustre: ptlrpc: fix endless loop issue James Simmons
2020-07-15 20:45 ` [lustre-devel] [PATCH 35/37] lustre: llite: fix short io for AIO James Simmons
2020-07-15 20:45 ` [lustre-devel] [PATCH 36/37] lnet: socklnd: change ksnd_nthreads to atomic_t James Simmons
2020-07-15 20:45 ` [lustre-devel] [PATCH 37/37] lnet: check rtr_nid is a gateway James Simmons

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).