[lustre-devel] [PATCH 00/10] lustre: lnet: fixes for non-x86 systems

All of lore.kernel.org
 help / color / mirror / Atom feed

* [lustre-devel] [PATCH 00/10] lustre: lnet: fixes for non-x86 systems
@ 2018-10-14 18:55 James Simmons
  2018-10-14 18:55 ` [lustre-devel] [PATCH 01/10] lustre: lnd: set device capabilities James Simmons
                   ` (10 more replies)
  0 siblings, 11 replies; 20+ messages in thread
From: James Simmons @ 2018-10-14 18:55 UTC (permalink / raw)
  To: lustre-devel

These are the required patches to make LNet work with non-x86
platforms like ARM or Power8. The tunables map_on_demand and
concurrent_sends assumed pages were 4K in size which is not
correct. Massively reworked to basically defunct those tunables.
Also the size of the LNet packet was always 256 pages but when
the page size is 64K like some ARM or Power8 systems the maximum
LNet message sent was 16MB not 1MB which is what is expected.
Fixing up the RDMA handling in the ko2iblnd driver also resolved
some performance issues.

Alexey Lyashkov (1):
  lustre: lnd: use less CQ entries for each connection

Amir Shehata (6):
  lustre: lnd: set device capabilities
  lustre: o2iblnd: use IB_MR_TYPE_SG_GAPS
  lustre: lnd: rework map_on_demand behavior
  lustre: lnd: calculate qp max_send_wrs properly
  lustre: lnd: remove concurrent_sends tunable
  lustre: lnd: correct WR fast reg accounting

Dmitry Eremin (1):
  lustre: o2iblnd: limit cap.max_send_wr for MLX5

James Simmons (1):
  lustre: lnet: make LNET_MAX_IOV dependent on page size

John L. Hammond (1):
  lustre: o2ib: use splice in kiblnd_peer_connect_failed()

 .../staging/lustre/include/linux/lnet/lib-types.h  |  10 +-
 .../lustre/include/uapi/linux/lnet/lnet-types.h    |   3 -
 drivers/staging/lustre/lnet/Kconfig                |  10 --
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c    | 138 +++++++++++++++------
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h    |  68 +++-------
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c | 133 ++++++++++++++------
 .../lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c  |  85 ++++++-------
 drivers/staging/lustre/lustre/include/lustre_net.h |   1 +
 8 files changed, 256 insertions(+), 192 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [lustre-devel] [PATCH 01/10] lustre: lnd: set device capabilities
  2018-10-14 18:55 [lustre-devel] [PATCH 00/10] lustre: lnet: fixes for non-x86 systems James Simmons
@ 2018-10-14 18:55 ` James Simmons
  2018-10-17  5:54   ` NeilBrown
  2018-10-14 18:55 ` [lustre-devel] [PATCH 02/10] lustre: o2iblnd: use IB_MR_TYPE_SG_GAPS James Simmons
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 20+ messages in thread
From: James Simmons @ 2018-10-14 18:55 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

MLX-4, MLX-5 and OPA support different capabilities. Query the
device and cache the capabilities of the device for future use.

At the time of the patches creation MLX5 could support fast
registration and gaps while MLX4 and OPA only support FMR

Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-10129
Reviewed-on: https://review.whamcloud.com/30309
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c    | 41 ++++++++++++----------
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h    |  5 +++
 2 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
index bf969b3..b10658b 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
@@ -1399,6 +1399,7 @@ static int kiblnd_alloc_fmr_pool(struct kib_fmr_poolset *fps, struct kib_fmr_poo
 		else
 			CERROR("FMRs are not supported\n");
 	}
+	fpo->fpo_is_fmr = true;
 
 	return rc;
 }
@@ -1408,6 +1409,8 @@ static int kiblnd_alloc_freg_pool(struct kib_fmr_poolset *fps, struct kib_fmr_po
 	struct kib_fast_reg_descriptor *frd;
 	int i, rc;
 
+	fpo->fpo_is_fmr = false;
+
 	INIT_LIST_HEAD(&fpo->fast_reg.fpo_pool_list);
 	fpo->fast_reg.fpo_pool_size = 0;
 	for (i = 0; i < fps->fps_pool_size; i++) {
@@ -1469,23 +1472,7 @@ static int kiblnd_create_fmr_pool(struct kib_fmr_poolset *fps,
 	fpo->fpo_hdev = kiblnd_current_hdev(dev);
 	dev_attr = &fpo->fpo_hdev->ibh_ibdev->attrs;
 
-	/* Check for FMR or FastReg support */
-	fpo->fpo_is_fmr = 0;
-	if (fpo->fpo_hdev->ibh_ibdev->alloc_fmr &&
-	    fpo->fpo_hdev->ibh_ibdev->dealloc_fmr &&
-	    fpo->fpo_hdev->ibh_ibdev->map_phys_fmr &&
-	    fpo->fpo_hdev->ibh_ibdev->unmap_fmr) {
-		LCONSOLE_INFO("Using FMR for registration\n");
-		fpo->fpo_is_fmr = 1;
-	} else if (dev_attr->device_cap_flags & IB_DEVICE_MEM_MGT_EXTENSIONS) {
-		LCONSOLE_INFO("Using FastReg for registration\n");
-	} else {
-		rc = -ENOSYS;
-		LCONSOLE_ERROR_MSG(rc, "IB device does not support FMRs nor FastRegs, can't register memory\n");
-		goto out_fpo;
-	}
-
-	if (fpo->fpo_is_fmr)
+	if (dev->ibd_dev_caps & IBLND_DEV_CAPS_FMR_ENABLED)
 		rc = kiblnd_alloc_fmr_pool(fps, fpo);
 	else
 		rc = kiblnd_alloc_freg_pool(fps, fpo);
@@ -2261,6 +2248,9 @@ static int kiblnd_net_init_pools(struct kib_net *net, struct lnet_ni *ni,
 
 static int kiblnd_hdev_get_attr(struct kib_hca_dev *hdev)
 {
+	struct ib_device_attr *dev_attr = &hdev->ibh_ibdev->attrs;
+	int rc = 0;
+
 	/*
 	 * It's safe to assume a HCA can handle a page size
 	 * matching that of the native system
@@ -2269,7 +2259,22 @@ static int kiblnd_hdev_get_attr(struct kib_hca_dev *hdev)
 	hdev->ibh_page_size  = 1 << PAGE_SHIFT;
 	hdev->ibh_page_mask  = ~((__u64)hdev->ibh_page_size - 1);
 
-	hdev->ibh_mr_size = hdev->ibh_ibdev->attrs.max_mr_size;
+	if (hdev->ibh_ibdev->alloc_fmr &&
+	    hdev->ibh_ibdev->dealloc_fmr &&
+	    hdev->ibh_ibdev->map_phys_fmr &&
+	    hdev->ibh_ibdev->unmap_fmr) {
+		LCONSOLE_INFO("Using FMR for registration\n");
+		hdev->ibh_dev->ibd_dev_caps |= IBLND_DEV_CAPS_FMR_ENABLED;
+	} else if (dev_attr->device_cap_flags & IB_DEVICE_MEM_MGT_EXTENSIONS) {
+		LCONSOLE_INFO("Using FastReg for registration\n");
+		hdev->ibh_dev->ibd_dev_caps |= IBLND_DEV_CAPS_FASTREG_ENABLED;
+	} else {
+		CERROR("IB device does not support FMRs nor FastRegs, can't register memory: %d\n",
+		       rc);
+		return -ENXIO;
+	}
+
+	hdev->ibh_mr_size = dev_attr->max_mr_size;
 	if (hdev->ibh_mr_size == ~0ULL) {
 		hdev->ibh_mr_shift = 64;
 		return 0;
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
index a4438d2..9f0a47d 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
@@ -73,6 +73,10 @@
 #define IBLND_N_SCHED			2
 #define IBLND_N_SCHED_HIGH		4
 
+#define IBLND_DEV_CAPS_FASTREG_ENABLED		0x1
+#define IBLND_DEV_CAPS_FASTREG_GAPS_SUPPORT	0x2
+#define IBLND_DEV_CAPS_FMR_ENABLED		0x4
+
 struct kib_tunables {
 	int *kib_dev_failover;           /* HCA failover */
 	unsigned int *kib_service;       /* IB service number */
@@ -162,6 +166,7 @@ struct kib_dev {
 	unsigned int ibd_can_failover; /* IPoIB interface is a bonding master */
 	struct list_head   ibd_nets;
 	struct kib_hca_dev *ibd_hdev;
+	u32			ibd_dev_caps;
 };
 
 struct kib_hca_dev {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [lustre-devel] [PATCH 02/10] lustre: o2iblnd: use IB_MR_TYPE_SG_GAPS
  2018-10-14 18:55 [lustre-devel] [PATCH 00/10] lustre: lnet: fixes for non-x86 systems James Simmons
  2018-10-14 18:55 ` [lustre-devel] [PATCH 01/10] lustre: lnd: set device capabilities James Simmons
@ 2018-10-14 18:55 ` James Simmons
  2018-10-14 18:55 ` [lustre-devel] [PATCH 03/10] lustre: lnd: rework map_on_demand behavior James Simmons
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: James Simmons @ 2018-10-14 18:55 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

When allocating fastreg buffers allow the use of IB_MR_TYPE_SG_GAPS
instead of IB_MR_TYPE_MEM_REG, since the fragments we provide
the fast registration API can have gaps. MEM_REG doesn't handle
that case.

There is a performance drop when using IB_MR_TYPE_SG_GAPS and it
is recommended not to use it. To mitigate this, we added a module
parameter, use_fastreg_gaps, which defaults to 0. When allocating
the memory region if this parameter is set to 1 and the hw has
gaps support then use it and output a warning that performance
may drop. Otherwise always use IB_MR_TYPE_MEM_REG. We still want
to give user the choice to use this option.

Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-10089
Reviewed-on: https://review.whamcloud.com/29551
WC-bug-id: https://jira.whamcloud.com/browse/LU-10394
Reviewed-on: https://review.whamcloud.com/30749
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c    | 24 +++++++++++++++++++---
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h    |  3 +++
 .../lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c  |  5 +++++
 3 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
index b10658b..ca3e9ce 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
@@ -1404,7 +1404,8 @@ static int kiblnd_alloc_fmr_pool(struct kib_fmr_poolset *fps, struct kib_fmr_poo
 	return rc;
 }
 
-static int kiblnd_alloc_freg_pool(struct kib_fmr_poolset *fps, struct kib_fmr_pool *fpo)
+static int kiblnd_alloc_freg_pool(struct kib_fmr_poolset *fps,
+				  struct kib_fmr_pool *fpo, u32 dev_caps)
 {
 	struct kib_fast_reg_descriptor *frd;
 	int i, rc;
@@ -1414,6 +1415,8 @@ static int kiblnd_alloc_freg_pool(struct kib_fmr_poolset *fps, struct kib_fmr_po
 	INIT_LIST_HEAD(&fpo->fast_reg.fpo_pool_list);
 	fpo->fast_reg.fpo_pool_size = 0;
 	for (i = 0; i < fps->fps_pool_size; i++) {
+		bool fastreg_gaps = false;
+
 		frd = kzalloc_cpt(sizeof(*frd), GFP_NOFS, fps->fps_cpt);
 		if (!frd) {
 			CERROR("Failed to allocate a new fast_reg descriptor\n");
@@ -1421,8 +1424,21 @@ static int kiblnd_alloc_freg_pool(struct kib_fmr_poolset *fps, struct kib_fmr_po
 			goto out;
 		}
 
+		/*
+		 * it is expected to get here if this is an MLX-5 card.
+		 * MLX-4 cards will always use FMR and MLX-5 cards will
+		 * always use fast_reg. It turns out that some MLX-5 cards
+		 * (possibly due to older FW versions) do not natively support
+		 * gaps. So we will need to track them here.
+		 */
+		if ((*kiblnd_tunables.kib_use_fastreg_gaps == 1) &&
+		    (dev_caps & IBLND_DEV_CAPS_FASTREG_GAPS_SUPPORT)) {
+			CWARN("using IB_MR_TYPE_SG_GAPS, expect a performance drop\n");
+			fastreg_gaps = true;
+		}
 		frd->frd_mr = ib_alloc_mr(fpo->fpo_hdev->ibh_pd,
-					  IB_MR_TYPE_MEM_REG,
+					  fastreg_gaps ? IB_MR_TYPE_SG_GAPS :
+							 IB_MR_TYPE_MEM_REG,
 					  LNET_MAX_PAYLOAD / PAGE_SIZE);
 		if (IS_ERR(frd->frd_mr)) {
 			rc = PTR_ERR(frd->frd_mr);
@@ -1475,7 +1491,7 @@ static int kiblnd_create_fmr_pool(struct kib_fmr_poolset *fps,
 	if (dev->ibd_dev_caps & IBLND_DEV_CAPS_FMR_ENABLED)
 		rc = kiblnd_alloc_fmr_pool(fps, fpo);
 	else
-		rc = kiblnd_alloc_freg_pool(fps, fpo);
+		rc = kiblnd_alloc_freg_pool(fps, fpo, dev->ibd_dev_caps);
 	if (rc)
 		goto out_fpo;
 
@@ -2268,6 +2284,8 @@ static int kiblnd_hdev_get_attr(struct kib_hca_dev *hdev)
 	} else if (dev_attr->device_cap_flags & IB_DEVICE_MEM_MGT_EXTENSIONS) {
 		LCONSOLE_INFO("Using FastReg for registration\n");
 		hdev->ibh_dev->ibd_dev_caps |= IBLND_DEV_CAPS_FASTREG_ENABLED;
+		if (dev_attr->device_cap_flags & IB_DEVICE_SG_GAPS_REG)
+			hdev->ibh_dev->ibd_dev_caps |= IBLND_DEV_CAPS_FASTREG_GAPS_SUPPORT;
 	} else {
 		CERROR("IB device does not support FMRs nor FastRegs, can't register memory: %d\n",
 		       rc);
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
index 9f0a47d..aaf0118 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
@@ -94,6 +94,9 @@ struct kib_tunables {
 	int *kib_use_priv_port; /* use privileged port for active connect */
 	int *kib_nscheds;                /* # threads on each CPT */
 	int *kib_wrq_sge;		 /* # sg elements per wrq */
+	bool *kib_use_fastreg_gaps;	 /* enable discontiguous fastreg
+					  * fragment support
+					  */
 };
 
 extern struct kib_tunables  kiblnd_tunables;
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c
index 13b19f3..985ccdf 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c
@@ -111,6 +111,10 @@
 module_param(concurrent_sends, int, 0444);
 MODULE_PARM_DESC(concurrent_sends, "send work-queue sizing");
 
+static bool use_fastreg_gaps;
+module_param(use_fastreg_gaps, bool, 0444);
+MODULE_PARM_DESC(use_fastreg_gaps, "Enable discontiguous fastreg fragment support. Expect performance drop");
+
 #define IBLND_DEFAULT_MAP_ON_DEMAND IBLND_MAX_RDMA_FRAGS
 static int map_on_demand = IBLND_DEFAULT_MAP_ON_DEMAND;
 module_param(map_on_demand, int, 0444);
@@ -165,6 +169,7 @@ struct kib_tunables kiblnd_tunables = {
 	.kib_use_priv_port     = &use_privileged_port,
 	.kib_nscheds		= &nscheds,
 	.kib_wrq_sge		= &wrq_sge,
+	.kib_use_fastreg_gaps	= &use_fastreg_gaps,
 };
 
 static struct lnet_ioctl_config_o2iblnd_tunables default_tunables;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [lustre-devel] [PATCH 03/10] lustre: lnd: rework map_on_demand behavior
  2018-10-14 18:55 [lustre-devel] [PATCH 00/10] lustre: lnet: fixes for non-x86 systems James Simmons
  2018-10-14 18:55 ` [lustre-devel] [PATCH 01/10] lustre: lnd: set device capabilities James Simmons
  2018-10-14 18:55 ` [lustre-devel] [PATCH 02/10] lustre: o2iblnd: use IB_MR_TYPE_SG_GAPS James Simmons
@ 2018-10-14 18:55 ` James Simmons
  2018-10-17  6:11   ` NeilBrown
  2018-10-14 18:55 ` [lustre-devel] [PATCH 04/10] lustre: lnd: use less CQ entries for each connection James Simmons
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 20+ messages in thread
From: James Simmons @ 2018-10-14 18:55 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

map_on_demand use is ambiguous. In kernels which supported global
memory regions, map-on-demand was used to limit the number of RDMA
fragments transferred between peers. That tuanble didn't impact the
memory allocation since the maximum allowed was always allocated. It
was used however as a variable in determining max_send_wr. With the
introduction of FMR if the number of fragments exceed the negotiated
number of fragments, which is determined by the map-on-demand value,
then FMR is used because we can transfer all the fragments in one FMR
fragment.

The latest kernels have removed support for global memory regions so
the use of map-on-demand has become ineffectual. However, we need to
keep it for backwards compatibility.  The behavior has changed as
follows:

map_on_demand is a flag used to determine if we can use FMR or
FastReg. This is applicable for kernels which support global memory
regions. For later kernels this flag is always enabled, since we will
always either use FMR or FastReg For kernels which support global
memory regions map_on_demand defaults to 0 which means we will be
using global memory regions exclusively.  If it is set to a value
other than 0, then we will behave as follows:
  1. Always default the number of fragments to IBLND_MAX_RDMA_FRAGS
  2. Create FMR/FastReg pools
  3. Negotiate the supported number of fragments per connection
  4. Attempt to transmit using global memory regions only if
     map-on-demand is off, otherwise use FMR or FastReg.
  5. In case of transmitting tx with GAPS over FMR we will need to
     transmit it with multiple fragments. Look at the comments in
     kiblnd_fmr_map_tx() for an explanation of the behavior.

For later kernels we default map_on_demand to 1 and not allow it to be
set to 0, since there is no longer support for global memory regions.
Behavior:
  1. Default the number of fragments to IBLND_MAX_RDMA_FRAGS
  2. Create FMR/FastReg pools
  3. Negotiate the supported number of fragments per connection
  4. Look at the comments in kiblnd_fmr_map_tx() for an explanation of
     the behavior when transmit with GAPS verses contiguous.

Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-10129
Reviewed-on: https://review.whamcloud.com/29995
WC-bug-id: https://jira.whamcloud.com/browse/LU-8573
Reviewed-on: http://review.whamcloud.com/23439
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c    |  31 ++---
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h    |  37 ++----
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c | 125 +++++++++++++++------
 .../lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c  |  55 ++++++---
 4 files changed, 154 insertions(+), 94 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
index ca3e9ce..8f984a0 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
@@ -130,7 +130,6 @@ static int kiblnd_msgtype2size(int type)
 static int kiblnd_unpack_rd(struct kib_msg *msg, int flip)
 {
 	struct kib_rdma_desc *rd;
-	int msg_size;
 	int nob;
 	int n;
 	int i;
@@ -149,6 +148,12 @@ static int kiblnd_unpack_rd(struct kib_msg *msg, int flip)
 
 	n = rd->rd_nfrags;
 
+	if (n <= 0 || n > IBLND_MAX_RDMA_FRAGS) {
+		CERROR("Bad nfrags: %d, should be 0 < n <= %d\n",
+		       n, IBLND_MAX_RDMA_FRAGS);
+		return 1;
+	}
+
 	nob = offsetof(struct kib_msg, ibm_u) +
 	      kiblnd_rd_msg_size(rd, msg->ibm_type, n);
 
@@ -158,13 +163,6 @@ static int kiblnd_unpack_rd(struct kib_msg *msg, int flip)
 		return 1;
 	}
 
-	msg_size = kiblnd_rd_size(rd);
-	if (msg_size <= 0 || msg_size > LNET_MAX_PAYLOAD) {
-		CERROR("Bad msg_size: %d, should be 0 < n <= %d\n",
-		       msg_size, LNET_MAX_PAYLOAD);
-		return 1;
-	}
-
 	if (!flip)
 		return 0;
 
@@ -336,7 +334,7 @@ int kiblnd_create_peer(struct lnet_ni *ni, struct kib_peer_ni **peerp,
 	peer_ni->ibp_nid = nid;
 	peer_ni->ibp_error = 0;
 	peer_ni->ibp_last_alive = 0;
-	peer_ni->ibp_max_frags = kiblnd_cfg_rdma_frags(peer_ni->ibp_ni);
+	peer_ni->ibp_max_frags = IBLND_MAX_RDMA_FRAGS;
 	peer_ni->ibp_queue_depth = ni->ni_net->net_tunables.lct_peer_tx_credits;
 	atomic_set(&peer_ni->ibp_refcount, 1);  /* 1 ref for caller */
 
@@ -782,6 +780,12 @@ struct kib_conn *kiblnd_create_conn(struct kib_peer_ni *peer_ni,
 			  kiblnd_cq_completion, kiblnd_cq_event, conn,
 			  &cq_attr);
 	if (IS_ERR(cq)) {
+		/*
+		 * on MLX-5 (possibly MLX-4 as well) this error could be
+		 * hit if the concurrent_sends and/or peer_tx_credits is set
+		 * too high. Or due to an MLX-5 bug which tries to
+		 * allocate 256kb via kmalloc for WR cookie array
+		 */
 		CERROR("Failed to create CQ with %d CQEs: %ld\n",
 		       IBLND_CQ_ENTRIES(conn), PTR_ERR(cq));
 		goto failed_2;
@@ -1320,9 +1324,8 @@ static void kiblnd_destroy_fmr_pool(struct kib_fmr_pool *fpo)
 {
 	LASSERT(!fpo->fpo_map_count);
 
-	if (fpo->fpo_is_fmr) {
-		if (fpo->fmr.fpo_fmr_pool)
-			ib_destroy_fmr_pool(fpo->fmr.fpo_fmr_pool);
+	if (fpo->fpo_is_fmr && fpo->fmr.fpo_fmr_pool) {
+		ib_destroy_fmr_pool(fpo->fmr.fpo_fmr_pool);
 	} else {
 		struct kib_fast_reg_descriptor *frd;
 		int i = 0;
@@ -1654,7 +1657,7 @@ void kiblnd_fmr_pool_unmap(struct kib_fmr *fmr, int status)
 
 int kiblnd_fmr_pool_map(struct kib_fmr_poolset *fps, struct kib_tx *tx,
 			struct kib_rdma_desc *rd, __u32 nob, __u64 iov,
-			struct kib_fmr *fmr, bool *is_fastreg)
+			struct kib_fmr *fmr)
 {
 	__u64 *pages = tx->tx_pages;
 	bool is_rx = (rd != tx->tx_rd);
@@ -1674,7 +1677,6 @@ int kiblnd_fmr_pool_map(struct kib_fmr_poolset *fps, struct kib_tx *tx,
 		if (fpo->fpo_is_fmr) {
 			struct ib_pool_fmr *pfmr;
 
-			*is_fastreg = 0;
 			spin_unlock(&fps->fps_lock);
 
 			if (!tx_pages_mapped) {
@@ -1694,7 +1696,6 @@ int kiblnd_fmr_pool_map(struct kib_fmr_poolset *fps, struct kib_tx *tx,
 			}
 			rc = PTR_ERR(pfmr);
 		} else {
-			*is_fastreg = 1;
 			if (!list_empty(&fpo->fast_reg.fpo_pool_list)) {
 				struct kib_fast_reg_descriptor *frd;
 				struct ib_reg_wr *wr;
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
index aaf0118..a6008ea 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
@@ -121,9 +121,8 @@ struct kib_tunables {
 #define IBLND_OOB_CAPABLE(v)       ((v) != IBLND_MSG_VERSION_1)
 #define IBLND_OOB_MSGS(v)	   (IBLND_OOB_CAPABLE(v) ? 2 : 0)
 
-#define IBLND_FRAG_SHIFT	(PAGE_SHIFT - 12)	/* frag size on wire is in 4K units */
-#define IBLND_MSG_SIZE		(4 << 10)		/* max size of queued messages (inc hdr) */
-#define IBLND_MAX_RDMA_FRAGS	(LNET_MAX_PAYLOAD >> 12)/* max # of fragments supported in 4K size */
+#define IBLND_MSG_SIZE		(4 << 10)	/* max size of queued messages (inc hdr) */
+#define IBLND_MAX_RDMA_FRAGS	LNET_MAX_IOV	/* max # of fragments supported */
 
 /************************/
 /* derived constants... */
@@ -141,8 +140,8 @@ struct kib_tunables {
 /* WRs and CQEs (per connection) */
 #define IBLND_RECV_WRS(c)	IBLND_RX_MSGS(c)
 #define IBLND_SEND_WRS(c)	\
-	(((c->ibc_max_frags + 1) << IBLND_FRAG_SHIFT) * \
-	  kiblnd_concurrent_sends(c->ibc_version, c->ibc_peer->ibp_ni))
+	((c->ibc_max_frags + 1) * kiblnd_concurrent_sends(c->ibc_version, \
+							  c->ibc_peer->ibp_ni))
 #define IBLND_CQ_ENTRIES(c)	(IBLND_RECV_WRS(c) + IBLND_SEND_WRS(c))
 
 struct kib_hca_dev;
@@ -288,7 +287,7 @@ struct kib_fmr_pool {
 	time64_t		fpo_deadline;	/* deadline of this pool */
 	int                   fpo_failed;          /* fmr pool is failed */
 	int                   fpo_map_count;       /* # of mapped FMR */
-	int		      fpo_is_fmr;
+	bool			fpo_is_fmr;	/* True if FMR pools allocated */
 };
 
 struct kib_fmr {
@@ -515,7 +514,9 @@ struct kib_tx {					/* transmit message */
 	int                   tx_nfrags;      /* # entries in... */
 	struct scatterlist    *tx_frags;      /* dma_map_sg descriptor */
 	__u64                 *tx_pages;      /* rdma phys page addrs */
-	struct kib_fmr        fmr;	      /* FMR */
+	/* gaps in fragments */
+	bool			tx_gaps;
+	struct kib_fmr		tx_fmr;		/* FMR */
 	int                   tx_dmadir;      /* dma direction */
 };
 
@@ -616,26 +617,6 @@ struct kib_peer_ni {
 
 int kiblnd_msg_queue_size(int version, struct lnet_ni *ni);
 
-/* max # of fragments configured by user */
-static inline int
-kiblnd_cfg_rdma_frags(struct lnet_ni *ni)
-{
-	struct lnet_ioctl_config_o2iblnd_tunables *tunables;
-	int mod;
-
-	tunables = &ni->ni_lnd_tunables.lnd_tun_u.lnd_o2ib;
-	mod = tunables->lnd_map_on_demand;
-	return mod ? mod : IBLND_MAX_RDMA_FRAGS >> IBLND_FRAG_SHIFT;
-}
-
-static inline int
-kiblnd_rdma_frags(int version, struct lnet_ni *ni)
-{
-	return version == IBLND_MSG_VERSION_1 ?
-			  (IBLND_MAX_RDMA_FRAGS >> IBLND_FRAG_SHIFT) :
-			  kiblnd_cfg_rdma_frags(ni);
-}
-
 static inline int
 kiblnd_concurrent_sends(int version, struct lnet_ni *ni)
 {
@@ -1011,7 +992,7 @@ static inline unsigned int kiblnd_sg_dma_len(struct ib_device *dev,
 
 int  kiblnd_fmr_pool_map(struct kib_fmr_poolset *fps, struct kib_tx *tx,
 			 struct kib_rdma_desc *rd, __u32 nob, __u64 iov,
-			 struct kib_fmr *fmr, bool *is_fastreg);
+			 struct kib_fmr *fmr);
 void kiblnd_fmr_pool_unmap(struct kib_fmr *fmr, int status);
 
 int kiblnd_tunables_setup(struct lnet_ni *ni);
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
index b16153f..9d30f31 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -133,6 +133,8 @@ static int kiblnd_init_rdma(struct kib_conn *conn, struct kib_tx *tx, int type,
 	LASSERT(!tx->tx_lntmsg[1]);
 	LASSERT(!tx->tx_nfrags);
 
+	tx->tx_gaps = false;
+
 	return tx;
 }
 
@@ -538,7 +540,7 @@ static int kiblnd_init_rdma(struct kib_conn *conn, struct kib_tx *tx, int type,
 {
 	struct kib_hca_dev *hdev;
 	struct kib_fmr_poolset *fps;
-	bool is_fastreg = 0;
+	struct kib_dev *dev;
 	int cpt;
 	int rc;
 	int i;
@@ -546,11 +548,42 @@ static int kiblnd_init_rdma(struct kib_conn *conn, struct kib_tx *tx, int type,
 	LASSERT(tx->tx_pool);
 	LASSERT(tx->tx_pool->tpo_pool.po_owner);
 
+	dev = net->ibn_dev;
 	hdev = tx->tx_pool->tpo_hdev;
 	cpt = tx->tx_pool->tpo_pool.po_owner->ps_cpt;
 
+	/*
+	 * If we're dealing with FastReg, but the device doesn't
+	 * support GAPS and the tx has GAPS, then there is no real point
+	 * in trying to map the memory, because it'll just fail. So
+	 * preemptively fail with an appropriate message
+	 */
+	if ((dev->ibd_dev_caps & IBLND_DEV_CAPS_FASTREG_ENABLED) &&
+	    !(dev->ibd_dev_caps & IBLND_DEV_CAPS_FASTREG_GAPS_SUPPORT) &&
+	    tx->tx_gaps) {
+		CERROR("Using FastReg with no GAPS support, but tx has gaps\n");
+		return -EPROTONOSUPPORT;
+	}
+
+	/*
+	 * FMR does not support gaps but the tx has gaps then
+	 * we should make sure that the number of fragments we'll be sending
+	 * over fits within the number of fragments negotiated on the
+	 * connection, otherwise, we won't be able to RDMA the data.
+	 * We need to maintain the number of fragments negotiation on the
+	 * connection for backwards compatibility.
+	 */
+	if (tx->tx_gaps && (dev->ibd_dev_caps & IBLND_DEV_CAPS_FMR_ENABLED)) {
+		if (tx->tx_conn &&
+		    tx->tx_conn->ibc_max_frags <= rd->rd_nfrags) {
+			CERROR("TX number of frags (%d) is <= than connection number of frags (%d). Consider setting peer's map_on_demand to 256\n",
+			       tx->tx_nfrags, tx->tx_conn->ibc_max_frags);
+			return -EFBIG;
+		}
+	}
+
 	fps = net->ibn_fmr_ps[cpt];
-	rc = kiblnd_fmr_pool_map(fps, tx, rd, nob, 0, &tx->fmr, &is_fastreg);
+	rc = kiblnd_fmr_pool_map(fps, tx, rd, nob, 0, &tx->tx_fmr);
 	if (rc) {
 		CERROR("Can't map %u bytes: %d\n", nob, rc);
 		return rc;
@@ -560,15 +593,28 @@ static int kiblnd_init_rdma(struct kib_conn *conn, struct kib_tx *tx, int type,
 	 * If rd is not tx_rd, it's going to get sent to a peer_ni,
 	 * who will need the rkey
 	 */
-	rd->rd_key = tx->fmr.fmr_key;
-	if (!is_fastreg) {
+	rd->rd_key = tx->tx_fmr.fmr_key;
+	/*
+	 * for FastReg or FMR with no gaps we can accumulate all
+	 * the fragments in one FastReg or FMR fragment.
+	 */
+	if (((dev->ibd_dev_caps & IBLND_DEV_CAPS_FMR_ENABLED) && !tx->tx_gaps) ||
+	    (dev->ibd_dev_caps & IBLND_DEV_CAPS_FASTREG_ENABLED)) {
+		/* FMR requires zero based address */
+		if (dev->ibd_dev_caps & IBLND_DEV_CAPS_FMR_ENABLED)
+			rd->rd_frags[0].rf_addr &= ~hdev->ibh_page_mask;
+		rd->rd_frags[0].rf_nob = nob;
+		rd->rd_nfrags = 1;
+	} else {
+		/*
+		 * We're transmitting with gaps using FMR.
+		 * We'll need to use multiple fragments and identify the
+		 * zero based address of each fragment.
+		 */
 		for (i = 0; i < rd->rd_nfrags; i++) {
 			rd->rd_frags[i].rf_addr &= ~hdev->ibh_page_mask;
 			rd->rd_frags[i].rf_addr += i << hdev->ibh_page_shift;
 		}
-	} else {
-		rd->rd_frags[0].rf_nob = nob;
-		rd->rd_nfrags = 1;
 	}
 
 	return 0;
@@ -576,8 +622,8 @@ static int kiblnd_init_rdma(struct kib_conn *conn, struct kib_tx *tx, int type,
 
 static void kiblnd_unmap_tx(struct kib_tx *tx)
 {
-	if (tx->fmr.fmr_pfmr || tx->fmr.fmr_frd)
-		kiblnd_fmr_pool_unmap(&tx->fmr, tx->tx_status);
+	if (tx->tx_fmr.fmr_pfmr || tx->tx_fmr.fmr_frd)
+		kiblnd_fmr_pool_unmap(&tx->tx_fmr, tx->tx_status);
 
 	if (tx->tx_nfrags) {
 		kiblnd_dma_unmap_sg(tx->tx_pool->tpo_hdev->ibh_ibdev,
@@ -656,6 +702,13 @@ static int kiblnd_map_tx(struct lnet_ni *ni, struct kib_tx *tx,
 		fragnob = min((int)(iov->iov_len - offset), nob);
 		fragnob = min(fragnob, (int)PAGE_SIZE - page_offset);
 
+		if ((fragnob < (int)PAGE_SIZE - page_offset) && (niov > 1)) {
+			CDEBUG(D_NET,
+			       "fragnob %d < available page %d: with remaining %d iovs\n",
+			       fragnob, (int)PAGE_SIZE - page_offset, niov);
+			tx->tx_gaps = true;
+		}
+
 		sg_set_page(sg, page, fragnob, page_offset);
 		sg = sg_next(sg);
 		if (!sg) {
@@ -704,6 +757,13 @@ static int kiblnd_map_tx(struct lnet_ni *ni, struct kib_tx *tx,
 
 		fragnob = min((int)(kiov->bv_len - offset), nob);
 
+		if ((fragnob < (int)(kiov->bv_len - offset)) && nkiov > 1) {
+			CDEBUG(D_NET,
+			       "fragnob %d < available page %d: with remaining %d kiovs\n",
+			       fragnob, (int)(kiov->bv_len - offset), nkiov);
+			tx->tx_gaps = true;
+		}
+
 		sg_set_page(sg, kiov->bv_page, fragnob,
 			    kiov->bv_offset + offset);
 		sg = sg_next(sg);
@@ -735,6 +795,7 @@ static int kiblnd_map_tx(struct lnet_ni *ni, struct kib_tx *tx,
 	LASSERT(tx->tx_queued);
 	/* We rely on this for QP sizing */
 	LASSERT(tx->tx_nwrq > 0 && tx->tx_nsge >= 0);
+	LASSERT(tx->tx_nwrq <= 1 + conn->ibc_max_frags);
 
 	LASSERT(!credit || credit == 1);
 	LASSERT(conn->ibc_outstanding_credits >= 0);
@@ -814,7 +875,7 @@ static int kiblnd_map_tx(struct lnet_ni *ni, struct kib_tx *tx,
 		/* close_conn will launch failover */
 		rc = -ENETDOWN;
 	} else {
-		struct kib_fast_reg_descriptor *frd = tx->fmr.fmr_frd;
+		struct kib_fast_reg_descriptor *frd = tx->tx_fmr.fmr_frd;
 		const struct ib_send_wr *bad = &tx->tx_wrq[tx->tx_nwrq - 1].wr;
 		struct ib_send_wr *wrq = &tx->tx_wrq[0].wr;
 
@@ -1042,15 +1103,6 @@ static int kiblnd_map_tx(struct lnet_ni *ni, struct kib_tx *tx,
 	LASSERT(!tx->tx_nwrq && !tx->tx_nsge);
 	LASSERT(type == IBLND_MSG_GET_DONE || type == IBLND_MSG_PUT_DONE);
 
-	if (kiblnd_rd_size(srcrd) > conn->ibc_max_frags << PAGE_SHIFT) {
-		CERROR("RDMA is too large for peer_ni %s (%d), src size: %d dst size: %d\n",
-		       libcfs_nid2str(conn->ibc_peer->ibp_nid),
-		       conn->ibc_max_frags << PAGE_SHIFT,
-		       kiblnd_rd_size(srcrd), kiblnd_rd_size(dstrd));
-		rc = -EMSGSIZE;
-		goto too_big;
-	}
-
 	for (srcidx = dstidx = wrq_sge = sge_nob = 0;
 	     resid > 0; resid -= sge_nob) {
 		int prev = dstidx;
@@ -1067,10 +1119,10 @@ static int kiblnd_map_tx(struct lnet_ni *ni, struct kib_tx *tx,
 			break;
 		}
 
-		if (tx->tx_nwrq >= IBLND_MAX_RDMA_FRAGS) {
+		if (tx->tx_nwrq >= conn->ibc_max_frags) {
 			CERROR("RDMA has too many fragments for peer_ni %s (%d), src idx/frags: %d/%d dst idx/frags: %d/%d\n",
 			       libcfs_nid2str(conn->ibc_peer->ibp_nid),
-			       IBLND_MAX_RDMA_FRAGS,
+			       conn->ibc_max_frags,
 			       srcidx, srcrd->rd_nfrags,
 			       dstidx, dstrd->rd_nfrags);
 			rc = -EMSGSIZE;
@@ -1110,7 +1162,7 @@ static int kiblnd_map_tx(struct lnet_ni *ni, struct kib_tx *tx,
 		}
 		tx->tx_nsge++;
 	}
-too_big:
+
 	if (rc < 0) { /* no RDMA if completing with failure */
 		tx->tx_nsge = 0;
 		tx->tx_nwrq = 0;
@@ -2335,21 +2387,20 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 		goto failed;
 	}
 
-	max_frags = reqmsg->ibm_u.connparams.ibcp_max_frags >> IBLND_FRAG_SHIFT;
-	if (max_frags > kiblnd_rdma_frags(version, ni)) {
+	max_frags = reqmsg->ibm_u.connparams.ibcp_max_frags;
+	if (max_frags > IBLND_MAX_RDMA_FRAGS) {
 		CWARN("Can't accept conn from %s (version %x): max message size %d is too large (%d wanted)\n",
 		      libcfs_nid2str(nid), version, max_frags,
-		      kiblnd_rdma_frags(version, ni));
+		      IBLND_MAX_RDMA_FRAGS);
 
 		if (version >= IBLND_MSG_VERSION)
 			rej.ibr_why = IBLND_REJECT_RDMA_FRAGS;
 
 		goto failed;
-	} else if (max_frags < kiblnd_rdma_frags(version, ni) &&
-		   !net->ibn_fmr_ps) {
+	} else if (max_frags < IBLND_MAX_RDMA_FRAGS && !net->ibn_fmr_ps) {
 		CWARN("Can't accept conn from %s (version %x): max message size %d incompatible without FMR pool (%d wanted)\n",
 		      libcfs_nid2str(nid), version, max_frags,
-		      kiblnd_rdma_frags(version, ni));
+		      IBLND_MAX_RDMA_FRAGS);
 
 		if (version == IBLND_MSG_VERSION)
 			rej.ibr_why = IBLND_REJECT_RDMA_FRAGS;
@@ -2495,7 +2546,7 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 	kiblnd_init_msg(ackmsg, IBLND_MSG_CONNACK,
 			sizeof(ackmsg->ibm_u.connparams));
 	ackmsg->ibm_u.connparams.ibcp_queue_depth = conn->ibc_queue_depth;
-	ackmsg->ibm_u.connparams.ibcp_max_frags = conn->ibc_max_frags << IBLND_FRAG_SHIFT;
+	ackmsg->ibm_u.connparams.ibcp_max_frags = conn->ibc_max_frags;
 	ackmsg->ibm_u.connparams.ibcp_max_msg_size = IBLND_MSG_SIZE;
 
 	kiblnd_pack_msg(ni, ackmsg, version, 0, nid, reqmsg->ibm_srcstamp);
@@ -2528,7 +2579,7 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
  failed:
 	if (ni) {
 		rej.ibr_cp.ibcp_queue_depth = kiblnd_msg_queue_size(version, ni);
-		rej.ibr_cp.ibcp_max_frags = kiblnd_rdma_frags(version, ni);
+		rej.ibr_cp.ibcp_max_frags = IBLND_MAX_RDMA_FRAGS;
 		lnet_ni_decref(ni);
 	}
 
@@ -2556,7 +2607,7 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 
 	if (cp) {
 		msg_size = cp->ibcp_max_msg_size;
-		frag_num	= cp->ibcp_max_frags << IBLND_FRAG_SHIFT;
+		frag_num = cp->ibcp_max_frags;
 		queue_dep = cp->ibcp_queue_depth;
 	}
 
@@ -2590,6 +2641,10 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 			goto out;
 		}
 		tunables = &peer_ni->ibp_ni->ni_lnd_tunables.lnd_tun_u.lnd_o2ib;
+		/*
+		 * This check only makes sense if the kernel supports global
+		 * memory registration. Otherwise, map_on_demand will never == 0
+		 */
 		if (!tunables->lnd_map_on_demand) {
 			reason = "map_on_demand must be enabled";
 			goto out;
@@ -2829,11 +2884,11 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 		goto failed;
 	}
 
-	if ((msg->ibm_u.connparams.ibcp_max_frags >> IBLND_FRAG_SHIFT) >
+	if ((msg->ibm_u.connparams.ibcp_max_frags) >
 	    conn->ibc_max_frags) {
 		CERROR("%s has incompatible max_frags %d (<=%d wanted)\n",
 		       libcfs_nid2str(peer_ni->ibp_nid),
-		       msg->ibm_u.connparams.ibcp_max_frags >> IBLND_FRAG_SHIFT,
+		       msg->ibm_u.connparams.ibcp_max_frags,
 		       conn->ibc_max_frags);
 		rc = -EPROTO;
 		goto failed;
@@ -2867,7 +2922,7 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 	conn->ibc_credits = msg->ibm_u.connparams.ibcp_queue_depth;
 	conn->ibc_reserved_credits = msg->ibm_u.connparams.ibcp_queue_depth;
 	conn->ibc_queue_depth = msg->ibm_u.connparams.ibcp_queue_depth;
-	conn->ibc_max_frags = msg->ibm_u.connparams.ibcp_max_frags >> IBLND_FRAG_SHIFT;
+	conn->ibc_max_frags = msg->ibm_u.connparams.ibcp_max_frags;
 	LASSERT(conn->ibc_credits + conn->ibc_reserved_credits +
 		IBLND_OOB_MSGS(ver) <= IBLND_RX_MSGS(conn));
 
@@ -2924,7 +2979,7 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 	memset(msg, 0, sizeof(*msg));
 	kiblnd_init_msg(msg, IBLND_MSG_CONNREQ, sizeof(msg->ibm_u.connparams));
 	msg->ibm_u.connparams.ibcp_queue_depth = conn->ibc_queue_depth;
-	msg->ibm_u.connparams.ibcp_max_frags = conn->ibc_max_frags << IBLND_FRAG_SHIFT;
+	msg->ibm_u.connparams.ibcp_max_frags = conn->ibc_max_frags;
 	msg->ibm_u.connparams.ibcp_max_msg_size = IBLND_MSG_SIZE;
 
 	kiblnd_pack_msg(peer_ni->ibp_ni, msg, version,
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c
index 985ccdf..7fc6a8a 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c
@@ -115,10 +115,37 @@
 module_param(use_fastreg_gaps, bool, 0444);
 MODULE_PARM_DESC(use_fastreg_gaps, "Enable discontiguous fastreg fragment support. Expect performance drop");
 
-#define IBLND_DEFAULT_MAP_ON_DEMAND IBLND_MAX_RDMA_FRAGS
+/*
+ * map_on_demand is a flag used to determine if we can use FMR or FastReg.
+ * This is applicable for kernels which support global memory regions. For
+ * later kernels this flag is always enabled, since we will always either
+ * use FMR or FastReg
+ * For kernels which support global memory regions map_on_demand defaults
+ * to 0 which means we will be using global memory regions exclusively.
+ * If it is set to a value other than 0, then we will behave as follows:
+ *  1. Always default the number of fragments to IBLND_MAX_RDMA_FRAGS
+ *  2. Create FMR/FastReg pools
+ *  3. Negotiate the supported number of fragments per connection
+ *  4. Attempt to transmit using global memory regions only if
+ *     map-on-demand is not turned on, otherwise use FMR or FastReg
+ *  5. In case of transmitting tx with GAPS over FMR we will need to
+ *     transmit it with multiple fragments. Look at the comments in
+ *     kiblnd_fmr_map_tx() for an explanation of the behavior.
+ *
+ * For later kernels we default map_on_demand to 1 and not allow
+ * it to be set to 0, since there is no longer support for global memory
+ * regions. Behavior:
+ *  1. Default the number of fragments to IBLND_MAX_RDMA_FRAGS
+ *  2. Create FMR/FastReg pools
+ *  3. Negotiate the supported number of fragments per connection
+ *  4. Look at the comments in kiblnd_fmr_map_tx() for an explanation of
+ *     the behavior when transmit with GAPS verses contiguous.
+ */
+
+#define IBLND_DEFAULT_MAP_ON_DEMAND 1
 static int map_on_demand = IBLND_DEFAULT_MAP_ON_DEMAND;
 module_param(map_on_demand, int, 0444);
-MODULE_PARM_DESC(map_on_demand, "map on demand");
+MODULE_PARM_DESC(map_on_demand, "map on demand (obsolete)");
 
 /* NB: this value is shared by all CPTs, it can grow at runtime */
 static int fmr_pool_size = 512;
@@ -234,6 +261,13 @@ int kiblnd_tunables_setup(struct lnet_ni *ni)
 		net_tunables->lct_peer_tx_credits =
 			net_tunables->lct_max_tx_credits;
 
+	/*
+	 * For kernels which do not support global memory regions, always
+	 * enable map_on_demand
+	 */
+	if (tunables->lnd_map_on_demand == 0)
+		tunables->lnd_map_on_demand = 1;
+
 	if (!tunables->lnd_peercredits_hiw)
 		tunables->lnd_peercredits_hiw = peer_credits_hiw;
 
@@ -243,19 +277,8 @@ int kiblnd_tunables_setup(struct lnet_ni *ni)
 	if (tunables->lnd_peercredits_hiw >= net_tunables->lct_peer_tx_credits)
 		tunables->lnd_peercredits_hiw = net_tunables->lct_peer_tx_credits - 1;
 
-	if (tunables->lnd_map_on_demand <= 0 ||
-	    tunables->lnd_map_on_demand > IBLND_MAX_RDMA_FRAGS) {
-		/* Use the default */
-		CWARN("Invalid map_on_demand (%d), expects 1 - %d. Using default of %d\n",
-		      tunables->lnd_map_on_demand,
-		      IBLND_MAX_RDMA_FRAGS, IBLND_DEFAULT_MAP_ON_DEMAND);
-		tunables->lnd_map_on_demand = IBLND_DEFAULT_MAP_ON_DEMAND;
-	}
-
-	if (tunables->lnd_map_on_demand == 1) {
-		/* don't make sense to create map if only one fragment */
-		tunables->lnd_map_on_demand = 2;
-	}
+	if (tunables->lnd_concurrent_sends == 0)
+		tunables->lnd_concurrent_sends = net_tunables->lct_peer_tx_credits;
 
 	if (!tunables->lnd_concurrent_sends) {
 		if (tunables->lnd_map_on_demand > 0 &&
@@ -299,7 +322,7 @@ int kiblnd_tunables_setup(struct lnet_ni *ni)
 void kiblnd_tunables_init(void)
 {
 	default_tunables.lnd_version = 0;
-	default_tunables.lnd_peercredits_hiw = peer_credits_hiw,
+	default_tunables.lnd_peercredits_hiw = peer_credits_hiw;
 	default_tunables.lnd_map_on_demand = map_on_demand;
 	default_tunables.lnd_concurrent_sends = concurrent_sends;
 	default_tunables.lnd_fmr_pool_size = fmr_pool_size;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [lustre-devel] [PATCH 04/10] lustre: lnd: use less CQ entries for each connection
  2018-10-14 18:55 [lustre-devel] [PATCH 00/10] lustre: lnet: fixes for non-x86 systems James Simmons
                   ` (2 preceding siblings ...)
  2018-10-14 18:55 ` [lustre-devel] [PATCH 03/10] lustre: lnd: rework map_on_demand behavior James Simmons
@ 2018-10-14 18:55 ` James Simmons
  2018-10-14 18:55 ` [lustre-devel] [PATCH 05/10] lustre: o2iblnd: limit cap.max_send_wr for MLX5 James Simmons
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: James Simmons @ 2018-10-14 18:55 UTC (permalink / raw)
  To: lustre-devel

From: Alexey Lyashkov <c17817@cray.com>

Currently we have a 2 work requests chains per transfer.
It mean OFED stack will generate only 2 events if transfer will
faild. Reduce number CQ entries to avoid extra resource consumption.

Signed-off-by: Alexey Lyashkov <c17817@cray.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-9810
Seagate-bug-id: MRP-4508
Reviewed-on: https://review.whamcloud.com/28279
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
index a6008ea..cd64cfb 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
@@ -142,7 +142,9 @@ struct kib_tunables {
 #define IBLND_SEND_WRS(c)	\
 	((c->ibc_max_frags + 1) * kiblnd_concurrent_sends(c->ibc_version, \
 							  c->ibc_peer->ibp_ni))
-#define IBLND_CQ_ENTRIES(c)	(IBLND_RECV_WRS(c) + IBLND_SEND_WRS(c))
+#define IBLND_CQ_ENTRIES(c)	\
+	(IBLND_RECV_WRS(c) + 2 * kiblnd_concurrent_sends(c->ibc_version, \
+							 c->ibc_peer->ibp_ni))
 
 struct kib_hca_dev;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [lustre-devel] [PATCH 05/10] lustre: o2iblnd: limit cap.max_send_wr for MLX5
  2018-10-14 18:55 [lustre-devel] [PATCH 00/10] lustre: lnet: fixes for non-x86 systems James Simmons
                   ` (3 preceding siblings ...)
  2018-10-14 18:55 ` [lustre-devel] [PATCH 04/10] lustre: lnd: use less CQ entries for each connection James Simmons
@ 2018-10-14 18:55 ` James Simmons
  2018-10-14 18:55 ` [lustre-devel] [PATCH 06/10] lustre: lnd: calculate qp max_send_wrs properly James Simmons
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: James Simmons @ 2018-10-14 18:55 UTC (permalink / raw)
  To: lustre-devel

From: Dmitry Eremin <dmitry.eremin@intel.com>

Decrease cap.max_send_wr until it is accepted by rdma_create_qp()

Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-7124
Reviewed-on: http://review.whamcloud.com/18347
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
index 8f984a0..99a4650 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
@@ -812,7 +812,14 @@ struct kib_conn *kiblnd_create_conn(struct kib_peer_ni *peer_ni,
 
 	conn->ibc_sched = sched;
 
-	rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd, init_qp_attr);
+	do {
+		rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd, init_qp_attr);
+		if (!rc || init_qp_attr->cap.max_send_wr < 16)
+			break;
+
+		init_qp_attr->cap.max_send_wr -= init_qp_attr->cap.max_send_wr / 4;
+	} while (rc);
+
 	if (rc) {
 		CERROR("Can't create QP: %d, send_wr: %d, recv_wr: %d, send_sge: %d, recv_sge: %d\n",
 		       rc, init_qp_attr->cap.max_send_wr,
@@ -822,6 +829,10 @@ struct kib_conn *kiblnd_create_conn(struct kib_peer_ni *peer_ni,
 		goto failed_2;
 	}
 
+	if (init_qp_attr->cap.max_send_wr != IBLND_SEND_WRS(conn))
+		CDEBUG(D_NET, "original send wr %d, created with %d\n",
+		       IBLND_SEND_WRS(conn), init_qp_attr->cap.max_send_wr);
+
 	kfree(init_qp_attr);
 
 	/* 1 ref for caller and each rxmsg */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [lustre-devel] [PATCH 06/10] lustre: lnd: calculate qp max_send_wrs properly
  2018-10-14 18:55 [lustre-devel] [PATCH 00/10] lustre: lnet: fixes for non-x86 systems James Simmons
                   ` (4 preceding siblings ...)
  2018-10-14 18:55 ` [lustre-devel] [PATCH 05/10] lustre: o2iblnd: limit cap.max_send_wr for MLX5 James Simmons
@ 2018-10-14 18:55 ` James Simmons
  2018-10-14 18:55 ` [lustre-devel] [PATCH 07/10] lustre: lnd: remove concurrent_sends tunable James Simmons
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: James Simmons @ 2018-10-14 18:55 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

The maximum in-flight transfers can not exceed the
negotiated queue depth. Instead of calculating the
max_send_wrs to be the negotiated number of frags *
concurrent sends, it should be the negotiated number
of frags * queue depth.

If that value is too large for successful qp creation then
we reduce the queue depth in a loop until we successfully
create the qp or the queue depth dips below 2.

Due to the queue depth negotiation protocol it is guaranteed
that the queue depth on both the active and the passive
will match.

This change resolves the discrepancy created by the previous
code which reduces max_send_wr by a quarter.

That could lead to:
mlx5_ib_post_send:4184:(pid 26272): Failed to prepare WQE
When the o2iblnd transfers a message which requires more
WRs than the max that has been allocated.

Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Signed-off-by: Alexey Lyashkov <c17817@cray.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-10213
Reviewed-on: https://review.whamcloud.com/30310
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c    | 30 +++++++++++++++++-----
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h    |  4 +--
 2 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
index 99a4650..43266d8 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
@@ -650,6 +650,19 @@ static struct kib_sched_info *kiblnd_get_scheduler(int cpt)
 	return NULL;
 }
 
+static unsigned int kiblnd_send_wrs(struct kib_conn *conn)
+{
+	/*
+	 * One WR for the LNet message
+	 * And ibc_max_frags for the transfer WRs
+	 */
+	unsigned int ret = 1 + conn->ibc_max_frags;
+
+	/* account for a maximum of ibc_queue_depth in-flight transfers */
+	ret *= conn->ibc_queue_depth;
+	return ret;
+}
+
 struct kib_conn *kiblnd_create_conn(struct kib_peer_ni *peer_ni,
 				    struct rdma_cm_id *cmid,
 				    int state, int version)
@@ -801,8 +814,6 @@ struct kib_conn *kiblnd_create_conn(struct kib_peer_ni *peer_ni,
 
 	init_qp_attr->event_handler = kiblnd_qp_event;
 	init_qp_attr->qp_context = conn;
-	init_qp_attr->cap.max_send_wr = IBLND_SEND_WRS(conn);
-	init_qp_attr->cap.max_recv_wr = IBLND_RECV_WRS(conn);
 	init_qp_attr->cap.max_send_sge = *kiblnd_tunables.kib_wrq_sge;
 	init_qp_attr->cap.max_recv_sge = 1;
 	init_qp_attr->sq_sig_type = IB_SIGNAL_REQ_WR;
@@ -813,11 +824,14 @@ struct kib_conn *kiblnd_create_conn(struct kib_peer_ni *peer_ni,
 	conn->ibc_sched = sched;
 
 	do {
+		init_qp_attr->cap.max_send_wr = kiblnd_send_wrs(conn);
+		init_qp_attr->cap.max_recv_wr = IBLND_RECV_WRS(conn);
+
 		rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd, init_qp_attr);
-		if (!rc || init_qp_attr->cap.max_send_wr < 16)
+		if (!rc || conn->ibc_queue_depth < 2)
 			break;
 
-		init_qp_attr->cap.max_send_wr -= init_qp_attr->cap.max_send_wr / 4;
+		conn->ibc_queue_depth--;
 	} while (rc);
 
 	if (rc) {
@@ -829,9 +843,11 @@ struct kib_conn *kiblnd_create_conn(struct kib_peer_ni *peer_ni,
 		goto failed_2;
 	}
 
-	if (init_qp_attr->cap.max_send_wr != IBLND_SEND_WRS(conn))
-		CDEBUG(D_NET, "original send wr %d, created with %d\n",
-		       IBLND_SEND_WRS(conn), init_qp_attr->cap.max_send_wr);
+	if (conn->ibc_queue_depth != peer_ni->ibp_queue_depth)
+		CWARN("peer %s - queue depth reduced from %u to %u  to allow for qp creation\n",
+		      libcfs_nid2str(peer_ni->ibp_nid),
+		      peer_ni->ibp_queue_depth,
+		      conn->ibc_queue_depth);
 
 	kfree(init_qp_attr);
 
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
index cd64cfb..c6c8106 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
@@ -139,9 +139,7 @@ struct kib_tunables {
 
 /* WRs and CQEs (per connection) */
 #define IBLND_RECV_WRS(c)	IBLND_RX_MSGS(c)
-#define IBLND_SEND_WRS(c)	\
-	((c->ibc_max_frags + 1) * kiblnd_concurrent_sends(c->ibc_version, \
-							  c->ibc_peer->ibp_ni))
+
 #define IBLND_CQ_ENTRIES(c)	\
 	(IBLND_RECV_WRS(c) + 2 * kiblnd_concurrent_sends(c->ibc_version, \
 							 c->ibc_peer->ibp_ni))
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [lustre-devel] [PATCH 07/10] lustre: lnd: remove concurrent_sends tunable
  2018-10-14 18:55 [lustre-devel] [PATCH 00/10] lustre: lnet: fixes for non-x86 systems James Simmons
                   ` (5 preceding siblings ...)
  2018-10-14 18:55 ` [lustre-devel] [PATCH 06/10] lustre: lnd: calculate qp max_send_wrs properly James Simmons
@ 2018-10-14 18:55 ` James Simmons
  2018-10-14 18:55 ` [lustre-devel] [PATCH 08/10] lustre: lnd: correct WR fast reg accounting James Simmons
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: James Simmons @ 2018-10-14 18:55 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Concurrent sends tunable was intended to limit the number of in-flight
transfers per connection. However queue depth does the exact same job.
So for example if the queue depth is negotiated to 16 and
concurrent_sends is set to 32, the maximum number of in-flight
transfers doesn't exceed 16. There is no need to keep concurrent_sends
around since it doesn't add any unique functionality

Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-10291
Reviewed-on: https://review.whamcloud.com/30312
WC-bug-id: https://jira.whamcloud.com/browse/LU-10459
Reviewed-on: https://review.whamcloud.com/30751
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h    | 23 +----------------
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c |  5 ++--
 .../lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c  | 29 +---------------------
 3 files changed, 4 insertions(+), 53 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
index c6c8106..c882345 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
@@ -141,8 +141,7 @@ struct kib_tunables {
 #define IBLND_RECV_WRS(c)	IBLND_RX_MSGS(c)
 
 #define IBLND_CQ_ENTRIES(c)	\
-	(IBLND_RECV_WRS(c) + 2 * kiblnd_concurrent_sends(c->ibc_version, \
-							 c->ibc_peer->ibp_ni))
+	(IBLND_RECV_WRS(c) + 2 * c->ibc_queue_depth)
 
 struct kib_hca_dev;
 
@@ -617,26 +616,6 @@ struct kib_peer_ni {
 
 int kiblnd_msg_queue_size(int version, struct lnet_ni *ni);
 
-static inline int
-kiblnd_concurrent_sends(int version, struct lnet_ni *ni)
-{
-	struct lnet_ioctl_config_o2iblnd_tunables *tunables;
-	int concurrent_sends;
-
-	tunables = &ni->ni_lnd_tunables.lnd_tun_u.lnd_o2ib;
-	concurrent_sends = tunables->lnd_concurrent_sends;
-
-	if (version == IBLND_MSG_VERSION_1) {
-		if (concurrent_sends > IBLND_MSG_QUEUE_SIZE_V1 * 2)
-			return IBLND_MSG_QUEUE_SIZE_V1 * 2;
-
-		if (concurrent_sends < IBLND_MSG_QUEUE_SIZE_V1 / 2)
-			return IBLND_MSG_QUEUE_SIZE_V1 / 2;
-	}
-
-	return concurrent_sends;
-}
-
 static inline void
 kiblnd_hdev_addref_locked(struct kib_hca_dev *hdev)
 {
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
index 9d30f31..1f31798 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -787,7 +787,6 @@ static int kiblnd_map_tx(struct lnet_ni *ni, struct kib_tx *tx,
 {
 	struct kib_msg *msg = tx->tx_msg;
 	struct kib_peer_ni *peer_ni = conn->ibc_peer;
-	struct lnet_ni *ni = peer_ni->ibp_ni;
 	int ver = conn->ibc_version;
 	int rc;
 	int done;
@@ -803,7 +802,7 @@ static int kiblnd_map_tx(struct lnet_ni *ni, struct kib_tx *tx,
 	LASSERT(conn->ibc_credits >= 0);
 	LASSERT(conn->ibc_credits <= conn->ibc_queue_depth);
 
-	if (conn->ibc_nsends_posted == kiblnd_concurrent_sends(ver, ni)) {
+	if (conn->ibc_nsends_posted == conn->ibc_queue_depth) {
 		/* tx completions outstanding... */
 		CDEBUG(D_NET, "%s: posted enough\n",
 		       libcfs_nid2str(peer_ni->ibp_nid));
@@ -953,7 +952,7 @@ static int kiblnd_map_tx(struct lnet_ni *ni, struct kib_tx *tx,
 		return;
 	}
 
-	LASSERT(conn->ibc_nsends_posted <= kiblnd_concurrent_sends(ver, ni));
+	LASSERT(conn->ibc_nsends_posted <= conn->ibc_queue_depth);
 	LASSERT(!IBLND_OOB_CAPABLE(ver) ||
 		conn->ibc_noops_posted <= IBLND_OOB_MSGS(ver));
 	LASSERT(conn->ibc_reserved_credits >= 0);
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c
index 7fc6a8a..47e8a60 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c
@@ -109,7 +109,7 @@
 
 static int concurrent_sends;
 module_param(concurrent_sends, int, 0444);
-MODULE_PARM_DESC(concurrent_sends, "send work-queue sizing");
+MODULE_PARM_DESC(concurrent_sends, "send work-queue sizing (obsolete)");
 
 static bool use_fastreg_gaps;
 module_param(use_fastreg_gaps, bool, 0444);
@@ -277,32 +277,6 @@ int kiblnd_tunables_setup(struct lnet_ni *ni)
 	if (tunables->lnd_peercredits_hiw >= net_tunables->lct_peer_tx_credits)
 		tunables->lnd_peercredits_hiw = net_tunables->lct_peer_tx_credits - 1;
 
-	if (tunables->lnd_concurrent_sends == 0)
-		tunables->lnd_concurrent_sends = net_tunables->lct_peer_tx_credits;
-
-	if (!tunables->lnd_concurrent_sends) {
-		if (tunables->lnd_map_on_demand > 0 &&
-		    tunables->lnd_map_on_demand <= IBLND_MAX_RDMA_FRAGS / 8) {
-			tunables->lnd_concurrent_sends =
-					net_tunables->lct_peer_tx_credits * 2;
-		} else {
-			tunables->lnd_concurrent_sends =
-				net_tunables->lct_peer_tx_credits;
-		}
-	}
-
-	if (tunables->lnd_concurrent_sends > net_tunables->lct_peer_tx_credits * 2)
-		tunables->lnd_concurrent_sends = net_tunables->lct_peer_tx_credits * 2;
-
-	if (tunables->lnd_concurrent_sends < net_tunables->lct_peer_tx_credits / 2)
-		tunables->lnd_concurrent_sends = net_tunables->lct_peer_tx_credits / 2;
-
-	if (tunables->lnd_concurrent_sends < net_tunables->lct_peer_tx_credits) {
-		CWARN("Concurrent sends %d is lower than message queue size: %d, performance may drop slightly.\n",
-		      tunables->lnd_concurrent_sends,
-		      net_tunables->lct_peer_tx_credits);
-	}
-
 	if (!tunables->lnd_fmr_pool_size)
 		tunables->lnd_fmr_pool_size = fmr_pool_size;
 	if (!tunables->lnd_fmr_flush_trigger)
@@ -324,7 +298,6 @@ void kiblnd_tunables_init(void)
 	default_tunables.lnd_version = 0;
 	default_tunables.lnd_peercredits_hiw = peer_credits_hiw;
 	default_tunables.lnd_map_on_demand = map_on_demand;
-	default_tunables.lnd_concurrent_sends = concurrent_sends;
 	default_tunables.lnd_fmr_pool_size = fmr_pool_size;
 	default_tunables.lnd_fmr_flush_trigger = fmr_flush_trigger;
 	default_tunables.lnd_fmr_cache = fmr_cache;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [lustre-devel] [PATCH 08/10] lustre: lnd: correct WR fast reg accounting
  2018-10-14 18:55 [lustre-devel] [PATCH 00/10] lustre: lnet: fixes for non-x86 systems James Simmons
                   ` (6 preceding siblings ...)
  2018-10-14 18:55 ` [lustre-devel] [PATCH 07/10] lustre: lnd: remove concurrent_sends tunable James Simmons
@ 2018-10-14 18:55 ` James Simmons
  2018-10-14 18:55 ` [lustre-devel] [PATCH 09/10] lustre: o2ib: use splice in kiblnd_peer_connect_failed() James Simmons
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: James Simmons @ 2018-10-14 18:55 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Ensure that enough WRs are allocated for the fast reg
case which needs two additional WRs per transfer:
the first for memory window registration and the second for
memory window invalidation.

Failure to allocate these causes the following problem:
mlx5_warn:mlx5_0:begin_wqe:4085(pid 9590): work queue overflow

Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Signed-off-by: Alexey Lyashkov <c17817@cray.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-9943
Reviewed-on: https://review.whamcloud.com/30311
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
index 43266d8..66aa45f 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
@@ -656,8 +656,13 @@ static unsigned int kiblnd_send_wrs(struct kib_conn *conn)
 	 * One WR for the LNet message
 	 * And ibc_max_frags for the transfer WRs
 	 */
+	u32 dev_caps = conn->ibc_hdev->ibh_dev->ibd_dev_caps;
 	unsigned int ret = 1 + conn->ibc_max_frags;
 
+	/* FastReg needs two extra WRs for map and invalidate */
+	if (dev_caps & IBLND_DEV_CAPS_FASTREG_ENABLED)
+		ret += 2;
+
 	/* account for a maximum of ibc_queue_depth in-flight transfers */
 	ret *= conn->ibc_queue_depth;
 	return ret;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [lustre-devel] [PATCH 09/10] lustre: o2ib: use splice in kiblnd_peer_connect_failed()
  2018-10-14 18:55 [lustre-devel] [PATCH 00/10] lustre: lnet: fixes for non-x86 systems James Simmons
                   ` (7 preceding siblings ...)
  2018-10-14 18:55 ` [lustre-devel] [PATCH 08/10] lustre: lnd: correct WR fast reg accounting James Simmons
@ 2018-10-14 18:55 ` James Simmons
  2018-10-14 18:55 ` [lustre-devel] [PATCH 10/10] lustre: lnet: make LNET_MAX_IOV dependent on page size James Simmons
  2018-10-18  4:48 ` [lustre-devel] [PATCH 00/10] lustre: lnet: fixes for non-x86 systems NeilBrown
  10 siblings, 0 replies; 20+ messages in thread
From: James Simmons @ 2018-10-14 18:55 UTC (permalink / raw)
  To: lustre-devel

From: "John L. Hammond" <jhammond@whamcloud.com>

In kiblnd_peer_connect_failed() replace a backwards list_add() and
list_del() with list_splice_init().

Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-10819
Reviewed-on: https://review.whamcloud.com/31643
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
index 1f31798..23ce59e 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -2118,8 +2118,7 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 	peer_ni->ibp_reconnected = 0;
 	if (list_empty(&peer_ni->ibp_conns)) {
 		/* Take peer_ni's blocked transmits to complete with error */
-		list_add(&zombies, &peer_ni->ibp_tx_queue);
-		list_del_init(&peer_ni->ibp_tx_queue);
+		list_splice_init(&peer_ni->ibp_tx_queue, &zombies);
 
 		if (kiblnd_peer_active(peer_ni))
 			kiblnd_unlink_peer_locked(peer_ni);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [lustre-devel] [PATCH 10/10] lustre: lnet: make LNET_MAX_IOV dependent on page size
  2018-10-14 18:55 [lustre-devel] [PATCH 00/10] lustre: lnet: fixes for non-x86 systems James Simmons
                   ` (8 preceding siblings ...)
  2018-10-14 18:55 ` [lustre-devel] [PATCH 09/10] lustre: o2ib: use splice in kiblnd_peer_connect_failed() James Simmons
@ 2018-10-14 18:55 ` James Simmons
  2018-10-18  4:48 ` [lustre-devel] [PATCH 00/10] lustre: lnet: fixes for non-x86 systems NeilBrown
  10 siblings, 0 replies; 20+ messages in thread
From: James Simmons @ 2018-10-14 18:55 UTC (permalink / raw)
  To: lustre-devel

The default behavior of LNet is to always use 256 pages which is
LNET_MAX_IOV and that LNET_MAX_PAYLOAD is always one megabyte.
This assumes pages are always 4K in size which is not the case.
This cause bulk I/O errors when using platforms like PowerPC or
ARM which tend to use 64K pages. This is resolved by first making
LNET_MAX_PAYLOAD always one megabyte since this is what the
configuring sets it too by default and no one ever changes it.
In theory it could set it to as high as 16MB but that will cause
the I/O errors since the ptlrpc layer expects the packets to be
always 1 megabyte in size. Also it would be better to make the
maximum payload a per network setup configurations instead of for
everything. Second we make LNET_MAX_IOV equal to LNET_MAX_PAYLOAD
divided by the PAGE_SIZE. This way packets will always be the
LNET_MAX_PAYLOAD in size but the number of pages used,
LNET_MAX_IOV will vary depending on the platform it is creating
packets on.

Signed-off-by: James Simmons <uja.ornl@yahoo.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-10157
Reviewed-on: https://review.whamcloud.com/31559
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Ruth Klundt <rklundt@sandia.gov>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/include/linux/lnet/lib-types.h       | 10 ++++------
 drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h |  3 ---
 drivers/staging/lustre/lnet/Kconfig                         | 10 ----------
 drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c         |  4 ++--
 drivers/staging/lustre/lustre/include/lustre_net.h          |  1 +
 5 files changed, 7 insertions(+), 21 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
index 19f7b11..8951a53 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
@@ -37,6 +37,7 @@
 #define __LNET_LIB_TYPES_H__
 
 #include <linux/kthread.h>
+#include <linux/net.h>
 #include <linux/uio.h>
 #include <linux/types.h>
 #include <linux/completion.h>
@@ -46,12 +47,9 @@
 #include <uapi/linux/lnet/lnet-dlc.h>
 
 /* Max payload size */
-#define LNET_MAX_PAYLOAD      CONFIG_LNET_MAX_PAYLOAD
-#if (LNET_MAX_PAYLOAD < LNET_MTU)
-# error "LNET_MAX_PAYLOAD too small - error in configure --with-max-payload-mb"
-#elif (LNET_MAX_PAYLOAD > (PAGE_SIZE * LNET_MAX_IOV))
-# error "LNET_MAX_PAYLOAD too large - error in configure --with-max-payload-mb"
-#endif
+#define LNET_MAX_PAYLOAD	LNET_MTU
+
+#define LNET_MAX_IOV		(LNET_MAX_PAYLOAD >> PAGE_SHIFT)
 
 /* forward refs */
 struct lnet_libmd;
diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h b/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h
index 1ecf18e..e440100 100644
--- a/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h
+++ b/drivers/staging/lustre/include/uapi/linux/lnet/lnet-types.h
@@ -522,9 +522,6 @@ struct lnet_md {
 #define LNET_MTU_BITS	20
 #define LNET_MTU	(1 << LNET_MTU_BITS)
 
-/** limit on the number of fragments in discontiguous MDs */
-#define LNET_MAX_IOV	256
-
 /**
  * Options for the MD structure. See lnet_md::options.
  */
diff --git a/drivers/staging/lustre/lnet/Kconfig b/drivers/staging/lustre/lnet/Kconfig
index ad049e6..6062a82 100644
--- a/drivers/staging/lustre/lnet/Kconfig
+++ b/drivers/staging/lustre/lnet/Kconfig
@@ -8,16 +8,6 @@ config LNET
 	  case of Lustre routers only the LNet layer is required. Lately other
 	  projects are also looking into using LNet as their networking API as well.
 
-config LNET_MAX_PAYLOAD
-	int "Lustre lnet max transfer payload (default 1MB)"
-	depends on LNET
-	default "1048576"
-	help
-	  This option defines the maximum size of payload in bytes that lnet
-	  can put into its transport.
-
-	  If unsure, use default.
-
 config LNET_SELFTEST
 	tristate "Lustre networking self testing"
 	depends on LNET
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
index 66aa45f..68a8963 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
@@ -1414,7 +1414,7 @@ static void kiblnd_destroy_fmr_pool_list(struct list_head *head)
 static int kiblnd_alloc_fmr_pool(struct kib_fmr_poolset *fps, struct kib_fmr_pool *fpo)
 {
 	struct ib_fmr_pool_param param = {
-		.max_pages_per_fmr = LNET_MAX_PAYLOAD / PAGE_SIZE,
+		.max_pages_per_fmr = LNET_MAX_IOV,
 		.page_shift        = PAGE_SHIFT,
 		.access            = (IB_ACCESS_LOCAL_WRITE |
 				      IB_ACCESS_REMOTE_WRITE),
@@ -1474,7 +1474,7 @@ static int kiblnd_alloc_freg_pool(struct kib_fmr_poolset *fps,
 		frd->frd_mr = ib_alloc_mr(fpo->fpo_hdev->ibh_pd,
 					  fastreg_gaps ? IB_MR_TYPE_SG_GAPS :
 							 IB_MR_TYPE_MEM_REG,
-					  LNET_MAX_PAYLOAD / PAGE_SIZE);
+					  LNET_MAX_IOV);
 		if (IS_ERR(frd->frd_mr)) {
 			rc = PTR_ERR(frd->frd_mr);
 			CERROR("Failed to allocate ib_alloc_mr: %d\n", rc);
diff --git a/drivers/staging/lustre/lustre/include/lustre_net.h b/drivers/staging/lustre/lustre/include/lustre_net.h
index 2df72c7..ce7e98c 100644
--- a/drivers/staging/lustre/lustre/include/lustre_net.h
+++ b/drivers/staging/lustre/lustre/include/lustre_net.h
@@ -56,6 +56,7 @@
 #include <linux/libcfs/libcfs.h>
 #include <uapi/linux/lnet/nidstr.h>
 #include <linux/lnet/api.h>
+#include <linux/lnet/lib-types.h>
 #include <uapi/linux/lustre/lustre_idl.h>
 #include <lustre_errno.h>
 #include <lustre_ha.h>
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [lustre-devel] [PATCH 01/10] lustre: lnd: set device capabilities
  2018-10-14 18:55 ` [lustre-devel] [PATCH 01/10] lustre: lnd: set device capabilities James Simmons
@ 2018-10-17  5:54   ` NeilBrown
  2018-10-20 16:58     ` James Simmons
  0 siblings, 1 reply; 20+ messages in thread
From: NeilBrown @ 2018-10-17  5:54 UTC (permalink / raw)
  To: lustre-devel

On Sun, Oct 14 2018, James Simmons wrote:

> From: Amir Shehata <ashehata@whamcloud.com>
>
> MLX-4, MLX-5 and OPA support different capabilities. Query the
> device and cache the capabilities of the device for future use.
>
> At the time of the patches creation MLX5 could support fast
> registration and gaps while MLX4 and OPA only support FMR
>
> Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-10129
> Reviewed-on: https://review.whamcloud.com/30309
> Reviewed-by: Alexey Lyashkov <c17817@cray.com>
> Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
> Reviewed-by: James Simmons <uja.ornl@yahoo.com>
> Reviewed-by: Oleg Drokin <green@whamcloud.com>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c    | 41 ++++++++++++----------
>  .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h    |  5 +++
>  2 files changed, 28 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
> index bf969b3..b10658b 100644
> --- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
> +++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
> @@ -1399,6 +1399,7 @@ static int kiblnd_alloc_fmr_pool(struct kib_fmr_poolset *fps, struct kib_fmr_poo
>  		else
>  			CERROR("FMRs are not supported\n");
>  	}
> +	fpo->fpo_is_fmr = true;
>  
>  	return rc;
>  }
> @@ -1408,6 +1409,8 @@ static int kiblnd_alloc_freg_pool(struct kib_fmr_poolset *fps, struct kib_fmr_po
>  	struct kib_fast_reg_descriptor *frd;
>  	int i, rc;
>  
> +	fpo->fpo_is_fmr = false;
> +
>  	INIT_LIST_HEAD(&fpo->fast_reg.fpo_pool_list);
>  	fpo->fast_reg.fpo_pool_size = 0;
>  	for (i = 0; i < fps->fps_pool_size; i++) {
> @@ -1469,23 +1472,7 @@ static int kiblnd_create_fmr_pool(struct kib_fmr_poolset *fps,
>  	fpo->fpo_hdev = kiblnd_current_hdev(dev);
>  	dev_attr = &fpo->fpo_hdev->ibh_ibdev->attrs;
>  
> -	/* Check for FMR or FastReg support */
> -	fpo->fpo_is_fmr = 0;
> -	if (fpo->fpo_hdev->ibh_ibdev->alloc_fmr &&
> -	    fpo->fpo_hdev->ibh_ibdev->dealloc_fmr &&
> -	    fpo->fpo_hdev->ibh_ibdev->map_phys_fmr &&
> -	    fpo->fpo_hdev->ibh_ibdev->unmap_fmr) {
> -		LCONSOLE_INFO("Using FMR for registration\n");
> -		fpo->fpo_is_fmr = 1;
> -	} else if (dev_attr->device_cap_flags & IB_DEVICE_MEM_MGT_EXTENSIONS) {
> -		LCONSOLE_INFO("Using FastReg for registration\n");
> -	} else {
> -		rc = -ENOSYS;
> -		LCONSOLE_ERROR_MSG(rc, "IB device does not support FMRs nor FastRegs, can't register memory\n");
> -		goto out_fpo;
> -	}
> -
> -	if (fpo->fpo_is_fmr)
> +	if (dev->ibd_dev_caps & IBLND_DEV_CAPS_FMR_ENABLED)
>  		rc = kiblnd_alloc_fmr_pool(fps, fpo);
>  	else
>  		rc = kiblnd_alloc_freg_pool(fps, fpo);
> @@ -2261,6 +2248,9 @@ static int kiblnd_net_init_pools(struct kib_net *net, struct lnet_ni *ni,
>  
>  static int kiblnd_hdev_get_attr(struct kib_hca_dev *hdev)
>  {
> +	struct ib_device_attr *dev_attr = &hdev->ibh_ibdev->attrs;
> +	int rc = 0;
> +
>  	/*
>  	 * It's safe to assume a HCA can handle a page size
>  	 * matching that of the native system
> @@ -2269,7 +2259,22 @@ static int kiblnd_hdev_get_attr(struct kib_hca_dev *hdev)
>  	hdev->ibh_page_size  = 1 << PAGE_SHIFT;
>  	hdev->ibh_page_mask  = ~((__u64)hdev->ibh_page_size - 1);
>  
> -	hdev->ibh_mr_size = hdev->ibh_ibdev->attrs.max_mr_size;
> +	if (hdev->ibh_ibdev->alloc_fmr &&
> +	    hdev->ibh_ibdev->dealloc_fmr &&
> +	    hdev->ibh_ibdev->map_phys_fmr &&
> +	    hdev->ibh_ibdev->unmap_fmr) {
> +		LCONSOLE_INFO("Using FMR for registration\n");
> +		hdev->ibh_dev->ibd_dev_caps |= IBLND_DEV_CAPS_FMR_ENABLED;
> +	} else if (dev_attr->device_cap_flags & IB_DEVICE_MEM_MGT_EXTENSIONS) {
> +		LCONSOLE_INFO("Using FastReg for registration\n");
> +		hdev->ibh_dev->ibd_dev_caps |= IBLND_DEV_CAPS_FASTREG_ENABLED;
> +	} else {
> +		CERROR("IB device does not support FMRs nor FastRegs, can't register memory: %d\n",
> +		       rc);
> +		return -ENXIO;
> +	}
> +
> +	hdev->ibh_mr_size = dev_attr->max_mr_size;
>  	if (hdev->ibh_mr_size == ~0ULL) {
>  		hdev->ibh_mr_shift = 64;
>  		return 0;
> diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
> index a4438d2..9f0a47d 100644
> --- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
> +++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
> @@ -73,6 +73,10 @@
>  #define IBLND_N_SCHED			2
>  #define IBLND_N_SCHED_HIGH		4
>  
> +#define IBLND_DEV_CAPS_FASTREG_ENABLED		0x1
> +#define IBLND_DEV_CAPS_FASTREG_GAPS_SUPPORT	0x2
> +#define IBLND_DEV_CAPS_FMR_ENABLED		0x4
> +

BIT(0), BIT(1), .... ???

>  struct kib_tunables {
>  	int *kib_dev_failover;           /* HCA failover */
>  	unsigned int *kib_service;       /* IB service number */
> @@ -162,6 +166,7 @@ struct kib_dev {
>  	unsigned int ibd_can_failover; /* IPoIB interface is a bonding master */
>  	struct list_head   ibd_nets;
>  	struct kib_hca_dev *ibd_hdev;
> +	u32			ibd_dev_caps;

"unsigned int" would be better I think, but it isn't very important.

Thanks,
NeilBrown


>  };
>  
>  struct kib_hca_dev {
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181017/a96fbe3e/attachment.sig>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [lustre-devel] [PATCH 03/10] lustre: lnd: rework map_on_demand behavior
  2018-10-14 18:55 ` [lustre-devel] [PATCH 03/10] lustre: lnd: rework map_on_demand behavior James Simmons
@ 2018-10-17  6:11   ` NeilBrown
  2018-10-20 17:06     ` James Simmons
  0 siblings, 1 reply; 20+ messages in thread
From: NeilBrown @ 2018-10-17  6:11 UTC (permalink / raw)
  To: lustre-devel

On Sun, Oct 14 2018, James Simmons wrote:

> @@ -1320,9 +1324,8 @@ static void kiblnd_destroy_fmr_pool(struct kib_fmr_pool *fpo)
>  {
>  	LASSERT(!fpo->fpo_map_count);
>  
> -	if (fpo->fpo_is_fmr) {
> -		if (fpo->fmr.fpo_fmr_pool)
> -			ib_destroy_fmr_pool(fpo->fmr.fpo_fmr_pool);
> +	if (fpo->fpo_is_fmr && fpo->fmr.fpo_fmr_pool) {
> +		ib_destroy_fmr_pool(fpo->fmr.fpo_fmr_pool);
>  	} else {
>  		struct kib_fast_reg_descriptor *frd;
>  		int i = 0;

This change looks strange.
With the new code: if fpo_fmr_pool is NULL, then the fpo_pool_list is
destroyed, even though fpo_is_fmr.

Are we sure that is correct??
I don't really understand this code so it might be correct, but the
change looks wrong.

Can someone please confirm that it really is right - and maybe explain
why?

Thanks,
NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181017/a6c3f1f4/attachment.sig>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [lustre-devel] [PATCH 00/10] lustre: lnet: fixes for non-x86 systems
  2018-10-14 18:55 [lustre-devel] [PATCH 00/10] lustre: lnet: fixes for non-x86 systems James Simmons
                   ` (9 preceding siblings ...)
  2018-10-14 18:55 ` [lustre-devel] [PATCH 10/10] lustre: lnet: make LNET_MAX_IOV dependent on page size James Simmons
@ 2018-10-18  4:48 ` NeilBrown
  2018-10-20 19:00   ` James Simmons
  10 siblings, 1 reply; 20+ messages in thread
From: NeilBrown @ 2018-10-18  4:48 UTC (permalink / raw)
  To: lustre-devel

On Sun, Oct 14 2018, James Simmons wrote:

> These are the required patches to make LNet work with non-x86
> platforms like ARM or Power8. The tunables map_on_demand and
> concurrent_sends assumed pages were 4K in size which is not
> correct. Massively reworked to basically defunct those tunables.
> Also the size of the LNet packet was always 256 pages but when
> the page size is 64K like some ARM or Power8 systems the maximum
> LNet message sent was 16MB not 1MB which is what is expected.
> Fixing up the RDMA handling in the ko2iblnd driver also resolved
> some performance issues.

Thanks.
I've applied this series, but I've still suspicious of that little
section in patch 3.
I hope that can be clarified, or fixed, before I feel the need to  move
the series from 'lustre-testing' to 'lustre'.

Thanks,
NeilBrown


>
> Alexey Lyashkov (1):
>   lustre: lnd: use less CQ entries for each connection
>
> Amir Shehata (6):
>   lustre: lnd: set device capabilities
>   lustre: o2iblnd: use IB_MR_TYPE_SG_GAPS
>   lustre: lnd: rework map_on_demand behavior
>   lustre: lnd: calculate qp max_send_wrs properly
>   lustre: lnd: remove concurrent_sends tunable
>   lustre: lnd: correct WR fast reg accounting
>
> Dmitry Eremin (1):
>   lustre: o2iblnd: limit cap.max_send_wr for MLX5
>
> James Simmons (1):
>   lustre: lnet: make LNET_MAX_IOV dependent on page size
>
> John L. Hammond (1):
>   lustre: o2ib: use splice in kiblnd_peer_connect_failed()
>
>  .../staging/lustre/include/linux/lnet/lib-types.h  |  10 +-
>  .../lustre/include/uapi/linux/lnet/lnet-types.h    |   3 -
>  drivers/staging/lustre/lnet/Kconfig                |  10 --
>  .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c    | 138 +++++++++++++++------
>  .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h    |  68 +++-------
>  .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c | 133 ++++++++++++++------
>  .../lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c  |  85 ++++++-------
>  drivers/staging/lustre/lustre/include/lustre_net.h |   1 +
>  8 files changed, 256 insertions(+), 192 deletions(-)
>
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181018/f00ad4b3/attachment.sig>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [lustre-devel] [PATCH 01/10] lustre: lnd: set device capabilities
  2018-10-17  5:54   ` NeilBrown
@ 2018-10-20 16:58     ` James Simmons
  2018-10-22  2:48       ` NeilBrown
  0 siblings, 1 reply; 20+ messages in thread
From: James Simmons @ 2018-10-20 16:58 UTC (permalink / raw)
  To: lustre-devel


> On Sun, Oct 14 2018, James Simmons wrote:
> 
> > From: Amir Shehata <ashehata@whamcloud.com>
> >
> > MLX-4, MLX-5 and OPA support different capabilities. Query the
> > device and cache the capabilities of the device for future use.
> >
> > At the time of the patches creation MLX5 could support fast
> > registration and gaps while MLX4 and OPA only support FMR
> >
> > Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
> > WC-bug-id: https://jira.whamcloud.com/browse/LU-10129
> > Reviewed-on: https://review.whamcloud.com/30309
> > Reviewed-by: Alexey Lyashkov <c17817@cray.com>
> > Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
> > Reviewed-by: James Simmons <uja.ornl@yahoo.com>
> > Reviewed-by: Oleg Drokin <green@whamcloud.com>
> > Signed-off-by: James Simmons <jsimmons@infradead.org>
> > ---
> >  .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c    | 41 ++++++++++++----------
> >  .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h    |  5 +++
> >  2 files changed, 28 insertions(+), 18 deletions(-)
> >
> > diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
> > index bf969b3..b10658b 100644
> > --- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
> > +++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
> > @@ -1399,6 +1399,7 @@ static int kiblnd_alloc_fmr_pool(struct kib_fmr_poolset *fps, struct kib_fmr_poo
> >  		else
> >  			CERROR("FMRs are not supported\n");
> >  	}
> > +	fpo->fpo_is_fmr = true;
> >  
> >  	return rc;
> >  }
> > @@ -1408,6 +1409,8 @@ static int kiblnd_alloc_freg_pool(struct kib_fmr_poolset *fps, struct kib_fmr_po
> >  	struct kib_fast_reg_descriptor *frd;
> >  	int i, rc;
> >  
> > +	fpo->fpo_is_fmr = false;
> > +
> >  	INIT_LIST_HEAD(&fpo->fast_reg.fpo_pool_list);
> >  	fpo->fast_reg.fpo_pool_size = 0;
> >  	for (i = 0; i < fps->fps_pool_size; i++) {
> > @@ -1469,23 +1472,7 @@ static int kiblnd_create_fmr_pool(struct kib_fmr_poolset *fps,
> >  	fpo->fpo_hdev = kiblnd_current_hdev(dev);
> >  	dev_attr = &fpo->fpo_hdev->ibh_ibdev->attrs;
> >  
> > -	/* Check for FMR or FastReg support */
> > -	fpo->fpo_is_fmr = 0;
> > -	if (fpo->fpo_hdev->ibh_ibdev->alloc_fmr &&
> > -	    fpo->fpo_hdev->ibh_ibdev->dealloc_fmr &&
> > -	    fpo->fpo_hdev->ibh_ibdev->map_phys_fmr &&
> > -	    fpo->fpo_hdev->ibh_ibdev->unmap_fmr) {
> > -		LCONSOLE_INFO("Using FMR for registration\n");
> > -		fpo->fpo_is_fmr = 1;
> > -	} else if (dev_attr->device_cap_flags & IB_DEVICE_MEM_MGT_EXTENSIONS) {
> > -		LCONSOLE_INFO("Using FastReg for registration\n");
> > -	} else {
> > -		rc = -ENOSYS;
> > -		LCONSOLE_ERROR_MSG(rc, "IB device does not support FMRs nor FastRegs, can't register memory\n");
> > -		goto out_fpo;
> > -	}
> > -
> > -	if (fpo->fpo_is_fmr)
> > +	if (dev->ibd_dev_caps & IBLND_DEV_CAPS_FMR_ENABLED)
> >  		rc = kiblnd_alloc_fmr_pool(fps, fpo);
> >  	else
> >  		rc = kiblnd_alloc_freg_pool(fps, fpo);
> > @@ -2261,6 +2248,9 @@ static int kiblnd_net_init_pools(struct kib_net *net, struct lnet_ni *ni,
> >  
> >  static int kiblnd_hdev_get_attr(struct kib_hca_dev *hdev)
> >  {
> > +	struct ib_device_attr *dev_attr = &hdev->ibh_ibdev->attrs;
> > +	int rc = 0;
> > +
> >  	/*
> >  	 * It's safe to assume a HCA can handle a page size
> >  	 * matching that of the native system
> > @@ -2269,7 +2259,22 @@ static int kiblnd_hdev_get_attr(struct kib_hca_dev *hdev)
> >  	hdev->ibh_page_size  = 1 << PAGE_SHIFT;
> >  	hdev->ibh_page_mask  = ~((__u64)hdev->ibh_page_size - 1);
> >  
> > -	hdev->ibh_mr_size = hdev->ibh_ibdev->attrs.max_mr_size;
> > +	if (hdev->ibh_ibdev->alloc_fmr &&
> > +	    hdev->ibh_ibdev->dealloc_fmr &&
> > +	    hdev->ibh_ibdev->map_phys_fmr &&
> > +	    hdev->ibh_ibdev->unmap_fmr) {
> > +		LCONSOLE_INFO("Using FMR for registration\n");
> > +		hdev->ibh_dev->ibd_dev_caps |= IBLND_DEV_CAPS_FMR_ENABLED;
> > +	} else if (dev_attr->device_cap_flags & IB_DEVICE_MEM_MGT_EXTENSIONS) {
> > +		LCONSOLE_INFO("Using FastReg for registration\n");
> > +		hdev->ibh_dev->ibd_dev_caps |= IBLND_DEV_CAPS_FASTREG_ENABLED;
> > +	} else {
> > +		CERROR("IB device does not support FMRs nor FastRegs, can't register memory: %d\n",
> > +		       rc);
> > +		return -ENXIO;
> > +	}
> > +
> > +	hdev->ibh_mr_size = dev_attr->max_mr_size;
> >  	if (hdev->ibh_mr_size == ~0ULL) {
> >  		hdev->ibh_mr_shift = 64;
> >  		return 0;
> > diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
> > index a4438d2..9f0a47d 100644
> > --- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
> > +++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
> > @@ -73,6 +73,10 @@
> >  #define IBLND_N_SCHED			2
> >  #define IBLND_N_SCHED_HIGH		4
> >  
> > +#define IBLND_DEV_CAPS_FASTREG_ENABLED		0x1
> > +#define IBLND_DEV_CAPS_FASTREG_GAPS_SUPPORT	0x2
> > +#define IBLND_DEV_CAPS_FMR_ENABLED		0x4
> > +
> 
> BIT(0), BIT(1), .... ???
> 
> >  struct kib_tunables {
> >  	int *kib_dev_failover;           /* HCA failover */
> >  	unsigned int *kib_service;       /* IB service number */
> > @@ -162,6 +166,7 @@ struct kib_dev {
> >  	unsigned int ibd_can_failover; /* IPoIB interface is a bonding master */
> >  	struct list_head   ibd_nets;
> >  	struct kib_hca_dev *ibd_hdev;
> > +	u32			ibd_dev_caps;
> 
> "unsigned int" would be better I think, but it isn't very important.

Actually better yet it could be changed to an enum. Move the below to
just above struct kib_dev {

enum ibd_dev_caps {
	IBLND_DEV_CAPS_FASTREG_ENABLED		= BIT(0),
	IBLND_DEV_CAPS_FASTREG_GAPS_SUPPORT	= BIT(1),
	IBLND_DEV_CAPS_FMR_ENABLED		= BIT(2),
}

and change ibd_dev_caps from u32 to enum ibd_dev_caps.

Do you mind if that is a follow on patch?

> Thanks,
> NeilBrown
> 
> 
> >  };
> >  
> >  struct kib_hca_dev {
> > -- 
> > 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [lustre-devel] [PATCH 03/10] lustre: lnd: rework map_on_demand behavior
  2018-10-17  6:11   ` NeilBrown
@ 2018-10-20 17:06     ` James Simmons
  2018-10-22  3:09       ` NeilBrown
  0 siblings, 1 reply; 20+ messages in thread
From: James Simmons @ 2018-10-20 17:06 UTC (permalink / raw)
  To: lustre-devel

> On Sun, Oct 14 2018, James Simmons wrote:
> 
> > @@ -1320,9 +1324,8 @@ static void kiblnd_destroy_fmr_pool(struct kib_fmr_pool *fpo)
> >  {
> >  	LASSERT(!fpo->fpo_map_count);
> >  
> > -	if (fpo->fpo_is_fmr) {
> > -		if (fpo->fmr.fpo_fmr_pool)
> > -			ib_destroy_fmr_pool(fpo->fmr.fpo_fmr_pool);
> > +	if (fpo->fpo_is_fmr && fpo->fmr.fpo_fmr_pool) {
> > +		ib_destroy_fmr_pool(fpo->fmr.fpo_fmr_pool);
> >  	} else {
> >  		struct kib_fast_reg_descriptor *frd;
> >  		int i = 0;
> 
> This change looks strange.
> With the new code: if fpo_fmr_pool is NULL, then the fpo_pool_list is
> destroyed, even though fpo_is_fmr.
> 
> Are we sure that is correct??
> I don't really understand this code so it might be correct, but the
> change looks wrong.
> 
> Can someone please confirm that it really is right - and maybe explain
> why?

It is wrong!!! The only reason it didn't show up is that fp_fmr_pool is
very unlikely to NULL. I understand why this bug was introduced since
its desirable to reduce the indentation. This is the fault of how the
code is written. To avoid this in the future it should be written:

if (!fpo->fpo_is_fmr) {
	struct kib_fast_reg_descriptor *frd, *tmp;
	....
} else if (fpo->fmr.fpo_fmr_pool) {
	ib_destroy_fmr_pool(fpo->fmr.fpo_fmr_pool);
}

I opened a ticket for https://jira.whamcloud.com/browse/LU-11552 and 
submitted a patch - https://review.whamcloud.com/#/c/33408. To address
this.

So please remove this part from the change. We could redo the patch to
include the above or I could just send a follow up patch to make this
code clearer. I leave that up to you.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [lustre-devel] [PATCH 00/10] lustre: lnet: fixes for non-x86 systems
  2018-10-18  4:48 ` [lustre-devel] [PATCH 00/10] lustre: lnet: fixes for non-x86 systems NeilBrown
@ 2018-10-20 19:00   ` James Simmons
  0 siblings, 0 replies; 20+ messages in thread
From: James Simmons @ 2018-10-20 19:00 UTC (permalink / raw)
  To: lustre-devel


> On Sun, Oct 14 2018, James Simmons wrote:
> 
> > These are the required patches to make LNet work with non-x86
> > platforms like ARM or Power8. The tunables map_on_demand and
> > concurrent_sends assumed pages were 4K in size which is not
> > correct. Massively reworked to basically defunct those tunables.
> > Also the size of the LNet packet was always 256 pages but when
> > the page size is 64K like some ARM or Power8 systems the maximum
> > LNet message sent was 16MB not 1MB which is what is expected.
> > Fixing up the RDMA handling in the ko2iblnd driver also resolved
> > some performance issues.
> 
> Thanks.
> I've applied this series, but I've still suspicious of that little
> section in patch 3.
> I hope that can be clarified, or fixed, before I feel the need to  move
> the series from 'lustre-testing' to 'lustre'.

Thank you. Yes as I pointed out that code in patch 3 is wrong. I pushed
a fix to the OpenSFS branch already. You can drop that change in patch 3.


> > Alexey Lyashkov (1):
> >   lustre: lnd: use less CQ entries for each connection
> >
> > Amir Shehata (6):
> >   lustre: lnd: set device capabilities
> >   lustre: o2iblnd: use IB_MR_TYPE_SG_GAPS
> >   lustre: lnd: rework map_on_demand behavior
> >   lustre: lnd: calculate qp max_send_wrs properly
> >   lustre: lnd: remove concurrent_sends tunable
> >   lustre: lnd: correct WR fast reg accounting
> >
> > Dmitry Eremin (1):
> >   lustre: o2iblnd: limit cap.max_send_wr for MLX5
> >
> > James Simmons (1):
> >   lustre: lnet: make LNET_MAX_IOV dependent on page size
> >
> > John L. Hammond (1):
> >   lustre: o2ib: use splice in kiblnd_peer_connect_failed()
> >
> >  .../staging/lustre/include/linux/lnet/lib-types.h  |  10 +-
> >  .../lustre/include/uapi/linux/lnet/lnet-types.h    |   3 -
> >  drivers/staging/lustre/lnet/Kconfig                |  10 --
> >  .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c    | 138 +++++++++++++++------
> >  .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h    |  68 +++-------
> >  .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c | 133 ++++++++++++++------
> >  .../lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c  |  85 ++++++-------
> >  drivers/staging/lustre/lustre/include/lustre_net.h |   1 +
> >  8 files changed, 256 insertions(+), 192 deletions(-)
> >
> > -- 
> > 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [lustre-devel] [PATCH 01/10] lustre: lnd: set device capabilities
  2018-10-20 16:58     ` James Simmons
@ 2018-10-22  2:48       ` NeilBrown
  2018-10-23 23:04         ` James Simmons
  0 siblings, 1 reply; 20+ messages in thread
From: NeilBrown @ 2018-10-22  2:48 UTC (permalink / raw)
  To: lustre-devel

On Sat, Oct 20 2018, James Simmons wrote:

>> On Sun, Oct 14 2018, James Simmons wrote:
>> 
>> > From: Amir Shehata <ashehata@whamcloud.com>
>> >
>> > MLX-4, MLX-5 and OPA support different capabilities. Query the
>> > device and cache the capabilities of the device for future use.
>> >
>> > At the time of the patches creation MLX5 could support fast
>> > registration and gaps while MLX4 and OPA only support FMR
>> >
>> > Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
>> > WC-bug-id: https://jira.whamcloud.com/browse/LU-10129
>> > Reviewed-on: https://review.whamcloud.com/30309
>> > Reviewed-by: Alexey Lyashkov <c17817@cray.com>
>> > Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
>> > Reviewed-by: James Simmons <uja.ornl@yahoo.com>
>> > Reviewed-by: Oleg Drokin <green@whamcloud.com>
>> > Signed-off-by: James Simmons <jsimmons@infradead.org>
>> > ---
>> >  .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c    | 41 ++++++++++++----------
>> >  .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h    |  5 +++
>> >  2 files changed, 28 insertions(+), 18 deletions(-)
>> >
>> > diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
>> > index bf969b3..b10658b 100644
>> > --- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
>> > +++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
>> > @@ -1399,6 +1399,7 @@ static int kiblnd_alloc_fmr_pool(struct kib_fmr_poolset *fps, struct kib_fmr_poo
>> >  		else
>> >  			CERROR("FMRs are not supported\n");
>> >  	}
>> > +	fpo->fpo_is_fmr = true;
>> >  
>> >  	return rc;
>> >  }
>> > @@ -1408,6 +1409,8 @@ static int kiblnd_alloc_freg_pool(struct kib_fmr_poolset *fps, struct kib_fmr_po
>> >  	struct kib_fast_reg_descriptor *frd;
>> >  	int i, rc;
>> >  
>> > +	fpo->fpo_is_fmr = false;
>> > +
>> >  	INIT_LIST_HEAD(&fpo->fast_reg.fpo_pool_list);
>> >  	fpo->fast_reg.fpo_pool_size = 0;
>> >  	for (i = 0; i < fps->fps_pool_size; i++) {
>> > @@ -1469,23 +1472,7 @@ static int kiblnd_create_fmr_pool(struct kib_fmr_poolset *fps,
>> >  	fpo->fpo_hdev = kiblnd_current_hdev(dev);
>> >  	dev_attr = &fpo->fpo_hdev->ibh_ibdev->attrs;
>> >  
>> > -	/* Check for FMR or FastReg support */
>> > -	fpo->fpo_is_fmr = 0;
>> > -	if (fpo->fpo_hdev->ibh_ibdev->alloc_fmr &&
>> > -	    fpo->fpo_hdev->ibh_ibdev->dealloc_fmr &&
>> > -	    fpo->fpo_hdev->ibh_ibdev->map_phys_fmr &&
>> > -	    fpo->fpo_hdev->ibh_ibdev->unmap_fmr) {
>> > -		LCONSOLE_INFO("Using FMR for registration\n");
>> > -		fpo->fpo_is_fmr = 1;
>> > -	} else if (dev_attr->device_cap_flags & IB_DEVICE_MEM_MGT_EXTENSIONS) {
>> > -		LCONSOLE_INFO("Using FastReg for registration\n");
>> > -	} else {
>> > -		rc = -ENOSYS;
>> > -		LCONSOLE_ERROR_MSG(rc, "IB device does not support FMRs nor FastRegs, can't register memory\n");
>> > -		goto out_fpo;
>> > -	}
>> > -
>> > -	if (fpo->fpo_is_fmr)
>> > +	if (dev->ibd_dev_caps & IBLND_DEV_CAPS_FMR_ENABLED)
>> >  		rc = kiblnd_alloc_fmr_pool(fps, fpo);
>> >  	else
>> >  		rc = kiblnd_alloc_freg_pool(fps, fpo);
>> > @@ -2261,6 +2248,9 @@ static int kiblnd_net_init_pools(struct kib_net *net, struct lnet_ni *ni,
>> >  
>> >  static int kiblnd_hdev_get_attr(struct kib_hca_dev *hdev)
>> >  {
>> > +	struct ib_device_attr *dev_attr = &hdev->ibh_ibdev->attrs;
>> > +	int rc = 0;
>> > +
>> >  	/*
>> >  	 * It's safe to assume a HCA can handle a page size
>> >  	 * matching that of the native system
>> > @@ -2269,7 +2259,22 @@ static int kiblnd_hdev_get_attr(struct kib_hca_dev *hdev)
>> >  	hdev->ibh_page_size  = 1 << PAGE_SHIFT;
>> >  	hdev->ibh_page_mask  = ~((__u64)hdev->ibh_page_size - 1);
>> >  
>> > -	hdev->ibh_mr_size = hdev->ibh_ibdev->attrs.max_mr_size;
>> > +	if (hdev->ibh_ibdev->alloc_fmr &&
>> > +	    hdev->ibh_ibdev->dealloc_fmr &&
>> > +	    hdev->ibh_ibdev->map_phys_fmr &&
>> > +	    hdev->ibh_ibdev->unmap_fmr) {
>> > +		LCONSOLE_INFO("Using FMR for registration\n");
>> > +		hdev->ibh_dev->ibd_dev_caps |= IBLND_DEV_CAPS_FMR_ENABLED;
>> > +	} else if (dev_attr->device_cap_flags & IB_DEVICE_MEM_MGT_EXTENSIONS) {
>> > +		LCONSOLE_INFO("Using FastReg for registration\n");
>> > +		hdev->ibh_dev->ibd_dev_caps |= IBLND_DEV_CAPS_FASTREG_ENABLED;
>> > +	} else {
>> > +		CERROR("IB device does not support FMRs nor FastRegs, can't register memory: %d\n",
>> > +		       rc);
>> > +		return -ENXIO;
>> > +	}
>> > +
>> > +	hdev->ibh_mr_size = dev_attr->max_mr_size;
>> >  	if (hdev->ibh_mr_size == ~0ULL) {
>> >  		hdev->ibh_mr_shift = 64;
>> >  		return 0;
>> > diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
>> > index a4438d2..9f0a47d 100644
>> > --- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
>> > +++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
>> > @@ -73,6 +73,10 @@
>> >  #define IBLND_N_SCHED			2
>> >  #define IBLND_N_SCHED_HIGH		4
>> >  
>> > +#define IBLND_DEV_CAPS_FASTREG_ENABLED		0x1
>> > +#define IBLND_DEV_CAPS_FASTREG_GAPS_SUPPORT	0x2
>> > +#define IBLND_DEV_CAPS_FMR_ENABLED		0x4
>> > +
>> 
>> BIT(0), BIT(1), .... ???
>> 
>> >  struct kib_tunables {
>> >  	int *kib_dev_failover;           /* HCA failover */
>> >  	unsigned int *kib_service;       /* IB service number */
>> > @@ -162,6 +166,7 @@ struct kib_dev {
>> >  	unsigned int ibd_can_failover; /* IPoIB interface is a bonding master */
>> >  	struct list_head   ibd_nets;
>> >  	struct kib_hca_dev *ibd_hdev;
>> > +	u32			ibd_dev_caps;
>> 
>> "unsigned int" would be better I think, but it isn't very important.
>
> Actually better yet it could be changed to an enum. Move the below to
> just above struct kib_dev {
>
> enum ibd_dev_caps {
> 	IBLND_DEV_CAPS_FASTREG_ENABLED		= BIT(0),
> 	IBLND_DEV_CAPS_FASTREG_GAPS_SUPPORT	= BIT(1),
> 	IBLND_DEV_CAPS_FMR_ENABLED		= BIT(2),
> }
>
> and change ibd_dev_caps from u32 to enum ibd_dev_caps.
>
> Do you mind if that is a follow on patch?
>

Yes, that would be good, thanks.
I'll apply this patch as it is.

Thanks,
NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181022/921e6766/attachment.sig>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [lustre-devel] [PATCH 03/10] lustre: lnd: rework map_on_demand behavior
  2018-10-20 17:06     ` James Simmons
@ 2018-10-22  3:09       ` NeilBrown
  0 siblings, 0 replies; 20+ messages in thread
From: NeilBrown @ 2018-10-22  3:09 UTC (permalink / raw)
  To: lustre-devel

On Sat, Oct 20 2018, James Simmons wrote:

>> On Sun, Oct 14 2018, James Simmons wrote:
>> 
>> > @@ -1320,9 +1324,8 @@ static void kiblnd_destroy_fmr_pool(struct kib_fmr_pool *fpo)
>> >  {
>> >  	LASSERT(!fpo->fpo_map_count);
>> >  
>> > -	if (fpo->fpo_is_fmr) {
>> > -		if (fpo->fmr.fpo_fmr_pool)
>> > -			ib_destroy_fmr_pool(fpo->fmr.fpo_fmr_pool);
>> > +	if (fpo->fpo_is_fmr && fpo->fmr.fpo_fmr_pool) {
>> > +		ib_destroy_fmr_pool(fpo->fmr.fpo_fmr_pool);
>> >  	} else {
>> >  		struct kib_fast_reg_descriptor *frd;
>> >  		int i = 0;
>> 
>> This change looks strange.
>> With the new code: if fpo_fmr_pool is NULL, then the fpo_pool_list is
>> destroyed, even though fpo_is_fmr.
>> 
>> Are we sure that is correct??
>> I don't really understand this code so it might be correct, but the
>> change looks wrong.
>> 
>> Can someone please confirm that it really is right - and maybe explain
>> why?
>
> It is wrong!!! The only reason it didn't show up is that fp_fmr_pool is
> very unlikely to NULL. I understand why this bug was introduced since
> its desirable to reduce the indentation. This is the fault of how the
> code is written. To avoid this in the future it should be written:

Thanks for the confirmation.
As the 'else' clause is significantly more indented than the 'then'
cause, reducing the indentation there seems gratuitous.
I've removed the offending hunk from the patch.
If you particularly want to proceed with the version below I won't
object.

>
> if (!fpo->fpo_is_fmr) {
> 	struct kib_fast_reg_descriptor *frd, *tmp;
> 	....
> } else if (fpo->fmr.fpo_fmr_pool) {
> 	ib_destroy_fmr_pool(fpo->fmr.fpo_fmr_pool);
> }
>
> I opened a ticket for https://jira.whamcloud.com/browse/LU-11552 and 
> submitted a patch - https://review.whamcloud.com/#/c/33408. To address
> this.

Thanks for following up this side too!

NeilBrown


>
> So please remove this part from the change. We could redo the patch to
> include the above or I could just send a follow up patch to make this
> code clearer. I leave that up to you.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181022/4a0138bb/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [lustre-devel] [PATCH 01/10] lustre: lnd: set device capabilities
  2018-10-22  2:48       ` NeilBrown
@ 2018-10-23 23:04         ` James Simmons
  0 siblings, 0 replies; 20+ messages in thread
From: James Simmons @ 2018-10-23 23:04 UTC (permalink / raw)
  To: lustre-devel


> >> > index a4438d2..9f0a47d 100644
> >> > --- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
> >> > +++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
> >> > @@ -73,6 +73,10 @@
> >> >  #define IBLND_N_SCHED			2
> >> >  #define IBLND_N_SCHED_HIGH		4
> >> >  
> >> > +#define IBLND_DEV_CAPS_FASTREG_ENABLED		0x1
> >> > +#define IBLND_DEV_CAPS_FASTREG_GAPS_SUPPORT	0x2
> >> > +#define IBLND_DEV_CAPS_FMR_ENABLED		0x4
> >> > +
> >> 
> >> BIT(0), BIT(1), .... ???
> >> 
> >> >  struct kib_tunables {
> >> >  	int *kib_dev_failover;           /* HCA failover */
> >> >  	unsigned int *kib_service;       /* IB service number */
> >> > @@ -162,6 +166,7 @@ struct kib_dev {
> >> >  	unsigned int ibd_can_failover; /* IPoIB interface is a bonding master */
> >> >  	struct list_head   ibd_nets;
> >> >  	struct kib_hca_dev *ibd_hdev;
> >> > +	u32			ibd_dev_caps;
> >> 
> >> "unsigned int" would be better I think, but it isn't very important.
> >
> > Actually better yet it could be changed to an enum. Move the below to
> > just above struct kib_dev {
> >
> > enum ibd_dev_caps {
> > 	IBLND_DEV_CAPS_FASTREG_ENABLED		= BIT(0),
> > 	IBLND_DEV_CAPS_FASTREG_GAPS_SUPPORT	= BIT(1),
> > 	IBLND_DEV_CAPS_FMR_ENABLED		= BIT(2),
> > }
> >
> > and change ibd_dev_caps from u32 to enum ibd_dev_caps.
> >
> > Do you mind if that is a follow on patch?
> >
> 
> Yes, that would be good, thanks.
> I'll apply this patch as it is.

New patch at https://review.whamcloud.com/#/c/33409
I will push shortly to linux client.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-10-23 23:04 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-14 18:55 [lustre-devel] [PATCH 00/10] lustre: lnet: fixes for non-x86 systems James Simmons
2018-10-14 18:55 ` [lustre-devel] [PATCH 01/10] lustre: lnd: set device capabilities James Simmons
2018-10-17  5:54   ` NeilBrown
2018-10-20 16:58     ` James Simmons
2018-10-22  2:48       ` NeilBrown
2018-10-23 23:04         ` James Simmons
2018-10-14 18:55 ` [lustre-devel] [PATCH 02/10] lustre: o2iblnd: use IB_MR_TYPE_SG_GAPS James Simmons
2018-10-14 18:55 ` [lustre-devel] [PATCH 03/10] lustre: lnd: rework map_on_demand behavior James Simmons
2018-10-17  6:11   ` NeilBrown
2018-10-20 17:06     ` James Simmons
2018-10-22  3:09       ` NeilBrown
2018-10-14 18:55 ` [lustre-devel] [PATCH 04/10] lustre: lnd: use less CQ entries for each connection James Simmons
2018-10-14 18:55 ` [lustre-devel] [PATCH 05/10] lustre: o2iblnd: limit cap.max_send_wr for MLX5 James Simmons
2018-10-14 18:55 ` [lustre-devel] [PATCH 06/10] lustre: lnd: calculate qp max_send_wrs properly James Simmons
2018-10-14 18:55 ` [lustre-devel] [PATCH 07/10] lustre: lnd: remove concurrent_sends tunable James Simmons
2018-10-14 18:55 ` [lustre-devel] [PATCH 08/10] lustre: lnd: correct WR fast reg accounting James Simmons
2018-10-14 18:55 ` [lustre-devel] [PATCH 09/10] lustre: o2ib: use splice in kiblnd_peer_connect_failed() James Simmons
2018-10-14 18:55 ` [lustre-devel] [PATCH 10/10] lustre: lnet: make LNET_MAX_IOV dependent on page size James Simmons
2018-10-18  4:48 ` [lustre-devel] [PATCH 00/10] lustre: lnet: fixes for non-x86 systems NeilBrown
2018-10-20 19:00   ` James Simmons

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.