Linux-RDMA Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
@ 2021-04-05  5:23 Leon Romanovsky
  2021-04-05  5:23 ` [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init() Leon Romanovsky
                   ` (11 more replies)
  0 siblings, 12 replies; 56+ messages in thread
From: Leon Romanovsky @ 2021-04-05  5:23 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, Adit Ranadive, Anna Schumaker, Ariel Elior,
	Avihai Horon, Bart Van Assche, Bernard Metzler,
	Christoph Hellwig, Chuck Lever, David S. Miller,
	Dennis Dalessandro, Devesh Sharma, Faisal Latif, Jack Wang,
	Jakub Kicinski, J. Bruce Fields, Jens Axboe, Karsten Graul,
	Keith Busch, Lijun Ou, linux-cifs, linux-kernel, linux-nfs,
	linux-nvme, linux-rdma, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

From: Leon Romanovsky <leonro@nvidia.com>

From Avihai,

Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
imposed on PCI transactions, and thus, can improve performance.

Until now, relaxed ordering could be set only by user space applications
for user MRs. The following patch series enables relaxed ordering for the
kernel ULPs as well. Relaxed ordering is an optional capability, and as
such, it is ignored by vendors that don't support it.

The following test results show the performance improvement achieved
with relaxed ordering. The test was performed on a NVIDIA A100 in order
to check performance of storage infrastructure over xprtrdma:

Without Relaxed Ordering:
READ: bw=16.5GiB/s (17.7GB/s), 16.5GiB/s-16.5GiB/s (17.7GB/s-17.7GB/s),
io=1987GiB (2133GB), run=120422-120422msec

With relaxed ordering:
READ: bw=72.9GiB/s (78.2GB/s), 72.9GiB/s-72.9GiB/s (78.2GB/s-78.2GB/s),
io=2367GiB (2542GB), run=32492-32492msec

Thanks

Avihai Horon (10):
  RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()
  RDMA/core: Enable Relaxed Ordering in __ib_alloc_pd()
  RDMA/iser: Enable Relaxed Ordering
  RDMA/rtrs: Enable Relaxed Ordering
  RDMA/srp: Enable Relaxed Ordering
  nvme-rdma: Enable Relaxed Ordering
  cifs: smbd: Enable Relaxed Ordering
  net/rds: Enable Relaxed Ordering
  net/smc: Enable Relaxed Ordering
  xprtrdma: Enable Relaxed Ordering

 drivers/infiniband/core/mr_pool.c             |  7 +-
 drivers/infiniband/core/rw.c                  | 12 ++--
 drivers/infiniband/core/verbs.c               | 26 +++++--
 drivers/infiniband/hw/bnxt_re/ib_verbs.c      |  2 +-
 drivers/infiniband/hw/bnxt_re/ib_verbs.h      |  2 +-
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h        |  2 +-
 drivers/infiniband/hw/cxgb4/mem.c             |  2 +-
 drivers/infiniband/hw/hns/hns_roce_device.h   |  2 +-
 drivers/infiniband/hw/hns/hns_roce_mr.c       |  2 +-
 drivers/infiniband/hw/i40iw/i40iw_verbs.c     |  3 +-
 drivers/infiniband/hw/mlx4/mlx4_ib.h          |  2 +-
 drivers/infiniband/hw/mlx4/mr.c               |  2 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h          | 12 ++--
 drivers/infiniband/hw/mlx5/mr.c               | 61 ++++++++--------
 drivers/infiniband/hw/mlx5/wr.c               | 69 ++++++++++++++-----
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c   |  2 +-
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.h   |  2 +-
 drivers/infiniband/hw/qedr/verbs.c            |  2 +-
 drivers/infiniband/hw/qedr/verbs.h            |  2 +-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c  |  4 +-
 .../infiniband/hw/vmw_pvrdma/pvrdma_verbs.h   |  2 +-
 drivers/infiniband/sw/rdmavt/mr.c             |  3 +-
 drivers/infiniband/sw/rdmavt/mr.h             |  2 +-
 drivers/infiniband/sw/rxe/rxe_verbs.c         |  2 +-
 drivers/infiniband/sw/siw/siw_verbs.c         |  2 +-
 drivers/infiniband/sw/siw/siw_verbs.h         |  2 +-
 drivers/infiniband/ulp/iser/iser_memory.c     | 10 ++-
 drivers/infiniband/ulp/iser/iser_verbs.c      |  6 +-
 drivers/infiniband/ulp/rtrs/rtrs-clt.c        |  6 +-
 drivers/infiniband/ulp/rtrs/rtrs-srv.c        | 15 ++--
 drivers/infiniband/ulp/srp/ib_srp.c           |  8 +--
 drivers/nvme/host/rdma.c                      | 19 +++--
 fs/cifs/smbdirect.c                           | 17 +++--
 include/rdma/ib_verbs.h                       | 11 ++-
 include/rdma/mr_pool.h                        |  3 +-
 net/rds/ib_frmr.c                             |  7 +-
 net/smc/smc_ib.c                              |  3 +-
 net/smc/smc_wr.c                              |  3 +-
 net/sunrpc/xprtrdma/frwr_ops.c                | 10 +--
 39 files changed, 209 insertions(+), 140 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()
  2021-04-05  5:23 [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs Leon Romanovsky
@ 2021-04-05  5:23 ` Leon Romanovsky
  2021-04-05 13:46   ` Christoph Hellwig
  2021-04-05 15:27   ` Bart Van Assche
  2021-04-05  5:23 ` [PATCH rdma-next 02/10] RDMA/core: Enable Relaxed Ordering in __ib_alloc_pd() Leon Romanovsky
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 56+ messages in thread
From: Leon Romanovsky @ 2021-04-05  5:23 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Avihai Horon, Adit Ranadive, Anna Schumaker, Ariel Elior,
	Bart Van Assche, Bernard Metzler, Christoph Hellwig, Chuck Lever,
	David S. Miller, Dennis Dalessandro, Devesh Sharma, Faisal Latif,
	Jack Wang, Jakub Kicinski, J. Bruce Fields, Jens Axboe,
	Karsten Graul, Keith Busch, Lijun Ou, linux-cifs, linux-kernel,
	linux-nfs, linux-nvme, linux-rdma, linux-s390, Max Gurtovoy,
	Max Gurtovoy, Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

From: Avihai Horon <avihaih@nvidia.com>

Add access flags parameter to ib_alloc_mr() and to ib_mr_pool_init(),
and refactor relevant code. This parameter is used to pass MR access
flags during MR allocation.

In the following patches, the new access flags parameter will be used
to enable Relaxed Ordering for ib_alloc_mr() and ib_mr_pool_init() users.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/mr_pool.c             |  7 +-
 drivers/infiniband/core/rw.c                  | 12 ++--
 drivers/infiniband/core/verbs.c               | 23 +++++--
 drivers/infiniband/hw/bnxt_re/ib_verbs.c      |  2 +-
 drivers/infiniband/hw/bnxt_re/ib_verbs.h      |  2 +-
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h        |  2 +-
 drivers/infiniband/hw/cxgb4/mem.c             |  2 +-
 drivers/infiniband/hw/hns/hns_roce_device.h   |  2 +-
 drivers/infiniband/hw/hns/hns_roce_mr.c       |  2 +-
 drivers/infiniband/hw/i40iw/i40iw_verbs.c     |  3 +-
 drivers/infiniband/hw/mlx4/mlx4_ib.h          |  2 +-
 drivers/infiniband/hw/mlx4/mr.c               |  2 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h          | 12 ++--
 drivers/infiniband/hw/mlx5/mr.c               | 61 ++++++++--------
 drivers/infiniband/hw/mlx5/wr.c               | 69 ++++++++++++++-----
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c   |  2 +-
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.h   |  2 +-
 drivers/infiniband/hw/qedr/verbs.c            |  2 +-
 drivers/infiniband/hw/qedr/verbs.h            |  2 +-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c  |  3 +-
 .../infiniband/hw/vmw_pvrdma/pvrdma_verbs.h   |  2 +-
 drivers/infiniband/sw/rdmavt/mr.c             |  3 +-
 drivers/infiniband/sw/rdmavt/mr.h             |  2 +-
 drivers/infiniband/sw/rxe/rxe_verbs.c         |  2 +-
 drivers/infiniband/sw/siw/siw_verbs.c         |  2 +-
 drivers/infiniband/sw/siw/siw_verbs.h         |  2 +-
 drivers/infiniband/ulp/iser/iser_verbs.c      |  4 +-
 drivers/infiniband/ulp/rtrs/rtrs-clt.c        |  2 +-
 drivers/infiniband/ulp/rtrs/rtrs-srv.c        |  2 +-
 drivers/infiniband/ulp/srp/ib_srp.c           |  2 +-
 drivers/nvme/host/rdma.c                      |  4 +-
 fs/cifs/smbdirect.c                           |  7 +-
 include/rdma/ib_verbs.h                       | 11 ++-
 include/rdma/mr_pool.h                        |  3 +-
 net/rds/ib_frmr.c                             |  2 +-
 net/smc/smc_ib.c                              |  2 +-
 net/sunrpc/xprtrdma/frwr_ops.c                |  2 +-
 37 files changed, 163 insertions(+), 105 deletions(-)

diff --git a/drivers/infiniband/core/mr_pool.c b/drivers/infiniband/core/mr_pool.c
index c0e2df128b34..b869c3487475 100644
--- a/drivers/infiniband/core/mr_pool.c
+++ b/drivers/infiniband/core/mr_pool.c
@@ -34,7 +34,8 @@ void ib_mr_pool_put(struct ib_qp *qp, struct list_head *list, struct ib_mr *mr)
 EXPORT_SYMBOL(ib_mr_pool_put);
 
 int ib_mr_pool_init(struct ib_qp *qp, struct list_head *list, int nr,
-		enum ib_mr_type type, u32 max_num_sg, u32 max_num_meta_sg)
+		    enum ib_mr_type type, u32 max_num_sg, u32 max_num_meta_sg,
+		    u32 access)
 {
 	struct ib_mr *mr;
 	unsigned long flags;
@@ -43,9 +44,9 @@ int ib_mr_pool_init(struct ib_qp *qp, struct list_head *list, int nr,
 	for (i = 0; i < nr; i++) {
 		if (type == IB_MR_TYPE_INTEGRITY)
 			mr = ib_alloc_mr_integrity(qp->pd, max_num_sg,
-						   max_num_meta_sg);
+						   max_num_meta_sg, access);
 		else
-			mr = ib_alloc_mr(qp->pd, type, max_num_sg);
+			mr = ib_alloc_mr(qp->pd, type, max_num_sg, access);
 		if (IS_ERR(mr)) {
 			ret = PTR_ERR(mr);
 			goto out;
diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index a588c2038479..d5a0038e82a4 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -110,7 +110,7 @@ static int rdma_rw_init_one_mr(struct ib_qp *qp, u32 port_num,
 
 	reg->reg_wr.wr.opcode = IB_WR_REG_MR;
 	reg->reg_wr.mr = reg->mr;
-	reg->reg_wr.access = IB_ACCESS_LOCAL_WRITE;
+	reg->reg_wr.access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_RELAXED_ORDERING;
 	if (rdma_protocol_iwarp(qp->device, port_num))
 		reg->reg_wr.access |= IB_ACCESS_REMOTE_WRITE;
 	count++;
@@ -437,7 +437,8 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 	ctx->reg->reg_wr.wr.wr_cqe = NULL;
 	ctx->reg->reg_wr.wr.num_sge = 0;
 	ctx->reg->reg_wr.wr.send_flags = 0;
-	ctx->reg->reg_wr.access = IB_ACCESS_LOCAL_WRITE;
+	ctx->reg->reg_wr.access =
+		IB_ACCESS_LOCAL_WRITE | IB_ACCESS_RELAXED_ORDERING;
 	if (rdma_protocol_iwarp(qp->device, port_num))
 		ctx->reg->reg_wr.access |= IB_ACCESS_REMOTE_WRITE;
 	ctx->reg->reg_wr.mr = ctx->reg->mr;
@@ -711,8 +712,8 @@ int rdma_rw_init_mrs(struct ib_qp *qp, struct ib_qp_init_attr *attr)
 
 	if (nr_mrs) {
 		ret = ib_mr_pool_init(qp, &qp->rdma_mrs, nr_mrs,
-				IB_MR_TYPE_MEM_REG,
-				max_num_sg, 0);
+				      IB_MR_TYPE_MEM_REG, max_num_sg, 0,
+				      IB_ACCESS_RELAXED_ORDERING);
 		if (ret) {
 			pr_err("%s: failed to allocated %d MRs\n",
 				__func__, nr_mrs);
@@ -722,7 +723,8 @@ int rdma_rw_init_mrs(struct ib_qp *qp, struct ib_qp_init_attr *attr)
 
 	if (nr_sig_mrs) {
 		ret = ib_mr_pool_init(qp, &qp->sig_mrs, nr_sig_mrs,
-				IB_MR_TYPE_INTEGRITY, max_num_sg, max_num_sg);
+				      IB_MR_TYPE_INTEGRITY, max_num_sg,
+				      max_num_sg, IB_ACCESS_RELAXED_ORDERING);
 		if (ret) {
 			pr_err("%s: failed to allocated %d SIG MRs\n",
 				__func__, nr_sig_mrs);
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index c576e2bc39c6..a1782f8a6ca0 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -2136,6 +2136,7 @@ EXPORT_SYMBOL(ib_dereg_mr_user);
  * @pd:            protection domain associated with the region
  * @mr_type:       memory region type
  * @max_num_sg:    maximum sg entries available for registration.
+ * @access:	   access flags for the memory region.
  *
  * Notes:
  * Memory registeration page/sg lists must not exceed max_num_sg.
@@ -2144,7 +2145,7 @@ EXPORT_SYMBOL(ib_dereg_mr_user);
  *
  */
 struct ib_mr *ib_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
-			  u32 max_num_sg)
+			  u32 max_num_sg, u32 access)
 {
 	struct ib_mr *mr;
 
@@ -2159,7 +2160,12 @@ struct ib_mr *ib_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
 		goto out;
 	}
 
-	mr = pd->device->ops.alloc_mr(pd, mr_type, max_num_sg);
+	if (access & ~IB_ACCESS_RELAXED_ORDERING) {
+		mr = ERR_PTR(-EINVAL);
+		goto out;
+	}
+
+	mr = pd->device->ops.alloc_mr(pd, mr_type, max_num_sg, access);
 	if (IS_ERR(mr))
 		goto out;
 
@@ -2187,15 +2193,15 @@ EXPORT_SYMBOL(ib_alloc_mr);
  * @max_num_data_sg:         maximum data sg entries available for registration
  * @max_num_meta_sg:         maximum metadata sg entries available for
  *                           registration
+ * @access:		     access flags for the memory region.
  *
  * Notes:
  * Memory registration page/sg lists must not exceed max_num_sg,
  * also the integrity page/sg lists must not exceed max_num_meta_sg.
  *
  */
-struct ib_mr *ib_alloc_mr_integrity(struct ib_pd *pd,
-				    u32 max_num_data_sg,
-				    u32 max_num_meta_sg)
+struct ib_mr *ib_alloc_mr_integrity(struct ib_pd *pd, u32 max_num_data_sg,
+				    u32 max_num_meta_sg, u32 access)
 {
 	struct ib_mr *mr;
 	struct ib_sig_attrs *sig_attrs;
@@ -2211,6 +2217,11 @@ struct ib_mr *ib_alloc_mr_integrity(struct ib_pd *pd,
 		goto out;
 	}
 
+	if (access & ~IB_ACCESS_RELAXED_ORDERING) {
+		mr = ERR_PTR(-EINVAL);
+		goto out;
+	}
+
 	sig_attrs = kzalloc(sizeof(struct ib_sig_attrs), GFP_KERNEL);
 	if (!sig_attrs) {
 		mr = ERR_PTR(-ENOMEM);
@@ -2218,7 +2229,7 @@ struct ib_mr *ib_alloc_mr_integrity(struct ib_pd *pd,
 	}
 
 	mr = pd->device->ops.alloc_mr_integrity(pd, max_num_data_sg,
-						max_num_meta_sg);
+						max_num_meta_sg, access);
 	if (IS_ERR(mr)) {
 		kfree(sig_attrs);
 		goto out;
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
index 2efaa80bfbd2..116febdf999b 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
@@ -3672,7 +3672,7 @@ int bnxt_re_map_mr_sg(struct ib_mr *ib_mr, struct scatterlist *sg, int sg_nents,
 }
 
 struct ib_mr *bnxt_re_alloc_mr(struct ib_pd *ib_pd, enum ib_mr_type type,
-			       u32 max_num_sg)
+			       u32 max_num_sg, u32 access)
 {
 	struct bnxt_re_pd *pd = container_of(ib_pd, struct bnxt_re_pd, ib_pd);
 	struct bnxt_re_dev *rdev = pd->rdev;
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.h b/drivers/infiniband/hw/bnxt_re/ib_verbs.h
index d68671cc6173..3e8342a6f367 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.h
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.h
@@ -201,7 +201,7 @@ struct ib_mr *bnxt_re_get_dma_mr(struct ib_pd *pd, int mr_access_flags);
 int bnxt_re_map_mr_sg(struct ib_mr *ib_mr, struct scatterlist *sg, int sg_nents,
 		      unsigned int *sg_offset);
 struct ib_mr *bnxt_re_alloc_mr(struct ib_pd *ib_pd, enum ib_mr_type mr_type,
-			       u32 max_num_sg);
+			       u32 max_num_sg, u32 access);
 int bnxt_re_dereg_mr(struct ib_mr *mr, struct ib_udata *udata);
 struct ib_mw *bnxt_re_alloc_mw(struct ib_pd *ib_pd, enum ib_mw_type type,
 			       struct ib_udata *udata);
diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
index cdec5deb37a1..4520c53aa1f6 100644
--- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
+++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
@@ -969,7 +969,7 @@ int c4iw_reject_cr(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len);
 void c4iw_qp_add_ref(struct ib_qp *qp);
 void c4iw_qp_rem_ref(struct ib_qp *qp);
 struct ib_mr *c4iw_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
-			    u32 max_num_sg);
+			    u32 max_num_sg, u32 access);
 int c4iw_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
 		   unsigned int *sg_offset);
 void c4iw_dealloc(struct uld_ctx *ctx);
diff --git a/drivers/infiniband/hw/cxgb4/mem.c b/drivers/infiniband/hw/cxgb4/mem.c
index a2c71a1d93d5..c8ed4c56925d 100644
--- a/drivers/infiniband/hw/cxgb4/mem.c
+++ b/drivers/infiniband/hw/cxgb4/mem.c
@@ -596,7 +596,7 @@ struct ib_mr *c4iw_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 }
 
 struct ib_mr *c4iw_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
-			    u32 max_num_sg)
+			    u32 max_num_sg, u32 access)
 {
 	struct c4iw_dev *rhp;
 	struct c4iw_pd *php;
diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h b/drivers/infiniband/hw/hns/hns_roce_device.h
index 55cbbd524057..3e2aed7e8329 100644
--- a/drivers/infiniband/hw/hns/hns_roce_device.h
+++ b/drivers/infiniband/hw/hns/hns_roce_device.h
@@ -1205,7 +1205,7 @@ struct ib_mr *hns_roce_rereg_user_mr(struct ib_mr *mr, int flags, u64 start,
 				     int mr_access_flags, struct ib_pd *pd,
 				     struct ib_udata *udata);
 struct ib_mr *hns_roce_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
-				u32 max_num_sg);
+				u32 max_num_sg, u32 access);
 int hns_roce_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
 		       unsigned int *sg_offset);
 int hns_roce_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata);
diff --git a/drivers/infiniband/hw/hns/hns_roce_mr.c b/drivers/infiniband/hw/hns/hns_roce_mr.c
index 79b3c3023fe7..c16638ad66f4 100644
--- a/drivers/infiniband/hw/hns/hns_roce_mr.c
+++ b/drivers/infiniband/hw/hns/hns_roce_mr.c
@@ -381,7 +381,7 @@ int hns_roce_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
 }
 
 struct ib_mr *hns_roce_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
-				u32 max_num_sg)
+				u32 max_num_sg, u32 access)
 {
 	struct hns_roce_dev *hr_dev = to_hr_dev(pd->device);
 	struct device *dev = hr_dev->dev;
diff --git a/drivers/infiniband/hw/i40iw/i40iw_verbs.c b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
index b876d722fcc8..827dbca3ddf3 100644
--- a/drivers/infiniband/hw/i40iw/i40iw_verbs.c
+++ b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
@@ -1451,9 +1451,10 @@ static int i40iw_hw_alloc_stag(struct i40iw_device *iwdev, struct i40iw_mr *iwmr
  * @pd: ibpd pointer
  * @mr_type: memory for stag registrion
  * @max_num_sg: man number of pages
+ * @access: access flags of memory region
  */
 static struct ib_mr *i40iw_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
-				    u32 max_num_sg)
+				    u32 max_num_sg, u32 access)
 {
 	struct i40iw_pd *iwpd = to_iwpd(pd);
 	struct i40iw_device *iwdev = to_iwdev(pd->device);
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index e856cf23a0a1..0c99dd57de3f 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -759,7 +759,7 @@ int mlx4_ib_dereg_mr(struct ib_mr *mr, struct ib_udata *udata);
 int mlx4_ib_alloc_mw(struct ib_mw *mw, struct ib_udata *udata);
 int mlx4_ib_dealloc_mw(struct ib_mw *mw);
 struct ib_mr *mlx4_ib_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
-			       u32 max_num_sg);
+			       u32 max_num_sg, u32 access);
 int mlx4_ib_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
 		      unsigned int *sg_offset);
 int mlx4_ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period);
diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
index 50becc0e4b62..5a6fc7d7a89f 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -643,7 +643,7 @@ int mlx4_ib_dealloc_mw(struct ib_mw *ibmw)
 }
 
 struct ib_mr *mlx4_ib_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
-			       u32 max_num_sg)
+			       u32 max_num_sg, u32 access)
 {
 	struct mlx4_ib_dev *dev = to_mdev(pd->device);
 	struct mlx4_ib_mr *mr;
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 69ecd0229322..0a8b33244fdd 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -670,6 +670,9 @@ struct mlx5_ib_mr {
 	struct mlx5_cache_ent *cache_ent;
 	struct ib_umem *umem;
 
+	/* Current access_flags */
+	int access_flags;
+
 	/* This is zero'd when the MR is allocated */
 	union {
 		/* Used only while the MR is in the cache */
@@ -705,8 +708,6 @@ struct mlx5_ib_mr {
 		/* Used only by User MRs (umem != NULL) */
 		struct {
 			unsigned int page_shift;
-			/* Current access_flags */
-			int access_flags;
 
 			/* For User ODP */
 			struct mlx5_ib_mr *parent;
@@ -1306,10 +1307,9 @@ struct ib_mr *mlx5_ib_rereg_user_mr(struct ib_mr *ib_mr, int flags, u64 start,
 				    struct ib_pd *pd, struct ib_udata *udata);
 int mlx5_ib_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata);
 struct ib_mr *mlx5_ib_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
-			       u32 max_num_sg);
-struct ib_mr *mlx5_ib_alloc_mr_integrity(struct ib_pd *pd,
-					 u32 max_num_sg,
-					 u32 max_num_meta_sg);
+			       u32 max_num_sg, u32 access);
+struct ib_mr *mlx5_ib_alloc_mr_integrity(struct ib_pd *pd, u32 max_num_sg,
+					 u32 max_num_meta_sg, u32 access);
 int mlx5_ib_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
 		      unsigned int *sg_offset);
 int mlx5_ib_map_mr_sg_pi(struct ib_mr *ibmr, struct scatterlist *data_sg,
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 552fecd210c2..9ba7d5d6c668 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -2015,14 +2015,14 @@ int mlx5_ib_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
 }
 
 static void mlx5_set_umr_free_mkey(struct ib_pd *pd, u32 *in, int ndescs,
-				   int access_mode, int page_shift)
+				   int access_mode, u32 access, int page_shift)
 {
 	void *mkc;
 
 	mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
 
 	/* This is only used from the kernel, so setting the PD is OK. */
-	set_mkc_access_pd_addr_fields(mkc, 0, 0, pd);
+	set_mkc_access_pd_addr_fields(mkc, access, 0, pd);
 	MLX5_SET(mkc, mkc, free, 1);
 	MLX5_SET(mkc, mkc, translations_octword_size, ndescs);
 	MLX5_SET(mkc, mkc, access_mode_1_0, access_mode & 0x3);
@@ -2033,7 +2033,8 @@ static void mlx5_set_umr_free_mkey(struct ib_pd *pd, u32 *in, int ndescs,
 
 static int _mlx5_alloc_mkey_descs(struct ib_pd *pd, struct mlx5_ib_mr *mr,
 				  int ndescs, int desc_size, int page_shift,
-				  int access_mode, u32 *in, int inlen)
+				  int access_mode, u32 access, u32 *in,
+				  int inlen)
 {
 	struct mlx5_ib_dev *dev = to_mdev(pd->device);
 	int err;
@@ -2046,7 +2047,7 @@ static int _mlx5_alloc_mkey_descs(struct ib_pd *pd, struct mlx5_ib_mr *mr,
 	if (err)
 		return err;
 
-	mlx5_set_umr_free_mkey(pd, in, ndescs, access_mode, page_shift);
+	mlx5_set_umr_free_mkey(pd, in, ndescs, access_mode, access, page_shift);
 
 	err = mlx5_ib_create_mkey(dev, &mr->mmkey, in, inlen);
 	if (err)
@@ -2055,6 +2056,7 @@ static int _mlx5_alloc_mkey_descs(struct ib_pd *pd, struct mlx5_ib_mr *mr,
 	mr->mmkey.type = MLX5_MKEY_MR;
 	mr->ibmr.lkey = mr->mmkey.key;
 	mr->ibmr.rkey = mr->mmkey.key;
+	mr->access_flags = access;
 
 	return 0;
 
@@ -2063,9 +2065,10 @@ static int _mlx5_alloc_mkey_descs(struct ib_pd *pd, struct mlx5_ib_mr *mr,
 	return err;
 }
 
-static struct mlx5_ib_mr *mlx5_ib_alloc_pi_mr(struct ib_pd *pd,
-				u32 max_num_sg, u32 max_num_meta_sg,
-				int desc_size, int access_mode)
+static struct mlx5_ib_mr *mlx5_ib_alloc_pi_mr(struct ib_pd *pd, u32 max_num_sg,
+					      u32 max_num_meta_sg,
+					      int desc_size, int access_mode,
+					      u32 access)
 {
 	int inlen = MLX5_ST_SZ_BYTES(create_mkey_in);
 	int ndescs = ALIGN(max_num_sg + max_num_meta_sg, 4);
@@ -2091,7 +2094,7 @@ static struct mlx5_ib_mr *mlx5_ib_alloc_pi_mr(struct ib_pd *pd,
 		page_shift = PAGE_SHIFT;
 
 	err = _mlx5_alloc_mkey_descs(pd, mr, ndescs, desc_size, page_shift,
-				     access_mode, in, inlen);
+				     access_mode, access, in, inlen);
 	if (err)
 		goto err_free_in;
 
@@ -2108,23 +2111,24 @@ static struct mlx5_ib_mr *mlx5_ib_alloc_pi_mr(struct ib_pd *pd,
 }
 
 static int mlx5_alloc_mem_reg_descs(struct ib_pd *pd, struct mlx5_ib_mr *mr,
-				    int ndescs, u32 *in, int inlen)
+				    int ndescs, u32 access, u32 *in, int inlen)
 {
 	return _mlx5_alloc_mkey_descs(pd, mr, ndescs, sizeof(struct mlx5_mtt),
-				      PAGE_SHIFT, MLX5_MKC_ACCESS_MODE_MTT, in,
-				      inlen);
+				      PAGE_SHIFT, MLX5_MKC_ACCESS_MODE_MTT,
+				      access, in, inlen);
 }
 
 static int mlx5_alloc_sg_gaps_descs(struct ib_pd *pd, struct mlx5_ib_mr *mr,
-				    int ndescs, u32 *in, int inlen)
+				    int ndescs, u32 access, u32 *in, int inlen)
 {
 	return _mlx5_alloc_mkey_descs(pd, mr, ndescs, sizeof(struct mlx5_klm),
-				      0, MLX5_MKC_ACCESS_MODE_KLMS, in, inlen);
+				      0, MLX5_MKC_ACCESS_MODE_KLMS, access, in,
+				      inlen);
 }
 
 static int mlx5_alloc_integrity_descs(struct ib_pd *pd, struct mlx5_ib_mr *mr,
 				      int max_num_sg, int max_num_meta_sg,
-				      u32 *in, int inlen)
+				      u32 access, u32 *in, int inlen)
 {
 	struct mlx5_ib_dev *dev = to_mdev(pd->device);
 	u32 psv_index[2];
@@ -2149,14 +2153,14 @@ static int mlx5_alloc_integrity_descs(struct ib_pd *pd, struct mlx5_ib_mr *mr,
 	++mr->sig->sigerr_count;
 	mr->klm_mr = mlx5_ib_alloc_pi_mr(pd, max_num_sg, max_num_meta_sg,
 					 sizeof(struct mlx5_klm),
-					 MLX5_MKC_ACCESS_MODE_KLMS);
+					 MLX5_MKC_ACCESS_MODE_KLMS, access);
 	if (IS_ERR(mr->klm_mr)) {
 		err = PTR_ERR(mr->klm_mr);
 		goto err_destroy_psv;
 	}
 	mr->mtt_mr = mlx5_ib_alloc_pi_mr(pd, max_num_sg, max_num_meta_sg,
 					 sizeof(struct mlx5_mtt),
-					 MLX5_MKC_ACCESS_MODE_MTT);
+					 MLX5_MKC_ACCESS_MODE_MTT, access);
 	if (IS_ERR(mr->mtt_mr)) {
 		err = PTR_ERR(mr->mtt_mr);
 		goto err_free_klm_mr;
@@ -2168,7 +2172,8 @@ static int mlx5_alloc_integrity_descs(struct ib_pd *pd, struct mlx5_ib_mr *mr,
 	MLX5_SET(mkc, mkc, bsf_octword_size, MLX5_MKEY_BSF_OCTO_SIZE);
 
 	err = _mlx5_alloc_mkey_descs(pd, mr, 4, sizeof(struct mlx5_klm), 0,
-				     MLX5_MKC_ACCESS_MODE_KLMS, in, inlen);
+				     MLX5_MKC_ACCESS_MODE_KLMS, access, in,
+				     inlen);
 	if (err)
 		goto err_free_mtt_mr;
 
@@ -2202,7 +2207,7 @@ static int mlx5_alloc_integrity_descs(struct ib_pd *pd, struct mlx5_ib_mr *mr,
 
 static struct ib_mr *__mlx5_ib_alloc_mr(struct ib_pd *pd,
 					enum ib_mr_type mr_type, u32 max_num_sg,
-					u32 max_num_meta_sg)
+					u32 max_num_meta_sg, u32 access)
 {
 	struct mlx5_ib_dev *dev = to_mdev(pd->device);
 	int inlen = MLX5_ST_SZ_BYTES(create_mkey_in);
@@ -2226,14 +2231,16 @@ static struct ib_mr *__mlx5_ib_alloc_mr(struct ib_pd *pd,
 
 	switch (mr_type) {
 	case IB_MR_TYPE_MEM_REG:
-		err = mlx5_alloc_mem_reg_descs(pd, mr, ndescs, in, inlen);
+		err = mlx5_alloc_mem_reg_descs(pd, mr, ndescs, access, in,
+					       inlen);
 		break;
 	case IB_MR_TYPE_SG_GAPS:
-		err = mlx5_alloc_sg_gaps_descs(pd, mr, ndescs, in, inlen);
+		err = mlx5_alloc_sg_gaps_descs(pd, mr, ndescs, access, in,
+					       inlen);
 		break;
 	case IB_MR_TYPE_INTEGRITY:
-		err = mlx5_alloc_integrity_descs(pd, mr, max_num_sg,
-						 max_num_meta_sg, in, inlen);
+		err = mlx5_alloc_integrity_descs(
+			pd, mr, max_num_sg, max_num_meta_sg, access, in, inlen);
 		break;
 	default:
 		mlx5_ib_warn(dev, "Invalid mr type %d\n", mr_type);
@@ -2255,16 +2262,16 @@ static struct ib_mr *__mlx5_ib_alloc_mr(struct ib_pd *pd,
 }
 
 struct ib_mr *mlx5_ib_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
-			       u32 max_num_sg)
+			       u32 max_num_sg, u32 access)
 {
-	return __mlx5_ib_alloc_mr(pd, mr_type, max_num_sg, 0);
+	return __mlx5_ib_alloc_mr(pd, mr_type, max_num_sg, 0, access);
 }
 
-struct ib_mr *mlx5_ib_alloc_mr_integrity(struct ib_pd *pd,
-					 u32 max_num_sg, u32 max_num_meta_sg)
+struct ib_mr *mlx5_ib_alloc_mr_integrity(struct ib_pd *pd, u32 max_num_sg,
+					 u32 max_num_meta_sg, u32 access)
 {
 	return __mlx5_ib_alloc_mr(pd, IB_MR_TYPE_INTEGRITY, max_num_sg,
-				  max_num_meta_sg);
+				  max_num_meta_sg, access);
 }
 
 int mlx5_ib_alloc_mw(struct ib_mw *ibmw, struct ib_udata *udata)
diff --git a/drivers/infiniband/hw/mlx5/wr.c b/drivers/infiniband/hw/mlx5/wr.c
index cf2852cba45c..a1b6d0ff8461 100644
--- a/drivers/infiniband/hw/mlx5/wr.c
+++ b/drivers/infiniband/hw/mlx5/wr.c
@@ -171,7 +171,8 @@ static u64 get_xlt_octo(u64 bytes)
 	       MLX5_IB_UMR_OCTOWORD;
 }
 
-static __be64 frwr_mkey_mask(bool atomic)
+static __be64 frwr_mkey_mask(bool atomic, int relaxed_ordering_write,
+			     int relaxed_ordering_read)
 {
 	u64 result;
 
@@ -190,10 +191,17 @@ static __be64 frwr_mkey_mask(bool atomic)
 	if (atomic)
 		result |= MLX5_MKEY_MASK_A;
 
+	if (relaxed_ordering_write)
+		result |= MLX5_MKEY_MASK_RELAXED_ORDERING_WRITE;
+
+	if (relaxed_ordering_read)
+		result |= MLX5_MKEY_MASK_RELAXED_ORDERING_READ;
+
 	return cpu_to_be64(result);
 }
 
-static __be64 sig_mkey_mask(void)
+static __be64 sig_mkey_mask(int relaxed_ordering_write,
+			    int relaxed_ordering_read)
 {
 	u64 result;
 
@@ -211,10 +219,17 @@ static __be64 sig_mkey_mask(void)
 		MLX5_MKEY_MASK_FREE		|
 		MLX5_MKEY_MASK_BSF_EN;
 
+	if (relaxed_ordering_write)
+		result |= MLX5_MKEY_MASK_RELAXED_ORDERING_WRITE;
+
+	if (relaxed_ordering_read)
+		result |= MLX5_MKEY_MASK_RELAXED_ORDERING_READ;
+
 	return cpu_to_be64(result);
 }
 
-static void set_reg_umr_seg(struct mlx5_wqe_umr_ctrl_seg *umr,
+static void set_reg_umr_seg(struct mlx5_ib_dev *dev,
+			    struct mlx5_wqe_umr_ctrl_seg *umr,
 			    struct mlx5_ib_mr *mr, u8 flags, bool atomic)
 {
 	int size = (mr->ndescs + mr->meta_ndescs) * mr->desc_size;
@@ -223,7 +238,9 @@ static void set_reg_umr_seg(struct mlx5_wqe_umr_ctrl_seg *umr,
 
 	umr->flags = flags;
 	umr->xlt_octowords = cpu_to_be16(get_xlt_octo(size));
-	umr->mkey_mask = frwr_mkey_mask(atomic);
+	umr->mkey_mask = frwr_mkey_mask(
+		atomic, MLX5_CAP_GEN(dev->mdev, relaxed_ordering_write_umr),
+		MLX5_CAP_GEN(dev->mdev, relaxed_ordering_read_umr));
 }
 
 static void set_linv_umr_seg(struct mlx5_wqe_umr_ctrl_seg *umr)
@@ -370,9 +387,8 @@ static u8 get_umr_flags(int acc)
 		MLX5_PERM_LOCAL_READ | MLX5_PERM_UMR_EN;
 }
 
-static void set_reg_mkey_seg(struct mlx5_mkey_seg *seg,
-			     struct mlx5_ib_mr *mr,
-			     u32 key, int access)
+static void set_reg_mkey_seg(struct mlx5_ib_dev *dev, struct mlx5_mkey_seg *seg,
+			     struct mlx5_ib_mr *mr, u32 key, int access)
 {
 	int ndescs = ALIGN(mr->ndescs + mr->meta_ndescs, 8) >> 1;
 
@@ -390,6 +406,13 @@ static void set_reg_mkey_seg(struct mlx5_mkey_seg *seg,
 	seg->start_addr = cpu_to_be64(mr->ibmr.iova);
 	seg->len = cpu_to_be64(mr->ibmr.length);
 	seg->xlt_oct_size = cpu_to_be32(ndescs);
+
+	if (MLX5_CAP_GEN(dev->mdev, relaxed_ordering_write_umr) &&
+	    (access & IB_ACCESS_RELAXED_ORDERING))
+		MLX5_SET(mkc, seg, relaxed_ordering_write, 1);
+	if (MLX5_CAP_GEN(dev->mdev, relaxed_ordering_read_umr) &&
+	    (access & IB_ACCESS_RELAXED_ORDERING))
+		MLX5_SET(mkc, seg, relaxed_ordering_read, 1);
 }
 
 static void set_linv_mkey_seg(struct mlx5_mkey_seg *seg)
@@ -746,7 +769,8 @@ static int set_sig_data_segment(const struct ib_send_wr *send_wr,
 	return 0;
 }
 
-static void set_sig_mkey_segment(struct mlx5_mkey_seg *seg,
+static void set_sig_mkey_segment(struct mlx5_ib_dev *dev,
+				 struct mlx5_mkey_seg *seg,
 				 struct ib_mr *sig_mr, int access_flags,
 				 u32 size, u32 length, u32 pdn)
 {
@@ -762,23 +786,34 @@ static void set_sig_mkey_segment(struct mlx5_mkey_seg *seg,
 	seg->len = cpu_to_be64(length);
 	seg->xlt_oct_size = cpu_to_be32(get_xlt_octo(size));
 	seg->bsfs_octo_size = cpu_to_be32(MLX5_MKEY_BSF_OCTO_SIZE);
+
+	if (MLX5_CAP_GEN(dev->mdev, relaxed_ordering_write_umr) &&
+	    (access_flags & IB_ACCESS_RELAXED_ORDERING))
+		MLX5_SET(mkc, seg, relaxed_ordering_write, 1);
+	if (MLX5_CAP_GEN(dev->mdev, relaxed_ordering_read_umr) &&
+	    (access_flags & IB_ACCESS_RELAXED_ORDERING))
+		MLX5_SET(mkc, seg, relaxed_ordering_read, 1);
 }
 
-static void set_sig_umr_segment(struct mlx5_wqe_umr_ctrl_seg *umr,
-				u32 size)
+static void set_sig_umr_segment(struct mlx5_ib_dev *dev,
+				struct mlx5_wqe_umr_ctrl_seg *umr, u32 size)
 {
 	memset(umr, 0, sizeof(*umr));
 
 	umr->flags = MLX5_FLAGS_INLINE | MLX5_FLAGS_CHECK_FREE;
 	umr->xlt_octowords = cpu_to_be16(get_xlt_octo(size));
 	umr->bsf_octowords = cpu_to_be16(MLX5_MKEY_BSF_OCTO_SIZE);
-	umr->mkey_mask = sig_mkey_mask();
+	umr->mkey_mask = sig_mkey_mask(
+		MLX5_CAP_GEN(dev->mdev, relaxed_ordering_write_umr),
+		MLX5_CAP_GEN(dev->mdev, relaxed_ordering_read_umr));
 }
 
 static int set_pi_umr_wr(const struct ib_send_wr *send_wr,
 			 struct mlx5_ib_qp *qp, void **seg, int *size,
 			 void **cur_edge)
 {
+	struct mlx5_ib_pd *pd = to_mpd(qp->ibqp.pd);
+	struct mlx5_ib_dev *dev = to_mdev(pd->ibpd.device);
 	const struct ib_reg_wr *wr = reg_wr(send_wr);
 	struct mlx5_ib_mr *sig_mr = to_mmr(wr->mr);
 	struct mlx5_ib_mr *pi_mr = sig_mr->pi_mr;
@@ -806,13 +841,13 @@ static int set_pi_umr_wr(const struct ib_send_wr *send_wr,
 	else
 		xlt_size = sizeof(struct mlx5_klm);
 
-	set_sig_umr_segment(*seg, xlt_size);
+	set_sig_umr_segment(dev, *seg, xlt_size);
 	*seg += sizeof(struct mlx5_wqe_umr_ctrl_seg);
 	*size += sizeof(struct mlx5_wqe_umr_ctrl_seg) / 16;
 	handle_post_send_edge(&qp->sq, seg, *size, cur_edge);
 
-	set_sig_mkey_segment(*seg, wr->mr, wr->access, xlt_size, region_len,
-			     pdn);
+	set_sig_mkey_segment(dev, *seg, wr->mr, wr->access, xlt_size,
+			     region_len, pdn);
 	*seg += sizeof(struct mlx5_mkey_seg);
 	*size += sizeof(struct mlx5_mkey_seg) / 16;
 	handle_post_send_edge(&qp->sq, seg, *size, cur_edge);
@@ -867,7 +902,7 @@ static int set_reg_wr(struct mlx5_ib_qp *qp,
 	u8 flags = 0;
 
 	/* Matches access in mlx5_set_umr_free_mkey() */
-	if (!mlx5_ib_can_reconfig_with_umr(dev, 0, wr->access)) {
+	if (!mlx5_ib_can_reconfig_with_umr(dev, mr->access_flags, wr->access)) {
 		mlx5_ib_warn(
 			to_mdev(qp->ibqp.device),
 			"Fast update for MR access flags is not possible\n");
@@ -885,12 +920,12 @@ static int set_reg_wr(struct mlx5_ib_qp *qp,
 	if (umr_inline)
 		flags |= MLX5_UMR_INLINE;
 
-	set_reg_umr_seg(*seg, mr, flags, atomic);
+	set_reg_umr_seg(dev, *seg, mr, flags, atomic);
 	*seg += sizeof(struct mlx5_wqe_umr_ctrl_seg);
 	*size += sizeof(struct mlx5_wqe_umr_ctrl_seg) / 16;
 	handle_post_send_edge(&qp->sq, seg, *size, cur_edge);
 
-	set_reg_mkey_seg(*seg, mr, wr->key, wr->access);
+	set_reg_mkey_seg(dev, *seg, mr, wr->key, wr->access);
 	*seg += sizeof(struct mlx5_mkey_seg);
 	*size += sizeof(struct mlx5_mkey_seg) / 16;
 	handle_post_send_edge(&qp->sq, seg, *size, cur_edge);
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index 58619ce64d0d..419711552825 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -2904,7 +2904,7 @@ int ocrdma_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags cq_flags)
 }
 
 struct ib_mr *ocrdma_alloc_mr(struct ib_pd *ibpd, enum ib_mr_type mr_type,
-			      u32 max_num_sg)
+			      u32 max_num_sg, u32 access)
 {
 	int status;
 	struct ocrdma_mr *mr;
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
index b1c5fad81603..7644e343fcd6 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
@@ -102,7 +102,7 @@ struct ib_mr *ocrdma_get_dma_mr(struct ib_pd *, int acc);
 struct ib_mr *ocrdma_reg_user_mr(struct ib_pd *, u64 start, u64 length,
 				 u64 virt, int acc, struct ib_udata *);
 struct ib_mr *ocrdma_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
-			      u32 max_num_sg);
+			      u32 max_num_sg, u32 access);
 int ocrdma_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
 		     unsigned int *sg_offset);
 
diff --git a/drivers/infiniband/hw/qedr/verbs.c b/drivers/infiniband/hw/qedr/verbs.c
index 41e12f011f22..51fa57b97928 100644
--- a/drivers/infiniband/hw/qedr/verbs.c
+++ b/drivers/infiniband/hw/qedr/verbs.c
@@ -3132,7 +3132,7 @@ static struct qedr_mr *__qedr_alloc_mr(struct ib_pd *ibpd,
 }
 
 struct ib_mr *qedr_alloc_mr(struct ib_pd *ibpd, enum ib_mr_type mr_type,
-			    u32 max_num_sg)
+			    u32 max_num_sg, u32 access)
 {
 	struct qedr_mr *mr;
 
diff --git a/drivers/infiniband/hw/qedr/verbs.h b/drivers/infiniband/hw/qedr/verbs.h
index 34ad47515861..8ea872b56ee0 100644
--- a/drivers/infiniband/hw/qedr/verbs.h
+++ b/drivers/infiniband/hw/qedr/verbs.h
@@ -86,7 +86,7 @@ int qedr_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
 		   int sg_nents, unsigned int *sg_offset);
 
 struct ib_mr *qedr_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
-			    u32 max_num_sg);
+			    u32 max_num_sg, u32 access);
 int qedr_poll_cq(struct ib_cq *, int num_entries, struct ib_wc *wc);
 int qedr_post_send(struct ib_qp *, const struct ib_send_wr *,
 		   const struct ib_send_wr **bad_wr);
diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
index e80848bfb3bd..b3fa783698a0 100644
--- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
+++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
@@ -198,11 +198,12 @@ struct ib_mr *pvrdma_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
  * @pd: protection domain
  * @mr_type: type of memory region
  * @max_num_sg: maximum number of pages
+ * @access: access flags of memory region
  *
  * @return: ib_mr pointer on success, otherwise returns an errno.
  */
 struct ib_mr *pvrdma_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
-			      u32 max_num_sg)
+			      u32 max_num_sg, u32 access)
 {
 	struct pvrdma_dev *dev = to_vdev(pd->device);
 	struct pvrdma_user_mr *mr;
diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h
index 544b94d97c3a..079fb4c09979 100644
--- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h
+++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h
@@ -371,7 +371,7 @@ struct ib_mr *pvrdma_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 				 struct ib_udata *udata);
 int pvrdma_dereg_mr(struct ib_mr *mr, struct ib_udata *udata);
 struct ib_mr *pvrdma_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
-			      u32 max_num_sg);
+			      u32 max_num_sg, u32 access);
 int pvrdma_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
 		     int sg_nents, unsigned int *sg_offset);
 int pvrdma_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
diff --git a/drivers/infiniband/sw/rdmavt/mr.c b/drivers/infiniband/sw/rdmavt/mr.c
index 601d18dda1f5..b484a7968681 100644
--- a/drivers/infiniband/sw/rdmavt/mr.c
+++ b/drivers/infiniband/sw/rdmavt/mr.c
@@ -571,11 +571,12 @@ int rvt_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
  * @pd: protection domain for this memory region
  * @mr_type: mem region type
  * @max_num_sg: Max number of segments allowed
+ * @access: access flags of memory region
  *
  * Return: the memory region on success, otherwise return an errno.
  */
 struct ib_mr *rvt_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
-			   u32 max_num_sg)
+			   u32 max_num_sg, u32 access)
 {
 	struct rvt_mr *mr;
 
diff --git a/drivers/infiniband/sw/rdmavt/mr.h b/drivers/infiniband/sw/rdmavt/mr.h
index b3aba359401b..0542b2c6dbfc 100644
--- a/drivers/infiniband/sw/rdmavt/mr.h
+++ b/drivers/infiniband/sw/rdmavt/mr.h
@@ -71,7 +71,7 @@ struct ib_mr *rvt_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 			      struct ib_udata *udata);
 int rvt_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata);
 struct ib_mr *rvt_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
-			   u32 max_num_sg);
+			   u32 max_num_sg, u32 access);
 int rvt_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
 		  int sg_nents, unsigned int *sg_offset);
 
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index aeb5e232c195..6a23f54c88a6 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -925,7 +925,7 @@ static int rxe_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
 }
 
 static struct ib_mr *rxe_alloc_mr(struct ib_pd *ibpd, enum ib_mr_type mr_type,
-				  u32 max_num_sg)
+				  u32 max_num_sg, u32 access)
 {
 	struct rxe_dev *rxe = to_rdev(ibpd->device);
 	struct rxe_pd *pd = to_rpd(ibpd);
diff --git a/drivers/infiniband/sw/siw/siw_verbs.c b/drivers/infiniband/sw/siw/siw_verbs.c
index d2313efb26db..a9ea13cd67bd 100644
--- a/drivers/infiniband/sw/siw/siw_verbs.c
+++ b/drivers/infiniband/sw/siw/siw_verbs.c
@@ -1383,7 +1383,7 @@ struct ib_mr *siw_reg_user_mr(struct ib_pd *pd, u64 start, u64 len,
 }
 
 struct ib_mr *siw_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
-			   u32 max_sge)
+			   u32 max_sge, u32 access)
 {
 	struct siw_device *sdev = to_siw_dev(pd->device);
 	struct siw_mr *mr = NULL;
diff --git a/drivers/infiniband/sw/siw/siw_verbs.h b/drivers/infiniband/sw/siw/siw_verbs.h
index 67ac08886a70..817f72cd242a 100644
--- a/drivers/infiniband/sw/siw/siw_verbs.h
+++ b/drivers/infiniband/sw/siw/siw_verbs.h
@@ -68,7 +68,7 @@ int siw_req_notify_cq(struct ib_cq *base_cq, enum ib_cq_notify_flags flags);
 struct ib_mr *siw_reg_user_mr(struct ib_pd *base_pd, u64 start, u64 len,
 			      u64 rnic_va, int rights, struct ib_udata *udata);
 struct ib_mr *siw_alloc_mr(struct ib_pd *base_pd, enum ib_mr_type mr_type,
-			   u32 max_sge);
+			   u32 max_sge, u32 access);
 struct ib_mr *siw_get_dma_mr(struct ib_pd *base_pd, int rights);
 int siw_map_mr_sg(struct ib_mr *base_mr, struct scatterlist *sl, int num_sle,
 		  unsigned int *sg_off);
diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c
index 136f6c4492e0..3c370ee25f2f 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -121,7 +121,7 @@ iser_create_fastreg_desc(struct iser_device *device,
 	else
 		mr_type = IB_MR_TYPE_MEM_REG;
 
-	desc->rsc.mr = ib_alloc_mr(pd, mr_type, size);
+	desc->rsc.mr = ib_alloc_mr(pd, mr_type, size, 0);
 	if (IS_ERR(desc->rsc.mr)) {
 		ret = PTR_ERR(desc->rsc.mr);
 		iser_err("Failed to allocate ib_fast_reg_mr err=%d\n", ret);
@@ -129,7 +129,7 @@ iser_create_fastreg_desc(struct iser_device *device,
 	}
 
 	if (pi_enable) {
-		desc->rsc.sig_mr = ib_alloc_mr_integrity(pd, size, size);
+		desc->rsc.sig_mr = ib_alloc_mr_integrity(pd, size, size, 0);
 		if (IS_ERR(desc->rsc.sig_mr)) {
 			ret = PTR_ERR(desc->rsc.sig_mr);
 			iser_err("Failed to allocate sig_mr err=%d\n", ret);
diff --git a/drivers/infiniband/ulp/rtrs/rtrs-clt.c b/drivers/infiniband/ulp/rtrs/rtrs-clt.c
index 64990df81937..0d3960ed5b2b 100644
--- a/drivers/infiniband/ulp/rtrs/rtrs-clt.c
+++ b/drivers/infiniband/ulp/rtrs/rtrs-clt.c
@@ -1260,7 +1260,7 @@ static int alloc_sess_reqs(struct rtrs_clt_sess *sess)
 			goto out;
 
 		req->mr = ib_alloc_mr(sess->s.dev->ib_pd, IB_MR_TYPE_MEM_REG,
-				      sess->max_pages_per_mr);
+				      sess->max_pages_per_mr, 0);
 		if (IS_ERR(req->mr)) {
 			err = PTR_ERR(req->mr);
 			req->mr = NULL;
diff --git a/drivers/infiniband/ulp/rtrs/rtrs-srv.c b/drivers/infiniband/ulp/rtrs/rtrs-srv.c
index 5e9bb7bf5ef3..575f31ff20fd 100644
--- a/drivers/infiniband/ulp/rtrs/rtrs-srv.c
+++ b/drivers/infiniband/ulp/rtrs/rtrs-srv.c
@@ -638,7 +638,7 @@ static int map_cont_bufs(struct rtrs_srv_sess *sess)
 			goto free_sg;
 		}
 		mr = ib_alloc_mr(sess->s.dev->ib_pd, IB_MR_TYPE_MEM_REG,
-				 sgt->nents);
+				 sgt->nents, 0);
 		if (IS_ERR(mr)) {
 			err = PTR_ERR(mr);
 			goto unmap_sg;
diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 31f8aa2c40ed..8481ad769ba4 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -436,7 +436,7 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device,
 		mr_type = IB_MR_TYPE_MEM_REG;
 
 	for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
-		mr = ib_alloc_mr(pd, mr_type, max_page_list_len);
+		mr = ib_alloc_mr(pd, mr_type, max_page_list_len, 0);
 		if (IS_ERR(mr)) {
 			ret = PTR_ERR(mr);
 			if (ret == -ENOMEM)
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 53ac4d7442ba..4dbc17311e0b 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -534,7 +534,7 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
 	ret = ib_mr_pool_init(queue->qp, &queue->qp->rdma_mrs,
 			      queue->queue_size,
 			      IB_MR_TYPE_MEM_REG,
-			      pages_per_mr, 0);
+			      pages_per_mr, 0, 0);
 	if (ret) {
 		dev_err(queue->ctrl->ctrl.device,
 			"failed to initialize MR pool sized %d for QID %d\n",
@@ -545,7 +545,7 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
 	if (queue->pi_support) {
 		ret = ib_mr_pool_init(queue->qp, &queue->qp->sig_mrs,
 				      queue->queue_size, IB_MR_TYPE_INTEGRITY,
-				      pages_per_mr, pages_per_mr);
+				      pages_per_mr, pages_per_mr, 0);
 		if (ret) {
 			dev_err(queue->ctrl->ctrl.device,
 				"failed to initialize PI MR pool sized %d for QID %d\n",
diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 10dfe5006792..647098a5cf3b 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -2178,9 +2178,8 @@ static void smbd_mr_recovery_work(struct work_struct *work)
 				continue;
 			}
 
-			smbdirect_mr->mr = ib_alloc_mr(
-				info->pd, info->mr_type,
-				info->max_frmr_depth);
+			smbdirect_mr->mr = ib_alloc_mr(info->pd, info->mr_type,
+						       info->max_frmr_depth, 0);
 			if (IS_ERR(smbdirect_mr->mr)) {
 				log_rdma_mr(ERR, "ib_alloc_mr failed mr_type=%x max_frmr_depth=%x\n",
 					    info->mr_type,
@@ -2245,7 +2244,7 @@ static int allocate_mr_list(struct smbd_connection *info)
 		if (!smbdirect_mr)
 			goto out;
 		smbdirect_mr->mr = ib_alloc_mr(info->pd, info->mr_type,
-					info->max_frmr_depth);
+					       info->max_frmr_depth, 0);
 		if (IS_ERR(smbdirect_mr->mr)) {
 			log_rdma_mr(ERR, "ib_alloc_mr failed mr_type=%x max_frmr_depth=%x\n",
 				    info->mr_type, info->max_frmr_depth);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index bed4cfe50554..59138174affa 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2444,10 +2444,10 @@ struct ib_device_ops {
 				       struct ib_udata *udata);
 	int (*dereg_mr)(struct ib_mr *mr, struct ib_udata *udata);
 	struct ib_mr *(*alloc_mr)(struct ib_pd *pd, enum ib_mr_type mr_type,
-				  u32 max_num_sg);
+				  u32 max_num_sg, u32 access);
 	struct ib_mr *(*alloc_mr_integrity)(struct ib_pd *pd,
 					    u32 max_num_data_sg,
-					    u32 max_num_meta_sg);
+					    u32 max_num_meta_sg, u32 access);
 	int (*advise_mr)(struct ib_pd *pd,
 			 enum ib_uverbs_advise_mr_advice advice, u32 flags,
 			 struct ib_sge *sg_list, u32 num_sge,
@@ -4142,11 +4142,10 @@ static inline int ib_dereg_mr(struct ib_mr *mr)
 }
 
 struct ib_mr *ib_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
-			  u32 max_num_sg);
+			  u32 max_num_sg, u32 access);
 
-struct ib_mr *ib_alloc_mr_integrity(struct ib_pd *pd,
-				    u32 max_num_data_sg,
-				    u32 max_num_meta_sg);
+struct ib_mr *ib_alloc_mr_integrity(struct ib_pd *pd, u32 max_num_data_sg,
+				    u32 max_num_meta_sg, u32 access);
 
 /**
  * ib_update_fast_reg_key - updates the key portion of the fast_reg MR
diff --git a/include/rdma/mr_pool.h b/include/rdma/mr_pool.h
index e77123bcb43b..2a0ee791037d 100644
--- a/include/rdma/mr_pool.h
+++ b/include/rdma/mr_pool.h
@@ -11,7 +11,8 @@ struct ib_mr *ib_mr_pool_get(struct ib_qp *qp, struct list_head *list);
 void ib_mr_pool_put(struct ib_qp *qp, struct list_head *list, struct ib_mr *mr);
 
 int ib_mr_pool_init(struct ib_qp *qp, struct list_head *list, int nr,
-		enum ib_mr_type type, u32 max_num_sg, u32 max_num_meta_sg);
+		    enum ib_mr_type type, u32 max_num_sg, u32 max_num_meta_sg,
+		    u32 access);
 void ib_mr_pool_destroy(struct ib_qp *qp, struct list_head *list);
 
 #endif /* _RDMA_MR_POOL_H */
diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
index 9b6ffff72f2d..694eb916319e 100644
--- a/net/rds/ib_frmr.c
+++ b/net/rds/ib_frmr.c
@@ -76,7 +76,7 @@ static struct rds_ib_mr *rds_ib_alloc_frmr(struct rds_ib_device *rds_ibdev,
 
 	frmr = &ibmr->u.frmr;
 	frmr->mr = ib_alloc_mr(rds_ibdev->pd, IB_MR_TYPE_MEM_REG,
-			 pool->max_pages);
+			       pool->max_pages, 0);
 	if (IS_ERR(frmr->mr)) {
 		pr_warn("RDS/IB: %s failed to allocate MR", __func__);
 		err = PTR_ERR(frmr->mr);
diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index 7d7ba0320d5a..4e91ed3dc265 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -579,7 +579,7 @@ int smc_ib_get_memory_region(struct ib_pd *pd, int access_flags,
 		return 0; /* already done */
 
 	buf_slot->mr_rx[link_idx] =
-		ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, 1 << buf_slot->order);
+		ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, 1 << buf_slot->order, 0);
 	if (IS_ERR(buf_slot->mr_rx[link_idx])) {
 		int rc;
 
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 766a1048a48a..cfbdd197cdfe 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -135,7 +135,7 @@ int frwr_mr_init(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr *mr)
 	struct ib_mr *frmr;
 	int rc;
 
-	frmr = ib_alloc_mr(ep->re_pd, ep->re_mrtype, depth);
+	frmr = ib_alloc_mr(ep->re_pd, ep->re_mrtype, depth, 0);
 	if (IS_ERR(frmr))
 		goto out_mr_err;
 
-- 
2.30.2


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH rdma-next 02/10] RDMA/core: Enable Relaxed Ordering in __ib_alloc_pd()
  2021-04-05  5:23 [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs Leon Romanovsky
  2021-04-05  5:23 ` [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init() Leon Romanovsky
@ 2021-04-05  5:23 ` Leon Romanovsky
  2021-04-05 18:01   ` Tom Talpey
  2021-04-05  5:23 ` [PATCH rdma-next 03/10] RDMA/iser: Enable Relaxed Ordering Leon Romanovsky
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 56+ messages in thread
From: Leon Romanovsky @ 2021-04-05  5:23 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Avihai Horon, Adit Ranadive, Anna Schumaker, Ariel Elior,
	Bart Van Assche, Bernard Metzler, Christoph Hellwig, Chuck Lever,
	David S. Miller, Dennis Dalessandro, Devesh Sharma, Faisal Latif,
	Jack Wang, Jakub Kicinski, J. Bruce Fields, Jens Axboe,
	Karsten Graul, Keith Busch, Lijun Ou, linux-cifs, linux-kernel,
	linux-nfs, linux-nvme, linux-rdma, linux-s390, Max Gurtovoy,
	Max Gurtovoy, Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

From: Avihai Horon <avihaih@nvidia.com>

Enable Relaxed Ordering in __ib_alloc_pd() allocation of the
local_dma_lkey.

This will take effect only for devices that don't pre-allocate the lkey
but allocate it per PD allocation.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/verbs.c              | 3 ++-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index a1782f8a6ca0..9b719f7d6fd5 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -287,7 +287,8 @@ struct ib_pd *__ib_alloc_pd(struct ib_device *device, unsigned int flags,
 	if (device->attrs.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)
 		pd->local_dma_lkey = device->local_dma_lkey;
 	else
-		mr_access_flags |= IB_ACCESS_LOCAL_WRITE;
+		mr_access_flags |=
+			IB_ACCESS_LOCAL_WRITE | IB_ACCESS_RELAXED_ORDERING;
 
 	if (flags & IB_PD_UNSAFE_GLOBAL_RKEY) {
 		pr_warn("%s: enabling unsafe global rkey\n", caller);
diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
index b3fa783698a0..d74827694f92 100644
--- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
+++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
@@ -66,6 +66,7 @@ struct ib_mr *pvrdma_get_dma_mr(struct ib_pd *pd, int acc)
 	int ret;
 
 	/* Support only LOCAL_WRITE flag for DMA MRs */
+	acc &= ~IB_ACCESS_RELAXED_ORDERING;
 	if (acc & ~IB_ACCESS_LOCAL_WRITE) {
 		dev_warn(&dev->pdev->dev,
 			 "unsupported dma mr access flags %#x\n", acc);
-- 
2.30.2


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH rdma-next 03/10] RDMA/iser: Enable Relaxed Ordering
  2021-04-05  5:23 [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs Leon Romanovsky
  2021-04-05  5:23 ` [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init() Leon Romanovsky
  2021-04-05  5:23 ` [PATCH rdma-next 02/10] RDMA/core: Enable Relaxed Ordering in __ib_alloc_pd() Leon Romanovsky
@ 2021-04-05  5:23 ` Leon Romanovsky
  2021-04-05  5:23 ` [PATCH rdma-next 04/10] RDMA/rtrs: " Leon Romanovsky
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Leon Romanovsky @ 2021-04-05  5:23 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Avihai Horon, Adit Ranadive, Anna Schumaker, Ariel Elior,
	Bart Van Assche, Bernard Metzler, Christoph Hellwig, Chuck Lever,
	David S. Miller, Dennis Dalessandro, Devesh Sharma, Faisal Latif,
	Jack Wang, Jakub Kicinski, J. Bruce Fields, Jens Axboe,
	Karsten Graul, Keith Busch, Lijun Ou, linux-cifs, linux-kernel,
	linux-nfs, linux-nvme, linux-rdma, linux-s390, Max Gurtovoy,
	Max Gurtovoy, Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

From: Avihai Horon <avihaih@nvidia.com>

Enable Relaxed Ordering for iser.

Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/ulp/iser/iser_memory.c | 10 ++++------
 drivers/infiniband/ulp/iser/iser_verbs.c  |  6 ++++--
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iser_memory.c b/drivers/infiniband/ulp/iser/iser_memory.c
index afec40da9b58..29849c9e82e8 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -271,9 +271,8 @@ iser_reg_sig_mr(struct iscsi_iser_task *iser_task,
 	wr->wr.send_flags = 0;
 	wr->mr = mr;
 	wr->key = mr->rkey;
-	wr->access = IB_ACCESS_LOCAL_WRITE |
-		     IB_ACCESS_REMOTE_READ |
-		     IB_ACCESS_REMOTE_WRITE;
+	wr->access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ |
+		     IB_ACCESS_REMOTE_WRITE | IB_ACCESS_RELAXED_ORDERING;
 	rsc->mr_valid = 1;
 
 	sig_reg->sge.lkey = mr->lkey;
@@ -318,9 +317,8 @@ static int iser_fast_reg_mr(struct iscsi_iser_task *iser_task,
 	wr->wr.num_sge = 0;
 	wr->mr = mr;
 	wr->key = mr->rkey;
-	wr->access = IB_ACCESS_LOCAL_WRITE  |
-		     IB_ACCESS_REMOTE_WRITE |
-		     IB_ACCESS_REMOTE_READ;
+	wr->access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE |
+		     IB_ACCESS_REMOTE_READ | IB_ACCESS_RELAXED_ORDERING;
 
 	rsc->mr_valid = 1;
 
diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c
index 3c370ee25f2f..1e236b1cf29b 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -121,7 +121,8 @@ iser_create_fastreg_desc(struct iser_device *device,
 	else
 		mr_type = IB_MR_TYPE_MEM_REG;
 
-	desc->rsc.mr = ib_alloc_mr(pd, mr_type, size, 0);
+	desc->rsc.mr = ib_alloc_mr(pd, mr_type, size,
+		IB_ACCESS_RELAXED_ORDERING);
 	if (IS_ERR(desc->rsc.mr)) {
 		ret = PTR_ERR(desc->rsc.mr);
 		iser_err("Failed to allocate ib_fast_reg_mr err=%d\n", ret);
@@ -129,7 +130,8 @@ iser_create_fastreg_desc(struct iser_device *device,
 	}
 
 	if (pi_enable) {
-		desc->rsc.sig_mr = ib_alloc_mr_integrity(pd, size, size, 0);
+		desc->rsc.sig_mr = ib_alloc_mr_integrity(pd, size, size,
+			IB_ACCESS_RELAXED_ORDERING);
 		if (IS_ERR(desc->rsc.sig_mr)) {
 			ret = PTR_ERR(desc->rsc.sig_mr);
 			iser_err("Failed to allocate sig_mr err=%d\n", ret);
-- 
2.30.2


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH rdma-next 04/10] RDMA/rtrs: Enable Relaxed Ordering
  2021-04-05  5:23 [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs Leon Romanovsky
                   ` (2 preceding siblings ...)
  2021-04-05  5:23 ` [PATCH rdma-next 03/10] RDMA/iser: Enable Relaxed Ordering Leon Romanovsky
@ 2021-04-05  5:23 ` Leon Romanovsky
  2021-04-05  5:23 ` [PATCH rdma-next 05/10] RDMA/srp: " Leon Romanovsky
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Leon Romanovsky @ 2021-04-05  5:23 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Avihai Horon, Adit Ranadive, Anna Schumaker, Ariel Elior,
	Bart Van Assche, Bernard Metzler, Christoph Hellwig, Chuck Lever,
	David S. Miller, Dennis Dalessandro, Devesh Sharma, Faisal Latif,
	Jack Wang, Jakub Kicinski, J. Bruce Fields, Jens Axboe,
	Karsten Graul, Keith Busch, Lijun Ou, linux-cifs, linux-kernel,
	linux-nfs, linux-nvme, linux-rdma, linux-s390, Max Gurtovoy,
	Max Gurtovoy, Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

From: Avihai Horon <avihaih@nvidia.com>

Enable Relaxed Ordering fro rtrs client and server.

Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/ulp/rtrs/rtrs-clt.c |  6 ++++--
 drivers/infiniband/ulp/rtrs/rtrs-srv.c | 15 ++++++++-------
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/ulp/rtrs/rtrs-clt.c b/drivers/infiniband/ulp/rtrs/rtrs-clt.c
index 0d3960ed5b2b..a3fbb47a3574 100644
--- a/drivers/infiniband/ulp/rtrs/rtrs-clt.c
+++ b/drivers/infiniband/ulp/rtrs/rtrs-clt.c
@@ -1099,7 +1099,8 @@ static int rtrs_clt_read_req(struct rtrs_clt_io_req *req)
 			.mr = req->mr,
 			.key = req->mr->rkey,
 			.access = (IB_ACCESS_LOCAL_WRITE |
-				   IB_ACCESS_REMOTE_WRITE),
+				   IB_ACCESS_REMOTE_WRITE |
+				   IB_ACCESS_RELAXED_ORDERING),
 		};
 		wr = &rwr.wr;
 
@@ -1260,7 +1261,8 @@ static int alloc_sess_reqs(struct rtrs_clt_sess *sess)
 			goto out;
 
 		req->mr = ib_alloc_mr(sess->s.dev->ib_pd, IB_MR_TYPE_MEM_REG,
-				      sess->max_pages_per_mr, 0);
+				      sess->max_pages_per_mr,
+				      IB_ACCESS_RELAXED_ORDERING);
 		if (IS_ERR(req->mr)) {
 			err = PTR_ERR(req->mr);
 			req->mr = NULL;
diff --git a/drivers/infiniband/ulp/rtrs/rtrs-srv.c b/drivers/infiniband/ulp/rtrs/rtrs-srv.c
index 575f31ff20fd..c28ed5e2245d 100644
--- a/drivers/infiniband/ulp/rtrs/rtrs-srv.c
+++ b/drivers/infiniband/ulp/rtrs/rtrs-srv.c
@@ -312,8 +312,8 @@ static int rdma_write_sg(struct rtrs_srv_op *id)
 		rwr.mr = srv_mr->mr;
 		rwr.wr.send_flags = 0;
 		rwr.key = srv_mr->mr->rkey;
-		rwr.access = (IB_ACCESS_LOCAL_WRITE |
-			      IB_ACCESS_REMOTE_WRITE);
+		rwr.access = (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE |
+			      IB_ACCESS_RELAXED_ORDERING);
 		msg = srv_mr->iu->buf;
 		msg->buf_id = cpu_to_le16(id->msg_id);
 		msg->type = cpu_to_le16(RTRS_MSG_RKEY_RSP);
@@ -432,8 +432,8 @@ static int send_io_resp_imm(struct rtrs_srv_con *con, struct rtrs_srv_op *id,
 		rwr.wr.send_flags = 0;
 		rwr.mr = srv_mr->mr;
 		rwr.key = srv_mr->mr->rkey;
-		rwr.access = (IB_ACCESS_LOCAL_WRITE |
-			      IB_ACCESS_REMOTE_WRITE);
+		rwr.access = (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE |
+			      IB_ACCESS_RELAXED_ORDERING);
 		msg = srv_mr->iu->buf;
 		msg->buf_id = cpu_to_le16(id->msg_id);
 		msg->type = cpu_to_le16(RTRS_MSG_RKEY_RSP);
@@ -638,7 +638,7 @@ static int map_cont_bufs(struct rtrs_srv_sess *sess)
 			goto free_sg;
 		}
 		mr = ib_alloc_mr(sess->s.dev->ib_pd, IB_MR_TYPE_MEM_REG,
-				 sgt->nents, 0);
+				 sgt->nents, IB_ACCESS_RELAXED_ORDERING);
 		if (IS_ERR(mr)) {
 			err = PTR_ERR(mr);
 			goto unmap_sg;
@@ -823,8 +823,9 @@ static int process_info_req(struct rtrs_srv_con *con,
 		rwr[mri].wr.send_flags = 0;
 		rwr[mri].mr = mr;
 		rwr[mri].key = mr->rkey;
-		rwr[mri].access = (IB_ACCESS_LOCAL_WRITE |
-				   IB_ACCESS_REMOTE_WRITE);
+		rwr[mri].access =
+			(IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE |
+			 IB_ACCESS_RELAXED_ORDERING);
 		reg_wr = &rwr[mri].wr;
 	}
 
-- 
2.30.2


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH rdma-next 05/10] RDMA/srp: Enable Relaxed Ordering
  2021-04-05  5:23 [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs Leon Romanovsky
                   ` (3 preceding siblings ...)
  2021-04-05  5:23 ` [PATCH rdma-next 04/10] RDMA/rtrs: " Leon Romanovsky
@ 2021-04-05  5:23 ` Leon Romanovsky
  2021-04-05  5:24 ` [PATCH rdma-next 06/10] nvme-rdma: " Leon Romanovsky
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Leon Romanovsky @ 2021-04-05  5:23 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Avihai Horon, Adit Ranadive, Anna Schumaker, Ariel Elior,
	Bart Van Assche, Bernard Metzler, Christoph Hellwig, Chuck Lever,
	David S. Miller, Dennis Dalessandro, Devesh Sharma, Faisal Latif,
	Jack Wang, Jakub Kicinski, J. Bruce Fields, Jens Axboe,
	Karsten Graul, Keith Busch, Lijun Ou, linux-cifs, linux-kernel,
	linux-nfs, linux-nvme, linux-rdma, linux-s390, Max Gurtovoy,
	Max Gurtovoy, Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

From: Avihai Horon <avihaih@nvidia.com>

Enable Relaxed Ordering for srp.

Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/ulp/srp/ib_srp.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 8481ad769ba4..0026660c3ead 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -436,7 +436,8 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device,
 		mr_type = IB_MR_TYPE_MEM_REG;
 
 	for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
-		mr = ib_alloc_mr(pd, mr_type, max_page_list_len, 0);
+		mr = ib_alloc_mr(pd, mr_type, max_page_list_len,
+				 IB_ACCESS_RELAXED_ORDERING);
 		if (IS_ERR(mr)) {
 			ret = PTR_ERR(mr);
 			if (ret == -ENOMEM)
@@ -1487,9 +1488,8 @@ static int srp_map_finish_fr(struct srp_map_state *state,
 	wr.wr.send_flags = 0;
 	wr.mr = desc->mr;
 	wr.key = desc->mr->rkey;
-	wr.access = (IB_ACCESS_LOCAL_WRITE |
-		     IB_ACCESS_REMOTE_READ |
-		     IB_ACCESS_REMOTE_WRITE);
+	wr.access = (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ |
+		     IB_ACCESS_REMOTE_WRITE | IB_ACCESS_RELAXED_ORDERING);
 
 	*state->fr.next++ = desc;
 	state->nmdesc++;
-- 
2.30.2


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH rdma-next 06/10] nvme-rdma: Enable Relaxed Ordering
  2021-04-05  5:23 [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs Leon Romanovsky
                   ` (4 preceding siblings ...)
  2021-04-05  5:23 ` [PATCH rdma-next 05/10] RDMA/srp: " Leon Romanovsky
@ 2021-04-05  5:24 ` Leon Romanovsky
  2021-04-05  5:24 ` [PATCH rdma-next 07/10] cifs: smbd: " Leon Romanovsky
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Leon Romanovsky @ 2021-04-05  5:24 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Avihai Horon, Adit Ranadive, Anna Schumaker, Ariel Elior,
	Bart Van Assche, Bernard Metzler, Christoph Hellwig, Chuck Lever,
	David S. Miller, Dennis Dalessandro, Devesh Sharma, Faisal Latif,
	Jack Wang, Jakub Kicinski, J. Bruce Fields, Jens Axboe,
	Karsten Graul, Keith Busch, Lijun Ou, linux-cifs, linux-kernel,
	linux-nfs, linux-nvme, linux-rdma, linux-s390, Max Gurtovoy,
	Max Gurtovoy, Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

From: Avihai Horon <avihaih@nvidia.com>

Enable Relaxed Ordering for nvme.

Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/nvme/host/rdma.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 4dbc17311e0b..8f106b20b39c 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -532,9 +532,8 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
 	 */
 	pages_per_mr = nvme_rdma_get_max_fr_pages(ibdev, queue->pi_support) + 1;
 	ret = ib_mr_pool_init(queue->qp, &queue->qp->rdma_mrs,
-			      queue->queue_size,
-			      IB_MR_TYPE_MEM_REG,
-			      pages_per_mr, 0, 0);
+			      queue->queue_size, IB_MR_TYPE_MEM_REG,
+			      pages_per_mr, 0, IB_ACCESS_RELAXED_ORDERING);
 	if (ret) {
 		dev_err(queue->ctrl->ctrl.device,
 			"failed to initialize MR pool sized %d for QID %d\n",
@@ -545,7 +544,8 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
 	if (queue->pi_support) {
 		ret = ib_mr_pool_init(queue->qp, &queue->qp->sig_mrs,
 				      queue->queue_size, IB_MR_TYPE_INTEGRITY,
-				      pages_per_mr, pages_per_mr, 0);
+				      pages_per_mr, pages_per_mr,
+				      IB_ACCESS_RELAXED_ORDERING);
 		if (ret) {
 			dev_err(queue->ctrl->ctrl.device,
 				"failed to initialize PI MR pool sized %d for QID %d\n",
@@ -1382,9 +1382,9 @@ static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
 	req->reg_wr.wr.num_sge = 0;
 	req->reg_wr.mr = req->mr;
 	req->reg_wr.key = req->mr->rkey;
-	req->reg_wr.access = IB_ACCESS_LOCAL_WRITE |
-			     IB_ACCESS_REMOTE_READ |
-			     IB_ACCESS_REMOTE_WRITE;
+	req->reg_wr.access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ |
+			     IB_ACCESS_REMOTE_WRITE |
+			     IB_ACCESS_RELAXED_ORDERING;
 
 	sg->addr = cpu_to_le64(req->mr->iova);
 	put_unaligned_le24(req->mr->length, sg->length);
@@ -1488,9 +1488,8 @@ static int nvme_rdma_map_sg_pi(struct nvme_rdma_queue *queue,
 	wr->wr.send_flags = 0;
 	wr->mr = req->mr;
 	wr->key = req->mr->rkey;
-	wr->access = IB_ACCESS_LOCAL_WRITE |
-		     IB_ACCESS_REMOTE_READ |
-		     IB_ACCESS_REMOTE_WRITE;
+	wr->access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ |
+		     IB_ACCESS_REMOTE_WRITE | IB_ACCESS_RELAXED_ORDERING;
 
 	sg->addr = cpu_to_le64(req->mr->iova);
 	put_unaligned_le24(req->mr->length, sg->length);
-- 
2.30.2


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH rdma-next 07/10] cifs: smbd: Enable Relaxed Ordering
  2021-04-05  5:23 [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs Leon Romanovsky
                   ` (5 preceding siblings ...)
  2021-04-05  5:24 ` [PATCH rdma-next 06/10] nvme-rdma: " Leon Romanovsky
@ 2021-04-05  5:24 ` Leon Romanovsky
  2021-04-05  5:24 ` [PATCH rdma-next 08/10] net/rds: " Leon Romanovsky
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Leon Romanovsky @ 2021-04-05  5:24 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Avihai Horon, Adit Ranadive, Anna Schumaker, Ariel Elior,
	Bart Van Assche, Bernard Metzler, Christoph Hellwig, Chuck Lever,
	David S. Miller, Dennis Dalessandro, Devesh Sharma, Faisal Latif,
	Jack Wang, Jakub Kicinski, J. Bruce Fields, Jens Axboe,
	Karsten Graul, Keith Busch, Lijun Ou, linux-cifs, linux-kernel,
	linux-nfs, linux-nvme, linux-rdma, linux-s390, Max Gurtovoy,
	Max Gurtovoy, Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

From: Avihai Horon <avihaih@nvidia.com>

Enable Relaxed Ordering for smbd.

Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 fs/cifs/smbdirect.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 647098a5cf3b..1e86dc8bbe85 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -2178,8 +2178,10 @@ static void smbd_mr_recovery_work(struct work_struct *work)
 				continue;
 			}
 
-			smbdirect_mr->mr = ib_alloc_mr(info->pd, info->mr_type,
-						       info->max_frmr_depth, 0);
+			smbdirect_mr->mr =
+				ib_alloc_mr(info->pd, info->mr_type,
+					    info->max_frmr_depth,
+					    IB_ACCESS_RELAXED_ORDERING);
 			if (IS_ERR(smbdirect_mr->mr)) {
 				log_rdma_mr(ERR, "ib_alloc_mr failed mr_type=%x max_frmr_depth=%x\n",
 					    info->mr_type,
@@ -2244,7 +2246,8 @@ static int allocate_mr_list(struct smbd_connection *info)
 		if (!smbdirect_mr)
 			goto out;
 		smbdirect_mr->mr = ib_alloc_mr(info->pd, info->mr_type,
-					       info->max_frmr_depth, 0);
+					       info->max_frmr_depth,
+					       IB_ACCESS_RELAXED_ORDERING);
 		if (IS_ERR(smbdirect_mr->mr)) {
 			log_rdma_mr(ERR, "ib_alloc_mr failed mr_type=%x max_frmr_depth=%x\n",
 				    info->mr_type, info->max_frmr_depth);
@@ -2406,9 +2409,10 @@ struct smbd_mr *smbd_register_mr(
 	reg_wr->wr.send_flags = IB_SEND_SIGNALED;
 	reg_wr->mr = smbdirect_mr->mr;
 	reg_wr->key = smbdirect_mr->mr->rkey;
-	reg_wr->access = writing ?
-			IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
-			IB_ACCESS_REMOTE_READ;
+	reg_wr->access =
+		(writing ? IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
+			   IB_ACCESS_REMOTE_READ) |
+		IB_ACCESS_RELAXED_ORDERING;
 
 	/*
 	 * There is no need for waiting for complemtion on ib_post_send
-- 
2.30.2


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH rdma-next 08/10] net/rds: Enable Relaxed Ordering
  2021-04-05  5:23 [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs Leon Romanovsky
                   ` (6 preceding siblings ...)
  2021-04-05  5:24 ` [PATCH rdma-next 07/10] cifs: smbd: " Leon Romanovsky
@ 2021-04-05  5:24 ` Leon Romanovsky
  2021-04-05  5:24 ` [PATCH rdma-next 09/10] net/smc: " Leon Romanovsky
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Leon Romanovsky @ 2021-04-05  5:24 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Avihai Horon, Adit Ranadive, Anna Schumaker, Ariel Elior,
	Bart Van Assche, Bernard Metzler, Christoph Hellwig, Chuck Lever,
	David S. Miller, Dennis Dalessandro, Devesh Sharma, Faisal Latif,
	Jack Wang, Jakub Kicinski, J. Bruce Fields, Jens Axboe,
	Karsten Graul, Keith Busch, Lijun Ou, linux-cifs, linux-kernel,
	linux-nfs, linux-nvme, linux-rdma, linux-s390, Max Gurtovoy,
	Max Gurtovoy, Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

From: Avihai Horon <avihaih@nvidia.com>

Enable Relaxed Ordering for rds.

Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 net/rds/ib_frmr.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
index 694eb916319e..1a60c2c90c78 100644
--- a/net/rds/ib_frmr.c
+++ b/net/rds/ib_frmr.c
@@ -76,7 +76,7 @@ static struct rds_ib_mr *rds_ib_alloc_frmr(struct rds_ib_device *rds_ibdev,
 
 	frmr = &ibmr->u.frmr;
 	frmr->mr = ib_alloc_mr(rds_ibdev->pd, IB_MR_TYPE_MEM_REG,
-			       pool->max_pages, 0);
+			       pool->max_pages, IB_ACCESS_RELAXED_ORDERING);
 	if (IS_ERR(frmr->mr)) {
 		pr_warn("RDS/IB: %s failed to allocate MR", __func__);
 		err = PTR_ERR(frmr->mr);
@@ -156,9 +156,8 @@ static int rds_ib_post_reg_frmr(struct rds_ib_mr *ibmr)
 	reg_wr.wr.num_sge = 0;
 	reg_wr.mr = frmr->mr;
 	reg_wr.key = frmr->mr->rkey;
-	reg_wr.access = IB_ACCESS_LOCAL_WRITE |
-			IB_ACCESS_REMOTE_READ |
-			IB_ACCESS_REMOTE_WRITE;
+	reg_wr.access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ |
+			IB_ACCESS_REMOTE_WRITE | IB_ACCESS_RELAXED_ORDERING;
 	reg_wr.wr.send_flags = IB_SEND_SIGNALED;
 
 	ret = ib_post_send(ibmr->ic->i_cm_id->qp, &reg_wr.wr, NULL);
-- 
2.30.2


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH rdma-next 09/10] net/smc: Enable Relaxed Ordering
  2021-04-05  5:23 [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs Leon Romanovsky
                   ` (7 preceding siblings ...)
  2021-04-05  5:24 ` [PATCH rdma-next 08/10] net/rds: " Leon Romanovsky
@ 2021-04-05  5:24 ` Leon Romanovsky
  2021-04-05  5:24 ` [PATCH rdma-next 10/10] xprtrdma: " Leon Romanovsky
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Leon Romanovsky @ 2021-04-05  5:24 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Avihai Horon, Adit Ranadive, Anna Schumaker, Ariel Elior,
	Bart Van Assche, Bernard Metzler, Christoph Hellwig, Chuck Lever,
	David S. Miller, Dennis Dalessandro, Devesh Sharma, Faisal Latif,
	Jack Wang, Jakub Kicinski, J. Bruce Fields, Jens Axboe,
	Karsten Graul, Keith Busch, Lijun Ou, linux-cifs, linux-kernel,
	linux-nfs, linux-nvme, linux-rdma, linux-s390, Max Gurtovoy,
	Max Gurtovoy, Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

From: Avihai Horon <avihaih@nvidia.com>

Enable Relaxed Ordering for smc.

Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 net/smc/smc_ib.c | 3 ++-
 net/smc/smc_wr.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index 4e91ed3dc265..6b65c5d1f957 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -579,7 +579,8 @@ int smc_ib_get_memory_region(struct ib_pd *pd, int access_flags,
 		return 0; /* already done */
 
 	buf_slot->mr_rx[link_idx] =
-		ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, 1 << buf_slot->order, 0);
+		ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, 1 << buf_slot->order,
+			    IB_ACCESS_RELAXED_ORDERING);
 	if (IS_ERR(buf_slot->mr_rx[link_idx])) {
 		int rc;
 
diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index cbc73a7e4d59..78e9650621f1 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -555,7 +555,8 @@ static void smc_wr_init_sge(struct smc_link *lnk)
 	lnk->wr_reg.wr.num_sge = 0;
 	lnk->wr_reg.wr.send_flags = IB_SEND_SIGNALED;
 	lnk->wr_reg.wr.opcode = IB_WR_REG_MR;
-	lnk->wr_reg.access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE;
+	lnk->wr_reg.access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE |
+			     IB_ACCESS_RELAXED_ORDERING;
 }
 
 void smc_wr_free_link(struct smc_link *lnk)
-- 
2.30.2


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH rdma-next 10/10] xprtrdma: Enable Relaxed Ordering
  2021-04-05  5:23 [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs Leon Romanovsky
                   ` (8 preceding siblings ...)
  2021-04-05  5:24 ` [PATCH rdma-next 09/10] net/smc: " Leon Romanovsky
@ 2021-04-05  5:24 ` Leon Romanovsky
  2021-04-05 13:41 ` [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs Christoph Hellwig
  2021-04-06  2:37 ` Honggang LI
  11 siblings, 0 replies; 56+ messages in thread
From: Leon Romanovsky @ 2021-04-05  5:24 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Avihai Horon, Adit Ranadive, Anna Schumaker, Ariel Elior,
	Bart Van Assche, Bernard Metzler, Christoph Hellwig, Chuck Lever,
	David S. Miller, Dennis Dalessandro, Devesh Sharma, Faisal Latif,
	Jack Wang, Jakub Kicinski, J. Bruce Fields, Jens Axboe,
	Karsten Graul, Keith Busch, Lijun Ou, linux-cifs, linux-kernel,
	linux-nfs, linux-nvme, linux-rdma, linux-s390, Max Gurtovoy,
	Max Gurtovoy, Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

From: Avihai Horon <avihaih@nvidia.com>

Enable Relaxed Ordering for xprtrdma.

Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 net/sunrpc/xprtrdma/frwr_ops.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index cfbdd197cdfe..f9334c0a1a13 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -135,7 +135,8 @@ int frwr_mr_init(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr *mr)
 	struct ib_mr *frmr;
 	int rc;
 
-	frmr = ib_alloc_mr(ep->re_pd, ep->re_mrtype, depth, 0);
+	frmr = ib_alloc_mr(ep->re_pd, ep->re_mrtype, depth,
+			   IB_ACCESS_RELAXED_ORDERING);
 	if (IS_ERR(frmr))
 		goto out_mr_err;
 
@@ -339,9 +340,10 @@ struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt *r_xprt,
 	reg_wr = &mr->frwr.fr_regwr;
 	reg_wr->mr = ibmr;
 	reg_wr->key = ibmr->rkey;
-	reg_wr->access = writing ?
-			 IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
-			 IB_ACCESS_REMOTE_READ;
+	reg_wr->access =
+		(writing ? IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
+			   IB_ACCESS_REMOTE_READ) |
+		IB_ACCESS_RELAXED_ORDERING;
 
 	mr->mr_handle = ibmr->rkey;
 	mr->mr_length = ibmr->length;
-- 
2.30.2


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-05  5:23 [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs Leon Romanovsky
                   ` (9 preceding siblings ...)
  2021-04-05  5:24 ` [PATCH rdma-next 10/10] xprtrdma: " Leon Romanovsky
@ 2021-04-05 13:41 ` Christoph Hellwig
  2021-04-05 14:08   ` Leon Romanovsky
  2021-04-05 20:07   ` Jason Gunthorpe
  2021-04-06  2:37 ` Honggang LI
  11 siblings, 2 replies; 56+ messages in thread
From: Christoph Hellwig @ 2021-04-05 13:41 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Doug Ledford, Jason Gunthorpe, Leon Romanovsky, Adit Ranadive,
	Anna Schumaker, Ariel Elior, Avihai Horon, Bart Van Assche,
	Bernard Metzler, Christoph Hellwig, Chuck Lever, David S. Miller,
	Dennis Dalessandro, Devesh Sharma, Faisal Latif, Jack Wang,
	Jakub Kicinski, J. Bruce Fields, Jens Axboe, Karsten Graul,
	Keith Busch, Lijun Ou, linux-cifs, linux-kernel, linux-nfs,
	linux-nvme, linux-rdma, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> >From Avihai,
> 
> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> imposed on PCI transactions, and thus, can improve performance.
> 
> Until now, relaxed ordering could be set only by user space applications
> for user MRs. The following patch series enables relaxed ordering for the
> kernel ULPs as well. Relaxed ordering is an optional capability, and as
> such, it is ignored by vendors that don't support it.
> 
> The following test results show the performance improvement achieved
> with relaxed ordering. The test was performed on a NVIDIA A100 in order
> to check performance of storage infrastructure over xprtrdma:

Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
What does that have to do with storage protocols?

Also if you enable this for basically all kernel ULPs, why not have
an opt-out into strict ordering for the cases that need it (if there are
any).

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()
  2021-04-05  5:23 ` [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init() Leon Romanovsky
@ 2021-04-05 13:46   ` Christoph Hellwig
  2021-04-06  5:24     ` Leon Romanovsky
  2021-04-05 15:27   ` Bart Van Assche
  1 sibling, 1 reply; 56+ messages in thread
From: Christoph Hellwig @ 2021-04-05 13:46 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Doug Ledford, Jason Gunthorpe, Avihai Horon, Adit Ranadive,
	Anna Schumaker, Ariel Elior, Bart Van Assche, Bernard Metzler,
	Christoph Hellwig, Chuck Lever, David S. Miller,
	Dennis Dalessandro, Devesh Sharma, Faisal Latif, Jack Wang,
	Jakub Kicinski, J. Bruce Fields, Jens Axboe, Karsten Graul,
	Keith Busch, Lijun Ou, linux-cifs, linux-kernel, linux-nfs,
	linux-nvme, linux-rdma, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Mon, Apr 05, 2021 at 08:23:55AM +0300, Leon Romanovsky wrote:
> From: Avihai Horon <avihaih@nvidia.com>
> 
> Add access flags parameter to ib_alloc_mr() and to ib_mr_pool_init(),
> and refactor relevant code. This parameter is used to pass MR access
> flags during MR allocation.
> 
> In the following patches, the new access flags parameter will be used
> to enable Relaxed Ordering for ib_alloc_mr() and ib_mr_pool_init() users.

So this weirds up a new RELAXED_ORDERING flag without ever mentioning
that flag in the commit log, never mind what it actually does.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-05 13:41 ` [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs Christoph Hellwig
@ 2021-04-05 14:08   ` Leon Romanovsky
  2021-04-05 16:11     ` Santosh Shilimkar
  2021-04-05 17:54     ` Tom Talpey
  2021-04-05 20:07   ` Jason Gunthorpe
  1 sibling, 2 replies; 56+ messages in thread
From: Leon Romanovsky @ 2021-04-05 14:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Doug Ledford, Jason Gunthorpe, Adit Ranadive, Anna Schumaker,
	Ariel Elior, Avihai Horon, Bart Van Assche, Bernard Metzler,
	Chuck Lever, David S. Miller, Dennis Dalessandro, Devesh Sharma,
	Faisal Latif, Jack Wang, Jakub Kicinski, J. Bruce Fields,
	Jens Axboe, Karsten Graul, Keith Busch, Lijun Ou, linux-cifs,
	linux-kernel, linux-nfs, linux-nvme, linux-rdma, linux-s390,
	Max Gurtovoy, Max Gurtovoy, Md. Haris Iqbal, Michael Guralnik,
	Michal Kalderon, Mike Marciniszyn, Naresh Kumar PBS, netdev,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > >From Avihai,
> > 
> > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> > imposed on PCI transactions, and thus, can improve performance.
> > 
> > Until now, relaxed ordering could be set only by user space applications
> > for user MRs. The following patch series enables relaxed ordering for the
> > kernel ULPs as well. Relaxed ordering is an optional capability, and as
> > such, it is ignored by vendors that don't support it.
> > 
> > The following test results show the performance improvement achieved
> > with relaxed ordering. The test was performed on a NVIDIA A100 in order
> > to check performance of storage infrastructure over xprtrdma:
> 
> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
> What does that have to do with storage protocols?

This system is in use by our storage oriented customer who performed the
test. He runs drivers/infiniband/* stack from the upstream, simply backported
to specific kernel version.

The performance boost is seen in other systems too.

> 
> Also if you enable this for basically all kernel ULPs, why not have
> an opt-out into strict ordering for the cases that need it (if there are
> any).

The RO property is optional, it can only improve. In addition, all in-kernel ULPs
don't need strict ordering. I can be mistaken here and Jason will correct me, it
is because of two things: ULP doesn't touch data before CQE and DMA API prohibits it.

Thanks

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()
  2021-04-05  5:23 ` [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init() Leon Romanovsky
  2021-04-05 13:46   ` Christoph Hellwig
@ 2021-04-05 15:27   ` Bart Van Assche
  2021-04-06  5:23     ` Leon Romanovsky
  1 sibling, 1 reply; 56+ messages in thread
From: Bart Van Assche @ 2021-04-05 15:27 UTC (permalink / raw)
  To: Leon Romanovsky, Doug Ledford, Jason Gunthorpe, Christoph Hellwig
  Cc: Avihai Horon, Adit Ranadive, Anna Schumaker, Ariel Elior,
	Bernard Metzler, Christoph Hellwig, Chuck Lever, David S. Miller,
	Dennis Dalessandro, Devesh Sharma, Faisal Latif, Jack Wang,
	Jakub Kicinski, J. Bruce Fields, Jens Axboe, Karsten Graul,
	Keith Busch, Lijun Ou, linux-cifs, linux-kernel, linux-nfs,
	linux-nvme, linux-rdma, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On 4/4/21 10:23 PM, Leon Romanovsky wrote:
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index bed4cfe50554..59138174affa 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -2444,10 +2444,10 @@ struct ib_device_ops {
>  				       struct ib_udata *udata);
>  	int (*dereg_mr)(struct ib_mr *mr, struct ib_udata *udata);
>  	struct ib_mr *(*alloc_mr)(struct ib_pd *pd, enum ib_mr_type mr_type,
> -				  u32 max_num_sg);
> +				  u32 max_num_sg, u32 access);
>  	struct ib_mr *(*alloc_mr_integrity)(struct ib_pd *pd,
>  					    u32 max_num_data_sg,
> -					    u32 max_num_meta_sg);
> +					    u32 max_num_meta_sg, u32 access);
>  	int (*advise_mr)(struct ib_pd *pd,
>  			 enum ib_uverbs_advise_mr_advice advice, u32 flags,
>  			 struct ib_sge *sg_list, u32 num_sge,
> @@ -4142,11 +4142,10 @@ static inline int ib_dereg_mr(struct ib_mr *mr)
>  }
>  
>  struct ib_mr *ib_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
> -			  u32 max_num_sg);
> +			  u32 max_num_sg, u32 access);
>  
> -struct ib_mr *ib_alloc_mr_integrity(struct ib_pd *pd,
> -				    u32 max_num_data_sg,
> -				    u32 max_num_meta_sg);
> +struct ib_mr *ib_alloc_mr_integrity(struct ib_pd *pd, u32 max_num_data_sg,
> +				    u32 max_num_meta_sg, u32 access);
>  
>  /**
>   * ib_update_fast_reg_key - updates the key portion of the fast_reg MR
> diff --git a/include/rdma/mr_pool.h b/include/rdma/mr_pool.h
> index e77123bcb43b..2a0ee791037d 100644
> --- a/include/rdma/mr_pool.h
> +++ b/include/rdma/mr_pool.h
> @@ -11,7 +11,8 @@ struct ib_mr *ib_mr_pool_get(struct ib_qp *qp, struct list_head *list);
>  void ib_mr_pool_put(struct ib_qp *qp, struct list_head *list, struct ib_mr *mr);
>  
>  int ib_mr_pool_init(struct ib_qp *qp, struct list_head *list, int nr,
> -		enum ib_mr_type type, u32 max_num_sg, u32 max_num_meta_sg);
> +		    enum ib_mr_type type, u32 max_num_sg, u32 max_num_meta_sg,
> +		    u32 access);
>  void ib_mr_pool_destroy(struct ib_qp *qp, struct list_head *list);
>  
>  #endif /* _RDMA_MR_POOL_H */

Does the new 'access' argument only control whether or not PCIe relaxed
ordering is enabled? It seems wrong to me to make enabling of PCIe
relaxed ordering configurable. I think this mechanism should be enabled
unconditionally if the HCA supports it.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-05 14:08   ` Leon Romanovsky
@ 2021-04-05 16:11     ` Santosh Shilimkar
  2021-04-05 17:54     ` Tom Talpey
  1 sibling, 0 replies; 56+ messages in thread
From: Santosh Shilimkar @ 2021-04-05 16:11 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Doug Ledford, Jason Gunthorpe, Adit Ranadive,
	Anna Schumaker, Ariel Elior, Avihai Horon, Bart Van Assche,
	Bernard Metzler, Chuck Lever III, David S. Miller,
	Dennis Dalessandro, Devesh Sharma, Faisal Latif, Jack Wang,
	Jakub Kicinski, J. Bruce Fields, Jens Axboe, Karsten Graul,
	Keith Busch, Lijun Ou, linux-cifs, linux-kernel, linux-nfs,
	linux-nvme, linux-rdma, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Selvin Xavier,
	Shiraz Saleem, Somnath Kotur, Sriharsha Basavapatna,
	Steve French, Trond Myklebust, VMware PV-Drivers, Weihang Li,
	Yishai Hadas, Zhu Yanjun


> On Apr 5, 2021, at 7:08 AM, Leon Romanovsky <leon@kernel.org> wrote:
> 
> On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
>> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
>>> From: Leon Romanovsky <leonro@nvidia.com>
>>> 
>>>> From Avihai,
>>> 
>>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
>>> imposed on PCI transactions, and thus, can improve performance.
>>> 
>>> Until now, relaxed ordering could be set only by user space applications
>>> for user MRs. The following patch series enables relaxed ordering for the
>>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
>>> such, it is ignored by vendors that don't support it.
>>> 
>>> The following test results show the performance improvement achieved
>>> with relaxed ordering. The test was performed on a NVIDIA A100 in order
>>> to check performance of storage infrastructure over xprtrdma:
>> 
>> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
>> What does that have to do with storage protocols?
> 
> This system is in use by our storage oriented customer who performed the
> test. He runs drivers/infiniband/* stack from the upstream, simply backported
> to specific kernel version.
> 
> The performance boost is seen in other systems too.
> 
>> 
>> Also if you enable this for basically all kernel ULPs, why not have
>> an opt-out into strict ordering for the cases that need it (if there are
>> any).
> 
> The RO property is optional, it can only improve. In addition, all in-kernel ULPs
> don't need strict ordering. I can be mistaken here and Jason will correct me, it
> is because of two things: ULP doesn't touch data before CQE and DMA API prohibits it.
> 

Sticking to in-kernel ULP’s don’t need strict ordering, why don’t you enable this
for HCA’s which supports it by default instead of patching very ULPs. Some ULPs
in future may forget to specify such flag.

Regards,
Santosh


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-05 14:08   ` Leon Romanovsky
  2021-04-05 16:11     ` Santosh Shilimkar
@ 2021-04-05 17:54     ` Tom Talpey
  1 sibling, 0 replies; 56+ messages in thread
From: Tom Talpey @ 2021-04-05 17:54 UTC (permalink / raw)
  To: Leon Romanovsky, Christoph Hellwig
  Cc: Doug Ledford, Jason Gunthorpe, Adit Ranadive, Anna Schumaker,
	Ariel Elior, Avihai Horon, Bart Van Assche, Bernard Metzler,
	Chuck Lever, David S. Miller, Dennis Dalessandro, Devesh Sharma,
	Faisal Latif, Jack Wang, Jakub Kicinski, J. Bruce Fields,
	Jens Axboe, Karsten Graul, Keith Busch, Lijun Ou, linux-cifs,
	linux-kernel, linux-nfs, linux-nvme, linux-rdma, linux-s390,
	Max Gurtovoy, Max Gurtovoy, Md. Haris Iqbal, Michael Guralnik,
	Michal Kalderon, Mike Marciniszyn, Naresh Kumar PBS, netdev,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On 4/5/2021 10:08 AM, Leon Romanovsky wrote:
> On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
>> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
>>> From: Leon Romanovsky <leonro@nvidia.com>
>>>
>>> >From Avihai,
>>>
>>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
>>> imposed on PCI transactions, and thus, can improve performance.
>>>
>>> Until now, relaxed ordering could be set only by user space applications
>>> for user MRs. The following patch series enables relaxed ordering for the
>>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
>>> such, it is ignored by vendors that don't support it.
>>>
>>> The following test results show the performance improvement achieved
>>> with relaxed ordering. The test was performed on a NVIDIA A100 in order
>>> to check performance of storage infrastructure over xprtrdma:
>>
>> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
>> What does that have to do with storage protocols?
> 
> This system is in use by our storage oriented customer who performed the
> test. He runs drivers/infiniband/* stack from the upstream, simply backported
> to specific kernel version.
> 
> The performance boost is seen in other systems too.

We need to see more information about this test, and platform.

What correctness testing was done, and how was it verified? What
PCI bus type(s) were tested, and with what adapters? What storage
workload was generated, and were all possible RDMA exchanges by
each ULP exercised?

>> Also if you enable this for basically all kernel ULPs, why not have
>> an opt-out into strict ordering for the cases that need it (if there are
>> any).
> 
> The RO property is optional, it can only improve. In addition, all in-kernel ULPs
> don't need strict ordering. I can be mistaken here and Jason will correct me, it
> is because of two things: ULP doesn't touch data before CQE and DMA API prohibits it.

+1 on Christoph's comment.

I woud hope most well-architected ULPs will support relaxed ordering,
but storage workloads, in my experience, can find ways to cause failure
in adapters. I would not suggest making this the default behavior
without extensive testing.

Tom.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 02/10] RDMA/core: Enable Relaxed Ordering in __ib_alloc_pd()
  2021-04-05  5:23 ` [PATCH rdma-next 02/10] RDMA/core: Enable Relaxed Ordering in __ib_alloc_pd() Leon Romanovsky
@ 2021-04-05 18:01   ` Tom Talpey
  2021-04-05 20:40     ` Adit Ranadive
  2021-04-06  6:28     ` Leon Romanovsky
  0 siblings, 2 replies; 56+ messages in thread
From: Tom Talpey @ 2021-04-05 18:01 UTC (permalink / raw)
  To: Leon Romanovsky, Doug Ledford, Jason Gunthorpe
  Cc: Avihai Horon, Adit Ranadive, Anna Schumaker, Ariel Elior,
	Bart Van Assche, Bernard Metzler, Christoph Hellwig, Chuck Lever,
	David S. Miller, Dennis Dalessandro, Devesh Sharma, Faisal Latif,
	Jack Wang, Jakub Kicinski, J. Bruce Fields, Jens Axboe,
	Karsten Graul, Keith Busch, Lijun Ou, linux-cifs, linux-kernel,
	linux-nfs, linux-nvme, linux-rdma, linux-s390, Max Gurtovoy,
	Max Gurtovoy, Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On 4/5/2021 1:23 AM, Leon Romanovsky wrote:
> From: Avihai Horon <avihaih@nvidia.com>
> 
> Enable Relaxed Ordering in __ib_alloc_pd() allocation of the
> local_dma_lkey.
> 
> This will take effect only for devices that don't pre-allocate the lkey
> but allocate it per PD allocation.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>   drivers/infiniband/core/verbs.c              | 3 ++-
>   drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c | 1 +
>   2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> index a1782f8a6ca0..9b719f7d6fd5 100644
> --- a/drivers/infiniband/core/verbs.c
> +++ b/drivers/infiniband/core/verbs.c
> @@ -287,7 +287,8 @@ struct ib_pd *__ib_alloc_pd(struct ib_device *device, unsigned int flags,
>   	if (device->attrs.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)
>   		pd->local_dma_lkey = device->local_dma_lkey;
>   	else
> -		mr_access_flags |= IB_ACCESS_LOCAL_WRITE;
> +		mr_access_flags |=
> +			IB_ACCESS_LOCAL_WRITE | IB_ACCESS_RELAXED_ORDERING;

So, do local_dma_lkey's get relaxed ordering unconditionally?

>   	if (flags & IB_PD_UNSAFE_GLOBAL_RKEY) {
>   		pr_warn("%s: enabling unsafe global rkey\n", caller);
> diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
> index b3fa783698a0..d74827694f92 100644
> --- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
> +++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
> @@ -66,6 +66,7 @@ struct ib_mr *pvrdma_get_dma_mr(struct ib_pd *pd, int acc)
>   	int ret;
>   
>   	/* Support only LOCAL_WRITE flag for DMA MRs */
> +	acc &= ~IB_ACCESS_RELAXED_ORDERING;
>   	if (acc & ~IB_ACCESS_LOCAL_WRITE) {
>   		dev_warn(&dev->pdev->dev,
>   			 "unsupported dma mr access flags %#x\n", acc);

Why does the pvrdma driver require relaxed ordering to be off?

Tom.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-05 13:41 ` [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs Christoph Hellwig
  2021-04-05 14:08   ` Leon Romanovsky
@ 2021-04-05 20:07   ` Jason Gunthorpe
  2021-04-05 23:42     ` Chuck Lever III
  1 sibling, 1 reply; 56+ messages in thread
From: Jason Gunthorpe @ 2021-04-05 20:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Doug Ledford, Leon Romanovsky, Adit Ranadive,
	Anna Schumaker, Ariel Elior, Avihai Horon, Bart Van Assche,
	Bernard Metzler, Chuck Lever, David S. Miller,
	Dennis Dalessandro, Devesh Sharma, Faisal Latif, Jack Wang,
	Jakub Kicinski, J. Bruce Fields, Jens Axboe, Karsten Graul,
	Keith Busch, Lijun Ou, linux-cifs, linux-kernel, linux-nfs,
	linux-nvme, linux-rdma, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > >From Avihai,
> > 
> > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> > imposed on PCI transactions, and thus, can improve performance.
> > 
> > Until now, relaxed ordering could be set only by user space applications
> > for user MRs. The following patch series enables relaxed ordering for the
> > kernel ULPs as well. Relaxed ordering is an optional capability, and as
> > such, it is ignored by vendors that don't support it.
> > 
> > The following test results show the performance improvement achieved
> > with relaxed ordering. The test was performed on a NVIDIA A100 in order
> > to check performance of storage infrastructure over xprtrdma:
> 
> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
> What does that have to do with storage protocols?

I think it is a typo (or at least mit makes no sense to be talking
about NFS with a GPU chip) Probably it should be a DGX A100 which is a
dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA
workload.

AMD dual socket systems are well known to benefit from relaxed
ordering, people have been doing this in userspace for a while now
with the opt in.

What surprises me is the performance difference, I hadn't heard it is
4x!

Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 02/10] RDMA/core: Enable Relaxed Ordering in __ib_alloc_pd()
  2021-04-05 18:01   ` Tom Talpey
@ 2021-04-05 20:40     ` Adit Ranadive
  2021-04-06  6:28     ` Leon Romanovsky
  1 sibling, 0 replies; 56+ messages in thread
From: Adit Ranadive @ 2021-04-05 20:40 UTC (permalink / raw)
  To: Tom Talpey, Leon Romanovsky, Doug Ledford, Jason Gunthorpe
  Cc: Avihai Horon, Anna Schumaker, Ariel Elior, Bart Van Assche,
	Bernard Metzler, Christoph Hellwig, Chuck Lever, David S. Miller,
	Dennis Dalessandro, Devesh Sharma, Faisal Latif, Jack Wang,
	Jakub Kicinski, J. Bruce Fields, Jens Axboe, Karsten Graul,
	Keith Busch, Lijun Ou, linux-cifs, linux-kernel, linux-nfs,
	linux-nvme, linux-rdma, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On 4/5/21 11:01 AM, Tom Talpey wrote:
> On 4/5/2021 1:23 AM, Leon Romanovsky wrote:
>> From: Avihai Horon <avihaih@nvidia.com>
>>
>> Enable Relaxed Ordering in __ib_alloc_pd() allocation of the
>> local_dma_lkey.
>>
>> This will take effect only for devices that don't pre-allocate the lkey
>> but allocate it per PD allocation.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
>> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
>> ---
>>   drivers/infiniband/core/verbs.c              | 3 ++-
>>   drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c | 1 +
>>   2 files changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
>> index a1782f8a6ca0..9b719f7d6fd5 100644
>> --- a/drivers/infiniband/core/verbs.c
>> +++ b/drivers/infiniband/core/verbs.c
>> @@ -287,7 +287,8 @@ struct ib_pd *__ib_alloc_pd(struct ib_device *device, unsigned int flags,
>>       if (device->attrs.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)
>>           pd->local_dma_lkey = device->local_dma_lkey;
>>       else
>> -        mr_access_flags |= IB_ACCESS_LOCAL_WRITE;
>> +        mr_access_flags |=
>> +            IB_ACCESS_LOCAL_WRITE | IB_ACCESS_RELAXED_ORDERING;
>
> So, do local_dma_lkey's get relaxed ordering unconditionally?
>
>>       if (flags & IB_PD_UNSAFE_GLOBAL_RKEY) {
>>           pr_warn("%s: enabling unsafe global rkey\n", caller);
>> diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
>> index b3fa783698a0..d74827694f92 100644
>> --- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
>> +++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
>> @@ -66,6 +66,7 @@ struct ib_mr *pvrdma_get_dma_mr(struct ib_pd *pd, int acc)
>>       int ret;
>>         /* Support only LOCAL_WRITE flag for DMA MRs */
>> +    acc &= ~IB_ACCESS_RELAXED_ORDERING;
>>       if (acc & ~IB_ACCESS_LOCAL_WRITE) {
>>           dev_warn(&dev->pdev->dev,
>>                "unsupported dma mr access flags %#x\n", acc);
>
> Why does the pvrdma driver require relaxed ordering to be off?

PVRDMA doesn't support any other flags other than LOCAL_WRITE for
DMA MRs so the MR creation will fail if any new unconditionally added
flag isn't cleared.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-05 20:07   ` Jason Gunthorpe
@ 2021-04-05 23:42     ` Chuck Lever III
  2021-04-05 23:50       ` Keith Busch
                         ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Chuck Lever III @ 2021-04-05 23:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Leon Romanovsky, Doug Ledford,
	Leon Romanovsky, Adit Ranadive, Anna Schumaker, Ariel Elior,
	Avihai Horon, Bart Van Assche, Bernard Metzler, David S. Miller,
	Dennis Dalessandro, Devesh Sharma, Faisal Latif, Jack Wang,
	Jakub Kicinski, Bruce Fields, Jens Axboe, Karsten Graul,
	Keith Busch, Lijun Ou, CIFS, LKML, Linux NFS Mailing List,
	linux-nvme, linux-rdma, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, Linux-Net,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun



> On Apr 5, 2021, at 4:07 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
>> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
>>> From: Leon Romanovsky <leonro@nvidia.com>
>>> 
>>>> From Avihai,
>>> 
>>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
>>> imposed on PCI transactions, and thus, can improve performance.
>>> 
>>> Until now, relaxed ordering could be set only by user space applications
>>> for user MRs. The following patch series enables relaxed ordering for the
>>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
>>> such, it is ignored by vendors that don't support it.
>>> 
>>> The following test results show the performance improvement achieved
>>> with relaxed ordering. The test was performed on a NVIDIA A100 in order
>>> to check performance of storage infrastructure over xprtrdma:
>> 
>> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
>> What does that have to do with storage protocols?
> 
> I think it is a typo (or at least mit makes no sense to be talking
> about NFS with a GPU chip) Probably it should be a DGX A100 which is a
> dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA
> workload.

We need to get a better idea what correctness testing has been done,
and whether positive correctness testing results can be replicated
on a variety of platforms.

I have an old Haswell dual-socket system in my lab, but otherwise
I'm not sure I have a platform that would be interesting for such a
test.


> AMD dual socket systems are well known to benefit from relaxed
> ordering, people have been doing this in userspace for a while now
> with the opt in.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-05 23:42     ` Chuck Lever III
@ 2021-04-05 23:50       ` Keith Busch
  2021-04-06  5:12       ` Leon Romanovsky
  2021-04-06 11:49       ` Jason Gunthorpe
  2 siblings, 0 replies; 56+ messages in thread
From: Keith Busch @ 2021-04-05 23:50 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Jason Gunthorpe, Christoph Hellwig, Leon Romanovsky,
	Doug Ledford, Leon Romanovsky, Adit Ranadive, Anna Schumaker,
	Ariel Elior, Avihai Horon, Bart Van Assche, Bernard Metzler,
	David S. Miller, Dennis Dalessandro, Devesh Sharma, Faisal Latif,
	Jack Wang, Jakub Kicinski, Bruce Fields, Jens Axboe,
	Karsten Graul, Lijun Ou, CIFS, LKML, Linux NFS Mailing List,
	linux-nvme, linux-rdma, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, Linux-Net,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
> > On Apr 5, 2021, at 4:07 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
> > On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
> >> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> >>> From: Leon Romanovsky <leonro@nvidia.com>
> >>> 
> >>>> From Avihai,
> >>> 
> >>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> >>> imposed on PCI transactions, and thus, can improve performance.
> >>> 
> >>> Until now, relaxed ordering could be set only by user space applications
> >>> for user MRs. The following patch series enables relaxed ordering for the
> >>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
> >>> such, it is ignored by vendors that don't support it.
> >>> 
> >>> The following test results show the performance improvement achieved
> >>> with relaxed ordering. The test was performed on a NVIDIA A100 in order
> >>> to check performance of storage infrastructure over xprtrdma:
> >> 
> >> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
> >> What does that have to do with storage protocols?
> > 
> > I think it is a typo (or at least mit makes no sense to be talking
> > about NFS with a GPU chip) Probably it should be a DGX A100 which is a
> > dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA
> > workload.
> 
> We need to get a better idea what correctness testing has been done,
> and whether positive correctness testing results can be replicated
> on a variety of platforms.
> 
> I have an old Haswell dual-socket system in my lab, but otherwise
> I'm not sure I have a platform that would be interesting for such a
> test.

Not sure if Haswell will be useful for such testing. It looks like many
of those subscribe to 'quirk_relaxedordering_disable'.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-05  5:23 [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs Leon Romanovsky
                   ` (10 preceding siblings ...)
  2021-04-05 13:41 ` [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs Christoph Hellwig
@ 2021-04-06  2:37 ` Honggang LI
  2021-04-06  5:09   ` Leon Romanovsky
  11 siblings, 1 reply; 56+ messages in thread
From: Honggang LI @ 2021-04-06  2:37 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Doug Ledford, Jason Gunthorpe, Leon Romanovsky, Adit Ranadive,
	Anna Schumaker, Ariel Elior, Avihai Horon, Bart Van Assche,
	Bernard Metzler, Christoph Hellwig, Chuck Lever, David S. Miller,
	Dennis Dalessandro, Devesh Sharma, Faisal Latif, Jack Wang,
	Jakub Kicinski, J. Bruce Fields, Jens Axboe, Karsten Graul,
	Keith Busch, Lijun Ou, linux-cifs, linux-kernel, linux-nfs,
	linux-nvme, linux-rdma, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> From Avihai,
> 
> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> imposed on PCI transactions, and thus, can improve performance.
> 
> Until now, relaxed ordering could be set only by user space applications
> for user MRs. The following patch series enables relaxed ordering for the
> kernel ULPs as well. Relaxed ordering is an optional capability, and as
> such, it is ignored by vendors that don't support it.
> 
> The following test results show the performance improvement achieved

Did you test this patchset with CPU does not support relaxed ordering?

We observed significantly performance degradation when run perftest with
relaxed ordering enabled over old CPU.

https://github.com/linux-rdma/perftest/issues/116

thanks


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-06  2:37 ` Honggang LI
@ 2021-04-06  5:09   ` Leon Romanovsky
  2021-04-06 11:53     ` Jason Gunthorpe
  0 siblings, 1 reply; 56+ messages in thread
From: Leon Romanovsky @ 2021-04-06  5:09 UTC (permalink / raw)
  To: Honggang LI
  Cc: Doug Ledford, Jason Gunthorpe, Adit Ranadive, Anna Schumaker,
	Ariel Elior, Avihai Horon, Bart Van Assche, Bernard Metzler,
	Christoph Hellwig, Chuck Lever, David S. Miller,
	Dennis Dalessandro, Devesh Sharma, Faisal Latif, Jack Wang,
	Jakub Kicinski, J. Bruce Fields, Jens Axboe, Karsten Graul,
	Keith Busch, Lijun Ou, linux-cifs, linux-kernel, linux-nfs,
	linux-nvme, linux-rdma, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Tue, Apr 06, 2021 at 10:37:38AM +0800, Honggang LI wrote:
> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > From Avihai,
> > 
> > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> > imposed on PCI transactions, and thus, can improve performance.
> > 
> > Until now, relaxed ordering could be set only by user space applications
> > for user MRs. The following patch series enables relaxed ordering for the
> > kernel ULPs as well. Relaxed ordering is an optional capability, and as
> > such, it is ignored by vendors that don't support it.
> > 
> > The following test results show the performance improvement achieved
> 
> Did you test this patchset with CPU does not support relaxed ordering?

I don't think so, the CPUs that don't support RO are Intel's fourth/fifth-generation
and they are not interesting from performance point of view.

> 
> We observed significantly performance degradation when run perftest with
> relaxed ordering enabled over old CPU.
> 
> https://github.com/linux-rdma/perftest/issues/116

The perftest is slightly different, but you pointed to the valid point.
We forgot to call pcie_relaxed_ordering_enabled() before setting RO bit
and arguably this was needed to be done in perftest too.

Thanks

> 
> thanks
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-05 23:42     ` Chuck Lever III
  2021-04-05 23:50       ` Keith Busch
@ 2021-04-06  5:12       ` Leon Romanovsky
  2021-04-06 11:49       ` Jason Gunthorpe
  2 siblings, 0 replies; 56+ messages in thread
From: Leon Romanovsky @ 2021-04-06  5:12 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Jason Gunthorpe, Christoph Hellwig, Doug Ledford, Adit Ranadive,
	Anna Schumaker, Ariel Elior, Avihai Horon, Bart Van Assche,
	Bernard Metzler, David S. Miller, Dennis Dalessandro,
	Devesh Sharma, Faisal Latif, Jack Wang, Jakub Kicinski,
	Bruce Fields, Jens Axboe, Karsten Graul, Keith Busch, Lijun Ou,
	CIFS, LKML, Linux NFS Mailing List, linux-nvme, linux-rdma,
	linux-s390, Max Gurtovoy, Max Gurtovoy, Md. Haris Iqbal,
	Michael Guralnik, Michal Kalderon, Mike Marciniszyn,
	Naresh Kumar PBS, Linux-Net, Potnuri Bharat Teja, rds-devel,
	Sagi Grimberg, samba-technical, Santosh Shilimkar, Selvin Xavier,
	Shiraz Saleem, Somnath Kotur, Sriharsha Basavapatna,
	Steve French, Trond Myklebust, VMware PV-Drivers, Weihang Li,
	Yishai Hadas, Zhu Yanjun

On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
> 
> 
> > On Apr 5, 2021, at 4:07 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
> > 
> > On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
> >> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> >>> From: Leon Romanovsky <leonro@nvidia.com>
> >>> 
> >>>> From Avihai,
> >>> 
> >>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> >>> imposed on PCI transactions, and thus, can improve performance.
> >>> 
> >>> Until now, relaxed ordering could be set only by user space applications
> >>> for user MRs. The following patch series enables relaxed ordering for the
> >>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
> >>> such, it is ignored by vendors that don't support it.
> >>> 
> >>> The following test results show the performance improvement achieved
> >>> with relaxed ordering. The test was performed on a NVIDIA A100 in order
> >>> to check performance of storage infrastructure over xprtrdma:
> >> 
> >> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
> >> What does that have to do with storage protocols?
> > 
> > I think it is a typo (or at least mit makes no sense to be talking
> > about NFS with a GPU chip) Probably it should be a DGX A100 which is a
> > dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA
> > workload.
> 
> We need to get a better idea what correctness testing has been done,
> and whether positive correctness testing results can be replicated
> on a variety of platforms.

I will ask to provide more details.

> 
> I have an old Haswell dual-socket system in my lab, but otherwise
> I'm not sure I have a platform that would be interesting for such a
> test.

We don't have such old systems too.

> 
> 
> > AMD dual socket systems are well known to benefit from relaxed
> > ordering, people have been doing this in userspace for a while now
> > with the opt in.
> 
> 
> --
> Chuck Lever
> 
> 
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()
  2021-04-05 15:27   ` Bart Van Assche
@ 2021-04-06  5:23     ` Leon Romanovsky
  2021-04-06  5:27       ` Christoph Hellwig
  0 siblings, 1 reply; 56+ messages in thread
From: Leon Romanovsky @ 2021-04-06  5:23 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Doug Ledford, Jason Gunthorpe, Christoph Hellwig, Avihai Horon,
	Adit Ranadive, Anna Schumaker, Ariel Elior, Bernard Metzler,
	Chuck Lever, David S. Miller, Dennis Dalessandro, Devesh Sharma,
	Faisal Latif, Jack Wang, Jakub Kicinski, J. Bruce Fields,
	Jens Axboe, Karsten Graul, Keith Busch, Lijun Ou, linux-cifs,
	linux-kernel, linux-nfs, linux-nvme, linux-rdma, linux-s390,
	Max Gurtovoy, Max Gurtovoy, Md. Haris Iqbal, Michael Guralnik,
	Michal Kalderon, Mike Marciniszyn, Naresh Kumar PBS, netdev,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Mon, Apr 05, 2021 at 08:27:16AM -0700, Bart Van Assche wrote:
> On 4/4/21 10:23 PM, Leon Romanovsky wrote:
> > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> > index bed4cfe50554..59138174affa 100644
> > --- a/include/rdma/ib_verbs.h
> > +++ b/include/rdma/ib_verbs.h
> > @@ -2444,10 +2444,10 @@ struct ib_device_ops {
> >  				       struct ib_udata *udata);
> >  	int (*dereg_mr)(struct ib_mr *mr, struct ib_udata *udata);
> >  	struct ib_mr *(*alloc_mr)(struct ib_pd *pd, enum ib_mr_type mr_type,
> > -				  u32 max_num_sg);
> > +				  u32 max_num_sg, u32 access);
> >  	struct ib_mr *(*alloc_mr_integrity)(struct ib_pd *pd,
> >  					    u32 max_num_data_sg,
> > -					    u32 max_num_meta_sg);
> > +					    u32 max_num_meta_sg, u32 access);
> >  	int (*advise_mr)(struct ib_pd *pd,
> >  			 enum ib_uverbs_advise_mr_advice advice, u32 flags,
> >  			 struct ib_sge *sg_list, u32 num_sge,
> > @@ -4142,11 +4142,10 @@ static inline int ib_dereg_mr(struct ib_mr *mr)
> >  }
> >  
> >  struct ib_mr *ib_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
> > -			  u32 max_num_sg);
> > +			  u32 max_num_sg, u32 access);
> >  
> > -struct ib_mr *ib_alloc_mr_integrity(struct ib_pd *pd,
> > -				    u32 max_num_data_sg,
> > -				    u32 max_num_meta_sg);
> > +struct ib_mr *ib_alloc_mr_integrity(struct ib_pd *pd, u32 max_num_data_sg,
> > +				    u32 max_num_meta_sg, u32 access);
> >  
> >  /**
> >   * ib_update_fast_reg_key - updates the key portion of the fast_reg MR
> > diff --git a/include/rdma/mr_pool.h b/include/rdma/mr_pool.h
> > index e77123bcb43b..2a0ee791037d 100644
> > --- a/include/rdma/mr_pool.h
> > +++ b/include/rdma/mr_pool.h
> > @@ -11,7 +11,8 @@ struct ib_mr *ib_mr_pool_get(struct ib_qp *qp, struct list_head *list);
> >  void ib_mr_pool_put(struct ib_qp *qp, struct list_head *list, struct ib_mr *mr);
> >  
> >  int ib_mr_pool_init(struct ib_qp *qp, struct list_head *list, int nr,
> > -		enum ib_mr_type type, u32 max_num_sg, u32 max_num_meta_sg);
> > +		    enum ib_mr_type type, u32 max_num_sg, u32 max_num_meta_sg,
> > +		    u32 access);
> >  void ib_mr_pool_destroy(struct ib_qp *qp, struct list_head *list);
> >  
> >  #endif /* _RDMA_MR_POOL_H */
> 
> Does the new 'access' argument only control whether or not PCIe relaxed
> ordering is enabled? It seems wrong to me to make enabling of PCIe
> relaxed ordering configurable. I think this mechanism should be enabled
> unconditionally if the HCA supports it.

The same proposal (enable unconditionally) was raised during
submission preparations and we decided to follow same pattern
as other verbs objects which receive flag parameter.

Thanks

> 
> Thanks,
> 
> Bart.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()
  2021-04-05 13:46   ` Christoph Hellwig
@ 2021-04-06  5:24     ` Leon Romanovsky
  0 siblings, 0 replies; 56+ messages in thread
From: Leon Romanovsky @ 2021-04-06  5:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Doug Ledford, Jason Gunthorpe, Avihai Horon, Adit Ranadive,
	Anna Schumaker, Ariel Elior, Bart Van Assche, Bernard Metzler,
	Chuck Lever, David S. Miller, Dennis Dalessandro, Devesh Sharma,
	Faisal Latif, Jack Wang, Jakub Kicinski, J. Bruce Fields,
	Jens Axboe, Karsten Graul, Keith Busch, Lijun Ou, linux-cifs,
	linux-kernel, linux-nfs, linux-nvme, linux-rdma, linux-s390,
	Max Gurtovoy, Max Gurtovoy, Md. Haris Iqbal, Michael Guralnik,
	Michal Kalderon, Mike Marciniszyn, Naresh Kumar PBS, netdev,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Mon, Apr 05, 2021 at 03:46:18PM +0200, Christoph Hellwig wrote:
> On Mon, Apr 05, 2021 at 08:23:55AM +0300, Leon Romanovsky wrote:
> > From: Avihai Horon <avihaih@nvidia.com>
> > 
> > Add access flags parameter to ib_alloc_mr() and to ib_mr_pool_init(),
> > and refactor relevant code. This parameter is used to pass MR access
> > flags during MR allocation.
> > 
> > In the following patches, the new access flags parameter will be used
> > to enable Relaxed Ordering for ib_alloc_mr() and ib_mr_pool_init() users.
> 
> So this weirds up a new RELAXED_ORDERING flag without ever mentioning
> that flag in the commit log, never mind what it actually does.

We will improve commit messages.

Thanks

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()
  2021-04-06  5:23     ` Leon Romanovsky
@ 2021-04-06  5:27       ` Christoph Hellwig
  2021-04-06  5:58         ` Leon Romanovsky
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Hellwig @ 2021-04-06  5:27 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bart Van Assche, Doug Ledford, Jason Gunthorpe,
	Christoph Hellwig, Avihai Horon, Adit Ranadive, Anna Schumaker,
	Ariel Elior, Bernard Metzler, Chuck Lever, David S. Miller,
	Dennis Dalessandro, Devesh Sharma, Faisal Latif, Jack Wang,
	Jakub Kicinski, J. Bruce Fields, Jens Axboe, Karsten Graul,
	Keith Busch, Lijun Ou, linux-cifs, linux-kernel, linux-nfs,
	linux-nvme, linux-rdma, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Tue, Apr 06, 2021 at 08:23:28AM +0300, Leon Romanovsky wrote:
> The same proposal (enable unconditionally) was raised during
> submission preparations and we decided to follow same pattern
> as other verbs objects which receive flag parameter.

A flags argument can be added when it actually is needed.  Using it
to pass an argument enabled by all ULPs just gets us back to the bad
old days of complete crap APIs someone drew up on a whiteboard.

I think we need to:

 a) document the semantics
 b) sort out any technical concerns
 c) just enable the damn thing

instead of requiring some form of cargo culting.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()
  2021-04-06  5:27       ` Christoph Hellwig
@ 2021-04-06  5:58         ` Leon Romanovsky
  2021-04-06 12:13           ` Jason Gunthorpe
  0 siblings, 1 reply; 56+ messages in thread
From: Leon Romanovsky @ 2021-04-06  5:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, Doug Ledford, Jason Gunthorpe, Avihai Horon,
	Adit Ranadive, Anna Schumaker, Ariel Elior, Bernard Metzler,
	Chuck Lever, David S. Miller, Dennis Dalessandro, Devesh Sharma,
	Faisal Latif, Jack Wang, Jakub Kicinski, J. Bruce Fields,
	Jens Axboe, Karsten Graul, Keith Busch, Lijun Ou, linux-cifs,
	linux-kernel, linux-nfs, linux-nvme, linux-rdma, linux-s390,
	Max Gurtovoy, Max Gurtovoy, Md. Haris Iqbal, Michael Guralnik,
	Michal Kalderon, Mike Marciniszyn, Naresh Kumar PBS, netdev,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Tue, Apr 06, 2021 at 07:27:17AM +0200, Christoph Hellwig wrote:
> On Tue, Apr 06, 2021 at 08:23:28AM +0300, Leon Romanovsky wrote:
> > The same proposal (enable unconditionally) was raised during
> > submission preparations and we decided to follow same pattern
> > as other verbs objects which receive flag parameter.
> 
> A flags argument can be added when it actually is needed.  Using it
> to pass an argument enabled by all ULPs just gets us back to the bad
> old days of complete crap APIs someone drew up on a whiteboard.

Let's wait till Jason wakes up, before jumping to conclusions.
It was his request to update all ULPs.

> 
> I think we need to:
> 
>  a) document the semantics
>  b) sort out any technical concerns
>  c) just enable the damn thing

Sure

> 
> instead of requiring some form of cargo culting.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 02/10] RDMA/core: Enable Relaxed Ordering in __ib_alloc_pd()
  2021-04-05 18:01   ` Tom Talpey
  2021-04-05 20:40     ` Adit Ranadive
@ 2021-04-06  6:28     ` Leon Romanovsky
  1 sibling, 0 replies; 56+ messages in thread
From: Leon Romanovsky @ 2021-04-06  6:28 UTC (permalink / raw)
  To: Tom Talpey
  Cc: Doug Ledford, Jason Gunthorpe, Avihai Horon, Adit Ranadive,
	Anna Schumaker, Ariel Elior, Bart Van Assche, Bernard Metzler,
	Christoph Hellwig, Chuck Lever, David S. Miller,
	Dennis Dalessandro, Devesh Sharma, Faisal Latif, Jack Wang,
	Jakub Kicinski, J. Bruce Fields, Jens Axboe, Karsten Graul,
	Keith Busch, Lijun Ou, linux-cifs, linux-kernel, linux-nfs,
	linux-nvme, linux-rdma, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Mon, Apr 05, 2021 at 02:01:16PM -0400, Tom Talpey wrote:
> On 4/5/2021 1:23 AM, Leon Romanovsky wrote:
> > From: Avihai Horon <avihaih@nvidia.com>
> > 
> > Enable Relaxed Ordering in __ib_alloc_pd() allocation of the
> > local_dma_lkey.
> > 
> > This will take effect only for devices that don't pre-allocate the lkey
> > but allocate it per PD allocation.
> > 
> > Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> > Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >   drivers/infiniband/core/verbs.c              | 3 ++-
> >   drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c | 1 +
> >   2 files changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> > index a1782f8a6ca0..9b719f7d6fd5 100644
> > --- a/drivers/infiniband/core/verbs.c
> > +++ b/drivers/infiniband/core/verbs.c
> > @@ -287,7 +287,8 @@ struct ib_pd *__ib_alloc_pd(struct ib_device *device, unsigned int flags,
> >   	if (device->attrs.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)
> >   		pd->local_dma_lkey = device->local_dma_lkey;
> >   	else
> > -		mr_access_flags |= IB_ACCESS_LOCAL_WRITE;
> > +		mr_access_flags |=
> > +			IB_ACCESS_LOCAL_WRITE | IB_ACCESS_RELAXED_ORDERING;
> 
> So, do local_dma_lkey's get relaxed ordering unconditionally?

Yes, in mlx5, this lkey is created on the fly.

Thanks

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-05 23:42     ` Chuck Lever III
  2021-04-05 23:50       ` Keith Busch
  2021-04-06  5:12       ` Leon Romanovsky
@ 2021-04-06 11:49       ` Jason Gunthorpe
  2021-04-09 14:26         ` Tom Talpey
  2 siblings, 1 reply; 56+ messages in thread
From: Jason Gunthorpe @ 2021-04-06 11:49 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Christoph Hellwig, Leon Romanovsky, Doug Ledford,
	Leon Romanovsky, Adit Ranadive, Anna Schumaker, Ariel Elior,
	Avihai Horon, Bart Van Assche, Bernard Metzler, David S. Miller,
	Dennis Dalessandro, Devesh Sharma, Faisal Latif, Jack Wang,
	Jakub Kicinski, Bruce Fields, Jens Axboe, Karsten Graul,
	Keith Busch, Lijun Ou, CIFS, LKML, Linux NFS Mailing List,
	linux-nvme, linux-rdma, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, Linux-Net,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
 
> We need to get a better idea what correctness testing has been done,
> and whether positive correctness testing results can be replicated
> on a variety of platforms.

RO has been rolling out slowly on mlx5 over a few years and storage
ULPs are the last to change. eg the mlx5 ethernet driver has had RO
turned on for a long time, userspace HPC applications have been using
it for a while now too.

We know there are platforms with broken RO implementations (like
Haswell) but the kernel is supposed to globally turn off RO on all
those cases. I'd be a bit surprised if we discover any more from this
series.

On the other hand there are platforms that get huge speed ups from
turning this on, AMD is one example, there are a bunch in the ARM
world too.

Still, obviously people should test on the platforms they have.

Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-06  5:09   ` Leon Romanovsky
@ 2021-04-06 11:53     ` Jason Gunthorpe
  2021-04-11 10:09       ` Max Gurtovoy
  0 siblings, 1 reply; 56+ messages in thread
From: Jason Gunthorpe @ 2021-04-06 11:53 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Honggang LI, Doug Ledford, Adit Ranadive, Anna Schumaker,
	Ariel Elior, Avihai Horon, Bart Van Assche, Bernard Metzler,
	Christoph Hellwig, Chuck Lever, David S. Miller,
	Dennis Dalessandro, Devesh Sharma, Faisal Latif, Jack Wang,
	Jakub Kicinski, J. Bruce Fields, Jens Axboe, Karsten Graul,
	Keith Busch, Lijun Ou, linux-cifs, linux-kernel, linux-nfs,
	linux-nvme, linux-rdma, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Tue, Apr 06, 2021 at 08:09:43AM +0300, Leon Romanovsky wrote:
> On Tue, Apr 06, 2021 at 10:37:38AM +0800, Honggang LI wrote:
> > On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > > 
> > > From Avihai,
> > > 
> > > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> > > imposed on PCI transactions, and thus, can improve performance.
> > > 
> > > Until now, relaxed ordering could be set only by user space applications
> > > for user MRs. The following patch series enables relaxed ordering for the
> > > kernel ULPs as well. Relaxed ordering is an optional capability, and as
> > > such, it is ignored by vendors that don't support it.
> > > 
> > > The following test results show the performance improvement achieved
> > 
> > Did you test this patchset with CPU does not support relaxed ordering?
> 
> I don't think so, the CPUs that don't support RO are Intel's fourth/fifth-generation
> and they are not interesting from performance point of view.
> 
> > 
> > We observed significantly performance degradation when run perftest with
> > relaxed ordering enabled over old CPU.
> > 
> > https://github.com/linux-rdma/perftest/issues/116
> 
> The perftest is slightly different, but you pointed to the valid point.
> We forgot to call pcie_relaxed_ordering_enabled() before setting RO bit
> and arguably this was needed to be done in perftest too.

No, the PCI device should not have the RO bit set in this situation.
It is something mlx5_core needs to do. We can't push this into
applications.

There should be no performance difference from asking for
IBV_ACCESS_RELAXED_ORDERING when RO is disabled at the PCI config and
not asking for it at all.

Either the platform has working relaxed ordering that gives a
performance gain and the RO config spec bit should be set, or it
doesn't and the bit should be clear.

This is not something to decide in userspace, or in RDMA. At worst it
becomes another platform specific PCI tunable people have to set.

I thought the old haswell systems were quirked to disable RO globally
anyhow?

Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()
  2021-04-06  5:58         ` Leon Romanovsky
@ 2021-04-06 12:13           ` Jason Gunthorpe
  2021-04-06 12:30             ` Christoph Hellwig
  0 siblings, 1 reply; 56+ messages in thread
From: Jason Gunthorpe @ 2021-04-06 12:13 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Bart Van Assche, Doug Ledford, Avihai Horon,
	Adit Ranadive, Anna Schumaker, Ariel Elior, Bernard Metzler,
	Chuck Lever, David S. Miller, Dennis Dalessandro, Devesh Sharma,
	Faisal Latif, Jack Wang, Jakub Kicinski, J. Bruce Fields,
	Jens Axboe, Karsten Graul, Keith Busch, Lijun Ou, linux-cifs,
	linux-kernel, linux-nfs, linux-nvme, linux-rdma, linux-s390,
	Max Gurtovoy, Max Gurtovoy, Md. Haris Iqbal, Michael Guralnik,
	Michal Kalderon, Mike Marciniszyn, Naresh Kumar PBS, netdev,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Tue, Apr 06, 2021 at 08:58:54AM +0300, Leon Romanovsky wrote:
> On Tue, Apr 06, 2021 at 07:27:17AM +0200, Christoph Hellwig wrote:
> > On Tue, Apr 06, 2021 at 08:23:28AM +0300, Leon Romanovsky wrote:
> > > The same proposal (enable unconditionally) was raised during
> > > submission preparations and we decided to follow same pattern
> > > as other verbs objects which receive flag parameter.
> > 
> > A flags argument can be added when it actually is needed.  Using it
> > to pass an argument enabled by all ULPs just gets us back to the bad
> > old days of complete crap APIs someone drew up on a whiteboard.
> 
> Let's wait till Jason wakes up, before jumping to conclusions.
> It was his request to update all ULPs.

We are stuck in a bad spot here

Only the kernel can universally support ACCESS_RELAXED_ORDERING
because all the ULPs are required to work sanely already, but
userspace does not have this limitation and we know of enough popular
verbs users that will break if we unconditionally switch on
ACCESS_RELAXED_ORDERING.

Further, we have the problem with the FMR interface that technically
lets the caller control the entire access_flags, including
ACCESS_RELAXED_ORDERING.

So we broadly have two choice
 1) Diverge the kernel and user interfaces and make the RDMA drivers
    special case all the kernel stuff to force
    ACCESS_RELAXED_ORDERING when they are building MRs and processing
    FMR WQE's
 2) Keep the two interfaces the same and push the
    ACCESS_RELAXED_ORDERING to a couple of places in the ULPs so the
    drivers see a consistent API

They are both poor choices, but I think #2 has a greater chance of
everyone doing their parts correctly.

Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()
  2021-04-06 12:13           ` Jason Gunthorpe
@ 2021-04-06 12:30             ` Christoph Hellwig
  2021-04-06 14:04               ` Jason Gunthorpe
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Hellwig @ 2021-04-06 12:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Christoph Hellwig, Bart Van Assche,
	Doug Ledford, Avihai Horon, Adit Ranadive, Anna Schumaker,
	Ariel Elior, Bernard Metzler, Chuck Lever, David S. Miller,
	Dennis Dalessandro, Devesh Sharma, Faisal Latif, Jack Wang,
	Jakub Kicinski, J. Bruce Fields, Jens Axboe, Karsten Graul,
	Keith Busch, Lijun Ou, linux-cifs, linux-kernel, linux-nfs,
	linux-nvme, linux-rdma, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Tue, Apr 06, 2021 at 09:13:12AM -0300, Jason Gunthorpe wrote:
> So we broadly have two choice
>  1) Diverge the kernel and user interfaces and make the RDMA drivers
>     special case all the kernel stuff to force
>     ACCESS_RELAXED_ORDERING when they are building MRs and processing
>     FMR WQE's
>  2) Keep the two interfaces the same and push the
>     ACCESS_RELAXED_ORDERING to a couple of places in the ULPs so the
>     drivers see a consistent API
> 
> They are both poor choices, but I think #2 has a greater chance of
> everyone doing their parts correctly.

No, 1 is the only sensible choice.  The userspace verbs interface is
a mess and should not inflict pain on the kernel in any way.  We've moved
away from a lot of the idiotic "Verbs API" concepts with things like
how we handle the global lkey, the new CQ API and the RDMA R/W
abstraction and that massively helped the kernel ecosystem.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()
  2021-04-06 12:30             ` Christoph Hellwig
@ 2021-04-06 14:04               ` Jason Gunthorpe
  2021-04-06 14:15                 ` Christoph Hellwig
  0 siblings, 1 reply; 56+ messages in thread
From: Jason Gunthorpe @ 2021-04-06 14:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Bart Van Assche, Doug Ledford, Avihai Horon,
	Adit Ranadive, Anna Schumaker, Ariel Elior, Bernard Metzler,
	Chuck Lever, David S. Miller, Dennis Dalessandro, Devesh Sharma,
	Faisal Latif, Jack Wang, Jakub Kicinski, J. Bruce Fields,
	Jens Axboe, Karsten Graul, Keith Busch, Lijun Ou, linux-cifs,
	linux-kernel, linux-nfs, linux-nvme, linux-rdma, linux-s390,
	Max Gurtovoy, Max Gurtovoy, Md. Haris Iqbal, Michael Guralnik,
	Michal Kalderon, Mike Marciniszyn, Naresh Kumar PBS, netdev,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Tue, Apr 06, 2021 at 02:30:34PM +0200, Christoph Hellwig wrote:
> On Tue, Apr 06, 2021 at 09:13:12AM -0300, Jason Gunthorpe wrote:
> > So we broadly have two choice
> >  1) Diverge the kernel and user interfaces and make the RDMA drivers
> >     special case all the kernel stuff to force
> >     ACCESS_RELAXED_ORDERING when they are building MRs and processing
> >     FMR WQE's
> >  2) Keep the two interfaces the same and push the
> >     ACCESS_RELAXED_ORDERING to a couple of places in the ULPs so the
> >     drivers see a consistent API
> > 
> > They are both poor choices, but I think #2 has a greater chance of
> > everyone doing their parts correctly.
> 
> No, 1 is the only sensible choice.  The userspace verbs interface is
> a mess and should not inflict pain on the kernel in any way.  We've moved
> away from a lot of the idiotic "Verbs API" concepts with things like
> how we handle the global lkey, the new CQ API and the RDMA R/W
> abstraction and that massively helped the kernel ecosystem.

It might be idiodic, but I have to keep the uverbs thing working
too.

There is a lot of assumption baked in to all the drivers that
user/kernel is the same thing, we'd have to go in and break this.

Essentially #2 ends up as deleting IB_ACCESS_RELAXED_ORDERING kernel
side and instead doing some IB_ACCESS_DISABLE_RO in kernel,
translating uverbs IBV_ACCESS_* to this then finding and inverting all
the driver logic and also finding and unblocking all the places that
enforce valid access flags in the drivers. It is complicated enough

Maybe Avihai will give it a try

Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()
  2021-04-06 14:04               ` Jason Gunthorpe
@ 2021-04-06 14:15                 ` Christoph Hellwig
  2021-04-06 14:40                   ` Jason Gunthorpe
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Hellwig @ 2021-04-06 14:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Leon Romanovsky, Bart Van Assche,
	Doug Ledford, Avihai Horon, Adit Ranadive, Anna Schumaker,
	Ariel Elior, Bernard Metzler, Chuck Lever, David S. Miller,
	Dennis Dalessandro, Devesh Sharma, Faisal Latif, Jack Wang,
	Jakub Kicinski, J. Bruce Fields, Jens Axboe, Karsten Graul,
	Keith Busch, Lijun Ou, linux-cifs, linux-kernel, linux-nfs,
	linux-nvme, linux-rdma, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Tue, Apr 06, 2021 at 11:04:37AM -0300, Jason Gunthorpe wrote:
> It might be idiodic, but I have to keep the uverbs thing working
> too.
> 
> There is a lot of assumption baked in to all the drivers that
> user/kernel is the same thing, we'd have to go in and break this.
> 
> Essentially #2 ends up as deleting IB_ACCESS_RELAXED_ORDERING kernel
> side and instead doing some IB_ACCESS_DISABLE_RO in kernel,
> translating uverbs IBV_ACCESS_* to this then finding and inverting all
> the driver logic and also finding and unblocking all the places that
> enforce valid access flags in the drivers. It is complicated enough

Inverting the polarity of a flag at the uapi boundary is pretty
trivial and we already do it all over the kernel.

Do we actually ever need the strict ordering semantics in the kernel?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()
  2021-04-06 14:15                 ` Christoph Hellwig
@ 2021-04-06 14:40                   ` Jason Gunthorpe
  2021-04-06 14:54                     ` Christoph Hellwig
  0 siblings, 1 reply; 56+ messages in thread
From: Jason Gunthorpe @ 2021-04-06 14:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Bart Van Assche, Doug Ledford, Avihai Horon,
	Adit Ranadive, Anna Schumaker, Ariel Elior, Bernard Metzler,
	Chuck Lever, David S. Miller, Dennis Dalessandro, Devesh Sharma,
	Faisal Latif, Jack Wang, Jakub Kicinski, J. Bruce Fields,
	Jens Axboe, Karsten Graul, Keith Busch, Lijun Ou, linux-cifs,
	linux-kernel, linux-nfs, linux-nvme, linux-rdma, linux-s390,
	Max Gurtovoy, Max Gurtovoy, Md. Haris Iqbal, Michael Guralnik,
	Michal Kalderon, Mike Marciniszyn, Naresh Kumar PBS, netdev,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Tue, Apr 06, 2021 at 04:15:52PM +0200, Christoph Hellwig wrote:
> On Tue, Apr 06, 2021 at 11:04:37AM -0300, Jason Gunthorpe wrote:
> > It might be idiodic, but I have to keep the uverbs thing working
> > too.
> > 
> > There is a lot of assumption baked in to all the drivers that
> > user/kernel is the same thing, we'd have to go in and break this.
> > 
> > Essentially #2 ends up as deleting IB_ACCESS_RELAXED_ORDERING kernel
> > side and instead doing some IB_ACCESS_DISABLE_RO in kernel,
> > translating uverbs IBV_ACCESS_* to this then finding and inverting all
> > the driver logic and also finding and unblocking all the places that
> > enforce valid access flags in the drivers. It is complicated enough
> 
> Inverting the polarity of a flag at the uapi boundary is pretty
> trivial and we already do it all over the kernel.

Yes, but the complexity is how the drivers are constructed they are
designed to reject flags they don't know about..

Hum, it looks like someone has already been in here and we now have a
IB_ACCESS_OPTIONAL concept. 

Something like this would be the starting point:

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index bed4cfe50554f7..fcb107df0eefc6 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1440,9 +1440,11 @@ enum ib_access_flags {
 	IB_ZERO_BASED = IB_UVERBS_ACCESS_ZERO_BASED,
 	IB_ACCESS_ON_DEMAND = IB_UVERBS_ACCESS_ON_DEMAND,
 	IB_ACCESS_HUGETLB = IB_UVERBS_ACCESS_HUGETLB,
-	IB_ACCESS_RELAXED_ORDERING = IB_UVERBS_ACCESS_RELAXED_ORDERING,
 
 	IB_ACCESS_OPTIONAL = IB_UVERBS_ACCESS_OPTIONAL_RANGE,
+	_IB_ACCESS_RESERVED1 = IB_UVERBS_ACCESS_RELAXED_ORDERING,
+	IB_ACCESS_DISABLE_RELAXED_ORDERING,
+
 	IB_ACCESS_SUPPORTED =
 		((IB_ACCESS_HUGETLB << 1) - 1) | IB_ACCESS_OPTIONAL,
 };

However I see only EFA actually uses IB_ACCESS_OPTIONAL, so the lead
up would be to audit all the drivers to process optional access_flags
properly. Maybe this was done, but I don't see much evidence of it..

Sigh. It is a big mess cleaning adventure in drivers really.

> Do we actually ever need the strict ordering semantics in the kernel?

No, only for uverbs.

Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()
  2021-04-06 14:40                   ` Jason Gunthorpe
@ 2021-04-06 14:54                     ` Christoph Hellwig
  2021-04-06 15:03                       ` Christoph Hellwig
  2021-04-07 18:28                       ` Jason Gunthorpe
  0 siblings, 2 replies; 56+ messages in thread
From: Christoph Hellwig @ 2021-04-06 14:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Leon Romanovsky, Bart Van Assche,
	Doug Ledford, Avihai Horon, Adit Ranadive, Anna Schumaker,
	Ariel Elior, Bernard Metzler, Chuck Lever, David S. Miller,
	Dennis Dalessandro, Devesh Sharma, Faisal Latif, Jack Wang,
	Jakub Kicinski, J. Bruce Fields, Jens Axboe, Karsten Graul,
	Keith Busch, Lijun Ou, linux-cifs, linux-kernel, linux-nfs,
	linux-nvme, linux-rdma, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Tue, Apr 06, 2021 at 11:40:39AM -0300, Jason Gunthorpe wrote:
> Yes, but the complexity is how the drivers are constructed they are
> designed to reject flags they don't know about..
> 
> Hum, it looks like someone has already been in here and we now have a
> IB_ACCESS_OPTIONAL concept. 
> 
> Something like this would be the starting point:
>
> [...]
>
> However I see only EFA actually uses IB_ACCESS_OPTIONAL, so the lead
> up would be to audit all the drivers to process optional access_flags
> properly. Maybe this was done, but I don't see much evidence of it..
>
> Sigh. It is a big mess cleaning adventure in drivers really.

Yes.  When passing flags to drivers we need to have a good pattern
in place for distinguishing between mandatory and optional flags.
That is something everyone gets wrong at first (yourself included)
and which then later needs painful untangling.  So I think we need
to do that anyway.

> > Do we actually ever need the strict ordering semantics in the kernel?
> 
> No, only for uverbs.

Which is a pretty clear indicator that we should avoid all this ULP
churn.  Note that the polarity of the the flag passed to the HCA
driver doesn't really matter either - we should just avoid having to
deal with it in the kernel ULP API.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()
  2021-04-06 14:54                     ` Christoph Hellwig
@ 2021-04-06 15:03                       ` Christoph Hellwig
  2021-04-07 18:28                       ` Jason Gunthorpe
  1 sibling, 0 replies; 56+ messages in thread
From: Christoph Hellwig @ 2021-04-06 15:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Leon Romanovsky, Bart Van Assche,
	Doug Ledford, Avihai Horon, Adit Ranadive, Anna Schumaker,
	Ariel Elior, Bernard Metzler, Chuck Lever, David S. Miller,
	Dennis Dalessandro, Devesh Sharma, Faisal Latif, Jack Wang,
	Jakub Kicinski, J. Bruce Fields, Jens Axboe, Karsten Graul,
	Keith Busch, Lijun Ou, linux-cifs, linux-kernel, linux-nfs,
	linux-nvme, linux-rdma, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Tue, Apr 06, 2021 at 04:54:57PM +0200, Christoph Hellwig wrote:
> That is something everyone gets wrong at first (yourself included)

The yourself here should have been a yours truly or "myself", sorry.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()
  2021-04-06 14:54                     ` Christoph Hellwig
  2021-04-06 15:03                       ` Christoph Hellwig
@ 2021-04-07 18:28                       ` Jason Gunthorpe
  1 sibling, 0 replies; 56+ messages in thread
From: Jason Gunthorpe @ 2021-04-07 18:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Bart Van Assche, Doug Ledford, Avihai Horon,
	Adit Ranadive, Anna Schumaker, Ariel Elior, Bernard Metzler,
	Chuck Lever, David S. Miller, Dennis Dalessandro, Devesh Sharma,
	Faisal Latif, Jack Wang, Jakub Kicinski, J. Bruce Fields,
	Jens Axboe, Karsten Graul, Keith Busch, Lijun Ou, linux-cifs,
	linux-kernel, linux-nfs, linux-nvme, linux-rdma, linux-s390,
	Max Gurtovoy, Max Gurtovoy, Md. Haris Iqbal, Michael Guralnik,
	Michal Kalderon, Mike Marciniszyn, Naresh Kumar PBS, netdev,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Tue, Apr 06, 2021 at 04:54:57PM +0200, Christoph Hellwig wrote:
> On Tue, Apr 06, 2021 at 11:40:39AM -0300, Jason Gunthorpe wrote:
> > Yes, but the complexity is how the drivers are constructed they are
> > designed to reject flags they don't know about..
> > 
> > Hum, it looks like someone has already been in here and we now have a
> > IB_ACCESS_OPTIONAL concept. 
> > 
> > Something like this would be the starting point:
> >
> > [...]
> >
> > However I see only EFA actually uses IB_ACCESS_OPTIONAL, so the lead
> > up would be to audit all the drivers to process optional access_flags
> > properly. Maybe this was done, but I don't see much evidence of it..
> >
> > Sigh. It is a big mess cleaning adventure in drivers really.
> 
> Yes.  When passing flags to drivers we need to have a good pattern
> in place for distinguishing between mandatory and optional flags.
> That is something everyone gets wrong at first (yourself included)
> and which then later needs painful untangling.  So I think we need
> to do that anyway.

It looks like we did actually do this right here, when the
RELAXED_ORDERING was originally added it was done as an optional flag
and a range of bits were set aside in the uAPI for future optional
flags.

This should be quite easy to do then

Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-06 11:49       ` Jason Gunthorpe
@ 2021-04-09 14:26         ` Tom Talpey
  2021-04-09 14:45           ` Chuck Lever III
  2021-04-09 16:40           ` Jason Gunthorpe
  0 siblings, 2 replies; 56+ messages in thread
From: Tom Talpey @ 2021-04-09 14:26 UTC (permalink / raw)
  To: Jason Gunthorpe, Chuck Lever III
  Cc: Christoph Hellwig, Leon Romanovsky, Doug Ledford,
	Leon Romanovsky, Adit Ranadive, Anna Schumaker, Ariel Elior,
	Avihai Horon, Bart Van Assche, Bernard Metzler, David S. Miller,
	Dennis Dalessandro, Devesh Sharma, Faisal Latif, Jack Wang,
	Jakub Kicinski, Bruce Fields, Jens Axboe, Karsten Graul,
	Keith Busch, Lijun Ou, CIFS, LKML, Linux NFS Mailing List,
	linux-nvme, linux-rdma, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, Linux-Net,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>   
>> We need to get a better idea what correctness testing has been done,
>> and whether positive correctness testing results can be replicated
>> on a variety of platforms.
> 
> RO has been rolling out slowly on mlx5 over a few years and storage
> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
> turned on for a long time, userspace HPC applications have been using
> it for a while now too.

I'd love to see RO be used more, it was always something the RDMA
specs supported and carefully architected for. My only concern is
that it's difficult to get right, especially when the platforms
have been running strictly-ordered for so long. The ULPs need
testing, and a lot of it.

> We know there are platforms with broken RO implementations (like
> Haswell) but the kernel is supposed to globally turn off RO on all
> those cases. I'd be a bit surprised if we discover any more from this
> series.
> 
> On the other hand there are platforms that get huge speed ups from
> turning this on, AMD is one example, there are a bunch in the ARM
> world too.

My belief is that the biggest risk is from situations where completions
are batched, and therefore polling is used to detect them without
interrupts (which explicitly). The RO pipeline will completely reorder
DMA writes, and consumers which infer ordering from memory contents may
break. This can even apply within the provider code, which may attempt
to poll WR and CQ structures, and be tripped up.

The Mellanox adapter, itself, historically has strict in-order DMA
semantics, and while it's great to relax that, changing it by default
for all consumers is something to consider very cautiously.

> Still, obviously people should test on the platforms they have.

Yes, and "test" be taken seriously with focus on ULP data integrity.
Speedups will mean nothing if the data is ever damaged.

Tom.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-09 14:26         ` Tom Talpey
@ 2021-04-09 14:45           ` Chuck Lever III
  2021-04-09 15:32             ` Tom Talpey
  2021-04-09 16:40           ` Jason Gunthorpe
  1 sibling, 1 reply; 56+ messages in thread
From: Chuck Lever III @ 2021-04-09 14:45 UTC (permalink / raw)
  To: Tom Talpey
  Cc: Jason Gunthorpe, Christoph Hellwig, Leon Romanovsky,
	Doug Ledford, Leon Romanovsky, Adit Ranadive, Anna Schumaker,
	Ariel Elior, Avihai Horon, Bart Van Assche, Bernard Metzler,
	David S. Miller, Dennis Dalessandro, Devesh Sharma, Faisal Latif,
	Jack Wang, Jakub Kicinski, Bruce Fields, Jens Axboe,
	Karsten Graul, Keith Busch, Lijun Ou, CIFS, LKML,
	Linux NFS Mailing List, linux-nvme, linux-rdma, linux-s390,
	Max Gurtovoy, Max Gurtovoy, Md. Haris Iqbal, Michael Guralnik,
	Michal Kalderon, Mike Marciniszyn, Naresh Kumar PBS, Linux-Net,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun



> On Apr 9, 2021, at 10:26 AM, Tom Talpey <tom@talpey.com> wrote:
> 
> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>>  
>>> We need to get a better idea what correctness testing has been done,
>>> and whether positive correctness testing results can be replicated
>>> on a variety of platforms.
>> RO has been rolling out slowly on mlx5 over a few years and storage
>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
>> turned on for a long time, userspace HPC applications have been using
>> it for a while now too.
> 
> I'd love to see RO be used more, it was always something the RDMA
> specs supported and carefully architected for. My only concern is
> that it's difficult to get right, especially when the platforms
> have been running strictly-ordered for so long. The ULPs need
> testing, and a lot of it.
> 
>> We know there are platforms with broken RO implementations (like
>> Haswell) but the kernel is supposed to globally turn off RO on all
>> those cases. I'd be a bit surprised if we discover any more from this
>> series.
>> On the other hand there are platforms that get huge speed ups from
>> turning this on, AMD is one example, there are a bunch in the ARM
>> world too.
> 
> My belief is that the biggest risk is from situations where completions
> are batched, and therefore polling is used to detect them without
> interrupts (which explicitly). The RO pipeline will completely reorder
> DMA writes, and consumers which infer ordering from memory contents may
> break. This can even apply within the provider code, which may attempt
> to poll WR and CQ structures, and be tripped up.

You are referring specifically to RPC/RDMA depending on Receive
completions to guarantee that previous RDMA Writes have been
retired? Or is there a particular implementation practice in
the Linux RPC/RDMA code that worries you?


> The Mellanox adapter, itself, historically has strict in-order DMA
> semantics, and while it's great to relax that, changing it by default
> for all consumers is something to consider very cautiously.
> 
>> Still, obviously people should test on the platforms they have.
> 
> Yes, and "test" be taken seriously with focus on ULP data integrity.
> Speedups will mean nothing if the data is ever damaged.

I agree that data integrity comes first.

Since I currently don't have facilities to test RO in my lab, the
community will have to agree on a set of tests and expected results
that specifically exercise the corner cases you are concerned about.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-09 14:45           ` Chuck Lever III
@ 2021-04-09 15:32             ` Tom Talpey
  2021-04-09 16:27               ` Haakon Bugge
  0 siblings, 1 reply; 56+ messages in thread
From: Tom Talpey @ 2021-04-09 15:32 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Jason Gunthorpe, Christoph Hellwig, Leon Romanovsky,
	Doug Ledford, Leon Romanovsky, Adit Ranadive, Anna Schumaker,
	Ariel Elior, Avihai Horon, Bart Van Assche, Bernard Metzler,
	David S. Miller, Dennis Dalessandro, Devesh Sharma, Faisal Latif,
	Jack Wang, Jakub Kicinski, Bruce Fields, Jens Axboe,
	Karsten Graul, Keith Busch, Lijun Ou, CIFS, LKML,
	Linux NFS Mailing List, linux-nvme, linux-rdma, linux-s390,
	Max Gurtovoy, Max Gurtovoy, Md. Haris Iqbal, Michael Guralnik,
	Michal Kalderon, Mike Marciniszyn, Naresh Kumar PBS, Linux-Net,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On 4/9/2021 10:45 AM, Chuck Lever III wrote:
> 
> 
>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <tom@talpey.com> wrote:
>>
>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>>>   
>>>> We need to get a better idea what correctness testing has been done,
>>>> and whether positive correctness testing results can be replicated
>>>> on a variety of platforms.
>>> RO has been rolling out slowly on mlx5 over a few years and storage
>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
>>> turned on for a long time, userspace HPC applications have been using
>>> it for a while now too.
>>
>> I'd love to see RO be used more, it was always something the RDMA
>> specs supported and carefully architected for. My only concern is
>> that it's difficult to get right, especially when the platforms
>> have been running strictly-ordered for so long. The ULPs need
>> testing, and a lot of it.
>>
>>> We know there are platforms with broken RO implementations (like
>>> Haswell) but the kernel is supposed to globally turn off RO on all
>>> those cases. I'd be a bit surprised if we discover any more from this
>>> series.
>>> On the other hand there are platforms that get huge speed ups from
>>> turning this on, AMD is one example, there are a bunch in the ARM
>>> world too.
>>
>> My belief is that the biggest risk is from situations where completions
>> are batched, and therefore polling is used to detect them without
>> interrupts (which explicitly). The RO pipeline will completely reorder
>> DMA writes, and consumers which infer ordering from memory contents may
>> break. This can even apply within the provider code, which may attempt
>> to poll WR and CQ structures, and be tripped up.
> 
> You are referring specifically to RPC/RDMA depending on Receive
> completions to guarantee that previous RDMA Writes have been
> retired? Or is there a particular implementation practice in
> the Linux RPC/RDMA code that worries you?

Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
is hopefully unfounded, is that the RO pipeline might not have flushed
when a completion is posted *after* posting an interrupt.

Something like this...

RDMA Write arrives
	PCIe RO Write for data
	PCIe RO Write for data
	...
RDMA Write arrives
	PCIe RO Write for data
	...
RDMA Send arrives
	PCIe RO Write for receive data
	PCIe RO Write for receive descriptor
	PCIe interrupt (flushes RO pipeline for all three ops above)

RPC/RDMA polls CQ
	Reaps receive completion

RDMA Send arrives
	PCIe RO Write for receive data
	PCIe RO write for receive descriptor
	Does *not* interrupt, since CQ not armed

RPC/RDMA continues to poll CQ
	Reaps receive completion
	PCIe RO writes not yet flushed
	Processes incomplete in-memory data
	Bzzzt

Hopefully, the adapter performs a PCIe flushing read, or something
to avoid this when an interrupt is not generated. Alternatively, I'm
overly paranoid.

Tom.

>> The Mellanox adapter, itself, historically has strict in-order DMA
>> semantics, and while it's great to relax that, changing it by default
>> for all consumers is something to consider very cautiously.
>>
>>> Still, obviously people should test on the platforms they have.
>>
>> Yes, and "test" be taken seriously with focus on ULP data integrity.
>> Speedups will mean nothing if the data is ever damaged.
> 
> I agree that data integrity comes first.
> 
> Since I currently don't have facilities to test RO in my lab, the
> community will have to agree on a set of tests and expected results
> that specifically exercise the corner cases you are concerned about.
> 
> 
> --
> Chuck Lever
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-09 15:32             ` Tom Talpey
@ 2021-04-09 16:27               ` Haakon Bugge
  2021-04-09 17:49                 ` Tom Talpey
  0 siblings, 1 reply; 56+ messages in thread
From: Haakon Bugge @ 2021-04-09 16:27 UTC (permalink / raw)
  To: Tom Talpey
  Cc: Chuck Lever III, Jason Gunthorpe, Christoph Hellwig,
	Leon Romanovsky, Doug Ledford, Leon Romanovsky, Adit Ranadive,
	Anna Schumaker, Ariel Elior, Avihai Horon, Bart Van Assche,
	Bernard Metzler, David S. Miller, Dennis Dalessandro,
	Devesh Sharma, Faisal Latif, Jack Wang, Jakub Kicinski,
	Bruce Fields, Jens Axboe, Karsten Graul, Keith Busch, Lijun Ou,
	CIFS, LKML, Linux NFS Mailing List, linux-nvme,
	OFED mailing list, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, Linux-Net,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun



> On 9 Apr 2021, at 17:32, Tom Talpey <tom@talpey.com> wrote:
> 
> On 4/9/2021 10:45 AM, Chuck Lever III wrote:
>>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <tom@talpey.com> wrote:
>>> 
>>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
>>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>>>>  
>>>>> We need to get a better idea what correctness testing has been done,
>>>>> and whether positive correctness testing results can be replicated
>>>>> on a variety of platforms.
>>>> RO has been rolling out slowly on mlx5 over a few years and storage
>>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
>>>> turned on for a long time, userspace HPC applications have been using
>>>> it for a while now too.
>>> 
>>> I'd love to see RO be used more, it was always something the RDMA
>>> specs supported and carefully architected for. My only concern is
>>> that it's difficult to get right, especially when the platforms
>>> have been running strictly-ordered for so long. The ULPs need
>>> testing, and a lot of it.
>>> 
>>>> We know there are platforms with broken RO implementations (like
>>>> Haswell) but the kernel is supposed to globally turn off RO on all
>>>> those cases. I'd be a bit surprised if we discover any more from this
>>>> series.
>>>> On the other hand there are platforms that get huge speed ups from
>>>> turning this on, AMD is one example, there are a bunch in the ARM
>>>> world too.
>>> 
>>> My belief is that the biggest risk is from situations where completions
>>> are batched, and therefore polling is used to detect them without
>>> interrupts (which explicitly). The RO pipeline will completely reorder
>>> DMA writes, and consumers which infer ordering from memory contents may
>>> break. This can even apply within the provider code, which may attempt
>>> to poll WR and CQ structures, and be tripped up.
>> You are referring specifically to RPC/RDMA depending on Receive
>> completions to guarantee that previous RDMA Writes have been
>> retired? Or is there a particular implementation practice in
>> the Linux RPC/RDMA code that worries you?
> 
> Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
> is hopefully unfounded, is that the RO pipeline might not have flushed
> when a completion is posted *after* posting an interrupt.
> 
> Something like this...
> 
> RDMA Write arrives
> 	PCIe RO Write for data
> 	PCIe RO Write for data
> 	...
> RDMA Write arrives
> 	PCIe RO Write for data
> 	...
> RDMA Send arrives
> 	PCIe RO Write for receive data
> 	PCIe RO Write for receive descriptor

Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then it will shure prior written RO date has global visibility when the CQE can be observed.



> 	PCIe interrupt (flushes RO pipeline for all three ops above)

Before the interrupt, the HCA will write the EQE (Event Queue Entry). This has to be a Strongly Ordered write to "push" prior written CQEs so that when the EQE is observed, the prior writes of CQEs have global visibility.

And the MSI-X write likewise, to avoid spurious interrupts.



Thxs, Håkon

> 
> RPC/RDMA polls CQ
> 	Reaps receive completion
> 
> RDMA Send arrives
> 	PCIe RO Write for receive data
> 	PCIe RO write for receive descriptor
> 	Does *not* interrupt, since CQ not armed
> 
> RPC/RDMA continues to poll CQ
> 	Reaps receive completion
> 	PCIe RO writes not yet flushed
> 	Processes incomplete in-memory data
> 	Bzzzt
> 
> Hopefully, the adapter performs a PCIe flushing read, or something
> to avoid this when an interrupt is not generated. Alternatively, I'm
> overly paranoid.
> 
> Tom.
> 
>>> The Mellanox adapter, itself, historically has strict in-order DMA
>>> semantics, and while it's great to relax that, changing it by default
>>> for all consumers is something to consider very cautiously.
>>> 
>>>> Still, obviously people should test on the platforms they have.
>>> 
>>> Yes, and "test" be taken seriously with focus on ULP data integrity.
>>> Speedups will mean nothing if the data is ever damaged.
>> I agree that data integrity comes first.
>> Since I currently don't have facilities to test RO in my lab, the
>> community will have to agree on a set of tests and expected results
>> that specifically exercise the corner cases you are concerned about.
>> --
>> Chuck Lever


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-09 14:26         ` Tom Talpey
  2021-04-09 14:45           ` Chuck Lever III
@ 2021-04-09 16:40           ` Jason Gunthorpe
  2021-04-09 17:44             ` Tom Talpey
  1 sibling, 1 reply; 56+ messages in thread
From: Jason Gunthorpe @ 2021-04-09 16:40 UTC (permalink / raw)
  To: Tom Talpey
  Cc: Chuck Lever III, Christoph Hellwig, Leon Romanovsky,
	Doug Ledford, Leon Romanovsky, Adit Ranadive, Anna Schumaker,
	Ariel Elior, Avihai Horon, Bart Van Assche, Bernard Metzler,
	David S. Miller, Dennis Dalessandro, Devesh Sharma, Faisal Latif,
	Jack Wang, Jakub Kicinski, Bruce Fields, Jens Axboe,
	Karsten Graul, Keith Busch, Lijun Ou, CIFS, LKML,
	Linux NFS Mailing List, linux-nvme, linux-rdma, linux-s390,
	Max Gurtovoy, Max Gurtovoy, Md. Haris Iqbal, Michael Guralnik,
	Michal Kalderon, Mike Marciniszyn, Naresh Kumar PBS, Linux-Net,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Fri, Apr 09, 2021 at 10:26:21AM -0400, Tom Talpey wrote:

> My belief is that the biggest risk is from situations where completions
> are batched, and therefore polling is used to detect them without
> interrupts (which explicitly).

We don't do this in the kernel.

All kernel ULPs only read data after they observe the CQE. We do not
have "last data polling" and our interrupt model does not support some
hacky "interrupt means go and use the data" approach.

ULPs have to be designed this way to use the DMA API properly.
Fencing a DMA before it is completed by the HW will cause IOMMU
errors.

Userspace is a different story, but that will remain as-is with
optional relaxed ordering.

Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-09 16:40           ` Jason Gunthorpe
@ 2021-04-09 17:44             ` Tom Talpey
  0 siblings, 0 replies; 56+ messages in thread
From: Tom Talpey @ 2021-04-09 17:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Chuck Lever III, Christoph Hellwig, Leon Romanovsky,
	Doug Ledford, Leon Romanovsky, Adit Ranadive, Anna Schumaker,
	Ariel Elior, Avihai Horon, Bart Van Assche, Bernard Metzler,
	David S. Miller, Dennis Dalessandro, Devesh Sharma, Faisal Latif,
	Jack Wang, Jakub Kicinski, Bruce Fields, Jens Axboe,
	Karsten Graul, Keith Busch, Lijun Ou, CIFS, LKML,
	Linux NFS Mailing List, linux-nvme, linux-rdma, linux-s390,
	Max Gurtovoy, Max Gurtovoy, Md. Haris Iqbal, Michael Guralnik,
	Michal Kalderon, Mike Marciniszyn, Naresh Kumar PBS, Linux-Net,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On 4/9/2021 12:40 PM, Jason Gunthorpe wrote:
> On Fri, Apr 09, 2021 at 10:26:21AM -0400, Tom Talpey wrote:
> 
>> My belief is that the biggest risk is from situations where completions
>> are batched, and therefore polling is used to detect them without
>> interrupts (which explicitly).
> 
> We don't do this in the kernel.
> 
> All kernel ULPs only read data after they observe the CQE. We do not
> have "last data polling" and our interrupt model does not support some
> hacky "interrupt means go and use the data" approach.
> 
> ULPs have to be designed this way to use the DMA API properly.

Yep. Totally agree.

My concern was about the data being written as relaxed, and the CQE
racing it. I'll reply in the other fork.


> Fencing a DMA before it is completed by the HW will cause IOMMU
> errors.
> 
> Userspace is a different story, but that will remain as-is with
> optional relaxed ordering.
> 
> Jason
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-09 16:27               ` Haakon Bugge
@ 2021-04-09 17:49                 ` Tom Talpey
  2021-04-10 13:30                   ` David Laight
  0 siblings, 1 reply; 56+ messages in thread
From: Tom Talpey @ 2021-04-09 17:49 UTC (permalink / raw)
  To: Haakon Bugge
  Cc: Chuck Lever III, Jason Gunthorpe, Christoph Hellwig,
	Leon Romanovsky, Doug Ledford, Leon Romanovsky, Adit Ranadive,
	Anna Schumaker, Ariel Elior, Avihai Horon, Bart Van Assche,
	Bernard Metzler, David S. Miller, Dennis Dalessandro,
	Devesh Sharma, Faisal Latif, Jack Wang, Jakub Kicinski,
	Bruce Fields, Jens Axboe, Karsten Graul, Keith Busch, Lijun Ou,
	CIFS, LKML, Linux NFS Mailing List, linux-nvme,
	OFED mailing list, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, Linux-Net,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On 4/9/2021 12:27 PM, Haakon Bugge wrote:
> 
> 
>> On 9 Apr 2021, at 17:32, Tom Talpey <tom@talpey.com> wrote:
>>
>> On 4/9/2021 10:45 AM, Chuck Lever III wrote:
>>>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <tom@talpey.com> wrote:
>>>>
>>>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
>>>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>>>>>   
>>>>>> We need to get a better idea what correctness testing has been done,
>>>>>> and whether positive correctness testing results can be replicated
>>>>>> on a variety of platforms.
>>>>> RO has been rolling out slowly on mlx5 over a few years and storage
>>>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
>>>>> turned on for a long time, userspace HPC applications have been using
>>>>> it for a while now too.
>>>>
>>>> I'd love to see RO be used more, it was always something the RDMA
>>>> specs supported and carefully architected for. My only concern is
>>>> that it's difficult to get right, especially when the platforms
>>>> have been running strictly-ordered for so long. The ULPs need
>>>> testing, and a lot of it.
>>>>
>>>>> We know there are platforms with broken RO implementations (like
>>>>> Haswell) but the kernel is supposed to globally turn off RO on all
>>>>> those cases. I'd be a bit surprised if we discover any more from this
>>>>> series.
>>>>> On the other hand there are platforms that get huge speed ups from
>>>>> turning this on, AMD is one example, there are a bunch in the ARM
>>>>> world too.
>>>>
>>>> My belief is that the biggest risk is from situations where completions
>>>> are batched, and therefore polling is used to detect them without
>>>> interrupts (which explicitly). The RO pipeline will completely reorder
>>>> DMA writes, and consumers which infer ordering from memory contents may
>>>> break. This can even apply within the provider code, which may attempt
>>>> to poll WR and CQ structures, and be tripped up.
>>> You are referring specifically to RPC/RDMA depending on Receive
>>> completions to guarantee that previous RDMA Writes have been
>>> retired? Or is there a particular implementation practice in
>>> the Linux RPC/RDMA code that worries you?
>>
>> Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
>> is hopefully unfounded, is that the RO pipeline might not have flushed
>> when a completion is posted *after* posting an interrupt.
>>
>> Something like this...
>>
>> RDMA Write arrives
>> 	PCIe RO Write for data
>> 	PCIe RO Write for data
>> 	...
>> RDMA Write arrives
>> 	PCIe RO Write for data
>> 	...
>> RDMA Send arrives
>> 	PCIe RO Write for receive data
>> 	PCIe RO Write for receive descriptor
> 
> Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then it will shure prior written RO date has global visibility when the CQE can be observed.

I wasn't aware that a strongly-ordered PCIe Write will ensure that
prior relaxed-ordered writes went first. If that's the case, I'm
fine with it - as long as the providers are correctly coded!!

>> 	PCIe interrupt (flushes RO pipeline for all three ops above)
> 
> Before the interrupt, the HCA will write the EQE (Event Queue Entry). This has to be a Strongly Ordered write to "push" prior written CQEs so that when the EQE is observed, the prior writes of CQEs have global visibility.
> 
> And the MSI-X write likewise, to avoid spurious interrupts.

Ok, and yes agreed the same principle would apply.

Is there any implication if a PCIe switch were present on the
motherboard? The switch is allowed to do some creative routing
if the operation is relaxed, correct?

Tom.

> Thxs, Håkon
> 
>>
>> RPC/RDMA polls CQ
>> 	Reaps receive completion
>>
>> RDMA Send arrives
>> 	PCIe RO Write for receive data
>> 	PCIe RO write for receive descriptor
>> 	Does *not* interrupt, since CQ not armed
>>
>> RPC/RDMA continues to poll CQ
>> 	Reaps receive completion
>> 	PCIe RO writes not yet flushed
>> 	Processes incomplete in-memory data
>> 	Bzzzt
>>
>> Hopefully, the adapter performs a PCIe flushing read, or something
>> to avoid this when an interrupt is not generated. Alternatively, I'm
>> overly paranoid.
>>
>> Tom.
>>
>>>> The Mellanox adapter, itself, historically has strict in-order DMA
>>>> semantics, and while it's great to relax that, changing it by default
>>>> for all consumers is something to consider very cautiously.
>>>>
>>>>> Still, obviously people should test on the platforms they have.
>>>>
>>>> Yes, and "test" be taken seriously with focus on ULP data integrity.
>>>> Speedups will mean nothing if the data is ever damaged.
>>> I agree that data integrity comes first.
>>> Since I currently don't have facilities to test RO in my lab, the
>>> community will have to agree on a set of tests and expected results
>>> that specifically exercise the corner cases you are concerned about.
>>> --
>>> Chuck Lever
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-09 17:49                 ` Tom Talpey
@ 2021-04-10 13:30                   ` David Laight
  2021-04-12 18:32                     ` Haakon Bugge
  0 siblings, 1 reply; 56+ messages in thread
From: David Laight @ 2021-04-10 13:30 UTC (permalink / raw)
  To: 'Tom Talpey', Haakon Bugge
  Cc: Chuck Lever III, Jason Gunthorpe, Christoph Hellwig,
	Leon Romanovsky, Doug Ledford, Leon Romanovsky, Adit Ranadive,
	Anna Schumaker, Ariel Elior, Avihai Horon, Bart Van Assche,
	Bernard Metzler, David S. Miller, Dennis Dalessandro,
	Devesh Sharma, Faisal Latif, Jack Wang, Jakub Kicinski,
	Bruce Fields, Jens Axboe, Karsten Graul, Keith Busch, Lijun Ou,
	CIFS, LKML, Linux NFS Mailing List, linux-nvme,
	OFED mailing list, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, Linux-Net,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

From: Tom Talpey
> Sent: 09 April 2021 18:49
> On 4/9/2021 12:27 PM, Haakon Bugge wrote:
> >
> >
> >> On 9 Apr 2021, at 17:32, Tom Talpey <tom@talpey.com> wrote:
> >>
> >> On 4/9/2021 10:45 AM, Chuck Lever III wrote:
> >>>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <tom@talpey.com> wrote:
> >>>>
> >>>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
> >>>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
> >>>>>
> >>>>>> We need to get a better idea what correctness testing has been done,
> >>>>>> and whether positive correctness testing results can be replicated
> >>>>>> on a variety of platforms.
> >>>>> RO has been rolling out slowly on mlx5 over a few years and storage
> >>>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
> >>>>> turned on for a long time, userspace HPC applications have been using
> >>>>> it for a while now too.
> >>>>
> >>>> I'd love to see RO be used more, it was always something the RDMA
> >>>> specs supported and carefully architected for. My only concern is
> >>>> that it's difficult to get right, especially when the platforms
> >>>> have been running strictly-ordered for so long. The ULPs need
> >>>> testing, and a lot of it.
> >>>>
> >>>>> We know there are platforms with broken RO implementations (like
> >>>>> Haswell) but the kernel is supposed to globally turn off RO on all
> >>>>> those cases. I'd be a bit surprised if we discover any more from this
> >>>>> series.
> >>>>> On the other hand there are platforms that get huge speed ups from
> >>>>> turning this on, AMD is one example, there are a bunch in the ARM
> >>>>> world too.
> >>>>
> >>>> My belief is that the biggest risk is from situations where completions
> >>>> are batched, and therefore polling is used to detect them without
> >>>> interrupts (which explicitly). The RO pipeline will completely reorder
> >>>> DMA writes, and consumers which infer ordering from memory contents may
> >>>> break. This can even apply within the provider code, which may attempt
> >>>> to poll WR and CQ structures, and be tripped up.
> >>> You are referring specifically to RPC/RDMA depending on Receive
> >>> completions to guarantee that previous RDMA Writes have been
> >>> retired? Or is there a particular implementation practice in
> >>> the Linux RPC/RDMA code that worries you?
> >>
> >> Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
> >> is hopefully unfounded, is that the RO pipeline might not have flushed
> >> when a completion is posted *after* posting an interrupt.
> >>
> >> Something like this...
> >>
> >> RDMA Write arrives
> >> 	PCIe RO Write for data
> >> 	PCIe RO Write for data
> >> 	...
> >> RDMA Write arrives
> >> 	PCIe RO Write for data
> >> 	...
> >> RDMA Send arrives
> >> 	PCIe RO Write for receive data
> >> 	PCIe RO Write for receive descriptor
> >
> > Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then
> it will shure prior written RO date has global visibility when the CQE can be observed.
> 
> I wasn't aware that a strongly-ordered PCIe Write will ensure that
> prior relaxed-ordered writes went first. If that's the case, I'm
> fine with it - as long as the providers are correctly coded!!

I remember trying to read the relevant section of the PCIe spec.
(Possibly in a book that was trying to make it easier to understand!)
It is about as clear as mud.

I presume this is all about allowing PCIe targets (eg ethernet cards)
to use relaxed ordering on write requests to host memory.
And that such writes can be completed out of order?

It isn't entirely clear that you aren't talking of letting the
cpu do 'relaxed order' writes to PCIe targets!

For a typical ethernet driver the receive interrupt just means
'go and look at the receive descriptor ring'.
So there is an absolute requirement that the writes for data
buffer complete before the write to the receive descriptor.
There is no requirement for the interrupt (requested after the
descriptor write) to have been seen by the cpu.

Quite often the driver will find the 'receive complete'
descriptor when processing frames from an earlier interrupt
(and nothing to do in response to the interrupt itself).

So the write to the receive descriptor would have to have RO clear
to ensure that all the buffer writes complete first.

(The furthest I've got into PCIe internals was fixing the bug
in some vendor-supplied FPGA logic that failed to correctly
handle multiple data TLP responses to a single read TLP.
Fortunately it wasn't in the hard-IP bit.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-06 11:53     ` Jason Gunthorpe
@ 2021-04-11 10:09       ` Max Gurtovoy
  0 siblings, 0 replies; 56+ messages in thread
From: Max Gurtovoy @ 2021-04-11 10:09 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky
  Cc: Honggang LI, Doug Ledford, Adit Ranadive, Anna Schumaker,
	Ariel Elior, Avihai Horon, Bart Van Assche, Bernard Metzler,
	Christoph Hellwig, Chuck Lever, David S. Miller,
	Dennis Dalessandro, Devesh Sharma, Faisal Latif, Jack Wang,
	Jakub Kicinski, J. Bruce Fields, Jens Axboe, Karsten Graul,
	Keith Busch, Lijun Ou, linux-cifs, linux-kernel, linux-nfs,
	linux-nvme, linux-rdma, linux-s390, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, netdev, Potnuri Bharat Teja,
	rds-devel, Sagi Grimberg, samba-technical, Santosh Shilimkar,
	Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun


On 4/6/2021 2:53 PM, Jason Gunthorpe wrote:
> On Tue, Apr 06, 2021 at 08:09:43AM +0300, Leon Romanovsky wrote:
>> On Tue, Apr 06, 2021 at 10:37:38AM +0800, Honggang LI wrote:
>>> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
>>>> From: Leon Romanovsky <leonro@nvidia.com>
>>>>
>>>>  From Avihai,
>>>>
>>>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
>>>> imposed on PCI transactions, and thus, can improve performance.
>>>>
>>>> Until now, relaxed ordering could be set only by user space applications
>>>> for user MRs. The following patch series enables relaxed ordering for the
>>>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
>>>> such, it is ignored by vendors that don't support it.
>>>>
>>>> The following test results show the performance improvement achieved
>>> Did you test this patchset with CPU does not support relaxed ordering?
>> I don't think so, the CPUs that don't support RO are Intel's fourth/fifth-generation
>> and they are not interesting from performance point of view.
>>
>>> We observed significantly performance degradation when run perftest with
>>> relaxed ordering enabled over old CPU.
>>>
>>> https://github.com/linux-rdma/perftest/issues/116
>> The perftest is slightly different, but you pointed to the valid point.
>> We forgot to call pcie_relaxed_ordering_enabled() before setting RO bit
>> and arguably this was needed to be done in perftest too.
> No, the PCI device should not have the RO bit set in this situation.
> It is something mlx5_core needs to do. We can't push this into
> applications.

pcie_relaxed_ordering_enabled is called in 
drivers/net/ethernet/mellanox/mlx5/core/en_common.c so probably need to 
move it to

mlx5_core in this series.



>
> There should be no performance difference from asking for
> IBV_ACCESS_RELAXED_ORDERING when RO is disabled at the PCI config and
> not asking for it at all.
>
> Either the platform has working relaxed ordering that gives a
> performance gain and the RO config spec bit should be set, or it
> doesn't and the bit should be clear.

is this the case today ?

>
> This is not something to decide in userspace, or in RDMA. At worst it
> becomes another platform specific PCI tunable people have to set.
>
> I thought the old haswell systems were quirked to disable RO globally
> anyhow?
>
> Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-10 13:30                   ` David Laight
@ 2021-04-12 18:32                     ` Haakon Bugge
  2021-04-12 20:20                       ` Tom Talpey
  0 siblings, 1 reply; 56+ messages in thread
From: Haakon Bugge @ 2021-04-12 18:32 UTC (permalink / raw)
  To: David Laight
  Cc: Tom Talpey, Chuck Lever III, Jason Gunthorpe, Christoph Hellwig,
	Leon Romanovsky, Doug Ledford, Leon Romanovsky, Adit Ranadive,
	Anna Schumaker, Ariel Elior, Avihai Horon, Bart Van Assche,
	Bernard Metzler, David S. Miller, Dennis Dalessandro,
	Devesh Sharma, Faisal Latif, Jack Wang, Jakub Kicinski,
	Bruce Fields, Jens Axboe, Karsten Graul, Keith Busch, Lijun Ou,
	CIFS, LKML, Linux NFS Mailing List, linux-nvme,
	OFED mailing list, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, Linux-Net,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun



> On 10 Apr 2021, at 15:30, David Laight <David.Laight@aculab.com> wrote:
> 
> From: Tom Talpey
>> Sent: 09 April 2021 18:49
>> On 4/9/2021 12:27 PM, Haakon Bugge wrote:
>>> 
>>> 
>>>> On 9 Apr 2021, at 17:32, Tom Talpey <tom@talpey.com> wrote:
>>>> 
>>>> On 4/9/2021 10:45 AM, Chuck Lever III wrote:
>>>>>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <tom@talpey.com> wrote:
>>>>>> 
>>>>>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
>>>>>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>>>>>>> 
>>>>>>>> We need to get a better idea what correctness testing has been done,
>>>>>>>> and whether positive correctness testing results can be replicated
>>>>>>>> on a variety of platforms.
>>>>>>> RO has been rolling out slowly on mlx5 over a few years and storage
>>>>>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
>>>>>>> turned on for a long time, userspace HPC applications have been using
>>>>>>> it for a while now too.
>>>>>> 
>>>>>> I'd love to see RO be used more, it was always something the RDMA
>>>>>> specs supported and carefully architected for. My only concern is
>>>>>> that it's difficult to get right, especially when the platforms
>>>>>> have been running strictly-ordered for so long. The ULPs need
>>>>>> testing, and a lot of it.
>>>>>> 
>>>>>>> We know there are platforms with broken RO implementations (like
>>>>>>> Haswell) but the kernel is supposed to globally turn off RO on all
>>>>>>> those cases. I'd be a bit surprised if we discover any more from this
>>>>>>> series.
>>>>>>> On the other hand there are platforms that get huge speed ups from
>>>>>>> turning this on, AMD is one example, there are a bunch in the ARM
>>>>>>> world too.
>>>>>> 
>>>>>> My belief is that the biggest risk is from situations where completions
>>>>>> are batched, and therefore polling is used to detect them without
>>>>>> interrupts (which explicitly). The RO pipeline will completely reorder
>>>>>> DMA writes, and consumers which infer ordering from memory contents may
>>>>>> break. This can even apply within the provider code, which may attempt
>>>>>> to poll WR and CQ structures, and be tripped up.
>>>>> You are referring specifically to RPC/RDMA depending on Receive
>>>>> completions to guarantee that previous RDMA Writes have been
>>>>> retired? Or is there a particular implementation practice in
>>>>> the Linux RPC/RDMA code that worries you?
>>>> 
>>>> Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
>>>> is hopefully unfounded, is that the RO pipeline might not have flushed
>>>> when a completion is posted *after* posting an interrupt.
>>>> 
>>>> Something like this...
>>>> 
>>>> RDMA Write arrives
>>>> 	PCIe RO Write for data
>>>> 	PCIe RO Write for data
>>>> 	...
>>>> RDMA Write arrives
>>>> 	PCIe RO Write for data
>>>> 	...
>>>> RDMA Send arrives
>>>> 	PCIe RO Write for receive data
>>>> 	PCIe RO Write for receive descriptor
>>> 
>>> Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then
>> it will shure prior written RO date has global visibility when the CQE can be observed.
>> 
>> I wasn't aware that a strongly-ordered PCIe Write will ensure that
>> prior relaxed-ordered writes went first. If that's the case, I'm
>> fine with it - as long as the providers are correctly coded!!

The PCIe spec (Table Ordering Rules Summary) is quite clear here (A Posted request is Memory Write Request in this context):

	A Posted Request must not pass another Posted Request unless A2b applies.

	A2b: A Posted Request with RO Set is permitted to pass another Posted Request.


Thxs, Håkon

> 
> I remember trying to read the relevant section of the PCIe spec.
> (Possibly in a book that was trying to make it easier to understand!)
> It is about as clear as mud.
> 
> I presume this is all about allowing PCIe targets (eg ethernet cards)
> to use relaxed ordering on write requests to host memory.
> And that such writes can be completed out of order?
> 
> It isn't entirely clear that you aren't talking of letting the
> cpu do 'relaxed order' writes to PCIe targets!
> 
> For a typical ethernet driver the receive interrupt just means
> 'go and look at the receive descriptor ring'.
> So there is an absolute requirement that the writes for data
> buffer complete before the write to the receive descriptor.
> There is no requirement for the interrupt (requested after the
> descriptor write) to have been seen by the cpu.
> 
> Quite often the driver will find the 'receive complete'
> descriptor when processing frames from an earlier interrupt
> (and nothing to do in response to the interrupt itself).
> 
> So the write to the receive descriptor would have to have RO clear
> to ensure that all the buffer writes complete first.
> 
> (The furthest I've got into PCIe internals was fixing the bug
> in some vendor-supplied FPGA logic that failed to correctly
> handle multiple data TLP responses to a single read TLP.
> Fortunately it wasn't in the hard-IP bit.)
> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-12 18:32                     ` Haakon Bugge
@ 2021-04-12 20:20                       ` Tom Talpey
  2021-04-12 22:48                         ` Jason Gunthorpe
  0 siblings, 1 reply; 56+ messages in thread
From: Tom Talpey @ 2021-04-12 20:20 UTC (permalink / raw)
  To: Haakon Bugge, David Laight
  Cc: Chuck Lever III, Jason Gunthorpe, Christoph Hellwig,
	Leon Romanovsky, Doug Ledford, Leon Romanovsky, Adit Ranadive,
	Anna Schumaker, Ariel Elior, Avihai Horon, Bart Van Assche,
	Bernard Metzler, David S. Miller, Dennis Dalessandro,
	Devesh Sharma, Faisal Latif, Jack Wang, Jakub Kicinski,
	Bruce Fields, Jens Axboe, Karsten Graul, Keith Busch, Lijun Ou,
	CIFS, LKML, Linux NFS Mailing List, linux-nvme,
	OFED mailing list, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, Linux-Net,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On 4/12/2021 2:32 PM, Haakon Bugge wrote:
> 
> 
>> On 10 Apr 2021, at 15:30, David Laight <David.Laight@aculab.com> wrote:
>>
>> From: Tom Talpey
>>> Sent: 09 April 2021 18:49
>>> On 4/9/2021 12:27 PM, Haakon Bugge wrote:
>>>>
>>>>
>>>>> On 9 Apr 2021, at 17:32, Tom Talpey <tom@talpey.com> wrote:
>>>>>
>>>>> On 4/9/2021 10:45 AM, Chuck Lever III wrote:
>>>>>>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <tom@talpey.com> wrote:
>>>>>>>
>>>>>>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
>>>>>>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>>>>>>>>
>>>>>>>>> We need to get a better idea what correctness testing has been done,
>>>>>>>>> and whether positive correctness testing results can be replicated
>>>>>>>>> on a variety of platforms.
>>>>>>>> RO has been rolling out slowly on mlx5 over a few years and storage
>>>>>>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
>>>>>>>> turned on for a long time, userspace HPC applications have been using
>>>>>>>> it for a while now too.
>>>>>>>
>>>>>>> I'd love to see RO be used more, it was always something the RDMA
>>>>>>> specs supported and carefully architected for. My only concern is
>>>>>>> that it's difficult to get right, especially when the platforms
>>>>>>> have been running strictly-ordered for so long. The ULPs need
>>>>>>> testing, and a lot of it.
>>>>>>>
>>>>>>>> We know there are platforms with broken RO implementations (like
>>>>>>>> Haswell) but the kernel is supposed to globally turn off RO on all
>>>>>>>> those cases. I'd be a bit surprised if we discover any more from this
>>>>>>>> series.
>>>>>>>> On the other hand there are platforms that get huge speed ups from
>>>>>>>> turning this on, AMD is one example, there are a bunch in the ARM
>>>>>>>> world too.
>>>>>>>
>>>>>>> My belief is that the biggest risk is from situations where completions
>>>>>>> are batched, and therefore polling is used to detect them without
>>>>>>> interrupts (which explicitly). The RO pipeline will completely reorder
>>>>>>> DMA writes, and consumers which infer ordering from memory contents may
>>>>>>> break. This can even apply within the provider code, which may attempt
>>>>>>> to poll WR and CQ structures, and be tripped up.
>>>>>> You are referring specifically to RPC/RDMA depending on Receive
>>>>>> completions to guarantee that previous RDMA Writes have been
>>>>>> retired? Or is there a particular implementation practice in
>>>>>> the Linux RPC/RDMA code that worries you?
>>>>>
>>>>> Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
>>>>> is hopefully unfounded, is that the RO pipeline might not have flushed
>>>>> when a completion is posted *after* posting an interrupt.
>>>>>
>>>>> Something like this...
>>>>>
>>>>> RDMA Write arrives
>>>>> 	PCIe RO Write for data
>>>>> 	PCIe RO Write for data
>>>>> 	...
>>>>> RDMA Write arrives
>>>>> 	PCIe RO Write for data
>>>>> 	...
>>>>> RDMA Send arrives
>>>>> 	PCIe RO Write for receive data
>>>>> 	PCIe RO Write for receive descriptor
>>>>
>>>> Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then
>>> it will shure prior written RO date has global visibility when the CQE can be observed.
>>>
>>> I wasn't aware that a strongly-ordered PCIe Write will ensure that
>>> prior relaxed-ordered writes went first. If that's the case, I'm
>>> fine with it - as long as the providers are correctly coded!!
> 
> The PCIe spec (Table Ordering Rules Summary) is quite clear here (A Posted request is Memory Write Request in this context):
> 
> 	A Posted Request must not pass another Posted Request unless A2b applies.
> 
> 	A2b: A Posted Request with RO Set is permitted to pass another Posted Request.
> 
> 
> Thxs, Håkon

Ok, good - a non-RO write (for example, to a CQE), or an interrupt
(which would be similarly non-RO), will "get behind" all prior writes.

So the issue is only in testing all the providers and platforms,
to be sure this new behavior isn't tickling anything that went
unnoticed all along, because no RDMA provider ever issued RO.

Honestly, the Haswell sounds like a great first candidate, because
if it has a known-broken RO behavior, verifying that it works with
this change is highly important. I'd have greater confidence in newer
platforms, in other words. They *all* have to work, proveably.

Tom.

>> I remember trying to read the relevant section of the PCIe spec.
>> (Possibly in a book that was trying to make it easier to understand!)
>> It is about as clear as mud.
>>
>> I presume this is all about allowing PCIe targets (eg ethernet cards)
>> to use relaxed ordering on write requests to host memory.
>> And that such writes can be completed out of order?
>>
>> It isn't entirely clear that you aren't talking of letting the
>> cpu do 'relaxed order' writes to PCIe targets!
>>
>> For a typical ethernet driver the receive interrupt just means
>> 'go and look at the receive descriptor ring'.
>> So there is an absolute requirement that the writes for data
>> buffer complete before the write to the receive descriptor.
>> There is no requirement for the interrupt (requested after the
>> descriptor write) to have been seen by the cpu.
>>
>> Quite often the driver will find the 'receive complete'
>> descriptor when processing frames from an earlier interrupt
>> (and nothing to do in response to the interrupt itself).
>>
>> So the write to the receive descriptor would have to have RO clear
>> to ensure that all the buffer writes complete first.
>>
>> (The furthest I've got into PCIe internals was fixing the bug
>> in some vendor-supplied FPGA logic that failed to correctly
>> handle multiple data TLP responses to a single read TLP.
>> Fortunately it wasn't in the hard-IP bit.)
>>
>> 	David
>>
>> -
>> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
>> Registration No: 1397386 (Wales)
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-12 20:20                       ` Tom Talpey
@ 2021-04-12 22:48                         ` Jason Gunthorpe
  2021-04-14 14:16                           ` Tom Talpey
  0 siblings, 1 reply; 56+ messages in thread
From: Jason Gunthorpe @ 2021-04-12 22:48 UTC (permalink / raw)
  To: Tom Talpey
  Cc: Haakon Bugge, David Laight, Chuck Lever III, Christoph Hellwig,
	Leon Romanovsky, Doug Ledford, Leon Romanovsky, Adit Ranadive,
	Anna Schumaker, Ariel Elior, Avihai Horon, Bart Van Assche,
	Bernard Metzler, David S. Miller, Dennis Dalessandro,
	Devesh Sharma, Faisal Latif, Jack Wang, Jakub Kicinski,
	Bruce Fields, Jens Axboe, Karsten Graul, Keith Busch, Lijun Ou,
	CIFS, LKML, Linux NFS Mailing List, linux-nvme,
	OFED mailing list, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, Linux-Net,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Mon, Apr 12, 2021 at 04:20:47PM -0400, Tom Talpey wrote:

> So the issue is only in testing all the providers and platforms,
> to be sure this new behavior isn't tickling anything that went
> unnoticed all along, because no RDMA provider ever issued RO.

The mlx5 ethernet driver has run in RO mode for a long time, and it
operates in basically the same way as RDMA. The issues with Haswell
have been worked out there already.

The only open question is if the ULPs have errors in their
implementation, which I don't think we can find out until we apply
this series and people start running their tests aggressively.

Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-12 22:48                         ` Jason Gunthorpe
@ 2021-04-14 14:16                           ` Tom Talpey
  2021-04-14 14:41                             ` David Laight
  2021-04-14 14:44                             ` Jason Gunthorpe
  0 siblings, 2 replies; 56+ messages in thread
From: Tom Talpey @ 2021-04-14 14:16 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Haakon Bugge, David Laight, Chuck Lever III, Christoph Hellwig,
	Leon Romanovsky, Doug Ledford, Leon Romanovsky, Adit Ranadive,
	Anna Schumaker, Ariel Elior, Avihai Horon, Bart Van Assche,
	Bernard Metzler, David S. Miller, Dennis Dalessandro,
	Devesh Sharma, Faisal Latif, Jack Wang, Jakub Kicinski,
	Bruce Fields, Jens Axboe, Karsten Graul, Keith Busch, Lijun Ou,
	CIFS, LKML, Linux NFS Mailing List, linux-nvme,
	OFED mailing list, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, Linux-Net,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On 4/12/2021 6:48 PM, Jason Gunthorpe wrote:
> On Mon, Apr 12, 2021 at 04:20:47PM -0400, Tom Talpey wrote:
> 
>> So the issue is only in testing all the providers and platforms,
>> to be sure this new behavior isn't tickling anything that went
>> unnoticed all along, because no RDMA provider ever issued RO.
> 
> The mlx5 ethernet driver has run in RO mode for a long time, and it
> operates in basically the same way as RDMA. The issues with Haswell
> have been worked out there already.
> 
> The only open question is if the ULPs have errors in their
> implementation, which I don't think we can find out until we apply
> this series and people start running their tests aggressively.

I agree that the core RO support should go in. But turning it on
by default for a ULP should be the decision of each ULP maintainer.
It's a huge risk to shift all the storage drivers overnight. How
do you propose to ensure the aggressive testing happens?

One thing that worries me is the patch02 on-by-default for the dma_lkey.
There's no way for a ULP to prevent IB_ACCESS_RELAXED_ORDERING
from being set in __ib_alloc_pd().

Tom.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-14 14:16                           ` Tom Talpey
@ 2021-04-14 14:41                             ` David Laight
  2021-04-14 14:49                               ` Jason Gunthorpe
  2021-04-14 14:44                             ` Jason Gunthorpe
  1 sibling, 1 reply; 56+ messages in thread
From: David Laight @ 2021-04-14 14:41 UTC (permalink / raw)
  To: 'Tom Talpey', Jason Gunthorpe
  Cc: Haakon Bugge, Chuck Lever III, Christoph Hellwig,
	Leon Romanovsky, Doug Ledford, Leon Romanovsky, Adit Ranadive,
	Anna Schumaker, Ariel Elior, Avihai Horon, Bart Van Assche,
	Bernard Metzler, David S. Miller, Dennis Dalessandro,
	Devesh Sharma, Faisal Latif, Jack Wang, Jakub Kicinski,
	Bruce Fields, Jens Axboe, Karsten Graul, Keith Busch, Lijun Ou,
	CIFS, LKML, Linux NFS Mailing List, linux-nvme,
	OFED mailing list, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, Linux-Net,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

From: Tom Talpey
> Sent: 14 April 2021 15:16
> 
> On 4/12/2021 6:48 PM, Jason Gunthorpe wrote:
> > On Mon, Apr 12, 2021 at 04:20:47PM -0400, Tom Talpey wrote:
> >
> >> So the issue is only in testing all the providers and platforms,
> >> to be sure this new behavior isn't tickling anything that went
> >> unnoticed all along, because no RDMA provider ever issued RO.
> >
> > The mlx5 ethernet driver has run in RO mode for a long time, and it
> > operates in basically the same way as RDMA. The issues with Haswell
> > have been worked out there already.
> >
> > The only open question is if the ULPs have errors in their
> > implementation, which I don't think we can find out until we apply
> > this series and people start running their tests aggressively.
> 
> I agree that the core RO support should go in. But turning it on
> by default for a ULP should be the decision of each ULP maintainer.
> It's a huge risk to shift all the storage drivers overnight. How
> do you propose to ensure the aggressive testing happens?
> 
> One thing that worries me is the patch02 on-by-default for the dma_lkey.
> There's no way for a ULP to prevent IB_ACCESS_RELAXED_ORDERING
> from being set in __ib_alloc_pd().

What is a ULP in this context?

I've presumed that this is all about getting PCIe targets (ie cards)
to set the RO (relaxed ordering) bit in some of the write TLP they
generate for writing to host memory?

So whatever driver initialises the target needs to configure whatever
target-specific register enables the RO transfers themselves.

After that there could be flags in the PCIe config space of the target
and any bridges that clear the RO flag.

There could also be flags in the bridges and root complex to ignore
the RO flag even if it is set.

Then the Linux kernel can have option(s) to tell the driver not
to enable RO - even though the driver believes it should all work.
This could be a single global flag, or fin-grained in some way.

So what exactly is this patch series doing?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-14 14:16                           ` Tom Talpey
  2021-04-14 14:41                             ` David Laight
@ 2021-04-14 14:44                             ` Jason Gunthorpe
  1 sibling, 0 replies; 56+ messages in thread
From: Jason Gunthorpe @ 2021-04-14 14:44 UTC (permalink / raw)
  To: Tom Talpey
  Cc: Haakon Bugge, David Laight, Chuck Lever III, Christoph Hellwig,
	Leon Romanovsky, Doug Ledford, Leon Romanovsky, Adit Ranadive,
	Anna Schumaker, Ariel Elior, Avihai Horon, Bart Van Assche,
	Bernard Metzler, David S. Miller, Dennis Dalessandro,
	Devesh Sharma, Faisal Latif, Jack Wang, Jakub Kicinski,
	Bruce Fields, Jens Axboe, Karsten Graul, Keith Busch, Lijun Ou,
	CIFS, LKML, Linux NFS Mailing List, linux-nvme,
	OFED mailing list, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, Linux-Net,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Wed, Apr 14, 2021 at 10:16:28AM -0400, Tom Talpey wrote:
> On 4/12/2021 6:48 PM, Jason Gunthorpe wrote:
> > On Mon, Apr 12, 2021 at 04:20:47PM -0400, Tom Talpey wrote:
> > 
> > > So the issue is only in testing all the providers and platforms,
> > > to be sure this new behavior isn't tickling anything that went
> > > unnoticed all along, because no RDMA provider ever issued RO.
> > 
> > The mlx5 ethernet driver has run in RO mode for a long time, and it
> > operates in basically the same way as RDMA. The issues with Haswell
> > have been worked out there already.
> > 
> > The only open question is if the ULPs have errors in their
> > implementation, which I don't think we can find out until we apply
> > this series and people start running their tests aggressively.
> 
> I agree that the core RO support should go in. But turning it on
> by default for a ULP should be the decision of each ULP maintainer.
> It's a huge risk to shift all the storage drivers overnight. How
> do you propose to ensure the aggressive testing happens?

Realistically we do test most of the RDMA storage ULPs at NVIDIA over
mlx5 which is the only HW that will enable this for now.

I disagree it is a "huge risk".

Additional wider testing is welcomed and can happen over the 16 week
release cycle for a kernel. I would aim to get the relaxed ordering
changed merged to linux-next a week or so after the merge window.

Further testing happens before these changes would get picked up in a
distro on something like MLNX_OFED.

I don't think we need to make the patch design worse or over think the
submission process for something that, so far, hasn't discovered any
issues and alread has a proven track record in other ULPs.

Any storage ULP that has a problem here is mis-using verbs and the DMA
API and thus has an existing data-corruption bug that they are simply
lucky to have not yet discovered.

> One thing that worries me is the patch02 on-by-default for the dma_lkey.
> There's no way for a ULP to prevent IB_ACCESS_RELAXED_ORDERING
> from being set in __ib_alloc_pd().

The ULPs are being forced into relaxed_ordering. They don't get to
turn it off one by one. The v2 will be more explicit about this as
there will be no ULP patches, just the verbs core code being updated.

Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
  2021-04-14 14:41                             ` David Laight
@ 2021-04-14 14:49                               ` Jason Gunthorpe
  0 siblings, 0 replies; 56+ messages in thread
From: Jason Gunthorpe @ 2021-04-14 14:49 UTC (permalink / raw)
  To: David Laight
  Cc: 'Tom Talpey',
	Haakon Bugge, Chuck Lever III, Christoph Hellwig,
	Leon Romanovsky, Doug Ledford, Leon Romanovsky, Adit Ranadive,
	Anna Schumaker, Ariel Elior, Avihai Horon, Bart Van Assche,
	Bernard Metzler, David S. Miller, Dennis Dalessandro,
	Devesh Sharma, Faisal Latif, Jack Wang, Jakub Kicinski,
	Bruce Fields, Jens Axboe, Karsten Graul, Keith Busch, Lijun Ou,
	CIFS, LKML, Linux NFS Mailing List, linux-nvme,
	OFED mailing list, linux-s390, Max Gurtovoy, Max Gurtovoy,
	Md. Haris Iqbal, Michael Guralnik, Michal Kalderon,
	Mike Marciniszyn, Naresh Kumar PBS, Linux-Net,
	Potnuri Bharat Teja, rds-devel, Sagi Grimberg, samba-technical,
	Santosh Shilimkar, Selvin Xavier, Shiraz Saleem, Somnath Kotur,
	Sriharsha Basavapatna, Steve French, Trond Myklebust,
	VMware PV-Drivers, Weihang Li, Yishai Hadas, Zhu Yanjun

On Wed, Apr 14, 2021 at 02:41:52PM +0000, David Laight wrote:

> So whatever driver initialises the target needs to configure whatever
> target-specific register enables the RO transfers themselves.

RDMA in general, and mlx5 in particular, is a layered design:

mlx5_core <- owns the PCI function, should turn on RO at the PCI
             function level
 mlx5_en  <- Commands the chip to use RO for queues used in ethernet
ib_core
  ib_uverbs
    mlx5_ib <- Commands the chip to use RO for some queues used in
               userspace
  ib_srp* <- A ULP driver built on RDMA - this patch commands the chip
             to use RO on SRP queues
  nvme-rdma <- Ditto
  ib_iser <- Ditto
  rds <- Ditto

So this series is about expanding the set of queues running on mlx5
that have RO turned when the PCI function is already running with RO
enabled.

We want as many queues as possible RO enabled because it brings big
performance wins

Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, back to index

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-05  5:23 [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs Leon Romanovsky
2021-04-05  5:23 ` [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init() Leon Romanovsky
2021-04-05 13:46   ` Christoph Hellwig
2021-04-06  5:24     ` Leon Romanovsky
2021-04-05 15:27   ` Bart Van Assche
2021-04-06  5:23     ` Leon Romanovsky
2021-04-06  5:27       ` Christoph Hellwig
2021-04-06  5:58         ` Leon Romanovsky
2021-04-06 12:13           ` Jason Gunthorpe
2021-04-06 12:30             ` Christoph Hellwig
2021-04-06 14:04               ` Jason Gunthorpe
2021-04-06 14:15                 ` Christoph Hellwig
2021-04-06 14:40                   ` Jason Gunthorpe
2021-04-06 14:54                     ` Christoph Hellwig
2021-04-06 15:03                       ` Christoph Hellwig
2021-04-07 18:28                       ` Jason Gunthorpe
2021-04-05  5:23 ` [PATCH rdma-next 02/10] RDMA/core: Enable Relaxed Ordering in __ib_alloc_pd() Leon Romanovsky
2021-04-05 18:01   ` Tom Talpey
2021-04-05 20:40     ` Adit Ranadive
2021-04-06  6:28     ` Leon Romanovsky
2021-04-05  5:23 ` [PATCH rdma-next 03/10] RDMA/iser: Enable Relaxed Ordering Leon Romanovsky
2021-04-05  5:23 ` [PATCH rdma-next 04/10] RDMA/rtrs: " Leon Romanovsky
2021-04-05  5:23 ` [PATCH rdma-next 05/10] RDMA/srp: " Leon Romanovsky
2021-04-05  5:24 ` [PATCH rdma-next 06/10] nvme-rdma: " Leon Romanovsky
2021-04-05  5:24 ` [PATCH rdma-next 07/10] cifs: smbd: " Leon Romanovsky
2021-04-05  5:24 ` [PATCH rdma-next 08/10] net/rds: " Leon Romanovsky
2021-04-05  5:24 ` [PATCH rdma-next 09/10] net/smc: " Leon Romanovsky
2021-04-05  5:24 ` [PATCH rdma-next 10/10] xprtrdma: " Leon Romanovsky
2021-04-05 13:41 ` [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs Christoph Hellwig
2021-04-05 14:08   ` Leon Romanovsky
2021-04-05 16:11     ` Santosh Shilimkar
2021-04-05 17:54     ` Tom Talpey
2021-04-05 20:07   ` Jason Gunthorpe
2021-04-05 23:42     ` Chuck Lever III
2021-04-05 23:50       ` Keith Busch
2021-04-06  5:12       ` Leon Romanovsky
2021-04-06 11:49       ` Jason Gunthorpe
2021-04-09 14:26         ` Tom Talpey
2021-04-09 14:45           ` Chuck Lever III
2021-04-09 15:32             ` Tom Talpey
2021-04-09 16:27               ` Haakon Bugge
2021-04-09 17:49                 ` Tom Talpey
2021-04-10 13:30                   ` David Laight
2021-04-12 18:32                     ` Haakon Bugge
2021-04-12 20:20                       ` Tom Talpey
2021-04-12 22:48                         ` Jason Gunthorpe
2021-04-14 14:16                           ` Tom Talpey
2021-04-14 14:41                             ` David Laight
2021-04-14 14:49                               ` Jason Gunthorpe
2021-04-14 14:44                             ` Jason Gunthorpe
2021-04-09 16:40           ` Jason Gunthorpe
2021-04-09 17:44             ` Tom Talpey
2021-04-06  2:37 ` Honggang LI
2021-04-06  5:09   ` Leon Romanovsky
2021-04-06 11:53     ` Jason Gunthorpe
2021-04-11 10:09       ` Max Gurtovoy

Linux-RDMA Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-rdma/0 linux-rdma/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-rdma linux-rdma/ https://lore.kernel.org/linux-rdma \
		linux-rdma@vger.kernel.org
	public-inbox-index linux-rdma

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-rdma


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git