All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Performance Improvents for Secured Mode NVMe over Fabrics and other RDMA ULPs
@ 2017-10-29 16:38 ` idanb
  0 siblings, 0 replies; 41+ messages in thread
From: idanb-VPRAkNaXOzVWk0Htik3J/w @ 2017-10-29 16:38 UTC (permalink / raw)
  To: leon-DgEjT+Ai2ygdnm+yROfE0A,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, hch-jcswGhMUV9g,
	sagi-NQWnxTmZq1alnMjI0IkVqw, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: maxg-VPRAkNaXOzVWk0Htik3J/w, idanb-VPRAkNaXOzVWk0Htik3J/w

From: Idan Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

The following two patches are including performance 
optimizations for RDMA kernel ULPs that register/invalidate
per IO (e.g. NVMe over Fabrics, iSER, NFS RDMA). The patches 
improve the IOPs could be achieved by a single network 
adapter with the above ULPs, specifically for NVMe over 
Fabrics the IOPs have improved x5 for small IO reads could 
be achieved in Linux using a single network adapter.

First improvement, which is relavant to mlx5 rdma adapters, 
improves the way reg_wr is posted to the adapter by communicating 
the KLM/MTT list inline (within the work request) instead of using 
scatter-gather.

After the first bottleneck has been removed, a new bottleneck 
was uncovered for send with invaldate processing in the recieve 
path of the ConnectX-5 (mlx5) adapter, therefore a new module 
parameter was added for nvme_rdma for disabling remote 
invalidation in such case.

We are currently investigating write IOPs optimizations as
well, be tuned for more patches in the pipeline...

Idan Burstein (2):
  IB/mlx5: posting klm/mtt list inline in the send queue for reg_wr
  nvme-rdma: Add remote_invalidation module parameter

 drivers/infiniband/hw/mlx5/qp.c | 41 +++++++++++++++++++++++++++++++++++------
 drivers/nvme/host/rdma.c        | 10 ++++++++--
 2 files changed, 43 insertions(+), 8 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 0/2] Performance Improvents for Secured Mode NVMe over Fabrics and other RDMA ULPs
@ 2017-10-29 16:38 ` idanb
  0 siblings, 0 replies; 41+ messages in thread
From: idanb @ 2017-10-29 16:38 UTC (permalink / raw)


From: Idan Burstein <idanb@mellanox.com>

The following two patches are including performance 
optimizations for RDMA kernel ULPs that register/invalidate
per IO (e.g. NVMe over Fabrics, iSER, NFS RDMA). The patches 
improve the IOPs could be achieved by a single network 
adapter with the above ULPs, specifically for NVMe over 
Fabrics the IOPs have improved x5 for small IO reads could 
be achieved in Linux using a single network adapter.

First improvement, which is relavant to mlx5 rdma adapters, 
improves the way reg_wr is posted to the adapter by communicating 
the KLM/MTT list inline (within the work request) instead of using 
scatter-gather.

After the first bottleneck has been removed, a new bottleneck 
was uncovered for send with invaldate processing in the recieve 
path of the ConnectX-5 (mlx5) adapter, therefore a new module 
parameter was added for nvme_rdma for disabling remote 
invalidation in such case.

We are currently investigating write IOPs optimizations as
well, be tuned for more patches in the pipeline...

Idan Burstein (2):
  IB/mlx5: posting klm/mtt list inline in the send queue for reg_wr
  nvme-rdma: Add remote_invalidation module parameter

 drivers/infiniband/hw/mlx5/qp.c | 41 +++++++++++++++++++++++++++++++++++------
 drivers/nvme/host/rdma.c        | 10 ++++++++--
 2 files changed, 43 insertions(+), 8 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/2] IB/mlx5: posting klm/mtt list inline in the send queue for reg_wr
  2017-10-29 16:38 ` idanb
@ 2017-10-29 16:38     ` idanb
  -1 siblings, 0 replies; 41+ messages in thread
From: idanb-VPRAkNaXOzVWk0Htik3J/w @ 2017-10-29 16:38 UTC (permalink / raw)
  To: leon-DgEjT+Ai2ygdnm+yROfE0A,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, hch-jcswGhMUV9g,
	sagi-NQWnxTmZq1alnMjI0IkVqw, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: maxg-VPRAkNaXOzVWk0Htik3J/w, idanb-VPRAkNaXOzVWk0Htik3J/w

From: Idan Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

As most kernel RDMA ULPs, NVMe over Fabrics in its secure
"register_always" mode registers and invalidates user buffer
upon each IO.

Today the mlx5 driver is posting the registration work
request using scatter/gather entry for the MTT/KLM list.

The fetch of the MTT/KLM list becomes the bottleneck in
number of IO operation could be done by NVMe over Fabrics
host driver on a single adapter as shown below.

This patch is adding the support for inline registration
work request upon MTT/KLM list of size <=64B.

The result for NVMe over Fabrics is increase of x3.6 for small
IOs as shown below, I expect other ULPs (e.g iSER, NFS over RDMA)
performance to be enhanced as well.

The following results were taken against a single nvme over fabrics
subsystem with a single namespace backed by null_blk:

Block Size       s/g reg_wr            inline reg_wr
++++++++++     +++++++++++++++        ++++++++++++++++
512B            1446.6K/8.57%          5224.2K/76.21%
1KB             1390.6K/8.5%           4656.7K/71.69%
2KB             1343.8K/8.6%           3410.3K/38.96%
4KB             1254.8K/8.39%          2033.6K/15.86%
8KB             1079.5K/7.7%           1143.1K/7.27%
16KB            603.4K/3.64%           593.8K/3.4%
32KB            294.8K/2.04%           293.7K/1.98%
64KB            138.2K/1.32%           141.6K/1.26%

Reviewed-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Idan Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/hw/mlx5/qp.c | 41 +++++++++++++++++++++++++++++++++++------
 1 file changed, 35 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index acb79d3..b25b93d 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -53,6 +53,7 @@ enum {
 
 enum {
 	MLX5_IB_SQ_STRIDE	= 6,
+	MLX5_IB_SQ_UMR_INLINE_THRESHOLD = 64,
 };
 
 static const u32 mlx5_ib_opcode[] = {
@@ -295,7 +296,8 @@ static int sq_overhead(struct ib_qp_init_attr *attr)
 			max(sizeof(struct mlx5_wqe_atomic_seg) +
 			    sizeof(struct mlx5_wqe_raddr_seg),
 			    sizeof(struct mlx5_wqe_umr_ctrl_seg) +
-			    sizeof(struct mlx5_mkey_seg));
+			    sizeof(struct mlx5_mkey_seg) +
+			    MLX5_IB_SQ_UMR_INLINE_THRESHOLD / MLX5_IB_UMR_OCTOWORD);
 		break;
 
 	case IB_QPT_XRC_TGT:
@@ -3164,13 +3166,15 @@ static __be64 sig_mkey_mask(void)
 }
 
 static void set_reg_umr_seg(struct mlx5_wqe_umr_ctrl_seg *umr,
-			    struct mlx5_ib_mr *mr)
+			    struct mlx5_ib_mr *mr, bool umr_inline)
 {
 	int size = mr->ndescs * mr->desc_size;
 
 	memset(umr, 0, sizeof(*umr));
 
 	umr->flags = MLX5_UMR_CHECK_NOT_FREE;
+	if (umr_inline)
+		umr->flags |= MLX5_UMR_INLINE;
 	umr->xlt_octowords = cpu_to_be16(get_xlt_octo(size));
 	umr->mkey_mask = frwr_mkey_mask();
 }
@@ -3341,6 +3345,24 @@ static void set_reg_data_seg(struct mlx5_wqe_data_seg *dseg,
 	dseg->lkey = cpu_to_be32(pd->ibpd.local_dma_lkey);
 }
 
+static void set_reg_umr_inline_seg(void *seg, struct mlx5_ib_qp *qp,
+				   struct mlx5_ib_mr *mr, int mr_list_size)
+{
+	void *qend = qp->sq.qend;
+	void *addr = mr->descs;
+	int copy;
+
+        if (unlikely(seg + mr_list_size > qend)) {
+                copy = qend - seg;
+                memcpy(seg, addr, copy);
+                addr += copy;
+                mr_list_size -= copy;
+                seg = mlx5_get_send_wqe(qp, 0);
+        }
+        memcpy(seg, addr, mr_list_size);
+        seg += mr_list_size;
+}
+
 static __be32 send_ieth(struct ib_send_wr *wr)
 {
 	switch (wr->opcode) {
@@ -3735,6 +3757,8 @@ static int set_reg_wr(struct mlx5_ib_qp *qp,
 {
 	struct mlx5_ib_mr *mr = to_mmr(wr->mr);
 	struct mlx5_ib_pd *pd = to_mpd(qp->ibqp.pd);
+	int mr_list_size = mr->ndescs * mr->desc_size;
+	bool umr_inline = mr_list_size <= MLX5_IB_SQ_UMR_INLINE_THRESHOLD;
 
 	if (unlikely(wr->wr.send_flags & IB_SEND_INLINE)) {
 		mlx5_ib_warn(to_mdev(qp->ibqp.device),
@@ -3742,7 +3766,7 @@ static int set_reg_wr(struct mlx5_ib_qp *qp,
 		return -EINVAL;
 	}
 
-	set_reg_umr_seg(*seg, mr);
+	set_reg_umr_seg(*seg, mr, umr_inline);
 	*seg += sizeof(struct mlx5_wqe_umr_ctrl_seg);
 	*size += sizeof(struct mlx5_wqe_umr_ctrl_seg) / 16;
 	if (unlikely((*seg == qp->sq.qend)))
@@ -3754,9 +3778,14 @@ static int set_reg_wr(struct mlx5_ib_qp *qp,
 	if (unlikely((*seg == qp->sq.qend)))
 		*seg = mlx5_get_send_wqe(qp, 0);
 
-	set_reg_data_seg(*seg, mr, pd);
-	*seg += sizeof(struct mlx5_wqe_data_seg);
-	*size += (sizeof(struct mlx5_wqe_data_seg) / 16);
+	if (umr_inline) {
+		set_reg_umr_inline_seg(*seg, qp, mr, mr_list_size);
+		*size += get_xlt_octo(mr_list_size);
+	} else {
+                set_reg_data_seg(*seg, mr, pd);
+                *seg += sizeof(struct mlx5_wqe_data_seg);
+                *size += (sizeof(struct mlx5_wqe_data_seg) / 16);
+	}
 
 	return 0;
 }
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 1/2] IB/mlx5: posting klm/mtt list inline in the send queue for reg_wr
@ 2017-10-29 16:38     ` idanb
  0 siblings, 0 replies; 41+ messages in thread
From: idanb @ 2017-10-29 16:38 UTC (permalink / raw)


From: Idan Burstein <idanb@mellanox.com>

As most kernel RDMA ULPs, NVMe over Fabrics in its secure
"register_always" mode registers and invalidates user buffer
upon each IO.

Today the mlx5 driver is posting the registration work
request using scatter/gather entry for the MTT/KLM list.

The fetch of the MTT/KLM list becomes the bottleneck in
number of IO operation could be done by NVMe over Fabrics
host driver on a single adapter as shown below.

This patch is adding the support for inline registration
work request upon MTT/KLM list of size <=64B.

The result for NVMe over Fabrics is increase of x3.6 for small
IOs as shown below, I expect other ULPs (e.g iSER, NFS over RDMA)
performance to be enhanced as well.

The following results were taken against a single nvme over fabrics
subsystem with a single namespace backed by null_blk:

Block Size       s/g reg_wr            inline reg_wr
++++++++++     +++++++++++++++        ++++++++++++++++
512B            1446.6K/8.57%          5224.2K/76.21%
1KB             1390.6K/8.5%           4656.7K/71.69%
2KB             1343.8K/8.6%           3410.3K/38.96%
4KB             1254.8K/8.39%          2033.6K/15.86%
8KB             1079.5K/7.7%           1143.1K/7.27%
16KB            603.4K/3.64%           593.8K/3.4%
32KB            294.8K/2.04%           293.7K/1.98%
64KB            138.2K/1.32%           141.6K/1.26%

Reviewed-by: Max Gurtovoy <maxg at mellanox.com>
Signed-off-by: Idan Burstein <idanb at mellanox.com>
---
 drivers/infiniband/hw/mlx5/qp.c | 41 +++++++++++++++++++++++++++++++++++------
 1 file changed, 35 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index acb79d3..b25b93d 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -53,6 +53,7 @@ enum {
 
 enum {
 	MLX5_IB_SQ_STRIDE	= 6,
+	MLX5_IB_SQ_UMR_INLINE_THRESHOLD = 64,
 };
 
 static const u32 mlx5_ib_opcode[] = {
@@ -295,7 +296,8 @@ static int sq_overhead(struct ib_qp_init_attr *attr)
 			max(sizeof(struct mlx5_wqe_atomic_seg) +
 			    sizeof(struct mlx5_wqe_raddr_seg),
 			    sizeof(struct mlx5_wqe_umr_ctrl_seg) +
-			    sizeof(struct mlx5_mkey_seg));
+			    sizeof(struct mlx5_mkey_seg) +
+			    MLX5_IB_SQ_UMR_INLINE_THRESHOLD / MLX5_IB_UMR_OCTOWORD);
 		break;
 
 	case IB_QPT_XRC_TGT:
@@ -3164,13 +3166,15 @@ static __be64 sig_mkey_mask(void)
 }
 
 static void set_reg_umr_seg(struct mlx5_wqe_umr_ctrl_seg *umr,
-			    struct mlx5_ib_mr *mr)
+			    struct mlx5_ib_mr *mr, bool umr_inline)
 {
 	int size = mr->ndescs * mr->desc_size;
 
 	memset(umr, 0, sizeof(*umr));
 
 	umr->flags = MLX5_UMR_CHECK_NOT_FREE;
+	if (umr_inline)
+		umr->flags |= MLX5_UMR_INLINE;
 	umr->xlt_octowords = cpu_to_be16(get_xlt_octo(size));
 	umr->mkey_mask = frwr_mkey_mask();
 }
@@ -3341,6 +3345,24 @@ static void set_reg_data_seg(struct mlx5_wqe_data_seg *dseg,
 	dseg->lkey = cpu_to_be32(pd->ibpd.local_dma_lkey);
 }
 
+static void set_reg_umr_inline_seg(void *seg, struct mlx5_ib_qp *qp,
+				   struct mlx5_ib_mr *mr, int mr_list_size)
+{
+	void *qend = qp->sq.qend;
+	void *addr = mr->descs;
+	int copy;
+
+        if (unlikely(seg + mr_list_size > qend)) {
+                copy = qend - seg;
+                memcpy(seg, addr, copy);
+                addr += copy;
+                mr_list_size -= copy;
+                seg = mlx5_get_send_wqe(qp, 0);
+        }
+        memcpy(seg, addr, mr_list_size);
+        seg += mr_list_size;
+}
+
 static __be32 send_ieth(struct ib_send_wr *wr)
 {
 	switch (wr->opcode) {
@@ -3735,6 +3757,8 @@ static int set_reg_wr(struct mlx5_ib_qp *qp,
 {
 	struct mlx5_ib_mr *mr = to_mmr(wr->mr);
 	struct mlx5_ib_pd *pd = to_mpd(qp->ibqp.pd);
+	int mr_list_size = mr->ndescs * mr->desc_size;
+	bool umr_inline = mr_list_size <= MLX5_IB_SQ_UMR_INLINE_THRESHOLD;
 
 	if (unlikely(wr->wr.send_flags & IB_SEND_INLINE)) {
 		mlx5_ib_warn(to_mdev(qp->ibqp.device),
@@ -3742,7 +3766,7 @@ static int set_reg_wr(struct mlx5_ib_qp *qp,
 		return -EINVAL;
 	}
 
-	set_reg_umr_seg(*seg, mr);
+	set_reg_umr_seg(*seg, mr, umr_inline);
 	*seg += sizeof(struct mlx5_wqe_umr_ctrl_seg);
 	*size += sizeof(struct mlx5_wqe_umr_ctrl_seg) / 16;
 	if (unlikely((*seg == qp->sq.qend)))
@@ -3754,9 +3778,14 @@ static int set_reg_wr(struct mlx5_ib_qp *qp,
 	if (unlikely((*seg == qp->sq.qend)))
 		*seg = mlx5_get_send_wqe(qp, 0);
 
-	set_reg_data_seg(*seg, mr, pd);
-	*seg += sizeof(struct mlx5_wqe_data_seg);
-	*size += (sizeof(struct mlx5_wqe_data_seg) / 16);
+	if (umr_inline) {
+		set_reg_umr_inline_seg(*seg, qp, mr, mr_list_size);
+		*size += get_xlt_octo(mr_list_size);
+	} else {
+                set_reg_data_seg(*seg, mr, pd);
+                *seg += sizeof(struct mlx5_wqe_data_seg);
+                *size += (sizeof(struct mlx5_wqe_data_seg) / 16);
+	}
 
 	return 0;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
  2017-10-29 16:38 ` idanb
@ 2017-10-29 16:38     ` idanb
  -1 siblings, 0 replies; 41+ messages in thread
From: idanb-VPRAkNaXOzVWk0Htik3J/w @ 2017-10-29 16:38 UTC (permalink / raw)
  To: leon-DgEjT+Ai2ygdnm+yROfE0A,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, hch-jcswGhMUV9g,
	sagi-NQWnxTmZq1alnMjI0IkVqw, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: maxg-VPRAkNaXOzVWk0Htik3J/w, idanb-VPRAkNaXOzVWk0Htik3J/w

From: Idan Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

NVMe over Fabrics in its secure "register_always" mode
registers and invalidates the user buffer upon each IO.
The protocol enables the host to request the susbsystem
to use SEND WITH INVALIDATE operation while returning the
response capsule and invalidate the local key
(remote_invalidation).
In some HW implementations, the local network adapter may
perform better while using local invalidation operations.

The results below show that running with local invalidation
rather then remote invalidation improve the iops one could
achieve by using the ConnectX-5Ex network adapter by x1.36 factor.
Nevertheless, using local invalidation induce more CPU overhead
than enabling the target to invalidate remotly, therefore,
because there is a CPU% vs IOPs limit tradeoff we propose to
have module parameter to define whether to request remote
invalidation or not.

The following results were taken against a single nvme over fabrics
subsystem with a single namespace backed by null_blk:

Block Size       s/g reg_wr      inline reg_wr    inline reg_wr + local inv
++++++++++++   ++++++++++++++   ++++++++++++++++ +++++++++++++++++++++++++++
512B            1446.6K/8.57%    5224.2K/76.21%   7143.3K/79.72%
1KB             1390.6K/8.5%     4656.7K/71.69%   5860.6K/55.45%
2KB             1343.8K/8.6%     3410.3K/38.96%   4106.7K/55.82%
4KB             1254.8K/8.39%    2033.6K/15.86%   2165.3K/17.48%
8KB             1079.5K/7.7%     1143.1K/7.27%    1158.2K/7.33%
16KB            603.4K/3.64%     593.8K/3.4%      588.9K/3.77%
32KB            294.8K/2.04%     293.7K/1.98%     294.4K/2.93%
64KB            138.2K/1.32%     141.6K/1.26%     135.6K/1.34%

Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Idan Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/nvme/host/rdma.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 92a03ff..7f8225d 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -146,6 +146,11 @@ static inline struct nvme_rdma_ctrl *to_rdma_ctrl(struct nvme_ctrl *ctrl)
 MODULE_PARM_DESC(register_always,
 	 "Use memory registration even for contiguous memory regions");
 
+static bool remote_invalidation = true;
+module_param(remote_invalidation, bool, 0444);
+MODULE_PARM_DESC(remote_invalidation,
+	 "request remote invalidation from subsystem (default: true)");
+
 static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
 		struct rdma_cm_event *event);
 static void nvme_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc);
@@ -1152,8 +1157,9 @@ static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
 	sg->addr = cpu_to_le64(req->mr->iova);
 	put_unaligned_le24(req->mr->length, sg->length);
 	put_unaligned_le32(req->mr->rkey, sg->key);
-	sg->type = (NVME_KEY_SGL_FMT_DATA_DESC << 4) |
-			NVME_SGL_FMT_INVALIDATE;
+	sg->type = NVME_KEY_SGL_FMT_DATA_DESC << 4;
+	if (remote_invalidation)
+		sg->type |= NVME_SGL_FMT_INVALIDATE;
 
 	return 0;
 }
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
@ 2017-10-29 16:38     ` idanb
  0 siblings, 0 replies; 41+ messages in thread
From: idanb @ 2017-10-29 16:38 UTC (permalink / raw)


From: Idan Burstein <idanb@mellanox.com>

NVMe over Fabrics in its secure "register_always" mode
registers and invalidates the user buffer upon each IO.
The protocol enables the host to request the susbsystem
to use SEND WITH INVALIDATE operation while returning the
response capsule and invalidate the local key
(remote_invalidation).
In some HW implementations, the local network adapter may
perform better while using local invalidation operations.

The results below show that running with local invalidation
rather then remote invalidation improve the iops one could
achieve by using the ConnectX-5Ex network adapter by x1.36 factor.
Nevertheless, using local invalidation induce more CPU overhead
than enabling the target to invalidate remotly, therefore,
because there is a CPU% vs IOPs limit tradeoff we propose to
have module parameter to define whether to request remote
invalidation or not.

The following results were taken against a single nvme over fabrics
subsystem with a single namespace backed by null_blk:

Block Size       s/g reg_wr      inline reg_wr    inline reg_wr + local inv
++++++++++++   ++++++++++++++   ++++++++++++++++ +++++++++++++++++++++++++++
512B            1446.6K/8.57%    5224.2K/76.21%   7143.3K/79.72%
1KB             1390.6K/8.5%     4656.7K/71.69%   5860.6K/55.45%
2KB             1343.8K/8.6%     3410.3K/38.96%   4106.7K/55.82%
4KB             1254.8K/8.39%    2033.6K/15.86%   2165.3K/17.48%
8KB             1079.5K/7.7%     1143.1K/7.27%    1158.2K/7.33%
16KB            603.4K/3.64%     593.8K/3.4%      588.9K/3.77%
32KB            294.8K/2.04%     293.7K/1.98%     294.4K/2.93%
64KB            138.2K/1.32%     141.6K/1.26%     135.6K/1.34%

Signed-off-by: Max Gurtovoy <maxg at mellanox.com>
Signed-off-by: Idan Burstein <idanb at mellanox.com>
---
 drivers/nvme/host/rdma.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 92a03ff..7f8225d 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -146,6 +146,11 @@ static inline struct nvme_rdma_ctrl *to_rdma_ctrl(struct nvme_ctrl *ctrl)
 MODULE_PARM_DESC(register_always,
 	 "Use memory registration even for contiguous memory regions");
 
+static bool remote_invalidation = true;
+module_param(remote_invalidation, bool, 0444);
+MODULE_PARM_DESC(remote_invalidation,
+	 "request remote invalidation from subsystem (default: true)");
+
 static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
 		struct rdma_cm_event *event);
 static void nvme_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc);
@@ -1152,8 +1157,9 @@ static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
 	sg->addr = cpu_to_le64(req->mr->iova);
 	put_unaligned_le24(req->mr->length, sg->length);
 	put_unaligned_le32(req->mr->rkey, sg->key);
-	sg->type = (NVME_KEY_SGL_FMT_DATA_DESC << 4) |
-			NVME_SGL_FMT_INVALIDATE;
+	sg->type = NVME_KEY_SGL_FMT_DATA_DESC << 4;
+	if (remote_invalidation)
+		sg->type |= NVME_SGL_FMT_INVALIDATE;
 
 	return 0;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/2] Performance Improvents for Secured Mode NVMe over Fabrics and other RDMA ULPs
  2017-10-29 16:38 ` idanb
@ 2017-10-29 16:59     ` Sagi Grimberg
  -1 siblings, 0 replies; 41+ messages in thread
From: Sagi Grimberg @ 2017-10-29 16:59 UTC (permalink / raw)
  To: idanb-VPRAkNaXOzVWk0Htik3J/w, leon-DgEjT+Ai2ygdnm+yROfE0A,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, hch-jcswGhMUV9g,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: maxg-VPRAkNaXOzVWk0Htik3J/w

Hi Idan,

One minor nit, there is no "secured mode", its the normal sane
mode. The other mode compromises insecurity for performance.

> The following two patches are including performance
> optimizations for RDMA kernel ULPs that register/invalidate
> per IO (e.g. NVMe over Fabrics, iSER, NFS RDMA). The patches
> improve the IOPs could be achieved by a single network
> adapter with the above ULPs, specifically for NVMe over
> Fabrics the IOPs have improved x5 for small IO reads could
> be achieved in Linux using a single network adapter.

That is very good! Thanks for doing this!
You should note that its mlx5 specific enhancement, which is
an important device, but its not a performance improvement for
RDMA ULPs as it is.

What about mlx4 btw?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 0/2] Performance Improvents for Secured Mode NVMe over Fabrics and other RDMA ULPs
@ 2017-10-29 16:59     ` Sagi Grimberg
  0 siblings, 0 replies; 41+ messages in thread
From: Sagi Grimberg @ 2017-10-29 16:59 UTC (permalink / raw)


Hi Idan,

One minor nit, there is no "secured mode", its the normal sane
mode. The other mode compromises insecurity for performance.

> The following two patches are including performance
> optimizations for RDMA kernel ULPs that register/invalidate
> per IO (e.g. NVMe over Fabrics, iSER, NFS RDMA). The patches
> improve the IOPs could be achieved by a single network
> adapter with the above ULPs, specifically for NVMe over
> Fabrics the IOPs have improved x5 for small IO reads could
> be achieved in Linux using a single network adapter.

That is very good! Thanks for doing this!
You should note that its mlx5 specific enhancement, which is
an important device, but its not a performance improvement for
RDMA ULPs as it is.

What about mlx4 btw?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/2] IB/mlx5: posting klm/mtt list inline in the send queue for reg_wr
  2017-10-29 16:38     ` idanb
@ 2017-10-29 17:09         ` Sagi Grimberg
  -1 siblings, 0 replies; 41+ messages in thread
From: Sagi Grimberg @ 2017-10-29 17:09 UTC (permalink / raw)
  To: idanb-VPRAkNaXOzVWk0Htik3J/w, leon-DgEjT+Ai2ygdnm+yROfE0A,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, hch-jcswGhMUV9g,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: maxg-VPRAkNaXOzVWk0Htik3J/w


> As most kernel RDMA ULPs, NVMe over Fabrics in its secure
> "register_always" mode registers and invalidates user buffer
> upon each IO.

Again, lets drop the secured mode language.

> diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
> index acb79d3..b25b93d 100644
> --- a/drivers/infiniband/hw/mlx5/qp.c
> +++ b/drivers/infiniband/hw/mlx5/qp.c
> @@ -53,6 +53,7 @@ enum {
>   
>   enum {
>   	MLX5_IB_SQ_STRIDE	= 6,
> +	MLX5_IB_SQ_UMR_INLINE_THRESHOLD = 64,
>   };

Is this a device capability? Is it true for all mlx5 devices?
can we get it from the FW instead?

Otherwise, looks good to me,

Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/2] IB/mlx5: posting klm/mtt list inline in the send queue for reg_wr
@ 2017-10-29 17:09         ` Sagi Grimberg
  0 siblings, 0 replies; 41+ messages in thread
From: Sagi Grimberg @ 2017-10-29 17:09 UTC (permalink / raw)



> As most kernel RDMA ULPs, NVMe over Fabrics in its secure
> "register_always" mode registers and invalidates user buffer
> upon each IO.

Again, lets drop the secured mode language.

> diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
> index acb79d3..b25b93d 100644
> --- a/drivers/infiniband/hw/mlx5/qp.c
> +++ b/drivers/infiniband/hw/mlx5/qp.c
> @@ -53,6 +53,7 @@ enum {
>   
>   enum {
>   	MLX5_IB_SQ_STRIDE	= 6,
> +	MLX5_IB_SQ_UMR_INLINE_THRESHOLD = 64,
>   };

Is this a device capability? Is it true for all mlx5 devices?
can we get it from the FW instead?

Otherwise, looks good to me,

Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/2] Performance Improvents for Secured Mode NVMe over Fabrics and other RDMA ULPs
  2017-10-29 16:59     ` Sagi Grimberg
@ 2017-10-29 17:09         ` Max Gurtovoy
  -1 siblings, 0 replies; 41+ messages in thread
From: Max Gurtovoy @ 2017-10-29 17:09 UTC (permalink / raw)
  To: Sagi Grimberg, idanb-VPRAkNaXOzVWk0Htik3J/w,
	leon-DgEjT+Ai2ygdnm+yROfE0A,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, hch-jcswGhMUV9g,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA



On 10/29/2017 6:59 PM, Sagi Grimberg wrote:
> Hi Idan,
> 
> One minor nit, there is no "secured mode", its the normal sane
> mode. The other mode compromises insecurity for performance.
> 
>> The following two patches are including performance
>> optimizations for RDMA kernel ULPs that register/invalidate
>> per IO (e.g. NVMe over Fabrics, iSER, NFS RDMA). The patches
>> improve the IOPs could be achieved by a single network
>> adapter with the above ULPs, specifically for NVMe over
>> Fabrics the IOPs have improved x5 for small IO reads could
>> be achieved in Linux using a single network adapter.
> 
> That is very good! Thanks for doing this!
> You should note that its mlx5 specific enhancement, which is
> an important device, but its not a performance improvement for
> RDMA ULPs as it is.

Sagi,

the subject refers to both patches :)
And it mentions that more improvements are in the pipe, that will 
improve the small write IOs for NVMEoF (at first stage).

> 
> What about mlx4 btw?

We still haven't try to improve the mlx4, but it's on our plate.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 0/2] Performance Improvents for Secured Mode NVMe over Fabrics and other RDMA ULPs
@ 2017-10-29 17:09         ` Max Gurtovoy
  0 siblings, 0 replies; 41+ messages in thread
From: Max Gurtovoy @ 2017-10-29 17:09 UTC (permalink / raw)




On 10/29/2017 6:59 PM, Sagi Grimberg wrote:
> Hi Idan,
> 
> One minor nit, there is no "secured mode", its the normal sane
> mode. The other mode compromises insecurity for performance.
> 
>> The following two patches are including performance
>> optimizations for RDMA kernel ULPs that register/invalidate
>> per IO (e.g. NVMe over Fabrics, iSER, NFS RDMA). The patches
>> improve the IOPs could be achieved by a single network
>> adapter with the above ULPs, specifically for NVMe over
>> Fabrics the IOPs have improved x5 for small IO reads could
>> be achieved in Linux using a single network adapter.
> 
> That is very good! Thanks for doing this!
> You should note that its mlx5 specific enhancement, which is
> an important device, but its not a performance improvement for
> RDMA ULPs as it is.

Sagi,

the subject refers to both patches :)
And it mentions that more improvements are in the pipe, that will 
improve the small write IOs for NVMEoF (at first stage).

> 
> What about mlx4 btw?

We still haven't try to improve the mlx4, but it's on our plate.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/2] Performance Improvents for Secured Mode NVMe over Fabrics and other RDMA ULPs
  2017-10-29 17:09         ` Max Gurtovoy
@ 2017-10-29 17:43             ` Leon Romanovsky
  -1 siblings, 0 replies; 41+ messages in thread
From: Leon Romanovsky @ 2017-10-29 17:43 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Sagi Grimberg, idanb-VPRAkNaXOzVWk0Htik3J/w,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, hch-jcswGhMUV9g,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 548 bytes --]

On Sun, Oct 29, 2017 at 07:09:41PM +0200, Max Gurtovoy wrote:
> > What about mlx4 btw?
>
> We still haven't try to improve the mlx4, but it's on our plate.

Max and Idan thanks for doing it.

The improvements look good, but I wonder what type of testing did you do?
Did you run full verification suite on it?

Thanks

>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 0/2] Performance Improvents for Secured Mode NVMe over Fabrics and other RDMA ULPs
@ 2017-10-29 17:43             ` Leon Romanovsky
  0 siblings, 0 replies; 41+ messages in thread
From: Leon Romanovsky @ 2017-10-29 17:43 UTC (permalink / raw)


On Sun, Oct 29, 2017@07:09:41PM +0200, Max Gurtovoy wrote:
> > What about mlx4 btw?
>
> We still haven't try to improve the mlx4, but it's on our plate.

Max and Idan thanks for doing it.

The improvements look good, but I wonder what type of testing did you do?
Did you run full verification suite on it?

Thanks

>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20171029/af83295b/attachment.sig>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
  2017-10-29 16:38     ` idanb
@ 2017-10-29 17:52         ` Jason Gunthorpe
  -1 siblings, 0 replies; 41+ messages in thread
From: Jason Gunthorpe @ 2017-10-29 17:52 UTC (permalink / raw)
  To: idanb-VPRAkNaXOzVWk0Htik3J/w
  Cc: leon-DgEjT+Ai2ygdnm+yROfE0A,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, hch-jcswGhMUV9g,
	sagi-NQWnxTmZq1alnMjI0IkVqw, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	maxg-VPRAkNaXOzVWk0Htik3J/w

On Sun, Oct 29, 2017 at 06:38:21PM +0200, idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org wrote:
  
> +static bool remote_invalidation = true;
> +module_param(remote_invalidation, bool, 0444);
> +MODULE_PARM_DESC(remote_invalidation,
> +	 "request remote invalidation from subsystem (default: true)");

Please no module options.

If your device has a performance anomaly that makes
SEND_WITH_INVALIDATE slower than local invalidation, then we need to
talk about having a new verbs flag for this situation so all the ULPs
can avoid using SEND_WITH_INVALIDATE..

Sagi, is this really an apples to apples comparison?

When I was talking with Chuck on NFS there were some various issues on
that side that made the local invalidate run faster, but it wasn't
actually working properly.

Does nvme wait for the local invalidate to finish before moving on?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
@ 2017-10-29 17:52         ` Jason Gunthorpe
  0 siblings, 0 replies; 41+ messages in thread
From: Jason Gunthorpe @ 2017-10-29 17:52 UTC (permalink / raw)


On Sun, Oct 29, 2017@06:38:21PM +0200, idanb@mellanox.com wrote:
  
> +static bool remote_invalidation = true;
> +module_param(remote_invalidation, bool, 0444);
> +MODULE_PARM_DESC(remote_invalidation,
> +	 "request remote invalidation from subsystem (default: true)");

Please no module options.

If your device has a performance anomaly that makes
SEND_WITH_INVALIDATE slower than local invalidation, then we need to
talk about having a new verbs flag for this situation so all the ULPs
can avoid using SEND_WITH_INVALIDATE..

Sagi, is this really an apples to apples comparison?

When I was talking with Chuck on NFS there were some various issues on
that side that made the local invalidate run faster, but it wasn't
actually working properly.

Does nvme wait for the local invalidate to finish before moving on?

Jason

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
  2017-10-29 16:38     ` idanb
@ 2017-10-29 18:24         ` Chuck Lever
  -1 siblings, 0 replies; 41+ messages in thread
From: Chuck Lever @ 2017-10-29 18:24 UTC (permalink / raw)
  To: idanb-VPRAkNaXOzVWk0Htik3J/w
  Cc: leon-DgEjT+Ai2ygdnm+yROfE0A,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, hch-jcswGhMUV9g,
	sagi-NQWnxTmZq1alnMjI0IkVqw, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	maxg-VPRAkNaXOzVWk0Htik3J/w


> On Oct 29, 2017, at 12:38 PM, idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org wrote:
> 
> From: Idan Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> 
> NVMe over Fabrics in its secure "register_always" mode
> registers and invalidates the user buffer upon each IO.
> The protocol enables the host to request the susbsystem
> to use SEND WITH INVALIDATE operation while returning the
> response capsule and invalidate the local key
> (remote_invalidation).
> In some HW implementations, the local network adapter may
> perform better while using local invalidation operations.
> 
> The results below show that running with local invalidation
> rather then remote invalidation improve the iops one could
> achieve by using the ConnectX-5Ex network adapter by x1.36 factor.
> Nevertheless, using local invalidation induce more CPU overhead
> than enabling the target to invalidate remotly, therefore,
> because there is a CPU% vs IOPs limit tradeoff we propose to
> have module parameter to define whether to request remote
> invalidation or not.
> 
> The following results were taken against a single nvme over fabrics
> subsystem with a single namespace backed by null_blk:
> 
> Block Size       s/g reg_wr      inline reg_wr    inline reg_wr + local inv
> ++++++++++++   ++++++++++++++   ++++++++++++++++ +++++++++++++++++++++++++++
> 512B            1446.6K/8.57%    5224.2K/76.21%   7143.3K/79.72%
> 1KB             1390.6K/8.5%     4656.7K/71.69%   5860.6K/55.45%
> 2KB             1343.8K/8.6%     3410.3K/38.96%   4106.7K/55.82%
> 4KB             1254.8K/8.39%    2033.6K/15.86%   2165.3K/17.48%
> 8KB             1079.5K/7.7%     1143.1K/7.27%    1158.2K/7.33%
> 16KB            603.4K/3.64%     593.8K/3.4%      588.9K/3.77%
> 32KB            294.8K/2.04%     293.7K/1.98%     294.4K/2.93%
> 64KB            138.2K/1.32%     141.6K/1.26%     135.6K/1.34%

Units reported here are KIOPS and %CPU ? What was the benchmark?

Was any root cause analysis done to understand why the IOPS
rate changes without RI? Reduction in avg RTT? Fewer long-
running outliers? Lock contention in the ULP?

I am curious enough to add a similar setting to NFS/RDMA,
now that I have mlx5 devices.


> Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Idan Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> ---
> drivers/nvme/host/rdma.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 92a03ff..7f8225d 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -146,6 +146,11 @@ static inline struct nvme_rdma_ctrl *to_rdma_ctrl(struct nvme_ctrl *ctrl)
> MODULE_PARM_DESC(register_always,
> 	 "Use memory registration even for contiguous memory regions");
> 
> +static bool remote_invalidation = true;
> +module_param(remote_invalidation, bool, 0444);
> +MODULE_PARM_DESC(remote_invalidation,
> +	 "request remote invalidation from subsystem (default: true)");

The use of a module parameter would be awkward in systems
that have a heterogenous collection of HCAs.


> +
> static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
> 		struct rdma_cm_event *event);
> static void nvme_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc);
> @@ -1152,8 +1157,9 @@ static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
> 	sg->addr = cpu_to_le64(req->mr->iova);
> 	put_unaligned_le24(req->mr->length, sg->length);
> 	put_unaligned_le32(req->mr->rkey, sg->key);
> -	sg->type = (NVME_KEY_SGL_FMT_DATA_DESC << 4) |
> -			NVME_SGL_FMT_INVALIDATE;
> +	sg->type = NVME_KEY_SGL_FMT_DATA_DESC << 4;
> +	if (remote_invalidation)
> +		sg->type |= NVME_SGL_FMT_INVALIDATE;
> 
> 	return 0;
> }
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
@ 2017-10-29 18:24         ` Chuck Lever
  0 siblings, 0 replies; 41+ messages in thread
From: Chuck Lever @ 2017-10-29 18:24 UTC (permalink / raw)



> On Oct 29, 2017,@12:38 PM, idanb@mellanox.com wrote:
> 
> From: Idan Burstein <idanb at mellanox.com>
> 
> NVMe over Fabrics in its secure "register_always" mode
> registers and invalidates the user buffer upon each IO.
> The protocol enables the host to request the susbsystem
> to use SEND WITH INVALIDATE operation while returning the
> response capsule and invalidate the local key
> (remote_invalidation).
> In some HW implementations, the local network adapter may
> perform better while using local invalidation operations.
> 
> The results below show that running with local invalidation
> rather then remote invalidation improve the iops one could
> achieve by using the ConnectX-5Ex network adapter by x1.36 factor.
> Nevertheless, using local invalidation induce more CPU overhead
> than enabling the target to invalidate remotly, therefore,
> because there is a CPU% vs IOPs limit tradeoff we propose to
> have module parameter to define whether to request remote
> invalidation or not.
> 
> The following results were taken against a single nvme over fabrics
> subsystem with a single namespace backed by null_blk:
> 
> Block Size       s/g reg_wr      inline reg_wr    inline reg_wr + local inv
> ++++++++++++   ++++++++++++++   ++++++++++++++++ +++++++++++++++++++++++++++
> 512B            1446.6K/8.57%    5224.2K/76.21%   7143.3K/79.72%
> 1KB             1390.6K/8.5%     4656.7K/71.69%   5860.6K/55.45%
> 2KB             1343.8K/8.6%     3410.3K/38.96%   4106.7K/55.82%
> 4KB             1254.8K/8.39%    2033.6K/15.86%   2165.3K/17.48%
> 8KB             1079.5K/7.7%     1143.1K/7.27%    1158.2K/7.33%
> 16KB            603.4K/3.64%     593.8K/3.4%      588.9K/3.77%
> 32KB            294.8K/2.04%     293.7K/1.98%     294.4K/2.93%
> 64KB            138.2K/1.32%     141.6K/1.26%     135.6K/1.34%

Units reported here are KIOPS and %CPU ? What was the benchmark?

Was any root cause analysis done to understand why the IOPS
rate changes without RI? Reduction in avg RTT? Fewer long-
running outliers? Lock contention in the ULP?

I am curious enough to add a similar setting to NFS/RDMA,
now that I have mlx5 devices.


> Signed-off-by: Max Gurtovoy <maxg at mellanox.com>
> Signed-off-by: Idan Burstein <idanb at mellanox.com>
> ---
> drivers/nvme/host/rdma.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 92a03ff..7f8225d 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -146,6 +146,11 @@ static inline struct nvme_rdma_ctrl *to_rdma_ctrl(struct nvme_ctrl *ctrl)
> MODULE_PARM_DESC(register_always,
> 	 "Use memory registration even for contiguous memory regions");
> 
> +static bool remote_invalidation = true;
> +module_param(remote_invalidation, bool, 0444);
> +MODULE_PARM_DESC(remote_invalidation,
> +	 "request remote invalidation from subsystem (default: true)");

The use of a module parameter would be awkward in systems
that have a heterogenous collection of HCAs.


> +
> static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
> 		struct rdma_cm_event *event);
> static void nvme_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc);
> @@ -1152,8 +1157,9 @@ static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
> 	sg->addr = cpu_to_le64(req->mr->iova);
> 	put_unaligned_le24(req->mr->length, sg->length);
> 	put_unaligned_le32(req->mr->rkey, sg->key);
> -	sg->type = (NVME_KEY_SGL_FMT_DATA_DESC << 4) |
> -			NVME_SGL_FMT_INVALIDATE;
> +	sg->type = NVME_KEY_SGL_FMT_DATA_DESC << 4;
> +	if (remote_invalidation)
> +		sg->type |= NVME_SGL_FMT_INVALIDATE;
> 
> 	return 0;
> }
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/2] Performance Improvents for Secured Mode NVMe over Fabrics and other RDMA ULPs
       [not found]             ` <AM0PR0502MB38906F57CD68C3707F8FE765C5580@AM0PR0502MB3890.eurprd05.prod.outlook.com>
@ 2017-10-30  5:21                   ` Leon Romanovsky
  0 siblings, 0 replies; 41+ messages in thread
From: Leon Romanovsky @ 2017-10-30  5:21 UTC (permalink / raw)
  To: Idan Burstein
  Cc: Max Gurtovoy, Sagi Grimberg,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, hch-jcswGhMUV9g,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Tal Alon, Majd Dibbiny

[-- Attachment #1: Type: text/plain, Size: 368 bytes --]

On Sun, Oct 29, 2017 at 06:03:43PM +0000, Idan Burstein wrote:
> Leon,
>
> We only did performance testing at the block level.

Excellent, so please, as a Mellanox employee, run it through our
standard submission channels, so at least we will be sure that you
are not breaking anything.

Please talk with Max, he did it before and he knows how to do it right.

Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 0/2] Performance Improvents for Secured Mode NVMe over Fabrics and other RDMA ULPs
@ 2017-10-30  5:21                   ` Leon Romanovsky
  0 siblings, 0 replies; 41+ messages in thread
From: Leon Romanovsky @ 2017-10-30  5:21 UTC (permalink / raw)


On Sun, Oct 29, 2017@06:03:43PM +0000, Idan Burstein wrote:
> Leon,
>
> We only did performance testing at the block level.

Excellent, so please, as a Mellanox employee, run it through our
standard submission channels, so at least we will be sure that you
are not breaking anything.

Please talk with Max, he did it before and he knows how to do it right.

Thanks
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20171030/46aec3a2/attachment.sig>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
  2017-10-29 16:38     ` idanb
@ 2017-10-30  8:11         ` Sagi Grimberg
  -1 siblings, 0 replies; 41+ messages in thread
From: Sagi Grimberg @ 2017-10-30  8:11 UTC (permalink / raw)
  To: idanb-VPRAkNaXOzVWk0Htik3J/w, leon-DgEjT+Ai2ygdnm+yROfE0A,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, hch-jcswGhMUV9g,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: maxg-VPRAkNaXOzVWk0Htik3J/w

Idan,

> NVMe over Fabrics in its secure "register_always" mode
> registers and invalidates the user buffer upon each IO.
> The protocol enables the host to request the susbsystem
> to use SEND WITH INVALIDATE operation while returning the
> response capsule and invalidate the local key
> (remote_invalidation).
> In some HW implementations, the local network adapter may
> perform better while using local invalidation operations.

This is a device quirk and if you want to optimize for this
you need to expose it via verbs interface. We won't add a module
parameter for implementation specific stuff.

Note that we currently need to rework the case where we use local
invalidates and properly wait for them to complete.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
@ 2017-10-30  8:11         ` Sagi Grimberg
  0 siblings, 0 replies; 41+ messages in thread
From: Sagi Grimberg @ 2017-10-30  8:11 UTC (permalink / raw)


Idan,

> NVMe over Fabrics in its secure "register_always" mode
> registers and invalidates the user buffer upon each IO.
> The protocol enables the host to request the susbsystem
> to use SEND WITH INVALIDATE operation while returning the
> response capsule and invalidate the local key
> (remote_invalidation).
> In some HW implementations, the local network adapter may
> perform better while using local invalidation operations.

This is a device quirk and if you want to optimize for this
you need to expose it via verbs interface. We won't add a module
parameter for implementation specific stuff.

Note that we currently need to rework the case where we use local
invalidates and properly wait for them to complete.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
  2017-10-29 17:52         ` Jason Gunthorpe
@ 2017-10-30  8:14             ` Sagi Grimberg
  -1 siblings, 0 replies; 41+ messages in thread
From: Sagi Grimberg @ 2017-10-30  8:14 UTC (permalink / raw)
  To: Jason Gunthorpe, idanb-VPRAkNaXOzVWk0Htik3J/w
  Cc: leon-DgEjT+Ai2ygdnm+yROfE0A,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, hch-jcswGhMUV9g,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, maxg-VPRAkNaXOzVWk0Htik3J/w


> Sagi, is this really an apples to apples comparison?

No

> When I was talking with Chuck on NFS there were some various issues on
> that side that made the local invalidate run faster, but it wasn't
> actually working properly.
> 
> Does nvme wait for the local invalidate to finish before moving on?

Not really, and we should, I'll get to it.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
@ 2017-10-30  8:14             ` Sagi Grimberg
  0 siblings, 0 replies; 41+ messages in thread
From: Sagi Grimberg @ 2017-10-30  8:14 UTC (permalink / raw)



> Sagi, is this really an apples to apples comparison?

No

> When I was talking with Chuck on NFS there were some various issues on
> that side that made the local invalidate run faster, but it wasn't
> actually working properly.
> 
> Does nvme wait for the local invalidate to finish before moving on?

Not really, and we should, I'll get to it.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
  2017-10-30  8:14             ` Sagi Grimberg
@ 2017-10-30  8:38                 ` idanb
  -1 siblings, 0 replies; 41+ messages in thread
From: idanb @ 2017-10-30  8:38 UTC (permalink / raw)
  To: Sagi Grimberg, Jason Gunthorpe
  Cc: leon-DgEjT+Ai2ygdnm+yROfE0A,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, hch-jcswGhMUV9g,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, maxg-VPRAkNaXOzVWk0Htik3J/w



On 10/30/2017 10:14 AM, Sagi Grimberg wrote:
>
>> Sagi, is this really an apples to apples comparison?
>
> No
>
>> When I was talking with Chuck on NFS there were some various issues on
>> that side that made the local invalidate run faster, but it wasn't
>> actually working properly.
>>
>> Does nvme wait for the local invalidate to finish before moving on?
>
> Not really, and we should, I'll get to it.

I agree we should fix this, nevertheless I don't expect this to create 
an IO rate issue for long queue depths.
Indeed we will have a tradeoff here of latency vs rate.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
@ 2017-10-30  8:38                 ` idanb
  0 siblings, 0 replies; 41+ messages in thread
From: idanb @ 2017-10-30  8:38 UTC (permalink / raw)




On 10/30/2017 10:14 AM, Sagi Grimberg wrote:
>
>> Sagi, is this really an apples to apples comparison?
>
> No
>
>> When I was talking with Chuck on NFS there were some various issues on
>> that side that made the local invalidate run faster, but it wasn't
>> actually working properly.
>>
>> Does nvme wait for the local invalidate to finish before moving on?
>
> Not really, and we should, I'll get to it.

I agree we should fix this, nevertheless I don't expect this to create 
an IO rate issue for long queue depths.
Indeed we will have a tradeoff here of latency vs rate.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
  2017-10-30  8:11         ` Sagi Grimberg
@ 2017-10-30  8:45             ` idanb
  -1 siblings, 0 replies; 41+ messages in thread
From: idanb @ 2017-10-30  8:45 UTC (permalink / raw)
  To: Sagi Grimberg, leon-DgEjT+Ai2ygdnm+yROfE0A,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, hch-jcswGhMUV9g,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: maxg-VPRAkNaXOzVWk0Htik3J/w



On 10/30/2017 10:11 AM, Sagi Grimberg wrote:
> Idan,
>
>> NVMe over Fabrics in its secure "register_always" mode
>> registers and invalidates the user buffer upon each IO.
>> The protocol enables the host to request the susbsystem
>> to use SEND WITH INVALIDATE operation while returning the
>> response capsule and invalidate the local key
>> (remote_invalidation).
>> In some HW implementations, the local network adapter may
>> perform better while using local invalidation operations.
>
> This is a device quirk and if you want to optimize for this
> you need to expose it via verbs interface. We won't add a module
> parameter for implementation specific stuff.

After our discussion yesterday I was thinking it may be better to hide
it within the nvme host driver while having a messages size threshold for
using send_with_invalidate. To make it device agnostic we will need
to come out with an idea of how to expose such limitation.

>
> Note that we currently need to rework the case where we use local
> invalidates and properly wait for them to complete.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
@ 2017-10-30  8:45             ` idanb
  0 siblings, 0 replies; 41+ messages in thread
From: idanb @ 2017-10-30  8:45 UTC (permalink / raw)




On 10/30/2017 10:11 AM, Sagi Grimberg wrote:
> Idan,
>
>> NVMe over Fabrics in its secure "register_always" mode
>> registers and invalidates the user buffer upon each IO.
>> The protocol enables the host to request the susbsystem
>> to use SEND WITH INVALIDATE operation while returning the
>> response capsule and invalidate the local key
>> (remote_invalidation).
>> In some HW implementations, the local network adapter may
>> perform better while using local invalidation operations.
>
> This is a device quirk and if you want to optimize for this
> you need to expose it via verbs interface. We won't add a module
> parameter for implementation specific stuff.

After our discussion yesterday I was thinking it may be better to hide
it within the nvme host driver while having a messages size threshold for
using send_with_invalidate. To make it device agnostic we will need
to come out with an idea of how to expose such limitation.

>
> Note that we currently need to rework the case where we use local
> invalidates and properly wait for them to complete.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
  2017-10-30  8:38                 ` idanb
@ 2017-10-30  9:44                     ` Sagi Grimberg
  -1 siblings, 0 replies; 41+ messages in thread
From: Sagi Grimberg @ 2017-10-30  9:44 UTC (permalink / raw)
  To: idanb, Jason Gunthorpe
  Cc: leon-DgEjT+Ai2ygdnm+yROfE0A,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, hch-jcswGhMUV9g,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, maxg-VPRAkNaXOzVWk0Htik3J/w


> I agree we should fix this, nevertheless I don't expect this to create 
> an IO rate issue for long queue depths.
> Indeed we will have a tradeoff here of latency vs rate.

I suggest we hold back with this one for now, I have fixes for
the local invalidate flow. Once we get them in we can see if
this approach makes a difference, and if so, we need to expose
it through a generic mechanism.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
@ 2017-10-30  9:44                     ` Sagi Grimberg
  0 siblings, 0 replies; 41+ messages in thread
From: Sagi Grimberg @ 2017-10-30  9:44 UTC (permalink / raw)



> I agree we should fix this, nevertheless I don't expect this to create 
> an IO rate issue for long queue depths.
> Indeed we will have a tradeoff here of latency vs rate.

I suggest we hold back with this one for now, I have fixes for
the local invalidate flow. Once we get them in we can see if
this approach makes a difference, and if so, we need to expose
it through a generic mechanism.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
  2017-10-30  9:44                     ` Sagi Grimberg
@ 2017-10-30 10:31                         ` idanb
  -1 siblings, 0 replies; 41+ messages in thread
From: idanb @ 2017-10-30 10:31 UTC (permalink / raw)
  To: Sagi Grimberg, Jason Gunthorpe
  Cc: leon-DgEjT+Ai2ygdnm+yROfE0A,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, hch-jcswGhMUV9g,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, maxg-VPRAkNaXOzVWk0Htik3J/w



On 10/30/2017 11:44 AM, Sagi Grimberg wrote:
>
>> I agree we should fix this, nevertheless I don't expect this to 
>> create an IO rate issue for long queue depths.
>> Indeed we will have a tradeoff here of latency vs rate.
>
> I suggest we hold back with this one for now, I have fixes for
> the local invalidate flow. Once we get them in we can see if
> this approach makes a difference, and if so, we need to expose
> it through a generic mechanism.

I suggest that you'll post the patches to the list and we'll investigate
the latency implications, if the implications are negligible, we have a
win here for small IOs.

When will you be able to post your patches?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
@ 2017-10-30 10:31                         ` idanb
  0 siblings, 0 replies; 41+ messages in thread
From: idanb @ 2017-10-30 10:31 UTC (permalink / raw)




On 10/30/2017 11:44 AM, Sagi Grimberg wrote:
>
>> I agree we should fix this, nevertheless I don't expect this to 
>> create an IO rate issue for long queue depths.
>> Indeed we will have a tradeoff here of latency vs rate.
>
> I suggest we hold back with this one for now, I have fixes for
> the local invalidate flow. Once we get them in we can see if
> this approach makes a difference, and if so, we need to expose
> it through a generic mechanism.

I suggest that you'll post the patches to the list and we'll investigate
the latency implications, if the implications are negligible, we have a
win here for small IOs.

When will you be able to post your patches?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
  2017-10-30 10:31                         ` idanb
@ 2017-10-30 10:33                             ` Sagi Grimberg
  -1 siblings, 0 replies; 41+ messages in thread
From: Sagi Grimberg @ 2017-10-30 10:33 UTC (permalink / raw)
  To: idanb, Jason Gunthorpe
  Cc: leon-DgEjT+Ai2ygdnm+yROfE0A,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, hch-jcswGhMUV9g,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, maxg-VPRAkNaXOzVWk0Htik3J/w



>>> I agree we should fix this, nevertheless I don't expect this to 
>>> create an IO rate issue for long queue depths.
>>> Indeed we will have a tradeoff here of latency vs rate.
>>
>> I suggest we hold back with this one for now, I have fixes for
>> the local invalidate flow. Once we get them in we can see if
>> this approach makes a difference, and if so, we need to expose
>> it through a generic mechanism.
> 
> I suggest that you'll post the patches to the list and we'll investigate
> the latency implications, if the implications are negligible, we have a
> win here for small IOs.
> 
> When will you be able to post your patches?

I can post them now, but I prefer to test them first ;)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
@ 2017-10-30 10:33                             ` Sagi Grimberg
  0 siblings, 0 replies; 41+ messages in thread
From: Sagi Grimberg @ 2017-10-30 10:33 UTC (permalink / raw)




>>> I agree we should fix this, nevertheless I don't expect this to 
>>> create an IO rate issue for long queue depths.
>>> Indeed we will have a tradeoff here of latency vs rate.
>>
>> I suggest we hold back with this one for now, I have fixes for
>> the local invalidate flow. Once we get them in we can see if
>> this approach makes a difference, and if so, we need to expose
>> it through a generic mechanism.
> 
> I suggest that you'll post the patches to the list and we'll investigate
> the latency implications, if the implications are negligible, we have a
> win here for small IOs.
> 
> When will you be able to post your patches?

I can post them now, but I prefer to test them first ;)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
  2017-10-30 10:33                             ` Sagi Grimberg
@ 2017-10-30 12:35                                 ` idanb
  -1 siblings, 0 replies; 41+ messages in thread
From: idanb @ 2017-10-30 12:35 UTC (permalink / raw)
  To: Sagi Grimberg, Jason Gunthorpe
  Cc: leon-DgEjT+Ai2ygdnm+yROfE0A,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, hch-jcswGhMUV9g,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, maxg-VPRAkNaXOzVWk0Htik3J/w


On 10/30/2017 12:33 PM, Sagi Grimberg wrote:
>
>
>>>> I agree we should fix this, nevertheless I don't expect this to 
>>>> create an IO rate issue for long queue depths.
>>>> Indeed we will have a tradeoff here of latency vs rate.
>>>
>>> I suggest we hold back with this one for now, I have fixes for
>>> the local invalidate flow. Once we get them in we can see if
>>> this approach makes a difference, and if so, we need to expose
>>> it through a generic mechanism.
>>
>> I suggest that you'll post the patches to the list and we'll investigate
>> the latency implications, if the implications are negligible, we have a
>> win here for small IOs.
>>
>> When will you be able to post your patches?
>
> I can post them now, but I prefer to test them first ;)

Sounds good, we will check once you'll post them.
I expect no impact on IOPs,  just latency.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
@ 2017-10-30 12:35                                 ` idanb
  0 siblings, 0 replies; 41+ messages in thread
From: idanb @ 2017-10-30 12:35 UTC (permalink / raw)



On 10/30/2017 12:33 PM, Sagi Grimberg wrote:
>
>
>>>> I agree we should fix this, nevertheless I don't expect this to 
>>>> create an IO rate issue for long queue depths.
>>>> Indeed we will have a tradeoff here of latency vs rate.
>>>
>>> I suggest we hold back with this one for now, I have fixes for
>>> the local invalidate flow. Once we get them in we can see if
>>> this approach makes a difference, and if so, we need to expose
>>> it through a generic mechanism.
>>
>> I suggest that you'll post the patches to the list and we'll investigate
>> the latency implications, if the implications are negligible, we have a
>> win here for small IOs.
>>
>> When will you be able to post your patches?
>
> I can post them now, but I prefer to test them first ;)

Sounds good, we will check once you'll post them.
I expect no impact on IOPs,  just latency.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
  2017-10-29 18:24         ` Chuck Lever
@ 2017-10-30 18:18             ` Chuck Lever
  -1 siblings, 0 replies; 41+ messages in thread
From: Chuck Lever @ 2017-10-30 18:18 UTC (permalink / raw)
  To: linux-rdma
  Cc: Leon Romanovsky, idanb-VPRAkNaXOzVWk0Htik3J/w,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, hch, Sagi Grimberg,
	Max Gurtovoy


> On Oct 29, 2017, at 2:24 PM, Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> 
> 
>> On Oct 29, 2017, at 12:38 PM, idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org wrote:
>> 
>> From: Idan Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> 
>> NVMe over Fabrics in its secure "register_always" mode
>> registers and invalidates the user buffer upon each IO.
>> The protocol enables the host to request the susbsystem
>> to use SEND WITH INVALIDATE operation while returning the
>> response capsule and invalidate the local key
>> (remote_invalidation).
>> In some HW implementations, the local network adapter may
>> perform better while using local invalidation operations.
>> 
>> The results below show that running with local invalidation
>> rather then remote invalidation improve the iops one could
>> achieve by using the ConnectX-5Ex network adapter by x1.36 factor.
>> Nevertheless, using local invalidation induce more CPU overhead
>> than enabling the target to invalidate remotly, therefore,
>> because there is a CPU% vs IOPs limit tradeoff we propose to
>> have module parameter to define whether to request remote
>> invalidation or not.
>> 
>> The following results were taken against a single nvme over fabrics
>> subsystem with a single namespace backed by null_blk:
>> 
>> Block Size       s/g reg_wr      inline reg_wr    inline reg_wr + local inv
>> ++++++++++++   ++++++++++++++   ++++++++++++++++ +++++++++++++++++++++++++++
>> 512B            1446.6K/8.57%    5224.2K/76.21%   7143.3K/79.72%
>> 1KB             1390.6K/8.5%     4656.7K/71.69%   5860.6K/55.45%
>> 2KB             1343.8K/8.6%     3410.3K/38.96%   4106.7K/55.82%
>> 4KB             1254.8K/8.39%    2033.6K/15.86%   2165.3K/17.48%
>> 8KB             1079.5K/7.7%     1143.1K/7.27%    1158.2K/7.33%
>> 16KB            603.4K/3.64%     593.8K/3.4%      588.9K/3.77%
>> 32KB            294.8K/2.04%     293.7K/1.98%     294.4K/2.93%
>> 64KB            138.2K/1.32%     141.6K/1.26%     135.6K/1.34%
> 
> Units reported here are KIOPS and %CPU ? What was the benchmark?
> 
> Was any root cause analysis done to understand why the IOPS
> rate changes without RI? Reduction in avg RTT? Fewer long-
> running outliers? Lock contention in the ULP?
> 
> I am curious enough to add a similar setting to NFS/RDMA,
> now that I have mlx5 devices.

I see the plan is to change the NVMeoF initiator to wait for
invalidation to complete, and then test again. However, I am
still curious what we're dealing with for NFS/RDMA (which
always waits for invalidation before allowing an RPC to
complete).

I tested here with a pair of EDR CX-4s in RoCE mode. I found
that there is a 15-20us latency penalty when Remote Invalid-
ation is used with RDMA Read, but when used with RDMA Write,
Remote Invalidation behaves as desired. The RDMA Read penalty
vanishes after payloads are larger than about 8kB.

Simple QD=1 iozone test with direct I/O on NFSv3, and a tmpfs
share on the NFS server. In this test, memory registration is
used for all data payloads.

Remote Invalidation enabled, reclen in kB, output in
microseconds:

              kB  reclen    write  rewrite    read    reread
          131072       1       47       54       27       27
          131072       2       61       62       27       28
          131072       4       63       62       28       28
          131072       8       59       65       29       29
          131072      16       67       66       31       32
          131072      32       75       73       42       42
          131072      64       92       87       64       44

Remote Invalidation disabled, reclen in kB, output in
microseconds

              kB  reclen    write  rewrite    read    reread
          131072       1       43       43       32       32
          131072       2       45       52       32       32
          131072       4       48       45       32       32
          131072       8       68       64       33       33
          131072      16       74       60       35       35
          131072      32       85       82       49       41
          131072      64      102       98       63       52

I would expect a similar ~5us latency benefit for both RDMA
Reads and RDMA Writes when Remote Invalidation is in use.
Small I/Os involving RDMA Read are precisely where NFS/RDMA
gains the most benefit from Remote Invalidation, so this
result is disappointing.

Unfortunately, the current version of RPC-over-RDMA does not
have the ability to convey an Rkey for the target to invalid-
ate remotely, though that is one of the features being
considered for the next version.

IOPS results (fio 70/30, multi-thread iozone, etc) are very
clearly worse without Remote Invalidation, so I currently
don't plan to submit a patch that allows RI to be disabled
for some NFS/RDMA mounts. Currently the Linux NFS client
indicates to servers that RI is supported whenever FRWR is
supported on the client's HCA. The server is then free to
decide whether to use Send With Invalidate with any Rkey
in the current RPC.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
@ 2017-10-30 18:18             ` Chuck Lever
  0 siblings, 0 replies; 41+ messages in thread
From: Chuck Lever @ 2017-10-30 18:18 UTC (permalink / raw)



> On Oct 29, 2017,@2:24 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> 
> 
>> On Oct 29, 2017,@12:38 PM, idanb@mellanox.com wrote:
>> 
>> From: Idan Burstein <idanb at mellanox.com>
>> 
>> NVMe over Fabrics in its secure "register_always" mode
>> registers and invalidates the user buffer upon each IO.
>> The protocol enables the host to request the susbsystem
>> to use SEND WITH INVALIDATE operation while returning the
>> response capsule and invalidate the local key
>> (remote_invalidation).
>> In some HW implementations, the local network adapter may
>> perform better while using local invalidation operations.
>> 
>> The results below show that running with local invalidation
>> rather then remote invalidation improve the iops one could
>> achieve by using the ConnectX-5Ex network adapter by x1.36 factor.
>> Nevertheless, using local invalidation induce more CPU overhead
>> than enabling the target to invalidate remotly, therefore,
>> because there is a CPU% vs IOPs limit tradeoff we propose to
>> have module parameter to define whether to request remote
>> invalidation or not.
>> 
>> The following results were taken against a single nvme over fabrics
>> subsystem with a single namespace backed by null_blk:
>> 
>> Block Size       s/g reg_wr      inline reg_wr    inline reg_wr + local inv
>> ++++++++++++   ++++++++++++++   ++++++++++++++++ +++++++++++++++++++++++++++
>> 512B            1446.6K/8.57%    5224.2K/76.21%   7143.3K/79.72%
>> 1KB             1390.6K/8.5%     4656.7K/71.69%   5860.6K/55.45%
>> 2KB             1343.8K/8.6%     3410.3K/38.96%   4106.7K/55.82%
>> 4KB             1254.8K/8.39%    2033.6K/15.86%   2165.3K/17.48%
>> 8KB             1079.5K/7.7%     1143.1K/7.27%    1158.2K/7.33%
>> 16KB            603.4K/3.64%     593.8K/3.4%      588.9K/3.77%
>> 32KB            294.8K/2.04%     293.7K/1.98%     294.4K/2.93%
>> 64KB            138.2K/1.32%     141.6K/1.26%     135.6K/1.34%
> 
> Units reported here are KIOPS and %CPU ? What was the benchmark?
> 
> Was any root cause analysis done to understand why the IOPS
> rate changes without RI? Reduction in avg RTT? Fewer long-
> running outliers? Lock contention in the ULP?
> 
> I am curious enough to add a similar setting to NFS/RDMA,
> now that I have mlx5 devices.

I see the plan is to change the NVMeoF initiator to wait for
invalidation to complete, and then test again. However, I am
still curious what we're dealing with for NFS/RDMA (which
always waits for invalidation before allowing an RPC to
complete).

I tested here with a pair of EDR CX-4s in RoCE mode. I found
that there is a 15-20us latency penalty when Remote Invalid-
ation is used with RDMA Read, but when used with RDMA Write,
Remote Invalidation behaves as desired. The RDMA Read penalty
vanishes after payloads are larger than about 8kB.

Simple QD=1 iozone test with direct I/O on NFSv3, and a tmpfs
share on the NFS server. In this test, memory registration is
used for all data payloads.

Remote Invalidation enabled, reclen in kB, output in
microseconds:

              kB  reclen    write  rewrite    read    reread
          131072       1       47       54       27       27
          131072       2       61       62       27       28
          131072       4       63       62       28       28
          131072       8       59       65       29       29
          131072      16       67       66       31       32
          131072      32       75       73       42       42
          131072      64       92       87       64       44

Remote Invalidation disabled, reclen in kB, output in
microseconds

              kB  reclen    write  rewrite    read    reread
          131072       1       43       43       32       32
          131072       2       45       52       32       32
          131072       4       48       45       32       32
          131072       8       68       64       33       33
          131072      16       74       60       35       35
          131072      32       85       82       49       41
          131072      64      102       98       63       52

I would expect a similar ~5us latency benefit for both RDMA
Reads and RDMA Writes when Remote Invalidation is in use.
Small I/Os involving RDMA Read are precisely where NFS/RDMA
gains the most benefit from Remote Invalidation, so this
result is disappointing.

Unfortunately, the current version of RPC-over-RDMA does not
have the ability to convey an Rkey for the target to invalid-
ate remotely, though that is one of the features being
considered for the next version.

IOPS results (fio 70/30, multi-thread iozone, etc) are very
clearly worse without Remote Invalidation, so I currently
don't plan to submit a patch that allows RI to be disabled
for some NFS/RDMA mounts. Currently the Linux NFS client
indicates to servers that RI is supported whenever FRWR is
supported on the client's HCA. The server is then free to
decide whether to use Send With Invalidate with any Rkey
in the current RPC.


--
Chuck Lever

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/2] IB/mlx5: posting klm/mtt list inline in the send queue for reg_wr
  2017-10-29 16:38     ` idanb
  (?)
  (?)
@ 2018-04-19 21:11     ` roland
  2018-04-20  2:04       ` Doug Ledford
  -1 siblings, 1 reply; 41+ messages in thread
From: roland @ 2018-04-19 21:11 UTC (permalink / raw)


> The fetch of the MTT/KLM list becomes the bottleneck in
> number of IO operation could be done by NVMe over Fabrics
> host driver on a single adapter as shown below.
>
> This patch is adding the support for inline registration
> work request upon MTT/KLM list of size <=64B.

It looks like this patch never made it upstream, and I don't see any
discussion or reason why it might have been dropped.

This still seems like a good idea - should we respin it against the latest tree?

 - R.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/2] IB/mlx5: posting klm/mtt list inline in the send queue for reg_wr
  2018-04-19 21:11     ` roland
@ 2018-04-20  2:04       ` Doug Ledford
  2018-04-20 19:07         ` Max Gurtovoy
  0 siblings, 1 reply; 41+ messages in thread
From: Doug Ledford @ 2018-04-20  2:04 UTC (permalink / raw)


On Thu, 2018-04-19@14:11 -0700, roland wrote:
> > The fetch of the MTT/KLM list becomes the bottleneck in
> > number of IO operation could be done by NVMe over Fabrics
> > host driver on a single adapter as shown below.
> > 
> > This patch is adding the support for inline registration
> > work request upon MTT/KLM list of size <=64B.
> 
> It looks like this patch never made it upstream, and I don't see any
> discussion or reason why it might have been dropped.
> 
> This still seems like a good idea - should we respin it against the latest tree?

If you want it included, then you'll have to respin it if for no other
reason than the fact that it isn't readily available in patchworks and
that's what we're using to track/grab patches now a days.  I supposed I
could go hunting for it, but that's a good way to grab an out of date
patch and make a mistake by accident...

-- 
Doug Ledford <dledford at redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20180419/05204a93/attachment.sig>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/2] IB/mlx5: posting klm/mtt list inline in the send queue for reg_wr
  2018-04-20  2:04       ` Doug Ledford
@ 2018-04-20 19:07         ` Max Gurtovoy
  0 siblings, 0 replies; 41+ messages in thread
From: Max Gurtovoy @ 2018-04-20 19:07 UTC (permalink / raw)


Thanks Roland.
Our team was actually running the patch again using ConnectX-4 and 
ConnectX-5 last week succesfully with NVMe-oF/RDMA. We'll push it throw 
Jason's PR with all the great results published.

-Max.

On 4/20/2018 5:04 AM, Doug Ledford wrote:
> On Thu, 2018-04-19@14:11 -0700, roland wrote:
>>> The fetch of the MTT/KLM list becomes the bottleneck in
>>> number of IO operation could be done by NVMe over Fabrics
>>> host driver on a single adapter as shown below.
>>>
>>> This patch is adding the support for inline registration
>>> work request upon MTT/KLM list of size <=64B.
>>
>> It looks like this patch never made it upstream, and I don't see any
>> discussion or reason why it might have been dropped.
>>
>> This still seems like a good idea - should we respin it against the latest tree?
> 
> If you want it included, then you'll have to respin it if for no other
> reason than the fact that it isn't readily available in patchworks and
> that's what we're using to track/grab patches now a days.  I supposed I
> could go hunting for it, but that's a good way to grab an out of date
> patch and make a mistake by accident...
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2018-04-20 19:07 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-29 16:38 [PATCH 0/2] Performance Improvents for Secured Mode NVMe over Fabrics and other RDMA ULPs idanb-VPRAkNaXOzVWk0Htik3J/w
2017-10-29 16:38 ` idanb
     [not found] ` <1509295101-14081-1-git-send-email-idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-10-29 16:38   ` [PATCH 1/2] IB/mlx5: posting klm/mtt list inline in the send queue for reg_wr idanb-VPRAkNaXOzVWk0Htik3J/w
2017-10-29 16:38     ` idanb
     [not found]     ` <1509295101-14081-2-git-send-email-idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-10-29 17:09       ` Sagi Grimberg
2017-10-29 17:09         ` Sagi Grimberg
2018-04-19 21:11     ` roland
2018-04-20  2:04       ` Doug Ledford
2018-04-20 19:07         ` Max Gurtovoy
2017-10-29 16:38   ` [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter idanb-VPRAkNaXOzVWk0Htik3J/w
2017-10-29 16:38     ` idanb
     [not found]     ` <1509295101-14081-3-git-send-email-idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-10-29 17:52       ` Jason Gunthorpe
2017-10-29 17:52         ` Jason Gunthorpe
     [not found]         ` <20171029175237.GD4488-uk2M96/98Pc@public.gmane.org>
2017-10-30  8:14           ` Sagi Grimberg
2017-10-30  8:14             ` Sagi Grimberg
     [not found]             ` <740c93f4-164e-d4e3-97b1-313a0420ae81-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-10-30  8:38               ` idanb
2017-10-30  8:38                 ` idanb
     [not found]                 ` <7e038d80-de95-8fb7-e313-825e40c03e88-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-10-30  9:44                   ` Sagi Grimberg
2017-10-30  9:44                     ` Sagi Grimberg
     [not found]                     ` <38523e67-fa00-dd03-5b6f-34cd6f863c8f-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-10-30 10:31                       ` idanb
2017-10-30 10:31                         ` idanb
     [not found]                         ` <5783e083-93d4-8e72-c380-03fc54a0291b-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-10-30 10:33                           ` Sagi Grimberg
2017-10-30 10:33                             ` Sagi Grimberg
     [not found]                             ` <360f892f-b88d-0947-1590-ab1d64d4da13-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-10-30 12:35                               ` idanb
2017-10-30 12:35                                 ` idanb
2017-10-29 18:24       ` Chuck Lever
2017-10-29 18:24         ` Chuck Lever
     [not found]         ` <87A0B150-CE67-4C8C-914E-53F66411E1BB-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-10-30 18:18           ` Chuck Lever
2017-10-30 18:18             ` Chuck Lever
2017-10-30  8:11       ` Sagi Grimberg
2017-10-30  8:11         ` Sagi Grimberg
     [not found]         ` <4423f96f-bf42-3603-aa6a-fa259a1d09d1-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-10-30  8:45           ` idanb
2017-10-30  8:45             ` idanb
2017-10-29 16:59   ` [PATCH 0/2] Performance Improvents for Secured Mode NVMe over Fabrics and other RDMA ULPs Sagi Grimberg
2017-10-29 16:59     ` Sagi Grimberg
     [not found]     ` <e1ba4993-5c77-4754-967c-4da8ec34bf77-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-10-29 17:09       ` Max Gurtovoy
2017-10-29 17:09         ` Max Gurtovoy
     [not found]         ` <8cb59a0a-c902-0c0e-27a8-fdf9b98982ac-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-10-29 17:43           ` Leon Romanovsky
2017-10-29 17:43             ` Leon Romanovsky
     [not found]             ` <AM0PR0502MB38906F57CD68C3707F8FE765C5580@AM0PR0502MB3890.eurprd05.prod.outlook.com>
     [not found]               ` <AM0PR0502MB38906F57CD68C3707F8FE765C5580-EJTefJAZ6OmxAoLrISMGDMDSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-10-30  5:21                 ` Leon Romanovsky
2017-10-30  5:21                   ` Leon Romanovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.