All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH rdma-next 0/3] Support out of order data placement
@ 2017-06-12  6:49 Leon Romanovsky
       [not found] ` <20170612064918.12510-1-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Leon Romanovsky @ 2017-06-12  6:49 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

Hi Doug,

This patch set which is based on v4.12-rc4 comes from Parav and it
adds support for out of order data placement for RDMA read and
write requests.

Thanks

---
>From Parav:

In certain fabric configurations IB packets for a given QP may take up
different paths in a network from source to destination. This results
into reaching packets out of order at the receiver side. Instead of
dropping packets, handling out of order packets can improve overall
network performance in following ways:
 * improve network utilization by avoiding retransmission
 * avoiding latency increase due to retransmission

This patchset allows HCA to expose out of order data placement
capability to users who can take benefit of such feature.

This is optional feature of an HCA and enablement of this feature
is done on per QP basis. The optional QP attribute is set when a QP
state is changed from INIT to RTR.

Out of order data placement capability indicates that if HCA receives
out of order RDMA packets, their data placement can be done at the
desired memory destination given in the packet(s). This is applicable
to RDMA read and write operations.

Send queue work requests are still completed in-order regardless of
their data placement order at requester or responder side.

In-order semantics is always guaranteed by setting the Fence
indicator for appropriate WRs.

An application shall not depend on the contents of the RDMA write buffer
at the responder until one of the following occurred:
- Completion of the RDMA WRITE with immediate data receive completion.
- Arrival and completion of the subsequent SEND message.
- Update of a memory element by subsequent ATOMIC operation.

An application shall not depend on the contents of the RDMA read buffer
at the requester until one of the following occurred:
- Completion of the RDMA READ work request if requested or such
  work request completes with error status.
- Completion of the subsequent work request.

Thanks

CC: Idan Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Parav Pandit (3):
  IB/core: Expose out of order data placement capability
  IB/uverbs: Enable user space programs to use out of order placement
  IB/mlx5: Support out of order data placement

 Documentation/infiniband/out_of_order.txt | 60 +++++++++++++++++++++++++++++++
 drivers/infiniband/core/uverbs_cmd.c      | 16 +++++++++
 drivers/infiniband/core/verbs.c           |  9 +++--
 drivers/infiniband/hw/mlx5/main.c         | 12 +++++++
 drivers/infiniband/hw/mlx5/qp.c           | 25 +++++++++++++
 include/linux/mlx5/mlx5_ifc.h             |  5 ++-
 include/rdma/ib_verbs.h                   | 22 ++++++++++++
 include/uapi/rdma/ib_user_verbs.h         | 16 +++++++--
 8 files changed, 159 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/infiniband/out_of_order.txt

--
2.12.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH rdma-next 1/3] IB/core: Expose out of order data placement capability
       [not found] ` <20170612064918.12510-1-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2017-06-12  6:49   ` Leon Romanovsky
  2017-06-12  6:49   ` [PATCH rdma-next 2/3] IB/uverbs: Enable user space programs to use out of order placement Leon Romanovsky
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 71+ messages in thread
From: Leon Romanovsky @ 2017-06-12  6:49 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein, Parav Pandit

From: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

This patch exposes out of order data placement capability to enable
HCA to report it.
It also defines optional QP attribute that can be set when moving QP
from INIT to RTR state to make use of this feature on a particular QP.

Signed-off-by: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Eli Cohen <eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Daniel Jurgens <danielj-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 Documentation/infiniband/out_of_order.txt | 60 +++++++++++++++++++++++++++++++
 drivers/infiniband/core/verbs.c           |  9 +++--
 include/rdma/ib_verbs.h                   | 22 ++++++++++++
 3 files changed, 88 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/infiniband/out_of_order.txt

diff --git a/Documentation/infiniband/out_of_order.txt b/Documentation/infiniband/out_of_order.txt
new file mode 100644
index 000000000000..9c585f6801fc
--- /dev/null
+++ b/Documentation/infiniband/out_of_order.txt
@@ -0,0 +1,60 @@
+Out of order data placement
+===========================
+
+This document describes out of order data placement feature and its
+user interface.
+
+1. Overview
+===========
+
+In certain fabric configurations IB packets for a given QP may take up
+different paths in a network from source to destination. This results
+into reaching packets out of order at the receiver side. Instead of
+dropping packets, handling out of order packets can improve overall
+network performance in following ways.
+(a) improve network utilization by avoiding retransmission
+(b) avoiding latency increase due to retransmission
+
+2. Description
+==============
+
+This is optional feature of an HCA.
+Enablement of this feature is done on per QP basis.
+This optional QP attribute is set when a QP state is changed from INIT
+to RTR.
+
+Out of order data placement capability indicates that if HCA receives
+out of order RDMA packets, their data placement can be done at the
+desired memory destination given in the packet(s). This is applicable
+to RDMA read and write operations.
+
+Send queue work requests are still completed in-order regardless of
+their data placement order at requester or responder side.
+
+In-order semantics is always guaranteed by setting the Fence
+indicator for appropriate WRs.
+
+An application shall not depend on the contents of the RDMA write buffer
+at the responder until one of the following occurred:
+- Completion of the RDMA WRITE with immediate data receive completion.
+- Arrival and completion of the subsequent SEND message.
+- Update of a memory element by subsequent ATOMIC operation.
+
+An application shall not depend on the contents of the RDMA read buffer
+at the requester until one of the following occurred:
+- Completion of the RDMA READ work request if requested or such
+  work request completes with error status.
+- Completion of the subsequent work request.
+
+3. Usage
+========
+
+(a) ibv_query_device_ex returns out of order data placement
+capability at ooo_caps structure for a given transport.
+Whenever HCA supports such capability for each transport user
+should get IB_OOO_RW_DATA_PLACEMENT set in the caps.
+
+(b) When such capability is set, user can set
+IB_QP_OOO_RW_DATA_PLACEMENT while invoking ibv_modify_qp().
+IB_QP_OOO_RW_DATA_PLACEMENT is optional attribute which is supported
+only when QP state transions from INIT to RTR state.
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 4792f5209ac2..7e761bedd1ad 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -957,13 +957,16 @@ static const struct {
 						 IB_QP_PKEY_INDEX),
 				 [IB_QPT_RC]  = (IB_QP_ALT_PATH			|
 						 IB_QP_ACCESS_FLAGS		|
-						 IB_QP_PKEY_INDEX),
+						 IB_QP_PKEY_INDEX		|
+						 IB_QP_OOO_RW_DATA_PLACEMENT),
 				 [IB_QPT_XRC_INI] = (IB_QP_ALT_PATH		|
 						 IB_QP_ACCESS_FLAGS		|
-						 IB_QP_PKEY_INDEX),
+						 IB_QP_PKEY_INDEX		|
+						 IB_QP_OOO_RW_DATA_PLACEMENT),
 				 [IB_QPT_XRC_TGT] = (IB_QP_ALT_PATH		|
 						 IB_QP_ACCESS_FLAGS		|
-						 IB_QP_PKEY_INDEX),
+						 IB_QP_PKEY_INDEX		|
+						 IB_QP_OOO_RW_DATA_PLACEMENT),
 				 [IB_QPT_SMI] = (IB_QP_PKEY_INDEX		|
 						 IB_QP_QKEY),
 				 [IB_QPT_GSI] = (IB_QP_PKEY_INDEX		|
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index ba8314ec5768..c507f2f0773a 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -278,6 +278,26 @@ struct ib_rss_caps {
 	u32 max_rwq_indirection_table_size;
 };
 
+/*
+ * Out of order data placement capability bits.
+ * Out of order data placement means, if HCA receives RDMA packets in
+ * out of order manner, their data placement can be done at the desired
+ * memory destination given in the packet(s). This is applicable
+ * to RDMA read and write operations.
+ * Send queue work requests are still completed in-order regardless of their
+ * data placement order at local or remote end.
+ */
+enum ib_ooo_transport_caps {
+	IB_OOO_RW_DATA_PLACEMENT	= (1 << 0),
+};
+
+struct ib_ooo_caps {
+	u32 rc_caps;
+	u32 xrc_caps;
+	u32 ud_caps;
+	u32 uc_caps;
+};
+
 enum ib_cq_creation_flags {
 	IB_CQ_FLAGS_TIMESTAMP_COMPLETION   = 1 << 0,
 	IB_CQ_FLAGS_IGNORE_OVERRUN	   = 1 << 1,
@@ -338,6 +358,7 @@ struct ib_device_attr {
 	struct ib_rss_caps	rss_caps;
 	u32			max_wq_type_rq;
 	u32			raw_packet_caps; /* Use ib_raw_packet_caps enum */
+	struct ib_ooo_caps	ooo_caps;
 };
 
 enum ib_mtu {
@@ -1157,6 +1178,7 @@ enum ib_qp_attr_mask {
 	IB_QP_RESERVED3			= (1<<23),
 	IB_QP_RESERVED4			= (1<<24),
 	IB_QP_RATE_LIMIT		= (1<<25),
+	IB_QP_OOO_RW_DATA_PLACEMENT	= (1 << 26),
 };
 
 enum ib_qp_state {
-- 
2.12.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH rdma-next 2/3] IB/uverbs: Enable user space programs to use out of order placement
       [not found] ` <20170612064918.12510-1-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2017-06-12  6:49   ` [PATCH rdma-next 1/3] IB/core: Expose out of order data placement capability Leon Romanovsky
@ 2017-06-12  6:49   ` Leon Romanovsky
  2017-06-12  6:49   ` [PATCH rdma-next 3/3] IB/mlx5: Support out of order data placement Leon Romanovsky
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 71+ messages in thread
From: Leon Romanovsky @ 2017-06-12  6:49 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein, Parav Pandit

From: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

This patch enables user space programs to query out of order device
capability.

Signed-off-by: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Daniel Jurgens <danielj-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Eli Cohen <eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 drivers/infiniband/core/uverbs_cmd.c | 16 ++++++++++++++++
 include/uapi/rdma/ib_user_verbs.h    | 16 ++++++++++++++--
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 70b7fb156414..122ba59c633b 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -3766,6 +3766,15 @@ ssize_t ib_uverbs_destroy_srq(struct ib_uverbs_file *file,
 	return in_len;
 }
 
+static void copy_ooo_caps(struct ib_uverbs_ooo_caps *uverb_caps,
+			  struct ib_ooo_caps *attr_caps)
+{
+	uverb_caps->rc_caps = attr_caps->rc_caps;
+	uverb_caps->xrc_caps = attr_caps->xrc_caps;
+	uverb_caps->ud_caps = attr_caps->ud_caps;
+	uverb_caps->uc_caps = attr_caps->uc_caps;
+}
+
 int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
 			      struct ib_device *ib_dev,
 			      struct ib_udata *ucore,
@@ -3854,6 +3863,13 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
 
 	resp.raw_packet_caps = attr.raw_packet_caps;
 	resp.response_length += sizeof(resp.raw_packet_caps);
+
+	if (ucore->outlen < resp.response_length + sizeof(resp.ooo_caps))
+		goto end;
+
+	copy_ooo_caps(&resp.ooo_caps, &attr.ooo_caps);
+	resp.response_length += sizeof(resp.ooo_caps);
+
 end:
 	err = ib_copy_to_udata(ucore, &resp, resp.response_length);
 	return err;
diff --git a/include/uapi/rdma/ib_user_verbs.h b/include/uapi/rdma/ib_user_verbs.h
index 270c350bedc6..3519ff8e34cd 100644
--- a/include/uapi/rdma/ib_user_verbs.h
+++ b/include/uapi/rdma/ib_user_verbs.h
@@ -236,6 +236,17 @@ struct ib_uverbs_rss_caps {
 	__u32 reserved;
 };
 
+struct ib_uverbs_ooo_caps {
+	/*
+	 * Per transport capability indicating whether out of order data
+	 * placement is supported or not.
+	 */
+	 __u32 rc_caps;
+	 __u32 xrc_caps;
+	 __u32 ud_caps;
+	 __u32 uc_caps;
+};
+
 struct ib_uverbs_ex_query_device_resp {
 	struct ib_uverbs_query_device_resp base;
 	__u32 comp_mask;
@@ -247,6 +258,7 @@ struct ib_uverbs_ex_query_device_resp {
 	struct ib_uverbs_rss_caps rss_caps;
 	__u32  max_wq_type_rq;
 	__u32 raw_packet_caps;
+	struct ib_uverbs_ooo_caps ooo_caps;
 };
 
 struct ib_uverbs_query_port {
@@ -555,9 +567,9 @@ enum {
 
 enum {
 	/*
-	 * This value is equal to IB_QP_RATE_LIMIT.
+	 * This value is equal to IB_QP_OOO_RW_DATA_PLACEMENT.
 	 */
-	IB_USER_LAST_QP_ATTR_MASK = 1ULL << 25,
+	IB_USER_LAST_QP_ATTR_MASK = 1ULL << 26,
 };
 
 struct ib_uverbs_ex_create_qp {
-- 
2.12.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH rdma-next 3/3] IB/mlx5: Support out of order data placement
       [not found] ` <20170612064918.12510-1-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2017-06-12  6:49   ` [PATCH rdma-next 1/3] IB/core: Expose out of order data placement capability Leon Romanovsky
  2017-06-12  6:49   ` [PATCH rdma-next 2/3] IB/uverbs: Enable user space programs to use out of order placement Leon Romanovsky
@ 2017-06-12  6:49   ` Leon Romanovsky
  2017-06-12 15:28   ` [PATCH rdma-next 0/3] " Bart Van Assche
  2017-06-12 17:18   ` Steve Wise
  4 siblings, 0 replies; 71+ messages in thread
From: Leon Romanovsky @ 2017-06-12  6:49 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein, Parav Pandit

From: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

This patch sets out of order data placement capability whenever
device supports it. Currently its supports on RC and XRC QPs.
It also extends the support to set such attribute for a QP during
state transition.

Signed-off-by: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Daniel Jurgens <danielj-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Eli Cohen <eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 drivers/infiniband/hw/mlx5/main.c | 12 ++++++++++++
 drivers/infiniband/hw/mlx5/qp.c   | 25 +++++++++++++++++++++++++
 include/linux/mlx5/mlx5_ifc.h     |  5 ++++-
 3 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 0c79983c8b1a..7f5284c08b8d 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -572,6 +572,16 @@ static int mlx5_query_node_desc(struct mlx5_ib_dev *dev, char *node_desc)
 				    MLX5_REG_NODE_DESC, 0, 0);
 }
 
+static void mlx5_ib_fill_ooo_caps(struct mlx5_ib_dev *dev,
+				  struct ib_ooo_caps *caps)
+{
+	if (MLX5_CAP_GEN(dev->mdev, multi_path_rc_rdma))
+		caps->rc_caps |= IB_OOO_RW_DATA_PLACEMENT;
+
+	if (MLX5_CAP_GEN(dev->mdev, multi_path_xrc_rdma))
+		caps->xrc_caps |= IB_OOO_RW_DATA_PLACEMENT;
+}
+
 static int mlx5_ib_query_device(struct ib_device *ibdev,
 				struct ib_device_attr *props,
 				struct ib_udata *uhw)
@@ -765,6 +775,8 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
 			1 << MLX5_CAP_GEN(dev->mdev, log_max_rq);
 	}
 
+	mlx5_ib_fill_ooo_caps(dev, &props->ooo_caps);
+
 	if (field_avail(typeof(resp), cqe_comp_caps, uhw->outlen)) {
 		resp.cqe_comp_caps.max_num =
 			MLX5_CAP_GEN(dev->mdev, cqe_compression) ?
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index ebb6768684de..bbe616e373b3 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -54,6 +54,10 @@ enum {
 	MLX5_IB_SQ_STRIDE	= 6,
 };
 
+enum {
+	MLX5_QP_OOO_ATTR	= 25,
+};
+
 static const u32 mlx5_ib_opcode[] = {
 	[IB_WR_SEND]				= MLX5_OPCODE_SEND,
 	[IB_WR_LSO]				= MLX5_OPCODE_LSO,
@@ -2810,6 +2814,27 @@ static int __mlx5_ib_modify_qp(struct ib_qp *ibqp,
 	if (qp->flags & MLX5_IB_QP_SQPN_QP1)
 		context->deth_sqpn = cpu_to_be32(1);
 
+	if (attr_mask & IB_QP_OOO_RW_DATA_PLACEMENT) {
+		if (ibqp->qp_type == IB_QPT_RC) {
+			if (MLX5_CAP_GEN(dev->mdev, multi_path_rc_rdma)) {
+				context->flags_pd |=
+					cpu_to_be32(1 << MLX5_QP_OOO_ATTR);
+			} else {
+				err = -EINVAL;
+				goto out;
+			}
+		} else if (ibqp->qp_type == IB_QPT_XRC_INI ||
+			   ibqp->qp_type == IB_QPT_XRC_TGT) {
+			if (MLX5_CAP_GEN(dev->mdev, multi_path_xrc_rdma)) {
+				context->flags_pd |=
+					cpu_to_be32(1 << MLX5_QP_OOO_ATTR);
+			} else {
+				err = -EINVAL;
+				goto out;
+			}
+		}
+	}
+
 	mlx5_cur = to_mlx5_state(cur_state);
 	mlx5_new = to_mlx5_state(new_state);
 	mlx5_st = to_mlx5_st(ibqp->qp_type);
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index edafedb7b509..3d1e77e97552 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -856,7 +856,10 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 	u8         pps[0x1];
 	u8         pps_modify[0x1];
 	u8         log_max_msg[0x5];
-	u8         reserved_at_1c8[0x4];
+	u8         multi_path_xrc_rdma[0x1];
+	u8         reserved_at_1c9[0x1];
+	u8         multi_path_rc_rdma[0x1];
+	u8         reserved_at_1cb[0x1];
 	u8         max_tc[0x4];
 	u8         reserved_at_1d0[0x1];
 	u8         dcbx[0x1];
-- 
2.12.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found] ` <20170612064918.12510-1-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (2 preceding siblings ...)
  2017-06-12  6:49   ` [PATCH rdma-next 3/3] IB/mlx5: Support out of order data placement Leon Romanovsky
@ 2017-06-12 15:28   ` Bart Van Assche
       [not found]     ` <1497281280.2770.1.camel-Sjgp3cTcYWE@public.gmane.org>
  2017-06-12 17:18   ` Steve Wise
  4 siblings, 1 reply; 71+ messages in thread
From: Bart Van Assche @ 2017-06-12 15:28 UTC (permalink / raw)
  To: leon-DgEjT+Ai2ygdnm+yROfE0A, dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, idanb-VPRAkNaXOzVWk0Htik3J/w

On Mon, 2017-06-12 at 09:49 +0300, Leon Romanovsky wrote:
> Out of order data placement capability indicates that if HCA receives
> out of order RDMA packets, their data placement can be done at the
> desired memory destination given in the packet(s). This is applicable
> to RDMA read and write operations.

Hello Leon and Parav,

Since PCIe writes can be executed out of order, shouldn't that be mentioned
in Documentation/infiniband/out_of_order.txt? See also the documentation of
the Device Control Register and the Enable Relaxed
Ordering bit in the PCIe
spec.

Additionally, since not handling out-of-order RDMA writes correctly is an ULP
bug and since there are more ULPs that handle out-of-order writes correctly
than ULPs that don't handle out-of-order writes correctly, if a new flag is
introduced, shouldn't that be a flag to disable out-of-order writes?

Thanks,

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]     ` <1497281280.2770.1.camel-Sjgp3cTcYWE@public.gmane.org>
@ 2017-06-12 16:19       ` Parav Pandit
       [not found]         ` <VI1PR0502MB3008478FC7C70D1F398FE2B2D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Parav Pandit @ 2017-06-12 16:19 UTC (permalink / raw)
  To: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

Hi Bart,

> -----Original Message-----
> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-
> owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Bart Van Assche
> Sent: Monday, June 12, 2017 10:28 AM
> To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org; dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> On Mon, 2017-06-12 at 09:49 +0300, Leon Romanovsky wrote:
> > Out of order data placement capability indicates that if HCA receives
> > out of order RDMA packets, their data placement can be done at the
> > desired memory destination given in the packet(s). This is applicable
> > to RDMA read and write operations.
> 
> Hello Leon and Parav,
> 
> Since PCIe writes can be executed out of order, shouldn't that be mentioned
> in Documentation/infiniband/out_of_order.txt? See also the
> documentation of the Device Control Register and the Enable Relaxed
> Ordering bit in the PCIe spec.
There is no change in the way PCIe writes are done with respect to this per QP bit.
Meaning, if this bit is cleared, PCIe subsystem can still do out of order writes depending on relaxed ordering flag.

> 
> Additionally, since not handling out-of-order RDMA writes correctly is an
> ULP bug and since there are more ULPs that handle out-of-order writes
> correctly than ULPs that don't handle out-of-order writes correctly, if a new
> flag is introduced, shouldn't that be a flag to disable out-of-order writes?
Not sure I understood correctly. This bit is not a bug fix for ULPs who don't handle out-of-order writes.
As I described in Documentation,
" Out of order data placement capability indicates that if HCA receives out of order RDMA packets, their data placement can be done at the desired memory destination given in the packet(s). This is applicable to RDMA read and write operations."
This flag indicates that - in above condition, HCA can do data placement out-of-order.
Without enabling this flag, when HCA receives out of order packets, it would drop them due to PSN sequence error.

So, - to your question - shouldn't that be a flag to disable out-of-order writes?
By default, its disabled at RDMA level.

> 
> Thanks,
> 
> Bart.--
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the
> body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info
> at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]         ` <VI1PR0502MB3008478FC7C70D1F398FE2B2D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  2017-06-12 16:29           ` Bart Van Assche
@ 2017-06-12 16:29           ` Jason Gunthorpe
       [not found]             ` <20170612162917.GA11993-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  1 sibling, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2017-06-12 16:29 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On Mon, Jun 12, 2017 at 04:19:37PM +0000, Parav Pandit wrote:

> So, - to your question - shouldn't that be a flag to disable
> out-of-order writes?  By default, its disabled at RDMA level.

Bart's point is that it is not disabled, by default, it is platform
specific by default.

Your patch makes makes it more likely a ULP will see out of order
writes, but it is already the case that any ULP that relies on this is
platform specific and/or broken.

I think the answer to Bart's question is that some MPI ULPs are
'broken' and only work on x86 - setting this bit by default would
cause them to stop working. It is unfortunate this is the rare case.

I would suggest at least using the inverted sense like Bart describes
in the kernel - every kernel ULP is safe.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]         ` <VI1PR0502MB3008478FC7C70D1F398FE2B2D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-06-12 16:29           ` Bart Van Assche
       [not found]             ` <1497284956.2770.8.camel-Sjgp3cTcYWE@public.gmane.org>
  2017-06-12 16:29           ` Jason Gunthorpe
  1 sibling, 1 reply; 71+ messages in thread
From: Bart Van Assche @ 2017-06-12 16:29 UTC (permalink / raw)
  To: Bart Van Assche, parav-VPRAkNaXOzVWk0Htik3J/w,
	leon-DgEjT+Ai2ygdnm+yROfE0A, dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, idanb-VPRAkNaXOzVWk0Htik3J/w

On Mon, 2017-06-12 at 16:19 +0000, Parav Pandit wrote:
> Hi Bart,
> 
> > -----Original Message-----
> > From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-
> > owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Bart Van Assche
> > Sent: Monday, June 12, 2017 10:28 AM
> > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org; dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> > 
> > On Mon, 2017-06-12 at 09:49 +0300, Leon Romanovsky wrote:
> > > Out of order data placement capability indicates that if HCA receives
> > > out of order RDMA packets, their data placement can be done at the
> > > desired memory destination given in the packet(s). This is applicable
> > > to RDMA read and write operations.
> > 
> > Hello Leon and Parav,
> > 
> > Since PCIe writes can be executed out of order, shouldn't that be mentioned
> > in Documentation/infiniband/out_of_order.txt? See also the
> > documentation of the Device Control Register and the Enable Relaxed
> > Ordering bit in the PCIe spec.
> 
> There is no change in the way PCIe writes are done with respect to this per QP
> bit. Meaning, if this bit is cleared, PCIe subsystem can still do out of order
> writes depending on relaxed ordering flag.

Hello Parav,

That's why I asked to mention PCIe write reordering in out_of_order.txt. Someone
who is reading that document could be mislead to assume that if the HCA does not
reorder writes that no reordering will occur.

> > Additionally, since not handling out-of-order RDMA writes correctly is an
> > ULP bug and since there are more ULPs that handle out-of-order writes
> > correctly than ULPs that don't handle out-of-order writes correctly, if a new
> > flag is introduced, shouldn't that be a flag to disable out-of-order writes?
> 
> Not sure I understood correctly. This bit is not a bug fix for ULPs who don't
> handle out-of-order writes. As I described in Documentation, "Out of order data
> placement capability indicates that if HCA receives out of order RDMA packets,
> their data placement can be done at the desired memory destination given in the
> packet(s). This is applicable to RDMA read and write operations." This flag
> indicates that - in above condition, HCA can do data placement out-of-order.
> Without enabling this flag, when HCA receives out of order packets, it would drop
> them due to PSN sequence error.
> 
> So, - to your question - shouldn't that be a flag to disable out-of-order writes?
> By default, its disabled at RDMA level.

I don't think that your last two paragraphs mention anything that had not yet
been mentioned in the four e-mails constituting your patch series. Additionally,
I think my question was clear and unambiguous. So please reread my question.

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]             ` <20170612162917.GA11993-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-06-12 16:42               ` Parav Pandit
       [not found]                 ` <VI1PR0502MB3008EA451DA9ECEBECE27362D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Parav Pandit @ 2017-06-12 16:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

Hi Jason,

> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> Sent: Monday, June 12, 2017 11:29 AM
> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Cc: Bart Van Assche <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org;
> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein
> <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> On Mon, Jun 12, 2017 at 04:19:37PM +0000, Parav Pandit wrote:
> 
> > So, - to your question - shouldn't that be a flag to disable
> > out-of-order writes?  By default, its disabled at RDMA level.
> 
> Bart's point is that it is not disabled, by default, it is platform specific by
> default.
> 
> Your patch makes makes it more likely a ULP will see out of order writes,
> but it is already the case that any ULP that relies on this is platform specific
> and/or broken.
> 
> I think the answer to Bart's question is that some MPI ULPs are 'broken'
> and only work on x86 - setting this bit by default would cause them to stop
> working. It is unfortunate this is the rare case.
There are two bits as described in usage section of Documentation.
(a) device capability bit, which indicates that device is capable of processing out-of-order packets. This doesn't mean it is enabled.
(b) per QP enable bit where application can enable it if it wants to.

So no bit is set by default.
So application which sets this per QP bit, and still have some logic for polling on write data, would be wrong/bug for that application.

> 
> I would suggest at least using the inverted sense like Bart describes in the
> kernel - every kernel ULP is safe.
> 
I don't see a need to use inverted sense in code. I can surely make documentation more descriptive as Bart suggested.

> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                 ` <VI1PR0502MB3008EA451DA9ECEBECE27362D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-06-12 16:43                   ` Jason Gunthorpe
       [not found]                     ` <20170612164343.GA12435-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2017-06-12 16:43 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On Mon, Jun 12, 2017 at 04:42:27PM +0000, Parav Pandit wrote:

> > I would suggest at least using the inverted sense like Bart describes in the
> > kernel - every kernel ULP is safe.
 
> I don't see a need to use inverted sense in code. I can surely make documentation more descriptive as Bart suggested.

If this is 'better' then it should be on as much as possible, and I
certianly don't want to see kernel ULPs query caps and other pointless
things when they already, necessarily, deal with out of order.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]             ` <1497284956.2770.8.camel-Sjgp3cTcYWE@public.gmane.org>
@ 2017-06-12 16:51               ` Parav Pandit
  0 siblings, 0 replies; 71+ messages in thread
From: Parav Pandit @ 2017-06-12 16:51 UTC (permalink / raw)
  To: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

Hi Bart,

> -----Original Message-----
> From: Bart Van Assche [mailto:Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org]
> Sent: Monday, June 12, 2017 11:29 AM
> To: Bart Van Assche <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; Parav Pandit
> <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org; dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> On Mon, 2017-06-12 at 16:19 +0000, Parav Pandit wrote:
> > Hi Bart,
> >
> > > -----Original Message-----
> > > From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-
> > > owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Bart Van Assche
> > > Sent: Monday, June 12, 2017 10:28 AM
> > > To: leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org; dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > > Subject: Re: [PATCH rdma-next 0/3] Support out of order data
> > > placement
> > >
> > > On Mon, 2017-06-12 at 09:49 +0300, Leon Romanovsky wrote:
> > > > Out of order data placement capability indicates that if HCA
> > > > receives out of order RDMA packets, their data placement can be
> > > > done at the desired memory destination given in the packet(s).
> > > > This is applicable to RDMA read and write operations.
> > >
> > > Hello Leon and Parav,
> > >
> > > Since PCIe writes can be executed out of order, shouldn't that be
> > > mentioned in Documentation/infiniband/out_of_order.txt? See also the
> > > documentation of the Device Control Register and the Enable Relaxed
> > > Ordering bit in the PCIe spec.
> >
> > There is no change in the way PCIe writes are done with respect to
> > this per QP bit. Meaning, if this bit is cleared, PCIe subsystem can
> > still do out of order writes depending on relaxed ordering flag.
> 
> Hello Parav,
> 
> That's why I asked to mention PCIe write reordering in out_of_order.txt.
> Someone who is reading that document could be mislead to assume that if
> the HCA does not reorder writes that no reordering will occur.
Make sense. I will update the documentation to describe that PCIe out-of-order writes can still happen.

> 
> > > Additionally, since not handling out-of-order RDMA writes correctly
> > > is an ULP bug and since there are more ULPs that handle out-of-order
> > > writes correctly than ULPs that don't handle out-of-order writes
> > > correctly, if a new flag is introduced, shouldn't that be a flag to disable
> out-of-order writes?
> >
> > Not sure I understood correctly. This bit is not a bug fix for ULPs
> > who don't handle out-of-order writes. As I described in Documentation,
> > "Out of order data placement capability indicates that if HCA receives
> > out of order RDMA packets, their data placement can be done at the
> > desired memory destination given in the packet(s). This is applicable
> > to RDMA read and write operations." This flag indicates that - in above
> condition, HCA can do data placement out-of-order.
> > Without enabling this flag, when HCA receives out of order packets, it
> > would drop them due to PSN sequence error.
> >
> > So, - to your question - shouldn't that be a flag to disable out-of-order
> writes?
> > By default, its disabled at RDMA level.
> 
> I don't think that your last two paragraphs mention anything that had not
> yet been mentioned in the four e-mails constituting your patch series.
I elaborated two points in email in last two paragraphs to answer your question.
(a) Without enabling this out-of-order flag, when HCA receives out of order packets, it would drop them due to PSN sequence error.
(b) by default out-of-order is disabled at RDMA level.

> Additionally, I think my question was clear and unambiguous. So please
> reread my question.

I reread your question. Let me try to answer again.
> shouldn't that be a flag to disable out-of-order writes?
No. There is no need for such flag in context of this patchset because by default out-of-order writes are disabled at RDMA level.

> 
> Thanks,
> 
> Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                     ` <20170612164343.GA12435-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-06-12 16:53                       ` Parav Pandit
       [not found]                         ` <VI1PR0502MB300831A1560531E67B29589DD1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Parav Pandit @ 2017-06-12 16:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein



> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> Sent: Monday, June 12, 2017 11:44 AM
> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Cc: Bart Van Assche <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org;
> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein
> <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> On Mon, Jun 12, 2017 at 04:42:27PM +0000, Parav Pandit wrote:
> 
> > > I would suggest at least using the inverted sense like Bart
> > > describes in the kernel - every kernel ULP is safe.
> 
> > I don't see a need to use inverted sense in code. I can surely make
> documentation more descriptive as Bart suggested.
> 
> If this is 'better' then it should be on as much as possible, and I certianly
> don't want to see kernel ULPs query caps and other pointless things when
> they already, necessarily, deal with out of order.

Sure. Kernel ULPs and any other user ULPs can skip query caps.

> 
> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                         ` <VI1PR0502MB300831A1560531E67B29589DD1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-06-12 16:55                           ` Jason Gunthorpe
       [not found]                             ` <20170612165536.GB12435-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2017-06-12 16:55 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On Mon, Jun 12, 2017 at 04:53:28PM +0000, Parav Pandit wrote:
> 
> 
> > From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> > Sent: Monday, June 12, 2017 11:44 AM
> > To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > Cc: Bart Van Assche <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org;
> > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein
> > <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> > 
> > On Mon, Jun 12, 2017 at 04:42:27PM +0000, Parav Pandit wrote:
> > 
> > > > I would suggest at least using the inverted sense like Bart
> > > > describes in the kernel - every kernel ULP is safe.
> > 
> > > I don't see a need to use inverted sense in code. I can surely make
> > documentation more descriptive as Bart suggested.
> > 
> > If this is 'better' then it should be on as much as possible, and I certianly
> > don't want to see kernel ULPs query caps and other pointless things when
> > they already, necessarily, deal with out of order.
> 
> Sure. Kernel ULPs and any other user ULPs can skip query caps.

No, they can't because only mlx5 accepts the new flag.

This is why inverting the flag in the kernel makes much more sense.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                             ` <20170612165536.GB12435-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-06-12 17:11                               ` Parav Pandit
       [not found]                                 ` <VI1PR0502MB30089EDC828A142338B1EE06D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  2017-06-12 21:09                               ` Tom Talpey
  2017-06-27  9:47                               ` Sagi Grimberg
  2 siblings, 1 reply; 71+ messages in thread
From: Parav Pandit @ 2017-06-12 17:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

Hi Jason,

> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> Sent: Monday, June 12, 2017 11:56 AM
> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Cc: Bart Van Assche <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org;
> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein
> <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> On Mon, Jun 12, 2017 at 04:53:28PM +0000, Parav Pandit wrote:
> >
> >
> > > From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> > > Sent: Monday, June 12, 2017 11:44 AM
> > > To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > > Cc: Bart Van Assche <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org;
> > > dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein
> > > <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > > Subject: Re: [PATCH rdma-next 0/3] Support out of order data
> > > placement
> > >
> > > On Mon, Jun 12, 2017 at 04:42:27PM +0000, Parav Pandit wrote:
> > >
> > > > > I would suggest at least using the inverted sense like Bart
> > > > > describes in the kernel - every kernel ULP is safe.
> > >
> > > > I don't see a need to use inverted sense in code. I can surely
> > > > make
> > > documentation more descriptive as Bart suggested.
> > >
> > > If this is 'better' then it should be on as much as possible, and I
> > > certianly don't want to see kernel ULPs query caps and other
> > > pointless things when they already, necessarily, deal with out of order.
> >
> > Sure. Kernel ULPs and any other user ULPs can skip query caps.
> 
As you know - capable doesn't mean enabled in query_cap.

> No, they can't because only mlx5 accepts the new flag.
> 
By default out-of-order processing of RDMA write packets is disabled even on mlx5 for all QPs.
When user requests it to enable during modify_qp() as described in Documentation/out_of_order.txt usage section 3, HCA will process it that way.

> This is why inverting the flag in the kernel makes much more sense.
When you say invert the flag, which flag do you mean?
Capability flag or QP attribute?
Capability flag indicates - whether out-of-order processing at RDMA level supported or not.
QP attribute indicates - to enable such ooo processing for a particular QP.

Let's not mix PCIe write ordering with RDMA level out-of-order processing.
IB spec clearly says, as I reiterated in the Documentation/out_of_order.txt - that application should not depend on the write data until write_with_imm arrives or subsequent rdma send arrives at responder side.
If some MPI or other application have taken shortcut based on some platform, they have to help themselves. This path is not intent to address such issue.

> 
> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                 ` <VI1PR0502MB30089EDC828A142338B1EE06D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-06-12 17:14                                   ` Jason Gunthorpe
       [not found]                                     ` <20170612171436.GA12739-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2017-06-12 17:14 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On Mon, Jun 12, 2017 at 05:11:09PM +0000, Parav Pandit wrote:

> > This is why inverting the flag in the kernel makes much more sense.
> When you say invert the flag, which flag do you mean?
> Capability flag or QP attribute?

QP attribute.

> Let's not mix PCIe write ordering with RDMA level out-of-order
> processing.  IB spec clearly says, as I reiterated in the
> Documentation/out_of_order.txt - that application should not depend
> on the write data until write_with_imm arrives or subsequent rdma
> send arrives at responder side.  If some MPI or other application
> have taken shortcut based on some platform, they have to help
> themselves. This path is not intent to address such issue.

Exactly - all spec conformant ULPs are compatible with enabling this
new function of mlx5.

There is no reason to add a flag at all, except to retain
compatibility with certain non-confomant ULPs. We have none of those
in the kernel, so kernel QPs should always have the flag on.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found] ` <20170612064918.12510-1-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (3 preceding siblings ...)
  2017-06-12 15:28   ` [PATCH rdma-next 0/3] " Bart Van Assche
@ 2017-06-12 17:18   ` Steve Wise
  2017-06-12 17:37     ` Parav Pandit
  4 siblings, 1 reply; 71+ messages in thread
From: Steve Wise @ 2017-06-12 17:18 UTC (permalink / raw)
  To: 'Leon Romanovsky', 'Doug Ledford'
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Idan Burstein'

> -----Original Message-----
> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-
> owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Leon Romanovsky
> Sent: Monday, June 12, 2017 1:49 AM
> To: Doug Ledford
> Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein
> Subject: [PATCH rdma-next 0/3] Support out of order data placement
> 
> Hi Doug,
> 
> This patch set which is based on v4.12-rc4 comes from Parav and it
> adds support for out of order data placement for RDMA read and
> write requests.
> 
> Thanks
> 
> ---
> >From Parav:
> 
> In certain fabric configurations IB packets for a given QP may take up
> different paths in a network from source to destination. This results
> into reaching packets out of order at the receiver side. Instead of
> dropping packets, handling out of order packets can improve overall
> network performance in following ways:
>  * improve network utilization by avoiding retransmission
>  * avoiding latency increase due to retransmission
> 
> This patchset allows HCA to expose out of order data placement
> capability to users who can take benefit of such feature.
> 
> This is optional feature of an HCA and enablement of this feature
> is done on per QP basis. The optional QP attribute is set when a QP
> state is changed from INIT to RTR.
> 
> Out of order data placement capability indicates that if HCA receives
> out of order RDMA packets, their data placement can be done at the
> desired memory destination given in the packet(s). This is applicable
> to RDMA read and write operations.
> 
> Send queue work requests are still completed in-order regardless of
> their data placement order at requester or responder side.
> 
> In-order semantics is always guaranteed by setting the Fence
> indicator for appropriate WRs.
> 
> An application shall not depend on the contents of the RDMA write buffer
> at the responder until one of the following occurred:
> - Completion of the RDMA WRITE with immediate data receive completion.
> - Arrival and completion of the subsequent SEND message.
> - Update of a memory element by subsequent ATOMIC operation.
> 
> An application shall not depend on the contents of the RDMA read buffer
> at the requester until one of the following occurred:
> - Completion of the RDMA READ work request if requested or such
>   work request completes with error status.
> - Completion of the subsequent work request.


Given the application follows the above semantics, why does it care if data is
placed out of order?  IE Why does this impact the application API at all?
 
Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                     ` <20170612171436.GA12739-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-06-12 17:28                                       ` Parav Pandit
       [not found]                                         ` <VI1PR0502MB3008304F8465C19ACCFF3D68D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Parav Pandit @ 2017-06-12 17:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

Hi Jason,

> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> Sent: Monday, June 12, 2017 12:15 PM
> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Cc: Bart Van Assche <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org;
> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein
> <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> On Mon, Jun 12, 2017 at 05:11:09PM +0000, Parav Pandit wrote:
> 
> > > This is why inverting the flag in the kernel makes much more sense.
> > When you say invert the flag, which flag do you mean?
> > Capability flag or QP attribute?
> 
> QP attribute.
> 
> > Let's not mix PCIe write ordering with RDMA level out-of-order
> > processing.  IB spec clearly says, as I reiterated in the
> > Documentation/out_of_order.txt - that application should not depend on
> > the write data until write_with_imm arrives or subsequent rdma send
> > arrives at responder side.  If some MPI or other application have
> > taken shortcut based on some platform, they have to help themselves.
> > This path is not intent to address such issue.
> 
> Exactly - all spec conformant ULPs are compatible with enabling this new
> function of mlx5.
> 
This per QP attribute is for read and write both. So responder can receive out-of-order read responses.
And HCA's QP need to be told to accept it that way, which by default doesn't.

> There is no reason to add a flag at all, except to retain compatibility with
> certain non-confomant ULPs. We have none of those in the kernel, so kernel
> QPs should always have the flag on.
> 
ULP can be conformant, but it may not be the same HCA on both requester and responder side which has this feature.
So requester or responder cannot assume what other side have enabled.
Therefore the need to be explicit enable flag.

As I explained, by default HCA doesn't process out-of-order packets. User need to tell him to do so.
For example adapter is capable of 16 incoming read requests, but when modify_qp() happens, user need to tell how many incoming outstanding read requests it should accept.
Same for outgoing read requests. Etc.


> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                         ` <VI1PR0502MB3008304F8465C19ACCFF3D68D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-06-12 17:32                                           ` Jason Gunthorpe
       [not found]                                             ` <20170612173221.GA13302-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2017-06-12 17:32 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On Mon, Jun 12, 2017 at 05:28:00PM +0000, Parav Pandit wrote:

> > Exactly - all spec conformant ULPs are compatible with enabling this new
> > function of mlx5.
 
> This per QP attribute is for read and write both. So responder can receive out-of-order read responses.
> And HCA's QP need to be told to accept it that way, which by default doesn't.

I think this one flag is conflating too many things then.

Obviously sending out of order packets is not spec conformant, and I
don't think your discussion is clear enough, as this change to send
side certainly was not clear to me.

I'm not excited about a new end-to-end flag without some kind of
negotiation scheme, and I'm not excited about this being in any of the
common APIs.

Perhaps it should be in libmlx5 instead.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
  2017-06-12 17:18   ` Steve Wise
@ 2017-06-12 17:37     ` Parav Pandit
       [not found]       ` <VI1PR0502MB3008CE85A4F274886B74DF2BD1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Parav Pandit @ 2017-06-12 17:37 UTC (permalink / raw)
  To: Steve Wise, 'Leon Romanovsky', 'Doug Ledford'
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

Hi Steve,

> -----Original Message-----
> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-
> owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Steve Wise
> Sent: Monday, June 12, 2017 12:19 PM
> To: 'Leon Romanovsky' <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>; 'Doug Ledford'
> <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Subject: RE: [PATCH rdma-next 0/3] Support out of order data placement
> 
> > -----Original Message-----
> > From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-
> > owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Leon Romanovsky
> > Sent: Monday, June 12, 2017 1:49 AM
> > To: Doug Ledford
> > Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein
> > Subject: [PATCH rdma-next 0/3] Support out of order data placement
> >
> > Hi Doug,
> >
> > This patch set which is based on v4.12-rc4 comes from Parav and it
> > adds support for out of order data placement for RDMA read and write
> > requests.
> >
> > Thanks
> >
> > ---
> > >From Parav:
> >
> > In certain fabric configurations IB packets for a given QP may take up
> > different paths in a network from source to destination. This results
> > into reaching packets out of order at the receiver side. Instead of
> > dropping packets, handling out of order packets can improve overall
> > network performance in following ways:
> >  * improve network utilization by avoiding retransmission
> >  * avoiding latency increase due to retransmission
> >
> > This patchset allows HCA to expose out of order data placement
> > capability to users who can take benefit of such feature.
> >
> > This is optional feature of an HCA and enablement of this feature is
> > done on per QP basis. The optional QP attribute is set when a QP state
> > is changed from INIT to RTR.
> >
> > Out of order data placement capability indicates that if HCA receives
> > out of order RDMA packets, their data placement can be done at the
> > desired memory destination given in the packet(s). This is applicable
> > to RDMA read and write operations.
> >
> > Send queue work requests are still completed in-order regardless of
> > their data placement order at requester or responder side.
> >
> > In-order semantics is always guaranteed by setting the Fence indicator
> > for appropriate WRs.
> >
> > An application shall not depend on the contents of the RDMA write
> > buffer at the responder until one of the following occurred:
> > - Completion of the RDMA WRITE with immediate data receive
> completion.
> > - Arrival and completion of the subsequent SEND message.
> > - Update of a memory element by subsequent ATOMIC operation.
> >
> > An application shall not depend on the contents of the RDMA read
> > buffer at the requester until one of the following occurred:
> > - Completion of the RDMA READ work request if requested or such
> >   work request completes with error status.
> > - Completion of the subsequent work request.
> 
> 
> Given the application follows the above semantics, why does it care if data
> is placed out of order?  IE Why does this impact the application API at all?
> 
If application wants to benefit from out-of-order placement, it should enable this flag.
Following semantics is fine, but letting HCA enable something under the hood for all QPs, is not good.
HCA or stack doesn't know which all applications are running, handling/following such semantics.
So its best left to the end user applications to benefit or not from it.


> Steve.
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the
> body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info
> at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                             ` <20170612173221.GA13302-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-06-12 17:46                                               ` Parav Pandit
       [not found]                                                 ` <VI1PR0502MB30081FD74492E97043F45BD8D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Parav Pandit @ 2017-06-12 17:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

Hi Jason,

> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> Sent: Monday, June 12, 2017 12:32 PM
> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Cc: Bart Van Assche <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org;
> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein
> <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> On Mon, Jun 12, 2017 at 05:28:00PM +0000, Parav Pandit wrote:
> 
> > > Exactly - all spec conformant ULPs are compatible with enabling this
> > > new function of mlx5.
> 
> > This per QP attribute is for read and write both. So responder can receive
> out-of-order read responses.
> > And HCA's QP need to be told to accept it that way, which by default
> doesn't.
> 
> I think this one flag is conflating too many things then.
> 
Too many flags just confuses end-user for read/write/requester/responder etc.
Also there isn't well established use case either where user wants to do only certain things.

> Obviously sending out of order packets is not spec conformant, and I don't
> think your discussion is clear enough, as this change to send side certainly
> was not clear to me.
> 
Ok. I can make Documentation more elaborate that describe summary of this discussion + Bart's point.

> I'm not excited about a new end-to-end flag without some kind of
> negotiation scheme, and I'm not excited about this being in any of the
> common APIs.
Many applications do out-of-band negotiation or static configuration and its unrelated to the patchset.
So lack of negotiation shouldn't be issue. Its outside the scope of this verb extension anyway.

> 
> Perhaps it should be in libmlx5 instead.

We think that its best usable through ibv_modify_qp because it's really simple enough and well documented which will address inputs given in this discussion.

> 
> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                 ` <VI1PR0502MB30081FD74492E97043F45BD8D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-06-12 17:51                                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2017-06-12 17:51 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On Mon, Jun 12, 2017 at 05:46:34PM +0000, Parav Pandit wrote:

> Too many flags just confuses end-user for
> read/write/requester/responder etc.  Also there isn't well
> established use case either where user wants to do only certain
> things.

Your original discussion said this optimized cases on Rx when the
fabric went out of order, eg repath, etc. That seems well defined
enough to be a dedicated flag, default to on, etc.

Sending packets out of order is an entirely different beast, and
sounds like it would also require additional configuration, as IBA
doesn't even have an existing way to describe a set of paths to choose
between.

How are the TX paths choosen? That need to be documented before this
could even be considered for the common API.

This is why I think one flag is a bad idea.

> > Perhaps it should be in libmlx5 instead.
> 
> We think that its best usable through ibv_modify_qp because it's
> really simple enough and well documented which will address inputs
> given in this discussion.

That isn't really the criteria to be in the common API now that we
have the libmlx5 approach.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]       ` <VI1PR0502MB3008CE85A4F274886B74DF2BD1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-06-12 19:06         ` Dennis Dalessandro
       [not found]           ` <3fa7a4b5-5c19-8c6a-d78b-93219a9be888-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Dennis Dalessandro @ 2017-06-12 19:06 UTC (permalink / raw)
  To: Parav Pandit, Steve Wise, 'Leon Romanovsky',
	'Doug Ledford'
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On 6/12/2017 1:37 PM, Parav Pandit wrote:
>> Given the application follows the above semantics, why does it care if data
>> is placed out of order?  IE Why does this impact the application API at all?
>>
> If application wants to benefit from out-of-order placement, it should enable this flag.
> Following semantics is fine, but letting HCA enable something under the hood for all QPs, is not good.
> HCA or stack doesn't know which all applications are running, handling/following such semantics.
> So its best left to the end user applications to benefit or not from it.

I don't understand why the application has to care about this at all. If 
the HW wants to place things out of order. Ok go for it. Applications 
should be depending on the correct mechanism to know when it's OK to 
look at the data. Why do they care what order it arrived in?

The only reason to care is if the applications wants to do something 
that is not compliant and look at parts of the data early. Perhaps you 
can explain *how* an application may benefit from out-of-order placement 
and likewise, why it wouldn't want to benefit.

It seems to me, the real benefit is reducing the number of 
retransmissions which the application shouldn't care about.

-Denny





--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]           ` <3fa7a4b5-5c19-8c6a-d78b-93219a9be888-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2017-06-12 19:19             ` Hefty, Sean
       [not found]               ` <1828884A29C6694DAF28B7E6B8A82373AB142A9B-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Hefty, Sean @ 2017-06-12 19:19 UTC (permalink / raw)
  To: Dalessandro, Dennis, Parav Pandit, Steve Wise,
	'Leon Romanovsky', 'Doug Ledford'
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1618 bytes --]

> I don't understand why the application has to care about this at all.
> If
> the HW wants to place things out of order. Ok go for it. Applications
> should be depending on the correct mechanism to know when it's OK to
> look at the data. Why do they care what order it arrived in?
> 
> The only reason to care is if the applications wants to do something
> that is not compliant and look at parts of the data early. Perhaps you
> can explain *how* an application may benefit from out-of-order
> placement
> and likewise, why it wouldn't want to benefit.

Relaxed ordering can improve performance by allowing packets to take multiple paths from the source to destination.  FWIW, the ofiwg actually discussed ordering at length, and libfabric exposes several ordering related attributes.

The application usually needs to know what sort of ordering the network is providing for correct operation.  For example, can sends be re-ordered by the network, such that they may be received in the opposite order that they were sent?  Can rdma writes be processed out of order, such that the target data ends up with some sort of interleaving of the written data?  Are there size limits to ordering constraints?

It's hard to capture this level of detail with a single flag.  There's ordering of data within a single message, as well as ordering of data between messages.  I don't know how this proposal works with explicit pathing that IB defines, or path failover.

- Sean
N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±­ÙšŠ{ayº\x1dʇڙë,j\a­¢f£¢·hš‹»öì\x17/oSc¾™Ú³9˜uÀ¦æå‰È&jw¨®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿïêäz¹Þ–Šàþf£¢·hšˆ§~ˆmš

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]               ` <1828884A29C6694DAF28B7E6B8A82373AB142A9B-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2017-06-12 20:14                 ` Parav Pandit
       [not found]                   ` <VI1PR0502MB300885A1DD676E1649CDB268D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Parav Pandit @ 2017-06-12 20:14 UTC (permalink / raw)
  To: Hefty, Sean, Dalessandro, Dennis, Steve Wise,
	'Leon Romanovsky', 'Doug Ledford'
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

Hi Denny, Sean,

As Sean explained, relaxed ordering/out-of-order packets, allows making efficient use of network resources.
When transmitter and receiver is enabled to do so, as I described in overview section of Documentation, it helps 
(a) to avoid retransmission - improves network utilization
(b) reduces latency due to timers not kicking in.

More comments to Sean's question below.

> -----Original Message-----
> From: Hefty, Sean [mailto:sean.hefty@intel.com]
> Sent: Monday, June 12, 2017 2:19 PM
> To: Dalessandro, Dennis <dennis.dalessandro@intel.com>; Parav Pandit
> <parav@mellanox.com>; Steve Wise <swise@opengridcomputing.com>;
> 'Leon Romanovsky' <leon@kernel.org>; 'Doug Ledford'
> <dledford@redhat.com>
> Cc: linux-rdma@vger.kernel.org; Idan Burstein <idanb@mellanox.com>
> Subject: RE: [PATCH rdma-next 0/3] Support out of order data placement
> 
> > I don't understand why the application has to care about this at all.
> > If
> > the HW wants to place things out of order. Ok go for it. Applications
> > should be depending on the correct mechanism to know when it's OK to
> > look at the data. Why do they care what order it arrived in?
> >
> > The only reason to care is if the applications wants to do something
> > that is not compliant and look at parts of the data early. Perhaps you
> > can explain *how* an application may benefit from out-of-order
> > placement and likewise, why it wouldn't want to benefit.
> 
> Relaxed ordering can improve performance by allowing packets to take
> multiple paths from the source to destination.  FWIW, the ofiwg actually
> discussed ordering at length, and libfabric exposes several ordering related
> attributes.
> 
> The application usually needs to know what sort of ordering the network is
> providing for correct operation.
Yes. This flag defines it how RW data can done in relaxed order manner.

> For example, can sends be re-ordered by
> the network, such that they may be received in the opposite order that they
> were sent?

No. Only rdma reads and writes. Table 79 of IB spec still holds true.
In internal patch review I had the table, but since there was no change, I just removed it.
I think it's worth to mention in the out_of_order.txt that it still holds true. I will add this line.

I would also document which compliant statements are not applicable when such flag is set.
How about having responder side table, similar to table 79 in Documentation/out_of_order.txt
That makes it more readable for users.

> Can rdma writes be processed out of order, such that the target
> data ends up with some sort of interleaving of the written data? 
Yes.

> Are there size limits to ordering constraints?
No.

> 
> It's hard to capture this level of detail with a single flag.  There's ordering of
> data within a single message, as well as ordering of data between messages.
Good documentation should describe what a flag does. Flag name itself has RW prefix in it.
Are you suggesting there should be two flags - one for within message, one for across message?
If so, shouldn't we add those flag when such different implementation arise.
I cannot define something that cannot be implemented or tested at this point. This is across the messages.

> I don't know how this proposal works with explicit pathing that IB defines,
> or path failover.

It doesn't change with path failover or explicit path to my knowledge, but I will confirm if there is any change in v1.

> 
> - Sean

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                   ` <VI1PR0502MB300885A1DD676E1649CDB268D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-06-12 20:35                     ` Dennis Dalessandro
       [not found]                       ` <b5279c09-027f-e374-ffde-7f236c52322c-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
  2017-06-12 20:41                     ` Hefty, Sean
  1 sibling, 1 reply; 71+ messages in thread
From: Dennis Dalessandro @ 2017-06-12 20:35 UTC (permalink / raw)
  To: Parav Pandit, Hefty, Sean, Steve Wise, 'Leon Romanovsky',
	'Doug Ledford'
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On 6/12/2017 4:14 PM, Parav Pandit wrote:
> Hi Denny, Sean,
> 
> As Sean explained, relaxed ordering/out-of-order packets, allows making efficient use of network resources.
> When transmitter and receiver is enabled to do so, as I described in overview section of Documentation, it helps
> (a) to avoid retransmission - improves network utilization
> (b) reduces latency due to timers not kicking in.

Yes those benefits are clear. I see no reason why it shouldn't always be 
done is my point. Application shouldn't have to care and there is no 
need to make this an additional flag.

-Denny

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                   ` <VI1PR0502MB300885A1DD676E1649CDB268D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  2017-06-12 20:35                     ` Dennis Dalessandro
@ 2017-06-12 20:41                     ` Hefty, Sean
       [not found]                       ` <1828884A29C6694DAF28B7E6B8A82373AB142BC8-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  1 sibling, 1 reply; 71+ messages in thread
From: Hefty, Sean @ 2017-06-12 20:41 UTC (permalink / raw)
  To: Parav Pandit, Dalessandro, Dennis, Steve Wise,
	'Leon Romanovsky', 'Doug Ledford'
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1537 bytes --]

> > For example, can sends be re-ordered by
> > the network, such that they may be received in the opposite order
> that they
> > were sent?
> 
> No. Only rdma reads and writes. Table 79 of IB spec still holds true.
> In internal patch review I had the table, but since there was no
> change, I just removed it.
> I think it's worth to mention in the out_of_order.txt that it still
> holds true. I will add this line.
> 
> I would also document which compliant statements are not applicable
> when such flag is set.
> How about having responder side table, similar to table 79 in
> Documentation/out_of_order.txt
> That makes it more readable for users.
> 
> > Can rdma writes be processed out of order, such that the target
> > data ends up with some sort of interleaving of the written data?
> Yes.
> 
> > Are there size limits to ordering constraints?
> No.

My questions were more examples of the level of detail that I think any API change needs to (eventually) be able to address.  Libibverbs is supposed to work over iwarp as well, which has different ordering semantics already.  I think that was part of Steve's point as to why any change is needed to the api.

IMO, read and write ordering should be determined separately from the perspective of the api.  This is exposing a hw specific limitation where these semantics are joined, and send/receive is not supported.
N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±­ÙšŠ{ayº\x1dʇڙë,j\a­¢f£¢·hš‹»öì\x17/oSc¾™Ú³9˜uÀ¦æå‰È&jw¨®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿïêäz¹Þ–Šàþf£¢·hšˆ§~ˆmš

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                       ` <b5279c09-027f-e374-ffde-7f236c52322c-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2017-06-12 20:46                         ` Hefty, Sean
       [not found]                           ` <1828884A29C6694DAF28B7E6B8A82373AB142BEC-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Hefty, Sean @ 2017-06-12 20:46 UTC (permalink / raw)
  To: Dalessandro, Dennis, Parav Pandit, Steve Wise,
	'Leon Romanovsky', 'Doug Ledford'
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 877 bytes --]

> > When transmitter and receiver is enabled to do so, as I described in
> overview section of Documentation, it helps
> > (a) to avoid retransmission - improves network utilization
> > (b) reduces latency due to timers not kicking in.
> 
> Yes those benefits are clear. I see no reason why it shouldn't always
> be
> done is my point. Application shouldn't have to care and there is no
> need to make this an additional flag.

The app cares when data from write 2 can be written at the target before data from write 1, especially if the writes target the same memory buffers.  (At least I think this is the intent of exposing this to the app.)

Note that the provider can always provide stronger ordering than what the app needs.
N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±­ÙšŠ{ayº\x1dʇڙë,j\a­¢f£¢·hš‹»öì\x17/oSc¾™Ú³9˜uÀ¦æå‰È&jw¨®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿïêäz¹Þ–Šàþf£¢·hšˆ§~ˆmš

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                           ` <1828884A29C6694DAF28B7E6B8A82373AB142BEC-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2017-06-12 20:57                             ` Steve Wise
  2017-06-12 21:02                               ` Jason Gunthorpe
  0 siblings, 1 reply; 71+ messages in thread
From: Steve Wise @ 2017-06-12 20:57 UTC (permalink / raw)
  To: 'Hefty, Sean', 'Dalessandro, Dennis',
	'Parav Pandit', 'Leon Romanovsky',
	'Doug Ledford'
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Idan Burstein'

> > > When transmitter and receiver is enabled to do so, as I described in
> > overview section of Documentation, it helps
> > > (a) to avoid retransmission - improves network utilization
> > > (b) reduces latency due to timers not kicking in.
> >
> > Yes those benefits are clear. I see no reason why it shouldn't always
> > be
> > done is my point. Application shouldn't have to care and there is no
> > need to make this an additional flag.
> 
> The app cares when data from write 2 can be written at the target before data
> from write 1, especially if the writes target the same memory buffers.  (At least I
> think this is the intent of exposing this to the app.)
> 
> Note that the provider can always provide stronger ordering than what the app
> needs.

My understanding is that IB or IW apps should never assume ingress write or read response data is _placed_ into local memory in the order it was transmitted from the peer.  The only guarantee is that the _indication_ of the arrived data preserve the sender's ordering.  However, I'm thinking that there are applications out there that spin polling local memory that is the target of a write or read response and assume the last bit of that memory will get written last...

 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
  2017-06-12 20:57                             ` Steve Wise
@ 2017-06-12 21:02                               ` Jason Gunthorpe
       [not found]                                 ` <20170612210259.GA25652-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2017-06-12 21:02 UTC (permalink / raw)
  To: Steve Wise
  Cc: 'Hefty, Sean', 'Dalessandro, Dennis',
	'Parav Pandit', 'Leon Romanovsky',
	'Doug Ledford',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Idan Burstein'

On Mon, Jun 12, 2017 at 03:57:29PM -0500, Steve Wise wrote:
> > > > When transmitter and receiver is enabled to do so, as I described in
> > > overview section of Documentation, it helps
> > > > (a) to avoid retransmission - improves network utilization
> > > > (b) reduces latency due to timers not kicking in.
> > >
> > > Yes those benefits are clear. I see no reason why it shouldn't always
> > > be
> > > done is my point. Application shouldn't have to care and there is no
> > > need to make this an additional flag.
> > 
> > The app cares when data from write 2 can be written at the target before data
> > from write 1, especially if the writes target the same memory buffers.  (At least I
> > think this is the intent of exposing this to the app.)
> > 
> > Note that the provider can always provide stronger ordering than what the app
> > needs.
> 
> My understanding is that IB or IW apps should never assume ingress
> write or read response data is _placed_ into local memory in the
> order it was transmitted from the peer.  The only guarantee is that
> the _indication_ of the arrived data preserve the sender's ordering.
> However, I'm thinking that there are applications out there that
> spin polling local memory that is the target of a write or read
> response and assume the last bit of that memory will get written
> last...

That is with respect to the CPU, but IB requires strong ordering
between messages within the same QP, eg if I do

RDMA WRITE addr=0 data=1
RDMA WRITE addr=0 data=2
RDMA WRITE addr=0 data=3
RDMA READ  addr=0

I must always get 3, not something else.

It would be notable if this 'out of order' feature violated that
invariant, but many ULPs would probably still be OK.

Frankly, Parav's original message doesn't seem to describe at all what
this is about, so maybe we should all wait until v2, and maybe more
people from Mellanox could contribute to sensibly describing it if
they want it in ibverbs.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                             ` <20170612165536.GB12435-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2017-06-12 17:11                               ` Parav Pandit
@ 2017-06-12 21:09                               ` Tom Talpey
       [not found]                                 ` <6747e257-67b0-a364-be21-04f73ef82ffe-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
  2017-06-27  9:47                               ` Sagi Grimberg
  2 siblings, 1 reply; 71+ messages in thread
From: Tom Talpey @ 2017-06-12 21:09 UTC (permalink / raw)
  To: Jason Gunthorpe, Parav Pandit
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On 6/12/2017 12:55 PM, Jason Gunthorpe wrote:
> On Mon, Jun 12, 2017 at 04:53:28PM +0000, Parav Pandit wrote:
>>
>>
>>> From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
>>> Sent: Monday, June 12, 2017 11:44 AM
>>> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>> Cc: Bart Van Assche <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org;
>>> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein
>>> <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
>>>
>>> On Mon, Jun 12, 2017 at 04:42:27PM +0000, Parav Pandit wrote:
>>>
>>>>> I would suggest at least using the inverted sense like Bart
>>>>> describes in the kernel - every kernel ULP is safe.
>>>
>>>> I don't see a need to use inverted sense in code. I can surely make
>>> documentation more descriptive as Bart suggested.
>>>
>>> If this is 'better' then it should be on as much as possible, and I certianly
>>> don't want to see kernel ULPs query caps and other pointless things when
>>> they already, necessarily, deal with out of order.
>>
>> Sure. Kernel ULPs and any other user ULPs can skip query caps.
> 
> No, they can't because only mlx5 accepts the new flag.
> 
> This is why inverting the flag in the kernel makes much more sense.

Absolutely. All iWARP adapters support multiple RDMA Writes in flight,
and their placement is not ordered. It's a quirk of the MLX devices,
and in turn their implementation of the IB protocol, that they place
writes in-order. I'm happy to see Mellanox become aware of this, it
limits write performance on networks with any significant latencies.

I agree with Jason, the bit should be 1 by default, if defined as you
propose. Out-of-order is the norm, not the exception, for ULPs.
Honestly, I think you should perhaps consider making it the default
on your devices, and allowing only MLX-aware ULPs to turn it off.

The Windows NDK API has a similar concept, by the way, it flips your
meaning and defaults to 0. Also, it's not settable by the ULP. The
flag isn't used for much, as a result.

NDK_ADAPTER_FLAG_IN_ORDER_DMA_SUPPORTED

https://msdn.microsoft.com/en-us/library/windows/hardware/hh439851(v=vs.85).aspx


Tom.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                 ` <20170612210259.GA25652-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-06-12 21:18                                   ` Steve Wise
  2017-06-12 21:22                                     ` Jason Gunthorpe
  2017-06-12 21:53                                     ` Parav Pandit
  2017-06-13  5:29                                   ` Leon Romanovsky
  1 sibling, 2 replies; 71+ messages in thread
From: Steve Wise @ 2017-06-12 21:18 UTC (permalink / raw)
  To: 'Jason Gunthorpe'
  Cc: 'Hefty, Sean', 'Dalessandro, Dennis',
	'Parav Pandit', 'Leon Romanovsky',
	'Doug Ledford',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Idan Burstein'

> On Mon, Jun 12, 2017 at 03:57:29PM -0500, Steve Wise wrote:
> > > > > When transmitter and receiver is enabled to do so, as I described in
> > > > overview section of Documentation, it helps
> > > > > (a) to avoid retransmission - improves network utilization
> > > > > (b) reduces latency due to timers not kicking in.
> > > >
> > > > Yes those benefits are clear. I see no reason why it shouldn't always
> > > > be
> > > > done is my point. Application shouldn't have to care and there is no
> > > > need to make this an additional flag.
> > >
> > > The app cares when data from write 2 can be written at the target before
data
> > > from write 1, especially if the writes target the same memory buffers.
(At least
> I
> > > think this is the intent of exposing this to the app.)
> > >
> > > Note that the provider can always provide stronger ordering than what the
app
> > > needs.
> >
> > My understanding is that IB or IW apps should never assume ingress
> > write or read response data is _placed_ into local memory in the
> > order it was transmitted from the peer.  The only guarantee is that
> > the _indication_ of the arrived data preserve the sender's ordering.
> > However, I'm thinking that there are applications out there that
> > spin polling local memory that is the target of a write or read
> > response and assume the last bit of that memory will get written
> > last...
> 
> That is with respect to the CPU, but IB requires strong ordering
> between messages within the same QP, eg if I do
> 
> RDMA WRITE addr=0 data=1
> RDMA WRITE addr=0 data=2
> RDMA WRITE addr=0 data=3
> RDMA READ  addr=0
> 
> I must always get 3, not something else.

Correct, but the peer, ie the remote end that is the target of those writes, can
spin looking at local address 0 and it might see 1, 2, or 3.  Eventually it will
see 3.  But there is no guarantee that it will see 1 before 2 or even see 1 or 2
at all depending on timing.   

But what I was getting at is this:  Say you tell your peer to RDMA WRITE 16KB
into your local buffer.  And let us say the last bit of that 16KB data will be a
1, and that the current value of that bit location in the local buffer is 0.  It
is incorrect for the app to spin reading that bit until reads 1, and assume the
data prior to that bit has been placed at that point.  At least with the iWARP
spec, out of order placement is allowed.  So if the 16KB was broken into X iWARP
DDP segments, the last segment could have been placed before the other segments.
A correct application will require the peer to post a SEND after the WRITE or
WRITE_WITH_IMMEDIATE, and only know the data has been placed into the local
buffer when it polls the recv completion for the SEND or WRITE_W_IMMD.  An iWARP
RNIC _must_ guarantee in-order delivery of data (but not actual placement).  Am
I making sense?

I'm guessing no HCAs nor RNICs actually place data out of order.  cxgb* does
not.  So applications _might_ be doing the spin technique I described.  I recall
a long time ago that MVAPICH2 did this.  Not sure if it still does.

> 
> It would be notable if this 'out of order' feature violated that
> invariant, but many ULPs would probably still be OK.
> 
> Frankly, Parav's original message doesn't seem to describe at all what
> this is about, so maybe we should all wait until v2, and maybe more
> people from Mellanox could contribute to sensibly describing it if
> they want it in ibverbs.
> 
> Jason

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
  2017-06-12 21:18                                   ` Steve Wise
@ 2017-06-12 21:22                                     ` Jason Gunthorpe
  2017-06-12 21:53                                     ` Parav Pandit
  1 sibling, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2017-06-12 21:22 UTC (permalink / raw)
  To: Steve Wise
  Cc: 'Hefty, Sean', 'Dalessandro, Dennis',
	'Parav Pandit', 'Leon Romanovsky',
	'Doug Ledford',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Idan Burstein'

On Mon, Jun 12, 2017 at 04:18:34PM -0500, Steve Wise wrote:

> buffer when it polls the recv completion for the SEND or WRITE_W_IMMD.  An iWARP
> RNIC _must_ guarantee in-order delivery of data (but not actual placement).  Am
> I making sense?

Yes, and IB is technically the same.

As Tom pointed out, it is a quirk of the x86 & mellanox combination
that apps can reliably 'spin on the last byte', it is certainly not
guaranteed by any spec.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                       ` <1828884A29C6694DAF28B7E6B8A82373AB142BC8-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2017-06-12 21:25                         ` Parav Pandit
  0 siblings, 0 replies; 71+ messages in thread
From: Parav Pandit @ 2017-06-12 21:25 UTC (permalink / raw)
  To: Hefty, Sean, Dalessandro, Dennis, Steve Wise,
	'Leon Romanovsky', 'Doug Ledford'
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

Hi Sean,

> -----Original Message-----
> From: Hefty, Sean [mailto:sean.hefty@intel.com]
> Sent: Monday, June 12, 2017 3:41 PM
> To: Parav Pandit <parav@mellanox.com>; Dalessandro, Dennis
> <dennis.dalessandro@intel.com>; Steve Wise
> <swise@opengridcomputing.com>; 'Leon Romanovsky' <leon@kernel.org>;
> 'Doug Ledford' <dledford@redhat.com>
> Cc: linux-rdma@vger.kernel.org; Idan Burstein <idanb@mellanox.com>
> Subject: RE: [PATCH rdma-next 0/3] Support out of order data placement
> 
> > > For example, can sends be re-ordered by the network, such that they
> > > may be received in the opposite order
> > that they
> > > were sent?
> >
> > No. Only rdma reads and writes. Table 79 of IB spec still holds true.
> > In internal patch review I had the table, but since there was no
> > change, I just removed it.
> > I think it's worth to mention in the out_of_order.txt that it still
> > holds true. I will add this line.
> >
> > I would also document which compliant statements are not applicable
> > when such flag is set.
> > How about having responder side table, similar to table 79 in
> > Documentation/out_of_order.txt That makes it more readable for users.
> >
> > > Can rdma writes be processed out of order, such that the target data
> > > ends up with some sort of interleaving of the written data?
> > Yes.
> >
> > > Are there size limits to ordering constraints?
> > No.
> 
> My questions were more examples of the level of detail that I think any API
> change needs to (eventually) be able to address.  Libibverbs is supposed to
> work over iwarp as well, which has different ordering semantics already.  I
> think that was part of Steve's point as to why any change is needed to the
> api.
> 
> IMO, read and write ordering should be determined separately from the
> perspective of the api.  This is exposing a hw specific limitation where these
> semantics are joined, and send/receive is not supported.

They are joined for simplicity. If there is established use case that I like to hear from users 
and if hardware supports it, more caps can be added.
That’s why caps is 32-bit number for future addition of RD_ONLY, WR_ONLY modes.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                 ` <6747e257-67b0-a364-be21-04f73ef82ffe-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
@ 2017-06-12 21:32                                   ` Parav Pandit
       [not found]                                     ` <VI1PR0502MB30080B672B80836FF0A328DCD1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Parav Pandit @ 2017-06-12 21:32 UTC (permalink / raw)
  To: Tom Talpey, Jason Gunthorpe
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

Hi Tom,

> -----Original Message-----
> From: Tom Talpey [mailto:tom@talpey.com]
> Sent: Monday, June 12, 2017 4:09 PM
> To: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>; Parav Pandit
> <parav@mellanox.com>
> Cc: Bart Van Assche <Bart.VanAssche@sandisk.com>; leon@kernel.org;
> dledford@redhat.com; linux-rdma@vger.kernel.org; Idan Burstein
> <idanb@mellanox.com>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> On 6/12/2017 12:55 PM, Jason Gunthorpe wrote:
> > On Mon, Jun 12, 2017 at 04:53:28PM +0000, Parav Pandit wrote:
> >>
> >>
> >>> From: Jason Gunthorpe [mailto:jgunthorpe@obsidianresearch.com]
> >>> Sent: Monday, June 12, 2017 11:44 AM
> >>> To: Parav Pandit <parav@mellanox.com>
> >>> Cc: Bart Van Assche <Bart.VanAssche@sandisk.com>; leon@kernel.org;
> >>> dledford@redhat.com; linux-rdma@vger.kernel.org; Idan Burstein
> >>> <idanb@mellanox.com>
> >>> Subject: Re: [PATCH rdma-next 0/3] Support out of order data
> >>> placement
> >>>
> >>> On Mon, Jun 12, 2017 at 04:42:27PM +0000, Parav Pandit wrote:
> >>>
> >>>>> I would suggest at least using the inverted sense like Bart
> >>>>> describes in the kernel - every kernel ULP is safe.
> >>>
> >>>> I don't see a need to use inverted sense in code. I can surely make
> >>> documentation more descriptive as Bart suggested.
> >>>
> >>> If this is 'better' then it should be on as much as possible, and I
> >>> certianly don't want to see kernel ULPs query caps and other
> >>> pointless things when they already, necessarily, deal with out of order.
> >>
> >> Sure. Kernel ULPs and any other user ULPs can skip query caps.
> >
> > No, they can't because only mlx5 accepts the new flag.
> >
> > This is why inverting the flag in the kernel makes much more sense.
> 
> Absolutely. All iWARP adapters support multiple RDMA Writes in flight, and
> their placement is not ordered. It's a quirk of the MLX devices, and in turn
> their implementation of the IB protocol, that they place writes in-order. I'm
> happy to see Mellanox become aware of this, it limits write performance on
> networks with any significant latencies.
> 
> I agree with Jason, the bit should be 1 by default, if defined as you propose.
> Out-of-order is the norm, not the exception, for ULPs.
> Honestly, I think you should perhaps consider making it the default on your
> devices, and allowing only MLX-aware ULPs to turn it off.
> 

There can be cases in deployment where responder has support for receiving out-of-order, but requester doesn't.
Read responses arriving out of order for different messages at requester side would break.
In other case if its supported on requester side, but not supported on responder side, would result into more retransmission because responder cannot handle them and triggers PSN sequence error.
Requester and responder both needs to know when they enable out-of-order on a given QP.
So we shouldn’t enable it by default on all QPs.

> The Windows NDK API has a similar concept, by the way, it flips your
> meaning and defaults to 0. Also, it's not settable by the ULP. The flag isn't
> used for much, as a result.
> 
> NDK_ADAPTER_FLAG_IN_ORDER_DMA_SUPPORTED
> 
> https://msdn.microsoft.com/en-
> us/library/windows/hardware/hh439851(v=vs.85).aspx
> 
> 
> Tom.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                     ` <VI1PR0502MB30080B672B80836FF0A328DCD1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-06-12 21:41                                       ` Jason Gunthorpe
       [not found]                                         ` <20170612214135.GB30578-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2017-06-12 22:19                                       ` Tom Talpey
  1 sibling, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2017-06-12 21:41 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Tom Talpey, Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On Mon, Jun 12, 2017 at 09:32:30PM +0000, Parav Pandit wrote:

> There can be cases in deployment where responder has support for
> receiving out-of-order, but requester doesn't.  Read responses

You still haven't explained at all what the transmitter side does
differently..

Is this actually some kind of selective retransmit scheme?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
  2017-06-12 21:18                                   ` Steve Wise
  2017-06-12 21:22                                     ` Jason Gunthorpe
@ 2017-06-12 21:53                                     ` Parav Pandit
       [not found]                                       ` <VI1PR0502MB30082486BC3B1669FE48764ED1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  1 sibling, 1 reply; 71+ messages in thread
From: Parav Pandit @ 2017-06-12 21:53 UTC (permalink / raw)
  To: Steve Wise, 'Jason Gunthorpe'
  Cc: 'Hefty, Sean', 'Dalessandro, Dennis',
	'Leon Romanovsky', 'Doug Ledford',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein



> -----Original Message-----
> From: Steve Wise [mailto:swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org]
> Sent: Monday, June 12, 2017 4:19 PM
> To: 'Jason Gunthorpe' <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
> Cc: 'Hefty, Sean' <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>; 'Dalessandro, Dennis'
> <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>; Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>;
> 'Leon Romanovsky' <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>; 'Doug Ledford'
> <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein
> <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Subject: RE: [PATCH rdma-next 0/3] Support out of order data placement
> 
> > On Mon, Jun 12, 2017 at 03:57:29PM -0500, Steve Wise wrote:
> > > > > > When transmitter and receiver is enabled to do so, as I
> > > > > > described in
> > > > > overview section of Documentation, it helps
> > > > > > (a) to avoid retransmission - improves network utilization
> > > > > > (b) reduces latency due to timers not kicking in.
> > > > >
> > > > > Yes those benefits are clear. I see no reason why it shouldn't
> > > > > always be done is my point. Application shouldn't have to care
> > > > > and there is no need to make this an additional flag.
> > > >
> > > > The app cares when data from write 2 can be written at the target
> > > > before
> data
> > > > from write 1, especially if the writes target the same memory buffers.
> (At least
> > I
> > > > think this is the intent of exposing this to the app.)
> > > >
> > > > Note that the provider can always provide stronger ordering than
> > > > what the
> app
> > > > needs.
> > >
> > > My understanding is that IB or IW apps should never assume ingress
> > > write or read response data is _placed_ into local memory in the
> > > order it was transmitted from the peer.  The only guarantee is that
> > > the _indication_ of the arrived data preserve the sender's ordering.
> > > However, I'm thinking that there are applications out there that
> > > spin polling local memory that is the target of a write or read
> > > response and assume the last bit of that memory will get written
> > > last...
> >
> > That is with respect to the CPU, but IB requires strong ordering
> > between messages within the same QP, eg if I do
> >
> > RDMA WRITE addr=0 data=1
> > RDMA WRITE addr=0 data=2
> > RDMA WRITE addr=0 data=3
> > RDMA READ  addr=0
> >
> > I must always get 3, not something else.
> 
> Correct, but the peer, ie the remote end that is the target of those writes,
> can spin looking at local address 0 and it might see 1, 2, or 3.  Eventually it
> will see 3.  But there is no guarantee that it will see 1 before 2 or even see 1
> or 2
> at all depending on timing.
> 
> But what I was getting at is this:  Say you tell your peer to RDMA WRITE
> 16KB into your local buffer.  And let us say the last bit of that 16KB data will
> be a 1, and that the current value of that bit location in the local buffer is 0.
> It is incorrect for the app to spin reading that bit until reads 1, and assume
> the data prior to that bit has been placed at that point.  At least with the
> iWARP spec, out of order placement is allowed.  So if the 16KB was broken
> into X iWARP DDP segments, the last segment could have been placed
> before the other segments.
> A correct application will require the peer to post a SEND after the WRITE or
> WRITE_WITH_IMMEDIATE, and only know the data has been placed into the
> local buffer when it polls the recv completion for the SEND or
> WRITE_W_IMMD.  An iWARP RNIC _must_ guarantee in-order delivery of
> data (but not actual placement).  Am I making sense?
> 
> I'm guessing no HCAs nor RNICs actually place data out of order.  cxgb*
> does not.  So applications _might_ be doing the spin technique I described.
> I recall a long time ago that MVAPICH2 did this.  Not sure if it still does.
> 
> >
> > It would be notable if this 'out of order' feature violated that
> > invariant, but many ULPs would probably still be OK.
> >
I certainly don't see a point in breaking the users who are polling on data, even though they should have followed optional requirement o9-20.
Also read responses can come out of orders, if such messages are polled either, it would also break, not just writes.
Refer table 79 for two read operations.

> > Frankly, Parav's original message doesn't seem to describe at all what
> > this is about, so maybe we should all wait until v2, and maybe more
> > people from Mellanox could contribute to sensibly describing it if
> > they want it in ibverbs.

I will add following more details to Documentation.
1. Mention about pcie relax ordering - Barts point
2. Include responder side table like Table 79 to crisply describe all cases and ordering with respect to send and other messages
3. Also indicate that C9-28 is relaxed when ooo is enabled on a QP as a description to new responder side table.
This was offline comment that I received.
4. Provide examples that Steve and Jason highlighted, with multiple writes to same memory location.
5. Reiterate table 79, to make it clear that what doesn't change, or changes.
Let me know if you want to see any more details.

> >
> > Jason

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                       ` <VI1PR0502MB30082486BC3B1669FE48764ED1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-06-12 21:57                                         ` Jason Gunthorpe
       [not found]                                           ` <20170612215741.GA31693-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2017-06-12 21:57 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Steve Wise, 'Hefty, Sean', 'Dalessandro, Dennis',
	'Leon Romanovsky', 'Doug Ledford',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On Mon, Jun 12, 2017 at 09:53:55PM +0000, Parav Pandit wrote:
> I will add following more details to Documentation.
> 1. Mention about pcie relax ordering - Barts point
> 2. Include responder side table like Table 79 to crisply describe all cases and ordering with respect to send and other messages
> 3. Also indicate that C9-28 is relaxed when ooo is enabled on a QP as a description to new responder side table.
> This was offline comment that I received.
> 4. Provide examples that Steve and Jason highlighted, with multiple writes to same memory location.
> 5. Reiterate table 79, to make it clear that what doesn't change, or changes.
> Let me know if you want to see any more details.

Explain why this needs to be negotiated and what it does to the
transmit side of a QP.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                         ` <20170612214135.GB30578-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-06-12 21:59                                           ` Parav Pandit
       [not found]                                             ` <VI1PR0502MB30087762738A4E02A1DA24D0D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Parav Pandit @ 2017-06-12 21:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tom Talpey, Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

Hi Jason,

> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> Sent: Monday, June 12, 2017 4:42 PM
> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Cc: Tom Talpey <tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>; Bart Van Assche
> <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org; dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org;
> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> On Mon, Jun 12, 2017 at 09:32:30PM +0000, Parav Pandit wrote:
> 
> > There can be cases in deployment where responder has support for
> > receiving out-of-order, but requester doesn't.  Read responses
> 
> You still haven't explained at all what the transmitter side does differently..

Let's say
Read-1
Read-2 are issued by requester.

For some reason read response for each read arrived via different path to requester,
If ooo is enabled on a QP they will be processed correctly. Else read responses for 2nd read could be dropped.
Here transmitter of read responder is different when ooo is enabled.

> 
> Is this actually some kind of selective retransmit scheme?
> 
> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                           ` <20170612215741.GA31693-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-06-12 22:00                                             ` Parav Pandit
  0 siblings, 0 replies; 71+ messages in thread
From: Parav Pandit @ 2017-06-12 22:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Steve Wise, 'Hefty, Sean', 'Dalessandro, Dennis',
	'Leon Romanovsky', 'Doug Ledford',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein


> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> Sent: Monday, June 12, 2017 4:58 PM
> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Cc: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>; 'Hefty, Sean'
> <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>; 'Dalessandro, Dennis'
> <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>; 'Leon Romanovsky' <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>;
> 'Doug Ledford' <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan
> Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> On Mon, Jun 12, 2017 at 09:53:55PM +0000, Parav Pandit wrote:
> > I will add following more details to Documentation.
> > 1. Mention about pcie relax ordering - Barts point 2. Include
> > responder side table like Table 79 to crisply describe all cases and
> > ordering with respect to send and other messages 3. Also indicate that
> C9-28 is relaxed when ooo is enabled on a QP as a description to new
> responder side table.
> > This was offline comment that I received.
> > 4. Provide examples that Steve and Jason highlighted, with multiple writes
> to same memory location.
> > 5. Reiterate table 79, to make it clear that what doesn't change, or
> changes.
> > Let me know if you want to see any more details.
> 
> Explain why this needs to be negotiated and what it does to the transmit
> side of a QP.
> 
Sure. I will add this. This is important point.

> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                             ` <VI1PR0502MB30087762738A4E02A1DA24D0D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-06-12 22:07                                               ` Jason Gunthorpe
       [not found]                                                 ` <20170612220730.GA32510-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2017-06-12 22:07 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Tom Talpey, Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On Mon, Jun 12, 2017 at 09:59:22PM +0000, Parav Pandit wrote:
> Hi Jason,
> 
> > From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> > Sent: Monday, June 12, 2017 4:42 PM
> > To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > Cc: Tom Talpey <tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>; Bart Van Assche
> > <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org; dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org;
> > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> > 
> > On Mon, Jun 12, 2017 at 09:32:30PM +0000, Parav Pandit wrote:
> > 
> > > There can be cases in deployment where responder has support for
> > > receiving out-of-order, but requester doesn't.  Read responses
> > 
> > You still haven't explained at all what the transmitter side does differently..
> 
> Let's say
> Read-1
> Read-2 are issued by requester.
> 
> For some reason read response for each read arrived via different
> path to requester, If ooo is enabled on a QP they will be processed
> correctly. Else read responses for 2nd read could be dropped.

This is still only discussing the reciever side of a QP.

> transmitter of read responder is different when ooo is enabled.

And I keep asking you to explain what is different about the
transmitter side.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                 ` <20170612220730.GA32510-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-06-12 22:16                                                   ` Parav Pandit
       [not found]                                                     ` <VI1PR0502MB30088258D7BADC83B0609F38D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Parav Pandit @ 2017-06-12 22:16 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tom Talpey, Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

Hi Jason,

> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> Sent: Monday, June 12, 2017 5:08 PM
> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Cc: Tom Talpey <tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>; Bart Van Assche
> <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org; dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org;
> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> On Mon, Jun 12, 2017 at 09:59:22PM +0000, Parav Pandit wrote:
> > Hi Jason,
> >
> > > From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> > > Sent: Monday, June 12, 2017 4:42 PM
> > > To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > > Cc: Tom Talpey <tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>; Bart Van Assche
> > > <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org;
> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org;
> > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > > Subject: Re: [PATCH rdma-next 0/3] Support out of order data
> > > placement
> > >
> > > On Mon, Jun 12, 2017 at 09:32:30PM +0000, Parav Pandit wrote:
> > >
> > > > There can be cases in deployment where responder has support for
> > > > receiving out-of-order, but requester doesn't.  Read responses
> > >
> > > You still haven't explained at all what the transmitter side does
> differently..
> >
> > Let's say
> > Read-1
> > Read-2 are issued by requester.
> >
> > For some reason read response for each read arrived via different path
> > to requester, If ooo is enabled on a QP they will be processed
> > correctly. Else read responses for 2nd read could be dropped.
> 
> This is still only discussing the reciever side of a QP.
> 
> > transmitter of read responder is different when ooo is enabled.
> 
> And I keep asking you to explain what is different about the transmitter
> side.
Above sequence explains read_response_transmitter behavior.
I will check with hardware team if there is anything done differently at write_transmitter (requester) other than what I explained above.

> 
> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                     ` <VI1PR0502MB30080B672B80836FF0A328DCD1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  2017-06-12 21:41                                       ` Jason Gunthorpe
@ 2017-06-12 22:19                                       ` Tom Talpey
       [not found]                                         ` <475e1873-e842-ecb9-d260-34777da57e51-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 71+ messages in thread
From: Tom Talpey @ 2017-06-12 22:19 UTC (permalink / raw)
  To: Parav Pandit, Jason Gunthorpe
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On 6/12/2017 5:32 PM, Parav Pandit wrote:
> Hi Tom,
...
>>
>> I agree with Jason, the bit should be 1 by default, if defined as you propose.
>> Out-of-order is the norm, not the exception, for ULPs.
>> Honestly, I think you should perhaps consider making it the default on your
>> devices, and allowing only MLX-aware ULPs to turn it off.
>>
> 
> There can be cases in deployment where responder has support for receiving out-of-order, but requester doesn't.

Yuck! So this needs to be negotiated end-to-end, and by the upper
layer? Talk about barriers to adoption, and opportunities for
disaster.

Tom.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                     ` <VI1PR0502MB30088258D7BADC83B0609F38D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-06-12 22:21                                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2017-06-12 22:21 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Tom Talpey, Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On Mon, Jun 12, 2017 at 10:16:26PM +0000, Parav Pandit wrote:

> > And I keep asking you to explain what is different about the transmitter
> > side.
> Above sequence explains read_response_transmitter behavior.

but that has no impact on the other side of the QP - it does not care
if they were processed successefully or a NACK is sent and retransmit
is needed.

Still not seeing any explanation for why negotiation is required, and
I'm getting sick of asking, please craft a response with your internal
team.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                         ` <475e1873-e842-ecb9-d260-34777da57e51-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
@ 2017-06-12 22:54                                           ` Parav Pandit
       [not found]                                             ` <VI1PR0502MB3008B7FE60CAB3BD49907A1BD1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Parav Pandit @ 2017-06-12 22:54 UTC (permalink / raw)
  To: Tom Talpey, Jason Gunthorpe
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

Hi Tom,

> -----Original Message-----
> From: Tom Talpey [mailto:tom@talpey.com]
> Sent: Monday, June 12, 2017 5:20 PM
> To: Parav Pandit <parav@mellanox.com>; Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com>
> Cc: Bart Van Assche <Bart.VanAssche@sandisk.com>; leon@kernel.org;
> dledford@redhat.com; linux-rdma@vger.kernel.org; Idan Burstein
> <idanb@mellanox.com>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> On 6/12/2017 5:32 PM, Parav Pandit wrote:
> > Hi Tom,
> ...
> >>
> >> I agree with Jason, the bit should be 1 by default, if defined as you
> propose.
> >> Out-of-order is the norm, not the exception, for ULPs.
> >> Honestly, I think you should perhaps consider making it the default
> >> on your devices, and allowing only MLX-aware ULPs to turn it off.
> >>
> >
> > There can be cases in deployment where responder has support for
> receiving out-of-order, but requester doesn't.
> 
> Yuck! So this needs to be negotiated end-to-end, and by the upper layer?
> Talk about barriers to adoption, and opportunities for disaster.
> 
As Jason confirmed that all Linux kernel consumers are coded to be compliant to o9-20 requirement,
So I think kernel based rdma-cm consumers can be transparently enabled end-to-end without ULP's involvement with rdma_accept() and rdma_connect().
That would be separate patchset for it whenever rdmacm is extended for negotiation, after testing them.
We haven't planned that yet.

> Tom.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                             ` <VI1PR0502MB3008B7FE60CAB3BD49907A1BD1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-06-12 23:44                                               ` Tom Talpey
       [not found]                                                 ` <d3a436a0-9ace-b11a-2e4c-387fc575877e-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Tom Talpey @ 2017-06-12 23:44 UTC (permalink / raw)
  To: Parav Pandit, Jason Gunthorpe
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On 6/12/2017 6:54 PM, Parav Pandit wrote:
> Hi Tom,
> 
>> -----Original Message-----
>> From: Tom Talpey [mailto:tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org]
>> Sent: Monday, June 12, 2017 5:20 PM
>> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; Jason Gunthorpe
>> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
>> Cc: Bart Van Assche <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org;
>> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein
>> <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
>>
>> On 6/12/2017 5:32 PM, Parav Pandit wrote:
>>> Hi Tom,
>> ...
>>>>
>>>> I agree with Jason, the bit should be 1 by default, if defined as you
>> propose.
>>>> Out-of-order is the norm, not the exception, for ULPs.
>>>> Honestly, I think you should perhaps consider making it the default
>>>> on your devices, and allowing only MLX-aware ULPs to turn it off.
>>>>
>>>
>>> There can be cases in deployment where responder has support for
>> receiving out-of-order, but requester doesn't.
>>
>> Yuck! So this needs to be negotiated end-to-end, and by the upper layer?
>> Talk about barriers to adoption, and opportunities for disaster.
>>
> As Jason confirmed that all Linux kernel consumers are coded to be compliant to o9-20 requirement,
> So I think kernel based rdma-cm consumers can be transparently enabled end-to-end without ULP's involvement with rdma_accept() and rdma_connect().

I have two thoughts here.

1) You seem to assume all consumers are Linux, and do not
need to negotiate. This is a dangerous assumption.

2) I assume that there is some performance benefit to toggling
this setting to non-strict. So, how do existing consumers get
this advantage, especially since they don't need strict
semantics? Bearing in mind that they do have to negotiate
this end-to-end, meaning they require a protocol extension.

Actually. I have a third thought. Since this is an attribute
to qp creation, performed even before establishing a connection,
how does the upper layer know when to set it?

Tom.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                 ` <d3a436a0-9ace-b11a-2e4c-387fc575877e-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
@ 2017-06-12 23:59                                                   ` Parav Pandit
       [not found]                                                     ` <VI1PR0502MB3008E04612EED6BC83F37115D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Parav Pandit @ 2017-06-12 23:59 UTC (permalink / raw)
  To: Tom Talpey, Jason Gunthorpe
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

Hi Tom,


> -----Original Message-----
> From: Tom Talpey [mailto:tom@talpey.com]
> Sent: Monday, June 12, 2017 6:44 PM
> To: Parav Pandit <parav@mellanox.com>; Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com>
> Cc: Bart Van Assche <Bart.VanAssche@sandisk.com>; leon@kernel.org;
> dledford@redhat.com; linux-rdma@vger.kernel.org; Idan Burstein
> <idanb@mellanox.com>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> On 6/12/2017 6:54 PM, Parav Pandit wrote:
> > Hi Tom,
> >
> >> -----Original Message-----
> >> From: Tom Talpey [mailto:tom@talpey.com]
> >> Sent: Monday, June 12, 2017 5:20 PM
> >> To: Parav Pandit <parav@mellanox.com>; Jason Gunthorpe
> >> <jgunthorpe@obsidianresearch.com>
> >> Cc: Bart Van Assche <Bart.VanAssche@sandisk.com>; leon@kernel.org;
> >> dledford@redhat.com; linux-rdma@vger.kernel.org; Idan Burstein
> >> <idanb@mellanox.com>
> >> Subject: Re: [PATCH rdma-next 0/3] Support out of order data
> >> placement
> >>
> >> On 6/12/2017 5:32 PM, Parav Pandit wrote:
> >>> Hi Tom,
> >> ...
> >>>>
> >>>> I agree with Jason, the bit should be 1 by default, if defined as
> >>>> you
> >> propose.
> >>>> Out-of-order is the norm, not the exception, for ULPs.
> >>>> Honestly, I think you should perhaps consider making it the default
> >>>> on your devices, and allowing only MLX-aware ULPs to turn it off.
> >>>>
> >>>
> >>> There can be cases in deployment where responder has support for
> >> receiving out-of-order, but requester doesn't.
> >>
> >> Yuck! So this needs to be negotiated end-to-end, and by the upper layer?
> >> Talk about barriers to adoption, and opportunities for disaster.
> >>
> > As Jason confirmed that all Linux kernel consumers are coded to be
> > compliant to o9-20 requirement, So I think kernel based rdma-cm
> consumers can be transparently enabled end-to-end without ULP's
> involvement with rdma_accept() and rdma_connect().
> 
> I have two thoughts here.
> 
> 1) You seem to assume all consumers are Linux, and do not need to
> negotiate. This is a dangerous assumption.
Certainly not. I didn't assume that. I just gave one example that known consumers can be done without modifying the ULP.
Explained further in 3rd question.
Even other consumers can work with this solution. 
For example Linux rdmacm based client and Other OS based server.
Client is ooo capable.
Server is ooo not capable.
Once you follow below rdmacm based sequence, it will be clear how this will works.

> 
> 2) I assume that there is some performance benefit to toggling this setting
> to non-strict. So, how do existing consumers get this advantage, especially
> since they don't need strict semantics? Bearing in mind that they do have
> to negotiate this end-to-end, meaning they require a protocol extension.
I don't have completely transparent upstream solution for existing consumers yet.
> 
> Actually. I have a third thought. Since this is an attribute to qp creation,
> performed even before establishing a connection, how does the upper layer
> know when to set it?
This is not at QP creation time. I have described in Documentation/out_of_order.txt in usage section 3.
This is at QP state transition from INIT to RTR.
Here is the flow. It's just not coded enough for posting patches.

1. When rdmacm active side creates the QP, It is INIT state.
2. Send MAD_Req msg (indicating ooo_requested=1)
3. When rdmacm passive side receives the message, it looks up device_cap attribute and matches it against ooo_requested flag.
4. when device supports it, MAD_Rsp msg sets ooo_enabled=1, if it doesn't support it, ooo_enabled=0
5. rdmacm passive side creates the QP and moves to RTR state (with QP ooo enabled bit set).
6. active side receives the message and puts the QP to RTR, RTS state based on received bit setting from passive side.

Flow is no different than how rest of the connection specific parameters are shared such as IRD/ORD, PSN, timeouts, mtu etc.



> 
> Tom.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                     ` <VI1PR0502MB3008E04612EED6BC83F37115D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-06-13  0:11                                                       ` Tom Talpey
       [not found]                                                         ` <fbdcf05b-ccd8-bd9c-c9c8-86f373303250-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Tom Talpey @ 2017-06-13  0:11 UTC (permalink / raw)
  To: Parav Pandit, Jason Gunthorpe
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On 6/12/2017 7:59 PM, Parav Pandit wrote:
>> -----Original Message-----
>> From: Tom Talpey [mailto:tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org]
>> Sent: Monday, June 12, 2017 6:44 PM
>> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; Jason Gunthorpe
>> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
>> Cc: Bart Van Assche <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org;
>> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein
>> <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
>>
>> On 6/12/2017 6:54 PM, Parav Pandit wrote:
>>> Hi Tom,
>>>
>>>> -----Original Message-----
>>>> From: Tom Talpey [mailto:tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org]
>>>> Sent: Monday, June 12, 2017 5:20 PM
>>>> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; Jason Gunthorpe
>>>> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
>>>> Cc: Bart Van Assche <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org;
>>>> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein
>>>> <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>> Subject: Re: [PATCH rdma-next 0/3] Support out of order data
>>>> placement
>>>>
>>>> On 6/12/2017 5:32 PM, Parav Pandit wrote:
>>>>> Hi Tom,
>>>> ...
>>>>>>
>>>>>> I agree with Jason, the bit should be 1 by default, if defined as
>>>>>> you
>>>> propose.
>>>>>> Out-of-order is the norm, not the exception, for ULPs.
>>>>>> Honestly, I think you should perhaps consider making it the default
>>>>>> on your devices, and allowing only MLX-aware ULPs to turn it off.
>>>>>>
>>>>>
>>>>> There can be cases in deployment where responder has support for
>>>> receiving out-of-order, but requester doesn't.
>>>>
>>>> Yuck! So this needs to be negotiated end-to-end, and by the upper layer?
>>>> Talk about barriers to adoption, and opportunities for disaster.
>>>>
>>> As Jason confirmed that all Linux kernel consumers are coded to be
>>> compliant to o9-20 requirement, So I think kernel based rdma-cm
>> consumers can be transparently enabled end-to-end without ULP's
>> involvement with rdma_accept() and rdma_connect().
>>
>> I have two thoughts here.
>>
>> 1) You seem to assume all consumers are Linux, and do not need to
>> negotiate. This is a dangerous assumption.
> Certainly not. I didn't assume that. I just gave one example that known consumers can be done without modifying the ULP.
> Explained further in 3rd question.
> Even other consumers can work with this solution.
> For example Linux rdmacm based client and Other OS based server.
> Client is ooo capable.
> Server is ooo not capable.
> Once you follow below rdmacm based sequence, it will be clear how this will works.

Oh, so there's a MAD protocol change under the hood. Well,
that's a wider question. And I still don't understand how
existing, non-strict-requiring protocols can take advantage
of this. Nor how this works for non-Mellanox, non-IB/RoCE
implementations.

Again, I'd be a lot less concerned if non-strict were the
default, and strict mode was negotiated. It's all just so
upside-down.

Tom.

>> 2) I assume that there is some performance benefit to toggling this setting
>> to non-strict. So, how do existing consumers get this advantage, especially
>> since they don't need strict semantics? Bearing in mind that they do have
>> to negotiate this end-to-end, meaning they require a protocol extension.
> I don't have completely transparent upstream solution for existing consumers yet.
>>
>> Actually. I have a third thought. Since this is an attribute to qp creation,
>> performed even before establishing a connection, how does the upper layer
>> know when to set it?
> This is not at QP creation time. I have described in Documentation/out_of_order.txt in usage section 3.
> This is at QP state transition from INIT to RTR.
> Here is the flow. It's just not coded enough for posting patches.
> 
> 1. When rdmacm active side creates the QP, It is INIT state.
> 2. Send MAD_Req msg (indicating ooo_requested=1)
> 3. When rdmacm passive side receives the message, it looks up device_cap attribute and matches it against ooo_requested flag.
> 4. when device supports it, MAD_Rsp msg sets ooo_enabled=1, if it doesn't support it, ooo_enabled=0
> 5. rdmacm passive side creates the QP and moves to RTR state (with QP ooo enabled bit set).
> 6. active side receives the message and puts the QP to RTR, RTS state based on received bit setting from passive side.
> 
> Flow is no different than how rest of the connection specific parameters are shared such as IRD/ORD, PSN, timeouts, mtu etc.
> 
> 
> 
>>
>> Tom.
> N�����r��y���b�X��ǧv�^�)޺{.n�+����{��ٚ�{ay�\x1d
ʇڙ�,j\a��f���h���z�\x1e
�w���\f
���j:+v���w�j�m����\a����zZ+�����ݢj"��!tml=
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                         ` <fbdcf05b-ccd8-bd9c-c9c8-86f373303250-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
@ 2017-06-13  0:36                                                           ` Parav Pandit
       [not found]                                                             ` <VI1PR0502MB30089271EB542493AA58060CD1C20-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Parav Pandit @ 2017-06-13  0:36 UTC (permalink / raw)
  To: Tom Talpey, Jason Gunthorpe
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

Hi Tom,

> -----Original Message-----
> From: Tom Talpey [mailto:tom@talpey.com]
> Sent: Monday, June 12, 2017 7:12 PM
> To: Parav Pandit <parav@mellanox.com>; Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com>
> Cc: Bart Van Assche <Bart.VanAssche@sandisk.com>; leon@kernel.org;
> dledford@redhat.com; linux-rdma@vger.kernel.org; Idan Burstein
> <idanb@mellanox.com>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> On 6/12/2017 7:59 PM, Parav Pandit wrote:
> >> -----Original Message-----
> >> From: Tom Talpey [mailto:tom@talpey.com]
> >> Sent: Monday, June 12, 2017 6:44 PM
> >> To: Parav Pandit <parav@mellanox.com>; Jason Gunthorpe
> >> <jgunthorpe@obsidianresearch.com>
> >> Cc: Bart Van Assche <Bart.VanAssche@sandisk.com>; leon@kernel.org;
> >> dledford@redhat.com; linux-rdma@vger.kernel.org; Idan Burstein
> >> <idanb@mellanox.com>
> >> Subject: Re: [PATCH rdma-next 0/3] Support out of order data
> >> placement
> >>
> >> On 6/12/2017 6:54 PM, Parav Pandit wrote:
> >>> Hi Tom,
> >>>
> >>>> -----Original Message-----
> >>>> From: Tom Talpey [mailto:tom@talpey.com]
> >>>> Sent: Monday, June 12, 2017 5:20 PM
> >>>> To: Parav Pandit <parav@mellanox.com>; Jason Gunthorpe
> >>>> <jgunthorpe@obsidianresearch.com>
> >>>> Cc: Bart Van Assche <Bart.VanAssche@sandisk.com>;
> leon@kernel.org;
> >>>> dledford@redhat.com; linux-rdma@vger.kernel.org; Idan Burstein
> >>>> <idanb@mellanox.com>
> >>>> Subject: Re: [PATCH rdma-next 0/3] Support out of order data
> >>>> placement
> >>>>
> >>>> On 6/12/2017 5:32 PM, Parav Pandit wrote:
> >>>>> Hi Tom,
> >>>> ...
> >>>>>>
> >>>>>> I agree with Jason, the bit should be 1 by default, if defined as
> >>>>>> you
> >>>> propose.
> >>>>>> Out-of-order is the norm, not the exception, for ULPs.
> >>>>>> Honestly, I think you should perhaps consider making it the
> >>>>>> default on your devices, and allowing only MLX-aware ULPs to turn
> it off.
> >>>>>>
> >>>>>
> >>>>> There can be cases in deployment where responder has support for
> >>>> receiving out-of-order, but requester doesn't.
> >>>>
> >>>> Yuck! So this needs to be negotiated end-to-end, and by the upper
> layer?
> >>>> Talk about barriers to adoption, and opportunities for disaster.
> >>>>
> >>> As Jason confirmed that all Linux kernel consumers are coded to be
> >>> compliant to o9-20 requirement, So I think kernel based rdma-cm
> >> consumers can be transparently enabled end-to-end without ULP's
> >> involvement with rdma_accept() and rdma_connect().
> >>
> >> I have two thoughts here.
> >>
> >> 1) You seem to assume all consumers are Linux, and do not need to
> >> negotiate. This is a dangerous assumption.
> > Certainly not. I didn't assume that. I just gave one example that known
> consumers can be done without modifying the ULP.
> > Explained further in 3rd question.
> > Even other consumers can work with this solution.
> > For example Linux rdmacm based client and Other OS based server.
> > Client is ooo capable.
> > Server is ooo not capable.
> > Once you follow below rdmacm based sequence, it will be clear how this
> will works.
> 
> Oh, so there's a MAD protocol change under the hood. 
No. There is no change under the hood.
Your question was how can we avoid ULP change and still they can benefit of this feature?
So I said rdmacm based Linux kernel consumers that we know of comply to o9-20, can take the benefit once rdmacm is extended as below example.

> Well, that's a wider
> question. And I still don't understand how existing, non-strict-requiring
> protocols can take advantage of this.
> Nor how this works for non-Mellanox, non-IB/RoCE implementations.

Device capability indicates that which device supports this. Explained in Documentation/out_of_order.txt usage section.
So whichever vendor supports it, whichever protocol supports it, can set this optional device capability.

> 
> Again, I'd be a lot less concerned if non-strict were the default, and strict
> mode was negotiated. It's all just so upside-down.

In IB spec, in-order delivery is default. So can you suggest how can we change default IB behavior without breaking anything?
Adding optional attribute seems the right way that ensures compatibility.
> 
> Tom.
> 
> >> 2) I assume that there is some performance benefit to toggling this
> >> setting to non-strict. So, how do existing consumers get this
> >> advantage, especially since they don't need strict semantics? Bearing
> >> in mind that they do have to negotiate this end-to-end, meaning they
> require a protocol extension.
> > I don't have completely transparent upstream solution for existing
> consumers yet.
> >>
> >> Actually. I have a third thought. Since this is an attribute to qp
> >> creation, performed even before establishing a connection, how does
> >> the upper layer know when to set it?
> > This is not at QP creation time. I have described in
> Documentation/out_of_order.txt in usage section 3.
> > This is at QP state transition from INIT to RTR.
> > Here is the flow. It's just not coded enough for posting patches.
> >
> > 1. When rdmacm active side creates the QP, It is INIT state.
> > 2. Send MAD_Req msg (indicating ooo_requested=1) 3. When rdmacm
> > passive side receives the message, it looks up device_cap attribute and
> matches it against ooo_requested flag.
> > 4. when device supports it, MAD_Rsp msg sets ooo_enabled=1, if it
> > doesn't support it, ooo_enabled=0 5. rdmacm passive side creates the QP
> and moves to RTR state (with QP ooo enabled bit set).
> > 6. active side receives the message and puts the QP to RTR, RTS state
> based on received bit setting from passive side.
> >
> > Flow is no different than how rest of the connection specific parameters
> are shared such as IRD/ORD, PSN, timeouts, mtu etc.
> >
> >
> >
> >>
> >> Tom.
> > N     r  y   b X  ǧv ^ )޺{.n +    {  ٚ {ay \x1dʇڙ ,j   f   h   z \x1e w
> 
>    j:+v   w j m         zZ+     ݢj"  !tml=
> >

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                             ` <VI1PR0502MB30089271EB542493AA58060CD1C20-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-06-13  1:30                                                               ` Tom Talpey
       [not found]                                                                 ` <fb11f261-b80b-f71a-8076-204706267798-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Tom Talpey @ 2017-06-13  1:30 UTC (permalink / raw)
  To: Parav Pandit, Jason Gunthorpe
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On 6/12/2017 8:36 PM, Parav Pandit wrote:
>> -----Original Message-----
>> From: Tom Talpey [mailto:tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org]
>> Sent: Monday, June 12, 2017 7:12 PM
>> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; Jason Gunthorpe
>> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
>> Cc: Bart Van Assche <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org;
>> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein
>> <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
>>
>> On 6/12/2017 7:59 PM, Parav Pandit wrote:
>>>> -----Original Message-----
>>>> From: Tom Talpey [mailto:tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org]
>>>> Sent: Monday, June 12, 2017 6:44 PM
>>>> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; Jason Gunthorpe
>>>> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
>>>> Cc: Bart Van Assche <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org;
>>>> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein
>>>> <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>> Subject: Re: [PATCH rdma-next 0/3] Support out of order data
>>>> placement
>>>>
>>>> On 6/12/2017 6:54 PM, Parav Pandit wrote:
>>>>> Hi Tom,
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Tom Talpey [mailto:tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org]
>>>>>> Sent: Monday, June 12, 2017 5:20 PM
>>>>>> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; Jason Gunthorpe
>>>>>> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
>>>>>> Cc: Bart Van Assche <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>;
>> leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org;
>>>>>> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein
>>>>>> <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>>>> Subject: Re: [PATCH rdma-next 0/3] Support out of order data
>>>>>> placement
>>>>>>
>>>>>> On 6/12/2017 5:32 PM, Parav Pandit wrote:
>>>>>>> Hi Tom,
>>>>>> ...
>>>>>>>>
>>>>>>>> I agree with Jason, the bit should be 1 by default, if defined as
>>>>>>>> you
>>>>>> propose.
>>>>>>>> Out-of-order is the norm, not the exception, for ULPs.
>>>>>>>> Honestly, I think you should perhaps consider making it the
>>>>>>>> default on your devices, and allowing only MLX-aware ULPs to turn
>> it off.
>>>>>>>>
>>>>>>>
>>>>>>> There can be cases in deployment where responder has support for
>>>>>> receiving out-of-order, but requester doesn't.
>>>>>>
>>>>>> Yuck! So this needs to be negotiated end-to-end, and by the upper
>> layer?
>>>>>> Talk about barriers to adoption, and opportunities for disaster.
>>>>>>
>>>>> As Jason confirmed that all Linux kernel consumers are coded to be
>>>>> compliant to o9-20 requirement, So I think kernel based rdma-cm
>>>> consumers can be transparently enabled end-to-end without ULP's
>>>> involvement with rdma_accept() and rdma_connect().
>>>>
>>>> I have two thoughts here.
>>>>
>>>> 1) You seem to assume all consumers are Linux, and do not need to
>>>> negotiate. This is a dangerous assumption.
>>> Certainly not. I didn't assume that. I just gave one example that known
>> consumers can be done without modifying the ULP.
>>> Explained further in 3rd question.
>>> Even other consumers can work with this solution.
>>> For example Linux rdmacm based client and Other OS based server.
>>> Client is ooo capable.
>>> Server is ooo not capable.
>>> Once you follow below rdmacm based sequence, it will be clear how this
>> will works.
>>
>> Oh, so there's a MAD protocol change under the hood.
> No. There is no change under the hood.
> Your question was how can we avoid ULP change and still they can benefit of this feature?
> So I said rdmacm based Linux kernel consumers that we know of comply to o9-20, can take the benefit once rdmacm is extended as below example.
> 
>> Well, that's a wider
>> question. And I still don't understand how existing, non-strict-requiring
>> protocols can take advantage of this.
>> Nor how this works for non-Mellanox, non-IB/RoCE implementations.
> 
> Device capability indicates that which device supports this. Explained in Documentation/out_of_order.txt usage section.
> So whichever vendor supports it, whichever protocol supports it, can set this optional device capability.
> 
>>
>> Again, I'd be a lot less concerned if non-strict were the default, and strict
>> mode was negotiated. It's all just so upside-down.
> 
> In IB spec, in-order delivery is default.

I don't agree. Requests are sent in-order, and the responder
processes them in-order, but the bytes thenselves are not
guaranteed to appear in-order. Additionally, if retries occur,
this is most definitely not the case.

Section 9.5 Transaction Ordering, I believe, covers these
requirements. Can you tell me where I misunderstand them?
In fact, c9-28 explicitly warns:

   • An application shall not depend upon the order of data writes to
   memory within a message. For example, if an application sets up
   data buffers that overlap, for separate data segments within a
   message, it is not guaranteed that the last sent data will always
   overwrite the earlier.

My guess is that this bit overrides the MLX behavior of
never pipelining RDMA Write requests, allowing more packets
to be queued at the responder and making better use of the
network. This is not at all prohibited by the spec, nor is
it unexpected by properly-coded upper layers, which all the
kernel consumers are.

I have one other question on the Documentation out-of-order.txt.
It states the fence bit can be used to force ordering on a
non-strict connection. But fence doesn't apply to RDMA Write?
It only applies to operations which produce a reply, such as
RDMA Read or Atomic. Have you changed the semantic?

Tom.



So can you suggest how can we change default IB behavior without 
breaking anything?
> Adding optional attribute seems the right way that ensures compatibility.
>>
>> Tom.
>>
>>>> 2) I assume that there is some performance benefit to toggling this
>>>> setting to non-strict. So, how do existing consumers get this
>>>> advantage, especially since they don't need strict semantics? Bearing
>>>> in mind that they do have to negotiate this end-to-end, meaning they
>> require a protocol extension.
>>> I don't have completely transparent upstream solution for existing
>> consumers yet.
>>>>
>>>> Actually. I have a third thought. Since this is an attribute to qp
>>>> creation, performed even before establishing a connection, how does
>>>> the upper layer know when to set it?
>>> This is not at QP creation time. I have described in
>> Documentation/out_of_order.txt in usage section 3.
>>> This is at QP state transition from INIT to RTR.
>>> Here is the flow. It's just not coded enough for posting patches.
>>>
>>> 1. When rdmacm active side creates the QP, It is INIT state.
>>> 2. Send MAD_Req msg (indicating ooo_requested=1) 3. When rdmacm
>>> passive side receives the message, it looks up device_cap attribute and
>> matches it against ooo_requested flag.
>>> 4. when device supports it, MAD_Rsp msg sets ooo_enabled=1, if it
>>> doesn't support it, ooo_enabled=0 5. rdmacm passive side creates the QP
>> and moves to RTR state (with QP ooo enabled bit set).
>>> 6. active side receives the message and puts the QP to RTR, RTS state
>> based on received bit setting from passive side.
>>>
>>> Flow is no different than how rest of the connection specific parameters
>> are shared such as IRD/ORD, PSN, timeouts, mtu etc.
>>>
>>>
>>>
>>>>
>>>> Tom.
>>> N     r  y   b X  ǧv ^ )޺{.n +    {  ٚ {ay \x1d
ʇڙ ,j   f   h   z \x1e
 w
>>
>>     j:+v   w j m         zZ+     ݢj"  !tml=
>>>
> N�����r��y���b�X��ǧv�^�)޺{.n�+����{��ٚ�{ay�\x1d
ʇڙ�,j\a��f���h���z�\x1e
�w���\f
���j:+v���w�j�m����\a����zZ+�����ݢj"��!tml=
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                 ` <20170612210259.GA25652-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2017-06-12 21:18                                   ` Steve Wise
@ 2017-06-13  5:29                                   ` Leon Romanovsky
  1 sibling, 0 replies; 71+ messages in thread
From: Leon Romanovsky @ 2017-06-13  5:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Steve Wise, 'Hefty, Sean', 'Dalessandro, Dennis',
	'Parav Pandit', 'Doug Ledford',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Idan Burstein'

[-- Attachment #1: Type: text/plain, Size: 2294 bytes --]

On Mon, Jun 12, 2017 at 03:02:59PM -0600, Jason Gunthorpe wrote:
> On Mon, Jun 12, 2017 at 03:57:29PM -0500, Steve Wise wrote:
> > > > > When transmitter and receiver is enabled to do so, as I described in
> > > > overview section of Documentation, it helps
> > > > > (a) to avoid retransmission - improves network utilization
> > > > > (b) reduces latency due to timers not kicking in.
> > > >
> > > > Yes those benefits are clear. I see no reason why it shouldn't always
> > > > be
> > > > done is my point. Application shouldn't have to care and there is no
> > > > need to make this an additional flag.
> > >
> > > The app cares when data from write 2 can be written at the target before data
> > > from write 1, especially if the writes target the same memory buffers.  (At least I
> > > think this is the intent of exposing this to the app.)
> > >
> > > Note that the provider can always provide stronger ordering than what the app
> > > needs.
> >
> > My understanding is that IB or IW apps should never assume ingress
> > write or read response data is _placed_ into local memory in the
> > order it was transmitted from the peer.  The only guarantee is that
> > the _indication_ of the arrived data preserve the sender's ordering.
> > However, I'm thinking that there are applications out there that
> > spin polling local memory that is the target of a write or read
> > response and assume the last bit of that memory will get written
> > last...
>
> That is with respect to the CPU, but IB requires strong ordering
> between messages within the same QP, eg if I do
>
> RDMA WRITE addr=0 data=1
> RDMA WRITE addr=0 data=2
> RDMA WRITE addr=0 data=3
> RDMA READ  addr=0
>
> I must always get 3, not something else.
>
> It would be notable if this 'out of order' feature violated that
> invariant, but many ULPs would probably still be OK.
>
> Frankly, Parav's original message doesn't seem to describe at all what
> this is about, so maybe we should all wait until v2, and maybe more
> people from Mellanox could contribute to sensibly describing it if
> they want it in ibverbs.

I sent this patch series in US timezone to allow Parav to answer on the
questions if they arise, but for other people from Mellanox it was night
and it limited number of participants.

Thanks

>
> Jason

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                                 ` <fb11f261-b80b-f71a-8076-204706267798-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
@ 2017-06-13 19:17                                                                   ` Jason Gunthorpe
  2017-06-23 16:03                                                                   ` Parav Pandit
  2017-07-19  5:33                                                                   ` Parav Pandit
  2 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2017-06-13 19:17 UTC (permalink / raw)
  To: Tom Talpey
  Cc: Parav Pandit, Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On Mon, Jun 12, 2017 at 09:30:05PM -0400, Tom Talpey wrote:
> >>Again, I'd be a lot less concerned if non-strict were the default, and strict
> >>mode was negotiated. It's all just so upside-down.
> >
> >In IB spec, in-order delivery is default.
> 
> I don't agree. Requests are sent in-order, and the responder
> processes them in-order, but the bytes thenselves are not
> guaranteed to appear in-order. Additionally, if retries occur,
> this is most definitely not the case.

+1

And again, if message ordering and table 79 are unchanged in this new
mode it is still fully conformant to the IBTA (and again, I don't see
how the transmitter can change behavior and not break table 79)

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                                 ` <fb11f261-b80b-f71a-8076-204706267798-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
  2017-06-13 19:17                                                                   ` Jason Gunthorpe
@ 2017-06-23 16:03                                                                   ` Parav Pandit
  2017-07-19  5:33                                                                   ` Parav Pandit
  2 siblings, 0 replies; 71+ messages in thread
From: Parav Pandit @ 2017-06-23 16:03 UTC (permalink / raw)
  To: Tom Talpey, Jason Gunthorpe
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

Hi Tom, Jason,

I will get back on updated v1 documentation and answers to below questions once I get some more details internally.

Parav

> -----Original Message-----
> From: Tom Talpey [mailto:tom@talpey.com]
> Sent: Monday, June 12, 2017 8:30 PM
> To: Parav Pandit <parav@mellanox.com>; Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com>
> Cc: Bart Van Assche <Bart.VanAssche@sandisk.com>; leon@kernel.org;
> dledford@redhat.com; linux-rdma@vger.kernel.org; Idan Burstein
> <idanb@mellanox.com>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> On 6/12/2017 8:36 PM, Parav Pandit wrote:
> >> -----Original Message-----
> >> From: Tom Talpey [mailto:tom@talpey.com]
> >> Sent: Monday, June 12, 2017 7:12 PM
> >> To: Parav Pandit <parav@mellanox.com>; Jason Gunthorpe
> >> <jgunthorpe@obsidianresearch.com>
> >> Cc: Bart Van Assche <Bart.VanAssche@sandisk.com>; leon@kernel.org;
> >> dledford@redhat.com; linux-rdma@vger.kernel.org; Idan Burstein
> >> <idanb@mellanox.com>
> >> Subject: Re: [PATCH rdma-next 0/3] Support out of order data
> >> placement
> >>
> >> On 6/12/2017 7:59 PM, Parav Pandit wrote:
> >>>> -----Original Message-----
> >>>> From: Tom Talpey [mailto:tom@talpey.com]
> >>>> Sent: Monday, June 12, 2017 6:44 PM
> >>>> To: Parav Pandit <parav@mellanox.com>; Jason Gunthorpe
> >>>> <jgunthorpe@obsidianresearch.com>
> >>>> Cc: Bart Van Assche <Bart.VanAssche@sandisk.com>; leon@kernel.org;
> >>>> dledford@redhat.com; linux-rdma@vger.kernel.org; Idan Burstein
> >>>> <idanb@mellanox.com>
> >>>> Subject: Re: [PATCH rdma-next 0/3] Support out of order data
> >>>> placement
> >>>>
> >>>> On 6/12/2017 6:54 PM, Parav Pandit wrote:
> >>>>> Hi Tom,
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Tom Talpey [mailto:tom@talpey.com]
> >>>>>> Sent: Monday, June 12, 2017 5:20 PM
> >>>>>> To: Parav Pandit <parav@mellanox.com>; Jason Gunthorpe
> >>>>>> <jgunthorpe@obsidianresearch.com>
> >>>>>> Cc: Bart Van Assche <Bart.VanAssche@sandisk.com>;
> >> leon@kernel.org;
> >>>>>> dledford@redhat.com; linux-rdma@vger.kernel.org; Idan Burstein
> >>>>>> <idanb@mellanox.com>
> >>>>>> Subject: Re: [PATCH rdma-next 0/3] Support out of order data
> >>>>>> placement
> >>>>>>
> >>>>>> On 6/12/2017 5:32 PM, Parav Pandit wrote:
> >>>>>>> Hi Tom,
> >>>>>> ...
> >>>>>>>>
> >>>>>>>> I agree with Jason, the bit should be 1 by default, if defined
> >>>>>>>> as you
> >>>>>> propose.
> >>>>>>>> Out-of-order is the norm, not the exception, for ULPs.
> >>>>>>>> Honestly, I think you should perhaps consider making it the
> >>>>>>>> default on your devices, and allowing only MLX-aware ULPs to
> >>>>>>>> turn
> >> it off.
> >>>>>>>>
> >>>>>>>
> >>>>>>> There can be cases in deployment where responder has support for
> >>>>>> receiving out-of-order, but requester doesn't.
> >>>>>>
> >>>>>> Yuck! So this needs to be negotiated end-to-end, and by the upper
> >> layer?
> >>>>>> Talk about barriers to adoption, and opportunities for disaster.
> >>>>>>
> >>>>> As Jason confirmed that all Linux kernel consumers are coded to be
> >>>>> compliant to o9-20 requirement, So I think kernel based rdma-cm
> >>>> consumers can be transparently enabled end-to-end without ULP's
> >>>> involvement with rdma_accept() and rdma_connect().
> >>>>
> >>>> I have two thoughts here.
> >>>>
> >>>> 1) You seem to assume all consumers are Linux, and do not need to
> >>>> negotiate. This is a dangerous assumption.
> >>> Certainly not. I didn't assume that. I just gave one example that
> >>> known
> >> consumers can be done without modifying the ULP.
> >>> Explained further in 3rd question.
> >>> Even other consumers can work with this solution.
> >>> For example Linux rdmacm based client and Other OS based server.
> >>> Client is ooo capable.
> >>> Server is ooo not capable.
> >>> Once you follow below rdmacm based sequence, it will be clear how
> >>> this
> >> will works.
> >>
> >> Oh, so there's a MAD protocol change under the hood.
> > No. There is no change under the hood.
> > Your question was how can we avoid ULP change and still they can benefit
> of this feature?
> > So I said rdmacm based Linux kernel consumers that we know of comply to
> o9-20, can take the benefit once rdmacm is extended as below example.
> >
> >> Well, that's a wider
> >> question. And I still don't understand how existing,
> >> non-strict-requiring protocols can take advantage of this.
> >> Nor how this works for non-Mellanox, non-IB/RoCE implementations.
> >
> > Device capability indicates that which device supports this. Explained in
> Documentation/out_of_order.txt usage section.
> > So whichever vendor supports it, whichever protocol supports it, can set
> this optional device capability.
> >
> >>
> >> Again, I'd be a lot less concerned if non-strict were the default,
> >> and strict mode was negotiated. It's all just so upside-down.
> >
> > In IB spec, in-order delivery is default.
> 
> I don't agree. Requests are sent in-order, and the responder processes them
> in-order, but the bytes thenselves are not guaranteed to appear in-order.
> Additionally, if retries occur, this is most definitely not the case.
> 
> Section 9.5 Transaction Ordering, I believe, covers these requirements. Can
> you tell me where I misunderstand them?
> In fact, c9-28 explicitly warns:
> 
>    • An application shall not depend upon the order of data writes to
>    memory within a message. For example, if an application sets up
>    data buffers that overlap, for separate data segments within a
>    message, it is not guaranteed that the last sent data will always
>    overwrite the earlier.
> 
> My guess is that this bit overrides the MLX behavior of never pipelining RDMA
> Write requests, allowing more packets to be queued at the responder and
> making better use of the network. This is not at all prohibited by the spec, nor
> is it unexpected by properly-coded upper layers, which all the kernel
> consumers are.
> 
> I have one other question on the Documentation out-of-order.txt.
> It states the fence bit can be used to force ordering on a non-strict
> connection. But fence doesn't apply to RDMA Write?
> It only applies to operations which produce a reply, such as RDMA Read or
> Atomic. Have you changed the semantic?
> 
> Tom.
> 
> 
> 
> So can you suggest how can we change default IB behavior without breaking
> anything?
> > Adding optional attribute seems the right way that ensures compatibility.
> >>
> >> Tom.
> >>
> >>>> 2) I assume that there is some performance benefit to toggling this
> >>>> setting to non-strict. So, how do existing consumers get this
> >>>> advantage, especially since they don't need strict semantics?
> >>>> Bearing in mind that they do have to negotiate this end-to-end,
> >>>> meaning they
> >> require a protocol extension.
> >>> I don't have completely transparent upstream solution for existing
> >> consumers yet.
> >>>>
> >>>> Actually. I have a third thought. Since this is an attribute to qp
> >>>> creation, performed even before establishing a connection, how does
> >>>> the upper layer know when to set it?
> >>> This is not at QP creation time. I have described in
> >> Documentation/out_of_order.txt in usage section 3.
> >>> This is at QP state transition from INIT to RTR.
> >>> Here is the flow. It's just not coded enough for posting patches.
> >>>
> >>> 1. When rdmacm active side creates the QP, It is INIT state.
> >>> 2. Send MAD_Req msg (indicating ooo_requested=1) 3. When rdmacm
> >>> passive side receives the message, it looks up device_cap attribute
> >>> and
> >> matches it against ooo_requested flag.
> >>> 4. when device supports it, MAD_Rsp msg sets ooo_enabled=1, if it
> >>> doesn't support it, ooo_enabled=0 5. rdmacm passive side creates the
> >>> QP
> >> and moves to RTR state (with QP ooo enabled bit set).
> >>> 6. active side receives the message and puts the QP to RTR, RTS
> >>> state
> >> based on received bit setting from passive side.
> >>>
> >>> Flow is no different than how rest of the connection specific
> >>> parameters
> >> are shared such as IRD/ORD, PSN, timeouts, mtu etc.
> >>>
> >>>
> >>>
> >>>>
> >>>> Tom.
> >>> N     r  y   b X  ǧv ^ )޺{.n +    {  ٚ {ay \x1dʇڙ ,j   f   h   z \x1e w
> >>
> >>     j:+v   w j m         zZ+     ݢj"  !tml=
> >>>
> > N     r  y   b X  ǧv ^ )޺{.n +    {  ٚ {ay \x1dʇڙ ,j   f   h   z \x1e w
> 
>    j:+v   w j m         zZ+     ݢj"  !tml=
> >

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                             ` <20170612165536.GB12435-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2017-06-12 17:11                               ` Parav Pandit
  2017-06-12 21:09                               ` Tom Talpey
@ 2017-06-27  9:47                               ` Sagi Grimberg
  2 siblings, 0 replies; 71+ messages in thread
From: Sagi Grimberg @ 2017-06-27  9:47 UTC (permalink / raw)
  To: Jason Gunthorpe, Parav Pandit
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein


>>> If this is 'better' then it should be on as much as possible, and I certianly
>>> don't want to see kernel ULPs query caps and other pointless things when
>>> they already, necessarily, deal with out of order.
>>
>> Sure. Kernel ULPs and any other user ULPs can skip query caps.
> 
> No, they can't because only mlx5 accepts the new flag.
> 
> This is why inverting the flag in the kernel makes much more sense.

Can you guys explain why should a ULP even be exposed to this at all?

I can today issue RDMA writes to a single remote buffer in a pattern
that are not "in-order". From what I understand, the OOO relates to
mutliple packets of a _single_ RDMA transaction, which afair a
granularity that the ULP isn't (and shouldn't be) exposed to.

I do not really understand what a ULP should do with this knob...
It's really a topology variable, maybe it needs to sit in the
CM or something?

Please correct me if I'm completely off...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                                 ` <fb11f261-b80b-f71a-8076-204706267798-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
  2017-06-13 19:17                                                                   ` Jason Gunthorpe
  2017-06-23 16:03                                                                   ` Parav Pandit
@ 2017-07-19  5:33                                                                   ` Parav Pandit
       [not found]                                                                     ` <VI1PR0502MB3008D488FEE8A7104A7B0A7CD1A60-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  2 siblings, 1 reply; 71+ messages in thread
From: Parav Pandit @ 2017-07-19  5:33 UTC (permalink / raw)
  To: Tom Talpey, Jason Gunthorpe
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

Hi Tom, Jason,

Sorry for the late response.
Please find the response inline below.

> -----Original Message-----
> From: Tom Talpey [mailto:tom@talpey.com]
> Sent: Monday, June 12, 2017 8:30 PM
> To: Parav Pandit <parav@mellanox.com>; Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com>
> Cc: Bart Van Assche <Bart.VanAssche@sandisk.com>; leon@kernel.org;
> dledford@redhat.com; linux-rdma@vger.kernel.org; Idan Burstein
> <idanb@mellanox.com>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> >
> > In IB spec, in-order delivery is default.
> 
> I don't agree. Requests are sent in-order, and the responder processes them in-
> order, but the bytes thenselves are not guaranteed to appear in-order.
> Additionally, if retries occur, this is most definitely not the case.
> 
> Section 9.5 Transaction Ordering, I believe, covers these requirements. Can you
> tell me where I misunderstand them?
> In fact, c9-28 explicitly warns:
> 
>    • An application shall not depend upon the order of data writes to
>    memory within a message. For example, if an application sets up
>    data buffers that overlap, for separate data segments within a
>    message, it is not guaranteed that the last sent data will always
>    overwrite the earlier.
> 
The IB spec indeed does not imply any ordering in the placement of data into memory within a single message.

It does guarantee that writes don't bypass writes and reads don't bypass reads (Table 79), and transport operations are executed in their *message* order (C9-28):
"A responder shall execute SEND requests, RDMA WRITE requests
and ATOMIC Operation requests in the message order in which
they are received."

Thus, ordering between messages is guaranteed - changes to remote memory of an RDMA-W will be observed strictly after any changes done by a previous RDMA-W; changes to local memory of an RDMA-R response will be observed strictly after any changes done by a previous RDMA-R response.

The proposed feature in this patch set is to relax the memory placement ordering *across* messages and not within a single message (which is not mandated by the spec as u noted), such that multiple consecutive RDMA-Ws may be committed to memory in any order, and similarly for RDMA-R responses.
This changes application semantics whenever multiple-inflight RDMA operations write to overlapping locations, or when one operation indicates the completion of the other.
A simple example to clarify: a requestor posted the following work elements in the written order:
1. RDMA-W(VA=0x1000, value=0x1)
2. RDMA-W(VA=0x1000, value=0x2)
3. Send()
On responder side, following the Send() operation completion, and according to spec (C9-28), reading from VA=0x1000 will produce the value 2. With the proposed feature enabled, the read value is not deterministic and dependent on the order in which the RDMA-W operations were received.

The proposed QP flag allows applications to knowingly indicate this relaxed data placement, thereby enabling the HCA to place OOO RDMA messages into memory without buffering them.

> I have one other question on the Documentation out-of-order.txt.
> It states the fence bit can be used to force ordering on a non-strict connection.
> But fence doesn't apply to RDMA Write?
> It only applies to operations which produce a reply, such as RDMA Read or
> Atomic. Have you changed the semantic?
> 
RDMA-R followed by RDMA-R semantic is changed when proposed QP flag is set.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                                     ` <VI1PR0502MB3008D488FEE8A7104A7B0A7CD1A60-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-07-19 17:12                                                                       ` Jason Gunthorpe
       [not found]                                                                         ` <20170719171211.GB25714-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2017-07-22  2:29                                                                       ` Tom Talpey
  1 sibling, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2017-07-19 17:12 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Tom Talpey, Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On Wed, Jul 19, 2017 at 05:33:52AM +0000, Parav Pandit wrote:
 
> It does guarantee that writes don't bypass writes and reads don't bypass reads (Table 79), and transport operations are executed in their *message* order (C9-28):
> "A responder shall execute SEND requests, RDMA WRITE requests
> and ATOMIC Operation requests in the message order in which
> they are received."

At a minimum, you need to include a version of table 79 in your
documentation that reflects the modified WR ordering rules for this
mode, and I'm not sure we should continue to call this thing a 'RC QP'
at all to avoid so much confusion.

Particularly since this new thing is not really wire interoperable
with the existing RC QP I don't think it should be the same thing in
the API.

I recommend you rework this patch series to introduce a relaxed RC QP
type instead of using a flag..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                                         ` <20170719171211.GB25714-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-07-20 15:06                                                                           ` Parav Pandit
  0 siblings, 0 replies; 71+ messages in thread
From: Parav Pandit @ 2017-07-20 15:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tom Talpey, Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

Hi Jason,

> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> Sent: Wednesday, July 19, 2017 12:12 PM
> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Cc: Tom Talpey <tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>; Bart Van Assche
> <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org; dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org;
> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> On Wed, Jul 19, 2017 at 05:33:52AM +0000, Parav Pandit wrote:
> 
> > It does guarantee that writes don't bypass writes and reads don't bypass reads
> (Table 79), and transport operations are executed in their *message* order (C9-
> 28):
> > "A responder shall execute SEND requests, RDMA WRITE requests and
> > ATOMIC Operation requests in the message order in which they are
> > received."
> 
> At a minimum, you need to include a version of table 79 in your documentation
> that reflects the modified WR ordering rules for this mode, 
Sure. In fact I had it in internal version before posting upstream, but dropped for some reason.
I will add it in v1.

> and I'm not sure we
> should continue to call this thing a 'RC QP'
1. None of the Reliability property is affected with this optional attribute. So it is still RC QP.

> at all to avoid so much confusion.
All this details we discussed our email will be described in documentation v1 to avoid confusion.

> I recommend you rework this patch series to introduce a relaxed RC QP type
> instead of using a flag..
Additionally,
2. It doesn't make sense to introduce 3 more relaxed QP type (RC, XRC initiator, target) for one common attribute among them.
3. New QP type makes it more difficult or impossible to make rdmacm work in future.
Active side creates the QP and shared parameters through REQ-RSP messages.
Such existing interface would just become impossible to make use of.
This attribute is similar to MTU, IRD/ORD etc.
I prefer to keep it as optional QP attribute that brings simplicity and it is extendible compare to new QP type.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                                     ` <VI1PR0502MB3008D488FEE8A7104A7B0A7CD1A60-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  2017-07-19 17:12                                                                       ` Jason Gunthorpe
@ 2017-07-22  2:29                                                                       ` Tom Talpey
       [not found]                                                                         ` <142c2fed-baa5-1295-1458-be765c94b957-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 71+ messages in thread
From: Tom Talpey @ 2017-07-22  2:29 UTC (permalink / raw)
  To: Parav Pandit, Jason Gunthorpe
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On 7/18/2017 10:33 PM, Parav Pandit wrote:
> Hi Tom, Jason,
> 
> Sorry for the late response.
> Please find the response inline below.
> 
>> -----Original Message-----
>> From: Tom Talpey [mailto:tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org]
>> Sent: Monday, June 12, 2017 8:30 PM
>> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; Jason Gunthorpe
>> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
>> Cc: Bart Van Assche <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org;
>> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein
>> <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
>>
>>>
>>> In IB spec, in-order delivery is default.
>>
>> I don't agree. Requests are sent in-order, and the responder processes them in-
>> order, but the bytes thenselves are not guaranteed to appear in-order.
>> Additionally, if retries occur, this is most definitely not the case.
>>
>> Section 9.5 Transaction Ordering, I believe, covers these requirements. Can you
>> tell me where I misunderstand them?
>> In fact, c9-28 explicitly warns:
>>
>>     • An application shall not depend upon the order of data writes to
>>     memory within a message. For example, if an application sets up
>>     data buffers that overlap, for separate data segments within a
>>     message, it is not guaranteed that the last sent data will always
>>     overwrite the earlier.
>>
> The IB spec indeed does not imply any ordering in the placement of data into memory within a single message.
> 
> It does guarantee that writes don't bypass writes and reads don't bypass reads (Table 79), and transport operations are executed in their *message* order (C9-28):
> "A responder shall execute SEND requests, RDMA WRITE requests
> and ATOMIC Operation requests in the message order in which
> they are received."
> 
> Thus, ordering between messages is guaranteed - changes to remote memory of an RDMA-W will be observed strictly after any changes done by a previous RDMA-W; changes to local memory of an RDMA-R response will be observed strictly after any changes done by a previous RDMA-R response.
> 
> The proposed feature in this patch set is to relax the memory placement ordering *across* messages and not within a single message (which is not mandated by the spec as u noted), such that multiple consecutive RDMA-Ws may be committed to memory in any order, and similarly for RDMA-R responses.
> This changes application semantics whenever multiple-inflight RDMA operations write to overlapping locations, or when one operation indicates the completion of the other.
> A simple example to clarify: a requestor posted the following work elements in the written order:
> 1. RDMA-W(VA=0x1000, value=0x1)
> 2. RDMA-W(VA=0x1000, value=0x2)
> 3. Send()
> On responder side, following the Send() operation completion, and according to spec (C9-28), reading from VA=0x1000 will produce the value 2. With the proposed feature enabled, the read value is not deterministic and dependent on the order in which the RDMA-W operations were received.
> 
> The proposed QP flag allows applications to knowingly indicate this relaxed data placement, thereby enabling the HCA to place OOO RDMA messages into memory without buffering them.

You didn't answer my question what is the actual benefit of relaxing
the ordering. Is it performance? And, specifically what applications
*can't* use it?

To me, it appears that most storage upper layers can already use
the extension. If it performs better, I expect they will definitely
want to enable it. In that case I believe it should be the *default*,
not an opt-in that these upper layers are newly responsible for.

>> I have one other question on the Documentation out-of-order.txt.
>> It states the fence bit can be used to force ordering on a non-strict connection.
>> But fence doesn't apply to RDMA Write?
>> It only applies to operations which produce a reply, such as RDMA Read or
>> Atomic. Have you changed the semantic?
>>
> RDMA-R followed by RDMA-R semantic is changed when proposed QP flag is set.

Can you explain that statement in more detail please? Also, please
clarify on what operation(s) the fence bit now applies.

Tom.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                                         ` <142c2fed-baa5-1295-1458-be765c94b957-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
@ 2017-07-22  4:50                                                                           ` Parav Pandit
       [not found]                                                                             ` <VI1PR0502MB300886ED1B5B1363B55F53F0D1A50-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Parav Pandit @ 2017-07-22  4:50 UTC (permalink / raw)
  To: Tom Talpey, Jason Gunthorpe
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

Hi Tom,

> -----Original Message-----
> From: Tom Talpey [mailto:tom@talpey.com]
> Sent: Friday, July 21, 2017 9:29 PM
> To: Parav Pandit <parav@mellanox.com>; Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com>
> Cc: Bart Van Assche <Bart.VanAssche@sandisk.com>; leon@kernel.org;
> dledford@redhat.com; linux-rdma@vger.kernel.org; Idan Burstein
> <idanb@mellanox.com>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> On 7/18/2017 10:33 PM, Parav Pandit wrote:
> > Hi Tom, Jason,
> >
> > Sorry for the late response.
> > Please find the response inline below.
> >
> >> -----Original Message-----
> >> From: Tom Talpey [mailto:tom@talpey.com]
> >> Sent: Monday, June 12, 2017 8:30 PM
> >> To: Parav Pandit <parav@mellanox.com>; Jason Gunthorpe
> >> <jgunthorpe@obsidianresearch.com>
> >> Cc: Bart Van Assche <Bart.VanAssche@sandisk.com>; leon@kernel.org;
> >> dledford@redhat.com; linux-rdma@vger.kernel.org; Idan Burstein
> >> <idanb@mellanox.com>
> >> Subject: Re: [PATCH rdma-next 0/3] Support out of order data
> >> placement
> >>
> >>>
> >>> In IB spec, in-order delivery is default.
> >>
> >> I don't agree. Requests are sent in-order, and the responder
> >> processes them in- order, but the bytes thenselves are not guaranteed to
> appear in-order.
> >> Additionally, if retries occur, this is most definitely not the case.
> >>
> >> Section 9.5 Transaction Ordering, I believe, covers these
> >> requirements. Can you tell me where I misunderstand them?
> >> In fact, c9-28 explicitly warns:
> >>
> >>     • An application shall not depend upon the order of data writes to
> >>     memory within a message. For example, if an application sets up
> >>     data buffers that overlap, for separate data segments within a
> >>     message, it is not guaranteed that the last sent data will always
> >>     overwrite the earlier.
> >>
> > The IB spec indeed does not imply any ordering in the placement of data into
> memory within a single message.
> >
> > It does guarantee that writes don't bypass writes and reads don't bypass reads
> (Table 76), and transport operations are executed in their *message* order (C9-
> 28):
> > "A responder shall execute SEND requests, RDMA WRITE requests and
> > ATOMIC Operation requests in the message order in which they are
> > received."
> >
> > Thus, ordering between messages is guaranteed - changes to remote memory
> of an RDMA-W will be observed strictly after any changes done by a previous
> RDMA-W; changes to local memory of an RDMA-R response will be observed
> strictly after any changes done by a previous RDMA-R response.
> >
> > The proposed feature in this patch set is to relax the memory placement
> ordering *across* messages and not within a single message (which is not
> mandated by the spec as u noted), such that multiple consecutive RDMA-Ws
> may be committed to memory in any order, and similarly for RDMA-R responses.
> > This changes application semantics whenever multiple-inflight RDMA
> operations write to overlapping locations, or when one operation indicates the
> completion of the other.
> > A simple example to clarify: a requestor posted the following work elements in
> the written order:
> > 1. RDMA-W(VA=0x1000, value=0x1)
> > 2. RDMA-W(VA=0x1000, value=0x2)
> > 3. Send()
> > On responder side, following the Send() operation completion, and according
> to spec (C9-28), reading from VA=0x1000 will produce the value 2. With the
> proposed feature enabled, the read value is not deterministic and dependent on
> the order in which the RDMA-W operations were received.
> >
> > The proposed QP flag allows applications to knowingly indicate this relaxed
> data placement, thereby enabling the HCA to place OOO RDMA messages into
> memory without buffering them.
> 
> You didn't answer my question what is the actual benefit of relaxing the
> ordering. Is it performance?

Yes. Performance is better.

> And, specifically what applications *can't* use it?
Applications which poll on RDMA-W data at responder side or RDMA-R data on Read requester side, cannot use this.
Because as explained in above example 2nd RDMA-W message can be executed first at responder side.
We cannot break such user space applications deployed in field by enabling this by default and without negotiation with peer.
> 
> To me, it appears that most storage upper layers can already use the extension.

Yes. As they don't poll on data and they depend on incoming RDMA Send they can make use of it.

> If it performs better, I expect they will definitely want to enable it. In that case I
> believe it should be the *default*, not an opt-in that these upper layers are
> newly responsible for.

Verb layer is unware of caller ULPs. At most it knows that its kernel vs user ULP easily - which is good enough.
Verb layer also doesn't know whether remote side support it or not.
Once rdmacm extension is done, all kernel ULPs which uses rdmacm - can be enabled by default.
This patchset enables user space applications to take immediate benefit of it which doesn't depend on rdmacm.

> 
> >> I have one other question on the Documentation out-of-order.txt.
> >> It states the fence bit can be used to force ordering on a non-strict
> connection.
> >> But fence doesn't apply to RDMA Write?
> >> It only applies to operations which produce a reply, such as RDMA
> >> Read or Atomic. Have you changed the semantic?
> >>
> > RDMA-R followed by RDMA-R semantic is changed when proposed QP flag is
> set.
> 
> Can you explain that statement in more detail please? Also, please clarify on
> what operation(s) the fence bit now applies.
Sure.
Let's take example.
A requestor posted the following work elements in the written order:
1. RDMA-R(VA=0x1000, len=0x400)
2. RDMA-R(VA=0x1400, len=0x4)
Currently as per Table-76, RDMA-R read response data of 1st RDMA-R is placed first.
With relaxed data placement attribute, 4 bytes data of 2nd RDMA-R can be placed first.
If user needs a ordering of current Table-76, it needs to set fence on 2nd RDMA-R.
This will ensure that 1st RDMA-R is executed before 2nd RDMA-R.

This translates to In Table 76, 
RDMA-R (Row) and RDMA-R(Column) changes from '#' to 'F'.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                                             ` <VI1PR0502MB300886ED1B5B1363B55F53F0D1A50-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-07-22  5:03                                                                               ` Tom Talpey
       [not found]                                                                                 ` <e5cee768-586c-516a-6056-ea4a44f89134-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Tom Talpey @ 2017-07-22  5:03 UTC (permalink / raw)
  To: Parav Pandit, Jason Gunthorpe
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On 7/21/2017 9:50 PM, Parav Pandit wrote:
> Hi Tom,
> 
>> -----Original Message-----
>> From: Tom Talpey [mailto:tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org]
>> Sent: Friday, July 21, 2017 9:29 PM
>> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; Jason Gunthorpe
>> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
>> Cc: Bart Van Assche <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org;
>> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein
>> <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
>>
>> On 7/18/2017 10:33 PM, Parav Pandit wrote:
>>> Hi Tom, Jason,
>>>
>>> Sorry for the late response.
>>> Please find the response inline below.
>>>
>>>> -----Original Message-----
>>>> From: Tom Talpey [mailto:tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org]
>>>> Sent: Monday, June 12, 2017 8:30 PM
>>>> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; Jason Gunthorpe
>>>> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
>>>> Cc: Bart Van Assche <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org;
>>>> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein
>>>> <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>> Subject: Re: [PATCH rdma-next 0/3] Support out of order data
>>>> placement
>>>>
>>>>>
>>>>> In IB spec, in-order delivery is default.
>>>>
>>>> I don't agree. Requests are sent in-order, and the responder
>>>> processes them in- order, but the bytes thenselves are not guaranteed to
>> appear in-order.
>>>> Additionally, if retries occur, this is most definitely not the case.
>>>>
>>>> Section 9.5 Transaction Ordering, I believe, covers these
>>>> requirements. Can you tell me where I misunderstand them?
>>>> In fact, c9-28 explicitly warns:
>>>>
>>>>      • An application shall not depend upon the order of data writes to
>>>>      memory within a message. For example, if an application sets up
>>>>      data buffers that overlap, for separate data segments within a
>>>>      message, it is not guaranteed that the last sent data will always
>>>>      overwrite the earlier.
>>>>
>>> The IB spec indeed does not imply any ordering in the placement of data into
>> memory within a single message.
>>>
>>> It does guarantee that writes don't bypass writes and reads don't bypass reads
>> (Table 76), and transport operations are executed in their *message* order (C9-
>> 28):
>>> "A responder shall execute SEND requests, RDMA WRITE requests and
>>> ATOMIC Operation requests in the message order in which they are
>>> received."
>>>
>>> Thus, ordering between messages is guaranteed - changes to remote memory
>> of an RDMA-W will be observed strictly after any changes done by a previous
>> RDMA-W; changes to local memory of an RDMA-R response will be observed
>> strictly after any changes done by a previous RDMA-R response.
>>>
>>> The proposed feature in this patch set is to relax the memory placement
>> ordering *across* messages and not within a single message (which is not
>> mandated by the spec as u noted), such that multiple consecutive RDMA-Ws
>> may be committed to memory in any order, and similarly for RDMA-R responses.
>>> This changes application semantics whenever multiple-inflight RDMA
>> operations write to overlapping locations, or when one operation indicates the
>> completion of the other.
>>> A simple example to clarify: a requestor posted the following work elements in
>> the written order:
>>> 1. RDMA-W(VA=0x1000, value=0x1)
>>> 2. RDMA-W(VA=0x1000, value=0x2)
>>> 3. Send()
>>> On responder side, following the Send() operation completion, and according
>> to spec (C9-28), reading from VA=0x1000 will produce the value 2. With the
>> proposed feature enabled, the read value is not deterministic and dependent on
>> the order in which the RDMA-W operations were received.
>>>
>>> The proposed QP flag allows applications to knowingly indicate this relaxed
>> data placement, thereby enabling the HCA to place OOO RDMA messages into
>> memory without buffering them.
>>
>> You didn't answer my question what is the actual benefit of relaxing the
>> ordering. Is it performance?
> 
> Yes. Performance is better.
> 
>> And, specifically what applications *can't* use it?
> Applications which poll on RDMA-W data at responder side or RDMA-R data on Read requester side, cannot use this.
> Because as explained in above example 2nd RDMA-W message can be executed first at responder side.
> We cannot break such user space applications deployed in field by enabling this by default and without negotiation with peer.

Those applications ignored the spec, and got away with it only
because the Mellanox (is that who "we" is?) implementation was
strongly ordered. Thats not much of an excuse, in my opinion, to
force change on the well-behaved, spec-observing ULPs in order
that they might take advantage of it.

>> To me, it appears that most storage upper layers can already use the extension.
> 
> Yes. As they don't poll on data and they depend on incoming RDMA Send they can make use of it.

Not without changing their protocols and implementations. I think
you should reconsider your approach to throw the responsibility to
them, and them only.

>> If it performs better, I expect they will definitely want to enable it. In that case I
>> believe it should be the *default*, not an opt-in that these upper layers are
>> newly responsible for.
> 
> Verb layer is unware of caller ULPs. At most it knows that its kernel vs user ULP easily - which is good enough.
> Verb layer also doesn't know whether remote side support it or not.
> Once rdmacm extension is done, all kernel ULPs which uses rdmacm - can be enabled by default.

Well, then this change should wait for that to become available.

> This patchset enables user space applications to take immediate benefit of it which doesn't depend on rdmacm.

But it changes the API in a way that we don't want to survive.
Let's get the interface right first.

>>>> I have one other question on the Documentation out-of-order.txt.
>>>> It states the fence bit can be used to force ordering on a non-strict
>> connection.
>>>> But fence doesn't apply to RDMA Write?
>>>> It only applies to operations which produce a reply, such as RDMA
>>>> Read or Atomic. Have you changed the semantic?
>>>>
>>> RDMA-R followed by RDMA-R semantic is changed when proposed QP flag is
>> set.
>>
>> Can you explain that statement in more detail please? Also, please clarify on
>> what operation(s) the fence bit now applies.
> Sure.
> Let's take example.
> A requestor posted the following work elements in the written order:
> 1. RDMA-R(VA=0x1000, len=0x400)
> 2. RDMA-R(VA=0x1400, len=0x4)
> Currently as per Table-76, RDMA-R read response data of 1st RDMA-R is placed first.
> With relaxed data placement attribute, 4 bytes data of 2nd RDMA-R can be placed first.
> If user needs a ordering of current Table-76, it needs to set fence on 2nd RDMA-R.
> This will ensure that 1st RDMA-R is executed before 2nd RDMA-R.

Oh, that's the same issue a the initial one - polling the "last"
bit was never guaranteed. I dont see this as a change to the
semantic.

But, I take it that the fence bit still applies as before, this is
not a proposal to extended fencing to RDMA Write. Ok.


> 
> This translates to In Table 76,
> RDMA-R (Row) and RDMA-R(Column) changes from '#' to 'F'.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                                                 ` <e5cee768-586c-516a-6056-ea4a44f89134-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
@ 2017-07-22  5:32                                                                                   ` Parav Pandit
       [not found]                                                                                     ` <VI1PR0502MB30089915EB792CD20CA9D179D1A50-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Parav Pandit @ 2017-07-22  5:32 UTC (permalink / raw)
  To: Tom Talpey, Jason Gunthorpe
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

Hi Tom,

> -----Original Message-----
> From: Tom Talpey [mailto:tom@talpey.com]
> Sent: Saturday, July 22, 2017 12:03 AM
> To: Parav Pandit <parav@mellanox.com>; Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com>
> Cc: Bart Van Assche <Bart.VanAssche@sandisk.com>; leon@kernel.org;
> dledford@redhat.com; linux-rdma@vger.kernel.org; Idan Burstein
> <idanb@mellanox.com>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> On 7/21/2017 9:50 PM, Parav Pandit wrote:
> > Hi Tom,
> >
> >> -----Original Message-----
> >> From: Tom Talpey [mailto:tom@talpey.com]
> >> Sent: Friday, July 21, 2017 9:29 PM
> >> To: Parav Pandit <parav@mellanox.com>; Jason Gunthorpe
> >> <jgunthorpe@obsidianresearch.com>
> >> Cc: Bart Van Assche <Bart.VanAssche@sandisk.com>; leon@kernel.org;
> >> dledford@redhat.com; linux-rdma@vger.kernel.org; Idan Burstein
> >> <idanb@mellanox.com>
> >> Subject: Re: [PATCH rdma-next 0/3] Support out of order data
> >> placement
> >>
> >> On 7/18/2017 10:33 PM, Parav Pandit wrote:
> >>> Hi Tom, Jason,
> >>>
> >>> Sorry for the late response.
> >>> Please find the response inline below.
> >>>
> >>>> -----Original Message-----
> >>>> From: Tom Talpey [mailto:tom@talpey.com]
> >>>> Sent: Monday, June 12, 2017 8:30 PM
> >>>> To: Parav Pandit <parav@mellanox.com>; Jason Gunthorpe
> >>>> <jgunthorpe@obsidianresearch.com>
> >>>> Cc: Bart Van Assche <Bart.VanAssche@sandisk.com>; leon@kernel.org;
> >>>> dledford@redhat.com; linux-rdma@vger.kernel.org; Idan Burstein
> >>>> <idanb@mellanox.com>
> >>>> Subject: Re: [PATCH rdma-next 0/3] Support out of order data
> >>>> placement
> >>>>
> >>>>>
> >>>>> In IB spec, in-order delivery is default.
> >>>>
> >>>> I don't agree. Requests are sent in-order, and the responder
> >>>> processes them in- order, but the bytes thenselves are not
> >>>> guaranteed to
> >> appear in-order.
> >>>> Additionally, if retries occur, this is most definitely not the case.
> >>>>
> >>>> Section 9.5 Transaction Ordering, I believe, covers these
> >>>> requirements. Can you tell me where I misunderstand them?
> >>>> In fact, c9-28 explicitly warns:
> >>>>
> >>>>      • An application shall not depend upon the order of data writes to
> >>>>      memory within a message. For example, if an application sets up
> >>>>      data buffers that overlap, for separate data segments within a
> >>>>      message, it is not guaranteed that the last sent data will always
> >>>>      overwrite the earlier.
> >>>>
> >>> The IB spec indeed does not imply any ordering in the placement of
> >>> data into
> >> memory within a single message.
> >>>
> >>> It does guarantee that writes don't bypass writes and reads don't
> >>> bypass reads
> >> (Table 76), and transport operations are executed in their *message*
> >> order (C9-
> >> 28):
> >>> "A responder shall execute SEND requests, RDMA WRITE requests and
> >>> ATOMIC Operation requests in the message order in which they are
> >>> received."
> >>>
> >>> Thus, ordering between messages is guaranteed - changes to remote
> >>> memory
> >> of an RDMA-W will be observed strictly after any changes done by a
> >> previous RDMA-W; changes to local memory of an RDMA-R response will
> >> be observed strictly after any changes done by a previous RDMA-R response.
> >>>
> >>> The proposed feature in this patch set is to relax the memory
> >>> placement
> >> ordering *across* messages and not within a single message (which is
> >> not mandated by the spec as u noted), such that multiple consecutive
> >> RDMA-Ws may be committed to memory in any order, and similarly for
> RDMA-R responses.
> >>> This changes application semantics whenever multiple-inflight RDMA
> >> operations write to overlapping locations, or when one operation
> >> indicates the completion of the other.
> >>> A simple example to clarify: a requestor posted the following work
> >>> elements in
> >> the written order:
> >>> 1. RDMA-W(VA=0x1000, value=0x1)
> >>> 2. RDMA-W(VA=0x1000, value=0x2)
> >>> 3. Send()
> >>> On responder side, following the Send() operation completion, and
> >>> according
> >> to spec (C9-28), reading from VA=0x1000 will produce the value 2.
> >> With the proposed feature enabled, the read value is not
> >> deterministic and dependent on the order in which the RDMA-W operations
> were received.
> >>>
> >>> The proposed QP flag allows applications to knowingly indicate this
> >>> relaxed
> >> data placement, thereby enabling the HCA to place OOO RDMA messages
> >> into memory without buffering them.
> >>
> >> You didn't answer my question what is the actual benefit of relaxing
> >> the ordering. Is it performance?
> >
> > Yes. Performance is better.
> >
> >> And, specifically what applications *can't* use it?
> > Applications which poll on RDMA-W data at responder side or RDMA-R data on
> Read requester side, cannot use this.
> > Because as explained in above example 2nd RDMA-W message can be
> executed first at responder side.
> > We cannot break such user space applications deployed in field by enabling
> this by default and without negotiation with peer.
> 
> Those applications ignored the spec, and got away with it only because the
> Mellanox (is that who "we" is?) implementation was strongly ordered. Thats not
> much of an excuse, in my opinion, to force change on the well-behaved, spec-
> observing ULPs in order that they might take advantage of it.
> 
As talked through Table-76, C9-28, current IB spec assures that RDMA-R of 4 bytes is executed after RDMA-R of 1K is executed.
Application didn't adhere to optional requirement o9-20, o9-21.
I don't see a reason on why such applications should be broken when we already have a way avoid that.

> >> To me, it appears that most storage upper layers can already use the
> extension.
> >
> > Yes. As they don't poll on data and they depend on incoming RDMA Send they
> can make use of it.
> 
> Not without changing their protocols and implementations. I think you should
> reconsider your approach to throw the responsibility to them, and them only.
> 
Approach is open currently at least with two options.
1. Either it can be done in core layer for kernel ULPs to enable by default with peer negotiation transparent to ULPs.
Or
2. ULP gets explicit control to enable/disable it, similar to other connection parameters.
This patch is a layer below it and its unaffected by above layers.

> >> If it performs better, I expect they will definitely want to enable
> >> it. In that case I believe it should be the *default*, not an opt-in
> >> that these upper layers are newly responsible for.
> >
> > Verb layer is unware of caller ULPs. At most it knows that its kernel vs user ULP
> easily - which is good enough.
> > Verb layer also doesn't know whether remote side support it or not.
> > Once rdmacm extension is done, all kernel ULPs which uses rdmacm - can be
> enabled by default.
> 
> Well, then this change should wait for that to become available.
Kernel provides the service to user applications and kernel ULPs both.
This attribute is layer below such applications.
I don't see a reason to put dependency on connection manager for those applications which doesn't use such connection manager.
Rdmacm would be an extension on top of this - that can make use of this attribute.

> 
> > This patchset enables user space applications to take immediate benefit of it
> which doesn't depend on rdmacm.
> 
> But it changes the API in a way that we don't want to survive.
> Let's get the interface right first.

This is a QP attribute similar to many other QP attributes that ULP can set appropriately.
> 
> >>> RDMA-R followed by RDMA-R semantic is changed when proposed QP flag
> >>> is
> >> set.
> >>
> >> Can you explain that statement in more detail please? Also, please
> >> clarify on what operation(s) the fence bit now applies.
> > Sure.
> > Let's take example.
> > A requestor posted the following work elements in the written order:
> > 1. RDMA-R(VA=0x1000, len=0x400)
> > 2. RDMA-R(VA=0x1400, len=0x4)
> > Currently as per Table-76, RDMA-R read response data of 1st RDMA-R is
> placed first.
> > With relaxed data placement attribute, 4 bytes data of 2nd RDMA-R can be
> placed first.
> > If user needs a ordering of current Table-76, it needs to set fence on 2nd
> RDMA-R.
> > This will ensure that 1st RDMA-R is executed before 2nd RDMA-R.
> 
> Oh, that's the same issue a the initial one - polling the "last"
> bit was never guaranteed. I dont see this as a change to the semantic.
> 
It’s a clear deviation from Table-76 and C9-28 and therefor semantic change that deserves a bit.

> But, I take it that the fence bit still applies as before, this is not a proposal to
> extended fencing to RDMA Write. Ok.

Rest of the Table 76 stays as is.

> > This translates to In Table 76,
> > RDMA-R (Row) and RDMA-R(Column) changes from '#' to 'F'.
> >

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                                                     ` <VI1PR0502MB30089915EB792CD20CA9D179D1A50-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-07-22 14:51                                                                                       ` Tom Talpey
       [not found]                                                                                         ` <4444a96a-c1e6-ad33-204a-680982e19bfe-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Tom Talpey @ 2017-07-22 14:51 UTC (permalink / raw)
  To: Parav Pandit, Jason Gunthorpe
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

Well, if the broken applications won't use the extension, and
the existing storage protocols and applications will have to
change both their implementation and their protocol to use it,
who do you envision actually doing so?

Sorry, but I just don't see the point of making it optional.

Tom.

On 7/21/2017 10:32 PM, Parav Pandit wrote:
> Hi Tom,
> 
>> -----Original Message-----
>> From: Tom Talpey [mailto:tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org]
>> Sent: Saturday, July 22, 2017 12:03 AM
>> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; Jason Gunthorpe
>> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
>> Cc: Bart Van Assche <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org;
>> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein
>> <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
>>
>> On 7/21/2017 9:50 PM, Parav Pandit wrote:
>>> Hi Tom,
>>>
>>>> -----Original Message-----
>>>> From: Tom Talpey [mailto:tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org]
>>>> Sent: Friday, July 21, 2017 9:29 PM
>>>> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; Jason Gunthorpe
>>>> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
>>>> Cc: Bart Van Assche <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org;
>>>> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein
>>>> <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>> Subject: Re: [PATCH rdma-next 0/3] Support out of order data
>>>> placement
>>>>
>>>> On 7/18/2017 10:33 PM, Parav Pandit wrote:
>>>>> Hi Tom, Jason,
>>>>>
>>>>> Sorry for the late response.
>>>>> Please find the response inline below.
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Tom Talpey [mailto:tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org]
>>>>>> Sent: Monday, June 12, 2017 8:30 PM
>>>>>> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; Jason Gunthorpe
>>>>>> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
>>>>>> Cc: Bart Van Assche <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org;
>>>>>> dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein
>>>>>> <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>>>> Subject: Re: [PATCH rdma-next 0/3] Support out of order data
>>>>>> placement
>>>>>>
>>>>>>>
>>>>>>> In IB spec, in-order delivery is default.
>>>>>>
>>>>>> I don't agree. Requests are sent in-order, and the responder
>>>>>> processes them in- order, but the bytes thenselves are not
>>>>>> guaranteed to
>>>> appear in-order.
>>>>>> Additionally, if retries occur, this is most definitely not the case.
>>>>>>
>>>>>> Section 9.5 Transaction Ordering, I believe, covers these
>>>>>> requirements. Can you tell me where I misunderstand them?
>>>>>> In fact, c9-28 explicitly warns:
>>>>>>
>>>>>>       • An application shall not depend upon the order of data writes to
>>>>>>       memory within a message. For example, if an application sets up
>>>>>>       data buffers that overlap, for separate data segments within a
>>>>>>       message, it is not guaranteed that the last sent data will always
>>>>>>       overwrite the earlier.
>>>>>>
>>>>> The IB spec indeed does not imply any ordering in the placement of
>>>>> data into
>>>> memory within a single message.
>>>>>
>>>>> It does guarantee that writes don't bypass writes and reads don't
>>>>> bypass reads
>>>> (Table 76), and transport operations are executed in their *message*
>>>> order (C9-
>>>> 28):
>>>>> "A responder shall execute SEND requests, RDMA WRITE requests and
>>>>> ATOMIC Operation requests in the message order in which they are
>>>>> received."
>>>>>
>>>>> Thus, ordering between messages is guaranteed - changes to remote
>>>>> memory
>>>> of an RDMA-W will be observed strictly after any changes done by a
>>>> previous RDMA-W; changes to local memory of an RDMA-R response will
>>>> be observed strictly after any changes done by a previous RDMA-R response.
>>>>>
>>>>> The proposed feature in this patch set is to relax the memory
>>>>> placement
>>>> ordering *across* messages and not within a single message (which is
>>>> not mandated by the spec as u noted), such that multiple consecutive
>>>> RDMA-Ws may be committed to memory in any order, and similarly for
>> RDMA-R responses.
>>>>> This changes application semantics whenever multiple-inflight RDMA
>>>> operations write to overlapping locations, or when one operation
>>>> indicates the completion of the other.
>>>>> A simple example to clarify: a requestor posted the following work
>>>>> elements in
>>>> the written order:
>>>>> 1. RDMA-W(VA=0x1000, value=0x1)
>>>>> 2. RDMA-W(VA=0x1000, value=0x2)
>>>>> 3. Send()
>>>>> On responder side, following the Send() operation completion, and
>>>>> according
>>>> to spec (C9-28), reading from VA=0x1000 will produce the value 2.
>>>> With the proposed feature enabled, the read value is not
>>>> deterministic and dependent on the order in which the RDMA-W operations
>> were received.
>>>>>
>>>>> The proposed QP flag allows applications to knowingly indicate this
>>>>> relaxed
>>>> data placement, thereby enabling the HCA to place OOO RDMA messages
>>>> into memory without buffering them.
>>>>
>>>> You didn't answer my question what is the actual benefit of relaxing
>>>> the ordering. Is it performance?
>>>
>>> Yes. Performance is better.
>>>
>>>> And, specifically what applications *can't* use it?
>>> Applications which poll on RDMA-W data at responder side or RDMA-R data on
>> Read requester side, cannot use this.
>>> Because as explained in above example 2nd RDMA-W message can be
>> executed first at responder side.
>>> We cannot break such user space applications deployed in field by enabling
>> this by default and without negotiation with peer.
>>
>> Those applications ignored the spec, and got away with it only because the
>> Mellanox (is that who "we" is?) implementation was strongly ordered. Thats not
>> much of an excuse, in my opinion, to force change on the well-behaved, spec-
>> observing ULPs in order that they might take advantage of it.
>>
> As talked through Table-76, C9-28, current IB spec assures that RDMA-R of 4 bytes is executed after RDMA-R of 1K is executed.
> Application didn't adhere to optional requirement o9-20, o9-21.
> I don't see a reason on why such applications should be broken when we already have a way avoid that.
> 
>>>> To me, it appears that most storage upper layers can already use the
>> extension.
>>>
>>> Yes. As they don't poll on data and they depend on incoming RDMA Send they
>> can make use of it.
>>
>> Not without changing their protocols and implementations. I think you should
>> reconsider your approach to throw the responsibility to them, and them only.
>>
> Approach is open currently at least with two options.
> 1. Either it can be done in core layer for kernel ULPs to enable by default with peer negotiation transparent to ULPs.
> Or
> 2. ULP gets explicit control to enable/disable it, similar to other connection parameters.
> This patch is a layer below it and its unaffected by above layers.
> 
>>>> If it performs better, I expect they will definitely want to enable
>>>> it. In that case I believe it should be the *default*, not an opt-in
>>>> that these upper layers are newly responsible for.
>>>
>>> Verb layer is unware of caller ULPs. At most it knows that its kernel vs user ULP
>> easily - which is good enough.
>>> Verb layer also doesn't know whether remote side support it or not.
>>> Once rdmacm extension is done, all kernel ULPs which uses rdmacm - can be
>> enabled by default.
>>
>> Well, then this change should wait for that to become available.
> Kernel provides the service to user applications and kernel ULPs both.
> This attribute is layer below such applications.
> I don't see a reason to put dependency on connection manager for those applications which doesn't use such connection manager.
> Rdmacm would be an extension on top of this - that can make use of this attribute.
> 
>>
>>> This patchset enables user space applications to take immediate benefit of it
>> which doesn't depend on rdmacm.
>>
>> But it changes the API in a way that we don't want to survive.
>> Let's get the interface right first.
> 
> This is a QP attribute similar to many other QP attributes that ULP can set appropriately.
>>
>>>>> RDMA-R followed by RDMA-R semantic is changed when proposed QP flag
>>>>> is
>>>> set.
>>>>
>>>> Can you explain that statement in more detail please? Also, please
>>>> clarify on what operation(s) the fence bit now applies.
>>> Sure.
>>> Let's take example.
>>> A requestor posted the following work elements in the written order:
>>> 1. RDMA-R(VA=0x1000, len=0x400)
>>> 2. RDMA-R(VA=0x1400, len=0x4)
>>> Currently as per Table-76, RDMA-R read response data of 1st RDMA-R is
>> placed first.
>>> With relaxed data placement attribute, 4 bytes data of 2nd RDMA-R can be
>> placed first.
>>> If user needs a ordering of current Table-76, it needs to set fence on 2nd
>> RDMA-R.
>>> This will ensure that 1st RDMA-R is executed before 2nd RDMA-R.
>>
>> Oh, that's the same issue a the initial one - polling the "last"
>> bit was never guaranteed. I dont see this as a change to the semantic.
>>
> It’s a clear deviation from Table-76 and C9-28 and therefor semantic change that deserves a bit.
> 
>> But, I take it that the fence bit still applies as before, this is not a proposal to
>> extended fencing to RDMA Write. Ok.
> 
> Rest of the Table 76 stays as is.
> 
>>> This translates to In Table 76,
>>> RDMA-R (Row) and RDMA-R(Column) changes from '#' to 'F'.
>>>
> N�����r��y���b�X��ǧv�^�)޺{.n�+����{��ٚ�{ay�\x1d
ʇڙ�,j\a��f���h���z�\x1e
�w���\f
���j:+v���w�j�m����\a����zZ+�����ݢj"��!tml=
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                                                         ` <4444a96a-c1e6-ad33-204a-680982e19bfe-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
@ 2017-07-22 15:11                                                                                           ` Parav Pandit
  2017-07-22 16:09                                                                                           ` Jason Gunthorpe
  1 sibling, 0 replies; 71+ messages in thread
From: Parav Pandit @ 2017-07-22 15:11 UTC (permalink / raw)
  To: Tom Talpey, Jason Gunthorpe
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein


> -----Original Message-----
> From: Tom Talpey [mailto:tom@talpey.com]
> Sent: Saturday, July 22, 2017 9:52 AM
> To: Parav Pandit <parav@mellanox.com>; Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com>
> Cc: Bart Van Assche <Bart.VanAssche@sandisk.com>; leon@kernel.org;
> dledford@redhat.com; linux-rdma@vger.kernel.org; Idan Burstein
> <idanb@mellanox.com>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> Well, if the broken applications won't use the extension, and the existing
> storage protocols and applications will have to change both their
> implementation and their protocol to use it, who do you envision actually doing
> so?
Current application binaries and their middleware binaries can continue to work as is.
New applications query this feature and enable themselves to take benefit of this optional extension.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                                                         ` <4444a96a-c1e6-ad33-204a-680982e19bfe-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
  2017-07-22 15:11                                                                                           ` Parav Pandit
@ 2017-07-22 16:09                                                                                           ` Jason Gunthorpe
       [not found]                                                                                             ` <20170722160939.GA30007-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  1 sibling, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2017-07-22 16:09 UTC (permalink / raw)
  To: Tom Talpey
  Cc: Parav Pandit, Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On Sat, Jul 22, 2017 at 07:51:33AM -0700, Tom Talpey wrote:
> Well, if the broken applications won't use the extension, and
> the existing storage protocols and applications will have to
> change both their implementation and their protocol to use it,
> who do you envision actually doing so?

The apps are not broken, it appears that this extension allows the
receiver HCA to process inbound packets without seeing earlier
packets.

This means it can really start doing things out of order, subject only
to the ability of the transmitter to halt sending (eg fence) until
acks are seen.

Since there is no way to predict what is in the data the HCA didn't
see, this would seem to allow quite a lot of out of order execution
inside the HCA.

Eg

 RDMA-W VA=0 Data=1
 RDMA-R VA=0
 RDMA-W VA=0 Data=2

What does the read return? Spec says 1, but it sounds like this
relaxed ordering could return 2.

Whta about

 RDMA-W VA=0 Data=1
 SEND WITH INVALIDATE VA=0
 RDMA-W VA=0 Data=2

Spec says the second RDMA-W must fail, but it sounds like this relaxed
ordering would allow it to happen.

So it cannot be turned on by default.. But a traditional well behaved
storage ULP will only do single access to a single VA and should be
able to safely turn something like this on.

All of these is independent to observability at the CPU..

All this means it needs to be negotiated via RDMA-CM, and making it a
common verbs option would make sense - is there a plan to do that?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                                                             ` <20170722160939.GA30007-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-07-22 17:32                                                                                               ` Parav Pandit
       [not found]                                                                                                 ` <VI1PR0502MB3008E483BB36F944C2E212ACD1A50-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Parav Pandit @ 2017-07-22 17:32 UTC (permalink / raw)
  To: Jason Gunthorpe, Tom Talpey
  Cc: Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

Hi Jason,


> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> Sent: Saturday, July 22, 2017 11:10 AM
> To: Tom Talpey <tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
> Cc: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; Bart Van Assche
> <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org; dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org;
> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> On Sat, Jul 22, 2017 at 07:51:33AM -0700, Tom Talpey wrote:
> > Well, if the broken applications won't use the extension, and the
> > existing storage protocols and applications will have to change both
> > their implementation and their protocol to use it, who do you envision
> > actually doing so?
> 
> The apps are not broken, it appears that this extension allows the receiver HCA
> to process inbound packets without seeing earlier packets.
> 
Correct.

> This means it can really start doing things out of order, subject only to the ability
> of the transmitter to halt sending (eg fence) until acks are seen.
> 
> Since there is no way to predict what is in the data the HCA didn't see, this would
> seem to allow quite a lot of out of order execution inside the HCA.
> 
> Eg
> 
>  RDMA-W VA=0 Data=1
>  RDMA-R VA=0
>  RDMA-W VA=0 Data=2
> 
> What does the read return? Spec says 1, but it sounds like this relaxed ordering
> could return 2.
> 
Spec says Data=1 on RDMA-R only if Fence is set on read operation in Table-76.
Otherwise duplicate read request after executing RDMA-W2 of Data=2 can return Data=2 on read request.

> Whta about
> 
>  RDMA-W VA=0 Data=1
>  SEND WITH INVALIDATE VA=0
>  RDMA-W VA=0 Data=2
> 
> Spec says the second RDMA-W must fail, 
Right.

> but it sounds like this relaxed ordering would allow it to happen.
No. Table 76 is followed in this case.
1st operation Write.
2nd operation send.
There is implicit fence defined by '#' in Table, which is followed.
So 2nd RDMA-W continue to fail.

> 
> So it cannot be turned on by default.. But a traditional well behaved storage ULP
> will only do single access to a single VA and should be able to safely turn
> something like this on.
> 
> All of these is independent to observability at the CPU..
> 
> All this means it needs to be negotiated via RDMA-CM, and making it a common
> verbs option would make sense - is there a plan to do that?

I completely understand and agree that storage protocols who depend on RDMA-CM would like to have this.
But again, unavailability of this bit in RDMA-CM is not a blocker for user land apps and let them start using it.
Once RDMA-CM has it, storage kernel ULPs can also be enabled.

I am not the right person for commenting on RDMA-CM plan.
Idan,
Do you know?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                                                                 ` <VI1PR0502MB3008E483BB36F944C2E212ACD1A50-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-07-22 21:27                                                                                                   ` Jason Gunthorpe
       [not found]                                                                                                     ` <20170722212706.GA14714-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2017-07-22 21:27 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Tom Talpey, Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On Sat, Jul 22, 2017 at 05:32:05PM +0000, Parav Pandit wrote:
> > 
> > Eg
> > 
> >  RDMA-W VA=0 Data=1
> >  RDMA-R VA=0
> >  RDMA-W VA=0 Data=2
> > 
> > What does the read return? Spec says 1, but it sounds like this relaxed ordering
> > could return 2.
> > 
> Spec says Data=1 on RDMA-R only if Fence is set on read operation in Table-76.
> Otherwise duplicate read request after executing RDMA-W2 of Data=2 can return Data=2 on read request.

Erm, I should have written it like this

 Initial Condition VA=0 Data = 0
 RDMA-W VA=0 Data=1
 RDMA-R VA=0

Spec says 1 must be returned, but sounds like this relaxed version
could return 0. So RDMA Write -> RDMA Read degrades to a F

Similarly,

 RDMA-W VA=0 Data=1
 RDMA-W VA=0 Data=2
 SEND

Sounds like with the relaxed version the app could see 1 at SEND CQ
time.

So RDMA-W -> RDMA-W degrades to a F

> > Whta about
> > 
> >  RDMA-W VA=0 Data=1
> >  SEND WITH INVALIDATE VA=0
> >  RDMA-W VA=0 Data=2
> > 
> > Spec says the second RDMA-W must fail, 
> Right.
> 
> > but it sounds like this relaxed ordering would allow it to happen.

> No. Table 76 is followed in this case.
> 1st operation Write.
> 2nd operation send.
> There is implicit fence defined by '#' in Table, which is followed.
> So 2nd RDMA-W continue to fail.

So, I expect what is happening here is that the SEND RCQ is delayed
until the sequence numbers catch up, eg guarenteeing that all
packets prior to the SEND have been seen and committed to memory.
Which is what table 76 is primarily talking about.

However, SEND WITH INVALIDATE is a special cases that impacts the
processing of work itself, not just the CPU observation, which is a
bit outside what table 76 is talking about.

I'd advocate for allowing this to be out of order (but documented as
such), as impliclty fencing SEND WITH INVALIDATE is not acceptable for
performance and most workloads using that feature do not care about
this strict ordering.

The requirement is really that by the time the SEND RCQ is seen that
the INVALIDATE has taken effect.

Atomic are basically similar, sounds like Atomic Op -> RDMA Read should
degrade to a F as well. I'd say that is desired as well.

The point is we want a definition for this feature that is broad
enough to allow future hardware optimization, and not just some some
narrow defintion that follows what mlx5 happened to implement.

> I completely understand and agree that storage protocols who depend
> on RDMA-CM would like to have this.  But again, unavailability of
> this bit in RDMA-CM is not a blocker for user land apps and let them
> start using it.  Once RDMA-CM has it, storage kernel ULPs can also
> be enabled.

If there is no path to get this into the RDMA CM then it is just
another vendor feature and it does not belong in the common API.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                                                                     ` <20170722212706.GA14714-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-08-01 18:14                                                                                                       ` Parav Pandit
       [not found]                                                                                                         ` <VI1PR0502MB300871C35E1CC06EB378D6C8D1B30-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Parav Pandit @ 2017-08-01 18:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tom Talpey, Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

Hi Jason,

> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> Sent: Saturday, July 22, 2017 4:27 PM
> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Cc: Tom Talpey <tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>; Bart Van Assche
> <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org; dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org;
> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> On Sat, Jul 22, 2017 at 05:32:05PM +0000, Parav Pandit wrote:
> > >
> > > Eg
> > >
> > >  RDMA-W VA=0 Data=1
> > >  RDMA-R VA=0
> > >  RDMA-W VA=0 Data=2
> > >
> > > What does the read return? Spec says 1, but it sounds like this
> > > relaxed ordering could return 2.
> > >
> > Spec says Data=1 on RDMA-R only if Fence is set on read operation in Table-
> 76.
> > Otherwise duplicate read request after executing RDMA-W2 of Data=2 can
> return Data=2 on read request.
> 
> Erm, I should have written it like this
> 
>  Initial Condition VA=0 Data = 0
>  RDMA-W VA=0 Data=1
>  RDMA-R VA=0
> 
> Spec says 1 must be returned, but sounds like this relaxed version could return 0.
No. Table 76 stays as is as described before.

> So RDMA Write -> RDMA Read degrades to a F
> 
> Similarly,
> 
>  RDMA-W VA=0 Data=1
>  RDMA-W VA=0 Data=2
>  SEND
> 
> Sounds like with the relaxed version the app could see 1 at SEND CQ time.
> 
> So RDMA-W -> RDMA-W degrades to a F
No. Table-76 is based on  how requester sees the execution.
So it stays as '#'.

> 
> > > Whta about
> > >
> > >  RDMA-W VA=0 Data=1
> > >  SEND WITH INVALIDATE VA=0
> > >  RDMA-W VA=0 Data=2
> > >
> > > Spec says the second RDMA-W must fail,
> > Right.
> >
> > > but it sounds like this relaxed ordering would allow it to happen.
> 
> > No. Table 76 is followed in this case.
> > 1st operation Write.
> > 2nd operation send.
> > There is implicit fence defined by '#' in Table, which is followed.
> > So 2nd RDMA-W continue to fail.
> 
> So, I expect what is happening here is that the SEND RCQ is delayed until the
> sequence numbers catch up, eg guarenteeing that all packets prior to the SEND
> have been seen and committed to memory.
> Which is what table 76 is primarily talking about.
> 
> However, SEND WITH INVALIDATE is a special cases that impacts the processing
> of work itself, not just the CPU observation, which is a bit outside what table 76
> is talking about.
> 
SEND, SEND WITH IMM, SEND WITH INVALIDATE falls in same category as send as first column in Table76.

> I'd advocate for allowing this to be out of order (but documented as such), as
> impliclty fencing SEND WITH INVALIDATE is not acceptable for performance and
It is as per first column of Table-76.

> most workloads using that feature do not care about this strict ordering.
> 
nvme fabrics do care.
nvme fabrics target does RDMA-W, RDMA_S_INV sequence on the same memory key that is being used in RDMA-W without waiting for RDMA-W completion for good reason.
I recall SMB doing the same as well.
RDMA-S_INV after RDMA-W cannot break the order.

> The requirement is really that by the time the SEND RCQ is seen that the
> INVALIDATE has taken effect.
> 
Current Table-76 requirement already relaxes for 
RDMA-R-> RDMA_S_INV.
However most users won't do above sequence because users would not like to fail duplicate read requests.
So let's continue with Table-76 for SEND as 2nd operation as defined today. (first column stays as is)

> Atomic are basically similar, sounds like Atomic Op -> RDMA Read should
> degrade to a F as well. I'd say that is desired as well.

No. Table-76 stays as is.
Atomic->Atomic is already 'F'.
Atomic->RDMA_R is continues as '#'. (Similar to RDMA_W->RDMA_R).

> > I completely understand and agree that storage protocols who depend on
> > RDMA-CM would like to have this.  But again, unavailability of this
> > bit in RDMA-CM is not a blocker for user land apps and let them start
> > using it.  Once RDMA-CM has it, storage kernel ULPs can also be
> > enabled.
> 
> If there is no path to get this into the RDMA CM then it is just another vendor
> feature and it does not belong in the common API.

RDMA CM is not the only connection manager is use today.
Once RDMA CM has it, it can be extended there as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                                                                         ` <VI1PR0502MB300871C35E1CC06EB378D6C8D1B30-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-08-01 19:00                                                                                                           ` Jason Gunthorpe
       [not found]                                                                                                             ` <20170801190052.GA31205-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2017-08-01 19:00 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Tom Talpey, Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On Tue, Aug 01, 2017 at 06:14:08PM +0000, Parav Pandit wrote:

> >  Initial Condition VA=0 Data = 0
> >  RDMA-W VA=0 Data=1
> >  RDMA-R VA=0
> > 
> > Spec says 1 must be returned, but sounds like this relaxed version could return 0.

> No. Table 76 stays as is as described before.

How is this possible?

> >  RDMA-W VA=0 Data=1
> >  RDMA-W VA=0 Data=2
> >  SEND
> > 
> > Sounds like with the relaxed version the app could see 1 at SEND CQ time.
> > 
> > So RDMA-W -> RDMA-W degrades to a F

> No. Table-76 is based on  how requester sees the execution.
> So it stays as '#'.

How is this possible?

You've clearly stated this feature allows out of order execution
across packet boundaries, there is no way to know at the responder
what the missed packets where, so ineventiably, both of these cases
must be possible. Or you are wrong about the statement on out of
order.

Frankly, I still don't think you know what this feature actually does
and consequently cannot document it properly, which is not
acceptable for a common verbs feature.

> > However, SEND WITH INVALIDATE is a special cases that impacts the processing
> > of work itself, not just the CPU observation, which is a bit outside what table 76
> > is talking about.
 
> SEND, SEND WITH IMM, SEND WITH INVALIDATE falls in same category as send as first column in Table76.

Not really, I don't think you understand how this all fits together..

> > I'd advocate for allowing this to be out of order (but documented as such), as
> > impliclty fencing SEND WITH INVALIDATE is not acceptable for
> > performance and

> It is as per first column of Table-76.

Dn't understand you remark, it is clearly ordered..

> > most workloads using that feature do not care about this strict ordering.
> > 
> nvme fabrics do care.
> nvme fabrics target does RDMA-W, RDMA_S_INV sequence on the same memory key that is being used in RDMA-W without waiting for RDMA-W completion for good reason.
> I recall SMB doing the same as well.
> RDMA-S_INV after RDMA-W cannot break the order.

They don't care, because RDMA_WRITE, SEND_WITH_INVALIDATE on the same
rkey does not try to write to the rkey memory twice, which is the only
case where adding out of order execution really matters.

They only care that the invalidate guarentees no DMA is possible once
it reaches the receiver's CQ.

> > The requirement is really that by the time the SEND RCQ is seen that the
> > INVALIDATE has taken effect.
> > 
> Current Table-76 requirement already relaxes for 
> RDMA-R-> RDMA_S_INV.
> However most users won't do above sequence because users would not like to fail duplicate read requests.
> So let's continue with Table-76 for SEND as 2nd operation as defined today. (first column stays as is)

As I said, table 76 does not really capture the full behavior of
INVALIDATE.

The spec requires WRITE,INV,WRITE to fail, but it would be just a fine
for storage protocols if WRITE,INV,WRITE could succeed, so long as
delivering the INV to the CQ fences the DMA, which can be done in a
high performance way. Fencing the WRITE,INV,WRITE can not be done
with high performance.

> > Atomic are basically similar, sounds like Atomic Op -> RDMA Read should
> > degrade to a F as well. I'd say that is desired as well.
> 
> No. Table-76 stays as is.
> Atomic->Atomic is already 'F'.
> Atomic->RDMA_R is continues as '#'. (Similar to RDMA_W->RDMA_R).

Same argument as above, many apps will tolerate out of order for
atomics, thee default for an out of order mode should be to allow it,
and let apps request in order with fence.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                                                                             ` <20170801190052.GA31205-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-08-01 22:06                                                                                                               ` Parav Pandit
       [not found]                                                                                                                 ` <VI1PR0502MB3008D65E9568764C4A781230D1B30-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Parav Pandit @ 2017-08-01 22:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tom Talpey, Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein



> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> Sent: Tuesday, August 01, 2017 2:01 PM
> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Cc: Tom Talpey <tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>; Bart Van Assche
> <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org; dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org;
> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> On Tue, Aug 01, 2017 at 06:14:08PM +0000, Parav Pandit wrote:
> 
> > >  Initial Condition VA=0 Data = 0
> > >  RDMA-W VA=0 Data=1
> > >  RDMA-R VA=0
> > >
> > > Spec says 1 must be returned, but sounds like this relaxed version could
> return 0.
> 
> > No. Table 76 stays as is as described before.
> 
> How is this possible?
I am not sure what more can I explain you Jason.
Requester side HCA follows HCA Table-76.
Incoming read responses are not processed until previous writes are ACKed implicitly (in read responses) or explicitly by ACK packets.
Same as before described in spec. No extra description needed for this patchset.

> 
> > >  RDMA-W VA=0 Data=1
> > >  RDMA-W VA=0 Data=2
> > >  SEND
> > >
> > > Sounds like with the relaxed version the app could see 1 at SEND CQ time.
> > >
> > > So RDMA-W -> RDMA-W degrades to a F
> 
> > No. Table-76 is based on  how requester sees the execution.
> > So it stays as '#'.
> 
> How is this possible?
> 
Please don't mix requester side ordering with responder side execution.
C9-28 on responder side is relaxed - as explained few times before.

> You've clearly stated this feature allows out of order execution across packet
> boundaries, there is no way to know at the responder what the missed packets
> where, so ineventiably, both of these cases must be possible. Or you are wrong
> about the statement on out of order.
Two examples already explained about out-of-order execution.
To me it appears that you are confused with requester vs responder side execution.
What more can I explain other than repeating 
(a) Table 76 on requester side stays as is and
(b) C9-28 is relaxed on responder side.

> > > However, SEND WITH INVALIDATE is a special cases that impacts the
> > > processing of work itself, not just the CPU observation, which is a
> > > bit outside what table 76 is talking about.
> 
> > SEND, SEND WITH IMM, SEND WITH INVALIDATE falls in same category as
> send as first column in Table76.
> 
> Not really, I don't think you understand how this all fits together..
> 

> > > I'd advocate for allowing this to be out of order (but documented as
> > > such), as impliclty fencing SEND WITH INVALIDATE is not acceptable
> > > for performance and
> 
> > It is as per first column of Table-76.
> 
> Dn't understand you remark, it is clearly ordered..
> 
I was saying, that there is no change on send ordering at requester side.

> > > most workloads using that feature do not care about this strict ordering.
> > >
> > nvme fabrics do care.
> > nvme fabrics target does RDMA-W, RDMA_S_INV sequence on the same
> memory key that is being used in RDMA-W without waiting for RDMA-W
> completion for good reason.
> > I recall SMB doing the same as well.
> > RDMA-S_INV after RDMA-W cannot break the order.
> 
> They don't care, because RDMA_WRITE, SEND_WITH_INVALIDATE on the same
> rkey does not try to write to the rkey memory twice, which is the only case
> where adding out of order execution really matters.
> 
It doesn't have to be an overlapping write to same rkey.
One block IO can translate to multiple RDMA-W from the target side, potentially to same rkey.
One possibility is target code ran out of number of local sges or had fragmented memory.
So 
RDMA-W1 (key=A, VA=0x1000, len=16K with 4 SGEs)
RDMA-W2 (key=A, VA=0x5000, len=16K, with 4 SGEs)
Send(Invalidate_key=A)
You do not want SEND_INVALIDATE to stop DMA of RDMA-W1 at later point.
So Send CQE cannot reach before previous RDMA-w1 and w2 are completed.

> They only care that the invalidate guarentees no DMA is possible once it reaches
> the receiver's CQ.
> 
> > > The requirement is really that by the time the SEND RCQ is seen that
> > > the INVALIDATE has taken effect.
> > >
> > Current Table-76 requirement already relaxes for
> > RDMA-R-> RDMA_S_INV.
> > However most users won't do above sequence because users would not like to
> fail duplicate read requests.
> > So let's continue with Table-76 for SEND as 2nd operation as defined
> > today. (first column stays as is)
> 
> As I said, table 76 does not really capture the full behavior of INVALIDATE.
> 
It covers only requester side.
Send with invalidate execution on responder side is described in 9.4.1.1.1

> The spec requires WRITE,INV,WRITE to fail, but it would be just a fine for
> storage protocols if WRITE,INV,WRITE could succeed, so long as delivering the
> INV to the CQ fences the DMA, which can be done in a high performance way.
> Fencing the WRITE,INV,WRITE can not be done with high performance.
> 
You are proposing a different behavior and attribute which may be done for a HCA that support such thing.
Please submit a different patch for it whenever its appropriate.
Current query HCA attribute is bit field for future relaxation. May be what you described can be done.

> > > Atomic are basically similar, sounds like Atomic Op -> RDMA Read
> > > should degrade to a F as well. I'd say that is desired as well.
> >
> > No. Table-76 stays as is.
> > Atomic->Atomic is already 'F'.
> > Atomic->RDMA_R is continues as '#'. (Similar to RDMA_W->RDMA_R).
> 
> Same argument as above, many apps will tolerate out of order for atomics, thee
> default for an out of order mode should be to allow it, and let apps request in
> order with fence.
>
Following are already 'F'.
Atomic->Atomic
Atomic->Write 
Read->Atomic

Other out of order atomics such as
Atomic->Read
Write->Atomic may be done in future under different attribute.

Jason,
Once single attribute is not solution to all out-of-order needs.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                                                                                 ` <VI1PR0502MB3008D65E9568764C4A781230D1B30-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-08-01 23:37                                                                                                                   ` Jason Gunthorpe
       [not found]                                                                                                                     ` <20170801233754.GB10239-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2017-08-01 23:37 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Tom Talpey, Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On Tue, Aug 01, 2017 at 10:06:14PM +0000, Parav Pandit wrote:

> > > >  Initial Condition VA=0 Data = 0
> > > >  RDMA-W VA=0 Data=1
> > > >  RDMA-R VA=0
> > > >
> > > > Spec says 1 must be returned, but sounds like this relaxed version could
> > return 0.
> > 
> > > No. Table 76 stays as is as described before.
> > 
> > How is this possible?

> I am not sure what more can I explain you Jason.
> Requester side HCA follows HCA Table-76.
> Incoming read responses are not processed until previous writes are
> ACKed implicitly (in read responses) or explicitly by ACK packets.
> Same as before described in spec. No extra description needed for
> this patchset.

But doing that pretty much destroys much of the entire point of having
a relaxed ordering :P

> > > >  RDMA-W VA=0 Data=1
> > > >  RDMA-W VA=0 Data=2
> > > >  SEND
> > > >
> > > > Sounds like with the relaxed version the app could see 1 at SEND CQ time.
> > > >
> > > > So RDMA-W -> RDMA-W degrades to a F
> > 
> > > No. Table-76 is based on  how requester sees the execution.
> > > So it stays as '#'.
> > 
> > How is this possible?
> > 
> Please don't mix requester side ordering with responder side execution.
> C9-28 on responder side is relaxed - as explained few times before.

No, I see what you are tring to say now. I disagree with
this. Table-76 and C9-28 are describing the same thing, you cannot
weaken C9-28 without also restating Table 76.

Table 76 is clearly talking about the entire system, including the
execution and memory subsystem of the completer.

> It covers only requester side.
> Send with invalidate execution on responder side is described in 9.4.1.1.1

I suppose 9.4.1.1.1 point #1 already allows the out of order behavior.

> You are proposing a different behavior and attribute which may be
> done for a HCA that support such thing.  Please submit a different
> patch for it whenever its appropriate.  Current query HCA attribute
> is bit field for future relaxation. May be what you described can be
> done.

Since you guys are hell bent on avoiding the IBTA for your new verbs
features, you do have to define them in a sensible and usefully broad
way using the community process.

> Other out of order atomics such as
> Atomic->Read
> Write->Atomic may be done in future under different attribute.

I think that is a mistake, you should start with them being out of
order and require the app to fence to bring order back, even if the
current HCAs execute them in order anyhow.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                                                                                     ` <20170801233754.GB10239-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-08-02  0:30                                                                                                                       ` Parav Pandit
       [not found]                                                                                                                         ` <VI1PR0502MB300860BDA4E6BD0B10F5DB1ED1B00-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Parav Pandit @ 2017-08-02  0:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tom Talpey, Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein



> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> Sent: Tuesday, August 01, 2017 6:38 PM
> To: Parav Pandit <parav-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Cc: Tom Talpey <tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>; Bart Van Assche
> <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>; leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org; dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org;
> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Idan Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> On Tue, Aug 01, 2017 at 10:06:14PM +0000, Parav Pandit wrote:
> 
> > > > >  Initial Condition VA=0 Data = 0  RDMA-W VA=0 Data=1  RDMA-R
> > > > > VA=0
> > > > >
> > > > > Spec says 1 must be returned, but sounds like this relaxed
> > > > > version could
> > > return 0.
> > >
> > > > No. Table 76 stays as is as described before.
> > >
> > > How is this possible?
> 
> > I am not sure what more can I explain you Jason.
> > Requester side HCA follows HCA Table-76.
> > Incoming read responses are not processed until previous writes are
> > ACKed implicitly (in read responses) or explicitly by ACK packets.
> > Same as before described in spec. No extra description needed for this
> > patchset.
> 
> But doing that pretty much destroys much of the entire point of having a relaxed
> ordering :P
>
Probably not. I can understand that having that would be possibly ultimate thing.
I like to add that whenever its available too.
Some of the applications are heavy write or read driven instead of mix operations - those benefit from this attribute.
 
> > > > >  RDMA-W VA=0 Data=1
> > > > >  RDMA-W VA=0 Data=2
> > > > >  SEND
> > > > >
> > > > > Sounds like with the relaxed version the app could see 1 at SEND CQ time.
> > > > >
> > > > > So RDMA-W -> RDMA-W degrades to a F
> > >
> > > > No. Table-76 is based on  how requester sees the execution.
> > > > So it stays as '#'.
> > >
> > > How is this possible?
> > >
> > Please don't mix requester side ordering with responder side execution.
> > C9-28 on responder side is relaxed - as explained few times before.
> 
> No, I see what you are tring to say now. 
Great.

> I disagree with this. Table-76 and C9-28
> are describing the same thing, you cannot weaken C9-28 without also restating
> Table 76.
>
The intent is to not extend the definition of fence bit beyond RDMA reads here.
What you are asking is when ooo attribute is set, and if user still wants to do in-order RDMA Writes for W1 and W2, fence bit should be extended for it.
There can be very few use cases where certain operations needs to follow ordering and certain don't in a single QP.
User is rather better off not setting this attribute on a QP when it needs W1, W2 ordering.

> Table 76 is clearly talking about the entire system, including the execution and
> memory subsystem of the completer.
> 
> > It covers only requester side.
> > Send with invalidate execution on responder side is described in
> > 9.4.1.1.1
> 
> I suppose 9.4.1.1.1 point #1 already allows the out of order behavior.
>
It allows because to indicates below.
(a) Paragraph above #1. Snippet below.
"Since the invalidation operation is not executed
by the transport layer, the Invalidate operation may take place either
before or after the transport-level acknowledge has been generated"

So it can still send out the ACK while invalidation in progress (not yet started).
While that is in progress new operation can still target invalidated region.
Now depending on how slow invalidation is going, subsequent operation DMA also.
Most good adapters won't send out ACK before invalidating to my knowledge even though spec allows it.
Because doing so is very hard to debug and keeps hole open for accidental memory corruption.

Also 9.4.1.1.1 is for subsequent operations where Send_Inv is second operation.

The example of 
RDMA-W1, RDMA-W2, Send_Invalidate actually follows section 9.4.1.1.1 point #2.

 > > You are proposing a different behavior and attribute which may be done
> > for a HCA that support such thing.  Please submit a different patch
> > for it whenever its appropriate.  Current query HCA attribute is bit
> > field for future relaxation. May be what you described can be done.
> 
> you do have to define them in a sensible and usefully broad way using the
> community process.
>
Sure. That's why we are discussing here.
 
> > Other out of order atomics such as
> > Atomic->Read
> > Write->Atomic may be done in future under different attribute.
> 
> I think that is a mistake, you should start with them being out of order and
> require the app to fence to bring order back, even if the current HCAs execute
> them in order anyhow.
> 
I would agree for Atomic->read case because it's very similar to READ->Read.
Write->Atomic goes back to first point as atomic completions can trigger implicit Write completions.
So Write->atomic I will keep out of this attribute.
Let me check with Idan for applying fence bit on Atomic->read, what he thinks about it.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH rdma-next 0/3] Support out of order data placement
       [not found]                                                                                                                         ` <VI1PR0502MB300860BDA4E6BD0B10F5DB1ED1B00-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-08-02  2:48                                                                                                                           ` Jason Gunthorpe
  0 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2017-08-02  2:48 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Tom Talpey, Bart Van Assche, leon-DgEjT+Ai2ygdnm+yROfE0A,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Idan Burstein

On Wed, Aug 02, 2017 at 12:30:33AM +0000, Parav Pandit wrote:

> > But doing that pretty much destroys much of the entire point of having a relaxed
> > ordering :P

> Probably not. I can understand that having that would be possibly ultimate thing.
> I like to add that whenever its available too.
> Some of the applications are heavy write or read driven instead of mix operations - those benefit from this attribute.

My point is, defining a 'relaxed ordering' feature such that as much
as is sensible is allowed to reordered allows the hardware to catch up
to prepared software.

The work to update the standards and CM is going to be large enough to
not want to do it more than once.

So you are better to come up with something broad and well described
that ycan be 'grown into'

> > > > > >  RDMA-W VA=0 Data=1
> > > > > >  RDMA-W VA=0 Data=2
> > > > > >  SEND
> > > > > >
> > > > > > Sounds like with the relaxed version the app could see 1 at SEND CQ time.
> > > > > >
> > > > > > So RDMA-W -> RDMA-W degrades to a F
> > > >
> > > > > No. Table-76 is based on  how requester sees the execution.
> > > > > So it stays as '#'.
> > > >
> > > > How is this possible?
> > > >
> > > Please don't mix requester side ordering with responder side execution.
> > > C9-28 on responder side is relaxed - as explained few times before.
> > 
> > No, I see what you are tring to say now. 
> Great.
> 
> > I disagree with this. Table-76 and C9-28
> > are describing the same thing, you cannot weaken C9-28 without also restating
> > Table 76.

> The intent is to not extend the definition of fence bit beyond RDMA
> reads here.

I am not talk about fence, I am saying, when the OOO attribute is set
some of the '#' must become 'F'  because the entire end-to-end system
no longer guarantees strict in order execution.

> What you are asking is when ooo attribute is set, and if user still
> wants to do in-order RDMA Writes for W1 and W2, fence bit should be
> extended for it.

Yes, I am also suggesting this should be true. The fence bit would
have to cause the sender to stop sending until acks are seen.

> There can be very few use cases where certain operations needs to
> follow ordering and certain don't in a single QP.  User is rather
> better off not setting this attribute on a QP when it needs W1, W2
> ordering.

Depends on use case, an app might do well to track outstanding ops by
address and only setting fence if there is a collision, for instance.

> So it can still send out the ACK while invalidation in progress (not
> yet started).

Not only an ack but it can begin processing another RDMA WRITE while
invalidation is ongoing, which is compatible with the idea that RDMA
WRITE can begin processing before seeing prior packets.

In esseence, there is another set of entries in table 76 that discusss
the INVALIDATE behavior and they are all 'F' with today's standard.

> I would agree for Atomic->read case because it's very similar to READ->Read.
> Write->Atomic goes back to first point as atomic completions can trigger implicit Write completions.
> So Write->atomic I will keep out of this attribute.
> Let me check with Idan for applying fence bit on Atomic->read, what he thinks about it.

Atomics do not cause completions at the responder, so  I'm not sure
what you are talking about here.

It is never, ever, OK to issue completions at the initiator out of
order, that would totally break every ULP we have.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2017-08-02  2:48 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-12  6:49 [PATCH rdma-next 0/3] Support out of order data placement Leon Romanovsky
     [not found] ` <20170612064918.12510-1-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2017-06-12  6:49   ` [PATCH rdma-next 1/3] IB/core: Expose out of order data placement capability Leon Romanovsky
2017-06-12  6:49   ` [PATCH rdma-next 2/3] IB/uverbs: Enable user space programs to use out of order placement Leon Romanovsky
2017-06-12  6:49   ` [PATCH rdma-next 3/3] IB/mlx5: Support out of order data placement Leon Romanovsky
2017-06-12 15:28   ` [PATCH rdma-next 0/3] " Bart Van Assche
     [not found]     ` <1497281280.2770.1.camel-Sjgp3cTcYWE@public.gmane.org>
2017-06-12 16:19       ` Parav Pandit
     [not found]         ` <VI1PR0502MB3008478FC7C70D1F398FE2B2D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-06-12 16:29           ` Bart Van Assche
     [not found]             ` <1497284956.2770.8.camel-Sjgp3cTcYWE@public.gmane.org>
2017-06-12 16:51               ` Parav Pandit
2017-06-12 16:29           ` Jason Gunthorpe
     [not found]             ` <20170612162917.GA11993-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-06-12 16:42               ` Parav Pandit
     [not found]                 ` <VI1PR0502MB3008EA451DA9ECEBECE27362D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-06-12 16:43                   ` Jason Gunthorpe
     [not found]                     ` <20170612164343.GA12435-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-06-12 16:53                       ` Parav Pandit
     [not found]                         ` <VI1PR0502MB300831A1560531E67B29589DD1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-06-12 16:55                           ` Jason Gunthorpe
     [not found]                             ` <20170612165536.GB12435-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-06-12 17:11                               ` Parav Pandit
     [not found]                                 ` <VI1PR0502MB30089EDC828A142338B1EE06D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-06-12 17:14                                   ` Jason Gunthorpe
     [not found]                                     ` <20170612171436.GA12739-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-06-12 17:28                                       ` Parav Pandit
     [not found]                                         ` <VI1PR0502MB3008304F8465C19ACCFF3D68D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-06-12 17:32                                           ` Jason Gunthorpe
     [not found]                                             ` <20170612173221.GA13302-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-06-12 17:46                                               ` Parav Pandit
     [not found]                                                 ` <VI1PR0502MB30081FD74492E97043F45BD8D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-06-12 17:51                                                   ` Jason Gunthorpe
2017-06-12 21:09                               ` Tom Talpey
     [not found]                                 ` <6747e257-67b0-a364-be21-04f73ef82ffe-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
2017-06-12 21:32                                   ` Parav Pandit
     [not found]                                     ` <VI1PR0502MB30080B672B80836FF0A328DCD1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-06-12 21:41                                       ` Jason Gunthorpe
     [not found]                                         ` <20170612214135.GB30578-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-06-12 21:59                                           ` Parav Pandit
     [not found]                                             ` <VI1PR0502MB30087762738A4E02A1DA24D0D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-06-12 22:07                                               ` Jason Gunthorpe
     [not found]                                                 ` <20170612220730.GA32510-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-06-12 22:16                                                   ` Parav Pandit
     [not found]                                                     ` <VI1PR0502MB30088258D7BADC83B0609F38D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-06-12 22:21                                                       ` Jason Gunthorpe
2017-06-12 22:19                                       ` Tom Talpey
     [not found]                                         ` <475e1873-e842-ecb9-d260-34777da57e51-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
2017-06-12 22:54                                           ` Parav Pandit
     [not found]                                             ` <VI1PR0502MB3008B7FE60CAB3BD49907A1BD1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-06-12 23:44                                               ` Tom Talpey
     [not found]                                                 ` <d3a436a0-9ace-b11a-2e4c-387fc575877e-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
2017-06-12 23:59                                                   ` Parav Pandit
     [not found]                                                     ` <VI1PR0502MB3008E04612EED6BC83F37115D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-06-13  0:11                                                       ` Tom Talpey
     [not found]                                                         ` <fbdcf05b-ccd8-bd9c-c9c8-86f373303250-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
2017-06-13  0:36                                                           ` Parav Pandit
     [not found]                                                             ` <VI1PR0502MB30089271EB542493AA58060CD1C20-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-06-13  1:30                                                               ` Tom Talpey
     [not found]                                                                 ` <fb11f261-b80b-f71a-8076-204706267798-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
2017-06-13 19:17                                                                   ` Jason Gunthorpe
2017-06-23 16:03                                                                   ` Parav Pandit
2017-07-19  5:33                                                                   ` Parav Pandit
     [not found]                                                                     ` <VI1PR0502MB3008D488FEE8A7104A7B0A7CD1A60-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-07-19 17:12                                                                       ` Jason Gunthorpe
     [not found]                                                                         ` <20170719171211.GB25714-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-07-20 15:06                                                                           ` Parav Pandit
2017-07-22  2:29                                                                       ` Tom Talpey
     [not found]                                                                         ` <142c2fed-baa5-1295-1458-be765c94b957-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
2017-07-22  4:50                                                                           ` Parav Pandit
     [not found]                                                                             ` <VI1PR0502MB300886ED1B5B1363B55F53F0D1A50-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-07-22  5:03                                                                               ` Tom Talpey
     [not found]                                                                                 ` <e5cee768-586c-516a-6056-ea4a44f89134-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
2017-07-22  5:32                                                                                   ` Parav Pandit
     [not found]                                                                                     ` <VI1PR0502MB30089915EB792CD20CA9D179D1A50-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-07-22 14:51                                                                                       ` Tom Talpey
     [not found]                                                                                         ` <4444a96a-c1e6-ad33-204a-680982e19bfe-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
2017-07-22 15:11                                                                                           ` Parav Pandit
2017-07-22 16:09                                                                                           ` Jason Gunthorpe
     [not found]                                                                                             ` <20170722160939.GA30007-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-07-22 17:32                                                                                               ` Parav Pandit
     [not found]                                                                                                 ` <VI1PR0502MB3008E483BB36F944C2E212ACD1A50-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-07-22 21:27                                                                                                   ` Jason Gunthorpe
     [not found]                                                                                                     ` <20170722212706.GA14714-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-08-01 18:14                                                                                                       ` Parav Pandit
     [not found]                                                                                                         ` <VI1PR0502MB300871C35E1CC06EB378D6C8D1B30-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-08-01 19:00                                                                                                           ` Jason Gunthorpe
     [not found]                                                                                                             ` <20170801190052.GA31205-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-08-01 22:06                                                                                                               ` Parav Pandit
     [not found]                                                                                                                 ` <VI1PR0502MB3008D65E9568764C4A781230D1B30-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-08-01 23:37                                                                                                                   ` Jason Gunthorpe
     [not found]                                                                                                                     ` <20170801233754.GB10239-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-08-02  0:30                                                                                                                       ` Parav Pandit
     [not found]                                                                                                                         ` <VI1PR0502MB300860BDA4E6BD0B10F5DB1ED1B00-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-08-02  2:48                                                                                                                           ` Jason Gunthorpe
2017-06-27  9:47                               ` Sagi Grimberg
2017-06-12 17:18   ` Steve Wise
2017-06-12 17:37     ` Parav Pandit
     [not found]       ` <VI1PR0502MB3008CE85A4F274886B74DF2BD1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-06-12 19:06         ` Dennis Dalessandro
     [not found]           ` <3fa7a4b5-5c19-8c6a-d78b-93219a9be888-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2017-06-12 19:19             ` Hefty, Sean
     [not found]               ` <1828884A29C6694DAF28B7E6B8A82373AB142A9B-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2017-06-12 20:14                 ` Parav Pandit
     [not found]                   ` <VI1PR0502MB300885A1DD676E1649CDB268D1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-06-12 20:35                     ` Dennis Dalessandro
     [not found]                       ` <b5279c09-027f-e374-ffde-7f236c52322c-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2017-06-12 20:46                         ` Hefty, Sean
     [not found]                           ` <1828884A29C6694DAF28B7E6B8A82373AB142BEC-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2017-06-12 20:57                             ` Steve Wise
2017-06-12 21:02                               ` Jason Gunthorpe
     [not found]                                 ` <20170612210259.GA25652-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-06-12 21:18                                   ` Steve Wise
2017-06-12 21:22                                     ` Jason Gunthorpe
2017-06-12 21:53                                     ` Parav Pandit
     [not found]                                       ` <VI1PR0502MB30082486BC3B1669FE48764ED1CD0-o1MPJYiShExKsLr+rGaxW8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-06-12 21:57                                         ` Jason Gunthorpe
     [not found]                                           ` <20170612215741.GA31693-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-06-12 22:00                                             ` Parav Pandit
2017-06-13  5:29                                   ` Leon Romanovsky
2017-06-12 20:41                     ` Hefty, Sean
     [not found]                       ` <1828884A29C6694DAF28B7E6B8A82373AB142BC8-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2017-06-12 21:25                         ` Parav Pandit

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.