All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V1 libibverbs 0/8] Completion timestamping
@ 2016-02-24  9:41 Yishai Hadas
       [not found] ` <1456306924-31298-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Yishai Hadas @ 2016-02-24  9:41 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	majd-VPRAkNaXOzVWk0Htik3J/w, talal-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w

Hi Doug,

This series from Matan adds completion timestamp for libibverbs.
The kernel part was already accepted.

We made some extra review on V0 and addressed some issues,
details below.

The series was retested successfully with mlx4 driver (lib, kernel)
and can be accessed also from my openfabrics GIT at:
git://openfabrics.org/~yishaih/libibverbs.git
branch: for-upstream

This timestamp series is sent on top of QP creation flags patch
that is pending your merge.

If may be helpful, it exists also on top of the master branch 
from my public GIT above, branch: ts.

Yishai
------

In order to do so, we add an extensible poll cq. The problem with
extending the WC is that you could run out of the current cache
line when adding new features and degrade performance. This is solved
by introducing a custom WC.

The user creates a CQ using ibv_create_cq_ex, stating which WC fields
should be returned by this CQ. When the user calls ibv_poll_cq_ex,
this custom WC is returned. The fields orders and sizes are declared
in advanced (we avoid alignment rules by putting the fields starting
from the 64bit fields --> 8bit fields). Each WC has a wc_flags field
representing which fields are valid in this WC.
The vendor drivers could optimize those calls extensively.

Completion timestamp is added on top of these extended ibv_create_cq_ex
verb and ibv_poll_cq_ex verb. The user should call ibv_create_cq_ex
stating that this CQ should support reporting completion timestamp.
ibv_poll_cq_ex reports this raw completion timestamp value in every
packet.

The timestamp mask (number of supported bits) and the HCA's frequency
are given in ibv_query_device_ex verb.

We also give the user an ability to read the HCA's current clock.
This is done via ibv_query_rt_values_ex. This verb could be extended
in the future for other interesting information.

Changes from V0:
- Split the series to small logical patches.
- Align naming in some places to match other verbs.
- Fix and improve the man pages.
- Add an example code as part of rc_pingpong.

Matan Barak (7):
  Add ibv_poll_cq_ex verb
  Add timestamp_mask and hca_core_clock to ibv_query_device_ex
  Add support for extended ibv_create_cq
  Add completion timestamp support for ibv_poll_cq_ex
  Add helper functions to work with the extended WC
  Add ibv_query_rt_values_ex
  Man pages for time stamping support

Yishai Hadas (1):
  Add timestamp support in rc_pingpong

 Makefile.am                   |   7 +-
 examples/devinfo.c            |  10 ++
 examples/rc_pingpong.c        | 130 ++++++++++++++---
 include/infiniband/compiler.h |  91 ++++++++++++
 include/infiniband/driver.h   |   9 ++
 include/infiniband/kern-abi.h |  22 +++
 include/infiniband/verbs.h    | 318 ++++++++++++++++++++++++++++++++++++++++++
 man/ibv_create_cq_ex.3        |  75 ++++++++++
 man/ibv_poll_cq_ex.3          | 181 ++++++++++++++++++++++++
 man/ibv_query_device_ex.3     |   6 +-
 man/ibv_query_rt_values_ex.3  |  50 +++++++
 src/cmd.c                     |  63 +++++++++
 src/device.c                  |  44 ++++++
 src/ibverbs.h                 |   5 +
 src/libibverbs.map            |   1 +
 15 files changed, 992 insertions(+), 20 deletions(-)
 create mode 100644 include/infiniband/compiler.h
 create mode 100644 man/ibv_create_cq_ex.3
 create mode 100644 man/ibv_poll_cq_ex.3
 create mode 100644 man/ibv_query_rt_values_ex.3

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb
       [not found] ` <1456306924-31298-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-02-24  9:41   ` Yishai Hadas
       [not found]     ` <1456306924-31298-2-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2016-02-24  9:41   ` [PATCH V1 libibverbs 2/8] Add timestamp_mask and hca_core_clock to ibv_query_device_ex Yishai Hadas
                     ` (6 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Yishai Hadas @ 2016-02-24  9:41 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	majd-VPRAkNaXOzVWk0Htik3J/w, talal-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

This is an extension verb for ibv_poll_cq. It allows the user to poll
the cq for specific wc fields only, while allowing to extend the wc.
The verb calls the provider in order to fill the WC with the required
information.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 include/infiniband/verbs.h | 107 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 107 insertions(+)

diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index 6451d0f..99d6265 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -386,6 +386,79 @@ struct ibv_wc {
 	uint8_t			dlid_path_bits;
 };
 
+enum ibv_wc_flags_ex {
+	IBV_WC_EX_GRH			= 1 << 0,
+	IBV_WC_EX_IMM			= 1 << 1,
+	IBV_WC_EX_CSUM_OK		= 1 << 2,
+	IBV_WC_EX_INV			= 1 << 3,
+	IBV_WC_EX_WITH_BYTE_LEN		= 1 << 4,
+	IBV_WC_EX_WITH_IMM		= 1 << 5,
+	IBV_WC_EX_WITH_QP_NUM		= 1 << 6,
+	IBV_WC_EX_WITH_SRC_QP		= 1 << 7,
+	IBV_WC_EX_WITH_PKEY_INDEX	= 1 << 8,
+	IBV_WC_EX_WITH_SLID		= 1 << 9,
+	IBV_WC_EX_WITH_SL		= 1 << 10,
+	IBV_WC_EX_WITH_DLID_PATH_BITS	= 1 << 11,
+};
+
+enum {
+	IBV_WC_STANDARD_FLAGS = IBV_WC_EX_WITH_BYTE_LEN		|
+				 IBV_WC_EX_WITH_IMM		|
+				 IBV_WC_EX_WITH_QP_NUM		|
+				 IBV_WC_EX_WITH_SRC_QP		|
+				 IBV_WC_EX_WITH_PKEY_INDEX	|
+				 IBV_WC_EX_WITH_SLID		|
+				 IBV_WC_EX_WITH_SL		|
+				 IBV_WC_EX_WITH_DLID_PATH_BITS
+};
+
+/* fields order in wc_ex
+ * Note: To guarantee compatibility a new field will be added after fields
+ * of the same size.
+ *	uint32_t		byte_len,
+ *	uint32_t		imm_data;
+ *	uint32_t		qp_num;
+ *	uint32_t		src_qp;
+ *	uint16_t		pkey_index;
+ *	uint16_t		slid;
+ *	uint8_t			sl;
+ *	uint8_t			dlid_path_bits;
+ */
+
+enum {
+	IBV_WC_EX_WITH_64BIT_FIELDS = 0
+};
+
+enum {
+	IBV_WC_EX_WITH_32BIT_FIELDS = IBV_WC_EX_WITH_BYTE_LEN		|
+				      IBV_WC_EX_WITH_IMM		|
+				      IBV_WC_EX_WITH_QP_NUM		|
+				      IBV_WC_EX_WITH_SRC_QP
+};
+
+enum {
+	IBV_WC_EX_WITH_16BIT_FIELDS = IBV_WC_EX_WITH_PKEY_INDEX		|
+				      IBV_WC_EX_WITH_SLID
+};
+
+enum {
+	IBV_WC_EX_WITH_8BIT_FIELDS = IBV_WC_EX_WITH_SL			|
+				     IBV_WC_EX_WITH_DLID_PATH_BITS
+};
+
+struct ibv_wc_ex {
+	uint64_t		wr_id;
+	/* wc_flags is a combination of ibv_wc_flags_ex flags. The IBV_WC_EX_WITH_XXX
+	 * flags dynamically define the valid fields in buffer[0].
+	 */
+	uint64_t		wc_flags;
+	uint32_t		status;
+	uint32_t		opcode;
+	uint32_t		vendor_err;
+	uint32_t		reserved;
+	uint8_t			buffer[0];
+};
+
 enum ibv_access_flags {
 	IBV_ACCESS_LOCAL_WRITE		= 1,
 	IBV_ACCESS_REMOTE_WRITE		= (1<<1),
@@ -1052,8 +1125,16 @@ enum verbs_context_mask {
 	VERBS_CONTEXT_RESERVED	= 1 << 5
 };
 
+struct ibv_poll_cq_ex_attr {
+	unsigned int	max_entries;
+	uint32_t	comp_mask;
+};
+
 struct verbs_context {
 	/*  "grows up" - new fields go here */
+	int (*poll_cq_ex)(struct ibv_cq *ibcq,
+			  struct ibv_wc_ex *wc,
+			  struct ibv_poll_cq_ex_attr *attr);
 	int (*query_device_ex)(struct ibv_context *context,
 			       const struct ibv_query_device_ex_input *input,
 			       struct ibv_device_attr_ex *attr,
@@ -1454,6 +1535,32 @@ ibv_create_srq_ex(struct ibv_context *context,
 }
 
 /**
+ * ibv_poll_cq_ex - Poll a CQ for work completions
+ * @cq:the CQ being polled
+ * @wc:array of @max_entries of &struct ibv_wc_ex where completions
+ *   will be returned
+ * @attr: Poll cq attributs
+ *
+ * Poll a CQ for (possibly multiple) completions.  If the return value
+ * is < 0, an error occurred.  If the return value is >= 0, it is the
+ * number of completions returned.  If the return value is
+ * non-negative and strictly less than max_entries, then the CQ was
+ * emptied.
+ */
+
+static inline int ibv_poll_cq_ex(struct ibv_cq *ibcq,
+				 struct ibv_wc_ex *wc,
+				 struct ibv_poll_cq_ex_attr *attr)
+{
+	struct verbs_context *vctx = verbs_get_ctx_op(ibcq->context,
+						      poll_cq_ex);
+	if (!vctx)
+		return ENOSYS;
+
+	return vctx->poll_cq_ex(ibcq, wc, attr);
+}
+
+/**
  * ibv_modify_srq - Modifies the attributes for the specified SRQ.
  * @srq: The SRQ to modify.
  * @srq_attr: On input, specifies the SRQ attributes to modify.  On output,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V1 libibverbs 2/8] Add timestamp_mask and hca_core_clock to ibv_query_device_ex
       [not found] ` <1456306924-31298-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2016-02-24  9:41   ` [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb Yishai Hadas
@ 2016-02-24  9:41   ` Yishai Hadas
  2016-02-24  9:41   ` [PATCH V1 libibverbs 3/8] Add support for extended ibv_create_cq Yishai Hadas
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Yishai Hadas @ 2016-02-24  9:41 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	majd-VPRAkNaXOzVWk0Htik3J/w, talal-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

The fields timestamp_mask and hca_core_clock were added
to the extended version of ibv_query_device verb.
timestamp_mask represents the supported mask of the timestamp.
Users could infer the accuracy of the reported possible
timestamp.
hca_core_clock represents the frequency of the HCA (in kHZ).

Since timestamp is given in hardware cycles, knowing the frequency
is mandatory in order to convert this number into seconds.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 examples/devinfo.c            | 10 ++++++++++
 include/infiniband/kern-abi.h |  2 ++
 include/infiniband/verbs.h    |  2 ++
 src/cmd.c                     | 21 +++++++++++++++++++++
 4 files changed, 35 insertions(+)

diff --git a/examples/devinfo.c b/examples/devinfo.c
index a8de982..0af8c3b 100644
--- a/examples/devinfo.c
+++ b/examples/devinfo.c
@@ -339,6 +339,16 @@ static int print_hca_cap(struct ibv_device *ib_dev, uint8_t ib_port)
 		printf("\tlocal_ca_ack_delay:\t\t%d\n", device_attr.orig_attr.local_ca_ack_delay);
 
 		print_odp_caps(&device_attr.odp_caps);
+		if (device_attr.completion_timestamp_mask)
+			printf("\tcompletion timestamp_mask:\t\t\t0x%016lx\n",
+			       device_attr.completion_timestamp_mask);
+		else
+			printf("\tcompletion_timestamp_mask not supported\n");
+
+		if (device_attr.hca_core_clock)
+			printf("\thca_core_clock:\t\t\t%lukHZ\n", device_attr.hca_core_clock);
+		else
+			printf("\tcore clock not supported\n");
 	}
 
 	for (port = 1; port <= device_attr.orig_attr.phys_port_cnt; ++port) {
diff --git a/include/infiniband/kern-abi.h b/include/infiniband/kern-abi.h
index 31da4be..b196a04 100644
--- a/include/infiniband/kern-abi.h
+++ b/include/infiniband/kern-abi.h
@@ -269,6 +269,8 @@ struct ibv_query_device_resp_ex {
 	__u32 comp_mask;
 	__u32 response_length;
 	struct ibv_odp_caps_resp odp_caps;
+	__u64 timestamp_mask;
+	__u64 hca_core_clock;
 };
 
 struct ibv_query_port {
diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index 99d6265..3ddf07c 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -207,6 +207,8 @@ struct ibv_device_attr_ex {
 	struct ibv_device_attr	orig_attr;
 	uint32_t		comp_mask;
 	struct ibv_odp_caps	odp_caps;
+	uint64_t		completion_timestamp_mask;
+	uint64_t		hca_core_clock;
 };
 
 enum ibv_mtu {
diff --git a/src/cmd.c b/src/cmd.c
index b8c51ce..e22a24b 100644
--- a/src/cmd.c
+++ b/src/cmd.c
@@ -189,6 +189,27 @@ int ibv_cmd_query_device_ex(struct ibv_context *context,
 		}
 	}
 
+	if (attr_size >= offsetof(struct ibv_device_attr_ex,
+				  completion_timestamp_mask) +
+			 sizeof(attr->completion_timestamp_mask)) {
+		if (resp->response_length >=
+		    offsetof(struct ibv_query_device_resp_ex, timestamp_mask) +
+		    sizeof(resp->timestamp_mask))
+			attr->completion_timestamp_mask = resp->timestamp_mask;
+		else
+			attr->completion_timestamp_mask = 0;
+	}
+
+	if (attr_size >= offsetof(struct ibv_device_attr_ex, hca_core_clock) +
+			 sizeof(attr->hca_core_clock)) {
+		if (resp->response_length >=
+		    offsetof(struct ibv_query_device_resp_ex, hca_core_clock) +
+		    sizeof(resp->hca_core_clock))
+			attr->hca_core_clock = resp->hca_core_clock;
+		else
+			attr->hca_core_clock = 0;
+	}
+
 	return 0;
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V1 libibverbs 3/8] Add support for extended ibv_create_cq
       [not found] ` <1456306924-31298-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2016-02-24  9:41   ` [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb Yishai Hadas
  2016-02-24  9:41   ` [PATCH V1 libibverbs 2/8] Add timestamp_mask and hca_core_clock to ibv_query_device_ex Yishai Hadas
@ 2016-02-24  9:41   ` Yishai Hadas
  2016-02-24  9:42   ` [PATCH V1 libibverbs 4/8] Add completion timestamp support for ibv_poll_cq_ex Yishai Hadas
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Yishai Hadas @ 2016-02-24  9:41 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	majd-VPRAkNaXOzVWk0Htik3J/w, talal-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Add ibv_create_cq_ex. This extended verb follows
the extension verbs scheme and hence could be
extendible in the future for more features.

The new command supports creation flags with timestamp.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 include/infiniband/driver.h   |  9 +++++++
 include/infiniband/kern-abi.h | 20 ++++++++++++++
 include/infiniband/verbs.h    | 63 +++++++++++++++++++++++++++++++++++++++++++
 src/cmd.c                     | 42 +++++++++++++++++++++++++++++
 src/device.c                  | 44 ++++++++++++++++++++++++++++++
 src/ibverbs.h                 |  5 ++++
 src/libibverbs.map            |  1 +
 7 files changed, 184 insertions(+)

diff --git a/include/infiniband/driver.h b/include/infiniband/driver.h
index 053ad5f..176eef0 100644
--- a/include/infiniband/driver.h
+++ b/include/infiniband/driver.h
@@ -155,6 +155,15 @@ int ibv_cmd_create_cq(struct ibv_context *context, int cqe,
 		      int comp_vector, struct ibv_cq *cq,
 		      struct ibv_create_cq *cmd, size_t cmd_size,
 		      struct ibv_create_cq_resp *resp, size_t resp_size);
+int ibv_cmd_create_cq_ex(struct ibv_context *context,
+			 struct ibv_cq_init_attr_ex *cq_attr,
+			 struct ibv_cq *cq,
+			 struct ibv_create_cq_ex *cmd,
+			 size_t cmd_core_size,
+			 size_t cmd_size,
+			 struct ibv_create_cq_resp_ex *resp,
+			 size_t resp_core_size,
+			 size_t resp_size);
 int ibv_cmd_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc);
 int ibv_cmd_req_notify_cq(struct ibv_cq *cq, int solicited_only);
 #define IBV_CMD_RESIZE_CQ_HAS_RESP_PARAMS
diff --git a/include/infiniband/kern-abi.h b/include/infiniband/kern-abi.h
index b196a04..2012879 100644
--- a/include/infiniband/kern-abi.h
+++ b/include/infiniband/kern-abi.h
@@ -112,6 +112,8 @@ enum {
 					    IB_USER_VERBS_CMD_QUERY_DEVICE,
 	IB_USER_VERBS_CMD_CREATE_QP_EX = IB_USER_VERBS_CMD_EXTENDED_MASK |
 					 IB_USER_VERBS_CMD_CREATE_QP,
+	IB_USER_VERBS_CMD_CREATE_CQ_EX = IB_USER_VERBS_CMD_EXTENDED_MASK |
+					 IB_USER_VERBS_CMD_CREATE_CQ,
 	IB_USER_VERBS_CMD_CREATE_FLOW = IB_USER_VERBS_CMD_EXTENDED_MASK +
 					IB_USER_VERBS_CMD_THRESHOLD,
 	IB_USER_VERBS_CMD_DESTROY_FLOW
@@ -445,6 +447,23 @@ struct ibv_create_cq_resp {
 	__u32 cqe;
 };
 
+struct ibv_create_cq_ex {
+	struct ex_hdr	hdr;
+	__u64		user_handle;
+	__u32		cqe;
+	__u32		comp_vector;
+	__s32		comp_channel;
+	__u32		comp_mask;
+	__u32		flags;
+	__u32		reserved;
+};
+
+struct ibv_create_cq_resp_ex {
+	struct ibv_create_cq_resp	base;
+	__u32				comp_mask;
+	__u32				response_length;
+};
+
 struct ibv_kern_wc {
 	__u64  wr_id;
 	__u32  status;
@@ -1100,6 +1119,7 @@ enum {
 	IB_USER_VERBS_CMD_DESTROY_FLOW_V2 = -1,
 	IB_USER_VERBS_CMD_QUERY_DEVICE_EX_V2 = -1,
 	IB_USER_VERBS_CMD_CREATE_QP_EX_V2 = -1,
+	IB_USER_VERBS_CMD_CREATE_CQ_EX_V2 = -1,
 };
 
 struct ibv_modify_srq_v3 {
diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index 3ddf07c..a64e320 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -1118,6 +1118,42 @@ struct ibv_context {
 	void		       *abi_compat;
 };
 
+enum ibv_cq_init_attr_mask {
+	IBV_CQ_INIT_ATTR_FLAGS		= 1 << 0,
+	IBV_CQ_INIT_ATTR_RESERVED	= 1 << 1
+};
+
+enum ibv_create_cq_attr_flags {
+	IBV_CREATE_CQ_ATTR_COMPLETION_TIMESTAMP	= 1 << 0,
+};
+
+struct ibv_cq_init_attr_ex {
+	/* Minimum number of entries required for CQ */
+	int			cqe;
+	/* Consumer-supplied context returned for completion events */
+	void			*cq_context;
+	/* Completion channel where completion events will be queued.
+	 * May be NULL if completion events will not be used.
+	 */
+	struct ibv_comp_channel *channel;
+	/* Completion vector used to signal completion events.
+	 *  Must be >= 0 and < context->num_comp_vectors.
+	 */
+	int			comp_vector;
+	/* The wc_flags that should be returned in ibv_poll_cq_ex.
+	 * Or'ed bit of enum ibv_wc_flags_ex.
+	 */
+	uint64_t		wc_flags;
+	/* compatibility mask (extended verb). Or'd flags of
+	 * enum ibv_cq_init_attr_mask
+	 */
+	uint32_t		comp_mask;
+	/* create cq attr flags - one or more flags from
+	 * enum ibv_create_cq_attr_flags
+	 */
+	uint32_t		flags;
+};
+
 enum verbs_context_mask {
 	VERBS_CONTEXT_XRCD	= 1 << 0,
 	VERBS_CONTEXT_SRQ	= 1 << 1,
@@ -1134,6 +1170,9 @@ struct ibv_poll_cq_ex_attr {
 
 struct verbs_context {
 	/*  "grows up" - new fields go here */
+	struct ibv_cq *(*create_cq_ex)(struct ibv_context *context,
+				       struct ibv_cq_init_attr_ex *);
+	struct verbs_ex_private *priv;
 	int (*poll_cq_ex)(struct ibv_cq *ibcq,
 			  struct ibv_wc_ex *wc,
 			  struct ibv_poll_cq_ex_attr *attr);
@@ -1428,6 +1467,30 @@ struct ibv_cq *ibv_create_cq(struct ibv_context *context, int cqe,
 			     int comp_vector);
 
 /**
+ * ibv_create_cq_ex - Create a completion queue
+ * @context - Context CQ will be attached to
+ * @cq_attr - Attributes to create the CQ with
+ */
+static inline
+struct ibv_cq *ibv_create_cq_ex(struct ibv_context *context,
+				struct ibv_cq_init_attr_ex *cq_attr)
+{
+	struct verbs_context *vctx = verbs_get_ctx_op(context, create_cq_ex);
+
+	if (!vctx) {
+		errno = ENOSYS;
+		return NULL;
+	}
+
+	if (cq_attr->comp_mask & ~(IBV_CQ_INIT_ATTR_RESERVED - 1)) {
+		errno = EINVAL;
+		return NULL;
+	}
+
+	return vctx->create_cq_ex(context, cq_attr);
+}
+
+/**
  * ibv_resize_cq - Modifies the capacity of the CQ.
  * @cq: The CQ to resize.
  * @cqe: The minimum size of the CQ.
diff --git a/src/cmd.c b/src/cmd.c
index e22a24b..0e03740 100644
--- a/src/cmd.c
+++ b/src/cmd.c
@@ -459,6 +459,48 @@ int ibv_cmd_create_cq(struct ibv_context *context, int cqe,
 	return 0;
 }
 
+int ibv_cmd_create_cq_ex(struct ibv_context *context,
+			 struct ibv_cq_init_attr_ex *cq_attr,
+			 struct ibv_cq *cq,
+			 struct ibv_create_cq_ex *cmd,
+			 size_t cmd_core_size,
+			 size_t cmd_size,
+			 struct ibv_create_cq_resp_ex *resp,
+			 size_t resp_core_size,
+			 size_t resp_size)
+{
+	int err;
+
+	memset(cmd, 0, cmd_core_size);
+	IBV_INIT_CMD_RESP_EX_V(cmd, cmd_core_size, cmd_size, CREATE_CQ_EX, resp,
+			       resp_core_size, resp_size);
+
+	if (cq_attr->comp_mask & ~(IBV_CQ_INIT_ATTR_RESERVED - 1))
+		return EINVAL;
+
+	cmd->user_handle   = (uintptr_t)cq;
+	cmd->cqe           = cq_attr->cqe;
+	cmd->comp_vector   = cq_attr->comp_vector;
+	cmd->comp_channel  = cq_attr->channel ? cq_attr->channel->fd : -1;
+	cmd->comp_mask = 0;
+
+	if (cmd_core_size >= offsetof(struct ibv_create_cq_ex, flags) +
+	    sizeof(cmd->flags) && cq_attr->comp_mask & IBV_CQ_INIT_ATTR_FLAGS)
+		cmd->flags = cq_attr->flags;
+
+	err = write(context->cmd_fd, cmd, cmd_size);
+	if (err != cmd_size)
+		return errno;
+
+	(void)VALGRIND_MAKE_MEM_DEFINED(resp, resp_size);
+
+	cq->handle  = resp->base.cq_handle;
+	cq->cqe     = resp->base.cqe;
+	cq->context = context;
+
+	return 0;
+}
+
 int ibv_cmd_poll_cq(struct ibv_cq *ibcq, int ne, struct ibv_wc *wc)
 {
 	struct ibv_poll_cq       cmd;
diff --git a/src/device.c b/src/device.c
index f2b889c..bff5836 100644
--- a/src/device.c
+++ b/src/device.c
@@ -122,6 +122,33 @@ uint64_t __ibv_get_device_guid(struct ibv_device *device)
 }
 default_symver(__ibv_get_device_guid, ibv_get_device_guid);
 
+struct ibv_cq *__lib_ibv_create_cq_ex(struct ibv_context *context,
+				      struct ibv_cq_init_attr_ex *cq_attr)
+{
+	struct verbs_context *vctx = verbs_get_ctx(context);
+	struct ibv_cq *cq;
+
+	pthread_mutex_lock(&context->mutex);
+
+	cq = vctx->priv->create_cq_ex(context, cq_attr);
+
+	if (cq) {
+		cq->context		   = context;
+		cq->channel		   = cq_attr->channel;
+		if (cq->channel)
+			++cq->channel->refcnt;
+		cq->cq_context		   = cq_attr->cq_context;
+		cq->comp_events_completed  = 0;
+		cq->async_events_completed = 0;
+		pthread_mutex_init(&cq->mutex, NULL);
+		pthread_cond_init(&cq->cond, NULL);
+	}
+
+	pthread_mutex_unlock(&context->mutex);
+
+	return cq;
+}
+
 struct ibv_context *__ibv_open_device(struct ibv_device *device)
 {
 	struct verbs_device *verbs_device = verbs_get_device(device);
@@ -148,6 +175,8 @@ struct ibv_context *__ibv_open_device(struct ibv_device *device)
 		if (!context)
 			goto err;
 	} else {
+		struct verbs_ex_private *priv;
+
 		/* Library now allocates the context */
 		context_ex = calloc(1, sizeof(*context_ex) +
 				       verbs_device->size_of_context);
@@ -156,6 +185,14 @@ struct ibv_context *__ibv_open_device(struct ibv_device *device)
 			goto err;
 		}
 
+		priv = calloc(1, sizeof(*priv));
+		if (!priv) {
+			errno = ENOMEM;
+			free(context_ex);
+			goto err;
+		}
+
+		context_ex->priv = priv;
 		context_ex->context.abi_compat  = __VERBS_ABI_IS_EXTENDED;
 		context_ex->sz = sizeof(*context_ex);
 
@@ -177,6 +214,11 @@ struct ibv_context *__ibv_open_device(struct ibv_device *device)
 		 */
 		context_ex->ABI_placeholder1 = (void (*)(void)) context_ex->ibv_create_flow;
 		context_ex->ABI_placeholder2 = (void (*)(void)) context_ex->ibv_destroy_flow;
+
+		if (context_ex->create_cq_ex) {
+			priv->create_cq_ex = context_ex->create_cq_ex;
+			context_ex->create_cq_ex = __lib_ibv_create_cq_ex;
+		}
 	}
 
 	context->device = device;
@@ -186,6 +228,7 @@ struct ibv_context *__ibv_open_device(struct ibv_device *device)
 	return context;
 
 verbs_err:
+	free(context_ex->priv);
 	free(context_ex);
 err:
 	close(cmd_fd);
@@ -204,6 +247,7 @@ int __ibv_close_device(struct ibv_context *context)
 	if (context_ex) {
 		struct verbs_device *verbs_device = verbs_get_device(context->device);
 		verbs_device->uninit_context(verbs_device, context);
+		free(context_ex->priv);
 		free(context_ex);
 	} else {
 		context->device->ops.free_context(context);
diff --git a/src/ibverbs.h b/src/ibverbs.h
index ff206f9..ceecb5d 100644
--- a/src/ibverbs.h
+++ b/src/ibverbs.h
@@ -81,6 +81,11 @@ extern HIDDEN int abi_ver;
 
 HIDDEN int ibverbs_init(struct ibv_device ***list);
 
+struct verbs_ex_private {
+	struct ibv_cq *(*create_cq_ex)(struct ibv_context *context,
+				       struct ibv_cq_init_attr_ex *);
+};
+
 #define IBV_INIT_CMD(cmd, size, opcode)					\
 	do {								\
 		if (abi_ver > 2)					\
diff --git a/src/libibverbs.map b/src/libibverbs.map
index a150416..282699c 100644
--- a/src/libibverbs.map
+++ b/src/libibverbs.map
@@ -115,6 +115,7 @@ IBVERBS_1.1 {
 		ibv_cmd_create_srq_ex;
 		ibv_cmd_create_qp_ex;
 		ibv_cmd_create_qp_ex2;
+		ibv_cmd_create_cq_ex;
 		ibv_cmd_open_qp;
 		ibv_cmd_rereg_mr;
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V1 libibverbs 4/8] Add completion timestamp support for ibv_poll_cq_ex
       [not found] ` <1456306924-31298-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2016-02-24  9:41   ` [PATCH V1 libibverbs 3/8] Add support for extended ibv_create_cq Yishai Hadas
@ 2016-02-24  9:42   ` Yishai Hadas
  2016-02-24  9:42   ` [PATCH V1 libibverbs 5/8] Add helper functions to work with the extended WC Yishai Hadas
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Yishai Hadas @ 2016-02-24  9:42 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	majd-VPRAkNaXOzVWk0Htik3J/w, talal-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Add support for raw completion timestamp through
ibv_poll_cq_ex.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 include/infiniband/verbs.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index a64e320..232b0df 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -401,6 +401,7 @@ enum ibv_wc_flags_ex {
 	IBV_WC_EX_WITH_SLID		= 1 << 9,
 	IBV_WC_EX_WITH_SL		= 1 << 10,
 	IBV_WC_EX_WITH_DLID_PATH_BITS	= 1 << 11,
+	IBV_WC_EX_WITH_COMPLETION_TIMESTAMP = 1 << 12,
 };
 
 enum {
@@ -417,6 +418,10 @@ enum {
 /* fields order in wc_ex
  * Note: To guarantee compatibility a new field will be added after fields
  * of the same size.
+ *	// Raw timestamp of completion. A raw timestamp is implementation
+ *	// defined and can not be relied upon to have any ordering value
+ *	// between more than one HCA or driver.
+ *	uint64_t		completion_timestamp;
  *	uint32_t		byte_len,
  *	uint32_t		imm_data;
  *	uint32_t		qp_num;
@@ -428,7 +433,7 @@ enum {
  */
 
 enum {
-	IBV_WC_EX_WITH_64BIT_FIELDS = 0
+	IBV_WC_EX_WITH_64BIT_FIELDS = IBV_WC_EX_WITH_COMPLETION_TIMESTAMP
 };
 
 enum {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V1 libibverbs 5/8] Add helper functions to work with the extended WC
       [not found] ` <1456306924-31298-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (3 preceding siblings ...)
  2016-02-24  9:42   ` [PATCH V1 libibverbs 4/8] Add completion timestamp support for ibv_poll_cq_ex Yishai Hadas
@ 2016-02-24  9:42   ` Yishai Hadas
  2016-02-24  9:42   ` [PATCH V1 libibverbs 6/8] Add ibv_query_rt_values_ex Yishai Hadas
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Yishai Hadas @ 2016-02-24  9:42 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	majd-VPRAkNaXOzVWk0Htik3J/w, talal-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Add bunch of functions that may help application and drivers
to work with the extended WC. (i.e. ibv_wc_ex)

It includes:
- ibv_wc_ex_get_offsetof_xx functions to get an offset of a specific
  field.
- ibv_wc_ex_getxx functions to get the data of a specific field.

- ibv_wc_ex_get_size - function to get ibv_wc_ex size.

Those functions uses the CQ's WC creation flags and rely
on the order of the fields as described in verbs.h.

To implement the above functions this patch uses the
__builtin_popcountll functionality to calculate how many
bits are set in a given u64.

As on some compilers popcountll doesn't exist the patch adds
ibv_popcount64 with the required implementation and use
it all around.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 Makefile.am                   |   3 +-
 include/infiniband/compiler.h |  91 +++++++++++++++++++++++++++++++++++
 include/infiniband/verbs.h    | 108 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 201 insertions(+), 1 deletion(-)
 create mode 100644 include/infiniband/compiler.h

diff --git a/Makefile.am b/Makefile.am
index 4698ecb..a916acb 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -44,7 +44,8 @@ libibverbsincludedir = $(includedir)/infiniband
 
 libibverbsinclude_HEADERS = include/infiniband/arch.h include/infiniband/driver.h \
     include/infiniband/kern-abi.h include/infiniband/opcode.h include/infiniband/verbs.h \
-    include/infiniband/sa-kern-abi.h include/infiniband/sa.h include/infiniband/marshall.h
+    include/infiniband/sa-kern-abi.h include/infiniband/sa.h include/infiniband/marshall.h \
+    include/infiniband/compiler.h
 
 man_MANS = man/ibv_asyncwatch.1 man/ibv_devices.1 man/ibv_devinfo.1	\
     man/ibv_rc_pingpong.1 man/ibv_uc_pingpong.1 man/ibv_ud_pingpong.1	\
diff --git a/include/infiniband/compiler.h b/include/infiniband/compiler.h
new file mode 100644
index 0000000..5d54fbe
--- /dev/null
+++ b/include/infiniband/compiler.h
@@ -0,0 +1,91 @@
+/*
+ * Copyright (c) 2016 Mellanox, Ltd.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef INFINIBAND_COMPILER_H
+#define INFINIBAND_COMPILER_H
+
+#if (__GNUC__ > 3 || (__GNUC__ == 3 && __GNUC_MINOR__ >= 4))
+#define ibv_popcount64		__builtin_popcountll
+#endif
+
+#ifndef __has_builtin
+#define __has_builtin(x) 0 /* Compatibility with non-clang compilers. */
+#endif
+
+#if __has_builtin(__builtin_popcountll) && !defined(ibv_popcount64)
+#define ibv_popcount64		__builtin_popcountll
+#endif
+
+#ifndef ibv_popcount64
+/* From FreeBSD:
+ * Copyright (C) 1995, 1996, 1997, and 1998 WIDE Project.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of the project nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE PROJECT AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE PROJECT OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+ /* Population count of how many bits are set */
+static inline int ibv_popcount64(uint64_t _x)
+{
+	_x = (_x & 0x5555555555555555) + ((_x & 0xaaaaaaaaaaaaaaaa) >> 1);
+	_x = (_x & 0x3333333333333333) + ((_x & 0xcccccccccccccccc) >> 2);
+	_x = (_x + (_x >> 4)) & 0x0f0f0f0f0f0f0f0f;
+	_x = (_x + (_x >> 8));
+	_x = (_x + (_x >> 16));
+	_x = (_x + (_x >> 32)) & 0x000000ff;
+	return _x;
+}
+#endif
+
+#endif
diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index 232b0df..67a2cb1 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -41,6 +41,7 @@
 #include <stddef.h>
 #include <errno.h>
 #include <string.h>
+#include <infiniband/compiler.h>
 
 #ifdef __cplusplus
 #  define BEGIN_C_DECLS extern "C" {
@@ -466,6 +467,113 @@ struct ibv_wc_ex {
 	uint8_t			buffer[0];
 };
 
+static inline size_t ibv_wc_ex_get_offsetof64(uint64_t cq_creation_wc_flags,
+					      enum ibv_wc_flags_ex flag)
+{
+	return ibv_popcount64(cq_creation_wc_flags &
+			      IBV_WC_EX_WITH_64BIT_FIELDS &
+			      (flag - 1)) * sizeof(uint64_t);
+}
+
+static inline size_t ibv_wc_ex_get_offsetof32(uint64_t cq_creation_wc_flags,
+					      enum ibv_wc_flags_ex flag)
+{
+	return ibv_popcount64(cq_creation_wc_flags &
+			      (IBV_WC_EX_WITH_64BIT_FIELDS)) *
+		sizeof(uint64_t) +
+		ibv_popcount64(cq_creation_wc_flags &
+			       IBV_WC_EX_WITH_32BIT_FIELDS &
+			       (flag - 1)) * sizeof(uint32_t);
+}
+
+static inline size_t ibv_wc_ex_get_offsetof16(uint64_t cq_creation_wc_flags,
+					      enum ibv_wc_flags_ex flag)
+{
+	return ibv_popcount64(cq_creation_wc_flags &
+			      (IBV_WC_EX_WITH_64BIT_FIELDS)) *
+		sizeof(uint64_t) +
+		ibv_popcount64(cq_creation_wc_flags &
+			       (IBV_WC_EX_WITH_32BIT_FIELDS)) *
+		sizeof(uint32_t) +
+		ibv_popcount64(cq_creation_wc_flags &
+			       IBV_WC_EX_WITH_16BIT_FIELDS &
+			       (flag - 1)) * sizeof(uint16_t);
+}
+
+static inline size_t ibv_wc_ex_get_offsetof8(uint64_t cq_creation_wc_flags,
+					     enum ibv_wc_flags_ex flag)
+{
+	return ibv_popcount64(cq_creation_wc_flags &
+			      (IBV_WC_EX_WITH_64BIT_FIELDS)) *
+		sizeof(uint64_t) +
+		ibv_popcount64(cq_creation_wc_flags &
+			       (IBV_WC_EX_WITH_32BIT_FIELDS)) *
+		sizeof(uint32_t) +
+		ibv_popcount64(cq_creation_wc_flags &
+			       (IBV_WC_EX_WITH_16BIT_FIELDS)) *
+		sizeof(uint16_t) +
+		ibv_popcount64(cq_creation_wc_flags &
+			       IBV_WC_EX_WITH_8BIT_FIELDS &
+			       (flag - 1)) * sizeof(uint8_t);
+}
+
+static inline uint64_t ibv_wc_ex_get64(const struct ibv_wc_ex *wc_ex,
+				       uint64_t cq_creation_wc_flags,
+				       enum ibv_wc_flags_ex flag)
+{
+	uint64_t *pdata = (uint64_t *)(wc_ex->buffer +
+				       ibv_wc_ex_get_offsetof64(cq_creation_wc_flags,
+								flag));
+
+	return *pdata;
+}
+
+static inline uint32_t ibv_wc_ex_get32(const struct ibv_wc_ex *wc_ex,
+				       uint64_t cq_creation_wc_flags,
+				       enum ibv_wc_flags_ex flag)
+{
+	uint32_t *pdata = (uint32_t *)(wc_ex->buffer +
+				       ibv_wc_ex_get_offsetof32(cq_creation_wc_flags,
+								flag));
+
+	return *pdata;
+}
+
+static inline uint16_t ibv_wc_ex_get16(const struct ibv_wc_ex *wc_ex,
+				       uint64_t cq_creation_wc_flags,
+				       enum ibv_wc_flags_ex flag)
+{
+	uint16_t *pdata = (uint16_t *)(wc_ex->buffer +
+				       ibv_wc_ex_get_offsetof16(cq_creation_wc_flags,
+								flag));
+
+	return *pdata;
+}
+
+static inline uint8_t ibv_wc_ex_get8(const struct ibv_wc_ex *wc_ex,
+				     uint64_t cq_creation_wc_flags,
+				     enum ibv_wc_flags_ex flag)
+{
+	uint8_t *pdata = (uint8_t *)(wc_ex->buffer +
+				     ibv_wc_ex_get_offsetof8(cq_creation_wc_flags,
+							     flag));
+
+	return *pdata;
+}
+
+static inline size_t ibv_wc_ex_get_size(uint64_t flags)
+{
+	return ((ibv_popcount64(flags & IBV_WC_EX_WITH_64BIT_FIELDS) *
+		 sizeof(uint64_t) +
+		 ibv_popcount64(flags & IBV_WC_EX_WITH_32BIT_FIELDS) *
+		 sizeof(uint32_t) +
+		 ibv_popcount64(flags & IBV_WC_EX_WITH_16BIT_FIELDS) *
+		 sizeof(uint16_t) +
+		 ibv_popcount64(flags & IBV_WC_EX_WITH_8BIT_FIELDS) *
+		 sizeof(uint8_t) + (sizeof(uint64_t) - 1)) &
+		~(sizeof(uint64_t) - 1)) + sizeof(struct ibv_wc_ex);
+}
+
 enum ibv_access_flags {
 	IBV_ACCESS_LOCAL_WRITE		= 1,
 	IBV_ACCESS_REMOTE_WRITE		= (1<<1),
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V1 libibverbs 6/8] Add ibv_query_rt_values_ex
       [not found] ` <1456306924-31298-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (4 preceding siblings ...)
  2016-02-24  9:42   ` [PATCH V1 libibverbs 5/8] Add helper functions to work with the extended WC Yishai Hadas
@ 2016-02-24  9:42   ` Yishai Hadas
  2016-02-24  9:42   ` [PATCH V1 libibverbs 7/8] Man pages for time stamping support Yishai Hadas
  2016-02-24  9:42   ` [PATCH V1 libibverbs 8/8] Add timestamp support in rc_pingpong Yishai Hadas
  7 siblings, 0 replies; 23+ messages in thread
From: Yishai Hadas @ 2016-02-24  9:42 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	majd-VPRAkNaXOzVWk0Htik3J/w, talal-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Add an extension verb to query certain real time values of a device.
Currently, only IBV_VALUES_MASK_RAW_CLOCK is supported, but this
verb could support other flags like IBV_VALUES_MASK_TEMP_SENSOR,
IBV_VALUES_MASK_CORE_FREQ, etc.
This extension verb only calls the provider.
The provider has to query this value somehow and mark the queried
values in comp_mask.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 include/infiniband/verbs.h | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index 67a2cb1..49de86c 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -1267,6 +1267,16 @@ struct ibv_cq_init_attr_ex {
 	uint32_t		flags;
 };
 
+enum ibv_values_mask {
+	IBV_VALUES_MASK_RAW_CLOCK	= 1 << 0,
+	IBV_VALUES_MASK_RESERVED	= 1 << 1
+};
+
+struct ibv_values_ex {
+	uint32_t	comp_mask;
+	struct timespec raw_clock;
+};
+
 enum verbs_context_mask {
 	VERBS_CONTEXT_XRCD	= 1 << 0,
 	VERBS_CONTEXT_SRQ	= 1 << 1,
@@ -1283,6 +1293,8 @@ struct ibv_poll_cq_ex_attr {
 
 struct verbs_context {
 	/*  "grows up" - new fields go here */
+	int (*query_rt_values)(struct ibv_context *context,
+			       struct ibv_values_ex *values);
 	struct ibv_cq *(*create_cq_ex)(struct ibv_context *context,
 				       struct ibv_cq_init_attr_ex *);
 	struct verbs_ex_private *priv;
@@ -1817,6 +1829,27 @@ ibv_create_qp_ex(struct ibv_context *context, struct ibv_qp_init_attr_ex *qp_ini
 }
 
 /**
+ * ibv_query_rt_values_ex - Get current real time @values of a device.
+ * @values - in/out - defines the attributes we need to query/queried.
+ * (Or's bits of enum ibv_values_mask on values->comp_mask field)
+ */
+static inline int
+ibv_query_rt_values_ex(struct ibv_context *context,
+		       struct ibv_values_ex *values)
+{
+	struct verbs_context *vctx;
+
+	vctx = verbs_get_ctx_op(context, query_rt_values);
+	if (!vctx)
+		return ENOSYS;
+
+	if (values->comp_mask & ~(IBV_VALUES_MASK_RESERVED - 1))
+		return EINVAL;
+
+	return vctx->query_rt_values(context, values);
+}
+
+/**
  * ibv_query_device_ex - Get extended device properties
  */
 static inline int
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V1 libibverbs 7/8] Man pages for time stamping support
       [not found] ` <1456306924-31298-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (5 preceding siblings ...)
  2016-02-24  9:42   ` [PATCH V1 libibverbs 6/8] Add ibv_query_rt_values_ex Yishai Hadas
@ 2016-02-24  9:42   ` Yishai Hadas
  2016-02-24  9:42   ` [PATCH V1 libibverbs 8/8] Add timestamp support in rc_pingpong Yishai Hadas
  7 siblings, 0 replies; 23+ messages in thread
From: Yishai Hadas @ 2016-02-24  9:42 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	majd-VPRAkNaXOzVWk0Htik3J/w, talal-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Add and update man pages to describe how to
work with the extended new verbs to have time stamping
support.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 Makefile.am                  |   4 +-
 man/ibv_create_cq_ex.3       |  75 ++++++++++++++++++
 man/ibv_poll_cq_ex.3         | 181 +++++++++++++++++++++++++++++++++++++++++++
 man/ibv_query_device_ex.3    |   6 +-
 man/ibv_query_rt_values_ex.3 |  50 ++++++++++++
 5 files changed, 313 insertions(+), 3 deletions(-)
 create mode 100644 man/ibv_create_cq_ex.3
 create mode 100644 man/ibv_poll_cq_ex.3
 create mode 100644 man/ibv_query_rt_values_ex.3

diff --git a/Makefile.am b/Makefile.am
index a916acb..59d3006 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -64,7 +64,9 @@ man_MANS = man/ibv_asyncwatch.1 man/ibv_devices.1 man/ibv_devinfo.1	\
     man/ibv_req_notify_cq.3 man/ibv_resize_cq.3 man/ibv_rate_to_mbps.3  \
     man/ibv_create_qp_ex.3 man/ibv_create_srq_ex.3 man/ibv_open_xrcd.3  \
     man/ibv_get_srq_num.3 man/ibv_open_qp.3 man/ibv_query_device_ex.3	\
-    man/ibv_alloc_mw.3 man/ibv_bind_mw.3 man/ibv_inc_rkey.3
+    man/ibv_alloc_mw.3 man/ibv_bind_mw.3 man/ibv_inc_rkey.3		\
+    man/ibv_poll_cq_ex.3 man/ibv_create_cq_ex.3           		\
+    man/ibv_query_rt_values_ex.3
 
 DEBIAN = debian/changelog debian/compat debian/control debian/copyright \
     debian/ibverbs-utils.install debian/libibverbs1.install \
diff --git a/man/ibv_create_cq_ex.3 b/man/ibv_create_cq_ex.3
new file mode 100644
index 0000000..3f655b4
--- /dev/null
+++ b/man/ibv_create_cq_ex.3
@@ -0,0 +1,75 @@
+.\" -*- nroff -*-
+.\"
+.TH IBV_CREATE_CQ_EX 3 2016-2-10 libibverbs "Libibverbs Programmer's Manual"
+.SH "NAME"
+ibv_create_cq_ex \- create a completion queue (CQ)
+.SH "SYNOPSIS"
+.nf
+.B #include <infiniband/verbs.h>
+.sp
+.BI "struct ibv_cq *ibv_create_cq_ex(struct ibv_context " "*context" ",
+.BI "                                struct ibv_cq_init_attr_ex " "*cq_attr" );
+.fi
+.SH "DESCRIPTION"
+.B ibv_create_cq_ex()
+creates a completion queue (CQ) for RDMA device context
+.I context\fR.
+The argument
+.I cq_attr
+is a pointer to struct ibv_create_cq_attr_ex as defined in <infiniband/verbs.h>.
+.PP
+.nf
+struct ibv_cq_init_attr_ex {
+.in +8
+int                     cqe;               /* Minimum number of entries required for CQ */
+void                    *cq_context;       /* Consumer-supplied context returned for completion events */
+struct ibv_comp_channel *channel;          /* Completion channel where completion events will be queued. May be NULL if completion events will not be used. */
+int                     comp_vector;       /* Completion vector used to signal completion events. Must be >= 0 and < context->num_comp_vectors. */
+uint64_t                wc_flags;          /* The wc_flags that should be returned in ibv_poll_cq_ex. Or'ed bit of enum ibv_wc_flags_ex. */
+uint32_t                comp_mask;         /* compatibility mask (extended verb). Or'd flags of enum ibv_cq_init_attr_mask */
+uint32_t                flags;             /* create cq flags - one or more flags from enum ibv_create_cq_attr_flags */
+.in -8
+};
+
+enum ibv_wc_flags_ex {
+        IBV_WC_EX_GRH                        = 1 << 0,  /* Output only */
+        IBV_WC_EX_IMM                        = 1 << 1,  /* Output only */
+        IBV_WC_EX_CSUM_OK                    = 1 << 2,  /* Output only */
+        IBV_WC_EX_INV                        = 1 << 3,  /* Output only */
+        IBV_WC_EX_WITH_BYTE_LEN              = 1 << 4,  /* Require byte len in WC */
+        IBV_WC_EX_WITH_IMM                   = 1 << 5,  /* Require immediate in WC */
+        IBV_WC_EX_WITH_QP_NUM                = 1 << 6,  /* Require QP number in WC */
+        IBV_WC_EX_WITH_SRC_QP                = 1 << 7,  /* Require source QP in WC */
+        IBV_WC_EX_WITH_PKEY_INDEX            = 1 << 8,  /* Require pkey index in WC */
+        IBV_WC_EX_WITH_SLID                  = 1 << 9,  /* Require slid in WC */
+        IBV_WC_EX_WITH_SL                    = 1 << 10, /* Require sl in WC */
+        IBV_WC_EX_WITH_DLID_PATH_BITS        = 1 << 11, /* Require dlid path bits in WC */
+        IBV_WC_EX_WITH_COMPLETION_TIMESTAMP  = 1 << 12, /* Require completion timestamp in WC */
+};
+
+
+enum ibv_cq_init_attr_mask {
+        IBV_CQ_INIT_ATTR_FLAGS               = 1 << 0,
+};
+.SH "RETURN VALUE"
+.B ibv_create_cq_ex()
+returns a pointer to the CQ, or NULL if the request fails.
+.SH "NOTES"
+.B ibv_create_cq_ex()
+may create a CQ with size greater than or equal to the requested
+size. Check the cqe attribute in the returned CQ for the actual size.
+.PP
+CQ should be destroyed with ibv_destroy_cq.
+.PP
+.SH "SEE ALSO"
+.BR ibv_create_cq (3),
+.BR ibv_destroy_cq (3),
+.BR ibv_resize_cq (3),
+.BR ibv_req_notify_cq (3),
+.BR ibv_ack_cq_events (3),
+.BR ibv_create_qp (3)
+.SH "AUTHORS"
+.TP
+Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
+.TP
+Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
diff --git a/man/ibv_poll_cq_ex.3 b/man/ibv_poll_cq_ex.3
new file mode 100644
index 0000000..5474dd7
--- /dev/null
+++ b/man/ibv_poll_cq_ex.3
@@ -0,0 +1,181 @@
+.\" -*- nroff -*-
+.\"
+.TH IBV_POLL_CQ_EX 3 2016-2-18 libibverbs "Libibverbs Programmer's Manual"
+.SH "NAME"
+ibv_poll_cq_ex \- poll a completion queue (CQ)
+.SH "SYNOPSIS"
+.nf
+.B #include <infiniband/verbs.h>
+.sp
+.BI "int ibv_poll_cq_ex(struct ibv_cq " "*cq" ", struct ibv_wc_ex " "*wc" ,
+.BI "                   struct ibv_poll_cq_ex_attr " "*attr");
+.fi
+.SH "DESCRIPTION"
+.B ibv_poll_cq_ex()
+polls the CQ
+.I cq
+for work completions according to the pointer argument
+.I attr
+of type struct ibv_poll_cq_ex_attr which is defined in <infiniband/verbs.h>.
+.PP
+.nf
+struct ibv_poll_cq_ex_attr {
+.in +8
+unsigned int    max_entries; /* the number of maximum wc_ex entries in array to return */
+uint32_t        comp_mask;   /* which commands extensions are available */
+.in -8
+};
+
+The first work completions (up to max_entries) are returned by this command.
+.I wc\fR is an array of type struct ibv_wc_ex which is also defined in <infiniband/verbs.h>.
+.PP
+.nf
+struct ibv_wc_ex {
+.in +8
+uint64_t                wr_id;       /* ID of the completed Work Request (WR) */
+uint64_t                wc_flags;    /* Combination of ibv_wc_flags_ex flags. The IBV_WC_EX_WITH_XXX flags
+                                      * dynamically define the fields exist in buffer[0] and hence defines the ibv_wc_ex's size.
+                                      */
+uint32_t                status;      /* Status of the operation */
+uint32_t                opcode;      /* Operation type specified in the completed WR */
+uint32_t                vendor_err;  /* Vendor error syndrome */
+uint32_t                reserved;    /* For future usage
+uint8_t                 buffer[0];   /* Other attributes, as described in wc_flags */
+.in -8
+};
+
+enum ibv_wc_flags_ex {
+.in +8
+IBV_WC_EX_GRH                           = 1 << 0,  /* GRH is present (valid only for UD QPs) */
+IBV_WC_EX_IMM                           = 1 << 1,  /* Action was carried with immediate data */
+IBV_WC_EX_IP_CSUM_OK                    = 1 << 2,  /* TCP/UDP checksum over IPv4 were verified */
+IBV_WC_EX_INV                           = 1 << 3,  /* Action was carried with invalidate immediate data */
+IBV_WC_EX_WITH_BYTE_LEN                 = 1 << 4,  /* The returned wc_ex contains byte_len field */
+IBV_WC_EX_WITH_IMM                      = 1 << 5,  /* The returned wc_ex contains imm_data field */
+IBV_WC_EX_WITH_QP_NUM                   = 1 << 6,  /* The returned wc_ex contains qp_num field */
+IBV_WC_EX_WITH_SRC_QP                   = 1 << 7,  /* The returned wc_ex contains src_qp field */
+IBV_WC_EX_WITH_PKEY_INDEX               = 1 << 8,  /* The returned wc_ex contains pkey_index field */
+IBV_WC_EX_WITH_SLID                     = 1 << 9,  /* The returned wc_ex contains slid field */
+IBV_WC_EX_WITH_SL                       = 1 << 10, /* The returned wc_ex contains sl field */
+IBV_WC_EX_WITH_DLID_PATH_BITS           = 1 << 11, /* The returned wc_ex contains dlid_path_bits field */
+IBV_WC_EX_WITH_COMPLETION_TIMESTAMP     = 1 << 12, /* The returned wc_ex contains completion_timestmap field */
+.in -8
+};
+
+.fi
+wc_flags describes which of the fields in buffer[0] have a valid value. The order of these fields and sizes are always as follows.
+Future fields will be added after the last field from their size.
+
+.nf
+uint64_t        completion_timestamp; /* Raw timestamp of completion. Implementation defined. Can't be relied upon to have any ordering value between more than one driver/hca */
+uint32_t        byte_len;
+uint32_t        imm_data;
+uint32_t        qp_num;
+uint32_t        src_qp;
+uint16_t        pkey_index;
+uint16_t        slid;
+uint8_t         sl;
+uint8_t         dlid_path_bits;
+
+.fi
+Upon creating a CQ using ibv_create_cq_ex, the required wc_flags are given.
+The user is advised to build a custom wc_ex struct that represents those field.
+For example, if
+.I IBV_WC_EX_WITH_QP_NUM\fR and
+.I IBV_WC_EX_WITH_DLID_PATH_BITS\fR are given to ibv_create_cq_ex in wc_flags parameter, the user is advised to build the following struct:
+
+.nf
+struct user_custom_wc {
+.in +8
+struct ibv_wc_ex     wc_ex;
+uint32_t             qp_num;
+uint8_t              dlid_path_bits;
+.in -8
+}
+
+.fi
+This allows the user to use ibv_poll_cq_ex with an array of type struct user_custom_cq.
+An alternative is to use the following functions:
+.nf
+static inline size_t ibv_wc_ex_get_offsetof64(uint64_t cq_creation_wc_flags,
+					      enum ibv_wc_flags_ex flag);
+static inline size_t ibv_wc_ex_get_offsetof32(uint64_t cq_creation_wc_flags,
+					      enum ibv_wc_flags_ex flag);
+static inline size_t ibv_wc_ex_get_offsetof16(uint64_t cq_creation_wc_flags,
+					      enum ibv_wc_flags_ex flag);
+static inline size_t ibv_wc_ex_get_offsetof8(uint64_t cq_creation_wc_flags,
+					     enum ibv_wc_flags_ex flag);
+static inline uint64_t ibv_wc_ex_get64(const struct ibv_wc_ex *wc_ex,
+				       uint64_t cq_creation_wc_flags,
+				       enum ibv_wc_flags_ex flag);
+static inline uint32_t ibv_wc_ex_get32(const struct ibv_wc_ex *wc_ex,
+				       uint64_t cq_creation_wc_flags,
+				       enum ibv_wc_flags_ex flag);
+static inline uint16_t ibv_wc_ex_get16(const struct ibv_wc_ex *wc_ex,
+				       uint64_t cq_creation_wc_flags,
+				       enum ibv_wc_flags_ex flag);
+static inline uint8_t ibv_wc_ex_get8(const struct ibv_wc_ex *wc_ex,
+				     uint64_t cq_creation_wc_flags,
+				     enum ibv_wc_flags_ex flag);
+static inline size_t ibv_wc_ex_get_size(uint64_t flags);
+
+.fi
+.PP
+The
+.I ibv_wc_ex_get_offsetxx\fR functions get the wc_flags that were used to create the CQ with and return the offset of the field in the returned WC.
+It's assumed that this flag exists in the returned wc_ex (meaning wc_ex->wc_flags & flag != 0).
+.PP
+.I ibv_wc_ex_getxx\fR functions return the value of this field given in flag. The same assumption as in
+.I ibv_wc_ex_get_offsetxx\fR also holds here.
+xx refers to the size of the required fields in bits. This size must match the sizes described above.
+.PP
+.I ibv_wc_ex_get_size\fR returns the size of a WC which is returned from CQs created with wc_flags given as the flags parameter.
+
+Users could use these functions by defining an array of ibv_wc_ex_get_size bytes:
+
+.nf
+struct ibv_wc_ex     *wc_ex = calloc(num_entries, ibv_wc_ex_get_size(cq_wc_creation_flags));
+
+.fi
+This array could be used for ibv_poll_cq_ex.
+After calling ibv_poll_cq_ex, the fields could be extracted using:
+
+.nf
+if (wc_ex->wc_flags & IBV_WC_EX_WITH_SL) {
+.in +8
+uint8_t sl = ibv_wc_ex_get8(wc_ex, cq_wc_creation_flags, IBV_WC_EX_WITH_SL);
+.in -8
+}
+
+.fi
+Please note that in x86, using SSE4.2 machine with "-mpopcount" compilation flag will cause these functions to run faster.
+
+Fields that aren't available for a completion won't be indicated in wc_ex->wc_flags and their values are undefined.
+
+.PP
+Not all
+.I wc\fR attributes are always valid. If the completion status is other than
+.B IBV_WC_SUCCESS\fR,
+only the following attributes are valid: wr_id, status, qp_num (if its respective wc_flags bit is set), and vendor_err.
+.SH "RETURN VALUE"
+On success,
+.B ibv_poll_cq_ex()
+returns a non-negative value equal to the number of completions
+found.  On failure, a negative value is returned.
+.SH "NOTES"
+.PP
+Each polled completion is removed from the CQ and cannot be returned to it.
+.PP
+The user should consume work completions at a rate that prevents CQ
+overrun from occurrence.  In case of a CQ overrun, the async event
+.B IBV_EVENT_CQ_ERR
+will be triggered, and the CQ cannot be used.
+.SH "SEE ALSO"
+.BR ibv_poll_cq (3),
+.BR ibv_post_send (3),
+.BR ibv_post_recv (3)
+.SH "AUTHORS"
+.TP
+Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
+.TP
+Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
diff --git a/man/ibv_query_device_ex.3 b/man/ibv_query_device_ex.3
index 1f483d2..db12c2b 100644
--- a/man/ibv_query_device_ex.3
+++ b/man/ibv_query_device_ex.3
@@ -22,8 +22,10 @@ is a pointer to an ibv_device_attr_ex struct, as defined in <infiniband/verbs.h>
 struct ibv_device_attr_ex {
 .in +8
 struct ibv_device_attr orig_attr;
-uint32_t               comp_mask;              /* Compatibility mask that defines which of the following variables are valid */
-struct ibv_odp_caps    odp_caps;               /* On-Demand Paging capabilities */
+uint32_t               comp_mask;                  /* Compatibility mask that defines which of the following variables are valid */
+struct ibv_odp_caps    odp_caps;                   /* On-Demand Paging capabilities */
+uint64_t               completion_timestamp_mask;  /* Completion timestamp mask (0 = unsupported) */
+uint64_t               hca_core_clock;             /* The frequency (in kHZ) of the HCA (0 = unsupported) */
 .in -8
 };
 
diff --git a/man/ibv_query_rt_values_ex.3 b/man/ibv_query_rt_values_ex.3
new file mode 100644
index 0000000..fcc460c
--- /dev/null
+++ b/man/ibv_query_rt_values_ex.3
@@ -0,0 +1,50 @@
+.\" -*- nroff -*-
+.\"
+.TH IBV_QUERY_RT_VALUES_EX 3 2016-2-20 libibverbs "Libibverbs Programmer's Manual"
+.SH "NAME"
+ibv_query_rt_values_ex \- query an RDMA device for some real time values
+.SH "SYNOPSIS"
+.nf
+.B #include <infiniband/verbs.h>
+.sp
+.BI "int ibv_query_rt_values_ex(struct ibv_context " "*context",
+.BI "                           struct ibv_values_ex " "*values" );
+.fi
+.SH "DESCRIPTION"
+.B ibv_query_rt_values_ex()
+returns certain real time values of a device
+.I context\fR.
+The argument
+.I attr
+is a pointer to an ibv_device_attr_ex struct, as defined in <infiniband/verbs.h>.
+.PP
+.nf
+struct ibv_values_ex {
+.in +8
+uint32_t             comp_mask;    /* Compatibility mask that defines the query/queried fields [in/out] */
+struct timespec      raw_clock;    /* HW raw clock */
+.in -8
+};
+
+enum ibv_values_mask {
+        IBV_VALUES_MASK_RAW_CLOCK = 1 << 0, /* HW raw clock */
+};
+
+.fi
+.SH "RETURN VALUE"
+.B ibv_query_rt_values_ex()
+returns 0 on success, or the value of errno on failure (which indicates the failure reason).
+.SH "NOTES"
+This extension verb only calls the provider, the provider has to query this value somehow and mark
+the queried values in the comp_mask field.
+.SH "SEE ALSO"
+.BR ibv_query_device (3),
+.BR ibv_open_device (3),
+.BR ibv_query_port (3),
+.BR ibv_query_pkey (3),
+.BR ibv_query_gid (3)
+.SH "AUTHORS"
+.TP
+Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
+.TP
+Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V1 libibverbs 8/8] Add timestamp support in rc_pingpong
       [not found] ` <1456306924-31298-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (6 preceding siblings ...)
  2016-02-24  9:42   ` [PATCH V1 libibverbs 7/8] Man pages for time stamping support Yishai Hadas
@ 2016-02-24  9:42   ` Yishai Hadas
  7 siblings, 0 replies; 23+ messages in thread
From: Yishai Hadas @ 2016-02-24  9:42 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	majd-VPRAkNaXOzVWk0Htik3J/w, talal-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w

Add timestamp support in rc_pingpong, it can serve as some example
of using ibv_create_cq_ex and ibv_poll_cq_ex verbs.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 examples/rc_pingpong.c | 130 +++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 114 insertions(+), 16 deletions(-)

diff --git a/examples/rc_pingpong.c b/examples/rc_pingpong.c
index 90a8320..bffcfe5 100644
--- a/examples/rc_pingpong.c
+++ b/examples/rc_pingpong.c
@@ -49,6 +49,14 @@
 
 #include "pingpong.h"
 
+#ifndef max
+#define max(x, y) (((x) > (y)) ? (x) : (y))
+#endif
+
+#ifndef min
+#define min(x, y) (((x) < (y)) ? (x) : (y))
+#endif
+
 enum {
 	PINGPONG_RECV_WRID = 1,
 	PINGPONG_SEND_WRID = 2,
@@ -56,6 +64,7 @@ enum {
 
 static int page_size;
 static int use_odp;
+static int use_ts;
 
 struct pingpong_context {
 	struct ibv_context	*context;
@@ -70,6 +79,7 @@ struct pingpong_context {
 	int			 rx_depth;
 	int			 pending;
 	struct ibv_port_attr     portinfo;
+	uint64_t		 completion_timestamp_mask;
 };
 
 struct pingpong_dest {
@@ -357,7 +367,7 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size,
 		goto clean_comp_channel;
 	}
 
-	if (use_odp) {
+	if (use_odp || use_ts) {
 		const uint32_t rc_caps_mask = IBV_ODP_SUPPORT_SEND |
 					      IBV_ODP_SUPPORT_RECV;
 		struct ibv_device_attr_ex attrx;
@@ -367,12 +377,22 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size,
 			goto clean_comp_channel;
 		}
 
-		if (!(attrx.odp_caps.general_caps & IBV_ODP_SUPPORT) ||
-		    (attrx.odp_caps.per_transport_caps.rc_odp_caps & rc_caps_mask) != rc_caps_mask) {
-			fprintf(stderr, "The device isn't ODP capable or does not support RC send and receive with ODP\n");
-			goto clean_comp_channel;
+		if (use_odp) {
+			if (!(attrx.odp_caps.general_caps & IBV_ODP_SUPPORT) ||
+			    (attrx.odp_caps.per_transport_caps.rc_odp_caps & rc_caps_mask) != rc_caps_mask) {
+				fprintf(stderr, "The device isn't ODP capable or does not support RC send and receive with ODP\n");
+				goto clean_comp_channel;
+			}
+			access_flags |= IBV_ACCESS_ON_DEMAND;
+		}
+
+		if (use_ts) {
+			if (!attrx.completion_timestamp_mask) {
+				fprintf(stderr, "The device isn't completion timestamp capable\n");
+				goto clean_comp_channel;
+			}
+			ctx->completion_timestamp_mask = attrx.completion_timestamp_mask;
 		}
-		access_flags |= IBV_ACCESS_ON_DEMAND;
 	}
 	ctx->mr = ibv_reg_mr(ctx->pd, ctx->buf, size, access_flags);
 
@@ -381,8 +401,23 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size,
 		goto clean_pd;
 	}
 
-	ctx->cq = ibv_create_cq(ctx->context, rx_depth + 1, NULL,
+	if (use_ts) {
+		struct ibv_cq_init_attr_ex attr_ex = {
+			.cqe = rx_depth + 1,
+			.cq_context = NULL,
+			.channel = ctx->channel,
+			.comp_vector = 0,
+			.comp_mask   = IBV_CQ_INIT_ATTR_FLAGS,
+			.flags = IBV_CREATE_CQ_ATTR_COMPLETION_TIMESTAMP,
+			.wc_flags = IBV_WC_EX_WITH_COMPLETION_TIMESTAMP
+		};
+
+		ctx->cq = ibv_create_cq_ex(ctx->context, &attr_ex);
+	} else {
+		ctx->cq = ibv_create_cq(ctx->context, rx_depth + 1, NULL,
 				ctx->channel, 0);
+	}
+
 	if (!ctx->cq) {
 		fprintf(stderr, "Couldn't create CQ\n");
 		goto clean_mr;
@@ -561,6 +596,7 @@ static void usage(const char *argv0)
 	printf("  -e, --events           sleep on CQ events (default poll)\n");
 	printf("  -g, --gid-idx=<gid index> local port gid index\n");
 	printf("  -o, --odp		    use on demand paging\n");
+	printf("  -t, --ts	            get CQE with timestamp\n");
 }
 
 int main(int argc, char *argv[])
@@ -586,6 +622,12 @@ int main(int argc, char *argv[])
 	int                      sl = 0;
 	int			 gidx = -1;
 	char			 gid[33];
+	unsigned int		 comp_recv_max_time_delta = 0;
+	unsigned int		 comp_recv_min_time_delta = 0xffffffff;
+	uint64_t		 comp_recv_total_time_delta = 0;
+	uint64_t		 comp_recv_prev_time = 0;
+	int			 last_comp_with_ts = 0;
+	unsigned int		 comp_with_time_iters = 0;
 
 	srand48(getpid() * time(NULL));
 
@@ -604,11 +646,12 @@ int main(int argc, char *argv[])
 			{ .name = "events",   .has_arg = 0, .val = 'e' },
 			{ .name = "gid-idx",  .has_arg = 1, .val = 'g' },
 			{ .name = "odp",      .has_arg = 0, .val = 'o' },
+			{ .name = "ts",       .has_arg = 0, .val = 't' },
 			{ 0 }
 		};
 
-		c = getopt_long(argc, argv, "p:d:i:s:m:r:n:l:eg:o",
-							long_options, NULL);
+		c = getopt_long(argc, argv, "p:d:i:s:m:r:n:l:eg:ot",
+				long_options, NULL);
 
 		if (c == -1)
 			break;
@@ -669,6 +712,9 @@ int main(int argc, char *argv[])
 		case 'o':
 			use_odp = 1;
 			break;
+		case 't':
+			use_ts = 1;
+			break;
 
 		default:
 			usage(argv[0]);
@@ -812,10 +858,24 @@ int main(int argc, char *argv[])
 
 		{
 			struct ibv_wc wc[2];
+			struct {
+				struct ibv_wc_ex wc_ex;
+				uint64_t	 completion_timestamp;
+			} wc_with_ts[2];
+			struct ibv_poll_cq_ex_attr attr = {
+				.comp_mask = 0,
+				.max_entries = 2
+			};
+			enum ibv_wc_status       status;
+			uint64_t                 wr_id;
 			int ne, i;
 
 			do {
-				ne = ibv_poll_cq(ctx->cq, 2, wc);
+				if (use_ts)
+					ne = ibv_poll_cq_ex(ctx->cq, &wc_with_ts[0].wc_ex, &attr);
+				else
+					ne = ibv_poll_cq(ctx->cq, 2, wc);
+
 				if (ne < 0) {
 					fprintf(stderr, "poll CQ failed %d\n", ne);
 					return 1;
@@ -824,14 +884,22 @@ int main(int argc, char *argv[])
 			} while (!use_event && ne < 1);
 
 			for (i = 0; i < ne; ++i) {
-				if (wc[i].status != IBV_WC_SUCCESS) {
+				if (use_ts) {
+					status = wc_with_ts[i].wc_ex.status;
+					wr_id = wc_with_ts[i].wc_ex.wr_id;
+				} else {
+					status = wc[i].status;
+					wr_id = wc[i].wr_id;
+				}
+
+				if (status != IBV_WC_SUCCESS) {
 					fprintf(stderr, "Failed status %s (%d) for wr_id %d\n",
-						ibv_wc_status_str(wc[i].status),
-						wc[i].status, (int) wc[i].wr_id);
+						ibv_wc_status_str(status),
+						status, (int)wr_id);
 					return 1;
 				}
 
-				switch ((int) wc[i].wr_id) {
+				switch ((int)wr_id) {
 				case PINGPONG_SEND_WRID:
 					++scnt;
 					break;
@@ -848,15 +916,39 @@ int main(int argc, char *argv[])
 					}
 
 					++rcnt;
+					if (use_ts && (wc_with_ts[i].wc_ex.wc_flags &
+						       IBV_WC_EX_WITH_COMPLETION_TIMESTAMP)) {
+						if (last_comp_with_ts) {
+							uint64_t delta;
+
+							/* checking whether the clock was wrapped around */
+							if (wc_with_ts[i].completion_timestamp >= comp_recv_prev_time)
+								delta = wc_with_ts[i].completion_timestamp - comp_recv_prev_time;
+							else
+								delta = ctx->completion_timestamp_mask - comp_recv_prev_time +
+										wc_with_ts[i].completion_timestamp + 1;
+
+							comp_recv_max_time_delta = max(comp_recv_max_time_delta, delta);
+							comp_recv_min_time_delta = min(comp_recv_min_time_delta, delta);
+							comp_recv_total_time_delta += delta;
+							comp_with_time_iters++;
+						}
+
+						comp_recv_prev_time = wc_with_ts[i].completion_timestamp;
+						last_comp_with_ts = 1;
+					} else {
+						last_comp_with_ts = 0;
+					}
+
 					break;
 
 				default:
 					fprintf(stderr, "Completion for unknown wr_id %d\n",
-						(int) wc[i].wr_id);
+						(int)wr_id);
 					return 1;
 				}
 
-				ctx->pending &= ~(int) wc[i].wr_id;
+				ctx->pending &= ~(int)wr_id;
 				if (scnt < iters && !ctx->pending) {
 					if (pp_post_send(ctx)) {
 						fprintf(stderr, "Couldn't post send\n");
@@ -883,6 +975,12 @@ int main(int argc, char *argv[])
 		       bytes, usec / 1000000., bytes * 8. / usec);
 		printf("%d iters in %.2f seconds = %.2f usec/iter\n",
 		       iters, usec / 1000000., usec / iters);
+
+		if (use_ts && comp_with_time_iters) {
+			printf("Max receive completion clock cycles = %u\n", comp_recv_max_time_delta);
+			printf("Min receive completion clock cycles = %u\n", comp_recv_min_time_delta);
+			printf("Average receive completion clock cycles = %f\n", (double)comp_recv_total_time_delta / comp_with_time_iters);
+		}
 	}
 
 	ibv_ack_cq_events(ctx->cq, num_cq_events);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb
       [not found]     ` <1456306924-31298-2-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-02-24 19:02       ` Jason Gunthorpe
       [not found]         ` <20160224190230.GA10588-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2016-02-24 19:02 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, matanb-VPRAkNaXOzVWk0Htik3J/w,
	majd-VPRAkNaXOzVWk0Htik3J/w, talal-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w

On Wed, Feb 24, 2016 at 11:41:57AM +0200, Yishai Hadas wrote:

> +enum {
> +	IBV_WC_EX_WITH_8BIT_FIELDS = IBV_WC_EX_WITH_SL			|
> +				     IBV_WC_EX_WITH_DLID_PATH_BITS
> +};
> +
> +struct ibv_wc_ex {
> +	uint64_t		wr_id;
> +	/* wc_flags is a combination of ibv_wc_flags_ex flags. The IBV_WC_EX_WITH_XXX
> +	 * flags dynamically define the valid fields in buffer[0].
> +	 */
> +	uint64_t		wc_flags;
> +	uint32_t		status;
> +	uint32_t		opcode;
> +	uint32_t		vendor_err;
> +	uint32_t		reserved;
> +	uint8_t			buffer[0];
> +};

Um, maybe you should give an example of how on earth anyone is
supposed to use this, all of this looks like a *really bad idea* to
me.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb
       [not found]         ` <20160224190230.GA10588-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2016-02-25  8:01           ` Yishai Hadas
       [not found]             ` <56CEB4C7.60607-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Yishai Hadas @ 2016-02-25  8:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, matanb-VPRAkNaXOzVWk0Htik3J/w,
	majd-VPRAkNaXOzVWk0Htik3J/w, talal-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w

On 2/24/2016 9:02 PM, Jason Gunthorpe wrote:
> On Wed, Feb 24, 2016 at 11:41:57AM +0200, Yishai Hadas wrote:
>
>> +enum {
>> +	IBV_WC_EX_WITH_8BIT_FIELDS = IBV_WC_EX_WITH_SL			|
>> +				     IBV_WC_EX_WITH_DLID_PATH_BITS
>> +};
>> +
>> +struct ibv_wc_ex {
>> +	uint64_t		wr_id;
>> +	/* wc_flags is a combination of ibv_wc_flags_ex flags. The IBV_WC_EX_WITH_XXX
>> +	 * flags dynamically define the valid fields in buffer[0].
>> +	 */
>> +	uint64_t		wc_flags;
>> +	uint32_t		status;
>> +	uint32_t		opcode;
>> +	uint32_t		vendor_err;
>> +	uint32_t		reserved;
>> +	uint8_t			buffer[0];
>> +};
>
> Um, maybe you should give an example of how on earth anyone is
> supposed to use this, all of this looks like a *really bad idea* to
> me.
>

The last patch is this series is a clear example of a typical usage of.
It was added as part of rc_pingpong, please look at.

In addition, there are detailed man pages that describe the idea/usage 
of the new verbs around, see patch #7.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb
       [not found]             ` <56CEB4C7.60607-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2016-02-25 17:05               ` Jason Gunthorpe
       [not found]                 ` <20160225170541.GA22513-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2016-02-25 17:05 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: Yishai Hadas, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, matanb-VPRAkNaXOzVWk0Htik3J/w,
	majd-VPRAkNaXOzVWk0Htik3J/w, talal-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w

On Thu, Feb 25, 2016 at 10:01:11AM +0200, Yishai Hadas wrote:
> On 2/24/2016 9:02 PM, Jason Gunthorpe wrote:
> >On Wed, Feb 24, 2016 at 11:41:57AM +0200, Yishai Hadas wrote:
> >
> >>+enum {
> >>+	IBV_WC_EX_WITH_8BIT_FIELDS = IBV_WC_EX_WITH_SL			|
> >>+				     IBV_WC_EX_WITH_DLID_PATH_BITS
> >>+};
> >>+
> >>+struct ibv_wc_ex {
> >>+	uint64_t		wr_id;
> >>+	/* wc_flags is a combination of ibv_wc_flags_ex flags. The IBV_WC_EX_WITH_XXX
> >>+	 * flags dynamically define the valid fields in buffer[0].
> >>+	 */
> >>+	uint64_t		wc_flags;
> >>+	uint32_t		status;
> >>+	uint32_t		opcode;
> >>+	uint32_t		vendor_err;
> >>+	uint32_t		reserved;
> >>+	uint8_t			buffer[0];
> >>+};
> >
> >Um, maybe you should give an example of how on earth anyone is
> >supposed to use this, all of this looks like a *really bad idea* to
> >me.
> >
> 
> The last patch is this series is a clear example of a typical usage of.
> It was added as part of rc_pingpong, please look at.
> 
> In addition, there are detailed man pages that describe the idea/usage of
> the new verbs around, see patch #7.

The manual page and rc_pingpong do different things.

This still looks like a horrible user API.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb
       [not found]                 ` <20160225170541.GA22513-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2016-02-28 16:03                   ` Matan Barak (External)
       [not found]                     ` <56D31A58.20205-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Matan Barak (External) @ 2016-02-28 16:03 UTC (permalink / raw)
  To: Jason Gunthorpe, Yishai Hadas
  Cc: Yishai Hadas, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, majd-VPRAkNaXOzVWk0Htik3J/w,
	talal-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	cl-vYTEC60ixJUAvxtiuMwx3w

On 25/02/2016 19:05, Jason Gunthorpe wrote:
> On Thu, Feb 25, 2016 at 10:01:11AM +0200, Yishai Hadas wrote:
>> On 2/24/2016 9:02 PM, Jason Gunthorpe wrote:
>>> On Wed, Feb 24, 2016 at 11:41:57AM +0200, Yishai Hadas wrote:
>>>
>>>> +enum {
>>>> +	IBV_WC_EX_WITH_8BIT_FIELDS = IBV_WC_EX_WITH_SL			|
>>>> +				     IBV_WC_EX_WITH_DLID_PATH_BITS
>>>> +};
>>>> +
>>>> +struct ibv_wc_ex {
>>>> +	uint64_t		wr_id;
>>>> +	/* wc_flags is a combination of ibv_wc_flags_ex flags. The IBV_WC_EX_WITH_XXX
>>>> +	 * flags dynamically define the valid fields in buffer[0].
>>>> +	 */
>>>> +	uint64_t		wc_flags;
>>>> +	uint32_t		status;
>>>> +	uint32_t		opcode;
>>>> +	uint32_t		vendor_err;
>>>> +	uint32_t		reserved;
>>>> +	uint8_t			buffer[0];
>>>> +};
>>>
>>> Um, maybe you should give an example of how on earth anyone is
>>> supposed to use this, all of this looks like a *really bad idea* to
>>> me.
>>>
>>
>> The last patch is this series is a clear example of a typical usage of.
>> It was added as part of rc_pingpong, please look at.
>>
>> In addition, there are detailed man pages that describe the idea/usage of
>> the new verbs around, see patch #7.
>
> The manual page and rc_pingpong do different things.
>

There are two options to use the API. They are both described well in 
the man page. This example uses the more trivial and easy-to-use way.

> This still looks like a horrible user API.
>

Why do you think so?

We want to present a completion structure that could be extended, 
without adding the overhead of setting unnecessary fields and without 
risking that adding future attributes will make the completion be bigger 
than a cache line (and create a great penalty). This also came up in the 
earlier libibverbs API discussion.

We could introduce a verb for every new field (poll_cq_ts), but what 
will a user do if he wants new_feature_a and new_feature_b from the same 
completion? In addition, polluting libibverbs with so many polling verbs 
is (to say the least) awkward.

The second option we could think of is to give the user a way to define 
which fields he would like to get - not wasting time, space and cache 
lines on fields he doesn't want. If we could do that automatically, it 
would of course be better (for example, a smart JIT or a compiler that 
compiles the user-space application along with the vendor libraries 
might be able to do that automatically. Even then, in low level 
languages it won't optimize the data layout), but since in the current 
way we work I don't see a way we could do that, I think it's best to add 
that to the API.

The user defines which fields interest him via create_cq_ex and then get 
a compact struct that consists only of the fields he need, sorted in a 
way that avoids holes if possible (alignment wise). This API guaranteed 
to be extensible without penalty and be backward compatible. In respect 
to the discussion we had on list when the kernel part was introduced, we 
believe this API represents a valid choice to fill all the requirements.

> Jason
>

Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb
       [not found]                     ` <56D31A58.20205-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-02-29 19:17                       ` Jason Gunthorpe
       [not found]                         ` <20160229191734.GA15042-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2016-02-29 19:17 UTC (permalink / raw)
  To: Matan Barak (External)
  Cc: Yishai Hadas, Yishai Hadas, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, majd-VPRAkNaXOzVWk0Htik3J/w,
	talal-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	cl-vYTEC60ixJUAvxtiuMwx3w

On Sun, Feb 28, 2016 at 06:03:36PM +0200, Matan Barak (External) wrote:

> >The manual page and rc_pingpong do different things.
> 
> There are two options to use the API. They are both described well in the
> man page. This example uses the more trivial and easy-to-use way.

Gross.

> >This still looks like a horrible user API.
> >
> 
> Why do you think so?

By my count it scores about a 2 on the Rusty API scale, which is
pretty bad.

> We want to present a completion structure that could be extended, without
> adding the overhead of setting unnecessary fields and without risking that
> adding future attributes will make the completion be bigger than a cache
> line (and create a great penalty). This also came up in the earlier
> libibverbs API discussion.

This series trade cache line efficiency for computation efficiency,
and it isn't at all clear to me that is going to be a win. Better
include some good benchmarks.

Hint: Cacheline size is much less important if the cache lines are
resident and dirty, and the driver writes make them dirty. The driver
should be dirtying them in a way that avoids a RMW penalty.

> We could introduce a verb for every new field (poll_cq_ts), but what will a
> user do if he wants new_feature_a and new_feature_b from the same
> completion? In addition, polluting libibverbs with so many polling verbs is
> (to say the least) awkward.

As opposed to asking the user to hand craft structures and use ugly
awkward macros?

I'd say give it another think and try and make the user facing API
saner.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb
       [not found]                         ` <20160229191734.GA15042-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2016-03-01  8:52                           ` Matan Barak (External)
       [not found]                             ` <56D55851.60206-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Matan Barak (External) @ 2016-03-01  8:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, Yishai Hadas, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, majd-VPRAkNaXOzVWk0Htik3J/w,
	talal-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	cl-vYTEC60ixJUAvxtiuMwx3w

On 29/02/2016 21:17, Jason Gunthorpe wrote:
> On Sun, Feb 28, 2016 at 06:03:36PM +0200, Matan Barak (External) wrote:
>
>>> The manual page and rc_pingpong do different things.
>>
>> There are two options to use the API. They are both described well in the
>> man page. This example uses the more trivial and easy-to-use way.
>
> Gross.
>
>>> This still looks like a horrible user API.
>>>
>>
>> Why do you think so?
>
> By my count it scores about a 2 on the Rusty API scale, which is
> pretty bad.
>

This is like writing something that means nothing....
Opinions are always appreciated if they contain valid arguments, so please.

>> We want to present a completion structure that could be extended, without
>> adding the overhead of setting unnecessary fields and without risking that
>> adding future attributes will make the completion be bigger than a cache
>> line (and create a great penalty). This also came up in the earlier
>> libibverbs API discussion.
>
> This series trade cache line efficiency for computation efficiency,
> and it isn't at all clear to me that is going to be a win. Better
> include some good benchmarks.
>

WRONG. Which computational overhead are you talking about?
The user-space driver could optimize the common cases and eliminate 
*almost* all extra computational overhead (this is done in libmlx4 and 
libmlx5 user-space drivers).
Even if there was such an overhead, this is not related to the API. 
Future hardwares could do that even entirely in hardware eliminating 
this all together.

> Hint: Cacheline size is much less important if the cache lines are
> resident and dirty, and the driver writes make them dirty. The driver
> should be dirtying them in a way that avoids a RMW penalty.
>

The user-space driver writes one completion at a time, which (depending 
on the user configuration) is probably less than a cache line. Why do 
you think it's worse than how ibv_poll_cq behaves? The way the 
user-space driver optimizes/dirties the cache line is unrelated to the API.

>> We could introduce a verb for every new field (poll_cq_ts), but what will a
>> user do if he wants new_feature_a and new_feature_b from the same
>> completion? In addition, polluting libibverbs with so many polling verbs is
>> (to say the least) awkward.
>
> As opposed to asking the user to hand craft structures and use ugly
> awkward macros?
>

Function per every completion fields permutation is uglier for sure 
(there are currently 9 optional completion fields in this proposal, you 
could easily do the math to calculate how many permutations we have).

> I'd say give it another think and try and make the user facing API
> saner.
>

Lets think of the main requirement here - an extensible way of getting 
completions with only user required data. So either you give an explicit 
function for every permutation - which is pretty awful (to say the 
least) or you have a way to say "this is what I want" and "if the vendor 
reported this field, please give it to me".

Since it could be possible telling a future hardware "I'm only 
interested in these fields in a CQ", the second approach seems to be better.

> Jason
>

Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb
       [not found]                             ` <56D55851.60206-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-03-01 16:10                               ` Christoph Lameter
  2016-03-01 17:24                               ` Jason Gunthorpe
  1 sibling, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2016-03-01 16:10 UTC (permalink / raw)
  To: Matan Barak (External)
  Cc: Jason Gunthorpe, Yishai Hadas, Yishai Hadas,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, majd-VPRAkNaXOzVWk0Htik3J/w,
	talal-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w

On Tue, 1 Mar 2016, Matan Barak (External) wrote:

> Function per every completion fields permutation is uglier for sure (there are
> currently 9 optional completion fields in this proposal, you could easily do
> the math to calculate how many permutations we have).
>
> > I'd say give it another think and try and make the user facing API
> > saner.
> >
>
> Lets think of the main requirement here - an extensible way of getting
> completions with only user required data. So either you give an explicit
> function for every permutation - which is pretty awful (to say the least) or
> you have a way to say "this is what I want" and "if the vendor reported this
> field, please give it to me".
>
> Since it could be possible telling a future hardware "I'm only interested in
> these fields in a CQ", the second approach seems to be better.

I like the approach in this patch and the example shows me how to
integrate that easily into our code base. Please lets get this merged.

Being able to customize the CQ contents so that they fit into a cacheline
is pretty important. We have high throughput packet capture devices that
are affected by slight changes to the cache footprint. We have seen
significant regressions in the past because fields were added that we did
not see to be important. I like it.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb
       [not found]                             ` <56D55851.60206-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2016-03-01 16:10                               ` Christoph Lameter
@ 2016-03-01 17:24                               ` Jason Gunthorpe
       [not found]                                 ` <20160301172448.GA24031-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  1 sibling, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2016-03-01 17:24 UTC (permalink / raw)
  To: Matan Barak (External)
  Cc: Yishai Hadas, Yishai Hadas, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, majd-VPRAkNaXOzVWk0Htik3J/w,
	talal-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	cl-vYTEC60ixJUAvxtiuMwx3w

On Tue, Mar 01, 2016 at 10:52:33AM +0200, Matan Barak (External) wrote:
> >By my count it scores about a 2 on the Rusty API scale, which is
> >pretty bad.
> 
> This is like writing something that means nothing....

If you don't agree with the points on the Rusty scale as what
constitutes good API design then we really have no common ground on
this point at all.

> >This series trade cache line efficiency for computation efficiency,
> >and it isn't at all clear to me that is going to be a win. Better
> >include some good benchmarks.
> >
> 
> WRONG. Which computational overhead are you talking about?

I don't see the driver side posted, but obviously there must be
indexing maths and possibly branching, depending on the design, that
is overhead.

> The user-space driver writes one completion at a time, which (depending on
> the user configuration) is probably less than a cache line. Why do you think
> it's worse than how ibv_poll_cq behaves? The way the user-space driver
> optimizes/dirties the cache line is unrelated to the API.

I didn't say it was worse, I questioned that this should be the
overarching goal when no data to support this has been posted and it
doesn't even make inutitive sense that worrying about the size of
*dirty cache lines* is even very important, (well, within limits).

Christoph may have some data, but I do wonder if his results are
polluted by the current ibv_poll_cq driver implementations which do
trigger RMW overheads.

> Function per every completion fields permutation is uglier for sure (there
> are currently 9 optional completion fields in this proposal, you could
> easily do the math to calculate how many permutations we have).

Why would you do every permutation?

Try a non-memcpy API design. Provide an opaque 'cursor' and functions
to return certain data. This has more unconditional branches, but may
avoid branching within the driver and certainly avoid memcpy's and
pretty much all cache line dirtying. Sean had some results that
were positive on this sort of direction.

Justify this horrible API is *necessary* with something concrete, just
not random hand waving and mumbling about cache lines. That means
benchmark several options.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb
       [not found]                                 ` <20160301172448.GA24031-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2016-03-02  7:34                                   ` Matan Barak (External)
       [not found]                                     ` <56D6979F.6000400-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Matan Barak (External) @ 2016-03-02  7:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, Yishai Hadas, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, majd-VPRAkNaXOzVWk0Htik3J/w,
	talal-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	cl-vYTEC60ixJUAvxtiuMwx3w

On 01/03/2016 19:24, Jason Gunthorpe wrote:
> On Tue, Mar 01, 2016 at 10:52:33AM +0200, Matan Barak (External) wrote:
>>> By my count it scores about a 2 on the Rusty API scale, which is
>>> pretty bad.
>>
>> This is like writing something that means nothing....
>
> If you don't agree with the points on the Rusty scale as what
> constitutes good API design then we really have no common ground on
> this point at all.
>

We can discuss actual points. That's how things can really progress.

>>> This series trade cache line efficiency for computation efficiency,
>>> and it isn't at all clear to me that is going to be a win. Better
>>> include some good benchmarks.
>>>
>>
>> WRONG. Which computational overhead are you talking about?
>
> I don't see the driver side posted, but obviously there must be
> indexing maths and possibly branching, depending on the design, that
> is overhead.
>

libmlx4 patches were posted on the mailing list awhile ago [1].
One of the patches there optimized ibv_poll_cq_ex for common cases with 
almost zero overhead. Anyway, as I wrote before - this could be 
delegated to a future hardware so I'm not sure it's relevant here.

>> The user-space driver writes one completion at a time, which (depending on
>> the user configuration) is probably less than a cache line. Why do you think
>> it's worse than how ibv_poll_cq behaves? The way the user-space driver
>> optimizes/dirties the cache line is unrelated to the API.
>
> I didn't say it was worse, I questioned that this should be the
> overarching goal when no data to support this has been posted and it
> doesn't even make inutitive sense that worrying about the size of
> *dirty cache lines* is even very important, (well, within limits).
>

We ran ib_send_bw (perftest package) when developing this - one time 
using the regular ibv_poll_cq and another time with ibv_poll_cq_ex and 
got identical results (the difference was tiny and not conclusive), but 
I don't have the actual results right now. This was one of the common 
cases the user-space driver optimized. We could re-run them and post 
results if you want.

> Christoph may have some data, but I do wonder if his results are
> polluted by the current ibv_poll_cq driver implementations which do
> trigger RMW overheads.
>
>> Function per every completion fields permutation is uglier for sure (there
>> are currently 9 optional completion fields in this proposal, you could
>> easily do the math to calculate how many permutations we have).
>
> Why would you do every permutation?
>

Because user-space applications might want to get completion 
time-stamping and size, but not the rest of the fields. They could 
pretty much decide which data matters to them and get only that.

> Try a non-memcpy API design. Provide an opaque 'cursor' and functions
> to return certain data. This has more unconditional branches, but may
> avoid branching within the driver and certainly avoid memcpy's and
> pretty much all cache line dirtying. Sean had some results that
> were positive on this sort of direction.
>

Does the opaque pointer guarantees an aligned access? Who allocates the 
space for the vendor's CQE? Any concrete example?
One of the problems here are that CQEs could be formatted as -
"if QP type is y, then copy the field z from o". Doing that this way may 
result doing the same "if" multiple times. The current approach could 
still avoid memcpy and write straight to the user's buffer.

> Justify this horrible API is *necessary* with something concrete, just
> not random hand waving and mumbling about cache lines. That means
> benchmark several options.
>
> Jason
>

Matan
[1] https://www.spinics.net/lists/linux-rdma/msg29837.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb
       [not found]                                     ` <56D6979F.6000400-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-03-02 18:28                                       ` Jason Gunthorpe
       [not found]                                         ` <20160302182836.GA7084-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2016-03-02 18:28 UTC (permalink / raw)
  To: Matan Barak (External)
  Cc: Yishai Hadas, Yishai Hadas, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, majd-VPRAkNaXOzVWk0Htik3J/w,
	talal-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	cl-vYTEC60ixJUAvxtiuMwx3w

On Wed, Mar 02, 2016 at 09:34:55AM +0200, Matan Barak (External) wrote:
> >I didn't say it was worse, I questioned that this should be the
> >overarching goal when no data to support this has been posted and it
> >doesn't even make inutitive sense that worrying about the size of
> >*dirty cache lines* is even very important, (well, within limits).
> >
> 
> We ran ib_send_bw (perftest package) when developing this - one time using
> the regular ibv_poll_cq and another time with ibv_poll_cq_ex and got
> identical results (the difference was tiny and not conclusive), but I don't
> have the actual results right now. This was one of the common cases the
> user-space driver optimized. We could re-run them and post results if you
> want.

So all this ugly API to minimize cache line usage has no measured
performance gain?

You see why I'm concerned ..

> >Try a non-memcpy API design. Provide an opaque 'cursor' and functions
> >to return certain data. This has more unconditional branches, but may
> >avoid branching within the driver and certainly avoid memcpy's and
> >pretty much all cache line dirtying. Sean had some results that
> >were positive on this sort of direction.
> >
> 
> Does the opaque pointer guarantees an aligned access? Who allocates the
> space for the vendor's CQE? Any concrete example?
> One of the problems here are that CQEs could be formatted as -
> "if QP type is y, then copy the field z from o". Doing that this way may
> result doing the same "if" multiple times. The current approach could still
> avoid memcpy and write straight to the user's buffer.

No, none of that...

struct ibv_cq
{
    int (*read_next_cq)(ibv_cq *cq,struct common_wr *res);
    int (*read_address)(ibv_cq *cq,struct wr_address *res);
    uint64_t (*read_timestamp(ibv_cq *cq);
    uint32_t (*read_immediate_data(ibv_cq *cq);
    void (*read_something_else)(ibv_cq *cq,struct something_else *res);
};

while (cq->read_next_cq(cq,&wc)) // Advance the cursor
{
    // Read the cursor
    uint64_t ts = cq->read_timestamp(cq);
    if (wc->opcode ==  ..)
        uint32_t imm = cq->read_immediate_data(cq);
}

Lots of variety for syntax, I've choosen the above for clarity.

1) The function pointers provided by the driver read directly from the
   CQ memory written by the adaptor. No memcpy. The pointers vary
   on a CQ by CQ basis so they can be optimal.
   Alignment is a driver/hardware problem and is exactly the same as
   any other scheme
2) The driver fills in the pointers in a way that is optimal for the
   specific CQ - if there are multiple formats then the driver
   provides multiple function implementations and chooses the
   combiantion that matches the CQ when it is created.
   The intent is to avoid all conditional branching in the driver on
   these callbacks. The functions should be small enough to avoid
   a frame setup (eg build them with -fomit-frame-pointer and other
   function call related optimizations to achieve this)
2a) If the CQ does not support an extension like read_something_else
    then the driver fills the pointer with something that calls
    abort(), no conditional branching.
    The application is responsible for only calling allowed accessors
    based on how the CQ is created. No branching is needed.
3) read_next_cq goes entry by entry. If signaling the hardware
   that a CQe is now available is expensive then the driver will need
   some kind of batching scheme for that. Ie signal on empty, signal
   ever N, etc. An additional entry point may be required to manage
   this.
3a) This could also follow the new kAPI pattern and use a callback
    scheme which makes the batching and budgeting alot simpler for
    the application.
4) A basic analysis says this trades cache line dirtying of the wc
   array for unconditional branches.
   It  eliminates at least 1 conditional branch per CQE iteration by
   using only 1 loop.
   For some things it may entirely eliminate memory traffic in favor of
   register returns (eg my read_immediate_data example)
   For some things this may entirely eliminate work, eg the
   address and immediate data rarely need to be parsed for some apps

   Compared to the poll_ex proposal this eliminates a huge
   number of conditional branches as 'wc_flags' and related no longer
   exist at all.

   Only careful benchmarking will tell if the unconditional branch
   overhead is greater than the savings from dropping conditional
   branches and cache line dirtying.

5) If the vendors could get together, it might be possible to replace
   some function pointers with inlined pointer math eg:

   struct ibv_cq
   {
        const void *cur_cqe;
	size_t immdata_offset;
   }

   inline uint32_t ibv_cqe_read_immedate_data(ibv_cq *cq)
   {
      return *(uint32_t *)(cq->cur_cqe + cq->immdata_offset);
   }

   Obvious this requires most vendors to use a HW CQE format that
   can support the above expression.
   
   Further, touching on an idea Cristoph brought up a long time
   ago - it would be easy to use this API to build a Vendor Optimized
   libibverbs this would support a single driver's card and would
   inline the CQE parse using something like the above, tuned for the
   single driver. It would be interesting to see if that is a win or
   not.

I hope you see how this sort of API is alot saner than what is
proposed, it is type safe, misuses not caught by the compiler blow up
100% of the time at runtime, the usage is intuitive and matches
typical C conventions.

The question that must be answered is the peformance for the various
options, and that is what I mean by benchmarking. I would probably
approach this with the perf tool and look at the various CPU counters
for each option and then get someone like Christoph to chime in with
his specalized benchmark environment when I think the best option has
been found and has already been carefully micro optimized.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb
       [not found]                                         ` <20160302182836.GA7084-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2016-03-02 19:08                                           ` Christoph Lameter
       [not found]                                             ` <alpine.DEB.2.20.1603021300491.15609-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2016-03-02 19:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matan Barak (External),
	Yishai Hadas, Yishai Hadas, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, majd-VPRAkNaXOzVWk0Htik3J/w,
	talal-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w

On Wed, 2 Mar 2016, Jason Gunthorpe wrote:

> So all this ugly API to minimize cache line usage has no measured
> performance gain?

We have seen an increased cacheline footprint adding ~100-200ns to receive
latencies during loops. This does not show up in synthetic loads that do
not do much processing since their cache footprint is minimal.

> > Does the opaque pointer guarantees an aligned access? Who allocates the
> > space for the vendor's CQE? Any concrete example?
> > One of the problems here are that CQEs could be formatted as -
> > "if QP type is y, then copy the field z from o". Doing that this way may
> > result doing the same "if" multiple times. The current approach could still
> > avoid memcpy and write straight to the user's buffer.
>
> No, none of that...
>
> struct ibv_cq
> {
>     int (*read_next_cq)(ibv_cq *cq,struct common_wr *res);
>     int (*read_address)(ibv_cq *cq,struct wr_address *res);
>     uint64_t (*read_timestamp(ibv_cq *cq);
>     uint32_t (*read_immediate_data(ibv_cq *cq);
>     void (*read_something_else)(ibv_cq *cq,struct something_else *res);
> };

Argh. You are requiring multiple indirect function calls
top retrieve the same imformation and therefore significantly increase
latency. This is going to cause lots of problems for procesing at high
speed where we have to use the processor caches as carefully as possible
to squeeze out all we can get.

> 4) A basic analysis says this trades cache line dirtying of the wc
>    array for unconditional branches.
>    It  eliminates at least 1 conditional branch per CQE iteration by
>    using only 1 loop.

This done none of that stuff at all if the device directly follows the
programmed format. There will be no need to do any driver formatting at
all.

>    Compared to the poll_ex proposal this eliminates a huge
>    number of conditional branches as 'wc_flags' and related no longer
>    exist at all.

wc_flags may be something bothersome. You do not want to check inside the
loop. All cqe's should come with the fields requested and the
layout of the data must be fixed when in the receive loop. No additional
branches in the loop.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb
       [not found]                                             ` <alpine.DEB.2.20.1603021300491.15609-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
@ 2016-03-02 19:51                                               ` Jason Gunthorpe
       [not found]                                                 ` <20160302195138.GA8427-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2016-03-02 19:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matan Barak (External),
	Yishai Hadas, Yishai Hadas, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, majd-VPRAkNaXOzVWk0Htik3J/w,
	talal-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w

On Wed, Mar 02, 2016 at 01:08:30PM -0600, Christoph Lameter wrote:
> On Wed, 2 Mar 2016, Jason Gunthorpe wrote:
> 
> > So all this ugly API to minimize cache line usage has no measured
> > performance gain?
> 
> We have seen an increased cacheline footprint adding ~100-200ns to receive
> latencies during loops. This does not show up in synthetic loads that do
> not do much processing since their cache footprint is minimal.

I know you've seen effects..

But apparently the patch authors haven't.

> > > Does the opaque pointer guarantees an aligned access? Who allocates the
> > > space for the vendor's CQE? Any concrete example?
> > > One of the problems here are that CQEs could be formatted as -
> > > "if QP type is y, then copy the field z from o". Doing that this way may
> > > result doing the same "if" multiple times. The current approach could still
> > > avoid memcpy and write straight to the user's buffer.
> >
> > No, none of that...
> >
> > struct ibv_cq
> > {
> >     int (*read_next_cq)(ibv_cq *cq,struct common_wr *res);
> >     int (*read_address)(ibv_cq *cq,struct wr_address *res);
> >     uint64_t (*read_timestamp(ibv_cq *cq);
> >     uint32_t (*read_immediate_data(ibv_cq *cq);
> >     void (*read_something_else)(ibv_cq *cq,struct something_else *res);
> > };
> 
> Argh. You are requiring multiple indirect function calls
> top retrieve the same imformation and therefore significantly increase
> latency.

It depends. These are unconditional branches and they elide alot of
other code and conditional branches by their design.

The mlx4 driver does something like this on every CQE to parse the
immediate data:

+	if (is_send) {
+		switch (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) {
+		case MLX4_RECV_OPCODE_RDMA_WRITE_IMM:
+			if (wc_flags & IBV_WC_EX_WITH_IMM) {
+				*wc_buffer.b32++ = cqe->immed_rss_invalid;

With this additional overhead for all the parse paths that have no
immediate:

+			if (wc_flags & IBV_WC_EX_WITH_IMM)
+				wc_buffer.b32++; // No imm to set

Whereas my suggestion would looke more like:

mlx4_read_immediate_data(cq) {return cq->cur_cqe->immed_rss_invalid;};

[and even that can be micro-optimized further]

And the setting of WITH_IMM during the common parse is much less branchy:

opcode = cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK;
wc->flags |= WITH_IMM_BIT << ((unsigned int)(opcode == MLX4_RECV_OPCODE_RDMA_WRITE_IMM ||
                                            opcode == MLX4_RECV_OPCODE_SEND_IMM ...))

I bet that is a lot less work going on, even including the indirect
jump overhead.

> This is going to cause lots of problems for procesing at high
> speed where we have to use the processor caches as carefully as possible
> to squeeze out all we can get.

The driver should be built so the branch targets are all in close
cache lines, with function prologues deliberately elided so the branch
targets are small. There may even be further techniques to micro
optimize the jump, if one wanted to spend the effort.

Further, since we are only running the code the caller actually needs
we are not wasting icache lines trundling through the driver CQE parse
that has to handle all cases, even ones the caller doesn't care about.

It is possible this uses fewer icache lines than what we have today,
it depends how big the calls end up..

> > 4) A basic analysis says this trades cache line dirtying of the wc
> >    array for unconditional branches.
> >    It  eliminates at least 1 conditional branch per CQE iteration by
> >    using only 1 loop.
> 
> This done none of that stuff at all if the device directly follows the
> programmed format. There will be no need to do any driver formatting at
> all.

Maybe, but no such device exists? I'm not sure I want to see a
miserable API on the faint hope of hardware support..

Even if hardware appears, this basic API pattern can still support
that optimally by using the #5 technique - and still avoids the memcpy
from the DMA buffer to the user memory.

> >    Compared to the poll_ex proposal this eliminates a huge
> >    number of conditional branches as 'wc_flags' and related no longer
> >    exist at all.
> 
> wc_flags may be something bothersome. You do not want to check inside the
> loop.

Did you look at the mlx4 driver patch? It checks wc_flags constantly
when memcpying the HW CQE to the user memory - in the per CQE loop:

+			if (wc_flags & IBV_WC_EX_WITH_IMM) {
+				*wc_buffer.b32++ = cqe->immed_rss_invalid;

Every single user CQE write (or even not write) is guarded by a
conditional branch on the flags, and use of post-increment like that
creates critical path data dependencies and kills ILP. All this extra
code eats icache lines.

Sure, the scheme saves dritying cache lines, but at the cost of a huge
number of these conditional branches and more code. I'm deeply
skeptical that is an overall win compared to dirtying more cache
lines - mispredicted branches are expensive.

What I've suggested avoids all the cache line dirtying and avoids all
the above conditional branching and data dependencies at the cost of
a few indirect unconditional jumps. (and if #5 is possible they aren't
even jumps)

I say there is no way to tell which scheme performs better without
careful benchmarking - it is not obvious any of these trade offs are
winners.

> All cqe's should come with the fields requested and the
> layout of the data must be fixed when in the receive loop. No additional
> branches in the loop.

Sure, for the caller, I'm looking at this from the whole system
perspective. The driver is doing a wack more crap to produce this
formatting and it certainly isn't free.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb
       [not found]                                                 ` <20160302195138.GA8427-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2016-03-21 15:24                                                   ` Matan Barak
       [not found]                                                     ` <56F01227.9050900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Matan Barak @ 2016-03-21 15:24 UTC (permalink / raw)
  To: Jason Gunthorpe, Christoph Lameter
  Cc: Yishai Hadas, Yishai Hadas, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, majd-VPRAkNaXOzVWk0Htik3J/w,
	talal-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w

On 02/03/2016 21:51, Jason Gunthorpe wrote:
> On Wed, Mar 02, 2016 at 01:08:30PM -0600, Christoph Lameter wrote:
>> On Wed, 2 Mar 2016, Jason Gunthorpe wrote:
>>
>>> So all this ugly API to minimize cache line usage has no measured
>>> performance gain?
>>
>> We have seen an increased cacheline footprint adding ~100-200ns to receive
>> latencies during loops. This does not show up in synthetic loads that do
>> not do much processing since their cache footprint is minimal.
>
> I know you've seen effects..
>
> But apparently the patch authors haven't.
>
>>>> Does the opaque pointer guarantees an aligned access? Who allocates the
>>>> space for the vendor's CQE? Any concrete example?
>>>> One of the problems here are that CQEs could be formatted as -
>>>> "if QP type is y, then copy the field z from o". Doing that this way may
>>>> result doing the same "if" multiple times. The current approach could still
>>>> avoid memcpy and write straight to the user's buffer.
>>>
>>> No, none of that...
>>>
>>> struct ibv_cq
>>> {
>>>      int (*read_next_cq)(ibv_cq *cq,struct common_wr *res);
>>>      int (*read_address)(ibv_cq *cq,struct wr_address *res);
>>>      uint64_t (*read_timestamp(ibv_cq *cq);
>>>      uint32_t (*read_immediate_data(ibv_cq *cq);
>>>      void (*read_something_else)(ibv_cq *cq,struct something_else *res);
>>> };
>>
>> Argh. You are requiring multiple indirect function calls
>> top retrieve the same imformation and therefore significantly increase
>> latency.
>
> It depends. These are unconditional branches and they elide alot of
> other code and conditional branches by their design.
>

Our performance team and I measured the new ibv_poll_cq_ex vs the old 
poll_cq implementation. We measured the number of cycles around the 
poll_cq call. We saw ~15 cycles increase for every CQE, which is about 
2% over the regular poll_cq (which is not extensible). We've used 
ib_write_lat in order to do these measurements.

Pinpointing the source of this increase, we found that the function 
pointer of our internal poll_one routine is the source of this. Our 
poll_cq_ex implementation uses a per-CQ poll_one_ex function, which is 
*almost* tailored for the user required fields. Using a static poll_one 
will cause a lot of conditional branches and will decrease performance.
Meaning, using a function pointer vs using a static poll_one function 
(that the compiler is free to inline) causes this effect, but using a 
monolithic poll_one function will incur substantial computation overhead.

Using a per field getter introduces such a call (unconditional jump) for 
every field. If we were using static linking (+ whole program linkage), 
I agree this route is better. However, based on the results we saw 
earlier, we are worried this might incur tremendous overhead.
In addition, the user has to call read_next_cq for every completion 
entry (which is another unconditional jump).

> The mlx4 driver does something like this on every CQE to parse the
> immediate data:
>
> +	if (is_send) {
> +		switch (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) {
> +		case MLX4_RECV_OPCODE_RDMA_WRITE_IMM:
> +			if (wc_flags & IBV_WC_EX_WITH_IMM) {
> +				*wc_buffer.b32++ = cqe->immed_rss_invalid;
>
> With this additional overhead for all the parse paths that have no
> immediate:
>
> +			if (wc_flags & IBV_WC_EX_WITH_IMM)
> +				wc_buffer.b32++; // No imm to set
>
> Whereas my suggestion would looke more like:
>
> mlx4_read_immediate_data(cq) {return cq->cur_cqe->immed_rss_invalid;};
>
> [and even that can be micro-optimized further]
>
> And the setting of WITH_IMM during the common parse is much less branchy:
>
> opcode = cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK;
> wc->flags |= WITH_IMM_BIT << ((unsigned int)(opcode == MLX4_RECV_OPCODE_RDMA_WRITE_IMM ||
>                                              opcode == MLX4_RECV_OPCODE_SEND_IMM ...))
>
> I bet that is a lot less work going on, even including the indirect
> jump overhead.
>

So we would have read the owner_sr_opcode several times (for almost 
every related wc field) and parse it accordingly or we'll have to store 
the pre-processed owner_sr_opcode in the CQ itself and incur the 
overhead of "copying" it. In addition, we'll need another API (end_cq) 
in order to indicate return-cq-element-to-hardware-ownership (and 
release vendor's per poll_cq locks if exist). So, as far as we 
understand, you propose something like:

struct ibv_cq
{
     /* more argument is new */
     int (*read_next_cq)(ibv_cq *cq,struct common_wr *res, bool more);
     /* function is new */
     int (*end_cq)(ibv_cq *cq);
     int (*read_address)(ibv_cq *cq,struct wr_address *res);
     uint64_t (*read_timestamp(ibv_cq *cq);
     uint32_t (*read_immediate_data(ibv_cq *cq);
     void (*read_something_else)(ibv_cq *cq,struct something_else *res);
};

Again, every call here introduces an unconditional jump and makes the CQ 
structure larger. So, the CQ struct itself might not fit in the cache line.

In addition, ibv_cq isn't extensible as it's today. Addign a new 
ibv_cq_ex will change all APIs that use it.

>> This is going to cause lots of problems for procesing at high
>> speed where we have to use the processor caches as carefully as possible
>> to squeeze out all we can get.
>
> The driver should be built so the branch targets are all in close
> cache lines, with function prologues deliberately elided so the branch
> targets are small. There may even be further techniques to micro
> optimize the jump, if one wanted to spend the effort.
>
> Further, since we are only running the code the caller actually needs
> we are not wasting icache lines trundling through the driver CQE parse
> that has to handle all cases, even ones the caller doesn't care about.
>

The current optimization we implement in our user-space drivers is to 
introduce multiple poll_one_ex functions, where every function assigns 
only some of the WC fields. Each function is tailored for a different 
use case and common scenario. By doing this, we don't assign fields that 
the user doesn't care about and we can avoid all conditional branches.

> It is possible this uses fewer icache lines than what we have today,
> it depends how big the calls end up..
>

The current proposal offers several poll_one_ex function - a function 
per common case. That's true we have more functions, but the function we 
actually used is *almost* tailored to the user's requirements.
All checks and conditions are done once. We trade-off the function 
pointer + re-checking/re-formatting some fields over and over again by 
copying the fields to ibv_wc_ex.

>>> 4) A basic analysis says this trades cache line dirtying of the wc
>>>     array for unconditional branches.
>>>     It  eliminates at least 1 conditional branch per CQE iteration by
>>>     using only 1 loop.
>>
>> This done none of that stuff at all if the device directly follows the
>> programmed format. There will be no need to do any driver formatting at
>> all.
>
> Maybe, but no such device exists? I'm not sure I want to see a
> miserable API on the faint hope of hardware support..
>
> Even if hardware appears, this basic API pattern can still support
> that optimally by using the #5 technique - and still avoids the memcpy
> from the DMA buffer to the user memory.
>
>>>     Compared to the poll_ex proposal this eliminates a huge
>>>     number of conditional branches as 'wc_flags' and related no longer
>>>     exist at all.
>>
>> wc_flags may be something bothersome. You do not want to check inside the
>> loop.
>
> Did you look at the mlx4 driver patch? It checks wc_flags constantly
> when memcpying the HW CQE to the user memory - in the per CQE loop:
>
> +			if (wc_flags & IBV_WC_EX_WITH_IMM) {
> +				*wc_buffer.b32++ = cqe->immed_rss_invalid;
>
> Every single user CQE write (or even not write) is guarded by a
> conditional branch on the flags, and use of post-increment like that
> creates critical path data dependencies and kills ILP. All this extra
> code eats icache lines.
>
 >

Look at the new code for libmlx5. This code is optimized *at compile 
time*. We create a poll_one_ex function for a common case. This has zero 
overhead.

Regarding user-space check for wc_flags, I agree on eliminating this 
check. If you created a CQ with a set of attributes, they are guaranteed 
to exist on known values for the opcode.

> Sure, the scheme saves dritying cache lines, but at the cost of a huge
> number of these conditional branches and more code. I'm deeply
> skeptical that is an overall win compared to dirtying more cache
> lines - mispredicted branches are expensive.
>
> What I've suggested avoids all the cache line dirtying and avoids all
> the above conditional branching and data dependencies at the cost of
> a few indirect unconditional jumps. (and if #5 is possible they aren't
> even jumps)
>

When eliminating "if (wc_flags & IBV_WC_EX_WITH_IMM)" in the user's 
space application, conditional branching is eliminated in this proposal 
as well.

The question becomes - what costs more - copying the data or using an 
unconditional branch.

> I say there is no way to tell which scheme performs better without
> careful benchmarking - it is not obvious any of these trade offs are
> winners.
>

Agree. It may depend on the user application as well.

>> All cqe's should come with the fields requested and the
>> layout of the data must be fixed when in the receive loop. No additional
>> branches in the loop.
>
> Sure, for the caller, I'm looking at this from the whole system
> perspective. The driver is doing a wack more crap to produce this
> formatting and it certainly isn't free.
>

Dynamic WC format (or getter functions) isn't going to be free, at least 
not without linking the vendor's user-space driver to the application 
itself or making all vendors behave the same.

> Jason
>

Yishai and Matan.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb
       [not found]                                                     ` <56F01227.9050900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-03-21 17:09                                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2016-03-21 17:09 UTC (permalink / raw)
  To: Matan Barak
  Cc: Christoph Lameter, Yishai Hadas, Yishai Hadas,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, majd-VPRAkNaXOzVWk0Htik3J/w,
	talal-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w

On Mon, Mar 21, 2016 at 05:24:23PM +0200, Matan Barak wrote:

> Our performance team and I measured the new ibv_poll_cq_ex vs the old
> poll_cq implementation. We measured the number of cycles around the poll_cq
> call. We saw ~15 cycles increase for every CQE, which is about 2% over the
> regular poll_cq (which is not extensible). We've used ib_write_lat in order
> to do these measurements.

So you understand my concern, this really ugly interface is
obstensibly designed this way for performance and yet it is 2% slower
than what we have today. :(

> Using a per field getter introduces such a call (unconditional jump) for
> every field.

As I keep saying, there is an overall trade off that could reduce the
number of jumps and code being run.

> In addition, the user has to call read_next_cq for every completion entry
> (which is another unconditional jump).

This is just replacing the loop inside the existing ibv_poll_cq_ex,
the number of overall jumps doesn't change.

> >opcode = cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK;
> >wc->flags |= WITH_IMM_BIT << ((unsigned int)(opcode == MLX4_RECV_OPCODE_RDMA_WRITE_IMM ||
> >                                             opcode == MLX4_RECV_OPCODE_SEND_IMM ...))
> >
> >I bet that is a lot less work going on, even including the indirect
> >jump overhead.
> >
> 
> So we would have read the owner_sr_opcode several times (for almost every
> related wc field)

No, I think you've missed the point, the accessors pretty much just
return fixed offsets in the HW WC. From what I've seen the HW WC looks
well structured, so, eg, the offset of the timestamp field is fixed.
No need to check owner_sr_opcode again.

> "copying" it. In addition, we'll need another API (end_cq) in order to
> indicate return-cq-element-to-hardware-ownership (and release vendor's per
> poll_cq locks if exist). So, as far as we understand, you propose
> something

Yes, I mentioned this, some kind of completion batching scheme would
be needed.

This could actually prove to be a net win, depending on how carefully
it is designed and typical application needs. The current batch every
poll_cq loop is not necessarily ideal.

> Again, every call here introduces an unconditional jump and makes the CQ
> structure larger. So, the CQ struct itself might not fit in the cache line.

Don't obsess so much about 1 or two cache lines in this sort of
context. Burning a large number of stack lines for the wc array is
bad of course, but an extra line isn't so bad, especially if it
is offset by an icache line.

> In addition, ibv_cq isn't extensible as it's today. Addign a new ibv_cq_ex
> will change all APIs that use it.

Not really, it just makes the structure longer, this is trivially
done.

> The current optimization we implement in our user-space drivers is to
> introduce multiple poll_one_ex functions, where every function assigns only
> some of the WC fields. Each function is tailored for a different use case
> and common scenario. By doing this, we don't assign fields that the user
> doesn't care about and we can avoid all conditional branches.

But the provider can't cover all cases - so uncommon cases will run
slow? Do you understand why I keep calling this a horrible API?

> Look at the new code for libmlx5. This code is optimized *at compile time*.
> We create a poll_one_ex function for a common case. This has zero overhead.

I didn't see this on patchworks. Do you have a link?

Even if the wc_flags is constant and thus optimized there is still
alot of opcode related branching and ++'s running around that
code. That is 'free' in the sense is similar to what is in the
existing implementation, but I'm suggesting a direction that largely
merges those branches with the mandatory user branches as well.

> >What I've suggested avoids all the cache line dirtying and avoids all
> >the above conditional branching and data dependencies at the cost of
> >a few indirect unconditional jumps. (and if #5 is possible they aren't
> >even jumps)
> >
> 
> When eliminating "if (wc_flags & IBV_WC_EX_WITH_IMM)" in the user's space
> application, conditional branching is eliminated in this proposal as well.

Just the wc_flags branching is gone, there is still excess branching
around the opcode that is eliminated.

> The question becomes - what costs more - copying the data or using an
> unconditional branch.

This will depend very much on how many completions are being
processed in one go. Copying is almost certainly better for a small
number of completions - but if there is a small number then why care
so much about dcache lines?

> Dynamic WC format (or getter functions) isn't going to be free, at
> least not without linking the vendor's user-space driver to the
> application itself or making all vendors behave the same.

I don't think this should be so quickly discounted - clearly this API
allows for the accessors to be fully inlined if the user desires, and
you can't beat that for performance.

At the very least it, or some other better API option, deserves study
and benchmarking before comitting to the ugly API.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2016-03-21 17:09 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-24  9:41 [PATCH V1 libibverbs 0/8] Completion timestamping Yishai Hadas
     [not found] ` <1456306924-31298-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-02-24  9:41   ` [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb Yishai Hadas
     [not found]     ` <1456306924-31298-2-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-02-24 19:02       ` Jason Gunthorpe
     [not found]         ` <20160224190230.GA10588-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-02-25  8:01           ` Yishai Hadas
     [not found]             ` <56CEB4C7.60607-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2016-02-25 17:05               ` Jason Gunthorpe
     [not found]                 ` <20160225170541.GA22513-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-02-28 16:03                   ` Matan Barak (External)
     [not found]                     ` <56D31A58.20205-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-02-29 19:17                       ` Jason Gunthorpe
     [not found]                         ` <20160229191734.GA15042-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-03-01  8:52                           ` Matan Barak (External)
     [not found]                             ` <56D55851.60206-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-03-01 16:10                               ` Christoph Lameter
2016-03-01 17:24                               ` Jason Gunthorpe
     [not found]                                 ` <20160301172448.GA24031-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-03-02  7:34                                   ` Matan Barak (External)
     [not found]                                     ` <56D6979F.6000400-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-03-02 18:28                                       ` Jason Gunthorpe
     [not found]                                         ` <20160302182836.GA7084-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-03-02 19:08                                           ` Christoph Lameter
     [not found]                                             ` <alpine.DEB.2.20.1603021300491.15609-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
2016-03-02 19:51                                               ` Jason Gunthorpe
     [not found]                                                 ` <20160302195138.GA8427-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-03-21 15:24                                                   ` Matan Barak
     [not found]                                                     ` <56F01227.9050900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-03-21 17:09                                                       ` Jason Gunthorpe
2016-02-24  9:41   ` [PATCH V1 libibverbs 2/8] Add timestamp_mask and hca_core_clock to ibv_query_device_ex Yishai Hadas
2016-02-24  9:41   ` [PATCH V1 libibverbs 3/8] Add support for extended ibv_create_cq Yishai Hadas
2016-02-24  9:42   ` [PATCH V1 libibverbs 4/8] Add completion timestamp support for ibv_poll_cq_ex Yishai Hadas
2016-02-24  9:42   ` [PATCH V1 libibverbs 5/8] Add helper functions to work with the extended WC Yishai Hadas
2016-02-24  9:42   ` [PATCH V1 libibverbs 6/8] Add ibv_query_rt_values_ex Yishai Hadas
2016-02-24  9:42   ` [PATCH V1 libibverbs 7/8] Man pages for time stamping support Yishai Hadas
2016-02-24  9:42   ` [PATCH V1 libibverbs 8/8] Add timestamp support in rc_pingpong Yishai Hadas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.