All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V4 libibverbs 0/7] Completion timestamping
@ 2016-05-29 14:51 Yishai Hadas
       [not found] ` <1464533475-18949-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Yishai Hadas @ 2016-05-29 14:51 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	majd-VPRAkNaXOzVWk0Htik3J/w, talal-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/

Hi Doug,

This V4 series addressed few notes that we got for V3,
details below.

The kernel part was already accepted.

The series was retested successfully with mlx5 driver (lib, kernel)
and can be accessed also from my openfabrics GIT at:
git://openfabrics.org/~yishaih/libibverbs.git
branch: ts_v4.

Thanks,
Yishai

In order to do so, we add an extensible poll cq mechanism.
Former attempts of extending poll CQ were made. An attempt to solve this
problem tried to split the WC into mandatory and optional fields.
The user declared which optional fields each CQ should report and the
WC was constructed in a dynamic way representing all requested fields.
We got some comments regarding this complex approach and API. Furthermore,
it resulted in degraded performance in some flows.

The current approach is based on Jason's proposal. Instead of using a WC
struct, we report completion fields by request. A new ibv_cq_ex is added.
This new extended CQ contains accessor functions to the completion fields.
Each vendor assigns these function pointers in order to provide the
completion data efficiently. In order to create a suitable CQ and maintain
backward and forward compatibility, the user declares which completion
attributes he needs while creating the CQ. A successful creation of the CQ
guarantees that all requested attributes can be queried using the accessor
function pointers.

This approach prevents copying the WC fields in cost of indirect function calls.
However, as most applications don't use most completion fields anyway, the new
approach fully makes sense.

Benchmarks we ran in our test lab found that this new approach generally
equals to current API but *not* worse than. As the new API enables
extending the polled fields we can overall say that it's a better API than
the legacy one.

The user creates a CQ using ibv_create_cq_ex, stating which completion
attributes could be queried later on from this CQ.
In order to decrease per-completion polling overhead, as of updating indices in
the hardware, we split the polling into batches.

A batch is started when calling ibv_start_poll_ex. If a completion is
successfully fetched, the user could query its attributes using accessor
functions ibv_wc_read_xxx.  In order to fetch the next completion in the batch,
the user uses ibv_next_poll_ex.  The same ibv_wc_read_xxx functions are used in
order to query these completions as well. In order to end a batch, the user
uses ibv_end_poll_ex.  Of course, starting a new batch incurs some overhead.

Each batch could poll zero or more completions.
Each completion polling starts with either ibv_start_poll_ex/ibv_next_poll_ex
and ends with ibv_next_poll_ex/ibv_end_poll_ex.
Completion attributes could only be queried between these calls.
These attributes represents the values of the completion already fetched by
the last ibv_start_poll_ex/ibv_next_poll_ex.

The batching API is thread-safe (assuming the CQ wasn't created with
SINGLE_THREADED attribute) and represents a series of completions the user
would like to poll one after another.  The vendor user space driver should
guarantee this.

Completion timestamp is added on top of these extended ibv_create_cq_ex verb by
using wc_flags field of init_cq_attr. The user could query the CQ's completion
timestamp using ibv_wc_read_completion_ts. The timestamp mask (number of
supported bits) and the HCA's frequency are given in ibv_query_device_ex verb.

We also give the user an ability to read the HCA's current clock.
This is done via ibv_query_rt_values_ex. This verb could be extended
in the future for other interesting information.

Changes from V3:
Addressed Jason's notes as of below:
- Reorder ibv_cq_ex fields to be memory aligned.
- Add a check as part of __lib_ibv_create_cq_ex that libibverbs supports all in coming WC flags,
  this should prevent future compatibility issues.
- Change ibv_wc_read_slid to return uint32_t as of other vendor request.
- Change ibv_cq_init_attr_ex to use uint32 instead of int for some fields.

Changes from V2:
Addressed Jason's notes as of below:
- Remove the '_ex' notation where was no legacy one.
- Use 'wr_id' and 'status' fields directly on ibv_cq_ex to improve
  performance. We ran some benchmarking and verified that this change is
  really useful.

Changes from V1:
- Moved to indirect function calls in order to poll a CQ.

Changes from V0:
- Split the series to small logical patches.
- Align naming in some places to match other verbs.
- Fix and improve the man pages.
- Add an example code as part of rc_pingpong.

Matan Barak (6):
  Add support for extended creating CQ verb
  Add member functions to poll an extended CQ
  Add timestamp_mask and hca_core_clock to ibv_query_device_ex
  Add completion timestamp to poll_cq
  Create a single threaded CQ
  Add a verb that queries real time values from the HCA

Yishai Hadas (1):
  Add timestamp support in rc_pingpong

 Makefile.am                   |   3 +-
 examples/devinfo.c            |  10 ++
 examples/rc_pingpong.c        | 278 ++++++++++++++++++++++++++++++++----------
 include/infiniband/driver.h   |   9 ++
 include/infiniband/kern-abi.h |  26 ++++
 include/infiniband/verbs.h    | 243 ++++++++++++++++++++++++++++++++++++
 man/ibv_create_cq_ex.3        | 150 +++++++++++++++++++++++
 man/ibv_query_device_ex.3     |   6 +-
 man/ibv_query_rt_values_ex.3  |  50 ++++++++
 src/cmd.c                     |  69 +++++++++++
 src/device.c                  |  49 ++++++++
 src/ibverbs.h                 |   5 +
 src/libibverbs.map            |   1 +
 13 files changed, 833 insertions(+), 66 deletions(-)
 create mode 100644 man/ibv_create_cq_ex.3
 create mode 100644 man/ibv_query_rt_values_ex.3

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH V4 libibverbs 1/7] Add support for extended creating CQ verb
       [not found] ` <1464533475-18949-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-05-29 14:51   ` Yishai Hadas
  2016-05-29 14:51   ` [PATCH V4 libibverbs 2/7] Add member functions to poll an extended CQ Yishai Hadas
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Yishai Hadas @ 2016-05-29 14:51 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	majd-VPRAkNaXOzVWk0Htik3J/w, talal-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Add ibv_create_cq_ex. This extended verb follows the extension verbs
scheme and hence could be extendible in the future for more features.
The new command requires the user to declare which fields are going
to be polled. This is mandatory in order to maintain compatibility
between new applications and old libraries.
The user shall only read fields from the completion which he requested
upon creating the CQ.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 Makefile.am                   |  2 +-
 include/infiniband/driver.h   |  9 ++++
 include/infiniband/kern-abi.h | 20 +++++++++
 include/infiniband/verbs.h    | 98 +++++++++++++++++++++++++++++++++++++++++++
 man/ibv_create_cq_ex.3        | 63 ++++++++++++++++++++++++++++
 src/cmd.c                     | 38 +++++++++++++++++
 src/device.c                  | 49 ++++++++++++++++++++++
 src/ibverbs.h                 |  5 +++
 src/libibverbs.map            |  1 +
 9 files changed, 284 insertions(+), 1 deletion(-)
 create mode 100644 man/ibv_create_cq_ex.3

diff --git a/Makefile.am b/Makefile.am
index cae80bb..a1c2122 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -65,7 +65,7 @@ man_MANS = man/ibv_asyncwatch.1 man/ibv_devices.1 man/ibv_devinfo.1	\
     man/ibv_create_qp_ex.3 man/ibv_create_srq_ex.3 man/ibv_open_xrcd.3  \
     man/ibv_get_srq_num.3 man/ibv_open_qp.3 man/ibv_query_device_ex.3	\
     man/ibv_alloc_mw.3 man/ibv_bind_mw.3 man/ibv_inc_rkey.3		\
-    man/ibv_rereg_mr.3
+    man/ibv_rereg_mr.3 man/ibv_create_cq_ex.3
 
 DEBIAN = debian/changelog debian/compat debian/control debian/copyright \
     debian/ibverbs-utils.install debian/libibverbs1.install \
diff --git a/include/infiniband/driver.h b/include/infiniband/driver.h
index 053ad5f..65fa44f 100644
--- a/include/infiniband/driver.h
+++ b/include/infiniband/driver.h
@@ -155,6 +155,15 @@ int ibv_cmd_create_cq(struct ibv_context *context, int cqe,
 		      int comp_vector, struct ibv_cq *cq,
 		      struct ibv_create_cq *cmd, size_t cmd_size,
 		      struct ibv_create_cq_resp *resp, size_t resp_size);
+int ibv_cmd_create_cq_ex(struct ibv_context *context,
+			 struct ibv_cq_init_attr_ex *cq_attr,
+			 struct ibv_cq_ex *cq,
+			 struct ibv_create_cq_ex *cmd,
+			 size_t cmd_core_size,
+			 size_t cmd_size,
+			 struct ibv_create_cq_resp_ex *resp,
+			 size_t resp_core_size,
+			 size_t resp_size);
 int ibv_cmd_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc);
 int ibv_cmd_req_notify_cq(struct ibv_cq *cq, int solicited_only);
 #define IBV_CMD_RESIZE_CQ_HAS_RESP_PARAMS
diff --git a/include/infiniband/kern-abi.h b/include/infiniband/kern-abi.h
index 31da4be..36a2c8c 100644
--- a/include/infiniband/kern-abi.h
+++ b/include/infiniband/kern-abi.h
@@ -112,6 +112,8 @@ enum {
 					    IB_USER_VERBS_CMD_QUERY_DEVICE,
 	IB_USER_VERBS_CMD_CREATE_QP_EX = IB_USER_VERBS_CMD_EXTENDED_MASK |
 					 IB_USER_VERBS_CMD_CREATE_QP,
+	IB_USER_VERBS_CMD_CREATE_CQ_EX = IB_USER_VERBS_CMD_EXTENDED_MASK |
+						IB_USER_VERBS_CMD_CREATE_CQ,
 	IB_USER_VERBS_CMD_CREATE_FLOW = IB_USER_VERBS_CMD_EXTENDED_MASK +
 					IB_USER_VERBS_CMD_THRESHOLD,
 	IB_USER_VERBS_CMD_DESTROY_FLOW
@@ -443,6 +445,23 @@ struct ibv_create_cq_resp {
 	__u32 cqe;
 };
 
+struct ibv_create_cq_ex {
+	struct ex_hdr	hdr;
+	__u64		user_handle;
+	__u32		cqe;
+	__u32		comp_vector;
+	__s32		comp_channel;
+	__u32		comp_mask;
+	__u32		flags;
+	__u32		reserved;
+};
+
+struct ibv_create_cq_resp_ex {
+	struct ibv_create_cq_resp	base;
+	__u32				comp_mask;
+	__u32				response_length;
+};
+
 struct ibv_kern_wc {
 	__u64  wr_id;
 	__u32  status;
@@ -1098,6 +1117,7 @@ enum {
 	IB_USER_VERBS_CMD_DESTROY_FLOW_V2 = -1,
 	IB_USER_VERBS_CMD_QUERY_DEVICE_EX_V2 = -1,
 	IB_USER_VERBS_CMD_CREATE_QP_EX_V2 = -1,
+	IB_USER_VERBS_CMD_CREATE_CQ_EX_V2 = -1,
 };
 
 struct ibv_modify_srq_v3 {
diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index bda31a8..fb68c45 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -360,6 +360,32 @@ enum {
 	IBV_WC_IP_CSUM_OK_SHIFT	= 2
 };
 
+enum ibv_create_cq_wc_flags {
+	IBV_WC_EX_WITH_BYTE_LEN		= 1 << 0,
+	IBV_WC_EX_WITH_IMM		= 1 << 1,
+	IBV_WC_EX_WITH_QP_NUM		= 1 << 2,
+	IBV_WC_EX_WITH_SRC_QP		= 1 << 3,
+	IBV_WC_EX_WITH_PKEY_INDEX	= 1 << 4,
+	IBV_WC_EX_WITH_SLID		= 1 << 5,
+	IBV_WC_EX_WITH_SL		= 1 << 6,
+	IBV_WC_EX_WITH_DLID_PATH_BITS	= 1 << 7,
+};
+
+enum {
+	IBV_WC_STANDARD_FLAGS = IBV_WC_EX_WITH_BYTE_LEN		|
+				 IBV_WC_EX_WITH_IMM		|
+				 IBV_WC_EX_WITH_QP_NUM		|
+				 IBV_WC_EX_WITH_SRC_QP		|
+				 IBV_WC_EX_WITH_PKEY_INDEX	|
+				 IBV_WC_EX_WITH_SLID		|
+				 IBV_WC_EX_WITH_SL		|
+				 IBV_WC_EX_WITH_DLID_PATH_BITS
+};
+
+enum {
+	IBV_CREATE_CQ_SUP_WC_FLAGS = IBV_WC_STANDARD_FLAGS
+};
+
 enum ibv_wc_flags {
 	IBV_WC_GRH		= 1 << 0,
 	IBV_WC_WITH_IMM		= 1 << 1,
@@ -834,6 +860,26 @@ struct ibv_cq {
 	uint32_t		async_events_completed;
 };
 
+struct ibv_cq_ex {
+	struct ibv_context     *context;
+	struct ibv_comp_channel *channel;
+	void		       *cq_context;
+	uint32_t		handle;
+	int			cqe;
+
+	pthread_mutex_t		mutex;
+	pthread_cond_t		cond;
+	uint32_t		comp_events_completed;
+	uint32_t		async_events_completed;
+
+	uint32_t		comp_mask;
+};
+
+static inline struct ibv_cq *ibv_cq_ex_to_cq(struct ibv_cq_ex *cq)
+{
+	return (struct ibv_cq *)cq;
+}
+
 struct ibv_ah {
 	struct ibv_context     *context;
 	struct ibv_pd	       *pd;
@@ -1044,6 +1090,31 @@ struct ibv_context {
 	void		       *abi_compat;
 };
 
+enum ibv_cq_init_attr_mask {
+	IBV_CQ_INIT_ATTR_MASK_RESERVED	= 0 << 1
+};
+
+struct ibv_cq_init_attr_ex {
+	/* Minimum number of entries required for CQ */
+	uint32_t			cqe;
+	/* Consumer-supplied context returned for completion events */
+	void			*cq_context;
+	/* Completion channel where completion events will be queued.
+	 * May be NULL if completion events will not be used.
+	 */
+	struct ibv_comp_channel *channel;
+	/* Completion vector used to signal completion events.
+	 *  Must be < context->num_comp_vectors.
+	 */
+	uint32_t			comp_vector;
+	 /* Or'ed bit of enum ibv_create_cq_wc_flags. */
+	uint64_t		wc_flags;
+	/* compatibility mask (extended verb). Or'd flags of
+	 * enum ibv_cq_init_attr_mask
+	 */
+	uint32_t		comp_mask;
+};
+
 enum verbs_context_mask {
 	VERBS_CONTEXT_XRCD	= 1 << 0,
 	VERBS_CONTEXT_SRQ	= 1 << 1,
@@ -1055,6 +1126,9 @@ enum verbs_context_mask {
 
 struct verbs_context {
 	/*  "grows up" - new fields go here */
+	struct ibv_cq_ex *(*create_cq_ex)(struct ibv_context *context,
+					  struct ibv_cq_init_attr_ex *init_attr);
+	struct verbs_ex_private *priv;
 	int (*query_device_ex)(struct ibv_context *context,
 			       const struct ibv_query_device_ex_input *input,
 			       struct ibv_device_attr_ex *attr,
@@ -1360,6 +1434,30 @@ struct ibv_cq *ibv_create_cq(struct ibv_context *context, int cqe,
 			     int comp_vector);
 
 /**
+ * ibv_create_cq_ex - Create a completion queue
+ * @context - Context CQ will be attached to
+ * @cq_attr - Attributes to create the CQ with
+ */
+static inline
+struct ibv_cq_ex *ibv_create_cq_ex(struct ibv_context *context,
+				   struct ibv_cq_init_attr_ex *cq_attr)
+{
+	struct verbs_context *vctx = verbs_get_ctx_op(context, create_cq_ex);
+
+	if (!vctx) {
+		errno = ENOSYS;
+		return NULL;
+	}
+
+	if (cq_attr->comp_mask & ~(IBV_CQ_INIT_ATTR_MASK_RESERVED - 1)) {
+		errno = EINVAL;
+		return NULL;
+	}
+
+	return vctx->create_cq_ex(context, cq_attr);
+}
+
+/**
  * ibv_resize_cq - Modifies the capacity of the CQ.
  * @cq: The CQ to resize.
  * @cqe: The minimum size of the CQ.
diff --git a/man/ibv_create_cq_ex.3 b/man/ibv_create_cq_ex.3
new file mode 100644
index 0000000..cf43784
--- /dev/null
+++ b/man/ibv_create_cq_ex.3
@@ -0,0 +1,63 @@
+.\" -*- nroff -*-
+.\"
+.TH IBV_CREATE_CQ_EX 3 2016-05-08 libibverbs "Libibverbs Programmer's Manual"
+.SH "NAME"
+ibv_create_cq_ex \- create a completion queue (CQ)
+.SH "SYNOPSIS"
+.nf
+.B #include <infiniband/verbs.h>
+.sp
+.BI "struct ibv_cq_ex *ibv_create_cq_ex(struct ibv_context " "*context" ",
+.BI "                                   struct ibv_create_cq_attr_ex " "*cq_attr" );
+.fi
+.SH "DESCRIPTION"
+.B ibv_create_cq_ex()
+creates a completion queue (CQ) for RDMA device context
+.I context\fR.
+The argument
+.I cq_attr
+is a pointer to struct ibv_create_cq_attr_ex as defined in <infiniband/verbs.h>.
+.PP
+.nf
+struct ibv_create_cq_attr_ex {
+.in +8
+int                     cqe;               /* Minimum number of entries required for CQ */
+void                    *cq_context;       /* Consumer-supplied context returned for completion events */
+struct ibv_comp_channel *channel;          /* Completion channel where completion events will be queued. May be NULL if completion events will not be used. */
+int                     comp_vector;       /* Completion vector used to signal completion events. Must be >= 0 and < context->num_comp_vectors. */
+uint64_t                wc_flags;          /* The wc_flags that should be returned in ibv_poll_cq_ex. Or'ed bit of enum ibv_wc_flags_ex. */
+uint32_t                comp_mask;         /* compatibility mask (extended verb). */
+.in -8
+};
+
+enum ibv_wc_flags_ex {
+        IBV_WC_EX_WITH_BYTE_LEN              = 1 << 0,  /* Require byte len in WC */
+        IBV_WC_EX_WITH_IMM                   = 1 << 1,  /* Require immediate in WC */
+        IBV_WC_EX_WITH_QP_NUM                = 1 << 2,  /* Require QP number in WC */
+        IBV_WC_EX_WITH_SRC_QP                = 1 << 3,  /* Require source QP in WC */
+        IBV_WC_EX_WITH_PKEY_INDEX            = 1 << 4,  /* Require pkey index in WC */
+        IBV_WC_EX_WITH_SLID                  = 1 << 5,  /* Require slid in WC */
+        IBV_WC_EX_WITH_SL                    = 1 << 6,  /* Require sl in WC */
+        IBV_WC_EX_WITH_DLID_PATH_BITS        = 1 << 7,  /* Require dlid path bits in WC */
+};
+
+.SH "RETURN VALUE"
+.B ibv_create_cq_ex()
+returns a pointer to the CQ, or NULL if the request fails.
+.SH "NOTES"
+.B ibv_create_cq_ex()
+may create a CQ with size greater than or equal to the requested
+size. Check the cqe attribute in the returned CQ for the actual size.
+.PP
+CQ should be destroyed with ibv_destroy_cq.
+.PP
+.SH "SEE ALSO"
+.BR ibv_create_cq (3),
+.BR ibv_destroy_cq (3),
+.BR ibv_resize_cq (3),
+.BR ibv_req_notify_cq (3),
+.BR ibv_ack_cq_events (3),
+.BR ibv_create_qp (3)
+.SH "AUTHORS"
+.TP
+Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
diff --git a/src/cmd.c b/src/cmd.c
index b8c51ce..d292452 100644
--- a/src/cmd.c
+++ b/src/cmd.c
@@ -438,6 +438,44 @@ int ibv_cmd_create_cq(struct ibv_context *context, int cqe,
 	return 0;
 }
 
+int ibv_cmd_create_cq_ex(struct ibv_context *context,
+			 struct ibv_cq_init_attr_ex *cq_attr,
+			 struct ibv_cq_ex *cq,
+			 struct ibv_create_cq_ex *cmd,
+			 size_t cmd_core_size,
+			 size_t cmd_size,
+			 struct ibv_create_cq_resp_ex *resp,
+			 size_t resp_core_size,
+			 size_t resp_size)
+{
+	int err;
+
+	memset(cmd, 0, cmd_core_size);
+	IBV_INIT_CMD_RESP_EX_V(cmd, cmd_core_size, cmd_size, CREATE_CQ_EX, resp,
+			       resp_core_size, resp_size);
+
+	if (cq_attr->comp_mask & ~(IBV_CQ_INIT_ATTR_MASK_RESERVED - 1))
+		return EINVAL;
+
+	cmd->user_handle   = (uintptr_t)cq;
+	cmd->cqe           = cq_attr->cqe;
+	cmd->comp_vector   = cq_attr->comp_vector;
+	cmd->comp_channel  = cq_attr->channel ? cq_attr->channel->fd : -1;
+	cmd->comp_mask = 0;
+
+	err = write(context->cmd_fd, cmd, cmd_size);
+	if (err != cmd_size)
+		return errno;
+
+	(void)VALGRIND_MAKE_MEM_DEFINED(resp, resp_size);
+
+	cq->handle  = resp->base.cq_handle;
+	cq->cqe     = resp->base.cqe;
+	cq->context = context;
+
+	return 0;
+}
+
 int ibv_cmd_poll_cq(struct ibv_cq *ibcq, int ne, struct ibv_wc *wc)
 {
 	struct ibv_poll_cq       cmd;
diff --git a/src/device.c b/src/device.c
index f2b889c..e520295 100644
--- a/src/device.c
+++ b/src/device.c
@@ -122,6 +122,38 @@ uint64_t __ibv_get_device_guid(struct ibv_device *device)
 }
 default_symver(__ibv_get_device_guid, ibv_get_device_guid);
 
+struct ibv_cq_ex *__lib_ibv_create_cq_ex(struct ibv_context *context,
+					 struct ibv_cq_init_attr_ex *cq_attr)
+{
+	struct verbs_context *vctx = verbs_get_ctx(context);
+	struct ibv_cq_ex *cq;
+
+	if (cq_attr->wc_flags & ~IBV_CREATE_CQ_SUP_WC_FLAGS) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	pthread_mutex_lock(&context->mutex);
+
+	cq = vctx->priv->create_cq_ex(context, cq_attr);
+
+	if (cq) {
+		cq->context		   = context;
+		cq->channel		   = cq_attr->channel;
+		if (cq->channel)
+			++cq->channel->refcnt;
+		cq->cq_context		   = cq_attr->cq_context;
+		cq->comp_events_completed  = 0;
+		cq->async_events_completed = 0;
+		pthread_mutex_init(&cq->mutex, NULL);
+		pthread_cond_init(&cq->cond, NULL);
+	}
+
+	pthread_mutex_unlock(&context->mutex);
+
+	return cq;
+}
+
 struct ibv_context *__ibv_open_device(struct ibv_device *device)
 {
 	struct verbs_device *verbs_device = verbs_get_device(device);
@@ -148,6 +180,8 @@ struct ibv_context *__ibv_open_device(struct ibv_device *device)
 		if (!context)
 			goto err;
 	} else {
+		struct verbs_ex_private *priv;
+
 		/* Library now allocates the context */
 		context_ex = calloc(1, sizeof(*context_ex) +
 				       verbs_device->size_of_context);
@@ -156,6 +190,14 @@ struct ibv_context *__ibv_open_device(struct ibv_device *device)
 			goto err;
 		}
 
+		priv = calloc(1, sizeof(*priv));
+		if (!priv) {
+			errno = ENOMEM;
+			free(context_ex);
+			goto err;
+		}
+
+		context_ex->priv = priv;
 		context_ex->context.abi_compat  = __VERBS_ABI_IS_EXTENDED;
 		context_ex->sz = sizeof(*context_ex);
 
@@ -177,6 +219,11 @@ struct ibv_context *__ibv_open_device(struct ibv_device *device)
 		 */
 		context_ex->ABI_placeholder1 = (void (*)(void)) context_ex->ibv_create_flow;
 		context_ex->ABI_placeholder2 = (void (*)(void)) context_ex->ibv_destroy_flow;
+
+		if (context_ex->create_cq_ex) {
+			priv->create_cq_ex = context_ex->create_cq_ex;
+			context_ex->create_cq_ex = __lib_ibv_create_cq_ex;
+		}
 	}
 
 	context->device = device;
@@ -186,6 +233,7 @@ struct ibv_context *__ibv_open_device(struct ibv_device *device)
 	return context;
 
 verbs_err:
+	free(context_ex->priv);
 	free(context_ex);
 err:
 	close(cmd_fd);
@@ -204,6 +252,7 @@ int __ibv_close_device(struct ibv_context *context)
 	if (context_ex) {
 		struct verbs_device *verbs_device = verbs_get_device(context->device);
 		verbs_device->uninit_context(verbs_device, context);
+		free(context_ex->priv);
 		free(context_ex);
 	} else {
 		context->device->ops.free_context(context);
diff --git a/src/ibverbs.h b/src/ibverbs.h
index ff206f9..062a490 100644
--- a/src/ibverbs.h
+++ b/src/ibverbs.h
@@ -81,6 +81,11 @@ extern HIDDEN int abi_ver;
 
 HIDDEN int ibverbs_init(struct ibv_device ***list);
 
+struct verbs_ex_private {
+	struct ibv_cq_ex *(*create_cq_ex)(struct ibv_context *context,
+					  struct ibv_cq_init_attr_ex *init_attr);
+};
+
 #define IBV_INIT_CMD(cmd, size, opcode)					\
 	do {								\
 		if (abi_ver > 2)					\
diff --git a/src/libibverbs.map b/src/libibverbs.map
index a150416..5134bd9 100644
--- a/src/libibverbs.map
+++ b/src/libibverbs.map
@@ -47,6 +47,7 @@ IBVERBS_1.0 {
 		ibv_cmd_reg_mr;
 		ibv_cmd_dereg_mr;
 		ibv_cmd_create_cq;
+		ibv_cmd_create_cq_ex;
 		ibv_cmd_poll_cq;
 		ibv_cmd_req_notify_cq;
 		ibv_cmd_resize_cq;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V4 libibverbs 2/7] Add member functions to poll an extended CQ
       [not found] ` <1464533475-18949-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2016-05-29 14:51   ` [PATCH V4 libibverbs 1/7] Add support for extended creating CQ verb Yishai Hadas
@ 2016-05-29 14:51   ` Yishai Hadas
       [not found]     ` <1464533475-18949-3-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2016-05-29 14:51   ` [PATCH V4 libibverbs 3/7] Add timestamp_mask and hca_core_clock to ibv_query_device_ex Yishai Hadas
                     ` (4 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Yishai Hadas @ 2016-05-29 14:51 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	majd-VPRAkNaXOzVWk0Htik3J/w, talal-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Currently, ibv_poll_cq is one stop shop for polling all the
completion's attribute. The current implementation has a few
implications:
1. The vendor's completion format is transformed to an agnostic IB
   format (copying data which might has been used directly).
2. Adding new completion attributes means wasting more time in copies.
3. Extensible functions require wasting cycles on logic which
   determines whether a function is known and implemented in the
   provider.

In order to solve these problems, we introduce a new poll_cq mechanism
for extensible CQs. Polling is done in batches, where every batch
could contain more than one completion. A batch is started with
calling to start_poll function. This already fetches a completion
that the user can now query using the various query functions.
Advancing to the next completion is done using next_poll.
After querying all completions (or when the user wants to stop
fetching completions in the current batch), end_poll is called.

As 'wr_id' and 'status' are mostly used, they can be accessed directly
and there is no function call for.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 include/infiniband/verbs.h | 92 ++++++++++++++++++++++++++++++++++++++++++++++
 man/ibv_create_cq_ex.3     | 75 +++++++++++++++++++++++++++++++++++++
 2 files changed, 167 insertions(+)

diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index fb68c45..76423c3 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -860,6 +860,10 @@ struct ibv_cq {
 	uint32_t		async_events_completed;
 };
 
+struct ibv_poll_cq_attr {
+	uint32_t comp_mask;
+};
+
 struct ibv_cq_ex {
 	struct ibv_context     *context;
 	struct ibv_comp_channel *channel;
@@ -873,6 +877,23 @@ struct ibv_cq_ex {
 	uint32_t		async_events_completed;
 
 	uint32_t		comp_mask;
+	enum ibv_wc_status status;
+	uint64_t wr_id;
+	int (*start_poll)(struct ibv_cq_ex *current,
+			     struct ibv_poll_cq_attr *attr);
+	int (*next_poll)(struct ibv_cq_ex *current);
+	void (*end_poll)(struct ibv_cq_ex *current);
+	enum ibv_wc_opcode (*read_opcode)(struct ibv_cq_ex *current);
+	uint32_t (*read_vendor_err)(struct ibv_cq_ex *current);
+	uint32_t (*read_byte_len)(struct ibv_cq_ex *current);
+	uint32_t (*read_imm_data)(struct ibv_cq_ex *current);
+	uint32_t (*read_qp_num)(struct ibv_cq_ex *current);
+	uint32_t (*read_src_qp)(struct ibv_cq_ex *current);
+	int (*read_wc_flags)(struct ibv_cq_ex *current);
+	uint16_t (*read_pkey_index)(struct ibv_cq_ex *current);
+	uint32_t (*read_slid)(struct ibv_cq_ex *current);
+	uint8_t (*read_sl)(struct ibv_cq_ex *current);
+	uint8_t (*read_dlid_path_bits)(struct ibv_cq_ex *current);
 };
 
 static inline struct ibv_cq *ibv_cq_ex_to_cq(struct ibv_cq_ex *cq)
@@ -880,6 +901,77 @@ static inline struct ibv_cq *ibv_cq_ex_to_cq(struct ibv_cq_ex *cq)
 	return (struct ibv_cq *)cq;
 }
 
+static inline int ibv_start_poll(struct ibv_cq_ex *cq,
+				    struct ibv_poll_cq_attr *attr)
+{
+	return cq->start_poll(cq, attr);
+}
+
+static inline int ibv_next_poll(struct ibv_cq_ex *cq)
+{
+	return cq->next_poll(cq);
+}
+
+static inline void ibv_end_poll(struct ibv_cq_ex *cq)
+{
+	cq->end_poll(cq);
+}
+
+static inline enum ibv_wc_opcode ibv_wc_read_opcode(struct ibv_cq_ex *cq)
+{
+	return cq->read_opcode(cq);
+}
+
+static inline uint32_t ibv_wc_read_vendor_err(struct ibv_cq_ex *cq)
+{
+	return cq->read_vendor_err(cq);
+}
+
+static inline uint32_t ibv_wc_read_byte_len(struct ibv_cq_ex *cq)
+{
+	return cq->read_byte_len(cq);
+}
+
+static inline uint32_t ibv_wc_read_imm_data(struct ibv_cq_ex *cq)
+{
+	return cq->read_imm_data(cq);
+}
+
+static inline uint32_t ibv_wc_read_qp_num(struct ibv_cq_ex *cq)
+{
+	return cq->read_qp_num(cq);
+}
+
+static inline uint32_t ibv_wc_read_src_qp(struct ibv_cq_ex *cq)
+{
+	return cq->read_src_qp(cq);
+}
+
+static inline int ibv_wc_read_wc_flags(struct ibv_cq_ex *cq)
+{
+	return cq->read_wc_flags(cq);
+}
+
+static inline uint16_t ibv_wc_read_pkey_index(struct ibv_cq_ex *cq)
+{
+	return cq->read_pkey_index(cq);
+}
+
+static inline uint32_t ibv_wc_read_slid(struct ibv_cq_ex *cq)
+{
+	return cq->read_slid(cq);
+}
+
+static inline uint8_t ibv_wc_read_sl(struct ibv_cq_ex *cq)
+{
+	return cq->read_sl(cq);
+}
+
+static inline uint8_t ibv_wc_read_dlid_path_bits(struct ibv_cq_ex *cq)
+{
+	return cq->read_dlid_path_bits(cq);
+}
+
 struct ibv_ah {
 	struct ibv_context     *context;
 	struct ibv_pd	       *pd;
diff --git a/man/ibv_create_cq_ex.3 b/man/ibv_create_cq_ex.3
index cf43784..3a24632 100644
--- a/man/ibv_create_cq_ex.3
+++ b/man/ibv_create_cq_ex.3
@@ -41,6 +41,81 @@ enum ibv_wc_flags_ex {
         IBV_WC_EX_WITH_DLID_PATH_BITS        = 1 << 7,  /* Require dlid path bits in WC */
 };
 
+.SH "Polling an extended CQ"
+In order to poll an extended CQ efficiently, a user could use the following functions.
+
+.TP
+.B Completion iterator functions
+
+.BI "int ibv_start_poll(struct ibv_cq_ex " "*cq" ", struct ibv_poll_cq_attr " "*attr")
+.br
+Start polling a batch of work completions.
+.I attr
+is given in order to make this function
+easily extensible in the future. This function either returns 0 on success or an error code
+otherwise. When no completions are available on the CQ, ENOENT is returned, but the CQ remains
+in a valid state. On success, querying the completion's attribute could be done using the query
+functions described below. If an error code is given, end_poll shouldn't be called.
+
+.BI "int ibv_next_poll(struct ibv_cq_ex " "*cq")
+.br
+This function is called in order to get the next work completion. It has to be called after
+.I start_poll
+and before
+.I end_poll
+are called. This function either returns 0 on success or an error code
+otherwise. When no completions are available on the CQ, ENOENT is returned, but the CQ remains
+in a valid state. On success, querying the completion's attribute could be done using the query
+functions described below. If an error code is given, end_poll should still be called,
+indicating this is the end of the polled batch.
+
+.BI "void ibv_end_poll(struct ibv_cq_ex " "*cq")
+.br
+This function indicates the end of polling batch of work completions. After calling this function, the user should start a new batch
+by calling
+.I start_poll.
+
+.TP
+.B Polling fields in the completion
+Below members and functions are used in order to poll the current completion. The current completion is the completion which the iterator points to (start_poll and next_poll advances this iterator). Only fields that the user requested via wc_flags in ibv_create_cq_ex could be queried. In addition, some fields are only valid in certain opcodes and status codes.
+
+.BI "uint64_t wr_id - Can be accessed directly from struct ibv_cq_ex".
+
+.BI "enum ibv_wc_status - Can be accessed directly from struct ibv_cq_ex".
+
+.BI "enum ibv_wc_opcode ibv_wc_read_opcode(struct ibv_cq_ex " "*cq"); \c
+ Get the opcode from the current completion.
+
+.BI "uint32_t ibv_wc_read_vendor_err(struct ibv_cq_ex " "*cq"); \c
+ Get the vendor error from the current completion.
+
+.BI "uint32_t ibv_wc_read_byte_len(struct ibv_cq_ex " "*cq"); \c
+ Get the vendor error from the current completion.
+
+.BI "uint32_t ibv_wc_read_imm_data(struct ibv_cq_ex " "*cq"); \c
+ Get the immediate data field from the current completion.
+
+.BI "uint32_t ibv_wc_read_qp_num(struct ibv_cq_ex " "*cq"); \c
+ Get the QP number field from the current completion.
+
+.BI "uint32_t ibv_wc_read_src_qp(struct ibv_cq_ex " "*cq"); \c
+ Get the source QP number field from the current completion.
+
+.BI "int ibv_wc_read_wc_flags(struct ibv_cq_ex " "*cq"); \c
+ Get the QP flags field from the current completion.
+
+.BI "uint16_t ibv_wc_read_pkey_index(struct ibv_cq_ex " "*cq"); \c
+ Get the pkey index field from the current completion.
+
+.BI "uint32_t ibv_wc_read_slid(struct ibv_cq_ex " "*cq"); \c
+ Get the slid field from the current completion.
+
+.BI "uint8_t ibv_wc_read_sl(struct ibv_cq_ex " "*cq"); \c
+ Get the sl field from the current completion.
+
+.BI "uint8_t ibv_wc_read_dlid_path_bits(struct ibv_cq_ex " "*cq"); \c
+ Get the dlid_path_bits field from the current completion.
+
 .SH "RETURN VALUE"
 .B ibv_create_cq_ex()
 returns a pointer to the CQ, or NULL if the request fails.
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V4 libibverbs 3/7] Add timestamp_mask and hca_core_clock to ibv_query_device_ex
       [not found] ` <1464533475-18949-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2016-05-29 14:51   ` [PATCH V4 libibverbs 1/7] Add support for extended creating CQ verb Yishai Hadas
  2016-05-29 14:51   ` [PATCH V4 libibverbs 2/7] Add member functions to poll an extended CQ Yishai Hadas
@ 2016-05-29 14:51   ` Yishai Hadas
  2016-05-29 14:51   ` [PATCH V4 libibverbs 4/7] Add completion timestamp to poll_cq Yishai Hadas
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Yishai Hadas @ 2016-05-29 14:51 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	majd-VPRAkNaXOzVWk0Htik3J/w, talal-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

The fields timestamp_mask and hca_core_clock were added
to the extended version of ibv_query_device verb.
timestamp_mask represents the supported mask of the timestamp.
Users could infer the accuracy of the reported possible
timestamp.
hca_core_clock represents the frequency of the HCA (in kHZ).

Since timestamp is given in hardware cycles, knowing the frequency
is mandatory in order to convert this number into seconds.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 examples/devinfo.c            | 10 ++++++++++
 include/infiniband/kern-abi.h |  2 ++
 include/infiniband/verbs.h    |  2 ++
 man/ibv_query_device_ex.3     |  6 ++++--
 src/cmd.c                     | 21 +++++++++++++++++++++
 5 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/examples/devinfo.c b/examples/devinfo.c
index a8de982..0af8c3b 100644
--- a/examples/devinfo.c
+++ b/examples/devinfo.c
@@ -339,6 +339,16 @@ static int print_hca_cap(struct ibv_device *ib_dev, uint8_t ib_port)
 		printf("\tlocal_ca_ack_delay:\t\t%d\n", device_attr.orig_attr.local_ca_ack_delay);
 
 		print_odp_caps(&device_attr.odp_caps);
+		if (device_attr.completion_timestamp_mask)
+			printf("\tcompletion timestamp_mask:\t\t\t0x%016lx\n",
+			       device_attr.completion_timestamp_mask);
+		else
+			printf("\tcompletion_timestamp_mask not supported\n");
+
+		if (device_attr.hca_core_clock)
+			printf("\thca_core_clock:\t\t\t%lukHZ\n", device_attr.hca_core_clock);
+		else
+			printf("\tcore clock not supported\n");
 	}
 
 	for (port = 1; port <= device_attr.orig_attr.phys_port_cnt; ++port) {
diff --git a/include/infiniband/kern-abi.h b/include/infiniband/kern-abi.h
index 36a2c8c..a2c0c46 100644
--- a/include/infiniband/kern-abi.h
+++ b/include/infiniband/kern-abi.h
@@ -271,6 +271,8 @@ struct ibv_query_device_resp_ex {
 	__u32 comp_mask;
 	__u32 response_length;
 	struct ibv_odp_caps_resp odp_caps;
+	__u64 timestamp_mask;
+	__u64 hca_core_clock;
 };
 
 struct ibv_query_port {
diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index 76423c3..2c9cb08 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -207,6 +207,8 @@ struct ibv_device_attr_ex {
 	struct ibv_device_attr	orig_attr;
 	uint32_t		comp_mask;
 	struct ibv_odp_caps	odp_caps;
+	uint64_t		completion_timestamp_mask;
+	uint64_t		hca_core_clock;
 };
 
 enum ibv_mtu {
diff --git a/man/ibv_query_device_ex.3 b/man/ibv_query_device_ex.3
index 1f483d2..db12c2b 100644
--- a/man/ibv_query_device_ex.3
+++ b/man/ibv_query_device_ex.3
@@ -22,8 +22,10 @@ is a pointer to an ibv_device_attr_ex struct, as defined in <infiniband/verbs.h>
 struct ibv_device_attr_ex {
 .in +8
 struct ibv_device_attr orig_attr;
-uint32_t               comp_mask;              /* Compatibility mask that defines which of the following variables are valid */
-struct ibv_odp_caps    odp_caps;               /* On-Demand Paging capabilities */
+uint32_t               comp_mask;                  /* Compatibility mask that defines which of the following variables are valid */
+struct ibv_odp_caps    odp_caps;                   /* On-Demand Paging capabilities */
+uint64_t               completion_timestamp_mask;  /* Completion timestamp mask (0 = unsupported) */
+uint64_t               hca_core_clock;             /* The frequency (in kHZ) of the HCA (0 = unsupported) */
 .in -8
 };
 
diff --git a/src/cmd.c b/src/cmd.c
index d292452..777fded 100644
--- a/src/cmd.c
+++ b/src/cmd.c
@@ -189,6 +189,27 @@ int ibv_cmd_query_device_ex(struct ibv_context *context,
 		}
 	}
 
+	if (attr_size >= offsetof(struct ibv_device_attr_ex,
+				  completion_timestamp_mask) +
+			 sizeof(attr->completion_timestamp_mask)) {
+		if (resp->response_length >=
+		    offsetof(struct ibv_query_device_resp_ex, timestamp_mask) +
+		    sizeof(resp->timestamp_mask))
+			attr->completion_timestamp_mask = resp->timestamp_mask;
+		else
+			attr->completion_timestamp_mask = 0;
+	}
+
+	if (attr_size >= offsetof(struct ibv_device_attr_ex, hca_core_clock) +
+			 sizeof(attr->hca_core_clock)) {
+		if (resp->response_length >=
+		    offsetof(struct ibv_query_device_resp_ex, hca_core_clock) +
+		    sizeof(resp->hca_core_clock))
+			attr->hca_core_clock = resp->hca_core_clock;
+		else
+			attr->hca_core_clock = 0;
+	}
+
 	return 0;
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V4 libibverbs 4/7] Add completion timestamp to poll_cq
       [not found] ` <1464533475-18949-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2016-05-29 14:51   ` [PATCH V4 libibverbs 3/7] Add timestamp_mask and hca_core_clock to ibv_query_device_ex Yishai Hadas
@ 2016-05-29 14:51   ` Yishai Hadas
  2016-05-29 14:51   ` [PATCH V4 libibverbs 5/7] Create a single threaded CQ Yishai Hadas
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Yishai Hadas @ 2016-05-29 14:51 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	majd-VPRAkNaXOzVWk0Htik3J/w, talal-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

In order to report when a completion was created, we add a member
function to the CQ which reads the completion timestamp from the
current CQ. The time is given in raw format.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 include/infiniband/kern-abi.h |  4 ++++
 include/infiniband/verbs.h    | 10 +++++++++-
 man/ibv_create_cq_ex.3        |  4 ++++
 src/cmd.c                     |  6 ++++++
 4 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/infiniband/kern-abi.h b/include/infiniband/kern-abi.h
index a2c0c46..f882cc9 100644
--- a/include/infiniband/kern-abi.h
+++ b/include/infiniband/kern-abi.h
@@ -447,6 +447,10 @@ struct ibv_create_cq_resp {
 	__u32 cqe;
 };
 
+enum ibv_create_cq_ex_kernel_flags {
+	IBV_CREATE_CQ_EX_KERNEL_FLAG_COMPLETION_TIMESTAMP = 1 << 0,
+};
+
 struct ibv_create_cq_ex {
 	struct ex_hdr	hdr;
 	__u64		user_handle;
diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index 2c9cb08..2bcc70d 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -371,6 +371,7 @@ enum ibv_create_cq_wc_flags {
 	IBV_WC_EX_WITH_SLID		= 1 << 5,
 	IBV_WC_EX_WITH_SL		= 1 << 6,
 	IBV_WC_EX_WITH_DLID_PATH_BITS	= 1 << 7,
+	IBV_WC_EX_WITH_COMPLETION_TIMESTAMP	= 1 << 8,
 };
 
 enum {
@@ -385,7 +386,8 @@ enum {
 };
 
 enum {
-	IBV_CREATE_CQ_SUP_WC_FLAGS = IBV_WC_STANDARD_FLAGS
+	IBV_CREATE_CQ_SUP_WC_FLAGS = IBV_WC_STANDARD_FLAGS |
+				IBV_WC_EX_WITH_COMPLETION_TIMESTAMP
 };
 
 enum ibv_wc_flags {
@@ -896,6 +898,7 @@ struct ibv_cq_ex {
 	uint32_t (*read_slid)(struct ibv_cq_ex *current);
 	uint8_t (*read_sl)(struct ibv_cq_ex *current);
 	uint8_t (*read_dlid_path_bits)(struct ibv_cq_ex *current);
+	uint64_t (*read_completion_ts)(struct ibv_cq_ex *current);
 };
 
 static inline struct ibv_cq *ibv_cq_ex_to_cq(struct ibv_cq_ex *cq)
@@ -974,6 +977,11 @@ static inline uint8_t ibv_wc_read_dlid_path_bits(struct ibv_cq_ex *cq)
 	return cq->read_dlid_path_bits(cq);
 }
 
+static inline uint64_t ibv_wc_read_completion_ts(struct ibv_cq_ex *cq)
+{
+	return cq->read_completion_ts(cq);
+}
+
 struct ibv_ah {
 	struct ibv_context     *context;
 	struct ibv_pd	       *pd;
diff --git a/man/ibv_create_cq_ex.3 b/man/ibv_create_cq_ex.3
index 3a24632..b0b5e2d 100644
--- a/man/ibv_create_cq_ex.3
+++ b/man/ibv_create_cq_ex.3
@@ -39,6 +39,7 @@ enum ibv_wc_flags_ex {
         IBV_WC_EX_WITH_SLID                  = 1 << 5,  /* Require slid in WC */
         IBV_WC_EX_WITH_SL                    = 1 << 6,  /* Require sl in WC */
         IBV_WC_EX_WITH_DLID_PATH_BITS        = 1 << 7,  /* Require dlid path bits in WC */
+        IBV_WC_EX_WITH_COMPLETION_TIMESTAMP  = 1 << 8,  /* Require completion timestamp in WC /*
 };
 
 .SH "Polling an extended CQ"
@@ -116,6 +117,9 @@ Below members and functions are used in order to poll the current completion. Th
 .BI "uint8_t ibv_wc_read_dlid_path_bits(struct ibv_cq_ex " "*cq"); \c
  Get the dlid_path_bits field from the current completion.
 
+.BI "uint64_t ibv_wc_read_completion_ts(struct ibv_cq_ex " "*cq"); \c
+ Get the completion timestamp from the current completion.
+
 .SH "RETURN VALUE"
 .B ibv_create_cq_ex()
 returns a pointer to the CQ, or NULL if the request fails.
diff --git a/src/cmd.c b/src/cmd.c
index 777fded..647fbbb 100644
--- a/src/cmd.c
+++ b/src/cmd.c
@@ -484,6 +484,12 @@ int ibv_cmd_create_cq_ex(struct ibv_context *context,
 	cmd->comp_channel  = cq_attr->channel ? cq_attr->channel->fd : -1;
 	cmd->comp_mask = 0;
 
+	if (cmd_core_size >= offsetof(struct ibv_create_cq_ex, flags) +
+	    sizeof(cmd->flags)) {
+		if (cq_attr->wc_flags & IBV_WC_EX_WITH_COMPLETION_TIMESTAMP)
+			cmd->flags |= IBV_CREATE_CQ_EX_KERNEL_FLAG_COMPLETION_TIMESTAMP;
+	}
+
 	err = write(context->cmd_fd, cmd, cmd_size);
 	if (err != cmd_size)
 		return errno;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V4 libibverbs 5/7] Create a single threaded CQ
       [not found] ` <1464533475-18949-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (3 preceding siblings ...)
  2016-05-29 14:51   ` [PATCH V4 libibverbs 4/7] Add completion timestamp to poll_cq Yishai Hadas
@ 2016-05-29 14:51   ` Yishai Hadas
  2016-05-29 14:51   ` [PATCH V4 libibverbs 6/7] Add a verb that queries real time values from the HCA Yishai Hadas
  2016-05-29 14:51   ` [PATCH V4 libibverbs 7/7] Add timestamp support in rc_pingpong Yishai Hadas
  6 siblings, 0 replies; 21+ messages in thread
From: Yishai Hadas @ 2016-05-29 14:51 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	majd-VPRAkNaXOzVWk0Htik3J/w, talal-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

When a CQ is used only from one thread, there's no need to waste cycles
on locking. Since this series introduces a mechanism which allows the
vendor to introduce different polling functions per CQ, it allows the
vendor to implement both locking and lockless CQs and assign them
accordingly.

Adding a new creation flag for this.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 include/infiniband/verbs.h | 12 +++++++++++-
 man/ibv_create_cq_ex.3     |  8 ++++++++
 src/cmd.c                  |  4 ++++
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index 2bcc70d..6642336 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -1193,7 +1193,13 @@ struct ibv_context {
 };
 
 enum ibv_cq_init_attr_mask {
-	IBV_CQ_INIT_ATTR_MASK_RESERVED	= 0 << 1
+	IBV_CQ_INIT_ATTR_MASK_FLAGS	= 1 << 0,
+	IBV_CQ_INIT_ATTR_MASK_RESERVED	= 1 << 1
+};
+
+enum ibv_create_cq_attr_flags {
+	IBV_CREATE_CQ_ATTR_SINGLE_THREADED = 1 << 0,
+	IBV_CREATE_CQ_ATTR_RESERVED = 1 << 1,
 };
 
 struct ibv_cq_init_attr_ex {
@@ -1215,6 +1221,10 @@ struct ibv_cq_init_attr_ex {
 	 * enum ibv_cq_init_attr_mask
 	 */
 	uint32_t		comp_mask;
+	/* create cq attr flags - one or more flags from
+	 * enum ibv_create_cq_attr_flags
+	 */
+	uint32_t		flags;
 };
 
 enum verbs_context_mask {
diff --git a/man/ibv_create_cq_ex.3 b/man/ibv_create_cq_ex.3
index b0b5e2d..93a1a38 100644
--- a/man/ibv_create_cq_ex.3
+++ b/man/ibv_create_cq_ex.3
@@ -42,6 +42,14 @@ enum ibv_wc_flags_ex {
         IBV_WC_EX_WITH_COMPLETION_TIMESTAMP  = 1 << 8,  /* Require completion timestamp in WC /*
 };
 
+enum ibv_cq_init_attr_mask {
+        IBV_CQ_INIT_ATTR_MASK_FLAGS             = 1 << 0,
+};
+
+enum ibv_create_cq_attr_flags {
+        IBV_CREATE_CQ_ATTR_SINGLE_THREADED      = 1 << 0, /* This CQ is used from a single threaded, thus no locking is required */
+};
+
 .SH "Polling an extended CQ"
 In order to poll an extended CQ efficiently, a user could use the following functions.
 
diff --git a/src/cmd.c b/src/cmd.c
index 647fbbb..5cd3e78 100644
--- a/src/cmd.c
+++ b/src/cmd.c
@@ -486,6 +486,10 @@ int ibv_cmd_create_cq_ex(struct ibv_context *context,
 
 	if (cmd_core_size >= offsetof(struct ibv_create_cq_ex, flags) +
 	    sizeof(cmd->flags)) {
+		if ((cq_attr->comp_mask & IBV_CQ_INIT_ATTR_MASK_FLAGS) &&
+		    (cq_attr->flags & ~(IBV_CREATE_CQ_ATTR_RESERVED - 1)))
+			return EOPNOTSUPP;
+
 		if (cq_attr->wc_flags & IBV_WC_EX_WITH_COMPLETION_TIMESTAMP)
 			cmd->flags |= IBV_CREATE_CQ_EX_KERNEL_FLAG_COMPLETION_TIMESTAMP;
 	}
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V4 libibverbs 6/7] Add a verb that queries real time values from the HCA
       [not found] ` <1464533475-18949-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (4 preceding siblings ...)
  2016-05-29 14:51   ` [PATCH V4 libibverbs 5/7] Create a single threaded CQ Yishai Hadas
@ 2016-05-29 14:51   ` Yishai Hadas
  2016-05-29 14:51   ` [PATCH V4 libibverbs 7/7] Add timestamp support in rc_pingpong Yishai Hadas
  6 siblings, 0 replies; 21+ messages in thread
From: Yishai Hadas @ 2016-05-29 14:51 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	majd-VPRAkNaXOzVWk0Htik3J/w, talal-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Add ibv_query_rt_values_ex, an extension verb to query certain real
time values of a device. Currently, only IBV_VALUES_MASK_RAW_CLOCK is
supported, but this verb could support other flags like
IBV_VALUES_MASK_TEMP_SENSOR, IBV_VALUES_MASK_CORE_FREQ, etc.

This extension verb only calls the provider.
The provider has to query this value somehow and mark the queried
values in comp_mask.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 Makefile.am                  |  3 ++-
 include/infiniband/verbs.h   | 33 +++++++++++++++++++++++++++++
 man/ibv_query_rt_values_ex.3 | 50 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 85 insertions(+), 1 deletion(-)
 create mode 100644 man/ibv_query_rt_values_ex.3

diff --git a/Makefile.am b/Makefile.am
index a1c2122..16f838d 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -65,7 +65,8 @@ man_MANS = man/ibv_asyncwatch.1 man/ibv_devices.1 man/ibv_devinfo.1	\
     man/ibv_create_qp_ex.3 man/ibv_create_srq_ex.3 man/ibv_open_xrcd.3  \
     man/ibv_get_srq_num.3 man/ibv_open_qp.3 man/ibv_query_device_ex.3	\
     man/ibv_alloc_mw.3 man/ibv_bind_mw.3 man/ibv_inc_rkey.3		\
-    man/ibv_rereg_mr.3 man/ibv_create_cq_ex.3
+    man/ibv_rereg_mr.3 man/ibv_create_cq_ex.3				\
+    man/ibv_query_rt_values_ex.3
 
 DEBIAN = debian/changelog debian/compat debian/control debian/copyright \
     debian/ibverbs-utils.install debian/libibverbs1.install \
diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index 6642336..c5878ae 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -1227,6 +1227,16 @@ struct ibv_cq_init_attr_ex {
 	uint32_t		flags;
 };
 
+enum ibv_values_mask {
+	IBV_VALUES_MASK_RAW_CLOCK	= 1 << 0,
+	IBV_VALUES_MASK_RESERVED	= 1 << 1
+};
+
+struct ibv_values_ex {
+	uint32_t	comp_mask;
+	struct timespec raw_clock;
+};
+
 enum verbs_context_mask {
 	VERBS_CONTEXT_XRCD	= 1 << 0,
 	VERBS_CONTEXT_SRQ	= 1 << 1,
@@ -1238,6 +1248,8 @@ enum verbs_context_mask {
 
 struct verbs_context {
 	/*  "grows up" - new fields go here */
+	int (*query_rt_values)(struct ibv_context *context,
+			       struct ibv_values_ex *values);
 	struct ibv_cq_ex *(*create_cq_ex)(struct ibv_context *context,
 					  struct ibv_cq_init_attr_ex *init_attr);
 	struct verbs_ex_private *priv;
@@ -1757,6 +1769,27 @@ ibv_create_qp_ex(struct ibv_context *context, struct ibv_qp_init_attr_ex *qp_ini
 }
 
 /**
+ * ibv_query_rt_values_ex - Get current real time @values of a device.
+ * @values - in/out - defines the attributes we need to query/queried.
+ * (Or's bits of enum ibv_values_mask on values->comp_mask field)
+ */
+static inline int
+ibv_query_rt_values_ex(struct ibv_context *context,
+		       struct ibv_values_ex *values)
+{
+	struct verbs_context *vctx;
+
+	vctx = verbs_get_ctx_op(context, query_rt_values);
+	if (!vctx)
+		return ENOSYS;
+
+	if (values->comp_mask & ~(IBV_VALUES_MASK_RESERVED - 1))
+		return EINVAL;
+
+	return vctx->query_rt_values(context, values);
+}
+
+/**
  * ibv_query_device_ex - Get extended device properties
  */
 static inline int
diff --git a/man/ibv_query_rt_values_ex.3 b/man/ibv_query_rt_values_ex.3
new file mode 100644
index 0000000..fcc460c
--- /dev/null
+++ b/man/ibv_query_rt_values_ex.3
@@ -0,0 +1,50 @@
+.\" -*- nroff -*-
+.\"
+.TH IBV_QUERY_RT_VALUES_EX 3 2016-2-20 libibverbs "Libibverbs Programmer's Manual"
+.SH "NAME"
+ibv_query_rt_values_ex \- query an RDMA device for some real time values
+.SH "SYNOPSIS"
+.nf
+.B #include <infiniband/verbs.h>
+.sp
+.BI "int ibv_query_rt_values_ex(struct ibv_context " "*context",
+.BI "                           struct ibv_values_ex " "*values" );
+.fi
+.SH "DESCRIPTION"
+.B ibv_query_rt_values_ex()
+returns certain real time values of a device
+.I context\fR.
+The argument
+.I attr
+is a pointer to an ibv_device_attr_ex struct, as defined in <infiniband/verbs.h>.
+.PP
+.nf
+struct ibv_values_ex {
+.in +8
+uint32_t             comp_mask;    /* Compatibility mask that defines the query/queried fields [in/out] */
+struct timespec      raw_clock;    /* HW raw clock */
+.in -8
+};
+
+enum ibv_values_mask {
+        IBV_VALUES_MASK_RAW_CLOCK = 1 << 0, /* HW raw clock */
+};
+
+.fi
+.SH "RETURN VALUE"
+.B ibv_query_rt_values_ex()
+returns 0 on success, or the value of errno on failure (which indicates the failure reason).
+.SH "NOTES"
+This extension verb only calls the provider, the provider has to query this value somehow and mark
+the queried values in the comp_mask field.
+.SH "SEE ALSO"
+.BR ibv_query_device (3),
+.BR ibv_open_device (3),
+.BR ibv_query_port (3),
+.BR ibv_query_pkey (3),
+.BR ibv_query_gid (3)
+.SH "AUTHORS"
+.TP
+Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
+.TP
+Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V4 libibverbs 7/7] Add timestamp support in rc_pingpong
       [not found] ` <1464533475-18949-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (5 preceding siblings ...)
  2016-05-29 14:51   ` [PATCH V4 libibverbs 6/7] Add a verb that queries real time values from the HCA Yishai Hadas
@ 2016-05-29 14:51   ` Yishai Hadas
  6 siblings, 0 replies; 21+ messages in thread
From: Yishai Hadas @ 2016-05-29 14:51 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	majd-VPRAkNaXOzVWk0Htik3J/w, talal-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/

Add timestamp support in rc_pingpong, it can serve as some example
of using ibv_create_cq_ex verb and the new
ibv_wc_read_xxx accessors.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 examples/rc_pingpong.c | 278 ++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 215 insertions(+), 63 deletions(-)

diff --git a/examples/rc_pingpong.c b/examples/rc_pingpong.c
index 90a8320..8d58357 100644
--- a/examples/rc_pingpong.c
+++ b/examples/rc_pingpong.c
@@ -49,6 +49,14 @@
 
 #include "pingpong.h"
 
+#ifndef max
+#define max(x, y) (((x) > (y)) ? (x) : (y))
+#endif
+
+#ifndef min
+#define min(x, y) (((x) < (y)) ? (x) : (y))
+#endif
+
 enum {
 	PINGPONG_RECV_WRID = 1,
 	PINGPONG_SEND_WRID = 2,
@@ -56,13 +64,17 @@ enum {
 
 static int page_size;
 static int use_odp;
+static int use_ts;
 
 struct pingpong_context {
 	struct ibv_context	*context;
 	struct ibv_comp_channel *channel;
 	struct ibv_pd		*pd;
 	struct ibv_mr		*mr;
-	struct ibv_cq		*cq;
+	union {
+		struct ibv_cq		*cq;
+		struct ibv_cq_ex	*cq_ex;
+	} cq_s;
 	struct ibv_qp		*qp;
 	void			*buf;
 	int			 size;
@@ -70,8 +82,15 @@ struct pingpong_context {
 	int			 rx_depth;
 	int			 pending;
 	struct ibv_port_attr     portinfo;
+	uint64_t		 completion_timestamp_mask;
 };
 
+struct ibv_cq *pp_cq(struct pingpong_context *ctx)
+{
+	return use_ts ? ibv_cq_ex_to_cq(ctx->cq_s.cq_ex) :
+		ctx->cq_s.cq;
+}
+
 struct pingpong_dest {
 	int lid;
 	int qpn;
@@ -357,7 +376,7 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size,
 		goto clean_comp_channel;
 	}
 
-	if (use_odp) {
+	if (use_odp || use_ts) {
 		const uint32_t rc_caps_mask = IBV_ODP_SUPPORT_SEND |
 					      IBV_ODP_SUPPORT_RECV;
 		struct ibv_device_attr_ex attrx;
@@ -367,12 +386,22 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size,
 			goto clean_comp_channel;
 		}
 
-		if (!(attrx.odp_caps.general_caps & IBV_ODP_SUPPORT) ||
-		    (attrx.odp_caps.per_transport_caps.rc_odp_caps & rc_caps_mask) != rc_caps_mask) {
-			fprintf(stderr, "The device isn't ODP capable or does not support RC send and receive with ODP\n");
-			goto clean_comp_channel;
+		if (use_odp) {
+			if (!(attrx.odp_caps.general_caps & IBV_ODP_SUPPORT) ||
+			    (attrx.odp_caps.per_transport_caps.rc_odp_caps & rc_caps_mask) != rc_caps_mask) {
+				fprintf(stderr, "The device isn't ODP capable or does not support RC send and receive with ODP\n");
+				goto clean_comp_channel;
+			}
+			access_flags |= IBV_ACCESS_ON_DEMAND;
+		}
+
+		if (use_ts) {
+			if (!attrx.completion_timestamp_mask) {
+				fprintf(stderr, "The device isn't completion timestamp capable\n");
+				goto clean_comp_channel;
+			}
+			ctx->completion_timestamp_mask = attrx.completion_timestamp_mask;
 		}
-		access_flags |= IBV_ACCESS_ON_DEMAND;
 	}
 	ctx->mr = ibv_reg_mr(ctx->pd, ctx->buf, size, access_flags);
 
@@ -381,9 +410,22 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size,
 		goto clean_pd;
 	}
 
-	ctx->cq = ibv_create_cq(ctx->context, rx_depth + 1, NULL,
-				ctx->channel, 0);
-	if (!ctx->cq) {
+	if (use_ts) {
+		struct ibv_cq_init_attr_ex attr_ex = {
+			.cqe = rx_depth + 1,
+			.cq_context = NULL,
+			.channel = ctx->channel,
+			.comp_vector = 0,
+			.wc_flags = IBV_WC_EX_WITH_COMPLETION_TIMESTAMP
+		};
+
+		ctx->cq_s.cq_ex = ibv_create_cq_ex(ctx->context, &attr_ex);
+	} else {
+		ctx->cq_s.cq = ibv_create_cq(ctx->context, rx_depth + 1, NULL,
+					     ctx->channel, 0);
+	}
+
+	if (!pp_cq(ctx)) {
 		fprintf(stderr, "Couldn't create CQ\n");
 		goto clean_mr;
 	}
@@ -391,8 +433,8 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size,
 	{
 		struct ibv_qp_attr attr;
 		struct ibv_qp_init_attr init_attr = {
-			.send_cq = ctx->cq,
-			.recv_cq = ctx->cq,
+			.send_cq = pp_cq(ctx),
+			.recv_cq = pp_cq(ctx),
 			.cap     = {
 				.max_send_wr  = 1,
 				.max_recv_wr  = rx_depth,
@@ -438,7 +480,7 @@ clean_qp:
 	ibv_destroy_qp(ctx->qp);
 
 clean_cq:
-	ibv_destroy_cq(ctx->cq);
+	ibv_destroy_cq(pp_cq(ctx));
 
 clean_mr:
 	ibv_dereg_mr(ctx->mr);
@@ -469,7 +511,7 @@ int pp_close_ctx(struct pingpong_context *ctx)
 		return 1;
 	}
 
-	if (ibv_destroy_cq(ctx->cq)) {
+	if (ibv_destroy_cq(pp_cq(ctx))) {
 		fprintf(stderr, "Couldn't destroy CQ\n");
 		return 1;
 	}
@@ -543,6 +585,89 @@ static int pp_post_send(struct pingpong_context *ctx)
 	return ibv_post_send(ctx->qp, &wr, &bad_wr);
 }
 
+struct ts_params {
+	unsigned int		 comp_recv_max_time_delta;
+	unsigned int		 comp_recv_min_time_delta;
+	uint64_t		 comp_recv_total_time_delta;
+	uint64_t		 comp_recv_prev_time;
+	int			 last_comp_with_ts;
+	unsigned int		 comp_with_time_iters;
+};
+
+static inline int parse_single_wc(struct pingpong_context *ctx, int *scnt,
+				  int *rcnt, int *routs, int iters,
+				  uint64_t wr_id, enum ibv_wc_status status,
+				  uint64_t completion_timestamp,
+				  struct ts_params *ts)
+{
+	if (status != IBV_WC_SUCCESS) {
+		fprintf(stderr, "Failed status %s (%d) for wr_id %d\n",
+			ibv_wc_status_str(status),
+			status, (int)wr_id);
+		return 1;
+	}
+
+	switch ((int)wr_id) {
+	case PINGPONG_SEND_WRID:
+		++(*scnt);
+		break;
+
+	case PINGPONG_RECV_WRID:
+		if (--(*routs) <= 1) {
+			*routs += pp_post_recv(ctx, ctx->rx_depth - *routs);
+			if (*routs < ctx->rx_depth) {
+				fprintf(stderr,
+					"Couldn't post receive (%d)\n",
+					*routs);
+				return 1;
+			}
+		}
+
+		++(*rcnt);
+		if (use_ts) {
+			if (ts->last_comp_with_ts) {
+				uint64_t delta;
+
+				/* checking whether the clock was wrapped around */
+				if (completion_timestamp >= ts->comp_recv_prev_time)
+					delta = completion_timestamp - ts->comp_recv_prev_time;
+				else
+					delta = ctx->completion_timestamp_mask - ts->comp_recv_prev_time +
+						completion_timestamp + 1;
+
+				ts->comp_recv_max_time_delta = max(ts->comp_recv_max_time_delta, delta);
+				ts->comp_recv_min_time_delta = min(ts->comp_recv_min_time_delta, delta);
+				ts->comp_recv_total_time_delta += delta;
+				ts->comp_with_time_iters++;
+			}
+
+			ts->comp_recv_prev_time = completion_timestamp;
+			ts->last_comp_with_ts = 1;
+		} else {
+			ts->last_comp_with_ts = 0;
+		}
+
+		break;
+
+	default:
+		fprintf(stderr, "Completion for unknown wr_id %d\n",
+			(int)wr_id);
+		return 1;
+	}
+
+	ctx->pending &= ~(int)wr_id;
+	if (*scnt < iters && !ctx->pending) {
+		if (pp_post_send(ctx)) {
+			fprintf(stderr, "Couldn't post send\n");
+			return 1;
+		}
+		ctx->pending = PINGPONG_RECV_WRID |
+			PINGPONG_SEND_WRID;
+	}
+
+	return 0;
+}
+
 static void usage(const char *argv0)
 {
 	printf("Usage:\n");
@@ -561,6 +686,7 @@ static void usage(const char *argv0)
 	printf("  -e, --events           sleep on CQ events (default poll)\n");
 	printf("  -g, --gid-idx=<gid index> local port gid index\n");
 	printf("  -o, --odp		    use on demand paging\n");
+	printf("  -t, --ts	            get CQE with timestamp\n");
 }
 
 int main(int argc, char *argv[])
@@ -586,6 +712,7 @@ int main(int argc, char *argv[])
 	int                      sl = 0;
 	int			 gidx = -1;
 	char			 gid[33];
+	struct ts_params	 ts;
 
 	srand48(getpid() * time(NULL));
 
@@ -604,11 +731,12 @@ int main(int argc, char *argv[])
 			{ .name = "events",   .has_arg = 0, .val = 'e' },
 			{ .name = "gid-idx",  .has_arg = 1, .val = 'g' },
 			{ .name = "odp",      .has_arg = 0, .val = 'o' },
+			{ .name = "ts",       .has_arg = 0, .val = 't' },
 			{ 0 }
 		};
 
-		c = getopt_long(argc, argv, "p:d:i:s:m:r:n:l:eg:o",
-							long_options, NULL);
+		c = getopt_long(argc, argv, "p:d:i:s:m:r:n:l:eg:ot",
+				long_options, NULL);
 
 		if (c == -1)
 			break;
@@ -669,6 +797,9 @@ int main(int argc, char *argv[])
 		case 'o':
 			use_odp = 1;
 			break;
+		case 't':
+			use_ts = 1;
+			break;
 
 		default:
 			usage(argv[0]);
@@ -683,6 +814,15 @@ int main(int argc, char *argv[])
 		return 1;
 	}
 
+	if (use_ts) {
+		ts.comp_recv_max_time_delta = 0;
+		ts.comp_recv_min_time_delta = 0xffffffff;
+		ts.comp_recv_total_time_delta = 0;
+		ts.comp_recv_prev_time = 0;
+		ts.last_comp_with_ts = 0;
+		ts.comp_with_time_iters = 0;
+	}
+
 	page_size = sysconf(_SC_PAGESIZE);
 
 	dev_list = ibv_get_device_list(NULL);
@@ -720,7 +860,7 @@ int main(int argc, char *argv[])
 	}
 
 	if (use_event)
-		if (ibv_req_notify_cq(ctx->cq, 0)) {
+		if (ibv_req_notify_cq(pp_cq(ctx), 0)) {
 			fprintf(stderr, "Couldn't request CQ notification\n");
 			return 1;
 		}
@@ -788,6 +928,8 @@ int main(int argc, char *argv[])
 
 	rcnt = scnt = 0;
 	while (rcnt < iters || scnt < iters) {
+		int ret;
+
 		if (use_event) {
 			struct ibv_cq *ev_cq;
 			void          *ev_ctx;
@@ -799,72 +941,73 @@ int main(int argc, char *argv[])
 
 			++num_cq_events;
 
-			if (ev_cq != ctx->cq) {
+			if (ev_cq != pp_cq(ctx)) {
 				fprintf(stderr, "CQ event for unknown CQ %p\n", ev_cq);
 				return 1;
 			}
 
-			if (ibv_req_notify_cq(ctx->cq, 0)) {
+			if (ibv_req_notify_cq(pp_cq(ctx), 0)) {
 				fprintf(stderr, "Couldn't request CQ notification\n");
 				return 1;
 			}
 		}
 
-		{
-			struct ibv_wc wc[2];
+		if (use_ts) {
+			struct ibv_poll_cq_attr attr = {};
+
+			do {
+				ret = ibv_start_poll(ctx->cq_s.cq_ex, &attr);
+			} while (!use_event && ret == ENOENT);
+
+			if (ret) {
+				fprintf(stderr, "poll CQ failed %d\n", ret);
+				return ret;
+			}
+			ret = parse_single_wc(ctx, &scnt, &rcnt, &routs,
+					      iters,
+					      ctx->cq_s.cq_ex->wr_id,
+					      ctx->cq_s.cq_ex->status,
+					      ibv_wc_read_completion_ts(ctx->cq_s.cq_ex),
+					      &ts);
+			if (ret) {
+				ibv_end_poll(ctx->cq_s.cq_ex);
+				return ret;
+			}
+			ret = ibv_next_poll(ctx->cq_s.cq_ex);
+			if (!ret)
+				ret = parse_single_wc(ctx, &scnt, &rcnt, &routs,
+						      iters,
+						      ctx->cq_s.cq_ex->wr_id,
+						      ctx->cq_s.cq_ex->status,
+						      ibv_wc_read_completion_ts(ctx->cq_s.cq_ex),
+						      &ts);
+			ibv_end_poll(ctx->cq_s.cq_ex);
+			if (ret && ret != ENOENT) {
+				fprintf(stderr, "poll CQ failed %d\n", ret);
+				return ret;
+			}
+		} else {
 			int ne, i;
+			struct ibv_wc wc[2];
 
 			do {
-				ne = ibv_poll_cq(ctx->cq, 2, wc);
+				ne = ibv_poll_cq(pp_cq(ctx), 2, wc);
 				if (ne < 0) {
 					fprintf(stderr, "poll CQ failed %d\n", ne);
 					return 1;
 				}
-
 			} while (!use_event && ne < 1);
 
 			for (i = 0; i < ne; ++i) {
-				if (wc[i].status != IBV_WC_SUCCESS) {
-					fprintf(stderr, "Failed status %s (%d) for wr_id %d\n",
-						ibv_wc_status_str(wc[i].status),
-						wc[i].status, (int) wc[i].wr_id);
+				ret = parse_single_wc(ctx, &scnt, &rcnt, &routs,
+						      iters,
+						      wc[i].wr_id,
+						      wc[i].status,
+						      0, &ts);
+				if (ret) {
+					fprintf(stderr, "parse WC failed %d\n", ne);
 					return 1;
 				}
-
-				switch ((int) wc[i].wr_id) {
-				case PINGPONG_SEND_WRID:
-					++scnt;
-					break;
-
-				case PINGPONG_RECV_WRID:
-					if (--routs <= 1) {
-						routs += pp_post_recv(ctx, ctx->rx_depth - routs);
-						if (routs < ctx->rx_depth) {
-							fprintf(stderr,
-								"Couldn't post receive (%d)\n",
-								routs);
-							return 1;
-						}
-					}
-
-					++rcnt;
-					break;
-
-				default:
-					fprintf(stderr, "Completion for unknown wr_id %d\n",
-						(int) wc[i].wr_id);
-					return 1;
-				}
-
-				ctx->pending &= ~(int) wc[i].wr_id;
-				if (scnt < iters && !ctx->pending) {
-					if (pp_post_send(ctx)) {
-						fprintf(stderr, "Couldn't post send\n");
-						return 1;
-					}
-					ctx->pending = PINGPONG_RECV_WRID |
-						       PINGPONG_SEND_WRID;
-				}
 			}
 		}
 	}
@@ -883,9 +1026,18 @@ int main(int argc, char *argv[])
 		       bytes, usec / 1000000., bytes * 8. / usec);
 		printf("%d iters in %.2f seconds = %.2f usec/iter\n",
 		       iters, usec / 1000000., usec / iters);
+
+		if (use_ts && ts.comp_with_time_iters) {
+			printf("Max receive completion clock cycles = %u\n",
+			       ts.comp_recv_max_time_delta);
+			printf("Min receive completion clock cycles = %u\n",
+			       ts.comp_recv_min_time_delta);
+			printf("Average receive completion clock cycles = %f\n",
+			       (double)ts.comp_recv_total_time_delta / ts.comp_with_time_iters);
+		}
 	}
 
-	ibv_ack_cq_events(ctx->cq, num_cq_events);
+	ibv_ack_cq_events(pp_cq(ctx), num_cq_events);
 
 	if (pp_close_ctx(ctx))
 		return 1;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH V4 libibverbs 2/7] Add member functions to poll an extended CQ
       [not found]     ` <1464533475-18949-3-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-05-29 22:38       ` Doug Ledford
       [not found]         ` <574B6F71.9060808-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Doug Ledford @ 2016-05-29 22:38 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, matanb-VPRAkNaXOzVWk0Htik3J/w,
	majd-VPRAkNaXOzVWk0Htik3J/w, talal-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/


[-- Attachment #1.1: Type: text/plain, Size: 4169 bytes --]

On 5/29/2016 10:51 AM, Yishai Hadas wrote:
> From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> 
> Currently, ibv_poll_cq is one stop shop for polling all the
> completion's attribute. The current implementation has a few
> implications:
> 1. The vendor's completion format is transformed to an agnostic IB
>    format (copying data which might has been used directly).
> 2. Adding new completion attributes means wasting more time in copies.
> 3. Extensible functions require wasting cycles on logic which
>    determines whether a function is known and implemented in the
>    provider.
> 
> In order to solve these problems, we introduce a new poll_cq mechanism
> for extensible CQs. Polling is done in batches, where every batch
> could contain more than one completion. A batch is started with
> calling to start_poll function. This already fetches a completion
> that the user can now query using the various query functions.
> Advancing to the next completion is done using next_poll.
> After querying all completions (or when the user wants to stop
> fetching completions in the current batch), end_poll is called.
> 
> As 'wr_id' and 'status' are mostly used, they can be accessed directly
> and there is no function call for.

So I've been busy with kernel stuff, but I don't really like this patch.
 I'm not really enamored with the entire poll_ex API.

First, I don't like the start_poll/poll/end_poll setup.  I would rather
we do something like using refcounting on the WCs.  Maybe something like
returning an array of pointers to WCs where each WC already has a ref
count taken, and when the user is done, they call ibv_release_wc() (or
something like that) on one or more of these WCs, at which point they
are freed and the hardware can do with them as they wish.

Then, I've been thinking about how to get this data to the user, and
again, I don't really like where we are headed.  I was thinking about
changing it so that we create an enum like so:

enum wc_elements {
	WC_WR_ID,
	WC_WR_STATUS,
	etc.
};

Then we define each cq struct to have a new array called offsets:

struct cq {
	unsigned char	offsets[64]; /* One cache line, should be 64byte aligned */
	....
} __align__(64);

When a cq is created by the driver, it is responsible for filling out
the offsets array (can be static if your driver only has one CQE format,
but if it changes with different options, it allows for that).
Essentially, it is a mapping from the enum of WC items to the position
in the hardware's CQE.  Then you define something like this in the verbs
header file:

static inline u8 __ibv_get_wc_u8(struct cq *cq, struct wc *wc, unsigned
char wc_element)
{
	return *(wc + cq->offsets[wc_element]);
}

and one for the other data sizes too, then something like this:

define ibv_get_wc_element(cq, wc, wc_element, data) \
	{ \
		switch(wc_element) { \
			/* All char size elements */ \
			case WC_WR_STATUS: \
				data = __ibv_get_wc_u8(cq, wc, wc_element); \

You get the idea.  The compiler will optimize the above down to static
code at each location.  And you could simplify things a bit by having
another static array that keeps the element sizes in it, and then run
the switch on that.

The long and short of this method is that accessing CQE struct elements
becomes essentially a load of a variable offset from the cq->offsets[]
item, which should stay hot in cache, followed by a lea of the cqe
address and the offset.

I would prefer having to have an individual routine to access each element.

I would prefer to avoid all the calls if possible.

And I want something that can be optimized away by the compiler as much
as possible.

The above accomplishes that and has the room to support up to 64
different WC struct elements, with a maximum offset of 256 bytes (much
larger than any WC today).

What do people think of this idea?  If no one shoots holes in it, I will
try to code something up this upcoming week (although I can't promise
it, I currently have two sick children at home and that prevents me from
getting much coding done).




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V4 libibverbs 2/7] Add member functions to poll an extended CQ
       [not found]         ` <574B6F71.9060808-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-05-29 23:30           ` Jason Gunthorpe
       [not found]             ` <20160529233009.GA12420-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Jason Gunthorpe @ 2016-05-29 23:30 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	matanb-VPRAkNaXOzVWk0Htik3J/w, majd-VPRAkNaXOzVWk0Htik3J/w,
	talal-VPRAkNaXOzVWk0Htik3J/w

On Sun, May 29, 2016 at 06:38:41PM -0400, Doug Ledford wrote:
 
> So I've been busy with kernel stuff, but I don't really like this patch.
>  I'm not really enamored with the entire poll_ex API.

You should really read the whole prior discussion

> First, I don't like the start_poll/poll/end_poll setup.  I would rather
> we do something like using refcounting on the WCs.  Maybe something like
> returning an array of pointers to WCs where each WC already has a
> ref

Refs? Yuk. That doesn't fit the typical use model, and refs
involve more expensive locking per wc.

> When a cq is created by the driver, it is responsible for filling out
> the offsets array (can be static if your driver only has one CQE
> format,

*shrug* Careful benchmarking would have to prove if this is better or
not. Based on Yishai's comments I expect it is not better, since I
expect an offsets array to perform worse than the original bitmask
thing.

Yishai's said this benchmarked better than the bitmask and equal to
or better than today's versions..

This is a whole picture optimization and the user side is only one
part of the equation - the function pointer scheme is also optimizing
the driver path quite heavily, which is why it is showing positively
in benchmarks.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V4 libibverbs 2/7] Add member functions to poll an extended CQ
       [not found]             ` <20160529233009.GA12420-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2016-05-29 23:47               ` Doug Ledford
       [not found]                 ` <8F7BC9E2-75EC-413B-BEBE-11450225AF06-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Doug Ledford @ 2016-05-29 23:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	matanb-VPRAkNaXOzVWk0Htik3J/w, majd-VPRAkNaXOzVWk0Htik3J/w,
	talal-VPRAkNaXOzVWk0Htik3J/w



Sent from my iPhone

> On May 29, 2016, at 7:30 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> 
>> On Sun, May 29, 2016 at 06:38:41PM -0400, Doug Ledford wrote:
>> 
>> So I've been busy with kernel stuff, but I don't really like this patch.
>> I'm not really enamored with the entire poll_ex API.
> 
> You should really read the whole prior discussion

If there's something in particular you'd like to point out, feel free. 

>> First, I don't like the start_poll/poll/end_poll setup.  I would rather
>> we do something like using refcounting on the WCs.  Maybe something like
>> returning an array of pointers to WCs where each WC already has a
>> ref
> 
> Refs? Yuk. That doesn't fit the typical use model, and refs
> involve more expensive locking per wc.

It doesn't have to be refs. If I recall correctly, the libmlx4 driver has to clear each cqe by changing the ownership status back to the hardware or something like that. The release could just as easily be setting that ownership on the cqe.

>> When a cq is created by the driver, it is responsible for filling out
>> the offsets array (can be static if your driver only has one CQE
>> format,
> 
> *shrug* Careful benchmarking would have to prove if this is better or
> not.

Agreed.

> Based on Yishai's comments I expect it is not better, since I
> expect an offsets array to perform worse than the original bitmask
> thing.

The original bitmask version included a copy to an interim, variable struct.  The idea here is that the cqe is an opaque pointer directly to the hardware cqe. There would be no interim copies, just retrieving the desired data directly from the cqe. Maybe I should have called it a hcqe to be more clear. 

> Yishai's said this benchmarked better than the bitmask

I would totally expect that. The bitmap version had both a data copy and a variable struct requiring lots of branches.

> and equal to
> or better than today's versions..

Today's version has a data copy. What I'm suggesting here eliminates that entirely and all of the branches can be optimized away at compile time, leaving a load from a hot cache line of an offset value and then a direct load from that offset in the hcqe. I would think that because of the elimination of the data copy and the compile time elimination of branches, this has the potential to meet or exceed today's implementation. I would certainly like to see the numbers. 

> This is a whole picture optimization and the user side is only one
> part of the equation - the function pointer scheme is also optimizing
> the driver path quite heavily, which is why it is showing positively
> in benchmarks.

Passing an opaque hcqe pointer to the user and then only getting the data they want out of it should similarly help optimize the driver's path I would think.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V4 libibverbs 2/7] Add member functions to poll an extended CQ
       [not found]                 ` <8F7BC9E2-75EC-413B-BEBE-11450225AF06-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-05-30  1:35                   ` Jason Gunthorpe
       [not found]                     ` <20160530013507.GA19230-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2016-05-30  7:47                   ` Matan Barak
  1 sibling, 1 reply; 21+ messages in thread
From: Jason Gunthorpe @ 2016-05-30  1:35 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	matanb-VPRAkNaXOzVWk0Htik3J/w, majd-VPRAkNaXOzVWk0Htik3J/w,
	talal-VPRAkNaXOzVWk0Htik3J/w

On Sun, May 29, 2016 at 07:47:54PM -0400, Doug Ledford wrote:
> > You should really read the whole prior discussion
> 
> If there's something in particular you'd like to point out, feel free. 

You can't ignore the mailing list and expect the rest of us to carry
you along when you finally wake up..

> It doesn't have to be refs. If I recall correctly, the libmlx4
> driver has to clear each cqe by changing the ownership status back
> to the hardware or something like that. The release could just as
> easily be setting that ownership on the cqe.

It also has to notify the hardware of the new ring head pointer which
becomes more expensive with refs, especially on a lockable CQ.

> The original bitmask version included a copy to an interim, variable
> struct.  The idea here is that the cqe is an opaque pointer directly
> to the hardware cqe. There would be no interim copies, just
> retrieving the desired data directly from the cqe. Maybe I should
> have called it a hcqe to be more clear.

Unless I'm mistaken, something much more more complex than just a
simple offsets is required to handle the usual hcqe format used by our
various drivers.

I agree if you could describe the typical hcqe extractor in a format
like offsets/mask/shift for all supported drivers then that might be
better (unclear if an unaligned load/mask/shift is more expensive than
a branch).  I believe I already explained that possibility earlier in
the thread, and went further to suggest doing a direct HW optimized
DIRECT option like libfabric has.

However, getting this wrong limits the future hardware we can support,
which is scary, and the branch option clearly doesn't have that
problem.

So the performance win would have to be substantial, IMHO.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V4 libibverbs 2/7] Add member functions to poll an extended CQ
       [not found]                     ` <20160530013507.GA19230-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2016-05-30  3:39                       ` Doug Ledford
  2016-05-31 17:46                       ` Hefty, Sean
  1 sibling, 0 replies; 21+ messages in thread
From: Doug Ledford @ 2016-05-30  3:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	matanb-VPRAkNaXOzVWk0Htik3J/w, majd-VPRAkNaXOzVWk0Htik3J/w,
	talal-VPRAkNaXOzVWk0Htik3J/w


[-- Attachment #1.1: Type: text/plain, Size: 2151 bytes --]

On 5/29/2016 9:35 PM, Jason Gunthorpe wrote:
> On Sun, May 29, 2016 at 07:47:54PM -0400, Doug Ledford wrote:
>>> You should really read the whole prior discussion
>>
>> If there's something in particular you'd like to point out, feel free. 
> 
> You can't ignore the mailing list and expect the rest of us to carry
> you along when you finally wake up..

Who said I ignored anything?  I didn't get your reference and needed you
to be more explicit.  You don't have to be a dick about it.

>> It doesn't have to be refs. If I recall correctly, the libmlx4
>> driver has to clear each cqe by changing the ownership status back
>> to the hardware or something like that. The release could just as
>> easily be setting that ownership on the cqe.
> 
> It also has to notify the hardware of the new ring head pointer which
> becomes more expensive with refs, especially on a lockable CQ.

It's no more expensive under what I just outlined.

>> The original bitmask version included a copy to an interim, variable
>> struct.  The idea here is that the cqe is an opaque pointer directly
>> to the hardware cqe. There would be no interim copies, just
>> retrieving the desired data directly from the cqe. Maybe I should
>> have called it a hcqe to be more clear.
> 
> Unless I'm mistaken, something much more more complex than just a
> simple offsets is required to handle the usual hcqe format used by our
> various drivers.

That's a valid point of discussion.

> I agree if you could describe the typical hcqe extractor in a format
> like offsets/mask/shift for all supported drivers then that might be
> better (unclear if an unaligned load/mask/shift is more expensive than
> a branch).  I believe I already explained that possibility earlier in
> the thread, and went further to suggest doing a direct HW optimized
> DIRECT option like libfabric has.

And that's what I want to explore.

> However, getting this wrong limits the future hardware we can support,
> which is scary, and the branch option clearly doesn't have that
> problem.
> 
> So the performance win would have to be substantial, IMHO.



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V4 libibverbs 2/7] Add member functions to poll an extended CQ
       [not found]                 ` <8F7BC9E2-75EC-413B-BEBE-11450225AF06-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2016-05-30  1:35                   ` Jason Gunthorpe
@ 2016-05-30  7:47                   ` Matan Barak
       [not found]                     ` <4958edf4-7296-26c9-4cbe-8fab45be11a3-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
       [not found]                     ` <4e8befc4-aec5-a17d-24ce-40ff97d345da@redhat.com>
  1 sibling, 2 replies; 21+ messages in thread
From: Matan Barak @ 2016-05-30  7:47 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	majd-VPRAkNaXOzVWk0Htik3J/w, talal-VPRAkNaXOzVWk0Htik3J/w

On 30/05/2016 02:47, Doug Ledford wrote:
>
>
> Sent from my iPhone
>
>> On May 29, 2016, at 7:30 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
>>
>>> On Sun, May 29, 2016 at 06:38:41PM -0400, Doug Ledford wrote:
>>>
>>> So I've been busy with kernel stuff, but I don't really like this patch.
>>> I'm not really enamored with the entire poll_ex API.
>>
>> You should really read the whole prior discussion
>
> If there's something in particular you'd like to point out, feel free.
>
>>> First, I don't like the start_poll/poll/end_poll setup.  I would rather
>>> we do something like using refcounting on the WCs.  Maybe something like
>>> returning an array of pointers to WCs where each WC already has a
>>> ref
>>
>> Refs? Yuk. That doesn't fit the typical use model, and refs
>> involve more expensive locking per wc.
>
> It doesn't have to be refs. If I recall correctly, the libmlx4 driver has to clear each cqe by changing the ownership status back to the hardware or something like that. The release could just as easily be setting that ownership on the cqe.
>
>>> When a cq is created by the driver, it is responsible for filling out
>>> the offsets array (can be static if your driver only has one CQE
>>> format,
>>
>> *shrug* Careful benchmarking would have to prove if this is better or
>> not.
>
> Agreed.
>
>> Based on Yishai's comments I expect it is not better, since I
>> expect an offsets array to perform worse than the original bitmask
>> thing.
>
> The original bitmask version included a copy to an interim, variable struct.  The idea here is that the cqe is an opaque pointer directly to the hardware cqe. There would be no interim copies, just retrieving the desired data directly from the cqe. Maybe I should have called it a hcqe to be more clear.
>
>> Yishai's said this benchmarked better than the bitmask
>
> I would totally expect that. The bitmap version had both a data copy and a variable struct requiring lots of branches.
>

Incorrect, the driver part of the bitmap version was optimized as such 
it had a specific poll function for every common variable struct - so 
effectively you didn't (or almost didn't) have any branches.
The indirect function call + copying all required user fields to the 
ibv_wc_ex struct and then back to the user killed it.

>> and equal to
>> or better than today's versions..
>
> Today's version has a data copy. What I'm suggesting here eliminates that entirely and all of the branches can be optimized away at compile time, leaving a load from a hot cache line of an offset value and then a direct load from that offset in the hcqe. I would think that because of the elimination of the data copy and the compile time elimination of branches, this has the potential to meet or exceed today's implementation. I would certainly like to see the numbers.
>

Not necessarily, smart compiler could fetch the data straight from the 
hardware's CQE and pass it through a register.

>> This is a whole picture optimization and the user side is only one
>> part of the equation - the function pointer scheme is also optimizing
>> the driver path quite heavily, which is why it is showing positively
>> in benchmarks.
>
> Passing an opaque hcqe pointer to the user and then only getting the data they want out of it should similarly help optimize the driver's path I would think.
>

What if the driver's CQE format has a field which is split between two 
different offsets?
We looked at storing offsets and masks before implementing this and 
supplying inline functions to get the various fields, but IMHO, that's a 
no go as it's not generic enough.

Anyway, if you really want to tie yourself to one vendor, maybe we could 
allow static linking with the user-space driver and by using lto you 
could get inline-like function behavior.


Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V4 libibverbs 2/7] Add member functions to poll an extended CQ
       [not found]                     ` <4958edf4-7296-26c9-4cbe-8fab45be11a3-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-05-30 10:01                       ` Yishai Hadas
  0 siblings, 0 replies; 21+ messages in thread
From: Yishai Hadas @ 2016-05-30 10:01 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Jason Gunthorpe, Yishai Hadas,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, majd-VPRAkNaXOzVWk0Htik3J/w,
	talal-VPRAkNaXOzVWk0Htik3J/w

On 5/30/2016 10:47 AM, Matan Barak wrote:
> On 30/05/2016 02:47, Doug Ledford wrote:
>>
>>
>> Sent from my iPhone
>>
>>> On May 29, 2016, at 7:30 PM, Jason Gunthorpe
>>> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
>>>
>>>> On Sun, May 29, 2016 at 06:38:41PM -0400, Doug Ledford wrote:
>>>>
>>>> So I've been busy with kernel stuff, but I don't really like this
>>>> patch.
>>>> I'm not really enamored with the entire poll_ex API.
>>>
>>> You should really read the whole prior discussion
>>
>> If there's something in particular you'd like to point out, feel free.
>>
>>>> First, I don't like the start_poll/poll/end_poll setup.  I would rather
>>>> we do something like using refcounting on the WCs.  Maybe something
>>>> like
>>>> returning an array of pointers to WCs where each WC already has a
>>>> ref
>>>
>>> Refs? Yuk. That doesn't fit the typical use model, and refs
>>> involve more expensive locking per wc.
>>
>> It doesn't have to be refs. If I recall correctly, the libmlx4 driver
>> has to clear each cqe by changing the ownership status back to the
>> hardware or something like that. The release could just as easily be
>> setting that ownership on the cqe.
>>
>>>> When a cq is created by the driver, it is responsible for filling out
>>>> the offsets array (can be static if your driver only has one CQE
>>>> format,
>>>
>>> *shrug* Careful benchmarking would have to prove if this is better or
>>> not.
>>
>> Agreed.
>>
>>> Based on Yishai's comments I expect it is not better, since I
>>> expect an offsets array to perform worse than the original bitmask
>>> thing.
>>
>> The original bitmask version included a copy to an interim, variable
>> struct.  The idea here is that the cqe is an opaque pointer directly
>> to the hardware cqe. There would be no interim copies, just retrieving
>> the desired data directly from the cqe. Maybe I should have called it
>> a hcqe to be more clear.
>>
>>> Yishai's said this benchmarked better than the bitmask
>>
>> I would totally expect that. The bitmap version had both a data copy
>> and a variable struct requiring lots of branches.
>>
>
> Incorrect, the driver part of the bitmap version was optimized as such
> it had a specific poll function for every common variable struct - so
> effectively you didn't (or almost didn't) have any branches.
> The indirect function call + copying all required user fields to the
> ibv_wc_ex struct and then back to the user killed it.
>
>>> and equal to
>>> or better than today's versions..
>>
>> Today's version has a data copy. What I'm suggesting here eliminates
>> that entirely and all of the branches can be optimized away at compile
>> time, leaving a load from a hot cache line of an offset value and then
>> a direct load from that offset in the hcqe. I would think that because
>> of the elimination of the data copy and the compile time elimination
>> of branches, this has the potential to meet or exceed today's
>> implementation. I would certainly like to see the numbers.
>>
>
> Not necessarily, smart compiler could fetch the data straight from the
> hardware's CQE and pass it through a register.
>
>>> This is a whole picture optimization and the user side is only one
>>> part of the equation - the function pointer scheme is also optimizing
>>> the driver path quite heavily, which is why it is showing positively
>>> in benchmarks.
>>
>> Passing an opaque hcqe pointer to the user and then only getting the
>> data they want out of it should similarly help optimize the driver's
>> path I would think.
>>
>
> What if the driver's CQE format has a field which is split between two
> different offsets?
> We looked at storing offsets and masks before implementing this and
> supplying inline functions to get the various fields, but IMHO, that's a
> no go as it's not generic enough.

Correct, this approach is not generic enough and can't fit in the 
general case to all current HW(s) (e.g. Time stamping in mlx4 is 
composed from data that resides in two *non* contiguous offsets).
In addition, there is *no* guarantee from future HWs' implementations to 
fit to this model.

In addition, there might be some extra MD/logic which may be vendor 
specific. For example, validity of fields might depends on some device 
capabilities (e.g. checksum is valid), endianness representation of data 
in some cases, etc.

As of that, we believe that the read_xxx model which lets the vendor who 
is familiar with the internal semantics to do the work is the way to go.

Re benchmark/performance,
As was previously discussed on the list and approved to be useful by our 
benchmark testing we moved wr_id & status fields which are mostly used 
to reside on ibv_cq_ex with direct access, so there is no read_xxx calls 
for.

Other fields might not be used at all (e.g. opcode, when there is only 
one operation, or there is a CQ per operation, wc_flags in many cases, 
etc.) so there will be no usage for their read_xxx calls.

We found that copying the data of other fields to ibv_cq_ex with the 
chance that they won't be used has a penalty to performance in the 
general case over the case that they will be read later on. As the API 
is extensible if future benchmarking will justify supplying direct 
access to other fields it can be done.

Doug,
This API was introduced and discussed for a while in both the list and 
in some forums (OFAWG, last summit) and found to be clean and make sense 
by the involved people, our benchmarking shows that as well.

Please approve that you are fine with so that we can go forward and 
finalize/send the libmlx5 code as some vendor usage of the API as Jason 
already asked.

Yishai








--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V4 libibverbs 2/7] Add member functions to poll an extended CQ
       [not found]                       ` <4e8befc4-aec5-a17d-24ce-40ff97d345da-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-05-30 13:50                         ` Doug Ledford
       [not found]                           ` <8708a378-4c48-df98-86a4-d210bbe690b5-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Doug Ledford @ 2016-05-30 13:50 UTC (permalink / raw)
  To: Matan Barak, Jason Gunthorpe; +Cc: Yishai Hadas, linux-rdma

[-- Attachment #1: Type: text/plain, Size: 3204 bytes --]

Resend because of my mailer...

On 05/30/2016 09:46 AM, Doug Ledford wrote:
> On 05/30/2016 03:47 AM, Matan Barak wrote:
>> On 30/05/2016 02:47, Doug Ledford wrote:
> 
>>> Today's version has a data copy. What I'm suggesting here eliminates
>>> that entirely and all of the branches can be optimized away at compile
>>> time, leaving a load from a hot cache line of an offset value and then
>>> a direct load from that offset in the hcqe. I would think that because
>>> of the elimination of the data copy and the compile time elimination
>>> of branches, this has the potential to meet or exceed today's
>>> implementation. I would certainly like to see the numbers.
>>>
>>
>> Not necessarily, smart compiler could fetch the data straight from the
>> hardware's CQE and pass it through a register.
> 
> Not while still maintaining multi-device capability.  What I laid out
> works without having to compile for a specific piece of hardware.
> 
>>>> This is a whole picture optimization and the user side is only one
>>>> part of the equation - the function pointer scheme is also optimizing
>>>> the driver path quite heavily, which is why it is showing positively
>>>> in benchmarks.
>>>
>>> Passing an opaque hcqe pointer to the user and then only getting the
>>> data they want out of it should similarly help optimize the driver's
>>> path I would think.
>>>
>>
>> What if the driver's CQE format has a field which is split between two
>> different offsets?
> 
> I have an idea for that as well.
> 
>> We looked at storing offsets and masks before implementing this and
>> supplying inline functions to get the various fields, but IMHO, that's a
>> no go as it's not generic enough.
>>
>> Anyway, if you really want to tie yourself to one vendor, maybe we could
>> allow static linking with the user-space driver and by using lto you
>> could get inline-like function behavior.
> 
> No, what I'm outlining works with multi-vendor support.  It's opaque to
> the user application.
> 
> Now, as to how to handle complex fields.
> 
> I was thinking of making the wc_elements enum encode some complex
> information.  Right now, it's just the offset number.  It would be
> possible to set each particular element as either A) simple offset, B)
> possibly complex (which means some vendors can do a simple offset and
> others need a complex function), or C) all complex.  Then the #define
> can be made smart enough to know the difference and on simple items
> return the data, on possibly complex items you have an if statement and
> branch so we avoid this when possible but allows different hardware
> types to be handled, and on always complex items we are able to optimize
> down to just the call in the compiler.  Although, I freely admit that if
> you have more than just a handful of these complex items then it likely
> becomes cleaner to just do the "function call per wc element" way of
> doing things like your patch has.  But I doubt there are that many
> complex items in your CQE.  Do you have a count?
> 
> 
> 


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V4 libibverbs 2/7] Add member functions to poll an extended CQ
       [not found]                           ` <8708a378-4c48-df98-86a4-d210bbe690b5-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-05-30 15:01                             ` Matan Barak (External)
       [not found]                               ` <ecdbec76-31cd-74e1-25b4-7d60c3fa2af0-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Matan Barak (External) @ 2016-05-30 15:01 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe; +Cc: Yishai Hadas, linux-rdma

On 30/05/2016 16:50, Doug Ledford wrote:
> Resend because of my mailer...
>
> On 05/30/2016 09:46 AM, Doug Ledford wrote:
>> On 05/30/2016 03:47 AM, Matan Barak wrote:
>>> On 30/05/2016 02:47, Doug Ledford wrote:
>>
>>>> Today's version has a data copy. What I'm suggesting here eliminates
>>>> that entirely and all of the branches can be optimized away at compile
>>>> time, leaving a load from a hot cache line of an offset value and then
>>>> a direct load from that offset in the hcqe. I would think that because
>>>> of the elimination of the data copy and the compile time elimination
>>>> of branches, this has the potential to meet or exceed today's
>>>> implementation. I would certainly like to see the numbers.
>>>>
>>>
>>> Not necessarily, smart compiler could fetch the data straight from the
>>> hardware's CQE and pass it through a register.
>>
>> Not while still maintaining multi-device capability.  What I laid out
>> works without having to compile for a specific piece of hardware.
>>

It has nothing to do with multi-device capability. For example, in 
cdecl, integer values are returned thorough EAX (RAX) register.

>>>>> This is a whole picture optimization and the user side is only one
>>>>> part of the equation - the function pointer scheme is also optimizing
>>>>> the driver path quite heavily, which is why it is showing positively
>>>>> in benchmarks.
>>>>
>>>> Passing an opaque hcqe pointer to the user and then only getting the
>>>> data they want out of it should similarly help optimize the driver's
>>>> path I would think.
>>>>
>>>
>>> What if the driver's CQE format has a field which is split between two
>>> different offsets?
>>
>> I have an idea for that as well.
>>
>>> We looked at storing offsets and masks before implementing this and
>>> supplying inline functions to get the various fields, but IMHO, that's a
>>> no go as it's not generic enough.
>>>
>>> Anyway, if you really want to tie yourself to one vendor, maybe we could
>>> allow static linking with the user-space driver and by using lto you
>>> could get inline-like function behavior.
>>
>> No, what I'm outlining works with multi-vendor support.  It's opaque to
>> the user application.
>>
>> Now, as to how to handle complex fields.
>>
>> I was thinking of making the wc_elements enum encode some complex
>> information.  Right now, it's just the offset number.  It would be
>> possible to set each particular element as either A) simple offset, B)
>> possibly complex (which means some vendors can do a simple offset and
>> others need a complex function), or C) all complex.  Then the #define
>> can be made smart enough to know the difference and on simple items
>> return the data, on possibly complex items you have an if statement and
>> branch so we avoid this when possible but allows different hardware
>> types to be handled, and on always complex items we are able to optimize
>> down to just the call in the compiler.  Although, I freely admit that if
>> you have more than just a handful of these complex items then it likely
>> becomes cleaner to just do the "function call per wc element" way of
>> doing things like your patch has.  But I doubt there are that many
>> complex items in your CQE.  Do you have a count?
>>
>>
>>

So, you suggest that "simple" attributes will take into account 
byte_offset, bit_offset, bits_size (or mask) and endianity. While the 
right endianity conversion function will be chosen by inspecting the 
bits_size?
Of course, this is in addition to the attribute's nature 
(simple/possible simple/complex), which we need to condition on that too.
This doesn't sound simple for me and I'm afraid it will affect 
performance more than just a real simple function call.

In addition, can we guarantee that for "simple" attribute we won't have 
any future "non-simple" vendors? If such a vendor comes up, the binary 
compatibility will break, since the attributes nature is encoded in the 
libibverbs inlined functions. Unless old applications are recompiled, 
you won't be able to use this new hardware.

Common fields (status and wr_id) are already stored in ibv_cq_ex and 
accessed directly, so we don't "pay" penalty for this fields.

There might not be a lot of complex fields in mlx5 based hardware (a 
quick count found only a few), but accessing the "simple" fields isn't 
that simple and requires a lot of logic.

The proposed API is clean, usable and simple both in respect of the 
application API and the vendor's API.

Matan and Yishai

>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V4 libibverbs 2/7] Add member functions to poll an extended CQ
       [not found]                               ` <ecdbec76-31cd-74e1-25b4-7d60c3fa2af0-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-05-30 17:05                                 ` Jason Gunthorpe
  0 siblings, 0 replies; 21+ messages in thread
From: Jason Gunthorpe @ 2016-05-30 17:05 UTC (permalink / raw)
  To: Matan Barak (External); +Cc: Doug Ledford, Yishai Hadas, linux-rdma

On Mon, May 30, 2016 at 06:01:24PM +0300, Matan Barak (External) wrote:
> So, you suggest that "simple" attributes will take into account byte_offset,
> bit_offset, bits_size (or mask) and endianity. While the right endianity
> conversion function will be chosen by inspecting the bits_size?
> Of course, this is in addition to the attribute's nature (simple/possible
> simple/complex), which we need to condition on that too.
> This doesn't sound simple for me and I'm afraid it will affect performance
> more than just a real simple function call.

Right, that is exactly the point - if the inlined version requires a
branch to choose betwen simple/complex then it will be about the same
cost to just branch to the correct code in the driver.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH V4 libibverbs 2/7] Add member functions to poll an extended CQ
       [not found]                     ` <20160530013507.GA19230-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2016-05-30  3:39                       ` Doug Ledford
@ 2016-05-31 17:46                       ` Hefty, Sean
       [not found]                         ` <1828884A29C6694DAF28B7E6B8A82373AB05CA6D-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  1 sibling, 1 reply; 21+ messages in thread
From: Hefty, Sean @ 2016-05-31 17:46 UTC (permalink / raw)
  To: Jason Gunthorpe, Doug Ledford
  Cc: Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	matanb-VPRAkNaXOzVWk0Htik3J/w, majd-VPRAkNaXOzVWk0Htik3J/w,
	talal-VPRAkNaXOzVWk0Htik3J/w

> However, getting this wrong limits the future hardware we can support,
> which is scary, and the branch option clearly doesn't have that
> problem.
> 
> So the performance win would have to be substantial, IMHO.

I agree that we need to be careful about exposing implementation specific details through the API, but we're already partially down that path.  I'm not sure about the best way to measure performance with any proposal.
 
These are more thoughts than questions:  If an app wants to support iWarp/RoCE/OPA, is it just going to have a branch around the calls anyway?  What data is actually needed by the apps?  Is there any relationship between the various calls?  For example, if an app calls read_slid, is it more likely than not to also call read_pkey_index and read_dlid_path_bits, etc.?  Is the app expected to retrieve addressing data piecemeal like this for every possible architecture?  Are 5 function calls better than one call that fills out a structure, when all fields are needed?  Basically, meaningful benchmarks of poll cq likely require including use of the data.

The proposal is the equivalent of C++ accessor functions, which I admit I have never been a fan of, though the use seems simple enough.  And I don't have a better proposal offhand.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V4 libibverbs 2/7] Add member functions to poll an extended CQ
       [not found]                         ` <1828884A29C6694DAF28B7E6B8A82373AB05CA6D-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2016-05-31 18:06                           ` Jason Gunthorpe
       [not found]                             ` <20160531180608.GA21834-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Jason Gunthorpe @ 2016-05-31 18:06 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Doug Ledford, Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	matanb-VPRAkNaXOzVWk0Htik3J/w, majd-VPRAkNaXOzVWk0Htik3J/w,
	talal-VPRAkNaXOzVWk0Htik3J/w

On Tue, May 31, 2016 at 05:46:38PM +0000, Hefty, Sean wrote:

> These are more thoughts than questions: If an app wants to support
> iWarp/RoCE/OPA, is it just going to have a branch around the calls
> anyway?

This doesn't really change one way or the other, apps have to
understand what attributes are available on the CQ and do whatever
branching is needed to access them.

It is a good question if we should have different accessors, like
get_src_addr(buf,sizeof(buf)) or something that can handle the other
protocols with less hassle.

But at the end of the day, the user will still have to cast the buf to
the right type...

> What data is actually needed by the apps?

The apps decide when they create the CQ what things the CQ is needed
to return. So apps are in control. Are you musing if some of these are
needed or not?

Oh, that reminds me: Yishai, please purge pkey_index from the new
API. It was a mistake in the old API, the standard says it is only
valid for QP1, and we don't support QP1 in user space. Any app that
uses pkey_index is broken. It was a mistake to ever expose this to
userspace.

> Is there any relationship between the various calls?  For example,
> if an app calls read_slid, is it more likely than not to also call
> read_pkey_index and read_dlid_path_bits, etc.?
> Is the app expected to retrieve addressing data piecemeal like this
> for every possible architecture?  Are 5 function calls better than
> one call that fills out a structure, when all fields are needed?

There are certainly logical groups, eg the various path parameters are
almost always going to be needed together.

I was asking this same question.. I think the response was that
memcpying into a stack struct was more expensive than the branches..

I'm skeptical of that, but maybe if the C ABI doesn't use return value
optimization like C++ does it could get a bit costly....

Even so, it seems like an easier API for the end user...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH V4 libibverbs 2/7] Add member functions to poll an extended CQ
       [not found]                             ` <20160531180608.GA21834-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2016-05-31 19:48                               ` Hefty, Sean
  0 siblings, 0 replies; 21+ messages in thread
From: Hefty, Sean @ 2016-05-31 19:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	matanb-VPRAkNaXOzVWk0Htik3J/w, majd-VPRAkNaXOzVWk0Htik3J/w,
	talal-VPRAkNaXOzVWk0Htik3J/w

> > These are more thoughts than questions: If an app wants to support
> > iWarp/RoCE/OPA, is it just going to have a branch around the calls
> > anyway?
> 
> This doesn't really change one way or the other, apps have to
> understand what attributes are available on the CQ and do whatever
> branching is needed to access them.
> 
> It is a good question if we should have different accessors, like
> get_src_addr(buf,sizeof(buf)) or something that can handle the other
> protocols with less hassle.

I agree.

> But at the end of the day, the user will still have to cast the buf to
> the right type...

I think this depends on what the app is wanting or needing to do with this data.  If the app needs this in order to create an address handle, for example, this format of the data could be transparent to the app (assuming a new create ah API).

> > What data is actually needed by the apps?
> 
> The apps decide when they create the CQ what things the CQ is needed
> to return. So apps are in control. Are you musing if some of these are
> needed or not?

I was just musing which fields are needed by which apps and for what purpose.

> > Is there any relationship between the various calls?  For example,
> > if an app calls read_slid, is it more likely than not to also call
> > read_pkey_index and read_dlid_path_bits, etc.?
> > Is the app expected to retrieve addressing data piecemeal like this
> > for every possible architecture?  Are 5 function calls better than
> > one call that fills out a structure, when all fields are needed?
> 
> There are certainly logical groups, eg the various path parameters are
> almost always going to be needed together.
> 
> I was asking this same question.. I think the response was that
> memcpying into a stack struct was more expensive than the branches..
> 
> I'm skeptical of that, but maybe if the C ABI doesn't use return value
> optimization like C++ does it could get a bit costly....
> 
> Even so, it seems like an easier API for the end user...

Personally, I would lean toward having a an easier to use API, at the cost of a few extra cycles.  I guess if you add the equivalent of FABRIC_DIRECT to libibverbs, the proposed model seems like it would be optimal.  This probably requires that all CQs have the same format.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-05-31 19:48 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-29 14:51 [PATCH V4 libibverbs 0/7] Completion timestamping Yishai Hadas
     [not found] ` <1464533475-18949-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-05-29 14:51   ` [PATCH V4 libibverbs 1/7] Add support for extended creating CQ verb Yishai Hadas
2016-05-29 14:51   ` [PATCH V4 libibverbs 2/7] Add member functions to poll an extended CQ Yishai Hadas
     [not found]     ` <1464533475-18949-3-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-05-29 22:38       ` Doug Ledford
     [not found]         ` <574B6F71.9060808-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-05-29 23:30           ` Jason Gunthorpe
     [not found]             ` <20160529233009.GA12420-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-05-29 23:47               ` Doug Ledford
     [not found]                 ` <8F7BC9E2-75EC-413B-BEBE-11450225AF06-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-05-30  1:35                   ` Jason Gunthorpe
     [not found]                     ` <20160530013507.GA19230-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-05-30  3:39                       ` Doug Ledford
2016-05-31 17:46                       ` Hefty, Sean
     [not found]                         ` <1828884A29C6694DAF28B7E6B8A82373AB05CA6D-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2016-05-31 18:06                           ` Jason Gunthorpe
     [not found]                             ` <20160531180608.GA21834-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-05-31 19:48                               ` Hefty, Sean
2016-05-30  7:47                   ` Matan Barak
     [not found]                     ` <4958edf4-7296-26c9-4cbe-8fab45be11a3-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-05-30 10:01                       ` Yishai Hadas
     [not found]                     ` <4e8befc4-aec5-a17d-24ce-40ff97d345da@redhat.com>
     [not found]                       ` <4e8befc4-aec5-a17d-24ce-40ff97d345da-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-05-30 13:50                         ` Doug Ledford
     [not found]                           ` <8708a378-4c48-df98-86a4-d210bbe690b5-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-05-30 15:01                             ` Matan Barak (External)
     [not found]                               ` <ecdbec76-31cd-74e1-25b4-7d60c3fa2af0-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-05-30 17:05                                 ` Jason Gunthorpe
2016-05-29 14:51   ` [PATCH V4 libibverbs 3/7] Add timestamp_mask and hca_core_clock to ibv_query_device_ex Yishai Hadas
2016-05-29 14:51   ` [PATCH V4 libibverbs 4/7] Add completion timestamp to poll_cq Yishai Hadas
2016-05-29 14:51   ` [PATCH V4 libibverbs 5/7] Create a single threaded CQ Yishai Hadas
2016-05-29 14:51   ` [PATCH V4 libibverbs 6/7] Add a verb that queries real time values from the HCA Yishai Hadas
2016-05-29 14:51   ` [PATCH V4 libibverbs 7/7] Add timestamp support in rc_pingpong Yishai Hadas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.