All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-08  9:57 ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-08  9:57 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	Max Gurtuvoy

This is the third re-incarnation of the CQ pool patches proposed
by Christoph and I.

Our ULPs often want to make smart decisions on completion vector
affinitization when using multiple completion queues spread on
multiple cpu cores. We can see examples for this in iser, srp, nvme-rdma.

This patch set attempts to move this smartness to the rdma core by
introducing per-device CQ pools that by definition spread
across cpu cores. In addition, we completely make the completion
queue allocation transparent to the ULP by adding affinity hints
to create_qp which tells the rdma core to select (or allocate)
a completion queue that has the needed affinity for it.

This API gives us a similar approach to whats used in the networking
stack where the device completion queues are hidden from the application.
With the affinitization hints, we also do not compromise performance
as the completion queue will be affinitized correctly.

One thing that should be noticed is that now different ULPs using this
API may share completion queues (given that they use the same polling context).
However, even without this API they share interrupt vectors (and CPUs
that are assigned to them). Thus aggregating consumers on less completion
queues will result in better overall completion processing efficiency per
completion event (or interrupt).

In addition, we introduce a configfs knob to our nvme-target to
bound I/O threads to a given cpulist (can be a subset). This is
useful for numa configurations where the backend device access is
configured with care to numa affinity, and we want to restrict rdma
device and I/O threads affinity accordingly.

The patch set convert iser, isert, srpt, svcrdma, nvme-rdma and
nvmet-rdma to use the new API.

Comments and feedback is welcome.

Christoph Hellwig (1):
  nvme-rdma: use implicit CQ allocation

Sagi Grimberg (8):
  RDMA/core: Add implicit per-device completion queue pools
  IB/isert: use implicit CQ allocation
  IB/iser: use implicit CQ allocation
  IB/srpt: use implicit CQ allocation
  svcrdma: Use RDMA core implicit CQ allocation
  nvmet-rdma: use implicit CQ allocation
  nvmet: allow assignment of a cpulist for each nvmet port
  nvmet-rdma: assign cq completion vector based on the port allowed cpus

 drivers/infiniband/core/core_priv.h      |   6 +
 drivers/infiniband/core/cq.c             | 193 +++++++++++++++++++++++++++++++
 drivers/infiniband/core/device.c         |   4 +
 drivers/infiniband/core/verbs.c          |  69 ++++++++++-
 drivers/infiniband/ulp/iser/iscsi_iser.h |  19 ---
 drivers/infiniband/ulp/iser/iser_verbs.c |  82 ++-----------
 drivers/infiniband/ulp/isert/ib_isert.c  | 165 ++++----------------------
 drivers/infiniband/ulp/isert/ib_isert.h  |  16 ---
 drivers/infiniband/ulp/srpt/ib_srpt.c    |  46 +++-----
 drivers/infiniband/ulp/srpt/ib_srpt.h    |   1 -
 drivers/nvme/host/rdma.c                 |  62 +++++-----
 drivers/nvme/target/configfs.c           |  75 ++++++++++++
 drivers/nvme/target/nvmet.h              |   4 +
 drivers/nvme/target/rdma.c               |  71 +++++-------
 include/linux/sunrpc/svc_rdma.h          |   2 -
 include/rdma/ib_verbs.h                  |  31 ++++-
 net/sunrpc/xprtrdma/svc_rdma_transport.c |  22 +---
 17 files changed, 468 insertions(+), 400 deletions(-)

-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-08  9:57 ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-08  9:57 UTC (permalink / raw)


This is the third re-incarnation of the CQ pool patches proposed
by Christoph and I.

Our ULPs often want to make smart decisions on completion vector
affinitization when using multiple completion queues spread on
multiple cpu cores. We can see examples for this in iser, srp, nvme-rdma.

This patch set attempts to move this smartness to the rdma core by
introducing per-device CQ pools that by definition spread
across cpu cores. In addition, we completely make the completion
queue allocation transparent to the ULP by adding affinity hints
to create_qp which tells the rdma core to select (or allocate)
a completion queue that has the needed affinity for it.

This API gives us a similar approach to whats used in the networking
stack where the device completion queues are hidden from the application.
With the affinitization hints, we also do not compromise performance
as the completion queue will be affinitized correctly.

One thing that should be noticed is that now different ULPs using this
API may share completion queues (given that they use the same polling context).
However, even without this API they share interrupt vectors (and CPUs
that are assigned to them). Thus aggregating consumers on less completion
queues will result in better overall completion processing efficiency per
completion event (or interrupt).

In addition, we introduce a configfs knob to our nvme-target to
bound I/O threads to a given cpulist (can be a subset). This is
useful for numa configurations where the backend device access is
configured with care to numa affinity, and we want to restrict rdma
device and I/O threads affinity accordingly.

The patch set convert iser, isert, srpt, svcrdma, nvme-rdma and
nvmet-rdma to use the new API.

Comments and feedback is welcome.

Christoph Hellwig (1):
  nvme-rdma: use implicit CQ allocation

Sagi Grimberg (8):
  RDMA/core: Add implicit per-device completion queue pools
  IB/isert: use implicit CQ allocation
  IB/iser: use implicit CQ allocation
  IB/srpt: use implicit CQ allocation
  svcrdma: Use RDMA core implicit CQ allocation
  nvmet-rdma: use implicit CQ allocation
  nvmet: allow assignment of a cpulist for each nvmet port
  nvmet-rdma: assign cq completion vector based on the port allowed cpus

 drivers/infiniband/core/core_priv.h      |   6 +
 drivers/infiniband/core/cq.c             | 193 +++++++++++++++++++++++++++++++
 drivers/infiniband/core/device.c         |   4 +
 drivers/infiniband/core/verbs.c          |  69 ++++++++++-
 drivers/infiniband/ulp/iser/iscsi_iser.h |  19 ---
 drivers/infiniband/ulp/iser/iser_verbs.c |  82 ++-----------
 drivers/infiniband/ulp/isert/ib_isert.c  | 165 ++++----------------------
 drivers/infiniband/ulp/isert/ib_isert.h  |  16 ---
 drivers/infiniband/ulp/srpt/ib_srpt.c    |  46 +++-----
 drivers/infiniband/ulp/srpt/ib_srpt.h    |   1 -
 drivers/nvme/host/rdma.c                 |  62 +++++-----
 drivers/nvme/target/configfs.c           |  75 ++++++++++++
 drivers/nvme/target/nvmet.h              |   4 +
 drivers/nvme/target/rdma.c               |  71 +++++-------
 include/linux/sunrpc/svc_rdma.h          |   2 -
 include/rdma/ib_verbs.h                  |  31 ++++-
 net/sunrpc/xprtrdma/svc_rdma_transport.c |  22 +---
 17 files changed, 468 insertions(+), 400 deletions(-)

-- 
2.14.1

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 1/9] RDMA/core: Add implicit per-device completion queue pools
  2017-11-08  9:57 ` Sagi Grimberg
@ 2017-11-08  9:57     ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-08  9:57 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	Max Gurtuvoy

Allow a ULP to ask the core to implicitly assign a completion
queue to a queue-pair based on a least-used search on a per-device
cq pools. The device CQ pools grow in a lazy fashion with every
QP creation.

In addition, expose an affinity hint for a queue pair creation.
If passed, the core will attempt to attach a CQ with a completion
vector that is directed to the cpu core as the affinity hint
provided.

Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
---
 drivers/infiniband/core/core_priv.h |   6 ++
 drivers/infiniband/core/cq.c        | 193 ++++++++++++++++++++++++++++++++++++
 drivers/infiniband/core/device.c    |   4 +
 drivers/infiniband/core/verbs.c     |  69 +++++++++++--
 include/rdma/ib_verbs.h             |  31 ++++--
 5 files changed, 291 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index a1d687a664f8..4f6cd4cf5116 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -179,6 +179,12 @@ static inline bool rdma_is_upper_dev_rcu(struct net_device *dev,
 	return netdev_has_upper_dev_all_rcu(dev, upper);
 }
 
+void ib_init_cq_pools(struct ib_device *dev);
+void ib_purge_cq_pools(struct ib_device *dev);
+struct ib_cq *ib_find_get_cq(struct ib_device *dev, unsigned int nr_cqe,
+		enum ib_poll_context poll_ctx, int affinity_hint);
+void ib_put_cq(struct ib_cq *cq, unsigned int nr_cqe);
+
 int addr_init(void);
 void addr_cleanup(void);
 
diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index f2ae75fa3128..8b9f9be5386b 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -15,6 +15,9 @@
 #include <linux/slab.h>
 #include <rdma/ib_verbs.h>
 
+/* XXX: wild guess - should not be too large or too small to avoid wastage */
+#define IB_CQE_BATCH			1024
+
 /* # of WCs to poll for with a single call to ib_poll_cq */
 #define IB_POLL_BATCH			16
 
@@ -149,6 +152,8 @@ struct ib_cq *ib_alloc_cq(struct ib_device *dev, void *private,
 	cq->cq_context = private;
 	cq->poll_ctx = poll_ctx;
 	atomic_set(&cq->usecnt, 0);
+	cq->cqe_used = 0;
+	cq->comp_vector = comp_vector;
 
 	cq->wc = kmalloc_array(IB_POLL_BATCH, sizeof(*cq->wc), GFP_KERNEL);
 	if (!cq->wc)
@@ -194,6 +199,8 @@ void ib_free_cq(struct ib_cq *cq)
 
 	if (WARN_ON_ONCE(atomic_read(&cq->usecnt)))
 		return;
+	if (WARN_ON_ONCE(cq->cqe_used != 0))
+		return;
 
 	switch (cq->poll_ctx) {
 	case IB_POLL_DIRECT:
@@ -213,3 +220,189 @@ void ib_free_cq(struct ib_cq *cq)
 	WARN_ON_ONCE(ret);
 }
 EXPORT_SYMBOL(ib_free_cq);
+
+void ib_init_cq_pools(struct ib_device *dev)
+{
+	int i;
+
+	spin_lock_init(&dev->cq_lock);
+	for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++)
+		INIT_LIST_HEAD(&dev->cq_pools[i]);
+}
+
+void ib_purge_cq_pools(struct ib_device *dev)
+{
+	struct ib_cq *cq, *n;
+	LIST_HEAD(tmp_list);
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&dev->cq_lock, flags);
+		list_splice_init(&dev->cq_pools[i], &tmp_list);
+		spin_unlock_irqrestore(&dev->cq_lock, flags);
+	}
+
+	list_for_each_entry_safe(cq, n, &tmp_list, pool_entry)
+		ib_free_cq(cq);
+}
+
+/**
+ * ib_find_vector_affinity() - Find the first completion vector mapped to a given
+ *     cpu core affinity
+ * @device:            rdma device
+ * @cpu:               cpu for the corresponding completion vector affinity
+ * @vector:            output target completion vector
+ *
+ * If the device expose vector affinity we will search each of the vectors
+ * and if we find one that gives us the desired cpu core we return true
+ * and assign @vector to the corresponding completion vector. Otherwise
+ * we return false. We stop at the first appropriate completion vector
+ * we find as we don't have any preference for multiple vectors with the
+ * same affinity.
+ */
+static bool ib_find_vector_affinity(struct ib_device *device, int cpu,
+		unsigned int *vector)
+{
+	bool found = false;
+	unsigned int c;
+	int vec;
+
+	if (cpu == -1)
+		goto out;
+
+	for (vec = 0; vec < device->num_comp_vectors; vec++) {
+		const struct cpumask *mask;
+
+		mask = ib_get_vector_affinity(device, vec);
+		if (!mask)
+			goto out;
+
+		for_each_cpu(c, mask) {
+			if (c == cpu) {
+				*vector = vec;
+				found = true;
+				goto out;
+			}
+		}
+	}
+
+out:
+	return found;
+}
+
+static int ib_alloc_cqs(struct ib_device *dev, int nr_cqes,
+		enum ib_poll_context poll_ctx)
+{
+	LIST_HEAD(tmp_list);
+	struct ib_cq *cq;
+	unsigned long flags;
+	int nr_cqs, ret, i;
+
+	/*
+	 * Allocated at least as many CQEs as requested, and otherwise
+	 * a reasonable batch size so that we can share CQs between
+	 * multiple users instead of allocating a larger number of CQs.
+	 */
+	nr_cqes = max(nr_cqes, min(dev->attrs.max_cqe, IB_CQE_BATCH));
+	nr_cqs = min_t(int, dev->num_comp_vectors, num_possible_cpus());
+	for (i = 0; i < nr_cqs; i++) {
+		cq = ib_alloc_cq(dev, NULL, nr_cqes, i, poll_ctx);
+		if (IS_ERR(cq)) {
+			ret = PTR_ERR(cq);
+			pr_err("%s: failed to create CQ ret=%d\n",
+				__func__, ret);
+			goto out_free_cqs;
+		}
+		list_add_tail(&cq->pool_entry, &tmp_list);
+	}
+
+	spin_lock_irqsave(&dev->cq_lock, flags);
+	list_splice(&tmp_list, &dev->cq_pools[poll_ctx]);
+	spin_unlock_irqrestore(&dev->cq_lock, flags);
+
+	return 0;
+
+out_free_cqs:
+	list_for_each_entry(cq, &tmp_list, pool_entry)
+		ib_free_cq(cq);
+	return ret;
+}
+
+/*
+ * ib_find_get_cq() - Find the least used completion queue that matches
+ *     a given affinity hint (or least used for wild card affinity)
+ *     and fits nr_cqe
+ * @dev:              rdma device
+ * @nr_cqe:           number of needed cqe entries
+ * @poll_ctx:         cq polling context
+ * @affinity_hint:    affinity hint (-1) for wild-card assignment
+ *
+ * Finds a cq that satisfies @affinity_hint and @nr_cqe requirements and claim
+ * entries in it for us. In case there is no available cq, allocate a new cq
+ * with the requirements and add it to the device pool.
+ */
+struct ib_cq *ib_find_get_cq(struct ib_device *dev, unsigned int nr_cqe,
+		enum ib_poll_context poll_ctx, int affinity_hint)
+{
+	struct ib_cq *cq, *found;
+	unsigned long flags;
+	int vector, ret;
+
+	if (poll_ctx >= ARRAY_SIZE(dev->cq_pools))
+		return ERR_PTR(-EINVAL);
+
+	if (!ib_find_vector_affinity(dev, affinity_hint, &vector)) {
+		/*
+		 * Couldn't find matching vector affinity so project
+		 * the affinty to the device completion vector range
+		 */
+		vector = affinity_hint % dev->num_comp_vectors;
+	}
+
+restart:
+	/*
+	 * Find the least used CQ with correct affinity and
+	 * enough free cq entries
+	 */
+	found = NULL;
+	spin_lock_irqsave(&dev->cq_lock, flags);
+	list_for_each_entry(cq, &dev->cq_pools[poll_ctx], pool_entry) {
+		if (vector != -1 && vector != cq->comp_vector)
+			continue;
+		if (cq->cqe_used + nr_cqe > cq->cqe)
+			continue;
+		if (found && cq->cqe_used >= found->cqe_used)
+			continue;
+		found = cq;
+	}
+
+	if (found) {
+		found->cqe_used += nr_cqe;
+		spin_unlock_irqrestore(&dev->cq_lock, flags);
+		return found;
+	}
+	spin_unlock_irqrestore(&dev->cq_lock, flags);
+
+	/*
+	 * Didn't find a match or ran out of CQs,
+	 * device pool, allocate a new array of CQs.
+	 */
+	ret = ib_alloc_cqs(dev, nr_cqe, poll_ctx);
+	if (ret)
+		return ERR_PTR(ret);
+
+	/* Now search again */
+	goto restart;
+}
+
+void ib_put_cq(struct ib_cq *cq, unsigned int nr_cqe)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cq->device->cq_lock, flags);
+	cq->cqe_used -= nr_cqe;
+	WARN_ON_ONCE(cq->cqe_used < 0);
+	spin_unlock_irqrestore(&cq->device->cq_lock, flags);
+}
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 84fc32a2c8b3..c828845c46d8 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -468,6 +468,8 @@ int ib_register_device(struct ib_device *device,
 		device->dma_device = parent;
 	}
 
+	ib_init_cq_pools(device);
+
 	mutex_lock(&device_mutex);
 
 	if (strchr(device->name, '%')) {
@@ -590,6 +592,8 @@ void ib_unregister_device(struct ib_device *device)
 	up_write(&lists_rwsem);
 
 	device->reg_state = IB_DEV_UNREGISTERED;
+
+	ib_purge_cq_pools(device);
 }
 EXPORT_SYMBOL(ib_unregister_device);
 
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index de57d6c11a25..fcc9ecba6741 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -793,14 +793,16 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 			   struct ib_qp_init_attr *qp_init_attr)
 {
 	struct ib_device *device = pd ? pd->device : qp_init_attr->xrcd->device;
+	struct ib_cq *cq = NULL;
 	struct ib_qp *qp;
-	int ret;
+	u32 nr_cqes = 0;
+	int ret = -EINVAL;
 
 	if (qp_init_attr->rwq_ind_tbl &&
 	    (qp_init_attr->recv_cq ||
 	    qp_init_attr->srq || qp_init_attr->cap.max_recv_wr ||
 	    qp_init_attr->cap.max_recv_sge))
-		return ERR_PTR(-EINVAL);
+		goto out;
 
 	/*
 	 * If the callers is using the RDMA API calculate the resources
@@ -811,9 +813,51 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 	if (qp_init_attr->cap.max_rdma_ctxs)
 		rdma_rw_init_qp(device, qp_init_attr);
 
+	if (qp_init_attr->create_flags & IB_QP_CREATE_ASSIGN_CQS) {
+		int affinity = -1;
+
+		if (WARN_ON(qp_init_attr->recv_cq))
+			goto out;
+		if (WARN_ON(qp_init_attr->send_cq))
+			goto out;
+
+		if (qp_init_attr->create_flags & IB_QP_CREATE_AFFINITY_HINT)
+			affinity = qp_init_attr->affinity_hint;
+
+		nr_cqes = qp_init_attr->cap.max_recv_wr +
+			  qp_init_attr->cap.max_send_wr;
+		if (nr_cqes) {
+			cq = ib_find_get_cq(device, nr_cqes,
+					    qp_init_attr->poll_ctx, affinity);
+			if (IS_ERR(cq)) {
+				ret = PTR_ERR(cq);
+				goto out;
+			}
+
+			if (qp_init_attr->cap.max_send_wr)
+				qp_init_attr->send_cq = cq;
+
+			if (qp_init_attr->cap.max_recv_wr) {
+				qp_init_attr->recv_cq = cq;
+
+				/*
+				 * Low-level drivers expect max_recv_wr == 0
+				 * for the SRQ case:
+				 */
+				if (qp_init_attr->srq)
+					qp_init_attr->cap.max_recv_wr = 0;
+			}
+		}
+
+		qp_init_attr->create_flags &=
+			~(IB_QP_CREATE_ASSIGN_CQS | IB_QP_CREATE_AFFINITY_HINT);
+	}
+
 	qp = device->create_qp(pd, qp_init_attr, NULL);
-	if (IS_ERR(qp))
-		return qp;
+	if (IS_ERR(qp)) {
+		ret = PTR_ERR(qp);
+		goto out_put_cq;
+	}
 
 	ret = ib_create_qp_security(qp, device);
 	if (ret) {
@@ -826,6 +870,7 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 	qp->uobject    = NULL;
 	qp->qp_type    = qp_init_attr->qp_type;
 	qp->rwq_ind_tbl = qp_init_attr->rwq_ind_tbl;
+	qp->nr_cqes    = nr_cqes;
 
 	atomic_set(&qp->usecnt, 0);
 	qp->mrs_used = 0;
@@ -865,8 +910,7 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 		ret = rdma_rw_init_mrs(qp, qp_init_attr);
 		if (ret) {
 			pr_err("failed to init MR pool ret= %d\n", ret);
-			ib_destroy_qp(qp);
-			return ERR_PTR(ret);
+			goto out_destroy_qp;
 		}
 	}
 
@@ -880,6 +924,14 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 				 device->attrs.max_sge_rd);
 
 	return qp;
+
+out_destroy_qp:
+	ib_destroy_qp(qp);
+out_put_cq:
+	if (cq)
+		ib_put_cq(cq, nr_cqes);
+out:
+	return ERR_PTR(ret);
 }
 EXPORT_SYMBOL(ib_create_qp);
 
@@ -1478,6 +1530,11 @@ int ib_destroy_qp(struct ib_qp *qp)
 			atomic_dec(&ind_tbl->usecnt);
 		if (sec)
 			ib_destroy_qp_security_end(sec);
+
+		if (qp->nr_cqes) {
+			WARN_ON_ONCE(rcq && rcq != scq);
+			ib_put_cq(scq, qp->nr_cqes);
+		}
 	} else {
 		if (sec)
 			ib_destroy_qp_security_abort(sec);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index bdb1279a415b..56d42e753eb4 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1098,11 +1098,22 @@ enum ib_qp_create_flags {
 	IB_QP_CREATE_SCATTER_FCS		= 1 << 8,
 	IB_QP_CREATE_CVLAN_STRIPPING		= 1 << 9,
 	IB_QP_CREATE_SOURCE_QPN			= 1 << 10,
+
+	/* only used by the core, not passed to low-level drivers */
+	IB_QP_CREATE_ASSIGN_CQS			= 1 << 24,
+	IB_QP_CREATE_AFFINITY_HINT		= 1 << 25,
+
 	/* reserve bits 26-31 for low level drivers' internal use */
 	IB_QP_CREATE_RESERVED_START		= 1 << 26,
 	IB_QP_CREATE_RESERVED_END		= 1 << 31,
 };
 
+enum ib_poll_context {
+	IB_POLL_SOFTIRQ,	/* poll from softirq context */
+	IB_POLL_WORKQUEUE,	/* poll from workqueue */
+	IB_POLL_DIRECT,		/* caller context, no hw completions */
+};
+
 /*
  * Note: users may not call ib_close_qp or ib_destroy_qp from the event_handler
  * callback to destroy the passed in QP.
@@ -1124,6 +1135,13 @@ struct ib_qp_init_attr {
 	 * Only needed for special QP types, or when using the RW API.
 	 */
 	u8			port_num;
+
+	/*
+	 * Only needed when not passing in explicit CQs.
+	 */
+	enum ib_poll_context	poll_ctx;
+	int			affinity_hint;
+
 	struct ib_rwq_ind_table *rwq_ind_tbl;
 	u32			source_qpn;
 };
@@ -1536,12 +1554,6 @@ struct ib_ah {
 
 typedef void (*ib_comp_handler)(struct ib_cq *cq, void *cq_context);
 
-enum ib_poll_context {
-	IB_POLL_DIRECT,		/* caller context, no hw completions */
-	IB_POLL_SOFTIRQ,	/* poll from softirq context */
-	IB_POLL_WORKQUEUE,	/* poll from workqueue */
-};
-
 struct ib_cq {
 	struct ib_device       *device;
 	struct ib_uobject      *uobject;
@@ -1549,9 +1561,12 @@ struct ib_cq {
 	void                  (*event_handler)(struct ib_event *, void *);
 	void                   *cq_context;
 	int               	cqe;
+	unsigned int		cqe_used;
 	atomic_t          	usecnt; /* count number of work queues */
 	enum ib_poll_context	poll_ctx;
+	int			comp_vector;
 	struct ib_wc		*wc;
+	struct list_head	pool_entry;
 	union {
 		struct irq_poll		iop;
 		struct work_struct	work;
@@ -1731,6 +1746,7 @@ struct ib_qp {
 	struct ib_rwq_ind_table *rwq_ind_tbl;
 	struct ib_qp_security  *qp_sec;
 	u8			port;
+	u32			nr_cqes;
 };
 
 struct ib_mr {
@@ -2338,6 +2354,9 @@ struct ib_device {
 
 	u32                          index;
 
+	spinlock_t		     cq_lock;
+	struct list_head	     cq_pools[IB_POLL_WORKQUEUE + 1];
+
 	/**
 	 * The following mandatory functions are used only at device
 	 * registration.  Keep functions such as these at the end of this
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v3 1/9] RDMA/core: Add implicit per-device completion queue pools
@ 2017-11-08  9:57     ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-08  9:57 UTC (permalink / raw)


Allow a ULP to ask the core to implicitly assign a completion
queue to a queue-pair based on a least-used search on a per-device
cq pools. The device CQ pools grow in a lazy fashion with every
QP creation.

In addition, expose an affinity hint for a queue pair creation.
If passed, the core will attempt to attach a CQ with a completion
vector that is directed to the cpu core as the affinity hint
provided.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/infiniband/core/core_priv.h |   6 ++
 drivers/infiniband/core/cq.c        | 193 ++++++++++++++++++++++++++++++++++++
 drivers/infiniband/core/device.c    |   4 +
 drivers/infiniband/core/verbs.c     |  69 +++++++++++--
 include/rdma/ib_verbs.h             |  31 ++++--
 5 files changed, 291 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index a1d687a664f8..4f6cd4cf5116 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -179,6 +179,12 @@ static inline bool rdma_is_upper_dev_rcu(struct net_device *dev,
 	return netdev_has_upper_dev_all_rcu(dev, upper);
 }
 
+void ib_init_cq_pools(struct ib_device *dev);
+void ib_purge_cq_pools(struct ib_device *dev);
+struct ib_cq *ib_find_get_cq(struct ib_device *dev, unsigned int nr_cqe,
+		enum ib_poll_context poll_ctx, int affinity_hint);
+void ib_put_cq(struct ib_cq *cq, unsigned int nr_cqe);
+
 int addr_init(void);
 void addr_cleanup(void);
 
diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index f2ae75fa3128..8b9f9be5386b 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -15,6 +15,9 @@
 #include <linux/slab.h>
 #include <rdma/ib_verbs.h>
 
+/* XXX: wild guess - should not be too large or too small to avoid wastage */
+#define IB_CQE_BATCH			1024
+
 /* # of WCs to poll for with a single call to ib_poll_cq */
 #define IB_POLL_BATCH			16
 
@@ -149,6 +152,8 @@ struct ib_cq *ib_alloc_cq(struct ib_device *dev, void *private,
 	cq->cq_context = private;
 	cq->poll_ctx = poll_ctx;
 	atomic_set(&cq->usecnt, 0);
+	cq->cqe_used = 0;
+	cq->comp_vector = comp_vector;
 
 	cq->wc = kmalloc_array(IB_POLL_BATCH, sizeof(*cq->wc), GFP_KERNEL);
 	if (!cq->wc)
@@ -194,6 +199,8 @@ void ib_free_cq(struct ib_cq *cq)
 
 	if (WARN_ON_ONCE(atomic_read(&cq->usecnt)))
 		return;
+	if (WARN_ON_ONCE(cq->cqe_used != 0))
+		return;
 
 	switch (cq->poll_ctx) {
 	case IB_POLL_DIRECT:
@@ -213,3 +220,189 @@ void ib_free_cq(struct ib_cq *cq)
 	WARN_ON_ONCE(ret);
 }
 EXPORT_SYMBOL(ib_free_cq);
+
+void ib_init_cq_pools(struct ib_device *dev)
+{
+	int i;
+
+	spin_lock_init(&dev->cq_lock);
+	for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++)
+		INIT_LIST_HEAD(&dev->cq_pools[i]);
+}
+
+void ib_purge_cq_pools(struct ib_device *dev)
+{
+	struct ib_cq *cq, *n;
+	LIST_HEAD(tmp_list);
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&dev->cq_lock, flags);
+		list_splice_init(&dev->cq_pools[i], &tmp_list);
+		spin_unlock_irqrestore(&dev->cq_lock, flags);
+	}
+
+	list_for_each_entry_safe(cq, n, &tmp_list, pool_entry)
+		ib_free_cq(cq);
+}
+
+/**
+ * ib_find_vector_affinity() - Find the first completion vector mapped to a given
+ *     cpu core affinity
+ * @device:            rdma device
+ * @cpu:               cpu for the corresponding completion vector affinity
+ * @vector:            output target completion vector
+ *
+ * If the device expose vector affinity we will search each of the vectors
+ * and if we find one that gives us the desired cpu core we return true
+ * and assign @vector to the corresponding completion vector. Otherwise
+ * we return false. We stop at the first appropriate completion vector
+ * we find as we don't have any preference for multiple vectors with the
+ * same affinity.
+ */
+static bool ib_find_vector_affinity(struct ib_device *device, int cpu,
+		unsigned int *vector)
+{
+	bool found = false;
+	unsigned int c;
+	int vec;
+
+	if (cpu == -1)
+		goto out;
+
+	for (vec = 0; vec < device->num_comp_vectors; vec++) {
+		const struct cpumask *mask;
+
+		mask = ib_get_vector_affinity(device, vec);
+		if (!mask)
+			goto out;
+
+		for_each_cpu(c, mask) {
+			if (c == cpu) {
+				*vector = vec;
+				found = true;
+				goto out;
+			}
+		}
+	}
+
+out:
+	return found;
+}
+
+static int ib_alloc_cqs(struct ib_device *dev, int nr_cqes,
+		enum ib_poll_context poll_ctx)
+{
+	LIST_HEAD(tmp_list);
+	struct ib_cq *cq;
+	unsigned long flags;
+	int nr_cqs, ret, i;
+
+	/*
+	 * Allocated at least as many CQEs as requested, and otherwise
+	 * a reasonable batch size so that we can share CQs between
+	 * multiple users instead of allocating a larger number of CQs.
+	 */
+	nr_cqes = max(nr_cqes, min(dev->attrs.max_cqe, IB_CQE_BATCH));
+	nr_cqs = min_t(int, dev->num_comp_vectors, num_possible_cpus());
+	for (i = 0; i < nr_cqs; i++) {
+		cq = ib_alloc_cq(dev, NULL, nr_cqes, i, poll_ctx);
+		if (IS_ERR(cq)) {
+			ret = PTR_ERR(cq);
+			pr_err("%s: failed to create CQ ret=%d\n",
+				__func__, ret);
+			goto out_free_cqs;
+		}
+		list_add_tail(&cq->pool_entry, &tmp_list);
+	}
+
+	spin_lock_irqsave(&dev->cq_lock, flags);
+	list_splice(&tmp_list, &dev->cq_pools[poll_ctx]);
+	spin_unlock_irqrestore(&dev->cq_lock, flags);
+
+	return 0;
+
+out_free_cqs:
+	list_for_each_entry(cq, &tmp_list, pool_entry)
+		ib_free_cq(cq);
+	return ret;
+}
+
+/*
+ * ib_find_get_cq() - Find the least used completion queue that matches
+ *     a given affinity hint (or least used for wild card affinity)
+ *     and fits nr_cqe
+ * @dev:              rdma device
+ * @nr_cqe:           number of needed cqe entries
+ * @poll_ctx:         cq polling context
+ * @affinity_hint:    affinity hint (-1) for wild-card assignment
+ *
+ * Finds a cq that satisfies @affinity_hint and @nr_cqe requirements and claim
+ * entries in it for us. In case there is no available cq, allocate a new cq
+ * with the requirements and add it to the device pool.
+ */
+struct ib_cq *ib_find_get_cq(struct ib_device *dev, unsigned int nr_cqe,
+		enum ib_poll_context poll_ctx, int affinity_hint)
+{
+	struct ib_cq *cq, *found;
+	unsigned long flags;
+	int vector, ret;
+
+	if (poll_ctx >= ARRAY_SIZE(dev->cq_pools))
+		return ERR_PTR(-EINVAL);
+
+	if (!ib_find_vector_affinity(dev, affinity_hint, &vector)) {
+		/*
+		 * Couldn't find matching vector affinity so project
+		 * the affinty to the device completion vector range
+		 */
+		vector = affinity_hint % dev->num_comp_vectors;
+	}
+
+restart:
+	/*
+	 * Find the least used CQ with correct affinity and
+	 * enough free cq entries
+	 */
+	found = NULL;
+	spin_lock_irqsave(&dev->cq_lock, flags);
+	list_for_each_entry(cq, &dev->cq_pools[poll_ctx], pool_entry) {
+		if (vector != -1 && vector != cq->comp_vector)
+			continue;
+		if (cq->cqe_used + nr_cqe > cq->cqe)
+			continue;
+		if (found && cq->cqe_used >= found->cqe_used)
+			continue;
+		found = cq;
+	}
+
+	if (found) {
+		found->cqe_used += nr_cqe;
+		spin_unlock_irqrestore(&dev->cq_lock, flags);
+		return found;
+	}
+	spin_unlock_irqrestore(&dev->cq_lock, flags);
+
+	/*
+	 * Didn't find a match or ran out of CQs,
+	 * device pool, allocate a new array of CQs.
+	 */
+	ret = ib_alloc_cqs(dev, nr_cqe, poll_ctx);
+	if (ret)
+		return ERR_PTR(ret);
+
+	/* Now search again */
+	goto restart;
+}
+
+void ib_put_cq(struct ib_cq *cq, unsigned int nr_cqe)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cq->device->cq_lock, flags);
+	cq->cqe_used -= nr_cqe;
+	WARN_ON_ONCE(cq->cqe_used < 0);
+	spin_unlock_irqrestore(&cq->device->cq_lock, flags);
+}
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 84fc32a2c8b3..c828845c46d8 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -468,6 +468,8 @@ int ib_register_device(struct ib_device *device,
 		device->dma_device = parent;
 	}
 
+	ib_init_cq_pools(device);
+
 	mutex_lock(&device_mutex);
 
 	if (strchr(device->name, '%')) {
@@ -590,6 +592,8 @@ void ib_unregister_device(struct ib_device *device)
 	up_write(&lists_rwsem);
 
 	device->reg_state = IB_DEV_UNREGISTERED;
+
+	ib_purge_cq_pools(device);
 }
 EXPORT_SYMBOL(ib_unregister_device);
 
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index de57d6c11a25..fcc9ecba6741 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -793,14 +793,16 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 			   struct ib_qp_init_attr *qp_init_attr)
 {
 	struct ib_device *device = pd ? pd->device : qp_init_attr->xrcd->device;
+	struct ib_cq *cq = NULL;
 	struct ib_qp *qp;
-	int ret;
+	u32 nr_cqes = 0;
+	int ret = -EINVAL;
 
 	if (qp_init_attr->rwq_ind_tbl &&
 	    (qp_init_attr->recv_cq ||
 	    qp_init_attr->srq || qp_init_attr->cap.max_recv_wr ||
 	    qp_init_attr->cap.max_recv_sge))
-		return ERR_PTR(-EINVAL);
+		goto out;
 
 	/*
 	 * If the callers is using the RDMA API calculate the resources
@@ -811,9 +813,51 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 	if (qp_init_attr->cap.max_rdma_ctxs)
 		rdma_rw_init_qp(device, qp_init_attr);
 
+	if (qp_init_attr->create_flags & IB_QP_CREATE_ASSIGN_CQS) {
+		int affinity = -1;
+
+		if (WARN_ON(qp_init_attr->recv_cq))
+			goto out;
+		if (WARN_ON(qp_init_attr->send_cq))
+			goto out;
+
+		if (qp_init_attr->create_flags & IB_QP_CREATE_AFFINITY_HINT)
+			affinity = qp_init_attr->affinity_hint;
+
+		nr_cqes = qp_init_attr->cap.max_recv_wr +
+			  qp_init_attr->cap.max_send_wr;
+		if (nr_cqes) {
+			cq = ib_find_get_cq(device, nr_cqes,
+					    qp_init_attr->poll_ctx, affinity);
+			if (IS_ERR(cq)) {
+				ret = PTR_ERR(cq);
+				goto out;
+			}
+
+			if (qp_init_attr->cap.max_send_wr)
+				qp_init_attr->send_cq = cq;
+
+			if (qp_init_attr->cap.max_recv_wr) {
+				qp_init_attr->recv_cq = cq;
+
+				/*
+				 * Low-level drivers expect max_recv_wr == 0
+				 * for the SRQ case:
+				 */
+				if (qp_init_attr->srq)
+					qp_init_attr->cap.max_recv_wr = 0;
+			}
+		}
+
+		qp_init_attr->create_flags &=
+			~(IB_QP_CREATE_ASSIGN_CQS | IB_QP_CREATE_AFFINITY_HINT);
+	}
+
 	qp = device->create_qp(pd, qp_init_attr, NULL);
-	if (IS_ERR(qp))
-		return qp;
+	if (IS_ERR(qp)) {
+		ret = PTR_ERR(qp);
+		goto out_put_cq;
+	}
 
 	ret = ib_create_qp_security(qp, device);
 	if (ret) {
@@ -826,6 +870,7 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 	qp->uobject    = NULL;
 	qp->qp_type    = qp_init_attr->qp_type;
 	qp->rwq_ind_tbl = qp_init_attr->rwq_ind_tbl;
+	qp->nr_cqes    = nr_cqes;
 
 	atomic_set(&qp->usecnt, 0);
 	qp->mrs_used = 0;
@@ -865,8 +910,7 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 		ret = rdma_rw_init_mrs(qp, qp_init_attr);
 		if (ret) {
 			pr_err("failed to init MR pool ret= %d\n", ret);
-			ib_destroy_qp(qp);
-			return ERR_PTR(ret);
+			goto out_destroy_qp;
 		}
 	}
 
@@ -880,6 +924,14 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 				 device->attrs.max_sge_rd);
 
 	return qp;
+
+out_destroy_qp:
+	ib_destroy_qp(qp);
+out_put_cq:
+	if (cq)
+		ib_put_cq(cq, nr_cqes);
+out:
+	return ERR_PTR(ret);
 }
 EXPORT_SYMBOL(ib_create_qp);
 
@@ -1478,6 +1530,11 @@ int ib_destroy_qp(struct ib_qp *qp)
 			atomic_dec(&ind_tbl->usecnt);
 		if (sec)
 			ib_destroy_qp_security_end(sec);
+
+		if (qp->nr_cqes) {
+			WARN_ON_ONCE(rcq && rcq != scq);
+			ib_put_cq(scq, qp->nr_cqes);
+		}
 	} else {
 		if (sec)
 			ib_destroy_qp_security_abort(sec);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index bdb1279a415b..56d42e753eb4 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1098,11 +1098,22 @@ enum ib_qp_create_flags {
 	IB_QP_CREATE_SCATTER_FCS		= 1 << 8,
 	IB_QP_CREATE_CVLAN_STRIPPING		= 1 << 9,
 	IB_QP_CREATE_SOURCE_QPN			= 1 << 10,
+
+	/* only used by the core, not passed to low-level drivers */
+	IB_QP_CREATE_ASSIGN_CQS			= 1 << 24,
+	IB_QP_CREATE_AFFINITY_HINT		= 1 << 25,
+
 	/* reserve bits 26-31 for low level drivers' internal use */
 	IB_QP_CREATE_RESERVED_START		= 1 << 26,
 	IB_QP_CREATE_RESERVED_END		= 1 << 31,
 };
 
+enum ib_poll_context {
+	IB_POLL_SOFTIRQ,	/* poll from softirq context */
+	IB_POLL_WORKQUEUE,	/* poll from workqueue */
+	IB_POLL_DIRECT,		/* caller context, no hw completions */
+};
+
 /*
  * Note: users may not call ib_close_qp or ib_destroy_qp from the event_handler
  * callback to destroy the passed in QP.
@@ -1124,6 +1135,13 @@ struct ib_qp_init_attr {
 	 * Only needed for special QP types, or when using the RW API.
 	 */
 	u8			port_num;
+
+	/*
+	 * Only needed when not passing in explicit CQs.
+	 */
+	enum ib_poll_context	poll_ctx;
+	int			affinity_hint;
+
 	struct ib_rwq_ind_table *rwq_ind_tbl;
 	u32			source_qpn;
 };
@@ -1536,12 +1554,6 @@ struct ib_ah {
 
 typedef void (*ib_comp_handler)(struct ib_cq *cq, void *cq_context);
 
-enum ib_poll_context {
-	IB_POLL_DIRECT,		/* caller context, no hw completions */
-	IB_POLL_SOFTIRQ,	/* poll from softirq context */
-	IB_POLL_WORKQUEUE,	/* poll from workqueue */
-};
-
 struct ib_cq {
 	struct ib_device       *device;
 	struct ib_uobject      *uobject;
@@ -1549,9 +1561,12 @@ struct ib_cq {
 	void                  (*event_handler)(struct ib_event *, void *);
 	void                   *cq_context;
 	int               	cqe;
+	unsigned int		cqe_used;
 	atomic_t          	usecnt; /* count number of work queues */
 	enum ib_poll_context	poll_ctx;
+	int			comp_vector;
 	struct ib_wc		*wc;
+	struct list_head	pool_entry;
 	union {
 		struct irq_poll		iop;
 		struct work_struct	work;
@@ -1731,6 +1746,7 @@ struct ib_qp {
 	struct ib_rwq_ind_table *rwq_ind_tbl;
 	struct ib_qp_security  *qp_sec;
 	u8			port;
+	u32			nr_cqes;
 };
 
 struct ib_mr {
@@ -2338,6 +2354,9 @@ struct ib_device {
 
 	u32                          index;
 
+	spinlock_t		     cq_lock;
+	struct list_head	     cq_pools[IB_POLL_WORKQUEUE + 1];
+
 	/**
 	 * The following mandatory functions are used only at device
 	 * registration.  Keep functions such as these at the end of this
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v3 2/9] IB/isert: use implicit CQ allocation
  2017-11-08  9:57 ` Sagi Grimberg
@ 2017-11-08  9:57     ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-08  9:57 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	Max Gurtuvoy

Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
[hch: ported to the new API]
Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 drivers/infiniband/ulp/isert/ib_isert.c | 165 ++++----------------------------
 drivers/infiniband/ulp/isert/ib_isert.h |  16 ----
 2 files changed, 20 insertions(+), 161 deletions(-)

diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
index ceabdb85df8b..bcf4adac5d8c 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -35,8 +35,6 @@
 #define ISER_MAX_RX_CQ_LEN	(ISERT_QP_MAX_RECV_DTOS * ISERT_MAX_CONN)
 #define ISER_MAX_TX_CQ_LEN \
 	((ISERT_QP_MAX_REQ_DTOS + ISCSI_DEF_XMIT_CMDS_MAX) * ISERT_MAX_CONN)
-#define ISER_MAX_CQ_LEN		(ISER_MAX_RX_CQ_LEN + ISER_MAX_TX_CQ_LEN + \
-				 ISERT_MAX_CONN)
 
 static int isert_debug_level;
 module_param_named(debug_level, isert_debug_level, int, 0644);
@@ -89,55 +87,26 @@ isert_qp_event_callback(struct ib_event *e, void *context)
 	}
 }
 
-static struct isert_comp *
-isert_comp_get(struct isert_conn *isert_conn)
-{
-	struct isert_device *device = isert_conn->device;
-	struct isert_comp *comp;
-	int i, min = 0;
-
-	mutex_lock(&device_list_mutex);
-	for (i = 0; i < device->comps_used; i++)
-		if (device->comps[i].active_qps <
-		    device->comps[min].active_qps)
-			min = i;
-	comp = &device->comps[min];
-	comp->active_qps++;
-	mutex_unlock(&device_list_mutex);
-
-	isert_info("conn %p, using comp %p min_index: %d\n",
-		   isert_conn, comp, min);
-
-	return comp;
-}
-
-static void
-isert_comp_put(struct isert_comp *comp)
-{
-	mutex_lock(&device_list_mutex);
-	comp->active_qps--;
-	mutex_unlock(&device_list_mutex);
-}
-
 static struct ib_qp *
-isert_create_qp(struct isert_conn *isert_conn,
-		struct isert_comp *comp,
-		struct rdma_cm_id *cma_id)
+isert_create_qp(struct isert_conn *isert_conn, struct rdma_cm_id *cma_id)
 {
 	struct isert_device *device = isert_conn->device;
 	struct ib_qp_init_attr attr;
 	int ret;
 
-	memset(&attr, 0, sizeof(struct ib_qp_init_attr));
+	memset(&attr, 0, sizeof(attr));
+	attr.create_flags = IB_QP_CREATE_ASSIGN_CQS;
 	attr.event_handler = isert_qp_event_callback;
 	attr.qp_context = isert_conn;
-	attr.send_cq = comp->cq;
-	attr.recv_cq = comp->cq;
+	attr.poll_ctx = IB_POLL_WORKQUEUE;
+
 	attr.cap.max_send_wr = ISERT_QP_MAX_REQ_DTOS + 1;
-	attr.cap.max_recv_wr = ISERT_QP_MAX_RECV_DTOS + 1;
 	attr.cap.max_rdma_ctxs = ISCSI_DEF_XMIT_CMDS_MAX;
 	attr.cap.max_send_sge = device->ib_device->attrs.max_sge;
+
+	attr.cap.max_recv_wr = ISERT_QP_MAX_RECV_DTOS + 1;
 	attr.cap.max_recv_sge = 1;
+
 	attr.sq_sig_type = IB_SIGNAL_REQ_WR;
 	attr.qp_type = IB_QPT_RC;
 	if (device->pi_capable)
@@ -152,25 +121,6 @@ isert_create_qp(struct isert_conn *isert_conn,
 	return cma_id->qp;
 }
 
-static int
-isert_conn_setup_qp(struct isert_conn *isert_conn, struct rdma_cm_id *cma_id)
-{
-	struct isert_comp *comp;
-	int ret;
-
-	comp = isert_comp_get(isert_conn);
-	isert_conn->qp = isert_create_qp(isert_conn, comp, cma_id);
-	if (IS_ERR(isert_conn->qp)) {
-		ret = PTR_ERR(isert_conn->qp);
-		goto err;
-	}
-
-	return 0;
-err:
-	isert_comp_put(comp);
-	return ret;
-}
-
 static int
 isert_alloc_rx_descriptors(struct isert_conn *isert_conn)
 {
@@ -237,61 +187,6 @@ isert_free_rx_descriptors(struct isert_conn *isert_conn)
 	isert_conn->rx_descs = NULL;
 }
 
-static void
-isert_free_comps(struct isert_device *device)
-{
-	int i;
-
-	for (i = 0; i < device->comps_used; i++) {
-		struct isert_comp *comp = &device->comps[i];
-
-		if (comp->cq)
-			ib_free_cq(comp->cq);
-	}
-	kfree(device->comps);
-}
-
-static int
-isert_alloc_comps(struct isert_device *device)
-{
-	int i, max_cqe, ret = 0;
-
-	device->comps_used = min(ISERT_MAX_CQ, min_t(int, num_online_cpus(),
-				 device->ib_device->num_comp_vectors));
-
-	isert_info("Using %d CQs, %s supports %d vectors support "
-		   "pi_capable %d\n",
-		   device->comps_used, device->ib_device->name,
-		   device->ib_device->num_comp_vectors,
-		   device->pi_capable);
-
-	device->comps = kcalloc(device->comps_used, sizeof(struct isert_comp),
-				GFP_KERNEL);
-	if (!device->comps)
-		return -ENOMEM;
-
-	max_cqe = min(ISER_MAX_CQ_LEN, device->ib_device->attrs.max_cqe);
-
-	for (i = 0; i < device->comps_used; i++) {
-		struct isert_comp *comp = &device->comps[i];
-
-		comp->device = device;
-		comp->cq = ib_alloc_cq(device->ib_device, comp, max_cqe, i,
-				IB_POLL_WORKQUEUE);
-		if (IS_ERR(comp->cq)) {
-			isert_err("Unable to allocate cq\n");
-			ret = PTR_ERR(comp->cq);
-			comp->cq = NULL;
-			goto out_cq;
-		}
-	}
-
-	return 0;
-out_cq:
-	isert_free_comps(device);
-	return ret;
-}
-
 static int
 isert_create_device_ib_res(struct isert_device *device)
 {
@@ -301,16 +196,12 @@ isert_create_device_ib_res(struct isert_device *device)
 	isert_dbg("devattr->max_sge: %d\n", ib_dev->attrs.max_sge);
 	isert_dbg("devattr->max_sge_rd: %d\n", ib_dev->attrs.max_sge_rd);
 
-	ret = isert_alloc_comps(device);
-	if (ret)
-		goto out;
-
 	device->pd = ib_alloc_pd(ib_dev, 0);
 	if (IS_ERR(device->pd)) {
 		ret = PTR_ERR(device->pd);
-		isert_err("failed to allocate pd, device %p, ret=%d\n",
-			  device, ret);
-		goto out_cq;
+		isert_err("%s: failed to allocate pd, ret=%d\n",
+			  ib_dev->name, ret);
+		return ret;
 	}
 
 	/* Check signature cap */
@@ -318,22 +209,6 @@ isert_create_device_ib_res(struct isert_device *device)
 			     IB_DEVICE_SIGNATURE_HANDOVER ? true : false;
 
 	return 0;
-
-out_cq:
-	isert_free_comps(device);
-out:
-	if (ret > 0)
-		ret = -EINVAL;
-	return ret;
-}
-
-static void
-isert_free_device_ib_res(struct isert_device *device)
-{
-	isert_info("device %p\n", device);
-
-	ib_dealloc_pd(device->pd);
-	isert_free_comps(device);
 }
 
 static void
@@ -343,7 +218,7 @@ isert_device_put(struct isert_device *device)
 	device->refcount--;
 	isert_info("device %p refcount %d\n", device, device->refcount);
 	if (!device->refcount) {
-		isert_free_device_ib_res(device);
+		ib_dealloc_pd(device->pd);
 		list_del(&device->dev_node);
 		kfree(device);
 	}
@@ -535,13 +410,15 @@ isert_connect_request(struct rdma_cm_id *cma_id, struct rdma_cm_event *event)
 
 	isert_set_nego_params(isert_conn, &event->param.conn);
 
-	ret = isert_conn_setup_qp(isert_conn, cma_id);
-	if (ret)
+	isert_conn->qp = isert_create_qp(isert_conn, cma_id);
+	if (IS_ERR(isert_conn->qp)) {
+		ret = PTR_ERR(isert_conn->qp);
 		goto out_conn_dev;
+	}
 
 	ret = isert_login_post_recv(isert_conn);
 	if (ret)
-		goto out_conn_dev;
+		goto out_conn_qp;
 
 	ret = isert_rdma_accept(isert_conn);
 	if (ret)
@@ -553,6 +430,8 @@ isert_connect_request(struct rdma_cm_id *cma_id, struct rdma_cm_event *event)
 
 	return 0;
 
+out_conn_qp:
+	ib_destroy_qp(isert_conn->qp);
 out_conn_dev:
 	isert_device_put(device);
 out_rsp_dma_map:
@@ -577,12 +456,8 @@ isert_connect_release(struct isert_conn *isert_conn)
 	    !isert_conn->dev_removed)
 		rdma_destroy_id(isert_conn->cm_id);
 
-	if (isert_conn->qp) {
-		struct isert_comp *comp = isert_conn->qp->recv_cq->cq_context;
-
-		isert_comp_put(comp);
+	if (isert_conn->qp)
 		ib_destroy_qp(isert_conn->qp);
-	}
 
 	if (isert_conn->login_req_buf)
 		isert_free_login_buf(isert_conn);
diff --git a/drivers/infiniband/ulp/isert/ib_isert.h b/drivers/infiniband/ulp/isert/ib_isert.h
index 87d994de8c91..bb7fda807471 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.h
+++ b/drivers/infiniband/ulp/isert/ib_isert.h
@@ -165,27 +165,11 @@ struct isert_conn {
 
 #define ISERT_MAX_CQ 64
 
-/**
- * struct isert_comp - iSER completion context
- *
- * @device:     pointer to device handle
- * @cq:         completion queue
- * @active_qps: Number of active QPs attached
- *              to completion context
- */
-struct isert_comp {
-	struct isert_device     *device;
-	struct ib_cq		*cq;
-	int                      active_qps;
-};
-
 struct isert_device {
 	bool			pi_capable;
 	int			refcount;
 	struct ib_device	*ib_device;
 	struct ib_pd		*pd;
-	struct isert_comp	*comps;
-	int                     comps_used;
 	struct list_head	dev_node;
 };
 
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v3 2/9] IB/isert: use implicit CQ allocation
@ 2017-11-08  9:57     ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-08  9:57 UTC (permalink / raw)


Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
[hch: ported to the new API]
Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/infiniband/ulp/isert/ib_isert.c | 165 ++++----------------------------
 drivers/infiniband/ulp/isert/ib_isert.h |  16 ----
 2 files changed, 20 insertions(+), 161 deletions(-)

diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
index ceabdb85df8b..bcf4adac5d8c 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -35,8 +35,6 @@
 #define ISER_MAX_RX_CQ_LEN	(ISERT_QP_MAX_RECV_DTOS * ISERT_MAX_CONN)
 #define ISER_MAX_TX_CQ_LEN \
 	((ISERT_QP_MAX_REQ_DTOS + ISCSI_DEF_XMIT_CMDS_MAX) * ISERT_MAX_CONN)
-#define ISER_MAX_CQ_LEN		(ISER_MAX_RX_CQ_LEN + ISER_MAX_TX_CQ_LEN + \
-				 ISERT_MAX_CONN)
 
 static int isert_debug_level;
 module_param_named(debug_level, isert_debug_level, int, 0644);
@@ -89,55 +87,26 @@ isert_qp_event_callback(struct ib_event *e, void *context)
 	}
 }
 
-static struct isert_comp *
-isert_comp_get(struct isert_conn *isert_conn)
-{
-	struct isert_device *device = isert_conn->device;
-	struct isert_comp *comp;
-	int i, min = 0;
-
-	mutex_lock(&device_list_mutex);
-	for (i = 0; i < device->comps_used; i++)
-		if (device->comps[i].active_qps <
-		    device->comps[min].active_qps)
-			min = i;
-	comp = &device->comps[min];
-	comp->active_qps++;
-	mutex_unlock(&device_list_mutex);
-
-	isert_info("conn %p, using comp %p min_index: %d\n",
-		   isert_conn, comp, min);
-
-	return comp;
-}
-
-static void
-isert_comp_put(struct isert_comp *comp)
-{
-	mutex_lock(&device_list_mutex);
-	comp->active_qps--;
-	mutex_unlock(&device_list_mutex);
-}
-
 static struct ib_qp *
-isert_create_qp(struct isert_conn *isert_conn,
-		struct isert_comp *comp,
-		struct rdma_cm_id *cma_id)
+isert_create_qp(struct isert_conn *isert_conn, struct rdma_cm_id *cma_id)
 {
 	struct isert_device *device = isert_conn->device;
 	struct ib_qp_init_attr attr;
 	int ret;
 
-	memset(&attr, 0, sizeof(struct ib_qp_init_attr));
+	memset(&attr, 0, sizeof(attr));
+	attr.create_flags = IB_QP_CREATE_ASSIGN_CQS;
 	attr.event_handler = isert_qp_event_callback;
 	attr.qp_context = isert_conn;
-	attr.send_cq = comp->cq;
-	attr.recv_cq = comp->cq;
+	attr.poll_ctx = IB_POLL_WORKQUEUE;
+
 	attr.cap.max_send_wr = ISERT_QP_MAX_REQ_DTOS + 1;
-	attr.cap.max_recv_wr = ISERT_QP_MAX_RECV_DTOS + 1;
 	attr.cap.max_rdma_ctxs = ISCSI_DEF_XMIT_CMDS_MAX;
 	attr.cap.max_send_sge = device->ib_device->attrs.max_sge;
+
+	attr.cap.max_recv_wr = ISERT_QP_MAX_RECV_DTOS + 1;
 	attr.cap.max_recv_sge = 1;
+
 	attr.sq_sig_type = IB_SIGNAL_REQ_WR;
 	attr.qp_type = IB_QPT_RC;
 	if (device->pi_capable)
@@ -152,25 +121,6 @@ isert_create_qp(struct isert_conn *isert_conn,
 	return cma_id->qp;
 }
 
-static int
-isert_conn_setup_qp(struct isert_conn *isert_conn, struct rdma_cm_id *cma_id)
-{
-	struct isert_comp *comp;
-	int ret;
-
-	comp = isert_comp_get(isert_conn);
-	isert_conn->qp = isert_create_qp(isert_conn, comp, cma_id);
-	if (IS_ERR(isert_conn->qp)) {
-		ret = PTR_ERR(isert_conn->qp);
-		goto err;
-	}
-
-	return 0;
-err:
-	isert_comp_put(comp);
-	return ret;
-}
-
 static int
 isert_alloc_rx_descriptors(struct isert_conn *isert_conn)
 {
@@ -237,61 +187,6 @@ isert_free_rx_descriptors(struct isert_conn *isert_conn)
 	isert_conn->rx_descs = NULL;
 }
 
-static void
-isert_free_comps(struct isert_device *device)
-{
-	int i;
-
-	for (i = 0; i < device->comps_used; i++) {
-		struct isert_comp *comp = &device->comps[i];
-
-		if (comp->cq)
-			ib_free_cq(comp->cq);
-	}
-	kfree(device->comps);
-}
-
-static int
-isert_alloc_comps(struct isert_device *device)
-{
-	int i, max_cqe, ret = 0;
-
-	device->comps_used = min(ISERT_MAX_CQ, min_t(int, num_online_cpus(),
-				 device->ib_device->num_comp_vectors));
-
-	isert_info("Using %d CQs, %s supports %d vectors support "
-		   "pi_capable %d\n",
-		   device->comps_used, device->ib_device->name,
-		   device->ib_device->num_comp_vectors,
-		   device->pi_capable);
-
-	device->comps = kcalloc(device->comps_used, sizeof(struct isert_comp),
-				GFP_KERNEL);
-	if (!device->comps)
-		return -ENOMEM;
-
-	max_cqe = min(ISER_MAX_CQ_LEN, device->ib_device->attrs.max_cqe);
-
-	for (i = 0; i < device->comps_used; i++) {
-		struct isert_comp *comp = &device->comps[i];
-
-		comp->device = device;
-		comp->cq = ib_alloc_cq(device->ib_device, comp, max_cqe, i,
-				IB_POLL_WORKQUEUE);
-		if (IS_ERR(comp->cq)) {
-			isert_err("Unable to allocate cq\n");
-			ret = PTR_ERR(comp->cq);
-			comp->cq = NULL;
-			goto out_cq;
-		}
-	}
-
-	return 0;
-out_cq:
-	isert_free_comps(device);
-	return ret;
-}
-
 static int
 isert_create_device_ib_res(struct isert_device *device)
 {
@@ -301,16 +196,12 @@ isert_create_device_ib_res(struct isert_device *device)
 	isert_dbg("devattr->max_sge: %d\n", ib_dev->attrs.max_sge);
 	isert_dbg("devattr->max_sge_rd: %d\n", ib_dev->attrs.max_sge_rd);
 
-	ret = isert_alloc_comps(device);
-	if (ret)
-		goto out;
-
 	device->pd = ib_alloc_pd(ib_dev, 0);
 	if (IS_ERR(device->pd)) {
 		ret = PTR_ERR(device->pd);
-		isert_err("failed to allocate pd, device %p, ret=%d\n",
-			  device, ret);
-		goto out_cq;
+		isert_err("%s: failed to allocate pd, ret=%d\n",
+			  ib_dev->name, ret);
+		return ret;
 	}
 
 	/* Check signature cap */
@@ -318,22 +209,6 @@ isert_create_device_ib_res(struct isert_device *device)
 			     IB_DEVICE_SIGNATURE_HANDOVER ? true : false;
 
 	return 0;
-
-out_cq:
-	isert_free_comps(device);
-out:
-	if (ret > 0)
-		ret = -EINVAL;
-	return ret;
-}
-
-static void
-isert_free_device_ib_res(struct isert_device *device)
-{
-	isert_info("device %p\n", device);
-
-	ib_dealloc_pd(device->pd);
-	isert_free_comps(device);
 }
 
 static void
@@ -343,7 +218,7 @@ isert_device_put(struct isert_device *device)
 	device->refcount--;
 	isert_info("device %p refcount %d\n", device, device->refcount);
 	if (!device->refcount) {
-		isert_free_device_ib_res(device);
+		ib_dealloc_pd(device->pd);
 		list_del(&device->dev_node);
 		kfree(device);
 	}
@@ -535,13 +410,15 @@ isert_connect_request(struct rdma_cm_id *cma_id, struct rdma_cm_event *event)
 
 	isert_set_nego_params(isert_conn, &event->param.conn);
 
-	ret = isert_conn_setup_qp(isert_conn, cma_id);
-	if (ret)
+	isert_conn->qp = isert_create_qp(isert_conn, cma_id);
+	if (IS_ERR(isert_conn->qp)) {
+		ret = PTR_ERR(isert_conn->qp);
 		goto out_conn_dev;
+	}
 
 	ret = isert_login_post_recv(isert_conn);
 	if (ret)
-		goto out_conn_dev;
+		goto out_conn_qp;
 
 	ret = isert_rdma_accept(isert_conn);
 	if (ret)
@@ -553,6 +430,8 @@ isert_connect_request(struct rdma_cm_id *cma_id, struct rdma_cm_event *event)
 
 	return 0;
 
+out_conn_qp:
+	ib_destroy_qp(isert_conn->qp);
 out_conn_dev:
 	isert_device_put(device);
 out_rsp_dma_map:
@@ -577,12 +456,8 @@ isert_connect_release(struct isert_conn *isert_conn)
 	    !isert_conn->dev_removed)
 		rdma_destroy_id(isert_conn->cm_id);
 
-	if (isert_conn->qp) {
-		struct isert_comp *comp = isert_conn->qp->recv_cq->cq_context;
-
-		isert_comp_put(comp);
+	if (isert_conn->qp)
 		ib_destroy_qp(isert_conn->qp);
-	}
 
 	if (isert_conn->login_req_buf)
 		isert_free_login_buf(isert_conn);
diff --git a/drivers/infiniband/ulp/isert/ib_isert.h b/drivers/infiniband/ulp/isert/ib_isert.h
index 87d994de8c91..bb7fda807471 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.h
+++ b/drivers/infiniband/ulp/isert/ib_isert.h
@@ -165,27 +165,11 @@ struct isert_conn {
 
 #define ISERT_MAX_CQ 64
 
-/**
- * struct isert_comp - iSER completion context
- *
- * @device:     pointer to device handle
- * @cq:         completion queue
- * @active_qps: Number of active QPs attached
- *              to completion context
- */
-struct isert_comp {
-	struct isert_device     *device;
-	struct ib_cq		*cq;
-	int                      active_qps;
-};
-
 struct isert_device {
 	bool			pi_capable;
 	int			refcount;
 	struct ib_device	*ib_device;
 	struct ib_pd		*pd;
-	struct isert_comp	*comps;
-	int                     comps_used;
 	struct list_head	dev_node;
 };
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v3 3/9] IB/iser: use implicit CQ allocation
  2017-11-08  9:57 ` Sagi Grimberg
@ 2017-11-08  9:57     ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-08  9:57 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	Max Gurtuvoy

Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
[hch: ported to the new API]
Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 drivers/infiniband/ulp/iser/iscsi_iser.h | 19 --------
 drivers/infiniband/ulp/iser/iser_verbs.c | 82 ++++----------------------------
 2 files changed, 8 insertions(+), 93 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h b/drivers/infiniband/ulp/iser/iscsi_iser.h
index c1ae4aeae2f9..cc4134acebdf 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.h
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.h
@@ -317,18 +317,6 @@ struct iser_conn;
 struct ib_conn;
 struct iscsi_iser_task;
 
-/**
- * struct iser_comp - iSER completion context
- *
- * @cq:         completion queue
- * @active_qps: Number of active QPs attached
- *              to completion context
- */
-struct iser_comp {
-	struct ib_cq		*cq;
-	int                      active_qps;
-};
-
 /**
  * struct iser_device - Memory registration operations
  *     per-device registration schemes
@@ -365,9 +353,6 @@ struct iser_reg_ops {
  * @event_handler: IB events handle routine
  * @ig_list:	   entry in devices list
  * @refcount:      Reference counter, dominated by open iser connections
- * @comps_used:    Number of completion contexts used, Min between online
- *                 cpus and device max completion vectors
- * @comps:         Dinamically allocated array of completion handlers
  * @reg_ops:       Registration ops
  * @remote_inv_sup: Remote invalidate is supported on this device
  */
@@ -377,8 +362,6 @@ struct iser_device {
 	struct ib_event_handler      event_handler;
 	struct list_head             ig_list;
 	int                          refcount;
-	int			     comps_used;
-	struct iser_comp	     *comps;
 	const struct iser_reg_ops    *reg_ops;
 	bool                         remote_inv_sup;
 };
@@ -456,7 +439,6 @@ struct iser_fr_pool {
  * @sig_count:           send work request signal count
  * @rx_wr:               receive work request for batch posts
  * @device:              reference to iser device
- * @comp:                iser completion context
  * @fr_pool:             connection fast registration poool
  * @pi_support:          Indicate device T10-PI support
  */
@@ -467,7 +449,6 @@ struct ib_conn {
 	u8                           sig_count;
 	struct ib_recv_wr	     rx_wr[ISER_MIN_POSTED_RX];
 	struct iser_device          *device;
-	struct iser_comp	    *comp;
 	struct iser_fr_pool          fr_pool;
 	bool			     pi_support;
 	struct ib_cqe		     reg_cqe;
diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c
index 55a73b0ed4c6..8f8d853b1dc9 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -68,40 +68,17 @@ static void iser_event_handler(struct ib_event_handler *handler,
 static int iser_create_device_ib_res(struct iser_device *device)
 {
 	struct ib_device *ib_dev = device->ib_device;
-	int ret, i, max_cqe;
+	int ret;
 
 	ret = iser_assign_reg_ops(device);
 	if (ret)
 		return ret;
 
-	device->comps_used = min_t(int, num_online_cpus(),
-				 ib_dev->num_comp_vectors);
-
-	device->comps = kcalloc(device->comps_used, sizeof(*device->comps),
-				GFP_KERNEL);
-	if (!device->comps)
-		goto comps_err;
-
-	max_cqe = min(ISER_MAX_CQ_LEN, ib_dev->attrs.max_cqe);
-
-	iser_info("using %d CQs, device %s supports %d vectors max_cqe %d\n",
-		  device->comps_used, ib_dev->name,
-		  ib_dev->num_comp_vectors, max_cqe);
-
 	device->pd = ib_alloc_pd(ib_dev,
 		iser_always_reg ? 0 : IB_PD_UNSAFE_GLOBAL_RKEY);
-	if (IS_ERR(device->pd))
-		goto pd_err;
-
-	for (i = 0; i < device->comps_used; i++) {
-		struct iser_comp *comp = &device->comps[i];
-
-		comp->cq = ib_alloc_cq(ib_dev, comp, max_cqe, i,
-				       IB_POLL_SOFTIRQ);
-		if (IS_ERR(comp->cq)) {
-			comp->cq = NULL;
-			goto cq_err;
-		}
+	if (IS_ERR(device->pd)) {
+		ret = PTR_ERR(device->pd);
+		goto out;
 	}
 
 	INIT_IB_EVENT_HANDLER(&device->event_handler, ib_dev,
@@ -109,19 +86,9 @@ static int iser_create_device_ib_res(struct iser_device *device)
 	ib_register_event_handler(&device->event_handler);
 	return 0;
 
-cq_err:
-	for (i = 0; i < device->comps_used; i++) {
-		struct iser_comp *comp = &device->comps[i];
-
-		if (comp->cq)
-			ib_free_cq(comp->cq);
-	}
-	ib_dealloc_pd(device->pd);
-pd_err:
-	kfree(device->comps);
-comps_err:
+out:
 	iser_err("failed to allocate an IB resource\n");
-	return -1;
+	return ret;
 }
 
 /**
@@ -130,20 +97,8 @@ static int iser_create_device_ib_res(struct iser_device *device)
  */
 static void iser_free_device_ib_res(struct iser_device *device)
 {
-	int i;
-
-	for (i = 0; i < device->comps_used; i++) {
-		struct iser_comp *comp = &device->comps[i];
-
-		ib_free_cq(comp->cq);
-		comp->cq = NULL;
-	}
-
 	ib_unregister_event_handler(&device->event_handler);
 	ib_dealloc_pd(device->pd);
-
-	kfree(device->comps);
-	device->comps = NULL;
 	device->pd = NULL;
 }
 
@@ -423,7 +378,6 @@ static int iser_create_ib_conn_res(struct ib_conn *ib_conn)
 	struct ib_device	*ib_dev;
 	struct ib_qp_init_attr	init_attr;
 	int			ret = -ENOMEM;
-	int index, min_index = 0;
 
 	BUG_ON(ib_conn->device == NULL);
 
@@ -431,23 +385,10 @@ static int iser_create_ib_conn_res(struct ib_conn *ib_conn)
 	ib_dev = device->ib_device;
 
 	memset(&init_attr, 0, sizeof init_attr);
-
-	mutex_lock(&ig.connlist_mutex);
-	/* select the CQ with the minimal number of usages */
-	for (index = 0; index < device->comps_used; index++) {
-		if (device->comps[index].active_qps <
-		    device->comps[min_index].active_qps)
-			min_index = index;
-	}
-	ib_conn->comp = &device->comps[min_index];
-	ib_conn->comp->active_qps++;
-	mutex_unlock(&ig.connlist_mutex);
-	iser_info("cq index %d used for ib_conn %p\n", min_index, ib_conn);
-
+	init_attr.create_flags = IB_QP_CREATE_ASSIGN_CQS;
 	init_attr.event_handler = iser_qp_event_callback;
 	init_attr.qp_context	= (void *)ib_conn;
-	init_attr.send_cq	= ib_conn->comp->cq;
-	init_attr.recv_cq	= ib_conn->comp->cq;
+	init_attr.poll_ctx = IB_POLL_SOFTIRQ;
 	init_attr.cap.max_recv_wr  = ISER_QP_MAX_RECV_DTOS;
 	init_attr.cap.max_send_sge = 2;
 	init_attr.cap.max_recv_sge = 1;
@@ -483,11 +424,7 @@ static int iser_create_ib_conn_res(struct ib_conn *ib_conn)
 	return ret;
 
 out_err:
-	mutex_lock(&ig.connlist_mutex);
-	ib_conn->comp->active_qps--;
-	mutex_unlock(&ig.connlist_mutex);
 	iser_err("unable to alloc mem or create resource, err %d\n", ret);
-
 	return ret;
 }
 
@@ -597,9 +534,6 @@ static void iser_free_ib_conn_res(struct iser_conn *iser_conn,
 		  iser_conn, ib_conn->cma_id, ib_conn->qp);
 
 	if (ib_conn->qp != NULL) {
-		mutex_lock(&ig.connlist_mutex);
-		ib_conn->comp->active_qps--;
-		mutex_unlock(&ig.connlist_mutex);
 		rdma_destroy_qp(ib_conn->cma_id);
 		ib_conn->qp = NULL;
 	}
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v3 3/9] IB/iser: use implicit CQ allocation
@ 2017-11-08  9:57     ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-08  9:57 UTC (permalink / raw)


Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
[hch: ported to the new API]
Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/infiniband/ulp/iser/iscsi_iser.h | 19 --------
 drivers/infiniband/ulp/iser/iser_verbs.c | 82 ++++----------------------------
 2 files changed, 8 insertions(+), 93 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h b/drivers/infiniband/ulp/iser/iscsi_iser.h
index c1ae4aeae2f9..cc4134acebdf 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.h
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.h
@@ -317,18 +317,6 @@ struct iser_conn;
 struct ib_conn;
 struct iscsi_iser_task;
 
-/**
- * struct iser_comp - iSER completion context
- *
- * @cq:         completion queue
- * @active_qps: Number of active QPs attached
- *              to completion context
- */
-struct iser_comp {
-	struct ib_cq		*cq;
-	int                      active_qps;
-};
-
 /**
  * struct iser_device - Memory registration operations
  *     per-device registration schemes
@@ -365,9 +353,6 @@ struct iser_reg_ops {
  * @event_handler: IB events handle routine
  * @ig_list:	   entry in devices list
  * @refcount:      Reference counter, dominated by open iser connections
- * @comps_used:    Number of completion contexts used, Min between online
- *                 cpus and device max completion vectors
- * @comps:         Dinamically allocated array of completion handlers
  * @reg_ops:       Registration ops
  * @remote_inv_sup: Remote invalidate is supported on this device
  */
@@ -377,8 +362,6 @@ struct iser_device {
 	struct ib_event_handler      event_handler;
 	struct list_head             ig_list;
 	int                          refcount;
-	int			     comps_used;
-	struct iser_comp	     *comps;
 	const struct iser_reg_ops    *reg_ops;
 	bool                         remote_inv_sup;
 };
@@ -456,7 +439,6 @@ struct iser_fr_pool {
  * @sig_count:           send work request signal count
  * @rx_wr:               receive work request for batch posts
  * @device:              reference to iser device
- * @comp:                iser completion context
  * @fr_pool:             connection fast registration poool
  * @pi_support:          Indicate device T10-PI support
  */
@@ -467,7 +449,6 @@ struct ib_conn {
 	u8                           sig_count;
 	struct ib_recv_wr	     rx_wr[ISER_MIN_POSTED_RX];
 	struct iser_device          *device;
-	struct iser_comp	    *comp;
 	struct iser_fr_pool          fr_pool;
 	bool			     pi_support;
 	struct ib_cqe		     reg_cqe;
diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c
index 55a73b0ed4c6..8f8d853b1dc9 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -68,40 +68,17 @@ static void iser_event_handler(struct ib_event_handler *handler,
 static int iser_create_device_ib_res(struct iser_device *device)
 {
 	struct ib_device *ib_dev = device->ib_device;
-	int ret, i, max_cqe;
+	int ret;
 
 	ret = iser_assign_reg_ops(device);
 	if (ret)
 		return ret;
 
-	device->comps_used = min_t(int, num_online_cpus(),
-				 ib_dev->num_comp_vectors);
-
-	device->comps = kcalloc(device->comps_used, sizeof(*device->comps),
-				GFP_KERNEL);
-	if (!device->comps)
-		goto comps_err;
-
-	max_cqe = min(ISER_MAX_CQ_LEN, ib_dev->attrs.max_cqe);
-
-	iser_info("using %d CQs, device %s supports %d vectors max_cqe %d\n",
-		  device->comps_used, ib_dev->name,
-		  ib_dev->num_comp_vectors, max_cqe);
-
 	device->pd = ib_alloc_pd(ib_dev,
 		iser_always_reg ? 0 : IB_PD_UNSAFE_GLOBAL_RKEY);
-	if (IS_ERR(device->pd))
-		goto pd_err;
-
-	for (i = 0; i < device->comps_used; i++) {
-		struct iser_comp *comp = &device->comps[i];
-
-		comp->cq = ib_alloc_cq(ib_dev, comp, max_cqe, i,
-				       IB_POLL_SOFTIRQ);
-		if (IS_ERR(comp->cq)) {
-			comp->cq = NULL;
-			goto cq_err;
-		}
+	if (IS_ERR(device->pd)) {
+		ret = PTR_ERR(device->pd);
+		goto out;
 	}
 
 	INIT_IB_EVENT_HANDLER(&device->event_handler, ib_dev,
@@ -109,19 +86,9 @@ static int iser_create_device_ib_res(struct iser_device *device)
 	ib_register_event_handler(&device->event_handler);
 	return 0;
 
-cq_err:
-	for (i = 0; i < device->comps_used; i++) {
-		struct iser_comp *comp = &device->comps[i];
-
-		if (comp->cq)
-			ib_free_cq(comp->cq);
-	}
-	ib_dealloc_pd(device->pd);
-pd_err:
-	kfree(device->comps);
-comps_err:
+out:
 	iser_err("failed to allocate an IB resource\n");
-	return -1;
+	return ret;
 }
 
 /**
@@ -130,20 +97,8 @@ static int iser_create_device_ib_res(struct iser_device *device)
  */
 static void iser_free_device_ib_res(struct iser_device *device)
 {
-	int i;
-
-	for (i = 0; i < device->comps_used; i++) {
-		struct iser_comp *comp = &device->comps[i];
-
-		ib_free_cq(comp->cq);
-		comp->cq = NULL;
-	}
-
 	ib_unregister_event_handler(&device->event_handler);
 	ib_dealloc_pd(device->pd);
-
-	kfree(device->comps);
-	device->comps = NULL;
 	device->pd = NULL;
 }
 
@@ -423,7 +378,6 @@ static int iser_create_ib_conn_res(struct ib_conn *ib_conn)
 	struct ib_device	*ib_dev;
 	struct ib_qp_init_attr	init_attr;
 	int			ret = -ENOMEM;
-	int index, min_index = 0;
 
 	BUG_ON(ib_conn->device == NULL);
 
@@ -431,23 +385,10 @@ static int iser_create_ib_conn_res(struct ib_conn *ib_conn)
 	ib_dev = device->ib_device;
 
 	memset(&init_attr, 0, sizeof init_attr);
-
-	mutex_lock(&ig.connlist_mutex);
-	/* select the CQ with the minimal number of usages */
-	for (index = 0; index < device->comps_used; index++) {
-		if (device->comps[index].active_qps <
-		    device->comps[min_index].active_qps)
-			min_index = index;
-	}
-	ib_conn->comp = &device->comps[min_index];
-	ib_conn->comp->active_qps++;
-	mutex_unlock(&ig.connlist_mutex);
-	iser_info("cq index %d used for ib_conn %p\n", min_index, ib_conn);
-
+	init_attr.create_flags = IB_QP_CREATE_ASSIGN_CQS;
 	init_attr.event_handler = iser_qp_event_callback;
 	init_attr.qp_context	= (void *)ib_conn;
-	init_attr.send_cq	= ib_conn->comp->cq;
-	init_attr.recv_cq	= ib_conn->comp->cq;
+	init_attr.poll_ctx = IB_POLL_SOFTIRQ;
 	init_attr.cap.max_recv_wr  = ISER_QP_MAX_RECV_DTOS;
 	init_attr.cap.max_send_sge = 2;
 	init_attr.cap.max_recv_sge = 1;
@@ -483,11 +424,7 @@ static int iser_create_ib_conn_res(struct ib_conn *ib_conn)
 	return ret;
 
 out_err:
-	mutex_lock(&ig.connlist_mutex);
-	ib_conn->comp->active_qps--;
-	mutex_unlock(&ig.connlist_mutex);
 	iser_err("unable to alloc mem or create resource, err %d\n", ret);
-
 	return ret;
 }
 
@@ -597,9 +534,6 @@ static void iser_free_ib_conn_res(struct iser_conn *iser_conn,
 		  iser_conn, ib_conn->cma_id, ib_conn->qp);
 
 	if (ib_conn->qp != NULL) {
-		mutex_lock(&ig.connlist_mutex);
-		ib_conn->comp->active_qps--;
-		mutex_unlock(&ig.connlist_mutex);
 		rdma_destroy_qp(ib_conn->cma_id);
 		ib_conn->qp = NULL;
 	}
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v3 4/9] IB/srpt: use implicit CQ allocation
  2017-11-08  9:57 ` Sagi Grimberg
@ 2017-11-08  9:57     ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-08  9:57 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	Max Gurtuvoy

Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
[hch: ported to the new API]
Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 drivers/infiniband/ulp/srpt/ib_srpt.c | 46 ++++++++++++-----------------------
 drivers/infiniband/ulp/srpt/ib_srpt.h |  1 -
 2 files changed, 15 insertions(+), 32 deletions(-)

diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c b/drivers/infiniband/ulp/srpt/ib_srpt.c
index 9e8e9220f816..256d0d5b32e5 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.c
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
@@ -798,7 +798,7 @@ static int srpt_zerolength_write(struct srpt_rdma_ch *ch)
 
 static void srpt_zerolength_write_done(struct ib_cq *cq, struct ib_wc *wc)
 {
-	struct srpt_rdma_ch *ch = cq->cq_context;
+	struct srpt_rdma_ch *ch = wc->qp->qp_context;
 
 	if (wc->status == IB_WC_SUCCESS) {
 		srpt_process_wait_list(ch);
@@ -1201,7 +1201,7 @@ static int srpt_abort_cmd(struct srpt_send_ioctx *ioctx)
  */
 static void srpt_rdma_read_done(struct ib_cq *cq, struct ib_wc *wc)
 {
-	struct srpt_rdma_ch *ch = cq->cq_context;
+	struct srpt_rdma_ch *ch = wc->qp->qp_context;
 	struct srpt_send_ioctx *ioctx =
 		container_of(wc->wr_cqe, struct srpt_send_ioctx, rdma_cqe);
 
@@ -1526,7 +1526,7 @@ static void srpt_handle_new_iu(struct srpt_rdma_ch *ch,
 
 static void srpt_recv_done(struct ib_cq *cq, struct ib_wc *wc)
 {
-	struct srpt_rdma_ch *ch = cq->cq_context;
+	struct srpt_rdma_ch *ch = wc->qp->qp_context;
 	struct srpt_recv_ioctx *ioctx =
 		container_of(wc->wr_cqe, struct srpt_recv_ioctx, ioctx.cqe);
 
@@ -1580,7 +1580,7 @@ static void srpt_process_wait_list(struct srpt_rdma_ch *ch)
  */
 static void srpt_send_done(struct ib_cq *cq, struct ib_wc *wc)
 {
-	struct srpt_rdma_ch *ch = cq->cq_context;
+	struct srpt_rdma_ch *ch = wc->qp->qp_context;
 	struct srpt_send_ioctx *ioctx =
 		container_of(wc->wr_cqe, struct srpt_send_ioctx, ioctx.cqe);
 	enum srpt_command_state state;
@@ -1626,23 +1626,14 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch)
 		goto out;
 
 retry:
-	ch->cq = ib_alloc_cq(sdev->device, ch, ch->rq_size + srp_sq_size,
-			0 /* XXX: spread CQs */, IB_POLL_WORKQUEUE);
-	if (IS_ERR(ch->cq)) {
-		ret = PTR_ERR(ch->cq);
-		pr_err("failed to create CQ cqe= %d ret= %d\n",
-		       ch->rq_size + srp_sq_size, ret);
-		goto out;
-	}
-
+	qp_init->create_flags = IB_QP_CREATE_ASSIGN_CQS;
 	qp_init->qp_context = (void *)ch;
 	qp_init->event_handler
 		= (void(*)(struct ib_event *, void*))srpt_qp_event;
-	qp_init->send_cq = ch->cq;
-	qp_init->recv_cq = ch->cq;
 	qp_init->srq = sdev->srq;
 	qp_init->sq_sig_type = IB_SIGNAL_REQ_WR;
 	qp_init->qp_type = IB_QPT_RC;
+	qp_init->poll_ctx = IB_POLL_WORKQUEUE;
 	/*
 	 * We divide up our send queue size into half SEND WRs to send the
 	 * completions, and half R/W contexts to actually do the RDMA
@@ -1653,6 +1644,9 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch)
 	qp_init->cap.max_send_wr = srp_sq_size / 2;
 	qp_init->cap.max_rdma_ctxs = srp_sq_size / 2;
 	qp_init->cap.max_send_sge = min(attrs->max_sge, SRPT_MAX_SG_PER_WQE);
+
+	qp_init->cap.max_recv_wr = ch->rq_size;
+
 	qp_init->port_num = ch->sport->port;
 
 	ch->qp = ib_create_qp(sdev->pd, qp_init);
@@ -1660,19 +1654,17 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch)
 		ret = PTR_ERR(ch->qp);
 		if (ret == -ENOMEM) {
 			srp_sq_size /= 2;
-			if (srp_sq_size >= MIN_SRPT_SQ_SIZE) {
-				ib_destroy_cq(ch->cq);
+			if (srp_sq_size >= MIN_SRPT_SQ_SIZE)
 				goto retry;
-			}
 		}
 		pr_err("failed to create_qp ret= %d\n", ret);
-		goto err_destroy_cq;
+		goto out;
 	}
 
 	atomic_set(&ch->sq_wr_avail, qp_init->cap.max_send_wr);
 
-	pr_debug("%s: max_cqe= %d max_sge= %d sq_size = %d cm_id= %p\n",
-		 __func__, ch->cq->cqe, qp_init->cap.max_send_sge,
+	pr_debug("%s: max_sge= %d sq_size = %d cm_id= %p\n",
+		 __func__, qp_init->cap.max_send_sge,
 		 qp_init->cap.max_send_wr, ch->cm_id);
 
 	ret = srpt_init_ch_qp(ch, ch->qp);
@@ -1685,17 +1677,9 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch)
 
 err_destroy_qp:
 	ib_destroy_qp(ch->qp);
-err_destroy_cq:
-	ib_free_cq(ch->cq);
 	goto out;
 }
 
-static void srpt_destroy_ch_ib(struct srpt_rdma_ch *ch)
-{
-	ib_destroy_qp(ch->qp);
-	ib_free_cq(ch->cq);
-}
-
 /**
  * srpt_close_ch() - Close an RDMA channel.
  *
@@ -1812,7 +1796,7 @@ static void srpt_release_channel_work(struct work_struct *w)
 
 	ib_destroy_cm_id(ch->cm_id);
 
-	srpt_destroy_ch_ib(ch);
+	ib_destroy_qp(ch->qp);
 
 	srpt_free_ioctx_ring((struct srpt_ioctx **)ch->ioctx_ring,
 			     ch->sport->sdev, ch->rq_size,
@@ -2070,7 +2054,7 @@ static int srpt_cm_req_recv(struct ib_cm_id *cm_id,
 	ch->sess = NULL;
 
 destroy_ib:
-	srpt_destroy_ch_ib(ch);
+	ib_destroy_qp(ch->qp);
 
 free_ring:
 	srpt_free_ioctx_ring((struct srpt_ioctx **)ch->ioctx_ring,
diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.h b/drivers/infiniband/ulp/srpt/ib_srpt.h
index 1b817e51b84b..4ab0d94af174 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.h
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.h
@@ -265,7 +265,6 @@ enum rdma_ch_state {
 struct srpt_rdma_ch {
 	struct ib_cm_id		*cm_id;
 	struct ib_qp		*qp;
-	struct ib_cq		*cq;
 	struct ib_cqe		zw_cqe;
 	struct kref		kref;
 	int			rq_size;
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v3 4/9] IB/srpt: use implicit CQ allocation
@ 2017-11-08  9:57     ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-08  9:57 UTC (permalink / raw)


Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
[hch: ported to the new API]
Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/infiniband/ulp/srpt/ib_srpt.c | 46 ++++++++++++-----------------------
 drivers/infiniband/ulp/srpt/ib_srpt.h |  1 -
 2 files changed, 15 insertions(+), 32 deletions(-)

diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c b/drivers/infiniband/ulp/srpt/ib_srpt.c
index 9e8e9220f816..256d0d5b32e5 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.c
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
@@ -798,7 +798,7 @@ static int srpt_zerolength_write(struct srpt_rdma_ch *ch)
 
 static void srpt_zerolength_write_done(struct ib_cq *cq, struct ib_wc *wc)
 {
-	struct srpt_rdma_ch *ch = cq->cq_context;
+	struct srpt_rdma_ch *ch = wc->qp->qp_context;
 
 	if (wc->status == IB_WC_SUCCESS) {
 		srpt_process_wait_list(ch);
@@ -1201,7 +1201,7 @@ static int srpt_abort_cmd(struct srpt_send_ioctx *ioctx)
  */
 static void srpt_rdma_read_done(struct ib_cq *cq, struct ib_wc *wc)
 {
-	struct srpt_rdma_ch *ch = cq->cq_context;
+	struct srpt_rdma_ch *ch = wc->qp->qp_context;
 	struct srpt_send_ioctx *ioctx =
 		container_of(wc->wr_cqe, struct srpt_send_ioctx, rdma_cqe);
 
@@ -1526,7 +1526,7 @@ static void srpt_handle_new_iu(struct srpt_rdma_ch *ch,
 
 static void srpt_recv_done(struct ib_cq *cq, struct ib_wc *wc)
 {
-	struct srpt_rdma_ch *ch = cq->cq_context;
+	struct srpt_rdma_ch *ch = wc->qp->qp_context;
 	struct srpt_recv_ioctx *ioctx =
 		container_of(wc->wr_cqe, struct srpt_recv_ioctx, ioctx.cqe);
 
@@ -1580,7 +1580,7 @@ static void srpt_process_wait_list(struct srpt_rdma_ch *ch)
  */
 static void srpt_send_done(struct ib_cq *cq, struct ib_wc *wc)
 {
-	struct srpt_rdma_ch *ch = cq->cq_context;
+	struct srpt_rdma_ch *ch = wc->qp->qp_context;
 	struct srpt_send_ioctx *ioctx =
 		container_of(wc->wr_cqe, struct srpt_send_ioctx, ioctx.cqe);
 	enum srpt_command_state state;
@@ -1626,23 +1626,14 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch)
 		goto out;
 
 retry:
-	ch->cq = ib_alloc_cq(sdev->device, ch, ch->rq_size + srp_sq_size,
-			0 /* XXX: spread CQs */, IB_POLL_WORKQUEUE);
-	if (IS_ERR(ch->cq)) {
-		ret = PTR_ERR(ch->cq);
-		pr_err("failed to create CQ cqe= %d ret= %d\n",
-		       ch->rq_size + srp_sq_size, ret);
-		goto out;
-	}
-
+	qp_init->create_flags = IB_QP_CREATE_ASSIGN_CQS;
 	qp_init->qp_context = (void *)ch;
 	qp_init->event_handler
 		= (void(*)(struct ib_event *, void*))srpt_qp_event;
-	qp_init->send_cq = ch->cq;
-	qp_init->recv_cq = ch->cq;
 	qp_init->srq = sdev->srq;
 	qp_init->sq_sig_type = IB_SIGNAL_REQ_WR;
 	qp_init->qp_type = IB_QPT_RC;
+	qp_init->poll_ctx = IB_POLL_WORKQUEUE;
 	/*
 	 * We divide up our send queue size into half SEND WRs to send the
 	 * completions, and half R/W contexts to actually do the RDMA
@@ -1653,6 +1644,9 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch)
 	qp_init->cap.max_send_wr = srp_sq_size / 2;
 	qp_init->cap.max_rdma_ctxs = srp_sq_size / 2;
 	qp_init->cap.max_send_sge = min(attrs->max_sge, SRPT_MAX_SG_PER_WQE);
+
+	qp_init->cap.max_recv_wr = ch->rq_size;
+
 	qp_init->port_num = ch->sport->port;
 
 	ch->qp = ib_create_qp(sdev->pd, qp_init);
@@ -1660,19 +1654,17 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch)
 		ret = PTR_ERR(ch->qp);
 		if (ret == -ENOMEM) {
 			srp_sq_size /= 2;
-			if (srp_sq_size >= MIN_SRPT_SQ_SIZE) {
-				ib_destroy_cq(ch->cq);
+			if (srp_sq_size >= MIN_SRPT_SQ_SIZE)
 				goto retry;
-			}
 		}
 		pr_err("failed to create_qp ret= %d\n", ret);
-		goto err_destroy_cq;
+		goto out;
 	}
 
 	atomic_set(&ch->sq_wr_avail, qp_init->cap.max_send_wr);
 
-	pr_debug("%s: max_cqe= %d max_sge= %d sq_size = %d cm_id= %p\n",
-		 __func__, ch->cq->cqe, qp_init->cap.max_send_sge,
+	pr_debug("%s: max_sge= %d sq_size = %d cm_id= %p\n",
+		 __func__, qp_init->cap.max_send_sge,
 		 qp_init->cap.max_send_wr, ch->cm_id);
 
 	ret = srpt_init_ch_qp(ch, ch->qp);
@@ -1685,17 +1677,9 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch)
 
 err_destroy_qp:
 	ib_destroy_qp(ch->qp);
-err_destroy_cq:
-	ib_free_cq(ch->cq);
 	goto out;
 }
 
-static void srpt_destroy_ch_ib(struct srpt_rdma_ch *ch)
-{
-	ib_destroy_qp(ch->qp);
-	ib_free_cq(ch->cq);
-}
-
 /**
  * srpt_close_ch() - Close an RDMA channel.
  *
@@ -1812,7 +1796,7 @@ static void srpt_release_channel_work(struct work_struct *w)
 
 	ib_destroy_cm_id(ch->cm_id);
 
-	srpt_destroy_ch_ib(ch);
+	ib_destroy_qp(ch->qp);
 
 	srpt_free_ioctx_ring((struct srpt_ioctx **)ch->ioctx_ring,
 			     ch->sport->sdev, ch->rq_size,
@@ -2070,7 +2054,7 @@ static int srpt_cm_req_recv(struct ib_cm_id *cm_id,
 	ch->sess = NULL;
 
 destroy_ib:
-	srpt_destroy_ch_ib(ch);
+	ib_destroy_qp(ch->qp);
 
 free_ring:
 	srpt_free_ioctx_ring((struct srpt_ioctx **)ch->ioctx_ring,
diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.h b/drivers/infiniband/ulp/srpt/ib_srpt.h
index 1b817e51b84b..4ab0d94af174 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.h
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.h
@@ -265,7 +265,6 @@ enum rdma_ch_state {
 struct srpt_rdma_ch {
 	struct ib_cm_id		*cm_id;
 	struct ib_qp		*qp;
-	struct ib_cq		*cq;
 	struct ib_cqe		zw_cqe;
 	struct kref		kref;
 	int			rq_size;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v3 5/9] svcrdma: Use RDMA core implicit CQ allocation
  2017-11-08  9:57 ` Sagi Grimberg
@ 2017-11-08  9:57     ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-08  9:57 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	Max Gurtuvoy

Get some of the wisdom of CQ completion vector spreading
and CQ queue-pair chunking for free.

Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
---
 include/linux/sunrpc/svc_rdma.h          |  2 --
 net/sunrpc/xprtrdma/svc_rdma_transport.c | 22 ++--------------------
 2 files changed, 2 insertions(+), 22 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 995c6fe9ee90..95e0b7a1b311 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -118,8 +118,6 @@ struct svcxprt_rdma {
 	struct list_head     sc_rq_dto_q;
 	spinlock_t	     sc_rq_dto_lock;
 	struct ib_qp         *sc_qp;
-	struct ib_cq         *sc_rq_cq;
-	struct ib_cq         *sc_sq_cq;
 
 	spinlock_t	     sc_lock;		/* transport lock */
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 5caf8e722a11..d51ead156898 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -780,21 +780,11 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 		dprintk("svcrdma: error creating PD for connect request\n");
 		goto errout;
 	}
-	newxprt->sc_sq_cq = ib_alloc_cq(dev, newxprt, newxprt->sc_sq_depth,
-					0, IB_POLL_WORKQUEUE);
-	if (IS_ERR(newxprt->sc_sq_cq)) {
-		dprintk("svcrdma: error creating SQ CQ for connect request\n");
-		goto errout;
-	}
-	newxprt->sc_rq_cq = ib_alloc_cq(dev, newxprt, newxprt->sc_rq_depth,
-					0, IB_POLL_WORKQUEUE);
-	if (IS_ERR(newxprt->sc_rq_cq)) {
-		dprintk("svcrdma: error creating RQ CQ for connect request\n");
-		goto errout;
-	}
 
 	memset(&qp_attr, 0, sizeof qp_attr);
 	qp_attr.event_handler = qp_event_handler;
+	qp_attr.create_flags = IB_QP_CREATE_ASSIGN_CQS;
+	qp_attr.poll_ctx = IB_POLL_WORKQUEUE;
 	qp_attr.qp_context = &newxprt->sc_xprt;
 	qp_attr.port_num = newxprt->sc_port_num;
 	qp_attr.cap.max_rdma_ctxs = ctxts;
@@ -804,8 +794,6 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 	qp_attr.cap.max_recv_sge = newxprt->sc_max_sge;
 	qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
 	qp_attr.qp_type = IB_QPT_RC;
-	qp_attr.send_cq = newxprt->sc_sq_cq;
-	qp_attr.recv_cq = newxprt->sc_rq_cq;
 	dprintk("svcrdma: newxprt->sc_cm_id=%p, newxprt->sc_pd=%p\n",
 		newxprt->sc_cm_id, newxprt->sc_pd);
 	dprintk("    cap.max_send_wr = %d, cap.max_recv_wr = %d\n",
@@ -959,12 +947,6 @@ static void __svc_rdma_free(struct work_struct *work)
 	if (rdma->sc_qp && !IS_ERR(rdma->sc_qp))
 		ib_destroy_qp(rdma->sc_qp);
 
-	if (rdma->sc_sq_cq && !IS_ERR(rdma->sc_sq_cq))
-		ib_free_cq(rdma->sc_sq_cq);
-
-	if (rdma->sc_rq_cq && !IS_ERR(rdma->sc_rq_cq))
-		ib_free_cq(rdma->sc_rq_cq);
-
 	if (rdma->sc_pd && !IS_ERR(rdma->sc_pd))
 		ib_dealloc_pd(rdma->sc_pd);
 
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v3 5/9] svcrdma: Use RDMA core implicit CQ allocation
@ 2017-11-08  9:57     ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-08  9:57 UTC (permalink / raw)


Get some of the wisdom of CQ completion vector spreading
and CQ queue-pair chunking for free.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 include/linux/sunrpc/svc_rdma.h          |  2 --
 net/sunrpc/xprtrdma/svc_rdma_transport.c | 22 ++--------------------
 2 files changed, 2 insertions(+), 22 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 995c6fe9ee90..95e0b7a1b311 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -118,8 +118,6 @@ struct svcxprt_rdma {
 	struct list_head     sc_rq_dto_q;
 	spinlock_t	     sc_rq_dto_lock;
 	struct ib_qp         *sc_qp;
-	struct ib_cq         *sc_rq_cq;
-	struct ib_cq         *sc_sq_cq;
 
 	spinlock_t	     sc_lock;		/* transport lock */
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 5caf8e722a11..d51ead156898 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -780,21 +780,11 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 		dprintk("svcrdma: error creating PD for connect request\n");
 		goto errout;
 	}
-	newxprt->sc_sq_cq = ib_alloc_cq(dev, newxprt, newxprt->sc_sq_depth,
-					0, IB_POLL_WORKQUEUE);
-	if (IS_ERR(newxprt->sc_sq_cq)) {
-		dprintk("svcrdma: error creating SQ CQ for connect request\n");
-		goto errout;
-	}
-	newxprt->sc_rq_cq = ib_alloc_cq(dev, newxprt, newxprt->sc_rq_depth,
-					0, IB_POLL_WORKQUEUE);
-	if (IS_ERR(newxprt->sc_rq_cq)) {
-		dprintk("svcrdma: error creating RQ CQ for connect request\n");
-		goto errout;
-	}
 
 	memset(&qp_attr, 0, sizeof qp_attr);
 	qp_attr.event_handler = qp_event_handler;
+	qp_attr.create_flags = IB_QP_CREATE_ASSIGN_CQS;
+	qp_attr.poll_ctx = IB_POLL_WORKQUEUE;
 	qp_attr.qp_context = &newxprt->sc_xprt;
 	qp_attr.port_num = newxprt->sc_port_num;
 	qp_attr.cap.max_rdma_ctxs = ctxts;
@@ -804,8 +794,6 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 	qp_attr.cap.max_recv_sge = newxprt->sc_max_sge;
 	qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
 	qp_attr.qp_type = IB_QPT_RC;
-	qp_attr.send_cq = newxprt->sc_sq_cq;
-	qp_attr.recv_cq = newxprt->sc_rq_cq;
 	dprintk("svcrdma: newxprt->sc_cm_id=%p, newxprt->sc_pd=%p\n",
 		newxprt->sc_cm_id, newxprt->sc_pd);
 	dprintk("    cap.max_send_wr = %d, cap.max_recv_wr = %d\n",
@@ -959,12 +947,6 @@ static void __svc_rdma_free(struct work_struct *work)
 	if (rdma->sc_qp && !IS_ERR(rdma->sc_qp))
 		ib_destroy_qp(rdma->sc_qp);
 
-	if (rdma->sc_sq_cq && !IS_ERR(rdma->sc_sq_cq))
-		ib_free_cq(rdma->sc_sq_cq);
-
-	if (rdma->sc_rq_cq && !IS_ERR(rdma->sc_rq_cq))
-		ib_free_cq(rdma->sc_rq_cq);
-
 	if (rdma->sc_pd && !IS_ERR(rdma->sc_pd))
 		ib_dealloc_pd(rdma->sc_pd);
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v3 6/9] nvme-rdma: use implicit CQ allocation
  2017-11-08  9:57 ` Sagi Grimberg
@ 2017-11-08  9:57     ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-08  9:57 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	Max Gurtuvoy

From: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>

Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 drivers/nvme/host/rdma.c | 62 +++++++++++++++++++++---------------------------
 1 file changed, 27 insertions(+), 35 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 32e21ab1ae52..3acf4d1ccfed 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -90,7 +90,6 @@ struct nvme_rdma_queue {
 	size_t			cmnd_capsule_len;
 	struct nvme_rdma_ctrl	*ctrl;
 	struct nvme_rdma_device	*device;
-	struct ib_cq		*ib_cq;
 	struct ib_qp		*qp;
 
 	unsigned long		flags;
@@ -241,24 +240,38 @@ static int nvme_rdma_wait_for_cm(struct nvme_rdma_queue *queue)
 	return queue->cm_error;
 }
 
-static int nvme_rdma_create_qp(struct nvme_rdma_queue *queue, const int factor)
+static int nvme_rdma_create_qp(struct nvme_rdma_queue *queue)
 {
 	struct nvme_rdma_device *dev = queue->device;
 	struct ib_qp_init_attr init_attr;
-	int ret;
+	int ret, idx;
+	const int send_wr_factor = 3;		/* MR, SEND, INV */
 
 	memset(&init_attr, 0, sizeof(init_attr));
+	init_attr.create_flags = IB_QP_CREATE_ASSIGN_CQS;
 	init_attr.event_handler = nvme_rdma_qp_event;
+	init_attr.qp_context = queue;
+	init_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
+	init_attr.qp_type = IB_QPT_RC;
+	init_attr.poll_ctx = IB_POLL_SOFTIRQ;
+
 	/* +1 for drain */
-	init_attr.cap.max_send_wr = factor * queue->queue_size + 1;
+	init_attr.cap.max_send_wr = send_wr_factor * queue->queue_size + 1;
+	init_attr.cap.max_send_sge = 1 + NVME_RDMA_MAX_INLINE_SEGMENTS;
+
 	/* +1 for drain */
 	init_attr.cap.max_recv_wr = queue->queue_size + 1;
 	init_attr.cap.max_recv_sge = 1;
-	init_attr.cap.max_send_sge = 1 + NVME_RDMA_MAX_INLINE_SEGMENTS;
-	init_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
-	init_attr.qp_type = IB_QPT_RC;
-	init_attr.send_cq = queue->ib_cq;
-	init_attr.recv_cq = queue->ib_cq;
+
+	/*
+	 * The admin queue is barely used once the controller is live, so don't
+	 * bother to spread it out.
+	 */
+	idx = nvme_rdma_queue_idx(queue);
+	if (idx > 0) {
+		init_attr.affinity_hint = idx;
+		init_attr.create_flags |= IB_QP_CREATE_AFFINITY_HINT;
+	}
 
 	ret = rdma_create_qp(queue->cm_id, dev->pd, &init_attr);
 
@@ -440,7 +453,6 @@ static void nvme_rdma_destroy_queue_ib(struct nvme_rdma_queue *queue)
 	struct ib_device *ibdev = dev->dev;
 
 	rdma_destroy_qp(queue->cm_id);
-	ib_free_cq(queue->ib_cq);
 
 	nvme_rdma_free_ring(ibdev, queue->rsp_ring, queue->queue_size,
 			sizeof(struct nvme_completion), DMA_FROM_DEVICE);
@@ -451,9 +463,6 @@ static void nvme_rdma_destroy_queue_ib(struct nvme_rdma_queue *queue)
 static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
 {
 	struct ib_device *ibdev;
-	const int send_wr_factor = 3;			/* MR, SEND, INV */
-	const int cq_factor = send_wr_factor + 1;	/* + RECV */
-	int comp_vector, idx = nvme_rdma_queue_idx(queue);
 	int ret;
 
 	queue->device = nvme_rdma_find_get_device(queue->cm_id);
@@ -464,24 +473,9 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
 	}
 	ibdev = queue->device->dev;
 
-	/*
-	 * Spread I/O queues completion vectors according their queue index.
-	 * Admin queues can always go on completion vector 0.
-	 */
-	comp_vector = idx == 0 ? idx : idx - 1;
-
-	/* +1 for ib_stop_cq */
-	queue->ib_cq = ib_alloc_cq(ibdev, queue,
-				cq_factor * queue->queue_size + 1,
-				comp_vector, IB_POLL_SOFTIRQ);
-	if (IS_ERR(queue->ib_cq)) {
-		ret = PTR_ERR(queue->ib_cq);
-		goto out_put_dev;
-	}
-
-	ret = nvme_rdma_create_qp(queue, send_wr_factor);
+	ret = nvme_rdma_create_qp(queue);
 	if (ret)
-		goto out_destroy_ib_cq;
+		goto out_put_dev;
 
 	queue->rsp_ring = nvme_rdma_alloc_ring(ibdev, queue->queue_size,
 			sizeof(struct nvme_completion), DMA_FROM_DEVICE);
@@ -494,8 +488,6 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
 
 out_destroy_qp:
 	rdma_destroy_qp(queue->cm_id);
-out_destroy_ib_cq:
-	ib_free_cq(queue->ib_cq);
 out_put_dev:
 	nvme_rdma_dev_put(queue->device);
 	return ret;
@@ -999,7 +991,7 @@ static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
 static void nvme_rdma_wr_error(struct ib_cq *cq, struct ib_wc *wc,
 		const char *op)
 {
-	struct nvme_rdma_queue *queue = cq->cq_context;
+	struct nvme_rdma_queue *queue = wc->qp->qp_context;
 	struct nvme_rdma_ctrl *ctrl = queue->ctrl;
 
 	if (ctrl->ctrl.state == NVME_CTRL_LIVE)
@@ -1361,7 +1353,7 @@ static int __nvme_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc, int tag)
 {
 	struct nvme_rdma_qe *qe =
 		container_of(wc->wr_cqe, struct nvme_rdma_qe, cqe);
-	struct nvme_rdma_queue *queue = cq->cq_context;
+	struct nvme_rdma_queue *queue = wc->qp->qp_context;
 	struct ib_device *ibdev = queue->device->dev;
 	struct nvme_completion *cqe = qe->data;
 	const size_t len = sizeof(struct nvme_completion);
@@ -1678,7 +1670,7 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 static int nvme_rdma_poll(struct blk_mq_hw_ctx *hctx, unsigned int tag)
 {
 	struct nvme_rdma_queue *queue = hctx->driver_data;
-	struct ib_cq *cq = queue->ib_cq;
+	struct ib_cq *cq = queue->cm_id->qp->recv_cq;
 	struct ib_wc wc;
 	int found = 0;
 
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v3 6/9] nvme-rdma: use implicit CQ allocation
@ 2017-11-08  9:57     ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-08  9:57 UTC (permalink / raw)


From: Christoph Hellwig <hch@lst.de>

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/rdma.c | 62 +++++++++++++++++++++---------------------------
 1 file changed, 27 insertions(+), 35 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 32e21ab1ae52..3acf4d1ccfed 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -90,7 +90,6 @@ struct nvme_rdma_queue {
 	size_t			cmnd_capsule_len;
 	struct nvme_rdma_ctrl	*ctrl;
 	struct nvme_rdma_device	*device;
-	struct ib_cq		*ib_cq;
 	struct ib_qp		*qp;
 
 	unsigned long		flags;
@@ -241,24 +240,38 @@ static int nvme_rdma_wait_for_cm(struct nvme_rdma_queue *queue)
 	return queue->cm_error;
 }
 
-static int nvme_rdma_create_qp(struct nvme_rdma_queue *queue, const int factor)
+static int nvme_rdma_create_qp(struct nvme_rdma_queue *queue)
 {
 	struct nvme_rdma_device *dev = queue->device;
 	struct ib_qp_init_attr init_attr;
-	int ret;
+	int ret, idx;
+	const int send_wr_factor = 3;		/* MR, SEND, INV */
 
 	memset(&init_attr, 0, sizeof(init_attr));
+	init_attr.create_flags = IB_QP_CREATE_ASSIGN_CQS;
 	init_attr.event_handler = nvme_rdma_qp_event;
+	init_attr.qp_context = queue;
+	init_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
+	init_attr.qp_type = IB_QPT_RC;
+	init_attr.poll_ctx = IB_POLL_SOFTIRQ;
+
 	/* +1 for drain */
-	init_attr.cap.max_send_wr = factor * queue->queue_size + 1;
+	init_attr.cap.max_send_wr = send_wr_factor * queue->queue_size + 1;
+	init_attr.cap.max_send_sge = 1 + NVME_RDMA_MAX_INLINE_SEGMENTS;
+
 	/* +1 for drain */
 	init_attr.cap.max_recv_wr = queue->queue_size + 1;
 	init_attr.cap.max_recv_sge = 1;
-	init_attr.cap.max_send_sge = 1 + NVME_RDMA_MAX_INLINE_SEGMENTS;
-	init_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
-	init_attr.qp_type = IB_QPT_RC;
-	init_attr.send_cq = queue->ib_cq;
-	init_attr.recv_cq = queue->ib_cq;
+
+	/*
+	 * The admin queue is barely used once the controller is live, so don't
+	 * bother to spread it out.
+	 */
+	idx = nvme_rdma_queue_idx(queue);
+	if (idx > 0) {
+		init_attr.affinity_hint = idx;
+		init_attr.create_flags |= IB_QP_CREATE_AFFINITY_HINT;
+	}
 
 	ret = rdma_create_qp(queue->cm_id, dev->pd, &init_attr);
 
@@ -440,7 +453,6 @@ static void nvme_rdma_destroy_queue_ib(struct nvme_rdma_queue *queue)
 	struct ib_device *ibdev = dev->dev;
 
 	rdma_destroy_qp(queue->cm_id);
-	ib_free_cq(queue->ib_cq);
 
 	nvme_rdma_free_ring(ibdev, queue->rsp_ring, queue->queue_size,
 			sizeof(struct nvme_completion), DMA_FROM_DEVICE);
@@ -451,9 +463,6 @@ static void nvme_rdma_destroy_queue_ib(struct nvme_rdma_queue *queue)
 static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
 {
 	struct ib_device *ibdev;
-	const int send_wr_factor = 3;			/* MR, SEND, INV */
-	const int cq_factor = send_wr_factor + 1;	/* + RECV */
-	int comp_vector, idx = nvme_rdma_queue_idx(queue);
 	int ret;
 
 	queue->device = nvme_rdma_find_get_device(queue->cm_id);
@@ -464,24 +473,9 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
 	}
 	ibdev = queue->device->dev;
 
-	/*
-	 * Spread I/O queues completion vectors according their queue index.
-	 * Admin queues can always go on completion vector 0.
-	 */
-	comp_vector = idx == 0 ? idx : idx - 1;
-
-	/* +1 for ib_stop_cq */
-	queue->ib_cq = ib_alloc_cq(ibdev, queue,
-				cq_factor * queue->queue_size + 1,
-				comp_vector, IB_POLL_SOFTIRQ);
-	if (IS_ERR(queue->ib_cq)) {
-		ret = PTR_ERR(queue->ib_cq);
-		goto out_put_dev;
-	}
-
-	ret = nvme_rdma_create_qp(queue, send_wr_factor);
+	ret = nvme_rdma_create_qp(queue);
 	if (ret)
-		goto out_destroy_ib_cq;
+		goto out_put_dev;
 
 	queue->rsp_ring = nvme_rdma_alloc_ring(ibdev, queue->queue_size,
 			sizeof(struct nvme_completion), DMA_FROM_DEVICE);
@@ -494,8 +488,6 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
 
 out_destroy_qp:
 	rdma_destroy_qp(queue->cm_id);
-out_destroy_ib_cq:
-	ib_free_cq(queue->ib_cq);
 out_put_dev:
 	nvme_rdma_dev_put(queue->device);
 	return ret;
@@ -999,7 +991,7 @@ static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
 static void nvme_rdma_wr_error(struct ib_cq *cq, struct ib_wc *wc,
 		const char *op)
 {
-	struct nvme_rdma_queue *queue = cq->cq_context;
+	struct nvme_rdma_queue *queue = wc->qp->qp_context;
 	struct nvme_rdma_ctrl *ctrl = queue->ctrl;
 
 	if (ctrl->ctrl.state == NVME_CTRL_LIVE)
@@ -1361,7 +1353,7 @@ static int __nvme_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc, int tag)
 {
 	struct nvme_rdma_qe *qe =
 		container_of(wc->wr_cqe, struct nvme_rdma_qe, cqe);
-	struct nvme_rdma_queue *queue = cq->cq_context;
+	struct nvme_rdma_queue *queue = wc->qp->qp_context;
 	struct ib_device *ibdev = queue->device->dev;
 	struct nvme_completion *cqe = qe->data;
 	const size_t len = sizeof(struct nvme_completion);
@@ -1678,7 +1670,7 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 static int nvme_rdma_poll(struct blk_mq_hw_ctx *hctx, unsigned int tag)
 {
 	struct nvme_rdma_queue *queue = hctx->driver_data;
-	struct ib_cq *cq = queue->ib_cq;
+	struct ib_cq *cq = queue->cm_id->qp->recv_cq;
 	struct ib_wc wc;
 	int found = 0;
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v3 7/9] nvmet-rdma: use implicit CQ allocation
  2017-11-08  9:57 ` Sagi Grimberg
@ 2017-11-08  9:57     ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-08  9:57 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	Max Gurtuvoy

Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
[hch: ported to the new API]
Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 drivers/nvme/target/rdma.c | 60 +++++++++++++---------------------------------
 1 file changed, 16 insertions(+), 44 deletions(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 3333d417b248..d9cdfd2bd623 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -83,7 +83,6 @@ enum nvmet_rdma_queue_state {
 struct nvmet_rdma_queue {
 	struct rdma_cm_id	*cm_id;
 	struct nvmet_port	*port;
-	struct ib_cq		*cq;
 	atomic_t		sq_wr_avail;
 	struct nvmet_rdma_device *dev;
 	spinlock_t		state_lock;
@@ -557,7 +556,7 @@ static void nvmet_rdma_read_data_done(struct ib_cq *cq, struct ib_wc *wc)
 {
 	struct nvmet_rdma_rsp *rsp =
 		container_of(wc->wr_cqe, struct nvmet_rdma_rsp, read_cqe);
-	struct nvmet_rdma_queue *queue = cq->cq_context;
+	struct nvmet_rdma_queue *queue = wc->qp->qp_context;
 
 	WARN_ON(rsp->n_rdma <= 0);
 	atomic_add(rsp->n_rdma, &queue->sq_wr_avail);
@@ -735,7 +734,7 @@ static void nvmet_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc)
 {
 	struct nvmet_rdma_cmd *cmd =
 		container_of(wc->wr_cqe, struct nvmet_rdma_cmd, cqe);
-	struct nvmet_rdma_queue *queue = cq->cq_context;
+	struct nvmet_rdma_queue *queue = wc->qp->qp_context;
 	struct nvmet_rdma_rsp *rsp;
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
@@ -893,62 +892,41 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
 {
 	struct ib_qp_init_attr qp_attr;
 	struct nvmet_rdma_device *ndev = queue->dev;
-	int comp_vector, nr_cqe, ret, i;
-
-	/*
-	 * Spread the io queues across completion vectors,
-	 * but still keep all admin queues on vector 0.
-	 */
-	comp_vector = !queue->host_qid ? 0 :
-		queue->idx % ndev->device->num_comp_vectors;
-
-	/*
-	 * Reserve CQ slots for RECV + RDMA_READ/RDMA_WRITE + RDMA_SEND.
-	 */
-	nr_cqe = queue->recv_queue_size + 2 * queue->send_queue_size;
-
-	queue->cq = ib_alloc_cq(ndev->device, queue,
-			nr_cqe + 1, comp_vector,
-			IB_POLL_WORKQUEUE);
-	if (IS_ERR(queue->cq)) {
-		ret = PTR_ERR(queue->cq);
-		pr_err("failed to create CQ cqe= %d ret= %d\n",
-		       nr_cqe + 1, ret);
-		goto out;
-	}
+	int ret, i;
 
 	memset(&qp_attr, 0, sizeof(qp_attr));
+	qp_attr.create_flags = IB_QP_CREATE_ASSIGN_CQS;
 	qp_attr.qp_context = queue;
 	qp_attr.event_handler = nvmet_rdma_qp_event;
-	qp_attr.send_cq = queue->cq;
-	qp_attr.recv_cq = queue->cq;
 	qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
 	qp_attr.qp_type = IB_QPT_RC;
+	qp_attr.poll_ctx = IB_POLL_WORKQUEUE;
+
 	/* +1 for drain */
 	qp_attr.cap.max_send_wr = queue->send_queue_size + 1;
 	qp_attr.cap.max_rdma_ctxs = queue->send_queue_size;
 	qp_attr.cap.max_send_sge = max(ndev->device->attrs.max_sge_rd,
 					ndev->device->attrs.max_sge);
 
-	if (ndev->srq) {
+	/* +1 for drain */
+	qp_attr.cap.max_recv_wr = queue->recv_queue_size + 1;
+
+	if (ndev->srq)
 		qp_attr.srq = ndev->srq;
-	} else {
-		/* +1 for drain */
-		qp_attr.cap.max_recv_wr = 1 + queue->recv_queue_size;
+	else
 		qp_attr.cap.max_recv_sge = 2;
-	}
 
 	ret = rdma_create_qp(queue->cm_id, ndev->pd, &qp_attr);
 	if (ret) {
 		pr_err("failed to create_qp ret= %d\n", ret);
-		goto err_destroy_cq;
+		return ret;
 	}
 
 	atomic_set(&queue->sq_wr_avail, qp_attr.cap.max_send_wr);
 
-	pr_debug("%s: max_cqe= %d max_sge= %d sq_size = %d cm_id= %p\n",
-		 __func__, queue->cq->cqe, qp_attr.cap.max_send_sge,
-		 qp_attr.cap.max_send_wr, queue->cm_id);
+	pr_debug("%s: max_sge= %d sq_size = %d cm_id=%p\n", __func__,
+		qp_attr.cap.max_send_sge, qp_attr.cap.max_send_wr,
+		queue->cm_id);
 
 	if (!ndev->srq) {
 		for (i = 0; i < queue->recv_queue_size; i++) {
@@ -957,19 +935,13 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
 		}
 	}
 
-out:
-	return ret;
-
-err_destroy_cq:
-	ib_free_cq(queue->cq);
-	goto out;
+	return 0;
 }
 
 static void nvmet_rdma_destroy_queue_ib(struct nvmet_rdma_queue *queue)
 {
 	ib_drain_qp(queue->cm_id->qp);
 	rdma_destroy_qp(queue->cm_id);
-	ib_free_cq(queue->cq);
 }
 
 static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v3 7/9] nvmet-rdma: use implicit CQ allocation
@ 2017-11-08  9:57     ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-08  9:57 UTC (permalink / raw)


Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
[hch: ported to the new API]
Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/target/rdma.c | 60 +++++++++++++---------------------------------
 1 file changed, 16 insertions(+), 44 deletions(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 3333d417b248..d9cdfd2bd623 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -83,7 +83,6 @@ enum nvmet_rdma_queue_state {
 struct nvmet_rdma_queue {
 	struct rdma_cm_id	*cm_id;
 	struct nvmet_port	*port;
-	struct ib_cq		*cq;
 	atomic_t		sq_wr_avail;
 	struct nvmet_rdma_device *dev;
 	spinlock_t		state_lock;
@@ -557,7 +556,7 @@ static void nvmet_rdma_read_data_done(struct ib_cq *cq, struct ib_wc *wc)
 {
 	struct nvmet_rdma_rsp *rsp =
 		container_of(wc->wr_cqe, struct nvmet_rdma_rsp, read_cqe);
-	struct nvmet_rdma_queue *queue = cq->cq_context;
+	struct nvmet_rdma_queue *queue = wc->qp->qp_context;
 
 	WARN_ON(rsp->n_rdma <= 0);
 	atomic_add(rsp->n_rdma, &queue->sq_wr_avail);
@@ -735,7 +734,7 @@ static void nvmet_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc)
 {
 	struct nvmet_rdma_cmd *cmd =
 		container_of(wc->wr_cqe, struct nvmet_rdma_cmd, cqe);
-	struct nvmet_rdma_queue *queue = cq->cq_context;
+	struct nvmet_rdma_queue *queue = wc->qp->qp_context;
 	struct nvmet_rdma_rsp *rsp;
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
@@ -893,62 +892,41 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
 {
 	struct ib_qp_init_attr qp_attr;
 	struct nvmet_rdma_device *ndev = queue->dev;
-	int comp_vector, nr_cqe, ret, i;
-
-	/*
-	 * Spread the io queues across completion vectors,
-	 * but still keep all admin queues on vector 0.
-	 */
-	comp_vector = !queue->host_qid ? 0 :
-		queue->idx % ndev->device->num_comp_vectors;
-
-	/*
-	 * Reserve CQ slots for RECV + RDMA_READ/RDMA_WRITE + RDMA_SEND.
-	 */
-	nr_cqe = queue->recv_queue_size + 2 * queue->send_queue_size;
-
-	queue->cq = ib_alloc_cq(ndev->device, queue,
-			nr_cqe + 1, comp_vector,
-			IB_POLL_WORKQUEUE);
-	if (IS_ERR(queue->cq)) {
-		ret = PTR_ERR(queue->cq);
-		pr_err("failed to create CQ cqe= %d ret= %d\n",
-		       nr_cqe + 1, ret);
-		goto out;
-	}
+	int ret, i;
 
 	memset(&qp_attr, 0, sizeof(qp_attr));
+	qp_attr.create_flags = IB_QP_CREATE_ASSIGN_CQS;
 	qp_attr.qp_context = queue;
 	qp_attr.event_handler = nvmet_rdma_qp_event;
-	qp_attr.send_cq = queue->cq;
-	qp_attr.recv_cq = queue->cq;
 	qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
 	qp_attr.qp_type = IB_QPT_RC;
+	qp_attr.poll_ctx = IB_POLL_WORKQUEUE;
+
 	/* +1 for drain */
 	qp_attr.cap.max_send_wr = queue->send_queue_size + 1;
 	qp_attr.cap.max_rdma_ctxs = queue->send_queue_size;
 	qp_attr.cap.max_send_sge = max(ndev->device->attrs.max_sge_rd,
 					ndev->device->attrs.max_sge);
 
-	if (ndev->srq) {
+	/* +1 for drain */
+	qp_attr.cap.max_recv_wr = queue->recv_queue_size + 1;
+
+	if (ndev->srq)
 		qp_attr.srq = ndev->srq;
-	} else {
-		/* +1 for drain */
-		qp_attr.cap.max_recv_wr = 1 + queue->recv_queue_size;
+	else
 		qp_attr.cap.max_recv_sge = 2;
-	}
 
 	ret = rdma_create_qp(queue->cm_id, ndev->pd, &qp_attr);
 	if (ret) {
 		pr_err("failed to create_qp ret= %d\n", ret);
-		goto err_destroy_cq;
+		return ret;
 	}
 
 	atomic_set(&queue->sq_wr_avail, qp_attr.cap.max_send_wr);
 
-	pr_debug("%s: max_cqe= %d max_sge= %d sq_size = %d cm_id= %p\n",
-		 __func__, queue->cq->cqe, qp_attr.cap.max_send_sge,
-		 qp_attr.cap.max_send_wr, queue->cm_id);
+	pr_debug("%s: max_sge= %d sq_size = %d cm_id=%p\n", __func__,
+		qp_attr.cap.max_send_sge, qp_attr.cap.max_send_wr,
+		queue->cm_id);
 
 	if (!ndev->srq) {
 		for (i = 0; i < queue->recv_queue_size; i++) {
@@ -957,19 +935,13 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
 		}
 	}
 
-out:
-	return ret;
-
-err_destroy_cq:
-	ib_free_cq(queue->cq);
-	goto out;
+	return 0;
 }
 
 static void nvmet_rdma_destroy_queue_ib(struct nvmet_rdma_queue *queue)
 {
 	ib_drain_qp(queue->cm_id->qp);
 	rdma_destroy_qp(queue->cm_id);
-	ib_free_cq(queue->cq);
 }
 
 static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v3 8/9] nvmet: allow assignment of a cpulist for each nvmet port
  2017-11-08  9:57 ` Sagi Grimberg
@ 2017-11-08  9:57     ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-08  9:57 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	Max Gurtuvoy

Users might want to assign specific affinity in the form of
a cpumap to a nvmet port. This can make sense in multi-socket
systems where each socket is connected to a HBA (e.g. RDMA device)
and a set of backend storage devices (e.g. NVMe or other PCI
storage devices) where the user wants to provision the backend
storage via the HBA belonging to the same numa socket.

So, allow the user to pass a cpulist, however if the
underlying devices do not expose access to these mappings
the transport drivers is not obligated to enforce it so its
marely a hint.

Default to all online cpumap.

Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
---
 drivers/nvme/target/configfs.c | 75 ++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/target/nvmet.h    |  4 +++
 2 files changed, 79 insertions(+)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index b6aeb1d70951..723af3baeb7b 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -17,12 +17,63 @@
 #include <linux/slab.h>
 #include <linux/stat.h>
 #include <linux/ctype.h>
+#include <linux/cpumask.h>
 
 #include "nvmet.h"
 
 static struct config_item_type nvmet_host_type;
 static struct config_item_type nvmet_subsys_type;
 
+static ssize_t nvmet_addr_cpulist_show(struct config_item *item,
+		char *page)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+
+	return sprintf(page, "%*pbl\n", cpumask_pr_args(port->cpumask));
+}
+
+static ssize_t nvmet_addr_cpulist_store(struct config_item *item,
+		const char *page, size_t count)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+	cpumask_var_t cpumask;
+	int i, err;
+
+	if (port->enabled) {
+		pr_err("Cannot specify cpulist while enabled\n");
+		pr_err("Disable the port before changing cores\n");
+		return -EACCES;
+	}
+
+	if (!alloc_cpumask_var(&cpumask, GFP_KERNEL))
+		return -ENOMEM;
+
+	err = cpulist_parse(page, cpumask);
+	if (err) {
+		pr_err("bad cpumask given (%d): %s\n", err, page);
+		return err;
+	}
+
+	if (!cpumask_intersects(cpumask, cpu_online_mask)) {
+		pr_err("cpulist consists of offline cpus: %s\n", page);
+		return err;
+	}
+
+	/* copy cpumask */
+	cpumask_copy(port->cpumask, cpumask);
+	free_cpumask_var(cpumask);
+
+	/* clear port cpulist */
+	port->nr_cpus = 0;
+	/* reset port cpulist */
+	for_each_cpu(i, cpumask)
+		port->cpus[port->nr_cpus++] = i;
+
+	return count;
+}
+
+CONFIGFS_ATTR(nvmet_, addr_cpulist);
+
 /*
  * nvmet_port Generic ConfigFS definitions.
  * Used in any place in the ConfigFS tree that refers to an address.
@@ -843,6 +894,7 @@ static struct config_group *nvmet_referral_make(
 		return ERR_PTR(-ENOMEM);
 
 	INIT_LIST_HEAD(&port->entry);
+
 	config_group_init_type_name(&port->group, name, &nvmet_referral_type);
 
 	return &port->group;
@@ -864,6 +916,8 @@ static void nvmet_port_release(struct config_item *item)
 {
 	struct nvmet_port *port = to_nvmet_port(item);
 
+	kfree(port->cpus);
+	free_cpumask_var(port->cpumask);
 	kfree(port);
 }
 
@@ -873,6 +927,7 @@ static struct configfs_attribute *nvmet_port_attrs[] = {
 	&nvmet_attr_addr_traddr,
 	&nvmet_attr_addr_trsvcid,
 	&nvmet_attr_addr_trtype,
+	&nvmet_attr_addr_cpulist,
 	NULL,
 };
 
@@ -891,6 +946,7 @@ static struct config_group *nvmet_ports_make(struct config_group *group,
 {
 	struct nvmet_port *port;
 	u16 portid;
+	int i;
 
 	if (kstrtou16(name, 0, &portid))
 		return ERR_PTR(-EINVAL);
@@ -903,6 +959,20 @@ static struct config_group *nvmet_ports_make(struct config_group *group,
 	INIT_LIST_HEAD(&port->subsystems);
 	INIT_LIST_HEAD(&port->referrals);
 
+	if (!alloc_cpumask_var(&port->cpumask, GFP_KERNEL))
+		goto err_free_port;
+
+	port->nr_cpus = num_possible_cpus();
+
+	port->cpus = kcalloc(sizeof(int), port->nr_cpus, GFP_KERNEL);
+	if (!port->cpus)
+		goto err_free_cpumask;
+
+	for_each_possible_cpu(i) {
+		cpumask_set_cpu(i, port->cpumask);
+		port->cpus[i] = i;
+	}
+
 	port->disc_addr.portid = cpu_to_le16(portid);
 	config_group_init_type_name(&port->group, name, &nvmet_port_type);
 
@@ -915,6 +985,11 @@ static struct config_group *nvmet_ports_make(struct config_group *group,
 	configfs_add_default_group(&port->referrals_group, &port->group);
 
 	return &port->group;
+
+err_free_cpumask:
+	free_cpumask_var(port->cpumask);
+err_free_port:
+	return ERR_PTR(-ENOMEM);
 }
 
 static struct configfs_group_operations nvmet_ports_group_ops = {
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index e342f02845c1..6aaf86e1439e 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -98,6 +98,10 @@ struct nvmet_port {
 	struct list_head		referrals;
 	void				*priv;
 	bool				enabled;
+
+	int				nr_cpus;
+	cpumask_var_t			cpumask;
+	int				*cpus;
 };
 
 static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v3 8/9] nvmet: allow assignment of a cpulist for each nvmet port
@ 2017-11-08  9:57     ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-08  9:57 UTC (permalink / raw)


Users might want to assign specific affinity in the form of
a cpumap to a nvmet port. This can make sense in multi-socket
systems where each socket is connected to a HBA (e.g. RDMA device)
and a set of backend storage devices (e.g. NVMe or other PCI
storage devices) where the user wants to provision the backend
storage via the HBA belonging to the same numa socket.

So, allow the user to pass a cpulist, however if the
underlying devices do not expose access to these mappings
the transport drivers is not obligated to enforce it so its
marely a hint.

Default to all online cpumap.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/target/configfs.c | 75 ++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/target/nvmet.h    |  4 +++
 2 files changed, 79 insertions(+)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index b6aeb1d70951..723af3baeb7b 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -17,12 +17,63 @@
 #include <linux/slab.h>
 #include <linux/stat.h>
 #include <linux/ctype.h>
+#include <linux/cpumask.h>
 
 #include "nvmet.h"
 
 static struct config_item_type nvmet_host_type;
 static struct config_item_type nvmet_subsys_type;
 
+static ssize_t nvmet_addr_cpulist_show(struct config_item *item,
+		char *page)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+
+	return sprintf(page, "%*pbl\n", cpumask_pr_args(port->cpumask));
+}
+
+static ssize_t nvmet_addr_cpulist_store(struct config_item *item,
+		const char *page, size_t count)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+	cpumask_var_t cpumask;
+	int i, err;
+
+	if (port->enabled) {
+		pr_err("Cannot specify cpulist while enabled\n");
+		pr_err("Disable the port before changing cores\n");
+		return -EACCES;
+	}
+
+	if (!alloc_cpumask_var(&cpumask, GFP_KERNEL))
+		return -ENOMEM;
+
+	err = cpulist_parse(page, cpumask);
+	if (err) {
+		pr_err("bad cpumask given (%d): %s\n", err, page);
+		return err;
+	}
+
+	if (!cpumask_intersects(cpumask, cpu_online_mask)) {
+		pr_err("cpulist consists of offline cpus: %s\n", page);
+		return err;
+	}
+
+	/* copy cpumask */
+	cpumask_copy(port->cpumask, cpumask);
+	free_cpumask_var(cpumask);
+
+	/* clear port cpulist */
+	port->nr_cpus = 0;
+	/* reset port cpulist */
+	for_each_cpu(i, cpumask)
+		port->cpus[port->nr_cpus++] = i;
+
+	return count;
+}
+
+CONFIGFS_ATTR(nvmet_, addr_cpulist);
+
 /*
  * nvmet_port Generic ConfigFS definitions.
  * Used in any place in the ConfigFS tree that refers to an address.
@@ -843,6 +894,7 @@ static struct config_group *nvmet_referral_make(
 		return ERR_PTR(-ENOMEM);
 
 	INIT_LIST_HEAD(&port->entry);
+
 	config_group_init_type_name(&port->group, name, &nvmet_referral_type);
 
 	return &port->group;
@@ -864,6 +916,8 @@ static void nvmet_port_release(struct config_item *item)
 {
 	struct nvmet_port *port = to_nvmet_port(item);
 
+	kfree(port->cpus);
+	free_cpumask_var(port->cpumask);
 	kfree(port);
 }
 
@@ -873,6 +927,7 @@ static struct configfs_attribute *nvmet_port_attrs[] = {
 	&nvmet_attr_addr_traddr,
 	&nvmet_attr_addr_trsvcid,
 	&nvmet_attr_addr_trtype,
+	&nvmet_attr_addr_cpulist,
 	NULL,
 };
 
@@ -891,6 +946,7 @@ static struct config_group *nvmet_ports_make(struct config_group *group,
 {
 	struct nvmet_port *port;
 	u16 portid;
+	int i;
 
 	if (kstrtou16(name, 0, &portid))
 		return ERR_PTR(-EINVAL);
@@ -903,6 +959,20 @@ static struct config_group *nvmet_ports_make(struct config_group *group,
 	INIT_LIST_HEAD(&port->subsystems);
 	INIT_LIST_HEAD(&port->referrals);
 
+	if (!alloc_cpumask_var(&port->cpumask, GFP_KERNEL))
+		goto err_free_port;
+
+	port->nr_cpus = num_possible_cpus();
+
+	port->cpus = kcalloc(sizeof(int), port->nr_cpus, GFP_KERNEL);
+	if (!port->cpus)
+		goto err_free_cpumask;
+
+	for_each_possible_cpu(i) {
+		cpumask_set_cpu(i, port->cpumask);
+		port->cpus[i] = i;
+	}
+
 	port->disc_addr.portid = cpu_to_le16(portid);
 	config_group_init_type_name(&port->group, name, &nvmet_port_type);
 
@@ -915,6 +985,11 @@ static struct config_group *nvmet_ports_make(struct config_group *group,
 	configfs_add_default_group(&port->referrals_group, &port->group);
 
 	return &port->group;
+
+err_free_cpumask:
+	free_cpumask_var(port->cpumask);
+err_free_port:
+	return ERR_PTR(-ENOMEM);
 }
 
 static struct configfs_group_operations nvmet_ports_group_ops = {
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index e342f02845c1..6aaf86e1439e 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -98,6 +98,10 @@ struct nvmet_port {
 	struct list_head		referrals;
 	void				*priv;
 	bool				enabled;
+
+	int				nr_cpus;
+	cpumask_var_t			cpumask;
+	int				*cpus;
 };
 
 static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v3 9/9] nvmet-rdma: assign cq completion vector based on the port allowed cpus
  2017-11-08  9:57 ` Sagi Grimberg
@ 2017-11-08  9:57     ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-08  9:57 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	Max Gurtuvoy

We take a cpu assignment from the port configured cpulist
(spread uniformly accross them) and pass it to the queue pair
as an affinity hint.

Note that if the rdma device does not expose a vector affinity mask,
or the core couldn't find a match, it will fallback to the old behavior
as we don't have sufficient information to do the "correct" vector
assignment.

Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
---
 drivers/nvme/target/rdma.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index d9cdfd2bd623..98d7f2ded511 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -892,7 +892,8 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
 {
 	struct ib_qp_init_attr qp_attr;
 	struct nvmet_rdma_device *ndev = queue->dev;
-	int ret, i;
+	struct nvmet_port *port = queue->port;
+	int ret, cpu, i;
 
 	memset(&qp_attr, 0, sizeof(qp_attr));
 	qp_attr.create_flags = IB_QP_CREATE_ASSIGN_CQS;
@@ -916,6 +917,14 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
 	else
 		qp_attr.cap.max_recv_sge = 2;
 
+	/*
+	 * Spread the io queues across port cpus,
+	 * but still keep all admin queues on cpu 0.
+	 */
+	cpu = !queue->host_qid ? 0 : port->cpus[queue->idx % port->nr_cpus];
+	qp_attr.affinity_hint = cpu;
+	qp_attr.create_flags |= IB_QP_CREATE_AFFINITY_HINT;
+
 	ret = rdma_create_qp(queue->cm_id, ndev->pd, &qp_attr);
 	if (ret) {
 		pr_err("failed to create_qp ret= %d\n", ret);
@@ -1052,6 +1061,7 @@ nvmet_rdma_alloc_queue(struct nvmet_rdma_device *ndev,
 	INIT_WORK(&queue->release_work, nvmet_rdma_release_queue_work);
 	queue->dev = ndev;
 	queue->cm_id = cm_id;
+	queue->port = cm_id->context;
 
 	spin_lock_init(&queue->state_lock);
 	queue->state = NVMET_RDMA_Q_CONNECTING;
@@ -1170,7 +1180,6 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
 		ret = -ENOMEM;
 		goto put_device;
 	}
-	queue->port = cm_id->context;
 
 	if (queue->host_qid == 0) {
 		/* Let inflight controller teardown complete */
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v3 9/9] nvmet-rdma: assign cq completion vector based on the port allowed cpus
@ 2017-11-08  9:57     ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-08  9:57 UTC (permalink / raw)


We take a cpu assignment from the port configured cpulist
(spread uniformly accross them) and pass it to the queue pair
as an affinity hint.

Note that if the rdma device does not expose a vector affinity mask,
or the core couldn't find a match, it will fallback to the old behavior
as we don't have sufficient information to do the "correct" vector
assignment.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/target/rdma.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index d9cdfd2bd623..98d7f2ded511 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -892,7 +892,8 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
 {
 	struct ib_qp_init_attr qp_attr;
 	struct nvmet_rdma_device *ndev = queue->dev;
-	int ret, i;
+	struct nvmet_port *port = queue->port;
+	int ret, cpu, i;
 
 	memset(&qp_attr, 0, sizeof(qp_attr));
 	qp_attr.create_flags = IB_QP_CREATE_ASSIGN_CQS;
@@ -916,6 +917,14 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
 	else
 		qp_attr.cap.max_recv_sge = 2;
 
+	/*
+	 * Spread the io queues across port cpus,
+	 * but still keep all admin queues on cpu 0.
+	 */
+	cpu = !queue->host_qid ? 0 : port->cpus[queue->idx % port->nr_cpus];
+	qp_attr.affinity_hint = cpu;
+	qp_attr.create_flags |= IB_QP_CREATE_AFFINITY_HINT;
+
 	ret = rdma_create_qp(queue->cm_id, ndev->pd, &qp_attr);
 	if (ret) {
 		pr_err("failed to create_qp ret= %d\n", ret);
@@ -1052,6 +1061,7 @@ nvmet_rdma_alloc_queue(struct nvmet_rdma_device *ndev,
 	INIT_WORK(&queue->release_work, nvmet_rdma_release_queue_work);
 	queue->dev = ndev;
 	queue->cm_id = cm_id;
+	queue->port = cm_id->context;
 
 	spin_lock_init(&queue->state_lock);
 	queue->state = NVMET_RDMA_Q_CONNECTING;
@@ -1170,7 +1180,6 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
 		ret = -ENOMEM;
 		goto put_device;
 	}
-	queue->port = cm_id->context;
 
 	if (queue->host_qid == 0) {
 		/* Let inflight controller teardown complete */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 3/9] IB/iser: use implicit CQ allocation
  2017-11-08  9:57     ` Sagi Grimberg
  (?)
@ 2017-11-08 10:25       ` Nicholas A. Bellinger
  -1 siblings, 0 replies; 92+ messages in thread
From: Nicholas A. Bellinger @ 2017-11-08 10:25 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma, linux-nvme, Christoph Hellwig, Max Gurtuvoy, target-devel

On Wed, 2017-11-08 at 11:57 +0200, Sagi Grimberg wrote:
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> [hch: ported to the new API]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/infiniband/ulp/iser/iscsi_iser.h | 19 --------
>  drivers/infiniband/ulp/iser/iser_verbs.c | 82 ++++----------------------------
>  2 files changed, 8 insertions(+), 93 deletions(-)
> 

A nice improvement to make IB_QP_CREATE_ASSIGN_CQS common across ULPs.

Reviewed-by: Nicholas Bellinger <nab@linux-iscsi.org>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 3/9] IB/iser: use implicit CQ allocation
@ 2017-11-08 10:25       ` Nicholas A. Bellinger
  0 siblings, 0 replies; 92+ messages in thread
From: Nicholas A. Bellinger @ 2017-11-08 10:25 UTC (permalink / raw)
  To: target-devel

On Wed, 2017-11-08 at 11:57 +0200, Sagi Grimberg wrote:
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> [hch: ported to the new API]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/infiniband/ulp/iser/iscsi_iser.h | 19 --------
>  drivers/infiniband/ulp/iser/iser_verbs.c | 82 ++++----------------------------
>  2 files changed, 8 insertions(+), 93 deletions(-)
> 

A nice improvement to make IB_QP_CREATE_ASSIGN_CQS common across ULPs.

Reviewed-by: Nicholas Bellinger <nab@linux-iscsi.org>


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 3/9] IB/iser: use implicit CQ allocation
@ 2017-11-08 10:25       ` Nicholas A. Bellinger
  0 siblings, 0 replies; 92+ messages in thread
From: Nicholas A. Bellinger @ 2017-11-08 10:25 UTC (permalink / raw)


On Wed, 2017-11-08@11:57 +0200, Sagi Grimberg wrote:
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> [hch: ported to the new API]
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> ---
>  drivers/infiniband/ulp/iser/iscsi_iser.h | 19 --------
>  drivers/infiniband/ulp/iser/iser_verbs.c | 82 ++++----------------------------
>  2 files changed, 8 insertions(+), 93 deletions(-)
> 

A nice improvement to make IB_QP_CREATE_ASSIGN_CQS common across ULPs.

Reviewed-by: Nicholas Bellinger <nab at linux-iscsi.org>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 2/9] IB/isert: use implicit CQ allocation
  2017-11-08  9:57     ` Sagi Grimberg
  (?)
@ 2017-11-08 10:27       ` Nicholas A. Bellinger
  -1 siblings, 0 replies; 92+ messages in thread
From: Nicholas A. Bellinger @ 2017-11-08 10:27 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma, linux-nvme, Christoph Hellwig, Max Gurtuvoy, target-devel

On Wed, 2017-11-08 at 11:57 +0200, Sagi Grimberg wrote:
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> [hch: ported to the new API]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/infiniband/ulp/isert/ib_isert.c | 165 ++++----------------------------
>  drivers/infiniband/ulp/isert/ib_isert.h |  16 ----
>  2 files changed, 20 insertions(+), 161 deletions(-)
> 

Likewise.

Reviewed-by: Nicholas Bellinger <nab@linux-iscsi.org>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 2/9] IB/isert: use implicit CQ allocation
@ 2017-11-08 10:27       ` Nicholas A. Bellinger
  0 siblings, 0 replies; 92+ messages in thread
From: Nicholas A. Bellinger @ 2017-11-08 10:27 UTC (permalink / raw)
  To: target-devel

On Wed, 2017-11-08 at 11:57 +0200, Sagi Grimberg wrote:
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> [hch: ported to the new API]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/infiniband/ulp/isert/ib_isert.c | 165 ++++----------------------------
>  drivers/infiniband/ulp/isert/ib_isert.h |  16 ----
>  2 files changed, 20 insertions(+), 161 deletions(-)
> 

Likewise.

Reviewed-by: Nicholas Bellinger <nab@linux-iscsi.org>


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 2/9] IB/isert: use implicit CQ allocation
@ 2017-11-08 10:27       ` Nicholas A. Bellinger
  0 siblings, 0 replies; 92+ messages in thread
From: Nicholas A. Bellinger @ 2017-11-08 10:27 UTC (permalink / raw)


On Wed, 2017-11-08@11:57 +0200, Sagi Grimberg wrote:
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> [hch: ported to the new API]
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> ---
>  drivers/infiniband/ulp/isert/ib_isert.c | 165 ++++----------------------------
>  drivers/infiniband/ulp/isert/ib_isert.h |  16 ----
>  2 files changed, 20 insertions(+), 161 deletions(-)
> 

Likewise.

Reviewed-by: Nicholas Bellinger <nab at linux-iscsi.org>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 0/9] Introduce per-device completion queue pools
  2017-11-08  9:57 ` Sagi Grimberg
@ 2017-11-08 16:42     ` Chuck Lever
  -1 siblings, 0 replies; 92+ messages in thread
From: Chuck Lever @ 2017-11-08 16:42 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	Max Gurtuvoy


> On Nov 8, 2017, at 4:57 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
> 
> This is the third re-incarnation of the CQ pool patches proposed
> by Christoph and I.
> 
> Our ULPs often want to make smart decisions on completion vector
> affinitization when using multiple completion queues spread on
> multiple cpu cores. We can see examples for this in iser, srp, nvme-rdma.
> 
> This patch set attempts to move this smartness to the rdma core by
> introducing per-device CQ pools that by definition spread
> across cpu cores. In addition, we completely make the completion
> queue allocation transparent to the ULP by adding affinity hints
> to create_qp which tells the rdma core to select (or allocate)
> a completion queue that has the needed affinity for it.
> 
> This API gives us a similar approach to whats used in the networking
> stack where the device completion queues are hidden from the application.
> With the affinitization hints, we also do not compromise performance
> as the completion queue will be affinitized correctly.
> 
> One thing that should be noticed is that now different ULPs using this
> API may share completion queues (given that they use the same polling context).
> However, even without this API they share interrupt vectors (and CPUs
> that are assigned to them). Thus aggregating consumers on less completion
> queues will result in better overall completion processing efficiency per
> completion event (or interrupt).

Hi Sagi, glad to see progress on this!

When running on the same CPU, Send and Receive completions compete
for the same finite CPU resource. In addition, they compete with
soft IRQ tasks that are also pinned to that CPU, and any other
BOUND workqueue tasks that are running there.

Send and Receive completions often have significant work to do
(for example, DMA syncing or unmapping followed by some parsing
of the completion results) and are all serialized on ib_poll_wq or
by soft IRQ.

This limits IOPS, and restricts other users of that shared CQ.

I recognize that handling interrupts on the same core where they
fired is best, but some of this work has to be allowed to migrate
when this CPU core is already fully utilized. A lot of the RDMA
core and ULP workqueues are BOUND, which prevents task migration,
even in the upper layers.

I would like to see a capability of intelligently spreading the
CQ workload for a single QP onto more CPU cores.

As an example, I've found that ensuring that NFS/RDMA's Receive
and Send completions are handled on separate CPU cores results in
slightly higher IOPS (~5%) and lower latency jitter on one mount
point.

This is more critical now that our ULPs are handling more Send
completions.


> In addition, we introduce a configfs knob to our nvme-target to
> bound I/O threads to a given cpulist (can be a subset). This is
> useful for numa configurations where the backend device access is
> configured with care to numa affinity, and we want to restrict rdma
> device and I/O threads affinity accordingly.
> 
> The patch set convert iser, isert, srpt, svcrdma, nvme-rdma and
> nvmet-rdma to use the new API.

Is there a straightforward way to assess whether this work
improves scalability and performance when multiple ULPs share a
device?


> Comments and feedback is welcome.
> 
> Christoph Hellwig (1):
>  nvme-rdma: use implicit CQ allocation
> 
> Sagi Grimberg (8):
>  RDMA/core: Add implicit per-device completion queue pools
>  IB/isert: use implicit CQ allocation
>  IB/iser: use implicit CQ allocation
>  IB/srpt: use implicit CQ allocation
>  svcrdma: Use RDMA core implicit CQ allocation
>  nvmet-rdma: use implicit CQ allocation
>  nvmet: allow assignment of a cpulist for each nvmet port
>  nvmet-rdma: assign cq completion vector based on the port allowed cpus
> 
> drivers/infiniband/core/core_priv.h      |   6 +
> drivers/infiniband/core/cq.c             | 193 +++++++++++++++++++++++++++++++
> drivers/infiniband/core/device.c         |   4 +
> drivers/infiniband/core/verbs.c          |  69 ++++++++++-
> drivers/infiniband/ulp/iser/iscsi_iser.h |  19 ---
> drivers/infiniband/ulp/iser/iser_verbs.c |  82 ++-----------
> drivers/infiniband/ulp/isert/ib_isert.c  | 165 ++++----------------------
> drivers/infiniband/ulp/isert/ib_isert.h  |  16 ---
> drivers/infiniband/ulp/srpt/ib_srpt.c    |  46 +++-----
> drivers/infiniband/ulp/srpt/ib_srpt.h    |   1 -
> drivers/nvme/host/rdma.c                 |  62 +++++-----
> drivers/nvme/target/configfs.c           |  75 ++++++++++++
> drivers/nvme/target/nvmet.h              |   4 +
> drivers/nvme/target/rdma.c               |  71 +++++-------
> include/linux/sunrpc/svc_rdma.h          |   2 -
> include/rdma/ib_verbs.h                  |  31 ++++-
> net/sunrpc/xprtrdma/svc_rdma_transport.c |  22 +---
> 17 files changed, 468 insertions(+), 400 deletions(-)
> 
> -- 
> 2.14.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-08 16:42     ` Chuck Lever
  0 siblings, 0 replies; 92+ messages in thread
From: Chuck Lever @ 2017-11-08 16:42 UTC (permalink / raw)



> On Nov 8, 2017,@4:57 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> This is the third re-incarnation of the CQ pool patches proposed
> by Christoph and I.
> 
> Our ULPs often want to make smart decisions on completion vector
> affinitization when using multiple completion queues spread on
> multiple cpu cores. We can see examples for this in iser, srp, nvme-rdma.
> 
> This patch set attempts to move this smartness to the rdma core by
> introducing per-device CQ pools that by definition spread
> across cpu cores. In addition, we completely make the completion
> queue allocation transparent to the ULP by adding affinity hints
> to create_qp which tells the rdma core to select (or allocate)
> a completion queue that has the needed affinity for it.
> 
> This API gives us a similar approach to whats used in the networking
> stack where the device completion queues are hidden from the application.
> With the affinitization hints, we also do not compromise performance
> as the completion queue will be affinitized correctly.
> 
> One thing that should be noticed is that now different ULPs using this
> API may share completion queues (given that they use the same polling context).
> However, even without this API they share interrupt vectors (and CPUs
> that are assigned to them). Thus aggregating consumers on less completion
> queues will result in better overall completion processing efficiency per
> completion event (or interrupt).

Hi Sagi, glad to see progress on this!

When running on the same CPU, Send and Receive completions compete
for the same finite CPU resource. In addition, they compete with
soft IRQ tasks that are also pinned to that CPU, and any other
BOUND workqueue tasks that are running there.

Send and Receive completions often have significant work to do
(for example, DMA syncing or unmapping followed by some parsing
of the completion results) and are all serialized on ib_poll_wq or
by soft IRQ.

This limits IOPS, and restricts other users of that shared CQ.

I recognize that handling interrupts on the same core where they
fired is best, but some of this work has to be allowed to migrate
when this CPU core is already fully utilized. A lot of the RDMA
core and ULP workqueues are BOUND, which prevents task migration,
even in the upper layers.

I would like to see a capability of intelligently spreading the
CQ workload for a single QP onto more CPU cores.

As an example, I've found that ensuring that NFS/RDMA's Receive
and Send completions are handled on separate CPU cores results in
slightly higher IOPS (~5%) and lower latency jitter on one mount
point.

This is more critical now that our ULPs are handling more Send
completions.


> In addition, we introduce a configfs knob to our nvme-target to
> bound I/O threads to a given cpulist (can be a subset). This is
> useful for numa configurations where the backend device access is
> configured with care to numa affinity, and we want to restrict rdma
> device and I/O threads affinity accordingly.
> 
> The patch set convert iser, isert, srpt, svcrdma, nvme-rdma and
> nvmet-rdma to use the new API.

Is there a straightforward way to assess whether this work
improves scalability and performance when multiple ULPs share a
device?


> Comments and feedback is welcome.
> 
> Christoph Hellwig (1):
>  nvme-rdma: use implicit CQ allocation
> 
> Sagi Grimberg (8):
>  RDMA/core: Add implicit per-device completion queue pools
>  IB/isert: use implicit CQ allocation
>  IB/iser: use implicit CQ allocation
>  IB/srpt: use implicit CQ allocation
>  svcrdma: Use RDMA core implicit CQ allocation
>  nvmet-rdma: use implicit CQ allocation
>  nvmet: allow assignment of a cpulist for each nvmet port
>  nvmet-rdma: assign cq completion vector based on the port allowed cpus
> 
> drivers/infiniband/core/core_priv.h      |   6 +
> drivers/infiniband/core/cq.c             | 193 +++++++++++++++++++++++++++++++
> drivers/infiniband/core/device.c         |   4 +
> drivers/infiniband/core/verbs.c          |  69 ++++++++++-
> drivers/infiniband/ulp/iser/iscsi_iser.h |  19 ---
> drivers/infiniband/ulp/iser/iser_verbs.c |  82 ++-----------
> drivers/infiniband/ulp/isert/ib_isert.c  | 165 ++++----------------------
> drivers/infiniband/ulp/isert/ib_isert.h  |  16 ---
> drivers/infiniband/ulp/srpt/ib_srpt.c    |  46 +++-----
> drivers/infiniband/ulp/srpt/ib_srpt.h    |   1 -
> drivers/nvme/host/rdma.c                 |  62 +++++-----
> drivers/nvme/target/configfs.c           |  75 ++++++++++++
> drivers/nvme/target/nvmet.h              |   4 +
> drivers/nvme/target/rdma.c               |  71 +++++-------
> include/linux/sunrpc/svc_rdma.h          |   2 -
> include/rdma/ib_verbs.h                  |  31 ++++-
> net/sunrpc/xprtrdma/svc_rdma_transport.c |  22 +---
> 17 files changed, 468 insertions(+), 400 deletions(-)
> 
> -- 
> 2.14.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 1/9] RDMA/core: Add implicit per-device completion queue pools
  2017-11-08  9:57     ` Sagi Grimberg
@ 2017-11-09 10:45         ` Max Gurtovoy
  -1 siblings, 0 replies; 92+ messages in thread
From: Max Gurtovoy @ 2017-11-09 10:45 UTC (permalink / raw)
  To: Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig



On 11/8/2017 11:57 AM, Sagi Grimberg wrote:
> Allow a ULP to ask the core to implicitly assign a completion
> queue to a queue-pair based on a least-used search on a per-device
> cq pools. The device CQ pools grow in a lazy fashion with every
> QP creation.
> 
> In addition, expose an affinity hint for a queue pair creation.
> If passed, the core will attempt to attach a CQ with a completion
> vector that is directed to the cpu core as the affinity hint
> provided.
> 
> Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
> ---
>   drivers/infiniband/core/core_priv.h |   6 ++
>   drivers/infiniband/core/cq.c        | 193 ++++++++++++++++++++++++++++++++++++
>   drivers/infiniband/core/device.c    |   4 +
>   drivers/infiniband/core/verbs.c     |  69 +++++++++++--
>   include/rdma/ib_verbs.h             |  31 ++++--
>   5 files changed, 291 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
> index a1d687a664f8..4f6cd4cf5116 100644
> --- a/drivers/infiniband/core/core_priv.h
> +++ b/drivers/infiniband/core/core_priv.h
> @@ -179,6 +179,12 @@ static inline bool rdma_is_upper_dev_rcu(struct net_device *dev,
>   	return netdev_has_upper_dev_all_rcu(dev, upper);
>   }
>   
> +void ib_init_cq_pools(struct ib_device *dev);
> +void ib_purge_cq_pools(struct ib_device *dev);
> +struct ib_cq *ib_find_get_cq(struct ib_device *dev, unsigned int nr_cqe,
> +		enum ib_poll_context poll_ctx, int affinity_hint);
> +void ib_put_cq(struct ib_cq *cq, unsigned int nr_cqe);
> +
>   int addr_init(void);
>   void addr_cleanup(void);
>   
> diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
> index f2ae75fa3128..8b9f9be5386b 100644
> --- a/drivers/infiniband/core/cq.c
> +++ b/drivers/infiniband/core/cq.c
> @@ -15,6 +15,9 @@
>   #include <linux/slab.h>
>   #include <rdma/ib_verbs.h>
>   
> +/* XXX: wild guess - should not be too large or too small to avoid wastage */
> +#define IB_CQE_BATCH			1024
> +
>   /* # of WCs to poll for with a single call to ib_poll_cq */
>   #define IB_POLL_BATCH			16
>   
> @@ -149,6 +152,8 @@ struct ib_cq *ib_alloc_cq(struct ib_device *dev, void *private,
>   	cq->cq_context = private;
>   	cq->poll_ctx = poll_ctx;
>   	atomic_set(&cq->usecnt, 0);
> +	cq->cqe_used = 0;
> +	cq->comp_vector = comp_vector;
>   
>   	cq->wc = kmalloc_array(IB_POLL_BATCH, sizeof(*cq->wc), GFP_KERNEL);
>   	if (!cq->wc)
> @@ -194,6 +199,8 @@ void ib_free_cq(struct ib_cq *cq)
>   
>   	if (WARN_ON_ONCE(atomic_read(&cq->usecnt)))
>   		return;
> +	if (WARN_ON_ONCE(cq->cqe_used != 0))
> +		return;
>   
>   	switch (cq->poll_ctx) {
>   	case IB_POLL_DIRECT:
> @@ -213,3 +220,189 @@ void ib_free_cq(struct ib_cq *cq)
>   	WARN_ON_ONCE(ret);
>   }
>   EXPORT_SYMBOL(ib_free_cq);
> +
> +void ib_init_cq_pools(struct ib_device *dev)
> +{
> +	int i;
> +
> +	spin_lock_init(&dev->cq_lock);
> +	for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++)
> +		INIT_LIST_HEAD(&dev->cq_pools[i]);
> +}
> +
> +void ib_purge_cq_pools(struct ib_device *dev)
> +{
> +	struct ib_cq *cq, *n;
> +	LIST_HEAD(tmp_list);
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++) {
> +		unsigned long flags;
> +
> +		spin_lock_irqsave(&dev->cq_lock, flags);
> +		list_splice_init(&dev->cq_pools[i], &tmp_list);
> +		spin_unlock_irqrestore(&dev->cq_lock, flags);
> +	}
> +
> +	list_for_each_entry_safe(cq, n, &tmp_list, pool_entry)
> +		ib_free_cq(cq);
> +}
> +
> +/**
> + * ib_find_vector_affinity() - Find the first completion vector mapped to a given
> + *     cpu core affinity
> + * @device:            rdma device
> + * @cpu:               cpu for the corresponding completion vector affinity
> + * @vector:            output target completion vector
> + *
> + * If the device expose vector affinity we will search each of the vectors
> + * and if we find one that gives us the desired cpu core we return true
> + * and assign @vector to the corresponding completion vector. Otherwise
> + * we return false. We stop at the first appropriate completion vector
> + * we find as we don't have any preference for multiple vectors with the
> + * same affinity.
> + */
> +static bool ib_find_vector_affinity(struct ib_device *device, int cpu,
> +		unsigned int *vector)
> +{
> +	bool found = false;
> +	unsigned int c;
> +	int vec;
> +
> +	if (cpu == -1)
> +		goto out;
> +
> +	for (vec = 0; vec < device->num_comp_vectors; vec++) {
> +		const struct cpumask *mask;
> +
> +		mask = ib_get_vector_affinity(device, vec);
> +		if (!mask)
> +			goto out;
> +
> +		for_each_cpu(c, mask) {
> +			if (c == cpu) {
> +				*vector = vec;
> +				found = true;
> +				goto out;
> +			}
> +		}
> +	}
> +
> +out:
> +	return found;
> +}
> +
> +static int ib_alloc_cqs(struct ib_device *dev, int nr_cqes,
> +		enum ib_poll_context poll_ctx)
> +{
> +	LIST_HEAD(tmp_list);
> +	struct ib_cq *cq;
> +	unsigned long flags;
> +	int nr_cqs, ret, i;
> +
> +	/*
> +	 * Allocated at least as many CQEs as requested, and otherwise
> +	 * a reasonable batch size so that we can share CQs between
> +	 * multiple users instead of allocating a larger number of CQs.
> +	 */
> +	nr_cqes = max(nr_cqes, min(dev->attrs.max_cqe, IB_CQE_BATCH));

did you mean min() ?

> +	nr_cqs = min_t(int, dev->num_comp_vectors, num_possible_cpus());
> +	for (i = 0; i < nr_cqs; i++) {
> +		cq = ib_alloc_cq(dev, NULL, nr_cqes, i, poll_ctx);
> +		if (IS_ERR(cq)) {
> +			ret = PTR_ERR(cq);
> +			pr_err("%s: failed to create CQ ret=%d\n",
> +				__func__, ret);
> +			goto out_free_cqs;
> +		}
> +		list_add_tail(&cq->pool_entry, &tmp_list);
> +	}
> +
> +	spin_lock_irqsave(&dev->cq_lock, flags);
> +	list_splice(&tmp_list, &dev->cq_pools[poll_ctx]);
> +	spin_unlock_irqrestore(&dev->cq_lock, flags);
> +
> +	return 0;
> +
> +out_free_cqs:
> +	list_for_each_entry(cq, &tmp_list, pool_entry)
> +		ib_free_cq(cq);
> +	return ret;
> +}
> +
> +/*
> + * ib_find_get_cq() - Find the least used completion queue that matches
> + *     a given affinity hint (or least used for wild card affinity)
> + *     and fits nr_cqe
> + * @dev:              rdma device
> + * @nr_cqe:           number of needed cqe entries
> + * @poll_ctx:         cq polling context
> + * @affinity_hint:    affinity hint (-1) for wild-card assignment
> + *
> + * Finds a cq that satisfies @affinity_hint and @nr_cqe requirements and claim
> + * entries in it for us. In case there is no available cq, allocate a new cq
> + * with the requirements and add it to the device pool.
> + */
> +struct ib_cq *ib_find_get_cq(struct ib_device *dev, unsigned int nr_cqe,
> +		enum ib_poll_context poll_ctx, int affinity_hint)
> +{
> +	struct ib_cq *cq, *found;
> +	unsigned long flags;
> +	int vector, ret;
> +
> +	if (poll_ctx >= ARRAY_SIZE(dev->cq_pools))
> +		return ERR_PTR(-EINVAL);
> +
> +	if (!ib_find_vector_affinity(dev, affinity_hint, &vector)) {
> +		/*
> +		 * Couldn't find matching vector affinity so project
> +		 * the affinty to the device completion vector range
> +		 */
> +		vector = affinity_hint % dev->num_comp_vectors;
> +	}
> +
> +restart:
> +	/*
> +	 * Find the least used CQ with correct affinity and
> +	 * enough free cq entries
> +	 */
> +	found = NULL;
> +	spin_lock_irqsave(&dev->cq_lock, flags);
> +	list_for_each_entry(cq, &dev->cq_pools[poll_ctx], pool_entry) {
> +		if (vector != -1 && vector != cq->comp_vector)

how vector can be -1 ?

> +			continue;
> +		if (cq->cqe_used + nr_cqe > cq->cqe)
> +			continue;
> +		if (found && cq->cqe_used >= found->cqe_used)
> +			continue;
> +		found = cq;
> +	}
> +
> +	if (found) {
> +		found->cqe_used += nr_cqe;
> +		spin_unlock_irqrestore(&dev->cq_lock, flags);
> +		return found;
> +	}
> +	spin_unlock_irqrestore(&dev->cq_lock, flags);
> +
> +	/*
> +	 * Didn't find a match or ran out of CQs,
> +	 * device pool, allocate a new array of CQs.
> +	 */
> +	ret = ib_alloc_cqs(dev, nr_cqe, poll_ctx);
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	/* Now search again */
> +	goto restart;
> +}
> +
> +void ib_put_cq(struct ib_cq *cq, unsigned int nr_cqe)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cq->device->cq_lock, flags);
> +	cq->cqe_used -= nr_cqe;
> +	WARN_ON_ONCE(cq->cqe_used < 0);
> +	spin_unlock_irqrestore(&cq->device->cq_lock, flags);
> +}
> diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
> index 84fc32a2c8b3..c828845c46d8 100644
> --- a/drivers/infiniband/core/device.c
> +++ b/drivers/infiniband/core/device.c
> @@ -468,6 +468,8 @@ int ib_register_device(struct ib_device *device,
>   		device->dma_device = parent;
>   	}
>   
> +	ib_init_cq_pools(device);
> +
>   	mutex_lock(&device_mutex);
>   
>   	if (strchr(device->name, '%')) {
> @@ -590,6 +592,8 @@ void ib_unregister_device(struct ib_device *device)
>   	up_write(&lists_rwsem);
>   
>   	device->reg_state = IB_DEV_UNREGISTERED;
> +
> +	ib_purge_cq_pools(device);
>   }
>   EXPORT_SYMBOL(ib_unregister_device);
>   
> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> index de57d6c11a25..fcc9ecba6741 100644
> --- a/drivers/infiniband/core/verbs.c
> +++ b/drivers/infiniband/core/verbs.c
> @@ -793,14 +793,16 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
>   			   struct ib_qp_init_attr *qp_init_attr)
>   {
>   	struct ib_device *device = pd ? pd->device : qp_init_attr->xrcd->device;
> +	struct ib_cq *cq = NULL;
>   	struct ib_qp *qp;
> -	int ret;
> +	u32 nr_cqes = 0;
> +	int ret = -EINVAL;
>   
>   	if (qp_init_attr->rwq_ind_tbl &&
>   	    (qp_init_attr->recv_cq ||
>   	    qp_init_attr->srq || qp_init_attr->cap.max_recv_wr ||
>   	    qp_init_attr->cap.max_recv_sge))
> -		return ERR_PTR(-EINVAL);
> +		goto out;
>   
>   	/*
>   	 * If the callers is using the RDMA API calculate the resources
> @@ -811,9 +813,51 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
>   	if (qp_init_attr->cap.max_rdma_ctxs)
>   		rdma_rw_init_qp(device, qp_init_attr);
>   
> +	if (qp_init_attr->create_flags & IB_QP_CREATE_ASSIGN_CQS) {
> +		int affinity = -1;
> +
> +		if (WARN_ON(qp_init_attr->recv_cq))
> +			goto out;
> +		if (WARN_ON(qp_init_attr->send_cq))
> +			goto out;
> +
> +		if (qp_init_attr->create_flags & IB_QP_CREATE_AFFINITY_HINT)
> +			affinity = qp_init_attr->affinity_hint;
> +
> +		nr_cqes = qp_init_attr->cap.max_recv_wr +
> +			  qp_init_attr->cap.max_send_wr;
> +		if (nr_cqes) {

what will happen if nr_cqes == 0 in that case ?

> +			cq = ib_find_get_cq(device, nr_cqes,
> +					    qp_init_attr->poll_ctx, affinity);
> +			if (IS_ERR(cq)) {
> +				ret = PTR_ERR(cq);
> +				goto out;
> +			}
> +
> +			if (qp_init_attr->cap.max_send_wr)
> +				qp_init_attr->send_cq = cq;
> +
> +			if (qp_init_attr->cap.max_recv_wr) {
> +				qp_init_attr->recv_cq = cq;
> +
> +				/*
> +				 * Low-level drivers expect max_recv_wr == 0
> +				 * for the SRQ case:
> +				 */
> +				if (qp_init_attr->srq)
> +					qp_init_attr->cap.max_recv_wr = 0;
> +			}
> +		}
> +
> +		qp_init_attr->create_flags &=
> +			~(IB_QP_CREATE_ASSIGN_CQS | IB_QP_CREATE_AFFINITY_HINT);
> +	}
> +
>   	qp = device->create_qp(pd, qp_init_attr, NULL);
> -	if (IS_ERR(qp))
> -		return qp;
> +	if (IS_ERR(qp)) {
> +		ret = PTR_ERR(qp);
> +		goto out_put_cq;
> +	}
>   
>   	ret = ib_create_qp_security(qp, device);
>   	if (ret) {
> @@ -826,6 +870,7 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
>   	qp->uobject    = NULL;
>   	qp->qp_type    = qp_init_attr->qp_type;
>   	qp->rwq_ind_tbl = qp_init_attr->rwq_ind_tbl;
> +	qp->nr_cqes    = nr_cqes;
>   
>   	atomic_set(&qp->usecnt, 0);
>   	qp->mrs_used = 0;
> @@ -865,8 +910,7 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
>   		ret = rdma_rw_init_mrs(qp, qp_init_attr);
>   		if (ret) {
>   			pr_err("failed to init MR pool ret= %d\n", ret);
> -			ib_destroy_qp(qp);
> -			return ERR_PTR(ret);
> +			goto out_destroy_qp;
>   		}
>   	}
>   
> @@ -880,6 +924,14 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
>   				 device->attrs.max_sge_rd);
>   
>   	return qp;
> +
> +out_destroy_qp:
> +	ib_destroy_qp(qp);
> +out_put_cq:
> +	if (cq)
> +		ib_put_cq(cq, nr_cqes);
> +out:
> +	return ERR_PTR(ret);
>   }
>   EXPORT_SYMBOL(ib_create_qp);
>   
> @@ -1478,6 +1530,11 @@ int ib_destroy_qp(struct ib_qp *qp)
>   			atomic_dec(&ind_tbl->usecnt);
>   		if (sec)
>   			ib_destroy_qp_security_end(sec);
> +
> +		if (qp->nr_cqes) {
> +			WARN_ON_ONCE(rcq && rcq != scq);
> +			ib_put_cq(scq, qp->nr_cqes);
> +		}
>   	} else {
>   		if (sec)
>   			ib_destroy_qp_security_abort(sec);
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index bdb1279a415b..56d42e753eb4 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -1098,11 +1098,22 @@ enum ib_qp_create_flags {
>   	IB_QP_CREATE_SCATTER_FCS		= 1 << 8,
>   	IB_QP_CREATE_CVLAN_STRIPPING		= 1 << 9,
>   	IB_QP_CREATE_SOURCE_QPN			= 1 << 10,
> +
> +	/* only used by the core, not passed to low-level drivers */
> +	IB_QP_CREATE_ASSIGN_CQS			= 1 << 24,
> +	IB_QP_CREATE_AFFINITY_HINT		= 1 << 25,
> +
>   	/* reserve bits 26-31 for low level drivers' internal use */
>   	IB_QP_CREATE_RESERVED_START		= 1 << 26,
>   	IB_QP_CREATE_RESERVED_END		= 1 << 31,
>   };
>   
> +enum ib_poll_context {
> +	IB_POLL_SOFTIRQ,	/* poll from softirq context */
> +	IB_POLL_WORKQUEUE,	/* poll from workqueue */
> +	IB_POLL_DIRECT,		/* caller context, no hw completions */
> +};
> +
>   /*
>    * Note: users may not call ib_close_qp or ib_destroy_qp from the event_handler
>    * callback to destroy the passed in QP.
> @@ -1124,6 +1135,13 @@ struct ib_qp_init_attr {
>   	 * Only needed for special QP types, or when using the RW API.
>   	 */
>   	u8			port_num;
> +
> +	/*
> +	 * Only needed when not passing in explicit CQs.
> +	 */
> +	enum ib_poll_context	poll_ctx;
> +	int			affinity_hint;
> +
>   	struct ib_rwq_ind_table *rwq_ind_tbl;
>   	u32			source_qpn;
>   };
> @@ -1536,12 +1554,6 @@ struct ib_ah {
>   
>   typedef void (*ib_comp_handler)(struct ib_cq *cq, void *cq_context);
>   
> -enum ib_poll_context {
> -	IB_POLL_DIRECT,		/* caller context, no hw completions */
> -	IB_POLL_SOFTIRQ,	/* poll from softirq context */
> -	IB_POLL_WORKQUEUE,	/* poll from workqueue */
> -};
> -
>   struct ib_cq {
>   	struct ib_device       *device;
>   	struct ib_uobject      *uobject;
> @@ -1549,9 +1561,12 @@ struct ib_cq {
>   	void                  (*event_handler)(struct ib_event *, void *);
>   	void                   *cq_context;
>   	int               	cqe;
> +	unsigned int		cqe_used;
>   	atomic_t          	usecnt; /* count number of work queues */
>   	enum ib_poll_context	poll_ctx;
> +	int			comp_vector;
>   	struct ib_wc		*wc;
> +	struct list_head	pool_entry;
>   	union {
>   		struct irq_poll		iop;
>   		struct work_struct	work;
> @@ -1731,6 +1746,7 @@ struct ib_qp {
>   	struct ib_rwq_ind_table *rwq_ind_tbl;
>   	struct ib_qp_security  *qp_sec;
>   	u8			port;
> +	u32			nr_cqes;
>   };
>   
>   struct ib_mr {
> @@ -2338,6 +2354,9 @@ struct ib_device {
>   
>   	u32                          index;
>   
> +	spinlock_t		     cq_lock;

maybe can be called cq_pools_lock (cq_lock is general) ?

> +	struct list_head	     cq_pools[IB_POLL_WORKQUEUE + 1];

maybe it's better to add and use IB_POLL_LAST ?

> +
>   	/**
>   	 * The following mandatory functions are used only at device
>   	 * registration.  Keep functions such as these at the end of this
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 1/9] RDMA/core: Add implicit per-device completion queue pools
@ 2017-11-09 10:45         ` Max Gurtovoy
  0 siblings, 0 replies; 92+ messages in thread
From: Max Gurtovoy @ 2017-11-09 10:45 UTC (permalink / raw)




On 11/8/2017 11:57 AM, Sagi Grimberg wrote:
> Allow a ULP to ask the core to implicitly assign a completion
> queue to a queue-pair based on a least-used search on a per-device
> cq pools. The device CQ pools grow in a lazy fashion with every
> QP creation.
> 
> In addition, expose an affinity hint for a queue pair creation.
> If passed, the core will attempt to attach a CQ with a completion
> vector that is directed to the cpu core as the affinity hint
> provided.
> 
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
>   drivers/infiniband/core/core_priv.h |   6 ++
>   drivers/infiniband/core/cq.c        | 193 ++++++++++++++++++++++++++++++++++++
>   drivers/infiniband/core/device.c    |   4 +
>   drivers/infiniband/core/verbs.c     |  69 +++++++++++--
>   include/rdma/ib_verbs.h             |  31 ++++--
>   5 files changed, 291 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
> index a1d687a664f8..4f6cd4cf5116 100644
> --- a/drivers/infiniband/core/core_priv.h
> +++ b/drivers/infiniband/core/core_priv.h
> @@ -179,6 +179,12 @@ static inline bool rdma_is_upper_dev_rcu(struct net_device *dev,
>   	return netdev_has_upper_dev_all_rcu(dev, upper);
>   }
>   
> +void ib_init_cq_pools(struct ib_device *dev);
> +void ib_purge_cq_pools(struct ib_device *dev);
> +struct ib_cq *ib_find_get_cq(struct ib_device *dev, unsigned int nr_cqe,
> +		enum ib_poll_context poll_ctx, int affinity_hint);
> +void ib_put_cq(struct ib_cq *cq, unsigned int nr_cqe);
> +
>   int addr_init(void);
>   void addr_cleanup(void);
>   
> diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
> index f2ae75fa3128..8b9f9be5386b 100644
> --- a/drivers/infiniband/core/cq.c
> +++ b/drivers/infiniband/core/cq.c
> @@ -15,6 +15,9 @@
>   #include <linux/slab.h>
>   #include <rdma/ib_verbs.h>
>   
> +/* XXX: wild guess - should not be too large or too small to avoid wastage */
> +#define IB_CQE_BATCH			1024
> +
>   /* # of WCs to poll for with a single call to ib_poll_cq */
>   #define IB_POLL_BATCH			16
>   
> @@ -149,6 +152,8 @@ struct ib_cq *ib_alloc_cq(struct ib_device *dev, void *private,
>   	cq->cq_context = private;
>   	cq->poll_ctx = poll_ctx;
>   	atomic_set(&cq->usecnt, 0);
> +	cq->cqe_used = 0;
> +	cq->comp_vector = comp_vector;
>   
>   	cq->wc = kmalloc_array(IB_POLL_BATCH, sizeof(*cq->wc), GFP_KERNEL);
>   	if (!cq->wc)
> @@ -194,6 +199,8 @@ void ib_free_cq(struct ib_cq *cq)
>   
>   	if (WARN_ON_ONCE(atomic_read(&cq->usecnt)))
>   		return;
> +	if (WARN_ON_ONCE(cq->cqe_used != 0))
> +		return;
>   
>   	switch (cq->poll_ctx) {
>   	case IB_POLL_DIRECT:
> @@ -213,3 +220,189 @@ void ib_free_cq(struct ib_cq *cq)
>   	WARN_ON_ONCE(ret);
>   }
>   EXPORT_SYMBOL(ib_free_cq);
> +
> +void ib_init_cq_pools(struct ib_device *dev)
> +{
> +	int i;
> +
> +	spin_lock_init(&dev->cq_lock);
> +	for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++)
> +		INIT_LIST_HEAD(&dev->cq_pools[i]);
> +}
> +
> +void ib_purge_cq_pools(struct ib_device *dev)
> +{
> +	struct ib_cq *cq, *n;
> +	LIST_HEAD(tmp_list);
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++) {
> +		unsigned long flags;
> +
> +		spin_lock_irqsave(&dev->cq_lock, flags);
> +		list_splice_init(&dev->cq_pools[i], &tmp_list);
> +		spin_unlock_irqrestore(&dev->cq_lock, flags);
> +	}
> +
> +	list_for_each_entry_safe(cq, n, &tmp_list, pool_entry)
> +		ib_free_cq(cq);
> +}
> +
> +/**
> + * ib_find_vector_affinity() - Find the first completion vector mapped to a given
> + *     cpu core affinity
> + * @device:            rdma device
> + * @cpu:               cpu for the corresponding completion vector affinity
> + * @vector:            output target completion vector
> + *
> + * If the device expose vector affinity we will search each of the vectors
> + * and if we find one that gives us the desired cpu core we return true
> + * and assign @vector to the corresponding completion vector. Otherwise
> + * we return false. We stop at the first appropriate completion vector
> + * we find as we don't have any preference for multiple vectors with the
> + * same affinity.
> + */
> +static bool ib_find_vector_affinity(struct ib_device *device, int cpu,
> +		unsigned int *vector)
> +{
> +	bool found = false;
> +	unsigned int c;
> +	int vec;
> +
> +	if (cpu == -1)
> +		goto out;
> +
> +	for (vec = 0; vec < device->num_comp_vectors; vec++) {
> +		const struct cpumask *mask;
> +
> +		mask = ib_get_vector_affinity(device, vec);
> +		if (!mask)
> +			goto out;
> +
> +		for_each_cpu(c, mask) {
> +			if (c == cpu) {
> +				*vector = vec;
> +				found = true;
> +				goto out;
> +			}
> +		}
> +	}
> +
> +out:
> +	return found;
> +}
> +
> +static int ib_alloc_cqs(struct ib_device *dev, int nr_cqes,
> +		enum ib_poll_context poll_ctx)
> +{
> +	LIST_HEAD(tmp_list);
> +	struct ib_cq *cq;
> +	unsigned long flags;
> +	int nr_cqs, ret, i;
> +
> +	/*
> +	 * Allocated at least as many CQEs as requested, and otherwise
> +	 * a reasonable batch size so that we can share CQs between
> +	 * multiple users instead of allocating a larger number of CQs.
> +	 */
> +	nr_cqes = max(nr_cqes, min(dev->attrs.max_cqe, IB_CQE_BATCH));

did you mean min() ?

> +	nr_cqs = min_t(int, dev->num_comp_vectors, num_possible_cpus());
> +	for (i = 0; i < nr_cqs; i++) {
> +		cq = ib_alloc_cq(dev, NULL, nr_cqes, i, poll_ctx);
> +		if (IS_ERR(cq)) {
> +			ret = PTR_ERR(cq);
> +			pr_err("%s: failed to create CQ ret=%d\n",
> +				__func__, ret);
> +			goto out_free_cqs;
> +		}
> +		list_add_tail(&cq->pool_entry, &tmp_list);
> +	}
> +
> +	spin_lock_irqsave(&dev->cq_lock, flags);
> +	list_splice(&tmp_list, &dev->cq_pools[poll_ctx]);
> +	spin_unlock_irqrestore(&dev->cq_lock, flags);
> +
> +	return 0;
> +
> +out_free_cqs:
> +	list_for_each_entry(cq, &tmp_list, pool_entry)
> +		ib_free_cq(cq);
> +	return ret;
> +}
> +
> +/*
> + * ib_find_get_cq() - Find the least used completion queue that matches
> + *     a given affinity hint (or least used for wild card affinity)
> + *     and fits nr_cqe
> + * @dev:              rdma device
> + * @nr_cqe:           number of needed cqe entries
> + * @poll_ctx:         cq polling context
> + * @affinity_hint:    affinity hint (-1) for wild-card assignment
> + *
> + * Finds a cq that satisfies @affinity_hint and @nr_cqe requirements and claim
> + * entries in it for us. In case there is no available cq, allocate a new cq
> + * with the requirements and add it to the device pool.
> + */
> +struct ib_cq *ib_find_get_cq(struct ib_device *dev, unsigned int nr_cqe,
> +		enum ib_poll_context poll_ctx, int affinity_hint)
> +{
> +	struct ib_cq *cq, *found;
> +	unsigned long flags;
> +	int vector, ret;
> +
> +	if (poll_ctx >= ARRAY_SIZE(dev->cq_pools))
> +		return ERR_PTR(-EINVAL);
> +
> +	if (!ib_find_vector_affinity(dev, affinity_hint, &vector)) {
> +		/*
> +		 * Couldn't find matching vector affinity so project
> +		 * the affinty to the device completion vector range
> +		 */
> +		vector = affinity_hint % dev->num_comp_vectors;
> +	}
> +
> +restart:
> +	/*
> +	 * Find the least used CQ with correct affinity and
> +	 * enough free cq entries
> +	 */
> +	found = NULL;
> +	spin_lock_irqsave(&dev->cq_lock, flags);
> +	list_for_each_entry(cq, &dev->cq_pools[poll_ctx], pool_entry) {
> +		if (vector != -1 && vector != cq->comp_vector)

how vector can be -1 ?

> +			continue;
> +		if (cq->cqe_used + nr_cqe > cq->cqe)
> +			continue;
> +		if (found && cq->cqe_used >= found->cqe_used)
> +			continue;
> +		found = cq;
> +	}
> +
> +	if (found) {
> +		found->cqe_used += nr_cqe;
> +		spin_unlock_irqrestore(&dev->cq_lock, flags);
> +		return found;
> +	}
> +	spin_unlock_irqrestore(&dev->cq_lock, flags);
> +
> +	/*
> +	 * Didn't find a match or ran out of CQs,
> +	 * device pool, allocate a new array of CQs.
> +	 */
> +	ret = ib_alloc_cqs(dev, nr_cqe, poll_ctx);
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	/* Now search again */
> +	goto restart;
> +}
> +
> +void ib_put_cq(struct ib_cq *cq, unsigned int nr_cqe)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cq->device->cq_lock, flags);
> +	cq->cqe_used -= nr_cqe;
> +	WARN_ON_ONCE(cq->cqe_used < 0);
> +	spin_unlock_irqrestore(&cq->device->cq_lock, flags);
> +}
> diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
> index 84fc32a2c8b3..c828845c46d8 100644
> --- a/drivers/infiniband/core/device.c
> +++ b/drivers/infiniband/core/device.c
> @@ -468,6 +468,8 @@ int ib_register_device(struct ib_device *device,
>   		device->dma_device = parent;
>   	}
>   
> +	ib_init_cq_pools(device);
> +
>   	mutex_lock(&device_mutex);
>   
>   	if (strchr(device->name, '%')) {
> @@ -590,6 +592,8 @@ void ib_unregister_device(struct ib_device *device)
>   	up_write(&lists_rwsem);
>   
>   	device->reg_state = IB_DEV_UNREGISTERED;
> +
> +	ib_purge_cq_pools(device);
>   }
>   EXPORT_SYMBOL(ib_unregister_device);
>   
> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> index de57d6c11a25..fcc9ecba6741 100644
> --- a/drivers/infiniband/core/verbs.c
> +++ b/drivers/infiniband/core/verbs.c
> @@ -793,14 +793,16 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
>   			   struct ib_qp_init_attr *qp_init_attr)
>   {
>   	struct ib_device *device = pd ? pd->device : qp_init_attr->xrcd->device;
> +	struct ib_cq *cq = NULL;
>   	struct ib_qp *qp;
> -	int ret;
> +	u32 nr_cqes = 0;
> +	int ret = -EINVAL;
>   
>   	if (qp_init_attr->rwq_ind_tbl &&
>   	    (qp_init_attr->recv_cq ||
>   	    qp_init_attr->srq || qp_init_attr->cap.max_recv_wr ||
>   	    qp_init_attr->cap.max_recv_sge))
> -		return ERR_PTR(-EINVAL);
> +		goto out;
>   
>   	/*
>   	 * If the callers is using the RDMA API calculate the resources
> @@ -811,9 +813,51 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
>   	if (qp_init_attr->cap.max_rdma_ctxs)
>   		rdma_rw_init_qp(device, qp_init_attr);
>   
> +	if (qp_init_attr->create_flags & IB_QP_CREATE_ASSIGN_CQS) {
> +		int affinity = -1;
> +
> +		if (WARN_ON(qp_init_attr->recv_cq))
> +			goto out;
> +		if (WARN_ON(qp_init_attr->send_cq))
> +			goto out;
> +
> +		if (qp_init_attr->create_flags & IB_QP_CREATE_AFFINITY_HINT)
> +			affinity = qp_init_attr->affinity_hint;
> +
> +		nr_cqes = qp_init_attr->cap.max_recv_wr +
> +			  qp_init_attr->cap.max_send_wr;
> +		if (nr_cqes) {

what will happen if nr_cqes == 0 in that case ?

> +			cq = ib_find_get_cq(device, nr_cqes,
> +					    qp_init_attr->poll_ctx, affinity);
> +			if (IS_ERR(cq)) {
> +				ret = PTR_ERR(cq);
> +				goto out;
> +			}
> +
> +			if (qp_init_attr->cap.max_send_wr)
> +				qp_init_attr->send_cq = cq;
> +
> +			if (qp_init_attr->cap.max_recv_wr) {
> +				qp_init_attr->recv_cq = cq;
> +
> +				/*
> +				 * Low-level drivers expect max_recv_wr == 0
> +				 * for the SRQ case:
> +				 */
> +				if (qp_init_attr->srq)
> +					qp_init_attr->cap.max_recv_wr = 0;
> +			}
> +		}
> +
> +		qp_init_attr->create_flags &=
> +			~(IB_QP_CREATE_ASSIGN_CQS | IB_QP_CREATE_AFFINITY_HINT);
> +	}
> +
>   	qp = device->create_qp(pd, qp_init_attr, NULL);
> -	if (IS_ERR(qp))
> -		return qp;
> +	if (IS_ERR(qp)) {
> +		ret = PTR_ERR(qp);
> +		goto out_put_cq;
> +	}
>   
>   	ret = ib_create_qp_security(qp, device);
>   	if (ret) {
> @@ -826,6 +870,7 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
>   	qp->uobject    = NULL;
>   	qp->qp_type    = qp_init_attr->qp_type;
>   	qp->rwq_ind_tbl = qp_init_attr->rwq_ind_tbl;
> +	qp->nr_cqes    = nr_cqes;
>   
>   	atomic_set(&qp->usecnt, 0);
>   	qp->mrs_used = 0;
> @@ -865,8 +910,7 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
>   		ret = rdma_rw_init_mrs(qp, qp_init_attr);
>   		if (ret) {
>   			pr_err("failed to init MR pool ret= %d\n", ret);
> -			ib_destroy_qp(qp);
> -			return ERR_PTR(ret);
> +			goto out_destroy_qp;
>   		}
>   	}
>   
> @@ -880,6 +924,14 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
>   				 device->attrs.max_sge_rd);
>   
>   	return qp;
> +
> +out_destroy_qp:
> +	ib_destroy_qp(qp);
> +out_put_cq:
> +	if (cq)
> +		ib_put_cq(cq, nr_cqes);
> +out:
> +	return ERR_PTR(ret);
>   }
>   EXPORT_SYMBOL(ib_create_qp);
>   
> @@ -1478,6 +1530,11 @@ int ib_destroy_qp(struct ib_qp *qp)
>   			atomic_dec(&ind_tbl->usecnt);
>   		if (sec)
>   			ib_destroy_qp_security_end(sec);
> +
> +		if (qp->nr_cqes) {
> +			WARN_ON_ONCE(rcq && rcq != scq);
> +			ib_put_cq(scq, qp->nr_cqes);
> +		}
>   	} else {
>   		if (sec)
>   			ib_destroy_qp_security_abort(sec);
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index bdb1279a415b..56d42e753eb4 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -1098,11 +1098,22 @@ enum ib_qp_create_flags {
>   	IB_QP_CREATE_SCATTER_FCS		= 1 << 8,
>   	IB_QP_CREATE_CVLAN_STRIPPING		= 1 << 9,
>   	IB_QP_CREATE_SOURCE_QPN			= 1 << 10,
> +
> +	/* only used by the core, not passed to low-level drivers */
> +	IB_QP_CREATE_ASSIGN_CQS			= 1 << 24,
> +	IB_QP_CREATE_AFFINITY_HINT		= 1 << 25,
> +
>   	/* reserve bits 26-31 for low level drivers' internal use */
>   	IB_QP_CREATE_RESERVED_START		= 1 << 26,
>   	IB_QP_CREATE_RESERVED_END		= 1 << 31,
>   };
>   
> +enum ib_poll_context {
> +	IB_POLL_SOFTIRQ,	/* poll from softirq context */
> +	IB_POLL_WORKQUEUE,	/* poll from workqueue */
> +	IB_POLL_DIRECT,		/* caller context, no hw completions */
> +};
> +
>   /*
>    * Note: users may not call ib_close_qp or ib_destroy_qp from the event_handler
>    * callback to destroy the passed in QP.
> @@ -1124,6 +1135,13 @@ struct ib_qp_init_attr {
>   	 * Only needed for special QP types, or when using the RW API.
>   	 */
>   	u8			port_num;
> +
> +	/*
> +	 * Only needed when not passing in explicit CQs.
> +	 */
> +	enum ib_poll_context	poll_ctx;
> +	int			affinity_hint;
> +
>   	struct ib_rwq_ind_table *rwq_ind_tbl;
>   	u32			source_qpn;
>   };
> @@ -1536,12 +1554,6 @@ struct ib_ah {
>   
>   typedef void (*ib_comp_handler)(struct ib_cq *cq, void *cq_context);
>   
> -enum ib_poll_context {
> -	IB_POLL_DIRECT,		/* caller context, no hw completions */
> -	IB_POLL_SOFTIRQ,	/* poll from softirq context */
> -	IB_POLL_WORKQUEUE,	/* poll from workqueue */
> -};
> -
>   struct ib_cq {
>   	struct ib_device       *device;
>   	struct ib_uobject      *uobject;
> @@ -1549,9 +1561,12 @@ struct ib_cq {
>   	void                  (*event_handler)(struct ib_event *, void *);
>   	void                   *cq_context;
>   	int               	cqe;
> +	unsigned int		cqe_used;
>   	atomic_t          	usecnt; /* count number of work queues */
>   	enum ib_poll_context	poll_ctx;
> +	int			comp_vector;
>   	struct ib_wc		*wc;
> +	struct list_head	pool_entry;
>   	union {
>   		struct irq_poll		iop;
>   		struct work_struct	work;
> @@ -1731,6 +1746,7 @@ struct ib_qp {
>   	struct ib_rwq_ind_table *rwq_ind_tbl;
>   	struct ib_qp_security  *qp_sec;
>   	u8			port;
> +	u32			nr_cqes;
>   };
>   
>   struct ib_mr {
> @@ -2338,6 +2354,9 @@ struct ib_device {
>   
>   	u32                          index;
>   
> +	spinlock_t		     cq_lock;

maybe can be called cq_pools_lock (cq_lock is general) ?

> +	struct list_head	     cq_pools[IB_POLL_WORKQUEUE + 1];

maybe it's better to add and use IB_POLL_LAST ?

> +
>   	/**
>   	 * The following mandatory functions are used only at device
>   	 * registration.  Keep functions such as these at the end of this
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 0/9] Introduce per-device completion queue pools
  2017-11-08  9:57 ` Sagi Grimberg
@ 2017-11-09 16:42     ` Bart Van Assche
  -1 siblings, 0 replies; 92+ messages in thread
From: Bart Van Assche @ 2017-11-09 16:42 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, sagi-NQWnxTmZq1alnMjI0IkVqw
  Cc: hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Wed, 2017-11-08 at 11:57 +0200, Sagi Grimberg wrote:
> This is the third re-incarnation of the CQ pool patches proposed
> by Christoph and I.

Hello Sagi,

This work looks interesting to me and I think it is a good idea to introduce
a CQ pool implementation in the RDMA core. However, I have a concern about
the approach. This patch series associates a single CQ pool with each RDMA
device. Wouldn't it be better to let CQ pool users chose the CQ pool size and
to let these users manage the CQ pool lifetime instead of binding the
lifetime of a CQ pool to that of an RDMA device? RDMA drivers are loaded
during system startup. I think allocation of memory for CQ pools should be
deferred until the ULP protocol driver(s) are loaded to avoid allocating
memory for CQs while these are not in use. Additionally, on many setups each
RDMA port only runs a single ULP. I think that's another argument to let the
ULP allocate CQ pool(s) instead of having one such pool per HCA.

Bart.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-09 16:42     ` Bart Van Assche
  0 siblings, 0 replies; 92+ messages in thread
From: Bart Van Assche @ 2017-11-09 16:42 UTC (permalink / raw)


On Wed, 2017-11-08@11:57 +0200, Sagi Grimberg wrote:
> This is the third re-incarnation of the CQ pool patches proposed
> by Christoph and I.

Hello Sagi,

This work looks interesting to me and I think it is a good idea to introduce
a CQ pool implementation in the RDMA core. However, I have a concern about
the approach. This patch series associates a single CQ pool with each RDMA
device. Wouldn't it be better to let CQ pool users chose the CQ pool size and
to let these users manage the CQ pool lifetime instead of binding the
lifetime of a CQ pool to that of an RDMA device? RDMA drivers are loaded
during system startup. I think allocation of memory for CQ pools should be
deferred until the ULP protocol driver(s) are loaded to avoid allocating
memory for CQs while these are not in use. Additionally, on many setups each
RDMA port only runs a single ULP. I think that's another argument to let the
ULP allocate CQ pool(s) instead of having one such pool per HCA.

Bart.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 0/9] Introduce per-device completion queue pools
  2017-11-08 16:42     ` Chuck Lever
@ 2017-11-09 17:06         ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-09 17:06 UTC (permalink / raw)
  To: Chuck Lever
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	Max Gurtuvoy

> Hi Sagi, glad to see progress on this!

Hi Chuck,

> When running on the same CPU, Send and Receive completions compete
> for the same finite CPU resource. In addition, they compete with
> soft IRQ tasks that are also pinned to that CPU, and any other
> BOUND workqueue tasks that are running there.

Thats true.

> Send and Receive completions often have significant work to do
> (for example, DMA syncing or unmapping followed by some parsing
> of the completion results) and are all serialized on ib_poll_wq or
> by soft IRQ.

Yes, that's correct.

> This limits IOPS, and restricts other users of that shared CQ.

I agree that's true for a single queue aspect. When multiple queues
are used, usually centralizing context to their cpu core is probably
the best approach to achieve linear scalability, otherwise we pay
more for context switches, cacheline bounces, resource contention, etc.

> I recognize that handling interrupts on the same core where they
> fired is best, but some of this work has to be allowed to migrate
> when this CPU core is already fully utilized. A lot of the RDMA
> core and ULP workqueues are BOUND, which prevents task migration,
> even in the upper layers.

So for the ib_comp_wq, started as an UNBOUND workqueue, but the fact
that unbound worqueue workers are not cpu bound did not fit well
with cpu/numa locality used with high-end storage devices and was a 
source of latency

See:
--
commit b7363e67b23e04c23c2a99437feefac7292a88bc
Author: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
Date:   Wed Mar 8 22:03:17 2017 +0200

     IB/device: Convert ib-comp-wq to be CPU-bound

     This workqueue is used by our storage target mode ULPs
     via the new CQ API. Recent observations when working
     with very high-end flash storage devices reveal that
     UNBOUND workqueue threads can migrate between cpu cores
     and even numa nodes (although some numa locality is accounted
     for).

     While this attribute can be useful in some workloads,
     it does not fit in very nicely with the normal
     run-to-completion model we usually use in our target-mode
     ULPs and the block-mq irq<->cpu affinity facilities.

     The whole block-mq concept is that the completion will
     land on the same cpu where the submission was performed.
     The fact that our submitter thread is migrating cpus
     can break this locality.

     We assume that as a target mode ULP, we will serve multiple
     initiators/clients and we can spread the load enough without
     having to use unbound kworkers.

     Also, while we're at it, expose this workqueue via sysfs which
     is harmless and can be useful for debug.
--

The rational is that storage targets (or file servers) usually serve
multiple clients and the spreading across cpu cores for more efficient
utilization would come from spreading the completion vectors.

However if this is not the case, then by all means we need a knob for
it (maybe have two ib completion workqueues and ULP will choose).

> I would like to see a capability of intelligently spreading the
> CQ workload for a single QP onto more CPU cores.

That is a different use case than what I was trying to achieve. In
ulp consumers such as nvme-rdma (or srp and alike) will use multiple
qp-cq pairs (usually even per-core) and for that use-case, probably
cpu locality is a better approach to take imo.

How likely that multiple NFS mount-points will be used on a single
server? Is that something you are looking for to optimize? or is
the single (or few) mount-points per server the common use-case?
If its the latter, then I perfectly agree with you, and we should
come up with a core api for it (probably rds or smc will want it
too).

> As an example, I've found that ensuring that NFS/RDMA's Receive
> and Send completions are handled on separate CPU cores results in
> slightly higher IOPS (~5%) and lower latency jitter on one mount
> point.

That is valuable information. I do agree that what you are proposing
is useful. I'll need some time to think on that.

> This is more critical now that our ULPs are handling more Send
> completions.

We still need to fix some more...

>> In addition, we introduce a configfs knob to our nvme-target to
>> bound I/O threads to a given cpulist (can be a subset). This is
>> useful for numa configurations where the backend device access is
>> configured with care to numa affinity, and we want to restrict rdma
>> device and I/O threads affinity accordingly.
>>
>> The patch set convert iser, isert, srpt, svcrdma, nvme-rdma and
>> nvmet-rdma to use the new API.
> 
> Is there a straightforward way to assess whether this work
> improves scalability and performance when multiple ULPs share a
> device?

I guess the only way is running multiple ULPs in parallel? I tried
running iser+nvme-rdma in parallel but my poor 2 VMs are not the best
performance platform I can evaluate this...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-09 17:06         ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-09 17:06 UTC (permalink / raw)


> Hi Sagi, glad to see progress on this!

Hi Chuck,

> When running on the same CPU, Send and Receive completions compete
> for the same finite CPU resource. In addition, they compete with
> soft IRQ tasks that are also pinned to that CPU, and any other
> BOUND workqueue tasks that are running there.

Thats true.

> Send and Receive completions often have significant work to do
> (for example, DMA syncing or unmapping followed by some parsing
> of the completion results) and are all serialized on ib_poll_wq or
> by soft IRQ.

Yes, that's correct.

> This limits IOPS, and restricts other users of that shared CQ.

I agree that's true for a single queue aspect. When multiple queues
are used, usually centralizing context to their cpu core is probably
the best approach to achieve linear scalability, otherwise we pay
more for context switches, cacheline bounces, resource contention, etc.

> I recognize that handling interrupts on the same core where they
> fired is best, but some of this work has to be allowed to migrate
> when this CPU core is already fully utilized. A lot of the RDMA
> core and ULP workqueues are BOUND, which prevents task migration,
> even in the upper layers.

So for the ib_comp_wq, started as an UNBOUND workqueue, but the fact
that unbound worqueue workers are not cpu bound did not fit well
with cpu/numa locality used with high-end storage devices and was a 
source of latency

See:
--
commit b7363e67b23e04c23c2a99437feefac7292a88bc
Author: Sagi Grimberg <sagi at grimberg.me>
Date:   Wed Mar 8 22:03:17 2017 +0200

     IB/device: Convert ib-comp-wq to be CPU-bound

     This workqueue is used by our storage target mode ULPs
     via the new CQ API. Recent observations when working
     with very high-end flash storage devices reveal that
     UNBOUND workqueue threads can migrate between cpu cores
     and even numa nodes (although some numa locality is accounted
     for).

     While this attribute can be useful in some workloads,
     it does not fit in very nicely with the normal
     run-to-completion model we usually use in our target-mode
     ULPs and the block-mq irq<->cpu affinity facilities.

     The whole block-mq concept is that the completion will
     land on the same cpu where the submission was performed.
     The fact that our submitter thread is migrating cpus
     can break this locality.

     We assume that as a target mode ULP, we will serve multiple
     initiators/clients and we can spread the load enough without
     having to use unbound kworkers.

     Also, while we're at it, expose this workqueue via sysfs which
     is harmless and can be useful for debug.
--

The rational is that storage targets (or file servers) usually serve
multiple clients and the spreading across cpu cores for more efficient
utilization would come from spreading the completion vectors.

However if this is not the case, then by all means we need a knob for
it (maybe have two ib completion workqueues and ULP will choose).

> I would like to see a capability of intelligently spreading the
> CQ workload for a single QP onto more CPU cores.

That is a different use case than what I was trying to achieve. In
ulp consumers such as nvme-rdma (or srp and alike) will use multiple
qp-cq pairs (usually even per-core) and for that use-case, probably
cpu locality is a better approach to take imo.

How likely that multiple NFS mount-points will be used on a single
server? Is that something you are looking for to optimize? or is
the single (or few) mount-points per server the common use-case?
If its the latter, then I perfectly agree with you, and we should
come up with a core api for it (probably rds or smc will want it
too).

> As an example, I've found that ensuring that NFS/RDMA's Receive
> and Send completions are handled on separate CPU cores results in
> slightly higher IOPS (~5%) and lower latency jitter on one mount
> point.

That is valuable information. I do agree that what you are proposing
is useful. I'll need some time to think on that.

> This is more critical now that our ULPs are handling more Send
> completions.

We still need to fix some more...

>> In addition, we introduce a configfs knob to our nvme-target to
>> bound I/O threads to a given cpulist (can be a subset). This is
>> useful for numa configurations where the backend device access is
>> configured with care to numa affinity, and we want to restrict rdma
>> device and I/O threads affinity accordingly.
>>
>> The patch set convert iser, isert, srpt, svcrdma, nvme-rdma and
>> nvmet-rdma to use the new API.
> 
> Is there a straightforward way to assess whether this work
> improves scalability and performance when multiple ULPs share a
> device?

I guess the only way is running multiple ULPs in parallel? I tried
running iser+nvme-rdma in parallel but my poor 2 VMs are not the best
performance platform I can evaluate this...

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 0/9] Introduce per-device completion queue pools
  2017-11-09 16:42     ` Bart Van Assche
@ 2017-11-09 17:22         ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-09 17:22 UTC (permalink / raw)
  To: Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r


>> This is the third re-incarnation of the CQ pool patches proposed
>> by Christoph and I.
> 
> Hello Sagi,

Hi Bart,

> This work looks interesting to me and I think it is a good idea to introduce
> a CQ pool implementation in the RDMA core. However, I have a concern about
> the approach. This patch series associates a single CQ pool with each RDMA
> device. Wouldn't it be better to let CQ pool users chose the CQ pool size and
> to let these users manage the CQ pool lifetime instead of binding the
> lifetime of a CQ pool to that of an RDMA device?

I think that the first approach I started from was introducing a CQ pool
entity and ULPs will manage it. Christoph really took the idea to this
level and suggested we move all the cq assignments "smarts" to the rdma
core...

> RDMA drivers are loaded
> during system startup. I think allocation of memory for CQ pools should be
> deferred until the ULP protocol driver(s) are loaded to avoid allocating
> memory for CQs while these are not in use.

I completely agree with you. The pool implementation is a lazy
allocation. Every create_qp with a IB_QP_CREATE_ASSIGN_CQS flag will
search a cq based on least-used selection or specific cq if
IB_QP_CREATE_AFFINITY_HINT is also passed.

If no candidate cq is found, the pool expands with more CQs (allocated
in per-cpu chunks). When the device is removed, the pool is freed (at
that point all the ULPs freed their queue-pairs in DEVICE_REMOVAL event
handlers).

However there is a catch because the CQ size is arbitrarily chosen at
the moment, and given that different ULPs will use different cq sizes
might generate unused few cq entries with this approach. That is indeed
a down-side compared to explicit cq pools owned by the ULP itself.

> Additionally, on many setups each
> RDMA port only runs a single ULP. I think that's another argument to let the
> ULP allocate CQ pool(s) instead of having one such pool per HCA.

I think that most modern HCAs are exposing device-per-port and will
probably continue to do so, But yes, mlx4 is dual-ported HCA.

But I'm afraid don't understand how the fact that ULPs will run on
different ports matter? how would the fact that we had two different
pools on different ports make a difference?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-09 17:22         ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-09 17:22 UTC (permalink / raw)



>> This is the third re-incarnation of the CQ pool patches proposed
>> by Christoph and I.
> 
> Hello Sagi,

Hi Bart,

> This work looks interesting to me and I think it is a good idea to introduce
> a CQ pool implementation in the RDMA core. However, I have a concern about
> the approach. This patch series associates a single CQ pool with each RDMA
> device. Wouldn't it be better to let CQ pool users chose the CQ pool size and
> to let these users manage the CQ pool lifetime instead of binding the
> lifetime of a CQ pool to that of an RDMA device?

I think that the first approach I started from was introducing a CQ pool
entity and ULPs will manage it. Christoph really took the idea to this
level and suggested we move all the cq assignments "smarts" to the rdma
core...

> RDMA drivers are loaded
> during system startup. I think allocation of memory for CQ pools should be
> deferred until the ULP protocol driver(s) are loaded to avoid allocating
> memory for CQs while these are not in use.

I completely agree with you. The pool implementation is a lazy
allocation. Every create_qp with a IB_QP_CREATE_ASSIGN_CQS flag will
search a cq based on least-used selection or specific cq if
IB_QP_CREATE_AFFINITY_HINT is also passed.

If no candidate cq is found, the pool expands with more CQs (allocated
in per-cpu chunks). When the device is removed, the pool is freed (at
that point all the ULPs freed their queue-pairs in DEVICE_REMOVAL event
handlers).

However there is a catch because the CQ size is arbitrarily chosen at
the moment, and given that different ULPs will use different cq sizes
might generate unused few cq entries with this approach. That is indeed
a down-side compared to explicit cq pools owned by the ULP itself.

> Additionally, on many setups each
> RDMA port only runs a single ULP. I think that's another argument to let the
> ULP allocate CQ pool(s) instead of having one such pool per HCA.

I think that most modern HCAs are exposing device-per-port and will
probably continue to do so, But yes, mlx4 is dual-ported HCA.

But I'm afraid don't understand how the fact that ULPs will run on
different ports matter? how would the fact that we had two different
pools on different ports make a difference?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 1/9] RDMA/core: Add implicit per-device completion queue pools
  2017-11-09 10:45         ` Max Gurtovoy
@ 2017-11-09 17:31             ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-09 17:31 UTC (permalink / raw)
  To: Max Gurtovoy, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig



>> +static int ib_alloc_cqs(struct ib_device *dev, int nr_cqes,
>> +        enum ib_poll_context poll_ctx)
>> +{
>> +    LIST_HEAD(tmp_list);
>> +    struct ib_cq *cq;
>> +    unsigned long flags;
>> +    int nr_cqs, ret, i;
>> +
>> +    /*
>> +     * Allocated at least as many CQEs as requested, and otherwise
>> +     * a reasonable batch size so that we can share CQs between
>> +     * multiple users instead of allocating a larger number of CQs.
>> +     */
>> +    nr_cqes = max(nr_cqes, min(dev->attrs.max_cqe, IB_CQE_BATCH));
> 
> did you mean min() ?

No, I meant max. If we choose the CQ size, we choose the min between the
default and the device capability, if the user chooses, we rely that it
asked for no more than the device capability (and if not, allocation
will fail, as it should).

>> +restart:
>> +    /*
>> +     * Find the least used CQ with correct affinity and
>> +     * enough free cq entries
>> +     */
>> +    found = NULL;
>> +    spin_lock_irqsave(&dev->cq_lock, flags);
>> +    list_for_each_entry(cq, &dev->cq_pools[poll_ctx], pool_entry) {
>> +        if (vector != -1 && vector != cq->comp_vector)
> 
> how vector can be -1 ?

-1 is a wild-card affinity hint value that chooses the least used cq
(see ib_create_qp).

>> @@ -811,9 +813,51 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
>>       if (qp_init_attr->cap.max_rdma_ctxs)
>>           rdma_rw_init_qp(device, qp_init_attr);
>> +    if (qp_init_attr->create_flags & IB_QP_CREATE_ASSIGN_CQS) {
>> +        int affinity = -1;
>> +
>> +        if (WARN_ON(qp_init_attr->recv_cq))
>> +            goto out;
>> +        if (WARN_ON(qp_init_attr->send_cq))
>> +            goto out;
>> +
>> +        if (qp_init_attr->create_flags & IB_QP_CREATE_AFFINITY_HINT)
>> +            affinity = qp_init_attr->affinity_hint;
>> +
>> +        nr_cqes = qp_init_attr->cap.max_recv_wr +
>> +              qp_init_attr->cap.max_send_wr;
>> +        if (nr_cqes) {
> 
> what will happen if nr_cqes == 0 in that case ?

The same thing that would happen without this code I think. This is
creating a qp without ability to post send and/or receive work requests.

>> @@ -2338,6 +2354,9 @@ struct ib_device {
>>       u32                          index;
>> +    spinlock_t             cq_lock;
> 
> maybe can be called cq_pools_lock (cq_lock is general) ?

I can change that.

>> +    struct list_head         cq_pools[IB_POLL_WORKQUEUE + 1];
> 
> maybe it's better to add and use IB_POLL_LAST ?

Yea, I can change that.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 1/9] RDMA/core: Add implicit per-device completion queue pools
@ 2017-11-09 17:31             ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-09 17:31 UTC (permalink / raw)




>> +static int ib_alloc_cqs(struct ib_device *dev, int nr_cqes,
>> +??????? enum ib_poll_context poll_ctx)
>> +{
>> +??? LIST_HEAD(tmp_list);
>> +??? struct ib_cq *cq;
>> +??? unsigned long flags;
>> +??? int nr_cqs, ret, i;
>> +
>> +??? /*
>> +???? * Allocated at least as many CQEs as requested, and otherwise
>> +???? * a reasonable batch size so that we can share CQs between
>> +???? * multiple users instead of allocating a larger number of CQs.
>> +???? */
>> +??? nr_cqes = max(nr_cqes, min(dev->attrs.max_cqe, IB_CQE_BATCH));
> 
> did you mean min() ?

No, I meant max. If we choose the CQ size, we choose the min between the
default and the device capability, if the user chooses, we rely that it
asked for no more than the device capability (and if not, allocation
will fail, as it should).

>> +restart:
>> +??? /*
>> +???? * Find the least used CQ with correct affinity and
>> +???? * enough free cq entries
>> +???? */
>> +??? found = NULL;
>> +??? spin_lock_irqsave(&dev->cq_lock, flags);
>> +??? list_for_each_entry(cq, &dev->cq_pools[poll_ctx], pool_entry) {
>> +??????? if (vector != -1 && vector != cq->comp_vector)
> 
> how vector can be -1 ?

-1 is a wild-card affinity hint value that chooses the least used cq
(see ib_create_qp).

>> @@ -811,9 +813,51 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
>> ????? if (qp_init_attr->cap.max_rdma_ctxs)
>> ????????? rdma_rw_init_qp(device, qp_init_attr);
>> +??? if (qp_init_attr->create_flags & IB_QP_CREATE_ASSIGN_CQS) {
>> +??????? int affinity = -1;
>> +
>> +??????? if (WARN_ON(qp_init_attr->recv_cq))
>> +??????????? goto out;
>> +??????? if (WARN_ON(qp_init_attr->send_cq))
>> +??????????? goto out;
>> +
>> +??????? if (qp_init_attr->create_flags & IB_QP_CREATE_AFFINITY_HINT)
>> +??????????? affinity = qp_init_attr->affinity_hint;
>> +
>> +??????? nr_cqes = qp_init_attr->cap.max_recv_wr +
>> +????????????? qp_init_attr->cap.max_send_wr;
>> +??????? if (nr_cqes) {
> 
> what will happen if nr_cqes == 0 in that case ?

The same thing that would happen without this code I think. This is
creating a qp without ability to post send and/or receive work requests.

>> @@ -2338,6 +2354,9 @@ struct ib_device {
>> ????? u32????????????????????????? index;
>> +??? spinlock_t???????????? cq_lock;
> 
> maybe can be called cq_pools_lock (cq_lock is general) ?

I can change that.

>> +??? struct list_head???????? cq_pools[IB_POLL_WORKQUEUE + 1];
> 
> maybe it's better to add and use IB_POLL_LAST ?

Yea, I can change that.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 0/9] Introduce per-device completion queue pools
  2017-11-09 17:22         ` Sagi Grimberg
@ 2017-11-09 17:31             ` Bart Van Assche
  -1 siblings, 0 replies; 92+ messages in thread
From: Bart Van Assche @ 2017-11-09 17:31 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, sagi-NQWnxTmZq1alnMjI0IkVqw
  Cc: hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Thu, 2017-11-09 at 19:22 +0200, Sagi Grimberg wrote:
> But I'm afraid don't understand how the fact that ULPs will run on
> different ports matter? how would the fact that we had two different
> pools on different ports make a difference?

If each RDMA port is only used by a single ULP then the ULP driver can provide
a better value for the CQ size than IB_CQE_BATCH. If CQ pools would be created
by ULPs then it would be easy for ULPs to pass their choice of CQ size to the
RDMA core.

In case multiple ULPs share an RDMA port then which CQ is chosen for the ULP
will depend on the order in which the ULP drivers are loaded. This may lead to
hard to debug performance issues, e.g. due to different lock contention
behavior. That's another reason why per-ULP CQ pools look more interesting to
me than one CQ pool per HCA.

Bart.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-09 17:31             ` Bart Van Assche
  0 siblings, 0 replies; 92+ messages in thread
From: Bart Van Assche @ 2017-11-09 17:31 UTC (permalink / raw)


On Thu, 2017-11-09@19:22 +0200, Sagi Grimberg wrote:
> But I'm afraid don't understand how the fact that ULPs will run on
> different ports matter? how would the fact that we had two different
> pools on different ports make a difference?

If each RDMA port is only used by a single ULP then the ULP driver can provide
a better value for the CQ size than IB_CQE_BATCH. If CQ pools would be created
by ULPs then it would be easy for ULPs to pass their choice of CQ size to the
RDMA core.

In case multiple ULPs share an RDMA port then which CQ is chosen for the ULP
will depend on the order in which the ULP drivers are loaded. This may lead to
hard to debug performance issues, e.g. due to different lock contention
behavior. That's another reason why per-ULP CQ pools look more interesting to
me than one CQ pool per HCA.

Bart.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 1/9] RDMA/core: Add implicit per-device completion queue pools
  2017-11-09 17:31             ` Sagi Grimberg
@ 2017-11-09 17:33                 ` Bart Van Assche
  -1 siblings, 0 replies; 92+ messages in thread
From: Bart Van Assche @ 2017-11-09 17:33 UTC (permalink / raw)
  To: maxg-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	sagi-NQWnxTmZq1alnMjI0IkVqw
  Cc: hch-jcswGhMUV9g, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1177 bytes --]

On Thu, 2017-11-09 at 19:31 +0200, Sagi Grimberg wrote:
> > > +static int ib_alloc_cqs(struct ib_device *dev, int nr_cqes,
> > > +        enum ib_poll_context poll_ctx)
> > > +{
> > > +    LIST_HEAD(tmp_list);
> > > +    struct ib_cq *cq;
> > > +    unsigned long flags;
> > > +    int nr_cqs, ret, i;
> > > +
> > > +    /*
> > > +     * Allocated at least as many CQEs as requested, and otherwise
> > > +     * a reasonable batch size so that we can share CQs between
> > > +     * multiple users instead of allocating a larger number of CQs.
> > > +     */
> > > +    nr_cqes = max(nr_cqes, min(dev->attrs.max_cqe, IB_CQE_BATCH));
> > 
> > did you mean min() ?
> 
> No, I meant max. If we choose the CQ size, we choose the min between the
> default and the device capability, if the user chooses, we rely that it
> asked for no more than the device capability (and if not, allocation
> will fail, as it should).

Hello Sagi,

How about the following:

	min(dev->attrs.max_cqe, max(nr_cqes, IB_CQE_BATCH))

Bart.
N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±­ÙšŠ{ayº\x1dʇڙë,j\a­¢f£¢·hš‹»öì\x17/oSc¾™Ú³9˜uÀ¦æå‰È&jw¨®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿïêäz¹Þ–Šàþf£¢·hšˆ§~ˆmš

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 1/9] RDMA/core: Add implicit per-device completion queue pools
@ 2017-11-09 17:33                 ` Bart Van Assche
  0 siblings, 0 replies; 92+ messages in thread
From: Bart Van Assche @ 2017-11-09 17:33 UTC (permalink / raw)


On Thu, 2017-11-09@19:31 +0200, Sagi Grimberg wrote:
> > > +static int ib_alloc_cqs(struct ib_device *dev, int nr_cqes,
> > > +        enum ib_poll_context poll_ctx)
> > > +{
> > > +    LIST_HEAD(tmp_list);
> > > +    struct ib_cq *cq;
> > > +    unsigned long flags;
> > > +    int nr_cqs, ret, i;
> > > +
> > > +    /*
> > > +     * Allocated at least as many CQEs as requested, and otherwise
> > > +     * a reasonable batch size so that we can share CQs between
> > > +     * multiple users instead of allocating a larger number of CQs.
> > > +     */
> > > +    nr_cqes = max(nr_cqes, min(dev->attrs.max_cqe, IB_CQE_BATCH));
> > 
> > did you mean min() ?
> 
> No, I meant max. If we choose the CQ size, we choose the min between the
> default and the device capability, if the user chooses, we rely that it
> asked for no more than the device capability (and if not, allocation
> will fail, as it should).

Hello Sagi,

How about the following:

	min(dev->attrs.max_cqe, max(nr_cqes, IB_CQE_BATCH))

Bart.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 0/9] Introduce per-device completion queue pools
  2017-11-09 17:22         ` Sagi Grimberg
@ 2017-11-09 18:52             ` Leon Romanovsky
  -1 siblings, 0 replies; 92+ messages in thread
From: Leon Romanovsky @ 2017-11-09 18:52 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

[-- Attachment #1: Type: text/plain, Size: 531 bytes --]

On Thu, Nov 09, 2017 at 07:22:58PM +0200, Sagi Grimberg wrote:
>
>
> > Additionally, on many setups each
> > RDMA port only runs a single ULP. I think that's another argument to let the
> > ULP allocate CQ pool(s) instead of having one such pool per HCA.
>
> I think that most modern HCAs are exposing device-per-port and will
> probably continue to do so, But yes, mlx4 is dual-ported HCA.

I have patches in my queue for -next cycle which introduce similar
functionality for mlx5 too - one mlx5_ib device with two ports.

Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-09 18:52             ` Leon Romanovsky
  0 siblings, 0 replies; 92+ messages in thread
From: Leon Romanovsky @ 2017-11-09 18:52 UTC (permalink / raw)


On Thu, Nov 09, 2017@07:22:58PM +0200, Sagi Grimberg wrote:
>
>
> > Additionally, on many setups each
> > RDMA port only runs a single ULP. I think that's another argument to let the
> > ULP allocate CQ pool(s) instead of having one such pool per HCA.
>
> I think that most modern HCAs are exposing device-per-port and will
> probably continue to do so, But yes, mlx4 is dual-ported HCA.

I have patches in my queue for -next cycle which introduce similar
functionality for mlx5 too - one mlx5_ib device with two ports.

Thanks
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20171109/74b937bf/attachment.sig>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 0/9] Introduce per-device completion queue pools
  2017-11-09 17:06         ` Sagi Grimberg
@ 2017-11-10 19:27             ` Chuck Lever
  -1 siblings, 0 replies; 92+ messages in thread
From: Chuck Lever @ 2017-11-10 19:27 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Christoph Hellwig, Max Gurtuvoy


> On Nov 9, 2017, at 12:06 PM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
> 
>> Hi Sagi, glad to see progress on this!
> 
> Hi Chuck,
> 
>> When running on the same CPU, Send and Receive completions compete
>> for the same finite CPU resource. In addition, they compete with
>> soft IRQ tasks that are also pinned to that CPU, and any other
>> BOUND workqueue tasks that are running there.
> 
> Thats true.
> 
>> Send and Receive completions often have significant work to do
>> (for example, DMA syncing or unmapping followed by some parsing
>> of the completion results) and are all serialized on ib_poll_wq or
>> by soft IRQ.
> 
> Yes, that's correct.
> 
>> This limits IOPS, and restricts other users of that shared CQ.
> 
> I agree that's true for a single queue aspect. When multiple queues
> are used, usually centralizing context to their cpu core is probably
> the best approach to achieve linear scalability, otherwise we pay
> more for context switches, cacheline bounces, resource contention, etc.
> 
>> I recognize that handling interrupts on the same core where they
>> fired is best, but some of this work has to be allowed to migrate
>> when this CPU core is already fully utilized. A lot of the RDMA
>> core and ULP workqueues are BOUND, which prevents task migration,
>> even in the upper layers.
> 
> So for the ib_comp_wq, started as an UNBOUND workqueue, but the fact
> that unbound worqueue workers are not cpu bound did not fit well
> with cpu/numa locality used with high-end storage devices and was a source of latency
> 
> See:
> --
> commit b7363e67b23e04c23c2a99437feefac7292a88bc
> Author: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
> Date:   Wed Mar 8 22:03:17 2017 +0200
> 
>    IB/device: Convert ib-comp-wq to be CPU-bound
> 
>    This workqueue is used by our storage target mode ULPs
>    via the new CQ API. Recent observations when working
>    with very high-end flash storage devices reveal that
>    UNBOUND workqueue threads can migrate between cpu cores
>    and even numa nodes (although some numa locality is accounted
>    for).
> 
>    While this attribute can be useful in some workloads,
>    it does not fit in very nicely with the normal
>    run-to-completion model we usually use in our target-mode
>    ULPs and the block-mq irq<->cpu affinity facilities.
> 
>    The whole block-mq concept is that the completion will
>    land on the same cpu where the submission was performed.
>    The fact that our submitter thread is migrating cpus
>    can break this locality.
> 
>    We assume that as a target mode ULP, we will serve multiple
>    initiators/clients and we can spread the load enough without
>    having to use unbound kworkers.
> 
>    Also, while we're at it, expose this workqueue via sysfs which
>    is harmless and can be useful for debug.
> --
> 
> The rational is that storage targets (or file servers) usually serve
> multiple clients and the spreading across cpu cores for more efficient
> utilization would come from spreading the completion vectors.

This works for me. It seems like an appropriate design.

On targets, the CPUs are typically shared with other ULPs,
so there is little more to do.

On initiators, CPUs are shared with user applications.
In fact, applications will use the majority of CPU and
scheduler resources.

Using BOUND workqueues seems to be very typical in file
systems, and we may be stuck with that design. What we
can't have is RDMA completions forcing user processes to
pile up on the CPU core that handles Receives.

Quite probably, initiator ULP implementations will need
to ensure explicitly that their transactions complete on
the same CPU core where the application started them.
The downside is this frequently adds the latency cost of
a context switch.


> However if this is not the case, then by all means we need a knob for
> it (maybe have two ib completion workqueues and ULP will choose).
> 
>> I would like to see a capability of intelligently spreading the
>> CQ workload for a single QP onto more CPU cores.
> 
> That is a different use case than what I was trying to achieve. In
> ulp consumers such as nvme-rdma (or srp and alike) will use multiple
> qp-cq pairs (usually even per-core) and for that use-case, probably
> cpu locality is a better approach to take imo.
> 
> How likely that multiple NFS mount-points will be used on a single
> server? Is that something you are looking for to optimize? or is
> the single (or few) mount-points per server the common use-case?
> If its the latter, then I perfectly agree with you, and we should
> come up with a core api for it (probably rds or smc will want it
> too).
> 
>> As an example, I've found that ensuring that NFS/RDMA's Receive
>> and Send completions are handled on separate CPU cores results in
>> slightly higher IOPS (~5%) and lower latency jitter on one mount
>> point.
> 
> That is valuable information. I do agree that what you are proposing
> is useful. I'll need some time to think on that.
> 
>> This is more critical now that our ULPs are handling more Send
>> completions.
> 
> We still need to fix some more...
> 
>>> In addition, we introduce a configfs knob to our nvme-target to
>>> bound I/O threads to a given cpulist (can be a subset). This is
>>> useful for numa configurations where the backend device access is
>>> configured with care to numa affinity, and we want to restrict rdma
>>> device and I/O threads affinity accordingly.
>>> 
>>> The patch set convert iser, isert, srpt, svcrdma, nvme-rdma and
>>> nvmet-rdma to use the new API.
>> Is there a straightforward way to assess whether this work
>> improves scalability and performance when multiple ULPs share a
>> device?
> 
> I guess the only way is running multiple ULPs in parallel? I tried
> running iser+nvme-rdma in parallel but my poor 2 VMs are not the best
> performance platform I can evaluate this...
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-10 19:27             ` Chuck Lever
  0 siblings, 0 replies; 92+ messages in thread
From: Chuck Lever @ 2017-11-10 19:27 UTC (permalink / raw)



> On Nov 9, 2017,@12:06 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
>> Hi Sagi, glad to see progress on this!
> 
> Hi Chuck,
> 
>> When running on the same CPU, Send and Receive completions compete
>> for the same finite CPU resource. In addition, they compete with
>> soft IRQ tasks that are also pinned to that CPU, and any other
>> BOUND workqueue tasks that are running there.
> 
> Thats true.
> 
>> Send and Receive completions often have significant work to do
>> (for example, DMA syncing or unmapping followed by some parsing
>> of the completion results) and are all serialized on ib_poll_wq or
>> by soft IRQ.
> 
> Yes, that's correct.
> 
>> This limits IOPS, and restricts other users of that shared CQ.
> 
> I agree that's true for a single queue aspect. When multiple queues
> are used, usually centralizing context to their cpu core is probably
> the best approach to achieve linear scalability, otherwise we pay
> more for context switches, cacheline bounces, resource contention, etc.
> 
>> I recognize that handling interrupts on the same core where they
>> fired is best, but some of this work has to be allowed to migrate
>> when this CPU core is already fully utilized. A lot of the RDMA
>> core and ULP workqueues are BOUND, which prevents task migration,
>> even in the upper layers.
> 
> So for the ib_comp_wq, started as an UNBOUND workqueue, but the fact
> that unbound worqueue workers are not cpu bound did not fit well
> with cpu/numa locality used with high-end storage devices and was a source of latency
> 
> See:
> --
> commit b7363e67b23e04c23c2a99437feefac7292a88bc
> Author: Sagi Grimberg <sagi at grimberg.me>
> Date:   Wed Mar 8 22:03:17 2017 +0200
> 
>    IB/device: Convert ib-comp-wq to be CPU-bound
> 
>    This workqueue is used by our storage target mode ULPs
>    via the new CQ API. Recent observations when working
>    with very high-end flash storage devices reveal that
>    UNBOUND workqueue threads can migrate between cpu cores
>    and even numa nodes (although some numa locality is accounted
>    for).
> 
>    While this attribute can be useful in some workloads,
>    it does not fit in very nicely with the normal
>    run-to-completion model we usually use in our target-mode
>    ULPs and the block-mq irq<->cpu affinity facilities.
> 
>    The whole block-mq concept is that the completion will
>    land on the same cpu where the submission was performed.
>    The fact that our submitter thread is migrating cpus
>    can break this locality.
> 
>    We assume that as a target mode ULP, we will serve multiple
>    initiators/clients and we can spread the load enough without
>    having to use unbound kworkers.
> 
>    Also, while we're at it, expose this workqueue via sysfs which
>    is harmless and can be useful for debug.
> --
> 
> The rational is that storage targets (or file servers) usually serve
> multiple clients and the spreading across cpu cores for more efficient
> utilization would come from spreading the completion vectors.

This works for me. It seems like an appropriate design.

On targets, the CPUs are typically shared with other ULPs,
so there is little more to do.

On initiators, CPUs are shared with user applications.
In fact, applications will use the majority of CPU and
scheduler resources.

Using BOUND workqueues seems to be very typical in file
systems, and we may be stuck with that design. What we
can't have is RDMA completions forcing user processes to
pile up on the CPU core that handles Receives.

Quite probably, initiator ULP implementations will need
to ensure explicitly that their transactions complete on
the same CPU core where the application started them.
The downside is this frequently adds the latency cost of
a context switch.


> However if this is not the case, then by all means we need a knob for
> it (maybe have two ib completion workqueues and ULP will choose).
> 
>> I would like to see a capability of intelligently spreading the
>> CQ workload for a single QP onto more CPU cores.
> 
> That is a different use case than what I was trying to achieve. In
> ulp consumers such as nvme-rdma (or srp and alike) will use multiple
> qp-cq pairs (usually even per-core) and for that use-case, probably
> cpu locality is a better approach to take imo.
> 
> How likely that multiple NFS mount-points will be used on a single
> server? Is that something you are looking for to optimize? or is
> the single (or few) mount-points per server the common use-case?
> If its the latter, then I perfectly agree with you, and we should
> come up with a core api for it (probably rds or smc will want it
> too).
> 
>> As an example, I've found that ensuring that NFS/RDMA's Receive
>> and Send completions are handled on separate CPU cores results in
>> slightly higher IOPS (~5%) and lower latency jitter on one mount
>> point.
> 
> That is valuable information. I do agree that what you are proposing
> is useful. I'll need some time to think on that.
> 
>> This is more critical now that our ULPs are handling more Send
>> completions.
> 
> We still need to fix some more...
> 
>>> In addition, we introduce a configfs knob to our nvme-target to
>>> bound I/O threads to a given cpulist (can be a subset). This is
>>> useful for numa configurations where the backend device access is
>>> configured with care to numa affinity, and we want to restrict rdma
>>> device and I/O threads affinity accordingly.
>>> 
>>> The patch set convert iser, isert, srpt, svcrdma, nvme-rdma and
>>> nvmet-rdma to use the new API.
>> Is there a straightforward way to assess whether this work
>> improves scalability and performance when multiple ULPs share a
>> device?
> 
> I guess the only way is running multiple ULPs in parallel? I tried
> running iser+nvme-rdma in parallel but my poor 2 VMs are not the best
> performance platform I can evaluate this...
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 1/9] RDMA/core: Add implicit per-device completion queue pools
  2017-11-09 17:33                 ` Bart Van Assche
@ 2017-11-13 20:28                     ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-13 20:28 UTC (permalink / raw)
  To: Bart Van Assche, maxg-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: hch-jcswGhMUV9g, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

> Hello Sagi,
> 
> How about the following:
> 
> 	min(dev->attrs.max_cqe, max(nr_cqes, IB_CQE_BATCH))

That would work too, thanks Bart!
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 1/9] RDMA/core: Add implicit per-device completion queue pools
@ 2017-11-13 20:28                     ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-13 20:28 UTC (permalink / raw)


> Hello Sagi,
> 
> How about the following:
> 
> 	min(dev->attrs.max_cqe, max(nr_cqes, IB_CQE_BATCH))

That would work too, thanks Bart!

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 0/9] Introduce per-device completion queue pools
  2017-11-09 17:31             ` Bart Van Assche
@ 2017-11-13 20:31                 ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-13 20:31 UTC (permalink / raw)
  To: Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r


>> But I'm afraid don't understand how the fact that ULPs will run on
>> different ports matter? how would the fact that we had two different
>> pools on different ports make a difference?
> 
> If each RDMA port is only used by a single ULP then the ULP driver can provide
> a better value for the CQ size than IB_CQE_BATCH. If CQ pools would be created
> by ULPs then it would be easy for ULPs to pass their choice of CQ size to the
> RDMA core.

But if that is not the case, maybe we may have less completion
aggragation per interrupt.

> In case multiple ULPs share an RDMA port then which CQ is chosen for the ULP
> will depend on the order in which the ULP drivers are loaded. This may lead to
> hard to debug performance issues, e.g. due to different lock contention
> behavior. That's another reason why per-ULP CQ pools look more interesting to
> me than one CQ pool per HCA.

The ULP is free to pass in an affinity hint to enforce locality to a
specific cpu core. Would that solve this issue?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-13 20:31                 ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-13 20:31 UTC (permalink / raw)



>> But I'm afraid don't understand how the fact that ULPs will run on
>> different ports matter? how would the fact that we had two different
>> pools on different ports make a difference?
> 
> If each RDMA port is only used by a single ULP then the ULP driver can provide
> a better value for the CQ size than IB_CQE_BATCH. If CQ pools would be created
> by ULPs then it would be easy for ULPs to pass their choice of CQ size to the
> RDMA core.

But if that is not the case, maybe we may have less completion
aggragation per interrupt.

> In case multiple ULPs share an RDMA port then which CQ is chosen for the ULP
> will depend on the order in which the ULP drivers are loaded. This may lead to
> hard to debug performance issues, e.g. due to different lock contention
> behavior. That's another reason why per-ULP CQ pools look more interesting to
> me than one CQ pool per HCA.

The ULP is free to pass in an affinity hint to enforce locality to a
specific cpu core. Would that solve this issue?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 0/9] Introduce per-device completion queue pools
  2017-11-09 18:52             ` Leon Romanovsky
@ 2017-11-13 20:32                 ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-13 20:32 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hey Leon,

>>> Additionally, on many setups each
>>> RDMA port only runs a single ULP. I think that's another argument to let the
>>> ULP allocate CQ pool(s) instead of having one such pool per HCA.
>>
>> I think that most modern HCAs are exposing device-per-port and will
>> probably continue to do so, But yes, mlx4 is dual-ported HCA.
> 
> I have patches in my queue for -next cycle which introduce similar
> functionality for mlx5 too - one mlx5_ib device with two ports.

Thanks for the heads up.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-13 20:32                 ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-13 20:32 UTC (permalink / raw)


Hey Leon,

>>> Additionally, on many setups each
>>> RDMA port only runs a single ULP. I think that's another argument to let the
>>> ULP allocate CQ pool(s) instead of having one such pool per HCA.
>>
>> I think that most modern HCAs are exposing device-per-port and will
>> probably continue to do so, But yes, mlx4 is dual-ported HCA.
> 
> I have patches in my queue for -next cycle which introduce similar
> functionality for mlx5 too - one mlx5_ib device with two ports.

Thanks for the heads up.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 0/9] Introduce per-device completion queue pools
  2017-11-13 20:31                 ` Sagi Grimberg
@ 2017-11-13 20:34                     ` Jason Gunthorpe
  -1 siblings, 0 replies; 92+ messages in thread
From: Jason Gunthorpe @ 2017-11-13 20:34 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Mon, Nov 13, 2017 at 10:31:26PM +0200, Sagi Grimberg wrote:
> 
> >>But I'm afraid don't understand how the fact that ULPs will run on
> >>different ports matter? how would the fact that we had two different
> >>pools on different ports make a difference?
> >
> >If each RDMA port is only used by a single ULP then the ULP driver can provide
> >a better value for the CQ size than IB_CQE_BATCH. If CQ pools would be created
> >by ULPs then it would be easy for ULPs to pass their choice of CQ size to the
> >RDMA core.
> 
> But if that is not the case, maybe we may have less completion
> aggragation per interrupt.

It is too bad we can't re-size CQs.. Can we?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-13 20:34                     ` Jason Gunthorpe
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Gunthorpe @ 2017-11-13 20:34 UTC (permalink / raw)


On Mon, Nov 13, 2017@10:31:26PM +0200, Sagi Grimberg wrote:
> 
> >>But I'm afraid don't understand how the fact that ULPs will run on
> >>different ports matter? how would the fact that we had two different
> >>pools on different ports make a difference?
> >
> >If each RDMA port is only used by a single ULP then the ULP driver can provide
> >a better value for the CQ size than IB_CQE_BATCH. If CQ pools would be created
> >by ULPs then it would be easy for ULPs to pass their choice of CQ size to the
> >RDMA core.
> 
> But if that is not the case, maybe we may have less completion
> aggragation per interrupt.

It is too bad we can't re-size CQs.. Can we?

Jason

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 0/9] Introduce per-device completion queue pools
  2017-11-10 19:27             ` Chuck Lever
@ 2017-11-13 20:47                 ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-13 20:47 UTC (permalink / raw)
  To: Chuck Lever
  Cc: linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Christoph Hellwig, Max Gurtuvoy

Hey Chuck,

> This works for me. It seems like an appropriate design.
> 
> On targets, the CPUs are typically shared with other ULPs,
> so there is little more to do.
> 
> On initiators, CPUs are shared with user applications.
> In fact, applications will use the majority of CPU and
> scheduler resources.
> 
> Using BOUND workqueues seems to be very typical in file
> systems, and we may be stuck with that design. What we
> can't have is RDMA completions forcing user processes to
> pile up on the CPU core that handles Receives.

I'm not sure I understand what you mean by:
"RDMA completions forcing user processes to pile up on the CPU core that
handles Receives"

My baseline assumption is that other cpu cores have their own tasks
that they are handling, and making RDMA completions be processed
on a different cpu is blocking something, maybe not the submitter,
but something else. So under the assumption that completion processing
always comes on the expense of something, choosing anything else other
than the cpu core that the I/O was submitted on is an inferior choice.

Is my understanding correct that you are trying to emphasize that
unbound workqueues make sense on some use-cases for initiator drivers
(like xprtrdma)?

> Quite probably, initiator ULP implementations will need
> to ensure explicitly that their transactions complete on
> the same CPU core where the application started them.

Just to be clear, you mean the CPU core where the I/O was
submitted correct?

> The downside is this frequently adds the latency cost of
> a context switch.

That is true, if the interrupt was directed to another cpu core
then a context-switch will need to be involved, and that adds latency.

I'm stating the obvious here, but this issue historically existed in
various devices ranging from network to storage and more. The solution
is using multiple queues (ideally per-cpu) and try to have minimal
synchronization in the submission path (like XPS for networking) and
keep completions as local as possible to the submission cores (like flow
steering).
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-13 20:47                 ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-13 20:47 UTC (permalink / raw)


Hey Chuck,

> This works for me. It seems like an appropriate design.
> 
> On targets, the CPUs are typically shared with other ULPs,
> so there is little more to do.
> 
> On initiators, CPUs are shared with user applications.
> In fact, applications will use the majority of CPU and
> scheduler resources.
> 
> Using BOUND workqueues seems to be very typical in file
> systems, and we may be stuck with that design. What we
> can't have is RDMA completions forcing user processes to
> pile up on the CPU core that handles Receives.

I'm not sure I understand what you mean by:
"RDMA completions forcing user processes to pile up on the CPU core that
handles Receives"

My baseline assumption is that other cpu cores have their own tasks
that they are handling, and making RDMA completions be processed
on a different cpu is blocking something, maybe not the submitter,
but something else. So under the assumption that completion processing
always comes on the expense of something, choosing anything else other
than the cpu core that the I/O was submitted on is an inferior choice.

Is my understanding correct that you are trying to emphasize that
unbound workqueues make sense on some use-cases for initiator drivers
(like xprtrdma)?

> Quite probably, initiator ULP implementations will need
> to ensure explicitly that their transactions complete on
> the same CPU core where the application started them.

Just to be clear, you mean the CPU core where the I/O was
submitted correct?

> The downside is this frequently adds the latency cost of
> a context switch.

That is true, if the interrupt was directed to another cpu core
then a context-switch will need to be involved, and that adds latency.

I'm stating the obvious here, but this issue historically existed in
various devices ranging from network to storage and more. The solution
is using multiple queues (ideally per-cpu) and try to have minimal
synchronization in the submission path (like XPS for networking) and
keep completions as local as possible to the submission cores (like flow
steering).

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 0/9] Introduce per-device completion queue pools
  2017-11-13 20:34                     ` Jason Gunthorpe
@ 2017-11-13 20:48                         ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-13 20:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r


>> But if that is not the case, maybe we may have less completion
>> aggragation per interrupt.
> 
> It is too bad we can't re-size CQs.. Can we?

I wish we could, but its an optional feature so I can't see how we can
use it.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-13 20:48                         ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-13 20:48 UTC (permalink / raw)



>> But if that is not the case, maybe we may have less completion
>> aggragation per interrupt.
> 
> It is too bad we can't re-size CQs.. Can we?

I wish we could, but its an optional feature so I can't see how we can
use it.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 0/9] Introduce per-device completion queue pools
  2017-11-08  9:57 ` Sagi Grimberg
@ 2017-11-13 22:11     ` Doug Ledford
  -1 siblings, 0 replies; 92+ messages in thread
From: Doug Ledford @ 2017-11-13 22:11 UTC (permalink / raw)
  To: Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	Max Gurtuvoy

[-- Attachment #1: Type: text/plain, Size: 469 bytes --]

On Wed, 2017-11-08 at 11:57 +0200, Sagi Grimberg wrote:
> Comments and feedback is welcome.

From what I gathered reading the feedback, there is still some concern
as to whether or not the design of this is ready to set in stone, so I'm
going to skip this series for this merge window.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-13 22:11     ` Doug Ledford
  0 siblings, 0 replies; 92+ messages in thread
From: Doug Ledford @ 2017-11-13 22:11 UTC (permalink / raw)


On Wed, 2017-11-08@11:57 +0200, Sagi Grimberg wrote:
> Comments and feedback is welcome.


From what I gathered reading the feedback, there is still some concern
as to whether or not the design of this is ready to set in stone, so I'm
going to skip this series for this merge window.

-- 
Doug Ledford <dledford at redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20171113/1173bc0d/attachment.sig>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 0/9] Introduce per-device completion queue pools
  2017-11-13 20:47                 ` Sagi Grimberg
@ 2017-11-13 22:15                     ` Chuck Lever
  -1 siblings, 0 replies; 92+ messages in thread
From: Chuck Lever @ 2017-11-13 22:15 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Christoph Hellwig, Max Gurtuvoy


> On Nov 13, 2017, at 3:47 PM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
> 
> Hey Chuck,
> 
>> This works for me. It seems like an appropriate design.
>> On targets, the CPUs are typically shared with other ULPs,
>> so there is little more to do.
>> On initiators, CPUs are shared with user applications.
>> In fact, applications will use the majority of CPU and
>> scheduler resources.
>> Using BOUND workqueues seems to be very typical in file
>> systems, and we may be stuck with that design. What we
>> can't have is RDMA completions forcing user processes to
>> pile up on the CPU core that handles Receives.
> 
> I'm not sure I understand what you mean by:
> "RDMA completions forcing user processes to pile up on the CPU core that
> handles Receives"

Recall that NFS is limited to a single QP per client-server
pair.

ib_alloc_cq(compvec) determines which CPU will handle Receive
completions for a QP. Let's call this CPU R.

I assume any CPU can initiate an RPC Call. For example, let's
say an application is running on CPU C != R.

The Receive completion occurs on CPU R. Suppose the Receive
matches to an incoming RPC that had no registered MRs. The
Receive completion can invoke xprt_complete_rqst in the
Receive completion handler to complete the RPC on CPU R
without another context switch.

The problem is that the RPC completes on CPU R because the
RPC stack uses a BOUND workqueue, and so does NFS. Thus at
least the RPC and NFS completion processing are competing
for CPU R, instead of being handled on other CPUs, and
maybe the requesting application is also likely to migrate
onto CPU R.

I observed this behavior experimentally.

Today, the xprtrdma Receive completion handler processes
simple RPCs (ie, RPCs with no MRs) immediately, but finishes
completion processing for RPCs with MRs by re-scheduling
them on an UNBOUND secondary workqueue.

I thought it would save me a context switch if the Receive
completion handler dealt with an RPC with only one MR that
had been remotely invalidated as a simple RPC, and allowed
it to complete immediately (all it needs to do is DMA unmap
that already-invalidated MR) rather than re-scheduling.

Assuming NFS READs and WRITEs are less than 1MB and the
payload can be registered in a single MR, I can avoid
that context switch for every I/O (and this assumption
is valid for my test system, using CX-3 Pro).

Except when I tried this, the IOPS throughput dropped
considerably, even while the measured per-RPC latency was
lower by the expected 5-15 microseconds. CPU R was running
flat out handling Receives, RPC completions, and NFS I/O
completions. In one case I recall seeing a 12 thread fio
run not using CPU on any other core on the client.


> My baseline assumption is that other cpu cores have their own tasks
> that they are handling, and making RDMA completions be processed
> on a different cpu is blocking something, maybe not the submitter,
> but something else. So under the assumption that completion processing
> always comes on the expense of something, choosing anything else other
> than the cpu core that the I/O was submitted on is an inferior choice.
> 
> Is my understanding correct that you are trying to emphasize that
> unbound workqueues make sense on some use-cases for initiator drivers
> (like xprtrdma)?

No, I'm just searching for the right tool for the job.

I think what you are saying is that when a file system
like XFS resides on an RDMA-enabled block device, you
have multiple QPs and CQs to route the completion
workload back to the CPUs that dispatched the work. There
shouldn't be an issue there similar to NFS, even though
XFS might also use BOUND workqueues. Fair enough.


>> Quite probably, initiator ULP implementations will need
>> to ensure explicitly that their transactions complete on
>> the same CPU core where the application started them.
> 
> Just to be clear, you mean the CPU core where the I/O was
> submitted correct?

Yes.


>> The downside is this frequently adds the latency cost of
>> a context switch.
> 
> That is true, if the interrupt was directed to another cpu core
> then a context-switch will need to be involved, and that adds latency.

Latency is also introduced when ib_comp_wq cannot get
scheduled for some time because of competing work on
the same CPU. Soft IRQ, Send completions, or other
HIGHPRI work can delay the dispatch of RPC and NFS work
on a particular CPU.


> I'm stating the obvious here, but this issue historically existed in
> various devices ranging from network to storage and more. The solution
> is using multiple queues (ideally per-cpu) and try to have minimal
> synchronization in the submission path (like XPS for networking) and
> keep completions as local as possible to the submission cores (like flow
> steering).

For the time being, the Linux NFS client does not support
multiple connections to a single NFS server. There is some
protocol standards work to be done to help clients discover
all distinct network paths to a server. We're also looking
at safe ways to schedule NFS RPCs over multiple connections.

To get multiple connections today you can use pNFS with
block devices, but that doesn't help the metadata workload
(GETATTRs, LOOKUPs, and the like), and not everyone wants
to use pNFS.

Also, there are some deployment scenarios where "creating
another connection" has an undesirable scalability impact:

- The NFS client has dozens or hundreds of CPUs. Typical
for a single large host running containers, where the
host's kernel NFS client manages the mounts, which are
shared among containers.

- The NFS client has mounted dozens or hundreds of NFS
servers, and thus wants to conserve its connection count
to avoid managing MxN connections.

- The device prefers a lower system QP count for good
performance, or the client's workload has hit the device's
QP count limit.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-13 22:15                     ` Chuck Lever
  0 siblings, 0 replies; 92+ messages in thread
From: Chuck Lever @ 2017-11-13 22:15 UTC (permalink / raw)



> On Nov 13, 2017,@3:47 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> Hey Chuck,
> 
>> This works for me. It seems like an appropriate design.
>> On targets, the CPUs are typically shared with other ULPs,
>> so there is little more to do.
>> On initiators, CPUs are shared with user applications.
>> In fact, applications will use the majority of CPU and
>> scheduler resources.
>> Using BOUND workqueues seems to be very typical in file
>> systems, and we may be stuck with that design. What we
>> can't have is RDMA completions forcing user processes to
>> pile up on the CPU core that handles Receives.
> 
> I'm not sure I understand what you mean by:
> "RDMA completions forcing user processes to pile up on the CPU core that
> handles Receives"

Recall that NFS is limited to a single QP per client-server
pair.

ib_alloc_cq(compvec) determines which CPU will handle Receive
completions for a QP. Let's call this CPU R.

I assume any CPU can initiate an RPC Call. For example, let's
say an application is running on CPU C != R.

The Receive completion occurs on CPU R. Suppose the Receive
matches to an incoming RPC that had no registered MRs. The
Receive completion can invoke xprt_complete_rqst in the
Receive completion handler to complete the RPC on CPU R
without another context switch.

The problem is that the RPC completes on CPU R because the
RPC stack uses a BOUND workqueue, and so does NFS. Thus at
least the RPC and NFS completion processing are competing
for CPU R, instead of being handled on other CPUs, and
maybe the requesting application is also likely to migrate
onto CPU R.

I observed this behavior experimentally.

Today, the xprtrdma Receive completion handler processes
simple RPCs (ie, RPCs with no MRs) immediately, but finishes
completion processing for RPCs with MRs by re-scheduling
them on an UNBOUND secondary workqueue.

I thought it would save me a context switch if the Receive
completion handler dealt with an RPC with only one MR that
had been remotely invalidated as a simple RPC, and allowed
it to complete immediately (all it needs to do is DMA unmap
that already-invalidated MR) rather than re-scheduling.

Assuming NFS READs and WRITEs are less than 1MB and the
payload can be registered in a single MR, I can avoid
that context switch for every I/O (and this assumption
is valid for my test system, using CX-3 Pro).

Except when I tried this, the IOPS throughput dropped
considerably, even while the measured per-RPC latency was
lower by the expected 5-15 microseconds. CPU R was running
flat out handling Receives, RPC completions, and NFS I/O
completions. In one case I recall seeing a 12 thread fio
run not using CPU on any other core on the client.


> My baseline assumption is that other cpu cores have their own tasks
> that they are handling, and making RDMA completions be processed
> on a different cpu is blocking something, maybe not the submitter,
> but something else. So under the assumption that completion processing
> always comes on the expense of something, choosing anything else other
> than the cpu core that the I/O was submitted on is an inferior choice.
> 
> Is my understanding correct that you are trying to emphasize that
> unbound workqueues make sense on some use-cases for initiator drivers
> (like xprtrdma)?

No, I'm just searching for the right tool for the job.

I think what you are saying is that when a file system
like XFS resides on an RDMA-enabled block device, you
have multiple QPs and CQs to route the completion
workload back to the CPUs that dispatched the work. There
shouldn't be an issue there similar to NFS, even though
XFS might also use BOUND workqueues. Fair enough.


>> Quite probably, initiator ULP implementations will need
>> to ensure explicitly that their transactions complete on
>> the same CPU core where the application started them.
> 
> Just to be clear, you mean the CPU core where the I/O was
> submitted correct?

Yes.


>> The downside is this frequently adds the latency cost of
>> a context switch.
> 
> That is true, if the interrupt was directed to another cpu core
> then a context-switch will need to be involved, and that adds latency.

Latency is also introduced when ib_comp_wq cannot get
scheduled for some time because of competing work on
the same CPU. Soft IRQ, Send completions, or other
HIGHPRI work can delay the dispatch of RPC and NFS work
on a particular CPU.


> I'm stating the obvious here, but this issue historically existed in
> various devices ranging from network to storage and more. The solution
> is using multiple queues (ideally per-cpu) and try to have minimal
> synchronization in the submission path (like XPS for networking) and
> keep completions as local as possible to the submission cores (like flow
> steering).

For the time being, the Linux NFS client does not support
multiple connections to a single NFS server. There is some
protocol standards work to be done to help clients discover
all distinct network paths to a server. We're also looking
at safe ways to schedule NFS RPCs over multiple connections.

To get multiple connections today you can use pNFS with
block devices, but that doesn't help the metadata workload
(GETATTRs, LOOKUPs, and the like), and not everyone wants
to use pNFS.

Also, there are some deployment scenarios where "creating
another connection" has an undesirable scalability impact:

- The NFS client has dozens or hundreds of CPUs. Typical
for a single large host running containers, where the
host's kernel NFS client manages the mounts, which are
shared among containers.

- The NFS client has mounted dozens or hundreds of NFS
servers, and thus wants to conserve its connection count
to avoid managing MxN connections.

- The device prefers a lower system QP count for good
performance, or the client's workload has hit the device's
QP count limit.


--
Chuck Lever

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 0/9] Introduce per-device completion queue pools
  2017-11-13 20:48                         ` Sagi Grimberg
@ 2017-11-14  2:48                             ` Jason Gunthorpe
  -1 siblings, 0 replies; 92+ messages in thread
From: Jason Gunthorpe @ 2017-11-14  2:48 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Mon, Nov 13, 2017 at 10:48:32PM +0200, Sagi Grimberg wrote:
> 
> >>But if that is not the case, maybe we may have less completion
> >>aggragation per interrupt.
> >
> >It is too bad we can't re-size CQs.. Can we?
> 
> I wish we could, but its an optional feature so I can't see how we can
> use it.

Well, it looks like mlx4/5 can do it, which covers a huge swath of
deployed hardware..

I'd say make an optimal implementation using resize_cq and just a working
implementation without it?

The only way to encourage vendors to implement optional features is to
actually use them in the OS...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-14  2:48                             ` Jason Gunthorpe
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Gunthorpe @ 2017-11-14  2:48 UTC (permalink / raw)


On Mon, Nov 13, 2017@10:48:32PM +0200, Sagi Grimberg wrote:
> 
> >>But if that is not the case, maybe we may have less completion
> >>aggragation per interrupt.
> >
> >It is too bad we can't re-size CQs.. Can we?
> 
> I wish we could, but its an optional feature so I can't see how we can
> use it.

Well, it looks like mlx4/5 can do it, which covers a huge swath of
deployed hardware..

I'd say make an optimal implementation using resize_cq and just a working
implementation without it?

The only way to encourage vendors to implement optional features is to
actually use them in the OS...

Jason

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 2/9] IB/isert: use implicit CQ allocation
  2017-11-08  9:57     ` Sagi Grimberg
@ 2017-11-14  9:14         ` Max Gurtovoy
  -1 siblings, 0 replies; 92+ messages in thread
From: Max Gurtovoy @ 2017-11-14  9:14 UTC (permalink / raw)
  To: Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig

Hi Sagi,

> @@ -535,13 +410,15 @@ isert_connect_request(struct rdma_cm_id *cma_id, struct rdma_cm_event *event)
>   
>   	isert_set_nego_params(isert_conn, &event->param.conn);
>   
> -	ret = isert_conn_setup_qp(isert_conn, cma_id);
> -	if (ret)
> +	isert_conn->qp = isert_create_qp(isert_conn, cma_id);
> +	if (IS_ERR(isert_conn->qp)) {
> +		ret = PTR_ERR(isert_conn->qp);
>   		goto out_conn_dev;
> +	}
>   
>   	ret = isert_login_post_recv(isert_conn);
>   	if (ret)
> -		goto out_conn_dev;
> +		goto out_conn_qp;

This is a bug fix, right ?

>   
>   	ret = isert_rdma_accept(isert_conn);
>   	if (ret)
> @@ -553,6 +430,8 @@ isert_connect_request(struct rdma_cm_id *cma_id, struct rdma_cm_event *event)
>   
>   	return 0;
>   
> +out_conn_qp:
> +	ib_destroy_qp(isert_conn->qp);

maybe use rdma_destroy_qp(isert_conn->cm_id) as we do in nvme/nvmet_rdma ?


Otherwise, Looks good

Reviewed-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 2/9] IB/isert: use implicit CQ allocation
@ 2017-11-14  9:14         ` Max Gurtovoy
  0 siblings, 0 replies; 92+ messages in thread
From: Max Gurtovoy @ 2017-11-14  9:14 UTC (permalink / raw)


Hi Sagi,

> @@ -535,13 +410,15 @@ isert_connect_request(struct rdma_cm_id *cma_id, struct rdma_cm_event *event)
>   
>   	isert_set_nego_params(isert_conn, &event->param.conn);
>   
> -	ret = isert_conn_setup_qp(isert_conn, cma_id);
> -	if (ret)
> +	isert_conn->qp = isert_create_qp(isert_conn, cma_id);
> +	if (IS_ERR(isert_conn->qp)) {
> +		ret = PTR_ERR(isert_conn->qp);
>   		goto out_conn_dev;
> +	}
>   
>   	ret = isert_login_post_recv(isert_conn);
>   	if (ret)
> -		goto out_conn_dev;
> +		goto out_conn_qp;

This is a bug fix, right ?

>   
>   	ret = isert_rdma_accept(isert_conn);
>   	if (ret)
> @@ -553,6 +430,8 @@ isert_connect_request(struct rdma_cm_id *cma_id, struct rdma_cm_event *event)
>   
>   	return 0;
>   
> +out_conn_qp:
> +	ib_destroy_qp(isert_conn->qp);

maybe use rdma_destroy_qp(isert_conn->cm_id) as we do in nvme/nvmet_rdma ?


Otherwise, Looks good

Reviewed-by: Max Gurtovoy <maxg at mellanox.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 0/9] Introduce per-device completion queue pools
  2017-11-09 17:31             ` Bart Van Assche
@ 2017-11-14 10:06                 ` Max Gurtovoy
  -1 siblings, 0 replies; 92+ messages in thread
From: Max Gurtovoy @ 2017-11-14 10:06 UTC (permalink / raw)
  To: Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	sagi-NQWnxTmZq1alnMjI0IkVqw
  Cc: hch-jcswGhMUV9g, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r



On 11/9/2017 7:31 PM, Bart Van Assche wrote:
> On Thu, 2017-11-09 at 19:22 +0200, Sagi Grimberg wrote:
>> But I'm afraid don't understand how the fact that ULPs will run on
>> different ports matter? how would the fact that we had two different
>> pools on different ports make a difference?
> 
> If each RDMA port is only used by a single ULP then the ULP driver can provide
> a better value for the CQ size than IB_CQE_BATCH. If CQ pools would be created
> by ULPs then it would be easy for ULPs to pass their choice of CQ size to the
> RDMA core.

I also prefer more the CQ pools per ULP approach (like we did with the 
MR pools per QP) in the first stage. For example, we saw a big 
improvement in NVMEoF performance when we did CQ moderation (currently 
local implementation in our labs). If we'll moderate shared CQ (iser + 
nvmf CQ) we can ruin other ULP performance. ISER/SRP/NVMEoF/NFS has 
different needs and different architectures, so even adaptive moderation 
will not supply the best performance in that case.

We can (I meant I can :)) also implement SRQ pool per ULP (and then push 
my NVMEoF target SRQ per completion vector feature that saves resource 
allocation and still gives us very good numbers - almost same as using a 
non shared RQ).

> 
> In case multiple ULPs share an RDMA port then which CQ is chosen for the ULP
> will depend on the order in which the ULP drivers are loaded. This may lead to
> hard to debug performance issues, e.g. due to different lock contention
> behavior. That's another reason why per-ULP CQ pools look more interesting to
> me than one CQ pool per HCA.

debug is also a good point..

> 
> Bart.

-Max.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-14 10:06                 ` Max Gurtovoy
  0 siblings, 0 replies; 92+ messages in thread
From: Max Gurtovoy @ 2017-11-14 10:06 UTC (permalink / raw)




On 11/9/2017 7:31 PM, Bart Van Assche wrote:
> On Thu, 2017-11-09@19:22 +0200, Sagi Grimberg wrote:
>> But I'm afraid don't understand how the fact that ULPs will run on
>> different ports matter? how would the fact that we had two different
>> pools on different ports make a difference?
> 
> If each RDMA port is only used by a single ULP then the ULP driver can provide
> a better value for the CQ size than IB_CQE_BATCH. If CQ pools would be created
> by ULPs then it would be easy for ULPs to pass their choice of CQ size to the
> RDMA core.

I also prefer more the CQ pools per ULP approach (like we did with the 
MR pools per QP) in the first stage. For example, we saw a big 
improvement in NVMEoF performance when we did CQ moderation (currently 
local implementation in our labs). If we'll moderate shared CQ (iser + 
nvmf CQ) we can ruin other ULP performance. ISER/SRP/NVMEoF/NFS has 
different needs and different architectures, so even adaptive moderation 
will not supply the best performance in that case.

We can (I meant I can :)) also implement SRQ pool per ULP (and then push 
my NVMEoF target SRQ per completion vector feature that saves resource 
allocation and still gives us very good numbers - almost same as using a 
non shared RQ).

> 
> In case multiple ULPs share an RDMA port then which CQ is chosen for the ULP
> will depend on the order in which the ULP drivers are loaded. This may lead to
> hard to debug performance issues, e.g. due to different lock contention
> behavior. That's another reason why per-ULP CQ pools look more interesting to
> me than one CQ pool per HCA.

debug is also a good point..

> 
> Bart.

-Max.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 0/9] Introduce per-device completion queue pools
  2017-11-13 20:31                 ` Sagi Grimberg
@ 2017-11-14 16:21                     ` Bart Van Assche
  -1 siblings, 0 replies; 92+ messages in thread
From: Bart Van Assche @ 2017-11-14 16:21 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, sagi-NQWnxTmZq1alnMjI0IkVqw
  Cc: hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Mon, 2017-11-13 at 22:31 +0200, Sagi Grimberg wrote:
> On Thu, 2017-11-09 at 17:31 +0000, Bart Van Assche wrote:
> > In case multiple ULPs share an RDMA port then which CQ is chosen for the ULP
> > will depend on the order in which the ULP drivers are loaded. This may lead to
> > hard to debug performance issues, e.g. due to different lock contention
> > behavior. That's another reason why per-ULP CQ pools look more interesting to
> > me than one CQ pool per HCA.
> 
> The ULP is free to pass in an affinity hint to enforce locality to a
> specific cpu core. Would that solve this issue?

Only for mlx5 adapters because only the mlx5 driver implements
.get_vector_affinity(). For other adapters the following code is used to chose a
vector:

    vector = affinity_hint % dev->num_comp_vectors;

That means whether or not a single CQ will be used by different CPUs depends on
how the ULP associates 'affinity_hint' with CPUs.

Bart.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-14 16:21                     ` Bart Van Assche
  0 siblings, 0 replies; 92+ messages in thread
From: Bart Van Assche @ 2017-11-14 16:21 UTC (permalink / raw)


On Mon, 2017-11-13@22:31 +0200, Sagi Grimberg wrote:
> On Thu, 2017-11-09@17:31 +0000, Bart Van Assche wrote:
> > In case multiple ULPs share an RDMA port then which CQ is chosen for the ULP
> > will depend on the order in which the ULP drivers are loaded. This may lead to
> > hard to debug performance issues, e.g. due to different lock contention
> > behavior. That's another reason why per-ULP CQ pools look more interesting to
> > me than one CQ pool per HCA.
> 
> The ULP is free to pass in an affinity hint to enforce locality to a
> specific cpu core. Would that solve this issue?

Only for mlx5 adapters because only the mlx5 driver implements
.get_vector_affinity(). For other adapters the following code is used to chose a
vector:

    vector = affinity_hint % dev->num_comp_vectors;

That means whether or not a single CQ will be used by different CPUs depends on
how the ULP associates 'affinity_hint' with CPUs.

Bart.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 1/9] RDMA/core: Add implicit per-device completion queue pools
  2017-11-08  9:57     ` Sagi Grimberg
@ 2017-11-14 16:28         ` Bart Van Assche
  -1 siblings, 0 replies; 92+ messages in thread
From: Bart Van Assche @ 2017-11-14 16:28 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, sagi-NQWnxTmZq1alnMjI0IkVqw
  Cc: hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Wed, 2017-11-08 at 11:57 +0200, Sagi Grimberg wrote:
> +struct ib_cq *ib_find_get_cq(struct ib_device *dev, unsigned int nr_cqe,
> +               enum ib_poll_context poll_ctx, int affinity_hint)
> +{
> +       struct ib_cq *cq, *found;
> +       unsigned long flags;
> +       int vector, ret;
> +
> +       if (poll_ctx >= ARRAY_SIZE(dev->cq_pools))
> +               return ERR_PTR(-EINVAL);
> +
> +       if (!ib_find_vector_affinity(dev, affinity_hint, &vector)) {
> +               /*
> +                * Couldn't find matching vector affinity so project
> +                * the affinty to the device completion vector range
> +                */
> +               vector = affinity_hint % dev->num_comp_vectors;
> +       }

So depending on whether or not the HCA driver implements .get_vector_affinity()
either pci_irq_get_affinity() is used or "vector = affinity_hint %
dev->num_comp_vectors"? Sorry but I think that kind of differences makes it
unnecessarily hard for ULP maintainers to provide predictable performance and
consistent behavior across HCAs.

Bart.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 1/9] RDMA/core: Add implicit per-device completion queue pools
@ 2017-11-14 16:28         ` Bart Van Assche
  0 siblings, 0 replies; 92+ messages in thread
From: Bart Van Assche @ 2017-11-14 16:28 UTC (permalink / raw)


On Wed, 2017-11-08@11:57 +0200, Sagi Grimberg wrote:
> +struct ib_cq *ib_find_get_cq(struct ib_device *dev, unsigned int nr_cqe,
> +               enum ib_poll_context poll_ctx, int affinity_hint)
> +{
> +       struct ib_cq *cq, *found;
> +       unsigned long flags;
> +       int vector, ret;
> +
> +       if (poll_ctx >= ARRAY_SIZE(dev->cq_pools))
> +               return ERR_PTR(-EINVAL);
> +
> +       if (!ib_find_vector_affinity(dev, affinity_hint, &vector)) {
> +               /*
> +                * Couldn't find matching vector affinity so project
> +                * the affinty to the device completion vector range
> +                */
> +               vector = affinity_hint % dev->num_comp_vectors;
> +       }

So depending on whether or not the HCA driver implements .get_vector_affinity()
either pci_irq_get_affinity() is used or "vector = affinity_hint %
dev->num_comp_vectors"? Sorry but I think that kind of differences makes it
unnecessarily hard for ULP maintainers to provide predictable performance and
consistent behavior across HCAs.

Bart.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 0/9] Introduce per-device completion queue pools
  2017-11-13 22:15                     ` Chuck Lever
@ 2017-11-20 12:08                         ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-20 12:08 UTC (permalink / raw)
  To: Chuck Lever
  Cc: linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Christoph Hellwig, Max Gurtuvoy


> Recall that NFS is limited to a single QP per client-server
> pair.
> 
> ib_alloc_cq(compvec) determines which CPU will handle Receive
> completions for a QP. Let's call this CPU R.
> 
> I assume any CPU can initiate an RPC Call. For example, let's
> say an application is running on CPU C != R.
> 
> The Receive completion occurs on CPU R. Suppose the Receive
> matches to an incoming RPC that had no registered MRs. The
> Receive completion can invoke xprt_complete_rqst in the
> Receive completion handler to complete the RPC on CPU R
> without another context switch.
> 
> The problem is that the RPC completes on CPU R because the
> RPC stack uses a BOUND workqueue, and so does NFS. Thus at
> least the RPC and NFS completion processing are competing
> for CPU R, instead of being handled on other CPUs, and
> maybe the requesting application is also likely to migrate
> onto CPU R.
> 
> I observed this behavior experimentally.
> 
> Today, the xprtrdma Receive completion handler processes
> simple RPCs (ie, RPCs with no MRs) immediately, but finishes
> completion processing for RPCs with MRs by re-scheduling
> them on an UNBOUND secondary workqueue.
> 
> I thought it would save me a context switch if the Receive
> completion handler dealt with an RPC with only one MR that
> had been remotely invalidated as a simple RPC, and allowed
> it to complete immediately (all it needs to do is DMA unmap
> that already-invalidated MR) rather than re-scheduling.
> 
> Assuming NFS READs and WRITEs are less than 1MB and the
> payload can be registered in a single MR, I can avoid
> that context switch for every I/O (and this assumption
> is valid for my test system, using CX-3 Pro).
> 
> Except when I tried this, the IOPS throughput dropped
> considerably, even while the measured per-RPC latency was
> lower by the expected 5-15 microseconds. CPU R was running
> flat out handling Receives, RPC completions, and NFS I/O
> completions. In one case I recall seeing a 12 thread fio
> run not using CPU on any other core on the client.

I see your point Chuck. The design choice here assumes that
other CPUs are equally occupied (even with NFS-RPC context) hence the
choice on which cpu to run would almost always want to run the local
cpu.

If this is not the case, then this design does not apply.

>> My baseline assumption is that other cpu cores have their own tasks
>> that they are handling, and making RDMA completions be processed
>> on a different cpu is blocking something, maybe not the submitter,
>> but something else. So under the assumption that completion processing
>> always comes on the expense of something, choosing anything else other
>> than the cpu core that the I/O was submitted on is an inferior choice.
>>
>> Is my understanding correct that you are trying to emphasize that
>> unbound workqueues make sense on some use-cases for initiator drivers
>> (like xprtrdma)?
> 
> No, I'm just searching for the right tool for the job.
> 
> I think what you are saying is that when a file system
> like XFS resides on an RDMA-enabled block device, you
> have multiple QPs and CQs to route the completion
> workload back to the CPUs that dispatched the work. There
> shouldn't be an issue there similar to NFS, even though
> XFS might also use BOUND workqueues. Fair enough.

The issue I've seen with unbound workqueues is that the
worker thread can migrate between cpus which messes up
the locality we are trying to achieve. However, we could
easily add IB_POLL_UNBOUND_WORKQUEUE polling context if
that helps your use case.

> Latency is also introduced when ib_comp_wq cannot get
> scheduled for some time because of competing work on
> the same CPU. Soft IRQ, Send completions, or other
> HIGHPRI work can delay the dispatch of RPC and NFS work
> on a particular CPU.

True, but again, the design assumes that other cores can (and
will) run similar tasks. The overhead of trying to select an
"optimal" cpu at exactly that moment is something we would want
to avoid for fast storage devices. Moreover, in high stress these
decisions are not guaranteed to be optimal and might be counter
productive (as estimations often can be).

>> I'm stating the obvious here, but this issue historically existed in
>> various devices ranging from network to storage and more. The solution
>> is using multiple queues (ideally per-cpu) and try to have minimal
>> synchronization in the submission path (like XPS for networking) and
>> keep completions as local as possible to the submission cores (like flow
>> steering).
> 
> For the time being, the Linux NFS client does not support
> multiple connections to a single NFS server. There is some
> protocol standards work to be done to help clients discover
> all distinct network paths to a server. We're also looking
> at safe ways to schedule NFS RPCs over multiple connections.
> 
> To get multiple connections today you can use pNFS with
> block devices, but that doesn't help the metadata workload
> (GETATTRs, LOOKUPs, and the like), and not everyone wants
> to use pNFS.
> 
> Also, there are some deployment scenarios where "creating
> another connection" has an undesirable scalability impact:

I can understand that.

> - The NFS client has dozens or hundreds of CPUs. Typical
> for a single large host running containers, where the
> host's kernel NFS client manages the mounts, which are
> shared among containers.
> 
> - The NFS client has mounted dozens or hundreds of NFS
> servers, and thus wants to conserve its connection count
> to avoid managing MxN connections.

So in this use-case, do you really see that non-local cpu
selection for completion processing is performing better?

 From my experience, linear scaling is much harder to achieve
with bouncing cpus with all the context-switching overhead involved.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-20 12:08                         ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-20 12:08 UTC (permalink / raw)



> Recall that NFS is limited to a single QP per client-server
> pair.
> 
> ib_alloc_cq(compvec) determines which CPU will handle Receive
> completions for a QP. Let's call this CPU R.
> 
> I assume any CPU can initiate an RPC Call. For example, let's
> say an application is running on CPU C != R.
> 
> The Receive completion occurs on CPU R. Suppose the Receive
> matches to an incoming RPC that had no registered MRs. The
> Receive completion can invoke xprt_complete_rqst in the
> Receive completion handler to complete the RPC on CPU R
> without another context switch.
> 
> The problem is that the RPC completes on CPU R because the
> RPC stack uses a BOUND workqueue, and so does NFS. Thus at
> least the RPC and NFS completion processing are competing
> for CPU R, instead of being handled on other CPUs, and
> maybe the requesting application is also likely to migrate
> onto CPU R.
> 
> I observed this behavior experimentally.
> 
> Today, the xprtrdma Receive completion handler processes
> simple RPCs (ie, RPCs with no MRs) immediately, but finishes
> completion processing for RPCs with MRs by re-scheduling
> them on an UNBOUND secondary workqueue.
> 
> I thought it would save me a context switch if the Receive
> completion handler dealt with an RPC with only one MR that
> had been remotely invalidated as a simple RPC, and allowed
> it to complete immediately (all it needs to do is DMA unmap
> that already-invalidated MR) rather than re-scheduling.
> 
> Assuming NFS READs and WRITEs are less than 1MB and the
> payload can be registered in a single MR, I can avoid
> that context switch for every I/O (and this assumption
> is valid for my test system, using CX-3 Pro).
> 
> Except when I tried this, the IOPS throughput dropped
> considerably, even while the measured per-RPC latency was
> lower by the expected 5-15 microseconds. CPU R was running
> flat out handling Receives, RPC completions, and NFS I/O
> completions. In one case I recall seeing a 12 thread fio
> run not using CPU on any other core on the client.

I see your point Chuck. The design choice here assumes that
other CPUs are equally occupied (even with NFS-RPC context) hence the
choice on which cpu to run would almost always want to run the local
cpu.

If this is not the case, then this design does not apply.

>> My baseline assumption is that other cpu cores have their own tasks
>> that they are handling, and making RDMA completions be processed
>> on a different cpu is blocking something, maybe not the submitter,
>> but something else. So under the assumption that completion processing
>> always comes on the expense of something, choosing anything else other
>> than the cpu core that the I/O was submitted on is an inferior choice.
>>
>> Is my understanding correct that you are trying to emphasize that
>> unbound workqueues make sense on some use-cases for initiator drivers
>> (like xprtrdma)?
> 
> No, I'm just searching for the right tool for the job.
> 
> I think what you are saying is that when a file system
> like XFS resides on an RDMA-enabled block device, you
> have multiple QPs and CQs to route the completion
> workload back to the CPUs that dispatched the work. There
> shouldn't be an issue there similar to NFS, even though
> XFS might also use BOUND workqueues. Fair enough.

The issue I've seen with unbound workqueues is that the
worker thread can migrate between cpus which messes up
the locality we are trying to achieve. However, we could
easily add IB_POLL_UNBOUND_WORKQUEUE polling context if
that helps your use case.

> Latency is also introduced when ib_comp_wq cannot get
> scheduled for some time because of competing work on
> the same CPU. Soft IRQ, Send completions, or other
> HIGHPRI work can delay the dispatch of RPC and NFS work
> on a particular CPU.

True, but again, the design assumes that other cores can (and
will) run similar tasks. The overhead of trying to select an
"optimal" cpu at exactly that moment is something we would want
to avoid for fast storage devices. Moreover, in high stress these
decisions are not guaranteed to be optimal and might be counter
productive (as estimations often can be).

>> I'm stating the obvious here, but this issue historically existed in
>> various devices ranging from network to storage and more. The solution
>> is using multiple queues (ideally per-cpu) and try to have minimal
>> synchronization in the submission path (like XPS for networking) and
>> keep completions as local as possible to the submission cores (like flow
>> steering).
> 
> For the time being, the Linux NFS client does not support
> multiple connections to a single NFS server. There is some
> protocol standards work to be done to help clients discover
> all distinct network paths to a server. We're also looking
> at safe ways to schedule NFS RPCs over multiple connections.
> 
> To get multiple connections today you can use pNFS with
> block devices, but that doesn't help the metadata workload
> (GETATTRs, LOOKUPs, and the like), and not everyone wants
> to use pNFS.
> 
> Also, there are some deployment scenarios where "creating
> another connection" has an undesirable scalability impact:

I can understand that.

> - The NFS client has dozens or hundreds of CPUs. Typical
> for a single large host running containers, where the
> host's kernel NFS client manages the mounts, which are
> shared among containers.
> 
> - The NFS client has mounted dozens or hundreds of NFS
> servers, and thus wants to conserve its connection count
> to avoid managing MxN connections.

So in this use-case, do you really see that non-local cpu
selection for completion processing is performing better?

 From my experience, linear scaling is much harder to achieve
with bouncing cpus with all the context-switching overhead involved.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 0/9] Introduce per-device completion queue pools
  2017-11-14  2:48                             ` Jason Gunthorpe
@ 2017-11-20 12:10                                 ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-20 12:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r


>>> It is too bad we can't re-size CQs.. Can we?
>>
>> I wish we could, but its an optional feature so I can't see how we can
>> use it.
> 
> Well, it looks like mlx4/5 can do it, which covers a huge swath of
> deployed hardware..
> 
> I'd say make an optimal implementation using resize_cq and just a working
> implementation without it?

I can experiment with it, sure. Do you think that its a must have for
first phase though?

> The only way to encourage vendors to implement optional features is to
> actually use them in the OS...

I agree with you on that.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-20 12:10                                 ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-20 12:10 UTC (permalink / raw)



>>> It is too bad we can't re-size CQs.. Can we?
>>
>> I wish we could, but its an optional feature so I can't see how we can
>> use it.
> 
> Well, it looks like mlx4/5 can do it, which covers a huge swath of
> deployed hardware..
> 
> I'd say make an optimal implementation using resize_cq and just a working
> implementation without it?

I can experiment with it, sure. Do you think that its a must have for
first phase though?

> The only way to encourage vendors to implement optional features is to
> actually use them in the OS...

I agree with you on that.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 0/9] Introduce per-device completion queue pools
  2017-11-14 10:06                 ` Max Gurtovoy
@ 2017-11-20 12:20                     ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-20 12:20 UTC (permalink / raw)
  To: Max Gurtovoy, Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: hch-jcswGhMUV9g, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r


> I also prefer more the CQ pools per ULP approach (like we did with the 
> MR pools per QP) in the first stage.

Fair enough.

> For example, we saw a big 
> improvement in NVMEoF performance when we did CQ moderation (currently 
> local implementation in our labs). If we'll moderate shared CQ (iser + 
> nvmf CQ) we can ruin other ULP performance. ISER/SRP/NVMEoF/NFS has 
> different needs and different architectures, so even adaptive moderation 
> will not supply the best performance in that case.

Here I disagree. Using hard-coded or pre-configured adaptive moderation
is something we should move away from. I have a generic adaptive
moderation implementation for rdma and nvme in the works, and once that
does its job correctly, it should benefit everyone equally. Moreover,
IMO it even *supports* the notion of sharing CQs across ULPs because the
more consumers we have on a CQ, the better the adaptive moderation
works.

In fact, if it works well I will vote to turn it on by default and not
even allow ULPs to control it but only the user (via sysctl or
something) because if you think about it, ULPs can't really choose
better then the core.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-20 12:20                     ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-20 12:20 UTC (permalink / raw)



> I also prefer more the CQ pools per ULP approach (like we did with the 
> MR pools per QP) in the first stage.

Fair enough.

> For example, we saw a big 
> improvement in NVMEoF performance when we did CQ moderation (currently 
> local implementation in our labs). If we'll moderate shared CQ (iser + 
> nvmf CQ) we can ruin other ULP performance. ISER/SRP/NVMEoF/NFS has 
> different needs and different architectures, so even adaptive moderation 
> will not supply the best performance in that case.

Here I disagree. Using hard-coded or pre-configured adaptive moderation
is something we should move away from. I have a generic adaptive
moderation implementation for rdma and nvme in the works, and once that
does its job correctly, it should benefit everyone equally. Moreover,
IMO it even *supports* the notion of sharing CQs across ULPs because the
more consumers we have on a CQ, the better the adaptive moderation
works.

In fact, if it works well I will vote to turn it on by default and not
even allow ULPs to control it but only the user (via sysctl or
something) because if you think about it, ULPs can't really choose
better then the core.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 0/9] Introduce per-device completion queue pools
  2017-11-14 16:21                     ` Bart Van Assche
@ 2017-11-20 12:26                         ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-20 12:26 UTC (permalink / raw)
  To: Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r


>> The ULP is free to pass in an affinity hint to enforce locality to a
>> specific cpu core. Would that solve this issue?
> 
> Only for mlx5 adapters because only the mlx5 driver implements
> .get_vector_affinity(). For other adapters the following code is used to chose a
> vector:

Looks like even that will get reverted for 4.15 :)
Due to the change in user experience for managed irq vectors.

> 
>      vector = affinity_hint % dev->num_comp_vectors;
> 
> That means whether or not a single CQ will be used by different CPUs depends on
> how the ULP associates 'affinity_hint' with CPUs.

My intension was to pass it down to ib_alloc_cq and then convert
ib_cq_poll_work to run queue_work_on(cq->cpu, ib_comp_wq, &cq->work)

Note that you point out something that is more relevant to server/target
side ULPs (hence the assumption of workqueue mode).

So up until now I count Bart and Max on per-ULP pools, on the other
hand theres Christoph and I for per-device pools. Can we get a tie
breaker?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-20 12:26                         ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-20 12:26 UTC (permalink / raw)



>> The ULP is free to pass in an affinity hint to enforce locality to a
>> specific cpu core. Would that solve this issue?
> 
> Only for mlx5 adapters because only the mlx5 driver implements
> .get_vector_affinity(). For other adapters the following code is used to chose a
> vector:

Looks like even that will get reverted for 4.15 :)
Due to the change in user experience for managed irq vectors.

> 
>      vector = affinity_hint % dev->num_comp_vectors;
> 
> That means whether or not a single CQ will be used by different CPUs depends on
> how the ULP associates 'affinity_hint' with CPUs.

My intension was to pass it down to ib_alloc_cq and then convert
ib_cq_poll_work to run queue_work_on(cq->cpu, ib_comp_wq, &cq->work)

Note that you point out something that is more relevant to server/target
side ULPs (hence the assumption of workqueue mode).

So up until now I count Bart and Max on per-ULP pools, on the other
hand theres Christoph and I for per-device pools. Can we get a tie
breaker?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 1/9] RDMA/core: Add implicit per-device completion queue pools
  2017-11-14 16:28         ` Bart Van Assche
@ 2017-11-20 12:31             ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-20 12:31 UTC (permalink / raw)
  To: Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r


>> +struct ib_cq *ib_find_get_cq(struct ib_device *dev, unsigned int nr_cqe,
>> +               enum ib_poll_context poll_ctx, int affinity_hint)
>> +{
>> +       struct ib_cq *cq, *found;
>> +       unsigned long flags;
>> +       int vector, ret;
>> +
>> +       if (poll_ctx >= ARRAY_SIZE(dev->cq_pools))
>> +               return ERR_PTR(-EINVAL);
>> +
>> +       if (!ib_find_vector_affinity(dev, affinity_hint, &vector)) {
>> +               /*
>> +                * Couldn't find matching vector affinity so project
>> +                * the affinty to the device completion vector range
>> +                */
>> +               vector = affinity_hint % dev->num_comp_vectors;
>> +       }
> 
> So depending on whether or not the HCA driver implements .get_vector_affinity()
> either pci_irq_get_affinity() is used or "vector = affinity_hint %
> dev->num_comp_vectors"? Sorry but I think that kind of differences makes it
> unnecessarily hard for ULP maintainers to provide predictable performance and
> consistent behavior across HCAs.

Well, as a ULP maintainer I think that in the lack of
.get_vector_affinity() I would do that same thing as this code. srp
itself is doing the same thing in srp_create_target()
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 1/9] RDMA/core: Add implicit per-device completion queue pools
@ 2017-11-20 12:31             ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2017-11-20 12:31 UTC (permalink / raw)



>> +struct ib_cq *ib_find_get_cq(struct ib_device *dev, unsigned int nr_cqe,
>> +               enum ib_poll_context poll_ctx, int affinity_hint)
>> +{
>> +       struct ib_cq *cq, *found;
>> +       unsigned long flags;
>> +       int vector, ret;
>> +
>> +       if (poll_ctx >= ARRAY_SIZE(dev->cq_pools))
>> +               return ERR_PTR(-EINVAL);
>> +
>> +       if (!ib_find_vector_affinity(dev, affinity_hint, &vector)) {
>> +               /*
>> +                * Couldn't find matching vector affinity so project
>> +                * the affinty to the device completion vector range
>> +                */
>> +               vector = affinity_hint % dev->num_comp_vectors;
>> +       }
> 
> So depending on whether or not the HCA driver implements .get_vector_affinity()
> either pci_irq_get_affinity() is used or "vector = affinity_hint %
> dev->num_comp_vectors"? Sorry but I think that kind of differences makes it
> unnecessarily hard for ULP maintainers to provide predictable performance and
> consistent behavior across HCAs.

Well, as a ULP maintainer I think that in the lack of
.get_vector_affinity() I would do that same thing as this code. srp
itself is doing the same thing in srp_create_target()

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 0/9] Introduce per-device completion queue pools
  2017-11-20 12:08                         ` Sagi Grimberg
@ 2017-11-20 15:54                             ` Chuck Lever
  -1 siblings, 0 replies; 92+ messages in thread
From: Chuck Lever @ 2017-11-20 15:54 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Christoph Hellwig, Max Gurtuvoy


> On Nov 20, 2017, at 7:08 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
> 
> 
>> Recall that NFS is limited to a single QP per client-server
>> pair.
>> ib_alloc_cq(compvec) determines which CPU will handle Receive
>> completions for a QP. Let's call this CPU R.
>> I assume any CPU can initiate an RPC Call. For example, let's
>> say an application is running on CPU C != R.
>> The Receive completion occurs on CPU R. Suppose the Receive
>> matches to an incoming RPC that had no registered MRs. The
>> Receive completion can invoke xprt_complete_rqst in the
>> Receive completion handler to complete the RPC on CPU R
>> without another context switch.
>> The problem is that the RPC completes on CPU R because the
>> RPC stack uses a BOUND workqueue, and so does NFS. Thus at
>> least the RPC and NFS completion processing are competing
>> for CPU R, instead of being handled on other CPUs, and
>> maybe the requesting application is also likely to migrate
>> onto CPU R.
>> I observed this behavior experimentally.
>> Today, the xprtrdma Receive completion handler processes
>> simple RPCs (ie, RPCs with no MRs) immediately, but finishes
>> completion processing for RPCs with MRs by re-scheduling
>> them on an UNBOUND secondary workqueue.
>> I thought it would save me a context switch if the Receive
>> completion handler dealt with an RPC with only one MR that
>> had been remotely invalidated as a simple RPC, and allowed
>> it to complete immediately (all it needs to do is DMA unmap
>> that already-invalidated MR) rather than re-scheduling.
>> Assuming NFS READs and WRITEs are less than 1MB and the
>> payload can be registered in a single MR, I can avoid
>> that context switch for every I/O (and this assumption
>> is valid for my test system, using CX-3 Pro).
>> Except when I tried this, the IOPS throughput dropped
>> considerably, even while the measured per-RPC latency was
>> lower by the expected 5-15 microseconds. CPU R was running
>> flat out handling Receives, RPC completions, and NFS I/O
>> completions. In one case I recall seeing a 12 thread fio
>> run not using CPU on any other core on the client.
> 
> I see your point Chuck. The design choice here assumes that
> other CPUs are equally occupied (even with NFS-RPC context) hence the
> choice on which cpu to run would almost always want to run the local
> cpu.
> 
> If this is not the case, then this design does not apply.
> 
>>> My baseline assumption is that other cpu cores have their own tasks
>>> that they are handling, and making RDMA completions be processed
>>> on a different cpu is blocking something, maybe not the submitter,
>>> but something else. So under the assumption that completion processing
>>> always comes on the expense of something, choosing anything else other
>>> than the cpu core that the I/O was submitted on is an inferior choice.
>>> 
>>> Is my understanding correct that you are trying to emphasize that
>>> unbound workqueues make sense on some use-cases for initiator drivers
>>> (like xprtrdma)?
>> No, I'm just searching for the right tool for the job.
>> I think what you are saying is that when a file system
>> like XFS resides on an RDMA-enabled block device, you
>> have multiple QPs and CQs to route the completion
>> workload back to the CPUs that dispatched the work. There
>> shouldn't be an issue there similar to NFS, even though
>> XFS might also use BOUND workqueues. Fair enough.
> 
> The issue I've seen with unbound workqueues is that the
> worker thread can migrate between cpus which messes up
> the locality we are trying to achieve. However, we could
> easily add IB_POLL_UNBOUND_WORKQUEUE polling context if
> that helps your use case.

I agree that arbitrary process migration is undesirable.
Therefore UNBOUND workqueues should not be used in these
cases, IMO.

I would prefer the ULP controls where transaction completion
is dispatched. The block ULPs use multiple connections,
and eventually xprtrdma will too. Just not today :-)


>> Latency is also introduced when ib_comp_wq cannot get
>> scheduled for some time because of competing work on
>> the same CPU. Soft IRQ, Send completions, or other
>> HIGHPRI work can delay the dispatch of RPC and NFS work
>> on a particular CPU.
> 
> True, but again, the design assumes that other cores can (and
> will) run similar tasks. The overhead of trying to select an
> "optimal" cpu at exactly that moment is something we would want
> to avoid for fast storage devices. Moreover, in high stress these
> decisions are not guaranteed to be optimal and might be counter
> productive (as estimations often can be).

Well, I guess more to the point: Even when the CQs are
operating in IB_POLL_WORKQUEUE mode, some network adapters
will need significant soft IRQ resources on the same CPU as
the completion workqueue, and these two tasks will compete
for the CPU resource. We should strive to make this
situation as efficient as possible because it appears to
be unavoidable. The ULPs, the core, and the drivers need
to be attentive to it.


>>> I'm stating the obvious here, but this issue historically existed in
>>> various devices ranging from network to storage and more. The solution
>>> is using multiple queues (ideally per-cpu) and try to have minimal
>>> synchronization in the submission path (like XPS for networking) and
>>> keep completions as local as possible to the submission cores (like flow
>>> steering).
>> For the time being, the Linux NFS client does not support
>> multiple connections to a single NFS server. There is some
>> protocol standards work to be done to help clients discover
>> all distinct network paths to a server. We're also looking
>> at safe ways to schedule NFS RPCs over multiple connections.
>> To get multiple connections today you can use pNFS with
>> block devices, but that doesn't help the metadata workload
>> (GETATTRs, LOOKUPs, and the like), and not everyone wants
>> to use pNFS.
>> Also, there are some deployment scenarios where "creating
>> another connection" has an undesirable scalability impact:
> 
> I can understand that.
> 
>> - The NFS client has dozens or hundreds of CPUs. Typical
>> for a single large host running containers, where the
>> host's kernel NFS client manages the mounts, which are
>> shared among containers.
>> - The NFS client has mounted dozens or hundreds of NFS
>> servers, and thus wants to conserve its connection count
>> to avoid managing MxN connections.
> 
> So in this use-case, do you really see that non-local cpu
> selection for completion processing is performing better?
> 
> From my experience, linear scaling is much harder to achieve
> with bouncing cpus with all the context-switching overhead involved.

I agree that migrating arbitrarily is a similar evil to
delivering to the wrong CPU.

It is clear that some cases can use multiple QPs to steer
Receive completions, others cannot. My humble requests
for your new API would be:

1. Don't assume the ULP can open lots of connections as
a mechanism for steering completions. Or, to state it
another way, the single QP case has to be efficient too.

2. Provide a mechanism that can either allow the ULP to
select the CPU where the completion handler runs, or
alternatively, the ULP should be able to query the CQ
to find out where it is going to physically handle
completions.

That way the ULP has better control over how many
connections it might want to open, and it can
allocate memory on the correct NUMA node for device-
specific tasks like Receives.

Automating the selection of interrupt and CPU can work
OK, but IMO completely hiding the physical resources in
this case is not good.

The per-ULP CQ pool idea might help for both 1 and 2.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-20 15:54                             ` Chuck Lever
  0 siblings, 0 replies; 92+ messages in thread
From: Chuck Lever @ 2017-11-20 15:54 UTC (permalink / raw)



> On Nov 20, 2017,@7:08 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> 
>> Recall that NFS is limited to a single QP per client-server
>> pair.
>> ib_alloc_cq(compvec) determines which CPU will handle Receive
>> completions for a QP. Let's call this CPU R.
>> I assume any CPU can initiate an RPC Call. For example, let's
>> say an application is running on CPU C != R.
>> The Receive completion occurs on CPU R. Suppose the Receive
>> matches to an incoming RPC that had no registered MRs. The
>> Receive completion can invoke xprt_complete_rqst in the
>> Receive completion handler to complete the RPC on CPU R
>> without another context switch.
>> The problem is that the RPC completes on CPU R because the
>> RPC stack uses a BOUND workqueue, and so does NFS. Thus at
>> least the RPC and NFS completion processing are competing
>> for CPU R, instead of being handled on other CPUs, and
>> maybe the requesting application is also likely to migrate
>> onto CPU R.
>> I observed this behavior experimentally.
>> Today, the xprtrdma Receive completion handler processes
>> simple RPCs (ie, RPCs with no MRs) immediately, but finishes
>> completion processing for RPCs with MRs by re-scheduling
>> them on an UNBOUND secondary workqueue.
>> I thought it would save me a context switch if the Receive
>> completion handler dealt with an RPC with only one MR that
>> had been remotely invalidated as a simple RPC, and allowed
>> it to complete immediately (all it needs to do is DMA unmap
>> that already-invalidated MR) rather than re-scheduling.
>> Assuming NFS READs and WRITEs are less than 1MB and the
>> payload can be registered in a single MR, I can avoid
>> that context switch for every I/O (and this assumption
>> is valid for my test system, using CX-3 Pro).
>> Except when I tried this, the IOPS throughput dropped
>> considerably, even while the measured per-RPC latency was
>> lower by the expected 5-15 microseconds. CPU R was running
>> flat out handling Receives, RPC completions, and NFS I/O
>> completions. In one case I recall seeing a 12 thread fio
>> run not using CPU on any other core on the client.
> 
> I see your point Chuck. The design choice here assumes that
> other CPUs are equally occupied (even with NFS-RPC context) hence the
> choice on which cpu to run would almost always want to run the local
> cpu.
> 
> If this is not the case, then this design does not apply.
> 
>>> My baseline assumption is that other cpu cores have their own tasks
>>> that they are handling, and making RDMA completions be processed
>>> on a different cpu is blocking something, maybe not the submitter,
>>> but something else. So under the assumption that completion processing
>>> always comes on the expense of something, choosing anything else other
>>> than the cpu core that the I/O was submitted on is an inferior choice.
>>> 
>>> Is my understanding correct that you are trying to emphasize that
>>> unbound workqueues make sense on some use-cases for initiator drivers
>>> (like xprtrdma)?
>> No, I'm just searching for the right tool for the job.
>> I think what you are saying is that when a file system
>> like XFS resides on an RDMA-enabled block device, you
>> have multiple QPs and CQs to route the completion
>> workload back to the CPUs that dispatched the work. There
>> shouldn't be an issue there similar to NFS, even though
>> XFS might also use BOUND workqueues. Fair enough.
> 
> The issue I've seen with unbound workqueues is that the
> worker thread can migrate between cpus which messes up
> the locality we are trying to achieve. However, we could
> easily add IB_POLL_UNBOUND_WORKQUEUE polling context if
> that helps your use case.

I agree that arbitrary process migration is undesirable.
Therefore UNBOUND workqueues should not be used in these
cases, IMO.

I would prefer the ULP controls where transaction completion
is dispatched. The block ULPs use multiple connections,
and eventually xprtrdma will too. Just not today :-)


>> Latency is also introduced when ib_comp_wq cannot get
>> scheduled for some time because of competing work on
>> the same CPU. Soft IRQ, Send completions, or other
>> HIGHPRI work can delay the dispatch of RPC and NFS work
>> on a particular CPU.
> 
> True, but again, the design assumes that other cores can (and
> will) run similar tasks. The overhead of trying to select an
> "optimal" cpu at exactly that moment is something we would want
> to avoid for fast storage devices. Moreover, in high stress these
> decisions are not guaranteed to be optimal and might be counter
> productive (as estimations often can be).

Well, I guess more to the point: Even when the CQs are
operating in IB_POLL_WORKQUEUE mode, some network adapters
will need significant soft IRQ resources on the same CPU as
the completion workqueue, and these two tasks will compete
for the CPU resource. We should strive to make this
situation as efficient as possible because it appears to
be unavoidable. The ULPs, the core, and the drivers need
to be attentive to it.


>>> I'm stating the obvious here, but this issue historically existed in
>>> various devices ranging from network to storage and more. The solution
>>> is using multiple queues (ideally per-cpu) and try to have minimal
>>> synchronization in the submission path (like XPS for networking) and
>>> keep completions as local as possible to the submission cores (like flow
>>> steering).
>> For the time being, the Linux NFS client does not support
>> multiple connections to a single NFS server. There is some
>> protocol standards work to be done to help clients discover
>> all distinct network paths to a server. We're also looking
>> at safe ways to schedule NFS RPCs over multiple connections.
>> To get multiple connections today you can use pNFS with
>> block devices, but that doesn't help the metadata workload
>> (GETATTRs, LOOKUPs, and the like), and not everyone wants
>> to use pNFS.
>> Also, there are some deployment scenarios where "creating
>> another connection" has an undesirable scalability impact:
> 
> I can understand that.
> 
>> - The NFS client has dozens or hundreds of CPUs. Typical
>> for a single large host running containers, where the
>> host's kernel NFS client manages the mounts, which are
>> shared among containers.
>> - The NFS client has mounted dozens or hundreds of NFS
>> servers, and thus wants to conserve its connection count
>> to avoid managing MxN connections.
> 
> So in this use-case, do you really see that non-local cpu
> selection for completion processing is performing better?
> 
> From my experience, linear scaling is much harder to achieve
> with bouncing cpus with all the context-switching overhead involved.

I agree that migrating arbitrarily is a similar evil to
delivering to the wrong CPU.

It is clear that some cases can use multiple QPs to steer
Receive completions, others cannot. My humble requests
for your new API would be:

1. Don't assume the ULP can open lots of connections as
a mechanism for steering completions. Or, to state it
another way, the single QP case has to be efficient too.

2. Provide a mechanism that can either allow the ULP to
select the CPU where the completion handler runs, or
alternatively, the ULP should be able to query the CQ
to find out where it is going to physically handle
completions.

That way the ULP has better control over how many
connections it might want to open, and it can
allocate memory on the correct NUMA node for device-
specific tasks like Receives.

Automating the selection of interrupt and CPU can work
OK, but IMO completely hiding the physical resources in
this case is not good.

The per-ULP CQ pool idea might help for both 1 and 2.


--
Chuck Lever

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 0/9] Introduce per-device completion queue pools
  2017-11-20 12:10                                 ` Sagi Grimberg
@ 2017-11-20 19:24                                     ` Jason Gunthorpe
  -1 siblings, 0 replies; 92+ messages in thread
From: Jason Gunthorpe @ 2017-11-20 19:24 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Mon, Nov 20, 2017 at 02:10:13PM +0200, Sagi Grimberg wrote:
> 
> >>>It is too bad we can't re-size CQs.. Can we?
> >>
> >>I wish we could, but its an optional feature so I can't see how we can
> >>use it.
> >
> >Well, it looks like mlx4/5 can do it, which covers a huge swath of
> >deployed hardware..
> >
> >I'd say make an optimal implementation using resize_cq and just a working
> >implementation without it?
> 
> I can experiment with it, sure. Do you think that its a must have for
> first phase though?

Not clear to me.. Bart?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-20 19:24                                     ` Jason Gunthorpe
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Gunthorpe @ 2017-11-20 19:24 UTC (permalink / raw)


On Mon, Nov 20, 2017@02:10:13PM +0200, Sagi Grimberg wrote:
> 
> >>>It is too bad we can't re-size CQs.. Can we?
> >>
> >>I wish we could, but its an optional feature so I can't see how we can
> >>use it.
> >
> >Well, it looks like mlx4/5 can do it, which covers a huge swath of
> >deployed hardware..
> >
> >I'd say make an optimal implementation using resize_cq and just a working
> >implementation without it?
> 
> I can experiment with it, sure. Do you think that its a must have for
> first phase though?

Not clear to me.. Bart?

Jason

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v3 0/9] Introduce per-device completion queue pools
  2017-11-20 19:24                                     ` Jason Gunthorpe
@ 2017-11-20 21:29                                         ` Bart Van Assche
  -1 siblings, 0 replies; 92+ messages in thread
From: Bart Van Assche @ 2017-11-20 21:29 UTC (permalink / raw)
  To: jgg-uk2M96/98Pc, sagi-NQWnxTmZq1alnMjI0IkVqw
  Cc: hch-jcswGhMUV9g, maxg-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1168 bytes --]

On Mon, 2017-11-20 at 12:24 -0700, Jason Gunthorpe wrote:
> On Mon, Nov 20, 2017 at 02:10:13PM +0200, Sagi Grimberg wrote:
> > > > > It is too bad we can't re-size CQs.. Can we?
> > > > 
> > > > I wish we could, but its an optional feature so I can't see how we can
> > > > use it.
> > > 
> > > Well, it looks like mlx4/5 can do it, which covers a huge swath of
> > > deployed hardware..
> > > 
> > > I'd say make an optimal implementation using resize_cq and just a working
> > > implementation without it?
> > 
> > I can experiment with it, sure. Do you think that its a must have for
> > first phase though?
> 
> Not clear to me.. Bart?

Hi Jason,

Having the completion queue pool implementation use resize_cq internally if
supported by the HCA sounds interesting to me. This would help to avoid that
the CQ pool implementation uses more memory than needed. Sagi, if you don't
have the time to work on this please let me know. In that case I will try to
free up some time and implement this myself.

Bart.N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±­ÙšŠ{ayº\x1dʇڙë,j\a­¢f£¢·hš‹»öì\x17/oSc¾™Ú³9˜uÀ¦æå‰È&jw¨®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿïêäz¹Þ–Šàþf£¢·hšˆ§~ˆmš

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v3 0/9] Introduce per-device completion queue pools
@ 2017-11-20 21:29                                         ` Bart Van Assche
  0 siblings, 0 replies; 92+ messages in thread
From: Bart Van Assche @ 2017-11-20 21:29 UTC (permalink / raw)


On Mon, 2017-11-20@12:24 -0700, Jason Gunthorpe wrote:
> On Mon, Nov 20, 2017@02:10:13PM +0200, Sagi Grimberg wrote:
> > > > > It is too bad we can't re-size CQs.. Can we?
> > > > 
> > > > I wish we could, but its an optional feature so I can't see how we can
> > > > use it.
> > > 
> > > Well, it looks like mlx4/5 can do it, which covers a huge swath of
> > > deployed hardware..
> > > 
> > > I'd say make an optimal implementation using resize_cq and just a working
> > > implementation without it?
> > 
> > I can experiment with it, sure. Do you think that its a must have for
> > first phase though?
> 
> Not clear to me.. Bart?

Hi Jason,

Having the completion queue pool implementation use resize_cq internally if
supported by the HCA sounds interesting to me. This would help to avoid that
the CQ pool implementation uses more memory than needed. Sagi, if you don't
have the time to work on this please let me know. In that case I will try to
free up some time and implement this myself.

Bart.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [v3,1/9] RDMA/core: Add implicit per-device completion queue pools
  2017-11-08  9:57     ` Sagi Grimberg
@ 2017-12-11 23:50         ` Jason Gunthorpe
  -1 siblings, 0 replies; 92+ messages in thread
From: Jason Gunthorpe @ 2017-12-11 23:50 UTC (permalink / raw)
  To: Sagi Grimberg, Bart Van Assche
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	Max Gurtuvoy

On Wed, Nov 08, 2017 at 11:57:34AM +0200, Sagi Grimberg wrote:
> Allow a ULP to ask the core to implicitly assign a completion
> queue to a queue-pair based on a least-used search on a per-device
> cq pools. The device CQ pools grow in a lazy fashion with every
> QP creation.
> 
> In addition, expose an affinity hint for a queue pair creation.
> If passed, the core will attempt to attach a CQ with a completion
> vector that is directed to the cpu core as the affinity hint
> provided.
> 
> Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>

Sagi, Bart,

Did we reach a conclusion on this? Is v3 the series to take, and
should it all go through the RDMA tree? It looks like there are some
missing acks for that??

I think there was also an unapplied comment from bart in the
patchworks notes...

Could you please add some commit messages when you resend it? Not sure
I should be accepting such large commits with empty messages???

Thanks,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [v3,1/9] RDMA/core: Add implicit per-device completion queue pools
@ 2017-12-11 23:50         ` Jason Gunthorpe
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Gunthorpe @ 2017-12-11 23:50 UTC (permalink / raw)


On Wed, Nov 08, 2017@11:57:34AM +0200, Sagi Grimberg wrote:
> Allow a ULP to ask the core to implicitly assign a completion
> queue to a queue-pair based on a least-used search on a per-device
> cq pools. The device CQ pools grow in a lazy fashion with every
> QP creation.
> 
> In addition, expose an affinity hint for a queue pair creation.
> If passed, the core will attempt to attach a CQ with a completion
> vector that is directed to the cpu core as the affinity hint
> provided.
> 
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>

Sagi, Bart,

Did we reach a conclusion on this? Is v3 the series to take, and
should it all go through the RDMA tree? It looks like there are some
missing acks for that??

I think there was also an unapplied comment from bart in the
patchworks notes...

Could you please add some commit messages when you resend it? Not sure
I should be accepting such large commits with empty messages???

Thanks,
Jason

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [v3,1/9] RDMA/core: Add implicit per-device completion queue pools
  2017-12-11 23:50         ` Jason Gunthorpe
@ 2018-01-03 17:47             ` Jason Gunthorpe
  -1 siblings, 0 replies; 92+ messages in thread
From: Jason Gunthorpe @ 2018-01-03 17:47 UTC (permalink / raw)
  To: Sagi Grimberg, Bart Van Assche
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	Max Gurtuvoy

On Mon, Dec 11, 2017 at 04:50:44PM -0700, Jason Gunthorpe wrote:
> On Wed, Nov 08, 2017 at 11:57:34AM +0200, Sagi Grimberg wrote:
> > Allow a ULP to ask the core to implicitly assign a completion
> > queue to a queue-pair based on a least-used search on a per-device
> > cq pools. The device CQ pools grow in a lazy fashion with every
> > QP creation.
> > 
> > In addition, expose an affinity hint for a queue pair creation.
> > If passed, the core will attempt to attach a CQ with a completion
> > vector that is directed to the cpu core as the affinity hint
> > provided.
> > 
> > Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
> 
> Sagi, Bart,
> 
> Did we reach a conclusion on this? Is v3 the series to take, and
> should it all go through the RDMA tree? It looks like there are some
> missing acks for that??
> 
> I think there was also an unapplied comment from bart in the
> patchworks notes...
> 
> Could you please add some commit messages when you resend it? Not sure
> I should be accepting such large commits with empty messages???

Hearing nothing, I've dropped this series off patchworks. Please
resend it when you are ready.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [v3,1/9] RDMA/core: Add implicit per-device completion queue pools
@ 2018-01-03 17:47             ` Jason Gunthorpe
  0 siblings, 0 replies; 92+ messages in thread
From: Jason Gunthorpe @ 2018-01-03 17:47 UTC (permalink / raw)


On Mon, Dec 11, 2017@04:50:44PM -0700, Jason Gunthorpe wrote:
> On Wed, Nov 08, 2017@11:57:34AM +0200, Sagi Grimberg wrote:
> > Allow a ULP to ask the core to implicitly assign a completion
> > queue to a queue-pair based on a least-used search on a per-device
> > cq pools. The device CQ pools grow in a lazy fashion with every
> > QP creation.
> > 
> > In addition, expose an affinity hint for a queue pair creation.
> > If passed, the core will attempt to attach a CQ with a completion
> > vector that is directed to the cpu core as the affinity hint
> > provided.
> > 
> > Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> 
> Sagi, Bart,
> 
> Did we reach a conclusion on this? Is v3 the series to take, and
> should it all go through the RDMA tree? It looks like there are some
> missing acks for that??
> 
> I think there was also an unapplied comment from bart in the
> patchworks notes...
> 
> Could you please add some commit messages when you resend it? Not sure
> I should be accepting such large commits with empty messages???

Hearing nothing, I've dropped this series off patchworks. Please
resend it when you are ready.

Jason

^ permalink raw reply	[flat|nested] 92+ messages in thread

end of thread, other threads:[~2018-01-03 17:47 UTC | newest]

Thread overview: 92+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-08  9:57 [PATCH v3 0/9] Introduce per-device completion queue pools Sagi Grimberg
2017-11-08  9:57 ` Sagi Grimberg
     [not found] ` <20171108095742.25365-1-sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-08  9:57   ` [PATCH v3 1/9] RDMA/core: Add implicit " Sagi Grimberg
2017-11-08  9:57     ` Sagi Grimberg
     [not found]     ` <20171108095742.25365-2-sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-09 10:45       ` Max Gurtovoy
2017-11-09 10:45         ` Max Gurtovoy
     [not found]         ` <fe2c23b7-2923-2693-28c3-6a6f399bda26-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-11-09 17:31           ` Sagi Grimberg
2017-11-09 17:31             ` Sagi Grimberg
     [not found]             ` <23b598f2-6982-0d15-69e4-c526c627ec33-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-09 17:33               ` Bart Van Assche
2017-11-09 17:33                 ` Bart Van Assche
     [not found]                 ` <1510248814.2608.19.camel-Sjgp3cTcYWE@public.gmane.org>
2017-11-13 20:28                   ` Sagi Grimberg
2017-11-13 20:28                     ` Sagi Grimberg
2017-11-14 16:28       ` Bart Van Assche
2017-11-14 16:28         ` Bart Van Assche
     [not found]         ` <1510676885.2280.9.camel-Sjgp3cTcYWE@public.gmane.org>
2017-11-20 12:31           ` Sagi Grimberg
2017-11-20 12:31             ` Sagi Grimberg
2017-12-11 23:50       ` [v3,1/9] " Jason Gunthorpe
2017-12-11 23:50         ` Jason Gunthorpe
     [not found]         ` <20171211235044.GA32331-uk2M96/98Pc@public.gmane.org>
2018-01-03 17:47           ` Jason Gunthorpe
2018-01-03 17:47             ` Jason Gunthorpe
2017-11-08  9:57   ` [PATCH v3 2/9] IB/isert: use implicit CQ allocation Sagi Grimberg
2017-11-08  9:57     ` Sagi Grimberg
2017-11-08 10:27     ` Nicholas A. Bellinger
2017-11-08 10:27       ` Nicholas A. Bellinger
2017-11-08 10:27       ` Nicholas A. Bellinger
     [not found]     ` <20171108095742.25365-3-sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-14  9:14       ` Max Gurtovoy
2017-11-14  9:14         ` Max Gurtovoy
2017-11-08  9:57   ` [PATCH v3 3/9] IB/iser: " Sagi Grimberg
2017-11-08  9:57     ` Sagi Grimberg
2017-11-08 10:25     ` Nicholas A. Bellinger
2017-11-08 10:25       ` Nicholas A. Bellinger
2017-11-08 10:25       ` Nicholas A. Bellinger
2017-11-08  9:57   ` [PATCH v3 4/9] IB/srpt: " Sagi Grimberg
2017-11-08  9:57     ` Sagi Grimberg
2017-11-08  9:57   ` [PATCH v3 5/9] svcrdma: Use RDMA core " Sagi Grimberg
2017-11-08  9:57     ` Sagi Grimberg
2017-11-08  9:57   ` [PATCH v3 6/9] nvme-rdma: use " Sagi Grimberg
2017-11-08  9:57     ` Sagi Grimberg
2017-11-08  9:57   ` [PATCH v3 7/9] nvmet-rdma: " Sagi Grimberg
2017-11-08  9:57     ` Sagi Grimberg
2017-11-08  9:57   ` [PATCH v3 8/9] nvmet: allow assignment of a cpulist for each nvmet port Sagi Grimberg
2017-11-08  9:57     ` Sagi Grimberg
2017-11-08  9:57   ` [PATCH v3 9/9] nvmet-rdma: assign cq completion vector based on the port allowed cpus Sagi Grimberg
2017-11-08  9:57     ` Sagi Grimberg
2017-11-08 16:42   ` [PATCH v3 0/9] Introduce per-device completion queue pools Chuck Lever
2017-11-08 16:42     ` Chuck Lever
     [not found]     ` <F7FAF7FD-AE2C-4428-A779-06C9768E0C73-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-11-09 17:06       ` Sagi Grimberg
2017-11-09 17:06         ` Sagi Grimberg
     [not found]         ` <b502f211-c477-3acd-be01-eaf645edc117-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-10 19:27           ` Chuck Lever
2017-11-10 19:27             ` Chuck Lever
     [not found]             ` <4B49C591-8671-4683-A437-215BAA6B56CD-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-11-13 20:47               ` Sagi Grimberg
2017-11-13 20:47                 ` Sagi Grimberg
     [not found]                 ` <a79c3899-5599-aa7e-5c5b-75ec98cdc490-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-13 22:15                   ` Chuck Lever
2017-11-13 22:15                     ` Chuck Lever
     [not found]                     ` <A6E320D9-63F4-4FFE-A5E2-EB3ED19D0CB3-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-11-20 12:08                       ` Sagi Grimberg
2017-11-20 12:08                         ` Sagi Grimberg
     [not found]                         ` <b4493cd9-62d6-c855-2b55-9749df7fa90d-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-20 15:54                           ` Chuck Lever
2017-11-20 15:54                             ` Chuck Lever
2017-11-09 16:42   ` Bart Van Assche
2017-11-09 16:42     ` Bart Van Assche
     [not found]     ` <1510245771.2608.6.camel-Sjgp3cTcYWE@public.gmane.org>
2017-11-09 17:22       ` Sagi Grimberg
2017-11-09 17:22         ` Sagi Grimberg
     [not found]         ` <d5820b65-b428-7cd1-f7f1-7d8d89d21cf5-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-09 17:31           ` Bart Van Assche
2017-11-09 17:31             ` Bart Van Assche
     [not found]             ` <1510248716.2608.17.camel-Sjgp3cTcYWE@public.gmane.org>
2017-11-13 20:31               ` Sagi Grimberg
2017-11-13 20:31                 ` Sagi Grimberg
     [not found]                 ` <6af627c8-2814-d562-21a2-f10788e44458-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-13 20:34                   ` Jason Gunthorpe
2017-11-13 20:34                     ` Jason Gunthorpe
     [not found]                     ` <20171113203444.GJ22610-uk2M96/98Pc@public.gmane.org>
2017-11-13 20:48                       ` Sagi Grimberg
2017-11-13 20:48                         ` Sagi Grimberg
     [not found]                         ` <75a7cfe3-58e0-b416-0b5b-beba1e8207fd-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-14  2:48                           ` Jason Gunthorpe
2017-11-14  2:48                             ` Jason Gunthorpe
     [not found]                             ` <20171114024822.GN22610-uk2M96/98Pc@public.gmane.org>
2017-11-20 12:10                               ` Sagi Grimberg
2017-11-20 12:10                                 ` Sagi Grimberg
     [not found]                                 ` <fd911816-576f-5a8e-eb41-b9081c960685-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-20 19:24                                   ` Jason Gunthorpe
2017-11-20 19:24                                     ` Jason Gunthorpe
     [not found]                                     ` <20171120192402.GK29075-uk2M96/98Pc@public.gmane.org>
2017-11-20 21:29                                       ` Bart Van Assche
2017-11-20 21:29                                         ` Bart Van Assche
2017-11-14 16:21                   ` Bart Van Assche
2017-11-14 16:21                     ` Bart Van Assche
     [not found]                     ` <1510676487.2280.4.camel-Sjgp3cTcYWE@public.gmane.org>
2017-11-20 12:26                       ` Sagi Grimberg
2017-11-20 12:26                         ` Sagi Grimberg
2017-11-14 10:06               ` Max Gurtovoy
2017-11-14 10:06                 ` Max Gurtovoy
     [not found]                 ` <1abee653-425d-940a-a51c-7fcd43af7203-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-11-20 12:20                   ` Sagi Grimberg
2017-11-20 12:20                     ` Sagi Grimberg
2017-11-09 18:52           ` Leon Romanovsky
2017-11-09 18:52             ` Leon Romanovsky
     [not found]             ` <20171109185239.GL18825-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-11-13 20:32               ` Sagi Grimberg
2017-11-13 20:32                 ` Sagi Grimberg
2017-11-13 22:11   ` Doug Ledford
2017-11-13 22:11     ` Doug Ledford

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.