Linux-RDMA Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH V3 0/4] Introducing RDMA shared CQ pool
@ 2020-05-19 12:43 Yamin Friedman
  2020-05-19 12:43 ` [PATCH V3 1/4] RDMA/core: Add protection for shared CQs used by ULPs Yamin Friedman
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Yamin Friedman @ 2020-05-19 12:43 UTC (permalink / raw)
  To: Jason Gunthorpe, Sagi Grimberg, Or Gerlitz, Leon Romanovsky
  Cc: linux-rdma, Yamin Friedman

This is the fourth re-incarnation of the CQ pool patches proposed
by Sagi and Christoph. I have started with the patches that Sagi last
submitted and built the CQ pool as a new API for acquiring shared CQs.

The main change from Sagi's last proposal is that I have simplified the
method that ULP drivers interact with the CQ pool. Instead of calling 
ib_alloc_cq they now call ib_cq_pool_get but use the CQ in the same manner
that they did before. This allows for a much easier transition to using
shared CQs by the ULP and makes it easier to deal with IB_POLL_DIRECT
contexts. Certain types of actions on CQs have been prevented on shared
CQs in order to prevent one user from harming another.

Our ULPs often want to make smart decisions on completion vector
affinitization when using multiple completion queues spread on
multiple cpu cores. We can see examples for this in iser, srp, nvme-rdma.

This patch set attempts to move this smartness to the rdma core by
introducing per-device CQ pools that by definition spread
across cpu cores. In addition, we completely make the completion
queue allocation transparent to the ULP by adding affinity hints
to create_qp which tells the rdma core to select (or allocate)
a completion queue that has the needed affinity for it.

This API gives us a similar approach to whats used in the networking
stack where the device completion queues are hidden from the application.
With the affinitization hints, we also do not compromise performance
as the completion queue will be affinitized correctly.

One thing that should be noticed is that now different ULPs using this
API may share completion queues (given that they use the same polling context).
However, even without this API they share interrupt vectors (and CPUs
that are assigned to them). Thus aggregating consumers on less completion
queues will result in better overall completion processing efficiency per
completion event (or interrupt).

An advantage of this method of using the CQ pool is that changes in the ULP
driver are minimal (around 14 altered lines of code).

The patch set converts nvme-rdma and nvmet-rdma to use the new API.

Test results can be found in patch-0002.

Comments and feedback are welcome.

Changes since v2
----------------

*Minor code refactoring

Changes since v1
----------------

*Simplified cq pool shutdown process
*Renamed cq pool functions to be like mr pool
*Simplified process for finding cqs in pool
*Removed unhelpful WARN prints
*Removed one liner functions
*Replaced cq_type with boolean shared
*Updated test results to more properly show effect of patch
*Minor bug fixes

Yamin Friedman (4):
  RDMA/core: Add protection for shared CQs used by ULPs
  RDMA/core: Introduce shared CQ pool API
  nvme-rdma: use new shared CQ mechanism
  nvmet-rdma: use new shared CQ mechanism

 drivers/infiniband/core/core_priv.h |   3 +
 drivers/infiniband/core/cq.c        | 144 ++++++++++++++++++++++++++++++++++++
 drivers/infiniband/core/device.c    |   2 +
 drivers/infiniband/core/verbs.c     |   9 +++
 drivers/nvme/host/rdma.c            |  75 ++++++++++++-------
 drivers/nvme/target/rdma.c          |  14 ++--
 include/rdma/ib_verbs.h             |  38 ++++++++++
 7 files changed, 253 insertions(+), 32 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH V3 1/4] RDMA/core: Add protection for shared CQs used by ULPs
  2020-05-19 12:43 [PATCH V3 0/4] Introducing RDMA shared CQ pool Yamin Friedman
@ 2020-05-19 12:43 ` Yamin Friedman
  2020-05-19 12:43 ` [PATCH V3 2/4] RDMA/core: Introduce shared CQ pool API Yamin Friedman
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Yamin Friedman @ 2020-05-19 12:43 UTC (permalink / raw)
  To: Jason Gunthorpe, Sagi Grimberg, Or Gerlitz, Leon Romanovsky
  Cc: linux-rdma, Yamin Friedman

A pre-step for adding shared CQs. Add the infrastructure to prevent
shared CQ users from altering the CQ configurations. For now all cqs are
marked as private (non-shared). The core driver should use the new force
functions to perform resize/destroy/moderation changes that are not
allowed for users of shared CQs.

Signed-off-by: Yamin Friedman <yaminf@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
---
 drivers/infiniband/core/verbs.c | 9 +++++++++
 include/rdma/ib_verbs.h         | 3 +++
 2 files changed, 12 insertions(+)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index bf0249f..675f601 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -1990,6 +1990,9 @@ struct ib_cq *__ib_create_cq(struct ib_device *device,
 
 int rdma_set_cq_moderation(struct ib_cq *cq, u16 cq_count, u16 cq_period)
 {
+	if (cq->shared)
+		return -EOPNOTSUPP;
+
 	return cq->device->ops.modify_cq ?
 		cq->device->ops.modify_cq(cq, cq_count,
 					  cq_period) : -EOPNOTSUPP;
@@ -1998,6 +2001,9 @@ int rdma_set_cq_moderation(struct ib_cq *cq, u16 cq_count, u16 cq_period)
 
 int ib_destroy_cq_user(struct ib_cq *cq, struct ib_udata *udata)
 {
+	if (WARN_ON_ONCE(cq->shared))
+		return -EOPNOTSUPP;
+
 	if (atomic_read(&cq->usecnt))
 		return -EBUSY;
 
@@ -2010,6 +2016,9 @@ int ib_destroy_cq_user(struct ib_cq *cq, struct ib_udata *udata)
 
 int ib_resize_cq(struct ib_cq *cq, int cqe)
 {
+	if (cq->shared)
+		return -EOPNOTSUPP;
+
 	return cq->device->ops.resize_cq ?
 		cq->device->ops.resize_cq(cq, cqe, NULL) : -EOPNOTSUPP;
 }
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 4c488ca..1659131 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1582,6 +1582,7 @@ struct ib_cq {
 	 * Implementation details of the RDMA core, don't use in drivers:
 	 */
 	struct rdma_restrack_entry res;
+	u8 shared:1;
 };
 
 struct ib_srq {
@@ -3824,6 +3825,8 @@ static inline struct ib_cq *ib_alloc_cq_any(struct ib_device *dev,
  * ib_free_cq_user - Free kernel/user CQ
  * @cq: The CQ to free
  * @udata: Valid user data or NULL for kernel objects
+ *
+ * NOTE: This function shouldn't be called on shared CQs.
  */
 void ib_free_cq_user(struct ib_cq *cq, struct ib_udata *udata);
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH V3 2/4] RDMA/core: Introduce shared CQ pool API
  2020-05-19 12:43 [PATCH V3 0/4] Introducing RDMA shared CQ pool Yamin Friedman
  2020-05-19 12:43 ` [PATCH V3 1/4] RDMA/core: Add protection for shared CQs used by ULPs Yamin Friedman
@ 2020-05-19 12:43 ` Yamin Friedman
  2020-05-20  6:19   ` Devesh Sharma
                     ` (3 more replies)
  2020-05-19 12:43 ` [PATCH V3 3/4] nvme-rdma: use new shared CQ mechanism Yamin Friedman
                   ` (2 subsequent siblings)
  4 siblings, 4 replies; 22+ messages in thread
From: Yamin Friedman @ 2020-05-19 12:43 UTC (permalink / raw)
  To: Jason Gunthorpe, Sagi Grimberg, Or Gerlitz, Leon Romanovsky
  Cc: linux-rdma, Yamin Friedman

Allow a ULP to ask the core to provide a completion queue based on a
least-used search on a per-device CQ pools. The device CQ pools grow in a
lazy fashion when more CQs are requested.

This feature reduces the amount of interrupts when using many QPs.
Using shared CQs allows for more effcient completion handling. It also
reduces the amount of overhead needed for CQ contexts.

Test setup:
Intel(R) Xeon(R) Platinum 8176M CPU @ 2.10GHz servers.
Running NVMeoF 4KB read IOs over ConnectX-5EX across Spectrum switch.
TX-depth = 32. The patch was applied in the nvme driver on both the target
and initiator. Four controllers are accessed from each core. In the
current test case we have exposed sixteen NVMe namespaces using four
different subsystems (four namespaces per subsystem) from one NVM port.
Each controller allocated X queues (RDMA QPs) and attached to Y CQs.
Before this series we had X == Y, i.e for four controllers we've created
total of 4X QPs and 4X CQs. In the shared case, we've created 4X QPs and
only X CQs which means that we have four controllers that share a
completion queue per core. Until fourteen cores there is no significant
change in performance and the number of interrupts per second is less than
a million in the current case.
==================================================
|Cores|Current KIOPs  |Shared KIOPs  |improvement|
|-----|---------------|--------------|-----------|
|14   |2332           |2723          |16.7%      |
|-----|---------------|--------------|-----------|
|20   |2086           |2712          |30%        |
|-----|---------------|--------------|-----------|
|28   |1971           |2669          |35.4%      |
|=================================================
|Cores|Current avg lat|Shared avg lat|improvement|
|-----|---------------|--------------|-----------|
|14   |767us          |657us         |14.3%      |
|-----|---------------|--------------|-----------|
|20   |1225us         |943us         |23%        |
|-----|---------------|--------------|-----------|
|28   |1816us         |1341us        |26.1%      |
========================================================
|Cores|Current interrupts|Shared interrupts|improvement|
|-----|------------------|-----------------|-----------|
|14   |1.6M/sec          |0.4M/sec         |72%        |
|-----|------------------|-----------------|-----------|
|20   |2.8M/sec          |0.6M/sec         |72.4%      |
|-----|------------------|-----------------|-----------|
|28   |2.9M/sec          |0.8M/sec         |63.4%      |
====================================================================
|Cores|Current 99.99th PCTL lat|Shared 99.99th PCTL lat|improvement|
|-----|------------------------|-----------------------|-----------|
|14   |67ms                    |6ms                    |90.9%      |
|-----|------------------------|-----------------------|-----------|
|20   |5ms                     |6ms                    |-10%       |
|-----|------------------------|-----------------------|-----------|
|28   |8.7ms                   |6ms                    |25.9%      |
|===================================================================

Performance improvement with sixteen disks (sixteen CQs per core) is
comparable.

Signed-off-by: Yamin Friedman <yaminf@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
---
 drivers/infiniband/core/core_priv.h |   3 +
 drivers/infiniband/core/cq.c        | 144 ++++++++++++++++++++++++++++++++++++
 drivers/infiniband/core/device.c    |   2 +
 include/rdma/ib_verbs.h             |  35 +++++++++
 4 files changed, 184 insertions(+)

diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index cf42acc..a1e6a67 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -414,4 +414,7 @@ void rdma_umap_priv_init(struct rdma_umap_priv *priv,
 			 struct vm_area_struct *vma,
 			 struct rdma_user_mmap_entry *entry);
 
+void ib_cq_pool_init(struct ib_device *dev);
+void ib_cq_pool_destroy(struct ib_device *dev);
+
 #endif /* _CORE_PRIV_H */
diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index 4f25b24..7175295 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -7,7 +7,11 @@
 #include <linux/slab.h>
 #include <rdma/ib_verbs.h>
 
+#include "core_priv.h"
+
 #include <trace/events/rdma_core.h>
+/* Max size for shared CQ, may require tuning */
+#define IB_MAX_SHARED_CQ_SZ		4096
 
 /* # of WCs to poll for with a single call to ib_poll_cq */
 #define IB_POLL_BATCH			16
@@ -218,6 +222,7 @@ struct ib_cq *__ib_alloc_cq_user(struct ib_device *dev, void *private,
 	cq->cq_context = private;
 	cq->poll_ctx = poll_ctx;
 	atomic_set(&cq->usecnt, 0);
+	cq->comp_vector = comp_vector;
 
 	cq->wc = kmalloc_array(IB_POLL_BATCH, sizeof(*cq->wc), GFP_KERNEL);
 	if (!cq->wc)
@@ -309,6 +314,8 @@ void ib_free_cq_user(struct ib_cq *cq, struct ib_udata *udata)
 {
 	if (WARN_ON_ONCE(atomic_read(&cq->usecnt)))
 		return;
+	if (WARN_ON_ONCE(cq->cqe_used))
+		return;
 
 	switch (cq->poll_ctx) {
 	case IB_POLL_DIRECT:
@@ -334,3 +341,140 @@ void ib_free_cq_user(struct ib_cq *cq, struct ib_udata *udata)
 	kfree(cq);
 }
 EXPORT_SYMBOL(ib_free_cq_user);
+
+void ib_cq_pool_init(struct ib_device *dev)
+{
+	int i;
+
+	spin_lock_init(&dev->cq_pools_lock);
+	for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++)
+		INIT_LIST_HEAD(&dev->cq_pools[i]);
+}
+
+void ib_cq_pool_destroy(struct ib_device *dev)
+{
+	struct ib_cq *cq, *n;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++) {
+		list_for_each_entry_safe(cq, n, &dev->cq_pools[i],
+					 pool_entry) {
+			cq->shared = false;
+			ib_free_cq_user(cq, NULL);
+		}
+	}
+
+}
+
+static int ib_alloc_cqs(struct ib_device *dev, int nr_cqes,
+			enum ib_poll_context poll_ctx)
+{
+	LIST_HEAD(tmp_list);
+	struct ib_cq *cq;
+	unsigned long flags;
+	int nr_cqs, ret, i;
+
+	/*
+	 * Allocated at least as many CQEs as requested, and otherwise
+	 * a reasonable batch size so that we can share CQs between
+	 * multiple users instead of allocating a larger number of CQs.
+	 */
+	nr_cqes = min(dev->attrs.max_cqe, max(nr_cqes, IB_MAX_SHARED_CQ_SZ));
+	nr_cqs = min_t(int, dev->num_comp_vectors, num_online_cpus());
+	for (i = 0; i < nr_cqs; i++) {
+		cq = ib_alloc_cq(dev, NULL, nr_cqes, i, poll_ctx);
+		if (IS_ERR(cq)) {
+			ret = PTR_ERR(cq);
+			goto out_free_cqs;
+		}
+		cq->shared = true;
+		list_add_tail(&cq->pool_entry, &tmp_list);
+	}
+
+	spin_lock_irqsave(&dev->cq_pools_lock, flags);
+	list_splice(&tmp_list, &dev->cq_pools[poll_ctx - 1]);
+	spin_unlock_irqrestore(&dev->cq_pools_lock, flags);
+
+	return 0;
+
+out_free_cqs:
+	list_for_each_entry(cq, &tmp_list, pool_entry) {
+		cq->shared = false;
+		ib_free_cq(cq);
+	}
+	return ret;
+}
+
+struct ib_cq *ib_cq_pool_get(struct ib_device *dev, unsigned int nr_cqe,
+			     int comp_vector_hint,
+			     enum ib_poll_context poll_ctx)
+{
+	static unsigned int default_comp_vector;
+	int vector, ret, num_comp_vectors;
+	struct ib_cq *cq, *found = NULL;
+	unsigned long flags;
+
+	if (poll_ctx > ARRAY_SIZE(dev->cq_pools) || poll_ctx == IB_POLL_DIRECT)
+		return ERR_PTR(-EINVAL);
+
+	num_comp_vectors = min_t(int, dev->num_comp_vectors,
+				 num_online_cpus());
+	/* Project the affinty to the device completion vector range */
+	if (comp_vector_hint < 0)
+		vector = default_comp_vector++ % num_comp_vectors;
+	else
+		vector = comp_vector_hint % num_comp_vectors;
+
+	/*
+	 * Find the least used CQ with correct affinity and
+	 * enough free CQ entries
+	 */
+	while (!found) {
+		spin_lock_irqsave(&dev->cq_pools_lock, flags);
+		list_for_each_entry(cq, &dev->cq_pools[poll_ctx - 1],
+				    pool_entry) {
+			/*
+			 * Check to see if we have found a CQ with the
+			 * correct completion vector
+			 */
+			if (vector != cq->comp_vector)
+				continue;
+			if (cq->cqe_used + nr_cqe > cq->cqe)
+				continue;
+			found = cq;
+			break;
+		}
+
+		if (found) {
+			found->cqe_used += nr_cqe;
+			spin_unlock_irqrestore(&dev->cq_pools_lock, flags);
+
+			return found;
+		}
+		spin_unlock_irqrestore(&dev->cq_pools_lock, flags);
+
+		/*
+		 * Didn't find a match or ran out of CQs in the device
+		 * pool, allocate a new array of CQs.
+		 */
+		ret = ib_alloc_cqs(dev, nr_cqe, poll_ctx);
+		if (ret)
+			return ERR_PTR(ret);
+	}
+
+	return found;
+}
+EXPORT_SYMBOL(ib_cq_pool_get);
+
+void ib_cq_pool_put(struct ib_cq *cq, unsigned int nr_cqe)
+{
+	unsigned long flags;
+
+	if (WARN_ON_ONCE(nr_cqe > cq->cqe_used))
+		return;
+
+	spin_lock_irqsave(&cq->device->cq_pools_lock, flags);
+	cq->cqe_used -= nr_cqe;
+	spin_unlock_irqrestore(&cq->device->cq_pools_lock, flags);
+}
+EXPORT_SYMBOL(ib_cq_pool_put);
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index d9f565a..0966f86 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -1418,6 +1418,7 @@ int ib_register_device(struct ib_device *device, const char *name)
 		device->ops.dealloc_driver = dealloc_fn;
 		return ret;
 	}
+	ib_cq_pool_init(device);
 	ib_device_put(device);
 
 	return 0;
@@ -1446,6 +1447,7 @@ static void __ib_unregister_device(struct ib_device *ib_dev)
 	if (!refcount_read(&ib_dev->refcount))
 		goto out;
 
+	ib_cq_pool_destroy(ib_dev);
 	disable_device(ib_dev);
 
 	/* Expedite removing unregistered pointers from the hash table */
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 1659131..d40604a 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1555,6 +1555,7 @@ enum ib_poll_context {
 	IB_POLL_SOFTIRQ,	   /* poll from softirq context */
 	IB_POLL_WORKQUEUE,	   /* poll from workqueue */
 	IB_POLL_UNBOUND_WORKQUEUE, /* poll from unbound workqueue */
+	IB_POLL_LAST,
 };
 
 struct ib_cq {
@@ -1564,9 +1565,12 @@ struct ib_cq {
 	void                  (*event_handler)(struct ib_event *, void *);
 	void                   *cq_context;
 	int               	cqe;
+	int			cqe_used;
 	atomic_t          	usecnt; /* count number of work queues */
 	enum ib_poll_context	poll_ctx;
+	int                     comp_vector;
 	struct ib_wc		*wc;
+	struct list_head        pool_entry;
 	union {
 		struct irq_poll		iop;
 		struct work_struct	work;
@@ -2695,6 +2699,10 @@ struct ib_device {
 #endif
 
 	u32                          index;
+
+	spinlock_t                   cq_pools_lock;
+	struct list_head             cq_pools[IB_POLL_LAST - 1];
+
 	struct rdma_restrack_root *res;
 
 	const struct uapi_definition   *driver_def;
@@ -3952,6 +3960,33 @@ static inline int ib_req_notify_cq(struct ib_cq *cq,
 	return cq->device->ops.req_notify_cq(cq, flags);
 }
 
+/*
+ * ib_cq_pool_get() - Find the least used completion queue that matches
+ *     a given cpu hint (or least used for wild card affinity)
+ *     and fits nr_cqe
+ * @dev: rdma device
+ * @nr_cqe: number of needed cqe entries
+ * @comp_vector_hint: completion vector hint (-1) for the driver to assign
+ *   a comp vector based on internal counter
+ * @poll_ctx: cq polling context
+ *
+ * Finds a cq that satisfies @comp_vector_hint and @nr_cqe requirements and
+ * claim entries in it for us. In case there is no available cq, allocate
+ * a new cq with the requirements and add it to the device pool.
+ * IB_POLL_DIRECT cannot be used for shared cqs so it is not a valid value
+ * for @poll_ctx.
+ */
+struct ib_cq *ib_cq_pool_get(struct ib_device *dev, unsigned int nr_cqe,
+			     int comp_vector_hint,
+			     enum ib_poll_context poll_ctx);
+
+/**
+ * ib_cq_pool_put - Return a CQ taken from a shared pool.
+ * @cq: The CQ to return.
+ * @nr_cqe: The max number of cqes that the user had requested.
+ */
+void ib_cq_pool_put(struct ib_cq *cq, unsigned int nr_cqe);
+
 /**
  * ib_req_ncomp_notif - Request completion notification when there are
  *   at least the specified number of unreaped completions on the CQ.
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH V3 3/4] nvme-rdma: use new shared CQ mechanism
  2020-05-19 12:43 [PATCH V3 0/4] Introducing RDMA shared CQ pool Yamin Friedman
  2020-05-19 12:43 ` [PATCH V3 1/4] RDMA/core: Add protection for shared CQs used by ULPs Yamin Friedman
  2020-05-19 12:43 ` [PATCH V3 2/4] RDMA/core: Introduce shared CQ pool API Yamin Friedman
@ 2020-05-19 12:43 ` Yamin Friedman
  2020-05-19 12:43 ` [PATCH V3 4/4] nvmet-rdma: " Yamin Friedman
  2020-05-20  7:03 ` [PATCH V3 0/4] Introducing RDMA shared CQ pool Sagi Grimberg
  4 siblings, 0 replies; 22+ messages in thread
From: Yamin Friedman @ 2020-05-19 12:43 UTC (permalink / raw)
  To: Jason Gunthorpe, Sagi Grimberg, Or Gerlitz, Leon Romanovsky
  Cc: linux-rdma, Yamin Friedman

Has the driver use shared CQs providing ~10%-20% improvement as seen in the
patch introducing shared CQs. Instead of opening a CQ for each QP per
controller connected, a CQ for each QP will be provided by the RDMA core
driver that will be shared between the QPs on that core reducing interrupt
overhead.

Signed-off-by: Yamin Friedman <yaminf@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
---
 drivers/nvme/host/rdma.c | 75 ++++++++++++++++++++++++++++++++----------------
 1 file changed, 50 insertions(+), 25 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index cac8a93..83d5f29 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -85,6 +85,7 @@ struct nvme_rdma_queue {
 	struct rdma_cm_id	*cm_id;
 	int			cm_error;
 	struct completion	cm_done;
+	int			cq_size;
 };
 
 struct nvme_rdma_ctrl {
@@ -261,6 +262,7 @@ static int nvme_rdma_create_qp(struct nvme_rdma_queue *queue, const int factor)
 	init_attr.qp_type = IB_QPT_RC;
 	init_attr.send_cq = queue->ib_cq;
 	init_attr.recv_cq = queue->ib_cq;
+	init_attr.qp_context = queue;
 
 	ret = rdma_create_qp(queue->cm_id, dev->pd, &init_attr);
 
@@ -389,6 +391,14 @@ static int nvme_rdma_dev_get(struct nvme_rdma_device *dev)
 	return NULL;
 }
 
+static void nvme_rdma_free_cq(struct nvme_rdma_queue *queue)
+{
+	if (nvme_rdma_poll_queue(queue))
+		ib_free_cq(queue->ib_cq);
+	else
+		ib_cq_pool_put(queue->ib_cq, queue->cq_size);
+}
+
 static void nvme_rdma_destroy_queue_ib(struct nvme_rdma_queue *queue)
 {
 	struct nvme_rdma_device *dev;
@@ -408,7 +418,7 @@ static void nvme_rdma_destroy_queue_ib(struct nvme_rdma_queue *queue)
 	 * the destruction of the QP shouldn't use rdma_cm API.
 	 */
 	ib_destroy_qp(queue->qp);
-	ib_free_cq(queue->ib_cq);
+	nvme_rdma_free_cq(queue);
 
 	nvme_rdma_free_ring(ibdev, queue->rsp_ring, queue->queue_size,
 			sizeof(struct nvme_completion), DMA_FROM_DEVICE);
@@ -422,13 +432,42 @@ static int nvme_rdma_get_max_fr_pages(struct ib_device *ibdev)
 		     ibdev->attrs.max_fast_reg_page_list_len - 1);
 }
 
+static int nvme_rdma_create_cq(struct ib_device *ibdev,
+				struct nvme_rdma_queue *queue)
+{
+	int ret, comp_vector, idx = nvme_rdma_queue_idx(queue);
+	enum ib_poll_context poll_ctx;
+
+	/*
+	 * Spread I/O queues completion vectors according their queue index.
+	 * Admin queues can always go on completion vector 0.
+	 */
+	comp_vector = idx == 0 ? idx : idx - 1;
+
+	/* Polling queues need direct cq polling context */
+	if (nvme_rdma_poll_queue(queue)) {
+		poll_ctx = IB_POLL_DIRECT;
+		queue->ib_cq = ib_alloc_cq(ibdev, queue, queue->cq_size,
+					   comp_vector, poll_ctx);
+	} else {
+		poll_ctx = IB_POLL_SOFTIRQ;
+		queue->ib_cq = ib_cq_pool_get(ibdev, queue->cq_size,
+					      comp_vector, poll_ctx);
+	}
+
+	if (IS_ERR(queue->ib_cq)) {
+		ret = PTR_ERR(queue->ib_cq);
+		return ret;
+	}
+
+	return 0;
+}
+
 static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
 {
 	struct ib_device *ibdev;
 	const int send_wr_factor = 3;			/* MR, SEND, INV */
 	const int cq_factor = send_wr_factor + 1;	/* + RECV */
-	int comp_vector, idx = nvme_rdma_queue_idx(queue);
-	enum ib_poll_context poll_ctx;
 	int ret, pages_per_mr;
 
 	queue->device = nvme_rdma_find_get_device(queue->cm_id);
@@ -439,26 +478,12 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
 	}
 	ibdev = queue->device->dev;
 
-	/*
-	 * Spread I/O queues completion vectors according their queue index.
-	 * Admin queues can always go on completion vector 0.
-	 */
-	comp_vector = idx == 0 ? idx : idx - 1;
-
-	/* Polling queues need direct cq polling context */
-	if (nvme_rdma_poll_queue(queue))
-		poll_ctx = IB_POLL_DIRECT;
-	else
-		poll_ctx = IB_POLL_SOFTIRQ;
-
 	/* +1 for ib_stop_cq */
-	queue->ib_cq = ib_alloc_cq(ibdev, queue,
-				cq_factor * queue->queue_size + 1,
-				comp_vector, poll_ctx);
-	if (IS_ERR(queue->ib_cq)) {
-		ret = PTR_ERR(queue->ib_cq);
+	queue->cq_size = cq_factor * queue->queue_size + 1;
+
+	ret = nvme_rdma_create_cq(ibdev, queue);
+	if (ret)
 		goto out_put_dev;
-	}
 
 	ret = nvme_rdma_create_qp(queue, send_wr_factor);
 	if (ret)
@@ -484,7 +509,7 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
 	if (ret) {
 		dev_err(queue->ctrl->ctrl.device,
 			"failed to initialize MR pool sized %d for QID %d\n",
-			queue->queue_size, idx);
+			queue->queue_size, nvme_rdma_queue_idx(queue));
 		goto out_destroy_ring;
 	}
 
@@ -498,7 +523,7 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
 out_destroy_qp:
 	rdma_destroy_qp(queue->cm_id);
 out_destroy_ib_cq:
-	ib_free_cq(queue->ib_cq);
+	nvme_rdma_free_cq(queue);
 out_put_dev:
 	nvme_rdma_dev_put(queue->device);
 	return ret;
@@ -1093,7 +1118,7 @@ static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
 static void nvme_rdma_wr_error(struct ib_cq *cq, struct ib_wc *wc,
 		const char *op)
 {
-	struct nvme_rdma_queue *queue = cq->cq_context;
+	struct nvme_rdma_queue *queue = wc->qp->qp_context;
 	struct nvme_rdma_ctrl *ctrl = queue->ctrl;
 
 	if (ctrl->ctrl.state == NVME_CTRL_LIVE)
@@ -1481,7 +1506,7 @@ static void nvme_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc)
 {
 	struct nvme_rdma_qe *qe =
 		container_of(wc->wr_cqe, struct nvme_rdma_qe, cqe);
-	struct nvme_rdma_queue *queue = cq->cq_context;
+	struct nvme_rdma_queue *queue = wc->qp->qp_context;
 	struct ib_device *ibdev = queue->device->dev;
 	struct nvme_completion *cqe = qe->data;
 	const size_t len = sizeof(struct nvme_completion);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH V3 4/4] nvmet-rdma: use new shared CQ mechanism
  2020-05-19 12:43 [PATCH V3 0/4] Introducing RDMA shared CQ pool Yamin Friedman
                   ` (2 preceding siblings ...)
  2020-05-19 12:43 ` [PATCH V3 3/4] nvme-rdma: use new shared CQ mechanism Yamin Friedman
@ 2020-05-19 12:43 ` Yamin Friedman
  2020-05-20  7:03 ` [PATCH V3 0/4] Introducing RDMA shared CQ pool Sagi Grimberg
  4 siblings, 0 replies; 22+ messages in thread
From: Yamin Friedman @ 2020-05-19 12:43 UTC (permalink / raw)
  To: Jason Gunthorpe, Sagi Grimberg, Or Gerlitz, Leon Romanovsky
  Cc: linux-rdma, Yamin Friedman

Has the driver use shared CQs providing ~10%-20% improvement when multiple
disks are used. Instead of opening a CQ for each QP per controller, a CQ
for each core will be provided by the RDMA core driver that will be shared
between the QPs on that core reducing interrupt overhead.

Signed-off-by: Yamin Friedman <yaminf@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/target/rdma.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index fd47de0..50e4c40 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -588,7 +588,7 @@ static void nvmet_rdma_read_data_done(struct ib_cq *cq, struct ib_wc *wc)
 {
 	struct nvmet_rdma_rsp *rsp =
 		container_of(wc->wr_cqe, struct nvmet_rdma_rsp, read_cqe);
-	struct nvmet_rdma_queue *queue = cq->cq_context;
+	struct nvmet_rdma_queue *queue = wc->qp->qp_context;
 
 	WARN_ON(rsp->n_rdma <= 0);
 	atomic_add(rsp->n_rdma, &queue->sq_wr_avail);
@@ -793,7 +793,7 @@ static void nvmet_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc)
 {
 	struct nvmet_rdma_cmd *cmd =
 		container_of(wc->wr_cqe, struct nvmet_rdma_cmd, cqe);
-	struct nvmet_rdma_queue *queue = cq->cq_context;
+	struct nvmet_rdma_queue *queue = wc->qp->qp_context;
 	struct nvmet_rdma_rsp *rsp;
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
@@ -995,9 +995,8 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
 	 */
 	nr_cqe = queue->recv_queue_size + 2 * queue->send_queue_size;
 
-	queue->cq = ib_alloc_cq(ndev->device, queue,
-			nr_cqe + 1, comp_vector,
-			IB_POLL_WORKQUEUE);
+	queue->cq = ib_cq_pool_get(ndev->device, nr_cqe + 1, comp_vector,
+				   IB_POLL_WORKQUEUE);
 	if (IS_ERR(queue->cq)) {
 		ret = PTR_ERR(queue->cq);
 		pr_err("failed to create CQ cqe= %d ret= %d\n",
@@ -1056,7 +1055,7 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
 err_destroy_qp:
 	rdma_destroy_qp(queue->cm_id);
 err_destroy_cq:
-	ib_free_cq(queue->cq);
+	ib_cq_pool_put(queue->cq, nr_cqe + 1);
 	goto out;
 }
 
@@ -1066,7 +1065,8 @@ static void nvmet_rdma_destroy_queue_ib(struct nvmet_rdma_queue *queue)
 	if (queue->cm_id)
 		rdma_destroy_id(queue->cm_id);
 	ib_destroy_qp(queue->qp);
-	ib_free_cq(queue->cq);
+	ib_cq_pool_put(queue->cq, queue->recv_queue_size + 2 *
+		       queue->send_queue_size + 1);
 }
 
 static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 2/4] RDMA/core: Introduce shared CQ pool API
  2020-05-19 12:43 ` [PATCH V3 2/4] RDMA/core: Introduce shared CQ pool API Yamin Friedman
@ 2020-05-20  6:19   ` Devesh Sharma
  2020-05-20  9:23     ` Yamin Friedman
  2020-05-25 13:06   ` Yamin Friedman
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Devesh Sharma @ 2020-05-20  6:19 UTC (permalink / raw)
  To: Yamin Friedman
  Cc: Jason Gunthorpe, Sagi Grimberg, Or Gerlitz, Leon Romanovsky, linux-rdma

On Tue, May 19, 2020 at 6:13 PM Yamin Friedman <yaminf@mellanox.com> wrote:
>
> Allow a ULP to ask the core to provide a completion queue based on a
> least-used search on a per-device CQ pools. The device CQ pools grow in a
> lazy fashion when more CQs are requested.
>
> This feature reduces the amount of interrupts when using many QPs.
> Using shared CQs allows for more effcient completion handling. It also
> reduces the amount of overhead needed for CQ contexts.
>
> Test setup:
> Intel(R) Xeon(R) Platinum 8176M CPU @ 2.10GHz servers.
> Running NVMeoF 4KB read IOs over ConnectX-5EX across Spectrum switch.
> TX-depth = 32. The patch was applied in the nvme driver on both the target
> and initiator. Four controllers are accessed from each core. In the
> current test case we have exposed sixteen NVMe namespaces using four
> different subsystems (four namespaces per subsystem) from one NVM port.
> Each controller allocated X queues (RDMA QPs) and attached to Y CQs.
> Before this series we had X == Y, i.e for four controllers we've created
> total of 4X QPs and 4X CQs. In the shared case, we've created 4X QPs and
> only X CQs which means that we have four controllers that share a
> completion queue per core. Until fourteen cores there is no significant
> change in performance and the number of interrupts per second is less than
> a million in the current case.
> ==================================================
> |Cores|Current KIOPs  |Shared KIOPs  |improvement|
> |-----|---------------|--------------|-----------|
> |14   |2332           |2723          |16.7%      |
> |-----|---------------|--------------|-----------|
> |20   |2086           |2712          |30%        |
> |-----|---------------|--------------|-----------|
> |28   |1971           |2669          |35.4%      |
> |=================================================
> |Cores|Current avg lat|Shared avg lat|improvement|
> |-----|---------------|--------------|-----------|
> |14   |767us          |657us         |14.3%      |
> |-----|---------------|--------------|-----------|
> |20   |1225us         |943us         |23%        |
> |-----|---------------|--------------|-----------|
> |28   |1816us         |1341us        |26.1%      |
> ========================================================
> |Cores|Current interrupts|Shared interrupts|improvement|
> |-----|------------------|-----------------|-----------|
> |14   |1.6M/sec          |0.4M/sec         |72%        |
> |-----|------------------|-----------------|-----------|
> |20   |2.8M/sec          |0.6M/sec         |72.4%      |
> |-----|------------------|-----------------|-----------|
> |28   |2.9M/sec          |0.8M/sec         |63.4%      |
> ====================================================================
> |Cores|Current 99.99th PCTL lat|Shared 99.99th PCTL lat|improvement|
> |-----|------------------------|-----------------------|-----------|
> |14   |67ms                    |6ms                    |90.9%      |
> |-----|------------------------|-----------------------|-----------|
> |20   |5ms                     |6ms                    |-10%       |
> |-----|------------------------|-----------------------|-----------|
> |28   |8.7ms                   |6ms                    |25.9%      |
> |===================================================================
>
> Performance improvement with sixteen disks (sixteen CQs per core) is
> comparable.
>
> Signed-off-by: Yamin Friedman <yaminf@mellanox.com>
> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
> Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
> Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
> ---
>  drivers/infiniband/core/core_priv.h |   3 +
>  drivers/infiniband/core/cq.c        | 144 ++++++++++++++++++++++++++++++++++++
>  drivers/infiniband/core/device.c    |   2 +
>  include/rdma/ib_verbs.h             |  35 +++++++++
>  4 files changed, 184 insertions(+)
>
> diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
> index cf42acc..a1e6a67 100644
> --- a/drivers/infiniband/core/core_priv.h
> +++ b/drivers/infiniband/core/core_priv.h
> @@ -414,4 +414,7 @@ void rdma_umap_priv_init(struct rdma_umap_priv *priv,
>                          struct vm_area_struct *vma,
>                          struct rdma_user_mmap_entry *entry);
>
> +void ib_cq_pool_init(struct ib_device *dev);
> +void ib_cq_pool_destroy(struct ib_device *dev);
> +
>  #endif /* _CORE_PRIV_H */
> diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
> index 4f25b24..7175295 100644
> --- a/drivers/infiniband/core/cq.c
> +++ b/drivers/infiniband/core/cq.c
> @@ -7,7 +7,11 @@
>  #include <linux/slab.h>
>  #include <rdma/ib_verbs.h>
>
> +#include "core_priv.h"
> +
>  #include <trace/events/rdma_core.h>
> +/* Max size for shared CQ, may require tuning */
> +#define IB_MAX_SHARED_CQ_SZ            4096
>
>  /* # of WCs to poll for with a single call to ib_poll_cq */
>  #define IB_POLL_BATCH                  16
> @@ -218,6 +222,7 @@ struct ib_cq *__ib_alloc_cq_user(struct ib_device *dev, void *private,
>         cq->cq_context = private;
>         cq->poll_ctx = poll_ctx;
>         atomic_set(&cq->usecnt, 0);
> +       cq->comp_vector = comp_vector;
>
>         cq->wc = kmalloc_array(IB_POLL_BATCH, sizeof(*cq->wc), GFP_KERNEL);
>         if (!cq->wc)
> @@ -309,6 +314,8 @@ void ib_free_cq_user(struct ib_cq *cq, struct ib_udata *udata)
>  {
>         if (WARN_ON_ONCE(atomic_read(&cq->usecnt)))
>                 return;
> +       if (WARN_ON_ONCE(cq->cqe_used))
> +               return;
>
>         switch (cq->poll_ctx) {
>         case IB_POLL_DIRECT:
> @@ -334,3 +341,140 @@ void ib_free_cq_user(struct ib_cq *cq, struct ib_udata *udata)
>         kfree(cq);
>  }
>  EXPORT_SYMBOL(ib_free_cq_user);
> +
> +void ib_cq_pool_init(struct ib_device *dev)
> +{
> +       int i;
> +
> +       spin_lock_init(&dev->cq_pools_lock);
> +       for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++)
> +               INIT_LIST_HEAD(&dev->cq_pools[i]);
> +}
> +
> +void ib_cq_pool_destroy(struct ib_device *dev)
> +{
> +       struct ib_cq *cq, *n;
> +       int i;
> +
> +       for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++) {
> +               list_for_each_entry_safe(cq, n, &dev->cq_pools[i],
> +                                        pool_entry) {
> +                       cq->shared = false;
> +                       ib_free_cq_user(cq, NULL);
> +               }
> +       }
> +
> +}
> +
> +static int ib_alloc_cqs(struct ib_device *dev, int nr_cqes,
> +                       enum ib_poll_context poll_ctx)
> +{
> +       LIST_HEAD(tmp_list);
> +       struct ib_cq *cq;
> +       unsigned long flags;
> +       int nr_cqs, ret, i;
> +
> +       /*
> +        * Allocated at least as many CQEs as requested, and otherwise
> +        * a reasonable batch size so that we can share CQs between
> +        * multiple users instead of allocating a larger number of CQs.
> +        */
> +       nr_cqes = min(dev->attrs.max_cqe, max(nr_cqes, IB_MAX_SHARED_CQ_SZ));
> +       nr_cqs = min_t(int, dev->num_comp_vectors, num_online_cpus());
No WARN() or return with failure as pointed by Leon and me. Has
anything else changes elsewhere?

> +       for (i = 0; i < nr_cqs; i++) {
> +               cq = ib_alloc_cq(dev, NULL, nr_cqes, i, poll_ctx);
> +               if (IS_ERR(cq)) {
> +                       ret = PTR_ERR(cq);
> +                       goto out_free_cqs;
> +               }
> +               cq->shared = true;
> +               list_add_tail(&cq->pool_entry, &tmp_list);
> +       }
> +
> +       spin_lock_irqsave(&dev->cq_pools_lock, flags);
> +       list_splice(&tmp_list, &dev->cq_pools[poll_ctx - 1]);
> +       spin_unlock_irqrestore(&dev->cq_pools_lock, flags);
> +
> +       return 0;
> +
> +out_free_cqs:
> +       list_for_each_entry(cq, &tmp_list, pool_entry) {
> +               cq->shared = false;
> +               ib_free_cq(cq);
> +       }
> +       return ret;
> +}
> +
> +struct ib_cq *ib_cq_pool_get(struct ib_device *dev, unsigned int nr_cqe,
> +                            int comp_vector_hint,
> +                            enum ib_poll_context poll_ctx)
> +{
> +       static unsigned int default_comp_vector;
> +       int vector, ret, num_comp_vectors;
> +       struct ib_cq *cq, *found = NULL;
> +       unsigned long flags;
> +
> +       if (poll_ctx > ARRAY_SIZE(dev->cq_pools) || poll_ctx == IB_POLL_DIRECT)
> +               return ERR_PTR(-EINVAL);
> +
> +       num_comp_vectors = min_t(int, dev->num_comp_vectors,
> +                                num_online_cpus());
> +       /* Project the affinty to the device completion vector range */
> +       if (comp_vector_hint < 0)
> +               vector = default_comp_vector++ % num_comp_vectors;
> +       else
> +               vector = comp_vector_hint % num_comp_vectors;
> +
> +       /*
> +        * Find the least used CQ with correct affinity and
> +        * enough free CQ entries
> +        */
> +       while (!found) {
> +               spin_lock_irqsave(&dev->cq_pools_lock, flags);
> +               list_for_each_entry(cq, &dev->cq_pools[poll_ctx - 1],
> +                                   pool_entry) {
> +                       /*
> +                        * Check to see if we have found a CQ with the
> +                        * correct completion vector
> +                        */
> +                       if (vector != cq->comp_vector)
> +                               continue;
> +                       if (cq->cqe_used + nr_cqe > cq->cqe)
> +                               continue;
> +                       found = cq;
> +                       break;
> +               }
> +
> +               if (found) {
> +                       found->cqe_used += nr_cqe;
> +                       spin_unlock_irqrestore(&dev->cq_pools_lock, flags);
> +
> +                       return found;
> +               }
> +               spin_unlock_irqrestore(&dev->cq_pools_lock, flags);
> +
> +               /*
> +                * Didn't find a match or ran out of CQs in the device
> +                * pool, allocate a new array of CQs.
> +                */
> +               ret = ib_alloc_cqs(dev, nr_cqe, poll_ctx);
> +               if (ret)
> +                       return ERR_PTR(ret);
> +       }
> +
> +       return found;
> +}
> +EXPORT_SYMBOL(ib_cq_pool_get);
> +
> +void ib_cq_pool_put(struct ib_cq *cq, unsigned int nr_cqe)
> +{
> +       unsigned long flags;
> +
> +       if (WARN_ON_ONCE(nr_cqe > cq->cqe_used))
> +               return;
> +
> +       spin_lock_irqsave(&cq->device->cq_pools_lock, flags);
> +       cq->cqe_used -= nr_cqe;
> +       spin_unlock_irqrestore(&cq->device->cq_pools_lock, flags);
> +}
> +EXPORT_SYMBOL(ib_cq_pool_put);
> diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
> index d9f565a..0966f86 100644
> --- a/drivers/infiniband/core/device.c
> +++ b/drivers/infiniband/core/device.c
> @@ -1418,6 +1418,7 @@ int ib_register_device(struct ib_device *device, const char *name)
>                 device->ops.dealloc_driver = dealloc_fn;
>                 return ret;
>         }
> +       ib_cq_pool_init(device);
>         ib_device_put(device);
>
>         return 0;
> @@ -1446,6 +1447,7 @@ static void __ib_unregister_device(struct ib_device *ib_dev)
>         if (!refcount_read(&ib_dev->refcount))
>                 goto out;
>
> +       ib_cq_pool_destroy(ib_dev);
>         disable_device(ib_dev);
>
>         /* Expedite removing unregistered pointers from the hash table */
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 1659131..d40604a 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -1555,6 +1555,7 @@ enum ib_poll_context {
>         IB_POLL_SOFTIRQ,           /* poll from softirq context */
>         IB_POLL_WORKQUEUE,         /* poll from workqueue */
>         IB_POLL_UNBOUND_WORKQUEUE, /* poll from unbound workqueue */
> +       IB_POLL_LAST,
>  };
>
>  struct ib_cq {
> @@ -1564,9 +1565,12 @@ struct ib_cq {
>         void                  (*event_handler)(struct ib_event *, void *);
>         void                   *cq_context;
>         int                     cqe;
> +       int                     cqe_used;
>         atomic_t                usecnt; /* count number of work queues */
>         enum ib_poll_context    poll_ctx;
> +       int                     comp_vector;
>         struct ib_wc            *wc;
> +       struct list_head        pool_entry;
>         union {
>                 struct irq_poll         iop;
>                 struct work_struct      work;
> @@ -2695,6 +2699,10 @@ struct ib_device {
>  #endif
>
>         u32                          index;
> +
> +       spinlock_t                   cq_pools_lock;
> +       struct list_head             cq_pools[IB_POLL_LAST - 1];
> +
>         struct rdma_restrack_root *res;
>
>         const struct uapi_definition   *driver_def;
> @@ -3952,6 +3960,33 @@ static inline int ib_req_notify_cq(struct ib_cq *cq,
>         return cq->device->ops.req_notify_cq(cq, flags);
>  }
>
> +/*
> + * ib_cq_pool_get() - Find the least used completion queue that matches
> + *     a given cpu hint (or least used for wild card affinity)
> + *     and fits nr_cqe
> + * @dev: rdma device
> + * @nr_cqe: number of needed cqe entries
> + * @comp_vector_hint: completion vector hint (-1) for the driver to assign
> + *   a comp vector based on internal counter
> + * @poll_ctx: cq polling context
> + *
> + * Finds a cq that satisfies @comp_vector_hint and @nr_cqe requirements and
> + * claim entries in it for us. In case there is no available cq, allocate
> + * a new cq with the requirements and add it to the device pool.
> + * IB_POLL_DIRECT cannot be used for shared cqs so it is not a valid value
> + * for @poll_ctx.
> + */
> +struct ib_cq *ib_cq_pool_get(struct ib_device *dev, unsigned int nr_cqe,
> +                            int comp_vector_hint,
> +                            enum ib_poll_context poll_ctx);
> +
> +/**
> + * ib_cq_pool_put - Return a CQ taken from a shared pool.
> + * @cq: The CQ to return.
> + * @nr_cqe: The max number of cqes that the user had requested.
> + */
> +void ib_cq_pool_put(struct ib_cq *cq, unsigned int nr_cqe);
> +
>  /**
>   * ib_req_ncomp_notif - Request completion notification when there are
>   *   at least the specified number of unreaped completions on the CQ.
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 0/4] Introducing RDMA shared CQ pool
  2020-05-19 12:43 [PATCH V3 0/4] Introducing RDMA shared CQ pool Yamin Friedman
                   ` (3 preceding siblings ...)
  2020-05-19 12:43 ` [PATCH V3 4/4] nvmet-rdma: " Yamin Friedman
@ 2020-05-20  7:03 ` Sagi Grimberg
  2020-05-20  8:15   ` Yamin Friedman
  4 siblings, 1 reply; 22+ messages in thread
From: Sagi Grimberg @ 2020-05-20  7:03 UTC (permalink / raw)
  To: Yamin Friedman, Jason Gunthorpe, Or Gerlitz, Leon Romanovsky; +Cc: linux-rdma


> This is the fourth re-incarnation of the CQ pool patches proposed
> by Sagi and Christoph. I have started with the patches that Sagi last
> submitted and built the CQ pool as a new API for acquiring shared CQs.
> 
> The main change from Sagi's last proposal is that I have simplified the
> method that ULP drivers interact with the CQ pool. Instead of calling
> ib_alloc_cq they now call ib_cq_pool_get but use the CQ in the same manner
> that they did before. This allows for a much easier transition to using
> shared CQs by the ULP and makes it easier to deal with IB_POLL_DIRECT
> contexts. Certain types of actions on CQs have been prevented on shared
> CQs in order to prevent one user from harming another.
> 
> Our ULPs often want to make smart decisions on completion vector
> affinitization when using multiple completion queues spread on
> multiple cpu cores. We can see examples for this in iser, srp, nvme-rdma.

Yamin, didn't you promise to adjust other ulps as well in the next post?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 0/4] Introducing RDMA shared CQ pool
  2020-05-20  7:03 ` [PATCH V3 0/4] Introducing RDMA shared CQ pool Sagi Grimberg
@ 2020-05-20  8:15   ` Yamin Friedman
  0 siblings, 0 replies; 22+ messages in thread
From: Yamin Friedman @ 2020-05-20  8:15 UTC (permalink / raw)
  To: Sagi Grimberg, Jason Gunthorpe, Or Gerlitz, Leon Romanovsky; +Cc: linux-rdma


On 5/20/2020 10:03 AM, Sagi Grimberg wrote:
>
>> This is the fourth re-incarnation of the CQ pool patches proposed
>> by Sagi and Christoph. I have started with the patches that Sagi last
>> submitted and built the CQ pool as a new API for acquiring shared CQs.
>>
>> The main change from Sagi's last proposal is that I have simplified the
>> method that ULP drivers interact with the CQ pool. Instead of calling
>> ib_alloc_cq they now call ib_cq_pool_get but use the CQ in the same 
>> manner
>> that they did before. This allows for a much easier transition to using
>> shared CQs by the ULP and makes it easier to deal with IB_POLL_DIRECT
>> contexts. Certain types of actions on CQs have been prevented on shared
>> CQs in order to prevent one user from harming another.
>>
>> Our ULPs often want to make smart decisions on completion vector
>> affinitization when using multiple completion queues spread on
>> multiple cpu cores. We can see examples for this in iser, srp, 
>> nvme-rdma.
>
> Yamin, didn't you promise to adjust other ulps as well in the next post?
I was looking to get it accepted first, I did not want to tie acceptance 
of the feature to the use in all the different ulps. I can prepare 
another patch set with other ulps now or later

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 2/4] RDMA/core: Introduce shared CQ pool API
  2020-05-20  6:19   ` Devesh Sharma
@ 2020-05-20  9:23     ` Yamin Friedman
  2020-05-20  9:32       ` Leon Romanovsky
  2020-05-20 10:50       ` Devesh Sharma
  0 siblings, 2 replies; 22+ messages in thread
From: Yamin Friedman @ 2020-05-20  9:23 UTC (permalink / raw)
  To: Devesh Sharma
  Cc: Jason Gunthorpe, Sagi Grimberg, Or Gerlitz, Leon Romanovsky, linux-rdma


On 5/20/2020 9:19 AM, Devesh Sharma wrote:
>
>> +
>> +static int ib_alloc_cqs(struct ib_device *dev, int nr_cqes,
>> +                       enum ib_poll_context poll_ctx)
>> +{
>> +       LIST_HEAD(tmp_list);
>> +       struct ib_cq *cq;
>> +       unsigned long flags;
>> +       int nr_cqs, ret, i;
>> +
>> +       /*
>> +        * Allocated at least as many CQEs as requested, and otherwise
>> +        * a reasonable batch size so that we can share CQs between
>> +        * multiple users instead of allocating a larger number of CQs.
>> +        */
>> +       nr_cqes = min(dev->attrs.max_cqe, max(nr_cqes, IB_MAX_SHARED_CQ_SZ));
>> +       nr_cqs = min_t(int, dev->num_comp_vectors, num_online_cpus());
> No WARN() or return with failure as pointed by Leon and me. Has
> anything else changes elsewhere?

Hey Devesh,

I am not sure what you are referring to, could you please clarify?

>
>> +       for (i = 0; i < nr_cqs; i++) {
>> +               cq = ib_alloc_cq(dev, NULL, nr_cqes, i, poll_ctx);
>> +               if (IS_ERR(cq)) {
>> +                       ret = PTR_ERR(cq);
>> +                       goto out_free_cqs;
>> +               }
>> +               cq->shared = true;
>> +               list_add_tail(&cq->pool_entry, &tmp_list);
>> +       }
>> +
>> +       spin_lock_irqsave(&dev->cq_pools_lock, flags);
>> +       list_splice(&tmp_list, &dev->cq_pools[poll_ctx - 1]);
>> +       spin_unlock_irqrestore(&dev->cq_pools_lock, flags);
>> +
>> +       return 0;
>> +
>> +out_free_cqs:
>> +       list_for_each_entry(cq, &tmp_list, pool_entry) {
>> +               cq->shared = false;
>> +               ib_free_cq(cq);
>> +       }
>> +       return ret;
>> +}
>> +
>>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 2/4] RDMA/core: Introduce shared CQ pool API
  2020-05-20  9:23     ` Yamin Friedman
@ 2020-05-20  9:32       ` Leon Romanovsky
  2020-05-20 10:50       ` Devesh Sharma
  1 sibling, 0 replies; 22+ messages in thread
From: Leon Romanovsky @ 2020-05-20  9:32 UTC (permalink / raw)
  To: Yamin Friedman
  Cc: Devesh Sharma, Jason Gunthorpe, Sagi Grimberg, Or Gerlitz, linux-rdma

On Wed, May 20, 2020 at 12:23:01PM +0300, Yamin Friedman wrote:
>
> On 5/20/2020 9:19 AM, Devesh Sharma wrote:
> >
> > > +
> > > +static int ib_alloc_cqs(struct ib_device *dev, int nr_cqes,
> > > +                       enum ib_poll_context poll_ctx)
> > > +{
> > > +       LIST_HEAD(tmp_list);
> > > +       struct ib_cq *cq;
> > > +       unsigned long flags;
> > > +       int nr_cqs, ret, i;
> > > +
> > > +       /*
> > > +        * Allocated at least as many CQEs as requested, and otherwise
> > > +        * a reasonable batch size so that we can share CQs between
> > > +        * multiple users instead of allocating a larger number of CQs.
> > > +        */
> > > +       nr_cqes = min(dev->attrs.max_cqe, max(nr_cqes, IB_MAX_SHARED_CQ_SZ));
> > > +       nr_cqs = min_t(int, dev->num_comp_vectors, num_online_cpus());
> > No WARN() or return with failure as pointed by Leon and me. Has
> > anything else changes elsewhere?
>
> Hey Devesh,
>
> I am not sure what you are referring to, could you please clarify?

He is saying that dev->num_comp_vectors can be 0.

Thanks

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 2/4] RDMA/core: Introduce shared CQ pool API
  2020-05-20  9:23     ` Yamin Friedman
  2020-05-20  9:32       ` Leon Romanovsky
@ 2020-05-20 10:50       ` Devesh Sharma
  2020-05-20 12:01         ` Yamin Friedman
  1 sibling, 1 reply; 22+ messages in thread
From: Devesh Sharma @ 2020-05-20 10:50 UTC (permalink / raw)
  To: Yamin Friedman
  Cc: Jason Gunthorpe, Sagi Grimberg, Or Gerlitz, Leon Romanovsky, linux-rdma

On Wed, May 20, 2020 at 2:53 PM Yamin Friedman <yaminf@mellanox.com> wrote:
>
>
> On 5/20/2020 9:19 AM, Devesh Sharma wrote:
> >
> >> +
> >> +static int ib_alloc_cqs(struct ib_device *dev, int nr_cqes,
> >> +                       enum ib_poll_context poll_ctx)
> >> +{
> >> +       LIST_HEAD(tmp_list);
> >> +       struct ib_cq *cq;
> >> +       unsigned long flags;
> >> +       int nr_cqs, ret, i;
> >> +
> >> +       /*
> >> +        * Allocated at least as many CQEs as requested, and otherwise
> >> +        * a reasonable batch size so that we can share CQs between
> >> +        * multiple users instead of allocating a larger number of CQs.
> >> +        */
> >> +       nr_cqes = min(dev->attrs.max_cqe, max(nr_cqes, IB_MAX_SHARED_CQ_SZ));
> >> +       nr_cqs = min_t(int, dev->num_comp_vectors, num_online_cpus());
> > No WARN() or return with failure as pointed by Leon and me. Has
> > anything else changes elsewhere?
>
> Hey Devesh,
>
> I am not sure what you are referring to, could you please clarify?
>
I thought on V2 Leon gave a comment "how this will work if
dev->num_comp_vectors" is 0.
there I had suggested to fail the pool creation and issue a
WARN_ONCE() or something.
> >
> >> +       for (i = 0; i < nr_cqs; i++) {
> >> +               cq = ib_alloc_cq(dev, NULL, nr_cqes, i, poll_ctx);
> >> +               if (IS_ERR(cq)) {
> >> +                       ret = PTR_ERR(cq);
> >> +                       goto out_free_cqs;
> >> +               }
> >> +               cq->shared = true;
> >> +               list_add_tail(&cq->pool_entry, &tmp_list);
> >> +       }
> >> +
> >> +       spin_lock_irqsave(&dev->cq_pools_lock, flags);
> >> +       list_splice(&tmp_list, &dev->cq_pools[poll_ctx - 1]);
> >> +       spin_unlock_irqrestore(&dev->cq_pools_lock, flags);
> >> +
> >> +       return 0;
> >> +
> >> +out_free_cqs:
> >> +       list_for_each_entry(cq, &tmp_list, pool_entry) {
> >> +               cq->shared = false;
> >> +               ib_free_cq(cq);
> >> +       }
> >> +       return ret;
> >> +}
> >> +
> >>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 2/4] RDMA/core: Introduce shared CQ pool API
  2020-05-20 10:50       ` Devesh Sharma
@ 2020-05-20 12:01         ` Yamin Friedman
  2020-05-20 13:48           ` Devesh Sharma
  0 siblings, 1 reply; 22+ messages in thread
From: Yamin Friedman @ 2020-05-20 12:01 UTC (permalink / raw)
  To: Devesh Sharma
  Cc: Jason Gunthorpe, Sagi Grimberg, Or Gerlitz, Leon Romanovsky, linux-rdma


On 5/20/2020 1:50 PM, Devesh Sharma wrote:
> On Wed, May 20, 2020 at 2:53 PM Yamin Friedman <yaminf@mellanox.com> wrote:
>>
>> On 5/20/2020 9:19 AM, Devesh Sharma wrote:
>>>> +
>>>> +static int ib_alloc_cqs(struct ib_device *dev, int nr_cqes,
>>>> +                       enum ib_poll_context poll_ctx)
>>>> +{
>>>> +       LIST_HEAD(tmp_list);
>>>> +       struct ib_cq *cq;
>>>> +       unsigned long flags;
>>>> +       int nr_cqs, ret, i;
>>>> +
>>>> +       /*
>>>> +        * Allocated at least as many CQEs as requested, and otherwise
>>>> +        * a reasonable batch size so that we can share CQs between
>>>> +        * multiple users instead of allocating a larger number of CQs.
>>>> +        */
>>>> +       nr_cqes = min(dev->attrs.max_cqe, max(nr_cqes, IB_MAX_SHARED_CQ_SZ));
>>>> +       nr_cqs = min_t(int, dev->num_comp_vectors, num_online_cpus());
>>> No WARN() or return with failure as pointed by Leon and me. Has
>>> anything else changes elsewhere?
>> Hey Devesh,
>>
>> I am not sure what you are referring to, could you please clarify?
>>
> I thought on V2 Leon gave a comment "how this will work if
> dev->num_comp_vectors" is 0.
> there I had suggested to fail the pool creation and issue a
> WARN_ONCE() or something.

I understood his comment to be regarding if the comp_vector itself is 0. 
There should not be any issue with that case.

As far as I am aware there must be a non-zero amount of comp_vectors for 
the ib_dev otherwise we will not be able to get any indication for cqes. 
I don't see any reason to add a special check here.

Thanks

>>>> +       for (i = 0; i < nr_cqs; i++) {
>>>> +               cq = ib_alloc_cq(dev, NULL, nr_cqes, i, poll_ctx);
>>>> +               if (IS_ERR(cq)) {
>>>> +                       ret = PTR_ERR(cq);
>>>> +                       goto out_free_cqs;
>>>> +               }
>>>> +               cq->shared = true;
>>>> +               list_add_tail(&cq->pool_entry, &tmp_list);
>>>> +       }
>>>> +
>>>> +       spin_lock_irqsave(&dev->cq_pools_lock, flags);
>>>> +       list_splice(&tmp_list, &dev->cq_pools[poll_ctx - 1]);
>>>> +       spin_unlock_irqrestore(&dev->cq_pools_lock, flags);
>>>> +
>>>> +       return 0;
>>>> +
>>>> +out_free_cqs:
>>>> +       list_for_each_entry(cq, &tmp_list, pool_entry) {
>>>> +               cq->shared = false;
>>>> +               ib_free_cq(cq);
>>>> +       }
>>>> +       return ret;
>>>> +}
>>>> +
>>>>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 2/4] RDMA/core: Introduce shared CQ pool API
  2020-05-20 12:01         ` Yamin Friedman
@ 2020-05-20 13:48           ` Devesh Sharma
  0 siblings, 0 replies; 22+ messages in thread
From: Devesh Sharma @ 2020-05-20 13:48 UTC (permalink / raw)
  To: Yamin Friedman
  Cc: Jason Gunthorpe, Sagi Grimberg, Or Gerlitz, Leon Romanovsky, linux-rdma

On Wed, May 20, 2020 at 5:32 PM Yamin Friedman <yaminf@mellanox.com> wrote:
>
>
> On 5/20/2020 1:50 PM, Devesh Sharma wrote:
> > On Wed, May 20, 2020 at 2:53 PM Yamin Friedman <yaminf@mellanox.com> wrote:
> >>
> >> On 5/20/2020 9:19 AM, Devesh Sharma wrote:
> >>>> +
> >>>> +static int ib_alloc_cqs(struct ib_device *dev, int nr_cqes,
> >>>> +                       enum ib_poll_context poll_ctx)
> >>>> +{
> >>>> +       LIST_HEAD(tmp_list);
> >>>> +       struct ib_cq *cq;
> >>>> +       unsigned long flags;
> >>>> +       int nr_cqs, ret, i;
> >>>> +
> >>>> +       /*
> >>>> +        * Allocated at least as many CQEs as requested, and otherwise
> >>>> +        * a reasonable batch size so that we can share CQs between
> >>>> +        * multiple users instead of allocating a larger number of CQs.
> >>>> +        */
> >>>> +       nr_cqes = min(dev->attrs.max_cqe, max(nr_cqes, IB_MAX_SHARED_CQ_SZ));
> >>>> +       nr_cqs = min_t(int, dev->num_comp_vectors, num_online_cpus());
> >>> No WARN() or return with failure as pointed by Leon and me. Has
> >>> anything else changes elsewhere?
> >> Hey Devesh,
> >>
> >> I am not sure what you are referring to, could you please clarify?
> >>
> > I thought on V2 Leon gave a comment "how this will work if
> > dev->num_comp_vectors" is 0.
> > there I had suggested to fail the pool creation and issue a
> > WARN_ONCE() or something.
>
> I understood his comment to be regarding if the comp_vector itself is 0.
> There should not be any issue with that case.
>
> As far as I am aware there must be a non-zero amount of comp_vectors for
> the ib_dev otherwise we will not be able to get any indication for cqes.
> I don't see any reason to add a special check here.
>
Okay, maybe a WARN_ONCE() would be useful from a debug point of view.
Otherwise for a given buggy driver, things may not be obvious and the
user may still think that the pool was created successfully, but
traffic will not move.
Add it if you see a value in this point of view otherwise:
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>

Thanks
> Thanks
>
> >>>> +       for (i = 0; i < nr_cqs; i++) {
> >>>> +               cq = ib_alloc_cq(dev, NULL, nr_cqes, i, poll_ctx);
> >>>> +               if (IS_ERR(cq)) {
> >>>> +                       ret = PTR_ERR(cq);
> >>>> +                       goto out_free_cqs;
> >>>> +               }
> >>>> +               cq->shared = true;
> >>>> +               list_add_tail(&cq->pool_entry, &tmp_list);
> >>>> +       }
> >>>> +
> >>>> +       spin_lock_irqsave(&dev->cq_pools_lock, flags);
> >>>> +       list_splice(&tmp_list, &dev->cq_pools[poll_ctx - 1]);
> >>>> +       spin_unlock_irqrestore(&dev->cq_pools_lock, flags);
> >>>> +
> >>>> +       return 0;
> >>>> +
> >>>> +out_free_cqs:
> >>>> +       list_for_each_entry(cq, &tmp_list, pool_entry) {
> >>>> +               cq->shared = false;
> >>>> +               ib_free_cq(cq);
> >>>> +       }
> >>>> +       return ret;
> >>>> +}
> >>>> +
> >>>>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 2/4] RDMA/core: Introduce shared CQ pool API
  2020-05-19 12:43 ` [PATCH V3 2/4] RDMA/core: Introduce shared CQ pool API Yamin Friedman
  2020-05-20  6:19   ` Devesh Sharma
@ 2020-05-25 13:06   ` Yamin Friedman
  2020-05-26  7:09     ` Yamin Friedman
  2020-05-25 15:14   ` Bart Van Assche
  2020-05-25 16:42   ` Jason Gunthorpe
  3 siblings, 1 reply; 22+ messages in thread
From: Yamin Friedman @ 2020-05-25 13:06 UTC (permalink / raw)
  To: Jason Gunthorpe, Sagi Grimberg, Or Gerlitz, Leon Romanovsky; +Cc: linux-rdma


[-- Attachment #1: Type: text/plain, Size: 12799 bytes --]

Minor fix brought to my attention by MaxG.

On 5/19/2020 3:43 PM, Yamin Friedman wrote:
> Allow a ULP to ask the core to provide a completion queue based on a
> least-used search on a per-device CQ pools. The device CQ pools grow in a
> lazy fashion when more CQs are requested.
>
> This feature reduces the amount of interrupts when using many QPs.
> Using shared CQs allows for more effcient completion handling. It also
> reduces the amount of overhead needed for CQ contexts.
>
> Test setup:
> Intel(R) Xeon(R) Platinum 8176M CPU @ 2.10GHz servers.
> Running NVMeoF 4KB read IOs over ConnectX-5EX across Spectrum switch.
> TX-depth = 32. The patch was applied in the nvme driver on both the target
> and initiator. Four controllers are accessed from each core. In the
> current test case we have exposed sixteen NVMe namespaces using four
> different subsystems (four namespaces per subsystem) from one NVM port.
> Each controller allocated X queues (RDMA QPs) and attached to Y CQs.
> Before this series we had X == Y, i.e for four controllers we've created
> total of 4X QPs and 4X CQs. In the shared case, we've created 4X QPs and
> only X CQs which means that we have four controllers that share a
> completion queue per core. Until fourteen cores there is no significant
> change in performance and the number of interrupts per second is less than
> a million in the current case.
> ==================================================
> |Cores|Current KIOPs  |Shared KIOPs  |improvement|
> |-----|---------------|--------------|-----------|
> |14   |2332           |2723          |16.7%      |
> |-----|---------------|--------------|-----------|
> |20   |2086           |2712          |30%        |
> |-----|---------------|--------------|-----------|
> |28   |1971           |2669          |35.4%      |
> |=================================================
> |Cores|Current avg lat|Shared avg lat|improvement|
> |-----|---------------|--------------|-----------|
> |14   |767us          |657us         |14.3%      |
> |-----|---------------|--------------|-----------|
> |20   |1225us         |943us         |23%        |
> |-----|---------------|--------------|-----------|
> |28   |1816us         |1341us        |26.1%      |
> ========================================================
> |Cores|Current interrupts|Shared interrupts|improvement|
> |-----|------------------|-----------------|-----------|
> |14   |1.6M/sec          |0.4M/sec         |72%        |
> |-----|------------------|-----------------|-----------|
> |20   |2.8M/sec          |0.6M/sec         |72.4%      |
> |-----|------------------|-----------------|-----------|
> |28   |2.9M/sec          |0.8M/sec         |63.4%      |
> ====================================================================
> |Cores|Current 99.99th PCTL lat|Shared 99.99th PCTL lat|improvement|
> |-----|------------------------|-----------------------|-----------|
> |14   |67ms                    |6ms                    |90.9%      |
> |-----|------------------------|-----------------------|-----------|
> |20   |5ms                     |6ms                    |-10%       |
> |-----|------------------------|-----------------------|-----------|
> |28   |8.7ms                   |6ms                    |25.9%      |
> |===================================================================
>
> Performance improvement with sixteen disks (sixteen CQs per core) is
> comparable.
>
> Signed-off-by: Yamin Friedman <yaminf@mellanox.com>
> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
> Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
> Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
> ---
>   drivers/infiniband/core/core_priv.h |   3 +
>   drivers/infiniband/core/cq.c        | 144 ++++++++++++++++++++++++++++++++++++
>   drivers/infiniband/core/device.c    |   2 +
>   include/rdma/ib_verbs.h             |  35 +++++++++
>   4 files changed, 184 insertions(+)
>
> diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
> index cf42acc..a1e6a67 100644
> --- a/drivers/infiniband/core/core_priv.h
> +++ b/drivers/infiniband/core/core_priv.h
> @@ -414,4 +414,7 @@ void rdma_umap_priv_init(struct rdma_umap_priv *priv,
>   			 struct vm_area_struct *vma,
>   			 struct rdma_user_mmap_entry *entry);
>   
> +void ib_cq_pool_init(struct ib_device *dev);
> +void ib_cq_pool_destroy(struct ib_device *dev);
> +
>   #endif /* _CORE_PRIV_H */
> diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
> index 4f25b24..7175295 100644
> --- a/drivers/infiniband/core/cq.c
> +++ b/drivers/infiniband/core/cq.c
> @@ -7,7 +7,11 @@
>   #include <linux/slab.h>
>   #include <rdma/ib_verbs.h>
>   
> +#include "core_priv.h"
> +
>   #include <trace/events/rdma_core.h>
> +/* Max size for shared CQ, may require tuning */
> +#define IB_MAX_SHARED_CQ_SZ		4096
>   
>   /* # of WCs to poll for with a single call to ib_poll_cq */
>   #define IB_POLL_BATCH			16
> @@ -218,6 +222,7 @@ struct ib_cq *__ib_alloc_cq_user(struct ib_device *dev, void *private,
>   	cq->cq_context = private;
>   	cq->poll_ctx = poll_ctx;
>   	atomic_set(&cq->usecnt, 0);
> +	cq->comp_vector = comp_vector;
>   
>   	cq->wc = kmalloc_array(IB_POLL_BATCH, sizeof(*cq->wc), GFP_KERNEL);
>   	if (!cq->wc)
> @@ -309,6 +314,8 @@ void ib_free_cq_user(struct ib_cq *cq, struct ib_udata *udata)
>   {
>   	if (WARN_ON_ONCE(atomic_read(&cq->usecnt)))
>   		return;
> +	if (WARN_ON_ONCE(cq->cqe_used))
> +		return;
>   
>   	switch (cq->poll_ctx) {
>   	case IB_POLL_DIRECT:
> @@ -334,3 +341,140 @@ void ib_free_cq_user(struct ib_cq *cq, struct ib_udata *udata)
>   	kfree(cq);
>   }
>   EXPORT_SYMBOL(ib_free_cq_user);
> +
> +void ib_cq_pool_init(struct ib_device *dev)
> +{
> +	int i;
> +
> +	spin_lock_init(&dev->cq_pools_lock);
> +	for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++)
> +		INIT_LIST_HEAD(&dev->cq_pools[i]);
> +}
> +
> +void ib_cq_pool_destroy(struct ib_device *dev)
> +{
> +	struct ib_cq *cq, *n;
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++) {
> +		list_for_each_entry_safe(cq, n, &dev->cq_pools[i],
> +					 pool_entry) {
> +			cq->shared = false;
> +			ib_free_cq_user(cq, NULL);
> +		}
> +	}
> +
> +}
> +
> +static int ib_alloc_cqs(struct ib_device *dev, int nr_cqes,
> +			enum ib_poll_context poll_ctx)
> +{
> +	LIST_HEAD(tmp_list);
> +	struct ib_cq *cq;
> +	unsigned long flags;
> +	int nr_cqs, ret, i;
> +
> +	/*
> +	 * Allocated at least as many CQEs as requested, and otherwise
> +	 * a reasonable batch size so that we can share CQs between
> +	 * multiple users instead of allocating a larger number of CQs.
> +	 */
> +	nr_cqes = min(dev->attrs.max_cqe, max(nr_cqes, IB_MAX_SHARED_CQ_SZ));
> +	nr_cqs = min_t(int, dev->num_comp_vectors, num_online_cpus());
> +	for (i = 0; i < nr_cqs; i++) {
> +		cq = ib_alloc_cq(dev, NULL, nr_cqes, i, poll_ctx);
> +		if (IS_ERR(cq)) {
> +			ret = PTR_ERR(cq);
> +			goto out_free_cqs;
> +		}
> +		cq->shared = true;
> +		list_add_tail(&cq->pool_entry, &tmp_list);
> +	}
> +
> +	spin_lock_irqsave(&dev->cq_pools_lock, flags);
> +	list_splice(&tmp_list, &dev->cq_pools[poll_ctx - 1]);
> +	spin_unlock_irqrestore(&dev->cq_pools_lock, flags);
> +
> +	return 0;
> +
> +out_free_cqs:
> +	list_for_each_entry(cq, &tmp_list, pool_entry) {
> +		cq->shared = false;
> +		ib_free_cq(cq);
> +	}
> +	return ret;
> +}
> +
> +struct ib_cq *ib_cq_pool_get(struct ib_device *dev, unsigned int nr_cqe,
> +			     int comp_vector_hint,
> +			     enum ib_poll_context poll_ctx)
> +{
> +	static unsigned int default_comp_vector;
> +	int vector, ret, num_comp_vectors;
> +	struct ib_cq *cq, *found = NULL;
> +	unsigned long flags;
> +
> +	if (poll_ctx > ARRAY_SIZE(dev->cq_pools) || poll_ctx == IB_POLL_DIRECT)
> +		return ERR_PTR(-EINVAL);
> +
> +	num_comp_vectors = min_t(int, dev->num_comp_vectors,
> +				 num_online_cpus());
> +	/* Project the affinty to the device completion vector range */
> +	if (comp_vector_hint < 0)
> +		vector = default_comp_vector++ % num_comp_vectors;
> +	else
> +		vector = comp_vector_hint % num_comp_vectors;
> +
> +	/*
> +	 * Find the least used CQ with correct affinity and
> +	 * enough free CQ entries
> +	 */
> +	while (!found) {
> +		spin_lock_irqsave(&dev->cq_pools_lock, flags);
> +		list_for_each_entry(cq, &dev->cq_pools[poll_ctx - 1],
> +				    pool_entry) {
> +			/*
> +			 * Check to see if we have found a CQ with the
> +			 * correct completion vector
> +			 */
> +			if (vector != cq->comp_vector)
> +				continue;
> +			if (cq->cqe_used + nr_cqe > cq->cqe)
> +				continue;
> +			found = cq;
> +			break;
> +		}
> +
> +		if (found) {
> +			found->cqe_used += nr_cqe;
> +			spin_unlock_irqrestore(&dev->cq_pools_lock, flags);
> +
> +			return found;
> +		}
> +		spin_unlock_irqrestore(&dev->cq_pools_lock, flags);
> +
> +		/*
> +		 * Didn't find a match or ran out of CQs in the device
> +		 * pool, allocate a new array of CQs.
> +		 */
> +		ret = ib_alloc_cqs(dev, nr_cqe, poll_ctx);
> +		if (ret)
> +			return ERR_PTR(ret);
> +	}
> +
> +	return found;
> +}
> +EXPORT_SYMBOL(ib_cq_pool_get);
> +
> +void ib_cq_pool_put(struct ib_cq *cq, unsigned int nr_cqe)
> +{
> +	unsigned long flags;
> +
> +	if (WARN_ON_ONCE(nr_cqe > cq->cqe_used))
> +		return;
> +
> +	spin_lock_irqsave(&cq->device->cq_pools_lock, flags);
> +	cq->cqe_used -= nr_cqe;
> +	spin_unlock_irqrestore(&cq->device->cq_pools_lock, flags);
> +}
> +EXPORT_SYMBOL(ib_cq_pool_put);
> diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
> index d9f565a..0966f86 100644
> --- a/drivers/infiniband/core/device.c
> +++ b/drivers/infiniband/core/device.c
> @@ -1418,6 +1418,7 @@ int ib_register_device(struct ib_device *device, const char *name)
>   		device->ops.dealloc_driver = dealloc_fn;
>   		return ret;
>   	}
> +	ib_cq_pool_init(device);
>   	ib_device_put(device);
>   
>   	return 0;
> @@ -1446,6 +1447,7 @@ static void __ib_unregister_device(struct ib_device *ib_dev)
>   	if (!refcount_read(&ib_dev->refcount))
>   		goto out;
>   
> +	ib_cq_pool_destroy(ib_dev);
>   	disable_device(ib_dev);
>   
>   	/* Expedite removing unregistered pointers from the hash table */
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 1659131..d40604a 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -1555,6 +1555,7 @@ enum ib_poll_context {
>   	IB_POLL_SOFTIRQ,	   /* poll from softirq context */
>   	IB_POLL_WORKQUEUE,	   /* poll from workqueue */
>   	IB_POLL_UNBOUND_WORKQUEUE, /* poll from unbound workqueue */
> +	IB_POLL_LAST,
>   };
>   
>   struct ib_cq {
> @@ -1564,9 +1565,12 @@ struct ib_cq {
>   	void                  (*event_handler)(struct ib_event *, void *);
>   	void                   *cq_context;
>   	int               	cqe;
> +	int			cqe_used;
>   	atomic_t          	usecnt; /* count number of work queues */
>   	enum ib_poll_context	poll_ctx;
> +	int                     comp_vector;
>   	struct ib_wc		*wc;
> +	struct list_head        pool_entry;
>   	union {
>   		struct irq_poll		iop;
>   		struct work_struct	work;
> @@ -2695,6 +2699,10 @@ struct ib_device {
>   #endif
>   
>   	u32                          index;
> +
> +	spinlock_t                   cq_pools_lock;
> +	struct list_head             cq_pools[IB_POLL_LAST - 1];
> +
>   	struct rdma_restrack_root *res;
>   
>   	const struct uapi_definition   *driver_def;
> @@ -3952,6 +3960,33 @@ static inline int ib_req_notify_cq(struct ib_cq *cq,
>   	return cq->device->ops.req_notify_cq(cq, flags);
>   }
>   
> +/*
> + * ib_cq_pool_get() - Find the least used completion queue that matches
> + *     a given cpu hint (or least used for wild card affinity)
> + *     and fits nr_cqe
> + * @dev: rdma device
> + * @nr_cqe: number of needed cqe entries
> + * @comp_vector_hint: completion vector hint (-1) for the driver to assign
> + *   a comp vector based on internal counter
> + * @poll_ctx: cq polling context
> + *
> + * Finds a cq that satisfies @comp_vector_hint and @nr_cqe requirements and
> + * claim entries in it for us. In case there is no available cq, allocate
> + * a new cq with the requirements and add it to the device pool.
> + * IB_POLL_DIRECT cannot be used for shared cqs so it is not a valid value
> + * for @poll_ctx.
> + */
> +struct ib_cq *ib_cq_pool_get(struct ib_device *dev, unsigned int nr_cqe,
> +			     int comp_vector_hint,
> +			     enum ib_poll_context poll_ctx);
> +
> +/**
> + * ib_cq_pool_put - Return a CQ taken from a shared pool.
> + * @cq: The CQ to return.
> + * @nr_cqe: The max number of cqes that the user had requested.
> + */
> +void ib_cq_pool_put(struct ib_cq *cq, unsigned int nr_cqe);
> +
>   /**
>    * ib_req_ncomp_notif - Request completion notification when there are
>    *   at least the specified number of unreaped completions on the CQ.

[-- Attachment #2: minor_fix.patch --]
[-- Type: text/plain, Size: 485 bytes --]

^[[1mdiff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c^[[m
^[[1mindex 7175295..c462d48 100644^[[m
^[[1m--- a/drivers/infiniband/core/cq.c^[[m
^[[1m+++ b/drivers/infiniband/core/cq.c^[[m
^[[36m@@ -360,7 +360,7 @@^[[m ^[[mvoid ib_cq_pool_destroy(struct ib_device *dev)^[[m
 		list_for_each_entry_safe(cq, n, &dev->cq_pools[i],^[[m
 					 pool_entry) {^[[m
 			cq->shared = false;^[[m
^[[31m-			ib_free_cq_user(cq, NULL);^[[m
^[[32m+^[[m			^[[32mib_free_cq(cq);^[[m
 		}^[[m
 	}^[[m
 ^[[m

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 2/4] RDMA/core: Introduce shared CQ pool API
  2020-05-19 12:43 ` [PATCH V3 2/4] RDMA/core: Introduce shared CQ pool API Yamin Friedman
  2020-05-20  6:19   ` Devesh Sharma
  2020-05-25 13:06   ` Yamin Friedman
@ 2020-05-25 15:14   ` Bart Van Assche
  2020-05-25 16:45     ` Jason Gunthorpe
  2020-05-25 16:42   ` Jason Gunthorpe
  3 siblings, 1 reply; 22+ messages in thread
From: Bart Van Assche @ 2020-05-25 15:14 UTC (permalink / raw)
  To: Yamin Friedman, Jason Gunthorpe, Sagi Grimberg, Or Gerlitz,
	Leon Romanovsky
  Cc: linux-rdma

On 2020-05-19 05:43, Yamin Friedman wrote:
> +	/*
> +	 * Allocated at least as many CQEs as requested, and otherwise
           ^^^^^^^^^
           allocate?

> +	spin_lock_irqsave(&dev->cq_pools_lock, flags);
> +	list_splice(&tmp_list, &dev->cq_pools[poll_ctx - 1]);
> +	spin_unlock_irqrestore(&dev->cq_pools_lock, flags);

Please add a WARN_ONCE() or WARN_ON_ONCE() statement that checks that
poll_ctx >= 1.

> +struct ib_cq *ib_cq_pool_get(struct ib_device *dev, unsigned int nr_cqe,
> +			     int comp_vector_hint,
> +			     enum ib_poll_context poll_ctx)
> +{
> +	static unsigned int default_comp_vector;
> +	int vector, ret, num_comp_vectors;
> +	struct ib_cq *cq, *found = NULL;
> +	unsigned long flags;
> +
> +	if (poll_ctx > ARRAY_SIZE(dev->cq_pools) || poll_ctx == IB_POLL_DIRECT)
> +		return ERR_PTR(-EINVAL);

How about changing this into the following?

	if ((unsigned)(poll_ctx - 1) >= ARRAY_SIZE(dev->cq_pools))
		return ...;

I think that change will make this code easier to verify.

> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 1659131..d40604a 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -1555,6 +1555,7 @@ enum ib_poll_context {
>  	IB_POLL_SOFTIRQ,	   /* poll from softirq context */
>  	IB_POLL_WORKQUEUE,	   /* poll from workqueue */
>  	IB_POLL_UNBOUND_WORKQUEUE, /* poll from unbound workqueue */
> +	IB_POLL_LAST,
>  };

Please consider changing IB_POLL_LAST into IB_POLL_LAST =
IB_POLL_UNBOUND_WORKQUEUE. Otherwise the compiler will produce annoying
warnings on switch statements that do not handle IB_POLL_LAST explicitly.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 2/4] RDMA/core: Introduce shared CQ pool API
  2020-05-19 12:43 ` [PATCH V3 2/4] RDMA/core: Introduce shared CQ pool API Yamin Friedman
                     ` (2 preceding siblings ...)
  2020-05-25 15:14   ` Bart Van Assche
@ 2020-05-25 16:42   ` Jason Gunthorpe
  2020-05-25 16:47     ` Leon Romanovsky
  3 siblings, 1 reply; 22+ messages in thread
From: Jason Gunthorpe @ 2020-05-25 16:42 UTC (permalink / raw)
  To: Yamin Friedman; +Cc: Sagi Grimberg, Or Gerlitz, Leon Romanovsky, linux-rdma

On Tue, May 19, 2020 at 03:43:34PM +0300, Yamin Friedman wrote:

> +void ib_cq_pool_init(struct ib_device *dev)
> +{
> +	int i;

I generally rather see unsigned types used for unsigned values

> +
> +	spin_lock_init(&dev->cq_pools_lock);
> +	for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++)
> +		INIT_LIST_HEAD(&dev->cq_pools[i]);
> +}
> +
> +void ib_cq_pool_destroy(struct ib_device *dev)
> +{
> +	struct ib_cq *cq, *n;
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++) {
> +		list_for_each_entry_safe(cq, n, &dev->cq_pools[i],
> +					 pool_entry) {
> +			cq->shared = false;
> +			ib_free_cq_user(cq, NULL);

WARN_ON cqe_used == 0?

> +		}
> +	}
> +
> +}
> +
> +static int ib_alloc_cqs(struct ib_device *dev, int nr_cqes,

unsigned types especially in function signatures please

> +struct ib_cq *ib_cq_pool_get(struct ib_device *dev, unsigned int nr_cqe,
> +			     int comp_vector_hint,
> +			     enum ib_poll_context poll_ctx)
> +{
> +	static unsigned int default_comp_vector;
> +	int vector, ret, num_comp_vectors;
> +	struct ib_cq *cq, *found = NULL;
> +	unsigned long flags;
> +
> +	if (poll_ctx > ARRAY_SIZE(dev->cq_pools) || poll_ctx == IB_POLL_DIRECT)
> +		return ERR_PTR(-EINVAL);
> +
> +	num_comp_vectors = min_t(int, dev->num_comp_vectors,
> +				 num_online_cpus());
> +	/* Project the affinty to the device completion vector range */
> +	if (comp_vector_hint < 0)
> +		vector = default_comp_vector++ % num_comp_vectors;
> +	else
> +		vector = comp_vector_hint % num_comp_vectors;

Modulo with signed types..

> +	/*
> +	 * Find the least used CQ with correct affinity and
> +	 * enough free CQ entries
> +	 */
> +	while (!found) {
> +		spin_lock_irqsave(&dev->cq_pools_lock, flags);
> +		list_for_each_entry(cq, &dev->cq_pools[poll_ctx - 1],
> +				    pool_entry) {
> +			/*
> +			 * Check to see if we have found a CQ with the
> +			 * correct completion vector
> +			 */
> +			if (vector != cq->comp_vector)
> +				continue;
> +			if (cq->cqe_used + nr_cqe > cq->cqe)
> +				continue;
> +			found = cq;
> +			break;
> +		}
> +
> +		if (found) {
> +			found->cqe_used += nr_cqe;
> +			spin_unlock_irqrestore(&dev->cq_pools_lock, flags);
> +
> +			return found;
> +		}
> +		spin_unlock_irqrestore(&dev->cq_pools_lock, flags);
> +
> +		/*
> +		 * Didn't find a match or ran out of CQs in the device
> +		 * pool, allocate a new array of CQs.
> +		 */
> +		ret = ib_alloc_cqs(dev, nr_cqe, poll_ctx);
> +		if (ret)
> +			return ERR_PTR(ret);
> +	}
> +
> +	return found;
> +}
> +EXPORT_SYMBOL(ib_cq_pool_get);
> +
> +void ib_cq_pool_put(struct ib_cq *cq, unsigned int nr_cqe)
> +{
> +	unsigned long flags;
> +
> +	if (WARN_ON_ONCE(nr_cqe > cq->cqe_used))
> +		return;
> +
> +	spin_lock_irqsave(&cq->device->cq_pools_lock, flags);
> +	cq->cqe_used -= nr_cqe;
> +	spin_unlock_irqrestore(&cq->device->cq_pools_lock, flags);

It doesn't look to me like this spinlock can be used from anywhere but
a user context, why is it an irqsave?

> diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
> index d9f565a..0966f86 100644
> +++ b/drivers/infiniband/core/device.c
> @@ -1418,6 +1418,7 @@ int ib_register_device(struct ib_device *device, const char *name)
>  		device->ops.dealloc_driver = dealloc_fn;
>  		return ret;
>  	}
> +	ib_cq_pool_init(device);
>  	ib_device_put(device);

This look like wrong placement, it should be done before enable_device
as enable_device triggers ULPs t start using the device and they might
start allocating using this API.
  
>  	return 0;
> @@ -1446,6 +1447,7 @@ static void __ib_unregister_device(struct ib_device *ib_dev)
>  	if (!refcount_read(&ib_dev->refcount))
>  		goto out;
>  
> +	ib_cq_pool_destroy(ib_dev);
>  	disable_device(ib_dev);

similar issue, should be after disable_device as ULPs are still
running here

  
>  	/* Expedite removing unregistered pointers from the hash table */
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 1659131..d40604a 100644
> +++ b/include/rdma/ib_verbs.h
> @@ -1555,6 +1555,7 @@ enum ib_poll_context {
>  	IB_POLL_SOFTIRQ,	   /* poll from softirq context */
>  	IB_POLL_WORKQUEUE,	   /* poll from workqueue */
>  	IB_POLL_UNBOUND_WORKQUEUE, /* poll from unbound workqueue */
> +	IB_POLL_LAST,
>  };
>  
>  struct ib_cq {
> @@ -1564,9 +1565,12 @@ struct ib_cq {
>  	void                  (*event_handler)(struct ib_event *, void *);
>  	void                   *cq_context;
>  	int               	cqe;
> +	int			cqe_used;

unsigned

>  	atomic_t          	usecnt; /* count number of work queues */
>  	enum ib_poll_context	poll_ctx;
> +	int                     comp_vector;

and put new members in sane places, don't make holes, etc

>  	const struct uapi_definition   *driver_def;
> @@ -3952,6 +3960,33 @@ static inline int ib_req_notify_cq(struct ib_cq *cq,
>  	return cq->device->ops.req_notify_cq(cq, flags);
>  }
>  
> +/*
> + * ib_cq_pool_get() - Find the least used completion queue that matches
> + *     a given cpu hint (or least used for wild card affinity)
> + *     and fits nr_cqe
> + * @dev: rdma device
> + * @nr_cqe: number of needed cqe entries
> + * @comp_vector_hint: completion vector hint (-1) for the driver to assign
> + *   a comp vector based on internal counter
> + * @poll_ctx: cq polling context
> + *
> + * Finds a cq that satisfies @comp_vector_hint and @nr_cqe requirements and
> + * claim entries in it for us. In case there is no available cq, allocate
> + * a new cq with the requirements and add it to the device pool.
> + * IB_POLL_DIRECT cannot be used for shared cqs so it is not a valid value
> + * for @poll_ctx.
> + */
> +struct ib_cq *ib_cq_pool_get(struct ib_device *dev, unsigned int nr_cqe,
> +			     int comp_vector_hint,
> +			     enum ib_poll_context poll_ctx);

kdoc comments belong in the C files please, and this isn't even in
proper kdoc format.

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 2/4] RDMA/core: Introduce shared CQ pool API
  2020-05-25 15:14   ` Bart Van Assche
@ 2020-05-25 16:45     ` Jason Gunthorpe
  2020-05-26 11:43       ` Yamin Friedman
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Gunthorpe @ 2020-05-25 16:45 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Yamin Friedman, Sagi Grimberg, Or Gerlitz, Leon Romanovsky, linux-rdma

On Mon, May 25, 2020 at 08:14:23AM -0700, Bart Van Assche wrote:
> On 2020-05-19 05:43, Yamin Friedman wrote:
> > +	/*
> > +	 * Allocated at least as many CQEs as requested, and otherwise
>            ^^^^^^^^^
>            allocate?
> 
> > +	spin_lock_irqsave(&dev->cq_pools_lock, flags);
> > +	list_splice(&tmp_list, &dev->cq_pools[poll_ctx - 1]);
> > +	spin_unlock_irqrestore(&dev->cq_pools_lock, flags);
> 
> Please add a WARN_ONCE() or WARN_ON_ONCE() statement that checks that
> poll_ctx >= 1.
> 
> > +struct ib_cq *ib_cq_pool_get(struct ib_device *dev, unsigned int nr_cqe,
> > +			     int comp_vector_hint,
> > +			     enum ib_poll_context poll_ctx)
> > +{
> > +	static unsigned int default_comp_vector;
> > +	int vector, ret, num_comp_vectors;
> > +	struct ib_cq *cq, *found = NULL;
> > +	unsigned long flags;
> > +
> > +	if (poll_ctx > ARRAY_SIZE(dev->cq_pools) || poll_ctx == IB_POLL_DIRECT)
> > +		return ERR_PTR(-EINVAL);
> 
> How about changing this into the following?
> 
> 	if ((unsigned)(poll_ctx - 1) >= ARRAY_SIZE(dev->cq_pools))
> 		return ...;
> 
> I think that change will make this code easier to verify.

Yuk also.. It would be alot better to re-order IB_POLL_DIRECT to the
end of the enum and use a IB_POLL_LAST_POOL_TYPE to exclude it
directly.

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 2/4] RDMA/core: Introduce shared CQ pool API
  2020-05-25 16:42   ` Jason Gunthorpe
@ 2020-05-25 16:47     ` Leon Romanovsky
  2020-05-26 11:39       ` Yamin Friedman
  0 siblings, 1 reply; 22+ messages in thread
From: Leon Romanovsky @ 2020-05-25 16:47 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Yamin Friedman, Sagi Grimberg, Or Gerlitz, linux-rdma

On Mon, May 25, 2020 at 01:42:15PM -0300, Jason Gunthorpe wrote:
> On Tue, May 19, 2020 at 03:43:34PM +0300, Yamin Friedman wrote:
>
> > +void ib_cq_pool_init(struct ib_device *dev)
> > +{
> > +	int i;
>
> I generally rather see unsigned types used for unsigned values
>
> > +
> > +	spin_lock_init(&dev->cq_pools_lock);
> > +	for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++)
> > +		INIT_LIST_HEAD(&dev->cq_pools[i]);
> > +}
> > +
> > +void ib_cq_pool_destroy(struct ib_device *dev)
> > +{
> > +	struct ib_cq *cq, *n;
> > +	int i;
> > +
> > +	for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++) {
> > +		list_for_each_entry_safe(cq, n, &dev->cq_pools[i],
> > +					 pool_entry) {
> > +			cq->shared = false;
> > +			ib_free_cq_user(cq, NULL);
>
> WARN_ON cqe_used == 0?

An opposite is better - WARN_ON(cqe_used).

<...>

> > @@ -1418,6 +1418,7 @@ int ib_register_device(struct ib_device *device, const char *name)
> >  		device->ops.dealloc_driver = dealloc_fn;
> >  		return ret;
> >  	}
> > +	ib_cq_pool_init(device);
> >  	ib_device_put(device);
>
> This look like wrong placement, it should be done before enable_device
> as enable_device triggers ULPs t start using the device and they might
> start allocating using this API.
>
> >  	return 0;
> > @@ -1446,6 +1447,7 @@ static void __ib_unregister_device(struct ib_device *ib_dev)
> >  	if (!refcount_read(&ib_dev->refcount))
> >  		goto out;
> >
> > +	ib_cq_pool_destroy(ib_dev);
> >  	disable_device(ib_dev);
>
> similar issue, should be after disable_device as ULPs are still
> running here

Sorry, this were my mistakes. I suggested to Yamin to put it here.

Thanks

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 2/4] RDMA/core: Introduce shared CQ pool API
  2020-05-25 13:06   ` Yamin Friedman
@ 2020-05-26  7:09     ` Yamin Friedman
  0 siblings, 0 replies; 22+ messages in thread
From: Yamin Friedman @ 2020-05-26  7:09 UTC (permalink / raw)
  To: Jason Gunthorpe, Sagi Grimberg, Or Gerlitz, Leon Romanovsky; +Cc: linux-rdma


On 5/25/2020 4:06 PM, Yamin Friedman wrote:
> Minor fix brought to my attention by MaxG.
>
> On 5/19/2020 3:43 PM, Yamin Friedman wrote:
>> Allow a ULP to ask the core to provide a completion queue based on a
>> least-used search on a per-device CQ pools. The device CQ pools grow 
>> in a
>> lazy fashion when more CQs are requested.
>>
>> This feature reduces the amount of interrupts when using many QPs.
>> Using shared CQs allows for more effcient completion handling. It also
>> reduces the amount of overhead needed for CQ contexts.
>>
>> Test setup:
>> Intel(R) Xeon(R) Platinum 8176M CPU @ 2.10GHz servers.
>> Running NVMeoF 4KB read IOs over ConnectX-5EX across Spectrum switch.
>> TX-depth = 32. The patch was applied in the nvme driver on both the 
>> target
>> and initiator. Four controllers are accessed from each core. In the
>> current test case we have exposed sixteen NVMe namespaces using four
>> different subsystems (four namespaces per subsystem) from one NVM port.
>> Each controller allocated X queues (RDMA QPs) and attached to Y CQs.
>> Before this series we had X == Y, i.e for four controllers we've created
>> total of 4X QPs and 4X CQs. In the shared case, we've created 4X QPs and
>> only X CQs which means that we have four controllers that share a
>> completion queue per core. Until fourteen cores there is no significant
>> change in performance and the number of interrupts per second is less 
>> than
>> a million in the current case.
>> ==================================================
>> |Cores|Current KIOPs  |Shared KIOPs  |improvement|
>> |-----|---------------|--------------|-----------|
>> |14   |2332           |2723          |16.7%      |
>> |-----|---------------|--------------|-----------|
>> |20   |2086           |2712          |30%        |
>> |-----|---------------|--------------|-----------|
>> |28   |1971           |2669          |35.4%      |
>> |=================================================
>> |Cores|Current avg lat|Shared avg lat|improvement|
>> |-----|---------------|--------------|-----------|
>> |14   |767us          |657us         |14.3%      |
>> |-----|---------------|--------------|-----------|
>> |20   |1225us         |943us         |23%        |
>> |-----|---------------|--------------|-----------|
>> |28   |1816us         |1341us        |26.1%      |
>> ========================================================
>> |Cores|Current interrupts|Shared interrupts|improvement|
>> |-----|------------------|-----------------|-----------|
>> |14   |1.6M/sec          |0.4M/sec         |72%        |
>> |-----|------------------|-----------------|-----------|
>> |20   |2.8M/sec          |0.6M/sec         |72.4%      |
>> |-----|------------------|-----------------|-----------|
>> |28   |2.9M/sec          |0.8M/sec         |63.4%      |
>> ====================================================================
>> |Cores|Current 99.99th PCTL lat|Shared 99.99th PCTL lat|improvement|
>> |-----|------------------------|-----------------------|-----------|
>> |14   |67ms                    |6ms |90.9%      |
>> |-----|------------------------|-----------------------|-----------|
>> |20   |5ms                     |6ms |-10%       |
>> |-----|------------------------|-----------------------|-----------|
>> |28   |8.7ms                   |6ms |25.9%      |
>> |===================================================================
>>
>> Performance improvement with sixteen disks (sixteen CQs per core) is
>> comparable.
>>
>> Signed-off-by: Yamin Friedman <yaminf@mellanox.com>
>> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
>> Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
>> Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
>> ---
>>   drivers/infiniband/core/core_priv.h |   3 +
>>   drivers/infiniband/core/cq.c        | 144 
>> ++++++++++++++++++++++++++++++++++++
>>   drivers/infiniband/core/device.c    |   2 +
>>   include/rdma/ib_verbs.h             |  35 +++++++++
>>   4 files changed, 184 insertions(+)
>>
>> diff --git a/drivers/infiniband/core/core_priv.h 
>> b/drivers/infiniband/core/core_priv.h
>> index cf42acc..a1e6a67 100644
>> --- a/drivers/infiniband/core/core_priv.h
>> +++ b/drivers/infiniband/core/core_priv.h
>> @@ -414,4 +414,7 @@ void rdma_umap_priv_init(struct rdma_umap_priv 
>> *priv,
>>                struct vm_area_struct *vma,
>>                struct rdma_user_mmap_entry *entry);
>>   +void ib_cq_pool_init(struct ib_device *dev);
>> +void ib_cq_pool_destroy(struct ib_device *dev);
>> +
>>   #endif /* _CORE_PRIV_H */
>> diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
>> index 4f25b24..7175295 100644
>> --- a/drivers/infiniband/core/cq.c
>> +++ b/drivers/infiniband/core/cq.c
>> @@ -7,7 +7,11 @@
>>   #include <linux/slab.h>
>>   #include <rdma/ib_verbs.h>
>>   +#include "core_priv.h"
>> +
>>   #include <trace/events/rdma_core.h>
>> +/* Max size for shared CQ, may require tuning */
>> +#define IB_MAX_SHARED_CQ_SZ        4096
>>     /* # of WCs to poll for with a single call to ib_poll_cq */
>>   #define IB_POLL_BATCH            16
>> @@ -218,6 +222,7 @@ struct ib_cq *__ib_alloc_cq_user(struct ib_device 
>> *dev, void *private,
>>       cq->cq_context = private;
>>       cq->poll_ctx = poll_ctx;
>>       atomic_set(&cq->usecnt, 0);
>> +    cq->comp_vector = comp_vector;
>>         cq->wc = kmalloc_array(IB_POLL_BATCH, sizeof(*cq->wc), 
>> GFP_KERNEL);
>>       if (!cq->wc)
>> @@ -309,6 +314,8 @@ void ib_free_cq_user(struct ib_cq *cq, struct 
>> ib_udata *udata)
>>   {
>>       if (WARN_ON_ONCE(atomic_read(&cq->usecnt)))
>>           return;
>> +    if (WARN_ON_ONCE(cq->cqe_used))
>> +        return;
>>         switch (cq->poll_ctx) {
>>       case IB_POLL_DIRECT:
>> @@ -334,3 +341,140 @@ void ib_free_cq_user(struct ib_cq *cq, struct 
>> ib_udata *udata)
>>       kfree(cq);
>>   }
>>   EXPORT_SYMBOL(ib_free_cq_user);
>> +
>> +void ib_cq_pool_init(struct ib_device *dev)
>> +{
>> +    int i;
>> +
>> +    spin_lock_init(&dev->cq_pools_lock);
>> +    for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++)
>> +        INIT_LIST_HEAD(&dev->cq_pools[i]);
>> +}
>> +
>> +void ib_cq_pool_destroy(struct ib_device *dev)
>> +{
>> +    struct ib_cq *cq, *n;
>> +    int i;
>> +
>> +    for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++) {
>> +        list_for_each_entry_safe(cq, n, &dev->cq_pools[i],
>> +                     pool_entry) {
>> +            cq->shared = false;
>> +            ib_free_cq_user(cq, NULL);
>> +        }
>> +    }
>> +
>> +}
>> +
>> +static int ib_alloc_cqs(struct ib_device *dev, int nr_cqes,
>> +            enum ib_poll_context poll_ctx)
>> +{
>> +    LIST_HEAD(tmp_list);
>> +    struct ib_cq *cq;
>> +    unsigned long flags;
>> +    int nr_cqs, ret, i;
>> +
>> +    /*
>> +     * Allocated at least as many CQEs as requested, and otherwise
>> +     * a reasonable batch size so that we can share CQs between
>> +     * multiple users instead of allocating a larger number of CQs.
>> +     */
>> +    nr_cqes = min(dev->attrs.max_cqe, max(nr_cqes, 
>> IB_MAX_SHARED_CQ_SZ));
>> +    nr_cqs = min_t(int, dev->num_comp_vectors, num_online_cpus());
>> +    for (i = 0; i < nr_cqs; i++) {
>> +        cq = ib_alloc_cq(dev, NULL, nr_cqes, i, poll_ctx);
>> +        if (IS_ERR(cq)) {
>> +            ret = PTR_ERR(cq);
>> +            goto out_free_cqs;
>> +        }
>> +        cq->shared = true;
>> +        list_add_tail(&cq->pool_entry, &tmp_list);
>> +    }
>> +
>> +    spin_lock_irqsave(&dev->cq_pools_lock, flags);
>> +    list_splice(&tmp_list, &dev->cq_pools[poll_ctx - 1]);
>> +    spin_unlock_irqrestore(&dev->cq_pools_lock, flags);
>> +
>> +    return 0;
>> +
>> +out_free_cqs:
>> +    list_for_each_entry(cq, &tmp_list, pool_entry) {
>> +        cq->shared = false;
>> +        ib_free_cq(cq);
>> +    }
>> +    return ret;
>> +}
>> +
>> +struct ib_cq *ib_cq_pool_get(struct ib_device *dev, unsigned int 
>> nr_cqe,
>> +                 int comp_vector_hint,
>> +                 enum ib_poll_context poll_ctx)
>> +{
>> +    static unsigned int default_comp_vector;
>> +    int vector, ret, num_comp_vectors;
>> +    struct ib_cq *cq, *found = NULL;
>> +    unsigned long flags;
>> +
>> +    if (poll_ctx > ARRAY_SIZE(dev->cq_pools) || poll_ctx == 
>> IB_POLL_DIRECT)
>> +        return ERR_PTR(-EINVAL);
>> +
>> +    num_comp_vectors = min_t(int, dev->num_comp_vectors,
>> +                 num_online_cpus());
>> +    /* Project the affinty to the device completion vector range */
>> +    if (comp_vector_hint < 0)
>> +        vector = default_comp_vector++ % num_comp_vectors;
>> +    else
>> +        vector = comp_vector_hint % num_comp_vectors;
>> +
>> +    /*
>> +     * Find the least used CQ with correct affinity and
>> +     * enough free CQ entries
>> +     */
>> +    while (!found) {
>> +        spin_lock_irqsave(&dev->cq_pools_lock, flags);
>> +        list_for_each_entry(cq, &dev->cq_pools[poll_ctx - 1],
>> +                    pool_entry) {
>> +            /*
>> +             * Check to see if we have found a CQ with the
>> +             * correct completion vector
>> +             */
>> +            if (vector != cq->comp_vector)
>> +                continue;
>> +            if (cq->cqe_used + nr_cqe > cq->cqe)
>> +                continue;
>> +            found = cq;
>> +            break;
>> +        }
>> +
>> +        if (found) {
>> +            found->cqe_used += nr_cqe;
>> +            spin_unlock_irqrestore(&dev->cq_pools_lock, flags);
>> +
>> +            return found;
>> +        }
>> +        spin_unlock_irqrestore(&dev->cq_pools_lock, flags);
>> +
>> +        /*
>> +         * Didn't find a match or ran out of CQs in the device
>> +         * pool, allocate a new array of CQs.
>> +         */
>> +        ret = ib_alloc_cqs(dev, nr_cqe, poll_ctx);
>> +        if (ret)
>> +            return ERR_PTR(ret);
>> +    }
>> +
>> +    return found;
>> +}
>> +EXPORT_SYMBOL(ib_cq_pool_get);
>> +
>> +void ib_cq_pool_put(struct ib_cq *cq, unsigned int nr_cqe)
>> +{
>> +    unsigned long flags;
>> +
>> +    if (WARN_ON_ONCE(nr_cqe > cq->cqe_used))
>> +        return;
>> +
>> +    spin_lock_irqsave(&cq->device->cq_pools_lock, flags);
>> +    cq->cqe_used -= nr_cqe;
>> + spin_unlock_irqrestore(&cq->device->cq_pools_lock, flags);
>> +}
>> +EXPORT_SYMBOL(ib_cq_pool_put);
>> diff --git a/drivers/infiniband/core/device.c 
>> b/drivers/infiniband/core/device.c
>> index d9f565a..0966f86 100644
>> --- a/drivers/infiniband/core/device.c
>> +++ b/drivers/infiniband/core/device.c
>> @@ -1418,6 +1418,7 @@ int ib_register_device(struct ib_device 
>> *device, const char *name)
>>           device->ops.dealloc_driver = dealloc_fn;
>>           return ret;
>>       }
>> +    ib_cq_pool_init(device);
>>       ib_device_put(device);
>>         return 0;
>> @@ -1446,6 +1447,7 @@ static void __ib_unregister_device(struct 
>> ib_device *ib_dev)
>>       if (!refcount_read(&ib_dev->refcount))
>>           goto out;
>>   +    ib_cq_pool_destroy(ib_dev);
>>       disable_device(ib_dev);
>>         /* Expedite removing unregistered pointers from the hash 
>> table */
>> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
>> index 1659131..d40604a 100644
>> --- a/include/rdma/ib_verbs.h
>> +++ b/include/rdma/ib_verbs.h
>> @@ -1555,6 +1555,7 @@ enum ib_poll_context {
>>       IB_POLL_SOFTIRQ,       /* poll from softirq context */
>>       IB_POLL_WORKQUEUE,       /* poll from workqueue */
>>       IB_POLL_UNBOUND_WORKQUEUE, /* poll from unbound workqueue */
>> +    IB_POLL_LAST,
>>   };
>>     struct ib_cq {
>> @@ -1564,9 +1565,12 @@ struct ib_cq {
>>       void                  (*event_handler)(struct ib_event *, void *);
>>       void                   *cq_context;
>>       int                   cqe;
>> +    int            cqe_used;
>>       atomic_t              usecnt; /* count number of work queues */
>>       enum ib_poll_context    poll_ctx;
>> +    int                     comp_vector;
>>       struct ib_wc        *wc;
>> +    struct list_head        pool_entry;
>>       union {
>>           struct irq_poll        iop;
>>           struct work_struct    work;
>> @@ -2695,6 +2699,10 @@ struct ib_device {
>>   #endif
>>         u32                          index;
>> +
>> +    spinlock_t                   cq_pools_lock;
>> +    struct list_head             cq_pools[IB_POLL_LAST - 1];
>> +
>>       struct rdma_restrack_root *res;
>>         const struct uapi_definition   *driver_def;
>> @@ -3952,6 +3960,33 @@ static inline int ib_req_notify_cq(struct 
>> ib_cq *cq,
>>       return cq->device->ops.req_notify_cq(cq, flags);
>>   }
>>   +/*
>> + * ib_cq_pool_get() - Find the least used completion queue that matches
>> + *     a given cpu hint (or least used for wild card affinity)
>> + *     and fits nr_cqe
>> + * @dev: rdma device
>> + * @nr_cqe: number of needed cqe entries
>> + * @comp_vector_hint: completion vector hint (-1) for the driver to 
>> assign
>> + *   a comp vector based on internal counter
>> + * @poll_ctx: cq polling context
>> + *
>> + * Finds a cq that satisfies @comp_vector_hint and @nr_cqe 
>> requirements and
>> + * claim entries in it for us. In case there is no available cq, 
>> allocate
>> + * a new cq with the requirements and add it to the device pool.
>> + * IB_POLL_DIRECT cannot be used for shared cqs so it is not a valid 
>> value
>> + * for @poll_ctx.
>> + */
>> +struct ib_cq *ib_cq_pool_get(struct ib_device *dev, unsigned int 
>> nr_cqe,
>> +                 int comp_vector_hint,
>> +                 enum ib_poll_context poll_ctx);
>> +
>> +/**
>> + * ib_cq_pool_put - Return a CQ taken from a shared pool.
>> + * @cq: The CQ to return.
>> + * @nr_cqe: The max number of cqes that the user had requested.
>> + */
>> +void ib_cq_pool_put(struct ib_cq *cq, unsigned int nr_cqe);
>> +
>>   /**
>>    * ib_req_ncomp_notif - Request completion notification when there are
>>    *   at least the specified number of unreaped completions on the CQ.

 From dabcad9a5813d9cb4bb5f5ac6931a5a9b1dd2dc2 Mon Sep 17 00:00:00 2001
From: Yamin Friedman <yaminf@mellanox.com>
Date: Mon, 25 May 2020 16:39:05 +0300
Subject: [PATCH] Fixup RDMA/core: Correct ib_cq_free usage

Signed-off-by: Yamin Friedman <yaminf@mellanox.com>
---
  drivers/infiniband/core/cq.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index 7175295..c462d48 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -360,7 +360,7 @@ void ib_cq_pool_destroy(struct ib_device *dev)
                 list_for_each_entry_safe(cq, n, &dev->cq_pools[i],
                                          pool_entry) {
                         cq->shared = false;
-                       ib_free_cq_user(cq, NULL);
+                       ib_free_cq(cq);
                 }
         }

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 2/4] RDMA/core: Introduce shared CQ pool API
  2020-05-25 16:47     ` Leon Romanovsky
@ 2020-05-26 11:39       ` Yamin Friedman
  2020-05-26 12:09         ` Jason Gunthorpe
  0 siblings, 1 reply; 22+ messages in thread
From: Yamin Friedman @ 2020-05-26 11:39 UTC (permalink / raw)
  To: Leon Romanovsky, Jason Gunthorpe; +Cc: Sagi Grimberg, Or Gerlitz, linux-rdma


On 5/25/2020 7:47 PM, Leon Romanovsky wrote:
> On Mon, May 25, 2020 at 01:42:15PM -0300, Jason Gunthorpe wrote:
>> On Tue, May 19, 2020 at 03:43:34PM +0300, Yamin Friedman wrote:
>>
>>> +void ib_cq_pool_init(struct ib_device *dev)
>>> +{
>>> +	int i;
>> I generally rather see unsigned types used for unsigned values
>>
>>> +
>>> +	spin_lock_init(&dev->cq_pools_lock);
>>> +	for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++)
>>> +		INIT_LIST_HEAD(&dev->cq_pools[i]);
>>> +}
>>> +
>>> +void ib_cq_pool_destroy(struct ib_device *dev)
>>> +{
>>> +	struct ib_cq *cq, *n;
>>> +	int i;
>>> +
>>> +	for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++) {
>>> +		list_for_each_entry_safe(cq, n, &dev->cq_pools[i],
>>> +					 pool_entry) {
>>> +			cq->shared = false;
>>> +			ib_free_cq_user(cq, NULL);
>> WARN_ON cqe_used == 0?
> An opposite is better - WARN_ON(cqe_used).
>
> <...>

Is this check really necessary as we are closing the device?

>
>>> @@ -1418,6 +1418,7 @@ int ib_register_device(struct ib_device *device, const char *name)
>>>   		device->ops.dealloc_driver = dealloc_fn;
>>>   		return ret;
>>>   	}
>>> +	ib_cq_pool_init(device);
>>>   	ib_device_put(device);
>> This look like wrong placement, it should be done before enable_device
>> as enable_device triggers ULPs t start using the device and they might
>> start allocating using this API.
>>
>>>   	return 0;
>>> @@ -1446,6 +1447,7 @@ static void __ib_unregister_device(struct ib_device *ib_dev)
>>>   	if (!refcount_read(&ib_dev->refcount))
>>>   		goto out;
>>>
>>> +	ib_cq_pool_destroy(ib_dev);
>>>   	disable_device(ib_dev);
>> similar issue, should be after disable_device as ULPs are still
>> running here
> Sorry, this were my mistakes. I suggested to Yamin to put it here.
>
> Thanks

I will move them to the suggest location.

Thanks


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 2/4] RDMA/core: Introduce shared CQ pool API
  2020-05-25 16:45     ` Jason Gunthorpe
@ 2020-05-26 11:43       ` Yamin Friedman
  0 siblings, 0 replies; 22+ messages in thread
From: Yamin Friedman @ 2020-05-26 11:43 UTC (permalink / raw)
  To: Jason Gunthorpe, Bart Van Assche
  Cc: Sagi Grimberg, Or Gerlitz, Leon Romanovsky, linux-rdma


On 5/25/2020 7:45 PM, Jason Gunthorpe wrote:
> On Mon, May 25, 2020 at 08:14:23AM -0700, Bart Van Assche wrote:
>> On 2020-05-19 05:43, Yamin Friedman wrote:
>>> +	/*
>>> +	 * Allocated at least as many CQEs as requested, and otherwise
>>             ^^^^^^^^^
>>             allocate?
>>
>>> +	spin_lock_irqsave(&dev->cq_pools_lock, flags);
>>> +	list_splice(&tmp_list, &dev->cq_pools[poll_ctx - 1]);
>>> +	spin_unlock_irqrestore(&dev->cq_pools_lock, flags);
>> Please add a WARN_ONCE() or WARN_ON_ONCE() statement that checks that
>> poll_ctx >= 1.
>>
>>> +struct ib_cq *ib_cq_pool_get(struct ib_device *dev, unsigned int nr_cqe,
>>> +			     int comp_vector_hint,
>>> +			     enum ib_poll_context poll_ctx)
>>> +{
>>> +	static unsigned int default_comp_vector;
>>> +	int vector, ret, num_comp_vectors;
>>> +	struct ib_cq *cq, *found = NULL;
>>> +	unsigned long flags;
>>> +
>>> +	if (poll_ctx > ARRAY_SIZE(dev->cq_pools) || poll_ctx == IB_POLL_DIRECT)
>>> +		return ERR_PTR(-EINVAL);
>> How about changing this into the following?
>>
>> 	if ((unsigned)(poll_ctx - 1) >= ARRAY_SIZE(dev->cq_pools))
>> 		return ...;
>>
>> I think that change will make this code easier to verify.
> Yuk also.. It would be alot better to re-order IB_POLL_DIRECT to the
> end of the enum and use a IB_POLL_LAST_POOL_TYPE to exclude it
> directly.
>
> Jason

You are right, this shouldn't have made it this far without refactoring. 
I will move POLL_DIRECT to the end and clean up all of these references.

Thanks


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V3 2/4] RDMA/core: Introduce shared CQ pool API
  2020-05-26 11:39       ` Yamin Friedman
@ 2020-05-26 12:09         ` Jason Gunthorpe
  0 siblings, 0 replies; 22+ messages in thread
From: Jason Gunthorpe @ 2020-05-26 12:09 UTC (permalink / raw)
  To: Yamin Friedman; +Cc: Leon Romanovsky, Sagi Grimberg, Or Gerlitz, linux-rdma

On Tue, May 26, 2020 at 02:39:33PM +0300, Yamin Friedman wrote:
> 
> On 5/25/2020 7:47 PM, Leon Romanovsky wrote:
> > On Mon, May 25, 2020 at 01:42:15PM -0300, Jason Gunthorpe wrote:
> > > On Tue, May 19, 2020 at 03:43:34PM +0300, Yamin Friedman wrote:
> > > 
> > > > +void ib_cq_pool_init(struct ib_device *dev)
> > > > +{
> > > > +	int i;
> > > I generally rather see unsigned types used for unsigned values
> > > 
> > > > +
> > > > +	spin_lock_init(&dev->cq_pools_lock);
> > > > +	for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++)
> > > > +		INIT_LIST_HEAD(&dev->cq_pools[i]);
> > > > +}
> > > > +
> > > > +void ib_cq_pool_destroy(struct ib_device *dev)
> > > > +{
> > > > +	struct ib_cq *cq, *n;
> > > > +	int i;
> > > > +
> > > > +	for (i = 0; i < ARRAY_SIZE(dev->cq_pools); i++) {
> > > > +		list_for_each_entry_safe(cq, n, &dev->cq_pools[i],
> > > > +					 pool_entry) {
> > > > +			cq->shared = false;
> > > > +			ib_free_cq_user(cq, NULL);
> > > WARN_ON cqe_used == 0?
> > An opposite is better - WARN_ON(cqe_used).
> > 
> > <...>
> 
> Is this check really necessary as we are closing the device?

It checks that no ULPs forgot to destroy something

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, back to index

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-19 12:43 [PATCH V3 0/4] Introducing RDMA shared CQ pool Yamin Friedman
2020-05-19 12:43 ` [PATCH V3 1/4] RDMA/core: Add protection for shared CQs used by ULPs Yamin Friedman
2020-05-19 12:43 ` [PATCH V3 2/4] RDMA/core: Introduce shared CQ pool API Yamin Friedman
2020-05-20  6:19   ` Devesh Sharma
2020-05-20  9:23     ` Yamin Friedman
2020-05-20  9:32       ` Leon Romanovsky
2020-05-20 10:50       ` Devesh Sharma
2020-05-20 12:01         ` Yamin Friedman
2020-05-20 13:48           ` Devesh Sharma
2020-05-25 13:06   ` Yamin Friedman
2020-05-26  7:09     ` Yamin Friedman
2020-05-25 15:14   ` Bart Van Assche
2020-05-25 16:45     ` Jason Gunthorpe
2020-05-26 11:43       ` Yamin Friedman
2020-05-25 16:42   ` Jason Gunthorpe
2020-05-25 16:47     ` Leon Romanovsky
2020-05-26 11:39       ` Yamin Friedman
2020-05-26 12:09         ` Jason Gunthorpe
2020-05-19 12:43 ` [PATCH V3 3/4] nvme-rdma: use new shared CQ mechanism Yamin Friedman
2020-05-19 12:43 ` [PATCH V3 4/4] nvmet-rdma: " Yamin Friedman
2020-05-20  7:03 ` [PATCH V3 0/4] Introducing RDMA shared CQ pool Sagi Grimberg
2020-05-20  8:15   ` Yamin Friedman

Linux-RDMA Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-rdma/0 linux-rdma/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-rdma linux-rdma/ https://lore.kernel.org/linux-rdma \
		linux-rdma@vger.kernel.org
	public-inbox-index linux-rdma

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-rdma


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git